TL;DR

多臂老虎机问题是在有限尝试次数下，通过平衡探索（尝试未知选项）与利用（选择当前最优选项），最大化总奖励的经典决策问题。

UCB1（Upper Confidence Bound 1）基于置信上界，对每个臂的估计奖励加上一个“不确定性”项，偏向乐观选择；理论保证强，但探索略显保守。
汤普森采样（Thompson Sampling）采用贝叶斯方法，从每个臂的后验分布中采样并选择最大者，实现“按需探索”，自适应性强。

在实验中（4臂、5000步、最优臂 p=0.637）：

汤普森采样的总悔憾仅15.6，远低于UCB1的59.8；
更快锁定最优臂（<500步 vs. ~1000步）；
非最优臂尝试次数更少（32次 vs. 123次）。

汤普森采样在实际应用中通常更高效、更智能，是工业界首选；UCB1则胜在理论清晰，适合分析场景。

什么是多臂老虎机问题？

有一排老虎机，每一台机器（称为一个“臂”）拉一次会以某个未知概率吐出硬币（奖励为 1），否则什么也没有（奖励为 0）。你的目标是在有限次数（比如 5000 次）内，尽可能多地获得硬币。

多臂老虎机问题
有 $K$ 个臂（actions），每次行动 $a \in {1,2,\dots,K}$ 对应一个未知的 Bernoulli 分布，成功概率为 $p_{a} \in [0,1]$ 。在每一轮 $t = 1,2,\dots,T$ ，你选择一个臂 $a_t$ ，并观察到随机奖励 $r_t \sim \text{Bernoulli}(p_{a_t})$ 。目标：最大化总奖励 $\sum_{t=1}^T r_t$ ，或等价地，最小化遗憾（Regret）。

遗憾（Regret）定义
设最优臂为 $a^* = \argmax_a p_a$ ，其期望奖励为 $p^* = \max_a p_a$ 。伪遗憾（Pseudo-Regret）定义为：

$R_T = T \cdot p^* - \sum_{t=1}^T \mathbb{E}[r_t] = \sum_{t=1}^T (p^* - p_{a_t})$

它衡量了你因未始终选择最优臂而“损失”的期望奖励。

利用（Exploitation） vs 探索（Exploration）

每次拉动老虎机的一个臂，你都在做一个决策：是继续使用目前看起来“最好”的那个臂，还是去试试那些还不太了解的臂？

利用（Exploitation）意味着选择当前估计奖励最高的臂——比如历史平均回报最大的那个。这是“吃老本”的策略，能立即获得高收益。
探索（Exploration）则意味着主动尝试那些被拉动次数较少的臂，哪怕它们当前表现平平。这是为了获取更多信息，避免因早期偶然结果而错过真正的最优臂。

这两种行为天然对立。如果只利用，你可能会固守在一个次优臂上（例如，某个臂前几次碰巧失败，就被永远抛弃）；如果只探索，你虽然收集了大量信息，却浪费了本可用于高回报臂的机会，导致总收益低下。因此，优秀的 bandit 算法必须在探索与利用之间动态地、自适应地取得平衡。UCB 和汤普森采样正是通过不同的数学机制，巧妙地实现了这一目标——一个基于置信上界的“乐观估计”，另一个基于贝叶斯后验的“概率匹配”。

接下来，我们看两种经典方法如何实现这种平衡。

UCB1：基于置信上界的乐观选择

UCB1（Upper Confidence Bound 1）的核心思想是：对每个臂的奖励估计加上一个“不确定性”项，然后选择最“乐观”的那个。

算法步骤

记：

$n_a(t)$ ：到时间 $t$ 为止臂 $a$ 被拉动的次数。
$\hat{\mu}_a(t)$ ：臂 $a$ 的样本均值（即历史平均奖励）。

UCB1 在第 $t$ 步选择的臂为：

$a_t = \argmax_{a \in {1,\dots,K}} \left[ \hat{\mu}_a(t-1) + \sqrt{ \frac{\alpha \ln t}{2 n_a(t-1)} } \right]$

其中 $\alpha > 0$ 是一个超参数（通常取 3），控制探索强度。在初始阶段，需确保每个臂至少被拉一次，避免除零错误。

公式解析

第一项 $\hat{\mu}_a$ ：代表利用 —— 当前对臂 $a$ 性能的最佳估计。
第二项 $\sqrt{ \frac{\alpha \ln t}{2 n_a} }$ $\frac{α l n t}{2 n _{a}}$ ：代表探索——这是 Hoeffding 不等式导出的置信区间上界。
- 当 $n_a$ 很小（对该臂了解少），这一项很大 → 鼓励探索。
- 当 $n_a$ 很大（对该臂了解多），这一项趋近于 0 → 退化为纯利用。
- 随着总步数 $t$ 增加，即使所有臂都被多次拉动，仍会偶尔探索低频臂（因为 $\ln t$ 缓慢增长）。

因此，UCB1 是一种乐观面对不确定性（Optimism in the Face of Uncertainty）的策略：它假设每个臂的真实性能可能比当前估计更好，而“更好”的程度由统计置信度决定。

汤普森采样：贝叶斯视角下的概率匹配

汤普森采样（Thompson Sampling）采用贝叶斯方法，通过维护每个臂奖励分布的后验概率，直接从后验中采样来决定下一步行动。

算法步骤（针对 Bernoulli 奖励）

为每个臂 $a$ 维护一个 Beta 后验分布： $\theta_a \sim \text{Beta}(\alpha_a, \beta_a)$ ，初始时 $\alpha_a = 1, \beta_a = 1$ （即均匀先验）。在每一步 $t$ ：

采样：对每个臂 $a$ ，从其后验中抽取一个样本 $\tilde{p}_a \sim \text{Beta}(\alpha_a, \beta_a)$
选择： $a_t = \argmax_a \tilde{p}_a$
观察：得到奖励 $r_t \in {0,1}$
更新：
- 若 $r_t = 1$ ，则 $\alpha_{a_t} \leftarrow \alpha_{a_t} + 1$
- 若 $r_t = 0$ ，则 $\beta_{a_t} \leftarrow \beta_{a_t} + 1$

公式解析

Beta 分布是 Bernoulli 分布的共轭先验，非常自然：

$\alpha_a - 1$ ≈ 臂 $a$ 的成功次数
$\beta_a - 1$ ≈ 臂 $a$ 的失败次数

采样过程相当于：“假设每个臂的真实成功概率是多少？根据当前数据，随机猜一个可能值。”选择最大样本意味着：“在当前信念下，哪个臂最有可能是最优的？”这被称为概率匹配（Probability Matching）：选择某个臂的概率 ≈ 该臂是全局最优的概率。

实验结果

实验设置

总步数：5000，4 个臂，每个臂的真实概率：

Arm 0: p = 0.637 ✅ 最优臂
Arm 1: p = 0.270
Arm 2: p = 0.041
Arm 3: p = 0.017

实验结论

UCB1	汤普森采样

UCB1算法分析图，同样包含五个子图：前四个展示每个臂的估计均值和不确定性（UCB-Mean差）随时间的变化（横轴为时间步数，纵轴为值），虚线表示真实概率，阴影区域表示不确定性；第五个子图展示累积伪悔憾曲线和各臂总拉动次数（横轴为时间或臂索引，纵轴为悔憾或拉动次数）。	汤普森采样（Thompson Sampling）分析图，包含五个子图：前四个展示各臂成功概率的后验分布随时间的演化（横轴为成功概率 $p \in [0,1]$ ，纵轴为概率密度），其中黑色虚线表示真实概率；第五个子图显示累积伪悔憾随时间变化（横轴为时间步数，纵轴为累计悔憾），以及各臂被拉动的总次数（横轴为臂索引，纵轴为拉动次数）。

UCB1

汤普森采样

UCB1算法分析图，同样包含五个子图：前四个展示每个臂的估计均值和不确定性（UCB-Mean差）随时间的变化（横轴为时间步数，纵轴为值），虚线表示真实概率，阴影区域表示不确定性；第五个子图展示累积伪悔憾曲线和各臂总拉动次数（横轴为时间或臂索引，纵轴为悔憾或拉动次数）。

汤普森采样（Thompson Sampling）分析图，包含五个子图：前四个展示各臂成功概率的后验分布随时间的演化（横轴为成功概率

p \in [0,1]

，纵轴为概率密度），其中黑色虚线表示真实概率；第五个子图显示累积伪悔憾随时间变化（横轴为时间步数，纵轴为累计悔憾），以及各臂被拉动的总次数（横轴为臂索引，纵轴为拉动次数）。

在本次实验设定下（4 臂 Bernoulli 老虎机，5000 步，最优臂 $p = 0.637$ ），Thompson Sampling（汤普森采样）在几乎所有关键指标上均显著优于 UCB1：

悔憾控制更优：TS 的总伪悔憾（≈15.61）仅为 UCB1（≈59.80）的四分之一，表明其决策更接近理想最优策略；
收敛速度更快：TS 在前 500 步内就基本锁定最优臂，而 UCB1 需要约 1000 步才能稳定；
探索效率更高：TS 仅用 32 次非最优拉取就完成有效探索，而 UCB1 浪费了 123 次机会；
估计精度相当：两者对各臂真实成功率的估计均准确，但 TS 提供了更丰富的不确定性信息（完整后验分布）；
行为更智能：TS 的贝叶斯机制实现了“按需探索”，避免了 UCB1 因置信界公式导致的冗余尝试。

因此，在实际应用中（如推荐系统、在线广告、A/B 测试等），Thompson Sampling 通常是比 UCB1 更高效、更稳健的选择。虽然 UCB1 具有坚实的理论保证，但 Thompson Sampling 凭借其自适应性、低悔憾和易实现性，已成为工业界事实上的标准方法。简言之：当目标是最大化收益而非证明理论边界时，选汤普森采样。

效率对比

指标	UCB1	Thompson Sampling
总伪悔憾（Total Regret）	~59.80	~15.61
最优臂拉取次数	4877	4968
其他臂拉取次数	共计 123 次	共计 32 次
收敛速度	较慢，约 1000 步后稳定	更快，前 500 步已基本锁定最优臂

✅ Thompson Sampling 的性能显著优于 UCB1，尤其是在累积悔憾和探索效率上。

累积伪悔憾（Cumulative Pseudo-Regret）

UCB1 悔憾曲线在初期快速上升（因为初始阶段需探索），之后趋于平缓，但整体增长明显。TS 曲线陡峭上升期更短，后期几乎平坦，最终悔憾远低于 UCB1。UCB1 的悔憾为 59.80，意味着平均每次决策损失了约 $\frac{59.8}{5000} \approx 0.012$ 的期望奖励。TS 的悔憾仅为 15.61，说明其“浪费”的机会成本极低。这表明 TS 能更快识别出最优臂，并长期利用它，从而大幅减少非最优选择带来的损失。因此，TS 在后悔控制方面表现更优，是更强的算法。

收敛速度与学习效率

UCB1 Arm 0（最优臂），初始估计波动大（因样本少），但很快收敛到真实值附近。不确定性区域（UCB-Mean）迅速缩小，说明置信区间随样本增加而收紧。Arm 1~3 均值估计稳定在真实值附近，但被拉动次数极少（共 123 次），说明 UCB1 成功抑制了对劣质臂的过度探索。尽管能收敛，但收敛速度较慢，尤其在早期阶段仍会频繁尝试次优臂。

Thompson Sampling Arm 0，后验分布从均匀分布逐渐收缩至峰值位于 p≈0.637 处，且在 t=630 左右已高度集中。表明 TS 很早就锁定了最优臂的真实参数。Arm 1~3，后验分布始终偏向低 p 区域，且随着失败次数增多，分布向左移动并变窄。只有少数几次被采中，体现了“只在必要时探索”。TS 的贝叶斯更新机制天然具备自适应能力——它不会盲目探索，而是根据当前信念决定是否值得一试。TS 收敛更快、学习效率更高，因为它将探索集中在最有潜力的臂上。

每个臂的估计准确度

我们可以通过以下方式评估估计精度：

（1）均值估计 vs. 真实值

臂	真实 p	UCB1 估计	TS 估计
0	0.637	≈0.63	≈0.637
1	0.270	≈0.27	≈0.270
2	0.041	≈0.04	≈0.041
3	0.017	≈0.017	≈0.017

两者都能很好地估计各臂的真实成功率，准确性相当。UCB1 使用确定性的置信边界（基于 Hoeffding 不等式），虽然保守，但在样本较少时可能高估不确定性。TS 使用完整的后验分布，不仅给出均值，还提供了完整的概率密度信息，更能反映“我们知道多少”。TS 的不确定性建模更具信息量，适合需要置信度或风险评估的应用场景。

探索行为比较

指标	UCB1	Thompson Sampling
最优臂拉取次数	4877	4968
次优臂拉取总数	123	32
探索频率	中等偏高	极低（仅在必要时）

UCB1 是一种“主动探索”策略：即使某个臂看起来很差，只要它的 UCB 值暂时较高，就可能被选中。这导致了一些不必要的尝试。TS 是一种“被动探索”策略：只有当某个臂的后验分布存在较大不确定性时，才有可能被选中。否则，它只会选择最可能最优的那个。

附录

具体实现

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import matplotlib.style as style

# 设置一个更美观的绘图风格
style.use('seaborn-v0_8-whitegrid')

class BernoulliBandit:
    def __init__(self, n_arms, seed=None):
        self.rng = np.random.default_rng(seed)
        # 每个臂的真实成功概率随机生成
        self.p = self.rng.uniform(0.0, 1.0, size=n_arms)
        self.n_arms = n_arms

    def pull(self, arm):
        return 1 if self.rng.random() < self.p[arm] else 0

def ucb1(n_arms, steps, bandit, alpha=3.0, seed=None, record_ucb=True):
    """
    UCB1: 选择 argmax_a [ mean[a] + sqrt(alpha * ln(t) / (2 * n[a])) ]
    alpha 常用 3.0，也可调节探索强度
    """
    rng = np.random.default_rng(seed)

    counts = np.zeros(n_arms, dtype=int)
    means = np.zeros(n_arms, dtype=float)
    actions = np.zeros(steps, dtype=int)
    rewards = np.zeros(steps, dtype=float)

    if record_ucb:
        mean_hist = np.zeros((steps, n_arms), dtype=float)
        delta_hist = np.zeros((steps, n_arms), dtype=float)
        ucb_hist = np.zeros((steps, n_arms), dtype=float)
    else:
        mean_hist = delta_hist = ucb_hist = None

    # 先保证每个臂至少拉一次
    t = 0
    for a in range(n_arms):
        if t >= steps:
            break
        r = bandit.pull(a)
        counts[a] += 1
        means[a] = r
        actions[t] = a
        rewards[t] = r

        if record_ucb:
            # 在早期，为了数值稳定性，分母可以加一个极小值或特殊处理
            # 这里为了简化，直接用 counts，因为我们保证了它不为 0
            # 使用 t+1 避免 log(0)
            delta = np.sqrt(alpha * np.log(t + 1) / (2 * counts))
            ucb = means + delta
            mean_hist[t] = means
            delta_hist[t] = delta
            ucb_hist[t] = ucb
        t += 1

    # 主循环
    for t in range(t, steps):
        # 计算每个 arm 的 UCB 值
        # np.log(t) 而不是 t+1，因为此时 t 代表已经过去的步数，从 n_arms 开始
        delta = np.sqrt(alpha * np.log(t) / (2 * counts))
        ucb = means + delta

        if record_ucb:
            mean_hist[t] = means
            delta_hist[t] = delta
            ucb_hist[t] = ucb

        # 选择 UCB 值最大的 arm
        a = int(np.argmax(ucb))
        r = bandit.pull(a)

        # 增量更新均值
        counts[a] += 1
        means[a] += (r - means[a]) / counts[a]

        actions[t] = a
        rewards[t] = r

    return actions, rewards, counts, means, mean_hist, delta_hist, ucb_hist


def thompson_sampling(n_arms, steps, bandit, seed=None, record_posteriors=True):
    """
    Thompson Sampling: 为每个臂维护一个 Beta 分布。
    """
    rng = np.random.default_rng(seed)

    # beta_params 存储每个臂的 Beta 分布参数 [alpha, beta]
    # 初始为 Beta(1, 1)，即均匀分布（无信息先验）
    beta_params = np.ones((n_arms, 2), dtype=float)
    
    actions = np.zeros(steps, dtype=int)
    rewards = np.zeros(steps, dtype=float)

    # 用于记录每一步的后验分布参数（可选）
    if record_posteriors:
        posterior_hist = np.zeros((steps, n_arms, 2), dtype=float)
    else:
        posterior_hist = None

    for t in range(steps):
        # 1. 采样：从每个臂的当前 Beta 后验分布中抽取一个样本
        samples = rng.beta(beta_params[:, 0], beta_params[:, 1])

        # 记录（在选臂之前记录这一时刻的后验分布）
        if record_posteriors:
            # .copy() 很重要，否则 history 会全部指向最后的状态
            posterior_hist[t] = beta_params.copy()

        # 2. 选择：选择样本值最大的臂
        a = int(np.argmax(samples))
        
        # 3. 拉动选择的臂并观察奖励
        r = bandit.pull(a)

        # 4. 更新后验分布：
        # 如果 r=1 (成功), alpha 增加 1
        # 如果 r=0 (失败), beta 增加 1
        beta_params[a, 0] += r
        beta_params[a, 1] += (1 - r)

        # 记录动作和奖励
        actions[t] = a
        rewards[t] = r
        
    # 计算最终的alpha和beta，可以用来推断每个臂的成功次数和失败次数
    # 成功次数 = alpha - 1, 失败次数 = beta - 1
    counts = (beta_params[:, 0] + beta_params[:, 1] - 2).astype(int)
    # 贝叶斯估计的均值是 alpha / (alpha + beta)
    estimated_means = beta_params[:, 0] / (beta_params[:, 0] + beta_params[:, 1])

    return actions, rewards, counts, estimated_means, posterior_hist


def plot_ucb1_results(bandit, actions, rewards, counts, mean_hist, ucb_hist):
    """
    功能强大的可视化函数，为每个臂单独绘制 UCB 演化图。
    """
    T = len(rewards)
    n_arms = bandit.n_arms
    true_p = bandit.p
    best_arm = np.argmax(true_p)
    time_steps = np.arange(1, T + 1)
    colors = plt.cm.viridis(np.linspace(0, 1, n_arms))

    # --- 计算子图网格布局 ---
    # 总图数 = 臂的数量 + 2 (悔憾图和次数图)
    n_plots = n_arms + 2
    n_cols = 2
    # 向上取整计算行数
    n_rows = int(np.ceil(n_plots / n_cols))

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, 4 * n_rows))
    # 将 2D 数组扁平化为 1D，方便索引
    axes = axes.flatten()
    fig.suptitle('UCB1 Algorithm Analysis (Individual Arm View)', fontsize=20)

    # --- 图 1 to n_arms: 每个臂的 UCB 值与均值估计的演化 ---
    for a in range(n_arms):
        ax = axes[a]
        # 绘制真实的概率 p (水平线)
        ax.axhline(y=true_p[a], color='gray', ls=':', lw=2,
                   label=f'True p={true_p[a]:.3f}')
        # 绘制均值估计
        ax.plot(time_steps, mean_hist[:, a], color=colors[a], ls='-',
                label='Estimated Mean')
        # 用半透明区域填充均值和 UCB 之间的空间，代表“不确定性”
        ax.fill_between(time_steps, mean_hist[:, a], ucb_hist[:, a],
                         color=colors[a], alpha=0.3, label='Uncertainty (UCB-Mean)')

        title = f'Arm {a} Analysis'
        if a == best_arm:
            title += ' (Best Arm)'
            ax.set_facecolor('gold') # 给最优臂的子图一个背景色
            ax.patch.set_alpha(0.15)
            
        ax.set_title(title)
        ax.set_xlabel('Time Steps')
        ax.set_ylabel('Value')
        ax.legend(loc='lower right')
        ax.grid(True, which='both', linestyle='--', linewidth=0.5)

    # --- 倒数第二个图: 累积伪悔憾 ---
    ax_regret = axes[n_arms]
    best_p = true_p.max()
    instant_regret = best_p - true_p[actions]
    cumulative_regret = np.cumsum(instant_regret)

    ax_regret.plot(time_steps, cumulative_regret, color='crimson')
    ax_regret.set_title('Cumulative Pseudo-Regret Over Time')
    ax_regret.set_xlabel('Time Steps')
    ax_regret.set_ylabel('Cumulative Regret')
    ax_regret.text(T * 0.05, cumulative_regret[-1] * 0.8,
                   f'Total Regret: {cumulative_regret[-1]:.2f}', fontsize=12)

    # --- 最后一个图: 各臂被选择次数 ---
    ax_counts = axes[n_arms + 1]
    arm_indices = np.arange(n_arms)
    bar_colors = [colors[i] for i in arm_indices]
    
    bars = ax_counts.bar(arm_indices, counts, color=bar_colors, edgecolor='black')
    ax_counts.set_title('Total Pull Counts per Arm')
    ax_counts.set_xlabel('Arm Index')
    ax_counts.set_ylabel('Number of Pulls')
    ax_counts.set_xticks(arm_indices)
    ax_counts.bar_label(bars, label_type='edge')
    # 高亮最优臂
    bars[best_arm].set_color('gold')
    bars[best_arm].set_edgecolor('black')
    ax_counts.legend([bars[best_arm]], [f'Best Arm ({best_arm})'])

    # --- 隐藏多余的子图 ---
    for i in range(n_plots, len(axes)):
        axes[i].axis('off')

    plt.tight_layout(rect=[0, 0, 1, 0.96])
    plt.savefig("source/_drafts/多臂老虎机问题/ucb1.png")
    plt.show()


def plot_thompson_sampling_results(bandit, actions, rewards, counts, posterior_hist, title=None):
    """
    Visualizes the results of the Thompson Sampling algorithm.

    - For each arm: plots the evolution of the Beta posterior distribution at different time steps.
    - Plots the cumulative pseudo-regret over time.
    - Plots the total pull counts for each arm.
    
    Args:
        bandit (BernoulliBandit): The bandit environment.
        actions (np.array): The sequence of actions taken.
        rewards (np.array): The sequence of rewards received.
        counts (np.array): The final pull counts for each arm.
        posterior_hist (np.array): History of posterior parameters (alpha, beta) for each arm.
        title (str, optional): The main title for the plot.
    """
    T = len(rewards)
    n_arms = bandit.n_arms
    true_p = bandit.p
    best_arm = np.argmax(true_p)
    time_steps = np.arange(1, T + 1)
    
    # Consistent colors for arms across different plots
    arm_colors = plt.cm.viridis(np.linspace(0, 1, n_arms))

    # --- Setup subplot grid ---
    n_plots = n_arms + 2
    n_cols = 2 if n_arms > 1 else 1 # Handle single-arm case
    n_rows = int(np.ceil(n_plots / n_cols))

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, 4 * n_rows))
    axes = axes.flatten()
    
    # Set main title
    fig.suptitle(title or 'Thompson Sampling Analysis', fontsize=20)

    # --- Plot 1 to n_arms: Posterior Distribution Evolution ---
    # Select a few time steps (logarithmically spaced) to show the evolution
    if T > 10:
        t_points = np.logspace(1, np.log10(T - 1), num=4, dtype=int)
    else: # Handle small T
        t_points = np.linspace(0, T - 1, num=min(T, 4), dtype=int)
    
    # Use a sequential colormap where darker colors represent later times
    time_colors = plt.cm.cividis_r(np.linspace(0.2, 1, len(t_points)))
    x_pdf = np.linspace(0, 1, 300)

    for a in range(n_arms):
        ax = axes[a]
        # Plot the true probability as a vertical line
        ax.axvline(true_p[a], color='black', ls='--', lw=2, label=f'True p={true_p[a]:.3f}')
        
        # Plot the PDF at different time steps
        for i, t in enumerate(t_points):
            alpha, beta = posterior_hist[t, a]
            if alpha > 0 and beta > 0: # Ensure valid Beta parameters
                pdf = stats.beta.pdf(x_pdf, alpha, beta)
                ax.plot(x_pdf, pdf, color=time_colors[i], label=f't={t+1}')

        plot_title = f'Arm {a}: Posterior Distribution'
        if a == best_arm:
            plot_title += ' (Best Arm)'
            ax.set_facecolor('gold')
            ax.patch.set_alpha(0.15)

        ax.set_title(plot_title)
        ax.set_xlabel('Success Probability p')
        ax.set_ylabel('Probability Density')
        ax.set_xlim(0, 1)
        ax.set_ylim(bottom=0)
        ax.legend(loc='upper right')
        ax.grid(True, which='both', linestyle='--', linewidth=0.5)

    # --- Plot n_arms + 1: Cumulative Pseudo-Regret ---
    ax_regret = axes[n_arms]
    best_p = true_p.max()
    instant_regret = best_p - true_p[actions]
    cumulative_regret = np.cumsum(instant_regret)

    ax_regret.plot(time_steps, cumulative_regret, color='crimson')
    ax_regret.set_title('Cumulative Pseudo-Regret Over Time')
    ax_regret.set_xlabel('Time Steps')
    ax_regret.set_ylabel('Cumulative Regret')
    ax_regret.text(T * 0.05, cumulative_regret[-1] * 0.8,
                   f'Total Regret: {cumulative_regret[-1]:.2f}', fontsize=12)
    ax_regret.grid(True)

    # --- Plot n_arms + 2: Total Pull Counts ---
    ax_counts = axes[n_arms + 1]
    arm_indices = np.arange(n_arms)
    
    bars = ax_counts.bar(arm_indices, counts, color=arm_colors, edgecolor='black')
    ax_counts.set_title('Total Pull Counts per Arm')
    ax_counts.set_xlabel('Arm Index')
    ax_counts.set_ylabel('Number of Pulls')
    ax_counts.set_xticks(arm_indices)
    ax_counts.bar_label(bars, label_type='edge')
    
    # Highlight the best arm
    bars[best_arm].set_color('gold')
    bars[best_arm].set_edgecolor('black')
    ax_counts.legend([bars[best_arm]], [f'Best Arm ({best_arm})'])

    # --- Hide unused subplots ---
    for i in range(n_plots, len(axes)):
        axes[i].axis('off')

    plt.tight_layout(rect=[0, 0, 1, 0.96])
    plt.savefig("source/_drafts/多臂老虎机问题/ts.png")
    plt.show()


if __name__ == "__main__":
    N = 4
    T = 5000
    bandit = BernoulliBandit(n_arms=N, seed=0)

    actions, rewards, counts, means, mean_hist, delta_hist, ucb_hist = ucb1(
        n_arms=N, steps=T, bandit=bandit, alpha=3.0, seed=0, record_ucb=True
    )

    best_arm = int(np.argmax(bandit.p))
    best_p = bandit.p[best_arm]
    total_reward = rewards.sum()
    regret = T * best_p - total_reward

    print("True p per arm:", np.round(bandit.p, 3))
    print("Best arm:", best_arm, "best p:", round(best_p, 3))
    print("Pull counts:", counts)
    print("Estimated means:", np.round(means, 3))
    print("Total reward:", int(total_reward))
    print("Pseudo-regret (from true p):", round(regret, 2))

    # 调用新的、按臂分开的可视化函数
    plot_ucb1_results(bandit, actions, rewards, counts, mean_hist, ucb_hist)


    actions, rewards, counts, estimated_means, posterior_hist = thompson_sampling(
        n_arms=N, steps=T, bandit=bandit, seed=0, record_posteriors=True
    )

    best_arm = int(np.argmax(bandit.p))
    best_p = bandit.p[best_arm]
    total_reward = rewards.sum()
    regret = T * best_p - total_reward

    print("Algorithm: Thompson Sampling")
    print("True p per arm:", np.round(bandit.p, 3))
    print("Best arm:", best_arm, "best p:", round(best_p, 3))
    print("Total pull counts:", counts)
    print("Estimated means (from posteriors):", np.round(estimated_means, 3))
    print("Total reward:", int(total_reward))
    print("Pseudo-regret:", round(regret, 2))

    # 可视化 Beta 后验分布随时间的变化
    plot_thompson_sampling_results(
        bandit=bandit,
        actions=actions,
        rewards=rewards,
        counts=counts,
        posterior_hist=posterior_hist,
        title="Thompson Sampling Analysis"
    )