# Multi-armed bandit
The multi-armed bandit problem is a sequential decision-making technique where there is a single state and multiple actions to choose from. The reward depends on the action taken, which can be deterministic, stochastic, or adversarial.
The reward distribution for each arm is initially unknown to the decision-maker, and that learning the reward distributions in order to maximize the reward over time is the key part of the problem (finding the best trade-off between exploration and exploitation).
This can be measured by minimizing the loss or expected pseudo-regret, which represents the number of times a suboptimal arm has been chosen.
- A multi-armed bandit is a tuple $\langle\mathcal{A}, \mathcal{R}\rangle$
- $\mathcal{A}$ is a known set of $m$ actions (or "arms")
- $\mathcal{R}^a(r)=\mathbb{P}[r \mid a]$ is an unknown probability distribution over rewards
- At each step $t$ the agent selects an action $a_t \in \mathcal{A}$
- The environment generates a reward $r_t \sim \mathcal{R}^{a_t}$
- The goal is to maximize cumulative reward $\sum_{\tau=1}^t r_\tau$
- The action-value is the mean reward for action $a$,
$
Q(a)=\mathbb{E}[r \mid a]
$
- The optimal value $V^*$ is
$
V^*=Q\left(a^*\right)=\max _{a \in \mathcal{A}} Q(a)
$
- The regret is the opportunity loss for one step: the difference between the optimal value and the expected reward of the chosen action in a single step.
$
I_t=\mathbb{E}\left[V^*-Q\left(a_t\right)\right]
$
- The total regret is the total opportunity loss: the sum of instantaneous regrets over time.
$
L_t=\mathbb{E}\left[\sum_{\tau=1}^t V^*-Q\left(a_\tau\right)\right]
$
- Maximise cumulative reward $\equiv$ minimize total regret : while the goal is ultimately to maximize cumulative reward, most multi-armed bandit algorithms are framed in terms of minimizing regret.
## Exploration vs exploitation
To play the optimal policy we need to reach a trade-off between:
- **exploration**: the choice of new unexplored actions, even at random, to increase our knowledge about the problem.
- **exploitation**: use only the current knowledge to make decision, following known-good paths but risking to miss some opportunities.
At the beginning we want to lean more towards exploration to see all the possible oppurtunities and then shift more and more towards exploiting the accumulated knowledge.
## Upper Confidence Bound (UCB)
The Upper Confidence Bound (UCB) algorithm addresses the exploration-exploitation dilemma by embracing "optimism in the face of uncertainty" principle.
It selects actions, or "arms," based on their potential for being optimal, considering both their estimated rewards and the uncertainty surrounding those estimates.
This uncertainty is quantified using upper confidence bounds on the expected rewards, guiding the algorithm away from prematurely converging on suboptimal choices.
Key to UCB's approach is the application of the Hoeffding inequality, which helps calculate these bounds, with variations like UCB1, UCBV, and BayesUCB offering different strategies for computing them.
Consider a set of i.i.d. random variables $X_1, \ldots, X_t$ within $[0,1]$. The sample mean $\bar{X}_t=\frac{1}{t} \sum_{\tau=1}^t X_\tau$ approximates the expected value, and the Hoeffding inequality gives us a way to bound the probability of the true mean exceeding this estimate by a margin $u$:
$
\mathbb{P}\left[\mathbb{E}[X]>\bar{X}_t+u\right] \leq e^{-2 t u^2}
$
In practice, the UCB1 algorithm selects the action $a_t$ at time $t$ as follows:
$
a_t=\underset{a \in \mathcal{A}}{\operatorname{argmax}} \left[Q(a)+\sqrt{\frac{2 \log t}{N_t(a)}}\right]
$
Where $Q(a)$ is the estimated reward for action $a$, and $N_t(a)$ is the number of times action $a$ has been selected prior to time $t$.
The term $\sqrt{\frac{2 \log t}{N_t(a)}}$ serves as the "uncertainty" or "confidence interval" component, which decreases as the number of times $N_t(a)$ an arm $a$ is played increases, thus reducing the exploration incentive for that arm over time.
This formula ensures a balance between exploiting actions with high estimated rewards and exploring actions with fewer trials.
The UCB approach is known for its logarithmic asymptotic total regret, indicating efficient performance over time.
So the recap is that this algorithm:
1) Iteratively selects the arm with the highest upper confidence bound (calculated as the sum of the estimated payoff and an uncertainty term)
2) observes the reward
3) updates the arm's estimated payoff, and recalculates the bounds.
This loop continues across a specified number of rounds, optimizing action selection based on both past rewards and the potential for discovery.
## Thompson Sampling
Thompson Sampling, a Bayesian method, optimizes decision-making under uncertainty by balancing exploration and exploitation. This approach starts with a prior distribution for each option or "arm," typically assuming all outcomes are equally likely (uniform distribution). In each round, it samples from these priors, selects the arm with the highest value, and updates the priors based on the outcome (success or failure), adjusting the alpha or beta parameters of the distribution.
The core principle of Thompson Sampling is probability matching, guided by Bayes' law to compute posterior distributions $(P[\mathcal{R} | h_t)$ based on historical data $(h_t)$.
Unlike methods that directly estimate reward distributions, Thompson Sampling maintains and updates a belief over these distributions, allowing for a natural integration of exploration and exploitation. By sampling from the posterior distributions, it inherently explores less certain options while exploiting known rewarding ones.
Thompson Sampling's efficiency stems from its ability to match the theoretical upper bound on regret with the lower bound, showcasing optimal performance. The algorithm favors actions with the largest sampled expected reward, $(\hat{r}(a_i))$, at each step.
In the context of Bernoulli trials, where outcomes are binary (success/failure), Thompson Sampling benefits from using a beta distribution as the conjugate prior.
This choice simplifies Bayesian updating, as the posterior remains in the same family (beta distribution), facilitating analytical tractability. The beta distribution, defined by parameters alpha $\alpha$ and beta $\beta$, adjusts its shape based on observed outcomes, starting from a uniform Beta(1, 1) prior for each arm.
The updating mechanism is straightforward:
- In case of success: $\phi(t + 1) = Beta(\alpha_t + 1, \beta_t)$
- In case of failure: $\phi(t + 1) = Beta(\alpha_t , \beta_t + 1)$
This iterative process refines our beliefs about each arm's success probability, guiding the selection towards the most rewarding options over time.