Learning Techniques in Multi-Armed Bandits

Thumbnail Image


Publication or External Link






Multi-Armed bandit problem is a classic example of the exploration vs. exploitation dilemma in which a collection of one-armed bandits, each with unknown but fixed reward probability, is given. The key idea is to develop a strategy, which results in the arm with the highest reward probability to be played such that the total reward obtained is maximized. Although seemingly a simplistic problem, solution strategies are important because of their wide applicability in a myriad of areas such as adaptive routing, resource allocation, clinical trials, and more recently in the area of online recommendation of news articles, advertisements, coupons, etc. to name a few.

In this dissertation, we present different types of Bayesian Inference based bandit algorithms for Two and Multiple Armed Bandits which use Order Statistics to select the next arm to play. The Bayesian strategies, also known in literature as Thompson Method are shown to function well for a whole range of values, including very small values, outperforming UCB and other commonly used strategies. Empirical analysis results show a significant improvement on multiple datasets.

In the second part of the dissertation, two types of Successive Reduction

(SR) strategies - 1) Successive Reduction Hoeffding (SRH) and 2) Successive

Reduction Order Statistics (SRO) are introduced. Both use an Order Statistics based Sampling method for arm selection, and then successively eliminate bandit arms from consideration depending on a confidence threshold. While SRH uses Hoeffding Bounds for elimination, SRO uses the probability of an arm being superior to the currently selected arm to measure confidence. The empirical results show that the performance advantage of

proposed SRO scheme increasing persistently with the number of bandit arms while the SRH scheme shows similar performance as pure Thompson Sampling Method.

In the third part of the dissertation, the assumption of the reward

probability being fixed is removed. We model problems where reward

probabilities are drifting , and introduce a new method

called Dynamic Thompson Sampling (DTS) which adapts the reward probability estimate faster than traditional schemes and thus leads to improved performance in terms of lower regret. Our empirical results demonstrate that DTS method outperforms the state-of-the-art techniques, namely pure Thompson Sampling, UCB-Normal and UCB-f, for the case of dynamic reward probabilities. Furthermore, the performance advantage of the proposed DTS scheme increases persistently with the number of bandit arms.

In the last part of the dissertation, we delve into arm space decomposition and use of multiple agents in the Bandit process. The three most important characteristics of a multi-agent systems are 1) Autonomy --- agents are completely or partially autonomous, 2) Local views --- agents are restricted to a local view of information, and 3) Decentralization of control --- each agent influences a limited part of the overall decision space. We study and compare Centralized vs. Decentralized Sampling Algorithm in Multi-Armed Bandit problems in the context of common payoff games .

In the Centralized Decision Making, a central agent maintains a global view of the currently available information and makes a decision to choose the next arm just as the regular Bayesian Algorithm. In Decentralized Decision Making, each agent maintains a local view of the arms and makes decisions just based on the local information available at its end without communicating with other agents. The Decentralized Decision Making is modeled as a Game Theory problem. Our results show that the Decentralized systems perform well for both the cases of Pure as well Mixed Nash equilibria and their performance scales well with the increase in the number of arms due to reduced dimensionality of the space.

We thus believe that this dissertation establishes Bayesian Multi-Armed bandit strategies as one of the prominent strategies in the field of bandits and opens up venues for new interesting research in the future.