Multi-Armed Bandits¶
Contents:
What is the multi-armed bandit problem?¶
Imagine you are in front of K slot machines. Each machine has a possibly different, unknown payout rate. You are allowed to pull the arm T times. The problem is how to allocate the number of pulls to each machine to maximize your payout.
The mathematical setup is as follows:
- K random variables \(\{A_1, ..., A_K\}\) with unknown payout distributions
- At each time interval \(t=1, ..., T\) we observe a single random variable \(a_t\)
- The payout is \(r_{t,a_t}\) (note: \(r_{t,a_t}\) for \(a_q \neq a_t\) is unobserved)
- Want to maximize
Tradeoffs
The tradeoff of each pull is gaining knowledge about the payout rates vs. getting the biggest payout with current knowledge (Exploration vs. Exploration). We can define regret as follows:
Note
Any policy that explores at all will have non-zero regret.
Applications¶
There are many applications that map well onto this problem. For example, we can model a Click Through Rate (CTR) problem as a multi-armed bandit instance. In this case, each arm is an ad or search result, each click is a success, and our objective is to maximize clicks.
Another application is experiments (A/B testing) where we would like to find the best solutions as fast as possible and limit how often bad solutions are chosen.
Policies¶
There are many different policies for this problem:
We have implemented the following policies in our package:
Other policies include:
- Weighted random choice
- Epsilon-decreasing *
- UCB2 *
- SoftMax *
* Regret bounded as \(t \rightarrow \infty\)
Pointers¶
You can learn more about multi-armed bandits at: http://www.youtube.com/watch?v=qAN6iyYPbEE
The slides for the talk is available at: http://slidesha.re/1zOrOJy