Multi-Armed Bandits
===================

**Contents:**

    #. `What is the multi-armed bandit problem?`_
    #. `Applications`_
    #. `Policies`_
    #. `Pointers`_

What is the multi-armed bandit problem?
---------------------------------------

Imagine you are in front of *K* slot machines.
Each machine has a possibly different, unknown payout rate. You are allowed to pull the arm *T* times.
The problem is how to allocate the number of pulls to each machine to maximize your payout.

**The mathematical setup is as follows:**

* *K* random variables :math:`\{A_1, ..., A_K\}` with unknown payout distributions
* At each time interval :math:`t=1, ..., T` we observe a single random variable :math:`a_t`
* The payout is :math:`r_{t,a_t}` (note: :math:`r_{t,a_t}` for :math:`a_q \neq a_t` is *unobserved*)
* Want to maximize

.. math::

        \mathbb{E} [ \sum\limits_{i=1}^T r_{t,a_t} ] = expected\;total\;reward

**Tradeoffs**

The tradeoff of each pull is gaining knowledge about the payout rates vs. getting the biggest payout with current knowledge
(Exploration vs. Exploration). We can define regret as follows:

.. math::

        Regret = \mathbb{E} [ \sum\limits_{i=1}^T r_{t,a_t}^{\star} ] - \mathbb{E} [ \sum\limits_{i=1}^T r_{t,a_t} ] = best\;possible\;reward - actually\;obtained\;reward

.. note::

        Any policy that explores at all will have non-zero regret.

Applications
------------

There are many applications that map well onto this problem.
For example, we can model a Click Through Rate (CTR) problem as
a multi-armed bandit instance.
In this case, each arm is an ad or search result, each click is a success,
and our objective is to maximize clicks.

Another application is experiments (A/B testing)
where we would like to find the best solutions as fast as possible
and limit how often bad solutions are chosen.

Policies
--------

There are many different policies for this problem:

We have implemented the following policies in our package:

* :mod:`~moe.bandit.epsilon.epsilon_first.EpsilonFirst`
* :mod:`~moe.bandit.epsilon.epsilon_greedy.EpsilonGreedy`
* :mod:`~moe.bandit.ucb.ucb1.UCB1`
* :mod:`~moe.bandit.ucb.ucb1_tuned.UCB1Tuned`
* :mod:`~moe.bandit.bla.bla.BLA`

Other policies include:

* Weighted random choice
* `Epsilon-decreasing`_ \*
* `UCB2`_ \*
* `SoftMax`_ \*

\* Regret bounded as :math:`t \rightarrow \infty`

.. _Epsilon-decreasing: http://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform_strategies
.. _UCB2: http://moodle.technion.ac.il/pluginfile.php/192340/mod_resource/content/0/UCB.pdf
.. _SoftMax: http://arxiv.org/pdf/1402.6028v1.pdf

Pointers
--------

You can learn more about multi-armed bandits at: http://www.youtube.com/watch?v=qAN6iyYPbEE

The slides for the talk is available at: http://slidesha.re/1zOrOJy