Multi armed bandits python

Multi-Armed Bandits: Upper Confidence Bound Algorithms with Python Code

Learn about the different Upper Confidence Bound bandit algorithms (UCB1, UCB1-Tuned, UCB1-Normal). Python code provided for all experiments.

Introduction

In this series of posts, we experiment with different bandit algorithms in order to optimise our movie nights — more specifically how we select movies and restaurants for food delivery!

For newcomers, the name bandit comes from slot-machines (known as one-armed bandits). You can think of it as something that can reward you (or not) every time you interact with it (pull its arm). The objective is, given a bunch of bandits that give different rewards, to identify the one that gives the highest ones, as fast as possible. As we start playing and continuously collect data about each bandit, the bandit algorithm helps us choose between exploiting the one that gave us the highest rewards so far and exploring others.

Читайте также:  Telegram music bot python

In our story, we use restaurants instead of bandits and our reward is food satisfaction! 🙂

Multi-Armed Bandits: Epsilon-Greedy Algorithm with Python Code

Learn how Epsilon-Greedy works. Full python code provided for all experiments.

Multi-Armed Bandits: Optimistic Initial Values Algorithm with Python Code

Everything’s great until proven otherwise. Learn about the Optimistic Initial Values algorithm. Python code provided…

It’s been a while since you and your friend started using bandit algorithms to optimise your movie nights. So far, you’ve been using Epsilon-Greedy (with 𝛆=1/#actions) to help with finding the best restaurant and your friend has stubbornly been using the Optimistic Initial Values algorithm to pick movie directors (even if your simulations have shown that it’s likely not as good for your use case).

While killing time on the internet, being trapped inside because of the lockdown, you come across some bandit algorithms based on Upper Confidence Bounds

Источник

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Python implementation of UCB, EXP3 and Epsilon greedy algorithms

kulinshah98/Multi-Armed-Bandit-Algorithms

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Multi Armed Bandit Algorithms

Python implementation of various Multi-armed bandit algorithms like Upper-confidence bound algorithm, Epsilon-greedy algorithm and Exp3 algorithm

  • Implemented all algorithms for 2-armed bandit.
  • Each algorithm has time horizon T as 10000.
  • Each experiment is repeated for 100 times to get mean results.
  • Ploted the cummulative regret at time t against the rounds t = 1. T.
  • Ploted the percentage of times optimal arm played against the rounds t = 1. T.
  • Final plots are given in Figures/ folder.
  • All algorithms file is given in Code/ folder.
  • Input of each algorithm is mean of first arm and mean of second arm.
  • Here please note that for simplicity, I assumed that mean of first arm is greater than mean of second arm.
  • To check effect of epsilon on Epsilon-greedy algorithm, I have run the epsilon-greedy algorithm for epsilon = 0.01, 0.1.
  • Figures of following problem is given in Figures/ folder.

About

Python implementation of UCB, EXP3 and Epsilon greedy algorithms

Источник

Multi-armed bandit implementation

In the multi-armed bandit (MAB) problem we try to maximise our gain over time by «gambling on slot-machines (or bandits)» that have different but unknown expected outcomes.

The concept is typically used as an alternative to A/B-testing used in marketing research or website optimization. For example, testing which marketing email leads to the most newsletter signups, or which webshop design leads to the most sales.

The benefit of viewing website optimization as a multi-armed bandit problem instead of an A/B-testing problem is that no pre-defined sample sizes are needed and the algorithm will start optimizing the outcome (e.g. click rate) from the beginning. While the A/B-test needs to run all predefined samples to make a conclusion.

What follows will illustrate how to implement and solve a very simple multi-armed bandit problem with a probabilistic algorithm.

# Imports %matplotlib inline %config InlineBackend.figure_formats = ['svg'] import numpy as np import scipy import scipy.stats as stats import matplotlib import matplotlib.pyplot as plt import seaborn as sns sns.set_style('darkgrid') np.random.seed(42) # 

Defining the bandit experiment

We will simulate 3 bandits with each an underlying probability of winning stored in p_bandits . The pull(i) method will «pull the arm» of bandit $i$ and randomly decides if the pull was a win (1) or not (0) based on the underlying probability.

# Define the multi-armed bandits nb_bandits = 3 # Number of bandits # True probability of winning for each bandit p_bandits = [0.45, 0.55, 0.60] def pull(i): """Pull arm of bandit with index `i` and return 1 if win, else return 0.""" if np.random.rand()  p_bandits[i]: return 1 else: return 0 

Bayesian interpretation

Each pull of a specifc bandit will result in a win with a certain probability. The higher this probability the more likely pulling the arm of the bandit is going to result in a win. However we don’t know what this probability is so we will have to model it based on our observations of a certain bandit resulting in a win or not. We are going to model the unkown probability with a parameter $\theta$. Based on oberving a single outcome $x$ we can model this as the distribution $P(\theta \mid x)$. Using Bayes rule we can write this as:

$P(\theta \mid x)$ is called the posterior distribution of $\theta$ after observing $x$. We will be able to compute this via the likelihood $P(x \mid \theta)$ and prior $P(\theta)$. In this case the likelihood function $P(x \mid \theta)$ follows the Bernoulli distribution .

Beta prior

A good choice for the prior $P(\theta)$ would be the Beta distribution since it is a conjugate prior to the Bernoulli distribution. This means that if the likelihood function $P(x \mid \theta)$ is Bernoulli distributed and the prior distribution $P(\theta)$ is Beta distributed then the posterior $P(\theta \mid x)$ will also Beta distributed. More specifically if the prior is $Beta(\alpha, \beta)$ then after observing a win $x=1$ the posterior would become $Beta(\alpha+1, \beta)$, or after observing a loss $x=0$ the posterior would become $Beta(\alpha, \beta+1)$ [ 1 ].

This means that after every observation we can use the posterior as the prior for the next time we pull the bandit’s arm.

Maximize reward / Minimize regret

The goal of the muli-armed bandit problem is to maximize reward (minimize regret). There is an exploitation-exploration tradeoff we have to make here. The more we pull the arm of our percieved best bandit the more certain we become of the probability of that bandit. But other bandits that we haven’t pulled that often might have a lower expected probability but with higher uncertaintly. There is a chance that they are actually better than our percieved best bandit.

We will use Thompson sampling to overcome this problem. In Thompson sampling we will for each bandit sample the probability $\theta$ from the prior and pull the bandit with the highest sampled probability. And repeat this step until finished.

We will start with the prior $Beta(\alpha=1, \beta=1)$, which corresponds to a uniform prior between 0 and 1. The run is simulated for 1000 steps and the results at certain steps are plotted below.

# Define plotting functions # Iterations to plot plots = [1, 2, 5, 10, 25, 50, 100, 200, 500, 1000] def plot(priors, step, ax): """Plot the priors for the current step.""" plot_x = np.linspace(0.001, .999, 100) for prior in priors: y = prior.pdf(plot_x) p = ax.plot(plot_x, y) ax.fill_between(plot_x, y, 0, alpha=0.2) ax.set_xlim([0, 1]) ax.set_ylim(bottom=0) ax.set_title(f'Priors at step step:d>') # 
# Simulate multi-armed bandit process and update posteriors # Setup plot fig, axs = plt.subplots(5, 2, figsize=(8, 10)) axs = axs.flat # The number of trials and wins will represent the prior for each # bandit with the help of the Beta distribution. trials = [0, 0, 0] # Number of times we tried each bandit wins = [0, 0, 0] # Number of wins for each bandit n = 1000 # Run the trail for `n` steps for step in range(1, n+1): # Define the prior based on current observations bandit_priors = [ stats.beta(a=1+w, b=1+t-w) for t, w in zip(trials, wins)] # plot prior if step in plots: plot(bandit_priors, step, next(axs)) # Sample a probability theta for each bandit theta_samples = [ d.rvs(1) for d in bandit_priors ] # choose a bandit chosen_bandit = np.argmax(theta_samples) # Pull the bandit x = pull(chosen_bandit) # Update trials and wins (defines the posterior) trials[chosen_bandit] += 1 wins[chosen_bandit] += x plt.tight_layout() plt.show() 

Источник

Оцените статью