Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-stationary Rewards

Read:: - [ ] Besbes et al. (2019) - Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-stationary Rewards ➕2024-12-23 !!2 rd citation todoist Print::  ❌ Zotero Link:: Zotero Files:: attachment Reading Note:: Web Rip:: url:: http://arxiv.org/abs/1405.3316

TABLE without id
file.link as "Related Files",
title as "Title",
type as "type"
FROM "" AND -"Obsidian Assets"
WHERE citekey = "besbesOptimalExplorationExploitationMultiArmedBandit2019" 
SORT file.cday DESC

Abstract

In a multi-armed bandit (MAB) problem a gambler needs to choose at each round of play one of K arms, each characterized by an unknown reward distribution. Reward realizations are only observed when an arm is selected, and the gambler’s objective is to maximize his cumulative expected earnings over some given horizon of play T. To do this, the gambler needs to acquire information about arms (exploration) while simultaneously optimizing immediate rewards (exploitation); the price paid due to this trade off is often referred to as the regret, and the main question is how small can this price be as a function of the horizon length T. This problem has been studied extensively when the reward distributions do not change over time; an assumption that supports a sharp characterization of the regret, yet is often violated in practical settings. In this paper, we focus on a MAB formulation which allows for a broad range of temporal uncertainties in the rewards, while still maintaining mathematical tractability. We fully characterize the (regret) complexity of this class of MAB problems by establishing a direct link between the extent of allowable reward “variation” and the minimal achievable regret. Our analysis draws some connections between two rather disparate strands of literature: the adversarial and the stochastic MAB frameworks.

Quick Reference

Top Notes

Tasks

Topics

Whittle (1988) introduced the term restless bandits; a model in which the states (associated with reward distributions) of arms change in each step according to an arbitrary, yet known, stochastic process. Considered a “hard” class of problems (cf. Papadimitriou and Tsitsiklis 1994), this line of work has led to various approximations (see, e.g., Bertsimas and Nino-Mora 2000), relaxations (see, e.g., Guha and Munagala 2007), and restrictions of the state transition mechanism (see, e.g., Ortner et al. (2012) for irreducible Markov processes, and Azar et al. (2014) for a class of history-dependent rewards). tp

Extracted Annotations and Comments

Page 2

xamples include clinical trials (Zelen 1969), strategic pricing (Bergemann and Valimaki 1996), investment in innovation (Bergemann and Hege 2005), packet routing (Awerbuch and Kleinberg 2004), on-line auctions (Kleinberg and Leighton 2003), assortment selection (Caro and Gallien 2007), and on-line advertising (Pandey et al. 2007),


Useful

Page 3

in many application domains (including the ones mentioned above) temporal changes in the reward distribution structure are an intrinsic characteristic of the problem, and several attempts have been made to incorporate this into a stochastic MAB formulation.