Linear Bandits in Unknown Environments

Abstract : In contextual bandit problems, an agent has to choose an action among a bigger set of available ones at each decision step, according to features observed on them. The goal is to define a decision strategy that maximizes the cumulative reward of actions over time. We focus on the specific case where the features of each action correspond to some kind of a constant profile, which can be used to determine its intrinsic utility for the task in concern. If there exists an unknown linear application that allows rewards to be mapped from profiles, this can be leveraged to greatly improve the exploitation-exploration trade-off of stationary stochastic methods like UCB. In this paper, we consider the case where action profiles are unknown beforehand. Instead, the agent only observes sample vectors, with mean equal to the true profiles, for a subset of actions at each decision step. We propose a new algorithm, called SampLinUCB, and derive a finite time high probability upper bound on its regret. We also provide numerical experiments on a task of focused data capture from online social networks.
Liste complète des métadonnées
Contributor : Sylvain Lamprier <>
Submitted on : Tuesday, August 30, 2016 - 4:41:25 PM
Last modification on : Thursday, March 21, 2019 - 2:18:36 PM



Thibault Gisselbrecht, Sylvain Lamprier, Patrick Gallinari. Linear Bandits in Unknown Environments. ECML PKDD 2016 - European Conference on Machine Learning and Knowledge Discovery in Databases, Sep 2016, Riva Del Garda, Italy. pp.282-298, ⟨10.1007/978-3-319-46227-1_18⟩. ⟨hal-01357953⟩



Record views