Deterministic MDPs with Adversarial Rewards and Bandit Feedback
Raman Arora, Ofer Dekel, Ambuj Tewari
We consider a Markov decision process with deterministic state transition dynamics, adversarially generated rewards that change arbitrarily from round to round, and a bandit feedback model in which the decision maker only observes the rewards it receives. In this setting, we present a novel and efficient online decision making algorithm named MarcoPolo. Under mild assumptions on the structure of the transition dynamics, we prove that MarcoPolo enjoys a regret of O(T^(3/4)sqrt(log(T))) against the best deterministic policy in hindsight. Specifically, our analysis does not rely on the stringent unichain assumption, which dominates much of the previous work on this topic.
PDF Link: /papers/12/p93-arora.pdf
AUTHOR = "Raman Arora
and Ofer Dekel and Ambuj Tewari",
TITLE = "Deterministic MDPs with Adversarial Rewards and Bandit Feedback",
BOOKTITLE = "Proceedings of the Twenty-Eighth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-12)",
PUBLISHER = "AUAI Press",
ADDRESS = "Corvallis, Oregon",
YEAR = "2012",
PAGES = "93--101"