📄 2.txt
字号:
发信人: GzLi (笑梨), 信区: DataMining
标 题: Machine Learning 47(2/3) <2>
发信站: 南京大学小百合站 (Thu Jul 18 00:45:08 2002), 站内信件
篇名: Finite-time Analysis of the Multiarmed Bandit Problem
刊名: Machine Learning
ISSN: 0885-6125
卷期: 47 卷 2/3 期 出版日期: 200205/06
页码: 从 235 页到 256 页共 22 页
作者: Auer Peter University of Technology Graz, A-8010 Graz, Austria.
pauer@igi.tu-graz.ac.at
Cesa-Bianchi Nicolò DTI, University of Milan, via Bramante 65, I-26013
Crema, Italy. cesa-bianchi@dti.unimi.it
Fischer Paul Lehrstuhl Informatik II, Universit?t Dortmund, D-44221 Dortmund
, Germany. fischer@ls2.informatik.uni-dortmund.de
文摘:
Reinforcement learning policies face the exploration versus exploitation
dilemma, i.e. the search for a balance between exploring the environment
to find profitable actions while taking the empirically best action as often
as possible. A popular measure of a policy's success in addressing
this dilemma is the regret, that is the loss due to the fact that the globally
optimal policy is not followed all the times. One of the simplest examples
of the exploration/exploitation dilemma is the multi-armed bandit problem
. Lai and Robbins were the first ones to show that the regret for this problem
has to grow at least logarithmically in the number of plays. Since then,
policies which asymptotically achieve this regret have been devised by Lai
and Robbins and many others. In this work we show that the optimal logarithmic
regret is also achievable uniformly over time, with simple and efficient
policies, and for all reward distributions with bounded support.
--
*** 端庄厚重 谦卑含容 事有归着 心存济物 ***
今天你挖了吗? DataMining http://DataMining.bbs.lilybbs.net
MathToolshttp://bbs.sjtu.edu.cn/cgi-bin/bbsdoc?board=MathTools [m
※ 修改:.GzLi 于 Jul 18 00:46:50 修改本文.[FROM: 211.80.38.29]
※ 来源:.南京大学小百合站 bbs.nju.edu.cn.[FROM: 211.80.38.29]
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -