📄 http:^^www.cs.duke.edu^~mlittman^courses^cps370^
字号:
<p><!WA25><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> Gerald Tesauro. Temporal difference learning and TD-Gammon.<em>Communications of the ACM</em>, 38(3):58--68, 1995. (<!WA26><ahref="http://www.research.ibm.com/massdist/tdl.html">html</a>) <p>Slides: <!WA27><a href="http://www.cs.duke.edu/~mlittman/courses/cps370/slides/lect4.fm.ps">postscript</a> <p>Another interesting application of TD lambda and neural networksapplied to MDP-like problems is Crites and Barto's elevatorcontroller. <p><!WA28><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> Robert H. Crites and Andrew G. Barto. Improvingelevator performance using reinforcement learning. InD. S. Touretzky, M. C. Mozer and M. E. Hasselmo, editors, <em>Advancesin Neural Information Processing Systems 8</em>, 1996. The MIT Press.(<!WA29><ahref="http://www-anw.cs.umass.edu./cgi-bin/getfile/pub/anw/pub/crites/nips8.ps.Z">compressedpostscript</a>) <p>An even simpler and similarly successful example for cellular phonechannel assignments is based on TD(0) and a linear functionapproximator. <p><!WA30><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> Satinder Singh and Dimitri Bertsekas.Reinforcement learning for dynamic channel allocation in cellulartelephone systems. To appear in <em>Advances in Neural InformationProcessing Systems 9</em>, 1997. The MIT Press. (<!WA31><ahref="ftp://ftp.cs.colorado.edu/users/baveja/Papers/CellPhone.ps">postscript</a>)<p>Tesauro's work is difficult to generalize because it simultaneouslyaddresses many unsolved problems. More recent work has begun to teaseapart the effect of using function approximation in dynamicprogramming from the use of the temporal difference algorithm. <p><!WA32><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> Justin A. Boyan and Andrew W. Moore. Generalization inreinforcement learning: Safely approximating the value function. InG. Tesauro, D. S. Touretzky and T. K. Leen, editors, <em>Advances inNeural Information Processing Systems 7</em>, 1995. The MITPress. (<!WA33><ahref="http://www.cs.cmu.edu/afs/cs/project/reinforcement/papers/boyan.funapprox-dp.ps.Z">compressedpostscript</a>, <!WA34><ahref="http://www.cs.cmu.edu/afs/cs/project/reinforcement/ml95/index.html">Recentworkshop on value function approximation</a>) <p>These results were not convincing to everyone. <p><!WA35><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> Richard S. Sutton. Generalization in reinforcement learning:Successful examples using sparse coarse coding. In <em>Advances inNeural Information Processing Systems 8</em>, 1996. The MIT Press.(<!WA36><ahref="ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/sutton-96.ps">postscript</a>)<p>Slides: <!WA37><a href="http://www.cs.duke.edu/~mlittman/courses/cps370/slides/lect6.fm.ps">postscript</a> <p>Some of the most interesting recent work has concerned theoreticalresults on when function approximation will and will not result in aconvergent algorithm. Results exist concerning gradient descentmethods and averaging methods. <p><!WA38><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> Leemon Baird. Residual algorithms: Reinforcementlearning with function approximation. In Armand Prieditis and StuartRussell, editors, <em>Proceedings of the Twelfth InternationalConference on Machine Learning</em>, 30--37, 1995. Morgan Kaufmann.(<!WA39><ahref="http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/baird/papers/residual.ps.Z">compressedpostscript</a>, <!WA40><ahref="http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/baird/papers/residual">html</a>) <p><!WA41><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> Geoffrey J. Gordon. Stable function approximation in dynamicprogramming. In Armand Prieditis and Stuart Russell, editors,<em>Proceedings of the Twelfth International Conference on MachineLearning</em>, 261--268. 1995. Morgan Kaufmann. (<!WA42><a href="http://www.cs.cmu.edu/afs/cs.cmu.edu/user/ggordon/www/ml95-stable-dp.ps.Z">compressedpostscript</a>) <p><!WA43><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> John N. Tsitsiklis and Benjamin Van Roy. Feature-basedmethods for large scale dynamic programming. </em>MachineLearning</em>, 22(1/2/3): 59--94, 1996. (<!WA44><ahref="http://www.cs.duke.edu/~mlittman/courses/cps370/papers/vanroy-converge.ps">local postscript</a>) <p>Homework: Propose a research project.<h3>Stochastic Planning</h3>The work on function approximation attempts to exploit structure inthe state space but treats actions as black box tranformations fromstates to distributions over states. A promising alternative is touse symbolic descriptions of the actions to reason about entireclasses of state-to-state transitions all at once. This is theapproach taken in AI planning. <p><!WA45><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> David McAllester and David Rosenblitt. Systematic nonlinearplanning. In <em>Proceedings of the 9th National Conference onArtificial Intelligence</em>, 1991. (<!WA46><ahref="http://www.ai.mit.edu/people/dam/aaai91c.ps">postscript</a>, <!WA47><ahref="http://www.cse.fau.edu/~matt/cap4630/AIMA-lisp-code/planning/snlp/">LISPcode</a>) <p>In the last two years, planning algorithms have been proposed thatdiffer substantially from the classic planners. Althoughunconventional, these planners have been shown empirically to resultin much shorter running times (up to several orders of magnitudefaster). Blum and Furst's algorithm views planning as a type of graphsearch while Kautz and Selman reduce planning to a booleansatisfiability problem. <p><!WA48><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> Avrim Blum and Merrick Furst. Fast planningthrough planning graph analysis. In <em>Proceedings of the 14thInternational Joint Conference on Artificial Intelligence(IJCAI)</em>, pages 1636--1642, 1995. (<!WA49><ahref="http://www.cs.cmu.edu/People/avrim/Papers/planning.ps.gz">extendedversion in compressed postscript</a>) <p><!WA50><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> Henry Kautz and Bart Selman. Pushing theenvelope: Planning, propositional logic, and stochastic search. In<em>Proceedings of the 13th National Conference on ArtificialIntelligence</em>, 1996. (<!WA51><ahref="http://www.research.att.com:80/~kautz/papers-ftp/plan.ps">postscript</a>)<p>Slides: <!WA52><a href="http://www.cs.duke.edu/~mlittman/courses/cps370/slides/lect7b.fm.ps">postscript</a> <p>In spite of recent algorithmic advances, the traditional view ofplanning ignores uncertainty. Uncertainty can be introduced gently byassuming a deterministic domain with some randomness added in. <p><!WA53><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> Jim Blythe. Planning with external events. In<em>Proceedings of the Tenth Conference on Uncertainty in ArtificialIntelligence</em>, 1994. (<!WA54><ahref="http://calvin.prodigy.cs.cmu.edu/papers/uai94.ps">postscript</a>)<p>The Buridan system introduces a more general representation forstochastic STRIPS operators and extends partial order planning tostochastic domains. Its representation is equivalent inexpressiveness to MDPs. <p>Slides: <!WA55><a href="http://www.cs.duke.edu/~mlittman/courses/cps370/slides/lect8a.fm.ps">postscript</a> <p><!WA56><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> Nicholas Kushmerick, Steve Hanks and Daniel S. Weld. Analgorithm for probabilistic planning. <em>ArtificialIntelligence</em>, 76(1-2): 239--286, 1995. (<!WA57><ahref="file://cs.washington.edu/tr/1993/06/UW-CSE-93-06-03.PS.Z">compressedpostscript</a>) <p>The Buridan system has been expanded so that the plan representationis more powerful (though less powerful than a policy-likerepresentation). <p><!WA58><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> Denise Draper, Steve Hanks and Dan Weld. Probabilisticplanning with information gathering and contingent execution.Technical Report 93-12-04, University of Washington, Department ofComputer Science, Seattle, WA, December, 1993. (<!WA59><ahref="ftp://ftp.cs.washington.edu/tr/1993/12/UW-CSE-93-12-04.PS.Z">compressedpostscript</a>) <p>Slides: <!WA60><a href="http://www.cs.duke.edu/~mlittman/courses/cps370/slides/lect8b.fm.ps">postscript</a> <p>An area of intense interest (but remarkably little work!) is combiningdirect manipulation of STRIPS-type actions with adynamic-programming-based algorithm. Several papers adopt the viewthat function approximation is a form of ``abstraction,'' the form ofwhich can be derived automatically from a propositional representationof the planning problem. <p><!WA61><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> Richard Dearden and Craig Boutilier. Abstractionand approximate decision theoretic planning. To appear in<em>Artificial Intelligence</em>. (<!WA62><a href="http://www.cs.ubc.ca/spider/cebly/Papers/abstraction.ps">postscript</a>) <p><!WA63><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> Craig Boutilier, Richard Dearden and MoisesGoldszmidt. Exploiting structure in policy construction. To appearin <em>Proceedings of the International Joint Conference on ArtificialIntelligence</em>, 1995. (<!WA64><ahref="http://www.cs.ubc.ca/spider/cebly/Papers/spi.ps">postscript</a>)<p><!WA65><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> Craig Boutilier and Richard Dearden. Approximating valuetrees in structured dynamic programming. To appear in <em>Proceedingsof the Thirteenth International Conference on Machine Learning</em>,1996. (<!WA66><ahref="http://www.cs.ubc.ca/spider/cebly/Papers/pruning.ps">postscript</a>)Slides: <!WA67><a href="http://www.cs.duke.edu/~mlittman/courses/cps370/slides/lect8c.fm.ps">postscript</a> <p>Summary Slides: <!WA68><a href="http://www.cs.duke.edu/~mlittman/courses/cps370/slides/lect9.fm.ps">postscript</a> <p><a name="now">Next we'll deal with partially observable Markov decision processesand how to solve them.<h3>Advanced Topics</h3>If we make unexpectedly fast progress through the core topics, thereare a number of interesting issues we could explore including:hierarchical solutions to MDPs, partially observable MDPs, solvinggames. <p>Here are some papers I've recently found that might be useful todiscuss. The first gives a notation for representing multi-playergames of incomplete information that might be useful in defining asimilar notation for MDPs. The second describes how to solve verylarge (and even continuous state) MDPs efficiently using linearprogramming. <p><!WA69><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> Daphne Koller and Avi Pfeffer. Representationsand solutions for game-theoretic problems. Preliminary versionappeared in <em>Proceedings of the 14th International Joint Conferenceon Artificial Intelligence (IJCAI)</em>, Montreal, Canada, August1995, pages 1185--1192. (<!WA70><ahref="http://robotics.Stanford.EDU/~koller/papers/galapaper.ps">postscript</a>)<p><!WA71><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/book.gif"> Michael Trick and Stanley Zin. A linearprogramming approach to solving stochastic dynamic programs, 1993.(<!WA72><a href="http://mat.gsia.cmu.edu/dyn.ps">postscript</a>) <p><hr><h2> <A NAME="projects"><!WA73><img src="http://www.cs.duke.edu/~mlittman/courses/cps370/bulb-t.gif">Project Ideas<A> </h2>The major thrust of the seminar will be to undertake a group projectexploring some facet of the problem of planning under uncertainty.Some sample project ideas are:<UL><LI> Are there methods of efficiently evaluating plans in complexdomains?<LI> Do MDP aggregation methods have anything to offer?<LI> Can hierarchical methods be applied to propositional statespaces? <LI> Can Blum and Furst's <!WA74><ahref="http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/avrim/graphplan.html">Graphplan</a>algorithm be extended to stochastic domains?<LI> How do the known results concerning the use of functionapproximation in dynamic programming relate to one another?<LI> How does Boutilier and Dearden's structured policy iterationalgorithm perform on some simple structured MDPs?</UL>We're starting to make some progress on a domain for later projectdevelopment. Stephen Majercik has written up his ideas on a loadbalancing MDP, which are accessible from his <!WA75><ahref="http://www.cs.duke.edu:80/~majercik/HomePage.html">homepage</a>. <p>We've talked about using the <!WA76><ahref="http://robotics.Stanford.EDU/~koller/gala.html">GALA</a> systemas a basis for a general declarative language for specifying MDPs.Another options is a <!WA77><ahref="http://envy.cs.umass.edu/People/sutton/RLinterface/RLinterface.html">standard</a>proposed by Rich Sutton.<p><hr><address>Last modified: Thu Nov 7 14:32:49 EST 1996by Michael Littman, mlittman@cs.duke.edu</font></body>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -