Document Type
|
:
|
BL
|
Record Number
|
:
|
877147
|
Main Entry
|
:
|
Sutton, Richard S.
|
Title & Author
|
:
|
Reinforcement learning : : an introduction /\ Richard S. Sutton and Andrew G. Barto.
|
Edition Statement
|
:
|
Second edition.
|
Publication Statement
|
:
|
Cambridge, Massachusetts ;London, England :: The MIT Press,, [2018]
|
|
:
|
, ©2018
|
Series Statement
|
:
|
Adaptive computation and machine learning
|
Page. NO
|
:
|
xxii, 526 pages :: illustrations (some color) ;; 24 cm
|
ISBN
|
:
|
0262039249
|
|
:
|
: 9780262039246
|
Bibliographies/Indexes
|
:
|
Includes bibliographical references and index.
|
Contents
|
:
|
Machine generated contents note: 1. Introduction -- 1.1. Reinforcement Learning -- 1.2. Examples -- 1.3. Elements of Reinforcement Learning -- 1.4. Limitations and Scope -- 1.5. An Extended Example: Tic-Tac-Toe -- 1.6. Summary -- 1.7. Early History of Reinforcement Learning -- 2. Multi-armed Bandits -- 2.1.A k-armed Bandit Problem -- 2.2. Action-value Methods -- 2.3. The 10-armed Testbed -- 2.4. Incremental Implementation -- 2.5. Tracking a Nonstationary Problem -- 2.6. Optimistic Initial Values -- 2.7. Upper-Confidence-Bound Action Selection -- 2.8. Gradient Bandit Algorithms -- 2.9. Associative Search (Contextual Bandits) -- 2.10. Summary -- 3. Finite Markov Decision Processes -- 3.1. The Agent-Environment Interface -- 3.2. Goals and Rewards -- 3.3. Returns and Episodes -- 3.4. Unified Notation for Episodic and Continuing Tasks -- 3.5. Policies and Value Functions -- 3.6. Optimal Policies and Optimal Value Functions -- 3.7. Optimality and Approximation -- 3.8. Summary -- 4. Dynamic Programming
|
|
:
|
Note continued: 11.3. The Deadly Triad -- 11.4. Linear Value-function Geometry -- 11.5. Gradient Descent in the Bellman Error -- 11.6. The Bellman Error is Not Learnable -- 11.7. Gradient-TD Methods -- 11.8. Emphatic-TD Methods -- 11.9. Reducing Variance -- 11.10. Summary -- 12. Eligibility Traces -- 12.1. The A-return -- 12.2. TD(A) -- 12.3.n-step Truncated A-return Methods -- 12.4. Redoing Updates: Online A-return Algorithm -- 12.5. True Online TD(A) -- 12.6.*Dutch Traces in Monte Carlo Learning -- 12.7. Sarsa(A) -- 12.8. Variable A and ry -- 12.9. Off-policy Traces with Control Variates -- 12.10. Watkins's Q(A) to Tree-Backup(A) -- 12.11. Stable Off-policy Methods with Traces -- 12.12. Implementation Issues -- 12.13. Conclusions -- 13. Policy Gradient Methods -- 13.1. Policy Approximation and its Advantages -- 13.2. The Policy Gradient Theorem -- 13.3. REINFORCE: Monte Carlo Policy Gradient -- 13.4. REINFORCE with Baseline -- 13.5. Actor-Critic Methods
|
|
:
|
Note continued: 13.6. Policy Gradient for Continuing Problems -- 13.7. Policy Parameterization for Continuous Actions -- 13.8. Summary -- 14. Psychology -- 14.1. Prediction and Control -- 14.2. Classical Conditioning -- 14.2.1. Blocking and Higher-order Conditioning -- 14.2.2. The Rescorla-Wagner Model -- 14.2.3. The TD Model -- 14.2.4. TD Model Simulations -- 14.3. Instrumental Conditioning -- 14.4. Delayed Reinforcement -- 14.5. Cognitive Maps -- 14.6. Habitual and Goal-directed Behavior -- 14.7. Summary -- 15. Neuroscience -- 15.1. Neuroscience Basics -- 15.2. Reward Signals, Reinforcement Signals, Values, and Prediction Errors -- 15.3. The Reward Prediction Error Hypothesis -- 15.4. Dopamine -- 15.5. Experimental Support for the Reward Prediction Error Hypothesis -- 15.6. TD Error/Dopamine Correspondence -- 15.7. Neural Actor-Critic -- 15.8. Actor and Critic Learning Rules -- 15.9. Hedonistic Neurons -- 15.10. Collective Reinforcement Learning -- 15.11. Model-based Methods in the Brain
|
|
:
|
Note continued: 15.12. Addiction -- 15.13. Summary -- 16. Applications and Case Studies -- 16.1. TD-Gammon -- 16.2. Samuel's Checkers Player -- 16.3. Watson's Daily-Double Wagering -- 16.4. Optimizing Memory Control -- 16.5. Human-level Video Game Play -- 16.6. Mastering the Game of Go -- 16.6.1. AlphaGo -- 16.6.2. AlphaGo Zero -- 16.7. Personalized Web Services -- 16.8. Thermal Soaring -- 17. Frontiers -- 17.1. General Value Functions and Auxiliary Tasks -- 17.2. Temporal Abstraction via Options -- 17.3. Observations and State -- 17.4. Designing Reward Signals -- 17.5. Remaining Issues -- 17.6. Experimental Support for the Reward Prediction Error Hypothesis.
|
|
:
|
Note continued: 4.1. Policy Evaluation (Prediction) -- 4.2. Policy Improvement -- 4.3. Policy Iteration -- 4.4. Value Iteration -- 4.5. Asynchronous Dynamic Programming -- 4.6. Generalized Policy Iteration -- 4.7. Efficiency of Dynamic Programming -- 4.8. Summary -- 5. Monte Carlo Methods -- 5.1. Monte Carlo Prediction -- 5.2. Monte Carlo Estimation of Action Values -- 5.3. Monte Carlo Control -- 5.4. Monte Carlo Control without Exploring Starts -- 5.5. Off-policy Prediction via Importance Sampling -- 5.6. Incremental Implementation -- 5.7. Off-policy Monte Carlo Control -- 5.8.*Discounting-aware Importance Sampling -- 5.9.*Per-decision Importance Sampling -- 5.10. Summary -- 6. Temporal-Difference Learning -- 6.1. TD Prediction -- 6.2. Advantages of TD Prediction Methods -- 6.3. Optimality of TD(0) -- 6.4. Sarsa: On-policy TD Control -- 6.5.Q-learning: Off-policy TD Control -- 6.6. Expected Sarsa -- 6.7. Maximization Bias and Double Learning
|
|
:
|
Note continued: 6.8. Games, Afterstates, and Other Special Cases -- 6.9. Summary -- 7.n-step Bootstrapping -- 7.1.n-step TD Prediction -- 7.2.n-step Sarsa -- 7.3.n-step Off-policy Learning -- 7.4.*Per-decision Methods with Control Variates -- 7.5. Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm -- 7.6.*A Unifying Algorithm: n-step Q(u) -- 7.7. Summary -- 8. Planning and Learning with Tabular Methods -- 8.1. Models and Planning -- 8.2. Dyna: Integrated Planning, Acting, and Learning -- 8.3. When the Model Is Wrong -- 8.4. Prioritized Sweeping -- 8.5. Expected vs. Sample Updates -- 8.6. Trajectory Sampling -- 8.7. Real-time Dynamic Programming -- 8.8. Planning at Decision Time -- 8.9. Heuristic Search -- 8.10. Rollout Algorithms -- 8.11. Monte Carlo Tree Search -- 8.12. Summary of the Chapter -- 8.13. Summary of Part I: Dimensions -- 9. On-policy Prediction with Approximation -- 9.1. Value-function Approximation -- 9.2. The Prediction Objective (VE)
|
|
:
|
Note continued: 9.3. Stochastic-gradient and Semi-gradient Methods -- 9.4. Linear Methods -- 9.5. Feature Construction for Linear Methods -- 9.5.1. Polynomials -- 9.5.2. Fourier Basis -- 9.5.3. Coarse Coding -- 9.5.4. Tile Coding -- 9.5.5. Radial Basis Functions -- 9.6. Selecting Step-Size Parameters Manually -- 9.7. Nonlinear Function Approximation: Artificial Neural Networks -- 9.8. Least-Squares TD -- 9.9. Memory-based Function Approximation -- 9.10. Kernel-based Function Approximation -- 9.11. Looking Deeper at On-policy Learning: Interest and Emphasis -- 9.12. Summary -- 10. On-policy Control with Approximation -- 10.1. Episodic Semi-gradient Control -- 10.2. Semi-gradient n-step Sarsa -- 10.3. Average Reward: A New Problem Setting for Continuing Tasks -- 10.4. Deprecating the Discounted Setting -- 10.5. Differential Semi-gradient n-step Sarsa -- 10.6. Summary -- 11.*Off-policy Methods with Approximation -- 11.1. Semi-gradient Methods -- 11.2. Examples of Off-policy Divergence
|
Abstract
|
:
|
"Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives while interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the field's key ideas and algorithms."--
|
Subject
|
:
|
Reinforcement learning.
|
Subject
|
:
|
54.72 artificial intelligence.
|
Subject
|
:
|
Reinforcement learning.
|
Subject
|
:
|
Machine Learning.
|
Subject
|
:
|
Reinforcement, Psychology.
|
Dewey Classification
|
:
|
006.3/1
|
LC Classification
|
:
|
Q325.6.S88 2018
|
|
:
|
Q325.6.R45 2018
|
Added Entry
|
:
|
Barto, Andrew G.
|