Main Paper https://arxiv.org/pdf/1502.05477.pdf
Trust Region Policy Optimization
Abstract
We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.
/* theoretically하게 정당화된 procedure에 몇몇 approximation을 적용하여 improvement를 보장하면서 policy를 optimization하는 iteration을 소개한다. TRPO는 policy gradient method와 유사하며 neural network와 같은 large nonlinear policy를 optimization하는데 효과적이다. theory에서 벗어난 approximation에도 불구하고 hyperparameter를 거의 조정하지 않고도 improvement가 가능하다. */
Introduction
policy optimization을 위한 algoritms category
1. policy iteration
: current policy에서의 value function을 estimate하는 것과 policy를 improve하는 것을 반복하는 방법.
2. policy gradient
: sample trajectories로부터 도출한 expected return의 gradient extractor를 사용하는 방법. TRPO에서 사용될 방식
3. derivative-free optimization
: CEM(Cross Entropy Method)나 CMA(Covariance Matrix Adaptation)과 같은 policy parameter 측면에서 optimized 될 black box function로 return을 처리하는 방법.
General derivative-free stochastic optimization methods such as CEM and CMA are preferred on many prob- lems, because they achieve good results while being simple to understand and implement. For continuous control problems, methods like CMA have been successful at learning control policies for challenging tasks like locomotion when provided with hand-engineered policy classes with low-dimensional parameterizations. The in ability of ADP and gradient-based methods to consistently beat gradient-free random search is unsatisfying, since gradient-based optimization algorithms enjoy much better sample complexity guarantees than gradient-free methods (Nemirovski, 2005). Continuous gradient-based optimization has been very successful at learning function approximators for supervised learning tasks with huge numbers of parameters, and extending their success to reinforcement learning would allow for efficient training of complex and powerful policies.
/* CEM or CMA와 같은 일반 derivative-free stochastic optimization은 이해 및 구현이 간단하고 결과가 좋기 때문에 많은 문제에서 선호된다. continuous control problem의 경우, low-dimensional hand-engineered policy classes가 제공되었을 때, control policy를 성공적으로 학습할 수 있었다. 그러나 gradient based method가 훨씬 더 좋은 sample complexity를 보장하기 때문에, contiuous gradient based optimization은 수많은 parameters를 사용하여 supervised learning task에 대한 function approximators를 학습하는데 매우 성공적이었고, complex & powerful policy를 효과적으로 훈련할 수 있었다. */
=> derivative-free optimization도 hand-engineered policy가 제공되었을 때, continuous policy를 성공적으로 학습 할 수 있지만, gradient based method가 더 많은 parameters를 사용하여 function approximator를 효과적으로 학습해 더 강력한 policy를 훈련이 가능하다. 그래서 TRPO에서는 policy gradient 방식을 사용한다.
Preliminaries
/*
'논문 리뷰 > RL' 카테고리의 다른 글
PPO(Proximal Policy Optimization Algorithms) (0) | 2021.08.06 |
---|---|
code (0) | 2021.07.26 |
DRQN1 (0) | 2021.05.12 |
DRQN (0) | 2021.03.25 |
04. Dynamic Programming (0) | 2021.03.22 |
03. Finite Markov Decision Process (0) | 2021.03.20 |
댓글