Base on-policy Algorithms#
|
The Policy Gradient algorithm. |
|
The Natural Policy Gradient algorithm. |
|
The Trust Region Policy Optimization (TRPO) algorithm. |
|
The Proximal Policy Optimization (PPO) algorithm. |
Policy Gradient#
Documentation
- class omnisafe.algorithms.on_policy.PolicyGradient(env_id, cfgs)[source]#
The Policy Gradient algorithm.
References
Title: Policy Gradient Methods for Reinforcement Learning with Function Approximation
Authors: Richard S. Sutton, David McAllester, Satinder Singh, Yishay Mansour.
URL: PG
- __init__(env_id, cfgs)#
- _compute_adv_surrogate(adv_r, adv_c)[source]#
Compute surrogate loss.
Policy Gradient only use reward advantage.
- Parameters:
adv_r (torch.Tensor) – reward advantage
adv_c (torch.Tensor) – cost advantage
- Return type:
Tensor
- _init()[source]#
The initialization of the algorithm.
User can define the initialization of the algorithm by inheriting this function.
- Return type:
None
Example
>>> def _init(self) -> None: >>> super()._init() >>> self._buffer = CustomBuffer() >>> self._model = CustomModel()
- _init_env()[source]#
Initialize the environment.
Omnisafe use
omnisafe.adapter.OnPolicyAdapter
to adapt the environment to the algorithm.User can customize the environment by inheriting this function.
- Return type:
None
Example
>>> def _init_env(self) -> None: >>> self._env = CustomAdapter()
- _init_log()[source]#
Log info about epoch.
Things to log
Description
Train/Epoch
Current epoch.
Metrics/EpCost
Average cost of the epoch.
Metrics/EpCost
Average cost of the epoch.
Metrics/EpRet
Average return of the epoch.
Metrics/EpLen
Average length of the epoch.
Values/reward
Average value in
roll_out()
(from critic network) of the epoch.Values/cost
Average cost in
roll_out()
(from critic network) of the epoch.Values/Adv
Average advantage in
roll_out()
of the epoch.Loss/Loss_pi
Loss of the policy network.
Loss/Delta_loss_pi
Delta loss of the policy network.
Loss/Loss_reward_critic
Loss of the value network.
Loss/Delta_loss_reward_critic
Delta loss of the value network.
Loss/Loss_cost_critic
Loss of the cost network.
Loss/Delta_loss_cost_critic
Delta loss of the cost network.
Train/Entropy
Entropy of the policy network.
Train/KL
KL divergence of the policy network.
Train/StopIters
Number of iterations of the policy network.
Train/PolicyRatio
Ratio of the policy network.
Train/LR
Learning rate of the policy network.
Misc/Seed
Seed of the experiment.
Misc/TotalEnvSteps
Total steps of the experiment.
Time
Total time.
FPS
Frames per second of the epoch.
- Parameters:
epoch (int) – current epoch.
- Return type:
None
- _init_model()[source]#
Initialize the model.
Omnisafe use
omnisafe.models.actor_critic.constraint_actor_critic. ConstraintActorCritic
as the default model.User can customize the model by inheriting this function.
- Return type:
None
Example
>>> def _init_model(self) -> None: >>> self._actor_critic = CustomActorCritic()
- _loss_pi(obs, act, logp, adv)[source]#
Computing pi/actor loss.
In Policy Gradient, the loss is defined as:
(4)#\[L = -\mathbb{E}_{s_t \sim \rho_\theta} [ \sum_{t=0}^T ( \frac{\pi^{'}_\theta(a_t|s_t)}{\pi_\theta(a_t|s_t)} ) A^{R}_{\pi_{\theta}}(s_t, a_t) ]\]where \(\pi_\theta\) is the policy network, \(\pi^{'}_\theta\) is the new policy network, \(A^{R}_{\pi_{\theta}}(s_t, a_t)\) is the advantage.
- Parameters:
obs (torch.Tensor) –
observation
stored in buffer.act (torch.Tensor) –
action
stored in buffer.logp (torch.Tensor) –
log probability
of action stored in buffer.adv (torch.Tensor) –
advantage
stored in buffer.
- Return type:
Tuple
[Tensor
,Dict
[str
,float
]]
- _update()[source]#
Update actor, critic, following next steps:
Get the
data
from buffer
Hint
obs
observaion
stored in buffer.act
action
stored in buffer.target_value_r
target value
stored in buffer.target_value_c
target cost
stored in buffer.logp
log probability
stored in buffer.adv
estimated advantage
(e.g. GAE) stored in buffer.cost_adv
estimated cost advantage
(e.g. GAE) stored in buffer.Update value net by
_update_reward_critic()
.Update cost net by
_update_cost_critic()
.Update policy net by
_update_actor()
.
The basic process of each update is as follows:
Get the data from buffer.
Shuffle the data and split it into mini-batch data.
Get the loss of network.
Update the network by loss.
Repeat steps 2, 3 until the number of mini-batch data is used up.
Repeat steps 2, 3, 4 until the KL divergence violates the limit.
- Parameters:
self (object) – object of the class.
- Return type:
None
- _update_actor(obs, act, logp, adv_r, adv_c)[source]#
Update policy network under a double for loop.
Compute the loss function.
Clip the gradient if
use_max_grad_norm
isTrue
.Update the network by loss function.
Warning
For some
KL divergence
based algorithms (e.g. TRPO, CPO, etc.), theKL divergence
between the old policy and the new policy is calculated. And theKL divergence
is used to determine whether the update is successful. If theKL divergence
is too large, the update will be terminated.- Parameters:
obs (torch.Tensor) –
observation
stored in buffer.act (torch.Tensor) –
action
stored in buffer.log_p (torch.Tensor) –
log_p
stored in buffer.adv_r (torch.Tensor) –
advantage
stored in buffer.adv_c (torch.Tensor) –
cost_advantage
stored in buffer.
- Return type:
None
- _update_cost_critic(obs, target_value_c)[source]#
Update value network under a double for loop.
The loss function is
MSE loss
, which is defined intorch.nn.MSELoss
. Specifically, the loss function is defined as:(5)#\[L = \frac{1}{N} \sum_{i=1}^N (\hat{V} - V)^2\]where \(\hat{V}\) is the predicted cost and \(V\) is the target cost.
Compute the loss function.
Add the
critic norm
to the loss function ifuse_critic_norm
isTrue
.Clip the gradient if
use_max_grad_norm
isTrue
.Update the network by loss function.
- Parameters:
obs (torch.Tensor) –
observation
stored in buffer.target_value_c (torch.Tensor) –
target_value_c
stored in buffer.
- Return type:
None
- _update_reward_critic(obs, target_value_r)[source]#
Update value network under a double for loop.
The loss function is
MSE loss
, which is defined intorch.nn.MSELoss
. Specifically, the loss function is defined as:(6)#\[L = \frac{1}{N} \sum_{i=1}^N (\hat{V} - V)^2\]where \(\hat{V}\) is the predicted cost and \(V\) is the target cost.
Compute the loss function.
Add the
critic norm
to the loss function ifuse_critic_norm
isTrue
.Clip the gradient if
use_max_grad_norm
isTrue
.Update the network by loss function.
- Parameters:
obs (torch.Tensor) –
observation
stored in buffer.target_value_r (torch.Tensor) –
target_value_r
stored in buffer.
- Return type:
None
- learn()[source]#
This is main function for algorithm update, divided into the following steps,
rollout()
: collect interactive data from environment.update()
: perform actor/critic updates.log()
: epoch/update information for visualization and terminal log print.
- Parameters:
self (object) – object of the class.
- Return type:
Tuple
[Union
[int
,float
],...
]
Natural Policy Gradient#
Documentation
- class omnisafe.algorithms.on_policy.NaturalPG(env_id, cfgs)[source]#
The Natural Policy Gradient algorithm.
The Natural Policy Gradient algorithm is a policy gradient algorithm that uses the Fisher information matrix to approximate the Hessian matrix. The Fisher information matrix is the second-order derivative of the KL-divergence.
References
Title: A Natural Policy Gradient
Author: Sham Kakade.
URL: Natural PG
- _fvp(params)[source]#
Build the Hessian-vector product based on an approximation of the KL-divergence.
The Hessian-vector product is approximated by the Fisher information matrix, which is the second-order derivative of the KL-divergence.
For details see John Schulman’s PhD thesis (pp. 40) http://joschu.net/docs/thesis.pdf
- Parameters:
params (torch.Tensor) – The parameters of the actor network.
- Return type:
Tensor
- _init_log()[source]#
Log the Natural Policy Gradient specific information.
Things to log
Description
Misc/AcceptanceStep
The acceptance step size.
Misc/Alpha
\(\frac{\delta_{KL}}{xHx}\) in original paper. where \(x\) is the step direction, \(H\) is the Hessian matrix, and \(\delta_{KL}\) is the target KL divergence.
Misc/FinalStepNorm
The final step norm.
Misc/gradient_norm
The gradient norm.
Misc/xHx
\(xHx\) in original paper.
Misc/H_inv_g
\(H^{-1}g\) in original paper.
- Parameters:
epoch (int) – current epoch.
- Return type:
None
- _update()[source]#
Update actor, critic.
Hint
Here are some differences between NPG and Policy Gradient (PG): In PG, the actor network and the critic network are updated together. When the KL divergence between the old policy, and the new policy is larger than a threshold, the update is rejected together.
In NPG, the actor network and the critic network are updated separately. When the KL divergence between the old policy, and the new policy is larger than a threshold, the update of the actor network is rejected, but the update of the critic network is still accepted.
- Parameters:
self (object) – object of the class.
- Return type:
None
- _update_actor(obs, act, logp, adv_r, adv_c)[source]#
Update policy network.
Natural Policy Gradient (NPG) update policy network using the conjugate gradient algorithm, following the steps:
Calculate the gradient of the policy network,
Use the conjugate gradient algorithm to calculate the step direction.
Update the policy network by taking a step in the step direction.
- Parameters:
obs (torch.Tensor) – The observation tensor.
act (torch.Tensor) – The action tensor.
log_p (torch.Tensor) – The log probability of the action.
adv (torch.Tensor) – The advantage tensor.
cost_adv (torch.Tensor) – The cost advantage tensor.
- Return type:
None
Trust Region Policy Optimization#
Documentation
- class omnisafe.algorithms.on_policy.TRPO(env_id, cfgs)[source]#
The Trust Region Policy Optimization (TRPO) algorithm.
References
Title: Trust Region Policy Optimization
Authors: John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel.
URL: TRPO
- __init__(env_id, cfgs)#
- _init_log()[source]#
Log the Natural Policy Gradient specific information.
Things to log
Description
Misc/AcceptanceStep
The acceptance step size.
Misc/Alpha
\(\frac{\delta_{KL}}{xHx}\) in original paper. where \(x\) is the step direction, \(H\) is the Hessian matrix, and \(\delta_{KL}\) is the target KL divergence.
Misc/FinalStepNorm
The final step norm.
Misc/gradient_norm
The gradient norm.
Misc/xHx
\(xHx\) in original paper.
Misc/H_inv_g
\(H^{-1}g\) in original paper.
- Parameters:
epoch (int) – current epoch.
- Return type:
None
- _search_step_size(step_direction, grad, p_dist, obs, act, logp, adv, loss_before, total_steps=15, decay=0.8)[source]#
TRPO performs line-search until constraint satisfaction.
Hint
TRPO search around for a satisfied step of policy update to improve loss and reward performance. The search is done by line-search, which is a way to find a step size that satisfies the constraint. The constraint is the KL-divergence between the old policy and the new policy.
- Parameters:
step_dir (torch.Tensor) – The step direction.
g_flat (torch.Tensor) – The gradient of the policy.
p_dist (torch.distributions.Distribution) – The old policy distribution.
obs (torch.Tensor) – The observation.
act (torch.Tensor) – The action.
logp (torch.Tensor) – The log probability of the action.
adv (torch.Tensor) – The advantage.
cost_adv (torch.Tensor) – The cost advantage.
loss_pi_before (float) – The loss of the policy before the update.
total_steps (int, optional) – The total steps to search. Defaults to 15.
decay (float, optional) – The decay rate of the step size. Defaults to 0.8.
- Return type:
Tuple
[Tensor
,int
]
- _update_actor(obs, act, logp, adv_r, adv_c)[source]#
Update policy network.
Trust Policy Region Optimization updates policy network using the conjugate gradient algorithm, following the steps:
Compute the gradient of the policy.
Compute the step direction.
Search for a step size that satisfies the constraint.
Update the policy network.
- Parameters:
obs (torch.Tensor) – The observation tensor.
act (torch.Tensor) – The action tensor.
logp (torch.Tensor) – The log probability of the action.
adv_r (torch.Tensor) – The advantage tensor.
adv_c (torch.Tensor) – The cost advantage tensor.
- Return type:
None
Proximal Policy Optimization#
Documentation
- class omnisafe.algorithms.on_policy.PPO(env_id, cfgs)[source]#
The Proximal Policy Optimization (PPO) algorithm.
References
Title: Proximal Policy Optimization Algorithms
Authors: John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov.
URL: PPO
- __init__(env_id, cfgs)#
- _loss_pi(obs, act, logp, adv)[source]#
Computing pi/actor loss.
In Proximal Policy Optimization, the loss is defined as:
(8)#\[L^{CLIP} = \mathbb{E}_{s_t \sim \rho_{\theta}} \left[ \min(r_t A^{R}_{\pi_{\theta}}(s_t, a_t) , \text{clip}(r_t, 1-\epsilon, 1+\epsilon) A^{R}_{\pi_{\theta}}(s_t, a_t) \right]\]where \(r_t = \frac{\pi_\theta ^{'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\), \(\epsilon\) is the clip parameter, \(A^{R}_{\pi_{\theta}}(s_t, a_t)\) is the advantage.
- Parameters:
obs (torch.Tensor) –
observation
stored in buffer.act (torch.Tensor) –
action
stored in buffer.log_p (torch.Tensor) –
log probability
of action stored in buffer.adv (torch.Tensor) –
advantage
stored in buffer.
- Return type:
Tuple
[Tensor
,Dict
[str
,float
]]