The Lagrange Algorithms#
|
The Lagrange version of the PPO algorithm. |
|
The Lagrange version of the TRPO algorithm. |
PPOLag#
Documentation
- class omnisafe.algorithms.on_policy.PPOLag(env_id, cfgs)[source]#
The Lagrange version of the PPO algorithm.
A simple combination of the Lagrange method and the Proximal Policy Optimization algorithm.
Initialize an instance of
BaseAlgo
.- _compute_adv_surrogate(adv_r, adv_c)[source]#
Compute surrogate loss.
PPOLag uses the following surrogate loss:
(3)#\[L = \frac{1}{1 + \lambda} [ A^{R}_{\pi_{\theta}} (s, a) - \lambda A^C_{\pi_{\theta}} (s, a) ]\]- Parameters:
adv_r (torch.Tensor) – The
reward_advantage
sampled from buffer.adv_c (torch.Tensor) – The
cost_advantage
sampled from buffer.
- Returns:
The ``advantage`` combined with ``reward_advantage`` and ``cost_advantage``.
- Return type:
Tensor
- _init()[source]#
Initialize the PPOLag specific model.
The PPOLag algorithm uses a Lagrange multiplier to balance the cost and reward.
- Return type:
None
- _init_log()[source]#
Log the PPOLag specific information.
Things to log
Description
Metrics/LagrangeMultiplier
The Lagrange multiplier.
- Return type:
None
- _update()[source]#
Update actor, critic, as we used in the
PolicyGradient
algorithm.Additionally, we update the Lagrange multiplier parameter by calling the
update_lagrange_multiplier()
method. :rtype:None
Note
The
_loss_pi()
is defined in thePolicyGradient
algorithm. When a lagrange multiplier is used, the_loss_pi()
method will return the loss of the policy as:(4)#\[L_{\pi} = \mathbb{E}_{s_t \sim \rho_{\pi}} \left[ \frac{\pi_{\theta} (a_t|s_t)}{\pi_{\theta}^{old}(a_t|s_t)} [ A^{R}_{\pi_{\theta}} (s_t, a_t) - \lambda A^{C}_{\pi_{\theta}} (s_t, a_t) ] \right]\]where \(\lambda\) is the Lagrange multiplier parameter.
TRPOLag#
Documentation
- class omnisafe.algorithms.on_policy.TRPOLag(env_id, cfgs)[source]#
The Lagrange version of the TRPO algorithm.
A simple combination of the Lagrange method and the Trust Region Policy Optimization algorithm.
Initialize an instance of
BaseAlgo
.- _compute_adv_surrogate(adv_r, adv_c)[source]#
Compute surrogate loss.
TRPOLag uses the following surrogate loss:
(7)#\[L = \frac{1}{1 + \lambda} [ A^{R}_{\pi_{\theta}} (s, a) - \lambda A^C_{\pi_{\theta}} (s, a) ]\]- Parameters:
adv_r (torch.Tensor) – The
reward_advantage
sampled from buffer.adv_c (torch.Tensor) – The
cost_advantage
sampled from buffer.
- Returns:
The ``advantage`` combined with ``reward_advantage`` and ``cost_advantage``.
- Return type:
Tensor
- _init()[source]#
Initialize the TRPOLag specific model.
The TRPOLag algorithm uses a Lagrange multiplier to balance the cost and reward.
- Return type:
None
- _init_log()[source]#
Log the TRPOLag specific information.
Things to log
Description
Metrics/LagrangeMultiplier
The Lagrange multiplier.
- Return type:
None
- _update()[source]#
Update actor, critic, as we used in the
PolicyGradient
algorithm.Additionally, we update the Lagrange multiplier parameter by calling the
update_lagrange_multiplier()
method. :rtype:None
Note
The
_loss_pi()
is defined in thePolicyGradient
algorithm. When a lagrange multiplier is used, the_loss_pi()
method will return the loss of the policy as:(8)#\[L_{\pi} = \mathbb{E}_{s_t \sim \rho_{\pi}} \left[ \frac{\pi_{\theta} (a_t|s_t)}{\pi_{\theta}^{old} (a_t|s_t)} [ A^{R}_{\pi_{\theta}} (s_t, a_t) - \lambda A^{C}_{\pi_{\theta}} (s_t, a_t) ] \right]\]where \(\lambda\) is the Lagrange multiplier parameter.
CRPO#
Documentation
- class omnisafe.algorithms.on_policy.OnCRPO(env_id, cfgs)[source]#
The on-policy CRPO algorithm.
References
Title: CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee.
Authors: Tengyu Xu, Yingbin Liang, Guanghui Lan.
URL: CRPO.
Initialize an instance of
OnCRPO
.- _compute_adv_surrogate(adv_r, adv_c)[source]#
Compute the advantage surrogate.
In CRPO algorithm, we first judge whether the cost is within the limit. If the cost is within the limit, we use the advantage of the policy. Otherwise, we use the advantage of the cost.
- Parameters:
adv_r (torch.Tensor) – The
reward_advantage
sampled from buffer.adv_c (torch.Tensor) – The
cost_advantage
sampled from buffer.
- Returns:
The ``advantage`` chosen from ``reward_advantage`` and ``cost_advantage``.
- Return type:
Tensor