The Lagrange Algorithms#
|
The Lagrange version of the PPO algorithm. |
|
The Lagrange version of the TRPO algorithm. |
PPOLag#
Documentation
- class omnisafe.algorithms.on_policy.PPOLag(env_id, cfgs)[source]#
The Lagrange version of the PPO algorithm.
A simple combination of the Lagrange method and the Proximal Policy Optimization algorithm.
- __init__(env_id, cfgs)#
- _compute_adv_surrogate(adv_r, adv_c)[source]#
Compute surrogate loss.
PPOLag uses the following surrogate loss:
(3)#\[L = \frac{1}{1 + \lambda} [A^{R}_{\pi_{\theta}}(s, a) - \lambda A^C_{\pi_{\theta}}(s, a)]\]- Parameters:
adv (torch.Tensor) – reward advantage
cost_adv (torch.Tensor) – cost advantage
- Return type:
Tensor
- _init()[source]#
Initialize the PPOLag specific model.
The PPOLag algorithm uses a Lagrange multiplier to balance the cost and reward.
- Return type:
None
- _init_log()[source]#
Log the PPOLag specific information.
Things to log
Description
Metrics/LagrangeMultiplierThe Lagrange multiplier.
- Return type:
None
- _update()[source]#
Update actor, critic, running statistics as we used in the
PolicyGradientalgorithm.Additionally, we update the Lagrange multiplier parameter, by calling the
update_lagrange_multiplier()method.Note
The
_loss_pi()is defined in thePolicyGradientalgorithm. When a lagrange multiplier is used, the_loss_pi()method will return the loss of the policy as:(4)#\[L_{\pi} = \mathbb{E}_{s_t \sim \rho_{\pi}} \left[ \frac{\pi_\theta(a_t|s_t)}{\pi_\theta^{old}(a_t|s_t)} [A^{R}_{\pi_{\theta}}(s_t, a_t) - \lambda A^{C}_{\pi_{\theta}}(s_t, a_t)] \right]\]where \(\lambda\) is the Lagrange multiplier parameter.
- Parameters:
self (object) – object of the class.
- Return type:
None
TRPOLag#
Documentation
- class omnisafe.algorithms.on_policy.TRPOLag(env_id, cfgs)[source]#
The Lagrange version of the TRPO algorithm.
A simple combination of the Lagrange method and the Trust Region Policy Optimization algorithm.
- __init__(env_id, cfgs)#
- _compute_adv_surrogate(adv_r, adv_c)[source]#
Compute surrogate loss.
PPOLag uses the following surrogate loss:
(7)#\[L = \frac{1}{1 + \lambda} [A^{R}_{\pi_{\theta}}(s, a) - \lambda A^C_{\pi_{\theta}}(s, a)]\]- Parameters:
adv (torch.Tensor) – reward advantage
cost_adv (torch.Tensor) – cost advantage
- Return type:
Tensor
- _init()[source]#
Initialize the TRPOLag specific model.
The TRPOLag algorithm uses a Lagrange multiplier to balance the cost and reward.
- Return type:
None
- _init_log()[source]#
Log the TRPOLag specific information.
Things to log
Description
Metrics/LagrangeMultiplierThe Lagrange multiplier.
- Return type:
None
- _update()[source]#
Update actor, critic, running statistics as we used in the
PolicyGradientalgorithm.Additionally, we update the Lagrange multiplier parameter, by calling the
update_lagrange_multiplier()method.Note
The
_loss_pi()is defined in thePolicyGradientalgorithm. When a lagrange multiplier is used, the_loss_pi()method will return the loss of the policy as:(8)#\[L_{\pi} = \mathbb{E}_{s_t \sim \rho_{\pi}} \left[ \frac{\pi_\theta(a_t|s_t)}{\pi_\theta^{old}(a_t|s_t)} [A^{R}_{\pi_{\theta}}(s_t, a_t) - \lambda A^{C}_{\pi_{\theta}}(s_t, a_t)] \right]\]where \(\lambda\) is the Lagrange multiplier parameter.
- Parameters:
self (object) – object of the class.
- Return type:
None
CRPO#
Documentation
- class omnisafe.algorithms.on_policy.OnCRPO(env_id, cfgs)[source]#
The on-policy CRPO algorithm.
References
Title: CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee.
Authors: Tengyu Xu, Yingbin Liang, Guanghui Lan.
URL: CRPO.
- _compute_adv_surrogate(adv_r, adv_c)[source]#
Compute the advantage surrogate.
In CRPO algorithm, we first judge whether the cost is within the limit. If the cost is within the limit, we use the advantage of the policy. Otherwise, we use the advantage of the cost.
- Parameters:
adv_r (torch.Tensor) – The advantage of the policy.
adv_c (torch.Tensor) – The advantage of the cost.
- Return type:
Tensor