The Lagrange Algorithms#

PPOLag(env_id, cfgs)

The Lagrange version of the PPO algorithm.

TRPOLag(env_id, cfgs)

The Lagrange version of the TRPO algorithm.

PPOLag#

Documentation

class omnisafe.algorithms.on_policy.PPOLag(env_id, cfgs)[source]#

The Lagrange version of the PPO algorithm.

A simple combination of the Lagrange method and the Proximal Policy Optimization algorithm.

__init__(env_id, cfgs)#
_compute_adv_surrogate(adv_r, adv_c)[source]#

Compute surrogate loss.

PPOLag uses the following surrogate loss:

(3)#\[L = \frac{1}{1 + \lambda} [A^{R}_{\pi_{\theta}}(s, a) - \lambda A^C_{\pi_{\theta}}(s, a)]\]
Parameters:
  • adv (torch.Tensor) – reward advantage

  • cost_adv (torch.Tensor) – cost advantage

Return type:

Tensor

_init()[source]#

Initialize the PPOLag specific model.

The PPOLag algorithm uses a Lagrange multiplier to balance the cost and reward.

Return type:

None

_init_log()[source]#

Log the PPOLag specific information.

Things to log

Description

Metrics/LagrangeMultiplier

The Lagrange multiplier.

Return type:

None

_update()[source]#

Update actor, critic, running statistics as we used in the PolicyGradient algorithm.

Additionally, we update the Lagrange multiplier parameter, by calling the update_lagrange_multiplier() method.

Note

The _loss_pi() is defined in the PolicyGradient algorithm. When a lagrange multiplier is used, the _loss_pi() method will return the loss of the policy as:

(4)#\[L_{\pi} = \mathbb{E}_{s_t \sim \rho_{\pi}} \left[ \frac{\pi_\theta(a_t|s_t)}{\pi_\theta^{old}(a_t|s_t)} [A^{R}_{\pi_{\theta}}(s_t, a_t) - \lambda A^{C}_{\pi_{\theta}}(s_t, a_t)] \right]\]

where \(\lambda\) is the Lagrange multiplier parameter.

Parameters:

self (object) – object of the class.

Return type:

None

TRPOLag#

Documentation

class omnisafe.algorithms.on_policy.TRPOLag(env_id, cfgs)[source]#

The Lagrange version of the TRPO algorithm.

A simple combination of the Lagrange method and the Trust Region Policy Optimization algorithm.

__init__(env_id, cfgs)#
_compute_adv_surrogate(adv_r, adv_c)[source]#

Compute surrogate loss.

PPOLag uses the following surrogate loss:

(7)#\[L = \frac{1}{1 + \lambda} [A^{R}_{\pi_{\theta}}(s, a) - \lambda A^C_{\pi_{\theta}}(s, a)]\]
Parameters:
  • adv (torch.Tensor) – reward advantage

  • cost_adv (torch.Tensor) – cost advantage

Return type:

Tensor

_init()[source]#

Initialize the TRPOLag specific model.

The TRPOLag algorithm uses a Lagrange multiplier to balance the cost and reward.

Return type:

None

_init_log()[source]#

Log the TRPOLag specific information.

Things to log

Description

Metrics/LagrangeMultiplier

The Lagrange multiplier.

Return type:

None

_update()[source]#

Update actor, critic, running statistics as we used in the PolicyGradient algorithm.

Additionally, we update the Lagrange multiplier parameter, by calling the update_lagrange_multiplier() method.

Note

The _loss_pi() is defined in the PolicyGradient algorithm. When a lagrange multiplier is used, the _loss_pi() method will return the loss of the policy as:

(8)#\[L_{\pi} = \mathbb{E}_{s_t \sim \rho_{\pi}} \left[ \frac{\pi_\theta(a_t|s_t)}{\pi_\theta^{old}(a_t|s_t)} [A^{R}_{\pi_{\theta}}(s_t, a_t) - \lambda A^{C}_{\pi_{\theta}}(s_t, a_t)] \right]\]

where \(\lambda\) is the Lagrange multiplier parameter.

Parameters:

self (object) – object of the class.

Return type:

None

CRPO#

Documentation

class omnisafe.algorithms.on_policy.OnCRPO(env_id, cfgs)[source]#

The on-policy CRPO algorithm.

References

  • Title: CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee.

  • Authors: Tengyu Xu, Yingbin Liang, Guanghui Lan.

  • URL: CRPO.

__init__(env_id, cfgs)[source]#
_compute_adv_surrogate(adv_r, adv_c)[source]#

Compute the advantage surrogate.

In CRPO algorithm, we first judge whether the cost is within the limit. If the cost is within the limit, we use the advantage of the policy. Otherwise, we use the advantage of the cost.

Parameters:
  • adv_r (torch.Tensor) – The advantage of the policy.

  • adv_c (torch.Tensor) – The advantage of the cost.

Return type:

Tensor

_init_log()[source]#

Log the CRPO specific information.

Things to log

Description

Misc/RewUpdate

The number of times the reward is updated.

Misc/CostUpdate

The number of times the cost is updated.

Return type:

None