The Lagrange Algorithms#

PPOLag(env_id, cfgs)

The Lagrange version of the PPO algorithm.

TRPOLag(env_id, cfgs)

The Lagrange version of the TRPO algorithm.

PPOLag#

Documentation

class omnisafe.algorithms.on_policy.PPOLag(env_id, cfgs)[source]#

The Lagrange version of the PPO algorithm.

A simple combination of the Lagrange method and the Proximal Policy Optimization algorithm.

Initialize an instance of BaseAlgo.

_compute_adv_surrogate(adv_r, adv_c)[source]#

Compute surrogate loss.

PPOLag uses the following surrogate loss:

(3)#\[L = \frac{1}{1 + \lambda} [ A^{R}_{\pi_{\theta}} (s, a) - \lambda A^C_{\pi_{\theta}} (s, a) ]\]
Parameters:
  • adv_r (torch.Tensor) – The reward_advantage sampled from buffer.

  • adv_c (torch.Tensor) – The cost_advantage sampled from buffer.

Returns:

The ``advantage`` combined with ``reward_advantage`` and ``cost_advantage``.

Return type:

Tensor

_init()[source]#

Initialize the PPOLag specific model.

The PPOLag algorithm uses a Lagrange multiplier to balance the cost and reward.

Return type:

None

_init_log()[source]#

Log the PPOLag specific information.

Things to log

Description

Metrics/LagrangeMultiplier

The Lagrange multiplier.

Return type:

None

_update()[source]#

Update actor, critic, as we used in the PolicyGradient algorithm.

Additionally, we update the Lagrange multiplier parameter by calling the update_lagrange_multiplier() method. :rtype: None

Note

The _loss_pi() is defined in the PolicyGradient algorithm. When a lagrange multiplier is used, the _loss_pi() method will return the loss of the policy as:

(4)#\[L_{\pi} = \mathbb{E}_{s_t \sim \rho_{\pi}} \left[ \frac{\pi_{\theta} (a_t|s_t)}{\pi_{\theta}^{old}(a_t|s_t)} [ A^{R}_{\pi_{\theta}} (s_t, a_t) - \lambda A^{C}_{\pi_{\theta}} (s_t, a_t) ] \right]\]

where \(\lambda\) is the Lagrange multiplier parameter.

TRPOLag#

Documentation

class omnisafe.algorithms.on_policy.TRPOLag(env_id, cfgs)[source]#

The Lagrange version of the TRPO algorithm.

A simple combination of the Lagrange method and the Trust Region Policy Optimization algorithm.

Initialize an instance of BaseAlgo.

_compute_adv_surrogate(adv_r, adv_c)[source]#

Compute surrogate loss.

TRPOLag uses the following surrogate loss:

(7)#\[L = \frac{1}{1 + \lambda} [ A^{R}_{\pi_{\theta}} (s, a) - \lambda A^C_{\pi_{\theta}} (s, a) ]\]
Parameters:
  • adv_r (torch.Tensor) – The reward_advantage sampled from buffer.

  • adv_c (torch.Tensor) – The cost_advantage sampled from buffer.

Returns:

The ``advantage`` combined with ``reward_advantage`` and ``cost_advantage``.

Return type:

Tensor

_init()[source]#

Initialize the TRPOLag specific model.

The TRPOLag algorithm uses a Lagrange multiplier to balance the cost and reward.

Return type:

None

_init_log()[source]#

Log the TRPOLag specific information.

Things to log

Description

Metrics/LagrangeMultiplier

The Lagrange multiplier.

Return type:

None

_update()[source]#

Update actor, critic, as we used in the PolicyGradient algorithm.

Additionally, we update the Lagrange multiplier parameter by calling the update_lagrange_multiplier() method. :rtype: None

Note

The _loss_pi() is defined in the PolicyGradient algorithm. When a lagrange multiplier is used, the _loss_pi() method will return the loss of the policy as:

(8)#\[L_{\pi} = \mathbb{E}_{s_t \sim \rho_{\pi}} \left[ \frac{\pi_{\theta} (a_t|s_t)}{\pi_{\theta}^{old} (a_t|s_t)} [ A^{R}_{\pi_{\theta}} (s_t, a_t) - \lambda A^{C}_{\pi_{\theta}} (s_t, a_t) ] \right]\]

where \(\lambda\) is the Lagrange multiplier parameter.

CRPO#

Documentation

class omnisafe.algorithms.on_policy.OnCRPO(env_id, cfgs)[source]#

The on-policy CRPO algorithm.

References

  • Title: CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee.

  • Authors: Tengyu Xu, Yingbin Liang, Guanghui Lan.

  • URL: CRPO.

Initialize an instance of OnCRPO.

_compute_adv_surrogate(adv_r, adv_c)[source]#

Compute the advantage surrogate.

In CRPO algorithm, we first judge whether the cost is within the limit. If the cost is within the limit, we use the advantage of the policy. Otherwise, we use the advantage of the cost.

Parameters:
  • adv_r (torch.Tensor) – The reward_advantage sampled from buffer.

  • adv_c (torch.Tensor) – The cost_advantage sampled from buffer.

Returns:

The ``advantage`` chosen from ``reward_advantage`` and ``cost_advantage``.

Return type:

Tensor

_init_log()[source]#

Log the CRPO specific information.

Things to log

Description

Misc/RewUpdate

The number of times the reward is updated.

Misc/CostUpdate

The number of times the cost is updated.

Return type:

None