First Order Algorithms#

FOCOPS(env_id, cfgs)

The First Order Constrained Optimization in Policy Space (FOCOPS) algorithm.

CUP(env_id, cfgs)

The Constrained Update Projection (CUP) Approach to Safe Policy Optimization.

FOCOPS#

Documentation

class omnisafe.algorithms.on_policy.FOCOPS(env_id, cfgs)[source]#

The First Order Constrained Optimization in Policy Space (FOCOPS) algorithm.

References

  • Title: First Order Constrained Optimization in Policy Space

  • Authors: Yiming Zhang, Quan Vuong, Keith W. Ross.

  • URL: FOCOPS

__init__(env_id, cfgs)[source]#
_compute_adv_surrogate(adv_r, adv_c)[source]#

Compute surrogate loss.

FOCOPS uses the following surrogate loss:

(4)#\[L = \frac{1}{1 + \lambda} [A^{R}_{\pi_{\theta}}(s, a) - \lambda A^C_{\pi_{\theta}}(s, a)]\]
Parameters:
  • adv (torch.Tensor) – reward advantage

  • cost_adv (torch.Tensor) – cost advantage

Return type:

Tensor

_init()[source]#

Initialize the FOCOPS specific model.

The FOCOPS algorithm uses a Lagrange multiplier to balance the cost and reward.

Return type:

None

_init_log()[source]#

Log the FOCOPS specific information.

Things to log

Description

Metrics/LagrangeMultiplier

The Lagrange multiplier.

Return type:

None

_loss_pi(obs, act, logp, adv)[source]#

Compute pi/actor loss.

In FOCOPS, the loss is defined as:

\begin{eqnarray} L = \nabla_\theta D_{K L}\left(\pi_\theta^{'} \| \pi_{\theta}\right)[s] -\frac{1}{\eta} \underset{a \sim \pi_{\theta}} {\mathbb{E}}\left[\frac{\nabla_\theta \pi_\theta(a \mid s)} {\pi_{\theta}(a \mid s)}\left(A^{R}_{\pi_{\theta}}(s, a) -\lambda A^C_{\pi_{\theta}}(s, a)\right)\right] \end{eqnarray}

where \(\eta\) is a hyperparameter, \(\lambda\) is the Lagrange multiplier, \(A_{\pi_{\theta_k}}(s, a)\) is the advantage function, \(A^C_{\pi_{\theta_k}}(s, a)\) is the cost advantage function, \(\pi^*\) is the optimal policy, and \(\pi_{\theta}\) is the current policy.

Parameters:
  • obs (torch.Tensor) – observation stored in buffer.

  • act (torch.Tensor) – action stored in buffer.

  • logp (torch.Tensor) – log probability of action stored in buffer.

  • adv (torch.Tensor) – advantage stored in buffer.

Return type:

Tuple[Tensor, Dict[str, float]]

_update()[source]#

Update actor, critic, and Lagrange multiplier parameters.

In FOCOPS, the Lagrange multiplier is updated as the naive lagrange multiplier update:

(6)#\[\lambda_{k+1} = \lambda_k + \eta (J^{C}_{\pi_\theta} - C)\]

where \(\lambda_k\) is the Lagrange multiplier at iteration \(k\), \(\eta\) is the Lagrange multiplier learning rate, \(J^{C}_{\pi_\theta}\) is the cost of the current policy, and \(C\) is the cost limit.

Then in each iteration of the policy update, FOCOPS calculates current policy’s distribution, which used to calculate the policy loss.

Parameters:

self (object) – object of the class.

Return type:

None

CUP#

Documentation

class omnisafe.algorithms.on_policy.CUP(env_id, cfgs)[source]#

The Constrained Update Projection (CUP) Approach to Safe Policy Optimization.

References

  • Title: Constrained Update Projection Approach to Safe Policy Optimization

  • Authors: Long Yang, Jiaming Ji, Juntao Dai, Linrui Zhang, Binbin Zhou, Pengfei Li,

    Yaodong Yang, Gang Pan.

  • URL: CUP

__init__(env_id, cfgs)[source]#
_init()[source]#

The initialization of the algorithm.

User can define the initialization of the algorithm by inheriting this function.

Return type:

None

Example

>>> def _init(self) -> None:
>>>    super()._init()
>>>    self._buffer = CustomBuffer()
>>>    self._model = CustomModel()
_init_log()[source]#

Log the CUP specific information.

Things to log

Description

Metrics/LagrangeMultiplier

The Lagrange multiplier.

Train/MaxRatio

The maximum ratio between the current policy and the old policy.

Train/MinRatio

The minimum ratio between the current policy and the old policy.

Loss/Loss_pi_c

The loss of the cost performance.

Train/SecondStepStopIter

The number of iterations to stop the second step.

Train/SecondStepEntropy

The entropy of the current policy.

Train/SecondStepPolicyRatio

The ratio between the current policy and the old policy.

Return type:

None

_loss_pi_cost(obs, act, logp, adv_c)[source]#

Compute the performance of cost on this moment.

Detailedly, we compute the KL divergence between the current policy and the old policy, the entropy of the current policy, and the ratio between the current policy and the old policy.

The loss of the cost performance is defined as:

(8)#\[L = \underset{a \sim \pi_{\theta}}{\mathbb{E}}[\lambda \frac{1 - \gamma \nu}{1 - \gamma} \frac{\pi_\theta^{'}(a|s)}{\pi_\theta(a|s)} A^{C}_{\pi_{\theta}} + KL(\pi_\theta^{'}(a|s)||\pi_\theta(a|s))]\]

where \(\lambda\) is the Lagrange multiplier, \(\frac{1 - \gamma \nu}{1 - \gamma}\) is the coefficient value, \(\pi_\theta^{'}(a_t|s_t)\) is the current policy, \(\pi_\theta(a_t|s_t)\) is the old policy, \(A^{C}_{\pi_{\theta}}\) is the cost advantage, \(KL(\pi_\theta^{'}(a_t|s_t)||\pi_\theta(a_t|s_t))\) is the KL divergence between the current policy and the old policy.

Parameters:
  • obs (torch.Tensor) – Observation.

  • act (torch.Tensor) – Action.

  • log_p (torch.Tensor) – Log probability.

  • cost_adv (torch.Tensor) – Cost advantage.

_update()[source]#

Update actor, critic, and Lagrange multiplier parameters.

In CUP, the Lagrange multiplier is updated as the naive lagrange multiplier update:

(9)#\[\lambda_{k+1} = \lambda_k + \eta (J^{C}_{\pi_\theta} - C)\]

where \(\lambda_k\) is the Lagrange multiplier at iteration \(k\), \(\eta\) is the Lagrange multiplier learning rate, \(J^{C}_{\pi_theta}\) is the cost of the current policy, and \(C\) is the cost limit.

Then in each iteration of the policy update, CUP calculates current policy’s distribution, which used to calculate the policy loss.

Parameters:

self (object) – object of the class.

Return type:

None