First Order Algorithms#
|
The First Order Constrained Optimization in Policy Space (FOCOPS) algorithm. |
|
The Constrained Update Projection (CUP) Approach to Safe Policy Optimization. |
FOCOPS#
Documentation
- class omnisafe.algorithms.on_policy.FOCOPS(env_id, cfgs)[source]#
The First Order Constrained Optimization in Policy Space (FOCOPS) algorithm.
References
Title: First Order Constrained Optimization in Policy Space
Authors: Yiming Zhang, Quan Vuong, Keith W. Ross.
URL: FOCOPS
- _compute_adv_surrogate(adv_r, adv_c)[source]#
Compute surrogate loss.
FOCOPS uses the following surrogate loss:
(4)#\[L = \frac{1}{1 + \lambda} [A^{R}_{\pi_{\theta}}(s, a) - \lambda A^C_{\pi_{\theta}}(s, a)]\]- Parameters:
adv (torch.Tensor) – reward advantage
cost_adv (torch.Tensor) – cost advantage
- Return type:
Tensor
- _init()[source]#
Initialize the FOCOPS specific model.
The FOCOPS algorithm uses a Lagrange multiplier to balance the cost and reward.
- Return type:
None
- _init_log()[source]#
Log the FOCOPS specific information.
Things to log
Description
Metrics/LagrangeMultiplierThe Lagrange multiplier.
- Return type:
None
- _loss_pi(obs, act, logp, adv)[source]#
Compute pi/actor loss.
In FOCOPS, the loss is defined as:
\begin{eqnarray} L = \nabla_\theta D_{K L}\left(\pi_\theta^{'} \| \pi_{\theta}\right)[s] -\frac{1}{\eta} \underset{a \sim \pi_{\theta}} {\mathbb{E}}\left[\frac{\nabla_\theta \pi_\theta(a \mid s)} {\pi_{\theta}(a \mid s)}\left(A^{R}_{\pi_{\theta}}(s, a) -\lambda A^C_{\pi_{\theta}}(s, a)\right)\right] \end{eqnarray}where \(\eta\) is a hyperparameter, \(\lambda\) is the Lagrange multiplier, \(A_{\pi_{\theta_k}}(s, a)\) is the advantage function, \(A^C_{\pi_{\theta_k}}(s, a)\) is the cost advantage function, \(\pi^*\) is the optimal policy, and \(\pi_{\theta}\) is the current policy.
- Parameters:
obs (torch.Tensor) –
observationstored in buffer.act (torch.Tensor) –
actionstored in buffer.logp (torch.Tensor) –
log probabilityof action stored in buffer.adv (torch.Tensor) –
advantagestored in buffer.
- Return type:
Tuple[Tensor,Dict[str,float]]
- _update()[source]#
Update actor, critic, and Lagrange multiplier parameters.
In FOCOPS, the Lagrange multiplier is updated as the naive lagrange multiplier update:
(6)#\[\lambda_{k+1} = \lambda_k + \eta (J^{C}_{\pi_\theta} - C)\]where \(\lambda_k\) is the Lagrange multiplier at iteration \(k\), \(\eta\) is the Lagrange multiplier learning rate, \(J^{C}_{\pi_\theta}\) is the cost of the current policy, and \(C\) is the cost limit.
Then in each iteration of the policy update, FOCOPS calculates current policy’s distribution, which used to calculate the policy loss.
- Parameters:
self (object) – object of the class.
- Return type:
None
CUP#
Documentation
- class omnisafe.algorithms.on_policy.CUP(env_id, cfgs)[source]#
The Constrained Update Projection (CUP) Approach to Safe Policy Optimization.
References
Title: Constrained Update Projection Approach to Safe Policy Optimization
- Authors: Long Yang, Jiaming Ji, Juntao Dai, Linrui Zhang, Binbin Zhou, Pengfei Li,
Yaodong Yang, Gang Pan.
URL: CUP
- _init()[source]#
The initialization of the algorithm.
User can define the initialization of the algorithm by inheriting this function.
- Return type:
None
Example
>>> def _init(self) -> None: >>> super()._init() >>> self._buffer = CustomBuffer() >>> self._model = CustomModel()
- _init_log()[source]#
Log the CUP specific information.
Things to log
Description
Metrics/LagrangeMultiplierThe Lagrange multiplier.
Train/MaxRatioThe maximum ratio between the current policy and the old policy.
Train/MinRatioThe minimum ratio between the current policy and the old policy.
Loss/Loss_pi_cThe loss of the cost performance.
Train/SecondStepStopIterThe number of iterations to stop the second step.
Train/SecondStepEntropyThe entropy of the current policy.
Train/SecondStepPolicyRatioThe ratio between the current policy and the old policy.
- Return type:
None
- _loss_pi_cost(obs, act, logp, adv_c)[source]#
Compute the performance of cost on this moment.
Detailedly, we compute the KL divergence between the current policy and the old policy, the entropy of the current policy, and the ratio between the current policy and the old policy.
The loss of the cost performance is defined as:
(8)#\[L = \underset{a \sim \pi_{\theta}}{\mathbb{E}}[\lambda \frac{1 - \gamma \nu}{1 - \gamma} \frac{\pi_\theta^{'}(a|s)}{\pi_\theta(a|s)} A^{C}_{\pi_{\theta}} + KL(\pi_\theta^{'}(a|s)||\pi_\theta(a|s))]\]where \(\lambda\) is the Lagrange multiplier, \(\frac{1 - \gamma \nu}{1 - \gamma}\) is the coefficient value, \(\pi_\theta^{'}(a_t|s_t)\) is the current policy, \(\pi_\theta(a_t|s_t)\) is the old policy, \(A^{C}_{\pi_{\theta}}\) is the cost advantage, \(KL(\pi_\theta^{'}(a_t|s_t)||\pi_\theta(a_t|s_t))\) is the KL divergence between the current policy and the old policy.
- Parameters:
obs (torch.Tensor) – Observation.
act (torch.Tensor) – Action.
log_p (torch.Tensor) – Log probability.
cost_adv (torch.Tensor) – Cost advantage.
- _update()[source]#
Update actor, critic, and Lagrange multiplier parameters.
In CUP, the Lagrange multiplier is updated as the naive lagrange multiplier update:
(9)#\[\lambda_{k+1} = \lambda_k + \eta (J^{C}_{\pi_\theta} - C)\]where \(\lambda_k\) is the Lagrange multiplier at iteration \(k\), \(\eta\) is the Lagrange multiplier learning rate, \(J^{C}_{\pi_theta}\) is the cost of the current policy, and \(C\) is the cost limit.
Then in each iteration of the policy update, CUP calculates current policy’s distribution, which used to calculate the policy loss.
- Parameters:
self (object) – object of the class.
- Return type:
None