First Order Algorithms#
|
The First Order Constrained Optimization in Policy Space (FOCOPS) algorithm. |
|
The Constrained Update Projection (CUP) Approach to Safe Policy Optimization. |
FOCOPS#
Documentation
- class omnisafe.algorithms.on_policy.FOCOPS(env_id, cfgs)[source]#
The First Order Constrained Optimization in Policy Space (FOCOPS) algorithm.
References
Title: First Order Constrained Optimization in Policy Space
Authors: Yiming Zhang, Quan Vuong, Keith W. Ross.
URL: FOCOPS
- _compute_adv_surrogate(adv_r, adv_c)[source]#
Compute surrogate loss.
FOCOPS uses the following surrogate loss:
(4)#\[L = \frac{1}{1 + \lambda} [A^{R}_{\pi_{\theta}}(s, a) - \lambda A^C_{\pi_{\theta}}(s, a)]\]- Parameters:
adv (torch.Tensor) – reward advantage
cost_adv (torch.Tensor) – cost advantage
- Return type:
Tensor
- _init()[source]#
Initialize the FOCOPS specific model.
The FOCOPS algorithm uses a Lagrange multiplier to balance the cost and reward.
- Return type:
None
- _init_log()[source]#
Log the FOCOPS specific information.
Things to log
Description
Metrics/LagrangeMultiplier
The Lagrange multiplier.
- Return type:
None
- _loss_pi(obs, act, logp, adv)[source]#
Compute pi/actor loss.
In FOCOPS, the loss is defined as:
\begin{eqnarray} L = \nabla_\theta D_{K L}\left(\pi_\theta^{'} \| \pi_{\theta}\right)[s] -\frac{1}{\eta} \underset{a \sim \pi_{\theta}} {\mathbb{E}}\left[\frac{\nabla_\theta \pi_\theta(a \mid s)} {\pi_{\theta}(a \mid s)}\left(A^{R}_{\pi_{\theta}}(s, a) -\lambda A^C_{\pi_{\theta}}(s, a)\right)\right] \end{eqnarray}where \(\eta\) is a hyperparameter, \(\lambda\) is the Lagrange multiplier, \(A_{\pi_{\theta_k}}(s, a)\) is the advantage function, \(A^C_{\pi_{\theta_k}}(s, a)\) is the cost advantage function, \(\pi^*\) is the optimal policy, and \(\pi_{\theta}\) is the current policy.
- Parameters:
obs (torch.Tensor) –
observation
stored in buffer.act (torch.Tensor) –
action
stored in buffer.logp (torch.Tensor) –
log probability
of action stored in buffer.adv (torch.Tensor) –
advantage
stored in buffer.
- Return type:
Tuple
[Tensor
,Dict
[str
,float
]]
- _update()[source]#
Update actor, critic, and Lagrange multiplier parameters.
In FOCOPS, the Lagrange multiplier is updated as the naive lagrange multiplier update:
(6)#\[\lambda_{k+1} = \lambda_k + \eta (J^{C}_{\pi_\theta} - C)\]where \(\lambda_k\) is the Lagrange multiplier at iteration \(k\), \(\eta\) is the Lagrange multiplier learning rate, \(J^{C}_{\pi_\theta}\) is the cost of the current policy, and \(C\) is the cost limit.
Then in each iteration of the policy update, FOCOPS calculates current policy’s distribution, which used to calculate the policy loss.
- Parameters:
self (object) – object of the class.
- Return type:
None
CUP#
Documentation
- class omnisafe.algorithms.on_policy.CUP(env_id, cfgs)[source]#
The Constrained Update Projection (CUP) Approach to Safe Policy Optimization.
References
Title: Constrained Update Projection Approach to Safe Policy Optimization
- Authors: Long Yang, Jiaming Ji, Juntao Dai, Linrui Zhang, Binbin Zhou, Pengfei Li,
Yaodong Yang, Gang Pan.
URL: CUP
- _init()[source]#
The initialization of the algorithm.
User can define the initialization of the algorithm by inheriting this function.
- Return type:
None
Example
>>> def _init(self) -> None: >>> super()._init() >>> self._buffer = CustomBuffer() >>> self._model = CustomModel()
- _init_log()[source]#
Log the CUP specific information.
Things to log
Description
Metrics/LagrangeMultiplier
The Lagrange multiplier.
Train/MaxRatio
The maximum ratio between the current policy and the old policy.
Train/MinRatio
The minimum ratio between the current policy and the old policy.
Loss/Loss_pi_c
The loss of the cost performance.
Train/SecondStepStopIter
The number of iterations to stop the second step.
Train/SecondStepEntropy
The entropy of the current policy.
Train/SecondStepPolicyRatio
The ratio between the current policy and the old policy.
- Return type:
None
- _loss_pi_cost(obs, act, logp, adv_c)[source]#
Compute the performance of cost on this moment.
Detailedly, we compute the KL divergence between the current policy and the old policy, the entropy of the current policy, and the ratio between the current policy and the old policy.
The loss of the cost performance is defined as:
(8)#\[L = \underset{a \sim \pi_{\theta}}{\mathbb{E}}[\lambda \frac{1 - \gamma \nu}{1 - \gamma} \frac{\pi_\theta^{'}(a|s)}{\pi_\theta(a|s)} A^{C}_{\pi_{\theta}} + KL(\pi_\theta^{'}(a|s)||\pi_\theta(a|s))]\]where \(\lambda\) is the Lagrange multiplier, \(\frac{1 - \gamma \nu}{1 - \gamma}\) is the coefficient value, \(\pi_\theta^{'}(a_t|s_t)\) is the current policy, \(\pi_\theta(a_t|s_t)\) is the old policy, \(A^{C}_{\pi_{\theta}}\) is the cost advantage, \(KL(\pi_\theta^{'}(a_t|s_t)||\pi_\theta(a_t|s_t))\) is the KL divergence between the current policy and the old policy.
- Parameters:
obs (torch.Tensor) – Observation.
act (torch.Tensor) – Action.
log_p (torch.Tensor) – Log probability.
cost_adv (torch.Tensor) – Cost advantage.
- _update()[source]#
Update actor, critic, and Lagrange multiplier parameters.
In CUP, the Lagrange multiplier is updated as the naive lagrange multiplier update:
(9)#\[\lambda_{k+1} = \lambda_k + \eta (J^{C}_{\pi_\theta} - C)\]where \(\lambda_k\) is the Lagrange multiplier at iteration \(k\), \(\eta\) is the Lagrange multiplier learning rate, \(J^{C}_{\pi_theta}\) is the cost of the current policy, and \(C\) is the cost limit.
Then in each iteration of the policy update, CUP calculates current policy’s distribution, which used to calculate the policy loss.
- Parameters:
self (object) – object of the class.
- Return type:
None