Second Order Algorithms#

`CPO`(env_id, cfgs)	The Constrained Policy Optimization (CPO) algorithm.
`PCPO`(env_id, cfgs)	The Projection-Based Constrained Policy Optimization (PCPO) algorithm.

Constraint Policy Optimization#

Documentation

class omnisafe.algorithms.on_policy.CPO(env_id, cfgs)[source]#

The Constrained Policy Optimization (CPO) algorithm.

CPO is a derivative of TRPO.

References

Title: Constrained Policy Optimization
Authors: Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel.
URL: CPO

__init__(env_id, cfgs)#

_cpo_search_step(step_direction, grad, p_dist, obs, act, logp, adv_r, adv_c, loss_reward_before, loss_cost_before, total_steps=15, decay=0.8, violation_c=0, optim_case=0)[source]#

Use line-search to find the step size that satisfies the constraint.

CPO uses line-search to find the step size that satisfies the constraint. The constraint is defined as:

(3)#\[\begin{split}J^C(\theta + \alpha \delta) - J^C(\theta) \leq \max \{0, c\}\\ D_{KL}(\pi_{\theta}(\cdot|s) || \pi_{\theta + \alpha \delta}(\cdot|s)) \leq \delta_{KL}\end{split}\]

where \(\delta_{KL}\) is the constraint of KL divergence, \(\alpha\) is the step size, \(c\) is the violation of constraint.

Parameters:

step_dir (torch.Tensor) – The step direction.
g_flat (torch.Tensor) – The gradient of the policy.
p_dist (torch.distributions.Distribution) – The old policy distribution.
obs (torch.Tensor) – The observation.
act (torch.Tensor) – The action.
log_p (torch.Tensor) – The log probability of the action.
adv (torch.Tensor) – The advantage.
cost_adv (torch.Tensor) – The cost advantage.
loss_pi_before (float) – The loss of the policy before the update.
total_steps (int, optional) – The total steps to search. Defaults to 15.
decay (float, optional) – The decay rate of the step size. Defaults to 0.8.
c (int, optional) – The violation of constraint. Defaults to 0.
optim_case (int, optional) – The optimization case. Defaults to 0.

Return type:

Tuple[Tensor, int]

_init_log()[source]#

Log the Natural Policy Gradient specific information.

Things to log	Description
`Misc/AcceptanceStep`	The acceptance step size.
`Misc/Alpha`	\(\frac{\delta_{KL}}{xHx}\) in original paper. where \(x\) is the step direction, \(H\) is the Hessian matrix, and \(\delta_{KL}\) is the target KL divergence.
`Misc/FinalStepNorm`	The final step norm.
`Misc/gradient_norm`	The gradient norm.
`Misc/xHx`	\(xHx\) in original paper.
`Misc/H_inv_g`	\(H^{-1}g\) in original paper.

Parameters:: epoch (int) – current epoch.
Return type:: None

_loss_pi_cost(obs, act, logp, adv_c)[source]#

Compute the performance of cost on this moment.

Detailedly, we compute the loss of cost of policy cost from real cost.

(4)#\[L = \mathbb{E}_{\pi} \left[ \frac{\pi^{'}(a|s)}{\pi(a|s)} A^C(s, a) \right]\]

where \(A^C(s, a)\) is the cost advantage, \(\pi(a|s)\) is the old policy, \(\pi^{'}(a|s)\) is the current policy.

Parameters:

obs (torch.Tensor) – Observation.
act (torch.Tensor) – Action.
logp (torch.Tensor) – Log probability of action.
adv_c (torch.Tensor) – Cost advantage.

Returns:

torch.Tensor – The loss of cost of policy cost from real cost.

Return type:

Tensor

_update_actor(obs, act, logp, adv_r, adv_c)[source]#

Update policy network.

Trust Policy Region Optimization updates policy network using the conjugate gradient algorithm, following the steps:

Compute the gradient of the policy.
Compute the step direction.
Search for a step size that satisfies the constraint.
Update the policy network.

Parameters:

obs (torch.Tensor) – The observation tensor.
act (torch.Tensor) – The action tensor.
logp (torch.Tensor) – The log probability of the action.
adv_r (torch.Tensor) – The advantage tensor.
adv_c (torch.Tensor) – The cost advantage tensor.

Return type:

None

Projection Based Constraint Policy Optimization#

Documentation

class omnisafe.algorithms.on_policy.PCPO(env_id, cfgs)[source]#

The Projection-Based Constrained Policy Optimization (PCPO) algorithm.

References

Title: Projection-Based Constrained Policy Optimization
Authors: Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan, Peter J. Ramadge.
URL: PCPO

__init__(env_id, cfgs)#

_update_actor(obs, act, logp, adv_r, adv_c)[source]#

Update policy network.

PCPO updates policy network using the conjugate gradient algorithm, following the steps:

Compute the gradient of the policy.
Compute the step direction.
Search for a step size that satisfies the constraint. (Both KL divergence and cost limit).
Update the policy network.

Parameters:

obs (torch.Tensor) – The observation tensor.
act (torch.Tensor) – The action tensor.
log_p (torch.Tensor) – The log probability of the action.
adv (torch.Tensor) – The advantage tensor.
cost_adv (torch.Tensor) – The cost advantage tensor.

Return type:

None