Base On-policy Algorithms#

PolicyGradient(env_id, cfgs)

The Policy Gradient algorithm.

NaturalPG(env_id, cfgs)

The Natural Policy Gradient algorithm.

TRPO(env_id, cfgs)

The Trust Region Policy Optimization (TRPO) algorithm.

PPO(env_id, cfgs)

The Proximal Policy Optimization (PPO) algorithm.

Policy Gradient#

Documentation

class omnisafe.algorithms.on_policy.PolicyGradient(env_id, cfgs)[source]#

The Policy Gradient algorithm.

References

  • Title: Policy Gradient Methods for Reinforcement Learning with Function Approximation

  • Authors: Richard S. Sutton, David McAllester, Satinder Singh, Yishay Mansour.

  • URL: PG

Initialize an instance of algorithm.

_compute_adv_surrogate(adv_r, adv_c)[source]#

Compute surrogate loss.

Policy Gradient only use reward advantage.

Parameters:
  • adv_r (torch.Tensor) – The reward_advantage sampled from buffer.

  • adv_c (torch.Tensor) – The cost_advantage sampled from buffer.

Returns:

The advantage function of reward to update policy network.

Return type:

Tensor

_init()[source]#

The initialization of the algorithm.

User can define the initialization of the algorithm by inheriting this method.

Return type:

None

Examples

>>> def _init(self) -> None:
...     super()._init()
...     self._buffer = CustomBuffer()
...     self._model = CustomModel()
_init_env()[source]#

Initialize the environment.

OmniSafe uses omnisafe.adapter.OnPolicyAdapter to adapt the environment to the algorithm.

User can customize the environment by inheriting this method.

Return type:

None

Examples

>>> def _init_env(self) -> None:
...     self._env = CustomAdapter()
Raises:

AssertionError – If the number of steps per epoch is not divisible by the number of environments.

_init_log()[source]#

Log info about epoch.

Things to log

Description

Train/Epoch

Current epoch.

Metrics/EpCost

Average cost of the epoch.

Metrics/EpRet

Average return of the epoch.

Metrics/EpLen

Average length of the epoch.

Values/reward

Average value in rollout() (from critic network) of the epoch.

Values/cost

Average cost in rollout() (from critic network) of the epoch.

Values/Adv

Average reward advantage of the epoch.

Loss/Loss_pi

Loss of the policy network.

Loss/Loss_cost_critic

Loss of the cost critic network.

Train/Entropy

Entropy of the policy network.

Train/StopIters

Number of iterations of the policy network.

Train/PolicyRatio

Ratio of the policy network.

Train/LR

Learning rate of the policy network.

Misc/Seed

Seed of the experiment.

Misc/TotalEnvSteps

Total steps of the experiment.

Time

Total time.

FPS

Frames per second of the epoch.

Return type:

None

_init_model()[source]#

Initialize the model.

OmniSafe uses omnisafe.models.actor_critic.constraint_actor_critic.ConstraintActorCritic as the default model.

User can customize the model by inheriting this method.

Return type:

None

Examples

>>> def _init_model(self) -> None:
...     self._actor_critic = CustomActorCritic()
_loss_pi(obs, act, logp, adv)[source]#

Computing pi/actor loss.

In Policy Gradient, the loss is defined as:

(4)#\[L = -\underset{s_t \sim \rho_{\theta}}{\mathbb{E}} [ \sum_{t=0}^T ( \frac{\pi^{'}_{\theta}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)} ) A^{R}_{\pi_{\theta}}(s_t, a_t) ]\]

where \(\pi_{\theta}\) is the policy network, \(\pi^{'}_{\theta}\) is the new policy network, \(A^{R}_{\pi_{\theta}}(s_t, a_t)\) is the advantage.

Parameters:
  • obs (torch.Tensor) – The observation sampled from buffer.

  • act (torch.Tensor) – The action sampled from buffer.

  • logp (torch.Tensor) – The log probability of action sampled from buffer.

  • adv (torch.Tensor) – The advantage processed. reward_advantage here.

Returns:

The loss of pi/actor.

Return type:

Tensor

_update()[source]#

Update actor, critic. :rtype: None

  • Get the data from buffer

Hint

obs

observation sampled from buffer.

act

action sampled from buffer.

target_value_r

target reward value sampled from buffer.

target_value_c

target cost value sampled from buffer.

logp

log probability sampled from buffer.

adv_r

estimated advantage (e.g. GAE) sampled from buffer.

adv_c

estimated cost advantage (e.g. GAE) sampled from buffer.

The basic process of each update is as follows:

  1. Get the data from buffer.

  2. Shuffle the data and split it into mini-batch data.

  3. Get the loss of network.

  4. Update the network by loss.

  5. Repeat steps 2, 3 until the number of mini-batch data is used up.

  6. Repeat steps 2, 3, 4 until the KL divergence violates the limit.

_update_actor(obs, act, logp, adv_r, adv_c)[source]#

Update policy network under a double for loop.

  1. Compute the loss function.

  2. Clip the gradient if use_max_grad_norm is True.

  3. Update the network by loss function.

Warning

For some KL divergence based algorithms (e.g. TRPO, CPO, etc.), the KL divergence between the old policy and the new policy is calculated. And the KL divergence is used to determine whether the update is successful. If the KL divergence is too large, the update will be terminated.

Parameters:
  • obs (torch.Tensor) – The observation sampled from buffer.

  • act (torch.Tensor) – The action sampled from buffer.

  • logp (torch.Tensor) – The log_p sampled from buffer.

  • adv_r (torch.Tensor) – The reward_advantage sampled from buffer.

  • adv_c (torch.Tensor) – The cost_advantage sampled from buffer.

Return type:

None

_update_cost_critic(obs, target_value_c)[source]#

Update value network under a double for loop.

The loss function is MSE loss, which is defined in torch.nn.MSELoss. Specifically, the loss function is defined as:

(5)#\[L = \frac{1}{N} \sum_{i=1}^N (\hat{V} - V)^2\]

where \(\hat{V}\) is the predicted cost and \(V\) is the target cost.

  1. Compute the loss function.

  2. Add the critic norm to the loss function if use_critic_norm is True.

  3. Clip the gradient if use_max_grad_norm is True.

  4. Update the network by loss function.

Parameters:
  • obs (torch.Tensor) – The observation sampled from buffer.

  • target_value_c (torch.Tensor) – The target_value_c sampled from buffer.

Return type:

None

_update_reward_critic(obs, target_value_r)[source]#

Update value network under a double for loop.

The loss function is MSE loss, which is defined in torch.nn.MSELoss. Specifically, the loss function is defined as:

(6)#\[L = \frac{1}{N} \sum_{i=1}^N (\hat{V} - V)^2\]

where \(\hat{V}\) is the predicted cost and \(V\) is the target cost.

  1. Compute the loss function.

  2. Add the critic norm to the loss function if use_critic_norm is True.

  3. Clip the gradient if use_max_grad_norm is True.

  4. Update the network by loss function.

Parameters:
  • obs (torch.Tensor) – The observation sampled from buffer.

  • target_value_r (torch.Tensor) – The target_value_r sampled from buffer.

Return type:

None

learn()[source]#

This is main function for algorithm update.

It is divided into the following steps: :rtype: tuple[float, float, float]

  • rollout(): collect interactive data from environment.

  • update(): perform actor/critic updates.

  • log(): epoch/update information for visualization and terminal log print.

Returns:
  • ep_ret – Average episode return in final epoch.

  • ep_cost – Average episode cost in final epoch.

  • ep_len – Average episode length in final epoch.

Natural Policy Gradient#

Documentation

class omnisafe.algorithms.on_policy.NaturalPG(env_id, cfgs)[source]#

The Natural Policy Gradient algorithm.

The Natural Policy Gradient algorithm is a policy gradient algorithm that uses the Fisher information matrix to approximate the Hessian matrix. The Fisher information matrix is the second-order derivative of the KL-divergence.

References

  • Title: A Natural Policy Gradient

  • Author: Sham Kakade.

  • URL: Natural PG

Initialize an instance of algorithm.

_fvp(params)[source]#

Build the Hessian-vector product.

Build the Hessian-vector product , which is the second-order derivative of the KL-divergence.

The Hessian-vector product is approximated by the Fisher information matrix, which is the second-order derivative of the KL-divergence.

For details see John Schulman’s PhD thesis (pp. 40) .

Parameters:

params (torch.Tensor) – The parameters of the actor network.

Returns:

The Fisher vector product.

Return type:

Tensor

_init_log()[source]#

Log the Natural Policy Gradient specific information.

Things to log

Description

Misc/AcceptanceStep

The acceptance step size.

Misc/Alpha

\(\frac{\delta_{KL}}{xHx}\) in the original paper.

Misc/FinalStepNorm

The final step norm.

Misc/gradient_norm

The gradient norm.

Misc/xHx

\(x H x\) in the original paper.

Misc/H_inv_g

\(H^{-1} g\) in the original paper.

Return type:

None

_update()[source]#

Update actor, critic. :rtype: None

Hint

Here are some differences between NPG and Policy Gradient (PG): In PG, the actor network and the critic network are updated together. When the KL divergence between the old policy, and the new policy is larger than a threshold, the update is rejected together.

In NPG, the actor network and the critic network are updated separately. When the KL divergence between the old policy, and the new policy is larger than a threshold, the update of the actor network is rejected, but the update of the critic network is still accepted.

_update_actor(obs, act, logp, adv_r, adv_c)[source]#

Update policy network.

Natural Policy Gradient (NPG) update policy network using the conjugate gradient algorithm, following the steps:

  • Calculate the gradient of the policy network,

  • Use the conjugate gradient algorithm to calculate the step direction.

  • Update the policy network by taking a step in the step direction.

Parameters:
  • obs (torch.Tensor) – The observation tensor.

  • act (torch.Tensor) – The action tensor.

  • logp (torch.Tensor) – The log probability of the action.

  • adv_r (torch.Tensor) – The reward advantage tensor.

  • adv_c (torch.Tensor) – The cost advantage tensor.

Raises:
  • AssertionError – If \(x\) is not finite.

  • AssertionError – If \(x H x\) is not positive.

  • AssertionError – If \(\alpha\) is not finite.

Return type:

None

Trust Region Policy Optimization#

Documentation

class omnisafe.algorithms.on_policy.TRPO(env_id, cfgs)[source]#

The Trust Region Policy Optimization (TRPO) algorithm.

References

  • Title: Trust Region Policy Optimization

  • Authors: John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel.

  • URL: TRPO

Initialize an instance of algorithm.

_init_log()[source]#

Log the Trust Region Policy Optimization specific information.

Things to log

Description

Misc/AcceptanceStep

The acceptance step size.

Return type:

None

_search_step_size(step_direction, grads, p_dist, obs, act, logp, adv, loss_before, total_steps=15, decay=0.8)[source]#

TRPO performs line-search until constraint satisfaction.

Hint

TRPO search around for a satisfied step of policy update to improve loss and reward performance. The search is done by line-search, which is a way to find a step size that satisfies the constraint. The constraint is the KL-divergence between the old policy and the new policy.

Parameters:
  • step_dir (torch.Tensor) – The step direction.

  • g_flat (torch.Tensor) – The gradient of the policy.

  • p_dist (torch.distributions.Distribution) – The old policy distribution.

  • obs (torch.Tensor) – The observation.

  • act (torch.Tensor) – The action.

  • logp (torch.Tensor) – The log probability of the action.

  • adv (torch.Tensor) – The advantage.

  • adv_c (torch.Tensor) – The cost advantage.

  • loss_pi_before (float) – The loss of the policy before the update.

  • total_steps (int, optional) – The total steps to search. Defaults to 15.

  • decay (float, optional) – The decay rate of the step size. Defaults to 0.8.

Returns:

The tuple of final update direction and acceptance step size.

Return type:

tuple[Tensor, int]

_update_actor(obs, act, logp, adv_r, adv_c)[source]#

Update policy network.

Trust Policy Region Optimization updates policy network using the conjugate gradient algorithm, following the steps:

  • Compute the gradient of the policy.

  • Compute the step direction.

  • Search for a step size that satisfies the constraint.

  • Update the policy network.

Parameters:
  • obs (torch.Tensor) – The observation tensor.

  • act (torch.Tensor) – The action tensor.

  • logp (torch.Tensor) – The log probability of the action.

  • adv_r (torch.Tensor) – The reward advantage tensor.

  • adv_c (torch.Tensor) – The cost advantage tensor.

Return type:

None

Proximal Policy Optimization#

Documentation

class omnisafe.algorithms.on_policy.PPO(env_id, cfgs)[source]#

The Proximal Policy Optimization (PPO) algorithm.

References

  • Title: Proximal Policy Optimization Algorithms

  • Authors: John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov.

  • URL: PPO

Initialize an instance of algorithm.

_loss_pi(obs, act, logp, adv)[source]#

Computing pi/actor loss.

In Proximal Policy Optimization, the loss is defined as:

(8)#\[L^{CLIP} = \underset{s_t \sim \rho_{\theta}}{\mathbb{E}} \left[ \min ( r_t A^{R}_{\pi_{\theta}} (s_t, a_t) , \text{clip} (r_t, 1 - \epsilon, 1 + \epsilon) A^{R}_{\pi_{\theta}} (s_t, a_t) \right]\]

where \(r_t = \frac{\pi_{\theta}^{'} (a_t|s_t)}{\pi_{\theta} (a_t|s_t)}\), \(\epsilon\) is the clip parameter, and \(A^{R}_{\pi_{\theta}} (s_t, a_t)\) is the advantage.

Parameters:
  • obs (torch.Tensor) – The observation sampled from buffer.

  • act (torch.Tensor) – The action sampled from buffer.

  • logp (torch.Tensor) – The log probability of action sampled from buffer.

  • adv (torch.Tensor) – The advantage processed. reward_advantage here.

Returns:

The loss of pi/actor.

Return type:

Tensor