Lagrangian Methods#

Quick Facts#

  1. Lagrangian Method can be applied to almost any RL algorithm.

  2. Lagrangian Method turns unsafe algorithm to a safe one.

  3. The OmniSafe implementation of Lagrangian Methods covers up to 6 kinds of on policy and off policy algorithm.

Lagrangian Methods Theorem#

Background#

In the previous introduction of algorithms, we know that SafeRL mainly solves the constraint optimization problem of CMDP.

Hint

Constrained optimization problems tend to be more challenging than unconstrained optimization problems.

Therefore, the natural idea is to convert a constrained optimization problem into an unconstrained optimization problem. Then solve it using classical optimization algorithms, such as stochastic gradient descent. Lagrangian Methods is a kind of method solving constraint problems that are widely used in machine learning. By using adaptive penalty coefficients to enforce constraints, Lagrangian methods convert the solution of a constrained optimization problem to the solution of an unconstrained optimization problem. In the section, we will briefly introduce Lagrangian methods, and give corresponding implementations in TRPO and PPO. TRPO and PPO are the algorithms we introduced earlier, if you lack understanding of it, it doesn’t matter. Please read the TRPO tutorial and PPO tutorial we wrote, you will soon understand how it works.

Advantages of Lagrangian Methods

  • Relatively simple to implement.

  • The principle is straightforward to understand.

  • Can be applied to a variety of algorithms.

  • Highly scalable.

Problems of Lagrangian Methods

  • Different hyper-parameters need to be set for different tasks.

  • Not necessarily valid for all tasks.

  • Problems of overshoot.

  • Difficult to handle multiple cost tasks directly.


Optimization Objective#

As we mentioned in the previous chapters, the optimization problem of CMDPs can be expressed as follows:

(1)#\[\begin{split}\max_{\pi \in \Pi_\theta} &J^R(\pi) \\ \text {s.t.}~~& J^{\mathcal{C}}(\pi) \leq d\end{split}\]

where \(\Pi_\theta \subseteq \Pi\) denotes the set of parametrized policies with parameters \(\theta\). In local policy search for CMDPs, we additionally require policy iterates to be feasible for the CMDP, so instead of optimizing over \(\Pi_\theta\), algorithm should optimize over \(\Pi_\theta \cap \Pi_C\). Specifically, for the TRPO and PPO algorithms, constraints on the differences between old and new policies should also be added. To solve this constrained problem, please read the TRPO tutorial. The final optimization goals are as follows:

(2)#\[\begin{split}&\pi_{k+1}=\arg \max _{\pi \in \Pi_\theta} J^R(\pi) \\ \text { s.t. } ~~ &J^{\mathcal{C}}(\pi) \leq d \\ &D\left(\pi, \pi_k\right) \leq \delta\end{split}\]

where \(D\) is some distance measure and \(\delta\) is the step size.


Lagrangian Method Theorem#

Lagrangian methods#

Constrained MDP’s are often solved using the Lagrange methods. In Lagrange methods, the CMDP is converted into an equivalent unconstrained problem. In addition to the objective, a penalty term is added for infeasibility, thus making infeasible solutions sub-optimal.

Theorem 1

Given a CMDP, the unconstrained problem can be written as:

(3)#\[\min _{\lambda \geq 0} \max _\theta G(\lambda, \theta)=\min _{\lambda \geq 0} \max _\theta [J^R(\pi)-\lambda J^C(\pi)]\]

where \(G\) is the Lagrangian and \(\lambda \geq 0\) is the Lagrange multiplier (a penalty coefficient). Notice, as \(\lambda\) increases, the solution to the Problem (1) converges to that of the Problem (3).

Hint

The Lagrangian method is a two-step process.

  1. First, we solve the unconstrained problem (3) to find a feasible solution \(\theta^*\)

  2. Then, we increase the penalty coefficient \(\lambda\) until the constraint is satisfied.

The final solution is \(\left(\theta^*, \lambda^*\right)\). The goal is to find a saddle point \(\left(\theta^*\left(\lambda^*\right), \lambda^*\right)\) of the Problem (1), which is a feasible solution. (A feasible solution of the CMDP is a solution which satisfies \(J^C(\pi) \leq d\) )


Practical Implementation#

intuitively, we train the agent to maximize the reward in the classical strategy gradient descent algorithm. If a particular action \(a\) in state \(s\) can bring a relatively higher reward, we increase the probability that the agent will choose action \(a\) under \(s\), and conversely, we will reduce this probability.

Hint

Lagrangian methods add two extra steps to the above process.

  • One is to adjust the reward function, and if the agent’s actions violate the constraint, the reward will reduce accordingly.

  • The second is a slow update of the penalty factor. If the agent violates fewer constraints, the penalty coefficient will gradually decrease, and conversely, it will gradually increase.

Next we will introduce the specific implementation of the Lagrange method in the TRPO and PPO algorithms.

Policy update#

Surrogate function update

Previously, in TRPO and PPO, we used to have the agent sample a series of data from the environment, and at the end of the episode, use this data to update the agent several times, as described in Problem (2). With the addition of the Lagrange method, we need to make a change to the original surrogate function, as it is shown below:

(4)#\[\begin{split}\max _{\pi \in \prod_\theta}[J^R(\pi)-\lambda J^C(\pi)] \\ \text { s.t. } D\left(\pi, \pi_k\right) \leq \delta\end{split}\]

In a word, we only need to punish the agent with its reward by \(\lambda\) with each step of updates. In fact, this is just a minor change made on TRPO and PPO.

Lagrange multiplier update

After all rounds of policy updates to the agent are complete, We will perform an update on the Lagrange multiplier that is:

(5)#\[\begin{split}\min _\lambda(1-\lambda) [J^R(\pi)-\lambda J^C(\pi)] \\ \text { s.t. } \lambda \geq 0\end{split}\]

Specifically, on the \(k^{t h}\) update, the above align is often written as below in the actual calculation process:

(6)#\[\lambda_{k+1}=\max \left(\lambda_k+ \eta_\lambda\left(J^C(\pi)-d\right), 0\right)\]

where \(\eta_\lambda\) is the learning rate of \(\lambda\).

Ultimately, we only need to add the above two steps to the TRPO and PPO; then we will get the TRPO-lag and the PPO-lag.

Attention

In practice, We often need to manually set the initial value of as well as the learning rate. Unfortunately, Lagrange algorithms are algorithms that are sensitive to hyperparameter selection.

  • If the initial value of \(\lambda\) or learning rate is chosen to be large, the agent may suffer from a low reward.

  • Else, it may violate the constraints.

So we often struggle to choose a compromise hyperparameter to balance reward and constraints.


Code with OmniSafe#

Safe RL algorithms for TRPO, PPO, NPG, DDPG, SAC and TD3 are currently implemented in omnisafe using Lagrangian methods. This section will explain how to deploy Lagrangian methods on PPO algorithms at the code level using PPOLag as an example. OmniSafe has Lagrange as a separate module and you can easily deploy it on most RL algorithms.

Quick start#

Run PPOLag in Omnisafe

Here are 3 ways to run PPOLag in OmniSafe:

  • Run Agent from preset yaml file

  • Run Agent from custom config dict

  • Run Agent from custom terminal config

1import omnisafe
2
3
4env_id = 'SafetyPointGoal1-v0'
5
6agent = omnisafe.Agent('PPOLag', env_id)
7agent.learn()
 1import omnisafe
 2
 3
 4env_id = 'SafetyPointGoal1-v0'
 5custom_cfgs = {
 6    'train_cfgs': {
 7        'total_steps': 1024000,
 8        'vector_env_nums': 1,
 9        'parallel': 1,
10    },
11    'algo_cfgs': {
12        'update_cycle': 2048,
13        'update_iters': 1,
14    },
15    'logger_cfgs': {
16        'use_wandb': False,
17    },
18}
19
20agent = omnisafe.Agent('PPOLag', env_id, custom_cfgs=custom_cfgs)
21agent.learn()

We use train_policy.py as the entrance file. You can train the agent with PPOLag simply using train_policy.py, with arguments about PPOLag and environments does the training. For example, to run PPOLag in SafetyPointGoal1-v0 , with 4 cpu cores and seed 0, you can use the following command:

1cd examples
2python train_policy.py --algo PPOLag --env-id SafetyPointGoal1-v0 --parallel 1 --total-steps 1024000 --device cpu --vector-env-nums 1 --torch-threads 1

Architecture of functions#

  • PPOLag.learn()

    • env.roll_out()

    • PPOLag.update()

      • PPOLag.buf.get()

      • PPOLag.pre_process_data(raw_data)

      • PPOLag.update_lagrange_multiplier(ep_costs)

      • PPOLag.update_policy_net()

      • PPOLag.update_cost_net()

      • PPOLag.update_value_net()

  • PPOLag.log()


Documentation of new functions#

PPOLag.compute_loss_pi(data: dict)

Compute the loss of policy network, flowing the next steps:

  1. Compute the clip surrogate function.

1dist, _log_p = self.ac.pi(data['obs'], data['act'])
2ratio = torch.exp(_log_p - data['log_p'])
3ratio_clip = torch.clamp(ratio, 1 - self.clip, 1 + self.clip)
4loss_pi = -(torch.min(ratio * data['adv'], ratio_clip * data['adv'])).mean()
5loss_pi -= self.entropy_coef * dist.entropy().mean()
  1. Punish the actor for violating the constraint.

1penalty = self.lambda_range_projection(self.lagrangian_multiplier).item()
2loss_pi += penalty * ((ratio * data['cost_adv']).mean())
3loss_pi /= 1 + penalty

Lagrange.update_lagrange_multiplier(ep_costs: float)

Update Lagrange multiplier (\(\lambda\))

Hint

ep_costs obtained from: self.logger.get_stats('EpCost')[0] are already averaged across MPI processes.

1self.lambda_optimizer.zero_grad()
2lambda_loss = self.compute_lambda_loss(ep_costs)
3lambda_loss.backward()
4self.lambda_optimizer.step()
5self.lagrangian_multiplier.data.clamp_(0)

Hint

self.lagrangian_multiplier.data.clamp_(0) is used to avoid negative values of \(\lambda\)


Parameters#

Specific Parameters

  • target_kl(float): Constraint for KL-distance to avoid too far gap

  • cg_damping(float): parameter plays a role in building Hessian-vector

  • cg_iters(int): Number of iterations of conjugate gradient to perform.

  • cost_limit(float): Constraint for agent to avoid too much cost

Basic parameters

  • algo (string): The name of algorithm corresponding to current class, it does not actually affect any things which happen in the following.

  • actor (string): The type of network in actor, discrete or continuous.

  • model_cfgs (dictionary) : Actor and critic’s net work configuration, it originates from algo.yaml file to describe hidden layers , activation function, shared_weights and weight_initialization_mode.

    • shared_weights (bool) : Use shared weights between actor and critic network or not.

    • weight_initialization_mode (string) : The type of weight initialization method.

      • pi (dictionary) : parameters for actor network pi

        • hidden_sizes:

          • 64

          • 64

        • activations: tanh

      • val (dictionary) parameters for critic network v

        • hidden_sizes:

          • 64

          • 64

          Hint

          Name

          Type

          Description

          v

          nn.Module

          Gives the current estimate of V for states in s.

          pi

          nn.Module

          Deterministically or continuously computes an action from the agent, conditioned on states in s.

      • activations: tanh

      • env_id (string): The name of environment we want to roll out.

      • seed (int): Define the seed of experiments.

      • parallel (int): Define the seed of experiments.

      • epochs (int): The number of epochs we want to roll out.

      • steps_per_epoch (int):The number of time steps per epoch.

      • pi_iters (int): The number of iteration when we update actor network per mini batch.

      • critic_iters (int): The number of iteration when we update critic network per mini batch.

Optional parameters

  • use_cost_critic (bool): Use cost value function or not.

  • linear_lr_decay (bool): Use linear learning rate decay or not.

  • exploration_noise_anneal (bool): Use exploration noise anneal or not.

  • reward_penalty (bool): Use cost to penalize reward or not.

  • kl_early_stopping (bool): Use KL early stopping or not.

  • max_grad_norm (float): Use maximum gradient normalization or not.

  • scale_rewards (bool): Use reward scaling or not.

Buffer parameters

Hint

Name

Description

Buffer

A buffer for storing trajectories experienced by an agent interacting with the environment, and using Generalized Advantage Estimation (GAE) for calculating the advantages of state-action pairs.

Warning

Buffer collects only raw data received from environment.

  • gamma (float): The gamma for GAE.

  • lam (float): The lambda for reward GAE.

  • adv_estimation_method (float):Roughly what KL divergence we think is appropriate between new and old policies after an update. This will get used for early stopping. (Usually small, 0.01 or 0.05.)

  • standardized_reward (int): Use standardized reward or not.

  • standardized_cost (bool): Use standardized cost or not.


References#