Constrained Policy Optimization#

Quick Facts#

  1. CPO is an on-policy algorithm.

  2. CPO can be used for environments with both discrete and continuous action spaces.

  3. CPO can be thought of as being TRPO in SafeRL areas .

  4. The OmniSafe implementation of CPO support parallelization.

CPO Theorem#

Background#

Constrained policy optimization (CPO) is a policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration. Motivated by TRPO( Trust Region Policy Optimization). CPO develops surrogate functions to be good local approximations for objectives and constraints and easy to estimate using samples from current policy. Moreover, it provides tighter bounds for policy search using trust regions.

Hint

CPO is the first general-purpose policy search algorithm for safe reinforcement learning with guarantees for near-constraint satisfaction at each iteration.

CPO trains neural network policies for high-dimensional control while making guarantees about policy behavior throughout training. CPO aims to provide an approach for policy search in continuous CMDP. It uses the result from TRPO and NPG to derive a policy improvement step that guarantees both an increase in reward and satisfaction of constraints. Although CPO is slightly inferior in performance, it provides a solid theoretical foundation for solving constrained optimization problems in the field of safe reinforcement learning.

Hint

CPO is very complex in terms of implementation, but omnisafe provides a highly readable code implementation to help you get up to speed quickly.


Optimization Objective#

In the previous chapters, we introduced that TRPO solves the following optimization problems:

(1)#\[\begin{split}&\pi_{k+1}=\arg\max_{\pi \in \Pi_{\boldsymbol{\theta}}}J^R(\pi)\\ \text{s.t.}\quad &D(\pi,\pi_k)\le\delta\end{split}\]

where \(\Pi_{\boldsymbol{\theta}} \subseteq \Pi\) denotes the set of parametrized policies with parameters \(\boldsymbol{\theta}\), and \(D\) is some distance measure. In local policy search, we additionally require policy iterates to be feasible for the CMDP, so instead of optimizing over \(\Pi_{\boldsymbol{\theta}}\), CPO optimizes over \(\Pi_{\boldsymbol{\theta}} \cap \Pi_{C}\).

(2)#\[\begin{split}& \pi_{k+1} = \arg\max_{\pi \in \Pi_{\boldsymbol{\theta}}}J^R(\pi)\\ \text{s.t.} \quad & D(\pi,\pi_k)\le\delta\\ & J^{C_i}(\pi)\le d_i\quad i=1,...m\end{split}\]

Hint

This update is difficult to implement because it requires evaluating the constraint functions to determine whether a proposed policy \(\pi\) is feasible.

CPO develops a principled approximation with a particular choice of \(D\), where the objective and constraints are replaced with surrogate functions. CPO proposes that with those surrogates, the update’s worst-case performance and worst-case constraint violation can be bounded with values that depend on a hyperparameter of the algorithm.


Policy Performance Bounds#

CPO presents the theoretical foundation for its approach, a new bound on the difference in returns between two arbitrary policies. The following Theorem 1 connects the difference in returns (or constraint costs) between two arbitrary policies to an average divergence between them.

Theorem 1 (Difference between two arbitrary policies)

For any function \(f : S \rightarrow \mathbb{R}\) and any policies \(\pi\) and \(\pi'\), define \(\delta_f(s,a,s') \doteq R(s,a,s') + \gamma f(s')-f(s)\),

(3)#\[\begin{split}\epsilon_f^{\pi'} &\doteq \max_s \left|\mathbb{E}_{a\sim\pi'~,s'\sim P }\left[\delta_f(s,a,s')\right] \right|\\ L_{\pi, f}\left(\pi'\right) &\doteq \mathbb{E}_{\tau \sim \pi}\left[\left(\frac{\pi'(a | s)}{\pi(a|s)}-1\right)\delta_f\left(s, a, s'\right)\right] \\ D_{\pi, f}^{\pm}\left(\pi^{\prime}\right) &\doteq \frac{L_{\pi, f}\left(\pi' \right)}{1-\gamma} \pm \frac{2 \gamma \epsilon_f^{\pi'}}{(1-\gamma)^2} \mathbb{E}_{s \sim d^\pi}\left[D_{T V}\left(\pi^{\prime} \| \pi\right)[s]\right]\end{split}\]

where \(D_{T V}\left(\pi'|| \pi\right)[s]=\frac{1}{2} \sum_a\left|\pi'(a|s)-\pi(a|s)\right|\) is the total variational divergence between action distributions at \(s\). The conclusion is as follows:

(4)#\[D_{\pi, f}^{+}\left(\pi'\right) \geq J\left(\pi'\right)-J(\pi) \geq D_{\pi, f}^{-}\left(\pi'\right)\]

Furthermore, the bounds are tight (when \(\pi=\pi^{\prime}\), all three expressions are identically zero).

By picking \(f=V_\pi\), we obtain a Corollary 1, Corollary 2, Corollary 3 below:

Corollary 1

For any policies \(\pi'\), \(\pi\), with \(\epsilon_{\pi'}=\max _s|\mathbb{E}_{a \sim \pi'}[A_\pi(s, a)]|\), the following bound holds:

(5)#\[J^R\left(\pi^{\prime}\right)-J^R(\pi) \geq \frac{1}{1-\gamma} \mathbb{E}_{s \sim d^\pi\,a \sim \pi'}\left[A^R_\pi(s, a)-\frac{2 \gamma \epsilon_{\pi'}}{1-\gamma} D_{T V}\left(\pi' \| \pi\right)[s]\right]\]

Corollary 2

For any policies \(\pi'\) and \(\pi\), with \(\epsilon^{C_i}_{\pi'}=\max _s|E_{a \sim \pi^{\prime}}[A^{C_i}_\pi(s, a)]|\)

the following bound holds:

(6)#\[J^{C_i}\left(\pi^{\prime}\right)-J^{C_i}(\pi) \geq \frac{1}{1-\gamma} \mathbb{E}_{s \sim d^\pi a \sim \pi'}\left[A^{C_i}_\pi(s, a)-\frac{2 \gamma \epsilon^{C_i}_{\pi'}}{1-\gamma} D_{T V}\left(\pi' \| \pi\right)[s]\right]\]

Corollary 3

Trust region methods prefer to constrain the KL-divergence between policies, so CPO use Pinsker’s inequality to connect the \(D_{TV}\) with \(D_{KL}\)

(7)#\[D_{TV}(p \| q) \leq \sqrt{D_{KL}(p \| q) / 2}\]

Combining this with Jensen’s inequality, we obtain our final Corollary 3 : In bound Theorem 1 , Corollary 1, Corollary 2, make the substitution:

(8)#\[\mathbb{E}_{s \sim d^\pi}\left[D_{T V}\left(\pi'|| \pi\right)[s]\right] \rightarrow \sqrt{\frac{1}{2} \mathbb{E}_{s \sim d^\pi}\left[D_{K L}\left(\pi^{\prime} \| \pi\right)[s]\right]}\]

Trust Region Methods#

For parameterized stationary policy, trust region algorithms for reinforcement learning have policy updates of the following form:

(9)#\[\begin{split}&\boldsymbol{\theta}_{k+1}=\arg\max_{\pi \in \Pi_{\boldsymbol{\theta}}} \mathbb{E}_{\substack{s \sim d_{\pi_k}\\a \sim \pi}}[A^R_{\boldsymbol{\theta}_k}(s, a)]\\ \text{s.t.}\quad &\bar{D}_{K L}\left(\pi \| \pi_k\right) \le \delta\end{split}\]

where \(\bar{D}_{K L}(\pi \| \pi_k)=\mathbb{E}_{s \sim \pi_k}[D_{K L}(\pi \| \pi_k)[s]]\) and \(\delta \ge 0\) is the step size. The set \(\left\{\pi_{\boldsymbol{\theta}} \in \Pi_{\boldsymbol{\theta}}: \bar{D}_{K L}\left(\pi \| \pi'\right) \leq \delta\right\}\) is called trust region. The success motivation for this update is that, it approximates optimizing the lower bound on policy performance given in Corollary 1, which would guarantee monotonic performance improvements.

(10)#\[\begin{split}&\pi_{k+1}=\arg \max _{\pi \in \Pi_{\boldsymbol{\theta}}} \mathbb{E}_{\substack{s \sim d_{\pi_k}\\a \sim \pi}}[A^R_{\pi_k}(s, a)]\\ \text{s.t.} \quad &J^{C_i}\left(\pi_k\right) \leq d_i-\frac{1}{1-\gamma} \mathbb{E}_{\substack{s \sim d_{\pi_k} \\ a \sim \pi}}\left[A^{C_i}_{\pi_k}(s, a)\right] \quad \forall i \\ &\bar{D}_{K L}\left(\pi \| \pi_k\right) \leq \delta\end{split}\]

Hint

In a word, CPO proposes the final optimization problem, which uses a trust region instead of penalties on policy divergence to enable larger step sizes.


Worst Performance of CPO Update#

Here we will introduce the propositions proposed by the CPO, one describes the worst-case performance degradation guarantee that depends on \(\delta\), and the other discusses the worst-case constraint violation in the CPO update.

Trust Region Update Performance

Suppose \(\pi_k, \pi_{k+1}\) are related by Eq.9, and that \(\pi_k \in \Pi_{\boldsymbol{\theta}}\). A lower bound on the policy performance difference between \(\pi_k\) and \(\pi_{k+1}\) is:

(11)#\[J^{R}\left(\pi_{k+1}\right)-J^{R}(\pi_{k}) \geq \frac{-\sqrt{2 \delta} \gamma \epsilon^R_{\pi_{k+1}}}{(1-\gamma)^2}\]

where \(\epsilon^R_{\pi_{k+1}}=\max_s\left|\mathbb{E}_{a \sim \pi_{k+1}}\left[A^R_{\pi_k}(s, a)\right]\right|\).

CPO Update Worst-Case Constraint Violation

Suppose \(\pi_k, \pi_{k+1}\) are related by Eq.9, and that \(\pi_k \in \Pi_{\boldsymbol{\theta}}\). An upper bound on the \(C_i\)-return of \(\pi_{k+1}\) is:

(12)#\[ J^{C_i}\left(\pi_{k+1}\right) \leq d_i+\frac{\sqrt{2 \delta} \gamma \epsilon^{C_i}_{\pi_{k+1}}}{(1-\gamma)^2}\]

where \(\epsilon^{C_i}_{\pi_{k+1}}=\max _s\left|\mathbb{E}_{a \sim \pi_{k+1}}\left[A^{C_i}_{\pi_k}(s, a)\right]\right|\)


Summary#

We mainly introduce the essential inequalities in CPO. Based on those inequalities, CPO presents optimization problems that ultimately need to be solved and propose two proposition about the worst case in the CPO update. Next section, we will discuss how to solve this problem practically. It is expected that you may be confused when you first read these theoretical derivation processes, and we have given detailed proof of the above formulas in the appendix, which we believe you can understand by reading them a few times.


Practical Implementation#

Overview

In this section, we show how CPO implements an approximation to the update Eq.10 that can be efficiently computed, even when optimizing policies with thousands of parameters. To address the issue of approximation and sampling errors that arise in practice and the potential violations described by Proposition 2, CPO proposes to tighten the constraints by constraining the upper bounds of the extra costs instead of the extra costs themselves.

Navigation

Approximately Solving the CPO Update

Click here

Feasibility

Click here

Tightening Constraints via Cost Shaping

Click here

Code With OmniSafe

Click here


Approximately Solving the CPO Update#

For policies with high-dimensional parameter spaces like neural networks, Eq.10 can be impractical to solve directly because of the computational cost.

Hint

However, for small step sizes \(\delta\), the objective and cost constraints are well-approximated by linearizing around \(\pi_k\), and the KL-Divergence constraint is well-approximated by second-order expansion.

Denoting the gradient of the objective as \(g\), the gradient of constraint \(i\) as \(b_i\), the Hessian of the KL-divergence as \(H\), and defining \(c_i=J^{C_i}\left(\pi_k\right)-d_i\), the approximation to Eq.10 is:

(13)#\[\begin{split}&\boldsymbol{\theta}_{k+1}=\arg \max _{\boldsymbol{\theta}} g^T\left(\boldsymbol{\theta}-\boldsymbol{\theta}_k\right)\\ \text{s.t.}\quad &c_i+b_i^T\left(\boldsymbol{\theta}-\boldsymbol{\theta}_k\right) \leq 0 ~~~ i=1, \ldots m \\ &\frac{1}{2}\left(\boldsymbol{\theta}-\boldsymbol{\theta}_k\right)^T H\left(\boldsymbol{\theta}-\boldsymbol{\theta}_k\right) \leq \delta\end{split}\]

With \(B=\left[b_1, \ldots, b_m\right]\) and \(c=\left[c_1, \ldots, c_m\right]^T\), a dual to Eq.13 can be express as:

(14)#\[\max_{\lambda \geq 0, \nu \geq 0} \frac{-1}{2 \lambda}\left(g^T H^{-1} g-2 r^T v+v^T S v\right)+v^T c-\frac{\lambda \delta}{2}\]

where \(r=g^T H^{-1} B, S=B^T H^{-1} B\). If \(\lambda^*, v^*\) are a solution to the dual, the solution to the primal is

(15)#\[{\boldsymbol{\theta}}^*={\boldsymbol{\theta}}_k+\frac{1}{\lambda^*} H^{-1}\left(g-B v^*\right)\]

In a word, CPO solves the dual for \(\lambda^*, \nu^*\) and uses it to propose the policy update Eq.15, thus solving Eq.10 in a particular way. In the experiment, CPO also uses two tricks to promise the update’s performance.

Warning

Because of the approximation error, the proposed update may not satisfy the constraints in Eq.10; a backtracking line search is used to ensure surrogate constraint satisfaction.

For high-dimensional policies, it is impractically expensive to invert the FIM. This poses a challenge for computing \(\mathrm{H}^{-1} \mathrm{~g}\) and \(H^{-1} b\), which appear in the dual. Like TRPO, CPO computes them approximately using the conjugate gradient method.


Feasibility#

Due to approximation errors, CPO may take a bad step and produce an infeasible iterate \(\pi_k\). CPO recovers the update from an infeasible case by proposing an update to decrease the constraint value purely:

(16)#\[\boldsymbol{\theta}^*=\boldsymbol{\theta}_k-\sqrt{\frac{2 \delta}{b^T H^{-1} b}} H^{-1} b\]

As before, this is followed by a line search. This approach is principled, because it uses the limiting search direction as the intersection of the trust region and the constraint region shrinks to zero.


Tightening Constraints via Cost Shaping#

To build a factor of safety into the algorithm to minimize the chance of constraint violations, CPO chooses to constrain upper bounds on the original constraints, \(C_i^{+}\), instead of the original constraints themselves. CPO does this by cost shaping:

(17)#\[C_i^{+}\left(s, a, s^{\prime}\right)=C_i\left(s, a, s^{\prime}\right)+\triangle_i\left(s, a, s^{\prime}\right)\]

where \(\delta_i: S \times A \times S \rightarrow R_{+}\) correlates in some useful way with \(C_i\). Because CPO has only one constraint, it partitions states into safe and unsafe states, and the agent suffers a safety cost of 1 for being in an unsafe state. CPO chooses \(\triangle\) to be the probability of entering an unsafe state within a fixed time horizon, according to a learned model that is updated at each iteration. This choice confers the additional benefit of smoothing out sparse constraints.


Code with OmniSafe#

Quick start#

Run CPO in Omnisafe

Here are 3 ways to run CPO in OmniSafe:

  • Run Agent from preset yaml file

  • Run Agent from custom config dict

  • Run Agent from custom terminal config

1import omnisafe
2
3
4env_id = 'SafetyPointGoal1-v0'
5
6agent = omnisafe.Agent('CPO', env_id)
7agent.learn()
 1import omnisafe
 2
 3
 4env_id = 'SafetyPointGoal1-v0'
 5custom_cfgs = {
 6    'train_cfgs': {
 7        'total_steps': 1024000,
 8        'vector_env_nums': 1,
 9        'parallel': 1,
10    },
11    'algo_cfgs': {
12        'update_cycle': 2048,
13        'update_iters': 1,
14    },
15    'logger_cfgs': {
16        'use_wandb': False,
17    },
18}
19
20agent = omnisafe.Agent('CPO', env_id, custom_cfgs=custom_cfgs)
21agent.learn()

We use train_policy.py as the entrance file. You can train the agent with CPO simply using train_policy.py, with arguments about CPO and environments does the training. For example, to run CPO in SafetyPointGoal1-v0 , with 4 cpu cores and seed 0, you can use the following command:

1cd examples
2python train_policy.py --algo CPO --env-id SafetyPointGoal1-v0 --parallel 1 --total-steps 1024000 --device cpu --vector-env-nums 1 --torch-threads 1

Here are the documentation of CPO in PyTorch version.

Architecture of functions#

  • cpo.learn()

    • env.roll_out()

    • cpo.update()

      • cpo.buf.get()

      • cpo.update_policy_net()

        • Fvp()

        • conjugate_gradients()

        • search_step_size()

      • cpo.update_cost_net()

      • cpo.update_value_net()

  • cpo.log()


Documentation of new functions#

cpo.update_policy_net()

Update the policy network, flowing the next steps:

  1. Get the policy reward performance gradient g (flat as vector)

1self.pi_optimizer.zero_grad()
2loss_pi, pi_info = self.compute_loss_pi(data=data)
3loss_pi.backward()
4g_flat = get_flat_gradients_from(self.ac.pi.net)
5g_flat *= -1
  1. Get the policy cost performance gradient b (flat as vector)

1self.pi_optimizer.zero_grad()
2loss_cost, _ = self.compute_loss_cost_performance(data=data)
3loss_cost.backward()
4b_flat = get_flat_gradients_from(self.ac.pi.net)
  1. Build the Hessian-vector product based on an approximation of the KL-divergence, using conjugate_gradients

1p = conjugate_gradients(self.Fvp, b_flat, self.cg_iters)
2q = xHx
3r = g_flat.dot(p)  # g^T H^{-1} b
4s = b_flat.dot(p)  # b^T H^{-1} b
  1. Divide the optimization case into 5 kinds to compute.

  2. Determine step direction and apply SGD step after grads where set (By search_step_size())

1final_step_dir, accept_step = self.search_step_size(
2    step_dir,
3    g_flat,
4    c=c,
5    optim_case=optim_case,
6    p_dist=p_dist,
7    data=data,
8    total_steps=20,
9)
  1. Update actor network parameters

1new_theta = theta_old + final_step_dir
2set_param_values_to_model(self.ac.pi.net, new_theta)

cpo.search_step_size()

CPO algorithm performs line-search to ensure constraint satisfaction for rewards and costs, flowing the next steps:

  1. Calculate the expected reward improvement.

1expected_rew_improve = g_flat.dot(step_dir)
  1. Performs line-search to find a step improve the surrogate while not violating trust region.

  • Search acceptance step ranging from 0 to total step

1for j in range(total_steps):
2   new_theta = _theta_old + step_frac * step_dir
3   set_param_values_to_model(self.ac.pi.net, new_theta)
4   acceptance_step = j + 1
  • In each step of for loop, calculate the policy performance and KL divergence.

1with torch.no_grad():
2    loss_pi_rew, _ = self.compute_loss_pi(data=data)
3    loss_pi_cost, _ = self.compute_loss_cost_performance(data=data)
4    q_dist = self.ac.pi.dist(data['obs'])
5    torch_kl = torch.distributions.kl.kl_divergence(p_dist, q_dist).mean().item()
6loss_rew_improve = self.loss_pi_before - loss_pi_rew.item()
7cost_diff = loss_pi_cost.item() - self.loss_pi_cost_before
  • Step only if surrogate is improved and within the trust region.

 1if not torch.isfinite(loss_pi_rew) and not torch.isfinite(loss_pi_cost):
 2    self.logger.log('WARNING: loss_pi not finite')
 3elif loss_rew_improve < 0 if optim_case > 1 else False:
 4    self.logger.log('INFO: did not improve improve <0')
 5
 6elif cost_diff > max(-c, 0):
 7    self.logger.log(f'INFO: no improve {cost_diff} > {max(-c, 0)}')
 8elif torch_kl > self.target_kl * 1.5:
 9    self.logger.log(f'INFO: violated KL constraint {torch_kl} at step {j + 1}.')
10else:
11    self.logger.log(f'Accept step at i={j + 1}')
12    break
  1. Return appropriate step direction and acceptance step.


Parameters#

Specific Parameters

  • target_kl(float): Constraint for KL-distance to avoid too far gap

  • cg_damping(float): parameter plays a role in building Hessian-vector

  • cg_iters(int): Number of iterations of conjugate gradient to perform.

  • cost_limit(float): Constraint for agent to avoid too much cost

Basic parameters

  • algo (string): The name of algorithm corresponding to current class, it does not actually affect any things which happen in the following.

  • actor (string): The type of network in actor, discrete or continuous.

  • model_cfgs (dictionary) : Actor and critic’s net work configuration, it originates from algo.yaml file to describe hidden layers , activation function, shared_weights and weight_initialization_mode.

    • shared_weights (bool) : Use shared weights between actor and critic network or not.

    • weight_initialization_mode (string) : The type of weight initialization method.

      • pi (dictionary) : parameters for actor network pi

        • hidden_sizes:

          • 64

          • 64

        • activations: tanh

      • val (dictionary) parameters for critic network v

        • hidden_sizes:

          • 64

          • 64

          Hint

          Name

          Type

          Description

          v

          nn.Module

          Gives the current estimate of V for states in s.

          pi

          nn.Module

          Deterministically or continuously computes an action from the agent, conditioned on states in s.

      • activations: tanh

      • env_id (string): The name of environment we want to roll out.

      • seed (int): Define the seed of experiments.

      • parallel (int): Define the seed of experiments.

      • epochs (int): The number of epochs we want to roll out.

      • steps_per_epoch (int):The number of time steps per epoch.

      • pi_iters (int): The number of iteration when we update actor network per mini batch.

      • critic_iters (int): The number of iteration when we update critic network per mini batch.

Optional parameters

  • use_cost_critic (bool): Use cost value function or not.

  • linear_lr_decay (bool): Use linear learning rate decay or not.

  • exploration_noise_anneal (bool): Use exploration noise anneal or not.

  • reward_penalty (bool): Use cost to penalize reward or not.

  • kl_early_stopping (bool): Use KL early stopping or not.

  • max_grad_norm (float): Use maximum gradient normalization or not.

  • scale_rewards (bool): Use reward scaling or not.

Buffer parameters

Hint

Name

Description

Buffer

A buffer for storing trajectories experienced by an agent interacting with the environment, and using Generalized Advantage Estimation (GAE) for calculating the advantages of state-action pairs.

Warning

Buffer collects only raw data received from environment.

  • gamma (float): The gamma for GAE.

  • lam (float): The lambda for reward GAE.

  • adv_estimation_method (float):Roughly what KL divergence we think is appropriate between new and old policies after an update. This will get used for early stopping. (Usually small, 0.01 or 0.05.)

  • standardized_reward (int): Use standardized reward or not.

  • standardized_cost (bool): Use standardized cost or not.


References#

Appendix#

Click here to jump to CPO Theorem Click here to jump to Code with OmniSafe

Proof of theorem 1 (Difference between two arbitrarily policies)#

Our analysis will begin with the discounted future future state distribution, \(d_\pi\), which is defined as:

(18)#\[d_\pi(s)=(1-\gamma) \sum_{t=0}^{\infty} \gamma^t P\left(s_t=s|\pi\right)\]

Let \(p_\pi^t \in R^{|S|}\) denote the vector with components \(p_\pi^t(s)=P\left(s_t=s \mid \pi\right)\), and let \(P_\pi \in R^{|S| \times|S|}\) denote the transition matrix with components \(P_\pi\left(s^{\prime} \mid s\right)=\int d a P\left(s^{\prime} \mid s, a\right) \pi(a \mid s)\), which shown as below:

(19)#\[\begin{split}&\left[\begin{array}{c} p_\pi^t\left(s_1\right) \\ p_\pi^t\left(s_2\right) \\ \vdots\nonumber \\ p_\pi^t\left(s_n\right) \end{array}\right] =\left[\begin{array}{cccc} P_\pi\left(s_1 \mid s_1\right) & P_\pi\left(s_1 \mid s_2\right) & \cdots & P_\pi\left(s_1 \mid s_n\right) \\ P_\pi\left(s_2 \mid s_1\right) & P_\pi\left(s_2 \mid s_2\right) & \cdots & P_\pi\left(s_2 \mid s_n\right) \\ \vdots & \vdots & \ddots & \vdots \\ P_\pi\left(s_n \mid s_1\right) & P_\pi\left(s_n \mid s_2\right) & \cdots & P_\pi\left(s_n \mid s_n\right) \end{array}\right]\left[\begin{array}{c} p_\pi^{t-1}\left(s_1\right) \\ p_\pi^{t-1}\left(s_2\right) \\ \vdots \\ p_\pi^{t-1}\left(s_n\right) \end{array}\right]\end{split}\]

then \(p_\pi^t=P_\pi p_\pi^{t-1}=P_\pi^2 p_\pi^{t-2}=\ldots=P_\pi^t \mu\), where \(\mu\) represents the state distribution of the system at the moment. That is, the initial state distribution, then \(d_\pi\) can then be rewritten as:

(20)#\[\begin{split}d_\pi&=\left[\begin{array}{c} d_\pi\left(s_1\right) \\ d_\pi\left(s_2\right) \\ \vdots \\ d_\pi\left(s_n\right) \end{array}\right] =(1-\gamma)\left[\begin{array}{c} \gamma^0 p_\pi^0\left(s_1\right)+\gamma^1 p_\pi^1\left(s_1\right)+\gamma^2 p_\pi^2\left(s_1\right)+\ldots \\ \gamma^0 p_\pi^0\left(s_2\right)+\gamma^1 p_\pi^1\left(s_2\right)+\gamma^2 p_\pi^2\left(s_2\right)+\ldots \\ \vdots \\ \gamma^0 p_\pi^0\left(s_3\right)+\gamma^1 p_\pi^1\left(s_3\right)+\gamma^2 p_\pi^2\left(s_3\right)+\ldots \end{array}\right]\end{split}\]
(21)#\[d_\pi=(1-\gamma) \sum_{t=0}^{\infty} \gamma^t p_\pi^t=(1-\gamma)\left(1-\gamma P_\pi\right)^{-1} \mu\]

Lemma 1

For any function \(f: S \rightarrow \mathbb{R}\) and any policy \(\pi\) :

(22)#\[(1-\gamma) E_{s \sim \mu}[f(s)]+E_{\tau \sim \pi}\left[\gamma f\left(s'\right)\right]-E_{s \sim d_\pi}[f(s)]=0\]

where \(\tau \sim \pi\) denotes \(s \sim d_\pi, a \sim \pi\) and \(s^{\prime} \sim P\).

Lemma 2

For any function \(f: S \rightarrow \mathbb{R}\) and any policies \(\pi\) and \(\pi'\), define

(23)#\[L_{\pi, f}\left(\pi'\right)\doteq \mathbb{E}_{\tau \sim \pi}\left[\left(\frac{\pi^{\prime}(a \mid s)}{\pi(a \mid s)}-1\right)\left(R\left(s, a, s^{\prime}\right)+\gamma f\left(s^{\prime}\right)-f(s)\right)\right]\]

and \(\epsilon_f^{\pi^{\prime}}\doteq \max_s \left|\mathbb{E}_{\substack{a \sim \pi , s'\sim P}} \left[R\left(s, a, s^{\prime}\right)+\gamma f\left(s^{\prime}\right)-f(s)\right]\right|\). Then the following bounds hold:

(24)#\[ \begin{align}\begin{aligned}:label: cpo-eq-24\\\begin{split}&J\left(\pi'\right)-J(\pi) \geq \frac{1}{1-\gamma}\left(L_{\pi, f}\left(\pi'\right)-2 \epsilon_f^{\pi'} D_{T V}\left(d_\pi \| d_{\pi^{\prime}}\right)\right) \\ &J\left(\pi^{\prime}\right)-J(\pi) \leq \frac{1}{1-\gamma}\left(L_{\pi, f}\left(\pi'\right)+2 \epsilon_f^{\pi'} D_{T V}\left(d_\pi \| d_{\pi'}\right)\right)\end{split}\end{aligned}\end{align} \]

where \(D_{T V}\) is the total variational divergence. Furthermore, the bounds are tight when \(\pi^{\prime}=\pi\), and the LHS and RHS are identically zero.

Lemma 3

The divergence between discounted future state visitation distributions, \(\Vert d_{\pi'}-d_\pi\Vert_1\), is bounded by an average divergence of the policies \(\pi\) and \(\pi\) :

(25)#\[\Vert d_{\pi'}-d_\pi\Vert_1 \leq \frac{2 \gamma}{1-\gamma} \mathbb{E}_{s \sim d_\pi}\left[D_{T V}\left(\pi^{\prime} \| \pi\right)[s]\right]\]

where \(D_{\mathrm{TV}}(\pi' \| \pi)[s] = \frac{1}{2}\sum_a \Vert\pi'(a|s) - \pi(a|s)\Vert\)

Corollary 4

Define the matrices \(G \doteq\left(I-\gamma P_\pi\right)^{-1}, \bar{G} \doteq\left(I-\gamma P_{\pi^{\prime}}\right)^{-1}\), and \(\Delta=P_{\pi^{\prime}}-P_\pi\). Then:

(26)#\[ \begin{align}\begin{aligned}:label: cpo-eq-26\\\begin{split}G^{-1}-\bar{G}^{-1} &=\left(I-\gamma P_\pi\right)-\left(I-\gamma P_{\pi^{\prime}}\right) \\ G^{-1}-\bar{G}^{-1} &=\gamma \Delta \\ \bar{G}\left(G^{-1}-\bar{G}^{-1}\right) G &=\gamma \bar{G} \Delta G \\ \bar{G}-G &=\gamma \bar{G} \Delta G\end{split}\end{aligned}\end{align} \]

Thus, with Eq.21

(27)#\[\begin{split}d^{\pi^{\prime}}-d^\pi &=(1-\gamma)(\bar{G}-G) \mu \\ &=\gamma(1-\gamma) \bar{G} \Delta G \mu\\ &=\gamma \bar{G} \Delta d^\pi\end{split}\]

Corollary 5

(28)#\[\left\|P_{\pi^{\prime}}\right\|_1=\max _{s \in \mathcal{S}}\left\{\sum_{s^{\prime} \in \mathcal{S}} P_\pi\left(s^{\prime} \mid s\right)\right\}=1\]

Begin with the bounds from Lemma 2 and bound the divergence by Lemma 3, Theorem 1 can be finally proved.

Proof

Multiply both sides of Eq.21 by \(\left(I-\gamma P_\pi\right)\), we get:

(29)#\[\left(I-\gamma P_\pi\right) d_\pi=(1-\gamma) \mu\]

Then take the inner product with the vector \(f \in \mathbb{R}^{|S|}\) and notice that the vector \(f\) can be arbitrarily picked.

(30)#\[<f,\left(I-\gamma P_\pi\right) d_\pi>=<f,(1-\gamma) \mu>\]

Both sides of the above equation can be rewritten separately by:

(31)#\[\begin{split}&<f,\left(I-\gamma P_\pi\right) d_\pi>=\left[\sum_s f(s) d_\pi(s)\right]-\\ &\left[\sum_{s^{\prime}} f\left(s^{\prime}\right) \gamma \sum_s \sum_a \pi(a \mid s) P\left(s^{\prime} \mid s, a\right) d_\pi(s)\right] \\ &=\mathbb{E}_{s \sim d_\pi}[f(s)]-\mathbb{E}_{\tau \sim \pi}\left[\gamma f\left(s^{\prime}\right)\right]\end{split}\]
(32)#\[<f,(1-\gamma) \mu>=\sum_s f(s)(1-\gamma) \mu(s)=(1-\gamma) \mathbb{E}_{s \sim \mu}[f(s)]\]

Finally, we obtain:

(33)#\[(1-\gamma) \mathbb{E}_{s \sim \mu}[f(s)]+\mathbb{E}_{\tau \sim \pi}\left[\gamma f\left(s^{\prime}\right)\right]-\mathbb{E}_{s \sim d_\pi}[f(s)] = 0\]

Hint

Supplementary details

(34)#\[\begin{split}d^\pi &=(1-\gamma)\left(I-\gamma P_\pi\right)^{-1} \mu \\\left(I-\gamma P_\pi\right) d^\pi &=(1-\gamma) \mu \\ \int_{s \in \mathcal{S}}\left(I-\gamma P_\pi\right) d^\pi f(s) d s &=\int_{s \in \mathcal{S}} (1-\gamma) \mu f(s) d s \\ \int_{s \in \mathcal{S}} d^\pi f(s) d s-\int_{s \in \mathcal{S}} \gamma P_\pi d^\pi f(s) d s &=\int_{s \in \mathcal{S}}(1-\gamma) \mu f(s) d s \\ \mathbb{E}_{s \sim d^\pi}[f(s)] -\mathbb{E}_{s \sim d^\pi, a \sim \pi, s^{\prime} \sim P}\left[\gamma f\left(s^{\prime}\right)\right] &= (1-\gamma) \mathbb{E}_{s \sim \mu}[f(s)]\end{split}\]

Proof

note that the objective function can be represented as:

(35)#\[\begin{split}J(\pi)&=\frac{1}{1-\gamma} \mathbb{E}_{\tau \sim \pi}[R(s, a, s^{\prime})] \\ &=\mathbb{E}_{s \sim \mu}[f(s)]+\frac{1}{1-\gamma} \mathbb{E}_{\tau \sim \pi}[R(s, a, s^{\prime})+\gamma f(s^{\prime})-f(s)]\end{split}\]

Let \(\delta_f(s, a, s^{\prime})\doteq R(s, a, s^{\prime})+\gamma f(s^{\prime})-f(s)\), then by Eq.29, we easily obtain that:

(36)#\[J\left(\pi'\right)-J(\pi)=\frac{1}{1-\gamma}\left\{\mathbb{E}_{\tau \sim \pi^{\prime}}\left[\delta_f\left(s, a, s^{\prime}\right)\right]-\mathbb{E}_{\tau \sim \pi}\left[\delta_f\left(s, a, s^{\prime}\right]\right\}\right.\]

For the first term of the equation, let \(\bar{\delta}_f^{\pi'} \in \mathbb{R}^{|S|}\) denote the vector of components \(\bar{\delta}_f^{\pi'}(s)=\mathbb{E}_{a \sim \pi', s' \sim P}\left[\delta_f\left(s, a, s'|s\right)\right]\), then

(37)#\[\mathbb{E}_{\tau \sim \pi'}\left[d_f\left(s, a, s'\right)\right]=<d_{\pi'}, \bar{\delta}^f_{\pi'}>=<d_\pi,\bar{\delta}^f_{\pi'}>+<d_{\pi'}-d_\pi, \hat{d}^f_{\pi'}>\]

By using Holder’s inequality, for any \(p, q \in[1, \infty]\), such that \(\frac{1}{p}+\frac{1}{q}=1\). We have

(38)#\[\begin{split}& \mathbb{E}_{\tau \sim \pi^{\prime}}\left[\delta_f\left(s, a, s^{\prime}\right)\right] \leq \langle d_\pi, \bar{\delta}_f^{\pi^{\prime}} \rangle+\Vert d_{\pi'}-d_\pi \Vert_p \Vert \bar{\delta}_f^{\pi'}\Vert_q \\ &\mathbb{E}_{\tau \sim \pi'}\left[\delta_f\left(s, a, s'\right)\right] \geq \langle d_\pi, \bar{\delta}_f^{\pi'}\rangle-\Vert d_{\pi'}-d_\pi \Vert_p \Vert \bar{\delta}_f^{\pi'}\Vert_q\end{split}\]

Hint

Hölder’s inequality:

Let \((\mathcal{S}, \sum, \mu)\) be a measure space and let \(p, q \in [1, \infty]\) with \(\frac{1}{p} + \frac{1}{q} = 1\). Then for all measurable real- or complex-valued function \(f\) and \(g\) on \(s\), \(\|f g\|_1 \leq\|f\|_p\|g\|_q\).

If, in addition, \(p, q \in(1, \infty)\) and \(f \in L^p(\mu)\) and \(g \in L^q(\mu)\), then Hölder’s inequality becomes an equality if and only if \(|f|^p\) and \(|g|^q\) are linearly dependent in \(L^1(\mu)\), meaning that there exist real numbers \(\alpha, \beta \geq 0\), not both of them zero, such that \(\alpha|f|^p=\beta|g|^q \mu\)-almost everywhere.

The last step is to observe that, by the importance of sampling identity,

(39)#\[\begin{split}\left\langle d^\pi, \bar{\delta}_f^{\pi^{\prime}}\right\rangle &=\underset{s \sim d^\pi, a \sim \pi^{\prime}, s^{\prime} \sim P}{\mathbb{E}}\left[\delta_f\left(s, a, s^{\prime}\right)\right] \\ &=\underset{s \sim d^\pi, a \sim \pi, s^{\prime} \sim P}{\mathbb{E}}\left[\left(\frac{\pi^{\prime}(a \mid s)}{\pi(a \mid s)}\right) \delta_f\left(s, a, s^{\prime}\right)\right]\end{split}\]

After grouping terms, the bounds are obtained.

(40)#\[\begin{split}&\left\langle d^\pi, \bar{\delta}_f^{\pi^{\prime}}\right\rangle \pm\Vert d^{\pi^{\prime}}-d^\pi\Vert_p\Vert\bar{\delta}_f^{\pi^{\prime}}\Vert_q\\ &=\mathbb{E}_{\substack{s \sim d^\pi\\ a \sim \pi\\ s^{\prime} \sim P}}\left[\left(\frac{\pi'(a|s)}{\pi(a|s)}\right) \delta_f\left(s, a, s^{\prime}\right)\right] \pm 2 \epsilon_f^{\pi^{\prime}} D_{T V}\left(d_{\pi'} \| d_\pi\right)\end{split}\]
(41)#\[\begin{split}&J(\pi')-J(\pi)\\ &\leq \frac{1}{1-\gamma}\mathbb{E}_{\substack{s \sim d^\pi \\ a \sim \pi \\ s' \sim P}}[(\frac{\pi^{\prime}(a|s)}{\pi(a|s)}) \delta_f(s, a, s^{\prime})]+2 \epsilon_f^{\pi^{\prime}} D_{T V}(d^{\pi^{\prime}} \| d^\pi)-\mathbb{E}_{\substack{s \sim d^\pi \\ a \sim \pi \\ s' \sim P}}[\delta_f(s, a, s^{\prime})]\\ &=\frac{1}{1-\gamma}(\mathbb{E}_{\substack{s \sim d^\pi \\ a \sim \pi \\ s' \sim P}}[(\frac{\pi^{\prime}(a|s)}{\pi(a|s)}) \delta_f(s, a, s^{\prime})]-\mathbb{E}_{\substack{s \sim d^\pi \\ a \sim \pi \\ s' \sim P}}[\delta_f(s, a, s^{\prime})]+2 \epsilon_f^{\pi^{\prime}} D_{T V}(d^{\pi^{\prime}} \| d^\pi))\\ &=\frac{1}{1-\gamma}(\mathbb{E}_{\substack{s \sim d^\pi \\ a \sim \pi \\ s' \sim P}}[(\frac{\pi^{\prime}(a \mid s)}{\pi(a \mid s)}-1) \delta_f(s, a, s^{\prime})]+2 \epsilon_f^{\pi^{\prime}} D_{T V}(d^{\pi^{\prime}} \| d^\pi))\end{split}\]

The lower bound is the same.

(42)#\[\begin{split}J\left(\pi^{\prime}\right)-J(\pi) \geq \mathbb{E}_{\substack{s \sim d^\pi \\ a \sim \pi \\ s' \sim P}}\left[\left(\frac{\pi^{\prime}(a|s)}{\pi(a|s)}-1\right) \delta_f\left(s, a, s^{\prime}\right)\right]-2 \epsilon_f^{\pi^{\prime}} D_{T V}\left(d^{\pi^{\prime}} \| d^\pi\right)\end{split}\]

Proof

First, using Corollary 4, we obtain

(43)#\[\begin{split}\left\|d^{\pi^{\prime}}-d^\pi\right\|_1 &=\gamma\left\|\bar{G} \Delta d^\pi\right\|_1 \\ & \leq \gamma\|\bar{G}\|_1\left\|\Delta d^\pi\right\|_1\end{split}\]

Meanwhile,

(44)#\[\begin{split}\|\bar{G}\|_1 &=\left\|\left(I-\gamma P_{\pi^{\prime}}\right)^{-1}\right\|_1 \\ &=\left\|\sum_{t=0}^{\infty} \gamma^t P_{\pi^{\prime}}^t\right\|_1 \\ & \leq \sum_{t=0}^{\infty} \gamma^t\left\|P_{\pi^{\prime}}\right\|_1^t \\ &=\left(1-\gamma\left\|P_{\pi^{\prime}}\right\|_1\right)^{-1} \\ &=(1-\gamma)^{-1}\end{split}\]

And, using Corollary 5, we have,

(45)#\[\begin{split}\Delta d^\pi\left[s^{\prime}\right] &= \sum_s \Delta\left(s^{\prime} \mid s\right) d^\pi(s) \\ &=\sum_s \left\{ P_{\pi^{\prime}}\left(s^{\prime} \mid s\right)-P_\pi\left(s^{\prime} \mid s\right) \right\} d_{\pi}(s) \\ &=\sum_s \left\{ P\left(s^{\prime} \mid s, a\right) \pi^{\prime}(a \mid s)-P\left(s^{\prime} \mid s, a\right) \pi(a \mid s) \right\} d_{\pi}(s)\\ &=\sum_s \left\{ P\left(s^{\prime} \mid s, a\right)\left[\pi^{\prime}(a \mid s)-\pi(a \mid s)\right] \right\} d_{\pi}(s)\end{split}\]

Hint

Total variation distance of probability measures

\(\Vert d_{\pi'}-d_\pi \Vert_1=\sum_{a \in \mathcal{A}}\left|d_{\pi_{{\boldsymbol{\theta}}^{\prime}}}(a|s)-d_{\pi_{\boldsymbol{\theta}}}(a|s)\right|=2 D_{\mathrm{TV}}\left(d_{\pi_{{\boldsymbol{\theta}}'}}, d_\pi\right)[s]\)

Finally, using (20), we obtain,

(46)#\[\begin{split}\left\|\Delta d^\pi\right\|_1 &=\sum_{s^{\prime}}\left|\sum_s \Delta\left(s^{\prime} \mid s\right) d^\pi(s)\right| \\ & \leq \sum_{s, s^{\prime}}\left|\Delta\left(s^{\prime} \mid s\right)\right| d^\pi(s) \\ &=\sum_{s, s^{\prime}}\left|\sum_a P\left(s^{\prime} \mid s, a\right)\left(\pi^{\prime}(a \mid s)-\pi(a \mid s)\right)\right| d^\pi(s) \\ & \leq \sum_{s, a, s^{\prime}} P\left(s^{\prime} \mid s, a\right)\left|\pi^{\prime}(a \mid s)-\pi(a \mid s)\right| d^\pi(s) \\ &=\sum_{s^{\prime}} P\left(s^{\prime} \mid s, a\right) \sum_{s, a}\left|\pi^{\prime}(a \mid s)-\pi(a \mid s)\right| d^\pi(s) \\ &=\sum_{s, a}\left|\pi^{\prime}(a \mid s)-\pi(a \mid s)\right| d^\pi(s) \\ &=\sum_a \underset{s \sim d^\pi}{ } \mathbb{E}^{\prime}|(a \mid s)-\pi(a \mid s)| \\ &=2 \underset{s \sim d^\pi}{\mathbb{E}}\left[D_{T V}\left(\pi^{\prime}|| \pi\right)[s]\right]\end{split}\]

Proof of Analytical Solution to LQCLP#

Theorem 2 (Optimizing Linear Objective with Linear, Quadratic Constraints)

Consider the problem

(47)#\[\begin{split}p^*&=\min_x g^T x \\ \text { s.t. }\quad & b^T x+c \leq 0 \\ & x^T H x \leq \delta\end{split}\]

where \(g, b, x \in \mathbb{R}^n, c, \delta \in \mathbb{R}, \delta>0, H \in \mathbb{S}^n\), and \(H \succ 0\). When there is at least one strictly feasible point, the optimal point \(x^*\) satisfies

(48)#\[x^*=-\frac{1}{\lambda^*} H^{-1}\left(g+\nu^* b\right)\]

where \(\lambda^*\) and \(\nu^*\) are defined by

(49)#\[\begin{split}&\nu^*=\left(\frac{\lambda^* c-r}{s}\right)_{+}, \\ &\lambda^*=\arg \max _{\lambda \geq 0} \begin{cases}f_a(\lambda) \doteq \frac{1}{2 \lambda}\left(\frac{r^2}{s}-q\right)+\frac{\lambda}{2}\left(\frac{c^2}{s}-\delta\right)-\frac{r c}{s} & \text { if } \lambda c-r>0 \\ f_b(\lambda) \doteq-\frac{1}{2}\left(\frac{q}{\lambda}+\lambda \delta\right) & \text { otherwise }\end{cases}\end{split}\]

with \(q=g^T H^{-1} g, r=g^T H^{-1} b\), and \(s=b^T H^{-1} b\).

Furthermore, let \(\Lambda_a \doteq\{\lambda \mid \lambda c-r>0, \lambda \geq 0\}\), and \(\Lambda_b \doteq\{\lambda \mid \lambda c-r \leq 0, \lambda \geq 0\}\). The value of \(\lambda^*\) satisfies

(50)#\[\lambda^* \in\left\{\lambda_a^* \doteq \operatorname{Proj}\left(\sqrt{\frac{q-r^2 / s}{\delta-c^2 / s}}, \Lambda_a\right), \lambda_b^* \doteq \operatorname{Proj}\left(\sqrt{\frac{q}{\delta}}, \Lambda_b\right)\right\}\]

with \(\lambda^*=\lambda_a^*\) if \(f_a\left(\lambda_a^*\right)>f_b\left(\lambda_b^*\right)\) and \(\lambda = \lambda_b^*\) otherwise, and \(\operatorname{Proj}(a, S)\) is the projection of a point \(x\) on to a set \(S\). hint: the projection of a point \(x \in \mathbb{R}\) onto a convex segment of \(\mathbb{R},[a, b]\), has value \(\operatorname{Proj}(x,[a, b])=\max (a, \min (b, x))\).

Proof for Theorem 2 (Click here)

This is a convex optimization problem. When there is at least one strictly feasible point, strong duality holds by Slater’s theorem. We exploit strong duality to solve the problem analytically. First using the method of Lagrange multipliers, \(\exists \lambda, \mu \geq 0\)

(51)#\[\mathcal{L}(x, \lambda, \nu)=g^T x+\frac{\lambda}{2}\left(x^T H x-\delta\right)+\nu\left(b^T x+c\right)\]

Because of strong duality,

\(p^*=\min_x\max_{\lambda \geq 0, \nu \geq 0} \mathcal{L}(x, \lambda, \nu)\)

(52)#\[\nabla_x \mathcal{L}(x, \lambda, \nu)=\lambda H x+(g+\nu b)\]

Plug in \(x^*\),

\(H \in \mathbb{S}^n \Rightarrow H^T=H \Rightarrow\left(H^{-1}\right)^T=H^{-1}\)

(53)#\[\begin{split}x^T H x &=\left(-\frac{1}{\lambda} H^{-1}(g+\nu b)\right)^T H\left(-\frac{1}{\lambda} H^{-1}(g+\nu b)\right)\\ &=\frac{1}{\lambda^2}(g+\nu b)^T H^{-1}(g+\nu b) -\frac{1}{2 \lambda}(g+\nu b)^T H^{-1}(g+\nu b)\\ &=-\frac{1}{2 \lambda}\left(g^T H^{-1} g+\nu g^T H^{-1} b+\nu b^T H^{-1} g+\nu^2 b^T H^{-1} b\right)\\ &=-\frac{1}{2 \lambda}\left(q+2 \nu r+\nu^2 s\right)\end{split}\]
(54)#\[\begin{split}p^* &=\min_x \underset{\begin{subarray}{c} \lambda \geq 0 \\ \nu \geq 0\end{subarray}}{\max} \; g^T x + \frac{\lambda}{2} \left( x^T H x - \delta \right) + \nu \left(b^Tx +c \right) \\ &\xlongequal[duality]{strong} \underset{\begin{subarray}{c} \lambda \geq 0 \\ \nu \geq 0\end{subarray}}{\max} \min_x \; \frac{\lambda}{2} x^T H x + \left(g + \nu b\right)^T x + \left( \nu c - \frac{1}{2} \lambda \delta \right)\\ & \;\;\; \implies x^* = -\frac{1}{\lambda} H^{-1} \left(g + \nu b \right) ~~~ \nabla_x \mathcal L(x,\lambda, \nu) =0\\ &\xlongequal{\text{Plug in } x^*} \underset{\begin{subarray}{c} \lambda \geq 0 \\ \nu \geq 0\end{subarray}}{\max} \; -\frac{1}{2\lambda} \left(g + \nu b \right)^T H^{-1} \left(g + \nu b \right) + \left( \nu c - \frac{1}{2} \lambda \delta \right)\\ &\xlongequal[s \doteq b^T H^{-1} b]{ q \doteq g^T H^{-1} g, r \doteq g^T H^{-1} b } \underset{\begin{subarray}{c} \lambda \geq 0 \\ \nu \geq 0\end{subarray}}{\max} \; -\frac{1}{2\lambda} \left(q + 2 \nu r + \nu^2 s\right) + \left( \nu c - \frac{1}{2} \lambda \delta \right)\\ & \;\;\; \implies \frac {\partial\mathcal L}{\partial\nu} = -\frac{1}{2\lambda}\left( 2r + 2 \nu s \right) + c \\ &~~ \text{Optimizing single-variable convex quadratic function over } \mathbb R_+ \\ & \;\;\; \implies \nu = \left(\frac{\lambda c - r}{s} \right)_+ \\ &= \max_{\lambda \geq 0} \; \left\{ \begin{array}{ll} \frac{1}{2\lambda} \left(\frac{r^2}{s} -q\right) + \frac{\lambda}{2}\left(\frac{c^2}{s} - \delta\right) - \frac{rc}{s} & \text{if } \lambda \in \Lambda_a \\ -\frac{1}{2} \left(\frac{q}{\lambda} + \lambda \delta\right) & \text{if } \lambda \in \Lambda_b \end{array}\right.\\ &~~~~ \text{where} \begin{array}{ll} \Lambda_a \doteq \{\lambda | \lambda c - r > 0, \;\; \lambda \geq 0\}, \\ \Lambda_b \doteq \{\lambda | \lambda c - r \leq 0, \;\; \lambda \geq 0\} \end{array}\end{split}\]

\(\lambda \in \Lambda_a \Rightarrow \nu>0\), then plug in \(\nu=\frac{\lambda c-r}{s} ; \lambda \in \Lambda_a \Rightarrow \nu \leq 0\), then plug in \(\nu=0\)