First Order Constrained Optimization in Policy Space#

Quick Facts#

  1. FOCOPS is an on-policy algorithm.

  2. FOCOPS can be used for environments with both discrete and continuous action spaces.

  3. FOCOPS is an algorithm using first-order method.

  4. The OmniSafe implementation of FOCOPS support parallelization.

FOCOPS Theorem#

Background#

First Order Constrained Optimization (FOCOPS) in Policy Space is a new CPO-based method which maximizes an agent’s overall reward while ensuring the agent satisfies a set of cost constraints. FOCOPS purposes that CPO has disadvantages below:

Problems of CPO

  • Sampling error resulting from taking sample trajectories from the current Policy.

  • Approximation errors resulting from Taylor approximations.

  • Approximation errors result from using the conjugate method to calculate the inverse of the Fisher information matrix.

Advantage of FOCOPS

  • Extremely simple to implement since it only utilizes first order approximations.

  • Simple first-order method avoids the last two sources of error caused by Taylor method and the conjugate method.

  • Outperform than CPO in experiment.

  • No recovery steps required

FOCOPS mainly includes the following contributions:

It provides a two-stage policy update to optimize the current Policy. Next, it provides the practical implementation for solving the two-stage policy update. Finally, FOCOPS provides rigorous derivative proofs for the above theories, as detailed in the Appendix to this tutorial. One suggested reading order is CPO(Constrained Policy Optimization), PCPO(Projection-Based Constrained Policy Optimization), then FOCOPS. If you have not read the PCPO, it does not matter. It will not affect your reading experience much. Nevertheless, be sure to read this article after reading the CPO tutorial we have written so that you can fully understand the following passage.


Optimization Objective#

In the previous chapters, you learned that CPO solves the following optimization problems:

(1)#\[\begin{split}&\pi_{k+1}=\arg \max _{\pi \in \Pi_{\boldsymbol{\theta}}} \mathbb{E}_{\substack{s \sim d_{\pi_k}\\a \sim \pi}}[A^R_{\pi_k}(s, a)]\\ \text{s.t.} \quad &J^{C_i}\left(\pi_k\right) \leq d_i-\frac{1}{1-\gamma} \mathbb{E}_{\substack{s \sim d_{\pi_k} \\ a \sim \pi}}\left[A^{C_i}_{\pi_k}(s, a)\right] \quad \forall i \\ &\bar{D}_{K L}\left(\pi \| \pi_k\right) \leq \delta\end{split}\]

where \(\prod_{\theta}\subseteq\prod\) denotes the set of parametrized policies with parameters \(\theta\), and \(\bar{D}_{K L}\) is the KL divergence of two policy. In local policy search for CMDPs, we additionally require policy iterates to be feasible for the CMDP, so instead of optimizing over \(\prod_{\theta}\), PCPO optimizes over \(\prod_{\theta}\cap\prod_{C}\). Next, we will introduce you to how FOCOPS solves the above optimization problems. For you to have a clearer understanding, we hope that you will read the next section with the following questions:

Questions

  • What is a two-stage policy update, and how?

  • How to practically implement FOCOPS?

  • How do parameters impact the performance of the algorithm?


Two-stage Policy Update#

Instead of solving the Problem Eq.1 directly, FOCOPS uses a two-stage approach summarized below:

Two-stage Policy Update

  • Given policy \(\pi_{\theta_k}\), find an optimal update policy \(\pi^*\) by solving the optimization problem from Problem Eq.1 in the non-parameterized policy space.

  • Project the Policy found in the previous step back into the parameterized policy space \(\Pi_{\theta}\) by solving for the closest policy \(\pi_{\theta}\in\Pi_{\theta}\) to \(\pi^*\), in order to obtain \(\pi_{\theta_{k+1}}\).


Finding the Optimal Update Policy#

In the first stage, FOCOPS rewrites Problem Eq.1 as below:

(2)#\[\begin{split}&\pi^*=\arg \max _{\pi \in \Pi} \mathbb{E}_{\substack{s \sim d_{\pi_k}\\a \sim \pi}}[A^R_{\pi_k}(s, a)]\\ \text{s.t.} \quad & J^{C}\left(\pi_k\right) \leq d-\frac{1}{1-\gamma} \mathbb{E}{\substack{s \sim d_{\pi_k} \\ a \sim \pi}}\left[A^{C}_{\pi_k}(s, a)\right] \quad \\ & \bar{D}_{K L}\left(\pi \| \pi_k\right) \leq \delta\end{split}\]

These problems are only slightly different from Problem Eq.1 , that is, the parameter of interest is now the non-parameterized Policy \(\pi\) and not the policy parameter \(\theta\). Then FOCOPS provides a solution as follows:

Theorem 1

Let \(\tilde{b}=(1-\gamma)\left(b-\tilde{J}^C\left(\pi_{\theta_k}\right)\right)\). If \(\pi_{\theta_k}\) is a feasible solution, the optimal policy for Problem Eq.2 takes the form

(3)#\[\pi^*(a \mid s)=\frac{\pi_{\theta_k}(a \mid s)}{Z_{\lambda, \nu}(s)} \exp \left(\frac{1}{\lambda}\left(A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a)\right)\right)\]

where \(Z_{\lambda,\nu}(s)\) is the partition function which ensures Problem Eq.3 is a valid probability distribution, \(\lambda\) and \(\nu\) are solutions to the optimization problem:

(4)#\[\begin{split}\min _{\lambda, \nu \geq 0} \lambda \delta+\nu \tilde{b}+\lambda \underset{\substack{s \sim d^{\pi_{\theta_k}} \\ a \sim \pi^*}}{\mathbb{E}}[\log Z_{\lambda, \nu}(s)]\end{split}\]

The form of the optimal Policy is intuitive. It gives high probability mass to areas of the state-action space with high return, offset by a penalty term times the cost advantage. We will refer to the optimal solution to Problem Eq.2 as the optimal update policy. Suppose you need help understanding the meaning of the above Equation. In that case, you can first think that FOCOPS finally solves Problem Eq.2 by solving Problem Eq.3 and Problem Eq.4. That is, the Theorem 1 is a viable solution.

Question

What is the bound for FOCOPS worst-case guarantee for cost constraint?

Question

Can FOCOPS solve the multi-constraint problem and how?

Answer

FOCOPS purposes that the optimal update policy \(\pi^*\) satisfies the following bound for the worst-case guarantee for cost constraint in CPO:

(5)#\[J^C\left(\pi^*\right) \leq d+\frac{\sqrt{2 \delta} \gamma \epsilon_C^{\pi^*}}{(1-\gamma)^2}\]

where \(\epsilon^C_{\pi^*}=\max _s\left|\mathbb{E}_{a \sim \pi}\left[A^C_{\pi_{\theta_k}}(s, a)\right]\right|\).

Answer

By introducing Lagrange multipliers \(\nu_1,\nu_2,...,\nu_m\ge0\), one for each cost constraint and applying a similar duality argument, FOCOPS extends its results to accommodate for multiple constraints.


Approximating the Optimal Update Policy#

The optimal update policy \(\pi^*\) is obtained in the previous section. However, it is not a parameterized policy. In this section, we will show you how FOCOPS projects the optimal update policy back into the parameterized policy space by minimizing the loss function:

(6)#\[\mathcal{L}(\theta)=\underset{s \sim d^{\pi_{\theta_k}}}{\mathbb{E}}\left[D_{\mathrm{KL}}\left(\pi_\theta \| \pi^*\right)[s]\right]\]

Here \(\pi_{\theta}\in \Pi_{\theta}\) is some projected policy that FOCOPS will use to approximate the optimal update policy. The first-order methods are also used to minimize this loss function:

Corollary 1

The gradient of \(\mathcal{L}(\theta)\) takes the form

(7)#\[\nabla_\theta \mathcal{L}(\theta)=\underset{s \sim d^{\pi_\theta}}{\mathbb{E}}\left[\nabla_\theta D_{K L}\left(\pi_\theta \| \pi^*\right)[s]\right]\]

where

(8)#\[\begin{split}\nabla_\theta D_{K L}\left(\pi_\theta \| \pi^*\right)[s] &=\nabla_\theta D_{K L}\left(\pi_\theta \| \pi_{\theta_k}\right)[s] \\ & -\frac{1}{\lambda} \underset{a \sim \pi_{\theta_k}}{\mathbb{E}}\left[\frac{\nabla_\theta \pi_\theta(a \mid s)}{\pi_{\theta_k}(a \mid s)}\left(A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a)\right)\right]\end{split}\]

Note that Equation Eq.7 can be estimated by sampling from the trajectories generated by Policy \(\pi_{\theta_k}\) so Policy can be trained using stochastic gradients.

Corollary 1 outlines the FOCOPS algorithm:

At every iteration, we begin with a policy \(\pi_{\theta_k}\), which we use to run trajectories and gather data. We use that data and Equation Eq.4 first to estimate \(\lambda\) and \(\nu\). We then draw a mini-batch from the data to estimate \(\nabla_\theta \mathcal{L}(\theta)\) given in Corollary 1. After taking a gradient step using Equation:eq:focops-eq-7, we draw another mini-batch and repeat the process.


Practical Implementation#

Hint

Solving Problem Eq.4 is computationally impractical for large state or action spaces as it requires calculating the partition function \(Z_{\lambda,\nu}(s)\), which often involves evaluating a high-dimensional integral or sum. Furthermore, \(\lambda\) and \(\nu\) depend on \(k\) and should be adapted at every iteration.

So in this section, we will introduce you to how FOCOPS practically implements its algorithm purpose. In practice, through hyperparameter sweeps, FOCOPS found that a fixed \(\lambda\) provides good results, which means the value of \(\lambda\) does not have to be updated. However, \(\nu\) needs to be continuously adapted during training so as to ensure cost-constraint satisfaction. FOCOPS appeals to an intuitive heuristic for determining \(\nu\) based on primal-dual gradient methods. With strong duality, the optimal \(\lambda^*\) and \(\nu^*\) minimizes the dual function Eq.4 which then be denoted as \(L(\pi^*,\lambda,\nu)\). By applying gradient descent w.r.t \(\nu\) to minimize \(L(\pi^*,\lambda,\nu)\), we obtain:

Corollary 2

The derivative of \(L(\pi^*,\lambda,\nu)\) w.r.t \(\nu\) is

(9)#\[\begin{split}\frac{\partial L\left(\pi^*, \lambda, \nu\right)}{\partial \nu}=\tilde{b}-\underset{\substack{s \sim d^{\pi^*} \\ a \sim \pi^*}}{\mathbb{E}}\left[A_{\pi_{\theta_k}}(s, a)\right]\end{split}\]

The last term in the gradient expression in Equation Eq.9 cannot be evaluated since we do not have access to \(\pi^*\). Since \(\pi_{\theta_k}\) and \(\pi^*\) are ‘close’, it is reasonable to assume that \(E_{s \sim d^{\pi_k}, a \sim \pi^*}\left[A_{\pi_{\theta_k}}(s, a)\right] \approx E_{s \sim d^{\pi_k}, a \sim \pi_{\theta_k}}\left[A_{\pi_{\theta_k}}(s, a)\right]=0\). In practice, this term can be set to zero, which gives the updated term:

(10)#\[\nu \leftarrow \underset{\nu}{\operatorname{proj}}\left[\nu-\alpha\left(d-J^C\left(\pi_{\theta_k}\right)\right)\right]\]

where \(\alpha\) is the step size. Note that we have incorporated the discount term \((1-\gamma)\) into \(\tilde{b}\) into the step size. The projection operator \(proj_{\nu}\) projects \(\nu\) back into the interval \([0,\nu_{max}]\), where \(\nu_{max}\) is chosen so that \(\nu\) does not become too large. In fact. FOCOPS purposed that even setting \(\nu_{max}=+\infty\) does not appear to reduce performance greatly. Practically, \(J^C(\pi_{\theta_k})\) can be estimated via Monte Carlo methods using trajectories collected from \(\pi_{\theta_k}\). Using the update rule Eq.10, FOCOPS performs one update step on \(\nu\) before updating the Policy parameters \(\theta\). A per-state acceptance indicator function \(I\left(s_j\right)^n:=\mathbf{1}_{D_{\mathrm{KL}}\left(\pi_\theta \| \pi_{\theta_k}\right)\left[s_j\right] \leq \delta}\) is added to Eq.7, in order better to enforce the accuracy for the first-order purposed method.

Hint

Here \(N\) is the number of samples collected by Policy \(\pi_{\theta_k}\), \(\hat A\) and \(\hat A^C\) are estimates of the advantage functions (for the return and cost) obtained from critic networks. The advantage functions are obtained using the Generalized Advantage Estimator (GAE). Note that FOCOPS only requires first-order methods (gradient descent) and is thus extremely simple to implement.


Variables Analysis#

In this section, we will explain the meaning of parameters \(\lambda\) and \(\mu\) of FOCOPS and their impact on the algorithm’s performance in the experiment.

Analysis of \(\lambda\)

In Equation Eq.3, note that as \(\lambda \rightarrow 0\), \(\pi^*\) approaches a greedy policy; as \(\lambda\) increases, the Policy becomes more exploratory. Therefore \(\lambda\) is similar to the temperature term used in maximum entropy reinforcement learning, which has been shown to produce good results when fixed during training. In practice, FOCOPS finds that its algorithm reaches the best performance when the \(\lambda\) is fixed.

Analysis of \(\nu\)

We recall that in Equation Eq.3, \(\nu\) acts as a cost penalty term where increasing \(\nu\) makes it less likely for state-action pairs with higher costs to be sampled by \(\pi^*\). Hence in this regard, the update rule in Eq.10 is intuitive, because it increases \(\nu\) if \(J^C(\pi_{\theta_k})>d\) (which means the agent violate the cost constraints) and decreases \(\nu\) otherwise.


Code with OmniSafe#

Quick start#

Run FOCOPS in Omnisafe

Here are 3 ways to run FOCOPS in OmniSafe:

  • Run Agent from preset yaml file

  • Run Agent from custom config dict

  • Run Agent from custom terminal config

1import omnisafe
2
3
4env_id = 'SafetyPointGoal1-v0'
5
6agent = omnisafe.Agent('FOCOPS', env_id)
7agent.learn()
 1import omnisafe
 2
 3
 4env_id = 'SafetyPointGoal1-v0'
 5custom_cfgs = {
 6    'train_cfgs': {
 7        'total_steps': 1024000,
 8        'vector_env_nums': 1,
 9        'parallel': 1,
10    },
11    'algo_cfgs': {
12        'update_cycle': 2048,
13        'update_iters': 1,
14    },
15    'logger_cfgs': {
16        'use_wandb': False,
17    },
18}
19
20agent = omnisafe.Agent('FOCOPS', env_id, custom_cfgs=custom_cfgs)
21agent.learn()

We use train_policy.py as the entrance file. You can train the agent with FOCOPS simply using train_policy.py, with arguments about FOCOPS and environments does the training. For example, to run FOCOPS in SafetyPointGoal1-v0 , with 4 cpu cores and seed 0, you can use the following command:

1cd examples
2python train_policy.py --algo FOCOPS --env-id SafetyPointGoal1-v0 --parallel 1 --total-steps 1024000 --device cpu --vector-env-nums 1 --torch-threads 1

Architecture of functions#

  • focops.learn()

    • env.roll_out()

    • focops.update()

      • focops.buf.get()

      • focops.pre_process_data(raw_data)

      • focops.update_policy_net()

      • focops.update_cost_net()

      • focops.update_value_net()

  • focops.log()


Documentation of new functions#

focops._loss_pi(data: dict)

Compute the loss of policy network, flowing the next steps:

  1. Calculate the KL divergence between the new policy and the old policy

1dist, _log_p = self.ac.pi(data['obs'], data['act'])
2ratio = torch.exp(_log_p - data['log_p'])
3kl_new_old = torch.distributions.kl.kl_divergence(dist, self.p_dist).sum(-1, keepdim=True)
  1. Compute the loss of policy network based on FOCOPS method, where self.lagrangian_multiplier is \(\nu`\) and self.lam is \(\lambda\) in FOCOPS paper.

1loss_pi = (
2    kl_new_old
3    - (1 / self.lam) * ratio * (data['adv'] - self.lagrangian_multiplier * data['cost_adv'])
4) * (kl_new_old.detach() <= self.eta).type(torch.float32)
5loss_pi = loss_pi.mean()
6loss_pi -= self.entropy_coef * dist.entropy().mean()

focops.update_lagrange_multiplier(ep_costs: float)

FOCOPS algorithm updates self.lagrangian_multiplier which is \(\nu\) in FOCOPS paper by projection.

1self.lagrangian_multiplier += self.lambda_lr * (ep_costs - self.cost_limit)
2if self.lagrangian_multiplier < 0.0:
3    self.lagrangian_multiplier = 0.0
4elif self.lagrangian_multiplier > 2.0:
5    self.lagrangian_multiplier = 2.0

Parameters#

Specific Parameters

  • target_kl(float): Constraint for KL-distance to avoid too far gap

  • cg_damping(float): parameter plays a role in building Hessian-vector

  • cg_iters(int): Number of iterations of conjugate gradient to perform.

  • cost_limit(float): Constraint for agent to avoid too much cost

Basic parameters

  • algo (string): The name of algorithm corresponding to current class, it does not actually affect any things which happen in the following.

  • actor (string): The type of network in actor, discrete or continuous.

  • model_cfgs (dictionary) : Actor and critic’s net work configuration, it originates from algo.yaml file to describe hidden layers , activation function, shared_weights and weight_initialization_mode.

    • shared_weights (bool) : Use shared weights between actor and critic network or not.

    • weight_initialization_mode (string) : The type of weight initialization method.

      • pi (dictionary) : parameters for actor network pi

        • hidden_sizes:

          • 64

          • 64

        • activations: tanh

      • val (dictionary) parameters for critic network v

        • hidden_sizes:

          • 64

          • 64

          Hint

          Name

          Type

          Description

          v

          nn.Module

          Gives the current estimate of V for states in s.

          pi

          nn.Module

          Deterministically or continuously computes an action from the agent, conditioned on states in s.

      • activations: tanh

      • env_id (string): The name of environment we want to roll out.

      • seed (int): Define the seed of experiments.

      • parallel (int): Define the seed of experiments.

      • epochs (int): The number of epochs we want to roll out.

      • steps_per_epoch (int):The number of time steps per epoch.

      • pi_iters (int): The number of iteration when we update actor network per mini batch.

      • critic_iters (int): The number of iteration when we update critic network per mini batch.

Optional parameters

  • use_cost_critic (bool): Use cost value function or not.

  • linear_lr_decay (bool): Use linear learning rate decay or not.

  • exploration_noise_anneal (bool): Use exploration noise anneal or not.

  • reward_penalty (bool): Use cost to penalize reward or not.

  • kl_early_stopping (bool): Use KL early stopping or not.

  • max_grad_norm (float): Use maximum gradient normalization or not.

  • scale_rewards (bool): Use reward scaling or not.

Buffer parameters

Hint

Name

Description

Buffer

A buffer for storing trajectories experienced by an agent interacting with the environment, and using Generalized Advantage Estimation (GAE) for calculating the advantages of state-action pairs.

Warning

Buffer collects only raw data received from environment.

  • gamma (float): The gamma for GAE.

  • lam (float): The lambda for reward GAE.

  • adv_estimation_method (float):Roughly what KL divergence we think is appropriate between new and old policies after an update. This will get used for early stopping. (Usually small, 0.01 or 0.05.)

  • standardized_reward (int): Use standardized reward or not.

  • standardized_cost (bool): Use standardized cost or not.


References#

Appendix#

Proof for Theorem 1#

Lemma 1

Problem Eq.2 is convex w.r.t \(\pi={\pi(a|s):s\in \mathrm{S},a\in\mathrm{A}}\).

Proof of Lemma 1

First, note that the objective function is linear w.r.t \(\pi\). Since \(J^{C}(\pi_{\theta_k})\) is a constant w.r.t \(\pi\), constraint Eq.2 is linear. Constraint Eq.2 can be rewritten as \(\sum_s d^{\pi_{\theta_k}}(s) D_{\mathrm{KL}}\left(\pi \| \pi_{\theta_k}\right)[s] \leq \delta\). The KL divergence is convex w.r.t its first argument. Hence Constraint Eq.2, a linear combination of convex functions, is also convex. Since \(\pi_{\theta_k}\) satisfies Constraint Eq.2 also satisfies Constraint Eq.2, therefore Slater’s constraint qualification holds, and strong duality holds.

Proof of Theorem 1 (Click here)

Based on Lemma 1 the optimal value of the Problem Eq.2 \(p^*\) can be solved by solving the corresponding dual problem. Let

(11)#\[L(\pi, \lambda, \nu)=\lambda \delta+\nu \tilde{b}+\underset{s \sim d^{\pi_{\theta_k}}}{\mathbb{E}}\left[A^{lag}-\lambda D_{\mathrm{KL}}\left(\pi \| \pi_{\theta_k}\right)[s]\right]\nonumber\]

where \(A^{lag}=\underset{a \sim \pi(\cdot \mid s)}{\mathbb{E}}\left[A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a)\right]\). Therefore.

(12)#\[p^*=\max _{\pi \in \Pi} \min _{\lambda, \nu \geq 0} L(\pi, \lambda, \nu)=\min _{\lambda, \nu \geq 0} \max _{\pi \in \Pi} L(\pi, \lambda, \nu)\]

Note that if \(\pi^*\), \(\lambda^*\), \(\nu^*\) are optimal for Problem Eq.12, \(\pi^*\) is also optimal for Problem Eq.2 because of the strong duality.

Consider the inner maximization problem in Problem Eq.12. We separate it from the original problem and try to solve it first:

(13)#\[\begin{split}&\underset{\pi}{\operatorname{max}} A^{lag}-\underset{a \sim \pi(\cdot \mid s)}{\mathbb{E}}\left[\lambda\left(\log \pi(a \mid s)+\log \pi_{\theta_k}(a \mid s)\right)\right] \\ \text { s.t. } & \sum_a \pi(a \mid s)=1 \\ & \pi(a \mid s) \geq 0 \quad \forall a \in \mathcal{A}\end{split}\]

Which is equivalent to the inner maximization problem in Eq.12. We can solve this convex optimization problem using a simple Lagrangian argument. We can write the Lagrangian of it as:

(14)#\[G(\pi)=\sum_a \pi(a \mid s)[A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a) -\lambda(\log \pi(a \mid s)-\log \pi_{\theta_k}(a \mid s))+\zeta]-1\]

where \(\zeta > 0\) is the Lagrange multiplier associated with the constraint \(\sum_a \pi(a \mid s)=1\). Different \(G(\pi)\) w.r.t. \(\pi(a \mid s)\) for some \(a\):

(15)#\[\frac{\partial G}{\partial \pi(a \mid s)}=A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a)-\lambda\left(\log \pi(a \mid s)+1-\log \pi_{\theta_k}(a \mid s)\right)+\zeta\]

Setting Equation Eq.15 to zero and rearranging the term, we obtain:

(16)#\[\pi(a \mid s)=\pi_{\theta_k}(a \mid s) \exp \left(\frac{1}{\lambda}\left(A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a)\right)+\frac{\zeta}{\lambda}+1\right)\]

We chose \(\zeta\) so that \(\sum_a \pi(a \mid s)=1\) and rewrite \(\zeta / \lambda+1\) as \(Z_{\lambda, \nu}(s)\). We find that the optimal solution \(\pi^*\) to Equation Eq.13 takes the form

(17)#\[\pi^*(a \mid s)=\frac{\pi_{\theta_k}(a \mid s)}{Z_{\lambda, \nu}(s)} \exp \left(\frac{1}{\lambda}\left(A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a)\right)\right)\]

Then we obtain:

(18)#\[\begin{split}&\underset{\substack{s \sim d^{\theta_{\theta_k}} \\ a \sim \pi^*}}{\mathbb{E}}\left[A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a)-\lambda\left(\log \pi^*(a \mid s)-\log \pi_{\theta_k}(a \mid s)\right)\right] \\ = &\underset{\substack{s \sim d^{\pi_{\theta_k}} \\ a \sim \pi^*}}{\mathbb{E}}\left[A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a)-\lambda\left(\log \pi_{\theta_k}(a \mid s)-\log Z_{\lambda, \nu}(s)\right.\right. \\ &\left.\left. + \frac{1}{\lambda}\left(A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a)\right)-\log \pi_{\theta_k}(a \mid s)\right)\right]\\ = &\lambda\underset{\substack{s \sim d^{\theta_{\theta_k}} \\ a \sim \pi^*}}{\mathbb{E}}[logZ_{\lambda,\nu}(s)]\nonumber\end{split}\]

Plugging the result back to Equation Eq.12, we obtain:

(19)#\[\begin{split}p^*=\underset{\lambda,\nu\ge0}{min}\lambda\delta+\nu\tilde{b}+\lambda\underset{\substack{s \sim d^{\theta_{\theta_k}} \\ a \sim \pi^*}}{\mathbb{E}}[logZ_{\lambda,\nu}(s)]\end{split}\]

Proof of Corollary#

Proof of Corollary 1

We only need to calculate the gradient of the loss function for a single sampled s. We first note that,

(20)#\[\begin{split}&D_{\mathrm{KL}}\left(\pi_\theta \| \pi^*\right)[s]\\ =&-\sum_a \pi_\theta(a \mid s) \log \pi^*(a \mid s)+\sum_a \pi_\theta(a \mid s) \log \pi_\theta(a \mid s) \\ =&H\left(\pi_\theta, \pi^*\right)[s]-H\left(\pi_\theta\right)[s]\end{split}\]

where \(H\left(\pi_\theta\right)[s]\) is the entropy and \(H\left(\pi_\theta, \pi^*\right)[s]\) is the cross-entropy under state :math: ‘s`. The above Equation is the basic mathematical knowledge in information theory, which you can get in any information theory textbook. We expand the cross entropy term, which gives us the following:

(21)#\[\begin{split}&H\left(\pi_\theta, \pi^*\right)[s]\\ &=-\sum_a \pi_\theta(a \mid s) \log \pi^*(a \mid s) \\ &=-\sum_a \pi_\theta(a \mid s) \log \left(\frac{\pi_{\theta_k}(a \mid s)}{Z_{\lambda, \nu}(s)} \exp \left[\frac{1}{\lambda}\left(A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a)\right)\right]\right) \\ &=-\sum_a \pi_\theta(a \mid s) \log \pi_{\theta_k}(a \mid s)+\log Z_{\lambda, \nu}(s)-\frac{1}{\lambda} \sum_a \pi_\theta(a \mid s)\left(A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a)\right)\end{split}\]

We then subtract the entropy term to recover the KL divergence:

(22)#\[\begin{split}&D_{\mathrm{KL}}\left(\pi_\theta \| \pi^*\right)[s]=D_{\mathrm{KL}}\left(\pi_\theta \| \pi_{\theta_k}\right)[s]+\log Z_{\lambda, \nu}(s)-\\&\frac{1}{\lambda} \underset{a \sim \pi_{\theta_k}(\cdot \mid s)}{\mathbb{E}}\left[\frac{\pi_\theta(a \mid s)}{\pi_{\theta_k}(a \mid s)}\left(A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a)\right)\right]\nonumber\end{split}\]

In the last equality, we applied importance sampling to rewrite the expectation w.r.t. \(\pi_{\theta_k}\). Finally, taking the gradient on both sides gives us the following:

(23)#\[\begin{split}&\nabla_\theta D_{\mathrm{KL}}\left(\pi_\theta \| \pi^*\right)[s]=\nabla_\theta D_{\mathrm{KL}}\left(\pi_\theta \| \pi_{\theta_k}\right)[s]\\&-\frac{1}{\lambda} \underset{a \sim \pi_{\theta_k}(\cdot \mid s)}{\mathbb{E}}\left[\frac{\nabla_\theta \pi_\theta(a \mid s)}{\pi_{\theta_k}(a \mid s)}\left(A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a)\right)\right]\nonumber\end{split}\]

Proof of Corollary 2

From Theorem 1, we have:

(24)#\[\begin{split}L\left(\pi^*, \lambda, \nu\right)=\lambda \delta+\nu \tilde{b}+\lambda \underset{\substack{s \sim d^{\pi^*} \\ a \sim \pi^*}}{\mathbb{E}}\left[\log Z_{\lambda, \nu}(s)\right]\end{split}\]

The first two terms are an affine function w.r.t. \(\nu\). Therefore, its derivative is \(\tilde{b}\). We will then focus on the expectation in the last term. To simplify our derivation, we will first calculate the derivative of \(\pi^*\) w.r.t. \(\nu\),

(25)#\[\begin{split}\frac{\partial \pi^*(a \mid s)}{\partial \nu} &=\frac{\pi_{\theta_k}(a \mid s)}{Z_{\lambda, \nu}^2(s)}\left[Z_{\lambda, \nu}(s) \frac{\partial}{\partial \nu} \exp \left(\frac{1}{\lambda}\left(A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a)\right)\right)\right.\\ &\left.-\exp \left(\frac{1}{\lambda}\left(A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a)\right)\right) \frac{\partial Z_{\lambda, \nu}(s)}{\partial \nu}\right] \\ &=-\frac{A^C_{\pi_{\theta_k}}(s, a)}{\lambda} \pi^*(a \mid s)-\pi^*(a \mid s) \frac{\partial \log Z_{\lambda, \nu}(s)}{\partial \nu}\nonumber\end{split}\]

Therefore the derivative of the expectation in the last term of \(L(\pi^*,\lambda,\nu)\) can be written as:

(26)#\[\begin{split}\frac{\partial}{\partial \nu} \underset{\substack{s \sim d^\pi \theta_k \\ a \sim \pi^*}}{\mathbb{E}}\left[\log Z_{\lambda, \nu}(s)\right] &= \underset{\substack{s \sim d^{\pi_\theta} \\ a \sim \pi_{\theta_k}}}{\mathbb{E}}\left[\frac{\partial}{\partial \nu}\left(\frac{\pi^*(a \mid s)}{\pi_{\theta_k}(a \mid s)} \log Z_{\lambda, \nu}(s)\right)\right] \\ &= \underset{\substack{s \sim d^{\pi_\theta} \\ a \sim \pi_{\theta_k}}}{\mathbb{E}}\left[\frac{1}{\pi_{\theta_k}(a \mid s)}\left(\frac{\partial \pi^*(a \mid s)}{\partial \nu} \log Z_{\lambda, \nu}(s)+\pi^*(a \mid s) \frac{\partial \log Z_{\lambda, \nu}(s)}{\partial \nu}\right)\right] \\ &= \underset{\substack{s \sim d^{\pi_\theta} \\ a \sim \pi^*}}{\mathbb{E}}\left[-(\frac{A^C_{\pi_{\theta_k}}(s, a)}{\lambda}+\frac{\partial \log Z_{\lambda, \nu}(s)}{\partial \nu}) \log Z_{\lambda, \nu}(s)+\frac{\partial \log Z_{\lambda, \nu}(s)}{\partial \nu}\right]\end{split}\]

Also:

(27)#\[\begin{split}\frac{\partial Z_{\lambda, \nu}(s)}{\partial \nu} &=\frac{\partial}{\partial \nu} \sum_a \pi_{\theta_k}(a \mid s) \exp \left(\frac{1}{\lambda}\left(A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a)\right)\right) \\ &=\sum_a-\pi_{\theta_k}(a \mid s) \frac{A^C_{\pi_{\theta_k}}(s, a)}{\lambda} \exp \left(\frac{1}{\lambda}\left(A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a)\right)\right) \\ &=\sum_a-\frac{A^C_{\pi_{\theta_k}}(s, a)}{\lambda} \frac{\pi_{\theta_k}(a \mid s)}{Z_{\lambda, \nu}(s)} \exp \left(\frac{1}{\lambda}\left(A_{\pi_{\theta_k}}(s, a)-\nu A^C_{\pi_{\theta_k}}(s, a)\right)\right) Z_{\lambda, \nu}(s) \\ &=-\frac{Z_{\lambda, \nu}(s)}{\lambda} \underset{a \sim \pi^*(\cdot \mid s)}{\mathbb{E}}\left[A^C_{\pi_{\theta_k}}(s, a)\right]\end{split}\]

Therefore:

(28)#\[\frac{\partial \log Z_{\lambda, \nu}(s)}{\partial \nu}=\frac{\partial Z_{\lambda, \nu}(s)}{\partial \nu} \frac{1}{Z_{\lambda, \nu}(s)}=-\frac{1}{\lambda} \underset{a \sim \pi^*(\cdot \mid s)}{\mathbb{E}}\left[A^C_{\pi_{\theta_k}}(s, a)\right]\]

Plugging Eq.28 into the last equality in Eq.26 gives us:

(29)#\[\begin{split}\frac{\partial}{\partial \nu} \underset{\substack{s \sim d^{\pi_\theta} \\ a \sim \pi^*}}{\mathbb{E}}\left[\log Z_{\lambda, \nu}(s)\right] &=\underset{\substack{s \sim d^{\pi^*} \\ a \sim \pi^*}}{\mathbb{E}}\left[-\frac{A^C_{\pi_{\theta_k}}(s, a)}{\lambda} \log Z_{\lambda, \nu}(s)+\frac{A^C_{\pi_{\theta_k}}(s, a)}{\lambda} \log Z_{\lambda, \nu}(s)-\frac{1}{\lambda} A^C_{\pi_{\theta_k}}(s, a)\right] \\ &=-\frac{1}{\lambda} \underset{\substack{s \sim d^{\pi_{\theta_k}} \\ a \sim \pi^*}}{\mathbb{E}}\left[A^C_{\pi_{\theta_k}}(s, a)\right]\end{split}\]

Combining Eq.29 with the derivatives of the affine term give us the final desired result.