Notations#

Introduction#

In this section, we will introduce the notations used in this tutorial. In Reinforcement Learning, we often use the following two basic notations: Linear Algebra and Constrained Markov Decision Processes. Make sure you are familiar with this section before you start. You can return to this section any time you are puzzled by some notations in the following chapters.

Linear Algebra#

Vector

A vector is an ordered finite list of numbers. Typically, vectors are written as vertical arrays surrounded by square brackets, as in:

(1)#\[\begin{split}\boldsymbol{a} = \left[\begin{array}{r} a_1 \\ a_2 \\ \cdots \\ a_n \end{array}\right] \in \mathbb{R}^n\end{split}\]

Usually, we use a bold lowercase letter to denote a vector (e.g. \(\mathbf{a}=(a_1,a_2,\cdots,a_n)\in\mathbb{R}^{n}\)), and its \(i^{th}\) element written as \(\mathbf{a}[i]=:a_{i},~~1\leq i\leq n.\)

Matrix

Matrix, mathematical term. In mathematics, a matrix is a collection of complex or real numbers arranged in a rectangular array.

Similarly, we use a bold capital letter to denote matrix, e.g., \(\mathbf{A}=(a_{i,j})\in\mathbb{R}^{m\times n}\), and its \((i,j)^{\text{th}}\) element denoted as

(2)#\[\mathbf{A}[i,j]=:a_{i,j},\]

where \(1\leq i\leq m,1\leq j\leq n\).

Constrained Markov Decision Processes#

For the convenience of reference, we list key notations that have be used.

A Reinforcement Learning (RL) problem is often formulated as Infinite-horizon Discounted Markov Decision Process (MDP). It is a tuple \(\mathcal{M}=\{\mathcal{S}, \mathcal{A}, \mathbb{P}, r, \mu, \gamma\}\), where

Key Notations

\(\mathcal{S}\) is a finite set of states;
\(\mathcal{A}\) is a finite set of actions;
\(\mathbb{P}(\cdot|\cdot,\cdot)\) is the transition probability distribution, \(\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1]\);
\(\mu\) is the distribution of the initial state \(s_0\), \(\mathcal{S} \rightarrow \mathbb{R}\) ;
\(r\) is the reward function, \(\mathcal{S} \rightarrow \mathbb{R}\);
\(\gamma\in(0,1)\) is the discount factor.

A stationary parameterized policy \(\pi_{\theta}\) is a probability distribution defined on \(\mathcal{S}\times\mathcal{A}\), \(\pi_{\theta}(a|s)\) denotes the probability of playing \(a\) in state \(s\). With explicit notation dropped to reduce clutter, we use \(\boldsymbol{\theta}\) to represent \(\pi_{\theta}\).

From MDP

Markov Decision Processes

Let \(J(\boldsymbol{\theta})\) denote its expected discounted reward,

(3)#\[J(\boldsymbol{\theta}) \doteq \mathbb{E}_{\tau \sim \boldsymbol{\theta}}\left[\sum_{t=0}^{\infty} \gamma^t r\left(s_t\right)\right],\]

Here \(\tau\) denotes a trajectory \((s_0, a_0, s_1, ...)\), and \(\tau \sim \pi\) is shorthand for indicating that the distribution over trajectories depends on a stationary parameterized policy \(\pi_{\theta}\): \(s_0 \sim \mu\), \(a_t \sim \boldsymbol{\theta}(\cdot|s_t)\), \(s_{t+1} \sim \mathbb{P}(\cdot | s_t, a_t)\). Meanwhile, let \(R(\tau)\) denote the discounted return of a trajectory.

The state action value function

(4)#\[Q^R_{\boldsymbol{\theta}} \left(s, a\right) \doteq \mathbb{E}_{\tau \sim \boldsymbol{\theta}}\left[ R(\tau) | s_0 = s, a_0 = a \right].\]

The value function

(5)#\[V^R_{\boldsymbol{\theta}}\left(s\right) \doteq \mathbb{E}_{\tau \sim \boldsymbol{\theta}}\left[R(\tau) | s_0 = s\right].\]

And the advantage function

(6)#\[A^R_{\boldsymbol{\theta}}(s, a) \doteq Q^R_{\boldsymbol{\theta}}(s, a)-V^R_{\boldsymbol{\theta}}(s).\]

Let \(\mathbb{P}_{\pi}\left(s'\mid s\right)\) denote one-step state transition probability from \(s\) to \(s'\) by executing \(\pi\),

(7)#\[\mathbb{P}_{\pi}\left(s'\mid s\right)=\sum_{a\in\mathbb{A}}\pi\left(a\mid s\right) \mathbb{P}_{\pi}\left(s'\mid s,a\right).\]

Then for any initial state \(s_0 \sim \mu\), we have

(8)#\[\mathbb{P}_{\pi}\left(s_t=s\mid s_0\right)=\sum_{s'\in\mathbb{S}} \mathbb{P}_{\pi}\left(s_t=s\mid s_{t-1}=s'\right)\mathbb{P}_{\pi}\left(s_{t-1}=s'\mid s_0\right),\]

where \(s_0 \sim \mu\) and the actions are chosen according to \(\pi\).

Let \(d_{\boldsymbol{\pi}}\) be the (unnormalized) discounted visitation frequencies here need to explain \(\mathbb{P}\).

(9)#\[\begin{split}\begin{aligned} d_{\boldsymbol{\pi}}(s)&=\sum_{t=0}^{\infty} \gamma^t \mathbb{P}_{\pi}\left(s_t=s \mid s_0\right)\\ &=\mathbb{P}\left(s_0=s\right)+\gamma \mathbb{P}\left(s_1=s\mid s_0\right)+\gamma^2 \mathbb{P}\left(s_2=s\mid s_0\right)+\cdots. \end{aligned}\end{split}\]

To CMDP

Constrained Markov Decision Processes

A Constrained Markov Decision Process(CMDP) extends the MDP framework by augmenting with constraints restricting the set of feasible policies. Specifically, we introduce a set \(C\) of auxiliary cost functions: \(C_1, \cdots, C_m\) and cost limits: \(d_1, \cdots, d_m\), that each of them \(C_i\): \(\mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}\) mapping transition tuples to costs.

Let \(J^{C_i}(\boldsymbol{\theta})\) denote the expected discounted return of policy \(\boldsymbol{\theta}\) in terms of cost function,

(10)#\[\begin{aligned} J^{C_i}(\boldsymbol{\theta}) = \mathbb{E}_{\tau \sim \boldsymbol{\theta}}[\sum_{t=0}^{\infty} \gamma^t C_i(s_t, a_t, s_{t+1})]. \end{aligned}\]

So, the feasible set of stationary parameterized policies for CMDP is

(11)#\[\begin{aligned} \Pi_{C} \doteq \{ \pi_{\theta} \in \Pi~:~\forall~i, ~ J^{C_i}(\boldsymbol{\theta}) \leq d_i \} \end{aligned}\]

The goal of CMDP is to find the optimal policy \(\pi^{*}\):

(12)#\[\begin{aligned} \label{def:problem-setting} \pi^{*}=\arg\max_{\pi_\theta \in\Pi_{C}} J(\pi_{\theta}). \end{aligned}\]

Respectively we have:

The state action value function

(13)#\[Q^{C}_{\boldsymbol{\theta}} \left(s, a\right) \doteq \mathbb{E}_{\tau \sim \boldsymbol{\theta}}\left[ C(\tau) | s_0 = s, a_0 = a \right].\]

The value function

(14)#\[V^{C}_{\boldsymbol{\theta}}\left(s\right) \doteq \mathbb{E}_{\tau \sim \boldsymbol{\theta}}\left[C(\tau) | s_0 = s\right].\]

And the advantage function

(15)#\[A^{C}_{\boldsymbol{\theta}}(s, a) \doteq Q^{C}_{\boldsymbol{\theta}}(s, a)-V^{C}_{\boldsymbol{\theta}}(s).\]

To summarize all of the above notation, we show the following table,

\(\tau\) is a trajectory that consist of \(\left(s_0, a_0, s_1, a_0, \cdots\right)\)
\(\pi_{\theta}, \theta\) is a stationary parameterized policy \(\pi_{\theta}\) is a probability distribution defined on \(\mathcal{S}\times\mathcal{A}\), \(\pi_{\theta}(a|s)\) denotes the probability of playing \(a\) in state \(s\).
\(J^R(\pi_{\theta}),~ J^R(\theta)\) is the expected discounted reward over trajectories, depending on a stationary parameterized policy \(\pi_{\theta}\) or a stationary parameterized policy \(\pi_{\theta}\).
\(J^{\mathcal{C}}(\pi_{\theta}),~ J^{\mathcal{C}}(\theta)\) is the expected discounted cost over trajectories, depending on a stationary parameterized policy \(\pi_{\theta}\) or a stationary parameterized policy \(\pi_{\theta}\).
\(Q_{\pi_{\theta}}^{R}, Q_{\theta}^{R}\) is the state action value function for reward.
\(Q_{\pi_{\theta}}^{\mathcal{C}_i}, Q_{\theta}^{\mathcal{C}_i}\) is the state action value function for cost.
\(V_{\pi_{\theta}}^{R}, V_{\theta}^{R}\) is the value function for reward.
\(V_{\pi_{\theta}}^{\mathcal{C}_i}, V_{\theta}^{\mathcal{C}_i}\) is the value function for cost.
\(A_{\pi_{\theta}}^{R}, A_{\theta}^{R}\) is the advantage function for reward.
\(A_{\pi_{\theta}}^{\mathcal{C}_i}, A_{\theta}^{\mathcal{C}_i}\) is the advantage function for cost.

Notations#

Introduction#

Linear Algebra#

Constrained Markov Decision Processes#

References#