Vector and Martrix#

Introduction#

Reinforcement Learning is one of the disciplines in machine learning that is more closely related to mathematics. Safe Reinforcement Learning is particularly close to mathematical theory, especially Optimization Theory.

This section introduces the basic mathematical theory of Safe Reinforcement Learning. We will briefly introduce the following topics: Linear Algebra and Optimization Theory.

If you are new to these mathematical theories in subsequent chapters, please refer back to this article. If this still does not solve your confusion, please refer to the more detailed introduction to mathematical theory.

Knowledge of Vector and Matrix#

Vector Projection#

The projection of a vector \(\boldsymbol{y} \in \mathbb{R}^m\) onto the span of \(\left\{\boldsymbol{x}_1, \ldots, \boldsymbol{x}_n\right\}\) (here we assume \(\boldsymbol{x}_i \in \mathbb{R}^m\) )is the vector \(\boldsymbol{v} \in \operatorname{span}\left(\left\{\boldsymbol{x}_1, \ldots, \boldsymbol{x}_n\right\}\right)\), such that \(\boldsymbol{v}\) is as close as possible to \(\boldsymbol{y}\), as measured by the Euclidean norm \(\|\boldsymbol{v}-\boldsymbol{y}\|_2\). We denote the projection as \(\operatorname{Proj}\left(\boldsymbol{y} ;\left\{\boldsymbol{x}_1, \ldots, \boldsymbol{x}_n\right\}\right)\) and can define it formally as

(1)#\[\operatorname{Proj}\left(\boldsymbol{y} ;\left\{\boldsymbol{x}_1, \ldots, \boldsymbol{x}_n\right\}\right)=\mathop{\arg\min}\limits_{\boldsymbol{v} \in \operatorname{span}\left(\left\{\boldsymbol{x}_1, \ldots, \boldsymbol{x}_n\right\}\right)}\|\boldsymbol{y}-\boldsymbol{v}\|_2 .\]

Given a full rank matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) with \(m \geq n\), we can define the projection of a vector \(\boldsymbol{y} \in \mathbb{R}^m\) onto the range of \(\mathbf{A}\) as follows:

(2)#\[\operatorname{Proj}(\boldsymbol{y} ; \mathbf{A})=\mathop{\arg\min}\limits_{\boldsymbol{v} \in \mathcal{R}(\mathbf{A})}\|\boldsymbol{v}-\boldsymbol{y}\|_2=\mathbf{A}\left(\mathbf{A}^{\top} \mathbf{A}\right)^{-1} \mathbf{A}^{\top} \boldsymbol{y} .\]

Norms#

Vector Norm

Introduction of Vector Norm

A norm of a vector \(\Vert\boldsymbol{x}\Vert\) is a measure of the “length” of the vector. More formally, a norm is any function \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) that satisfies four properties:

For all \(\boldsymbol{x} \in \mathbb{R}^n, f(\boldsymbol{x}) \geq 0\) (non-negativity).
\(f(\boldsymbol{x})=0\) if and only if \(\boldsymbol{x}=\mathbf{0}\) (definiteness).
For all \(\boldsymbol{x} \in \mathbb{R}^n, t \in \mathbb{R}, f(t \boldsymbol{x})=|t| f(x)\) (absolute value homogeneity).
For all \(\boldsymbol{x}, \boldsymbol{y} \in \mathbb{R}^n, f(\boldsymbol{x}+\boldsymbol{y}) \leq f(\boldsymbol{x})+f(\boldsymbol{y})\) (triangle inequality).

Consider the following common examples:

p-norm: \(\|\boldsymbol{x}\|_p=\left(\sum_{i=1}^n\left|x_i\right|^p\right)^{1 / p}\), for \(p \geq 1\).
2-norm: \(\|\boldsymbol{x}\|_2=\sqrt{\sum_{i=1}^n x_i^2}\), also called Euclidean norm. Note that \(\|\boldsymbol{x}\|_2^2=\boldsymbol{x}^{\top} \boldsymbol{x}\).
1-norm: \(\|\boldsymbol{x}\|_1=\sum_{i=1}^n\left|x_i\right|\).
Max-norm: \(\|x\|_{\infty}=\max _i\left|x_i\right|\).
0-norm: \(\|\boldsymbol{x}\|_0=\sum_{i=1}^n \mathbb{I}\left(\left|x_i\right|>0\right)\). This is a pseudo-norm since it does not satisfy homogeneity. It counts the number of non-zero elements in \(\boldsymbol{x}\). If we define \(0^0=0\), we can write this as \(\|\boldsymbol{x}\|_0=\sum_{i=1}^n x_i^0\)

Matrix Norm

Introduction of Matrix Norm

Suppose we think of a matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) as defining a linear function \(f(\boldsymbol{x})=\mathbf{A} \boldsymbol{x}\). We define the induced norm of \(\mathbf{A}\) as the maximum amount by which \(f\) can lengthen any unit-norm input:

(3)#\[\|\mathbf{A}\|_p=\max _{\boldsymbol{x} \neq 0} \frac{\|\mathbf{A} \boldsymbol{x}\|_p}{\|\boldsymbol{x}\|_p}=\max _{\|\boldsymbol{x}\|=1}\|\mathbf{A} \boldsymbol{x}\|_p\]

Typically \(p=2\), in which case

(4)#\[\|\mathbf{A}\|_2=\sqrt{\lambda_{\max }\left(\mathbf{A}^{\top} \mathbf{A}\right)}=\max _i \sigma_i\]

where \(\sigma_i\) is the \(i^{th}\) singular value. The Nuclear norm also called the trace norm, is defined as

(5)#\[\|\mathbf{A}\|_*=\operatorname{tr}\left(\sqrt{\mathbf{A}^{\top} \mathbf{A}}\right)=\sum_i \sigma_i\]

where \(\sqrt{\mathbf{A}^{\top} \mathbf{A}}\) is the matrix square root. Since the singular values are always non-negative, we have

(6)#\[\|\mathbf{A}\|_*=\sum_i\left|\sigma_i\right|=\|\boldsymbol{\sigma}\|_1\]

Using this as a regularizer encourages many singular values to become zero, resulting in a low-rank matrix. More generally, we can define the Schatten \(p\)-norm as

(7)#\[\|\mathbf{A}\|_p=\left(\sum_i \sigma_i^p(\mathbf{A})\right)^{1 / p}\]

If we think of a matrix as a vector, we can define the matrix norm in terms of a vector norm, \(\|\mathbf{A}\|=\|\operatorname{vec}(\mathbf{A})\|\). If the vector norm is the 2-norm, the corresponding matrix norm is the Frobenius norm:

(8)#\[\|\mathbf{A}\|_F=\sqrt{\sum_{i=1}^m \sum_{j=1}^n a_{i j}^2}=\sqrt{\operatorname{tr}\left(\mathbf{A}^{\top} \mathbf{A}\right)}=\|\operatorname{vec}(\mathbf{A})\|_2\]

If \(\mathbf{A}\) is expensive to evaluate, but \(\mathbf{A} \boldsymbol{v}\) is cheap (for a random vector \(\boldsymbol{v}\) ), we can create a stochastic approximation to the Frobenius norm by using the Hutchinson trace estimator as follows:

(9)#\[\|\mathbf{A}\|_F^2=\operatorname{tr}\left(\mathbf{A}^{\top} \mathbf{A}\right)=\mathbb{E}\left[\boldsymbol{v}^{\top} \mathbf{A}^{\top} \mathbf{A} \boldsymbol{v}\right]=\mathbb{E}\left[\|\mathbf{A} \boldsymbol{v}\|_2^2\right]\]

where \(\boldsymbol{v} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\).