Introduction#

Welcome To OmniSafe Tutorial#

Welcome to OmniSafe in Safe RL! OmniSafe is a comprehensive and reliable benchmark for safe reinforcement learning, encompassing more than 20 different kinds of algorithms covering a multitude of SafeRL domains and delivering a new suite of testing environments.

Hint

For beginners, it is necessary first to introduce you to Safe RL(Safe Reinforcement Learning). Safe Reinforcement Learning can be defined as the learning agents that maximize the expectation of the return on problems, ensure reasonable system performance, and respect safety constraints during the learning and deployment processes.

This tutorial is useful for reinforcement learning learners of many levels.

For Beginners

If you are a beginner in machine learning with only some simple knowledge of linear algebra and probability theory, you can start with the mathematical fundamentals section of this tutorial.

For Average Users

If you have a general understanding of RL algorithms but need to familiarize yourself with Safe RL, This tutorial introduces it so you can get started quickly.

For Experts

If you are already an expert in the field of RL, you can also gain new insights from our systematic introduction to Safe RL algorithms. Also, this tutorial will allow you to design your algorithms using OmniSafe quickly.

Why We Built This#

In recent years, RL (Reinforcement Learning) algorithms, especially Deep RL algorithms, have performed well in many tasks. Examples include:

Achieving high scores on Atari games with only visual input.
Completing complex control tasks in high dimensions.
Beating human grandmasters at Go tournaments.

However, in the process of strategy updating by RL, the agents often learn cheating or even dangerous behaviors to improve their performance. Such an agent that can quickly achieve high scores differs from our desired result. Therefore, Safe RL algorithms are dedicated to solving the problem of how to train an agent to learn to achieve the desired training goal without violating constraints simultaneously.

However

Even experienced RL researchers have difficulty understanding Safe RL’s algorithms in a short time and quickly programming their implementation.

Therefore, OmniSafe will facilitate the subsequent study of Safe RL by providing both a detailed and systematic introduction to the algorithm and a streamlined and robust code.

Problem I

Puzzling Math

Safe RL algorithms are a class of algorithms built on a rigorous mathematical system. These algorithms have a detailed theoretical derivation, but they lack a unified symbolic system, which makes it difficult for beginners to learn them systematically and comprehensively.

Problem II

Hard-to-find Codes

Most of the existing Safe RL algorithms do not have open-source code, making it difficult for beginners to grasp the ideas of the algorithms at the code level, and researchers suffer from incorrect implementations, unfair comparisons, and misleading conclusions.

Soulution I

Friendly Math

OmniSafe tutorial provides a unified and standardized notation system that allows beginners to learn the theory of Safe RL algorithms completely and systematically.

Solution II

Robust Code

OmniSafe tutorial gives a code-level introduction in each algorithm introduction, allowing learners who are new to Safe RL theory to understand how to relate algorithmic ideas to code and give experts in the field of Safe RL new insights into algorithm implementation.

Code Design Principles#

Consistent and Inherited

Our code has a complete logic system that allows you to understand the connection between each algorithm and the similarities together with differences. For example, if you understand the Policy Gradient algorithm, then you can learn the PPO algorithm by simply reading a new function and immediately grasping the code implementation of the PPO algorithm.

Robust and Readable

Our code can play the role of both a tutorial and a tool. If you still need to become familiar with algorithms’ implementations in Safe RL, the highly readable code in OmniSafe can help you get started quickly. You can see how each algorithm performs. If you want to build your algorithms, OmniSafe’s highly robust code can also be an excellent tool!

Long-lived

Unlike other code that relies on a large number of external libraries, OmniSafe minimizes the dependency on third-party libraries. This avoids shortening the life of the project due to iterative changes in third-party library code and also optimizes the user’s experience in installing and using OmniSafe, because they do not have to install lots of dependencies to run OmniSafe.

Before Reading#

Before you start having fun reading the OmniSafe tutorial, we want you to understand the usage of colors in this tutorial. In this tutorial, in general, the light blue boxes indicate mathematically relevant derivations, including but not limited to Theorem, Lemma, Proposition, Corollary, and their proofs, while the green boxes indicate specifically implementations, both theoretical and code-based. We give an example below:

Example of OmniSafe color usage styles (Click here)

Theorem I (Difference between two arbitrary policies)

For any function \(f : S \rightarrow \mathbb{R}\) and any policies \(\pi\) and \(\pi'\), define \(\delta_f(s,a,s') \doteq R(s,a,s') + \gamma f(s')-f(s)\),

\begin{eqnarray} \epsilon_f^{\pi'} &\doteq& \max_s \left|\mathbb{E}_{a\sim\pi'~,s'\sim P }\left[\delta_f(s,a,s')\right] \right|\tag{3}\\ L_{\pi, f}\left(\pi'\right) &\doteq& \mathbb{E}_{\tau \sim \pi}\left[\left(\frac{\pi'(a | s)}{\pi(a|s)}-1\right)\delta_f\left(s, a, s'\right)\right]\tag{4} \\ D_{\pi, f}^{\pm}\left(\pi^{\prime}\right) &\doteq& \frac{L_{\pi, f}\left(\pi' \right)}{1-\gamma} \pm \frac{2 \gamma \epsilon_f^{\pi'}}{(1-\gamma)^2} \mathbb{E}_{s \sim d^\pi}\left[D_{T V}\left(\pi^{\prime} \| \pi\right)[s]\right]\tag{5} \end{eqnarray}

where \(D_{T V}\left(\pi'|| \pi\right)[s]=\frac{1}{2} \sum_a\left|\pi'(a|s)-\pi(a|s)\right|\) is the total variational divergence between action distributions at \(s\). The conclusion is as follows:

(2)#\[D_{\pi, f}^{+}\left(\pi'\right) \geq J\left(\pi'\right)-J(\pi) \geq D_{\pi, f}^{-}\left(\pi'\right)\tag{6}\]

Furthermore, the bounds are tight (when \(\pi=\pi^{\prime}\), all three expressions are identically zero).

The proof of the Theorem 1 can be seen in the Appendix, click on this card to jump to view.

Run CPO in OmniSafe

Here are 3 ways to run CPO in OmniSafe:

Run Agent from preset yaml file
Run Agent from custom config dict
Run Agent from custom terminal config

Yaml file style

import omnisafe

env = omnisafe.Env('SafetyPointGoal1-v0')

agent = omnisafe.Agent('CPO', env)
agent.learn()

Config dict style

import omnisafe

env_id = 'SafetyPointGoal1-v0'
custom_cfgs = {
    'train_cfgs': {
        'total_steps': 1024000,
        'vector_env_nums': 1,
        'parallel': 1,
    },
    'algo_cfgs': {
        'steps_per_epoch': 2048,
        'update_iters': 1,
    },
    'logger_cfgs': {
        'use_wandb': False,
    },
}
agent = omnisafe.Agent('CPO', env_id, custom_cfgs=custom_cfgs)
agent.learn()

Terminal config style

We use train_policy.py as the entrance file. You can train the agent with CPO simply using train_policy.py, with arguments about CPO and environments does the training. For example, to run CPO in SafetyPointGoal1-v0 , with 1 torch thread and seed 0, you can use the following command:

cd examples
python train_policy.py --algo CPO --env-id SafetyPointGoal1-v0 --parallel 1 --total-steps 1024000 --device cpu --vector-env-nums 1 --torch-threads 1

You may not yet understand the above theory and the specific meaning of the code, but do not worry, we will make a detailed introduction later in the Constrained Policy Optimization tutorial.

Long-Term Support and Support History#

OmniSafe is mainly developed by the SafeRL research team directed by Prof. Yaodong Yang, Our SafeRL research team members include Borong Zhang , Jiayi Zhou, JTao Dai, Weidong Huang, Ruiyang Sun, Xuehai Pan and Jiamg Ji. If you have any questions in the process of using OmniSafe, or if you are willing to contribute to this project, don’t hesitate to ask your question on the GitHub issue page, we will reply to you in 2-3 working days.