Introduction#

Welcome To OmniSafe Tutorial#

Welcome to OmniSafe in Safe RL! OmniSafe is an infrastructural framework designed to accelerate safe reinforcement learning (RL) research. It provides a comprehensive and reliable benchmark for safe RL algorithms, and also an out-of-box modular toolkit for researchers. Safe RL intends to develop algorithms that minimize the risk of unintended harm or unsafe behavior.

Hint

Safe Reinforcement Learning can be defined as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes.

This tutorial is useful for RL learners of roughly three levels.

For Beginners

If you have only basic knowledge of linear algebra and probability theory and are new to machine learning, we recommend starting with the mathematical fundamentals section of this tutorial.

For Average Users

If you have a general understanding of RL algorithms but need to familiarize yourself with Safe RL, this tutorial introduces some classic Safe RL algorithms to you so you can get started quickly.

For Experts

If you are already an expert in the field of RL, our tutorial can still offer you new insights with its systematic introduction to Safe RL algorithms. Furthermore, it will enable you to quickly design your own algorithms.

Why We Built This#

In recent years, RL (Reinforcement Learning) algorithms, especially Deep RL algorithms, have demonstrated remarkable performance in various tasks. Notable examples include:

Hint

Achieving high scores on Atari using only visual input.
Completing complex control tasks in high dimensions.
Beating human grandmasters at Go tournaments.

However, in the process of policy updating in reinforcement learning (RL), agents sometimes learn to engage in cheating or even dangerous behaviors in order to improve their performance. While these agents may achieve high scores rapidly, their actions may not align with the desired outcome.

Therefore, Safe RL algorithms aim to train agents to achieve desired goals while adhering to constraints and safety requirements, addressing the challenge of maintaining agent safety during the policy updating process in RL. The primary objective of safe RL algorithms is to ensure that agents maintain safety and avoid behaviors that could lead to negative consequences or violate predefined constraints.

However

Even experienced researchers in RL may face challenges when it comes to quickly grasping the intricacies of Safe RL algorithms and efficiently programming their implementations.

To address this issue, OmniSafe aims to provide a comprehensive and systematic introduction to Safe RL algorithms, along with streamlined and robust code, making it easier for researchers to delve into Safe RL.

Problem I

Puzzling Math

Safe RL algorithms belong to a class of rigorously grounded algorithms that are built upon a strong mathematical foundation. While these algorithms possess detailed theoretical derivations, their lack of a unified symbolic system can pose challenges for beginners in terms of systematic and comprehensive learning.

Problem II

Hard-to-find Codes

Most of the existing Safe RL algorithms do not have open-source code available, which makes it difficult for beginners to understand the algorithms at the code level. Furthermore, researchers may encounter issues such as incorrect implementations, unfair comparisons, and misleading conclusions, which could have been avoided with open-source code.

Soulution I

Friendly Math

The OmniSafe tutorial offers a standardized notation system that enables beginners to acquire a complete and systematic understanding of the theory behind Safe RL algorithms.

Solution II

Robust Code

The OmniSafe tutorial provides a comprehensive introduction to each algorithm, including a detailed explanation of the code implementation. Beginners can easily understand the connection between the algorithmic concepts and the code, while experts can gain valuable insights into Safe RL by studying the code-level details of each algorithm.

Code Design Principles#

Consistent and Inherited

Our code follows a comprehensive and logical system, enabling users to understand the interconnection between each algorithm. For instance, if one comprehends the Policy Gradient algorithm, they can quickly grasp the code implementation of the PPO algorithm by reading a new function.

Robust and Readable

Our code not only serves as a tutorial but also as a practical tool. For those who want to learn about the implementation of Safe RL algorithms, the highly readable code in OmniSafe provides an easy and quick way to get started. For those who want to develop their algorithms, OmniSafe’s highly modular and reusable code can be an excellent resource.

Long-lived

Unlike other codes that heavily rely on external libraries, OmniSafe minimizes its dependency on third-party libraries. This design prevents the project from becoming obsolete due to changes in the third-party library code, and optimizes the user experience by reducing the number of dependencies that need to be installed to run OmniSafe.

Before Reading#

Before you start reading the OmniSafe tutorial, we want you to understand the usage of colors in this tutorial.

In this tutorial, in general, the light blue boxes indicate mathematically relevant derivations, including but not limited to Theorem, Lemma, Proposition, Corollary, and their proofs, while the green boxes indicate specifically implementations, both theoretical and code-based. We give an example below:

Example of OmniSafe color usage styles (Click here)

Theorem I (Difference between two arbitrary policies)

For any function \(f : S \rightarrow \mathbb{R}\) and any policies \(\pi\) and \(\pi'\), define \(\delta_f(s,a,s') \doteq R(s,a,s') + \gamma f(s')-f(s)\),

\begin{eqnarray} \epsilon_f^{\pi'} &\doteq& \max_s \left|\mathbb{E}_{a\sim\pi'~,s'\sim P }\left[\delta_f(s,a,s')\right] \right|\tag{3}\\ L_{\pi, f}\left(\pi'\right) &\doteq& \mathbb{E}_{\tau \sim \pi}\left[\left(\frac{\pi'(a | s)}{\pi(a|s)}-1\right)\delta_f\left(s, a, s'\right)\right]\tag{4} \\ D_{\pi, f}^{\pm}\left(\pi^{\prime}\right) &\doteq& \frac{L_{\pi, f}\left(\pi' \right)}{1-\gamma} \pm \frac{2 \gamma \epsilon_f^{\pi'}}{(1-\gamma)^2} \mathbb{E}_{s \sim d^\pi}\left[D_{T V}\left(\pi^{\prime} \| \pi\right)[s]\right]\tag{5} \end{eqnarray}

where \(D_{T V}\left(\pi'|| \pi\right)[s]=\frac{1}{2} \sum_a\left|\pi'(a|s)-\pi(a|s)\right|\) is the total variational divergence between action distributions at \(s\). The conclusion is as follows:

(2)#\[D_{\pi, f}^{+}\left(\pi'\right) \geq J\left(\pi'\right)-J(\pi) \geq D_{\pi, f}^{-}\left(\pi'\right)\tag{6}\]

Furthermore, the bounds are tight (when \(\pi=\pi^{\prime}\), all three expressions are identically zero).

The proof of the Theorem 1 can be seen in the Appendix, click on this card to jump to view.

Run CPO in OmniSafe

Here are 3 ways to run CPO in OmniSafe:

Run Agent from preset yaml file
Run Agent from custom config dict
Run Agent from custom terminal config

Yaml file style

import omnisafe

env = omnisafe.Env('SafetyPointGoal1-v0')

agent = omnisafe.Agent('CPO', env)
agent.learn()

Config dict style

import omnisafe

env_id = 'SafetyPointGoal1-v0'
custom_cfgs = {
    'train_cfgs': {
        'total_steps': 10000000,
        'vector_env_nums': 1,
        'parallel': 1,
    },
    'algo_cfgs': {
        'steps_per_epoch': 20000,
    },
    'logger_cfgs': {
        'use_wandb': False,
        'use_tensorboard': True,
    },
}

agent = omnisafe.Agent('CPO', env_id, custom_cfgs=custom_cfgs)
agent.learn()

Terminal config style

We use train_policy.py as the entrance file. You can train the agent with CPO simply using train_policy.py, with arguments about CPO and environments does the training. For example, to run CPO in SafetyPointGoal1-v0 , with 1 torch thread and seed 0, you can use the following command:

cd examples
python train_policy.py --algo CPO --env-id SafetyPointGoal1-v0 --parallel 1 --total-steps 10000000 --device cpu --vector-env-nums 1 --torch-threads 1

You may not yet understand the above theory and the specific meaning of the code, but do not worry, we will make a detailed introduction later in the Constrained Policy Optimization tutorial.

Citing OmniSafe#

If you find OmniSafe useful or use OmniSafe in your research, please cite it in your publications.

@article{omnisafe,
  title   = {OmniSafe: An Infrastructure for Accelerating Safe Reinforcement Learning Research},
  author  = {Jiaming Ji, Jiayi Zhou, Borong Zhang, Juntao Dai, Xuehai Pan, Ruiyang Sun, Weidong Huang, Yiran Geng, Mickel Liu, Yaodong Yang},
  journal = {arXiv preprint arXiv:2305.09304},
  year    = {2023}
}

Long-Term Support and Support History#

OmniSafe is mainly developed by the Safe RL research team directed by Prof. Yaodong Yang. Our Safe RL research team members include Borong Zhang , Jiayi Zhou, JTao Dai, Weidong Huang, Ruiyang Sun, Xuehai Pan and Jiaming Ji. If you have any questions in the process of using OmniSafe, or if you are willing to contribute to this project, don’t hesitate to ask your question on the GitHub issue page, we will reply to you in 2-3 working days.