OmniSafe Buffer#

`BaseBuffer`(obs_space, act_space, size[, device])	Abstract base class for buffer.
`OnPolicyBuffer`(obs_space, act_space, size, ...)	A buffer for storing trajectories experienced by an agent interacting with the environment.
`OffPolicyBuffer`(obs_space, act_space, size, ...)	A ReplayBuffer for off_policy Algorithms.
`VectorOffPolicyBuffer`(obs_space, act_space, ...)	A VectorReplayBuffer for OffPolicy Algorithms.
`VectorOnPolicyBuffer`(obs_space, act_space, ...)	Vectorized on-policy buffer.

Base Buffer#

Documentation

class omnisafe.common.buffer.BaseBuffer(obs_space, act_space, size, device=torch.device('cpu'))[source]#

Abstract base class for buffer.

Initialize the buffer.

Warning

The buffer only supports Box spaces.

In base buffer, we store the following data:

Name	Shape	Dtype	Description
obs	(size, obs_space.shape)	torch.float32	The observation.
act	(size, act_space.shape)	torch.float32	The action.
reward	(size, )	torch.float32	Single step reward.
cost	(size, )	torch.float32	Single step cost.
done	(size, )	torch.float32	Whether the episode is done.

Parameters:

obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
device (torch.device) – The device of the buffer.

__init__(obs_space, act_space, size, device=torch.device('cpu'))[source]#

Initialize the buffer.

Warning

The buffer only supports Box spaces.

In base buffer, we store the following data:

Name	Shape	Dtype	Description
obs	(size, obs_space.shape)	torch.float32	The observation.
act	(size, act_space.shape)	torch.float32	The action.
reward	(size, )	torch.float32	Single step reward.
cost	(size, )	torch.float32	Single step cost.
done	(size, )	torch.float32	Whether the episode is done.

Parameters:

obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
device (torch.device) – The device of the buffer.

add_field(name, shape, dtype)[source]#

Add a field to the buffer.

Example

>>> buffer = BaseBuffer(...)
>>> buffer.add_field('new_field', (2, 3), torch.float32)
>>> buffer.data['new_field'].shape
>>> (buffer.size, 2, 3)

Parameters:

name (str) – The name of the field.
shape (tuple) – The shape of the field.
dtype (torch.dtype) – The dtype of the field.

property device: device#: Return the device of the buffer.

property size: int#: Return the size of the buffer.

abstract store(**data)[source]#

Store a transition in the buffer.

Warning

This is an abstract method.

Example

>>> buffer = BaseBuffer(...)
>>> buffer.store(obs=obs, act=act, reward=reward, cost=cost, done=done)

Parameters:: data (torch.Tensor) – The data to store.

On Policy Buffer#

Documentation

class omnisafe.common.buffer.OnPolicyBuffer(obs_space, act_space, size, gamma, lam, lam_c, advantage_estimator, penalty_coefficient=0, standardized_adv_r=False, standardized_adv_c=False, device=torch.device('cpu'))[source]#

A buffer for storing trajectories experienced by an agent interacting with the environment.

Besides, The buffer also provides the functionality of calculating the advantages of state-action pairs, ranging from GAE, GAE-RTG ,``V-trace`` to Plain method.

Initialize the on-policy buffer.

Warning

The buffer only supports Box spaces.

Compared to the base buffer, the on-policy buffer stores extra data:

Name	Shape	Dtype	Description
discounted_ret	(size, )	torch.float32	The discounted return.
target_value_r	(size, )	torch.float32	The target value of the reward critic.
adv_r	(size, )	torch.float32	The advantage of the reward.
value_r	(size, )	torch.float32	The value estimated by reward critic.
target_value_c	(size, )	torch.float32	The target value of the cost critic.
adv_c	(size, )	torch.float32	The advantage of the critic.
value_c	(size, )	torch.float32	The value estimated by cost critic.
logp	(size, )	torch.float32	The log probability of the action.

Parameters:

obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
gamma (float) – The discount factor.
lam (float) – The lambda factor for calculating the advantages.
lam_c (float) – The lambda factor for calculating the advantages of the critic.
advantage_estimator (AdvatageEstimator) – The advantage estimator.
penalty_coefficient (float, optional) – The penalty coefficient. Defaults to 0.
standardized_adv_r (bool, optional) – Whether to standardize the advantages of the actor. Defaults to False.
standardized_adv_c (bool, optional) – Whether to standardize the advantages of the critic. Defaults to False.
device (torch.device, optional) – The device to store the data. Defaults to torch.device(‘cpu’).

__init__(obs_space, act_space, size, gamma, lam, lam_c, advantage_estimator, penalty_coefficient=0, standardized_adv_r=False, standardized_adv_c=False, device=torch.device('cpu'))[source]#

Initialize the on-policy buffer.

Warning

The buffer only supports Box spaces.

Compared to the base buffer, the on-policy buffer stores extra data:

Name	Shape	Dtype	Description
discounted_ret	(size, )	torch.float32	The discounted return.
target_value_r	(size, )	torch.float32	The target value of the reward critic.
adv_r	(size, )	torch.float32	The advantage of the reward.
value_r	(size, )	torch.float32	The value estimated by reward critic.
target_value_c	(size, )	torch.float32	The target value of the cost critic.
adv_c	(size, )	torch.float32	The advantage of the critic.
value_c	(size, )	torch.float32	The value estimated by cost critic.
logp	(size, )	torch.float32	The log probability of the action.

Parameters:

obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
gamma (float) – The discount factor.
lam (float) – The lambda factor for calculating the advantages.
lam_c (float) – The lambda factor for calculating the advantages of the critic.
advantage_estimator (AdvatageEstimator) – The advantage estimator.
penalty_coefficient (float, optional) – The penalty coefficient. Defaults to 0.
standardized_adv_r (bool, optional) – Whether to standardize the advantages of the actor. Defaults to False.
standardized_adv_c (bool, optional) – Whether to standardize the advantages of the critic. Defaults to False.
device (torch.device, optional) – The device to store the data. Defaults to torch.device(‘cpu’).

_calculate_adv_and_value_targets(values, rewards, lam)[source]#

Compute the estimated advantage.

Three methods are supported: - GAE (Generalized Advantage Estimation)

GAE is a variance reduction method for the actor-critic algorithm. It is proposed in the paper High-Dimensional Continuous Control Using Generalized Advantage Estimation.

GAE calculates the advantage using the following formula:

(4)#\[A_t = \sum_{k=0}^{n-1} (\lambda \gamma)^k \delta_{t+k}\]

where \(\delta_{t+k} = r_{t+k} + \gamma*V(s_{t+k+1}) - V(s_{t+k})\) When \(\lambda =1\), GAE reduces to the Monte Carlo method, which is unbiased but has high variance. When \(\lambda =0\), GAE reduces to the TD(1) method, which is biased but has low variance.

V-trace

V-trace is a variance reduction method for the actor-critic algorithm. It is proposed in the paper IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures.

V-trace calculates the advantage using the following formula:

(5)#\[A_t = \sum_{k=0}^{n-1} (\lambda \gamma)^k \delta_{t+k} + (\lambda \gamma)^n * \rho_{t+n} * (1 - d_{t+n}) * (V(x_{t+n}) - b_{t+n})\]

where \(\delta_{t+k} = r_{t+k} + \gamma*V(s_{t+k+1}) - V(s_{t+k})\), \(\rho_{t+k} =\frac{\pi(a_{t+k}|s_{t+k})}{b_{t+k}}\), \(b_{t+k}\) is the behavior policy, and \(d_{t+k}\) is the done flag.

Plain

Plain method is the original actor-critic algorithm. It is unbiased but has high variance.

Parameters:

vals (np.array) – The value of states.
rews (np.array) – The reward of states.
lam (float, optional) – The lambda factor for GAE. Defaults to 0.95.

Return type:

Tuple[Tensor, Tensor]

static _calculate_v_trace(policy_action_probs, values, rewards, behavior_action_probs, gamma=0.99, rho_bar=1.0, c_bar=1.0)[source]#

This function is used to calculate V-trace targets.

(6)#\[A_t = \sum_{k=0}^{n-1} (\lambda \gamma)^k \delta_{t+k} + (\lambda \gamma)^n * \rho_{t+n} * (1 - d_{t+n}) * (V(x_{t+n}) - b_{t+n})\]

Calculate V-trace targets for off-policy actor-critic learning recursively. For more details, please refer to the paper: Espeholt et al. 2018, IMPALA.

Parameters:

policy_action_probs (torch.Tensor) – action probabilities of policy network, shape=(sequence_length,)
values (torch.Tensor) – state values, shape=(sequence_length+1,)
rewards (torch.Tensor) – rewards, shape=(sequence_length+1,)
behavior_action_probs (torch.Tensor) – action probabilities of behavior network, shape=(sequence_length,)
gamma (float) – discount factor
rho_bar (float) – clip rho
c_bar (float) – clip c

Returns:

tuple – V-trace targets, shape=(batch_size, sequence_length)

Return type:

Tuple[Tensor, Tensor, Tensor]

finish_path(last_value_r=torch.zeros(1), last_value_c=torch.zeros(1))[source]#

Finish the current path and calculate the advantages of state-action pairs.

On-policy algorithms need to calculate the advantages of state-action pairs after the path is finished. This function calculates the advantages of state-action pairs and stores them in the buffer, following the steps:

Hint

Calculate the discounted return.
Calculate the advantages of the reward.
Calculate the advantages of the cost.

Parameters:

last_value_r (torch.Tensor, optional) – The value of the last state of the current path.
torch.zeros (Defaults to) –
last_value_c (torch.Tensor, optional) – The value of the last state of the current path.
torch.zeros –

Return type:

None

get()[source]#: Get the data in the buffer. :rtype: Dict[str, Tensor]

Hint

We provide a trick to standardize the advantages of state-action pairs. We calculate the mean and standard deviation of the advantages of state-action pairs and then standardize the advantages of state-action pairs. You can turn on this trick by setting the standardized_adv_r to True. The same trick is applied to the advantages of the cost.

property standardized_adv_c: bool#: Get the standardized_adv_c.

property standardized_adv_r: bool#: Get the standardized_adv_r.

store(**data)[source]#

Store data into the buffer.

Warning

The total size of the data must be less than the buffer size.

Parameters:: data (torch.Tensor) – The data to store.
Return type:: None

Off Policy buffer#

Documentation

class omnisafe.common.buffer.OffPolicyBuffer(obs_space, act_space, size, batch_size, device=torch.device('cpu'))[source]#

A ReplayBuffer for off_policy Algorithms.

Initialize the off policy buffer.

Warning

The buffer only supports Box spaces.

Compared to the base buffer, the off-policy buffer stores extra data:

Name	Shape	Dtype	Description
next_obs	(batch_size, obs_space.shape)	torch.float32	The next observation.

Parameters:

obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
batch_size (int) – The batch size of the buffer.
device (torch.device, optional) – The device of the buffer. Defaults to torch.device(‘cpu’).

__init__(obs_space, act_space, size, batch_size, device=torch.device('cpu'))[source]#

Initialize the off policy buffer.

Warning

The buffer only supports Box spaces.

Compared to the base buffer, the off-policy buffer stores extra data:

Name	Shape	Dtype	Description
next_obs	(batch_size, obs_space.shape)	torch.float32	The next observation.

Parameters:

obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
batch_size (int) – The batch size of the buffer.
device (torch.device, optional) – The device of the buffer. Defaults to torch.device(‘cpu’).

property batch_size: int#: Return the batch size of the buffer.

property max_size: int#: Return the maximum size of the buffer.

sample_batch()[source]#

Sample a batch of data from the buffer.

Return type:: Dict[str, Tensor]

store(**data)[source]#

Store data into the buffer.

Hint

The ReplayBuffer is a circular buffer. When the buffer is full, the oldest data will be overwritten.

Parameters:: data (torch.Tensor) – The data to be stored.

Vector On Policy Buffer#

Documentation

class omnisafe.common.buffer.VectorOnPolicyBuffer(obs_space, act_space, size, gamma, lam, lam_c, advantage_estimator, penalty_coefficient, standardized_adv_r, standardized_adv_c, num_envs=1, device=torch.device('cpu'))[source]#

Vectorized on-policy buffer.

Initialize the vector-on-policy buffer.

The vector-on-policy buffer is used to store the data from vector environments. The data is stored in a list of on-policy buffers, each of which corresponds to one environment.

Warning

The buffer only supports Box spaces.

Parameters:

obs_space (OmnisafeSpace) – Observation space.
act_space (OmnisafeSpace) – Action space.
size (int) – Size of the buffer.
gamma (float) – Discount factor.
lam (float) – Lambda for GAE.
lam_c (float) – Lambda for GAE for cost.
advantage_estimator (AdvatageEstimator) – Advantage estimator.
penalty_coefficient (float) – Penalty coefficient.
standardized_adv_r (bool) – Whether to standardize the advantage for reward.
standardized_adv_c (bool) – Whether to standardize the advantage for cost.
num_envs (int, optional) – Number of environments. Defaults to 1.
device (torch.device, optional) – Device to store the data. Defaults to torch.device(‘cpu’).

__init__(obs_space, act_space, size, gamma, lam, lam_c, advantage_estimator, penalty_coefficient, standardized_adv_r, standardized_adv_c, num_envs=1, device=torch.device('cpu'))[source]#

Initialize the vector-on-policy buffer.

The vector-on-policy buffer is used to store the data from vector environments. The data is stored in a list of on-policy buffers, each of which corresponds to one environment.

Warning

The buffer only supports Box spaces.

Parameters:

obs_space (OmnisafeSpace) – Observation space.
act_space (OmnisafeSpace) – Action space.
size (int) – Size of the buffer.
gamma (float) – Discount factor.
lam (float) – Lambda for GAE.
lam_c (float) – Lambda for GAE for cost.
advantage_estimator (AdvatageEstimator) – Advantage estimator.
penalty_coefficient (float) – Penalty coefficient.
standardized_adv_r (bool) – Whether to standardize the advantage for reward.
standardized_adv_c (bool) – Whether to standardize the advantage for cost.
num_envs (int, optional) – Number of environments. Defaults to 1.
device (torch.device, optional) – Device to store the data. Defaults to torch.device(‘cpu’).

finish_path(last_value_r=torch.zeros(1), last_value_c=torch.zeros(1), idx=0)[source]#

Get the data in the buffer.

In vector-on-policy buffer, we get the data from each buffer and then concatenate them. :rtype: None

Hint

We provide a trick to standardize the advantages of state-action pairs. We calculate the mean and standard deviation of the advantages of state-action pairs and then standardize the advantages of state-action pairs. You can turn on this trick by setting the standardized_adv_r to True. The same trick is applied to the advantages of the cost.

get()[source]#

Get the data from the buffer.

Return type:: Dict[str, Tensor]

property num_buffers: int#: Get the number of buffers.

store(**data)[source]#: Store data into the buffer. :rtype: None

Hint

The data should be a list of tensors, each of which corresponds to one environment. Then the data will be stored into the corresponding buffer.

Vector Off Policy Buffer#

Documentation

class omnisafe.common.buffer.VectorOffPolicyBuffer(obs_space, act_space, size, batch_size, num_envs, device=torch.device('cpu'))[source]#

A VectorReplayBuffer for OffPolicy Algorithms.

Initialize the off policy buffer.

The vector-off-policy buffer is a vectorized version of the off-policy buffer. It stores the data in a single tensor, and the data of each environment is stored in a separate column.

Warning

The buffer only supports Box spaces.

Parameters:

obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
batch_size (int) – The batch size of the buffer.
num_envs (int) – The number of environments.
device (torch.device, optional) – The device of the buffer. Defaults to torch.device(‘cpu’).

__init__(obs_space, act_space, size, batch_size, num_envs, device=torch.device('cpu'))[source]#

Initialize the off policy buffer.

The vector-off-policy buffer is a vectorized version of the off-policy buffer. It stores the data in a single tensor, and the data of each environment is stored in a separate column.

Warning

The buffer only supports Box spaces.

Parameters:

obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
batch_size (int) – The batch size of the buffer.
num_envs (int) – The number of environments.
device (torch.device, optional) – The device of the buffer. Defaults to torch.device(‘cpu’).

add_field(name, shape, dtype)[source]#

Add a field to the buffer.

Example

>>> buffer = BaseBuffer(...)
>>> buffer.add_field('new_field', (2, 3), torch.float32)
>>> buffer.data['new_field'].shape
>>> (buffer.size, 2, 3)

Parameters:

name (str) – The name of the field.
shape (tuple) – The shape of the field.
dtype (torch.dtype) – The dtype of the field.

property num_envs: int#: Return the number of environments.

sample_batch()[source]#

Sample a batch from the buffer.

Return type:: Dict[str, Tensor]