OmniSafe Buffer#
|
Abstract base class for buffer. |
|
A buffer for storing trajectories experienced by an agent interacting with the environment. |
|
A ReplayBuffer for off_policy Algorithms. |
|
A VectorReplayBuffer for OffPolicy Algorithms. |
|
Vectorized on-policy buffer. |
Base Buffer#
Documentation
- class omnisafe.common.buffer.BaseBuffer(obs_space, act_space, size, device=torch.device('cpu'))[source]#
Abstract base class for buffer.
Initialize the buffer.
Warning
The buffer only supports Box spaces.
In base buffer, we store the following data:
Name
Shape
Dtype
Description
obs
(size, obs_space.shape)
torch.float32
The observation.
act
(size, act_space.shape)
torch.float32
The action.
reward
(size, )
torch.float32
Single step reward.
cost
(size, )
torch.float32
Single step cost.
done
(size, )
torch.float32
Whether the episode is done.
- Parameters:
obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
device (torch.device) – The device of the buffer.
- __init__(obs_space, act_space, size, device=torch.device('cpu'))[source]#
Initialize the buffer.
Warning
The buffer only supports Box spaces.
In base buffer, we store the following data:
Name
Shape
Dtype
Description
obs
(size, obs_space.shape)
torch.float32
The observation.
act
(size, act_space.shape)
torch.float32
The action.
reward
(size, )
torch.float32
Single step reward.
cost
(size, )
torch.float32
Single step cost.
done
(size, )
torch.float32
Whether the episode is done.
- Parameters:
obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
device (torch.device) – The device of the buffer.
- add_field(name, shape, dtype)[source]#
Add a field to the buffer.
Example
>>> buffer = BaseBuffer(...) >>> buffer.add_field('new_field', (2, 3), torch.float32) >>> buffer.data['new_field'].shape >>> (buffer.size, 2, 3)
- Parameters:
name (str) – The name of the field.
shape (tuple) – The shape of the field.
dtype (torch.dtype) – The dtype of the field.
- property device: device#
Return the device of the buffer.
- property size: int#
Return the size of the buffer.
On Policy Buffer#
Documentation
- class omnisafe.common.buffer.OnPolicyBuffer(obs_space, act_space, size, gamma, lam, lam_c, advantage_estimator, penalty_coefficient=0, standardized_adv_r=False, standardized_adv_c=False, device=torch.device('cpu'))[source]#
A buffer for storing trajectories experienced by an agent interacting with the environment.
Besides, The buffer also provides the functionality of calculating the advantages of state-action pairs, ranging from
GAE
,GAE-RTG
,``V-trace`` toPlain
method.Initialize the on-policy buffer.
Warning
The buffer only supports Box spaces.
Compared to the base buffer, the on-policy buffer stores extra data:
Name
Shape
Dtype
Description
discounted_ret
(size, )
torch.float32
The discounted return.
target_value_r
(size, )
torch.float32
The target value of the reward critic.
adv_r
(size, )
torch.float32
The advantage of the reward.
value_r
(size, )
torch.float32
The value estimated by reward critic.
target_value_c
(size, )
torch.float32
The target value of the cost critic.
adv_c
(size, )
torch.float32
The advantage of the critic.
value_c
(size, )
torch.float32
The value estimated by cost critic.
logp
(size, )
torch.float32
The log probability of the action.
- Parameters:
obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
gamma (float) – The discount factor.
lam (float) – The lambda factor for calculating the advantages.
lam_c (float) – The lambda factor for calculating the advantages of the critic.
advantage_estimator (AdvatageEstimator) – The advantage estimator.
penalty_coefficient (float, optional) – The penalty coefficient. Defaults to 0.
standardized_adv_r (bool, optional) – Whether to standardize the advantages of the actor. Defaults to False.
standardized_adv_c (bool, optional) – Whether to standardize the advantages of the critic. Defaults to False.
device (torch.device, optional) – The device to store the data. Defaults to torch.device(‘cpu’).
- __init__(obs_space, act_space, size, gamma, lam, lam_c, advantage_estimator, penalty_coefficient=0, standardized_adv_r=False, standardized_adv_c=False, device=torch.device('cpu'))[source]#
Initialize the on-policy buffer.
Warning
The buffer only supports Box spaces.
Compared to the base buffer, the on-policy buffer stores extra data:
Name
Shape
Dtype
Description
discounted_ret
(size, )
torch.float32
The discounted return.
target_value_r
(size, )
torch.float32
The target value of the reward critic.
adv_r
(size, )
torch.float32
The advantage of the reward.
value_r
(size, )
torch.float32
The value estimated by reward critic.
target_value_c
(size, )
torch.float32
The target value of the cost critic.
adv_c
(size, )
torch.float32
The advantage of the critic.
value_c
(size, )
torch.float32
The value estimated by cost critic.
logp
(size, )
torch.float32
The log probability of the action.
- Parameters:
obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
gamma (float) – The discount factor.
lam (float) – The lambda factor for calculating the advantages.
lam_c (float) – The lambda factor for calculating the advantages of the critic.
advantage_estimator (AdvatageEstimator) – The advantage estimator.
penalty_coefficient (float, optional) – The penalty coefficient. Defaults to 0.
standardized_adv_r (bool, optional) – Whether to standardize the advantages of the actor. Defaults to False.
standardized_adv_c (bool, optional) – Whether to standardize the advantages of the critic. Defaults to False.
device (torch.device, optional) – The device to store the data. Defaults to torch.device(‘cpu’).
- _calculate_adv_and_value_targets(values, rewards, lam)[source]#
Compute the estimated advantage.
Three methods are supported: - GAE (Generalized Advantage Estimation)
GAE is a variance reduction method for the actor-critic algorithm. It is proposed in the paper High-Dimensional Continuous Control Using Generalized Advantage Estimation.
GAE calculates the advantage using the following formula:
(4)#\[A_t = \sum_{k=0}^{n-1} (\lambda \gamma)^k \delta_{t+k}\]where \(\delta_{t+k} = r_{t+k} + \gamma*V(s_{t+k+1}) - V(s_{t+k})\) When \(\lambda =1\), GAE reduces to the Monte Carlo method, which is unbiased but has high variance. When \(\lambda =0\), GAE reduces to the TD(1) method, which is biased but has low variance.
V-trace
V-trace is a variance reduction method for the actor-critic algorithm. It is proposed in the paper IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures.
V-trace calculates the advantage using the following formula:
(5)#\[A_t = \sum_{k=0}^{n-1} (\lambda \gamma)^k \delta_{t+k} + (\lambda \gamma)^n * \rho_{t+n} * (1 - d_{t+n}) * (V(x_{t+n}) - b_{t+n})\]where \(\delta_{t+k} = r_{t+k} + \gamma*V(s_{t+k+1}) - V(s_{t+k})\), \(\rho_{t+k} =\frac{\pi(a_{t+k}|s_{t+k})}{b_{t+k}}\), \(b_{t+k}\) is the behavior policy, and \(d_{t+k}\) is the done flag.
Plain
Plain method is the original actor-critic algorithm. It is unbiased but has high variance.
- Parameters:
vals (np.array) – The value of states.
rews (np.array) – The reward of states.
lam (float, optional) – The lambda factor for GAE. Defaults to 0.95.
- Return type:
Tuple
[Tensor
,Tensor
]
- static _calculate_v_trace(policy_action_probs, values, rewards, behavior_action_probs, gamma=0.99, rho_bar=1.0, c_bar=1.0)[source]#
This function is used to calculate V-trace targets.
(6)#\[A_t = \sum_{k=0}^{n-1} (\lambda \gamma)^k \delta_{t+k} + (\lambda \gamma)^n * \rho_{t+n} * (1 - d_{t+n}) * (V(x_{t+n}) - b_{t+n})\]Calculate V-trace targets for off-policy actor-critic learning recursively. For more details, please refer to the paper: Espeholt et al. 2018, IMPALA.
- Parameters:
policy_action_probs (torch.Tensor) – action probabilities of policy network, shape=(sequence_length,)
values (torch.Tensor) – state values, shape=(sequence_length+1,)
rewards (torch.Tensor) – rewards, shape=(sequence_length+1,)
behavior_action_probs (torch.Tensor) – action probabilities of behavior network, shape=(sequence_length,)
gamma (float) – discount factor
rho_bar (float) – clip rho
c_bar (float) – clip c
- Returns:
tuple – V-trace targets, shape=(batch_size, sequence_length)
- Return type:
Tuple
[Tensor
,Tensor
,Tensor
]
- finish_path(last_value_r=torch.zeros(1), last_value_c=torch.zeros(1))[source]#
Finish the current path and calculate the advantages of state-action pairs.
On-policy algorithms need to calculate the advantages of state-action pairs after the path is finished. This function calculates the advantages of state-action pairs and stores them in the buffer, following the steps:
Hint
Calculate the discounted return.
Calculate the advantages of the reward.
Calculate the advantages of the cost.
- Parameters:
last_value_r (torch.Tensor, optional) – The value of the last state of the current path.
torch.zeros (Defaults to) –
last_value_c (torch.Tensor, optional) – The value of the last state of the current path.
torch.zeros –
- Return type:
None
- get()[source]#
Get the data in the buffer. :rtype:
Dict
[str
,Tensor
]Hint
We provide a trick to standardize the advantages of state-action pairs. We calculate the mean and standard deviation of the advantages of state-action pairs and then standardize the advantages of state-action pairs. You can turn on this trick by setting the
standardized_adv_r
toTrue
. The same trick is applied to the advantages of the cost.
- property standardized_adv_c: bool#
Get the standardized_adv_c.
- property standardized_adv_r: bool#
Get the standardized_adv_r.
Off Policy buffer#
Documentation
- class omnisafe.common.buffer.OffPolicyBuffer(obs_space, act_space, size, batch_size, device=torch.device('cpu'))[source]#
A ReplayBuffer for off_policy Algorithms.
Initialize the off policy buffer.
Warning
The buffer only supports Box spaces.
Compared to the base buffer, the off-policy buffer stores extra data:
Name
Shape
Dtype
Description
next_obs
(batch_size, obs_space.shape)
torch.float32
The next observation.
- Parameters:
obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
batch_size (int) – The batch size of the buffer.
device (torch.device, optional) – The device of the buffer. Defaults to torch.device(‘cpu’).
- __init__(obs_space, act_space, size, batch_size, device=torch.device('cpu'))[source]#
Initialize the off policy buffer.
Warning
The buffer only supports Box spaces.
Compared to the base buffer, the off-policy buffer stores extra data:
Name
Shape
Dtype
Description
next_obs
(batch_size, obs_space.shape)
torch.float32
The next observation.
- Parameters:
obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
batch_size (int) – The batch size of the buffer.
device (torch.device, optional) – The device of the buffer. Defaults to torch.device(‘cpu’).
- property batch_size: int#
Return the batch size of the buffer.
- property max_size: int#
Return the maximum size of the buffer.
Vector On Policy Buffer#
Documentation
- class omnisafe.common.buffer.VectorOnPolicyBuffer(obs_space, act_space, size, gamma, lam, lam_c, advantage_estimator, penalty_coefficient, standardized_adv_r, standardized_adv_c, num_envs=1, device=torch.device('cpu'))[source]#
Vectorized on-policy buffer.
Initialize the vector-on-policy buffer.
The vector-on-policy buffer is used to store the data from vector environments. The data is stored in a list of on-policy buffers, each of which corresponds to one environment.
Warning
The buffer only supports Box spaces.
- Parameters:
obs_space (OmnisafeSpace) – Observation space.
act_space (OmnisafeSpace) – Action space.
size (int) – Size of the buffer.
gamma (float) – Discount factor.
lam (float) – Lambda for GAE.
lam_c (float) – Lambda for GAE for cost.
advantage_estimator (AdvatageEstimator) – Advantage estimator.
penalty_coefficient (float) – Penalty coefficient.
standardized_adv_r (bool) – Whether to standardize the advantage for reward.
standardized_adv_c (bool) – Whether to standardize the advantage for cost.
num_envs (int, optional) – Number of environments. Defaults to 1.
device (torch.device, optional) – Device to store the data. Defaults to torch.device(‘cpu’).
- __init__(obs_space, act_space, size, gamma, lam, lam_c, advantage_estimator, penalty_coefficient, standardized_adv_r, standardized_adv_c, num_envs=1, device=torch.device('cpu'))[source]#
Initialize the vector-on-policy buffer.
The vector-on-policy buffer is used to store the data from vector environments. The data is stored in a list of on-policy buffers, each of which corresponds to one environment.
Warning
The buffer only supports Box spaces.
- Parameters:
obs_space (OmnisafeSpace) – Observation space.
act_space (OmnisafeSpace) – Action space.
size (int) – Size of the buffer.
gamma (float) – Discount factor.
lam (float) – Lambda for GAE.
lam_c (float) – Lambda for GAE for cost.
advantage_estimator (AdvatageEstimator) – Advantage estimator.
penalty_coefficient (float) – Penalty coefficient.
standardized_adv_r (bool) – Whether to standardize the advantage for reward.
standardized_adv_c (bool) – Whether to standardize the advantage for cost.
num_envs (int, optional) – Number of environments. Defaults to 1.
device (torch.device, optional) – Device to store the data. Defaults to torch.device(‘cpu’).
- finish_path(last_value_r=torch.zeros(1), last_value_c=torch.zeros(1), idx=0)[source]#
Get the data in the buffer.
In vector-on-policy buffer, we get the data from each buffer and then concatenate them. :rtype:
None
Hint
We provide a trick to standardize the advantages of state-action pairs. We calculate the mean and standard deviation of the advantages of state-action pairs and then standardize the advantages of state-action pairs. You can turn on this trick by setting the
standardized_adv_r
toTrue
. The same trick is applied to the advantages of the cost.
- property num_buffers: int#
Get the number of buffers.
Vector Off Policy Buffer#
Documentation
- class omnisafe.common.buffer.VectorOffPolicyBuffer(obs_space, act_space, size, batch_size, num_envs, device=torch.device('cpu'))[source]#
A VectorReplayBuffer for OffPolicy Algorithms.
Initialize the off policy buffer.
The vector-off-policy buffer is a vectorized version of the off-policy buffer. It stores the data in a single tensor, and the data of each environment is stored in a separate column.
Warning
The buffer only supports Box spaces.
- Parameters:
obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
batch_size (int) – The batch size of the buffer.
num_envs (int) – The number of environments.
device (torch.device, optional) – The device of the buffer. Defaults to torch.device(‘cpu’).
- __init__(obs_space, act_space, size, batch_size, num_envs, device=torch.device('cpu'))[source]#
Initialize the off policy buffer.
The vector-off-policy buffer is a vectorized version of the off-policy buffer. It stores the data in a single tensor, and the data of each environment is stored in a separate column.
Warning
The buffer only supports Box spaces.
- Parameters:
obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
batch_size (int) – The batch size of the buffer.
num_envs (int) – The number of environments.
device (torch.device, optional) – The device of the buffer. Defaults to torch.device(‘cpu’).
- add_field(name, shape, dtype)[source]#
Add a field to the buffer.
Example
>>> buffer = BaseBuffer(...) >>> buffer.add_field('new_field', (2, 3), torch.float32) >>> buffer.data['new_field'].shape >>> (buffer.size, 2, 3)
- Parameters:
name (str) – The name of the field.
shape (tuple) – The shape of the field.
dtype (torch.dtype) – The dtype of the field.
- property num_envs: int#
Return the number of environments.