renom_rl.continuous.ddpg

class renom_rl.continuous.ddpg. DDPG ( env , actor_network , critic_network , loss_func=None , actor_optimizer=None , critic_optimizer=None , gamma=0.99 , tau=0.001 , buffer_size=1000000.0 )

Bases: renom_rl.AgentBase

DDPG class

This class provides a reinforcement learning agent including training and testing methods. This class only accepts ‘Environment’ as a object of ‘BaseEnv’ class.

Parameters:
  • env ( BaseEnv ) – An instance of Environment to be learned.
  • actor_network ( Model ) – Actor-Network.
  • critic_network ( Model ) – Critic-Network. Basically this is a Q(s,a) function Network.
  • loss_func – Loss function for critic network. Default is MeanSquaredError()
  • actor_optimizer – Optimizer object for training actor network. Default is Adam(lr=0.0001)
  • critic_optimizer – Optimizer object for training actor network. Default is Adam(lr=0.001)
  • gamma ( float ) – Discount rate.
  • tau ( float ) – target_networks update parameter. If this is 0, weight parameters will be copied.
  • buffer_size ( float, int ) – The size of replay buffer.

Example

>>> import renom as rm
>>> from renom_rl.continuous.ddpg import DDPG
>>> from renom_rl.environ.openai import Pendulum
>>>
>>> class Critic(rm.Model):
...
...     def __init__(self):
...         self.l1 = rm.Dense(2)
...         self.l2 = rm.Dense(1)
...
...     def forward(self, state, action):
...         h = rm.concat(self.l1(state), action)
...         return self.l2(rm.relu(h))
...
>>> actor = rm.Sequential(...)
>>> critic = Critic()
>>> agent = DDPG(
...       Pendulum(),
...       actor,
...       critic,
...       loss_func=rm.ClippedMeanSquaredError(),
...       buffer_size=1e6
...   )
>>> agent.fit(episode=10000)
episode 001 avg_loss: 0.004 total_reward [train:2.000 test:-] e-greedy:0.000: : 190it [00:03, 48.42it/s]
episode 002 avg_loss: 0.003 total_reward [train:0.000 test:-] e-greedy:0.000: : 126it [00:02, 50.59it/s]
episode 003 avg_loss: 0.003 total_reward [train:3.000 test:-] e-greedy:0.001: : 250it [00:04, 51.31it/s]
Reference:
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel,
Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra,
Continuous control with deep reinforcement learning

action ( state )

This method returns an action according to the given state. :param state: A state of an environment.

Returns: Action.
Return type: (int, ndarray)
fit ( epoch=1000 , epoch_step=2000 , batch_size=64 , random_step=5000 , test_step=2000 , train_frequency=1 , min_greedy=0.01 , max_greedy=1.0 , greedy_step=10000 , noise=<renom_rl.noise.OU object> )

This method executes training of an actor-network. Here, target actor & critic network weights are updated after every actor & critic update using self.tau

- end_epoch
Args:
epoch (int): The number of current epoch.
model (DDPG): Object of DDPG which is on training.
summed_train_reward_in_current_epoch (float): Sum of train rewards earned in current epoch.
summed_test_reward_in_current_epoch (float): Sum of test rewards.
average_train_loss_in_current_epoch (float): Average train loss in current epoch.

Parameters:
  • epoch ( int ) – Training number of epochs.
  • epoch_step ( int ) – Number of step of one epoch.
  • batch_size ( int ) – Batch size.
  • random_step ( int ) – Number of random step which will be executed before training.
  • test_step ( int ) – Number of test step.
  • train_frequency ( int ) – For the learning step, training is done at this cycle.
  • min_greedy ( int ) – Minimum greedy value.
  • max_greedy ( int ) – Maximum greedy value.
  • greedy_step ( int ) – Number of step.
  • noise ( OU, GP ) – Ornstein-uhlenbeck noise or gaussian noise.

Example

>>> import renom as rm
>>> from renom_rl.continuous.ddpg import DDPG
>>> from renom_rl.environ.openai import Pendulum
>>>
>>> class Critic(rm.Model):
...
...     def __init__(self):
...         self.l1 = rm.Dense(2)
...         self.l2 = rm.Dense(1)
...
...     def forward(self, state, action):
...         h = rm.concat(self.l1(state), action)
...         return self.l2(rm.relu(h))
...
>>> actor = rm.Sequential(...)
>>> critic = Critic()
>>>
>>> agent = DDPG(
...       Pendulum(),
...       actor,
...       critic,
...       loss_func=rm.ClippedMeanSquaredError(),
...       buffer_size=1e6
...   )
>>> @agent.event.end_epoch
>>> def callback(epoch, ddpg_model, train_rew, test_rew, avg_loss):
... # This function will be called end of each epoch.
...
>>>
>>> agent.fit()
epoch 001 avg_loss:0.0031 total reward in epoch: [train:109.000 test: 3.0] avg reward in episode:[train:0.235 test:0.039] e-greedy:0.900: 100%|██████████| 10000/10000 [05:48<00:00, 28.73it/s]
epoch 002 avg_loss:0.0006 total reward in epoch: [train:116.000 test:14.0] avg reward in episode:[train:0.284 test:0.163] e-greedy:0.900: 100%|██████████| 10000/10000 [05:53<00:00, 25.70it/s]
...
value_function ( state )

Value of predict network Q_predict(s,a)

Parameters: state – input state
Returns: Q(s,a) value
Return type: value
target_value_function ( state )

Value of target network Q_target(s,a).

Parameters: state – input state
Returns: Q(s,a) value
Return type: value
initalize ( )

Target actor and critic networks are initialized with same neural network weights as actor & critic network

update ( )

Updare target networks

test ( test_step=None , render=False )

Test the trained agent.

Parameters:
  • test_step ( int, None ) – Number of steps for test. If None is given, this method tests just 1 episode.
  • render ( bool ) – If True is given, BaseEnv.render() method will be called.
Returns:

Sum of rewards.

Return type:

(int)