renom_rl.discrete.double_dqn

class renom_rl.discrete.double_dqn. DoubleDQN ( env , q_network , loss_func=None , optimizer=None , gamma=0.99 , buffer_size=1000000.0 )

Bases: renom_rl.AgentBase

Double DQN class This class provides a reinforcement learning agent including training method.

Parameters:
  • env ( BaseEnv ) – Environment. This must be a child class of BaseEnv.
  • q_network ( Model ) – Q-Network.
  • loss_func ( function ) – Loss function for train q-network. Default is ClippedMeanSquaredError().
  • optimizer – Optimizer for train q-network. Default is Rmsprop(lr=0.00025, g=0.95).
  • gamma ( float ) – Discount rate.
  • buffer_size ( float, int ) – The size of replay buffer.

Example

>>> import renom as rm
>>> from renom_rl.discrete.double_dqn import DoubleDQN
>>> from renom_rl.environ.openai import Breakout
>>> model = rm.Sequential(...)
>>> agent = DQN(
...       Breakout(),
...       model,
...       loss_func=rm.ClippedMeanSquaredError(),
...       buffer_size=1e6
...   )
>>> agent.fit(episode=10000)
episode 001 avg_loss: 0.004 total_reward [train:2.000 test:-] e-greedy:0.000: : 190it [00:03, 48.42it/s]
episode 002 avg_loss: 0.003 total_reward [train:0.000 test:-] e-greedy:0.000: : 126it [00:02, 50.59it/s]
episode 003 avg_loss: 0.003 total_reward [train:3.000 test:-] e-greedy:0.001: : 250it [00:04, 51.31it/s]
...

References

Hado van Hasselt, Arthur Guez, David Silver
Deep Reinforcement Learning with Double Q-learning

initialize ( )

Target q-network is initialized with same neural network weights of q-network.

action ( state )

This method returns an action according to the given state.

Parameters: state ( ndarray ) – A state of an environment.
Returns: Action.
Return type: (int, ndarray)
update ( )

This function updates target network.

update_best_q_network ( )

This function updates best network in each epoch.

fit ( epoch=500 , epoch_step=250000 , batch_size=32 , random_step=50000 , test_step=2000 , update_period=10000 , train_frequency=4 , min_greedy=0.0 , max_greedy=0.9 , greedy_step=1000000 , test_greedy=0.95 , render=False , callback_end_epoch=None )

This method executes training of a q-network. Training will be done with epsilon-greedy method.

You can define following callback functions.

- end_epoch
Args:
epoch (int): The number of current epoch.
model (DQN): Object of DQN which is on training.
summed_train_reward_in_current_epoch (float): Sum of train rewards earned in current epoch.
summed_test_reward_in_current_epoch (float): Sum of test rewards.
average_train_loss_in_current_epoch (float): Average train loss in current epoch.

Parameters:
  • epoch ( int ) – Number of epoch for training.
  • epoch_step ( int ) – Number of step of one epoch.
  • batch_size ( int ) – Batch size.
  • random_step ( int ) – Number of random step which will be executed before training.
  • test_step ( int ) – Number of test step.
  • update_period ( int ) – Period of updating target network.
  • train_frequency ( int ) – For the learning step, training is done at this cycle.
  • min_greedy ( int ) – Minimum greedy value
  • max_greedy ( int ) – Maximum greedy value
  • greedy_step ( int ) – Number of step
  • test_greedy ( int ) – Greedy threshold
  • render ( bool ) – If True is given, BaseEnv.render() method will be called in test time.

Example

>>> import renom as rm
>>> from renom_rl.discrete.double_dqn import DoubleDQN
>>> from renom_rl.environ.openai import Breakout
>>>
>>> q_network = rm.Sequential([
...    # Define network here.
... ])
>>> model = DoubleDQN(Breakout(), q_network)
>>>
>>> @model.event.end_epoch
>>> def callback(epoch, ddqn, train_rew, test_rew, avg_loss):
... # This function will be called end of each epoch.
...
>>>
>>> model.fit()
epoch 001 avg_loss:0.0031 total reward in epoch: [train:109.000 test: 3.0] avg reward in episode:[train:0.235 test:0.039] e-greedy:0.900: 100%|██████████| 10000/10000 [05:48<00:00, 28.73it/s]
epoch 002 avg_loss:0.0006 total reward in epoch: [train:116.000 test:14.0] avg reward in episode:[train:0.284 test:0.163] e-greedy:0.900: 100%|██████████| 10000/10000 [05:53<00:00, 25.70it/s]
...
test ( test_step=None , test_greedy=0.95 , render=False )

Test the trained agent.

Parameters:
  • test_step ( int, None ) – Number of steps for test. If None is given, this method tests just 1 episode.
  • test_greedy ( float ) – Greedy ratio of action.
  • render ( bool ) – If True is given, BaseEnv.render() method will be called.
Returns:

Sum of rewards.

Return type:

(int)