renom_rl.discrete.a2c

class A2C ( env , network , num_worker=8 , logger=None , loss_func=None , optimizer=None , gamma=0.99 , advantage=5 , value_coef=0.5 , entropy_coef=0.01 , node_selector=None , test_node_selector=None , gradient_clipping=None , initialize=True )

Bases: renom_rl.AgentBase

This class provides a reinforcement learning agent including training method. A2C is an Actor/Critic model that uses multiple agents to to train a network calculate the optimal policy and V value using advantage learning. This class runs on a single thread. Discrete version.

For env argument, if list( or tuple) of size-2 is set, the environemnt will be considered as [training_env, test_env]. training_env can also be a list( or tuple), but the length of training_env must be the same as num_worker and each element must be a BaseEnv inhereted object. If env was set to a BaseEnv inhereted object, the environment will be deepcopied to meet the number of num_worker . The environment specified as env will be test_env.

For optimizer arguement, if list( or tuple) of size-2 is set, the 1st optimizer will be used for the actor, while the second optimizer will be used for the critic. If rm.Optimizer was set, the optimizer will calculate with the sum of actor and critic loss.

Pseudo Code:
../../_images/A2C_pseudo1.png

For more explanation on the Pseudo Code please refer to Efficient Parallel Methods for Deep Reinforcement Learning.

Network Structure:
  • Input Size: BaseEnv.state_shape
  • Output Size: [action, value] where action is BaseEnv.action_shape and the value is equal to the number of actions taken.
Parameters:
  • env ( BaseEnv, list, tuple ) – Environment.
  • network ( Model ) – Actor Critic Model.
  • num_worker ( int ) – Number of actor/environment model.
  • logger ( Logger ) – Logs session data.
  • loss_func ( object ) – Loss function used to train the network. Default is MeanSquaredError() .
  • optimizer ( renom.Optimizer, list, tuple ) – Optimizer for training the network. Default is Rmsprop(0.001, g=0.99, epsilon=1e-10) .
  • gamma ( float ) – Discount rate.
  • advantage ( int ) – Advantage steps.
  • value_coef ( float ) – Coefficient of value loss.
  • entropy_coef ( float ) – Coefficient of actor’s output entropy.
  • node_selector ( DiscreteNodeChooser ) – Selects which action the agent will take. Default is ProbNodeChooser()
  • test_node_selector ( DiscreteNodeChooser ) – Selects which action the agent will take during training. Default is MaxNodeChooser()
  • gradient_clipping ( GradientClipping ) – Optional gradient clipping. Default is none.
  • initialize ( bool ) – Whether to initialize the network or not.

Example

>>> import numpy as np
>>> import renom as rm
>>> from renom_rl.discrete.a2c import A2C
>>> from renom_rl.environ.openai import CartPole00
>>>
>>> class ActorCritic(rm.Model):
...     def __init__(self):
...         self.l1=rm.Dense(32)
...         self.l2=rm.Dense(32)
...         self.l3=rm.Dense(2)
...         self.l4=rm.Dense(1)
...
...     def forward(self,x):
...         h1 = self.l1(x)
...         h2 = self.l2(h1)
...         act = rm.softmax(self.l3(h2))
...         val=self.l4(h2)
...         return act,val
...
>>> model=ActorCritic()
>>> env = CartPole00()
>>> a2c=A2C(env,model)
>>> a2c.fit(epoch=1,epoch_step=10000)

References

A. V. Clemente, H. N. Castejon, and A. Chandra.
Efficient Parallel Methods for Deep Reinforcement Learning

fit ( epoch=1 , epoch_step=250000 , test_step=None , loss_func=None , optimizer=None , gamma=None , advantage=None , value_coef=None , entropy_coef=None , node_selector=None , test_node_selector=None , gradient_clipping=None , initialize=None )

This method executes training of the actor critic model. Test will be ran after training is done.

Refer to A2C for other argument descriptions.

Parameters:
  • epoch ( int ) – Number of epochs for training.
  • epoch_step ( int ) – Number of steps of one epoch.
  • test_step ( int ) – Number of test steps.
test ( test_step=None , node_selector=None )

Test the trained actor agent. Refer to A2C for other argument descriptions.

Parameters: test_step ( int, None ) – Number of steps (not episodes) for test. If None is given, this method tests execute only 1 episode.
Returns: Sum of rewards. (float)