# renom_rl.discrete.a2c ¶

class  A2C  ( env , network , num_worker=8 , logger=None , loss_func=None , optimizer=None , gamma=0.99 , advantage=5 , value_coef=0.5 , entropy_coef=0.01 , node_selector=None , test_node_selector=None , gradient_clipping=None , initialize=True )

Bases:  renom_rl.AgentBase 

This class provides a reinforcement learning agent including training method. A2C is an Actor/Critic model that uses multiple agents to to train a network calculate the optimal policy and V value using advantage learning. This class runs on a single thread. Discrete version.

For  env  argument, if list( or tuple) of size-2 is set, the environemnt will be considered as [training_env, test_env]. training_env can also be a list( or tuple), but the length of training_env must be the same as  num_worker  and each element must be a  BaseEnv  inhereted object. If  env  was set to a  BaseEnv  inhereted object, the environment will be deepcopied to meet the number of  num_worker  .　The environment specified as  env  will be test_env.

For  optimizer  arguement, if list( or tuple) of size-2 is set, the 1st optimizer will be used for the actor, while the second optimizer will be used for the critic. If  rm.Optimizer  was set, the optimizer will calculate with the sum of actor and critic loss.

Pseudo Code:

For more explanation on the Pseudo Code please refer to Efficient Parallel Methods for Deep Reinforcement Learning.

Network Structure:
• Input Size:  BaseEnv.state_shape 
• Output Size: [action, value] where action is  BaseEnv.action_shape  and the value is equal to the number of actions taken.
 Parameters: env ( BaseEnv, list, tuple ) – Environment. network ( Model ) – Actor Critic Model. num_worker ( int ) – Number of actor/environment model. logger ( Logger ) – Logs session data. loss_func ( object ) – Loss function used to train the network. Default is  MeanSquaredError()  . optimizer ( renom.Optimizer, list, tuple ) – Optimizer for training the network. Default is  Rmsprop(0.001, g=0.99, epsilon=1e-10)  . gamma ( float ) – Discount rate. advantage ( int ) – Advantage steps. value_coef ( float ) – Coefficient of value loss. entropy_coef ( float ) – Coefficient of actor’s output entropy. node_selector ( DiscreteNodeChooser ) – Selects which action the agent will take. Default is  ProbNodeChooser()  test_node_selector ( DiscreteNodeChooser ) – Selects which action the agent will take during training. Default is  MaxNodeChooser()  gradient_clipping ( GradientClipping ) – Optional gradient clipping. Default is none. initialize ( bool ) – Whether to initialize the network or not.

Example

>>> import numpy as np
>>> import renom as rm
>>> from renom_rl.discrete.a2c import A2C
>>> from renom_rl.environ.openai import CartPole00
>>>
>>> class ActorCritic(rm.Model):
...     def __init__(self):
...         self.l1=rm.Dense(32)
...         self.l2=rm.Dense(32)
...         self.l3=rm.Dense(2)
...         self.l4=rm.Dense(1)
...
...     def forward(self,x):
...         h1 = self.l1(x)
...         h2 = self.l2(h1)
...         act = rm.softmax(self.l3(h2))
...         val=self.l4(h2)
...         return act,val
...
>>> model=ActorCritic()
>>> env = CartPole00()
>>> a2c=A2C(env,model)
>>> a2c.fit(epoch=1,epoch_step=10000)


References

A. V. Clemente, H. N. Castejon, and A. Chandra.
Efficient Parallel Methods for Deep Reinforcement Learning

 fit  ( epoch=1 , epoch_step=250000 , test_step=None , loss_func=None , optimizer=None , gamma=None , advantage=None , value_coef=None , entropy_coef=None , node_selector=None , test_node_selector=None , gradient_clipping=None , initialize=None )

This method executes training of the actor critic model. Test will be ran after training is done.

Refer to  A2C  for other argument descriptions.

 Parameters: epoch ( int ) – Number of epochs for training. epoch_step ( int ) – Number of steps of one epoch. test_step ( int ) – Number of test steps.
 test  ( test_step=None , node_selector=None )

Test the trained actor agent. Refer to  A2C  for other argument descriptions.

 Parameters: test_step ( int, None ) – Number of steps (not episodes) for test. If None is given, this method tests execute only 1 episode. Sum of rewards. (float)