renom_rl.discrete.a2c

class A2C ( env , network , loss_func=None , optimizer=None , gamma=0.99 , num_worker=8 , advantage=5 , value_coef=0.5 , entropy_coef=0.01 , node_selector=None , test_node_selector=None , gradient_clipping=None , logger=None )

Bases: renom_rl.AgentBase

A2C class

This class provides a reinforcement learning agent including training method. This class runs on a single thread. Discrete version.

For env arguemnt, if list( or tuple) of size-2 is set, the environemnt will be considered as [training_env, test_env]. training_env can also be a list( or tuple), but the length of training_env must be the same as num_worker and each element must be a BaseEnv inhereted object. If env was set to a BaseEnv inhereted object, the environment will be deepcopied to meet the number of num_worker . The environment specified as env will be test_env.

For optimizer arguement, if list( or tuple) of size-2 is set, the 1st optimizer will be used for the actor, while the second optimizer will be used for the critic. If rm.Optimizer was set, the optimizer will calculate with the sum of actor and critic loss.

Parameters:
  • env ( BaseEnv, list, tuple ) – Environment.
  • network ( Model ) – Actor Critic Model.
  • num_worker ( int ) – Number of actor/environment model.
  • advantage ( int ) – Advantage steps.
  • node_selector ( DiscreteNodeChooser ) – node selector.
  • test_node_selector ( DiscreteNodeChooser ) – test node selector.
  • loss_func ( function ) – Loss function for train network. Default is MeanSquaredError() .
  • optimizer ( rm.Optimizer, list, tuple ) – Optimizer for training network. Default is Rmsprop(0.001, g=0.99, epsilon=1e-10) .
  • entropy_coef ( float ) – Coefficient of actor’s output entropy.
  • value_coef ( float ) – Coefficient of value loss.
  • gamma ( float ) – Discount rate.

Example

>>> import numpy as np
>>> import renom as rm
>>> from renom_rl.discrete.a2c import A2C
>>> from renom_rl.environ.openai import CartPole00
>>>
>>> class ActorCritic(rm.Model):
...     def __init__(self):
...         self.l1=rm.Dense(32)
...         self.l2=rm.Dense(32)
...         self.l3=rm.Dense(2)
...         self.l4=rm.Dense(1)
...
...     def forward(self,x):
...         h1 = self.l1(x)
...         h2 = self.l2(h1)
...         act = rm.softmax(self.l3(h2))
...         val=self.l4(h2)
...         return act,val
...
>>> model=ActorCritic()
>>> env = CartPole00()
>>> a2c=A2C(env,model)
>>> a2c.fit(epoch=1,epoch_step=10000)

References

A. V. Clemente, H. N. Castejon, and A. Chandra.
Efficient Parallel Methods for Deep Reinforcement Learning
fit ( epoch=1 , epoch_step=250000 , test_step=None )

This method executes training of actor critic. Test will be runned after each epoch is done.

Parameters:
  • epoch ( int ) – Number of epoch for training.
  • epoch_step ( int ) – Number of step of one epoch.
  • test_step ( int ) – Number steps during test.
test ( test_step=None , **kwargs )

Test the trained actor agent.

Parameters: test_step ( int, None ) – Number of steps (not episodes) for test. If None is given, this method tests execute only 1 episode.
Returns: Sum of rewards.