How to Use - Detail - ¶
In this section, we will list details users should know as prior knowledge before using the module.
Table of Contents
1- Env Argument Structure ¶
For
env
argument, if list( or tuple) of size-2 is set, the environment will be considered as [training_env, test_env].
training_env can also be a list( or tuple), when applying to algorithms that uses multiple agents such as A2C, but the length of training_env must be the same as
num_worker
and each element must be a
BaseEnv
inherited object.
If
env
was set to a
BaseEnv
inherited object, the environment will be deepcopied to meet the number of
num_worker
. The environment specified as
env
will be test_env.
Good Examples:
from renom_rl.discrete.a2c import A2C
custom_env = CartPole00()
test_custom_env = CartPole00()
custom_env_list = [CartPole00() for _ in range(8)]
_ = A2C(custom_env, q_network)
_ = A2C([custom_env, test_custom_env], q_network, num_worker=8)
_ = A2C([custom_env_list, test_custom_env], q_network, num_worker=8)
Bad Examples:
custom_env_list = [CartPole00() for _ in range(8)]
_ = A2C([custom_env_list, test_custom_env], q_network, num_worker=9)
2- Network(Agent) Structure ¶
Each Algorithm has a specific network structure. For example, A2C Discrete ver. needs output of Actor and Critic. Follow documentation for specific structures.
Example:
#For DQN
class DQN_Model(rm.Model):
def __init__(self, a=2):
self.d1 = rm.Dense(30)
self.r1 = rm.Relu()
self.d2 = rm.Dense(a)
self.act = rm.Softmax()
def forward(self, x):
h = self.d1(x)
h = self.r1(h)
h = self.d2(h)
act = self.act(h)
return act
#For A2C
class A2C_Discrete(rm.Model):
def __init__(self, m=2, v=1):
self.d1 = rm.Dense(200)
self.r1 = rm.Relu()
self.d2 = rm.Dense(m)
self.act = rm.Softmax()
self.val = rm.Dense(v)
def forward(self, x):
h = self.d1(x)
h = self.r1(h)
h = self.d2(h)
act = self.act(h)
val = self.val(h)
return act, val
3- Network Weights Initialization ¶
Unless
initialize
is set to false, the network will restart at init and fit.
To stop module from initializing network, set
initialize
as False.
Example:
from renom_rl.discrete.dqn import DQN
algorithm = DQN(custom_env, q_network, initialize = False)
4- Logger ¶
Users can log data using
renom_rl.utility.logger
during testing and training.
For details, go to this
renom_rl.utility.logger.Logger
.
class Original(Logger):
def __init__(self,log_key):
super(Original,self).__init__(log_key,record_episode_base=False)
self.reward_previous = 0
self.reward = 0
self.total_list = []
self.state = 0
self.total = 0
def logger(self,**log):
self.state = log["state"]
self.reward = log["reward"]
self.total += log["reward"]
return "state----{}/reward---{}/total----{}".format(self.state, self.reward, self.total)
import renom as rm
from renom_rl.environ.openai import CartPole00
from renom_rl.discrete.dqn import DQN
network = rm.Sequential([rm.Dense(32),rm.Relu(),rm.Dense(32),rm.Relu(),rm.Dense(2)])
logger = Original(["reward"])
dqn=DQN(env=CartPole00(),q_network=network,logger=logger)
#result
# state----[-0.00528582 0.76312646 -0.00763515 -1.1157825 ]/reward---0/total-----39: 100%|██████████████████████████████████████| 500/500 [00:01<00:00, 438.39it/s]
5- Init and Fit Arguments ¶
__init__
and
fit
have nearly the same argument.
The arguments that can only be set at
__init__
are environment, network, and logger related arguments.
We believe that these are the important objects for reinforcement learning.

The arguments that can only be set at
fit
are arguments that affects the length of learning, such as epoch, epoch_step etc.
For common arguments, the following diagram explains what values will be used during fit:

Test arguments are also set as the diagram above.
Example:
# loss: use Default Value, gamma: specified at init, ActionFilter: Changed at fit
from renom_rl.utility.filter import EpsilonCFilter
dqn=DQN(custom_env, q_network, gamma=0.99)
print("---before---")
info=dqn.info_init()
print("loss(id):",id(info["loss_func"]),\
"\ngamma(id):",info["gamma"],\
"\nActionFilter(id):",id(info["action_filter"]))
print()
dqn.fit(random_step=0,epoch=1,epoch_step=10,action_filter=EpsilonCFilter(epsilon=0.1))
print()
print("---after---")
info=dqn.info_fit()
print("loss(id):",id(info["loss_func"]),\
"\ngamma(id):",info["gamma"],\
"\nActionFilter(id):",id(info["action_filter"]))
### The result will show as follows:
###
### ---before---
### loss(id): 112130625152
### gamma(id): 0.99
### ActionFilter(id): 112130625040
###
### Run random 0 step for storing experiences
###
### ---after---
### loss(id): 112130625152
### gamma(id): 0.99
### ActionFilter(id): 112130625264
6- Filter ¶
Filter is used for network output in order to feed value as action(s). There are different object within Filter. Which objects are optional depends on what algorithms users choose.
The following diagram shows a simple diagram of DQN and where filter functions:

DQN has
node_selector
,
action_filter
as argument.
DQN uses
renom_rl.utility.DiscreteNodeChooser
object as
node_selector
value, and
renom_rl.utility.EpsilonGreedyFilter
object as
action_filter
value.
As default,
node_selector
is
MaxNodeChooser()
, and
action_filter
is
EpsilonSLFilter()
.
Users can choose
node_selector
as
ProbNodeChooser()
(Note that
ProbNodeChooser()
choses value base on output between 0~1).
Same goes with
action_filter
.
Other algorithms (such as A2C) has filter as shown above. Read document for more information.
7- Other BaseEnv methods ¶
start() ~ close() methods ¶
In
renom_rl.environ.env.BaseEnv
, there are
start()
~
close()
methods.
These run during certain points during training/testing phase. The timing for each method runs as follows:

terminate() ~ stop_epoch() methods ¶
In
renom_rl.environ.env.BaseEnv
, there are
terminate()
,
terminate_epoch()
,
stop_epoch()
methods.
These stops or terminates training/testing phase. Each stops or terminates as follows:

Note:
stop_epoch()
(or
teminate_epoch()
) will run every epoch run, so unless the return value is set to false at the beginning of iteration, if will keep on stopping or terminating epoch run.
Always set the return value of
stop_epoch()
(or
teminate_epoch()
) to False at the beginning of epoch run.
For more details, view renom_rl.environ.env.BaseEnv .
Example:
class CartPole(CartPole00):
def __init__(self):
self.i = 0
self.t = 0
CartPole00.__init__(self)
# overriding start, epoch, epoch_step, test_epoch_step, terminal_epoch
def start(self):
self.i = 0
self.t = 0
def epoch(self):
self.i=0
def epoch_step(self):
self.i +=1
def test_epoch_step(self):
"""if not overridden, epoch_step will run"""
pass
def terminate_epoch(self):
if not self.i < 5:
self.t += 1
return False if self.i < 5 else True
def result(self):
print("epoch_step counts: ",self.i)
print("terminate counts: ",self.t)
env=CartPole()
dqn=DQN(env,model)
dqn.fit(random_step=0, epoch=2, epoch_step=1000)
env.result()
## Results:
## epoch_step counts: 5
## terminate counts: 2