How to Use - Detail -

In this section, we will list details users should know as prior knowledge before using the module.

1- Env Argument Structure

For env argument, if list( or tuple) of size-2 is set, the environment will be considered as [training_env, test_env]. training_env can also be a list( or tuple), when applying to algorithms that uses multiple agents such as A2C, but the length of training_env must be the same as num_worker and each element must be a BaseEnv inherited object. If env was set to a BaseEnv inherited object, the environment will be deepcopied to meet the number of num_worker . The environment specified as env will be test_env.

Good Examples:

from renom_rl.discrete.a2c import A2C

custom_env = CartPole00()
test_custom_env = CartPole00()
custom_env_list = [CartPole00() for _ in range(8)]

_ = A2C(custom_env, q_network)

_ = A2C([custom_env, test_custom_env], q_network, num_worker=8)

_ = A2C([custom_env_list, test_custom_env], q_network, num_worker=8)

Bad Examples:

custom_env_list = [CartPole00() for _ in range(8)]

_ = A2C([custom_env_list, test_custom_env], q_network, num_worker=9)

2- Network(Agent) Structure

Each Algorithm has a specific network structure. For example, A2C Discrete ver. needs output of Actor and Critic. Follow documentation for specific structures.

Example:

#For DQN
class DQN_Model(rm.Model):

    def __init__(self, a=2):
        self.d1 = rm.Dense(30)
        self.r1 = rm.Relu()
        self.d2 = rm.Dense(a)
        self.act = rm.Softmax()

    def forward(self, x):
        h = self.d1(x)
        h = self.r1(h)
        h = self.d2(h)
        act = self.act(h)

        return act

#For A2C
class A2C_Discrete(rm.Model):
    def __init__(self, m=2, v=1):
        self.d1 = rm.Dense(200)
        self.r1 = rm.Relu()
        self.d2 = rm.Dense(m)
        self.act = rm.Softmax()
        self.val = rm.Dense(v)

    def forward(self, x):
        h = self.d1(x)
        h = self.r1(h)
        h = self.d2(h)
        act = self.act(h)
        val = self.val(h)

        return act, val

3- Network Weights Initialization

Unless initialize is set to false, the network will restart at init and fit.

To stop module from initializing network, set initialize as False.

Example:

from renom_rl.discrete.dqn import DQN

algorithm = DQN(custom_env, q_network, initialize = False)

4- Logger

Users can log data using renom_rl.utility.logger during testing and training. For details, go to this renom_rl.utility.logger.Logger .

class Original(Logger):

    def __init__(self,log_key):
        super(Original,self).__init__(log_key,record_episode_base=False)
        self.reward_previous = 0
        self.reward = 0
        self.total_list = []
        self.state = 0
        self.total = 0

    def logger(self,**log):
        self.state = log["state"]
        self.reward = log["reward"]
        self.total += log["reward"]
        return "state----{}/reward---{}/total----{}".format(self.state, self.reward, self.total)


import renom as rm
from renom_rl.environ.openai import CartPole00
from renom_rl.discrete.dqn import DQN

network = rm.Sequential([rm.Dense(32),rm.Relu(),rm.Dense(32),rm.Relu(),rm.Dense(2)])

logger = Original(["reward"])

dqn=DQN(env=CartPole00(),q_network=network,logger=logger)

#result
# state----[-0.00528582  0.76312646 -0.00763515 -1.1157825 ]/reward---0/total-----39: 100%|██████████████████████████████████████| 500/500 [00:01<00:00, 438.39it/s]

5- Init and Fit Arguments

__init__ and fit have nearly the same argument. The arguments that can only be set at __init__ are environment, network, and logger related arguments. We believe that these are the important objects for reinforcement learning.

../_images/02-image_of_rl.png

The arguments that can only be set at fit are arguments that affects the length of learning, such as epoch, epoch_step etc.

For common arguments, the following diagram explains what values will be used during fit:

../_images/02-fit_value_ref.png

Test arguments are also set as the diagram above.

Example:

# loss: use Default Value, gamma: specified at init, ActionFilter: Changed at fit
from renom_rl.utility.filter import EpsilonCFilter
dqn=DQN(custom_env, q_network, gamma=0.99)

print("---before---")
info=dqn.info_init()
print("loss(id):",id(info["loss_func"]),\
      "\ngamma(id):",info["gamma"],\
      "\nActionFilter(id):",id(info["action_filter"]))
print()
dqn.fit(random_step=0,epoch=1,epoch_step=10,action_filter=EpsilonCFilter(epsilon=0.1))
print()
print("---after---")
info=dqn.info_fit()
print("loss(id):",id(info["loss_func"]),\
      "\ngamma(id):",info["gamma"],\
      "\nActionFilter(id):",id(info["action_filter"]))

### The result will show as follows:
###
###    ---before---
###    loss(id): 112130625152
###    gamma(id): 0.99
###    ActionFilter(id): 112130625040
###
###    Run random 0 step for storing experiences
###
###    ---after---
###    loss(id): 112130625152
###    gamma(id): 0.99
###    ActionFilter(id): 112130625264

6- Filter

Filter is used for network output in order to feed value as action(s). There are different object within Filter. Which objects are optional depends on what algorithms users choose.

The following diagram shows a simple diagram of DQN and where filter functions:

../_images/05-filter.png

DQN has node_selector , action_filter as argument. DQN uses renom_rl.utility.DiscreteNodeChooser object as node_selector value, and renom_rl.utility.EpsilonGreedyFilter object as action_filter value. As default, node_selector is MaxNodeChooser() , and action_filter is EpsilonSLFilter() . Users can choose node_selector as ProbNodeChooser() (Note that ProbNodeChooser() choses value base on output between 0~1). Same goes with action_filter .

Other algorithms (such as A2C) has filter as shown above. Read document for more information.

7- Other BaseEnv methods

start() ~ close() methods

In renom_rl.environ.env.BaseEnv , there are start() ~ close() methods.

These run during certain points during training/testing phase. The timing for each method runs as follows:

../_images/03-env_start_close.png

terminate() ~ stop_epoch() methods

In renom_rl.environ.env.BaseEnv , there are terminate() , terminate_epoch() , stop_epoch() methods.

These stops or terminates training/testing phase. Each stops or terminates as follows:

../_images/04-terminate.png

Note: stop_epoch() (or teminate_epoch() ) will run every epoch run, so unless the return value is set to false at the beginning of iteration, if will keep on stopping or terminating epoch run. Always set the return value of stop_epoch() (or teminate_epoch() ) to False at the beginning of epoch run.

For more details, view renom_rl.environ.env.BaseEnv .

Example:

class CartPole(CartPole00):
    def __init__(self):
        self.i = 0
        self.t = 0

        CartPole00.__init__(self)

    # overriding start, epoch, epoch_step, test_epoch_step, terminal_epoch
    def start(self):
        self.i = 0
        self.t = 0

    def epoch(self):
        self.i=0

    def epoch_step(self):
        self.i +=1

    def test_epoch_step(self):
        """if not overridden, epoch_step will run"""
        pass

    def terminate_epoch(self):
        if not self.i < 5:
            self.t += 1
        return False if self.i < 5 else True

    def result(self):
        print("epoch_step counts: ",self.i)
        print("terminate counts: ",self.t)

env=CartPole()

dqn=DQN(env,model)
dqn.fit(random_step=0, epoch=2, epoch_step=1000)
env.result()


## Results:
## epoch_step counts:  5
## terminate counts: 2