Control Mechanism for Hierarchical Reinforcement Learning Agent


  • Filip Pavlove Comenius University in Bratislava



Reinforcement learning is a branch of machine learning that is focused on developing artificially intelligent agents, solving sequential decision-making problems in the environment.

Thanks to reinforcement learning (RL) models, the roles of dopamine and the cortico-basal ganglia-thalamo-cortical (CBGTC) circuits have been better understood in recent years. One of the theories behind the CBGTC circuits suggest that basal ganglia (BG), plays a crucial role of the action selection, while the action candidates are initially generated in the cortex [1].

One of the biggest challenges associated with the majority of the reinforcement learning models is their inability to properly represent multiple levels of temporal abstraction that are critical for extended courses of action over a broad range of time scales.

To tackle challenges like this, the field of hierarchical reinforcement learning (HRL) has been closely studied in the past two decades. Particularly, the pivotal work of Sutton, et al. [2], extends the usual notion of primitive actions to the more generalizable framework of options with temporally variable courses of action.

The main objective of this project is the study of the Hierarchical Reinforcement Learning system inspired by the CBGTC. We hypothesize that we will observe a division of tasks between actors at lower levels, which could be interpreted as the emergence of skills.


We will test and compare the performance of the agent inspired by CBGTC circuits with and without the implementation of a hierarchical control mechanism in the Atari game (Pac-Man) environment. Both implementations will be trained with the proximal policy optimimalization (PPO) algorithm that is an on-policy gradient method for the training of RL agents.

Agents will be quantitatively compared on metrics such as the average number of steps and the average collected reward per episode. Moreover, we will make a qualitative examination of the agent’s behavior, and look for signs of potential task division between the actors at the lower level in the agent with hierarchical control.


[1] T. V. Maia, and M. J. Frank. “From reinforcement learning models to psychiatric and neurological disorders.” Nature neuroscience vol. 14.2 pp. 154–162, 2011.

[2] R. S. Sutton, D. Precup, and S. Singh, “Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.” Artificial intelligence vol. 112.1–2, pp. 181–211, 1999.

[3] J. Schulman, F. Wolski, p. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347. , 2017.