Empowerment-Guided Safety for Robotic Reinforcement Learning
Abstract
Introduction
Reinforcement learning agents often lack mechanisms to ensure safe behavior during training or deployment, hindering real-world use cases like robotics. We propose using information-theoretic intrinsic motivation measures as runtime safety filters. Specifically, we focus on metrics based on mutual information between actions and future states, quantifying control capacity and recoverability. Empowerment—measuring mutual information between action sequences and observations—is one such measure; by penalizing transitions that lead to low-empowerability, agents can avoid entering states from which recovery is unlikely, without resorting to domain-specific heuristics.
Methodology
We explore three research questions. First, which intrinsic reward formulations capture recoverability, robustness to disturbances, and avoiding irreversible transitions? We evaluate multiple mutual-information estimators to assess their effects on safe behavior. Second, how can intrinsic safety rewards be balanced against extrinsic task objectives to modulate conservativeness? By weighting intrinsic mutual-information costs versus extrinsic rewards, we aim to chart a trade-off between safety (measured by reduced violations) and task efficiency. Third, does emphasizing safety through intrinsic rewards during early learning improve sample efficiency in high-risk environments? We hypothesize that starting training with high intrinsic penalties steers agents away from dangerous regions, reducing costly failures and accelerating convergence. Once a basic safe skill set is established, relaxing intrinsic penalties should allow refinement of task-specific behaviors without excessive conservatism. In both settings, we examine how emphasizing safety during early learning affects sample efficiency, gradually relaxing intrinsic penalties to refine task performance without undue conservatism.
Experiments
Building on inverse reward design [1], latent empowerment estimation [2], and dynamics-aware skill discovery [3], our methodology has two phases. In preliminary one, we analyze an inverted pendulum without handcrafted heuristics, implementing intrinsic rewards—empowerment via variational information bounds, predictive information between successive observations, and explicit barrier penalties near boundaries—to visualize learning curves under different weightings. In phase two, we scale to a robotic pick-and-place task in a workspace with designated no-go zones. We systematically vary intrinsic reward coefficients—balancing empowerment-based penalties against extrinsic placement accuracy rewards—to measure collision rates, recovery occurrences (robot retreats from low-control states), task success, and samples to convergence, thereby quantifying risk–performance trade-offs across different reward configurations.
Expected Results
We expect penalizing low-empowerment transitions to significantly reduce safety violations, demonstrating empowerment’s effectiveness as an intuitive risk indicator. When combined with predictive information and barrier penalties, intrinsic rewards should yield smoother, more robust trajectories than any single measure. Adjusting intrinsic safety weights will likely reveal a clear trade-off: higher weights produce safer but slower learning, while lower weights improve task performance at the expense of increased risk.
References
[1] A. Hadfield-Menell, S. Dragan, P. Abbeel, and S. Russell, “Inverse Reward Design,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[2] R. Zhao, K. Lu, P. Abbeel, and S. Tiomkin, “Efficient Empowerment Estimation for Unsupervised Stabilization,” May 2021. [Online]. Available: https://arxiv.org/abs/2007.07356.
[3] A. Sharma, S. Gu, S. Levine, V. Kumar, and K. Hausman, “Dynamics-Aware Unsupervised Discovery of Skills,” Feb. 2020. [Online]. Available: https://arxiv.org/abs/1907.01657.
Published
Issue
Section
License
Copyright (c) 2025 Anže Rifel

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.