On the role of time horizons in reinforcement learning
Abstract: The paradigm of Reinforcement Learning attained exceptional achievements in the last several years, largely by the employment of Deep Neural Networks and their capability to adapt to large and complex state- and action-spaces. The most fundamental characteristic of Reinforcement Learning, that puts it apart from other machine learning paradigms, is the notion of time horizons and the accompanying difficulties that intensify with longer time scales. This particularly holds for a very famous and important representative of the field: Q-learning. This thesis therefore looks at four limiting challenges of Q-learning from the perspective of time – its sample-complexity, the difficulty of domain knowledge incorporation, the difficulty of reward design and its inherent long-term dependency on many consecutive decision steps – and attempts to solve them by the employment of auxiliary estimations on shorter time scales. Precisely, this thesis investigates how single-step dynamics models, fixed-horizon auxiliary costs, immediate rewards from expert demonstrations and action-values of different horizons can be exploited to alleviate the aforementioned limitations of Q-learning in order to enhance its real-world applicability.
In a first step, model-free Q-learning is combined with a model-based imagination module to enhance collected transitions by artificial rollouts. To account for the minor asymptotic performance of current model-based approaches, an uncertainty-measure of the value function is derived to limit the targeted use of artificial data, resulting in the Model-assisted Bootstrapped Deep Deterministic Policy Gradient algorithm. While this imagined exploration leads to a reduction of real exploration, the common update step of Q-learning still has to reason over – and therefore approximate values for – all actions in the action-space which can lead to the system taking actions as interim maximum that are known to be undesired or non-optimal. In a second step, this thesis thus proposes Deep Q-networks with Long-term Action-space Shaping to account for a priori known desiderata directly in the Q-update. A consecutive bootstrapping scheme of auxiliary costs is presented that estimates expected costs over a fixed and limited horizon for a rollout of the target-policy. This allows for the neglect of discounting within auxiliary cost estimation which makes the formalization of rules and desires more accessible and intuitive. However, some subjective desires may still be hard to put in a single scalar, be it a threshold in a constraint or a handcrafted reward, especially for end users without Reinforcement Learning background. In a third step, this thesis therefore establishes a novel class of Inverse Reinforcement Learning algorithms estimating the immediate reward, i.e. the ground truth feedback acting on the shortest time scale, from given expert trajectories. This novel class of Inverse Reinforcement Learning algorithms spans the full spectrum from model-based tabular Inverse Action-value Iteration (IAVI) to sampling-based Deep Inverse Q-learning with function approximation. By reducing the computational effort to solving the Markov Decision Process (MDP) at hand only once in contrast to current state-of-the-art approaches which commonly require solving the MDP in a costly inner loop, it is shown that IAVI tremendously reduces training time by multiple orders of magnitude while being more consistent w.r.t. the true reward of a task. By a combination with action-space shaping, the proposed framework provides the option to furthermore restrict the behavior to be as close as possible to the expert demonstrations while remaining in a valid range. Estimated dynamics and immediate rewards, along with other peculiarities common in real-life scenarios such as incorrect sensor measurements or noisy action execution yield increased variance in this important feedback signal. As a last step, this thesis therefore introduces two novel Temporal-difference formulations, corresponding to different segments of the long-term action-value: (1) Truncated Q-functions, representing the first n steps of a rollout corresponding to the target-policy of the full action-value and (2) Shifted Q-functions, representing the expected return after this truncated rollout. This decomposition leading to the Composite Q-learning algorithm offers the possibility of an independent hyperparameter optimization for different sections of the long-term value estimation. It is shown that such a composition of the Q-value enhances the data-efficiency and that it furthermore reduces the sensitivity of vanilla Q-learning against stochastic rewards.
The different approaches are evaluated in various control domains: robot simulation tasks, a high-level decision making task for autonomous driving on highways, a neural decoding setting and the Objectworld benchmark. It is shown that all methods presented in this thesis pushed the state-of-the-art by alleviating four of the aforementioned most challenging limitations of vanilla Q-learning
- Standort
-
Deutsche Nationalbibliothek Frankfurt am Main
- Umfang
-
Online-Ressource
- Sprache
-
Englisch
- Anmerkungen
-
Universität Freiburg, Dissertation, 2022
- Schlagwort
-
Bestärkendes Lernen
Maschinelles Lernen
Neuronales Netz
Bestärkendes Lernen
Maschinelles Lernen
Neuronales Netz
Künstliche Intelligenz
Robotik
Neurowissenschaften
- Ereignis
-
Veröffentlichung
- (wo)
-
Freiburg
- (wer)
-
Universität
- (wann)
-
2022
- Urheber
- Beteiligte Personen und Organisationen
- DOI
-
10.6094/UNIFR/232102
- URN
-
urn:nbn:de:bsz:25-freidok-2321027
- Rechteinformation
-
Open Access; Der Zugriff auf das Objekt ist unbeschränkt möglich.
- Letzte Aktualisierung
-
15.08.2025, 07:24 MESZ
Datenpartner
Deutsche Nationalbibliothek. Bei Fragen zum Objekt wenden Sie sich bitte an den Datenpartner.
Beteiligte
Entstanden
- 2022