Reinforcement Learning for Improved Utility of Simulation-Based Training

Abstract: Team training in complex domains often requires a substantial number of resources, e.g. vehicles, machines, and role-players. For this reason, it may be difficult to realise efficient and effective training scenarios in a real-world setting. Instead, part of the training can be conducted in synthetic, computer-generated environments. In these environments trainees can operate simulators instead of real vehicles, while synthetic actors can replace human role-players to increase the complexity of the simulated scenario at low operating cost. However, constructing behaviour models for synthetic actors is challenging, especially for the end users, who typically do not have expertise in artificial intelligence. In this dissertation, we study how machine learning can be used to simplify the construction of intelligent agents for simulation-based training. A simulation-based air combat training system is used as case study. The contributions of the dissertation are divided into two parts. The first part aims at improving the understanding of reinforcement learning in the domain of simulation-based training. First, a user-study is conducted to identify important capabilities and characteristics of learning agents that are intended to support training of fighter pilots. It is identified that one of the most important capabilities of learning agents in the context of simulation-based training is that their behaviour can be adapted to different phases of training, as well as to the training needs of individual human trainees. Second, methods for learning how to coordinate with other agents are studied in simplified training scenarios, to investigate how the design of the agent’s observation space, action space, and reward signal affects the performance of learning. It is identified that temporal abstractions and hierarchical reinforcement learning can improve the efficiency of learning, while also providing support for modelling of doctrinal behaviour. In more complex settings, curriculum learning and related methods are expected to help find novel tactics even when sparse, abstract reward signals are used. Third, based on the results from the user study and the practical experiments, a system concept for a user-adaptive training system is developed to support further research. The second part of the contributions focuses on methods for utility-based multi-objective reinforcement learning, which incorporates knowledge of the user’s utility function in the search for policies that balance multiple conflicting objectives. Two new agents for multi-objective reinforcement learning are proposed: the Tunable Actor (T-Actor) and the Multi-Objective Dreamer (MO-Dreamer). T-Actor provides decision support to instructors by learning a set of Pareto optimal policies, represented by a single neural network conditioned on objective preferences. This enables tuning of the agent’s behaviour to fit trainees’ current training needs. Experimental evaluations in gridworlds and in the target system show that T-Actor reduces the number of training steps required for learning. MO-Dreamer adapts online to changes in users’ utility, e.g. changes in training needs. It does so by learning a model of the environment, which it can use for anticipatory rollouts with a diverse set of utility functions to explore which policy to follow to optimise the return for a given set of objective preferences. An experimental evaluation shows that MO-Dreamer outperforms prior model-free approaches in terms of experienced regret, for frequent as well as sparse changes in utility. Overall, the research conducted in this dissertation contributes to improved knowledge about how to apply machine learning methods to construction of simulation-based training environments. While our focus was on air combat training, the results are general enough to be applicable in other domains. 

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.