Unsupervised construction of 4D semantic maps in a long-term autonomy scenario

Abstract: Robots are operating for longer times and collecting much more data than just a few years ago. In this setting we are interested in exploring ways of modeling the environment, segmenting out areas of interest and keeping track of the segmentations over time, with the purpose of building 4D models (i.e. space and time) of the relevant parts of the environment.Our approach relies on repeatedly observing the environment and creating local maps at specific locations. The first question we address is how to choose where to build these local maps. Traditionally, an operator defines a set of waypoints on a pre-built map of the environment which the robot visits autonomously. Instead, we propose a method to automatically extract semantically meaningful regions from a point cloud representation of the environment. The resulting segmentation is purely geometric, and in the context of mobile robots operating in human environments, the semantic label associated with each segment (i.e. kitchen, office) can be of interest for a variety of applications. We therefore also look at how to obtain per-pixel semantic labels given the geometric segmentation, by fusing probabilistic distributions over scene and object types in a Conditional Random Field.For most robotic systems, the elements of interest in the environment are the ones which exhibit some dynamic properties (such as people, chairs, cups, etc.), and the ability to detect and segment such elements provides a very useful initial segmentation of the scene. We propose a method to iteratively build a static map from observations of the same scene acquired at different points in time. Dynamic elements are obtained by computing the difference between the static map and new observations. We address the problem of clustering together dynamic elements which correspond to the same physical object, observed at different points in time and in significantly different circumstances. To address some of the inherent limitations in the sensors used, we autonomously plan, navigate around and obtain additional views of the segmented dynamic elements. We look at methods of fusing the additional data and we show that both a combined point cloud model and a fused mesh representation can be used to more robustly recognize the dynamic object in future observations. In the case of the mesh representation, we also show how a Convolutional Neural Network can be trained for recognition by using mesh renderings.Finally, we present a number of methods to analyse the data acquired by the mobile robot autonomously and over extended time periods. First, we look at how the dynamic segmentations can be used to derive a probabilistic prior which can be used in the mapping process to further improve and reinforce the segmentation accuracy. We also investigate how to leverage spatial-temporal constraints in order to cluster dynamic elements observed at different points in time and under different circumstances. We show that by making a few simple assumptions we can increase the clustering accuracy even when the object appearance varies significantly between observations. The result of the clustering is a spatial-temporal footprint of the dynamic object, defining an area where the object is likely to be observed spatially as well as a set of time stamps corresponding to when the object was previously observed. Using this data, predictive models can be created and used to infer future times when the object is more likely to be observed. In an object search scenario, this model can be used to decrease the search time when looking for specific objects.