Data-Efficient Learning of Semantic Segmentation

Abstract: Semantic segmentation is a fundamental problem in visual perception with a wide range of applications ranging from robotics to autonomous vehicles, and recent approaches based on deep learning have achieved excellent performance. However, to train such systems there is in general a need for very large datasets of annotated images. In this thesis we investigate and propose methods and setups for which it is possible to use unlabelled data to increase the performance or to use limited application specific data to reduce the need for large datasets when learning semantic segmentation.In the first paper we study semantic video segmentation. We present a deep end-to-end trainable model that uses propagated labelling information in unlabelled frames in addition to sparsely labelled frames to predict semantic segmentation. Extensive experiments on the CityScapes and CamVid datasets show that the model can improve accuracy and temporal consistency by using extra unlabelled video frames in training and testing.In the second, third and fourth paper we study active learning for semantic segmentation in an embodied context where navigation is part of the problem. A navigable agent should explore a building and query for the labelling of informative views that increase the visual perception of the agent. In the second paper we introduce the embodied visual active learning problem, and propose and evaluate a range of methods from heuristic baselines to a fully trainable agent using reinforcement learning (RL) on the Matterport3D dataset. We show that the learned agent outperforms several comparable pre-specified baselines. In the third paper we study the embodied visual active learning problem in a lifelong setup, where the visual learning spans the exploration of multiple buildings, and the learning in one scene should influence the active learning in the next e.g. by not annotating already accurately segmented object classes. We introduce new methodology to encourage global exploration of scenes, via an RL-formulation that combines local navigation with global exploration by frontier exploration. We show that the RL-agent can learn adaptable behaviour such as annotating less frequently when it already has explored a number of buildings. Finally we study the embodied visual active learning problem with region-based active learning in the fourth paper. Instead of querying for annotations for a whole image, an agent can query for annotations of just parts of images, and we show that it is significantly more labelling efficient to just annotate regions in the image instead of the full images.

  CLICK HERE TO DOWNLOAD THE WHOLE DISSERTATION. (in PDF format)