Multi-Modal Scene Understanding for Robotic Grasping

University dissertation from Stockholm : KTH Royal Institute of Technology

Abstract: Current robotics research is largely driven by the vision of creatingan intelligent being that can perform dangerous, difficult orunpopular tasks. These can for example be exploring the surface of planet mars or the bottomof the ocean, maintaining a furnace or assembling a car.   They can also be more mundane such as cleaning an apartment or fetching groceries. This vision has been pursued since the 1960s when the first robots were built. Some of the tasks mentioned above, especially those in industrial manufacturing, arealready frequently performed by robots. Others are still completelyout of reach. Especially, household robots are far away from beingdeployable as general purpose devices. Although advancements have beenmade in this research area, robots are not yet able to performhousehold chores robustly in unstructured and open-ended environments givenunexpected events and uncertainty in perception and execution.In this thesis, we are analyzing which perceptual andmotor capabilities are necessaryfor the robot to perform common tasks in a household scenario. In that context, an essential capability is tounderstand the scene that the robot has to interact with. This involvesseparating objects from the background but also from each other.Once this is achieved, many other tasks becomemuch easier. Configuration of objectscan be determined; they can be identified or categorized; their pose can be estimated; free and occupied space in the environment can be outlined.This kind of scene model can then inform grasp planning algorithms to finally pick up objects.However, scene understanding is not a trivial problem and evenstate-of-the-art methods may fail. Given an incomplete, noisy andpotentially erroneously segmented scene model, the questions remain howsuitable grasps can be planned and how they can be executed robustly.In this thesis, we propose to equip the robot with a set of predictionmechanisms that allow it to hypothesize about parts of the sceneit has not yet observed. Additionally, the robot can alsoquantify how uncertain it is about this prediction allowing it toplan actions for exploring the scene at specifically uncertainplaces. We consider multiple modalities includingmonocular and stereo vision, haptic sensing and information obtainedthrough a human-robot dialog system. We also study several scene representations of different complexity and their applicability to a grasping scenario. Given an improved scene model from this multi-modalexploration, grasps can be inferred for each objecthypothesis. Dependent on whether the objects are known, familiar orunknown, different methodologies for grasp inference apply. In thisthesis, we propose novel methods for each of these cases. Furthermore,we demonstrate the execution of these grasp both in a closed andopen-loop manner showing the effectiveness of the proposed methods inreal-world scenarios.