Active Attention for Target Detection and Recognition in Robot Vision
Baras, John S
MetadataShow full item record
In this thesis, we address problems in building an efficient and reliable target detection and recognition system for robot applications, where the vision module is only one component of the overall system executing the task. The different modules interact with each other to achieve the goal. In this interaction, the role of vision is not only to recognize but also to select what and where to process. In other words, attention is an essential process for efficient task execution. We introduce attention mechanisms into the recognition system that serve the overall system at different levels of the integration and formulate four problems as below. At the most basic level of integration, attention interacts with vision only. We consider the problem of detecting a target in an input image using a trained binary classifier of the target and formulate the target detection problem as a sampling process. The goal is to localize the windows containing targets in the image, and attention controls which part of the image to process next. We observe that detectors’ response scores of sampling windows fade gradually from the peak response window in the detection area and approximate this scoring pattern with an exponential de- cay function. Exploiting this property, we propose an active sampling procedure to efficiently detect the target while avoiding an exhaustive and expensive search of all the possible window locations. With more knowledge about the target, we describe the target as template graphs over segmented surfaces. Constraint functions are also defined to find the node and edge’s matching between an input scene graph and target’s template graph. We propose to introduce the recognition early into the traditional candidate proposal process to achieve fast and reliable detection performance. The target detection thence becomes finding subgraphs from the segmented input scene graph that match the template graphs. In this problem, attention provides the order of constraints in checking the graph matching, and a reasonable sequence can help filter out negatives early, thus reducing computational time. We put forward a sub-optimal checking order, and prove that it has bounded time cost compared to the optimal checking sequence, which is not obtainable in polynomial time. Experiments on rigid and non-rigid object detection validate our pipeline. With more freedom in control, we allow the robot to actively choose another viewpoint if the current view cannot deliver a reliable detection and recognition result. We develop a practical viewpoint control system and apply it to two human-robot interaction applications, where the detection task becomes more challenging with the additional randomness from the human. Attention represents an active process of deciding the location of the camera. Our viewpoint selection module not only considers the viewing condition constraints for vision algorithms but also incorporates the low-level robot kinematics to guarantee the reachability of the desired viewpoint. By selecting viewpoints fast using a linear time cost score function, the system can deliver smooth user interaction experience. Additionally, we provide a learning from human demonstration method to obtain the score function parameters that better serves the task’s preference. Finally, when recognition results from multiple sources under different environmental factor are available, attention means how to fuse the observations to get reliable output. We consider the problem of object recognition in 3D using an ensemble of attribute-based classifiers. We propose two new concepts to improve classification in practical situations, and show their implementation in an approach implemented for recognition from point-cloud data. First, we study the impact of the distance between the camera and the object and propose an approach to classifier’s accuracy performance, which incorporates distance into the decision making. Second, to avoid the difficulties arising from lack of representative training examples in learning the optimal threshold, we set in our attribute classifier two threshold values to distinguish a positive, a negative and an uncertainty class, instead of just one threshold value. We prove the theoretical correctness of this approach for an active agent who can observe the object multiple times.