CN106920250A

CN106920250A - Robot target identification and localization method and system based on RGB D videos

Info

Publication number: CN106920250A
Application number: CN201710078328.9A
Authority: CN
Inventors: 陶文兵; 李坤乾
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-02-14
Filing date: 2017-02-14
Publication date: 2017-07-04
Anticipated expiration: 2037-02-14
Also published as: CN106920250B

Abstract

The invention discloses a robot target recognition and positioning method and system based on RGB-D video, through the steps of target candidate extraction, recognition, confidence estimation based on timing consistency, target segmentation optimization, position estimation, etc., in the scene to determine target category and obtain accurate spatial location positioning. In the present invention, the depth information of the scene is used to enhance the spatial level perception ability of the recognition and positioning algorithm, and the long-sequence target recognition and positioning tasks are ensured while improving the video processing efficiency by adopting the long-short time-spatial-space consistency constraints based on key frames The identity and relatedness of the target. In the localization process, the collaborative object localization in multiple information modalities is realized by accurately segmenting the object in planar space and evaluating the location consistency of the same object in the depth information space. With small amount of calculation, good real-time performance, and high recognition and positioning accuracy, it can be applied to robot tasks based on online visual information analysis and understanding technology.

Description

Robot target recognition and positioning method and system based on RGB-D video

技术领域technical field

本发明属于计算机视觉技术领域，更具体地，涉及一种基于RGB-D视频的机器人目标识别与定位方法及系统。The invention belongs to the technical field of computer vision, and more specifically relates to a method and system for robot target recognition and positioning based on RGB-D video.

背景技术Background technique

近年来，随着机器人技术的快速发展，面向机器人任务的机器视觉技术也得到了研究者的广泛关注。其中，目标的识别与精确定位是机器人视觉问题的重要一环，是执行后续任务的前提条件。In recent years, with the rapid development of robot technology, machine vision technology for robot tasks has also received extensive attention from researchers. Among them, the recognition and precise positioning of the target is an important part of the robot vision problem and a prerequisite for performing subsequent tasks.

现有的目标识别方法一般包括提取待识别目标信息作为识别依据和与待识别场景的匹配两个步骤。传统的待识别目标的表达一般包括几何形状、目标外观、提取局部特征等方法，这类方法往往存在通用性差、稳定性不足、目标抽象化能力差等不足。以上目标表达的缺陷也给后续的匹配过程带来了难以克服的困难。Existing object recognition methods generally include two steps: extracting target information to be recognized as the basis for recognition and matching with the scene to be recognized. The traditional expression of the target to be recognized generally includes methods such as geometric shape, target appearance, and extraction of local features. Such methods often have shortcomings such as poor versatility, insufficient stability, and poor target abstraction ability. The defects of the above target expression also bring insurmountable difficulties to the subsequent matching process.

获取待识别目标的表达后，目标匹配是指将获得该目标表达与待识别场景特征进行比较，以识别目标。总体上讲，现有的方法包括基于区域匹配和特征匹配的两类方法。基于区域的匹配是指提取图像局部子区域的信息进行比对，其计算量与待匹配的子区域个数成正比；基于特征的方法对图像中的典型特征进行匹配，其匹配准确率与特征表达有效性密切相关。以上两类方法对候选区域的获取以及特征表达提出了较高的要求，但由于二维平面图像信息和设计特征的局限性，在面向机器人的复杂环境识别任务中往往效果较差。After obtaining the expression of the target to be recognized, target matching refers to comparing the obtained target expression with the features of the scene to be recognized to identify the target. Generally speaking, existing methods include two types of methods based on region matching and feature matching. Region-based matching refers to extracting the information of local sub-regions of the image for comparison, and the calculation amount is proportional to the number of sub-regions to be matched; the feature-based method matches typical features in the image, and the matching accuracy is the same as that of the feature Expression effectiveness is closely related. The above two types of methods put forward high requirements for the acquisition of candidate regions and feature expression, but due to the limitations of two-dimensional plane image information and design features, they are often less effective in robot-oriented complex environment recognition tasks.

目标定位广泛存在于工业生产生活中，如户外运动中的GPS、军事雷达监控、舰艇声纳设备等等，此类设备定位准确、作业距离范围很广，但价格高昂。基于视觉的定位系统是近年来新的研究热点。根据视觉传感器的不同，大致可分为基于单目视觉传感器、双目及深度传感器、全景视觉传感器的定位方法。单目视觉传感器价格低、结构简单、易于标定，但定位精度往往较差；全景视觉传感器可获得完整的场景信息，定位精度较高，但计算量大、实时性较差、设备复杂昂贵；基于双目视觉的深度估计或深度信息采集设备对场景距离感知能力较强，且系统较为简单，实时性易于实现，近年来受到的关注也越来越多。但这一领域的研究仍处于起步阶段，目前仍缺乏高效的、可实时处理RGB-Depth视频的目标定位方法。Target positioning widely exists in industrial production and life, such as GPS in outdoor sports, military radar monitoring, ship sonar equipment, etc. Such equipment has accurate positioning and a wide range of operating distances, but the price is high. Vision-based positioning system is a new research hotspot in recent years. According to different vision sensors, it can be roughly divided into positioning methods based on monocular vision sensors, binocular and depth sensors, and panoramic vision sensors. The monocular vision sensor is low in price, simple in structure, and easy to calibrate, but its positioning accuracy is often poor; the panoramic vision sensor can obtain complete scene information, and its positioning accuracy is high, but it has a large amount of calculation, poor real-time performance, and complex and expensive equipment; The depth estimation or depth information acquisition equipment of binocular vision has a strong ability to perceive the distance of the scene, and the system is relatively simple, and the real-time performance is easy to realize. In recent years, it has received more and more attention. However, the research in this field is still in its infancy, and there is still a lack of efficient object localization methods that can process RGB-Depth videos in real time.

由于对于深度信息感知能力具有较高的需求，因此现有的机器人系统大多采集RGB-Depth视频作为视觉信息来源，深度信息为场景的立体感知、复杂目标的层次性划分、定位提供了丰富的信息。然而，由于机器人工作场景的复杂性、计算复杂度较高、运算量较大，目前尚未有系统、快速便捷的RGB-Depth视频目标识别与精确定位方法。因此，研究基于RGB-Depth视频的室内机器人目标识别与精确定位算法不仅有很强的研究价值，而且具有非常广阔的应用前景。Due to the high demand for depth information perception capabilities, most existing robot systems collect RGB-Depth videos as a source of visual information. Depth information provides rich information for stereoscopic perception of scenes, hierarchical division and positioning of complex targets. . However, due to the complexity of the robot's working scene, the high computational complexity, and the large amount of calculation, there is no systematic, fast and convenient RGB-Depth video target recognition and precise positioning method. Therefore, the research on indoor robot target recognition and precise positioning algorithm based on RGB-Depth video not only has strong research value, but also has very broad application prospects.

发明内容Contents of the invention

针对现有技术的以上缺陷或改进需求，本发明提供了一种基于RGB-D视频的机器人目标识别与定位方法及系统，通过处理机器人第一视角获取的RGB-Depth视频，实现实时的、准确的目标识别，以及目标在机器人工作环境中的精准定位，从而辅助目标抓取等复杂机器人任务。由此解决目前缺乏高效的、可实时处理RGB-Depth视频的目标定位方法的技术问题。In view of the above defects or improvement needs of the prior art, the present invention provides a robot target recognition and positioning method and system based on RGB-D video, by processing the RGB-Depth video obtained from the first perspective of the robot, real-time and accurate Target recognition and precise positioning of the target in the robot working environment, thus assisting complex robot tasks such as target grasping. Therefore, the current technical problem of lacking an efficient target positioning method that can process RGB-Depth video in real time is solved.

为实现上述目的，按照本发明的一个方面，提供了一种基于RGB-D视频的机器人目标识别与定位方法，包括：In order to achieve the above object, according to one aspect of the present invention, a kind of robot target recognition and localization method based on RGB-D video is provided, comprising:

(1)获取待识别定位目标所在场景的RGB-D视频帧序列；(1) Obtain the RGB-D video frame sequence of the scene where the positioning target is to be identified;

(2)提取所述RGB-D视频帧序列中的关键视频帧，并对所述关键视频帧提取目标候选区域，根据各关键视频帧对应的深度信息对所述目标候选区域进行过滤筛选；(2) extract the key video frame in the RGB-D video frame sequence, and extract the target candidate area to the key video frame, filter and screen the target candidate area according to the depth information corresponding to each key video frame;

(3)基于深度网络对过滤筛选后的目标候选区域进行识别，通过长时序时空关联约束及多帧识别一致性估计，对目标识别结果进行置信度排序；(3) Identify the filtered and screened target candidate areas based on the deep network, and sort the target recognition results by confidence through long-term time-series spatio-temporal correlation constraints and multi-frame recognition consistency estimation;

(4)对过滤筛选后的目标候选区域进行局部快速分割，根据目标识别结果的置信度及各关键视频帧的时序间隔关系，从所述关键视频帧中选取主要关键视频帧，并对分割区域进行前后相邻帧扩展及协同优化；(4) Carry out partial fast segmentation to the target candidate area after filtering and screening, according to the confidence degree of the target recognition result and the time sequence interval relationship of each key video frame, select the main key video frame from the key video frame, and segment the area Carry out adjacent frame expansion and collaborative optimization;

(5)在场景中确定关键特征点作为定位参照点，进而估计相机视角及相机运动估计值，通过对主要关键视频帧识别分割结果进行目标特征一致性约束和目标位置一致性约束，估计待识别定位目标的协同置信度并进行空间精确定位。(5) Determine the key feature points in the scene as the positioning reference point, and then estimate the camera angle of view and camera motion estimation value. By performing the target feature consistency constraint and target position consistency constraint on the main key video frame recognition and segmentation results, estimate the to-be-identified Coordinate confidence in localizing objects and perform spatially precise localization.

优选地，所述步骤(2)具体包括：Preferably, said step (2) specifically includes:

(2.1)以间隔采样或关键帧选取方法，确定用于识别待识别定位目标的关键视频帧；(2.1) Determine the key video frame used to identify the target to be identified and positioned with interval sampling or key frame selection method;

(2.2)采用基于似物性先验的置信度排序方法获取所述关键视频帧中的目标候选区域组成目标候选区域集合，利用各关键视频帧对应的深度信息，获取各目标候选区域的内部及其邻域内的层次属性，对所述目标候选区域集合进行优化筛选、再排序。(2.2) Using the confidence sorting method based on similarity prior to obtain the target candidate areas in the key video frames to form a target candidate area set, using the depth information corresponding to each key video frame to obtain the interior of each target candidate area and The hierarchical attributes in the neighborhood are used to optimize, screen and reorder the set of target candidate regions.

优选地，所述步骤(3)具体包括：Preferably, the step (3) specifically includes:

(3.1)将经过步骤(2)筛选后的目标候选区域送入已训练好的目标识别深度网络，获取各筛选后的目标候选区域对应的关键视频帧的目标识别预测结果及各目标识别预测结果的第一置信度；(3.1) Send the target candidate regions screened in step (2) into the trained target recognition depth network, and obtain the target recognition prediction results and the target recognition prediction results of the key video frames corresponding to each filtered target candidate region The first confidence level of

(3.2)根据长时序的时空关联约束，对关键视频帧的目标识别预测结果进行特征一致性评价，评价各目标识别预测结果的第二置信度，将由所述第一置信度与所述第二置信度得到的累积置信度进行排序，进一步过滤掉累积置信度低于预设置信度阈值的目标候选区域。(3.2) According to the time-series spatio-temporal correlation constraints, the target recognition prediction results of the key video frames are evaluated for feature consistency, and the second confidence of each target recognition prediction result is evaluated, and the first confidence and the second The cumulative confidence obtained by the confidence degree is sorted, and the target candidate regions whose cumulative confidence is lower than the preset confidence threshold are further filtered out.

优选地，所述步骤(4)具体包括：Preferably, said step (4) specifically includes:

(4.1)对于步骤(3.2)获得的目标候选区域及其扩展邻域，进行快速的目标分割操作，获得目标的初始分割，确定目标边界；(4.1) For the target candidate area and its extended neighborhood obtained in step (3.2), perform a fast target segmentation operation to obtain the initial segmentation of the target and determine the target boundary;

(4.2)以短时时空一致性为约束，基于步骤(3.2)中的累积置信度排序结果，从所述关键视频帧中筛选出主要关键视频帧；(4.2) take the short-time spatio-temporal consistency as constraints, based on the cumulative confidence ranking result in step (3.2), filter out the main key video frame from the key video frame;

(4.3)以长时时空一致性为约束，基于步骤(4.1)的初始分割，对待识别定位目标进行外观建模，对主要关键视频帧及其相邻帧进行三维图形构建，并设计最大后验概率-马尔科夫随机场能量函数，通过图割算法对初始分割进行优化，对单帧的目标分割结果在该帧前后相邻帧中进行分割扩展及优化。(4.3) Constrained by long-term spatiotemporal consistency, based on the initial segmentation in step (4.1), perform appearance modeling for the target to be identified and locate, construct 3D graphics for the main key video frame and its adjacent frames, and design the maximum a posteriori Probability-Markov random field energy function, the initial segmentation is optimized through the graph cut algorithm, and the target segmentation result of a single frame is segmented, extended and optimized in the adjacent frames before and after the frame.

优选地，所述步骤(5)具体包括：Preferably, said step (5) specifically includes:

(5.1)对于步骤(4.2)获得的主要关键视频帧，根据各主要关键视频帧之间的相邻及视野重合关系，提取多组同名点点对作为定位参照点；(5.1) for the main key video frame that step (4.2) obtains, according to the adjacency between each main key video frame and the coincidence relation of visual field, extract many groups of point-to-point pairs with the same name as the positioning reference point;

(5.2)依据视野重合的主要关键视频帧估计相机视角变化，进而通过几何关系，利用定位参照点点对的深度信息估计相机的运动信息；(5.2) Estimate the change of the camera's viewing angle according to the main key video frames whose fields of view overlap, and then use the depth information of the positioning reference point pair to estimate the camera's motion information through the geometric relationship;

(5.3)根据主要关键视频帧中待识别定位目标的测量深度信息、相机视角以及相机的运动信息，评价主要关键视频帧中待识别定位目标的空间位置一致性；(5.3) Evaluate the spatial position consistency of the target to be identified and located in the main key video frame according to the measured depth information, camera angle of view and camera motion information of the target to be identified and located in the main key video frame;

(5.4)根据步骤(4.3)的结果，评价待识别定位目标二维分割区域的特征一致性；(5.4) According to the result of step (4.3), evaluate the feature consistency of the two-dimensional segmentation area of the target to be identified and positioned;

(5.5)通过综合评价待识别定位目标二维分割区域的特征一致性以及空间位置一致性，确定待识别定位目标的空间位置。(5.5) Determine the spatial position of the target to be identified and positioned by comprehensively evaluating the feature consistency and spatial position consistency of the two-dimensional segmented area of the target to be identified and positioned.

按照本发明的另一方面，提供了一种基于RGB-D视频的机器人目标识别与定位系统，包括：According to another aspect of the present invention, there is provided a robot target recognition and positioning system based on RGB-D video, including:

获取模块，用于获取待识别定位目标所在场景的RGB-D视频帧序列；The obtaining module is used to obtain the RGB-D video frame sequence of the scene where the positioning target is to be identified;

过滤筛选模块，用于提取所述RGB-D视频帧序列中的关键视频帧，并对所述关键视频帧提取目标候选区域，根据各关键视频帧对应的深度信息对所述目标候选区域进行过滤筛选；Filtering and screening module, used to extract the key video frame in the RGB-D video frame sequence, and extract the target candidate area for the key video frame, and filter the target candidate area according to the depth information corresponding to each key video frame filter;

置信度排序模块，用于基于深度网络对过滤筛选后的目标候选区域进行识别，通过长时序时空关联约束及多帧识别一致性估计，对目标识别结果进行置信度排序；The confidence ranking module is used to identify the filtered and screened target candidate areas based on the deep network, and sort the target recognition results by confidence through long-term time-series spatio-temporal correlation constraints and multi-frame recognition consistency estimation;

优化模块，用于对过滤筛选后的目标候选区域进行局部快速分割，根据目标识别结果的置信度及各关键视频帧的时序间隔关系，从所述关键视频帧中选取主要关键视频帧，并对分割区域进行前后相邻帧扩展及协同优化；The optimization module is used to perform local rapid segmentation on the filtered target candidate area, and select the main key video frame from the key video frame according to the confidence of the target recognition result and the time sequence interval relationship of each key video frame, and Segment the area for adjacent frame expansion and collaborative optimization;

定位模块，用于在场景中确定关键特征点作为定位参照点，进而估计相机视角及相机运动估计值，通过对主要关键视频帧识别分割结果进行目标特征一致性约束和目标位置一致性约束，估计待识别定位目标的协同置信度并进行空间精确定位。The positioning module is used to determine key feature points in the scene as positioning reference points, and then estimate the camera angle of view and camera motion estimation value. By performing target feature consistency constraints and target position consistency constraints on the main key video frame recognition and segmentation results, estimate The collaborative confidence of the target to be identified and spatially precise positioning.

总体而言，通过本发明所构思的以上技术方案与现有技术相比，主要有以下的技术优点：本发明中利用场景深度信息，增强了识别与定位算法的空间层次感知能力，通过采用基于关键帧的长短时时空一致性约束，在提高视频处理效率的同时，保证了长时序目标识别与定位任务中目标的同一性与关联性。在定位过程中，通过在平面空间中精确分割目标以及在深度信息空间评价同一目标的位置一致性，实现了在多信息模态中的协同目标定位。计算量小，实时性好，识别与定位精度高，可被应用于基于在线视觉信息解析理解技术的机器人任务。Generally speaking, compared with the prior art, the above technical solution conceived by the present invention mainly has the following technical advantages: In the present invention, the depth information of the scene is used to enhance the spatial level perception ability of the recognition and positioning algorithm. The long-short-term spatio-temporal consistency constraints of the key frames not only improve the efficiency of video processing, but also ensure the identity and relevance of targets in long-sequence target recognition and positioning tasks. In the localization process, the collaborative object localization in multiple information modalities is realized by accurately segmenting the target in planar space and evaluating the position consistency of the same target in the depth information space. Small amount of calculation, good real-time performance, high recognition and positioning accuracy, can be applied to robot tasks based on online visual information analysis and understanding technology.

附图说明Description of drawings

图1为本发明实施例方法的总体流程示意图；Fig. 1 is the overall flow diagram of the method of the embodiment of the present invention;

图2为本发明实施例中目标识别的流程示意图；Fig. 2 is a schematic flow chart of target recognition in an embodiment of the present invention;

图3为本发明实施例中目标精准定位的流程示意图。FIG. 3 is a schematic flow chart of accurate target positioning in an embodiment of the present invention.

具体实施方式detailed description

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

本发明公开的方法涉及关键帧筛选、基于深度网络的目标识别、分割、标记帧间传递、基于一致性约束的位置估计及协同优化等技术，可直接用于以RGB-D视频是视觉信息输入的机器人系统中，辅助机器人完成目标识别及目标精准定位任务。The method disclosed in the present invention involves technologies such as key frame screening, target recognition based on deep network, segmentation, transfer between marked frames, position estimation based on consistency constraints, and collaborative optimization, and can be directly used to input visual information with RGB-D video In the robot system, the auxiliary robot completes the task of target recognition and target precise positioning.

如图1所示为本发明实施例方法的总体流程示意图。从图1可以看出，本方法包含目标识别与目标精确定位两大步骤，目标识别是目标精准定位的前提条件。其具体实施方式如下：FIG. 1 is a schematic diagram of the overall flow of the method of the embodiment of the present invention. It can be seen from Figure 1 that this method includes two steps: target recognition and target precise positioning, and target recognition is a prerequisite for precise target positioning. Its specific implementation is as follows:

优选地，在本发明的一个实施方式中，可以通过Kinect等深度视觉传感器采集待识别定位目标所在场景的RGB-D视频序列；还可以通过双目成像设备采集RGB像对，并通过计算视差估计场景深度信息作为depth通道信息，从而合成RGB-D视频作为输入。Preferably, in one embodiment of the present invention, the RGB-D video sequence of the scene where the target to be identified and positioned can be collected by depth vision sensors such as Kinect; RGB image pairs can also be collected by binocular imaging equipment, and estimated by calculating the parallax The scene depth information is used as the depth channel information to synthesize RGB-D video as input.

(2)提取RGB-D视频帧序列中的关键视频帧，并对关键视频帧提取目标候选区域，根据各关键视频帧对应的深度信息对目标候选区域进行过滤筛选；(2) extract the key video frame in the RGB-D video frame sequence, and extract the target candidate area to the key video frame, filter and screen the target candidate area according to the depth information corresponding to each key video frame;

(4)对过滤筛选后的目标候选区域进行局部快速分割，根据目标识别结果的置信度及各关键视频帧的时序间隔关系，从关键视频帧中选取主要关键视频帧，并对分割区域进行前后相邻帧扩展及协同优化；(4) Carry out local rapid segmentation on the target candidate area after filtering, select the main key video frame from the key video frame according to the confidence of the target recognition result and the time sequence interval relationship of each key video frame, and carry out the front and back of the segmented area Adjacent frame extension and collaborative optimization;

优选地，在本发明的一个实施例中，上述步骤(1)具体包括：Preferably, in one embodiment of the present invention, the above step (1) specifically includes:

(1.1)用Kinect采集待识别定位目标所在场景的RGB-D视频序列，并用邻域采样平滑方式填充深度图像空洞，根据Kinect参数对其进行修正并转换为实际深度信息，与RGB数据作为输入；(1.1) Use Kinect to collect the RGB-D video sequence of the scene where the target to be identified is located, and fill the depth image hole in a smooth manner with neighborhood sampling, correct it according to the Kinect parameters and convert it into actual depth information, and use the RGB data as input;

(1.2)当使用双目设备采集像对时，依次通过相机标定、立体匹配(像对特征提取、同一物理结构对应点提取、计算视差)步骤，最后通过投影模型估计深度作为视频中depth通道的输入。(1.2) When a binocular device is used to collect image pairs, camera calibration, stereo matching (image pair feature extraction, corresponding point extraction of the same physical structure, and parallax calculation) steps are performed in sequence, and finally the depth is estimated by the projection model as the depth channel in the video. enter.

优选地，在本发明的一个实施例中，上述步骤(2)具体包括：Preferably, in one embodiment of the present invention, the above step (2) specifically includes:

其中，步骤(2.1)具体包括：利用快速尺度不变特征变换(Scale-invariantfeature transform，SIFT)点匹配方法获取相邻帧的场景重叠率，从而估计当前拍摄的场景变化率，对于拍摄场景切换较快的视频帧，提高采样频率，对于拍摄场景切换较慢的视频帧，降低采样频率。此外，当实际应用需求对算法效率要求较高时，可直接采用间隔采样方法替代本步骤。Wherein, the step (2.1) specifically includes: using the fast scale-invariant feature transform (Scale-invariant feature transform, SIFT) point matching method to obtain the scene overlap rate of adjacent frames, thereby estimating the scene change rate of the current shooting, which is relatively difficult for shooting scene switching. For fast video frames, increase the sampling frequency, and for slow video frames, reduce the sampling frequency. In addition, when the actual application requirements require high algorithm efficiency, the interval sampling method can be directly used to replace this step.

其中，基于似物性先验的置信度排序方法可以是BING算法或Edge box算法。如图2所示，再利用对应帧的深度信息，获取目标候选区域内部及其邻域内的层次属性，根据高置信度的候选框内部应深度信息平滑、框内外边界处深度信息梯度较大的原则，对目标候选区域集合进行优化筛选、再排序。Among them, the confidence ranking method based on the similarity prior can be the BING algorithm or the Edge box algorithm. As shown in Figure 2, the depth information of the corresponding frame is used to obtain the hierarchical attributes of the target candidate area and its neighborhood. According to the high confidence candidate frame, the depth information should be smooth, and the depth information gradient at the inner and outer boundaries of the frame is relatively large. According to the principle, the set of target candidate regions is optimally screened and re-ranked.

优选地，在本发明的一个实施例中，上述步骤(3)具体包括：Preferably, in one embodiment of the present invention, the above step (3) specifically includes:

(3.1)如图2所示，将经过步骤(2)筛选后的目标候选区域送入已训练好的目标识别深度网络，获取各筛选后的目标候选区域对应的关键视频帧的目标识别预测结果及各目标识别预测结果的第一置信度；(3.1) As shown in Figure 2, send the target candidate areas screened in step (2) into the trained target recognition deep network, and obtain the target recognition prediction results of the key video frames corresponding to each screened target candidate area and the first confidence degree of each target recognition prediction result;

其中，已训练好的目标识别深度网络可以是例如SPP-Net、R-CNN、Fast-R-CNN等深度识别网络，也可以由其他深度识别网络替代。Wherein, the trained target recognition deep network can be deep recognition networks such as SPP-Net, R-CNN, Fast-R-CNN, etc., or can be replaced by other deep recognition networks.

(3.2)根据长时序的时空关联约束，对关键视频帧的目标识别预测结果进行特征一致性评价，评价各目标识别预测结果的第二置信度，将由第一置信度与第二置信度得到的累积置信度进行排序，进一步过滤掉累积置信度低于预设置信度阈值的目标候选区域。(3.2) According to the time-series spatio-temporal correlation constraints, the feature consistency evaluation of the target recognition prediction results of key video frames is performed, and the second confidence degree of each target recognition prediction result is evaluated, and the first confidence degree and the second confidence degree are obtained. The cumulative confidence is sorted, and the target candidate regions whose cumulative confidence is lower than the preset confidence threshold are further filtered out.

可选地，在本发明的一个实施例中，可以通过对算法施加识别指令，获取对待识别定位目标的检测识别结果，并通过过滤低置信度识别结果提升算法效率。Optionally, in an embodiment of the present invention, by applying recognition instructions to the algorithm, the detection and recognition results of the positioning target to be recognized can be obtained, and the efficiency of the algorithm can be improved by filtering the recognition results with low confidence.

可选地，在本发明的一个实施例中，上述步骤(4)具体包括：Optionally, in one embodiment of the present invention, the above step (4) specifically includes:

(4.1)如图3所示，对于步骤(3.2)获得的目标候选区域及其扩展邻域，进行快速的目标分割操作，获得目标的初始分割，确定目标边界；(4.1) As shown in Figure 3, for the target candidate area and its extended neighborhood obtained in step (3.2), perform a fast target segmentation operation to obtain the initial segmentation of the target and determine the target boundary;

其中，作为一种可选的实施方式，可以使用基于RGB-D信息的GrabCut分割算法进行快速的目标分割操作，获得目标的初始分割，从而在当前视频帧中获得目标的二维定位结果。Wherein, as an optional implementation, the RGB-D information-based GrabCut segmentation algorithm can be used to perform fast target segmentation operations to obtain the initial segmentation of the target, thereby obtaining the two-dimensional positioning result of the target in the current video frame.

(4.2)为了进一步提高视频目标定位的效率，如图3所示，以短时时空一致性为约束，基于步骤(3.2)中的累积置信度排序结果，以单帧识别置信度高、相邻帧时空一致性强为准则，从关键视频帧中筛选出主要关键视频帧；(4.2) In order to further improve the efficiency of video target location, as shown in Figure 3, with short-term spatio-temporal consistency as constraints, based on the cumulative confidence ranking results in step (3.2), a single frame with high confidence and adjacent The strong spatiotemporal consistency of frames is the criterion, and the main key video frames are selected from the key video frames;

(4.3)以长时时空一致性为约束，基于步骤(4.1)的初始分割，对待识别定位目标进行外观建模，对主要关键视频帧及其相邻帧进行三维图形构建，并设计最大后验概率-马尔科夫随机场能量函数，通过图割算法对初始分割进行优化，对单帧的目标分割结果在该帧前后相邻帧中进行分割扩展，从而实现基于长-短时时空一致性的二维目标分割定位优化。(4.3) Constrained by long-term spatiotemporal consistency, based on the initial segmentation in step (4.1), perform appearance modeling for the target to be identified and locate, construct 3D graphics for the main key video frame and its adjacent frames, and design the maximum a posteriori The probability-Markov random field energy function optimizes the initial segmentation through the graph cut algorithm, and the target segmentation result of a single frame is segmented and extended in the adjacent frames before and after the frame, so as to realize the long-short-time spatio-temporal consistency 2D Object Segmentation and Localization Optimization.

可选地，在本发明的一个实施例中，上述步骤(5)具体包括：Optionally, in one embodiment of the present invention, the above step (5) specifically includes:

(5.1)如图3所示，对于步骤(4.2)获得的主要关键视频帧，根据各主要关键视频帧之间的相邻及视野重合关系，提取多组同名点点对作为定位参照点；(5.1) As shown in Figure 3, for the main key video frame that step (4.2) obtains, according to the adjacency and visual field overlapping relationship between each main key video frame, extract many groups of point pairs with the same name as the positioning reference point;

其中，相机的运动信息包括相机移动距离及移动轨迹。Wherein, the motion information of the camera includes a moving distance and a moving track of the camera.

(5.3)如图3所示，根据主要关键视频帧中待识别定位目标的测量深度信息、相机视角以及相机的运动信息，评价主要关键视频帧中待识别定位目标的空间位置一致性；(5.3) As shown in Figure 3, according to the measured depth information, camera angle of view and camera motion information of the target to be identified and located in the main key video frame, evaluate the spatial position consistency of the target to be identified and located in the main key video frame;

(5.4)根据步骤(4.3)的结果，评价待识别定位目标二维分割区域的特征一致性，一般采用基于区域的深度网络提取区域深度特征用于特征距离度量及特征一致性评价；(5.4) According to the result of step (4.3), evaluate the feature consistency of the two-dimensional segmented area of the target to be identified and positioned, generally using a region-based deep network to extract regional depth features for feature distance measurement and feature consistency evaluation;

在本发明的一个实施例中，公开了一种基于RGB-D视频的机器人目标识别与定位系统，该系统包括：In one embodiment of the present invention, a kind of robot target recognition and localization system based on RGB-D video is disclosed, and this system comprises:

其中，各模块的具体实施方式可以参照方法实施例的描述，本发明实施例将不做复述。Wherein, for the specific implementation manner of each module, reference may be made to the description of the method embodiment, and the embodiment of the present invention will not be described again.

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。Those skilled in the art can easily understand that the above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, All should be included within the protection scope of the present invention.

Claims

1. a kind of robot target identification and localization method based on RGB-D videos, it is characterised in that including：

(1) the RGB-D sequence of frames of video of scene where positioning target to be identified is obtained；

(2) the key video sequence frame in the RGB-D sequence of frames of video is extracted, and target candidate area is extracted to the key video sequence frame Domain, filtering screening is carried out according to the corresponding depth information of each key video sequence frame to the object candidate area；

(3) object candidate area after filtering screening is identified based on depth network, is constrained by sequential space time correlation long And multiframe identification Uniform estimates, confidence level sequence is carried out to target identification result；

(4) local Fast Segmentation is carried out to the object candidate area after filtering screening, confidence level according to target identification result and The timing intervals relation of each key video sequence frame, chooses Chief frame of video from the key video sequence frame, and to cut zone Carry out front and rear consecutive frame extension and collaboration optimization；

(5) key feature points are determined in the scene as positioning reference point, and then estimate camera perspective and camera motion estimate, Recognize that segmentation result carries out target signature consistency constraint and target location consistency constraint by Chief frame of video, estimate Count the collaboration confidence level of positioning target to be identified and carry out space and be accurately positioned.

2. method according to claim 1, it is characterised in that the step (2) specifically includes：

(2.1) with interval sampling or key frame extraction method, it is determined that the key video sequence frame for recognizing positioning target to be identified；

(2.2) using based on the object candidate area in the confidence level sort method acquisition key video sequence frame like physical property priori Composition object candidate area set, using the corresponding depth information of each key video sequence frame, obtains the inside of each object candidate area And its hierarchy attributes in neighborhood, screening is optimized to the object candidate area set, is sorted again.

3. method according to claim 2, it is characterised in that the step (3) specifically includes：

(3.1) the target identification depth network that will have been trained by the object candidate area feeding after step (2) screening, obtains The target identification of the corresponding key video sequence frame of object candidate area after each screening predicts the outcome and each target identification predicts the outcome The first confidence level；

(3.2) the space time correlation constraint according to sequential long, the target identification of key video sequence frame is predicted the outcome, and it is consistent to carry out feature Property evaluate, evaluate the second confidence level that each target identification predicts the outcome, will be by first confidence level and second confidence level The accumulation confidence level for obtaining is ranked up, and further filters out the target candidate area that accumulation confidence level is less than default confidence threshold value Domain.

4. method according to claim 3, it is characterised in that the step (4) specifically includes：

(4.1) object candidate area and its extension neighborhood for being obtained for step (3.2), carry out quick Target Segmentation operation, The initial segmentation of target is obtained, object boundary is determined；

(4.2) it is constraint with space-time consistency in short-term, based on the accumulation confidence level ranking results in step (3.2), from the pass Chief frame of video is filtered out in key frame of video；

(4.3) with it is long when space-time consistency be constraint, based on the initial segmentation of step (4.1), positioning target to be identified is carried out Outward appearance is modeled, and carries out 3-D graphic structure to Chief frame of video and its consecutive frame, and design maximum a posteriori probability-Ma Erke Husband's random field energy function, cuts algorithm and initial segmentation is optimized by figure, to the object segmentation result of single frames before the frame Carry out splitting extension and optimization in consecutive frame afterwards.

5. method according to claim 4, it is characterised in that the step (5) specifically includes：

(5.1) the Chief frame of video obtained for step (4.2), according to adjacent between each Chief frame of video and regards Wild coincidence relation, extracts multigroup same place point to as positioning reference point；

(5.2) the Chief frame of video overlapped according to the visual field estimates camera perspective change, and then by geometrical relationship, using fixed The depth information of position reference point point pair estimates the movable information of camera；

(5.3) according to the information that fathoms of positioning target to be identified, camera perspective and camera in Chief frame of video Movable information, evaluates the locus uniformity of positioning target to be identified in Chief frame of video；

(5.4) according to the result of step (4.3), the feature consistency of positioning target two dimension cut zone to be identified is evaluated；

(5.5) feature consistency and locus by overall merit positioning target two dimension cut zone to be identified are consistent Property, determine the locus of positioning target to be identified.

6. a kind of robot target identification and alignment system based on RGB-D videos, it is characterised in that including：

Acquisition module, the RGB-D sequence of frames of video for obtaining scene where positioning target to be identified；

Filtering screening module, for extracting the key video sequence frame in the RGB-D sequence of frames of video, and to the key video sequence frame Object candidate area is extracted, sieves is carried out to the object candidate area according to the corresponding depth information of each key video sequence frame Choosing；

Confidence level order module, for being identified to the object candidate area after filtering screening based on depth network, by length Sequential space time correlation is constrained and multiframe identification Uniform estimates, and confidence level sequence is carried out to target identification result；

Optimization module, for carrying out local Fast Segmentation to the object candidate area after filtering screening, according to target identification result Confidence level and each key video sequence frame timing intervals relation, from the key video sequence frame choose Chief frame of video, and Front and rear consecutive frame extension and collaboration optimization are carried out to cut zone；

Locating module, for determining key feature points in the scene as positioning reference point, and then estimates camera perspective and camera Motion estimated values, recognize that segmentation result carries out target signature consistency constraint and target location one by Chief frame of video The constraint of cause property, estimates the collaboration confidence level of positioning target to be identified and carries out space and be accurately positioned.