CN110706285A

CN110706285A - Object pose prediction method based on CAD model

Info

Publication number: CN110706285A
Application number: CN201910947809.8A
Authority: CN
Inventors: 许状男; 王广龙; 刁俊岐; 庞健
Original assignee: PLA University of Science and Technology
Current assignee: PLA University of Science and Technology
Priority date: 2019-10-08
Filing date: 2019-10-08
Publication date: 2020-01-17

Abstract

The invention discloses an object pose prediction method based on a CAD model, and relates to the technical field of image processing methods. The method comprises the following steps: obtaining relevant parameters of the monocular camera through calibration, and generating data required by rough matching by using a CAD model; detecting and identifying an object in the image and outputting a mask of the image, and obtaining related outline information of the object through the mask of the object; and obtaining a rough matching pose of the object by combining the relevant contour information of the object with rough matching data, and then obtaining an accurate pose of the object by an iterative algorithm. The method can be used as an algorithm for detecting the pose of an object when the real-time requirement is not high, and has high detection precision and strong anti-interference performance.

Description

Object pose prediction method based on CAD model

技术领域technical field

本发明涉及图像处理方法技术领域，尤其涉及一种基于CAD模型的物体位姿预测方法。The invention relates to the technical field of image processing methods, in particular to a CAD model-based object pose prediction method.

背景技术Background technique

增强现实(Augmented Reality，AR)以计算机图形技术和可视化技术为基础，在三维空间中增添定位虚拟物体，能够将真实场景与虚拟场景的信息集成，具有实时交互性。自基于增强现实的诱导维修的概念提出后，AR在维修领域的研究逐渐深入。如以增强现实技术的机器人，在执行抓取、焊接等任务时，需要通过摄像头采集的视觉信息来预先获取准确的物体三维位姿信息，此外在无人驾驶、航空航天、深海作业、武器制导等方面都需要利用视觉传感器信息来预先判断物体的三维位姿。目前增强现实的传感器主要依赖于摄像头、激光雷达、超声波雷达等，其中摄像头又分为单目摄像头和双目摄像头，其中双目摄像头存在体积大、重量重、价格高、易损坏的问题，而超声波雷达存在精度不高，实时性差，不能有遮挡，易受噪声影响的问题。Augmented Reality (AR) is based on computer graphics technology and visualization technology, adding and positioning virtual objects in three-dimensional space, which can integrate the information of real scenes and virtual scenes, and has real-time interactivity. Since the concept of induced maintenance based on augmented reality was put forward, the research of AR in the field of maintenance has gradually deepened. For example, robots with augmented reality technology need to obtain accurate three-dimensional pose information of objects in advance through visual information collected by cameras when performing tasks such as grasping and welding. In other aspects, it is necessary to use visual sensor information to pre-determine the three-dimensional pose of the object. At present, augmented reality sensors mainly rely on cameras, lidars, ultrasonic radars, etc. Among them, cameras are divided into monocular cameras and binocular cameras. Among them, binocular cameras have the problems of large size, heavy weight, high price, and easy damage. Ultrasonic radar has the problems of low accuracy, poor real-time performance, no occlusion, and easy to be affected by noise.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是如何提供一种成本低且可准确的得到物体位姿的识别方法。The technical problem to be solved by the present invention is how to provide a low-cost and accurate identification method for obtaining the pose of an object.

为解决上述技术问题，本发明所采取的技术方案是：一种基于CAD模型的物体位姿预测方法，其特征在于包括如下步骤：In order to solve the above-mentioned technical problems, the technical scheme adopted by the present invention is: a method for predicting the pose of an object based on a CAD model, which is characterized in that it comprises the following steps:

通过标定获取单目摄像头的相关参数，并利用CAD模型生成粗匹配所需数据；Obtain the relevant parameters of the monocular camera through calibration, and use the CAD model to generate the data required for rough matching;

检测识别图像中的物体并输出图像的遮罩，通过物体的遮罩得到物体的相关轮廓信息；Detect and identify objects in the image and output the mask of the image, and obtain the relevant contour information of the object through the mask of the object;

通过物体的相关轮廓信息结合粗匹配数据得到物体的粗匹配位姿，然后通过迭代算法，得到物体的精确位姿。The rough matching pose of the object is obtained by combining the relevant contour information of the object with the rough matching data, and then the precise pose of the object is obtained through an iterative algorithm.

进一步的技术方案在于，通过标定获取单目摄像头的相关参数的方法包括如下步骤：A further technical solution is that the method for obtaining the relevant parameters of the monocular camera by calibration includes the following steps:

构建相机成像模型：Build the camera imaging model:

M为三位空间点，m为M在图像平面投影的像点，按照相机涉及的坐标系之间的关系可得到世界坐标系到像素坐标的投影：M is a three-dimensional space point, and m is the image point projected by M on the image plane. According to the relationship between the coordinate systems involved in the camera, the projection of the world coordinate system to the pixel coordinate can be obtained:

可将(1)写成(2)的形式(1) can be written in the form of (2)

其中a_x，a_y分别是图像水平轴和垂直轴的尺度因子；K为相机内部参数矩阵；M₁中包含旋转矩阵和平移矢量，M₁中参数是由相机坐标系相对于世界坐标系的位置决定的，因此称M₁为相机外部参数矩阵；内部参数和外部参数矩阵的乘积M为投影矩阵；X_W为世界坐标系中物体中心W所在的x轴坐标，Y_W为世界坐标系中物体中心W所在的y轴坐标，Z_W为世界坐标系中物体中心W所在的z轴坐标；where a _x and a _y are the scale factors of the horizontal and vertical axes of the image respectively; K is the camera's internal parameter matrix; M ₁ contains the rotation matrix and translation vector, and the parameters in M ₁ are determined by the camera coordinate system relative to the world coordinate system The position is determined, so M ₁ is called the camera external parameter matrix; the product M of the internal parameter and the external parameter matrix is the projection matrix; X _W is the x-axis coordinate of the object center W in the world coordinate system, and Y _W is the world coordinate system. The y-axis coordinate where the object center W is located, and Z _W is the z-axis coordinate where the object center W is located in the world coordinate system;

相机焦距为f所在轴为z正方向，x与y轴处在光心O所在平面，以光心O为相机坐标系原点，在此相机坐标系下则物体中心所在位置用W来表示，其中：The focal length of the camera is that the axis where f is located is the positive direction of z, the x and y axes are in the plane where the optical center O is located, and the optical center O is the origin of the camera coordinate system. In this camera coordinate system, the position of the object center is represented by W, where :

W＝(W_x,W_y,W_z) (3)W=(W _x ,W _y ,W _z ) (3)

规定物体中心就是物体CAD模型中心所在位置，若P＝(u,v)为物体对应像素在图像上的坐标，K为相机内参矩阵，则可以得到此等式：It is stipulated that the center of the object is the position of the center of the CAD model of the object. If P=(u, v) is the coordinate of the corresponding pixel of the object on the image, and K is the camera internal parameter matrix, this equation can be obtained:

此等式表示实际物体中心所在位置W在相机坐标系下经过相机内参K后投影到图像后的二维坐标位置P。This equation expresses the two-dimensional coordinate position P of the actual object center position W in the camera coordinate system after passing through the camera internal reference K and projected to the image.

进一步的技术方案在于，利用CAD模型生成粗匹配数据的方法如下：A further technical solution is that the method for generating rough matching data by using the CAD model is as follows:

首先通过物体CAD模型在指定位姿下渲染出物体的遮罩，通过物体的遮罩得到物体的边界框，而后根据不同的需要在边界框上每隔一定距离对物体轮廓进行采样；First, the mask of the object is rendered under the specified pose through the CAD model of the object, and the bounding box of the object is obtained through the mask of the object, and then the outline of the object is sampled at certain distances on the bounding box according to different needs;

以左边界框长度L为基准，把L分为n等份，每隔L/n为一个采样横坐标点，遍历每一个轮廓上的点在其横坐标等于采样横坐标点时计算其到左边框的距离，由于每个采样横坐标点可能对应多个轮廓采样距离，所以取多个距离中的最大和最小值作为此采样横坐标上对轮廓采样的采样值，把轮廓信息变为成一组采样值；Taking the length L of the left bounding box as the benchmark, divide L into n equal parts, and every L/n is a sampling abscissa point, traverse the points on each contour and calculate the left side when the abscissa is equal to the sampling abscissa point. The distance of the border, since each sampling abscissa point may correspond to multiple contour sampling distances, the maximum and minimum values of the multiple distances are taken as the sampling values of the contour sampling on the sampling abscissa, and the contour information is changed into a group. sample value;

对采样值进行归一化，即把左边界框长度统一到一个单位下；Normalize the sampled values, that is, unify the length of the left bounding box to one unit;

在指定距离上，以物体CAD模型中心为中心，在不同的旋转角度对物体的轮廓进行采样，把轮廓采样信息与相对应的位姿信息保存，得到物体的粗匹配的模板数据。At a specified distance, take the center of the CAD model of the object as the center, sample the contour of the object at different rotation angles, save the contour sampling information and the corresponding pose information, and obtain the rough matching template data of the object.

进一步的技术方案在于，所述检测识别图像中的物体并输出图像的遮罩的方法如下：A further technical solution is that the method for detecting and recognizing objects in the image and outputting the mask of the image is as follows:

利用Mask-RCNN神经网络进行图像识别，输出物体的类别与物体的遮罩。The Mask-RCNN neural network is used for image recognition, and the category of the object and the mask of the object are output.

进一步的技术方案在于，在训练Mask-RCNN神经网络时，利用blender以及Opencv软件自动生成了数据集来训练。A further technical solution is to use blender and Opencv software to automatically generate a dataset for training when training the Mask-RCNN neural network.

进一步的技术方案在于，粗匹配位姿方法如下：A further technical solution is that the coarse matching pose method is as follows:

刚体的位姿包括旋转R和位移T两部分，其旋转部分的匹配过程如下：The pose of the rigid body includes two parts: rotation R and displacement T. The matching process of the rotation part is as follows:

首先把输出的轮廓信息做归一化处理，统一到相同尺度下进行比较；First, the output contour information is normalized and unified to the same scale for comparison;

若对物体实际遮罩采样数据为Sⁱⁿ，模板数据中第i组数据为Sⁱ，每组有n个采样值，则计算实际遮罩采样数据与模板中每组数据的L₁距离，第i组数据的L₁距离L_i为：If the actual mask sampling data for the object is S ⁱⁿ , the i-th group of data in the template data is S ⁱ , and each group has n sampling values, then calculate the L ₁ distance between the actual mask sampling data and each group of data in the template, and the first The L ₁ distance Li of the _i group of data is:

理想情况下，在位姿相同时采样值应该一致，即在模板数据中使此距离为0的旋转角度，即为此轮廓所对应旋转角度，所以取所有结果中满足阈值情况下的最小值所对应旋转角度为当前匹配得到的旋转角度，不满足阈值认为匹配失败；Ideally, the sampling values should be the same when the poses are the same, that is, the rotation angle at which this distance is 0 in the template data is the rotation angle corresponding to this contour, so take the minimum value of all results satisfying the threshold. The corresponding rotation angle is the rotation angle obtained by the current matching, and the matching fails if the threshold is not met;

在粗匹配时，误差控制在欧拉角每个自由度误差不大于12°；而后把此欧拉角信息转化为旋转矩阵R，即得到物体的旋转信息；During rough matching, the error is controlled at the Euler angle and the error of each degree of freedom is not greater than 12°; then the Euler angle information is converted into the rotation matrix R, that is, the rotation information of the object is obtained;

其平移部分算法如下：The translation algorithm is as follows:

在生成模板数据时，由于是在指定距离上对物体进行采样，且CAD模型大小已知，所以物体对应的包围框大小与其距离成反比，即包围框越小距离越远，这与人肉眼认知一致，则模型中心点与相机光心的距离D即可以通过(5)求出:When generating template data, since the object is sampled at a specified distance and the size of the CAD model is known, the size of the bounding box corresponding to the object is inversely proportional to its distance. If the knowledge is consistent, the distance D between the model center point and the camera optical center can be calculated by (5):

D＝(w_in/w_i)·D_i (6)D=(w _in /w _i )·D _i (6)

其中w_in为物体识别输出边界框宽，w_i为与其旋转相匹配的模板数据的边界框宽，D_i为模板数据采集时指定的距离，D即为模型中心点与相机光心的距离；where w _in is the width of the object recognition output bounding box, w _i is the bounding box width of the template data matching its rotation, D _i is the distance specified when the template data is collected, and D is the distance between the model center point and the camera optical center;

CAD模型大小先验信息已知，既可以计算出模板中每像素所代表的实际物理距离，进而可以计算出物体的位移向量t_z：The prior information of the size of the CAD model is known, and the actual physical distance represented by each pixel in the template can be calculated, and then the displacement vector t _z of the object can be calculated:

其中，t_x为物体在x轴的位移量，t_y为物体在y轴的位移量，在得到物体的旋转R和位移T之后，结合相机的内外参得到物体的世界坐标。Among them, t _x is the displacement of the object on the x-axis, and ty is the displacement of the object on the _y -axis. After the rotation R and displacement T of the object are obtained, the world coordinates of the object are obtained by combining the internal and external parameters of the camera.

进一步的技术方案在于，通过迭代算法，得到物体的精确位姿的方法如下：A further technical solution is that, through an iterative algorithm, the method for obtaining the precise pose of the object is as follows:

若粗匹配物体旋转为A＝(ψ，θ,φ)，则在此基础上，每个坐标轴都加减一个角度Δε，Δε设置为粗匹配间隔的一半，在粗匹配旋转空间求出其相邻空间的若干个角度，利用相邻空间的若干个角度结合CAD模型得到物体的轮廓，利用轮廓采样法并结合(5)式，求出使得(5)式L_i值最小的旋转A₁＝(ψ₁，θ₁,φ₁)，即得到迭代一次后的物体旋转角度；If the rotation of the coarse matching object is A=(ψ, θ, φ), then on this basis, each coordinate axis is added or subtracted by an angle Δε, Δε is set to half of the coarse matching interval, and it is calculated in the coarse matching rotation space. Several angles of the adjacent space, using the several angles of the adjacent space combined with the CAD model to obtain the contour of the object, using the contour sampling method and combining the formula (5) to find the rotation A ₁ that _minimizes the value of Li in the formula (5) =(ψ ₁ , θ ₁ , φ ₁ ), that is, the rotation angle of the object after one iteration is obtained;

而后通过不断把角度Δε减半得到更小范围的角度值进行迭代，最终可以得到使得(3)式为0的旋转角度；Then, by continuously halving the angle Δε to obtain a smaller range of angle values for iteration, the rotation angle that makes the formula (3) equal to 0 can be finally obtained;

结合粗匹配时得到的物体平移信息，得到物体的精确位姿。Combined with the object translation information obtained during rough matching, the precise pose of the object is obtained.

采用上述技术方案所产生的有益效果在于：本申请所述方法首先通过标定获取摄像头的相关参数，并利用CAD模型生成粗匹配所需数据，而后利用深度神经网络或者其他算法检测识别图像中的物体并输出图像的遮罩，通过物体的遮罩可以得到相关轮廓信息，此轮廓信息结合粗匹配数据可以得到物体的粗匹配位姿，然后通过迭代算法，可以得到物体的精确位姿。The beneficial effect of adopting the above technical solution is that: the method described in the present application first obtains the relevant parameters of the camera through calibration, and uses the CAD model to generate the data required for rough matching, and then uses the deep neural network or other algorithms to detect and recognize objects in the image. And output the mask of the image. The relevant contour information can be obtained through the mask of the object. This contour information can be combined with the rough matching data to obtain the rough matching pose of the object, and then through the iterative algorithm, the precise pose of the object can be obtained.

附图说明Description of drawings

下面结合附图和具体实施方式对本发明作进一步详细的说明。The present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.

图1是本发明实施例所述方法的流程图；Fig. 1 is the flow chart of the method described in the embodiment of the present invention;

图2是本发明实施例中坐标系的关系图；Fig. 2 is the relational diagram of the coordinate system in the embodiment of the present invention;

图3是本发明实施例中相机针孔模型示意图；3 is a schematic diagram of a camera pinhole model in an embodiment of the present invention;

图4是本发明实施例中物体轮廓采样的结果图；4 is a result diagram of object contour sampling in an embodiment of the present invention;

图5是本发明实施例中Mask-RCNN图像分割效果图；Fig. 5 is the Mask-RCNN image segmentation effect diagram in the embodiment of the present invention;

图6是本发明实施例中Mask-RCNN图像识别结果图；Fig. 6 is the Mask-RCNN image recognition result diagram in the embodiment of the present invention;

图7是本发明实施例中粗匹配与迭代后位姿对比图；Fig. 7 is the position and attitude comparison diagram after rough matching and iteration in the embodiment of the present invention;

图8是本发明实施例中遮挡情况下物体位姿精度图。FIG. 8 is an accuracy diagram of the pose of an object under occlusion in an embodiment of the present invention.

具体实施方式Detailed ways

下面结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明的一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

在下面的描述中阐述了很多具体细节以便于充分理解本发明，但是本发明还可以采用其他不同于在此描述的其它方式来实施，本领域技术人员可以在不违背本发明内涵的情况下做类似推广，因此本发明不受下面公开的具体实施例的限制。Many specific details are set forth in the following description to facilitate a full understanding of the present invention, but the present invention can also be implemented in other ways different from those described herein, and those skilled in the art can do so without departing from the connotation of the present invention. Similar promotion, therefore, the present invention is not limited by the specific embodiments disclosed below.

如图1所示，本发明实施例公开了一种基于CAD模型的物体位姿预测方法，包括如下步骤：As shown in FIG. 1 , an embodiment of the present invention discloses a method for predicting the pose of an object based on a CAD model, including the following steps:

首先通过标定获取摄像头的相关参数，并利用CAD模型生成粗匹配所需数据，而后利用深度神经网络或者其他算法检测识别图像中的物体并输出图像的遮罩，通过物体的遮罩可以得到相关轮廓信息，此轮廓信息结合粗匹配数据可以得到物体的粗匹配位姿，然后通过迭代算法，可以得到物体的精确位姿。First, the relevant parameters of the camera are obtained through calibration, and the CAD model is used to generate the data required for rough matching, and then the deep neural network or other algorithms are used to detect and identify the objects in the image and output the mask of the image, and the relevant contour can be obtained through the mask of the object. The contour information can be combined with the rough matching data to obtain the rough matching pose of the object, and then through the iterative algorithm, the precise pose of the object can be obtained.

下面对上述方法进行详细说明：The above methods are described in detail below:

相机成像模型：Camera imaging model:

M为三位空间点，m为M在图像平面投影的像点。按照相机涉及的坐标系之间的关系可得到世界坐标系到像素坐标的投影(坐标系关系如图2所示)：M is a three-dimensional space point, and m is an image point projected by M on the image plane. According to the relationship between the coordinate systems involved in the camera, the projection of the world coordinate system to the pixel coordinate can be obtained (the coordinate system relationship is shown in Figure 2):

可将(1)写成(2)的形式(1) can be written in the form of (2)

其中a_x，a_y分别是图像水平轴和垂直轴的尺度因子；K包含焦距、主点坐标等相机内部参数，因此称K为内部参数矩阵；M₁中包含旋转矩阵和平移矢量，M₁中参数是由相机坐标系相对于世界坐标系的位置决定的，因此称为相机的外部参数矩阵；内部参数和外部参数矩阵的乘积M称为投影矩阵。通过比较式(1)和式(2)，可以很容易地确定这些矩阵代表的相机内、外参的具体表现形式，相机标定就是确定相机的内部参数和外部参数。where a _x and a _y are the scale factors of the horizontal and vertical axes of the image respectively; K contains the camera's internal parameters such as focal length and principal point coordinates, so K is called the internal parameter matrix; M ₁ contains the rotation matrix and translation vector, M ₁ The middle parameter is determined by the position of the camera coordinate system relative to the world coordinate system, so it is called the external parameter matrix of the camera; the product M of the internal parameter and the external parameter matrix is called the projection matrix. By comparing Equation (1) and Equation (2), it is easy to determine the specific expressions of the camera's internal and external parameters represented by these matrices. Camera calibration is to determine the camera's internal parameters and external parameters.

相机标定：Camera calibration:

假设相机为针孔模型，如图3所示，一个物体的位姿是物体位置和姿态的统称。Assuming that the camera is a pinhole model, as shown in Figure 3, the pose of an object is a general term for the position and pose of the object.

相机焦距为f所在轴为z正方向，x与y轴处在光心O所在平面，以光心O为相机坐标系原点，在此相机坐标系下则物体中心所在位置可以用W来表示，其中：The focal length of the camera is that the axis where f is located is the positive direction of z, the x and y axes are in the plane where the optical center O is, and the optical center O is the origin of the camera coordinate system. In this camera coordinate system, the position of the object center can be represented by W, in:

W＝(W_x,W_y,W_z) (3)W=(W _x ,W _y ,W _z ) (3)

规定物体中心就是物体CAD模型中心所在位置，一般为体积中心。若P＝(u,v)为物体对应像素在图像上的坐标，K为相机内参矩阵，则可以得到此等式：It is stipulated that the center of the object is the position of the center of the CAD model of the object, which is generally the center of the volume. If P=(u, v) is the coordinate of the corresponding pixel of the object on the image, and K is the camera's internal parameter matrix, this equation can be obtained:

所以想要求得物体三维坐标位置，必须标定相机内参K，其中标定方法很多，在本申请中使用Opencv软件提供的相机标定方法获取K。Therefore, in order to obtain the three-dimensional coordinate position of the object, the camera internal parameter K must be calibrated. There are many calibration methods. In this application, the camera calibration method provided by the Opencv software is used to obtain K.

利用CAD模型生成粗匹配数据：Use the CAD model to generate coarse matching data:

在利用物体CAD模型生成模板数据中最核心的算法就是基于物体轮廓的采样算法。首先通过物体CAD模型可以在指定位姿下渲染出物体的Mask，通过物体的Mask可以得到物体的边界框，而后根据不同的需要在边界框上每隔一定距离对物体轮廓进行采样，如图4所示为在左边框每隔一定距离对物体轮廓进行采样。The core algorithm in using object CAD model to generate template data is the sampling algorithm based on object outline. First, the mask of the object can be rendered under the specified pose through the CAD model of the object, and the bounding box of the object can be obtained through the mask of the object, and then the outline of the object can be sampled at certain distances on the bounding box according to different needs, as shown in Figure 4 It is shown that the object contour is sampled every certain distance in the left border.

以左边界(其它边界类似)框长度L为基准，把L分为n等份，每隔L/n为一个采样横坐标点，遍历每一个轮廓上的点在其横坐标等于采样横坐标点时计算其到左边框的距离，由于每个采样横坐标点可能对应多个轮廓采样距离，所以取多个距离中的最大和最小值作为此采样横坐标上对轮廓采样的采样值。这样就把轮廓信息变为成了一组采样值。Based on the frame length L of the left border (other borders are similar), divide L into n equal parts, and every L/n is a sampling abscissa point, traverse the points on each contour whose abscissa is equal to the sampling abscissa point When calculating the distance to the left border, since each sampling abscissa point may correspond to multiple contour sampling distances, the maximum and minimum values of the multiple distances are taken as the sampling value for contour sampling on this sampling abscissa. This turns the contour information into a set of sampled values.

由于轮廓可能会大小变化，所以需要对采样值进行归一化，即把左边界框长度统一到一个单位下，在实验中，把左边界框统一长度为128px，这样既可以保证采样的精度，也可以保证采样速度。Since the outline may change in size, the sampling value needs to be normalized, that is, the length of the left bounding box is unified to one unit. In the experiment, the unified length of the left bounding box is 128px, which can not only ensure the accuracy of sampling, Sampling speed can also be guaranteed.

上述采样方式的好处是得到了轮廓的一组特征，即采样值，这个值对于轮廓具有缩放不变性，但是对物体旋转很敏感，且数据维度一致，便于比较。The advantage of the above sampling method is to obtain a set of features of the contour, that is, the sampling value. This value has scaling invariance to the contour, but is sensitive to the rotation of the object, and the data dimensions are consistent, which is convenient for comparison.

在指定距离上，以物体CAD模型中心为中心，在不同的旋转角度对物体的轮廓进行采样，把轮廓采样信息与相对应的位姿信息保存，便得到了物体的粗匹配的模板数据。At a specified distance, take the center of the CAD model of the object as the center, sample the contour of the object at different rotation angles, and save the contour sampling information and the corresponding pose information to obtain the rough matching template data of the object.

图像识别输出物体遮罩：Image recognition output object mask:

目前图像识别效果比较突出的方法是利用深度神经网络，而其中Mask-RCNN为目前利用深度神经网络进行图像识别中效果较好的模型，其效果如图5，此深度神经网络模型经训练后，可以做到实时高精度输出物体的类别与物体的mask，所以本方法采用此模型作为图像检测端处理模块。At present, the most prominent method of image recognition is to use deep neural network, and Mask-RCNN is a model with better effect in image recognition using deep neural network. The effect is shown in Figure 5. After the deep neural network model is trained, The category of the object and the mask of the object can be output with high precision in real time, so this method uses this model as the processing module of the image detection end.

随着神经网络不断发展，不同的算法和深度神经网络模型性能必将会超越Mask-RCNN，本算法可以适用于任何输出遮罩(mask)或轮廓的算法或深度神经网络模型，即可以作为通用解决方案。其中训练Mask-RCNN神经网络时，利用blender以及Opencv软件自动生成了数据集来训练，识别精度较高。With the continuous development of neural networks, the performance of different algorithms and deep neural network models will definitely surpass Mask-RCNN. This algorithm can be applied to any algorithm or deep neural network model that outputs masks or contours, that is, it can be used as a general-purpose algorithm. solution. When training the Mask-RCNN neural network, blender and Opencv software are used to automatically generate data sets for training, and the recognition accuracy is high.

粗匹配算法：Rough matching algorithm:

由于不同框架输出的物体遮罩分辨率不同，若是物体分辨率过低则会影响采样算法采集的数据质量，物体分辨率过高则在采样时会导致采样速度下降，所以和采样算法相同，首先把输出的轮廓信息做归一化处理，统一到相同尺度下进行比较。Since the resolution of the object mask output by different frameworks is different, if the object resolution is too low, it will affect the data quality collected by the sampling algorithm. If the object resolution is too high, the sampling speed will decrease during sampling, so the same as the sampling algorithm, first of all The output contour information is normalized and unified to the same scale for comparison.

理想情况下，在位姿相同时采样值应该一致，即在模板数据中使此距离为0的旋转角度，即为此轮廓所对应旋转角度，在实际中若是对角度分割过细，会产生大量的数据，匹配过慢，所以取所有结果中满足阈值情况下的最小值所对应旋转角度为当前匹配得到的旋转角度，不满足阈值认为匹配失败。为了保证匹配速度，在粗匹配时，误差控制在欧拉角每个自由度误差不大于12°(即把360°等分为30份进行采样生成粗匹配模板)。而后可以把此欧拉角信息转化为旋转矩阵R，即得到物体的旋转信息。Ideally, the sampling values should be the same when the pose is the same, that is, the rotation angle at which the distance is 0 in the template data is the rotation angle corresponding to the contour. In practice, if the angle is divided too finely, a large number of Data, the matching is too slow, so take the rotation angle corresponding to the minimum value of all results satisfying the threshold as the rotation angle obtained by the current matching, and the matching fails if the threshold is not met. In order to ensure the matching speed, during rough matching, the error is controlled to be no more than 12° in each degree of freedom of the Euler angle (that is, 360° is divided into 30 equal parts for sampling to generate a rough matching template). Then, the Euler angle information can be converted into the rotation matrix R, that is, the rotation information of the object can be obtained.

其平移部分算法如下：The translation algorithm is as follows:

在生成模板数据时，由于是在指定距离上对物体进行采样，且CAD模型大小已知，所以物体对应的包围框大小与其距离成反比，即包围框越小距离越远，这与人肉眼认知一致，则模型中心点与相机光心的距离即可以通过(5)求出:When generating template data, since the object is sampled at a specified distance and the size of the CAD model is known, the size of the bounding box corresponding to the object is inversely proportional to its distance. If the knowledge is consistent, the distance between the model center point and the camera optical center can be calculated by (5):

D＝(w_in/w_i)·D_i (6)D=(w _in /w _i )·D _i (6)

其中w_in为物体识别输出边界框宽(利用边框长计算亦可)，w_i为与其旋转相匹配的模板数据的边界框宽，D_i为模板数据采集时指定的距离，D即为模型中心点与相机光心的距离。where w _in is the width of the object recognition output bounding box (can also be calculated by using the frame length), w _i is the bounding box width of the template data matching its rotation, D _i is the distance specified when the template data is collected, and D is the model center The distance of the point from the camera's optical center.

与此类似，因为CAD模型大小先验信息已知，既可以计算出模板中每像素所代表的实际物理距离，进而可以计算出物体的位移向量：Similarly, because the prior information of the size of the CAD model is known, the actual physical distance represented by each pixel in the template can be calculated, and then the displacement vector of the object can be calculated:

在实际实验中，由于无法获取亚像素级别的信息，仅利用像素来计算时，位移向量误差较大，可以利用提高相机图像的分辨率来解决此问题，也就是相机分辨率越高，得到物体的位置越准。在得到物体的旋转R和位移T之后，结合相机的内外参即可得到物体的世界坐标。In the actual experiment, since the sub-pixel level information cannot be obtained, when only pixels are used for calculation, the displacement vector error is large. This problem can be solved by increasing the resolution of the camera image, that is, the higher the camera resolution, the better the object. The more accurate the position. After the rotation R and displacement T of the object are obtained, the world coordinates of the object can be obtained by combining the internal and external parameters of the camera.

迭代算法：Iterative Algorithms:

在得到物体的粗匹配位姿之后，物体的旋转理论上还有小于12°的误差，为了消除此误差，引入迭代算法，此算法在粗匹配得到的物体旋转信息上进行计算，最终得到误差为零的旋转信息(在浮点数设置为8位小数时)After the rough matching pose of the object is obtained, there is still an error of less than 12° in the rotation of the object in theory. In order to eliminate this error, an iterative algorithm is introduced. This algorithm calculates the object rotation information obtained by the rough matching, and finally obtains the error as Rotation information for zero (when float is set to 8 decimal places)

若粗匹配物体旋转为A＝(ψ，θ,φ)，则在此基础上，每个轴都加减一个小角度Δε，由于之前设置粗匹配间隔为12°，所以Δε设置为粗匹配间隔的一半即6°，这样在粗匹配旋转空间求出其相邻空间的26个角度，利用这26个角度结合CAD模型得到物体的轮廓，轮廓采样法并结合(5)式，求出使得(5)式L_i值最小的旋转A₁＝(ψ₁，θ₁,φ₁)，即得到迭代一次后的物体旋转角度。If the rotation of the coarse matching object is A=(ψ, θ, φ), then on this basis, each axis adds or subtracts a small angle Δε. Since the coarse matching interval was previously set to 12°, Δε is set as the coarse matching interval The half of φ is 6°, so the 26 angles of the adjacent space are obtained in the coarse matching rotation space, and the contour of the object is obtained by combining these 26 angles with the CAD model, and the contour sampling method is combined with the formula (5) to obtain ( 5) The rotation A ₁ =( _ψ ₁ , θ ₁ , φ ₁ ) with the smallest value of formula Li, that is, the rotation angle of the object after one iteration is obtained.

而后通过不断把Δε减半得到更小范围的角度值进行迭代，最终可以得到使得(3)式为0的旋转角度(在浮点数设置为8位小数时)，可以通过设置计算机浮点位数得到更为精确的旋转信息。Then by continuously halving Δε to obtain a smaller range of angle values for iteration, we can finally get the rotation angle that makes equation (3) 0 (when the floating-point number is set to 8 decimal places), which can be set by setting the floating-point number of the computer. Get more accurate rotation information.

结合粗匹配时得到的物体平移信息，即得到了物体的精确的6Dof位置。Combined with the object translation information obtained during rough matching, the precise 6Dof position of the object is obtained.

实验数据：Experimental data:

本实验环境配置为：笔记本为联想Y7000，系统为ubuntul6.04，编程语言使用python3.6。The experimental environment is configured as: the notebook is Lenovo Y7000, the system is ubuntul6.04, and the programming language is python3.6.

物体识别精度：本方法输入的遮罩数据采用Mask-RCNN神经网络所输出的遮罩，由于Mask-RCNN神经网络本身性能较强，经自行开发的数据集训练，实现了比较理想的识别精度，可以满足本方法需求，如图6所示。Object recognition accuracy: The mask data input by this method adopts the mask output by the Mask-RCNN neural network. Due to the strong performance of the Mask-RCNN neural network itself, it has been trained with a self-developed dataset to achieve a relatively ideal recognition accuracy. The requirements of this method can be met, as shown in Figure 6.

旋转精度：粗匹配旋转精度由生成的粗匹配数据模板决定，本实验粗匹配时把欧拉角每个自由度分成了30等分，所以精度不大于12°(360°/30)，平均经过6轮迭代后达到8位浮点数最高精度。图7为遮罩图经粗匹配和迭代后效果对比，其中从左到右分别为物体的随机位姿遮罩图(即输入)、利用经过粗匹配后位姿得到的物体渲染图以及二者差别图。Rotation accuracy: The rotation accuracy of rough matching is determined by the generated rough matching data template. In this experiment, each degree of freedom of Euler angle is divided into 30 equal parts, so the accuracy is not greater than 12° (360°/30). The highest precision of 8-bit floating-point numbers is reached after 6 iterations. Figure 7 is a comparison of the effect of the mask image after rough matching and iteration, where from left to right are the random pose mask image (ie input) of the object, the object rendering image obtained by using the pose after rough matching, and the two difference graph.

经实验证明，迭代后旋转误差在8位浮点数时为0，图5部分绿色误差为位置误差所引起。It has been proved by experiments that the rotation error after iteration is 0 when it is an 8-bit floating point number, and the green error in Fig. 5 is caused by the position error.

位置精度：由于本方法位置是由物体的包围框计算的，精度受限于像素精度，极端情况下例如较远的物体较小，包围框成比例缩小，这样包围框差一个像素，位置误差就增大很多，所以物体的位置精度取决于相机像素，相机像素越高，包围框误差越小，从而物体位置精度越高。Position accuracy: Since the position of this method is calculated by the bounding box of the object, the accuracy is limited by the pixel accuracy. In extreme cases, such as the far object is smaller, the bounding box is proportionally reduced, so that the bounding box is one pixel difference, and the position error is It increases a lot, so the position accuracy of the object depends on the camera pixel. The higher the camera pixel, the smaller the bounding box error, and the higher the object position accuracy.

经试验，在相机分辨率为512x512像素时本方法4x4x3(cm)物体位置精度随物体距相机位置变化如表1所示：After experiments, when the camera resolution is 512x512 pixels, the position accuracy of the 4x4x3(cm) object in this method changes with the position of the object from the camera, as shown in Table 1:

表1位置误差随距离变化关系Table 1 The relationship between position error and distance

物体与相机距离(mm)Object and camera distance (mm) 误差(mm)Error(mm) 500500 0-50-5 10001000 2-122-12 20002000 10-10010-100 50005000 ＞100>100

与其他相关位姿法的对比：Comparison with other related pose methods:

本方法与目前神经网络中比较有代表性的SSD-6D，BB8等方法相比，当评价标准为目前通用标准2Dprojection、5cm5°或6Dpose，本方法在旋转准确度上都接近100％，远超其他各类算法，其主要误差来源是位置误差，在考虑位置误差成因取决于相机图片分辨率精度后，认为与其他方法无对比性。Compared with the representative methods such as SSD-6D and BB8 in the current neural network, when the evaluation standard is the current general standard 2Dprojection, 5cm5° or 6Dpose, the rotation accuracy of this method is close to 100%, far exceeding For other types of algorithms, the main source of error is the position error. After considering that the cause of the position error depends on the resolution and accuracy of the camera image, it is considered that there is no comparison with other methods.

在实时性上差距很大，本方法在上述个人笔记本环境下运行检测一张图片粗匹配用时约0.6s，迭代后平均用时约为40-60s。而一般基于神经网络的6dof位姿方法基本可达到实时(＞20fps)，基于算法的6Dof位姿方法比较有代表性的是Linemod也基本可以达到15-18fps。There is a big gap in real-time performance. This method takes about 0.6s to detect a rough matching of a picture in the above-mentioned personal notebook environment, and the average time after iteration is about 40-60s. The general neural network-based 6Dof pose method can basically achieve real-time (>20fps), and the algorithm-based 6Dof pose method is more representative that Linemod can basically reach 15-18fps.

本方法在抗干扰能力上比较突出，只要物体轮廓采样基本正确，物体遮罩中间缺失对本方法推算位姿影响不大，如图8所示。This method is more prominent in anti-interference ability. As long as the object contour sampling is basically correct, the absence of the object mask has little effect on the pose estimation of this method, as shown in Figure 8.

综上所述方法可以作为实时性要求不高时对物体位姿检测的一种通用算法，其检测精度较高且有较强的抗干扰性能，在实际应用中可以考虑通过使用C++代码以及并行计算提高其实时性，以满足使用要求。In summary, the above method can be used as a general algorithm for object pose detection when real-time requirements are not high. It has high detection accuracy and strong anti-interference performance. In practical applications, it can be considered by using C++ code and parallelism. The calculation improves its real-time performance to meet the usage requirements.

Claims

1. An object pose prediction method based on a CAD model is characterized by comprising the following steps:

obtaining relevant parameters of the monocular camera through calibration, and generating data required by rough matching by using a CAD model;

detecting and identifying an object in the image and outputting a mask of the image, and obtaining related outline information of the object through the mask of the object;

and obtaining a rough matching pose of the object by combining the relevant contour information of the object with rough matching data, and then obtaining an accurate pose of the object by an iterative algorithm.

2. The CAD model-based object pose prediction method of claim 1, wherein the method for obtaining monocular camera related parameters by calibration comprises the steps of:

constructing a camera imaging model:

m is a three-dimensional space point, M is an image point projected by M on an image plane, and the projection of a world coordinate system to a pixel coordinate can be obtained according to the relation between coordinate systems related to a camera:

can write (1) into (2)

Wherein a is_x，a_yScale factors for the horizontal and vertical axes of the image, respectively; k is a camera internal parameter matrix;M₁containing a rotation matrix and a translation vector, M₁The medium parameter is determined by the position of the camera coordinate system relative to the world coordinate system, hence the name M₁Is a camera extrinsic parameter matrix; the product M of the internal parameter matrix and the external parameter matrix is a projection matrix; x_WIs the x-axis coordinate, Y, of the center W of the object in the world coordinate system_WIs the y-axis coordinate, Z, of the center W of the object in the world coordinate system_WIs the z-axis coordinate of the object center W in the world coordinate system;

the focal length of the camera is f, the axis is the positive z direction, the x and y axes are on the plane of the optical center O, the optical center O is used as the origin of the camera coordinate system, and the position of the center of the object is represented by W in the camera coordinate system, wherein:

W＝(W_x,W_y,W_z) (3)

specifying the object center as the position of the object CAD model center, and if P ═ u (u)_,v) is the coordinates of the corresponding pixels of the object on the image, and K is the camera internal reference matrix, then the equation can be obtained:

this equation represents the two-dimensional coordinate position P of the actual object center position W after passing through the camera intrinsic parameters K and being projected onto the image in the camera coordinate system.

3. The CAD model-based object pose prediction method of claim 1, wherein the method of generating coarse match data using a CAD model is as follows:

firstly, rendering a mask of an object under a specified pose through an object CAD model, obtaining a boundary frame of the object through the mask of the object, and then sampling the outline of the object at intervals on the boundary frame according to different requirements;

dividing L into n equal parts by taking the length L of a left frame as a reference, taking every L/n as a sampling abscissa point, traversing points on each contour, and calculating the distance from the points to the left frame when the abscissa of the points is equal to the sampling abscissa point;

normalizing the sampling value, namely unifying the lengths of the left boundary frames to a unit;

and sampling the contour of the object at different rotation angles by taking the center of the object CAD model as the center at the specified distance, and storing the contour sampling information and the corresponding pose information to obtain rough matching template data of the object.

4. The CAD model-based object pose prediction method of claim 1, wherein the method of detecting masks that identify objects in an image and output an image is as follows:

and performing image recognition by using a Mask-RCNN neural network, and outputting the category of the object and the Mask of the object.

5. The CAD model based object pose prediction method of claim 4, wherein during training of the Mask-RCNN neural network, a dataset is automatically generated by a blend and Opencv software for training.

6. The CAD model-based object pose prediction method of claim 1, wherein the coarse matching pose method is as follows:

the pose of the rigid body comprises a rotation part R and a displacement part T, and the matching process of the rotation part is as follows:

firstly, normalizing the output contour information, unifying the output contour information to the same scale for comparison;

if the sampling data of the actual shade of the object is SⁱⁿThe ith group of data in the template data is SⁱAnd each group has n sampling values, calculating L of each group of actual mask sampling data and template₁Distance, L of the ith group of data₁Distance L_iComprises the following steps:

ideally, the sampling values should be consistent when the positions are the same, that is, the rotation angle with the distance of 0 in the template data is the rotation angle corresponding to the contour, so that the rotation angle corresponding to the minimum value in all the results when the threshold is met is taken as the rotation angle obtained by current matching, and the threshold is not met to consider that the matching fails;

in coarse matching, the error is controlled to be not more than 12 degrees in each degree of freedom of the Euler angle; then, the Euler angle information is converted into a rotation matrix R, and the rotation information of the object is obtained;

the algorithm of the translation part is as follows:

when generating the template data, since the object is sampled at a predetermined distance and the CAD model size is known, the size of the bounding box corresponding to the object is inversely proportional to the distance thereof, that is, the distance between the center point of the model and the optical center of the camera can be found by (5) if the distance becomes longer as the bounding box becomes smaller, which is consistent with the recognition by the human eye:

D＝(w_in/w_i)·D_i(6)

wherein w_inOutput bounding box width, w, for object identification_iBounding box width of template data matched to its rotation, D_iThe distance is specified when the template data is collected, and D is the distance between the center point of the model and the optical center of the camera;

the prior information of the size of the CAD model is known, so that the actual physical distance represented by each pixel in the template can be calculated, and the displacement vector t of the object can be further calculated_z：

Wherein, t_xIs an object in xAmount of displacement of the shaft, t_yThe world coordinate of the object is obtained by combining internal and external parameters of the camera after the rotation R and the displacement T of the object are obtained.

7. The CAD model-based object pose prediction method of claim 6, wherein the exact pose of the object is obtained by an iterative algorithm as follows:

if the coarse matching object rotation is a ═ phi (psi, theta, phi), then, on the basis of the coarse matching object rotation, each coordinate axis is added and subtracted by an angle delta epsilon, the delta epsilon is set to be half of the coarse matching interval, a plurality of angles of adjacent spaces of the coarse matching object rotation are obtained in the coarse matching rotation space, the outlines of the objects are obtained by combining a CAD model by the plurality of angles of the adjacent spaces, and the outlines are obtained by combining the formula (5) by a contour sampling method, so that the formula (5) L is obtained_iRotation A of minimum value₁＝(ψ₁，θ₁,φ₁) Obtaining the rotation angle of the object after one iteration;

then, continuously halving the angle delta epsilon to obtain an angle value in a smaller range for iteration, and finally obtaining a rotation angle with the formula (3) being 0;

and combining the translation information of the object obtained in the rough matching process to obtain the accurate pose of the object.