CN118212268A

CN118212268A - Class-level three-dimensional object six-dimensional attitude tracking method based on soft correspondence estimation

Info

Publication number: CN118212268A
Application number: CN202410437068.XA
Authority: CN
Inventors: 秦学英; 曹昕; 李贾; 赵盼盼; 李佳宸
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2024-04-12
Filing date: 2024-04-12
Publication date: 2024-06-18

Abstract

The invention relates to a class-level three-dimensional object six-dimensional attitude tracking method based on soft correspondence estimation, which belongs to the field of computer vision, and comprises the steps of performing attitude estimation on a target object in a first frame of a video stream in a data preprocessing module to obtain an initial attitude of the target object, and performing continuous six-dimensional attitude tracking on the object of a depth video stream after initialization. In the corresponding estimation module, geometrical characteristics of point cloud data from different frames are extracted and point-to-point correspondence between different point clouds is estimated. In the corresponding expansion module, corresponding expansion is carried out on the estimated corresponding relation, so that the number of the points associated with each point is as small as possible. In the reference frame module, the reference frames are updated according to the standard over time to reduce accumulated errors in the tracking process. The invention solves the six-dimensional gesture tracking of objects of the same category but never seen before under the condition that only the depth video stream and the initial frame detection and segmentation result are taken as inputs.

Description

Class-level three-dimensional object six-dimensional attitude tracking method based on soft correspondence estimation

Technical Field

The invention relates to a class-level three-dimensional object six-dimensional attitude tracking method based on soft correspondence estimation, and belongs to the technical field of computer vision.

Background

In augmented reality/mixed reality applications, training time, accuracy of tracking, and fluency of a three-dimensional object pose tracking model are major factors affecting the convenience of an application. The pose tracking method based on the deep neural network shows excellent performance in performance, but the feature of the pose tracking method that the pose tracking method needs to be trained in advance for a specific object model increases the application cost of the pose tracking method.

Most existing methods use shape priors to replace instance-level models in traditional tracking algorithms and attempt to deform shape priors during the matching process to approximate the shape of the current instance object. Prior-based methods can partially capture the changes within the class to some extent, but they require a large number of 3D models during the training phase, and it may be difficult to fully account for shape changes within all classes, which may not be as accurate in application as instance level tracking due to the fact that shape priors cannot fully reach the accuracy of the instance level model.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a class-level three-dimensional object six-dimensional gesture tracking method based on soft correspondence estimation, which solves the problem of six-dimensional gesture tracking on objects of the same class but never seen before under the condition that only depth video stream and initial frame detection and segmentation results are used as inputs. There is no need to perform data collection and retraining of model parameters for a particular instance.

The technical scheme of the invention is as follows:

a class-level three-dimensional object six-dimensional attitude tracking method based on soft correspondence estimation comprises the following steps:

S1: in the data preprocessing module, camera internal parameters are obtained in advance through the existing camera calibration method, three-dimensional point cloud data corresponding to each frame of picture are calculated through back projection by taking the camera internal parameters and depth video streams as input, and the two different input point clouds are respectively subjected to averaging processing so as to ensure that the network can effectively estimate the relative position and posture change between video frames in a training stage;

Meanwhile, selecting a first frame image of a depth video stream, detecting a region where a target object is located by using an existing target detection method to obtain a segmentation mask M _init (also called mask) covering the region, and estimating a six-dimensional pose of the target object as an initial pose by using an existing class-level single-frame pose estimation method, wherein the initialization process is finished;

S2: in an inter-frame attitude estimation module, extracting geometric features of three-dimensional point cloud data obtained by S1 by using a three-layer perceptron (MLP), obtaining two feature vectors from point clouds of two frames through the multi-layer perceptron, and multiplying the two feature vectors by a fork to calculate similarity between different feature vectors, wherein the similarity is expressed in a matrix, called a corresponding matrix, and represents a corresponding relation between different points;

soft correspondence constraints are designed during training: generating a real corresponding matrix by using the known corresponding relation, and training parameters of the feature extraction network under the constraint of the real corresponding matrix to estimate a correct corresponding matrix;

S3: in the corresponding expansion module, the dispersion degree of the corresponding relation in the corresponding matrix is controlled by using k bidirectional correspondence, and the value of k is gradually reduced along with the increase of the iteration times, so that the origin cloud mapped by the corresponding matrix gradually approaches the shape of the target point cloud.

Preferably, the method can be performed only by using the center point of the point cloud during the averaging, if the average point of the point cloud is simply used as the center point, when the segmentation result has errors, the background point may be erroneously included in the observation point cloud, and further the standardization process of the point cloud and the determination of the center position are affected, so that the effect of the network in the feature learning process is affected. In order to solve the problem, in the step S1, all points from two frames of point clouds are taken as a whole to jointly calculate a mean point as a center point, so that the interference of the center positioning of the point clouds of the background point is effectively avoided;

In S1, after the initialization of the initial pose is finished, selecting two frames of depth images to be matched currently from the input depth video stream, and obtaining a depth value z of each target pixel point through the depth images and the segmentation mask, wherein the internal parameter matrix K of the camera is known because the camera is calibrated in advance, and the three-dimensional point (x, y, z) obtained by back projection of the target two-dimensional pixel point (u, v) under the camera coordinate system can be obtained, as shown in the following formula:

Preferably, each layer of perceptron in S2 is composed of two linear layers and one ReLU layer, after the two eigenvectors are cross multiplied, row and column normalization is performed by using the existing sinkhorn algorithm iteration, the sinkhorn algorithm is preferably iterated 5 times, and each time, the normalization operation is performed on the rows and columns of the matrix according to the sequence, so that a corresponding matrix representing the corresponding relationship is finally obtained, and the similarity of the characteristics among different points is reflected.

By utilizing the known position data of the object in the previous video frame and the motion data of the camera, the invention innovatively proposes a method for realizing accurate depth image segmentation of the current video frame, since the motion of the object and the camera is continuous, assuming in S1 that the position change of the object relative to the camera is tiny between the continuous video frames, ignoring, firstly, the pixel coordinates of the target object on the depth image D _t-1 of the previous video frame are acquired using the segmentation mask M _i of the previous frame, i.e. the i-th frame (the initial segmentation mask M _init in the initial case); generating a series of two-dimensional rectangular bounding boxes B _j,k with different sizes on the depth image of the current frame, namely the j-th frame; then, gradually expanding a rectangular bounding box B _i,k by a preset step length (generally 1 pixel), applying the bounding box B _i,k to image processing of a current frame, and integrating all generated segmentation results, and selecting the segmentation result with the highest confidence score as an optimal segmentation result B _j on the current frame j to ensure that the bounding boxes can be accurately covered on a target object; finally, the rectangular bounding box B _i,k is used as input data, and the segmentation mask M _j finally used for the current frame is obtained in cooperation with the target detection method.

Preferably, the geometric Feature in S2 is implemented by a Feature extraction module, and a Point-Pair Feature (PPF) is used to describe a neighborhood around each Point in the Point cloud, so as to represent the local geometric information of each Point in a rotation-invariant manner; the feature extraction module stands on PointNet ++, takes point-to-point features as input, and outputs local geometric features F _i and F _j of each point cloud, and the local geometric features F _i and F _j are then fed into the feature matching module to estimate soft correspondence between the point clouds.

Preferably, in S2, three-layer perceptrons are used to extract features F _i and F _j of the point clouds P _i and P _j under the camera coordinate systems of different frames, and the similarity score between the two sets of features is obtained by calculating the cross-multiplication parallel normalization of the two feature vectors, which is finally expressed as a corresponding matrix a, and a soft mapping relationship between the point flux of the point cloud P _j of the j-th frame and the point of the point cloud P _i of the i-th frame is established through the corresponding matrix a _ij, and vice versa; a _i ^T _j is a transpose of the corresponding matrix a _ij, as shown in the formula:

P_j＝A_ijP_j

in the training stage, normalizing the rows and columns of the corresponding matrix A, and establishing a one-to-one, one-to-many and many-to-one soft corresponding relation between three-dimensional space points instead of a strict one-to-one hard corresponding relation;

In the training phase, known information includes: corresponding point clouds N _i and N _j of the point clouds of each frame under the standard object coordinate space (Normalized Object Coordinate Space, NOCS), the relative pose transforms T _i and T _j of the standard object coordinate space to the camera coordinate space, then:

P_i＝T_iN_i

P_j＝T_jN_j

Through pose transformation T _i and T _j, the relation between a standard object coordinate space and a camera coordinate space can be established, point clouds N _i and N _j under the standard object coordinate space are in one-to-one correspondence with points in observation point clouds P _i and P _j under the camera coordinate space, the point clouds N _i and N _j correspond to the points from the same object and are the same in orientation and position, the corresponding relation of the points between N _i and N _j can be obtained, and the corresponding relation of the points of the observation point clouds P _i and P _j is further obtained.

When the correspondence matrix is estimated, its soft correspondence property is such that the weight of each point in one point cloud in another point cloud is typically not concentrated on a single point, but rather spread over multiple points. Accordingly, the correspondence matrix a _ij can generate a corresponding point cloud P _c with respect to P _i in the coordinate space of the point cloud P _j. Each element in the corresponding matrix represents a weight, the values of the weights range from 0 to 1, and the sum of the weights is 1, which indicates that the corresponding point cloud P _c generated by all possible weight combinations is contained in the space of the point cloud P, and is formed by combining the points in the point cloud P in a weighted manner, and the space can be regarded as a plurality of changes of the shape of the points in the point cloud P.

Based on the fact that the soft correspondence characteristics result in the weight being spread among multiple points, the present invention proposes a new strategy "correspondence extension" aimed at progressively reducing the area of each point in relation to the other point cloud. The distribution of the associated points is concentrated by reducing the number of points involved in calculating the corresponding points, and the process also leads to gradual shrinkage of the space represented by the matrix A _ij and the point cloud P, so that the generated corresponding point cloud is more closely attached to the shape of the point cloud P, thereby effectively solving the problem of 'point cloud shrinkage'.

Preferably, in S3, the scale of the correspondence in the correspondence matrix a is controlled by adopting k bidirectional correspondence, and k points p _a,k, k=1, 2, … and k in the radius r are found for each point p _a by using the nearest neighbor (knn) algorithm and using the characteristic that the point clouds N _i and N _j are in the same coordinate space and belong to different parts of the same object as the associated points; when point p _a from point cloud N _i is the associated point of point p _b from point cloud N _j, while p _b is also the associated point of point p _a, this set of points (p _a,p_b) is considered to be a pair of points that are associated with each other;

Generating a matrix A _gt with the same size as the corresponding matrix A, setting the positions corresponding to the point pairs which are mutually related to each other as 1, and setting other positions as 0; the KL divergence constraint corresponding matrix A _gt and the estimated corresponding matrix A _ij are used, so that the frame pose estimation module can estimate a soft corresponding matrix representing the soft corresponding relation of the point cloud between frames.

Preferably, the corresponding confidence score is further estimated using a bi-directional correspondence distance (Bidirectional Correspondence Distance, BCD). The feature similarity matrix S of the two groups of point clouds is transposed, so that the bidirectional corresponding relation expressed as the corresponding matrices A _ij and A _ji can be estimated, and the estimation of the relative posture between the two groups of point clouds is realized. When the correspondence is a hard correspondence, a _ij and a _ji are reciprocal. However, when the correspondence is soft correspondence and expressed as a soft correspondence matrix, a _ij and a _ji generally do not form a reciprocal relationship due to the influence of weight dispersion and rank normalization.

By utilizing the bidirectional correspondence, points at which the weight distribution is excessively dispersed can be identified. As described above, one point P _c in the point cloud P _i corresponds to a plurality of points P _c,k in the point cloud P _j, and when P _c,k is spatially dispersed, the respective points in a _ij and a _ji in the corresponding matrix D will be very far from that point in the P _j point cloud. Based on this phenomenon, when the corresponding matrices a _ij and a _ji are subjected to the point-wise weighted calculation, the generated point P ^′ _c is also far from the original position of P _c in P _i in terms of spatial distance.

BCD_i＝P_i-A_ijA_jiP_i

BCD _i represents the bi-directional corresponding distance value of the i-th frame point cloud.

All associated point pairs correspond to points where weights are concentrated in a particular region, their bi-directional correspondence distance BCD being small. This is because such points have a correspondence close to a hard correspondence, and their corresponding matrix linear combinations are rarely affected by the number of points. The combination of the two corresponding matrices does not change their spatial position per se.

The relative pose between successive frames can be accurately estimated by means of soft correspondence matrix estimation, and accumulation of errors during continuous matching is unavoidable. These accumulated errors gradually lead the tracking result to the wrong direction, eventually leading to tracking failure. This is one of the reasons why the point cloud registration method cannot be directly applied to the tracking task. In order to overcome the problem of accumulated errors, the invention provides a reference frame strategy, wherein data of an initial frame is stored as a reference frame at the beginning of tracking, and relative transformation between adjacent frames is applied to the reference frame in calculating the relative posture between two continuous frames; this transformation is then applied to the point cloud of the reference frame; subsequently, the reference frame is transformed into the coordinate system of the current frame; the point clouds of the reference frame and the current frame are regarded as adjacent frames, and a soft corresponding matrix is estimated to determine the relative posture; finally, the pose estimation results of the multiple phases are accumulated to obtain the final estimation of the target object in the current image frame. The reference frame strategy is used to match frames over a time interval, thereby reducing error accumulation and providing more accurate pose tracking results.

Preferably, in the training process of S2, the corresponding relationship between the points of different point clouds is obtained through the labeling information of the standard object coordinate space and the camera coordinate space, and a real corresponding matrix a _gt is generated, and the training process of the constraint network is performed by using a corresponding loss function L _corr:

L_corr＝KL(A_gt,A_pred)

Wherein KL () represents a KL divergence calculation function for measuring the degree of similarity of two distributions, the larger the value of which is to say that the larger the difference in the distribution of the matrix representations of the two inputs is; a _pred is a corresponding matrix obtained by estimation;

Meanwhile, aiming at the corresponding matrixes A _ij and A _ji obtained by estimating the two frames of point clouds through a known corresponding relation (which can be realized through a corresponding estimation module), the bidirectional loss of the corresponding matrixes A _ij and A _ji is calculated, the network is encouraged to aggregate weights in the estimated response matrixes, and the bidirectional corresponding characteristics are utilized in a training stage;

Wherein, P _i and P _j are point clouds of an ith frame and a jth frame respectively, A _ij is a corresponding matrix representing a corresponding relation from the jth frame to the ith frame, A _ji is a corresponding matrix representing a corresponding relation from the ith frame to the jth frame, and L (&) represents an L1 loss function;

in the final stage, the pose transformation loss function L _transform is divided into rotation loss L _rot and translation loss L _t, and the performance of the network in estimating rotation and translation is evaluated separately instead of combining them into a single unified loss function. The transformation loss calculation l ¹ uses the distance between the point cloud transformed by the predictive transformation [ R _pred,t_pred ] and the ground truth transformation [ R _gt,t_gt ] as shown in the following equation:

L_transform＝λ_rotL_rot+λ_tL_t

＝λ_rot(R_gtP_i-R_predP_i)+λ_t(t_gt-t_pred)

Wherein λ _rot and λ _t represent the specific gravity of the rotation loss L _rot and the translation loss L _t in the conversion loss, respectively; t _pred is the six-dimensional pose obtained through final estimation, and comprises three-dimensional rotation and three-dimensional displacement; t _gt is the true six-dimensional pose of the target object; r _pred is the rotation matrix of pose T _pred, and T _pred is the translation vector of pose T _pred; r _gt is the rotation matrix of pose T _gt, and T _gt is the translation vector of pose T _gt.

The method takes the depth video stream as output, and carries out pose estimation on the target object in the first frame of the video stream in a data preprocessing module to obtain the initial pose of the target object. After the initialization is completed, continuous six-dimensional gesture tracking is started on the object of the depth video stream. And extracting geometric features of point cloud data from different frames in a corresponding estimation module, carrying out feature matching, and estimating point-to-point correspondence between different point clouds. In the corresponding expansion module, corresponding expansion is carried out according to the estimated corresponding relation, so that the number of the points associated with each point is as small as possible, and a more accurate shape can be obtained after corresponding matrix mapping. In the reference frame module, the reference frames are updated according to the standard over time to reduce accumulated errors in the tracking process. The method solves the problem that the class-level object is difficult to track the six-dimensional gesture by using the depth video stream output, and can track the six-dimensional gesture of the class-level object with the same geometric characteristics even if the model does not see the object before.

The invention is not exhaustive and can be seen in the prior art.

The beneficial effects of the invention are as follows:

1. In the six-dimensional attitude tracking process of the class-level three-dimensional object, three constraints are added: soft corresponding constraint, transformation constraint (commonly used in the existing similar method, comparing the estimated pose with the real pose and checking whether the pose is the same or not after the whole tracking process is finished) and bidirectional corresponding constraint, especially for three-dimensional tracking based on depth video stream, in order to solve the problem that accurate key points are difficult to estimate on a depth image and the key point estimation offset is caused to cause inaccurate tracking and positioning, the soft corresponding constraint designed by the invention effectively solves the problem that corresponding points are scattered when dense point cloud is matched, and can map the point cloud to the correct shape closer to the target point cloud, so that the six-dimensional pose tracking result of the three-dimensional object is more accurate.

2. According to the method, the six-dimensional gesture tracking of the object can be completed even if the similar object does not appear in the previous training data due to the segmentation result and the six-dimensional gesture of the target object in the depth video stream and the first frame picture.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application.

FIG. 1 is a flow chart of the collaboration of each module of a class-level three-dimensional object six-dimensional gesture tracking method based on soft correspondence estimation;

FIG. 2 is a graph of the tracking results of the present invention for different angles for different categories in the same scene for a standard object coordinate space dataset (NOCS-REAL 275); wherein (a) is a tracking result graph of angle one; (b) a tracking result graph of the angle II; (c) a tracking result graph of the angle III; (d) a tracking result graph of the angle IV; (e) a tracking result graph of the angle five; (f) a tracking result graph of an angle six;

fig. 3 is an effect diagram of the soft correspondence estimation flow, the point cloud shrinkage problem and the point cloud expansion in the present invention.

Detailed Description

In order to better understand the technical solutions in the present specification, the following description will clearly and completely describe the technical solutions in the embodiments of the present invention in conjunction with the drawings in the implementation of the present specification, but not limited thereto, and the present invention is not fully described and is according to the conventional technology in the art.

Example 1

A class-level three-dimensional object six-dimensional attitude tracking method based on soft correspondence estimation, as shown in figure 1, comprises the following steps:

Meanwhile, selecting a first frame image of a depth video stream, detecting a region where a target object is located by using an existing target detection method yolov, then dividing the target object in a detected two-dimensional bounding box by using a FAST-SAM model to obtain a division mask M _init (also called mask) covering the region, and estimating a six-dimensional pose of the target object as an initial pose by using an existing class-level single-frame pose estimation method (such as genelse and IST-Net); after the initialization process is finished, the three-dimensional object starts to be tracked in posture;

S2: in the inter-frame attitude estimation module, extracting geometric features of three-dimensional point cloud data obtained by S1 by using a three-layer perceptron (MLP), obtaining two feature vectors from point clouds of two frames through the multi-layer perceptron, and multiplying the two feature vectors by a fork to calculate similarity between different feature vectors, wherein the similarity is expressed in a matrix, called a corresponding matrix, and represents a corresponding relation between different points, as shown in FIG. 3;

S3: in the corresponding expansion module, the degree of dispersion of the corresponding relation in the corresponding matrix is controlled by using k bidirectional correspondence, in the corresponding matrix obtained in the process of S2, each point may be associated with a plurality of points in another point cloud, which may cause the corresponding point obtained by weighting the associated points to approach the average point of all the associated points in space, resulting in the error shape of the whole point cloud obtained by final weighting, as shown in the point cloud shrinkage problem in fig. 3. The k bidirectional correspondence limits the number of points that can be associated by reserving k points with the largest weights in the correspondence matrix, resets the other point weights to 0, and performs row and column normalization again. The excluded points do not continue to participate in normalization, cannot be assigned new weights, nor do they participate in the subsequent weighting process. The number of points that each point can associate in another point cloud is limited by the value k, i.e. k cannot be exceeded at maximum. Along with the increase of the iteration times, the value of k is gradually reduced to 3 according to a preset step length, so that the origin cloud mapped by the corresponding matrix gradually approaches the shape of the target point cloud.

Example 2

A six-dimensional gesture tracking method of a class-level three-dimensional object based on soft correspondence estimation is described in embodiment 1, except that the method can be performed only by using the center point of the point cloud during the averaging, if the average point of the point cloud is simply used as the center point, when the segmentation result has errors, the background point may be erroneously included in the observation point cloud, and further the standardization process of the point cloud and the determination of the center position are affected, so that the effect of the network in the feature learning process is affected. In order to solve the problem, in the step S1, all points from two frames of point clouds are taken as a whole to jointly calculate a mean point as a center point, so that the interference of the center positioning of the point clouds of the background point is effectively avoided;

Example 3

A six-dimensional attitude tracking method of a class-level three-dimensional object based on soft correspondence estimation is disclosed in embodiment 2, and is characterized in that each layer of perceptron in S2 consists of two linear layers and a ReLU layer, after two eigenvectors are cross multiplied, row and column normalization is performed by using the prior sinkhorn algorithm iteration, the sinkhorn algorithm is preferably iterated 5 times, and normalization operations are performed on rows and columns of the matrix respectively in sequence each time, so that a corresponding matrix representing a correspondence relationship is finally obtained, and the similarity of features among different points is reflected.

The geometric features in the S2 are realized through a Feature extraction module, and Point-Pair features (PPF) are used for describing the neighborhood around each Point in the Point cloud so as to represent the local geometric information of each Point in a rotation-invariant manner; the feature extraction module stands on PointNet ++, takes point-to-point features as input, and outputs local geometric features F _i and F _j of each point cloud, and the local geometric features F _i and F _j are then fed into the feature matching module to estimate soft correspondence between the point clouds.

Example 4

In the step S2, three-layer perceptrons are used to extract features F _i and F _j of the observation point clouds P _i and P _j under the camera coordinate system of different frames, and similarity scores between the two sets of features are obtained by calculating cross parallel normalization of the two feature vectors, which are finally expressed as a corresponding matrix a, and a soft mapping relation between the point pass of the point cloud P _j of the jth frame and the point of the point cloud P _i of the ith frame is established through the corresponding matrix a _ij, and vice versa; Transpose of the corresponding matrix a _ij as shown in the formula:

P_j＝A_ijP_j

P_i＝T_iN_i

P_j＝T_jN_j

Example 5

A class-level three-dimensional object six-dimensional gesture tracking method based on soft correspondence estimation, as in embodiment 4, except that when a correspondence matrix is estimated, its soft correspondence property is such that the weight of each point in one point cloud in another point cloud is generally not concentrated on a single point, but is dispersed over multiple points. Accordingly, the correspondence matrix a _ij can generate a corresponding point cloud P _c with respect to P _i in the coordinate space of the point cloud P _j. Each element in the corresponding matrix represents a weight, the values of the weights range from 0 to 1, and the sum of the weights is 1, which indicates that the corresponding point cloud P _c generated by all possible weight combinations is contained in the space of the point cloud P, and is formed by combining the points in the point cloud P in a weighted manner, and the space can be regarded as a plurality of changes of the shape of the points in the point cloud P.

In S3, the scale of the corresponding relation in the corresponding matrix A is controlled by adopting k bidirectional correspondence, and k points p _a,k, k=1, 2, … and k in the radius r are searched for each point p _a by using a nearest neighbor (knn) algorithm and utilizing the characteristic that the point clouds N _i and N _j are in the same coordinate space and belong to different parts of the same object as the associated points; when point p _a from point cloud N _i is the associated point of point p _b from point cloud N _j, while p _b is also the associated point of point p _a, this set of points (p _a,p_b) is considered to be a pair of points that are associated with each other;

Example 6

A class-level three-dimensional object six-dimensional pose tracking method based on soft correspondence estimation, as described in embodiment 5, except that a bi-directional correspondence distance (Bidirectional Correspondence Distance, BCD) is used to further estimate the corresponding confidence score. The feature similarity matrix S of the two groups of point clouds is transposed, so that the bidirectional corresponding relation expressed as the corresponding matrices A _ij and A _ji can be estimated, and the estimation of the relative posture between the two groups of point clouds is realized. When the correspondence is a hard correspondence, a _ij and a _ji are reciprocal. However, when the correspondence is soft correspondence and expressed as a soft correspondence matrix, a _ij and a _ji generally do not form a reciprocal relationship due to the influence of weight dispersion and rank normalization.

BCD_i＝P_i-A_ijA_jiP_i

Example 7

In the training process of S2, the corresponding relationship between the points of different point clouds is obtained through the labeling information of the standard object coordinate space and the camera coordinate space, and a real corresponding matrix a _gt is generated, and the training process of using a corresponding loss function L _corr to constrain the network is performed as described in embodiment 6:

L_corr＝KL(A_gt,A_pred)

Wherein P _i and P _j are point clouds of an ith frame and a jth frame respectively, A _ij is a corresponding matrix representing a corresponding relation from the jth frame to the ith frame, A _ji is a corresponding matrix representing a corresponding relation from the ith frame to the jth frame, L (·) represents an L1 loss function, the invention performs two corresponding matrix transformations on the point clouds and uses smooth L1 loss to constrain them relative to an original point cloud, and is beneficial to gathering weights of the corresponding matrices onto as few local points as possible.

L_transform＝λ_rotL_rot+λ_tL_t

＝λ_rot(R_gtP_i-R_predP_i)+λ_t(t_gt-t_pred)

The invention was compared with the existing similar method (6-PACK, CAPTRA) under various indexes, and the results are shown in Table 1:

table 1: comparison accuracy result of three-dimensional object six-dimensional gesture tracking method and existing method in real scene

Wherein Metric represents a plurality of different metrics, 5 ° 5cm and 10 ° 10cm representing the percentage of the estimation result that is within a defined range of both rotation and translation, e.g., 5 ° 5cm representing the rotation error of the estimation result to be within 5 ° and the translation error to be within 5 cm; r _err denotes the rotation error of tracking, t _err denotes the translation error of tracking, mIoU denotes the intersection ratio of the bounding box corresponding to the estimated pose to the real bounding box, ioU and IoU50 denote the percentage of the intersection ratio of the bounding box corresponding to the estimated pose to the real bounding box above a limit value, for example IoU25 denotes the percentage of the total number of frames in the sequence occupied by the frames in which the intersection ratio of the estimated bounding box to the real bounding box is above 25%; bottle, bowl, camera, can, laptop, mug respectively represent six different categories, each category corresponds to three or more than three sequences of evaluation results, and thickening indicates that the error of the method is best in the existing similar algorithms.

Table 2 precision results of class level three-dimensional object six-dimensional pose tracking method for text in real scene Wild6D dataset

Table 2 shows the test results of the method proposed by the invention on different categories in the Wild6D dataset under the real scene, and from table 1 and table 2, the method of the invention obviously improves the rotation and translation accuracy of six-dimensional gesture tracking of three-dimensional objects of class level, and obtains the best six-dimensional gesture tracking result of three-dimensional objects of class level, thereby proving the effectiveness of the method proposed by the invention.

Fig. 2 shows the tracking results of different angles of 5 kinds of objects in the same scene, namely, a bottle, a notebook, a camera, a mug and a pop can, wherein the 5 objects appearing in the image are not appeared in the data used for training, so that the universality of the method in the face of a new object is proved.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. A class-level three-dimensional object six-dimensional attitude tracking method based on soft correspondence estimation is characterized by comprising the following steps:

s1: in the data preprocessing module, camera internal parameters are obtained in advance through a camera calibration method, three-dimensional point cloud data corresponding to each frame of picture are calculated through back projection by taking the camera internal parameters and depth video streams as input, and the two different input point clouds are respectively subjected to averaging processing so as to ensure that the network can effectively estimate the relative position and posture change between video frames in a training stage;

Simultaneously, selecting a first frame image of a depth video stream, detecting a region where a target object is located by using a target detection method to obtain a segmentation mask covering the region, and estimating a six-dimensional pose of the target object as an initial pose by using a class-level single-frame pose estimation method, wherein the initialization process is finished;

S2: in an inter-frame attitude estimation module, extracting geometric features of three-dimensional point cloud data obtained in the step S1 by using a multi-layer perceptron, obtaining two feature vectors from point clouds of two frames through the multi-layer perceptron, and multiplying the two feature vectors by a fork to calculate similarity between different feature vectors, wherein the similarity is expressed in a matrix, called a corresponding matrix, and represents the corresponding relation between different points;

2. The six-dimensional attitude tracking method of a class-level three-dimensional object based on soft correspondence estimation according to claim 1, wherein when the averaging process is performed in S1, all points from two frame point clouds are taken as a whole to jointly calculate a mean point as a center point;

In S1, after the initialization of the initial pose is finished, selecting two frames of depth images to be matched currently from the input depth video stream, obtaining a depth value z of each target pixel point through the depth images and the segmentation mask, wherein the internal parameter matrix K of the camera is known because the camera is calibrated in advance, and obtaining three-dimensional points (x, y, z) obtained by back projection of the target two-dimensional pixel points (u, v) under the camera coordinate system can be obtained, as shown in the following formula:

3. The six-dimensional attitude tracking method of a class-level three-dimensional object based on soft correspondence estimation according to claim 1, wherein each layer of perceptron in S2 consists of two linear layers and one ReLU layer, after cross multiplication of two feature vectors, row and column normalization is performed by using sinkhorn algorithm iteration, sinkhorn algorithm iteration is performed for 5 times, and each time, normalization operation is performed on rows and columns of the matrix in sequence, so as to finally obtain a corresponding matrix representing a corresponding relation, and feature similarity among different points is reflected.

4. A class-level three-dimensional object six-dimensional pose tracking method based on soft correspondence estimation according to claim 3, wherein in S1, assuming that the object' S position relative to the camera is negligible between consecutive video frames, first, the pixel coordinates of the target object on the previous video frame depth image D _t-1 are acquired using the segmentation mask M _i of the previous frame, i.e. the i-th frame; generating a series of two-dimensional rectangular bounding boxes B _j,k with different sizes on the depth image of the current frame, namely the j-th frame; then, gradually expanding a rectangular bounding box B _i,k by a preset step length, applying the rectangular bounding box B _i,k to image processing of the current frame, and selecting a segmentation result with the highest confidence score as an optimal segmentation result B _j on the current frame j by integrating all generated segmentation results; finally, the rectangular bounding box B _i,k is used as input data, and the segmentation mask M _j finally used for the current frame is obtained in cooperation with the target detection method.

5. The six-dimensional attitude tracking method of class-level three-dimensional objects based on soft correspondence estimation according to claim 4, wherein the geometric features in S2 are implemented by a feature extraction module, and the neighborhood around each point in the point cloud is described by using point-pair features so as to represent the local geometric information of each point in a rotation-invariant manner; the feature extraction module stands on PointNet ++, takes point-to-point features as input, and outputs local geometric features F _i and F _j of each point cloud, and the local geometric features F _i and F _j are then fed into the feature matching module to estimate soft correspondence between the point clouds.

6. The six-dimensional attitude tracking method of a class-level three-dimensional object based on soft correspondence estimation according to claim 5, wherein in S2, three-layer perceptrons are used to extract features F _i and F _j of observation point clouds P _i and P _j under camera coordinate systems of different frames, similarity scores between two sets of features are obtained by calculating cross-multiply parallel normalization of two feature vectors, and finally the similarity scores are expressed as a correspondence matrix a, and soft mapping relations between points of point clouds P _j of a j frame and points of point clouds P _i of an i frame are established through the correspondence matrix a _ij, and vice versa; Transpose of the corresponding matrix a _ij as shown in the formula:

P_j＝A_ijP_j

in the training stage, normalizing the rows and columns of the corresponding matrix A, and establishing a one-to-one, one-to-many and many-to-one soft corresponding relation between three-dimensional space points;

In the training phase, known information includes: corresponding point clouds N _i and N _j of the point clouds of each frame under the standard object coordinate space, the relative pose transforms T _i and T _j of the standard object coordinate space to the camera coordinate space, then:

P_i＝T_iN_i

P_j＝T_jN_j

And establishing a relation between a standard object coordinate space and a camera coordinate space through pose transformation T _i and T _j, wherein the point clouds N _i and N _j under the standard object coordinate space are in one-to-one correspondence with the points in the observation point clouds P _i and P _j under the camera coordinate space, and the point clouds N _i and N _j correspond to the points from the same object and have the same orientation and position under the standard coordinate space, so that the corresponding relation of the points between N _i and N _j can be obtained, and the corresponding relation of the points of the observation point clouds P _i and P _j is further obtained.

7. The six-dimensional gesture tracking method of class-level three-dimensional object based on soft correspondence estimation according to claim 6, wherein in S3, the scale of the correspondence in the correspondence matrix a is controlled by adopting k bidirectional correspondence, and k points p _a,k, k=1, 2, …, k in radius r are found for each point p _a by using nearest neighbor algorithm by utilizing the characteristic that the point clouds N _i and N _j are in the same coordinate space and belong to different parts of the same object, as their associated points; when point p _a from point cloud N _i is the associated point of point p _b from point cloud N _j, while p _b is also the associated point of point p _a, this set of points (p _a,p_b) is considered to be a pair of points that are associated with each other;

8. The six-dimensional object pose tracking method based on soft correspondence estimation according to claim 7, wherein in the soft correspondence matrix estimation process, at the start of tracking, data of an initial frame is saved as a reference frame, and a relative pose between two consecutive frames is calculated, and a relative transformation between adjacent frames is applied to the reference frame; this transformation is then applied to the point cloud of the reference frame; subsequently, the reference frame is transformed into the coordinate system of the current frame; the point clouds of the reference frame and the current frame are regarded as adjacent frames, and a soft corresponding matrix is estimated to determine the relative posture; finally, the pose estimation results of the multiple phases are accumulated to obtain the final estimation of the target object in the current image frame.

9. The six-dimensional attitude tracking method of a class-level three-dimensional object based on soft correspondence estimation according to claim 8, wherein in the training process of S2, the correspondence between the points of different point clouds is obtained through the labeling information of the standard object coordinate space and the camera coordinate space, and a real correspondence matrix a _gt is generated, and the training process of the constraint network is performed by using a correspondence loss function L _corr:

L_corr＝KL(A_gt,A_pred)

wherein KL () represents a KL divergence calculation function for measuring the degree of similarity of two distributions; a _pred is a corresponding matrix obtained by estimation;

Meanwhile, aiming at the corresponding matrixes A _ij and A _ji obtained by estimating the two frames of point clouds through the known corresponding relation, calculating the bidirectional loss of the corresponding matrixes A _ij and A _ji, encouraging the network to aggregate weights in the estimated response matrixes, and utilizing the bidirectional corresponding characteristics in a training stage;

Wherein P _i and P _j are point clouds of an ith frame and a jth frame, respectively, A _ij is a corresponding matrix representing correspondence from the jth frame to the ith frame, A _ji is a corresponding matrix representing correspondence from the ith frame to the jth frame, Representing an L1 loss function;

In the final stage, the pose transformation loss function L _transform is divided into rotation loss L _rot and translation loss L _t:

L_transform＝λ_rotL_rot+λ_tL_t

＝λ_rot(R_gtP_i-R_predP_i)+λ_t(t_gt-t_pred)