CN110634160B

CN110634160B - Method for constructing target three-dimensional key point extraction model and recognizing posture in two-dimensional graph

Info

Publication number: CN110634160B
Application number: CN201910738138.4A
Authority: CN
Inventors: 彭进业; 张少博; 赵万青; 祝轩; 李斌; 张薇; 乐明楠; 李展; 罗迒哉; 王珺
Original assignee: Northwest University
Current assignee: Northwest University
Priority date: 2019-08-12
Filing date: 2019-08-12
Publication date: 2022-11-18
Anticipated expiration: 2039-08-12
Also published as: CN110634160A

Abstract

The invention discloses a method for constructing a target three-dimensional key point extraction model and identifying a posture in a two-dimensional graph, which can accurately and directly output the coordinates of a target three-dimensional key point by designing a network structure of the three-dimensional key point extraction model; by means of the designed key point loss function, the network can independently learn and extract key points with semantic consistency and geometric consistency in an unsupervised mode, and the accuracy of extracting the three-dimensional key points is improved.

Description

Method for constructing target three-dimensional key point extraction model and recognizing posture in two-dimensional graph

Technical Field

The invention relates to a target three-dimensional gesture recognition method, in particular to a target three-dimensional key point extraction model construction and gesture recognition method in a two-dimensional graph.

Background

Target three-dimensional gesture recognition refers to recognizing the three-dimensional position and direction of a target object, and is a key module in many computer vision applications such as augmented reality, robot control, and unmanned tasks. However, the three-dimensional gesture recognition of the target is based on the need of extracting three-dimensional key points of the target object, finding the two-dimensional position of the object on the image and extracting some key points such as the projection of the 3D frame of the object on the image, and these methods are very effective by using a large amount of supervision information, but the workload of labeling three-dimensional information on the image is huge, and extremely high professional knowledge and complicated preparation work are required, and these methods cannot process images with occlusion and complex backgrounds.

In addition, even after the three-dimensional key point of the target is obtained, the three-dimensional posture of the target cannot be accurately identified, so that the method for acquiring the three-dimensional posture of the target object in the two-dimensional image in the prior art has the problems of low posture acquisition accuracy, large workload, low real-time performance and low robustness.

Disclosure of Invention

The invention aims to provide a method for constructing a target three-dimensional key point extraction model and identifying a gesture in a two-dimensional image, which is used for solving the problems that the accuracy of the method for identifying the three-dimensional key points of a target object in the two-dimensional image is low, the gesture identification accuracy is low and the like in the prior art.

In order to realize the task, the invention adopts the following technical scheme:

a method for constructing a target three-dimensional key point extraction model in a two-dimensional graph is implemented according to the following steps:

step 1, acquiring a plurality of two-dimensional image groups containing targets to be recognized, wherein two-dimensional images in the two-dimensional image groups are different in image acquisition angle;

obtaining a training image set;

step 2, inputting the training image set into a neural network for training;

the neural network comprises a feature extraction sub-network, and the feature extraction sub-network is respectively connected with a key point extraction sub-network and a target detection sub-network;

the feature extraction sub-network comprises a feature map extraction module and a sensitive interest region extraction module which are sequentially arranged;

the target detection sub-network comprises a target classification module and a bounding box detection module which are connected in parallel;

the key point extraction sub-network comprises a key point probability obtaining module and a key point output module which are connected in series;

the key point probability obtaining module is used for obtaining the probability that each pixel point is a three-dimensional key point;

the key point output module obtains the coordinates of each three-dimensional key point by using a formula I:

wherein [ x ] _i ,y _i ]Coordinates representing the ith three-dimensional key point, I =1,2, …, I is a positive integer, P _i (u, v) represents the probability that the (u, v) th pixel point in the two-dimensional image output by the key point probability calculation sub-network is the ith three-dimensional key point, wherein (u, v) is the coordinate of the two-dimensional image, and u and v are positive integers;

and obtaining a three-dimensional key point extraction model.

Furthermore, the characteristic map extraction module comprises a characteristic pyramid network and a residual error network which are sequentially arranged; the region of interest extraction module comprises a region generation network.

Further, the key point probability obtaining module comprises a plurality of convolution blocks, an up-sampling layer and a softmax layer which are sequentially connected in series;

the convolution block comprises a convolution layer and a ReLU active layer which are connected in sequence.

Further, the loss function L of the three-dimensional key point extraction model is as follows:

wherein,

represents the sum of the classification loss functions of all negative examples,

target classification loss function L representing all positive samples _class Bounding box detection loss function L _box And a keypoint detection loss function L _keypoints The sum of beta and gamma is more than 0;

the negative sample is a two-dimensional image which is extracted by the interesting region extraction module and does not contain a target in the interesting region; the positive sample is a two-dimensional image of a target contained in the interesting region extracted by the interesting region extraction module;

wherein the key point detects a loss function

Wherein L is _dis To be a significant loss function, L _dep For depth prediction loss function, L _con As a three-dimensional consistency loss function, L _sep As a function of separation loss, L _pose Estimate the loss function for the relative attitude, tau, epsilon, mu,

Are both greater than 0.

Further, the beta, gamma, tau, epsilon, mu,

both are 1, and δ is 0.08.

A method for extracting a target three-dimensional key point in a two-dimensional graph is implemented according to the following steps:

step A, collecting a two-dimensional image containing a target to be identified to obtain an image to be identified;

and B, inputting the image to be recognized into a three-dimensional key point extraction model constructed by the construction method of the target three-dimensional key point extraction model in the two-dimensional graph to obtain a three-dimensional key point set of the target to be recognized, wherein the three-dimensional key point set comprises Q three-dimensional key points, and Q is a positive integer.

A method for recognizing a three-dimensional posture of a target in a two-dimensional image is used for obtaining a three-dimensional posture matrix of the target in the two-dimensional image and is executed according to the following steps:

step I, acquiring a two-dimensional image containing a target to be identified, and acquiring an image to be identified;

step II, obtaining a three-dimensional key point set of the target to be recognized in the image to be recognized by adopting the method for extracting the three-dimensional key points of the target in the two-dimensional graph as claimed in claim 6;

step III, calculating the distance between the three-dimensional key point set of the target to be recognized in the image to be recognized and the three-dimensional key point set of each image in the reference image library;

the reference image library comprises a plurality of reference images and information of each reference image, wherein the information of each reference image comprises a three-dimensional key point set of each reference image and a three-dimensional attitude matrix of a target in each reference image, which are obtained by adopting the method for extracting the target three-dimensional key point in the two-dimensional graph of claim 6 for each reference image;

taking the image corresponding to the three-dimensional key point set with the minimum distance as a comparison image, and obtaining the three-dimensional key point set of the comparison image and a three-dimensional attitude matrix of a target in the comparison image;

step IV, subtracting the coordinates of the mass center of the three-dimensional key point set of the target to be recognized from the coordinates of each three-dimensional key point in the three-dimensional key point set of the target to be recognized to obtain a new three-dimensional key point set of the target to be recognized;

subtracting the coordinate of the mass center of the three-dimensional key point set of the contrast image from the coordinate of each three-dimensional key point in the three-dimensional key point set of the contrast image to obtain a new three-dimensional key point set of the contrast image;

v, using a singular value decomposition method to carry out the step

Decomposing to obtain a rotation matrix R;

wherein, X' _n The coordinate of the nth point in the three-dimensional key point set of the new target to be identified is P' _n For the coordinates of the nth point in the three-dimensional set of keypoints for the new comparison image, N _P As a new three-dimensional set of key points of the object to be identified orThe total number of the three-dimensional key points in the three-dimensional key point set of the new contrast image;

VI, obtaining a posture matrix T = [ R | T =]Wherein t = μ _X -Rμ _P ，μ _X Mean coordinate, mu, of a three-dimensional set of key points for a new object to be recognized _P The average coordinates of the three-dimensional key point set of the new contrast image;

step VII, obtaining a three-dimensional attitude matrix T of the target to be recognized in the image to be recognized by adopting a formula III _input ：

T _input ＝T·T _ref Formula III

Wherein T is _ref Is a three-dimensional pose matrix of objects in the contrast image.

Compared with the prior art, the invention has the following technical effects:

1. according to the method for constructing and extracting the target three-dimensional key point extraction model in the two-dimensional graph, the coordinates of the target three-dimensional key point can be accurately and directly output by designing the network structure of the three-dimensional key point extraction model; by the aid of the designed key point loss function, the network can independently learn and extract key points with semantic consistency and geometric consistency in an unsupervised mode, and accuracy of three-dimensional key point extraction is improved;

2. according to the method for constructing and extracting the target three-dimensional key point extraction model in the two-dimensional graph, in the network training stage, no three-dimensional model of any object or three-dimensional marking on the image is needed, compared with the existing method, the workload of marking can be greatly reduced, and the efficiency of the extraction method is improved;

3. according to the method for recognizing the three-dimensional posture of the target in the two-dimensional graph, the three-dimensional space coordinate system is established by setting the comparison image, and the recognition precision is improved.

Drawings

FIG. 1 is an internal structure diagram of a three-dimensional key point extraction model provided by the present invention;

FIG. 2 is a diagram of an internal structure of a keypoint probability acquisition module provided in an embodiment of the present invention;

FIG. 3 is an image to be recognized provided in an embodiment of the present invention;

fig. 4 is an image representation of a three-dimensional key point set obtained by performing three-dimensional key point extraction on the image to be recognized shown in fig. 3 according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the drawings and examples. So that those skilled in the art can better understand the present invention. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

The following definitions or conceptual connotations relating to the present invention are provided for illustration:

three-dimensional key points: located on a structure where the object is more prominent, represents a local feature of the surface of the object that is rotationally invariant with respect to the object.

A bounding box: for marking the position of an object in the image.

Significance loss function: and (5) using the light and shade characteristics to enable the key points to fall on the object significance positions.

Depth prediction loss function: the epipolar geometry principle is used to train the network so that the depth of the key points can be accurately predicted.

Three-dimensional consistency loss function: for ensuring that the same area can be stably tracked under different viewing angles.

Separation loss function: and a certain distance is reserved between every two key points, so that the key points are prevented from being overlapped.

Relative attitude estimation loss function: is a penalty term for the angle, i.e. the difference between the true value of the angle of the relative pose of the camera between the pair of input images and the relative angle estimated from the detected keypoints, this term loss contributes to the generation of a meaningful and natural set of 3D keypoints.

Rotating the matrix: the rotation matrix is used to describe the rotation of the object around the x, y, z axes, and is a 3x3 orthogonal matrix with a determinant of 1.

An attitude matrix: the method comprises the following steps of [ R | T ], wherein R is a rotation matrix, and T is a translation matrix, and rotation information and translation information of an object in a three-dimensional space are described.

Example one

The embodiment discloses a method for constructing a three-dimensional key point extraction model of a target in a two-dimensional graph. In the present embodiment, the three-dimensional keypoint extraction model is used to extract three-dimensional keypoints with geometric consistency and semantic consistency on a target object in an image, and in the present embodiment, the three-dimensional keypoints on the object are directly predicted by using an autonomously designed CNN network.

The method comprises the following steps:

the method comprises the following steps that 1, a plurality of two-dimensional image groups containing targets to be identified are obtained, and two-dimensional images in the two-dimensional image groups are different in image acquisition angle;

obtaining a training image set;

in this embodiment, because the depth information of the corresponding key points in the two images can be calculated by epipolar geometry using two different angles, a multitask loss function is trained by using two pictures of the same object taken from different viewpoints and the relative posture change between the viewpoints during training, so that the network can predict the key points with geometric consistency and semantic consistency on the object.

Step 2, inputting the training image set into a neural network for training;

the neural network comprises a feature extraction sub-network, and the feature extraction sub-network is respectively connected with the key point extraction sub-network and the target detection sub-network;

the feature extraction sub-network comprises a feature map extraction module and an interested region extraction module which are sequentially arranged;

wherein [ x ] _i ,y _i ]Coordinates representing the ith three-dimensional keypoint, I =1,2, …, I being a positive integer, P _i (u, v) represents the probability that the (u, v) th pixel point in the two-dimensional image output by the key point probability computation sub-network is the ith three-dimensional key point, wherein (u, v) is the coordinate of the two-dimensional image, and u and v are both positive integers;

and obtaining a three-dimensional key point extraction model.

In this embodiment, as shown in fig. 1, the image is first input into the feature extraction sub-network, the region-of-interest sub-image of the image is output, the region-of-interest sub-image is input into the key point extraction sub-network and the target detection sub-network, respectively, the key point coordinates of the target in the region-of-interest sub-image are output by the key point detection sub-network, and the classification of the target in the region-of-interest sub-image and the coordinates of the bounding box are output by the target detection sub-network, wherein the bounding box is used to frame out the target in the two-dimensional image.

In this embodiment, the feature extraction sub-networks are spliced by using the existing CNN network, and specifically, the feature map extraction module includes a feature pyramid network and a residual error network that are sequentially arranged; the region of interest extraction module includes a region generation network.

In this embodiment, the function of implementing target classification detection and bounding box coordinate output in the target detection sub-network can be implemented by using the CNN network in the prior art.

In this embodiment, different from the prior art, the key point extraction sub-network first obtains the probability of the three-dimensional key point of each pixel point, and obtains the coordinate of each three-dimensional key point through the probability accumulation.

Optionally, the key point probability obtaining module includes a plurality of volume blocks, an upper sampling layer and a softmax layer connected in series in sequence;

the convolution block includes a convolution layer and a ReLU active layer connected in sequence.

In this embodiment, as shown in fig. 2, the key point probability obtaining module includes 4 volume blocks, an upsampling layer, and a softmax layer, which are connected in series in sequence.

Where the convolution kernel size in the convolution block is 3x3.

Optionally, the loss function L of the three-dimensional key point extraction model is:

wherein,

the negative sample is a two-dimensional image which is extracted by the interesting region extraction module and does not contain a target; the positive sample is a two-dimensional image of the region of interest containing the target, which is extracted by the region of interest extraction module;

wherein the key point detects the loss function

Wherein L is _dis As a function of significant loss, L _dep For depth prediction loss function, L _con Is a three-dimensional uniform loss function, L _sep As a function of separation loss, L _pose Estimate the loss function for the relative attitude, tau, epsilon, mu,

Are all greater than 0.

Alternatively, the values of β, γ, τ,ε，μ，

is 1, and delta is 0.08.

In this step, a saliency loss function is used to ensure that the three-dimensional keypoints fall within the saliency areas of the object,

wherein l (x) _i ,y _i ) X, y-axis coordinates, P, of salient region representing the ith three-dimensional keypoint _i (x _i ,y _i ) X, y-axis coordinates representing the ith three-dimensional keypoint, i =1,2, … …, N being the total number of three-dimensional keypoints, N being a positive integer, N =10 in this embodiment, where l (x = 10) _i ,y _i ) = l (u, v) was obtained using the following procedure:

step (1), carrying out Gaussian filtering on the image and obtaining a Hessian matrix of each pixel,

the determinant of each hessian matrix is calculated.

And finding out the point with the maximum determinant value in the range of 3*3 as the point on the salient region by using a non-maximum suppression algorithm.

With the generation of a profile of the salient region:

in this step, the depth prediction penalty function reduces the error of the predicted depth from the depth calculated by the epipolar geometry,

wherein z is _i Is the Z-axis coordinate, Z, of the ith three-dimensional key point in one two-dimensional image in the two-dimensional image group _i ' is the ith three-dimensional relation in another two-dimensional image in the two-dimensional image groupZ-axis coordinate of key point, d _i The depth of the ith three-dimensional key point in one two-dimensional image in the two-dimensional image group, d _i ' is the depth of the ith three-dimensional key point in the other two-dimensional image in the two-dimensional image group.

And calculating d 'by using a formula d' e ^ Re '+ e ^ t =0 and using a least square method, and calculating the depth d of the other point by using a formula d' e ^ Re '+ e ^ t =0, wherein e and e' are matched key points on the two images in the image group during training.

Wherein the three-dimensional consistency loss function maintains the position of the three-dimensional keypoints with respect to the object,

m _i the coordinates of one of the images in the two-dimensional image set for the three-dimensional keypoint,

the coordinates of the three-dimensional keypoint in another image in the two-dimensional image set.

Wherein the separation loss function ensures that a certain distance is maintained between three-dimensional key points and does not fall on the same point,

(x _i ,y _i ,z _i ) Is the ith key coordinate, (x) _j ,y _j ,z _j ) For the coordinates of the jth key, i ≠ j, δ is the distance between the set key points.

Wherein the relative pose estimation loss function makes the obtained three-dimensional key points more suitable for the pose estimation task,

and R' is the posture change of the object in the two images calculated by utilizing the key points, and R is group-Truth.

Example two

A method for extracting three-dimensional key points of a target in a two-dimensional graph is implemented according to the following steps:

and step B, inputting the image to be recognized into the three-dimensional key point extraction model constructed by the construction method of the target three-dimensional key point extraction model in the two-dimensional graph of the embodiment, and obtaining the three-dimensional key point set of the target to be recognized.

In this embodiment, the extraction of three-dimensional keypoints is performed on the image to be recognized as shown in fig. 3, and an image representation of a three-dimensional keypoint set as shown in fig. 4 is obtained.

EXAMPLE III

step I, acquiring a two-dimensional image containing a target to be identified, and acquiring the image to be identified;

step II, obtaining a three-dimensional key point set of the target to be recognized in the image to be recognized by adopting the second embodiment method;

the reference image library comprises a plurality of reference images and information of each reference image, wherein the information of each reference image comprises a three-dimensional key point set of each reference image and a three-dimensional attitude matrix of a target in each reference image, which are obtained by each reference image by adopting the method of the second embodiment;

subtracting the coordinate of the mass center of the three-dimensional key point set of the comparison image from the coordinate of each three-dimensional key point in the three-dimensional key point set of the comparison image to obtain a new three-dimensional key point set of the comparison image;

v, using a singular value decomposition method to carry out the step

Decomposing to obtain a rotation matrix R;

wherein, X' _n The coordinate of the nth point in the three-dimensional key point set of the new target to be identified is P' _n Coordinates of the nth point in the three-dimensional set of keypoints for the new comparison image, N _P The total number of the three-dimensional key points in the three-dimensional key point set of the new target to be identified or the three-dimensional key point set of the new contrast image is set;

VI, obtaining a posture matrix T = [ R | T =]Where t = μ _x -Rμ _p ，μ _x Mean coordinate, mu, of a three-dimensional set of key points for a new object to be recognized _p Average coordinates of the three-dimensional key point set of the new contrast image;

T _input ＝T·T _ref Formula III

In this embodiment, a three-dimensional key point set X of the target to be recognized in the image to be recognized and a three-dimensional key point set P of the contrast image, N in this embodiment, are obtained through step III _P ＝10。

IV, obtaining the coordinates of the three-dimensional key point set centroid of the target to be identified

Obtaining coordinates of centroids of three-dimensional keypoint sets of contrast images

Subtracting the coordinate of each three-dimensional key point in the three-dimensional key point set X of the target to be recognized from the coordinate of the mass center of the three-dimensional key point set X of the target to be recognized

Obtaining a new three-dimensional key point set X' of the target to be identified;

a new three-dimensional keypoint set P' of the contrast image is also obtained.

V, processing each pair of three-dimensional key points in the new three-dimensional key point set X 'of the target to be identified and the three-dimensional key point set P' of the new comparison image to obtain a total matrix W,

and then carrying out singular value decomposition on the total matrix W to obtain a rotation matrix R:

obtaining an attitude matrix through the step VI:

T＝[R|t]＝[-0.05903081 -0.02168849 -0.01671735]

where | represents the integer division of the matrix.

Then, obtaining a three-dimensional attitude matrix T of the target to be recognized in the image to be recognized by adopting a formula III _input ：

T _input ＝[-0.05903081 -0.02168849 -0.01671735]·T _ref

Where · denotes the multiplication of the matrix.

Claims

1. A method for constructing a three-dimensional key point extraction model of a target in a two-dimensional graph is characterized by comprising the following steps of:

the method comprises the following steps that 1, a plurality of two-dimensional image groups containing targets to be identified are obtained, and the two-dimensional images in the two-dimensional image groups are different in image acquisition angle;

obtaining a training image set;

step 2, inputting the training image set into a neural network for training;

wherein [ x ] _i ,y _i ]Coordinates representing the ith three-dimensional keypoint, I =1,2, …, I being a positive integer, P _i (u, v) represents the probability that the (u, v) th pixel point in the two-dimensional image output by the key point probability calculation sub-network is the ith three-dimensional key point, wherein (u, v) is the coordinate of the two-dimensional image, and both u and v are positive integers;

and obtaining a three-dimensional key point extraction model.

2. The method for constructing the extraction model of the target three-dimensional key points in the two-dimensional graph as claimed in claim 1, wherein the feature map extraction module comprises a feature pyramid network and a residual error network which are sequentially arranged; the region of interest extraction module comprises a region generation network.

3. The method for constructing the three-dimensional key point extraction model of the target in the two-dimensional graph, as claimed in claim 1, wherein the key point probability obtaining module comprises a plurality of volume blocks, an upsampling layer and a softmax layer which are connected in series in sequence;

4. The method for constructing a three-dimensional key point extraction model of an object in a two-dimensional graph according to claim 1, wherein the loss function L of the three-dimensional key point extraction model is as follows:

wherein,

target classification penalty function L representing all positive samples _class Bounding box detection loss function L _box And a keypoint detection loss function L _keypoints The sum of beta and gamma is more than 0;

the negative sample is a two-dimensional image which is extracted by the interesting region extraction module and does not contain a target in the interesting region; the positive sample is a two-dimensional image of an interesting region extracted by the interesting region extraction module and containing a target;

wherein the key point detects the loss function

Wherein L is _dis As a function of significant loss, L _dep For depth prediction loss function, L _con As a three-dimensional consistency loss function, L _sep As a function of separation loss, L _pose Estimate the loss function for the relative attitude, tau, epsilon, mu,

Are all greater than 0.

5. The method for constructing a model for extracting three-dimensional key points of an object from a two-dimensional figure as claimed in claim 4, wherein said values β, γ, τ, ε, μ,

both are 1, and δ is 0.08.

6. A method for extracting a target three-dimensional key point in a two-dimensional graph is characterized by comprising the following steps:

and step B, inputting the image to be recognized into the three-dimensional key point extraction model constructed by the method for constructing the three-dimensional key point extraction model of the target in the two-dimensional graph according to any one of claims 1 to 5, and obtaining a three-dimensional key point set of the target to be recognized, wherein the three-dimensional key point set comprises Q three-dimensional key points, and Q is a positive integer.

7. A method for recognizing a three-dimensional posture of an object in a two-dimensional image is used for obtaining a three-dimensional posture matrix of the object in the two-dimensional image, and is characterized by comprising the following steps:

v, using singular value decomposition method to

Decomposing to obtain a rotation matrix R;

wherein, X' _n The coordinate of the nth point in the three-dimensional key point set of the new target to be identified is P' _n Coordinates of the nth point in the three-dimensional set of keypoints for the new comparison image, N _P The total number of the three-dimensional key points in the three-dimensional key point set of the new target to be recognized or the three-dimensional key point set of the new contrast image is obtained;

VI, obtaining a posture matrix T = [ R | T =]Wherein t = μ _X -Rμ _P ，μ _X Is the mean coordinate, mu, of the three-dimensional set of key points of the new object to be identified _P Average coordinates of the three-dimensional keypoint set of the new contrast image;

VII, obtaining the image to be identified by adopting a formula IIIThree-dimensional attitude matrix T for recognizing target _input ：

T _input ＝T·T _ref Formula III