CN112183506A

CN112183506A - Human body posture generation method and system

Info

Publication number: CN112183506A
Application number: CN202011369283.9A
Authority: CN
Inventors: 唐浩; 范宇航
Original assignee: Chengdu Tishi Technology Co ltd
Current assignee: Chengdu Tishi Technology Co ltd
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2021-01-05

Abstract

The invention discloses a human body posture generation method and a system thereof, comprising the following steps: acquiring multi-channel synchronous video stream data; capturing a pedestrian image set of the same frame contained in the synchronous video stream data frame by frame; 2D skeleton key point detection is carried out on the pedestrian image set, and 2D skeleton key point coordinates and confidence degrees of pedestrians are generated; and constructing a coordinate transformation matrix based on the projection matrix of the camera unit, the 2D skeleton key point coordinates and the confidence coefficient thereof, and generating human body posture information by utilizing a triangulation algorithm. According to the method, the coordinates and the confidence coefficients of the 2D skeleton key points are calculated, the initial coordinate conversion matrix is established, and meanwhile the weight of each coordinate in the initial coordinate conversion matrix is updated based on the confidence coefficients, so that the problem that the calculation result of the space 3D coordinates deviates from the true value seriously due to different shielding degrees of the 2D skeleton key points of the images acquired by a plurality of camera units is solved.

Description

Human body posture generation method and system

Technical Field

The invention relates to the technical field of human body posture recognition, in particular to a human body posture generation method and a human body posture generation system.

Background

The 3D human body posture estimation method can be used as the basis of tasks such as human body posture recognition, behavior recognition and human body tracking, and has high application value in the fields of medical treatment, monitoring, human-computer interaction and the like. At present, 3D human body posture estimation methods can be divided into a human body posture estimation method based on a single camera and a human body posture estimation method based on multiple cameras. The 3D posture estimation method based on the single camera estimates the depth of a human body in an image through the foreground and background difference of the image, and restores the position of the human body posture in a 3D space by combining a 2D human body posture joint point estimation algorithm. The 3D posture estimation method based on the multiple cameras comprises the steps of firstly estimating 2D joint point coordinates of a human body in each camera respectively, and then calculating 3D space coordinates of the joint points through a triangulation method.

With the development of deep learning with a convolutional neural network as a core and the remarkable improvement of computing power in recent years, real-time 3D human body posture estimation based on multiple cameras becomes the best choice for many applications. The spatial position of the cameras can be determined through external reference calibration among the cameras based on the multiple cameras, and therefore the 3D spatial position is obtained through a triangulation method.

However, the traditional spatial 3D coordinate calculation method based on multiple cameras still has the problems of low detection precision and poor adaptability.

Disclosure of Invention

In view of the above, the invention provides a human body posture generating method and a human body posture generating system, which solve the problems of low detection precision and poor adaptability of the existing spatial 3D coordinate calculation method based on multiple cameras by improving an image detection method.

In order to solve the above problems, the technical scheme of the invention is to adopt a human body posture generation method, which comprises the following steps: s1: acquiring multi-channel synchronous video stream data based on a plurality of camera units; s2: grabbing a pedestrian image set of the same frame contained in the synchronous video stream data frame by frame; s3: 2D skeleton key point detection is carried out on the pedestrian image set, and 2D skeleton key point coordinates and confidence degrees of pedestrians are generated; s4: and constructing a coordinate transformation matrix based on the projection matrix of the camera unit, the 2D skeleton key point coordinates and the confidence coefficient thereof, and generating human body posture information by utilizing a triangulation algorithm.

Optionally, the S4 includes: constructing an initial transformation matrix based on the projection matrix and the 2D bone key point coordinates; updating the weight of each 2D bone key point coordinate in the initial conversion matrix based on the confidence coefficient corresponding to the 2D bone key point coordinate and generating the coordinate conversion matrix; and calculating a space 3D coordinate by using a formula (A x W) x Y =0 and generating human body posture information, wherein A is the initial transformation matrix, W is a confidence coefficient corresponding to the 2D skeleton key point coordinate, Y is the space 3D coordinate, and (A x W) is the coordinate transformation matrix.

Optionally, the S3 includes: constructing a neural network model for generating the 2D bone key point coordinates; constructing a regression model based on a training sample set and the neural network model, and training the neural network model; and after the sizes of a plurality of pedestrian images contained in the pedestrian image set are consistent, inputting the pedestrian images into the trained neural network model, and obtaining the 2D skeleton key point coordinates and the confidence coefficient of the pedestrian.

Optionally, the S1 includes: carrying out internal reference calibration on the plurality of camera units to obtain internal reference and distortion parameters; selecting a main camera, carrying out external parameter calibration on the rest of the plurality of camera units outside the main camera to obtain external parameter and translation vectors, and calculating the projection matrix; and acquiring the synchronous video stream data of the multi-channel after the distortion correction by using an image processing function based on the internal parameters and the distortion parameters.

Correspondingly, the invention provides a human body posture generating system which comprises a camera shooting unit and a data processing unit, wherein the data processing unit comprises a camera driving module, an image capturing module, a neural network module and a coordinate conversion module, wherein the camera shooting unit is used for acquiring multi-channel synchronous video stream data; the camera driving module is used for driving the camera shooting unit and receiving the synchronous video stream data; the image capture module is used for capturing a pedestrian image set of the same frame contained in the synchronous video stream data frame by frame; the neural network module is used for carrying out 2D skeleton key point detection on the pedestrian image set and generating 2D skeleton key point coordinates and confidence degrees of pedestrians; and the coordinate conversion module is used for constructing a coordinate conversion matrix through the 2D skeleton key point coordinates and the confidence coefficient thereof and generating human body posture information by utilizing a triangulation algorithm.

Optionally, the human body posture generating system further comprises a cross-platform computer vision library unit for providing image processing functions required by the camera driving module and the neural network module.

Optionally, the human body posture generating system further includes a data storage unit, configured to store a training sample set required by the neural network module.

Optionally, the neural network module builds a neural network model for generating the 2D bone key point coordinates, builds a regression model based on the training sample set and the neural network model, trains the neural network model, and inputs a plurality of pedestrian images contained in the pedestrian image set after size unification into the trained neural network model to obtain the 2D bone key point coordinates and confidence thereof of the pedestrian.

Optionally, the coordinate transformation module may construct an initial transformation matrix based on the 2D bone key point coordinates, update a weight of each 2D bone key point coordinate in the initial transformation matrix based on a confidence corresponding to the 2D bone key point coordinates and generate the coordinate transformation matrix, and calculate a spatial 3D coordinate using a formula (a × W) × Y =0 and generate the body posture information, where a is the initial transformation matrix, W is the confidence corresponding to the 2D bone key point coordinates, Y is the spatial 3D coordinate, and (a × W) is the coordinate transformation matrix.

The primary improvement of the human body posture generating method is that the coordinates and the confidence coefficient of the 2D skeleton key points are calculated through the neural network module, the weight of each coordinate in the initial coordinate conversion matrix is updated based on the confidence coefficient while the initial coordinate conversion matrix is established, the problem that the calculation result of the space 3D coordinates is seriously deviated from the true value due to different shielding degrees of the 2D skeleton key points of the images acquired by a plurality of camera units is solved, the accuracy of the output space 3D coordinates is effectively improved, and the anti-interference capability and the detection precision of the human body posture generating system are improved.

Drawings

FIG. 1 is a simplified flow diagram of a human body pose generation method of the present invention;

FIG. 2 is a simplified block diagram of the human gesture generation system of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, a human body posture generating method includes: s1: acquiring multi-channel synchronous video stream data based on a plurality of camera units; s2: grabbing a pedestrian image set of the same frame contained in the synchronous video stream data frame by frame; s3: 2D skeleton key point detection is carried out on the pedestrian image set, and 2D skeleton key point coordinates and confidence degrees of pedestrians are generated; s4: and constructing a coordinate transformation matrix based on the projection matrix of the camera unit, the 2D skeleton key point coordinates and the confidence coefficient thereof, and generating human body posture information by utilizing a triangulation algorithm.

The inventor finds that if a neural network module of a partial neural network framework is used for calculating the 2D bone key point coordinates of a human body in an image, the neural network module predicts the 2D bone key point coordinates by calculating the confidence coefficient of the 2D bone key point in the calculation process, and outputs the 2D bone key point coordinates and the confidence coefficient thereof, wherein the confidence coefficient is the shielding degree of the 2D bone key point predicted by the neural network module, and the traditional method for calculating the space 3D coordinates considers the 2D bone key point as completely credible. But due to the 2D skeletal key of multiple camera acquisition in most casesThe occlusion degree of the point coordinates is different (namely, the confidence degrees of the 2D bone key point coordinates are different), so that the accuracy of the finally output spatial 3D coordinates is low, and the related information of the confidence degrees of the 2D bone key point coordinates becomes redundant information, which wastes the calculation power of related neural network units. In order to solve the problem, the coordinates and the confidence coefficient of the 2D skeleton key points are calculated through the neural network module, the weight of each coordinate in the initial coordinate conversion matrix is updated based on the confidence coefficient while the initial coordinate conversion matrix is established, the problem that the calculation result of the space 3D coordinates is seriously deviated from the true value due to different shielding degrees of the 2D skeleton key points of the images acquired by a plurality of camera units is solved, the accuracy of the output space 3D coordinates is effectively improved, the anti-interference capability and the detection precision of a human body posture generation system are improved, and redundant information during early-stage operation is fully utilized. Specifically, an initial transformation matrix is constructed based on the projection matrix and the 2D skeleton key point coordinates; updating the weight of each 2D bone key point coordinate in the initial conversion matrix based on the confidence coefficient corresponding to the 2D bone key point coordinate and generating the coordinate conversion matrix; calculating a space 3D coordinate by using a formula (A x W) x Y =0 and generating human body posture information, wherein A is the initial transformation matrix corresponding to the 2D skeleton key point coordinate; w is the confidence corresponding to the 2D skeleton key point coordinate; y is the spatial 3D coordinate; and (A x W) is the coordinate transformation matrix, the meaning of multiplying the matrix A by the numerical value W is that the coefficients of the row vectors of the initial transformation matrix A are updated through the confidence W, and the coordinate transformation matrix is generated. Specifically, the construction of the initial transformation matrix based on the 2D bone key point coordinates includes a projection matrix P based on a plurality of camera units and corresponding 2D bone key point coordinates (x) acquired by each camera unit_i，y_i) Constructing a transformation matrix, for example: taking the calculation of the first skeleton point as an example, the first 2D skeleton key point coordinates (x, y) and the projection matrix P of the corresponding first camera unit are obtained to construct a first initial transformation matrix a₁=

Based onUpdating the first initial transformation matrix and generating a first coordinate transformation matrix according to the confidence degree of the first 2D bone key point coordinate

=

W; repeating the steps until generating a coordinate transformation matrix of the first 2D skeleton key point coordinate under the conditions of different projection matrixes and confidence degrees of all the camera units

、

…

And form a complete coordinate transformation matrix

Wherein, 1, 2, 3.. n is the number of the image pickup unit. Performing SVD on the coordinate conversion matrix to obtain a spatial 3D coordinate of a bone key point corresponding to the first 2D bone key point; and repeating the steps to calculate the coordinates of all the 2D skeleton key points, so as to generate complete human body posture information.

Further, the S3 includes:

constructing a neural network model for generating the 2D bone key point coordinates, wherein the neural network model can be a deep convolution neural network comprising a 153-layer network, the input of the model is an RGB three-channel image, and the output of the model is (key points, confidence), wherein the key points are the 2D bone key point coordinates (x)_i，y_i) Confidence is the confidence that the bone key point is not occluded;

constructing a regression model based on a training sample set (I, key points, confidence) and the neural network model, training the neural network model, wherein I is a pedestrian image, the key points are coordinates of 2D bone key points in the image, the confidence is confidence that the bone key points are not shielded, the regression model adopts a random gradient descent method to calculate a difference value L1 between the output key point coordinates and coordinates in a real label and a predicted difference value L2, a total difference value Loss = L1+ L2 is calculated, parameters of the network are modified through gradient back transmission, and the neural network model is trained;

the sizes of a plurality of pedestrian images contained in the pedestrian image set are unified and then input into the trained neural network model, the 2D skeleton key point coordinates and the confidence coefficient of the pedestrians are obtained, wherein, after each input pedestrian image with consistent size is subjected to convolution operation, maximum pooling operation, deconvolution operation and mean pooling operation, a thermodynamic diagram for predicting a plurality of 2D skeleton key point coordinates and a highest confidence coefficient corresponding to the 2D skeleton key point coordinates are output, and based on the thermodynamic diagram and the highest confidence coefficient corresponding to the 2D skeleton key point coordinates, the coordinate with the highest confidence coefficient in the thermodynamic diagram is extracted as the predicted coordinate of the 2D skeleton key point, the 2D skeleton key point coordinates and the corresponding confidence coefficient can be output, and all the 2D skeleton key point coordinates and the confidence coefficients of the pedestrians can be obtained by repeating the steps.

Further, the S2 includes: and capturing a pedestrian image set of the same frame contained in the synchronous video stream data frame by frame, wherein the pedestrian image set can be directly formed by a plurality of pedestrian images acquired by a plurality of camera units, or a plurality of second pedestrian images generated after pedestrian detection is carried out on the pedestrian images acquired by the camera units, and a user can self-regulate the generation mode of the pedestrian image set according to the actual application scene and the requirement of the camera units.

Specifically, the forming the pedestrian image set by using a plurality of second pedestrian images generated after pedestrian detection is performed on a plurality of pedestrian images acquired by a plurality of image pickup units includes:

constructing a second neural network model for pedestrian detection, wherein the second neural network model can be a pedestrian detection model based on a YOLO detection framework, the input of the model is RBG three-channel images of 224 × 3, and the output is (x, y, w, h, confidence), wherein x and y are coordinates of the upper left corner of a target frame for detecting a coming person, w is the width of the target frame, h is the height of the target frame, and confidence represents the confidence of the pedestrian in the target frame;

constructing a regression model based on a training sample set and the second neural network model, training the second neural network model, wherein the training sample set can be a picture set and a label set represented by (I, x, y, w, h), wherein I is a complex background image containing pedestrians, and (x, y, w, h) is the position of a target frame of the pedestrians in the image, specifically, after the regression model is constructed, the RBG image containing the pedestrians is input, and (x, y, w, h, confidence) is output, and the parameters of the network are modified through gradient back-transmission by adopting a random gradient descent method through calculating the difference value of the output coordinates and the coordinates in the real labels;

inputting a plurality of pedestrian images into the trained second neural network model, and acquiring pedestrian coordinates of the plurality of pedestrian images;

and extracting the second pedestrian image contained in the corresponding pedestrian image based on the plurality of pedestrian coordinates, and generating the pedestrian image set. Further, extracting the second pedestrian image included in the pedestrian image corresponding to the second pedestrian image based on the plurality of pedestrian coordinates, and generating the pedestrian image set includes: constructing an extraction frame based on the pedestrian coordinates output by the second neural network model; extracting images of the corresponding areas of the pedestrian images based on the extraction frame to form the second pedestrian image; repeating the steps until the second pedestrian image completely containing the pedestrian coordinates is traversed and the pedestrian image set is formed. Wherein, when at least two second pedestrian images exist in the plurality of pedestrian images, the plurality of second pedestrian images are extracted to form the pedestrian image set; and when only one second pedestrian image exists in the plurality of pedestrian images, acquiring a next frame of acquired image of the multi-channel and detecting the pedestrian.

Further, the S1 includes: carrying out internal reference calibration on a plurality of camera units to obtain internal reference K and distortion parameters, wherein the internal reference K is (f)_x，f_y，u₀，v₀) Wherein f is_x，f_yIs the focal length of the camera, u₀And v₀The coordinates of the principal points, and the distortion parameters of the camera unit are (k 1, k2, k3, p1, p 2), wherein k1, k2 and k3 are the radial distortion of the camera, and p1 and p2 are the tangential distortion of the camera; selecting a main camera, carrying out external reference calibration on the rest of the plurality of camera units outside the main camera to obtain an external reference T = (Tx, Ty, Tz), wherein Tx, Ty and Tz are coordinates in an X-axis direction, coordinates in a Y-axis direction and translation vectors of the coordinates in a Z-axis direction when the coordinates in a camera coordinate system are converted to coordinates in a world coordinate system, and calculating a projection matrix R = R (alpha, beta and gamma), wherein gamma is a rotation angle around a Z-axis of the camera coordinate system, beta is a rotation angle around a Y-axis, and alpha is a rotation angle around the X-axis; and acquiring the current frame acquisition image of the multi-channel after the distortion correction by using an image processing function based on the internal parameters and the distortion parameters.

Correspondingly, as shown in fig. 2, the present invention provides a human body posture generating system, which is characterized by comprising a camera unit and a data processing unit, wherein the data processing unit comprises a camera driving module, an image capturing module, a neural network module and a coordinate conversion module, wherein the camera unit is used for acquiring multi-channel synchronous video stream data; the camera driving module is used for driving the camera shooting unit and receiving the synchronous video stream data; the image capture module is used for capturing a pedestrian image set of the same frame contained in the synchronous video stream data frame by frame; the neural network module is used for carrying out 2D skeleton key point detection on the pedestrian image set and generating 2D skeleton key point coordinates and confidence degrees of pedestrians; and the coordinate conversion module is used for constructing a coordinate conversion matrix through the 2D skeleton key point coordinates and the confidence coefficient thereof and generating human body posture information by utilizing a triangulation algorithm. The image capture module is further used for performing image rectification, filtering and the like on multi-frame image data contained in the synchronous video stream data acquired by the camera unit.

Further, the human body posture generation system further comprises a cross-platform computer vision library unit and a data storage unit, wherein the cross-platform computer vision library unit is used for providing image processing functions required by the camera driving module and the neural network module, and the data storage unit is used for storing a training sample set required by the neural network module.

Further, the neural network module builds a neural network model for generating the 2D bone key point coordinates, builds a regression model based on the training sample set and the neural network model, trains the neural network model, and inputs a plurality of pedestrian images contained in the pedestrian image set after size unification into the trained neural network model to obtain the 2D bone key point coordinates and confidence thereof of the pedestrian.

Further, the coordinate transformation module may construct an initial transformation matrix based on the 2D bone key point coordinates, update a weight of each 2D bone key point coordinate in the initial transformation matrix based on a confidence corresponding to the 2D bone key point coordinates and generate the coordinate transformation matrix, and calculate a spatial 3D coordinate using a formula (a × W) × Y =0 and generate body posture information, where a is the initial transformation matrix, W is the confidence corresponding to the 2D bone key point coordinates, Y is the spatial 3D coordinate, and (a × W) is the coordinate transformation matrix.

The above is only a preferred embodiment of the present invention, and it should be noted that the above preferred embodiment should not be considered as limiting the present invention, and the protection scope of the present invention should be subject to the scope defined by the claims. It will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the spirit and scope of the invention, and these modifications and adaptations should be considered within the scope of the invention.

Claims

1. A human body posture generation method is characterized by comprising the following steps:

s1: acquiring multi-channel synchronous video stream data based on a plurality of camera units;

s2: grabbing a pedestrian image set of the same frame contained in the synchronous video stream data frame by frame;

s3: 2D skeleton key point detection is carried out on the pedestrian image set, and 2D skeleton key point coordinates and confidence degrees of pedestrians are generated;

s4: and constructing a coordinate transformation matrix based on the projection matrix of the camera unit, the 2D skeleton key point coordinates and the confidence coefficient thereof, and generating human body posture information by utilizing a triangulation algorithm.

2. The human body posture generation method as claimed in claim 1, wherein said S4 includes:

constructing an initial transformation matrix based on the projection matrix and the 2D bone key point coordinates;

updating the weight of each 2D bone key point coordinate in the initial conversion matrix based on the confidence coefficient corresponding to the 2D bone key point coordinate and generating the coordinate conversion matrix;

and calculating a space 3D coordinate by using a formula (A x W) x Y =0 and generating human body posture information, wherein A is the initial transformation matrix, W is a confidence coefficient corresponding to the 2D skeleton key point coordinate, Y is the space 3D coordinate, and (A x W) is the coordinate transformation matrix.

3. The human body posture generation method as claimed in claim 2, wherein said S3 includes:

constructing a neural network model for generating the 2D bone key point coordinates;

constructing a regression model based on a training sample set and the neural network model, and training the neural network model;

and after the sizes of a plurality of pedestrian images contained in the pedestrian image set are consistent, inputting the pedestrian images into the trained neural network model, and obtaining the 2D skeleton key point coordinates and the confidence coefficient of the pedestrian.

4. The human body posture generation method of claim 3, wherein the S1 includes:

carrying out internal reference calibration on the plurality of camera units to obtain internal reference and distortion parameters;

selecting a main camera, carrying out external parameter calibration on the rest of the plurality of camera units outside the main camera to obtain external parameter and translation vectors, and calculating the projection matrix;

and acquiring the synchronous video stream data of the multi-channel after the distortion correction by using an image processing function based on the internal parameters and the distortion parameters.

5. A human body posture generating system is characterized by comprising a camera shooting unit and a data processing unit, wherein the data processing unit comprises a camera driving module, an image capturing module, a neural network module and a coordinate conversion module,

the camera shooting unit is used for collecting multi-channel synchronous video stream data;

the camera driving module is used for driving the camera shooting unit and receiving the synchronous video stream data;

the image capture module is used for capturing a pedestrian image set of the same frame contained in the synchronous video stream data frame by frame;

the neural network module is used for carrying out 2D skeleton key point detection on the pedestrian image set and generating 2D skeleton key point coordinates and confidence degrees of pedestrians;

and the coordinate conversion module is used for constructing a coordinate conversion matrix through the 2D skeleton key point coordinates and the confidence coefficient thereof and generating human body posture information by utilizing a triangulation algorithm.

6. The human pose generation system of claim 5, further comprising a cross-platform computer vision library unit for providing image processing functions required by the camera drive module and the neural network module.

7. The human pose generation system of claim 6, further comprising a data storage unit for storing a set of training samples required by the neural network module.

8. The human body posture generation system of claim 7, wherein the neural network module obtains 2D bone key point coordinates of a pedestrian and a confidence thereof by constructing a neural network model for generating the 2D bone key point coordinates, constructing a regression model based on the training sample set and the neural network model, training the neural network model, and inputting a plurality of pedestrian images included in the pedestrian image set after size unification into the trained neural network model.

9. The human body pose generation system of claim 8, wherein the coordinate transformation module is capable of constructing an initial transformation matrix based on the 2D bone keypoint coordinates, updating a weight of each 2D bone keypoint coordinate in the initial transformation matrix based on a confidence corresponding to the 2D bone keypoint coordinates and generating the coordinate transformation matrix, and calculating a spatial 3D coordinate using a formula (a x W) x Y =0 and generating the human body pose information, wherein a is the initial transformation matrix, W is the confidence corresponding to the 2D bone keypoint coordinates, Y is the spatial 3D coordinate, and (a x W) is the coordinate transformation matrix.