CN114757822B

CN114757822B - Binocular-based human body three-dimensional key point detection method and system

Info

Publication number: CN114757822B
Application number: CN202210663896.6A
Authority: CN
Inventors: 祝敏航; 李玲; 曹卫强; 徐晓刚; 王军
Original assignee: Zhejiang Gongshang University; Zhejiang Lab
Current assignee: Zhejiang Gongshang University; Zhejiang Lab
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2022-11-04
Anticipated expiration: 2042-06-14
Also published as: CN114757822A

Abstract

The invention discloses a binocular-based human body three-dimensional key point detection method and system, wherein the method comprises the following steps: the method comprises the following steps: respectively obtaining human body rectangular frames in the left and right view field images by using a target detection module adopting a YOLOv5 target detection algorithm; step two: a human body image corresponding to the human body rectangular frame is deducted, and a two-dimensional thermodynamic diagram of each key point of the human body in the left and right view field images is extracted through a human body two-dimensional key point identification algorithm; step three: reversely projecting the thermodynamic diagrams of the two-dimensional key points of the left and right human bodies to a three-dimensional space to obtain a three-dimensional back-projection thermodynamic diagram; inputting the three-dimensional back projection thermodynamic diagrams into a three-dimensional convolution coding and decoding network, and obtaining three-dimensional key point thermodynamic diagrams through coding and decoding operations; and fifthly, obtaining the coordinates of the three-dimensional key points represented by each channel in the three-dimensional key point thermodynamic diagram through the maximum independent variable soft operation, and finally obtaining the coordinates of all the three-dimensional key points of the human body. The invention has higher feasibility and practicability.

Description

Binocular-based human body three-dimensional key point detection method and system

Technical Field

The invention relates to the field of computer vision, in particular to a binocular-based human body three-dimensional key point detection method and system.

Background

In the field of intelligent monitoring, human behavior analysis is a basic monitoring requirement. The basis of behavior analysis is human body key point detection, and common key points are usually based on two-dimensional images and only comprise xy coordinates. However, human body behaviors occur in space, and behavior analysis algorithms established on three-dimensional key points of the human body generally achieve higher accuracy. The input of the human body three-dimensional key point detection can be monocular or multiocular (binocular and above). Monocular input can detect a three-dimensional key point based on a central point of a human body, the central point is usually positioned in the center of a pelvis, but the three-dimensional key point based on a global coordinate system is difficult to detect, namely, the movement of the human body on the ground cannot be subjected to space modeling. The multi-view input can detect three-dimensional key points based on a global coordinate system and can model relative motion and global motion of a human body, but the multi-view system is high in cost, complex in deployment and time-consuming in processing algorithm.

Therefore, a method and a system for detecting three-dimensional key points based on a global coordinate system, which are convenient to deploy, are needed to be more conveniently applied to the field of intelligent monitoring.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a binocular-based human body three-dimensional key point detection method and system, which are used for detecting human body three-dimensional key points based on a global coordinate system and are convenient to deploy, and the specific technical scheme is as follows:

a binocular-based human body three-dimensional key point detection method comprises the following steps:

the method comprises the following steps: respectively obtaining human body rectangular frames in the left and right view field images by using a target detection module adopting a YOLOv5 target detection algorithm;

step two: a human body image corresponding to the human body rectangular frame is deducted, and a two-dimensional thermodynamic diagram of each key point of the human body in the left and right view field images is extracted through a human body two-dimensional key point identification algorithm;

step three: reversely projecting the thermodynamic diagrams of the two-dimensional key points of the left and right human bodies to a three-dimensional space to obtain a three-dimensional back-projection thermodynamic diagram;

inputting the three-dimensional back projection thermodynamic diagrams into a three-dimensional convolution coding and decoding network, and obtaining three-dimensional key point thermodynamic diagrams through coding and decoding operations;

and fifthly, each channel in the three-dimensional key point thermodynamic diagram obtains the coordinates of the three-dimensional key points represented by the channel through the independent variable maximum value soft operation, and finally the coordinates of all the three-dimensional key points of the human body are obtained.

Further, the second step specifically comprises: and deducting a human body image corresponding to the human body rectangular frame, processing the human body image through a human body two-dimensional key point identification algorithm, extracting a human body apparent characteristic diagram of the human body image through a characteristic extraction network in an encoding stage, and generating a two-dimensional thermodynamic diagram of each key point through an deconvolution layer in a decoding stage to obtain a left two-dimensional thermodynamic diagram and a right two-dimensional thermodynamic diagram.

Further, the third step specifically includes the following substeps:

step 3.1, processing the two-dimensional thermodynamic diagram of the pelvis key point through the operation of the maximum independent variable value to obtain two-dimensional coordinates of the pelvis key point in the left and right view field images, and obtaining the three-dimensional coordinates of the pelvis through triangulation

Wherein

Is as followsiThe two-dimensional coordinates of the pelvis bone in each image,

is as followsiThe internal parameters of the camera are referred to,

is as followsiExternal parameters of a camera;

step 3.2, with the three-dimensional coordinates of the pelvis as a center, creating a space cube with the same length, width and height, dividing the space cube into grids of x multiplied by y multiplied by z, wherein x = y = z is a positive integer;

step 3.3, according to the pinhole model of the camera

Wherein

In the form of two-dimensional image coordinates,

is a scale factor, and is a function of,

is an internal reference of the camera and is used as a reference of the camera,

and respectively projecting the grid center coordinates to a left two-dimensional thermodynamic diagram and a right two-dimensional thermodynamic diagram for three-dimensional space coordinates, obtaining a thermal value of the point at the projection point by bilinear interpolation, adding the left thermal value and the right thermal value to the grid center, and obtaining a three-dimensional back projection thermodynamic diagram, wherein the dimension is C x y x z, and C is the number of key points.

Further, the fourth step specifically includes the following substeps:

step 4.1, inputting the three-dimensional back projection thermodynamic diagram into a three-dimensional convolution coding network, and extracting three-dimensional thermodynamic diagram features;

step 4.2, inputting the three-dimensional thermodynamic diagram characteristics into a three-dimensional convolution decoding network, and analyzing the three-dimensional thermodynamic diagram characteristics;

and 4.3, outputting a three-dimensional key point thermodynamic diagram by the decoding network, wherein the dimensionality is C multiplied by x multiplied by y multiplied by z, and C is the number of the key points.

Furthermore, the three-dimensional convolution coding network is formed by connecting a plurality of coding modules in sequence, the three-dimensional convolution decoding network is formed by connecting a plurality of decoding modules in sequence, each coding module is formed by connecting a three-dimensional convolution block and a jump layer, and each decoding module is formed by connecting a three-dimensional deconvolution block and a jump layer.

A binocular-based human body three-dimensional key point detection system comprises a sensing unit, a processing unit, a three-dimensional key point detection algorithm module and a display module;

the sensing units are two cameras with internal and external reference calibration and are distributed left and right;

the processing unit is a desktop computer and an end-side AI module and is used for coding and decoding videos acquired by the camera, running an algorithm library and displaying the videos;

the three-dimensional key point detection algorithm module is used for detecting the positions of the human body in the left view and the right view of the camera, obtaining a two-dimensional thermodynamic diagram of key points from left and right human body images through a human body two-dimensional key point identification algorithm, projecting the two-dimensional thermodynamic diagram to a three-dimensional space through back projection to obtain a three-dimensional back projection thermodynamic diagram, and inputting the three-dimensional back projection thermodynamic diagram to a three-dimensional convolution coding and decoding network to obtain a final three-dimensional human body key point coordinate;

the display module is mainly used for three-dimensional display and is used for displaying the generated three-dimensional human body key points in real time.

A binocular-based human body three-dimensional key point detection device comprises one or more processors and is used for achieving the binocular-based human body three-dimensional key point detection method.

A computer-readable storage medium having stored thereon a program which, when executed by a processor, implements a binocular-based human three-dimensional keypoint detection method.

The invention has the advantages and beneficial effects that:

the method and the system for detecting the three-dimensional key points based on the global coordinate system are convenient to deploy, can be conveniently applied to the field of intelligent monitoring, and have high feasibility and practicability.

Drawings

FIG. 1 is a schematic flow chart of a binocular-based human body three-dimensional key point detection method of the present invention;

FIG. 2 is a schematic diagram of a detailed detection process of human body three-dimensional key points in left and right views of the method of the present invention;

FIG. 3 is a schematic diagram of a three-dimensional convolutional encoding module employed in the present invention;

FIG. 4 is a schematic diagram of a three-dimensional convolutional decoding module employed by the present invention;

FIG. 5 is a block diagram of a binocular-based human body three-dimensional keypoint detection system of the present invention;

fig. 6 is a schematic structural diagram of a binocular-based human body three-dimensional key point detection device of the present invention.

Detailed Description

In order to make the objects, technical solutions and technical effects of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.

As shown in fig. 1 and 2, a binocular-based human body three-dimensional key point detection method includes the following steps:

the method comprises the following steps: and respectively obtaining the human body rectangular frames in the left and right view field images by using a target detection module adopting a YOLOv5 target detection algorithm.

Step two: and (4) deducting the human body image corresponding to the human body rectangular frame, and extracting the two-dimensional thermodynamic diagrams of each key point of the human body in the left and right view field images through a human body two-dimensional key point identification algorithm.

When the human body two-dimensional key point identification algorithm is processed, the human body apparent feature map is extracted through a feature extraction network in the encoding stage, and the two-dimensional thermodynamic diagrams of all key points are generated through the deconvolution layer in the decoding stage, so that the left and right two-dimensional thermodynamic diagrams are obtained.

The human body two-dimensional key point identification algorithm is a Simple Baseline key point tracking detection method, the dimensionality of the two-dimensional thermodynamic diagram is C multiplied by H multiplied by W, and C, H, W are the number of key points, the height of the thermodynamic diagram and the width of the thermodynamic diagram respectively.

Step three: and reversely projecting the thermodynamic diagrams of the two-dimensional key points of the left and right human bodies to a three-dimensional space to obtain a three-dimensional back-projection thermodynamic diagram. The specific substeps are as follows:

step 3.1, processing the two-dimensional thermodynamic diagram of the pelvic bone key point through the maximum independent variable operation (argmax operation), obtaining two-dimensional coordinates of the pelvic bone key point in the left and right view field images, and obtaining the three-dimensional coordinates of the pelvic bone through triangulation

Wherein

Is as followsiThe two-dimensional coordinates of the pelvis bone in each image,

is as followsiThe internal parameters of the camera are referred to,

is as followsiExternal parameters of a camera;

step 3.2, taking the three-dimensional coordinates of the pelvis as a center, creating a space cube with the same length, width and height, dividing the space cube into grids of x multiplied by y multiplied by z, wherein x = y = z is a positive integer;

in this embodiment, the length, width and height of the space cube are all 2.5 meters, and are divided into grids of 64 × 64 × 64;

step 3.3, according to the pinhole model of the camera

Wherein

In the form of two-dimensional image coordinates,

is a scale factor, and is a function of,

is a reference for the camera to be used,

and respectively projecting the grid center coordinates to the left two-dimensional thermodynamic diagram and the right two-dimensional thermodynamic diagram for three-dimensional space coordinates, obtaining a thermodynamic value of a projection point by bilinear interpolation, adding the left thermodynamic value and the right thermodynamic value to the grid center, and obtaining a three-dimensional back projection thermodynamic diagram, wherein the dimensionality is C x y x z, and C is the number of key points.

Inputting the three-dimensional back projection thermodynamic diagrams into a three-dimensional convolution coding and decoding network, and obtaining the three-dimensional key point thermodynamic diagrams through coding and decoding operations, wherein the three-dimensional back projection thermodynamic diagrams are specifically formed by the following substeps:

step 4.1, inputting the three-dimensional back projection thermodynamic diagram into a three-dimensional convolution coding network, wherein the three-dimensional convolution coding network is formed by sequentially connecting 5 coding modules shown in the figure 3 and is used for extracting the characteristics of the three-dimensional thermodynamic diagram;

step 4.2, inputting the feature map generated by the coding network into a three-dimensional convolution decoding network, wherein the three-dimensional convolution decoding network is formed by sequentially connecting 5 decoding modules shown in the figure 4 and is used for analyzing the features of the three-dimensional thermodynamic diagram;

and 4.3, outputting a three-dimensional key point thermodynamic diagram by the decoding network, wherein the dimensionality is C multiplied by 64, and C is the number of the key points.

The three-dimensional convolution coding and decoding network is formed by sequentially connecting a plurality of coding modules and a plurality of decoding modules, wherein each coding module is formed by connecting a three-dimensional convolution block and a skip layer, and each decoding module is formed by connecting a three-dimensional deconvolution block and a skip layer.

And step five, obtaining the coordinates of the three-dimensional key points represented by each channel in the three-dimensional key point thermodynamic diagram through independent variable maximum soft operation (soft argmax operation), and finally obtaining the coordinates of all the three-dimensional key points of the human body.

As shown in fig. 5, a binocular-based human body three-dimensional key point detection system comprises a sensing unit, a processing unit, a three-dimensional key point detection algorithm module and a display module;

the processing unit is a desktop computer, an end-side AI module and the like and is used for coding and decoding videos acquired by the camera, running an algorithm library and displaying the videos;

Corresponding to the embodiment of the binocular-based human body three-dimensional key point detection method, the invention also provides an embodiment of a binocular-based human body three-dimensional key point detection device.

Referring to fig. 6, the binocular-based human body three-dimensional key point detection apparatus provided in the embodiment of the present invention includes one or more processors, and is configured to implement the binocular-based human body three-dimensional key point detection method in the above embodiment.

The binocular-based human body three-dimensional key point detection device of the embodiment of the invention can be applied to any equipment with data processing capability, such as computers and other equipment or devices. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. In terms of hardware, as shown in fig. 6, the present invention is a hardware structure diagram of any device with data processing capability in which a binocular-based human body three-dimensional key point detection apparatus is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 6, in an embodiment, any device with data processing capability in which the apparatus is located may also include other hardware according to the actual function of the any device with data processing capability, which is not described again.

The specific details of the implementation process of the functions and actions of each unit in the above device are the implementation processes of the corresponding steps in the above method, and are not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

The embodiment of the invention also provides a computer readable storage medium, wherein a program is stored on the computer readable storage medium, and when the program is executed by a processor, the binocular-based human body three-dimensional key point detection method in the embodiment is realized.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be an external storage device of the wind turbine, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), and the like, provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the practice of the invention as described in the foregoing examples, or that certain features may be substituted in the practice of the invention. All changes, equivalents and the like which come within the spirit and principles of the invention are desired to be protected.

Claims

1. A binocular-based human body three-dimensional key point detection method is characterized by comprising the following steps:

step two: a human body image corresponding to the human body rectangular frame is deducted, and a two-dimensional thermodynamic diagram of each key point of the human body in the left and right view field images is extracted through a human body two-dimensional key point identification algorithm; the method specifically comprises the following steps: the method comprises the following steps of deducting a human body image corresponding to a human body rectangular frame, processing the human body image through a human body two-dimensional key point identification algorithm, extracting a human body apparent feature map of the human body image through a feature extraction network in an encoding stage, and generating two-dimensional thermodynamic diagrams of key points through deconvolution layers in a decoding stage to obtain a left two-dimensional thermodynamic diagram and a right two-dimensional thermodynamic diagram;

step three: reversely projecting the thermodynamic diagrams of the two-dimensional key points of the left and right human bodies to a three-dimensional space to obtain a three-dimensional back-projection thermodynamic diagram; the method specifically comprises the following substeps:

step 3.1, processing the two-dimensional thermodynamic diagram of the pelvic bone key point through the operation of the maximum independent variable value to obtain two-dimensional coordinates of the pelvic bone key point in the left and right view field images, and obtaining the three-dimensional coordinates of the pelvic bone through triangulation

In which

Is as followsiThe two-dimensional coordinates of the pelvis bone in each image,

is a firstiThe internal parameters of the camera are referred to,

is as followsiExternal parameters of a camera;

step 3.3, according to the pinhole model of the camera

Wherein

In the form of two-dimensional image coordinates,

is a scale factor, and is a function of,

is a reference for the camera to be used,

respectively projecting the grid center coordinates to a left two-dimensional thermodynamic diagram and a right two-dimensional thermodynamic diagram for three-dimensional space coordinates, obtaining a thermodynamic value of a projection point by bilinear interpolation, adding the left thermodynamic value and the right thermodynamic value to the grid center to obtain a three-dimensional back projection thermodynamic diagram, wherein the dimensionality is C x y x z, and C is the number of key points;

2. The binocular-based human body three-dimensional key point detection method as claimed in claim 1, wherein the fourth step specifically comprises the following substeps:

3. The binocular-based human body three-dimensional key point detection method as claimed in claim 2, wherein the three-dimensional convolution encoding network is formed by sequentially connecting a plurality of encoding modules, the three-dimensional convolution decoding network is formed by sequentially connecting a plurality of decoding modules, the encoding modules are formed by connecting three-dimensional convolution blocks and skip layers, and the decoding modules are formed by connecting three-dimensional deconvolution blocks and skip layers.

4. A binocular-based human body three-dimensional key point detection device is characterized by comprising one or more processors and being used for realizing the binocular-based human body three-dimensional key point detection method of any one of claims 1 to 3.

5. A computer-readable storage medium, having stored thereon a program which, when executed by a processor, implements a binocular-based human three-dimensional keypoint detection method of any one of claims 1 to 3.