CN115035551A

CN115035551A - Three-dimensional human body posture estimation method, device, equipment and storage medium

Info

Publication number: CN115035551A
Application number: CN202210956640.4A
Authority: CN
Inventors: 胡波; 胡世卓; 周斌; 沈振冈; 李艳红
Original assignee: Wuhan Etah Information Technology Co ltd
Current assignee: Wuhan Etah Information Technology Co ltd
Priority date: 2022-08-10
Filing date: 2022-08-10
Publication date: 2022-09-09
Anticipated expiration: 2042-08-10
Also published as: CN115035551B

Abstract

The invention discloses a three-dimensional human body posture estimation method, a device, equipment and a storage medium, wherein the method comprises the steps of generating a target heat map corresponding to an input image by adopting a multi-view fusion network, and matching and fusing human body central point heat map information of the target heat map to obtain fusion information; projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume; the three-dimensional human body posture is estimated according to the three-dimensional characteristic volume, the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of other human body key points are reduced, errors of three-dimensional human body posture estimation are reduced, the posture reconstruction quality is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of three-dimensional human body posture estimation is improved, the scheme is simple and reliable to implement, the method can be suitable for three-dimensional human body posture estimation of most scenes, and the speed and the efficiency of three-dimensional human body posture estimation are improved.

Description

Three-dimensional human body posture estimation method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of multi-view fusion, in particular to a three-dimensional human body posture estimation method, device, equipment and storage medium.

Background

In recent years, three-dimensional human posture estimation studies through multi-view matching are mainly classified into two main categories: two-dimensional to three-dimensional based multi-stage methods and direct regression based methods; two-dimensional to three-dimensional based methods such as by estimating 2D keypoints of the same person in each view, and then lifting the matched 2D single view pose to 3D space; if the 2D image structure model is expanded to the 3D image structure model to encode the pair-wise relation between the body joint positions, if the multi-person 2D posture detection is firstly solved, the association is carried out in a plurality of camera views, and then the 3D posture is recovered by using triangulation; the methods are effective in a specific scene, but the method is very dependent on a 2D detection result, the reconstruction quality of the 3D posture is greatly influenced by inaccuracy of two-dimensional posture estimation, and particularly the shielding condition exists.

The method based on the direct regression is also called as an end-to-end based method, and because the deep neural network can fit a complex function, the method usually does not need other algorithm assistance and intermediate data, the three-dimensional attitude coordinate can be directly predicted based on the network structure of the regression; the discretization 3D feature volume is constructed by multi-view features like the VoxelPose model, the 2D pose in each view is not independently estimated, but the obtained 2D heatmap is directly projected to be inferred in a 3D space, but the calculation cost for searching key points in the whole space geometrically increases along with the detailed division of the space, and is also influenced by quantization errors caused by the spatial discretization.

Disclosure of Invention

The invention mainly aims to provide a method, a device, equipment and a storage medium for estimating a three-dimensional human body posture, and aims to solve the technical problems that the reconstruction quality of a 3D posture is greatly influenced due to inaccuracy of two-dimensional posture estimation and the direct regression mode has high calculation cost and large error because of depending on a 2D detection result in the prior art.

In a first aspect, the present invention provides a three-dimensional human body posture estimation method, including the following steps:

generating a target heat map corresponding to an input image by adopting a multi-view fusion network, and matching and fusing human body central point heat map information of the target heat map to obtain fusion information;

projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume;

and estimating the three-dimensional human body posture according to the three-dimensional characteristic volume.

Optionally, the generating a target heat map corresponding to the input image by using a multi-view fusion network, matching and fusing the human body central point heat map information of the target heat map to obtain fusion information includes:

inputting an input image into a high-resolution network of a multi-view fusion network to acquire high-resolution characteristic information;

constructing a residual error unit of the high-resolution network according to the high-resolution characteristic information, and performing convolution sampling on the residual error unit to obtain a multi-resolution module;

fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a target heat map;

matching and fusing the human body center point heat map information of the target heat map to obtain fused information.

Optionally, the fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain the target heat map includes:

fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a fused feature map;

inputting the fusion characteristic graph into a deconvolution module, obtaining an output result through convolution and channel conversion, and performing dimensionality splicing on the output result and the fusion characteristic graph to obtain a splicing characteristic;

and improving the resolution ratio of the splicing features according to the deconvolution layer, extracting target feature information of the splicing features with improved resolution ratio through the residual error unit, and generating a target heat map according to the target feature information.

Optionally, the matching and fusing the human body central point heat map information of the target heat map to obtain fused information includes:

taking preset key points between hip joints of the human body in the target heat map as central points of the human body;

and matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multiple views to obtain fused information.

Optionally, the matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multiple views to obtain fused information, including:

sampling polar lines corresponding to the central points of the graphs in the target heat map according to the central points of the human body of the multiple views to obtain a corresponding point set;

generating a probability region of the Gaussian distribution of the target heat map at the corresponding coordinates according to the corresponding point set;

fusing values of all points on the epipolar line in the probability region through a full-connection layer to obtain a finally fused central point coordinate;

and carrying out coordinate matching fusion on the human body central point heat map information of the target heat map according to the central point coordinate to obtain fusion information.

Optionally, the projecting the fusion information to a 3D space to obtain a three-dimensional feature volume includes:

acquiring camera calibration data of a video camera, and projecting each voxel center in the fusion information to a camera view by using the camera calibration data to obtain camera view projection data;

and constructing a three-dimensional characteristic volume according to the camera view projection data by using a 3D CNN network.

Optionally, the estimating a three-dimensional human body posture according to the three-dimensional feature volume includes:

dividing the three-dimensional feature volume into a plurality of discrete grids;

and acquiring the 3D heat map space coordinate of each key point in each discrete grid, and regressing the 3D heat map space coordinate to obtain the three-dimensional human body posture.

In a second aspect, to achieve the above object, the present invention further provides a three-dimensional body posture estimation device, including:

the fusion module is used for generating a target heat map corresponding to the input image by adopting a multi-view fusion network, and matching and fusing the human body center point heat map information of the target heat map to obtain fusion information;

the projection module is used for projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume;

and the posture estimation module is used for estimating the three-dimensional human body posture according to the three-dimensional characteristic volume.

In a third aspect, to achieve the above object, the present invention further provides a three-dimensional human body posture estimation device, including: a memory, a processor and a three-dimensional body pose estimation program stored on the memory and executable on the processor, the three-dimensional body pose estimation program configured to implement the steps of the three-dimensional body pose estimation method as described above.

In a fourth aspect, to achieve the above object, the present invention further provides a storage medium having a three-dimensional body posture estimation program stored thereon, wherein the three-dimensional body posture estimation program, when executed by a processor, implements the steps of the three-dimensional body posture estimation method as described above.

The invention provides a three-dimensional human body posture estimation method, which comprises the steps of generating a target heat map corresponding to an input image by adopting a multi-view fusion network, and matching and fusing human body center point heat map information of the target heat map to obtain fusion information; projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume; the three-dimensional human body posture is estimated according to the three-dimensional characteristic volume, the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of other human body key points are reduced, errors of three-dimensional human body posture estimation are reduced, the posture reconstruction quality is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of three-dimensional human body posture estimation is improved, the scheme is simple and reliable to implement, the method can be suitable for three-dimensional human body posture estimation of most scenes, and the speed and the efficiency of three-dimensional human body posture estimation are improved.

Drawings

FIG. 1 is a schematic diagram of an apparatus architecture of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for estimating a three-dimensional human body pose according to a first embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method for estimating a three-dimensional human body pose according to a second embodiment of the present invention;

FIG. 4 is a network structure of key point detection in the three-dimensional human body posture estimation method of the present invention;

FIG. 5 is a flowchart illustrating a method for estimating a three-dimensional human body pose according to a third embodiment of the present invention;

FIG. 6 is a schematic antipodal geometry diagram in the three-dimensional human body pose estimation method of the present invention;

FIG. 7 is a schematic diagram of a multi-view epipolar constraint model in the three-dimensional human body pose estimation method of the present invention;

FIG. 8 is a flowchart illustrating a method for estimating a three-dimensional human body pose according to a fourth embodiment of the present invention;

FIG. 9 is a schematic diagram of a 3D CNN network structure in the three-dimensional human body posture estimation method according to the present invention;

FIG. 10 is a flowchart illustrating a fifth embodiment of a three-dimensional human body posture estimation method according to the present invention;

FIG. 11 is a functional block diagram of a three-dimensional human body posture estimation device according to a first embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

The solution of the embodiment of the invention is mainly as follows: generating a target heat map corresponding to an input image by adopting a multi-view fusion network, and matching and fusing human body central point heat map information of the target heat map to obtain fusion information; projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume; the three-dimensional human body posture is estimated according to the three-dimensional characteristic volume, the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of key points of other human bodies are reduced, errors of three-dimensional human body posture estimation are reduced, the posture reconstruction quality is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of three-dimensional human body posture estimation is improved, the scheme is simple and reliable to implement, the method can be suitable for three-dimensional human body posture estimation of most scenes, the speed and the efficiency of three-dimensional human body posture estimation are improved, the technical problems that the reconstruction quality of a 3D posture is greatly influenced due to inaccuracy of two-dimensional posture estimation and the calculation cost is high in a direct regression mode and the error is large in the prior art are solved.

Referring to fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the apparatus may include: a processor 1001, e.g. a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. The communication bus 1002 is used to implement connection communication among these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., a Wi-Fi interface). The Memory 1005 may be a high-speed RAM Memory or a Non-Volatile Memory (Non-Volatile Memory), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the configuration of the apparatus shown in fig. 1 is not intended to be limiting of the apparatus and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating device, a network communication module, a user interface module, and a three-dimensional human body posture estimation program.

The apparatus of the present invention calls a three-dimensional human body posture estimation program stored in the memory 1005 through the processor 1001 and performs the following operations:

generating a target heat map corresponding to an input image by adopting a multi-view fusion network, and matching and fusing human body center point heat map information of the target heat map to obtain fusion information;

The apparatus of the present invention calls the three-dimensional human body posture estimation program stored in the memory 1005 through the processor 1001, and also performs the following operations:

matching and fusing the human body central point heat map information of the target heat map to obtain fused information.

matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multiple views to obtain fused information.

acquiring camera calibration data of a video camera, and projecting each voxel center in the fusion information into a camera view by using the camera calibration data to obtain camera view projection data;

The apparatus of the present invention calls the three-dimensional human body posture estimation program stored in the memory 1005 by the processor 1001, and also performs the following operations:

According to the scheme, a multi-view fusion network is adopted to generate a target heat map corresponding to an input image, and matching fusion is carried out on human body central point heat map information of the target heat map to obtain fusion information; projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume; the three-dimensional human body posture is estimated according to the three-dimensional characteristic volume, the accuracy of three-dimensional human body posture estimation can be improved, inference search spaces of other human body key points are reduced, errors of three-dimensional human body posture estimation are reduced, posture reconstruction quality is improved, calculation cost is reduced, quantization error influence is avoided, the accuracy of three-dimensional human body posture estimation is improved, the scheme is simple and reliable to implement, the method can be suitable for three-dimensional human body posture estimation of most scenes, and the speed and the efficiency of three-dimensional human body posture estimation are improved.

Based on the hardware structure, the embodiment of the three-dimensional human body posture estimation method is provided.

Referring to fig. 2, fig. 2 is a flowchart illustrating a method for estimating a three-dimensional human body pose according to a first embodiment of the present invention.

In a first embodiment, the three-dimensional human body posture estimation method comprises the following steps:

and S10, generating a target heat map corresponding to the input image by adopting a multi-view fusion network, and matching and fusing the human body central point heat map information of the target heat map to obtain fusion information.

It should be noted that, through a Multi-View Fusion Network (MVFNet), the Network may obtain a corresponding target heatmap according to an input image on the basis of a high resolution Network HRNet, and perform matching Fusion on the human body center point heatmap information of the target heatmap to obtain fused heatmap information.

And step S20, projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume.

It will be appreciated that projecting the fusion information into 3D space enables a three-dimensional feature volume to be obtained from a coarse to a fine build volume.

And step S30, estimating the three-dimensional human body posture according to the three-dimensional characteristic volume.

It should be appreciated that an accurate three-dimensional body pose can be estimated from the three-dimensional feature volumes.

According to the scheme, a multi-view fusion network is adopted to generate a target heat map corresponding to an input image, and matching fusion is carried out on human body central point heat map information of the target heat map to obtain fusion information; projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume; the three-dimensional human body posture is estimated according to the three-dimensional characteristic volume, the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of other human body key points are reduced, errors of three-dimensional human body posture estimation are reduced, the posture reconstruction quality is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of three-dimensional human body posture estimation is improved, the scheme is simple and reliable to implement, the method can be suitable for three-dimensional human body posture estimation of most scenes, and the speed and the efficiency of three-dimensional human body posture estimation are improved.

Further, fig. 3 is a schematic flow chart of a second embodiment of the three-dimensional human body posture estimation method of the present invention, and as shown in fig. 3, the second embodiment of the three-dimensional human body posture estimation method of the present invention is proposed based on the first embodiment, and in this embodiment, the step S10 specifically includes the following steps:

and step S11, inputting the input image into a high-resolution network of the multi-view fusion network, and acquiring high-resolution characteristic information.

It should be noted that, when the input image is input into the high-resolution network of the multiview fusion network, the high-resolution feature information can be obtained.

In specific implementation, in order to obtain high-resolution feature information, a network before HRNet adopts the steps of sampling a high-resolution feature map to a low resolution and then restoring the high resolution to realize multi-scale feature extraction, such as U-Net, SegNet, Hourglass and the like; in such network architectures, the high resolution features are mainly derived from two parts: the first is the original high-resolution feature, which only provides low-level semantic expression due to a small amount of convolution operation; secondly, down-sampling and up-sampling are carried out to obtain high-resolution features, but a large amount of effective feature information is lost when up-down sampling is repeatedly carried out; the HRNet gradually introduces low-resolution convolution while always keeping high-resolution characteristics through paralleling a plurality of branches with high resolution to low resolution, and connects the convolutions with different resolutions in parallel to carry out information interaction, so that each feature with high resolution to low resolution repeatedly receives information from other parallel sub-networks, and the purpose of obtaining strong semantic information and accurate position information is achieved.

And step S12, constructing a residual error unit of the high-resolution network according to the high-resolution characteristic information, and performing convolution sampling on the residual error unit to obtain a multi-resolution module.

It can be understood that a residual unit of the high-resolution network can be constructed through the high-resolution feature information, and a multi-resolution module can be obtained by performing convolution sampling on the residual unit.

In a specific implementation, the present embodiment may use HRNet as a basic framework through a multi-view fusion network MVFNet network, and add a deconvolution module to obtain a heatmap with higher resolution and richer semantic information, as shown in fig. 4, fig. 4 is a key point detection network structure in the three-dimensional human body posture estimation method of the present invention, referring to fig. 4, the network is divided into four stages, the main body is four parallel sub-networks, and the high resolution sub-network is used as a first stage, the sub-networks from high resolution to low resolution are gradually added, and the multi-resolution sub-networks are connected in parallel; the first stage comprises 4 residual error units, each residual error unit is the same as ResNet-50 and is composed of a bottompiece with 64 channel numbers; then downsampled to the second stage by a convolution with 3 x 3, step size 2; the second, third and fourth stages respectively comprise 1, 4 and 3 multi-resolution blocks, so that the network can keep a certain depth and fully extract the characteristic information, each multi-resolution block has 4 residual error units, and the BasicBlock of ResNet is adopted, namely two 3 multiplied by 3 convolutions.

And step S13, fusing the feature maps with different resolutions in each stage in the multi-resolution module to obtain the target heat map.

It should be understood that the target heat map can be obtained by fusing the feature maps of different resolutions at different stages in the multi-resolution module.

Further, the step S13 specifically includes the following steps:

It can be understood that a fused feature map can be obtained by fusing feature maps of different resolutions at each stage in the multi-resolution module, channel conversion is performed in the deconvolution module, and after dimension splicing is performed on an output result and the fused feature map, spliced features can be obtained, so that the resolution of the spliced features can be improved according to the deconvolution layer, and target feature information of the spliced features after resolution improvement is extracted through the residual error unit, thereby generating a target heat map.

In the specific implementation, feature maps with different resolutions at each stage are fused at the end of a network, the fused feature maps are used as the input of a deconvolution module, channel conversion is carried out through convolution, the result is subjected to dimensional splicing with the input features, the resolution of the feature maps is improved to be 2 times of the original resolution by deconvolution with a convolution kernel of 4 × 4, feature information is further extracted through 4 residual blocks, and finally heatmap is predicted through convolution of 1 × 1; the higher resolution is beneficial to obtaining richer key point information, and further accurate three-dimensional human body posture estimation is realized.

And step S14, matching and fusing the human body central point heat map information of the target heat map to obtain fused information.

It can be understood that the fused information can be obtained by matching and fusing the human body central point heat map information of the target heat map.

According to the scheme, the high-resolution characteristic information is acquired by inputting the input image into the high-resolution network of the multi-view fusion network; constructing a residual error unit of the high-resolution network according to the high-resolution characteristic information, and performing convolution sampling on the residual error unit to obtain a multi-resolution module; fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a target heat map; matching and fusing the human body central point heat map information of the target heat map to obtain fused information, so that the fused information can be accurately obtained, the accuracy of three-dimensional human body posture estimation is improved, reasoning and searching spaces of other human body key points are reduced, and the error of three-dimensional human body posture estimation is reduced.

Further, fig. 5 is a schematic flow chart of a third embodiment of the three-dimensional human body posture estimation method of the present invention, and as shown in fig. 5, the third embodiment of the three-dimensional human body posture estimation method of the present invention is proposed based on the second embodiment, and in this embodiment, the step S14 specifically includes the following steps:

and step S141, taking preset key points between hip joints of the human body in the target heat map as central points of the human body.

It should be noted that, the preset key points between hip joints of the human body in the target heat map may be used as the center points of the human body according to the preset key points.

It will be appreciated that there is an epipolar geometry between the multiple view images, describing the intrinsic projective relationship between the two views, independent of the external scene, and relying only on the in-camera parameters and the inter-view parametersA relative attitude; the epipolar geometric relationship is fully utilized to help the network to acquire more position information, irrelevant noise in the training process is eliminated, and the accuracy of network prediction is improved; the principle is shown in fig. 6, fig. 6 is a geometric diagram of epipolar lines in the three-dimensional human body posture estimation method of the invention, see fig. 6, O ₁ 、O ₂ Is the optical center of two cameras, I ₁ 、I ₂ As an image plane, e ₁ 、e ₂ The projection points of the optical center of the camera on the relative plane are called poles; if the two cameras cannot shoot each other due to an angle problem, the pole does not appear on the imaging plane; observed point P is at I ₁ 、I ₂ Projected point on is P ₁ 、P ₂ Since the depth information is unknown, P may be at ray O ₁ P ₁ At a different point on the ray, onto a line L formed on the right image ₂ Called and point P ₁ Corresponding polar line, then P ₁ At the corresponding point P of the right image ₂ Necessarily in the polar line L ₂ The above step (1); the relative position of the matching points is constrained by the geometrical relationship of the image plane space, the constrained relationship can be expressed by a basic matrix, and according to the literature, the epipolar constraint is shown as formula (1):

(1)

wherein

The calculation formula is shown in (2) for the basic matrix:

(2)

wherein

And

is a matrix of the two camera intrinsic parameters,

the method comprises the following steps of (1) being an intrinsic matrix comprising an external reference translation matrix and a rotation matrix of a camera; therefore, to take full advantage of the geometric constraint relationship between views.

In a specific implementation, a multi-view epipolar constraint model is introduced into the MVFNet network provided by this embodiment, key points between human hip joints are taken as central points, the heatmap matching fusion of the multi-view human central points is performed, the high-resolution heatmap is input, epipolar lines corresponding to the central points of the respective graphs are solved by epipolar geometric constraint relations, and sampling is performed to obtain a set of corresponding points; according to the characteristics of the heatmap, a probability area of Gaussian distribution is generated at a corresponding coordinate, only high response exists near a corresponding point, and other places are close to 0, so that the values of all points on the epipolar line can be fused by using a full-link layer, and the accuracy of central point detection is improved; finally, the differences between the final fused centroid coordinates and the labeled centroid coordinates are compared using L2 loss to perform training constraints.

And S142, matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multiple views to obtain fused information.

It can be understood that, the human body center point heat map information of the target heat map can be matched and fused through the human body center points of the multiple views, so that matched and fused heatmap fusion information can be obtained.

Further, the step S142 specifically includes the following steps:

generating probability regions of Gaussian distribution of the target heat map at corresponding coordinates according to the corresponding point sets;

It can be understood that polar lines corresponding to central points of the respective graphs in the target heat map are sampled, after corresponding point sets are obtained, probability regions with gaussian distribution can be generated, and then values of all points on the polar lines in the probability regions are fused through the full connection layer, so that finally fused central point coordinates are obtained, and heat map coordinate matching is performed to obtain fusion information.

In a specific implementation, as shown in fig. 7, fig. 7 is a schematic diagram of a multi-view epipolar constraint model in the three-dimensional human body posture estimation method of the present invention, see fig. 7, a high-resolution heatmap is input, epipolar lines corresponding to central points of each graph are solved by an epipolar geometric constraint relationship, and sampling is performed to obtain a set of corresponding points; according to the characteristics of the heatmap, a probability area of Gaussian distribution is generated at a corresponding coordinate, only high response exists near a corresponding point, and other places are close to 0, so that the values of all points on the epipolar line can be fused by using a full-connection layer, and the accuracy of central point detection is improved; finally, the differences between the final fused centroid coordinates and the labeled centroid coordinates are compared using L2 loss to perform training constraints.

According to the scheme, the preset key points between hip joints of the human body in the target heat map are used as the central points of the human body; matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multiple views to obtain fusion information, so that the fusion information can be accurately obtained, the accuracy of three-dimensional human body posture estimation is improved, reasoning and searching spaces of other human body key points are reduced, and the error of three-dimensional human body posture estimation is reduced.

Further, fig. 8 is a schematic flowchart of a fourth embodiment of the three-dimensional human body posture estimation method of the present invention, and as shown in fig. 8, the fourth embodiment of the three-dimensional human body posture estimation method of the present invention is proposed based on the first embodiment, in this embodiment, the step S20 specifically includes the following steps:

and step S21, acquiring camera calibration data of the video camera, and projecting each voxel center in the fusion information into a camera view by using the camera calibration data to obtain camera view projection data.

It should be noted that, after camera calibration data of the video camera is obtained, each voxel center in the fusion information may be projected into a camera view by using the camera calibration data, so as to obtain camera view projection data.

It can be understood that the features of all the obtained views are aggregated into a 3D voxel volume by inverse image projection method, a voxel grid is initialized and contains the whole space observed by the camera, meanwhile, the center of each voxel is projected into the camera view by using the camera calibration data, and then the feature volume is constructed by the 3D CNN network from coarse to fine by taking the center as the center to estimate the position of all the key points.

In specific implementation, referring to fig. 9, fig. 9 is a schematic diagram of a 3D CNN network structure in the three-dimensional human body posture estimation method of the present invention, as shown in fig. 9, a network input of the 3D CNN network structure is a 3D feature volume, which is constructed by projecting 2D heatmaps in all camera views to a common 3D space, because the heatmaps encode position information of a central point, the obtained 3D feature volume also has rich information for detecting a 3D posture, and a search area of other key points in the 3D space can be reduced according to human body prior information; black open arrows represent standard 3D convolutional layers, black solid arrows represent residual blocks of two 3D convolutional layers, linear arrows are pooling, and dashed arrows are deconvolution; discretizing a three-dimensional space into

At a discrete location

Each position can be regarded as an anchor of a detected person; to reduce quantization error, adjustments are made

The distance between adjacent anchors is reduced; on a common data set, the space is typically 8m

8m

2m, thus will

Are set to 80, 80, 20.

And S22, constructing a three-dimensional characteristic volume according to the camera view projection data by using a 3D CNN network.

It will be appreciated that by constructing feature volumes from coarse to fine centered by the 3D CNN network to estimate the location of all keypoints, three-dimensional feature volumes can be constructed from the camera view projection data.

In specific implementation, the 2D heatmap value of the projection position of each anchor in a camera view is fused, and the feature vector of each anchor is calculated; let 2D heatmap in view a be denoted

Where K is the number of body keypoints; position for each anchor

Its projected position in the view and

here, the heatmap value is expressed as

(ii) a Then, calculating a feature vector of the anchor as an average heatmap value in all camera views; as shown in formula (3):

(3)

where V is the number of cameras, it can be seen that

Actually encodes K key points

The possibility of (a); then, a 3D bounding box is used for representing the positions of key points of the human body including detection, and the size and the direction of the bounding box are fixed in the experiment; this is a reasonable simplification because the human variation in 3D space is limited; sliding a small network on the characteristic volume F; each sliding window centered at the anchor maps to a low-dimensional feature that is fed back to the fully-connected layer with regression confidence as the output of the 3D CNN network, indicating the likelihood of the person appearing at that location; calculating the GT heatmap value of each anchor according to the distance from the anchor to the GT pose; for each pair of GT and anchor, calculating a Gaussian score according to the distance between GT and anchor, wherein the Gaussian score decreases exponentially when the distance increases; if there are N people in the scene, an anchor may have multiple scores, and the N largest, i.e., representative of the N positions of people, are retained through non-maximum suppression (NMS).

In this embodiment, by using the above scheme, camera calibration data of a video camera is acquired, and the camera calibration data is used to project each voxel center in the fusion information into a camera view, so as to obtain camera view projection data; the three-dimensional characteristic volume is constructed by utilizing the 3D CNN network according to the camera view projection data, the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of other human body key points are reduced, errors of the three-dimensional human body posture estimation are reduced, the posture reconstruction quality is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of the three-dimensional human body posture estimation is improved, and the scheme is simple and reliable to implement.

Further, fig. 10 is a schematic flowchart of a fifth embodiment of the three-dimensional human body posture estimation method according to the present invention, and as shown in fig. 10, the fifth embodiment of the three-dimensional human body posture estimation method according to the present invention is proposed based on the first embodiment, in this embodiment, the step S30 specifically includes the following steps:

and step S31, dividing the three-dimensional characteristic volume into a plurality of discrete grids.

It should be noted that, after the three-dimensional feature volume is divided, a plurality of discrete grids can be obtained.

In specific implementation, the first 3D CNN network cannot accurately estimate the 3D positions of all the key points, so a finer-grained feature volume is constructed in the second 3D CNN network, with the size set to 2000mm

2000mm

2000mm, ratio 8m

8m

2m is much smaller but sufficient to cover any pose of a person, the volume being divided into X ₀ =Y ₀ =Z ₀ =64 discrete meshes, whose network body structure is the same as the first 3D CNN.

And S32, acquiring the 3D heat map space coordinates of each key point in each discrete grid, and performing regression on the 3D heat map space coordinates to obtain the three-dimensional human body posture.

It should be understood that, further, the 3D heat map space coordinates of each key point in each discrete grid are obtained, and then the 3D heat map space coordinates may be regressed, so that the three-dimensional human body posture may be obtained.

It will be appreciated that the 3D heatmap for each keypoint K is estimated based on the feature volumes of the construct

Finally, returning to the accurate three-dimensional human body posture,

calculating according to equation (4)

The centroid of the key points can be obtained

：

(4)

Comparing the estimated joint position with the true position

Making comparisons to train the network, loss function

Is represented by formula (5):

(5)

in a specific implementation, the accuracy of the 3D pose of the Campus and Shelf datasets is evaluated using the Percentage PCP3D (percent of Correct Part 3D) that correctly estimates the Joint position, and the detection is considered Correct if the distance between the predicted Joint position and the true Joint position is less than half the limb length. For each frame f and the human skeleton S, the calculation of MPJPE is given by equation (6):

(6)

wherein

Is the number of joints in the skeleton S, and for a set of frames, the error is the average of the MPJPE of all frames; while at the threshold of MPJPE (from 25mm to 1)50mm, step length is 25mm) and Average Precision (Average Precision) and Recall rate (Recall) are taken as performance indexes for comprehensively evaluating 3D human body center detection and human body posture estimation; the AP is an area under a PR curve surrounded by two dimensions of horizontal coordinate Recall and vertical coordinate Precision (Precision), and the larger the value of the AP is, the better the comprehensive performance of the detection model is.

According to the scheme, the three-dimensional characteristic volume is divided into a plurality of discrete grids; the method comprises the steps of obtaining the 3D heat map space coordinates of each key point in each discrete grid, and regressing the 3D heat map space coordinates to obtain the three-dimensional human body posture, so that the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of other human body key points are reduced, errors of the three-dimensional human body posture estimation are reduced, the reconstruction quality of the posture is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of the three-dimensional human body posture estimation is improved, and the scheme is simple and reliable to implement.

Correspondingly, the invention further provides a three-dimensional human body posture estimation device.

Referring to fig. 11, fig. 11 is a functional block diagram of a three-dimensional human body posture estimation device according to a first embodiment of the present invention.

In a first embodiment of the three-dimensional body posture estimation device of the present invention, the three-dimensional body posture estimation device includes:

the fusion module 10 is configured to generate a target heat map corresponding to the input image by using a multi-view fusion network, and perform matching fusion on the human body center point heat map information of the target heat map to obtain fusion information.

And a projection module 20, configured to project the fusion information to a 3D space to obtain a three-dimensional feature volume.

And the posture estimation module 30 is used for estimating the three-dimensional human body posture according to the three-dimensional characteristic volume.

The fusion module 10 is further configured to input the input image into a high-resolution network of the multiview fusion network, and acquire high-resolution feature information; constructing a residual error unit of the high-resolution network according to the high-resolution characteristic information, and performing convolution sampling on the residual error unit to obtain a multi-resolution module; fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a target heat map; matching and fusing the human body central point heat map information of the target heat map to obtain fused information.

The fusion module 10 is further configured to fuse feature maps of different resolutions at different stages in the multi-resolution module to obtain a fusion feature map; inputting the fusion characteristic graph into a deconvolution module, obtaining an output result through convolution and channel conversion, and performing dimensionality splicing on the output result and the fusion characteristic graph to obtain a splicing characteristic; and improving the resolution ratio of the splicing features according to the deconvolution layer, extracting target feature information of the splicing features with improved resolution ratio through the residual error unit, and generating a target heat map according to the target feature information.

The fusion module 10 is further configured to use a preset key point between hip joints of the human body in the target heat map as a central point of the human body; and matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multiple views to obtain fused information.

The fusion module 10 is further configured to sample epipolar lines corresponding to the central points of the graphs in the target heat map according to the human central points of the multiple views, so as to obtain a corresponding point set; generating a probability region of the Gaussian distribution of the target heat map at the corresponding coordinates according to the corresponding point set; fusing values of all points on the epipolar line in the probability region through a full-connection layer to obtain a finally fused central point coordinate; and carrying out coordinate matching fusion on the human body central point heat map information of the target heat map according to the central point coordinate to obtain fusion information.

The projection module 20 is further configured to acquire camera calibration data of a video camera, and project each voxel center in the fusion information into a camera view by using the camera calibration data to obtain camera view projection data; and constructing a three-dimensional characteristic volume according to the camera view projection data by using a 3D CNN network.

The pose estimation module 30 is further configured to divide the three-dimensional feature volume into a plurality of discrete grids; and acquiring the 3D heat map space coordinate of each key point in each discrete grid, and regressing the 3D heat map space coordinate to obtain the three-dimensional human body posture.

The steps implemented by the functional modules of the three-dimensional human body posture estimation device can refer to the embodiments of the three-dimensional human body posture estimation method of the present invention, and are not described herein again.

In addition, an embodiment of the present invention further provides a storage medium, where a three-dimensional body posture estimation program is stored on the storage medium, and when executed by a processor, the three-dimensional body posture estimation program implements the following operations:

Further, the three-dimensional human body posture estimation program when executed by the processor further realizes the following operations:

Further, the three-dimensional human body posture estimation program further realizes the following operations when being executed by the processor:

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element identified by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A three-dimensional human body posture estimation method is characterized by comprising the following steps:

2. The method for estimating the three-dimensional human body pose according to claim 1, wherein the generating a target heat map corresponding to the input image by using a multi-view fusion network, and performing matching fusion on the human body center point heat map information of the target heat map to obtain fusion information comprises:

3. The method for estimating the three-dimensional human body pose according to claim 2, wherein the fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain the target heat map comprises:

4. The method for estimating the three-dimensional human body pose according to claim 2, wherein the matching and fusing the human body center point heat map information of the target heat map to obtain fused information comprises:

5. The method according to claim 4, wherein said matching and fusing the human body center point heat map information of the target heat map according to the multi-view human body center point to obtain fused information comprises:

fusing values of all points on the epipolar line in the probability region through a full connection layer to obtain a final fused central point coordinate;

6. The three-dimensional human pose estimation method of claim 1, wherein said projecting the fusion information into a 3D space to obtain a three-dimensional feature volume comprises:

7. The method of estimating a three-dimensional body pose according to claim 1, wherein said estimating a three-dimensional body pose from said three-dimensional feature volume comprises:

8. A three-dimensional body posture estimation device, characterized by comprising:

9. A three-dimensional body posture estimation device characterized by comprising: a memory, a processor, and a three-dimensional body pose estimation program stored on the memory and executable on the processor, the three-dimensional body pose estimation program configured to implement the steps of the three-dimensional body pose estimation method of any of claims 1 to 7.

10. A storage medium having stored thereon a three-dimensional body pose estimation program, which when executed by a processor, implements the steps of the three-dimensional body pose estimation method of any of claims 1 to 7.