Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The solution of the embodiment of the invention is mainly as follows: generating a target heat map corresponding to an input image by adopting a multi-view fusion network, and matching and fusing human body central point heat map information of the target heat map to obtain fusion information; projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume; the three-dimensional human body posture is estimated according to the three-dimensional characteristic volume, the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of key points of other human bodies are reduced, errors of three-dimensional human body posture estimation are reduced, the posture reconstruction quality is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of three-dimensional human body posture estimation is improved, the scheme is simple and reliable to implement, the method can be suitable for three-dimensional human body posture estimation of most scenes, the speed and the efficiency of three-dimensional human body posture estimation are improved, the technical problems that the reconstruction quality of a 3D posture is greatly influenced due to inaccuracy of two-dimensional posture estimation and the calculation cost is high in a direct regression mode and the error is large in the prior art are solved.
Referring to fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the apparatus may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., a Wi-Fi interface). The Memory 1005 may be a high-speed RAM Memory or a Non-Volatile Memory (Non-Volatile Memory), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001 described previously.
Those skilled in the art will appreciate that the configuration of the device shown in fig. 1 is not intended to be limiting of the device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of storage medium, may include therein an operating device, a network communication module, a user interface module, and a three-dimensional human body posture estimation program.
The apparatus of the present invention calls a three-dimensional human body posture estimation program stored in the memory 1005 through the processor 1001 and performs the following operations:
generating a target heat map corresponding to an input image by adopting a multi-view fusion network, and matching and fusing human body central point heat map information of the target heat map to obtain fusion information;
projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume;
and estimating the three-dimensional human body posture according to the three-dimensional characteristic volume.
The apparatus of the present invention calls the three-dimensional human body posture estimation program stored in the memory 1005 through the processor 1001, and also performs the following operations:
inputting an input image into a high-resolution network of a multi-view fusion network to acquire high-resolution characteristic information;
constructing a residual error unit of the high-resolution network according to the high-resolution characteristic information, and performing convolution sampling on the residual error unit to obtain a multi-resolution module;
fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a target heat map;
matching and fusing the human body central point heat map information of the target heat map to obtain fused information.
The apparatus of the present invention calls the three-dimensional human body posture estimation program stored in the memory 1005 through the processor 1001, and also performs the following operations:
fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a fused feature map;
inputting the fusion characteristic graph into a deconvolution module, obtaining an output result through convolution and channel conversion, and performing dimensionality splicing on the output result and the fusion characteristic graph to obtain a splicing characteristic;
and improving the resolution ratio of the splicing features according to the deconvolution layer, extracting target feature information of the splicing features with improved resolution ratio through the residual error unit, and generating a target heat map according to the target feature information.
The apparatus of the present invention calls the three-dimensional human body posture estimation program stored in the memory 1005 through the processor 1001, and also performs the following operations:
taking a preset key point between hip joints of the human body in the target heat map as a central point of the human body;
and matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multiple views to obtain fused information.
The apparatus of the present invention calls the three-dimensional human body posture estimation program stored in the memory 1005 through the processor 1001, and also performs the following operations:
sampling polar lines corresponding to central points of all the graphs in the target heat map according to the central points of the human body in multiple views to obtain a corresponding point set;
generating a probability region of the Gaussian distribution of the target heat map at the corresponding coordinates according to the corresponding point set;
fusing values of all points on the epipolar line in the probability region through a full-connection layer to obtain a finally fused central point coordinate;
and carrying out coordinate matching fusion on the human body central point heat map information of the target heat map according to the central point coordinate to obtain fusion information.
The apparatus of the present invention calls the three-dimensional human body posture estimation program stored in the memory 1005 through the processor 1001, and also performs the following operations:
acquiring camera calibration data of a video camera, and projecting each voxel center in the fusion information into a camera view by using the camera calibration data to obtain camera view projection data;
and constructing a three-dimensional characteristic volume according to the camera view projection data by using a 3D CNN network.
The apparatus of the present invention calls the three-dimensional human body posture estimation program stored in the memory 1005 by the processor 1001, and also performs the following operations:
dividing the three-dimensional feature volume into a plurality of discrete grids;
and acquiring the 3D heat map space coordinate of each key point in each discrete grid, and regressing the 3D heat map space coordinate to obtain the three-dimensional human body posture.
According to the scheme, a multi-view fusion network is adopted to generate a target heat map corresponding to an input image, and matching fusion is carried out on human body central point heat map information of the target heat map to obtain fusion information; projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume; the three-dimensional human body posture is estimated according to the three-dimensional characteristic volume, the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of other human body key points are reduced, errors of three-dimensional human body posture estimation are reduced, the posture reconstruction quality is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of three-dimensional human body posture estimation is improved, the scheme is simple and reliable to implement, the method can be suitable for three-dimensional human body posture estimation of most scenes, and the speed and the efficiency of three-dimensional human body posture estimation are improved.
Based on the hardware structure, the embodiment of the three-dimensional human body posture estimation method is provided.
Referring to fig. 2, fig. 2 is a flowchart illustrating a method for estimating a three-dimensional human body pose according to a first embodiment of the present invention.
In a first embodiment, the three-dimensional human body posture estimation method comprises the following steps:
and S10, generating a target heat map corresponding to the input image by adopting a multi-view fusion network, and matching and fusing the human body central point heat map information of the target heat map to obtain fusion information.
It should be noted that, through a Multi-View Fusion Network (MVFNet), the Network may obtain a corresponding target heatmap according to an input image on the basis of a high resolution Network HRNet, and perform matching Fusion on the human body center point heatmap information of the target heatmap to obtain fused heatmap information.
And S20, projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume.
It will be appreciated that projecting the fusion information into 3D space enables a three-dimensional feature volume to be obtained from a coarse to a fine build volume.
And S30, estimating the three-dimensional human body posture according to the three-dimensional characteristic volume.
It should be appreciated that an accurate three-dimensional body pose can be estimated from the three-dimensional feature volumes.
According to the scheme, a multi-view fusion network is adopted to generate a target heat map corresponding to an input image, and matching fusion is carried out on human body central point heat map information of the target heat map to obtain fusion information; projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume; the three-dimensional human body posture is estimated according to the three-dimensional characteristic volume, the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of other human body key points are reduced, errors of three-dimensional human body posture estimation are reduced, the posture reconstruction quality is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of three-dimensional human body posture estimation is improved, the scheme is simple and reliable to implement, the method can be suitable for three-dimensional human body posture estimation of most scenes, and the speed and the efficiency of three-dimensional human body posture estimation are improved.
Further, fig. 3 is a schematic flow chart of a second embodiment of the three-dimensional human body posture estimation method of the present invention, and as shown in fig. 3, the second embodiment of the three-dimensional human body posture estimation method of the present invention is proposed based on the first embodiment, and in this embodiment, the step S10 specifically includes the following steps:
and S11, inputting the input image into a high-resolution network of the multi-view fusion network to acquire high-resolution characteristic information.
It should be noted that, when the input image is input into the high-resolution network of the multiview fusion network, the high-resolution feature information can be obtained.
In the specific implementation, in order to obtain high-resolution feature information, a network before HRNet samples a high-resolution feature map to a low resolution and then restores the high resolution to realize multi-scale feature extraction, such as U-Net, segNet, hourglass and the like; in such network architectures, the high resolution features are mainly derived from two parts: the first is the original high-resolution characteristic, which only can provide low-level semantic expression due to a small amount of convolution operation; secondly, down-sampling and up-sampling are carried out to obtain high-resolution features, but a large amount of effective feature information is lost when up-down sampling is repeatedly carried out; the HRNet gradually introduces low-resolution convolution while always keeping high-resolution characteristics through paralleling a plurality of branches with high resolution to low resolution, and connects the convolutions with different resolutions in parallel to carry out information interaction, so that each feature with high resolution to low resolution repeatedly receives information from other parallel sub-networks, and the purpose of obtaining strong semantic information and accurate position information is achieved.
And S12, constructing a residual error unit of the high-resolution network according to the high-resolution characteristic information, and performing convolution sampling on the residual error unit to obtain a multi-resolution module.
It can be understood that a residual unit of the high-resolution network can be constructed through the high-resolution feature information, and a multi-resolution module can be obtained by performing convolution sampling on the residual unit.
In a specific implementation, the present embodiment may use HRNet as a basic framework through a multi-view fusion network MVFNet network, and add a deconvolution module to obtain a heatmap with higher resolution and richer semantic information, as shown in fig. 4, fig. 4 is a key point detection network structure in the three-dimensional human body posture estimation method of the present invention, referring to fig. 4, the network is divided into four stages, the main body is four parallel sub-networks, and the high resolution sub-network is used as a first stage, the sub-networks from high resolution to low resolution are gradually added, and the multi-resolution sub-networks are connected in parallel; the first stage comprises 4 residual error units, each residual error unit is the same as ResNet-50 and is composed of a bottompiece with 64 channel numbers; then downsampled to the second stage by a convolution with 3 x 3, step size 2; the second, third and fourth stages respectively comprise 1, 4 and 3 multi-resolution blocks, so that the network can keep a certain depth and fully extract the characteristic information, each multi-resolution block has 4 residual error units, and the BasicBlock of ResNet is adopted, namely two 3 multiplied by 3 convolutions.
And S13, fusing the feature maps with different resolutions in each stage in the multi-resolution module to obtain a target heat map.
It should be understood that the target heat map can be obtained by fusing the feature maps of different resolutions at different stages in the multi-resolution module.
Further, the step S13 specifically includes the following steps:
fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a fused feature map;
inputting the fusion characteristic graph into a deconvolution module, obtaining an output result through convolution and channel conversion, and performing dimensionality splicing on the output result and the fusion characteristic graph to obtain a splicing characteristic;
and improving the resolution ratio of the splicing features according to the deconvolution layer, extracting target feature information of the splicing features with improved resolution ratio through the residual error unit, and generating a target heat map according to the target feature information.
It can be understood that a fused feature map can be obtained by fusing feature maps of different resolutions at each stage in the multi-resolution module, channel conversion is performed in the deconvolution module, and after dimension splicing is performed on an output result and the fused feature map, spliced features can be obtained, so that the resolution of the spliced features can be improved according to the deconvolution layer, and target feature information of the spliced features after resolution improvement is extracted through the residual error unit, thereby generating a target heat map.
In the specific implementation, feature maps with different resolutions at each stage are fused at the end of a network, the fused feature maps are used as the input of a deconvolution module, channel conversion is carried out through convolution, the result is subjected to dimensional splicing with the input features, the resolution of the feature maps is improved to be 2 times of the original resolution by deconvolution with a convolution kernel of 4 × 4, feature information is further extracted through 4 residual blocks, and finally heatmap is predicted through convolution of 1 × 1; the higher resolution is beneficial to obtaining richer key point information, and further accurate three-dimensional human body posture estimation is realized.
And S14, matching and fusing the human body central point heat map information of the target heat map to obtain fused information.
It can be understood that the fused information can be obtained by matching and fusing the human body central point heat map information of the target heat map.
According to the scheme, the high-resolution characteristic information is acquired by inputting the input image into the high-resolution network of the multi-view fusion network; constructing a residual error unit of the high-resolution network according to the high-resolution characteristic information, and performing convolution sampling on the residual error unit to obtain a multi-resolution module; fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a target heat map; matching and fusing the human body central point heat map information of the target heat map to obtain fused information, so that the fused information can be accurately obtained, the accuracy of three-dimensional human body posture estimation is improved, reasoning and searching spaces of other human body key points are reduced, and the error of three-dimensional human body posture estimation is reduced.
Further, fig. 5 is a schematic flow chart of a third embodiment of the three-dimensional human body posture estimation method of the present invention, and as shown in fig. 5, the third embodiment of the three-dimensional human body posture estimation method of the present invention is proposed based on the second embodiment, and in this embodiment, the step S14 specifically includes the following steps:
and step S141, taking preset key points between hip joints of the human body in the target heat map as central points of the human body.
It should be noted that, the preset key points between hip joints of the human body in the target heat map may be used as the central points of the human body according to the preset key points.
It can be understood that an epipolar geometric relationship exists between the multiple view images, and the intrinsic projective relationship between the two views is described, and is independent of an external scene and only depends on the relative pose between the camera intrinsic parameters and the views; the epipolar geometric relationship is fully utilized to help the network to acquire more position information, irrelevant noise in the training process is eliminated, and the accuracy of network prediction is improved; the principle is shown in FIG. 6, FIG. 6 is a schematic view of epipolar geometry in the three-dimensional human body posture estimation method of the present inventionSee figure 6,O 1 、O 2 Is the optical center of two cameras, I 1 、I 2 As an image plane, e 1 、e 2 The projection points of the optical center of the camera on the relative plane are called poles; if the two cameras cannot shoot each other due to an angle problem, the pole does not appear on the imaging plane; observed point P at I 1 、I 2 Projected point on is P 1 、P 2 Since the depth information is unknown, P may be at ray O 1 P 1 At a different point on the ray, onto a line L formed on the right image 2 Called and point P 1 Corresponding polar line, then P 1 At the corresponding point P of the right image 2 Necessarily in the polar line L 2 C, removing; the relative position of the matching points is constrained by the geometrical relationship of the image plane space, the constrained relationship can be expressed by a basic matrix, and according to the literature, the epipolar constraint is shown as formula (1):
wherein
The calculation formula is shown in (2) for the basic matrix:
wherein
And
is a matrix of the two camera intrinsic parameters,
the method comprises the following steps of (1) being an intrinsic matrix comprising an external reference translation matrix and a rotation matrix of a camera; therefore, in order to make full use of inter-viewAnd (4) geometrically constraining the relationship.
In specific implementation, a multi-view epipolar constraint model is introduced into the MVFNet network provided by the embodiment, key points between hip joints of a human body are taken as central points, the heatmap matching fusion of the multi-view human body central points is performed, the high-resolution heatmap is input, epipolar lines corresponding to the central points of all the figures are solved according to the epipolar geometric constraint relationship, and sampling is performed to obtain a set of corresponding points; according to the characteristics of the heatmap, a probability area of Gaussian distribution is generated at a corresponding coordinate, only high response exists near a corresponding point, and other places are close to 0, so that the values of all points on the epipolar line can be fused by using a full-connection layer, and the accuracy of central point detection is improved; and finally, comparing the difference between the finally fused center point coordinate and the marked center point coordinate by using L2 loss to carry out training constraint.
And S142, matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multiple views to obtain fused information.
It can be understood that the human body center point heat map information of the target heat map can be matched and fused through the human body center points of the multiple views, and matched and fused heatmap fusion information can be obtained.
Further, the step S142 specifically includes the following steps:
sampling polar lines corresponding to the central points of the graphs in the target heat map according to the central points of the human body of the multiple views to obtain a corresponding point set;
generating a probability region of the Gaussian distribution of the target heat map at the corresponding coordinates according to the corresponding point set;
fusing values of all points on the epipolar line in the probability region through a full connection layer to obtain a final fused central point coordinate;
and carrying out coordinate matching fusion on the human body central point heat map information of the target heat map according to the central point coordinate to obtain fusion information.
It can be understood that polar lines corresponding to central points of the respective graphs in the target heat map are sampled, after corresponding point sets are obtained, probability regions with gaussian distribution can be generated, and then values of all points on the polar lines in the probability regions are fused through the full connection layer, so that finally fused central point coordinates are obtained, and heat map coordinate matching is performed to obtain fusion information.
In a specific implementation, as shown in fig. 7, fig. 7 is a schematic diagram of a multi-view epipolar constraint model in the three-dimensional human body posture estimation method of the present invention, see fig. 7, a high resolution heatmap is input, epipolar lines corresponding to central points of each graph are solved by an epipolar geometric constraint relationship, and sampling is performed to obtain a set of corresponding points; according to the characteristics of the heatmap, a probability area of Gaussian distribution is generated at a corresponding coordinate, only high response exists near a corresponding point, and other places are close to 0, so that the values of all points on the epipolar line can be fused by using a full-link layer, and the accuracy of central point detection is improved; and finally, comparing the difference between the finally fused center point coordinate and the marked center point coordinate by using L2 loss to carry out training constraint.
According to the scheme, the preset key points between hip joints of the human body in the target heat map are used as the central points of the human body; matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multi-view to obtain fused information, so that the fused information can be accurately obtained, the accuracy of three-dimensional human body posture estimation is improved, reasoning and searching spaces of other human body key points are reduced, and the error of three-dimensional human body posture estimation is reduced.
Further, fig. 8 is a schematic flowchart of a fourth embodiment of the three-dimensional human body posture estimation method of the present invention, and as shown in fig. 8, the fourth embodiment of the three-dimensional human body posture estimation method of the present invention is proposed based on the first embodiment, in this embodiment, the step S20 specifically includes the following steps:
and S21, acquiring camera calibration data of a video camera, and projecting each voxel center in the fusion information to a camera view by using the camera calibration data to obtain camera view projection data.
It should be noted that after the camera calibration data of the video camera is obtained, the camera calibration data may be used to project each voxel center in the fusion information into the camera view, so as to obtain the camera view projection data.
It can be understood that the features of all the obtained views are aggregated into a 3D voxel volume by inverse image projection method, a voxel grid is initialized and contains the whole space observed by the camera, meanwhile, the center of each voxel is projected into the camera view by using the camera calibration data, and then the feature volume is constructed by the 3D CNN network from coarse to fine by taking the center as the center to estimate the position of all the key points.
In specific implementation, referring to fig. 9, fig. 9 is a schematic diagram of a 3D CNN network structure in the three-dimensional human body posture estimation method of the present invention, as shown in fig. 9, a network input of the 3D CNN network structure is a 3D feature volume, which is constructed by projecting 2D heatmaps in all camera views to a common 3D space, because the heatmaps encode position information of a central point, the obtained 3D feature volume also has rich information for detecting a 3D posture, and a search area of other key points in the 3D space can be reduced according to human body prior information; black open arrows represent standard 3D convolutional layers, black solid arrows represent residual blocks of two 3D convolutional layers, linear arrows are pooling, and dashed arrows are deconvolution; discretizing a three-dimensional space into
At a discrete location
Each position can be regarded as an anchor of a detected person; to reduce quantization error, adjustments are made
The distance between adjacent anchors is reduced; on a common data set, the space is typically 8m
8m
2m, thus will
Are set to 80, 80, 20.
And S22, constructing a three-dimensional characteristic volume according to the camera view projection data by using a 3D CNN network.
It will be appreciated that by constructing feature volumes from coarse to fine centered by the 3D CNN network to estimate the location of all keypoints, a three-dimensional feature volume can be constructed from the camera view projection data.
In specific implementation, the 2D heatmap value of the projection position of each anchor in a camera view is fused, and the feature vector of each anchor is calculated; let 2D heatmap in view a be denoted
Where K is the number of body keypoints; position for each anchor
Its projected position in the view and
here, the heatmap value is expressed as
(ii) a Then, calculating a feature vector of the anchor as an average heatmap value in all camera views; as shown in formula (3):
where V is the number of cameras, it can be seen that
Actually encodes K key points
The possibility of (a); then using a 3DThe bounding box represents the position of the key point of the human body including detection, and the size and the direction of the bounding box are fixed in the experiment; this is a reasonable simplification because the human variation in 3D space is limited; sliding a small network on the characteristic volume F; each sliding window centered at the anchor maps to a low-dimensional feature that is fed back to the fully-connected layer with regression confidence as the output of the 3D CNN network, indicating the likelihood of the person appearing at that location; calculating the GT heatmap value of each anchor according to the distance from the anchor to the GT pose; for each pair of GT and anchor, calculating a Gaussian score according to the distance between GT and anchor, wherein the Gaussian score decreases exponentially when the distance increases; if there are N people in the scene, an anchor may have multiple scores, and the N largest, i.e., representative of the N positions of people, are retained through non-maximum suppression (NMS).
According to the scheme, by acquiring camera calibration data of a video camera, each voxel center in the fusion information is projected to a camera view by using the camera calibration data, so as to obtain camera view projection data; the three-dimensional characteristic volume is constructed by utilizing the 3D CNN network according to the camera view projection data, the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of other human body key points are reduced, errors of the three-dimensional human body posture estimation are reduced, the posture reconstruction quality is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of the three-dimensional human body posture estimation is improved, and the scheme is simple and reliable to implement.
Further, fig. 10 is a schematic flowchart of a fifth embodiment of the three-dimensional human body posture estimation method according to the present invention, and as shown in fig. 10, the fifth embodiment of the three-dimensional human body posture estimation method according to the present invention is proposed based on the first embodiment, in this embodiment, the step S30 specifically includes the following steps:
and S31, dividing the three-dimensional characteristic volume into a plurality of discrete grids.
It should be noted that after the three-dimensional feature volume is divided, a plurality of discrete grids can be obtained.
In a specific implementation, the first 3D CNN network is noneThe method accurately estimates the 3D positions of all key points, so that a finer-grained feature volume is constructed in the second 3D CNN network, and the size of the feature volume is set to 2000mm
2000mm
2000mm, ratio 8m
8m
2m is much smaller but sufficient to cover any pose of a person, the volume being divided into X
0 =Y
0 =Z
0 =64 discrete meshes, whose network body structure is the same as the first 3D CNN.
And S32, acquiring the 3D heat map space coordinates of each key point in each discrete grid, and performing regression on the 3D heat map space coordinates to obtain the three-dimensional human body posture.
It should be understood that, further, the 3D heat map space coordinates of each key point in each discrete grid are obtained, and then the 3D heat map space coordinates may be regressed, so that the three-dimensional human body posture may be obtained.
It will be appreciated that the 3D heatmap for each keypoint K is estimated based on the constructed feature volumes
Finally, returning to the accurate three-dimensional human body posture,
calculating according to equation (4)
The centroid of the key points can be obtained
:
Comparing the estimated joint position with the true position
Making comparisons to train the network, loss function
Is represented by formula (5):
in a specific implementation, the accuracy of the 3D pose of the Campus and Shelf datasets is evaluated using the Percentage PCP3D (percent of Correct Part 3D) that correctly estimates the Joint position, and if the distance between the predicted Joint position and the true Joint position is less than half of the limb length, the detection is considered Correct for the CMU-Panoptic dataset, the Mean MPJPE (Mean Per Joint Point Error) of the Joint position Error is taken as an important evaluation index, and the positioning accuracy of the 3D Joint is evaluated in millimeters, representing the distance between GT and the predicted Joint position; for each frame f and the human skeleton S, the calculation of MPJPE is given by equation (6):
wherein
Is the number of joints in the skeleton S, and for a set of frames, the error is the average of the MPJPE of all frames; meanwhile, average Precision (Average Precision) and Recall rate (Recall) are taken as performance indexes for comprehensively evaluating 3D human body center detection and human body posture estimation on the threshold value (from 25mm to 150mm, the step length is 25 mm) of MPJPE; AP is defined by abscissa Recall and ordinate essenceThe area under a PR curve formed by two dimensions of accuracy (Precision) is larger, and the larger the value of AP is, the better the comprehensive performance of the detection model is.
According to the scheme, the three-dimensional characteristic volume is divided into a plurality of discrete grids; the method comprises the steps of obtaining the 3D heat map space coordinates of each key point in each discrete grid, and regressing the 3D heat map space coordinates to obtain the three-dimensional human body posture, so that the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of other human body key points are reduced, errors of the three-dimensional human body posture estimation are reduced, the reconstruction quality of the posture is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of the three-dimensional human body posture estimation is improved, and the scheme is simple and reliable to implement.
Correspondingly, the invention further provides a three-dimensional human body posture estimation device.
Referring to fig. 11, fig. 11 is a functional block diagram of a three-dimensional human body posture estimation device according to a first embodiment of the present invention.
In a first embodiment of the three-dimensional body posture estimation device of the present invention, the three-dimensional body posture estimation device includes:
the fusion module 10 is configured to generate a target heat map corresponding to the input image by using a multi-view fusion network, and perform matching fusion on the human body center point heat map information of the target heat map to obtain fusion information.
And the projection module 20 is configured to project the fusion information to a 3D space to obtain a three-dimensional feature volume.
And the posture estimation module 30 is used for estimating the three-dimensional human body posture according to the three-dimensional characteristic volume.
The fusion module 10 is further configured to input the input image into a high-resolution network of the multiview fusion network, and acquire high-resolution feature information; constructing a residual error unit of the high-resolution network according to the high-resolution characteristic information, and performing convolution sampling on the residual error unit to obtain a multi-resolution module; fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a target heat map; matching and fusing the human body central point heat map information of the target heat map to obtain fused information.
The fusion module 10 is further configured to fuse feature maps of different resolutions at different stages in the multi-resolution module to obtain a fusion feature map; inputting the fusion characteristic graph into a deconvolution module, obtaining an output result through convolution and channel conversion, and performing dimensionality splicing on the output result and the fusion characteristic graph to obtain a splicing characteristic; and improving the resolution ratio of the splicing features according to the deconvolution layer, extracting target feature information of the splicing features with improved resolution ratio through the residual error unit, and generating a target heat map according to the target feature information.
The fusion module 10 is further configured to use a preset key point between hip joints of the human body in the target heat map as a central point of the human body; and matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multiple views to obtain fused information.
The fusion module 10 is further configured to sample epipolar lines corresponding to the central points of the graphs in the target heat map according to the human central points of the multiple views, so as to obtain a corresponding point set; generating a probability region of the Gaussian distribution of the target heat map at the corresponding coordinates according to the corresponding point set; fusing values of all points on the epipolar line in the probability region through a full-connection layer to obtain a finally fused central point coordinate; and carrying out coordinate matching fusion on the human body central point heat map information of the target heat map according to the central point coordinate to obtain fusion information.
The projection module 20 is further configured to acquire camera calibration data of a video camera, and project each voxel center in the fusion information into a camera view by using the camera calibration data to obtain camera view projection data; and constructing a three-dimensional characteristic volume according to the camera view projection data by using a 3D CNN network.
The pose estimation module 30 is further configured to divide the three-dimensional feature volume into a plurality of discrete grids; and acquiring the 3D heat map space coordinate of each key point in each discrete grid, and regressing the 3D heat map space coordinate to obtain the three-dimensional human body posture.
The steps implemented by each functional module of the three-dimensional human body posture estimation device can refer to each embodiment of the three-dimensional human body posture estimation method, and are not described herein again.
In addition, an embodiment of the present invention further provides a storage medium, where a three-dimensional human body posture estimation program is stored on the storage medium, and when executed by a processor, the three-dimensional human body posture estimation program implements the following operations:
generating a target heat map corresponding to an input image by adopting a multi-view fusion network, and matching and fusing human body center point heat map information of the target heat map to obtain fusion information;
projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume;
and estimating the three-dimensional human body posture according to the three-dimensional characteristic volume.
Further, the three-dimensional human body posture estimation program further realizes the following operations when being executed by the processor:
inputting an input image into a high-resolution network of a multi-view fusion network to acquire high-resolution characteristic information;
constructing a residual error unit of the high-resolution network according to the high-resolution characteristic information, and performing convolution sampling on the residual error unit to obtain a multi-resolution module;
fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a target heat map;
matching and fusing the human body central point heat map information of the target heat map to obtain fused information.
Further, the three-dimensional human body posture estimation program when executed by the processor further realizes the following operations:
fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a fused feature map;
inputting the fusion characteristic graph into a deconvolution module, obtaining an output result through convolution and channel conversion, and performing dimensionality splicing on the output result and the fusion characteristic graph to obtain a splicing characteristic;
and improving the resolution ratio of the splicing features according to the deconvolution layer, extracting target feature information of the splicing features with improved resolution ratio through the residual error unit, and generating a target heat map according to the target feature information.
Further, the three-dimensional human body posture estimation program when executed by the processor further realizes the following operations:
taking preset key points between hip joints of the human body in the target heat map as central points of the human body;
and matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multiple views to obtain fused information.
Further, the three-dimensional human body posture estimation program when executed by the processor further realizes the following operations:
sampling polar lines corresponding to the central points of the graphs in the target heat map according to the central points of the human body of the multiple views to obtain a corresponding point set;
generating a probability region of the Gaussian distribution of the target heat map at the corresponding coordinates according to the corresponding point set;
fusing values of all points on the epipolar line in the probability region through a full-connection layer to obtain a finally fused central point coordinate;
and carrying out coordinate matching fusion on the human body central point heat map information of the target heat map according to the central point coordinate to obtain fusion information.
Further, the three-dimensional human body posture estimation program when executed by the processor further realizes the following operations:
acquiring camera calibration data of a video camera, and projecting each voxel center in the fusion information into a camera view by using the camera calibration data to obtain camera view projection data;
and constructing a three-dimensional characteristic volume according to the camera view projection data by using a 3D CNN network.
Further, the three-dimensional human body posture estimation program when executed by the processor further realizes the following operations:
dividing the three-dimensional feature volume into a plurality of discrete grids;
and acquiring the 3D heat map space coordinate of each key point in each discrete grid, and regressing the 3D heat map space coordinate to obtain the three-dimensional human body posture.
According to the scheme, a multi-view fusion network is adopted to generate a target heat map corresponding to an input image, and matching fusion is carried out on human body central point heat map information of the target heat map to obtain fusion information; projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume; the three-dimensional human body posture is estimated according to the three-dimensional characteristic volume, the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of other human body key points are reduced, errors of three-dimensional human body posture estimation are reduced, the posture reconstruction quality is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of three-dimensional human body posture estimation is improved, the scheme is simple and reliable to implement, the method can be suitable for three-dimensional human body posture estimation of most scenes, and the speed and the efficiency of three-dimensional human body posture estimation are improved.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.