CN115035551B - Three-dimensional human body posture estimation method, device, equipment and storage medium - Google Patents

Three-dimensional human body posture estimation method, device, equipment and storage medium Download PDF

Info

Publication number
CN115035551B
CN115035551B CN202210956640.4A CN202210956640A CN115035551B CN 115035551 B CN115035551 B CN 115035551B CN 202210956640 A CN202210956640 A CN 202210956640A CN 115035551 B CN115035551 B CN 115035551B
Authority
CN
China
Prior art keywords
human body
heat map
dimensional
information
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210956640.4A
Other languages
Chinese (zh)
Other versions
CN115035551A (en
Inventor
胡波
胡世卓
周斌
沈振冈
李艳红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Etah Information Technology Co ltd
Original Assignee
Wuhan Etah Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Etah Information Technology Co ltd filed Critical Wuhan Etah Information Technology Co ltd
Priority to CN202210956640.4A priority Critical patent/CN115035551B/en
Publication of CN115035551A publication Critical patent/CN115035551A/en
Application granted granted Critical
Publication of CN115035551B publication Critical patent/CN115035551B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Human Computer Interaction (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a three-dimensional human body posture estimation method, a device, equipment and a storage medium, wherein the method comprises the steps of generating a target heat map corresponding to an input image by adopting a multi-view fusion network, and matching and fusing human body central point heat map information of the target heat map to obtain fusion information; projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume; the three-dimensional human body posture is estimated according to the three-dimensional characteristic volume, the accuracy of three-dimensional human body posture estimation can be improved, inference search spaces of other human body key points are reduced, errors of three-dimensional human body posture estimation are reduced, posture reconstruction quality is improved, calculation cost is reduced, quantization error influence is avoided, the accuracy of three-dimensional human body posture estimation is improved, the scheme is simple and reliable to implement, the method can be suitable for three-dimensional human body posture estimation of most scenes, and the speed and the efficiency of three-dimensional human body posture estimation are improved.

Description

Three-dimensional human body posture estimation method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of multi-view fusion, in particular to a three-dimensional human body posture estimation method, device, equipment and storage medium.
Background
In recent years, three-dimensional human posture estimation studies through multi-view matching are mainly classified into two main categories: two-dimensional to three-dimensional based multi-stage methods and direct regression based methods; a two-dimensional to three-dimensional based approach such as by estimating 2D keypoints of the same person in each view, and then lifting the matched 2D single-view pose to 3D space; if the 2D image structure model is expanded to the 3D image structure model to encode the pair-wise relation between the body joint positions, if the multi-person 2D posture detection is firstly solved, the association is carried out in a plurality of camera views, and then the 3D posture is recovered by using triangulation; the methods are effective in a specific scene, but depend on 2D detection results, the reconstruction quality of the 3D posture is greatly influenced by inaccuracy of two-dimensional posture estimation, and particularly the shielding condition exists.
The method based on the direct regression is also called as an end-to-end based method, and because the deep neural network can fit a complex function, the method usually does not need other algorithm assistance and intermediate data, the three-dimensional attitude coordinate can be directly predicted based on the network structure of the regression; the discretization 3D feature volume is constructed by multi-view features like the VoxelPose model, the 2D pose in each view is not independently estimated, but the obtained 2D heatmap is directly projected to be inferred in a 3D space, but the calculation cost for searching key points in the whole space geometrically increases along with the detailed division of the space, and is also influenced by quantization errors caused by the spatial discretization.
Disclosure of Invention
The invention mainly aims to provide a three-dimensional human body posture estimation method, a three-dimensional human body posture estimation device, three-dimensional human body posture estimation equipment and a storage medium, and aims to solve the technical problems that the reconstruction quality of a 3D posture is greatly influenced due to inaccuracy of two-dimensional posture estimation, the calculation cost of a direct regression mode is high, and the error is large in the prior art.
In a first aspect, the present invention provides a three-dimensional human body posture estimation method, including the following steps:
generating a target heat map corresponding to an input image by adopting a multi-view fusion network, and matching and fusing human body center point heat map information of the target heat map to obtain fusion information;
projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume;
and estimating the three-dimensional human body posture according to the three-dimensional characteristic volume.
Optionally, the generating a target heat map corresponding to the input image by using a multi-view fusion network, and performing matching fusion on the human body center point heat map information of the target heat map to obtain fusion information includes:
inputting an input image into a high-resolution network of a multi-view fusion network to acquire high-resolution characteristic information;
constructing a residual error unit of the high-resolution network according to the high-resolution characteristic information, and performing convolution sampling on the residual error unit to obtain a multi-resolution module;
fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a target heat map;
matching and fusing the human body central point heat map information of the target heat map to obtain fused information.
Optionally, the fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain the target heat map includes:
fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a fused feature map;
inputting the fusion characteristic graph into a deconvolution module, obtaining an output result through convolution and channel conversion, and performing dimensionality splicing on the output result and the fusion characteristic graph to obtain a splicing characteristic;
and improving the resolution ratio of the splicing features according to the deconvolution layer, extracting target feature information of the splicing features with improved resolution ratio through the residual error unit, and generating a target heat map according to the target feature information.
Optionally, the matching and fusing the human body central point heat map information of the target heat map to obtain fused information includes:
taking preset key points between hip joints of the human body in the target heat map as central points of the human body;
matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multiple views to obtain fused information.
Optionally, the matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multiple views to obtain fused information includes:
sampling polar lines corresponding to the central points of the graphs in the target heat map according to the central points of the human body of the multiple views to obtain a corresponding point set;
generating probability regions of Gaussian distribution of the target heat map at corresponding coordinates according to the corresponding point sets;
fusing values of all points on the epipolar line in the probability region through a full-connection layer to obtain a finally fused central point coordinate;
and carrying out coordinate matching fusion on the human body central point heat map information of the target heat map according to the central point coordinate to obtain fusion information.
Optionally, the projecting the fusion information to a 3D space to obtain a three-dimensional feature volume includes:
acquiring camera calibration data of a video camera, and projecting each voxel center in the fusion information into a camera view by using the camera calibration data to obtain camera view projection data;
and constructing a three-dimensional characteristic volume according to the camera view projection data by using a 3D CNN network.
Optionally, the estimating a three-dimensional human body posture according to the three-dimensional feature volume includes:
dividing the three-dimensional feature volume into a plurality of discrete grids;
and acquiring the 3D heat map space coordinate of each key point in each discrete grid, and regressing the 3D heat map space coordinate to obtain the three-dimensional human body posture.
In a second aspect, to achieve the above object, the present invention further provides a three-dimensional body posture estimation device, including:
the fusion module is used for generating a target heat map corresponding to the input image by adopting a multi-view fusion network, and matching and fusing the human body central point heat map information of the target heat map to obtain fusion information;
the projection module is used for projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume;
and the posture estimation module is used for estimating the three-dimensional human body posture according to the three-dimensional characteristic volume.
In a third aspect, to achieve the above object, the present invention further provides a three-dimensional human body posture estimation device, including: a memory, a processor and a three-dimensional body pose estimation program stored on the memory and executable on the processor, the three-dimensional body pose estimation program configured to implement the steps of the three-dimensional body pose estimation method as described above.
In a fourth aspect, to achieve the above object, the present invention further provides a storage medium having a three-dimensional body posture estimation program stored thereon, wherein the three-dimensional body posture estimation program, when executed by a processor, implements the steps of the three-dimensional body posture estimation method as described above.
The invention provides a three-dimensional human body posture estimation method, which comprises the steps of generating a target heat map corresponding to an input image by adopting a multi-view fusion network, and matching and fusing human body central point heat map information of the target heat map to obtain fusion information; projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume; the three-dimensional human body posture is estimated according to the three-dimensional characteristic volume, the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of other human body key points are reduced, errors of three-dimensional human body posture estimation are reduced, the posture reconstruction quality is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of three-dimensional human body posture estimation is improved, the scheme is simple and reliable to implement, the method can be suitable for three-dimensional human body posture estimation of most scenes, and the speed and the efficiency of three-dimensional human body posture estimation are improved.
Drawings
FIG. 1 is a schematic diagram of an apparatus architecture of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for estimating a three-dimensional human body pose according to a first embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for estimating a three-dimensional human body pose according to a second embodiment of the present invention;
FIG. 4 is a network structure for detecting key points in the three-dimensional human body pose estimation method of the present invention;
FIG. 5 is a flowchart illustrating a method for estimating a three-dimensional human body pose according to a third embodiment of the present invention;
FIG. 6 is a schematic view of an antipodal geometry in the three-dimensional human body posture estimation method of the present invention;
FIG. 7 is a schematic diagram of a multi-view epipolar constraint model in the three-dimensional human body pose estimation method of the present invention;
FIG. 8 is a flowchart illustrating a method for estimating a three-dimensional human body pose according to a fourth embodiment of the present invention;
FIG. 9 is a schematic diagram of a 3D CNN network structure in the three-dimensional human body posture estimation method according to the present invention;
FIG. 10 is a flowchart illustrating a fifth embodiment of a three-dimensional human body posture estimation method according to the present invention;
FIG. 11 is a functional block diagram of a three-dimensional human body posture estimation device according to a first embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The solution of the embodiment of the invention is mainly as follows: generating a target heat map corresponding to an input image by adopting a multi-view fusion network, and matching and fusing human body central point heat map information of the target heat map to obtain fusion information; projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume; the three-dimensional human body posture is estimated according to the three-dimensional characteristic volume, the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of key points of other human bodies are reduced, errors of three-dimensional human body posture estimation are reduced, the posture reconstruction quality is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of three-dimensional human body posture estimation is improved, the scheme is simple and reliable to implement, the method can be suitable for three-dimensional human body posture estimation of most scenes, the speed and the efficiency of three-dimensional human body posture estimation are improved, the technical problems that the reconstruction quality of a 3D posture is greatly influenced due to inaccuracy of two-dimensional posture estimation and the calculation cost is high in a direct regression mode and the error is large in the prior art are solved.
Referring to fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the apparatus may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., a Wi-Fi interface). The Memory 1005 may be a high-speed RAM Memory or a Non-Volatile Memory (Non-Volatile Memory), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001 described previously.
Those skilled in the art will appreciate that the configuration of the device shown in fig. 1 is not intended to be limiting of the device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of storage medium, may include therein an operating device, a network communication module, a user interface module, and a three-dimensional human body posture estimation program.
The apparatus of the present invention calls a three-dimensional human body posture estimation program stored in the memory 1005 through the processor 1001 and performs the following operations:
generating a target heat map corresponding to an input image by adopting a multi-view fusion network, and matching and fusing human body central point heat map information of the target heat map to obtain fusion information;
projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume;
and estimating the three-dimensional human body posture according to the three-dimensional characteristic volume.
The apparatus of the present invention calls the three-dimensional human body posture estimation program stored in the memory 1005 through the processor 1001, and also performs the following operations:
inputting an input image into a high-resolution network of a multi-view fusion network to acquire high-resolution characteristic information;
constructing a residual error unit of the high-resolution network according to the high-resolution characteristic information, and performing convolution sampling on the residual error unit to obtain a multi-resolution module;
fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a target heat map;
matching and fusing the human body central point heat map information of the target heat map to obtain fused information.
The apparatus of the present invention calls the three-dimensional human body posture estimation program stored in the memory 1005 through the processor 1001, and also performs the following operations:
fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a fused feature map;
inputting the fusion characteristic graph into a deconvolution module, obtaining an output result through convolution and channel conversion, and performing dimensionality splicing on the output result and the fusion characteristic graph to obtain a splicing characteristic;
and improving the resolution ratio of the splicing features according to the deconvolution layer, extracting target feature information of the splicing features with improved resolution ratio through the residual error unit, and generating a target heat map according to the target feature information.
The apparatus of the present invention calls the three-dimensional human body posture estimation program stored in the memory 1005 through the processor 1001, and also performs the following operations:
taking a preset key point between hip joints of the human body in the target heat map as a central point of the human body;
and matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multiple views to obtain fused information.
The apparatus of the present invention calls the three-dimensional human body posture estimation program stored in the memory 1005 through the processor 1001, and also performs the following operations:
sampling polar lines corresponding to central points of all the graphs in the target heat map according to the central points of the human body in multiple views to obtain a corresponding point set;
generating a probability region of the Gaussian distribution of the target heat map at the corresponding coordinates according to the corresponding point set;
fusing values of all points on the epipolar line in the probability region through a full-connection layer to obtain a finally fused central point coordinate;
and carrying out coordinate matching fusion on the human body central point heat map information of the target heat map according to the central point coordinate to obtain fusion information.
The apparatus of the present invention calls the three-dimensional human body posture estimation program stored in the memory 1005 through the processor 1001, and also performs the following operations:
acquiring camera calibration data of a video camera, and projecting each voxel center in the fusion information into a camera view by using the camera calibration data to obtain camera view projection data;
and constructing a three-dimensional characteristic volume according to the camera view projection data by using a 3D CNN network.
The apparatus of the present invention calls the three-dimensional human body posture estimation program stored in the memory 1005 by the processor 1001, and also performs the following operations:
dividing the three-dimensional feature volume into a plurality of discrete grids;
and acquiring the 3D heat map space coordinate of each key point in each discrete grid, and regressing the 3D heat map space coordinate to obtain the three-dimensional human body posture.
According to the scheme, a multi-view fusion network is adopted to generate a target heat map corresponding to an input image, and matching fusion is carried out on human body central point heat map information of the target heat map to obtain fusion information; projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume; the three-dimensional human body posture is estimated according to the three-dimensional characteristic volume, the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of other human body key points are reduced, errors of three-dimensional human body posture estimation are reduced, the posture reconstruction quality is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of three-dimensional human body posture estimation is improved, the scheme is simple and reliable to implement, the method can be suitable for three-dimensional human body posture estimation of most scenes, and the speed and the efficiency of three-dimensional human body posture estimation are improved.
Based on the hardware structure, the embodiment of the three-dimensional human body posture estimation method is provided.
Referring to fig. 2, fig. 2 is a flowchart illustrating a method for estimating a three-dimensional human body pose according to a first embodiment of the present invention.
In a first embodiment, the three-dimensional human body posture estimation method comprises the following steps:
and S10, generating a target heat map corresponding to the input image by adopting a multi-view fusion network, and matching and fusing the human body central point heat map information of the target heat map to obtain fusion information.
It should be noted that, through a Multi-View Fusion Network (MVFNet), the Network may obtain a corresponding target heatmap according to an input image on the basis of a high resolution Network HRNet, and perform matching Fusion on the human body center point heatmap information of the target heatmap to obtain fused heatmap information.
And S20, projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume.
It will be appreciated that projecting the fusion information into 3D space enables a three-dimensional feature volume to be obtained from a coarse to a fine build volume.
And S30, estimating the three-dimensional human body posture according to the three-dimensional characteristic volume.
It should be appreciated that an accurate three-dimensional body pose can be estimated from the three-dimensional feature volumes.
According to the scheme, a multi-view fusion network is adopted to generate a target heat map corresponding to an input image, and matching fusion is carried out on human body central point heat map information of the target heat map to obtain fusion information; projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume; the three-dimensional human body posture is estimated according to the three-dimensional characteristic volume, the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of other human body key points are reduced, errors of three-dimensional human body posture estimation are reduced, the posture reconstruction quality is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of three-dimensional human body posture estimation is improved, the scheme is simple and reliable to implement, the method can be suitable for three-dimensional human body posture estimation of most scenes, and the speed and the efficiency of three-dimensional human body posture estimation are improved.
Further, fig. 3 is a schematic flow chart of a second embodiment of the three-dimensional human body posture estimation method of the present invention, and as shown in fig. 3, the second embodiment of the three-dimensional human body posture estimation method of the present invention is proposed based on the first embodiment, and in this embodiment, the step S10 specifically includes the following steps:
and S11, inputting the input image into a high-resolution network of the multi-view fusion network to acquire high-resolution characteristic information.
It should be noted that, when the input image is input into the high-resolution network of the multiview fusion network, the high-resolution feature information can be obtained.
In the specific implementation, in order to obtain high-resolution feature information, a network before HRNet samples a high-resolution feature map to a low resolution and then restores the high resolution to realize multi-scale feature extraction, such as U-Net, segNet, hourglass and the like; in such network architectures, the high resolution features are mainly derived from two parts: the first is the original high-resolution characteristic, which only can provide low-level semantic expression due to a small amount of convolution operation; secondly, down-sampling and up-sampling are carried out to obtain high-resolution features, but a large amount of effective feature information is lost when up-down sampling is repeatedly carried out; the HRNet gradually introduces low-resolution convolution while always keeping high-resolution characteristics through paralleling a plurality of branches with high resolution to low resolution, and connects the convolutions with different resolutions in parallel to carry out information interaction, so that each feature with high resolution to low resolution repeatedly receives information from other parallel sub-networks, and the purpose of obtaining strong semantic information and accurate position information is achieved.
And S12, constructing a residual error unit of the high-resolution network according to the high-resolution characteristic information, and performing convolution sampling on the residual error unit to obtain a multi-resolution module.
It can be understood that a residual unit of the high-resolution network can be constructed through the high-resolution feature information, and a multi-resolution module can be obtained by performing convolution sampling on the residual unit.
In a specific implementation, the present embodiment may use HRNet as a basic framework through a multi-view fusion network MVFNet network, and add a deconvolution module to obtain a heatmap with higher resolution and richer semantic information, as shown in fig. 4, fig. 4 is a key point detection network structure in the three-dimensional human body posture estimation method of the present invention, referring to fig. 4, the network is divided into four stages, the main body is four parallel sub-networks, and the high resolution sub-network is used as a first stage, the sub-networks from high resolution to low resolution are gradually added, and the multi-resolution sub-networks are connected in parallel; the first stage comprises 4 residual error units, each residual error unit is the same as ResNet-50 and is composed of a bottompiece with 64 channel numbers; then downsampled to the second stage by a convolution with 3 x 3, step size 2; the second, third and fourth stages respectively comprise 1, 4 and 3 multi-resolution blocks, so that the network can keep a certain depth and fully extract the characteristic information, each multi-resolution block has 4 residual error units, and the BasicBlock of ResNet is adopted, namely two 3 multiplied by 3 convolutions.
And S13, fusing the feature maps with different resolutions in each stage in the multi-resolution module to obtain a target heat map.
It should be understood that the target heat map can be obtained by fusing the feature maps of different resolutions at different stages in the multi-resolution module.
Further, the step S13 specifically includes the following steps:
fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a fused feature map;
inputting the fusion characteristic graph into a deconvolution module, obtaining an output result through convolution and channel conversion, and performing dimensionality splicing on the output result and the fusion characteristic graph to obtain a splicing characteristic;
and improving the resolution ratio of the splicing features according to the deconvolution layer, extracting target feature information of the splicing features with improved resolution ratio through the residual error unit, and generating a target heat map according to the target feature information.
It can be understood that a fused feature map can be obtained by fusing feature maps of different resolutions at each stage in the multi-resolution module, channel conversion is performed in the deconvolution module, and after dimension splicing is performed on an output result and the fused feature map, spliced features can be obtained, so that the resolution of the spliced features can be improved according to the deconvolution layer, and target feature information of the spliced features after resolution improvement is extracted through the residual error unit, thereby generating a target heat map.
In the specific implementation, feature maps with different resolutions at each stage are fused at the end of a network, the fused feature maps are used as the input of a deconvolution module, channel conversion is carried out through convolution, the result is subjected to dimensional splicing with the input features, the resolution of the feature maps is improved to be 2 times of the original resolution by deconvolution with a convolution kernel of 4 × 4, feature information is further extracted through 4 residual blocks, and finally heatmap is predicted through convolution of 1 × 1; the higher resolution is beneficial to obtaining richer key point information, and further accurate three-dimensional human body posture estimation is realized.
And S14, matching and fusing the human body central point heat map information of the target heat map to obtain fused information.
It can be understood that the fused information can be obtained by matching and fusing the human body central point heat map information of the target heat map.
According to the scheme, the high-resolution characteristic information is acquired by inputting the input image into the high-resolution network of the multi-view fusion network; constructing a residual error unit of the high-resolution network according to the high-resolution characteristic information, and performing convolution sampling on the residual error unit to obtain a multi-resolution module; fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a target heat map; matching and fusing the human body central point heat map information of the target heat map to obtain fused information, so that the fused information can be accurately obtained, the accuracy of three-dimensional human body posture estimation is improved, reasoning and searching spaces of other human body key points are reduced, and the error of three-dimensional human body posture estimation is reduced.
Further, fig. 5 is a schematic flow chart of a third embodiment of the three-dimensional human body posture estimation method of the present invention, and as shown in fig. 5, the third embodiment of the three-dimensional human body posture estimation method of the present invention is proposed based on the second embodiment, and in this embodiment, the step S14 specifically includes the following steps:
and step S141, taking preset key points between hip joints of the human body in the target heat map as central points of the human body.
It should be noted that, the preset key points between hip joints of the human body in the target heat map may be used as the central points of the human body according to the preset key points.
It can be understood that an epipolar geometric relationship exists between the multiple view images, and the intrinsic projective relationship between the two views is described, and is independent of an external scene and only depends on the relative pose between the camera intrinsic parameters and the views; the epipolar geometric relationship is fully utilized to help the network to acquire more position information, irrelevant noise in the training process is eliminated, and the accuracy of network prediction is improved; the principle is shown in FIG. 6, FIG. 6 is a schematic view of epipolar geometry in the three-dimensional human body posture estimation method of the present inventionSee figure 6,O 1 、O 2 Is the optical center of two cameras, I 1 、I 2 As an image plane, e 1 、e 2 The projection points of the optical center of the camera on the relative plane are called poles; if the two cameras cannot shoot each other due to an angle problem, the pole does not appear on the imaging plane; observed point P at I 1 、I 2 Projected point on is P 1 、P 2 Since the depth information is unknown, P may be at ray O 1 P 1 At a different point on the ray, onto a line L formed on the right image 2 Called and point P 1 Corresponding polar line, then P 1 At the corresponding point P of the right image 2 Necessarily in the polar line L 2 C, removing; the relative position of the matching points is constrained by the geometrical relationship of the image plane space, the constrained relationship can be expressed by a basic matrix, and according to the literature, the epipolar constraint is shown as formula (1):
Figure 355465DEST_PATH_IMAGE001
(1)
wherein
Figure 210289DEST_PATH_IMAGE002
The calculation formula is shown in (2) for the basic matrix:
Figure 724447DEST_PATH_IMAGE003
(2)
wherein
Figure 650814DEST_PATH_IMAGE004
And
Figure 364387DEST_PATH_IMAGE005
is a matrix of the two camera intrinsic parameters,
Figure 390112DEST_PATH_IMAGE006
the method comprises the following steps of (1) being an intrinsic matrix comprising an external reference translation matrix and a rotation matrix of a camera; therefore, in order to make full use of inter-viewAnd (4) geometrically constraining the relationship.
In specific implementation, a multi-view epipolar constraint model is introduced into the MVFNet network provided by the embodiment, key points between hip joints of a human body are taken as central points, the heatmap matching fusion of the multi-view human body central points is performed, the high-resolution heatmap is input, epipolar lines corresponding to the central points of all the figures are solved according to the epipolar geometric constraint relationship, and sampling is performed to obtain a set of corresponding points; according to the characteristics of the heatmap, a probability area of Gaussian distribution is generated at a corresponding coordinate, only high response exists near a corresponding point, and other places are close to 0, so that the values of all points on the epipolar line can be fused by using a full-connection layer, and the accuracy of central point detection is improved; and finally, comparing the difference between the finally fused center point coordinate and the marked center point coordinate by using L2 loss to carry out training constraint.
And S142, matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multiple views to obtain fused information.
It can be understood that the human body center point heat map information of the target heat map can be matched and fused through the human body center points of the multiple views, and matched and fused heatmap fusion information can be obtained.
Further, the step S142 specifically includes the following steps:
sampling polar lines corresponding to the central points of the graphs in the target heat map according to the central points of the human body of the multiple views to obtain a corresponding point set;
generating a probability region of the Gaussian distribution of the target heat map at the corresponding coordinates according to the corresponding point set;
fusing values of all points on the epipolar line in the probability region through a full connection layer to obtain a final fused central point coordinate;
and carrying out coordinate matching fusion on the human body central point heat map information of the target heat map according to the central point coordinate to obtain fusion information.
It can be understood that polar lines corresponding to central points of the respective graphs in the target heat map are sampled, after corresponding point sets are obtained, probability regions with gaussian distribution can be generated, and then values of all points on the polar lines in the probability regions are fused through the full connection layer, so that finally fused central point coordinates are obtained, and heat map coordinate matching is performed to obtain fusion information.
In a specific implementation, as shown in fig. 7, fig. 7 is a schematic diagram of a multi-view epipolar constraint model in the three-dimensional human body posture estimation method of the present invention, see fig. 7, a high resolution heatmap is input, epipolar lines corresponding to central points of each graph are solved by an epipolar geometric constraint relationship, and sampling is performed to obtain a set of corresponding points; according to the characteristics of the heatmap, a probability area of Gaussian distribution is generated at a corresponding coordinate, only high response exists near a corresponding point, and other places are close to 0, so that the values of all points on the epipolar line can be fused by using a full-link layer, and the accuracy of central point detection is improved; and finally, comparing the difference between the finally fused center point coordinate and the marked center point coordinate by using L2 loss to carry out training constraint.
According to the scheme, the preset key points between hip joints of the human body in the target heat map are used as the central points of the human body; matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multi-view to obtain fused information, so that the fused information can be accurately obtained, the accuracy of three-dimensional human body posture estimation is improved, reasoning and searching spaces of other human body key points are reduced, and the error of three-dimensional human body posture estimation is reduced.
Further, fig. 8 is a schematic flowchart of a fourth embodiment of the three-dimensional human body posture estimation method of the present invention, and as shown in fig. 8, the fourth embodiment of the three-dimensional human body posture estimation method of the present invention is proposed based on the first embodiment, in this embodiment, the step S20 specifically includes the following steps:
and S21, acquiring camera calibration data of a video camera, and projecting each voxel center in the fusion information to a camera view by using the camera calibration data to obtain camera view projection data.
It should be noted that after the camera calibration data of the video camera is obtained, the camera calibration data may be used to project each voxel center in the fusion information into the camera view, so as to obtain the camera view projection data.
It can be understood that the features of all the obtained views are aggregated into a 3D voxel volume by inverse image projection method, a voxel grid is initialized and contains the whole space observed by the camera, meanwhile, the center of each voxel is projected into the camera view by using the camera calibration data, and then the feature volume is constructed by the 3D CNN network from coarse to fine by taking the center as the center to estimate the position of all the key points.
In specific implementation, referring to fig. 9, fig. 9 is a schematic diagram of a 3D CNN network structure in the three-dimensional human body posture estimation method of the present invention, as shown in fig. 9, a network input of the 3D CNN network structure is a 3D feature volume, which is constructed by projecting 2D heatmaps in all camera views to a common 3D space, because the heatmaps encode position information of a central point, the obtained 3D feature volume also has rich information for detecting a 3D posture, and a search area of other key points in the 3D space can be reduced according to human body prior information; black open arrows represent standard 3D convolutional layers, black solid arrows represent residual blocks of two 3D convolutional layers, linear arrows are pooling, and dashed arrows are deconvolution; discretizing a three-dimensional space into
Figure 391566DEST_PATH_IMAGE007
At a discrete location
Figure 121624DEST_PATH_IMAGE008
Each position can be regarded as an anchor of a detected person; to reduce quantization error, adjustments are made
Figure 197028DEST_PATH_IMAGE009
The distance between adjacent anchors is reduced; on a common data set, the space is typically 8m
Figure 721550DEST_PATH_IMAGE010
8m
Figure 882404DEST_PATH_IMAGE010
2m, thus will
Figure 150574DEST_PATH_IMAGE009
Are set to 80, 80, 20.
And S22, constructing a three-dimensional characteristic volume according to the camera view projection data by using a 3D CNN network.
It will be appreciated that by constructing feature volumes from coarse to fine centered by the 3D CNN network to estimate the location of all keypoints, a three-dimensional feature volume can be constructed from the camera view projection data.
In specific implementation, the 2D heatmap value of the projection position of each anchor in a camera view is fused, and the feature vector of each anchor is calculated; let 2D heatmap in view a be denoted
Figure 844599DEST_PATH_IMAGE011
Where K is the number of body keypoints; position for each anchor
Figure 805601DEST_PATH_IMAGE008
Its projected position in the view and
Figure 453752DEST_PATH_IMAGE012
here, the heatmap value is expressed as
Figure 260034DEST_PATH_IMAGE013
(ii) a Then, calculating a feature vector of the anchor as an average heatmap value in all camera views; as shown in formula (3):
Figure 310029DEST_PATH_IMAGE014
(3)
where V is the number of cameras, it can be seen that
Figure 176354DEST_PATH_IMAGE015
Actually encodes K key points
Figure 311800DEST_PATH_IMAGE008
The possibility of (a); then using a 3DThe bounding box represents the position of the key point of the human body including detection, and the size and the direction of the bounding box are fixed in the experiment; this is a reasonable simplification because the human variation in 3D space is limited; sliding a small network on the characteristic volume F; each sliding window centered at the anchor maps to a low-dimensional feature that is fed back to the fully-connected layer with regression confidence as the output of the 3D CNN network, indicating the likelihood of the person appearing at that location; calculating the GT heatmap value of each anchor according to the distance from the anchor to the GT pose; for each pair of GT and anchor, calculating a Gaussian score according to the distance between GT and anchor, wherein the Gaussian score decreases exponentially when the distance increases; if there are N people in the scene, an anchor may have multiple scores, and the N largest, i.e., representative of the N positions of people, are retained through non-maximum suppression (NMS).
According to the scheme, by acquiring camera calibration data of a video camera, each voxel center in the fusion information is projected to a camera view by using the camera calibration data, so as to obtain camera view projection data; the three-dimensional characteristic volume is constructed by utilizing the 3D CNN network according to the camera view projection data, the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of other human body key points are reduced, errors of the three-dimensional human body posture estimation are reduced, the posture reconstruction quality is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of the three-dimensional human body posture estimation is improved, and the scheme is simple and reliable to implement.
Further, fig. 10 is a schematic flowchart of a fifth embodiment of the three-dimensional human body posture estimation method according to the present invention, and as shown in fig. 10, the fifth embodiment of the three-dimensional human body posture estimation method according to the present invention is proposed based on the first embodiment, in this embodiment, the step S30 specifically includes the following steps:
and S31, dividing the three-dimensional characteristic volume into a plurality of discrete grids.
It should be noted that after the three-dimensional feature volume is divided, a plurality of discrete grids can be obtained.
In a specific implementation, the first 3D CNN network is noneThe method accurately estimates the 3D positions of all key points, so that a finer-grained feature volume is constructed in the second 3D CNN network, and the size of the feature volume is set to 2000mm
Figure 921773DEST_PATH_IMAGE010
2000mm
Figure 324810DEST_PATH_IMAGE010
2000mm, ratio 8m
Figure 627616DEST_PATH_IMAGE010
8m
Figure 719200DEST_PATH_IMAGE010
2m is much smaller but sufficient to cover any pose of a person, the volume being divided into X 0 =Y 0 =Z 0 =64 discrete meshes, whose network body structure is the same as the first 3D CNN.
And S32, acquiring the 3D heat map space coordinates of each key point in each discrete grid, and performing regression on the 3D heat map space coordinates to obtain the three-dimensional human body posture.
It should be understood that, further, the 3D heat map space coordinates of each key point in each discrete grid are obtained, and then the 3D heat map space coordinates may be regressed, so that the three-dimensional human body posture may be obtained.
It will be appreciated that the 3D heatmap for each keypoint K is estimated based on the constructed feature volumes
Figure 132864DEST_PATH_IMAGE016
Finally, returning to the accurate three-dimensional human body posture,
Figure 157451DEST_PATH_IMAGE017
calculating according to equation (4)
Figure 631158DEST_PATH_IMAGE018
The centroid of the key points can be obtained
Figure 475617DEST_PATH_IMAGE019
Figure 427393DEST_PATH_IMAGE020
(4)
Comparing the estimated joint position with the true position
Figure 805022DEST_PATH_IMAGE021
Making comparisons to train the network, loss function
Figure 652892DEST_PATH_IMAGE022
Is represented by formula (5):
Figure 109282DEST_PATH_IMAGE023
(5)
in a specific implementation, the accuracy of the 3D pose of the Campus and Shelf datasets is evaluated using the Percentage PCP3D (percent of Correct Part 3D) that correctly estimates the Joint position, and if the distance between the predicted Joint position and the true Joint position is less than half of the limb length, the detection is considered Correct for the CMU-Panoptic dataset, the Mean MPJPE (Mean Per Joint Point Error) of the Joint position Error is taken as an important evaluation index, and the positioning accuracy of the 3D Joint is evaluated in millimeters, representing the distance between GT and the predicted Joint position; for each frame f and the human skeleton S, the calculation of MPJPE is given by equation (6):
Figure 740114DEST_PATH_IMAGE024
(6)
wherein
Figure 332770DEST_PATH_IMAGE025
Is the number of joints in the skeleton S, and for a set of frames, the error is the average of the MPJPE of all frames; meanwhile, average Precision (Average Precision) and Recall rate (Recall) are taken as performance indexes for comprehensively evaluating 3D human body center detection and human body posture estimation on the threshold value (from 25mm to 150mm, the step length is 25 mm) of MPJPE; AP is defined by abscissa Recall and ordinate essenceThe area under a PR curve formed by two dimensions of accuracy (Precision) is larger, and the larger the value of AP is, the better the comprehensive performance of the detection model is.
According to the scheme, the three-dimensional characteristic volume is divided into a plurality of discrete grids; the method comprises the steps of obtaining the 3D heat map space coordinates of each key point in each discrete grid, and regressing the 3D heat map space coordinates to obtain the three-dimensional human body posture, so that the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of other human body key points are reduced, errors of the three-dimensional human body posture estimation are reduced, the reconstruction quality of the posture is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of the three-dimensional human body posture estimation is improved, and the scheme is simple and reliable to implement.
Correspondingly, the invention further provides a three-dimensional human body posture estimation device.
Referring to fig. 11, fig. 11 is a functional block diagram of a three-dimensional human body posture estimation device according to a first embodiment of the present invention.
In a first embodiment of the three-dimensional body posture estimation device of the present invention, the three-dimensional body posture estimation device includes:
the fusion module 10 is configured to generate a target heat map corresponding to the input image by using a multi-view fusion network, and perform matching fusion on the human body center point heat map information of the target heat map to obtain fusion information.
And the projection module 20 is configured to project the fusion information to a 3D space to obtain a three-dimensional feature volume.
And the posture estimation module 30 is used for estimating the three-dimensional human body posture according to the three-dimensional characteristic volume.
The fusion module 10 is further configured to input the input image into a high-resolution network of the multiview fusion network, and acquire high-resolution feature information; constructing a residual error unit of the high-resolution network according to the high-resolution characteristic information, and performing convolution sampling on the residual error unit to obtain a multi-resolution module; fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a target heat map; matching and fusing the human body central point heat map information of the target heat map to obtain fused information.
The fusion module 10 is further configured to fuse feature maps of different resolutions at different stages in the multi-resolution module to obtain a fusion feature map; inputting the fusion characteristic graph into a deconvolution module, obtaining an output result through convolution and channel conversion, and performing dimensionality splicing on the output result and the fusion characteristic graph to obtain a splicing characteristic; and improving the resolution ratio of the splicing features according to the deconvolution layer, extracting target feature information of the splicing features with improved resolution ratio through the residual error unit, and generating a target heat map according to the target feature information.
The fusion module 10 is further configured to use a preset key point between hip joints of the human body in the target heat map as a central point of the human body; and matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multiple views to obtain fused information.
The fusion module 10 is further configured to sample epipolar lines corresponding to the central points of the graphs in the target heat map according to the human central points of the multiple views, so as to obtain a corresponding point set; generating a probability region of the Gaussian distribution of the target heat map at the corresponding coordinates according to the corresponding point set; fusing values of all points on the epipolar line in the probability region through a full-connection layer to obtain a finally fused central point coordinate; and carrying out coordinate matching fusion on the human body central point heat map information of the target heat map according to the central point coordinate to obtain fusion information.
The projection module 20 is further configured to acquire camera calibration data of a video camera, and project each voxel center in the fusion information into a camera view by using the camera calibration data to obtain camera view projection data; and constructing a three-dimensional characteristic volume according to the camera view projection data by using a 3D CNN network.
The pose estimation module 30 is further configured to divide the three-dimensional feature volume into a plurality of discrete grids; and acquiring the 3D heat map space coordinate of each key point in each discrete grid, and regressing the 3D heat map space coordinate to obtain the three-dimensional human body posture.
The steps implemented by each functional module of the three-dimensional human body posture estimation device can refer to each embodiment of the three-dimensional human body posture estimation method, and are not described herein again.
In addition, an embodiment of the present invention further provides a storage medium, where a three-dimensional human body posture estimation program is stored on the storage medium, and when executed by a processor, the three-dimensional human body posture estimation program implements the following operations:
generating a target heat map corresponding to an input image by adopting a multi-view fusion network, and matching and fusing human body center point heat map information of the target heat map to obtain fusion information;
projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume;
and estimating the three-dimensional human body posture according to the three-dimensional characteristic volume.
Further, the three-dimensional human body posture estimation program further realizes the following operations when being executed by the processor:
inputting an input image into a high-resolution network of a multi-view fusion network to acquire high-resolution characteristic information;
constructing a residual error unit of the high-resolution network according to the high-resolution characteristic information, and performing convolution sampling on the residual error unit to obtain a multi-resolution module;
fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a target heat map;
matching and fusing the human body central point heat map information of the target heat map to obtain fused information.
Further, the three-dimensional human body posture estimation program when executed by the processor further realizes the following operations:
fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a fused feature map;
inputting the fusion characteristic graph into a deconvolution module, obtaining an output result through convolution and channel conversion, and performing dimensionality splicing on the output result and the fusion characteristic graph to obtain a splicing characteristic;
and improving the resolution ratio of the splicing features according to the deconvolution layer, extracting target feature information of the splicing features with improved resolution ratio through the residual error unit, and generating a target heat map according to the target feature information.
Further, the three-dimensional human body posture estimation program when executed by the processor further realizes the following operations:
taking preset key points between hip joints of the human body in the target heat map as central points of the human body;
and matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multiple views to obtain fused information.
Further, the three-dimensional human body posture estimation program when executed by the processor further realizes the following operations:
sampling polar lines corresponding to the central points of the graphs in the target heat map according to the central points of the human body of the multiple views to obtain a corresponding point set;
generating a probability region of the Gaussian distribution of the target heat map at the corresponding coordinates according to the corresponding point set;
fusing values of all points on the epipolar line in the probability region through a full-connection layer to obtain a finally fused central point coordinate;
and carrying out coordinate matching fusion on the human body central point heat map information of the target heat map according to the central point coordinate to obtain fusion information.
Further, the three-dimensional human body posture estimation program when executed by the processor further realizes the following operations:
acquiring camera calibration data of a video camera, and projecting each voxel center in the fusion information into a camera view by using the camera calibration data to obtain camera view projection data;
and constructing a three-dimensional characteristic volume according to the camera view projection data by using a 3D CNN network.
Further, the three-dimensional human body posture estimation program when executed by the processor further realizes the following operations:
dividing the three-dimensional feature volume into a plurality of discrete grids;
and acquiring the 3D heat map space coordinate of each key point in each discrete grid, and regressing the 3D heat map space coordinate to obtain the three-dimensional human body posture.
According to the scheme, a multi-view fusion network is adopted to generate a target heat map corresponding to an input image, and matching fusion is carried out on human body central point heat map information of the target heat map to obtain fusion information; projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume; the three-dimensional human body posture is estimated according to the three-dimensional characteristic volume, the accuracy of three-dimensional human body posture estimation can be improved, reasoning search spaces of other human body key points are reduced, errors of three-dimensional human body posture estimation are reduced, the posture reconstruction quality is improved, the calculation cost is reduced, the influence of quantization errors is avoided, the accuracy of three-dimensional human body posture estimation is improved, the scheme is simple and reliable to implement, the method can be suitable for three-dimensional human body posture estimation of most scenes, and the speed and the efficiency of three-dimensional human body posture estimation are improved.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (8)

1. A three-dimensional human body posture estimation method is characterized by comprising the following steps:
generating a target heat map corresponding to an input image by adopting a multi-view fusion network, and matching and fusing human body central point heat map information of the target heat map to obtain fusion information;
projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume;
estimating a three-dimensional human body posture according to the three-dimensional characteristic volume;
the method for generating the target heat map corresponding to the input image by adopting the multi-view fusion network and matching and fusing the human body central point heat map information of the target heat map to obtain the fusion information comprises the following steps:
inputting an input image into a high-resolution network of a multi-view fusion network to acquire high-resolution characteristic information;
constructing a residual error unit of the high-resolution network according to the high-resolution characteristic information, and performing convolution sampling on the residual error unit to obtain a multi-resolution module;
fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a target heat map;
matching and fusing the human body central point heat map information of the target heat map to obtain fused information;
wherein, the matching and fusing of the human body central point heat map information of the target heat map to obtain fusion information comprises the following steps:
taking preset key points between hip joints of the human body in the target heat map as central points of the human body;
and matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multiple views to obtain fused information.
2. The method for estimating the posture of the three-dimensional human body according to claim 1, wherein the fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain the target heat map comprises:
fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a fused feature map;
inputting the fusion characteristic graph into a deconvolution module, obtaining an output result through convolution and channel conversion, and performing dimensionality splicing on the output result and the fusion characteristic graph to obtain a splicing characteristic;
and improving the resolution ratio of the splicing features according to the deconvolution layer, extracting target feature information of the splicing features with improved resolution ratio through the residual error unit, and generating a target heat map according to the target feature information.
3. The method according to claim 1, wherein said matching and fusing the human body center point heat map information of the target heat map according to the multi-view human body center point to obtain fused information comprises:
sampling polar lines corresponding to the central points of the graphs in the target heat map according to the central points of the human body of the multiple views to obtain a corresponding point set;
generating a probability region of the Gaussian distribution of the target heat map at the corresponding coordinates according to the corresponding point set;
fusing values of all points on the epipolar line in the probability region through a full-connection layer to obtain a finally fused central point coordinate;
and carrying out coordinate matching fusion on the human body central point heat map information of the target heat map according to the central point coordinate to obtain fusion information.
4. The three-dimensional human pose estimation method of claim 1, wherein said projecting the fusion information into a 3D space to obtain a three-dimensional feature volume comprises:
acquiring camera calibration data of a video camera, and projecting each voxel center in the fusion information to a camera view by using the camera calibration data to obtain camera view projection data;
and constructing a three-dimensional characteristic volume according to the camera view projection data by using a 3D CNN network.
5. The method for estimating the three-dimensional human pose as recited in claim 1, wherein said estimating the three-dimensional human pose from the three-dimensional feature volume comprises:
dividing the three-dimensional feature volume into a plurality of discrete grids;
and acquiring the 3D heat map space coordinate of each key point in each discrete grid, and regressing the 3D heat map space coordinate to obtain the three-dimensional human body posture.
6. A three-dimensional body posture estimation device, characterized by comprising:
the fusion module is used for generating a target heat map corresponding to the input image by adopting a multi-view fusion network, and matching and fusing the human body central point heat map information of the target heat map to obtain fusion information;
the projection module is used for projecting the fusion information to a 3D space to obtain a three-dimensional characteristic volume;
the posture estimation module is used for estimating a three-dimensional human body posture according to the three-dimensional characteristic volume;
the fusion module is also used for inputting the input image into a high-resolution network of the multi-view fusion network to acquire high-resolution characteristic information; constructing a residual error unit of the high-resolution network according to the high-resolution characteristic information, and performing convolution sampling on the residual error unit to obtain a multi-resolution module; fusing the feature maps with different resolutions at each stage in the multi-resolution module to obtain a target heat map; matching and fusing the human body central point heat map information of the target heat map to obtain fused information;
the fusion module is also used for taking a preset key point between hip joints of the human body in the target heat map as a central point of the human body; and matching and fusing the human body central point heat map information of the target heat map according to the human body central point of the multiple views to obtain fused information.
7. A three-dimensional human body posture estimation device characterized by comprising: a memory, a processor, and a three-dimensional body pose estimation program stored on the memory and executable on the processor, the three-dimensional body pose estimation program configured to implement the steps of the three-dimensional body pose estimation method of any of claims 1 to 5.
8. A storage medium, characterized in that the storage medium has stored thereon a three-dimensional human body posture estimation program which, when executed by a processor, realizes the steps of the three-dimensional human body posture estimation method according to any one of claims 1 to 5.
CN202210956640.4A 2022-08-10 2022-08-10 Three-dimensional human body posture estimation method, device, equipment and storage medium Active CN115035551B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210956640.4A CN115035551B (en) 2022-08-10 2022-08-10 Three-dimensional human body posture estimation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210956640.4A CN115035551B (en) 2022-08-10 2022-08-10 Three-dimensional human body posture estimation method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115035551A CN115035551A (en) 2022-09-09
CN115035551B true CN115035551B (en) 2022-12-02

Family

ID=83130421

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210956640.4A Active CN115035551B (en) 2022-08-10 2022-08-10 Three-dimensional human body posture estimation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115035551B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115880774B (en) * 2022-12-01 2024-08-16 湖南工商大学 Body-building action recognition method and device based on human body posture estimation and related equipment
CN116665309B (en) * 2023-07-26 2023-11-14 山东睿芯半导体科技有限公司 Method, device, chip and terminal for identifying walking gesture features
CN118334755B (en) * 2024-06-14 2024-09-03 中国地质大学(武汉) Semi-supervised animal three-dimensional attitude estimation method, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113643366A (en) * 2021-07-12 2021-11-12 中国科学院自动化研究所 Multi-view three-dimensional object attitude estimation method and device
CN114548224A (en) * 2022-01-19 2022-05-27 南京邮电大学 2D human body pose generation method and device for strong interaction human body motion
CN114613001A (en) * 2022-01-28 2022-06-10 厦门理工学院 3D human body posture estimation method based on high-quality heat map in multiple views
CN114627491A (en) * 2021-12-28 2022-06-14 浙江工商大学 Single three-dimensional attitude estimation method based on polar line convergence
CN114758205A (en) * 2022-04-24 2022-07-15 湖南大学 Multi-view feature fusion method and system for 3D human body posture estimation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113643366A (en) * 2021-07-12 2021-11-12 中国科学院自动化研究所 Multi-view three-dimensional object attitude estimation method and device
CN114627491A (en) * 2021-12-28 2022-06-14 浙江工商大学 Single three-dimensional attitude estimation method based on polar line convergence
CN114548224A (en) * 2022-01-19 2022-05-27 南京邮电大学 2D human body pose generation method and device for strong interaction human body motion
CN114613001A (en) * 2022-01-28 2022-06-10 厦门理工学院 3D human body posture estimation method based on high-quality heat map in multiple views
CN114758205A (en) * 2022-04-24 2022-07-15 湖南大学 Multi-view feature fusion method and system for 3D human body posture estimation

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Cross View Fusion for 3D Human Pose Estimation;Haibo Qiu 等;《arXiv:1909.01203v1 [cs.CV]》;20190903;第1-10页 *
HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation;Bowen Cheng 等;《CVPR 2020》;20201231;全文 *
VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Environment;Hanyue Tu 等;《arXiv:2004.06239v4 [cs.CV]》;20200824;第1-17页 *

Also Published As

Publication number Publication date
CN115035551A (en) 2022-09-09

Similar Documents

Publication Publication Date Title
CN115035551B (en) Three-dimensional human body posture estimation method, device, equipment and storage medium
US11145078B2 (en) Depth information determining method and related apparatus
CN108510535B (en) High-quality depth estimation method based on depth prediction and enhancer network
US10225473B2 (en) Threshold determination in a RANSAC algorithm
WO2015135323A1 (en) Camera tracking method and device
JP2018520425A (en) 3D space modeling
Jellal et al. LS-ELAS: Line segment based efficient large scale stereo matching
CN109063549B (en) High-resolution aerial video moving target detection method based on deep neural network
US11651581B2 (en) System and method for correspondence map determination
WO2017110836A1 (en) Method and system for fusing sensed measurements
CN113762147B (en) Facial expression migration method and device, electronic equipment and storage medium
CN111709984B (en) Pose depth prediction method, visual odometer device, pose depth prediction equipment and visual odometer medium
CN113673400A (en) Real scene three-dimensional semantic reconstruction method and device based on deep learning and storage medium
CN113643366B (en) Multi-view three-dimensional object attitude estimation method and device
CN115035235A (en) Three-dimensional reconstruction method and device
CN112288788A (en) Monocular image depth estimation method
CN113902802A (en) Visual positioning method and related device, electronic equipment and storage medium
KR20220014678A (en) Method and apparatus for estimating depth of images
Zhou et al. PADENet: An efficient and robust panoramic monocular depth estimation network for outdoor scenes
CN116051736A (en) Three-dimensional reconstruction method, device, edge equipment and storage medium
CN103208109A (en) Local restriction iteration neighborhood embedding-based face hallucination method
CN112288813B (en) Pose estimation method based on multi-view vision measurement and laser point cloud map matching
CN112270748B (en) Three-dimensional reconstruction method and device based on image
WO2021167910A1 (en) A method for generating a dataset, a method for generating a neural network, and a method for constructing a model of a scene
CN114998630B (en) Ground-to-air image registration method from coarse to fine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Hu Bo

Inventor after: Hu Shizhuo

Inventor after: Zhou Bin

Inventor after: Shen Zhengang

Inventor after: Li Yanhong

Inventor before: Hu Bo

Inventor before: Hu Shizhuo

Inventor before: Zhou Bin

Inventor before: Shen Zhengang

Inventor before: Li Yanhong

CB03 Change of inventor or designer information
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Method, device, device, and storage medium for three-dimensional human pose estimation

Granted publication date: 20221202

Pledgee: Guanggu Branch of Wuhan Rural Commercial Bank Co.,Ltd.

Pledgor: WUHAN ETAH INFORMATION TECHNOLOGY Co.,Ltd.

Registration number: Y2024980009498

PE01 Entry into force of the registration of the contract for pledge of patent right