CN117558066A

CN117558066A - Model training method, joint point prediction method, device, equipment and storage medium

Info

Publication number: CN117558066A
Application number: CN202311823987.2A
Authority: CN
Inventors: 张戈; 陈德聪; 赵飞
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-02-13

Abstract

The invention discloses a model training method, a joint point prediction method, a device, equipment and a storage medium. Acquiring a training data set of 3D human body posture estimation; based on training data set training joint point prediction model, joint point prediction model includes joint point depth estimation module, 2D gesture estimation module and 3D gesture fusion module, and joint point depth estimation module is used for extracting the joint point depth characteristic of input image and forms the depth map, and 2D gesture estimation module is used for extracting the 2D gesture characteristic of input image and forms thermodynamic diagram, and the input of 3D gesture fusion module includes depth map and thermodynamic diagram, and the output of 3D gesture fusion module is joint point prediction result. The method solves the problems that the existing 3D human body posture estimation is affected by depth blurring, multi-human body shielding and the like, so that the accuracy of joint point prediction is low, and the 3D human body posture estimation accuracy is affected, and the accuracy of joint point prediction and the 3D human body posture estimation accuracy are further improved.

Description

Model training method, joint point prediction method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computer vision, in particular to a model training method, a joint point prediction method, a device, equipment and a storage medium.

Background

Human body posture estimation is one of hot research directions in the fields of computer vision and pattern recognition, is also a key step for understanding human behavior and actions by a computer, and has been widely applied to the fields of human body activity analysis, intelligent video monitoring, advanced human-computer interaction and the like. The human body posture estimation can be subdivided into 2D human body posture estimation and 3D human body posture estimation, wherein the 3D human body posture estimation aims at estimating coordinates of human body joints in a three-dimensional space from a given image or video, so accurate prediction of the joints is an effective means for improving the accuracy of the 3D human body posture estimation.

The existing joint point prediction method has the problems that the accuracy of joint point prediction is low due to the influences of depth blurring, multi-body shielding and the like, and the 3D human body posture estimation accuracy is further influenced.

Disclosure of Invention

The invention provides a model training method, a joint point prediction method, a device, equipment and a storage medium, which are used for solving the problems that the accuracy of joint point prediction is low and the estimation precision of 3D human body posture is affected due to the influence of depth blurring, multi-body shielding and the like in the existing joint point prediction method.

According to an aspect of the present invention, there is provided a joint prediction model training method, including:

Acquiring a training data set of 3D human body posture estimation;

based on training data set training joint point prediction model, joint point prediction model includes joint point depth estimation module, 2D gesture estimation module and 3D gesture fusion module, and joint point depth estimation module is used for extracting the joint point depth characteristic of input image and forms the depth map, and 2D gesture estimation module is used for extracting the 2D gesture characteristic of input image and forms thermodynamic diagram, and the input of 3D gesture fusion module includes depth map and thermodynamic diagram, and the output of 3D gesture fusion module is joint point prediction result.

According to another aspect of the present invention, there is provided a joint point prediction method, including:

acquiring a prediction set of 3D human body posture estimation;

inputting a single frame image in the prediction set into a joint point prediction model to obtain a joint point prediction result;

the joint point prediction model comprises a joint point depth estimation module, a 2D gesture estimation module and a 3D gesture fusion module, wherein the joint point depth estimation module is used for extracting joint point depth characteristics, the 2D gesture estimation module is used for extracting 2D gesture characteristics, the input of the 3D gesture fusion module comprises a depth map and a thermodynamic diagram, and the output of the 3D gesture fusion module is a joint point prediction result.

According to another aspect of the present invention, there is provided a joint prediction model training apparatus including:

the training set acquisition module is used for acquiring a training data set of 3D human body posture estimation;

the model training module is used for training the joint point prediction model based on the training data set, the joint point prediction model comprises a joint point depth estimation module, a 2D gesture estimation module and a 3D gesture fusion module, the joint point depth estimation module is used for extracting joint point depth characteristics of an input image to form a depth map, the 2D gesture estimation module is used for extracting 2D gesture characteristics of the input image to form a thermodynamic diagram, the input of the 3D gesture fusion module comprises the depth map and the thermodynamic diagram, and the output of the 3D gesture fusion module is a joint point prediction result.

According to another aspect of the present invention, there is provided a joint point predicting apparatus including:

the prediction set acquisition module is used for acquiring a prediction set of 3D human body posture estimation;

the prediction result determining module is used for inputting a single frame image in the prediction set into the joint point prediction model to obtain a joint point prediction result;

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of training the joint point prediction model of any of the embodiments of the present invention or to perform the method of predicting the joint point of any of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the joint point prediction model training method of any of the embodiments of the present invention or the joint point prediction method of any of the embodiments of the present invention when executed.

According to the technical scheme provided by the embodiment of the invention, a training data set of 3D human body posture estimation is obtained; based on training data set training joint point prediction model, joint point prediction model includes joint point depth estimation module, 2D gesture estimation module and 3D gesture fusion module, and joint point depth estimation module is used for extracting the joint point depth characteristic of input image and forms the depth map, and 2D gesture estimation module is used for extracting the 2D gesture characteristic of input image and forms thermodynamic diagram, and the input of 3D gesture fusion module includes depth map and thermodynamic diagram, and the output of 3D gesture fusion module is joint point prediction result. According to the technical scheme, the obtained training data set of the 3D human body posture estimation is input into the joint point prediction model, the joint point prediction model utilizes the depth map formed by the node depth estimation module and the thermodynamic diagram formed by the 2D posture estimation module, and further the depth map and the thermodynamic diagram are input into the 3D posture fusion module to output the prediction result of the joint point, so that the training of the joint point prediction model is realized. The method solves the problems that the existing 3D human body posture estimation is affected by depth blurring, multi-human body shielding and the like, so that the accuracy of joint point prediction is low, and the 3D human body posture estimation accuracy is affected, and the accuracy of joint point prediction and the 3D human body posture estimation accuracy are further improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for training a joint prediction model according to an embodiment of the present invention;

fig. 2 is a flowchart of a joint prediction method according to a second embodiment of the present invention;

FIG. 3 is a schematic diagram of a joint prediction framework according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating an implementation procedure of a 2D pose estimation module according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating an implementation procedure of a joint depth estimation module according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a 3D gesture fusion module according to an embodiment of the present invention;

Fig. 7 is a schematic structural diagram of a training device for a joint prediction model according to a third embodiment of the present invention;

fig. 8 is a schematic structural diagram of a joint prediction device according to a fourth embodiment of the present invention;

fig. 9 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a method for training a joint prediction model according to an embodiment of the present invention, where the method may be performed by a joint prediction model training device, and the joint prediction model training device may be implemented in hardware and/or software, and the joint prediction model training device may be configured in a terminal or a server with a data processing function. As shown in fig. 1, the method includes:

s110, acquiring a training data set of 3D human body posture estimation.

In this embodiment, the training data set of 3D human body pose estimation is used to train the joint point prediction model.

Specifically, images containing single or multiple persons are acquired by an image acquisition device, and the images can be acquired indoors or outdoors, and a training data set for 3D human body posture estimation is formed based on the acquired images.

S120, training a joint point prediction model based on a training data set, wherein the joint point prediction model comprises a joint point depth estimation module, a 2D gesture estimation module and a 3D gesture fusion module, the joint point depth estimation module is used for extracting joint point depth characteristics of an input image to form a depth map, the 2D gesture estimation module is used for extracting 2D gesture characteristics of the input image to form a thermodynamic diagram, the input of the 3D gesture fusion module comprises the depth map and the thermodynamic diagram, and the output of the 3D gesture fusion module is a joint point prediction result.

In this embodiment, the joint point prediction model is used to predict the joint point coordinates from a single frame image. The input image is an image in the input training dataset. The joint depth features include a vertical distance of a joint in the scene from an imaging plane of the image acquisition device. The depth map comprises a single channel two-dimensional image consisting of perpendicular distances from the point of interest in the scene to the imaging plane of the image acquisition device. The joint point depth estimation module is used for extracting joint point depth characteristics of the input image to form a depth map. The 2D pose features include probabilities that pixels in the input image are joint points. The thermodynamic diagram includes a thermodynamic diagram generated based on a probability that a pixel point in the input image is a joint point. The 2D gesture estimation module is used for extracting 2D gesture features of the input image to form a thermodynamic diagram. The joint point prediction result comprises joint point coordinates predicted by the 3D gesture fusion module. The inputs to the 3D pose fusion module include a depth map and a thermodynamic diagram.

Specifically, the training data set is respectively input into a joint point depth estimation module and a 2D gesture estimation module, the joint point depth estimation module is utilized to extract joint point depth characteristics of a single frame image in the input training data set to form a depth map, the 2D gesture estimation module is utilized to extract 2D gesture characteristics of the single frame image in the input training data set to form a thermodynamic diagram, the obtained depth map and the thermodynamic diagram are simultaneously input into a 3D gesture fusion module, the 3D gesture fusion module is utilized to output predicted joint point coordinates, and then the whole joint point prediction model is adjusted based on the predicted joint point coordinates and real joint point coordinates, so that the training of the joint point prediction model is realized.

In some embodiments, the loss of the joint point prediction model includes thermodynamic diagram loss, depth information loss, joint point 3D pose information loss, and multiview consistency projection loss. Through the technical scheme, the problem that the two-dimensional coordinates of the joint point and the depth coordinates of the joint point have large deviation can be effectively avoided, meanwhile, noise caused by the depth blurring problem and shielding can be further reduced, and the accuracy of model prediction is effectively improved.

In this embodiment, thermodynamic diagram losses are used to constrain the thermodynamic diagram. The depth information loss is used to constrain the depth map. The node 3D gesture information loss is used for optimizing the 3D gesture fusion module. The multi-view consistency projection penalty is used to reduce the problem of depth blur and the problem of human occlusion.

Specifically, considering that the joint point prediction model only monitors the three-dimensional coordinates of the joint points in the training process, the two-dimensional coordinates and the depth coordinates of the joint points are easy to deviate greatly, so that thermodynamic diagram loss and depth information loss are respectively restrained by building the thermodynamic diagram loss and the depth information loss. Wherein thermodynamic diagram loss l _2d Depth information loss l _D Loss of 3D pose information of joint points _3d The calculations of (a) are shown below, respectively:

l _2d ＝L2Loss(H,H′)

l _D ＝L2Loss(D,D′)

l _3d ＝smoothL1Loss(P,P′)

Wherein L2Loss represents an L2 norm Loss function, H represents a 2D joint point thermodynamic diagram label, H ' represents a joint point thermodynamic diagram predicted by the 2D pose estimation module, D represents a joint point depth coordinate, D ' represents a joint point depth coordinate predicted by the depth estimation module, smoothL1 represents a smooth L1 norm Loss function, P represents a joint point three-dimensional coordinate label, and P ' represents a joint point three-dimensional coordinate predicted by the 3D pose fusion module.

Meanwhile, in order to further solve the problem of depth blurring and noise caused by human body shielding, geometric prior constraint is introduced to construct multi-view consistency projection loss. The thermodynamic diagram and the depth diagram are in one-to-one correspondence on a pixel plane, and the pixel coordinate positions (x _k ,y _k ) Obtaining a corresponding depth label d in the depth map _k . From the image acquisition device, the projection of the articulation point on the pixel plane may be determined, for example, from the camera model, as:

where K represents the camera internal reference. The camera coordinate system joint point coordinates P (X, Y, Z) and the pixel plane coordinate system joint point coordinates P (X ', Y') have the following mapping relationship:

the multi-view consistency projection loss can be expressed as:

l _p ＝smoothL1Loos((x,y),(x′,y′))+smoothL1Loos(Z,d)

Thus, the loss function of the joint prediction model is as follows:

L＝l _2d +l _D +l _3d +l _p

in some embodiments, obtaining a training dataset of 3D human body pose estimates includes: acquiring 3D human body posture data in an indoor scene by using a depth camera and generating calibrated image information and depth information according to the 3D human body posture data; and marking the joint points of the human body target according to the image information and the depth information to obtain a three-channel RGB image and a label with the joint point information and the camera parameters, and using the three-channel RGB image and the label with the joint point information and the camera parameters as a training data set for 3D human body posture estimation. By the technical scheme, a high-quality training data set can be obtained, and a foundation is laid for training of the joint point prediction model.

In this embodiment, a depth camera may be understood as a camera capable of acquiring the physical distance of each point in a scene from the camera. The image information includes coordinates of each pixel in the image. The depth information includes the physical distance of points in the scene corresponding to each pixel from the camera.

Specifically, in this embodiment, a training data set of 3D human body posture estimation is obtained, the image acquisition device used is a depth camera, for example RealSense Depth Camera D435i, the 3D human body posture data in the indoor scene is acquired by using the depth camera, and calibrated image information and depth information are generated according to the 3D human body posture data, so that the joint points of the human body target are marked based on the image information and the depth information, and a three-channel RGB image and a label with the joint point information and camera parameters can be obtained. And further taking the marked 3D human body posture data as an indoor scene 3D human body posture estimation training data set.

Optionally, in order to improve generalization capability of the joint point prediction model and solve the problem of difficult network convergence caused by coefficient labels, the 3D human body posture label is expanded, wherein the 3D human body posture label is a label which marks the joint point of the human body target based on image information and depth information, and the obtained label has joint point information and camera parameters.

It is considered that it is difficult to directly regress the offset of the node of the joint with respect to the original coordinates of the image in performing the 2D pose estimation module because the sparse offset limits the convergence speed of the module and makes it difficult for the module to converge. The two-dimensional joint point coordinates of the image are expressed in the form of a thermodynamic diagram to form a target thermodynamic diagram. The width and height of the target thermodynamic diagram are respectively 1/4 of that of the RGB image, each pixel point is endowed with a corresponding probability value, gaussian sampling taking joint point coordinates as the center is formed, and the corresponding probability value at the joint point is 1. For the articulation point mu _i ＝(x _i ,y _i ) Corresponding target thermodynamic diagram h _i Is expressed as:

where δ is typically 1/32 of the target thermodynamic diagram size.

Similarly, sparse joint depth value supervisory signals make the depth estimation module difficult to train, and the joint depth values, joint pixel coordinates, and original images are not effectively aligned. And expanding K joint point depth value labels based on the target thermodynamic diagram representation form. Specifically, the depth value of the area near the joint point of the human body is consistent with the depth value of the joint point, and the depth value range is the same as the thermodynamic diagram range. For the articulation point mu _i ＝(x _i ,y _i ) Having a depth value d _i Expanded target depth map D _i Is expressed as:

the joint point supervision signals are enriched through the target joint point depth map, and the problem of difficult convergence of the depth estimation network caused by sparse labels is effectively solved.

In some embodiments, the backbone network of the 2D pose estimation module is HRNet, the HRNet generates 2D pose feature maps with different resolutions through a plurality of parallel branches, the channel values are K, K is a positive integer, a single channel feature map represents 2D pose features of a class of joint points, each branch uses an independent attention module, and the 2D pose feature maps with different resolutions are sampled to a specified size and summed from feature point to form a thermodynamic diagram;

the joint point depth estimation module comprises an encoder and a decoder, wherein a backbone network of the encoder is ResNet18, the decoder comprises a continuous convolution and up-sampling layer, the joint point depth estimation module is used for predicting depth feature maps with different resolutions relative to an RGB image, channel values are K, the depth feature map of a single channel represents depth features of a class of joint points, and the depth feature maps with different resolutions are sampled to a specified size and summed by feature points to form the depth map. Through the technical scheme, the 2D gesture estimation module and the joint point depth estimation module can be effectively modeled, and a foundation is laid for further extracting three-dimensional coordinates of an input image.

In the present embodiment, a High-Resolution network (HRNet) is a neural network for human body posture estimation, which is capable of generating a High-Resolution feature representation. ResNet18ResNet18 is a classical deep convolutional neural network model with good feature extraction and classification capabilities. The upsampling layer is used to sample the low resolution image into a high resolution image. The specified size includes a size set in advance.

Specifically, an RGB image with 256×256 resolution is input into a 2D gesture estimation module, a backbone network of the 2D gesture estimation module is HRNet, the HRNet generates 2D gesture feature graphs with 1/4,1/8 and 1/16 resolution respectively through three parallel resolution branches, channel values are K, each channel feature graph represents 2D gesture features of a class of nodes, and information is transferred between different branches in a sampling and up-sampling mode. In order to make the 2D pose estimation module focus more on the node, each branch uses an independent attention module to acquire 2D pose feature maps of different resolutions. In this embodiment, the attention module is a high-dimensional attention module. Taking a 1/4-resolved 2D pose feature map x as an example, a convolution f of 3 x 3 dimensions and 1 step is performed to generate a new feature map q, k, v for the 2D pose feature x. And obtaining the similarity s of the new feature graphs q and k by using a Gaussian kernel function, wherein the similarity s simultaneously represents the retention of the gesture features. The high-dimensional attention module calculation process is as follows:

x'＝s·f _v (x)+x

Where x' represents a gesture feature map acquired using a high-dimensional attention module, f _v (x) Representation of the gesture feature map x by convolution f _v Performing linear change to obtain a characteristic acquisition value v after the linear change; f (f) _q (x) Is the convolution f of the same number of different parameters _q Obtaining a changed characteristic searching value q; similarly, f _k (x) Is a different convolution parameter f _k Obtaining a changed characteristic key value k; s calculates the similarity between the search value q and the key value k through a Gaussian kernel function, and is used for selecting the most similar feature acquisition value; delta represents the range of action controlling the gaussian kernel.

The features with high similarity in the new feature map are mostly reserved, and irrelevant features are removed by the high-dimensional attention module, so that gesture features with different resolutions are acquired, and the 2D gesture estimation module focuses on joint point information more. And finally, sampling the 2D gesture feature images with different resolutions to the appointed size of the input image, and summing the feature points by feature points to form a target 2D gesture thermodynamic diagram H.

The joint depth estimation module includes an encoder that uses ResNet18 as a backbone network and a decoder that predicts a depth map using successive convolution and upsampling layers. The joint point depth estimation module adopts three parallel resolution branches to predict depth feature maps corresponding to RGB images 1/4,1/8 and 1/16, channel values are K, and each channel depth feature map represents depth features of a class of joint points. And finally, sampling the multi-resolution depth feature map to 1/4 of the size of the input image, and summing the feature points by feature points to form a target depth map D.

Optionally, in order to make the joint point depth estimation module pay more attention to the joint point information, after the depth feature map is acquired, the depth features d 'with different resolutions are further extracted by the high-dimensional attention module' _i 。

In some embodiments, the 3D pose fusion module includes a full connection layer and three bottleneck layers, each bottleneck layer corresponds to three types of convolution of 1×1,3×3, and 1×1, and the depth map and the thermodynamic diagram are overlapped in a channel direction to form a 3D pose feature map, where the full connection layer is used to convert the 3D pose feature map into three-dimensional coordinates of K nodes. Through the technical scheme, accuracy of joint point coordinate prediction can be effectively improved.

In this embodiment, the bottleneck layer is a special residual structure, which is used to reduce parameters and computation.

In particular, the camera model may be used to model the projection relationship of a three-dimensional volume to an imaging plane. The 3D pose fusion module aims at modeling the inverse mapping process of the pixel plane coordinate system to the camera coordinate system. Given 2D gesture features and joint point depth features, a 3D gesture fusion module implicitly learns camera internal parameters and models a conversion process from the 2D gesture features and the depth features to 3D gesture coordinates. The 3D gesture fusion module is composed of 3 bottleneck layers and a full connection layer, wherein the 3 bottleneck layers respectively correspond to three types of convolution of 1 multiplied by 1,3 multiplied by 3 and 1 multiplied by 1. And the 3D gesture fusion module is used for superposing the depth map and the thermodynamic diagram in the channel direction to form a 3D gesture feature map with the size of H multiplied by W multiplied by 2K, and further converting the 3D gesture feature map into three-dimensional coordinates of 3K articulation points by using a full connection layer.

Optionally, in order to effectively save the training time of the joint point model, the training of the joint point prediction module may be divided into two stages, firstly, the joint point prediction model is pre-trained by using the disclosed training data set for 3D human body posture estimation, and when the joint point prediction model tends to be stable, the joint point prediction module is finely tuned by using the indoor data set collected by the 3D data collection module.

Example two

Fig. 2 is a flowchart of a method for predicting a joint point according to a second embodiment of the present invention, where the method may be performed by a joint point predicting device, and the method may be implemented in hardware and/or software, and the joint point predicting device may be configured in a computer or a server having a joint point predicting function. As shown in fig. 2, the method includes:

s210, obtaining a prediction set of 3D human body posture estimation.

In this embodiment, the prediction set of 3D human body posture estimation is used for the node prediction model to predict the node coordinates.

Specifically, an image or video containing a human body is acquired by an image acquisition device, wherein the image or video can be acquired indoors or outdoors, and a 3D human body posture estimated prediction set is formed based on the acquired images or videos.

By way of example, the 3D human body posture estimation prediction set is obtained through the image acquisition device such as the video monitoring camera, compared with the depth camera, the common video monitoring camera is low in price and widely applied, the cost of joint point prediction can be reduced, and meanwhile, the generalization capability of the joint point prediction is improved. The video monitoring module can adopt 12 acquisition cameras (1 device every 30 degrees), ensures that the same user behavior can obtain multi-view video data, generates mp4 format video stream data, cuts the video stream data into multi-frame images through OpenCV, and further takes the multi-frame images as a prediction set for 3D human body posture estimation. Wherein OpenCV is a cross-platform computer vision and machine learning software library.

S220, inputting a single frame image in the prediction set into a joint point prediction model to obtain a joint point prediction result; the joint point prediction model comprises a joint point depth estimation module, a 2D gesture estimation module and a 3D gesture fusion module, wherein the joint point depth estimation module is used for extracting joint point depth characteristics, the 2D gesture estimation module is used for extracting 2D gesture characteristics, the input of the 3D gesture fusion module comprises a depth map and a thermodynamic diagram, and the output of the 3D gesture fusion module is a joint point prediction result.

Specifically, a single frame image in a prediction set is respectively input to a joint point depth estimation module and a 2D gesture estimation module of a joint point prediction model, the joint point depth estimation module is utilized to extract joint point depth characteristics, the 2D gesture estimation module is utilized to extract 2D gesture characteristics, further, the depth characteristics and the 2D gesture characteristics are stacked and input to a 3D gesture fusion module at a channel layer, and finally, a joint point prediction result is output, namely, the prediction of joint point coordinates is realized.

According to the technical scheme provided by the second embodiment of the invention, the joint point prediction result is obtained by acquiring the prediction set of the 3D human body posture estimation and inputting a single frame image in the prediction set into the joint point prediction model. By the technical scheme, the problem of low prediction precision in the existing joint point prediction method is solved, and the accuracy of joint point prediction is effectively improved.

Illustratively, the node prediction model is trained based on a training data set generated in an indoor scenario. Fig. 3 is a schematic diagram of a joint prediction framework according to an embodiment of the present invention. The schematic of the joint prediction framework may represent a training process of the joint prediction model and a prediction process of the joint prediction model. As can be seen from fig. 3, the joint prediction framework mainly comprises a data acquisition module, a data processing module, a training module and an estimation module, wherein the data processing module and the training module are applied to the training process of the joint prediction model, and the estimation module is applied to the joint prediction process. The data acquisition module comprises a 3D data acquisition module and a video monitoring module, wherein the video monitoring module is used in the joint point prediction process; the data processing module comprises a label alignment module and a data enhancement module; the training module comprises a parallel reasoning module, a 3D gesture fusion module and a loss construction module, wherein the parallel reasoning module comprises a 2D gesture estimation module and a joint point depth estimation module; the estimation module comprises a parallel reasoning module and a 3D gesture fusion module.

When the joint point prediction model is trained, a training data set is acquired through a 3D data acquisition module in a data acquisition module, then the training data set is processed through a tag alignment module and a data enhancement module in a data processing module, the processed training data set is input into the training module, feature learning is carried out through a 2D gesture estimation module and a joint point depth estimation module in a parallel reasoning module, the learned features are input into a 3D gesture fusion module, a prediction result of the training data set is obtained through the 3D gesture fusion module, and further the joint point prediction model is trained by using a loss function constructed by a loss construction module based on the prediction result and tag data in the 3D data acquisition module.

When the trained joint point prediction model is utilized for prediction, firstly, a video monitoring module in a data acquisition module is used for acquiring a prediction set, the acquired prediction set is preprocessed and then is input into an estimation module, a 2D gesture estimation module and a joint point depth estimation module in a parallel reasoning module in the estimation module are utilized for feature extraction, and the extracted features are input into a 3D gesture fusion module to realize joint point prediction.

Fig. 4 is a schematic diagram of an implementation process of a 2D pose estimation module according to an embodiment of the present invention, where, as shown in fig. 4, after a predicted concentrated video frame is input to the 2D pose estimation module, features are learned by the pose estimation module, and then features are further learned by using 3 high-dimensional attention modules, and further, after sampling, a 2D pose thermodynamic diagram is obtained.

Fig. 5 is a schematic diagram of an implementation process of a joint point depth estimation module according to an embodiment of the present invention, as shown in fig. 5, after a predicted concentrated video frame is input to the joint point depth estimation module, features are learned by the depth estimation module, and then features are further learned by using 3 high-dimensional attention modules, and further, a depth map is obtained after sampling.

Fig. 6 is a schematic diagram of an implementation process of a 3D gesture fusion module according to an embodiment of the present invention, where, as shown in fig. 6, the obtained depth map and the 2D gesture thermodynamic diagram are input into the 3D gesture fusion module, so as to obtain three-dimensional coordinates of a joint point.

Example III

Fig. 7 is a schematic structural diagram of a training device for a joint prediction model according to a third embodiment of the present invention. As shown in fig. 7, the apparatus includes:

a training set acquisition module 31, configured to acquire a training data set of 3D human body posture estimation;

The model training module 32 is configured to train a joint prediction model based on a training data set, where the joint prediction model includes a joint depth estimation module, a 2D pose estimation module, and a 3D pose fusion module, the joint depth estimation module is configured to extract a joint depth feature of an input image to form a depth map, the 2D pose estimation module is configured to extract a 2D pose feature of the input image to form a thermodynamic diagram, the input of the 3D pose fusion module includes the depth map and the thermodynamic diagram, and the output of the 3D pose fusion module is a joint prediction result.

The technical scheme provided by the third embodiment of the invention solves the problems that the existing 3D human body posture estimation is affected by depth blurring, multi-human body shielding and the like, so that the accuracy of joint point prediction is low and the 3D human body posture estimation accuracy is affected, and further improves the accuracy of joint point prediction and the 3D human body posture estimation accuracy.

Optionally, the loss of the joint point prediction model includes thermodynamic diagram loss, depth information loss, joint point 3D pose information loss, and multiview consistency projection loss.

Optionally, the training set obtaining module 31 includes:

the information generating unit is used for acquiring 3D human body posture data in an indoor scene by using the depth camera and generating calibrated image information and depth information according to the 3D human body posture data;

The training set determining unit is used for marking the joint points of the human body target according to the image information and the depth information to obtain a three-channel RGB image and a label with the joint point information and the camera parameters, and the three-channel RGB image and the label with the joint point information and the camera parameters are used as a training data set for 3D human body posture estimation.

Optionally, the backbone network of the 2D gesture estimation module is HRNet, the HRNet generates 2D gesture feature maps with different resolutions through a plurality of parallel branches, the channel values are K, K is a positive integer, a single channel feature map represents 2D gesture features of a class of nodes, each branch uses an independent attention module, and the 2D gesture feature maps with different resolutions are sampled to a specified size and summed up from feature point to form a thermodynamic diagram;

the joint point depth estimation module comprises an encoder and a decoder, wherein a backbone network of the encoder is ResNet18, the decoder comprises a continuous convolution and up-sampling layer, the joint point depth estimation module is used for predicting depth feature maps with different resolutions relative to an RGB image, channel values are K, the depth feature map of a single channel represents depth features of a class of joint points, and the depth feature maps with different resolutions are sampled to a specified size and summed by feature points to form the depth map.

Optionally, the 3D gesture fusion module includes a full connection layer and three bottleneck layers, where each bottleneck layer corresponds to three types of convolution of 1×1, 3×3, and 1×1, and the depth map and the thermodynamic diagram are overlapped in a channel direction to form a 3D gesture feature map, and the full connection layer is used to convert the 3D gesture feature map into three-dimensional coordinates of K nodes.

The joint point prediction model training device provided by the embodiment of the invention can execute the joint point prediction model training method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

Fig. 8 is a schematic structural diagram of a joint prediction device according to a fourth embodiment of the present invention. As shown in fig. 8, the apparatus includes:

a prediction set acquisition module 41, configured to acquire a prediction set of 3D human body posture estimation;

the prediction result determining module 42 is configured to input a single frame image in the prediction set into the joint point prediction model to obtain a joint point prediction result;

The joint point prediction device provided by the embodiment of the invention can execute the joint point prediction method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example five

Fig. 9 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention. The electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 9, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as the joint point prediction model training method, or the joint point prediction method.

In some embodiments, the joint point prediction model training method and the joint point prediction method may be implemented as computer programs tangibly embodied in a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the above-described joint prediction model training method, or joint prediction method, may be performed. Alternatively, in other embodiments, processor 11 may be configured to perform the joint prediction model training method, or the joint prediction method, in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. The joint point prediction model training method is characterized by comprising the following steps of:

acquiring a training data set of 3D human body posture estimation;

based on the training data set, training an articulation point prediction model, wherein the articulation point prediction model comprises an articulation point depth estimation module, a 2D gesture estimation module and a 3D gesture fusion module, the articulation point depth estimation module is used for extracting articulation point depth characteristics of an input image to form a depth map, the 2D gesture estimation module is used for extracting 2D gesture characteristics of the input image to form a thermodynamic diagram, the input of the 3D gesture fusion module comprises the depth map and the thermodynamic diagram, and the output of the 3D gesture fusion module is an articulation point prediction result.

2. The method of claim 1, wherein the loss of the joint prediction model comprises thermodynamic diagram loss, depth information loss, joint 3D pose information loss, and multi-view consistency projection loss.

3. The method of claim 1, wherein the acquiring a training dataset of 3D human body pose estimates comprises:

acquiring 3D human body posture data in an indoor scene by using a depth camera and generating calibrated image information and depth information according to the 3D human body posture data;

and marking the joint points of the human body target according to the image information and the depth information to obtain a three-channel RGB image and a label with the joint point information and camera parameters, and using the three-channel RGB image and the label with the joint point information and camera parameters as a training data set for 3D human body posture estimation.

4. The method according to claim 1, wherein the backbone network of the 2D pose estimation module is HRNet, the HRNet generates 2D pose feature maps with different resolutions through a plurality of parallel branches, channel values are K, K is a positive integer, a single channel feature map represents 2D pose features of a class of nodes, each branch uses an independent attention module, and the 2D pose feature maps with different resolutions are sampled to a specified size and summed by feature points to form a thermodynamic diagram;

The joint point depth estimation module comprises an encoder and a decoder, wherein a backbone network of the encoder is ResNet18, the decoder comprises a continuous convolution layer and an up-sampling layer, the joint point depth estimation module is used for predicting depth feature maps with different resolutions relative to an RGB image, channel values are K, the depth feature maps of a single channel represent depth features of a class of joint points, and the depth feature maps with different resolutions are sampled to a specified size and summed by feature points to form the depth map.

5. The method of claim 1, wherein the 3D pose fusion module comprises a full connection layer and three bottleneck layers, each bottleneck layer corresponding to three types of convolution of 1 x 1, 3 x 3 and 1 x 1, the depth map and the thermodynamic map are overlapped in a channel direction to form a 3D pose feature map, and the full connection layer is used for converting the 3D pose feature map into three-dimensional coordinates of K nodes.

6. A method of joint prediction, comprising:

acquiring a prediction set of 3D human body posture estimation;

inputting the single-frame image in the prediction set into a joint point prediction model to obtain a joint point prediction result;

the joint point prediction model comprises a joint point depth estimation module, a 2D gesture estimation module and a 3D gesture fusion module, wherein the joint point depth estimation module is used for extracting joint point depth characteristics, the 2D gesture estimation module is used for extracting 2D gesture characteristics, the input of the 3D gesture fusion module comprises the depth map and the thermodynamic diagram, and the output of the 3D gesture fusion module is a joint point prediction result.

7. A joint prediction model training device, comprising:

the model training module is used for training a joint point prediction model based on the training data set, the joint point prediction model comprises a joint point depth estimation module, a 2D gesture estimation module and a 3D gesture fusion module, the joint point depth estimation module is used for extracting joint point depth characteristics of an input image to form a depth map, the 2D gesture estimation module is used for extracting 2D gesture characteristics of the input image to form a thermodynamic diagram, the input of the 3D gesture fusion module comprises the depth map and the thermodynamic diagram, and the output of the 3D gesture fusion module is a joint point prediction result.

8. A joint prediction apparatus, comprising:

9. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the joint point prediction model training method of any one of claims 1-5 or the joint point prediction method of any one of claim 6.

10. A computer readable storage medium storing computer instructions for causing a processor to implement the joint point prediction model training method of any one of claims 1-5 or the joint point prediction method of any one of claim 6 when executed.