CN117830564B

CN117830564B - Three-dimensional virtual human model reconstruction method based on gesture distribution guidance

Info

Publication number: CN117830564B
Application number: CN202410250378.0A
Authority: CN
Inventors: 王宏升; 林峰
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2024-03-05
Filing date: 2024-03-05
Publication date: 2024-06-11
Anticipated expiration: 2044-03-05
Also published as: CN117830564A

Abstract

The method is suitable for generating a three-dimensional human body grid sequence from video, video image features, semantic features and video sequence features corresponding to video data can be extracted, the video image features and the semantic features are input into a U-shaped neural network in a reconstruction model to obtain an intermediate layer output result, the intermediate layer output result of the semantic features, the video sequence features and the U-shaped neural network is input into a stable diffusion network in the reconstruction model to obtain human body posture features, original distribution corresponding to the video data is determined, the original distribution of a human body in the video data is converted through a streaming method to obtain posture distribution, the human body posture features are reinforced according to the posture distribution, the reinforced human body posture features are obtained, and human body three-dimensional model reconstruction is performed according to the reinforced human body posture features, so that the accuracy of human body three-dimensional model reconstruction is improved.

Description

Three-dimensional virtual human model reconstruction method based on gesture distribution guidance

Technical Field

The specification relates to the field of three-dimensional reconstruction, in particular to a three-dimensional virtual human model reconstruction method guided by gesture distribution.

Background

In recent years, computer vision technology has made significant progress in the field of three-dimensional virtual human reconstruction. As an important representative in the three-dimensional digital world, three-dimensional virtual people have irreplaceable roles in enhancing man-machine interaction, improving the fidelity of the virtual world and enriching the presentation mode of digital content. The three-dimensional virtual person reconstruction has important application value in the fields of virtual reality, augmented reality, medical treatment, rehabilitation, animation production and the like.

In the conventional three-dimensional virtual thermal reconstruction method, surface reconstruction or segmentation is generally performed based on image data, and then a three-dimensional human body model is obtained through modeling, deformation and other technologies. The main problem with these methods is that they cannot effectively process the timing information and cannot capture the dynamic changes of the human body in motion, resulting in an inaccurate and stable reconstructed body mesh.

Therefore, how to improve the accuracy of the reconstructed three-dimensional model of the human body is a problem to be solved.

Disclosure of Invention

The present disclosure provides a three-dimensional virtual model reconstruction method for guiding pose distribution, so as to partially solve the above-mentioned problems in the prior art.

The technical scheme adopted in the specification is as follows:

the specification provides a three-dimensional virtual model reconstruction method for gesture distribution guidance, which comprises the following steps:

Acquiring video data;

inputting video data into a reconstruction model, and extracting video image features, semantic features and video sequence features corresponding to the video data;

Inputting the video image features and the semantic features into a U-shaped neural network in the reconstruction model to obtain an intermediate layer output result, inputting the semantic features, the video sequence features and the intermediate layer output result of the U-shaped neural network into a stable diffusion network in the reconstruction model to obtain human body posture features, determining original distribution corresponding to the video data, and inputting the original distribution into a stream module in the reconstruction model to obtain posture distribution;

strengthening the human body posture features according to the posture distribution to obtain the strengthened human body posture features;

And reconstructing a dynamic human body three-dimensional model corresponding to the video data according to the reinforced human body posture characteristics.

Optionally, inputting the video data into a reconstruction model, and extracting to obtain video image features corresponding to the video data, which specifically includes:

Decoding video in video data into images frame by frame, sampling a plurality of frame images from the video data at equal intervals, and obtaining a video frame sequence containing the plurality of frame images;

Processing the video frame sequence to obtain a processed video frame sequence, wherein the processing comprises at least one of cutting and scaling the video frames in the video frame sequence;

inputting the processed video frame sequence into a multi-resolution convolutional neural network, and extracting to obtain video image features corresponding to the video data;

Inputting video data into a reconstruction model, and extracting semantic features corresponding to the video data, wherein the method specifically comprises the following steps:

Inputting the video data into a semantic extraction network in the reconstruction model, and extracting to obtain semantic features corresponding to the video data;

inputting video data into a reconstruction model, and extracting video sequence features corresponding to the video data, wherein the video sequence features specifically comprise:

and inputting the video data into a video sequence encoder in the reconstruction model, and extracting and obtaining video sequence characteristics corresponding to the video data.

Optionally, the U-shaped neural network comprises a plurality of sub-modules connected in series, and for each sub-module, the sub-module comprises a spatial convolution module, a spatial attention module and a cross attention module;

Inputting the video image features and the semantic features into a U-shaped neural network in the reconstruction model to obtain an intermediate layer output result, wherein the method specifically comprises the following steps of:

Inputting the video image features and the semantic features into a U-shaped neural network in the reconstruction model, aiming at each sub-module connected in series in the U-shaped neural network, inputting a spatial convolution result output by a spatial convolution module of the sub-module into a spatial attention module of the sub-module to obtain a weighted convolution result for carrying out attention weighting on the spatial convolution result, and inputting the weighted convolution result and the semantic features into a cross attention module of the sub-module to obtain an output result of the sub-module;

And inputting the output result of the sub-module into the next sub-module until the output result of the last sub-module is obtained.

Optionally, the stable diffusion network includes several sub-modules connected in series, and for each sub-module, the sub-module includes: the system comprises a space convolution module, a space attention module, a cross attention module, a time convolution module and a time attention module, wherein each sub-module in the stable diffusion network corresponds to each sub-module of the U-shaped neural network one by one;

Inputting the semantic features, the video sequence features and the middle layer output result of the U-shaped neural network into a stable diffusion network in the reconstruction model to obtain human body posture features, wherein the method specifically comprises the following steps of:

Inputting the semantic features and the video sequence features into the stable diffusion network, and inputting the spatial convolution results output by the spatial convolution module in the submodule corresponding to the submodule of the U-shaped neural network for each submodule in the stable diffusion network into the submodule;

The space convolution module in the submodule is used for carrying out space convolution on input data to obtain a space convolution result, the space convolution result is spliced with the space convolution result of the submodule corresponding to the submodule through the U-shaped neural network, and the spliced result is input into the space attention module to obtain a space attention result;

Removing a part belonging to the U-shaped neural network in the spatial attention result to obtain a residual space-time attention result, and inputting the residual space-time attention result and the semantic features into a cross attention module to obtain a cross attention result, wherein the data input into a first sub-module is the video sequence features;

And inputting the cross attention result into the time convolution module, carrying out time convolution on the cross attention result according to a sliding window preset in time, inputting the time convolution result into the time attention module to obtain an output result of the submodule, and inputting the output result of the submodule into a next submodule until the output result of the last submodule is obtained to be used as the human body posture characteristic output by the stable diffusion network.

Optionally, inputting the original distribution into a flow module in the reconstruction model to obtain a gesture distribution, which specifically includes:

Inputting the original distribution into a flow module in a reconstruction model to obtain k differential embryo-passing mapping results, wherein the ith differential embryo-passing mapping result is determined by the ith-1 differential embryo-passing mapping result and a transformation parameter, and the 1 st differential embryo-passing mapping result is obtained by carrying out differential embryo-passing mapping on the original distribution by the transformation parameter;

And obtaining the gesture distribution according to a preset probability density transformation method and the k differential stratospheric mapping results.

Optionally, training the reconstruction model specifically includes:

acquiring a video sample and marking information corresponding to the video sample;

Inputting a video sample into a reconstruction model, and extracting video image features, semantic features and video sequence features corresponding to the video sample;

Inputting video image features and semantic features corresponding to the video samples into a U-shaped neural network in the reconstruction model to obtain an intermediate layer output result, inputting the semantic features corresponding to the video samples, the video sequence features and the intermediate layer output result of the U-shaped neural network into a stable diffusion network in the reconstruction model to obtain human body posture features, determining original distribution corresponding to the video samples, and inputting the original distribution corresponding to the video samples into a stream module in the reconstruction model to obtain posture distribution;

strengthening the human body posture features according to the posture distribution to obtain strengthened human body posture features, and determining a reconstruction result corresponding to the video sample according to the strengthened human body posture features;

And training the reconstruction model by taking the difference between the minimized labeling information and the reconstruction result as an optimization target.

Optionally, training the reconstruction model with the minimized difference between the labeling information and the reconstruction result as an optimization target specifically includes:

Sampling the gesture distribution according to the labeling information to obtain sampled gesture data and probability density values corresponding to the sampled gesture data;

And training the reconstruction model by taking the minimized difference between the labeling information and the reconstruction result and the maximized probability density value as an optimization target.

The specification provides a three-dimensional virtual human model reconstruction device for gesture distribution guidance, which comprises:

the acquisition module is used for acquiring video data;

The first input module is used for inputting video data into the reconstruction model and extracting video image features, semantic features and video sequence features corresponding to the video data;

The second input module is used for inputting the video image characteristics and the semantic characteristics into a U-shaped neural network in the reconstruction model to obtain an intermediate layer output result, inputting the semantic characteristics, the video sequence characteristics and the intermediate layer output result of the U-shaped neural network into a stable diffusion network in the reconstruction model to obtain human body posture characteristics, determining original distribution corresponding to the video samples, and inputting the original distribution into a flow module in the reconstruction model to obtain posture distribution;

The strengthening module is used for strengthening the human body posture characteristics according to the posture distribution to obtain the strengthened human body posture characteristics;

And the reconstruction module is used for reconstructing a dynamic human body three-dimensional model corresponding to the video data according to the reinforced human body posture characteristics.

The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the above-described pose distribution guided three-dimensional virtual human model reconstruction method.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-described pose distribution guided three-dimensional virtual model person reconstruction method when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

According to the three-dimensional virtual model person reconstruction method guided by the gesture distribution, video data can be acquired, then the video data can be input into a reconstruction model, video image features, semantic features and video sequence features corresponding to the video data are extracted, further the video image features and the semantic features can be input into a U-shaped neural network in the reconstruction model to obtain an intermediate layer output result, the semantic features, the video sequence features and the intermediate layer output result of the U-shaped neural network are input into a stable diffusion network in the reconstruction model to obtain human gesture features, original distribution corresponding to the video data is determined, the original distribution is input into a stream module in the reconstruction model to obtain gesture distribution, the human gesture features are reinforced according to the gesture distribution to obtain reinforced human gesture features, and the dynamic human three-dimensional model corresponding to the video data is reconstructed according to the reinforced human gesture features.

Compared with other human body grid reconstruction methods, the method has the main advantages that: 1. compared with the traditional non-generation type human body grid reconstruction method, the diffusion model is introduced, the method has stronger reconstruction capability, and meanwhile, the reasoning process has stronger anti-interference performance and robustness due to the strong reasoning capability of the diffusion model.

2. Compared with other space feature fusion methods, the method can avoid unnecessary noise interference in the process of fusing the space features, so that the fusion process is more stable, and the space features can be fused into the human body posture features more efficiently.

3. Compared with the diffusion model with a traditional structure, the method provided by the invention can better generate the spatial characteristics describing the distribution of the human body in space, and the cross attention mechanism can ensure the accurate and effective fusion of the attention information containing different information.

4. Compared with a method without using a semantic extraction model, the method can better guide the generation process of the diffusion model through the semantic information, so that the geometric consistency and the time continuity of the generated structure are better maintained.

5. Compared with the diffusion model with a traditional structure, the method provided by the invention can better acquire the time information contained in the video frame sequence and can keep the time continuity among frames in the process of generating the human body posture characteristics.

6. Compared with a human body grid reconstruction method without using a flow model, the human body posture distribution learning method based on the flow model has the advantages that human body posture characteristics can be enhanced through the human body posture distribution learned by the flow model, so that human body posture distribution information contained in the characteristics is more in line with real distribution of human body postures.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

FIG. 1 is a schematic flow chart of a three-dimensional virtual model method for gesture distribution guidance provided in the present specification;

FIG. 2 is a schematic structural diagram of a reconstruction model provided in the present specification;

Fig. 3 is a schematic diagram of a correspondence between a sub-module in a U-shaped neural network and a sub-module in a stable diffusion model provided in the present specification;

FIG. 4 is a schematic view of a three-dimensional virtual model device for gesture distribution guidance provided in the present specification;

fig. 5 is a schematic view of the electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of a three-dimensional virtual model method for guiding gesture distribution provided in the present specification, specifically including the following steps:

S100: video data is acquired.

S102: inputting the video data into a reconstruction model, and extracting and obtaining video image characteristics, semantic characteristics and video sequence characteristics corresponding to the video data.

In the present specification, a corresponding dynamic three-dimensional model of a human body can be reconstructed from motion pictures of the human body represented by video data, where the video data may be a video including a segment of human body motion captured for a person, and then, the three-dimensional model of the human body needs to be reconstructed by the reconstruction model in the present specification.

Based on the method, the video data can be input into a reconstruction model, and the video image features, the semantic features and the video sequence features corresponding to the video data are extracted. The difference between the video image features and the video sequence features is that the video image features split the video into a plurality of images and then perform feature extraction, so that the video image features are biased to the image features, the video sequence features are directly used for performing feature extraction on the complete video, so that the continuous feature of the video is more reserved, and the semantic features are features of the video extracted through a semantic feature extraction network (such as a CLIP network) on the whole semantic.

Specifically, the video in the video data can be decoded into images frame by frame, a plurality of frame images are sampled from the video data at equal intervals to obtain a video frame sequence containing a plurality of frame images, the video frame sequence is processed to obtain a processed video frame sequence, the processed video frame sequence is input into a multi-resolution convolutional neural network, and the video image characteristics corresponding to the video data are extracted.

Wherein the processing of the sequence of video frames may comprise: clipping each frame of video frame, clipping the video frame into an aspect ratio of 1:1, and the resolution of each cut video frame is scaled to 244×244.

After the processed video frame sequence is input into the multi-resolution convolutional neural network, the multi-resolution convolutional neural network is used for carrying out feature extraction on the video frame sequence frame by frame, and the video frame sequence is assumed to contain 16 sampled video frames:

In a first step, a sequence of preprocessed video frames of size (16, 3, 244, 244) is split into 16 video frames and input into a multi-resolution convolutional neural network frame by frame.

And secondly, three different resolution branches are included in the multi-resolution convolutional neural network, and feature extraction is carried out on each video frame by using a convolutional layer in parallel respectively, so that the spatial features of the video frames under different resolutions are extracted.

And thirdly, splicing the spatial characteristics of a video frame under three different resolutions and the video frame serving as the original input in the channel dimension to obtain a new characteristic diagram.

Fourth, global averaging pooling is used to convert the obtained feature map into feature vectors of size (256,56,56) that contain spatial features of the input image at different resolutions.

Each video frame goes through the steps from the second step to the fourth step to obtain a feature vector with the size (256,56,56) of each video frame, and each video frame is spliced in the time dimension to obtain the video image feature.

For the semantic features and the video sequence features, inputting the video data into a semantic extraction network (CLIP network) in a reconstruction model, extracting feature vectors describing the spatial consistency and the time continuity of the whole video from the video sequence, obtaining the semantic features corresponding to the video data, inputting the video data into a video sequence encoder in the reconstruction model, and extracting the video sequence features corresponding to the video data.

All three features are the preparation work of the modeling type for extracting the features of the video data, and can be used for extracting the features of the U-shaped neural network and the stable diffusion network in the later-mentioned reconstruction model.

S104: inputting the video image features and the semantic features into a U-shaped neural network in the reconstruction model to obtain an output result, inputting the semantic features, the video sequence features and the output result of the U-shaped neural network into a stable diffusion network in the reconstruction model to obtain human body posture features, determining original distribution corresponding to the video data, and inputting the original distribution into a flow module in the reconstruction model to obtain posture distribution.

Three structures, namely a U-shaped neural network, a stable diffusion network and a flow module, which mainly conduct feature extraction on video data are included in the reconstruction model, and an overall structure schematic diagram of the reconstruction model is shown in fig. 2.

Fig. 2 is a schematic structural diagram of a reconstruction model provided in the present specification.

As can be seen from fig. 2, after the reconstruction model determines the video image features, the semantic features and the video sequence features, the video image features and the semantic features can be input into the U-shaped neural network in the reconstruction model, so as to obtain an intermediate layer output result.

The stable diffusion model needs to combine the middle layer output result of the U-shaped neural network when extracting the characteristics, so that the semantic characteristics, the video sequence characteristics and the middle layer output result of the U-shaped neural network need to be input into the stable diffusion network in the reconstruction model to obtain the human body posture characteristics.

The original distribution can also be determined according to the video data, and the original distribution is input into a flow module in the reconstruction model to obtain the gesture distribution.

The following is a detailed description of the three parts.

For a U-shaped neural network in a reconstruction model, the U-shaped neural network includes a number of sub-modules in series, the sub-modules including, for each sub-module, a spatial convolution module, a spatial attention module, and a cross attention module.

The function of the U-shaped neural network is mainly to provide the output result of the middle layer of the U-shaped neural network for the stable diffusion model, specifically, the video image features and the semantic features mentioned above can be input into the U-shaped neural network in the reconstruction model, the spatial convolution result output by the spatial convolution module of each sub-module in series in the U-shaped neural network is input into the spatial attention module of the sub-module to obtain the weighted convolution result for carrying out attention weighting on the spatial convolution result, the weighted convolution result and the semantic features are input into the cross attention module of the sub-module to obtain the output result of the sub-module, and the output result of the sub-module is input into the next sub-module until the output result of the last sub-module is obtained. The video image features are input to the spatial convolution module of the first sub-module.

Specifically, first, a U-shaped neural network with the same structure and pre-training weights as the stable diffusion model can be constructed.

Second, in each sub-module of the U-shaped neural network, spatial convolution, spatial attention mechanisms, and cross attention mechanisms are used instead of the original self attention mechanism layer.

Thirdly, the space convolution structure is realized through depth separable convolution, the feature images on each channel are convolved channel by channel, information interaction between the channels can not occur, then the feature images on each channel are convolved point by point, and information interaction between points of each feature image can not occur.

Fourth, in conventional convolutional neural networks, all input features are processed equally, without considering the importance differences between different locations and channels. The spatial attention mechanism calculates an attention weight at each position of each feature map by introducing the attention weights at the positions and the channels, and is used for controlling the importance degree of the position, so that the model can pay more attention to important information areas, and the precision and generalization capability of the model are improved.

Fifth, by fusing information from the spatial attention mechanism with semantic information from the semantic extraction model through the cross attention mechanism, the cross attention mechanism can help the model build global relationships of forces between multiple inputs, extracting important information.

In addition, the human body space features output by the space convolution layer in each sub-module in the U-shaped neural network can be input into the sub-module corresponding to the stable diffusion model, and fused with the output of the space convolution of the sub-module corresponding to the stable diffusion model, and the following description is given by fig. 3, and fig. 3 mainly shows the corresponding relationship between the sub-module in the U-shaped neural network and the sub-module in the stable diffusion model.

Fig. 3 is a schematic diagram of a correspondence between a sub-module in a U-shaped neural network and a sub-module in a stable diffusion model provided in the present specification.

In fig. 3, 4 sub-modules are taken as an example in the U-shaped neural network and the stable diffusion network, wherein the interaction process of the U-shaped neural network and the stable diffusion network is described by using the sub-module 4, it can be seen that each sub-module of the U-shaped neural network corresponds to each sub-module of the stable diffusion network one by one, and the spatial convolution result of the spatial convolution module of each sub-module of the U-shaped neural network needs to be input into the corresponding sub-module in the stable diffusion model.

Specifically, in the stable diffusion network, for each sub-module, the sub-module includes: a spatial convolution module, a spatial attention module, a cross attention module, a temporal convolution module, and a temporal attention module.

The semantic features and the video sequence features can be input into the stable diffusion network, and for each sub-module in the stable diffusion network, the spatial convolution result output by the spatial convolution module in the sub-module corresponding to the sub-module in the U-shaped neural network is also required to be input into the sub-module.

In one sub-module, the spatial convolution can be performed on the input data (if the sub-module is the first sub-module, the input data is the video sequence feature, if the sub-module is not the first sub-module, the input data is the output result of the last sub-module) through the spatial convolution module in the sub-module, so as to obtain a spatial convolution result, the spatial convolution result is spliced with the spatial convolution result of the sub-module corresponding to the U-shaped neural network and the sub-module, and the obtained spliced result is input into the spatial attention module, so as to obtain the spatial attention result.

As can be seen from fig. 3, the spatial attention result includes a part of features corresponding to the spatial convolution result of the U-shaped neural network, and also includes a part of features corresponding to the spatial convolution result of the stable diffusion network, and since the stable diffusion network can already adopt the features extracted from the U-shaped neural network after spatial attention is performed (the features extracted from the U-shaped neural network are biased to the image features, and the features extracted from the stable diffusion network have the continuity features of the video), the part belonging to the U-shaped neural network in the spatial attention result can be removed to obtain a residual space-time attention result, and the residual space-time attention result and the semantic features are input to the cross attention module to obtain the cross attention result.

In visual terms, the splicing may refer to: the size of the eigenvectors corresponding to the spatial convolution results in the stable diffusion network is changed from (t, c, H, W) to (t, c, H, 2W) after being spliced. The spliced result is input into the spatial attention module, the spatial attention module extracts key spatial information in the spliced result (i.e. the spliced features), the key features are considered to be saved in the left half part (1 to W) of the feature vector, so that only the left half part of the feature vector is intercepted and input into the cross attention mechanism layer, and the right half part (W+1 to 2W) is abandoned.

And then, the cross attention result can be input into a time convolution module, the time convolution is carried out on the cross attention result according to a sliding window preset in time to obtain a time convolution result, the time convolution result is input into the time attention module to obtain an output result of the submodule, the output result of the submodule is input into a next submodule until the output result of the last submodule is obtained, and the output result is used as the human body posture characteristic output by the stable diffusion network.

It should be noted that the above-mentioned time convolution module is used to make the features extracted by the stable diffusion network have stronger time continuity, and the time convolution is a convolution operation applied on the time sequence data. It is similar to a conventional two-dimensional convolution operation, but operates in the time dimension. The time convolution performs a sliding operation on time-series data by defining a sliding window, and it is necessary to define the size of the sliding window and the distance over which the sliding operation is performed once.

Specifically, the time series data is a feature vector with a time dimension and composed of features of a plurality of video frames, the size of a sliding window determines a time range focused by convolution operation, a convolution kernel is used for extracting features in the range, and the time convolution can capture the time features of the time series video frames and can better keep the time continuity of the human body posture features. The temporal attention mechanism achieves weighting of the time series data by calculating an attention weight for each video frame feature. These attention weights represent the degree of attention of the model to the different video frames.

In the streaming module, the original distribution corresponding to the video data needs to be converted into a gesture distribution capable of representing the human gesture in the video data (i.e., a probability distribution of the human gesture), and it can also be understood that the probability density function corresponding to the original distribution needs to be converted into a probability density function corresponding to the gesture distribution, so that the gesture distribution is represented by the determined probability density function corresponding to the gesture distribution.

The above-mentioned original distribution may be an index gaussian distribution, and of course, the standard gaussian distribution may also be converted (e.g. converted by BAE) by the video data, so as to obtain a probability distribution that is closer to the video data itself, as the original distribution.

Inputting the original distribution into a flow module in a reconstruction model to obtain k differential stratospheric mapping results, wherein the i-th differential stratospheric mapping result is determined by the i-1-th differential stratospheric mapping result and transformation parameters, the 1-th differential stratospheric mapping result is obtained by carrying out differential embryo passing mapping on the original distribution through the transformation parameters, and the gesture distribution is obtained according to a preset probability density transformation method and the k differential stratospheric mapping results. The transformation parameters mentioned here need to be learned by the training process of the reconstruction model, i.e. the transformation parameters are network parameters for the flow modules in the reconstruction model.

Here, it is necessary to use a streaming method to convert the distribution pi (u) in the video frame sequence into the human body posture spatial distribution p (x), as shown in the following formula:

the underlying probability density transformation formula is:

in this specification, complex differential homoembryo maps f need to be synthesized by successive k simpler differential homoembryo map transforms fi, namely:

Wherein, the initial variable is set as (/>=U), the initial variable being the variable in the original distribution, the final transformed variable being/>Subsequently mentioned/>Then, for the transformed pose distribution, the following formula is the relationship between two adjacent differential stratospheric mapping results, where θ is the transformation parameter.

The above-mentioned preset probability density transformation method can be expressed by the following formula:

From the above formula, it can be seen that by distributing the raw distribution The k continuously transformed differential embryo mapping results are substituted into the probability density transformation formula to obtain the gesture distribution/>For the convenience of calculation, log can be taken from two sides of the formula to calculate, as shown in the following formula:

The reason why the actual pose distribution can be determined by the above method is that the reconstructed model is trained by the loss associated with the flow module in the model training, which will be described in step S108.

S106: and strengthening the human body posture features according to the posture distribution to obtain the strengthened human body posture features.

S108: and reconstructing a dynamic human body three-dimensional model corresponding to the video data according to the reinforced human body posture characteristics.

After the human body posture characteristics are determined through the stable diffusion network and the posture distribution is determined through the flow module, the human body posture characteristics can be enhanced according to the posture distribution, the enhanced human body posture characteristics are obtained, and the dynamic human body three-dimensional model corresponding to the video data is reconstructed according to the enhanced human body posture characteristics. Before strengthening, human body posture features can be decoded from the latent space through a latent feature decoder, and then the decoded human body posture features are strengthened through posture distribution.

The strengthening of the human body posture feature according to the posture distribution may refer to sampling the posture distribution and strengthening the human body posture feature according to the sampling result, where the human body posture feature may be obtained by a probability density function (the above-mentioned probability density function) Multiplying the human body posture features to obtain the reinforced human body posture features.

After the reinforced human body posture features are obtained, three-dimensional model reconstruction can be directly carried out through the reinforced human body posture features, and three-dimensional model reconstruction can also be carried out by combining the reinforced human body posture features and the human body posture features before reinforcement.

In the reconstruction process, the deformer can be used to acquire the pose information, wherein a deformer network coding and decoding architecture is adopted, and the strengthened human body pose features are coded through the encoder architecture. In the decoder architecture, we input 14 key point information and 431 vertex information from the SMPL model as a priori in addition to the encoding information. These 431 vertex information are sampled at 6890 vertices of the SMPL model. And finally, the information output by the decoder is subjected to a three-dimensional coordinate regression architecture, and the deformer network can output three-dimensional coordinates of 14 key points and 431 calibration vertexes.

Then, the human body sequence information can be returned. And the first step, taking the key point coordinates output by the deformer as the key point coordinates of the human body grid. And secondly, up-sampling 431 calibration vertex coordinates output by the deformer to 6890 calibration vertex coordinates through a multi-layer perceptron. And thirdly, finally obtaining a human body grid sequence of the three-dimensional virtual human body through the key point coordinates and 6890 human body surface vertex coordinates, and using the human body grid sequence as a human body three-dimensional model obtained through reconstruction.

It should be noted that, the above reconstructed model needs to perform a pre-supervised training, and the training samples may be from existing open source data, that is, the video samples and labeling information corresponding to the video samples may be obtained, where the labeling information is an actual dynamic three-dimensional model of the human body corresponding to the video samples.

In the training process, a video sample and label information corresponding to the video sample can be acquired, the video sample is input into a reconstruction model, video image features, semantic features and video sequence features corresponding to the video sample are extracted, the video image features and the semantic features corresponding to the video sample are input into a U-shaped neural network in the reconstruction model, an intermediate layer output result is obtained, and the semantic features, the video sequence features and the intermediate layer output result of the U-shaped neural network corresponding to the video sample are input into a stable diffusion network in the reconstruction model, so that human body posture features are obtained.

And determining original distribution corresponding to the video samples, and inputting the original distribution corresponding to the video samples into a flow module in the reconstruction model to obtain gesture distribution. Strengthening the human body posture features according to the posture distribution to obtain the strengthened human body posture features, and determining a reconstruction result corresponding to the video sample according to the strengthened human body posture features. And training the reconstruction model by taking the difference between the minimized labeling information and the reconstruction result as an optimization target.

The above-mentioned reconstruction model basically has supervised training loss, but there may be individual loss for the stream module, specifically, the gesture distribution may be sampled according to the labeling information, so as to obtain sampled gesture data and probability density values corresponding to the sampled gesture data, and the reconstruction model is trained with the minimized difference between the labeling information and the reconstruction result and the maximized probability density value as an optimization target.

The above-mentioned manner of maximizing the probability density value corresponding to the sampled pose data may be implemented by using a log-likelihood manner, where the pose data may refer to data for representing the pose of each joint of the human body (for example, the joint pose may be represented by a rotation matrix, or of course, the joint pose may be represented by another form), and the purpose of sampling is to sample the actual pose of the human body in the video sample represented by the labeling information.

The determined gesture distribution is closer to the human gesture in the actual video sample by maximizing the probability density value corresponding to the sampled gesture data, namely, the aim of the mode of maximizing the probability density value is to enable the gesture distribution determined by the flow module in the reconstruction model to be closer to the human gesture in the actual video sample because the probability density value corresponding to the actual gesture of the human body in the video sample is sampled, and therefore, the aim of adding the loss is to enable the gesture distribution determined by the flow module to be more accurate.

Note that the number of samples to be taken for the joints is not limited, and for example, probability density values corresponding to the posture data of all the joints indicated in the labeling information may be taken from the posture distribution, or probability density values corresponding to the posture data of some joints may be taken.

The main body of execution of the present method may be a computer, a controller, a server, or the like, and is not limited thereto. The features of the following examples and embodiments may be combined with each other without any conflict.

In addition, it should be noted that all actions for obtaining signals, information or data in this specification are performed under the condition of conforming to the corresponding policy of data protection regulations where the corresponding device owner is located and obtaining the authorization given by the corresponding device owner.

The three-dimensional virtual human model reconstruction method based on the same thought for one or more gesture distribution guidance in the specification also provides a three-dimensional virtual human model reconstruction device based on the gesture distribution guidance in the specification, as shown in fig. 4.

Fig. 4 is a schematic diagram of a three-dimensional virtual mannequin reconstruction device for guiding pose distribution provided in the present specification, including:

An acquisition module 401, configured to acquire video data;

the first input module 402 is configured to input video data into a reconstruction model, and extract video image features, semantic features, and video sequence features corresponding to the video data;

The second input module 403 is configured to input the video image features and semantic features into a U-shaped neural network in the reconstruction model to obtain an intermediate layer output result, input the semantic features, the video sequence features and the intermediate layer output result of the U-shaped neural network into a stable diffusion network in the reconstruction model to obtain a human body posture feature, determine an original distribution corresponding to the video data, and input the original distribution into a flow module in the reconstruction model to obtain a posture distribution;

the strengthening module 404 is configured to strengthen the human body posture feature according to the posture distribution, so as to obtain a strengthened human body posture feature;

And the reconstruction module 405 is configured to reconstruct a dynamic three-dimensional model of the human body corresponding to the video data according to the reinforced human body posture feature.

Optionally, the first input module 402 is specifically configured to decode video in video data into images frame by frame, and sample a plurality of frame images from the video data at equal intervals to obtain a video frame sequence including the plurality of frame images; processing the video frame sequence to obtain a processed video frame sequence, wherein the processing comprises at least one of cutting and scaling the video frames in the video frame sequence; inputting the processed video frame sequence into a multi-resolution convolutional neural network, and extracting to obtain video image features corresponding to the video data; the first input module 402 is specifically configured to input the video data into a semantic extraction network in the reconstruction model, and extract semantic features corresponding to the video data; the first input module 402 is specifically configured to input the video data into a video sequence encoder in the reconstruction model, and extract a video sequence feature corresponding to the video data.

The second input module 403 is specifically configured to input the video image feature and the semantic feature to a U-shaped neural network in the reconstruction model, input, for each sub-module connected in series in the U-shaped neural network, a spatial convolution result output by a spatial convolution module of the sub-module to a spatial attention module of the sub-module to obtain a weighted convolution result weighted for attention of the spatial convolution result, and input the weighted convolution result and the semantic feature to a cross attention module of the sub-module to obtain an output result of the sub-module; and inputting the output result of the sub-module into the next sub-module until the output result of the last sub-module is obtained.

the second input module 403 is specifically configured to input the semantic feature and the video sequence feature into the stable diffusion network, and for each sub-module in the stable diffusion network, input a spatial convolution result output by a spatial convolution module in a sub-module corresponding to the sub-module in the U-shaped neural network to the sub-module; the space convolution module in the submodule is used for carrying out space convolution on input data to obtain a space convolution result, the space convolution result is spliced with the space convolution result of the submodule corresponding to the submodule through the U-shaped neural network, and the spliced result is input into the space attention module to obtain a space attention result; removing a part belonging to the U-shaped neural network in the spatial attention result to obtain a residual space-time attention result, and inputting the residual space-time attention result and the semantic features into a cross attention module to obtain a cross attention result, wherein the data input into a first sub-module is the video sequence features; and inputting the cross attention result into the time convolution module, carrying out time convolution on the cross attention result according to a sliding window preset in time, inputting the time convolution result into the time attention module to obtain an output result of the submodule, and inputting the output result of the submodule into a next submodule until the output result of the last submodule is obtained to be used as the human body posture characteristic output by the stable diffusion network.

Optionally, the second input module 403 is specifically configured to input the original distribution into a flow module in a reconstruction model, so as to obtain k differential embryo-passing mapping results, where the i-th differential embryo-passing mapping result is determined by the i-1-th differential embryo-passing mapping result and a transformation parameter, and the 1-st differential embryo-passing mapping result is obtained by performing differential embryo-passing mapping on the original distribution by using the transformation parameter;

Optionally, the apparatus further comprises:

The training module 406 is configured to obtain a video sample and annotation information corresponding to the video sample; inputting a video sample into a reconstruction model, and extracting video image features, semantic features and video sequence features corresponding to the video sample; inputting video image features and semantic features corresponding to the video samples into a U-shaped neural network in the reconstruction model to obtain an intermediate layer output result, inputting the semantic features corresponding to the video samples, the video sequence features and the intermediate layer output result of the U-shaped neural network into a stable diffusion network in the reconstruction model to obtain human body posture features, determining original distribution corresponding to the video samples, and inputting the original distribution corresponding to the video samples into a stream module in the reconstruction model to obtain posture distribution; strengthening the human body posture features according to the posture distribution to obtain strengthened human body posture features, and determining a reconstruction result corresponding to the video sample according to the strengthened human body posture features; and training the reconstruction model by taking the difference between the minimized labeling information and the reconstruction result as an optimization target.

Optionally, the training module 406 is specifically configured to sample the gesture distribution according to the labeling information, to obtain sampled gesture data and a probability density value corresponding to the sampled gesture data; and training the reconstruction model by taking the minimized difference between the labeling information and the reconstruction result and the maximized probability density value as an optimization target.

The present specification also provides a computer readable storage medium storing a computer program operable to perform the above-described pose distribution guided three-dimensional virtual human model reconstruction method.

The present specification also provides a schematic structural diagram of the electronic device shown in fig. 5. At the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, as illustrated in fig. 5, although other hardware required by other services may be included. The processor reads the corresponding computer program from the nonvolatile memory to the memory and then operates the computer program to realize the three-dimensional virtual human model reconstruction method for guiding the gesture distribution.

Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable GATE ARRAY, FPGA)) is an integrated circuit whose logic functions are determined by user programming of the device. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler (logic compiler)" software, which is similar to the software compiler used in program development and writing, and the original code before being compiled is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but HDL is not just one, but a plurality of kinds, such as ABEL（Advanced Boolean Expression Language）、AHDL（Altera Hardware Description Language）、Confluence、CUPL（Cornell University Programming Language）、HDCal、JHDL（Java Hardware Description Language）、Lava、Lola、MyHDL、PALASM、RHDL（Ruby Hardware Description Language）, and VHDL (Very-High-SPEED INTEGRATED Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. A three-dimensional virtual human model reconstruction method for gesture distribution guidance is characterized by comprising the following steps:

Acquiring video data;

Inputting the video image features and the semantic features into a U-shaped neural network in the reconstruction model, aiming at each serial submodule contained in the U-shaped neural network, inputting a spatial convolution result output by a spatial convolution module contained by the submodule into a spatial attention module contained by the submodule to obtain a weighted convolution result weighted by attention aiming at the spatial convolution result, inputting the weighted convolution result and the semantic features into a cross attention module contained by the submodule to obtain an output result of the submodule, and inputting the output result of the submodule into a next submodule until the output result of the last submodule is obtained to be used as an output result; inputting the semantic features and the video sequence features into a stable diffusion network, wherein the stable diffusion network comprises a plurality of sub-modules connected in series, and each sub-module in the stable diffusion network corresponds to each sub-module of the U-shaped neural network one by one; for each sub-module included in the stable diffusion network, inputting a spatial convolution result output by a spatial convolution module in a sub-module corresponding to the U-shaped neural network to the sub-module, performing spatial convolution on input data through the spatial convolution module included in the sub-module to obtain a spatial convolution result, splicing the spatial convolution result and the spatial convolution result of the sub-module corresponding to the U-shaped neural network to the sub-module to obtain a spliced result, inputting the spliced result to the spatial attention module included in the sub-module to obtain a spatial attention result, removing a part belonging to the U-shaped neural network in the spatial attention result to obtain a residual spatial attention result, inputting the residual spatial attention result and the semantic feature to a cross attention module included in the sub-module, obtaining a cross attention result, wherein the data input to a first sub-module is the video sequence characteristic, the cross attention result is input to a time convolution module contained in the sub-module, the time convolution is carried out on the cross attention result according to a sliding window preset in time, the time convolution result is input to the time attention module contained in the sub-module, the output result of the sub-module is obtained, the output result of the sub-module is input to a next sub-module until the output result of a last sub-module is obtained, the output result is used as the human body gesture characteristic output by the stable diffusion network, the original distribution corresponding to the video data is determined, and the original distribution is input to a stream module in the reconstruction model, so that the gesture distribution is obtained;

2. The method of claim 1, wherein video data is input into a reconstruction model, and video image features corresponding to the video data are extracted, specifically comprising:

3. The method according to claim 1, wherein inputting the original distribution into a flow module in the reconstruction model results in a pose distribution, comprising in particular:

4. The method of claim 1, wherein training the reconstruction model comprises:

5. The method of claim 4, wherein training the reconstruction model with a minimum difference between labeling information and reconstruction results as an optimization objective, specifically comprises:

6. The three-dimensional virtual human model reconstruction device for guiding gesture distribution is characterized by comprising:

the acquisition module is used for acquiring video data;

The second input module is used for inputting the video image characteristics and the semantic characteristics into the U-shaped neural network in the reconstruction model, aiming at each serial submodule contained in the U-shaped neural network, the spatial convolution result output by the spatial convolution module contained in the submodule is input into the spatial attention module contained in the submodule to obtain a weighted convolution result weighted by attention aiming at the spatial convolution result, the weighted convolution result and the semantic characteristics are input into the cross attention module contained in the submodule to obtain an output result of the submodule, and the output result of the submodule is input into the next submodule until the output result of the last submodule is obtained to be used as an output result; inputting the semantic features and the video sequence features into a stable diffusion network, wherein the stable diffusion network comprises a plurality of sub-modules connected in series, and each sub-module in the stable diffusion network corresponds to each sub-module of the U-shaped neural network one by one; for each sub-module included in the stable diffusion network, inputting a spatial convolution result output by a spatial convolution module in a sub-module corresponding to the U-shaped neural network to the sub-module, performing spatial convolution on input data through the spatial convolution module included in the sub-module to obtain a spatial convolution result, splicing the spatial convolution result and the spatial convolution result of the sub-module corresponding to the U-shaped neural network to the sub-module to obtain a spliced result, inputting the spliced result to the spatial attention module included in the sub-module to obtain a spatial attention result, removing a part belonging to the U-shaped neural network in the spatial attention result to obtain a residual spatial attention result, inputting the residual spatial attention result and the semantic feature to a cross attention module included in the sub-module, obtaining a cross attention result, wherein the data input to a first sub-module is the video sequence characteristic, the cross attention result is input to a time convolution module contained in the sub-module, the time convolution is carried out on the cross attention result according to a sliding window preset in time, the time convolution result is input to the time attention module contained in the sub-module, the output result of the sub-module is obtained, the output result of the sub-module is input to a next sub-module until the output result of a last sub-module is obtained, the output result is used as the human body gesture characteristic output by the stable diffusion network, the original distribution corresponding to the video data is determined, and the original distribution is input to a stream module in the reconstruction model, so that the gesture distribution is obtained;

7. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-5.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-5 when executing the program.