CN117726760B

CN117726760B - Training method and device for three-dimensional human body reconstruction model of video

Info

Publication number: CN117726760B
Application number: CN202410175200.4A
Authority: CN
Inventors: 王宏升; 林峰
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2024-02-07
Filing date: 2024-02-07
Publication date: 2024-05-07
Anticipated expiration: 2044-02-07
Also published as: CN117726760A

Abstract

The present specification discloses a training method and apparatus for a three-dimensional human body reconstruction model of a video, the reconstruction model at least includes a feature extraction layer, a motion enhancement layer and a regression layer, for each image sequence, sequence feature elements of the initial feature are determined according to a frame number axis, a height axis and a width axis of a first tensor corresponding to the initial feature of the image sequence, a reconstruction video of a predicted three-dimensional human body in a sample video is obtained according to the motion enhancement feature of each sequence feature element, and the reconstruction model is trained according to a speed loss of each image sequence and a three-dimensional reconstruction loss of the sample video. After the initial characteristics of each image sequence are obtained, the characteristics of each frame image contained in the same image sequence of the same channel are enhanced by taking the sequence characteristic elements as units, the connection among each frame image in the same image sequence is enhanced, and the enhancement of the inter-frame continuity by the reconstruction model is supervised according to the speed loss.

Description

Training method and device for three-dimensional human body reconstruction model of video

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a training method and apparatus for a three-dimensional human body reconstruction model of video.

Background

At present, the main idea of reconstructing a three-dimensional human body in a video is to determine a three-dimensional human body model contained in images of each frame in the video by using a method for reconstructing the three-dimensional human body from the images, and finally obtain the reconstructed video according to the obtained three-dimensional human body model of each frame.

However, because the video is continuous, the three-dimensional human body is reconstructed in the video, unlike the three-dimensional human body reconstructed in a single image, the existing method can only realize the intra-frame accuracy of the three-dimensional human body model, and is difficult to ensure the inter-frame continuity between the three-dimensional human body models of continuous frames, so that the video picture jitters. Based on this, the present description provides a training method for a three-dimensional human reconstruction model of video.

Disclosure of Invention

The present specification provides a training method, apparatus, storage medium and electronic device for a three-dimensional human reconstruction model of video, to at least partially solve the above-mentioned problems of the prior art.

The technical scheme adopted in the specification is as follows:

The present specification provides a training method for a three-dimensional human body reconstruction model of a video, the reconstruction model at least comprises a feature extraction layer, a motion enhancement layer and a regression layer, comprising:

acquiring a sample video, and determining a plurality of image sequences corresponding to the sample video;

inputting the image sequence into the feature extraction layer for each image sequence to obtain initial features of the image sequence, determining a first tensor corresponding to the initial features, determining sequence feature elements of the initial features according to a frame number axis, a height axis and a width axis of the first tensor, and determining the number of the sequence feature elements of the initial features according to the sequence axis and a channel axis of the first tensor;

inputting the sequence feature elements into the motion enhancement layer for each sequence feature element, determining the motion enhancement characteristics of the sequence feature elements, and determining the motion enhancement characteristics of the sample video according to the motion enhancement characteristics of each sequence feature element;

Inputting the motion enhancement features of the sample video into the regression layer to obtain a reconstructed video of the predicted three-dimensional human body in the sample video;

Determining a predicted three-dimensional joint point of a predicted three-dimensional human body in each frame image in the reconstructed video, and determining the average speed of the predicted three-dimensional joint point of the image sequence as the predicted speed of the image sequence according to the position change of the predicted three-dimensional joint point in each frame image of the image sequence aiming at each image sequence;

and determining the speed loss of the image sequence according to the difference between the predicted speed of the image sequence and the labeling speed of the image sequence, and training the reconstruction model according to the speed loss of each image sequence and the three-dimensional reconstruction loss of the sample video.

Optionally, the motion enhancing layer comprises a first enhancer layer and a second enhancer layer;

Inputting the sequence feature elements into the motion enhancement layer, and determining the motion enhancement characteristics of the sequence feature elements, wherein the method specifically comprises the following steps:

Inputting the sequence feature element into the first enhancer layer to obtain a first enhancement feature of the sequence feature element, wherein the first enhancement feature is a result of comprehensively carrying out feature enhancement on the sequence feature element in the horizontal and vertical directions;

And inputting the first enhancement features into the second enhancement sublayer to obtain the motion enhancement features of the sequence feature elements, wherein the motion enhancement features are the results of comprehensively carrying out feature enhancement on the sequence feature elements under different receptive fields.

Optionally, inputting the sequence feature element into the first enhancer layer to obtain a first enhancement feature of the sequence feature element, which specifically includes:

Inputting the sequence feature element into the first enhancer layer, respectively carrying out one-dimensional pooling on the sequence feature element in the horizontal direction and the vertical direction, and determining a first horizontal feature and a first vertical feature of the sequence feature element;

determining a first enhancement weight of the sequence feature element according to the first horizontal feature and the first vertical feature at least through a preset first convolution kernel;

and weighting the sequence feature elements through the first enhancement weights to obtain first enhancement features of the sequence feature elements.

Optionally, inputting the first enhancement feature into the second enhancement sub-layer to obtain a motion enhancement feature of the sequence feature element, which specifically includes:

Inputting the first enhancement features into the second enhancement sub-layer, carrying out convolution operation on the first enhancement features through a preset second convolution kernel, and determining the second enhancement features of the sequence feature elements, wherein the size of the second convolution kernel is different from that of the first convolution kernel;

Two-dimensional pooling is carried out on the first enhancement features to obtain first average weights, and two-dimensional pooling is carried out on the second enhancement features to obtain second average weights;

Weighting the second enhancement feature through the first average weight to obtain a first local weight;

Weighting the first enhancement feature through the second average weight to obtain a second local weight;

adding the first local weight and the second local weight to obtain a motion enhancement weight of the sequence feature element;

and weighting the sequence feature elements through the motion enhancement weights to obtain the motion enhancement features of the sequence feature elements.

Optionally, the reconstruction model further comprises a spatial enhancement layer comprising a third enhancer layer and a fourth enhancer layer;

The method further comprises the steps of:

grouping channels of the initial characteristics according to the preset number of sub-channels for each image sequence to obtain a plurality of channel groups;

According to the number of the sub-channels and the number of the groups of the channels, carrying out space recombination on the initial characteristics, determining a second tensor corresponding to the initial characteristics, determining channel characteristic elements of the initial characteristics according to a sub-channel axis, a height axis and a width axis of the second tensor, and determining the number of the channel characteristic elements of the initial characteristics according to a sequence axis, a frame number axis and a group number axis of the second tensor;

Inputting the channel characteristic element into the third enhancer layer for each channel characteristic element to obtain a third enhancement characteristic of the channel characteristic element, wherein the third enhancement characteristic is a result of comprehensively enhancing the characteristics of the channel characteristic element in the horizontal and vertical directions;

Inputting the third enhancement feature into the fourth enhancement sub-layer to obtain the spatial enhancement feature of the channel feature element, wherein the spatial enhancement feature is the result of comprehensively enhancing the channel feature element under different receptive fields;

and determining the spatial enhancement characteristic of the sample video according to the spatial enhancement characteristic of each channel characteristic element.

Optionally, inputting the channel feature element into the third enhancer layer to obtain a third enhancement feature of the channel feature element, which specifically includes:

Inputting the channel characteristic elements into the third enhancer layer, respectively carrying out one-dimensional pooling on the sequence characteristic elements in the horizontal direction and the vertical direction, and determining a second horizontal characteristic and a second vertical characteristic of the sequence characteristic elements;

Determining a third enhancement weight of the channel feature element according to the second horizontal feature and the second vertical feature at least through a preset third convolution kernel;

And weighting the channel feature element through the third enhancement weight to obtain the third enhancement feature of the channel feature element.

Optionally, inputting the third enhancement feature into the fourth enhancement sub-layer to obtain a spatial enhancement feature of the channel feature element, which specifically includes:

Inputting the third enhancement feature into the fourth enhancement sub-layer, carrying out convolution operation on the third enhancement feature through a preset fourth convolution kernel, and determining the fourth enhancement feature of the channel feature element, wherein the size of the fourth convolution kernel is different from that of the third convolution kernel;

Performing two-dimensional pooling on the third enhancement feature to obtain a third average weight, and performing two-dimensional pooling on the fourth enhancement feature to obtain a fourth average weight;

weighting the fourth enhancement feature by the third average weight to obtain a third local weight;

Weighting the third enhancement feature by the fourth average weight to obtain a fourth local weight;

adding the third local weight and the fourth local weight to obtain the space enhancement weight of the sequence feature element;

And weighting the channel characteristic elements through the space enhancement weights to obtain the space enhancement characteristics of the channel characteristic elements.

Optionally, inputting the motion enhancement feature of the sample video into the regression layer to obtain a reconstructed video of the predicted three-dimensional human body in the sample video, which specifically includes:

Fusing the motion enhancement features of the sample video and the space enhancement features of the sample video to obtain dual enhancement features of the sample video;

And inputting the dual enhancement features into the regression layer to obtain a reconstructed video of the predicted three-dimensional human body in the sample video.

Optionally, the motion enhancement feature of the sample video and the spatial enhancement feature of the sample video are fused to obtain a dual enhancement feature of the sample video, which specifically includes:

Adding the motion enhancement feature of the sample video and the space enhancement feature of the sample video to obtain a primary fusion feature of the sample video;

carrying out two-dimensional pooling on the primary fusion characteristics, and determining a fifth average weight of the sample video;

weighting the motion enhancement features of the sample video through the fifth average weight to obtain a first fusion feature;

weighting the spatial enhancement features of the sample video through the fifth average weight to obtain a second fusion feature;

and adding the first fusion feature and the second fusion feature to obtain the dual enhancement feature of the sample video.

Optionally, determining the speed loss of the image sequence according to the difference between the predicted speed of the image sequence and the labeling speed of the image sequence specifically includes:

taking standard three-dimensional joint points of a standard three-dimensional human body corresponding to each frame of image of the sample video as annotation data;

for each image sequence, determining the average speed of the standard three-dimensional joint points of the image sequence according to the position change of the standard three-dimensional joint points in each frame image of the image sequence, and taking the average speed as the labeling speed of the image sequence;

And determining the speed loss of the image sequence according to the difference between the predicted speed of the image sequence and the labeling speed of the image sequence.

Optionally, training the reconstruction model according to the speed loss of each image sequence and the three-dimensional reconstruction loss of the sample video, specifically including:

For each frame of image, determining the three-dimensional reconstruction loss of the frame of image according to the difference between the predicted three-dimensional articulation point corresponding to the frame of image and the standard three-dimensional articulation point;

Determining the three-dimensional reconstruction loss of the sample video according to the three-dimensional reconstruction loss of each frame of image;

and training the reconstruction model according to the speed loss of each image sequence and the three-dimensional reconstruction loss of the sample video.

determining camera parameters of each frame of image of the sample video, and determining a projection direction corresponding to each frame of image according to the camera parameters of the frame of image;

Projecting a predicted three-dimensional human body corresponding to the frame image according to the projection direction to obtain a predicted two-dimensional human body of the frame image, and determining a predicted two-dimensional articulation point of the predicted two-dimensional human body;

Projecting a standard three-dimensional human body corresponding to the frame image according to the projection direction to obtain a standard two-dimensional human body of the frame image, and determining a standard two-dimensional articulation point of the standard two-dimensional human body;

determining the two-dimensional reconstruction loss of the frame image according to the difference between the predicted two-dimensional articulation point and the standard two-dimensional articulation point;

determining the two-dimensional reconstruction loss of the sample video according to the two-dimensional reconstruction loss of each frame of image;

And training the reconstruction model according to the speed loss of each image sequence, the three-dimensional reconstruction loss of the sample video and the two-dimensional reconstruction loss of the sample video.

The present specification provides a training apparatus for a three-dimensional human reconstruction model of video, the reconstruction model comprising at least a feature extraction layer, a motion enhancement layer and a regression layer, the apparatus comprising:

the acquisition module acquires a sample video and determines a plurality of image sequences corresponding to the sample video;

The sequence feature element determining module inputs the image sequence into the feature extraction layer for each image sequence to obtain initial features of the image sequence, determines a first tensor corresponding to the initial features, determines sequence feature elements of the initial features according to a frame number axis, a height axis and a width axis of the first tensor, and determines the number of sequence feature elements of the initial features according to the sequence axis and a channel axis of the first tensor;

The motion enhancement module is used for inputting the sequence feature elements into the motion enhancement layer for each sequence feature element, determining the motion enhancement characteristics of the sequence feature elements, and determining the motion enhancement characteristics of the sample video according to the motion enhancement characteristics of each sequence feature element;

The reconstructed video determining module inputs the motion enhancement features of the sample video into the regression layer to obtain a reconstructed video of the predicted three-dimensional human body in the sample video;

the prediction speed determining module determines a prediction three-dimensional joint point of a prediction three-dimensional human body in each frame image in the reconstructed video, and determines the average speed of the prediction three-dimensional joint point of the image sequence as the prediction speed of the image sequence according to the position change of the prediction three-dimensional joint point in each frame image of the image sequence for each image sequence;

And the loss determination module is used for determining the speed loss of the image sequence according to the difference between the predicted speed of the image sequence and the labeling speed of the image sequence, and training the reconstruction model according to the speed loss of each image sequence and the three-dimensional reconstruction loss of the sample video.

The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the training method for a three-dimensional human reconstruction model for video described above.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the training method for a three-dimensional human reconstruction model for video described above when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

In the training method for a three-dimensional human body reconstruction model of a video provided in the present specification, the reconstruction model at least includes a feature extraction layer, a motion enhancement layer and a regression layer, for each image sequence, sequence feature elements of the initial feature are determined according to a frame number axis, a height axis and a width axis of a first tensor corresponding to the initial feature of the image sequence, a reconstruction video of a predicted three-dimensional human body in a sample video is obtained according to the motion enhancement feature of each sequence feature element, and the reconstruction model is trained according to a speed loss of each image sequence and a three-dimensional reconstruction loss of the sample video. After the initial characteristics of each image sequence are obtained, the characteristics of each frame image contained in the same image sequence of the same channel are enhanced by taking the sequence characteristic elements as units, the connection among each frame image in the same image sequence is enhanced, and the enhancement of the inter-frame continuity by the reconstruction model is supervised according to the speed loss.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

FIG. 1 is a schematic flow chart of a training method for a three-dimensional human body reconstruction model for video in the present specification;

FIG. 2 is a schematic structural diagram of a reconstruction model according to the present disclosure;

FIG. 3 is a schematic view of a motion enhancing layer according to the present disclosure;

FIG. 4 is a schematic structural diagram of another reconstruction model provided in the present specification;

FIG. 5 is a schematic view of a motion enhancing layer provided in the present specification;

FIG. 6 is a schematic diagram of a three-dimensional human reconstruction model training apparatus for video provided herein;

fig. 7 is a schematic view of the electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present application based on the embodiments herein.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a training method for a three-dimensional human body reconstruction model for video in the present specification, specifically including the following steps:

s100: and acquiring a sample video, and determining a plurality of image sequences corresponding to the sample video.

All the steps in the three-dimensional human body reconstruction model training method for video provided in the present specification can be implemented by any electronic device having a computing function, such as a terminal, a server, and the like. For convenience of description, the three-dimensional human body reconstruction model training method for video provided in the present specification will be described below with only a server as an execution subject.

When reconstructing a video in three dimensions, it is necessary to convert the video into image processing. Therefore, after the server acquires the sample video, the sample video is firstly analyzed, the frame rate of the sample video, namely the number of frames transmitted per second, is determined, and then the sample video is converted into a frame-by-frame image according to the frame rate, so that a sample image set is obtained.

Then, the sample image set is divided into a plurality of image sequences according to a preset unit frame number. The unit frame number is the number of image frames contained in one image sequence. In the image sequence, the sample images are arranged in time order.

For example, the duration of the sample video is 10 seconds, and the frame rate is 24 frames/second, so that the sample video contains 240 frames in total, and the preset unit frame number is determined to be 8 frames, that is, the server divides the sample video into 30 image sequences according to the mode that 8 frames are divided into one image sequence.

In the description, a mode of dividing an image sequence is adopted, one image sequence is considered as a whole, the motion characteristic is enhanced in the dimension of the sequence, and the time sequence characteristic of the image sequence is fully utilized, so that the inter-frame continuity of the reconstructed video is enhanced.

S102: inputting the image sequence into the feature extraction layer for each image sequence to obtain initial features of the image sequence, determining a first tensor corresponding to the initial features, determining sequence feature elements of the initial features according to a frame number axis, a height axis and a width axis of the first tensor, and determining the number of the sequence feature elements of the initial features according to the sequence axis and a channel axis of the first tensor.

Fig. 2 is a schematic structural diagram of a reconstruction model provided in the present specification, and as shown in fig. 2, the reconstruction model at least includes a feature extraction layer, a motion enhancement layer, and a regression layer.

For each image sequence, the server inputs the image sequence into a feature extraction layer to obtain initial features of the image sequence. The data dimension of the first tensor corresponding to the initial feature is five, and the axes of the five dimensions are a sequence axis, a channel axis, a frame number axis, a height axis and a width axis respectively. The axis length of the sequence axis is the number of image sequences contained in the sample video, the axis length of the channel axis is the number of characteristic channels of the initial feature, the axis length of the frame axis is the frame number of images contained in one image sequence, namely, the unit frame number, the axis length of the height axis is the height value of a characteristic map corresponding to the initial feature, and the axis length of the width axis is the width value of the characteristic map corresponding to the initial feature.

The server determines the sequence feature elements of the initial feature according to the frame number axis, the height axis and the width axis of the first tensor, and determines the number of the sequence feature elements of the initial feature according to the sequence axis and the channel axis of the first tensor.

In the data processing of the tensor, the order of the axes in the shape of the tensor represents the level of the data processing, and the server processes the data in the order of the axes from right to left, so that the axes of a part of the dimensions from right to left can be regarded as basic units of the data processing, the axes of the remaining dimensions are used as the number of the basic units to characterize, and the axes of the remaining dimensions can be used as coordinates to determine the position of each basic unit in the tensor space.

Thus, in the embodiment of the present specification, in the shape of the first tensor corresponding to the initial feature, it is necessary to ensure that the last three axes are the frame number axis, the height axis, and the width axis, and the first two axes are the sequence axis and the channel axis, so that the sequence feature element can be determined from the three axes from right to left according to the first tensor. The sequence feature element consists of the features of each frame of image contained in the image sequence in the same feature channel.

The order of the last three axes of the shape of the first tensor is not limited in this specification as long as the last three axes are composed of a frame number axis, a height axis, and a width axis. Likewise, the order of the first two axes of the first tensor shape is not limited in this specification, as long as the first two axes are composed of the sequence axis and the channel axis. That is, the first tensor may be in the shape of B representing the sequence axis, C representing the channel axis, S representing the frame number axis, H representing the height axis, W representing the width axis、Or/>Etc.

Taking the shape of the first tensor asFor example, the tensor shape of the sequence feature element is，/>The value of (2) represents the number of sequence feature elements. /(I)The position of the sequence feature element in five-dimensional tensor space can be characterized as coordinates, wherein/>For/>Integer within range,/>For/>Integers within the range. The sequence feature element is the sequence feature element of which image sequence and which feature channel, after the subsequent feature enhancement operation is carried out, the position of the feature enhancement result obtained by the sequence feature element in the tensor space can be determined, so that the feature enhancement result of the sample video is obtained according to the feature enhancement result of each sequence feature element.

For example, a section of sample video is divided into 20 image sequences, each image sequence includes 8 frames of images, the number of feature channels of the obtained initial features is 50 through the feature extraction layer, and the height of the feature map corresponding to the obtained initial features is 56 and the width thereof is 56. Then a representation of the shape of the first tensor corresponding to the initial feature may beTensor shape of sequence feature element is/>The number of sequence feature elements is/>。

S104: and inputting the sequence feature elements into the motion enhancement layer for each sequence feature element, determining the motion enhancement characteristics of the sequence feature elements, and determining the motion enhancement characteristics of the sample video according to the motion enhancement characteristics of each sequence feature element.

For each sequence feature element, the server inputs the sequence feature element into the motion enhancement layer to determine the motion enhancement feature of the sequence feature element.

FIG. 3 is a schematic diagram of a motion enhancement layer according to the present disclosure, wherein "W_pooling" represents one-dimensional pooling in the horizontal direction, "H_pooling" represents one-dimensional pooling in the vertical direction, "HW_pooling" represents two-dimensional pooling, "H_pooling" represents one-dimensional pooling, and "H_pooling" represents one-dimensional pooling in the vertical direction ""Represents matrix multiplication,"/>"Means matrix addition.

As shown in fig. 3, the motion-enhancing layer comprises a first enhancer layer and a second enhancer layer. The server inputs the sequence feature element into a first enhancer layer to obtain a first enhancement feature of the sequence feature element, and then inputs the first enhancement feature into a second enhancer layer to obtain a motion enhancement feature of the sequence feature element. The first enhancement feature is the result of feature enhancement of the sequence feature elements in the horizontal and vertical directions, and the motion enhancement feature is the result of feature enhancement of the sequence feature elements in the different receptive fields.

As shown in fig. 3, the specific algorithm flow of the sequence feature element in the motion enhancement layer is as follows:

Firstly, the server inputs the sequence feature elements into a first enhancer layer, respectively carries out one-dimensional pooling on the sequence feature elements in the horizontal direction and the vertical direction, and determines a first horizontal feature and a first vertical feature of the sequence feature elements.

And secondly, the server determines the first enhancement weight of the sequence feature element at least through a preset first convolution kernel according to the first horizontal feature and the first vertical feature.

Specifically, the server transposes the first vertical feature to obtain a first transposed vertical feature, and then splices the first horizontal feature with the first transposed vertical feature to obtain a first spliced feature. And carrying out convolution operation on the first splicing feature through a preset first convolution kernel to obtain a second splicing feature, then dividing the second splicing feature to obtain a first horizontal weight and a first vertical weight, and carrying out matrix multiplication on the first horizontal weight and the first vertical weight to obtain a first enhancement weight of the sequence feature element.

When the convolution operation is carried out according to the first convolution kernel, the first horizontal feature and the first vertical feature are spliced, so that the first horizontal feature and the first vertical feature can share one convolution layer, the parameter number of the reconstruction model is reduced, and the feature enhancement effect of the first horizontal feature and the first vertical feature is more balanced.

And thirdly, the server weights the sequence feature elements through the first enhancement weights to obtain first enhancement features of the sequence feature elements.

In the process of obtaining the first enhancement weight, the features of the sequence feature elements in the horizontal direction and the vertical direction are extracted through one-dimensional pooling in the horizontal direction and one-dimensional pooling in the vertical direction, and the first enhancement weight reflects the importance degree of the features in two different directions, so that after the sequence feature elements are weighted through the first enhancement weight, the first enhancement feature for comprehensively enhancing the features of the sequence feature elements in the horizontal direction and the vertical direction can be obtained.

And then, the server inputs the first enhancement feature into a second enhancement sub-layer, carries out convolution operation on the first enhancement feature through a preset second convolution check, and determines the second enhancement feature of the sequence feature element. Wherein the second convolution kernel is different in size from the first convolution kernel. For example, the server may set the first convolution kernel to beIs the convolution kernel of (2), the second convolution kernel isOr the first convolution kernel is/>Is the second convolution kernel/>As long as the first convolution kernel is different from the second convolution kernel, the specific sizes of the first convolution kernel and the second convolution kernel are not limited in this specification.

And then, the server carries out two-dimensional pooling on the first enhancement features to obtain first average weights, and carries out two-dimensional pooling on the second enhancement features to obtain second average weights. And weighting the second enhancement features through the first average weights to obtain first local weights, and weighting the first enhancement features through the second average weights to obtain second local weights. And adding the first local weight and the second local weight to obtain the motion enhancement weight of the sequence feature element.

And then, weighting the sequence feature elements through the motion enhancement weights to obtain the motion enhancement features of the sequence feature elements.

And finally, the server determines the motion enhancement characteristics of the sample video according to the motion enhancement characteristics of each sequence characteristic element.

The server can determine the position of the sequence feature element in the tensor space according to the position of the sequence axis corresponding to each sequence feature element and the position of the channel axis, and determine the position of the motion enhancement feature of the sequence feature element according to the position of the sequence feature element, so that the motion enhancement feature of the sample video is obtained according to the feature enhancement result of each sequence feature element.

In the process of obtaining the motion enhancement weight, as the first enhancement feature and the first average weight are obtained based on the convolution operation of the first convolution kernel, and the second enhancement feature and the second average weight are obtained based on the convolution operation of the second convolution kernel, the fusion between the features of different receptive fields obtained by two different convolution kernels is realized by weighting the second enhancement feature through the first average weight and the first enhancement feature through the second average weight. And the feature level of the motion enhancement feature obtained by finally weighting the sequence feature elements through the motion enhancement weight is richer.

Meanwhile, the sequence feature elements are determined according to the frame number axis, the height axis and the width axis of tensor of the initial feature, so that feature enhancement is carried out on the features of each frame image of the same feature channel in the sequence feature elements in the motion enhancement layer, time sequence information among continuous frame images is utilized, and the inter-frame continuity of a predicted three-dimensional human body in the continuous frame images can be enhanced.

In addition, because the time sequence information is considered, the characteristic supplement can be carried out on the individual frames with inaccurate characteristic information according to the characteristics between the adjacent frames, so that under the condition that the influence of the human body in the images of the individual frames is shielded, the complete three-dimensional human body in the shielded frames can be obtained according to the non-shielded characteristic information in the adjacent frames, the defect of the three-dimensional human body in the individual frames caused by shielding the images is avoided, and the inter-frame continuity of the three-dimensional human body between the frames is ensured.

S106: and inputting the motion enhancement features of the sample video into the regression layer to obtain a reconstructed video of the predicted three-dimensional human body in the sample video.

As shown in fig. 2, the server inputs the motion enhancement features of the sample video into the regression layer to obtain a reconstructed video of the predicted three-dimensional human body in the sample video.

The regression layer can reconstruct the three-dimensional human body skin according to the characteristics of the three-dimensional joint points represented by the motion enhancement characteristics to obtain the predicted three-dimensional human body in each frame of image, so that the predicted three-dimensional human body in each frame of image can be continuously restored to obtain a reconstructed video of the predicted three-dimensional human body in the sample video.

The network structure of the regression layer is not limited in this specification, and may be a graph convolutional neural network, an encoder-decoder structure, or the like.

The initial features extracted through the convolution layer are the features of the three-dimensional joint points in each image sequence, and the continuity of the three-dimensional joint point features in each frame image of each image sequence is enhanced through the motion enhancement layer, so that the inter-frame continuity of the three-dimensional human body in the obtained reconstructed video between each frame image is enhanced, and the video appearance is smoother.

S108: and determining a predicted three-dimensional joint point of a predicted three-dimensional human body in each frame image in the reconstructed video, and determining the average speed of the predicted three-dimensional joint point of the image sequence according to the position change of the predicted three-dimensional joint point in each frame image of the image sequence for each image sequence as the predicted speed of the image sequence.

S110: and determining the speed loss of the image sequence according to the difference between the predicted speed of the image sequence and the labeling speed of the image sequence, and training the reconstruction model according to the speed loss of each image sequence and the three-dimensional reconstruction loss of the sample video.

The server determines the predicted three-dimensional joint points of the predicted three-dimensional human body in each frame image in the reconstructed video, and determines the average speed of the predicted three-dimensional joint points of the image sequence according to the position change of the predicted three-dimensional joint points in each frame image of the image sequence for each image sequence as the predicted speed of the image sequence.

And the server takes the standard three-dimensional joint points of the standard three-dimensional human body corresponding to each frame of image of the sample video as annotation data. And for each image sequence, determining the average speed of the standard three-dimensional joint points of the image sequence according to the position change of the standard three-dimensional joint points in each frame image of the image sequence, and taking the average speed as the labeling speed of the image sequence.

And determining the speed loss of the image sequence according to the difference between the predicted speed of the image sequence and the labeling speed of the image sequence, and determining the speed loss of the sample video according to the speed loss of each image sequence.

Specifically, the speed loss of the sample video may be determined according to the following formula:

Wherein, Represents the/>Prediction speed of individual image sequences,/>Represents the/>Labeling speed of individual image sequences,/>Representing the speed loss of a sample video,/>Representing the number of three-dimensional joints contained in a three-dimensional human body,/>Representing a corresponding time interval of an image sequence,/>Representing sample video contains the/>Time corresponding to frame image in video,/>Representing the number of sequences of image sequences contained in a sample video,/>Represents the/>First/>, in a sequence of imagesFrame image No./>Predicting the position of a three-dimensional articulation point,/>Represents the/>First/>, in a sequence of imagesFrame image No./>The positions of the three-dimensional nodes are standard.

For each frame of image, the server determines the three-dimensional reconstruction loss of the frame of image according to the difference between the predicted three-dimensional joint point corresponding to the frame of image and the standard three-dimensional joint point, and determines the three-dimensional reconstruction loss of the sample video according to the three-dimensional reconstruction loss of each frame of image.

Specifically, the three-dimensional reconstruction loss of the sample video can be determined according to the following formula:

and finally, the server determines total loss according to the speed loss and the three-dimensional reconstruction loss of the sample video, and trains the reconstruction model with the minimum total loss as a target.

Specifically, the total loss can be determined according to the following formula:

The speed loss of the sample video is calculated according to the speed loss of each image sequence, and the speed change among the frame images contained in each image sequence is considered in the determination process of the speed loss of the sample video, and the speed change among the frame images of all the image sequences of the whole sample video is not directly calculated, so that the speed change among the image sequences can be accurately distinguished. Because the motion changes of the three-dimensional human body in each frame of image in one image sequence are similar in time, the motion change time of the three-dimensional human body in the image sequence is relatively far from that of the three-dimensional human body in the images of other image sequences, the change amplitude is relatively large, the inter-frame continuity is actually determined only according to adjacent frames, and the considered time range is too large, so that the prediction speed is inaccurate.

In addition, due to the application of the speed loss, the server can accurately identify the prediction speed of each image sequence, and the motion enhancement layer takes the characteristics of each frame image of one sequence as sequence characteristic elements to enhance the characteristics, and the obtained motion enhancement characteristics can be the characteristics which are continuously enhanced according to the characteristic images of each frame image of one sequence, so that the speed loss can enhance the motion characteristics between each image sequence, namely the characteristics of the change of joint points, in the motion enhancement characteristics, thereby ensuring the continuity between each frame.

According to the three-dimensional reconstruction loss, the reconstruction model is trained, so that the intra-frame accuracy of the predicted three-dimensional human body in the obtained reconstruction video can be ensured, and according to the speed loss, the reconstruction model is trained, so that the inter-frame continuity of the predicted three-dimensional human body in the reconstruction video can be ensured.

Therefore, according to the training method of the reconstruction model provided by the specification, after the reconstruction model is trained, the target video can be input into the trained reconstruction model, and the reconstruction video corresponding to the target video can be obtained through the reconstruction model feature extraction layer, the motion enhancement layer and the regression layer. In the reconstructed video, not only is the intra-frame accuracy of the predicted three-dimensional human body corresponding to each frame image of the reconstructed video realized, but also the inter-frame continuity of the predicted three-dimensional human body corresponding to each frame image of the reconstructed video is ensured.

In the step S102, the shape of the first tensor of the initial feature of the motion enhancement layer is input, and the condition is satisfied: the latter three axes are the frame number axis, the height axis and the width axis, and the former two axes are the sequence axis and the channel axis. But due to the different network settings of the feature extraction layers, the shape of the resulting initial feature initial tensor at the feature extraction layer may not be consistent with the shape of the first tensor.

Therefore, before inputting the initial feature into the motion enhancement layer, the server needs to confirm whether the shape of the initial tensor of the initial feature meets the condition that the shape of the first tensor needs to meet, if not, the initial feature of the initial tensor shape needs to be spatially recombined to make the initial feature meet the condition that the shape of the first tensor needs to meet, if so, the initial feature of the initial tensor shape can be directly input into the motion enhancement layer.

For example, the initial tensor has the shape ofThe server needs to spatially reorganize the initial features of the initial tensor shape, and one spatial reorganization method is to interchange the positions of the frame number axes and the channel axes, so that the shape of the first tensor obtained after reorganization is/>. In this example, other spatial reorganization methods may be used as long as the conditions to be satisfied by the first tensor shape are satisfied.

In the above step S104, the structure of the motion enhancing layer shown in fig. 3 is just one embodiment provided in the present specification. The server only needs to ensure that the first enhancement feature and the motion enhancement feature are obtained in the motion enhancement layer, and the structure of the motion enhancement layer is not particularly limited in the specification.

In one or more embodiments of the present description, in determining the first enhancement weight according to the first horizontal feature and the first vertical feature at the first enhancer layer, the server may set two different convolution layers, and the convolution kernels of the two convolution layers are both the first convolution kernel.

Specifically, the server performs convolution operation on the first horizontal feature through one convolution layer to obtain a first horizontal weight corresponding to the first horizontal feature, and performs convolution operation on the first vertical feature through the other convolution layer to obtain a first vertical weight corresponding to the first vertical feature. And then, carrying out matrix multiplication on the first horizontal weight and the first vertical weight to obtain a first enhancement weight of the sequence feature element.

In this way, the added splicing and dividing steps caused by sharing one convolution layer can be avoided, the operation flow is simplified, and meanwhile, the parameters of the two convolution layers can be respectively adjusted according to the characteristics of the first horizontal characteristic and the first vertical characteristic, so that the characteristic enhancement effect of the sequence characteristic elements in two different directions is differentiated, and the extracted characteristics are more abundant.

In one or more embodiments of the present description, at the first enhancer layer, the server may obtain the first enhancement feature by performing a one-dimensional convolution of the sequence feature elements in a horizontal direction and in a vertical direction.

Specifically, the server performs one-dimensional convolution on the sequence feature element through a preset first horizontal convolution kernel to obtain a first horizontal weight, and performs one-dimensional convolution on the sequence feature element through a preset first vertical convolution kernel to obtain a first vertical weight. And multiplying the first horizontal weight and the first horizontal weight by a matrix to obtain the first enhancement weight of the sequence characteristic element. And weighting the sequence feature elements through the first enhancement weights to obtain first enhancement features.

In one or more embodiments of the present description, at the second enhancement sub-layer, the server may convolve the first enhancement feature with a second convolution kernel to obtain a second enhancement feature. And then, the server normalizes the first enhancement feature and the second enhancement feature respectively, and determines a first local weight corresponding to the first enhancement feature and a second local weight corresponding to the second enhancement feature. And adding the first local weight and the second local weight, and determining the motion enhancement weight of the sequence feature element. And weighting the sequence feature elements through the motion enhancement weights to determine the motion enhancement features of the sequence feature elements.

In one or more embodiments of the present disclosure, after the second enhancement sub-layer obtains the first local weight and the second local weight, the server may weight the sequence feature element by the first local weight to obtain a first local feature, weight the sequence feature element by the second local weight to obtain a second local feature, and sum the first local feature and the second local feature to determine the motion enhancement feature of the sequence feature element.

In one or more embodiments of the present disclosure, at the second enhancer layer, the server may perform a convolution operation on the sequence feature element through a preset second convolution kernel to obtain a second enhancement feature.

In short, in the first enhancer layer, as long as the first enhancement feature for enhancing the characteristics of the sequence feature elements in the horizontal and vertical directions is obtained, in the second enhancer layer, the first enhancement feature or the sequence feature elements are subjected to convolution operation through a second convolution kernel different from the first convolution kernel to obtain the second enhancement feature, and then the first enhancement feature and the second enhancement feature are fused to obtain the motion enhancement feature for enhancing the characteristics of the sequence feature elements under different receptive fields comprehensively.

In the step S104, the motion enhancement layer is set, and in the sequence dimension, the initial feature is enhanced by using the sequence feature element as the basic unit of feature enhancement, so as to enhance the inter-frame continuity between the frame images. The server can also be provided with a space enhancement layer, the characteristics of different characteristic channels of the same frame of image are subjected to characteristic enhancement in the channel dimension to obtain the space enhancement characteristics of the sample video, then the motion enhancement characteristics and the space enhancement characteristics are fused, and the three-dimensional skin is reconstructed through the result of characteristic enhancement in two different dimensions to obtain a more accurate three-dimensional human body.

Fig. 4 is a schematic structural diagram of another reconstruction model provided in the present specification, and as shown in fig. 4, the reconstruction model further includes a spatial enhancement layer. The server inputs the initial features into a space enhancement layer, determines the space enhancement features of the sample video, and inputs the dual enhancement features obtained by fusing the motion enhancement features and the space enhancement features into a regression layer to obtain the reconstructed video.

At the spatial enhancement layer, the server performs feature enhancement on the initial features in the channel dimension. Therefore, for each image sequence, the server groups the channels of the initial feature according to the preset number of sub-channels to obtain a plurality of channel groups.

After spatial reorganization, the data dimension of the second tensor is six, and axes of the six dimensions are a sequence axis, a group number axis, a sub-channel axis, a frame number axis, a height axis and a width axis respectively. The axes of the sequence axis, the frame number axis, the height axis and the width axis in the second tensor have the same meaning as that of the first tensor, the axes of the group number axis are the group number of the channel groups, and the axes of the sub-channel axes are the number of channels contained in one channel group.

For each image sequence, the server performs space recombination on the initial feature according to the number of sub-channels and the number of groups of channel groups, determines a second tensor corresponding to the initial feature, determines channel feature elements of the initial feature according to the sub-channel axis, the height axis and the width axis of the second tensor, and determines the number of the channel feature elements of the initial feature according to the sequence axis, the frame number axis and the group number axis of the second tensor.

In the shape of the second tensor corresponding to the initial feature, the last three axes are a sub-channel axis, a height axis and a width axis, and the first three axes are a sequence axis, a frame number axis and a group number axis, so that the channel feature element can be determined according to the second tensor from the right to the left. The channel characteristic element consists of the characteristics of each channel in a channel group corresponding to one frame of image of the image sequence.

The order of the last three axes of the shape of the second tensor is not limited in this specification as long as the last three axes are composed of a subchannel axis, a height axis, and a width axis. Likewise, the order of the first three axes of the second tensor shape is not limited in this specification, as long as the first three axes are composed of a sequence axis, a frame number axis, and a group number axis. That is, the number of groups of channel groups is denoted by G, the channel axis is denoted by C, and the group number axis is denoted by G, and the sub-channel axis can be denoted as. Still with B representing the sequence axis, S representing the frame number axis, H representing the height axis, W representing the width axis, the second tensor may be of the shape/>、/>Or/>Etc.

Taking the shape of the second tensor asFor example, the tensor shape of the channel feature element is，/>The value of (2) represents the number of channel feature elements. /(I)The position of the channel feature element in the six-dimensional tensor space can be characterized as coordinates, wherein/>For/>Integer within range,/>For/>Integer within range,/>For/>Integers within the range. I.e. which channel feature is the channel feature of which channel group of which frame image the sequence of images contains.

For example, a section of sample video is divided into 20 image sequences, each image sequence includes 8 frames of images, the height of a feature map corresponding to an obtained initial feature is 56, the width of the feature map is 56, the number of feature channels of the obtained initial feature is 50, and the 50 channels are divided into 5 channel groups, so that each channel group includes 10 channels, namely, the axial length of a group number axis is 5, and the axial length of a sub-channel axis is 10. Then a representation of the shape of the second tensor corresponding to the initial feature may beTensor shape of channel feature element is/>The number of channel feature elements is。

Each characteristic channel corresponds to a detection of a characteristic of the three-dimensional human body in the image sequence, and the characteristic map of the channel reflects the intensity of the detected characteristic of the channel existing in each frame image of the image sequence. Therefore, after the channels are grouped, different characteristics are detected in each channel group, the characteristic diagram of each channel group reflects different characteristic detection conditions, channel characteristic elements formed by the characteristics of the channels in one channel group are taken as basic units for characteristic enhancement, different characteristics of each channel group can be subjected to characteristic enhancement in different degrees in a targeted manner, so that the characteristics with high relevance to three-dimensional human body reconstruction in a sample video are enhanced, and the characteristics with low relevance to three-dimensional human body reconstruction in the sample video are weakened.

Fig. 5 is a schematic structural diagram of a motion-enhancing layer provided in the present specification, and as shown in fig. 5, the motion-enhancing layer includes a third enhancer layer and a fourth enhancer layer. And inputting the channel characteristic element into a third enhancer layer by the server for each channel characteristic element to obtain a third enhancement characteristic of the channel characteristic element, and then inputting the third enhancement characteristic into a fourth enhancer layer to obtain a space enhancement characteristic of the channel characteristic element. The third enhancement feature is the result of feature enhancement on the channel feature element in the horizontal and vertical directions, and the space enhancement feature is the result of feature enhancement on the channel feature element in the different receptive fields.

As shown in fig. 5, a specific algorithm flow of the sequence feature element in the spatial enhancement layer is as follows:

Firstly, the server inputs the channel characteristic element into a third enhancer layer, and respectively carries out one-dimensional pooling on the channel characteristic element in the horizontal direction and the vertical direction to determine a second horizontal characteristic and a second vertical characteristic of the channel characteristic element.

And secondly, the server determines a third enhancement weight of the channel characteristic element at least through a preset third convolution kernel according to the second horizontal characteristic and the second vertical characteristic.

Specifically, the server transposes the second vertical feature to obtain a second transposed vertical feature, and then splices the second horizontal feature with the second transposed vertical feature to obtain a third spliced feature. And carrying out convolution operation on the third splicing feature through a preset third convolution kernel to obtain a fourth splicing feature, then dividing the fourth splicing feature to obtain a second horizontal weight and a second vertical weight, and carrying out matrix multiplication on the second horizontal weight and the second vertical weight to obtain a third enhancement weight of the channel feature element.

And thirdly, the server weights the channel feature element through the third enhancement weight to obtain the third enhancement feature of the channel feature element.

And then, the server inputs the third enhancement feature into a fourth enhancement sub-layer, and carries out convolution operation on the third enhancement feature through a preset fourth convolution check to determine the fourth enhancement feature of the channel feature element. Wherein the fourth convolution kernel is different in size from the third convolution kernel.

And then, the server carries out two-dimensional pooling on the third enhancement feature to obtain a third average weight, and carries out two-dimensional pooling on the fourth enhancement feature to obtain a fourth average weight. And weighting the fourth enhancement feature through the third average weight to obtain a third local weight, and weighting the third enhancement feature through the fourth average weight to obtain a fourth local weight. And adding the third local weight and the fourth local weight to obtain the spatial enhancement weight of the channel feature element.

And then, weighting the channel feature elements through the space enhancement weights to obtain the space enhancement features of the channel feature elements.

And finally, the server determines the spatial enhancement characteristics of the sample video according to the spatial enhancement characteristics of the characteristic elements of each channel.

The server can determine the position of the channel feature element in the tensor space according to the positions of the sequence axis, the frame number axis and the group number axis corresponding to each channel feature element, and determine the position of the spatial enhancement feature of the channel feature element according to the position of the channel feature element, so that the spatial enhancement feature of the sample video is obtained according to the feature enhancement result of each channel feature element.

Similarly to the motion enhancement layer in S104, in the process of obtaining the spatial enhancement weight, since the third enhancement feature and the third average weight are both obtained based on the convolution operation of the third convolution kernel, and the fourth enhancement feature and the fourth average weight are both obtained based on the convolution operation of the fourth convolution kernel, the fusion between the features of different receptive fields obtained by two different convolution kernels is achieved by weighting the fourth enhancement feature by the third average weight and weighting the third enhancement feature by the fourth average weight. And the feature level of the spatial enhancement feature obtained by finally weighting the sequence feature elements through the spatial enhancement weight is richer.

The channel characteristic elements are determined according to the sub-channel axes, the height axes and the width axes of tensors of initial characteristics, so that in the space enhancement layer, characteristics of each channel group are enhanced to different degrees in the dimension of the channel, different weights can be given to different characteristics of each channel group, and the characteristics with high relevance to three-dimensional human body reconstruction in a sample video are enhanced.

The structure of the spatial enhancement layer in fig. 5 is similar to that of the motion enhancement layer in fig. 3, and the detailed actions of the steps corresponding to fig. 5 are referred to the corresponding description of the motion enhancement layer in fig. 3 in S104, and are not repeated here.

Note that the structure of the spatial enhancement layer shown in fig. 5 is just one embodiment provided in the present specification. The server only needs to ensure that the third enhancement feature and the spatial enhancement feature are obtained in the spatial enhancement layer, and the structure of the spatial enhancement layer is not particularly limited in the specification. For a description of other embodiments of the spatial enhancement layer, reference may be made to the description of other embodiments of the motion enhancement layer of fig. 3, which will not be repeated here.

It should be noted that the structures of the motion enhancement layer and the spatial enhancement layer may be the same or different in specific applications. The specific structures of the motion enhancement layer and the spatial enhancement layer may be arranged and combined with reference to other embodiments corresponding to fig. 3 provided in the present specification, and the combination manner of the motion enhancement layer and the spatial enhancement layer is not limited in the present specification.

The server can fuse the motion enhancement features and the space enhancement features of the sample video to obtain dual enhancement features of the sample video, wherein the dual enhancement features are the result of comprehensive feature enhancement of the initial features from the sequence dimension and the channel dimension.

Specifically, as shown in fig. 4, the server sums the motion enhancement feature and the spatial enhancement feature of the sample video to obtain a primary fusion feature of the sample video, performs two-dimensional pooling on the primary fusion feature, determines a fifth average weight of the sample video, weights the motion enhancement feature of the sample video through the fifth average weight to obtain a first fusion feature, and weights the spatial enhancement feature of the sample video through the fifth average weight to obtain a second fusion feature. And adding the first fusion characteristic and the second fusion characteristic to obtain the dual enhancement characteristic of the sample video.

It should be noted that, as shown in fig. 4, only one way of fusing the motion enhancement feature and the spatial enhancement feature is shown. In practice, the primary fusion feature, the first fusion feature and the second fusion feature of the sample video are fusion results of the motion enhancement feature and the space enhancement feature, so that the server can directly use any one of the primary fusion feature, the first fusion feature or the second fusion feature as the dual enhancement feature.

After the dual enhancement features are determined, the server inputs the dual enhancement features into a regression layer to obtain a reconstructed video of the predicted three-dimensional human body in the sample video.

Because the dual enhancement feature is the result of the comprehensive feature enhancement of the initial feature from the sequence dimension and the channel dimension, not only the continuity between frames is enhanced, but also the accuracy within frames is further enhanced by the three-dimensional human body in the reconstructed video obtained by the dual enhancement feature.

In the step S110, the server may further enhance accuracy of three-dimensional human body reconstruction according to the predicted two-dimensional reconstruction loss of the three-dimensional human body.

The server can determine camera parameters of each frame of image in the sample video, and for each frame of image, determine a projection direction corresponding to the frame of image according to the camera parameters of the frame of image. And then, the server projects the predicted three-dimensional human body corresponding to the frame image according to the projection direction to obtain a predicted two-dimensional human body of the frame image, and determines a predicted two-dimensional articulation point of the predicted two-dimensional human body. And projecting the standard three-dimensional human body corresponding to the frame image according to the projection direction to obtain a standard two-dimensional human body of the frame image, and determining a standard two-dimensional articulation point of the standard two-dimensional human body.

And the server determines the two-dimensional reconstruction loss of the frame image according to the difference between the predicted two-dimensional articulation point and the standard two-dimensional articulation point corresponding to the frame image. And determining the two-dimensional reconstruction loss of the sample video according to the two-dimensional reconstruction loss of each frame image, and training a reconstruction model by taking the minimum two-dimensional reconstruction loss as a target.

Specifically, the two-dimensional reconstruction loss of the sample video can be determined according to the following formula:

Wherein, Represents the/>First/>, in a sequence of imagesFrame image No./>Each predicts the position of a two-dimensional articulation point,Represents the/>First/>, in a sequence of imagesFrame image No./>Location of individual standard two-dimensional articulation points,/>The number of two-dimensional nodes contained in the two-dimensional human body is represented and is the same as the number of three-dimensional nodes.

Then, the server determines the total loss according to the speed loss, the three-dimensional reconstruction loss and the two-dimensional reconstruction loss of the sample video, and trains the reconstruction model with the minimum total loss as a target.

The training method for the three-dimensional human body reconstruction model for the video provided by the specification is based on the same thought, and the specification also provides a corresponding training device for the three-dimensional human body reconstruction model for the video, as shown in fig. 6.

Fig. 6 is a schematic diagram of a training apparatus for a three-dimensional human body reconstruction model for video provided in the present specification, where the reconstruction model at least includes a feature extraction layer, a motion enhancement layer, and a regression layer, and the apparatus includes:

the acquisition module 200 acquires a sample video and determines a plurality of image sequences corresponding to the sample video;

The sequence feature element determining module 202 inputs each image sequence into the feature extraction layer to obtain an initial feature of the image sequence, determines a first tensor corresponding to the initial feature, determines sequence feature elements of the initial feature according to a frame number axis, a height axis and a width axis of the first tensor, and determines the number of sequence feature elements of the initial feature according to a sequence axis and a channel axis of the first tensor;

the motion enhancement module 204 inputs the sequence feature elements into the motion enhancement layer for each sequence feature element, determines motion enhancement features of the sequence feature elements, and determines motion enhancement features of the sample video according to the motion enhancement features of each sequence feature element;

the reconstructed video determining module 206 inputs the motion enhancement features of the sample video into the regression layer to obtain a reconstructed video of the predicted three-dimensional human body in the sample video;

The prediction speed determining module 208 determines a predicted three-dimensional node of a predicted three-dimensional human body in each frame image in the reconstructed video, and determines, for each image sequence, an average speed of the predicted three-dimensional node of the image sequence according to a position change of the predicted three-dimensional node in each frame image of the image sequence as a prediction speed of the image sequence;

The loss determination module 210 determines a loss of speed of the image sequence according to a difference between the predicted speed of the image sequence and the labeling speed of the image sequence, and trains the reconstruction model according to the loss of speed of each image sequence and the three-dimensional reconstruction loss of the sample video.

Optionally, the motion enhancing module 204 includes a first enhancer layer and a second enhancer layer, and is specifically configured to input the sequence feature element into the first enhancer layer to obtain a first enhancement feature of the sequence feature element, where the first enhancement feature is a result of feature enhancement on the sequence feature element in a combination of horizontal and vertical directions, and input the first enhancement feature into the second enhancer layer to obtain a motion enhancement feature of the sequence feature element, and the motion enhancement feature is a result of feature enhancement on the sequence feature element in a combination of different receptive fields.

Optionally, the motion enhancing module 204 is specifically configured to input the sequence feature element into the first enhancer layer, respectively perform one-dimensional pooling in a horizontal direction and a vertical direction on the sequence feature element, determine a first horizontal feature and a first vertical feature of the sequence feature element, determine, according to the first horizontal feature and the first vertical feature, a first enhancement weight of the sequence feature element at least through a preset first convolution kernel, and weight the sequence feature element through the first enhancement weight to obtain the first enhancement feature of the sequence feature element.

Optionally, the motion enhancing module 204 is specifically configured to input the first enhancing feature into the second enhancing sub-layer, perform convolution operation on the first enhancing feature through a preset second convolution kernel, determine a second enhancing feature of the sequence feature element, two-dimensionally pool the first enhancing feature to obtain a first average weight, two-dimensionally pool the second enhancing feature to obtain a second average weight, weight the second enhancing feature to obtain a first local weight through the first average weight, weight the first enhancing feature to obtain a second local weight through the second average weight, sum the first local weight and the second local weight to obtain a motion enhancing weight of the sequence feature element, and weight the sequence feature element through the motion enhancing weight to obtain the motion enhancing feature of the sequence feature element.

Optionally, the apparatus further includes a spatial enhancement module 212, specifically configured to group, for each image sequence, channels of the initial feature according to a preset number of sub-channels, obtain a plurality of channel groups, spatially recombine the initial feature according to the number of sub-channels and the number of channel groups, determine a second tensor corresponding to the initial feature, determine a channel feature element of the initial feature according to a sub-channel axis, a height axis and a width axis of the second tensor, determine a number of channel feature elements of the initial feature according to a sequence axis, a frame number axis and a group number axis of the second tensor, input the channel feature element into the third enhancement layer for each channel feature element, obtain a third enhancement feature of the channel feature element, where the third enhancement feature is a result of performing feature enhancement on the channel feature element in a horizontal and vertical direction, input the third enhancement feature into the fourth enhancement layer, obtain a spatial enhancement feature of the channel feature element, and determine a spatial enhancement feature of the channel feature element, where the spatial enhancement feature is a spatial enhancement feature of the channel feature element under different channel feature, and the spatial enhancement feature is determined according to a result of the spatial enhancement feature of the channel feature.

Optionally, the spatial enhancement module 212 is specifically configured to input the channel feature element into the third enhancement sub-layer, respectively perform one-dimensional pooling in a horizontal direction and a vertical direction on the sequence feature element, determine a second horizontal feature and a second vertical feature of the sequence feature element, determine, according to the second horizontal feature and the second vertical feature, a third enhancement weight of the channel feature element at least through a preset third convolution kernel, and weight the channel feature element through the third enhancement weight to obtain a third enhancement feature of the channel feature element.

Optionally, the spatial enhancement module 212 is specifically configured to input the third enhancement feature into the fourth enhancement sub-layer, perform a convolution operation on the third enhancement feature through a preset fourth convolution kernel, determine a fourth enhancement feature of the channel feature element, two-dimensionally pool the third enhancement feature to obtain a third average weight, two-dimensionally pool the fourth enhancement feature to obtain a fourth average weight, weight the fourth enhancement feature to obtain a third local weight through the third average weight, weight the third enhancement feature to obtain a fourth local weight through the fourth average weight, sum the third local weight and the fourth local weight to obtain a spatial enhancement weight of the sequence feature element, weight the channel feature element through the spatial enhancement weight, and obtain the spatial enhancement feature of the channel feature element.

Optionally, the reconstructed video determining module 206 is specifically configured to fuse the motion enhancement feature of the sample video and the spatial enhancement feature of the sample video to obtain a dual enhancement feature of the sample video, and input the dual enhancement feature into the regression layer to obtain a reconstructed video of the predicted three-dimensional human body in the sample video.

Optionally, the reconstructed video determining module 206 is specifically configured to sum the motion enhancement feature of the sample video and the spatial enhancement feature of the sample video to obtain a preliminary fusion feature of the sample video, two-dimensionally pool the preliminary fusion feature, determine a fifth average weight of the sample video, weight the motion enhancement feature of the sample video through the fifth average weight to obtain a first fusion feature, weight the spatial enhancement feature of the sample video through the fifth average weight to obtain a second fusion feature, and sum the first fusion feature and the second fusion feature to obtain a dual enhancement feature of the sample video.

Optionally, the loss determining module 210 is specifically configured to determine, as the labeling data, for each image sequence, an average speed of the standard three-dimensional node of the image sequence according to a position change of the standard three-dimensional node in each frame image of the image sequence, as the labeling speed of the image sequence, and determine a speed loss of the image sequence according to a difference between the predicted speed of the image sequence and the labeling speed of the image sequence.

Optionally, the loss determination module 210 is specifically configured to use a standard three-dimensional node of a standard three-dimensional human body corresponding to each frame image of the sample video as the labeling data, determine, for each frame image, a three-dimensional reconstruction loss of the frame image according to a difference between a predicted three-dimensional node corresponding to the frame image and the standard three-dimensional node, determine, according to the three-dimensional reconstruction loss of each frame image, a three-dimensional reconstruction loss of the sample video, and train the reconstruction model according to a speed loss of each image sequence and the three-dimensional reconstruction loss of the sample video.

Optionally, the loss determination module 210 is specifically configured to use a standard three-dimensional node of a standard three-dimensional human body corresponding to each frame image of the sample video as the labeling data;

Determining camera parameters of each frame image of the sample video, determining a projection direction corresponding to each frame image according to the camera parameters of the frame image, projecting a predicted three-dimensional human body corresponding to the frame image according to the projection direction to obtain a predicted two-dimensional human body of the frame image, determining a predicted two-dimensional joint point of the predicted two-dimensional human body, projecting a standard three-dimensional human body corresponding to the frame image according to the projection direction to obtain a standard two-dimensional human body of the frame image, determining a standard two-dimensional joint point of the standard two-dimensional human body, determining a two-dimensional reconstruction loss of the frame image according to the difference between the predicted two-dimensional joint point and the standard two-dimensional joint point, determining a two-dimensional reconstruction loss of the sample video according to the two-dimensional reconstruction loss of each frame image, and training the reconstruction model according to the speed loss of each image sequence, the three-dimensional reconstruction loss of the sample video and the two-dimensional reconstruction loss of the sample video.

The present specification also provides a computer readable storage medium storing a computer program operable to perform the training method of fig. 1 described above to provide a three-dimensional human reconstruction model for video.

The present specification also provides a schematic structural diagram of the electronic device shown in fig. 7. At the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, as described in fig. 7, although other hardware required by other services may be included. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs to implement the training method for the three-dimensional human body reconstruction model for video described above with respect to fig. 1.

The method. Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

Improvements to one technology can clearly distinguish between improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) and software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable GATE ARRAY, FPGA)) is an integrated circuit whose logic functions are determined by user programming of the device. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler (logic compiler)" software, which is similar to the software compiler used in program development and writing, and the original code before being compiled is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but HDL is not just one, but a plurality of kinds, such as ABEL（Advanced Boolean Expression Language）、AHDL（Altera Hardware Description Language）、Confluence、CUPL（Cornell University Programming Language）、HDCal、JHDL（Java Hardware Description Language）、Lava、Lola、MyHDL、PALASM、RHDL（Ruby Hardware Description Language）, and VHDL (Very-High-SPEED INTEGRATED Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present application.

Claims

1. A training method for a three-dimensional human reconstruction model for video, wherein the reconstruction model comprises at least a feature extraction layer, a motion enhancement layer and a regression layer, the motion enhancement layer comprising a first enhancer layer and a second enhancer layer, the method comprising:

Inputting the sequence feature element into the first enhancer layer aiming at each sequence feature element to obtain a first enhancement feature of the sequence feature element, wherein the first enhancement feature is a result of comprehensively carrying out feature enhancement on the sequence feature element in the horizontal and vertical directions, and inputting the first enhancement feature into the second enhancer layer to obtain a motion enhancement feature of the sequence feature element, and the motion enhancement feature is a result of comprehensively carrying out feature enhancement on the sequence feature element under different receptive fields;

2. The method according to claim 1, wherein inputting the sequence feature element into the first enhancer layer, to obtain the first enhancement feature of the sequence feature element, specifically comprises:

3. The method according to claim 2, wherein inputting the first enhancement feature into the second enhancement sub-layer results in a motion enhancement feature for the sequence of feature elements, comprising:

4. The method of claim 1, wherein the reconstruction model further comprises a spatial enhancement layer comprising a third enhancer layer and a fourth enhancer layer;

The method further comprises the steps of:

5. The method of claim 4, wherein inputting the channel feature element into the third enhancer layer to obtain a third enhancement feature of the channel feature element, specifically comprises:

6. The method of claim 4, wherein inputting the third enhancement feature into the fourth enhancement sub-layer results in a spatial enhancement feature for the channel feature element, comprising:

7. The method of claim 4, wherein inputting the motion enhancement features of the sample video into the regression layer yields a reconstructed video of the predicted three-dimensional human body in the sample video, comprising:

8. The method according to claim 7, wherein the motion enhancement feature of the sample video and the spatial enhancement feature of the sample video are fused to obtain a dual enhancement feature of the sample video, specifically comprising:

9. The method according to claim 1, wherein determining the velocity loss of the image sequence based on the difference between the predicted velocity of the image sequence and the labeling velocity of the image sequence, comprises:

10. The method according to claim 1, wherein training the reconstruction model based on the speed loss of each image sequence and the three-dimensional reconstruction loss of the sample video, in particular comprises:

11. The method according to claim 1, wherein training the reconstruction model based on the speed loss of each image sequence and the three-dimensional reconstruction loss of the sample video, in particular comprises:

12. A training device for a three-dimensional human reconstruction model of video, the reconstruction model comprising at least a feature extraction layer, a motion enhancement layer and a regression layer, the motion enhancement layer comprising a first enhancer layer and a second enhancer layer, the device comprising:

The motion enhancement module inputs the sequence feature elements into the first enhancer layer aiming at each sequence feature element to obtain a first enhancement feature of the sequence feature elements, wherein the first enhancement feature is a result of comprehensively carrying out feature enhancement on the sequence feature elements in the horizontal and vertical directions, and inputs the first enhancement feature into the second enhancer layer to obtain a motion enhancement feature of the sequence feature elements, and the motion enhancement feature is a result of comprehensively carrying out feature enhancement on the sequence feature elements under different receptive fields;

13. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-11.

14. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-11 when executing the program.