CN115761801A

CN115761801A - Three-dimensional human body posture migration method based on video time sequence information

Info

Publication number: CN115761801A
Application number: CN202211470729.6A
Authority: CN
Inventors: 邓若愚; 胡尚薇
Original assignee: Tongji Institute Of Artificial Intelligence Suzhou Co ltd
Current assignee: Tongji Institute Of Artificial Intelligence Suzhou Co ltd
Priority date: 2022-11-23
Filing date: 2022-11-23
Publication date: 2023-03-07

Abstract

The invention relates to a three-dimensional human body posture migration method based on video time sequence information, which comprises the following steps: extracting space and time sequence characteristics through a source image and a reference attitude image, rendering and outputting an SMPL (smooth Markov chain) parameter through an SMPL (smooth Markov chain) model, and projecting the SMPL parameter to a two-dimensional plane to obtain an image corresponding map; acquiring a vertex of a two-dimensional space, calculating a gravity center coordinate, and obtaining a conversion matrix and a converted image by matching a corresponding graph; masking the image to obtain foreground and background images; and acquiring a mask of the corresponding diagram, connecting the background image and the mask of the corresponding diagram in a color channel to generate a background, acquiring an attention diagram and a color diagram, reconstructing an image and synthesizing a posture migration result. The invention has the advantages of high quality of generated images, better retention of the identity characteristics of people and the like.

Description

Three-dimensional human body posture migration method based on video time sequence information

Technical Field

The invention relates to the technical field of image processing, in particular to a three-dimensional human body posture migration method based on video time sequence information.

Background

The human body posture migration task at least comprises two inputs, wherein one input is a target character, the other input is a reference posture, and the model is used for changing the posture of the target character into the reference posture on the premise of keeping the identity of the target character unchanged so as to synthesize new images of the target character under different postures or visual angles. Three-dimensional human body posture migration synthesis needs to be established on the basis of three-dimensional human body posture estimation, needs to express the position and the angle of a real three-dimensional human body, and compared with two-dimensional human body posture migration, the three-dimensional situation is more complex, but the generated image quality is higher. This algorithm is mainly applied in the following scenarios: (1) The method helps the video production field to synthesize a new visual angle to enhance the reality of the work; (2) Synthesizing new images of pedestrians at different viewing angles, expanding a data set for pedestrian re-identification, and improving the generalization performance of a pedestrian re-identification model; (3) The online fitting effect is realized, and the e-commerce platform and the user are helped to complete transaction smoothly. Therefore, the human body posture migration has wide application range and higher application value.

Human pose migration synthesis aims at transforming the pose of a target person according to a given target pose and then generating a target image, and the main challenge at present is that there are many differences between a source image and a reference pose image. Existing methods can be broadly divided into three categories: top-down methods, bottom-up methods, and hybrid methods.

Specifically, a top-down method directly learns an input-to-output mapping relationship, and most of the top-down method adopts a GAN network; the bottom-up approach will get the final output through combining the intermediate processes; the hybrid approach combines the advantages of both approaches. The top-down human body posture migration method directly connects a source image and a reference posture as input, and adds a fine adjustment module to adjust the definition of a generated image so as to synthesize a final target image, but the image quality is not high. At present, most human body image synthesis needs to rely on human body key point detection technology, and represents human body posture by using a two-dimensional human body key point heat map, however, the human body key point position only comprises joints and bones, and has no human appearance information, such as skin, clothes and the like. In addition, because the two-dimensional key points cannot accurately represent the posture of the human body under the condition of occlusion, the appearance of the target person cannot be completely matched with the reference posture when the target image is synthesized. For example, when a short person mimics the action of a tall person, using two-dimensional body key points inevitably changes the height of the short person.

In patent publication No. CN114638744A, a human body posture migration method based on two-dimensional bone key points is disclosed, which describes human bodies in original and reference postures by using multiple attributes, and performs shape self-adaptation on the human body in the reference posture, which is helpful for improving visual effect of generated images. However, the human skeleton key points cannot accurately represent the postures of the human body under the shielding condition, so that the appearance of the target character cannot be completely matched with the reference posture.

Referring to the patent with publication number CN113223124A, a gesture transferring method based on a three-dimensional human parametric model is disclosed, in which different parts of the three-dimensional human parametric model of a target character are bound with different colors for rendering to obtain a gesture image, thereby reducing the required amount of training data while ensuring the quality of results. However, the temporal information of the pose in the video is not utilized, resulting in insufficient consistency of the pose after transition.

Disclosure of Invention

The invention aims to provide a three-dimensional human body posture migration method based on video time sequence information.

In order to achieve the purpose, the invention adopts the technical scheme that:

a three-dimensional human body posture migration method based on video time sequence information comprises the following steps:

s1: extracting spatial features and time sequence features through the source character image and the reference attitude image,

s2: extracting image features, rendering and outputting SMPL parameters through an SMPL model: human body posture, shape parameters and three-dimensional human body grids,

s3: projecting the SMPL parameters to a two-dimensional plane to obtain a corresponding map of a source character image and a reference attitude image; projecting the three-dimensional human body mesh to a two-dimensional image space to obtain a vertex of the two-dimensional space,

s4a: calculating the barycentric coordinate of each grid surface according to the vertex of the two-dimensional space, obtaining a conversion matrix by matching the corresponding graph, and transforming the image by using the conversion matrix to obtain a transformed image;

s4b: masking the image based on the corresponding image of the image to obtain a foreground image and a background image; binarizing the corresponding image of the image to obtain a mask of the corresponding image, connecting the background image and the mask of the corresponding image in a color channel to generate a background O _bg ，

S5: extracting the features of the source information, obtaining an attention pattern A and a color pattern P by performing convolution operation on the extracted features, and reconstructing an image O by using the attention pattern A and the color pattern P _s Synthesizing the attitude transition result O _t 。

Preferably, in the above technical solution, in S1: the method comprises the steps of segmenting a reference attitude video frame by frame to obtain a reference attitude image, extracting spatial features of frames in the video by using a Convolutional Neural Network (CNN), and then learning time sequence features by using a gated recurrent neural network (GRU).

Further preferably, in S2: for each key frame in the video, extracting image features by using ResNet50, coding the features into vectors with 2048 dimensions, and rendering output SMPL parameters by using an SMPL model.

Preferably, in the above technical solution, in S2: SMPL parameters: { K, θ, β, M }, where K is the root node and M (θ, β) is a differentiable function used to parameterize the mesh of vertices and triangular faces.

Preferably, in the above technical solution, in S3: projecting the three-dimensional human body mesh of the image to a two-dimensional image space through weak perspective projection to obtain the vertex of the two-dimensional space:

v _s ＝Proj(V _s ,K _s )。

preferably, in the above technical solution, in S4a: the transformation matrix T ∈ R ^H*W*2 Where H x W is the resolution of the input image.

Preferably, in S4b: and generating a background by using a background generator, wherein the background generator comprises an encoder and a decoder, the down-sampling is carried out by using multilayer convolution at the stage of the encoder, and then the up-sampling is carried out by using transposed convolution at the stage of the decoder, so that the image is restored to the original size.

Preferably, in the above technical solution, in S5: attention is drawn to the position information for a convolution kernel using one channel; the color map uses a three-channel convolution kernel, focusing on color information.

Preferably, in S5, the following steps: the generation of the final result can be summarized as the following formula:

O _s ＝P _s ⊙A _s +O _bg ⊙(1-A _s )

O _t ＝P _t ⊙A _t +O _bg ⊙(1-A _t )。

preferably, in S5, the following steps: in the synthesis process, the global similarity among the multi-source features is firstly learned, then the learned similarity and the multi-source features are linearly combined in a feature space, and the fused features are transmitted to the global features through a spatial adaptive normalization algorithm.

Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:

1. the stature of a person cannot be described and joint rotation is simulated by detecting key points of a two-dimensional human body, and the posture and the stature of the portrait are represented by a three-dimensional human body grid, so that the position and the rotation of the joint can be simulated, and various body shapes can be represented;

2. by using the human body posture estimation of the time sequence information in the video, compared with the method of estimating the human body posture in the video frame by frame, the estimation result is more accurate;

3. the local features of the source images are synthesized into the global features, and the target image is synthesized according to the reference posture, so that the identity information of the portrait is better kept;

4. by using a small sample counterstudy strategy and a self-adaptive step, a network pays attention to the difference part of an individual, and finally a high-quality target image is generated on the premise that the details of each part such as the face, the clothes, the stature and the like of a figure in a source image are not changed, so that the generalization capability of the model is improved.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a block diagram of the SMPL model of the method of the present invention;

FIG. 3 is a schematic diagram of a background generator of the method of the present invention;

FIG. 4 is a schematic diagram of a source image reconstruction generator of the method of the present invention;

FIG. 5 is a schematic diagram of a migration image generator according to the method of the present invention;

FIG. 6 is a schematic diagram of the human body posture migration result of the method of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, 2, and 6, a three-dimensional human body posture migration method based on video timing information specifically explains the following implementation process of the method:

inputting a source character image I _s (hereinafter referred to as source image) and reference attitude image I _t The reference pose image is obtained by segmenting the reference pose video frame by frame. The method comprises the steps of firstly extracting spatial features of each important frame in a video by using a convolutional neural network CNN, and then learning time sequence features by using a gated recurrent neural network GRU.

Extracting image features from the key frame by using ResNet50, coding the features into 2048-dimensional vectors, and rendering and outputting corresponding SMPL parameters by using an SMPL model: are respectively { K _si ,θ _si ,β _si ,M _si And { K } _r ,θ _r ,β _r ,M _r }. Where K is the root node and M (θ, β) is a differentiable function used to parameterizeIn the method, only the human body posture (position), the shape parameter (shape) and the three-dimensional human body mesh (camera) are needed to be kept, and the mapping relation between the image and the SMPL parameters is learned by combining a regression and optimization mode.

In order to effectively combine the SMPL parameter model of human body information with various characteristics extracted by a network, the SMPL parameter needs to be converted into a form of a two-dimensional corresponding graph to be used as the representation of the human body posture, in a discriminator stage, a motion discriminator is used for combining a predicted posture sequence and a posture sequence in a data set to discriminate true or false, and the training aim is to minimize the difference between the predicted posture and the real motion posture, wherein the motion discriminator introduces a self-attention mechanism and can amplify the contribution of some important frames.

The method specifically comprises the following steps: based on the previous human body posture estimation, the SMPL parameters are projected to a two-dimensional plane to obtain a source image I _s Is corresponding to (C) _s And a reference attitude image I _t Is corresponding to (C) _t (human body analysis chart). The corresponding map refers to the fact that a portrait captured in an image is divided into a plurality of semantically consistent areas, such as limbs and clothes, and belongs to a fine-grained semantic segmentation task.

V _s Source image I being an SPML output _s By a weak perspective projection of the source image I _s Three-dimensional human body grid V _s Projecting the image to a two-dimensional image space, wherein the projection method Proj refers to a differentiable neural three-dimensional grid renderer NMR to obtain a vertex v of the two-dimensional space _s ＝Proj(V _s ,K _s )。

From vertices v of two-dimensional space _s Calculating barycentric coordinates f of each grid surface _s By matching the source images I _s Center of gravity coordinates f _s And corresponding to the diagram C _s And obtaining a conversion matrix T epsilon R ^H*W*2 Where H x W is the size of the input image. For reference attitude map I _t Like this, the conversion matrix of the reference image is calculated in the same manner. Finally, the source image I is mapped by the transformation matrix _s Transforming to obtain transformed image I _syn 。

Simultaneously: based on source image I _s Is corresponding to (C) _s Masking the image to obtain foreground image I _ft And a background image I _bg . Source image I _s Is corresponding to (C) _s Binarizing to obtain a corresponding graph C _s A background image I _bg And corresponding graph C _s The masks are connected in color channels to generate a realistic background O _bg . As shown in fig. 3: generating a background O in particular by means of a background generator _bg The background generator comprises an encoder and a decoder, the inside of the background generator is composed of the encoder and the decoder, the encoder is a de-noising convolution automatic encoder, multi-layer convolution is used for carrying out down-sampling at the stage of the encoder, and then transposition convolution is used for carrying out up-sampling at the stage of the decoder, so that the image is restored to the original size. A neck module is arranged between the two stages, and a residual block is added into the neck module for improving the generalization performance of the model.

Extracting the characteristics of the source information, and performing convolution operation on the extracted characteristics to obtain an attention diagram A and a color diagram P, wherein the attention diagram A uses a convolution kernel of one channel, so that the attention diagram A of one channel is obtained, the position information is mainly focused, and the color diagram P uses a convolution kernel of three channels (RGB), so that the color diagram is obtained, and the color information is mainly focused.

Reconstruction of an image O using an attention map a and a color map P _s Synthesizing attitude transition results O _t The generation of the final result can be summarized as the following formula:

O _s ＝P _s ⊙A _s +O _bg ⊙(1-A _s )，

O _t ＝P _t ⊙A _t +O _bg ⊙(1-A _t )。

here, (-) represents an Element-wise Multiplication (Element-wise Multiplication), i.e., a Multiplication of corresponding elements of two matrices.

In the generation process, global similarity among multi-source features is firstly learned, and then the learned similarity and the multi-source features are linearly combined in a feature space, namely the multi-source features are fused. In order to better understand the style of the music,and transmitting source features such as color, texture and the like into the global feature, and transmitting the fused feature into the global feature through a spatial adaptive normalization algorithm. As shown in fig. 4: by X _si ^l Representing the l-th layer features extracted by the ith source image through the generator,

transformation of representation by a transformation matrix X _si ^l The conversion process of the obtained characteristics is realized by a Bilinear Sampler (BS). X _t ^l Representing the l-th layer features output by the generator, which will aggregate these elements into the final generated image. The above process can be summarized as the following formula:

because warped multi-source features are added to the global features, the area of overlap is enlarged. The method introduces a mechanism of attention that helps solve the problem of artifacts in the generated images. Specifically, query embedding (Q), key embedding (K), and value embedding (V) in the attention mechanism are all generated by a convolutional neural network, the output of the generator is used as the input of Q, K, and V, and the weight formula of attention is:

Attention(Q,K,V)＝Softmax((Q*K ^T )√(d _k ))V。

the SMPL model replaces two-dimensional human body key points to represent human body posture information, so that the defect of the two-dimensional human body key points can be well overcome, the model is not based on the positions of joints, a transformation matrix is calculated by matching the corresponding relation of human body grids, and then the transformation matrix is mapped into a generated image, so that the posture can be more accurately transformed; in addition, because the human body posture in the video has coherence, time sequence information is introduced, and the video is processed more reasonably. When the traditional method carries out human body posture estimation on a video, frames in the video are used as mutually independent inputs and are sent into a network for prediction, but the human body posture has time continuity, so that the added time sequence encoder can predict the posture in the video more accurately, the posture of a human body when the human body is shielded can be predicted, and the performance of human body posture estimation on video data is improved.

The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A three-dimensional human body posture migration method based on video time sequence information is characterized in that: the method comprises the following steps:

s3: projecting the SMPL parameters to a two-dimensional plane to obtain a corresponding graph of a source character image and a reference posture image; projecting the three-dimensional human body mesh to a two-dimensional image space to obtain a vertex of the two-dimensional space,

s4a: calculating the barycentric coordinate of each grid surface according to the vertex of the two-dimensional space, obtaining a conversion matrix by matching the corresponding graph, and converting the image by using the conversion matrix to obtain a converted image;

s4b: masking the image based on the corresponding image of the image to obtain a foreground image and a background image; binarizing the corresponding image of the image to obtain a mask of the corresponding image, connecting the background image and the mask of the corresponding image in a color channel to generate a background O _bg And S5: extracting the features of the source information, obtaining an attention pattern A and a color pattern P by performing convolution operation on the extracted features, and reconstructing an image O by using the attention pattern A and the color pattern P _s Synthesizing the attitude transition result O _t 。

2. The three-dimensional human body posture migration method based on video time series information as claimed in claim 1, characterized in that: in S1: the method comprises the steps of segmenting a reference attitude video frame by frame to obtain a reference attitude image, extracting spatial features from frames in the video by using a Convolutional Neural Network (CNN), and then learning time sequence features by using a gated cyclic neural network (GRU).

3. The three-dimensional human body posture migration method based on video time sequence information, according to claim 2, is characterized in that: in S2: for each key frame in the video, extracting image features by using ResNet50, coding the features into vectors with 2048 dimensions, and rendering output SMPL parameters by using an SMPL model.

4. The three-dimensional human body posture migration method based on video time sequence information, according to claim 1, is characterized in that: in S2: SMPL parameters: { K, θ, β, M }, where K is the root node and M (θ, β) is a differentiable function used to parameterize the mesh of vertices and triangular faces.

5. The three-dimensional human body posture migration method based on video time sequence information, according to claim 1, is characterized in that: in S3: projecting the three-dimensional human body mesh of the image to a two-dimensional image space through weak perspective projection to obtain the vertex of the two-dimensional space:

v _s ＝Proj(V _s ,K _s )。

6. the three-dimensional human body posture migration method based on video time sequence information, according to claim 1, is characterized in that: in S4a: the transformation matrix T ∈ R ^H*W*2 Where H W is the resolution of the input image.

7. The three-dimensional human body posture migration method based on video time sequence information, according to claim 1, is characterized in that: in S4b: and generating a background by using a background generator, wherein the background generator comprises an encoder and a decoder, the down-sampling is carried out by using multilayer convolution at the stage of the encoder, and then the up-sampling is carried out by using transposed convolution at the stage of the decoder, so that the image is restored to the original size.

8. The three-dimensional human body posture migration method based on video time series information as claimed in claim 1, characterized in that: in S5: attention is paid to a convolution kernel using one channel, and position information is concerned; the color map uses a three-channel convolution kernel, focusing on color information.

9. The three-dimensional human body posture migration method based on video time series information as claimed in claim 1, characterized in that: in S5: the generation of the final result can be summarized as the following formula:

O _s ＝P _s ⊙A _s +O _bg ⊙(1-A _s )

O _t ＝P _t ⊙A _t +O _bg ⊙(1-A _t )。

10. the three-dimensional human body posture migration method based on video time sequence information, according to claim 1, is characterized in that: in S5: in the synthesis process, global similarity among multi-source features is learned, the learned similarity and the multi-source features are linearly combined in a feature space, and the fused features are transmitted to the global features through a spatial adaptive normalization algorithm.