CN115330950A

CN115330950A - Three-dimensional human body reconstruction method based on time sequence context clues

Info

Publication number: CN115330950A
Application number: CN202210985402.6A
Authority: CN
Inventors: 戴翘楚; 吴翼天; 曹静萍
Original assignee: Hangzhou Yilan Technology Co ltd
Current assignee: Hangzhou Yilan Technology Co ltd
Priority date: 2022-08-17
Filing date: 2022-08-17
Publication date: 2022-11-11

Abstract

The invention discloses a three-dimensional human body reconstruction method based on time sequence context clues, which particularly relates to the field of artificial intelligence and mainly comprises the following steps: pre-training a three-dimensional human body reconstruction neural network based on time sequence context clues; extracting spatial features of each frame of image by using a convolutional neural network; adding the human body external contour and the optical flow information of each frame into the input features by utilizing a motion encoder; capturing the time correlation of multi-frame input by adopting a converter network as a time sequence encoder; the parameters of the human body template and the parameters of the camera are regressed through a training regressor; distinguishing the real and natural human body action posture by using a discriminator; and training the differentiable renderer by using the parameter of the parameterized human body template obtained by regression. After the training is finished, any action sequence is given, and the posture and shape reconstruction of the human body model can be finished. The technology can be used for motion analysis, virtual and augmented reality, games, animations and other scenes needing three-dimensional human body reconstruction.

Description

Three-dimensional human body reconstruction method based on time sequence context clues

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a three-dimensional human body reconstruction method based on time sequence context clues.

Background

The human body model posture and shape reconstruction problem based on monocular video is an important problem in the fields of computer vision and artificial intelligence. The method for generating the accurate and smooth three-dimensional human body reconstruction result has wide application prospect and important application value in the virtual and augmented reality fields.

In recent years, with the fusion and development of a deep learning technology and a computer vision technology, a three-dimensional reconstruction method based on a deep neural network appears, however, due to the lack of a data set under a natural scene with three-dimensional human body labels, the complexity and variability of real human body motion are not captured by the existing human body motion time model; secondly, because the average error of each joint position only penalizes spatial error without considering time consistency, the posture estimation has a 'shaking' phenomenon, so that the result is difficult to approach to the real posture; indoor three-dimensional human body data sets are limited in terms of number of targets, range of motion, and image complexity.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a three-dimensional human body reconstruction method based on time sequence context clues, which introduces a deep neural network three-dimensional human body reconstruction method into a monocular video, on one hand, a convolutional neural network is utilized to extract spatial information of a video sequence, on the other hand, a converter network is utilized to capture the time correlation of multi-frame input, finally, the characteristics containing the whole input space-time information are obtained, and simultaneously, the quality and the precision of human body model posture and shape reconstruction are further improved by combining the light stream, the contour and the like time sequence context clues, so that the smooth, natural and real human body model posture and shape reconstruction are achieved to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme:

the three-dimensional human body reconstruction method based on the time sequence context clue comprises the following steps:

step S1, inputting a section of single human video frame sequence and recording the sequence as

Wherein

The number of images to be processed is the length of the sequence, i.e. the number of image sequences,

is shown as

An image, i.e. frame i;

s2, utilizing a convolution neural network to carry out image sequence for each frame

Extracting spatial features

Wherein each feature is a vector of 2048 dimensions in size;

s3, adding the human body external contour and the optical flow information of each frame into the input features through a motion encoder to obtain feature values as follows:

wherein

Is a spatial feature of the video frame,

is a profile characteristic of the outer part of the human body,

is an optical flow feature;

s4, training a converter network as a time sequence encoder to extract context time information and outputting a hidden variable of each frame containing front and rear frame information

；

Step S5, utilize

Regression parameterization human body template parameter and camera parameter

And

the regressor is initialized to mean attitude

Followed by the pose result of the previous frame

As the next frame initialization, fitting the corresponding three-dimensional human body template dynamic sequence to the whole sequence

；

S6, integrating feature vectors corresponding to all times by adopting self-supervision to judge the real and natural human body action posture;

s7, carrying out differentiable rendering by using the parameter of the parameterized human body template obtained by regression to obtain two-dimensional joint point information, a human body contour and optical flow information, and comparing the two-dimensional joint point information, the human body contour and the optical flow information with values estimated by a network to calculate a re-projection error;

s8, constructing a loss function by utilizing the human body template posture sequence and all image video frame sequences

Training a network model;

and S9, after the training in the step S8 is finished, giving any video frame sequence, and finishing the reconstruction of the posture and the shape of the three-dimensional human body model through the trained model.

In a preferred embodiment, in step S5, the fitted three-dimensional human body template is a linear function

Linear function of

The input is the posture parameter of the human body

The output is the vertex coordinates of the three-dimensional human body template

I.e. by

Wherein

The total number of vertexes of the three-dimensional human body template is shown; from the output three-dimensional human body template vertex coordinates, the joint point coordinates of the human body template can be regressed:

wherein

Is a regression matrix.

In a preferred embodiment, in step S8, the loss function

Comprises the following steps:

wherein, three-dimensional error

Using the L2 norm loss function:

wherein, in the step (A),

as parameters of three-dimensional joints

Two dimensional error

Using the L2 norm loss function:

wherein, in the step (A),

as two-dimensional joint parameters

Parameterized human body template error

Using the L2 norm loss function:

wherein, in the process,

is a parameter of the shape of the human body,

as a parameter of human posture

Error of discriminator

Using the L2 norm loss function:

wherein, in the step (A),

for antagonistic losses of kinetic parameters

Error of discriminator

Using the L2 norm loss function:

error of motion encoder

Using the L2 norm loss function:

wherein, in the step (A),

is a profile characteristic of the outer part of the human body,

for the purpose of the optical flow characteristics,

is the spatial feature of the video frame.

The invention has the technical effects and advantages that:

the invention discloses a three-dimensional reconstruction method based on time sequence context clues, which introduces a depth neural network three-dimensional reconstruction method into a monocular video, on one hand, a convolutional neural network is utilized to extract spatial information of a video sequence, on the other hand, a converter network is utilized to capture the time correlation of multi-frame input, finally the characteristics containing the whole input space-time information are obtained, and simultaneously, the quality and the precision of reconstruction of human body model posture and shape are further improved by combining light stream, contour and other time sequence context clues, so that smooth, natural and real reconstruction of the human body model posture and shape is achieved.

Drawings

FIG. 1 is a network structure diagram of a three-dimensional human body reconstruction method based on temporal context cues according to the present invention.

FIG. 2 is a flowchart of a three-dimensional human body reconstruction method based on temporal context cues according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention relates to a three-dimensional human body reconstruction method based on time sequence context clues, which is characterized in that a depth neural network three-dimensional reconstruction method is introduced into a monocular video, on one hand, a convolutional neural network is utilized to extract spatial information of a video sequence, on the other hand, a converter network is utilized to capture the time correlation of multi-frame input, and finally the characteristics containing the whole input space-time information are obtained, and simultaneously, the quality and the precision of human body model posture and shape reconstruction are further improved by combining light stream, contour and other time sequence context clues, so that the smooth, natural and real human body model posture and shape reconstruction is achieved.

Therefore, the invention utilizes the space-time encoder and the motion encoder to extract the human motion characteristics and captures the time correlation of multi-frame information. Different from the existing method, when the spatial feature extraction is carried out, the method captures the internal association of long time sequence input by using a converter network, finally obtains the feature containing the whole input space-time information, and simultaneously fuses motion information and contour information to predict human model parameters.

Specifically, when the method is applied to reconstruction of the posture and the shape of the human body model in the video, the method comprises the following steps:

as shown in fig. 1, auto-supervision, which is the core of the converter network, relates motion features to embedded features of an input picture frame sequence. Our converter network consists of multiple auto-supervision and multi-layer perceptrons. The normalization layer is applied before each module and the residual connection is applied after each module.

The attention module can be described as a mapping function that will query the matrix

Key matrix

Sum matrix

Mapping to an output attention matrix.

Wherein

Is the number of vectors in the sequence,

is the dimension. The output of the attention module can be expressed as:

in this context

，

。

And

is made up of embedded features

By linear transformation

And

calculated as follows:

multipath auto-supervision utilizes multipath branches to model information in a representation subspace of different locations. Each branch applies the attention module in parallel. MSA output will

The outputs of the multiple self-supervision are connected:

input embedding features

Is provided with

The converter network structure of a layer can be expressed as:

wherein

A hierarchical normalization operation is performed. Output of converter network

The same size as the input is maintained. For prediction, the encoder output is

Compressed into vectors

And averaged in the frame dimension. Finally, the output is returned through a multi-layer perceptron layer.

As used herein, a three-dimensional body template is a linear function

The input of the function is the attitude parameter of the human body, namely the rotation amount of the skeleton joint, and the output is the vertex coordinates of the three-dimensional human body template

I.e. by

Wherein, the total number of the vertexes of the three-dimensional human body template is shown. From the output three-dimensional human body template vertex coordinates, the joint point coordinates of the human body template can be regressed:

in which

Is a regression matrix.

Wherein

=

(Here, the

Is the true value),

is the current time

Is a prediction parameter of the shape of a single human body (i.e. height, weight, etc.),

the parameters are averaged using the shape parameters for each frame.

Loss function of the entire model

Comprises the following steps:

wherein, three-dimensional error

Using the L2 norm loss function:

wherein, in the process,

as parameters of three-dimensional joints

Two dimensional error

Using the L2 norm loss function:

wherein, in the step (A),

as two-dimensional joint parameters

Parameterized human body template error

Using the L2 norm loss function:

wherein, in the step (A),

is a parameter of the shape of the human body,

as a parameter of human posture

Error of discriminator

Using the L2 norm loss function:

wherein, in the step (A),

for antagonistic losses of kinetic parameters

Error of discriminator

Using the L2 norm loss function:

error of motion encoder

Using the L2 norm loss function:

wherein, in the step (A),

is a profile characteristic of the outer part of the human body,

for the purpose of the optical flow characteristics,

is the spatial feature of the video frame.

Specifically, as shown in fig. 2, the three-dimensional human body reconstruction method based on the temporal context clue includes the following specific steps:

step S101, pre-training a three-dimensional human body reconstruction neural network based on time sequence context clues, mainly comprising a space encoder, a time sequence encoder, a motion encoder, a regressor and a differentiable renderer, wherein a data set comprises a mixed two-dimensional and three-dimensional data set, 5000 sections of video data sets with two-dimensional truth values, 8000 sections of pseudo label data sets obtained by a two-dimensional key point detector, and 2000 sections of video data with parameterized human body template truth values are used for calculating for the three-dimensional data sets.

Step S102, extracting spatial features for each frame of the image sequence by using a convolutional neural network, wherein each feature is a vector with the size of 2048 dimensions, the specific network is a residual error network with 50 layers, the final output feature size is 2048 dimensions, the sequence length is long, and the batch size is 32.

Step S103, adding the human body external contour and the optical flow information of each frame into the input characteristics through a motion encoder, and obtaining the characteristic value of the point as follows:

wherein

Is a spatial feature of the video frame and,

is a profile characteristic of the outer part of the human body,

is an optical flow feature.

And step S104, the updated characteristics are transmitted to a network coding layer of the converter, the network architecture of the model comprises a self-attention mechanism and a shallow fully-connected feedforward network, the output of each part is subjected to residual error network and level normalization processing, a time sequence encoder formed by the converter network extracts context time information, and each frame is output to include hidden variables of front and rear frame information.

Step S105, initializing a regressor to an average posture by using regression parameterized human body model parameters and camera parameters, then initializing a posture result of a previous frame as a next frame, fitting a corresponding three-dimensional human body template dynamic sequence to the whole sequence, wherein the regressor of the parameterized human body template consists of 2 full-connection layers, each layer is provided with 1024 neurons, and finally outputting = 85-dimensional final layers containing information such as postures, shapes and camera parameters.

And S106, integrating the feature vectors corresponding to all times by adopting multi-path self-supervision to judge the real and natural human motion postures, using 2 multilayer perceptron layers, wherein each layer is provided with 1024 neurons and sine activation to learn attention weight, and finally predicting whether each sample belongs to a real and reasonable human motion posture by a linear layer.

And S107, carrying out differentiable rendering by using the parameter of the parameterized human body model obtained by regression, comparing the obtained two-dimensional joint point information, the human body contour and the optical flow information with the value estimated by the network, and calculating a reprojection error.

And S108, constructing a loss function by using the human body template posture sequence and all images, and training a network model.

Step S109, in the training process, an adaptive moment estimation optimizer is used, the learning rate is fixed to be 0.0001, the training is carried out for 120 rounds, the evaluation indexes comprise average position error of each joint, correct key point percentage and vertex-by-vertex error, and acceleration error, the acceleration error is calculated according to the difference between the real value and the predicted acceleration of the three-dimensional coordinate point of each joint, the unit is a main smooth index of the estimated motion sequence, and a better acceleration error marks a smooth and natural human motion estimation.

And step S110, after the training is finished, giving any video frame sequence, and finishing the reconstruction of the posture and the shape of the three-dimensional human body model through the trained model.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention are intended to be included in the scope of the present invention.

Claims

1. The three-dimensional human body reconstruction method based on the time sequence context clue is characterized by comprising the following steps: the method comprises the following steps:

step S1, inputting a segment of single-person video frame sequence, and recording the sequence as

Wherein

is shown as

An image, i.e. frame i;

step S2, utilizing a convolution neural network,for each frame of the image sequence

Extracting spatial features

Wherein each feature is a vector of 2048 dimensions in size;

in which

Is a spatial feature of the video frame,

is a profile characteristic of the outer part of the human body,

is an optical flow feature;

；

Step S5, utilize

Regression parameterization human body template parameter and camera parameter

And

the regressor is initialized to mean attitude

Followed by the pose result of the previous frame

；

s7, carrying out differentiable rendering by using the parameter of the parameterized human body template obtained by regression to obtain two-dimensional joint point information, a human body contour and optical flow information, and comparing the two-dimensional joint point information, the human body contour and the optical flow information with values estimated by a network to calculate a reprojection error;

Training a network model;

2. The three-dimensional human body reconstruction method based on temporal context cues as claimed in claim 1, wherein: in step S5, the fitted three-dimensional human body template is a linear function

Said linear function

The input is the posture parameter of the human body

I.e. by

Wherein

The total number of vertexes of the three-dimensional human body template is shown; and (3) regressing the joint point coordinates of the human body template from the output three-dimensional human body template vertex coordinates: