CN115330950A - Three-dimensional human body reconstruction method based on time sequence context clues - Google Patents
Three-dimensional human body reconstruction method based on time sequence context clues Download PDFInfo
- Publication number
- CN115330950A CN115330950A CN202210985402.6A CN202210985402A CN115330950A CN 115330950 A CN115330950 A CN 115330950A CN 202210985402 A CN202210985402 A CN 202210985402A CN 115330950 A CN115330950 A CN 115330950A
- Authority
- CN
- China
- Prior art keywords
- human body
- dimensional
- frame
- sequence
- error
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/62—Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computer Graphics (AREA)
- Geometry (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention discloses a three-dimensional human body reconstruction method based on time sequence context clues, which particularly relates to the field of artificial intelligence and mainly comprises the following steps: pre-training a three-dimensional human body reconstruction neural network based on time sequence context clues; extracting spatial features of each frame of image by using a convolutional neural network; adding the human body external contour and the optical flow information of each frame into the input features by utilizing a motion encoder; capturing the time correlation of multi-frame input by adopting a converter network as a time sequence encoder; the parameters of the human body template and the parameters of the camera are regressed through a training regressor; distinguishing the real and natural human body action posture by using a discriminator; and training the differentiable renderer by using the parameter of the parameterized human body template obtained by regression. After the training is finished, any action sequence is given, and the posture and shape reconstruction of the human body model can be finished. The technology can be used for motion analysis, virtual and augmented reality, games, animations and other scenes needing three-dimensional human body reconstruction.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a three-dimensional human body reconstruction method based on time sequence context clues.
Background
The human body model posture and shape reconstruction problem based on monocular video is an important problem in the fields of computer vision and artificial intelligence. The method for generating the accurate and smooth three-dimensional human body reconstruction result has wide application prospect and important application value in the virtual and augmented reality fields.
In recent years, with the fusion and development of a deep learning technology and a computer vision technology, a three-dimensional reconstruction method based on a deep neural network appears, however, due to the lack of a data set under a natural scene with three-dimensional human body labels, the complexity and variability of real human body motion are not captured by the existing human body motion time model; secondly, because the average error of each joint position only penalizes spatial error without considering time consistency, the posture estimation has a 'shaking' phenomenon, so that the result is difficult to approach to the real posture; indoor three-dimensional human body data sets are limited in terms of number of targets, range of motion, and image complexity.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides a three-dimensional human body reconstruction method based on time sequence context clues, which introduces a deep neural network three-dimensional human body reconstruction method into a monocular video, on one hand, a convolutional neural network is utilized to extract spatial information of a video sequence, on the other hand, a converter network is utilized to capture the time correlation of multi-frame input, finally, the characteristics containing the whole input space-time information are obtained, and simultaneously, the quality and the precision of human body model posture and shape reconstruction are further improved by combining the light stream, the contour and the like time sequence context clues, so that the smooth, natural and real human body model posture and shape reconstruction are achieved to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme:
the three-dimensional human body reconstruction method based on the time sequence context clue comprises the following steps:
step S1, inputting a section of single human video frame sequence and recording the sequence asWhereinThe number of images to be processed is the length of the sequence, i.e. the number of image sequences,is shown asAn image, i.e. frame i;
s2, utilizing a convolution neural network to carry out image sequence for each frameExtracting spatial featuresWherein each feature is a vector of 2048 dimensions in size;
s3, adding the human body external contour and the optical flow information of each frame into the input features through a motion encoder to obtain feature values as follows:whereinIs a spatial feature of the video frame,is a profile characteristic of the outer part of the human body,is an optical flow feature;
s4, training a converter network as a time sequence encoder to extract context time information and outputting a hidden variable of each frame containing front and rear frame information;
Step S5, utilizeRegression parameterization human body template parameter and camera parameterAndthe regressor is initialized to mean attitudeFollowed by the pose result of the previous frameAs the next frame initialization, fitting the corresponding three-dimensional human body template dynamic sequence to the whole sequence;
S6, integrating feature vectors corresponding to all times by adopting self-supervision to judge the real and natural human body action posture;
s7, carrying out differentiable rendering by using the parameter of the parameterized human body template obtained by regression to obtain two-dimensional joint point information, a human body contour and optical flow information, and comparing the two-dimensional joint point information, the human body contour and the optical flow information with values estimated by a network to calculate a re-projection error;
s8, constructing a loss function by utilizing the human body template posture sequence and all image video frame sequencesTraining a network model;
and S9, after the training in the step S8 is finished, giving any video frame sequence, and finishing the reconstruction of the posture and the shape of the three-dimensional human body model through the trained model.
In a preferred embodiment, in step S5, the fitted three-dimensional human body template is a linear functionLinear function ofThe input is the posture parameter of the human bodyThe output is the vertex coordinates of the three-dimensional human body templateI.e. byWhereinThe total number of vertexes of the three-dimensional human body template is shown; from the output three-dimensional human body template vertex coordinates, the joint point coordinates of the human body template can be regressed:whereinIs a regression matrix.
wherein, in the process,is a parameter of the shape of the human body,as a parameter of human posture
wherein, in the step (A),is a profile characteristic of the outer part of the human body,for the purpose of the optical flow characteristics,is the spatial feature of the video frame.
The invention has the technical effects and advantages that:
the invention discloses a three-dimensional reconstruction method based on time sequence context clues, which introduces a depth neural network three-dimensional reconstruction method into a monocular video, on one hand, a convolutional neural network is utilized to extract spatial information of a video sequence, on the other hand, a converter network is utilized to capture the time correlation of multi-frame input, finally the characteristics containing the whole input space-time information are obtained, and simultaneously, the quality and the precision of reconstruction of human body model posture and shape are further improved by combining light stream, contour and other time sequence context clues, so that smooth, natural and real reconstruction of the human body model posture and shape is achieved.
Drawings
FIG. 1 is a network structure diagram of a three-dimensional human body reconstruction method based on temporal context cues according to the present invention.
FIG. 2 is a flowchart of a three-dimensional human body reconstruction method based on temporal context cues according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention relates to a three-dimensional human body reconstruction method based on time sequence context clues, which is characterized in that a depth neural network three-dimensional reconstruction method is introduced into a monocular video, on one hand, a convolutional neural network is utilized to extract spatial information of a video sequence, on the other hand, a converter network is utilized to capture the time correlation of multi-frame input, and finally the characteristics containing the whole input space-time information are obtained, and simultaneously, the quality and the precision of human body model posture and shape reconstruction are further improved by combining light stream, contour and other time sequence context clues, so that the smooth, natural and real human body model posture and shape reconstruction is achieved.
Therefore, the invention utilizes the space-time encoder and the motion encoder to extract the human motion characteristics and captures the time correlation of multi-frame information. Different from the existing method, when the spatial feature extraction is carried out, the method captures the internal association of long time sequence input by using a converter network, finally obtains the feature containing the whole input space-time information, and simultaneously fuses motion information and contour information to predict human model parameters.
Specifically, when the method is applied to reconstruction of the posture and the shape of the human body model in the video, the method comprises the following steps:
as shown in fig. 1, auto-supervision, which is the core of the converter network, relates motion features to embedded features of an input picture frame sequence. Our converter network consists of multiple auto-supervision and multi-layer perceptrons. The normalization layer is applied before each module and the residual connection is applied after each module.
The attention module can be described as a mapping function that will query the matrixKey matrixSum matrixMapping to an output attention matrix.WhereinIs the number of vectors in the sequence,is the dimension. The output of the attention module can be expressed as:
in this context,。Andis made up of embedded featuresBy linear transformationAndcalculated as follows:
multipath auto-supervision utilizes multipath branches to model information in a representation subspace of different locations. Each branch applies the attention module in parallel. MSA output willThe outputs of the multiple self-supervision are connected:
input embedding featuresIs provided withThe converter network structure of a layer can be expressed as:
whereinA hierarchical normalization operation is performed. Output of converter networkThe same size as the input is maintained. For prediction, the encoder output isCompressed into vectorsAnd averaged in the frame dimension. Finally, the output is returned through a multi-layer perceptron layer.
As used herein, a three-dimensional body template is a linear functionThe input of the function is the attitude parameter of the human body, namely the rotation amount of the skeleton joint, and the output is the vertex coordinates of the three-dimensional human body templateI.e. byWherein, the total number of the vertexes of the three-dimensional human body template is shown. From the output three-dimensional human body template vertex coordinates, the joint point coordinates of the human body template can be regressed:in whichIs a regression matrix.
Wherein=(Here, theIs the true value),is the current timeIs a prediction parameter of the shape of a single human body (i.e. height, weight, etc.),the parameters are averaged using the shape parameters for each frame.
wherein, in the step (A),is a parameter of the shape of the human body,as a parameter of human posture
wherein, in the step (A),is a profile characteristic of the outer part of the human body,for the purpose of the optical flow characteristics,is the spatial feature of the video frame.
Specifically, as shown in fig. 2, the three-dimensional human body reconstruction method based on the temporal context clue includes the following specific steps:
step S101, pre-training a three-dimensional human body reconstruction neural network based on time sequence context clues, mainly comprising a space encoder, a time sequence encoder, a motion encoder, a regressor and a differentiable renderer, wherein a data set comprises a mixed two-dimensional and three-dimensional data set, 5000 sections of video data sets with two-dimensional truth values, 8000 sections of pseudo label data sets obtained by a two-dimensional key point detector, and 2000 sections of video data with parameterized human body template truth values are used for calculating for the three-dimensional data sets.
Step S102, extracting spatial features for each frame of the image sequence by using a convolutional neural network, wherein each feature is a vector with the size of 2048 dimensions, the specific network is a residual error network with 50 layers, the final output feature size is 2048 dimensions, the sequence length is long, and the batch size is 32.
Step S103, adding the human body external contour and the optical flow information of each frame into the input characteristics through a motion encoder, and obtaining the characteristic value of the point as follows:whereinIs a spatial feature of the video frame and,is a profile characteristic of the outer part of the human body,is an optical flow feature.
And step S104, the updated characteristics are transmitted to a network coding layer of the converter, the network architecture of the model comprises a self-attention mechanism and a shallow fully-connected feedforward network, the output of each part is subjected to residual error network and level normalization processing, a time sequence encoder formed by the converter network extracts context time information, and each frame is output to include hidden variables of front and rear frame information.
Step S105, initializing a regressor to an average posture by using regression parameterized human body model parameters and camera parameters, then initializing a posture result of a previous frame as a next frame, fitting a corresponding three-dimensional human body template dynamic sequence to the whole sequence, wherein the regressor of the parameterized human body template consists of 2 full-connection layers, each layer is provided with 1024 neurons, and finally outputting = 85-dimensional final layers containing information such as postures, shapes and camera parameters.
And S106, integrating the feature vectors corresponding to all times by adopting multi-path self-supervision to judge the real and natural human motion postures, using 2 multilayer perceptron layers, wherein each layer is provided with 1024 neurons and sine activation to learn attention weight, and finally predicting whether each sample belongs to a real and reasonable human motion posture by a linear layer.
And S107, carrying out differentiable rendering by using the parameter of the parameterized human body model obtained by regression, comparing the obtained two-dimensional joint point information, the human body contour and the optical flow information with the value estimated by the network, and calculating a reprojection error.
And S108, constructing a loss function by using the human body template posture sequence and all images, and training a network model.
Step S109, in the training process, an adaptive moment estimation optimizer is used, the learning rate is fixed to be 0.0001, the training is carried out for 120 rounds, the evaluation indexes comprise average position error of each joint, correct key point percentage and vertex-by-vertex error, and acceleration error, the acceleration error is calculated according to the difference between the real value and the predicted acceleration of the three-dimensional coordinate point of each joint, the unit is a main smooth index of the estimated motion sequence, and a better acceleration error marks a smooth and natural human motion estimation.
And step S110, after the training is finished, giving any video frame sequence, and finishing the reconstruction of the posture and the shape of the three-dimensional human body model through the trained model.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention are intended to be included in the scope of the present invention.
Claims (3)
1. The three-dimensional human body reconstruction method based on the time sequence context clue is characterized by comprising the following steps: the method comprises the following steps:
step S1, inputting a segment of single-person video frame sequence, and recording the sequence asWhereinThe number of images to be processed is the length of the sequence, i.e. the number of image sequences,is shown asAn image, i.e. frame i;
step S2, utilizing a convolution neural network,for each frame of the image sequenceExtracting spatial featuresWherein each feature is a vector of 2048 dimensions in size;
s3, adding the human body external contour and the optical flow information of each frame into the input features through a motion encoder to obtain feature values as follows:in whichIs a spatial feature of the video frame,is a profile characteristic of the outer part of the human body,is an optical flow feature;
s4, training a converter network as a time sequence encoder to extract context time information and outputting a hidden variable of each frame containing front and rear frame information;
Step S5, utilizeRegression parameterization human body template parameter and camera parameterAndthe regressor is initialized to mean attitudeFollowed by the pose result of the previous frameAs the next frame initialization, fitting the corresponding three-dimensional human body template dynamic sequence to the whole sequence;
S6, integrating feature vectors corresponding to all times by adopting self-supervision to judge the real and natural human body action posture;
s7, carrying out differentiable rendering by using the parameter of the parameterized human body template obtained by regression to obtain two-dimensional joint point information, a human body contour and optical flow information, and comparing the two-dimensional joint point information, the human body contour and the optical flow information with values estimated by a network to calculate a reprojection error;
s8, constructing a loss function by utilizing the human body template posture sequence and all image video frame sequencesTraining a network model;
and S9, after the training in the step S8 is finished, giving any video frame sequence, and finishing the reconstruction of the posture and the shape of the three-dimensional human body model through the trained model.
2. The three-dimensional human body reconstruction method based on temporal context cues as claimed in claim 1, wherein: in step S5, the fitted three-dimensional human body template is a linear functionSaid linear functionThe input is the posture parameter of the human bodyThe output is the vertex coordinates of the three-dimensional human body templateI.e. byWhereinThe total number of vertexes of the three-dimensional human body template is shown; and (3) regressing the joint point coordinates of the human body template from the output three-dimensional human body template vertex coordinates:whereinIs a regression matrix.
3. The three-dimensional human body reconstruction method based on temporal context cues as claimed in claim 1, wherein: in step S8, a loss functionComprises the following steps:
wherein, in the step (A),is a parameter of the shape of the human body,as a parameter of human posture
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210985402.6A CN115330950A (en) | 2022-08-17 | 2022-08-17 | Three-dimensional human body reconstruction method based on time sequence context clues |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210985402.6A CN115330950A (en) | 2022-08-17 | 2022-08-17 | Three-dimensional human body reconstruction method based on time sequence context clues |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115330950A true CN115330950A (en) | 2022-11-11 |
Family
ID=83923878
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210985402.6A Pending CN115330950A (en) | 2022-08-17 | 2022-08-17 | Three-dimensional human body reconstruction method based on time sequence context clues |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115330950A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116309698A (en) * | 2023-01-11 | 2023-06-23 | 中国科学院上海微系统与信息技术研究所 | Multi-frame optical flow estimation method based on motion feature compensation guidance |
CN116385666A (en) * | 2023-06-02 | 2023-07-04 | 杭州倚澜科技有限公司 | Human body model redirection method and device based on feedback type cyclic neural network |
CN117218297A (en) * | 2023-09-29 | 2023-12-12 | 北京百度网讯科技有限公司 | Human body reconstruction parameter generation method, device, equipment and medium |
-
2022
- 2022-08-17 CN CN202210985402.6A patent/CN115330950A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116309698A (en) * | 2023-01-11 | 2023-06-23 | 中国科学院上海微系统与信息技术研究所 | Multi-frame optical flow estimation method based on motion feature compensation guidance |
CN116385666A (en) * | 2023-06-02 | 2023-07-04 | 杭州倚澜科技有限公司 | Human body model redirection method and device based on feedback type cyclic neural network |
CN116385666B (en) * | 2023-06-02 | 2024-02-27 | 杭州倚澜科技有限公司 | Human body model redirection method and device based on feedback type cyclic neural network |
CN117218297A (en) * | 2023-09-29 | 2023-12-12 | 北京百度网讯科技有限公司 | Human body reconstruction parameter generation method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kanazawa et al. | End-to-end recovery of human shape and pose | |
CN109636831B (en) | Method for estimating three-dimensional human body posture and hand information | |
Wang et al. | Learning compositional radiance fields of dynamic human heads | |
Zuffi et al. | Lions and tigers and bears: Capturing non-rigid, 3d, articulated shape from images | |
CN115330950A (en) | Three-dimensional human body reconstruction method based on time sequence context clues | |
Yu et al. | A discriminative deep model with feature fusion and temporal attention for human action recognition | |
CN112288627B (en) | Recognition-oriented low-resolution face image super-resolution method | |
CN113421328B (en) | Three-dimensional human body virtual reconstruction method and device | |
CN113051420B (en) | Robot vision man-machine interaction method and system based on text generation video | |
CN110473284A (en) | A kind of moving object method for reconstructing three-dimensional model based on deep learning | |
Tu et al. | Consistent 3d hand reconstruction in video via self-supervised learning | |
Wang et al. | Depth estimation of video sequences with perceptual losses | |
CN111462274A (en) | Human body image synthesis method and system based on SMP L model | |
CN114639374A (en) | Real-time voice-driven photo-level realistic human face portrait video generation method | |
Zhu et al. | Facescape: 3d facial dataset and benchmark for single-view 3d face reconstruction | |
CN112668550A (en) | Double-person interaction behavior recognition method based on joint point-depth joint attention RGB modal data | |
CN114611600A (en) | Self-supervision technology-based three-dimensional attitude estimation method for skiers | |
Zhu et al. | Deep review and analysis of recent nerfs | |
Xu et al. | Motion recognition algorithm based on deep edge-aware pyramid pooling network in human–computer interaction | |
CN111862278A (en) | Animation obtaining method and device, electronic equipment and storage medium | |
Wang et al. | NerfCap: Human performance capture with dynamic neural radiance fields | |
Wang et al. | Digital twin: Acquiring high-fidelity 3D avatar from a single image | |
Rabby et al. | Beyondpixels: A comprehensive review of the evolution of neural radiance fields | |
Peng et al. | Implicit neural representations with structured latent codes for human body modeling | |
CN112927348B (en) | High-resolution human body three-dimensional reconstruction method based on multi-viewpoint RGBD camera |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |