CN115330950A - Three-dimensional human body reconstruction method based on time sequence context clues - Google Patents

Three-dimensional human body reconstruction method based on time sequence context clues Download PDF

Info

Publication number
CN115330950A
CN115330950A CN202210985402.6A CN202210985402A CN115330950A CN 115330950 A CN115330950 A CN 115330950A CN 202210985402 A CN202210985402 A CN 202210985402A CN 115330950 A CN115330950 A CN 115330950A
Authority
CN
China
Prior art keywords
human body
dimensional
frame
sequence
error
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210985402.6A
Other languages
Chinese (zh)
Inventor
戴翘楚
吴翼天
曹静萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yilan Technology Co ltd
Original Assignee
Hangzhou Yilan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yilan Technology Co ltd filed Critical Hangzhou Yilan Technology Co ltd
Priority to CN202210985402.6A priority Critical patent/CN115330950A/en
Publication of CN115330950A publication Critical patent/CN115330950A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a three-dimensional human body reconstruction method based on time sequence context clues, which particularly relates to the field of artificial intelligence and mainly comprises the following steps: pre-training a three-dimensional human body reconstruction neural network based on time sequence context clues; extracting spatial features of each frame of image by using a convolutional neural network; adding the human body external contour and the optical flow information of each frame into the input features by utilizing a motion encoder; capturing the time correlation of multi-frame input by adopting a converter network as a time sequence encoder; the parameters of the human body template and the parameters of the camera are regressed through a training regressor; distinguishing the real and natural human body action posture by using a discriminator; and training the differentiable renderer by using the parameter of the parameterized human body template obtained by regression. After the training is finished, any action sequence is given, and the posture and shape reconstruction of the human body model can be finished. The technology can be used for motion analysis, virtual and augmented reality, games, animations and other scenes needing three-dimensional human body reconstruction.

Description

Three-dimensional human body reconstruction method based on time sequence context clues
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a three-dimensional human body reconstruction method based on time sequence context clues.
Background
The human body model posture and shape reconstruction problem based on monocular video is an important problem in the fields of computer vision and artificial intelligence. The method for generating the accurate and smooth three-dimensional human body reconstruction result has wide application prospect and important application value in the virtual and augmented reality fields.
In recent years, with the fusion and development of a deep learning technology and a computer vision technology, a three-dimensional reconstruction method based on a deep neural network appears, however, due to the lack of a data set under a natural scene with three-dimensional human body labels, the complexity and variability of real human body motion are not captured by the existing human body motion time model; secondly, because the average error of each joint position only penalizes spatial error without considering time consistency, the posture estimation has a 'shaking' phenomenon, so that the result is difficult to approach to the real posture; indoor three-dimensional human body data sets are limited in terms of number of targets, range of motion, and image complexity.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides a three-dimensional human body reconstruction method based on time sequence context clues, which introduces a deep neural network three-dimensional human body reconstruction method into a monocular video, on one hand, a convolutional neural network is utilized to extract spatial information of a video sequence, on the other hand, a converter network is utilized to capture the time correlation of multi-frame input, finally, the characteristics containing the whole input space-time information are obtained, and simultaneously, the quality and the precision of human body model posture and shape reconstruction are further improved by combining the light stream, the contour and the like time sequence context clues, so that the smooth, natural and real human body model posture and shape reconstruction are achieved to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme:
the three-dimensional human body reconstruction method based on the time sequence context clue comprises the following steps:
step S1, inputting a section of single human video frame sequence and recording the sequence as
Figure 433806DEST_PATH_IMAGE001
Wherein
Figure 981462DEST_PATH_IMAGE002
The number of images to be processed is the length of the sequence, i.e. the number of image sequences,
Figure 745018DEST_PATH_IMAGE003
is shown as
Figure 923190DEST_PATH_IMAGE003
An image, i.e. frame i;
s2, utilizing a convolution neural network to carry out image sequence for each frame
Figure 139408DEST_PATH_IMAGE004
Extracting spatial features
Figure 723711DEST_PATH_IMAGE005
Wherein each feature is a vector of 2048 dimensions in size;
s3, adding the human body external contour and the optical flow information of each frame into the input features through a motion encoder to obtain feature values as follows:
Figure 482719DEST_PATH_IMAGE006
wherein
Figure 690847DEST_PATH_IMAGE007
Is a spatial feature of the video frame,
Figure 535306DEST_PATH_IMAGE008
is a profile characteristic of the outer part of the human body,
Figure 752661DEST_PATH_IMAGE009
is an optical flow feature;
s4, training a converter network as a time sequence encoder to extract context time information and outputting a hidden variable of each frame containing front and rear frame information
Figure 162913DEST_PATH_IMAGE010
Step S5, utilize
Figure 718440DEST_PATH_IMAGE010
Regression parameterization human body template parameter and camera parameter
Figure 50196DEST_PATH_IMAGE011
And
Figure 805662DEST_PATH_IMAGE012
the regressor is initialized to mean attitude
Figure 539263DEST_PATH_IMAGE013
Followed by the pose result of the previous frame
Figure 89193DEST_PATH_IMAGE014
As the next frame initialization, fitting the corresponding three-dimensional human body template dynamic sequence to the whole sequence
Figure 642665DEST_PATH_IMAGE015
S6, integrating feature vectors corresponding to all times by adopting self-supervision to judge the real and natural human body action posture;
s7, carrying out differentiable rendering by using the parameter of the parameterized human body template obtained by regression to obtain two-dimensional joint point information, a human body contour and optical flow information, and comparing the two-dimensional joint point information, the human body contour and the optical flow information with values estimated by a network to calculate a re-projection error;
s8, constructing a loss function by utilizing the human body template posture sequence and all image video frame sequences
Figure 201823DEST_PATH_IMAGE016
Training a network model;
and S9, after the training in the step S8 is finished, giving any video frame sequence, and finishing the reconstruction of the posture and the shape of the three-dimensional human body model through the trained model.
In a preferred embodiment, in step S5, the fitted three-dimensional human body template is a linear function
Figure 554044DEST_PATH_IMAGE017
Linear function of
Figure 274876DEST_PATH_IMAGE017
The input is the posture parameter of the human body
Figure 581223DEST_PATH_IMAGE018
The output is the vertex coordinates of the three-dimensional human body template
Figure 412913DEST_PATH_IMAGE019
I.e. by
Figure 121106DEST_PATH_IMAGE020
Wherein
Figure 278418DEST_PATH_IMAGE021
The total number of vertexes of the three-dimensional human body template is shown; from the output three-dimensional human body template vertex coordinates, the joint point coordinates of the human body template can be regressed:
Figure 806482DEST_PATH_IMAGE022
wherein
Figure 707442DEST_PATH_IMAGE023
Is a regression matrix.
In a preferred embodiment, in step S8, the loss function
Figure 768677DEST_PATH_IMAGE016
Comprises the following steps:
Figure 768994DEST_PATH_IMAGE024
wherein, three-dimensional error
Figure 908988DEST_PATH_IMAGE025
Using the L2 norm loss function:
Figure 223426DEST_PATH_IMAGE026
wherein, in the step (A),
Figure 765266DEST_PATH_IMAGE027
as parameters of three-dimensional joints
Two dimensional error
Figure 139746DEST_PATH_IMAGE028
Using the L2 norm loss function:
Figure 767037DEST_PATH_IMAGE029
wherein, in the step (A),
Figure 383701DEST_PATH_IMAGE030
as two-dimensional joint parameters
Parameterized human body template error
Figure 514468DEST_PATH_IMAGE031
Using the L2 norm loss function:
Figure 59850DEST_PATH_IMAGE032
wherein, in the process,
Figure 174436DEST_PATH_IMAGE033
is a parameter of the shape of the human body,
Figure 96256DEST_PATH_IMAGE034
as a parameter of human posture
Error of discriminator
Figure 347109DEST_PATH_IMAGE035
Using the L2 norm loss function:
Figure 63392DEST_PATH_IMAGE036
wherein, in the step (A),
Figure 830054DEST_PATH_IMAGE037
for antagonistic losses of kinetic parameters
Error of discriminator
Figure 149040DEST_PATH_IMAGE038
Using the L2 norm loss function:
Figure 395345DEST_PATH_IMAGE039
error of motion encoder
Figure 813688DEST_PATH_IMAGE040
Using the L2 norm loss function:
Figure 902866DEST_PATH_IMAGE041
wherein, in the step (A),
Figure 900909DEST_PATH_IMAGE042
is a profile characteristic of the outer part of the human body,
Figure 296993DEST_PATH_IMAGE043
for the purpose of the optical flow characteristics,
Figure 745292DEST_PATH_IMAGE044
is the spatial feature of the video frame.
The invention has the technical effects and advantages that:
the invention discloses a three-dimensional reconstruction method based on time sequence context clues, which introduces a depth neural network three-dimensional reconstruction method into a monocular video, on one hand, a convolutional neural network is utilized to extract spatial information of a video sequence, on the other hand, a converter network is utilized to capture the time correlation of multi-frame input, finally the characteristics containing the whole input space-time information are obtained, and simultaneously, the quality and the precision of reconstruction of human body model posture and shape are further improved by combining light stream, contour and other time sequence context clues, so that smooth, natural and real reconstruction of the human body model posture and shape is achieved.
Drawings
FIG. 1 is a network structure diagram of a three-dimensional human body reconstruction method based on temporal context cues according to the present invention.
FIG. 2 is a flowchart of a three-dimensional human body reconstruction method based on temporal context cues according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention relates to a three-dimensional human body reconstruction method based on time sequence context clues, which is characterized in that a depth neural network three-dimensional reconstruction method is introduced into a monocular video, on one hand, a convolutional neural network is utilized to extract spatial information of a video sequence, on the other hand, a converter network is utilized to capture the time correlation of multi-frame input, and finally the characteristics containing the whole input space-time information are obtained, and simultaneously, the quality and the precision of human body model posture and shape reconstruction are further improved by combining light stream, contour and other time sequence context clues, so that the smooth, natural and real human body model posture and shape reconstruction is achieved.
Therefore, the invention utilizes the space-time encoder and the motion encoder to extract the human motion characteristics and captures the time correlation of multi-frame information. Different from the existing method, when the spatial feature extraction is carried out, the method captures the internal association of long time sequence input by using a converter network, finally obtains the feature containing the whole input space-time information, and simultaneously fuses motion information and contour information to predict human model parameters.
Specifically, when the method is applied to reconstruction of the posture and the shape of the human body model in the video, the method comprises the following steps:
as shown in fig. 1, auto-supervision, which is the core of the converter network, relates motion features to embedded features of an input picture frame sequence. Our converter network consists of multiple auto-supervision and multi-layer perceptrons. The normalization layer is applied before each module and the residual connection is applied after each module.
The attention module can be described as a mapping function that will query the matrix
Figure 665975DEST_PATH_IMAGE045
Key matrix
Figure 592343DEST_PATH_IMAGE046
Sum matrix
Figure 813240DEST_PATH_IMAGE047
Mapping to an output attention matrix.
Figure 838964DEST_PATH_IMAGE048
Wherein
Figure 637156DEST_PATH_IMAGE049
Is the number of vectors in the sequence,
Figure 475537DEST_PATH_IMAGE050
is the dimension. The output of the attention module can be expressed as:
Figure 675574DEST_PATH_IMAGE051
in this context
Figure 75462DEST_PATH_IMAGE052
Figure 33054DEST_PATH_IMAGE053
Figure 301224DEST_PATH_IMAGE054
And
Figure 231134DEST_PATH_IMAGE055
is made up of embedded features
Figure 926558DEST_PATH_IMAGE056
By linear transformation
Figure 73243DEST_PATH_IMAGE057
And
Figure 879525DEST_PATH_IMAGE058
calculated as follows:
Figure 929521DEST_PATH_IMAGE059
multipath auto-supervision utilizes multipath branches to model information in a representation subspace of different locations. Each branch applies the attention module in parallel. MSA output will
Figure 795846DEST_PATH_IMAGE060
The outputs of the multiple self-supervision are connected:
Figure 931292DEST_PATH_IMAGE061
Figure 213369DEST_PATH_IMAGE062
input embedding features
Figure 242505DEST_PATH_IMAGE063
Is provided with
Figure 925071DEST_PATH_IMAGE064
The converter network structure of a layer can be expressed as:
Figure 406868DEST_PATH_IMAGE065
Figure 695898DEST_PATH_IMAGE066
Figure 579540DEST_PATH_IMAGE067
wherein
Figure 928613DEST_PATH_IMAGE068
A hierarchical normalization operation is performed. Output of converter network
Figure 632127DEST_PATH_IMAGE069
The same size as the input is maintained. For prediction, the encoder output is
Figure 459268DEST_PATH_IMAGE069
Compressed into vectors
Figure 462996DEST_PATH_IMAGE070
And averaged in the frame dimension. Finally, the output is returned through a multi-layer perceptron layer.
As used herein, a three-dimensional body template is a linear function
Figure 481506DEST_PATH_IMAGE071
The input of the function is the attitude parameter of the human body, namely the rotation amount of the skeleton joint, and the output is the vertex coordinates of the three-dimensional human body template
Figure 937895DEST_PATH_IMAGE072
I.e. by
Figure 568727DEST_PATH_IMAGE073
Wherein, the total number of the vertexes of the three-dimensional human body template is shown. From the output three-dimensional human body template vertex coordinates, the joint point coordinates of the human body template can be regressed:
Figure 833487DEST_PATH_IMAGE074
in which
Figure 852258DEST_PATH_IMAGE075
Is a regression matrix.
Wherein
Figure 468047DEST_PATH_IMAGE076
=
Figure 27205DEST_PATH_IMAGE077
(Here, the
Figure 113847DEST_PATH_IMAGE076
Is the true value),
Figure 834679DEST_PATH_IMAGE078
is the current time
Figure 875447DEST_PATH_IMAGE079
Is a prediction parameter of the shape of a single human body (i.e. height, weight, etc.),
Figure 972716DEST_PATH_IMAGE080
the parameters are averaged using the shape parameters for each frame.
Loss function of the entire model
Figure 680909DEST_PATH_IMAGE016
Comprises the following steps:
Figure 572641DEST_PATH_IMAGE024
wherein, three-dimensional error
Figure 366285DEST_PATH_IMAGE025
Using the L2 norm loss function:
Figure 267245DEST_PATH_IMAGE026
wherein, in the process,
Figure 62900DEST_PATH_IMAGE027
as parameters of three-dimensional joints
Two dimensional error
Figure 391114DEST_PATH_IMAGE028
Using the L2 norm loss function:
Figure 406474DEST_PATH_IMAGE029
wherein, in the step (A),
Figure 111125DEST_PATH_IMAGE030
as two-dimensional joint parameters
Parameterized human body template error
Figure 528331DEST_PATH_IMAGE031
Using the L2 norm loss function:
Figure 27445DEST_PATH_IMAGE032
wherein, in the step (A),
Figure 530102DEST_PATH_IMAGE033
is a parameter of the shape of the human body,
Figure 772865DEST_PATH_IMAGE034
as a parameter of human posture
Error of discriminator
Figure 357428DEST_PATH_IMAGE035
Using the L2 norm loss function:
Figure 230706DEST_PATH_IMAGE036
wherein, in the step (A),
Figure 220659DEST_PATH_IMAGE037
for antagonistic losses of kinetic parameters
Error of discriminator
Figure 1533DEST_PATH_IMAGE038
Using the L2 norm loss function:
Figure 393331DEST_PATH_IMAGE039
error of motion encoder
Figure 234248DEST_PATH_IMAGE040
Using the L2 norm loss function:
Figure 711497DEST_PATH_IMAGE041
wherein, in the step (A),
Figure 296062DEST_PATH_IMAGE042
is a profile characteristic of the outer part of the human body,
Figure 40902DEST_PATH_IMAGE043
for the purpose of the optical flow characteristics,
Figure 52720DEST_PATH_IMAGE044
is the spatial feature of the video frame.
Specifically, as shown in fig. 2, the three-dimensional human body reconstruction method based on the temporal context clue includes the following specific steps:
step S101, pre-training a three-dimensional human body reconstruction neural network based on time sequence context clues, mainly comprising a space encoder, a time sequence encoder, a motion encoder, a regressor and a differentiable renderer, wherein a data set comprises a mixed two-dimensional and three-dimensional data set, 5000 sections of video data sets with two-dimensional truth values, 8000 sections of pseudo label data sets obtained by a two-dimensional key point detector, and 2000 sections of video data with parameterized human body template truth values are used for calculating for the three-dimensional data sets.
Step S102, extracting spatial features for each frame of the image sequence by using a convolutional neural network, wherein each feature is a vector with the size of 2048 dimensions, the specific network is a residual error network with 50 layers, the final output feature size is 2048 dimensions, the sequence length is long, and the batch size is 32.
Step S103, adding the human body external contour and the optical flow information of each frame into the input characteristics through a motion encoder, and obtaining the characteristic value of the point as follows:
Figure 751686DEST_PATH_IMAGE081
wherein
Figure 139942DEST_PATH_IMAGE082
Is a spatial feature of the video frame and,
Figure 37491DEST_PATH_IMAGE042
is a profile characteristic of the outer part of the human body,
Figure 95577DEST_PATH_IMAGE043
is an optical flow feature.
And step S104, the updated characteristics are transmitted to a network coding layer of the converter, the network architecture of the model comprises a self-attention mechanism and a shallow fully-connected feedforward network, the output of each part is subjected to residual error network and level normalization processing, a time sequence encoder formed by the converter network extracts context time information, and each frame is output to include hidden variables of front and rear frame information.
Step S105, initializing a regressor to an average posture by using regression parameterized human body model parameters and camera parameters, then initializing a posture result of a previous frame as a next frame, fitting a corresponding three-dimensional human body template dynamic sequence to the whole sequence, wherein the regressor of the parameterized human body template consists of 2 full-connection layers, each layer is provided with 1024 neurons, and finally outputting = 85-dimensional final layers containing information such as postures, shapes and camera parameters.
And S106, integrating the feature vectors corresponding to all times by adopting multi-path self-supervision to judge the real and natural human motion postures, using 2 multilayer perceptron layers, wherein each layer is provided with 1024 neurons and sine activation to learn attention weight, and finally predicting whether each sample belongs to a real and reasonable human motion posture by a linear layer.
And S107, carrying out differentiable rendering by using the parameter of the parameterized human body model obtained by regression, comparing the obtained two-dimensional joint point information, the human body contour and the optical flow information with the value estimated by the network, and calculating a reprojection error.
And S108, constructing a loss function by using the human body template posture sequence and all images, and training a network model.
Step S109, in the training process, an adaptive moment estimation optimizer is used, the learning rate is fixed to be 0.0001, the training is carried out for 120 rounds, the evaluation indexes comprise average position error of each joint, correct key point percentage and vertex-by-vertex error, and acceleration error, the acceleration error is calculated according to the difference between the real value and the predicted acceleration of the three-dimensional coordinate point of each joint, the unit is a main smooth index of the estimated motion sequence, and a better acceleration error marks a smooth and natural human motion estimation.
And step S110, after the training is finished, giving any video frame sequence, and finishing the reconstruction of the posture and the shape of the three-dimensional human body model through the trained model.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention are intended to be included in the scope of the present invention.

Claims (3)

1. The three-dimensional human body reconstruction method based on the time sequence context clue is characterized by comprising the following steps: the method comprises the following steps:
step S1, inputting a segment of single-person video frame sequence, and recording the sequence as
Figure 825904DEST_PATH_IMAGE001
Wherein
Figure 385062DEST_PATH_IMAGE002
The number of images to be processed is the length of the sequence, i.e. the number of image sequences,
Figure 973169DEST_PATH_IMAGE003
is shown as
Figure 959579DEST_PATH_IMAGE003
An image, i.e. frame i;
step S2, utilizing a convolution neural network,for each frame of the image sequence
Figure 62665DEST_PATH_IMAGE004
Extracting spatial features
Figure 363196DEST_PATH_IMAGE005
Wherein each feature is a vector of 2048 dimensions in size;
s3, adding the human body external contour and the optical flow information of each frame into the input features through a motion encoder to obtain feature values as follows:
Figure 101083DEST_PATH_IMAGE006
in which
Figure 133761DEST_PATH_IMAGE007
Is a spatial feature of the video frame,
Figure 786459DEST_PATH_IMAGE008
is a profile characteristic of the outer part of the human body,
Figure 562785DEST_PATH_IMAGE009
is an optical flow feature;
s4, training a converter network as a time sequence encoder to extract context time information and outputting a hidden variable of each frame containing front and rear frame information
Figure 250118DEST_PATH_IMAGE010
Step S5, utilize
Figure 453698DEST_PATH_IMAGE010
Regression parameterization human body template parameter and camera parameter
Figure 593692DEST_PATH_IMAGE011
And
Figure 141086DEST_PATH_IMAGE012
the regressor is initialized to mean attitude
Figure 682925DEST_PATH_IMAGE013
Followed by the pose result of the previous frame
Figure 57406DEST_PATH_IMAGE014
As the next frame initialization, fitting the corresponding three-dimensional human body template dynamic sequence to the whole sequence
Figure 684696DEST_PATH_IMAGE015
S6, integrating feature vectors corresponding to all times by adopting self-supervision to judge the real and natural human body action posture;
s7, carrying out differentiable rendering by using the parameter of the parameterized human body template obtained by regression to obtain two-dimensional joint point information, a human body contour and optical flow information, and comparing the two-dimensional joint point information, the human body contour and the optical flow information with values estimated by a network to calculate a reprojection error;
s8, constructing a loss function by utilizing the human body template posture sequence and all image video frame sequences
Figure 802825DEST_PATH_IMAGE016
Training a network model;
and S9, after the training in the step S8 is finished, giving any video frame sequence, and finishing the reconstruction of the posture and the shape of the three-dimensional human body model through the trained model.
2. The three-dimensional human body reconstruction method based on temporal context cues as claimed in claim 1, wherein: in step S5, the fitted three-dimensional human body template is a linear function
Figure 464751DEST_PATH_IMAGE017
Said linear function
Figure 10133DEST_PATH_IMAGE017
The input is the posture parameter of the human body
Figure 859140DEST_PATH_IMAGE018
The output is the vertex coordinates of the three-dimensional human body template
Figure 76232DEST_PATH_IMAGE019
I.e. by
Figure 468031DEST_PATH_IMAGE020
Wherein
Figure 308948DEST_PATH_IMAGE021
The total number of vertexes of the three-dimensional human body template is shown; and (3) regressing the joint point coordinates of the human body template from the output three-dimensional human body template vertex coordinates:
Figure 520617DEST_PATH_IMAGE022
wherein
Figure 105182DEST_PATH_IMAGE023
Is a regression matrix.
3. The three-dimensional human body reconstruction method based on temporal context cues as claimed in claim 1, wherein: in step S8, a loss function
Figure 351487DEST_PATH_IMAGE016
Comprises the following steps:
Figure 363305DEST_PATH_IMAGE024
wherein, three-dimensional error
Figure 554947DEST_PATH_IMAGE025
Using the L2 norm loss function:
Figure 677624DEST_PATH_IMAGE026
wherein, in the step (A),
Figure 44014DEST_PATH_IMAGE027
as parameters of three-dimensional joints
Two dimensional error
Figure 633258DEST_PATH_IMAGE028
Using the L2 norm loss function:
Figure 944154DEST_PATH_IMAGE029
wherein, in the step (A),
Figure 542626DEST_PATH_IMAGE030
as two-dimensional joint parameters
Parameterized human body template error
Figure 262058DEST_PATH_IMAGE031
Using the L2 norm loss function:
Figure 615679DEST_PATH_IMAGE032
wherein, in the step (A),
Figure 289237DEST_PATH_IMAGE033
is a parameter of the shape of the human body,
Figure 799721DEST_PATH_IMAGE034
as a parameter of human posture
Error of discriminator
Figure 140704DEST_PATH_IMAGE035
Using the L2 norm loss function:
Figure 868488DEST_PATH_IMAGE036
wherein, in the step (A),
Figure 465561DEST_PATH_IMAGE037
for antagonistic losses of kinetic parameters
Error of discriminator
Figure 733731DEST_PATH_IMAGE038
Using the L2 norm loss function:
Figure 663641DEST_PATH_IMAGE039
error of motion encoder
Figure 765589DEST_PATH_IMAGE040
Using the L2 norm loss function:
Figure 538373DEST_PATH_IMAGE041
wherein, in the process,
Figure 485600DEST_PATH_IMAGE042
is a profile characteristic of the outer part of the human body,
Figure 394651DEST_PATH_IMAGE043
for the purpose of the optical flow characteristics,
Figure 703053DEST_PATH_IMAGE044
is the spatial feature of the video frame.
CN202210985402.6A 2022-08-17 2022-08-17 Three-dimensional human body reconstruction method based on time sequence context clues Pending CN115330950A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210985402.6A CN115330950A (en) 2022-08-17 2022-08-17 Three-dimensional human body reconstruction method based on time sequence context clues

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210985402.6A CN115330950A (en) 2022-08-17 2022-08-17 Three-dimensional human body reconstruction method based on time sequence context clues

Publications (1)

Publication Number Publication Date
CN115330950A true CN115330950A (en) 2022-11-11

Family

ID=83923878

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210985402.6A Pending CN115330950A (en) 2022-08-17 2022-08-17 Three-dimensional human body reconstruction method based on time sequence context clues

Country Status (1)

Country Link
CN (1) CN115330950A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116309698A (en) * 2023-01-11 2023-06-23 中国科学院上海微系统与信息技术研究所 Multi-frame optical flow estimation method based on motion feature compensation guidance
CN116385666A (en) * 2023-06-02 2023-07-04 杭州倚澜科技有限公司 Human body model redirection method and device based on feedback type cyclic neural network
CN117218297A (en) * 2023-09-29 2023-12-12 北京百度网讯科技有限公司 Human body reconstruction parameter generation method, device, equipment and medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116309698A (en) * 2023-01-11 2023-06-23 中国科学院上海微系统与信息技术研究所 Multi-frame optical flow estimation method based on motion feature compensation guidance
CN116385666A (en) * 2023-06-02 2023-07-04 杭州倚澜科技有限公司 Human body model redirection method and device based on feedback type cyclic neural network
CN116385666B (en) * 2023-06-02 2024-02-27 杭州倚澜科技有限公司 Human body model redirection method and device based on feedback type cyclic neural network
CN117218297A (en) * 2023-09-29 2023-12-12 北京百度网讯科技有限公司 Human body reconstruction parameter generation method, device, equipment and medium

Similar Documents

Publication Publication Date Title
Kanazawa et al. End-to-end recovery of human shape and pose
CN109636831B (en) Method for estimating three-dimensional human body posture and hand information
Wang et al. Learning compositional radiance fields of dynamic human heads
Zuffi et al. Lions and tigers and bears: Capturing non-rigid, 3d, articulated shape from images
CN115330950A (en) Three-dimensional human body reconstruction method based on time sequence context clues
Yu et al. A discriminative deep model with feature fusion and temporal attention for human action recognition
CN112288627B (en) Recognition-oriented low-resolution face image super-resolution method
CN113421328B (en) Three-dimensional human body virtual reconstruction method and device
CN113051420B (en) Robot vision man-machine interaction method and system based on text generation video
CN110473284A (en) A kind of moving object method for reconstructing three-dimensional model based on deep learning
Tu et al. Consistent 3d hand reconstruction in video via self-supervised learning
Wang et al. Depth estimation of video sequences with perceptual losses
CN111462274A (en) Human body image synthesis method and system based on SMP L model
CN114639374A (en) Real-time voice-driven photo-level realistic human face portrait video generation method
Zhu et al. Facescape: 3d facial dataset and benchmark for single-view 3d face reconstruction
CN112668550A (en) Double-person interaction behavior recognition method based on joint point-depth joint attention RGB modal data
CN114611600A (en) Self-supervision technology-based three-dimensional attitude estimation method for skiers
Zhu et al. Deep review and analysis of recent nerfs
Xu et al. Motion recognition algorithm based on deep edge-aware pyramid pooling network in human–computer interaction
CN111862278A (en) Animation obtaining method and device, electronic equipment and storage medium
Wang et al. NerfCap: Human performance capture with dynamic neural radiance fields
Wang et al. Digital twin: Acquiring high-fidelity 3D avatar from a single image
Rabby et al. Beyondpixels: A comprehensive review of the evolution of neural radiance fields
Peng et al. Implicit neural representations with structured latent codes for human body modeling
CN112927348B (en) High-resolution human body three-dimensional reconstruction method based on multi-viewpoint RGBD camera

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination