CN114783039B - Motion migration method driven by 3D human body model - Google Patents

Motion migration method driven by 3D human body model Download PDF

Info

Publication number
CN114783039B
CN114783039B CN202210708260.9A CN202210708260A CN114783039B CN 114783039 B CN114783039 B CN 114783039B CN 202210708260 A CN202210708260 A CN 202210708260A CN 114783039 B CN114783039 B CN 114783039B
Authority
CN
China
Prior art keywords
human body
motion
image
posture
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210708260.9A
Other languages
Chinese (zh)
Other versions
CN114783039A (en
Inventor
罗冬
夏贵羽
张泽远
马芙蓉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202210708260.9A priority Critical patent/CN114783039B/en
Publication of CN114783039A publication Critical patent/CN114783039A/en
Application granted granted Critical
Publication of CN114783039B publication Critical patent/CN114783039B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a 3D human body model driven motion migration method, which comprises the steps of converting training data into a UV space and constructing and optimizing a 3D human body model by using complementary information between adjacent video frames; then projecting the optimized 3D human body model to a 2D plane to keep the 3D information of the original motion and realize that the optimized 3D human body model is driven in a target posture; taking the 2D projection and the posture of the training data as the input of a pre-training model, and storing the trained model; then normalizing the posture of the target person; and finally, the 2D projection of the optimized 3D human body model driven by the target person posture and the normalized target person posture are used as the input of a trained motion image generation model for final motion migration, so that the problems of blurring, shape distortion and the like in 2D plane image generation are solved, and the generated motion image is ensured to have reliable depth information, accurate shape and clear human face.

Description

Motion migration method driven by 3D human body model
Technical Field
The invention belongs to the technical field of motion migration, and particularly relates to a motion migration method driven by a 3D human body model.
Background
The human motion migration aims to synthesize a human motion image with the human texture and the target pose of the training image. It is currently used in film production, game design and medical rehabilitation. Based on the human motion migration technique, the character of the training image can be animated freely to perform the user-defined mimic action. The traditional motion migration method based on computer graphics requires complicated rendering operation to generate appearance texture, is time-consuming and complicated in calculation, but an ordinary user or a small-sized organization cannot afford extremely high calculation amount and time cost.
Human motion is a complex natural phenomenon, all real motion occurs in 3D space, and the reason that real motion images look natural is that they are 2D projections of the original motion in 3D space, inheriting the 3D information naturally. Existing motion migration studies are mostly based on 2D motion data, such as images and video, which are 2D projections of true motion. From such motion migration studies, it is found that the generated moving images generally have problems such as blurring and shape distortion.
Disclosure of Invention
In order to solve the defects in the prior art, the invention provides a motion migration method driven by a 3D human body model, which not only overcomes the problems of blurring, shape distortion and the like in the generation of a 2D plane image, but also ensures that the generated motion image has reliable depth information, accurate shape and clear human face.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows: a 3D mannequin-driven motion migration method comprising: constructing a training data set by taking a video frame shot in advance as training data, and extracting the posture of the training data; converting the training data into a UV space, generating a UV image, and constructing and optimizing a 3D human body model by using complementary information between adjacent video frames; then projecting the optimized 3D human body model to a 2D plane to obtain a 2D projection retaining 3D information of the original motion, and driving the optimized 3D human body model in the posture of the target person; using the 2D projection of the 3D information with the original motion and the posture of the training data as the input of a motion image generation model, and storing the trained motion image generation model; normalizing the posture of the target person; and finally, performing final motion migration by taking the 2D projection of the optimized 3D human body model driven by the posture of the target person and the normalized posture of the target person as the input of the trained motion image generation model.
Further, extracting the posture of the training data by adopting a posture estimation algorithm OpenPose.
Further, pixels of the images in the training data are converted into a UV space by using DensePose, corresponding UV maps are generated, and adjacent videos are usedComplementary information between frames to construct and optimize a 3D human body model, including: taking a set of images of different poses spaced by several frames from training data
Figure 908381DEST_PATH_IMAGE001
And corresponding to the generated UV map of DensePose, and then generating a set of local texture maps by UV conversion
Figure 558805DEST_PATH_IMAGE002
Local texture map to be generated
Figure 748478DEST_PATH_IMAGE002
Inputting the data into a texture filling network to generate a texture map with multi-pose texture information
Figure 225464DEST_PATH_IMAGE003
And applying the texture map by a lossy function pair
Figure 260417DEST_PATH_IMAGE003
Reduced set of "original images"
Figure 421139DEST_PATH_IMAGE004
With a set of real images
Figure 516134DEST_PATH_IMAGE001
And performing loss calculation to realize optimization of the 3D human body model.
Further, the loss function is expressed as:
Figure 981882DEST_PATH_IMAGE005
wherein the content of the first and second substances,
Figure 820525DEST_PATH_IMAGE006
Figure 711121DEST_PATH_IMAGE004
from texture maps
Figure 101651DEST_PATH_IMAGE003
Is obtained by reduction, n represents the number of the reduced 'original images', texture map
Figure 444907DEST_PATH_IMAGE003
Obtained from the following equation:
Figure 195563DEST_PATH_IMAGE007
Figure 940665DEST_PATH_IMAGE008
representing a local texture map
Figure 377463DEST_PATH_IMAGE002
The total number of the (c) is,
Figure 332650DEST_PATH_IMAGE009
represents a probability map generated by a texture-filling network, which predicts
Figure 513095DEST_PATH_IMAGE003
The upper pixel point comes from the corresponding position
Figure 722491DEST_PATH_IMAGE002
Probability of an upper pixel point;
Figure 595769DEST_PATH_IMAGE009
obtained from the following equation:
Figure 913618DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure 22388DEST_PATH_IMAGE011
represent
Figure 476503DEST_PATH_IMAGE012
To (1) aThe elements of row j and column k,
Figure 255103DEST_PATH_IMAGE013
represent
Figure 434150DEST_PATH_IMAGE014
The element value of the jth row and kth column of (1),
Figure 956398DEST_PATH_IMAGE015
and
Figure 530599DEST_PATH_IMAGE016
each of which represents a value of one of the elements,
Figure 604734DEST_PATH_IMAGE014
which represents the output of the decoder, and,
Figure 897175DEST_PATH_IMAGE017
represents the number of channels output by the decoder,
Figure 832901DEST_PATH_IMAGE018
representing the amplification factor of the amplification module; in particular, the number n of restored "original images" and the total number of local texture maps
Figure 261608DEST_PATH_IMAGE008
And number of channels output by decoder
Figure 913170DEST_PATH_IMAGE017
Are equal in number.
Further, the projecting the optimized 3D human body model to a 2D plane to obtain a 2D projection retaining 3D information of the original motion, and driving the optimized 3D human body model in the pose of the target person includes: and predicting the posture of the 3D human body model through the HMR, and transmitting the predicted posture to the 3D human body model, so that the 3D human body model is driven.
Further, the motion image generation model is defined as a Face-Attention GAN model; the Face-Attention GAN model is a GAN modelBased on the type, matching an elliptical face region by using Gaussian distribution, configuring a face enhancement loss function, and introducing an attention mechanism, wherein: the method for matching the elliptical face area by using Gaussian distribution is realized by designing a mean value and a covariance matrix, and comprises the following steps: the position of the image face region is determined by the pose estimation algorithm openpos,
Figure 286382DEST_PATH_IMAGE019
is the location of the nose, eyes and ears; the center of the ellipse is set as the nose
Figure 150433DEST_PATH_IMAGE020
The position of (a); two axes of the ellipse are eigenvectors of the covariance matrix, and the length of the axes is an eigenvalue of the covariance matrix; let a and b be the two axes of the ellipse, a and b both being unit vectors, and satisfy the following formula:
Figure 699226DEST_PATH_IMAGE021
wherein the content of the first and second substances,
Figure 364431DEST_PATH_IMAGE022
is two elements of b, the relationship between the eigenvectors a and b and the covariance matrix Σ is as follows:
Figure 365885DEST_PATH_IMAGE023
wherein the content of the first and second substances,
Figure 299206DEST_PATH_IMAGE024
Figure 561560DEST_PATH_IMAGE025
is the characteristic value corresponding to the a, and is,
Figure 23766DEST_PATH_IMAGE026
Figure 387882DEST_PATH_IMAGE028
is the characteristic value corresponding to the b-number,
Figure 859315DEST_PATH_IMAGE029
is the axial length of the ellipse, σ is the scaling factor, a and b are orthogonal,
Figure 851541DEST_PATH_IMAGE030
are necessarily reversible; in the process of
Figure 140440DEST_PATH_IMAGE020
In a Gaussian distribution where Σ is covariance as a mean, face-enhanced Gaussian weights are obtained by uniformly sampling at a distance interval of 1 within a rectangular region constructed by four points of (1, 1), (1, 512), (512, 1), (512 ), and obtaining a face-enhanced Gaussian weight
Figure 116487DEST_PATH_IMAGE031
And with the generated Gaussian weight
Figure 860452DEST_PATH_IMAGE031
To define a face enhancement loss function; the face enhancement loss function is as follows:
Figure 370103DEST_PATH_IMAGE032
wherein the content of the first and second substances,
Figure 439690DEST_PATH_IMAGE033
the gesture is represented by a gesture that is,
Figure 762087DEST_PATH_IMAGE034
representing a 2D projection of a 3D phantom,ya real image is represented by a real image,
Figure 575322DEST_PATH_IMAGE035
to represent
Figure 542141DEST_PATH_IMAGE033
And
Figure 923575DEST_PATH_IMAGE034
is input to the image generated by the generator G,
Figure 608635DEST_PATH_IMAGE031
representing a gaussian weight generated by a gaussian distribution matching elliptical face; the attention mechanism introduced includes channel attention and spatial attention; the final objective function is:
Figure 959981DEST_PATH_IMAGE036
wherein G denotes a generator, D denotes a discriminator,
Figure 171520DEST_PATH_IMAGE037
a loss function representing the GAN model,
Figure 582910DEST_PATH_IMAGE038
the fact that the discriminator can accurately judge the authenticity of the sample through minG and maxD and the sample generated by the generator can be distinguished through the discriminator is a mutual game process;
Figure 755265DEST_PATH_IMAGE039
representing a face enhancement loss function for enhancing a face region of an image;
Figure 18625DEST_PATH_IMAGE040
representing feature matching loss for ensuring global consistency of image content;
Figure 225615DEST_PATH_IMAGE041
representing perceptual reconstruction loss for ensuring global consistency of image content; parameter(s)
Figure 73486DEST_PATH_IMAGE042
For adjustment to balance these losses.
Further, in the attention mechanism introduced, a feature matching penalty based on discriminator D is employed, the feature matching penalty being as follows:
Figure 592192DEST_PATH_IMAGE043
wherein the content of the first and second substances,
Figure 550920DEST_PATH_IMAGE044
is the second of discriminator DiA layer-feature extractor for extracting the layer feature,
Figure 956625DEST_PATH_IMAGE045
represents the firstiThe number of the elements of the layer,Tis the total number of layers of discriminator D; and then inputting the generated image and the real image into a pre-trained VGG network, comparing the characteristics of different layers, and perceiving reconstruction loss as follows:
Figure 975397DEST_PATH_IMAGE046
Figure 856765DEST_PATH_IMAGE047
representing the i-th layer feature extractor of the VGG network,
Figure 478239DEST_PATH_IMAGE048
indicates the number of elements in the i-th layer,
Figure 394243DEST_PATH_IMAGE049
is the total number of layers of the VGG network.
Further, normalizing the posture of the target person specifically comprises: the real length of the bone segments is approximated by the maximum bone segment length in the training set, and the real bone segment length of the new pose is approximated in the same way; then, the length of the bone segments displayed in the image is adjusted according to the proportion between the standard skeleton and the new skeleton; is provided with
Figure 318336DEST_PATH_IMAGE050
The ith joint coordinate representing the new pose,
Figure 60902DEST_PATH_IMAGE051
representing its parent joint coordinates;
Figure 361434DEST_PATH_IMAGE050
by
Figure 256577DEST_PATH_IMAGE052
An adjustment is made, wherein,
Figure 351572DEST_PATH_IMAGE053
and
Figure 207533DEST_PATH_IMAGE054
respectively representing the maximum bone segment length between the ith joint and the father joint in the target person image and the training image.
Compared with the prior art, the invention has the following beneficial effects:
(1) the method comprises the steps of converting training data into a UV space to generate a UV image, and constructing and optimizing a 3D human body model by using complementary information between adjacent video frames; then projecting the optimized 3D human body model to a 2D plane to obtain a 2D projection retaining 3D information of the original motion, and driving the optimized 3D human body model in the posture of the target person; using the 2D projection of the 3D information with the original motion and the posture of the training data as the input of a motion image generation model, and storing the trained motion image generation model; normalizing the posture of the target person; finally, the 2D projection of the optimized 3D human body model driven by the posture of the target person and the normalized posture of the target person are used as the input of a trained motion image generation model for final motion migration, so that the problems of blurring, shape distortion and the like in 2D plane image generation are solved, and the generated motion image is ensured to have reliable depth information, accurate shape and clear human face;
(2) the method has the advantages of small calculation burden and short time consumption, and can be mainly applied to three fields: 1) in the field of film and television industry, the method can be used for simulating real characters to make actions with ornamental value and high difficulty; 2) in the field of game design, the method can be used for action design of virtual characters; 3) in the field of medical rehabilitation, the method can be used for synthesizing the normal movement posture of a patient with dyskinesia.
Drawings
FIG. 1 is a model framework for optimizing a 3D human body model in an embodiment of the invention;
FIG. 2 is a diagram of a texture filling network in accordance with an embodiment of the present invention;
FIG. 3 is a schematic diagram of the pose drive of a 3D human body model according to an embodiment of the invention;
FIG. 4 is a Face-Attention GAN model framework constructed in an embodiment of the present invention;
FIG. 5 is a diagram illustrating matching elliptical faces using Gaussian distributions in an embodiment of the present invention;
FIG. 6 is a schematic illustration of a CBAM attention mechanism in an embodiment of the present invention;
fig. 7 is a schematic diagram of a motion transfer process in an embodiment of the invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
A 3D mannequin-driven motion migration method comprising: constructing a training data set by taking a video frame shot in advance as training data, and extracting the posture of the training data; converting the training data into a UV space, generating a UV image, and constructing and optimizing a 3D human body model by using complementary information between adjacent video frames; then projecting the optimized 3D human body model to a 2D plane to obtain a 2D projection retaining 3D information of the original motion, and driving the optimized 3D human body model in the posture of the target person; using the 2D projection of the 3D information with the original motion and the posture of the training data as the input of a motion image generation model, and storing the trained motion image generation model; normalizing the posture of the target person; and finally, performing final motion migration by taking the 2D projection of the optimized 3D human body model driven by the posture of the target person and the normalized posture of the target person as the input of the trained motion image generation model.
Step 1, constructing a training data set by taking a video frame shot in advance as training data, and extracting the posture of the training data.
An average length of 3 minutes of motion video is taken for each person at a rate of 30 frames per second, and the training data is video frames for each person, each video frame having a resolution of 512 x 512. These videos are taken by cell phones or fixed locations, with a shooting distance of about 5 meters. After the training data set is prepared, the posture of the training data set is extracted by adopting the most advanced posture estimation algorithm OpenPose.
And 2, converting pixels of the image in the training data into a UV space by using DensePose to generate a corresponding UV image. And constructs and optimizes the 3D human body model with complementary information between adjacent video frames.
The present embodiment is a human body model optimization method based on sequential images, and the framework of the method is shown in fig. 1. Taking a set of images of different poses spaced apart by several frames from training data
Figure 187121DEST_PATH_IMAGE001
And generating a UV map corresponding to DensePose, then generating a group of local texture maps through UV conversion, and generating the local texture maps
Figure 546558DEST_PATH_IMAGE002
Input into the texture filling network.
The texture filling network is shown in FIG. 2, and finally generates a complete texture map with multi-pose texture information
Figure 937088DEST_PATH_IMAGE003
By using
Figure 280345DEST_PATH_IMAGE003
Reduced set of "original images"
Figure 922679DEST_PATH_IMAGE004
With a set of real images
Figure 41682DEST_PATH_IMAGE001
L1 loss calculations are performed to cause the network to generate a more detailed texture map, which is ultimately used to generate a 3D phantom, enabling optimization of the 3D phantom. The corresponding loss function is expressed as:
Figure 478480DEST_PATH_IMAGE005
wherein the content of the first and second substances,
Figure 309033DEST_PATH_IMAGE006
Figure 614112DEST_PATH_IMAGE004
from texture maps
Figure 948142DEST_PATH_IMAGE003
Is obtained by reduction, n represents the number of the reduced 'original images', texture map
Figure 696786DEST_PATH_IMAGE003
Obtained from the following equation:
Figure 749055DEST_PATH_IMAGE007
Figure 733192DEST_PATH_IMAGE008
representing a local texture map
Figure 577520DEST_PATH_IMAGE002
The total number of the (c) is,
Figure 356120DEST_PATH_IMAGE009
represents a probability map generated by a texture-filling network, which predicts
Figure 161265DEST_PATH_IMAGE003
The upper pixel point comes from the corresponding position
Figure 57415DEST_PATH_IMAGE002
Probability of an upper pixel point;
Figure 631616DEST_PATH_IMAGE009
obtained from the following equation:
Figure 705751DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure 732613DEST_PATH_IMAGE011
to represent
Figure 58552DEST_PATH_IMAGE012
The jth row and kth column of (c),
Figure 362625DEST_PATH_IMAGE013
to represent
Figure 483028DEST_PATH_IMAGE014
The element value of the jth row and kth column of (1),
Figure 262765DEST_PATH_IMAGE015
and
Figure 251450DEST_PATH_IMAGE016
each of which represents a value of one of the elements,
Figure 534664DEST_PATH_IMAGE014
which represents the output of the decoder, and,
Figure 357126DEST_PATH_IMAGE017
represents the number of channels output by the decoder,
Figure 466902DEST_PATH_IMAGE018
representing the amplification factor of the amplification module; in particular, the number n of restored "original images" and the local texture map
Figure 400223DEST_PATH_IMAGE002
Total number of (2)
Figure 396998DEST_PATH_IMAGE008
And number of channels output by decoder
Figure 390362DEST_PATH_IMAGE017
Are equal in number.
The optimization of the 3D human body model is realized according to the method.
And 3, projecting the optimized 3D human body model to a 2D plane to keep the 3D information of the original motion, and designing a posture driving method of the 3D human body model. The method predicts the pose of the 3D human body model through HMR and transmits the predicted pose to the 3D human body model, thereby implementing the driving of the 3D human body model, as shown in fig. 3. Visual skeleton map representation 3D human body model's gesture is accepted to direct-viewing convenient to.
And 4, taking the 2D projection and the posture of the training data as the input of the motion image generation model, and storing the trained model.
The embodiment provides a motion image generation model for final motion migration, wherein the motion image generation model is defined as a Face-Attention GAN model; the Face-Attention GAN model is based on GAN model, uses gaussian distribution to match elliptical Face regions, and configures Face enhancement loss function, and introduces Attention mechanism, the model takes 2D projection obtained in step 3 and pose extracted in step 1 as input of the model, the model framework is as shown in fig. 4, wherein the confrontation loss of GAN is as follows:
Figure 613533DEST_PATH_IMAGE055
wherein G denotes a generator, D denotes a discriminator,
Figure 960332DEST_PATH_IMAGE033
the gesture is represented by a gesture that is,
Figure 952558DEST_PATH_IMAGE034
representing a 2D projection of a 3D human model, y representing a real image,
Figure 116824DEST_PATH_IMAGE035
to represent
Figure 951924DEST_PATH_IMAGE033
And
Figure 961469DEST_PATH_IMAGE034
input to the image generated by the generator G
Figure 339360DEST_PATH_IMAGE056
Figure 517270DEST_PATH_IMAGE057
The function of (a) is to ensure the basic judgment capability of the discriminator, and a larger one means that
Figure 980612DEST_PATH_IMAGE058
The larger, i.e., the more accurately the discriminator can identify the true sample as a true sample,
Figure 652902DEST_PATH_IMAGE059
the effect of (a) is to ensure that the discriminator is able to distinguish between false samples, the larger it is, the more likely it is to mean
Figure 885300DEST_PATH_IMAGE060
The smaller the size, the more correctly the discriminator can distinguish between spurious samples.
Matching elliptical face regions using gaussian distributions is achieved by designing mean and covariance matrices of the gaussian distributions, including: the position of the image face region is determined by the pose estimation algorithm openpos,
Figure 860209DEST_PATH_IMAGE019
is the location of the nose, eyes and ears; the center of the ellipse is set as the nose
Figure 686214DEST_PATH_IMAGE020
The position of (a); two axes of the ellipse are eigenvectors of the covariance matrix, and the length of the axes is an eigenvalue of the covariance matrix; as shown in fig. 5, it is assumed that a and b are two axes of an ellipse, that a and b are both unit vectors, and that the following condition is satisfied:
Figure 303140DEST_PATH_IMAGE021
wherein, the first and the second end of the pipe are connected with each other,
Figure 124466DEST_PATH_IMAGE022
is two elements of b, the relationship between the eigenvectors a and b and the covariance matrix Σ is as follows:
Figure 926068DEST_PATH_IMAGE023
wherein the content of the first and second substances,
Figure 832845DEST_PATH_IMAGE024
Figure 987882DEST_PATH_IMAGE026
Figure 303195DEST_PATH_IMAGE025
is the characteristic value corresponding to the a, and is,
Figure 885486DEST_PATH_IMAGE028
is the characteristic value corresponding to the b-number,
Figure 404192DEST_PATH_IMAGE029
is the axial length of the ellipse, σ is the scaling factor, a and b are orthogonal,
Figure 362921DEST_PATH_IMAGE030
are necessarily reversible; in a manner that
Figure 158838DEST_PATH_IMAGE020
Is taken as the mean value of the average value,
Figure 52976DEST_PATH_IMAGE061
in a Gaussian distribution of covariance, face-enhanced Gaussian weights are obtained by uniformly sampling at a distance interval of 1 in a rectangular region constructed by four points of (1, 1), (1, 512), (512, 1), (512 )
Figure 934345DEST_PATH_IMAGE031
And with the generated Gaussian weight
Figure 431185DEST_PATH_IMAGE031
Defining a face enhancement loss function;
the designed face enhancement loss function is as follows:
Figure 206243DEST_PATH_IMAGE032
wherein the content of the first and second substances,
Figure 395916DEST_PATH_IMAGE033
the gesture is represented by a gesture that is,
Figure 138482DEST_PATH_IMAGE034
representing a 2D projection of a 3D human model, y representing a real image,
Figure 439013DEST_PATH_IMAGE035
to represent
Figure 943944DEST_PATH_IMAGE033
And
Figure 429152DEST_PATH_IMAGE034
input to the image generated by the generator G
Figure 550691DEST_PATH_IMAGE056
Figure 123755DEST_PATH_IMAGE031
Is represented by a Gaussian distributionAnd matching the Gaussian weight generated by the elliptical face.
And an attention mechanism is introduced into the model, and the attention mechanism structure is shown in fig. 6 and is formed by combining channel attention and space attention.
To further refine the details, a feature matching penalty based on discriminator D is employed, as follows:
Figure 889717DEST_PATH_IMAGE043
wherein the content of the first and second substances,
Figure 155613DEST_PATH_IMAGE044
is the i-th layer feature extractor of discriminator D,
Figure 498870DEST_PATH_IMAGE045
represents the number of elements of the ith layer,Tis the total number of layers of discriminator D.
And then inputting the generated image and the real image into a pre-trained VGG network, and comparing the characteristics of different layers. The perceptual reconstruction loss is as follows:
Figure 265838DEST_PATH_IMAGE046
wherein the content of the first and second substances,
Figure 10940DEST_PATH_IMAGE047
represents the i-th layer feature extractor of the VGG network,
Figure 821639DEST_PATH_IMAGE048
representing the number of elements in the ith layer, and N is the total number of layers of the VGG network.
The final objective function is:
Figure 652191DEST_PATH_IMAGE036
wherein the parameters
Figure 567058DEST_PATH_IMAGE042
For adjustment to balance these losses, G denotes a generator, D denotes a discriminator,
Figure 556879DEST_PATH_IMAGE037
a loss function representing the loss of the GAN is shown,
Figure 430158DEST_PATH_IMAGE038
the fact that the discriminator can accurately judge the authenticity of the sample through minG and maxD is shown, and the sample generated by the generator can be distinguished through the discriminator, so that the process of mutual game is realized.
Figure 482427DEST_PATH_IMAGE039
Representing a face enhancement loss function for enhancing the facial area of an image.
Figure 341930DEST_PATH_IMAGE040
And representing feature matching loss for ensuring the global consistency of the image content.
Figure 796045DEST_PATH_IMAGE041
Representing the perceptual reconstruction loss for ensuring global consistency of the image content.
Step 5, in this embodiment, the pose of the target person is normalized. The real length of the bone segments is approximated by the maximum bone segment length in the training set, and the real bone segment length of the new pose is approximated in the same way; then, the length of the bone segment displayed in the image is adjusted according to the proportion between the standard skeleton and the new skeleton; is provided with
Figure 840224DEST_PATH_IMAGE050
The ith joint coordinate representing the new pose,
Figure 504424DEST_PATH_IMAGE051
representing its parent joint coordinates;
Figure 292251DEST_PATH_IMAGE050
by
Figure 866452DEST_PATH_IMAGE052
An adjustment is made, wherein,
Figure 189855DEST_PATH_IMAGE053
and
Figure 216717DEST_PATH_IMAGE054
respectively representing the maximum bone segment length between the ith joint and the father joint in the target person image and the training image.
And 6, inputting the 2D projection of the optimized 3D human body model driven by the target person posture and the normalized target person posture into a trained motion image generation model to perform final motion migration, wherein the motion migration process comprises posture normalization of a new skeleton and generation of a target person image, and is shown in FIG. 7.
Converting training data into a UV space to generate a UV image, and constructing and optimizing a 3D human body model by using complementary information between adjacent video frames; then projecting the optimized 3D human body model to a 2D plane to keep the 3D information of the original motion and realize that the optimized 3D human body model is driven in a target posture; taking the 2D projection and the posture of the training data as the input of a pre-training model, and storing the trained model; then normalizing the posture of the target person; finally, the 2D projection of the optimized 3D human body model driven by the target person posture and the normalized target person posture are used as the input of a trained motion image generation model for carrying out final motion migration, so that the problems of blurring, shape distortion and the like in 2D plane image generation are solved, and the generated motion image is ensured to have reliable depth information, accurate shape and clear human face; the method has the advantages of small calculation burden and short time consumption, and can be mainly applied to three fields: (1) in the field of film and television industry, the method can be used for simulating real characters to make actions with ornamental value and high difficulty; (2) in the field of game design, the method can be used for action design of virtual characters; (3) in the field of medical rehabilitation, the method can be used for synthesizing the normal movement posture of a patient with dyskinesia.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, it is possible to make various improvements and modifications without departing from the technical principle of the present invention, and those improvements and modifications should be considered as the protection scope of the present invention.

Claims (7)

1. A 3D mannequin-driven motion migration method, comprising:
constructing a training data set by taking a video frame shot in advance as training data, and extracting the posture of the training data;
converting the training data into a UV space, generating a UV map, and constructing and optimizing a 3D human body model by using complementary information between adjacent video frames;
then projecting the optimized 3D human body model to a 2D plane to obtain a 2D projection retaining 3D information of the original motion, and driving the optimized 3D human body model in the posture of the target person;
using the 2D projection of the 3D information with the original motion and the posture of the training data as the input of a motion image generation model, and storing the trained motion image generation model;
normalizing the posture of the target person;
finally, the 2D projection of the optimized 3D human body model driven by the posture of the target person and the normalized posture of the target person are used as the input of the trained motion image generation model for final motion migration;
the motion image generation model is defined as a Face-Attention GAN model; the Face-Attention GAN model is based on the GAN model, uses Gaussian distribution to match an elliptical human Face region, configures a human Face enhancement loss function, and introduces an Attention mechanism, wherein:
the method for matching the elliptical face area by using Gaussian distribution is realized by designing a mean value and a covariance matrix, and comprises the following steps: the position of the image face region is determined by the pose estimation algorithm openposition,
Figure 294515DEST_PATH_IMAGE001
is the location of the nose, eyes and ears; the center of the ellipse is set as a nose
Figure 261334DEST_PATH_IMAGE002
The position of (a); two axes of the ellipse are eigenvectors of the covariance matrix, and the length of the axes is an eigenvalue of the covariance matrix; let a and b be the two axes of the ellipse, a and b both being unit vectors, and satisfy the following formula:
Figure DEST_PATH_IMAGE003
wherein the content of the first and second substances,
Figure 236243DEST_PATH_IMAGE004
is two elements of b, the relationship between the eigenvectors a and b and the covariance matrix Σ is as follows:
Figure 921302DEST_PATH_IMAGE005
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE006
Figure 256338DEST_PATH_IMAGE007
is the characteristic value corresponding to the a, and is,
Figure 343242DEST_PATH_IMAGE008
Figure 20211DEST_PATH_IMAGE010
is the characteristic value corresponding to the b-number,
Figure 926987DEST_PATH_IMAGE011
is the axial length of the ellipse, σ is the scaling factor, a and b are orthogonal,
Figure 82025DEST_PATH_IMAGE012
are necessarily reversible; in a manner that
Figure 272704DEST_PATH_IMAGE002
In a Gaussian distribution where Σ is covariance as a mean, face-enhanced Gaussian weights are obtained by uniformly sampling at a distance interval of 1 within a rectangular region constructed by four points of (1, 1), (1, 512), (512, 1), (512 ), and obtaining a face-enhanced Gaussian weight
Figure 120574DEST_PATH_IMAGE013
And with the generated Gaussian weight
Figure 780226DEST_PATH_IMAGE013
To define a face enhancement loss function;
the face enhancement loss function is as follows:
Figure 473375DEST_PATH_IMAGE014
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE015
the gesture is represented by a gesture that is,
Figure 738135DEST_PATH_IMAGE016
representing a 2D projection of a 3D phantom,ya real image is represented by a real image,
Figure 6174DEST_PATH_IMAGE017
represent
Figure 887542DEST_PATH_IMAGE015
And
Figure 649962DEST_PATH_IMAGE016
is input to the image generated by the generator G,
Figure 300386DEST_PATH_IMAGE013
representing a Gaussian weight generated by matching the face of the ellipse with a Gaussian distribution;
the attention mechanism introduced includes channel attention and spatial attention; the final objective function is:
Figure 224480DEST_PATH_IMAGE018
wherein G denotes a generator, D denotes a discriminator,
Figure 858723DEST_PATH_IMAGE019
a loss function representing the GAN model,
Figure DEST_PATH_IMAGE020
the fact that the discriminator can accurately judge the authenticity of the sample through minG and maxD and the sample generated by the generator can be distinguished through the discriminator is a mutual game process;
Figure 877364DEST_PATH_IMAGE021
representing a face enhancement loss function for enhancing a face region of an image;
Figure 913453DEST_PATH_IMAGE022
representing feature matching loss for ensuring global consistency of image content;
Figure 742869DEST_PATH_IMAGE023
representing perceptual reconstruction loss for ensuring global consistency of image content; parameter(s)
Figure 864408DEST_PATH_IMAGE024
For adjustment to balance these losses.
2. The 3D mannequin-driven motion migration method of claim 1, wherein a pose estimation algorithm openpos is used to extract a pose of the training data.
3. The 3D phantom-driven motion migration method according to claim 1, wherein converting pixels of the image in the training data to UV space using DensePose, generating corresponding UV maps, and constructing and optimizing the 3D phantom with complementary information between adjacent video frames comprises:
taking a set of images of different poses spaced by several frames from training data
Figure 968631DEST_PATH_IMAGE025
And corresponding to the generated UV map of DensePose, and then generating a set of local texture maps by UV conversion
Figure 577335DEST_PATH_IMAGE026
Local texture map to be generated
Figure 108811DEST_PATH_IMAGE026
Inputting the data into a texture filling network to generate a texture map with multi-pose texture information
Figure 452067DEST_PATH_IMAGE027
And using the texture map through a pair of loss functions
Figure 94401DEST_PATH_IMAGE027
Reduced set of "original images"
Figure 839503DEST_PATH_IMAGE028
With a set of real images
Figure 541880DEST_PATH_IMAGE025
And performing loss calculation to realize optimization of the 3D human body model.
4. The 3D mannequin-driven motion transfer method of claim 3, wherein the loss function is expressed as:
Figure 356121DEST_PATH_IMAGE029
wherein, the first and the second end of the pipe are connected with each other,
Figure 802146DEST_PATH_IMAGE030
Figure 136176DEST_PATH_IMAGE028
from texture maps
Figure 9454DEST_PATH_IMAGE027
Is obtained by reduction, n represents the number of the reduced 'original images', texture map
Figure 327303DEST_PATH_IMAGE027
Obtained from the following equation:
Figure 295127DEST_PATH_IMAGE031
Figure DEST_PATH_IMAGE032
representing a local texture map
Figure 483663DEST_PATH_IMAGE026
The total number of the (c) is,
Figure 527843DEST_PATH_IMAGE033
represents a probability map generated by a texture-filling network, which predicts
Figure 332988DEST_PATH_IMAGE027
The upper pixel point comes from the corresponding position
Figure 120815DEST_PATH_IMAGE026
Probability of an upper pixel point;
Figure 678704DEST_PATH_IMAGE033
obtained from the following equation:
Figure 893785DEST_PATH_IMAGE034
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE035
to represent
Figure 389488DEST_PATH_IMAGE036
The jth row and kth column of (c),
Figure 981007DEST_PATH_IMAGE037
to represent
Figure 144135DEST_PATH_IMAGE038
The element value of the jth row and kth column of (1),
Figure DEST_PATH_IMAGE039
and
Figure 513805DEST_PATH_IMAGE040
each of which represents a value of one of the elements,
Figure 27963DEST_PATH_IMAGE038
which represents the output of the decoder, and,
Figure 157593DEST_PATH_IMAGE041
represents the number of channels output by the decoder,
Figure 175228DEST_PATH_IMAGE042
representing the amplification factor of the amplification module; the number n of restored 'original images' and the total number of local texture maps
Figure 997690DEST_PATH_IMAGE032
And number of channels output by decoder
Figure 982832DEST_PATH_IMAGE041
Are equal in number.
5. The 3D mannequin-driven motion migration method according to claim 1, wherein the projecting the optimized 3D mannequin onto a 2D plane, obtaining a 2D projection retaining 3D information of the original motion, and driving the optimized 3D mannequin in the pose of the target person comprises: and predicting the posture of the 3D human body model through the HMR, and transmitting the predicted posture to the 3D human body model, so that the 3D human body model is driven.
6. The 3D mannequin-driven motion migration method according to claim 1, wherein in the introduced attention mechanism, a feature matching penalty based on discriminator D is employed as follows:
Figure 916153DEST_PATH_IMAGE043
wherein the content of the first and second substances,
Figure 53874DEST_PATH_IMAGE044
is the second of discriminator DiA layer-feature extractor for extracting the layer feature,
Figure 47238DEST_PATH_IMAGE045
represents the firstiThe number of the elements of the layer,Tis the total number of layers of discriminator D;
and then inputting the generated image and the real image into a pre-trained VGG network, comparing the characteristics of different layers, and perceiving reconstruction loss as follows:
Figure 270408DEST_PATH_IMAGE046
wherein the content of the first and second substances,
Figure 741841DEST_PATH_IMAGE047
represents the i-th layer feature extractor of the VGG network,
Figure 6773DEST_PATH_IMAGE048
representing the number of elements in the ith layer, and N is the total number of layers of the VGG network.
7. The 3D human model-driven motion transfer method according to claim 1, wherein the pose of the target person is normalized, specifically: the real length of the bone segments is approximated by the maximum bone segment length in the training set, and the real bone segment length of the new pose is approximated in the same way; then, the length of the bone segment displayed in the image is adjusted according to the proportion between the standard skeleton and the new skeleton; is provided with
Figure DEST_PATH_IMAGE049
The ith joint coordinate representing the new pose,
Figure 905459DEST_PATH_IMAGE050
representing its parent joint coordinates;
Figure 350347DEST_PATH_IMAGE049
by
Figure 625470DEST_PATH_IMAGE051
An adjustment is made, wherein,
Figure 987050DEST_PATH_IMAGE052
and
Figure DEST_PATH_IMAGE053
respectively representing the maximum bone segment length between the ith joint and the father joint in the target person image and the training image.
CN202210708260.9A 2022-06-22 2022-06-22 Motion migration method driven by 3D human body model Active CN114783039B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210708260.9A CN114783039B (en) 2022-06-22 2022-06-22 Motion migration method driven by 3D human body model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210708260.9A CN114783039B (en) 2022-06-22 2022-06-22 Motion migration method driven by 3D human body model

Publications (2)

Publication Number Publication Date
CN114783039A CN114783039A (en) 2022-07-22
CN114783039B true CN114783039B (en) 2022-09-16

Family

ID=82422416

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210708260.9A Active CN114783039B (en) 2022-06-22 2022-06-22 Motion migration method driven by 3D human body model

Country Status (1)

Country Link
CN (1) CN114783039B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116071831B (en) * 2023-03-20 2023-06-20 南京信息工程大学 Human body image generation method based on UV space transformation

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111640172A (en) * 2020-05-08 2020-09-08 大连理工大学 Attitude migration method based on generation of countermeasure network
CN111724414A (en) * 2020-06-23 2020-09-29 宁夏大学 Basketball movement analysis method based on 3D attitude estimation
CN111797753A (en) * 2020-06-29 2020-10-20 北京灵汐科技有限公司 Training method, device, equipment and medium of image driving model, and image generation method, device and medium
CN112215116A (en) * 2020-09-30 2021-01-12 江苏大学 Mobile 2D image-oriented 3D river crab real-time detection method
CN112651316A (en) * 2020-12-18 2021-04-13 上海交通大学 Two-dimensional and three-dimensional multi-person attitude estimation system and method
CN114612614A (en) * 2022-03-09 2022-06-10 北京大甜绵白糖科技有限公司 Human body model reconstruction method and device, computer equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111161200A (en) * 2019-12-22 2020-05-15 天津大学 Human body posture migration method based on attention mechanism
CN114049652A (en) * 2021-11-05 2022-02-15 成都艾特能电气科技有限责任公司 Human body posture migration method and system based on action driving

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111640172A (en) * 2020-05-08 2020-09-08 大连理工大学 Attitude migration method based on generation of countermeasure network
CN111724414A (en) * 2020-06-23 2020-09-29 宁夏大学 Basketball movement analysis method based on 3D attitude estimation
CN111797753A (en) * 2020-06-29 2020-10-20 北京灵汐科技有限公司 Training method, device, equipment and medium of image driving model, and image generation method, device and medium
CN112215116A (en) * 2020-09-30 2021-01-12 江苏大学 Mobile 2D image-oriented 3D river crab real-time detection method
CN112651316A (en) * 2020-12-18 2021-04-13 上海交通大学 Two-dimensional and three-dimensional multi-person attitude estimation system and method
CN114612614A (en) * 2022-03-09 2022-06-10 北京大甜绵白糖科技有限公司 Human body model reconstruction method and device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
3DMM与GAN结合的实时人脸表情迁移方法;高翔等;《计算机应用与软件》;20200412(第04期);全文 *
VIBE: Video Inference for Human Body Pose and Shape Estimation;Muhammed Kocabas 等;《arXiv:1912.05656 [cs.CV]》;20200615;全文 *

Also Published As

Publication number Publication date
CN114783039A (en) 2022-07-22

Similar Documents

Publication Publication Date Title
CN112887698B (en) High-quality face voice driving method based on nerve radiation field
CN109711413A (en) Image, semantic dividing method based on deep learning
CN110827193B (en) Panoramic video significance detection method based on multichannel characteristics
CN108776983A (en) Based on the facial reconstruction method and device, equipment, medium, product for rebuilding network
CN108830913B (en) Semantic level line draft coloring method based on user color guidance
CN111161200A (en) Human body posture migration method based on attention mechanism
CN110175986A (en) A kind of stereo-picture vision significance detection method based on convolutional neural networks
WO2020177214A1 (en) Double-stream video generation method based on different feature spaces of text
CN110796593A (en) Image processing method, device, medium and electronic equipment based on artificial intelligence
CN108363973A (en) A kind of unconfined 3D expressions moving method
CN113255457A (en) Animation character facial expression generation method and system based on facial expression recognition
CN110363770A (en) A kind of training method and device of the infrared semantic segmentation model of margin guide formula
CN114783039B (en) Motion migration method driven by 3D human body model
CN113888399B (en) Face age synthesis method based on style fusion and domain selection structure
CN114399829A (en) Posture migration method based on generative countermeasure network, electronic device and medium
CN116704084B (en) Training method of facial animation generation network, facial animation generation method and device
CN115914505B (en) Video generation method and system based on voice-driven digital human model
CN116863069A (en) Three-dimensional light field face content generation method, electronic equipment and storage medium
CN113076918B (en) Video-based facial expression cloning method
Kang et al. Image-to-image translation method for game-character face generation
Cao et al. Guided cascaded super-resolution network for face image
CN113947520A (en) Method for realizing face makeup conversion based on generation of confrontation network
CN117496072B (en) Three-dimensional digital person generation and interaction method and system
CN117152825B (en) Face reconstruction method and system based on single picture
CN117036893B (en) Image fusion method based on local cross-stage and rapid downsampling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant