CN113887429A - Digital man video generation method and device and electronic equipment - Google Patents

Digital man video generation method and device and electronic equipment Download PDF

Info

Publication number
CN113887429A
CN113887429A CN202111165371.1A CN202111165371A CN113887429A CN 113887429 A CN113887429 A CN 113887429A CN 202111165371 A CN202111165371 A CN 202111165371A CN 113887429 A CN113887429 A CN 113887429A
Authority
CN
China
Prior art keywords
face
parameters
reconstruction
video
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111165371.1A
Other languages
Chinese (zh)
Inventor
王鑫宇
刘炫鹏
杨国基
刘致远
常向月
刘云峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhuiyi Technology Co Ltd
Original Assignee
Shenzhen Zhuiyi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhuiyi Technology Co Ltd filed Critical Shenzhen Zhuiyi Technology Co Ltd
Priority to CN202111165371.1A priority Critical patent/CN113887429A/en
Publication of CN113887429A publication Critical patent/CN113887429A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention relates to a method and a device for generating a digital human video and electronic equipment, wherein the method for generating the digital human video comprises the following steps: acquiring a first video containing a first object and a target image containing a second object; extracting expression parameters of the first object in image frames in the first video; extracting a first 3D face reconstruction parameter of the second object and a background contour line of the second object from the target image, wherein the first 3D face reconstruction parameter comprises face posture information; generating a 3D face grid containing the face posture information by using the expression parameters, the first 3D face reconstruction parameters and a preset first 3D face generation model; and generating a digital human video based on the 3D human face mesh, the background contour line and a preset digital human generation model. The embodiment of the invention can enable the generated 3D face grid to take the face posture information into consideration, so that the expression of the digital person is more vivid and natural.

Description

Digital man video generation method and device and electronic equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a digital personal video, and an electronic device.
Background
A virtual digital human being is a "human" existing in the digital world, and is a three-dimensional "human" highly restored and displayed in the digital world by techniques such as motion capture, three-dimensional modeling, and speech synthesis.
If the digital human video generation technology needs to realize real-time interaction, at least two points need to be met, the generation effect is good, and the reasoning speed is high; good results are a necessary prerequisite, and high speed is a commercial requirement.
At present, a scheme for generating a digital human picture based on sound reasoning appears, which mainly generates key points by sound reasoning through a sound reasoning model, however, the posture of a digital human generated by the key points generated based on sound reasoning is generally unnatural and has a poor effect.
Disclosure of Invention
In order to solve the technical problems or at least partially solve the technical problems, the application provides a digital human video generation method, a digital human video generation device and an electronic device.
In a first aspect, the present application provides a method for generating a digital human video, including:
acquiring a first video containing a first object and a target image containing a second object;
extracting expression parameters of the first object in image frames in the first video;
extracting a first 3D face reconstruction parameter of the second object and a background contour line of the second object from the target image, wherein the first 3D face reconstruction parameter comprises face posture information;
generating a 3D face grid containing the face posture information by using the expression parameters, the first 3D face reconstruction parameters and a preset first 3D face generation model;
and generating a digital human video based on the 3D human face mesh, the background contour line and a preset digital human generation model.
Optionally, the generating a 3D face mesh including the face pose information by using the expression parameter, the first 3D face reconstruction parameter, and a preset first 3D face generation model includes:
extracting local expression parameters associated with a preset target reconstruction region from the expression parameters, wherein the target reconstruction region is a region to be reconstructed in the face of the second object;
searching local face parameters associated with the target reconstruction region in the first 3D face reconstruction parameters;
replacing the local face parameters with the local expression parameters to obtain second 3D face reconstruction parameters;
and inputting the second 3D face reconstruction parameters into the preset first 3D face generation model, so that the preset first 3D face generation model outputs a 3D face grid containing the face posture information.
Optionally, the extracting, from the expression parameters, a local expression parameter associated with a preset target reconstruction region includes:
if the target reconstruction area is a mouth area, extracting local expression parameters corresponding to the mouth area from a corresponding relation between a preset reconstruction area and the local expression parameters;
or, if the target reconstruction region is a complete face region, extracting a local expression parameter corresponding to the complete face region from a preset corresponding relationship between the reconstruction region and the local expression parameter.
Optionally, the replacing the local face parameter with the local expression parameter to obtain a second 3D face reconstruction parameter includes:
if a local face parameter and a local expression parameter are extracted, directly replacing the local face parameter by using the local expression parameter;
or if a plurality of local face parameters and a plurality of local expression parameters are extracted, replacing the local face parameters with the local expression parameters for the local face parameters and the local face parameters corresponding to the same part identification.
Optionally, the extracting expression parameters of the first object in the image frames in the first video includes:
inputting key points of the face image in the image frame of the first video into a preset second 3D face generation model to obtain expression parameters of the first object;
or, the face mesh of the face image in the image frame of the first video is input into a preset first 3D face reconstruction model, so as to obtain the expression parameters of the first object.
Optionally, the extracting, in the target image, the first 3D face reconstruction parameter of the second object includes:
inputting the target image into a preset second 3D face reconstruction model to obtain a second face mesh;
acquiring mesh information of the second face mesh or acquiring key point information of each face key point in the second face mesh;
and determining the first 3D face reconstruction parameter based on the mesh information or the key point information.
Optionally, the generating a digital human video based on the 3D face mesh, the background contour line, and a preset digital human generation model includes:
inputting the 3D face mesh and the background contour line into the preset digital human generation model to obtain a digital human picture;
and sequencing the digital human pictures obtained based on the image frames according to the arrangement sequence of the image frames in the first video to obtain the digital human video.
Optionally, before the obtaining the first video containing the first object and the target image containing the second object, the method further includes:
acquiring a plurality of groups of first training videos containing a first training object, training images containing a second training object and digital human label pictures;
extracting expression parameters of the first training object in image frames in the first training video;
extracting a training 3D face reconstruction parameter of the second training object and a background contour line of the second object from the training image, wherein the training 3D face reconstruction parameter comprises face posture information;
generating a 3D face grid containing the face posture information by using the expression parameters, the training 3D face reconstruction parameters and a preset first 3D face generation model;
training an initially generated confrontation network model by using the 3D face mesh and the background contour line until the similarity between the digital human picture output by the initially generated confrontation network model and the digital human label picture exceeds a preset threshold value, and obtaining the digital human generation model.
In a second aspect, the present application provides a digital human video generating apparatus, comprising:
the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a first video containing a first object and a target image containing a second object, and the target image contains the face posture information of the second object;
the first extraction module is used for extracting expression parameters of the first object from image frames in the first video;
the second extraction module is used for extracting a first 3D face reconstruction parameter of the second object and a background contour line of the second object from the target image, wherein the first 3D face reconstruction parameter comprises face posture information;
the first generation module is used for generating a 3D face grid containing the face posture information by using the expression parameters, the first 3D face reconstruction parameters and a preset first 3D face generation model;
and the second generation module is used for generating a digital human video based on the 3D face mesh, the background contour line and a preset digital human generation model.
Optionally, the first generating module includes:
an extracting unit, configured to extract a local expression parameter associated with a preset target reconstruction region from the expression parameters, where the target reconstruction region is a region to be reconstructed in the face of the second object;
a searching unit, configured to search, in the first 3D face reconstruction parameter, a local face parameter associated with the target reconstruction region;
the replacing unit is used for replacing the local face parameters with the local expression parameters to obtain second 3D face reconstruction parameters;
and the first input unit is used for inputting the second 3D face reconstruction parameters into the preset first 3D face generation model so as to enable the preset first 3D face generation model to output a 3D face grid containing the face posture information.
Optionally, the extracting unit is further configured to:
if the target reconstruction area is a mouth area, extracting local expression parameters corresponding to the mouth area from a corresponding relation between a preset reconstruction area and the local expression parameters;
or, if the target reconstruction region is a complete face region, extracting a local expression parameter corresponding to the complete face region from a preset corresponding relationship between the reconstruction region and the local expression parameter.
Optionally, the replacing unit is further configured to:
if a local face parameter and a local expression parameter are extracted, directly replacing the local face parameter by using the local expression parameter;
or if a plurality of local face parameters and a plurality of local expression parameters are extracted, replacing the local face parameters with the local expression parameters for the local face parameters and the local face parameters corresponding to the same part identification.
Optionally, the first extracting module is further configured to:
inputting key points of the face image in the image frame of the first video into a preset second 3D face generation model to obtain expression parameters of the first object;
or, the face mesh of the face image in the image frame of the first video is input into a preset first 3D face reconstruction model, so as to obtain the expression parameters of the first object.
Optionally, the second extraction module includes:
the second input unit is used for inputting the target image into a preset second 3D human face reconstruction model to obtain a second human face mesh;
an obtaining unit, configured to obtain mesh information of the second face mesh or obtain key point information of each face key point in the second face mesh;
and the determining unit is used for determining the first 3D face reconstruction parameter based on the mesh information or the key point information.
Optionally, the second generating module includes:
a third input unit, configured to input the 3D face mesh and the background contour line into the preset digital human generation model, so as to obtain a digital human picture;
and the sequencing unit is used for sequencing the digital human pictures obtained based on the image frames according to the arrangement sequence of the image frames in the first video to obtain the digital human video.
Optionally, before the first obtaining module, the apparatus further includes:
the second acquisition module is used for acquiring a plurality of groups of first training videos containing a first training object, training images containing a second training object and digital human label images;
a third extraction module, configured to extract expression parameters of the first training object from image frames in the first training video;
a fourth extraction module, configured to extract, from the training image, a training 3D face reconstruction parameter of the second training object and a background contour line of the second object, where the training 3D face reconstruction parameter includes face pose information;
the third generation module is used for generating a 3D face grid containing the face posture information by using the expression parameters, the training 3D face reconstruction parameters and a preset first 3D face generation model;
and the fourth generation module is used for training an initially generated confrontation network model by using the 3D face mesh and the background contour line until the similarity between the digital human picture output by the initially generated confrontation network model and the digital human label picture exceeds a preset threshold value, so as to obtain the digital human generation model.
In a third aspect, the present application provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus;
a memory for storing a computer program;
and a processor configured to implement the method for generating a digital human video according to any one of the first aspect when executing the program stored in the memory.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
according to the embodiment of the invention, the expression parameter of the first object in the first video is combined with the first 3D face reconstruction parameter in the target image to generate the 3D face grid, the target image contains the face pose information of the second object, so that the face pose information of the second object is considered in the generated 3D face grid, and finally, the digital human video is generated based on the 3D face grid and the background contour line, so that the generated 3D face grid takes the face pose information into consideration, and the expression of the digital human is more vivid and natural.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a flowchart of a method for generating a digital human video according to an embodiment of the present disclosure;
fig. 2 is a block diagram of a digital human generation apparatus according to an embodiment of the present application;
fig. 3 is a structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
At present, a scheme for generating a digital human picture based on sound reasoning appears, which mainly generates key points by sound reasoning through a sound reasoning model, however, the posture of a digital human generated by the key points generated based on sound reasoning is generally unnatural and has a poor effect. Therefore, the embodiment of the application provides a digital human video generation method, and the digital human video generation method can be applied to a computer.
As shown in fig. 1, the digital human video generation method may include the steps of:
step S101, acquiring a first video containing a first object and a target image containing a second object;
in this embodiment of the present invention, the first video may be a video recorded on a first object in real time, or a non-real-time (i.e. recorded in advance) video, the first object may be a real person, the target image may be an image including a second object, and the second object may be a pre-created virtual character, for example: a big head son or a black cat, etc.
Step S102, extracting expression parameters of the first object from image frames in the first video;
in the embodiment of the present invention, the first video includes a plurality of image frames, for example, the image frames may be extracted from the first video at a frame rate of 25 frames per second, the expression parameters may include mouth opening and closing parameters, and the mouth opening and closing parameters may be identified as blendshapes (mixed shapes).
In the step, key points of a face image in an image frame of the first video are input into a preset second 3D face generation model to obtain expression parameters of the first object; or, inputting a preset first 3D face reconstruction model in a face picture of a face image in the image frame of the first video to obtain expression parameters of the first object.
A plurality of image frames can be extracted from a first video, a face is detected in each extracted image frame to obtain a face image, a plurality of face key points are detected in the face image, the face key points are input into a 3D face generation model or a 3D face reconstruction model, the 3D mm model converts the face key points into expression parameters, for example, the picture passes through the 3D face reconstruction model (deep3D face reconstruction), and the expression parameters are extracted; in another example, the picture is subjected to a 3D face generation model (DECA) to extract expression parameters.
Step S103, extracting a first 3D face reconstruction parameter of the second object and a background contour line of the second object from the target image.
In this embodiment of the present invention, the first 3D face reconstruction parameter may refer to a 3D dm (face 3D deformation statistical model) parameter, the 3D dm parameter includes face pose information, reflection parameter, face shape, illumination parameter, and conversion parameter, and the background contour line of the second object may include: the outline of clothes or the outline of hair, etc.;
since the target image is a picture containing a second object, the second object may present a different face pose (position) in the target image, such as: different yaw angles, different pitch angles, different roll angles and the like, and if the same parameters are used for controlling the expression under different human face postures, the expression is not true and natural, so the embodiment of the invention needs to extract the human face posture information from the target image.
In this step, the target image may be input into a preset second 3D face reconstruction model (e.g., a 3DMM/Deep3DFaceReconstruction/DECA model, etc.), a second face mesh is obtained, mesh information of the second face mesh is obtained or key point information of each face key point in the second face mesh is obtained, and the first 3D face reconstruction parameter is determined based on the mesh information or the key point information.
Step S104, generating a 3D face grid containing the face posture information by using the expression parameters, the first 3D face reconstruction parameters and a preset first 3D face generation model;
in the embodiment of the invention, the 3D face mesh can be referred to as a 3D face mesh.
In order to enable the expression of the virtual image to be vivid and natural as the expression parameter of the first object, the expression parameter may be combined with a corresponding expression parameter in the first 3D face reconstruction parameter, and then the combined face reconstruction parameter is input into the first 3D face generation model to generate a 3D face mesh, where the 3D face mesh includes face pose information of the first object.
And step S105, generating a digital human video based on the 3D human face mesh, the background contour line and a preset digital human generation model.
In the embodiment of the invention, the digital human video can be generated by combining the 3D human face mesh and the background contour line and utilizing the preset digital human generation model.
According to the embodiment of the invention, the expression parameter of the first object in the first video is combined with the first 3D face reconstruction parameter in the target image to generate the 3D face grid, the target image contains the face pose information of the second object, so that the face pose information of the second object is considered in the generated 3D face grid, and finally, the digital human video is generated based on the 3D face grid and the background contour line, so that the generated 3D face grid takes the face pose information into consideration, and the expression of the digital human is more vivid and natural.
In another embodiment of the present invention, step S104, generating a 3D face mesh including the face pose information by using the expression parameter, the first 3D face reconstruction parameter, and a preset first 3D face generation model, includes:
step 201, extracting local expression parameters associated with a preset target reconstruction area from the expression parameters;
in the embodiment of the present invention, the target reconstruction region is a region to be reconstructed in the face of the second object, for example, the target reconstruction region may be a mouth region of the second object, or the target reconstruction region may also be a face region of the second object, and a user may pre-configure a region to be reconstructed to obtain the target reconstruction region;
in order to improve the extraction efficiency, an association relationship between the reconstruction region and the local expression parameters may be preset, for example, the association relationship between the reconstruction region and the local expression parameters may be as shown in table 1 below:
TABLE 1
Figure BDA0003291490290000101
Therefore, after the target reconstruction region is determined, the local expression parameters corresponding to the target reconstruction region can be extracted from the expression parameters according to the incidence relation between the reconstruction region and the local expression parameters of the target reconstruction region.
In this step, if the target reconstruction region is a mouth region, extracting a local expression parameter corresponding to the mouth region from a preset correspondence between the reconstruction region and the local expression parameter; or, if the target reconstruction region is a complete face region, extracting a local expression parameter corresponding to the complete face region from a preset corresponding relationship between the reconstruction region and the local expression parameter.
Step 202, searching local face parameters associated with the target reconstruction region in the first 3D face reconstruction parameters;
in the embodiment of the present invention, in order to improve the search efficiency, a corresponding relationship between the reconstruction region and the local face parameters may also be preset, and the corresponding relationship between the reconstruction region and the local face parameters may be as shown in table 2 below:
TABLE 2
Figure BDA0003291490290000111
Thus, after the target reconstruction region is determined, the local face parameters corresponding to the target reconstruction region can be searched in the first 3D face reconstruction parameters according to the corresponding relation between the reconstruction region and the local face parameters.
Step 203, replacing the local face parameters with the local expression parameters to obtain second 3D face reconstruction parameters;
because the first object in the first video is a real person, the expression posture is more natural, the second object in the target image is a virtual person, and the expression posture is unnatural, local expression parameters can be extracted from the expression parameters of the first object, local face parameters are extracted from the first 3D face reconstruction parameters, in the step, the local expression parameters are replaced by the local expression parameters, and other parameters except the local face parameters in the first 3D face reconstruction parameters still remain (such as face posture information), so that the second 3D face reconstruction parameters contain the local face parameters which can enable the expression of the digital person to be more vivid and natural.
In the step, if a local face parameter and a local expression parameter are extracted, the local face parameter is directly replaced by the local expression parameter; or if a plurality of local face parameters and a plurality of local expression parameters are extracted, replacing the local face parameters with the local expression parameters for the local face parameters and the local face parameters corresponding to the same part identification.
Step 204, inputting the second 3D face reconstruction parameter into the preset first 3D face generation model, so that the preset first 3D face generation model outputs a 3D face mesh including the face pose information.
In the embodiment of the invention, the first 3D face generation model is used for generating 3D face grids based on 3D face reconstruction parameters, and the first 3D face generation model is a model which is trained by using a plurality of groups of 3D face reconstruction parameters as training data and a plurality of groups of corresponding 3D face grids as label data in advance.
The second 3D face reconstruction parameter comprises a local expression parameter and other parameters except the local face parameter (such as face pose information), so that the 3D face mesh output by the first 3D face generation model can take the face pose information into consideration.
In another embodiment of the present invention, step S105, generating a digital human video based on the 3D face mesh, the background contour line and a preset digital human generation model, includes:
step 301, inputting the 3D face mesh and the background contour line into the preset digital human generation model to obtain a digital human picture;
in the embodiment of the invention, the preset digital human generation model is used for generating the digital human picture based on the 3D face mesh and the background contour line, the preset digital human generation model can use the 3D face mesh and the background contour line as training data in advance, the corresponding digital human picture is used as label data for training, and the preset digital human generation model can be used for generating the digital human picture after the training of the digital human generation model is finished.
Step 302, according to the arrangement sequence of each image frame in the first video, sequencing the digital human pictures obtained based on each image frame to obtain a digital human video.
The embodiment of the invention can keep the image frame identification of each image frame when extracting the image frames, and can arrange the digital human pictures according to the sequence of the image frame identifications when generating the digital human pictures based on each image frame to obtain the digital human video.
In a further embodiment of the present invention, before the step S101 of acquiring the first video including the first object and the target image including the second object, the method further includes:
step 401, acquiring a plurality of groups of first training videos including a first training object, training images including a second training object, and digital people label images.
In this embodiment of the present invention, the first training video may be a video obtained by recording the first object or another first training object, the target training image may be an image including the second object or another second training object, the first training object may be a real person, and the second training object may be a virtual character, etc.;
step 402, extracting expression parameters of the first training object from image frames in the first training video;
in the embodiment of the present invention, the first training video includes a plurality of image frames, for example, the image frames may be extracted in the first training video at a frame rate of 25 frames per second, and the expression parameters may include mouth opening and closing parameters, and the like.
Step 403, extracting a training 3D face reconstruction parameter of the second training object and a background contour line of the second object from the training image, where the training 3D face reconstruction parameter includes face pose information;
step 404, generating a 3D face mesh containing the face pose information by using the expression parameters, the training 3D face reconstruction parameters and a preset first 3D face generation model;
the flow of steps 403, 404 is similar to that of steps S103 and S104, and the flow of steps S103 and S104 can be referred to.
Step 405, training an initially generated confrontation network model by using the 3D face mesh and the background contour line until the similarity between the digital human picture output by the initially generated confrontation network model and the digital human label picture exceeds a preset threshold value, and obtaining the digital human generation model.
When training an initially generated confrontation network model by using a 3D face mesh and the background contour line, optimizing the model by using a loss function of a generator and a discriminator in the initially generated confrontation network model; the generator and the arbiter perform antagonistic training, namely: the generator wants to generate a digital person picture which is the same as the digital person label picture, but the discriminator wants to discriminate whether the input digital person picture is the picture generated by the generator or the digital person label picture; based on the training mode, the digital human picture output by the generator for initially generating the confrontation network model can approach the digital human label picture more and more until the similarity between the digital human picture output by the initially generated confrontation network model and the digital human label picture exceeds a preset threshold value, and the digital human generation model is obtained.
In still another embodiment of the present invention, there is also provided a digital human video generating apparatus, as shown in fig. 2, including:
an obtaining module 11, configured to obtain a first video including a first object and a target image including a second object, where the target image includes face pose information of the second object;
a first extraction module 12, configured to extract expression parameters of the first object in image frames in the first video;
a second extraction module 13, configured to extract, from the target image, a first 3D face reconstruction parameter of the second object and a background contour line of the second object, where the first 3D face reconstruction parameter includes face pose information;
a first generating module 14, configured to generate a 3D face mesh including the face pose information by using the expression parameter, the first 3D face reconstruction parameter, and a preset first 3D face generation model;
and the second generation module 15 is configured to generate a digital human video based on the 3D face mesh, the background contour line, and a preset digital human generation model.
Optionally, the first generating module includes:
an extracting unit, configured to extract a local expression parameter associated with a preset target reconstruction region from the expression parameters, where the target reconstruction region is a region to be reconstructed in the face of the second object;
a searching unit, configured to search, in the first 3D face reconstruction parameter, a local face parameter associated with the target reconstruction region;
the replacing unit is used for replacing the local face parameters with the local expression parameters to obtain second 3D face reconstruction parameters;
and the first input unit is used for inputting the second 3D face reconstruction parameters into the preset first 3D face generation model so as to enable the preset first 3D face generation model to output a 3D face grid containing the face posture information.
Optionally, the extracting unit is further configured to:
if the target reconstruction area is a mouth area, extracting local expression parameters corresponding to the mouth area from a corresponding relation between a preset reconstruction area and the local expression parameters;
or, if the target reconstruction region is a complete face region, extracting a local expression parameter corresponding to the complete face region from a preset corresponding relationship between the reconstruction region and the local expression parameter.
Optionally, the replacing unit is further configured to:
if a local face parameter and a local expression parameter are extracted, directly replacing the local face parameter by using the local expression parameter;
or if a plurality of local face parameters and a plurality of local expression parameters are extracted, replacing the local face parameters with the local expression parameters for the local face parameters and the local face parameters corresponding to the same part identification.
Optionally, the first extracting module is further configured to:
inputting key points of the face image in the image frame of the first video into a preset second 3D face generation model to obtain expression parameters of the first object;
or, the face mesh of the face image in the image frame of the first video is input into a preset first 3D face reconstruction model, so as to obtain the expression parameters of the first object.
Optionally, the second extraction module includes:
the second input unit is used for inputting the target image into a preset second 3D human face reconstruction model to obtain a second human face mesh;
an obtaining unit, configured to obtain mesh information of the second face mesh or obtain key point information of each face key point in the second face mesh;
and the determining unit is used for determining the first 3D face reconstruction parameter based on the mesh information or the key point information.
Optionally, the second generating module includes:
a third input unit, configured to input the 3D face mesh and the background contour line into the preset digital human generation model, so as to obtain a digital human picture;
and the sequencing unit is used for sequencing the digital human pictures obtained based on the image frames according to the arrangement sequence of the image frames in the first video to obtain the digital human video.
Optionally, before the first obtaining module, the apparatus further includes:
the second acquisition module is used for acquiring a plurality of groups of first training videos containing a first training object, training images containing a second training object and digital human label images;
a third extraction module, configured to extract expression parameters of the first training object from image frames in the first training video;
a fourth extraction module, configured to extract, from the training image, a training 3D face reconstruction parameter of the second training object and a background contour line of the second object, where the training 3D face reconstruction parameter includes face pose information;
the third generation module is used for generating a 3D face grid containing the face posture information by using the expression parameters, the training 3D face reconstruction parameters and a preset first 3D face generation model;
and the fourth generation module is used for training an initially generated confrontation network model by using the 3D face mesh and the background contour line until the similarity between the digital human picture output by the initially generated confrontation network model and the digital human label picture exceeds a preset threshold value, so as to obtain the digital human generation model.
In another embodiment of the present invention, an electronic device is further provided, which includes a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the method of generating digital human video according to any of the above method embodiments when executing the program stored in the memory.
In the electronic device provided by the embodiment of the present invention, the processor implements acquisition of a first video including a first object and a target image including a second object by executing a program stored in the memory; extracting expression parameters of the first object in image frames in the first video; extracting a first 3D face reconstruction parameter of the second object and a background contour line of the second object from the target image, wherein the first 3D face reconstruction parameter comprises face posture information; generating a 3D face grid containing the face posture information by using the expression parameters, the first 3D face reconstruction parameters and a preset first 3D face generation model; and generating a digital human video based on the 3D human face mesh, the background contour line and a preset digital human generation model. According to the embodiment of the invention, the expression parameter of the first object in the first video is combined with the first 3D face reconstruction parameter in the target image to generate the 3D face grid, the target image contains the face pose information of the second object, so that the face pose information of the second object is considered in the generated 3D face grid, and finally, the digital human video is generated based on the 3D face grid and the background contour line, so that the generated 3D face grid takes the face pose information into consideration, and the expression of the digital human is more vivid and natural.
The communication bus 1140 mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 1140 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 3, but this does not mean only one bus or one type of bus.
The communication interface 1120 is used for communication between the electronic device and other devices.
The memory 1130 may include a Random Access Memory (RAM), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The processor 1110 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the integrated circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for generating a digital human video, comprising:
acquiring a first video containing a first object and a target image containing a second object;
extracting expression parameters of the first object in image frames in the first video;
extracting a first 3D face reconstruction parameter of the second object and a background contour line of the second object from the target image, wherein the first 3D face reconstruction parameter comprises face posture information;
generating a 3D face grid containing the face posture information by using the expression parameters, the first 3D face reconstruction parameters and a preset first 3D face generation model;
and generating a digital human video based on the 3D human face mesh, the background contour line and a preset digital human generation model.
2. The method of claim 1, wherein the generating a 3D face mesh including the face pose information using the expression parameters, the first 3D face reconstruction parameters, and a preset first 3D face generation model comprises:
extracting local expression parameters associated with a preset target reconstruction region from the expression parameters, wherein the target reconstruction region is a region to be reconstructed in the face of the second object;
searching local face parameters associated with the target reconstruction region in the first 3D face reconstruction parameters;
replacing the local face parameters with the local expression parameters to obtain second 3D face reconstruction parameters;
and inputting the second 3D face reconstruction parameters into the preset first 3D face generation model, so that the preset first 3D face generation model outputs a 3D face grid containing the face posture information.
3. The method for generating digital human video according to claim 2, wherein the extracting of the local expression parameters associated with the preset target reconstruction region from the expression parameters comprises:
if the target reconstruction area is a mouth area, extracting local expression parameters corresponding to the mouth area from a corresponding relation between a preset reconstruction area and the local expression parameters;
or, if the target reconstruction region is a complete face region, extracting a local expression parameter corresponding to the complete face region from a preset corresponding relationship between the reconstruction region and the local expression parameter.
4. The method of claim 2, wherein the replacing the local face parameters with the local expression parameters to obtain second 3D face reconstruction parameters comprises:
if a local face parameter and a local expression parameter are extracted, directly replacing the local face parameter by using the local expression parameter;
or if a plurality of local face parameters and a plurality of local expression parameters are extracted, replacing the local face parameters with the local expression parameters for the local face parameters and the local face parameters corresponding to the same part identification.
5. The method of claim 1, wherein the extracting expression parameters of the first object in image frames in the first video comprises:
inputting key points of the face image in the image frame of the first video into a preset second 3D face generation model to obtain expression parameters of the first object;
or, inputting a preset first 3D face reconstruction model into a face mesh of a face image in the image frame of the first video to obtain an expression parameter of the first object.
6. The method of claim 1, wherein the extracting the first 3D face reconstruction parameters of the second object in the target image comprises:
inputting the target image into a preset second 3D face reconstruction model to obtain a second face mesh;
acquiring mesh information of the second face mesh or acquiring key point information of each face key point in the second face mesh;
and determining the first 3D face reconstruction parameter based on the mesh information or the key point information.
7. The method for generating digital human video according to claim 1, wherein the generating digital human video based on the 3D face mesh, the background contour line and a preset digital human generation model comprises:
inputting the 3D face mesh and the background contour line into the preset digital human generation model to obtain a digital human picture;
and sequencing the digital human pictures obtained based on the image frames according to the arrangement sequence of the image frames in the first video to obtain the digital human video.
8. The method of claim 1, wherein prior to obtaining the first video containing the first object and the target image containing the second object, the method further comprises:
acquiring a plurality of groups of first training videos containing a first training object, training images containing a second training object and digital human label pictures;
extracting expression parameters of the first training object in image frames in the first training video;
extracting a training 3D face reconstruction parameter of the second training object and a background contour line of the second object from the training image, wherein the training 3D face reconstruction parameter comprises face posture information;
generating a 3D face grid containing the face posture information by using the expression parameters, the training 3D face reconstruction parameters and a preset first 3D face generation model;
training an initially generated confrontation network model by using the 3D face mesh and the background contour line until the similarity between the digital human picture output by the initially generated confrontation network model and the digital human label picture exceeds a preset threshold value, and obtaining the digital human generation model.
9. A digital human video generating apparatus, comprising:
the device comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a first video containing a first object and a target image containing a second object, and the target image contains the face posture information of the second object;
the first extraction module is used for extracting expression parameters of the first object from image frames in the first video;
the second extraction module is used for extracting a first 3D face reconstruction parameter of the second object and a background contour line of the second object from the target image, wherein the first 3D face reconstruction parameter comprises face posture information;
the first generation module is used for generating a 3D face grid containing the face posture information by using the expression parameters, the first 3D face reconstruction parameters and a preset first 3D face generation model;
and the second generation module is used for generating a digital human video based on the 3D face mesh, the background contour line and a preset digital human generation model.
10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the digital human video generation method of any one of claims 1 to 7 when executing the program stored in the memory.
CN202111165371.1A 2021-09-30 2021-09-30 Digital man video generation method and device and electronic equipment Pending CN113887429A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111165371.1A CN113887429A (en) 2021-09-30 2021-09-30 Digital man video generation method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111165371.1A CN113887429A (en) 2021-09-30 2021-09-30 Digital man video generation method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN113887429A true CN113887429A (en) 2022-01-04

Family

ID=79005094

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111165371.1A Pending CN113887429A (en) 2021-09-30 2021-09-30 Digital man video generation method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN113887429A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023201996A1 (en) * 2022-04-19 2023-10-26 奥丁信息科技有限公司 Digital person expression generation method and apparatus, digital person expression model generation method, and plug-in system for vr device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023201996A1 (en) * 2022-04-19 2023-10-26 奥丁信息科技有限公司 Digital person expression generation method and apparatus, digital person expression model generation method, and plug-in system for vr device

Similar Documents

Publication Publication Date Title
CN109325437B (en) Image processing method, device and system
CN109359538B (en) Training method of convolutional neural network, gesture recognition method, device and equipment
WO2022222810A1 (en) Avatar generation method, apparatus and device, and medium
JP2021043974A5 (en)
CN111696028A (en) Method and device for processing cartoon of real scene image, computer equipment and storage medium
CN111626113A (en) Facial expression recognition method and device based on facial action unit
CN102567716B (en) Face synthetic system and implementation method
US11282257B2 (en) Pose selection and animation of characters using video data and training techniques
CN108198130B (en) Image processing method, image processing device, storage medium and electronic equipment
CN111008935B (en) Face image enhancement method, device, system and storage medium
CN111680550B (en) Emotion information identification method and device, storage medium and computer equipment
US7257538B2 (en) Generating animation from visual and audio input
CN114529639A (en) Method, device, equipment and storage medium for generating virtual image animation
CN113886641A (en) Digital human generation method, apparatus, device and medium
JP6052533B2 (en) Feature amount extraction apparatus and feature amount extraction method
US20160110909A1 (en) Method and apparatus for creating texture map and method of creating database
CN113887429A (en) Digital man video generation method and device and electronic equipment
CN109986553B (en) Active interaction robot, system, method and storage device
US11361467B2 (en) Pose selection and animation of characters using video data and training techniques
KR102247481B1 (en) Device and method for generating job image having face to which age transformation is applied
Wang et al. Speech Driven Talking Head Generation via Attentional Landmarks Based Representation.
CN112288861B (en) Single-photo-based automatic construction method and system for three-dimensional model of human face
CN113886638A (en) Digital person generation method and device, electronic equipment and storage medium
CN113033250A (en) Facial muscle state analysis and evaluation method
CN111274854A (en) Human body action recognition method and vision enhancement processing system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination