CN116863043A - Face dynamic capture driving method and device, electronic equipment and readable storage medium - Google Patents

Face dynamic capture driving method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN116863043A
CN116863043A CN202310602158.5A CN202310602158A CN116863043A CN 116863043 A CN116863043 A CN 116863043A CN 202310602158 A CN202310602158 A CN 202310602158A CN 116863043 A CN116863043 A CN 116863043A
Authority
CN
China
Prior art keywords
input
facial
key
face
dynamic capture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310602158.5A
Other languages
Chinese (zh)
Inventor
王远强
杨青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Du Xiaoman Technology Beijing Co Ltd
Original Assignee
Du Xiaoman Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Du Xiaoman Technology Beijing Co Ltd filed Critical Du Xiaoman Technology Beijing Co Ltd
Priority to CN202310602158.5A priority Critical patent/CN116863043A/en
Publication of CN116863043A publication Critical patent/CN116863043A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The application provides a face dynamic capture driving method, a device, an electronic device and a readable storage medium, comprising the following steps: receiving a first input, the first input comprising facial morphology key information and facial binding key bone information; generating a facial dynamic capture driving model in response to the first input; receiving a second input, the second input being image information for a number of frames including facial features; and in response to the second input, importing the image information containing the facial features into the facial dynamic capture driving model, and generating and displaying a facial dynamic capture driving result.

Description

Face dynamic capture driving method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of facial motion capture technologies, and in particular, to a method and apparatus for driving facial motion capture, an electronic device, and a readable storage medium.
Background
In the technical field of digital human face driving, one means in the prior art is to predict human face shape parameters, expression parameters and the like of a 3DMM model by training a ResNet50 network module to drive the digital human face, but the method is limited to the fact that the 3DMM model which is not specially constructed by consuming a large amount of resources has general precision, the expression fine granularity depiction is deficient, and the predicted expression parameters have a path problem on the digital human model with a newly built model or other software, so that the method is difficult to widely apply. The other means is that the divided face area corresponds to the controller in the UE, the controller parameters are predicted through a training network, and local grid change is generated when the controller parameters are input into the UE, so that driving is realized, however, the face area is divided into complicated areas, the controller is required to act accurately, the dependence on the UE is strong, and the expansibility and the universality of the method are greatly limited.
Disclosure of Invention
In view of the above, the embodiment of the application provides a method for driving facial dynamic capture, which aims to solve the problems of poor accuracy and poor universality of the driving of facial dynamic capture.
According to a first aspect of the present application, there is provided a face dynamic capture driving method, comprising:
receiving a first input, the first input comprising facial morphology key information and facial binding key bone information;
generating a facial dynamic capture driving model in response to the first input;
receiving a second input, the second input being image information for a number of frames including facial features;
and in response to the second input, importing the image information containing the facial features into the facial dynamic capture driving model, and generating and displaying a facial dynamic capture driving result.
Optionally, the importing the image information including the facial features into the facial dynamic capture driving model in response to the second input, generating and displaying a facial dynamic capture driving result includes:
receiving the facial dynamic capture model to analyze morphological key values and key skeleton control parameters of the image information containing facial features;
and generating and displaying a facial dynamic capture driving result in response to the morphological key value and the key skeleton control parameter.
Optionally, the generating and displaying a facial motion capture driving result in response to the morphological key value and the key bone control parameter includes:
constructing a digital human model according to the first input;
and controlling to drive the display of the digital human model in response to the morphological key value and the key skeleton control parameter.
Optionally, before receiving the second input, the method further includes:
receiving a plurality of image frame data;
and inputting a plurality of image frame data into the face dynamic capture driving model, and training the face dynamic capture driving model.
Optionally, after the inputting the plurality of image frame data into the face dynamic capture driving model and training the face dynamic capture driving model, the method further includes:
calculating loss data between the facial dynamic capture driving model and the second input, the loss data including, but not limited to, 2D loss, 3D loss, and microexpressive loss;
in response to the loss data, errors between the morphology key values, the key bone control parameters, and the facial motion capture driving model are eliminated.
Optionally, before receiving the second input, the method further includes:
and acquiring the second input by adopting an image acquisition device, wherein the second input is 2D image information or 3D image information containing facial features.
Optionally, after the second input is acquired by using the image acquisition device, the method further includes:
and detecting the face key points and the face frames of the second input by adopting a face detector, and correcting the second input according to the detection result.
According to a second aspect of the present application, there is provided a face dynamic capture driving apparatus comprising:
the first receiving module is used for receiving a first input, and the first input comprises facial morphology key information and facial binding key skeleton information;
a generation module that generates a facial dynamic capture driving model in response to the first input;
the second receiving module is used for receiving a second input, wherein the second input is image information of a plurality of frames containing facial features;
and the display module is used for responding to the second input, importing the image information containing the facial features into the facial dynamic capture driving model, and generating and displaying a facial dynamic capture driving result.
According to a third aspect of the present application, there is provided an electronic device comprising:
a processor; and
a memory in which a program is stored,
wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the method according to any of the first aspects of the application.
According to a fourth aspect of the present application there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method according to any one of the first aspects of the present application.
One or more technical solutions provided in the embodiments of the present application at least have the following technical effects or advantages:
the face dynamic capture driving model is generated by adopting the face morphological key information and the face binding key skeleton information, so that a digital person is precisely matched with image frame data, the image frame data is used as the input of the face dynamic capture driving model, morphological key values and key skeleton parameters are obtained through analysis, the face matching with the digital person and the face driving of the digital person are precisely realized, the driving mode is precise and reliable, the driving display can be carried out on different digital person models without deviation, the digital person dynamic capture driving model has stronger universality, and the influence of complex scene environments can be overcome.
The foregoing summary is merely an overview of the present application, as it is intended to provide a better understanding of the principles of the application, as it is embodied in accordance with the disclosure herein, and as it is intended to provide an overview of the application, its principles, its features and advantages with its details being understood by reference to the following examples.
Drawings
Further details, features and advantages of the application are disclosed in the following description of exemplary embodiments with reference to the following drawings, in which:
FIG. 1 illustrates a schematic diagram of an example system in which various methods described herein may be implemented, according to an example embodiment of the application;
FIG. 2 illustrates a flowchart of a facial motion capture driving method according to an exemplary embodiment of the present application;
FIG. 3 illustrates a schematic diagram of face morphology key information and face binding key skeleton information according to an exemplary embodiment of the present application;
FIG. 4 illustrates facial bone bindings and weight examples thereof in accordance with an exemplary embodiment of the application;
FIG. 5 illustrates a model and training framework diagram according to an exemplary embodiment of the present application;
FIG. 6 illustrates three types of motion data collection graphs according to an exemplary embodiment of the present application;
FIG. 7 shows a schematic block diagram of a facial motion capture driving device in accordance with an exemplary embodiment of the present application;
fig. 8 shows a block diagram of an exemplary electronic device that can be used to implement an embodiment of the application.
Detailed Description
Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While the application is susceptible of embodiment in the drawings, it is to be understood that the application may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided to provide a more thorough and complete understanding of the application. It should be understood that the drawings and embodiments of the application are for illustration purposes only and are not intended to limit the scope of the present application.
It should be understood that the various steps recited in the method embodiments of the present application may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the application is not limited in this respect.
The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below. It should be noted that the terms "first," "second," and the like herein are merely used for distinguishing between different devices, modules, or units and not for limiting the order or interdependence of the functions performed by such devices, modules, or units.
It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those skilled in the art will appreciate that "one or more" is intended to be construed as "one or more" unless the context clearly indicates otherwise.
The names of messages or information interacted between the devices in the embodiments of the present application are for illustrative purposes only and are not intended to limit the scope of such messages or information.
The following describes the scheme of the present application with reference to the drawings, and the technical scheme provided by the embodiment of the present application is described in detail through specific embodiments and application scenarios thereof.
The heat of the metauniverse has been continuously rising worldwide, and it is believed that most of the living space of people will be completed in the metauniverse in the future, and Digital people (Digital Human/Meta Human) as important links of the metauniverse are also receiving great attention. The digital person is a digital person which is created by using digital technology and has the action similar to that of the human figure, so as to replace the human to finish life production in the virtual world. The digital person needs to act according to the intention of the human to finish the life, and the driving becomes the soul of the digital person, wherein the problem to be solved is real-time driving of the real person. Through the evolution of people for millions of years, facial expressions and actions thereof are complex and meaningful, and the accurate, vivid and real-time driving of digital human faces has become a subject with research and practical significance.
As shown in fig. 1, fig. 1 is a schematic diagram of a system for implementing a method for driving facial dynamic capture according to an embodiment of the present application, where the system 100 includes a data acquisition module 101, a digital mannequin design module 102, a server 103, and a display terminal 104, where the data acquisition module 101 is configured to acquire a plurality of image frames as a second input or as a training set of a facial dynamic capture driving model, and the data acquisition module 101 may be an image capturing electronic device such as a mobile phone or other devices with a video stream acquisition output function. The digital mannequin design module 102 is used for entering face morphology key information and face binding key skeleton information, so as to generate a face dynamic capture driving model. The digital mannequin design module 102 applies 3D design software such as DAZ Studio or ZBrush to design a digital mannequin meeting the image requirements from the dimensions such as gender, shape, hairstyle, etc. The server 103 includes an AI processor 1031 and a rendering module 1032, the AI processor 1031 deploys a trained network model, and is configured to input a plurality of image frames acquired by the data acquisition module 101 into the digital mannequin design module 102 for processing, calculate and acquire a morphology key value and a key skeleton control parameter, and render the processing result into the digital mannequin by the rendering module 1032, where the rendering module 1032 takes a target digital person, the morphology key value and the key skeleton control parameter as input, and renders an image in combination with light, materials, scenes and the like. The display terminal 104 is used for displaying the rendered digital human model, and displaying the face dynamic capture driving result.
In an alternative manner of this embodiment, as shown in fig. 3, to provide the face morphology key list and the key skeleton list required in this embodiment, the system 100 further includes a face binding module 105 for adding key control skeletons to the face of the digital person and binding with the face of the digital person, so that each skeleton controls its corresponding face region.
In an alternative of this embodiment, the facial morphology key, which may be considered as a high-level form of facial expression and motion resolution, is controlled in combination with the facial key's critical bones, and the closing and opening of the chin may be achieved by controlling the value of morphology key Jaw Open [0,1], where 0 represents complete closing, 1 represents complete opening, and 0-1 represents between the complete closing and complete opening states. The key bones are mainly concentrated in the eyes and mouth regions, are important forms of facial expression and action control, have a control degree which is not as high as that of a morphological key, but can highlight details such as canthus extrusion, lip overturning and the like.
The image frame data is used as the input of the face dynamic capture model, the depth network of the designed and trained face dynamic capture model is used for calculating the input to obtain the morphological key value and the key skeleton parameter of the second input, so that the digital human face driving is realized, the whole driving process is accurate and reliable, the digital human face driving method can be used for transferring different digital human models without deviation, has strong universality, can be used for driving in real time, and can overcome the influence caused by the complex scene environment.
As shown in fig. 2, fig. 2 is a schematic flow chart of a method for driving facial dynamic capture according to an embodiment of the present application, and the method may include steps S201 to S204 as follows:
s201, receiving a first input, where the first input includes facial morphology key information and facial binding key skeleton information.
In an alternative manner of this embodiment, the facial form key information may be created manually or by means of a tool such as Faceit, or may be created by multi-software collaboration, for example, by importing a model for creating facial form key information into Onmivierse Audio2Face, and aligning the model with a mark model in software, then deriving the facial form key. Wherein Omniverse Audio2Face is an AI application in NVIDIA Omniverse, and 3D facial animation can be immediately produced by only one Audio track.
The face binding key skeleton information is exemplified by Blender software system, the key skeleton can be bound by using its own plug-in Rigify or Auto-Rig Pro, faceit, the key skeleton can be the skeleton of eye and mouth region of Human (Meta-Rig), other software systems such as UE, maya, etc., and the corresponding method can also be used for obtaining and binding the key skeleton of digital Human face. Each skeletal control includes Location, rotation two dimensions, each having three parameters, x, y, and z. Blender is open source three-dimensional graphic image software, and provides a series of short animation production solutions from modeling, animation, texture, rendering, audio processing, video clipping and the like. Blender has various user interfaces convenient to use under different works, and is internally provided with advanced film and television solutions such as green screen matting, camera back tracking, mask processing, post node synthesis and the like. The Blender is internally provided with a Cycles renderer and a real-time rendering engine EEVEE. While also supporting a variety of third party renderers. Blender can be used for three-dimensional visualization, and simultaneously can also create videos with broadcasting and film-grade quality, and in addition, a built-in real-time three-dimensional game engine enables the production of three-dimensional interactive contents capable of being played back independently.
In an alternative manner of this embodiment, before receiving the first input, the entering of the facial morphology key information and the definition of the key control skeleton of the face are performed, and further facial binding is performed with the digital person, where the number of key skeletons is n, each skeleton is controlled by Location, rotation two dimensions, and each dimension has three parameters of x, y and z, and the number of key skeleton control parameters is 6 n.
S202, responding to the first input, and generating a face dynamic capture driving model.
In an alternative manner of this embodiment, the facial motion capture driving model is a network model of res net34 plus 3 subsequent full connection layers, and is used to output a morphology key value and a key bone control parameter corresponding to the facial morphology key information and the facial binding key bone information.
S203, receiving a second input, wherein the second input is image information containing facial features for a plurality of frames.
In this embodiment, the image including the facial features is acquired for the target object by the image acquisition apparatus of the video frame.
S204, responding to the second input, importing the image information containing the facial features into the facial dynamic capture driving model, and generating and displaying a facial dynamic capture driving result.
In an alternative manner of this embodiment, the acquired video frames are input into the face dynamic capture driving model, and the final face dynamic capture driving result is output by the face dynamic capture driving model.
The embodiment of the application adopts the facial morphology key information and the facial binding key skeleton information to generate the facial dynamic capture driving model, so that the digital person is precisely matched with the image frame data, the image frame data is used as the input of the facial dynamic capture driving model, the morphology key value and the key skeleton parameters are obtained through analysis, the facial matching with the digital person and the facial driving of the digital person are precisely realized, the driving mode is precise and reliable, the driving display can be carried out on different digital person models without deviation, the driving display has stronger universality, and the real-time driving can be realized, and the influence of complex scene environments can be overcome.
In an alternative manner of this embodiment, step S204 includes:
s2041, receiving a morphological key value and key skeleton control parameters of the facial feature-containing image information analyzed by the facial dynamic capture model;
s2042, responding to the morphological key value and the key skeleton control parameter, and generating and displaying a facial dynamic capture driving result.
In this embodiment, the acquired video frames are input into the face dynamic capture driving model, and the morphological key value and the key skeleton control parameter params of the target object are obtained by the face dynamic capture driving model. And generating and displaying a facial dynamic capture driving result according to the morphological key value of the target object and the key skeleton control parameter params.
In an alternative manner of the present embodiment, step S2042 includes:
s2042a, constructing a digital human model according to the first input;
and S2042b, responding to the morphological key value and the key skeleton control parameter, and controlling and driving to display the digital human model.
In this embodiment, a digital Human model close to a Human face is constructed according to face morphological key information and face binding key skeleton information in the first input, the constructed digital Human model is imported into Blender software, human skeleton is obtained by using Human (Meta-Rig) under the texture, and only head skeleton is reserved.
As shown in fig. 4, according to the facial binding key skeleton information, the positions of the skeleton and the digital human model face are adjusted, automatic weight binding is performed, after the binding is completed, the positions of 34 key skeletons as proposed in fig. 3 are focused, other skeletons are ignored, and further form keys such as 61 form keys in fig. 3 are manufactured on the basis of the digital human model.
In an optional manner of this embodiment, before step S203, the method further includes:
s203a, receiving a plurality of image frame data;
s203b, inputting a plurality of image frame data into the face dynamic capture driving model, and training the face dynamic capture driving model.
In this embodiment, machine learning generally divides the sample into an independent three-part training set (train set), a validation set (validation set) and a test set (test set). Wherein the training set is used to build the model. The training set is used in supervised learning, which is a process of adjusting parameters of a classifier by using a set of samples of known classes to achieve required performance, and is also called supervised training or teacher learning.
In this embodiment, the model is a res net34 network, and the network is used to extract the face features of the input picture frame, and then 3 full-connection layers are connected to calculate the morphological key value and the key skeleton parameter.
In an alternative manner of this embodiment, step S203b further includes:
s203c, calculating loss data between the digital human model and the second input, wherein the loss data comprises, but is not limited to, 2D loss, 3D loss and micro-expression loss;
and S203d, responding to the loss data, and eliminating errors among the morphological key value, the key skeleton control parameter and the digital human model.
In this embodiment, the loss is used in training the face dynamic capture driving model, and the loss is calculated from four aspects:
(1) The face morphology key parameter MSELoss is
(2) When the dimension layer is a 2D layer, inputting a picture frame into DBface key point detection, detecting 98 key points, and calculating loss with 98 key points of the driven model to beWhen the face is a 3D layer, inputting a picture frame into a MediaPipe to detect the face mesh, and calculating loss corresponding to the face vertex group of the driven model, wherein loss is +.>
(3) Inputting the picture frame into a micro-expression recognition module at the micro-expression level, and calculating loss as the micro-expression recognition result of the driven model
(4) The stability level, i.e. interframe continuity learning, the input two continuous frames of mesh point direction vectors should be consistent with the corresponding two frames of mesh point direction vectors of the driven model, which is
Further, the training overall loss is:
and returning loss, namely enabling the trained face dynamic capture driving model to be converged, deploying the model to an AI processing server with a GPU after model training is completed, inputting image frame data acquired in real time, namely, second input, calculating morphological key values and key skeleton parameter values, driving the digital human face, and displaying final digital human and rendered scene results on a client after rendering by a rendering module.
In an alternative manner of this embodiment, as shown in fig. 5, with res net34 as a backup, 3 full connection layers are connected, 4-channel data formed by combining the preprocessed collected picture frame and the 98 detection point result concat is input, the number of neurons of the full connection layers is 265 (61+34×6), corresponding to the morphological key value and the key skeleton control parameter, and the activation function adopts a Tanh function.
The model training considers that the digital human face after driving and rendering is consistent with the input in the five sense organs states (such as opening/closing of eyes, opening/closing of mouth, etc.), the micro expression and the input, and the loss needs to be calculated
Calculation of loss 2 d: at the 2D level, 98 key points of the digital human face and 98 key points of the input human face are detected, and calculation is performedAnd judging the key similarity between the digital human face and the input human face by loss.
Where k is the point index of 98 key points, y 2d And y' 2d And respectively representing the input face key points and the corresponding digital face key points.
Calculation of loss 3 d: in the 3D layer, face mesh detection in the MediaPipe of Google is adopted, wherein the MediaPipe is an open source item of Google, a cross-platform common ML scheme is supported, 468 human face 3D key points of an input image are detected, the number of human face vertexes on a digital human model is numerous, 468 human face 3D key points are aligned with the digital human face, the point closest to the 468 human face 3D key points is used as the corresponding point of the 468 human face 3D key points, and position indexes are recorded, so that subsequent calculation is facilitated.
Where k is the point index of 468 keypoints, y 3d And y' 3d And respectively representing the input face 3D key points and the corresponding digital face key points.
Loss ofIs calculated by (1): detecting the micro-expressions of the input face and the digital face to obtain the distribution value of the input face and the digital face on each label, and calculating KL divergence as +.>
In this embodiment, because of the data stream driving, the measurement of adding the stability level is considered, and the 3D point direction vector of two consecutive frames should be consistent with the two corresponding point direction vectors of the corresponding driven digital human face, and is recorded as
Where k is the point index of 468 keypoints, y 3d_now And y 3d_last 3D key points respectively representing the current frame and the previous frame of the input face, y' 3d_now And y' 3d_last Corresponding 3D key points of the current frame and the previous frame of the digital human face are respectively.
When the training set is constructed, liveLinkFace is adopted for data acquisition, so that not only can video be acquired, but also 61 morphological key values of each frame can be acquired, the method can be used as a group trunk to calculate MSE loss, and the method participates in important network training links and is recorded as
Wherein k is an index of 61 morphological keys, y bs And y' bs And respectively representing the morphological key true value and the morphological key value predicted by the network model.
The loss function of the whole model is then:
wherein, the liquid crystal display device comprises a liquid crystal display device,mean square error for morphological key, +.>For 2D key loss, < >>For the loss of the 3D keypoints,for loss of micro expression->To stabilizeLoss of sex, lambda mse 、λ 2d 、λ 3d 、λ mircro-exp 、λ sta Is the weight coefficient corresponding to the loss.
In an alternative manner of this embodiment, the setting and the changing of the weight coefficient in the training process are as follows:
the first stage: form bond fitting, lambda 2d 、λ 3d 、λ mircro-exp 、λ sta Is set to be 0 and is set to be a constant value,set to 1.0, train 20 epochs;
and a second stage:set to 0.6 lambda 2d Initial set to 1.0, eventually decreasing to 0.2, lambda with increasing epoch 3d Initial set to 0.6, eventually rising to 1.2, lambda as epoch increases mircro-exp Set to 0.3 lambda sta Set to 0.7 (lambda in discontinuous frame input) sta Set to 0), further optimizing the effect details of the drive, training 20 epochs.
In addition, to ensure the completeness of the training set, besides 1000 groups of data under normal communication state, three groups of motion data 632 are collected as indoor/outdoor and under different illumination conditions as shown in fig. 6.
The input data is enhanced in training, including random noise addition, brightness adjustment, random overturn and other operations, the optimizer adopts Adam, the initial learning rate is set to 0.002, and the batch-size is set to 32. After training, the training is serialized into the onnx model format, and the training is run on a TensorRT acceleration framework, and the whole reasoning process is about 19.8ms on average. During rendering, the digital human model, the calculated morphological key values and the key skeleton parameters are input to an Eevee engine of the Blender, all operations of the Blender can be executed by the back end, and finally rendering results are output and presented to the client.
In an optional manner of this embodiment, before step S203, the method further includes:
s203e, acquiring the second input by using an image acquisition device, where the second input is 2D image information or 3D image information including facial features.
In the embodiment, a monocular unmarked dynamic capturing means is adopted, and only one image acquisition device is needed to acquire video frame images, so that the facial actions and micro expressions of a user can be vividly and timely depicted on a target digital human figure. The video frame image may be 2D image information or 3D image information including facial features, among others.
In an optional manner of this embodiment, after step S203e, the method further includes:
and S203f, detecting the face key points and the face frames of the second input by using a face detector, and correcting the second input according to the detection result.
In this embodiment, preprocessing is performed on the second input image, 98 key points and face frames of the face are detected by using the existing face detector DBFace, the loop face area is corrected to be square on the premise that the image is not stretched, the 98 key points are ensured to be corrected correspondingly, and finally the size is changed to 256×256 as network input.
The embodiment of the application provides a face dynamic capturing driving method, which is characterized in that a morphological key and key skeleton control are combined, real-time driving of a digital human face is realized through training a network model, rgb frame data is used as input, the whole calculation amount is small, and the accuracy is excellent. The embodiment belongs to the technical category of monocular markless dynamic capturing, and can vividly and timely depict the facial actions and micro-expressions of a user on a target digital human figure only by using one camera shooting acquisition device. The depth network model ResNet34 is trained by using a self-defined label-free training set, so that the cost is lower, the model parameters are smaller, and the real-time performance is better. The change of micro-expressions and the study of stability are considered during model training, so that the driven digital human face movement process is more natural and fine. The technical scheme of the application can effectively drive the digital human model conforming to the definition of the corresponding morphological key and the binding of the facial skeleton, for example, the model manufactured by DAZSTudio can be seamlessly driven after being simply adjusted after being imported into Blender, UE5 and other general 3D software.
Corresponding to the above embodiment, referring to fig. 7, an embodiment of the present application further provides a facial dynamic capture driving device 700, including:
a first receiving module 701, configured to receive a first input, where the first input includes facial morphology key information and facial binding key skeleton information;
a generation module 702, responsive to the first input, that generates a facial dynamic capture driving model;
a second receiving module 703, configured to receive a second input, where the second input is image information including facial features for a plurality of frames;
and a display module 704, responsive to the second input, for importing the image information containing the facial features into the facial dynamic capture driving model, and generating and displaying a facial dynamic capture driving result.
Optionally, the display module 704 further includes:
an analysis module 7041 for receiving the facial dynamic capture model to analyze morphological key values and key bone control parameters of the image information containing facial features;
a display sub-module 7042, responsive to the morphology key values and the key bone control parameters, generates and displays facial motion capture driving results.
Optionally, the display submodule 7042 further includes:
a building module 70421 for building a digital person model from the first input;
a control module 70422, responsive to the morphology key value and the key bone control parameter, controls driving the display of the digital mannequin.
Optionally, the facial dynamic capture driving apparatus 700 further includes:
a third receiving module 705, configured to receive a plurality of image frame data;
and the training module 706 is configured to input a plurality of image frame data into the face dynamic capture driving model, and train the face dynamic capture driving model.
Optionally, the facial dynamic capture driving apparatus 700 further includes:
a calculation module 707 for calculating loss data between the facial dynamic capture driving model and the second input, including but not limited to 2D loss, 3D loss, and microexpressive loss;
an elimination module 708 eliminates errors between the morphology key values, the key bone control parameters, and the facial motion capture driving model in response to the loss data.
Optionally, the facial dynamic capture driving apparatus 700 further includes:
the acquiring module 709 acquires the second input with an image acquisition device, where the second input is 2D image information or 3D image information including facial features.
Optionally, the facial dynamic capture driving apparatus 700 further includes:
and the correction module 710 is configured to detect the face key points and the face frames of the second input by using a face detector, and correct the second input according to the detection result.
The embodiment of the application adopts the facial morphology key information and the facial binding key skeleton information to generate the facial dynamic capture driving model, so that the digital person is precisely matched with the image frame data, the image frame data is used as the input of the facial dynamic capture driving model, the morphology key value and the key skeleton parameters are obtained through analysis, the facial matching with the digital person and the facial driving of the digital person are precisely realized, the driving mode is precise and reliable, the driving display can be carried out on different digital person models without deviation, the driving display has stronger universality, and the real-time driving can be realized, and the influence of complex scene environments can be overcome.
The exemplary embodiment of the application also provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor for causing the electronic device to perform a method according to an embodiment of the application when executed by the at least one processor.
The exemplary embodiments of the present application also provide a non-transitory computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is for causing the computer to perform a method according to an embodiment of the present application.
The exemplary embodiments of the application also provide a computer program product comprising a computer program, wherein the computer program, when being executed by a processor of a computer, is for causing the computer to perform a method according to an embodiment of the application.
With reference to fig. 8, a block diagram of an electronic device 800 that may be a server or a client of the present application will now be described, which is an example of a hardware device that may be applied to aspects of the present application. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.
As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.
Various components in electronic device 800 are connected to I/O interface 805, including: an input unit 806, an output unit 807, a storage unit 808, and a communication unit 809. The input unit 806 may be any type of device capable of inputting information to the electronic device 800, and the input unit 806 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. The output unit 807 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 804 may include, but is not limited to, magnetic disks, optical disks. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices over computer networks, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.
The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above. For example, in some embodiments, the traffic scheduling method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. In some embodiments, the computing unit 801 may be configured to perform the traffic scheduling method by any other suitable means (e.g., by means of firmware).
Program code for carrying out methods of the present application may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Claims (10)

1. A face dynamic capture driving method, comprising:
receiving a first input, the first input comprising facial morphology key information and facial binding key bone information;
generating a facial dynamic capture driving model in response to the first input;
receiving a second input, the second input being image information for a number of frames including facial features;
and in response to the second input, importing the image information containing the facial features into the facial dynamic capture driving model, and generating and displaying a facial dynamic capture driving result.
2. The method of claim 1, wherein the importing the image information including the facial features into the facial motion capture driving model in response to the second input, generating and displaying a facial motion capture driving result, comprises:
receiving the facial dynamic capture model to analyze morphological key values and key skeleton control parameters of the image information containing facial features;
and generating and displaying a facial dynamic capture driving result in response to the morphological key value and the key skeleton control parameter.
3. The facial motion capture driving method according to claim 2, wherein said generating and displaying a facial motion capture driving result in response to said morphology key value and said key bone control parameter comprises:
constructing a digital human model according to the first input;
and controlling to drive the display of the digital human model in response to the morphological key value and the key skeleton control parameter.
4. The method of claim 1, further comprising, prior to receiving the second input:
receiving a plurality of image frame data;
and inputting a plurality of image frame data into the face dynamic capture driving model, and training the face dynamic capture driving model.
5. The method according to claim 4, wherein the step of inputting the plurality of image frame data into the face dynamic capture driving model and training the face dynamic capture driving model further comprises:
calculating loss data between the facial dynamic capture driving model and the second input, the loss data including, but not limited to, 2D loss, 3D loss, and microexpressive loss;
in response to the loss data, errors between the morphology key values, the key bone control parameters, and the facial motion capture driving model are eliminated.
6. The method of claim 1, further comprising, prior to receiving the second input:
and acquiring the second input by adopting an image acquisition device, wherein the second input is 2D image information or 3D image information containing facial features.
7. The method of driving facial motion capture according to claim 6, further comprising, after the acquiring the second input with the image capturing device:
and detecting the face key points and the face frames of the second input by adopting a face detector, and correcting the second input according to the detection result.
8. A facial motion capture driving device, comprising:
the first receiving module is used for receiving a first input, and the first input comprises facial morphology key information and facial binding key skeleton information;
a generation module that generates a facial dynamic capture driving model in response to the first input;
the second receiving module is used for receiving a second input, wherein the second input is image information of a plurality of frames containing facial features;
and the display module is used for responding to the second input, importing the image information containing the facial features into the facial dynamic capture driving model, and generating and displaying a facial dynamic capture driving result.
9. An electronic device, comprising:
a processor; and
a memory in which a program is stored,
wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the method according to any of claims 1-7.
10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-7.
CN202310602158.5A 2023-05-25 2023-05-25 Face dynamic capture driving method and device, electronic equipment and readable storage medium Pending CN116863043A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310602158.5A CN116863043A (en) 2023-05-25 2023-05-25 Face dynamic capture driving method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310602158.5A CN116863043A (en) 2023-05-25 2023-05-25 Face dynamic capture driving method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN116863043A true CN116863043A (en) 2023-10-10

Family

ID=88231067

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310602158.5A Pending CN116863043A (en) 2023-05-25 2023-05-25 Face dynamic capture driving method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN116863043A (en)

Similar Documents

Publication Publication Date Title
CN111598998B (en) Three-dimensional virtual model reconstruction method, three-dimensional virtual model reconstruction device, computer equipment and storage medium
CN103731583B (en) Intelligent synthetic, print processing method is used for taking pictures
CN111476871B (en) Method and device for generating video
WO2019173108A1 (en) Electronic messaging utilizing animatable 3d models
KR101547780B1 (en) Method and arrangement for image model construction
WO2023050992A1 (en) Network training method and apparatus for facial reconstruction, and device and storage medium
JP2021192222A (en) Video image interactive method and apparatus, electronic device, computer readable storage medium, and computer program
CN108564641B (en) Expression capturing method and device based on UE engine
KR102491140B1 (en) Method and apparatus for generating virtual avatar
CN111294665B (en) Video generation method and device, electronic equipment and readable storage medium
US20220222796A1 (en) Image processing method and apparatus, server, and storage medium
WO2022089166A1 (en) Facial image processing method and apparatus, facial image display method and apparatus, and device
CN114821734A (en) Method and device for driving expression of virtual character
CN116309992A (en) Intelligent meta-universe live person generation method, equipment and storage medium
CN110310299A (en) Method and apparatus for training light stream network and handling image
CN112562045B (en) Method, apparatus, device and storage medium for generating model and generating 3D animation
CN116391209A (en) Realistic audio-driven 3D avatar generation
CN112634413B (en) Method, apparatus, device and storage medium for generating model and generating 3D animation
CN111479087A (en) 3D monitoring scene control method and device, computer equipment and storage medium
Huang et al. Perceptual conversational head generation with regularized driver and enhanced renderer
WO2023217138A1 (en) Parameter configuration method and apparatus, device, storage medium and product
CN116342782A (en) Method and apparatus for generating avatar rendering model
CN116630495A (en) Virtual digital human model planning system based on AIGC algorithm
CN116958344A (en) Animation generation method and device for virtual image, computer equipment and storage medium
CN116863043A (en) Face dynamic capture driving method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination