CN116977547A - Three-dimensional face reconstruction method and device, electronic equipment and storage medium - Google Patents

Three-dimensional face reconstruction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116977547A
CN116977547A CN202310413286.5A CN202310413286A CN116977547A CN 116977547 A CN116977547 A CN 116977547A CN 202310413286 A CN202310413286 A CN 202310413286A CN 116977547 A CN116977547 A CN 116977547A
Authority
CN
China
Prior art keywords
parameter
predicted
face
model
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310413286.5A
Other languages
Chinese (zh)
Inventor
王福东
葛志鹏
曹玮剑
丁中干
陈人望
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310413286.5A priority Critical patent/CN116977547A/en
Publication of CN116977547A publication Critical patent/CN116977547A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/166Detection; Localisation; Normalisation using acquisition arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation

Abstract

The application relates to the technical field of artificial intelligence, in particular to a three-dimensional face reconstruction method, a three-dimensional face reconstruction device, electronic equipment and a storage medium, which are used for improving the accuracy of a 3D face model. The method comprises the following steps: acquiring a video to be reconstructed and an initial reconstruction model containing a target object, performing cyclic iteration training on the initial reconstruction model for a preset number of times based on the video to be reconstructed to obtain the target reconstruction model, performing feature recognition on each video frame in one cyclic iteration to obtain corresponding prediction parameter sets, obtaining prediction parameter differences based on prediction phase parameters contained in the prediction parameter sets and reference phase parameters determined based on the current iteration number for each prediction parameter set, and performing parameter adjustment on the initial reconstruction model based on the prediction parameter differences. According to the application, the video to be reconstructed containing the target object is used for training the initial reconstruction model, so that the obtained target reconstruction model can accurately identify the characteristics of the target object, and the accuracy of the constructed target face model is higher.

Description

Three-dimensional face reconstruction method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a three-dimensional face reconstruction method, apparatus, electronic device, and storage medium.
Background
Three-dimensional face reconstruction refers to reconstructing a 3D (three-dimensional) model of a face from one or more 2D (two-dimensional) images. In the field of computer vision, three-dimensional face reconstruction is a direction with research value, and the high-quality reconstruction of the three-dimensional face has important significance in the fields of face recognition, anti-counterfeiting, game entertainment, movie animation, beauty and medical treatment and the like.
In the related art, in order to reconstruct a three-dimensional face, first, 2D key points of a face are extracted from an input image, then, in a model training stage, face parameters of an object contained in the input image are predicted by using a deep learning model, a 3D face model of the object is generated based on the face parameters, and by continuously performing parameter adjustment on the deep learning model, errors between projected coordinates of the 3D key points in the generated 3D face model and coordinates of the corresponding 2D key points in the input image reach a minimum value.
However, because the face features of different objects are quite different, but the scene distribution of training data used in the model training process is limited, if a target image is not used in the model training process, a 3D face model generated based on a deep learning model has a certain difference from a face actually contained in the target image, and the accuracy of the generated 3D face model is not high, so that when the 3D face model is used for digital technology development, the desired effect is difficult to realize.
Disclosure of Invention
The embodiment of the application provides a three-dimensional face reconstruction method, a three-dimensional face reconstruction device, electronic equipment and a storage medium, which are used for improving the accuracy of three-dimensional face reconstruction.
The first three-dimensional face reconstruction method provided by the embodiment of the application comprises the following steps:
acquiring a video to be reconstructed comprising a target object, and acquiring an initial reconstruction model, wherein the initial reconstruction model is obtained by pre-training a reconstruction model to be trained by using a constructed training data set;
performing loop iteration training for the initial reconstruction model for a preset number of times based on the video to be reconstructed to obtain a target reconstruction model, so as to respectively construct a target face model of the target object in each video frame contained in the video to be reconstructed based on the target reconstruction model; wherein, in one loop iteration, the following operations are performed:
respectively carrying out feature recognition on each video frame to obtain a corresponding prediction parameter set, wherein the prediction parameter set at least comprises prediction looks parameters;
for each prediction parameter set, the following operations are performed: and obtaining a prediction parameter difference based on the prediction phase parameters contained in one prediction parameter set and the reference phase parameters determined based on the current iteration times, and carrying out parameter adjustment on the initial reconstruction model based on the prediction parameter difference.
The second three-dimensional face reconstruction method provided by the embodiment of the application comprises the following steps:
inputting each video frame contained in the video to be rebuilt into a target rebuilding model respectively to obtain output prediction looks parameters and prediction expression parameters and prediction position parameters corresponding to each video frame, wherein the video to be rebuilt contains a target object, and the target rebuilding model is obtained by executing cyclic iteration training for preset times on an initial rebuilding model based on the video to be rebuilt;
obtaining a target phase parameter based on the predicted phase parameter and a first reference phase parameter, wherein the first reference phase parameter is obtained by adjusting a second reference phase parameter determined by the last cycle iteration training based on each predicted parameter difference obtained by the last cycle iteration training, and each predicted parameter difference is obtained based on each predicted phase parameter obtained by the last cycle iteration training and the second reference phase parameter;
and respectively constructing a target face model of the target object in the corresponding video frame based on the target looks parameter, and the predicted expression parameter and the predicted position parameter corresponding to each video frame.
The first three-dimensional face reconstruction device provided by the embodiment of the application comprises:
the acquisition unit is used for acquiring a video to be reconstructed containing a target object and acquiring an initial reconstruction model, wherein the initial reconstruction model is obtained by pre-training a reconstruction model to be trained by using a constructed training data set;
the first training unit is used for executing cyclic iterative training for the initial reconstruction model for a preset number of times based on the video to be reconstructed to obtain a target reconstruction model, so as to respectively construct a target face model of the target object in each video frame contained in the video to be reconstructed based on the target reconstruction model; wherein, in one loop iteration, the following operations are performed:
respectively carrying out feature recognition on each video frame to obtain a corresponding prediction parameter set, wherein the prediction parameter set at least comprises prediction looks parameters;
for each prediction parameter set, the following operations are performed: and obtaining a prediction parameter difference based on the prediction phase parameters contained in one prediction parameter set and the reference phase parameters determined based on the current iteration times, and carrying out parameter adjustment on the initial reconstruction model based on the prediction parameter difference.
In an alternative embodiment, the set of predicted parameters further comprises predicted expression parameters; the prediction parameter differences comprise semantic differences and position differences;
the first training unit is specifically configured to:
fusing the reference phase parameters with the predicted phase parameters to obtain corresponding fused phase parameters;
constructing a predicted face model of the target object in a corresponding video frame based on the fusion looks parameter and the predicted expression parameter;
based on each predicted two-dimensional key point contained in the predicted two-dimensional image corresponding to the predicted face model and each reference two-dimensional key point contained in the video frame, obtaining the semantic difference and the position difference;
and carrying out parameter adjustment on the initial reconstruction model based on the semantic difference and the position difference.
In an alternative embodiment, the first training unit is specifically configured to:
based on the predicted two-dimensional image, obtaining respective predicted semantic information and predicted position information of each predicted two-dimensional key point;
based on the video frame, obtaining respective reference semantic information and reference position information of each reference two-dimensional key point;
Determining the semantic difference based on the difference between each predicted semantic information and each reference semantic information;
the position difference is determined based on the difference between each predicted position information and each reference position information.
In an alternative embodiment, the set of predicted parameters further comprises predicted location parameters; the first training unit is specifically configured to:
performing face analysis on the predicted two-dimensional image to obtain respective predicted semantic information of each predicted two-dimensional key point;
and projecting each predicted two-dimensional key point to the video frame based on the predicted position parameter, and obtaining the predicted position information of each predicted two-dimensional key point on the video frame.
In an alternative embodiment, the first training unit is specifically configured to:
performing face detection on the video frame to obtain a face region containing the target object;
performing face analysis on the face region to obtain the respective reference semantic information of each reference two-dimensional key point;
and detecting the face key points of the face area to obtain the respective reference position information of each reference two-dimensional key point.
In an alternative embodiment, the first training unit is specifically configured to determine the reference phase parameter by:
If the current iteration number is 1, determining the reference phase parameters based on average values of the prediction phase parameters contained in each prediction parameter set; otherwise the first set of parameters is selected,
and adjusting the reference looks parameter determined in the previous loop iteration based on the semantic difference and the position difference obtained in the previous loop iteration, and determining the reference looks parameter.
In an alternative embodiment, the prediction parameter set further comprises a prediction texture parameter;
the first training unit is specifically configured to:
constructing a first loss function based on the semantic difference and the location difference;
constructing a second loss function based on the difference between the fused and predicted looks parameters;
constructing a first regular function based on the predicted looks parameter, the predicted expression parameter and the predicted texture parameter;
and constructing a target loss function based on the first loss function, the second loss function and the first regular function, and carrying out parameter adjustment on the initial reconstruction model based on the target loss function.
In an alternative embodiment, the apparatus further comprises a second training unit for:
Based on the training samples in the training data set, performing loop iteration training on the reconstruction model to be trained to obtain the initial reconstruction model; in one loop iteration, the following operations are performed:
performing feature recognition on a sample image contained in a training sample to obtain a sample parameter set, wherein the sample parameter set at least contains sample appearance parameters, sample expression parameters and sample texture parameters;
constructing a sample face model based on the sample appearance parameters and the sample expression parameters, and constructing a second regular function based on the sample appearance parameters, the sample expression parameters and the sample texture parameters;
based on each sample two-dimensional key point contained in the sample two-dimensional image corresponding to the sample face model and each reference two-dimensional key point contained in the sample image, obtaining sample semantic difference and sample position difference between the sample two-dimensional key points;
and constructing a third loss function by using the sample semantic difference and the sample position difference, and carrying out parameter adjustment on the reconstruction model to be trained based on the second regular function and the third loss function.
The second three-dimensional face reconstruction device provided by the embodiment of the application comprises:
The prediction unit is used for inputting each video frame contained in the video to be reconstructed into a target reconstruction model respectively to obtain output prediction looks parameters and prediction expression parameters and prediction position parameters corresponding to each video frame, wherein the video to be reconstructed contains a target object, and the target reconstruction model is obtained by executing cyclic iteration training for the initial reconstruction model for preset times based on the video to be reconstructed;
the acquisition unit is used for acquiring a target phase parameter based on the predicted phase parameter and a first reference phase parameter, wherein the first reference phase parameter is acquired by adjusting a second reference phase parameter determined by the last cycle iteration training based on each predicted parameter difference acquired by the last cycle iteration training, and each predicted parameter difference is acquired based on each predicted phase parameter acquired by the last cycle iteration training and the second reference phase parameter;
the construction unit is used for respectively constructing a target face model of the target object in the corresponding video frame based on the target looks parameter, the predicted expression parameter and the predicted position parameter corresponding to each video frame.
In an alternative embodiment, the construction unit is specifically configured to:
for each video frame, the following operations are respectively executed:
inputting the target looks parameter, the predicted expression parameter and the predicted position parameter corresponding to one video frame into a preset basic face model to obtain each predicted three-dimensional key point;
and connecting the three-dimensional prediction key points according to a preset topological structure to obtain a target face model corresponding to the one prediction parameter set.
The electronic device provided by the embodiment of the application comprises a processor and a memory, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor is caused to execute the steps of any three-dimensional face reconstruction method.
An embodiment of the present application provides a computer readable storage medium including a computer program for causing an electronic device to execute the steps of any one of the three-dimensional face reconstruction methods described above when the computer program is run on the electronic device.
Embodiments of the present application provide a computer program product comprising a computer program stored in a computer readable storage medium; when the processor of the electronic device reads the computer program from the computer readable storage medium, the processor executes the computer program, so that the electronic device executes the steps of any one of the three-dimensional face reconstruction methods described above.
The application has the following beneficial effects:
the embodiment of the application provides a three-dimensional face reconstruction method, a device, electronic equipment and a storage medium, which are used for obtaining a target reconstruction model by executing cyclic iteration training for a preset number of times on the basis of a video to be reconstructed containing a target object, wherein on one hand, the initial reconstruction model is already pre-trained, and the cyclic iteration training is executed by using the initial reconstruction model, so that model convergence can be accelerated, and cyclic iteration training time is reduced; on the other hand, the video to be rebuilt comprising the target object is used for model training, model parameter adjustment can be carried out aiming at the characteristics of the target object, the obtained target rebuilding model can more accurately identify the characteristics of the target object, further, the fitting degree between the target rebuilding model constructed based on the target rebuilding model and the target object is higher, the accuracy is effectively improved, in one-time loop iteration, parameter adjustment is carried out on the initial rebuilding model through the prediction parameter difference obtained based on the prediction phase parameter and the reference phase parameter, the target rebuilding model can output uniform phase parameter aiming at the target object in different video frames, the association between the target face models in different video frames is enhanced, the stability of the target face model is improved, and the large difference of the phases of the constructed different target face models is avoided.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1A is a flow chart of a conventional optimization method in the related art;
FIG. 1B is a flow chart of a deep learning method in the related art;
FIG. 2 is an alternative schematic diagram of an application scenario in an embodiment of the present application;
FIG. 3 is a flowchart illustrating a first face reconstruction method according to an embodiment of the present application;
FIG. 4A is a schematic diagram of a data preprocessing method according to an embodiment of the present application;
fig. 4B is a schematic diagram of a 2D face key point in an embodiment of the present application;
FIG. 5A is a schematic diagram of a 3D face model according to an embodiment of the present application;
FIG. 5B is a schematic diagram of a model pre-training method according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a parameter adjustment method according to an embodiment of the present application;
FIG. 7 is a schematic diagram of another parameter adjustment method according to an embodiment of the application;
FIG. 8 is a schematic diagram of a method for obtaining a reference phase parameter according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a video frame preprocessing method according to an embodiment of the present application;
fig. 10A is a flowchart of a second face reconstruction method according to an embodiment of the present application;
fig. 10B is a logic diagram of a face reconstruction method according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a face reconstruction device according to an embodiment of the present application;
fig. 12A is a schematic structural diagram of another face reconstruction device according to an embodiment of the present application;
fig. 12B is a schematic diagram of a hardware component of an electronic device to which the embodiment of the present application is applied;
fig. 13 is a schematic diagram of a hardware composition structure of another electronic device to which the embodiment of the present application is applied.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the technical solutions of the present application, but not all embodiments. All other embodiments, based on the embodiments described in the present document, which can be obtained by a person skilled in the art without any creative effort, are within the scope of protection of the technical solutions of the present application.
Some of the concepts involved in the embodiments of the present application are described below.
Prediction parameter set: the method comprises the steps that each prediction parameter capable of constructing a 3D face model is included, the 3D face model constructed by using a prediction parameter set is called a prediction face model, a two-dimensional image corresponding to the prediction face model is called a prediction two-dimensional image, and in the model training process, the difference between the prediction two-dimensional image and a face area in a video frame is required to be gradually reduced.
Reference phase parameters: the method comprises the steps that the same target object is used for constraining the model to output, the looks of the 3D face model in different video frames are consistent, when the three-dimensional face reconstruction is carried out on a video to be reconstructed containing the target object, each video frame is required to be input into the target reconstruction model for reconstruction, and the looks of the target object are consistent no matter the expression of the target object in different video frames, so that in the model training process, the predicted looks parameters output by the model for different video frames are close to the reference looks parameters through the reference looks parameters, and then the looks of the 3D face model corresponding to different video frames are consistent.
Predicting the phase parameters: the method is used for representing the looks attribute of each three-dimensional face key point of the target object, takes the target object as a small figure as an example, and can enable the looks of the constructed 3D face model to be close to the small figure and different from the looks of other objects through predicting the constraint of the looks parameters.
Predicting expression parameters: the method is used for representing the expression attribute of each three-dimensional face key point of the target object, for the target object, the predicted looks parameter determines the looks (also called long-phase) of the 3D face model of the target object, and the predicted expression parameter determines what expression is presented on the face of the target object.
Embodiments of the present application relate to artificial intelligence (Artificial Intelligence, AI), natural language processing (Nature Language processing, NLP), and Machine Learning (ML) techniques, designed based on computer vision techniques and Machine Learning in artificial intelligence.
Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence.
Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology mainly comprises a computer vision technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions. With research and progress of artificial intelligence technology, artificial intelligence is developed in various fields such as common smart home, intelligent customer service, virtual assistant, smart speaker, smart marketing, unmanned, automatic driving, robot, smart medical, etc., and it is believed that with the development of technology, artificial intelligence will be applied in more fields and become more and more important value.
The computer vision is a science for researching how to make a machine "see", and more specifically, a camera and a computer are used to replace human eyes to identify and measure targets, and the like, and further, graphic processing is performed, so that the computer is processed into images which are more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.
Machine learning is a multi-field interdisciplinary, and relates to multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like, and a specially researched computer acquires new knowledge or skills by simulating learning behaviors of human beings, reorganizes the existing knowledge structure and enables the computer to continuously improve the performance of the computer.
Machine learning is the core of artificial intelligence, which is the fundamental way for computers to have intelligence, applied throughout various areas of artificial intelligence; the core of machine learning is deep learning, which is a technology for realizing machine learning. Machine learning typically includes deep learning, reinforcement learning, transfer learning, induction learning, artificial neural networks, teaching learning, etc., which includes CNN (Convolutional Neural Networks ), deep confidence networks, recurrent neural networks, automatic encoders, generation countermeasure networks, etc. The target reconstruction model in the embodiment of the application is obtained by training by adopting a machine learning or deep learning technology. The three-dimensional face reconstruction is carried out based on the target reconstruction model in the embodiment of the application, so that the accuracy of the constructed 3D face model can be improved.
It should be noted that, in the embodiment of the present application, related data such as face data is related, and when the above embodiment of the present application is applied to a specific product or technology, each time the data is acquired, a user license or consent is required to be obtained, and the collection, use and processing of the related data are required to comply with related laws and regulations and standards of related countries and regions.
The following briefly describes the design concept of the embodiment of the present application:
three-dimensional face reconstruction refers to reconstructing a 3D model of a face from one or more 2D images. In the field of computer vision, three-dimensional face reconstruction is a direction with research value, and the high-quality reconstruction of the three-dimensional face has important significance in the fields of face recognition, anti-counterfeiting, game entertainment, movie animation, beauty and medical treatment and the like.
In the related art, three-dimensional face reconstruction is mainly performed based on a parameterized 3D face template by the following two methods:
method one, the traditional optimization method:
in the conventional optimization method, firstly, 2D face key points are extracted from an input image (or video), 2D coordinates of the 2D face key points are obtained, then, various optimization algorithms are designed, iterative optimization is carried out on various prediction parameters of a parameterized 3D face template, referring to fig. 1A, the prediction parameters are substituted into the parameterized 3D face template to generate a 3D face model of an object for a flow diagram of the conventional optimization method in the related art, and the optimization target is mainly 2D reprojection of the 3D face model and geometric errors of the extracted 2D face key points reach minimum values.
However, in the conventional optimization method, the stability of the optimization result cannot be well maintained due to the fact that various parameters in the input image are solved frame by frame or in a plurality of frames, the micro-jitter phenomenon is easy to generate in the finally reconstructed 3D face model, the inter-frame stability is insufficient, and in addition, more optimization steps are needed for each optimization, so that overall time consumption is long when the number of video frames is large.
Method two, a deep learning method:
referring to fig. 1B, a flow chart of a deep learning method in the related art is shown, first, 2D face key points are extracted from an input image (or video), then, in a model training stage, parameters of a 3D face model are predicted by using a deep learning model (such as a res net series), the predicted parameters are substituted into a parameterized 3D face template to generate a 3D face model of an object, the parameters of the deep learning model are continuously adjusted, so that coordinates of 2D re-projection in the input image and coordinates of the corresponding 2D key points reach a minimum value, and model training is performed by using a deep learning frame (such as Pytorch, tensorFlow).
However, in the deep learning method, because the face features of different objects are quite different, but the scene distribution of training data used in the model training process is limited, if the target image is not used in the model training process, the 3D face model generated based on the deep learning model has a certain difference from the face actually contained in the target image, the accuracy of the generated 3D face model is not high, and further, when the 3D face model is used for digital technology development, the desired effect is difficult to realize.
In addition, in the related field of 3D face reconstruction, besides the above-mentioned method based on the parameterized 3D face template, there are various methods using discrete 3D models, such as MVS (multi-view stereoo) which has been developed for decades, and methods such as Neural Rendering (Neural Rendering) and distance function field (Distance Function Field) which have rapidly been developed in recent years, still have problems in the deep learning method.
In view of this, the embodiments of the present application provide a three-dimensional face reconstruction method, apparatus, electronic device, and storage medium, which perform cyclic iteration training for a preset number of times on an initial reconstruction model based on a video to be reconstructed including a target object to obtain the target reconstruction model, on one hand, since the initial reconstruction model has been pre-trained, the cyclic iteration training is performed using the initial reconstruction model, so as to accelerate model convergence and reduce cyclic iteration training time; on the other hand, the video to be rebuilt comprising the target object is used for model training, model parameter adjustment can be carried out aiming at the characteristics of the target object, the obtained target rebuilding model can more accurately identify the characteristics of the target object, further, the fitting degree between the target rebuilding model constructed based on the target rebuilding model and the target object is higher, the accuracy is effectively improved, in one-time loop iteration, parameter adjustment is carried out on the initial rebuilding model through the prediction parameter difference obtained based on the prediction phase parameter and the reference phase parameter, the target rebuilding model can output uniform phase parameter aiming at the target object in different video frames, the association between the target face models in different video frames is enhanced, the stability of the target face model is improved, and the large difference of the phases of the constructed different target face models is avoided.
The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and not for limitation of the present application, and embodiments of the present application and features of the embodiments may be combined with each other without conflict.
Fig. 2 is a schematic diagram of an application scenario according to an embodiment of the present application. The application scenario diagram includes two terminal devices 210 and a server 220.
In the embodiment of the application, the terminal equipment comprises, but is not limited to, mobile phones, tablet computers, notebook computers, desktop computers, electronic book readers, intelligent voice interaction equipment, intelligent household appliances, vehicle-mounted terminals and other equipment; the terminal device may be provided with a client related to three-dimensional face reconstruction, where the client may be software (such as a browser, animation software, etc.), or may be a web page, an applet, etc., and the server may be a background server corresponding to the software or the web page, the applet, etc., or a server specifically used for three-dimensional face reconstruction, and the application is not limited in detail. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms, and the like.
It should be noted that, the three-dimensional face reconstruction method in the embodiment of the present application may be executed by an electronic device, and the electronic device may be a server or a terminal device, that is, the method may be executed by the server or the terminal device alone or may be executed by the server and the terminal device together. For example, when the server and the terminal equipment execute together, the server uses the constructed training data set to pretrain the reconstruction model to be trained to obtain an initial reconstruction model, and sends the initial reconstruction model to the terminal equipment, the terminal equipment obtains the video to be reconstructed containing the target object, and obtains the initial reconstruction model, and based on the video to be reconstructed, the initial reconstruction model is subjected to cyclic iterative training for a preset number of times to obtain the target reconstruction model.
In the embodiment of the application, partial segments can be selected from the video input by the user to optimize the initial reconstruction model, and the optimized initial reconstruction model is directly used for predicting parameters of each video frame instead of lengthy iterative optimization, so that the time for predicting the parameters is greatly reduced, and the overall time consumption is reduced.
In an alternative embodiment, the communication between the terminal device and the server may be via a communication network.
In an alternative embodiment, the communication network is a wired network or a wireless network.
It should be noted that, the number of terminal devices and servers shown in fig. 2 is merely illustrative, and the number of terminal devices and servers is not limited in practice, and is not particularly limited in the embodiment of the present application.
In the embodiment of the application, when the number of the servers is multiple, the multiple servers can be formed into a blockchain, and the servers are nodes on the blockchain; according to the three-dimensional face reconstruction method disclosed by the embodiment of the application, the related training data set and the target face model of the target object can be stored on the blockchain.
In addition, the embodiment of the application can be applied to various scenes, including not only three-dimensional face reconstruction scenes, but also cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and other scenes. When the three-dimensional face reconstruction method in the embodiment of the application is applied to an artificial intelligent scene, a 3D face model of the character can be constructed according to the character contained in the video to be reconstructed and input by a user, the virtual image generated based on the constructed 3D face model can interact with the user, and the character in the video can be subjected to face changing based on the constructed 3D face model, so that the personalized requirement of the user is met.
The three-dimensional face reconstruction method in the embodiment of the application can also be applied to products related to digital wisdom, such as virtual anchor, movie cartoon character production, game character face driving and the like, and the 2D face shot by video can be produced into a 3D face model with a topological rule by using the three-dimensional face reconstruction method in the embodiment of the application, and various given expressions can be given to the 3D face model, so that the purposes of driving, simulating and the like are achieved.
For example, when the three-dimensional face reconstruction method in the embodiment of the application is applied to movie cartoon figure production, firstly, a video to be reconstructed including a target object with small white is obtained, and an initial reconstruction model, based on the video to be reconstructed, a preset number of cyclic iterative training is performed on the initial reconstruction model, a target reconstruction model is obtained, then 3 video frames of the video to be reconstructed are respectively input into the target reconstruction model, an output predicted facial parameter s is obtained, a predicted expression parameter e1 and a predicted position parameter l1 of the video frame 1, a predicted expression parameter e2 and a predicted position parameter l2 of the video frame 2, a predicted expression parameter e3 and a predicted position parameter l3 of the video frame 3, a target facial model 1 in the video frame 1 is finally constructed based on s, e1 and l1, a target facial model 3 in the video frame 1 is constructed based on s, e2 and l3, and the target facial model 1-3 can be produced in a corresponding small cartoon figure according to s and first reference facial parameters sc.
In comparison, 3D professional software (such as Blender, maya, etc.) is used to purely manually make a topology rule and drivable 3D face model, which generally requires a technician to take days to a week, while the three-dimensional face reconstruction method in the embodiment of the application only takes half an hour to calculate, and combines a small amount of post-processing fine adjustment to meet certain specific requirements, so that the time of the whole process can be shortened to within one day.
The three-dimensional face reconstruction method provided by the exemplary embodiment of the present application will be described below with reference to the accompanying drawings in conjunction with the above-described application scenario, and it should be noted that the above-described application scenario is only shown for the convenience of understanding the spirit and principle of the present application, and the embodiment of the present application is not limited in any way in this respect.
Referring to fig. 3, a flowchart of an implementation of a method for reconstructing a three-dimensional face according to an embodiment of the present application is shown by taking an execution subject as a server, where the implementation flow of the method includes steps S31 to S33 as follows:
s31: the method comprises the steps that a server obtains a video to be rebuilt containing a target object and an initial rebuilding model;
the video to be reconstructed may include only the target object, or may include the target object and other objects, and if the video to be reconstructed further includes other objects, face detection needs to be performed on a video frame in the video to be reconstructed, and a face region including the target object is detected, so as to perform three-dimensional face reconstruction on the target object.
The initial reconstruction model is obtained by pre-training a reconstruction model to be trained by using a constructed training data set, and after pre-training, the initial reconstruction model can predict parameters of three-dimensional key points of a face of an object in an image for each input image.
In the present application, in order to improve the accuracy of the 3D face model constructed for the target object, loop iterative training is also performed on the initial reconstructed model through step S32.
S32: the method comprises the steps that a server executes loop iteration training for a preset number of times on an initial reconstruction model based on a video to be reconstructed to obtain a target reconstruction model, so that a target face model of a target object in each video frame contained in the video to be reconstructed is respectively constructed based on the target reconstruction model, wherein in one loop iteration, the following operations are executed:
s321: respectively carrying out feature recognition on each video frame to obtain a corresponding prediction parameter set;
the prediction parameter set can be used for constructing a predicted face model of a target object, each prediction parameter in the prediction parameter set represents one attribute of each three-dimensional key point of the face of the target object in a corresponding video frame, the prediction parameter set at least comprises a prediction looks parameter, the prediction looks parameter represents the looks attribute of each three-dimensional key point of the face, the prediction looks parameter constrains the looks of the constructed 3D face model to be identical with the target object and different from other objects, for example, the prediction parameters are prediction expression parameters, the prediction looks expression represents the expression attribute of each three-dimensional key point of the face, the prediction expression parameter constrains the expression of the constructed 3D face model, and the prediction expression parameters corresponding to different expressions of the target object are different.
The initial reconstruction model performs feature recognition on one video frame at a time, outputs a prediction parameter set corresponding to the video frame, and takes the case that the video to be reconstructed comprises 3 frames of video frames as an example, outputs a prediction parameter set 1 for the video frame 1, outputs a prediction parameter set 2 for the video frame 2, and outputs a prediction parameter set 3 for the video frame 3.
S322: for each prediction parameter set, the following operations are performed: obtaining a prediction parameter difference based on a prediction phase parameter contained in a prediction parameter set and a reference phase parameter determined based on the current iteration times, and carrying out parameter adjustment on an initial reconstruction model based on the prediction parameter difference;
specifically, the reference phase parameter is related to the current iteration number, and the reference phase parameters determined in different iteration numbers are different, for example, the reference phase parameter determined in the first loop iteration is s1, and the reference phase parameter determined in the second loop iteration is s2, which are not listed here one by one. When the initial reconstruction model is subjected to parameter adjustment, each prediction parameter set is subjected to primary parameter adjustment by combining the reference phase parameter, for example, the prediction parameter set 1 is subjected to primary parameter adjustment by combining the reference phase parameter, the prediction parameter set 2 is subjected to secondary parameter adjustment by combining the reference phase parameter, and the prediction parameter set 3 is subjected to tertiary parameter adjustment by combining the reference phase parameter.
In the embodiment of the application, the target reconstruction model is obtained by executing the cyclic iteration training for the preset times on the initial reconstruction model based on the video to be reconstructed containing the target object, on one hand, the initial reconstruction model is already pre-trained, and the cyclic iteration training is executed by using the initial reconstruction model, so that the model convergence can be accelerated, and the cyclic iteration training time is reduced; on the other hand, the video to be rebuilt comprising the target object is used for model training, model parameter adjustment can be carried out aiming at the characteristics of the target object, the obtained target rebuilding model can more accurately identify the characteristics of the target object, further, the fitting degree between the target rebuilding model constructed based on the target rebuilding model and the target object is higher, the accuracy is effectively improved, in one-time loop iteration, parameter adjustment is carried out on the initial rebuilding model through the prediction parameter difference obtained based on the prediction phase parameter and the reference phase parameter, the target rebuilding model can output uniform phase parameter aiming at the target object in different video frames, the association between the target face models in different video frames is enhanced, the stability of the target face model is improved, and the large difference of the phases of the constructed different target face models is avoided.
After the target reconstruction model is obtained, each video frame contained in the video to be reconstructed is input into the target reconstruction model, and the target face model of the target object in each video frame is respectively constructed.
Specifically, the target reconstruction model outputs a prediction parameter set for each video frame, and each prediction parameter set is used for constructing a target face model, for example, a target face model 1 of a target object in a first frame of video frames is constructed according to the prediction parameter set 1, a target face model 2 of the target object in the first frame of video frames is constructed according to the prediction parameter set 2, and a target face model 3 of the target object in the first frame of video frames is constructed according to the prediction parameter set 3.
In order to obtain the initial reconstruction model, a training data set is required to be constructed, firstly, large-scale image data are collected and arranged, then, the collected image data are preprocessed, referring to fig. 4A, a schematic diagram of a data preprocessing method in an embodiment of the present application is shown, which includes the following three steps:
s41: performing face detection on the video to obtain a face area;
for a given video, it is necessary to ensure that only one Face exists in each video frame, if there are multiple faces in each video frame, a Face Box (Face detector) can be used to detect the area where each Face exists, and the areas are separated according to the form of a rectangular frame, through this step, a new video can be obtained, and each video contains only the Face area I of one Face.
S42: 2D face key points are extracted from the face area, and position information and semantic information of the 2D face key points are obtained;
for the video obtained in step S41, a face alignment extractor is used to extract 2D key points on the face of the video frame, and the information contained in the 2D key points is defined as follows:wherein M is the number of 2D key points, such as 68, 256, and X i ∈R 2 Representing the pixel coordinates of the key point in the video frame, l i The semantic information representing the point is generally classified into five types of semantic information, namely, eyebrow, eye, nose, mouth, and outline.
S43: and carrying out face analysis on the face area to obtain semantic information of each pixel point.
For the video obtained in step S41, using tools such as face parsing, extracting mask in the range of face for each video frame, and reserving the positions of face skin, eyebrow, eyes, nose, mouth, etc., to obtain semantic information of each pixel, denoted as M I
Referring to fig. 4B, which is a schematic diagram of 2D face key points in an embodiment of the present application, 2D face key points are extracted from an image to obtain 2D face key points shown in fig. 4B, including five key points of eyebrows, eyes, nose, mouth and outline,
It should be noted that, the above data preprocessing methods S41 to S43 are only illustrative, and in fact, for a part of the content of the data preprocessing, the design assisting work and the manual labeling may be adopted, or other on-source 2D face key point detection algorithms may be used, which is not limited herein.
After data preprocessing, information contained in each frame of video frame is obtained: face region I, face region mask M I 2D keypointsAnd taking the video frame as a sample image, obtaining semantic information of each pixel point in the sample image, and position information and semantic information of each 2D key point in the pixel point. Each training sample in the constructed training data set comprises a sample image, respective semantic information and position information of each 2D key point in the sample image, and semantic information of pixel points except the 2D key point.
After obtaining the training data set, in an alternative embodiment, based on training samples in the training data set, a loop iterative training is performed on the reconstruction model to be trained, obtaining an initial reconstruction model. Taking a cyclic iteration as an example, the following steps are performed:
firstly, carrying out feature recognition on a sample image contained in a training sample to obtain a sample parameter set, constructing a sample face model based on sample appearance parameters and sample expression parameters in the sample parameter set, and constructing a second regular function based on the sample appearance parameters, the sample expression parameters and sample texture parameters in the sample parameter set; then, based on each sample two-dimensional key point contained in the sample two-dimensional image corresponding to the sample face model and each reference two-dimensional key point contained in the sample image, sample semantic difference and sample position difference between the sample two-dimensional key points and the reference two-dimensional key points are obtained; and finally, constructing a third loss function by using the sample semantic difference and the sample position difference, and carrying out parameter adjustment on the reconstruction model to be trained based on the second regular function and the third loss function.
Specifically, the sample appearance parameters and the sample expression parameters are substituted into the parameterized 3D face template, so that a sample three-dimensional model can be obtained. The parameterized 3D face template Θ may be defined as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,k for characterizing identity (looks) of face 1 The base vectors constitute->K for representing facial expression 2 The base vectors constitute->To characterize k of face texture substrate 3 Each basis vector is formed, s epsilon R k1 Is a human face and bodyParts coefficient (looks parameter),. About.>For facial expression factor (expression parameter), +.>For the texture coefficients (texture parameters), each time a group (s, e) is given, a corresponding geometrical 3D face model can be calculatedMeanwhile, when g is given again, each 3D point V can be obtained j Corresponding texture color value C j ∈[0,1] 3 . In addition, for the vertex { V } of the parameterized 3D face template j } j Use of defined triangular patch connection relation +.>The connection is a fixed topology. Currently, there are a number of different parameterized 3D face templates, and embodiments of the present application use self-lapping parameterized 3D face templates. />
Referring to fig. 5A, a schematic diagram of a 3D face model in an embodiment of the present application is shown, where the 3D face model is composed of a plurality of 3D vertices and a plurality of faces, and the 3D face model shown in fig. 5A can be obtained through a looks parameter and an expression parameter.
Because the model training process is easy to be over-fitted, a second regular function is constructed based on the sample phase parameters, the sample expression parameters and the sample texture parameters and is used for enhancing the compactness of the predicted sample phase parameters, sample expression parameters and sample texture parameters. For example, for a sample phase parameter, the second canonical function may set the 100 th dimension of the sample phase parameter around 0.
The sample two-dimensional key points can refer to all pixel points in the sample two-dimensional image, and can also refer to two-dimensional key points preset in face key point detection in all pixel points, so that the sample semantic difference can be the difference between the semantic information of the pixel points in the sample two-dimensional image and the semantic information of the pixel points in the sample image, and can also be the difference between the semantic information of the two-dimensional key points in the sample two-dimensional image and the semantic information of the two-dimensional key points in the sample image.
For example, the sample two-dimensional image 1 includes a pixel 1, a pixel 2, and a pixel 3, where the pixel 2 is a two-dimensional key point, the sample image includes a pixel 4, a pixel 5, and a pixel 6, and the sample semantic difference includes a semantic difference between the pixel 1 and the pixel 4, a semantic difference between the pixel 2 and the pixel 5, a semantic difference between the pixel 3 and the pixel 6, or the sample semantic difference includes only a semantic difference between the pixel 2 and the pixel 5.
Since face extraction can only obtain semantic information of pixel points in each face region, only two-dimensional key points have position information, and thus the sample position difference refers to the difference between the position information of the two-dimensional key points in the sample two-dimensional image and the position information of the pixel points in the sample image. Taking sample two-dimensional image 1 as an example, the sample position difference includes the position difference between pixel point 2 and pixel point 5.
After the sample semantic difference and the sample position difference are obtained, a third loss function is constructed, a weight coefficient is preset, then the third loss function is fused with a second regular function based on a weight coefficient group, a fourth loss function is obtained, and parameter adjustment is carried out on a reconstruction model to be trained based on the fourth loss function.
For example, if the weight coefficient of the third loss function a is preset to be 0.6 and the weight coefficient of the second regular function b is preset to be 0.4, the fourth loss function=0.6a+0.4b.
In the implementation, firstly, resNet50 is selected as a basic backbone network (a reconstruction model to be trained), the image size of an input layer is 256×256, and the last regression layer is k 1 +k 2 +k 3 +6, the predicted value of which is (s, e, g, R, t), wherein s and e are identity coefficient (predicted looks coefficient) and expression coefficient (predicted expression parameter), respectively, g is texture coefficient (predicted texture parameter), R e R 3 ,t∈R 3 Then rotation and translation in camera parameters (predicted position parameters) and then in training numbersTraining on the dataset to obtain a pre-training model f capable of predicting the corresponding (s, e, g, r, t) for each input image 0 (θ|i) (initial reconstructed model), where θ represents the learnable parameters of the model.
For (s, e, g, r, t) learned in each iteration process, the 3D face template can be parameterizedCalculate the corresponding 3D point +.>And neutralizing 2D keypoints of the input imageSemantically corresponding M points->Taking out, using camera projection model T ji =Proj(r,t|V ji ) Projecting onto the input image to obtain a 2D point T on the input image ji ∈R 2 Furthermore, each vertex V is calculated using g j Color value C of (2) j And using a differentiable renderer +.>Will->According to its topology->Rendering to generate an image I'. The loss function of this pre-training process is as follows:
L 1 =w 1 L lm +w 2 L r +w 3 L reg .
wherein w is 1 ,w 2 ,w 3 The weight values are 10,1,0.001 respectively. Each loss function term is: l (L) lm =MSE(X i ,Proj(V ji ) 3D keypoint V for constrained reconstruction ji After being projected to the input image, the 2D key point X of the original input image i Keep as consistent as possible; l (L) r =MSE(M I I, I ') for constraining the rendered graph I' and input image I in the face mask region M I As much as possible keep consistent, L reg =L 2 (s, e, g) (second canonical function) is L of three coefficients 2 And the regularization is used for enhancing the compactness of the predicted identity parameters, expression parameters and texture parameters.
Based on the mode, the reconstruction model to be trained is pre-trained, the compactness of the prediction parameters is enhanced through the second regular function, the constructed 3D face model is smoother, the accuracy of the prediction parameters is improved through the third loss function, the constructed 3D face model is more accurate, and therefore when the initial training model is used for cyclic iterative training, the model convergence is accelerated, and the accuracy of the target face model is improved.
Referring to fig. 5B, which is a schematic diagram of a model pre-training method in an embodiment of the present application, in a one-time loop iterative training, a sample image is input into a reconstruction model to be trained, the reconstruction model outputs a predicted sample looks parameter s1 and a sample expression parameter e1, s1 and e1 are substituted into a parameterized 3D face template to obtain a sample face model, 3D key points corresponding to the semantics of a reference two-dimensional key point in the sample image in each 3D vertex in the sample face model are projected into the sample image to obtain sample two-dimensional key points, the position difference is determined based on the position information of the sample two-dimensional key points and the position information of the reference two-dimensional key points, the semantic difference is determined with the position information of the reference two-dimensional key points, and parameter adjustment is performed based on the semantic difference and the position difference. And obtaining an initial reconstruction model through multiple rounds of loop iteration loops.
In an alternative embodiment, as shown in fig. 6, the initial reconstructed model is parameter adjusted in step S322 by:
s61: fusing the reference phase parameters with the predicted phase parameters to obtain corresponding fused phase parameters;
s62: based on the fusion looks parameter and the predicted expression parameter, constructing a predicted face model of the target object in the corresponding video frame;
s63: based on each predicted two-dimensional key point contained in the predicted two-dimensional image corresponding to the predicted face model and each reference two-dimensional key point contained in the video frame, semantic difference and position difference between the predicted two-dimensional key points and the reference two-dimensional key points are obtained;
s64: and carrying out parameter adjustment on the initial reconstruction model based on the semantic difference and the position difference.
The prediction parameter set further comprises a prediction expression parameter, and the prediction parameter difference comprises a semantic difference and a position difference. The method comprises the steps that a reference phase parameter represents expected values of phase attribute of three-dimensional key points of faces, the expected values are used for constraining the same target object output by a model, the phases of 3D face models in different video frames are consistent, when three-dimensional face reconstruction is carried out on a video to be reconstructed containing the target object, each video frame needs to be respectively input into the target reconstruction model for reconstruction, and the phases of the target object are consistent no matter the expression of the target object in different video frames, so that in the model training process, the predicted phase parameters output by the model for different video frames are close to the reference phase parameters through the reference phase parameters, and then the phases of the 3D face models corresponding to different video frames are consistent.
Specifically, the phase parameter s will be referred to c Fusion with predicted phase parameters s to obtain combined identity coefficients s * =w 0 s c +(1-w 0 ) s, and then combining the total combination parameters (s * E, g, r, t) to a subsequent parameterized 3D face template Θ, projection model Proj, and rendererAnd calculating a loss function, returning gradients and updating parameters. Wherein w is 0 For the weight of the successive approximation, according to the iteration steps d=1, 2, … D, d+1, … 2D, when D is less than or equal to D, the number of steps is +.>When D is<D is less than or equal to 2D,
in the model training process, the predicted looks parameter output by the expected model can gradually approach the corresponding reference looks parameter, so that the reference looks parameter and the predicted looks parameter are fused, a predicted face model is constructed based on the fused looks parameter and the predicted expression parameter, and parameter adjustment is performed.
The predicted two-dimensional key points may refer to all pixel points in the predicted two-dimensional image, or may refer to two-dimensional key points preset in face key point detection in all pixel points, so that the predicted semantic difference may be a difference between semantic information of each pixel point in the predicted two-dimensional image and semantic information of each pixel point in a video frame, or may be a difference between semantic information of each two-dimensional key point in the predicted two-dimensional image and semantic information of each two-dimensional key point in the predicted image.
For example, the predicted two-dimensional image 1 includes a pixel 7, a pixel 8, and a pixel 9, where the pixel 8 is a two-dimensional key point, the predicted image includes a pixel 10, a pixel 11, and a pixel 12, and the predicted semantic difference includes a semantic difference between the pixel 7 and the pixel 10, a semantic difference between the pixel 8 and the pixel 11, a semantic difference between the pixel 9 and the pixel 12, or the predicted semantic difference includes only a semantic difference between the pixel 8 and the pixel 11.
Since face extraction can only obtain semantic information of pixel points in each face region, only two-dimensional key points have position information, and thus the predicted position difference refers to the difference between the position information of the two-dimensional key points in the predicted two-dimensional image and the position information of the pixel points in the predicted image. Taking still the predicted two-dimensional image 1 as an example, the predicted position difference includes the position difference between the pixel point 8 and the pixel point 11.
In the embodiment of the application, the prediction semantic difference refers to the difference between the semantic information of each pixel point in the predicted two-dimensional image and the semantic information of each pixel point in the video frame.
Based on the mode, the method uses the thought of joint optimization to optimize the target variable of the traditional optimization: the identity parameters, the expression parameters and the network parameters of deep learning are all used as optimized target variables, and the identity parameters and the expression parameters which are used as optimized variables and obtained through deep learning prediction are mutually constrained and supervised, so that the accuracy of the parameters of model prediction is improved.
Referring to fig. 7, a schematic diagram of another parameter adjustment method in an embodiment of the present application is shown, where a predicted parameter set includes a predicted looks parameter s2 and a predicted expression parameter e2, s2 is fused with a current reference looks parameter sc1 to obtain a fused looks parameter scs, scs and e2 are substituted into a parameterized 3D face template to obtain a predicted face model, a semantic difference and a position difference are obtained based on a predicted two-dimensional key point corresponding to the predicted face model and a reference two-dimensional key point corresponding to a video frame, and finally parameter adjustment is performed based on the semantic difference and the position difference.
In an alternative embodiment, the reference phase parameters are determined by:
if the current iteration number is 1, determining a reference phase parameter based on the average value of the prediction phase parameters contained in each prediction parameter set; otherwise, based on the semantic difference and the position difference obtained in the previous loop iteration, the reference phase parameters determined in the previous loop iteration are adjusted, and the reference phase parameters are determined.
Specifically, if the current iteration number is 1, it indicates the first iteration, the reference phase parameter is determined based on the average value of each predicted phase parameter, for example, the predicted parameter set 1 includes the predicted phase parameter of 11, the predicted parameter set 2 includes the predicted phase parameter of 12, the predicted parameter set 3 includes the predicted phase parameter of 13, and the reference phase parameter of 12. In the subsequent loop iteration, the reference phase and appearance parameters determined in the last loop iteration are adjusted by using a gradient descent algorithm, and the reference phase and appearance parameters of the time are obtained.
It should be noted that, in the above description, the adjustment of the reference phase parameter by using the gradient descent algorithm is taken as an example, and in fact, other (non) convex optimization algorithms, such as a gaussian-newton algorithm, a quasi-newton method, etc., may be used, where the computational complexity of these methods is higher, but the precision is limited, and the present application is not limited in this disclosure.
Based on the mode, iterative updating is carried out on the reference phase parameters, so that the predicted phase parameters output by the initial reconstruction model gradually approach the reference phase parameters, the final target reconstruction model can solve a common phase parameter for a whole video, solve a face shape as real as possible for the input video, improve the stability of inter-frame phase and improve the accuracy of the constructed target face model.
Referring to fig. 8, a schematic diagram of a method for obtaining a reference phase parameter according to an embodiment of the present application includes the following steps:
s801: judging whether the current iteration number is 1, if yes, executing step S802, and if not, executing step S804;
s802: obtaining each predicted phase parameter output by an initial reconstruction model;
s803: taking the average value of each predicted phase parameter as a reference phase parameter;
s804: acquiring a reference phase parameter determined by the previous loop iteration;
S805: and adjusting the acquired reference phase parameters by using a gradient descent algorithm to obtain the current reference phase parameters.
In an alternative embodiment, step S63 may be implemented as steps S631-S634:
s631: based on the predicted two-dimensional image, obtaining respective predicted semantic information and predicted position information of each predicted two-dimensional key point;
s632: based on the video frame, obtaining respective reference semantic information and reference position information of each reference two-dimensional key point;
s633: determining semantic differences based on differences between each piece of predicted semantic information and each piece of reference semantic information;
s634: based on the differences between each predicted position information and each reference position information, a position difference is determined.
Specifically, when determining the semantic difference, the semantic difference is determined based on the difference between the predicted semantic information of the predicted two-dimensional key point and the reference semantic information of the corresponding reference two-dimensional key point, for example, in the predicted two-dimensional image, the predicted two-dimensional key point 1, the predicted two-dimensional key point 2 and the predicted two-dimensional key point 3 are included, the video frame includes the reference two-dimensional key point 4, the reference two-dimensional key point 5 and the reference two-dimensional key point 6, wherein the predicted two-dimensional key point 1 corresponds to the reference two-dimensional key point 4, the predicted two-dimensional key point 2 corresponds to the reference two-dimensional key point 5, and the predicted two-dimensional key point 3 corresponds to the reference two-dimensional key point 6, and then the semantic difference is composed of the difference between the predicted semantic information of the predicted two-dimensional key point 1 and the reference semantic information of the reference two-dimensional key point 4, the difference between the predicted semantic information of the predicted two-dimensional key point 2 and the reference semantic information of the reference two-dimensional key point 5, and the difference between the predicted semantic information of the predicted two-dimensional key point 3 and the reference two-dimensional key point 6. Accordingly, when determining the position difference, the position difference is determined based on the difference between the predicted position information of the predicted two-dimensional key point and the reference position information of the corresponding reference two-dimensional key point.
In an alternative embodiment, step S631 may be implemented as:
face analysis is carried out on the predicted two-dimensional image, and respective predicted semantic information of each predicted two-dimensional key point is obtained; and projecting each predicted two-dimensional key point to the video frame based on the predicted position parameter, and obtaining the predicted position information of each predicted two-dimensional key point on the video frame.
Wherein the prediction parameter set further comprises a prediction position parameter, and the prediction two-dimensional image is obtained by calculating each vertex V in the prediction face model by using a prediction texture coefficient j Color value C of (2) j And using a differentiable rendererWill->According to itTopology->And rendering, namely performing face mask extraction on the predicted two-dimensional image, so that predicted semantic information containing each predicted two-dimensional key point can be obtained, and based on the predicted position parameters, projecting each predicted two-dimensional key point to a video frame by using a camera projection model to obtain predicted position information of each predicted two-dimensional key point.
In addition, after the predicted face model is obtained, M points corresponding to the semantics of the reference two-dimensional key points in the vertexes contained in the predicted face model can be taken out, the camera projection model is used for projection onto the video frame, the position information of the predicted two-dimensional key points on the video frame is obtained, and then the predicted two-dimensional image is composed of the predicted two-dimensional key points obtained based on the mode.
In an alternative embodiment, step S632 may be implemented as:
performing face detection on the video frame to obtain a face region containing a target object; carrying out face analysis on the face region to obtain respective reference semantic information of each reference two-dimensional key point; and detecting the face key points of the face area to obtain the respective reference position information of each reference two-dimensional key point.
Specifically, it is required to ensure that only one Face exists in each video frame, if a plurality of faces exist in each video frame, the area where each Face exists can be detected by using a Face Box and separated according to a rectangular frame form to obtain a Face area containing a target object, a mask of the area where the Face exists is extracted from the Face area by using tools such as Face coating and the like, positions of the skin, eyebrows, eyes, nose, mouth and the like of the Face are reserved, semantic information of each reference two-dimensional key point is obtained, 2D key points of the Face area are extracted by using extractors such as Face alignment and the like, and position information of each reference two-dimensional key point is obtained.
Referring to fig. 9, a schematic diagram of a video frame preprocessing method in an embodiment of the present application is shown, a video frame is input, face detection is performed to obtain a face region 1, face analysis is performed to the face region 1 to obtain reference semantic information 1, reference semantic information 2, reference semantic information 3, face key point detection is performed to the face region 1 to obtain reference position information 1, reference position information 2, and reference position information 3.
In an alternative embodiment, step S64 may be implemented as:
constructing a first loss function based on the semantic difference and the position difference; constructing a second loss function based on the difference between the fused and predicted looks parameters; constructing a first regular function based on the predicted looks parameter, the predicted expression parameter and the predicted texture parameter; and constructing a target loss function based on the first loss function, the second loss function and the first regular function, and carrying out parameter adjustment on the initial reconstruction model based on the target loss function.
Specifically, the first loss function is used for restraining 3D vertices in the reconstructed predictive face model, after 3D keypoints corresponding to the semantics of the reference two-dimensional keypoints in the video frame are projected to the video frame, the obtained projection points (the predicted two-dimensional keypoints) keep the same as possible with the reference two-dimensional keypoints, and is used for restraining images (predicted two-dimensional images) obtained by rendering the 3D vertices in the predictive face model based on the predicted texture parameters and the video frame to keep the same as possible in a face mask area. The second loss function is used for keeping the predicted phase parameters output by the initial reconstruction model consistent with the reference phase parameters as much as possible.
Because the target reconstruction model in the embodiment of the application is mainly used for solving the facial form (namely the facial appearance parameters) as real as possible, the expression coefficient and the camera parameters which are as stable as possible between frames for the input video, the general traditional optimization and deep learning solution model needs to be optimized and modified. The most important point is that a common identity coefficient s is solved for a whole video c (i.e., reference phase parameters), solving s in a joint and traditional optimization method c At the same time as s predicted by the initial reconstruction model in the previous step is required to gradually approach s c . Namely: adding a new optimization variable s to the whole optimization model c During each iteration, for s c Calculating the gradient and weighting the gradient and s predicted by the deep learning model to obtain a combined identity coefficient s * =w 0 s c +(1-w 0 ) s, and then combining the total combination parameters (s * E, g, r, t) to a subsequent parameterized 3D face model Θ, projection model Proj, and rendererAnd calculating a target loss function, returning gradients and updating parameters.
When the target loss function is constructed, the first loss function, the second loss function and the first regular function can be fused in a weighted summation mode, and the fused target loss function L 2 Is that
L 2 =w 1 L lm +w 2 L r +w 3 L reg +w 4 ||s c -s|| 2
Wherein L is lm Is constructed based on the position difference, L r Is constructed based on semantic difference, L reg Representing a first regular function, and adding a new term s c -s|| 2 For predicting a phase parameter s and a reference phase parameter s c As consistent as possible. The process of solving uses a gradient descent algorithm, in turn with respect to s c And carrying out gradient solving and parameter updating on the parameter theta. In a specific implementation process, the parameters theta of the initial reconstruction model of the pre-training are used 0 As initialization of θ, and s is used c0 =∑ I f 00 I), i.e. the average of the predicted phase parameters over the whole video in the first loop iteration of the pre-trained initial reconstruction model, as s c Is performed in the initialization of the (c).
For the weight (w 1 ,w 2 ,w 3 ,w 4 ) Taking 2D as a preset number of times as an example, in the 1 st to D iterations of the loop iteration, the value is (10,1,0.001,1), in the D+1 to 2 nd iterations, the value is (1,10,0.001,0.1), and since the continuity of the video frame is stronger than that of the sample image, the w is enhanced in the latter half 2 Reduce w 1 Advantageous effectsAnd the stability of the final optimization result is improved.
The process of training the initial reconstruction model by using the video to be reconstructed to obtain the target reconstruction model is described above, and the process of constructing the target face model by using the target reconstruction model is described below.
Referring to fig. 10A, a flowchart of an implementation of a second method for reconstructing a three-dimensional face according to an embodiment of the present application is taken as an example of an execution subject, and the specific implementation of the method includes steps S1001 to S1003 as follows:
s1001: the server inputs each video frame contained in the video to be rebuilt into a target rebuilding model respectively to obtain an output predicted looks parameter and a predicted expression parameter and a predicted position parameter corresponding to each video frame respectively;
the method comprises the steps that a video to be reconstructed comprises a target object, a target reconstruction model is obtained by performing cyclic iteration training for preset times on an initial reconstruction model based on the video to be reconstructed, after the cyclic iteration training for preset times is performed, the target reconstruction model integrally outputs a predicted looks parameter for the video to be reconstructed, and a predicted expression parameter and a predicted position parameter are respectively output for each video frame. For example, the video to be reconstructed includes video frame 1, video frame 2 and video frame 3, and the target reconstruction model outputs the predicted looks parameter S, and outputs the predicted expression parameter E1 and the predicted position parameter L1 of the video frame 1, the predicted expression parameter E2 and the predicted position parameter L2 of the video frame 2, and the predicted expression parameter E3 and the predicted position parameter L3 of the video frame 3.
S1002: the server obtains a target phase parameter based on the predicted phase parameter and the first reference phase parameter;
in the cyclic iteration training process, a target reconstruction model is obtained, and meanwhile, a first reference phase parameter can be obtained, the first reference phase parameter is obtained based on each prediction parameter difference obtained in the last cyclic iteration training, a second reference phase parameter determined in the last cyclic iteration training is adjusted, each prediction parameter difference is obtained based on each prediction phase parameter obtained in the last cyclic iteration training and the second reference phase parameter, in the one cyclic iteration training process, the reference phase parameter determined based on the current iteration times is adjusted based on each prediction parameter difference, and the specific adjustment process is described in the above embodiment and is not repeated herein.
To predict the phase parameter s * The first reference phase parameter isFor example, the target looks parameterD represents half of the preset iteration times, and the predicted phase parameters and the first reference phase parameters are fused to obtain more accurate target phase parameters, so that a target face model can be more accurately constructed based on the target phase parameters.
S1003: the server respectively builds a target face model of the target object in the corresponding video frame based on the target looks parameter, the prediction expression parameter and the prediction position parameter corresponding to each video frame.
For example, the video frame 1 corresponds to the predicted expression parameter 1 and the predicted position parameter 1, then the target facial model in the video frame 1 is constructed by using the target looks parameter, the predicted expression parameter 1 and the predicted position parameter 1, the video frame 2 corresponds to the predicted expression parameter 2 and the predicted position parameter 2, and then the target facial model in the video frame 2 is constructed by using the target looks parameter, the predicted expression parameter 2 and the predicted position parameter 2.
In an alternative embodiment, in step S1003, for one video frame, a corresponding target face model is constructed by:
inputting the target looks parameter, the predicted expression parameter and the predicted position parameter corresponding to one video frame into a preset basic face model to obtain each predicted three-dimensional key point; and connecting all the predicted three-dimensional key points according to a preset topological structure to obtain a target face model corresponding to the predicted parameter set.
Specifically, a preset basic face model, namely a parameterized 3D face template, is used for combining target looks parameters and views The predicted expression parameters and the predicted position parameters corresponding to the frequency frames are input into a parameterized 3D face template, each predicted three-dimensional key point of a target object in the video frames can be obtained, and each predicted three-dimensional key point is connected according to a preset topological structure to obtain a corresponding target face model. The preset topology structure can be a triangle patch connection relationship
Referring to fig. 10B, which is a logic schematic diagram of a face reconstruction method in an embodiment of the present application, a video to be reconstructed includes a video frame a, a video frame B and a video frame c, and in a loop iteration, the video frame a, the video frame B and the video frame c are respectively input into an initial reconstruction model to obtain a prediction parameter set a1 of the video frame a, a prediction parameter set B1 of the video frame B, a prediction parameter set c1 of the video frame c, and based on a1 and a reference looks parameter x, B1 and a reference looks parameter x, c1 and a reference looks parameter x, respectively, parameter adjustment is performed, after 1000 loop iterations, a target reconstruction model and a first reference looks parameter x0 are obtained, and then the video to be reconstructed is input into a target reconstruction model, and a prediction parameter set an of the video frame a, a prediction parameter set bn of the video frame B and a prediction parameter set cn of the video frame c are obtained, wherein the prediction looks parameters included in each prediction parameter set are x1, x0 and x1 are fused to obtain a target looks parameter x2, and a face parameter set 2 is replaced by x2, and a face model is replaced by the prediction parameter set B1 and then a face model is replaced by the prediction parameter set n 2.
In the embodiment of the application, a video containing a face is input, the identity coefficient of the face, the expression coefficient corresponding to each frame of image and the camera parameter are calculated by combining the traditional optimization method and the deep learning regularization, the 3-dimensional face template corresponding to each frame of image is driven by the 3 coefficients, and finally the 3-dimensional face reconstruction result corresponding to each frame of the input video is obtained. Specifically, first, an identity coefficient, an expression coefficient, and a camera parameter are calculated as initial values for each frame of an input video using a pre-trained deep learning model; then, designing a traditional optimization algorithm, carrying out iterative optimization by taking the 3 coefficients as variables, in the optimization process, using initial values of the 3 coefficients to accelerate convergence, and adding a deep learning module as a regularization term to increase stability; and finally, calculating a unified identity coefficient for the whole video, and calculating an expression coefficient and a camera parameter for each frame of the video, wherein the expression coefficient and the camera parameter are used for inputting a parameterized 3D face template to generate a final face 3D reconstruction result. The 3D face model obtained by the technical process can be used for face changing, digital intelligence making, driving and other applications.
Based on the same inventive concept, the embodiment of the application also provides a three-dimensional face reconstruction device. Referring to fig. 11, a schematic structural diagram of a three-dimensional face reconstruction device 1100 may include:
An obtaining unit 1101, configured to obtain a video to be reconstructed including a target object, and obtain an initial reconstruction model, where the initial reconstruction model is obtained by pre-training a reconstruction model to be trained using a constructed training data set;
the first training unit 1102 is configured to perform, based on a video to be reconstructed, loop iteration training for a preset number of times on an initial reconstruction model, to obtain a target reconstruction model, so as to respectively construct, based on the target reconstruction model, a target face model of a target object in each video frame included in the video to be reconstructed; wherein, in one loop iteration, the following operations are performed:
respectively carrying out feature recognition on each video frame to obtain a corresponding prediction parameter set, wherein the prediction parameter set at least comprises prediction looks parameters;
for each prediction parameter set, the following operations are performed: and obtaining a prediction parameter difference based on the prediction phase parameters contained in one prediction parameter set and the reference phase parameters determined based on the current iteration times, and carrying out parameter adjustment on the initial reconstruction model based on the prediction parameter difference.
In an alternative embodiment, the set of predicted parameters further comprises predicted expression parameters; the prediction parameter differences comprise semantic differences and position differences;
The first training unit 1102 is specifically configured to:
fusing the reference phase parameters with the predicted phase parameters to obtain corresponding fused phase parameters;
based on the fusion looks parameter and the predicted expression parameter, constructing a predicted face model of the target object in the corresponding video frame;
based on each predicted two-dimensional key point contained in the predicted two-dimensional image corresponding to the predicted face model and each reference two-dimensional key point contained in the video frame, semantic difference and position difference are obtained;
and carrying out parameter adjustment on the initial reconstruction model based on the semantic difference and the position difference.
In an alternative embodiment, first training unit 1102 is specifically configured to:
based on the predicted two-dimensional image, obtaining respective predicted semantic information and predicted position information of each predicted two-dimensional key point;
based on the video frame, obtaining respective reference semantic information and reference position information of each reference two-dimensional key point;
determining semantic differences based on differences between each piece of predicted semantic information and each piece of reference semantic information;
based on the differences between each predicted position information and each reference position information, a position difference is determined.
In an alternative embodiment, the set of predicted parameters further comprises a predicted location parameter; the first training unit 1102 is specifically configured to:
Face analysis is carried out on the predicted two-dimensional image, and respective predicted semantic information of each predicted two-dimensional key point is obtained;
and projecting each predicted two-dimensional key point to the video frame based on the predicted position parameter, and obtaining the predicted position information of each predicted two-dimensional key point on the video frame.
In an alternative embodiment, first training unit 1102 is specifically configured to:
performing face detection on the video frame to obtain a face region containing a target object;
face analysis is carried out on the face area to obtain the respective reference semantic information of each reference two-dimensional key point
And detecting the face key points of the face area to obtain the respective reference position information of each reference two-dimensional key point.
In an alternative embodiment, first training unit 1102 is specifically configured to determine the reference phase parameters by:
if the current iteration number is 1, determining a reference phase parameter based on the average value of the prediction phase parameters contained in each prediction parameter set; otherwise the first set of parameters is selected,
and adjusting the reference looks parameter determined in the previous loop iteration based on the semantic difference and the position difference obtained in the previous loop iteration to determine the reference looks parameter.
In an alternative embodiment, one prediction parameter set further comprises a prediction texture parameter;
The first training unit 1102 is specifically configured to:
constructing a first loss function based on the semantic difference and the position difference;
constructing a second loss function based on the difference between the fused and predicted looks parameters;
constructing a first regular function based on the predicted looks parameter, the predicted expression parameter and the predicted texture parameter;
and constructing a target loss function based on the first loss function, the second loss function and the first regular function, and carrying out parameter adjustment on the initial reconstruction model based on the target loss function.
In an alternative embodiment, the apparatus further comprises a second training unit 1103 for:
based on training samples in the training data set, performing loop iterative training on the reconstruction model to be trained to obtain an initial reconstruction model; in one loop iteration, the following operations are performed:
carrying out feature recognition on a sample image contained in a training sample to obtain a sample parameter set, wherein the sample parameter set at least contains sample appearance parameters, sample expression parameters and sample texture parameters;
constructing a sample face model based on the sample appearance parameters and the sample expression parameters, and constructing a second regular function based on the sample appearance parameters, the sample expression parameters and the sample texture parameters;
Based on each sample two-dimensional key point contained in the sample two-dimensional image corresponding to the sample face model and each reference two-dimensional key point contained in the sample image, obtaining sample semantic difference and sample position difference between the sample two-dimensional key points;
and constructing a third loss function based on the sample semantic difference and the sample position difference, and carrying out parameter adjustment on the reconstruction model to be trained based on the second regular function and the third loss function.
In an alternative embodiment, the obtaining unit 1101 is specifically configured to:
for each obtained prediction parameter set, the following operations are performed respectively:
based on a prediction parameter set and a preset basic face model, obtaining each prediction three-dimensional key point;
and connecting all the predicted three-dimensional key points according to a preset topological structure to obtain a target face model corresponding to the predicted parameter set.
Based on the same inventive concept, the embodiment of the application also provides another three-dimensional face reconstruction device. Referring to fig. 12A, a schematic structural diagram of a three-dimensional face reconstruction device 1200 may include:
the prediction unit 1201 is configured to input each video frame included in the video to be reconstructed into a target reconstruction model, obtain an output predicted looks parameter and a predicted expression parameter and a predicted position parameter corresponding to each video frame, where the video to be reconstructed includes a target object, and the target reconstruction model is obtained by performing a loop iteration training for a preset number of times on an initial reconstruction model based on the video to be reconstructed;
An obtaining unit 1202, configured to obtain a target feature parameter based on a predicted feature parameter and a first reference feature parameter, where the first reference feature parameter is obtained by adjusting a second reference feature parameter determined by a last iteration training of a loop based on each predicted feature parameter obtained by the last iteration training of the loop and the second reference feature parameter;
the construction unit 1203 is configured to respectively construct a target face model of the target object in the corresponding video frame based on the target looks parameter, and the predicted expression parameter and the predicted position parameter corresponding to each video frame.
In an alternative embodiment, the construction unit 1203 is specifically configured to:
for each video frame, the following operations are performed:
inputting the target looks parameter, the predicted expression parameter and the predicted position parameter corresponding to one video frame into a preset basic face model to obtain each predicted three-dimensional key point;
and connecting all the predicted three-dimensional key points according to a preset topological structure to obtain a target face model corresponding to the predicted parameter set.
For convenience of description, the above parts are described as being functionally divided into modules (or units) respectively. Of course, the functions of each module (or unit) may be implemented in the same piece or pieces of software or hardware when implementing the present application.
Those skilled in the art will appreciate that the various aspects of the application may be implemented as a system, method, or program product. Accordingly, aspects of the application may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.
The embodiment of the application also provides electronic equipment based on the same conception as the embodiment of the method. In one embodiment, the electronic device may be a server, such as the server shown in FIG. 2. In this embodiment, the electronic device may be configured as shown in fig. 12B, including a memory 1201, a communication module 1203, and one or more processors 1202.
A memory 1201 for storing a computer program for execution by the processor 1202. The memory 1201 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a program required for running an instant communication function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.
Memory 1201 may be a volatile memory (RAM), such as random-access memory; the memory 1201 may also be a nonvolatile memory (non-volatile memory), such as a read-only memory, a flash memory (flash memory), a hard disk (HDD) or a Solid State Drive (SSD); or memory 1201 is any other medium that can be used to carry or store a desired computer program in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1201 may be a combination of the above memories.
The processor 1202 may include one or more central processing units (central processing unit, CPU) or digital processing units, or the like. A processor 1202 for implementing the above-mentioned three-dimensional face reconstruction method when calling a computer program stored in the memory 1201.
The communication module 1203 is configured to communicate with a terminal device and other servers.
The specific connection medium between the memory 1201, the communication module 1203, and the processor 1202 is not limited in the embodiment of the present application. The connection between the memory 1201 and the processor 1202 in fig. 12B is shown by a bus 1204, and the bus 1204 is shown in bold in fig. 12B, and the connection between other components is merely illustrative, and not limited thereto. Bus 1204 may be classified as an address bus, a data bus, a control bus, etc. For ease of description, only one thick line is depicted in fig. 12B, but only one bus or one type of bus is not depicted.
The memory 1201 stores a computer storage medium in which computer executable instructions for implementing the three-dimensional face reconstruction method according to the embodiment of the present application are stored. The processor 1202 is configured to perform the three-dimensional face reconstruction method described above, as shown in fig. 3.
In another embodiment, the electronic device may also be other electronic devices, such as the terminal device shown in fig. 2. In this embodiment, the structure of the electronic device may include, as shown in fig. 13: communication component 1310, memory 1320, display unit 1330, camera 1340, sensor 1350, audio circuit 1360, bluetooth module 1370, processor 1380, and the like.
The communication component 1310 is for communicating with a server. In some embodiments, a circuit wireless fidelity (Wireless Fidelity, wiFi) module may be included, where the WiFi module belongs to a short-range wireless transmission technology, and the electronic device may help the user to send and receive information through the WiFi module.
Memory 1320 may be used to store software programs and data. The processor 1380 performs various functions of the terminal device and data processing by executing software programs or data stored in the memory 1320. Memory 1320 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Memory 1320 stores an operating system that enables the terminal device to operate. The memory 1320 of the present application may store an operating system and various application programs, and may also store a computer program for executing the three-dimensional face reconstruction method according to the embodiment of the present application.
The display unit 1330 may also be used to display information input by a user or information provided to the user and a graphical user interface (graphical user interface, GUI) of various menus of the terminal device. In particular, the display unit 1330 may include a display 1332 disposed on a front side of the terminal device. The display 1332 may be configured in the form of a liquid crystal display, light emitting diodes, or the like. The display unit 1330 may be used to display a three-dimensional face reconstruction user interface or the like in an embodiment of the present application.
The display unit 1330 may also be used to receive input numeric or character information, generate signal inputs related to user settings and function control of the terminal device, and in particular, the display unit 1330 may include a touch screen 1331 provided on the front surface of the terminal device, and may collect touch operations on or near the user, such as clicking buttons, dragging scroll boxes, and the like.
The touch screen 1331 may be covered on the display screen 1332, or the touch screen 1331 may be integrated with the display screen 1332 to implement input and output functions of the terminal device, and after integration, the touch screen may be abbreviated as touch screen. The display unit 1330 may display an application program and a corresponding operation procedure in the present application.
The camera 1340 can be used to capture still images, and a user can comment on the image captured by the camera 1340 through an application. The camera 1340 may be one or more. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive elements convert the optical signals to electrical signals, which are then passed to a processor 1380 for conversion to digital image signals.
The terminal device may further comprise at least one sensor 1350, such as an acceleration sensor 1351, a distance sensor 1352, a fingerprint sensor 1353, a temperature sensor 1354. The terminal device may also be configured with other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, light sensors, motion sensors, and the like.
Audio circuitry 1360, speakers 1361, microphones 1362 may provide an audio interface between the user and the terminal equipment. The audio circuit 1360 may transmit the received electrical signal after conversion of the audio data to the speaker 1361, where the electrical signal is converted to a sound signal by the speaker 1361 for output. The terminal device may also be configured with a volume button for adjusting the volume of the sound signal. On the other hand, the microphone 1362 converts the collected sound signals into electrical signals, which are received by the audio circuit 1360 and converted into audio data, which are output to the communication component 1310 for transmission to, for example, another terminal device, or to the memory 1320 for further processing.
The bluetooth module 1370 is used for exchanging information with other bluetooth devices having bluetooth modules through a bluetooth protocol. For example, the terminal device may establish a bluetooth connection with a wearable electronic device (e.g., a smart watch) that also has a bluetooth module through the bluetooth module 1370, so as to perform data interaction.
The processor 1380 is a control center of the terminal device, connects various parts of the entire terminal using various interfaces and lines, performs various functions of the terminal device and processes data by running or executing software programs stored in the memory 1320, and calling data stored in the memory 1320. In some embodiments, processor 1380 may include one or more processing units; processor 1380 may also integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a baseband processor that primarily handles wireless communications. It will be appreciated that the baseband processor described above may not be integrated into the processor 1380. The processor 1380 of the present application may run an operating system, an application, a user interface display, and a touch response, as well as a three-dimensional face reconstruction method according to an embodiment of the present application. In addition, a processor 1380 is coupled with the display unit 1330.
In some possible embodiments, aspects of the three-dimensional face reconstruction method provided by the present application may also be implemented in the form of a program product comprising a computer program for causing an electronic device to perform the steps of the three-dimensional face reconstruction method according to the various exemplary embodiments of the present application described herein above when the program product is run on the electronic device, e.g. the electronic device may perform the steps as shown in fig. 3.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product of embodiments of the present application may take the form of a portable compact disc read only memory (CD-ROM) and comprise a computer program and may be run on an electronic device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.
The readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave in which a readable computer program is embodied. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.
A computer program embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer programs for performing the operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer program may execute entirely on the consumer electronic device, partly on the consumer electronic device, as a stand-alone software package, partly on the consumer electronic device and partly on a remote electronic device or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic device may be connected to the consumer electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (e.g., connected through the internet using an internet service provider).
It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.
Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required to either imply that the operations must be performed in that particular order or that all of the illustrated operations be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having a computer-usable computer program embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program commands may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the commands executed by the processor of the computer or other programmable data processing apparatus produce means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program commands may also be stored in a computer readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the commands stored in the computer readable memory produce an article of manufacture including command means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (14)

1. A three-dimensional face reconstruction method, the method comprising:
acquiring a video to be reconstructed comprising a target object, and acquiring an initial reconstruction model, wherein the initial reconstruction model is obtained by pre-training a reconstruction model to be trained by using a constructed training data set;
performing loop iteration training for the initial reconstruction model for a preset number of times based on the video to be reconstructed to obtain a target reconstruction model, so as to respectively construct a target face model of the target object in each video frame contained in the video to be reconstructed based on the target reconstruction model; wherein, in one loop iteration, the following operations are performed:
respectively carrying out feature recognition on each video frame to obtain a corresponding prediction parameter set, wherein the prediction parameter set at least comprises prediction looks parameters;
for each prediction parameter set, the following operations are performed: and obtaining a prediction parameter difference based on the prediction phase parameters contained in one prediction parameter set and the reference phase parameters determined based on the current iteration times, and carrying out parameter adjustment on the initial reconstruction model based on the prediction parameter difference.
2. The method of claim 1, wherein the set of predicted parameters further comprises predicted expression parameters; the prediction parameter differences comprise semantic differences and position differences;
the obtaining a prediction parameter difference based on the prediction looks parameter contained in the prediction parameter set and the reference looks parameter determined based on the current iteration number includes:
fusing the reference phase parameters with the predicted phase parameters to obtain corresponding fused phase parameters;
constructing a predicted face model of the target object in a corresponding video frame based on the fusion looks parameter and the predicted expression parameter;
based on each predicted two-dimensional key point contained in the predicted two-dimensional image corresponding to the predicted face model and each reference two-dimensional key point contained in the video frame, obtaining the semantic difference and the position difference;
and carrying out parameter adjustment on the initial reconstruction model based on the semantic difference and the position difference.
3. The method according to claim 2, wherein the obtaining the prediction parameter difference based on each predicted two-dimensional key point included in the predicted two-dimensional image corresponding to the predicted face model and each reference two-dimensional key point included in the video frame includes:
Based on the predicted two-dimensional image, obtaining respective predicted semantic information and predicted position information of each predicted two-dimensional key point;
based on the video frame, obtaining respective reference semantic information and reference position information of each reference two-dimensional key point;
determining the semantic difference based on the difference between each predicted semantic information and each reference semantic information;
the position difference is determined based on the difference between each predicted position information and each reference position information.
4. The method of claim 3, wherein the set of prediction parameters further comprises a predicted location parameter;
the obtaining, based on the predicted two-dimensional image, respective predicted semantic information and predicted position information of each predicted two-dimensional key point includes:
performing face analysis on the predicted two-dimensional image to obtain respective predicted semantic information of each predicted two-dimensional key point;
and projecting each predicted two-dimensional key point to the video frame based on the predicted position parameter, and obtaining the predicted position information of each predicted two-dimensional key point on the video frame.
5. The method of claim 3, wherein the obtaining, based on the video frame, the reference semantic information and the reference location information for each of the reference two-dimensional keypoints, respectively, comprises:
Performing face detection on the video frame to obtain a face region containing the target object;
performing face analysis on the face region to obtain the respective reference semantic information of each reference two-dimensional key point;
and detecting the face key points of the face area to obtain the respective reference position information of each reference two-dimensional key point.
6. The method of any one of claims 2 to 5, wherein the reference phase parameter is determined by:
if the current iteration number is 1, determining the reference phase parameters based on average values of the prediction phase parameters contained in each prediction parameter set; otherwise the first set of parameters is selected,
and adjusting the reference looks parameter determined in the previous loop iteration based on the semantic difference and the position difference obtained in the previous loop iteration, and determining the reference looks parameter.
7. The method of any of claims 2-5, wherein the set of prediction parameters further comprises a predicted texture parameter;
the parameter adjustment of the initial reconstruction model based on the prediction parameter difference comprises:
constructing a first loss function based on the semantic difference and the location difference;
Constructing a second loss function based on the difference between the fused and predicted looks parameters;
constructing a first regular function based on the predicted looks parameter, the predicted expression parameter and the predicted texture parameter;
and constructing a target loss function based on the first loss function, the second loss function and the first regular function, and carrying out parameter adjustment on the initial reconstruction model based on the target loss function.
8. The method according to any one of claims 1 to 5, wherein the initial reconstructed model is trained by:
based on the training samples in the training data set, performing loop iteration training on the reconstruction model to be trained to obtain the initial reconstruction model; in one loop iteration, the following operations are performed:
performing feature recognition on a sample image contained in a training sample to obtain a sample parameter set, wherein the sample parameter set at least contains sample appearance parameters, sample expression parameters and sample texture parameters;
constructing a sample face model based on the sample appearance parameters and the sample expression parameters, and constructing a second regular function based on the sample appearance parameters, the sample expression parameters and the sample texture parameters;
Based on each sample two-dimensional key point contained in the sample two-dimensional image corresponding to the sample face model and each reference two-dimensional key point contained in the sample image, obtaining sample semantic difference and sample position difference between the sample two-dimensional key points;
and constructing a third loss function by using the sample semantic difference and the sample position difference, and carrying out parameter adjustment on the reconstruction model to be trained based on the second regular function and the third loss function.
9. A three-dimensional face reconstruction method, the method comprising:
inputting each video frame contained in the video to be rebuilt into a target rebuilding model respectively to obtain output prediction looks parameters and prediction expression parameters and prediction position parameters corresponding to each video frame, wherein the video to be rebuilt contains a target object, and the target rebuilding model is obtained by executing cyclic iteration training for preset times on an initial rebuilding model based on the video to be rebuilt;
obtaining a target phase parameter based on the predicted phase parameter and a first reference phase parameter, wherein the first reference phase parameter is obtained by adjusting a second reference phase parameter determined by the last cycle iteration training based on each predicted parameter difference obtained by the last cycle iteration training, and each predicted parameter difference is obtained based on each predicted phase parameter obtained by the last cycle iteration training and the second reference phase parameter;
And respectively constructing a target face model of the target object in the corresponding video frame based on the target looks parameter, and the predicted expression parameter and the predicted position parameter corresponding to each video frame.
10. The method of claim 9, wherein constructing the target face model of the target object in the corresponding video frame based on the target looks parameter and the predicted expression parameter and the predicted position parameter corresponding to each video frame, respectively, comprises:
for each video frame, the following operations are respectively executed:
inputting the target looks parameter, the predicted expression parameter and the predicted position parameter corresponding to one video frame into a preset basic face model to obtain each predicted three-dimensional key point;
and connecting the three-dimensional prediction key points according to a preset topological structure to obtain a target face model corresponding to the one prediction parameter set.
11. A three-dimensional face reconstruction device, comprising:
the acquisition unit is used for acquiring a video to be reconstructed containing a target object and acquiring an initial reconstruction model, wherein the initial reconstruction model is obtained by pre-training a reconstruction model to be trained by using a constructed training data set;
The first training unit is used for executing cyclic iterative training for the initial reconstruction model for a preset number of times based on the video to be reconstructed to obtain a target reconstruction model, so as to respectively construct a target face model of the target object in each video frame contained in the video to be reconstructed based on the target reconstruction model; wherein, in one loop iteration, the following operations are performed:
respectively carrying out feature recognition on each video frame to obtain a corresponding prediction parameter set, wherein the prediction parameter set at least comprises prediction looks parameters;
for each prediction parameter set, the following operations are performed: and obtaining a prediction parameter difference based on the prediction phase parameters contained in one prediction parameter set and the reference phase parameters determined based on the current iteration times, and carrying out parameter adjustment on the initial reconstruction model based on the prediction parameter difference.
12. A three-dimensional face reconstruction device, comprising:
the prediction unit is used for inputting each video frame contained in the video to be reconstructed into a target reconstruction model respectively to obtain output prediction looks parameters and prediction expression parameters and prediction position parameters corresponding to each video frame, wherein the video to be reconstructed contains a target object, and the target reconstruction model is obtained by executing cyclic iteration training for the initial reconstruction model for preset times based on the video to be reconstructed;
The acquisition unit is used for acquiring a target phase parameter based on the predicted phase parameter and a first reference phase parameter, wherein the first reference phase parameter is acquired by adjusting a second reference phase parameter determined by the last cycle iteration training based on each predicted parameter difference acquired by the last cycle iteration training, and each predicted parameter difference is acquired based on each predicted phase parameter acquired by the last cycle iteration training and the second reference phase parameter;
the construction unit is used for respectively constructing a target face model of the target object in the corresponding video frame based on the target looks parameter, the predicted expression parameter and the predicted position parameter corresponding to each video frame.
13. An electronic device comprising a processor and a memory, wherein the memory stores a computer program which, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 10.
14. A computer readable storage medium, characterized in that it comprises a computer program for causing an electronic device to perform the steps of the method according to any one of claims 1-10 when said computer program is run on the electronic device.
CN202310413286.5A 2023-04-11 2023-04-11 Three-dimensional face reconstruction method and device, electronic equipment and storage medium Pending CN116977547A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310413286.5A CN116977547A (en) 2023-04-11 2023-04-11 Three-dimensional face reconstruction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310413286.5A CN116977547A (en) 2023-04-11 2023-04-11 Three-dimensional face reconstruction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116977547A true CN116977547A (en) 2023-10-31

Family

ID=88478500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310413286.5A Pending CN116977547A (en) 2023-04-11 2023-04-11 Three-dimensional face reconstruction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116977547A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117496019A (en) * 2023-12-29 2024-02-02 南昌市小核桃科技有限公司 Image animation processing method and system for driving static image

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117496019A (en) * 2023-12-29 2024-02-02 南昌市小核桃科技有限公司 Image animation processing method and system for driving static image
CN117496019B (en) * 2023-12-29 2024-04-05 南昌市小核桃科技有限公司 Image animation processing method and system for driving static image

Similar Documents

Publication Publication Date Title
Zeng et al. Srnet: Improving generalization in 3d human pose estimation with a split-and-recombine approach
Li et al. Monocular real-time volumetric performance capture
JP7373554B2 (en) Cross-domain image transformation
US11232286B2 (en) Method and apparatus for generating face rotation image
Zou et al. 3D human shape reconstruction from a polarization image
CN111583399B (en) Image processing method, device, equipment, medium and electronic equipment
CN113496507A (en) Human body three-dimensional model reconstruction method
JP2022553252A (en) IMAGE PROCESSING METHOD, IMAGE PROCESSING APPARATUS, SERVER, AND COMPUTER PROGRAM
CN114359775A (en) Key frame detection method, device, equipment, storage medium and program product
CN116977547A (en) Three-dimensional face reconstruction method and device, electronic equipment and storage medium
CN115661336A (en) Three-dimensional reconstruction method and related device
Lu et al. 3d real-time human reconstruction with a single rgbd camera
WO2024041235A1 (en) Image processing method and apparatus, device, storage medium and program product
CN110197226B (en) Unsupervised image translation method and system
Saint et al. 3dbooster: 3d body shape and texture recovery
CN117011415A (en) Method and device for generating special effect text, electronic equipment and storage medium
JP7452698B2 (en) Reinforcement learning model for labeling spatial relationships between images
CN115222917A (en) Training method, device and equipment for three-dimensional reconstruction model and storage medium
CN117011449A (en) Reconstruction method and device of three-dimensional face model, storage medium and electronic equipment
Li et al. Inductive Guided Filter: Real-Time Deep Matting with Weakly Annotated Masks on Mobile Devices
Cai et al. Automatic generation of Labanotation based on human pose estimation in folk dance videos
Liu Light image enhancement based on embedded image system application in animated character images
CN117496036A (en) Method and device for generating texture map, electronic equipment and storage medium
CN113240796B (en) Visual task processing method and device, computer readable medium and electronic equipment
CN116978079A (en) Image recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication