CN110956691B

CN110956691B - Three-dimensional face reconstruction method, device, equipment and storage medium

Info

Publication number: CN110956691B
Application number: CN201911148553.0A
Authority: CN
Inventors: 王多民
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2023-06-06
Anticipated expiration: 2039-11-21
Also published as: CN110956691A

Abstract

The embodiment of the application discloses a three-dimensional face reconstruction method, a device, equipment and a storage medium, wherein the method comprises the following steps: when a face picture acquisition instruction is detected, acquiring a two-dimensional picture containing a face; identifying a target face in the two-dimensional picture based on a preset face identification strategy, and acquiring position information of the target face; cutting out the target face in the two-dimensional picture based on the position information of the target face to obtain a cut-out target face picture; inputting the cut target face picture into a target neural network model, and outputting three-dimensional model parameters of the target face; and driving the target three-dimensional model to carry out three-dimensional reconstruction based on the three-dimensional model parameters of the target face to obtain a three-dimensional face model of the target face. Therefore, the face reconstruction result with high precision and excellent effect is obtained quickly, and the operation is convenient and simple.

Description

Three-dimensional face reconstruction method, device, equipment and storage medium

Technical Field

The present disclosure relates to image processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for reconstructing a three-dimensional face.

Background

Three-dimensional facial reconstruction technology has been widely used in various fields, and is well known to users as application of three-dimensional facial expression packages. In the prior art, a method for generating a three-dimensional facial expression package mostly uses pictures or videos to directly generate, for example, directly uses video clips to generate dynamic pictures of interesting clips, or analyzes keywords related to expressions in a user message text, and uses the keywords and expression templates to generate expressions. Some of these methods do not interact with the user when generating the expression package, and lack reality and interest; some facial expressions in the two-dimensional pictures are analyzed to drive the generation of the animation expression, the two-dimensional pictures cannot drive the animation expression of the three-dimensional model well, and the three-dimensional facial reconstruction effect is poor.

Disclosure of Invention

In order to solve the above technical problems, an embodiment of the present application is expected to provide a three-dimensional face reconstruction method, apparatus, device, and storage medium.

The technical scheme of the application is realized as follows:

in a first aspect, a three-dimensional face reconstruction method is provided, the method including:

when a face picture acquisition instruction is detected, acquiring a two-dimensional picture containing a face;

identifying a target face in the two-dimensional picture based on a preset face identification strategy, and acquiring position information of the target face;

cutting out the target face in the two-dimensional picture based on the position information of the target face to obtain a cut-out target face picture;

inputting the cut target face picture into a target neural network model, and outputting three-dimensional model parameters of the target face;

and driving the target three-dimensional model to carry out three-dimensional reconstruction based on the three-dimensional model parameters of the target face to obtain a three-dimensional face model of the target face.

In a second aspect, a three-dimensional face reconstruction apparatus is provided, the apparatus comprising:

the acquisition unit is used for acquiring a two-dimensional picture containing the human face when detecting a human face acquisition instruction;

the detection unit is used for identifying a target face in the two-dimensional picture based on a preset face identification strategy and acquiring the position information of the target face;

the clipping unit is used for clipping the target face in the two-dimensional picture based on the position information of the target face to obtain a clipped target face picture;

the reconstruction unit is used for inputting the cut target face picture into the target neural network model and outputting three-dimensional model parameters of the target face; and driving the target three-dimensional model to carry out three-dimensional reconstruction based on the three-dimensional model parameters of the target face to obtain a three-dimensional face model of the target face.

In a third aspect, a three-dimensional face reconstruction apparatus is provided, including: a processor and a memory configured to store a computer program capable of running on the processor, wherein the processor is configured to perform the steps of the aforementioned method when the computer program is run.

In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, wherein the computer program, when being executed by a processor, carries out the steps of the aforementioned method.

Drawings

Fig. 1 is a schematic flow chart of a three-dimensional face reconstruction method in an embodiment of the present application;

FIG. 2 is a schematic flow chart of a model training method in an embodiment of the present application;

fig. 3 is a schematic diagram of a composition structure of a three-dimensional face reconstruction device in an embodiment of the present application;

fig. 4 is a schematic structural diagram of a three-dimensional face reconstruction device in an embodiment of the present application.

Detailed Description

For a more complete understanding of the features and technical content of the embodiments of the present application, reference should be made to the following detailed description of the embodiments of the present application, taken in conjunction with the accompanying drawings, which are for purposes of illustration only and not intended to limit the embodiments of the present application.

An embodiment of the present application provides a three-dimensional face reconstruction method, and fig. 1 is a schematic flow chart of the three-dimensional face reconstruction method in the embodiment of the present application, as shown in fig. 1, where the method specifically may include:

step 101: when a face picture acquisition instruction is detected, acquiring a two-dimensional picture containing a face;

step 102: identifying a target face in the two-dimensional picture based on a preset face identification strategy, and acquiring position information of the target face;

step 103: cutting out the target face in the two-dimensional picture based on the position information of the target face to obtain a cut-out target face picture;

step 104: inputting the cut target face picture into a target neural network model, and outputting three-dimensional model parameters of the target face;

step 105: and driving the target three-dimensional model to carry out three-dimensional reconstruction based on the three-dimensional model parameters of the target face to obtain a three-dimensional face model of the target face.

The execution main body for establishing the three-dimensional face model can be a mobile terminal or a fixed terminal, and after the mobile terminal acquires the two-dimensional picture, the three-dimensional model parameters of the face in the two-dimensional picture are acquired by using the neural network model, so that the standard three-dimensional model is driven to carry out three-dimensional reconstruction to obtain the three-dimensional face model.

Here, the face picture obtaining instruction may be an instruction to start to build a three-dimensional face model, or a photographing instruction. The two-dimensional picture containing the face is obtained through the camera, here, the camera can be any camera capable of collecting the two-dimensional picture, for example: monocular cameras, color cameras, black and white cameras, etc. The face picture may be a black-and-white picture or a color picture. For example, face pictures are collected through cameras of mobile phones, cameras and wearable devices.

In some embodiments, the identifying the target face in the two-dimensional picture based on the preset face recognition policy, and obtaining the location information of the target face includes: identifying at least one face in the two-dimensional picture based on a preset face recognition strategy, and acquiring position information of the at least one face; and screening the target face from the at least one face based on a preset screening strategy, and acquiring the position information of the target face.

Specifically, the screening strategy includes: determining the number of pixels occupied by the at least one face based on the position information of the at least one face; and screening out the faces with the number of the occupied pixels being larger than a number threshold value as target faces.

And inputting the cut target face picture into a target neural network model, setting a face key point detector in the target neural network model, and identifying N face key points in each picture by using the face key point detector to obtain three-dimensional model parameters of the face. Here, the neural network model is a lightweight neural network, and a face reconstruction result with high precision and excellent effect is obtained rapidly under the condition of limited computing resources.

That is, only when the area of the head portrait is larger than the minimum recognition area, the face can be accurately recognized, otherwise, the face information cannot be accurately recognized, and three-dimensional reconstruction cannot be performed.

Further, the face picture is cut according to the recognized face position information, and the background in the face picture is subtracted to only keep the face part. Specifically, a face recognition device is arranged in the three-dimensional face reconstruction device and is used for recognizing the face position in the face picture, cutting the face and obtaining the cut face picture. The clipping shape may be square, oval, etc.

Specifically, the preset three-dimensional expression template may be a general template stored in an expression library, for example, a character template, an animal template, an animation template, and the like. Or a template made by the user.

For example, the animation template is driven by using face posture parameters and facial expression parameters, and the three-dimensional animation template and three-dimensional data used for training the neural network have the same spatial topological structure and node semantic information, so that the three-dimensional animation can be driven to the same posture as the current head of the user by using the face posture parameters, and the three-dimensional animation can be driven to the same expression as the current face of the user by using the facial expression parameters.

In some embodiments, the method further comprises: acquiring voice information of the target face acquired by a voice acquisition unit; acquiring audio features corresponding to the target three-dimensional model; adjusting the voice information by utilizing the audio features corresponding to the target three-dimensional model to obtain target audio corresponding to the target face; and storing the three-dimensional face model of the target face and the corresponding target audio.

That is, in the three-dimensional face reconstruction process, the user not only can generate a three-dimensional model containing the face information according to the face of the user and a preset three-dimensional model, but also can combine the sound characteristics of the user with the audio template to generate the audio with the sound characteristics of the user. Thus, the effect desired by the user can be achieved visually and aurally, and the user can also select to change only one of the sound and the face with confidence.

For example, when the user selects the record button, the mobile phone terminal starts to record the expression display interface in real time, meanwhile, the microphone of the mobile phone is called, the sound of the user is saved, when the user selects to stop, the expression recording is finished, and the three-dimensional face with the sound is saved in the expression library.

By adopting the technical scheme, the neural network is trained based on the two-dimensional pictures representing different facial expressions and the three-dimensional standard model with expression capacity, the three-dimensional facial model with different expressions can be generated by fitting, the authenticity of the three-dimensional facial model reconstruction is increased, the lightweight neural network is used, the face reconstruction result with high precision and excellent effect is obtained rapidly under the condition of limited computing resources, and the operation is convenient and simple.

On the basis of the above embodiment, a model training method is further provided, and fig. 2 is a schematic flow chart of the model training method in the embodiment of the present application, as shown in fig. 2, where the method includes:

step 201: acquiring a training sample set; the training sample set comprises at least one two-dimensional picture of facial expression;

in practical applications, the method for obtaining the training sample set may include: controlling a camera to acquire at least one two-dimensional picture of the facial expression; and building a training sample set by using all the acquired two-dimensional pictures. Here, the camera may be any camera capable of capturing two-dimensional pictures, for example: monocular cameras, color cameras, black and white cameras, etc. The two-dimensional picture may be a black-and-white picture or a color picture. The training sample set may be downloaded directly from a library of avatars in the network.

In practical application, when a training sample set is established, as many facial expression samples as possible need to be collected, so that the trained neural network model can simulate more expressions.

Specifically, the types of facial expressions in the training sample set include at least one of the following: smiling, tucking, frowning, lifting eyebrows, anger, left chin, right chin, forward chin, left mouth, right mouth, lifting chin, opening large mouth, blush, eye closing, and sadness.

In some embodiments, the face types in the training sample set include at least one of: race, age, sex, angle, face shape.

That is, in building the training sample set, in addition to the facial expression, other factors that affect three-dimensional reconstruction of the face, such as race, age, sex, weight, height, face shape, photographing angle, etc., should be considered.

Two-dimensional pictures of faces with different ages, sexes, angles and complexions can be acquired and stored at different angles in different scenes through electronic equipment with cameras such as mobile phones, cameras and wearable equipment; a training sample set is established by utilizing two-dimensional pictures acquired by a plurality of electronic devices; and sending the training sample set to a three-dimensional face reconstruction device, so that the three-dimensional face reconstruction device trains the neural network model by using the training sample set.

Step 202: detecting key points of a human face on the two-dimensional pictures in the training sample set, and determining N two-dimensional key points of the human face in the two-dimensional pictures;

here, in order to generate a three-dimensional face image that matches a real face in a two-dimensional picture, it is necessary to identify the face in the two-dimensional picture and detect N pieces of key point information that are capable of characterizing features of the face, such as a facial expression, a facial pose, a facial identity, and the like.

In practical application, the more the number of key points is, the more the face information is comprehensive, but the more the number of key points is, the higher the performance requirement on the processor is, and the cost is also high, so in order to balance the cost and the effect, the number N of key points in the embodiment of the application is an integer greater than 68. Such as 90, 106, 240, etc. Compared with the conventional 68 key points and below, the method can provide more face information and improve the accuracy of three-dimensional face reconstruction.

In some embodiments, the detecting the key points of the face for the two-dimensional images in the training sample set, and determining N two-dimensional key points of the face in the two-dimensional images includes: performing face detection and face cutting on the two-dimensional picture to obtain a cut two-dimensional picture; and detecting key points of the faces of the cut two-dimensional pictures, and determining N two-dimensional key points of the faces in the two-dimensional pictures.

Specifically, a face recognition device and a face key point detector can be set, one or more face positions in a two-dimensional picture are recognized by the face recognition device, and the face is cut to obtain a picture only containing the face; and then the face key points of each picture are identified by using a face key point detector.

Step 203: based on the corresponding relation between the N two-dimensional key points and N three-dimensional key points in the face three-dimensional standard model, performing iterative fitting on the face three-dimensional standard model through a preset optimization algorithm to obtain standard three-dimensional model parameters of the two-dimensional picture;

illustratively, the two-dimensional key points only contain x-axis and y-axis information, the three-dimensional key points contain x-axis, y-axis and z-axis information, and points (x 1, y1 and z 1) with the same x-axis and y-axis information in the three-dimensional standard model are obtained as corresponding three-dimensional key points based on the indexes of the two-dimensional key points (x 1 and y 1) in the two-dimensional picture.

By way of example, the corresponding relation between 106 face key points of each face picture and 106 key points of corresponding semantics in the face three-dimensional standard model is utilized, iteration is continuously carried out through an optimization algorithm, and the face three-dimensional standard template is changed into the shape of the face in the picture. The optimization algorithm takes the following formula as an optimization target:

until the algorithm converges. Wherein s is a scaling factor, R is a rotation angle parameter, t is a translation parameter, the three parameters form a face gesture parameter,

is the three-dimensional key point coordinate corresponding to the two-dimensional key point n on the three-dimensional model in the iterative optimization process,

for two-dimensional key point coordinates after parallel projection, < +.>

Is the coordinates of the two-dimensional key point n, +.>

Is the average face, alpha _i Is a face identity parameter S _i Is the identity base of human face, beta _i Is a facial expression parameter, B _i Is a facial expression base.

In some embodiments, the face three-dimensional standard model is a three-dimensional deformation model (3D Morphable Model,3DMM). The three-dimensional standard model training neural network with the expression capability of the 3DMM can be used for fitting and generating three-dimensional face models with different expressions, so that the authenticity of reconstructing the three-dimensional face models is increased.

Through the steps 101 to 103, training pictures and corresponding standard three-dimensional model parameters required for training the neural network model are generated. Here, the number of pictures in the training sample set is in the millions, and the standard three-dimensional model parameters can be used as true values of neural network model training.

Step 204: and taking the two-dimensional pictures in the training sample set as input, taking the standard three-dimensional model parameters of the two-dimensional pictures in the training sample set as target output, and training a neural network model to obtain the target neural network model.

Specifically, the three-dimensional model parameters include: face pose parameters, face identity parameters and face expression bases. Accordingly, the process of training the neural network model includes: taking the two-dimensional pictures in the training sample set as the input of the neural network model, and outputting predicted three-dimensional model parameters; and calculating a loss function based on one parameter of the predicted three-dimensional model parameters and three parameters of the target three-dimensional model parameters, and adjusting the neural network model to obtain a trained neural network model. Here, when calculating model loss parameters, three sets of parameters are calculated respectively, and when calculating one set of parameters, the other two sets of parameters use true values (i.e. standard three-dimensional model parameters), so that the model converges better.

In practical application, before the two-dimensional picture is input into the neural network model, the two-dimensional picture can be subjected to face recognition and face cutting, and the cut two-dimensional picture is input into the neural network model.

Here, the trained neural network model can be configured on any terminal, the terminal obtains a two-dimensional picture of the face of the user, and the two-dimensional picture is input into the trained neural network model, so that the three-dimensional reconstruction model of the user can be directly output. And the expression of the three-dimensional reconstruction model can realize the same change along with the change of the expression of the user.

Based on the three-dimensional face reconstruction method, a specific implementation scene is provided in the embodiment of the application as follows.

The process for obtaining the parameters of the face standard three-dimensional model is as follows:

step 1: collecting a two-dimensional picture of a human face;

specifically, the acquisition requirements are: age distribution is broad and average, ranging from 5 years to 80 years as much as possible; the sex ratio is balanced, and the male and female ratio is kept at about 1; the species are uniformly distributed, the species of east Asian, middle Asian, caucasian, black, and the like are uniformly distributed, and other species also have partial pictures; the people with various facial forms are covered when the face pictures are collected. For each person, 73 face poses and 15 expressions are required to be acquired.

The face pose includes: face-correcting; left 30 degrees (rotating roll), left 60 degrees (roll), left 90 degrees (roll), right 30 degrees (roll), right 60 degrees (roll), right 90 degrees (roll); raising the head by 30 degrees (pitch), raising the head by 60 degrees (pitch), lowering the head by 30 degrees (pitch), lowering the head by 60 degrees (pitch); left 30 degrees (offset yaw), left 60 degrees (yaw), right 30 degrees (yaw), right 60 degrees (yaw); and 6 combinations of left 3 cases and head 2 cases (roll+pitch), and 6 combinations of right 3 cases and head 2 cases (roll+pitch), and 6 combinations of left 3 cases and head 2 cases (roll+pitch), and 6 combinations of right 3 cases and head 2 cases (roll+pitch); and 6 combinations of 3 cases on the left side and 2 cases on the left side (roll+yaw), and 6 combinations of 3 cases on the right side and 2 cases on the right side (roll+yaw), and 3 combinations of 3 cases on the left side and 30 degrees on the right side (roll+yaw), and 3 combinations of 3 cases on the right side and 30 degrees on the left side (roll+yaw); and 4 combinations of left-offset 2 cases and head 2 cases (yaw+pitch), and 4 combinations of right-offset 2 cases and head 2 cases (yaw+pitch), and 4 combinations of left-offset 2 cases and head 2 cases (yaw+pitch), and 4 combinations of right-offset 2 cases and head 2 cases (yaw+pitch); 73 poses in total.

The facial expression includes: smile, tuck, frowning, eyebrow lifting, anger, jaw left, jaw right, jaw forward, mouth left, mouth right, jaw lifting, large mouth, drum cheek, eye closing and sadness for 15 expressions.

Step 2: face recognition and cutting;

specifically, a face recognition device is arranged, one or more face positions in a two-dimensional picture are recognized by the face recognition device, and the faces are cut to obtain a picture only containing the faces.

Step 3: detecting two-dimensional key points of a human face;

specifically, a face key point detector is set, and 106 face key points in each picture are identified by using the face key point detector.

Step 4: and (5) performing iterative optimization to obtain standard three-dimensional model parameters.

Specifically, the corresponding relation between 106 face key points of each face picture and 106 key points of corresponding semantics in the face three-dimensional standard model is utilized, iteration is continuously carried out through an optimization algorithm, and the face three-dimensional standard template is changed into the shape of the face in the picture. The optimization algorithm takes the following formula as an optimization target:

for two-dimensional key point coordinates after parallel projection, < +.>

Is the coordinates of the two-dimensional key point n, +.>

The neural network model is obtained as follows:

step 1: and constructing a network model by using tensorflw, generating tfrecords from training data in a training sample set, and constructing a network training process, so that an input- > neural network model- > output- > loss function forms a complete chain.

Step 2: the model was trained for 50 rounds, one round to run the data in the dataset through.

Step 3: the output of the model is face pose parameters { s, R, t }, face identity parameters and face expression parameters.

Step 4: when the model loss parameters are calculated, three groups of parameters are calculated respectively, and when one group of parameters is calculated, the other two groups of parameters use true values (namely standard three-dimensional model parameters), so that the model converges better.

After the deep neural network model training is finished, the cut face picture can be directly used as input to generate a face three-dimensional model corresponding to the picture. It should be noted that the face three-dimensional model can be generated by directly using a single face picture, so that the use is convenient, and the complex operation of a user is not required; meanwhile, the used deep neural network adopts a lightweight and rapid model, and can run on a mobile phone terminal in real time.

The three-dimensional face reconstruction process by using the trained neural network model comprises the following steps:

step 1: the expression making interface is built in the system input method, when the user selects the expression interface, a "+" sign is arranged at the lower right corner of the expression making interface, and the expression can be added by clicking the button. When the user clicks the button to add the expression, the pop-up interface has an option for making the 3D animation expression for the user to select.

Here, a 3D animation expression template can be made for the user by using the trained neural network model, and the made template is added into an expression library of the system input method, so that the user can use the homemade 3D expression in the chat process.

Step 2: and the user selects to manufacture the 3D animation expression, displays a 3D animation expression manufacturing interface, and simultaneously starts a front monocular camera of the mobile phone.

Step 3: the user selects a three-dimensional animation expression template in the cartoon standard model library, and the three-dimensional animation expression template can be downloaded in an application store; the user can also use the deep neural network to generate a three-dimensional model by self-photographing, select various stickers (such as a specific hillock and the like) to be stuck on the generated three-dimensional model, construct a three-dimensional animation model with the shape of the face and the selected stickers as textures, and store the three-dimensional animation model in an expression library.

Step 4: after the user selects the three-dimensional animation expression template, the interface displays the selected three-dimensional animation expression in real time and is driven by the facial expression and head action of the user.

Step 5: the process of driving the three-dimensional animation expression by the user is as follows: the user makes actions, expressions, talks and the like at will, the front camera captures face images of the user in real time, the face areas are detected through the face detector, face cutting is carried out, the cut face images are sent into the deep neural network for generating parameters, and the face posture parameters, the face identity parameters and the face expression parameters are output through the neural network. The facial pose parameters and facial expression parameters are used here to drive the animated expression. Because the three-dimensional animation template and the three-dimensional data used for training the neural network have the same spatial topological structure and node semantic information, the three-dimensional animation can be driven to the same gesture as the current head of the user by using gesture parameters, and the three-dimensional animation can be driven to the same expression as the current face of the user by using facial expression parameters.

Step 6: when the user selects the record button, the mobile phone terminal starts to record the expression display interface in real time, and meanwhile, a microphone of the mobile phone is called, and sounding of the user is saved.

Step 7: when the user selects to stop, the expression recording is finished, and the expression with sound is stored in a built-in expression library of the system input method.

Step 8: the stored three-dimensional animation expression can be selected by a system input method and sent to the terminal of the contact in the chat software.

The embodiment of the application also provides a three-dimensional face reconstruction device, as shown in fig. 3, which comprises:

an acquiring unit 301, configured to acquire a two-dimensional picture including a face when a face acquiring instruction is detected;

the detection unit 302 is configured to identify a target face in the two-dimensional picture based on a preset face recognition policy, and obtain location information of the target face;

a clipping unit 303, configured to clip the target face in the two-dimensional picture based on the position information of the target face, so as to obtain a clipped target face picture;

the reconstruction unit 304 is configured to input the clipped target face picture into a target neural network model, and output three-dimensional model parameters of the target face; and driving the target three-dimensional model to carry out three-dimensional reconstruction based on the three-dimensional model parameters of the target face to obtain a three-dimensional face model of the target face.

In some embodiments, the detecting unit 302 is specifically configured to identify at least one face in the two-dimensional picture based on a preset face recognition policy, and obtain location information of the at least one face; and screening the target face from the at least one face based on a preset screening strategy, and acquiring the position information of the target face.

In some embodiments, the screening strategy comprises: determining the number of pixels occupied by the at least one face based on the position information of the at least one face; and screening out the faces with the number of the occupied pixels being larger than a number threshold value as target faces.

In some embodiments, the apparatus further comprises: the voice acquisition unit is used for acquiring the voice information of the target face acquired by the voice acquisition unit;

the voice processing unit is also used for acquiring the audio characteristics corresponding to the target three-dimensional model; adjusting the voice information by utilizing the audio features corresponding to the target three-dimensional model to obtain target audio corresponding to the target face;

the storage unit is used for storing the three-dimensional face model of the target face and the corresponding target audio.

In some embodiments, the obtaining unit is further configured to obtain a training sample set; the training sample set comprises at least one two-dimensional picture of facial expression;

the apparatus further comprises: the training unit is also used for detecting key points of the face of the two-dimensional pictures in the training sample set and determining N two-dimensional key points of the face in the two-dimensional pictures; based on the corresponding relation between the N two-dimensional key points and N three-dimensional key points in the face three-dimensional standard model, performing iterative fitting on the face three-dimensional standard model through a preset optimization algorithm to obtain standard three-dimensional model parameters of the two-dimensional picture; and taking the two-dimensional pictures in the training sample set as input, taking the standard three-dimensional model parameters of the two-dimensional pictures in the training sample set as target output, and training a neural network model to obtain the target neural network model.

In some embodiments, the categories of facial expressions in the training sample set include at least one of: smiling, tucking, frowning, lifting eyebrows, anger, left chin, right chin, forward chin, left mouth, right mouth, lifting chin, opening large mouth, blush, eye closing, and sadness.

The embodiment of the application also provides a three-dimensional face reconstruction device, as shown in fig. 4, which includes: a processor 401 and a memory 402 configured to store a computer program capable of running on the processor; the processor 401 when running a computer program in the memory 402 implements the steps of:

In some embodiments, the processor 401, when running a computer program in the memory 402, implements the following steps: identifying at least one face in the two-dimensional picture based on a preset face recognition strategy, and acquiring position information of the at least one face; and screening the target face from the at least one face based on a preset screening strategy, and acquiring the position information of the target face.

In some embodiments, the processor 401 when running the computer program in the memory 402 also implements the following steps: acquiring voice information of the target face acquired by a voice acquisition unit; acquiring audio features corresponding to the target three-dimensional model; adjusting the voice information by utilizing the audio features corresponding to the target three-dimensional model to obtain target audio corresponding to the target face; and storing the three-dimensional face model of the target face and the corresponding target audio.

In some embodiments, the processor 401 when running the computer program in the memory 402 also implements the following steps: acquiring a training sample set; the training sample set comprises at least one two-dimensional picture of facial expression; detecting key points of a human face on the two-dimensional pictures in the training sample set, and determining N two-dimensional key points of the human face in the two-dimensional pictures; based on the corresponding relation between the N two-dimensional key points and N three-dimensional key points in the face three-dimensional standard model, performing iterative fitting on the face three-dimensional standard model through a preset optimization algorithm to obtain standard three-dimensional model parameters of the two-dimensional picture; and taking the two-dimensional pictures in the training sample set as input, taking the standard three-dimensional model parameters of the two-dimensional pictures in the training sample set as target output, and training a neural network model to obtain the target neural network model.

Of course, in actual practice, the various components of the device are coupled together by a bus system 403, as shown in FIG. 4. It is understood that the bus system 403 is used to enable connected communications between these components. The bus system 403 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 403 in fig. 4.

In practical applications, the processor may be at least one of an application specific integrated circuit (ASIC, application Specific Integrated Circuit), a digital signal processing device (DSPD, digital Signal Processing Device), a programmable logic device (PLD, programmable Logic Device), a Field-programmable gate array (Field-Programmable Gate Array, FPGA), a controller, a microcontroller, and a microprocessor. It will be appreciated that the electronic device for implementing the above-mentioned processor function may be other for different apparatuses, and embodiments of the present application are not specifically limited.

The Memory may be a volatile Memory (RAM) such as Random-Access Memory; or a nonvolatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (HDD) or a Solid State Drive (SSD); or a combination of the above types of memories and provide instructions and data to the processor.

Embodiments of the present application also provide a computer-readable storage medium for storing a computer program.

Optionally, the computer readable storage medium may be applied to any three-dimensional face reconstruction device in the embodiments of the present application, and the computer program causes a computer to execute corresponding processes implemented by a processor in each method in the embodiments of the present application, which is not described herein for brevity.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated in one processing module, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units. Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.

The methods disclosed in the several method embodiments provided in the present application may be arbitrarily combined without collision to obtain a new method embodiment.

The features disclosed in the several product embodiments provided in the present application may be combined arbitrarily without conflict to obtain new product embodiments.

The features disclosed in the several method or apparatus embodiments provided in the present application may be arbitrarily combined without conflict to obtain new method embodiments or apparatus embodiments.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A three-dimensional face reconstruction method, the method comprising:

acquiring a training sample set; the training sample set comprises at least one two-dimensional picture of facial expression;

detecting key points of a human face on the two-dimensional pictures in the training sample set, and determining N two-dimensional key points of the human face in the two-dimensional pictures;

based on the corresponding relation between the N two-dimensional key points and N three-dimensional key points in the face three-dimensional standard model, performing iterative fitting on the face three-dimensional standard model through a preset optimization algorithm to obtain standard three-dimensional model parameters of the two-dimensional picture;

taking the two-dimensional pictures in the training sample set as input, taking standard three-dimensional model parameters of the two-dimensional pictures in the training sample set as target output, and training a neural network model to obtain a target neural network model;

inputting the cut target face picture into the target neural network model, and outputting three-dimensional model parameters of the target face;

2. The method according to claim 1, wherein the identifying the target face in the two-dimensional picture based on the preset face recognition policy, and obtaining the location information of the target face, includes:

identifying at least one face in the two-dimensional picture based on a preset face recognition strategy, and acquiring position information of the at least one face;

and screening the target face from the at least one face based on a preset screening strategy, and acquiring the position information of the target face.

3. The method of claim 2, wherein the screening strategy comprises:

determining the number of pixels occupied by the at least one face based on the position information of the at least one face;

and screening out the faces with the number of the occupied pixels being larger than a number threshold value as target faces.

4. The method according to claim 1, wherein the method further comprises:

acquiring voice information of the target face acquired by a voice acquisition unit;

acquiring audio features corresponding to the target three-dimensional model;

adjusting the voice information by utilizing the audio features corresponding to the target three-dimensional model to obtain target audio corresponding to the target face;

and storing the three-dimensional face model of the target face and the corresponding target audio.

5. The method of claim 1, wherein the categories of facial expressions in the training sample set include at least one of: smiling, tucking, frowning, lifting eyebrows, anger, left chin, right chin, forward chin, left mouth, right mouth, lifting chin, opening large mouth, blush, eye closing, and sadness.

6. The method of claim 1, wherein the face types in the training sample set comprise at least one of: race, age, sex, angle, face shape.

7. A three-dimensional face reconstruction apparatus, the apparatus comprising:

the training unit is used for carrying out face key point detection on the two-dimensional pictures in the training sample set and determining N two-dimensional key points of faces in the two-dimensional pictures; based on the corresponding relation between the N two-dimensional key points and N three-dimensional key points in the face three-dimensional standard model, performing iterative fitting on the face three-dimensional standard model through a preset optimization algorithm to obtain standard three-dimensional model parameters of the two-dimensional picture; taking the two-dimensional pictures in the training sample set as input, taking standard three-dimensional model parameters of the two-dimensional pictures in the training sample set as target output, and training a neural network model to obtain a target neural network model; wherein the training sample set comprises at least one two-dimensional picture of facial expression;

8. A three-dimensional face reconstruction apparatus, the apparatus comprising: a processor and a memory configured to store a computer program capable of running on the processor,

wherein the processor is configured to perform the steps of the method of any of claims 1 to 6 when the computer program is run.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.