CN117011449A

CN117011449A - Reconstruction method and device of three-dimensional face model, storage medium and electronic equipment

Info

Publication number: CN117011449A
Application number: CN202210641294.0A
Authority: CN
Inventors: 张昕昳; 朱俊伟; 贺珂珂; 朱飞达; 邰颖; 汪铖杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-06-08
Filing date: 2022-06-08
Publication date: 2023-11-07

Abstract

The invention discloses a reconstruction method and device of a three-dimensional face model, a storage medium and electronic equipment. Wherein the method comprises the following steps: acquiring a target face image of a target object; extracting features of the target facial image to obtain expression features of the target object; determining three-dimensional reconstruction parameters of the target facial image by using the expression characterization vector matched with the expression characteristics and the reconstruction reference characterization vector corresponding to the target facial image, wherein the reconstruction reference characterization vector comprises: a pixel characterization vector that matches a pixel feature of the target facial image, a face characterization vector that matches a facial feature of the target object; and carrying out three-dimensional reconstruction on the target face image according to the three-dimensional reconstruction parameters to obtain a three-dimensional face model matched with the target object. The invention solves the technical problem that the accuracy of the face model obtained by the existing three-dimensional face model reconstruction method is lower.

Description

Reconstruction method and device of three-dimensional face model, storage medium and electronic equipment

Technical Field

The present invention relates to the field of computers, and in particular, to a method and apparatus for reconstructing a three-dimensional face model, a storage medium, and an electronic device.

Background

In many applications today, in order to enrich the user's product experience, a three-dimensional face model may be reconstructed using the user's real face image. For example, in a metauniverse, the reconstructed three-dimensional facial model may be used to generate steerable heads for the metauniverse; also for example, in some 3D games, face reconstruction may be used to reconstruct the face of a virtual game character that approximates the facial features of a real user.

The more common reconstruction modes at present are: and acquiring a facial image of the user, extracting key point features in the facial image, and generating corresponding three-dimensional facial model parameters based on the key point features. And then according to the definition of the three-dimensional face model parameters, the real face features are migrated to the faces of the corresponding virtual roles. The reconstruction method of the three-dimensional facial model is limited by fixed parameter definition standards, so that the expression of the virtual character after migration is difficult to realize fine expression, and the problem of low accuracy of the reconstruction result of the three-dimensional facial model is caused.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the invention provides a reconstruction method and device of a three-dimensional face model, a storage medium and electronic equipment, and aims to at least solve the technical problem that the accuracy of the face model obtained by the existing reconstruction method of the three-dimensional face model is low.

According to an aspect of an embodiment of the present invention, there is provided a reconstruction method of a three-dimensional face model, including: acquiring a target face image of a target object; extracting features of the target facial image to obtain expression features of the target object; determining three-dimensional reconstruction parameters of the target facial image by using an expression characterization vector matched with the expression characteristics and a reconstruction reference characterization vector corresponding to the target facial image, wherein the reconstruction reference characterization vector comprises: a pixel characterization vector matching the pixel characteristics of the target face image, a face characterization vector matching the face characteristics of the target object; and carrying out three-dimensional reconstruction on the target face image according to the three-dimensional reconstruction parameters to obtain a three-dimensional face model matched with the target object.

According to another aspect of the embodiment of the present invention, there is also provided a reconstruction apparatus of a three-dimensional face model, including: an acquisition unit configured to acquire a target face image of a target object; an extracting unit, configured to perform feature extraction on the target face image to obtain expression features of the target object; a determining unit, configured to determine a three-dimensional reconstruction parameter of the target face image by using an expression token vector matched with the expression feature and a reconstruction reference token vector corresponding to the target face image, where the reconstruction reference token vector includes: a pixel characterization vector matching the pixel characteristics of the target face image, a face characterization vector matching the face characteristics of the target object; and the reconstruction unit is used for carrying out three-dimensional reconstruction on the target face image according to the three-dimensional reconstruction parameters to obtain a three-dimensional face model matched with the target object.

According to still another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium having stored therein a computer program, wherein the computer program is configured to perform the above-described reconstruction method of a three-dimensional face model when run.

According to yet another aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs a reconstruction method of the three-dimensional face model as above.

According to still another aspect of the embodiments of the present application, there is also provided an electronic device including a memory in which a computer program is stored, and a processor configured to execute the above-described reconstruction method of a three-dimensional face model by the computer program.

In the embodiment of the application, a target face image of a target object is acquired; extracting features of the target facial image to obtain expression features of the target object; determining three-dimensional reconstruction parameters of the target facial image by using the expression characterization vector matched with the expression characteristics and the reconstruction reference characterization vector corresponding to the target facial image, wherein the reconstruction reference characterization vector comprises: a pixel characterization vector that matches a pixel feature of the target facial image, a face characterization vector that matches a facial feature of the target object; and then carrying out three-dimensional reconstruction on the target face image according to the three-dimensional reconstruction parameters to obtain a three-dimensional face model matched with the target object. In the embodiment, the three-dimensional reconstruction parameters are determined based on the expression representation vector and the reconstruction reference representation vector of the target object, so that the expression characteristics of the target object are emphasized and adopted in the process of acquiring the three-dimensional reconstruction parameters, the accuracy degree of the generated three-dimensional facial model on the expression characteristics is further improved, and the technical problem that the accuracy of the facial model obtained by the existing three-dimensional facial model reconstruction method is low is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a schematic diagram of a hardware environment of an alternative method of reconstructing a three-dimensional facial model according to an embodiment of the present application;

FIG. 2 is a flow chart of an alternative method of reconstructing a three-dimensional facial model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative method of reconstructing a three-dimensional facial model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of another alternative method of reconstructing a three-dimensional facial model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of yet another alternative method of reconstructing a three-dimensional facial model according to an embodiment of the present application;

FIG. 6 is a schematic diagram of yet another alternative method of reconstructing a three-dimensional facial model according to an embodiment of the present application;

FIG. 7 is a schematic diagram of yet another alternative method of reconstructing a three-dimensional facial model in accordance with an embodiment of the present application;

FIG. 8 is a flow chart of another alternative method of reconstructing a three-dimensional facial model according to an embodiment of the present application;

FIG. 9 is a schematic structural view of an alternative three-dimensional facial model reconstruction apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural view of an alternative electronic device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The terms used in the present application will be described below:

deep learning: the method is a branch of machine learning, is based on a neural network architecture, is divided into unsupervised, semi-supervised and fully supervised learning, and has been widely applied to the fields of computer vision, voice recognition, natural language processing and the like;

3D face reconstruction: the task of reconstructing the 3D face according to a single picture or a plurality of pictures is referred to.

According to an aspect of the embodiment of the present application, a method for reconstructing a three-dimensional face model is provided, and as an alternative implementation manner, the above method for reconstructing a three-dimensional face model may be applied, but is not limited to, to a system for reconstructing a three-dimensional face model composed of a server 102 and a terminal device 104 as shown in fig. 1. As shown in fig. 1, server 102 is connected to terminal device 104 via a network 110, which may include, but is not limited to: a wired network, a wireless network, wherein the wired network comprises: local area networks, metropolitan area networks, and wide area networks, the wireless network comprising: bluetooth, WIFI, and other networks that enable wireless communications. The terminal device may include, but is not limited to, at least one of: a mobile phone (e.g., an Android mobile phone, iOS mobile phone, etc.), a notebook computer, a tablet computer, a palm computer, a MID (Mobile Internet Devices, mobile internet device), a PAD, a desktop computer, a smart television, a vehicle-mounted device, etc. The terminal device may be provided with a client, such as a three-dimensional face model generation client, a game client, or the like. The terminal device is further provided with a display, a processor and a memory, wherein the display can be used for displaying the program interfaces of the three-dimensional facial model generating client and the game client and displaying the facial expression images of the target objects uploaded to the server, and the processor can be used for preprocessing the video files to be uploaded before transmission, for example, compressing the acquired image files; the memory is used for storing the image file to be uploaded. It may be understood that, after the face image of the target object to be uploaded is acquired in the terminal device 104, the terminal device 104 may send the face image to the server 102 through the network 110, and in the case that the server 102 receives the face image, generate three-dimensional reconstruction parameters matched with the target object according to the face image uploaded by the terminal device 104, and generate a corresponding three-dimensional face model based on the target three-dimensional reconstruction parameters; terminal device 104 may receive the three-dimensional facial model returned by server 102 via network 110. The server 102 may be a single server, a server cluster composed of a plurality of servers, or a cloud server. The server includes a database and a processing engine. Wherein, the database can comprise basic face models for reconstructing three-dimensional face models for user objects; the processing engine is used for reconstructing the corresponding three-dimensional facial model by the three-dimensional reconstruction parameters.

According to an aspect of the embodiment of the present invention, the above-described reconstruction system of a three-dimensional face model may further perform the steps of: the terminal device 104 executes step S102 to acquire a face image of the target object; next, step S104 is executed, and the terminal device 104 transmits the face image to the server 102 via the network 110; the server 102 executes steps S106 to S112 to acquire a target face image of the target object; extracting features of the target facial image to obtain expression features of the target object; determining three-dimensional reconstruction parameters of the target facial image by using the expression characterization vector matched with the expression characteristics and the reconstruction reference characterization vector corresponding to the target facial image, wherein the reconstruction reference characterization vector comprises: a pixel characterization vector that matches a pixel feature of the target facial image, a face characterization vector that matches a facial feature of the target object; performing three-dimensional reconstruction on the target face image according to the three-dimensional reconstruction parameters to obtain a three-dimensional face model matched with the target object; next, step S114 is executed, and the server 102 transmits the three-dimensional face model to the terminal device 104 via the network 110; finally, step S116 is performed, and the three-dimensional face model may be displayed on the terminal device 104. It will be appreciated that the above steps S106 to S112 may also be performed in the terminal device 104 in case the terminal device 104 is a device having sufficient computational processing power.

In the embodiment of the invention, a target face image of a target object is acquired; extracting features of the target facial image to obtain expression features of the target object; determining three-dimensional reconstruction parameters of the target facial image by using the expression characterization vector matched with the expression characteristics and the reconstruction reference characterization vector corresponding to the target facial image, wherein the reconstruction reference characterization vector comprises: a pixel characterization vector that matches a pixel feature of the target facial image, a face characterization vector that matches a facial feature of the target object; and then carrying out three-dimensional reconstruction on the target face image according to the three-dimensional reconstruction parameters to obtain a three-dimensional face model matched with the target object. In the embodiment, the three-dimensional reconstruction parameters are determined based on the expression representation vector and the reconstruction reference representation vector of the target object, so that the expression characteristics of the target object are emphasized and adopted in the process of acquiring the three-dimensional reconstruction parameters, the accuracy degree of the generated three-dimensional facial model on the expression characteristics is further improved, and the technical problem that the accuracy of the facial model obtained by the existing three-dimensional facial model reconstruction method is low is solved.

The above is merely an example, and is not limited in any way in the present embodiment.

As an alternative embodiment, as shown in fig. 2, the method for reconstructing the three-dimensional face model includes the following steps:

s202, acquiring a target face image of a target object;

it can be understood that the manner of acquiring the face image of the target object may be to acquire an image file stored in the mobile terminal and containing the face of the target object, or may call a shooting function in the mobile terminal to acquire the face image of the target object in real time. The target face image may be acquired one or more.

Alternatively, text or voice prompts in the terminal interface may be used to instruct the target object to activate the terminal camera to acquire the target face image, or instruct the target object to upload the target face image through touch operation. Under the condition that a target object is instructed to start a terminal camera to acquire a target facial image through words or voice prompts in a terminal interface, different expressions of the target object can be further instructed through the voice or words, so that expression features in the target facial image can be conveniently extracted. Such as "please open the mouth to the maximum", "please open the eyes to the maximum", "please close the eyes", "please laugh" by playing the voice prompt message to indicate that the target subject makes different expressions.

S204, extracting features of the target facial image to obtain expression features of the target object;

the above expression features are explained below. In existing methods, regression learning is typically performed using only multiple images of the same object to obtain reconstruction parameters for generating a three-dimensional face model. For example, a corresponding reconstructed image is obtained based on one face image of the same subject, and when regression learning is performed based on the face image and the reconstructed image, the above learning process is repeated a plurality of times when there are a plurality of face images of the same subject. However, although there are different expressions on different images of the same person, there is usually a strong correlation between reconstructed images of different expressions, i.e. there are also related features indicating the same object for the expression features between reconstructed images of different facial expressions of the same object. By analyzing the expression characteristics among different images of the object, more accurate facial characteristics of the same object are obtained, and further, more accurate three-dimensional facial models are obtained through reconstruction.

In an alternative indication of the expressive features, the expressive features may be used to indicate fine features of the eye of the same subject. For example, in the case where the target object is in the closed-eye state in the inputted face image of the target object, the closed-eye state of the target object is accurately restored in the three-dimensional face model based on the eye expression features.

S206, determining three-dimensional reconstruction parameters of the target facial image by using the expression characterization vector matched with the expression characteristics and the reconstruction reference characterization vector corresponding to the target facial image, wherein the reconstruction reference characterization vector comprises: a pixel characterization vector that matches a pixel feature of the target facial image, a facial characterization vector that matches a facial feature of the target object.

It can be understood that, on the basis of obtaining the expression characterization vector with the expression feature matching, a reconstruction reference expression vector corresponding to the facial image needs to be obtained, so as to accurately determine the reconstruction parameters for reconstructing the three-dimensional facial model. As an alternative, the upper reconstruction reference expression vector may include a pixel characterization vector for indicating a fine similarity between the original image and the reconstructed image before and after reconstruction, a face contour characterization vector for indicating a perceived hierarchical similarity between the original image and the reconstructed image before and after reconstruction, and a face key point characterization vector for indicating key point characteristics between the original image and the reconstructed image before and after reconstruction, and it may be understood that the face characterization vector may include the face contour characterization vector and the face key point characterization vector.

And S208, performing three-dimensional reconstruction on the target face image according to the three-dimensional reconstruction parameters to obtain a three-dimensional face model matched with the target object.

The three-dimensional reconstruction parameters are described below. In this embodiment, the three-dimensional reconstruction parameters described above may be used to linearly combine with the average three-dimensional face model to obtain a three-dimensional face model that matches the target object. For example, the three-dimensional reconstruction parameters may include five types of reconstruction parameters including identity (α), expression (β), texture (σ), gesture (p), and illumination (γ), and are linearly combined with an average three-dimensional face model to obtain the reconstructed target three-dimensional face model.

As an optional implementation manner, determining the three-dimensional reconstruction parameter of the target facial image by using the expression characterization vector matched with the expression feature and the reconstruction reference characterization vector corresponding to the target facial image includes:

s1, in a three-dimensional reconstruction parameter prediction network, a representation vector, a pixel representation vector and a facial contour representation vector and a facial key point representation vector in the representation vector are fused to obtain a multi-mode representation vector, wherein the three-dimensional reconstruction parameter prediction network is a deep neural network obtained after deep learning of sample facial features and sample expression features of sample objects in sample facial images, and the sample facial images comprise different expression images matched with the same sample object;

s2, predicting three-dimensional reconstruction parameters of the target face image based on the multi-mode characterization vector.

It can be understood that the sample image for deep learning training to obtain the deep neural network may include different expression images matched with the same sample object, and may further include a plurality of different expression images of a plurality of different sample objects, so as to improve the learning effect of the deep learning process.

As an alternative, it is assumed that the expression characterization vector may be represented as L _per-emotion The pixel characterization vector may be represented as L _photo The face contour token vector in the above face token vectors may be represented as L _per The above-mentioned facial key representation vector may be represented as L _land The fusing operation may be performed by obtaining the expression characterization vector L _per-emotion Pixel characterization vector L _photo Facial contour characterization vector L _per Keypoint characterization vector L _land Sum vector L of (a) _(x) . In another alternative, the multimode token vector may also be a sum vector obtained by weighted summation of the token vectors, i.e.:

L _(x) ＝w _photo L _photo +w _{per_emotion} L _per-emotion +w _per L _per +w _land L _land

the above w _photo 、w _{per_motion} 、w _per W _land The weight coefficients respectively correspond to the pixel characterization vector, the expression characterization vector, the facial contour characterization vector and the key point characterization vector. In a preferred embodiment, w is as defined above _photo ＝1.9，w _per ＝0.2，w _land ＝1.6e-3，w _{per_motion} ＝0.2。

In an alternative manner, the expression characterization vector L _per-motion Pixel characterization vector L _photo Facial contour characterization vector L _per Keypoint characterization vector L _land Can be used to represent the loss function generated during feature extraction during training, the multi-modal token vector can be used to represent the multi-modal tokenFor representing the overall loss function obtained in the course of the deep learning training by combining the above-described multiple loss functions. It will be appreciated that, in an alternative manner, in the case where the value of the overall loss function reaches the target threshold, the performance of the parameter prediction network obtained through the overall loss function training is indicated to reach the requirement, and the parameter prediction network may be used to reconstruct a three-dimensional face model based on the face image of the target object.

According to the embodiment of the application, in a three-dimensional reconstruction parameter prediction network, the expression characterization vector, the pixel characterization vector, a facial contour characterization vector and a facial key point characterization vector in the facial characterization vector are fused to obtain a multimode characterization vector; the three-dimensional reconstruction parameters of the target facial image are predicted based on the multi-mode characterization vector, so that the three-dimensional reconstruction parameters are predicted and obtained based on the expression characterization vector used for representing the expression characteristics of the target object and the multi-mode characterization vectors of the characterization vectors of a plurality of other characteristic dimensions, the three-dimensional facial model obtained based on the three-dimensional reconstruction parameters is more similar to the facial image of the target object, and the technical problem that the accuracy of the facial model obtained by the existing reconstruction method is lower is solved.

As an alternative, before the target face image of the target object is acquired, the method further includes:

s1, acquiring a plurality of sample face images;

s2, inputting a plurality of sample face images into an initialized three-dimensional reconstruction parameter prediction network for training until reaching a convergence condition, wherein the convergence condition is used for indicating that a plurality of continuous multimode loss values output by training are smaller than a target threshold value, and the multimode loss values are obtained by carrying out weighted summation on pixel loss values, face loss values, key point loss values and expression loss values.

In this embodiment, the initialized three-dimensional reconstruction parameter prediction network may be trained using a plurality of sample face images. It will be appreciated that the plurality of sample facial images described above include facial images of different expressions of the same subject to extract accurate facial expression features.

A specific mode of acquiring the above pixel loss value and the face loss value will be described below.

As shown in fig. 3, the face image I is first input into an initialized R-net network to obtain initialized three-dimensional reconstruction parameters. In this embodiment, the three-dimensional reconstruction parameters may include six three-dimensional reconstruction parameters including an identity (α), an expression (β), a texture (σ), a gesture (p), an illumination (γ), and a pupil (δ), and differential rendering is performed based on the six three-dimensional reconstruction parameters, so as to obtain a reconstructed image I'. As an alternative, the reconstructed image I 'may be a three-dimensional face model rendered based on the six three-dimensional reconstruction parameters of the identity (α), the expression (β), the texture (σ), the pose (p), the illumination (γ), and the pupil (δ), and projection-mapped based on the same projection direction as the face image I to obtain a reconstructed image I' corresponding to the reconstructed model.

Meanwhile, the facial mask A is obtained for the facial image through a seg network. In particular, in order to make the network more robust to face occlusion and other facial changes such as beard or rich makeup, the seg network may be a simple Bayesian classifier based on a Gaussian mixture model, predicting a skin color probability P for each pixel i _i The face mask a may be generated as follows:

it should be noted that the above-mentioned method of generating the face mask can be regarded as a two-class problem, and each pixel of the input image is classified into a two-class (background or person), so that the portrait is scratched out from the background. As in fig. 3, the white area is an area that participates in the reconstruction that needs to be learned.

When the face image I, the reconstructed image I' and the face mask a are obtained, the pixel loss value L _photo Can be obtained by the following means:

where I represents the index of the pixel, M is the projected face region, ||I _i -I` _i The l represents the distance between corresponding pixel points between the face image I and the reconstructed image I'.

Face loss value L _per The calculation mode of (2) is as follows:

as an alternative, f (I) above may be used to identify depth feature encoding of image I, < f (I), f (I') represents the inner product of the results of depth feature encoding of the facial image and the reconstructed image. In this embodiment, the training process is further guided by computing the face feature distance between the source and reconstructed pictures with the aid of signals from a pre-trained face recognition network as weak supervision, thus losing the perception hierarchy.

After the pixel loss value L is obtained _photo Face loss value L _per Based on (1) and combining expression loss value L _per-emotion A key point loss value L _land The multimode loss value L can be obtained _(x) . The multimode loss value may also be a weighted sum of the loss values, that is:

L _(x) ＝w _photo L _photo +w _{per_emotion} L _{per_emotion} +w _per L _per +w _land L _land

the above w _photo 、w _{per_emotion} 、w _per W _land The weight coefficients respectively correspond to the pixel loss value, the expression loss value, the face loss value and the key point loss value. In a preferred embodiment, w is as defined above _photo ＝1.9，w _per ＝0.2，w _land ＝1.6e-3，w _{per_emotion} ＝0.2。

By the above embodiments of the present application, to obtainA plurality of sample face images; inputting a plurality of sample facial images into an initialized three-dimensional reconstruction parameter prediction network for training until convergence conditions are reached, thereby being based on a total loss value L _(x) Training the initial three-dimensional reconstruction parameter prediction network, and further improving the accuracy of the three-dimensional reconstruction parameters output by the three-dimensional reconstruction parameter prediction network.

As an alternative, the training of the input of the plurality of sample face images into the initialized three-dimensional reconstruction parameter prediction network includes:

acquiring an expression image subset corresponding to the same sample object from a plurality of sample facial images, wherein the expression image subset comprises different expression images of the sample object; sequentially taking each expression image subset as a current expression image subset, and executing the following operations:

S1, acquiring a first expression image and a second expression image from a current expression image subset;

s2, replacing a first expression parameter of the first expression image with a second expression parameter of the second expression image to obtain a first reference expression image, and obtaining a first distance between the first expression image and the first reference expression image;

s3, replacing the second expression parameter of the second expression image with the first expression parameter of the first expression image to obtain a second reference expression image, and obtaining a second distance between the second expression image and the second reference expression image;

s4, determining an expression loss value based on the first distance and the second distance.

It should be noted that, the above-mentioned obtaining the first expression image and the second expression image from the current expression image subset may be obtaining two or more images of different expressions of the same object. Different expressions may be represented by different facial angles and also by different details of the facial five sense organs.

In the embodiment, inter-picture perceptual loss function constraint is performed on reconstruction of different expressions of the same person, so that expression fitting capacity of a reconstruction model is improved. Specifically, as shown in fig. 4, for two portrait pictures of an object whose ID is a (A ₁ 、A ₂ ) Corresponding reconstruction parameters (alpha) obtained through initialized R-net network output ₁ β ₁ σ ₁ p ₁ γ ₁ δ ₁ 、α ₂ β ₂ σ ₂ p ₂ γ ₂ δ ₂ ). And the expression parameters beta of the two pictures are used for ₁ And beta ₂ Alternatively, after rendering, a rendering with A is formed ₂ New A of expression ₁ ' image (texture pose illumination and A) ₁ Consistent) for the new expression image A of the same person ₁ ' and A ₁ The face recognition loss function is used for constraint in the sense layer, and the expression loss value is calculated as follows:

it will be appreciated that <, as described above, represents the vector inner product.

In this embodiment, the f function may use Arcface as the face recognition network. The main ideas of the Arcface recognition network comprise: arcFace loss: additive Angular Margin Loss (additive angle interval loss function), normalize eigenvectors and weights, add angle interval m to θ, angle interval being more direct to angle than cosine interval. Geometrically, there is a constant linear angle margen; arcFace maximizes the classification limit directly in the angle space θ, while CosFace maximizes the classification limit in the cosine space cos (θ); pretreatment (face alignment): the key points of the face are detected by the MTCNN, and then the cut aligned face is obtained through similar transformation; training (face classifier): resNet50+ArcFace loss; and (3) testing: extracting 512-dimensional embedded features from the output of the FCl layer of the face classifier, calculating cosine distances for the two input features, and then carrying out face verification and face recognition; the training time in the actual code is divided into a resnetmodel+archead+softmax loss. A resnet model output feature; after the angle interval is added between the feature and the weight, outputting a prediction label, and obtaining ACC (automatic control) by using the output label; softmax loss finds the predicted tag and the actual error.

Through the above embodiment of the present application, a first expression image and a second expression image are acquired from the current expression image subset; replacing a first expression parameter of the first expression image with a second expression parameter of the second expression image to obtain a first reference expression image, and obtaining a first distance between the first expression image and the first reference expression image; replacing a second expression parameter of the second expression image with the first expression parameter of the first expression image to obtain a second reference expression image, and obtaining a second distance between the second expression image and the second reference expression image; based on the first distance and the second distance, the expression loss value is determined, so that a multi-image reconstruction mode is adopted, the relation between different expressions of the same object is further determined, the image reconstruction images of different expressions of the same person are subjected to perception loss function constraint, the expression fitting capacity of a reconstruction model is improved, and the technical problem of low reconstruction accuracy of the conventional face model reconstruction method is solved.

Sequentially taking the plurality of sample face images as a current sample face image, and performing the following operations:

s1, determining the position of a facial key point of a sample object in a current sample facial image, wherein the position of the facial key point comprises the position of a pupil of the sample object in the sample facial image;

s2, obtaining the corresponding reconstruction reference key point positions based on the prediction of the key point positions of each face;

and S3, determining a key point loss value based on the facial key point position and the reconstructed reference key point position.

In an alternative manner, the above-mentioned set of facial key points may be a two-dimensional set of points, that is, the face of the target object is regarded as a plane, so as to determine the position coordinates of each point on the plane. If the area where the face of the target object is located is assumed to be a rectangle with 500px x 800px, the two-dimensional plane coordinates of the pupil of the left eye are (200 px,600 px), and the plane coordinates of the pupil of the right eye are (400 px,600 px). According to the mode, the position coordinates of each key point of the face of the target object are determined, namely, the target object face key point set is extracted.

In another alternative, the above-mentioned facial key point set may be a three-dimensional point set, that is, a facial space structure of the target object is restored by three-dimensional coordinates of different points. And determining three-position space position coordinates of each key point of the face of the target object, namely extracting to obtain a target object face key point set.

As a specific way, as shown in fig. 5, the definition of the face key point in the present embodiment adopts the 70 key point definition. As in fig. 5, points 1 to 17 indicate the positions of the contours of the face of the person; points 18 to 22 indicate the position of the left eyebrow; points 23 to 27 indicate the position of the right eyebrow; points 28 to 31 indicate the position of the bridge of the nose; points 32 to 36 indicate the position of the nose; points 37 to 42 and points 43 to 48 indicate the position of the eye; points 49 to 68 indicate the position of the mouth; the key point 69 indicates the position of the left eye pupil and the key point 70 indicates the position of the right eye pupil.

As an optional manner, determining the keypoint loss value based on the facial keypoint location and the reconstructed reference keypoint location includes:

s1, acquiring first coordinates corresponding to each face key point position, and obtaining a first coordinate set;

s2, obtaining second coordinates corresponding to each reconstructed reference key point position respectively to obtain a second coordinate set;

and S3, determining the mean square error value of the first coordinate set and the second coordinate set as a key point loss value.

Let the first 68 key points of the face image of the target object be q _i Representing the key point of the pupil of the left eyeIndicating that the right eye pupil is at the critical point +. >The calculation method of the key point loss value is as follows:

it will be appreciated that q is as described above _n Andfor the position parameters of 68 key points and pupil key points on the face image of the target object, q' _n And +.>68 keypoints and position parameters of pupil keypoints are obtained for projecting the 3D keypoints of the reconstructed face into image space. w (w) _n And m _n The weighting coefficients corresponding to the other keypoints of the face and pupil keypoints, respectively.

It will be appreciated that the positional relationship between the pupil keypoints and the periocular keypoints may represent the eye details of the target subject, such as the positional relationship between the keypoints 69 and 37 to 42, and may represent the eye details of the target subject, such as the open-eye and closed-eye characteristics. And because the target eye details can further accurately display the emotion characteristics of the target object, and further reference pupil key points can accurately express the emotion characteristics of the target object in the reconstructed three-dimensional model. As shown in fig. 6, the upward arrow shows the prediction result of the prediction network obtained without training the pupil key points, and it can be seen that the accurate eye closing feature cannot be obtained by reconstruction because the pupil key points are not used for training in the reconstructed image; the downward arrow shows the prediction result of the prediction network obtained by training the pupil key points, and therefore, as the pupil key point characteristics are adopted for training constraint, the reconstructed image can well show the characteristics of the pupil key points on the same straight line with other eye periphery key points, and the eye closing characteristics of the target object can be accurately shown.

By the above embodiment of the present application, the position of the facial key point of the sample object in the current sample facial image is determined, wherein the position of the facial key point comprises the pupil position of the sample object in the sample facial image; obtaining respective corresponding reconstructed reference key point positions based on the face key point position predictions; based on the position of the facial key point and the position of the reconstruction reference key point, the key point loss value is determined, so that the facial reconstruction model is constrained by key point information comprising the key position of the pupil, and then the fine expression characteristics around eyes can be embodied through the facial reconstruction model, and further the technical problem of lower accuracy of the existing facial reconstruction method is solved.

As an optional embodiment, the acquiring a plurality of sample face images includes:

s1, acquiring a real face image set;

s2, acquiring a real face image from the real face image set as a current real face image, and executing the following operations:

s3, acquiring face angle information of a current real face image;

s4, obtaining a reference shelter from the candidate shelter set;

s5, determining the adding position information of the reference occlusion object on the current real face image based on the face angle information and the type information of the reference occlusion object;

S6, adding a reference shielding object on the current real face image according to the position indicated by the added position information so as to obtain a sample face image corresponding to the current real face image.

It can be appreciated that in this embodiment, by artificially constructing an occlusion image and training the face reconstruction parameter prediction network using the artificially constructed occlusion image, the reducing capability of the prediction network to the occluded face image can be improved. As shown in fig. 7, the upward arrow shows the reconstruction result of the prediction network obtained without training with the occlusion data, and it can be seen that the reconstructed image cannot be reconstructed to obtain the eye feature, and the eye area still has serious shadows; the downward arrow shows the reconstruction result of the prediction network obtained by training the occlusion data, and it can be seen that the reconstructed image can well show the eye features, i.e. the influence of the occlusion on the reconstructed image of the target object is eliminated.

It will be appreciated that in this embodiment, the artificially structured occlusion data still needs to be reasonablely guaranteed to promote training results. In the case where the reference occlusion comprises a hand, a sunglasses, the specific form and location of the occlusion may be determined based on the face angle information of the acquired face image. For example, in the case that the acquired reference mask is a sunglasses, determining the face angle of the acquired face image as the front face at the same time, and then normally masking two lenses of the sunglasses to eyes of the current real face image; under the condition that the acquired reference shielding object is a sunglasses, determining the face angle of the acquired face image as a side face at the same time, and further normally shielding one lens of the sunglasses to eyes of the side face of the current real face image; under the condition that the acquired reference shielding object is a hand, the shielding object can be shielded on the face randomly.

Through the above embodiment of the present application, the face angle information of the current real face image is obtained; obtaining a reference occlusion from a candidate occlusion set; determining the adding position information of the reference occlusion object on the current real face image based on the face angle information and the type information of the reference occlusion object; a reference occlusion is added on the current real face image in a position indicated by the added position information to obtain a sample face image with the current real face image. Further, the artificial occlusion image is adopted to train the prediction network, and because the artificial occlusion image is artificial occlusion, the artificial occlusion image has a reconstruction true value, and can carry out supervision and constraint on the reconstruction image, so that the network can carry out reasonable prediction on the occlusion area, further, the accuracy of the reconstruction result of the face model by the prediction network is improved, and the technical problem that the accuracy of the face model obtained by the existing three-dimensional face model reconstruction method is lower is solved.

A complete embodiment of the present application is described below in conjunction with fig. 8.

S802, constructing shielding data;

in particular, where the reference occlusion comprises a hand, a sunglasses, the specific form and location of the occlusion may be determined based on facial angle information of the acquired facial image. For example, in the case that the acquired reference mask is a sunglasses, determining the face angle of the acquired face image as the front face at the same time, and then normally masking two lenses of the sunglasses to eyes of the current real face image; under the condition that the acquired reference shielding object is a sunglasses, determining the face angle of the acquired face image as a side face at the same time, and further normally shielding one lens of the sunglasses to eyes of the side face of the current real face image; under the condition that the acquired reference shielding object is a hand, the shielding object can be shielded on the face randomly

S804, training an initial network by using shielding data;

in the training process, the following loss function is adopted to restrict the initial prediction network:

the above w _photo 、w _{per_emotion} 、w _per W _land The weight coefficients respectively correspond to the pixel loss value, the expression loss value, the face contour loss value and the key point loss value. In a preferred embodiment, w is as defined above _photo ＝1.9，w _per ＝0.2，w _land ＝1.6e-3，w _{per_emotion} ＝0.2。

Wherein:

in the case of obtaining the face image I, the reconstructed image I' and the face mask a, I represents a pixel index, M represents a projected face region, and I is _i -I` _i The l represents the distance between corresponding pixel points between the face image I and the reconstructed image I'.

In the formula, inter-picture perception loss function constraint is carried out on reconstruction of different expressions of the same person, so that expression fitting capacity of a reconstruction model is improved. Specifically, as shown in fig. 4, for two portrait pictures (a ₁ 、A ₂ ) Corresponding reconstruction parameters (alpha) obtained through initialized R-net network output ₁ β ₁ σ ₁ p ₁ γ ₁ δ ₁ 、α ₂ β ₂ σ ₂ p ₂ γ ₂ δ ₂ ). And the expression parameters beta of the two pictures are used for ₁ And beta ₂ Alternatively, after rendering, a rendering with A is formed ₂ New A of expression ₁ ' image (texture pose illumination and A) ₁ Consistent) for the new expression image A of the same person ₁ ' and A ₁ The face recognition loss function is used for constraint on the sensory level.

The above f (I) can be used to identify depth feature encoding of the image I, < f (I), f (I') represents the inner product of the results of depth feature encoding of the facial image and the reconstructed image. In this embodiment, the training process is further guided by computing the face feature distance between the source and reconstructed pictures with the aid of signals from a pre-trained face recognition network as weak supervision, thus losing the perception hierarchy.

It will be appreciated that q is as described above _n Andposition parameters for 68 key points and pupil key points on the face image of the target objectNumber, q _n And +.>68 keypoints and position parameters of pupil keypoints are obtained for projecting the 3D keypoints of the reconstructed face into image space. w (w) _n And m _n The weighting coefficients corresponding to the other keypoints of the face and pupil keypoints, respectively.

S806, acquiring a face image of the target object;

s808, inputting the facial image into a trained prediction network;

and S810, generating a three-dimensional face model based on the input reconstruction parameters.

The three-dimensional reconstruction parameters described above may be used to linearly combine with the average three-dimensional facial model to obtain a three-dimensional facial model that matches the target object. For example, the three-dimensional reconstruction parameters may include identity (α), expression (β), texture (σ), pose (p), illumination (γ), pupil point And linearly combining the six types of reconstruction parameters with an average three-dimensional face model to obtain the reconstructed target three-dimensional face model.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.

According to another aspect of the embodiment of the present invention, there is also provided a reconstruction apparatus of a three-dimensional face model for implementing the above reconstruction method of a three-dimensional face model. As shown in fig. 9, the apparatus includes:

an acquisition unit 902 for acquiring a target face image of a target object;

an extracting unit 904, configured to perform feature extraction on the target facial image to obtain expression features of the target object;

a determining unit 906, configured to determine three-dimensional reconstruction parameters of the target facial image by using the expression token vector matched with the expression feature and the reconstruction reference token vector corresponding to the target facial image, where the reconstruction reference token vector includes: a pixel characterization vector that matches a pixel feature of the target facial image, a face characterization vector that matches a facial feature of the target object;

and a reconstruction unit 908, configured to perform three-dimensional reconstruction on the target face image according to the three-dimensional reconstruction parameters, so as to obtain a three-dimensional face model matched with the target object.

Alternatively, in this embodiment, the embodiments to be implemented by each unit module may refer to the embodiments of each method described above, which are not described herein again.

According to still another aspect of the embodiment of the present invention, there is also provided an electronic device for implementing the above-described method for reconstructing a three-dimensional face model, which may be a terminal device or a server as shown in fig. 10. The present embodiment is described taking the electronic device as a terminal device as an example. As shown in fig. 10, the electronic device comprises a memory 1002 and a processor 1004, the memory 1002 having stored therein a computer program, the processor 1004 being arranged to perform the steps of any of the method embodiments described above by means of the computer program.

Alternatively, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of the computer network.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

s1, acquiring a target face image of a target object;

s2, extracting features of the target facial image to obtain expression features of the target object;

s3, determining three-dimensional reconstruction parameters of the target facial image by using the expression characterization vector matched with the expression characteristics and the reconstruction reference characterization vector corresponding to the target facial image, wherein the reconstruction reference characterization vector comprises: a pixel characterization vector that matches a pixel feature of the target facial image, a face characterization vector that matches a facial feature of the target object;

and S4, carrying out three-dimensional reconstruction on the target face image according to the three-dimensional reconstruction parameters to obtain a three-dimensional face model matched with the target object.

Alternatively, it will be understood by those skilled in the art that the structure shown in fig. 10 is only schematic, and the electronic device may also be a vehicle-mounted terminal, a smart phone (such as an Android mobile phone, an iOS mobile phone, etc.), a tablet computer, a palm computer, a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 10 is not limited to the structure of the electronic device described above. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 10, or have a different configuration than shown in FIG. 10.

The memory 1002 may be configured to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for reconstructing a three-dimensional facial model in the embodiment of the present invention, and the processor 1004 executes the software programs and modules stored in the memory 1002 to perform various functional applications and data processing, that is, implement the method for reconstructing a three-dimensional facial model. The memory 1002 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, the memory 1002 may further include memory located remotely from the processor 1004, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 1002 may be used for storing information such as elements in a view angle screen, and reconstruction information of a three-dimensional face model. As an example, as shown in fig. 10, the above memory 1002 may be, but is not limited to, an acquisition unit 902, an extraction unit 904, a determination unit 906, and a reconstruction unit 910 in a reconstruction apparatus including the above three-dimensional face model. In addition, other module units in the reconstruction device of the three-dimensional face model may be further included, but are not limited to, and are not described in detail in this example.

Optionally, the transmission device 1006 is configured to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission means 1006 includes a network adapter (Network Interface Controller, NIC) that can be connected to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 1006 is a Radio Frequency (RF) module for communicating with the internet wirelessly.

In addition, the electronic device further includes: a display 1008, and a connection bus 1010 for connecting the various module components in the electronic device described above.

In other embodiments, the terminal device or the server may be a node in a distributed system, where the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting the plurality of nodes through a network communication. Among them, the nodes may form a Peer-To-Peer (P2P) network, and any type of computing device, such as a server, a terminal, etc., may become a node in the blockchain system by joining the Peer-To-Peer network.

According to one aspect of the present application, there is provided a computer program product comprising a computer program/instruction containing program code for executing the method shown in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via a communication portion, and/or installed from a removable medium. When executed by a central processing unit, performs various functions provided by embodiments of the present application.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

According to an aspect of the present application, there is provided a computer-readable storage medium, from which a processor of a computer device reads the computer instructions, the processor executing the computer instructions, causing the computer device to perform the above-described method of reconstructing a three-dimensional face model.

Alternatively, in the present embodiment, the above-described computer-readable storage medium may be configured to store a computer program for performing the steps of:

s1, acquiring a target face image of a target object;

Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the above-described method of the various embodiments of the present invention.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In several embodiments provided by the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the above, is merely a logical function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A method of reconstructing a three-dimensional facial model, comprising:

acquiring a target face image of a target object;

extracting features of the target facial image to obtain expression features of the target object;

determining three-dimensional reconstruction parameters of the target facial image by using an expression characterization vector matched with the expression characteristics and a reconstruction reference characterization vector corresponding to the target facial image, wherein the reconstruction reference characterization vector comprises: a pixel characterization vector that matches a pixel feature of the target facial image, a facial characterization vector that matches a facial feature of the target object;

And carrying out three-dimensional reconstruction on the target face image according to the three-dimensional reconstruction parameters to obtain a three-dimensional face model matched with the target object.

2. The method of claim 1, wherein determining three-dimensional reconstruction parameters for the target facial image using an expression characterization vector that matches the expression features and a reconstruction reference characterization vector that corresponds to the target facial image comprises:

in a three-dimensional reconstruction parameter prediction network, fusing the expression representation vector, the pixel representation vector and a facial outline representation vector and a facial key point representation vector in the facial representation vector to obtain a multi-mode representation vector, wherein the three-dimensional reconstruction parameter prediction network is a deep neural network obtained after deep learning of sample pixel characteristics of a sample facial image, sample facial characteristics of a sample object in the sample facial image and sample expression characteristics, and the sample facial image comprises different expression images matched with the same sample object;

the three-dimensional reconstruction parameters of the target facial image are predicted based on the multi-mode characterization vector.

3. The method of claim 2, further comprising, prior to the acquiring the target facial image of the target object:

Acquiring a plurality of the sample face images;

and inputting the plurality of sample facial images into an initialized three-dimensional reconstruction parameter prediction network for training until reaching a convergence condition, wherein the convergence condition is used for indicating that a plurality of continuous multimode loss values output by training are smaller than a target threshold value, and the multimode loss values are obtained by carrying out weighted summation on pixel loss values, facial loss values, key point loss values and expression loss values.

4. The method of claim 3, wherein said inputting a plurality of said sample facial images into an initialized three-dimensional reconstruction parameter prediction network for training comprises:

acquiring an expression image subset corresponding to the same sample object from a plurality of sample facial images, wherein the expression image subset comprises different expression images of the sample object;

sequentially taking each expression image subset as a current expression image subset, and executing the following operations:

acquiring a first expression image and a second expression image from the current expression image subset;

replacing a first expression parameter of the first expression image with a second expression parameter of the second expression image to obtain a first reference expression image, and obtaining a first distance between the first expression image and the first reference expression image;

Replacing a second expression parameter of the second expression image with the first expression parameter of the first expression image to obtain a second reference expression image, and obtaining a second distance between the second expression image and the second reference expression image;

the expression loss value is determined based on the first distance and the second distance.

5. The method of claim 3, wherein said inputting a plurality of said sample facial images into an initialized three-dimensional reconstruction parameter prediction network for training comprises:

sequentially taking a plurality of the sample face images as a current sample face image, and performing the following operations:

determining a facial keypoint location of the sample object in the current sample facial image, wherein the facial keypoint location comprises a pupil location of the sample object in the sample facial image;

obtaining the corresponding reconstruction reference key point positions based on the face key point position predictions;

the keypoint loss value is determined based on the facial keypoint location and the reconstructed reference keypoint location.

6. The method of claim 5, wherein the determining the keypoint loss value based on the facial keypoint location and the reconstructed reference keypoint location comprises:

Acquiring first coordinates corresponding to each facial key point position to obtain a first coordinate set;

obtaining second coordinates corresponding to each reconstructed reference key point position respectively to obtain a second coordinate set;

and determining the mean square deviation value of the first coordinate set and the second coordinate set as the key point loss value.

7. The method of claim 3, wherein the acquiring a plurality of the sample facial images comprises:

acquiring a real face image set;

acquiring a real face image from the real face image set as a current real face image, performing the following operations:

acquiring face angle information of the current real face image;

obtaining a reference occlusion from a candidate occlusion set;

determining the adding position information of the reference occlusion object on the current real face image based on the face angle information and the type information of the reference occlusion object;

and adding the reference occlusion object on the current real face image according to the position indicated by the adding position information so as to obtain a sample face image corresponding to the current real face image.

8. A reconstruction apparatus of a three-dimensional face model, comprising:

an acquisition unit configured to acquire a target face image of a target object;

the extraction unit is used for extracting the characteristics of the target facial image to obtain the expression characteristics of the target object;

a determining unit, configured to determine a three-dimensional reconstruction parameter of the target facial image by using an expression token vector matched with the expression feature and a reconstruction reference token vector corresponding to the target facial image, where the reconstruction reference token vector includes: a pixel characterization vector that matches a pixel feature of the target facial image, a facial characterization vector that matches a facial feature of the target object;

and the reconstruction unit is used for carrying out three-dimensional reconstruction on the target face image according to the three-dimensional reconstruction parameters to obtain a three-dimensional face model matched with the target object.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program, when run, performs the method of any one of claims 1 to 7.

10. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 7.

11. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 7 by means of the computer program.