CN113763535A

CN113763535A - Characteristic latent code extraction method, computer equipment and storage medium

Info

Publication number: CN113763535A
Application number: CN202111027269.5A
Authority: CN
Inventors: 陈仿雄
Original assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Current assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2021-12-07

Abstract

The embodiment of the application relates to the technical field of image processing, and discloses a feature latent code extraction method, computer equipment and a storage medium. In addition, in the process of generating and generating the second face image by adopting the style confrontation generating network, the three-dimensional reconstruction image reflecting the three-dimensional spatial characteristics of the face in the first face image is input into the style generating confrontation network together, so that the second face image is fused with the three-dimensional spatial characteristics and is closer to the original first face image, and the error brought to the second face image by the style confrontation network can be effectively reduced.

Description

Characteristic latent code extraction method, computer equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of image processing, in particular to a feature latent code extraction method, computer equipment and a storage medium.

Background

A style antagonistic generated network (style area network) is a network structure recently used in the field of artificial intelligence for face image generation, and a mapping network is added in a generator network to introduce feature latent codes and noise related to face attributes into a feature map, so that a face image can be generated.

In the process of implementing the embodiment of the present application, the inventors of the present application find that: at present, randomly generated feature latent codes are input into a style confrontation generation network to generate an image, and the application of the confrontation generation network is limited because the feature latent codes of the image in a real scene cannot be acquired, for example, the confrontation generation network cannot be utilized to generate an image with expression changes of the face of an actual user, an image with aging changes, and the like. Therefore, how to extract the feature latent codes from the face image is an urgent problem to be solved.

Disclosure of Invention

The technical problem mainly solved by the embodiments of the present application is to provide a method for extracting latent feature codes, a computer device and a storage medium, by which the latent feature codes capable of reflecting a face image can be effectively extracted from the face image, i.e., the latent feature codes are accurate, the face image can be effectively restored, and the method is beneficial to expanding the application of style confrontation generation networks.

In order to solve the foregoing technical problem, in a first aspect, an embodiment of the present application provides a method for extracting a latent feature code, including:

acquiring a first face image of a characteristic latent code to be extracted and a three-dimensional reconstruction image corresponding to the first face image, wherein the three-dimensional reconstruction image comprises three-dimensional space characteristics of a face in the first face image;

inputting the first face image into a preset feature latent code extraction network to obtain a feature latent code output by the feature latent code extraction network;

inputting the feature latent codes and the three-dimensional reconstruction images into a style confrontation generation network to generate second face images fused with the three-dimensional space features;

iteratively adjusting parameters of the feature latent code extraction network according to the difference between the first face image and the second face image until the feature latent code extraction network is converged;

and extracting the feature latent codes output by the network from the converged feature latent codes to serve as the feature latent codes of the first face image.

In some embodiments, the acquiring a three-dimensional reconstructed image corresponding to the first face image includes:

performing three-dimensional face reconstruction on the face in the first face image to obtain a first three-dimensional reconstruction parameter;

and performing two-dimensional rendering on the first three-dimensional reconstruction parameter to obtain a three-dimensional reconstruction image corresponding to the first face image.

In some embodiments, the style confrontation generating network includes a mapping network and a plurality of sequentially arranged generating networks, the generating networks are respectively used for outputting feature maps with different sizes, wherein a target generating network is used for generating and outputting a second feature map according to an input first feature map, the size of the second feature map is larger than that of the first feature map, and the target generating network is used for generating a network for any one of the sequentially arranged generating networks;

the step of inputting the feature latent codes and the three-dimensional reconstruction images into a style confrontation generation network to generate a second face image fused with the three-dimensional spatial features comprises the following steps:

inputting the characteristic latent codes into the mapping network to perform characteristic decoupling on the characteristic latent codes, and generating intermediate vectors;

according to the target size, performing feature extraction on the three-dimensional reconstruction image to obtain a three-dimensional reconstruction feature map, wherein the target size is the size of the second feature map, and the size of the three-dimensional reconstruction feature map is the target size;

inputting the intermediate vector, the first feature map, the three-dimensional reconstruction feature map and random noise into the target generation network for fusion so as to output the second feature map;

and determining a second feature map output by the last generation network in the plurality of sequentially arranged generation networks as the second face image.

In some embodiments, the target generation network includes at least one convolutional layer and a first fusion layer, each convolutional layer being followed by a second fusion layer,

inputting the intermediate vector, the first feature map, the three-dimensional reconstruction feature map and random noise into the target generation network for fusion to output the second feature map, including:

inputting the intermediate vector into each second fusion layer, and performing affine transformation on the intermediate vector by each second fusion layer according to the target size to obtain a characteristic factor, wherein the characteristic factor is matched with the target size;

the target second fusion layer fuses the first intermediate feature map input into the target second fusion layer with the random noise by using the feature factor to obtain a second intermediate feature map output by the target second fusion layer, wherein the target second fusion layer is any one of the second fusion layers, and the first intermediate feature map is a feature map output by a convolutional layer located at a layer before the target second fusion layer;

and inputting the second intermediate feature map output by the last second fusion layer and the three-dimensional reconstruction feature map into the first fusion layer for fusion to obtain the second feature map.

In some embodiments, the characteristic factors include a scaling factor and a bias factor;

the target second fusion layer fuses the first intermediate feature map input into the target second fusion layer with the random noise by using the feature factor to obtain a second intermediate feature map output by the target second fusion layer, and the method includes:

calculating a second intermediate characteristic diagram output by the target second fusion layer by adopting the following formula,

y_ij＝y_(s,i)*(T_ij+B_i)+y_(b,i)；

wherein i is a label of the target generation network, i is greater than or equal to 1 and less than or equal to N, N is the number of the generation networks, j is a label of the target second fusion layer, j is greater than or equal to 1 and less than or equal to M, M is the number of the second fusion layers in the target generation network, Tij is a first intermediate characteristic diagram input into the target second fusion layer, Bi is the random noise, y (s, i) is the scaling factor, and y (b, i) is the deviation factor;

inputting the second intermediate feature map output by the last second fusion layer and the three-dimensional reconstruction feature map into the first fusion layer for fusion to obtain the second feature map, wherein the fusion process comprises the following steps:

calculating to obtain the second characteristic diagram by adopting the following formula;

X_i+1＝y_ij*h_i，j＝M

wherein, y_ijSecond intermediate feature map, h, output for the last second fused layer_iFor said three-dimensional reconstructed feature map, X_i+1Is the second characteristic diagram.

In some embodiments, the difference between the first facial image and the second facial image comprises a structural difference between the first facial image and the second facial image;

before the iterative parameter adjustment is performed on the feature latent code extraction network according to the difference between the first face image and the second face image, the method further includes:

calculating brightness similarity values, contrast similarity values and structure similarity values of the first face image and the second face image;

taking a product of the brightness similarity value, the contrast similarity value, and the structure similarity value as the structure difference.

In some embodiments, the difference between the first facial image and the second facial image comprises a pixel difference between the first facial image and the second facial image;

calculating a pixel difference value between a pixel point at a target position in the first face image and a pixel point at the target position in the second face image to obtain a pixel difference value corresponding to the target position, wherein the target position is any one position in the first face image or the second face image;

and taking the sum of pixel difference values corresponding to each position in the first face image and the second face image as the pixel difference.

In some embodiments, the difference between the first facial image and the second facial image comprises a three-dimensional reconstruction parameter difference between the first facial image and the second facial image;

calculating a parameter difference between a first three-dimensional reconstruction parameter and a second three-dimensional reconstruction parameter, and taking the parameter difference as the three-dimensional reconstruction parameter difference, wherein the first three-dimensional reconstruction parameter is a three-dimensional reconstruction parameter obtained by performing three-dimensional face reconstruction on a face in the first face image, and the second three-dimensional reconstruction parameter is a three-dimensional reconstruction parameter obtained by performing three-dimensional face reconstruction on a face in the second face image.

In order to solve the above technical problem, in a second aspect, an embodiment of the present application provides a computer device, including a memory and one or more processors, where the one or more processors are configured to execute one or more computer programs stored in the memory, and the one or more processors, when executing the one or more computer programs, cause the computer device to implement the method according to the first aspect.

In order to solve the above technical problem, in a third aspect, the present application provides a computer-readable storage medium storing a computer program, the computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to the first aspect.

The beneficial effects of the embodiment of the application are as follows: different from the situation in the prior art, in the feature latent code extraction method provided in the embodiment of the present application, a first face image of a feature latent code to be extracted and a three-dimensional reconstruction image corresponding to the first face image are obtained, the feature latent code of the first face image and the three-dimensional reconstruction image extracted through a feature latent code extraction network are input into a style generation countermeasure network, a second face image with three-dimensional spatial features of the first face image fused is generated, then, iterative parameter adjustment is performed on the feature latent code extraction network according to a difference between the first face image and the second face image until the feature latent code extraction network converges, and finally, the feature latent code output by the converged feature latent code extraction network is used as the feature latent code of the first face image.

The characteristic latent codes output by the characteristic latent code extraction network are used as the characteristic latent codes of the first face image after the characteristic latent code extraction network converges, so that the second face image and the first face image can be sufficiently similar, that is, the characteristic latent codes output by the characteristic latent code extraction network finally can accurately reflect the characteristic attribute of the first face image, the finally output characteristic latent codes can restore the first face image, and therefore, the controllable change of the attribute of the first face image can be realized by changing the characteristic latent codes in the style confrontation generation network.

In addition, in the process of generating and generating the second face image by using the style confrontation generating network, the three-dimensional reconstruction image reflecting the three-dimensional spatial feature of the face in the first face image is inputted into the style confrontation generating network together, so that, in the process of generating the second face image by the style confrontation generation network, the three-dimensional reconstruction image can play a role in supervising the position of the whole face and the distribution of five sense organs, the style countermeasure generated network output feature map has a targeted response at the positions of different geometric information, the second face image is fused with three-dimensional space characteristics and is more close to the original first face image, the error brought to the second face image by the style confrontation network can be effectively reduced, namely, the second face image can accurately express the characteristic latent codes, and the converged characteristic latent codes are beneficial to extracting more accurate characteristic latent codes by a network.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

Fig. 1 is a schematic partial structural diagram of a style countermeasure generation network according to an embodiment of the present application;

fig. 2 is a schematic flow chart of a method for extracting latent feature codes according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a stylistic countermeasure generating network according to another embodiment of the present application;

FIG. 4 is a schematic flow chart illustrating a sub-process of step S23 in the method of FIG. 2;

FIG. 5 is a schematic view of a sub-flow chart of step S233 of the method shown in FIG. 4;

fig. 6 is a schematic flowchart of a method for extracting latent feature codes according to another embodiment of the present application;

fig. 7 is a schematic flowchart of a feature latent code extraction method according to another embodiment of the present application;

fig. 8 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

The present application will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present application, but are not intended to limit the present application in any way. It should be noted that various changes and modifications can be made by one skilled in the art without departing from the spirit of the application. All falling within the scope of protection of the present application.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that, if not conflicted, the various features of the embodiments of the present application may be combined with each other within the scope of protection of the present application. Additionally, while functional block divisions are performed in apparatus schematics, with logical sequences shown in flowcharts, in some cases, steps shown or described may be performed in sequences other than block divisions in apparatus or flowcharts. Further, the terms "first," "second," "third," and the like, as used herein, do not limit the data and the execution order, but merely distinguish the same items or similar items having substantially the same functions and actions.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

In addition, the technical features mentioned in the embodiments of the present application described below may be combined with each other as long as they do not conflict with each other.

For the convenience of understanding the technical solution of the present application, the related principles of the style of the present application against generating network and feature latent codes will be described.

Referring to fig. 1, fig. 1 is a schematic partial structural diagram of a stylistic confrontation generating network provided in an embodiment of the present application, and as shown in fig. 1, the stylistic confrontation generating network includes a mapping network S1 and an image generator S2. The mapping network S1 is configured to perform feature decoupling on the composite features included in the latent feature codes, so as to map the latent feature codes into multiple sets of feature control vectors input to the image generator, and input the multiple sets of feature control vectors obtained by mapping into the image generator S2, so as to perform pattern control (i.e., face attribute control) on the image generator. The image generator S2 is configured to perform pattern control and processing on the constant tensor based on the control vector input from the mapping network S1, thereby generating one image. When the feature latent code reflects the face feature, the image generated by the image generator S2 is a face image.

As shown in fig. 1, the mapping network S1 includes 8 fully-connected layers, and the 8 fully-connected layers are connected in sequence and used for performing nonlinear mapping on the latent feature codes to obtain an intermediate vector w. The intermediate vector w can reflect various facial features such as eye features, mouth features, nose features, or the like.

The image generator S2 includes N sequentially arranged generation networks, the first generation network includes a constant tensor const and a convolution layer, both of which are configured with an adaptive instance normalization layer. It is understood that the constant tensor const corresponds to an initial data for generating an image. Any one of the remaining generation networks except the first generation network comprises two convolution layers, and a self-adaptive instance normalization layer is configured behind each convolution layer. In the image generator S2, each generated network outputs a feature map, which is used as an input of the next generated network, and as the generated network recurs, the size of the output feature map becomes larger, and the feature map output by the last generated network is the generated face image. It is understood that the target size of each generated net output feature map is set, for example, the size of the feature map of the 1 st generated net output in fig. 1 is 4 × 4, the size of the feature map of the 2 nd generated net output is 8 × 8, and if the size of the finally generated face image is 1024 × 1024, the size of the feature map of the last generated net output is 1024 × 1024.

In a production network, convolutional layers and adaptive instance normalization layers are interleaved, with the output of the previous layer being the input of the next layer. As shown in fig. 1, the convolution layer includes a3 × 3 convolution kernel for performing deconvolution operation on the input image, and outputting a feature map with an increased size. The feature map output by the convolutional layer is then input into the example normalization layer, and the intermediate vector w and random noise B are also input into the example normalization layer. The processing procedure of the instance normalization layer (i.e., AdaIN) is shown in fig. 1: and expanding the intermediate vector w into a scaling factor y (s, i) and a deviation factor y (B, i) through a learnable affine change (namely, a full connection layer), wherein the scaling factor y (s, i) and the deviation factor y (B, i) are subjected to weighted summation with the feature map output by the normalized last convolution layer and the random noise B, so that the intermediate vector w influences the pattern of the feature map output by the convolution layer, namely, the pattern control can be realized. Second, random noise B is used to enrich the details of the feature map.

In the above, namely, the process of generating the image for the style competition generation network, there are various application scenarios for the style competition generation network, for example, the application scenarios for the image generation scenario.

The latent feature code is a feature vector, which may also be referred to as a feature map, and may be a multidimensional vector, and each value in the vector is in the range of [ -1,1], for example, 18 × 512, and each value in the vector is in the range of [ -1,1 ]. It is understood that, by inputting the feature latent codes into the style resistance generation network, images corresponding to the feature latent codes can be generated. It is understood that the feature potential may also be understood as a feature of an image extracted from the image based on the neural network, the feature potential can represent the image, in the case of feature potential determination, the image generated based on the feature potential is also determined, and in another aspect, the feature potential may also be understood as a vector output after the image passes through the convolutional layer in the neural network.

However, at present, randomly generated feature latent codes are input into a style confrontation generation network to generate an image, and since the feature latent codes of the image in a real scene cannot be acquired, the application of the confrontation generation network is limited, for example, the confrontation generation network cannot be used to generate an image with expression changes of the face of an actual user, an image with aging changes, and the like.

In view of the above, an embodiment of the present application provides a method for extracting a feature latent code, please refer to fig. 2, which includes, but is not limited to, the following steps:

s21: the method comprises the steps of obtaining a first face image of a characteristic latent code to be extracted and a three-dimensional reconstruction image corresponding to the first face image, wherein the three-dimensional reconstruction image comprises three-dimensional space characteristics of a face in the first face image.

S22: and inputting the first face image into a preset characteristic latent code extraction network to obtain a characteristic latent code output by the characteristic latent code extraction network.

S23: and inputting the characteristic latent codes and the three-dimensional reconstruction images into a style confrontation generation network to generate a second face image fused with three-dimensional space characteristics.

S24: and iteratively adjusting parameters of the feature latent code extraction network according to the difference between the first face image and the second face image until the feature latent code extraction network is converged.

S25: and extracting the characteristic latent codes output by the network from the converged characteristic latent codes to be used as the characteristic latent codes of the first face image.

In step S21, the first face image is an image including a human face, for example, the first face image may be an identification photograph, and in this embodiment, it is necessary to extract a feature latent code for the first face image, that is, a one-dimensional vector capable of reflecting features of the first face image is obtained from the first face image, and the feature latent code is equivalently a vector expression form of the first face image.

The three-dimensional reconstruction image corresponding to the first face image is another expression form of the first face image, the three-dimensional reconstruction image comprises three-dimensional space characteristics of the face in the first face image, and the three-dimensional space characteristics can be understood as solid geometric characteristics. In some embodiments, the first face image is converted into a three-dimensional face (formed by three-dimensional point cloud data), and the three-dimensional face is subjected to two-dimensional rendering to obtain a two-dimensional rendered image, that is, a three-dimensional reconstructed image is obtained.

In some embodiments, the step of obtaining the three-dimensional reconstructed image corresponding to the first face image specifically includes:

A) and performing three-dimensional face reconstruction on the face in the first face image to obtain a first three-dimensional reconstruction parameter.

B) And performing two-dimensional rendering on the first three-dimensional reconstruction parameter to obtain a three-dimensional reconstruction image corresponding to the first face image.

In this embodiment, the 3DMM model may be used to calculate the first three-dimensional reconstruction parameter of the first face image, that is, to implement three-dimensional face reconstruction on the face in the first face image, so as to obtain the first three-dimensional reconstruction parameter.

In the 3DMM model, all three-dimensional faces can be represented by the same number of point clouds (spatial coordinate positions), and the points with the same sequence number represent the same semantics, for example, the kth point cloud is the left outer eye corner point for each face. The 3DMM model counts the face laser scanning data of 200 face samples (100 young men and 100 young women) to obtain a face statistical model, and the three-dimensional reconstruction parameters of the face can be calculated according to the following formula:

wherein S is²Representing the average shape of 200 face samples, i.e. the average of the spatial coordinate positions of the point clouds comprised by 200 face samples, T²Representing the average texture of 200 face samples, namely the texture average of the point clouds included in the 200 face samples; s_iIs an orthogonal position characteristic basis vector, T, obtained by performing Principal Component Analysis (PCA) on the 200 human face samples_iThe method comprises the steps of carrying out Principal Component Analysis (PCA) on 200 human face samples to obtain color characteristic basis vectors; alpha is alpha_iIs S_iCoefficient of (b), beta_iIs T_iWherein α is_i＝(α₁,α₂,......,α₁₉₉)，β_i＝(β₁,β₂,......,β₁₉₉). It should be noted that i is a label of a vector dimension, and represents a position in the dimension. Thus, any face can obtain its three-dimensional reconstruction parameters by adjusting the coefficients α and β. For example, for the first face image, the first three-dimensional reconstruction parameters corresponding to the first face image may also be obtained by adjusting the coefficients α and β.

And then, inputting the first three-dimensional reconstruction parameter into a differentiable renderer to generate a two-dimensional rendering map, wherein the two-dimensional rendering map is a three-dimensional reconstruction image corresponding to the first face image. Among them, the differentiable renderer may be a mesh _ render (i.e., a differentiable 3D network renderer) in a tensflow framework, which belongs to the prior art and is not described in detail herein. It is to be understood that, when rendering is performed, the size of the generated two-dimensional rendering map may be set to 256 × 256 to facilitate subsequent processing.

In this embodiment, the three-dimensional reconstructed image corresponding to the first face image is obtained in the above manner, so that the three-dimensional reconstructed image includes the three-dimensional spatial feature of the face in the first face image.

In step S22, the latent feature code extracting network is used to extract features of the first face image, where the latent feature code extracting network includes a convolution layer, an activation function layer, and a normalization layer, so as to implement dimension reduction on the input first face image and output a latent feature code (one-dimensional vector). It can be understood that the mathematical expression of the feature latent code extraction network is as follows:

wherein f is_m ^lM characteristic diagram, f, of the l layer_m ^l+1The nth feature diagram of the l +1 th layer is shown, W represents a convolution kernel, B represents a bias term, sigma (-) represents a RealyRelu activation function, and IN represents normalization. The method comprises the steps of reducing dimensions of a first face image to generate a feature map by setting parameters (such as the size and the number of convolution kernels and corresponding step length) of each convolution layer, inputting the feature map e output by the last convolution layer into a full-connection layer, and changing the feature map e into a one-dimensional feature vector (such as 9216 x 1 feature vector) by the full-connection layer, wherein the one-dimensional feature vector is a feature latent code. In some embodiments, the one-dimensional feature vector is also converted into a multi-dimensional vector, which is the latent feature code.

It can be understood that the feature latent code extraction network may be an existing mobilene algorithm or VGG algorithm, and the specific structure of the feature latent code extraction network is not limited herein, as long as feature extraction and dimension reduction are implemented.

In step S23, the feature latent code and the three-dimensional reconstruction image are input into a style resistance generation network to generate a second face image with three-dimensional spatial features fused thereto. From the above principle of the style-to-antibiotic network, it can be known that the feature latent codes are deconvoluted to generate a series of feature maps with gradually increasing sizes, the feature map output by the last layer of the network has the largest size and is the second face image, and in the process of generating the series of feature maps, the three-dimensional reconstructed image and at least one feature map may be fused, for example, linear fusion or nonlinear fusion, so that the second face image is fused with the three-dimensional spatial features included in the three-dimensional reconstructed image.

That is, in the process of generating the second face image using the style confrontation generating network, the three-dimensional reconstruction image reflecting the three-dimensional spatial feature of the face in the first face image is input to the style confrontation generating network together, and thereby, in the process of generating the second face image by the style confrontation generation network, the three-dimensional reconstruction image can play a role in supervising the position of the whole face and the distribution of five sense organs, the style countermeasure generated network output feature map has a targeted response at the positions of different geometric information, the second face image is fused with three-dimensional space characteristics and is more close to the original first face image, the error brought to the second face image by the style confrontation network can be effectively reduced, namely, the second face image can accurately express the characteristic latent codes, and the subsequent converged characteristic latent codes are facilitated to be extracted by a network to output more accurate characteristic latent codes.

It can be understood that, if the feature latent code is accurate, that is, the feature latent code can truly reflect the features of the first face image, the more similar the second face image generated by the feature latent code and the first face image are, so that the feature latent code extraction network can be iteratively adjusted according to the difference between the first face image and the second face image until the feature latent code extraction network converges. It is understood that the feature latent code extraction network convergence herein may refer to that, under a certain parameter, a difference between the first face image and the second face image output by the network is smaller than a preset threshold or fluctuates within a certain range. After the network convergence of the feature latent code extraction, the similarity between the first face image and the second face image is high, namely the feature latent code output by the network can be well restored by the feature latent code after the network convergence. Therefore, the converged characteristic latent codes are extracted and the characteristic latent codes output by the network are used as the characteristic latent codes of the first face image, so that the accuracy of the characteristic latent codes of the first face image is high and the first face image is more reasonable.

In some embodiments, the adam algorithm is used to optimize the model parameters, for example, the number of iterations is set to 1000, the initial learning rate is set to 0.001, the weight attenuation of the learning rate is set to 0.0005, and the learning rate is attenuated to 1/10 as it is every 10 iterations, where the learning rate, the difference between the first face image and the second face image may be input into the adam algorithm to obtain an adjustment parameter output by the adam algorithm, and the adjustment parameter is used to perform the next training until after the training is completed, and the model parameters of the converged latent feature code extraction network are output.

It should be noted that, in this embodiment of the application, the first face image is an image, and a feature latent code extraction network is trained based on the image, so that a feature latent code extraction model (a converged feature latent code extraction network) corresponding to the first face image can be obtained, where the feature latent code extraction model is more suitable for obtaining a feature latent code of the first face image, that is, the feature latent code extraction model only corresponds to the first face image and is not suitable for other face images. And furthermore, training the characteristic latent code extraction network is not performed to obtain a universal characteristic latent code extraction model, but is performed to extract the characteristic latent codes from the first face image, and in the training process, along with gradual training convergence of the characteristic latent code extraction network, the difference between the second face image and the first face image obtained by restoring the characteristic latent codes is smaller and smaller, so that the final characteristic latent codes with high accuracy are obtained.

It can be understood that, by the above method, the latent feature codes can be effectively extracted from the face image, so that the extraction of the latent feature codes becomes possible, and further, the latent feature codes can be extracted from the face image, so that the features of the face in the face image can be changed or the age change of the face can be realized by modifying the latent feature codes, thereby providing a richer technical scheme for the user. For example, the feature latent code a2 of the face image a1 of the user a is extracted in the above manner, wherein the expression of the user a in the face image a1 is serious, the feature latent code a2 is modified to obtain a modified feature latent code a3, the feature latent code a3 is input into the style generation countermeasure network, so that a new face image a4 can be obtained, and the expression of the user a in the face image a4 is smile, so that by modifying the feature latent code, more and richer images belonging to the same user can be obtained.

In the embodiment of the application, a mode that a countermeasure network, a first face image and a three-dimensional reconstruction image comprising three-dimensional space characteristics of a face in the first face image are generated by utilizing a style to train a feature latent code extraction network is adopted, so that the feature latent codes extracted from the first face image are continuously optimized in the training process until the feature latent code extraction network converges, and the feature latent codes output by the network are extracted by the converged feature latent codes to serve as the feature latent codes of the first face image. The characteristic latent codes output by the characteristic latent code extraction network are used as the characteristic latent codes of the first face image after the characteristic latent code extraction network converges, so that the second face image and the first face image can be sufficiently similar, that is, the characteristic latent codes output by the characteristic latent code extraction network finally can accurately reflect the characteristic attribute of the first face image, the finally output characteristic latent codes can restore the first face image, and therefore, the controllable change of the attribute of the first face image can be realized by changing the characteristic latent codes in the style confrontation generation network.

In addition, in the process of generating the second face image by using the style confrontation generating network, the three-dimensional reconstruction image reflecting the three-dimensional spatial feature of the face in the first face image is inputted into the style confrontation generating network together, so that, in the process of generating the second face image by the style confrontation generation network, the three-dimensional reconstruction image can play a role in supervising the position of the whole face and the distribution of five sense organs, the style countermeasure generated network output feature map has a targeted response at the positions of different geometric information, the second face image is fused with three-dimensional space characteristics and is more close to the original first face image, the error brought to the second face image by the network generated by style confrontation can be effectively reduced, namely, the second face image can accurately express the characteristic latent codes, and the converged characteristic latent codes are beneficial to extracting more accurate characteristic latent codes by a network.

In some embodiments, referring to fig. 3, the style confrontation generating network includes a mapping network and a plurality of sequentially arranged generating networks, the plurality of generating networks are respectively used for outputting feature maps with different sizes, the feature map output by the former generating network is used for being input into the latter generating network, and as the generating networks increase in size, the feature maps with larger sizes reflect finer features.

The structure and processing procedure of the generation network are exemplarily described below by taking the target generation network as an example, and it is understood that the structure and processing procedure of each generation network are the same. The target generation network is any one of a plurality of sequentially arranged generation networks, and for convenience of description, the reference number of the target generation network is denoted as i, that is, the ith generation network, and the total number of the generation networks is N. And the target generation network is used for generating and outputting a second feature map according to the input first feature map, wherein the first feature map is the feature map output by the previous generation network (i-1) of the target generation network, and the size of the second feature map is larger than that of the first feature map.

That is, as shown in fig. 3, each generation network outputs a feature map, and for two adjacent generation networks, the output of the former generation network is the input of the latter generation network. And, as i increases, the size of the output second feature map becomes larger. For example, if N is 9, 9 generation nets are included, if the size of the second feature map of the 1 st generation net output is 4 × 4, the size of the first feature map of the 2 nd generation net input is 4 × 4, the size of the second feature map of the 2 nd generation net output is 8 × 8, and so on, the size of the second feature map of the 9 th generation net output is 1024. Each generating network carries out deconvolution operation on the input first characteristic diagram so as to improve the dimensionality of the first characteristic diagram.

It is understood that the more forward the target generation network i is (i.e. the smaller i), the smaller the size of the output second feature map is, the coarser the features can be affected. When the size of the output second feature map does not exceed 82, the posture, hairstyle, facial shape, or the like is mainly affected, when the size of the output second feature map is larger than 162 and smaller than 322, finer facial features, hairstyle, opening or closing of eyes, or the like is affected, and when the size of the output second feature map is larger than 642 and smaller than or equal to 1024, the color of eyes, hair, or skin, and microscopic features are affected.

It should be noted that the multiple sequentially arranged generating networks are the image generator S2 in the embodiment shown in fig. 1, and the mapping network is the mapping network in the embodiment shown in fig. 1, which is not repeated herein to describe the relevant principle of the style confrontation generating network.

In this embodiment, referring to fig. 4, step S23 specifically includes:

s231: and inputting the characteristic latent codes into a mapping network to perform characteristic decoupling on the characteristic latent codes to generate intermediate vectors.

S232: and according to the target size, performing feature extraction on the three-dimensional reconstruction image to obtain a three-dimensional reconstruction feature map, wherein the target size is the size of the second feature map, and the size of the three-dimensional reconstruction feature map is the target size.

S233: and fusing the intermediate vector, the first feature map, the three-dimensional reconstruction feature map and the random noise input target generation network to output a second feature map.

S234: and determining the second feature map output by the last generation network in the plurality of sequentially arranged generation networks as a second face image.

Firstly, inputting the characteristic latent codes into a mapping network to perform characteristic decoupling on the characteristic latent codes, and generating a middle vector w. Based on that the mapping network comprises a plurality of full connection layers, the characteristic decoupling process is an operation process of convolution operation downsampling, namely, the characteristic latent codes are reduced into one-dimensional intermediate vectors w. The intermediate vector W is respectively input to each generation network, each generation network obtains 2 control vectors according to the intermediate vector W, so that different elements of the control vectors can control different visual features, when one control vector is adjusted, other control vectors cannot be changed, namely other features are not influenced, and feature entanglement cannot occur.

In order to fuse the features in the three-dimensional reconstruction image into the second feature map and make the fused features fit with the second feature map, feature extraction is performed on the three-dimensional reconstruction image according to the target size (i.e., the size of the second feature map), and the three-dimensional reconstruction feature map is acquired. The specific way of feature extraction may be a convolution operation of downsampling. Since the size of the three-dimensional reconstruction feature map is the target size, that is, the size of the three-dimensional reconstruction feature map is consistent with the size of the second feature map, the roughness of the features reflected by the three-dimensional reconstruction feature map is consistent with the roughness of the features reflected by the second feature map.

As shown in fig. 3, the features of the 1 st three-dimensional reconstructed feature map (4 × 512) are coarser than those of the 2 nd three-dimensional reconstructed feature map (8 × 512), and the features of the 7 th three-dimensional reconstructed feature map (256 × 512) are finest.

It can be understood that the number of the three-dimensional reconstruction feature maps is the same as the number of the generation networks, and the three-dimensional reconstruction feature maps correspond to the generation networks one by one, that is, a three-dimensional reconstruction feature map with a matched size is generated as the input of the generation networks.

And then, inputting the intermediate vector, the first feature map, the three-dimensional reconstruction feature map and the random noise into a target generation network for fusion so as to output a second feature map.

The process of merging the intermediate vector, the first feature map and the random noise is the same as that of merging the generation network in the embodiment shown in fig. 1. That is, in this embodiment, the modules for fusion, the intermediate vector, the first feature map, and the random noise in the target generation network also have the same layer structure as the corresponding modules in the embodiment shown in fig. 1.

In this embodiment, the size of the three-dimensional reconstructed feature map is the same as the size of the second feature map output by the target generation network, so that the three-dimensional reconstructed feature map and the second feature map before output can be subjected to linear fusion or nonlinear fusion. The size of the three-dimensional reconstruction characteristic diagram is adaptive to the target generation network, so that the characteristics of the three-dimensional reconstruction image can be merged into each second characteristic diagram output by each generation network in batches according to the thickness degree of the characteristics. At each level of the generated second feature map, the function of supervising the position of the whole face and the distribution of five sense organs can be achieved, so that the second feature map generated by the network output has targeted response at the positions of different geometric information.

It is understood that linear fusion is to perform linear operation, such as addition and subtraction, on the two feature maps through a linear function, and nonlinear fusion is to perform nonlinear operation, such as multiplication and division or logarithm, on the two feature maps through a nonlinear function.

Based on the principle of the style confrontation generation network, the second feature map output by the last generation network in the plurality of sequentially arranged generation networks is determined as the second face image. The second face image is fused with three-dimensional spatial features, is closer to the original first face image, can effectively reduce errors brought to the second face image by the style confrontation network, namely the second face image can accurately express the feature latent codes, and is favorable for extracting the more accurate feature latent codes output by the network after convergence.

In this embodiment, the features of the three-dimensional reconstructed image can be merged into the second feature maps output by the generation networks according to the thickness degree of the features. The function of monitoring the position of the whole face and the distribution of five sense organs can be achieved at each level of the generated second feature map, so that the second feature map output by each generated network has targeted response at the positions of different geometric information, namely, the second face image can more accurately express the feature latent codes by fusing according to the feature thickness degree, and the converged feature latent codes are beneficial to extracting the more accurate feature latent codes output by the network.

In some embodiments, the target-based generation network includes at least one convolutional layer and a first fusion layer, and a second fusion layer is configured after each convolutional layer, where the second fusion layer is the adaptive example normalization layer in the embodiment shown in fig. 1. It will be appreciated that the first fusion layer is used to fuse three-dimensional reconstructed feature maps.

Based on the above network structure, in this embodiment, referring to fig. 5, step S233 specifically includes:

s2331: and inputting the intermediate vector into each second fusion layer, and performing affine transformation on the intermediate vector by each second fusion layer according to the target size to obtain a characteristic factor, wherein the characteristic factor is matched with the target size.

S2332: and the target second fusion layer fuses the first intermediate feature map input into the target second fusion layer with the random noise by adopting the feature factors to obtain a second intermediate feature map output by the target second fusion layer, wherein the target second fusion layer is any one of the second fusion layers, and the first intermediate feature map is the feature map output by the convolution layer positioned in the previous layer of the target second fusion layer.

S2333: and inputting the second intermediate characteristic diagram output by the last second fusion layer and the three-dimensional reconstruction characteristic diagram into the first fusion layer for fusion to obtain a second characteristic diagram.

It can be understood that the thickness degrees of the features represented by the second feature maps output based on different generation networks are different, so that each second fusion layer in the target generation network performs affine transformation on the intermediate vector according to the thickness degree of the corresponding feature corresponding to the target generation network (i.e., the size of the second feature map, the target size), and obtains the feature factor. It will be appreciated that each generated network corresponds to a feature factor, the feature factor i corresponding to the target generated network i, the feature factor being adapted to the size of the second feature map. That is, the degree of thickness of the feature that can be controlled by the feature factor is substantially the same as the degree of thickness of the feature represented in the second feature map.

Since the target generation network has at least one second fusion layer, and the processing procedure of each second fusion layer is the same, here, taking any one of the second fusion layers (i.e. the target second fusion layer) as an example to illustrate, the target second fusion layer may be an example normalization layer (AdaIN), and the feature factor is used to fuse the input first intermediate feature map with the random noise to obtain the output second intermediate feature map, so that the second intermediate feature map is fused with the features reflected by the random noise. It is understood that the first intermediate feature map is a feature map of the convolutional layer output located at a layer preceding the target second fused layer. The fusion performed by the second fusion layer is linear calculation or nonlinear calculation between the feature factor and the first intermediate feature map and random noise. It is understood that the features reflected by the random noise may be hair, color spots or beard, etc. which do not affect the identity of the human face.

And then inputting the second intermediate feature map output by the last second fusion layer and the three-dimensional reconstruction feature map into the first fusion layer for fusion to obtain a second feature map. And the fusion performed by the first fusion layer is linear calculation or nonlinear calculation between the second intermediate feature map output by the last second fusion layer and the three-dimensional reconstruction feature map.

In the embodiment, the three-dimensional reconstruction feature map comprising the three-dimensional spatial features of the corresponding scale is fused with the last second intermediate feature map in the target generation network, namely the three-dimensional reconstruction feature map is directly fused without convolution operation in the target generation network, so that the three-dimensional reconstruction feature map can better play a role in monitoring the position of the whole face and the distribution of five sense organs, the second feature map output by the generation network has a targeted response at the positions of different geometric information, and the corresponding three-dimensional spatial features are fused.

In some embodiments, the characteristic factors include a scaling factor and a bias factor.

In this embodiment, step S2332 specifically includes:

calculating to obtain a second intermediate characteristic diagram output by the target second fusion layer by adopting the following formula,

y_ij＝y_(s,i)*(T_ij+B_i)+y_(b,i)；

wherein i is the label of the target generation network, i is more than or equal to 1 and less than or equal to N, N is the number of the generation networks, jJ is more than or equal to 1 and less than or equal to M which is the label of the second fusion layer of the target, M is the number of the second fusion layers in the target generation network, T_ijTo input the first intermediate feature map of the target second fusion layer, Bi is random noise, y (s, i) is a scaling factor i, and y (b, i) is a deviation factor i.

Namely, the first intermediate characteristic diagram Tij output by each convolution layer is fused with random noise by adopting the formula, and the details of the first intermediate characteristic diagram are optimized.

In this embodiment, step S2333 specifically includes:

calculating by adopting the following formula to obtain a second characteristic diagram;

X_i+1＝y_ij*h_i，j＝M

wherein, y_ijSecond intermediate feature map, h, output for the last second fused layer_iFor three-dimensional reconstruction of feature maps, X_i+1Is a second characteristic diagram.

In the embodiment, the three-dimensional reconstruction characteristic diagram reflecting different characteristic scales is added into the second characteristic diagram in the form of product, so that the added second characteristic diagram has targeted response at the positions of different geometric information.

In some embodiments, the difference between the first facial image and the second facial image comprises a structural difference between the first facial image and the second facial image.

Referring to fig. 6, before step S24, the method further includes:

s31: and calculating the brightness similarity value, the contrast similarity value and the structure similarity value of the first face image and the second face image.

S32: and taking the product of the brightness similarity value, the contrast similarity value and the structure similarity value as the structure difference.

Wherein, the brightness similarity value is an index for reflecting the similarity degree between the brightness of the two images. The contrast similarity value is an index for reflecting the degree of similarity between the contrasts of two images. The Structural SIMilarity value may be a Structural SIMilarity (SSIM) index, which is an index for measuring the SIMilarity between two images. The luminance similarity value r (Y, Y '), the contrast similarity value c (Y, Y '), and the structural similarity value s (Y, Y ') of the first and second face images may be calculated as follows.

Wherein Y represents a first face image, Y' represents a second face image, μ_YMean value of pixel values, mu, representing the first face image_Y'Mean value of pixel values, σ, representing the second face image_YThe variance value, σ, of pixel values representing the first face image_Y'Representing a pixel value variance value of a second face image;

wherein σ_YY'Represents a covariance value of the first face image and the second face image, c1 ═ k1 ═ LO)²,c2＝(k2,*LO)²K1 and k2 represent preset constants, for example, k1 and k2 may be 0.01 and 0.03, respectively, and LO is a range of pixel values, and may be 255 in general.

Further, the above structural difference Ls (Y, Y') is obtained according to the following formula:

Ls(Y,Y')＝r(Y,Y')*c(Y,Y')*s(Y,Y')

in this embodiment, through the above manner, the difference between the first face image and the second face image includes the structural difference between the first face image and the second face image, so that the second face image is closer to the first face image in structure (brightness, contrast, and SSIM), that is, the first face image can be better restored in structure by the feature subcode (brightness, contrast, and SSIM), that is, the structural features (brightness, contrast, and SSIM) of the first face image can be accurately reflected by the feature subcode output by the converged feature subcode extraction network, and further, the face image can be better restored in global features, and the restoration degree of the face image is improved.

In some embodiments, the difference between the first facial image and the second facial image comprises a pixel difference between the first facial image and the second facial image.

Referring to fig. 7, before step S24, the method further includes:

s33: and calculating a pixel difference value between a pixel point at the target position in the first face image and a pixel point at the target position in the second face image to obtain a pixel difference value corresponding to the target position, wherein the target position is any one position in the first face image or the second face image.

S34: and taking the sum of pixel difference values corresponding to each position in the first face image and the second face image as pixel difference.

Specifically, the calculation formula of the pixel difference Lp (Y, Y') between the first face image and the second face image is as follows:

wherein, Y_i,jA pixel value indicating a pixel point of coordinates (i, j) in the first face image Y, Y_i,j'represents a pixel value of a pixel point with coordinates (i, j) in the second face image Y', and P × Q represents the resolution sizes of the first face image and the second face image, for example, P may be 1024, and Q may also be 1024, so that the resolutions of the first face image and the second face image are 1024 × 1024.

In this embodiment, by the above manner, the difference between the first face image and the second face image includes the pixel difference between the first face image and the second face image, so that the second face image is closer to the first face image in terms of pixel values, that is, the first face image can be better restored in terms of pixel values by the latent feature codes, that is, the pixel features of the first face image can be reflected by the latent feature codes output by the converged latent feature code extraction network, and then the face image can be better restored in terms of local features, so that the restoration degree of the face image is improved.

In some embodiments, the difference between the first facial image and the second facial image comprises a three-dimensional reconstruction parameter difference between the first facial image and the second facial image. I.e. the difference between the first three-dimensional reconstruction parameters of the first facial image and the second three-dimensional reconstruction parameters of the second facial image.

Before step S24, the method further includes:

s35: and calculating a parameter difference between the first three-dimensional reconstruction parameter and the second three-dimensional reconstruction parameter, and taking the parameter difference as a three-dimensional reconstruction parameter difference.

The first three-dimensional reconstruction parameter is a three-dimensional reconstruction parameter obtained by performing three-dimensional face reconstruction on a face in the first face image, and the second three-dimensional reconstruction parameter is a three-dimensional reconstruction parameter obtained by performing three-dimensional face reconstruction on a face in the second face image.

Specifically, the calculation formula of the three-dimensional reconstruction parameter difference Lv (Y, Y') between the first face image and the second face image is as follows:

wherein Vi denotes an ith parameter of the three-dimensional reconstruction parameters corresponding to the first face image, Vi' denotes an ith parameter of the three-dimensional reconstruction parameters corresponding to the second face image, and G denotes a structural size of the three-dimensional reconstruction parameters of the first face image and the second face image, and G may be 398, for example.

In this embodiment, by the above manner, the difference between the first face image and the second face image includes a three-dimensional reconstruction parameter difference between the first face image and the second face image, so that the second face image is closer to the first face image in terms of three-dimensional geometric features, that is, the feature subcode can better restore the first face image in terms of three-dimensional geometric features, that is, the feature subcode output by the converged feature subcode extraction network can also reflect the three-dimensional geometric features of the first face image, so that the face image can be better restored in terms of depth features, and the restoration degree of the face image is improved.

In some embodiments, the total difference between the first facial image and the second facial image may be calculated according to the following formula:

L＝αLs(Y,Y')+βLp(Y,Y')+γLv(Y,Y')

wherein, alpha, beta and gamma are respectively the weight of the structural difference Ls, the pixel difference Lp and the three-dimensional reconstruction parameter difference.

It can be understood that, equivalently, a multi-dimensional similarity loss function is constructed, and the multi-dimensional similarity loss function includes a structural difference representing a global difference, a pixel difference representing a local difference, and a three-dimensional reconstruction parameter difference representing a depth, so that texture features (texture depth and texture color) of the second face image are closer to those of the first face image, that is, the feature latent codes output by the converged feature latent code extraction network can reflect texture features of the first face image, and the first face image can be better restored in the aspects of global features, local features, and depth features, so that the restoration degree is improved.

Referring to fig. 8, a hardware structure diagram of a computer device according to an embodiment of the present application is provided, and specifically, as shown in fig. 8, the computer device 10 includes at least one processor 11 and a memory 12 (fig. 8 takes a bus connection, one processor as an example) that are communicatively connected.

The processor 11 is configured to provide computing and control capabilities to control the computer device 10 to perform corresponding tasks, for example, to control the computer device 10 to perform any one of the above-mentioned methods for extracting latent feature codes.

It is understood that the Processor 11 may be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), etc.; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

The memory 12, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the feature latent code extraction method in the embodiments of the present application. The processor 11 may implement the feature latent code extraction method in any of the method embodiments described below by running non-transitory software programs, instructions, and modules stored in the memory 12. In particular, the memory 12 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 12 may also include memory located remotely from the processor, which may be connected to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present application provides a computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions, which, when executed by a processor, cause the processor to perform the steps of the above-described method embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; within the context of the present application, where technical features in the above embodiments or in different embodiments can also be combined, the steps can be implemented in any order and there are many other variations of the different aspects of the present application as described above, which are not provided in detail for the sake of brevity; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for extracting a latent feature code includes:

2. The method according to claim 1, wherein the obtaining of the three-dimensional reconstruction image corresponding to the first face image comprises:

3. The method according to claim 1, wherein the style confrontation generating network comprises a mapping network and a plurality of sequentially arranged generating networks, the generating networks are respectively used for outputting feature maps with different sizes, wherein a target generating network is used for generating and outputting a second feature map according to an input first feature map, the size of the second feature map is larger than that of the first feature map, and the target generating network is any one generating network in the sequentially arranged generating networks;

4. The method of claim 3, wherein the target generation network comprises at least one convolutional layer and a first fusion layer, each convolutional layer followed by a second fusion layer,

5. The method of claim 4, wherein the characteristic factors include a scaling factor and a bias factor;

y_ij＝y_(s,i)*(T_ij+B_i)+y_(b,i)；

wherein i is the label of the target generation network, i is more than or equal to 1 and less than or equal to N, N is the number of the generation networks, j is the label of the target second fusion layer, j is more than or equal to 1 and less than or equal to M, M is the number of the second fusion layers in the target generation network, T_ijSecond fusing for inputting the targetA first intermediate profile of the composite layer, Bi being the random noise, y (s, i) being the scaling factor, y (b, i) being the deviation factor;

X_i+1＝y_ij*h_i，j＝M

6. The method according to any of claims 1-5, wherein the difference between the first facial image and the second facial image comprises a structural difference between the first facial image and the second facial image;

7. The method according to any one of claims 1-5, wherein the difference between the first facial image and the second facial image comprises a pixel difference between the first facial image and the second facial image;

8. The method according to any one of claims 1-5, wherein the difference between the first facial image and the second facial image comprises a three-dimensional reconstruction parameter difference between the first facial image and the second facial image;

9. A computer device comprising memory and one or more processors to execute one or more computer programs stored in the memory, the one or more processors, when executing the one or more computer programs, causing the computer device to implement the method of any of claims 1-8.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-8.