CN113989441B

CN113989441B - Automatic three-dimensional cartoon model generation method and system based on single face image

Info

Publication number: CN113989441B
Application number: CN202111355290.8A
Authority: CN
Inventors: 潘俊君; 黄美佳
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2024-05-24
Anticipated expiration: 2041-11-16
Also published as: CN113989441A

Abstract

The invention relates to a three-dimensional cartoon model automatic generation method and system based on a single Zhang Ren face image, wherein the method comprises the following steps: s1: constructing a corresponding cartoon image and a three-dimensional cartoon model containing textures based on the face images with the same identity to form a cartoon data set; s2: inputting the three-dimensional cartoon model containing textures into a self-encoder based on graph convolution to obtain geometric and texture representation of the three-dimensional cartoon model; s3: inputting the face image into ResNet networks to obtain facial attitude vectors and vectors represented by the geometry and texture of the three-dimensional cartoon model; s4: the method comprises the steps of obtaining a generated three-dimensional cartoon model by using a decoder through vectors of geometric and texture representation of the three-dimensional cartoon model, converting the generated three-dimensional cartoon model into the same facial pose as a face image through facial pose vectors, and rendering the generated three-dimensional cartoon model through a differentiable renderer. The method provided by the invention can efficiently and quickly generate the three-dimensional cartoon model with exaggerated geometry and rich texture styles.

Description

Automatic three-dimensional cartoon model generation method and system based on single face image

Technical Field

The invention relates to the field of computer graphics and image processing, in particular to an automatic three-dimensional cartoon model generation method and system based on a single Zhang Ren face image.

Background

Cartoon is a rendered image that represents the most vivid features of a person through exaggeration, simplification, and abstraction. They are also used to express irony and humor for political and social problems. The artist drawn caricature is a 2D image. Although widely used, they are insufficient for many applications, such as animation, virtual reality, and 3D printing, and 3D information is indispensable. 3D comics are suitable for these applications, but they can only be created by artists with 3D modeling techniques and are cumbersome and time consuming to produce, so automatically generating 3D exaggeration style faces is a meaningful and lacking direction of investigation. The generation of three-dimensional comic from two-dimensional comic or common photo is similar to face reconstruction, but little work is done to address the automatic generation of three-dimensional comic from face picture.

Disclosure of Invention

In order to solve the technical problems, the invention provides a method and a system for automatically generating a three-dimensional cartoon model based on a single Zhang Ren face image.

The technical scheme of the invention is as follows: a three-dimensional cartoon model automatic generation method based on a single Zhang Ren face image comprises the following steps:

Step S1: constructing a corresponding cartoon image and a three-dimensional cartoon model containing textures based on face images with the same identity to form a cartoon data set, wherein the face images and the cartoon images have the same facial pose;

Step S2: inputting the texture-containing three-dimensional cartoon model into a graph convolution-based self-encoder, wherein the self-encoder comprises an encoder and a decoder with grid convolution operation and grid sampling operation, and the graph convolution self-encoder is used for encoding the texture-containing three-dimensional cartoon model to obtain geometric and texture representation of the three-dimensional cartoon model;

Step S3: inputting the face image into ResNet networks to obtain facial attitude vectors and vectors represented by geometric and texture of a three-dimensional cartoon model;

Step S4: decoding vectors represented by geometry and texture of the three-dimensional cartoon model by using the decoder to obtain a generated three-dimensional cartoon model, converting the generated three-dimensional cartoon model into a facial pose identical to the facial image by using the facial pose vector, and rendering the generated three-dimensional cartoon model by using a differentiable renderer; meanwhile, a pixel loss function and a facial feature point loss function are constructed to constrain end-to-end network training.

Compared with the prior art, the invention has the following advantages:

1. The invention discloses a three-dimensional cartoon model automatic generation method based on a single Zhang Ren face image, which can generate exaggerated geometric and stylized textures by constructing a three-dimensional cartoon model representation based on a graph convolution self-encoder, and solves the problems that the extrapolation capability of the traditional linear open source face model is insufficient and the parameterized three-dimensional face model based on PCA can not generate an exaggerated face.

2. Compared with the multi-step method in the prior art, the method for generating the three-dimensional cartoon model by using the single Zhang Ren face image through the three-dimensional cartoon model can be used for generating the three-dimensional cartoon model with the personalized features through the direct learning of the single Zhang Ren face image, and the three-dimensional cartoon model with exaggerated geometry and rich texture styles can be generated efficiently and quickly.

3. According to the method disclosed by the invention, the differential renderer is introduced, so that the neural network can perform end-to-end training, and the problem of non-conductivity of traditional physical rendering is solved.

Drawings

FIG. 1 is a flowchart of a three-dimensional cartoon model automatic generation method based on a single Zhang Ren face image in an embodiment of the invention;

FIG. 2 is a schematic diagram of a cartoon dataset construction process in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram of a process for obtaining a three-dimensional cartoon containing textures in accordance with an embodiment of the present invention;

FIG. 4 is a flow chart of constructing a three-dimensional caricature model representation based on a graph convolution self-encoder in accordance with an embodiment of the present invention;

FIG. 5 is a schematic flow diagram of generating a three-dimensional cartoon model based on renderable end-to-end in an embodiment of the invention;

fig. 6 is a block diagram of a three-dimensional cartoon model automatic generation system based on a single Zhang Ren face image in an embodiment of the invention.

Detailed Description

The invention provides a three-dimensional cartoon model automatic generation method based on a single Zhang Ren face image, which can efficiently and quickly generate a three-dimensional cartoon model with exaggerated geometry and rich texture styles.

The present invention will be further described in detail below with reference to the accompanying drawings by way of specific embodiments in order to make the objects, technical solutions and advantages of the present invention more apparent.

Example 1

As shown in fig. 1, the method for automatically generating the three-dimensional cartoon model based on the single Zhang Ren face image provided by the embodiment of the invention comprises the following steps:

Step S1: constructing a corresponding cartoon image and a three-dimensional cartoon model containing textures based on the face images with the same identity to form a cartoon data set, wherein the face images and the cartoon images have the same facial pose;

Step S2: inputting the three-dimensional cartoon model containing the texture into a self-encoder based on graph convolution, wherein the self-encoder comprises an encoder and a decoder with grid convolution operation and grid sampling operation, and encoding the three-dimensional cartoon model containing the texture by using the graph convolution self-encoder to obtain geometric and texture representation of the three-dimensional cartoon model;

step S3: inputting the face image into ResNet networks to obtain facial attitude vectors and vectors represented by the geometry and texture of the three-dimensional cartoon model;

Step S4: decoding the vector represented by the geometry and the texture of the three-dimensional cartoon model by using a decoder to obtain a generated three-dimensional cartoon model, converting the generated three-dimensional cartoon model into the same facial pose as the facial image by using a facial pose vector, and rendering the generated three-dimensional cartoon model by using a differential renderer; meanwhile, a pixel loss function and a facial feature point loss function are constructed to constrain end-to-end network training.

In one embodiment, step S1 described above: constructing a corresponding cartoon image and a three-dimensional cartoon model containing textures based on face images with the same identity to form a cartoon data set, wherein the face images and the cartoon images have the same facial pose, and specifically comprises the following steps:

Step S11: face detection is carried out based on the face image set, and clipping is carried out to obtain a face image;

the face image in the embodiment of the invention is obtained from the open source dataset CelebA, celebA is a large-scale face attribute dataset comprising 200k faces, wherein each image has 40 attribute annotations, and the large-scale attitude change and different backgrounds are covered. The embodiment of the invention firstly carries out face detection on the image to obtain a face area, and finally cuts the face image into (224 ) for the subsequent steps;

Step S12: based on the face image with the same identity, generating a corresponding cartoon image by using a neural network cartoon automatic generation method;

And generating a corresponding cartoon image based on the face images with the same identity. The cartoon image is obtained by exaggerating and stylizing a face image, the face is stylized from two dimensions of a shape and a texture by using a neural network cartoon automatic generation method, and a group of control points are automatically predicted for the face image, and can distort a photo into the cartoon, and meanwhile the same identity is reserved. By introducing a loss of antagonism that protects identity, the authenticator is aided in distinguishing between different subjects. In addition, the created caricature is customized by controlling the degree of exaggeration and the visual style. Since the image generated by the neural network may be blurred or lose detail, the super resolution method is used for refining the cartoon image.

FIG. 2 illustrates a schematic diagram of a caricature dataset construction process.

Step S13: and carrying out three-dimensional reconstruction based on the cartoon image to obtain a three-dimensional network, and obtaining pixel values of the corresponding face image for each grid vertex by using a projection mode as textures to obtain a three-dimensional cartoon model containing textures.

FIG. 3 illustrates a schematic diagram of a process for obtaining a three-dimensional caricature containing textures.

The three-dimensional cartoon model is obtained by three-dimensional reconstruction of cartoon images. The parametric three-dimensional cartoon model constructed based on FaceWareHouse is used for reconstructing a cartoon image to obtain a three-dimensional grid, and the used model only comprises a geometric model and has no texture part, so that in the three-dimensional reconstruction process, the embodiment of the invention obtains the pixel value on a corresponding image for each grid vertex in a projection mode, thus obtaining a coarse group of grid colors, and the grid can deviate from a face after being projected to a 2D image due to the precision problem of the three-dimensional reconstruction, so that the grid colors can obtain the background or hair inevitably. And deleting the grid with very large deviation in a manual screening mode, so as to obtain the three-dimensional cartoon model with textures.

In one embodiment, step S2 above: the input of the three-dimensional cartoon model containing textures is based on a graph convolution self-encoder, the self-encoder comprises an encoder decoder with grid convolution operation and grid sampling operation, the graph convolution self-encoder is utilized to encode the three-dimensional cartoon model containing textures, and geometric and texture representations of the three-dimensional cartoon model are obtained, and the method specifically comprises the following steps:

step S21: defining attribute information of each grid vertex of the three-dimensional cartoon model containing the texture as { x, y, z, r, g, b }, wherein x, y, z are three-dimensional coordinates of the grid vertex, and r, g, b are RGB color values corresponding to the grid vertex;

Each grid in the embodiment of the invention has 6144 vertexes, wherein the texture is represented by a vertex coloring mode, so that the vertex data of the grid of the three-dimensional cartoon model with the texture has a scale of n x 6, and each grid vertex comprises 6-dimensional attribute information { x, y, z, r, g, b };

Step S22: defining a three-dimensional cartoon model containing textures as: f= (V, a); wherein V is a set of n mesh vertices; a is an edge represented by a sparse adjacency matrix so as to represent the connection relation between the grid vertexes, wherein a epsilon {0,1} ^n×n represents the connection condition of the edge according to the element values in a, when the grid vertexes i and j are in a connection state, a _ij =1, otherwise a _ij =0;

Step S23: calculating the sparse adjacent matrix A to obtain a Laplace matrix L=D-A, wherein L is a Laplace operator; d is the degree D _ii＝∑_jA_ij of each grid vertex;

The Laplace operator L is a real symmetric matrix, diagonalized into L=UΛU ^T through a Fourier basis U epsilon R ^n×n, wherein column vectors of U are orthogonal, and the diagonalized L is a diagonal matrix with non-negative real eigenvalues;

Performing graph Fourier transform x _w＝U^T x on the grid vertex x, and performing inverse Fourier transform to obtain x=Ux _w;

because the grid has a large number of vertexes, the matrix U is not sparse, so the calculated amount is large, and in order to solve the problem, the embodiment of the invention uses a grid filter g _θ with a recursive chebyshev polynomial in the following steps;

Step S24: defining convolution operator x y = U ((U ^Tx)Θ(U^T y)) in fourier space;

a filter g _θ parameterized by a k-th order chebyshev polynomial is designed as shown in equation (1):

wherein, For scaled Laplace matrix,/>The parameter θ is the chebyshev coefficient, which is a vector θ e R ^K;T_k is the k-order chebyshev polynomial, T _k∈R^n×n, which is calculated from equation (2):

T_k(x)＝2xT_k-1(x)-T_k-2(x) (2)

Initializing T ₀＝1,T₁ = x;

the trellis convolution is defined as the following equation (3):

Wherein, the input x _i, Is a grid vertex with 6 characteristics, each convolution layer has F _in×F_out Chebyshev coefficient vectors, and theta _i,j∈R^K is a parameter to be trained; output y _j,/>Is a network vertex with 6 features after position reconstruction;

Step S25: downsampling the grids, downsampling the grids with a plurality of vertexes to a grid with small-scale vertexes, and recording indexes of the removed grid vertexes in a matrix Q _d; wherein, Q _d e {0,1} form records the removed and retained vertex index, v _d is retained vertex Q _d(x,d＝1,v_q is removed vertex Q _d (x, q=0;

Then up-sampling the grid, and restoring to the format of the initial input grid vertex; according to the matrix Q _d, a corresponding matrix Q _u is obtained; wherein three vertexes { v _i,v_j,v_k}∈V_d) nearest to v _q are calculated, and the barycentric coordinates are calculated Where w _i+w_j+w_k = 1, Q _u(q,i＝w_i,Q_u(q,j＝w_j,Q_u(q,i)＝w_k is obtained, and the removed vertex v _q is recovered by matrix Q _u;

step S26: the method comprises the steps that grid convolution and grid sampling are combined to be used as a convolution sampling operation, an encoder comprises a plurality of convolution up-sampling operations, a decoder in picture convolution self-encoding is formed into a plurality of convolution down-sampling operations, a three-dimensional cartoon model containing textures is input into the encoder to obtain a potential vector serving as geometric and texture representation of the three-dimensional cartoon model, and the potential vector is sent into the decoder to be restored into the three-dimensional cartoon model containing textures; meanwhile, the loss function is the L1 loss of the input grid vertex and the output reconstructed grid vertex.

The self-encoder in the embodiment of the invention comprises an encoder and a decoder, wherein the encoder comprises four grid convolutions, each grid convolution is followed by a downsampling operation, each downsampling operation reduces the number of grid vertexes by one fourth, finally, data is mapped to one-dimensional potential space vectors through a full connection layer, the dimension of the potential space vectors is 256, the potential space vectors are input to the decoder, the decoder comprises four grid convolutions, each grid convolution is followed by an upsampling operation, and each upsampling operation increases the number of the grid vertexes by one fourth. The decoder in the embodiment of the invention firstly restores the potential space vector to the data format of the grid by using the full connection layer, then carries out grid convolution and up-sampling operation, restores the data input format, and inputs the grid data n 6 with vertex colors from the encoder, and the output format is also n 6.

FIG. 4 illustrates a flow chart for constructing a three-dimensional caricature model representation based on a graph convolution self-encoder.

In one embodiment, the step S3: inputting a face image into ResNet networks to obtain a facial attitude vector and a vector represented by geometry and texture of a three-dimensional cartoon model, wherein the method specifically comprises the following steps:

inputting the face image into ResNet network, learning the high-order feature of face, and obtaining camera parameters as facial gesture vector and vector representing geometric texture parameters.

The embodiment of the invention uses Resnet architecture as a face encoder to learn high-order features for a face, and obtains a vector with 256 dimensions for representing geometric texture parameters and a camera parameter with 6 dimensions as a vector for representing the pose information of the face, wherein the face pose vector is used for rotating a three-dimensional cartoon object so as to project the three-dimensional cartoon object into a 2D image space and is used as a constraint of a subsequent step.

In one embodiment, step S4 above: decoding the vector represented by the geometry and the texture of the three-dimensional cartoon model by using a decoder to obtain a generated three-dimensional cartoon model, converting the generated three-dimensional cartoon model into the same facial pose as the facial image by using a facial pose vector, and rendering the generated three-dimensional cartoon model by using a differential renderer; meanwhile, constructing a pixel loss function and a face feature point loss function to restrict end-to-end network training, which specifically comprises the following steps:

Step S41: inputting the vector represented by the geometry and texture of the three-dimensional cartoon model into a decoder to obtain a generated three-dimensional cartoon model;

step S42: converting the generated three-dimensional cartoon model into the same facial pose as the facial image by utilizing the facial pose vector, and rendering the generated three-dimensional cartoon model by utilizing a differential renderer to obtain a rendered image;

Because conventional graphics rendering is non-conductive, embodiments of the present invention introduce a differentiable renderer, also referred to as a rasterizer-based deferred rendering model, that generates a barycentric coordinate and a corresponding triangle ID for each pixel on the image plane. Because the normals and color attributes of the mesh vertices are interpolated at the corresponding pixels, the gradient can be easily counter-propagated to the underlying parameters by the renderer, thereby enabling the architecture to perform end-to-end training;

step S43: constructing a pixel loss function for measuring the difference between an input caricature image and a rendered image introduces a skin mask based on an attention mechanism to filter out non-facial content, and the formula (4) is as follows:

Wherein, I' _i is a rendered image, I _i is an input cartoon image, M _i is a face mask, M _proj is an image area corresponding to the projected three-dimensional grid vertex, I is the ith pixel of the image;

The method used in this step to measure the difference between the input image and the rendered image is to calculate the pixel level error for both images using the L2 paradigm. Because the image may contain occlusions, a skin mask based on an attention mechanism is introduced to focus on only facial areas, filtering out non-facial content;

Step S44: constructing a facial feature point loss function, calculating 2D facial feature points in n cartoon images, calculating vertex indexes of 3D facial feature points of a corresponding generated three-dimensional cartoon model, obtaining corresponding 2D projection facial feature points through a projection matrix, and constructing L2 loss aiming at the facial feature points and the projection facial feature points, wherein the formula (5) is as follows:

Q _i is a 2D facial feature point in the cartoon image, and q' _i is a 3D feature point projection of the generated three-dimensional model to obtain a corresponding 2D projection facial feature point.

In order to establish a bridge of 2D and 3D, to better constrain network and accelerate network convergence, facial feature point loss is introduced. The embodiment of the invention uses a cartoon image feature point detection method for cartoon images in a cartoon data set to calculate 68 2D facial feature points. And then, calculating vertex indexes of the corresponding 68 3D facial feature points for the generated three-dimensional cartoon model, obtaining the corresponding 2D facial feature points through a projection matrix, and constructing an L2 loss function for the two groups of 2D feature points.

Fig. 5 illustrates a flow diagram for generating a renderable end-to-end three-dimensional caricature model.

The invention discloses a three-dimensional cartoon model automatic generation method based on a single Zhang Ren face image, which can generate exaggerated geometric and stylized textures by constructing a three-dimensional cartoon model representation based on a graph convolution self-encoder, and solves the problems that the extrapolation capability of the traditional linear open source face model is insufficient and the parameterized three-dimensional face model based on PCA can not generate an exaggerated face. Compared with the multi-step method in the prior art, the method for generating the three-dimensional cartoon model by using the single Zhang Ren face image through the three-dimensional cartoon model can be used for generating the three-dimensional cartoon model with the personalized features through the direct learning of the single Zhang Ren face image, and the three-dimensional cartoon model with exaggerated geometry and rich texture styles can be generated efficiently and quickly. According to the method disclosed by the invention, the differential renderer is introduced, so that the neural network can perform end-to-end training, and the problem of non-conductivity of traditional physical rendering is solved.

Example two

As shown in fig. 6, an embodiment of the present invention provides an automatic three-dimensional cartoon model generating system based on a single Zhang Ren face image, which includes the following modules:

the cartoon data set constructing module 51 is configured to construct a corresponding cartoon image and a three-dimensional cartoon model containing textures based on face images with the same identity, so as to form a cartoon data set, wherein the face images and the cartoon images have the same facial pose;

A geometry and texture representation module 52 for obtaining a three-dimensional cartoon model, for inputting the three-dimensional cartoon model containing texture into a self-encoder based on a graph convolution, the self-encoder comprising an encoder and a decoder having a grid convolution operation and a grid sampling operation, encoding the three-dimensional cartoon model containing texture with the graph convolution self-encoder to obtain a geometry and texture representation of the three-dimensional cartoon model;

The vector module 53 for obtaining a facial pose vector and geometric and texture representation is configured to input the face image into ResNet networks to obtain a facial pose vector and a geometric and texture representation vector of a three-dimensional cartoon model;

A three-dimensional cartoon model generation module 54, configured to decode a vector represented by geometry and texture of the three-dimensional cartoon model by using the decoder, obtain a generated three-dimensional cartoon model, convert the generated three-dimensional cartoon model to a facial pose identical to the face image by using the facial pose vector, and render the generated three-dimensional cartoon model by using a differentiable renderer; meanwhile, a pixel loss function and a facial feature point loss function are constructed to constrain end-to-end network training.

The above examples are provided for the purpose of describing the present invention only and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalents and modifications that do not depart from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The automatic three-dimensional cartoon model generation method based on single Zhang Ren face images is characterized by comprising the following steps of:

step S2: inputting the texture-containing three-dimensional cartoon model into a graph convolution-based self-encoder, wherein the self-encoder comprises an encoder and a decoder with grid convolution operation and grid sampling operation, and the graph convolution self-encoder is used for encoding the texture-containing three-dimensional cartoon model to obtain geometric and texture representations of the three-dimensional cartoon model, and the method specifically comprises the following steps of:

Step S21: defining attribute information of each grid vertex as { x, y, z, r, g, b }, wherein x, y, z are three-dimensional coordinates of the grid vertex, and r, g, b are RGB color values corresponding to the grid vertex;

step S22: defining the texture-containing three-dimensional cartoon model as: f= (V, a); wherein V is a set of n mesh vertices; a is an edge represented by a sparse adjacency matrix to represent the connection between the mesh vertices, According to the connection condition of the element value in A representing the edge, when the grid vertex i and the grid vertex j are in a connection state/>Otherwise/>；

Step S23: calculating the sparse adjacent matrix A to obtain a Laplace matrix L=D-A, wherein L is a Laplace operator; d is the degree of each grid vertex；

Wherein the Laplace operator L is a real symmetric matrix and passes through a Fourier baseDiagonalizing it to/>Wherein the column vectors of U are orthogonal and the diagonalized L is a diagonal matrix with non-negative real eigenvalues;

performing graph Fourier transform on grid vertex x Obtained by inverse Fourier transform；

Step S24: will convolve the operatorDefined in fourier space as hadamard product/>；

Designing a filter parameterized by a k-order chebyshev polynomialAs shown in formula (1):

(1)

wherein, For scaled Laplace matrix,/>; Parameter/>Is the coefficient of chebyshev, which is a vector/>；/>Is a k-order chebyshev polynomial,/>It is calculated from formula (2):

(2)

Initializing to ,/>；

The grid convolution is defined as the following equation (3):

(3)

wherein the input is Is a mesh vertex with 6 features, each convolution layer has/>Chebyshev coefficient vector,/>Is a parameter to be trained; output/>，/>Is a network vertex with 6 features after position reconstruction;

Step S25: downsampling the grid, downsampling the grid with a plurality of vertexes to a grid with small-scale vertexes, and recording the indexes of the removed grid vertexes in a matrix ; Wherein/>Form record of removed and preserved vertex index,/>Is the reserved vertex/>，/>Is the vertex removed/>；

Then up-sampling the grid, and restoring to the format of the initial input grid vertex; wherein, according to the matrixObtain the corresponding matrix/>; Wherein the calculation is related to the/>Three most recent vertices/>Calculating barycentric coordinatesWherein/>Can obtain/>，/>，By matrix/>To recover the removed vertices/>；

Step S26: combining the grid convolution and the grid sampling as a convolution sampling operation, wherein the encoder comprises a plurality of convolution up-sampling operations, a decoder in the graph convolution self-encoding is formed into a plurality of convolution down-sampling operations, the texture-containing three-dimensional cartoon model is input into the encoder to obtain a potential vector serving as geometric and texture representation of the three-dimensional cartoon model, and the potential vector is sent into the decoder to be restored into the texture-containing three-dimensional cartoon model; meanwhile, the loss function is the L1 loss of the input grid vertex and the output reconstructed grid vertex;

2. The automatic generation method of three-dimensional cartoon model based on single face image according to claim 1, wherein said step S1: constructing a corresponding cartoon image and a three-dimensional cartoon model containing textures based on face images with the same identity to form a cartoon data set, wherein the face images and the cartoon images have the same facial pose, and the method specifically comprises the following steps of:

step S13: and carrying out three-dimensional reconstruction based on the cartoon image to obtain a three-dimensional network, and obtaining pixel values corresponding to the face image for each grid vertex by using a projection mode as textures to obtain a three-dimensional cartoon model containing textures.

3. The automatic generation method of three-dimensional cartoon model based on single face image according to claim 1, wherein said step S3: inputting the face image into ResNet networks to obtain facial attitude vectors and vectors represented by geometric and texture of a three-dimensional cartoon model, wherein the facial attitude vectors and the vectors specifically comprise:

inputting ResNet the face image into a network, and learning the high-order features of the face to obtain camera parameters as facial attitude vectors and vectors representing geometric texture parameters.

4. The automatic generation method of three-dimensional cartoon model based on single face image according to claim 1, wherein said step S4: decoding vectors represented by geometry and texture of the three-dimensional cartoon model by using the decoder to obtain a generated three-dimensional cartoon model, converting the generated three-dimensional cartoon model into a facial pose identical to the facial image by using the facial pose vector, and rendering the generated three-dimensional cartoon model by using a differentiable renderer; meanwhile, constructing a pixel loss function and a face feature point loss function to restrict end-to-end network training, which specifically comprises the following steps:

Step S41: inputting vectors represented by the geometry and the texture of the three-dimensional cartoon model into the decoder to obtain a generated three-dimensional cartoon model;

Step S42: converting the generated three-dimensional cartoon model into the same facial pose as the face image by utilizing the facial pose vector, and rendering the generated three-dimensional cartoon model by using a differential renderer to obtain a rendered image;

Step S43: constructing a pixel loss function for measuring the difference between the input caricature image and the rendered image, introducing a skin mask based on an attention mechanism to filter out non-facial content, wherein the formula (4) is as follows:

(4)

wherein, For the rendered image,/>For the caricature image of the input,/>For face mask,/>The method comprises the steps of projecting a corresponding image area for three-dimensional grid vertexes, wherein i is an ith pixel of an image;

step S44: constructing a facial feature point loss function, calculating 2D facial feature points in n cartoon images, calculating vertex indexes of 3D facial feature points of the generated three-dimensional cartoon model corresponding to the 2D facial feature points, obtaining corresponding 2D projection facial feature points through a projection matrix, and constructing L2 loss for the facial feature points and the projection facial feature points, wherein the formula (5) is as follows:

(5)

wherein, For 2D facial feature points in the caricature image,/>And obtaining corresponding 2D projection facial feature points for the 3D feature point projection of the generated three-dimensional model.

5. An automatic three-dimensional cartoon model generation system based on a single Zhang Ren face image, which is based on the automatic three-dimensional cartoon model generation method based on a single face image according to any one of claims 1-4, and is characterized by comprising the following modules:

The cartoon data set constructing module is used for constructing a corresponding cartoon image and a three-dimensional cartoon model containing textures based on face images with the same identity to form a cartoon data set, wherein the face images and the cartoon images have the same facial posture;

The geometric and texture representation module is used for inputting the three-dimensional cartoon model containing texture into a self-encoder based on graph convolution, the self-encoder comprises an encoder and a decoder with grid convolution operation and grid sampling operation, and the self-encoder is used for encoding the three-dimensional cartoon model containing texture to obtain geometric and texture representation of the three-dimensional cartoon model;

The facial pose module is used for acquiring facial pose vectors and geometric and texture represented vectors, and inputting the facial images into ResNet networks to obtain facial pose vectors and geometric and texture represented vectors of the three-dimensional cartoon model;

The three-dimensional cartoon model generating module is used for decoding vectors represented by geometry and texture of the three-dimensional cartoon model by using the decoder to obtain a generated three-dimensional cartoon model, converting the generated three-dimensional cartoon model into the same facial pose as the face image by using the facial pose vector, and rendering the generated three-dimensional cartoon model by using a differentiable renderer; meanwhile, a pixel loss function and a facial feature point loss function are constructed to constrain end-to-end network training.