CN109741247B

CN109741247B - Portrait cartoon generating method based on neural network

Info

Publication number: CN109741247B
Application number: CN201811631295.7A
Authority: CN
Inventors: 吕建成; 汤臣薇; 徐坤; 贺喆南; 李婵娟
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2020-04-21
Anticipated expiration: 2038-12-29
Also published as: CN109741247A

Abstract

The invention discloses a portrait cartoon generating method based on a neural network, which comprises the following steps: s1, extracting the structural features of the face in the real face image, and converting the structural features into sequence feature data; s2, inputting the sequence feature data into the trained Seq2Seq VAE model to generate corresponding exaggeration structure sequence points; s3, applying the generated exaggerated structure sequence points to a real face image, and carrying out exaggerated deformation on the real face image; and S4, applying the cartoon style to the face image after the exaggerated deformation to generate the portrait cartoon. The invention creatively provides a method for representing the structural characteristics of the human face by using sequence characteristics, and the method is applied to cartoon generation by using a Seq2Seq VAE model to generate an exaggerated sequence. The limitations of the existing image translation method are overcome, and the generated exaggeration portrait cartoon not only has humorous exaggeration without damaging the identification degree of the character role, but also is reflected in the painting styles of different cartoon artists.

Description

Portrait cartoon generating method based on neural network

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a portrait cartoon generating method based on a neural network.

Background

Portrait rendering remains a very popular form of artistic expression up to modern times. With the continuous development of machine vision related technologies, portrait drawing is widely applied to multimedia, personalized entertainment, internet and the like of virtual reality, augmented reality, robot portrait drawing systems and the like. In order to enhance the artistic expressive force of the portrait, various artistic portraits such as sketches, cartoons and the like are generated based on different artistic characteristics, and the cartoons are paid attention and researched by a plurality of scholars as a common artistic form.

With the development of artificial intelligence, more and more scholars begin to study the combination of artificial intelligence and art, i.e., computing art. By means of mathematics and statistics, the rules involved in the art can be quantified as mathematical relationships, for example, the golden section has strict proportionality, artistry and harmony and has a rich aesthetic value. At the same time, these mathematical relationships become part of the theoretical basis of computational art. When painting involves the expression of a figure, there are many different forms of painting art.

As shown in fig. 1, the portrait includes an exaggerated portrait cartoon, a sketch, a cartoon, a simple drawing, and the like. An exaggerated portrait caricature, as the name implies, means to express a distinct distinction of a person from a popular face through exaggeration and deformation of facial organs. Compared with the sketching, the exaggeration cartoon adds humor elements on the basis of sketching. Different from cartoons and simplified strokes, the exaggerated cartoon can meet the fun of the cartoon and keep the identification degree of characters. However, there has been a lot of research work on simple artistic forms such as sketches, simplified strokes, and cartoons. In contrast, only a few research efforts have focused on the generation of exaggerated portrait caricatures.

The generation of an exaggerated portrait caricature may be viewed as a stylistic transformation from a real facial image to a caricature image. Image-to-image conversion is a popular type of visual problem, whose goal is to learn the style characteristics of the target image, and the mapping between the input and output images. Among them, a Convolutional Neural Network (CNN) based generation countermeasure network (GAN) is considered as one of the most popular image translation methods. However, the existing method can only convert the texture and color of the image, when the task has the change of the image content and the geometric structure, the effect of the convolutional neural network method based on the antagonistic generation network is not ideal, and the generation of the exaggerated portrait cartoon relates to the exaggerated deformation of the image content, namely the face structure.

In order to directly convert a portrait picture into a corresponding portrait cartoon, one method in the prior art is a sample-based method, in the method, a portrait picture of a face is given, each face is decomposed into different parts (such as a nose, a mouth and the like), for each part, corresponding cartoon components in a data set are searched by applying feature matching, and then the cartoon components are combined together to construct a cartoon face; the other method is a method based on human face features, firstly defining the feature points of a movable shape model, then generating an exaggerated portrait in a real human face image based on human faces and mutual relations, and introducing a 'contrast principle' while obtaining an exaggerated shape of the face from the exaggeration of the face shape and the exaggeration of five sense organs; and finally, generating an exaggerated image of the face image by combining an image deformation method.

In the prior art, a sample-based method needs to draw and collect a large number of cartoon components to build a database according to different local human face characteristics, so that the workload is huge and the technical requirement on callback is extremely high; the spliced human faces are relatively fixed and lack diversity; and the final effect is only cartoon to the original face, and the obvious features of the input face are not deformed, namely the definition of the exaggerated portrait cartoon is not met. And the other method based on the human face features can exaggerate the original human face to a certain degree, but the effect is not obvious, the figure identification degree is not available, and the style of the obtained cartoon effect is single.

Disclosure of Invention

Aiming at the defects in the prior art, the neural network-based portrait cartoon generation method provided by the invention solves the problems that the existing image translation method in the existing portrait cartoon generation method is limited and the obtained portrait cartoon has a single style.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a portrait cartoon generating method based on a neural network comprises the following steps:

s1, extracting the structural features of the face in the real face image, and converting the extracted structural feature data into sequence feature data;

s2, inputting the sequence feature data into the trained Seq2Seq VAE model to generate an exaggerated structure sequence point corresponding to the face image;

s3, applying the generated exaggerated structure sequence points to the real face image by utilizing a thin plate spline interpolation technology to realize the exaggerated deformation of the real face image;

and S4, applying the cartoon style to the face image after the exaggerated deformation by using the CycleGAN technology to generate the portrait cartoon.

Further, the structural features of the human face in the step S1 include the contour structural features and the facial features of the human face;

the step S1 specifically includes:

extracting 68 sequence points of a real face image to serve as structural feature data of a face, and obtaining an offset coordinate sequence of each sequence point relative to a previous sequence point according to an absolute coordinate value of each sequence point, wherein the offset coordinate sequence is sequence feature data;

wherein each sequence point is a sequence point to which a state value is added, and the sequence point to which the state value is added is represented as Q (x, y, p)₁,p₂,p₃)；

Wherein x and y represent the offset distance of the sequence point in the x and y directions relative to the previous sequence point;

p₁,p₂,p₃a binary one-hot vector representing three facial states; p is a radical of₁Indicating that the sequence point is the start of a facial contour or five sense organs, p₂Represents the sequenceThe point is the same organ as the previous sequence point, p₃Indicating that the sequence point is the last point of the 68 sequence points.

Further, the method for training the Seq2Seq VAE model in step S2 specifically includes:

a1, inputting the positive sequence and the negative sequence of the structural feature of the face in a real face image into a coder to obtain a positive sequence feature vector

And the reverse order feature vector

And connecting the vectors into a final feature vector h;

a2, mapping the final feature vector h into an average vector mu and a standard deviation vector sigma through two full-connection-layer networks respectively, and sampling to obtain a random vector z which enables the average vector mu and the standard deviation vector sigma to obey normal distribution;

a3, inputting a random vector z into a decoder to obtain a preliminary training Seq2Seq VAE network;

a4, inputting a plurality of real face images into a Seq2Seq VAE network trained at the previous time, and repeating the steps A1-A3 until the Seq2Seq VAE network converges to obtain a trained Seq2Seq VAE model.

Further, the encoder in step a1 includes a bidirectional LSTM network module, where the bidirectional LSTM network module includes two LSTM networks with 68 layers.

Further, the step a1 is specifically:

the positive sequence of the structural characteristics of the face in a real face image

Each data in the network is input into an LSTM network to obtain a positive sequence feature vector

Simultaneously, the face in a real face imageReverse order sequence of structural features of section

Is input into another LSTM network to obtain the reverse order feature vector

Feature vector of positive sequence

And the reverse order feature vector

Connecting to form a final feature vector h;

wherein, the positive sequence

Reverse order sequence

Wherein i is 0,1,2.. 67.

Further, in the step a 2;

the random vector z is:

z＝μ+σ⊙N(0,1)

wherein ⊙ represents a vector point multiplication;

n (0,1) is IID Gaussian vector.

Further, in the step a3, the decoder is an LSTM network with a time length of 68;

the input elements of the LSTM network at each moment further comprise a vector T derived from a previous moment_tAnd source point S_t；

The output end of the LSTM network at each moment outputs a vector O_tAnd the output vector O of the current time t_tVector T is obtained by sampling through Gaussian mixture model_tAnd input to the LSTM network at the next moment;

wherein t represents time, and t is 0,1,2.. 67;

vector T with initial time input into LSTM network₀And source point S₀Are initialized to (0,0,1,0, 0).

Further, the output vector O at the current time t_tVector T is obtained by sampling through Gaussian mixture model_tThe method comprises the following steps:

b1, determining the number N of normal distributions in the Gaussian mixture model, and outputting a vector O_tIs set to 6N, O_tThe decomposition is as follows:

wherein n represents the nth Gaussian mixture model;

x represents the abscissa;

y represents the ordinate;

w_na weight matrix representing the nth Gaussian mixture model, an

μ_(x,n)Representing the expectation of the abscissa x;

μ_(y,n)the expectation of the ordinate y;

σ_(x,n)represents the standard deviation of the abscissa x;

σ_(y,n)standard deviation representing the ordinate y;

ρ (xy, n) represents a correlation coefficient;

b2, determining decomposed O_tWhen the T is input into the Gaussian mixture model, T is obtained by sampling_tProbability p (x, y; t);

wherein the probability p (x, y; t) is:

w (n, t) represents a weight matrix of the nth Gaussian mixture model at the time t;

n (x, y) represents that the coordinates (x, y) follow a normal distribution, and the parameters are mu, sigma and rho;

μ (x, n, t) represents the expectation of the abscissa of the nth gaussian model at time t;

μ (y, n, t) represents the expectation of the nth gaussian model ordinate at time t;

σ (x, n, t) represents the standard deviation of the abscissa of the nth Gaussian model at the time t;

sigma (y, n, t) represents the standard deviation of the ordinate of the nth Gaussian model at the time t;

ρ (xy, n, t) represents a correlation coefficient;

b3, substituting the probability p (x, y; T) into the reconstruction error function to obtain the reconstruction error, and maximizing the reconstruction error to enable the Gaussian mixture model to output the target vector T_t；

Wherein the reconstruction error function is:

wherein L is_RIs a reconstruction error;

(x, y) represents the horizontal and vertical coordinates of the feature points.

The invention has the beneficial effects that: the portrait cartoon network generation method based on the neural network provided by the invention initiatively proposes that the face structure characteristics are stored as sequence characteristics, and then the Seq2Seq VAE model can be used for generating an exaggerated sequence, so that the method is applied to cartoon generation; the limitations of the existing image translation method are overcome, and the generated exaggeration portrait cartoon not only has humorous exaggeration without damaging the recognition degree of the role, but also is reflected in the painting styles of different cartoon artists.

Drawings

Fig. 1 is a schematic diagram of the kind of portrait painting in the background art of the present invention.

Fig. 2 is a flowchart of an implementation of a neural network-based portrait caricature generation method according to the present invention.

Fig. 3 is a schematic representation diagram illustrating conversion of a face structure feature into a sequence feature by using a face alignment technique according to an embodiment of the present invention.

FIG. 4 is a flowchart illustrating an implementation of the Seq2Seq VAE model training method according to the present invention.

FIG. 5 is a schematic diagram comparing a complete target L with several variations thereof in an embodiment of the present invention.

Fig. 6 is a comparison result of different real face images in the embodiment provided by the present invention.

Fig. 7 is a schematic diagram showing a comparison of features between an input face and a corresponding "common face" in an embodiment provided by the present invention.

Fig. 8 is a diagram illustrating a partially exaggerated result of an original image according to an embodiment of the present invention.

FIG. 9 is a graph illustrating the comparison of the results of using different artistic styles on an exaggerated face according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 2, a portrait caricature generation method based on a neural network includes the following steps:

when the thin plate spline interpolation technology is adopted, a Cartesian coordinate system is established on a thin plate, and the independent variable x and the function value y are points distributed on the coordinate system. The sheet passing through all the corresponding after bending deformationThe value y point is used while minimizing the bending energy. This interpolation function is defined as

The specific form is as follows:

CycleGAN can transform images from source domain X to target domain Y without paired data through learning. The goal is to learn a mapping G X → Y that approximates the distribution of the image from G (X) to distribution Y with increased resistance loss. Since this mapping is highly under-constrained, reversing it produces the inverse mapping F: Y → X, and introduces a consistency penalty to constrain such that F (G (X)) is ≈ X.

In the step S1, extracting face contours and facial feature points of the face of all real faces in the MMC data set and the face of the corresponding exaggerated portrait caricature by using a face alignment technique, wherein the structural features of the face include contour structural features and facial feature features of the face; extracting 68 sequence points to represent the structural characteristics of the human face, wherein each sequence point is represented by absolute value coordinates (x ', y');

the MMC data set is a Harvard cartoon data set, and comprises a large number of collected real face images and corresponding exaggeration portrait cartoon face data.

Therefore, the step S1 is specifically:

wherein, in order to distinguish each facial organ, each sequence point is a sequence point added with a state valueExpressed as Q (x, y, p)₁,p₂,p₃) (ii) a For a conventional rectangular image, using the lower left corner of the image as the origin of the coordinate system, the extracted sequence points can be considered as points distributed in the first quadrant;

p₁,p₂,p₃a binary one-hot vector representing three facial states; p is a radical of₁Indicating that the sequence point is the start of a facial contour or five sense organs, p₂Indicating that the sequence point is identical to the previous sequence point and is the same organ, p₃Indicating that the sequence point is the last point of the 68 sequence points. By means of the state values, the human face structure can be divided into five parts, namely, the outline, the eyebrow, the eye, the nose and the mouth. Fig. 3 shows a character sketch drawn by MMC data set with several samples and based on 68 sequence points obtained by the face alignment method.

Since the face of the real face image in the MMC data set is not completely matched with the corresponding portrait cartoon, particularly the proportion of the face to the whole picture and the angle of the side face of a person, in order to reduce errors caused by data extraction, the first point S 'of the face contour of the real face image is used before the offset coordinate sequence is obtained'₀＝(x′₀,y′₀) And last point S'₁₆＝(x′₁₆,y′₁₆) And as a reference point, correcting the extracted 68 sequence points in sequence by rotating and scaling the corresponding cartoon until the cartoon is aligned with the 1 st point and the 16 th point of the real face, and then obtaining an offset coordinate sequence of each sequence point relative to the previous sequence point according to the absolute coordinate value of each sequence point.

In step S2, the Seq2Seq VAE network is trained using the paired exaggerated sequence data, and when the network converges (i.e. the corresponding exaggerated sequence points of the cartoon can be reconstructed from the sequence feature data of the real face), the network model is used to generate the exaggerated sequence points of the cartoon for the tested real face feature sequence data;

as shown in fig. 4, the method for training the Seq2Seq VAE model in step S2 specifically includes:

And the reverse order feature vector

And connecting the vectors into a final feature vector h;

the encoder comprises a bidirectional LSTM network module, wherein the bidirectional LSTM network module comprises two LSTM networks with 68 layers;

therefore, step a1 specifically includes:

Simultaneously, the reverse sequence of the structural characteristics of the face in a real face image

Is input into another LSTM network to obtain the reverse order feature vector

Feature vector of positive sequence

And the reverse order feature vector

Connecting to form a final feature vector h;

wherein, the positive sequence

Reverse order sequence

Wherein i is 0,1,2.. 67.

A2, projecting the final feature vector h to an average vector mu and a standard deviation vector sigma, and sampling to obtain a random vector z which enables the average vector mu and the standard deviation vector sigma to obey normal distribution;

wherein the random vector z is:

z＝μ+σ⊙N(0,1)

wherein ⊙ represents a vector point multiplication;

n (0,1) is IID Gaussian vector.

There is a divergence loss L between the random vector z and the IID Gaussian vector N (0,1) distribution_KL；

Wherein KL (·) represents a KL distance;

n (μ, σ) denotes obedience to a normal distribution;

KL (A | | B) represents the KL distance between distribution A and distribution B;

when the random vector z is determined, the divergence loss L can be calculated_KLThe divergence loss automatically propagates backward to train the LSTM network structure, so that the difference between the z obtained by subsequent input and the distribution of the gaussian vector N (0,1) becomes smaller and smaller.

wherein, the decoder is an LSTM network with the time length of 68;

The output end of the LSTM network at each moment outputs a vector O_tAnd is andoutput vector O at current instant t_tVector T is obtained by sampling through Gaussian mixture model_tAnd input to the LSTM network at the next moment;

wherein t represents time, and t is 0,1,2.. 67;

vector T with initial time input into LSTM network₀And source point S₀Are all initialized to (0,0,1,0, 0);

in the above process, the output of the LSTM network is O at each moment_tIt cannot be directly input into the next-time LSTM network, so O is required_tDecomposing into parameters needed by Gaussian mixture model, and then obtaining T_tThe binary normal distribution is determined by five elements: (mu.) a_x,μ_y,μ_x,μ_y,μ_xy) In which μ_xAnd mu_yDenotes the mean value, σ_xAnd σ_yDenotes the standard deviation, p_xyRepresenting the relevant parameter. For a binary normal distribution, there is also a weight w; therefore, a GMM model with N normal distributions requires (5+1) N parameters. For each sequence point of the face, the state value (p)₁,p₂,p₃) Are fixed and therefore do not need to be generated.

Wherein, the output vector O of the current time t_tVector T is obtained by sampling through Gaussian mixture model_tThe method comprises the following steps:

wherein n represents the nth Gaussian mixture model;

x represents the abscissa;

y represents the ordinate;

w_na weight matrix representing the nth Gaussian mixture model, an

μ_(x,n)Representing the expectation of the abscissa x;

μ_(y,n)the expectation of the ordinate y;

σ_(x,n)represents the standard deviation of the abscissa x;

σ_(y,n)standard deviation representing the ordinate y;

ρ (xy, n) represents a correlation coefficient;

wherein the probability p (x, y; t) is:

ρ (xy, n, t) represents a correlation coefficient;

Wherein the reconstruction error function is:

wherein L is_RIs a reconstruction error;

There is also a loss of consistency L in the sequence Seq2Seq VAE network process_CThe loss of consistency uses the log-likelihood of the probability distribution produced to interpret the source point S, and L_CRelated to maintaining the basic structure of the face;

loss of consistency L_CComprises the following steps:

there is a source point S in each LSTM network_tThe resulting loss of consistency from each LSTM automatically adjusts back to the network structure of the decoder, thereby generating exaggerated structure sequence points.

In one embodiment of the present invention, experimental results of the inventive method on MMC data sets are provided:

1. dividing the images in the MMC data set into 500 training pairs and 47 testing pairs, and adding 100 extra real face images;

2. for the encoder, 256-dimensional feature vectors are extracted from the source data in the positive and negative order, respectively

And

then, a 512-dimensional vector h obtained by concatenation is used as an input of VAE, and the dimension of the vector z is 128. Setting GMM (Gaussian mixture model) as 20 normal distributions, and setting the output dimension of the LSTM of the decoder as 120;

3. study of Kullback-Leibler divergence loss L_KLReconstruction error L_RAnd loss of consistency L_CThe importance of (c). The complete method is then compared to several variants, after which the effect of the batch size on the generated exaggerated sketch is analyzedThe effectiveness of the system is improved by the local exaggeration. Finally, the exaggerated portrait is transferred to various artistic styles.

Experimental results and analysis:

A. analysis of the loss function:

in FIG. 5, the complete target L is compared to several variants thereof, one being the Kullback-Leibler divergence loss L_KLAnd a reconstruction error L_RAnd the other is a reconstruction error L_RAnd loss of consistency L_C. All Seq2Seq VAE models were trained with a batch size of 64 samples, using the Adam optimization algorithm with a learning rate of 0.01 and a gradient cut of 1.0.

(1) At L_KLAnd L_RIn the experiment, α ═ 0.8 and β ═ 0 were set;

(2) at L_RAnd L_CIn the experiment, α ═ 0 and β ═ 2 were set;

(3) at L_KL+L_R+L_CIn the experiment, α -0.5 and β -0.5 were set.

Experimental results show that the three losses play a crucial role in obtaining high-quality results. From the original sketch of the second row and the exaggerated sketch of the third row we can find L_KLThe resulting sketch can be made more exaggerated and the preservation of the original facial structure is mainly achieved by minimizing L_cIs achieved by the loss of. Complete model simultaneous minimization of L_KL，L_R，L_CTo maintain the basic structure of the original image and to exaggerate the original sketch. The complete Seq2Seq VAE model not only exaggerates the facial features, but also preserves the recognition of the character. However, the complete model still has some disadvantages. The model only slightly exaggerates certain identifying features of the original image, and does not achieve the ideal exaggeration effect.

B. Analysis of batch size:

the variation in batch size may cause the network to oscillate between randomness and certainty, the most obvious manifestation of randomness and certainty in this Seq2Seq VAE is whether the generated sketch is distorted or restored. The comparison of the different real face images in fig. 6 shows that the batch size directly affects the stability of the generated sequence.

As shown in fig. 6, when the batch size is equal to 16, the degree of randomness in the generated network is much greater than the stability, resulting in severe distortion of the generated image. From the simple sketches of the second and fourth rows it is also clear that the deformation of the facial structure is very severe when the batch is small. As the batch size increases, the stability of the network increases accordingly, and the generated sequence is more consistent with the sequence of the source images. For example, when the batch size is equal to 128, the degree of exaggeration on the source image is very gradual, in order to exaggerate the source image as the model, it can not only maintain the basic structure of the facial features, but also greatly exaggerate the apparent features thereof; thus, in other experiments, the batch size was set to 64.

C. The parts are exaggerated:

in general, artists often exaggerate the apparent features that distinguish them from "mass faces" when drawing portrait caricatures. Thus, in the proposed system, the inventive method proposes a partially exaggerated approach. By comparing the proportional distribution of the input face and the "popular face" through the data sets of the "popular face" of the male and female in different countries, the obvious features of the subject can be obtained, as shown in fig. 7, a feature comparison example between the input face and the corresponding "popular face" can be seen.

The influence of the system on the generation of the cartoon can be further enhanced by inverting the x-axis and y-axis values of the corresponding local coordinate points and making corresponding changes to exaggerate the local facial organs. There are problems in that the hairstyle, forehead, ears and cheeks of a person cannot be extracted in the face alignment step, and thus these local features cannot be compared and exaggerated. From experimental results, the method can reasonably exaggerate the extracted features. Fig. 8 shows a partially exaggerated result of the original image. The first column is the original image. The second column is characteristic of the structural distribution. The blue dots are the structures of the original plane and the yellow dots are the structures after local adjustment. A third column of deformation results may be obtained when the local variation is applied to the original surface. The fourth column of caricatures is the corresponding target. Although the result does not achieve the effect of the target output, a certain exaggerated humorous effect can be obtained.

The final result is:

the generated exaggeration portrait cartoon not only has humorous exaggeration without damaging the identification degree of the role, but also is reflected in the painting styles of different cartoon artists; different styles, such as cartoon style, oil painting style, sketch style, cartoon style and the like, are trained through the CycleGAN. Fig. 9 shows the result of using different artistic styles on an exaggerated face.

The invention has the beneficial effects that: the portrait cartoon network generation method based on the neural network provided by the invention creatively provides that the face structure characteristics are stored as sequence characteristics, and then the Seq2Seq VAE model can be used for generating an exaggerated sequence, so that the method is applied to cartoon generation. The limitations of the existing image translation method are overcome, and the generated exaggeration portrait cartoon not only has humorous exaggeration without damaging the recognition degree of the role, but also is reflected in the painting styles of different cartoon artists.

Claims

1. A portrait cartoon generating method based on a neural network is characterized by comprising the following steps:

s4, applying the cartoon style to the face image after the exaggerated deformation by using a CycleGAN technology to generate a portrait cartoon;

the structural features of the face in the step S1 include contour structural features and facial features of the face;

the step S1 specifically includes:

wherein each sequence point is a sequence point to which a state value is added, and the sequence point to which the state value is added is represented as Q (x, y, p)₁，p₂，p₃)；

p₁，p₂，p₃a binary one-hot vector representing three facial states; p is a radical of₁Indicating that the sequence point is the start of a facial contour or five sense organs, p₂Indicating that the sequence point is the same organ as the previous sequence point, p₃Indicating that the sequence point is the last point of the 68 sequence points;

the method for training the Seq2Seq VAE model in the step S2 specifically comprises the following steps:

And the reverse order feature vector

And connecting the vectors into a final feature vector h;

a4, sequentially inputting a plurality of real face images into a Seq2Seq VAE network trained at the previous time, and repeating the steps A1-A3 until the Seq2Seq VAE network converges to obtain a trained Seq2Seq VAE model;

the encoder in the step a1 includes a bidirectional LSTM network module, where the bidirectional LSTM network module includes two LSTM networks with 68 layers;

the step a1 specifically includes:

Is input into another LSTM network to obtain the reverse order feature vector

Feature vector of positive sequence

And the reverse order feature vector

Connecting to form a final feature vector h;

wherein, the positive sequence

Reverse order sequence

Wherein i is 0,1,2.. 67;

in said step a 2;

the random vector z is:

z＝μ+σ⊙N(0，1)

wherein ⊙ represents a vector point multiplication;

n (0,1) is IID Gaussian vector;

in step a3, the decoder is an LSTM network with a time length of 68;

The LSTM network outputs a vector O at each moment_tAnd the output vector O of the current time t_tVector T is obtained by sampling through Gaussian mixture model_tAnd input to the LSTM network at the next moment;

wherein t represents time, and t is 0,1,2.. 67;

2. The neural network-based portrait caricature generation method of claim 1, wherein the output vector O at the current time t_tVector T is obtained by sampling through Gaussian mixture model_tThe method comprises the following steps:

wherein n represents the nth Gaussian mixture model;

x represents the abscissa;

y represents the ordinate;

w_na weight matrix representing the nth Gaussian mixture model, an

μ_(x，n)Representing the expectation of the abscissa x;

μ_(y，n)the expectation of the ordinate y;

σ_(x，n)represents the standard deviation of the abscissa x;

σ_(y，n)standard deviation representing the ordinate y;

ρ (xy, n) represents a correlation coefficient;

wherein the probability p (x, y; t) is:

ρ (xy, n, t) represents a correlation coefficient;

Wherein the reconstruction error function is:

wherein L is_RIs a reconstruction error;