CN113240115B

CN113240115B - Training method for generating face change image model and related device

Info

Publication number: CN113240115B
Application number: CN202110636448.2A
Authority: CN
Inventors: 陈仿雄
Original assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Current assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2023-06-06
Anticipated expiration: 2041-06-08
Also published as: CN113240115A

Abstract

The embodiment of the invention relates to the technical field of machine learning, and discloses a training method and a related device for generating a face change image model. Secondly, in order to restrict the accuracy of text feature coding, a preset image description network is constructed, a predicted text description is output, then, according to each predicted text description and each text description, the parameters of the preset image description network and the parameters of the preset text coding network are reversely adjusted, so that the predicted text description is continuously close to the text description, iteration is continuously carried out until the preset image description network and the preset text network are converged, and a face change image model is obtained. Therefore, the face change image model obtained through training can be used for describing personalized control and modifying face characteristics according to the text for reflecting the user will.

Description

Training method for generating face change image model and related device

Technical Field

The embodiment of the invention relates to the technical field of machine learning, in particular to a training method for generating a face change image model and a related device.

Background

With the rising of photographing and short video, the vast users have higher requirements on the photographing quality of faces, hope to individually control the face features, namely, realize the modification and adjustment of the face features according to the own will in the photographing process, and increase the interestingness and interactivity in the photographing process.

At present, in the shooting process of intelligent equipment, functions such as beautifying by only one key or adding animation scenes and the like are only available, and the face characteristics cannot be controlled according to individuation.

Disclosure of Invention

The technical problem to be solved by the embodiment of the invention is to provide a training method and a related device for generating a face change image model, so that the generated face change image model obtained by training can be controlled and modified individually according to text description for reflecting user wish, and the generated face image accords with the user modification wish.

In order to solve the above technical problem, in a first aspect, an embodiment of the present invention provides a training method for generating a face change image model, where the face change image model includes a text encoding network, a fusion module, and an countermeasure generation network, and the method includes:

The method comprises the steps of obtaining a training set, wherein the training set comprises a plurality of groups of training data, the training data comprises random latent codes and text description, and the random latent codes are vectors used for generating face images;

performing feature coding on each text description by adopting the text coding network to obtain each text feature code, wherein the text feature codes are used for reflecting semantic features of the text description;

inputting the random latent codes corresponding to the text feature codes into the fusion module for fusion to obtain the fusion latent codes;

inputting the fusion latent codes into the countermeasure generation network respectively to generate training change graphs, wherein one training change graph is a face image generated by the countermeasure generation network based on one fusion latent code, and the face attribute in one training change graph is matched with one text description;

taking each training change chart and text descriptions corresponding to each training change chart as each sample pair, and inputting a preset image description network to obtain predicted text descriptions corresponding to each training change chart;

calculating error sums between the text descriptions and the predicted text descriptions according to a preset loss function;

And reversely adjusting model parameters of the text coding network and the preset image description network according to the error sum, and returning to execute the step of coding each text description by adopting the text coding network to obtain each text feature code until the preset image description network and the preset text coding network are converged.

In some embodiments, the text encoding network includes a sequential forgetting encoding module and a first recurrent neural network module, and the step of encoding each text description with the text encoding network to obtain each text feature encoding includes:

inputting each text description into the sequence forgetting coding module for coding processing to obtain each text vector, wherein the length of each text vector is fixed;

and respectively inputting the text vectors into the first recurrent neural network module for context association so as to acquire the text feature codes.

In some embodiments, the step of inputting the random latent codes corresponding to the text feature codes to the fusion module to fuse to obtain the fused latent codes includes:

And inputting the target text feature codes and random latent codes corresponding to the target text feature codes into the fusion module to perform nonlinear calculation to obtain fusion latent codes corresponding to the target text feature codes, wherein the target text feature codes are any text feature codes.

In some embodiments, the preset image description network includes a feature extraction module and a second recurrent neural network module,

the step of inputting a preset image description network by taking each training change chart and text descriptions corresponding to each training change chart as each sample pair so as to obtain predicted text descriptions corresponding to each training change chart comprises the following steps:

inputting a training change chart in a target sample pair into the feature extraction module to perform feature extraction to obtain a feature vector corresponding to the target sample, wherein the target sample pair is any sample pair;

inputting the feature vector and the text description in the target sample pair into the second recurrent neural network module, decoding the feature vector and the text description in the target sample pair through the second recurrent neural network module, and outputting the predicted text description corresponding to the training change diagram in the target sample pair.

In some embodiments, the step of calculating a sum of errors between each of the text descriptions and each of the predicted text descriptions according to a preset loss function includes:

calculating a sum of errors between each of the text descriptions and each of the predicted text descriptions according to the following formula;

where N is the number of the pairs of samples, L (θ) is the sum of the maximum probabilities,

represents L ₂ Regularization term, μ represents weight value, θ ^* _i Error between the corresponding text description and the predicted text description for the i-th sample pair;

wherein an error between the text description and the predicted text description is a maximum probability that the predicted text description is the text description, the error between the text description and the predicted text description being calculated according to the following formula;

wherein θ is a model parameter, I is the training variation graph, y is the text description, θ ^* For the maximum probability that the predicted text description is the text description under the model parameter theta, (y|I; theta) represents the probability that the predicted text description of the training variation graph I output by the image description network to be trained under the model parameter theta is the text description y.

In some embodiments, the second recurrent neural network module is a long-short term memory neural network,

The step of calculating the log (y|I; θ) includes:

calculating joint probability of each word in the text description as probability that the output predicted text description of the training change chart I is the text description y under the model parameter theta by the image description network to be trained according to the following formula;

log(y _i |y ₁ ,y ₂ ,......,y _t ；I；θ)＝f(h _t ,c _t )

ht＝LSTM(x _t ,h _t-1 ,m _t-1 )

wherein y is _t For relative to the current word y _t-1 F is an output y _t A nonlinear function of the probability of c _t Is the visual context vector, h, extracted from the training change diagram at time t _t Is the state of a long-short-term memory neural network layer at the time t, and x is _t For the feature vector, m _t-1 Is a memory cell at time t-1.

In order to solve the above technical problem, in a second aspect, an embodiment of the present invention provides a method for generating a face change image, including:

acquiring test text description and a test latent code, wherein the test latent code is based on a vector which is generated by a test face image and is used for reflecting the face characteristics of the test face;

inputting the test text description and the test latent code into the generated face change image model according to the first aspect, so as to encode the test text description through a text encoding model in the generated face change image model, and generate a test text feature code; inputting the test text feature codes and the test latent codes into a fusion module in the generated face change image model for fusion so as to output test fusion latent codes; inputting the test fusion latent code into a countermeasure generation network in the generated face change image model, and outputting a test face change image, wherein the faces in the test face change image and the faces in the test face image both reflect the same target face, and the face attribute in the test face change image is matched with the test text description.

In some embodiments, the obtaining the text description includes:

acquiring voice information;

and acquiring text information corresponding to the voice information by adopting a voice recognition algorithm, and taking the text information as the text description.

In order to solve the above technical problem, in a third aspect, an embodiment of the present invention provides an electronic device, including:

at least one processor, and

a memory communicatively coupled to the at least one processor, wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above in the first aspect.

To solve the above technical problem, in a fourth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium storing computer executable instructions for causing an electronic device to perform the method according to the above first aspect.

The embodiment of the invention has the beneficial effects that: compared with the prior art, the training method and the related device for generating the face change image model provided by the embodiment of the invention comprise a text coding network, a fusion module and a countermeasure generation network, wherein the text description in the training set is firstly coded by adopting the preset text coding network to convert the text description into text feature codes which can be identified by the countermeasure generation network, then the text feature codes are fused with random latent codes of corresponding face images to form fusion latent codes, the fusion latent codes are input into the countermeasure generation network, and a training change graph is generated, namely, the training change graph is generated after the face features of the face images are changed according to the text feature codes corresponding to the text description, namely, the change direction of the face images is controlled through the text feature codes. Secondly, in order to restrict the accuracy of text feature coding, the accuracy of a trained preset text coding network is improved, a changed change diagram accords with text description, a preset image description network is constructed, each training change diagram and corresponding text description are used as a sample pair, the preset image description network is trained, the trained image description network can achieve an image description function, predicted text description is output, then error sum between each text description and each predicted text description is calculated through a preset loss function, and parameters of the preset image description network and parameters of the preset text coding network are reversely adjusted based on the error sum, so that the predicted text description is continuously close to the text description, iteration is continuously carried out until the preset image description network and the preset text network are converged, and a face change image model is obtained. That is, whether the training change diagram generated by the countermeasure generation network according to the fusion latent code accords with the text description is judged through the preset image description network, then the accuracy of the preset text coding network is judged, the parameter of the preset text coding network is restrained to be continuously adjusted towards the direction with high accuracy, so that the change direction of the face image can be accurately controlled by the text feature code corresponding to the text description, the face change image model obtained through training can be controlled and modified according to the text description individuation for reflecting the user will, and the generated changed face image accords with the user will.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.

Fig. 1 is a schematic diagram of an application scenario of a generated face change image model obtained by training by using a training method for generating a face change image model in an embodiment of the present application;

fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 3 is a flowchart of a training method for generating a face change image model according to an embodiment of the present application;

fig. 4 is a flowchart of a method for generating a face change image according to an embodiment of the present application.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It should be noted that, if not conflicting, the various features of the embodiments of the present invention may be combined with each other, which are all within the protection scope of the present application. In addition, while functional block division is performed in a device diagram and logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. Moreover, the words "first," "second," "third," and the like as used herein do not limit the data and order of execution, but merely distinguish between identical or similar items that have substantially the same function and effect.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used in this specification includes any and all combinations of one or more of the associated listed items.

In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

The training method for generating the face change image model in the embodiment of the application can be used for various applications such as photographing, video recording or picture repairing, and the original face characteristics can be changed in a personalized mode according to user wishes.

In an optional scenario, a user wants to change a face feature through instruction information to modify a self-shot when the user self-shooting, for example, as shown in fig. 1, the instruction information may be voice information or text information, and when the user self-shooting is performed, for example, voice information "big eyes and long hairs" is issued, after the voice information is acquired by a shooting device (in fig. 1, for example, a smart phone), the voice information is converted into text information, the text information is provided to a face change image generating model obtained through the training method in the application, and meanwhile, a latent code reflecting a feature vector of an original face image of the user is provided to the face change image generating model, so that the face change image after modification can be performed according to the text information ("big eyes and long hairs") on an original face image of the user through the face change image generating model, and a modified self-shooting can be obtained.

In another optional scenario, a user hopes to realize intelligent image correction in the image correction process, specifically, provides an intelligent image correction software (app), inputs a face image to be corrected and text description into the image correction software, firstly, obtains a latent code of the image to be corrected (namely a feature vector of the image to be corrected) through a latent code module in the image correction software, then inputs the latent code and the text description into a well-trained face change image generation model in the image correction software, and modifies the face image to be corrected according to the indication in text information through the face change image generation model, and outputs a corresponding modified face change image which accords with the feature indicated in the text information.

The training method for generating the face change image model and the method for generating the face change image in the embodiment of the application can be applied to electronic equipment such as terminal equipment, a computer system and a server, and can be operated together with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with the terminal device, computer system, server, or other electronic device include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, minicomputers systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc., that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment in which tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computing system storage media including memory storage devices.

An embodiment of the present application provides an electronic device, please refer to fig. 2, which is a schematic hardware structure of the electronic device provided in the embodiment of the present application, specifically, as shown in fig. 2, the electronic device 10 includes at least one processor 11 and a memory 12 (in fig. 2, a bus connection, a processor is taken as an example).

The processor 11 is configured to provide computing and control capabilities to control the electronic device 10 to perform corresponding tasks, for example, control the electronic device 10 to perform any one of the training methods for generating a face change image model or any one of the methods for generating a face change image provided in the following embodiments.

It is understood that the processor 11 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

The memory 12 is used as a non-transitory computer readable storage medium, and may be used to store a non-transitory software program, a non-transitory computer executable program, and a module, such as a program instruction/module corresponding to a training method for generating a face change image model in an embodiment of the present invention, or a program instruction/module corresponding to a method for generating a face change image in an embodiment of the present invention. The processor 11 may implement the training method for generating the face change image model in any of the method embodiments described below, and may implement the method for generating the face change image in any of the method embodiments described below, by executing the non-transitory software programs, instructions, and modules stored in the memory 12. In particular, the memory 12 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 12 may also include memory located remotely from the processor, which may be connected to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The training method for generating the face change image model provided by the embodiment of the application is described in detail below, and the face change image model includes a text coding network, a fusion module and an countermeasure generation network, wherein the text coding network is used for converting text into codes, digitizing text data, the fusion module is used for calculating at least two codes so as to fuse the at least two codes, and the countermeasure generation network is used for generating an image.

Referring to fig. 3, the training method S20 includes, but is not limited to, the following steps:

s21: the method comprises the steps of obtaining a training set, wherein the training set comprises a plurality of groups of training data, the training data comprises random latent codes and text description, and the random latent codes are vectors used for generating face images.

S22: encoding each text description by adopting the preset text encoding network to obtain each text feature code, wherein the text feature code is used for reflecting semantic features of the text description;

s23: and inputting the random latent codes corresponding to the text feature codes into the fusion module for fusion so as to obtain the fusion latent codes.

S24: and respectively inputting the fusion latent codes into the countermeasure generation network to generate training change graphs, wherein one training change graph is a face image generated by the countermeasure generation network based on one fusion latent code, and the face attribute in one training change graph is matched with one text description.

S25: and taking each training change chart and text descriptions corresponding to each training change chart as each sample pair, and inputting a preset image description network to obtain predicted text descriptions corresponding to each training change chart.

S26: and calculating the error sum between each text description and each predicted text description according to a preset loss function.

S27: and reversely adjusting model parameters of the text coding network and the preset image description network according to the error sum until the preset image description network and the preset text coding network are converged.

In this embodiment, the training set includes several sets of training data, the training data including a random latent code and a text description, the random latent code being in one-to-one correspondence with the text description. The random latent code is a vector for generating the face image, namely the random latent code is equivalent to the vector expression of the face image. The text description indicates that the face feature obtained after the change is required, for example, the text description vec= [ A girl with long hair and big eyes ].

In step S22, in order to enable the text descriptions to be learned by the model, first, each text description in the training set is encoded by using a preset text encoding network to obtain a corresponding text feature code, where the text feature code is used to reflect the semantic features of the text description, and it is known that the text feature code corresponds to a numeric representation of the text description. Specifically, each word in the text description may be converted into a vector, so that the text description is a multi-column vector matrix. For example, the text description vec= [ A girl with long hair and big eyes ] is converted into a vector matrix of m×n, where m is the dimension of a single word and n=8 is the number of words. Thus, the text description can be converted into a vector matrix that can be learned by a model. In some embodiments, the preset text encoding network may be a recurrent neural network (Recurrent Neural Network, RNN) or the like.

In some embodiments, the preset text encoding network includes a sequential forgetting encoding module and a first recurrent neural network module. The sequence forgetting coding module is used for converting a text sequence with an unfixed length into a vector with a fixed length so as to meet the requirement that the model needs to input the vector with the fixed length. The first recurrent neural network module is used for establishing the connection between words and facilitating the learning of the context information of the text description.

In this embodiment, the step S22 specifically includes:

s221: and respectively inputting each text description into the sequence forgetting coding module to carry out coding processing so as to obtain each text vector, wherein the length of each text vector is fixed.

S222: and respectively inputting the text vectors into the first recurrent neural network module for context association so as to acquire the text feature codes.

The sequential forgetting coding module (Fixed-size Ordinally Forgetting Encoding, FOFE) sets all texts as a word list, and the size of the word list is K, and each word in the word list is expressed as a K-dimensional thermal coding vector e E R ^s Then for a given text sequence y= { w ₁ ，w ₂ ，......w _T Each word w _t Are all encoded by a K-dimensional thermal code vector e _t And (3) representing. Each partial sequence is then encoded based on the following recursive formula:

wherein z is _t Represented is the coding of the sequence from position 1 to position t of the text sequence, for example for the text sequence "A girl with long hair and big eyes", when t=5, Z ₅ Representing the coding of the sequence "Agirl with long hair", alpha (0 < alpha < 1) represents a forgetting factor, representing the effect of the preceding sequence on the current word, and in fact its index also reflects the order information of the words in the sequence.

Based on the principle of the above sequential forgetting coding module, for each text description in the training set, after inputting the sequential forgetting coding module, a text vector [ Z ] is output ₁ ，Z ₂ ，......，Z _T ]. Each vector Z in the text vector _t Is consistent with the dimension of the vocabulary (K dimensions in each case) such that the length of the text vector is fixed and a vector Z with an association with the above information is generated under the influence of the above recursive formula _t Context information of the text is also preliminarily retained.

That is, by employing the sequential forgetting encoding module, since each word in the text description is represented as a thermally encoded vector of K dimensions, thereby encoding the variable length text description into a text vector of a fixed size, the uniqueness of the text description of any length can be ensured.

Then, each text vector is input into a first recurrent neural network module, the first recurrent neural network learns the text vector by an initial parameter, and then each text feature code is output. The recurrent neural network module can further learn context information, establish word-to-word associations, and enable text feature encoding to reflect the context information of the text description.

In some embodiments, the first recurrent neural network module may be an existing Bi-directional long short Term Memory neural network (Bi-directional Long Short-Term Memory, bi_LSTM). For example, for the ith text vector [ Z ] _i1 ,Z _i2 ,......,Z _iT ]After inputting the two-way long-short-term memory neural network, the two-way long-short-term memory neural network learns the text vector with an initial parameter, and then outputs a text feature code with contextual information features, the text feature code fi=bi-LSTM (Z _i1 ,Z _i2 ,......,Z _iT )。

In this embodiment, the preset text encoding network adopts a sequential forgetting encoding module and a first recurrent neural network module, and encodes the text description input into the preset text encoding network to obtain the text feature encoding, where the sequential forgetting encoding module can losslessly encode the text description with variable length into a text vector with fixed size, so that the uniqueness of the text description with any length can be ensured, and meanwhile, the first recurrent neural network module can learn the context information better, so that the text feature encoding can reflect the context information of the text description.

After the text feature codes corresponding to the text descriptions in the training set are obtained, namely in the step S23, the random latent codes corresponding to the text feature codes are respectively input to the fusion module for fusion, so as to obtain the fusion latent codes.

The fusion module can comprise at least one function, so that the random latent codes corresponding to the text feature codes are mapped through a series of functions to obtain fusion latent codes, and therefore the fusion latent codes have the face features reflected by the text description on the basis of the face features of the original face image.

In some embodiments, the step of step S23 specifically includes:

And carrying out nonlinear calculation on the text feature codes and the random latent codes corresponding to any text description in the training data, namely, the random latent codes corresponding to the target text feature codes and the target text feature codes by adopting nonlinear functions in a fusion module so as to fuse the random latent codes corresponding to the target text feature codes and the target text codes to obtain fusion latent codes corresponding to the target text feature codes. It can be understood that in this embodiment, the fusion latent code is obtained by text feature codes and random latent codes through nonlinear mapping, so that the facial features reflected by the random latent codes can be better reserved, and meanwhile, the direction of facial changes can be accurately controlled by the text feature codes through fusion of the text feature codes.

For example, the nonlinear function in the fusion module may be a hyperbolic tangent function, encoding f for text features _i And random latent code W _i Mapping by adopting hyperbolic tangent function to obtain a fusion latent code w _i ’，w _i ’＝tanh(W _i *f _i +b _i ) Wherein W is _i Random latent code representing ith training data, f _i Representing text feature codes corresponding to the ith training data, b _i Representing the deviation value of the ith training data, wherein b _i The initial value of (2) is 0, and the random variation in the model training process is a random number.

In the embodiment, the fusion latent code is obtained by carrying out nonlinear calculation on the text feature code and the random latent code, so that the fusion latent code can better retain facial features reflected by the random latent code and can accurately control the direction of facial changes at the same time.

In step S24, the challenge-generating network is input with each fusion latent code to generate each training change pattern. The training change map is a face image generated by the countermeasure generation network based on the fusion latent code, namely the fusion latent code corresponds to the training change map one by one, and face attributes in the training change map are matched with the text description, namely the training change map is generated after face features reflected by the text description are added into the original face image.

It will be appreciated that the countermeasure generation network may be a styligan network whose network structure includes Mapping networks and Synthesis network, where the Mapping networks have 8 fully connected layers for encoding the input fusion latent codes into one-dimensional intermediate vectors that can reflect facial features, such as eye features, mouth features, nose features, or the like. Then, the intermediate vector and the random noise are input into each sub-network layer of Synthesis network, each sub-network layer carries out deconvolution operation, each intermediate vector and the random noise are mapped into image data with p-p resolution, as the sub-network layer progresses, the resolution p-p is larger and larger, and finally, training change images with required sizes are generated.

The training change diagram generated by adopting the styleGAN accords with the real human face, and in addition, the real human face is modified according to the text description in the application based on the trained generated human face change image model, so that the authenticity of the training change diagram is beneficial to increasing the accuracy of the generated human face change image model when the training change diagram is used as training data.

The training change diagram is obtained based on the change direction of the text feature code control face image, in order to restrict the accuracy of the text feature code, the accuracy of a trained preset text code network is improved, the changed image accords with text description, a preset image description network is constructed, the accuracy of the text feature code is detected, and the accuracy of the preset text code network is reversely improved. Specifically, in step S25, for each training change chart and corresponding text description, the sample pairs are input into a preset image description network, and the preset image description network learns the relationship between each training change chart and corresponding text description with an initial parameter, and outputs corresponding predicted text description. It will be appreciated that the training change graph is obtained by modifying the face features by the challenge-generating network in a textual description, whereby the textual description corresponds to the actual label of the training change graph.

And then calculating error sums between each text description and the predicted text description through a preset loss function, reversely adjusting model parameters of the text description network and the preset image description network according to the error sums, and performing iterative training for a plurality of times until the preset image description network and the preset text description network are converged, so as to obtain a face change image model with high accuracy. It can be understood that based on the model parameters of the text coding network and the preset image description network being reversely regulated simultaneously, the preset text coding network and the preset image description network are trained simultaneously, end-to-end training is achieved, the accuracy of text feature codes output by the preset text coding network in the training process is monitored through the preset image description network in the training process, the features reflected by the text feature codes output by the preset text coding network are enabled to be continuously approximate to real description texts, the changing direction of face images is accurately controlled according to the text descriptions, and accordingly the face conforming to the corresponding text descriptions can be generated by the countermeasure generation network in the face change image model obtained through training according to the text feature codes output by the preset text coding network.

In some embodiments, the model parameters may be optimized by using adam algorithm, the iteration number may be set to 500, the initial learning rate is set to 0.001, the weight attenuation is set to 0.0005, and each 50 iterations, the learning rate attenuation is 1/10 of the original one, and after training, the model parameters are output, so as to obtain the generated face change image model.

In this embodiment, model parameters of a text encoding network and a preset image description network are adjusted, firstly, text descriptions in a training set are encoded by adopting the preset text encoding network to convert the text descriptions into text feature codes which can be identified by a countermeasure generation network, then, the text feature codes and random latent codes of corresponding face images are fused to form fusion latent codes, the fusion latent codes are input into the countermeasure generation network, and a training change graph is generated, namely, the training change graph is generated after the face images change the face features according to the text feature codes corresponding to the text descriptions, namely, the change direction of the face images is controlled through the text feature codes. Secondly, in order to restrict the accuracy of text feature coding, the accuracy of a trained preset text coding network is improved, a changed change diagram accords with text description, a preset image description network is constructed, each training change diagram and corresponding text description are used as a sample pair, the preset image description network is trained, the trained image description network can achieve an image description function, predicted text description is output, then error sum between each text description and each predicted text description is calculated through a preset loss function, and parameters of the preset image description network and parameters of the preset text coding network are reversely adjusted based on the error sum, so that the predicted text description is continuously close to the text description, iteration is continuously carried out until the preset image description network and the preset text network are converged, and a face change image model is obtained. That is, whether the training change diagram generated by the countermeasure generation network according to the fusion latent code accords with the text description is judged through the trained preset image description network, then the accuracy of the preset text coding network is judged, the parameter of the preset text coding network is restrained to be continuously adjusted towards the direction with high accuracy, so that the change direction of the face image can be accurately controlled by the text feature code corresponding to the text description, the face change image model obtained through training can be controlled and modified according to the text description individuation for reflecting the user will, and the generated changed face image accords with the user will.

In some embodiments, the preset image description network includes a feature extraction module and a recurrent neural network module. The feature extraction module is used for extracting image features, converting image data into vectors used for reflecting the image features, and the recurrent neural network module is used for learning the image features and text descriptions.

In this embodiment, the preset image description network includes a feature extraction module and a second recurrent neural network module, and in this embodiment, step S25 specifically includes:

s251: and inputting the training change diagram in the target sample pair into the feature extraction module to perform feature extraction to obtain a feature vector corresponding to the target sample, wherein the target sample pair is any sample pair.

S252: inputting the feature vector and the text description in the target sample pair into the second recurrent neural network module, decoding the feature vector and the text description in the target sample pair through the second recurrent neural network module, and outputting the predicted text description corresponding to the training change diagram in the target sample pair.

The sample pairs comprise training change graphs and corresponding text descriptions, and for any sample pair, namely a target sample pair, the training change graphs in the target sample pair are input into a feature extraction module for feature extraction, so that feature vectors are obtained, namely, the image data are converted into the vectors. In some embodiments, the feature extraction module may include a convolution layer group, an activation function layer, and a normalization layer.

Wherein the convolution layer group comprises a plurality of convolution layers, and each convolution layer comprises at least one convolution kernel to perform convolution operation. A convolution layer outputs a feature map, the more the number of convolution kernels is, the stronger the feature extraction capability is, the more features in the corresponding feature map are, and the more the features deviate from the original training change map. In order to avoid loss of characteristic information in the characteristic sampling process, a uniform step length is adopted, so that the moving length of the convolution kernel in convolution operation is uniform. The activation function layer may be a nonlinear activation function such as a leak ReLU or softmax, so that the activation function layer may increase the nonlinearity of the model, so that the neural network may be applied to the nonlinear model. The normalization layer is used for mapping the data to the range of 0-1, so that the processing is convenient, and the calculation is more convenient and faster.

It will be appreciated that the corresponding mathematical expression for generating the feature vector is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,

an mth feature map representing the first layer, < >>

N-th feature map representing layer 1+1,>

represents the convolution kernel of layer l+1, < >>

Represents the bias term for layer l +1,

sigma (·) represents the LealyRelu activation function and IN represents normalization.

After the training change diagram is subjected to feature extraction by the feature extraction module, a 1024 x 1 feature vector is generated. It will be appreciated that in some embodiments, existing mobilet, resnet or VGG network architectures may also be employed to extract features. Here, there is no limitation on the feature extraction module as long as the process of convolution feature extraction is satisfied.

And then inputting the text description corresponding to the feature vector and the sample pair into a second recurrent neural network module, and outputting the predicted text description corresponding to the training change chart after the second recurrent neural network module decodes the text description in the feature vector and the target sample pair.

Specifically, for an input training pattern I and text description y, the training pattern I is characterized by feature extraction module to generate a one-dimensional feature vector V, and each word in the text description y is mapped into a thermally encoded vector comprising only 0 and 1. Then, a vocabulary is built, and each word in the vocabulary is represented as a K-dimensional thermally encoded vector, e.g., "girl" is mapped to vector [0,1,0,0,0,0], and the text description starts with a default starting word < start > and ends with a default ending word < end >, and the text description "A girl with long hair and big eyes" is mapped to vector set [ W0, W1, W2, ], W9], where vector W0 represents < start >, vector W9 represents < end >, and vector W2 represents "girl". It will be appreciated that the mapping may be obtained using the sequential forgetting encoding module described above.

The feature vector and the vector group are input into the second recurrent neural network module, the second recurrent neural network module decodes the feature vector and the vector group according to the mechanism of the second recurrent neural network module, a set of probability values are output through a softmax layer of the second recurrent neural network module, then the probability values are converted into thermal coding vectors, words corresponding to the thermal coding vectors are searched in a word list, and accordingly the output of predictive text description, namely word sequences [ y1', y2',. The second recurrent neural network module may be an existing Long short-term memory (LSTM).

In this embodiment, vectorization of training change images is achieved through the feature extraction module, and the relationship between feature vectors and text descriptions is learned through the second recurrent neural network module, so that the preset image description network can achieve an image description function, and therefore whether an image generated by the countermeasure generation network accords with the input text description can be judged through the preset image description network, and the preset text encoding network is reversely optimized to train to obtain the generated face change image model.

In some embodiments, the step S26 specifically includes:

where N is the number of the pairs of samples, L (θ) is the inverse of the sum of the maximum probabilities,

represents an L2 regularization term, μ represents a weight value, θ ^* _i The error between the corresponding text description and the predicted text description is for the i-th sample pair.

The L2 regularization term can prevent the model from being excessively fitted, and improves the generalization performance of the model. It will be appreciated that when L (θ) fluctuates within a certain range before and after several adjacent iterations, the model is illustrated as converging and training is stopped.

And (y|I; theta) represents the probability that the predicted text description of the training change diagram I output by the image description network to be trained under the model parameter theta is the text description y.

It can be understood that, based on the principle of decoding the feature vector and the text description by the second recurrent neural network module, in the process of calculating the error between the text description and the predicted text description, multiple sets of probability values are obtained by adjusting the model parameters θ of the text encoding network and the preset image description network. Equivalently, taking the training change chart I and the multiple sets of probability values as known sample results, back-pushing the model parameter θ which is most likely (the maximum probability) to lead to the sample result being a real sample pair (I, y), that is, the maximum probability θ illustrates the proximity degree between the predicted text description and the real text description, and can reflect the error between the two. The larger the maximum probability θ, the smaller the error between the two.

In this embodiment, by adjusting the second recurrent neural network module to output multiple sets of probability values, and then using the training change graph I and the multiple sets of probability values as known sample results, back-pushing the model parameter θ of the sample pair (I, y) that most probably (most probable) results in the sample results being real, that is, the maximum probability θ illustrates the proximity between the predicted text description and the real text description, so that the error between the predicted text description and the real text description can be accurately reflected, so that the preset loss function is accurate, and the training is beneficial to accurately generating the face change image model.

In some embodiments, the second recurrent neural network module is an existing long-short term memory neural network, and the long-short term memory neural network uses a forgetting gate, an input gate and an output gate to control the states of the memory cells, wherein the forgetting gate is used for controlling whether to forget the values of the current memory cells, the input gate is used for controlling whether to read the inputs of the memory cells, and the output gate is used for controlling whether to output the values of the new memory cells. The mechanism of the door is adopted, so that the gradient explosion and gradient disappearance problems can be solved to a certain extent.

The step of calculating the log (y|I; θ) includes:

log(y _i |y ₁ ,y ₂ ,......,y _t ；I；θ)＝f(h _t ,c _t )

ht＝LSTM(x _t ,h _t-1 ,m _t-1 )

Wherein y is _t For relative to the current word y _t-1 F is an output y _t A nonlinear function of the probability of c _t Is a visual context vector, h, extracted from the training change graph at time t _t Is the state of a long-short-term memory neural network layer at the time t, and x is _t For the feature vector, m _t-1 Is a memory cell at time t-1, h _t-1 The state of the neural network layer is memorized for a long period at the time t-1.

In this embodiment, since y represents any sentence whose length is infinite, a chain is used to simulate each word y in y ₁ ,...y _t Joint probabilities on, i.e.

The joint probability is a combination of the probabilities that each word predicted is the corresponding word in the text description, and is the total probability that the predicted text description is the true text description. In the long-term and short-term memory neural network, a memory unit m at time t-1 _t-1 And the state of the long-short-term memory neural network layer is input into the memory unit at the moment t, so that the long-short-term memory neural network can generate the next word by utilizing the word generated before, the network has a long-term memory function, the joint probability is more accurate, and the problems of gradient explosion and gradient disappearance can be solved to a certain extent by adopting a gate mechanism, so that the face change image model generated by training is more accurate.

In summary, the training method and related device for generating a face change image model provided by the embodiments of the present invention include a text encoding network, a fusion module, and a countermeasure generation network, where first, text descriptions in a training set are encoded by using a preset text encoding network to convert the text descriptions into text feature codes that can be identified by the countermeasure generation network, then the text feature codes are fused with random latent codes of corresponding face images to form fusion latent codes, and the fusion latent codes are input into the countermeasure generation network, so as to generate a training change graph, where the training change graph is generated after the face features of the face images are changed according to the text feature codes corresponding to the text descriptions, and the change direction of the face images is controlled by the text feature codes. Secondly, in order to restrict the accuracy of text feature coding, the accuracy of a trained preset text coding network is improved, a changed change diagram accords with text description, a preset image description network is constructed, each training change diagram and corresponding text description are used as a sample pair, the preset image description network is trained, the trained image description network can achieve an image description function, predicted text description is output, then error sum between each text description and each predicted text description is calculated through a preset loss function, and parameters of the preset image description network and parameters of the preset text coding network are reversely adjusted based on the error sum, so that the predicted text description is continuously close to the text description, iteration is continuously carried out until the preset image description network and the preset text network are converged, and a face change image model is obtained. That is, whether the training change diagram generated by the countermeasure generation network according to the fusion latent code accords with the text description is judged through the preset image description network, then the accuracy of the preset text coding network is judged, the parameter of the preset text coding network is restrained to be continuously adjusted towards the direction with high accuracy, so that the change direction of the face image can be accurately controlled by the text feature code corresponding to the text description, the face change image model obtained through training can be controlled and modified according to the text description individuation for reflecting the user will, and the generated changed face image accords with the user will.

Referring to fig. 4, the method S30 for generating a face change image according to the embodiment of the present invention includes, but is not limited to, the following steps:

s31: and acquiring test text description and a test latent code, wherein the test latent code is based on a vector which is generated by a test face image and is used for reflecting the face characteristics of the test face.

S32: inputting the test text description and the test latent code into the generated face change image model in any embodiment, so as to encode the test text description through a text encoding model in the generated face change image model, and generate a test text feature code; inputting the test text feature codes and the test latent codes into a fusion module in the generated face change image model for fusion so as to output test fusion latent codes; inputting the test fusion latent code into a countermeasure generation network in the generated face change image model, and outputting a test face change image, wherein the faces in the test face change image and the faces in the test face image both reflect the same target face, and the face attribute in the test face change image is matched with the test text description.

It will be appreciated that the test face image is an image including a face before the feature needs to be modified, for example, may be an unprocessed face photograph or a frame image in an unprocessed recorded video. If the test face image is an unprocessed face photo, the embodiment can be applied to application scenes such as photographing or repairing to change the characteristics of the test face image, and if the test face image is an unprocessed frame image in a recorded video, the embodiment can be applied to application scenes such as video recording to change the characteristics of the test face image.

The test latent code is a vector for reflecting the face features of the test face, i.e. the test face image is image data, and the latent code is in the form of a vector of the image data, for example, the test face image can be generated by the test latent code through a countermeasure generation network (styleGAN). In some embodiments, the test latent code is generated based on the test face image, e.g., the test face image may be converted to the test latent code using a visual geometry group model (Visual Geometry Group, VGG). Further, in some embodiments, the test latent code may be optimized, specifically, the converted test latent code is input into a countermeasure generation network (styleGAN) to generate a new face image, the similarity between the new face image and the test face image is determined, and the test latent code is adjusted according to the similarity, so as to obtain an optimized accurate test latent code.

The text description is user-output indicating that the face feature is to be changed, e.g., the text description may be [ A girl with long hair and big eyes ].

After the test text description and the test latent codes are obtained, inputting the text description and the latent codes into the face change image model in any embodiment, specifically, encoding the test text description through a text encoding model in the face change image model, and generating a test text feature code; inputting the test text feature codes and the test latent codes into a fusion module in the generated face change image model for fusion so as to output test fusion latent codes; inputting the test fusion latent codes into the countermeasure generation network in the generated face change image model, and outputting a test face change image. The test face change image is a face change image after the test face image is changed according to the text description, so that the faces in the test face change image and the faces in the test face image both reflect the same target face, for example, faces of characters with reddish color, and the face attribute in the test face change image is matched with the test text description, for example, if the test text description is 'short hair and large eyes', the face attribute in the test face change image also reflects 'short hair and large eyes'.

It can be understood that the generated face change image model is obtained by training the training method for generating the face change image model in the above embodiment, and has the same structure and function as the generated face change image model in the above embodiment, and will not be described in detail herein.

In some embodiments, the step of obtaining the text description includes:

s311: and acquiring voice information.

S312: and acquiring text information corresponding to the voice information by adopting a voice recognition algorithm, and taking the text information as the text description.

It will be appreciated that in this embodiment, the text description is converted from voice information, specifically, voice information may be collected through a microphone in the electronic device, where the voice information may be sent by a user, for example, when the user speaks "long hair and big eyes" into the electronic device, the electronic device collects an audio signal reflecting the voice information, and uses an existing voice recognition algorithm, such as a gaussian mixture model or a hidden markov model, to recognize the voice information, obtain text information corresponding to the voice information, that is, convert the audio signal into text information, and use the text information as a text description. Thus, the change of the voice control face characteristics can be realized.

The present embodiment can be applied to photographing or video recording, for example, when a user photographs or records video using an electronic device (e.g., a smart phone), the user can control features in the obtained photograph or features in frame images in the video by voice, for example, control eyes to become larger or hair to become longer, etc. Therefore, the voice control of the face characteristic change can be realized, and the interestingness in the photographing or video recording process is increased.

Another embodiment of the present invention further provides a non-transitory computer readable storage medium storing computer executable instructions for causing an electronic device to perform the above-described training method for generating a face change image model, or a method for generating a face change image.

It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Those skilled in the art will appreciate that all or part of the processes implementing the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and where the program may include processes implementing the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the invention, the steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A training method for generating a face change image model, wherein the generated face change image model comprises a text coding network, a fusion module and a countermeasure generation network, and the method comprises the following steps:

the method comprises the steps of obtaining a training set, wherein the training set comprises a plurality of groups of training data, the training data comprises random latent codes and text descriptions, the random latent codes are vectors used for generating face images, and the text descriptions indicate face characteristics obtained after face changes;

performing feature coding on each text description by adopting the text coding network to obtain each text feature code, wherein the text feature codes are used for reflecting semantic features of the text description, and the text coding network comprises a sequential forgetting coding module and a first recurrent neural network module;

Taking each training change chart and text descriptions corresponding to each training change chart as each sample pair, and inputting a preset image description network to obtain predicted text descriptions corresponding to each training change chart; the preset image description network comprises a feature extraction module and a second recurrent neural network module;

2. The training method of claim 1, wherein the step of encoding each text description using the text encoding network to obtain each text feature encoding comprises:

3. The training method according to claim 1, wherein the step of inputting the random latent codes corresponding to the text feature codes to the fusion module for fusion to obtain the fusion latent codes includes:

4. The training method according to claim 1, wherein the step of inputting a predetermined image description network with the training change charts and the text descriptions corresponding to the training change charts as respective pairs of samples to obtain the predicted text descriptions corresponding to the training change charts comprises:

5. The training method of claim 1, wherein the step of calculating a sum of errors between each of the text descriptions and each of the predicted text descriptions based on a preset loss function comprises:

wherein θ is a model parameter of the text encoding network and the preset image description network, I is the training change chart, y is the text description, θ ^* For the maximum probability that the predicted text description is the text description under the model parameter theta, (y|I; theta) represents the probability that the predicted text description of the training variation graph I output by the image description network to be trained under the model parameter theta is the text description y.

6. The training method of claim 5, wherein the second recurrent neural network module is a long-term memory neural network,

the step of calculating the log (y|I; θ) includes:

log(y _i |y ₁ ,y ₂ ,......,y _t ；I；θ)＝f(h _t ,c _t )

ht＝LSTM(x _t ,h _t-1 ,m _t-1 )

wherein y is _t For relative to the current word y _t-1 F is an output y _t Non-linearities of probability of (2)Function, c _t Is the visual context vector, h, extracted from the training change diagram at time t _t Is the state of a long-short-term memory neural network layer at the time t, and x is _t For the feature vector, m _t-1 Is a memory cell at time t-1.

7. A method of generating a face change image, comprising:

acquiring test text description and a test latent code, wherein the test latent code is a vector which is generated based on a test face image and is used for reflecting face characteristics of a face in the test face image;

Inputting the test text description and the test latent code into a generated face change image model according to any one of claims 1-6 to encode the test text description through a text encoding model in the generated face change image model to generate a test text feature code, wherein the text encoding network comprises a sequential forgetting encoding module and a first recurrent neural network module; inputting the test text feature codes and the test latent codes into a fusion module in the generated face change image model for fusion so as to output test fusion latent codes; inputting the test fusion latent code into a countermeasure generation network in the generated face change image model, and outputting a test face change image, wherein the faces in the test face change image and the faces in the test face image both reflect the same target face, and the face attribute in the test face change image is matched with the test text description.

8. The method of claim 7, wherein the obtaining a text description comprises:

acquiring voice information;

9. An electronic device, comprising:

at least one processor, and

a memory communicatively coupled to the at least one processor, wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

10. A non-transitory computer-readable storage medium storing computer-executable instructions for causing an electronic device to perform the method of any one of claims 1-8.