CN114897670A

CN114897670A - Stylized picture generation method, stylized picture generation device, stylized picture generation equipment and storage medium

Info

Publication number: CN114897670A
Application number: CN202210508195.5A
Authority: CN
Inventors: 谢中流; 钟凯宇; 郑曌琼; 丁保阳; 张术
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2022-05-11
Filing date: 2022-05-11
Publication date: 2022-08-12

Abstract

The invention discloses a stylized picture generation method, a stylized picture generation device, stylized picture generation equipment and a storage medium, wherein the method comprises the following steps: determining a target style type and an attribute feature vector corresponding to a target picture to be processed, and determining style intensity features corresponding to the target picture according to the attribute feature vector; extracting global attribute features of a target picture; searching a preset style feature library according to the target style type to obtain style features matched with the target style type; and splicing the global attribute features, the style features and the style intensity features to obtain fusion features, and generating the stylized picture corresponding to the target picture according to the fusion features. The method and the device realize the diversity of the generation of the stylized pictures, so that the style of the target object in the target picture and the stylized pictures is more coordinated.

Description

Stylized picture generation method, stylized picture generation device, stylized picture generation equipment and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a stylized image generation method, apparatus, device, and storage medium.

Background

Stylization is a technology for converting a source style of a picture into a target style (for example, a disney style, an animation style, a national style and the like) by processing a real picture through an algorithm, and for an existing scheme of a stylization implementation process, for example, a trained network model is directly used for generating stylized pictures. In actual use, the style of the stylized picture output by the network model is fixed regardless of any picture input, and meanwhile, the style in the stylized picture does not overlap with a target object (such as a person), so that the image and the style of the target object are not coordinated.

Disclosure of Invention

The embodiment of the invention provides a stylized picture generation method, a stylized picture generation device and a storage medium, and aims to solve the technical problem that the style of a target object in a stylized picture is inconsistent with the image of the target object in the stylized picture due to fixed style used by the existing stylized picture processing technology.

The embodiment of the invention provides a stylized picture generation method, which comprises the following steps:

determining a target style type and an attribute feature vector corresponding to a target picture to be processed, and determining style intensity features corresponding to the target picture according to the attribute feature vector;

extracting global attribute features of the target picture;

searching a preset style feature library according to the target style type to obtain style features matched with the target style type;

and splicing the global attribute feature, the style feature and the style intensity feature to obtain a fusion feature, and generating a stylized picture corresponding to the target picture according to the fusion feature.

In one embodiment, a target style type and an attribute feature vector corresponding to a target picture to be processed are determined, and style intensity features corresponding to the target picture are determined according to the attribute feature vector;

extracting global attribute features of the target picture;

In an embodiment, the step of determining a target style type and an attribute feature vector corresponding to a target picture to be processed, and determining a style intensity feature corresponding to the target picture according to the attribute feature vector includes:

acquiring a pre-trained attribute decoupling encoder;

and identifying the target picture by adopting the attribute decoupling encoder to obtain the target style type and style intensity characteristic corresponding to the target picture.

In an embodiment, the training of the attribute decoupling encoder comprises:

acquiring sample object data with attribute labels, wherein the attribute labels comprise attribute labels and style labels of sample objects;

iteratively training an original attribute decoupling encoder by adopting the sample object data, wherein the original attribute decoupling encoder comprises a U-Net network and a plurality of three-layer fully-connected networks which are sequentially connected;

acquiring the cross loss of each fully-connected network after each training, and determining the weighted sum of the cross loss of each fully-connected network;

and when the weighted sum is less than or equal to the preset cross loss, stopping the training of the original attribute decoupling encoder, and storing the original attribute decoupling encoder which stops the training as the attribute decoupling encoder.

In an embodiment, the step of extracting the global attribute feature of the target picture includes:

acquiring a pre-trained global information encoder;

and identifying the target picture by adopting the global information encoder to obtain the global attribute characteristics of the target picture.

In an embodiment, the step of splicing the global attribute feature, the style feature, and the style intensity feature to obtain a fusion feature, and generating a stylized picture corresponding to the target picture according to the fusion feature includes:

based on the channel dimensions of the global attribute feature, the style feature and the style intensity feature, overlapping the global attribute feature, the style feature and the style intensity feature to obtain the fusion feature;

obtaining a pre-trained style decoder;

and identifying the fusion characteristics by adopting the style decoder, and outputting the stylized picture corresponding to the target picture.

In one embodiment, the training of the global information encoder and the style decoder comprises:

acquiring a training style data picture with a style label, training style intensity data and a pre-training attribute decoupling encoder;

training an original global information encoder by adopting the training style data picture, so that the original global information encoder outputs the training global attribute characteristics of the training style data picture;

acquiring training style characteristics matched with the style labels from a preset style characteristic library;

inputting the training style data picture into the attribute decoupling encoder to obtain a training style intensity characteristic corresponding to the training style data picture;

splicing the training global attribute features, the training style features and the training style strength features to obtain fusion training features;

training an original style decoder by adopting the fusion training characteristics so as to output a training stylized picture corresponding to the training style data picture by the original style decoder;

acquiring a preset style picture matched with the training stylized picture according to the style label, and inputting the training stylized picture and the preset style picture into a loss function discrimination network to obtain a loss value;

and when the loss value is smaller than or equal to a preset value, stopping training of the original global information encoder and the original style decoder, and storing the original global information encoder and the original style decoder which are stopped from training as the global information encoder and the style decoder.

In an embodiment, the stylized picture generating method further includes:

acquiring a target picture to be processed;

and when the target picture does not contain a preset trigger action for extracting the target style type, executing the steps of determining the target style type and the attribute feature vector corresponding to the target picture to be processed, and determining the style intensity feature corresponding to the target picture according to the attribute feature vector.

In addition, to achieve the above object, the present invention further provides a stylized picture generating apparatus, including:

the first acquisition module is used for determining a target style type and an attribute feature vector corresponding to a target picture to be processed and determining style intensity features corresponding to the target picture according to the attribute feature vector;

the second acquisition module is used for extracting the global attribute characteristics of the target picture;

the third acquisition module is used for searching a preset style feature library according to the target style type to obtain style features matched with the target style type;

and the picture generation module is used for splicing the global attribute feature, the style feature and the style intensity feature to obtain a fusion feature, and generating a stylized picture corresponding to the target picture according to the fusion feature.

In addition, to achieve the above object, the present invention also provides a terminal device, including: the stylized picture generation program is stored on the memory and can run on the processor, and when being executed by the processor, the stylized picture generation program realizes the steps of the stylized picture generation method.

In addition, to achieve the above object, the present invention also provides a storage medium having a stylized picture generation program stored thereon, which when executed by a processor, implements the steps of the stylized picture generation method described above.

The technical scheme of the stylized picture generation method, the stylized picture generation device, the stylized picture generation equipment and the storage medium provided by the embodiment of the invention at least has the following technical effects or advantages:

the technical scheme that the target style type and the attribute feature vector corresponding to the target picture to be processed are determined, the style intensity feature corresponding to the target picture is determined according to the attribute feature vector, the global attribute feature of the target picture is extracted, the preset style feature library is searched according to the target style type, the style feature matched with the target style type is obtained, the global attribute feature, the style feature and the style intensity feature are spliced to obtain the fusion feature, and the stylized picture corresponding to the target picture is generated according to the fusion feature is adopted, so that the technical problem that the style of the target object in the stylized picture is inconsistent with the style due to the fact that the style used by the existing stylized picture processing technology is fixed is solved. According to the method and the device, the stylized picture of the target object in the target picture is generated according to the multi-dimensional picture characteristic drive, so that the diversity of the stylized picture generation is realized, the harmony degree of the style is improved, and the style of the target object and the stylized picture is more harmonious and more appropriate.

Drawings

FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a stylized image generation method according to an embodiment of the present invention;

FIG. 3 is a schematic illustration of the stitching of the fusion features of the present invention;

FIG. 4 is a flowchart illustrating the detailed process of step S210 in the stylized image generating method of the present invention;

FIG. 5 is a schematic diagram of a network architecture of an attribute decoupling decoder according to the present invention;

FIG. 6 is a flowchart illustrating the detailed step S220 of the stylized drawing generation method of the present invention;

FIG. 7 is a flowchart illustrating the detailed step S240 of the stylized drawing generation method of the present invention;

FIG. 8 is a schematic diagram of a global information encoder and a style decoder in accordance with the present invention;

FIG. 9 is a functional block diagram of the stylized graphical user interface of the present invention.

Detailed Description

In order to better understand the above technical solutions, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

As shown in fig. 1, fig. 1 is a schematic structural diagram of a hardware operating environment according to an embodiment of the present invention.

It should be noted that fig. 1 is a schematic structural diagram of a hardware operating environment of the terminal device.

As an implementation manner, as shown in fig. 1, an embodiment of the present invention relates to a terminal device, where the terminal device includes: a processor 1001, such as a CPU, a memory 1002, and a communication bus 1003. The communication bus 1003 is used to implement connection communication among these components.

The memory 1002 may be a high-speed RAX memory or a non-volatile memory (non-volatile XeXory), such as a disk memory. As shown in fig. 1, a stylized picture creation program may be included in the memory 1002 as a storage medium; and the processor 1001 may be configured to call the stylized picture generation program stored in the memory 1002 and perform at least the following operations:

extracting global attribute features of the target picture;

and splicing the global attribute feature, the style feature and the style strength feature to obtain a fusion feature, and generating a stylized picture corresponding to the target picture according to the fusion feature.

While a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in a different order than here.

As shown in fig. 2, in a first embodiment of the present invention, a stylized picture creating method of the present invention includes the steps of:

step S210: determining a target style type and an attribute feature vector corresponding to a target picture to be processed, and determining style intensity features corresponding to the target picture according to the attribute feature vector.

In this embodiment, the target picture to be processed refers to a picture that needs to be stylized, and the target picture may be understood as an object picture in different scenes, where the target picture is, for example, a picture containing a person (the picture needs to include a face), a picture containing an animal (such as a puppy picture), a picture containing a plant (such as a pine picture), and the like. This embodiment and other embodiments described below are explained with the target picture being a person picture, that is, the target object included in the person picture is a person, and the person needs to include a face. The target style type can be understood as the style of the face, namely the target style type of the face, such as a Disney style, a romantic style, a national romantic style, a hot blood cartoon style, a budding cartoon style and the like. The attribute feature vector is a face attribute feature vector, and refers to a vector of a person face attribute feature, and is used for representing the person face attribute feature, where the face attribute feature is, for example, a face shape, an expression, an eyebrow shape, a hairstyle, a skin color, an age, and the like of the person.

After the face attribute feature vector is obtained, the face style intensity feature of a person in the person picture is determined according to the face attribute feature vector, the face attribute feature vector has specific data, such as the width of an eyebrow shape, the age, the black and white degree of skin color and the like, and the style intensity feature is comprehensively represented according to the data size of the face attribute feature vector.

Step S220: and extracting the global attribute features of the target picture.

And extracting the global attribute characteristics of the figure picture after or at the same time of determining the face target style type and the face style intensity characteristics corresponding to the figure picture, wherein the global attribute characteristics are the attribute characteristics of all information in the figure picture and comprise the face attribute characteristics and other information. For example, the global attribute features include texture, color, edge, content, face and other attribute features in the character picture.

Step S230: and searching a preset style feature library according to the target style type to obtain style features matched with the target style type.

The preset style feature library is established in advance and stores style features related to preset style types and target style types. After the character target style type is obtained, the human face target style type is used as a searching condition to search in a preset style characteristic library, so that the preset style type matched with the human face target style type can be searched, and the style characteristic associated with the preset style type is the character style characteristic matched with the human face target style type. For example, the style type of the face target is a discone style, and style features related to the discone style in the preset style feature library are finally found.

Further, before performing step S210, the method further includes:

acquiring a target picture to be processed;

and when the target picture does not contain the preset trigger action for extracting the target style type, executing the step S210 of determining the target style type and the attribute feature vector corresponding to the target picture to be processed, and determining the style intensity feature corresponding to the target picture according to the attribute feature vector.

In order to improve the efficiency of determining the target style type corresponding to the target picture, a preset trigger action for extracting the target style type is preset, and the preset trigger action and style characteristics associated with the preset trigger action are stored in a preset style characteristic library, that is, after the target picture to be processed is obtained, whether the target picture contains the preset trigger action for extracting the target style type is detected, and if not, the step S210 is executed; if yes, matching is carried out according to the detected trigger action and a preset trigger action stored in a preset style feature library, and after the preset trigger action is obtained, a target style type corresponding to the target picture can be obtained, namely the target style type corresponding to the target picture is obtained from the preset style feature library.

For example, after a person image is acquired, human skeleton joint points are acquired by using a human posture estimation method, such as but not limited to OpenPose, deppose, MSPN, and the like; and judging whether the figure picture contains a preset triggering action for extracting the style type of the human face target or not according to the triggering rule of the skeleton key point. For example, the character picture comprises a double-hand vertex barycenter, and the double-hand vertex barycenter is matched with the slow-budding style stored in the preset style feature library; also for example, the character picture includes a cross of a two-handed chest that matches the ottman style. If the character picture contains a preset trigger action for extracting the face target style type, the matched face target style type can be searched from the preset style feature library according to the trigger action, and if the character picture does not contain the preset trigger action for extracting the face target style type, the step S210 is executed. The identification of the triggering action is to normalize the skeleton points of the human body to the backbone skeleton, and then judge whether the person is triggering action or not according to the position relation of the skeleton points.

Step S240: and splicing the global attribute feature, the style feature and the style intensity feature to obtain a fusion feature, and generating a stylized picture corresponding to the target picture according to the fusion feature.

After the global attribute feature, the face style feature and the face style intensity feature are obtained, the global attribute feature, the face style feature and the face style intensity feature are spliced to obtain a fusion feature, and the splicing process can be understood as directly adding the global attribute feature, the face style feature and the face style intensity feature, as specifically shown in fig. 3. After the fusion features are obtained, the fusion features comprise global attribute features, face style features and face style intensity features, preset style pictures matched with the face style features can be found according to the fusion features, then the fusion features are adopted to adjust attribute information in the preset style pictures, stylized pictures corresponding to the figure pictures are generated, and the stylized pictures are generated by a user.

According to the technical scheme, the stylized picture of the target object in the target picture is driven and generated according to the multi-dimensional picture characteristics, the diversity of the stylized picture generation is achieved, the harmony degree of the style is improved, and the style of the target object and the stylized picture is more harmonious and more appropriate.

As shown in fig. 4, based on the above embodiment, step S210 includes the following steps:

step S211: acquiring a pre-trained attribute decoupling encoder;

step S212: and identifying the target picture by adopting the attribute decoupling encoder to obtain the target style type and style intensity characteristic corresponding to the target picture.

The attribute decoupling encoder is trained in advance and comprises a U-Net network and a plurality of three-layer fully-connected networks which are connected in sequence, wherein the U-Net network is connected with a first layer fully-connected network in the plurality of fully-connected networks, and the attribute decoupling encoder is specifically shown in fig. 5. Processing the figure picture by adopting a U-Net network and a plurality of three-layer fully-connected networks, outputting attribute characteristic vectors of a target object by a second-to-last layer fully-connected network in the plurality of fully-connected networks, splicing the output attribute characteristic vectors to obtain a vector matrix, inputting the vector matrix into the last layer fully-connected network in the plurality of fully-connected networks to obtain a target style type corresponding to the figure picture and characteristic vectors output by the last layer fully-connected network, determining probability distribution of attribute information of the target object according to the vector matrix and the characteristic vectors, and determining style intensity characteristics according to the vector matrix and the probability distribution.

The training process of the attribute decoupling encoder is explained based on the figure picture:

because the stylized category is strongly related to the image of the face, the invention provides a face attribute decoupling encoder, which guides the strong correlation between the style classification and the face attribute by expanding the capability of the encoder for identifying various attributes of the face. The implementation process of the face attribute decoupling encoder can be understood as two stages, but end-to-end training is actually adopted, wherein the first stage is to decouple the attribute features of the face, and the second stage is to classify the style and calculate the style intensity features based on the decoupled attribute features.

In the first task network stage, the process of decoupling the attributes of the face is to perform multi-attribute classification through one multi-task, each sub-network corresponds to one attribute classification task, and then the attributes can be represented through the features extracted by the sub-networks, so that the multi-task network can realize the decoupling of the attribute features. The concrete description is as follows:

(1) the extraction of features related to attributes can be achieved using a classification network. Firstly, the neural network is used for extracting the characteristics of the picture, and the deeper the network layer is, the more abstract the extracted characteristics are; secondly, the neural network combines certain tasks, such as classification and detection, and the features extracted deeply by the network are the features required for completing the tasks. Therefore, based on these two points, if a neural network model is used to classify a certain attribute of a human face, the features extracted by the neural network are related to the attribute. For example, the task of age classification, deep extraction of features from neural networks, which are correlated with age, can be used to characterize age.

(2) The decoupling of the face attribute characteristics can be realized by using a multitask network structure; the multitasking adopts a full connection network (sub-network) based on a U-Net network (a backbone network) and a plurality of three layers. The purpose of the method can be understood as that the general features are extracted through the backbone network and the features with the attribute emphasis are extracted through the sub-networks, so that multitasking can be realized. General features, that is, features of a human face extracted by a backbone network, such as shallow information of a face, such as an outline and a color, etc., and various subtasks require such information, so that a multi-task backbone network can be shared; the features with attributes emphasis are extracted through a sub-network, because the features with attributes emphasis are different, the features are only used for completing subtasks, and therefore, only the face attributes corresponding to the subtasks are represented.

Based on the two-point analysis, a multitask network can be used, and the attribute characteristics of the character image can be decoupled. The feature of each face attribute is the feature vector of the output of the penultimate layer of each subnetwork;

for stage two, style classification is performed. Because the style category is strongly related to the face attribute, after a plurality of decoupled face attribute features are obtained in the first process, the features are spliced and fused and sent to a classification module of a multilayer fully-connected network, so that the style category is obtained, and meanwhile, a feature vector of the last layer of the fully-connected network is taken as a feature vector containing the face image and the style, namely the style feature for short. Because the existing stylization has the problem that the stylization strength cannot be controlled, for example, the stylization effect of some more lovely characters is the lovely sweet effect, and the stylization effect of some more serious middle-aged people is more realistic and even slightly stylized. In order to solve the problem, the invention provides a method for constructing a style strength feature vector based on a feedback attention mechanism, which is used for controlling the generated style strength. The method is different from a general attention calculation mode, a feedback mode is adopted in attention distribution calculation, and style type factors are fully considered in information weighting.

The general Attention mechanism strengthens effective information, weakens irrelevant information, solves the problem of model information overload and can improve the information processing capability of the neural network. The essence of the Attention mechanism is an addressing (addressing) process, the implementation process is that a Query vector q related to a task is given, Attention distribution of a Key vector k is calculated and attached to a Value vector v, so that an Attention Value feature vector is calculated, the feature vector at the moment is obtained in a training state of the model, and the feature vector also needs to participate in the training of the model.

The invention relates to a method for constructing a feature vector based on a feedback attribute mechanism, wherein query is taken from the last feature vector of a network, Key is a Value vector and is taken from the feature vector of a non-last layer, and the purpose is to calculate the feature vector related to the result according to the classification result, the feature vector at the moment is obtained in the deployment state of a model, and the feature is not required to participate in model training but is taken as the style strength feature.

Therefore, the attention method of the present invention differs from other existing attention methods in the following ways:

(1) the states used are different: the attention of the method is used in the deployment state of the model, and the AttentionValue feature vector is used as the style intensity feature and does not participate in model training; while the existing attention is used in the training state of the model, AttentionValue feature vectors participate in the model training.

(2) The effect is different: the attention method of the invention spans the network layer, and is a feedback mode based on the output result of the model, so that the obtained AttentionValue feature vector is only strongly related to the output result of the model; however, the existing attention is realized in the same network layer, effective information related to the loss function is strengthened and irrelevant information is weakened, so that the obtained AttentionValue feature vector is strongly relevant to all targets represented by the loss function, but not only to one of the targets.

(3) Scope of action not used: the query, the key and the value of the attribute method are from different network layers; while query, key and value of the general attribute method receive the limitation of model training, and generally all come from the same layer network.

The method for constructing the feature vector based on the feedback attention mechanism of the invention calculates as follows: step1, information input: x ═ x1, x 2.., xn ], xi denote the ith personal face attribute feature vector;

step2, attention distribution calculation: q is a feature vector output by the last layer of fully-connected network in the plurality of fully-connected networks, x is a feature vector output by the second last layer of fully-connected network in the plurality of fully-connected networks, which means that a feedback mode is used for an attention supervision mechanism, a style type is used for supervision to obtain a face attribute feature vector which is more closely related to the style type, and the feature vector can be used as a style intensity feature to better drive the strength of the style, and the calculation is as follows:

a＝softmax(x ^T ·q)；

where a ═ a1, a 2.., an ], is expressed as an attention distribution, also referred to as a probability distribution;

step3. information price weighting:

s＝ax；

s represents the style intensity characteristic, and a represents the value obtained by the style classifier at the output of softmax, and represents the type probability of the style type.

Training an attribute decoupling encoder, wherein the whole process is end-to-end training, and the training process is as follows:

acquiring sample object data with attribute labels, wherein the attribute labels comprise attribute labels and style labels of the sample objects;

the method comprises the steps that sample object data are adopted to conduct iterative training on an original attribute decoupling encoder, wherein the original attribute decoupling encoder comprises a U-Net network and a plurality of three-layer full-connection networks which are connected in sequence;

acquiring the cross loss of each fully connected network after each training, and determining the weighted sum of the cross loss of each fully connected network;

and when the weighted sum is less than or equal to the preset cross loss, stopping training of the original attribute decoupling encoder, and storing the original attribute decoupling encoder which stops training as the attribute decoupling encoder.

(1) And (4) data set preparation, namely sample object data with attribute labels, wherein the sample object data is a human face data set.

1) Attribute labels for sample objects, such as labels for attributes of human faces, include: attribute classification labels such as face shape, expression, hairstyle, eyebrow shape, age and the like;

2) style labels for sample objects, such as style labels for human faces, include: each face is labeled with a style which is most matched with the image, and the styles comprise a Disney style, a daily cartoon style, a national cartoon style, a hot blood cartoon style, a slow-growing cartoon style and the like.

(2) The attribute decoupling decoder is based on the idea of multi-task learning, and the model structure is shown in fig. 5 and can be divided into three parts: a backbone network, an attribute decoupled subnetwork, a classification network for style classification. The part A is a module built based on a convolutional neural network, and the part B is a module built based on full connection.

1) Firstly, inputting a picture into a backbone network, wherein the network adopts a U-NET model;

2) then, a number of three-layer full joins are followed, each of which has the task of predicting an attribute classification. For example, if there are 5 face attribute labels, the number of full connections is also 5, and so on.

3) Finally, splicing the feature vectors output by the second last layer of the full connection, and then sending the feature vectors into a three-layer full connection for style classification;

(3) the input of the model is a character picture which is processed by a conventional preprocessing means, and the conventional preprocessing means comprises RGB channel conversion, scaling, rotation, brightness adjustment, different types of noise addition, pixel normalization and the like of the picture.

(4) Because all the tasks are classified, the loss function adopts cross entropy loss, each task adopts one cross entropy loss in the multi-task training, namely one full connection corresponds to one cross entropy loss, the final loss is the weighted sum of all the cross entropy losses, the loss weight value of the face attribute classification is 0.1, and the style classification is 1. And if the weighted sum is less than or equal to the preset cross loss and the error of the original attribute decoupling encoder is small, stopping the training of the original attribute decoupling encoder, and storing the original attribute decoupling encoder which stops the training as the attribute decoupling encoder.

(5) After training is completed, style types and style characteristics are obtained.

Specifically, after a figure picture to be processed is acquired, the figure picture is used as input of an attribute decoupling encoder, and the figure picture is processed through a U-Net network and a plurality of three-layer full-connection networks. The method comprises the steps that face attribute feature vectors are output through a penultimate layer full-connected network (B1 in figure 5) in a plurality of full-connected networks, and then the output face attribute feature vectors are spliced to obtain a vector matrix corresponding to the face attribute feature vectors; then, the vector matrix is input into the last layer of fully-connected network (B1 in fig. 5) in the plurality of fully-connected networks, and the last layer of fully-connected network outputs the target style type corresponding to the character picture and also outputs the feature vector.

After the vector matrix and the vector matrix are obtained, the probability distribution of the face attribute information is calculated based on the above-mentioned a-softmax (xT · q) and s-ax, and the face style intensity feature of the person in the person picture is calculated from the vector matrix and the probability distribution.

In consideration of the situation that the character picture may have one character, that is, one face, or may have a plurality of characters, that is, a plurality of faces. I.e. to identify whether the target object contained in the target picture is multiple or single. It should be understood that if the target object included in the target picture is single, the style type output by the last layer of the fully-connected network is one, that is, when the target object included in the target picture is single, the style type output by the last layer of the fully-connected network is taken as the target style type. If the target images contain a plurality of target objects, the style types output by the last layer of fully-connected network are also a plurality of, the number of the style types is the same as that of the target objects, and each style type corresponds to a type probability. When a plurality of target objects are contained in the target picture, the type probability of each style type output by the last layer of fully-connected network is obtained, and then the style type corresponding to the maximum type probability is determined as the target style type.

As shown in fig. 6, based on the above embodiment, step S220 includes the following steps:

step S221: acquiring a pre-trained global information encoder;

step S222: and identifying the target picture by adopting the global information encoder to obtain the global attribute characteristics of the target picture.

The global information encoder is trained in advance, comprises a convolution layer, a normalization layer and an activation function layer, takes the figure picture as input and inputs the figure picture into the global information encoder, and performs down-sampling on the figure picture through the convolution layer in the global information encoder, namely, the compression of the figure picture is realized, so as to obtain a characteristic diagram of the figure picture; and (4) processing the feature graph by a normalization layer and an activation function layer in the global information encoder in sequence to obtain the global attribute feature of the target picture, namely finishing the extraction of the global attribute feature of the target picture.

The global information encoder realizes the encoding of the picture content, and aims to reserve various detailed information from the picture overall, including reserving the global information such as texture, color, edge, content and the like of the picture, and reserving the detailed information of the character object. The global information encoder reserves the global information of the picture, and the picture can be restored by utilizing the global information.

As shown in fig. 7, based on the above embodiment, step S240 includes the following steps:

step S241: based on the channel dimensions of the global attribute feature, the style feature and the style intensity feature, overlapping the global attribute feature, the style feature and the style intensity feature to obtain the fusion feature;

step S242: obtaining a pre-trained style decoder;

step S243: and identifying the fusion characteristics by adopting the style decoder, and outputting the stylized picture corresponding to the target picture.

Specifically, the channel dimension based on the global attribute feature, the style feature and the style intensity feature can be determined by the matrix dimension of the vector matrix corresponding to the global attribute feature, the style feature and the style intensity feature. For example, the matrix dimensions of the vector matrices corresponding to the global attribute features, the style features and the style intensity features are all 8 × 8, that is, 8 channel dimensions, and then one-to-one correspondence is performed according to the channel dimensions, and the global attribute features, the style features and the style intensity features are superimposed to obtain fusion features.

And after the fusion features are obtained, inputting the fusion features into a style decoder, identifying the fusion features by the style decoder so as to restore the global attribute features in the fusion features to obtain original global attribute information, and searching the preset style pictures matched with the face style features according to the face style features in the fusion features. Furthermore, the attribute information of the preset style picture is adjusted according to the original global attribute information and the face style intensity characteristics in the fusion characteristics, so that the stylized people in the adjusted preset style picture are more harmonious, the face attribute information of the stylized people is more attached to the face attribute information of the original people, and the harmony of the stylized people is improved.

The style decoder is also trained in advance, in addition to the global information encoder. And the style decoder is used for decoding by fusing the global information, the style characteristics and the style intensity characteristics, restoring the content of the picture by utilizing the global information and realizing the driving generation of the target style by utilizing the style information.

As shown in fig. 8, the training process of the global information encoder and the style decoder is as follows:

training an original global information encoder by adopting the training style data picture, and outputting training global attribute characteristics of the training style data picture by the original global information encoder;

acquiring training style characteristics matched with style labels from a preset style characteristic library;

inputting the training style data picture into an attribute decoupling encoder to obtain a training style intensity characteristic corresponding to the training style data picture;

splicing the training global attribute features, the training style features and the training style intensity features to obtain fusion training features;

training an original style decoder by adopting the fusion training characteristics, and outputting a training stylized picture corresponding to the training style data picture by the original style decoder;

and when the loss value is less than or equal to the preset value, stopping training of the original global information encoder and the original style decoder, and storing the original global information encoder and the original style decoder which stop training as the global information encoder and the style decoder.

(1) Preparing a training set: the training style data picture is source style data, training style intensity data is target style data and a preset style feature library.

1) Source style data: the image data of the real scene containing the human face can contain a plurality of people, but the image data must contain the human face.

2) Target style data: the electronic cartoon comprises image data of a Disney style, a daily cartoon style, a national cartoon style, a hot blood cartoon style, a dark lovely cartoon style and the like, each style has different style strengths, the quantity of the image data is uniformly distributed, and a certain amount of data, such as 5 thousands of pieces of data, is taken for each style quantity.

(2) As shown in fig. 8, it is a system flowchart of a global information encoder and a style decoder, where the global information encoder implements extraction of picture global information; the style decoder generates stylized pictures by fusing global information, style characteristics and style strength characteristics.

(3) The global information encoder down-samples the picture to extract a feature map, which is equivalent to a compression of the original image information, and the main components are a convolutional layer normalization layer and an activation function.

(4) And (3) style decoder: a structure for up-sampling a feature map into an original map is mainly composed of a transposed convolution layer (or up-sampling layer), a normalization layer, and an activation function.

(5) Input of the network: the three-channel RGB picture has the length and width of 512 x 512, and is subjected to conventional pretreatment including RGB channel conversion, scaling, rotation, brightness adjustment, addition of different types of noise, pixel normalization and the like.

(6) Output of the network: the three channels of RGB pictures have the length and width of 512 x 512.

(7) Style feature library: a style label and a feature vector containing the style are provided. In the training process, the feature vector is fused with the feature vector of the global information encoder in a channel superposition mode and then sent to the trellis decoder.

(8) Training process: randomly extracting a batch of pictures from source style data as input, for example 12 pictures, obtaining global information characteristics from each picture through a global information encoder, calculating the pictures through a style decoder, only taking style intensity characteristics, taking style characteristics from a style characteristic library according to the style type of a label, fusing the three characteristics, outputting a batch of pictures through the style decoder, randomly extracting the pictures with the same style from target data according to the input style type of each output picture, and finally sending the pictures to a judgment network for calculating loss.

(9) Loss function:

A. the resistance loss: a conventional discrimination network is adopted and is formed by connecting 6 convolutional layers and 3 full layers, and loss is as follows:

wherein G is a global information encoder and a style decoder; d is a discriminator; x- > y represents from the source domain to the target style domain;

B. loss of cycle consistency: Cycle-GAN and Disco-GAN first proposed a loss of circular consistency, which is to force the generators to generate the images of the opposite domain in reverse of each other.

Wherein x->y represents from the source domain to the target style domain, y->x represents from the target style domain to the source domain; l. capillary ₁ Indicating a loss of L1.

C. Reconstruction loss: unlike the loss function based on domain similarity, the reconstruction loss helps the two domain generator generated images to maintain consistency in the hidden vector space.

All losses are a weighted sum of three losses:

wherein λ is ₁ ，λ ₂ ，λ ₃ Are all preset values, e.g. λ ₁ ，λ ₂ ，λ ₃ 1,10, 10. And if the loss value is less than or equal to the preset value and indicates that the errors of the original global information encoder and the original style decoder are very small, stopping training of the original global information encoder and the original style decoder, and storing the original global information encoder and the original style decoder which stop training as the global information encoder and the style decoder.

As shown in fig. 9, a stylized picture creating apparatus according to the present invention includes:

a first obtaining module 310, configured to determine a target style type and an attribute feature vector corresponding to a target picture to be processed, and determine a style intensity feature corresponding to the target picture according to the attribute feature vector;

a second obtaining module 320, configured to extract a global attribute feature of the target picture;

a third obtaining module 330, configured to search a preset style feature library according to the target style type, so as to obtain style features matched with the target style type;

the picture generating module 340 is configured to splice the global attribute feature, the style feature, and the style intensity feature to obtain a fusion feature, and generate a stylized picture corresponding to the target picture according to the fusion feature.

It should be noted that the stylized picture generating apparatus may further include other optional functional modules, so that it may perform other steps involved in the above embodiments. The embodiments of the stylized image generating apparatus of the present invention are substantially the same as those of the embodiments of the stylized image generating method described above, and are not described herein again.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A stylized picture generation method, comprising:

extracting global attribute features of the target picture;

2. The method of claim 1, wherein the step of determining the target style type and attribute feature vector corresponding to the target picture to be processed, and determining the style intensity feature corresponding to the target picture according to the attribute feature vector comprises:

acquiring a pre-trained attribute decoupling encoder;

3. The method of claim 2, wherein the training of the attribute decoupling encoder comprises:

4. The method of claim 1, wherein the step of extracting the global property feature of the target picture comprises:

acquiring a pre-trained global information encoder;

5. The method according to claim 4, wherein the step of splicing the global attribute feature, the style feature and the style intensity feature to obtain a fusion feature, and generating the stylized picture corresponding to the target picture according to the fusion feature comprises:

obtaining a pre-trained style decoder;

6. The method of claim 5, wherein the training of the global information encoder and the style decoder comprises:

splicing the training global attribute features, the training style features and the training style intensity features to obtain fused training features;

7. The method of claim 1, wherein the stylized picture generation method further comprises:

acquiring a target picture to be processed;

8. A stylized picture generating apparatus, comprising:

9. A terminal device, characterized in that the terminal device comprises: a memory, a processor and a stylized picture generating program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the stylized picture generating method as claimed in any one of claims 1 to 7.

10. A storage medium having stored thereon a stylized picture generation program which, when executed by a processor, implements the steps of the stylized picture generation method of any one of claims 1-7.