CN113255813B

CN113255813B - Multi-style image generation method based on feature fusion

Info

Publication number: CN113255813B
Application number: CN202110635370.2A
Authority: CN
Inventors: 余月; 李本源; 李能力
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2022-12-02
Anticipated expiration: 2041-06-02
Also published as: CN113255813A

Abstract

The invention discloses a multi-style image generation method based on feature fusion, and belongs to the field of computer vision. The implementation method of the invention comprises the following steps: inputting the semantic segmentation graph into a content feature extraction network, and extracting a content feature vector in the semantic graph; inputting the style diagram into a style feature extraction network, and extracting style feature vectors in the style diagram; extracting the content feature vector f _c And a style feature vector f _s Inputting a content style feature fusion network for feature fusion to obtain a fusion feature vector after feature fusion; constructing a generation countermeasure network consisting of a generator and a discriminator, and training the generation countermeasure network on a data set by designing a loss function; and generating a multi-style image with semantic graph content and style of the semantic graph by using a generator with the minimized loss function obtained by training. The invention can apply the generated multi-style image with semantic graph content and style of the style to the scene which attracts attention, and solves the technical problems of the related engineering.

Description

Multi-style image generation method based on feature fusion

Technical Field

The invention relates to an image generation method for generating a multi-format image from a semantic segmentation image, in particular to a method capable of realizing quick generation from the semantic image to the multi-format image end to end, and belongs to the field of computer vision.

Background

At present, most models for generating multi-style images are generated from real images by style images, but a few models for generating style images from semantic graphs only use images in the same data set as input styles, and rapid migration of any style cannot be realized.

The method has the advantages that the method has important significance in the generating direction of art design and virtual reality education resources by generating images of any style from the semantic graph end to end, and in the art design field, an art creator or designer can quickly generate style images meeting the semantic graph and style constraints as long as the position and the general shape of each object in the semantic graph and the style to be generated are specified, so that the time cost required by creation and design is greatly reduced; in the multimedia education resource generation direction, a teacher can use simple semantic graph information to generate multi-style teaching scene images, the multi-style teaching scene images can greatly enrich teaching resources, and the teaching scenes with various styles can better attract the attention of students to improve the learning interest of the students. Meanwhile, the teaching scene image is generated from the semantic graph quickly, so that the time spent on generating a new image resource can be greatly reduced.

Disclosure of Invention

Aiming at the problem that the generation of a multi-style image from a semantic graph has great limitation in the background technology, the multi-style image generation method based on feature fusion disclosed by the invention aims to solve the technical problems that: the method comprises the steps of providing a network framework which is composed of a content feature extraction network, a style feature extraction network and a content style feature fusion network and generates style images from semantic graphs, respectively extracting content features and style features through the content feature extraction network and the style feature extraction network, and fusing the features extracted by the first two networks through the content style feature fusion network to generate multi-style images with semantic graph contents and style of the semantic graphs. The invention has the advantages of rapidness, convenience, wide applicability and good generation effect. The generation of a multi-modal image with semantic graph content and a style of a style is applied to the attention-attracting scene, the technical problem of relevant engineering is solved.

In order to achieve the above purpose, the invention adopts the following technical scheme.

The invention discloses a multi-style image generation method based on feature fusion, which inputs a semantic segmentation graph into a content feature extraction network and extracts a content feature vector in the semantic graph. Inputting the style diagrams into a style feature extraction network, and extracting style feature vectors in the style diagrams. Extracting the content feature vector f _c And a style feature vector f _s And inputting a content style feature fusion network for feature fusion to obtain a fusion feature vector after feature fusion. And constructing a generation countermeasure network consisting of a generator and a discriminator, and training the generation countermeasure network on the data set by designing a loss function. And generating a multi-style image with semantic graph content and style of the semantic graph by using a generator with the minimized loss function obtained by training. The method and the device can apply the generated multi-format image with semantic graph content and style of the format graph to the scene attracting attention, and solve the technical problems of related engineering.

The invention discloses a multi-style image generation method based on feature fusion, which comprises the following steps:

step 1: and inputting the semantic segmentation graph into a content feature extraction network, and extracting a content feature vector in the semantic graph.

The content feature extraction network in step 1 is a multi-Path feature extraction network mainly composed of three branch paths, namely a Global Space Path (GSP), a Category Space Path (CSP) and a Category semantic Path (CCP). The global space path GSP is used for extracting global space characteristics, the classification space path CSP is used for extracting classification space characteristics of a semantic graph, and the classification semantic path CCP is used for extracting classification semantic characteristics.

The input of the global space path is a whole semantic graph, and a feature graph containing global space information is obtained through convolution network processing.

The structure of the classification space path is the same as that of the global space path, and the only difference is that the input is different. The input of the semantic space path is not the whole semantic graph, but the semantic graph is firstly divided according to different categories, each channel has only one category, then the categories are spliced together to form a multi-channel classification semantic graph, each category of the classification semantic graph is respectively subjected to convolution operation, and the space characteristic of each category is calculated.

The classification semantic path adopts a lightweight ResNet network model and global average pooling to expand the receptive field, and global average pooling is added at the end of the ResNet network model, so that the receptive field and global context information of each category can be provided to the maximum extent. In addition, an Attention Extraction Module AEM (Attention Extraction Module) is also used in the classification semantic path. The attention extraction module captures global semantic information of the feature map by using an attention mechanism, calculates attention vectors and gives different weights to different positions so as to achieve the purpose of guiding network learning.

After extracting global spatial information, classification spatial information and classification semantic information from three branch paths in a multi-path generation network, fusing the features output by the three branch paths through a Feature Fusion Module (FFM). After feature fusion, using a Conditional Normalization module CNB (Conditional Normalization Block) to take the processed classified semantic graphs as additional condition input, giving different Normalization parameters to semantic graphs with different categories, and further fully retaining information in the semantic graphs and obtaining content feature vectors f _c 。

In order to take account of the size of the network parameter and the effect of extracting the spatial information, preferably, in step 1, a three-layer convolutional network is selected as the convolutional network, each layer of the convolutional network comprises a convolutional layer, a normalization layer and an activation function layer, and the size of the feature map output after three-layer convolution is 1/8 of that of the original image.

And 2, step: inputting the style diagrams into a style feature extraction network, and extracting style feature vectors in the style diagrams.

And the style feature extraction network in the step 2 uses a pre-trained VGG16 network. Extracting the characteristics of the input style graph t before the activation layer through the VGG16 network, and taking the extracted characteristics as the original characteristics of characteristic fusion. The features belong to different levels of features, so a feature fusion module is usedThe FFM performs feature fusion on features of different levels from deep to shallow in sequence. The fused features pass through an attention extraction module AEM, and a style feature vector f obtained by carrying out attention weighting on different channels by using a self-attention model _s 。

Preferably, the input style diagram t is set in the feature f before the activation layers of relu1_2, relu2_2, relu3_3 and relu4_3 in the VGG16 network _{relu1_2} (t)、f _{relu2_2} (t)、f _{relu3_3} (t)、f _{relu4_3} (t) extracting, and taking the extracted features as original features of feature fusion. The features belong to features of different levels, so the features of different levels are sequentially subjected to feature fusion from deep to shallow by using a feature fusion module FFM. The fused features pass through an attention extraction module AEM, and a style feature vector f obtained by carrying out attention weighting on different channels by using a self-attention model _s 。

And step 3: extracting the content feature vector f _c And a style feature vector f _s Inputting a content style feature fusion network for feature fusion to obtain a fusion feature vector f after feature fusion _cs 。

And (4) performing feature fusion by using a WCT (white-Color Transform) matrix transformation mode in the content style feature fusion network in the step 3. WCT matrix transformation is to feature f of content image _c And features f of the chart _s Obtaining a fusion characteristic f with content characteristics and style characteristics of the content diagram after Whiten transformation and Color transformation _cs The WCT transform is divided into two parts, namely Whiten transform and Color transform.

The method of Whiten transformation is to use the feature f of the content image in the feature space of the VGG16 network _c Obtaining a covariance matrix, carrying out SVD decomposition on the covariance matrix, carrying out Whiten transformation on the characteristics according to the matrix obtained by decomposition, stripping the color characteristics in the content image from the image, and obtaining the characteristics f of the transformed characteristics only with the content contour _c The Whiten transform is implemented as follows:

wherein f is _c Is a feature of the content image extracted in the VGG 16; d _c Is a diagonal matrix and the elements are covariance matrices

A characteristic value of (d); e _c Is an orthogonal matrix, satisfies

D _c And E _c Are obtained after SVD decomposition of covariance matrix.

The method of Color transformation is to use the feature f of the style image in the feature space in the VGG16 network _s Firstly, solving a covariance matrix, carrying out SVD (singular value decomposition) on the covariance matrix, and then carrying out f _s F transformed with whiten _c Performing inverse Whiten transform, i.e. Color transform, transferring the content features after Whiten transform to the feature distribution of the styligram to obtain a feature vector f after WCT transform _cs The Color transformation is realized in the following mode:

after the WCT matrix transformation operation, a feature fusion module FFM is added to combine the content feature vector with the feature vector f after the WCT transformation _cs Performing feature fusion, strengthening the content constraint force of the semantic graph in the fusion vector, and obtaining the final style content feature fusion vector f _cs 。

And 4, step 4: and constructing a generation countermeasure network consisting of a generator and a discriminator, and training the generation countermeasure network on the data set by designing a loss function, namely training the generation countermeasure network with the minimized loss function.

The system comprises a network framework which is composed of a content feature extraction network, a style feature extraction network and a content style feature fusion network and is used for generating style images from semantic graphs, the content feature extraction network and the style feature extraction network are used for respectively extracting content features and style features, and the content style feature fusion network is used for fusing the features extracted by the first two networks and is used for generating multi-style images with semantic graph contents and style of style.

The generator in the step 4 is a network which is composed of a content feature extraction network, a style feature extraction network and a content style feature fusion network and is used for generating style images from semantic graphs. The arbiter is composed of a global arbiter D ₁ And a local discriminator D ₂ The composed multi-stage discriminators have the same network structure but operate on different image scales.

The loss function designed in step 4 is:

wherein λ is ₁ ,λ ₂ ,λ ₃ ,λ ₄ ,λ ₅ For settable parameters, G is the generator, D ₁ Is a local discriminator, D ₂ Is the global discriminator, x is the input semantic graph, t is the input style graph, and y is the generated multi-style image.

To calculate the perceptual loss of content differences, the expression is:

F ⁽ⁱ⁾ represents the i-th active front-layer feature extractor of the VGG16 network, and w _i Is the adaptive weight of the ith layer, the deeper the number of layers of the feature, the larger the weighted parameter.

Is to counter the loss, table thereofThe expression is as follows:

is the feature matching penalty, whose expression is:

wherein T represents a discriminator D _k Number of network layers, N _i Indicating the number of elements per layer.

To calculate the context loss of style difference, the expression is:

wherein, CX (φ) ^l (x),φ ^l (t)) is the cosine similarity of the l-th level VGG16 feature of the semantic graph x and the stylistic graph t.

The expression is the total variation loss:

where i and j are coordinate values of pixels in the image, and N is a pixel range size of the image.

In order to fully consider the influence of different depth features on the loss function calculation, it is preferable to extract five layers of features extracted by the VGG16 network in step 4, i.e., N =5,w _i Is dependent onThe times are 1/32, 1/16, 1/8, 1/4 and 1, and the deeper the feature layer number is, the larger the weighted parameter is.

And 5: training by using the step 4 to obtain a generator with minimized loss function, wherein the style content feature fusion vector f obtained in the step 3 _cs The multi-style image t with semantic graph content and style of the semantic graph is formed, namely, the multi-style image generation is realized based on the feature fusion.

Further comprising the step 6: and (5) applying the multi-style image with the semantic graph content and the style of the style generated in the step 5 to the scene attracting attention, and solving the technical problems of related engineering.

And 6, the related engineering technical problems comprise actual problems of creative advertisement design, game scene design, teaching scene image design and the like.

Has the advantages that:

1. the invention discloses a multi-style image generation method based on feature fusion, which provides a network framework which is composed of a content feature extraction network, a style feature extraction network and a content style feature fusion network and is used for generating a style image from a semantic graph.

2. The multi-format image generation method based on feature fusion disclosed by the invention has no limitation on input images, and can realize the generation of multi-format images with semantic map contents and format map styles by using any semantic map and format map after training is finished, so that the generation requirements of different tasks can be met, and the method has the advantage of wide applicability.

3. The network framework generated by any existing multi-format image can not generate the multi-format image from the semantic graph end to end, a real image which accords with the semantic graph needs to be generated firstly, and then the style of the real image is transferred.

4. The invention discloses a multi-format image generation method based on feature fusion, which applies the multi-format image with semantic graph content and format style generated by the invention to a scene attracting attention and solves the technical problems of related engineering, such as: the method comprises the actual problems of creative advertisement design, game scene design, teaching scene image design and the like.

Drawings

FIG. 1 is a flow chart of an implementation of a multi-style image generation method based on feature fusion of the present invention;

fig. 2 is a structural diagram of a content Feature Extraction network according to the present invention, in which fig. 2 (a) is a structural diagram of the entire content Feature Extraction network, fig. 2 (b) is a structural diagram of an Attention Extraction Module AEM (Attention Extraction Module), fig. 2 (c) is a structural diagram of a Feature Fusion Module FFM (Feature Fusion Module), and fig. 2 (d) is a Conditional Normalization Module CNB (Conditional Normalization Block);

FIG. 3 is a block diagram of a style feature extraction network in accordance with the present invention;

FIG. 4 is a block diagram of a content style feature fusion network in accordance with the present invention;

FIG. 5 is a block diagram of a generator in the present invention;

FIG. 6 is a structural diagram of an arbiter in the present invention;

FIG. 7 is a graph of the effect of the invention on the Cityscapes dataset.

Detailed Description

The following describes embodiments of the present invention with reference to the drawings.

As shown in fig. 1, the multi-style image generation method based on feature fusion disclosed in this embodiment can be applied to a cityscaps data set for entertainment-related applications, for example, in the creation of movies, animations and games, perform style rendering on the movies, animations and street views in the games, change the same street view into different styles, and create desired movies, animations and game styles. And can also reduce the cost of creation, save the time of making, increase the interdynamic with audience or player. The training and image generation flow of the present embodiment is shown in fig. 1.

Step 1: the semantic segmentation graph is input into a content feature extraction network, content feature vectors in the semantic graph are extracted, and the structure diagram of the content feature extraction network is shown in fig. 2 (a).

The size of the semantic graph input in step 1 is [3,256,512]The size of the feature map obtained by classifying the spatial path and the global spatial path is [512,32,64 ]]The feature map size obtained by classifying the semantic paths is [256,128,256 ]]Wherein the network structure of the Attention Extraction Module (AEM) used in the classification semantic path is shown in FIG. 2 (b), after obtaining the three features, they are feature-fused by the Feature Fusion Module (FFM) to obtain the size [512,128,256 ]]The structure of the Feature Fusion Module (FFM) is shown in fig. 2 (c). Finally, the fused features are up-sampled by a conditional normalization module CNB, the size of the conditional normalization module CNB is shown in figure 2 (d), and finally, a final content feature vector f is obtained _c ，f _c Has a size of [256,128,256]。

And 2, step: the style drawing is input into a style feature extraction network, style feature vectors in the style drawing are extracted, and a structure diagram of the style feature extraction network is shown in fig. 3.

And the style feature extraction network in the step 2 uses a pre-trained VGG16 network. The size of the stylized graph t of the input network is [3,256,512 ]]Relu1_2, relu2_2, relu3_3, and relu4_3 preceding the activation layer feature f in VGG16 _{relu1_2} (t)、f _{relu2_2} (t)、f _{relu3_3} (t)、f _{relu4_3} (t) extractionOut, the size of the extracted features are [128,256,512 respectively]， [256,128,256]，[512,64,128]，[512,64,128]. Since the features belong to different levels of features, they are sequentially feature-fused from deep to shallow using a feature fusion module FFM, the structure of which is shown in fig. 2 (c). Finally, the fused features pass through an attention extraction module AEM, attention is weighted on different channels by a self-attention model to obtain a final style feature vector f _s The network structure of the extraction module AEM is shown in FIG. 2 (b), and the style feature vector f finally obtained _s Has a size of [256,128, 256%]。

And 3, step 3: extracting the content feature vector f _c And a style feature vector f _s Inputting a content style characteristic fusion network for characteristic fusion to obtain a content style fusion characteristic f _cs The structure of the content style feature fusion network is shown in fig. 4.

Inputting content style characteristic fusion network content characteristic vector f in step 3 _c And a style feature vector f _s The size is divided into [256,128,256 ]]The size of the feature after the WCT matrix transformation does not change, but the transformed feature vector already has the content information of the content map and the style information of the style map. After the matrix transformation operation of the WCT, a feature fusion module FFM is added to perform feature fusion on the content feature vector and the feature vector after the WCT transformation, so that the content constraint force of a semantic graph in the fusion vector is strengthened, and the size of the fused vector is [256,128,256 ] or more]Then, the content constraint with the size of 3,256,512 which is better in accordance with the semantic graph while having the artistic style of the input style graph is finally obtained by performing up-sampling by using a deconvolution operation]Style content fusion feature f _cs 。

And 4, step 4: and constructing a generating countermeasure network consisting of a generator and a discriminator, and training the generating countermeasure network on the data set by designing a loss function, namely training the generating countermeasure network with the minimized loss function.

The network structure of the generator in step 4 is shown in fig. 5, and the network structure of the arbiter is shown in fig. 6.The generator is a generator network consisting of a content feature extraction network, a style feature extraction network and a content style feature fusion network in the steps 1 to 3, and the discriminator is a global discriminator D ₁ And a local discriminator D ₂ And forming a multi-stage discriminator. The loss function used during training is:

To calculate the perceptual loss of content differences, the expression is:

F( ⁱ ) Represents the i-th active front-layer feature extractor of the VGG16 network, and w _i Is the adaptive weight of the i layer, five-layer features in the VGG16 network are extracted in the experiment, namely N =5,w _i Are 1/32, 1/16, 1/8, 1/4, 1 in turn, the deeper the number of levels of features, the larger the weighted parameter.

Is the countermeasure loss, whose expression is:

is the feature matching penalty, whose expression is:

To calculate the context loss of style difference, the expression is:

wherein, CX (phi) ^l (x),φ ^l (t)) is the cosine similarity of the VGG16 feature of the semantic graph x and the trellis graph t.

The expression is the total variation loss:

In the training process of the invention, the generation is 300, and the lambda is trained in the training process ₁ ＝10,λ ₂ ＝1,λ ₃ ＝1,λ ₅ =0.00001, coefficient of context loss controlling style difference in the first 150 generations of training vs ₄ Small by 0.1, in the later 150 generations ₄ Begins to increase gradually until a maximum value of 20 is reached.

And 5: training the generator with the loss function minimized in step 4, in which case step 3 resultsStyle content feature fusion vector f _cs It becomes a multi-modal image y with semantic graph content and a stylistic style.

In step 5, the present embodiment achieves a good generation result on the public dataset cityscaps. The cityscaps data set is a new large-scale data set containing multiple stereoscopic video sequences recorded in street scenes from 50 different cities, and can be applied to creation of movies, animations and games, style rendering of street views in movies, animations and games, changing the same street view into different styles, and creating desired movies, animations and game styles. The resulting network is shown in fig. 6.

In summary, in the embodiment, the semantic graph and the style graph are input into the generation confrontation network, and the generation confrontation network model is trained to obtain a generator which is well trained, and at this time, the generator can generate the image which meets the content constraint of the semantic graph and the style constraint of the style graph. The embodiment can solve the problems that the time cost and the labor cost of generation are high and the effect cannot be guaranteed in the traditional method.

The above detailed description is intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above detailed description is only exemplary of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A multi-style image generation method based on feature fusion is characterized in that: comprises the following steps of (a) preparing a solution,

step 1: inputting the semantic segmentation graph into a content feature extraction network, and extracting a content feature vector in the semantic graph;

step 2: inputting the style diagram into a style feature extraction network, and extracting style feature vectors in the style diagram;

and 3, step 3: extracting the content feature vector f _c And style feature vector f _s Inputting a content style feature fusion network for feature fusion to obtain a fusion feature vector f after feature fusion _cs ；

And 4, step 4: constructing a generation countermeasure network consisting of a generator and a discriminator, and training the generation countermeasure network on a data set by designing a loss function, namely training to obtain the generation countermeasure network with the minimized loss function;

the system comprises a network framework, a content characteristic extraction network, a style characteristic extraction network and a content style characteristic fusion network, wherein the network framework is formed by three parts of networks, namely a semantic graph generating style image, the content characteristic extraction network and the style characteristic extraction network are used for respectively extracting content characteristics and style characteristics, and the content style characteristic extraction network and the style characteristic extraction network are used for fusing the characteristics extracted by the two networks to generate a multi-style image with semantic graph contents and style;

the generator in the step 4 is a network which is composed of a content characteristic extraction network, a style characteristic extraction network and a content style characteristic fusion network and is used for generating style images from semantic graphs; the arbiter is composed of a global arbiter D ₁ And a local discriminator D ₂ The multi-stage discriminators are formed, have the same network structure and operate on different image scales;

the loss function designed in step 4 is:

wherein λ is ₁ ,λ ₂ ,λ ₃ ,λ ₄ ,λ ₅ For settable parameters, G is the generator, D ₁ Is a local discriminator, D ₂ Is a global discriminator, x is an input semantic graph, t is an input style graph, and y is a generated multi-style image;

to calculate the perceptual loss of content differences, the expression is:

F ⁽ⁱ⁾ represents the i-th active front layer feature extractor of the VGG16 network, and w _i The adaptive weight of the ith layer is adopted, and the deeper the layer number of the characteristic is, the larger the weighted parameter is;

is the countermeasure loss, whose expression is:

is the feature matching penalty, whose expression is:

wherein T represents a discriminator D _k Number of network layers, N _i Representing the number of elements of each layer;

to calculate the context loss of style difference, the expression is:

wherein, CX (φ) ^l (x),φ ^l (t)) is the cosine similarity of the ith-level VGG16 features of the semantic graph x and the trellis graph t;

the expression is the total variation loss:

wherein i and j are coordinate values of pixels in the image, and N is a pixel range size of the image;

and 5: training by using the step 4 to obtain a generator with minimized loss function, wherein the style content feature fusion vector f obtained in the step 3 _cs The multi-style image t with semantic graph content and style of the style is formed, namely, the multi-style image generation is realized based on the feature fusion.

2. The method for generating multi-format image based on feature fusion as claimed in claim 1, wherein: further comprising the step 6: and (5) applying the multi-style image with the semantic graph content and the style of the style generated in the step 5 to the scene attracting attention, and solving the technical problems of the related engineering.

3. The method for generating multi-style image based on feature fusion as claimed in claim 2, wherein: and 6, the related engineering technical problems comprise actual problems of creative advertisement design, game scene design, teaching scene image design and the like.

4. A method for multi-format image generation based on feature fusion as claimed in claim 1, 2 or 3, characterized in that: the content feature extraction network in the step 1 is a multi-Path feature extraction network mainly composed of three branch paths, namely a Global Space Path (GSP), a Classification Space Path (CSP) and a classification semantic Path (CCP); the global space path GSP is used for extracting global space characteristics, the classification space path CSP is used for extracting classification space characteristics of a semantic graph, and the classification semantic path CCP is used for extracting classification semantic characteristics;

the input of the global space path is a whole semantic graph, and a feature graph containing global space information is obtained through convolution network processing;

the structure of the classification space path is the same as that of the global space path, and the only difference is that the input is different; the input of the semantic space path is not a whole semantic graph, but the semantic graph is firstly divided according to different categories, each channel has only one category, then the semantic graphs are spliced together to form a multi-channel classification semantic graph, each category of the classification semantic graph is respectively subjected to convolution operation, and the spatial feature of each category is calculated;

the classification semantic path adopts a lightweight ResNet network model and global average pooling to expand the receptive field, and global average pooling is added at the tail of the ResNet network model, so that the receptive field and global context information of each category can be provided to the maximum extent; in addition, an Attention Extraction Module AEM (Attention Extraction Module) is also used in the classification semantic path; the attention extraction module captures global semantic information of the feature map by using an attention mechanism, and calculates attention vectors to give different weights to different positions so as to achieve the purpose of guiding network learning;

after global space information, classification space information and classification semantic information are respectively extracted from three branch paths in a multi-path generation network, the features output by the three branch paths are fused by a Feature Fusion Module (FFM); after feature fusion, a Conditional Normalization module CNB (Conditional Normalization Block) is used for taking the processed classified semantic graphs as additional condition input, different Normalization parameters are given to the semantic graphs with different categories, information in the semantic graphs is fully reserved, and content feature vectors f are obtained _c 。

5. The method for generating multi-style image based on feature fusion as claimed in claim 4, wherein: the style feature extraction network in the step 2 uses a pre-trained VGG16 network; extracting the characteristics of the input style graph t before the activation layer through the VGG16 network, and extractingThe extracted features are used as original features of feature fusion; the features belong to features of different levels, so the features of different levels are sequentially subjected to feature fusion from deep to shallow by using a feature fusion module FFM; the fused features pass through an attention extraction module AEM, and a style feature vector f obtained by performing attention weighting on different channels by using a self-attention model _s 。

6. The method of claim 5, wherein the multi-format image generation method based on feature fusion is characterized in that: the content style feature fusion network in the step 3 performs feature fusion by using a WCT (white-Color Transform) matrix transformation mode; WCT matrix transformation is to feature f of content image _c And features f of the chart _s Obtaining fusion characteristics with content characteristics and style characteristics of the content diagram after Whiten transformation and Color transformation

The WCT transformation is divided into two parts, namely Whiten transformation and Color transformation;

the method of Whiten transformation is to use the feature f of the content image in the feature space of the VGG16 network _c Obtaining a covariance matrix, performing SVD on the covariance matrix, performing Whiten transformation on the features according to the matrix obtained by decomposition, stripping the color features in the content image from the image, and obtaining the transformed features with only the features of the content contour

The Whiten transform is implemented as follows:

A characteristic value of (d); e _c Is an orthogonal matrix, satisfies

D _c And E _c The covariance matrix is obtained after SVD decomposition;

the method of Color transformation is to use the feature f of the style image in the feature space of the VGG16 network _s Firstly, solving a covariance matrix, carrying out SVD (singular value decomposition) on the covariance matrix, and then carrying out f _s Transformed with whiten

Carrying out reverse Whiten transformation, namely Color transformation, transferring the content characteristics after the Whiten transformation to the characteristic distribution of the styligram to obtain the characteristic vector after the WCT transformation

The Color transformation is realized by the following steps:

after the WCT matrix transformation operation, a feature fusion module FFM is added to combine the content feature vector with the feature vector after the WCT transformation

Performing feature fusion, strengthening the content constraint force of the semantic graph in the fusion vector, and obtaining the final style content feature fusion vector f _cs 。

7. The method for generating multi-format image based on feature fusion as claimed in claim 1, wherein: in order to take account of the size of network parameters and the effect of spatial information extraction, in step 1, the network selects three layers of convolution networks, each layer of network comprises a convolution layer, a standard layer and an activation function layer, and the size of a feature graph output after three layers of convolution is 1/8 of that of an original graph.

8. The method for generating multi-format image based on feature fusion as claimed in claim 1, wherein: characteristic f of input style diagram t before active layers of relu1_2, relu2_2, relu3_3 and relu4_3 in VGG16 network _{relu1_2} (t)、f _{relu2_2} (t)、f _{relu3_3} (t)、f _{relu4_3} (t) extracting and using the extracted features as original features of feature fusion; the features belong to features of different levels, so the features of different levels are sequentially subjected to feature fusion from deep to shallow by using a feature fusion module FFM; the fused features pass through an attention extraction module AEM, and a style feature vector f obtained by carrying out attention weighting on different channels by using a self-attention model _s 。

9. The method for generating a multi-style image based on feature fusion as claimed in claim 1, wherein: in order to fully consider the influence of different depth features on the calculation of the loss function, in step 4, five layers of features extracted by the VGG16 network are extracted, namely N =5,w _i Are 1/32, 1/16, 1/8, 1/4, 1 in sequence, the deeper the number of levels of a feature, the greater the weighted parameter.