CN113255813B - Multi-style image generation method based on feature fusion - Google Patents

Multi-style image generation method based on feature fusion Download PDF

Info

Publication number
CN113255813B
CN113255813B CN202110635370.2A CN202110635370A CN113255813B CN 113255813 B CN113255813 B CN 113255813B CN 202110635370 A CN202110635370 A CN 202110635370A CN 113255813 B CN113255813 B CN 113255813B
Authority
CN
China
Prior art keywords
style
feature
network
content
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110635370.2A
Other languages
Chinese (zh)
Other versions
CN113255813A (en
Inventor
余月
李本源
李能力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202110635370.2A priority Critical patent/CN113255813B/en
Publication of CN113255813A publication Critical patent/CN113255813A/en
Application granted granted Critical
Publication of CN113255813B publication Critical patent/CN113255813B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]

Abstract

The invention discloses a multi-style image generation method based on feature fusion, and belongs to the field of computer vision. The implementation method of the invention comprises the following steps: inputting the semantic segmentation graph into a content feature extraction network, and extracting a content feature vector in the semantic graph; inputting the style diagram into a style feature extraction network, and extracting style feature vectors in the style diagram; extracting the content feature vector f c And a style feature vector f s Inputting a content style feature fusion network for feature fusion to obtain a fusion feature vector after feature fusion; constructing a generation countermeasure network consisting of a generator and a discriminator, and training the generation countermeasure network on a data set by designing a loss function; and generating a multi-style image with semantic graph content and style of the semantic graph by using a generator with the minimized loss function obtained by training. The invention can apply the generated multi-style image with semantic graph content and style of the style to the scene which attracts attention, and solves the technical problems of the related engineering.

Description

Multi-style image generation method based on feature fusion
Technical Field
The invention relates to an image generation method for generating a multi-format image from a semantic segmentation image, in particular to a method capable of realizing quick generation from the semantic image to the multi-format image end to end, and belongs to the field of computer vision.
Background
At present, most models for generating multi-style images are generated from real images by style images, but a few models for generating style images from semantic graphs only use images in the same data set as input styles, and rapid migration of any style cannot be realized.
The method has the advantages that the method has important significance in the generating direction of art design and virtual reality education resources by generating images of any style from the semantic graph end to end, and in the art design field, an art creator or designer can quickly generate style images meeting the semantic graph and style constraints as long as the position and the general shape of each object in the semantic graph and the style to be generated are specified, so that the time cost required by creation and design is greatly reduced; in the multimedia education resource generation direction, a teacher can use simple semantic graph information to generate multi-style teaching scene images, the multi-style teaching scene images can greatly enrich teaching resources, and the teaching scenes with various styles can better attract the attention of students to improve the learning interest of the students. Meanwhile, the teaching scene image is generated from the semantic graph quickly, so that the time spent on generating a new image resource can be greatly reduced.
Disclosure of Invention
Aiming at the problem that the generation of a multi-style image from a semantic graph has great limitation in the background technology, the multi-style image generation method based on feature fusion disclosed by the invention aims to solve the technical problems that: the method comprises the steps of providing a network framework which is composed of a content feature extraction network, a style feature extraction network and a content style feature fusion network and generates style images from semantic graphs, respectively extracting content features and style features through the content feature extraction network and the style feature extraction network, and fusing the features extracted by the first two networks through the content style feature fusion network to generate multi-style images with semantic graph contents and style of the semantic graphs. The invention has the advantages of rapidness, convenience, wide applicability and good generation effect. The generation of a multi-modal image with semantic graph content and a style of a style is applied to the attention-attracting scene, the technical problem of relevant engineering is solved.
In order to achieve the above purpose, the invention adopts the following technical scheme.
The invention discloses a multi-style image generation method based on feature fusion, which inputs a semantic segmentation graph into a content feature extraction network and extracts a content feature vector in the semantic graph. Inputting the style diagrams into a style feature extraction network, and extracting style feature vectors in the style diagrams. Extracting the content feature vector f c And a style feature vector f s And inputting a content style feature fusion network for feature fusion to obtain a fusion feature vector after feature fusion. And constructing a generation countermeasure network consisting of a generator and a discriminator, and training the generation countermeasure network on the data set by designing a loss function. And generating a multi-style image with semantic graph content and style of the semantic graph by using a generator with the minimized loss function obtained by training. The method and the device can apply the generated multi-format image with semantic graph content and style of the format graph to the scene attracting attention, and solve the technical problems of related engineering.
The invention discloses a multi-style image generation method based on feature fusion, which comprises the following steps:
step 1: and inputting the semantic segmentation graph into a content feature extraction network, and extracting a content feature vector in the semantic graph.
The content feature extraction network in step 1 is a multi-Path feature extraction network mainly composed of three branch paths, namely a Global Space Path (GSP), a Category Space Path (CSP) and a Category semantic Path (CCP). The global space path GSP is used for extracting global space characteristics, the classification space path CSP is used for extracting classification space characteristics of a semantic graph, and the classification semantic path CCP is used for extracting classification semantic characteristics.
The input of the global space path is a whole semantic graph, and a feature graph containing global space information is obtained through convolution network processing.
The structure of the classification space path is the same as that of the global space path, and the only difference is that the input is different. The input of the semantic space path is not the whole semantic graph, but the semantic graph is firstly divided according to different categories, each channel has only one category, then the categories are spliced together to form a multi-channel classification semantic graph, each category of the classification semantic graph is respectively subjected to convolution operation, and the space characteristic of each category is calculated.
The classification semantic path adopts a lightweight ResNet network model and global average pooling to expand the receptive field, and global average pooling is added at the end of the ResNet network model, so that the receptive field and global context information of each category can be provided to the maximum extent. In addition, an Attention Extraction Module AEM (Attention Extraction Module) is also used in the classification semantic path. The attention extraction module captures global semantic information of the feature map by using an attention mechanism, calculates attention vectors and gives different weights to different positions so as to achieve the purpose of guiding network learning.
After extracting global spatial information, classification spatial information and classification semantic information from three branch paths in a multi-path generation network, fusing the features output by the three branch paths through a Feature Fusion Module (FFM). After feature fusion, using a Conditional Normalization module CNB (Conditional Normalization Block) to take the processed classified semantic graphs as additional condition input, giving different Normalization parameters to semantic graphs with different categories, and further fully retaining information in the semantic graphs and obtaining content feature vectors f c
In order to take account of the size of the network parameter and the effect of extracting the spatial information, preferably, in step 1, a three-layer convolutional network is selected as the convolutional network, each layer of the convolutional network comprises a convolutional layer, a normalization layer and an activation function layer, and the size of the feature map output after three-layer convolution is 1/8 of that of the original image.
And 2, step: inputting the style diagrams into a style feature extraction network, and extracting style feature vectors in the style diagrams.
And the style feature extraction network in the step 2 uses a pre-trained VGG16 network. Extracting the characteristics of the input style graph t before the activation layer through the VGG16 network, and taking the extracted characteristics as the original characteristics of characteristic fusion. The features belong to different levels of features, so a feature fusion module is usedThe FFM performs feature fusion on features of different levels from deep to shallow in sequence. The fused features pass through an attention extraction module AEM, and a style feature vector f obtained by carrying out attention weighting on different channels by using a self-attention model s
Preferably, the input style diagram t is set in the feature f before the activation layers of relu1_2, relu2_2, relu3_3 and relu4_3 in the VGG16 network relu1_2 (t)、f relu2_2 (t)、f relu3_3 (t)、f relu4_3 (t) extracting, and taking the extracted features as original features of feature fusion. The features belong to features of different levels, so the features of different levels are sequentially subjected to feature fusion from deep to shallow by using a feature fusion module FFM. The fused features pass through an attention extraction module AEM, and a style feature vector f obtained by carrying out attention weighting on different channels by using a self-attention model s
And step 3: extracting the content feature vector f c And a style feature vector f s Inputting a content style feature fusion network for feature fusion to obtain a fusion feature vector f after feature fusion cs
And (4) performing feature fusion by using a WCT (white-Color Transform) matrix transformation mode in the content style feature fusion network in the step 3. WCT matrix transformation is to feature f of content image c And features f of the chart s Obtaining a fusion characteristic f with content characteristics and style characteristics of the content diagram after Whiten transformation and Color transformation cs The WCT transform is divided into two parts, namely Whiten transform and Color transform.
The method of Whiten transformation is to use the feature f of the content image in the feature space of the VGG16 network c Obtaining a covariance matrix, carrying out SVD decomposition on the covariance matrix, carrying out Whiten transformation on the characteristics according to the matrix obtained by decomposition, stripping the color characteristics in the content image from the image, and obtaining the characteristics f of the transformed characteristics only with the content contour c The Whiten transform is implemented as follows:
Figure BDA0003097045710000031
wherein f is c Is a feature of the content image extracted in the VGG 16; d c Is a diagonal matrix and the elements are covariance matrices
Figure BDA0003097045710000032
A characteristic value of (d); e c Is an orthogonal matrix, satisfies
Figure BDA0003097045710000033
D c And E c Are obtained after SVD decomposition of covariance matrix.
The method of Color transformation is to use the feature f of the style image in the feature space in the VGG16 network s Firstly, solving a covariance matrix, carrying out SVD (singular value decomposition) on the covariance matrix, and then carrying out f s F transformed with whiten c Performing inverse Whiten transform, i.e. Color transform, transferring the content features after Whiten transform to the feature distribution of the styligram to obtain a feature vector f after WCT transform cs The Color transformation is realized in the following mode:
Figure BDA0003097045710000034
after the WCT matrix transformation operation, a feature fusion module FFM is added to combine the content feature vector with the feature vector f after the WCT transformation cs Performing feature fusion, strengthening the content constraint force of the semantic graph in the fusion vector, and obtaining the final style content feature fusion vector f cs
And 4, step 4: and constructing a generation countermeasure network consisting of a generator and a discriminator, and training the generation countermeasure network on the data set by designing a loss function, namely training the generation countermeasure network with the minimized loss function.
The system comprises a network framework which is composed of a content feature extraction network, a style feature extraction network and a content style feature fusion network and is used for generating style images from semantic graphs, the content feature extraction network and the style feature extraction network are used for respectively extracting content features and style features, and the content style feature fusion network is used for fusing the features extracted by the first two networks and is used for generating multi-style images with semantic graph contents and style of style.
The generator in the step 4 is a network which is composed of a content feature extraction network, a style feature extraction network and a content style feature fusion network and is used for generating style images from semantic graphs. The arbiter is composed of a global arbiter D 1 And a local discriminator D 2 The composed multi-stage discriminators have the same network structure but operate on different image scales.
The loss function designed in step 4 is:
Figure BDA0003097045710000041
wherein λ is 12345 For settable parameters, G is the generator, D 1 Is a local discriminator, D 2 Is the global discriminator, x is the input semantic graph, t is the input style graph, and y is the generated multi-style image.
Figure BDA0003097045710000042
To calculate the perceptual loss of content differences, the expression is:
Figure BDA0003097045710000043
F (i) represents the i-th active front-layer feature extractor of the VGG16 network, and w i Is the adaptive weight of the ith layer, the deeper the number of layers of the feature, the larger the weighted parameter.
Figure BDA0003097045710000044
Is to counter the loss, table thereofThe expression is as follows:
Figure BDA0003097045710000045
Figure BDA0003097045710000046
is the feature matching penalty, whose expression is:
Figure BDA0003097045710000047
wherein T represents a discriminator D k Number of network layers, N i Indicating the number of elements per layer.
Figure BDA0003097045710000048
To calculate the context loss of style difference, the expression is:
Figure BDA0003097045710000051
wherein, CX (φ) l (x),φ l (t)) is the cosine similarity of the l-th level VGG16 feature of the semantic graph x and the stylistic graph t.
Figure BDA0003097045710000052
The expression is the total variation loss:
Figure BDA0003097045710000053
where i and j are coordinate values of pixels in the image, and N is a pixel range size of the image.
In order to fully consider the influence of different depth features on the loss function calculation, it is preferable to extract five layers of features extracted by the VGG16 network in step 4, i.e., N =5,w i Is dependent onThe times are 1/32, 1/16, 1/8, 1/4 and 1, and the deeper the feature layer number is, the larger the weighted parameter is.
And 5: training by using the step 4 to obtain a generator with minimized loss function, wherein the style content feature fusion vector f obtained in the step 3 cs The multi-style image t with semantic graph content and style of the semantic graph is formed, namely, the multi-style image generation is realized based on the feature fusion.
Further comprising the step 6: and (5) applying the multi-style image with the semantic graph content and the style of the style generated in the step 5 to the scene attracting attention, and solving the technical problems of related engineering.
And 6, the related engineering technical problems comprise actual problems of creative advertisement design, game scene design, teaching scene image design and the like.
Has the advantages that:
1. the invention discloses a multi-style image generation method based on feature fusion, which provides a network framework which is composed of a content feature extraction network, a style feature extraction network and a content style feature fusion network and is used for generating a style image from a semantic graph.
2. The multi-format image generation method based on feature fusion disclosed by the invention has no limitation on input images, and can realize the generation of multi-format images with semantic map contents and format map styles by using any semantic map and format map after training is finished, so that the generation requirements of different tasks can be met, and the method has the advantage of wide applicability.
3. The network framework generated by any existing multi-format image can not generate the multi-format image from the semantic graph end to end, a real image which accords with the semantic graph needs to be generated firstly, and then the style of the real image is transferred.
4. The invention discloses a multi-format image generation method based on feature fusion, which applies the multi-format image with semantic graph content and format style generated by the invention to a scene attracting attention and solves the technical problems of related engineering, such as: the method comprises the actual problems of creative advertisement design, game scene design, teaching scene image design and the like.
Drawings
FIG. 1 is a flow chart of an implementation of a multi-style image generation method based on feature fusion of the present invention;
fig. 2 is a structural diagram of a content Feature Extraction network according to the present invention, in which fig. 2 (a) is a structural diagram of the entire content Feature Extraction network, fig. 2 (b) is a structural diagram of an Attention Extraction Module AEM (Attention Extraction Module), fig. 2 (c) is a structural diagram of a Feature Fusion Module FFM (Feature Fusion Module), and fig. 2 (d) is a Conditional Normalization Module CNB (Conditional Normalization Block);
FIG. 3 is a block diagram of a style feature extraction network in accordance with the present invention;
FIG. 4 is a block diagram of a content style feature fusion network in accordance with the present invention;
FIG. 5 is a block diagram of a generator in the present invention;
FIG. 6 is a structural diagram of an arbiter in the present invention;
FIG. 7 is a graph of the effect of the invention on the Cityscapes dataset.
Detailed Description
The following describes embodiments of the present invention with reference to the drawings.
As shown in fig. 1, the multi-style image generation method based on feature fusion disclosed in this embodiment can be applied to a cityscaps data set for entertainment-related applications, for example, in the creation of movies, animations and games, perform style rendering on the movies, animations and street views in the games, change the same street view into different styles, and create desired movies, animations and game styles. And can also reduce the cost of creation, save the time of making, increase the interdynamic with audience or player. The training and image generation flow of the present embodiment is shown in fig. 1.
Step 1: the semantic segmentation graph is input into a content feature extraction network, content feature vectors in the semantic graph are extracted, and the structure diagram of the content feature extraction network is shown in fig. 2 (a).
The size of the semantic graph input in step 1 is [3,256,512]The size of the feature map obtained by classifying the spatial path and the global spatial path is [512,32,64 ]]The feature map size obtained by classifying the semantic paths is [256,128,256 ]]Wherein the network structure of the Attention Extraction Module (AEM) used in the classification semantic path is shown in FIG. 2 (b), after obtaining the three features, they are feature-fused by the Feature Fusion Module (FFM) to obtain the size [512,128,256 ]]The structure of the Feature Fusion Module (FFM) is shown in fig. 2 (c). Finally, the fused features are up-sampled by a conditional normalization module CNB, the size of the conditional normalization module CNB is shown in figure 2 (d), and finally, a final content feature vector f is obtained c ,f c Has a size of [256,128,256]。
And 2, step: the style drawing is input into a style feature extraction network, style feature vectors in the style drawing are extracted, and a structure diagram of the style feature extraction network is shown in fig. 3.
And the style feature extraction network in the step 2 uses a pre-trained VGG16 network. The size of the stylized graph t of the input network is [3,256,512 ]]Relu1_2, relu2_2, relu3_3, and relu4_3 preceding the activation layer feature f in VGG16 relu1_2 (t)、f relu2_2 (t)、f relu3_3 (t)、f relu4_3 (t) extractionOut, the size of the extracted features are [128,256,512 respectively], [256,128,256],[512,64,128],[512,64,128]. Since the features belong to different levels of features, they are sequentially feature-fused from deep to shallow using a feature fusion module FFM, the structure of which is shown in fig. 2 (c). Finally, the fused features pass through an attention extraction module AEM, attention is weighted on different channels by a self-attention model to obtain a final style feature vector f s The network structure of the extraction module AEM is shown in FIG. 2 (b), and the style feature vector f finally obtained s Has a size of [256,128, 256%]。
And 3, step 3: extracting the content feature vector f c And a style feature vector f s Inputting a content style characteristic fusion network for characteristic fusion to obtain a content style fusion characteristic f cs The structure of the content style feature fusion network is shown in fig. 4.
Inputting content style characteristic fusion network content characteristic vector f in step 3 c And a style feature vector f s The size is divided into [256,128,256 ]]The size of the feature after the WCT matrix transformation does not change, but the transformed feature vector already has the content information of the content map and the style information of the style map. After the matrix transformation operation of the WCT, a feature fusion module FFM is added to perform feature fusion on the content feature vector and the feature vector after the WCT transformation, so that the content constraint force of a semantic graph in the fusion vector is strengthened, and the size of the fused vector is [256,128,256 ] or more]Then, the content constraint with the size of 3,256,512 which is better in accordance with the semantic graph while having the artistic style of the input style graph is finally obtained by performing up-sampling by using a deconvolution operation]Style content fusion feature f cs
And 4, step 4: and constructing a generating countermeasure network consisting of a generator and a discriminator, and training the generating countermeasure network on the data set by designing a loss function, namely training the generating countermeasure network with the minimized loss function.
The network structure of the generator in step 4 is shown in fig. 5, and the network structure of the arbiter is shown in fig. 6.The generator is a generator network consisting of a content feature extraction network, a style feature extraction network and a content style feature fusion network in the steps 1 to 3, and the discriminator is a global discriminator D 1 And a local discriminator D 2 And forming a multi-stage discriminator. The loss function used during training is:
Figure BDA0003097045710000081
wherein λ is 12345 For settable parameters, G is the generator, D 1 Is a local discriminator, D 2 Is the global discriminator, x is the input semantic graph, t is the input style graph, and y is the generated multi-style image.
Figure BDA0003097045710000082
To calculate the perceptual loss of content differences, the expression is:
Figure BDA0003097045710000083
F( i ) Represents the i-th active front-layer feature extractor of the VGG16 network, and w i Is the adaptive weight of the i layer, five-layer features in the VGG16 network are extracted in the experiment, namely N =5,w i Are 1/32, 1/16, 1/8, 1/4, 1 in turn, the deeper the number of levels of features, the larger the weighted parameter.
Figure BDA0003097045710000084
Is the countermeasure loss, whose expression is:
Figure BDA0003097045710000085
Figure BDA0003097045710000086
is the feature matching penalty, whose expression is:
Figure BDA0003097045710000087
wherein T represents a discriminator D k Number of network layers, N i Indicating the number of elements per layer.
Figure BDA0003097045710000088
To calculate the context loss of style difference, the expression is:
Figure BDA0003097045710000089
wherein, CX (phi) l (x),φ l (t)) is the cosine similarity of the VGG16 feature of the semantic graph x and the trellis graph t.
Figure BDA00030970457100000810
The expression is the total variation loss:
Figure BDA00030970457100000811
where i and j are coordinate values of pixels in the image, and N is a pixel range size of the image.
In the training process of the invention, the generation is 300, and the lambda is trained in the training process 1 =10,λ 2 =1,λ 3 =1,λ 5 =0.00001, coefficient of context loss controlling style difference in the first 150 generations of training vs 4 Small by 0.1, in the later 150 generations 4 Begins to increase gradually until a maximum value of 20 is reached.
And 5: training the generator with the loss function minimized in step 4, in which case step 3 resultsStyle content feature fusion vector f cs It becomes a multi-modal image y with semantic graph content and a stylistic style.
In step 5, the present embodiment achieves a good generation result on the public dataset cityscaps. The cityscaps data set is a new large-scale data set containing multiple stereoscopic video sequences recorded in street scenes from 50 different cities, and can be applied to creation of movies, animations and games, style rendering of street views in movies, animations and games, changing the same street view into different styles, and creating desired movies, animations and game styles. The resulting network is shown in fig. 6.
In summary, in the embodiment, the semantic graph and the style graph are input into the generation confrontation network, and the generation confrontation network model is trained to obtain a generator which is well trained, and at this time, the generator can generate the image which meets the content constraint of the semantic graph and the style constraint of the style graph. The embodiment can solve the problems that the time cost and the labor cost of generation are high and the effect cannot be guaranteed in the traditional method.
The above detailed description is intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above detailed description is only exemplary of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (9)

1. A multi-style image generation method based on feature fusion is characterized in that: comprises the following steps of (a) preparing a solution,
step 1: inputting the semantic segmentation graph into a content feature extraction network, and extracting a content feature vector in the semantic graph;
step 2: inputting the style diagram into a style feature extraction network, and extracting style feature vectors in the style diagram;
and 3, step 3: extracting the content feature vector f c And style feature vector f s Inputting a content style feature fusion network for feature fusion to obtain a fusion feature vector f after feature fusion cs
And 4, step 4: constructing a generation countermeasure network consisting of a generator and a discriminator, and training the generation countermeasure network on a data set by designing a loss function, namely training to obtain the generation countermeasure network with the minimized loss function;
the system comprises a network framework, a content characteristic extraction network, a style characteristic extraction network and a content style characteristic fusion network, wherein the network framework is formed by three parts of networks, namely a semantic graph generating style image, the content characteristic extraction network and the style characteristic extraction network are used for respectively extracting content characteristics and style characteristics, and the content style characteristic extraction network and the style characteristic extraction network are used for fusing the characteristics extracted by the two networks to generate a multi-style image with semantic graph contents and style;
the generator in the step 4 is a network which is composed of a content characteristic extraction network, a style characteristic extraction network and a content style characteristic fusion network and is used for generating style images from semantic graphs; the arbiter is composed of a global arbiter D 1 And a local discriminator D 2 The multi-stage discriminators are formed, have the same network structure and operate on different image scales;
the loss function designed in step 4 is:
Figure FDA0003793819820000011
wherein λ is 12345 For settable parameters, G is the generator, D 1 Is a local discriminator, D 2 Is a global discriminator, x is an input semantic graph, t is an input style graph, and y is a generated multi-style image;
Figure FDA0003793819820000012
to calculate the perceptual loss of content differences, the expression is:
Figure FDA0003793819820000013
F (i) represents the i-th active front layer feature extractor of the VGG16 network, and w i The adaptive weight of the ith layer is adopted, and the deeper the layer number of the characteristic is, the larger the weighted parameter is;
Figure FDA0003793819820000014
is the countermeasure loss, whose expression is:
Figure FDA0003793819820000015
Figure FDA0003793819820000016
is the feature matching penalty, whose expression is:
Figure FDA0003793819820000017
wherein T represents a discriminator D k Number of network layers, N i Representing the number of elements of each layer;
Figure FDA0003793819820000021
to calculate the context loss of style difference, the expression is:
Figure FDA0003793819820000022
wherein, CX (φ) l (x),φ l (t)) is the cosine similarity of the ith-level VGG16 features of the semantic graph x and the trellis graph t;
Figure FDA0003793819820000023
the expression is the total variation loss:
Figure FDA0003793819820000024
wherein i and j are coordinate values of pixels in the image, and N is a pixel range size of the image;
and 5: training by using the step 4 to obtain a generator with minimized loss function, wherein the style content feature fusion vector f obtained in the step 3 cs The multi-style image t with semantic graph content and style of the style is formed, namely, the multi-style image generation is realized based on the feature fusion.
2. The method for generating multi-format image based on feature fusion as claimed in claim 1, wherein: further comprising the step 6: and (5) applying the multi-style image with the semantic graph content and the style of the style generated in the step 5 to the scene attracting attention, and solving the technical problems of the related engineering.
3. The method for generating multi-style image based on feature fusion as claimed in claim 2, wherein: and 6, the related engineering technical problems comprise actual problems of creative advertisement design, game scene design, teaching scene image design and the like.
4. A method for multi-format image generation based on feature fusion as claimed in claim 1, 2 or 3, characterized in that: the content feature extraction network in the step 1 is a multi-Path feature extraction network mainly composed of three branch paths, namely a Global Space Path (GSP), a Classification Space Path (CSP) and a classification semantic Path (CCP); the global space path GSP is used for extracting global space characteristics, the classification space path CSP is used for extracting classification space characteristics of a semantic graph, and the classification semantic path CCP is used for extracting classification semantic characteristics;
the input of the global space path is a whole semantic graph, and a feature graph containing global space information is obtained through convolution network processing;
the structure of the classification space path is the same as that of the global space path, and the only difference is that the input is different; the input of the semantic space path is not a whole semantic graph, but the semantic graph is firstly divided according to different categories, each channel has only one category, then the semantic graphs are spliced together to form a multi-channel classification semantic graph, each category of the classification semantic graph is respectively subjected to convolution operation, and the spatial feature of each category is calculated;
the classification semantic path adopts a lightweight ResNet network model and global average pooling to expand the receptive field, and global average pooling is added at the tail of the ResNet network model, so that the receptive field and global context information of each category can be provided to the maximum extent; in addition, an Attention Extraction Module AEM (Attention Extraction Module) is also used in the classification semantic path; the attention extraction module captures global semantic information of the feature map by using an attention mechanism, and calculates attention vectors to give different weights to different positions so as to achieve the purpose of guiding network learning;
after global space information, classification space information and classification semantic information are respectively extracted from three branch paths in a multi-path generation network, the features output by the three branch paths are fused by a Feature Fusion Module (FFM); after feature fusion, a Conditional Normalization module CNB (Conditional Normalization Block) is used for taking the processed classified semantic graphs as additional condition input, different Normalization parameters are given to the semantic graphs with different categories, information in the semantic graphs is fully reserved, and content feature vectors f are obtained c
5. The method for generating multi-style image based on feature fusion as claimed in claim 4, wherein: the style feature extraction network in the step 2 uses a pre-trained VGG16 network; extracting the characteristics of the input style graph t before the activation layer through the VGG16 network, and extractingThe extracted features are used as original features of feature fusion; the features belong to features of different levels, so the features of different levels are sequentially subjected to feature fusion from deep to shallow by using a feature fusion module FFM; the fused features pass through an attention extraction module AEM, and a style feature vector f obtained by performing attention weighting on different channels by using a self-attention model s
6. The method of claim 5, wherein the multi-format image generation method based on feature fusion is characterized in that: the content style feature fusion network in the step 3 performs feature fusion by using a WCT (white-Color Transform) matrix transformation mode; WCT matrix transformation is to feature f of content image c And features f of the chart s Obtaining fusion characteristics with content characteristics and style characteristics of the content diagram after Whiten transformation and Color transformation
Figure FDA0003793819820000031
The WCT transformation is divided into two parts, namely Whiten transformation and Color transformation;
the method of Whiten transformation is to use the feature f of the content image in the feature space of the VGG16 network c Obtaining a covariance matrix, performing SVD on the covariance matrix, performing Whiten transformation on the features according to the matrix obtained by decomposition, stripping the color features in the content image from the image, and obtaining the transformed features with only the features of the content contour
Figure FDA0003793819820000032
The Whiten transform is implemented as follows:
Figure FDA0003793819820000033
wherein f is c Is a feature of the content image extracted in the VGG 16; d c Is a diagonal matrix and the elements are covariance matrices
Figure FDA0003793819820000034
A characteristic value of (d); e c Is an orthogonal matrix, satisfies
Figure FDA0003793819820000035
D c And E c The covariance matrix is obtained after SVD decomposition;
the method of Color transformation is to use the feature f of the style image in the feature space of the VGG16 network s Firstly, solving a covariance matrix, carrying out SVD (singular value decomposition) on the covariance matrix, and then carrying out f s Transformed with whiten
Figure FDA0003793819820000036
Carrying out reverse Whiten transformation, namely Color transformation, transferring the content characteristics after the Whiten transformation to the characteristic distribution of the styligram to obtain the characteristic vector after the WCT transformation
Figure FDA0003793819820000037
The Color transformation is realized by the following steps:
Figure FDA0003793819820000041
after the WCT matrix transformation operation, a feature fusion module FFM is added to combine the content feature vector with the feature vector after the WCT transformation
Figure FDA0003793819820000042
Performing feature fusion, strengthening the content constraint force of the semantic graph in the fusion vector, and obtaining the final style content feature fusion vector f cs
7. The method for generating multi-format image based on feature fusion as claimed in claim 1, wherein: in order to take account of the size of network parameters and the effect of spatial information extraction, in step 1, the network selects three layers of convolution networks, each layer of network comprises a convolution layer, a standard layer and an activation function layer, and the size of a feature graph output after three layers of convolution is 1/8 of that of an original graph.
8. The method for generating multi-format image based on feature fusion as claimed in claim 1, wherein: characteristic f of input style diagram t before active layers of relu1_2, relu2_2, relu3_3 and relu4_3 in VGG16 network relu1_2 (t)、f relu2_2 (t)、f relu3_3 (t)、f relu4_3 (t) extracting and using the extracted features as original features of feature fusion; the features belong to features of different levels, so the features of different levels are sequentially subjected to feature fusion from deep to shallow by using a feature fusion module FFM; the fused features pass through an attention extraction module AEM, and a style feature vector f obtained by carrying out attention weighting on different channels by using a self-attention model s
9. The method for generating a multi-style image based on feature fusion as claimed in claim 1, wherein: in order to fully consider the influence of different depth features on the calculation of the loss function, in step 4, five layers of features extracted by the VGG16 network are extracted, namely N =5,w i Are 1/32, 1/16, 1/8, 1/4, 1 in sequence, the deeper the number of levels of a feature, the greater the weighted parameter.
CN202110635370.2A 2021-06-02 2021-06-02 Multi-style image generation method based on feature fusion Active CN113255813B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110635370.2A CN113255813B (en) 2021-06-02 2021-06-02 Multi-style image generation method based on feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110635370.2A CN113255813B (en) 2021-06-02 2021-06-02 Multi-style image generation method based on feature fusion

Publications (2)

Publication Number Publication Date
CN113255813A CN113255813A (en) 2021-08-13
CN113255813B true CN113255813B (en) 2022-12-02

Family

ID=77186962

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110635370.2A Active CN113255813B (en) 2021-06-02 2021-06-02 Multi-style image generation method based on feature fusion

Country Status (1)

Country Link
CN (1) CN113255813B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113919998A (en) * 2021-10-14 2022-01-11 天翼数字生活科技有限公司 Image anonymization method based on semantic and attitude map guidance
CN113642566B (en) * 2021-10-15 2021-12-21 南通宝田包装科技有限公司 Medicine package design method based on artificial intelligence and big data
CN113642262B (en) * 2021-10-15 2021-12-21 南通宝田包装科技有限公司 Toothpaste package appearance auxiliary design method based on artificial intelligence
CN114782590A (en) * 2022-03-17 2022-07-22 山东大学 Multi-object content joint image generation method and system
CN115272687B (en) * 2022-07-11 2023-05-05 哈尔滨工业大学 Single sample adaptive domain generator migration method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109523463A (en) * 2018-11-20 2019-03-26 中山大学 A kind of face aging method generating confrontation network based on condition
CN109829353A (en) * 2018-11-21 2019-05-31 东南大学 A kind of facial image stylizing method based on space constraint
CN111325664A (en) * 2020-02-27 2020-06-23 Oppo广东移动通信有限公司 Style migration method and device, storage medium and electronic equipment
CN112017301A (en) * 2020-07-24 2020-12-01 武汉纺织大学 Style migration model and method for specific relevant area of clothing image
CN112132167A (en) * 2019-06-24 2020-12-25 商汤集团有限公司 Image generation and neural network training method, apparatus, device, and medium
CN112766079A (en) * 2020-12-31 2021-05-07 北京航空航天大学 Unsupervised image-to-image translation method based on content style separation
CN112861805A (en) * 2021-03-17 2021-05-28 中山大学 Face image generation method based on content features and style features

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112419328B (en) * 2019-08-22 2023-08-04 北京市商汤科技开发有限公司 Image processing method and device, electronic equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109523463A (en) * 2018-11-20 2019-03-26 中山大学 A kind of face aging method generating confrontation network based on condition
CN109829353A (en) * 2018-11-21 2019-05-31 东南大学 A kind of facial image stylizing method based on space constraint
CN112132167A (en) * 2019-06-24 2020-12-25 商汤集团有限公司 Image generation and neural network training method, apparatus, device, and medium
WO2020258902A1 (en) * 2019-06-24 2020-12-30 商汤集团有限公司 Image generating and neural network training method, apparatus, device, and medium
CN111325664A (en) * 2020-02-27 2020-06-23 Oppo广东移动通信有限公司 Style migration method and device, storage medium and electronic equipment
CN112017301A (en) * 2020-07-24 2020-12-01 武汉纺织大学 Style migration model and method for specific relevant area of clothing image
CN112766079A (en) * 2020-12-31 2021-05-07 北京航空航天大学 Unsupervised image-to-image translation method based on content style separation
CN112861805A (en) * 2021-03-17 2021-05-28 中山大学 Face image generation method based on content features and style features

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
The Contextual Loss for Image Transformation with Non-Aligned Data;Roey Mechrez等;《arXiv》;20180718;第1-16页 *
Universal Style Transfer via Feature Transforms;Yijun Li等;《arXiv》;20171117;第1-11页 *
全局双边网络的语义分割算法;任天赐等;《计算机科学》;20200615;第171-175页 *
基于语义分割的图像风格迁移技术研究;李美丽等;《计算机工程与应用》;20200409;第207-213页 *

Also Published As

Publication number Publication date
CN113255813A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
CN113255813B (en) Multi-style image generation method based on feature fusion
Li et al. A closed-form solution to photorealistic image stylization
CN108830912B (en) Interactive gray image coloring method for depth feature-based antagonistic learning
CN110378985B (en) Animation drawing auxiliary creation method based on GAN
CN108830913B (en) Semantic level line draft coloring method based on user color guidance
CN111862294B (en) Hand-painted 3D building automatic coloring network device and method based on ArcGAN network
CN110120049B (en) Method for jointly estimating scene depth and semantics by single image
CN105374007A (en) Generation method and generation device of pencil drawing fusing skeleton strokes and textural features
CN110020681A (en) Point cloud feature extracting method based on spatial attention mechanism
Zhao et al. Computer-aided graphic design for virtual reality-oriented 3D animation scenes
Li et al. High-resolution network for photorealistic style transfer
Ye et al. Multi-style transfer and fusion of image’s regions based on attention mechanism and instance segmentation
CN110097615B (en) Stylized and de-stylized artistic word editing method and system
CN111489405A (en) Face sketch synthesis system for generating confrontation network based on condition enhancement
CN116485892A (en) Six-degree-of-freedom pose estimation method for weak texture object
CN115690487A (en) Small sample image generation method
CN111064905A (en) Video scene conversion method for automatic driving
CN115512100A (en) Point cloud segmentation method, device and medium based on multi-scale feature extraction and fusion
CN115018729A (en) White box image enhancement method for content
Togo et al. Text-guided style transfer-based image manipulation using multimodal generative models
Bagwari et al. An edge filter based approach of neural style transfer to the image stylization
Li et al. FreePIH: Training-Free Painterly Image Harmonization with Diffusion Model
Shen et al. Overview of Cartoon Face Generation
Guo Design and Development of an Intelligent Rendering System for New Year's Paintings Color Based on B/S Architecture
Bagwari et al. A review: The study and analysis of neural style transfer in image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant