CN112581360A

CN112581360A - Multi-style image aesthetic quality enhancement method based on structural constraint

Info

Publication number: CN112581360A
Application number: CN202011609567.0A
Authority: CN
Inventors: 俞俊; 牛豪康; 高飞
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-03-30
Anticipated expiration: 2040-12-30
Also published as: CN112581360B

Abstract

The invention discloses a structure-constrained multi-style image aesthetic quality enhancement method. The invention comprises the following steps: (1) converting input image data into vectors of an LAB space; (2) inputting the vector converted into the LAB space into an enhancement network, wherein the enhancement network comprises two structure adjustment networks and a pixel adjustment network; wherein the structure adjustment network is used for improving the aesthetic property of the composition; the pixel adjusting network further adjusts the color and the light and shadow effect of the image by adjusting the numerical value of each pixel; (3) performing refinement processing on the extracted features; inputting the characteristics of the enhanced network output into the refined network to obtain a final output aesthetic quality enhanced image; (4) a multi-scale multi-distribution constraint discrimination network; and a multi-scale multi-distribution constraint discrimination network is adopted to optimize an enhancement network and a refinement network, so that the quality of the final output aesthetic quality enhanced image is improved. The structure adjustment network of the invention can automatically extract the optimal n beautified areas without human intervention.

Description

Multi-style image aesthetic quality enhancement method based on structural constraint

Technical Field

The invention provides a novel method for enhancing Multi-style image aesthetic quality (Multi-style image aesthetic quality enhanced on structural constraints) based on structural constraints, and mainly relates to a method for carrying out training by using a convolutional neural network, reconstructing partial regions of an image, capturing deep characteristic information and mixing a specific style to obtain a model capable of carrying out Multi-style aesthetic quality optimization on the image.

Background

The Image aesthetic quality enhancement (Image aesthetic quality enhancement) process typically involves the adjustment of factors such as hue, saturation, and composition. The existing method generally adopts two modes of clipping and pixel adjustment, and generally increases various rule constraints on the adjustment process based on expert knowledge, thereby limiting the diversity of the enhancement effect. In addition, in the conventional method, the internal correlation of the image is not considered in the adjustment process of the image pixels, and the rationality of the enhanced image in terms of light and shadow and color cannot be ensured. Finally, image aesthetics come in a variety of styles, with large differences in composition, color, shading, etc. between different styles. However, the existing method does not take the style factor into consideration, and usually only can obtain the enhancement effect with a single style, and is difficult to meet different requirements of users.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a method for enhancing the aesthetic quality of a multi-format image based on structural constraint. In order to break through the frame of adjusting the image structure based on clipping and introduce the structural constraint of the image content, a multi-style frame of enhancing the image aesthetic quality is realized.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step (1) feature space transformation

And converting the input image data into vectors of an LAB space, so that the expression of the image in light and shadow factors such as colors and the like is consistent with the subjective perception of human eyes.

Step (2) feature extraction

Inputting the vector converted into the LAB space into an enhancement network, wherein the enhancement network comprises two structure adjustment networks and a pixel adjustment network; wherein the structure adjustment network is used for improving the aesthetic property of the composition; the pixel adjustment network further adjusts the color and light and shadow effects of the image by adjusting the value of each pixel.

Step (3) refining the extracted features

And inputting the characteristics of the enhanced network output into the refined network to obtain the final output aesthetic quality enhanced image.

Step (4) multi-scale multi-distribution constraint discrimination network

And optimizing an enhancement network and a refinement network by adopting a multi-scale multi-distribution constraint discrimination network, thereby improving the quality of the final output aesthetic quality enhanced image.

Further, the feature space transformation in step (1):

1-1, preprocessing an input image by cutting, turning and the like;

1-2 convert the preprocessed image as input into a vector in LAB space.

Further, the feature extraction in step (2) is implemented as follows:

2-1 structural adjustment network:

the method comprises the steps of training a pre-trained target detection reference network by combining a composition marking data set and an aesthetic quality evaluation data set, carrying out image aesthetic task fine adjustment on the network in the training process, adopting a graph evaluation model to score candidate regions by a fine adjustment strategy, and then selecting the optimal top n candidate regions based on a sequencing result. The pre-trained target detection reference network has better composition evaluation and aesthetic quality prediction capabilities, and therefore reliable feedback is provided for generation of the candidate region.

Taking the output of the trained target detection reference network as the input of the graph attention network; the target detection reference network extracts target features, associated features, and area features from an input image and constructs a map. And inputting the constructed graph into a multi-layer graph attention network, and outputting a beautification graph and a feature matrix corresponding to the beautified input image. Features of each layer graph attention network during GAT iterationThe progressive transformation of the image structure and the semantic expression of the corresponding content are expressed, so that the predicted beautification map and the feature matrix { X ] of all layers in the GAT are planned to be predicted⁽¹⁾,X⁽²⁾,...,X^(L)And inputting the image data into a fine network for synthesizing the enhanced image.

2-2 pixel adjustment network: it is intended to adaptively adjust the shading and color of an image for different styles. And inputting the Lab three-channel data of the input image into a content encoder, and extracting high-level semantic features. In addition, considering that pixel adjustment rules corresponding to different aesthetic styles are different, style marking One-Hot vectors are simultaneously input into a style encoder to extract style high-level semantic features. And then, the high-level semantic features and the style high-level semantic features are connected in series and input into a decoder, and a pixel adjustment factor matrix T corresponding to each position of the Lab three channels is predicted by using an adjustable Sigmoid activation function k sigma (·). Where k is an adjustment factor and σ (·) represents a Sigmoid function. And finally, performing dot multiplication on the pixel adjustment factor matrix T and the Lab matrix X of the original input image to obtain the image T X with adjusted brightness and color.

Furthermore, the content encoder, the style encoder and the decoder are connected in a U-Net connection mode. Considering similar regions in the input image, the adjustment factor k needs to be similar, so the Guiding Attention (GA) mechanism is adopted to reconstruct the output characteristics of the decoder. The guiding attention calculation flow of the feature map y of the content encoder (such as the l-th layer) and the feature map x of the corresponding decoding layer (n-l layers) is as follows:

α (-) represents an attention calculation function, typically a feature similarity description between different locations; f (-), g (-), h (-), are the mappings of the feature graph x. α (-) describes the correlation between all locations in the input image. Thus α (-) is described as the structure of the input image and is used to reconstruct the decoder output features. And after the feature graph z obtained by reconstruction is connected with the content encoder feature graph y and the style encoder feature graph s in series, inputting the feature graph z into a subsequent decoding layer to obtain output. Therefore, similar positions in the input image are ensured, and the output pixel adjustment factor matrix T also has similar expression, so that the output image and the input image are promoted to keep similar structures.

Further, the refining process in step (3):

3-1 refinement network: and (4) adjusting the beautification picture and the feature matrix output by the network based on the structure and the image output by the pixel adjustment network, and synthesizing the beautified image.

The advanced network adopts an encoding-decoding structure, wherein a convolutional layer adopts a residual error network block, and a Self-Attention (SA) mechanism is introduced into a decoding layer to reconstruct the output characteristics of a decoder. The SA calculation for the decoded layer feature map x is represented as:

wherein α (·) represents an attention calculation function; f (-), g (-), h (-), are the mappings of the feature graph x. α (-) describes the correlation between all locations in the input image. The basic idea is as follows: for a particular location, it is reconstructed using the features of all locations. Similar contents in the output image have similar characteristic expressions, so that similar appearance expression is obtained, and the reasonability of the aesthetic quality enhanced image is ensured.

Further, the multi-scale multi-distribution constraint discriminating network in the step (4):

4-1 aesthetic quality enhancement: the quality of the aesthetic quality enhanced image is improved by adopting multi-distribution constraint. In order to improve the capability of judging the network, firstly, a pre-trained image aesthetic quality evaluation model is used as an aesthetic feature extraction module; then, feature graphs of three network layers with different depths of the model are used as the input of a discriminant network, and discriminant sub-networks are respectively constructed

Different judgmentsThe network of the other sub-networks corresponds to the expression of the aesthetic quality of the image on different scales. In order to improve the discrimination capability of the model, a multitask and multi-label learning mode is adopted, each discrimination network simultaneously predicts the image style type, the aesthetic quality (G/B) and the truth (R/F), and respectively adopts cross entropy Loss, triple Loss (triple Loss) and L2 Loss. Wherein the triplet penalty is due to the true aesthetic image Y, the enhanced image

And the original image X should satisfy the aesthetic degree

Triple Loss (Triplet Loss) is therefore introduced as an objective function, namely:

wherein alpha is a regulatory factor of]₊Meaning that only terms greater than 0 are taken.

4-2 content discrimination network: and comparing the feature maps obtained from the regions corresponding to the aesthetic quality enhancement images and the input images by adopting a pre-trained fast-RCNN network, and calculating the L2 distance between the feature maps to be used as content loss. In the training phase, the generator and the discriminator are optimized in an end-to-end mode. In the testing phase, the given image and the style label are input into the generator, and the enhanced image can be output.

The invention has the following beneficial effects:

the invention aims to overcome the defects of the prior art and provides a method for enhancing the aesthetic quality of a multi-format image based on structural constraint. In order to break through the frame of adjusting the image structure based on clipping and introduce the structural constraint of the image content, a multi-style frame of enhancing the image aesthetic quality is realized. The invention has the advantage that the structure adjusting network can automatically extract the optimal n beautifying areas without human intervention. Further attention is drawn to the extent to which the network can further refine the area. And then, an optimization strategy is automatically provided by the pixel adjustment network through extracting image characteristics, and the optimization strategy is input into the refinement network, so that the model can beautify the input image with high efficiency and high quality.

Drawings

FIG. 1 is a schematic diagram of an aesthetic quality assessment framework using composition blending with global features;

FIG. 2 is an architectural diagram of a global feature and composition feature extraction network;

FIG. 1 (a) is an overall architecture diagram of a structurally constrained multi-format image aesthetic quality enhancement model;

FIG. 1 (b) is a schematic diagram of a fabric adjustment network architecture;

FIG. 2 (c) is a schematic diagram of a pixel adjustment network architecture

FIG. 2 (d) is a schematic diagram of the architecture of a multi-scale, multi-distribution constraint discriminating network

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1 and fig. 2, a method for enhancing the aesthetic quality of a multi-format image with structural constraints specifically includes the following steps:

step (1) feature space transformation

Step (2) feature extraction

Step (3) refining the extracted features

Step (4) multi-scale multi-distribution constraint discrimination network

Further, the feature space transformation in step (1):

1-1, preprocessing an input image by cutting, turning and the like;

1-2 convert the preprocessed image as input into a vector in LAB space.

Further, the feature extraction in step (2) is implemented as follows:

2-1 structural adjustment network:

Taking the output of the trained target detection reference network as the input of the graph attention network; the target detection reference network extracts target features, associated features, and area features from an input image and constructs a map. And inputting the constructed graph into a multi-layer graph attention network, and outputting a beautification graph and a feature matrix corresponding to the beautified input image. In the iterative process of GAT, the feature of each layer graph attention network expresses the progressive transformation of the image structure and the semantic expression of the corresponding content, so that the predicted beautification graph and the feature matrix { X ] of all layers in the GAT are planned to be predicted⁽¹⁾,X⁽²⁾,...,X^(L)And inputting the image data into a fine network for synthesizing the enhanced image.

Further, the refining process in step (3):

Different discriminative subnetworks correspond to the expression of the aesthetic quality of the image on different scales. In order to improve the discrimination capability of the model, a multitask and multi-label learning mode is adopted, each discrimination network simultaneously predicts the image style type, the aesthetic quality (G/B) and the truth (R/F), and respectively adopts cross entropy Loss, triple Loss (triple Loss) and L2 Loss. Wherein the triplet penalty is due to the true aesthetic image Y, the enhanced image

And the original image X should satisfy the aesthetic degree

Claims

1. A method for enhancing aesthetic quality of a multi-style image with structural constraint is characterized by comprising the following steps:

step (1) feature space conversion; converting input image data into vectors of an LAB space;

step (2) feature extraction; inputting the vector converted into the LAB space into an enhancement network, wherein the enhancement network comprises two structure adjustment networks and a pixel adjustment network; wherein the structure adjustment network is used for improving the aesthetic property of the composition; the pixel adjusting network further adjusts the color and the light and shadow effect of the image by adjusting the numerical value of each pixel;

step (3) refining the extracted features; inputting the characteristics of the enhanced network output into the refined network to obtain a final output aesthetic quality enhanced image;

step (4), multi-scale multi-distribution constraint discrimination network; and optimizing an enhancement network and a refinement network by adopting a multi-scale multi-distribution constraint discrimination network, thereby improving the quality of the final output aesthetic quality enhanced image.

2. A method for enhancing the aesthetic quality of a structurally-constrained multi-style image according to claim 1, wherein the feature space transformation of step (1):

1-1, performing cutting and turning pretreatment on an input image;

1-2 convert the preprocessed image as input into a vector in LAB space.

3. A method for enhancing aesthetic quality of multi-format image with structural constraints according to claim 1 or 2, characterized in that the feature extraction in step (2) is implemented as follows:

2-1 structural adjustment network:

the method comprises the steps that a pre-trained target detection reference network is adopted, a composition marking data set and an aesthetic quality evaluation data set are combined to train the pre-trained target detection reference network, image aesthetic task fine adjustment is conducted on the network in the training process, a fine adjustment strategy is to adopt a graph evaluation model to score candidate regions, and then the optimal top n candidate regions are selected based on a sequencing result; the pre-trained target detection reference network has better composition evaluation and aesthetic quality prediction capabilities, so that reliable feedback is provided for generation of candidate areas;

taking the output of the trained target detection reference network as the input of the graph attention network; extracting target features, associated features and regional features from an input image by a target detection reference network, and forming a diagram; then inputting the formed graph into a multilayer graph attention network, and outputting a beautified graph and a feature matrix corresponding to the beautified input image; in the iterative process of GAT, the feature of each layer graph attention network expresses the progressive transformation of the image structure and the semantic expression of the corresponding content, so that the predicted beautification graph and the feature matrix { X ] of all layers in the GAT are planned to be predicted⁽¹⁾,X⁽²⁾,...,X^(L)Inputting the image data into a precision network for synthesizing the enhanced image;

2-2 pixel adjustment network: aiming at adaptively adjusting the shadow and the color of the image aiming at different styles; inputting Lab three-channel data of an input image into a content encoder, and extracting high-level semantic features; meanwhile, inputting the style mark One-Hot vector into a style encoder, and extracting style high-level semantic features; then, high-level semantic features and style high-level semantic features are input into a decoder after being connected in series, and a pixel adjustment factor matrix T corresponding to each position of a Lab three channel is predicted by utilizing an adjustable Sigmoid activation function k sigma (·); wherein k is an adjustment factor, and sigma (·) represents a Sigmoid function; finally, performing dot multiplication on the pixel adjustment factor matrix T and the Lab matrix X of the original input image to obtain an image T X with adjusted brightness and color;

the content encoder, the style encoder and the decoder are integrally connected in a U-Net connection mode; considering similar areas in the input image, the adjustment factors k of the similar areas need to be similar, so that the output characteristics of the decoder are reconstructed by adopting a guiding attention mechanism; the guiding attention calculation flow of the feature graph y of the content encoder and the corresponding decoding layer feature graph x is as follows:

α (·) represents an attention calculation function; f (-), g (-), h (-), are the mappings of the feature graph x; α (-) describes the correlation between all locations in the input image; therefore, alpha (-) is taken as the structural description of the input image, and the output characteristics of the decoder are reconstructed by using alpha (-); after the feature graph z obtained by reconstruction is connected with the content encoder feature graph y and the style encoder feature graph s in series, the feature graph z is input into a subsequent decoding layer to obtain output; therefore, similar positions in the input image are ensured, and the output pixel adjustment factor matrix T also has similar expression, so that the output image and the input image are promoted to keep similar structures.

4. A method of aesthetically enhancing a structurally constrained multi-style image according to claim 3, wherein said refinement of step (3):

3-1 refinement network: beautifying images and feature matrixes output by the structure adjusting network and images output by the pixel adjusting network are synthesized based on the beautifying images and the feature matrixes output by the structure adjusting network;

the refinement network adopts a coding-decoding structure, wherein the convolution layer adopts a residual error network block, and a self-attention mechanism is introduced into a decoding layer to reconstruct the output characteristics of a decoder; the SA calculation for the decoded layer feature map x is represented as:

wherein α (·) represents an attention calculation function; f (-), g (-), h (-), are the mappings of the feature graph x; α (-) describes the correlation between all locations in the input image; the basic idea is as follows: for a specific location, reconstructing it using the features of all locations; similar contents in the output image have similar characteristic expressions, so that similar appearance expression is obtained, and the reasonability of the aesthetic quality enhanced image is ensured.

5. The method according to claim 4, wherein the multi-scale multi-distribution constraint discrimination network of step (4) comprises:

4-1, firstly, a pre-trained image aesthetic quality evaluation model is used as an aesthetic feature extraction module; then, feature graphs of three network layers with different depths of the model are used as the input of a discriminant network, and discriminant sub-networks are respectively constructed

Different discrimination sub-networks correspond to the expression of the aesthetic quality of the image on different scales; the discrimination capability of the model is improved by adopting a multi-task and multi-label learning mode, each discrimination network simultaneously predicts the image style type, the aesthetic quality (G/B) and the truth (R/F), and respectively adopts cross entropy Loss, triple Loss and L2 Loss; wherein the loss of triads is adoptedFor true and beautiful image Y, enhanced image

And the original image X should satisfy the aesthetic degree

The triplet losses are therefore introduced as an objective function, namely:

wherein alpha is a regulatory factor of]₊Meaning that only terms greater than 0 are taken;

4-2, comparing the aesthetic quality enhancement image with feature maps obtained from corresponding areas of the input image by adopting a pre-trained Faster-RCNN network, and calculating an L2 distance between the feature maps as content loss; in the training stage, optimizing the generator and the discriminator in an end-to-end mode; in the testing phase, the given image and the style label are input into the generator, and the enhanced image can be output.