CN112598662B

CN112598662B - Image aesthetic description generation method based on hidden information learning

Info

Publication number: CN112598662B
Application number: CN202011609603.3A
Authority: CN
Inventors: 俞俊; 李相�; 高飞
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2024-02-13
Anticipated expiration: 2040-12-30
Also published as: CN112598662A

Abstract

The invention discloses a method for generating an aesthetic description of an image based on hidden information learning. The method comprises the following steps: and (1) model pretreatment. Using a target detection network Enc _v And a transducer network Enc _t Extracting multi-scale feature expression from the image and the text comment respectively; (2) Cross-modality consistency feature extraction based on countermeasure learning. Constructing a characteristic mode discriminator by utilizing the countermeasure learning thought; (3) aesthetic comment generation for multi-factor control. With aesthetic factor marking as auxiliary information, using aesthetic factor encoder Enc _f Extracting semantic features corresponding to aesthetic factor marks, and inputting the semantic features into a comment decoder to generate text comments; (4) Based on the multi-task constraint discrimination network, the effectiveness of multi-scale image features and multi-scale text features and the rationality of generated text comments are realized; (5) combat loss based on hidden information learning. The invention generates text which is matched with the aesthetic quality of the input image, thereby improving the robustness and the accuracy of the model.

Description

Image aesthetic description generation method based on hidden information learning

Technical Field

The invention provides a method for generating image aesthetic description based on hidden information learning, which mainly relates to a method for generating an countermeasure learning framework, aims at the problems of small marked data size and large noise, utilizes the idea of hidden information learning (Learning Using Privilidged Information, LUPI) to perform reliability estimation on noise data, and improves model training efficiency and performance as a relaxation term of a countermeasure loss function.

Background

Image aesthetic quality assessment (Photo QualityAssessment) is the evaluation of the aesthetic quality of a picture computationally accurate based on the artistic understanding of the image. Related research tasks can be broadly divided into five categories, namely quality two categories (professional/amateur, aesthetic/ugly, good/bad), quality score prediction (e.g. using a score of 0-10 to describe the aesthetic), quality score distribution prediction (probability distribution of subjective marking scores for different observers on the same image), aesthetic factor prediction (the quality level of various factors such as composition, light and shade, color matching), and aesthetic description (text comments on the aesthetics of the image, discussing why the image is good/bad). The current research on the aesthetic quality of images is mainly focused on the first three tasks, and the corresponding aesthetic database marks have high data quality and large scale. In contrast, aesthetic factor prediction and aesthetic description are significant for understanding the image aesthetics, but related research is still in the beginning stage, and marking data is low in quality and small in scale, so that the requirement of a large-scale depth network on training samples is difficult to meet.

Most existing methods are based solely on image extraction features and focus on aesthetic quality classification or score prediction tasks. In recent years, a small number of work research image aesthetic factor analysis and text comment/description generation problems have emerged. For example, chang et al utilize convolutional neural networks with long and short term memory networks and build aesthetic factor guidance and blending mechanisms for aesthetic description of images, but lack reliable guidance for generating text. Text comment information is significant for understanding the aesthetic mechanism of an image. However, the existing image aesthetic comment data is large in noise and small in data size, and training requirements of a depth network are difficult to meet. Therefore, how to learn the association relation between text and image by using limited and noisy data and explore the causal reasoning mechanism of the aesthetic quality evaluation of the image is a current research hot spot and difficulty.

In the image aesthetic description method, there are two technical difficulties. Firstly, the model learning problem under a small sample is solved, and how to design an effective learning strategy based on the small sample by considering the requirement of the existing image description model on a large-scale standard sample; meanwhile, a large amount of noise exists in the marked sample, the conventional discrimination mechanism in the countermeasure learning carries out hard division on the real sample and the generated sample, error information is necessarily introduced, and how to design an asymmetric joint learning method to acquire effective information and avoid noise information.

Disclosure of Invention

It is an object of the present invention to address the deficiencies of the prior art and to provide a method of image aesthetic description generation based on hidden information learning.

The technical scheme adopted by the invention for solving the technical problems comprises the following steps:

step (1) model pretreatment

The model adopts a pre-trained target detection network Enc _v And a transducer network Enc _t As a benchmark, the target detection network Enc _v For extracting multi-scale image features from an input image, a transducer network Enc _t For extracting multi-scale text features from real text comments.

Step (2) cross-modal consistency feature extraction based on countermeasure learning

And (3) constructing a characteristic mode discriminator by utilizing the countermeasure learning thought, and inputting the multi-scale image characteristics and the multi-scale text characteristics extracted in the step (1) into the characteristic mode discriminator. So that the multi-scale image features and the multi-scale text features output by the feature mode discriminator are similar as much as possible.

Step (3) generating multi-factor controlled aesthetic text reviews

With aesthetic factor marking as auxiliary information, using aesthetic factor encoder Enc _f And extracting semantic features corresponding to the aesthetic factor marks, and inputting the semantic features into a comment decoder to generate text comments.

And (4) judging the network based on the multi-task constraint, and realizing the feature accuracy of aesthetic factor marking and text quality.

The multi-tasking constraint discrimination network employs text quality prediction loss and aesthetic factor prediction loss. The effectiveness of the multi-scale image features and the multi-scale text features and the rationality of the generated text comments are realized in a multi-task learning form based on the text quality predictions and the aesthetic factor predictions. The text quality prediction loss and the aesthetic factor prediction loss are weighted and summed for guiding training of the model.

Step (5) fight loss based on hidden information learning

Based on the idea of hidden information learning, a learnable relaxation factor is introduced into the counterdamage function according to the correlation strength between the real text comments and the aesthetic quality, so as to guide the training of the model.

Further, the model preprocessing in the step (1):

1-1 pair of target detection networks Enc _v And a transducer network Enc _t Pre-training, target detection network Enc _v Pre-training through a large-scale image object detection dataset, a transducer network Enc _t The pre-training is performed by natural language processing of the data set.

1-2 Pre-trained target detection network Enc _v And a transducer network Enc _t Fine tuning is performed on the aesthetic quality assessment dataset to obtain better feature extraction capabilities. The fine tuning stage takes the form of semi-supervised learning. In "aesthetic factor encoder Enc _f -visual encoder Enc _v -text decoder Dec _t -a plurality of discriminating network "branches, an object detection network Enc _v And learning according to standard countermeasure generation learning thinking. In "aesthetic factor encoder Enc _f Text encoder Enc _t -text decoder Dec _t -a plurality of discriminating network "branches, a transducer network Enc _t The idea of loop generation against the network is employed to increase the reconstruction consistency constraint for text generation.

1-3 inputting the input image into a fine-tuned object detection network Enc _v For extracting multi-scale image features therefrom; inputting real text comments into a transducer network Enc _t For extracting multi-scale text features from real text comments.

Further, the cross-modal consistency feature extraction based on the countermeasure learning in the step (2) is as follows:

2-1 construction of a characteristic modality discriminator D by utilizing the countermeasure learning idea _m 。D _m The modality of the input feature needs to be judged. And (3) inputting the multi-scale image features and the multi-scale text features extracted in the step (1) into a feature mode discriminator. So that the multi-scale image features and the multi-scale text features output by the feature mode discriminator are similar as much as possible, thereby decepting D _m 。

2-2 extracted multi-scale image features and multi-scale text features require precise characterization of aestheticsQuality. Therefore, mode discrimination loss L is adopted _m ：

Wherein D is _m (. Cndot.) is a probability function representing a feature, f _v Representing multi-scale image features, f _t Representing multi-scale text features.

Further, the aesthetic comments of the multi-factor control are generated in the step (3):

3-1 with aesthetic factor marking as auxiliary information, with aesthetic factor encoder Enc _f Extracting semantic features corresponding to aesthetic factor marks and inputting the semantic features to a comment decoder Dec _t In which text comments are generated.

3-2 in comment decoder Dec _t And mining the association relation between the multi-scale image features and the multi-scale text features by utilizing the cooperative attention module, and outputting the text aggregation features by utilizing the cooperative attention module for generating text comments.

Further, the determination network based on the multitasking constraint in the step (4) realizes feature accuracy of aesthetic factor marking and text quality, specifically as follows:

4-1 Mass prediction loss L _a : the quality prediction loss comprises multi-scale image features and multi-scale text features, and the L2 loss is adopted for the effectiveness of the multi-scale image features and the multi-scale text features.

4-2 aesthetic factor predictive loss L _fact : aesthetic factor prediction loss contains real text comments and generated text comments, and cross entropy loss is adopted for restricting the rationality of generated text comments.

4-3, carrying out weighted summation on the text quality prediction loss and the aesthetic factor prediction loss for guiding the training of the model.

Further, the antagonism loss based on hidden information learning in the step (5):

based on the idea of hidden information learning, according to real text comments and aestheticsThe correlation between the quality is strong and weak, and a learnable relaxation factor is introduced into a loss function to guide the training of a model. Specifically, two sets of parameters w and w are introduced in the discriminant network ^* The countering loss is to be in the form of a hangeloss, which requires solving the following problems:

s.t.

wherein w and w ^* B and b are network weight parameters ^* For the network bias, gamma and C are weight coefficients, y _i Is x _i Labels, x, corresponding to samples _i ∈R ^d For the transducer to discriminate the network extracted features,features extracted for pre-trained aesthetic quality assessment model, < >>The relaxation factor introduced for the text feature is output for the two fully connected layers. When the text noise is larger, the quality error is larger based on the text prediction, and the corresponding relaxation factor is also required to be larger, namely the generated text comment is not required to be too similar to the real text comment; when the text noise is smaller, the relaxation factor is smaller, and the generated text comment should approximate to the real text comment. Here, w and w ^* For the network weight parameters, the improved SMO algorithm can be used for solving, and iterative optimization is carried out together with the whole network.

In the test phase, only the test image and the aesthetic factor marking (or a plurality of) vectors to be generated need to be input into the trained model, so that corresponding aesthetic description can be obtained.

The invention has the following beneficial effects:

aiming at the learning problems of small mark information scale and large noise, based on the generation of research image aesthetic description generation task against learning thought, real text mark data is to be used as hidden information, and according to the expression capability of the real text mark data to the image aesthetic quality, the loose item in the discriminant loss function is generated by automatic learning description. That is, where the real text label data is highly correlated with the aesthetic quality of the image, the amount of relaxation is small, so that the generated description needs to be close thereto; conversely, the amount of relaxation is large, and the generated description may be significantly different from the true mark. In addition, in order to restrict the rationality of the generated text, a quality prediction loss and a factor classification loss based on the text are introduced, so that the generated text is matched with the aesthetic quality of the input image, and the robustness and the accuracy of the model are improved.

Drawings

FIG. 1 is a basic frame diagram generated based on an aesthetic description of an image learned from hidden information;

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, the present invention is based on the idea of challenge learning, and comprises three encoders, one decoder, and a plurality of discrimination networks. Wherein the encoder has an aesthetic factor encoder Enc _f Visual encoder Enc _v And text encoder Enc _t High-level semantic features are extracted from the aesthetic factor marking control vector, the input image, and the real text comment, respectively. Thereafter, the aesthetic factor features and the visual features are input together to the decoder Dec _t For generating text comments.

For example, as shown in FIG. 1, an input image including a sunset flying scarf is input to the object detection network Enc by a ship traveling on a lake surface _v Multi-scale image features can be extracted while we input "excellent composition, five factors happening all at once." real text comments corresponding to the image toTransformer network Enc _t Extracting multi-scale text features, marking aesthetic factors as auxiliary information, and utilizing aesthetic factor encoder Enc _f Extracting semantic features corresponding to aesthetic factor marks, inputting the semantic features and the multi-scale image features into a comment decoder, and generating text comments of the image 'excellent composition, five factors happening ll at once', wherein the multi-scale image features and the multi-scale text features are similar as much as possible by using modal discrimination loss, the multi-scale image features and the multi-scale text features are more accurate by using text quality prediction loss and aesthetic factor prediction loss, and the generated text comments are more reasonable. According to the correlation strength between the real text comment and the aesthetic quality, a learnable relaxation factor is introduced into the loss function, so that the generated text comment is matched with the aesthetic factor or quality of the input image, and the robustness of sample generation is improved.

The method specifically comprises the following steps:

step (1) model pretreatment

Step (3) generating multi-factor controlled aesthetic text reviews

Step (5) fight loss based on hidden information learning

Further, the model preprocessing in the step (1):

1-4 pairs of target detection networks Enc _v And a transducer network Enc _t Pre-training, target detection network Enc _v Pre-training through a large-scale image object detection dataset, a transducer network Enc _t The pre-training is performed by natural language processing of the data set.

1-5 Pre-trained target detection network Enc _v And a transducer network Enc _t Fine tuning is performed on the aesthetic quality assessment dataset to obtain better feature extraction capabilities. The fine tuning stage takes the form of semi-supervised learning. In "aesthetic factor encoder Enc _f -visual encoder Enc _v -text decoder Dec _t -a plurality of discriminating network "branches, an object detection network Enc _v And learning according to standard countermeasure generation learning thinking. In "aesthetic factor encoder Enc _f Text encoder Enc _t -text decoder Dec _t -a plurality of discriminating network "branches, a transducer network Enc _t The idea of loop generation against the network is employed to increase the reconstruction consistency constraint for text generation.

1-6 input of the input image to a fine-tuned object detection network Enc _v For extracting multi-scale image features therefrom; inputting real text comments into a transducer network Enc _t For extracting multi-scale text features from real text comments.

2-2 the extracted multi-scale image features and multi-scale text features require precise characterization of aesthetic quality. Therefore, mode discrimination loss L is adopted _m ：

based on the idea of hidden information learning, a learnable relaxation factor is introduced into a loss function to guide training of a model according to the correlation strength between real text comments and aesthetic quality. Specifically, two sets of parameters w and w are introduced in the discriminant network ^* The countering loss is to be in the form of a hangeloss, which requires solving the following problems:

s.t.

Claims

1. A method of generating an aesthetic description of an image based on hidden information learning, comprising the steps of:

step (1) model pretreatment

The model adopts a pre-trained target detection network Enc _v And a transducer network Enc _t As a benchmark, the target detection network Enc _v For extracting multi-scale image features from an input image, a transducer network Enc _t Extracting multi-scale text features from the real text comments;

Constructing a characteristic mode discriminator by utilizing the countermeasure learning thought, and inputting the multi-scale image characteristics and the multi-scale text characteristics extracted in the step 1 into the characteristic mode discriminator; the multi-scale image features and the multi-scale text features output by the feature mode discriminator are similar as much as possible;

step (3) generating multi-factor controlled aesthetic text reviews

With aesthetic factor marking as auxiliary information, using aesthetic factor encoder Enc _f Extracting semantic features corresponding to aesthetic factor marks, and inputting the semantic features into a comment decoder to generate text comments;

the network is judged based on multitasking constraint, so that the effectiveness of the multi-scale image features and the multi-scale text features and the rationality of the generated text comments are realized;

the multi-task constraint discrimination network adopts text quality prediction loss and aesthetic factor prediction loss; the text quality prediction loss and the aesthetic factor prediction loss are weighted and summed in a multitask learning mode based on the text quality prediction and the aesthetic factor prediction, and are used for guiding the training of the model;

step (5) fight loss based on hidden information learning

Based on the idea of hidden information learning, according to the correlation strength between real text comments and aesthetic quality, a learnable relaxation factor is introduced into an anti-loss function to guide the training of a model;

the countermeasure loss based on hidden information learning in the step (5) is specifically implemented as follows:

based on the idea of hidden information learning, according to the correlation strength between real text comments and aesthetic quality, introducing a learnable relaxation factor to a loss function to guide training of a model; specifically, two sets of parameters w and w are introduced in the discriminant network ^* The countering loss is to be in the form of a hangeloss, which requires solving the following problems:

s.t.

wherein w and w ^* Is netParameters of the complex weights, b and b ^* For the network bias, gamma and C are weight coefficients, y _i Is x _i Labels, x, corresponding to samples _i ∈R ^d For the transducer to discriminate the network extracted features,features extracted for pre-trained aesthetic quality assessment model, < >>Outputting a relaxation factor introduced for text features for two full-connection layers; when the text noise is larger, the quality error is larger based on the text prediction, and the corresponding relaxation factor is also required to be larger, namely the generated text comment is not required to be too similar to the real text comment; when the text noise is smaller, the relaxation factor is smaller, and the generated text comment is close to the real text comment; wherein w and w ^* As the network weight parameters, the improved SMO algorithm can be utilized for solving, and the iterative optimization is carried out together with the whole network;

in the test stage, the corresponding aesthetic description can be obtained by only inputting the test image and the aesthetic factor mark to be generated into the trained model.

2. The method for generating an aesthetic description of an image based on hidden information learning according to claim 1, wherein said model preprocessing in step (1) is implemented as follows:

1-1 pair of target detection networks Enc _v And a transducer network Enc _t Pre-training, target detection network Enc _v Pre-training through a large-scale image object detection dataset, a transducer network Enc _t Pre-training by processing the data set through natural language;

1-2 Pre-trained target detection network Enc _v And a transducer network Enc _t Fine tuning on the aesthetic quality assessment dataset to obtain better feature extraction capabilities; the fine tuning stage adopts a semi-supervised learning mode; in "aesthetic factor encoder Enc _f -visual encoder Enc _v -text decoder Dec _t -a plurality of discriminating network "branches, an object detection network Enc _v Learning according to standard countermeasure generation learning thinking; in "aesthetic factor encoder Enc _f Text encoder Enc _t -text decoder Dec _t -a plurality of discriminating network "branches, a transducer network Enc _t Generating a heavy structural consistency constraint on the text by adopting the idea of circularly generating an countermeasure network;

3. The method for generating the aesthetic description of the image based on the hidden information learning according to claim 2, wherein the cross-modal consistency feature extraction based on the countermeasure learning in the step (2) is specifically implemented as follows:

2-1 construction of a characteristic modality discriminator D by utilizing the countermeasure learning idea _m ；D _m The mode of the input feature needs to be judged; inputting the multi-scale image features and the multi-scale text features extracted in the step 1 into a feature mode discriminator; the multi-scale image features and the multi-scale text features output by the feature mode discriminator are similar as much as possible;

2-2 the extracted multi-scale image features and multi-scale text features require precise characterization of aesthetic quality; therefore, mode discrimination loss L is adopted _m ：

4. A method for generating an aesthetic description of an image based on hidden information learning according to claim 3, wherein said generating an aesthetic review of multi-factor control in step (3) is implemented as follows:

3-1 with aesthetic factor marking as auxiliary information, with aesthetic factor encoder Enc _f Extracting semantic features corresponding to aesthetic factor marks and inputting the semantic features to a comment decoder Dec _t Generating text comments;

5. The method for generating an aesthetic description of an image based on hidden information learning according to claim 4, wherein said step (4) is specifically implemented as follows:

4-1 Mass prediction loss L _a : the quality prediction loss comprises multi-scale image features and multi-scale text features, and the effectiveness of the multi-scale image features and the multi-scale text features is realized by adopting L2 loss;

4-2 aesthetic factor predictive loss L _fact : aesthetic factor prediction loss comprises real text comments and generated text comments, and the rationality of the generated text comments is adopted by cross entropy loss constraint;