CN112598662B - Image aesthetic description generation method based on hidden information learning - Google Patents

Image aesthetic description generation method based on hidden information learning Download PDF

Info

Publication number
CN112598662B
CN112598662B CN202011609603.3A CN202011609603A CN112598662B CN 112598662 B CN112598662 B CN 112598662B CN 202011609603 A CN202011609603 A CN 202011609603A CN 112598662 B CN112598662 B CN 112598662B
Authority
CN
China
Prior art keywords
text
aesthetic
features
scale
factor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011609603.3A
Other languages
Chinese (zh)
Other versions
CN112598662A (en
Inventor
俞俊
李相�
高飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202011609603.3A priority Critical patent/CN112598662B/en
Publication of CN112598662A publication Critical patent/CN112598662A/en
Application granted granted Critical
Publication of CN112598662B publication Critical patent/CN112598662B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30168Image quality inspection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for generating an aesthetic description of an image based on hidden information learning. The method comprises the following steps: and (1) model pretreatment. Using a target detection network Enc v And a transducer network Enc t Extracting multi-scale feature expression from the image and the text comment respectively; (2) Cross-modality consistency feature extraction based on countermeasure learning. Constructing a characteristic mode discriminator by utilizing the countermeasure learning thought; (3) aesthetic comment generation for multi-factor control. With aesthetic factor marking as auxiliary information, using aesthetic factor encoder Enc f Extracting semantic features corresponding to aesthetic factor marks, and inputting the semantic features into a comment decoder to generate text comments; (4) Based on the multi-task constraint discrimination network, the effectiveness of multi-scale image features and multi-scale text features and the rationality of generated text comments are realized; (5) combat loss based on hidden information learning. The invention generates text which is matched with the aesthetic quality of the input image, thereby improving the robustness and the accuracy of the model.

Description

Image aesthetic description generation method based on hidden information learning
Technical Field
The invention provides a method for generating image aesthetic description based on hidden information learning, which mainly relates to a method for generating an countermeasure learning framework, aims at the problems of small marked data size and large noise, utilizes the idea of hidden information learning (Learning Using Privilidged Information, LUPI) to perform reliability estimation on noise data, and improves model training efficiency and performance as a relaxation term of a countermeasure loss function.
Background
Image aesthetic quality assessment (Photo QualityAssessment) is the evaluation of the aesthetic quality of a picture computationally accurate based on the artistic understanding of the image. Related research tasks can be broadly divided into five categories, namely quality two categories (professional/amateur, aesthetic/ugly, good/bad), quality score prediction (e.g. using a score of 0-10 to describe the aesthetic), quality score distribution prediction (probability distribution of subjective marking scores for different observers on the same image), aesthetic factor prediction (the quality level of various factors such as composition, light and shade, color matching), and aesthetic description (text comments on the aesthetics of the image, discussing why the image is good/bad). The current research on the aesthetic quality of images is mainly focused on the first three tasks, and the corresponding aesthetic database marks have high data quality and large scale. In contrast, aesthetic factor prediction and aesthetic description are significant for understanding the image aesthetics, but related research is still in the beginning stage, and marking data is low in quality and small in scale, so that the requirement of a large-scale depth network on training samples is difficult to meet.
Most existing methods are based solely on image extraction features and focus on aesthetic quality classification or score prediction tasks. In recent years, a small number of work research image aesthetic factor analysis and text comment/description generation problems have emerged. For example, chang et al utilize convolutional neural networks with long and short term memory networks and build aesthetic factor guidance and blending mechanisms for aesthetic description of images, but lack reliable guidance for generating text. Text comment information is significant for understanding the aesthetic mechanism of an image. However, the existing image aesthetic comment data is large in noise and small in data size, and training requirements of a depth network are difficult to meet. Therefore, how to learn the association relation between text and image by using limited and noisy data and explore the causal reasoning mechanism of the aesthetic quality evaluation of the image is a current research hot spot and difficulty.
In the image aesthetic description method, there are two technical difficulties. Firstly, the model learning problem under a small sample is solved, and how to design an effective learning strategy based on the small sample by considering the requirement of the existing image description model on a large-scale standard sample; meanwhile, a large amount of noise exists in the marked sample, the conventional discrimination mechanism in the countermeasure learning carries out hard division on the real sample and the generated sample, error information is necessarily introduced, and how to design an asymmetric joint learning method to acquire effective information and avoid noise information.
Disclosure of Invention
It is an object of the present invention to address the deficiencies of the prior art and to provide a method of image aesthetic description generation based on hidden information learning.
The technical scheme adopted by the invention for solving the technical problems comprises the following steps:
step (1) model pretreatment
The model adopts a pre-trained target detection network Enc v And a transducer network Enc t As a benchmark, the target detection network Enc v For extracting multi-scale image features from an input image, a transducer network Enc t For extracting multi-scale text features from real text comments.
Step (2) cross-modal consistency feature extraction based on countermeasure learning
And (3) constructing a characteristic mode discriminator by utilizing the countermeasure learning thought, and inputting the multi-scale image characteristics and the multi-scale text characteristics extracted in the step (1) into the characteristic mode discriminator. So that the multi-scale image features and the multi-scale text features output by the feature mode discriminator are similar as much as possible.
Step (3) generating multi-factor controlled aesthetic text reviews
With aesthetic factor marking as auxiliary information, using aesthetic factor encoder Enc f And extracting semantic features corresponding to the aesthetic factor marks, and inputting the semantic features into a comment decoder to generate text comments.
And (4) judging the network based on the multi-task constraint, and realizing the feature accuracy of aesthetic factor marking and text quality.
The multi-tasking constraint discrimination network employs text quality prediction loss and aesthetic factor prediction loss. The effectiveness of the multi-scale image features and the multi-scale text features and the rationality of the generated text comments are realized in a multi-task learning form based on the text quality predictions and the aesthetic factor predictions. The text quality prediction loss and the aesthetic factor prediction loss are weighted and summed for guiding training of the model.
Step (5) fight loss based on hidden information learning
Based on the idea of hidden information learning, a learnable relaxation factor is introduced into the counterdamage function according to the correlation strength between the real text comments and the aesthetic quality, so as to guide the training of the model.
Further, the model preprocessing in the step (1):
1-1 pair of target detection networks Enc v And a transducer network Enc t Pre-training, target detection network Enc v Pre-training through a large-scale image object detection dataset, a transducer network Enc t The pre-training is performed by natural language processing of the data set.
1-2 Pre-trained target detection network Enc v And a transducer network Enc t Fine tuning is performed on the aesthetic quality assessment dataset to obtain better feature extraction capabilities. The fine tuning stage takes the form of semi-supervised learning. In "aesthetic factor encoder Enc f -visual encoder Enc v -text decoder Dec t -a plurality of discriminating network "branches, an object detection network Enc v And learning according to standard countermeasure generation learning thinking. In "aesthetic factor encoder Enc f Text encoder Enc t -text decoder Dec t -a plurality of discriminating network "branches, a transducer network Enc t The idea of loop generation against the network is employed to increase the reconstruction consistency constraint for text generation.
1-3 inputting the input image into a fine-tuned object detection network Enc v For extracting multi-scale image features therefrom; inputting real text comments into a transducer network Enc t For extracting multi-scale text features from real text comments.
Further, the cross-modal consistency feature extraction based on the countermeasure learning in the step (2) is as follows:
2-1 construction of a characteristic modality discriminator D by utilizing the countermeasure learning idea m 。D m The modality of the input feature needs to be judged. And (3) inputting the multi-scale image features and the multi-scale text features extracted in the step (1) into a feature mode discriminator. So that the multi-scale image features and the multi-scale text features output by the feature mode discriminator are similar as much as possible, thereby decepting D m
2-2 extracted multi-scale image features and multi-scale text features require precise characterization of aestheticsQuality. Therefore, mode discrimination loss L is adopted m
Wherein D is m (. Cndot.) is a probability function representing a feature, f v Representing multi-scale image features, f t Representing multi-scale text features.
Further, the aesthetic comments of the multi-factor control are generated in the step (3):
3-1 with aesthetic factor marking as auxiliary information, with aesthetic factor encoder Enc f Extracting semantic features corresponding to aesthetic factor marks and inputting the semantic features to a comment decoder Dec t In which text comments are generated.
3-2 in comment decoder Dec t And mining the association relation between the multi-scale image features and the multi-scale text features by utilizing the cooperative attention module, and outputting the text aggregation features by utilizing the cooperative attention module for generating text comments.
Further, the determination network based on the multitasking constraint in the step (4) realizes feature accuracy of aesthetic factor marking and text quality, specifically as follows:
4-1 Mass prediction loss L a : the quality prediction loss comprises multi-scale image features and multi-scale text features, and the L2 loss is adopted for the effectiveness of the multi-scale image features and the multi-scale text features.
4-2 aesthetic factor predictive loss L fact : aesthetic factor prediction loss contains real text comments and generated text comments, and cross entropy loss is adopted for restricting the rationality of generated text comments.
4-3, carrying out weighted summation on the text quality prediction loss and the aesthetic factor prediction loss for guiding the training of the model.
Further, the antagonism loss based on hidden information learning in the step (5):
based on the idea of hidden information learning, according to real text comments and aestheticsThe correlation between the quality is strong and weak, and a learnable relaxation factor is introduced into a loss function to guide the training of a model. Specifically, two sets of parameters w and w are introduced in the discriminant network * The countering loss is to be in the form of a hangeloss, which requires solving the following problems:
s.t.
wherein w and w * B and b are network weight parameters * For the network bias, gamma and C are weight coefficients, y i Is x i Labels, x, corresponding to samples i ∈R d For the transducer to discriminate the network extracted features,features extracted for pre-trained aesthetic quality assessment model, < >>The relaxation factor introduced for the text feature is output for the two fully connected layers. When the text noise is larger, the quality error is larger based on the text prediction, and the corresponding relaxation factor is also required to be larger, namely the generated text comment is not required to be too similar to the real text comment; when the text noise is smaller, the relaxation factor is smaller, and the generated text comment should approximate to the real text comment. Here, w and w * For the network weight parameters, the improved SMO algorithm can be used for solving, and iterative optimization is carried out together with the whole network.
In the test phase, only the test image and the aesthetic factor marking (or a plurality of) vectors to be generated need to be input into the trained model, so that corresponding aesthetic description can be obtained.
The invention has the following beneficial effects:
aiming at the learning problems of small mark information scale and large noise, based on the generation of research image aesthetic description generation task against learning thought, real text mark data is to be used as hidden information, and according to the expression capability of the real text mark data to the image aesthetic quality, the loose item in the discriminant loss function is generated by automatic learning description. That is, where the real text label data is highly correlated with the aesthetic quality of the image, the amount of relaxation is small, so that the generated description needs to be close thereto; conversely, the amount of relaxation is large, and the generated description may be significantly different from the true mark. In addition, in order to restrict the rationality of the generated text, a quality prediction loss and a factor classification loss based on the text are introduced, so that the generated text is matched with the aesthetic quality of the input image, and the robustness and the accuracy of the model are improved.
Drawings
FIG. 1 is a basic frame diagram generated based on an aesthetic description of an image learned from hidden information;
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, the present invention is based on the idea of challenge learning, and comprises three encoders, one decoder, and a plurality of discrimination networks. Wherein the encoder has an aesthetic factor encoder Enc f Visual encoder Enc v And text encoder Enc t High-level semantic features are extracted from the aesthetic factor marking control vector, the input image, and the real text comment, respectively. Thereafter, the aesthetic factor features and the visual features are input together to the decoder Dec t For generating text comments.
For example, as shown in FIG. 1, an input image including a sunset flying scarf is input to the object detection network Enc by a ship traveling on a lake surface v Multi-scale image features can be extracted while we input "excellent composition, five factors happening all at once." real text comments corresponding to the image toTransformer network Enc t Extracting multi-scale text features, marking aesthetic factors as auxiliary information, and utilizing aesthetic factor encoder Enc f Extracting semantic features corresponding to aesthetic factor marks, inputting the semantic features and the multi-scale image features into a comment decoder, and generating text comments of the image 'excellent composition, five factors happening ll at once', wherein the multi-scale image features and the multi-scale text features are similar as much as possible by using modal discrimination loss, the multi-scale image features and the multi-scale text features are more accurate by using text quality prediction loss and aesthetic factor prediction loss, and the generated text comments are more reasonable. According to the correlation strength between the real text comment and the aesthetic quality, a learnable relaxation factor is introduced into the loss function, so that the generated text comment is matched with the aesthetic factor or quality of the input image, and the robustness of sample generation is improved.
The method specifically comprises the following steps:
step (1) model pretreatment
The model adopts a pre-trained target detection network Enc v And a transducer network Enc t As a benchmark, the target detection network Enc v For extracting multi-scale image features from an input image, a transducer network Enc t For extracting multi-scale text features from real text comments.
Step (2) cross-modal consistency feature extraction based on countermeasure learning
And (3) constructing a characteristic mode discriminator by utilizing the countermeasure learning thought, and inputting the multi-scale image characteristics and the multi-scale text characteristics extracted in the step (1) into the characteristic mode discriminator. So that the multi-scale image features and the multi-scale text features output by the feature mode discriminator are similar as much as possible.
Step (3) generating multi-factor controlled aesthetic text reviews
With aesthetic factor marking as auxiliary information, using aesthetic factor encoder Enc f And extracting semantic features corresponding to the aesthetic factor marks, and inputting the semantic features into a comment decoder to generate text comments.
And (4) judging the network based on the multi-task constraint, and realizing the feature accuracy of aesthetic factor marking and text quality.
The multi-tasking constraint discrimination network employs text quality prediction loss and aesthetic factor prediction loss. The effectiveness of the multi-scale image features and the multi-scale text features and the rationality of the generated text comments are realized in a multi-task learning form based on the text quality predictions and the aesthetic factor predictions. The text quality prediction loss and the aesthetic factor prediction loss are weighted and summed for guiding training of the model.
Step (5) fight loss based on hidden information learning
Based on the idea of hidden information learning, a learnable relaxation factor is introduced into the counterdamage function according to the correlation strength between the real text comments and the aesthetic quality, so as to guide the training of the model.
Further, the model preprocessing in the step (1):
1-4 pairs of target detection networks Enc v And a transducer network Enc t Pre-training, target detection network Enc v Pre-training through a large-scale image object detection dataset, a transducer network Enc t The pre-training is performed by natural language processing of the data set.
1-5 Pre-trained target detection network Enc v And a transducer network Enc t Fine tuning is performed on the aesthetic quality assessment dataset to obtain better feature extraction capabilities. The fine tuning stage takes the form of semi-supervised learning. In "aesthetic factor encoder Enc f -visual encoder Enc v -text decoder Dec t -a plurality of discriminating network "branches, an object detection network Enc v And learning according to standard countermeasure generation learning thinking. In "aesthetic factor encoder Enc f Text encoder Enc t -text decoder Dec t -a plurality of discriminating network "branches, a transducer network Enc t The idea of loop generation against the network is employed to increase the reconstruction consistency constraint for text generation.
1-6 input of the input image to a fine-tuned object detection network Enc v For extracting multi-scale image features therefrom; inputting real text comments into a transducer network Enc t For extracting multi-scale text features from real text comments.
Further, the cross-modal consistency feature extraction based on the countermeasure learning in the step (2) is as follows:
2-1 construction of a characteristic modality discriminator D by utilizing the countermeasure learning idea m 。D m The modality of the input feature needs to be judged. And (3) inputting the multi-scale image features and the multi-scale text features extracted in the step (1) into a feature mode discriminator. So that the multi-scale image features and the multi-scale text features output by the feature mode discriminator are similar as much as possible, thereby decepting D m
2-2 the extracted multi-scale image features and multi-scale text features require precise characterization of aesthetic quality. Therefore, mode discrimination loss L is adopted m
Wherein D is m (. Cndot.) is a probability function representing a feature, f v Representing multi-scale image features, f t Representing multi-scale text features.
Further, the aesthetic comments of the multi-factor control are generated in the step (3):
3-1 with aesthetic factor marking as auxiliary information, with aesthetic factor encoder Enc f Extracting semantic features corresponding to aesthetic factor marks and inputting the semantic features to a comment decoder Dec t In which text comments are generated.
3-2 in comment decoder Dec t And mining the association relation between the multi-scale image features and the multi-scale text features by utilizing the cooperative attention module, and outputting the text aggregation features by utilizing the cooperative attention module for generating text comments.
Further, the determination network based on the multitasking constraint in the step (4) realizes feature accuracy of aesthetic factor marking and text quality, specifically as follows:
4-1 Mass prediction loss L a : the quality prediction loss comprises multi-scale image features and multi-scale text features, and the L2 loss is adopted for the effectiveness of the multi-scale image features and the multi-scale text features.
4-2 aesthetic factor predictive loss L fact : aesthetic factor prediction loss contains real text comments and generated text comments, and cross entropy loss is adopted for restricting the rationality of generated text comments.
4-3, carrying out weighted summation on the text quality prediction loss and the aesthetic factor prediction loss for guiding the training of the model.
Further, the antagonism loss based on hidden information learning in the step (5):
based on the idea of hidden information learning, a learnable relaxation factor is introduced into a loss function to guide training of a model according to the correlation strength between real text comments and aesthetic quality. Specifically, two sets of parameters w and w are introduced in the discriminant network * The countering loss is to be in the form of a hangeloss, which requires solving the following problems:
s.t.
wherein w and w * B and b are network weight parameters * For the network bias, gamma and C are weight coefficients, y i Is x i Labels, x, corresponding to samples i ∈R d For the transducer to discriminate the network extracted features,features extracted for pre-trained aesthetic quality assessment model, < >>The relaxation factor introduced for the text feature is output for the two fully connected layers. When the text noise is larger, the quality error is larger based on the text prediction, and the corresponding relaxation factor is also required to be larger, namely the generated text comment is not required to be too similar to the real text comment; when the text noise is smaller, the relaxation factor is smaller, and the generated text comment should approximate to the real text comment. Here, w and w * For the network weight parameters, the improved SMO algorithm can be used for solving, and iterative optimization is carried out together with the whole network.
In the test phase, only the test image and the aesthetic factor marking (or a plurality of) vectors to be generated need to be input into the trained model, so that corresponding aesthetic description can be obtained.

Claims (5)

1. A method of generating an aesthetic description of an image based on hidden information learning, comprising the steps of:
step (1) model pretreatment
The model adopts a pre-trained target detection network Enc v And a transducer network Enc t As a benchmark, the target detection network Enc v For extracting multi-scale image features from an input image, a transducer network Enc t Extracting multi-scale text features from the real text comments;
step (2) cross-modal consistency feature extraction based on countermeasure learning
Constructing a characteristic mode discriminator by utilizing the countermeasure learning thought, and inputting the multi-scale image characteristics and the multi-scale text characteristics extracted in the step 1 into the characteristic mode discriminator; the multi-scale image features and the multi-scale text features output by the feature mode discriminator are similar as much as possible;
step (3) generating multi-factor controlled aesthetic text reviews
With aesthetic factor marking as auxiliary information, using aesthetic factor encoder Enc f Extracting semantic features corresponding to aesthetic factor marks, and inputting the semantic features into a comment decoder to generate text comments;
the network is judged based on multitasking constraint, so that the effectiveness of the multi-scale image features and the multi-scale text features and the rationality of the generated text comments are realized;
the multi-task constraint discrimination network adopts text quality prediction loss and aesthetic factor prediction loss; the text quality prediction loss and the aesthetic factor prediction loss are weighted and summed in a multitask learning mode based on the text quality prediction and the aesthetic factor prediction, and are used for guiding the training of the model;
step (5) fight loss based on hidden information learning
Based on the idea of hidden information learning, according to the correlation strength between real text comments and aesthetic quality, a learnable relaxation factor is introduced into an anti-loss function to guide the training of a model;
the countermeasure loss based on hidden information learning in the step (5) is specifically implemented as follows:
based on the idea of hidden information learning, according to the correlation strength between real text comments and aesthetic quality, introducing a learnable relaxation factor to a loss function to guide training of a model; specifically, two sets of parameters w and w are introduced in the discriminant network * The countering loss is to be in the form of a hangeloss, which requires solving the following problems:
s.t.
wherein w and w * Is netParameters of the complex weights, b and b * For the network bias, gamma and C are weight coefficients, y i Is x i Labels, x, corresponding to samples i ∈R d For the transducer to discriminate the network extracted features,features extracted for pre-trained aesthetic quality assessment model, < >>Outputting a relaxation factor introduced for text features for two full-connection layers; when the text noise is larger, the quality error is larger based on the text prediction, and the corresponding relaxation factor is also required to be larger, namely the generated text comment is not required to be too similar to the real text comment; when the text noise is smaller, the relaxation factor is smaller, and the generated text comment is close to the real text comment; wherein w and w * As the network weight parameters, the improved SMO algorithm can be utilized for solving, and the iterative optimization is carried out together with the whole network;
in the test stage, the corresponding aesthetic description can be obtained by only inputting the test image and the aesthetic factor mark to be generated into the trained model.
2. The method for generating an aesthetic description of an image based on hidden information learning according to claim 1, wherein said model preprocessing in step (1) is implemented as follows:
1-1 pair of target detection networks Enc v And a transducer network Enc t Pre-training, target detection network Enc v Pre-training through a large-scale image object detection dataset, a transducer network Enc t Pre-training by processing the data set through natural language;
1-2 Pre-trained target detection network Enc v And a transducer network Enc t Fine tuning on the aesthetic quality assessment dataset to obtain better feature extraction capabilities; the fine tuning stage adopts a semi-supervised learning mode; in "aesthetic factor encoder Enc f -visual encoder Enc v -text decoder Dec t -a plurality of discriminating network "branches, an object detection network Enc v Learning according to standard countermeasure generation learning thinking; in "aesthetic factor encoder Enc f Text encoder Enc t -text decoder Dec t -a plurality of discriminating network "branches, a transducer network Enc t Generating a heavy structural consistency constraint on the text by adopting the idea of circularly generating an countermeasure network;
1-3 inputting the input image into a fine-tuned object detection network Enc v For extracting multi-scale image features therefrom; inputting real text comments into a transducer network Enc t For extracting multi-scale text features from real text comments.
3. The method for generating the aesthetic description of the image based on the hidden information learning according to claim 2, wherein the cross-modal consistency feature extraction based on the countermeasure learning in the step (2) is specifically implemented as follows:
2-1 construction of a characteristic modality discriminator D by utilizing the countermeasure learning idea m ;D m The mode of the input feature needs to be judged; inputting the multi-scale image features and the multi-scale text features extracted in the step 1 into a feature mode discriminator; the multi-scale image features and the multi-scale text features output by the feature mode discriminator are similar as much as possible;
2-2 the extracted multi-scale image features and multi-scale text features require precise characterization of aesthetic quality; therefore, mode discrimination loss L is adopted m
Wherein D is m (. Cndot.) is a probability function representing a feature, f v Representing multi-scale image features, f t Representing multi-scale text features.
4. A method for generating an aesthetic description of an image based on hidden information learning according to claim 3, wherein said generating an aesthetic review of multi-factor control in step (3) is implemented as follows:
3-1 with aesthetic factor marking as auxiliary information, with aesthetic factor encoder Enc f Extracting semantic features corresponding to aesthetic factor marks and inputting the semantic features to a comment decoder Dec t Generating text comments;
3-2 in comment decoder Dec t And mining the association relation between the multi-scale image features and the multi-scale text features by utilizing the cooperative attention module, and outputting the text aggregation features by utilizing the cooperative attention module for generating text comments.
5. The method for generating an aesthetic description of an image based on hidden information learning according to claim 4, wherein said step (4) is specifically implemented as follows:
4-1 Mass prediction loss L a : the quality prediction loss comprises multi-scale image features and multi-scale text features, and the effectiveness of the multi-scale image features and the multi-scale text features is realized by adopting L2 loss;
4-2 aesthetic factor predictive loss L fact : aesthetic factor prediction loss comprises real text comments and generated text comments, and the rationality of the generated text comments is adopted by cross entropy loss constraint;
4-3, carrying out weighted summation on the text quality prediction loss and the aesthetic factor prediction loss for guiding the training of the model.
CN202011609603.3A 2020-12-30 2020-12-30 Image aesthetic description generation method based on hidden information learning Active CN112598662B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011609603.3A CN112598662B (en) 2020-12-30 2020-12-30 Image aesthetic description generation method based on hidden information learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011609603.3A CN112598662B (en) 2020-12-30 2020-12-30 Image aesthetic description generation method based on hidden information learning

Publications (2)

Publication Number Publication Date
CN112598662A CN112598662A (en) 2021-04-02
CN112598662B true CN112598662B (en) 2024-02-13

Family

ID=75206485

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011609603.3A Active CN112598662B (en) 2020-12-30 2020-12-30 Image aesthetic description generation method based on hidden information learning

Country Status (1)

Country Link
CN (1) CN112598662B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114510924B (en) * 2022-02-14 2022-09-20 哈尔滨工业大学 Text generation method based on pre-training language model

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109544524A (en) * 2018-11-15 2019-03-29 中共中央办公厅电子科技学院 A kind of more attribute image aesthetic evaluation systems based on attention mechanism

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10685434B2 (en) * 2016-03-30 2020-06-16 Institute Of Automation, Chinese Academy Of Sciences Method for assessing aesthetic quality of natural image based on multi-task deep learning

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109544524A (en) * 2018-11-15 2019-03-29 中共中央办公厅电子科技学院 A kind of more attribute image aesthetic evaluation systems based on attention mechanism

Also Published As

Publication number Publication date
CN112598662A (en) 2021-04-02

Similar Documents

Publication Publication Date Title
Li et al. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection
Li et al. Generalized focal loss: Towards efficient representation learning for dense object detection
CN108256561B (en) Multi-source domain adaptive migration method and system based on counterstudy
Kastaniotis et al. Attention-aware generative adversarial networks (ATA-GANs)
Nartey et al. Semi-supervised learning for fine-grained classification with self-training
CN110796199A (en) Image processing method and device and electronic medical equipment
CN113255822B (en) Double knowledge distillation method for image retrieval
Dering et al. Generative adversarial networks for increasing the veracity of big data
He et al. Image captioning with text-based visual attention
CN112580636A (en) Image aesthetic quality evaluation method based on cross-modal collaborative reasoning
CN113505855A (en) Training method for anti-attack model
Chu et al. Adversarial alignment for source free object detection
CN112598662B (en) Image aesthetic description generation method based on hidden information learning
CN114399661A (en) Instance awareness backbone network training method
CN113343123B (en) Training method and detection method for generating confrontation multiple relation graph network
Chen et al. Refinement of Boundary Regression Using Uncertainty in Temporal Action Localization.
Luvembe et al. CAF-ODNN: Complementary attention fusion with optimized deep neural network for multimodal fake news detection
CN111753684B (en) Pedestrian re-recognition method using target posture for generation
CN113222002A (en) Zero sample classification method based on generative discriminative contrast optimization
CN117371511A (en) Training method, device, equipment and storage medium for image classification model
CN114912549B (en) Training method of risk transaction identification model, and risk transaction identification method and device
Xu et al. Bootstrap your object detector via mixed training
Yang et al. How to use extra training data for better edge detection?
Wang et al. Saliency Regularization for Self-Training with Partial Annotations
CN113011446A (en) Intelligent target identification method based on multi-source heterogeneous data learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant