CN117635964A

CN117635964A - Multimode aesthetic quality evaluation method based on transducer

Info

Publication number: CN117635964A
Application number: CN202310175150.5A
Authority: CN
Inventors: 林裕皓; 高飞; 徐岗
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2024-03-01

Abstract

The invention discloses a transformation-based multi-mode aesthetic quality evaluation method, which comprises the following steps: s1, inputting an image into a visual feature encoder to extract visual features; s2, inputting the corresponding user comment data into a text feature encoder to extract text features containing semantic information; s3, extracting visual features and text features, inputting the visual features and the text features into a cross-modal encoder, and outputting the visual features and the text features after feature fusion; and S4, connecting the visual features and the text features output in the S3, outputting 10-dimensional aesthetic score distribution through a Linear layer, and performing training optimization by using the EMD as a loss function. The method designs a good cross-modal encoder to fully learn information of another mode, fully models the inherent correlation between visual characteristics and text characteristics, and effectively improves the performance of aesthetic quality evaluation.

Description

Multimode aesthetic quality evaluation method based on transducer

Technical Field

The invention relates to the technical field of deep learning and image processing, in particular to a multi-mode aesthetic quality evaluation method based on a transducer.

Background

The objective of aesthetic quality assessment is to objectively assess the image based on the aesthetic appeal of humans. The aesthetic quality assessment task is more challenging than traditional image quality assessment tasks because it requires consideration of not only image quality, but also high-level aesthetic factors in terms of composition, color matching, and content. The multi-mode aesthetic quality evaluation requires inputting images and corresponding comment information thereof, and the visual characteristics and the text characteristics are jointly learned, so that the accuracy of aesthetic prediction can be greatly improved under the assistance of semantic information.

The transducer is an encoder-decoder model based on self-attention mechanisms, which changes the Natural Language Processing (NLP) domain, which not only has faster speed but also achieves better performance than the Recurrent Neural Network (RNN). Later, a large number of researchers construct a transducer-based neural network model and apply it widely to fields such as natural language processing, computer vision and multi-modality. With the help of self-attention mechanism, the transducer-based model has strong global modeling capability, and shows very excellent performance in various tasks.

The task of multi-modal aesthetic quality assessment, which faces mainly 2 problems:

1) Composition is an important factor affecting the aesthetic quality evaluation of an image, and studies have shown that extracting global features and local features of an image simultaneously helps capture the composition features of the image. However, most methods today extract visual features of images based on Convolutional Neural Network (CNN) models, whereas the receptive field of convolutional operations limits the ability of the model to extract global features. This results in insufficient extraction of global features of the image by previous methods, and thus, the patterned features of the image cannot be captured efficiently, so that the performance of the aesthetic quality evaluation of the image is not high.

2) The comment information corresponding to the image contains rich semantic information, so that the opinion and attitude of the user on the image can be intuitively reflected through comments, and the aesthetic quality evaluation work can be effectively assisted. However, these comment information are not well utilized and lack a good cross-modal encoder to adequately interact and align features of the visual and linguistic modalities, failing to adequately model the inherent correlation between visual and textual features.

Disclosure of Invention

The invention aims to solve the problem of providing a multi-mode aesthetic quality evaluation method based on a transducer, which designs a good cross-mode encoder to fully learn information of another mode, fully models the inherent relevance between visual characteristics and text characteristics and effectively improves the performance of aesthetic quality evaluation.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a method for evaluating multi-mode aesthetic quality based on a transducer comprises the following steps:

s1, inputting an image into a visual feature encoder which takes a Swin Transformer as a backbone network to extract visual features;

s2, inputting corresponding user comment data into a text feature encoder which takes BERT as a backbone network to extract text features containing semantic information;

s3, respectively extracting the visual features and the text features from the S1 and the S2, inputting the visual features and the text features into a cross-modal encoder, learning the information of the other mode by the features of each mode, aligning the feature spaces of the two modes, fully modeling the internal relevance between the visual features and the text features, and outputting the visual features and the text features after feature fusion;

and S4, connecting the visual features and the text features output in the S3, outputting 10-dimensional aesthetic score distribution through a Linear layer, and performing training optimization by using the EMD as a loss function.

Preferably, in step S1, the procedure for obtaining the advanced visual features is as follows:

1-1 divides the input image into pixel blocks through a patch partition module, each pixel block containing 4*4 pixels. Then connecting in the channel dimension, and adjusting to 56 x 48 input characteristics;

1-2, inputting the obtained input features into a 1-layer Linear layer for Linear mapping to obtain 56 x 96 input features;

1-3 dividing windows in the image, wherein each window comprises a block of pixels of 7*7, readjusted to a characteristic of 64 x 49 x 96;

1-4 in 64 independent windows, standard self-attention (i.e., window self-attention mechanism W-MSA) was calculated separately. For the features X εR in each window ^49×96 Corresponding matrixes Q, K and V of the queries, the keys and the values are obtained, and the calculation formula is as follows:

Q＝XW _Q ,K＝XW _K ,V＝XW _V ,

wherein W is _Q ,W _K And W is _V Is a projection matrix shared across windows;

1-5 obtaining the attention matrix as the obtained K, Q and V matrix according to the self-attention calculation formula

Wherein B is a learnable relative position code;

1-6 use a Feed Forward Network (FFN) for further feature conversion, the FFN module consisting of two Linear layers, activated with a gel nonlinear function between them;

1-7 shifting features again using a moving window self-attention (SW-MSA) mechanismAfter the window sizes are respectively calculated in steps 1-4, 1-5 and 1-6;

1-8 to obtain multi-scale features, the number of pixel blocks is reduced by a patch raising layer. The patch merge layer connects the features of each group of 2 x 2 adjacent pixel blocks in the channel dimension, and then adjusts the channel dimension using a Linear layer. This allows the image to be downsampled twice, with the channel dimension set to be 2 times the original.

1-9 steps 1-4 to 1-8 are repeated 4 times as a phase, steps 1-4 to 1-8 being repeated, the final visual feature having an output dimension of 1 x 49 x 768.

Preferably, in step S2, the procedure of extracting the advanced semantic information in the comment data is as follows:

2-1 dividing the input text into word sequences { w1, & gt, wn } of length n by WordPiece Tokenizer;

2-2 obtaining Token codes through an embedding sublayer according to the word sequence after word segmentation;

2-3 a cls Token is added to the front of the resulting Token code and a sep Token is added to the back.

2-4, obtaining position codes through an embedded sublayer according to indexes of word sequences after word segmentation;

2-5, adding the obtained Token codes and the position codes to obtain an input feature Hi;

2-6, and obtaining a final context feature T after 12 layers of transformations, wherein the dimension of the final context feature T is (n+2) x 768.

Preferably, in step S3, the process of obtaining the visual feature and the text feature after the modal feature fusion includes the following steps:

3-1, after visual features and text features are obtained, firstly, respectively carrying out layer normalization and a self-attention sub-layer on the visual features and the text features, and establishing internal connection for the respective modal features;

the 3-2 visual features are normalized by the layers to provide a queries matrix, and the text features provide keys and values matrix as information of language modes. Performing cross-attribute calculation according to the queries, keys and values matrix, and learning abundant semantic information from visual features to text features;

the 3-3 text features are normalized by the layers to provide a queries matrix, and the visual features provide keys and values matrix as information of visual modes. Performing cross-attribute calculation according to the queries, keys and values matrix, and sensing the characteristics of the image by the text characteristics;

the 3-4 visual features and the text features are further converted through a FFN sub-layer after being subjected to layer normalization respectively.

3-5 repeating the four processes described above to fully fuse and align the modality features.

Preferably, in step S4, the process of obtaining the final aesthetic score of the multi-modal input is as follows:

4-1 a global pooling (GAP) operation is used on the visual features to obtain a final visual feature representation of 1 x 768;

4-2 taking a cls token of the 0 th dimension of the text feature as a final text feature representation, wherein the vector dimension corresponding to the cls token is 1 x 768;

4-3, connecting the final visual features and the text features to obtain a visual text joint feature representation of 1 x 1536;

4-4, sending the final visual text joint characteristic representation into a Linear layer, and normalizing and outputting the probability distribution of 1-10 minutes of the multi-mode input through a Softmax function.

The invention has the following characteristics and beneficial effects:

(1) The invention provides a visual feature encoder with global feature and local feature extraction capability, and a traditional aesthetic quality evaluation feature extractor adopts a convolutional neural network model (CNN) and cannot well extract global features of an image. The visual feature encoder used in the invention is based on the window self-attention module, has good local feature extraction capability, and has strong global feature extraction capability by means of the moving window self-attention module, so that image composition features can be better focused;

(2) The invention solves the problem that the visual characteristics and the text characteristics can not be fully fused, adopts a 3-layer cross-mode encoder, so that the visual characteristics learn the abundant semantic information of the text characteristics, and the text characteristics sense the visual mode information of the image, thereby fully modeling the internal relevance between the visual characteristics and the text characteristics.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a schematic diagram of a transducer-based multimodal aesthetic quality assessment network;

FIG. 2 is a schematic diagram of a visual feature encoder Swin transducer network architecture;

FIG. 3 is a schematic diagram of a text feature encoder BERT network input representation construction;

FIG. 4 is a schematic diagram of a text feature encoder BERT network architecture;

FIG. 5 is a schematic diagram of a cross-modal feature encoder network architecture;

FIG. 6 is an attention diagram in a cross-modality encoder, showing image areas of interest for different words.

Detailed Description

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.

In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", etc. may explicitly or implicitly include one or more such feature. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art in a specific case.

The invention provides a transformation-based multi-mode aesthetic quality evaluation method, which is shown in figure 1 and comprises the following steps:

s1, inputting an image into a visual feature encoder which takes a Swin Transformer as a backbone network to extract advanced visual features;

s2, inputting corresponding user comment data into a language feature encoder which takes BERT as a backbone network to extract text features containing high-level semantic information;

s3, extracting the visual features and the text features from the S1 and the S2, inputting the visual features and the text features into a cross-modal encoder, learning information of another mode by the features of each mode, aligning the feature spaces of the two modes, fully modeling the internal relevance between the visual features and the text features, and outputting the visual features and the text features after feature fusion;

It should be noted that, in this embodiment, the overall model structure for implementing a transformation-based multi-mode aesthetic quality evaluation method, as shown in fig. 1, includes:

a visual feature encoder module for extracting global features and local features of the input image;

the text feature encoder module is used for extracting high-level semantic information of user comments corresponding to the input image;

the cross-mode encoder module is used for fusing and aligning the feature space between modes and establishing internal connection between the modes;

in this embodiment, specific:

s1, inputting the image into a visual feature encoder to extract global features and local features.

In the embodiment, in step S1, the input 224×224×3 image is divided into pixel blocks by a patch partition module, and each pixel block includes 4*4 pixels. Then, connections are made in the channel dimension, adjusting to the input characteristics of 56 x 48. Then, the input signal is input into Linear of 1 layer to carry out Linear mapping so as to obtain input characteristics of 56 x 96. Windows are divided in the image, where each window contains blocks of 7*7 pixels, readjusted to 64 x 49 x 96 features. Standard self-attention (i.e., window self-attention mechanism W-MSA) is calculated in 64 separate windows, respectively. For the features X εR in each window ^49×96 Corresponding matrixes Q, K and V of the queries, the keys and the values are obtained, and the calculation formula is as follows: q=xw _Q ,K＝XW _K ,V＝XW _V Wherein W is _Q ,W _K And W is _V Is a projection matrix shared across windows; obtaining attention matrixes of the obtained K, Q and V matrixes according to a self-attention calculation formula as followsWherein B is a learnable relative position code; a Feed Forward Network (FFN) is used for further feature conversion, the FFN module consisting of two Linear layers, which are activated with a GELU nonlinear function between them. Shifting features again using a moving window self-attention (SW-MSA) mechanismAfter each window size, window self-attention calculation and FFN calculation are respectively carried out. Two consecutive Swin transducer modules are shown in FIG. 2 b. . To obtain multi-scale features, the number of pixel blocks is reduced by the patch raising layer. The patch merge layer connects the features of each group of 2 x 2 adjacent pixel blocks in the channel dimension, and then adjusts the channel dimension using a Linear layer. This allows the image to be downsampled twice, with the channel dimension set to be 2 times the original. The overall visual feature encoder is divided into 4 stages altogether, as shown in fig. 2a, with an output dimension of the final visual feature of 1 x 49 x 768.

S2, inputting comment information of the user corresponding to the S1 input image into a text feature encoder to extract semantic information of the text.

In this embodiment, the input text data is divided into word sequences { w } of length n by WordPiece Tokenizer ₁ ,...,w _n }. According to the word sequence after word segmentation, a Token code is obtained through an embedding sublayer, a cls Token is added at the front of the sequence, and a sep Token is added at the back of the sequence. And obtaining the position code through an embedded sub-layer according to the index of the word sequence after word segmentation. The resulting Token code and position code are added to obtain an input feature H as shown in FIG. 3 _i . As shown in fig. 4, the input representation corresponding to the comment is obtained, and the final context feature T is obtained after passing through the 12-layer transducer module, and the dimension of the final context feature T is (n+2) ×768.

S3, respectively extracting visual features and text features from the S1 and the S2 as input, and mutually learning information of each mode by using a 3-layer cross-mode encoder to obtain the visual features and the text features of feature fusion.

In this embodiment, as shown in fig. 5, first, the visual feature and the text feature are respectively normalized by the layer and a self-attention sub-layer, and internal connection is established for each modal feature. And then one mode starts to learn mode information from the other mode, the visual features are normalized by layers to provide a queries matrix, and the text features provide keys and values matrix as language mode information. And (3) performing cross-attribute calculation according to the queries, keys and values matrix, and learning abundant semantic information from the visual features to the text features. The text features are normalized by the layers to provide a queries matrix, and the visual features provide keys and values matrix as information of visual modes. And performing cross-attribute calculation according to the queries, keys and values matrix, and sensing the characteristics of the image by the text characteristics. Finally, the visual features and the text features are further converted through a FFN sub-layer after being subjected to layer normalization. Visual analysis of the attention value in cross-section enables a cross-modality encoder to well establish information between the two modalities.

And S4, connecting the visual features and the text features output in the step S3, and outputting 10-dimensional aesthetic score distribution through a Linear layer.

In this embodiment, a global pooling (GAP) operation is used for the visual feature to obtain a final visual feature representation of 1×768, and cls token of the 0 th dimension of the text feature is taken as the final text feature representation, where the vector dimension corresponding to the cls token is 1×768. And connecting the final visual characteristics and the text characteristics to obtain a visual text joint characteristic representation of 1 x 1536, then sending the final visual text joint characteristic representation into a Linear layer, and normalizing and outputting the probability distribution of 1-10 minutes of the multi-mode input through a Softmax function.

Further, in this embodiment, in the training stage, an AdamW optimizer is uniformly used to optimize the network, the initial learning rate is set to 1e-5, the cosine annealing algorithm is used to adjust the learning rate, and the minimum learning rate is set to 1e-6. The loss function is an Earth moving Distance function (Earth Mover's Distance). In addition, an early-stopping method is used for avoiding excessive training, the early-stopping method can automatically monitor indexes in network training, if the indexes are not optimized in a certain time, the early-stopping method can consider that a network model is already optimized, and continuous training is stopped, so that the method well avoids the problem of excessive fitting caused by excessive training.

The embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments, including the components, without departing from the principles and spirit of the invention, yet fall within the scope of the invention.

Claims

1. A method for evaluating multi-mode aesthetic quality based on a transducer is characterized by comprising the following steps:

s1, inputting an image into a visual feature encoder to extract visual features;

s2, inputting user comment data corresponding to the image into a text feature encoder to extract text features containing semantic information;

s3, inputting the extracted visual features and text features into a cross-modal encoder, and outputting the visual features and text features after feature fusion;

and S4, connecting the fused visual features and text features output in the S3, outputting 10-dimensional aesthetic score distribution through a Linear layer, and performing training optimization by using the EMD as a loss function.

2. The method of claim 1, wherein the visual feature encoder uses Swin transducer as a backbone network.

3. The method of claim 1, wherein the text feature encoder is a BERT based backbone network.

4. The method for multi-modal aesthetic quality assessment based on Transformer according to claim 1, wherein the method for extracting visual features in step S1 is as follows:

s1-1, dividing an input image into pixel blocks through a patch part module, wherein each pixel block comprises 4*4 pixels, then connecting the pixel blocks in a channel dimension, and adjusting the pixel blocks to be 56 x 48 input features;

s1-2, inputting the obtained input features into a 1-layer Linear layer for Linear mapping to obtain 56 x 96 input features;

s1-3, dividing windows in the image, wherein each window comprises a pixel block of 7*7, and readjusting the pixel block to 64 x 49 x 96;

s1-4, respectively calculating standard self-attentiveness according to a window self-attentiveness mechanism W-MSA in 64 independent windows, and regarding the characteristic X epsilon R in each window ^49×96 Corresponding matrixes Q, K and V of the queries, the keys and the values are obtained, and the calculation formula is as follows:

Q＝XW _Q ,K＝XW _K ,V＝XW _V ,

wherein W is _Q ,W _K And W is _V Is a projection matrix shared across windows;

s1-5, obtaining an attention matrix as the obtained K, Q and V matrixes according to a self-attention calculation formula

Wherein B is a learnable relative position code;

s1-6, using a feed-forward network to further perform feature conversion, wherein the FFN module consists of two Linear layers, and a GELU nonlinear function is used for activating the FFN module;

s1-7 shifting features again using a moving window self-attention mechanismAfter the window sizes are respectively calculated in the steps S1-4, S1-5 and S1-6;

s1-8, in order to obtain multi-scale features, reducing the number of pixel blocks through a patch raising layer, connecting the features of each group of 2 x 2 adjacent pixel blocks in a channel dimension, and then adjusting the channel dimension by using a Linear layer;

s1-9, taking the steps S1-4 to S1-8 as a stage, and outputting the visual characteristics with the dimension of 1 x 49 x 768 through calculation of four stages.

5. The method for multi-modal aesthetic quality assessment based on Transformer according to claim 1, wherein the feed-forward network in step S1-6 is composed of two Linear layers, and the two Linear layers are activated by a gel nonlinear function.

6. The method for multi-modal aesthetic quality assessment based on Transformer according to claim 1, wherein in the step S2, the method for extracting text features containing semantic information is as follows:

s2-1, dividing input user comment data into word sequences { w1, & gt, wn } with the length of n through WordPiece Tokenizer;

s2-2, obtaining Token codes through an embedding sublayer according to the word sequence after word segmentation;

s2-3, adding a cls Token at the forefront of the obtained Token code, and adding a sep Token at the rearmost of the obtained Token code;

s2-4, obtaining position codes through an embedding sublayer according to indexes of word sequences after word segmentation;

s2-5, adding the obtained Token codes and the position codes to obtain input characteristics Hi;

s2-6, obtaining corresponding input characterization of comments, and obtaining a final context feature T through a 12-layer transducer module, wherein the dimension of the final context feature T is (n+2) 768.

7. The method for multi-modal aesthetic quality assessment based on Transformer according to claim 1, wherein the method for fusing visual features and text features by the trans-modal encoder is as follows: the visual feature learning mode and the text feature learning mode learn feature information mutually and align feature spaces of the visual feature learning mode and the text feature learning mode, and the inherent relevance between the visual feature and the text feature is fully constructed.

8. The method for multi-modal aesthetic quality assessment based on Transformer according to claim 1, wherein the step S3 specifically comprises the steps of:

s3-1, after visual features and text features are obtained, firstly, respectively carrying out layer normalization and a self-attention sub-layer on the visual features and the text features, and establishing internal connection for the respective modal features;

s3-2, providing a query matrix after layer normalization of visual features, and providing a key matrix and a value matrix as information of a language mode by text features; performing cross-attribute calculation according to the queries, keys and values matrix, and learning semantic information from visual features to text features;

providing a query matrix after the text features are subjected to layer normalization, providing a key matrix and a value matrix as information of a visual mode, and performing cross-attribute calculation according to the query matrix, the key matrix and the value matrix to sense the features of the image by the text features;

s3-3, respectively carrying out layer normalization on visual features after semantic information learning and text features after image feature perception, and then carrying out further conversion through a layer of FFN sub-layer;

s3-4, repeating the steps S3-1 to S3-3 to fully fuse and align the modal characteristics to obtain the visual characteristics and the text characteristics after the characteristic fusion.

9. The method for multi-modal aesthetic quality assessment based on Transformer according to claim 1, wherein in the step S4, the method for outputting 10-dimensional aesthetic score distribution is as follows:

s4-1, performing global pooling operation on the visual features after feature fusion to obtain a final visual feature representation of 1 x 768;

s4-2, taking a cls token of the 0 th dimension of the text feature after feature fusion as a final text feature representation, wherein the vector dimension corresponding to the cls token is 1 x 768;

s4-3, connecting the final visual features and the text features to obtain a visual text joint feature representation of 1 x 1536;

s4-4, sending the final visual text joint characteristic representation into a Linear layer, and carrying out normalized output through a Softmax function to carry out probability distribution of 1-10 minutes on the multi-mode input.