CN114170460A

CN114170460A - Multi-mode fusion-based artwork classification method and system

Info

Publication number: CN114170460A
Application number: CN202111411858.3A
Authority: CN
Inventors: 蒋蕊; 司思
Original assignee: Beijing University of Chemical Technology
Current assignee: Beijing University of Chemical Technology
Priority date: 2021-11-24
Filing date: 2021-11-24
Publication date: 2022-03-11

Abstract

The invention provides a method and a system for classifying artworks based on multi-mode fusion, and belongs to the field of natural language processing and computer vision. The method comprises the steps of preprocessing data of two modes of an artwork, respectively obtaining a text mode prediction result and an image prediction result, and obtaining an artwork classification prediction result by using a learning weight vector. The method overcomes the defect that the pre-defined decision rule can not be learned from data, can perform weight learning on the multi-modal classifier prediction data to be used as the decision rule for performing decision-level fusion, overcomes the limitation of single-modal work of art classification, and effectively improves the accuracy and robustness of multi-modal work of art classification.

Description

Multi-mode fusion-based artwork classification method and system

Technical Field

The invention belongs to the field of natural language processing and computer vision, and particularly relates to an artwork classification method and system based on multi-mode fusion.

Background

With the development of the art development field of China, a large amount of artwork resources are generated. In the art exhibition, it is difficult for the exhibitor to quickly find the exhibition artwork which needs to present the effect from the mass artwork data. The data of the artworks are classified, on one hand, the data can be used as the basis of content recommendation, the collected artworks under the classification are recommended for the curator, and the time is saved for the exhibition. On the other hand, the art collection system needs to update the strategy library frequently. When the strategy exhibition library is updated, the traditional classification filing is usually manually realized, a large amount of manual time is consumed, the automatic classification of the artworks is realized through an artificial intelligence technology, the time and the energy for manual screening, classification and filing are saved, the explanation description words and the artworks picture content of each piece of work do not need to be read, and the timely updating of the strategy exhibition library is ensured. The method is characterized in that the images and the text features of the artworks are divided into different categories according to the image and the text features of the artworks, so that the classification of the artworks is realized, and the method becomes an important subject combining the deep learning method in the field of art planning.

The artwork classification can be respectively solved as a text and image classification task, for a text mode, a classic text classification method TextCNN model can effectively capture a text sequence, and a higher result can be obtained on the text classification task without knowing the syntax or semantic structure of a sentence; for image modalities, the more classical image classification generally selects a CNN-based model that captures the local texture features of an image, but lacks modeling of global information, whereas the Transformer-based approach takes advantage of the attention mechanism to capture both local and global information.

At present, the form of the work of art data is not limited to a single image drawing modality, but more data of multiple modalities such as explanation and description of the work of art by an author, image presentation and the like are fused. How to effectively fuse information of different modes is the key of multi-mode art classification.

In the research of multi-modal classification, the multi-modal fusion method which is commonly used mainly comprises feature layer fusion and decision layer fusion. The feature layer fusion method considers the complementarity of different modal features, but does not consider the difference of the different modal features in classification, and only simply performs fusion through feature splicing. However, the decision layer fusion is usually based on the prediction result of the classifier of each modality, and then performs decision judgment according to the relevant rules, as the final classification recognition result.

In contrast, the decision-layer fusion method considers the difference between different modality information according to the different contributions of the different modality information. Of course, the multi-modal artwork classification performance based on decision-level fusion is not only related to the performance of a single-modal classifier, but also depends on the performance of a decision-level fusion method.

In view of the above problems, the present invention provides an art classification method based on multi-mode fusion, which uses image information of art and data of two modes of an explanation text of the art to perform classification together to enhance the classification result.

Disclosure of Invention

The invention aims to solve the problems in the prior art and provides an artwork classification method and system based on multi-mode fusion.

The invention is realized by the following technical scheme:

the method comprises the steps of preprocessing data of two modes of an artwork, respectively obtaining a text mode prediction result and an image prediction result, and obtaining an artwork classification prediction result by using a learning weight vector.

The invention is further improved in that:

the method comprises the following steps:

s1, acquiring an original data set of the multi-modal artwork, and respectively preprocessing data of two modes to obtain an original image and a word embedding matrix after normalization processing;

s2, constructing a work of art classification model in a text mode, and obtaining a classifier prediction result of the text mode by using a word embedding matrix;

s3, constructing an artwork classification model of an image mode, and obtaining a classifier prediction result of the image by using the original image after normalization processing;

and S4, weighting the prediction result of the classifier in the text mode and the prediction result of the classifier in the image respectively, and further obtaining the classification prediction result of the artwork.

The invention is further improved in that:

the operation of step S1 includes:

s11, performing format processing on the artwork image to obtain an original image after normalization processing;

and S12, performing word segmentation on the artwork explanation text by using a word segmentation tool to obtain segmented words, and mapping the segmented words into a word embedding matrix.

The invention is further improved in that:

in the step S2, a multi-layer self-attention mechanism is integrated into the text convolutional neural network, so as to construct an artwork classification model in a text mode.

The invention is further improved in that:

the operation of step S2 includes:

s21, performing self-attention calculation on the word embedding matrix to obtain a feature map after self-attention calculation;

s22, performing convolution operation on the feature graph after the self-attention calculation by using three groups of convolution kernels with different sizes respectively to obtain each group of finally activated results;

s23, respectively carrying out self-attention calculation on each group of finally activated results to obtain a feature map;

and S24, extracting the maximum value of each channel in the feature map through the pooling layer to perform down-sampling, and obtaining the prediction result of the classifier of the text mode.

The invention is further improved in that:

the operation of step S3 includes:

s31, dividing the original picture after normalization into a plurality of picture blocks to obtain the serialized picture blocks;

s32, adding position coding information into the serialized picture block to obtain a vector sequence added with the position information;

s33, sending the vector sequence added with the position information into a ViT model encoder module for feature extraction;

and S34, performing linear transformation on the features extracted by the ViT model encoder module, and activating to obtain a classifier prediction result of the image.

The invention is further improved in that:

the operation of step S31 includes:

s311, taking half of the side length of the picture block as a convolution sliding step length, and dividing the original picture after the normalization processing into l picture blocks by utilizing convolution transformation;

s312, flattening each picture block into a one-dimensional vector to obtain the serialized picture blocks.

The invention is further improved in that:

the operation of step S4 includes:

assigning a learning weight vector to the prediction result of the classifier of the text mode to obtain the calculation result of the classifier of the text mode;

taking a value obtained by subtracting the learning weight vector of the prediction result of the classifier in the text mode from 1 as the learning weight vector of the prediction result of the classifier of the image, and assigning the learning weight vector to the prediction result of the classifier of the image to obtain the calculation result of the classifier of the image; the learning weight vector is a numerical value between 0 and 1;

and adding the calculation result of the classifier of the text mode and the calculation result of the classifier of the image, and inputting the result into an MLP (Multi-layer processing) containing a full connection layer and a SoftMax activation layer to obtain an artwork classification prediction result.

In a second aspect of the present invention, there is provided a system for classifying works of art based on multi-modal fusion, the system comprising:

the data acquisition and processing unit: the system comprises a multi-mode artwork acquisition module, a word embedding module, a word processing module and a word processing module, wherein the multi-mode artwork acquisition module is used for acquiring an original data set of the multi-mode artwork, and respectively preprocessing data of two modes to obtain an original image and a word embedding matrix after normalization processing;

a text prediction unit: the data acquisition and processing unit is connected with the work of art classification model in the text mode and used for constructing the work of art classification model in the text mode and obtaining a classifier prediction result of the text mode by utilizing a word embedding matrix;

an image prediction unit: the data acquisition and processing unit is connected with the image processing unit and used for constructing an artwork classification model of an image modality and obtaining a classifier prediction result of the image by using the normalized original image;

a classification prediction unit: and the system is respectively connected with the text prediction unit and the image prediction unit and is used for respectively weighting the prediction result of the classifier in the text mode and the prediction result of the classifier in the image so as to obtain the classification prediction result of the artwork.

In a third aspect of the present invention, there is provided a computer-readable storage medium storing at least one program executable by a computer, the at least one program, when executed by the computer, causing the computer to perform the steps of the above-mentioned multi-modal fusion-based artwork classification method.

Compared with the prior art, the invention has the beneficial effects that:

firstly, multilayer self-attention is integrated into a textCNN model, key features in artwork text explanation are focused, a larger weight is given to the key features, the influence of other non-key features on a classification result is reduced, and the prediction result of a text classifier is improved;

secondly, the invention improves the segmentation mode of the original picture at ViT, and the edge part of the picture block is overlapped and segmented during the segmentation so as to keep the local information of adjacent pixel points of the adjacent picture block, so that the segmented picture block vectors have stronger correlation, and the classification accuracy of the image mode is improved;

and thirdly, assigning a learning weight vector to each of the two improved classifier prediction results to serve as a learning rule, performing decision-level fusion, sending the final result to a multi-layer perceptron consisting of a linear layer and a softmax activation layer, and outputting the result as the final multi-modal network prediction value.

The method overcomes the defect that the pre-defined decision rule can not be learned from data, can perform weight learning on the multi-modal classifier prediction data to be used as the decision rule for performing decision-level fusion, overcomes the limitation of single-modal work of art classification, and effectively improves the accuracy and robustness of multi-modal work of art classification.

Drawings

FIG. 1 is a block diagram of the steps of an art classification method based on multi-modal fusion according to the present invention;

FIG. 2 is a schematic diagram of an art classification method based on multi-modal fusion according to the present invention;

FIG. 3 is a flow chart of the classification of the model based on the multi-mode fusion artwork classification method of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings:

the invention provides an artwork classification method based on multi-mode fusion, which comprises the following steps of:

acquiring an original data set of the multi-modal artwork, and respectively preprocessing data of two modes;

integrating a multilayer self-attention mechanism into the existing text convolutional neural network, and constructing a text modal artwork classification model (also called a text modal classifier and a text classifier);

an existing ViT (Vision Transformer) network is improved, and an artwork classification model (also called an image classifier) of an image modality is constructed;

assigning a learning weight vector between 0 and 1 to the prediction result of the classifier in the text mode, using the learning weight vector obtained by subtracting the prediction result of the classifier in the text mode from 1 as the learning weight vector of the prediction result of the image classifier, adding the calculation results of the two classifiers, and sending the added results into an MLP (Multi-layer processor) containing a full connection layer and a SoftMax (software-based Max) activation layer to obtain the final artwork classification prediction result of the decision-level fusion network.

The method comprises the following steps:

as shown in fig. 1 and 3, the art classification method based on multi-modal fusion of the present invention includes:

s1, acquiring an original data set of the multi-modal artwork, and respectively preprocessing data of two modes to obtain an original image and a word embedding matrix after normalization processing; the data of the two modalities are referred to as an art image and an art explanation text.

Preferably, the operation of acquiring the original data set in the multi-modal art field and respectively preprocessing the data of the two modalities includes:

s11, performing format processing on the artwork image to obtain an original image after normalization processing: the sketch type works represented by the gray level diagram in the artwork data set and other artwork images represented by the gray level diagram are all converted into RGB formats, so that the formats are unified;

normalizing the image features: converting all pixel values to 0-1 by using a normalization function transform. normaize in a computer for image pixel values of 0-255 in an RGB format, and obtaining a 3-dimensional tensor image representation with the numerical value between 0-1 after normalization processing, namely obtaining an original picture after normalization processing.

S12, performing Word segmentation on the work of art explanation text by using a Word segmentation tool to obtain segmented words, and mapping the segmented words into a Word vector by using the existing Word2Vec pre-training Word vector model to obtain a Word embedding matrix as the input of the text classifier.

The word segmentation tool can adopt the existing jieba word segmentation tool. Furthermore, the stop words can be removed by using a custom function, namely all the words are compared with the stop words, and if the stop words are the stop words, the stop words are discarded. These are all existing mature technologies and are not described herein.

S2, a multilayer self-attention mechanism is integrated into the text convolution neural network, a text modal artwork classification model is constructed, and a text modal classifier prediction result is obtained;

preferably, the operation of integrating the multilayer self-attention mechanism into the text convolutional neural network and constructing the artwork classification model of the text modality includes:

s21, performing self-attention calculation on the word embedding matrix to obtain a feature map after self-attention calculation, wherein the calculation method comprises the following steps:

wherein Q represents the query, K represents the key, and V represents the value corresponding to the key, which is a matrix. Specifically, V is a weight matrix W generated by a word embedding matrix and random initialization_vMultiplied), Q is a weight matrix W generated from the word embedding matrix and random initialization_qMultiplied by a weight matrix W generated by the word embedding matrix and random initialization_kMultiplying to obtain V is weight matrix W generated by word embedding matrix and random initialization_vMultiplication to obtain d_kRepresenting the dimensions of a word vector (a row of vectors in the word embedding matrix).

The self-attention calculation realizes that a larger weight is given to the key features, and the influence of the rest non-key features on the classification result is reduced for the following reasons:

by passing

And calculating to obtain a correlation degree matrix between every two words in the word embedding matrix, namely a weight matrix needing to multiply and assign the V matrix, wherein the correlation degree of the weight matrix with the rest words is higher, the sentence is more critical, and the weight value calculation result corresponding to the key characteristic is higher.

The calculated weight matrix

And multiplying the V by a matrix V which represents the text explanation sentence and needs to be assigned to realize the weight assignment of the V, and obtaining the feature diagram after the self-attention calculation.

Step S21 is actually to perform self-attention calculation on the word embedding matrix to obtain a weight matrix, and to assign a weight to the matrix V by using the weight matrix to obtain a feature map after self-attention calculation.

S22, performing convolution operation on the feature graph after the self-attention calculation by using three groups of convolution kernels with different sizes to realize feature extraction;

preferably, the operation of performing convolution operations on the feature map after the self-attention calculation by using three groups of convolution kernels with different sizes to extract the features includes:

s221, performing linear calculation by using the convolution kernel and the feature graph after the self-attention calculation to obtain a feature result, wherein a calculation formula can be represented by the following formula:

y＝∑w_c×x_i'+b_c

wherein: w is a_cRepresenting the convolution kernel, x_i' is a plurality of word vectors currently scanned by the convolution kernel, b_cIs a bias term (x)_i'、b_cAre initialized randomly and their values are adjusted by gradient descent, which is not described in detail in the prior art. ) Y is the output characteristic value of the position of the convolution kernel;

s222, activating the characteristic result by using a Relu activation function to obtain a final activated result, wherein the activation function can be represented by the following formula:

where y' is the result after final activation.

And realizing down sampling and feature extraction through the convolution process.

S23, respectively carrying out self-attention calculation on each group of finally activated results (three groups of convolution checks with different sizes should be used for obtaining three groups of finally activated results) to obtain a feature map:

in the same manner as step S21, the matrix W is first randomly initialized_qThe length of the matrix is the same as the dimension of the characteristic diagram after convolution operation, and the width is the size of self attention, W_qCarrying out matrix multiplication with a characteristic diagram matrix obtained by convolution calculation to obtain Q; random initialization matrix W_kThe length is the same as the dimension of the feature map after convolution operation, and the width is the size of self attention, W_kCarrying out matrix multiplication with a characteristic diagram matrix obtained by convolution calculation to obtain K; random initialization matrix W_vThe length is the same as the dimension of the feature map after convolution operation, and the width is the size of self attention, W_vCarrying out matrix multiplication with a characteristic diagram matrix obtained by convolution calculation to obtain V according to a formula

Namely, the calculation of the self-attention can be realized, and the weight of the feature map is adjusted again after the calculation of the self-attention is completed.

S24, extracting the maximum value of each channel in the feature map through the pooling layer to perform down-sampling, and obtaining the prediction result of the classifier of the text mode, wherein the prediction result is as follows:

wherein

The feature vectors are pooled, the maximum values of each channel are spliced to form global high-level features, the global high-level features are sent to a full connection layer, and finally Softmax of the global high-level features is activated to obtain prediction results belonging to various categories (the prediction results are mature technologies and are not repeated here), namely prediction results of a text modal classifier.

In this way, in step S2, an art classification model in the text mode is constructed, and the output result of the model is the classifier prediction result in the text mode.

S22 and S24 in the above steps are both steps in the conventional text convolutional neural network, and the present invention performs a first self-attention calculation before using the text convolutional neural network (i.e., step S21), and then performs a second self-attention calculation in step S23, so that the following effects are achieved:

the method has the advantages that the explanation information of each artwork sample is excessively redundant, the information contained in the text sentence of a single sample is excessive, the traditional text convolution classification method lacks the attention to the key information, the key features are endowed with larger weight values by integrating the text convolution twice through self-attention calculation, the influence of other non-key features on the classification result is reduced, and the classification precision is further improved.

S3, improving the ViT network, constructing an artwork classification model of an image mode, and obtaining a classifier prediction result of an image;

preferably, the ViT network is improved, and the operation of constructing the image-mode artwork classification model comprises the following steps:

s31, dividing the original picture after normalization into a plurality of picture blocks, converting each picture block into a picture block vector, forming a vector sequence by the picture block vectors corresponding to all the picture blocks, overlapping and dividing the pixels at the adjacent positions of two adjacent picture blocks, using the pixels as the pixels selected by the adjacent picture blocks, and keeping the information of local textures, lines and the like at the edge part, so that the picture block vectors after serialization have stronger correlation;

preferably, the operation of dividing the original picture after normalization into a plurality of picture blocks, converting each picture block into an image block vector, forming a vector sequence by the image block vectors corresponding to all the picture blocks, overlapping and dividing the pixels at the adjacent positions of two adjacent picture blocks, and using the pixels as the selected pixels of the adjacent picture blocks includes:

s311, the difference between the model of the invention and the model of ViT is that: in the convolution conversion process of the original picture after the normalization processing, the sliding step of the convolution is reduced to make the step be half of the side length of the picture block, so as to retain the adjacent position information of the adjacent picture block, and the number l of vectors in the picture block vector sequence after the division and the output (the "number" is the number of image blocks included in the vector sequence formed by the image block vectors in the above, i.e. n in the formula of step S312) is as follows:

the size of a picture block is k × k, the overlapping width of adjacent picture blocks is s, p is the filling size of the original picture after normalization processing, h represents the height of the original picture after normalization processing, w represents the width of the original picture after normalization processing, k-s is the step size of convolution operation, the step size is preferably equal to half of the side length of the picture block, namely the size of the picture block is k × k, the convolution step size is half of the side length of the picture block, namely k/2, namely s is also half of the side length.

In step S311, the sliding step of the convolution is reduced to make the step half the picture size, thereby implementing the overlap segmentation.

S312, performing trainable linear projection to obtain the serialized picture blocks: flattening each picture block into a 1-dimensional vector, training weight vector (training weight vector x)_clsThe classification category information representing the learning is generated by random initialization) and spliced to the initial position of the vector sequence to obtain a serialized picture block, and the calculation process is as follows:

Z＝[x_cls；x₁E；x₂E；...；x_nE]

n in the above equation represents the number of picture blocks in the picture block sequence, i.e., l obtained in step S311.

Wherein x_clsTo represent the vector of the classification category, x₁,x₂,…,x_nFor normalizing the segmented picture block vector sequence of the processed original picture, E is the weight for linear projection (because x₁,x₂,...,x_nN picture blocks in a picture are shown, different picture blocks are extracted by linear calculation using the same set of weights E, E is a 768-channel convolution kernel, and each picture block x_iAlso RGB three channels, one picture block gets a vector of 768 dimensions by linear computation with the same set E).

Step S312 is an existing algorithm, and is not described herein again.

S32, adding position encoding information to the serialized picture block Z obtained in step S312, the calculation formula is:

where pos represents the position of the current image block vector in the image, i represents the index of each value in the vector, d_modelRepresenting the dimensions of the image block vector.

Adding the calculated position code into a vector sequence, wherein the calculation formula is as follows:

wherein Z is the linear projected value of the original image (i.e. Z obtained in step S312), E_posIs a position information vector that determines the position of the current image block in a sequence of vectors of image blocks,

is a vector sequence added with position information;

s33, sending the vector sequence added with the position information into an encoder module (an encoder module of the existing ViT model) for feature extraction;

preferably, the sending the vector sequence added with the position information to the encoder module for feature extraction includes:

s331, carrying out layer normalization adjustment on the features sent into the encoder, wherein the calculation mode is as follows:

where Z represents the input vector of the encoder, μ and σ represent the mean and variance, respectively, and ε is a very small number set to prevent the denominator from being zero, and may take the value of 10^-8. Gamma and beta represent zoom and averageThe shifted parameter vector is consistent with the dimension of the expression vector of the image block input to the encoder;

s332, performing multi-head self-attention calculation on the features after the layer normalization feature adjustment, wherein the calculation step is as follows:

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V)

MultiHead(Q,K,V)＝Concat(head₁...head_h)W^O

where Q, K, V are the queries, keys and values of the attention calculation process. d_kThe dimension of the Key (Key) is expressed, and the effect of zooming is achieved. The multi-head self-attention linear calculation process has the advantage that parameters between heads are not shared, namely W is used respectively_i ^Q，W_i ^K，W_i ^VThe matrix carries out linear transformation calculation on the Q, K and V eigenvectors. Splicing the calculation results of h times of scaling dot product Attention, and finally obtaining the final result of multi-head self Attention through one time of linear transformation, wherein W^OCalculating a weight matrix of the linear transformation for the multi-head attention;

s333, residual error connection is carried out on the output result of the multi-head self-attention module and the image block vector sequence sent to the encoder;

and S334, adjusting the result of residual error connection through the layer normalization feature again, and sending the result into an MLP module to extract features.

Preferably, the step of sending the result of residual error connection to the MLP module after adjusting the layer normalization feature again includes:

s3341, sending the data into a Linear layer (Linear) for Linear transformation;

s3342, activating by using a GELU activation function, sending an activation result to a Dropout layer, and then performing linear transformation, wherein in order to prevent overfitting, Dropout is used for neglecting a certain proportion of characteristic values;

and S3343, performing residual error connection on the output result of the Dropout layer and the calculation result of the first residual error connection in the encoder again to obtain the final output result of the MLP module.

And S34, performing linear transformation on the features extracted by the encoder, and activating to obtain the prediction result of the image classifier.

In this way, in step S3, an art classification model in the image modality is constructed, and the output of the model is the classifier prediction result of the image.

S31 in the above steps is an improvement of the VIT of the present invention, and steps S32, S33 and S34 are all existing steps of the existing VIT, and therefore, only brief descriptions are given and no further description is given.

S4, assigning a learning weight vector to the prediction result of the classifier in the text mode to obtain the calculation result of the classifier in the text mode, taking the learning weight vector obtained by subtracting the prediction result of the classifier in the text mode from 1 as the learning weight vector of the prediction result of the classifier in the image, and assigning the learning weight vector to the prediction result of the classifier in the image to obtain the calculation result of the classifier in the image. The learning weight vector is a numerical value between 0 and 1;

adding the classifier calculation result obtained in the text mode and the classifier calculation result of the image (the calculation results of the two classifiers are vectors with the dimensionality of M (if the two classifiers are ten classifications, M is equal to 10), directly adding corresponding dimensionality calculations, in the prior art, the details are not repeated), sending an MLP (multi-level processor) containing a full connection layer and a SoftMax activation layer, obtaining the final artwork classification prediction result of the decision-level fusion network, wherein the artwork classification prediction result is the probability corresponding to all the classifications, and the maximum probability is the prediction result of the artwork, namely the classification of the artwork.

Furthermore, the accuracy, the recall rate and the f1 value can be used for evaluating the prediction result of the model, generally, the accuracy value is larger, the recall rate and the f1 value are relatively higher, the classification effect of the model is better, and the methods are mature technologies and are not described herein again.

An embodiment utilizing the present invention is given below:

[ EXAMPLES one ]

As shown in fig. 2, which is a method for classifying an art based on multi-modal fusion according to the present invention, a specific dimension transformation process is performed in a model building and training process using a public art data set SemArt, specifically as follows:

obtaining a matrix with dimension [2000 x 300] through a word embedding layer for input text modal data, wherein 2000 is the number of words of a set single text sentence, if the number of words is not enough, 0 value completion is carried out, and 300 is the size of a word vector; the dimension of the text data is not changed by the self-attention transformation, after the text data is convoluted with three groups of 16 convolution kernels with the sizes of 2,3 and 4 respectively, the dimension of the output text feature map is [16 x 1998], [16 x 1997], [16 x 1996], after the three feature maps are respectively subjected to the self-attention computation, the dimension is still [16 x 1998], [16 x 1997], [16 x 1996], the feature obtained by the maximum pooling is [1 x 48], and the dimension after the activation of the full-connection layer is sent to [1 x 10 ].

The original data dimension of the image modality is [3 × 224], through the process of soft segmentation serialization, the dimension transformation of the graph data is obtained [730 × 768], the feature graph dimension obtained through MSA self-attention calculation of an encoder and a multilayer linear processing module is [730 × 768], the class vector [1 × 768] is taken, and the prediction vector with the dimension [1 × 10] is obtained through a fully-connected layer with the output of 10.

For text mode [1 x 10]Is initialized to [1 x 10] with a size between 0 and 1]Dimension vector w_tMultiplying the corresponding positions of the prediction vectors to obtain the dimension [1 x 10]]T of (1), subtracting w from 1_tWeight w as image modality_pMultiplying the image mode by the corresponding position of the prediction vector to obtain the dimension [1 x 10]]P, the two vectors p and t are added to obtain the final dimension of [1 x 10]The prediction vector of (2).

The "ten-degree-result" in the embodiment shown in fig. 3 means that the works of art are classified into 10 types, and is illustrated by model training based on the disclosed work of art data set SemArt. In practice, the number of the art categories can be set according to actual needs, for example, more than ten categories or less than ten categories.

The general fusion method is to make a decision on the classification prediction vectors of two modes through a certain decision rule, and to perform rule operation on the prediction results of a plurality of classifiers to obtain a final prediction result. The decision rule is usually a predefined rule that cannot be learned from the data. The invention can carry out weight learning on the multi-modal classifier prediction data, and can be used as a decision rule to carry out decision-level fusion. Compared with the simple fusion through feature splicing, the decision fusion method of the invention endows the sub-classifiers with better performance (namely corresponding to the single-mode classifiers with high classification precision) with larger weight, reduces the sub-classifiers with poorer performance and the influence on the decision fusion result, thereby greatly improving the classification precision.

The invention also provides an artwork classification system based on multi-mode fusion, and the embodiment of the system is as follows:

[ example two ]

The system comprises:

The invention also provides a computer-readable storage medium, which comprises the following embodiments:

[ EXAMPLE III ]

The computer-readable storage medium stores at least one program executable by a computer, the at least one program causing the computer to perform the steps of the above-described multi-modal fusion-based art classification method.

The method for classifying the artwork based on the multi-mode fusion is realized by using a computer, combines a convolutional network TextCNN and an attention mechanism, models text data of the artwork, and overcomes the problem that a traditional text convolutional network ignores keywords in text data classification; the serialization process of the Vision Transformer image is improved, the local characteristics of the edge part are reserved, and the relevance of the serialized image block is larger; data of various modes are used, so that the characteristic information is richer; and a decision rule is made in a learning weight mode to perform decision layer fusion, and the difference between different modal information is considered.

In conclusion, the multi-mode fusion-based artwork classification method has high accuracy and robustness and has a remarkable application value in the field of intelligent art planning.

Finally, it should be noted that the above-mentioned technical solution is only one embodiment of the present invention, and it will be apparent to those skilled in the art that various modifications and variations can be easily made based on the application method and principle of the present invention disclosed, and the method is not limited to the above-mentioned specific embodiment of the present invention, so that the above-mentioned embodiment is only preferred, and not restrictive.

Claims

1. A multi-mode fusion-based artwork classification method is characterized by comprising the following steps: the method comprises the steps of preprocessing data of two modes of an artwork, then respectively obtaining a prediction result of a text mode and a prediction result of an image, and obtaining an artwork classification prediction result by utilizing a learning weight vector.

2. The method of multi-modal fusion based work of art classification as claimed in claim 1 wherein: the method comprises the following steps:

3. The method of multi-modal fusion based artwork classification of claim 2, wherein: the operation of step S1 includes:

4. The method of multi-modal fusion based artwork classification of claim 2, wherein: in the step S2, a multi-layer self-attention mechanism is integrated into the text convolutional neural network, so as to construct an artwork classification model in a text mode.

5. The method of multi-modal fusion based artwork classification of claim 4, wherein: the operation of step S2 includes:

6. The method of multi-modal fusion based artwork classification of claim 2, wherein: the operation of step S3 includes:

7. The method of multi-modal fusion based artwork classification of claim 6, wherein: the operation of step S31 includes:

8. The method of multi-modal fusion based artwork classification of claim 2, wherein: the operation of step S4 includes:

9. An artwork classification system based on multi-mode fusion is characterized in that: the system comprises:

10. A computer-readable storage medium characterized by: the computer-readable storage medium stores at least one program executable by a computer, the at least one program causing the computer to perform the steps of the multi-modal fusion based artwork classification method of any one of claims 1-8.