CN111985369B - Course field multi-modal document classification method based on cross-modal attention convolution neural network - Google Patents
Course field multi-modal document classification method based on cross-modal attention convolution neural network Download PDFInfo
- Publication number
- CN111985369B CN111985369B CN202010791032.3A CN202010791032A CN111985369B CN 111985369 B CN111985369 B CN 111985369B CN 202010791032 A CN202010791032 A CN 202010791032A CN 111985369 B CN111985369 B CN 111985369B
- Authority
- CN
- China
- Prior art keywords
- text
- modal
- attention
- image
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which is used for preprocessing multi-mode document data in the course field; an attention mechanism is combined with a dense convolution network, a convolution neural network based on cross-modal attention is provided, and image features with sparsity can be constructed more effectively; the bidirectional long-short term memory network constructed based on the attention mechanism and oriented to the text features is provided, and the text features locally associated with image semantics can be efficiently constructed; the cross-modal grouping fusion based on the attention mechanism is designed, the local incidence relation of the image and the text in the document can be more accurately learned, and the accuracy of the cross-modal feature fusion is improved. Compared with the existing multi-mode document classification model, the method has better performance and improves the accuracy of multi-mode document data classification under the condition of data sets in the same course field.
Description
Technical Field
The invention belongs to the field of computer application, multi-modal data classification, education data classification, image processing and text processing, and particularly relates to a course field multi-modal document classification method based on a cross-modal attention convolution neural network.
Background
With the development of science and technology, data to be processed by computers in various fields is converted from single images into multi-modal data such as images, texts, audios and the like with richer forms and contents. The classification of multimodal documents has applications in video classification, visual question answering, entity matching for social networks, and the like. The accuracy of multimodal document classification depends on whether the computer can accurately understand the semantics and content of the images and text that are contained within the document. However, the image in the text-text mixed multimodal document in the course field is generally composed of lines and characters, and shows high sparse characteristics on visual features such as color and texture; the characteristic of local association between the text in the multi-modal document and the semantics of the image is shown, so that the existing multi-modal document classification model is difficult to accurately construct the semantic feature vectors of the image and the text in the document, the accuracy of multi-modal document feature expression is reduced, and the performance of the multi-modal document classification task is hindered.
In order to solve the problems, the invention expands a model system structure and provides a course field multi-mode document classification method based on a cross-modal attention convolution neural network. The method can well extract the sparse image characteristics in the course field, efficiently construct the text characteristics associated with the local fine-grained semantics of the image semantics, and more accurately learn the association relationship between the image and the text characteristics related to the specific object, thereby improving the multi-modal document classification performance.
Disclosure of Invention
Technical problem to be solved
Image visual features in image-text mixed multi-modal document data in the course field are sparse, and only local semantic association exists between texts and images, so that the semantics and contents of the texts and the images in the documents are difficult to accurately understand by the existing multi-modal document classification model, and the performance of multi-modal classification is greatly influenced. In order to solve the problems, the invention provides a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which can more efficiently learn the semantic features of the course field image with feature sparsity, can better capture the local fine-grained semantic association between the image and the text in the multi-mode document, accurately express the multi-mode document features and simultaneously improve the performance of a computer in the course field multi-mode document data classification.
Technical scheme
A course field multi-mode document classification method based on a cross-mode attention convolutional neural network is characterized by comprising the following steps:
step 1: preprocessing of multimodal document data
Step 1.1: each multi-modal document comprises an image and a text description and is attached with a plurality of semantic tags; constructing a dictionary by using the text description and the document label set in the document; deleting the tags with the frequency of appearance less than 13, and deleting the multi-modal document when the number of semantic tags of the document is 0;
step 1.2: data preprocessing, namely randomly cutting image data into the size of 224 × 224 in length and width, and performing random horizontal turning; for text description, all text lengths are cut off and filled into length l, and a word vector model is used for learning vector representation of words in the text;
step 2: depth cross-modal feature extraction based on attention mechanism
Step 2.1: carrying out representation construction on image features by adopting dense convolutional neural network DenseNet based on space and feature attention mechanism CBAM, and recording the obtained image features asm represents the number of feature maps of the image;
step 2.2: constructing text features by adopting a bidirectional long-short term memory network (BilSTM) and a text attention mechanism, wherein the text attention mechanism consists of two convolution layers and a softmax classifier; the calculated weight is recorded asThe text feature representation after weighting is recorded asn is 4 hidden _ size, and the hidden _ size is the characteristic dimension of the hidden state of the BilSTM;
and step 3: packet cross-modal fusion based on attention mechanism
Step 3.1: dividing the image characteristics x obtained in the step 2 into r groups, and dividing each group of image characteristics into r groupsRespectively mapping the feature vector with the text feature y to the same one-dimensional space, and obtaining the fused feature { Z by adopting multi-mode split bilinear pooling fusion0,Z1,…,Zr};
Step 3.2: for each group of fused features ZiCalculating the weight of the feature map on each channel by using a channel attention mechanism, and recording the weighted features as Z';
step 3.3: each group is taken as a fusion feature Z with weighti' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifierAnd finally, calculating the error between the predicted value P and the true value by adopting the maximum entropy as a loss function, and training the parameters of the model by utilizing a back propagation algorithm.
In the step 1.2, according to the characteristics of the image and the text, the multi-modal document data facing the course field are processed, and the (R) is finally obtained for the ith pre-processed multi-modal documenti,Ti,Li):
(1) Randomly sampling from image-text mixed multi-mode document data in the field of courses;
(2) for each image of the multimodal document:
(a) zooming the image to keep the aspect ratio unchanged, wherein the shortest side is 256; then randomly cutting the picture into length and width 224 x 224; and randomly turning horizontally; finally, channel value normalization is carried out to obtainWherein C is 3, H is W is 224;
(3) for each text description in the multimodal document:
(a) counting the lengths of all text descriptions, and selecting the length l, wherein the length l is 484, and 92% of the text lengths are smaller than the length;
(b) all data are cut off and filled to be the same length l;
(c) using word vectors, also known as word embedding, mapping word sequence numbers to vectors in the real number domain, and training the weights thereof;
(d) the word sequence number is embedded into 256-dimensional continuous space with high dimension as dictionary number by word embedding
(4) For each set of tags in the multimodal document:
(a) setting an N-dimensional vector for the total classification number N, and mapping a plurality of semantic tags of corresponding documents into 0-1 vectors through a tag dictionary to obtain
The dense convolutional neural network based on the space and feature attention mechanism described in step 2.1 is used for representing and constructing image features, and specifically comprises the following steps: for processed image dataFirstly, a convolution layer with convolution kernel of 7 × 7 and step size of 2 and a maximum pooling layer with convolution kernel of 3 × 3 and step size of 2 are passed; next, a CBAM module is used, and then a DenseBlock module, a CBAM module and a transition module are adoptedAlternating Transition modules, extracting the characteristics of the sparse image in the course field, and finally obtaining the image characteristics by adopting the average pooling with the convolution kernel size of 7 × 7The CBAM module is composed of a channel submodule and a space submodule, so that the attention weight map is multiplied by the input feature map to carry out self-adaptive feature refinement; the weight of the feature map can be calculated by utilizing the channel submodule, and the weight of each part in the feature map can be obtained by utilizing the space submodule; for an intermediate feature mapAs input, CBAM calculates one-dimensional channel attention diagrams in turnAnd two-dimensional spatial attention mapThe whole attention mechanism is calculated as follows:
whereinRepresenting the outer product; obtaining the channel attention weight M through calculationc(F) Obtaining weighted featuresCalculating to obtain the space attention weight Ms(F') obtaining weighted featuresThe DenseBlock module is composed of multiple DenseLayer layers, each DenseLayer layer is composed of two groups of batch processing layers, a ReLU activation function and a convolution layer, each DenseLayer layer outputs k eigenvectors, and the input eigenvector of each DenseLayer layer is k0+k*(i-1),k0Representing an initial input feature number, wherein i represents the number of layers; the Transition module is equivalent to a buffer layer and is used for down sampling, and is composed of a batch processing hierarchy, a ReLU activation function, a convolution layer of 1 × 1 and an average pooling layer of 2 × 2. The two-way long-short term memory network based on the text attention mechanism in the step 2.2 specifically comprises the following steps: for input data TiExtracting text sequence information by using BilSTM to obtain text characteristicsObtaining text attention weight of output by text attention mechanismCalculating the outer product of output and alpha to obtain the weighted text characteristicsThe text attention mechanism is composed of two convolution layers and a softmax classifier, and output comprises a set of hidden state vectors at each moment in the last layer sequenceAnd l is seq _ len, and h is 2 hidden _ size.
The BilSTM model in the step 2.2 specifically comprises the following steps:
for input data TiExtracting text sequence information by using BilSTM to obtain text characteristicsThe specific calculation formula is as follows:
it=σ(Wiixt+bii+Whih(t-1)+bhi), (3)
ft=σ(Wifxt+bif+Whfh(t-1)+bhf), (4)
gt=tanh(Wigxt+big+Whgh(t-1)+bhg), (5)
ot=σ(Wioxt+bio+Whoh(t-1)+bho), (6)
ct=ft⊙c(t-1)+it⊙gt, (7)
ht=ot⊙tanh(ct), (8)
wherein xtVector representing input word, htHidden state vector representing time t, ctRepresenting the cell state vector at time t, xiRepresenting the input vector at time t, h(t-1)Representing the hidden state vector or the initial state vector, W, of the previous layer t-1i,biAnd bhAre trainable parameters, it,ft,gt,otRespectively represent input, forgetting, cell and output gates, sigma represents a sigmoid equation, tanh is an activation function, and; the hidden state vector set h with the same length as the sentence can be obtained0,h1,…,hl-1]And l 484 represents the length of the sentence.
The text attention mechanism module in the step 2.2 specifically comprises:
for text features obtained by BilsTMComputing attention weights for outputs using a textual attention mechanismObtaining text characteristics after weightingWherein the text attention mechanism is composed of two 1 × 1 convolutional layers and a softmax classifier.
The packet cross-modal fusion module based on the attention mechanism in the step 3 specifically comprises:
in step 3.1, the grouping fusion of the image features and the text features is as follows:
for image features derived from the model in DenseNet-CBAMAnd text features extracted by the BilSTM + Att modelImage features are divided into r groupsRespectively fusing with the text characteristics y, wherein the specific fusing steps are as follows:
Zi=x′TWiy, (9)
wherein Represents the outer product of two vectors;is a connection matrix; ziIs the output of a multi-modal split bilinear pooling; wiMapping x' and y to the same dimension space by utilizing two full connection layers, and then executing SumPage with the size of k in one dimension to obtaini represents the fusion of the ith group of image features with the text features y;
in step 3.2, the fused features Z are combined for each groupiCalculating the weight of the feature map on each channel by using a channel attention mechanism, and recording the weighted features as Zi', its characteristics are as follows:
for the resulting fused features { Z0,Z1,…,ZrAnd calculating the weight of the feature map on each channel by using a channel attention mechanism, wherein the specific calculation process is as follows:
where a is expressed as a sigmoid function, andrepresentative pair of features ZiPerforming average pooling and maximum pooling, and weighting W therein0,W1Learning training is carried out, and weighted fusion characteristics Z' are obtained through outer product operation;
in step 3.3, each set is weighted to obtain the fusion features Zi' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifier
Pi=Zi′AT+b, (13)
Wherein A isTAnd b is a trainable parameter.
Advantageous effects
The invention provides a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which is used for preprocessing multi-mode document data in the course field; an attention mechanism is combined with a dense convolution network, a convolution neural network based on cross-modal attention is provided, and image features with sparsity can be constructed more effectively; the bidirectional long-short term memory network constructed based on the attention mechanism and oriented to the text features is provided, and the text features locally associated with image semantics can be efficiently constructed; the cross-modal grouping fusion based on the attention mechanism is designed, the local incidence relation of the image and the text in the document can be more accurately learned, and the accuracy of the cross-modal feature fusion is improved. Compared with the existing multi-mode document classification model, the method has better performance and improves the accuracy of multi-mode document data classification under the condition of data sets in the same course field.
Drawings
FIG. 1 is a diagram of a model of the process described in the examples of the invention.
FIG. 2 is a diagram of the input data construction of the method in an example of the invention.
FIG. 3 is a model diagram of image feature extraction according to an embodiment of the present invention.
FIG. 4 is a model diagram of text feature extraction according to an embodiment of the present invention.
FIG. 5 is a model diagram of packet cross-modality fusion as described in the examples of the present invention.
FIG. 6a is a graph comparing accuracy of image multi-label classification under the same data set comparing different models.
FIG. 6b is a graph comparing loss of image multi-label classification under the same data set comparing different models.
FIG. 6c is a comparison graph of image multi-label classification top-3 under the same data set comparing different models.
FIG. 6d is a comparison graph of image multi-label classification top-5 under the same data set comparing different models.
FIG. 7 is a sample diagram of a data set employed in an example of the present invention.
Detailed Description
The invention will now be further described with reference to the following examples and drawings:
the invention provides a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which is mainly divided into five modules: 1) pre-processing of multimodal document data. 2) And constructing a dense convolutional neural network image feature based on an attention mechanism. 3) And constructing a bidirectional long-short term memory network text feature based on an attention mechanism. 4) Packet cross-modality fusion based on attention mechanism. 5) Multi-label classification of multi-modal documents. The model diagram of the whole method is shown in fig. 1, and is specifically described as follows:
1. preprocessing of multimodal document data
Representing a teletext hybrid multimodal document asRepresents the d-th multimodal document data. WhereinIdIs available in the multimodal document data list for extracting image data,representing a qualifying image;representing a textual description locally associated with the image, JdIs an index set at word2ix that points out the description text appearing in the d-th multimodal document dataIn the synthesis process, the raw materials are mixed,an index representing the nth word in word2 ix; l isdIn the same way as the above, the first step,representing a semantic tag set, KdIndicating that the word in the ith multimodal document data is in the index set of label2ix,indicating the index of the o-th word in label2 ix.
The model of the multi-modal document data preprocessing of the course field multi-modal document classification method based on the cross-modal attention convolutional neural network is shown in figure 2, the image is turned through random cutting, all the text lengths are cut off and filled into the same length l based on the length of the text description of the whole data set, the vector representation of words in the text is learned by using a word vector model, and finally, all the parameters input into the network are obtained
For the ith preprocessed multi-modal document, finally obtaining (R)i,Ti,Li):
(1) Randomly sampling from the text-text mixed multi-modal document data in the field of courses.
(2) For each image in the multimodal document:
zooming the image to keep the aspect ratio unchanged, wherein the shortest side is 256; then randomly cutting the picture into length and width 224 x 224; and randomly and horizontally turning the image; finally, channel value normalization is carried out to obtainWherein C is 3 and H is 224.
(3) For each text description in the multimodal document:
(a) counting the lengths of all text descriptions, and selecting the length l, wherein the length l is 484, and 92% of the text length is smaller than the length l;
(b) all data are cut off and filled to be the same length l;
(c) using Word vectors (Word vectors), also known as Word Embedding (Word Embedding), mapping Word sequence numbers to vectors in the real number domain, and training the weights thereof;
(d) the word sequence number is embedded into 256-dimensional continuous space with high dimension as dictionary number by word embedding
(4) For each set of tags in the multimodal document:
setting an N-dimensional vector for the total classification number N, and mapping a plurality of semantic tags of corresponding documents into 0-1 vectors through a tag dictionary to obtain
2. Depth cross-modal feature construction based on attention mechanism
The depth cross-modal feature extraction based on the attention mechanism comprises two parts: 1) dense convolutional neural network image feature construction based on an attention mechanism, and 2) bidirectional long-short term memory network text feature construction based on the attention mechanism. Combining an attention mechanism with a dense convolution network to extract the features of the image; providing a bidirectional long-short term memory network which is constructed facing text features and based on an attention mechanism; thereby obtaining image features with weightsAnd text featuresThe model diagram of the whole network is shown in fig. 3:
2.1 attention-based dense convolutional neural network
For processed imagesData ofFirstly, a convolution layer with convolution kernel of 7 × 7 and step size of 2 and a maximum pooling layer with convolution kernel of 3 × 3 and step size of 2 are passed; next, a CBAM module is adopted, then a DenseBlock module is adopted, the CBAM module and a Transition module are alternated, the sparse image is subjected to feature extraction, finally, the average pooling with the convolution kernel size of 7x7 is adopted, the data dimensionality is reduced, overfitting is avoided, and finally the image feature is obtained
Image features for multimodal document inputWherein each layer of the network structure is specifically set as the following table:
table 1: DenseNet-CBAM model structure table
Where k represents the growth rate and _ layer represents the number of layers in the DenseBlock, i.e., the number of layers of DenseLayer. The output of each DenseLayer is recorded as HiEach of HiK eigenvectors are generated, with the input x for the i-th DenseLayeri,i∈_layer:
xi=Hi([x0,x1,…xi-1]), (1)
Whereby the feature vector of the input of each layer DenseLayer can be expressed as k0+k*(i-1),k0Representing the initial input feature number. For the Transition module, it is equivalent to a buffer layer for down-sampling, and is composed of a batch hierarchy, the ReLU activation function, 1 × 1 convolution layer and 2 × 2 average pooling layer.
Adding a layer of CBAM module between each layer of DenseBlock and Transition, calculating the weight of the feature map through the CBAM module, calculating the weight of the feature map by using a channel submodule in the CBAM, and obtaining the weight of each part in the feature map by using a space submodule in the CBAM.
And the CBAM module is composed of a channel submodule and a space submodule, so that the attention weight map and the input feature map are multiplied to carry out self-adaptive feature refinement. For an intermediate feature mapAs input, CBAM calculates one-dimensional channel attention diagrams in turnAnd two-dimensional spatial attention mapThe whole attention mechanism is calculated as follows:
for a given feature mapCalculating the weight of the characteristic diagram on each channel by using a channel attention mechanism, and recording the weight as
Sigma is expressed as a sigmoid function, andrepresenting the average pooling and maximum pooling of feature F. And for the characteristic diagram F', a space attention mechanism is utilized again, and the weight on a space area in the characteristic diagram is obtained through calculation and is recorded as
σ is expressed as sigmoid function, f7×7Representing a convolution kernel size of 7x7,andrespectively representing average pooling and maximum pooling to change the characteristic dimension of F' into 1, connecting the two vectors in series, and obtaining M through convolution layer and sigmoid functions(F′)。
2.2 two-way Long-short term memory network based on attention mechanism
For input data TiExtracting text sequence information by using BilSTM to obtain text characteristicsObtaining text attention weight of output by text attention mechanismCalculating the outer product of output and alpha to obtain the weighted text characteristicsThe text attention mechanism is composed of two convolution layers and a softmax classifier, and output comprises a set of hidden state vectors at each moment in the last layer sequenceAnd l is seq _ len, and h is 2 hidden _ size.
Wherein the BilSTM (Bi-directional Long Short-Term Memory) is formed by combining a forward LSTM and a backward LSTM. LSTM for each element in the input sequence, each layer computes the function:
it=σ(Wiixt+bii+Whih(t-1)+bhi), (6)
ft=σ(Wifxt+bif+Whfh(t-1)+bhf), (7)
gt=tanh(Wigxt+big+Whgh(t-1)+bhg), (8)
ot=σ(Wioxt+bio+Whoh(t-1)+bho), (9)
ct=ft⊙c(t-1)+it⊙gt, (10)
ht=ot⊙tanh(ct), (11)
wherein xtVector representing input word, htHidden state vector representing time t, ctRepresenting the cell state vector at time t, xiRepresenting the input vector at time t, h(t-1)Representing the hidden state vector or the initial state vector, W, of the previous layer t-1i,biAnd bhAre trainable parameters, it,ft,gt,otRespectively represent input, forgetting, cell and output gates, σ represents a sigmoid equation, tanh is an activation function, and it is a hadamard product. The hidden state vector set h with the same length as the sentence can be obtained0,h1,…,hl-1]And l 484 represents the length of the sentence.
Text features obtained by BilsTMIncluding a set of hidden state vectors at each time in the last layer of sequenceComputing attention weights for outputs using a textual attention mechanismObtaining text characteristics after weightingThe specific calculation process is as follows:
u=Wwoutput+bw, (12)
Calculating the weight alpha of the text feature through a softmax functioniAnd the weighted text features are recorded as
y=∑lαi*outputi, (14)
3. Packet cross-modal fusion based on attention mechanism
For image features obtained from depth cross-modal feature construction based on attention mechanismAnd text featuresFirstly, dividing image characteristics x into r groups, and dividing each group of image characteristicsRespectively mapping the feature vector with the text feature y to the same one-dimensional space, and performing feature fusion to obtain a fused feature { Z0,Z1,…,Zr}; for each group of fused features ZiCalculating the weight of the feature graph on each channel by using a channel attention mechanism to obtain weighted features Z'; each group is taken as a fusion feature Z with weighti' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifierThe detailed model diagram is shown in fig. 5.
3.1 group fusion of image features and text features
For image features trained from DenseNet-CBAMText features constructed by BilSTM + AttFirst, the image features are divided into r groupsRespectively fusing with the text characteristics y, wherein the specific calculation formula is as follows:
Zi=x′TWiy, (15)
wherein Which represents the outer product of two vectors,represents a connection matrix, ZiIs the output of the multi-modal split bilinear pooling. WiMapping x' and y to the same dimension space by utilizing two full connection layers, and then executing SumPage with the size of k in one dimension to obtaini represents the fusion of the ith grouping of image features with the text feature y.
3.2 attention mechanism
For the feature Z fused with each groupiCalculating the weight M of the feature map on each channel by using a channel attention mechanismc(Zi) The specific calculation process is as follows:
where a is expressed as a sigmoid function, andrepresentative pair of features ZiPerforming average pooling and maximum pooling, and weighting W therein0,W1And (5) carrying out learning training, and obtaining weighted fusion characteristics Z' through outer product operation.
Finally, each group is taken as a fusion feature Z with weightiRespectively fusing output vectors of a plurality of groups of full connection layers through one full connection layer in a corresponding element addition mode, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifierThe specific calculation process is as follows:
Pi=Zi′AT+b, (19)
wherein A isTAnd b is a trainable parameter.
4. Multi-mode document data multi-label classification experiment in course field
The data set used by us is a text-text mixed multi-modal document data in the field of courses, each multi-modal document data consisting of one image, one text description and a plurality of semantic tags. There are 871 pieces of multi-modal document data in the dataset for training, and the number of tags is 41. The labels include those shown in table 2:
table 2: set of labels in a data set
With 80% of the course domain multimodal document data set used for training and 20% used for testing. We compared the classification effect of different other pre-training models such as VGG16, ResNet34, DenseNet121, BiLSTM under the same data set. The construction of the whole model uses a pytorech deep learning framework and runs on a GPU, and the CUDA version is 10.1.120.
4.1 loss function
We use a maximum entropy based criterion to optimize a multi-label pair for all losses, for the final classification result x and the target result y. For each batch of data:
where y [ i ] ∈ 0,1, C represents the total number of tags.
4.2 evaluation and results
Table 3: and the model accuracy comparison table shows that the precision and recall ratio of labels generated by different models under the same data set are used as evaluation indexes. For each multimodal document data, Top-3 and Top-5 are taken as one of the evaluation criteria, the meaning of Top-k is: if the top-k labels with the highest probability contain all real labels, the prediction is considered to be correct. And calculating Hamming loss (Hamming loss) which represents the proportion of error samples in all the labels, wherein the smaller the value, the stronger the classification capability of the network. In the experiment, the mAP is mainly used as a main evaluation standard, and the intrinsic average AP (average precision) value of the mAP is the average accuracy.
Table 3: model accuracy comparison table
Claims (7)
1. A course field multi-mode document classification method based on a cross-mode attention convolutional neural network is characterized by comprising the following steps:
step 1: preprocessing of multimodal document data
Step 1.1: each multi-modal document comprises an image and a text description and is attached with a plurality of semantic tags; constructing a dictionary by using text description and semantic label set in the document; deleting the tags with the frequency of appearance less than 13, and deleting the multi-modal document when the number of semantic tags of the document is 0;
step 1.2: data preprocessing, namely randomly cutting image data into the size of 224 × 224 in length and width, and performing random horizontal turning; for text description, all text lengths are cut off and filled into length l, and a word vector model is used for learning vector representation of words in the text;
step 2: depth cross-modal feature extraction based on attention mechanism
Step 2.1: carrying out representation construction on image features by adopting dense convolutional neural network DenseNet based on space and feature attention mechanism CBAM, and recording the obtained image features asm represents the number of feature maps of the image;
step 2.2: constructing text features by adopting a bidirectional long-short term memory network (BilSTM) and a text attention mechanism, wherein the text attention mechanism consists of two convolution layers and a softmax classifier; the calculated weight is recorded asThe text feature representation after weighting is recorded asn is 4 hidden _ size, and the hidden _ size is the characteristic dimension of the hidden state of the BilSTM;
and step 3: packet cross-modal fusion based on attention mechanism
Step 3.1: dividing the image characteristics x obtained in the step 2 into r groups, and dividing each group of image characteristics into r groupsRespectively mapping the feature vector with the text feature y to the same one-dimensional space, and obtaining the fused feature { Z by adopting multi-mode split bilinear pooling fusion0,Z1,…,Zr};
Step 3.2: for each group of fused features ZiCalculating the weight of the feature map on each channel by using a channel attention mechanism, and recording the weighted features as Z';
step 3.3: each group is taken as a fusion feature Z with weighti' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifierAnd finally, calculating the error between the predicted value P and the true value by adopting the maximum entropy as a loss function, and training the parameters of depth cross-modal feature extraction based on the attention mechanism in the step 2 and grouping cross-modal fusion based on the attention mechanism in the step 3 by utilizing a back propagation algorithm.
2. The method as claimed in claim 1, wherein in step 1.2, based on the characteristics of the image and the text itself, the multi-modal class document data facing the class field is processed, and for the ith preprocessing, the multi-modal class document data is processedA well-organized multi-modal document, finally obtaining (R)i,Ti,Li):
(1) Randomly sampling from image-text mixed multi-mode document data in the field of courses;
(2) for each image of the multimodal document:
(a) zooming the image to keep the aspect ratio unchanged, wherein the shortest side is 256; then randomly cutting the picture into length and width 224 x 224; and randomly turning horizontally; finally, channel value normalization is carried out to obtainWherein C is 3, H is W is 224;
(3) for each text description in the multimodal document:
(a) counting the lengths of all text descriptions, and selecting the length l, wherein the length l is 484, and 92% of the text lengths are smaller than the length;
(b) all data are cut off and filled to be the same length l;
(c) using word vectors, also known as word embedding, mapping word sequence numbers to vectors in the real number domain, and training the weights thereof;
(d) the word sequence number is embedded into 256-dimensional continuous space with high dimension as dictionary number by word embedding
(4) For each set of tags in the multimodal document:
3. The course domain multi-modal document classification method based on the cross-modal attention convolutional neural network as claimed in claim 1, characterized in that: step by stepThe dense convolutional neural network densneet based on the spatial and feature attention mechanism CBAM described in step 2.1 performs representation construction of image features, specifically: for processed image dataFirstly, a convolution layer with convolution kernel of 7 × 7 and step size of 2 and a maximum pooling layer with convolution kernel of 3 × 3 and step size of 2 are passed; next, a CBAM module is adopted, then a DenseBlock module is adopted, the CBAM module and a Transition module are alternated, the feature extraction is carried out on the sparse image in the course field, and finally, the average pooling with the convolution kernel size of 7x7 is adopted to obtain the image featureThe CBAM module is composed of a channel submodule and a space submodule, so that the attention weight map is multiplied by the input feature map to carry out self-adaptive feature refinement; the weight of the feature map can be calculated by utilizing the channel submodule, and the weight of each part in the feature map can be obtained by utilizing the space submodule; for an intermediate feature mapAs input, CBAM calculates one-dimensional channel attention diagrams in turnAnd two-dimensional spatial attention mapThe whole attention mechanism is calculated as follows:
whereinRepresenting the outer product; obtaining the channel attention weight M through calculationc(F) Obtaining weighted features Calculating to obtain the space attention weight Ms(F') obtaining weighted features
The DenseBlock module is composed of multiple DenseLayer layers, each DenseLayer layer is composed of two groups of batch processing layers, a ReLU activation function and a convolution layer, each DenseLayer layer outputs k eigenvectors, and the input eigenvector of each DenseLayer layer is k0+k*(i-1),k0Representing an initial input feature number, wherein i represents the number of layers; the Transition module is equivalent to a buffer layer and is used for down sampling, and is composed of a batch processing hierarchy, a ReLU activation function, a convolution layer of 1 × 1 and an average pooling layer of 2 × 2.
4. The course domain multi-modal document classification method based on the cross-modal attention convolutional neural network as claimed in claim 1, characterized in that: the step 2.2 of constructing text features by adopting a bidirectional long-short term memory network BilSTM and a text attention mechanism specifically comprises the following steps:
for input data TiExtracting text sequence information by using BilSTM to obtain text characteristics Obtaining text attention weight of output by text attention mechanismCalculating the outer product of output and alpha to obtain the weighted text characteristicsThe text attention mechanism is composed of two convolution layers and a softmax classifier, and output comprises a set of hidden state vectors at each moment in the last layer sequenceAnd l is seq _ len, and h is 2 hidden _ size.
5. The method for lesson domain multimodal document classification based on cross-modal attention convolutional neural network as claimed in claim 4, wherein said BilSTM in step 2.2 is specifically characterized in that:
for input data TiExtracting text sequence information by using BilSTM to obtain text characteristics The specific calculation formula is as follows:
it=σ(Wiixt+bii+Whih(t-1)+bhi), (3)
ft=σ(Wifxt+bif+Whfh(t-1)+bhf), (4)
gt=tanh(Wigxt+big+Whgh(t-1)+bhg), (5)
ot=σ(Wioxt+bio+Whoh(t-1)+bho), (6)
ct=ft⊙c(t-1)+it⊙gt, (7)
ht=ot⊙tanh(ct), (8)
wherein xtVector representing input word, htHidden state vector representing time t, ctRepresents the cell state vector at time t, h(t-1)Representing the hidden state vector or the initial state vector, W, of the previous layer t-1i,biAnd bhAre trainable parameters, it,ft,gt,otRespectively represent input, forgetting, cell and output gates, sigma represents a sigmoid equation, tanh is an activation function, and; the hidden state vector set h with the same length as the sentence can be obtained0,h1,…,hl-1]And l 484 represents the length of the sentence.
6. The method for lesson domain multimodal document classification based on cross-modal attention convolutional neural network as claimed in claim 4, wherein said text attention mechanism in step 2.2 is specifically characterized in that:
7. The method for lesson domain multimodal document classification based on a cross-modal attention convolutional neural network as claimed in claim 1, wherein the attention mechanism based grouping cross-modal fusion in step 3 is specifically characterized in that:
in step 3.1, the grouping fusion of the image features and the text features is as follows:
for image features obtained from step 2 attention-based depth cross-modal feature extractionAnd text featuresImage features are divided into r groupsRespectively fusing with the text characteristics y, wherein the specific fusing steps are as follows:
wherein Represents the outer product of two vectors;is a connection matrix; ziIs the output of a multi-modal split bilinear pooling; wiMapping x' and y to the same dimension space by utilizing two full connection layers, and then executing SumPage with the size of k in one dimension to obtaini represents the fusion of the ith group of image features with the text features y;
in step 3.2, the fused features Z are combined for each groupiCalculating the weight of the feature map on each channel by using a channel attention mechanism, and recording the weighted features as Zi', its characteristics are as follows:
for the resulting fused features { Z0,Z1,…,ZrAnd calculating the weight of the feature map on each channel by using a channel attention mechanism, wherein the specific calculation process is as follows:
where a is expressed as a sigmoid function, andrepresentative pair of features ZiPerforming average pooling and maximum pooling, and weighting W therein0,W1Learning training is carried out, and weighted fusion characteristics Z' are obtained through outer product operation;
in step 3.3, each set is weighted to obtain the fusion features Zi' by a full tie layer; then, combining a plurality of groups of output vectors of the full-link layer in a mode of adding corresponding elements in the vectors, and then calculating to obtain probability distribution of the multi-modal document on all semantic tags through a sigmoid classifier
Pi=Zi′AT+b, (12)
Wherein A isTAnd b is a trainable parameter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010791032.3A CN111985369B (en) | 2020-08-07 | 2020-08-07 | Course field multi-modal document classification method based on cross-modal attention convolution neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010791032.3A CN111985369B (en) | 2020-08-07 | 2020-08-07 | Course field multi-modal document classification method based on cross-modal attention convolution neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111985369A CN111985369A (en) | 2020-11-24 |
CN111985369B true CN111985369B (en) | 2021-09-17 |
Family
ID=73444539
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010791032.3A Active CN111985369B (en) | 2020-08-07 | 2020-08-07 | Course field multi-modal document classification method based on cross-modal attention convolution neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111985369B (en) |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112487187B (en) * | 2020-12-02 | 2022-06-10 | 杭州电子科技大学 | News text classification method based on graph network pooling |
CN112508077B (en) * | 2020-12-02 | 2023-01-03 | 齐鲁工业大学 | Social media emotion analysis method and system based on multi-modal feature fusion |
CN112507898B (en) * | 2020-12-14 | 2022-07-01 | 重庆邮电大学 | Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN |
CN112650886B (en) * | 2020-12-28 | 2022-08-02 | 电子科技大学 | Cross-modal video time retrieval method based on cross-modal dynamic convolution network |
CN112685565B (en) * | 2020-12-29 | 2023-07-21 | 平安科技(深圳)有限公司 | Text classification method based on multi-mode information fusion and related equipment thereof |
CN112686345B (en) * | 2020-12-31 | 2024-03-15 | 江南大学 | Offline English handwriting recognition method based on attention mechanism |
CN112863081A (en) * | 2021-01-04 | 2021-05-28 | 西安建筑科技大学 | Device and method for automatic weighing, classifying and settling vegetables and fruits |
CN112819052B (en) * | 2021-01-25 | 2021-12-24 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Multi-modal fine-grained mixing method, system, device and storage medium |
CN112817604B (en) * | 2021-02-18 | 2022-08-05 | 北京邮电大学 | Android system control intention identification method and device, electronic equipment and storage medium |
US20230145535A1 (en) * | 2021-03-01 | 2023-05-11 | Nvidia Corporation | Neural network training technique |
CN112925935B (en) * | 2021-04-13 | 2022-05-06 | 电子科技大学 | Image menu retrieval method based on intra-modality and inter-modality mixed fusion |
CN113052159A (en) * | 2021-04-14 | 2021-06-29 | 中国移动通信集团陕西有限公司 | Image identification method, device, equipment and computer storage medium |
CN113140023B (en) * | 2021-04-29 | 2023-09-15 | 南京邮电大学 | Text-to-image generation method and system based on spatial attention |
CN113221882B (en) * | 2021-05-11 | 2022-12-09 | 西安交通大学 | Image text aggregation method and system for curriculum field |
CN113342933B (en) * | 2021-05-31 | 2022-11-08 | 淮阴工学院 | Multi-feature interactive network recruitment text classification method similar to double-tower model |
CN113221181B (en) * | 2021-06-09 | 2022-08-09 | 上海交通大学 | Table type information extraction system and method with privacy protection function |
CN113255821B (en) * | 2021-06-15 | 2021-10-29 | 中国人民解放军国防科技大学 | Attention-based image recognition method, attention-based image recognition system, electronic device and storage medium |
CN113378989B (en) * | 2021-07-06 | 2022-05-17 | 武汉大学 | Multi-mode data fusion method based on compound cooperative structure characteristic recombination network |
CN113469094B (en) * | 2021-07-13 | 2023-12-26 | 上海中科辰新卫星技术有限公司 | Surface coverage classification method based on multi-mode remote sensing data depth fusion |
CN113792617B (en) * | 2021-08-26 | 2023-04-18 | 电子科技大学 | Image interpretation method combining image information and text information |
CN113807340B (en) * | 2021-09-07 | 2024-03-15 | 南京信息工程大学 | Attention mechanism-based irregular natural scene text recognition method |
CN113806564B (en) * | 2021-09-22 | 2024-05-10 | 齐鲁工业大学 | Multi-mode informative text detection method and system |
CN115858826A (en) * | 2021-09-22 | 2023-03-28 | 腾讯科技(深圳)有限公司 | Data processing method and device, computer equipment and storage medium |
CN113961710B (en) * | 2021-12-21 | 2022-03-08 | 北京邮电大学 | Fine-grained thesis classification method and device based on multi-mode layered fusion network |
GB2616316A (en) * | 2022-02-28 | 2023-09-06 | Nvidia Corp | Neural network training technique |
CN116704537B (en) * | 2022-12-02 | 2023-11-03 | 大连理工大学 | Lightweight pharmacopoeia picture and text extraction method |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107066583A (en) * | 2017-04-14 | 2017-08-18 | 华侨大学 | A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity |
CN108595632A (en) * | 2018-04-24 | 2018-09-28 | 福州大学 | A kind of hybrid neural networks file classification method of fusion abstract and body feature |
CN109740148A (en) * | 2018-12-16 | 2019-05-10 | 北京工业大学 | A kind of text emotion analysis method of BiLSTM combination Attention mechanism |
CN110019812A (en) * | 2018-02-27 | 2019-07-16 | 中国科学院计算技术研究所 | A kind of user is from production content detection algorithm and system |
CN110209789A (en) * | 2019-05-29 | 2019-09-06 | 山东大学 | A kind of multi-modal dialog system and method for user's attention guidance |
WO2019204186A1 (en) * | 2018-04-18 | 2019-10-24 | Sony Interactive Entertainment Inc. | Integrated understanding of user characteristics by multimodal processing |
CN111079444A (en) * | 2019-12-25 | 2020-04-28 | 北京中科研究院 | Network rumor detection method based on multi-modal relationship |
CN111325155A (en) * | 2020-02-21 | 2020-06-23 | 重庆邮电大学 | Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy |
CN111461174A (en) * | 2020-03-06 | 2020-07-28 | 西北大学 | Multi-mode label recommendation model construction method and device based on multi-level attention mechanism |
-
2020
- 2020-08-07 CN CN202010791032.3A patent/CN111985369B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107066583A (en) * | 2017-04-14 | 2017-08-18 | 华侨大学 | A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity |
CN110019812A (en) * | 2018-02-27 | 2019-07-16 | 中国科学院计算技术研究所 | A kind of user is from production content detection algorithm and system |
WO2019204186A1 (en) * | 2018-04-18 | 2019-10-24 | Sony Interactive Entertainment Inc. | Integrated understanding of user characteristics by multimodal processing |
CN108595632A (en) * | 2018-04-24 | 2018-09-28 | 福州大学 | A kind of hybrid neural networks file classification method of fusion abstract and body feature |
CN109740148A (en) * | 2018-12-16 | 2019-05-10 | 北京工业大学 | A kind of text emotion analysis method of BiLSTM combination Attention mechanism |
CN110209789A (en) * | 2019-05-29 | 2019-09-06 | 山东大学 | A kind of multi-modal dialog system and method for user's attention guidance |
CN111079444A (en) * | 2019-12-25 | 2020-04-28 | 北京中科研究院 | Network rumor detection method based on multi-modal relationship |
CN111325155A (en) * | 2020-02-21 | 2020-06-23 | 重庆邮电大学 | Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy |
CN111461174A (en) * | 2020-03-06 | 2020-07-28 | 西北大学 | Multi-mode label recommendation model construction method and device based on multi-level attention mechanism |
Non-Patent Citations (2)
Title |
---|
Image sentiment analysis using latent correlations among visual, textual, and sentiment views;Marie Katsurai et al;《2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)》;20160325;第2837-2841页 * |
基于层次化深度关联融合网络的社交媒体情感分类;蔡国永等;《计算机研究与发展》;20190615;第1312-1324页 * |
Also Published As
Publication number | Publication date |
---|---|
CN111985369A (en) | 2020-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111985369B (en) | Course field multi-modal document classification method based on cross-modal attention convolution neural network | |
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
CN110263912B (en) | Image question-answering method based on multi-target association depth reasoning | |
CN107480261B (en) | Fine-grained face image fast retrieval method based on deep learning | |
CN109711463B (en) | Attention-based important object detection method | |
CN112100346B (en) | Visual question-answering method based on fusion of fine-grained image features and external knowledge | |
CN110866140A (en) | Image feature extraction model training method, image searching method and computer equipment | |
CN112818861A (en) | Emotion classification method and system based on multi-mode context semantic features | |
CN111881262A (en) | Text emotion analysis method based on multi-channel neural network | |
CN112733866A (en) | Network construction method for improving text description correctness of controllable image | |
CN115438215B (en) | Image-text bidirectional search and matching model training method, device, equipment and medium | |
CN112434628B (en) | Small sample image classification method based on active learning and collaborative representation | |
Sharma et al. | Deep eigen space based ASL recognition system | |
CN113626589B (en) | Multi-label text classification method based on mixed attention mechanism | |
CN112749274B (en) | Chinese text classification method based on attention mechanism and interference word deletion | |
CN115455171B (en) | Text video mutual inspection rope and model training method, device, equipment and medium | |
CN111738169A (en) | Handwriting formula recognition method based on end-to-end network model | |
CN113948217A (en) | Medical nested named entity recognition method based on local feature integration | |
CN110188827A (en) | A kind of scene recognition method based on convolutional neural networks and recurrence autocoder model | |
CN113987187A (en) | Multi-label embedding-based public opinion text classification method, system, terminal and medium | |
CN115512096A (en) | CNN and Transformer-based low-resolution image classification method and system | |
CN110111365B (en) | Training method and device based on deep learning and target tracking method and device | |
CN116129141A (en) | Medical data processing method, apparatus, device, medium and computer program product | |
CN114780767A (en) | Large-scale image retrieval method and system based on deep convolutional neural network | |
CN110704665A (en) | Image feature expression method and system based on visual attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |