CN111985369B

CN111985369B - Course field multi-modal document classification method based on cross-modal attention convolution neural network

Info

Publication number: CN111985369B
Application number: CN202010791032.3A
Authority: CN
Inventors: 宋凌云; 俞梦真; 尚学群; 李建鳌; 彭杨柳; 李伟; 李战怀
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2020-08-07
Filing date: 2020-08-07
Publication date: 2021-09-17
Anticipated expiration: 2040-08-07
Also published as: CN111985369A

Abstract

The invention relates to a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which is used for preprocessing multi-mode document data in the course field; an attention mechanism is combined with a dense convolution network, a convolution neural network based on cross-modal attention is provided, and image features with sparsity can be constructed more effectively; the bidirectional long-short term memory network constructed based on the attention mechanism and oriented to the text features is provided, and the text features locally associated with image semantics can be efficiently constructed; the cross-modal grouping fusion based on the attention mechanism is designed, the local incidence relation of the image and the text in the document can be more accurately learned, and the accuracy of the cross-modal feature fusion is improved. Compared with the existing multi-mode document classification model, the method has better performance and improves the accuracy of multi-mode document data classification under the condition of data sets in the same course field.

Description

Course field multi-modal document classification method based on cross-modal attention convolution neural network

Technical Field

The invention belongs to the field of computer application, multi-modal data classification, education data classification, image processing and text processing, and particularly relates to a course field multi-modal document classification method based on a cross-modal attention convolution neural network.

Background

With the development of science and technology, data to be processed by computers in various fields is converted from single images into multi-modal data such as images, texts, audios and the like with richer forms and contents. The classification of multimodal documents has applications in video classification, visual question answering, entity matching for social networks, and the like. The accuracy of multimodal document classification depends on whether the computer can accurately understand the semantics and content of the images and text that are contained within the document. However, the image in the text-text mixed multimodal document in the course field is generally composed of lines and characters, and shows high sparse characteristics on visual features such as color and texture; the characteristic of local association between the text in the multi-modal document and the semantics of the image is shown, so that the existing multi-modal document classification model is difficult to accurately construct the semantic feature vectors of the image and the text in the document, the accuracy of multi-modal document feature expression is reduced, and the performance of the multi-modal document classification task is hindered.

In order to solve the problems, the invention expands a model system structure and provides a course field multi-mode document classification method based on a cross-modal attention convolution neural network. The method can well extract the sparse image characteristics in the course field, efficiently construct the text characteristics associated with the local fine-grained semantics of the image semantics, and more accurately learn the association relationship between the image and the text characteristics related to the specific object, thereby improving the multi-modal document classification performance.

Disclosure of Invention

Technical problem to be solved

Image visual features in image-text mixed multi-modal document data in the course field are sparse, and only local semantic association exists between texts and images, so that the semantics and contents of the texts and the images in the documents are difficult to accurately understand by the existing multi-modal document classification model, and the performance of multi-modal classification is greatly influenced. In order to solve the problems, the invention provides a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which can more efficiently learn the semantic features of the course field image with feature sparsity, can better capture the local fine-grained semantic association between the image and the text in the multi-mode document, accurately express the multi-mode document features and simultaneously improve the performance of a computer in the course field multi-mode document data classification.

Technical scheme

A course field multi-mode document classification method based on a cross-mode attention convolutional neural network is characterized by comprising the following steps:

step 1: preprocessing of multimodal document data

Step 1.1: each multi-modal document comprises an image and a text description and is attached with a plurality of semantic tags; constructing a dictionary by using the text description and the document label set in the document; deleting the tags with the frequency of appearance less than 13, and deleting the multi-modal document when the number of semantic tags of the document is 0;

step 1.2: data preprocessing, namely randomly cutting image data into the size of 224 × 224 in length and width, and performing random horizontal turning; for text description, all text lengths are cut off and filled into length l, and a word vector model is used for learning vector representation of words in the text;

step 2: depth cross-modal feature extraction based on attention mechanism

Step 2.1: carrying out representation construction on image features by adopting dense convolutional neural network DenseNet based on space and feature attention mechanism CBAM, and recording the obtained image features as

m represents the number of feature maps of the image;

step 2.2: constructing text features by adopting a bidirectional long-short term memory network (BilSTM) and a text attention mechanism, wherein the text attention mechanism consists of two convolution layers and a softmax classifier; the calculated weight is recorded as

The text feature representation after weighting is recorded as

n is 4 hidden _ size, and the hidden _ size is the characteristic dimension of the hidden state of the BilSTM;

and step 3: packet cross-modal fusion based on attention mechanism

Step 3.1: dividing the image characteristics x obtained in the step 2 into r groups, and dividing each group of image characteristics into r groups

Respectively mapping the feature vector with the text feature y to the same one-dimensional space, and obtaining the fused feature { Z by adopting multi-mode split bilinear pooling fusion₀,Z₁,…,Z_r}；

Step 3.2: for each group of fused features Z_iCalculating the weight of the feature map on each channel by using a channel attention mechanism, and recording the weighted features as Z';

step 3.3: each group is taken as a fusion feature Z with weight_i' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifier

And finally, calculating the error between the predicted value P and the true value by adopting the maximum entropy as a loss function, and training the parameters of the model by utilizing a back propagation algorithm.

In the step 1.2, according to the characteristics of the image and the text, the multi-modal document data facing the course field are processed, and the (R) is finally obtained for the ith pre-processed multi-modal document_i,T_i,L_i)：

(1) Randomly sampling from image-text mixed multi-mode document data in the field of courses;

(2) for each image of the multimodal document:

(a) zooming the image to keep the aspect ratio unchanged, wherein the shortest side is 256; then randomly cutting the picture into length and width 224 x 224; and randomly turning horizontally; finally, channel value normalization is carried out to obtain

Wherein C is 3, H is W is 224;

(3) for each text description in the multimodal document:

(a) counting the lengths of all text descriptions, and selecting the length l, wherein the length l is 484, and 92% of the text lengths are smaller than the length;

(b) all data are cut off and filled to be the same length l;

(c) using word vectors, also known as word embedding, mapping word sequence numbers to vectors in the real number domain, and training the weights thereof;

(d) the word sequence number is embedded into 256-dimensional continuous space with high dimension as dictionary number by word embedding

(4) For each set of tags in the multimodal document:

(a) setting an N-dimensional vector for the total classification number N, and mapping a plurality of semantic tags of corresponding documents into 0-1 vectors through a tag dictionary to obtain

The dense convolutional neural network based on the space and feature attention mechanism described in step 2.1 is used for representing and constructing image features, and specifically comprises the following steps: for processed image data

Firstly, a convolution layer with convolution kernel of 7 × 7 and step size of 2 and a maximum pooling layer with convolution kernel of 3 × 3 and step size of 2 are passed; next, a CBAM module is used, and then a DenseBlock module, a CBAM module and a transition module are adoptedAlternating Transition modules, extracting the characteristics of the sparse image in the course field, and finally obtaining the image characteristics by adopting the average pooling with the convolution kernel size of 7 × 7

The CBAM module is composed of a channel submodule and a space submodule, so that the attention weight map is multiplied by the input feature map to carry out self-adaptive feature refinement; the weight of the feature map can be calculated by utilizing the channel submodule, and the weight of each part in the feature map can be obtained by utilizing the space submodule; for an intermediate feature map

As input, CBAM calculates one-dimensional channel attention diagrams in turn

And two-dimensional spatial attention map

The whole attention mechanism is calculated as follows:

wherein

Representing the outer product; obtaining the channel attention weight M through calculation_c(F) Obtaining weighted features

Calculating to obtain the space attention weight M_s(F') obtaining weighted features

The DenseBlock module is composed of multiple DenseLayer layers, each DenseLayer layer is composed of two groups of batch processing layers, a ReLU activation function and a convolution layer, each DenseLayer layer outputs k eigenvectors, and the input eigenvector of each DenseLayer layer is k₀+k*(i-1),k₀Representing an initial input feature number, wherein i represents the number of layers; the Transition module is equivalent to a buffer layer and is used for down sampling, and is composed of a batch processing hierarchy, a ReLU activation function, a convolution layer of 1 × 1 and an average pooling layer of 2 × 2. The two-way long-short term memory network based on the text attention mechanism in the step 2.2 specifically comprises the following steps: for input data T_iExtracting text sequence information by using BilSTM to obtain text characteristics

Obtaining text attention weight of output by text attention mechanism

Calculating the outer product of output and alpha to obtain the weighted text characteristics

The text attention mechanism is composed of two convolution layers and a softmax classifier, and output comprises a set of hidden state vectors at each moment in the last layer sequence

And l is seq _ len, and h is 2 hidden _ size.

The BilSTM model in the step 2.2 specifically comprises the following steps:

for input data T_iExtracting text sequence information by using BilSTM to obtain text characteristics

The specific calculation formula is as follows:

i_t＝σ(W_iix_t+b_ii+W_hih_(t-1)+b_hi), (3)

f_t＝σ(W_ifx_t+b_if+W_hfh_(t-1)+b_hf), (4)

g_t＝tanh(W_igx_t+b_ig+W_hgh_(t-1)+b_hg), (5)

o_t＝σ(W_iox_t+b_io+W_hoh_(t-1)+b_ho), (6)

c_t＝f_t⊙c_(t-1)+i_t⊙g_t, (7)

h_t＝o_t⊙tanh(c_t), (8)

wherein x_tVector representing input word, h_tHidden state vector representing time t, c_tRepresenting the cell state vector at time t, x_iRepresenting the input vector at time t, h_(t-1)Representing the hidden state vector or the initial state vector, W, of the previous layer t-1_i，b_iAnd b_hAre trainable parameters, i_t，f_t，g_t，o_tRespectively represent input, forgetting, cell and output gates, sigma represents a sigmoid equation, tanh is an activation function, and; the hidden state vector set h with the same length as the sentence can be obtained₀,h₁,…,h_l-1]And l 484 represents the length of the sentence.

The text attention mechanism module in the step 2.2 specifically comprises:

for text features obtained by BilsTM

Computing attention weights for outputs using a textual attention mechanism

Obtaining text characteristics after weighting

Wherein the text attention mechanism is composed of two 1 × 1 convolutional layers and a softmax classifier.

The packet cross-modal fusion module based on the attention mechanism in the step 3 specifically comprises:

in step 3.1, the grouping fusion of the image features and the text features is as follows:

for image features derived from the model in DenseNet-CBAM

And text features extracted by the BilSTM + Att model

Image features are divided into r groups

Respectively fusing with the text characteristics y, wherein the specific fusing steps are as follows:

Z_i＝x′^TW_iy, (9)

wherein

Represents the outer product of two vectors;

is a connection matrix; z_iIs the output of a multi-modal split bilinear pooling; w_iMapping x' and y to the same dimension space by utilizing two full connection layers, and then executing SumPage with the size of k in one dimension to obtain

i represents the fusion of the ith group of image features with the text features y;

in step 3.2, the fused features Z are combined for each group_iCalculating the weight of the feature map on each channel by using a channel attention mechanism, and recording the weighted features as Z_i', its characteristics are as follows:

for the resulting fused features { Z₀,Z₁,…,Z_rAnd calculating the weight of the feature map on each channel by using a channel attention mechanism, wherein the specific calculation process is as follows:

where a is expressed as a sigmoid function,

and

representative pair of features Z_iPerforming average pooling and maximum pooling, and weighting W therein₀，W₁Learning training is carried out, and weighted fusion characteristics Z' are obtained through outer product operation;

in step 3.3, each set is weighted to obtain the fusion features Z_i' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifier

P_i＝Z_i′A^T+b, (13)

Wherein A is^TAnd b is a trainable parameter.

Advantageous effects

The invention provides a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which is used for preprocessing multi-mode document data in the course field; an attention mechanism is combined with a dense convolution network, a convolution neural network based on cross-modal attention is provided, and image features with sparsity can be constructed more effectively; the bidirectional long-short term memory network constructed based on the attention mechanism and oriented to the text features is provided, and the text features locally associated with image semantics can be efficiently constructed; the cross-modal grouping fusion based on the attention mechanism is designed, the local incidence relation of the image and the text in the document can be more accurately learned, and the accuracy of the cross-modal feature fusion is improved. Compared with the existing multi-mode document classification model, the method has better performance and improves the accuracy of multi-mode document data classification under the condition of data sets in the same course field.

Drawings

FIG. 1 is a diagram of a model of the process described in the examples of the invention.

FIG. 2 is a diagram of the input data construction of the method in an example of the invention.

FIG. 3 is a model diagram of image feature extraction according to an embodiment of the present invention.

FIG. 4 is a model diagram of text feature extraction according to an embodiment of the present invention.

FIG. 5 is a model diagram of packet cross-modality fusion as described in the examples of the present invention.

FIG. 6a is a graph comparing accuracy of image multi-label classification under the same data set comparing different models.

FIG. 6b is a graph comparing loss of image multi-label classification under the same data set comparing different models.

FIG. 6c is a comparison graph of image multi-label classification top-3 under the same data set comparing different models.

FIG. 6d is a comparison graph of image multi-label classification top-5 under the same data set comparing different models.

FIG. 7 is a sample diagram of a data set employed in an example of the present invention.

Detailed Description

The invention will now be further described with reference to the following examples and drawings:

the invention provides a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which is mainly divided into five modules: 1) pre-processing of multimodal document data. 2) And constructing a dense convolutional neural network image feature based on an attention mechanism. 3) And constructing a bidirectional long-short term memory network text feature based on an attention mechanism. 4) Packet cross-modality fusion based on attention mechanism. 5) Multi-label classification of multi-modal documents. The model diagram of the whole method is shown in fig. 1, and is specifically described as follows:

1. preprocessing of multimodal document data

Representing a teletext hybrid multimodal document as

Represents the d-th multimodal document data. Wherein

I^dIs available in the multimodal document data list for extracting image data,

representing a qualifying image;

representing a textual description locally associated with the image, J^dIs an index set at word2ix that points out the description text appearing in the d-th multimodal document dataIn the synthesis process, the raw materials are mixed,

an index representing the nth word in word2 ix; l is^dIn the same way as the above, the first step,

representing a semantic tag set, K^dIndicating that the word in the ith multimodal document data is in the index set of label2ix,

indicating the index of the o-th word in label2 ix.

The model of the multi-modal document data preprocessing of the course field multi-modal document classification method based on the cross-modal attention convolutional neural network is shown in figure 2, the image is turned through random cutting, all the text lengths are cut off and filled into the same length l based on the length of the text description of the whole data set, the vector representation of words in the text is learned by using a word vector model, and finally, all the parameters input into the network are obtained

For the ith preprocessed multi-modal document, finally obtaining (R)_i,T_i,L_i)：

(1) Randomly sampling from the text-text mixed multi-modal document data in the field of courses.

(2) For each image in the multimodal document:

zooming the image to keep the aspect ratio unchanged, wherein the shortest side is 256; then randomly cutting the picture into length and width 224 x 224; and randomly and horizontally turning the image; finally, channel value normalization is carried out to obtain

Wherein C is 3 and H is 224.

(3) For each text description in the multimodal document:

(a) counting the lengths of all text descriptions, and selecting the length l, wherein the length l is 484, and 92% of the text length is smaller than the length l;

(b) all data are cut off and filled to be the same length l;

(c) using Word vectors (Word vectors), also known as Word Embedding (Word Embedding), mapping Word sequence numbers to vectors in the real number domain, and training the weights thereof;

(4) For each set of tags in the multimodal document:

setting an N-dimensional vector for the total classification number N, and mapping a plurality of semantic tags of corresponding documents into 0-1 vectors through a tag dictionary to obtain

2. Depth cross-modal feature construction based on attention mechanism

The depth cross-modal feature extraction based on the attention mechanism comprises two parts: 1) dense convolutional neural network image feature construction based on an attention mechanism, and 2) bidirectional long-short term memory network text feature construction based on the attention mechanism. Combining an attention mechanism with a dense convolution network to extract the features of the image; providing a bidirectional long-short term memory network which is constructed facing text features and based on an attention mechanism; thereby obtaining image features with weights

And text features

The model diagram of the whole network is shown in fig. 3:

2.1 attention-based dense convolutional neural network

For processed imagesData of

Firstly, a convolution layer with convolution kernel of 7 × 7 and step size of 2 and a maximum pooling layer with convolution kernel of 3 × 3 and step size of 2 are passed; next, a CBAM module is adopted, then a DenseBlock module is adopted, the CBAM module and a Transition module are alternated, the sparse image is subjected to feature extraction, finally, the average pooling with the convolution kernel size of 7x7 is adopted, the data dimensionality is reduced, overfitting is avoided, and finally the image feature is obtained

Image features for multimodal document input

Wherein each layer of the network structure is specifically set as the following table:

table 1: DenseNet-CBAM model structure table

Where k represents the growth rate and _ layer represents the number of layers in the DenseBlock, i.e., the number of layers of DenseLayer. The output of each DenseLayer is recorded as H_iEach of H_iK eigenvectors are generated, with the input x for the i-th DenseLayer_i，i∈_layer：

x_i＝H_i([x₀,x₁,…x_i-1]), (1)

Whereby the feature vector of the input of each layer DenseLayer can be expressed as k₀+k*(i-1)，k₀Representing the initial input feature number. For the Transition module, it is equivalent to a buffer layer for down-sampling, and is composed of a batch hierarchy, the ReLU activation function, 1 × 1 convolution layer and 2 × 2 average pooling layer.

Adding a layer of CBAM module between each layer of DenseBlock and Transition, calculating the weight of the feature map through the CBAM module, calculating the weight of the feature map by using a channel submodule in the CBAM, and obtaining the weight of each part in the feature map by using a space submodule in the CBAM.

And the CBAM module is composed of a channel submodule and a space submodule, so that the attention weight map and the input feature map are multiplied to carry out self-adaptive feature refinement. For an intermediate feature map

As input, CBAM calculates one-dimensional channel attention diagrams in turn

And two-dimensional spatial attention map

The whole attention mechanism is calculated as follows:

for a given feature map

Calculating the weight of the characteristic diagram on each channel by using a channel attention mechanism, and recording the weight as

Sigma is expressed as a sigmoid function,

and

representing the average pooling and maximum pooling of feature F. And for the characteristic diagram F', a space attention mechanism is utilized again, and the weight on a space area in the characteristic diagram is obtained through calculation and is recorded as

σ is expressed as sigmoid function, f^7×7Representing a convolution kernel size of 7x7,

and

respectively representing average pooling and maximum pooling to change the characteristic dimension of F' into 1, connecting the two vectors in series, and obtaining M through convolution layer and sigmoid function_s(F′)。

2.2 two-way Long-short term memory network based on attention mechanism

Obtaining text attention weight of output by text attention mechanism

And l is seq _ len, and h is 2 hidden _ size.

Wherein the BilSTM (Bi-directional Long Short-Term Memory) is formed by combining a forward LSTM and a backward LSTM. LSTM for each element in the input sequence, each layer computes the function:

i_t＝σ(W_iix_t+b_ii+W_hih_(t-1)+b_hi), (6)

f_t＝σ(W_ifx_t+b_if+W_hfh_(t-1)+b_hf), (7)

g_t＝tanh(W_igx_t+b_ig+W_hgh_(t-1)+b_hg), (8)

o_t＝σ(W_iox_t+b_io+W_hoh_(t-1)+b_ho), (9)

c_t＝f_t⊙c_(t-1)+i_t⊙g_t, (10)

h_t＝o_t⊙tanh(c_t), (11)

wherein x_tVector representing input word, h_tHidden state vector representing time t, c_tRepresenting the cell state vector at time t, x_iRepresenting the input vector at time t, h_(t-1)Representing the hidden state vector or the initial state vector, W, of the previous layer t-1_i，b_iAnd b_hAre trainable parameters, i_t，f_t，g_t，o_tRespectively represent input, forgetting, cell and output gates, σ represents a sigmoid equation, tanh is an activation function, and it is a hadamard product. The hidden state vector set h with the same length as the sentence can be obtained₀,h₁,…,h_l-1]And l 484 represents the length of the sentence.

Text features obtained by BilsTM

Including a set of hidden state vectors at each time in the last layer of sequence

Computing attention weights for outputs using a textual attention mechanism

Obtaining text characteristics after weighting

The specific calculation process is as follows:

converting output into output by convolution operation

W_w，b_w，u_wIs a learnable parameter:

u＝W_woutput+b_w, (12)

for that obtained by convolution operations

Obtaining attention weight using softmax function

Calculating the weight alpha of the text feature through a softmax function_iAnd the weighted text features are recorded as

y＝∑_lα_i*output_i, (14)

3. Packet cross-modal fusion based on attention mechanism

For image features obtained from depth cross-modal feature construction based on attention mechanism

And text features

Firstly, dividing image characteristics x into r groups, and dividing each group of image characteristics

Respectively mapping the feature vector with the text feature y to the same one-dimensional space, and performing feature fusion to obtain a fused feature { Z₀,Z₁,…,Z_r}; for each group of fused features Z_iCalculating the weight of the feature graph on each channel by using a channel attention mechanism to obtain weighted features Z'; each group is taken as a fusion feature Z with weight_i' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifier

The detailed model diagram is shown in fig. 5.

3.1 group fusion of image features and text features

For image features trained from DenseNet-CBAM

Text features constructed by BilSTM + Att

First, the image features are divided into r groups

Respectively fusing with the text characteristics y, wherein the specific calculation formula is as follows:

Z_i＝x′^TW_iy, (15)

wherein

Which represents the outer product of two vectors,

represents a connection matrix, Z_iIs the output of the multi-modal split bilinear pooling. W_iMapping x' and y to the same dimension space by utilizing two full connection layers, and then executing SumPage with the size of k in one dimension to obtain

i represents the fusion of the ith grouping of image features with the text feature y.

3.2 attention mechanism

For the feature Z fused with each group_iCalculating the weight M of the feature map on each channel by using a channel attention mechanism_c(Z_i) The specific calculation process is as follows:

where a is expressed as a sigmoid function,

and

representative pair of features Z_iPerforming average pooling and maximum pooling, and weighting W therein₀，W₁And (5) carrying out learning training, and obtaining weighted fusion characteristics Z' through outer product operation.

Finally, each group is taken as a fusion feature Z with weight_iRespectively fusing output vectors of a plurality of groups of full connection layers through one full connection layer in a corresponding element addition mode, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifier

The specific calculation process is as follows:

P_i＝Z_i′A^T+b, (19)

wherein A is^TAnd b is a trainable parameter.

4. Multi-mode document data multi-label classification experiment in course field

The data set used by us is a text-text mixed multi-modal document data in the field of courses, each multi-modal document data consisting of one image, one text description and a plurality of semantic tags. There are 871 pieces of multi-modal document data in the dataset for training, and the number of tags is 41. The labels include those shown in table 2:

table 2: set of labels in a data set

With 80% of the course domain multimodal document data set used for training and 20% used for testing. We compared the classification effect of different other pre-training models such as VGG16, ResNet34, DenseNet121, BiLSTM under the same data set. The construction of the whole model uses a pytorech deep learning framework and runs on a GPU, and the CUDA version is 10.1.120.

4.1 loss function

We use a maximum entropy based criterion to optimize a multi-label pair for all losses, for the final classification result x and the target result y. For each batch of data:

where y [ i ] ∈ 0,1, C represents the total number of tags.

4.2 evaluation and results

Table 3: and the model accuracy comparison table shows that the precision and recall ratio of labels generated by different models under the same data set are used as evaluation indexes. For each multimodal document data, Top-3 and Top-5 are taken as one of the evaluation criteria, the meaning of Top-k is: if the top-k labels with the highest probability contain all real labels, the prediction is considered to be correct. And calculating Hamming loss (Hamming loss) which represents the proportion of error samples in all the labels, wherein the smaller the value, the stronger the classification capability of the network. In the experiment, the mAP is mainly used as a main evaluation standard, and the intrinsic average AP (average precision) value of the mAP is the average accuracy.

Table 3: model accuracy comparison table

Claims

1. A course field multi-mode document classification method based on a cross-mode attention convolutional neural network is characterized by comprising the following steps:

step 1: preprocessing of multimodal document data

Step 1.1: each multi-modal document comprises an image and a text description and is attached with a plurality of semantic tags; constructing a dictionary by using text description and semantic label set in the document; deleting the tags with the frequency of appearance less than 13, and deleting the multi-modal document when the number of semantic tags of the document is 0;

step 2: depth cross-modal feature extraction based on attention mechanism

m represents the number of feature maps of the image;

The text feature representation after weighting is recorded as

and step 3: packet cross-modal fusion based on attention mechanism

And finally, calculating the error between the predicted value P and the true value by adopting the maximum entropy as a loss function, and training the parameters of depth cross-modal feature extraction based on the attention mechanism in the step 2 and grouping cross-modal fusion based on the attention mechanism in the step 3 by utilizing a back propagation algorithm.

2. The method as claimed in claim 1, wherein in step 1.2, based on the characteristics of the image and the text itself, the multi-modal class document data facing the class field is processed, and for the ith preprocessing, the multi-modal class document data is processedA well-organized multi-modal document, finally obtaining (R)_i,T_i,L_i)：

(2) for each image of the multimodal document:

Wherein C is 3, H is W is 224;

(3) for each text description in the multimodal document:

(b) all data are cut off and filled to be the same length l;

(4) For each set of tags in the multimodal document:

3. The course domain multi-modal document classification method based on the cross-modal attention convolutional neural network as claimed in claim 1, characterized in that: step by stepThe dense convolutional neural network densneet based on the spatial and feature attention mechanism CBAM described in step 2.1 performs representation construction of image features, specifically: for processed image data

Firstly, a convolution layer with convolution kernel of 7 × 7 and step size of 2 and a maximum pooling layer with convolution kernel of 3 × 3 and step size of 2 are passed; next, a CBAM module is adopted, then a DenseBlock module is adopted, the CBAM module and a Transition module are alternated, the feature extraction is carried out on the sparse image in the course field, and finally, the average pooling with the convolution kernel size of 7x7 is adopted to obtain the image feature

As input, CBAM calculates one-dimensional channel attention diagrams in turn

And two-dimensional spatial attention map

The whole attention mechanism is calculated as follows:

wherein

The DenseBlock module is composed of multiple DenseLayer layers, each DenseLayer layer is composed of two groups of batch processing layers, a ReLU activation function and a convolution layer, each DenseLayer layer outputs k eigenvectors, and the input eigenvector of each DenseLayer layer is k₀+k*(i-1),k₀Representing an initial input feature number, wherein i represents the number of layers; the Transition module is equivalent to a buffer layer and is used for down sampling, and is composed of a batch processing hierarchy, a ReLU activation function, a convolution layer of 1 × 1 and an average pooling layer of 2 × 2.

4. The course domain multi-modal document classification method based on the cross-modal attention convolutional neural network as claimed in claim 1, characterized in that: the step 2.2 of constructing text features by adopting a bidirectional long-short term memory network BilSTM and a text attention mechanism specifically comprises the following steps:

Obtaining text attention weight of output by text attention mechanism

And l is seq _ len, and h is 2 hidden _ size.

5. The method for lesson domain multimodal document classification based on cross-modal attention convolutional neural network as claimed in claim 4, wherein said BilSTM in step 2.2 is specifically characterized in that:

The specific calculation formula is as follows:

i_t＝σ(W_iix_t+b_ii+W_hih_(t-1)+b_hi), (3)

f_t＝σ(W_ifx_t+b_if+W_hfh_(t-1)+b_hf), (4)

g_t＝tanh(W_igx_t+b_ig+W_hgh_(t-1)+b_hg), (5)

o_t＝σ(W_iox_t+b_io+W_hoh_(t-1)+b_ho), (6)

c_t＝f_t⊙c_(t-1)+i_t⊙g_t, (7)

h_t＝o_t⊙tanh(c_t), (8)

wherein x_tVector representing input word, h_tHidden state vector representing time t, c_tRepresents the cell state vector at time t, h_(t-1)Representing the hidden state vector or the initial state vector, W, of the previous layer t-1_i，b_iAnd b_hAre trainable parameters, i_t，f_t，g_t，o_tRespectively represent input, forgetting, cell and output gates, sigma represents a sigmoid equation, tanh is an activation function, and; the hidden state vector set h with the same length as the sentence can be obtained₀,h₁,…,h_l-1]And l 484 represents the length of the sentence.

6. The method for lesson domain multimodal document classification based on cross-modal attention convolutional neural network as claimed in claim 4, wherein said text attention mechanism in step 2.2 is specifically characterized in that:

for text features obtained by BilsTM

Computing attention weights for outputs using a textual attention mechanism

Obtaining text characteristics after weighting

7. The method for lesson domain multimodal document classification based on a cross-modal attention convolutional neural network as claimed in claim 1, wherein the attention mechanism based grouping cross-modal fusion in step 3 is specifically characterized in that:

for image features obtained from step 2 attention-based depth cross-modal feature extraction

And text features

Image features are divided into r groups

wherein

Represents the outer product of two vectors;

where a is expressed as a sigmoid function,

and

in step 3.3, each set is weighted to obtain the fusion features Z_i' by a full tie layer; then, combining a plurality of groups of output vectors of the full-link layer in a mode of adding corresponding elements in the vectors, and then calculating to obtain probability distribution of the multi-modal document on all semantic tags through a sigmoid classifier

P_i＝Z_i′A^T+b, (12)

Wherein A is^TAnd b is a trainable parameter.