CN111985369B - Course field multi-modal document classification method based on cross-modal attention convolution neural network - Google Patents

Course field multi-modal document classification method based on cross-modal attention convolution neural network Download PDF

Info

Publication number
CN111985369B
CN111985369B CN202010791032.3A CN202010791032A CN111985369B CN 111985369 B CN111985369 B CN 111985369B CN 202010791032 A CN202010791032 A CN 202010791032A CN 111985369 B CN111985369 B CN 111985369B
Authority
CN
China
Prior art keywords
text
modal
attention
image
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010791032.3A
Other languages
Chinese (zh)
Other versions
CN111985369A (en
Inventor
宋凌云
俞梦真
尚学群
李建鳌
彭杨柳
李伟
李战怀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202010791032.3A priority Critical patent/CN111985369B/en
Publication of CN111985369A publication Critical patent/CN111985369A/en
Application granted granted Critical
Publication of CN111985369B publication Critical patent/CN111985369B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which is used for preprocessing multi-mode document data in the course field; an attention mechanism is combined with a dense convolution network, a convolution neural network based on cross-modal attention is provided, and image features with sparsity can be constructed more effectively; the bidirectional long-short term memory network constructed based on the attention mechanism and oriented to the text features is provided, and the text features locally associated with image semantics can be efficiently constructed; the cross-modal grouping fusion based on the attention mechanism is designed, the local incidence relation of the image and the text in the document can be more accurately learned, and the accuracy of the cross-modal feature fusion is improved. Compared with the existing multi-mode document classification model, the method has better performance and improves the accuracy of multi-mode document data classification under the condition of data sets in the same course field.

Description

Course field multi-modal document classification method based on cross-modal attention convolution neural network
Technical Field
The invention belongs to the field of computer application, multi-modal data classification, education data classification, image processing and text processing, and particularly relates to a course field multi-modal document classification method based on a cross-modal attention convolution neural network.
Background
With the development of science and technology, data to be processed by computers in various fields is converted from single images into multi-modal data such as images, texts, audios and the like with richer forms and contents. The classification of multimodal documents has applications in video classification, visual question answering, entity matching for social networks, and the like. The accuracy of multimodal document classification depends on whether the computer can accurately understand the semantics and content of the images and text that are contained within the document. However, the image in the text-text mixed multimodal document in the course field is generally composed of lines and characters, and shows high sparse characteristics on visual features such as color and texture; the characteristic of local association between the text in the multi-modal document and the semantics of the image is shown, so that the existing multi-modal document classification model is difficult to accurately construct the semantic feature vectors of the image and the text in the document, the accuracy of multi-modal document feature expression is reduced, and the performance of the multi-modal document classification task is hindered.
In order to solve the problems, the invention expands a model system structure and provides a course field multi-mode document classification method based on a cross-modal attention convolution neural network. The method can well extract the sparse image characteristics in the course field, efficiently construct the text characteristics associated with the local fine-grained semantics of the image semantics, and more accurately learn the association relationship between the image and the text characteristics related to the specific object, thereby improving the multi-modal document classification performance.
Disclosure of Invention
Technical problem to be solved
Image visual features in image-text mixed multi-modal document data in the course field are sparse, and only local semantic association exists between texts and images, so that the semantics and contents of the texts and the images in the documents are difficult to accurately understand by the existing multi-modal document classification model, and the performance of multi-modal classification is greatly influenced. In order to solve the problems, the invention provides a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which can more efficiently learn the semantic features of the course field image with feature sparsity, can better capture the local fine-grained semantic association between the image and the text in the multi-mode document, accurately express the multi-mode document features and simultaneously improve the performance of a computer in the course field multi-mode document data classification.
Technical scheme
A course field multi-mode document classification method based on a cross-mode attention convolutional neural network is characterized by comprising the following steps:
step 1: preprocessing of multimodal document data
Step 1.1: each multi-modal document comprises an image and a text description and is attached with a plurality of semantic tags; constructing a dictionary by using the text description and the document label set in the document; deleting the tags with the frequency of appearance less than 13, and deleting the multi-modal document when the number of semantic tags of the document is 0;
step 1.2: data preprocessing, namely randomly cutting image data into the size of 224 × 224 in length and width, and performing random horizontal turning; for text description, all text lengths are cut off and filled into length l, and a word vector model is used for learning vector representation of words in the text;
step 2: depth cross-modal feature extraction based on attention mechanism
Step 2.1: carrying out representation construction on image features by adopting dense convolutional neural network DenseNet based on space and feature attention mechanism CBAM, and recording the obtained image features as
Figure GDA0003160703830000021
m represents the number of feature maps of the image;
step 2.2: constructing text features by adopting a bidirectional long-short term memory network (BilSTM) and a text attention mechanism, wherein the text attention mechanism consists of two convolution layers and a softmax classifier; the calculated weight is recorded as
Figure GDA0003160703830000022
The text feature representation after weighting is recorded as
Figure GDA0003160703830000023
n is 4 hidden _ size, and the hidden _ size is the characteristic dimension of the hidden state of the BilSTM;
and step 3: packet cross-modal fusion based on attention mechanism
Step 3.1: dividing the image characteristics x obtained in the step 2 into r groups, and dividing each group of image characteristics into r groups
Figure GDA0003160703830000031
Respectively mapping the feature vector with the text feature y to the same one-dimensional space, and obtaining the fused feature { Z by adopting multi-mode split bilinear pooling fusion0,Z1,…,Zr};
Step 3.2: for each group of fused features ZiCalculating the weight of the feature map on each channel by using a channel attention mechanism, and recording the weighted features as Z';
step 3.3: each group is taken as a fusion feature Z with weighti' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifier
Figure GDA0003160703830000032
And finally, calculating the error between the predicted value P and the true value by adopting the maximum entropy as a loss function, and training the parameters of the model by utilizing a back propagation algorithm.
In the step 1.2, according to the characteristics of the image and the text, the multi-modal document data facing the course field are processed, and the (R) is finally obtained for the ith pre-processed multi-modal documenti,Ti,Li):
(1) Randomly sampling from image-text mixed multi-mode document data in the field of courses;
(2) for each image of the multimodal document:
(a) zooming the image to keep the aspect ratio unchanged, wherein the shortest side is 256; then randomly cutting the picture into length and width 224 x 224; and randomly turning horizontally; finally, channel value normalization is carried out to obtain
Figure GDA0003160703830000033
Wherein C is 3, H is W is 224;
(3) for each text description in the multimodal document:
(a) counting the lengths of all text descriptions, and selecting the length l, wherein the length l is 484, and 92% of the text lengths are smaller than the length;
(b) all data are cut off and filled to be the same length l;
(c) using word vectors, also known as word embedding, mapping word sequence numbers to vectors in the real number domain, and training the weights thereof;
(d) the word sequence number is embedded into 256-dimensional continuous space with high dimension as dictionary number by word embedding
Figure GDA0003160703830000041
(4) For each set of tags in the multimodal document:
(a) setting an N-dimensional vector for the total classification number N, and mapping a plurality of semantic tags of corresponding documents into 0-1 vectors through a tag dictionary to obtain
Figure GDA0003160703830000042
The dense convolutional neural network based on the space and feature attention mechanism described in step 2.1 is used for representing and constructing image features, and specifically comprises the following steps: for processed image data
Figure GDA0003160703830000043
Firstly, a convolution layer with convolution kernel of 7 × 7 and step size of 2 and a maximum pooling layer with convolution kernel of 3 × 3 and step size of 2 are passed; next, a CBAM module is used, and then a DenseBlock module, a CBAM module and a transition module are adoptedAlternating Transition modules, extracting the characteristics of the sparse image in the course field, and finally obtaining the image characteristics by adopting the average pooling with the convolution kernel size of 7 × 7
Figure GDA0003160703830000044
The CBAM module is composed of a channel submodule and a space submodule, so that the attention weight map is multiplied by the input feature map to carry out self-adaptive feature refinement; the weight of the feature map can be calculated by utilizing the channel submodule, and the weight of each part in the feature map can be obtained by utilizing the space submodule; for an intermediate feature map
Figure GDA0003160703830000045
As input, CBAM calculates one-dimensional channel attention diagrams in turn
Figure GDA0003160703830000046
And two-dimensional spatial attention map
Figure GDA0003160703830000047
The whole attention mechanism is calculated as follows:
Figure GDA0003160703830000048
Figure GDA0003160703830000049
wherein
Figure GDA00031607038300000410
Representing the outer product; obtaining the channel attention weight M through calculationc(F) Obtaining weighted features
Figure GDA00031607038300000411
Calculating to obtain the space attention weight Ms(F') obtaining weighted features
Figure GDA00031607038300000412
The DenseBlock module is composed of multiple DenseLayer layers, each DenseLayer layer is composed of two groups of batch processing layers, a ReLU activation function and a convolution layer, each DenseLayer layer outputs k eigenvectors, and the input eigenvector of each DenseLayer layer is k0+k*(i-1),k0Representing an initial input feature number, wherein i represents the number of layers; the Transition module is equivalent to a buffer layer and is used for down sampling, and is composed of a batch processing hierarchy, a ReLU activation function, a convolution layer of 1 × 1 and an average pooling layer of 2 × 2. The two-way long-short term memory network based on the text attention mechanism in the step 2.2 specifically comprises the following steps: for input data TiExtracting text sequence information by using BilSTM to obtain text characteristics
Figure GDA0003160703830000051
Obtaining text attention weight of output by text attention mechanism
Figure GDA0003160703830000052
Calculating the outer product of output and alpha to obtain the weighted text characteristics
Figure GDA0003160703830000053
The text attention mechanism is composed of two convolution layers and a softmax classifier, and output comprises a set of hidden state vectors at each moment in the last layer sequence
Figure GDA0003160703830000054
And l is seq _ len, and h is 2 hidden _ size.
The BilSTM model in the step 2.2 specifically comprises the following steps:
for input data TiExtracting text sequence information by using BilSTM to obtain text characteristics
Figure GDA0003160703830000055
The specific calculation formula is as follows:
it=σ(Wiixt+bii+Whih(t-1)+bhi), (3)
ft=σ(Wifxt+bif+Whfh(t-1)+bhf), (4)
gt=tanh(Wigxt+big+Whgh(t-1)+bhg), (5)
ot=σ(Wioxt+bio+Whoh(t-1)+bho), (6)
ct=ft⊙c(t-1)+it⊙gt, (7)
ht=ot⊙tanh(ct), (8)
wherein xtVector representing input word, htHidden state vector representing time t, ctRepresenting the cell state vector at time t, xiRepresenting the input vector at time t, h(t-1)Representing the hidden state vector or the initial state vector, W, of the previous layer t-1i,biAnd bhAre trainable parameters, it,ft,gt,otRespectively represent input, forgetting, cell and output gates, sigma represents a sigmoid equation, tanh is an activation function, and; the hidden state vector set h with the same length as the sentence can be obtained0,h1,…,hl-1]And l 484 represents the length of the sentence.
The text attention mechanism module in the step 2.2 specifically comprises:
for text features obtained by BilsTM
Figure GDA0003160703830000056
Computing attention weights for outputs using a textual attention mechanism
Figure GDA0003160703830000057
Obtaining text characteristics after weighting
Figure GDA0003160703830000058
Wherein the text attention mechanism is composed of two 1 × 1 convolutional layers and a softmax classifier.
The packet cross-modal fusion module based on the attention mechanism in the step 3 specifically comprises:
in step 3.1, the grouping fusion of the image features and the text features is as follows:
for image features derived from the model in DenseNet-CBAM
Figure GDA0003160703830000061
And text features extracted by the BilSTM + Att model
Figure GDA0003160703830000062
Image features are divided into r groups
Figure GDA0003160703830000063
Respectively fusing with the text characteristics y, wherein the specific fusing steps are as follows:
Zi=x′TWiy, (9)
Figure GDA0003160703830000064
wherein
Figure GDA0003160703830000065
Figure GDA0003160703830000066
Represents the outer product of two vectors;
Figure GDA0003160703830000067
is a connection matrix; ziIs the output of a multi-modal split bilinear pooling; wiMapping x' and y to the same dimension space by utilizing two full connection layers, and then executing SumPage with the size of k in one dimension to obtain
Figure GDA0003160703830000068
i represents the fusion of the ith group of image features with the text features y;
in step 3.2, the fused features Z are combined for each groupiCalculating the weight of the feature map on each channel by using a channel attention mechanism, and recording the weighted features as Zi', its characteristics are as follows:
for the resulting fused features { Z0,Z1,…,ZrAnd calculating the weight of the feature map on each channel by using a channel attention mechanism, wherein the specific calculation process is as follows:
Figure GDA0003160703830000069
Figure GDA00031607038300000610
where a is expressed as a sigmoid function,
Figure GDA00031607038300000611
Figure GDA00031607038300000612
and
Figure GDA00031607038300000613
representative pair of features ZiPerforming average pooling and maximum pooling, and weighting W therein0,W1Learning training is carried out, and weighted fusion characteristics Z' are obtained through outer product operation;
in step 3.3, each set is weighted to obtain the fusion features Zi' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifier
Figure GDA0003160703830000071
Pi=Zi′AT+b, (13)
Figure GDA0003160703830000072
Wherein A isTAnd b is a trainable parameter.
Advantageous effects
The invention provides a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which is used for preprocessing multi-mode document data in the course field; an attention mechanism is combined with a dense convolution network, a convolution neural network based on cross-modal attention is provided, and image features with sparsity can be constructed more effectively; the bidirectional long-short term memory network constructed based on the attention mechanism and oriented to the text features is provided, and the text features locally associated with image semantics can be efficiently constructed; the cross-modal grouping fusion based on the attention mechanism is designed, the local incidence relation of the image and the text in the document can be more accurately learned, and the accuracy of the cross-modal feature fusion is improved. Compared with the existing multi-mode document classification model, the method has better performance and improves the accuracy of multi-mode document data classification under the condition of data sets in the same course field.
Drawings
FIG. 1 is a diagram of a model of the process described in the examples of the invention.
FIG. 2 is a diagram of the input data construction of the method in an example of the invention.
FIG. 3 is a model diagram of image feature extraction according to an embodiment of the present invention.
FIG. 4 is a model diagram of text feature extraction according to an embodiment of the present invention.
FIG. 5 is a model diagram of packet cross-modality fusion as described in the examples of the present invention.
FIG. 6a is a graph comparing accuracy of image multi-label classification under the same data set comparing different models.
FIG. 6b is a graph comparing loss of image multi-label classification under the same data set comparing different models.
FIG. 6c is a comparison graph of image multi-label classification top-3 under the same data set comparing different models.
FIG. 6d is a comparison graph of image multi-label classification top-5 under the same data set comparing different models.
FIG. 7 is a sample diagram of a data set employed in an example of the present invention.
Detailed Description
The invention will now be further described with reference to the following examples and drawings:
the invention provides a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which is mainly divided into five modules: 1) pre-processing of multimodal document data. 2) And constructing a dense convolutional neural network image feature based on an attention mechanism. 3) And constructing a bidirectional long-short term memory network text feature based on an attention mechanism. 4) Packet cross-modality fusion based on attention mechanism. 5) Multi-label classification of multi-modal documents. The model diagram of the whole method is shown in fig. 1, and is specifically described as follows:
1. preprocessing of multimodal document data
Representing a teletext hybrid multimodal document as
Figure GDA0003160703830000081
Represents the d-th multimodal document data. Wherein
Figure GDA0003160703830000082
IdIs available in the multimodal document data list for extracting image data,
Figure GDA0003160703830000083
representing a qualifying image;
Figure GDA0003160703830000084
representing a textual description locally associated with the image, JdIs an index set at word2ix that points out the description text appearing in the d-th multimodal document dataIn the synthesis process, the raw materials are mixed,
Figure GDA0003160703830000085
an index representing the nth word in word2 ix; l isdIn the same way as the above, the first step,
Figure GDA0003160703830000086
representing a semantic tag set, KdIndicating that the word in the ith multimodal document data is in the index set of label2ix,
Figure GDA0003160703830000087
indicating the index of the o-th word in label2 ix.
The model of the multi-modal document data preprocessing of the course field multi-modal document classification method based on the cross-modal attention convolutional neural network is shown in figure 2, the image is turned through random cutting, all the text lengths are cut off and filled into the same length l based on the length of the text description of the whole data set, the vector representation of words in the text is learned by using a word vector model, and finally, all the parameters input into the network are obtained
Figure GDA0003160703830000088
For the ith preprocessed multi-modal document, finally obtaining (R)i,Ti,Li):
(1) Randomly sampling from the text-text mixed multi-modal document data in the field of courses.
(2) For each image in the multimodal document:
zooming the image to keep the aspect ratio unchanged, wherein the shortest side is 256; then randomly cutting the picture into length and width 224 x 224; and randomly and horizontally turning the image; finally, channel value normalization is carried out to obtain
Figure GDA0003160703830000091
Wherein C is 3 and H is 224.
(3) For each text description in the multimodal document:
(a) counting the lengths of all text descriptions, and selecting the length l, wherein the length l is 484, and 92% of the text length is smaller than the length l;
(b) all data are cut off and filled to be the same length l;
(c) using Word vectors (Word vectors), also known as Word Embedding (Word Embedding), mapping Word sequence numbers to vectors in the real number domain, and training the weights thereof;
(d) the word sequence number is embedded into 256-dimensional continuous space with high dimension as dictionary number by word embedding
Figure GDA0003160703830000092
(4) For each set of tags in the multimodal document:
setting an N-dimensional vector for the total classification number N, and mapping a plurality of semantic tags of corresponding documents into 0-1 vectors through a tag dictionary to obtain
Figure GDA0003160703830000093
2. Depth cross-modal feature construction based on attention mechanism
The depth cross-modal feature extraction based on the attention mechanism comprises two parts: 1) dense convolutional neural network image feature construction based on an attention mechanism, and 2) bidirectional long-short term memory network text feature construction based on the attention mechanism. Combining an attention mechanism with a dense convolution network to extract the features of the image; providing a bidirectional long-short term memory network which is constructed facing text features and based on an attention mechanism; thereby obtaining image features with weights
Figure GDA0003160703830000094
And text features
Figure GDA0003160703830000095
The model diagram of the whole network is shown in fig. 3:
2.1 attention-based dense convolutional neural network
For processed imagesData of
Figure GDA0003160703830000096
Firstly, a convolution layer with convolution kernel of 7 × 7 and step size of 2 and a maximum pooling layer with convolution kernel of 3 × 3 and step size of 2 are passed; next, a CBAM module is adopted, then a DenseBlock module is adopted, the CBAM module and a Transition module are alternated, the sparse image is subjected to feature extraction, finally, the average pooling with the convolution kernel size of 7x7 is adopted, the data dimensionality is reduced, overfitting is avoided, and finally the image feature is obtained
Figure GDA0003160703830000101
Image features for multimodal document input
Figure GDA0003160703830000102
Wherein each layer of the network structure is specifically set as the following table:
table 1: DenseNet-CBAM model structure table
Figure GDA0003160703830000103
Where k represents the growth rate and _ layer represents the number of layers in the DenseBlock, i.e., the number of layers of DenseLayer. The output of each DenseLayer is recorded as HiEach of HiK eigenvectors are generated, with the input x for the i-th DenseLayeri,i∈_layer:
xi=Hi([x0,x1,…xi-1]), (1)
Whereby the feature vector of the input of each layer DenseLayer can be expressed as k0+k*(i-1),k0Representing the initial input feature number. For the Transition module, it is equivalent to a buffer layer for down-sampling, and is composed of a batch hierarchy, the ReLU activation function, 1 × 1 convolution layer and 2 × 2 average pooling layer.
Adding a layer of CBAM module between each layer of DenseBlock and Transition, calculating the weight of the feature map through the CBAM module, calculating the weight of the feature map by using a channel submodule in the CBAM, and obtaining the weight of each part in the feature map by using a space submodule in the CBAM.
And the CBAM module is composed of a channel submodule and a space submodule, so that the attention weight map and the input feature map are multiplied to carry out self-adaptive feature refinement. For an intermediate feature map
Figure GDA0003160703830000111
As input, CBAM calculates one-dimensional channel attention diagrams in turn
Figure GDA0003160703830000112
And two-dimensional spatial attention map
Figure GDA0003160703830000113
The whole attention mechanism is calculated as follows:
for a given feature map
Figure GDA0003160703830000114
Calculating the weight of the characteristic diagram on each channel by using a channel attention mechanism, and recording the weight as
Figure GDA0003160703830000115
Figure GDA0003160703830000116
Figure GDA0003160703830000117
Sigma is expressed as a sigmoid function,
Figure GDA0003160703830000118
Figure GDA0003160703830000119
and
Figure GDA00031607038300001110
representing the average pooling and maximum pooling of feature F. And for the characteristic diagram F', a space attention mechanism is utilized again, and the weight on a space area in the characteristic diagram is obtained through calculation and is recorded as
Figure GDA00031607038300001111
Figure GDA00031607038300001112
Figure GDA00031607038300001113
σ is expressed as sigmoid function, f7×7Representing a convolution kernel size of 7x7,
Figure GDA00031607038300001114
and
Figure GDA00031607038300001115
respectively representing average pooling and maximum pooling to change the characteristic dimension of F' into 1, connecting the two vectors in series, and obtaining M through convolution layer and sigmoid functions(F′)。
2.2 two-way Long-short term memory network based on attention mechanism
For input data TiExtracting text sequence information by using BilSTM to obtain text characteristics
Figure GDA0003160703830000121
Obtaining text attention weight of output by text attention mechanism
Figure GDA0003160703830000122
Calculating the outer product of output and alpha to obtain the weighted text characteristics
Figure GDA0003160703830000123
The text attention mechanism is composed of two convolution layers and a softmax classifier, and output comprises a set of hidden state vectors at each moment in the last layer sequence
Figure GDA0003160703830000124
And l is seq _ len, and h is 2 hidden _ size.
Wherein the BilSTM (Bi-directional Long Short-Term Memory) is formed by combining a forward LSTM and a backward LSTM. LSTM for each element in the input sequence, each layer computes the function:
it=σ(Wiixt+bii+Whih(t-1)+bhi), (6)
ft=σ(Wifxt+bif+Whfh(t-1)+bhf), (7)
gt=tanh(Wigxt+big+Whgh(t-1)+bhg), (8)
ot=σ(Wioxt+bio+Whoh(t-1)+bho), (9)
ct=ft⊙c(t-1)+it⊙gt, (10)
ht=ot⊙tanh(ct), (11)
wherein xtVector representing input word, htHidden state vector representing time t, ctRepresenting the cell state vector at time t, xiRepresenting the input vector at time t, h(t-1)Representing the hidden state vector or the initial state vector, W, of the previous layer t-1i,biAnd bhAre trainable parameters, it,ft,gt,otRespectively represent input, forgetting, cell and output gates, σ represents a sigmoid equation, tanh is an activation function, and it is a hadamard product. The hidden state vector set h with the same length as the sentence can be obtained0,h1,…,hl-1]And l 484 represents the length of the sentence.
Text features obtained by BilsTM
Figure GDA0003160703830000125
Including a set of hidden state vectors at each time in the last layer of sequence
Figure GDA0003160703830000126
Computing attention weights for outputs using a textual attention mechanism
Figure GDA0003160703830000127
Obtaining text characteristics after weighting
Figure GDA0003160703830000128
The specific calculation process is as follows:
converting output into output by convolution operation
Figure GDA0003160703830000129
Ww,bw,uwIs a learnable parameter:
u=Wwoutput+bw, (12)
for that obtained by convolution operations
Figure GDA00031607038300001210
Obtaining attention weight using softmax function
Figure GDA00031607038300001211
Figure GDA00031607038300001212
Calculating the weight alpha of the text feature through a softmax functioniAnd the weighted text features are recorded as
Figure GDA0003160703830000131
y=∑lαi*outputi, (14)
3. Packet cross-modal fusion based on attention mechanism
For image features obtained from depth cross-modal feature construction based on attention mechanism
Figure GDA0003160703830000132
And text features
Figure GDA0003160703830000133
Firstly, dividing image characteristics x into r groups, and dividing each group of image characteristics
Figure GDA0003160703830000134
Respectively mapping the feature vector with the text feature y to the same one-dimensional space, and performing feature fusion to obtain a fused feature { Z0,Z1,…,Zr}; for each group of fused features ZiCalculating the weight of the feature graph on each channel by using a channel attention mechanism to obtain weighted features Z'; each group is taken as a fusion feature Z with weighti' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifier
Figure GDA0003160703830000135
The detailed model diagram is shown in fig. 5.
3.1 group fusion of image features and text features
For image features trained from DenseNet-CBAM
Figure GDA0003160703830000136
Text features constructed by BilSTM + Att
Figure GDA0003160703830000137
First, the image features are divided into r groups
Figure GDA0003160703830000138
Respectively fusing with the text characteristics y, wherein the specific calculation formula is as follows:
Zi=x′TWiy, (15)
Figure GDA0003160703830000139
wherein
Figure GDA00031607038300001310
Figure GDA00031607038300001311
Which represents the outer product of two vectors,
Figure GDA00031607038300001312
represents a connection matrix, ZiIs the output of the multi-modal split bilinear pooling. WiMapping x' and y to the same dimension space by utilizing two full connection layers, and then executing SumPage with the size of k in one dimension to obtain
Figure GDA00031607038300001313
i represents the fusion of the ith grouping of image features with the text feature y.
3.2 attention mechanism
For the feature Z fused with each groupiCalculating the weight M of the feature map on each channel by using a channel attention mechanismc(Zi) The specific calculation process is as follows:
Figure GDA00031607038300001314
Figure GDA0003160703830000141
Figure GDA0003160703830000142
where a is expressed as a sigmoid function,
Figure GDA0003160703830000143
Figure GDA0003160703830000144
and
Figure GDA0003160703830000146
representative pair of features ZiPerforming average pooling and maximum pooling, and weighting W therein0,W1And (5) carrying out learning training, and obtaining weighted fusion characteristics Z' through outer product operation.
Finally, each group is taken as a fusion feature Z with weightiRespectively fusing output vectors of a plurality of groups of full connection layers through one full connection layer in a corresponding element addition mode, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifier
Figure GDA0003160703830000147
The specific calculation process is as follows:
Pi=Zi′AT+b, (19)
Figure GDA0003160703830000148
wherein A isTAnd b is a trainable parameter.
4. Multi-mode document data multi-label classification experiment in course field
The data set used by us is a text-text mixed multi-modal document data in the field of courses, each multi-modal document data consisting of one image, one text description and a plurality of semantic tags. There are 871 pieces of multi-modal document data in the dataset for training, and the number of tags is 41. The labels include those shown in table 2:
table 2: set of labels in a data set
Figure GDA0003160703830000149
Figure GDA0003160703830000151
With 80% of the course domain multimodal document data set used for training and 20% used for testing. We compared the classification effect of different other pre-training models such as VGG16, ResNet34, DenseNet121, BiLSTM under the same data set. The construction of the whole model uses a pytorech deep learning framework and runs on a GPU, and the CUDA version is 10.1.120.
4.1 loss function
We use a maximum entropy based criterion to optimize a multi-label pair for all losses, for the final classification result x and the target result y. For each batch of data:
Figure GDA0003160703830000152
where y [ i ] ∈ 0,1, C represents the total number of tags.
4.2 evaluation and results
Table 3: and the model accuracy comparison table shows that the precision and recall ratio of labels generated by different models under the same data set are used as evaluation indexes. For each multimodal document data, Top-3 and Top-5 are taken as one of the evaluation criteria, the meaning of Top-k is: if the top-k labels with the highest probability contain all real labels, the prediction is considered to be correct. And calculating Hamming loss (Hamming loss) which represents the proportion of error samples in all the labels, wherein the smaller the value, the stronger the classification capability of the network. In the experiment, the mAP is mainly used as a main evaluation standard, and the intrinsic average AP (average precision) value of the mAP is the average accuracy.
Table 3: model accuracy comparison table
Figure GDA0003160703830000153
Figure GDA0003160703830000161

Claims (7)

1. A course field multi-mode document classification method based on a cross-mode attention convolutional neural network is characterized by comprising the following steps:
step 1: preprocessing of multimodal document data
Step 1.1: each multi-modal document comprises an image and a text description and is attached with a plurality of semantic tags; constructing a dictionary by using text description and semantic label set in the document; deleting the tags with the frequency of appearance less than 13, and deleting the multi-modal document when the number of semantic tags of the document is 0;
step 1.2: data preprocessing, namely randomly cutting image data into the size of 224 × 224 in length and width, and performing random horizontal turning; for text description, all text lengths are cut off and filled into length l, and a word vector model is used for learning vector representation of words in the text;
step 2: depth cross-modal feature extraction based on attention mechanism
Step 2.1: carrying out representation construction on image features by adopting dense convolutional neural network DenseNet based on space and feature attention mechanism CBAM, and recording the obtained image features as
Figure FDA0003160703820000011
m represents the number of feature maps of the image;
step 2.2: constructing text features by adopting a bidirectional long-short term memory network (BilSTM) and a text attention mechanism, wherein the text attention mechanism consists of two convolution layers and a softmax classifier; the calculated weight is recorded as
Figure FDA0003160703820000012
The text feature representation after weighting is recorded as
Figure FDA0003160703820000013
n is 4 hidden _ size, and the hidden _ size is the characteristic dimension of the hidden state of the BilSTM;
and step 3: packet cross-modal fusion based on attention mechanism
Step 3.1: dividing the image characteristics x obtained in the step 2 into r groups, and dividing each group of image characteristics into r groups
Figure FDA0003160703820000014
Respectively mapping the feature vector with the text feature y to the same one-dimensional space, and obtaining the fused feature { Z by adopting multi-mode split bilinear pooling fusion0,Z1,…,Zr};
Step 3.2: for each group of fused features ZiCalculating the weight of the feature map on each channel by using a channel attention mechanism, and recording the weighted features as Z';
step 3.3: each group is taken as a fusion feature Z with weighti' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifier
Figure FDA0003160703820000021
And finally, calculating the error between the predicted value P and the true value by adopting the maximum entropy as a loss function, and training the parameters of depth cross-modal feature extraction based on the attention mechanism in the step 2 and grouping cross-modal fusion based on the attention mechanism in the step 3 by utilizing a back propagation algorithm.
2. The method as claimed in claim 1, wherein in step 1.2, based on the characteristics of the image and the text itself, the multi-modal class document data facing the class field is processed, and for the ith preprocessing, the multi-modal class document data is processedA well-organized multi-modal document, finally obtaining (R)i,Ti,Li):
(1) Randomly sampling from image-text mixed multi-mode document data in the field of courses;
(2) for each image of the multimodal document:
(a) zooming the image to keep the aspect ratio unchanged, wherein the shortest side is 256; then randomly cutting the picture into length and width 224 x 224; and randomly turning horizontally; finally, channel value normalization is carried out to obtain
Figure FDA0003160703820000022
Wherein C is 3, H is W is 224;
(3) for each text description in the multimodal document:
(a) counting the lengths of all text descriptions, and selecting the length l, wherein the length l is 484, and 92% of the text lengths are smaller than the length;
(b) all data are cut off and filled to be the same length l;
(c) using word vectors, also known as word embedding, mapping word sequence numbers to vectors in the real number domain, and training the weights thereof;
(d) the word sequence number is embedded into 256-dimensional continuous space with high dimension as dictionary number by word embedding
Figure FDA0003160703820000023
(4) For each set of tags in the multimodal document:
(a) setting an N-dimensional vector for the total classification number N, and mapping a plurality of semantic tags of corresponding documents into 0-1 vectors through a tag dictionary to obtain
Figure FDA0003160703820000031
3. The course domain multi-modal document classification method based on the cross-modal attention convolutional neural network as claimed in claim 1, characterized in that: step by stepThe dense convolutional neural network densneet based on the spatial and feature attention mechanism CBAM described in step 2.1 performs representation construction of image features, specifically: for processed image data
Figure FDA0003160703820000032
Firstly, a convolution layer with convolution kernel of 7 × 7 and step size of 2 and a maximum pooling layer with convolution kernel of 3 × 3 and step size of 2 are passed; next, a CBAM module is adopted, then a DenseBlock module is adopted, the CBAM module and a Transition module are alternated, the feature extraction is carried out on the sparse image in the course field, and finally, the average pooling with the convolution kernel size of 7x7 is adopted to obtain the image feature
Figure FDA0003160703820000033
The CBAM module is composed of a channel submodule and a space submodule, so that the attention weight map is multiplied by the input feature map to carry out self-adaptive feature refinement; the weight of the feature map can be calculated by utilizing the channel submodule, and the weight of each part in the feature map can be obtained by utilizing the space submodule; for an intermediate feature map
Figure FDA0003160703820000034
As input, CBAM calculates one-dimensional channel attention diagrams in turn
Figure FDA0003160703820000035
And two-dimensional spatial attention map
Figure FDA0003160703820000036
The whole attention mechanism is calculated as follows:
Figure FDA0003160703820000037
Figure FDA0003160703820000038
wherein
Figure FDA0003160703820000039
Representing the outer product; obtaining the channel attention weight M through calculationc(F) Obtaining weighted features
Figure FDA00031607038200000310
Figure FDA00031607038200000311
Calculating to obtain the space attention weight Ms(F') obtaining weighted features
Figure FDA00031607038200000312
The DenseBlock module is composed of multiple DenseLayer layers, each DenseLayer layer is composed of two groups of batch processing layers, a ReLU activation function and a convolution layer, each DenseLayer layer outputs k eigenvectors, and the input eigenvector of each DenseLayer layer is k0+k*(i-1),k0Representing an initial input feature number, wherein i represents the number of layers; the Transition module is equivalent to a buffer layer and is used for down sampling, and is composed of a batch processing hierarchy, a ReLU activation function, a convolution layer of 1 × 1 and an average pooling layer of 2 × 2.
4. The course domain multi-modal document classification method based on the cross-modal attention convolutional neural network as claimed in claim 1, characterized in that: the step 2.2 of constructing text features by adopting a bidirectional long-short term memory network BilSTM and a text attention mechanism specifically comprises the following steps:
for input data TiExtracting text sequence information by using BilSTM to obtain text characteristics
Figure FDA0003160703820000041
Figure FDA0003160703820000042
Obtaining text attention weight of output by text attention mechanism
Figure FDA0003160703820000043
Calculating the outer product of output and alpha to obtain the weighted text characteristics
Figure FDA0003160703820000044
The text attention mechanism is composed of two convolution layers and a softmax classifier, and output comprises a set of hidden state vectors at each moment in the last layer sequence
Figure FDA0003160703820000045
And l is seq _ len, and h is 2 hidden _ size.
5. The method for lesson domain multimodal document classification based on cross-modal attention convolutional neural network as claimed in claim 4, wherein said BilSTM in step 2.2 is specifically characterized in that:
for input data TiExtracting text sequence information by using BilSTM to obtain text characteristics
Figure FDA0003160703820000046
Figure FDA0003160703820000047
The specific calculation formula is as follows:
it=σ(Wiixt+bii+Whih(t-1)+bhi), (3)
ft=σ(Wifxt+bif+Whfh(t-1)+bhf), (4)
gt=tanh(Wigxt+big+Whgh(t-1)+bhg), (5)
ot=σ(Wioxt+bio+Whoh(t-1)+bho), (6)
ct=ft⊙c(t-1)+it⊙gt, (7)
ht=ot⊙tanh(ct), (8)
wherein xtVector representing input word, htHidden state vector representing time t, ctRepresents the cell state vector at time t, h(t-1)Representing the hidden state vector or the initial state vector, W, of the previous layer t-1i,biAnd bhAre trainable parameters, it,ft,gt,otRespectively represent input, forgetting, cell and output gates, sigma represents a sigmoid equation, tanh is an activation function, and; the hidden state vector set h with the same length as the sentence can be obtained0,h1,…,hl-1]And l 484 represents the length of the sentence.
6. The method for lesson domain multimodal document classification based on cross-modal attention convolutional neural network as claimed in claim 4, wherein said text attention mechanism in step 2.2 is specifically characterized in that:
for text features obtained by BilsTM
Figure FDA0003160703820000051
Computing attention weights for outputs using a textual attention mechanism
Figure FDA0003160703820000052
Obtaining text characteristics after weighting
Figure FDA0003160703820000053
Wherein the text attention mechanism is composed of two 1 × 1 convolutional layers and a softmax classifier.
7. The method for lesson domain multimodal document classification based on a cross-modal attention convolutional neural network as claimed in claim 1, wherein the attention mechanism based grouping cross-modal fusion in step 3 is specifically characterized in that:
in step 3.1, the grouping fusion of the image features and the text features is as follows:
for image features obtained from step 2 attention-based depth cross-modal feature extraction
Figure FDA0003160703820000054
And text features
Figure FDA0003160703820000055
Image features are divided into r groups
Figure FDA0003160703820000056
Respectively fusing with the text characteristics y, wherein the specific fusing steps are as follows:
Figure FDA0003160703820000057
wherein
Figure FDA0003160703820000058
Figure FDA0003160703820000059
Represents the outer product of two vectors;
Figure FDA00031607038200000510
is a connection matrix; ziIs the output of a multi-modal split bilinear pooling; wiMapping x' and y to the same dimension space by utilizing two full connection layers, and then executing SumPage with the size of k in one dimension to obtain
Figure FDA00031607038200000511
i represents the fusion of the ith group of image features with the text features y;
in step 3.2, the fused features Z are combined for each groupiCalculating the weight of the feature map on each channel by using a channel attention mechanism, and recording the weighted features as Zi', its characteristics are as follows:
for the resulting fused features { Z0,Z1,…,ZrAnd calculating the weight of the feature map on each channel by using a channel attention mechanism, wherein the specific calculation process is as follows:
Figure FDA00031607038200000512
Figure FDA00031607038200000513
where a is expressed as a sigmoid function,
Figure FDA0003160703820000061
Figure FDA0003160703820000062
and
Figure FDA0003160703820000063
representative pair of features ZiPerforming average pooling and maximum pooling, and weighting W therein0,W1Learning training is carried out, and weighted fusion characteristics Z' are obtained through outer product operation;
in step 3.3, each set is weighted to obtain the fusion features Zi' by a full tie layer; then, combining a plurality of groups of output vectors of the full-link layer in a mode of adding corresponding elements in the vectors, and then calculating to obtain probability distribution of the multi-modal document on all semantic tags through a sigmoid classifier
Figure FDA0003160703820000064
Pi=Zi′AT+b, (12)
Figure FDA0003160703820000065
Wherein A isTAnd b is a trainable parameter.
CN202010791032.3A 2020-08-07 2020-08-07 Course field multi-modal document classification method based on cross-modal attention convolution neural network Active CN111985369B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010791032.3A CN111985369B (en) 2020-08-07 2020-08-07 Course field multi-modal document classification method based on cross-modal attention convolution neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010791032.3A CN111985369B (en) 2020-08-07 2020-08-07 Course field multi-modal document classification method based on cross-modal attention convolution neural network

Publications (2)

Publication Number Publication Date
CN111985369A CN111985369A (en) 2020-11-24
CN111985369B true CN111985369B (en) 2021-09-17

Family

ID=73444539

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010791032.3A Active CN111985369B (en) 2020-08-07 2020-08-07 Course field multi-modal document classification method based on cross-modal attention convolution neural network

Country Status (1)

Country Link
CN (1) CN111985369B (en)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487187B (en) * 2020-12-02 2022-06-10 杭州电子科技大学 News text classification method based on graph network pooling
CN112508077B (en) * 2020-12-02 2023-01-03 齐鲁工业大学 Social media emotion analysis method and system based on multi-modal feature fusion
CN112507898B (en) * 2020-12-14 2022-07-01 重庆邮电大学 Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN
CN112650886B (en) * 2020-12-28 2022-08-02 电子科技大学 Cross-modal video time retrieval method based on cross-modal dynamic convolution network
CN112685565B (en) * 2020-12-29 2023-07-21 平安科技(深圳)有限公司 Text classification method based on multi-mode information fusion and related equipment thereof
CN112686345B (en) * 2020-12-31 2024-03-15 江南大学 Offline English handwriting recognition method based on attention mechanism
CN112863081A (en) * 2021-01-04 2021-05-28 西安建筑科技大学 Device and method for automatic weighing, classifying and settling vegetables and fruits
CN112819052B (en) * 2021-01-25 2021-12-24 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-modal fine-grained mixing method, system, device and storage medium
CN112817604B (en) * 2021-02-18 2022-08-05 北京邮电大学 Android system control intention identification method and device, electronic equipment and storage medium
US20230145535A1 (en) * 2021-03-01 2023-05-11 Nvidia Corporation Neural network training technique
CN112925935B (en) * 2021-04-13 2022-05-06 电子科技大学 Image menu retrieval method based on intra-modality and inter-modality mixed fusion
CN113052159A (en) * 2021-04-14 2021-06-29 中国移动通信集团陕西有限公司 Image identification method, device, equipment and computer storage medium
CN113140023B (en) * 2021-04-29 2023-09-15 南京邮电大学 Text-to-image generation method and system based on spatial attention
CN113221882B (en) * 2021-05-11 2022-12-09 西安交通大学 Image text aggregation method and system for curriculum field
CN113342933B (en) * 2021-05-31 2022-11-08 淮阴工学院 Multi-feature interactive network recruitment text classification method similar to double-tower model
CN113221181B (en) * 2021-06-09 2022-08-09 上海交通大学 Table type information extraction system and method with privacy protection function
CN113255821B (en) * 2021-06-15 2021-10-29 中国人民解放军国防科技大学 Attention-based image recognition method, attention-based image recognition system, electronic device and storage medium
CN113378989B (en) * 2021-07-06 2022-05-17 武汉大学 Multi-mode data fusion method based on compound cooperative structure characteristic recombination network
CN113469094B (en) * 2021-07-13 2023-12-26 上海中科辰新卫星技术有限公司 Surface coverage classification method based on multi-mode remote sensing data depth fusion
CN113792617B (en) * 2021-08-26 2023-04-18 电子科技大学 Image interpretation method combining image information and text information
CN113807340B (en) * 2021-09-07 2024-03-15 南京信息工程大学 Attention mechanism-based irregular natural scene text recognition method
CN113806564B (en) * 2021-09-22 2024-05-10 齐鲁工业大学 Multi-mode informative text detection method and system
CN115858826A (en) * 2021-09-22 2023-03-28 腾讯科技(深圳)有限公司 Data processing method and device, computer equipment and storage medium
CN113961710B (en) * 2021-12-21 2022-03-08 北京邮电大学 Fine-grained thesis classification method and device based on multi-mode layered fusion network
GB2616316A (en) * 2022-02-28 2023-09-06 Nvidia Corp Neural network training technique
CN116704537B (en) * 2022-12-02 2023-11-03 大连理工大学 Lightweight pharmacopoeia picture and text extraction method

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066583A (en) * 2017-04-14 2017-08-18 华侨大学 A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity
CN108595632A (en) * 2018-04-24 2018-09-28 福州大学 A kind of hybrid neural networks file classification method of fusion abstract and body feature
CN109740148A (en) * 2018-12-16 2019-05-10 北京工业大学 A kind of text emotion analysis method of BiLSTM combination Attention mechanism
CN110019812A (en) * 2018-02-27 2019-07-16 中国科学院计算技术研究所 A kind of user is from production content detection algorithm and system
CN110209789A (en) * 2019-05-29 2019-09-06 山东大学 A kind of multi-modal dialog system and method for user's attention guidance
WO2019204186A1 (en) * 2018-04-18 2019-10-24 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing
CN111079444A (en) * 2019-12-25 2020-04-28 北京中科研究院 Network rumor detection method based on multi-modal relationship
CN111325155A (en) * 2020-02-21 2020-06-23 重庆邮电大学 Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy
CN111461174A (en) * 2020-03-06 2020-07-28 西北大学 Multi-mode label recommendation model construction method and device based on multi-level attention mechanism

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066583A (en) * 2017-04-14 2017-08-18 华侨大学 A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity
CN110019812A (en) * 2018-02-27 2019-07-16 中国科学院计算技术研究所 A kind of user is from production content detection algorithm and system
WO2019204186A1 (en) * 2018-04-18 2019-10-24 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing
CN108595632A (en) * 2018-04-24 2018-09-28 福州大学 A kind of hybrid neural networks file classification method of fusion abstract and body feature
CN109740148A (en) * 2018-12-16 2019-05-10 北京工业大学 A kind of text emotion analysis method of BiLSTM combination Attention mechanism
CN110209789A (en) * 2019-05-29 2019-09-06 山东大学 A kind of multi-modal dialog system and method for user's attention guidance
CN111079444A (en) * 2019-12-25 2020-04-28 北京中科研究院 Network rumor detection method based on multi-modal relationship
CN111325155A (en) * 2020-02-21 2020-06-23 重庆邮电大学 Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy
CN111461174A (en) * 2020-03-06 2020-07-28 西北大学 Multi-mode label recommendation model construction method and device based on multi-level attention mechanism

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Image sentiment analysis using latent correlations among visual, textual, and sentiment views;Marie Katsurai et al;《2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)》;20160325;第2837-2841页 *
基于层次化深度关联融合网络的社交媒体情感分类;蔡国永等;《计算机研究与发展》;20190615;第1312-1324页 *

Also Published As

Publication number Publication date
CN111985369A (en) 2020-11-24

Similar Documents

Publication Publication Date Title
CN111985369B (en) Course field multi-modal document classification method based on cross-modal attention convolution neural network
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN107480261B (en) Fine-grained face image fast retrieval method based on deep learning
CN109711463B (en) Attention-based important object detection method
CN112100346B (en) Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN110866140A (en) Image feature extraction model training method, image searching method and computer equipment
CN112818861A (en) Emotion classification method and system based on multi-mode context semantic features
CN111881262A (en) Text emotion analysis method based on multi-channel neural network
CN112733866A (en) Network construction method for improving text description correctness of controllable image
CN115438215B (en) Image-text bidirectional search and matching model training method, device, equipment and medium
CN112434628B (en) Small sample image classification method based on active learning and collaborative representation
Sharma et al. Deep eigen space based ASL recognition system
CN113626589B (en) Multi-label text classification method based on mixed attention mechanism
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN115455171B (en) Text video mutual inspection rope and model training method, device, equipment and medium
CN111738169A (en) Handwriting formula recognition method based on end-to-end network model
CN113948217A (en) Medical nested named entity recognition method based on local feature integration
CN110188827A (en) A kind of scene recognition method based on convolutional neural networks and recurrence autocoder model
CN113987187A (en) Multi-label embedding-based public opinion text classification method, system, terminal and medium
CN115512096A (en) CNN and Transformer-based low-resolution image classification method and system
CN110111365B (en) Training method and device based on deep learning and target tracking method and device
CN116129141A (en) Medical data processing method, apparatus, device, medium and computer program product
CN114780767A (en) Large-scale image retrieval method and system based on deep convolutional neural network
CN110704665A (en) Image feature expression method and system based on visual attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant