CN111985369A - Course field multi-modal document classification method based on cross-modal attention convolution neural network - Google Patents

Course field multi-modal document classification method based on cross-modal attention convolution neural network Download PDF

Info

Publication number
CN111985369A
CN111985369A CN202010791032.3A CN202010791032A CN111985369A CN 111985369 A CN111985369 A CN 111985369A CN 202010791032 A CN202010791032 A CN 202010791032A CN 111985369 A CN111985369 A CN 111985369A
Authority
CN
China
Prior art keywords
text
modal
image
attention
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010791032.3A
Other languages
Chinese (zh)
Other versions
CN111985369B (en
Inventor
宋凌云
俞梦真
尚学群
李建鳌
彭杨柳
李伟
李战怀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202010791032.3A priority Critical patent/CN111985369B/en
Publication of CN111985369A publication Critical patent/CN111985369A/en
Application granted granted Critical
Publication of CN111985369B publication Critical patent/CN111985369B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which is used for preprocessing multi-mode document data in the course field; an attention mechanism is combined with a dense convolution network, a convolution neural network based on cross-modal attention is provided, and image features with sparsity can be constructed more effectively; the bidirectional long-short term memory network constructed based on the attention mechanism and oriented to the text features is provided, and the text features locally associated with image semantics can be efficiently constructed; the cross-modal grouping fusion based on the attention mechanism is designed, the local incidence relation of the image and the text in the document can be more accurately learned, and the accuracy of the cross-modal feature fusion is improved. Compared with the existing multi-mode document classification model, the method has better performance and improves the accuracy of multi-mode document data classification under the condition of data sets in the same course field.

Description

Course field multi-modal document classification method based on cross-modal attention convolution neural network
Technical Field
The invention belongs to the field of computer application, multi-modal data classification, education data classification, image processing and text processing, and particularly relates to a course field multi-modal document classification method based on a cross-modal attention convolution neural network.
Background
With the development of science and technology, data to be processed by computers in various fields is converted from single images into multi-modal data such as images, texts, audios and the like with richer forms and contents. The classification of multimodal documents has applications in video classification, visual question answering, entity matching for social networks, and the like. The accuracy of multimodal document classification depends on whether the computer can accurately understand the semantics and content of the images and text that are contained within the document. However, the image in the text-text mixed multimodal document in the course field is generally composed of lines and characters, and shows high sparse characteristics on visual features such as color and texture; the characteristic of local association between the text in the multi-modal document and the semantics of the image is shown, so that the existing multi-modal document classification model is difficult to accurately construct the semantic feature vectors of the image and the text in the document, the accuracy of multi-modal document feature expression is reduced, and the performance of the multi-modal document classification task is hindered.
In order to solve the problems, the invention expands a model system structure and provides a course field multi-mode document classification method based on a cross-modal attention convolution neural network. The method can well extract the sparse image characteristics in the course field, efficiently construct the text characteristics associated with the local fine-grained semantics of the image semantics, and more accurately learn the association relationship between the image and the text characteristics related to the specific object, thereby improving the multi-modal document classification performance.
Disclosure of Invention
Technical problem to be solved
Image visual features in image-text mixed multi-modal document data in the course field are sparse, and only local semantic association exists between texts and images, so that the semantics and contents of the texts and the images in the documents are difficult to accurately understand by the existing multi-modal document classification model, and the performance of multi-modal classification is greatly influenced. In order to solve the problems, the invention provides a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which can more efficiently learn the semantic features of the course field image with feature sparsity, can better capture the local fine-grained semantic association between the image and the text in the multi-mode document, accurately express the multi-mode document features and simultaneously improve the performance of a computer in the course field multi-mode document data classification.
Technical scheme
A course field multi-mode document classification method based on a cross-mode attention convolutional neural network is characterized by comprising the following steps:
step 1: preprocessing of multimodal document data
Step 1.1: each multi-modal document comprises an image and a text description and is attached with a plurality of semantic tags; constructing a dictionary by using the text description and the document label set in the document; deleting the tags with the frequency of appearance less than 13, and deleting the multi-modal document when the number of semantic tags of the document is 0;
step 1.2: data preprocessing, namely randomly cutting image data into the size of 224 × 224 in length and width, and performing random horizontal turning; for text description, all text lengths are cut off and filled into length l, and a word vector model is used for learning vector representation of words in the text;
step 2: depth cross-modal feature extraction based on attention mechanism
Step 2.1: image characterization using dense convolutional neural network DenseNet based on space and feature attention mechanism CBAMAnd (3) representing the obtained image features as
Figure BDA0002623748340000021
m represents the number of feature maps of the image;
step 2.2: constructing text features by adopting a bidirectional long-short term memory network (BilSTM) and a text attention mechanism, wherein the text attention mechanism consists of two convolution layers and a softmax classifier; the calculated weight is recorded as
Figure BDA0002623748340000022
The text feature representation after weighting is recorded as
Figure BDA0002623748340000023
n is 4 hidden _ size, and the hidden _ size is the characteristic dimension of the hidden state of the BilSTM;
and step 3: packet cross-modal fusion based on attention mechanism
Step 3.1: dividing the image characteristics x obtained in the step 2 into r groups, and dividing each group of image characteristics into r groups
Figure BDA0002623748340000031
Respectively mapping the feature vector with the text feature y to the same one-dimensional space, and obtaining the fused feature { Z by adopting multi-mode split bilinear pooling fusion0,Z1,…,Zr};
Step 3.2: for each group of fused features ZiCalculating the weight of the feature map on each channel by using a channel attention mechanism, and recording the weighted features as Z';
step 3.3: each group is taken as a fusion feature Z with weighti' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifier
Figure BDA0002623748340000032
Finally, the loss function adopts the maximum entropy to calculate the predicted value P and the true valueAnd (4) training parameters of the model by using a back propagation algorithm.
The image classification model based on cross-modal attention convolutional neural network as claimed in claim 1, wherein in step 1.2, based on the characteristics of the image and text itself, the multi-modal document data facing the curriculum field is processed, and for the ith pre-processed multi-modal document, the (R) is finally obtainedi,Ti,Li):
(1) Randomly sampling from image-text mixed multi-mode document data in the field of courses;
(2) for each image of the multimodal document:
(a) zooming the image to keep the aspect ratio unchanged, wherein the shortest side is 256; then randomly cutting the picture into length and width 224 x 224; and randomly turning horizontally; finally, channel value normalization is carried out to obtain
Figure BDA0002623748340000033
Wherein C is 3, H is W is 224;
(3) for each text description in the multimodal document:
(a) counting the lengths of all text descriptions, and selecting the length l, wherein the length l is 484, and 92% of the text lengths are smaller than the length;
(b) all data are cut off and filled to be the same length l;
(c) using word vectors, also known as word embedding, mapping word sequence numbers to vectors in the real number domain, and training the weights thereof;
(d) the word sequence number is embedded into 256-dimensional continuous space with high dimension as dictionary number by word embedding
Figure BDA0002623748340000041
(4) For each set of tags in the multimodal document:
(a) setting an N-dimensional vector for the total classification number N, and mapping a plurality of semantic tags of corresponding documents into 0-1 vectors through a tag dictionary to obtain
Figure BDA0002623748340000042
The dense convolutional neural network based on the space and feature attention mechanism described in step 2.1 is used for representing and constructing image features, and specifically comprises the following steps: for processed image data
Figure BDA0002623748340000043
Firstly, a convolution layer with convolution kernel of 7 × 7 and step size of 2 and a maximum pooling layer with convolution kernel of 3 × 3 and step size of 2 are passed; next, a CBAM module is adopted, then a DenseBlock module is adopted, the CBAM module and a Transition module are alternated, the feature extraction is carried out on the sparse image in the course field, and finally, the average pooling with the convolution kernel size of 7x7 is adopted to obtain the image feature
Figure BDA0002623748340000044
The CBAM module is composed of a channel submodule and a space submodule, so that the attention weight map is multiplied by the input feature map to carry out self-adaptive feature refinement; the weight of the feature map can be calculated by utilizing the channel submodule, and the weight of each part in the feature map can be obtained by utilizing the space submodule; for an intermediate feature map
Figure BDA0002623748340000045
As input, CBAM calculates one-dimensional channel attention diagrams in turn
Figure BDA0002623748340000046
Figure BDA0002623748340000047
And two-dimensional spatial attention map
Figure BDA0002623748340000048
The whole attention mechanism is calculated as follows:
Figure BDA0002623748340000049
Figure BDA00026237483400000410
wherein
Figure BDA00026237483400000411
Representing the outer product; obtaining the channel attention weight M through calculationc(F) Obtaining weighted features
Figure BDA00026237483400000412
Figure BDA00026237483400000413
Calculating to obtain the space attention weight Ms(F') obtaining weighted features
Figure BDA00026237483400000414
The DenseBlock module is composed of multiple DenseLayer layers, each DenseLayer layer is composed of two groups of batch processing layers, a ReLU activation function and a convolution layer, each DenseLayer layer outputs k eigenvectors, and the input eigenvector of each DenseLayer layer is k0+k*(i-1),k0Representing an initial input feature number, wherein i represents the number of layers; the Transition module is equivalent to a buffer layer and is used for down sampling, and is composed of a batch processing hierarchy, a ReLU activation function, a convolution layer of 1 × 1 and an average pooling layer of 2 × 2. The two-way long-short term memory network based on the text attention mechanism in the step 2.2 specifically comprises the following steps:
for input data TiExtracting text sequence information by using BilSTM to obtain text characteristics
Figure BDA0002623748340000051
Obtaining text attention weight of output by text attention mechanism
Figure BDA0002623748340000052
Calculating the outer product of output and alpha to obtain the weighted text characteristics
Figure BDA0002623748340000053
The text attention mechanism is composed of two convolution layers and a softmax classifier, and output comprises a set of hidden state vectors at each moment in the last layer sequence
Figure BDA0002623748340000054
And l is seq _ len, and h is 2 hidden _ size.
The BilSTM model in the step 2.2 specifically comprises the following steps:
for input data TiExtracting text sequence information by using BilSTM to obtain text characteristics
Figure BDA0002623748340000055
Figure BDA0002623748340000056
The specific calculation formula is as follows:
it=σ(Wiixt+bii+Whih(t-1)+bhi), (3)
ft=σ(Wifxt+bif+Whfh(t-1)+bhf), (4)
gt=tanh(Wigxt+big+Whgh(t-1) (5)
+bhg),
ot=σ(Wioxt+bio+Whoh(t-1)+bho), (6)
ct=ft⊙c(t-1)+it⊙gt, (7)
ht=ot⊙tanh(ct), (8)
wherein xtVector representing input word, htRepresents t atHidden state vector of inscription, ctRepresenting the cell state vector at time t, xiRepresenting the input vector at time t, h(t-1)Representing the hidden state vector or the initial state vector, W, of the previous layer t-1i,biAnd bhAre trainable parameters, it,ft,gt,otRespectively represent input, forgetting, cell and output gates, sigma represents a sigmoid equation, tanh is an activation function, and; the hidden state vector set h with the same length as the sentence can be obtained0,h1,...,hl-1]And l 484 represents the length of the sentence.
The text attention mechanism module in the step 2.2 specifically comprises:
for text features obtained by BilsTM
Figure BDA0002623748340000057
Computing attention weights for outputs using a textual attention mechanism
Figure BDA0002623748340000058
Obtaining text characteristics after weighting
Figure BDA0002623748340000059
Wherein the text attention mechanism is composed of two 1 × 1 convolutional layers and a softmax classifier.
The packet cross-modal fusion module based on the attention mechanism in the step 3 specifically comprises:
in step 3.1, the grouping fusion of the image features and the text features is as follows:
for image features derived from the model in DenseNet-CBAM
Figure BDA0002623748340000061
And text features extracted by the BilSTM + Att model
Figure BDA0002623748340000062
Image features are divided into r groups
Figure BDA0002623748340000063
Respectively fusing with the text characteristics y, wherein the specific fusing steps are as follows:
Zi=x′TWiy, (9)
Figure BDA0002623748340000064
wherein
Figure BDA0002623748340000065
. Represents the outer product of two vectors;
Figure BDA0002623748340000066
is a connection matrix; ziIs the output of a multi-modal split bilinear pooling; wiMapping x' and y to the same dimension space by utilizing two full connection layers, and then executing SumPage with the size of k in one dimension to obtain
Figure BDA0002623748340000067
i represents the fusion of the ith group of image features with the text features y;
in step 3.2, the fused features Z are combined for each groupiCalculating the weight of the feature map on each channel by using a channel attention mechanism, and recording the weighted features as Zi', its characteristics are as follows:
for the resulting fused features { Z0,Z1,…,ZrAnd calculating the weight of the feature map on each channel by using a channel attention mechanism, wherein the specific calculation process is as follows:
Figure BDA0002623748340000068
Figure BDA0002623748340000069
where a is expressed as a sigmoid function,
Figure BDA00026237483400000610
Figure BDA00026237483400000611
and
Figure BDA00026237483400000612
representative pair of features ZiPerforming average pooling and maximum pooling, and weighting W therein0,W1Learning training is carried out, and weighted fusion characteristics Z' are obtained through outer product operation;
in step 3.3, each set is weighted to obtain the fusion features Zi' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifier
Figure BDA0002623748340000071
Pi=Zi′AT+b, (13)
Figure BDA0002623748340000072
Wherein A isTAnd b is a trainable parameter.
Advantageous effects
The invention provides a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which is used for preprocessing multi-mode document data in the course field; an attention mechanism is combined with a dense convolution network, a convolution neural network based on cross-modal attention is provided, and image features with sparsity can be constructed more effectively; the bidirectional long-short term memory network constructed based on the attention mechanism and oriented to the text features is provided, and the text features locally associated with image semantics can be efficiently constructed; the cross-modal grouping fusion based on the attention mechanism is designed, the local incidence relation of the image and the text in the document can be more accurately learned, and the accuracy of the cross-modal feature fusion is improved. Compared with the existing multi-mode document classification model, the method has better performance and improves the accuracy of multi-mode document data classification under the condition of data sets in the same course field.
Drawings
FIG. 1 is a diagram of a model of the process described in the examples of the invention.
FIG. 2 is a diagram of the input data construction of the method in an example of the invention.
FIG. 3 is a model diagram of image feature extraction according to an embodiment of the present invention.
FIG. 4 is a model diagram of text feature extraction according to an embodiment of the present invention.
FIG. 5 is a model diagram of packet cross-modality fusion as described in the examples of the present invention.
FIG. 6a is a graph comparing accuracy of image multi-label classification under the same data set comparing different models.
FIG. 6b is a graph comparing loss of image multi-label classification under the same data set comparing different models.
FIG. 6c is a comparison graph of image multi-label classification top-3 under the same data set comparing different models.
FIG. 6d is a comparison graph of image multi-label classification top-5 under the same data set comparing different models.
FIG. 7 is a sample diagram of a data set employed in an example of the present invention.
Detailed Description
The invention will now be further described with reference to the following examples and drawings:
the invention provides a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which is mainly divided into five modules: 1) pre-processing of multimodal document data. 2) And constructing a dense convolutional neural network image feature based on an attention mechanism. 3) And constructing a bidirectional long-short term memory network text feature based on an attention mechanism. 4) Packet cross-modality fusion based on attention mechanism. 5) Multi-label classification of multi-modal documents. The model diagram of the whole method is shown in fig. 1, and is specifically described as follows:
1. preprocessing of multimodal document data
Representing a teletext hybrid multimodal document as
Figure BDA0002623748340000081
Represents the d-th multimodal document data. Wherein
Figure BDA0002623748340000082
IdIs available in the multimodal document data list for extracting image data,
Figure BDA0002623748340000083
representing a qualifying image;
Figure BDA0002623748340000084
representing a textual description locally associated with the image, JdIn the index set of word2ix indicating the descriptive text present in the d-th multimodal document data,
Figure BDA0002623748340000085
an index representing the nth word in word2 ix; l isdIn the same way as the above, the first step,
Figure BDA0002623748340000086
representing a semantic tag set, KdIndicating that the word in the ith multimodal document data is in the index set of label2ix,
Figure BDA0002623748340000087
indicating the index of the o-th word in label2 ix.
The model of the multi-modal document data preprocessing of the course field multi-modal document classification method based on the cross-modal attention convolutional neural network is shown in figure 2, the image is turned through random cutting, all the text lengths are cut off and filled into the same length l based on the length of the text description of the whole data set, the vector representation of words in the text is learned by using a word vector model, and finally the input words are obtainedParameters into the network
Figure BDA0002623748340000088
For the ith preprocessed multi-modal document, finally obtaining (R)i,Ti,Li):
(1) Randomly sampling from the text-text mixed multi-modal document data in the field of courses.
(2) For each image in the multimodal document:
zooming the image to keep the aspect ratio unchanged, wherein the shortest side is 256; then randomly cutting the picture into length and width 224 x 224; and randomly and horizontally turning the image; finally, channel value normalization is carried out to obtain
Figure BDA0002623748340000091
Wherein C is 3 and H is 224.
(3) For each text description in the multimodal document:
(a) counting the lengths of all text descriptions, and selecting the length l, l-484, wherein 92% of the text length is smaller than the length l:
(b) all data are cut off and filled to be the same length l;
(c) using Word vectors (Word vectors), also known as Word Embedding (Word Embedding), mapping Word sequence numbers to vectors in the real number domain, and training the weights thereof;
(d) the word sequence number is embedded into 256-dimensional continuous space with high dimension as dictionary number by word embedding
Figure BDA0002623748340000092
(4) For each set of tags in the multimodal document:
setting an N-dimensional vector for the total classification number N, and mapping a plurality of semantic tags of corresponding documents into 0-1 vectors through a tag dictionary to obtain
Figure BDA0002623748340000093
2. Depth cross-modal feature construction based on attention mechanism
The depth cross-modal feature extraction based on the attention mechanism comprises two parts: 1) dense convolutional neural network image feature construction based on an attention mechanism, and 2) bidirectional long-short term memory network text feature construction based on the attention mechanism. Combining an attention mechanism with a dense convolution network to extract the features of the image; providing a bidirectional long-short term memory network which is constructed facing text features and based on an attention mechanism; thereby obtaining image features with weights
Figure BDA0002623748340000094
And text features
Figure BDA0002623748340000095
The model diagram of the whole network is shown in fig. 3:
2.1 attention-based dense convolutional neural network
For processed image data
Figure BDA0002623748340000096
Firstly, a convolution layer with convolution kernel of 7 × 7 and step size of 2 and a maximum pooling layer with convolution kernel of 3 × 3 and step size of 2 are passed; next, a CBAM module is adopted, then a DenseBlock module is adopted, the CBAM module and a Transition module are alternated, the sparse image is subjected to feature extraction, finally, the average pooling with the convolution kernel size of 7x7 is adopted, the data dimensionality is reduced, overfitting is avoided, and finally the image feature is obtained
Figure BDA0002623748340000101
Image features for multimodal document input
Figure BDA0002623748340000102
Wherein each layer of the network structure is specifically set as the following table:
table 1: DenseNet-CBAM model structure table
Figure BDA0002623748340000103
Where k represents the growth rate and _ layer represents the number of layers in the DenseBlock, i.e., the number of layers of DenseLayer. The output of each DenseLayer is recorded as HiEach of HiK eigenvectors are generated, with the input x for the i-th DenseLayeri,i∈_layer:
xi=Hi([x0,x1,…xi-1]), (1)
Whereby the feature vector of the input of each layer DenseLayer can be expressed as k0+k*(i-1),k0Representing the initial input feature number. For the Transition module, it is equivalent to a buffer layer for down-sampling, and is composed of a batch hierarchy, the ReLU activation function, 1 × 1 convolution layer and 2 × 2 average pooling layer.
Adding a layer of CBAM module between each layer of DenseBlock and Transition, calculating the weight of the feature map through the CBAM module, calculating the weight of the feature map by using a channel submodule in the CBAM, and obtaining the weight of each part in the feature map by using a space submodule in the CBAM.
And the CBAM module is composed of a channel submodule and a space submodule, so that the attention weight map and the input feature map are multiplied to carry out self-adaptive feature refinement. For an intermediate feature map
Figure BDA0002623748340000111
As input, CBAM calculates one-dimensional channel attention diagrams in turn
Figure BDA0002623748340000112
And two-dimensional spatial attention map
Figure BDA0002623748340000113
The whole attention mechanism is calculated as follows:
for a given feature map
Figure BDA0002623748340000114
Calculating the weight of the characteristic diagram on each channel by using a channel attention mechanism, and recording the weight as
Figure BDA0002623748340000115
Figure BDA0002623748340000116
Figure BDA0002623748340000117
Sigma is expressed as a sigmoid function,
Figure BDA0002623748340000118
Figure BDA0002623748340000119
and
Figure BDA00026237483400001110
representing the average pooling and maximum pooling of feature F. And for the characteristic diagram F', a space attention mechanism is utilized again, and the weight on a space area in the characteristic diagram is obtained through calculation and is recorded as
Figure BDA00026237483400001111
Figure BDA00026237483400001112
Figure BDA00026237483400001113
σ is expressed as sigmoid function, f7×7Representing a convolution kernel size of 7x7,and
Figure BDA00026237483400001115
Figure BDA00026237483400001116
respectively representing average pooling and maximum pooling to change the characteristic dimension of F' into 1, connecting the two vectors in series, and obtaining M through convolution layer and sigmoid functions(F′)。
2.2 two-way Long-short term memory network based on attention mechanism
For input data TiExtracting text sequence information by using BilSTM to obtain text characteristics
Figure BDA0002623748340000121
Figure BDA0002623748340000122
Obtaining text attention weight of output by text attention mechanism
Figure BDA0002623748340000123
Calculating the outer product of output and alpha to obtain the weighted text characteristics
Figure BDA0002623748340000124
The text attention mechanism is composed of two convolution layers and a softmax classifier, and output comprises a set of hidden state vectors at each moment in the last layer sequence
Figure BDA0002623748340000125
And l is seq _ len, and h is 2 hidden _ size.
Wherein the BilSTM (Bi-directional Long Short-Term Memory) is formed by combining a forward LSTM and a backward LSTM. LSTM for each element in the input sequence, each layer computes the function:
it=σ(Wiixt+bii+Whih(t-1)+bhi), (6)
ft=σ(Wifxt+bif+Whfh(t-1)+bhf), (7)
gt=tanh(Wigxt+big+Whgh(t-1) (8)
+bhg),
ot=σ(Wioxt+bio+Whoh(t-1)+bho), (9)
ct=ft⊙c(t-1)+it⊙gt, (10)
ht=ot⊙tanh(ct), (11)
wherein xtVector representing input word, htHidden state vector representing time t, ctRepresenting the cell state vector at time t, xiRepresenting the input vector at time t, h(t-1)Representing the hidden state vector or the initial state vector, W, of the previous layer t-1i,biAnd bhAre trainable parameters, it,ft,gt,otRespectively represent input, forgetting, cell and output gates, σ represents a sigmoid equation, tanh is an activation function, and it is a hadamard product. The hidden state vector set h with the same length as the sentence can be obtained0,h1,...,hl-1]And l 484 represents the length of the sentence.
Text features obtained by BilsTM
Figure BDA0002623748340000126
Including a set of hidden state vectors at each time in the last layer of sequence
Figure BDA0002623748340000127
Computing attention weights for outputs using a textual attention mechanism
Figure BDA0002623748340000128
Obtaining text characteristics after weighting
Figure BDA0002623748340000129
The specific calculation process is as follows:
converting output into output by convolution operation
Figure BDA00026237483400001210
Ww,bw,uwIs a learnable parameter:
u=Wwoutput+bw, (12)
for that obtained by convolution operations
Figure BDA00026237483400001211
Obtaining attention weight using softmax function
Figure BDA00026237483400001212
Figure BDA0002623748340000131
Calculating the weight alpha of the text feature through a softmax functioniAnd the weighted text features are recorded as
Figure BDA0002623748340000132
y=∑lαi*outputi, (14)
3. Packet cross-modal fusion based on attention mechanism
For image features obtained from depth cross-modal feature construction based on attention mechanism
Figure BDA0002623748340000133
And text features
Figure BDA0002623748340000134
Firstly, dividing image characteristics x into r groups, and dividing each group of image characteristics
Figure BDA0002623748340000135
Respectively mapping the feature vector with the text feature y to the same one-dimensional space, and performing feature fusion to obtain a fused feature { Z0,Z1,…,Zr}; for each group of fused features ZiCalculating the weight of the feature graph on each channel by using a channel attention mechanism to obtain weighted features Z'; each group is taken as a fusion feature Z with weighti' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifier
Figure BDA0002623748340000136
The detailed model diagram is shown in fig. 5.
3.1 group fusion of image features and text features
For image features trained from DenseNet-CBAM
Figure BDA0002623748340000137
Text features constructed by BilSTM + Att
Figure BDA0002623748340000138
First, the image features are divided into r groups
Figure BDA0002623748340000139
Respectively fusing with the text characteristics y, wherein the specific calculation formula is as follows:
Zi=x′TWiy, (15)
Figure BDA00026237483400001310
wherein
Figure BDA00026237483400001311
Figure BDA00026237483400001314
Representing two vectorsThe external product is accumulated on the inner wall of the casing,
Figure BDA00026237483400001312
represents a connection matrix, ZiIs the output of the multi-modal split bilinear pooling. WiMapping x' and y to the same dimension space by utilizing two full connection layers, and then executing SumPage with the size of k in one dimension to obtain
Figure BDA00026237483400001313
i represents the fusion of the ith grouping of image features with the text feature y.
3.2 attention mechanism
For the feature Z fused with each groupiCalculating the weight M of the feature map on each channel by using a channel attention mechanismc(Zi) The specific calculation process is as follows:
Figure BDA0002623748340000141
Figure BDA0002623748340000142
where a is expressed as a sigmoid function,
Figure BDA0002623748340000143
Figure BDA0002623748340000144
and
Figure BDA0002623748340000145
representative pair of features ZiPerforming average pooling and maximum pooling, and weighting W therein0,W1And (5) carrying out learning training, and obtaining weighted fusion characteristics Z' through outer product operation.
Finally, each group is taken as a fusion feature Z with weighti' for the output vectors of multiple groups of fully-connected layers, corresponding to each other is adoptedFusing elements in an adding mode, and then calculating the probability distribution of the multi-modal document on each label through a sigmoid classifier
Figure BDA0002623748340000146
The specific calculation process is as follows:
Pi=Zi′AT+b, (19)
Figure BDA0002623748340000147
wherein A isTAnd b is a trainable parameter.
4. Multi-mode document data multi-label classification experiment in course field
The data set used by us is a text-text mixed multi-modal document data in the field of courses, each multi-modal document data consisting of one image, one text description and a plurality of semantic tags. There are 871 pieces of multi-modal document data in the dataset for training, and the number of tags is 41. The labels include those shown in table 2:
table 2: set of labels in a data set
Figure BDA0002623748340000148
Figure BDA0002623748340000151
With 80% of the course domain multimodal document data set used for training and 20% used for testing. We compared the classification effect of different other pre-training models such as VGG16, ResNet34, DenseNet121, BiLSTM under the same data set. The construction of the whole model uses a pytorech deep learning framework and runs on a GPU, and the CUDA version is 10.1.120.
4.1 loss function
We use a maximum entropy based criterion to optimize a multi-label pair for all losses, for the final classification result x and the target result y. For each batch of data:
Figure BDA0002623748340000152
where y [ i ] ∈ 0, 1, C represents the total number of tags.
4.2 evaluation and results
Table 3: and the model accuracy comparison table shows that the precision and recall ratio of labels generated by different models under the same data set are used as evaluation indexes. For each multimodal document data, Top-3 and Top-5 are taken as one of the evaluation criteria, the meaning of Top-k is: if the top-k labels with the highest probability contain all real labels, the prediction is considered to be correct. And calculating Hamming loss (Hamming loss) which represents the proportion of error samples in all the labels, wherein the smaller the value, the stronger the classification capability of the network. In the experiment, the mAP is mainly used as a main evaluation standard, and the intrinsic average AP (average precision) value of the mAP is the average accuracy.
Table 3: model accuracy comparison table
Figure BDA0002623748340000153
Figure BDA0002623748340000161

Claims (7)

1. A course field multi-mode document classification method based on a cross-mode attention convolutional neural network is characterized by comprising the following steps:
step 1: preprocessing of multimodal document data
Step 1.1: each multi-modal document comprises an image and a text description and is attached with a plurality of semantic tags; constructing a dictionary by using the text description and the document label set in the document; deleting the tags with the frequency of appearance less than 13, and deleting the multi-modal document when the number of semantic tags of the document is 0;
step 1.2: data preprocessing, namely randomly cutting image data into the size of 224 × 224 in length and width, and performing random horizontal turning; for text description, all text lengths are cut off and filled into length l, and a word vector model is used for learning vector representation of words in the text;
step 2: depth cross-modal feature extraction based on attention mechanism
Step 2.1: carrying out representation construction on image features by adopting dense convolutional neural network DenseNet based on space and feature attention mechanism CBAM, and recording the obtained image features as
Figure FDA0002623748330000011
m represents the number of feature maps of the image;
step 2.2: constructing text features by adopting a bidirectional long-short term memory network (BilSTM) and a text attention mechanism, wherein the text attention mechanism consists of two convolution layers and a softmax classifier; the calculated weight is recorded as
Figure FDA0002623748330000012
The text feature representation after weighting is recorded as
Figure FDA0002623748330000013
n is 4 hidden _ size, and the hidden _ size is the characteristic dimension of the hidden state of the BilSTM;
and step 3: packet cross-modal fusion based on attention mechanism
Step 3.1: dividing the image characteristics x obtained in the step 2 into r groups, and dividing each group of image characteristics into r groups
Figure FDA0002623748330000014
Respectively mapping the feature vector with the text feature y to the same one-dimensional space, and obtaining the fused feature { Z by adopting multi-mode split bilinear pooling fusion0,Z1,…,Zr};
Step 3.2: for each group of fused features ZiCalculating each using the channel attention mechanismWeighting the feature graph on the channel, and recording the weighted feature as Z';
step 3.3: each group is taken as a fusion feature Z with weighti' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifier
Figure FDA0002623748330000021
And finally, calculating the error between the predicted value P and the true value by adopting the maximum entropy as a loss function, and training the parameters of the model by utilizing a back propagation algorithm.
2. The image classification model based on cross-modal attention convolutional neural network as claimed in claim 1, wherein in step 1.2, based on the characteristics of the image and text itself, the multi-modal document data facing the curriculum field is processed, and for the ith pre-processed multi-modal document, the (R) is finally obtainedi,Ti,Li):
(1) Randomly sampling from image-text mixed multi-mode document data in the field of courses;
(2) for each image of the multimodal document:
(a) zooming the image to keep the aspect ratio unchanged, wherein the shortest side is 256; then randomly cutting the picture into length and width 224 x 224; and randomly turning horizontally; finally, channel value normalization is carried out to obtain
Figure FDA0002623748330000022
Wherein C is 3, H is W is 224;
(3) for each text description in the multimodal document:
(a) counting the lengths of all text descriptions, and selecting the length l, wherein the length l is 484, and 92% of the text lengths are smaller than the length;
(b) all data are cut off and filled to be the same length l;
(c) using word vectors, also known as word embedding, mapping word sequence numbers to vectors in the real number domain, and training the weights thereof;
(d) the word sequence number is embedded into 256-dimensional continuous space with high dimension as dictionary number by word embedding
Figure FDA0002623748330000023
(4) For each set of tags in the multimodal document:
(a) setting an N-dimensional vector for the total classification number N, and mapping a plurality of semantic tags of corresponding documents into 0-1 vectors through a tag dictionary to obtain
Figure FDA0002623748330000024
3. The course domain multi-modal document classification method based on the cross-modal attention convolutional neural network as claimed in claim 1, characterized in that: the dense convolutional neural network based on the space and feature attention mechanism described in step 2.1 is used for representing and constructing image features, and specifically comprises the following steps: for processed image data
Figure FDA0002623748330000031
Firstly, a convolution layer with convolution kernel of 7 × 7 and step size of 2 and a maximum pooling layer with convolution kernel of 3 × 3 and step size of 2 are passed; next, a CBAM module is adopted, then a DenseBlock module is adopted, the CBAM module and a Transition module are alternated, the feature extraction is carried out on the sparse image in the course field, and finally, the average pooling with the convolution kernel size of 7x7 is adopted to obtain the image feature
Figure FDA0002623748330000032
The CBAM module is composed of a channel submodule and a space submodule, so that the attention weight map is multiplied by the input feature map to carry out self-adaptive feature refinement; the weight of the feature map can be calculated by utilizing the channel submodule, and the weight of the feature map can be obtained by utilizing the space submoduleA weight for each location in a feature map; for an intermediate feature map
Figure FDA0002623748330000033
As input, CBAM calculates one-dimensional channel attention diagrams in turn
Figure FDA0002623748330000034
And two-dimensional spatial attention map
Figure FDA0002623748330000035
The whole attention mechanism is calculated as follows:
Figure FDA0002623748330000036
Figure FDA0002623748330000037
wherein
Figure FDA0002623748330000038
Representing the outer product; obtaining the channel attention weight M through calculationc(F) Obtaining weighted features
Figure FDA0002623748330000039
Figure FDA00026237483300000310
Calculating to obtain the space attention weight Ms(F') obtaining weighted features
Figure FDA00026237483300000311
The DenseBlock module is composed of multiple DenseLayer layers, each DenseLayer layer is composed of two groups of batch processing and one layer, a ReLU activation function and a convolution layer, and each DenseLayer layer can output k special charactersThe feature vector of the input of each layer DenseLayer is k0+k*(i-1),k0Representing an initial input feature number, wherein i represents the number of layers; the Transition module is equivalent to a buffer layer and is used for down sampling, and is composed of a batch processing hierarchy, a ReLU activation function, a convolution layer of 1 × 1 and an average pooling layer of 2 × 2.
4. The course domain multi-modal document classification method based on the cross-modal attention convolutional neural network as claimed in claim 1, characterized in that: the two-way long-short term memory network based on the text attention mechanism in the step 2.2 specifically comprises the following steps:
for input data TiExtracting text sequence information by using BilSTM to obtain text characteristics
Figure FDA0002623748330000041
Figure FDA0002623748330000042
Obtaining text attention weight of output by text attention mechanism
Figure FDA0002623748330000043
Calculating the outer product of output and alpha to obtain the weighted text characteristics
Figure FDA0002623748330000044
The text attention mechanism is composed of two convolution layers and a softmax classifier, and output comprises a set of hidden state vectors at each moment in the last layer sequence
Figure FDA0002623748330000045
And l is seq _ len, and h is 2 hidden _ size.
5. The course domain multi-modal document classification method based on the cross-modal attention convolutional neural network as claimed in claim 4, wherein the BilSTM model in step 2.2 specifically comprises:
for input data TiExtracting text sequence information by using BilSTM to obtain text characteristics
Figure FDA0002623748330000046
Figure FDA0002623748330000047
The specific calculation formula is as follows:
it=σ(Wiixt+bii+Whih(t-1)+bhi), (3)
ft=σ(Wifxt+bif+Whfh(t-1)+bhf), (4)
gt=tanh(Wigxt+big+Whgh(t-1) (5)
+bhg),
ot=σ(Wioxt+bio+Whoh(t-1)+bho), (6)
ct=ft⊙c(t-1)+it⊙gt, (7)
ht=ot⊙tanh(ct), (8)
wherein xtVector representing input word, htHidden state vector representing time t, ctRepresenting the cell state vector at time t, xiRepresenting the input vector at time t, h(t-1)Representing the hidden state vector or the initial state vector, W, of the previous layer t-1i,biAnd bhAre trainable parameters, it,ft,gt,otRespectively represent input, forgetting, cell and output gates, sigma represents a sigmoid equation, tanh is an activation function, and; the hidden state vector set h with the same length as the sentence can be obtained0,h1,...,hl-1]And l 484 represents the length of the sentence.
6. The method for lesson domain multimodal document classification based on cross-modal attention convolutional neural network as claimed in claim 4, wherein said text attention mechanism module in step 2.2 is specifically characterized in that:
for text features obtained by BilsTM
Figure FDA0002623748330000048
Computing attention weights for outputs using a textual attention mechanism
Figure FDA0002623748330000051
Obtaining text characteristics after weighting
Figure FDA0002623748330000052
Wherein the text attention mechanism is composed of two 1 × 1 convolutional layers and a softmax classifier.
7. The method for lesson domain multi-modal document classification based on the cross-modal attention convolutional neural network as claimed in claim 1, wherein the group cross-modal fusion module based on the attention mechanism in the step 3 is specifically characterized in that:
in step 3.1, the grouping fusion of the image features and the text features is as follows:
for image features derived from the model in DenseNet-CBAM
Figure FDA0002623748330000053
And text features extracted by the BilSTM + Att model
Figure FDA0002623748330000054
Image features are divided into r groups
Figure FDA0002623748330000055
Are respectively connected withThe text characteristics y are fused, and the specific fusion steps are as follows:
Zi=x′TWiy, (9)
Figure FDA0002623748330000056
wherein
Figure FDA0002623748330000057
Figure FDA0002623748330000058
Represents the outer product of two vectors;
Figure FDA0002623748330000059
is a connection matrix; ziIs the output of a multi-modal split bilinear pooling; wiMapping x' and y to the same dimension space by utilizing two full connection layers, and then executing SumPage with the size of k in one dimension to obtaini represents the fusion of the ith group of image features with the text features y;
in step 3.2, the fused features Z are combined for each groupiCalculating the weight of the feature map on each channel by using a channel attention mechanism, and recording the weighted features as Zi', its characteristics are as follows:
for the resulting fused features { Z0,Z1,…,ZrAnd calculating the weight of the feature map on each channel by using a channel attention mechanism, wherein the specific calculation process is as follows:
Figure FDA00026237483300000511
Figure FDA00026237483300000512
where a is expressed as a sigmoid function,
Figure FDA00026237483300000513
Figure FDA00026237483300000514
and
Figure FDA00026237483300000515
representative pair of features ZiPerforming average pooling and maximum pooling, and weighting W therein0,W1Learning training is carried out, and weighted fusion characteristics Z' are obtained through outer product operation;
in step 3.3, each set is weighted to obtain the fusion features Zi' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifier
Figure FDA0002623748330000061
Pi=Zi′AT+b, (13)
Figure FDA0002623748330000062
Wherein A isTAnd b is a trainable parameter.
CN202010791032.3A 2020-08-07 2020-08-07 Course field multi-modal document classification method based on cross-modal attention convolution neural network Active CN111985369B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010791032.3A CN111985369B (en) 2020-08-07 2020-08-07 Course field multi-modal document classification method based on cross-modal attention convolution neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010791032.3A CN111985369B (en) 2020-08-07 2020-08-07 Course field multi-modal document classification method based on cross-modal attention convolution neural network

Publications (2)

Publication Number Publication Date
CN111985369A true CN111985369A (en) 2020-11-24
CN111985369B CN111985369B (en) 2021-09-17

Family

ID=73444539

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010791032.3A Active CN111985369B (en) 2020-08-07 2020-08-07 Course field multi-modal document classification method based on cross-modal attention convolution neural network

Country Status (1)

Country Link
CN (1) CN111985369B (en)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487187A (en) * 2020-12-02 2021-03-12 杭州电子科技大学 News text classification method based on graph network pooling
CN112508077A (en) * 2020-12-02 2021-03-16 齐鲁工业大学 Social media emotion analysis method and system based on multi-modal feature fusion
CN112507898A (en) * 2020-12-14 2021-03-16 重庆邮电大学 Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN
CN112650886A (en) * 2020-12-28 2021-04-13 电子科技大学 Cross-modal video time retrieval method based on cross-modal dynamic convolution network
CN112686345A (en) * 2020-12-31 2021-04-20 江南大学 Off-line English handwriting recognition method based on attention mechanism
CN112685565A (en) * 2020-12-29 2021-04-20 平安科技(深圳)有限公司 Text classification method based on multi-mode information fusion and related equipment thereof
CN112817604A (en) * 2021-02-18 2021-05-18 北京邮电大学 Android system control intention identification method and device, electronic equipment and storage medium
CN112819052A (en) * 2021-01-25 2021-05-18 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-modal fine-grained mixing method, system, device and storage medium
CN112863081A (en) * 2021-01-04 2021-05-28 西安建筑科技大学 Device and method for automatic weighing, classifying and settling vegetables and fruits
CN112925935A (en) * 2021-04-13 2021-06-08 电子科技大学 Image menu retrieval method based on intra-modality and inter-modality mixed fusion
CN113052159A (en) * 2021-04-14 2021-06-29 中国移动通信集团陕西有限公司 Image identification method, device, equipment and computer storage medium
CN113140023A (en) * 2021-04-29 2021-07-20 南京邮电大学 Text-to-image generation method and system based on space attention
CN113221882A (en) * 2021-05-11 2021-08-06 西安交通大学 Image text aggregation method and system for curriculum field
CN113221181A (en) * 2021-06-09 2021-08-06 上海交通大学 Table type information extraction system and method with privacy protection function
CN113255821A (en) * 2021-06-15 2021-08-13 中国人民解放军国防科技大学 Attention-based image recognition method, attention-based image recognition system, electronic device and storage medium
CN113342933A (en) * 2021-05-31 2021-09-03 淮阴工学院 Multi-feature interactive network recruitment text classification method similar to double-tower model
CN113378989A (en) * 2021-07-06 2021-09-10 武汉大学 Multi-mode data fusion method based on compound cooperative structure characteristic recombination network
CN113469094A (en) * 2021-07-13 2021-10-01 上海中科辰新卫星技术有限公司 Multi-mode remote sensing data depth fusion-based earth surface coverage classification method
CN113792617A (en) * 2021-08-26 2021-12-14 电子科技大学 Image interpretation method combining image information and text information
CN113806564A (en) * 2021-09-22 2021-12-17 齐鲁工业大学 Multi-mode informativeness tweet detection method and system
CN113807340A (en) * 2021-09-07 2021-12-17 南京信息工程大学 Method for recognizing irregular natural scene text based on attention mechanism
CN113961710A (en) * 2021-12-21 2022-01-21 北京邮电大学 Fine-grained thesis classification method and device based on multi-mode layered fusion network
WO2022187167A1 (en) * 2021-03-01 2022-09-09 Nvidia Corporation Neural network training technique
WO2023045605A1 (en) * 2021-09-22 2023-03-30 腾讯科技(深圳)有限公司 Data processing method and apparatus, computer device, and storage medium
CN116704537A (en) * 2022-12-02 2023-09-05 大连理工大学 Lightweight pharmacopoeia picture and text extraction method
GB2616316A (en) * 2022-02-28 2023-09-06 Nvidia Corp Neural network training technique
CN113806564B (en) * 2021-09-22 2024-05-10 齐鲁工业大学 Multi-mode informative text detection method and system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066583A (en) * 2017-04-14 2017-08-18 华侨大学 A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity
CN108595632A (en) * 2018-04-24 2018-09-28 福州大学 A kind of hybrid neural networks file classification method of fusion abstract and body feature
CN109740148A (en) * 2018-12-16 2019-05-10 北京工业大学 A kind of text emotion analysis method of BiLSTM combination Attention mechanism
CN110019812A (en) * 2018-02-27 2019-07-16 中国科学院计算技术研究所 A kind of user is from production content detection algorithm and system
CN110209789A (en) * 2019-05-29 2019-09-06 山东大学 A kind of multi-modal dialog system and method for user's attention guidance
WO2019204186A1 (en) * 2018-04-18 2019-10-24 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing
CN111079444A (en) * 2019-12-25 2020-04-28 北京中科研究院 Network rumor detection method based on multi-modal relationship
CN111325155A (en) * 2020-02-21 2020-06-23 重庆邮电大学 Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy
CN111461174A (en) * 2020-03-06 2020-07-28 西北大学 Multi-mode label recommendation model construction method and device based on multi-level attention mechanism

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066583A (en) * 2017-04-14 2017-08-18 华侨大学 A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity
CN110019812A (en) * 2018-02-27 2019-07-16 中国科学院计算技术研究所 A kind of user is from production content detection algorithm and system
WO2019204186A1 (en) * 2018-04-18 2019-10-24 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing
CN108595632A (en) * 2018-04-24 2018-09-28 福州大学 A kind of hybrid neural networks file classification method of fusion abstract and body feature
CN109740148A (en) * 2018-12-16 2019-05-10 北京工业大学 A kind of text emotion analysis method of BiLSTM combination Attention mechanism
CN110209789A (en) * 2019-05-29 2019-09-06 山东大学 A kind of multi-modal dialog system and method for user's attention guidance
CN111079444A (en) * 2019-12-25 2020-04-28 北京中科研究院 Network rumor detection method based on multi-modal relationship
CN111325155A (en) * 2020-02-21 2020-06-23 重庆邮电大学 Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy
CN111461174A (en) * 2020-03-06 2020-07-28 西北大学 Multi-mode label recommendation model construction method and device based on multi-level attention mechanism

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MARIE KATSURAI ET AL: "Image sentiment analysis using latent correlations among visual, textual, and sentiment views", 《2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 *
蔡国永等: "基于层次化深度关联融合网络的社交媒体情感分类", 《计算机研究与发展》 *

Cited By (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112508077A (en) * 2020-12-02 2021-03-16 齐鲁工业大学 Social media emotion analysis method and system based on multi-modal feature fusion
CN112508077B (en) * 2020-12-02 2023-01-03 齐鲁工业大学 Social media emotion analysis method and system based on multi-modal feature fusion
CN112487187A (en) * 2020-12-02 2021-03-12 杭州电子科技大学 News text classification method based on graph network pooling
CN112507898A (en) * 2020-12-14 2021-03-16 重庆邮电大学 Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN
CN112650886A (en) * 2020-12-28 2021-04-13 电子科技大学 Cross-modal video time retrieval method based on cross-modal dynamic convolution network
CN112685565A (en) * 2020-12-29 2021-04-20 平安科技(深圳)有限公司 Text classification method based on multi-mode information fusion and related equipment thereof
WO2022142014A1 (en) * 2020-12-29 2022-07-07 平安科技(深圳)有限公司 Multi-modal information fusion-based text classification method, and related device thereof
CN112685565B (en) * 2020-12-29 2023-07-21 平安科技(深圳)有限公司 Text classification method based on multi-mode information fusion and related equipment thereof
CN112686345B (en) * 2020-12-31 2024-03-15 江南大学 Offline English handwriting recognition method based on attention mechanism
CN112686345A (en) * 2020-12-31 2021-04-20 江南大学 Off-line English handwriting recognition method based on attention mechanism
CN112863081A (en) * 2021-01-04 2021-05-28 西安建筑科技大学 Device and method for automatic weighing, classifying and settling vegetables and fruits
CN112819052A (en) * 2021-01-25 2021-05-18 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-modal fine-grained mixing method, system, device and storage medium
CN112817604A (en) * 2021-02-18 2021-05-18 北京邮电大学 Android system control intention identification method and device, electronic equipment and storage medium
WO2022187167A1 (en) * 2021-03-01 2022-09-09 Nvidia Corporation Neural network training technique
CN112925935A (en) * 2021-04-13 2021-06-08 电子科技大学 Image menu retrieval method based on intra-modality and inter-modality mixed fusion
CN112925935B (en) * 2021-04-13 2022-05-06 电子科技大学 Image menu retrieval method based on intra-modality and inter-modality mixed fusion
CN113052159A (en) * 2021-04-14 2021-06-29 中国移动通信集团陕西有限公司 Image identification method, device, equipment and computer storage medium
CN113140023B (en) * 2021-04-29 2023-09-15 南京邮电大学 Text-to-image generation method and system based on spatial attention
CN113140023A (en) * 2021-04-29 2021-07-20 南京邮电大学 Text-to-image generation method and system based on space attention
CN113221882A (en) * 2021-05-11 2021-08-06 西安交通大学 Image text aggregation method and system for curriculum field
CN113342933A (en) * 2021-05-31 2021-09-03 淮阴工学院 Multi-feature interactive network recruitment text classification method similar to double-tower model
CN113342933B (en) * 2021-05-31 2022-11-08 淮阴工学院 Multi-feature interactive network recruitment text classification method similar to double-tower model
CN113221181B (en) * 2021-06-09 2022-08-09 上海交通大学 Table type information extraction system and method with privacy protection function
CN113221181A (en) * 2021-06-09 2021-08-06 上海交通大学 Table type information extraction system and method with privacy protection function
CN113255821A (en) * 2021-06-15 2021-08-13 中国人民解放军国防科技大学 Attention-based image recognition method, attention-based image recognition system, electronic device and storage medium
CN113378989A (en) * 2021-07-06 2021-09-10 武汉大学 Multi-mode data fusion method based on compound cooperative structure characteristic recombination network
CN113378989B (en) * 2021-07-06 2022-05-17 武汉大学 Multi-mode data fusion method based on compound cooperative structure characteristic recombination network
CN113469094B (en) * 2021-07-13 2023-12-26 上海中科辰新卫星技术有限公司 Surface coverage classification method based on multi-mode remote sensing data depth fusion
CN113469094A (en) * 2021-07-13 2021-10-01 上海中科辰新卫星技术有限公司 Multi-mode remote sensing data depth fusion-based earth surface coverage classification method
CN113792617B (en) * 2021-08-26 2023-04-18 电子科技大学 Image interpretation method combining image information and text information
CN113792617A (en) * 2021-08-26 2021-12-14 电子科技大学 Image interpretation method combining image information and text information
CN113807340A (en) * 2021-09-07 2021-12-17 南京信息工程大学 Method for recognizing irregular natural scene text based on attention mechanism
CN113807340B (en) * 2021-09-07 2024-03-15 南京信息工程大学 Attention mechanism-based irregular natural scene text recognition method
WO2023045605A1 (en) * 2021-09-22 2023-03-30 腾讯科技(深圳)有限公司 Data processing method and apparatus, computer device, and storage medium
CN113806564A (en) * 2021-09-22 2021-12-17 齐鲁工业大学 Multi-mode informativeness tweet detection method and system
CN113806564B (en) * 2021-09-22 2024-05-10 齐鲁工业大学 Multi-mode informative text detection method and system
CN113961710A (en) * 2021-12-21 2022-01-21 北京邮电大学 Fine-grained thesis classification method and device based on multi-mode layered fusion network
GB2616316A (en) * 2022-02-28 2023-09-06 Nvidia Corp Neural network training technique
CN116704537A (en) * 2022-12-02 2023-09-05 大连理工大学 Lightweight pharmacopoeia picture and text extraction method
CN116704537B (en) * 2022-12-02 2023-11-03 大连理工大学 Lightweight pharmacopoeia picture and text extraction method

Also Published As

Publication number Publication date
CN111985369B (en) 2021-09-17

Similar Documents

Publication Publication Date Title
CN111985369B (en) Course field multi-modal document classification method based on cross-modal attention convolution neural network
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
US11270225B1 (en) Methods and apparatus for asynchronous and interactive machine learning using word embedding within text-based documents and multimodal documents
CN109389091B (en) Character recognition system and method based on combination of neural network and attention mechanism
CN110737801B (en) Content classification method, apparatus, computer device, and storage medium
CN107480261B (en) Fine-grained face image fast retrieval method based on deep learning
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN109711463B (en) Attention-based important object detection method
CN112100346B (en) Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN111881262A (en) Text emotion analysis method based on multi-channel neural network
CN111738169B (en) Handwriting formula recognition method based on end-to-end network model
CN112818861A (en) Emotion classification method and system based on multi-mode context semantic features
CN112733866A (en) Network construction method for improving text description correctness of controllable image
Sharma et al. Deep eigen space based ASL recognition system
CN113626589B (en) Multi-label text classification method based on mixed attention mechanism
CN115438215B (en) Image-text bidirectional search and matching model training method, device, equipment and medium
CN112257449A (en) Named entity recognition method and device, computer equipment and storage medium
CN113948217A (en) Medical nested named entity recognition method based on local feature integration
CN114239585A (en) Biomedical nested named entity recognition method
CN112749274A (en) Chinese text classification method based on attention mechanism and interference word deletion
CN110188827A (en) A kind of scene recognition method based on convolutional neural networks and recurrence autocoder model
CN114358203A (en) Training method and device for image description sentence generation module and electronic equipment
CN113987187A (en) Multi-label embedding-based public opinion text classification method, system, terminal and medium
CN115512096A (en) CNN and Transformer-based low-resolution image classification method and system
CN112131345A (en) Text quality identification method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant