CN111985369A - Course field multi-modal document classification method based on cross-modal attention convolution neural network - Google Patents
Course field multi-modal document classification method based on cross-modal attention convolution neural network Download PDFInfo
- Publication number
- CN111985369A CN111985369A CN202010791032.3A CN202010791032A CN111985369A CN 111985369 A CN111985369 A CN 111985369A CN 202010791032 A CN202010791032 A CN 202010791032A CN 111985369 A CN111985369 A CN 111985369A
- Authority
- CN
- China
- Prior art keywords
- text
- modal
- image
- attention
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which is used for preprocessing multi-mode document data in the course field; an attention mechanism is combined with a dense convolution network, a convolution neural network based on cross-modal attention is provided, and image features with sparsity can be constructed more effectively; the bidirectional long-short term memory network constructed based on the attention mechanism and oriented to the text features is provided, and the text features locally associated with image semantics can be efficiently constructed; the cross-modal grouping fusion based on the attention mechanism is designed, the local incidence relation of the image and the text in the document can be more accurately learned, and the accuracy of the cross-modal feature fusion is improved. Compared with the existing multi-mode document classification model, the method has better performance and improves the accuracy of multi-mode document data classification under the condition of data sets in the same course field.
Description
Technical Field
The invention belongs to the field of computer application, multi-modal data classification, education data classification, image processing and text processing, and particularly relates to a course field multi-modal document classification method based on a cross-modal attention convolution neural network.
Background
With the development of science and technology, data to be processed by computers in various fields is converted from single images into multi-modal data such as images, texts, audios and the like with richer forms and contents. The classification of multimodal documents has applications in video classification, visual question answering, entity matching for social networks, and the like. The accuracy of multimodal document classification depends on whether the computer can accurately understand the semantics and content of the images and text that are contained within the document. However, the image in the text-text mixed multimodal document in the course field is generally composed of lines and characters, and shows high sparse characteristics on visual features such as color and texture; the characteristic of local association between the text in the multi-modal document and the semantics of the image is shown, so that the existing multi-modal document classification model is difficult to accurately construct the semantic feature vectors of the image and the text in the document, the accuracy of multi-modal document feature expression is reduced, and the performance of the multi-modal document classification task is hindered.
In order to solve the problems, the invention expands a model system structure and provides a course field multi-mode document classification method based on a cross-modal attention convolution neural network. The method can well extract the sparse image characteristics in the course field, efficiently construct the text characteristics associated with the local fine-grained semantics of the image semantics, and more accurately learn the association relationship between the image and the text characteristics related to the specific object, thereby improving the multi-modal document classification performance.
Disclosure of Invention
Technical problem to be solved
Image visual features in image-text mixed multi-modal document data in the course field are sparse, and only local semantic association exists between texts and images, so that the semantics and contents of the texts and the images in the documents are difficult to accurately understand by the existing multi-modal document classification model, and the performance of multi-modal classification is greatly influenced. In order to solve the problems, the invention provides a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which can more efficiently learn the semantic features of the course field image with feature sparsity, can better capture the local fine-grained semantic association between the image and the text in the multi-mode document, accurately express the multi-mode document features and simultaneously improve the performance of a computer in the course field multi-mode document data classification.
Technical scheme
A course field multi-mode document classification method based on a cross-mode attention convolutional neural network is characterized by comprising the following steps:
step 1: preprocessing of multimodal document data
Step 1.1: each multi-modal document comprises an image and a text description and is attached with a plurality of semantic tags; constructing a dictionary by using the text description and the document label set in the document; deleting the tags with the frequency of appearance less than 13, and deleting the multi-modal document when the number of semantic tags of the document is 0;
step 1.2: data preprocessing, namely randomly cutting image data into the size of 224 × 224 in length and width, and performing random horizontal turning; for text description, all text lengths are cut off and filled into length l, and a word vector model is used for learning vector representation of words in the text;
step 2: depth cross-modal feature extraction based on attention mechanism
Step 2.1: image characterization using dense convolutional neural network DenseNet based on space and feature attention mechanism CBAMAnd (3) representing the obtained image features asm represents the number of feature maps of the image;
step 2.2: constructing text features by adopting a bidirectional long-short term memory network (BilSTM) and a text attention mechanism, wherein the text attention mechanism consists of two convolution layers and a softmax classifier; the calculated weight is recorded asThe text feature representation after weighting is recorded asn is 4 hidden _ size, and the hidden _ size is the characteristic dimension of the hidden state of the BilSTM;
and step 3: packet cross-modal fusion based on attention mechanism
Step 3.1: dividing the image characteristics x obtained in the step 2 into r groups, and dividing each group of image characteristics into r groupsRespectively mapping the feature vector with the text feature y to the same one-dimensional space, and obtaining the fused feature { Z by adopting multi-mode split bilinear pooling fusion0,Z1,…,Zr};
Step 3.2: for each group of fused features ZiCalculating the weight of the feature map on each channel by using a channel attention mechanism, and recording the weighted features as Z';
step 3.3: each group is taken as a fusion feature Z with weighti' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifierFinally, the loss function adopts the maximum entropy to calculate the predicted value P and the true valueAnd (4) training parameters of the model by using a back propagation algorithm.
The image classification model based on cross-modal attention convolutional neural network as claimed in claim 1, wherein in step 1.2, based on the characteristics of the image and text itself, the multi-modal document data facing the curriculum field is processed, and for the ith pre-processed multi-modal document, the (R) is finally obtainedi,Ti,Li):
(1) Randomly sampling from image-text mixed multi-mode document data in the field of courses;
(2) for each image of the multimodal document:
(a) zooming the image to keep the aspect ratio unchanged, wherein the shortest side is 256; then randomly cutting the picture into length and width 224 x 224; and randomly turning horizontally; finally, channel value normalization is carried out to obtainWherein C is 3, H is W is 224;
(3) for each text description in the multimodal document:
(a) counting the lengths of all text descriptions, and selecting the length l, wherein the length l is 484, and 92% of the text lengths are smaller than the length;
(b) all data are cut off and filled to be the same length l;
(c) using word vectors, also known as word embedding, mapping word sequence numbers to vectors in the real number domain, and training the weights thereof;
(d) the word sequence number is embedded into 256-dimensional continuous space with high dimension as dictionary number by word embedding
(4) For each set of tags in the multimodal document:
(a) setting an N-dimensional vector for the total classification number N, and mapping a plurality of semantic tags of corresponding documents into 0-1 vectors through a tag dictionary to obtain
The dense convolutional neural network based on the space and feature attention mechanism described in step 2.1 is used for representing and constructing image features, and specifically comprises the following steps: for processed image dataFirstly, a convolution layer with convolution kernel of 7 × 7 and step size of 2 and a maximum pooling layer with convolution kernel of 3 × 3 and step size of 2 are passed; next, a CBAM module is adopted, then a DenseBlock module is adopted, the CBAM module and a Transition module are alternated, the feature extraction is carried out on the sparse image in the course field, and finally, the average pooling with the convolution kernel size of 7x7 is adopted to obtain the image featureThe CBAM module is composed of a channel submodule and a space submodule, so that the attention weight map is multiplied by the input feature map to carry out self-adaptive feature refinement; the weight of the feature map can be calculated by utilizing the channel submodule, and the weight of each part in the feature map can be obtained by utilizing the space submodule; for an intermediate feature mapAs input, CBAM calculates one-dimensional channel attention diagrams in turn And two-dimensional spatial attention mapThe whole attention mechanism is calculated as follows:
whereinRepresenting the outer product; obtaining the channel attention weight M through calculationc(F) Obtaining weighted features Calculating to obtain the space attention weight Ms(F') obtaining weighted features
The DenseBlock module is composed of multiple DenseLayer layers, each DenseLayer layer is composed of two groups of batch processing layers, a ReLU activation function and a convolution layer, each DenseLayer layer outputs k eigenvectors, and the input eigenvector of each DenseLayer layer is k0+k*(i-1),k0Representing an initial input feature number, wherein i represents the number of layers; the Transition module is equivalent to a buffer layer and is used for down sampling, and is composed of a batch processing hierarchy, a ReLU activation function, a convolution layer of 1 × 1 and an average pooling layer of 2 × 2. The two-way long-short term memory network based on the text attention mechanism in the step 2.2 specifically comprises the following steps:
for input data TiExtracting text sequence information by using BilSTM to obtain text characteristicsObtaining text attention weight of output by text attention mechanismCalculating the outer product of output and alpha to obtain the weighted text characteristicsThe text attention mechanism is composed of two convolution layers and a softmax classifier, and output comprises a set of hidden state vectors at each moment in the last layer sequenceAnd l is seq _ len, and h is 2 hidden _ size.
The BilSTM model in the step 2.2 specifically comprises the following steps:
for input data TiExtracting text sequence information by using BilSTM to obtain text characteristics The specific calculation formula is as follows:
it=σ(Wiixt+bii+Whih(t-1)+bhi), (3)
ft=σ(Wifxt+bif+Whfh(t-1)+bhf), (4)
gt=tanh(Wigxt+big+Whgh(t-1) (5)
+bhg),
ot=σ(Wioxt+bio+Whoh(t-1)+bho), (6)
ct=ft⊙c(t-1)+it⊙gt, (7)
ht=ot⊙tanh(ct), (8)
wherein xtVector representing input word, htRepresents t atHidden state vector of inscription, ctRepresenting the cell state vector at time t, xiRepresenting the input vector at time t, h(t-1)Representing the hidden state vector or the initial state vector, W, of the previous layer t-1i,biAnd bhAre trainable parameters, it,ft,gt,otRespectively represent input, forgetting, cell and output gates, sigma represents a sigmoid equation, tanh is an activation function, and; the hidden state vector set h with the same length as the sentence can be obtained0,h1,...,hl-1]And l 484 represents the length of the sentence.
The text attention mechanism module in the step 2.2 specifically comprises:
for text features obtained by BilsTMComputing attention weights for outputs using a textual attention mechanismObtaining text characteristics after weightingWherein the text attention mechanism is composed of two 1 × 1 convolutional layers and a softmax classifier.
The packet cross-modal fusion module based on the attention mechanism in the step 3 specifically comprises:
in step 3.1, the grouping fusion of the image features and the text features is as follows:
for image features derived from the model in DenseNet-CBAMAnd text features extracted by the BilSTM + Att modelImage features are divided into r groupsRespectively fusing with the text characteristics y, wherein the specific fusing steps are as follows:
Zi=x′TWiy, (9)
wherein. Represents the outer product of two vectors;is a connection matrix; ziIs the output of a multi-modal split bilinear pooling; wiMapping x' and y to the same dimension space by utilizing two full connection layers, and then executing SumPage with the size of k in one dimension to obtaini represents the fusion of the ith group of image features with the text features y;
in step 3.2, the fused features Z are combined for each groupiCalculating the weight of the feature map on each channel by using a channel attention mechanism, and recording the weighted features as Zi', its characteristics are as follows:
for the resulting fused features { Z0,Z1,…,ZrAnd calculating the weight of the feature map on each channel by using a channel attention mechanism, wherein the specific calculation process is as follows:
where a is expressed as a sigmoid function, andrepresentative pair of features ZiPerforming average pooling and maximum pooling, and weighting W therein0,W1Learning training is carried out, and weighted fusion characteristics Z' are obtained through outer product operation;
in step 3.3, each set is weighted to obtain the fusion features Zi' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifier
Pi=Zi′AT+b, (13)
Wherein A isTAnd b is a trainable parameter.
Advantageous effects
The invention provides a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which is used for preprocessing multi-mode document data in the course field; an attention mechanism is combined with a dense convolution network, a convolution neural network based on cross-modal attention is provided, and image features with sparsity can be constructed more effectively; the bidirectional long-short term memory network constructed based on the attention mechanism and oriented to the text features is provided, and the text features locally associated with image semantics can be efficiently constructed; the cross-modal grouping fusion based on the attention mechanism is designed, the local incidence relation of the image and the text in the document can be more accurately learned, and the accuracy of the cross-modal feature fusion is improved. Compared with the existing multi-mode document classification model, the method has better performance and improves the accuracy of multi-mode document data classification under the condition of data sets in the same course field.
Drawings
FIG. 1 is a diagram of a model of the process described in the examples of the invention.
FIG. 2 is a diagram of the input data construction of the method in an example of the invention.
FIG. 3 is a model diagram of image feature extraction according to an embodiment of the present invention.
FIG. 4 is a model diagram of text feature extraction according to an embodiment of the present invention.
FIG. 5 is a model diagram of packet cross-modality fusion as described in the examples of the present invention.
FIG. 6a is a graph comparing accuracy of image multi-label classification under the same data set comparing different models.
FIG. 6b is a graph comparing loss of image multi-label classification under the same data set comparing different models.
FIG. 6c is a comparison graph of image multi-label classification top-3 under the same data set comparing different models.
FIG. 6d is a comparison graph of image multi-label classification top-5 under the same data set comparing different models.
FIG. 7 is a sample diagram of a data set employed in an example of the present invention.
Detailed Description
The invention will now be further described with reference to the following examples and drawings:
the invention provides a course field multi-mode document classification method based on a cross-modal attention convolutional neural network, which is mainly divided into five modules: 1) pre-processing of multimodal document data. 2) And constructing a dense convolutional neural network image feature based on an attention mechanism. 3) And constructing a bidirectional long-short term memory network text feature based on an attention mechanism. 4) Packet cross-modality fusion based on attention mechanism. 5) Multi-label classification of multi-modal documents. The model diagram of the whole method is shown in fig. 1, and is specifically described as follows:
1. preprocessing of multimodal document data
Representing a teletext hybrid multimodal document asRepresents the d-th multimodal document data. WhereinIdIs available in the multimodal document data list for extracting image data,representing a qualifying image;representing a textual description locally associated with the image, JdIn the index set of word2ix indicating the descriptive text present in the d-th multimodal document data,an index representing the nth word in word2 ix; l isdIn the same way as the above, the first step,representing a semantic tag set, KdIndicating that the word in the ith multimodal document data is in the index set of label2ix,indicating the index of the o-th word in label2 ix.
The model of the multi-modal document data preprocessing of the course field multi-modal document classification method based on the cross-modal attention convolutional neural network is shown in figure 2, the image is turned through random cutting, all the text lengths are cut off and filled into the same length l based on the length of the text description of the whole data set, the vector representation of words in the text is learned by using a word vector model, and finally the input words are obtainedParameters into the network
For the ith preprocessed multi-modal document, finally obtaining (R)i,Ti,Li):
(1) Randomly sampling from the text-text mixed multi-modal document data in the field of courses.
(2) For each image in the multimodal document:
zooming the image to keep the aspect ratio unchanged, wherein the shortest side is 256; then randomly cutting the picture into length and width 224 x 224; and randomly and horizontally turning the image; finally, channel value normalization is carried out to obtainWherein C is 3 and H is 224.
(3) For each text description in the multimodal document:
(a) counting the lengths of all text descriptions, and selecting the length l, l-484, wherein 92% of the text length is smaller than the length l:
(b) all data are cut off and filled to be the same length l;
(c) using Word vectors (Word vectors), also known as Word Embedding (Word Embedding), mapping Word sequence numbers to vectors in the real number domain, and training the weights thereof;
(d) the word sequence number is embedded into 256-dimensional continuous space with high dimension as dictionary number by word embedding
(4) For each set of tags in the multimodal document:
setting an N-dimensional vector for the total classification number N, and mapping a plurality of semantic tags of corresponding documents into 0-1 vectors through a tag dictionary to obtain
2. Depth cross-modal feature construction based on attention mechanism
The depth cross-modal feature extraction based on the attention mechanism comprises two parts: 1) dense convolutional neural network image feature construction based on an attention mechanism, and 2) bidirectional long-short term memory network text feature construction based on the attention mechanism. Combining an attention mechanism with a dense convolution network to extract the features of the image; providing a bidirectional long-short term memory network which is constructed facing text features and based on an attention mechanism; thereby obtaining image features with weightsAnd text featuresThe model diagram of the whole network is shown in fig. 3:
2.1 attention-based dense convolutional neural network
For processed image dataFirstly, a convolution layer with convolution kernel of 7 × 7 and step size of 2 and a maximum pooling layer with convolution kernel of 3 × 3 and step size of 2 are passed; next, a CBAM module is adopted, then a DenseBlock module is adopted, the CBAM module and a Transition module are alternated, the sparse image is subjected to feature extraction, finally, the average pooling with the convolution kernel size of 7x7 is adopted, the data dimensionality is reduced, overfitting is avoided, and finally the image feature is obtained
Image features for multimodal document inputWherein each layer of the network structure is specifically set as the following table:
table 1: DenseNet-CBAM model structure table
Where k represents the growth rate and _ layer represents the number of layers in the DenseBlock, i.e., the number of layers of DenseLayer. The output of each DenseLayer is recorded as HiEach of HiK eigenvectors are generated, with the input x for the i-th DenseLayeri,i∈_layer:
xi=Hi([x0,x1,…xi-1]), (1)
Whereby the feature vector of the input of each layer DenseLayer can be expressed as k0+k*(i-1),k0Representing the initial input feature number. For the Transition module, it is equivalent to a buffer layer for down-sampling, and is composed of a batch hierarchy, the ReLU activation function, 1 × 1 convolution layer and 2 × 2 average pooling layer.
Adding a layer of CBAM module between each layer of DenseBlock and Transition, calculating the weight of the feature map through the CBAM module, calculating the weight of the feature map by using a channel submodule in the CBAM, and obtaining the weight of each part in the feature map by using a space submodule in the CBAM.
And the CBAM module is composed of a channel submodule and a space submodule, so that the attention weight map and the input feature map are multiplied to carry out self-adaptive feature refinement. For an intermediate feature mapAs input, CBAM calculates one-dimensional channel attention diagrams in turnAnd two-dimensional spatial attention mapThe whole attention mechanism is calculated as follows:
for a given feature mapCalculating the weight of the characteristic diagram on each channel by using a channel attention mechanism, and recording the weight as
Sigma is expressed as a sigmoid function, andrepresenting the average pooling and maximum pooling of feature F. And for the characteristic diagram F', a space attention mechanism is utilized again, and the weight on a space area in the characteristic diagram is obtained through calculation and is recorded as
σ is expressed as sigmoid function, f7×7Representing a convolution kernel size of 7x7,and respectively representing average pooling and maximum pooling to change the characteristic dimension of F' into 1, connecting the two vectors in series, and obtaining M through convolution layer and sigmoid functions(F′)。
2.2 two-way Long-short term memory network based on attention mechanism
For input data TiExtracting text sequence information by using BilSTM to obtain text characteristics Obtaining text attention weight of output by text attention mechanismCalculating the outer product of output and alpha to obtain the weighted text characteristicsThe text attention mechanism is composed of two convolution layers and a softmax classifier, and output comprises a set of hidden state vectors at each moment in the last layer sequenceAnd l is seq _ len, and h is 2 hidden _ size.
Wherein the BilSTM (Bi-directional Long Short-Term Memory) is formed by combining a forward LSTM and a backward LSTM. LSTM for each element in the input sequence, each layer computes the function:
it=σ(Wiixt+bii+Whih(t-1)+bhi), (6)
ft=σ(Wifxt+bif+Whfh(t-1)+bhf), (7)
gt=tanh(Wigxt+big+Whgh(t-1) (8)
+bhg),
ot=σ(Wioxt+bio+Whoh(t-1)+bho), (9)
ct=ft⊙c(t-1)+it⊙gt, (10)
ht=ot⊙tanh(ct), (11)
wherein xtVector representing input word, htHidden state vector representing time t, ctRepresenting the cell state vector at time t, xiRepresenting the input vector at time t, h(t-1)Representing the hidden state vector or the initial state vector, W, of the previous layer t-1i,biAnd bhAre trainable parameters, it,ft,gt,otRespectively represent input, forgetting, cell and output gates, σ represents a sigmoid equation, tanh is an activation function, and it is a hadamard product. The hidden state vector set h with the same length as the sentence can be obtained0,h1,...,hl-1]And l 484 represents the length of the sentence.
Text features obtained by BilsTMIncluding a set of hidden state vectors at each time in the last layer of sequenceComputing attention weights for outputs using a textual attention mechanismObtaining text characteristics after weightingThe specific calculation process is as follows:
u=Wwoutput+bw, (12)
Calculating the weight alpha of the text feature through a softmax functioniAnd the weighted text features are recorded as
y=∑lαi*outputi, (14)
3. Packet cross-modal fusion based on attention mechanism
For image features obtained from depth cross-modal feature construction based on attention mechanismAnd text featuresFirstly, dividing image characteristics x into r groups, and dividing each group of image characteristicsRespectively mapping the feature vector with the text feature y to the same one-dimensional space, and performing feature fusion to obtain a fused feature { Z0,Z1,…,Zr}; for each group of fused features ZiCalculating the weight of the feature graph on each channel by using a channel attention mechanism to obtain weighted features Z'; each group is taken as a fusion feature Z with weighti' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifierThe detailed model diagram is shown in fig. 5.
3.1 group fusion of image features and text features
For image features trained from DenseNet-CBAMText features constructed by BilSTM + AttFirst, the image features are divided into r groupsRespectively fusing with the text characteristics y, wherein the specific calculation formula is as follows:
Zi=x′TWiy, (15)
wherein Representing two vectorsThe external product is accumulated on the inner wall of the casing,represents a connection matrix, ZiIs the output of the multi-modal split bilinear pooling. WiMapping x' and y to the same dimension space by utilizing two full connection layers, and then executing SumPage with the size of k in one dimension to obtaini represents the fusion of the ith grouping of image features with the text feature y.
3.2 attention mechanism
For the feature Z fused with each groupiCalculating the weight M of the feature map on each channel by using a channel attention mechanismc(Zi) The specific calculation process is as follows:
where a is expressed as a sigmoid function, andrepresentative pair of features ZiPerforming average pooling and maximum pooling, and weighting W therein0,W1And (5) carrying out learning training, and obtaining weighted fusion characteristics Z' through outer product operation.
Finally, each group is taken as a fusion feature Z with weighti' for the output vectors of multiple groups of fully-connected layers, corresponding to each other is adoptedFusing elements in an adding mode, and then calculating the probability distribution of the multi-modal document on each label through a sigmoid classifierThe specific calculation process is as follows:
Pi=Zi′AT+b, (19)
wherein A isTAnd b is a trainable parameter.
4. Multi-mode document data multi-label classification experiment in course field
The data set used by us is a text-text mixed multi-modal document data in the field of courses, each multi-modal document data consisting of one image, one text description and a plurality of semantic tags. There are 871 pieces of multi-modal document data in the dataset for training, and the number of tags is 41. The labels include those shown in table 2:
table 2: set of labels in a data set
With 80% of the course domain multimodal document data set used for training and 20% used for testing. We compared the classification effect of different other pre-training models such as VGG16, ResNet34, DenseNet121, BiLSTM under the same data set. The construction of the whole model uses a pytorech deep learning framework and runs on a GPU, and the CUDA version is 10.1.120.
4.1 loss function
We use a maximum entropy based criterion to optimize a multi-label pair for all losses, for the final classification result x and the target result y. For each batch of data:
where y [ i ] ∈ 0, 1, C represents the total number of tags.
4.2 evaluation and results
Table 3: and the model accuracy comparison table shows that the precision and recall ratio of labels generated by different models under the same data set are used as evaluation indexes. For each multimodal document data, Top-3 and Top-5 are taken as one of the evaluation criteria, the meaning of Top-k is: if the top-k labels with the highest probability contain all real labels, the prediction is considered to be correct. And calculating Hamming loss (Hamming loss) which represents the proportion of error samples in all the labels, wherein the smaller the value, the stronger the classification capability of the network. In the experiment, the mAP is mainly used as a main evaluation standard, and the intrinsic average AP (average precision) value of the mAP is the average accuracy.
Table 3: model accuracy comparison table
Claims (7)
1. A course field multi-mode document classification method based on a cross-mode attention convolutional neural network is characterized by comprising the following steps:
step 1: preprocessing of multimodal document data
Step 1.1: each multi-modal document comprises an image and a text description and is attached with a plurality of semantic tags; constructing a dictionary by using the text description and the document label set in the document; deleting the tags with the frequency of appearance less than 13, and deleting the multi-modal document when the number of semantic tags of the document is 0;
step 1.2: data preprocessing, namely randomly cutting image data into the size of 224 × 224 in length and width, and performing random horizontal turning; for text description, all text lengths are cut off and filled into length l, and a word vector model is used for learning vector representation of words in the text;
step 2: depth cross-modal feature extraction based on attention mechanism
Step 2.1: carrying out representation construction on image features by adopting dense convolutional neural network DenseNet based on space and feature attention mechanism CBAM, and recording the obtained image features asm represents the number of feature maps of the image;
step 2.2: constructing text features by adopting a bidirectional long-short term memory network (BilSTM) and a text attention mechanism, wherein the text attention mechanism consists of two convolution layers and a softmax classifier; the calculated weight is recorded asThe text feature representation after weighting is recorded asn is 4 hidden _ size, and the hidden _ size is the characteristic dimension of the hidden state of the BilSTM;
and step 3: packet cross-modal fusion based on attention mechanism
Step 3.1: dividing the image characteristics x obtained in the step 2 into r groups, and dividing each group of image characteristics into r groupsRespectively mapping the feature vector with the text feature y to the same one-dimensional space, and obtaining the fused feature { Z by adopting multi-mode split bilinear pooling fusion0,Z1,…,Zr};
Step 3.2: for each group of fused features ZiCalculating each using the channel attention mechanismWeighting the feature graph on the channel, and recording the weighted feature as Z';
step 3.3: each group is taken as a fusion feature Z with weighti' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifierAnd finally, calculating the error between the predicted value P and the true value by adopting the maximum entropy as a loss function, and training the parameters of the model by utilizing a back propagation algorithm.
2. The image classification model based on cross-modal attention convolutional neural network as claimed in claim 1, wherein in step 1.2, based on the characteristics of the image and text itself, the multi-modal document data facing the curriculum field is processed, and for the ith pre-processed multi-modal document, the (R) is finally obtainedi,Ti,Li):
(1) Randomly sampling from image-text mixed multi-mode document data in the field of courses;
(2) for each image of the multimodal document:
(a) zooming the image to keep the aspect ratio unchanged, wherein the shortest side is 256; then randomly cutting the picture into length and width 224 x 224; and randomly turning horizontally; finally, channel value normalization is carried out to obtainWherein C is 3, H is W is 224;
(3) for each text description in the multimodal document:
(a) counting the lengths of all text descriptions, and selecting the length l, wherein the length l is 484, and 92% of the text lengths are smaller than the length;
(b) all data are cut off and filled to be the same length l;
(c) using word vectors, also known as word embedding, mapping word sequence numbers to vectors in the real number domain, and training the weights thereof;
(d) the word sequence number is embedded into 256-dimensional continuous space with high dimension as dictionary number by word embedding
(4) For each set of tags in the multimodal document:
3. The course domain multi-modal document classification method based on the cross-modal attention convolutional neural network as claimed in claim 1, characterized in that: the dense convolutional neural network based on the space and feature attention mechanism described in step 2.1 is used for representing and constructing image features, and specifically comprises the following steps: for processed image dataFirstly, a convolution layer with convolution kernel of 7 × 7 and step size of 2 and a maximum pooling layer with convolution kernel of 3 × 3 and step size of 2 are passed; next, a CBAM module is adopted, then a DenseBlock module is adopted, the CBAM module and a Transition module are alternated, the feature extraction is carried out on the sparse image in the course field, and finally, the average pooling with the convolution kernel size of 7x7 is adopted to obtain the image featureThe CBAM module is composed of a channel submodule and a space submodule, so that the attention weight map is multiplied by the input feature map to carry out self-adaptive feature refinement; the weight of the feature map can be calculated by utilizing the channel submodule, and the weight of the feature map can be obtained by utilizing the space submoduleA weight for each location in a feature map; for an intermediate feature mapAs input, CBAM calculates one-dimensional channel attention diagrams in turnAnd two-dimensional spatial attention mapThe whole attention mechanism is calculated as follows:
whereinRepresenting the outer product; obtaining the channel attention weight M through calculationc(F) Obtaining weighted features Calculating to obtain the space attention weight Ms(F') obtaining weighted features
The DenseBlock module is composed of multiple DenseLayer layers, each DenseLayer layer is composed of two groups of batch processing and one layer, a ReLU activation function and a convolution layer, and each DenseLayer layer can output k special charactersThe feature vector of the input of each layer DenseLayer is k0+k*(i-1),k0Representing an initial input feature number, wherein i represents the number of layers; the Transition module is equivalent to a buffer layer and is used for down sampling, and is composed of a batch processing hierarchy, a ReLU activation function, a convolution layer of 1 × 1 and an average pooling layer of 2 × 2.
4. The course domain multi-modal document classification method based on the cross-modal attention convolutional neural network as claimed in claim 1, characterized in that: the two-way long-short term memory network based on the text attention mechanism in the step 2.2 specifically comprises the following steps:
for input data TiExtracting text sequence information by using BilSTM to obtain text characteristics Obtaining text attention weight of output by text attention mechanismCalculating the outer product of output and alpha to obtain the weighted text characteristicsThe text attention mechanism is composed of two convolution layers and a softmax classifier, and output comprises a set of hidden state vectors at each moment in the last layer sequenceAnd l is seq _ len, and h is 2 hidden _ size.
5. The course domain multi-modal document classification method based on the cross-modal attention convolutional neural network as claimed in claim 4, wherein the BilSTM model in step 2.2 specifically comprises:
for input data TiExtracting text sequence information by using BilSTM to obtain text characteristics The specific calculation formula is as follows:
it=σ(Wiixt+bii+Whih(t-1)+bhi), (3)
ft=σ(Wifxt+bif+Whfh(t-1)+bhf), (4)
gt=tanh(Wigxt+big+Whgh(t-1) (5)
+bhg),
ot=σ(Wioxt+bio+Whoh(t-1)+bho), (6)
ct=ft⊙c(t-1)+it⊙gt, (7)
ht=ot⊙tanh(ct), (8)
wherein xtVector representing input word, htHidden state vector representing time t, ctRepresenting the cell state vector at time t, xiRepresenting the input vector at time t, h(t-1)Representing the hidden state vector or the initial state vector, W, of the previous layer t-1i,biAnd bhAre trainable parameters, it,ft,gt,otRespectively represent input, forgetting, cell and output gates, sigma represents a sigmoid equation, tanh is an activation function, and; the hidden state vector set h with the same length as the sentence can be obtained0,h1,...,hl-1]And l 484 represents the length of the sentence.
6. The method for lesson domain multimodal document classification based on cross-modal attention convolutional neural network as claimed in claim 4, wherein said text attention mechanism module in step 2.2 is specifically characterized in that:
7. The method for lesson domain multi-modal document classification based on the cross-modal attention convolutional neural network as claimed in claim 1, wherein the group cross-modal fusion module based on the attention mechanism in the step 3 is specifically characterized in that:
in step 3.1, the grouping fusion of the image features and the text features is as follows:
for image features derived from the model in DenseNet-CBAMAnd text features extracted by the BilSTM + Att modelImage features are divided into r groupsAre respectively connected withThe text characteristics y are fused, and the specific fusion steps are as follows:
Zi=x′TWiy, (9)
wherein Represents the outer product of two vectors;is a connection matrix; ziIs the output of a multi-modal split bilinear pooling; wiMapping x' and y to the same dimension space by utilizing two full connection layers, and then executing SumPage with the size of k in one dimension to obtaini represents the fusion of the ith group of image features with the text features y;
in step 3.2, the fused features Z are combined for each groupiCalculating the weight of the feature map on each channel by using a channel attention mechanism, and recording the weighted features as Zi', its characteristics are as follows:
for the resulting fused features { Z0,Z1,…,ZrAnd calculating the weight of the feature map on each channel by using a channel attention mechanism, wherein the specific calculation process is as follows:
where a is expressed as a sigmoid function, andrepresentative pair of features ZiPerforming average pooling and maximum pooling, and weighting W therein0,W1Learning training is carried out, and weighted fusion characteristics Z' are obtained through outer product operation;
in step 3.3, each set is weighted to obtain the fusion features Zi' by a full tie layer; then, combining a plurality of groups of output vectors of the fully-connected layers in a mode of adding corresponding elements in the vectors, and then calculating to obtain the probability distribution of the multi-modal document on each label through a sigmoid classifier
Pi=Zi′AT+b, (13)
Wherein A isTAnd b is a trainable parameter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010791032.3A CN111985369B (en) | 2020-08-07 | 2020-08-07 | Course field multi-modal document classification method based on cross-modal attention convolution neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010791032.3A CN111985369B (en) | 2020-08-07 | 2020-08-07 | Course field multi-modal document classification method based on cross-modal attention convolution neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111985369A true CN111985369A (en) | 2020-11-24 |
CN111985369B CN111985369B (en) | 2021-09-17 |
Family
ID=73444539
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010791032.3A Active CN111985369B (en) | 2020-08-07 | 2020-08-07 | Course field multi-modal document classification method based on cross-modal attention convolution neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111985369B (en) |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112487187A (en) * | 2020-12-02 | 2021-03-12 | 杭州电子科技大学 | News text classification method based on graph network pooling |
CN112508077A (en) * | 2020-12-02 | 2021-03-16 | 齐鲁工业大学 | Social media emotion analysis method and system based on multi-modal feature fusion |
CN112507898A (en) * | 2020-12-14 | 2021-03-16 | 重庆邮电大学 | Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN |
CN112650886A (en) * | 2020-12-28 | 2021-04-13 | 电子科技大学 | Cross-modal video time retrieval method based on cross-modal dynamic convolution network |
CN112686345A (en) * | 2020-12-31 | 2021-04-20 | 江南大学 | Off-line English handwriting recognition method based on attention mechanism |
CN112685565A (en) * | 2020-12-29 | 2021-04-20 | 平安科技(深圳)有限公司 | Text classification method based on multi-mode information fusion and related equipment thereof |
CN112817604A (en) * | 2021-02-18 | 2021-05-18 | 北京邮电大学 | Android system control intention identification method and device, electronic equipment and storage medium |
CN112819052A (en) * | 2021-01-25 | 2021-05-18 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Multi-modal fine-grained mixing method, system, device and storage medium |
CN112863081A (en) * | 2021-01-04 | 2021-05-28 | 西安建筑科技大学 | Device and method for automatic weighing, classifying and settling vegetables and fruits |
CN112925935A (en) * | 2021-04-13 | 2021-06-08 | 电子科技大学 | Image menu retrieval method based on intra-modality and inter-modality mixed fusion |
CN113052159A (en) * | 2021-04-14 | 2021-06-29 | 中国移动通信集团陕西有限公司 | Image identification method, device, equipment and computer storage medium |
CN113140023A (en) * | 2021-04-29 | 2021-07-20 | 南京邮电大学 | Text-to-image generation method and system based on space attention |
CN113221882A (en) * | 2021-05-11 | 2021-08-06 | 西安交通大学 | Image text aggregation method and system for curriculum field |
CN113221181A (en) * | 2021-06-09 | 2021-08-06 | 上海交通大学 | Table type information extraction system and method with privacy protection function |
CN113255821A (en) * | 2021-06-15 | 2021-08-13 | 中国人民解放军国防科技大学 | Attention-based image recognition method, attention-based image recognition system, electronic device and storage medium |
CN113342933A (en) * | 2021-05-31 | 2021-09-03 | 淮阴工学院 | Multi-feature interactive network recruitment text classification method similar to double-tower model |
CN113378989A (en) * | 2021-07-06 | 2021-09-10 | 武汉大学 | Multi-mode data fusion method based on compound cooperative structure characteristic recombination network |
CN113469094A (en) * | 2021-07-13 | 2021-10-01 | 上海中科辰新卫星技术有限公司 | Multi-mode remote sensing data depth fusion-based earth surface coverage classification method |
CN113792617A (en) * | 2021-08-26 | 2021-12-14 | 电子科技大学 | Image interpretation method combining image information and text information |
CN113806564A (en) * | 2021-09-22 | 2021-12-17 | 齐鲁工业大学 | Multi-mode informativeness tweet detection method and system |
CN113807340A (en) * | 2021-09-07 | 2021-12-17 | 南京信息工程大学 | Method for recognizing irregular natural scene text based on attention mechanism |
CN113961710A (en) * | 2021-12-21 | 2022-01-21 | 北京邮电大学 | Fine-grained thesis classification method and device based on multi-mode layered fusion network |
WO2022187167A1 (en) * | 2021-03-01 | 2022-09-09 | Nvidia Corporation | Neural network training technique |
WO2023045605A1 (en) * | 2021-09-22 | 2023-03-30 | 腾讯科技(深圳)有限公司 | Data processing method and apparatus, computer device, and storage medium |
CN116704537A (en) * | 2022-12-02 | 2023-09-05 | 大连理工大学 | Lightweight pharmacopoeia picture and text extraction method |
GB2616316A (en) * | 2022-02-28 | 2023-09-06 | Nvidia Corp | Neural network training technique |
CN113806564B (en) * | 2021-09-22 | 2024-05-10 | 齐鲁工业大学 | Multi-mode informative text detection method and system |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107066583A (en) * | 2017-04-14 | 2017-08-18 | 华侨大学 | A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity |
CN108595632A (en) * | 2018-04-24 | 2018-09-28 | 福州大学 | A kind of hybrid neural networks file classification method of fusion abstract and body feature |
CN109740148A (en) * | 2018-12-16 | 2019-05-10 | 北京工业大学 | A kind of text emotion analysis method of BiLSTM combination Attention mechanism |
CN110019812A (en) * | 2018-02-27 | 2019-07-16 | 中国科学院计算技术研究所 | A kind of user is from production content detection algorithm and system |
CN110209789A (en) * | 2019-05-29 | 2019-09-06 | 山东大学 | A kind of multi-modal dialog system and method for user's attention guidance |
WO2019204186A1 (en) * | 2018-04-18 | 2019-10-24 | Sony Interactive Entertainment Inc. | Integrated understanding of user characteristics by multimodal processing |
CN111079444A (en) * | 2019-12-25 | 2020-04-28 | 北京中科研究院 | Network rumor detection method based on multi-modal relationship |
CN111325155A (en) * | 2020-02-21 | 2020-06-23 | 重庆邮电大学 | Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy |
CN111461174A (en) * | 2020-03-06 | 2020-07-28 | 西北大学 | Multi-mode label recommendation model construction method and device based on multi-level attention mechanism |
-
2020
- 2020-08-07 CN CN202010791032.3A patent/CN111985369B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107066583A (en) * | 2017-04-14 | 2017-08-18 | 华侨大学 | A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity |
CN110019812A (en) * | 2018-02-27 | 2019-07-16 | 中国科学院计算技术研究所 | A kind of user is from production content detection algorithm and system |
WO2019204186A1 (en) * | 2018-04-18 | 2019-10-24 | Sony Interactive Entertainment Inc. | Integrated understanding of user characteristics by multimodal processing |
CN108595632A (en) * | 2018-04-24 | 2018-09-28 | 福州大学 | A kind of hybrid neural networks file classification method of fusion abstract and body feature |
CN109740148A (en) * | 2018-12-16 | 2019-05-10 | 北京工业大学 | A kind of text emotion analysis method of BiLSTM combination Attention mechanism |
CN110209789A (en) * | 2019-05-29 | 2019-09-06 | 山东大学 | A kind of multi-modal dialog system and method for user's attention guidance |
CN111079444A (en) * | 2019-12-25 | 2020-04-28 | 北京中科研究院 | Network rumor detection method based on multi-modal relationship |
CN111325155A (en) * | 2020-02-21 | 2020-06-23 | 重庆邮电大学 | Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy |
CN111461174A (en) * | 2020-03-06 | 2020-07-28 | 西北大学 | Multi-mode label recommendation model construction method and device based on multi-level attention mechanism |
Non-Patent Citations (2)
Title |
---|
MARIE KATSURAI ET AL: "Image sentiment analysis using latent correlations among visual, textual, and sentiment views", 《2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 * |
蔡国永等: "基于层次化深度关联融合网络的社交媒体情感分类", 《计算机研究与发展》 * |
Cited By (40)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112508077A (en) * | 2020-12-02 | 2021-03-16 | 齐鲁工业大学 | Social media emotion analysis method and system based on multi-modal feature fusion |
CN112508077B (en) * | 2020-12-02 | 2023-01-03 | 齐鲁工业大学 | Social media emotion analysis method and system based on multi-modal feature fusion |
CN112487187A (en) * | 2020-12-02 | 2021-03-12 | 杭州电子科技大学 | News text classification method based on graph network pooling |
CN112507898A (en) * | 2020-12-14 | 2021-03-16 | 重庆邮电大学 | Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN |
CN112650886A (en) * | 2020-12-28 | 2021-04-13 | 电子科技大学 | Cross-modal video time retrieval method based on cross-modal dynamic convolution network |
CN112685565A (en) * | 2020-12-29 | 2021-04-20 | 平安科技(深圳)有限公司 | Text classification method based on multi-mode information fusion and related equipment thereof |
WO2022142014A1 (en) * | 2020-12-29 | 2022-07-07 | 平安科技(深圳)有限公司 | Multi-modal information fusion-based text classification method, and related device thereof |
CN112685565B (en) * | 2020-12-29 | 2023-07-21 | 平安科技(深圳)有限公司 | Text classification method based on multi-mode information fusion and related equipment thereof |
CN112686345B (en) * | 2020-12-31 | 2024-03-15 | 江南大学 | Offline English handwriting recognition method based on attention mechanism |
CN112686345A (en) * | 2020-12-31 | 2021-04-20 | 江南大学 | Off-line English handwriting recognition method based on attention mechanism |
CN112863081A (en) * | 2021-01-04 | 2021-05-28 | 西安建筑科技大学 | Device and method for automatic weighing, classifying and settling vegetables and fruits |
CN112819052A (en) * | 2021-01-25 | 2021-05-18 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Multi-modal fine-grained mixing method, system, device and storage medium |
CN112817604A (en) * | 2021-02-18 | 2021-05-18 | 北京邮电大学 | Android system control intention identification method and device, electronic equipment and storage medium |
WO2022187167A1 (en) * | 2021-03-01 | 2022-09-09 | Nvidia Corporation | Neural network training technique |
CN112925935A (en) * | 2021-04-13 | 2021-06-08 | 电子科技大学 | Image menu retrieval method based on intra-modality and inter-modality mixed fusion |
CN112925935B (en) * | 2021-04-13 | 2022-05-06 | 电子科技大学 | Image menu retrieval method based on intra-modality and inter-modality mixed fusion |
CN113052159A (en) * | 2021-04-14 | 2021-06-29 | 中国移动通信集团陕西有限公司 | Image identification method, device, equipment and computer storage medium |
CN113140023B (en) * | 2021-04-29 | 2023-09-15 | 南京邮电大学 | Text-to-image generation method and system based on spatial attention |
CN113140023A (en) * | 2021-04-29 | 2021-07-20 | 南京邮电大学 | Text-to-image generation method and system based on space attention |
CN113221882A (en) * | 2021-05-11 | 2021-08-06 | 西安交通大学 | Image text aggregation method and system for curriculum field |
CN113342933A (en) * | 2021-05-31 | 2021-09-03 | 淮阴工学院 | Multi-feature interactive network recruitment text classification method similar to double-tower model |
CN113342933B (en) * | 2021-05-31 | 2022-11-08 | 淮阴工学院 | Multi-feature interactive network recruitment text classification method similar to double-tower model |
CN113221181B (en) * | 2021-06-09 | 2022-08-09 | 上海交通大学 | Table type information extraction system and method with privacy protection function |
CN113221181A (en) * | 2021-06-09 | 2021-08-06 | 上海交通大学 | Table type information extraction system and method with privacy protection function |
CN113255821A (en) * | 2021-06-15 | 2021-08-13 | 中国人民解放军国防科技大学 | Attention-based image recognition method, attention-based image recognition system, electronic device and storage medium |
CN113378989A (en) * | 2021-07-06 | 2021-09-10 | 武汉大学 | Multi-mode data fusion method based on compound cooperative structure characteristic recombination network |
CN113378989B (en) * | 2021-07-06 | 2022-05-17 | 武汉大学 | Multi-mode data fusion method based on compound cooperative structure characteristic recombination network |
CN113469094B (en) * | 2021-07-13 | 2023-12-26 | 上海中科辰新卫星技术有限公司 | Surface coverage classification method based on multi-mode remote sensing data depth fusion |
CN113469094A (en) * | 2021-07-13 | 2021-10-01 | 上海中科辰新卫星技术有限公司 | Multi-mode remote sensing data depth fusion-based earth surface coverage classification method |
CN113792617B (en) * | 2021-08-26 | 2023-04-18 | 电子科技大学 | Image interpretation method combining image information and text information |
CN113792617A (en) * | 2021-08-26 | 2021-12-14 | 电子科技大学 | Image interpretation method combining image information and text information |
CN113807340A (en) * | 2021-09-07 | 2021-12-17 | 南京信息工程大学 | Method for recognizing irregular natural scene text based on attention mechanism |
CN113807340B (en) * | 2021-09-07 | 2024-03-15 | 南京信息工程大学 | Attention mechanism-based irregular natural scene text recognition method |
WO2023045605A1 (en) * | 2021-09-22 | 2023-03-30 | 腾讯科技(深圳)有限公司 | Data processing method and apparatus, computer device, and storage medium |
CN113806564A (en) * | 2021-09-22 | 2021-12-17 | 齐鲁工业大学 | Multi-mode informativeness tweet detection method and system |
CN113806564B (en) * | 2021-09-22 | 2024-05-10 | 齐鲁工业大学 | Multi-mode informative text detection method and system |
CN113961710A (en) * | 2021-12-21 | 2022-01-21 | 北京邮电大学 | Fine-grained thesis classification method and device based on multi-mode layered fusion network |
GB2616316A (en) * | 2022-02-28 | 2023-09-06 | Nvidia Corp | Neural network training technique |
CN116704537A (en) * | 2022-12-02 | 2023-09-05 | 大连理工大学 | Lightweight pharmacopoeia picture and text extraction method |
CN116704537B (en) * | 2022-12-02 | 2023-11-03 | 大连理工大学 | Lightweight pharmacopoeia picture and text extraction method |
Also Published As
Publication number | Publication date |
---|---|
CN111985369B (en) | 2021-09-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111985369B (en) | Course field multi-modal document classification method based on cross-modal attention convolution neural network | |
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
US11270225B1 (en) | Methods and apparatus for asynchronous and interactive machine learning using word embedding within text-based documents and multimodal documents | |
CN109389091B (en) | Character recognition system and method based on combination of neural network and attention mechanism | |
CN110737801B (en) | Content classification method, apparatus, computer device, and storage medium | |
CN107480261B (en) | Fine-grained face image fast retrieval method based on deep learning | |
CN110263912B (en) | Image question-answering method based on multi-target association depth reasoning | |
CN109711463B (en) | Attention-based important object detection method | |
CN112100346B (en) | Visual question-answering method based on fusion of fine-grained image features and external knowledge | |
CN111881262A (en) | Text emotion analysis method based on multi-channel neural network | |
CN111738169B (en) | Handwriting formula recognition method based on end-to-end network model | |
CN112818861A (en) | Emotion classification method and system based on multi-mode context semantic features | |
CN112733866A (en) | Network construction method for improving text description correctness of controllable image | |
Sharma et al. | Deep eigen space based ASL recognition system | |
CN113626589B (en) | Multi-label text classification method based on mixed attention mechanism | |
CN115438215B (en) | Image-text bidirectional search and matching model training method, device, equipment and medium | |
CN112257449A (en) | Named entity recognition method and device, computer equipment and storage medium | |
CN113948217A (en) | Medical nested named entity recognition method based on local feature integration | |
CN114239585A (en) | Biomedical nested named entity recognition method | |
CN112749274A (en) | Chinese text classification method based on attention mechanism and interference word deletion | |
CN110188827A (en) | A kind of scene recognition method based on convolutional neural networks and recurrence autocoder model | |
CN114358203A (en) | Training method and device for image description sentence generation module and electronic equipment | |
CN113987187A (en) | Multi-label embedding-based public opinion text classification method, system, terminal and medium | |
CN115512096A (en) | CNN and Transformer-based low-resolution image classification method and system | |
CN112131345A (en) | Text quality identification method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |