CN113535949A - Multi-mode combined event detection method based on pictures and sentences - Google Patents

Multi-mode combined event detection method based on pictures and sentences Download PDF

Info

Publication number
CN113535949A
CN113535949A CN202110660692.2A CN202110660692A CN113535949A CN 113535949 A CN113535949 A CN 113535949A CN 202110660692 A CN202110660692 A CN 202110660692A CN 113535949 A CN113535949 A CN 113535949A
Authority
CN
China
Prior art keywords
picture
event
sentence
word
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110660692.2A
Other languages
Chinese (zh)
Other versions
CN113535949B (en
Inventor
张旻
曹祥彪
汤景凡
姜明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202110660692.2A priority Critical patent/CN113535949B/en
Publication of CN113535949A publication Critical patent/CN113535949A/en
Application granted granted Critical
Publication of CN113535949B publication Critical patent/CN113535949B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/285Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-modal combined event detection method based on pictures and sentences, and simultaneously, events are identified from the pictures and the sentences. On one hand, the method utilizes the existing single-mode data set to respectively learn the image and text event classifiers; on the other hand, the existing picture and title pair training picture sentence matching module is utilized to find out the picture and sentence with the highest semantic similarity in the multi-modal article, so that the characteristic representation of the picture entity and the word in the public space is obtained. These features help to share parameters between the picture and text event classifiers, resulting in a shared event classifier. And finally, testing the model by using a small amount of multi-modal labeled data, and respectively acquiring the events and the types thereof described by the pictures and the sentences by using a shared event classifier. The invention identifies the events from the pictures and the sentences, and utilizes the complementarity of the visual characteristics and the text characteristics, thereby not only improving the performance of single-mode event classification, but also finding more complete event information in the articles.

Description

Multi-mode combined event detection method based on pictures and sentences
Technical Field
The invention relates to an event detection method, in particular to a multi-mode combined event detection method based on pictures and sentences, belonging to the field of multi-mode information extraction.
Background
With the gradual introduction of modern technologies such as computers and mobile phones into common families, the participation in social platform interaction, news website browsing and other behaviors has become a main way for people to acquire network information, and the process of acquiring information by netizens is greatly simplified. It follows that the number of network users consuming information is increasing, and according to the 47 th statistical report of China Internet development status issued by the China Internet information center1It shows that the number of people in China net reaches 98900 ten thousand by 12 months in 2020, and the number of people in net is increased by 8540 ten thousand compared with 3 months in the last year. As a result, a great deal of new information is being flooded into the network every day, and the information is often spread among the masses in various forms such as text, pictures, audio, and the like. When the massive and disordered network information is faced, the information extraction technology can process the data and display the structured information to the user, so that valuable and interesting information is accurately provided for the user.
The information extraction is to extract structured information from pictures, texts or audios for storage and display, is also an important technical means for constructing a knowledge graph, and generally comprises three subtasks of named entity identification, relationship extraction and event extraction. Using text as an example, the named entity recognition task is to discover entities that describe geopolitics, facilities, and names of people. The purpose of the relationship extraction task is to determine a binary semantic relationship between two entities. And the event extraction task comprises two links of event detection (finding out trigger words in the sentence and determining the event type of the trigger words) and argument identification (allocating argument roles to each entity participating in the event). Compared with the relation extraction, the event extraction task can simultaneously extract the mutual relation among multiple entities, so that the structured information with finer granularity is obtained. Thus, the event extraction task is more challenging.
Event detection is an important link of an event extraction task, and the link can identify picture actions and text trigger words which mark the occurrence of events and classify the picture actions and the text trigger words into predefined event types. The method is widely applied to the fields of network public opinion analysis, information collection and the like.
Disclosure of Invention
The information provided by the invention mainly aiming at the single-mode data such as pictures or sentences is often insufficient for carrying out correct event classification, and the characteristic information of other modes is usually required. A multi-modal combined event detection method based on pictures and sentences is provided, and events are simultaneously recognized from the pictures and the sentences. A method for multi-modal joint event detection based on pictures and sentences is proposed.
The multi-modal combined event detection method based on pictures and sentences comprises the following steps:
step 1, a text event detection module firstly encodes text features to obtain a feature representation sequence of words in a sentence
Figure BDA0003115110180000021
For the jth candidate trigger word, the corresponding feature vector is then used
Figure BDA0003115110180000022
Input text event classifier SoftmaxTAnd acquiring event type probability distribution triggered by the jth candidate trigger word, wherein a loss function of the text event classifier is defined as LT
Step 2, coding the picture characteristics, and acquiring the characteristic representation sequences of the actions and the plurality of entities described in the picture
Figure BDA0003115110180000023
Then, the feature vector of the image entity
Figure BDA0003115110180000024
Input picture event classifier SoftmaxIObtaining the event type probability distribution described by the current picture, wherein the loss function of the picture event classifier is defined as LI
Step 3, the image sentence matching module firstly uses a Cross-Modal Attention Mechanism (CM)AM) calculating the association weight between each pair of photo entities and the word. According to the jth word, the CMAM can locate important picture entities and assign weights, and obtains the feature representation of the word in the picture mode by aggregating visual features related to the word through weighted average
Figure BDA0003115110180000025
On the other hand, for the ith entity in the picture, relevant words are firstly searched in the sentence to be matched, weights are assigned to the words, semantic information relevant to the photo entity is captured through weighted average, and therefore the characteristic representation of the photo entity in the text mode is obtained
Figure BDA0003115110180000026
Then, Euclidean distance D of each sentence from the characteristic representation sequence of each sentence in the picture modalityT←IAnd Euclidean distance D of all entities in the picture and the characteristic representation sequence thereof in the text modalityI←TAnd adding is carried out as the similarity of the picture and the sentence. Wherein, the loss function of the picture sentence matching module is defined as Lm
Step 4, acquiring a shared event classifier through a combined optimization text event detection module, a picture event detection module and a picture sentence matching module;
step 5, in the testing stage, for the multi-modal text, firstly, the picture and sentence with the highest similarity are found out by using the picture and sentence matching module, and the characteristic representation of the ith picture entity in the text mode is obtained
Figure BDA0003115110180000027
And the feature representation of the jth word in the picture modality
Figure BDA0003115110180000028
Then utilizing gate control attention mechanism to make picture entity feature vector
Figure BDA0003115110180000031
And
Figure BDA0003115110180000032
and distributing weights, acquiring a multi-modal feature vector corresponding to the ith picture entity through weighted average, and then acquiring the event type described by the picture by using a shared event classifier. Also, another gated attention mechanism is utilized
Figure BDA0003115110180000033
And
Figure BDA0003115110180000034
distributing weight, obtaining multi-modal feature representation of the jth word through weighted average, and then obtaining an event type triggered by the jth word by utilizing a shared event classifier;
further, the step 1 is specifically realized as follows:
1-1, training a text event classifier on a KBP2017 English data set, firstly preprocessing labeled data to obtain entity types, event trigger words and entity relations, wherein the entity types comprise 5 entity types and 18 event types, and then performing sentence segmentation and word segmentation on an original text by using Stanford CoreNLP to obtain the part of speech and the grammar dependency structure of the sentence. And respectively creating a part-of-speech vector table and an entity type vector table, wherein each vector table has an initialization vector corresponding to the type 'null'.
1-2, inquiring a pre-trained glove word vector matrix to obtain a word vector w of each word in a sentenceemdThen, the part-of-speech vector table is inquired to obtain a part-of-speech vector wposAnd querying the entity type vector table to obtain an entity type vector wentityThe real-valued vector x of each word ═ { w ═ wemd,wpos,wentityThus the sentence real valued vector sequence is denoted W ═ x1,x2,...,xn-1,xnWhere n is the length of the sentence.
1-3. changing the sentence real value vector sequence W to { x ═ x1,x2,...,xn-1,xnTaking the hidden state vector sequence of the sentence as the input of Bi-LSTMs
Figure BDA0003115110180000035
Constructing a graph convolution network based on sentence grammar dependency structure, and then combining HLInputting into GCNs to obtain the convolution vector sequence of sentences
Figure BDA0003115110180000036
Finally, the sequence H is calculated by using attentionTThe influence weight of each element in the sentence on the candidate trigger word is obtained, and the coding sequence of the sentence is obtained
Figure BDA0003115110180000037
At the same time, C is mixedTAs a sequence of word features in a common space.
1-4, regarding each word in the sentence as a candidate trigger, regarding the jth (j is less than or equal to n) candidate trigger, and then corresponding feature vector thereof
Figure BDA0003115110180000038
Input text event classifier:
Figure BDA0003115110180000039
Figure BDA00031151101800000310
wherein, WTAnd bTSoftmax as a text event classifierTThe weight matrix and the bias term of (c),
Figure BDA0003115110180000041
represents the jth candidate trigger word w in the sentence SjProbability distribution of event type of trigger, and typew,jDenotes wjThe type of event triggered. Meanwhile, the loss function of the text event classifier is defined as:
Figure BDA0003115110180000042
wherein T is the number of sentences marked in the KBP2017 English data set,
Figure BDA0003115110180000043
as a word wjAnnotated event type, SiRepresenting the ith sentence in the data set, with a sentence length of n.
Further, step 2 is specifically implemented as follows:
2-1. a picture event classifier is trained on imSitu picture data sets, where a total of 504 verbs are defined to record the actions described by the picture, and 11538 entity types describe the entities that appear in the picture. First using VGG16vTo extract the action characteristics in the picture and utilize the multi-layer perceptron MLPvConverting verb features into verb vectors
Figure BDA00031151101800000410
At the same time, another VGG16 is utilizedoExtracting an entity set O ═ { O ] in a picture1,o2,...,om-1,omIs then passed through a multi-layer perceptron MLPoConverting all entities into their corresponding noun vector sequences
Figure BDA0003115110180000044
Each picture is then represented by a mesh, which is built according to the actions and entities it describes. The action described by the picture is used as a central node of the mesh structure, and the entity is connected with the action node. And then, coding a word vector sequence corresponding to the picture characteristics by adopting a graph convolution network, so that the vector after the convolution calculation of the action nodes stores entity characteristic information. Wherein the coded picture entity characteristic vector sequence is
Figure BDA0003115110180000045
Wherein the content of the first and second substances,
Figure BDA0003115110180000046
convolution vector for representing picture action node (for convenient calculation, the invention uses picture as picture motion nodeAction as a pictorial entity), likewise, HIAnd the characteristic representation sequence of the action and the entity set of the view slice in the common space.
2-2, convolving the motion vector in the picture I
Figure BDA0003115110180000047
As an input of the picture event classifier, obtaining a probability distribution of picture description event types as follows:
Figure BDA0003115110180000048
Figure BDA0003115110180000049
wherein, WIAnd bISoftmax as a picture event classifierIWeight matrix and bias term of P (y)II) represents the probability distribution of event type triggered by Picture I, while typeIIndicating the type of event described in picture I. Meanwhile, the loss function of the picture event classifier is defined as:
Figure BDA0003115110180000051
wherein N represents the number of the event samples marked on the pictures in imSitu, yIAs picture IiAnnotated event type, IiIndicating the ith picture sample in the picture data set.
Further, step 3 is specifically implemented as follows:
3-1, the picture sentence matching module is used for finding out pictures and sentences with highest semantic similarity from the multi-modal document containing a plurality of pictures and sentences. Firstly, calculating the association weight between each pair of photo entities and words by using a cross-modal attention mechanism, and learning word-based photo entity feature representation and word feature representation based on the photo entities. More specifically, from each word, the CMAM is able to locate significantAnd assigning weights to the photo entities, and acquiring feature representation of words in the photo modality by aggregating visual features related to the words through weighted average. On the other hand, for each entity in the picture, related words are firstly searched in the sentences to be matched, weights are assigned to the words, and semantic information related to the photo entities is captured through weighted average, so that the feature representation of the photo entities in the text mode is obtained. Giving out entity characteristic vector sequence corresponding to picture I
Figure BDA0003115110180000052
And word feature vector sequence of sentence S
Figure BDA0003115110180000053
A cross-modality attention mechanism is first utilized to obtain a characterization of word and pictorial entities in other modalities.
3-2, in order to obtain the word-based picture entity feature representation, firstly, calculating the association degree Score of the ith entity and the jth word in the picture by using a cross-modal attention mechanismij
Figure BDA0003115110180000054
Figure BDA0003115110180000055
Wherein the content of the first and second substances,
Figure BDA0003115110180000056
feature vector representing ith entity in picture
Figure BDA0003115110180000057
Characteristic energy of j-th word
Figure BDA0003115110180000058
Has a cosine similarity of [0,1 ] in the value range]. Then according to ScoreijCalculating the influence weight of the ith picture entity on the jth wordHeavy AijComprises the following steps:
Figure BDA0003115110180000059
finally, aggregating the picture entity feature representation based on the jth word in a weighted average mode
Figure BDA0003115110180000061
Therefore, the invention uses
Figure BDA0003115110180000062
And representing the characteristic representation sequence of the whole sentence in the picture mode.
3-3, in order to obtain the word feature representation based on the picture entity, adopting and obtaining the vector
Figure BDA0003115110180000063
In the same calculation process, for the ith entity in the picture, according to the relevance of the jth word and the entity of the current picture, the attention weight is distributed to the jth word:
Figure BDA0003115110180000064
Figure BDA0003115110180000065
then, word feature representation based on the ith entity of the picture is captured by weighted averaging:
Figure BDA0003115110180000066
similarly, the representation of all entities in the picture in the text modality is:
Figure BDA0003115110180000067
3-4, in order to obtain semantic similarity between the picture and the sentence, adopting a weak consistency alignment mode to define the similarity between the picture and the sentence as the Euclidean distance between all entities in the picture and the characteristic representation sequence thereof in the text mode, and the sum of the Euclidean distance between each sentence and the characteristic representation sequence thereof in the picture mode.
First, the euclidean distance of each sentence from its sequence of feature representations in the picture modality is calculated:
Figure BDA0003115110180000068
then, the euclidean distances between all entities in the picture and their feature representation sequences in the text modality are:
Figure BDA0003115110180000069
thus, the semantic similarity between picture I and sentence S is defined as < I, S > ═ DT←I+DI←T. And finally, in order to obtain the picture sentence pair with the highest semantic similarity < I, S >, optimizing a picture sentence matching module by using the triplet loss. For each pair of correctly matched pictures and sentences, the invention additionally extracts a picture I which is not matched with the sentence S-And a sentence S not matching with the picture I-Form two negative pairs < I, S-> and < I-S >. Finally, the loss function of the picture sentence matching module is defined as:
Lm=max(0,1+<I,S>-<I,S->)+max(0,1+<I,S>-<I-,S>) (15)
further, step 4 is specifically implemented as follows:
4-1. in order to obtain event classifiers sharing weight and bias term, the invention takes the feature representation of words and picture actions in a common space as the input of the text and picture event classifiers respectively, and finally, the target function L is minimized to be LT+LI+LmTo the modelAnd (5) performing joint optimization. Let the text event classifier SoftmaxTAnd picture event classifier SoftmaxIThe weight matrix and bias terms can be shared. Thus, in the testing phase, the shared event classifier is used to predict the event types described by the picture and sentence simultaneously.
Further, step 5 is specifically implemented as follows:
5-1. with M2E2The multi-modal annotation data tests the trained model and tests the model for k sentences S1,S2...,Sk-1,SkAnd l pictures I1,I2...,Il-1,IlThe article firstly utilizes a picture sentence matching module to find out the picture sentence pair with the highest semantic similarity < I, S >, and simultaneously obtains a word feature representation sequence H based on a picture entityI←TAnd word-based photo entity feature representation sequence HT←I
5-2. in feature fusion, for word wjThe invention considers
Figure BDA0003115110180000071
And
Figure BDA0003115110180000072
for trigger word wjContribute different degrees of feature information. Therefore, the present invention utilizes a gated attention mechanism to assign weights to different feature information,
Figure BDA0003115110180000073
weight of (2)
Figure BDA0003115110180000074
The calculation method is as follows:
Figure BDA0003115110180000075
Figure BDA0003115110180000076
wherein the content of the first and second substances,
Figure BDA0003115110180000077
representing the jth word feature vector
Figure BDA0003115110180000078
With its feature representation in the picture modality
Figure BDA0003115110180000079
Has a cosine similarity of [ -1,1 ] in the range]. Then, the weighted average is used to fuse with wjRelated picture characteristic information, obtaining wjCorresponding multi-modal feature representation vectors
Figure BDA00031151101800000710
Figure BDA00031151101800000711
Wherein the content of the first and second substances,
Figure BDA0003115110180000081
the result of (A) is usually a value between 0 and 1, controlling
Figure BDA0003115110180000082
For fused multi-modal features
Figure BDA0003115110180000083
The degree of influence of (c). When in use
Figure BDA0003115110180000084
Smaller, fused features preserve more textual information, while
Figure BDA0003115110180000085
When larger, illustrate the picture feature to the word wjMore information is contributed in the event classification process.
Finally, theCandidate trigger word wjCorresponding multimodal features
Figure BDA0003115110180000086
Inputting a shared event classifier to obtain a word wjEvent type of trigger
Figure BDA0003115110180000087
5-3. also for picture I, the influence of the word features on the picture event classification is controlled using another gated attention. Firstly, a gating attention machine is utilized to produce original characteristics corresponding to the picture action
Figure BDA0003115110180000088
With its feature representation in text modality
Figure BDA0003115110180000089
Assigning weights
Figure BDA00031151101800000810
And
Figure BDA00031151101800000811
wherein the content of the first and second substances,
Figure BDA00031151101800000812
the calculation method is as follows:
Figure BDA00031151101800000813
then, the original characteristics of the ith picture entity are fused through weighted average
Figure BDA00031151101800000814
And their feature representation in text modality
Figure BDA00031151101800000815
Obtaining updated multi-modal feature vectors
Figure BDA00031151101800000816
Finally, the shared event classifier pair is utilized
Figure BDA00031151101800000817
Classifying to obtain event type argmax (P (y) to which the picture description action belongsII), wherein I ═ 1.
The invention has the following beneficial effects:
aiming at the defects of the prior art, a multi-modal joint event detection method based on pictures and sentences is provided, and events are simultaneously recognized from the pictures and the sentences. However, due to the lack of sufficient multi-modal annotation data, the invention adopts a joint optimization mode, on one hand, the existing single-modal data sets (imSitu picture data set and KBP2017 English data set) are utilized to respectively learn the picture and text event classifiers, and on the other hand, the existing picture and title pair training picture sentence matching module is utilized to find out the picture and sentence with the highest semantic similarity in the multi-modal article, so as to obtain the characteristic representation of the picture entity and the word in the public space. These features help to share parameters between the picture and text event classifiers, resulting in a shared event classifier. Finally, a small amount of multi-modal annotation data (M) is utilized2E2Multimodal datasets) to test the model, using a shared event classifier to obtain the events and their types described by the pictures and sentences, respectively. The invention identifies the events from the pictures and the sentences, and utilizes the complementarity of the visual characteristics and the text characteristics, thereby not only improving the performance of single-mode event classification, but also finding more complete event information in the articles.
Drawings
FIG. 1 is a flow chart of the overall implementation of the present invention.
FIG. 2 is a block diagram of the model training phase of the present invention
Detailed Description
The attached drawings disclose a flow chart of a preferred embodiment of the invention in a non-limiting way; the technical solution of the present invention will be described in detail below with reference to the accompanying drawings.
Event detection is an important link of an event extraction task, and the link can identify picture actions and text trigger words which mark the occurrence of events and classify the picture actions and the text trigger words into predefined event types. The method is widely applied to the fields of network public opinion analysis, information collection and the like. With the diversification of carriers for transmitting network information, researchers are paying attention to event detection tasks in different fields, namely how to automatically acquire interesting events from different information carriers such as unstructured pictures and texts. Also, the same event may appear in different forms in pictures and sentences. However, the existing model only aims at the single-mode event detection based on sentences or pictures, or only considers the influence of picture characteristics on the text event detection, and ignores the influence of text context on the picture event classification. In order to solve the problems, the invention provides a multi-mode combined event detection method based on pictures and sentences.
As shown in fig. 1-2, a method for multi-modal joint event detection based on pictures and sentences comprises the following steps:
step 1, a text event detection module firstly encodes text features to obtain a feature representation sequence of words in a sentence
Figure BDA0003115110180000091
For the jth candidate trigger word, the corresponding feature vector is then used
Figure BDA0003115110180000092
Input text event classifier SoftmaxTAnd acquiring event type probability distribution triggered by the jth candidate trigger word, wherein a loss function of the text event classifier is defined as LT
Step 2, coding the picture characteristics, and acquiring the characteristic representation sequences of the actions and the plurality of entities described in the picture
Figure BDA0003115110180000093
Then, the feature vector of the image entity
Figure BDA0003115110180000094
Input devicePicture event classifier SoftmaxIObtaining the event type probability distribution described by the current picture, wherein the loss function of the picture event classifier is defined as LI
And step 3, the picture sentence matching module firstly calculates the association weight between each pair of picture entities and words by using a Cross-modal attention Mechanism (CMAM). According to the jth word, the CMAM can locate important picture entities and assign weights, and obtains the feature representation of the word in the picture mode by aggregating visual features related to the word through weighted average
Figure BDA0003115110180000101
On the other hand, for the ith entity in the picture, relevant words are firstly searched in the sentence to be matched, weights are assigned to the words, semantic information relevant to the photo entity is captured through weighted average, and therefore the characteristic representation of the photo entity in the text mode is obtained
Figure BDA0003115110180000102
Then, Euclidean distance D of each sentence from the characteristic representation sequence of each sentence in the picture modalityT←IAnd Euclidean distance D of all entities in the picture and the characteristic representation sequence thereof in the text modalityI←TAnd adding is carried out as the similarity of the picture and the sentence. Wherein, the loss function of the picture sentence matching module is defined as Lm
Step 4, acquiring a shared event classifier through a combined optimization text event detection module, a picture event detection module and a picture sentence matching module;
step 5, in the testing stage, for the multi-modal text, firstly, the picture and sentence with the highest similarity are found out by using the picture and sentence matching module, and the characteristic representation of the ith picture entity in the text mode is obtained
Figure BDA0003115110180000103
And the feature representation of the jth word in the picture modality
Figure BDA0003115110180000104
Then utilizing gate control attention mechanism to make picture entity feature vector
Figure BDA0003115110180000105
And
Figure BDA0003115110180000106
and distributing weights, acquiring a multi-modal feature vector corresponding to the ith picture entity through weighted average, and then acquiring the event type described by the picture by using a shared event classifier. Also, another gated attention mechanism is utilized
Figure BDA0003115110180000107
And
Figure BDA0003115110180000108
distributing weight, obtaining multi-modal feature representation of the jth word through weighted average, and then obtaining an event type triggered by the jth word by utilizing a shared event classifier;
further, the step 1 is specifically realized as follows:
1-1, training a text event classifier on a KBP2017 English data set, firstly preprocessing labeled data to obtain entity types, event trigger words and entity relations, wherein the entity types comprise 5 entity types and 18 event types, and then performing sentence segmentation and word segmentation on an original text by using Stanford CoreNLP to obtain the part of speech and the grammar dependency structure of the sentence. And respectively creating a part-of-speech vector table and an entity type vector table, wherein each vector table has an initialization vector corresponding to the type 'null'.
1-2, inquiring a pre-trained glove word vector matrix to obtain a word vector w of each word in a sentenceemdThen, the part-of-speech vector table is inquired to obtain a part-of-speech vector wposAnd querying the entity type vector table to obtain an entity type vector wentityThe real-valued vector x of each word ═ { w ═ wemd,wpos,wentityThus the sentence real valued vector sequence is denoted W ═ x1,x2,...,xn-1,xnWhere n is a sentenceThe length of the seed.
1-3. changing the sentence real value vector sequence W to { x ═ x1,x2,...,xn-1,xnTaking the hidden state vector sequence of the sentence as the input of Bi-LSTMs
Figure BDA0003115110180000111
Constructing a graph convolution network based on sentence grammar dependency structure, and then combining HLInputting into GCNs to obtain the convolution vector sequence of sentences
Figure BDA0003115110180000112
Finally, the sequence H is calculated by using attentionTThe influence weight of each element in the sentence on the candidate trigger word is obtained, and the coding sequence of the sentence is obtained
Figure BDA0003115110180000113
At the same time, C is mixedTAs a sequence of word features in a common space.
1-4, regarding each word in the sentence as a candidate trigger, regarding the jth (j is less than or equal to n) candidate trigger, and then corresponding feature vector thereof
Figure BDA0003115110180000114
Input text event classifier:
Figure BDA0003115110180000115
Figure BDA0003115110180000116
wherein, WTAnd bTSoftmax as a text event classifierTThe weight matrix and the bias term of (c),
Figure BDA0003115110180000117
represents the jth candidate trigger word w in the sentence SjProbability distribution of event type of trigger, and typew,jDenotes wjThe type of event triggered. Meanwhile, the loss function of the text event classifier is defined as:
Figure BDA0003115110180000118
wherein T is the number of sentences marked in the KBP2017 English data set,
Figure BDA0003115110180000119
as a word wjAnnotated event type, SiRepresenting the ith sentence in the data set, with a sentence length of n.
Further, step 2 is specifically implemented as follows:
2-1. a picture event classifier is trained on imSitu picture data sets, where a total of 504 verbs are defined to record the actions described by the picture, and 11538 entity types describe the entities that appear in the picture. First using VGG16vTo extract the action characteristics in the picture and utilize the multi-layer perceptron MLPvConverting verb features into verb vectors
Figure BDA00031151101800001110
At the same time, another VGG16 is utilizedoExtracting an entity set O ═ { O ] in a picture1,o2,...,om-1,omIs then passed through a multi-layer perceptron MLPoConverting all entities into their corresponding noun vector sequences
Figure BDA00031151101800001111
Each picture is then represented by a mesh, which is built according to the actions and entities it describes. The action described by the picture is used as a central node of the mesh structure, and the entity is connected with the action node. And then, coding a word vector sequence corresponding to the picture characteristics by adopting a graph convolution network, so that the vector after the convolution calculation of the action nodes stores entity characteristic information. Wherein the coded picture entity characteristic vector sequence is
Figure BDA0003115110180000121
Wherein the content of the first and second substances,
Figure BDA0003115110180000122
convolution vector for representing picture action node (for the convenience of calculation, the picture action is regarded as a picture entity in the invention), and similarly, HIAnd the characteristic representation sequence of the action and the entity set of the view slice in the common space.
2-2, convolving the motion vector in the picture I
Figure BDA0003115110180000123
As an input of the picture event classifier, obtaining a probability distribution of picture description event types as follows:
Figure BDA0003115110180000124
typeI=argmax(P(yI|I))
wherein, WIAnd bISoftmax as a picture event classifierIWeight matrix and bias term of P (y)II) represents the probability distribution of event type triggered by Picture I, while typeIIndicating the type of event described in picture I. Meanwhile, the loss function of the picture event classifier is defined as:
Figure BDA0003115110180000125
wherein N represents the number of the event samples marked on the pictures in imSitu, yIAs picture IiAnnotated event type, IiIndicating the ith picture sample in the picture data set.
Further, step 3 is specifically implemented as follows:
3-1, the picture sentence matching module is used for finding out pictures and sentences with highest semantic similarity from the multi-modal document containing a plurality of pictures and sentences. First, each pair of graphs is calculated by using a cross-modal attention mechanismAnd (3) association weight values between the fragment entities and the words, and learning word-based picture entity characteristic representation and word characteristic representation based on the picture entities. More specifically, from each word, the CMAM can locate important photo entities and assign weights, and obtain a feature representation of the word in the photo modality by aggregating visual features associated with the word by weighted average. On the other hand, for each entity in the picture, related words are firstly searched in the sentences to be matched, weights are assigned to the words, and semantic information related to the photo entities is captured through weighted average, so that the feature representation of the photo entities in the text mode is obtained. Giving out entity characteristic vector sequence corresponding to picture I
Figure BDA0003115110180000126
And word feature vector sequence of sentence S
Figure BDA0003115110180000127
A cross-modality attention mechanism is first utilized to obtain a characterization of word and pictorial entities in other modalities.
3-2, in order to obtain the word-based picture entity feature representation, firstly, calculating the association degree Score of the ith entity and the jth word in the picture by using a cross-modal attention mechanismij
Figure BDA0003115110180000131
Figure BDA0003115110180000132
Wherein the content of the first and second substances,
Figure BDA0003115110180000133
feature vector representing ith entity in picture
Figure BDA0003115110180000134
Characteristic energy of j-th word
Figure BDA0003115110180000135
Has a cosine similarity of [0,1 ] in the value range]. Then according to ScoreijCalculating the influence weight A of the ith photo entity on the jth wordijComprises the following steps:
Figure BDA0003115110180000136
finally, aggregating the picture entity feature representation based on the jth word in a weighted average mode
Figure BDA0003115110180000137
Therefore, the invention uses
Figure BDA0003115110180000138
And representing the characteristic representation sequence of the whole sentence in the picture mode.
3-3, in order to obtain the word feature representation based on the picture entity, adopting and obtaining the vector
Figure BDA0003115110180000139
In the same calculation process, for the ith entity in the picture, according to the relevance of the jth word and the entity of the current picture, the attention weight is distributed to the jth word:
Figure BDA00031151101800001310
Figure BDA00031151101800001311
then, word feature representation based on ith entity of picture is captured by weighted average
Figure BDA00031151101800001312
Similarly, the representation of all entities in the picture in the text modality is:
Figure BDA00031151101800001313
3-4, in order to obtain semantic similarity between the picture and the sentence, adopting a weak consistency alignment mode to define the similarity between the picture and the sentence as the Euclidean distance between all entities in the picture and the characteristic representation sequence thereof in the text mode, and the sum of the Euclidean distance between each sentence and the characteristic representation sequence thereof in the picture mode.
First, the euclidean distance of each sentence from its sequence of feature representations in the picture modality is calculated:
Figure BDA0003115110180000141
then, the euclidean distances between all entities in the picture and their feature representation sequences in the text modality are:
Figure BDA0003115110180000142
thus, the semantic similarity between picture I and sentence S is defined as < I, S > ═ DT←I+DI←T. And finally, in order to obtain the picture sentence pair with the highest semantic similarity < I, S >, optimizing a picture sentence matching module by using the triplet loss. For each pair of correctly matched pictures and sentences, the invention additionally extracts a picture I which is not matched with the sentence S-And a sentence S not matching with the picture I-Form two negative pairs < I, S-> and < I-S >. Finally, the loss function of the picture sentence matching module is defined as:
Lm=max(0,1+<I,S>-<I,S->)+max(0,1+<I,S>-<I-,S>)
further, step 4 is specifically implemented as follows:
4-1, in order to obtain event classifiers sharing weight and bias items, the invention takes the characteristic representation of words and picture actions in a common space as the input of the text and picture event classifiers respectively, and finally minimizes a target functionNumber L ═ LT+LI+LmAnd jointly optimizing the models. Let the text event classifier SoftmaxTAnd picture event classifier SoftmaxIThe weight matrix and bias terms can be shared. Thus, in the testing phase, the shared event classifier is used to predict the event types described by the picture and sentence simultaneously.
Further, step 5 is specifically implemented as follows:
5-1. with M2E2The multi-modal annotation data tests the trained model and tests the model for k sentences S1,S2...,Sk-1,SkAnd l pictures I1,I2...,Il-1,IlThe article firstly utilizes a picture sentence matching module to find out the picture sentence pair with the highest semantic similarity < I, S >, and simultaneously obtains a word feature representation sequence H based on a picture entityI←TAnd word-based photo entity feature representation sequence HT←I
5-2. in feature fusion, for word wjIn the present invention, it is considered that cj TAnd hj T←IFor trigger word wjContribute different degrees of feature information. Therefore, the present invention utilizes a gated attention mechanism to assign weights to different feature information,
Figure BDA0003115110180000151
weight of (2)
Figure BDA0003115110180000152
The calculation method is as follows:
Figure BDA0003115110180000153
Figure BDA0003115110180000154
wherein the content of the first and second substances,
Figure BDA0003115110180000155
representing the jth word feature vector
Figure BDA0003115110180000156
With its feature representation in the picture modality
Figure BDA0003115110180000157
Has a cosine similarity of [ -1,1 ] in the range]. Then, the weighted average is used to fuse with wjRelated picture characteristic information, obtaining wjCorresponding multi-modal feature representation vectors
Figure BDA0003115110180000158
Figure BDA0003115110180000159
Wherein the content of the first and second substances,
Figure BDA00031151101800001510
the result of (A) is usually a value between 0 and 1, controlling
Figure BDA00031151101800001511
For fused multi-modal features
Figure BDA00031151101800001512
The degree of influence of (c). When in use
Figure BDA00031151101800001513
Smaller, fused features preserve more textual information, while
Figure BDA00031151101800001514
When larger, illustrate the picture feature to the word wjMore information is contributed in the event classification process.
Finally, the candidate trigger word wjCorresponding multimodal features
Figure BDA00031151101800001515
Inputting a shared event classifier to obtain a word wjEvent type of trigger
Figure BDA00031151101800001516
5-3. also for picture I, the influence of the word features on the picture event classification is controlled using another gated attention. Firstly, a gating attention machine is utilized to produce original characteristics corresponding to the picture action
Figure BDA00031151101800001517
With its feature representation in text modality
Figure BDA00031151101800001518
Assigning weights
Figure BDA00031151101800001519
And
Figure BDA00031151101800001520
wherein the content of the first and second substances,
Figure BDA00031151101800001521
the calculation method is as follows:
Figure BDA00031151101800001522
then, the original characteristics of the ith picture entity are fused through weighted average
Figure BDA00031151101800001523
And their feature representation in text modality
Figure BDA00031151101800001524
Obtaining updated multi-modal feature vectors
Figure BDA00031151101800001525
Finally, the shared event classifier pair is utilized
Figure BDA00031151101800001526
Classifying to obtain event type argmax (P (y) to which the picture description action belongsII), wherein I ═ 1.

Claims (6)

1. The multimodal combined event detection method based on pictures and sentences is characterized by comprising the following steps:
step 1, a text event detection module firstly encodes text features to obtain a feature vector representation sequence of words in a sentence
Figure FDA0003115110170000011
For the jth candidate trigger word, the feature vector of the corresponding candidate trigger word is then used
Figure FDA0003115110170000012
Input text event classifier SoftmaxTObtaining the probability distribution of event types triggered by the jth candidate trigger word, wherein the loss function of the text event classifier is defined as LT
Step 2, the picture event detection module encodes the picture characteristics to obtain picture entity characteristic vector representation sequences of the description actions and the plurality of entities in the picture
Figure FDA0003115110170000013
Then, the feature vector of the image entity
Figure FDA0003115110170000014
Input picture event classifier SoftmaxIObtaining the event type probability distribution of the current picture description, wherein the loss function of the picture event classifier is defined as LI
Step 3, the picture sentence matching module firstly calculates the association weight between each pair of picture entities and words by using a cross-modal attention mechanism CMAM;
according to the jth word, the CMAM can locate and classify important picture entitiesWeighting, and acquiring the feature representation of the words in the picture mode by weighting and averaging the picture entity features related to the words
Figure FDA0003115110170000015
Meanwhile, for the ith entity in the picture, related words are searched in the sentence to be matched, weight is distributed to the words, semantic information related to the photo entity is captured through weighted average, and therefore characteristic representation of the photo entity in the text mode is obtained
Figure FDA0003115110170000016
And then, the Euclidean distance D between each sentence to be matched and the characteristic representation sequence thereof in the picture modalityT←IEuclidean distance D from all entities in the picture and their feature representation sequences in the text modalityI←TAdding the images to obtain similarity of the images and the sentences; wherein, the loss function of the picture sentence matching module is defined as Lm
Step 4, acquiring a shared event classifier through a combined optimization text event detection module, a picture event detection module and a picture sentence matching module;
step 5, in the testing stage, for the multi-modal article, firstly, the picture and sentence with the highest similarity are found out by using the picture and sentence matching module, and the characteristic representation of the ith picture entity in the text mode is obtained
Figure FDA0003115110170000021
And the feature representation of the jth word in the picture modality
Figure FDA0003115110170000022
Then utilizing gate control attention mechanism to make picture entity feature vector
Figure FDA0003115110170000023
And a feature representation
Figure FDA0003115110170000024
Distributing weight, and obtaining a multi-modal feature vector corresponding to the ith picture entity through weighted average; then, acquiring an event type described by the picture by using a shared event classifier; similarly, another gated attention mechanism is used as the feature vector of the candidate trigger word
Figure FDA0003115110170000025
And a feature representation
Figure FDA0003115110170000026
And assigning weights, acquiring the multi-modal feature representation of the jth word by weighted average, and then acquiring the event type triggered by the jth word by using a shared event classifier.
2. Step 1 of the picture and sentence based multimodal combined event detection method according to claim 1 is implemented as follows:
1-1, training a text event classifier Softmax on KBP2017 English data setTFirstly, preprocessing the labeled data to obtain an entity type, an event trigger word and an event type corresponding to the event trigger word; the method comprises 5 entity types and 18 event types; then, carrying out sentence segmentation and word segmentation on the original text by using Stanford CoreNLP to obtain the part of speech and the grammar dependency structure of the sentence; respectively creating a part-of-speech vector table and an entity type vector table, wherein each vector table has an initialization vector corresponding to a type 'null';
1-2, inquiring a pre-trained glove word vector matrix to obtain a word vector w of each word in a sentenceemdThen, the part-of-speech vector table is inquired to obtain a part-of-speech vector wposAnd querying the entity type vector table to obtain an entity type vector wentityThe real-valued vector x of each word ═ { w ═ wemd,wpos,wentityThus the sentence real valued vector sequence is denoted W ═ x1,x2,...,xn-1,xnWhere n is the length of the sentence;
1-3. will sentenceSub-real-valued vector sequence W ═ x1,x2,...,xn-1,xnTaking the hidden state vector sequence of the sentence as the input of Bi-LSTMs
Figure FDA0003115110170000027
Constructing a graph convolution network based on sentence grammar dependency structure, and then combining HLInputting into GCNs to obtain the convolution vector sequence of sentences
Figure FDA0003115110170000028
Finally, the sequence H is calculated by using attentionTThe influence weight of each element in the sentence on the candidate trigger word is obtained, and the coding sequence of the sentence is obtained
Figure FDA0003115110170000031
At the same time, C is mixedTA feature representation sequence in a common space as a word sequence;
1-4, regarding each word in the sentence as a candidate trigger, regarding the j (j is less than or equal to n) first candidate trigger, and then using the corresponding feature vector thereof
Figure FDA0003115110170000032
Input text event classifier:
Figure FDA0003115110170000033
Figure FDA0003115110170000034
wherein, WTAnd bTSoftmax as a text event classifierTThe weight matrix and the bias term of (c),
Figure FDA0003115110170000035
represents the jth candidate trigger word w in the sentence SjThe probability distribution of the event type of the trigger,and typew,jDenotes wjThe type of event triggered; meanwhile, the loss function of the text event classifier is defined as:
Figure FDA0003115110170000036
wherein T is the number of sentences marked in the KBP2017 English data set,
Figure FDA0003115110170000037
as a word wjAnnotated event type, SiRepresenting the ith sentence in the data set, with a sentence length of n.
3. Step 2 of the picture and sentence based multimodal combined event detection method according to claim 2 is implemented as follows:
2-1, training a picture event classifier on an imSitu picture dataset, wherein a total of 504 verbs are defined to record actions described by pictures, and 11538 entity types describe entities appearing in the pictures; first using VGG16vTo extract the action characteristics in the picture and utilize the multi-layer perceptron MLPvConverting verb features into verb vectors
Figure FDA0003115110170000038
At the same time, another VGG16 is utilizedoExtracting an entity set O ═ { O ] in a picture1,o2,...,om-1,omIs then passed through a multi-layer perceptron MLPoConverting all entities into their corresponding noun vector sequences
Figure FDA0003115110170000039
Then, representing each picture by using a mesh structure, and constructing the mesh structure according to the described actions and entities; the action described by the picture is used as a central node of the mesh structure, and the entity is connected with the action node; then, the word vector corresponding to the picture features is subjected to graph convolution networkCoding the sequence so that the vector after the convolution calculation of the action node stores entity characteristic information; wherein the coded picture entity characteristic vector sequence is
Figure FDA0003115110170000041
Wherein the content of the first and second substances,
Figure FDA0003115110170000042
a convolution vector to represent a picture action node; likewise, HIA characteristic representation sequence of the action of the view slice and the entity set in a common space;
2-2, convolving the motion vector in the picture I
Figure FDA0003115110170000043
As an input of the picture event classifier, obtaining a probability distribution of picture description event types as follows:
Figure FDA0003115110170000044
typeI=argmax(P(yI|I))
wherein, WIAnd bISoftmax as a picture event classifierIWeight matrix and bias term of P (y)II) representing Picture IiProbability distribution of event type of trigger, and typeIRepresenting the event type described in the picture I; meanwhile, the loss function of the picture event classifier is defined as:
Figure FDA0003115110170000045
wherein N represents the number of the event samples marked on the pictures in imSitu, yIAs picture IiAnnotated event type, IiIndicating the ith picture sample in the picture data set.
4. Step 3 of the picture and sentence based multimodal combined event detection method according to claim 3 is implemented as follows:
3-1, giving the entity characteristic vector sequence corresponding to the picture I
Figure FDA0003115110170000046
And word feature vector sequence of sentence S
Figure FDA0003115110170000047
Firstly, acquiring feature representations of word and picture entities in other modes by using a cross-mode attention mechanism;
3-2, in order to obtain the word-based picture entity feature representation, firstly, calculating the association degree Score of the ith entity in the picture and the jth word in the sentence by using a cross-mode attention mechanismij
Figure FDA0003115110170000048
Figure FDA0003115110170000049
Wherein the content of the first and second substances,
Figure FDA00031151101700000410
feature vector representing ith entity in picture
Figure FDA00031151101700000411
Feature vector of j-th word in sentence
Figure FDA00031151101700000412
Has a cosine similarity of [0,1 ] in the value range](ii) a Then according to ScoreijCalculating the influence weight A of the ith photo entity on the jth wordijComprises the following steps:
Figure FDA0003115110170000051
finally, aggregating the picture entity feature representation based on the jth word in a weighted average mode
Figure FDA0003115110170000052
By using
Figure FDA0003115110170000053
A characteristic representation sequence representing the whole sentence in the picture mode;
3-3, in order to obtain the word feature representation based on the picture entity, adopting and obtaining the vector
Figure FDA0003115110170000054
In the same calculation process, for the ith entity in the picture, according to the relevance of the jth word and the entity of the current picture, the attention weight is distributed to the jth word:
Figure FDA0003115110170000055
Figure FDA0003115110170000056
then, word feature representation based on ith entity of picture is captured by weighted average
Figure FDA0003115110170000057
Also, the representation of all entities in the picture in the text modality is:
Figure FDA0003115110170000058
3-4, defining the similarity of the picture and the sentences as the sum of Euclidean distances between all entities in the picture and the characteristic representation sequence thereof in the text mode and the Euclidean distance between each sentence and the characteristic representation sequence thereof in the picture mode by adopting a weak consistency alignment mode;
first, the euclidean distance of each sentence from its sequence of feature representations in the picture modality is calculated:
Figure FDA0003115110170000059
and then calculating Euclidean distances of all entities in the picture and the characteristic representation sequence of the entities in the text modality as follows:
Figure FDA00031151101700000510
thus, the semantic similarity between picture I and sentence S is defined as < I, S > ═ DT←I+DI←T
In order to obtain the picture sentence pair with the semantic similarity of less than I and S greater than the highest, a trippletloss is used for optimizing a picture sentence matching module; for each pair of correctly matched pictures and sentences, additionally extracting a picture I which is not matched with the sentence S-And a sentence S not matching with the picture I-Form two negative pairs < I, S-> and < I-,S>;
Finally, the loss function of the picture sentence matching module is defined as:
Lm=max(0,(1+<I,S>-<I,S->))+max(0,(1+<I,S>-<I-,S>))。
5. step 4 of the multi-modal combined event detection method based on pictures and sentences according to claim 4 is implemented as follows:
4-1. in order to obtain event classifiers sharing weight and bias term, respectively using the feature representation of words and picture actions in a common space as the input of the text and picture event classifiers, and finally, minimizing an objective function L ═ LT+LI+LmPerforming combined optimization on the models; enabling text event classifiersSoftmaxTAnd picture event classifier SoftmaxIThe weight matrix and bias terms can be shared; thus, in the testing phase, the shared event classifier is used to predict the event types described by the picture and sentence simultaneously.
6. Step 5 of the multi-modal combined event detection method based on pictures and sentences according to claim 5 is implemented as follows:
5-1. with M2E2The multi-modal annotation data tests the trained model and tests the model for k sentences S1,S2...,Sk-1,SkAnd l pictures I1,I2...,Il-1,IlThe article firstly utilizes a picture sentence matching module to find out a picture sentence pair with the semantic similarity of less than I and S of more than the highest, and simultaneously obtains a word characteristic representation sequence H based on a picture entityI←TAnd word-based photo entity feature representation sequence HT←I
5-2, in feature fusion, for candidate trigger word wjIt is considered that
Figure FDA0003115110170000061
And
Figure FDA0003115110170000062
for candidate trigger word wjThe event type prediction of (2) contributes different degrees of feature information; the different feature information is therefore assigned weights using a gated attention mechanism,
Figure FDA0003115110170000063
weight of (2)
Figure FDA0003115110170000064
The calculation method is as follows:
Figure FDA0003115110170000065
Figure FDA0003115110170000066
wherein the content of the first and second substances,
Figure FDA0003115110170000071
feature vector representing jth candidate trigger
Figure FDA0003115110170000072
With its feature representation in the picture modality
Figure FDA0003115110170000073
Has a cosine similarity of [ -1,1 ] in the range](ii) a Then, the weighted average is used to fuse with wjRelated picture characteristic information, obtaining wjCorresponding multi-modal feature representation vectors
Figure FDA0003115110170000074
Figure FDA0003115110170000075
Wherein the content of the first and second substances,
Figure FDA0003115110170000076
the result of (A) is usually a value between 0 and 1, controlling
Figure FDA0003115110170000077
For fused multi-modal features
Figure FDA0003115110170000078
The degree of influence of (c); when in use
Figure FDA0003115110170000079
When smaller, the fused features preserve more textual informationTo do so
Figure FDA00031151101700000710
When larger, illustrate the picture feature to the word wjContributing more information in the event classification process;
finally, the candidate trigger word wjCorresponding multimodal features
Figure FDA00031151101700000711
Inputting a shared event classifier to obtain a word wjEvent type of trigger
Figure FDA00031151101700000712
5-3, similarly, for the picture I, controlling the influence of the word characteristics on the picture event classification by using another gating attention; firstly, a gating attention machine is utilized to produce original characteristics corresponding to the picture action
Figure FDA00031151101700000713
With its feature representation in text modality
Figure FDA00031151101700000714
Assign weights respectively
Figure FDA00031151101700000715
And
Figure FDA00031151101700000716
wherein the content of the first and second substances,
Figure FDA00031151101700000717
the calculation method is as follows:
Figure FDA00031151101700000718
then, the ith picture entity is fused by weighted averageOriginal characteristics of
Figure FDA00031151101700000719
And their feature representation in text modality
Figure FDA00031151101700000720
Obtaining updated multi-modal feature vectors
Figure FDA00031151101700000721
Finally, the shared event classifier pair is utilized
Figure FDA00031151101700000722
Classifying to obtain event type argmax (P (y) to which the picture description action belongsII), wherein I ═ 1.
CN202110660692.2A 2021-06-15 2021-06-15 Multi-modal combined event detection method based on pictures and sentences Active CN113535949B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110660692.2A CN113535949B (en) 2021-06-15 2021-06-15 Multi-modal combined event detection method based on pictures and sentences

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110660692.2A CN113535949B (en) 2021-06-15 2021-06-15 Multi-modal combined event detection method based on pictures and sentences

Publications (2)

Publication Number Publication Date
CN113535949A true CN113535949A (en) 2021-10-22
CN113535949B CN113535949B (en) 2022-09-13

Family

ID=78124947

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110660692.2A Active CN113535949B (en) 2021-06-15 2021-06-15 Multi-modal combined event detection method based on pictures and sentences

Country Status (1)

Country Link
CN (1) CN113535949B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114418038A (en) * 2022-03-29 2022-04-29 北京道达天际科技有限公司 Space-based information classification method and device based on multi-mode fusion and electronic equipment
WO2023093574A1 (en) * 2021-11-25 2023-06-01 北京邮电大学 News event search method and system based on multi-level image-text semantic alignment model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017139764A1 (en) * 2016-02-12 2017-08-17 Sri International Zero-shot event detection using semantic embedding
CN111259851A (en) * 2020-01-23 2020-06-09 清华大学 Multi-mode event detection method and device
CN112163416A (en) * 2020-10-09 2021-01-01 北京理工大学 Event joint extraction method for merging syntactic and entity relation graph convolution network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017139764A1 (en) * 2016-02-12 2017-08-17 Sri International Zero-shot event detection using semantic embedding
CN111259851A (en) * 2020-01-23 2020-06-09 清华大学 Multi-mode event detection method and device
CN112163416A (en) * 2020-10-09 2021-01-01 北京理工大学 Event joint extraction method for merging syntactic and entity relation graph convolution network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JINGLI ZHANG等: "Interactive learning for joint event and relation extraction", 《SPRINGER》 *
钱胜胜: "多媒体社会事件分析综述", 《计算机科学》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023093574A1 (en) * 2021-11-25 2023-06-01 北京邮电大学 News event search method and system based on multi-level image-text semantic alignment model
CN114418038A (en) * 2022-03-29 2022-04-29 北京道达天际科技有限公司 Space-based information classification method and device based on multi-mode fusion and electronic equipment

Also Published As

Publication number Publication date
CN113535949B (en) 2022-09-13

Similar Documents

Publication Publication Date Title
Wang et al. Deep learning for aspect-based sentiment analysis
CN108763362B (en) Local model weighted fusion Top-N movie recommendation method based on random anchor point pair selection
CN108073568B (en) Keyword extraction method and device
CN107688870B (en) Text stream input-based hierarchical factor visualization analysis method and device for deep neural network
CN107844533A (en) A kind of intelligent Answer System and analysis method
CN110162771B (en) Event trigger word recognition method and device and electronic equipment
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
Huang et al. Expert as a service: Software expert recommendation via knowledge domain embeddings in stack overflow
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN113535949B (en) Multi-modal combined event detection method based on pictures and sentences
Ji et al. Survey of visual sentiment prediction for social media analysis
CN113961666B (en) Keyword recognition method, apparatus, device, medium, and computer program product
Liu et al. Correlation identification in multimodal weibo via back propagation neural network with genetic algorithm
US20220058464A1 (en) Information processing apparatus and non-transitory computer readable medium
CN112131345B (en) Text quality recognition method, device, equipment and storage medium
Dong et al. Cross-media similarity evaluation for web image retrieval in the wild
Liu et al. A Comparative Analysis of Classic and Deep Learning Models for Inferring Gender and Age of Twitter Users [A Comparative Analysis of Classic and Deep Learning Models for Inferring Gender and Age of Twitter Users]
CN112148831A (en) Image-text mixed retrieval method and device, storage medium and computer equipment
CN113627151B (en) Cross-modal data matching method, device, equipment and medium
Gligorić et al. Experts and authorities receive disproportionate attention on Twitter during the COVID-19 crisis
CN113239159A (en) Cross-modal retrieval method of videos and texts based on relational inference network
Sajeevan et al. An enhanced approach for movie review analysis using deep learning techniques
CN116089644A (en) Event detection method integrating multi-mode features
CN113516094B (en) System and method for matching and evaluating expert for document
Shaik et al. Recurrent neural network with emperor penguin-based Salp swarm (RNN-EPS2) algorithm for emoji based sentiment analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant