CN111259851A - Multi-mode event detection method and device - Google Patents

Multi-mode event detection method and device Download PDF

Info

Publication number
CN111259851A
CN111259851A CN202010076960.1A CN202010076960A CN111259851A CN 111259851 A CN111259851 A CN 111259851A CN 202010076960 A CN202010076960 A CN 202010076960A CN 111259851 A CN111259851 A CN 111259851A
Authority
CN
China
Prior art keywords
detected
sentence
image
expression
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010076960.1A
Other languages
Chinese (zh)
Other versions
CN111259851B (en
Inventor
许斌
仝美涵
李涓子
侯磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202010076960.1A priority Critical patent/CN111259851B/en
Publication of CN111259851A publication Critical patent/CN111259851A/en
Application granted granted Critical
Publication of CN111259851B publication Critical patent/CN111259851B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Abstract

The embodiment of the invention provides a multi-mode event detection method and device. The method comprises the steps of obtaining a set of images to be detected matched with sentences to be detected; obtaining an initial sentence expression of the sentence to be detected, and obtaining an image expression of each image to be detected; updating the sentence expressions according to the image expressions in sequence by adopting an alternating dual attention mechanism to obtain updated sentence expressions; obtaining a sentence expression after fusion by adopting a residual fusion device; the method comprises the steps of obtaining an image to be detected of a sentence to be detected, respectively coding the image to be detected to obtain a sentence expression and an image expression, updating and residual error fusion are carried out on the sentence expressions according to the image expression of each image to be detected in sequence by adopting an alternating dual attention mechanism, and then the event detection result is obtained by passing the fused sentence expressions through an event prediction model, so that the efficiency and the quality of event detection of a text to be detected are improved.

Description

Multi-mode event detection method and device
Technical Field
The invention relates to the technical field of natural language processing, in particular to a multi-mode event detection method and device.
Background
Event detection aims at detecting event triggers from unstructured news stories and determining their event type (usually verbs or nouns). For example, "the meeting attendee of ford and toronto meets" the event detection task needs to recognize the word "meeting" as an event trigger and decide that it triggered a "meeting" event. At present, event detection is taken as a basic core technology in the field of artificial intelligence, and is widely introduced to tasks such as information retrieval, question answering systems, recommendation systems, knowledge base construction and the like. High-quality structured knowledge information in event detection can guide an intelligent model to have deeper object understanding, more accurate task query and logical reasoning capability to a certain extent, so that the method plays a vital role in analyzing massive information.
Event detection is a challenging task because trigger words that reflect the core type of an event are often ambiguous. This ambiguity is particularly manifested in: the same trigger may trigger different events in different context of the sentence, and the surrounding context is usually not sufficient to disambiguate them. For example, "lift off" in "michael has lifted off the person's basilar" may mean that michael is no longer involved in a significant job, a "leave" event has occurred, or that michael has put down a weight, and an "item transfer" event has occurred. The existing event detection and analysis analyzes the text through a series of natural language labeling tools such as part of speech labeling, syntactic analysis and the like, and then classifies and extracts the events by utilizing the analyzed text characteristics.
However, the existing analysis method cannot judge the type of the event expressed by news under the condition of sentence information loss or fuzzy word semantic directivity, so that the obtained result is not accurate enough.
Disclosure of Invention
Because the existing method has the above problems, embodiments of the present invention provide a method and an apparatus for multi-modal event detection.
In a first aspect, an embodiment of the present invention provides a multi-modal event detection method, including:
acquiring a to-be-detected image set matched with a to-be-detected sentence in a to-be-detected text according to a preset matching rule;
obtaining an initial sentence expression of the sentence to be detected according to a preset sentence coding module, and obtaining an image expression of each image to be detected in the image set to be detected according to a preset image coding module;
updating the sentence expression of the sentence to be detected in sequence according to each image expression by adopting a preset alternating dual attention mechanism to obtain an updated sentence expression;
obtaining a fused sentence expression by adopting a preset residual fusion device according to the initial sentence expression and the updated sentence expression;
and inputting the fused sentence expression into a pre-trained event prediction module to obtain an event detection result corresponding to the sentence to be detected.
Further, the updating the sentence expression of the sentence to be detected according to the image expressions in sequence by using a preset alternating dual attention mechanism to obtain an updated sentence expression specifically includes:
sequentially acquiring an image expression m corresponding to the ith image to be detected in the image set to be detectedi
According to a preset multi-head attention mechanism, utilizing the current sentence expression H of the sentence to be detectedi-1Updating the image expression miObtaining an updated image expression m 'according to the multi-head attention distribution'i
According to the preset multi-head attention mechanism, utilizing the updated image expression m'iUpdating the current sentence expression Hi-1Obtaining updated current sentence expression Hi
Further, the acquiring a set of images to be detected matched with sentences to be detected in a text to be detected according to a preset matching rule specifically includes:
extracting event characteristic information of a title contained in the text to be detected;
acquiring a historical text matched with the text to be detected from a preset text database according to the event characteristic information; wherein the title of the history text contains the event characteristic information;
and taking the image contained in the historical text as a to-be-detected image matched with the to-be-detected sentence, and storing the to-be-detected image into the to-be-detected image set.
Further, the obtaining of the initial sentence expression of the sentence to be detected according to the preset sentence coding module specifically includes:
and inputting the sentence to be detected into a pre-training converter BERT model with preset depth bidirectional representation to obtain an initial sentence expression of the sentence to be detected.
Further, the obtaining, according to a preset image coding module, an image expression of each image to be detected in the image set to be detected specifically includes:
and inputting each image to be detected into a preset residual error network ResNet model, and generating an image expression of each image to be detected by adopting a preset Sigmoid function to the output result of the residual error network ResNet model.
In a second aspect, an embodiment of the present invention provides a multi-modal event detection apparatus, including:
the data collection module is used for acquiring a to-be-detected image set matched with to-be-detected sentences in the to-be-detected text according to a preset matching rule;
the sentence coding module is used for obtaining an initial sentence expression of the sentence to be detected;
the image coding module is used for obtaining an image expression of each image to be detected in the image set to be detected;
the multi-picture encoder module is used for sequentially updating the sentence expression of the sentence to be detected according to each image expression by adopting a preset alternating dual attention mechanism to obtain an updated sentence expression;
the residual error fusion device module is used for obtaining a fused sentence expression by adopting a preset residual error fusion device according to the initial sentence expression and the updated sentence expression;
and the event prediction module is used for inputting the fused sentence expression into a pre-trained event prediction module to obtain an event detection result corresponding to the sentence to be detected.
Further, the multi-picture encoder module specifically includes: the system comprises an information acquisition sub-module, a first attention sub-module and a second attention sub-module; wherein the content of the first and second substances,
the information acquisition submodule is used for sequentially acquiring the image expression m corresponding to the ith image to be detected in the image set to be detectedi
The first attention submodule is used for utilizing the current sentence expression H of the sentence to be detected according to a preset multi-head attention mechanismi-1Updating the image expression miObtaining an updated image expression m 'according to the multi-head attention distribution'i
The second attention submodule is configured to utilize the updated image expression m 'according to the preset multi-head attention mechanism'iUpdating the current sentence expression Hi-1Obtaining updated current sentence expression Hi
Further, the data collection module is specifically configured to:
extracting event characteristic information of a title contained in the text to be detected;
acquiring a historical text matched with the text to be detected from a preset text database according to the event characteristic information; wherein the title of the history text contains the event characteristic information;
and taking the image contained in the historical text as a to-be-detected image matched with the to-be-detected sentence, and storing the to-be-detected image into the to-be-detected image set.
In a third aspect, an embodiment of the present invention further provides an electronic device, including:
a processor, a memory, a communication interface, and a communication bus; wherein the content of the first and second substances,
the processor, the memory and the communication interface complete mutual communication through the communication bus;
the communication interface is used for information transmission between communication devices of the electronic equipment;
the memory stores computer program instructions executable by the processor, the processor invoking the program instructions to perform a method comprising:
acquiring a to-be-detected image set matched with a to-be-detected sentence in a to-be-detected text according to a preset matching rule;
obtaining an initial sentence expression of the sentence to be detected according to a preset sentence coding module, and obtaining an image expression of each image to be detected in the image set to be detected according to a preset image coding module;
updating the sentence expression of the sentence to be detected in sequence according to each image expression by adopting a preset alternating dual attention mechanism to obtain an updated sentence expression;
obtaining a fused sentence expression by adopting a preset residual fusion device according to the initial sentence expression and the updated sentence expression;
and inputting the fused sentence expression into a pre-trained event prediction module to obtain an event detection result corresponding to the sentence to be detected.
In a fourth aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following method:
acquiring a to-be-detected image set matched with a to-be-detected sentence in a to-be-detected text according to a preset matching rule;
obtaining an initial sentence expression of the sentence to be detected according to a preset sentence coding module, and obtaining an image expression of each image to be detected in the image set to be detected according to a preset image coding module;
updating the sentence expression of the sentence to be detected in sequence according to each image expression by adopting a preset alternating dual attention mechanism to obtain an updated sentence expression;
obtaining a fused sentence expression by adopting a preset residual fusion device according to the initial sentence expression and the updated sentence expression;
and inputting the fused sentence expression into a pre-trained event prediction module to obtain an event detection result corresponding to the sentence to be detected.
According to the multi-mode event detection method and device provided by the embodiment of the invention, the images to be detected of the sentences to be detected are obtained and respectively coded to obtain the sentence expressions and the image expressions, the sentence expressions are updated and residual errors are fused according to the image expressions of the images to be detected in sequence by adopting a preset alternating dual attention mechanism, and then the fused sentence expressions pass through a pre-trained event prediction model to obtain the event detection result, so that the efficiency and the quality of event detection on the texts to be detected are improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow diagram of a multimodal event detection method according to an embodiment of the invention;
FIG. 2 is a flow chart of yet another multimodal event detection method in accordance with an embodiment of the present invention;
FIG. 3 is a flow diagram of another multimodal event detection method in accordance with an embodiment of the invention;
FIG. 4 is a schematic structural diagram of an alternative attention module according to an embodiment of the present invention;
FIG. 5 is a flow diagram of yet another multimodal event detection method in accordance with an embodiment of the invention;
FIG. 6 is a schematic structural diagram of a multi-modal event detection apparatus according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of another multi-modal event detection apparatus according to an embodiment of the present invention;
fig. 8 illustrates a physical structure diagram of an electronic device.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Existing event detection methods are limited to a single modality, the text modality, and do not utilize multi-modal information: a picture modality and a text modality. In fact fusing multimodality information images is very effective for handling event-triggered word ambiguities. Taking news as an example, the matched images in the news can reflect the core events of the reports, so that the events have rough directivity when being disambiguated. For example, in the news of terrorist attacks, the matching graph would reflect scenes such as "refugee", "weapons", "military", etc., with the types of events tending to be "dead", "injured", "attacked", rather than "travel", "trial", "celebration", etc. The method has the type trend of core events, and has large background knowledge during event detection, so that the accuracy of event detection can be effectively improved. Meanwhile, the image modality can supplement more detailed information and can complement the information of the text modality, so that the capability of event detection is improved. Some information that is difficult to express in the text modality can be easily reflected in the image modality, such as dress style, facial expression or character gesture actions, and the details can determine the occasion and form of the event, which has a beneficial effect on deducing the event type.
Fig. 1 is a flowchart of a multi-modal event detection method according to an embodiment of the present invention, and fig. 2 is a flowchart of a multi-modal event detection method according to an embodiment of the present invention, as shown in fig. 1, the method includes:
and step S01, acquiring a to-be-detected image set matched with the to-be-detected sentence in the to-be-detected text according to a preset matching rule.
The method includes the steps of obtaining a text to be tested in a preset mode, for example, obtaining a news article from a website, wherein the news article comprises a title, news content and a news image, the title, the news content and the news image respectively correspond to the title, the text content and the image to be tested of the text to be tested, and the text content can be composed of a plurality of sentences to be tested.
And performing event detection on the text to be detected, specifically performing event detection on each contained sentence to be detected respectively to obtain an event detection result corresponding to each sentence to be detected, for example, a trigger event triggered by a keyword in the sentence to be detected.
Because the image to be detected contained in the text to be detected is often few or even none, the image to be detected matched with the text to be detected is acquired from other channels or historical data through a preset matching rule, so that a large number of images to be detected matched with sentences to be detected in the text to be detected are obtained, and an image set to be detected is formed.
Step S02, obtaining an initial sentence expression of the sentence to be detected according to a preset sentence coding module, obtaining an image expression of each image to be detected in the image set to be detected according to a preset image coding module, and combining to obtain an image expression set.
As shown in fig. 2, a sentence coding module and an image coding module are preset, and a preset coding model is adopted to code the sentence to be detected and each image to be detected in the image set to be detected, so as to obtain a sentence expression of the sentence to be detected and an image expression of each image to be detected.
Further, the obtaining of the initial sentence expression of the sentence to be detected according to the preset sentence coding module specifically includes:
and inputting the sentence to be detected into a pre-training converter BERT model with preset depth bidirectional representation to obtain an initial sentence expression of the sentence to be detected.
The coding model adopted in the sentence coder module can be set according to actual needs, and the embodiment of the invention only takes a preset depth Bidirectional representation pre-training converter (BERT) model as an example for illustration. The BERT model is a pre-trained language representation model, comprises a plurality of layers of multi-head attention layers, can understand sentence semantic information at deep and multi-angle, and is strong in function and suitable for event detection of texts.
Adopting a BERT model as a text feature extractor, and converting the sentence S to be detected into a sentence S<w1,w2,…,wn>Inputting into BERT model, and using sequential output to obtain initial sentence expression H0=<h1,h2,…,hn>. Wherein, the wiThe sentence to be tested is subjected to word segmentation processing to obtain segmented words or words, hiThe word segmentation vectors are in one-to-one correspondence with the word segmentations. The encoding process of the BERT model may be represented as:
H0=BERT(S)
further, the obtaining, according to a preset image coding module, an image expression of each image to be detected in the image set to be detected specifically includes:
and inputting each image to be detected into a preset residual error network ResNet model, and generating an image expression of each image to be detected by adopting a preset Sigmoid function to the output result of the residual error network ResNet model.
The coding model adopted in the image coding module can be set according to actual needs, and the embodiment of the invention only gives an illustration of one of the coding models.
Respectively setting the set P ═ { P ═ of the image to be detected by adopting a preset Residual Network (ResNet) model1,p2,…,pkIn theiAnd (5) carrying out feature extraction. An image p to be measurediInputting into the ResNet model to obtainHidden representation u to the image under testi. The encoding process of the ResNet model can be expressed as:
ui=ResNet(pi)
in order to map the image expression to the same latitude space as the text, a Sigmoid function is adopted to represent the hidden representation u of the image to be detectediConversion to image expression miThe Sigmoid function may be expressed as:
mi=σ(Wuui+bu)
wherein σ (·) is Sigmoid function, WuAnd buIs a hyper-parameter matrix.
And step S03, updating the sentence expression of the sentence to be detected in sequence according to each image expression by adopting a preset alternating dual attention mechanism to obtain the updated sentence expression.
The acquired image to be tested, which is matched with the sentence to be tested, tends to describe the event from different angles, for example, for a news report of an earthquake event, the image to be tested may be an image of a road collapse for reporting a damage condition, an image of a worker carrying reconstructed materials for reporting a reconstruction condition, and the like. Therefore, the embodiment of the invention disambiguates the detection event by dynamically aggregating the information of a plurality of images to be detected.
According to the image expressions corresponding to the images to be detected in the image set to be detected in sequence, a Dual alternative Attention mechanism is adopted to recursively update the sentence expressions of the sentences to be detected, specifically, the recursive update can be realized through a preset Dual alternative Attention (ADA) submodule, and the obtained ith image p to be detected is sequentially updatediCorresponding image expression miInputting into preset ADA submodule for obtaining sentence expression H after last updatei-1Updating to obtain the ith updated sentence expression Hi. So as to pass through the set P ═ P of images under test1,p2,…,pkRecursively updating image expressions corresponding to k images to be detected in the image database, and updating the image expressionsThe sentence expression of the sentence to be detected is expressed by the initial sentence expression H0Obtaining a sentence expression H after k times of updatingk
And step S04, obtaining a fused sentence expression by adopting a preset residual fusion device according to the initial sentence expression and the updated sentence expression.
Adopting a preset residual fusion device to express the initial sentence expression H0And the sentence expression H after k times of updatingkAnd performing fusion processing to obtain a fused sentence expression which is used as a final sentence expression of the sentence to be detected.
There are many methods for the fusion process, and the embodiment of the present invention is illustrated by only one method. Expressing the initial sentence by using residual block0Integrate back the updated sentence expression HkSo as to obtain the expression R ═ H of the merged sentence0+Hk
The sentence expression R after the fusion keeps the original semantics of the sentence to be detected as much as possible, and prevents the gradient of the parameters in the BERT from disappearing through the updating process.
And step S05, inputting the fused sentence expression into a pre-trained event prediction module to obtain an event detection result corresponding to the sentence to be detected.
And the event prediction module is preset and divides all event detection results into a preset number of event types. According to a pre-acquired corpus
Figure BDA0002378717320000091
Wherein the event prediction model is trained, and the corpus comprises a plurality of training samples xi=〈S,P>,yiY, wherein S, P and Y represent training sentences in the same training text, training image sets matching the training texts, training images matching the training texts, and pre-labeled event detection results corresponding to the training sentences, respectively. Input training sample xiObtaining an output result vector O, wherein the result vector O is formed by a conditional probability OijcComposition of the said OijcRepresenting the sentence x to be testediThe probability that the jth participle in (b) is in the event category (c) is normalized through a softmax function, and the following result is obtained:
Figure BDA0002378717320000092
where θ represents all defined hyper-parameters in all modules described above.
The optimization function adopted in the training process of the event prediction model is defined as follows:
Figure BDA0002378717320000101
and Adam was used as the gradient descent optimizer.
And inputting the fused sentence expression R of the sentence to be detected into the trained event prediction module, and taking the obtained event detection result as the conditional probability of the event type of each participle in the sentence to be detected. And further analyzing the obtained conditional probability subsequently to determine the sentence to be detected, even the accurate event type corresponding to the text to be detected.
Compared with the prior art, the multi-mode event detection method based on the alternating dual attention mechanism provided by the embodiment of the invention comprises a text mode of a sentence to be detected and an image mode of an image to be detected. The method has the advantages that the independent information of each mode and the consistent information of the cross-mode contained in the multi-mode are respectively coded into different semantic spaces, the cross-mode semantic information is extracted more deeply by adopting an alternating dual attention mechanism, and a uniform semantic space mode is constructed, so that the effect of improving the event detection task expression in the multi-mode scene in practical application is achieved, the event analysis efficiency and quality of the text to be detected such as a news text are improved, and the method has a wide application prospect.
According to the embodiment of the invention, the image to be detected of the sentence to be detected is obtained, the sentence expression and the image expression are obtained through respectively encoding, the sentence expression is updated and residual error fusion is carried out sequentially according to the image expression of each image to be detected by adopting a preset alternating dual attention mechanism, and then the event detection result is obtained through the fused sentence expression by virtue of the pre-trained event prediction model, so that the efficiency and the quality of the event detection of the text to be detected are improved.
Fig. 3 is a flowchart of another multi-modal event detection method according to an embodiment of the present invention, and fig. 4 is a schematic structural diagram of an alternative attention module according to an embodiment of the present invention, as shown in fig. 2, where the step S03 specifically includes:
step S031, sequentially obtaining the image expression m corresponding to the ith image to be detected in the image set to be detectedi
Sequentially carrying out the image p to be detected in the image set to be detected through the image coding moduleiCoding to obtain corresponding image expression miThen, the preset alternate dual attention module is utilized to carry out expression H on the current sentencei-1Updating to obtain the ith updated sentence expression Hi
Step S032, according to a preset multi-head attention mechanism, utilizing the current sentence expression H of the sentence to be detectedi-1Updating the image expression miObtaining an updated image expression m 'according to the multi-head attention distribution'i
The alternating dual attention mechanism is composed of two parts, which are respectively: firstly, guiding multi-head attention distribution of an image by using text information, namely sentence expressions, and updating the image expressions; and then guiding the multi-head attention distribution of the text by using the image information, namely the updated image expression, so as to update the sentence expression. Since the image information and the text information affect each other, a dual structure is adopted. Specifically, in different text backgrounds, the focus area of the same image is different; also, in different picture description contexts, the same word may trigger different events.
The multi-headed attention distribution for updating the image expression with the sentence expression is referred to as a first round of updating, and the multi-headed attention distribution for updating the sentence expression with the image expression is referred to as a second round of updating.
In the first round of updating, as shown in FIG. 4, two fully-connected layer Linear layers are used to represent the current sentence expression Hi-1Mapped into the first two inputs of the Scaled Dot Product Attention module Scaled Dot-Product Attention layer, respectively, and the mapped implicit representation is labeled as key k, value v. Then the image expression m is expressed using the third fully-connected layeriMapped into the third input of the scaled dot product attention module, and the mapped implicit representation is labeled as query q.
Next, learned attention α is dot-product with the third input v to obtain weighted image representation z. the specific formulation of the above process is as follows:
Figure BDA0002378717320000111
Figure BDA0002378717320000112
z=αvT
wherein d iskRepresents the degree of dimension, said siAnd representing the embedded representation after the ith word and other modes are interacted, wherein L represents the number of words in the sentence.
The above process is repeated u times and a linear transformation is applied to obtain a modified image representation h, which is formulated as follows:
Z=[z1;z2;…;zu]
h=WhZ+bh
wherein, "; "denotes the splice in the last dimension.
And directly sending the implicit representation q to an output end by adopting a residual error module to obtain a final updated representation.
mi′=h+q
With the operation of the above equation uniformly labeled as Ω, the first round of update process can be summarized as:
mi′=Ω(mi,Hi-1)
wherein m isi' is the updated image expression.
Step 033, utilizing the updated image expression m 'according to the preset multi-head attention mechanism'iUpdating the current sentence expression Hi-1Obtaining updated current sentence expression Hi
In the second round of update, as shown in fig. 4, the updated image expression m is similar to the first round of updatet' mapping into the first two inputs of the scaled dot product attention Module, the current sentence expression Hi-1Mapping into a third input, similar to the first round of updates, the second round of update processes can also be summarized as:
Hi=Ω(Hi-1,mt′)
wherein, the HiIs the sentence expression after i times of updating.
Setting the image set P to be tested as P1,p2,…,pkImage expressions corresponding to k images to be detected in the sentence expression updating method are sequentially used for updating the sentence expressions, and the updated sentence expressions are Hk
The embodiment of the invention updates the multi-head attention distribution of the image expression by using the sentence expression, and then updates the multi-head attention distribution of the sentence expression by using the image expression, thereby realizing an alternating dual attention mechanism which is used for updating the sentence expression of the sentence to be detected, and improving the efficiency and quality of event detection of the text to be detected.
Fig. 5 is a flowchart of another multi-modal event detection method according to an embodiment of the present invention, and as shown in fig. 5, the step S01 specifically includes:
and S011, extracting event characteristic information of a title contained in the text to be detected.
When a text to be detected is obtained, firstly, a preset information extraction model is adopted according to the title of the text to be detected, and event characteristic information of the title is obtained. Specifically, the structured parsing of the title may be performed through an Abstract syntax parser (AMR), and an event role is extracted from the structured parsing as the feature information, where the event role includes: actors, and locations, etc.
Step S012, acquiring a historical text matched with the text to be detected from a preset text database according to the event characteristic information; wherein the title of the history text contains the event feature information.
And matching the obtained event characteristic information with each historical text in a pre-obtained text database to judge whether the historical text and the text to be detected correspond to the same event. Specifically, comparing the event feature information of the title of the historical text with the event feature information of the title of the text to be detected, and if the event feature information of the title of the historical text comprises the event feature information of the title of the text to be detected, judging that the historical text and the text to be detected correspond to the same event; otherwise, judging that the historical text and the text to be detected correspond to different events. For example, the title of the text to be tested is "wild abuse across california", the title of the history text is "large-scale wild abuse in california", and the event feature information of the two titles is extracted and determined to be "large-scale wild abuse" and "california", so that the text to be tested and the history text can be judged to correspond to the same event.
Each historical text in the text database can be obtained by searching a preset website, for example, for news reports, the historical text can be obtained by searching a plurality of news websites with authority and capability.
And S013, taking the image contained in the historical text as a to-be-detected image matched with the to-be-detected sentence, and storing the to-be-detected image into the to-be-detected image set.
And extracting images contained in the historical texts corresponding to the same event with the text to be detected, taking the images as images to be detected matched with the text to be detected, equivalently taking the images to be detected as images to be detected matched with sentences to be detected in the text to be detected, and storing the images to be detected in the image set to be detected.
According to the method and the device, the event characteristic information of the title of the text to be detected is obtained to be matched with the historical text in the text database, the image contained in the historical text corresponding to the same event is used as the image to be detected matched with the sentence to be detected in the text to be detected, and the image set to be detected is obtained, so that the efficiency and the quality of event detection of the text to be detected are improved.
In summary, the embodiment of the present invention designs an event detection model of a multi-modal neural network based on an alternating dual attention mechanism. Firstly, image mode information related to texts is collected, and the diversity of the image mode information is ensured by connecting different historical texts of the same event. And then respectively acquiring expressions of text and image modes through a pre-trained language coding model and an image coding model. Then, a recurrent neural network model with an alternate dual attention mechanism as a basic unit is designed to perform deep fusion on the image mode and the text. Finally, a fully-connected neural network is designed to judge the event type of each word. The alternating dual attention mechanism not only deletes the strong attention area of the image modality according to the context of the text modality, but also screens the strong attention area of the text modality in turn according to the context of the image modality. The deep multi-mode fusion mechanism can delete irrelevant semantic information and retain information matched with event semantics, so that the effect of multi-mode event detection can be improved.
Fig. 6 is a schematic structural diagram of a multi-modal event detection apparatus according to an embodiment of the present invention, and as shown in fig. 6, the multi-modal event detection apparatus includes: a data collection module 10, a sentence coding module 11, an image coding module 12, a multi-picture coder module 13, a residual fusion module 14 and an event prediction module 15; wherein the content of the first and second substances,
the data collection module 10 is configured to obtain a set of images to be detected that are matched with sentences to be detected in a text to be detected according to a preset matching rule; the sentence coding module 11 is configured to obtain an initial sentence expression of the sentence to be detected; the image coding module 12 is configured to obtain an image expression of each image to be detected in the image set to be detected; the multi-picture encoder module 13 is configured to update the sentence expression of the sentence to be detected in sequence according to each image expression by using a preset alternating dual attention mechanism, so as to obtain an updated sentence expression; the residual error fusion device module 14 is configured to obtain a fused sentence expression by using a preset residual error fusion device according to the initial sentence expression and the updated sentence expression; the event prediction module 15 is configured to input the fused sentence expression into a pre-trained event prediction module, so as to obtain an event detection result corresponding to the sentence to be detected. Specifically, the method comprises the following steps:
the data collection module 10 obtains a text to be tested in a preset manner, where the text to be tested includes a title, text contents and an image to be tested, and the text contents may be composed of a plurality of sentences to be tested.
And performing event detection on the text to be detected, specifically performing event detection on each contained sentence to be detected respectively to obtain an event detection result corresponding to each sentence to be detected.
Because the image to be detected contained in the text to be detected is often few or even none, for this reason, the data collection module 10 needs to obtain the image to be detected matched with the text to be detected from other channels or historical data through a preset matching rule, so as to obtain a large number of images to be detected matched with each sentence to be detected in the text to be detected, so as to form an image set to be detected.
The data collection module 10 sends the sentence to be detected and each image to be detected to the sentence coding module 11 and the image coding module 12, and codes the sentence to be detected and each image to be detected in the image set to be detected by using a preset coding model, so as to obtain a sentence expression of the sentence to be detected and an image expression of each image to be detected.
Further, the sentence encoding module 11 according to preset is specifically configured to:
and inputting the sentence to be detected into a pre-training converter BERT model with preset depth bidirectional representation to obtain an initial sentence expression of the sentence to be detected.
The coding model adopted in the sentence coder module 11 can be set according to actual needs, and the embodiment of the present invention is exemplified by only a preset BERT model.
Adopting a BERT model as a text feature extractor, and converting the sentence S to be detected into a sentence S<w1,w2,…,wn>Inputting into BERT model, and using sequential output to obtain initial sentence expression H0=<h1,h2,…,hn>. Wherein, the wiThe sentence to be tested is subjected to word segmentation processing to obtain segmented words or words, hiThe word segmentation vectors are in one-to-one correspondence with the word segmentations. The encoding process of the BERT model may be represented as:
H0=BERT(S)
further, the image encoding module 12 according to preset is specifically configured to:
and inputting each image to be detected into a preset residual error network ResNet model, and generating an image expression of each image to be detected by adopting a preset Sigmoid function to the output result of the residual error network ResNet model.
The coding model adopted in the image coding module 12 can be set according to actual needs, and the embodiment of the present invention only provides an example.
Respectively setting the set P ═ { P ═ of the image to be detected by adopting a preset Residual Network (ResNet) model1,p2,…,pkIn theiAnd (5) carrying out feature extraction. An image p to be measurediInputting the hidden representation u into the ResNet model to obtain the hidden representation u of the image to be detectedi. The encoding process of the ResNet model can be expressed as:
ui=ResNet(pi)
in order to map the image expression to the same latitude space as the text, a Sigmoid function is adopted to represent the hidden representation u of the image to be detectediConversionFor image expression miThe Sigmoid function may be expressed as:
mi=σ(Wuui+bu)
wherein σ (·) is Sigmoid function, WuAnd buIs a hyper-parameter matrix.
The multi-picture encoder module 13 sequentially obtains the image expressions corresponding to the images to be detected in the image set to be detected from the image encoding module 12, recursively updates the sentence expressions of the sentences to be detected obtained from the sentence encoding module 11 by adopting an alternating dual attention mechanism, specifically, the recursive update can be realized by a preset ADA sub-module, and the multi-picture encoder module 13 sequentially obtains the ith image p to be detectediCorresponding image expression miInputting into preset ADA submodule for obtaining sentence expression H after last updatei-1Updating to obtain the ith updated sentence expression Hi. So as to pass through the set P ═ P of images under test1,p2,…,pkRecursively updating image expressions corresponding to k images to be detected in the sentence expression database, and expressing the sentence expressions of the sentences to be detected by using the initial sentence expression H0Obtaining a sentence expression H after k times of updatingkAnd express the sentence as H0And HkTo the residual fuser module 14.
The residual fuser module 14 fuses the initial sentence expression H0And the sentence expression H after k times of updatingkAnd performing fusion processing to obtain a fused sentence expression which is used as a final sentence expression of the sentence to be detected.
There are many methods for the fusion process, and the embodiment of the present invention is illustrated by only one method. Residual fuser module 14 uses the residual block to express the original sentence as expression H0Integrate back the updated sentence expression HkSo as to obtain the expression R ═ H of the merged sentence0+HkAnd sent to the event prediction module 15.
The event prediction module 15 classifies all the event detection results intoA preset number of event types. According to a pre-acquired corpus
Figure BDA0002378717320000161
Wherein the event prediction model is trained, and the corpus comprises a plurality of training samples xi=<S,P>,yiY, wherein S, P and Y represent training sentences in the same training text, training image sets matching the training texts, training images matching the training texts, and pre-labeled event detection results corresponding to the training sentences, respectively. Input training sample xiObtaining an output result vector O, wherein the result vector O is formed by a conditional probability OijcComposition of the said OijcRepresenting the sentence x to be testediThe probability that the jth participle in (b) is in the event category (c) is normalized through a softmax function, and the following result is obtained:
Figure BDA0002378717320000171
where θ represents all defined hyper-parameters in all modules described above.
The optimization function adopted in the training process of the event prediction model is defined as follows:
Figure BDA0002378717320000172
and Adam was used as the gradient descent optimizer.
And inputting the fused sentence expression R of the sentence to be detected into the trained event prediction module, and taking the obtained event detection result as the conditional probability of the event type of each participle in the sentence to be detected. And further analyzing the obtained conditional probability subsequently to determine the sentence to be detected, even the accurate event type corresponding to the text to be detected.
The apparatus provided in the embodiment of the present invention is configured to execute the method, and the functions of the apparatus refer to the method embodiment specifically, and detailed method flows thereof are not described herein again.
According to the embodiment of the invention, the image to be detected of the sentence to be detected is obtained, the sentence expression and the image expression are obtained through respectively encoding, the sentence expression is updated and residual error fusion is carried out sequentially according to the image expression of each image to be detected by adopting a preset alternating dual attention mechanism, and then the event detection result is obtained through the fused sentence expression by virtue of the pre-trained event prediction model, so that the efficiency and the quality of the event detection of the text to be detected are improved.
Fig. 7 is a schematic structural diagram of another multi-modal event detection apparatus according to an embodiment of the present invention, as shown in fig. 7, the apparatus includes: a data collection module 10, a sentence encoding module 11, an image encoding module 12, a multi-picture encoder module 13, a residual fuser module 14, and an event prediction module 15, wherein the multi-picture encoder module 13 includes: an information acquisition sub-module 131, a first attention sub-module 132, and a second attention sub-module 133; wherein the content of the first and second substances,
the information obtaining sub-module 131 is configured to sequentially obtain an image expression m corresponding to the ith image to be tested in the image set to be testedi(ii) a The first attention sub-module 132 is configured to utilize the current sentence expression H of the sentence to be tested according to a preset multi-head attention mechanismi-1Updating the image expression miGet the updated image expression mi(ii) a The second attention submodule 133 is configured to utilize the updated image expression m 'according to the preset multi-head attention mechanism'iUpdating the current sentence expression Hi-1Obtaining updated current sentence expression Hi. Specifically, the method comprises the following steps:
the multi-picture encoder module 13 may adopt a structure of a plurality of ADA sub-modules as described in the above embodiment, where each ADA sub-module corresponds to an image expression of an image to be detected, and then sequentially updates a sentence expression of a sentence to be detected according to the ADA sub-modules. The multi-picture encoder module 13 may also adopt a configuration that, as described in the embodiment of the present invention, the multi-picture encoder module is composed of an information obtaining sub-module 131, a first attention sub-module 132, and a second attention sub-module 133, which is equivalent to only including one ADA sub-module, and the ADA sub-module is cyclically used to update the sentence expression.
When the image coding module 12 sequentially processes the image p to be detected in the image set to be detectediCoding to obtain corresponding image expression miAnd then sent to the information obtaining sub-module 131.
The first attention submodule 132 uses two fully-connected layers to express the current sentence expression Hi-1Mapped into the first two inputs of the scaled dot product attention module, respectively, and the implicit representation after mapping is labeled as key k, value v. Then, the image expression m received by the information acquisition submodule 131 is obtained by using the third full connection layeriMapped into the third input of the scaled dot product attention module, and the mapped implicit representation is labeled as query q.
Next, learned attention α is dot-product with the third input v to obtain weighted image representation z. the specific formulation of the above process is as follows:
Figure BDA0002378717320000181
Figure BDA0002378717320000182
z=αvT
the above process is repeated u times and a linear transformation is applied to obtain a modified image representation h, which is formulated as follows:
Z=[z1;z2;…;zu]
h=WhZ+bh
wherein, "; "denotes the splice in the last dimension.
And directly sending the implicit representation q to an output end by adopting a residual error module to obtain a final updated representation.
mi′=h+q
With the operation of the above equation labeled collectively as Ω, then the update process of the first attention sub-module 132 may be summarized as:
mi′=Ω(mi,Hi-1)
wherein m isi' is the updated image expression.
Similarly, the second attention submodule 133 outputs the updated image expression m output by the first attention submodule 132t' mapping into the first two inputs of the scaled dot product attention Module, the current sentence expression Hi-1Mapped into a third input, similar to the first attention submodule 132, the second attention submodule 133 update process may also be summarized as:
Hi=Ω(Hi-1,mt′)
wherein, the HiIs the sentence expression after i times of updating.
The information obtaining sub-module 131 obtains the set P ═ P of the image to be measured in sequence1,p2,…,pkImage expressions corresponding to k images to be detected in the sentence expression updating method are used for updating the sentence expressions, and the updated sentence expressions are Hk
The apparatus provided in the embodiment of the present invention is configured to execute the method, and the functions of the apparatus refer to the method embodiment specifically, and detailed method flows thereof are not described herein again.
The embodiment of the invention updates the multi-head attention distribution of the image expression by using the sentence expression, and then updates the multi-head attention distribution of the sentence expression by using the image expression, thereby realizing an alternating dual attention mechanism which is used for updating the sentence expression of the sentence to be detected, and improving the efficiency and quality of event detection of the text to be detected.
Based on the foregoing embodiment, further, the data collection module is specifically configured to:
extracting event characteristic information of a title contained in the text to be detected;
acquiring a historical text matched with the text to be detected from a preset text database according to the event characteristic information; wherein the title of the history text contains the event characteristic information;
and taking the image contained in the historical text as a to-be-detected image matched with the to-be-detected sentence, and storing the to-be-detected image into the to-be-detected image set.
When the data collection module obtains a text to be detected, a preset information extraction model is adopted according to the title of the text to be detected to obtain event characteristic information of the title. Specifically, the title may be obtained by performing a structured analysis on the title through AMR, and an event role is extracted from the title as the feature information, where the event role includes: actors, and locations, etc.
And the data collection module matches the obtained event characteristic information with each historical text in a pre-acquired text database to judge whether the historical text and the text to be detected correspond to the same event. Specifically, comparing the event feature information of the title of the historical text with the event feature information of the title of the text to be detected, and if the event feature information of the title of the historical text comprises the event feature information of the title of the text to be detected, judging that the historical text and the text to be detected correspond to the same event by a data collection module; otherwise, the data collection module judges that the historical texts and the texts to be tested correspond to different events.
And the data collection module extracts images contained in the historical texts corresponding to the same event with the text to be detected, the images are used as the images to be detected matched with the text to be detected and also used as the images to be detected matched with the sentences to be detected in the text to be detected, and the images are stored in the image set to be detected.
The apparatus provided in the embodiment of the present invention is configured to execute the method, and the functions of the apparatus refer to the method embodiment specifically, and detailed method flows thereof are not described herein again.
According to the method and the device, the event characteristic information of the title of the text to be detected is obtained to be matched with the historical text in the text database, the image contained in the historical text corresponding to the same event is used as the image to be detected matched with the sentence to be detected in the text to be detected, and the image set to be detected is obtained, so that the efficiency and the quality of event detection of the text to be detected are improved.
Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor)801, a communication Interface (Communications Interface)803, a memory (memory)802 and a communication bus 804, wherein the processor 801, the communication Interface 803 and the memory 802 complete communication with each other through the communication bus 804. The processor 801 may call logic instructions in the memory 802 to perform the above-described method.
Further, embodiments of the present invention disclose a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which, when executed by a computer, enable the computer to perform the methods provided by the above-mentioned method embodiments.
Further, the present invention provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the methods provided by the above method embodiments.
Those of ordinary skill in the art will understand that: furthermore, the logic instructions in the memory 802 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for multi-modal event detection, comprising:
acquiring a to-be-detected image set matched with a to-be-detected sentence in a to-be-detected text according to a preset matching rule;
obtaining an initial sentence expression of the sentence to be detected according to a preset sentence coding module, and obtaining an image expression of each image to be detected in the image set to be detected according to a preset image coding module;
updating the sentence expression of the sentence to be detected in sequence according to each image expression by adopting a preset alternating dual attention mechanism to obtain an updated sentence expression;
obtaining a fused sentence expression by adopting a preset residual fusion device according to the initial sentence expression and the updated sentence expression;
and inputting the fused sentence expression into a pre-trained event prediction module to obtain an event detection result corresponding to the sentence to be detected.
2. The multi-modal event detection method according to claim 1, wherein the updating the sentence expression of the sentence to be detected according to the image expressions in sequence by using a preset alternating dual attention mechanism to obtain an updated sentence expression specifically comprises:
sequentially acquiring an image expression m corresponding to the ith image to be detected in the image set to be detectedi
According to a preset multi-head attention mechanism, utilizing the current sentence expression H of the sentence to be detectedi-1Updating the image expression miObtaining an updated image expression m 'according to the multi-head attention distribution'i
According to the preset multi-head attention mechanism, utilizing the updated image expression m'iUpdating the current sentence expression Hi-1Obtaining updated current sentence expression Hi
3. The multi-modal event detection method according to claim 1, wherein the obtaining of the set of images to be detected matched with the sentences to be detected in the text to be detected according to the preset matching rule specifically comprises:
extracting event characteristic information of a title contained in the text to be detected;
acquiring a historical text matched with the text to be detected from a preset text database according to the event characteristic information; wherein the title of the history text contains the event characteristic information;
and taking the image contained in the historical text as a to-be-detected image matched with the to-be-detected sentence, and storing the to-be-detected image into the to-be-detected image set.
4. The multi-modal event detection method according to claim 1, wherein obtaining the initial sentence expression of the sentence to be detected according to a preset sentence encoding module specifically comprises:
and inputting the sentence to be detected into a pre-training converter BERT model with preset depth bidirectional representation to obtain an initial sentence expression of the sentence to be detected.
5. The method according to claim 1, wherein obtaining the image expression of each image to be detected in the image set to be detected according to a preset image coding module specifically comprises:
and inputting each image to be detected into a preset residual error network ResNet model, and generating an image expression of each image to be detected by adopting a preset Sigmoid function to the output result of the residual error network ResNet model.
6. A multimodal event detection apparatus, comprising:
the data collection module is used for acquiring a to-be-detected image set matched with to-be-detected sentences in the to-be-detected text according to a preset matching rule;
the sentence coding module is used for obtaining an initial sentence expression of the sentence to be detected;
the image coding module is used for obtaining an image expression of each image to be detected in the image set to be detected;
the multi-picture encoder module is used for sequentially updating the sentence expression of the sentence to be detected according to each image expression by adopting a preset alternating dual attention mechanism to obtain an updated sentence expression;
the residual error fusion device module is used for obtaining a fused sentence expression by adopting a preset residual error fusion device according to the initial sentence expression and the updated sentence expression;
and the event prediction module is used for inputting the fused sentence expression into a pre-trained event prediction module to obtain an event detection result corresponding to the sentence to be detected.
7. The multi-modal event detection apparatus as recited in claim 6, wherein the multi-picture encoder module comprises in particular: the system comprises an information acquisition sub-module, a first attention sub-module and a second attention sub-module; wherein the content of the first and second substances,
the information acquisition submodule is used for sequentially acquiring the image expression m corresponding to the ith image to be detected in the image set to be detectedi
The first attention submodule is used for utilizing the current sentence expression H of the sentence to be detected according to a preset multi-head attention mechanismi-1Updating the image expression miObtaining an updated image expression m 'according to the multi-head attention distribution'i
The second attention submodule is configured to utilize the updated image expression m 'according to the preset multi-head attention mechanism'iUpdating the current sentence expression Hi-1Obtaining updated current sentence expression Hi
8. The multi-modal event detection apparatus of claim 6, wherein the data collection module is specifically configured to:
extracting event characteristic information of a title contained in the text to be detected;
acquiring a historical text matched with the text to be detected from a preset text database according to the event characteristic information; wherein the title of the history text contains the event characteristic information;
and taking the image contained in the historical text as a to-be-detected image matched with the to-be-detected sentence, and storing the to-be-detected image into the to-be-detected image set.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the multimodal event detection method as claimed in any of claims 1 to 5 are implemented when the program is executed by the processor.
10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps of the multimodal event detection method as claimed in any of claims 1 to 5.
CN202010076960.1A 2020-01-23 2020-01-23 Multi-mode event detection method and device Active CN111259851B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010076960.1A CN111259851B (en) 2020-01-23 2020-01-23 Multi-mode event detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010076960.1A CN111259851B (en) 2020-01-23 2020-01-23 Multi-mode event detection method and device

Publications (2)

Publication Number Publication Date
CN111259851A true CN111259851A (en) 2020-06-09
CN111259851B CN111259851B (en) 2021-04-23

Family

ID=70951033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010076960.1A Active CN111259851B (en) 2020-01-23 2020-01-23 Multi-mode event detection method and device

Country Status (1)

Country Link
CN (1) CN111259851B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783475A (en) * 2020-07-28 2020-10-16 北京深睿博联科技有限责任公司 Semantic visual positioning method and device based on phrase relation propagation
CN112328859A (en) * 2020-11-05 2021-02-05 南开大学 False news detection method based on knowledge-aware attention network
CN112685565A (en) * 2020-12-29 2021-04-20 平安科技(深圳)有限公司 Text classification method based on multi-mode information fusion and related equipment thereof
CN112836043A (en) * 2020-10-13 2021-05-25 讯飞智元信息科技有限公司 Long text clustering method and device based on pre-training language model
CN113535949A (en) * 2021-06-15 2021-10-22 杭州电子科技大学 Multi-mode combined event detection method based on pictures and sentences
CN113642603A (en) * 2021-07-05 2021-11-12 北京三快在线科技有限公司 Data matching method and device, storage medium and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199933A (en) * 2014-09-04 2014-12-10 华中科技大学 Multi-modal information fusion football video event detection and semantic annotation method
CN105183849A (en) * 2015-09-06 2015-12-23 华中科技大学 Event detection and semantic annotation method for snooker game videos
CN106202281A (en) * 2016-06-28 2016-12-07 广东工业大学 A kind of multi-modal data represents learning method and system
CN106529492A (en) * 2016-11-17 2017-03-22 天津大学 Video topic classification and description method based on multi-image fusion in view of network query
CN108319686A (en) * 2018-02-01 2018-07-24 北京大学深圳研究生院 Antagonism cross-media retrieval method based on limited text space
CN110134757A (en) * 2019-04-19 2019-08-16 杭州电子科技大学 A kind of event argument roles abstracting method based on bull attention mechanism
CN110232158A (en) * 2019-05-06 2019-09-13 重庆大学 Burst occurred events of public safety detection method based on multi-modal data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199933A (en) * 2014-09-04 2014-12-10 华中科技大学 Multi-modal information fusion football video event detection and semantic annotation method
CN105183849A (en) * 2015-09-06 2015-12-23 华中科技大学 Event detection and semantic annotation method for snooker game videos
CN106202281A (en) * 2016-06-28 2016-12-07 广东工业大学 A kind of multi-modal data represents learning method and system
CN106529492A (en) * 2016-11-17 2017-03-22 天津大学 Video topic classification and description method based on multi-image fusion in view of network query
CN108319686A (en) * 2018-02-01 2018-07-24 北京大学深圳研究生院 Antagonism cross-media retrieval method based on limited text space
CN110134757A (en) * 2019-04-19 2019-08-16 杭州电子科技大学 A kind of event argument roles abstracting method based on bull attention mechanism
CN110232158A (en) * 2019-05-06 2019-09-13 重庆大学 Burst occurred events of public safety detection method based on multi-modal data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HYEONSEOB NAM.ETC: ""Dual Attention Networks for Multimodal Reasoning and Matching"", 《ARXIV》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783475A (en) * 2020-07-28 2020-10-16 北京深睿博联科技有限责任公司 Semantic visual positioning method and device based on phrase relation propagation
CN112836043A (en) * 2020-10-13 2021-05-25 讯飞智元信息科技有限公司 Long text clustering method and device based on pre-training language model
CN112328859A (en) * 2020-11-05 2021-02-05 南开大学 False news detection method based on knowledge-aware attention network
CN112328859B (en) * 2020-11-05 2022-09-20 南开大学 False news detection method based on knowledge-aware attention network
CN112685565A (en) * 2020-12-29 2021-04-20 平安科技(深圳)有限公司 Text classification method based on multi-mode information fusion and related equipment thereof
CN112685565B (en) * 2020-12-29 2023-07-21 平安科技(深圳)有限公司 Text classification method based on multi-mode information fusion and related equipment thereof
CN113535949A (en) * 2021-06-15 2021-10-22 杭州电子科技大学 Multi-mode combined event detection method based on pictures and sentences
CN113535949B (en) * 2021-06-15 2022-09-13 杭州电子科技大学 Multi-modal combined event detection method based on pictures and sentences
CN113642603A (en) * 2021-07-05 2021-11-12 北京三快在线科技有限公司 Data matching method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN111259851B (en) 2021-04-23

Similar Documents

Publication Publication Date Title
CN111259851B (en) Multi-mode event detection method and device
CN110263324B (en) Text processing method, model training method and device
CN111026842B (en) Natural language processing method, natural language processing device and intelligent question-answering system
CN111783474B (en) Comment text viewpoint information processing method and device and storage medium
CN111046132A (en) Customer service question and answer processing method and system for retrieving multiple rounds of conversations
CN113268609B (en) Knowledge graph-based dialogue content recommendation method, device, equipment and medium
WO2023134083A1 (en) Text-based sentiment classification method and apparatus, and computer device and storage medium
CN114330966A (en) Risk prediction method, device, equipment and readable storage medium
CN116578688A (en) Text processing method, device, equipment and storage medium based on multiple rounds of questions and answers
CN113868459A (en) Model training method, cross-modal characterization method, unsupervised image text matching method and unsupervised image text matching device
CN114461890A (en) Hierarchical multi-modal intellectual property search engine method and system
CN115017879A (en) Text comparison method, computer device and computer storage medium
CN110659392A (en) Retrieval method and device, and storage medium
CN113240033A (en) Visual relation detection method and device based on scene graph high-order semantic structure
CN117574898A (en) Domain knowledge graph updating method and system based on power grid equipment
CN110852066B (en) Multi-language entity relation extraction method and system based on confrontation training mechanism
US20240037335A1 (en) Methods, systems, and media for bi-modal generation of natural languages and neural architectures
CN115964497A (en) Event extraction method integrating attention mechanism and convolutional neural network
CN115906863A (en) Emotion analysis method, device and equipment based on comparative learning and storage medium
CN115544212A (en) Document-level event element extraction method, apparatus and medium
CN115269781A (en) Modal association degree prediction method, device, equipment, storage medium and program product
CN114579876A (en) False information detection method, device, equipment and medium
CN110633363B (en) Text entity recommendation method based on NLP and fuzzy multi-criterion decision
CN113761874A (en) Event reality prediction method and device, electronic equipment and storage medium
CN113704460B (en) Text classification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant