CN111259851A

CN111259851A - Multi-mode event detection method and device

Info

Publication number: CN111259851A
Application number: CN202010076960.1A
Authority: CN
Inventors: 许斌; 仝美涵; 李涓子; 侯磊
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-01-23
Filing date: 2020-01-23
Publication date: 2020-06-09
Anticipated expiration: 2040-01-23
Also published as: CN111259851B

Abstract

The embodiment of the invention provides a multi-mode event detection method and device. The method comprises the steps of obtaining a set of images to be detected matched with sentences to be detected; obtaining an initial sentence expression of the sentence to be detected, and obtaining an image expression of each image to be detected; updating the sentence expressions according to the image expressions in sequence by adopting an alternating dual attention mechanism to obtain updated sentence expressions; obtaining a sentence expression after fusion by adopting a residual fusion device; the method comprises the steps of obtaining an image to be detected of a sentence to be detected, respectively coding the image to be detected to obtain a sentence expression and an image expression, updating and residual error fusion are carried out on the sentence expressions according to the image expression of each image to be detected in sequence by adopting an alternating dual attention mechanism, and then the event detection result is obtained by passing the fused sentence expressions through an event prediction model, so that the efficiency and the quality of event detection of a text to be detected are improved.

Description

Multi-mode event detection method and device

Technical Field

The invention relates to the technical field of natural language processing, in particular to a multi-mode event detection method and device.

Background

Event detection aims at detecting event triggers from unstructured news stories and determining their event type (usually verbs or nouns). For example, "the meeting attendee of ford and toronto meets" the event detection task needs to recognize the word "meeting" as an event trigger and decide that it triggered a "meeting" event. At present, event detection is taken as a basic core technology in the field of artificial intelligence, and is widely introduced to tasks such as information retrieval, question answering systems, recommendation systems, knowledge base construction and the like. High-quality structured knowledge information in event detection can guide an intelligent model to have deeper object understanding, more accurate task query and logical reasoning capability to a certain extent, so that the method plays a vital role in analyzing massive information.

Event detection is a challenging task because trigger words that reflect the core type of an event are often ambiguous. This ambiguity is particularly manifested in: the same trigger may trigger different events in different context of the sentence, and the surrounding context is usually not sufficient to disambiguate them. For example, "lift off" in "michael has lifted off the person's basilar" may mean that michael is no longer involved in a significant job, a "leave" event has occurred, or that michael has put down a weight, and an "item transfer" event has occurred. The existing event detection and analysis analyzes the text through a series of natural language labeling tools such as part of speech labeling, syntactic analysis and the like, and then classifies and extracts the events by utilizing the analyzed text characteristics.

However, the existing analysis method cannot judge the type of the event expressed by news under the condition of sentence information loss or fuzzy word semantic directivity, so that the obtained result is not accurate enough.

Disclosure of Invention

Because the existing method has the above problems, embodiments of the present invention provide a method and an apparatus for multi-modal event detection.

In a first aspect, an embodiment of the present invention provides a multi-modal event detection method, including:

acquiring a to-be-detected image set matched with a to-be-detected sentence in a to-be-detected text according to a preset matching rule;

obtaining an initial sentence expression of the sentence to be detected according to a preset sentence coding module, and obtaining an image expression of each image to be detected in the image set to be detected according to a preset image coding module;

updating the sentence expression of the sentence to be detected in sequence according to each image expression by adopting a preset alternating dual attention mechanism to obtain an updated sentence expression;

obtaining a fused sentence expression by adopting a preset residual fusion device according to the initial sentence expression and the updated sentence expression;

and inputting the fused sentence expression into a pre-trained event prediction module to obtain an event detection result corresponding to the sentence to be detected.

Further, the updating the sentence expression of the sentence to be detected according to the image expressions in sequence by using a preset alternating dual attention mechanism to obtain an updated sentence expression specifically includes:

sequentially acquiring an image expression m corresponding to the ith image to be detected in the image set to be detected_i；

According to a preset multi-head attention mechanism, utilizing the current sentence expression H of the sentence to be detected_i-1Updating the image expression m_iObtaining an updated image expression m 'according to the multi-head attention distribution'_i；

According to the preset multi-head attention mechanism, utilizing the updated image expression m'_iUpdating the current sentence expression H_i-1Obtaining updated current sentence expression H_i。

Further, the acquiring a set of images to be detected matched with sentences to be detected in a text to be detected according to a preset matching rule specifically includes:

extracting event characteristic information of a title contained in the text to be detected;

acquiring a historical text matched with the text to be detected from a preset text database according to the event characteristic information; wherein the title of the history text contains the event characteristic information;

and taking the image contained in the historical text as a to-be-detected image matched with the to-be-detected sentence, and storing the to-be-detected image into the to-be-detected image set.

Further, the obtaining of the initial sentence expression of the sentence to be detected according to the preset sentence coding module specifically includes:

and inputting the sentence to be detected into a pre-training converter BERT model with preset depth bidirectional representation to obtain an initial sentence expression of the sentence to be detected.

Further, the obtaining, according to a preset image coding module, an image expression of each image to be detected in the image set to be detected specifically includes:

and inputting each image to be detected into a preset residual error network ResNet model, and generating an image expression of each image to be detected by adopting a preset Sigmoid function to the output result of the residual error network ResNet model.

In a second aspect, an embodiment of the present invention provides a multi-modal event detection apparatus, including:

the data collection module is used for acquiring a to-be-detected image set matched with to-be-detected sentences in the to-be-detected text according to a preset matching rule;

the sentence coding module is used for obtaining an initial sentence expression of the sentence to be detected;

the image coding module is used for obtaining an image expression of each image to be detected in the image set to be detected;

the multi-picture encoder module is used for sequentially updating the sentence expression of the sentence to be detected according to each image expression by adopting a preset alternating dual attention mechanism to obtain an updated sentence expression;

the residual error fusion device module is used for obtaining a fused sentence expression by adopting a preset residual error fusion device according to the initial sentence expression and the updated sentence expression;

and the event prediction module is used for inputting the fused sentence expression into a pre-trained event prediction module to obtain an event detection result corresponding to the sentence to be detected.

Further, the multi-picture encoder module specifically includes: the system comprises an information acquisition sub-module, a first attention sub-module and a second attention sub-module; wherein the content of the first and second substances,

the information acquisition submodule is used for sequentially acquiring the image expression m corresponding to the ith image to be detected in the image set to be detected_i；

The first attention submodule is used for utilizing the current sentence expression H of the sentence to be detected according to a preset multi-head attention mechanism_i-1Updating the image expression m_iObtaining an updated image expression m 'according to the multi-head attention distribution'_i；

The second attention submodule is configured to utilize the updated image expression m 'according to the preset multi-head attention mechanism'_iUpdating the current sentence expression H_i-1Obtaining updated current sentence expression H_i。

Further, the data collection module is specifically configured to:

In a third aspect, an embodiment of the present invention further provides an electronic device, including:

a processor, a memory, a communication interface, and a communication bus; wherein the content of the first and second substances,

the processor, the memory and the communication interface complete mutual communication through the communication bus;

the communication interface is used for information transmission between communication devices of the electronic equipment;

the memory stores computer program instructions executable by the processor, the processor invoking the program instructions to perform a method comprising:

In a fourth aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following method:

According to the multi-mode event detection method and device provided by the embodiment of the invention, the images to be detected of the sentences to be detected are obtained and respectively coded to obtain the sentence expressions and the image expressions, the sentence expressions are updated and residual errors are fused according to the image expressions of the images to be detected in sequence by adopting a preset alternating dual attention mechanism, and then the fused sentence expressions pass through a pre-trained event prediction model to obtain the event detection result, so that the efficiency and the quality of event detection on the texts to be detected are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow diagram of a multimodal event detection method according to an embodiment of the invention;

FIG. 2 is a flow chart of yet another multimodal event detection method in accordance with an embodiment of the present invention;

FIG. 3 is a flow diagram of another multimodal event detection method in accordance with an embodiment of the invention;

FIG. 4 is a schematic structural diagram of an alternative attention module according to an embodiment of the present invention;

FIG. 5 is a flow diagram of yet another multimodal event detection method in accordance with an embodiment of the invention;

FIG. 6 is a schematic structural diagram of a multi-modal event detection apparatus according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of another multi-modal event detection apparatus according to an embodiment of the present invention;

fig. 8 illustrates a physical structure diagram of an electronic device.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Existing event detection methods are limited to a single modality, the text modality, and do not utilize multi-modal information: a picture modality and a text modality. In fact fusing multimodality information images is very effective for handling event-triggered word ambiguities. Taking news as an example, the matched images in the news can reflect the core events of the reports, so that the events have rough directivity when being disambiguated. For example, in the news of terrorist attacks, the matching graph would reflect scenes such as "refugee", "weapons", "military", etc., with the types of events tending to be "dead", "injured", "attacked", rather than "travel", "trial", "celebration", etc. The method has the type trend of core events, and has large background knowledge during event detection, so that the accuracy of event detection can be effectively improved. Meanwhile, the image modality can supplement more detailed information and can complement the information of the text modality, so that the capability of event detection is improved. Some information that is difficult to express in the text modality can be easily reflected in the image modality, such as dress style, facial expression or character gesture actions, and the details can determine the occasion and form of the event, which has a beneficial effect on deducing the event type.

Fig. 1 is a flowchart of a multi-modal event detection method according to an embodiment of the present invention, and fig. 2 is a flowchart of a multi-modal event detection method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

and step S01, acquiring a to-be-detected image set matched with the to-be-detected sentence in the to-be-detected text according to a preset matching rule.

The method includes the steps of obtaining a text to be tested in a preset mode, for example, obtaining a news article from a website, wherein the news article comprises a title, news content and a news image, the title, the news content and the news image respectively correspond to the title, the text content and the image to be tested of the text to be tested, and the text content can be composed of a plurality of sentences to be tested.

And performing event detection on the text to be detected, specifically performing event detection on each contained sentence to be detected respectively to obtain an event detection result corresponding to each sentence to be detected, for example, a trigger event triggered by a keyword in the sentence to be detected.

Because the image to be detected contained in the text to be detected is often few or even none, the image to be detected matched with the text to be detected is acquired from other channels or historical data through a preset matching rule, so that a large number of images to be detected matched with sentences to be detected in the text to be detected are obtained, and an image set to be detected is formed.

Step S02, obtaining an initial sentence expression of the sentence to be detected according to a preset sentence coding module, obtaining an image expression of each image to be detected in the image set to be detected according to a preset image coding module, and combining to obtain an image expression set.

As shown in fig. 2, a sentence coding module and an image coding module are preset, and a preset coding model is adopted to code the sentence to be detected and each image to be detected in the image set to be detected, so as to obtain a sentence expression of the sentence to be detected and an image expression of each image to be detected.

The coding model adopted in the sentence coder module can be set according to actual needs, and the embodiment of the invention only takes a preset depth Bidirectional representation pre-training converter (BERT) model as an example for illustration. The BERT model is a pre-trained language representation model, comprises a plurality of layers of multi-head attention layers, can understand sentence semantic information at deep and multi-angle, and is strong in function and suitable for event detection of texts.

Adopting a BERT model as a text feature extractor, and converting the sentence S to be detected into a sentence S<w₁,w₂,…,w_n>Inputting into BERT model, and using sequential output to obtain initial sentence expression H₀＝<h₁,h₂,…,h_n>. Wherein, the w_iThe sentence to be tested is subjected to word segmentation processing to obtain segmented words or words, h_iThe word segmentation vectors are in one-to-one correspondence with the word segmentations. The encoding process of the BERT model may be represented as:

H₀＝BERT(S)

The coding model adopted in the image coding module can be set according to actual needs, and the embodiment of the invention only gives an illustration of one of the coding models.

Respectively setting the set P ═ { P ═ of the image to be detected by adopting a preset Residual Network (ResNet) model₁,p₂,…,p_kIn the_iAnd (5) carrying out feature extraction. An image p to be measured_iInputting into the ResNet model to obtainHidden representation u to the image under test_i. The encoding process of the ResNet model can be expressed as:

u_i＝ResNet(p_i)

in order to map the image expression to the same latitude space as the text, a Sigmoid function is adopted to represent the hidden representation u of the image to be detected_iConversion to image expression m_iThe Sigmoid function may be expressed as:

m_i＝σ(W_uu_i+b_u)

wherein σ (·) is Sigmoid function, W_uAnd b_uIs a hyper-parameter matrix.

And step S03, updating the sentence expression of the sentence to be detected in sequence according to each image expression by adopting a preset alternating dual attention mechanism to obtain the updated sentence expression.

The acquired image to be tested, which is matched with the sentence to be tested, tends to describe the event from different angles, for example, for a news report of an earthquake event, the image to be tested may be an image of a road collapse for reporting a damage condition, an image of a worker carrying reconstructed materials for reporting a reconstruction condition, and the like. Therefore, the embodiment of the invention disambiguates the detection event by dynamically aggregating the information of a plurality of images to be detected.

According to the image expressions corresponding to the images to be detected in the image set to be detected in sequence, a Dual alternative Attention mechanism is adopted to recursively update the sentence expressions of the sentences to be detected, specifically, the recursive update can be realized through a preset Dual alternative Attention (ADA) submodule, and the obtained ith image p to be detected is sequentially updated_iCorresponding image expression m_iInputting into preset ADA submodule for obtaining sentence expression H after last update_i-1Updating to obtain the ith updated sentence expression H_i. So as to pass through the set P ═ P of images under test₁,p₂,…,p_kRecursively updating image expressions corresponding to k images to be detected in the image database, and updating the image expressionsThe sentence expression of the sentence to be detected is expressed by the initial sentence expression H₀Obtaining a sentence expression H after k times of updating_k。

And step S04, obtaining a fused sentence expression by adopting a preset residual fusion device according to the initial sentence expression and the updated sentence expression.

Adopting a preset residual fusion device to express the initial sentence expression H₀And the sentence expression H after k times of updating_kAnd performing fusion processing to obtain a fused sentence expression which is used as a final sentence expression of the sentence to be detected.

There are many methods for the fusion process, and the embodiment of the present invention is illustrated by only one method. Expressing the initial sentence by using residual block₀Integrate back the updated sentence expression H_kSo as to obtain the expression R ═ H of the merged sentence₀+H_k。

The sentence expression R after the fusion keeps the original semantics of the sentence to be detected as much as possible, and prevents the gradient of the parameters in the BERT from disappearing through the updating process.

And step S05, inputting the fused sentence expression into a pre-trained event prediction module to obtain an event detection result corresponding to the sentence to be detected.

And the event prediction module is preset and divides all event detection results into a preset number of event types. According to a pre-acquired corpus

Wherein the event prediction model is trained, and the corpus comprises a plurality of training samples x_i＝〈S，P>,y_iY, wherein S, P and Y represent training sentences in the same training text, training image sets matching the training texts, training images matching the training texts, and pre-labeled event detection results corresponding to the training sentences, respectively. Input training sample x_iObtaining an output result vector O, wherein the result vector O is formed by a conditional probability O_ijcComposition of the said O_ijcRepresenting the sentence x to be tested_iThe probability that the jth participle in (b) is in the event category (c) is normalized through a softmax function, and the following result is obtained:

where θ represents all defined hyper-parameters in all modules described above.

The optimization function adopted in the training process of the event prediction model is defined as follows:

and Adam was used as the gradient descent optimizer.

And inputting the fused sentence expression R of the sentence to be detected into the trained event prediction module, and taking the obtained event detection result as the conditional probability of the event type of each participle in the sentence to be detected. And further analyzing the obtained conditional probability subsequently to determine the sentence to be detected, even the accurate event type corresponding to the text to be detected.

Compared with the prior art, the multi-mode event detection method based on the alternating dual attention mechanism provided by the embodiment of the invention comprises a text mode of a sentence to be detected and an image mode of an image to be detected. The method has the advantages that the independent information of each mode and the consistent information of the cross-mode contained in the multi-mode are respectively coded into different semantic spaces, the cross-mode semantic information is extracted more deeply by adopting an alternating dual attention mechanism, and a uniform semantic space mode is constructed, so that the effect of improving the event detection task expression in the multi-mode scene in practical application is achieved, the event analysis efficiency and quality of the text to be detected such as a news text are improved, and the method has a wide application prospect.

According to the embodiment of the invention, the image to be detected of the sentence to be detected is obtained, the sentence expression and the image expression are obtained through respectively encoding, the sentence expression is updated and residual error fusion is carried out sequentially according to the image expression of each image to be detected by adopting a preset alternating dual attention mechanism, and then the event detection result is obtained through the fused sentence expression by virtue of the pre-trained event prediction model, so that the efficiency and the quality of the event detection of the text to be detected are improved.

Fig. 3 is a flowchart of another multi-modal event detection method according to an embodiment of the present invention, and fig. 4 is a schematic structural diagram of an alternative attention module according to an embodiment of the present invention, as shown in fig. 2, where the step S03 specifically includes:

step S031, sequentially obtaining the image expression m corresponding to the ith image to be detected in the image set to be detected_i。

Sequentially carrying out the image p to be detected in the image set to be detected through the image coding module_iCoding to obtain corresponding image expression m_iThen, the preset alternate dual attention module is utilized to carry out expression H on the current sentence_i-1Updating to obtain the ith updated sentence expression H_i。

Step S032, according to a preset multi-head attention mechanism, utilizing the current sentence expression H of the sentence to be detected_i-1Updating the image expression m_iObtaining an updated image expression m 'according to the multi-head attention distribution'_i。

The alternating dual attention mechanism is composed of two parts, which are respectively: firstly, guiding multi-head attention distribution of an image by using text information, namely sentence expressions, and updating the image expressions; and then guiding the multi-head attention distribution of the text by using the image information, namely the updated image expression, so as to update the sentence expression. Since the image information and the text information affect each other, a dual structure is adopted. Specifically, in different text backgrounds, the focus area of the same image is different; also, in different picture description contexts, the same word may trigger different events.

The multi-headed attention distribution for updating the image expression with the sentence expression is referred to as a first round of updating, and the multi-headed attention distribution for updating the sentence expression with the image expression is referred to as a second round of updating.

In the first round of updating, as shown in FIG. 4, two fully-connected layer Linear layers are used to represent the current sentence expression H_i-1Mapped into the first two inputs of the Scaled Dot Product Attention module Scaled Dot-Product Attention layer, respectively, and the mapped implicit representation is labeled as key k, value v. Then the image expression m is expressed using the third fully-connected layer_iMapped into the third input of the scaled dot product attention module, and the mapped implicit representation is labeled as query q.

Next, learned attention α is dot-product with the third input v to obtain weighted image representation z. the specific formulation of the above process is as follows:

z＝αv^T

wherein d is_kRepresents the degree of dimension, said s_iAnd representing the embedded representation after the ith word and other modes are interacted, wherein L represents the number of words in the sentence.

The above process is repeated u times and a linear transformation is applied to obtain a modified image representation h, which is formulated as follows:

Z＝[z₁；z₂；…；z_u]

h＝W_hZ+b_h

wherein, "; "denotes the splice in the last dimension.

And directly sending the implicit representation q to an output end by adopting a residual error module to obtain a final updated representation.

m_i′＝h+q

With the operation of the above equation uniformly labeled as Ω, the first round of update process can be summarized as:

m_i′＝Ω(m_i,H_i-1)

wherein m is_i' is the updated image expression.

Step 033, utilizing the updated image expression m 'according to the preset multi-head attention mechanism'_iUpdating the current sentence expression H_i-1Obtaining updated current sentence expression H_i。

In the second round of update, as shown in fig. 4, the updated image expression m is similar to the first round of update_t' mapping into the first two inputs of the scaled dot product attention Module, the current sentence expression H_i-1Mapping into a third input, similar to the first round of updates, the second round of update processes can also be summarized as:

H_i＝Ω(H_i-1,m_t′)

wherein, the H_iIs the sentence expression after i times of updating.

Setting the image set P to be tested as P₁,p₂,…,p_kImage expressions corresponding to k images to be detected in the sentence expression updating method are sequentially used for updating the sentence expressions, and the updated sentence expressions are H_k。

The embodiment of the invention updates the multi-head attention distribution of the image expression by using the sentence expression, and then updates the multi-head attention distribution of the sentence expression by using the image expression, thereby realizing an alternating dual attention mechanism which is used for updating the sentence expression of the sentence to be detected, and improving the efficiency and quality of event detection of the text to be detected.

Fig. 5 is a flowchart of another multi-modal event detection method according to an embodiment of the present invention, and as shown in fig. 5, the step S01 specifically includes:

and S011, extracting event characteristic information of a title contained in the text to be detected.

When a text to be detected is obtained, firstly, a preset information extraction model is adopted according to the title of the text to be detected, and event characteristic information of the title is obtained. Specifically, the structured parsing of the title may be performed through an Abstract syntax parser (AMR), and an event role is extracted from the structured parsing as the feature information, where the event role includes: actors, and locations, etc.

Step S012, acquiring a historical text matched with the text to be detected from a preset text database according to the event characteristic information; wherein the title of the history text contains the event feature information.

And matching the obtained event characteristic information with each historical text in a pre-obtained text database to judge whether the historical text and the text to be detected correspond to the same event. Specifically, comparing the event feature information of the title of the historical text with the event feature information of the title of the text to be detected, and if the event feature information of the title of the historical text comprises the event feature information of the title of the text to be detected, judging that the historical text and the text to be detected correspond to the same event; otherwise, judging that the historical text and the text to be detected correspond to different events. For example, the title of the text to be tested is "wild abuse across california", the title of the history text is "large-scale wild abuse in california", and the event feature information of the two titles is extracted and determined to be "large-scale wild abuse" and "california", so that the text to be tested and the history text can be judged to correspond to the same event.

Each historical text in the text database can be obtained by searching a preset website, for example, for news reports, the historical text can be obtained by searching a plurality of news websites with authority and capability.

And S013, taking the image contained in the historical text as a to-be-detected image matched with the to-be-detected sentence, and storing the to-be-detected image into the to-be-detected image set.

And extracting images contained in the historical texts corresponding to the same event with the text to be detected, taking the images as images to be detected matched with the text to be detected, equivalently taking the images to be detected as images to be detected matched with sentences to be detected in the text to be detected, and storing the images to be detected in the image set to be detected.

According to the method and the device, the event characteristic information of the title of the text to be detected is obtained to be matched with the historical text in the text database, the image contained in the historical text corresponding to the same event is used as the image to be detected matched with the sentence to be detected in the text to be detected, and the image set to be detected is obtained, so that the efficiency and the quality of event detection of the text to be detected are improved.

In summary, the embodiment of the present invention designs an event detection model of a multi-modal neural network based on an alternating dual attention mechanism. Firstly, image mode information related to texts is collected, and the diversity of the image mode information is ensured by connecting different historical texts of the same event. And then respectively acquiring expressions of text and image modes through a pre-trained language coding model and an image coding model. Then, a recurrent neural network model with an alternate dual attention mechanism as a basic unit is designed to perform deep fusion on the image mode and the text. Finally, a fully-connected neural network is designed to judge the event type of each word. The alternating dual attention mechanism not only deletes the strong attention area of the image modality according to the context of the text modality, but also screens the strong attention area of the text modality in turn according to the context of the image modality. The deep multi-mode fusion mechanism can delete irrelevant semantic information and retain information matched with event semantics, so that the effect of multi-mode event detection can be improved.

Fig. 6 is a schematic structural diagram of a multi-modal event detection apparatus according to an embodiment of the present invention, and as shown in fig. 6, the multi-modal event detection apparatus includes: a data collection module 10, a sentence coding module 11, an image coding module 12, a multi-picture coder module 13, a residual fusion module 14 and an event prediction module 15; wherein the content of the first and second substances,

the data collection module 10 is configured to obtain a set of images to be detected that are matched with sentences to be detected in a text to be detected according to a preset matching rule; the sentence coding module 11 is configured to obtain an initial sentence expression of the sentence to be detected; the image coding module 12 is configured to obtain an image expression of each image to be detected in the image set to be detected; the multi-picture encoder module 13 is configured to update the sentence expression of the sentence to be detected in sequence according to each image expression by using a preset alternating dual attention mechanism, so as to obtain an updated sentence expression; the residual error fusion device module 14 is configured to obtain a fused sentence expression by using a preset residual error fusion device according to the initial sentence expression and the updated sentence expression; the event prediction module 15 is configured to input the fused sentence expression into a pre-trained event prediction module, so as to obtain an event detection result corresponding to the sentence to be detected. Specifically, the method comprises the following steps:

the data collection module 10 obtains a text to be tested in a preset manner, where the text to be tested includes a title, text contents and an image to be tested, and the text contents may be composed of a plurality of sentences to be tested.

And performing event detection on the text to be detected, specifically performing event detection on each contained sentence to be detected respectively to obtain an event detection result corresponding to each sentence to be detected.

Because the image to be detected contained in the text to be detected is often few or even none, for this reason, the data collection module 10 needs to obtain the image to be detected matched with the text to be detected from other channels or historical data through a preset matching rule, so as to obtain a large number of images to be detected matched with each sentence to be detected in the text to be detected, so as to form an image set to be detected.

The data collection module 10 sends the sentence to be detected and each image to be detected to the sentence coding module 11 and the image coding module 12, and codes the sentence to be detected and each image to be detected in the image set to be detected by using a preset coding model, so as to obtain a sentence expression of the sentence to be detected and an image expression of each image to be detected.

Further, the sentence encoding module 11 according to preset is specifically configured to:

The coding model adopted in the sentence coder module 11 can be set according to actual needs, and the embodiment of the present invention is exemplified by only a preset BERT model.

H₀＝BERT(S)

further, the image encoding module 12 according to preset is specifically configured to:

The coding model adopted in the image coding module 12 can be set according to actual needs, and the embodiment of the present invention only provides an example.

Respectively setting the set P ═ { P ═ of the image to be detected by adopting a preset Residual Network (ResNet) model₁,p₂,…,p_kIn the_iAnd (5) carrying out feature extraction. An image p to be measured_iInputting the hidden representation u into the ResNet model to obtain the hidden representation u of the image to be detected_i. The encoding process of the ResNet model can be expressed as:

u_i＝ResNet(p_i)

in order to map the image expression to the same latitude space as the text, a Sigmoid function is adopted to represent the hidden representation u of the image to be detected_iConversionFor image expression m_iThe Sigmoid function may be expressed as:

m_i＝σ(W_uu_i+b_u)

wherein σ (·) is Sigmoid function, W_uAnd b_uIs a hyper-parameter matrix.

The multi-picture encoder module 13 sequentially obtains the image expressions corresponding to the images to be detected in the image set to be detected from the image encoding module 12, recursively updates the sentence expressions of the sentences to be detected obtained from the sentence encoding module 11 by adopting an alternating dual attention mechanism, specifically, the recursive update can be realized by a preset ADA sub-module, and the multi-picture encoder module 13 sequentially obtains the ith image p to be detected_iCorresponding image expression m_iInputting into preset ADA submodule for obtaining sentence expression H after last update_i-1Updating to obtain the ith updated sentence expression H_i. So as to pass through the set P ═ P of images under test₁,p₂,…,p_kRecursively updating image expressions corresponding to k images to be detected in the sentence expression database, and expressing the sentence expressions of the sentences to be detected by using the initial sentence expression H₀Obtaining a sentence expression H after k times of updating_kAnd express the sentence as H₀And H_kTo the residual fuser module 14.

The residual fuser module 14 fuses the initial sentence expression H₀And the sentence expression H after k times of updating_kAnd performing fusion processing to obtain a fused sentence expression which is used as a final sentence expression of the sentence to be detected.

There are many methods for the fusion process, and the embodiment of the present invention is illustrated by only one method. Residual fuser module 14 uses the residual block to express the original sentence as expression H₀Integrate back the updated sentence expression H_kSo as to obtain the expression R ═ H of the merged sentence₀+H_kAnd sent to the event prediction module 15.

The event prediction module 15 classifies all the event detection results intoA preset number of event types. According to a pre-acquired corpus

Wherein the event prediction model is trained, and the corpus comprises a plurality of training samples x_i＝<S，P>,y_iY, wherein S, P and Y represent training sentences in the same training text, training image sets matching the training texts, training images matching the training texts, and pre-labeled event detection results corresponding to the training sentences, respectively. Input training sample x_iObtaining an output result vector O, wherein the result vector O is formed by a conditional probability O_ijcComposition of the said O_ijcRepresenting the sentence x to be tested_iThe probability that the jth participle in (b) is in the event category (c) is normalized through a softmax function, and the following result is obtained:

and Adam was used as the gradient descent optimizer.

The apparatus provided in the embodiment of the present invention is configured to execute the method, and the functions of the apparatus refer to the method embodiment specifically, and detailed method flows thereof are not described herein again.

Fig. 7 is a schematic structural diagram of another multi-modal event detection apparatus according to an embodiment of the present invention, as shown in fig. 7, the apparatus includes: a data collection module 10, a sentence encoding module 11, an image encoding module 12, a multi-picture encoder module 13, a residual fuser module 14, and an event prediction module 15, wherein the multi-picture encoder module 13 includes: an information acquisition sub-module 131, a first attention sub-module 132, and a second attention sub-module 133; wherein the content of the first and second substances,

the information obtaining sub-module 131 is configured to sequentially obtain an image expression m corresponding to the ith image to be tested in the image set to be tested_i(ii) a The first attention sub-module 132 is configured to utilize the current sentence expression H of the sentence to be tested according to a preset multi-head attention mechanism_i-1Updating the image expression m_iGet the updated image expression m_i(ii) a The second attention submodule 133 is configured to utilize the updated image expression m 'according to the preset multi-head attention mechanism'_iUpdating the current sentence expression H_i-1Obtaining updated current sentence expression H_i. Specifically, the method comprises the following steps:

the multi-picture encoder module 13 may adopt a structure of a plurality of ADA sub-modules as described in the above embodiment, where each ADA sub-module corresponds to an image expression of an image to be detected, and then sequentially updates a sentence expression of a sentence to be detected according to the ADA sub-modules. The multi-picture encoder module 13 may also adopt a configuration that, as described in the embodiment of the present invention, the multi-picture encoder module is composed of an information obtaining sub-module 131, a first attention sub-module 132, and a second attention sub-module 133, which is equivalent to only including one ADA sub-module, and the ADA sub-module is cyclically used to update the sentence expression.

When the image coding module 12 sequentially processes the image p to be detected in the image set to be detected_iCoding to obtain corresponding image expression m_iAnd then sent to the information obtaining sub-module 131.

The first attention submodule 132 uses two fully-connected layers to express the current sentence expression H_i-1Mapped into the first two inputs of the scaled dot product attention module, respectively, and the implicit representation after mapping is labeled as key k, value v. Then, the image expression m received by the information acquisition submodule 131 is obtained by using the third full connection layer_iMapped into the third input of the scaled dot product attention module, and the mapped implicit representation is labeled as query q.

z＝αv^T

Z＝[z₁；z₂；…；z_u]

h＝W_hZ+b_h

wherein, "; "denotes the splice in the last dimension.

m_i′＝h+q

With the operation of the above equation labeled collectively as Ω, then the update process of the first attention sub-module 132 may be summarized as:

m_i′＝Ω(m_i,H_i-1)

wherein m is_i' is the updated image expression.

Similarly, the second attention submodule 133 outputs the updated image expression m output by the first attention submodule 132_t' mapping into the first two inputs of the scaled dot product attention Module, the current sentence expression H_i-1Mapped into a third input, similar to the first attention submodule 132, the second attention submodule 133 update process may also be summarized as:

H_i＝Ω(H_i-1,m_t′)

wherein, the H_iIs the sentence expression after i times of updating.

The information obtaining sub-module 131 obtains the set P ═ P of the image to be measured in sequence₁,p₂,…,p_kImage expressions corresponding to k images to be detected in the sentence expression updating method are used for updating the sentence expressions, and the updated sentence expressions are H_k。

Based on the foregoing embodiment, further, the data collection module is specifically configured to:

When the data collection module obtains a text to be detected, a preset information extraction model is adopted according to the title of the text to be detected to obtain event characteristic information of the title. Specifically, the title may be obtained by performing a structured analysis on the title through AMR, and an event role is extracted from the title as the feature information, where the event role includes: actors, and locations, etc.

And the data collection module matches the obtained event characteristic information with each historical text in a pre-acquired text database to judge whether the historical text and the text to be detected correspond to the same event. Specifically, comparing the event feature information of the title of the historical text with the event feature information of the title of the text to be detected, and if the event feature information of the title of the historical text comprises the event feature information of the title of the text to be detected, judging that the historical text and the text to be detected correspond to the same event by a data collection module; otherwise, the data collection module judges that the historical texts and the texts to be tested correspond to different events.

And the data collection module extracts images contained in the historical texts corresponding to the same event with the text to be detected, the images are used as the images to be detected matched with the text to be detected and also used as the images to be detected matched with the sentences to be detected in the text to be detected, and the images are stored in the image set to be detected.

Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor)801, a communication Interface (Communications Interface)803, a memory (memory)802 and a communication bus 804, wherein the processor 801, the communication Interface 803 and the memory 802 complete communication with each other through the communication bus 804. The processor 801 may call logic instructions in the memory 802 to perform the above-described method.

Further, embodiments of the present invention disclose a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which, when executed by a computer, enable the computer to perform the methods provided by the above-mentioned method embodiments.

Further, the present invention provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the methods provided by the above method embodiments.

Those of ordinary skill in the art will understand that: furthermore, the logic instructions in the memory 802 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for multi-modal event detection, comprising:

2. The multi-modal event detection method according to claim 1, wherein the updating the sentence expression of the sentence to be detected according to the image expressions in sequence by using a preset alternating dual attention mechanism to obtain an updated sentence expression specifically comprises:

3. The multi-modal event detection method according to claim 1, wherein the obtaining of the set of images to be detected matched with the sentences to be detected in the text to be detected according to the preset matching rule specifically comprises:

4. The multi-modal event detection method according to claim 1, wherein obtaining the initial sentence expression of the sentence to be detected according to a preset sentence encoding module specifically comprises:

5. The method according to claim 1, wherein obtaining the image expression of each image to be detected in the image set to be detected according to a preset image coding module specifically comprises:

6. A multimodal event detection apparatus, comprising:

7. The multi-modal event detection apparatus as recited in claim 6, wherein the multi-picture encoder module comprises in particular: the system comprises an information acquisition sub-module, a first attention sub-module and a second attention sub-module; wherein the content of the first and second substances,

8. The multi-modal event detection apparatus of claim 6, wherein the data collection module is specifically configured to:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the multimodal event detection method as claimed in any of claims 1 to 5 are implemented when the program is executed by the processor.

10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps of the multimodal event detection method as claimed in any of claims 1 to 5.