CN113609259A

CN113609259A - Multi-mode reasoning method and system for videos and natural languages

Info

Publication number: CN113609259A
Application number: CN202110935190.6A
Authority: CN
Inventors: 王雯哲; 高岩; 高明; 王建华
Original assignee: Shandong New Generation Information Industry Technology Research Institute Co Ltd
Current assignee: Shandong New Generation Information Industry Technology Research Institute Co Ltd
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2021-11-05

Abstract

The invention discloses a multi-modal reasoning method and a multi-modal reasoning system for videos and natural languages, which belong to the technical field of computer vision, and the technical problem to be solved by the invention is how to find intrinsic causal relationships among objects in events in a video sequence and use natural language segments for supplement so as to obtain common sense knowledge and help to improve the robustness of downstream tasks, and the technical scheme is as follows: the method comprises the steps of positioning an event in a video, carrying out target detection on the event in the video, and learning context causal relationship by using the video and a text subtitle so as to realize multi-modal reasoning of the video and natural language; the method comprises the following specific steps: inputting a video into a regular frame, and outputting a pair of images; acquiring all possible causal relationships of a front frame and a rear frame in a corresponding event; outputting a causal relationship score prediction between events, namely a probability measure; the causal relationship is rationalized. The system comprises a standard frame detection module, a text encoder module, a cross attention module, a classifier and a cause and effect rationalization module.

Description

Multi-mode reasoning method and system for videos and natural languages

Technical Field

The invention relates to the technical field of computer vision, in particular to a multi-modal reasoning method and system for videos and natural languages.

Background

The causal relationship is helpful for determining the causal relationship between events, so that not only the accidental connection between the events can be known in a deeper way, but also the environment where the events occur can be known, the understanding of the events occurring in the real events described by the video or the natural language fragments can be improved, and the method is not only helpful for human beings, but also helpful for deep learning models.

Causal relationships can help improve the robustness of downstream tasks, making it valuable to convey the concept of causal relationships to machines due to a lack of understanding of causal relationships such as video captioning, video answering, and a host of other NLP tasks.

Reasoning about causal relationships to support machine intelligence is an area of recent interest because it is an important step forward towards artificial intelligence. The common sense knowledge can be obtained in a verbal and/or visual (image or video) form. At present, methods for extracting cause and effect knowledge by using natural language mostly excavate text segments (such as titles), text introduction and large-scale knowledge bases (such as Wikipedia), all of the methods are used for reasoning in a pure text form, the common knowledge can be obtained only by manually marking a large amount of data, and in addition, some intuitive errors exist in the cause and effect relationship reasoning by using a pure text model. In addition, some causal models under visual scenes do not need much manual labeling, but the causal models cannot accept video as input causal reasoning and are limited, and the effectiveness and the universality of the causal models are limited by reasoning the causal relations from one image.

Therefore, how to find the intrinsic cause-and-effect relationship between objects in an event in a video sequence and use natural language segments for supplement to obtain common sense knowledge to help improve the robustness of downstream tasks is a technical problem to be solved urgently at present.

Disclosure of Invention

The technical task of the invention is to provide a multi-modal reasoning method and system for videos and natural languages, so as to solve the problem of how to find intrinsic causal relationships among objects in an event in a video sequence, and to obtain common sense knowledge by using natural language segment supplement, thereby helping to improve the robustness of a downstream task.

The technical task of the invention is realized in the following way, the method comprises the steps of positioning an event in a video, carrying out target detection on the event in the video, and learning context cause and effect relationship by using the video and text subtitles, thereby realizing the multi-modal reasoning of the video and the natural language; the method comprises the following specific steps:

inputting a video into a regular frame, and outputting a pair of images;

acquiring all possible causal relationships of a front frame and a rear frame in a corresponding event;

outputting a causal relationship score prediction between events, namely a probability measure;

the causal relationship is rationalized.

Preferably, the video input is fed to a canonical frame, and the output pair of images is specifically as follows:

inputting a video V, and feeding the video V to a regular frame for identification;

outputting a pair of image P E P; wherein P is the set of image pairs of each event E E, E is the set of all events in the video;

p is composed of two frames I₁And I₂Composition I₁And I₂Two frames are sampled from the video V in time sequence, i.e. I₁Appears in I₂Front, I₁And I₂Respectively, are causal frames.

Preferably, all possible causal relationships between the previous frame and the next frame in the corresponding event are obtained as follows:

for each p, I is to be determined₁And I₂All possible causal relationships between two frames include two childrenTask:

(i) identifying an event therein;

(ii) identifying causal relationships between events.

More preferably, identifying causal relationships between events is as follows:

let I₁Is denoted as E₁The set of events contained in all frames sampled from the video V is denoted as Ev;

for each event e₁∈E₁The goal is to find all events e₂Ev, let e1 cause e2, i.e., e1 → e 2.

Preferably, the causal score prediction is based on the inference that the output c meets a threshold (e.g., 0.5) to be positive (e)₁→e₂) And negative (e)₁→e₂) Cause and effect relationships; wherein C belongs to C, C is a threshold value, and C belongs to [0,1 ]]；

Rationalization means receiving outputs c and e₁、e₂And a string output to explain the predicted reason behind.

A multi-modal inference system for video and natural language, the system comprising,

a standard frame detection module for accepting video input, localizing events in the video, and identifying representative frames corresponding to a pair of events;

a text encoder module to encode tags of the detected objects from a pair of context frames into a vector;

a cross attention module for synthesizing features, the synthesized features being attention interleaved together to find context and event representations;

the classifier is used for outputting causal probability prediction and communicating the coded event titles to the causal rationalization module;

and the cause and effect rationalization module is used for rationalizing the cause and effect relationship.

Preferably, the working process of the text encoder module is as follows:

(1) event e using BERT₁And e₂Is encoded with a text representation of w₁And w₂Representing the encoded text;

(2) training on MS-COCO by using a pre-trained Faster R-CNN model, and carrying out target detection on the standard frame so as to establish a visual context.

Preferably, the classifier is a binary classifier.

An electronic device, comprising: a memory and at least one processor;

wherein the memory has stored thereon a computer program;

the at least one processor executes the memory-stored computer program causing the at least one processor to perform the multi-modal inference method of video and natural language as described above.

A computer-readable storage medium having stored therein a computer program executable by a processor to implement the method of multimodal inference of video and natural language as described above.

The multimode reasoning method and the system for the videos and the natural languages have the following advantages:

the method has the advantages that better performance can be obtained by fusing the causal relationship and the input characteristics into the existing model and then executing the visual cognition task (such as scene understanding, video captions, video question answering and the like);

the invention is realized by positioning events in the video, utilizing a standard frame detection module representing the events and learning context causal relationship by using the video and text subtitles;

thirdly, the method infers the causal relationship through knowledge in two forms of videos and texts to generate common knowledge, and the model can find the intrinsic causal relationship between objects in the event in the video sequence and is supplemented by natural language segments (descriptions of the event) to obtain knowledge;

the method helps to determine the causal relationship among the events, not only can more deeply understand the accidental connection among the events, but also can understand the occurrence environment of the events, and simultaneously improves the understanding of the events occurring in the real events described by the video or the natural language fragments, thereby being helpful for not only human beings, but also deep learning models; and it can help improve the robustness of downstream tasks due to lack of understanding of causal relationships such as video captioning, video answering, and a host of other NLP tasks; this makes it valuable to convey the concept of causality to the machine.

Drawings

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a block flow diagram of a multimodal inference system for video and natural language.

Detailed Description

The video and natural language multimodal inference method and system of the present invention are described in detail below with reference to the drawings and specific embodiments of the specification.

Example 1:

the invention relates to a multi-modal reasoning method for videos and natural languages, which comprises the steps of positioning an event in a video, carrying out target detection on the event in the video, and learning context cause and effect relationship by using the video and text subtitles so as to further realize the multi-modal reasoning of the video and the natural languages; the method comprises the following specific steps:

s1, inputting the video into a regular frame and outputting a pair of images;

s2, acquiring all possible causal relationships of the front frame and the rear frame in the corresponding event;

s3, outputting causal relationship score prediction between events, namely probability measurement;

s4, rationalizing the causal relationship.

In this embodiment, the step S1 of feeding the video input to the regular frame and outputting a pair of images is specifically as follows:

s101, inputting a video V, and feeding the video V to a regular frame for identification;

s102, outputting a pair of image P belonging to P; wherein P is the set of image pairs of each event E E, E is the set of all events in the video;

s103, P is composed of two frames I₁And I₂Composition I₁And I₂Two-frame slave videoV middle sampling, in time sequence, i.e. I₁Appears in I₂Front, I₁And I₂Respectively, are causal frames.

In this embodiment, all possible causal relationships of the previous and subsequent frames in the corresponding event obtained in step S2 are specifically as follows:

for each p, I is to be determined₁And I₂All possible causal relationships between two frames include two subtasks:

(i) identifying an event therein;

(ii) identifying causal relationships between events.

In this embodiment, the causal relationship between the identification events of step (ii) is specifically as follows:

firstly, set I₁Is denoted as E₁The set of events contained in all frames sampled from the video V is denoted as Ev;

② for each event e₁∈E₁The goal is to find all events e₂Ev, let e1 cause e2, i.e., e1 → e 2.

The causal score prediction of step S3 in this embodiment is inferred to be positive (e) by setting the output c to meet a threshold (e.g., 0.5)₁→e₂) And negative (e)₁→e₂) Cause and effect relationships; wherein C belongs to C, C is a threshold value, and C belongs to [0,1 ]]。

Rationalization in this embodiment means receiving outputs c and e₁、e₂And a string output to explain the predicted reason behind.

Example 2:

as shown in fig. 1, the multi-modal inference system for videos and natural languages of the present invention comprises,

The working process of the text encoder module in this embodiment is specifically as follows:

The classifier in this embodiment uses a binary classifier to output causal probability predictions, which are ultimately input (along with encoded event headers) to a causal rationalization module to facilitate model interpretability. By employing the common sense automatic generation interpretation (CAGE) framework proposed by Rajani et al, the interpretability and robustness are enhanced.

Common sense knowledge is generated by inferring causal relationships using two modes of knowledge (video and text). This enables the model to find intrinsic causal relationships between objects in an event in a video sequence and to supplement the knowledge gained thereby with natural language fragments (i.e. the description of the event above).

Using the causal relationship in visual and textual form video generally contains common sense that cannot be easily inferred using text alone, as this information is not usually explicitly stated in textual form. For example, consider a girl throwing a flying disc into the air and a dog jumping to catch the video of the flying disc. In this case, there is a causal relationship between the two events. Although the text title of the entire sequence is helpful in understanding the event, it is often not possible to specify this relationship explicitly. However, as is evident from the video, the girl throwing the frisbee caused the dog to jump. Thus, both of these modes have their unique importance in the task of learning causal relationships. Therefore, visual and textual forms are fused to mine common sense, which exploits the timing of the video and the object context in the video.

Example 3:

an embodiment of the present invention further provides an electronic device, including: a memory and a processor;

wherein the memory stores computer execution instructions;

the processor executes the computer-executable instructions stored by the memory to cause the processor to perform the method for multimodal inference of video and natural language in any of the embodiments of the present invention.

Example 4:

embodiments of the present invention further provide a computer-readable storage medium, in which a plurality of instructions are stored, and the instructions are loaded by a processor, so that the processor executes the multi-modal inference method for videos and natural languages in any embodiment of the present invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.

In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.

Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RYM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.

Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.

Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A multimode reasoning method of video and natural language is characterized in that the method locates events in the video, performs target detection on the events in the video, and learns context cause and effect relationship by using video and text subtitles, thereby realizing multimode reasoning of the video and the natural language; the method comprises the following specific steps:

inputting a video into a regular frame, and outputting a pair of images;

the causal relationship is rationalized.

2. A method for multi-modal inference of video and natural language as defined in claim 1, wherein the video input is fed to a canonical frame and a pair of images is output as follows:

3. The multimodal inference method in video and natural language according to claim 1, characterized by that, all possible causal relationships of previous and next frames in the corresponding event are obtained as follows:

(i) identifying an event therein;

(ii) identifying causal relationships between events.

4. A method for multimodal inference of video and natural language as claimed in claim 3, characterized by identifying causal relationships between events as follows:

5. The method for multimodal inference of video and natural language as claimed in claim 1, wherein causal score prediction is derived by setting output c to meet a threshold value, which infers positive (e)₁→e₂) And negative (e)₁→e₂) Cause and effect relationships; wherein C belongs to C, C is a threshold value, and C belongs to [0,1 ]]；

6. A multimodal inference system of video and natural language, the system comprising,

7. The system of claim 6, wherein the text coder module operates as follows:

8. The system of claim 6, wherein the classifier is a binary classifier.

9. An electronic device, comprising: a memory and at least one processor;

wherein the memory has stored thereon a computer program;

the at least one processor executing the memory-stored computer program causes the at least one processor to perform the method of multimodal inference of video and natural language as claimed in any of claims 1 to 5.

10. A computer-readable storage medium, in which a computer program is stored, the computer program being executable by a processor to implement the method for multimodal inference of video and natural language according to any one of claims 1 to 5.