CN113609259A - Multi-mode reasoning method and system for videos and natural languages - Google Patents
Multi-mode reasoning method and system for videos and natural languages Download PDFInfo
- Publication number
- CN113609259A CN113609259A CN202110935190.6A CN202110935190A CN113609259A CN 113609259 A CN113609259 A CN 113609259A CN 202110935190 A CN202110935190 A CN 202110935190A CN 113609259 A CN113609259 A CN 113609259A
- Authority
- CN
- China
- Prior art keywords
- video
- events
- causal
- event
- natural language
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/7867—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
Abstract
The invention discloses a multi-modal reasoning method and a multi-modal reasoning system for videos and natural languages, which belong to the technical field of computer vision, and the technical problem to be solved by the invention is how to find intrinsic causal relationships among objects in events in a video sequence and use natural language segments for supplement so as to obtain common sense knowledge and help to improve the robustness of downstream tasks, and the technical scheme is as follows: the method comprises the steps of positioning an event in a video, carrying out target detection on the event in the video, and learning context causal relationship by using the video and a text subtitle so as to realize multi-modal reasoning of the video and natural language; the method comprises the following specific steps: inputting a video into a regular frame, and outputting a pair of images; acquiring all possible causal relationships of a front frame and a rear frame in a corresponding event; outputting a causal relationship score prediction between events, namely a probability measure; the causal relationship is rationalized. The system comprises a standard frame detection module, a text encoder module, a cross attention module, a classifier and a cause and effect rationalization module.
Description
Technical Field
The invention relates to the technical field of computer vision, in particular to a multi-modal reasoning method and system for videos and natural languages.
Background
The causal relationship is helpful for determining the causal relationship between events, so that not only the accidental connection between the events can be known in a deeper way, but also the environment where the events occur can be known, the understanding of the events occurring in the real events described by the video or the natural language fragments can be improved, and the method is not only helpful for human beings, but also helpful for deep learning models.
Causal relationships can help improve the robustness of downstream tasks, making it valuable to convey the concept of causal relationships to machines due to a lack of understanding of causal relationships such as video captioning, video answering, and a host of other NLP tasks.
Reasoning about causal relationships to support machine intelligence is an area of recent interest because it is an important step forward towards artificial intelligence. The common sense knowledge can be obtained in a verbal and/or visual (image or video) form. At present, methods for extracting cause and effect knowledge by using natural language mostly excavate text segments (such as titles), text introduction and large-scale knowledge bases (such as Wikipedia), all of the methods are used for reasoning in a pure text form, the common knowledge can be obtained only by manually marking a large amount of data, and in addition, some intuitive errors exist in the cause and effect relationship reasoning by using a pure text model. In addition, some causal models under visual scenes do not need much manual labeling, but the causal models cannot accept video as input causal reasoning and are limited, and the effectiveness and the universality of the causal models are limited by reasoning the causal relations from one image.
Therefore, how to find the intrinsic cause-and-effect relationship between objects in an event in a video sequence and use natural language segments for supplement to obtain common sense knowledge to help improve the robustness of downstream tasks is a technical problem to be solved urgently at present.
Disclosure of Invention
The technical task of the invention is to provide a multi-modal reasoning method and system for videos and natural languages, so as to solve the problem of how to find intrinsic causal relationships among objects in an event in a video sequence, and to obtain common sense knowledge by using natural language segment supplement, thereby helping to improve the robustness of a downstream task.
The technical task of the invention is realized in the following way, the method comprises the steps of positioning an event in a video, carrying out target detection on the event in the video, and learning context cause and effect relationship by using the video and text subtitles, thereby realizing the multi-modal reasoning of the video and the natural language; the method comprises the following specific steps:
inputting a video into a regular frame, and outputting a pair of images;
acquiring all possible causal relationships of a front frame and a rear frame in a corresponding event;
outputting a causal relationship score prediction between events, namely a probability measure;
the causal relationship is rationalized.
Preferably, the video input is fed to a canonical frame, and the output pair of images is specifically as follows:
inputting a video V, and feeding the video V to a regular frame for identification;
outputting a pair of image P E P; wherein P is the set of image pairs of each event E E, E is the set of all events in the video;
p is composed of two frames I1And I2Composition I1And I2Two frames are sampled from the video V in time sequence, i.e. I1Appears in I2Front, I1And I2Respectively, are causal frames.
Preferably, all possible causal relationships between the previous frame and the next frame in the corresponding event are obtained as follows:
for each p, I is to be determined1And I2All possible causal relationships between two frames include two childrenTask:
(i) identifying an event therein;
(ii) identifying causal relationships between events.
More preferably, identifying causal relationships between events is as follows:
let I1Is denoted as E1The set of events contained in all frames sampled from the video V is denoted as Ev;
for each event e1∈E1The goal is to find all events e2Ev, let e1 cause e2, i.e., e1 → e 2.
Preferably, the causal score prediction is based on the inference that the output c meets a threshold (e.g., 0.5) to be positive (e)1→e2) And negative (e)1→e2) Cause and effect relationships; wherein C belongs to C, C is a threshold value, and C belongs to [0,1 ]];
Rationalization means receiving outputs c and e1、e2And a string output to explain the predicted reason behind.
A multi-modal inference system for video and natural language, the system comprising,
a standard frame detection module for accepting video input, localizing events in the video, and identifying representative frames corresponding to a pair of events;
a text encoder module to encode tags of the detected objects from a pair of context frames into a vector;
a cross attention module for synthesizing features, the synthesized features being attention interleaved together to find context and event representations;
the classifier is used for outputting causal probability prediction and communicating the coded event titles to the causal rationalization module;
and the cause and effect rationalization module is used for rationalizing the cause and effect relationship.
Preferably, the working process of the text encoder module is as follows:
(1) event e using BERT1And e2Is encoded with a text representation of w1And w2Representing the encoded text;
(2) training on MS-COCO by using a pre-trained Faster R-CNN model, and carrying out target detection on the standard frame so as to establish a visual context.
Preferably, the classifier is a binary classifier.
An electronic device, comprising: a memory and at least one processor;
wherein the memory has stored thereon a computer program;
the at least one processor executes the memory-stored computer program causing the at least one processor to perform the multi-modal inference method of video and natural language as described above.
A computer-readable storage medium having stored therein a computer program executable by a processor to implement the method of multimodal inference of video and natural language as described above.
The multimode reasoning method and the system for the videos and the natural languages have the following advantages:
the method has the advantages that better performance can be obtained by fusing the causal relationship and the input characteristics into the existing model and then executing the visual cognition task (such as scene understanding, video captions, video question answering and the like);
the invention is realized by positioning events in the video, utilizing a standard frame detection module representing the events and learning context causal relationship by using the video and text subtitles;
thirdly, the method infers the causal relationship through knowledge in two forms of videos and texts to generate common knowledge, and the model can find the intrinsic causal relationship between objects in the event in the video sequence and is supplemented by natural language segments (descriptions of the event) to obtain knowledge;
the method helps to determine the causal relationship among the events, not only can more deeply understand the accidental connection among the events, but also can understand the occurrence environment of the events, and simultaneously improves the understanding of the events occurring in the real events described by the video or the natural language fragments, thereby being helpful for not only human beings, but also deep learning models; and it can help improve the robustness of downstream tasks due to lack of understanding of causal relationships such as video captioning, video answering, and a host of other NLP tasks; this makes it valuable to convey the concept of causality to the machine.
Drawings
The invention is further described below with reference to the accompanying drawings.
FIG. 1 is a block flow diagram of a multimodal inference system for video and natural language.
Detailed Description
The video and natural language multimodal inference method and system of the present invention are described in detail below with reference to the drawings and specific embodiments of the specification.
Example 1:
the invention relates to a multi-modal reasoning method for videos and natural languages, which comprises the steps of positioning an event in a video, carrying out target detection on the event in the video, and learning context cause and effect relationship by using the video and text subtitles so as to further realize the multi-modal reasoning of the video and the natural languages; the method comprises the following specific steps:
s1, inputting the video into a regular frame and outputting a pair of images;
s2, acquiring all possible causal relationships of the front frame and the rear frame in the corresponding event;
s3, outputting causal relationship score prediction between events, namely probability measurement;
s4, rationalizing the causal relationship.
In this embodiment, the step S1 of feeding the video input to the regular frame and outputting a pair of images is specifically as follows:
s101, inputting a video V, and feeding the video V to a regular frame for identification;
s102, outputting a pair of image P belonging to P; wherein P is the set of image pairs of each event E E, E is the set of all events in the video;
s103, P is composed of two frames I1And I2Composition I1And I2Two-frame slave videoV middle sampling, in time sequence, i.e. I1Appears in I2Front, I1And I2Respectively, are causal frames.
In this embodiment, all possible causal relationships of the previous and subsequent frames in the corresponding event obtained in step S2 are specifically as follows:
for each p, I is to be determined1And I2All possible causal relationships between two frames include two subtasks:
(i) identifying an event therein;
(ii) identifying causal relationships between events.
In this embodiment, the causal relationship between the identification events of step (ii) is specifically as follows:
firstly, set I1Is denoted as E1The set of events contained in all frames sampled from the video V is denoted as Ev;
② for each event e1∈E1The goal is to find all events e2Ev, let e1 cause e2, i.e., e1 → e 2.
The causal score prediction of step S3 in this embodiment is inferred to be positive (e) by setting the output c to meet a threshold (e.g., 0.5)1→e2) And negative (e)1→e2) Cause and effect relationships; wherein C belongs to C, C is a threshold value, and C belongs to [0,1 ]]。
Rationalization in this embodiment means receiving outputs c and e1、e2And a string output to explain the predicted reason behind.
Example 2:
as shown in fig. 1, the multi-modal inference system for videos and natural languages of the present invention comprises,
a standard frame detection module for accepting video input, localizing events in the video, and identifying representative frames corresponding to a pair of events;
a text encoder module to encode tags of the detected objects from a pair of context frames into a vector;
a cross attention module for synthesizing features, the synthesized features being attention interleaved together to find context and event representations;
the classifier is used for outputting causal probability prediction and communicating the coded event titles to the causal rationalization module;
and the cause and effect rationalization module is used for rationalizing the cause and effect relationship.
The working process of the text encoder module in this embodiment is specifically as follows:
(1) event e using BERT1And e2Is encoded with a text representation of w1And w2Representing the encoded text;
(2) training on MS-COCO by using a pre-trained Faster R-CNN model, and carrying out target detection on the standard frame so as to establish a visual context.
The classifier in this embodiment uses a binary classifier to output causal probability predictions, which are ultimately input (along with encoded event headers) to a causal rationalization module to facilitate model interpretability. By employing the common sense automatic generation interpretation (CAGE) framework proposed by Rajani et al, the interpretability and robustness are enhanced.
Common sense knowledge is generated by inferring causal relationships using two modes of knowledge (video and text). This enables the model to find intrinsic causal relationships between objects in an event in a video sequence and to supplement the knowledge gained thereby with natural language fragments (i.e. the description of the event above).
Using the causal relationship in visual and textual form video generally contains common sense that cannot be easily inferred using text alone, as this information is not usually explicitly stated in textual form. For example, consider a girl throwing a flying disc into the air and a dog jumping to catch the video of the flying disc. In this case, there is a causal relationship between the two events. Although the text title of the entire sequence is helpful in understanding the event, it is often not possible to specify this relationship explicitly. However, as is evident from the video, the girl throwing the frisbee caused the dog to jump. Thus, both of these modes have their unique importance in the task of learning causal relationships. Therefore, visual and textual forms are fused to mine common sense, which exploits the timing of the video and the object context in the video.
Example 3:
an embodiment of the present invention further provides an electronic device, including: a memory and a processor;
wherein the memory stores computer execution instructions;
the processor executes the computer-executable instructions stored by the memory to cause the processor to perform the method for multimodal inference of video and natural language in any of the embodiments of the present invention.
Example 4:
embodiments of the present invention further provide a computer-readable storage medium, in which a plurality of instructions are stored, and the instructions are loaded by a processor, so that the processor executes the multi-modal inference method for videos and natural languages in any embodiment of the present invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.
In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.
Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RYM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.
Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.
Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.
Claims (10)
1. A multimode reasoning method of video and natural language is characterized in that the method locates events in the video, performs target detection on the events in the video, and learns context cause and effect relationship by using video and text subtitles, thereby realizing multimode reasoning of the video and the natural language; the method comprises the following specific steps:
inputting a video into a regular frame, and outputting a pair of images;
acquiring all possible causal relationships of a front frame and a rear frame in a corresponding event;
outputting a causal relationship score prediction between events, namely a probability measure;
the causal relationship is rationalized.
2. A method for multi-modal inference of video and natural language as defined in claim 1, wherein the video input is fed to a canonical frame and a pair of images is output as follows:
inputting a video V, and feeding the video V to a regular frame for identification;
outputting a pair of image P E P; wherein P is the set of image pairs of each event E E, E is the set of all events in the video;
p is composed of two frames I1And I2Composition I1And I2Two frames are sampled from the video V in time sequence, i.e. I1Appears in I2Front, I1And I2Respectively, are causal frames.
3. The multimodal inference method in video and natural language according to claim 1, characterized by that, all possible causal relationships of previous and next frames in the corresponding event are obtained as follows:
for each p, I is to be determined1And I2All possible causal relationships between two frames include two subtasks:
(i) identifying an event therein;
(ii) identifying causal relationships between events.
4. A method for multimodal inference of video and natural language as claimed in claim 3, characterized by identifying causal relationships between events as follows:
let I1Is denoted as E1The set of events contained in all frames sampled from the video V is denoted as Ev;
for each event e1∈E1The goal is to find all events e2Ev, let e1 cause e2, i.e., e1 → e 2.
5. The method for multimodal inference of video and natural language as claimed in claim 1, wherein causal score prediction is derived by setting output c to meet a threshold value, which infers positive (e)1→e2) And negative (e)1→e2) Cause and effect relationships; wherein C belongs to C, C is a threshold value, and C belongs to [0,1 ]];
Rationalization means receiving outputs c and e1、e2And a string output to explain the predicted reason behind.
6. A multimodal inference system of video and natural language, the system comprising,
a standard frame detection module for accepting video input, localizing events in the video, and identifying representative frames corresponding to a pair of events;
a text encoder module to encode tags of the detected objects from a pair of context frames into a vector;
a cross attention module for synthesizing features, the synthesized features being attention interleaved together to find context and event representations;
the classifier is used for outputting causal probability prediction and communicating the coded event titles to the causal rationalization module;
and the cause and effect rationalization module is used for rationalizing the cause and effect relationship.
7. The system of claim 6, wherein the text coder module operates as follows:
(1) event e using BERT1And e2Is encoded with a text representation of w1And w2Representing the encoded text;
(2) training on MS-COCO by using a pre-trained Faster R-CNN model, and carrying out target detection on the standard frame so as to establish a visual context.
8. The system of claim 6, wherein the classifier is a binary classifier.
9. An electronic device, comprising: a memory and at least one processor;
wherein the memory has stored thereon a computer program;
the at least one processor executing the memory-stored computer program causes the at least one processor to perform the method of multimodal inference of video and natural language as claimed in any of claims 1 to 5.
10. A computer-readable storage medium, in which a computer program is stored, the computer program being executable by a processor to implement the method for multimodal inference of video and natural language according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110935190.6A CN113609259A (en) | 2021-08-16 | 2021-08-16 | Multi-mode reasoning method and system for videos and natural languages |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110935190.6A CN113609259A (en) | 2021-08-16 | 2021-08-16 | Multi-mode reasoning method and system for videos and natural languages |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113609259A true CN113609259A (en) | 2021-11-05 |
Family
ID=78340764
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110935190.6A Pending CN113609259A (en) | 2021-08-16 | 2021-08-16 | Multi-mode reasoning method and system for videos and natural languages |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113609259A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108229175A (en) * | 2017-12-28 | 2018-06-29 | 中国科学院信息工程研究所 | A kind of correlation analysis system and method for multidimensional isomery forensic information |
CN108960063A (en) * | 2018-06-01 | 2018-12-07 | 清华大学深圳研究生院 | It is a kind of towards event relation coding video in multiple affair natural language description algorithm |
WO2020152843A1 (en) * | 2019-01-25 | 2020-07-30 | 日本電気株式会社 | Processing device, processing method and program |
CN113159034A (en) * | 2021-04-23 | 2021-07-23 | 杭州电子科技大学 | Method and system for automatically generating subtitles by using short video |
-
2021
- 2021-08-16 CN CN202110935190.6A patent/CN113609259A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108229175A (en) * | 2017-12-28 | 2018-06-29 | 中国科学院信息工程研究所 | A kind of correlation analysis system and method for multidimensional isomery forensic information |
CN108960063A (en) * | 2018-06-01 | 2018-12-07 | 清华大学深圳研究生院 | It is a kind of towards event relation coding video in multiple affair natural language description algorithm |
WO2020152843A1 (en) * | 2019-01-25 | 2020-07-30 | 日本電気株式会社 | Processing device, processing method and program |
CN113159034A (en) * | 2021-04-23 | 2021-07-23 | 杭州电子科技大学 | Method and system for automatically generating subtitles by using short video |
Non-Patent Citations (2)
Title |
---|
AMAN CHADHA等: "iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability", 《ARXIV》, pages 1 - 12 * |
JINGZHOU LIU等: "Violin: A large-scale dataset for video-and-language inference", 《ARXIV》, pages 1 - 18 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106534548B (en) | Voice error correction method and device | |
CN112015949B (en) | Video generation method and device, storage medium and electronic equipment | |
KR102169925B1 (en) | Method and System for Automatic Image Caption Generation | |
CN111582241A (en) | Video subtitle recognition method, device, equipment and storage medium | |
CN112364810A (en) | Video classification method and device, computer readable storage medium and electronic equipment | |
CN102207954A (en) | Electronic apparatus, content recommendation method and program therefor | |
CN112016573B (en) | Bullet screen generation method and device, electronic equipment and computer storage medium | |
CN113766314B (en) | Video segmentation method, device, equipment, system and storage medium | |
CN110555136A (en) | Video tag generation method and device and computer storage medium | |
CN111723784A (en) | Risk video identification method and device and electronic equipment | |
CN113515997B (en) | Video data processing method and device and readable storage medium | |
KR20230133059A (en) | Ai-based digital contents automated production method, apparatus and system | |
CN105847752B (en) | Information encoding-decoding method, equipment and video monitoring system | |
CN101772950A (en) | Method of processing moving picture and apparatus thereof | |
Liu et al. | Controlllm: Augment language models with tools by searching on graphs | |
CN110991175A (en) | Text generation method, system, device and storage medium under multiple modes | |
CN116168119A (en) | Image editing method, image editing device, electronic device, storage medium, and program product | |
CN113515998A (en) | Video data processing method and device and readable storage medium | |
KR20210097314A (en) | Artificial intelligence based image generation system | |
CN113609259A (en) | Multi-mode reasoning method and system for videos and natural languages | |
Raj et al. | Deep learning based video captioning in bengali | |
CN103984699A (en) | Pushing method and pushing device for promotion information | |
CN110750669A (en) | Method and system for generating image captions | |
CN111259197A (en) | Video description generation method based on pre-coding semantic features | |
CN113010635B (en) | Text error correction method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |