CN117557946A

CN117557946A - Video event description and attribution generation method, system, equipment and storage medium

Info

Publication number: CN117557946A
Application number: CN202410034631.9A
Authority: CN
Inventors: 徐童; 陈恩红; 吕元杰
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2024-01-10
Filing date: 2024-01-10
Publication date: 2024-02-13
Anticipated expiration: 2044-01-10
Also published as: CN117557946B

Abstract

The invention discloses a video event description and attribution generation method, a system, equipment and a storage medium, which are one-to-one corresponding schemes, wherein: the method has the advantages that event understanding in videos on a higher semantic level is explored, irrelevant information contained in the videos is effectively screened out, more valuable multi-modal clues are obtained, and further more accurate event text description is generated; the knowledge graph is introduced to enhance the logic correlation between the events, and the event awareness attention mechanism is combined to generate the reason of occurrence of the events, so that the difficulty that the association of a plurality of events in the video is difficult to capture is effectively solved, and more accurate event attribution is generated.

Description

Video event description and attribution generation method, system, equipment and storage medium

Technical Field

The present invention relates to the field of computer vision and natural language processing, and in particular, to a method, system, device and storage medium for video event description and attribution generation.

Background

With the advent of social media platforms, online viewing of long videos has become an important part of daily entertainment. At the same time, this also increases the urgent need for automatically understanding intelligent services such as events occurring in long videos. For example, a user may want to first read a storyline summary to select a video of interest and further refer to a sequence of events in the video to understand a story line. In this case, event description and attribution in long videos are very important. However, despite the great progress currently made in video understanding tasks, even the most advanced large model-based techniques have remained difficult to abstract semantically very high level event descriptions and attribution information from long videos. Thus, service providers have to rely heavily on manual annotation, which not only results in high cost and inefficiency limitations, but also severely compromises the user experience.

As technology advances, existing technology has been able to automatically identify social interactions between characters, which can be considered as the core element of an event. However, two major challenges remain in handling this task. First, the amount of information in visual cues is large, most of which may be task independent noise, and needs to be filtered out to avoid interference. More complex, there is often a temporal overlap between the various social interactions. Second, in performing event attribution tasks, it is often difficult to infer abstract associations between segments from the similarity of low-level multimodal cues. However, this association cannot be captured by simply measuring the similarity between the two sentences. In this case, the conventional similarity metric-based attention mechanism may not achieve the desired effect. In summary, both descriptive and attributive tasks require more comprehensive and thorough solutions that are not addressed by existing solutions.

In view of this, the present invention has been made.

Disclosure of Invention

The invention aims to provide a video event description and attribution generation method, a system, equipment and a storage medium, which adopt a knowledge-enhanced multi-aspect attention mechanism and can generate more accurate event text description and reason description.

The invention aims at realizing the following technical scheme:

a video event description and attribution generation method, comprising:

for a single event, correspondingly extracting visual mode characteristics and text mode characteristics from corresponding video information and dialogue text information, and extracting corresponding global dialogue text representations from the text mode characteristics; measuring the association degree between given social interaction and visual mode characteristics and global dialogue text representation through an interaction perception attention mechanism, and generating text description of an event by combining multi-mode clues formed by the visual mode characteristics and the text mode characteristics;

and generating common knowledge of each event based on the knowledge graph, combining the common knowledge with text description of the corresponding event, extracting feature representation of each event through a feature extractor, capturing the association degree between the events through an event perception attention mechanism based on the feature representation of the event, and generating the occurrence reason description of each event.

A video event description and attribution generation system, comprising:

the event text description generating unit is used for correspondingly extracting visual mode characteristics and text mode characteristics from corresponding video information and dialogue text information for a single event, and extracting corresponding global dialogue text representations from the text mode characteristics; measuring the association degree between given social interaction and visual mode characteristics and global dialogue text representation through an interaction perception attention mechanism, and generating text description of an event by combining multi-mode clues formed by the visual mode characteristics and the text mode characteristics;

the event attribution generating unit is used for generating common knowledge of each event based on the knowledge graph, extracting the characteristic representation of each event through the characteristic extractor after being combined with the text description of the corresponding event, and generating the occurrence reason description of each event based on the association degree of the event captured through the event perception attention mechanism of the characteristic representation of the event.

A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.

A readable storage medium storing a computer program which, when executed by a processor, implements the method described above.

According to the technical scheme provided by the invention, event understanding in the video on a higher semantic level is explored, irrelevant information contained in the video is effectively screened out, a more valuable multi-modal clue is obtained, and more accurate event text description is generated; the knowledge graph is introduced to enhance the logic correlation between the events, and the event awareness attention mechanism is combined to generate the reason of occurrence of the events, so that the difficulty that the association of a plurality of events in the video is difficult to capture is effectively solved, and more accurate event attribution is generated.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a video event description and attribution generation method provided by an embodiment of the present invention;

FIG. 2 is a diagram of a model framework for generating a textual description of an event provided by an embodiment of the present invention;

FIG. 3 is a diagram of a model framework for generating event attributions provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a video event description and attribution generation system according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The terms that may be used herein will first be described as follows:

the terms "comprises," "comprising," "includes," "including," "has," "having" or other similar referents are to be construed to cover a non-exclusive inclusion. For example: including a particular feature (e.g., a starting material, component, ingredient, carrier, formulation, material, dimension, part, means, mechanism, apparatus, step, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product or article of manufacture, etc.), should be construed as including not only a particular feature but also other features known in the art that are not explicitly recited.

The term "consisting of … …" is meant to exclude any technical feature element not explicitly listed. If such term is used in a claim, the term will cause the claim to be closed, such that it does not include technical features other than those specifically listed, except for conventional impurities associated therewith. If the term is intended to appear in only a clause of a claim, it is intended to limit only the elements explicitly recited in that clause, and the elements recited in other clauses are not excluded from the overall claim.

The following describes in detail a video event description and attribution generation method, system, device and storage medium. What is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art. The specific conditions are not noted in the examples of the present invention and are carried out according to the conditions conventional in the art or suggested by the manufacturer.

Example 1

The embodiment of the invention provides a video event description and attribution generation method, as shown in fig. 1, which mainly comprises the following steps:

step 1, an event text description generation stage.

In the embodiment of the invention, for a single event, corresponding visual mode characteristics and text mode characteristics are extracted from corresponding video information and dialogue text information, and corresponding global dialogue text representations are extracted from the text mode characteristics; and measuring the association degree between given social interaction and the visual mode characteristics and the overall dialogue text representation respectively through an interaction perception attention mechanism, and generating the text description of the event by combining multi-mode clues formed by the visual mode characteristics and the text mode characteristics.

In the generation stage of the event text description, the correlation degree between given social interactions and the visual mode characteristics and the global dialogue text representation are measured through an interaction perception attention mechanism, and a multi-mode clue formed by the visual mode characteristics and the text mode characteristics is combined, so that the preferred implementation mode of generating the text description of the event is as follows:

(1) The given social interactions include: social interaction textImage>The method comprises the steps of carrying out a first treatment on the surface of the Text from social interactions->Extracting global social interaction text representation +.>From social interaction image->Extracting visual mode characteristics of social interaction>。

(2) Computing visual modality features of social interactions through an attention mechanismAnd visual modality characteristicsThe degree of association between each other and the global social interaction text representation +.>Text representation +.>The association degree between the two parts of the calculated association degree is combined to obtain the association degree of the multi-modal information and the given social interaction, wherein the multi-modal information refers to video information and dialogue text information corresponding to the event, and the multi-modal information refers to->Is the firstiFrame->Is characterized by the visual modality of (c),ti=1, 2, …,t，is the firstiStrip dialogue text->Global dialog text representation of (a), item (b)iStrip dialogue text->Corresponds to the firstiFrame->The number of dialogue texts in the dialogue text information is equal to the number of frames of the video information; here the global social interaction text representation +.>Text representation +.>Are hidden states obtained by classifying head marks, namely by social interaction text and dialogue text informationThe classification head mark is added to the middle head, and the hidden state obtained by the text converter is described in detail later.

(3) Combining the association degree of the multi-modal information and the given social interaction with the visual modal characteristics and the global dialogue text representation respectively to obtain the interactive perception context information, wherein the method comprises the following steps of: interactive perceived visual contextText context perceived with interaction ++>。

(4) Query transformation is carried out on the visual modal characteristics to obtain visual query characteristics, all the visual query characteristics are combined with all the text modal characteristics to obtain multi-modal clues, and the multi-modal clues are recorded as。

(5) In the event text description generation phase, generating a text description of an event based on a combined multimodal cue comprises: sequentially generating words at all times, and connecting the words at all times according to the time sequence to form text description of the event; wherein, the firstjAt each moment, generating decoding characteristics by a decoder, classifying the decoding characteristics, and determining the first according to probability distributionjWords at each moment; first, thejAt each moment, by sensing context information for interaction and multi-modal cuesPerforming a cross-attention mechanism to capture more valuable multi-modal cues, and generating decoding features in combination with embedded representations of previously generated words>Expressed as:

；

wherein,from time 1 to time 1j-1 embedded representation of all words at time, < +.>Representing decoder, symbol->Representing a first dimension concatenation of the corresponding information.

The text description generation process of a single event is introduced above, and all events in the video data generate corresponding text descriptions in the above manner.

And 2, event attribution generation stage.

In the embodiment of the invention, the common sense knowledge of each event is generated based on the knowledge graph, and after the common sense knowledge is combined with the text description of the corresponding event, the characteristic representation of each event is extracted through the characteristic extractor, and the association degree between the events is captured through the event awareness attention mechanism based on the characteristic representation of the event, so that the description of the occurrence reason of each event is generated. The preferred embodiment is as follows:

(1) And fine-tuning the pre-trained language model on the knowledge graph, and generating common sense knowledge of each event by using the fine-tuned language model.

(2) The common sense knowledge of each event is added to the textual description of the respective event, and a feature representation of each event is extracted by a feature extractor.

(3) For each event, the association degree between the events is calculated by combining the characteristic representation of the previous event, and the occurrence reason description of each event is generated by combining the characteristic representation of the previous event.

In order to more clearly demonstrate the technical scheme and the technical effects provided by the invention, the method provided by the embodiment of the invention is described in detail below by using specific embodiments.

1. Scheme overview.

The invention provides an event description and attribution generation method combining visual information and corresponding text information, which can automatically generate text descriptions of a plurality of events in a long video and reasons of occurrence of each event.

Compared with the traditional method for identifying low semantic information in videos such as actions, objects and the like, the method provided by the invention adopts a novel technical framework, and adopts knowledge-enhanced multi-aspect attention mechanisms for generating descriptions and attributions of story events in long videos.

Specifically, in the event text description generation stage, the invention designs an interactive perception attention mechanism which can estimate the semantic relatedness (association degree) between visual and text characteristics and social interactions which are core elements of a story event, and then, according to the estimated association degree, valuable multi-modal clues can be obtained, so as to generate text descriptions of corresponding events. In the event attribution generation phase, the invention further provides an event awareness and attention mechanism, which is combined with an index map to enhance the logic correlation between events, and then generates the reason for occurrence of the events. The scheme provided by the invention is helpful for promoting the development of video semantic understanding technology, and meets the requirements of audience on higher-level video understanding.

2. Detailed description of the scheme.

1. The event text description generation phase.

The present invention contemplates an interactive awareness attention mechanism that utilizes a hierarchical attention structure to capture multi-modal cues associated with a given social interaction, which is equivalent to extracting cues associated with events from multiple modalities. Specifically, the present invention introduces a pre-training model to measure the semantic relevance of visual and textual features to a given interaction and screens valuable cues from video frames and subtitle sequences through semantic relevance. The present invention then fuses the filtered text and visual cues to the decoder input, while the cross-attention mechanism will further adaptively capture the most relevant and semantically consistent multi-modal cues, which will be used to generate a complete and accurate event description.

(1) And a pretreatment stage.

Considering the length of the video, the invention first divides the long video into several video segments, each containing one or more complete events, by means of a correlation tool. The video clip is the basic input unit and a description is generated for each event.

And uniformly sampling the video clips at a designated frequency to obtain corresponding video information, and extracting dialogue text information aligned with the video information from the video clips. By way of example, t frames may be uniformly sampled at a frequency of 1 frame/second,the corresponding dialog text information is marked +.>，/>Is the firstiFrame (F)>Is the firstiThe dialog text corresponding to the frame, i=1, 2, …,t，tfor the number of frames of video information, the number of dialog texts is equal to the number of frames of video information.

In the embodiment of the invention, the feature extraction is performed by using a pre-training model. By way of example, a multimodal pre-training model (BLIP-2) may be employed, which contains three components: a pre-training image transformer (vision transformer) for visual feature extraction, expressed asThe method comprises the steps of carrying out a first treatment on the surface of the A text transformer (text transducer) which can be used as both a text encoder and a text decoder, expressed as +.>The method comprises the steps of carrying out a first treatment on the surface of the And a lightweight query transformer (Q-former) that can effectively bridge modal gaps, denoted +.>. Video information +.>Input to a pre-training image converter to obtain the visual modality feature +.>In the first placeiThe frame is exemplified and expressed as: />，/>Is the firstiFrame->Is a visual modality feature of (a); subsequently, query transforming the visual modality features by a lightweight query transformer to bridge the modality gap, thereby producing visual query features ∈>In the first placeiThe frame is exemplified and expressed as: />，/>Is the firstiFrame(s)Is a visual query feature of (1). Likewise, the dialogue text information is input to +.>However, in order to obtain the global representation, the text transformer needs to add a classification header tag CLS at the beginning of each dialog text, and uses the hidden state of the classification header tag CLS as the global dialog text representation, specifically, at->After the classification header marks CLS are respectively added at the beginning of each dialogue text, the dialogue text is input into a text converter to obtain the text modal characteristicsIn the first placeiStrip dialogue text->For example, add the Classification header tag CLS and mark as +.>Inputting to a text transformer to obtain text mode feature +.>Expressed as: />Text modality feature->Is a hidden state sequence, the first bit of which is the hidden state of the classification head mark, namely the global dialog text representation +.>。

Then, in combination with the extracted features above, one straightforward way to generate a textual description of the event is to provide all these features to a multi-modal pre-training model, i.e. byThe text is decoded. However, since video information and dialog text information typically contain a large amount of event-independent information, such event-independent information will interfere with the task of generating the event text description.

(2) A textual description of the event is generated based on the interactive awareness mechanism.

As described above, a large amount of event-independent multimodal information exists in the video information and the dialogue text information, and therefore, it is necessary to exclude the irrelevant information (including the irrelevant text information and the visual information) first. Since social interactions between roles are central to events and can be detected using existing methods, the present invention evaluates the relevance of various features to social interactions (including text and images). This is equivalent to measuring the relevance of various types of features to events.

In the text part, if the dialogue textText for social interaction->They should exhibit a high degree of semantic similarity. Based on this observation, the present invention constructs the text portion of the interactive awareness mechanism. Likewise, the text converter described above is used to add/drop to the social interaction text>The interaction is used for extracting the characteristics, and the invention freezes the characteristics in the training process. Meanwhile, in order to obtain the global representation of the social interaction, a classification head mark is added at the beginning of the social interaction, and the hidden state of the classification head mark is utilized. Social interaction text->Add Classification head marker and mark +.>Through text converter->Obtaining social interaction text modal characteristics, expressed as:

；

wherein,is a social interaction text modal feature, the first bit of which is the hidden state of the classification head mark, expressed asAs a wholeAnd (5) displaying the local social interaction text.

Representing global social interaction textAs query information, the global dialog text is representedAs key information, calculating the degree of association between the global social interaction text representation and the global dialog text representation by means of a dot product attention mechanism +.>Expressed as:

；

wherein T is the transposed symbol,da dimension that is a global dialog text representation;and->Are all learnable (to be learned) weight parameters; />Is a normalized exponential function.

Similar to the text portion, the present invention also requires filtering out visual information that is not related to the event. The invention inputs images describing social interactions into an image transformerIn order to extract the visual modality characteristics of social interaction +.>. Visual modality characteristics of social interactions->As query information, the visual modality feature +.>As key information, computing visual modality features of social interactions by dot product attention mechanism +.>Degree of association with visual modality characteristics +.>Expressed as:

；

wherein T is the transposed symbol,drepresenting the dimensions of the visual modality features, which are the same as the dimensions of the global dialog text representation, i.e. the dimensions of both the visual modality features and the global dialog text representation ared；And->Are all weight parameters which can be learned; />Is a normalized exponential function.

Since the dialog text information is aligned in time with the video information, the degree of association is aggregatedAnd->Performing standardized processing to obtain the association degree A of the multi-modal information and the given social interaction, wherein the association degree A is expressed as:

；

wherein,is a standardized function. The association degree A is an attention score matrix, and reflects the association degree of the multi-modal information and the social interaction, namely the association degree of the multi-modal information and the event.

The present invention then utilizes the degree of association a to measure the various features, which results in interactive perceptual context information, which in essence highlights relevant visual and textual information that is helpful in understanding the event. The interactive awareness context information includes: an interactively perceived visual context and an interactively perceived text context; wherein:

the association degree of the multi-modal information and the given social interaction is combined with the visual modal characteristics to obtain an interactive perception visual context, the association degree of the multi-modal information and the given social interaction is combined with the global dialogue text to obtain an interactive perception text context, and the interactive perception text context is expressed as:

；

wherein,and->Are all weight parameters which can be learned; />For the visual context of interactive perception, +.>Is an interactively perceived text context.

While interactive awareness context information may aggregate information most relevant to an event at the time-step level, multimodal information typically spans multiple time steps. Using only interactive perceptual context information will inevitably miss important cues. Thus, the present invention further uses the interactive awareness context information and complete multimodal features (i.e., complete visual query features and text modality features) to perform an attention mechanism, such a hierarchy can better capture the most relevant multimodal cues across multiple time steps. Then, words at all times are sequentially generated, and the words at all times are connected according to the time sequence to form a text description of the event.

Specifically, when generating the decoding features of the j-th word of the description, the text modal features and visual query features of all dialog texts from time 1 to time t are connected as multi-modal cues to perform cross-attention mechanisms with the interactive perceptual context information to obtain more valuable cues, and are embedded with the words at 1-1 to j-1, and then provided to a decoder to obtain the decoding features at j-thExpressed as:

；

wherein,this part is the multi-modal thread +.>，/>For complete text modality feature +.>，/>For a complete visual query feature,。

after that, linear layer sum can be usedSoftmaxLayer-to-decoder output characteristicsSorting and calculating the firstjProbability distribution of individual words, the word corresponding to the highest probability value being the firstjAnd (5) personal words. And so on until a complete text description is generated.

FIG. 2 illustrates a model framework diagram for generating a textual description of an event, t in FIG. 2 ₁ ~t ₃ For the time, the time corresponds to the video frame, the encoder comprising the self-attention mechanism and the feedforward neural network in the left part is the text converter, N is the number of layers of the self-attention mechanism and the feedforward neural network, and the right part is the structure of the decoder, the number of layers of the self-attention mechanism, the cross-attention mechanism and the feedforward neural network are N, and the principles of the networks related to the invention can refer to the conventional technology, so that the description is omitted; a learnable query token (linear query) is a unique structure of a lightweight query transformer (Q-former), which can be understood as a learnable query representing parameters, linear representing linear layers, belonging to an infrastructure in a neural network.

2. Event attribution generation phase.

In the previous stage, the text description of the events generated by the present invention is critical to understanding the events in the video story. However, a coherent story is made up of a number of interrelated events. Thus, the present invention needs to explore the deep causes behind the event and understand the video-level stories as a whole, not just a single clip level. Thus, the present invention further generates the cause of each event from the textual description of the event. The current event is affected by the previous event, but not all events are related to the current event. Worse still, current event descriptions often have significant semantic differences from previously related events, which prevents the attention mechanism from eliminating irrelevant information.

To this end, in the event attribution generation phase, the present invention designs an event aware attention mechanism that not only utilizes a similar hierarchical attention structure, but also introduces an external knowledge graph for predicting possible outcomes from previous events. In this way, the link between related events may be semantically enhanced by additional cues. In addition, in order to supplement the missing events in the external knowledge graph, the invention performs fine tuning on a large pre-training model, and converts the implicit knowledge into explicit knowledge triples in the external knowledge graph, so that the external knowledge graph can predict the possible consequences of any given event. Finally, the present invention employs a similar attention mechanism as the previous stage, capturing attributions between events from the descriptions of the events and the rich logical reasoning provided by the refined knowledge graph.

(1) And introducing an inference knowledge graph.

In order to solve the challenge of semantic difference, the invention provides a method for introducing an inference knowledge graph. The goal is to use common sense to infer what may happen after an activity is completed, including the reason after the activity is completed and its impact on one character and its impact on other characters.

In particular, the present invention will introduce common sense knowledge patterns for logical relationships between tandem video clips. In this embodiment, a large-scale common sense knowledge graph (e.g., ATOMIC, or other common sense knowledge graph may be used instead). However, the common sense knowledge graph cannot cover all events. To solve this problem, the present invention can extend the implicit knowledge in the pre-trained language model to explicit knowledge in the common sense knowledge graph. Specifically, the present invention fine-tunes a pre-trained language model (e.g., the Flan T5 model) over a common sense knowledge graph. When applying the common sense knowledge graph, given an event that is not in the common sense knowledge graph, the present invention anonymously processes the name entity and adjusts it to a format on the common sense knowledge graph. Taking the example of the event "friend of frank encourages frank to participate in a baking race", which is adjusted to the format of "PersonX (individual X) encourages PersonY (individual Y) to participate in a baking race", then input a desired relationship, such as OReact (Other React), which represents a possible reaction of other individuals (e.g., personY), and finally the fine-tuned language model generates sentences: personY may feel tension. Thus, the present invention can generate knowledge of any event (inference of event).

(2) The cause of the time is generated based on the event-aware attention mechanism.

The invention uses event awareness mechanisms to further capture associations between events and generate reasons behind the current event. As shown in fig. 3, firstly, the present invention adds all the extracted common knowledge to the text description of the corresponding events to obtain the related inference of the event text description and the events, then, coarse-sifts out the related events through the language model to obtain the most likely related events and related inferences (the events before the current event and the related inferences), and further extracts the feature representation of the events; and, the feature representation of the current event is also extracted through the language model, and these two feature representations will be used as inputs for event attribution.

For the firstkAn event, characterized byThe method comprises the steps of carrying out a first treatment on the surface of the Will be the firstkCharacterization of events preceding an eventAs key information, the firstkCharacteristic representation of individual events->As query information, the degree of relevance between them is calculated by means of the dot product attention mechanism>Expressed as:

；

wherein T is the transposed symbol,d representing an event feature representation dimension that is the same as the dimensions of the visual modality feature and the global dialog text representation;and->Are all weight parameters which can be learned; />Is a normalized exponential function;、/>the characteristic representation of the 1 st event and the k-1 st event respectively.

By using degree of associationWeighting the firstkThe feature representation of the event preceding the event, the event-aware context is obtained, expressed as:

；

wherein,is a weight parameter which can be learned; />Context is perceived for an event.

Decoding out a first using event-aware contextkThe reasons for occurrence of the events are described, and this part is similar to the previous stage.

Preferably, considering the occurrence reason of the kth event, it is also possible to infer from the dialog of the person in the corresponding video segment of the kth event (i.e., the dialog under the current segment), so that the text modality feature of the corresponding dialog text information is also taken as the input of the event attribution. That is, the context is perceived by the event and the firstkText modality features extracted from dialog text information for individual eventsDecode out the firstkThe occurrence cause description of each event. Wherein, the firstrThe individual time decoding features are:

；

wherein the text modality features，/>Represent the firstkText modality characteristics of the first, 2 nd and t th dialog text information of each event, t being the number of dialog text information.

After that, linear layer sum can be usedSoftmaxLayer-to-decoder output characteristicsSorting and calculating the firstrProbability distribution of individual words, the word corresponding to the highest probability value being the firstrAnd (5) personal words. And so on until a complete description of the cause of occurrence of the event is generated.

3. Training protocols.

In the embodiment of the invention, the two stages can be realized by using corresponding models (frames of the corresponding models are provided in fig. 2-3), when the models are trained, a separate training mode is adopted, cross entropy can be adopted as training loss during training, and the weight parameters W are mainly optimized during training.

From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.

Example two

The present invention also provides a video event description and attribution generation system, which is mainly used for implementing the method provided in the foregoing embodiment, as shown in fig. 4, and the system mainly includes:

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the system is divided into different functional modules to perform all or part of the functions described above.

Example III

The present invention also provides a processing apparatus, as shown in fig. 5, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.

Further, the processing device further comprises at least one input device and at least one output device; in the processing device, the processor, the memory, the input device and the output device are connected through buses.

In the embodiment of the invention, the specific types of the memory, the input device and the output device are not limited; for example:

the input device can be a touch screen, an image acquisition device, a physical key or a mouse and the like;

the output device may be a display terminal;

the memory may be random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as disk memory.

Example IV

The invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiments.

The readable storage medium according to the embodiment of the present invention may be provided as a computer readable storage medium in the aforementioned processing apparatus, for example, as a memory in the processing apparatus. The readable storage medium may be any of various media capable of storing a program code, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, and an optical disk.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A method for video event description and attribution generation, comprising:

2. The method of claim 1, wherein the measuring, by the interactive awareness mechanism, the degree of association between the given social interaction and the visual modality feature and the global dialog text representation, respectively, comprises:

the given social interactions include: social interaction textImage>The method comprises the steps of carrying out a first treatment on the surface of the Text from social interactions->Extracting global social interaction text representation +.>From social interaction image->Extracting visual mode characteristics of social interaction>；

Computing visual modality features of social interactions through an attention mechanismAnd visual modality characteristics->The degree of association between each other and the global social interaction text representation +.>Text representation +.>The degree of association between the two; wherein (1)>Is the firstiFrame->Is characterized by the visual modality of (c),ti=1, 2, …,t，/>is the firstiStrip dialogue text->Global dialog text representation of (a), item (b)iStrip dialogue text->Corresponds to the firstiFrame->The number of dialogue texts in the dialogue text information is equal to the number of frames of the video information; global social interactive text representationText representation +.>Are hidden states obtained by sorting the header marks.

3. The video event description and attribution generation method as claimed in claim 2, further comprising: the calculated association degrees of the two parts are combined to obtain the association degree of the multi-modal information and the given social interaction, wherein the multi-modal information refers to the association degree of the video information and the dialogue text information corresponding to the event, the association degree of the multi-modal information and the given social interaction is respectively combined with the visual modal feature and the global dialogue text representation to obtain the interactive perception context information, and then the multi-modal clues formed by the visual modal feature and the text modal feature are combined to generate the text description of the event;

wherein, combining the calculated association degrees of the two parts, obtaining the association degree of the multi-modal information and the given social interaction comprises:

visual modality characterization of social interactionsAs query information, the visual modality feature +.>As key information, computing visual modality features of social interactions by dot product attention mechanism +.>Degree of association with visual modality characteristics +.>Expressed as:

；

wherein T is the transposed symbol,and->Are all weight parameters which can be learned; />Is a normalized exponential function;

representing global social interaction textAs query information, the global dialog text is represented +.>As key information, calculating the degree of association between the global social interaction text representation and the global dialog text representation by means of a dot product attention mechanism +.>Expressed as:

；

wherein the dimensionality of the visual mode characteristic and the dimensionality of the global dialogue text representation are respectivelyd，And->Are all weight parameters which can be learned;

degree of aggregate relevanceAnd->And performing standardized processing to obtain the association degree A of the multi-modal information and the given social interaction.

4. A video event description and attribution generation method according to claim 3, wherein the interactive perception context information comprises: an interactively perceived visual context and an interactively perceived text context; wherein:

；

wherein A is the association degree of the multi-modal information with the given social interaction,and->Are all weight parameters which can be learned; />For the visual context of interactive perception, +.>Is an interactively perceived text context.

5. A method of video event description and attribution generation as claimed in claim 3 or 4, wherein said generating a textual description of an event comprises:

sequentially generating words at all times, and connecting the words at all times according to the time sequence to form text description of the event;

wherein, the firstjAt each moment, generating decoding characteristics by a decoder, classifying the decoding characteristics, and determining the first according to probability distributionjWords at each moment; wherein the interactive awareness context information includes: interactive perceived visual contextText context perceived with interaction ++>The multi-modal clue formed by the visual modal feature and the text modal feature refers to that the visual modal feature is subjected to query transformation to obtain the visual query feature, and the visual query feature is combined with the text modal feature to obtain the multi-modal clue->The method comprises the steps of carrying out a first treatment on the surface of the First, thejIn each moment, the interactive perception context information and the multi-modal clues are +.>Performing a cross-attention mechanism, re-combining the embedded representation of the previously generated word, generating a decoding feature +.>Expressed as:

；

6. The method for generating a description and attribution of video event according to claim 1, wherein the generating the common knowledge of each event based on the knowledge graph, and the combining with the text description of the corresponding event, extracting the feature representation of each event through the feature extractor, and capturing the association degree between the events through the event perception attention mechanism based on the feature representation of each event, generating the occurrence cause description of each event comprises:

performing fine adjustment on the pre-trained language model on the knowledge graph, and generating common sense knowledge of each event by utilizing the fine-adjusted language model;

adding knowledge of the common sense of each event to the text description of the corresponding event, extracting a feature representation of each event by a feature extractor;

for each event, the association degree between the events is calculated by combining the characteristic representation of the previous event, and the occurrence reason description of each event is generated by combining the characteristic representation of the previous event.

7. The method of claim 6, wherein for each event, calculating the degree of association between events in combination with the characteristic representation of the previous event, and generating the occurrence cause description of each event in combination with the characteristic representation of the previous event comprises:

；

wherein T is a transposed symbol, and the event feature representation dimension isd；And->Are all weight parameters which can be learned;is a normalized exponential function; />、/>The characteristic representation of the 1 st event and the k-1 st event respectively;

；

wherein,is a weight parameter which can be learned; />Perceiving context for an event;

using event awareness context, and (ii)kText modal characteristics extracted from dialogue text information of each event, and generating the firstkThe occurrence cause description of each event.

8. A video event description and attribution generation system, comprising:

9. A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.

10. A readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1-7.