CN114357124B

CN114357124B - Video paragraph positioning method based on language reconstruction and graph mechanism

Info

Publication number: CN114357124B
Application number: CN202210270425.9A
Authority: CN
Inventors: 徐行; 蒋寻; 沈复民; 申恒涛
Original assignee: Chengdu Koala Youran Technology Co ltd
Current assignee: Chengdu Koala Youran Technology Co ltd
Priority date: 2022-03-18
Filing date: 2022-03-18
Publication date: 2022-06-14
Anticipated expiration: 2042-03-18
Also published as: CN114357124A

Abstract

The invention relates to the technical field of cross-modal content retrieval in multi-modal video understanding, and discloses a video paragraph positioning method based on language reconstruction and graph mechanism, which comprises the following steps: selecting a data set, constructing a video paragraph, training a model by using a loss function, and finally verifying the effect of the model. The method is used for improving the information interaction capacity among the fine-grained heterogeneous data, enhancing the understanding of the video content and improving the cross-modal content understanding capacity of the model to the video-text. The method can be used in various multi-mode video understanding occasions, such as network video APP, an intelligent security system, large-scale video content retrieval and the like, can improve the use experience of user software, and improves the working efficiency of workers in related fields such as videos, security, social administration and the like.

Description

Video paragraph positioning method based on language reconstruction and graph mechanism

Technical Field

The invention relates to the technical field of cross-modal content retrieval in multi-modal video understanding, in particular to a video paragraph positioning method based on language reconstruction and graph mechanism, which is used for improving the information interaction capacity between heterogeneous data with fine granularity, enhancing the understanding of video content and improving the cross-modal content understanding capacity of a model to video-text.

Background

As a multimedia technology hotspot in the internet era, multimodal video understanding has attracted much attention from the industrial and academic circles in recent years. Sequential language localization techniques, which are one of the most challenging tasks in multimodal video understanding, aim to perform video segment-level retrieval from long video that is not cropped, based on given query text information, requiring a computer to locate the segment position in this long video of the event described by the query text. The time sequence language positioning technology has a wide application scene: with the gradual coming of the big media era, the internet video auditing work also becomes heavy, and by applying the time sequence language positioning technology, fine-grained cross-mode video content retrieval can be realized, and manpower is liberated from complicated video auditing and searching. Meanwhile, the technology can be deployed in the fields of intelligent security, social management, human-computer interaction and the like, the user experience is effectively improved, and the working efficiency is improved.

According to the form of a query text, the current time sequence language positioning technology can be divided into two types, the first type is video sentence positioning, namely the query text is only a single sentence, the retrieval target is a single event, and an algorithm model retrieves a target segment from a long video containing a plurality of events in a one-to-many mode; the second method is video paragraph positioning, the query text is a paragraph containing a plurality of sentences, a plurality of events are described, and the algorithm model completes the retrieval of each event fragment in a many-to-many mode. In the last decade, video sentence localization technology has been the focus of research and has been greatly developed, but with the increase of multi-modal data, the drawbacks of this single event localization mechanism have gradually been revealed: for example, when multiple similar events occur in a video, the video sentence positioning easily confuses the logical relationship of the multiple events to cause error positioning, because it only uses the video to perform context modeling at the event level, and ignores the context modeling of the text modality, which results in insufficient understanding of the video content by the model, and in actual use, in the face of the same or similar events that occur repeatedly, the absence of the text context at the event level causes error retrieval of the event segments. The video paragraph positioning method can dig more event-level context information from a text mode by inputting description sentences of a plurality of events as query texts, thereby reducing the possibility of error alignment.

However, the localization of multiple events in the localization of video segments also presents new challenges. Firstly, paragraphs are adopted as query texts, so that more complexity is presented, the difficulty of modal fusion becomes high, and due to the many-to-many positioning mode, each sentence in the modal fusion is visible to each event in the video, so that higher possibility of misalignment is brought. Second, while it is possible to provide enough time information by keeping the chronological relationship of each sentence, this approach also requires that our model have better long-range context modeling capability as sentences grow.

Therefore, in order to solve the technical problem of the existing video paragraph positioning, the invention provides a video paragraph positioning method based on language reconstruction and graph mechanism, and the information interaction capability between the heterogeneous data with fine granularity is improved by introducing a multi-modal graph mechanism into a Transformer; enhancing understanding of video content by context modeling between a plurality of events by an event feature decoder; meanwhile, a language reconstructor is designed to reconstruct the query text, and the cross-modal content comprehension capability of the model to the video-text is further improved.

Disclosure of Invention

The invention aims to provide a video paragraph positioning method based on language reconstruction and graph mechanism, which improves the information interaction capability between heterogeneous data with fine granularity by introducing a multi-modal graph mechanism into a Transformer; enhancing understanding of video content by context modeling between a plurality of events by an event feature decoder; meanwhile, a language reconstructor is designed to reconstruct the query text, and the cross-modal content comprehension capability of the model to the video-text is further improved.

The invention is realized by the following technical scheme: a video paragraph positioning method based on language reconstruction and graph mechanism comprises the following steps:

s1, selecting a training data set, and extracting a video-paragraph pair as an input of a positioning algorithm model;

s2, loading model parameters of a pre-trained 3D convolutional neural network, extracting a video mode in the video-paragraph pair, and obtaining segment-level video characteristics;

s3, extracting a text mode in the video-paragraph pair, and using GloVe coding to represent each word in the text mode as a word vector with fixed dimensionality to serve as query text coding;

s4, querying text codes by using a projection layer and regularization processing to obtain word-level text characteristics, splitting the word-level text characteristics according to sentences, sequentially inputting each obtained sentence into a bidirectional gate control coding unit, and extracting sentence-level text characteristics;

Step S5, connecting the segment-level video features and the word-level text features together, taking each feature point as a graph node, setting the strength of each edge as a learnable parameter, initializing the graph into a multi-mode full-connected graph consisting of the video feature nodes and the text feature nodes, and inputting the graph into a multi-mode graph encoder to perform multi-mode feature fusion, so that each node can selectively acquire information from neighboring nodes, and fine-grained feature interaction is realized;

s6, extracting video feature nodes in the multi-modal image processed by the multi-modal image encoder, inputting the video feature nodes and sentence-level text features extracted in the step S4 into an event feature decoder together to obtain multi-modal features of the target event, and predicting the relative position of the event in the complete video by using a multi-layer perceptron;

s7, using the multi-modal features of each target event obtained in the step S6, simultaneously extracting text feature nodes in the multi-modal graph processed by the multi-modal graph encoder, inputting the text feature nodes into a language reconstructor, and retranslating each text feature node into a paragraph query text to realize query text reconstruction;

s8, calculating time sequence position information loss according to the result predicted in the S6;

S9, extracting an attention weight matrix in an event feature decoder, and calculating attention guide loss;

s10, calculating language reconstruction loss according to the text reconstruction result in the step S7;

and S11, adopting an Adam optimizer and training a positioning algorithm model by using a constant learning rate strategy.

In the invention, unlike the traditional video and audio fields, the invention is in the video and text field; the invention belongs to time sequence language positioning/video fragment retrieval, namely, one or more description texts of natural language (one or more fragments in description video) are given, the position of the video fragment in the video is retrieved according to the text, the unnatural alignment state between two modes is required to be matched according to the semantics in the text mode, and the retrieval is further realized; the neural network model of the invention has more complex structure and larger structural difference, including but not limited to: 1) and multi-modal information interaction is realized by adopting a multi-modal graph mode. 2) An event characteristic decoder is designed, and context modeling is carried out at an event level (an unconventional word level or a video fragment level) by utilizing a text mode so as to better understand long video content; 3) a language reconstructor is designed to reconstruct the query text so as to improve the comprehension capability of the model to deep semantics.

The invention is different from the traditional video time positioning, the traditional video time positioning focuses on single sentence time positioning, namely only one query sentence is given each time to complete the positioning of one event, the invention focuses on paragraph time positioning, namely, a paragraph consisting of a plurality of sentences can be given to complete the positioning of a plurality of events; in the traditional video time positioning, video context modeling is completed by pre-dividing video segments, and then key text words in a query sentence are selectively screened according to video contents. The invention makes each node (video clip node or text word node) obtain information from its neighbor nodes by establishing a multi-modal graph and utilizing a graph modeling layer.

In order to better implement the present invention, further, a verification method for the positioning algorithm model is further included: in the testing stage, language reconstruction is not needed, and the model reasoning speed is improved by removing a language reconstructor in the trained video paragraph positioning method based on the language reconstruction and the graph mechanism; and (4) performing video multi-segment retrieval on the video and the segment text pair by using the residual part without the language reconstructor as an evaluation model so as to verify the effect of the positioning algorithm model.

To better implement the present invention, the inference process of the multimodal map encoder in step S5 further includes:

s5.1, connecting the video nodes and the text nodes, setting the side weight value as a learnable value, and initializing a multi-modal graph;

s5.2, transmitting the multi-modal graph into a multi-modal graph encoder, modeling the multi-modal graph, and encoding the positions of a graph modeling layer, a video and a text through a Transformer encoder to obtain single-layer multi-modal graph modeling;

and S5.3, the multi-modal graph encoder is composed of a single-layer multi-modal graph modeling structure in the multi-layer step S5.2, and the multi-modal graph is continuously updated in an iteration mode.

In order to better implement the present invention, further, step S5.2 comprises:

and performing multi-mode graph reasoning in a graph modeling layer GM (-) to enable each node to acquire information from neighbor nodes and update the weight of the node and the edge.

In order to better implement the present invention, step S6 further includes:

extracting video feature nodes in a multi-modal graph processed by a multi-modal graph encoder, inputting the video feature nodes as encoded signals of an event feature decoder, inputting sentence-level text features as query signals of the event feature decoder, mining context relations among multiple events through a self-attention mechanism, obtaining multi-modal features of a target event through a cross-modal attention mechanism, and finally predicting the relative position of each event in a complete video by using a multi-layered perceptron.

In order to better implement the present invention, step S7 further includes:

and (4) inputting the multi-modal characteristics of the target event obtained in the step (S6) as an encoded signal of the language reconstructor, extracting text nodes in the multi-modal graph processed by the multi-modal graph encoder as a query signal input of the language reconstructor, calculating the probability distribution of each text node in the encoded vocabulary, and selecting a word with the maximum probability as a reconstruction result.

In order to better implement the present invention, step S8 further includes:

using the prediction result of each event in step S6, the positional information loss is calculated from the prediction result of the event, the total number of events, the actual annotation, and the G-IOU loss function.

In order to better implement the present invention, step S9 further includes:

attention weights in a cross-modal attention mechanism in an event feature decoder are extracted, and an attention-directed loss is calculated.

In order to better implement the present invention, step S10 further includes:

and calculating the reconstruction loss according to the prediction result of the language reconstructor.

In order to better implement the present invention, step S10 further includes:

the position loss, the attention guiding loss and the reconstruction loss are subjected to weighted summation to be used as a final training target

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) the method introduces a graph modeling mechanism into video paragraph positioning, and promotes information interaction between fine-grained heterogeneous data by establishing a multi-modal graph;

(2) the invention designs an event characteristic decoder, reduces the alignment error of a plurality of event positioning in video paragraph positioning by exploring the context relation between events, and effectively improves the reliability of event positioning;

(3) the invention designs a language reconstructor, which assists the model to improve the cross-modal comprehension capability of the video-text, excavates deep semantics among heterogeneous data and simultaneously improves the interpretability of the model by reconstructing the query text;

(4) the invention effectively improves the precision of time sequence language positioning through testing, and has more obvious advantages in multi-event positioning compared with the prior art;

(5) the method can be used in various multi-mode video understanding occasions, such as network video APP, an intelligent security system, large-scale video content retrieval and the like, can improve the use experience of user software, and improves the working efficiency of workers in related fields such as video, security, social governance and the like.

Drawings

The invention is further described with reference to the following figures and examples, all of which are intended to be covered by the present disclosure and the scope of the invention.

Fig. 1 is a flowchart of a video paragraph locating method based on language reconstruction and graph mechanism according to the present invention.

Fig. 2 is a schematic structural diagram of a video paragraph locating method based on language reconstruction and graph mechanism according to the present invention.

Fig. 3 is a schematic diagram of a video segment localization method based on language reconstruction and graph mechanism on the chardes-STA data set according to the present invention.

Fig. 4 is a schematic diagram of an Activity Net-scenario data set of a video paragraph location method based on language reconstruction and graph mechanism according to the present invention.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments, and therefore should not be considered as limiting the scope of protection. All other embodiments, which can be obtained by a worker skilled in the art based on the embodiments of the present invention without making creative efforts, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that, unless explicitly stated or limited otherwise, the terms "disposed," "connected" or "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through an intermediary, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in a specific case to those of ordinary skill in the art.

Example 1:

a video paragraph location method based on language reconstruction and graph mechanism in this embodiment, as shown in fig. 1 and fig. 2, includes the following steps: s1, selecting a training data set, and extracting a video-paragraph pair as input of a positioning algorithm model;

s6, extracting video feature nodes in the multi-modal graph processed by the multi-modal graph encoder, inputting the video feature nodes and sentence-level text features extracted in the step S4 into an event feature decoder to obtain multi-modal features of the target event, and predicting the relative position of the event in the complete video by using a multi-layer perceptron;

s10, calculating language reconstruction loss according to the text reconstruction result in the S7;

The positioning algorithm model in this embodiment is a video paragraph positioning model based on language reconstruction and graph mechanism, and in this embodiment, the video modality in the video-paragraph pair is extracted to obtain the segment-level video features, that is, each feature vector is extracted from a video segment of a fixed length in a video. In this embodiment, GloVe encoding is used to represent each word in the text modality as a word vector in a fixed dimension, which is a common setting of 300 dimensions.

The working principle/working process of the invention is as follows: the length of 1 word is 1, the length of l W words is l W, a pre-training convolutional neural network is used for extracting segment-level video features, and GloVe, a mapping layer with regularization and a bidirectional gated loop unit (biGRU) are used for extracting word-level text features and sentence-level text features. The method comprises the steps of performing fine-grained multi-modal feature modeling by using a multi-modal graph encoder, predicting the relative positions of a plurality of events in an uncut long video by using an event feature decoder and a multi-layer perceptron, and reconstructing a query text by using a language reconstructor in a training stage.

Example 2:

the embodiment further optimizes the positioning algorithm model based on embodiment 1, and provides a verification method of the positioning algorithm model: in the testing stage, language reconstruction is not needed, and the model reasoning speed is improved by removing a language reconstructor in the trained video paragraph positioning method based on the language reconstruction and the graph mechanism; and (4) performing video multi-segment retrieval on the video and the segment text pair by using the residual part without the language reconstructor as an evaluation model so as to verify the effect of the positioning algorithm model.

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 3:

the present embodiment is further optimized on the basis of the foregoing embodiment 1 or 2, and the inference process of the modal graph encoder provided in the present embodiment includes:

and S5.2, transmitting the multi-modal graph into a multi-modal graph encoder to perform multi-modal graph modeling, wherein the single-layer multi-modal graph modeling process is represented as follows:

；

wherein GM (-) represents a graph modeling layer, Enc (-) represents a Transformer encoder,

represents a multimodal map modeled after the ith map,

Position coding for video and text, respectively, [;]is a join operator;

The rest of this embodiment is the same as embodiment 1 or 2, and therefore, the description thereof is omitted.

Example 4:

this embodiment is further optimized based on any of the above embodiments 1-3, and this embodiment provides performing multi-modal graph inference in the graph modeling layer GM (-) to enable each node to obtain information from its neighbor nodes and update its own and edge weights, where the single-layer graph modeling layer is expressed as:

；

wherein the content of the first and second substances,

represents the value of the jth node at the ith level,

is a set of neighbor nodes of the jth node,

the edge weight between the jth node and the kth node in the ith iteration,

for the learnable parameter matrix of the ith graph modeling layer, σ (-) and LN (-) are the activation function layer and the linear mapping layer, respectively.

Other parts of this embodiment are the same as any of embodiments 1 to 3, and thus are not described again.

Example 5:

in this embodiment, a video feature node in a multimodal graph processed by a multimodal graph encoder is extracted and used as an encoded signal input of an event feature decoder, a sentence-level text feature is used as a query signal input of the event feature decoder, a context relationship between multiple events is mined through a self-attention mechanism, a multimodal feature of a target event is obtained through a cross-modal attention mechanism, and finally, a multi-layered perceptron is used to predict a relative position of each event in a complete video, where the relative position is expressed as:

；

Wherein, the first and the second end of the pipe are connected with each other,

for the multi-modal characterization of the ith event,

is the sentence-level text feature of the ith event, NV is the video node in the multi-modal graph, PS is the sentence-level position code, DECT (. cndot.) is the event feature decoder, MLP (. cndot.) is the multi-layer perceptron,

is the predicted result of the ith event, i.e., the normalized timestamp.

Other parts of this embodiment are the same as any of embodiments 1 to 4, and thus are not described again.

Example 6:

this embodiment is further optimized on the basis of any one of the above embodiments 1 to 5, and discloses:

inputting the multi-modal features of the target event obtained in the step S6 as an encoded signal of the language reconstructor, extracting text nodes in the multi-modal graph processed by the multi-modal graph encoder as a query signal input of the language reconstructor, calculating a probability distribution of each text node in an encoded vocabulary, selecting a word with the highest probability as a reconstruction result, and expressing as:

;

the FE is a multi-modal feature of the target event, the NW is a text node, and the PW is a word-level position information code. DECLR (-) is the language reconstructor and P is the probability distribution for each reconstructed word.

Other parts of this embodiment are the same as any of embodiments 1 to 5, and thus are not described again.

Example 7:

this embodiment is further optimized based on any of embodiments 1 to 6 above, and discloses calculating the loss of location information using the prediction result of each event in step S6

Expressed as:

，

wherein the content of the first and second substances,

、

respectively refer to the predicted result and the actual label of the ith event,

refer to the G-IOU loss function, and K is the total number of events.

Other parts of this embodiment are the same as any of embodiments 1 to 6, and thus are not described again.

Example 8:

this embodiment is further optimized on the basis of any one of the above embodiments 1 to 7, and discloses:

extracting attention weight in cross-modal attention mechanism in event feature decoder, calculating attention-guiding loss

：

；

Wherein the content of the first and second substances,

、

respectively refer to the actual annotation and attention weight of the ith segment-level video feature,

is the total length of the video feature at the segment level.

Other parts of this embodiment are the same as any of embodiments 1 to 7, and thus are not described again.

Example 9:

this embodiment is further optimized on the basis of any one of the above embodiments 1 to 8, and discloses: calculating reconstruction loss from the prediction of the speech reconstructor

And is represented as:

；

wherein the content of the first and second substances,

for the (i) -th word(s),

For the total length of the word or words,

is the first hyperparameter for stable training.

Other parts of this embodiment are the same as any of embodiments 1 to 8, and thus are not described again.

Example 10:

this embodiment is further optimized on the basis of any one of the above embodiments 1 to 9, and discloses: the position loss, the attention guiding loss and the reconstruction loss are subjected to weighted summation to be used as a final training target

And is represented as:

；

wherein, alpha, beta and gamma are respectively a second hyperparameter, a third hyperparameter and a fourth hyperparameter of the balance loss function.

Other parts of this embodiment are the same as any of embodiments 1 to 9, and thus are not described again.

The invention is used in specific scenes for example:

the first scenario case: effect evaluation work was performed on the Charades-STA dataset. Containing 6672 videos of daily life. Each video had approximately 2.4 annotated time instants with an average duration of 8.2 seconds. The data set relates to 6670/16124 videos/sentences, divided into training and testing sections, 5336/12404 and 1334/3720 respectively. In this embodiment, the present invention applies C3D as the original video feature extractor to obtain RGB features of the video. Based on the above features, the results of the comparison between the present invention and other methods on the data set are shown in fig. 3, respectively.

The second scenario case: and carrying out effect evaluation work on the Activity Net-Caption data set. This data is the largest data set in the time-series language positioning task, containing approximately 2 million open-domain videos. On average, each video contains 3.65 queries with an average of 13.48 words per query. The data set is divided into a training set, a verification set 1 and a verification set 2, which respectively contain 10009/37421 videos/sentences, 4917/17505 videos/sentences and 4885/17031 videos/sentences, and the invention performs verification on the verification set 1 and tests on the verification set 2. The results of the present invention compared to other prior art methods are shown in fig. 4.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

Claims

1. A video paragraph positioning method based on language reconstruction and graph mechanism is characterized by comprising the following steps:

S5.3, the multi-modal graph encoder is composed of a single-layer multi-modal graph modeling structure in the multi-layer step S5.2, and multi-modal graphs are continuously updated in an iteration mode;

2. The method of claim 1, further comprising a verification mode for the positioning algorithm model, wherein the verification mode comprises:

in the testing stage, language reconstruction is not needed, and the model reasoning speed is improved by removing a language reconstructor in the trained video paragraph positioning method based on the language reconstruction and the graph mechanism; and (4) performing video multi-segment retrieval on the video and the segment text pair by using the residual part without the language reconstructor as an evaluation model so as to verify the effect of the positioning algorithm model.

3. A method for locating video segments based on language reconstruction and graph mechanism according to claim 1, wherein said step S5.2 comprises:

4. The method for locating video paragraphs according to claim 1, wherein said step S6 comprises:

video feature nodes in a multi-modal graph processed by a multi-modal graph encoder are extracted and used as encoded signals of an event feature decoder to be input, sentence-level text features are used as query signals of the event feature decoder to be input, context relations among multiple events are mined through a self-attention mechanism, multi-modal features of a target event are obtained through a cross-modal attention mechanism, and finally a multi-layer perceptron is used for predicting the relative position of each event in a complete video.

5. The method for locating a video paragraph based on language reconstruction and graph mechanism as claimed in claim 1, wherein said step S7 includes:

6. The method for locating a video paragraph based on language reconstruction and graph mechanism as claimed in claim 1, wherein said step S8 includes:

7. The method for locating video paragraphs according to claim 1, wherein said step S9 comprises:

attention weights in a cross-modal attention mechanism in an event feature decoder are extracted, and attention guidance loss is calculated.

8. The method for locating video paragraphs according to claim 1, wherein said step S10 comprises: and calculating the reconstruction loss according to the prediction result of the language reconstructor.

9. The method for locating video paragraphs according to any one of claims 6, 7 or 8, wherein said step S10 further comprises:

and performing weighted summation on the position loss, the attention guiding loss and the reconstruction loss to serve as a final training target.