CN114357124B - Video paragraph positioning method based on language reconstruction and graph mechanism - Google Patents

Video paragraph positioning method based on language reconstruction and graph mechanism Download PDF

Info

Publication number
CN114357124B
CN114357124B CN202210270425.9A CN202210270425A CN114357124B CN 114357124 B CN114357124 B CN 114357124B CN 202210270425 A CN202210270425 A CN 202210270425A CN 114357124 B CN114357124 B CN 114357124B
Authority
CN
China
Prior art keywords
video
graph
modal
text
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210270425.9A
Other languages
Chinese (zh)
Other versions
CN114357124A (en
Inventor
徐行
蒋寻
沈复民
申恒涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Koala Youran Technology Co ltd
Original Assignee
Chengdu Koala Youran Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Koala Youran Technology Co ltd filed Critical Chengdu Koala Youran Technology Co ltd
Priority to CN202210270425.9A priority Critical patent/CN114357124B/en
Publication of CN114357124A publication Critical patent/CN114357124A/en
Application granted granted Critical
Publication of CN114357124B publication Critical patent/CN114357124B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of cross-modal content retrieval in multi-modal video understanding, and discloses a video paragraph positioning method based on language reconstruction and graph mechanism, which comprises the following steps: selecting a data set, constructing a video paragraph, training a model by using a loss function, and finally verifying the effect of the model. The method is used for improving the information interaction capacity among the fine-grained heterogeneous data, enhancing the understanding of the video content and improving the cross-modal content understanding capacity of the model to the video-text. The method can be used in various multi-mode video understanding occasions, such as network video APP, an intelligent security system, large-scale video content retrieval and the like, can improve the use experience of user software, and improves the working efficiency of workers in related fields such as videos, security, social administration and the like.

Description

Video paragraph positioning method based on language reconstruction and graph mechanism
Technical Field
The invention relates to the technical field of cross-modal content retrieval in multi-modal video understanding, in particular to a video paragraph positioning method based on language reconstruction and graph mechanism, which is used for improving the information interaction capacity between heterogeneous data with fine granularity, enhancing the understanding of video content and improving the cross-modal content understanding capacity of a model to video-text.
Background
As a multimedia technology hotspot in the internet era, multimodal video understanding has attracted much attention from the industrial and academic circles in recent years. Sequential language localization techniques, which are one of the most challenging tasks in multimodal video understanding, aim to perform video segment-level retrieval from long video that is not cropped, based on given query text information, requiring a computer to locate the segment position in this long video of the event described by the query text. The time sequence language positioning technology has a wide application scene: with the gradual coming of the big media era, the internet video auditing work also becomes heavy, and by applying the time sequence language positioning technology, fine-grained cross-mode video content retrieval can be realized, and manpower is liberated from complicated video auditing and searching. Meanwhile, the technology can be deployed in the fields of intelligent security, social management, human-computer interaction and the like, the user experience is effectively improved, and the working efficiency is improved.
According to the form of a query text, the current time sequence language positioning technology can be divided into two types, the first type is video sentence positioning, namely the query text is only a single sentence, the retrieval target is a single event, and an algorithm model retrieves a target segment from a long video containing a plurality of events in a one-to-many mode; the second method is video paragraph positioning, the query text is a paragraph containing a plurality of sentences, a plurality of events are described, and the algorithm model completes the retrieval of each event fragment in a many-to-many mode. In the last decade, video sentence localization technology has been the focus of research and has been greatly developed, but with the increase of multi-modal data, the drawbacks of this single event localization mechanism have gradually been revealed: for example, when multiple similar events occur in a video, the video sentence positioning easily confuses the logical relationship of the multiple events to cause error positioning, because it only uses the video to perform context modeling at the event level, and ignores the context modeling of the text modality, which results in insufficient understanding of the video content by the model, and in actual use, in the face of the same or similar events that occur repeatedly, the absence of the text context at the event level causes error retrieval of the event segments. The video paragraph positioning method can dig more event-level context information from a text mode by inputting description sentences of a plurality of events as query texts, thereby reducing the possibility of error alignment.
However, the localization of multiple events in the localization of video segments also presents new challenges. Firstly, paragraphs are adopted as query texts, so that more complexity is presented, the difficulty of modal fusion becomes high, and due to the many-to-many positioning mode, each sentence in the modal fusion is visible to each event in the video, so that higher possibility of misalignment is brought. Second, while it is possible to provide enough time information by keeping the chronological relationship of each sentence, this approach also requires that our model have better long-range context modeling capability as sentences grow.
Therefore, in order to solve the technical problem of the existing video paragraph positioning, the invention provides a video paragraph positioning method based on language reconstruction and graph mechanism, and the information interaction capability between the heterogeneous data with fine granularity is improved by introducing a multi-modal graph mechanism into a Transformer; enhancing understanding of video content by context modeling between a plurality of events by an event feature decoder; meanwhile, a language reconstructor is designed to reconstruct the query text, and the cross-modal content comprehension capability of the model to the video-text is further improved.
Disclosure of Invention
The invention aims to provide a video paragraph positioning method based on language reconstruction and graph mechanism, which improves the information interaction capability between heterogeneous data with fine granularity by introducing a multi-modal graph mechanism into a Transformer; enhancing understanding of video content by context modeling between a plurality of events by an event feature decoder; meanwhile, a language reconstructor is designed to reconstruct the query text, and the cross-modal content comprehension capability of the model to the video-text is further improved.
The invention is realized by the following technical scheme: a video paragraph positioning method based on language reconstruction and graph mechanism comprises the following steps:
s1, selecting a training data set, and extracting a video-paragraph pair as an input of a positioning algorithm model;
s2, loading model parameters of a pre-trained 3D convolutional neural network, extracting a video mode in the video-paragraph pair, and obtaining segment-level video characteristics;
s3, extracting a text mode in the video-paragraph pair, and using GloVe coding to represent each word in the text mode as a word vector with fixed dimensionality to serve as query text coding;
s4, querying text codes by using a projection layer and regularization processing to obtain word-level text characteristics, splitting the word-level text characteristics according to sentences, sequentially inputting each obtained sentence into a bidirectional gate control coding unit, and extracting sentence-level text characteristics;
Step S5, connecting the segment-level video features and the word-level text features together, taking each feature point as a graph node, setting the strength of each edge as a learnable parameter, initializing the graph into a multi-mode full-connected graph consisting of the video feature nodes and the text feature nodes, and inputting the graph into a multi-mode graph encoder to perform multi-mode feature fusion, so that each node can selectively acquire information from neighboring nodes, and fine-grained feature interaction is realized;
s6, extracting video feature nodes in the multi-modal image processed by the multi-modal image encoder, inputting the video feature nodes and sentence-level text features extracted in the step S4 into an event feature decoder together to obtain multi-modal features of the target event, and predicting the relative position of the event in the complete video by using a multi-layer perceptron;
s7, using the multi-modal features of each target event obtained in the step S6, simultaneously extracting text feature nodes in the multi-modal graph processed by the multi-modal graph encoder, inputting the text feature nodes into a language reconstructor, and retranslating each text feature node into a paragraph query text to realize query text reconstruction;
s8, calculating time sequence position information loss according to the result predicted in the S6;
S9, extracting an attention weight matrix in an event feature decoder, and calculating attention guide loss;
s10, calculating language reconstruction loss according to the text reconstruction result in the step S7;
and S11, adopting an Adam optimizer and training a positioning algorithm model by using a constant learning rate strategy.
In the invention, unlike the traditional video and audio fields, the invention is in the video and text field; the invention belongs to time sequence language positioning/video fragment retrieval, namely, one or more description texts of natural language (one or more fragments in description video) are given, the position of the video fragment in the video is retrieved according to the text, the unnatural alignment state between two modes is required to be matched according to the semantics in the text mode, and the retrieval is further realized; the neural network model of the invention has more complex structure and larger structural difference, including but not limited to: 1) and multi-modal information interaction is realized by adopting a multi-modal graph mode. 2) An event characteristic decoder is designed, and context modeling is carried out at an event level (an unconventional word level or a video fragment level) by utilizing a text mode so as to better understand long video content; 3) a language reconstructor is designed to reconstruct the query text so as to improve the comprehension capability of the model to deep semantics.
The invention is different from the traditional video time positioning, the traditional video time positioning focuses on single sentence time positioning, namely only one query sentence is given each time to complete the positioning of one event, the invention focuses on paragraph time positioning, namely, a paragraph consisting of a plurality of sentences can be given to complete the positioning of a plurality of events; in the traditional video time positioning, video context modeling is completed by pre-dividing video segments, and then key text words in a query sentence are selectively screened according to video contents. The invention makes each node (video clip node or text word node) obtain information from its neighbor nodes by establishing a multi-modal graph and utilizing a graph modeling layer.
In order to better implement the present invention, further, a verification method for the positioning algorithm model is further included: in the testing stage, language reconstruction is not needed, and the model reasoning speed is improved by removing a language reconstructor in the trained video paragraph positioning method based on the language reconstruction and the graph mechanism; and (4) performing video multi-segment retrieval on the video and the segment text pair by using the residual part without the language reconstructor as an evaluation model so as to verify the effect of the positioning algorithm model.
To better implement the present invention, the inference process of the multimodal map encoder in step S5 further includes:
s5.1, connecting the video nodes and the text nodes, setting the side weight value as a learnable value, and initializing a multi-modal graph;
s5.2, transmitting the multi-modal graph into a multi-modal graph encoder, modeling the multi-modal graph, and encoding the positions of a graph modeling layer, a video and a text through a Transformer encoder to obtain single-layer multi-modal graph modeling;
and S5.3, the multi-modal graph encoder is composed of a single-layer multi-modal graph modeling structure in the multi-layer step S5.2, and the multi-modal graph is continuously updated in an iteration mode.
In order to better implement the present invention, further, step S5.2 comprises:
and performing multi-mode graph reasoning in a graph modeling layer GM (-) to enable each node to acquire information from neighbor nodes and update the weight of the node and the edge.
In order to better implement the present invention, step S6 further includes:
extracting video feature nodes in a multi-modal graph processed by a multi-modal graph encoder, inputting the video feature nodes as encoded signals of an event feature decoder, inputting sentence-level text features as query signals of the event feature decoder, mining context relations among multiple events through a self-attention mechanism, obtaining multi-modal features of a target event through a cross-modal attention mechanism, and finally predicting the relative position of each event in a complete video by using a multi-layered perceptron.
In order to better implement the present invention, step S7 further includes:
and (4) inputting the multi-modal characteristics of the target event obtained in the step (S6) as an encoded signal of the language reconstructor, extracting text nodes in the multi-modal graph processed by the multi-modal graph encoder as a query signal input of the language reconstructor, calculating the probability distribution of each text node in the encoded vocabulary, and selecting a word with the maximum probability as a reconstruction result.
In order to better implement the present invention, step S8 further includes:
using the prediction result of each event in step S6, the positional information loss is calculated from the prediction result of the event, the total number of events, the actual annotation, and the G-IOU loss function.
In order to better implement the present invention, step S9 further includes:
attention weights in a cross-modal attention mechanism in an event feature decoder are extracted, and an attention-directed loss is calculated.
In order to better implement the present invention, step S10 further includes:
and calculating the reconstruction loss according to the prediction result of the language reconstructor.
In order to better implement the present invention, step S10 further includes:
the position loss, the attention guiding loss and the reconstruction loss are subjected to weighted summation to be used as a final training target
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) the method introduces a graph modeling mechanism into video paragraph positioning, and promotes information interaction between fine-grained heterogeneous data by establishing a multi-modal graph;
(2) the invention designs an event characteristic decoder, reduces the alignment error of a plurality of event positioning in video paragraph positioning by exploring the context relation between events, and effectively improves the reliability of event positioning;
(3) the invention designs a language reconstructor, which assists the model to improve the cross-modal comprehension capability of the video-text, excavates deep semantics among heterogeneous data and simultaneously improves the interpretability of the model by reconstructing the query text;
(4) the invention effectively improves the precision of time sequence language positioning through testing, and has more obvious advantages in multi-event positioning compared with the prior art;
(5) the method can be used in various multi-mode video understanding occasions, such as network video APP, an intelligent security system, large-scale video content retrieval and the like, can improve the use experience of user software, and improves the working efficiency of workers in related fields such as video, security, social governance and the like.
Drawings
The invention is further described with reference to the following figures and examples, all of which are intended to be covered by the present disclosure and the scope of the invention.
Fig. 1 is a flowchart of a video paragraph locating method based on language reconstruction and graph mechanism according to the present invention.
Fig. 2 is a schematic structural diagram of a video paragraph locating method based on language reconstruction and graph mechanism according to the present invention.
Fig. 3 is a schematic diagram of a video segment localization method based on language reconstruction and graph mechanism on the chardes-STA data set according to the present invention.
Fig. 4 is a schematic diagram of an Activity Net-scenario data set of a video paragraph location method based on language reconstruction and graph mechanism according to the present invention.
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments, and therefore should not be considered as limiting the scope of protection. All other embodiments, which can be obtained by a worker skilled in the art based on the embodiments of the present invention without making creative efforts, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that, unless explicitly stated or limited otherwise, the terms "disposed," "connected" or "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through an intermediary, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in a specific case to those of ordinary skill in the art.
Example 1:
a video paragraph location method based on language reconstruction and graph mechanism in this embodiment, as shown in fig. 1 and fig. 2, includes the following steps: s1, selecting a training data set, and extracting a video-paragraph pair as input of a positioning algorithm model;
s2, loading model parameters of a pre-trained 3D convolutional neural network, extracting a video mode in the video-paragraph pair, and obtaining segment-level video characteristics;
s3, extracting a text mode in the video-paragraph pair, and using GloVe coding to represent each word in the text mode as a word vector with fixed dimensionality to serve as query text coding;
S4, querying text codes by using a projection layer and regularization processing to obtain word-level text characteristics, splitting the word-level text characteristics according to sentences, sequentially inputting each obtained sentence into a bidirectional gate control coding unit, and extracting sentence-level text characteristics;
step S5, connecting the segment-level video features and the word-level text features together, taking each feature point as a graph node, setting the strength of each edge as a learnable parameter, initializing the graph into a multi-mode full-connected graph consisting of the video feature nodes and the text feature nodes, and inputting the graph into a multi-mode graph encoder to perform multi-mode feature fusion, so that each node can selectively acquire information from neighboring nodes, and fine-grained feature interaction is realized;
s6, extracting video feature nodes in the multi-modal graph processed by the multi-modal graph encoder, inputting the video feature nodes and sentence-level text features extracted in the step S4 into an event feature decoder to obtain multi-modal features of the target event, and predicting the relative position of the event in the complete video by using a multi-layer perceptron;
s7, using the multi-modal features of each target event obtained in the step S6, simultaneously extracting text feature nodes in the multi-modal graph processed by the multi-modal graph encoder, inputting the text feature nodes into a language reconstructor, and retranslating each text feature node into a paragraph query text to realize query text reconstruction;
S8, calculating time sequence position information loss according to the result predicted in the S6;
s9, extracting an attention weight matrix in an event feature decoder, and calculating attention guide loss;
s10, calculating language reconstruction loss according to the text reconstruction result in the S7;
and S11, adopting an Adam optimizer and training a positioning algorithm model by using a constant learning rate strategy.
The positioning algorithm model in this embodiment is a video paragraph positioning model based on language reconstruction and graph mechanism, and in this embodiment, the video modality in the video-paragraph pair is extracted to obtain the segment-level video features, that is, each feature vector is extracted from a video segment of a fixed length in a video. In this embodiment, GloVe encoding is used to represent each word in the text modality as a word vector in a fixed dimension, which is a common setting of 300 dimensions.
The working principle/working process of the invention is as follows: the length of 1 word is 1, the length of l W words is l W, a pre-training convolutional neural network is used for extracting segment-level video features, and GloVe, a mapping layer with regularization and a bidirectional gated loop unit (biGRU) are used for extracting word-level text features and sentence-level text features. The method comprises the steps of performing fine-grained multi-modal feature modeling by using a multi-modal graph encoder, predicting the relative positions of a plurality of events in an uncut long video by using an event feature decoder and a multi-layer perceptron, and reconstructing a query text by using a language reconstructor in a training stage.
Example 2:
the embodiment further optimizes the positioning algorithm model based on embodiment 1, and provides a verification method of the positioning algorithm model: in the testing stage, language reconstruction is not needed, and the model reasoning speed is improved by removing a language reconstructor in the trained video paragraph positioning method based on the language reconstruction and the graph mechanism; and (4) performing video multi-segment retrieval on the video and the segment text pair by using the residual part without the language reconstructor as an evaluation model so as to verify the effect of the positioning algorithm model.
Other parts of this embodiment are the same as embodiment 1, and thus are not described again.
Example 3:
the present embodiment is further optimized on the basis of the foregoing embodiment 1 or 2, and the inference process of the modal graph encoder provided in the present embodiment includes:
s5.1, connecting the video nodes and the text nodes, setting the side weight value as a learnable value, and initializing a multi-modal graph;
and S5.2, transmitting the multi-modal graph into a multi-modal graph encoder to perform multi-modal graph modeling, wherein the single-layer multi-modal graph modeling process is represented as follows:
Figure 954449DEST_PATH_IMAGE001
wherein GM (-) represents a graph modeling layer, Enc (-) represents a Transformer encoder,
Figure DEST_PATH_IMAGE002
represents a multimodal map modeled after the ith map,
Figure 74852DEST_PATH_IMAGE003
Position coding for video and text, respectively, [;]is a join operator;
and S5.3, the multi-modal graph encoder is composed of a single-layer multi-modal graph modeling structure in the multi-layer step S5.2, and the multi-modal graph is continuously updated in an iteration mode.
The rest of this embodiment is the same as embodiment 1 or 2, and therefore, the description thereof is omitted.
Example 4:
this embodiment is further optimized based on any of the above embodiments 1-3, and this embodiment provides performing multi-modal graph inference in the graph modeling layer GM (-) to enable each node to obtain information from its neighbor nodes and update its own and edge weights, where the single-layer graph modeling layer is expressed as:
Figure DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure 57851DEST_PATH_IMAGE005
represents the value of the jth node at the ith level,
Figure DEST_PATH_IMAGE006
is a set of neighbor nodes of the jth node,
Figure 390743DEST_PATH_IMAGE007
the edge weight between the jth node and the kth node in the ith iteration,
Figure DEST_PATH_IMAGE008
for the learnable parameter matrix of the ith graph modeling layer, σ (-) and LN (-) are the activation function layer and the linear mapping layer, respectively.
Other parts of this embodiment are the same as any of embodiments 1 to 3, and thus are not described again.
Example 5:
in this embodiment, a video feature node in a multimodal graph processed by a multimodal graph encoder is extracted and used as an encoded signal input of an event feature decoder, a sentence-level text feature is used as a query signal input of the event feature decoder, a context relationship between multiple events is mined through a self-attention mechanism, a multimodal feature of a target event is obtained through a cross-modal attention mechanism, and finally, a multi-layered perceptron is used to predict a relative position of each event in a complete video, where the relative position is expressed as:
Figure 142799DEST_PATH_IMAGE009
Wherein, the first and the second end of the pipe are connected with each other,
Figure DEST_PATH_IMAGE010
for the multi-modal characterization of the ith event,
Figure 902944DEST_PATH_IMAGE011
is the sentence-level text feature of the ith event, NV is the video node in the multi-modal graph, PS is the sentence-level position code, DECT (. cndot.) is the event feature decoder, MLP (. cndot.) is the multi-layer perceptron,
Figure DEST_PATH_IMAGE012
is the predicted result of the ith event, i.e., the normalized timestamp.
Other parts of this embodiment are the same as any of embodiments 1 to 4, and thus are not described again.
Example 6:
this embodiment is further optimized on the basis of any one of the above embodiments 1 to 5, and discloses:
inputting the multi-modal features of the target event obtained in the step S6 as an encoded signal of the language reconstructor, extracting text nodes in the multi-modal graph processed by the multi-modal graph encoder as a query signal input of the language reconstructor, calculating a probability distribution of each text node in an encoded vocabulary, selecting a word with the highest probability as a reconstruction result, and expressing as:
Figure 373240DEST_PATH_IMAGE013
;
the FE is a multi-modal feature of the target event, the NW is a text node, and the PW is a word-level position information code. DECLR (-) is the language reconstructor and P is the probability distribution for each reconstructed word.
Other parts of this embodiment are the same as any of embodiments 1 to 5, and thus are not described again.
Example 7:
this embodiment is further optimized based on any of embodiments 1 to 6 above, and discloses calculating the loss of location information using the prediction result of each event in step S6
Figure DEST_PATH_IMAGE014
Expressed as:
Figure 506894DEST_PATH_IMAGE015
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE016
Figure 254401DEST_PATH_IMAGE017
respectively refer to the predicted result and the actual label of the ith event,
Figure DEST_PATH_IMAGE018
refer to the G-IOU loss function, and K is the total number of events.
Other parts of this embodiment are the same as any of embodiments 1 to 6, and thus are not described again.
Example 8:
this embodiment is further optimized on the basis of any one of the above embodiments 1 to 7, and discloses:
extracting attention weight in cross-modal attention mechanism in event feature decoder, calculating attention-guiding loss
Figure 451027DEST_PATH_IMAGE019
Figure DEST_PATH_IMAGE020
Wherein the content of the first and second substances,
Figure 408619DEST_PATH_IMAGE021
Figure DEST_PATH_IMAGE022
respectively refer to the actual annotation and attention weight of the ith segment-level video feature,
Figure 83314DEST_PATH_IMAGE023
is the total length of the video feature at the segment level.
Other parts of this embodiment are the same as any of embodiments 1 to 7, and thus are not described again.
Example 9:
this embodiment is further optimized on the basis of any one of the above embodiments 1 to 8, and discloses: calculating reconstruction loss from the prediction of the speech reconstructor
Figure DEST_PATH_IMAGE024
And is represented as:
Figure 950907DEST_PATH_IMAGE025
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE026
for the (i) -th word(s),
Figure 309645DEST_PATH_IMAGE027
For the total length of the word or words,
Figure DEST_PATH_IMAGE028
is the first hyperparameter for stable training.
Other parts of this embodiment are the same as any of embodiments 1 to 8, and thus are not described again.
Example 10:
this embodiment is further optimized on the basis of any one of the above embodiments 1 to 9, and discloses: the position loss, the attention guiding loss and the reconstruction loss are subjected to weighted summation to be used as a final training target
Figure 754533DEST_PATH_IMAGE029
And is represented as:
Figure DEST_PATH_IMAGE030
wherein, alpha, beta and gamma are respectively a second hyperparameter, a third hyperparameter and a fourth hyperparameter of the balance loss function.
Other parts of this embodiment are the same as any of embodiments 1 to 9, and thus are not described again.
The invention is used in specific scenes for example:
the first scenario case: effect evaluation work was performed on the Charades-STA dataset. Containing 6672 videos of daily life. Each video had approximately 2.4 annotated time instants with an average duration of 8.2 seconds. The data set relates to 6670/16124 videos/sentences, divided into training and testing sections, 5336/12404 and 1334/3720 respectively. In this embodiment, the present invention applies C3D as the original video feature extractor to obtain RGB features of the video. Based on the above features, the results of the comparison between the present invention and other methods on the data set are shown in fig. 3, respectively.
The second scenario case: and carrying out effect evaluation work on the Activity Net-Caption data set. This data is the largest data set in the time-series language positioning task, containing approximately 2 million open-domain videos. On average, each video contains 3.65 queries with an average of 13.48 words per query. The data set is divided into a training set, a verification set 1 and a verification set 2, which respectively contain 10009/37421 videos/sentences, 4917/17505 videos/sentences and 4885/17031 videos/sentences, and the invention performs verification on the verification set 1 and tests on the verification set 2. The results of the present invention compared to other prior art methods are shown in fig. 4.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

Claims (9)

1. A video paragraph positioning method based on language reconstruction and graph mechanism is characterized by comprising the following steps:
s1, selecting a training data set, and extracting a video-paragraph pair as an input of a positioning algorithm model;
s2, loading model parameters of a pre-trained 3D convolutional neural network, extracting a video mode in the video-paragraph pair, and obtaining segment-level video characteristics;
S3, extracting a text mode in the video-paragraph pair, and using GloVe coding to represent each word in the text mode as a word vector with fixed dimensionality to serve as query text coding;
s4, querying text codes by using a projection layer and regularization processing to obtain word-level text characteristics, splitting the word-level text characteristics according to sentences, sequentially inputting each obtained sentence into a bidirectional gate control coding unit, and extracting sentence-level text characteristics;
step S5, connecting the segment-level video features and the word-level text features together, taking each feature point as a graph node, setting the strength of each edge as a learnable parameter, initializing the graph into a multi-mode full-connected graph consisting of the video feature nodes and the text feature nodes, and inputting the graph into a multi-mode graph encoder to perform multi-mode feature fusion, so that each node can selectively acquire information from neighboring nodes, and fine-grained feature interaction is realized;
s5.1, connecting the video nodes and the text nodes, setting the side weight value as a learnable value, and initializing a multi-modal graph;
s5.2, transmitting the multi-modal graph into a multi-modal graph encoder, modeling the multi-modal graph, and encoding the positions of a graph modeling layer, a video and a text through a Transformer encoder to obtain single-layer multi-modal graph modeling;
S5.3, the multi-modal graph encoder is composed of a single-layer multi-modal graph modeling structure in the multi-layer step S5.2, and multi-modal graphs are continuously updated in an iteration mode;
s6, extracting video feature nodes in the multi-modal image processed by the multi-modal image encoder, inputting the video feature nodes and sentence-level text features extracted in the step S4 into an event feature decoder together to obtain multi-modal features of the target event, and predicting the relative position of the event in the complete video by using a multi-layer perceptron;
s7, using the multi-modal features of each target event obtained in the step S6, simultaneously extracting text feature nodes in the multi-modal graph processed by the multi-modal graph encoder, inputting the text feature nodes into a language reconstructor, and retranslating each text feature node into a paragraph query text to realize query text reconstruction;
s8, calculating time sequence position information loss according to the result predicted in the S6;
s9, extracting an attention weight matrix in an event feature decoder, and calculating attention guide loss;
s10, calculating language reconstruction loss according to the text reconstruction result in the step S7;
and S11, adopting an Adam optimizer and training a positioning algorithm model by using a constant learning rate strategy.
2. The method of claim 1, further comprising a verification mode for the positioning algorithm model, wherein the verification mode comprises:
in the testing stage, language reconstruction is not needed, and the model reasoning speed is improved by removing a language reconstructor in the trained video paragraph positioning method based on the language reconstruction and the graph mechanism; and (4) performing video multi-segment retrieval on the video and the segment text pair by using the residual part without the language reconstructor as an evaluation model so as to verify the effect of the positioning algorithm model.
3. A method for locating video segments based on language reconstruction and graph mechanism according to claim 1, wherein said step S5.2 comprises:
and performing multi-mode graph reasoning in a graph modeling layer GM (-) to enable each node to acquire information from neighbor nodes and update the weight of the node and the edge.
4. The method for locating video paragraphs according to claim 1, wherein said step S6 comprises:
video feature nodes in a multi-modal graph processed by a multi-modal graph encoder are extracted and used as encoded signals of an event feature decoder to be input, sentence-level text features are used as query signals of the event feature decoder to be input, context relations among multiple events are mined through a self-attention mechanism, multi-modal features of a target event are obtained through a cross-modal attention mechanism, and finally a multi-layer perceptron is used for predicting the relative position of each event in a complete video.
5. The method for locating a video paragraph based on language reconstruction and graph mechanism as claimed in claim 1, wherein said step S7 includes:
and (4) inputting the multi-modal characteristics of the target event obtained in the step (S6) as an encoded signal of the language reconstructor, extracting text nodes in the multi-modal graph processed by the multi-modal graph encoder as a query signal input of the language reconstructor, calculating the probability distribution of each text node in the encoded vocabulary, and selecting a word with the maximum probability as a reconstruction result.
6. The method for locating a video paragraph based on language reconstruction and graph mechanism as claimed in claim 1, wherein said step S8 includes:
using the prediction result of each event in step S6, the positional information loss is calculated from the prediction result of the event, the total number of events, the actual annotation, and the G-IOU loss function.
7. The method for locating video paragraphs according to claim 1, wherein said step S9 comprises:
attention weights in a cross-modal attention mechanism in an event feature decoder are extracted, and attention guidance loss is calculated.
8. The method for locating video paragraphs according to claim 1, wherein said step S10 comprises: and calculating the reconstruction loss according to the prediction result of the language reconstructor.
9. The method for locating video paragraphs according to any one of claims 6, 7 or 8, wherein said step S10 further comprises:
and performing weighted summation on the position loss, the attention guiding loss and the reconstruction loss to serve as a final training target.
CN202210270425.9A 2022-03-18 2022-03-18 Video paragraph positioning method based on language reconstruction and graph mechanism Active CN114357124B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210270425.9A CN114357124B (en) 2022-03-18 2022-03-18 Video paragraph positioning method based on language reconstruction and graph mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210270425.9A CN114357124B (en) 2022-03-18 2022-03-18 Video paragraph positioning method based on language reconstruction and graph mechanism

Publications (2)

Publication Number Publication Date
CN114357124A CN114357124A (en) 2022-04-15
CN114357124B true CN114357124B (en) 2022-06-14

Family

ID=81095153

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210270425.9A Active CN114357124B (en) 2022-03-18 2022-03-18 Video paragraph positioning method based on language reconstruction and graph mechanism

Country Status (1)

Country Link
CN (1) CN114357124B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438225B (en) * 2022-11-08 2023-03-24 苏州浪潮智能科技有限公司 Video text mutual inspection method and model training method, device, equipment and medium thereof
CN116226443B (en) * 2023-05-11 2023-07-21 山东建筑大学 Weak supervision video clip positioning method and system based on large-scale video corpus

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7881505B2 (en) * 2006-09-29 2011-02-01 Pittsburgh Pattern Recognition, Inc. Video retrieval system for human face content
US9208776B2 (en) * 2009-10-05 2015-12-08 At&T Intellectual Property I, L.P. System and method for speech-enabled access to media content by a ranked normalized weighted graph
CN108932304B (en) * 2018-06-12 2019-06-18 山东大学 Video moment localization method, system and storage medium based on cross-module state
US11244167B2 (en) * 2020-02-06 2022-02-08 Adobe Inc. Generating a response to a user query utilizing visual features of a video segment and a query-response-neural network
CN112380385B (en) * 2020-11-18 2023-12-29 湖南大学 Video time positioning method and device based on multi-mode relation diagram
CN113204674B (en) * 2021-07-05 2021-09-17 杭州一知智能科技有限公司 Video-paragraph retrieval method and system based on local-overall graph inference network
CN113204675B (en) * 2021-07-07 2021-09-21 成都考拉悠然科技有限公司 Cross-modal video time retrieval method based on cross-modal object inference network
CN114048350A (en) * 2021-11-08 2022-02-15 湖南大学 Text-video retrieval method based on fine-grained cross-modal alignment model
CN113934887B (en) * 2021-12-20 2022-03-15 成都考拉悠然科技有限公司 No-proposal time sequence language positioning method based on semantic decoupling
CN114064967B (en) * 2022-01-18 2022-05-06 之江实验室 Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network
CN114155477B (en) * 2022-02-08 2022-04-29 成都考拉悠然科技有限公司 Semi-supervised video paragraph positioning method based on average teacher model

Also Published As

Publication number Publication date
CN114357124A (en) 2022-04-15

Similar Documents

Publication Publication Date Title
Uc-Cetina et al. Survey on reinforcement learning for language processing
CN108984683B (en) Method, system, equipment and storage medium for extracting structured data
CN112668671B (en) Method and device for acquiring pre-training model
JP6348554B2 (en) Simple question answering (HISQA) systems and methods inspired by humans
CN114357124B (en) Video paragraph positioning method based on language reconstruction and graph mechanism
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
CN113128431B (en) Video clip retrieval method, device, medium and electronic equipment
CN116821457B (en) Intelligent consultation and public opinion processing system based on multi-mode large model
CN114492441A (en) BilSTM-BiDAF named entity identification method based on machine reading understanding
CN108536735A (en) Multi-modal lexical representation method and system based on multichannel self-encoding encoder
CN114155477B (en) Semi-supervised video paragraph positioning method based on average teacher model
CN115048447A (en) Database natural language interface system based on intelligent semantic completion
CN114881014A (en) Entity alias relationship acquisition method, entity alias relationship training device and storage medium
CN113971837A (en) Knowledge-based multi-modal feature fusion dynamic graph neural sign language translation method
Xu et al. BERT gated multi-window attention network for relation extraction
Abdar et al. A review of deep learning for video captioning
Shin et al. Learning to combine the modalities of language and video for temporal moment localization
CN117496388A (en) Cross-modal video description model based on dynamic memory network
CN115759091A (en) MMOS-based power grid entity relationship joint extraction method, system and medium
CN117216255A (en) Classification model training method and related equipment
CN116361511A (en) Video retrieval method, device and equipment of composite semantics and storage medium
Luu et al. A multilevel NER framework for automatic clinical name entity recognition
CN112347753B (en) Abstract generation method and system applied to reading robot
CN113822018A (en) Entity relation joint extraction method
Bao et al. Omnipotent Distillation with LLMs for Weakly-Supervised Natural Language Video Localization: When Divergence Meets Consistency

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant