CN112488063B - Video statement positioning method based on multi-stage aggregation Transformer model - Google Patents
Video statement positioning method based on multi-stage aggregation Transformer model Download PDFInfo
- Publication number
- CN112488063B CN112488063B CN202011508292.1A CN202011508292A CN112488063B CN 112488063 B CN112488063 B CN 112488063B CN 202011508292 A CN202011508292 A CN 202011508292A CN 112488063 B CN112488063 B CN 112488063B
- Authority
- CN
- China
- Prior art keywords
- video
- stage
- sentence
- sequence
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000002776 aggregation Effects 0.000 title claims abstract description 35
- 238000004220 aggregation Methods 0.000 title claims abstract description 35
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000012549 training Methods 0.000 claims description 20
- 238000012545 processing Methods 0.000 claims description 9
- 238000006116 polymerization reaction Methods 0.000 claims description 8
- 230000000873 masking effect Effects 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 5
- 239000012634 fragment Substances 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 3
- 230000001629 suppression Effects 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims 1
- 230000000007 visual effect Effects 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 7
- 238000012512 characterization method Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 238000011176 pooling Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 230000004807 localization Effects 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 102100040160 Rabankyrin-5 Human genes 0.000 description 1
- 238000010411 cooking Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000033772 system development Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Probability & Statistics with Applications (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a video sentence positioning method based on a multi-stage aggregation Transformer model. Through multilayer superposition, the finally obtained video sentence joint representation has rich visual language cue capturing capability, and finer matching can be realized. In the multi-stage aggregation module, the stage feature of the starting stage, the stage feature of the intermediate stage and the stage feature of the ending stage are connected in series to form a feature representation of the candidate segment. Since the obtained feature indicates that specific information of different stages is captured, it is very suitable to accurately locate the start position and the end position of the video segment. The two modules are integrated together to form an effective and efficient network, and the accuracy of video statement positioning is improved.
Description
Technical Field
The invention belongs to the technical field of video statement positioning retrieval, and particularly relates to a video statement positioning method based on a multi-stage aggregation Transformer model.
Background
Video positioning is a fundamental problem in computer vision systems and has wide application. Over the past decade, there has been a great deal of research into video motion localization and related application system development. In recent years, with the rise of multimedia data and the diversification of people's needs, the problem of positioning of sentences in video (video sentence positioning) has become increasingly important. The purpose of video sentence positioning is to position a certain video segment corresponding to a sentence to be inquired in a long video. Compared with video action positioning, statement positioning has greater challenges and wider application prospects, such as video retrieval, automatic generation of video subtitles, man-machine intelligent interaction and the like.
Video sentence localization is a challenging task. In addition to the need to understand the content of the video, there is also a need to match the semantics between the video and the sentences.
Existing video statement positioning can be generally divided into two categories: one-stage and two-stage processes. The one-stage method takes a video and a query statement as input, directly predicts a starting point and an end point of a queried video segment, and directly generates a video segment associated with the query statement. One-stage methods can do end-to-end training, but they easily lose some of the correct video segments. However, the two-stage approach follows both the candidate segment generation and candidate segment ranking processes. They typically first generate a series of candidate segments from the video and then rank the candidate segments according to their degree of match with the query statement. Many approaches follow this route. Although the two-phase method can recall many possible correct candidate video clips, there are several key problems that are not well solved:
1) how to efficiently perform fine-grained semantic matching between videos and sentences?
2) How to accurately locate the start and end points of a video segment matching a sentence in the original long video?
For the 1 st problem, most existing methods usually process the video and sentence sequences separately and then match them. However, processing the video and the sentence sequence separately, for example, encoding the sentence into a vector and then matching, inevitably loses some detailed semantic content in the sentence, and thus cannot realize detailed matching;
for the 2 nd problem, existing methods typically use full convolution, average Pooling, or RoI Pooling operations to obtain a feature representation of the candidate fragment. However, these operations have obtained characteristics that are not sufficiently time-differentiated. For example, a video clip usually contains several different phases, such as a start phase, an intermediate phase and an end phase. The information of these phases is very important for the precise positioning of the starting and ending points of the time of day. However, the average pooling operation completely discards the phase information and cannot match the different phases exactly to achieve an accurate positioning. Although the full convolution operation or the RoI Pooling operation may delineate different stages to some extent, they do not rely on explicit stage-specific features and therefore also have disadvantages in more accurate positioning.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a video statement positioning method based on a multi-stage aggregation transform model so as to improve the accuracy of video statement positioning.
In order to achieve the above object, the present invention provides a method for locating a video sentence based on a multi-stage aggregation transform model, comprising the steps of:
(1) video slice feature and word feature extraction
Uniformly dividing a video into N time points according to time, collecting a video slice (consisting of continuous multiframe images such as 50 frames) at each time point, performing feature extraction on each video slice to obtain slice features (obtaining N slice features in total), and placing the N slice features according to a time sequence to form a video feature sequence;
carrying out word turning quantity (Doc2Vec) on each word of the sentence to obtain word characteristics, and then placing the word characteristics according to the sequence in the sentence to form a sentence characteristic sequence;
mapping the slice features and the word features of the sentence feature sequence in the video feature sequence to the same dimensionality to obtain the video feature sequenceSentence feature sequencesWherein,representing slice characteristics of the ith slice of the video,a word feature representing a jth word of the sentence;
(2) constructing a video statement Transformer model, and calculating a video characteristic sequence and a statement characteristic sequence
Constructing a D-layer video sentence Transformer model, wherein the output of a D-layer, D is 1, 2.
Wherein V, L denotes video and sentences, respectively, Q, K, W is a learnable parameter, wherein different subscripts denote different parameters, Att (-) is an attention calculation function;
video feature sequenceSentence feature sequencesCalculating as the input of the transform model of the video statement to obtain the characteristic sequence of the output video of the D-th layerSentence feature sequences
(3) Constructing a multi-stage aggregation module, and calculating a stage characteristic sequence and a prediction score sequence of three stages
Calculating the starting orderPhase characteristic sequence r of segments, intermediate phases and end phasessta、rmid、rend:
Wherein, the characteristic sequence r of the initial stagestaStage characterization by N slicesComposition, intermediate stage signature sequence rmidStage characterization by N slicesComposition, end stage signature sequence rendStage characterization by N slicesThe components of the composition are as follows,a Multi-layer Perceptron (MLP) for computing a phase signature sequence of the three phases, respectively;
computing a sequence of prediction scores p for a beginning stage, an intermediate stage and an end stagesta、pmid、pend:
Wherein the fractional sequence p is predicted at the beginningstaPrediction score from N slicesComposition, intermediate stage prediction score sequence pmidPrediction score from N slicesComposition, end stage prediction fraction orderColumn pendPrediction score from N slicesThe components of the composition are as follows, the multilayer perceptron is used for calculating the prediction score sequences of the three stages;
(4) training a multi-stage polymerization Transformer model
The video sentence Transformer model and the multi-stage aggregation module form a multi-stage aggregation Transformer model;
constructing a video sentence training data set, wherein each piece of data comprises a video, a sentence and a video slice starting position of a sentence-positioned video segmentEnd position
Extracting a piece of data from a video sentence training data set, randomly shielding a word in a sentence, replacing the word with a mark 'MASK', processing the video and the sentence according to the steps (1) to (3), and calculating the real scores of the start stage, the middle stage and the end stage of each video slice
σsta、σmidσendstandard deviation, σ, of an unnormalized two-dimensional Gaussian distributionsta、αmid、αendA scalar quantity that is a positive value for controlling the value of the standard deviation;
4.1), calculating a weighted cross entropy loss L on the prediction layerstage:
4.2), calculating the predicted values of the start position and the end position of the video slice of the z-th candidate segmentAnd matching score prediction values
4.3), calculate the boundary regression loss Lregress:
Wherein Z is the total number of candidate segments,respectively a video slice starting position and an ending position of the z-th candidate segment;
4.4), computing the matching score weighted cross entropy loss Lmatch:
Wherein, yzVideo slice for locating the z-th candidate segment and sentenceSegment (start position)To the end positionVideo of (d) a goodness of overlap;
4.5), calculating the cross entropy loss L of the masking word predictionword
Lword=-logpmask
Wherein p ismaskIs based on sentence characteristic sequenceA probability of predicting a word as masked;
4.6) calculating the loss L of the whole network for training the multi-stage aggregation Transformer modeltotal
Ltotal=Lstage+Lregress+Lmatch+Lword
4.7), updating parameters of the whole network
Sequentially taking out a piece of data from a video sentence training data set according to the loss LtotalUpdating parameters of the whole network until data of a video sentence training data set are empty, so that a trained multi-stage aggregation Transformer model is obtained;
(5) video statement positioning
Inputting a video and a complete query sentence without a masking word, processing according to the steps (1) to (3), calculating a matching score predicted value of each candidate segment and predicted values of a video slice starting position and a video slice ending position (forming a new candidate segment) according to the step 4.2), sorting the new candidate segments according to the matching scores from high to low, removing the new candidate segments with the overlap exceeding 70 percent by using NMS (non-maximum suppression), and returning the first 1 or 5 new candidate segments as finally positioned video segments.
The object of the invention is thus achieved.
Aiming at the problems existing in the existing method, the invention constructs a multi-stage aggregation Transformer model for a video statement positioning network. The multistage polymerization Transformer model consists of two parts: the system comprises a video statement Transformer model and a multi-stage aggregation module positioned on the video statement Transformer model. In the video sentence Transformer model, a single BERT framework is retained, but BERT parameters are decoupled into different packets to process video and sentence information separately. The video statement Transformer model more effectively models two modalities, namely video and statement, while maintaining the compactness and efficiency of a single BERT structure. In the video sentence Transformer model, each video slice or word can adaptively aggregate and align information from all other video slices or words in both modalities according to its own semantics. Through multilayer superposition, the finally obtained video sentence joint representation has rich visual language cue capturing capability, and finer matching can be realized. Furthermore, the method is simple. In the multi-stage aggregation module, three stage features corresponding to different stages, i.e., the stage features of the start stage, the intermediate stage, and the end stage, are calculated for each video slice, respectively. Then, for a certain candidate segment, the stage feature of the starting stage, the stage feature of the intermediate stage and the stage feature of the ending stage are connected in series to form the feature representation of the candidate segment. Since the obtained feature indicates that specific information of different stages is captured, it is very suitable to accurately locate the start position and the end position of the video segment. The two modules are integrated together to form an effective and efficient network, and the accuracy of video statement positioning is improved.
Drawings
FIG. 1 is a flowchart of an embodiment of a method for locating video statements based on a multi-stage aggregate transform model according to the present invention;
FIG. 2 is a schematic view of a video slice;
FIG. 3 is a schematic diagram illustrating a video sentence location method based on a multi-stage aggregation transform model according to an embodiment of the present invention;
FIG. 4 is a flow diagram of one embodiment of video statement location.
Detailed Description
The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.
FIG. 1 is a flowchart of an embodiment of a method for locating a video statement based on a multi-stage aggregation transform model according to the present invention.
In this embodiment, as shown in fig. 1, the video sentence positioning method based on the multi-stage aggregation transform model includes the following steps:
step S1: video slice feature and word feature extraction
In this embodiment, as shown in fig. 2, a video is uniformly divided into N time points according to time, at each time point, a video slice (composed of consecutive multi-frame images, such as 50 frames) is collected, feature extraction is performed on each video slice to obtain slice features (N slice features in total are obtained), and the N slice features are placed in time order to form a video feature sequence.
And performing word turning quantity (Doc2Vec) on each word of the sentence to obtain word characteristics, and then placing the word characteristics in the sentence in sequence to form a sentence characteristic sequence.
Mapping the slice features and the word features of the sentence feature sequence in the video feature sequence to the same dimensionality to obtain the video feature sequenceSentence feature sequencesWherein,a slice feature representing the ith slice of the video,representing the word characteristics of the jth word of the sentence.
Step S2: constructing a video sentence Transformer model, and calculating a video characteristic sequence and a sentence characteristic sequence
Most of the existing methods usually process the video and sentence sequences separately and then match them. However, processing these two sequences separately, such as encoding the sentence into a vector and then matching, will inevitably lose some detailed semantic content in the sentence, and thus a detailed matching cannot be achieved. To solve this problem, as shown in fig. 3, the present invention constructs a completely new transform model of video sentences as a backbone network, where the output of the D-th layer, D is 1, 2.
Wherein V, L denotes video and sentences, respectively, Q, K, W is a learnable parameter, wherein different subscripts denote different parameters, Att (-) is an attention calculation function;
video feature sequenceSentence feature sequencesCalculating as the input of a video statement Transformer model, and calculating layer by layer according to a formula (1) to obtain a D-th layer output video feature sequenceSentence feature sequences
Compared with the single BERT model widely used in the prior art, the multi-stage aggregation Transformer model has the advantages that the structure is not changed, no additional calculation is introduced, and different parameters are used for processing different modal contents, so that the compactness and the efficiency of the model are maintained, and the multi-modal modeling capability of the model is improved. Meanwhile, the multi-stage aggregation Transformer model in the invention is also different from other multi-mode BERT models, and the multi-mode BERT models use two BERT streams to realize contents of different modes. The model based on two BERT flows introduces an additional cross-modal layer to realize multi-modal interaction, and the multi-stage polymerization Transformer model in the invention keeps the same structure as the original BERT model and is more compact and efficient.
The multistage polymerization Transformer model consists of multiple layers of computational processes as in formula (1). After multilayer superposition, the obtained video sentence joint representation has the aggregation and alignment capability of rich visual language clues. Each slice in the video can interact with each word in the query sentence, thereby realizing more detailed and accurate video sentence matching. This is very important for accurate positioning.
Step S3: constructing a multi-stage aggregation module, and calculating a stage feature sequence and a prediction score sequence of three stages
After passing through the video sentence Transformer model, the obtained video sentence joint representation is the video feature sequenceSentence feature sequencesWith richer information and finer matching. However, in order to overcome the problem that the prior method ignores different phases contained in the video segment, the invention provides a multi-phase aggregation module on the video sentence Transformer model, so as to accurately locate the starting position and the ending position of the queried video segment. In a multistage polymerization module, the invention respectivelyEach video slice in a video sequence computes phase characteristics for three phases corresponding to different temporal phases, namely, phase characteristics for a start phase, an intermediate phase, and an end phaseIn order to improve the distinguishing capability of different stage characteristics, the invention adds a prediction layer on the stage characteristics to respectively predict the scores of the starting stage, the intermediate stage and the ending stage.
In this embodiment, as shown in FIG. 3, a multi-stage aggregation module is used to calculate a stage feature sequence r for a start stage, an intermediate stage, and an end stagesta、rmid、rend:
Wherein, the characteristic sequence r of the initial stagestaStage characterization by N slicesComposition, intermediate stage signature sequence rmidStage characterization from N slicesComposition, end stage signature sequence rendStage characterization by N slicesThe components of the components are as follows,multi-layer perceptrons (MLP) that compute a sequence of phase features for the three phases, respectively.
Computing a sequence of prediction scores p for a beginning stage, an intermediate stage and an end stagesta、pmid、pend:
Wherein the fractional sequence p is predicted at the beginningstaPrediction score from N slicesComposition, intermediate stage prediction score sequence pmidPrediction score from N slicesComposition, end stage prediction score sequence pendPrediction score from N slicesThe components of the composition are as follows, respectively, a multi-layer perceptron calculating a sequence of prediction scores for the three stages.
Step S4: training a multistage polymerization Transformer model
The video sentence Transformer model and the multi-stage aggregation module form a multi-stage aggregation Transformer model;
constructing a video sentence training data set, wherein each piece of data comprises a video, a sentence and a video slice starting position of a sentence-positioned video segmentEnd position
Extracting a piece of data from a video sentence training data set, randomly shielding a word in a sentence, replacing the word with a mark 'MASK', processing the video and the sentence according to the steps (1) to (3), calculating each video slice, and starting to calculateTrue score of stage, intermediate stage, end stage
Wherein,σsta、σmid、σendis the standard deviation, alpha, of an unnormalized two-dimensional Gaussian distributionsta、αmid、αendScalar quantities of positive value, controlling the value of the standard deviation, alphasta、αmid、αendThe larger the value of (a), the higher the score of the start/middle/end position of the video slice in the vicinity of the start/middle/end position of the video segment.
Step S4.1: computing a weighted cross-entropy loss L on a prediction layerstage:
Step S4.2: calculating predicted values of the start position and the end position of the video slice of the z-th candidate segment And matching score prediction values
Since all three phases are specific to the start, intermediate and end phases, this tandem feature is very discriminative for accurate video segment positioning.
Step S4.3: calculating boundary regression loss Lregress:
Wherein Z is the total number of candidate segments,the starting position and the ending position of the video slice of the z-th candidate segment are respectively.
Step S4.4: computing matching score weighted cross entropy loss Lmatch:
Wherein, yzVideo segment (start position) for locating the z-th candidate segment and sentenceTo the end positionVideo) of the video.
Unlike previous predictions IoU where regression was not calculated, the prediction IoU in the present invention predicts IoU between the regressed candidate segment and the true, which allows the present invention to measure the quality of the boundary regression.
For generating candidate segments, any candidate segment generation method may be applied in the framework of the present invention. For simplicity, the present invention first enumerates all possible video segments that consist of consecutive video slices. Then, for shorter videos, they may be chosen densely as candidate segments. For longer videos, the sampling interval may be gradually increased, and they may be sparsely selected as candidate segments. The main idea of this method is to remove redundant candidate segments with large overlap.
Step S4.5: calculating cross entropy loss L for masking word predictionword
Lword=-logpmask
Wherein p ismaskIs based on sentence characteristic sequenceThe probability of predicting a word as masked.
In the training process, the invention takes the statement pair as the input of the network. Similar to the original Transformer model, a word in the sentence sequence is randomly masked. For a MASK word, it is replaced with a special token "[ MASK ]". The model then predicts the masked words from the unmasked words and information from the video sequence. It is worth noting that predicting some important words, such as some nouns corresponding to objects and some verbs corresponding to actions, requires information of the video sequence. Thus, masking word prediction not only allows the transform model to learn language, but also better aligns the video and sentence modalities. The loss function for the matching word prediction is the standard cross entropy loss. The invention does not pre-train the transform model of the video statement on any other data set, and all parameters are initialized randomly.
Step S4.6: calculating the loss L of the whole network for training the multi-stage aggregation Transformer modeltotal
Ltotal=Lstage+Lregress+Lmatch+Lword
Step S4.7: updating parameters of an entire network
The invention does not pre-train the video sentence Transformer model on any other data set, and all parameters are initialized randomly.
Sequentially taking out a piece of data from a video sentence training data set according to the loss LtotalUpdating the parameters of the whole network until the data of the video sentence training data set is empty, thus obtaining the trained dataA multi-stage polymerization Transformer model;
step S5: video sentence localization
Inputting a video and a complete query sentence without a masking word, as shown in fig. 4, processing according to steps (1) to (3), calculating a matching score prediction value of each candidate segment and prediction values of a video slice start position and an end position (forming a new candidate segment) according to step 4.2), sorting the new candidate segments according to the matching scores from high to low, removing the new candidate segments with an overlap exceeding 70% by using NMS (non-maximum suppression), and returning the first 1 or 5 new candidate segments as finally located video segments.
Performance evaluation
The invention performs experiments on two large public datasets, activityNet _ Captions [14] and TACOS [24], respectively. ActivityNet _ Captions contains 20K videos and 100K query statements. The average length of the video is 2 minutes. It contains 127 videos related to cooking activity with an average duration of 4.79 minutes. There are on average 148 query statements per video. The TACoS dataset contains 18,818 segment statement pairs. TACoS is a very challenging piece of data. Its query statement contains multiple levels of activity, which contain different levels of detail.
The invention was evaluated using Rank n @ IoU ═ m. It represents the percentage of correct positioning over all positioning, where correct positioning is defined as at least one segment in the output result that matches the ground truth. A fragment matches a group channel if IoU between the fragment and the group channel is greater than m.
Adam was used to optimize the network in the present invention. The batch size was set to 16 and the learning rate was set to 0.0001. The number of Transformer layers was set to 6. The characteristic dimension of all layers is set to 512. The scalar standard deviation is 0.25, alphas,αm0.21. In ActivityNet _ tasks and TACoS, the number of transducer attention heads is set to 16 and 32, respectively. Video slice features are extracted using a C3D network. For ActivityNet _ Captions dataset, the sampled video slice length is set to 32, for TACo S dataThe set is set to 128.
The multi-stage aggregation Transformer network proposed by the present invention is compared with various current advanced methods, and the comparison results are shown in tables 1-2.
TABLE 1
Table 1 shows the results of comparisons with other methods on ActivityNet _ Captions data sets.
TABLE 2
Table 2 is a comparison of TACoS data sets with other methods.
From the experimental results, it can be seen that the present invention achieves a significant improvement over previous methods. Although the invention is 1.09 points lower than the value of 0.5 for Rank1@ IoU of CSMGAN [18] on the ActivityNet _ locations dataset, it outperforms CSMGAN on all other criteria. In particular, the indexes of Rank1@ IoU ═ 0.7 and Rank5@ IoU ═ 0.7 are respectively 2.63 and 3.55 points higher than CSMGAN. Note that IoU-0.7 is a more stringent criterion for determining whether a fragment is correct, indicating that the present invention can achieve a higher quality of positioning. Furthermore, the present invention is more than 10 percentage points higher than CSMGAN on all the evaluation indexes on TACoS data sets, which indicates the superiority of the present invention over the CSMGAN method. In addition, the present invention also achieves overwhelming advantages over other methods, and these results fully demonstrate the effectiveness of the present invention. In the video sentence Transformer model of the invention, each video slice can interact with each word in the query sentence, thereby obtaining more detailed and accurate video sentence alignment. Due to the multi-stage aggregation module of the present invention, the computed video segment representations can match the activities of different stages. The two modules of the invention are tightly combined together to form a very effective and efficient segment location network.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.
Claims (1)
1. A video statement positioning method based on a multi-stage aggregation Transformer model is characterized by comprising the following steps:
(1) video slice feature and word feature extraction
Uniformly dividing a video into N time points according to time, collecting a video slice at each time point, wherein the video slice consists of continuous multi-frame images, extracting the characteristics of each video slice to obtain N slice characteristics in total, and placing the N slice characteristics according to the time sequence to form a video characteristic sequence;
carrying out word steering quantity on each word of the sentence to obtain word characteristics, and then placing the word characteristics according to the sequence in the sentence to form a sentence characteristic sequence;
mapping the slice features and the word features of the sentence feature sequence in the video feature sequence to the same dimensionality to obtain the video feature sequenceSentence feature sequencesWherein,a slice feature representing the ith slice of the video,a word feature representing a jth word of the sentence;
(2) constructing a video statement Transformer model, and calculating a video characteristic sequence and a statement characteristic sequence
Constructing a D-layer video sentence Transformer model, wherein D is 1,2, …, and the output of D is:
wherein V, L denotes video and sentences, respectively, Q, K, W is a learnable parameter, wherein different subscripts denote different parameters, Att (-) is an attention calculation function;
video feature sequenceSentence feature sequencesCalculating as the input of the transform model of the video statement to obtain the characteristic sequence of the output video of the D-th layerSentence feature sequences
(3) Constructing a multi-stage aggregation module, and calculating a stage characteristic sequence and a prediction score sequence of three stages
Calculating a phase feature sequence r of a start phase, an intermediate phase and an end phasesta、rmid、rend:
Wherein, the characteristic sequence r of the initial stagestaFrom phase features r of N slicesi staN, i-1, 2,. N, intermediate stage signature sequence rmidFrom phase features r of N slicesi midI-1, 2.. N, ending the stage signature sequence rendFrom phase features r of N slicesi endN, MLP, i-1, 21 sta、MLP1 mid、MLP1 endA Multi-layer Perceptron (MLP) for computing a phase signature sequence of the three phases, respectively;
computing a sequence of prediction scores p for a beginning stage, an intermediate stage and an end stagesta、pmid、pend:
Wherein the fractional sequence p is predicted at the beginningstaPrediction score from N slicesComposition, intermediate stage prediction score sequence pmidPrediction score from N slicesComposition, end stage prediction score sequence pendPrediction score from N slicesThe components of the composition are as follows, the multilayer perceptron is used for calculating the prediction score sequences of the three stages;
(4) training a multistage polymerization Transformer model
The video sentence Transformer model and the multi-stage aggregation module form a multi-stage aggregation Transformer model;
constructing a video sentence training data set, wherein each piece of data comprises a video, a sentence and a video slice starting position of a sentence-positioned video segmentEnd position
Extracting a piece of data from a video sentence training data set, randomly shielding a word in a sentence, replacing the word with a mark 'MASK', processing the video and the sentence according to the steps (1) to (3), and calculating the real scores of the start stage, the middle stage and the end stage of each video slice
σsta、σmidσendstandard deviation, σ, of non-normalized two-dimensional Gaussian distributionsta、αmid、αendA scalar quantity that is a positive value for controlling the value of the standard deviation;
4.1), calculating the weighted cross entropy loss L on the prediction layerstage:
4.2), calculating the predicted values of the start position and the end position of the video slice of the z-th candidate segmentAnd matching score prediction values
Wherein,respectively, the video slice start position, middle position and end position of the z-th candidate segment,respectively is the stage characteristic sequence r obtained in the step (3)sta、rmid、rendStage characteristics of the respective locations;
4.3), calculate the boundary regression loss Lregress:
Wherein Z is the total number of candidate fragments;
4.4), computing the matching score weighted cross entropy loss Lmatch:
Wherein, yzVideo segment, i.e., start position, for locating the z-th candidate segment and sentenceTo the end positionThe degree of overlap of the video of (a);
4.5), calculating the cross entropy loss L of the masking word predictionword
Lword=-log pmask
Wherein p ismaskIs based on sentence characteristic sequenceA probability of predicting a word as masked;
4.6) calculating the loss L of the whole network for training the multi-stage aggregation Transformer modeltotal
Ltotal=Lstage+Lregress+Lmatch+Lword
4.7), updating parameters of the whole network
Sequentially taking out a piece of data from a video sentence training data set according to the loss LtotalUpdating parameters of the whole network until data of a video sentence training data set are empty, so that a trained multi-stage aggregation Transformer model is obtained;
(5) video statement positioning
Inputting a video and a complete query sentence without a masking word, processing according to the steps (1) to (3), calculating a matching score predicted value of each candidate segment and predicted values of a video slice starting position and a video slice ending position according to the step 4.2), forming a new candidate segment, sequencing the new candidate segment according to the matching score from high to low, removing the new candidate segment which is overlapped by more than 70% by using non-maximum suppression, and returning the first 1 or 5 new candidate segments as finally positioned video segments.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011508292.1A CN112488063B (en) | 2020-12-18 | 2020-12-18 | Video statement positioning method based on multi-stage aggregation Transformer model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011508292.1A CN112488063B (en) | 2020-12-18 | 2020-12-18 | Video statement positioning method based on multi-stage aggregation Transformer model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112488063A CN112488063A (en) | 2021-03-12 |
CN112488063B true CN112488063B (en) | 2022-06-14 |
Family
ID=74914591
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011508292.1A Active CN112488063B (en) | 2020-12-18 | 2020-12-18 | Video statement positioning method based on multi-stage aggregation Transformer model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112488063B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115708359B (en) * | 2021-08-20 | 2024-09-03 | 小米科技(武汉)有限公司 | Video clip interception method, device and storage medium |
CN116740067B (en) * | 2023-08-14 | 2023-10-20 | 苏州凌影云诺医疗科技有限公司 | Infiltration depth judging method and system for esophageal lesions |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1144588A (en) * | 1994-03-14 | 1997-03-05 | 美国赛特公司 | A system for implanting image into video stream |
CN110225368A (en) * | 2019-06-27 | 2019-09-10 | 腾讯科技(深圳)有限公司 | A kind of video locating method, device and electronic equipment |
CN110377792A (en) * | 2019-06-14 | 2019-10-25 | 浙江大学 | A method of task is extracted using the video clip that the cross-module type Internet solves Problem based learning |
CN110781347A (en) * | 2019-10-23 | 2020-02-11 | 腾讯科技(深圳)有限公司 | Video processing method, device, equipment and readable storage medium |
CN111814489A (en) * | 2020-07-23 | 2020-10-23 | 苏州思必驰信息科技有限公司 | Spoken language semantic understanding method and system |
CN111931736A (en) * | 2020-09-27 | 2020-11-13 | 浙江大学 | Lip language identification method and system using non-autoregressive model and integrated discharge technology |
-
2020
- 2020-12-18 CN CN202011508292.1A patent/CN112488063B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1144588A (en) * | 1994-03-14 | 1997-03-05 | 美国赛特公司 | A system for implanting image into video stream |
CN110377792A (en) * | 2019-06-14 | 2019-10-25 | 浙江大学 | A method of task is extracted using the video clip that the cross-module type Internet solves Problem based learning |
CN110225368A (en) * | 2019-06-27 | 2019-09-10 | 腾讯科技(深圳)有限公司 | A kind of video locating method, device and electronic equipment |
CN110781347A (en) * | 2019-10-23 | 2020-02-11 | 腾讯科技(深圳)有限公司 | Video processing method, device, equipment and readable storage medium |
CN111814489A (en) * | 2020-07-23 | 2020-10-23 | 苏州思必驰信息科技有限公司 | Spoken language semantic understanding method and system |
CN111931736A (en) * | 2020-09-27 | 2020-11-13 | 浙江大学 | Lip language identification method and system using non-autoregressive model and integrated discharge technology |
Non-Patent Citations (2)
Title |
---|
Video Action Transformer Network;Rohit Girdhar 等;《2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition》;20200109;244-253页 * |
唇读应用中唇部信息的定位跟踪与特征提取技术研究;杨阳;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;20090915;I138-563 * |
Also Published As
Publication number | Publication date |
---|---|
CN112488063A (en) | 2021-03-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112559556B (en) | Language model pre-training method and system for table mode analysis and sequence mask | |
CN109947912B (en) | Model method based on intra-paragraph reasoning and joint question answer matching | |
CN111243579B (en) | Time domain single-channel multi-speaker voice recognition method and system | |
Zhang et al. | Multi-scale 2d temporal adjacency networks for moment localization with natural language | |
CN108932304B (en) | Video moment localization method, system and storage medium based on cross-module state | |
CN110457672B (en) | Keyword determination method and device, electronic equipment and storage medium | |
CN111221962B (en) | Text emotion analysis method based on new word expansion and complex sentence pattern expansion | |
CN112488063B (en) | Video statement positioning method based on multi-stage aggregation Transformer model | |
CN107066973A (en) | A kind of video content description method of utilization spatio-temporal attention model | |
CN109388743B (en) | Language model determining method and device | |
CN106126619A (en) | A kind of video retrieval method based on video content and system | |
CN113672708A (en) | Language model training method, question and answer pair generation method, device and equipment | |
CN112733533A (en) | Multi-mode named entity recognition method based on BERT model and text-image relation propagation | |
CN113722478B (en) | Multi-dimensional feature fusion similar event calculation method and system and electronic equipment | |
CN108536735B (en) | Multi-mode vocabulary representation method and system based on multi-channel self-encoder | |
CN108446316A (en) | Recommendation method, apparatus, electronic equipment and the storage medium of associational word | |
CN103390004A (en) | Determination method and determination device for semantic redundancy and corresponding search method and device | |
CN112989120B (en) | Video clip query system and video clip query method | |
CN109740158A (en) | Text semantic parsing method and device | |
CN114357120A (en) | Non-supervision type retrieval method, system and medium based on FAQ | |
Shin et al. | Learning to combine the modalities of language and video for temporal moment localization | |
CN117197725B (en) | Sequential action nomination generation method and system based on multi-position collaboration | |
CN106503066A (en) | Process Search Results method and apparatus based on artificial intelligence | |
CN113486659A (en) | Text matching method and device, computer equipment and storage medium | |
CN110580280B (en) | New word discovery method, device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |