CN112488063B - Video statement positioning method based on multi-stage aggregation Transformer model - Google Patents

Video statement positioning method based on multi-stage aggregation Transformer model Download PDF

Info

Publication number
CN112488063B
CN112488063B CN202011508292.1A CN202011508292A CN112488063B CN 112488063 B CN112488063 B CN 112488063B CN 202011508292 A CN202011508292 A CN 202011508292A CN 112488063 B CN112488063 B CN 112488063B
Authority
CN
China
Prior art keywords
video
stage
sentence
sequence
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011508292.1A
Other languages
Chinese (zh)
Other versions
CN112488063A (en
Inventor
杨阳
张明星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou University
Original Assignee
Guizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou University filed Critical Guizhou University
Priority to CN202011508292.1A priority Critical patent/CN112488063B/en
Publication of CN112488063A publication Critical patent/CN112488063A/en
Application granted granted Critical
Publication of CN112488063B publication Critical patent/CN112488063B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video sentence positioning method based on a multi-stage aggregation Transformer model. Through multilayer superposition, the finally obtained video sentence joint representation has rich visual language cue capturing capability, and finer matching can be realized. In the multi-stage aggregation module, the stage feature of the starting stage, the stage feature of the intermediate stage and the stage feature of the ending stage are connected in series to form a feature representation of the candidate segment. Since the obtained feature indicates that specific information of different stages is captured, it is very suitable to accurately locate the start position and the end position of the video segment. The two modules are integrated together to form an effective and efficient network, and the accuracy of video statement positioning is improved.

Description

Video statement positioning method based on multi-stage aggregation Transformer model
Technical Field
The invention belongs to the technical field of video statement positioning retrieval, and particularly relates to a video statement positioning method based on a multi-stage aggregation Transformer model.
Background
Video positioning is a fundamental problem in computer vision systems and has wide application. Over the past decade, there has been a great deal of research into video motion localization and related application system development. In recent years, with the rise of multimedia data and the diversification of people's needs, the problem of positioning of sentences in video (video sentence positioning) has become increasingly important. The purpose of video sentence positioning is to position a certain video segment corresponding to a sentence to be inquired in a long video. Compared with video action positioning, statement positioning has greater challenges and wider application prospects, such as video retrieval, automatic generation of video subtitles, man-machine intelligent interaction and the like.
Video sentence localization is a challenging task. In addition to the need to understand the content of the video, there is also a need to match the semantics between the video and the sentences.
Existing video statement positioning can be generally divided into two categories: one-stage and two-stage processes. The one-stage method takes a video and a query statement as input, directly predicts a starting point and an end point of a queried video segment, and directly generates a video segment associated with the query statement. One-stage methods can do end-to-end training, but they easily lose some of the correct video segments. However, the two-stage approach follows both the candidate segment generation and candidate segment ranking processes. They typically first generate a series of candidate segments from the video and then rank the candidate segments according to their degree of match with the query statement. Many approaches follow this route. Although the two-phase method can recall many possible correct candidate video clips, there are several key problems that are not well solved:
1) how to efficiently perform fine-grained semantic matching between videos and sentences?
2) How to accurately locate the start and end points of a video segment matching a sentence in the original long video?
For the 1 st problem, most existing methods usually process the video and sentence sequences separately and then match them. However, processing the video and the sentence sequence separately, for example, encoding the sentence into a vector and then matching, inevitably loses some detailed semantic content in the sentence, and thus cannot realize detailed matching;
for the 2 nd problem, existing methods typically use full convolution, average Pooling, or RoI Pooling operations to obtain a feature representation of the candidate fragment. However, these operations have obtained characteristics that are not sufficiently time-differentiated. For example, a video clip usually contains several different phases, such as a start phase, an intermediate phase and an end phase. The information of these phases is very important for the precise positioning of the starting and ending points of the time of day. However, the average pooling operation completely discards the phase information and cannot match the different phases exactly to achieve an accurate positioning. Although the full convolution operation or the RoI Pooling operation may delineate different stages to some extent, they do not rely on explicit stage-specific features and therefore also have disadvantages in more accurate positioning.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a video statement positioning method based on a multi-stage aggregation transform model so as to improve the accuracy of video statement positioning.
In order to achieve the above object, the present invention provides a method for locating a video sentence based on a multi-stage aggregation transform model, comprising the steps of:
(1) video slice feature and word feature extraction
Uniformly dividing a video into N time points according to time, collecting a video slice (consisting of continuous multiframe images such as 50 frames) at each time point, performing feature extraction on each video slice to obtain slice features (obtaining N slice features in total), and placing the N slice features according to a time sequence to form a video feature sequence;
carrying out word turning quantity (Doc2Vec) on each word of the sentence to obtain word characteristics, and then placing the word characteristics according to the sequence in the sentence to form a sentence characteristic sequence;
mapping the slice features and the word features of the sentence feature sequence in the video feature sequence to the same dimensionality to obtain the video feature sequence
Figure BDA0002845561540000021
Sentence feature sequences
Figure BDA0002845561540000022
Wherein,
Figure BDA0002845561540000023
representing slice characteristics of the ith slice of the video,
Figure BDA0002845561540000024
a word feature representing a jth word of the sentence;
(2) constructing a video statement Transformer model, and calculating a video characteristic sequence and a statement characteristic sequence
Constructing a D-layer video sentence Transformer model, wherein the output of a D-layer, D is 1, 2.
Figure BDA0002845561540000031
Figure BDA0002845561540000032
Wherein V, L denotes video and sentences, respectively, Q, K, W is a learnable parameter, wherein different subscripts denote different parameters, Att (-) is an attention calculation function;
video feature sequence
Figure BDA0002845561540000033
Sentence feature sequences
Figure BDA0002845561540000034
Calculating as the input of the transform model of the video statement to obtain the characteristic sequence of the output video of the D-th layer
Figure BDA0002845561540000035
Sentence feature sequences
Figure BDA0002845561540000036
(3) Constructing a multi-stage aggregation module, and calculating a stage characteristic sequence and a prediction score sequence of three stages
Calculating the starting orderPhase characteristic sequence r of segments, intermediate phases and end phasessta、rmid、rend
Figure BDA0002845561540000037
Wherein, the characteristic sequence r of the initial stagestaStage characterization by N slices
Figure BDA0002845561540000038
Composition, intermediate stage signature sequence rmidStage characterization by N slices
Figure BDA0002845561540000039
Composition, end stage signature sequence rendStage characterization by N slices
Figure BDA00028455615400000310
The components of the composition are as follows,
Figure BDA00028455615400000311
a Multi-layer Perceptron (MLP) for computing a phase signature sequence of the three phases, respectively;
computing a sequence of prediction scores p for a beginning stage, an intermediate stage and an end stagesta、pmid、pend
Figure BDA0002845561540000041
Wherein the fractional sequence p is predicted at the beginningstaPrediction score from N slices
Figure BDA0002845561540000042
Composition, intermediate stage prediction score sequence pmidPrediction score from N slices
Figure BDA0002845561540000043
Composition, end stage prediction fraction orderColumn pendPrediction score from N slices
Figure BDA0002845561540000044
The components of the composition are as follows,
Figure BDA0002845561540000045
Figure BDA0002845561540000046
the multilayer perceptron is used for calculating the prediction score sequences of the three stages;
(4) training a multi-stage polymerization Transformer model
The video sentence Transformer model and the multi-stage aggregation module form a multi-stage aggregation Transformer model;
constructing a video sentence training data set, wherein each piece of data comprises a video, a sentence and a video slice starting position of a sentence-positioned video segment
Figure BDA0002845561540000047
End position
Figure BDA0002845561540000048
Extracting a piece of data from a video sentence training data set, randomly shielding a word in a sentence, replacing the word with a mark 'MASK', processing the video and the sentence according to the steps (1) to (3), and calculating the real scores of the start stage, the middle stage and the end stage of each video slice
Figure BDA0002845561540000049
Figure BDA00028455615400000410
Wherein,
Figure BDA00028455615400000411
σsta、σmidσendstandard deviation, σ, of an unnormalized two-dimensional Gaussian distributionsta、αmid、αendA scalar quantity that is a positive value for controlling the value of the standard deviation;
4.1), calculating a weighted cross entropy loss L on the prediction layerstage
Figure BDA00028455615400000412
4.2), calculating the predicted values of the start position and the end position of the video slice of the z-th candidate segment
Figure BDA00028455615400000413
And matching score prediction values
Figure BDA0002845561540000051
Figure BDA0002845561540000052
4.3), calculate the boundary regression loss Lregress
Figure BDA0002845561540000053
Wherein Z is the total number of candidate segments,
Figure BDA0002845561540000054
respectively a video slice starting position and an ending position of the z-th candidate segment;
4.4), computing the matching score weighted cross entropy loss Lmatch
Figure BDA0002845561540000055
Wherein, yzVideo slice for locating the z-th candidate segment and sentenceSegment (start position)
Figure BDA0002845561540000056
To the end position
Figure BDA0002845561540000057
Video of (d) a goodness of overlap;
4.5), calculating the cross entropy loss L of the masking word predictionword
Lword=-logpmask
Wherein p ismaskIs based on sentence characteristic sequence
Figure BDA0002845561540000058
A probability of predicting a word as masked;
4.6) calculating the loss L of the whole network for training the multi-stage aggregation Transformer modeltotal
Ltotal=Lstage+Lregress+Lmatch+Lword
4.7), updating parameters of the whole network
Sequentially taking out a piece of data from a video sentence training data set according to the loss LtotalUpdating parameters of the whole network until data of a video sentence training data set are empty, so that a trained multi-stage aggregation Transformer model is obtained;
(5) video statement positioning
Inputting a video and a complete query sentence without a masking word, processing according to the steps (1) to (3), calculating a matching score predicted value of each candidate segment and predicted values of a video slice starting position and a video slice ending position (forming a new candidate segment) according to the step 4.2), sorting the new candidate segments according to the matching scores from high to low, removing the new candidate segments with the overlap exceeding 70 percent by using NMS (non-maximum suppression), and returning the first 1 or 5 new candidate segments as finally positioned video segments.
The object of the invention is thus achieved.
Aiming at the problems existing in the existing method, the invention constructs a multi-stage aggregation Transformer model for a video statement positioning network. The multistage polymerization Transformer model consists of two parts: the system comprises a video statement Transformer model and a multi-stage aggregation module positioned on the video statement Transformer model. In the video sentence Transformer model, a single BERT framework is retained, but BERT parameters are decoupled into different packets to process video and sentence information separately. The video statement Transformer model more effectively models two modalities, namely video and statement, while maintaining the compactness and efficiency of a single BERT structure. In the video sentence Transformer model, each video slice or word can adaptively aggregate and align information from all other video slices or words in both modalities according to its own semantics. Through multilayer superposition, the finally obtained video sentence joint representation has rich visual language cue capturing capability, and finer matching can be realized. Furthermore, the method is simple. In the multi-stage aggregation module, three stage features corresponding to different stages, i.e., the stage features of the start stage, the intermediate stage, and the end stage, are calculated for each video slice, respectively. Then, for a certain candidate segment, the stage feature of the starting stage, the stage feature of the intermediate stage and the stage feature of the ending stage are connected in series to form the feature representation of the candidate segment. Since the obtained feature indicates that specific information of different stages is captured, it is very suitable to accurately locate the start position and the end position of the video segment. The two modules are integrated together to form an effective and efficient network, and the accuracy of video statement positioning is improved.
Drawings
FIG. 1 is a flowchart of an embodiment of a method for locating video statements based on a multi-stage aggregate transform model according to the present invention;
FIG. 2 is a schematic view of a video slice;
FIG. 3 is a schematic diagram illustrating a video sentence location method based on a multi-stage aggregation transform model according to an embodiment of the present invention;
FIG. 4 is a flow diagram of one embodiment of video statement location.
Detailed Description
The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.
FIG. 1 is a flowchart of an embodiment of a method for locating a video statement based on a multi-stage aggregation transform model according to the present invention.
In this embodiment, as shown in fig. 1, the video sentence positioning method based on the multi-stage aggregation transform model includes the following steps:
step S1: video slice feature and word feature extraction
In this embodiment, as shown in fig. 2, a video is uniformly divided into N time points according to time, at each time point, a video slice (composed of consecutive multi-frame images, such as 50 frames) is collected, feature extraction is performed on each video slice to obtain slice features (N slice features in total are obtained), and the N slice features are placed in time order to form a video feature sequence.
And performing word turning quantity (Doc2Vec) on each word of the sentence to obtain word characteristics, and then placing the word characteristics in the sentence in sequence to form a sentence characteristic sequence.
Mapping the slice features and the word features of the sentence feature sequence in the video feature sequence to the same dimensionality to obtain the video feature sequence
Figure BDA0002845561540000071
Sentence feature sequences
Figure BDA0002845561540000072
Wherein,
Figure BDA0002845561540000073
a slice feature representing the ith slice of the video,
Figure BDA0002845561540000074
representing the word characteristics of the jth word of the sentence.
Step S2: constructing a video sentence Transformer model, and calculating a video characteristic sequence and a sentence characteristic sequence
Most of the existing methods usually process the video and sentence sequences separately and then match them. However, processing these two sequences separately, such as encoding the sentence into a vector and then matching, will inevitably lose some detailed semantic content in the sentence, and thus a detailed matching cannot be achieved. To solve this problem, as shown in fig. 3, the present invention constructs a completely new transform model of video sentences as a backbone network, where the output of the D-th layer, D is 1, 2.
Figure BDA0002845561540000075
Figure BDA0002845561540000081
Wherein V, L denotes video and sentences, respectively, Q, K, W is a learnable parameter, wherein different subscripts denote different parameters, Att (-) is an attention calculation function;
video feature sequence
Figure BDA0002845561540000082
Sentence feature sequences
Figure BDA0002845561540000083
Calculating as the input of a video statement Transformer model, and calculating layer by layer according to a formula (1) to obtain a D-th layer output video feature sequence
Figure BDA0002845561540000084
Sentence feature sequences
Figure BDA0002845561540000085
Compared with the single BERT model widely used in the prior art, the multi-stage aggregation Transformer model has the advantages that the structure is not changed, no additional calculation is introduced, and different parameters are used for processing different modal contents, so that the compactness and the efficiency of the model are maintained, and the multi-modal modeling capability of the model is improved. Meanwhile, the multi-stage aggregation Transformer model in the invention is also different from other multi-mode BERT models, and the multi-mode BERT models use two BERT streams to realize contents of different modes. The model based on two BERT flows introduces an additional cross-modal layer to realize multi-modal interaction, and the multi-stage polymerization Transformer model in the invention keeps the same structure as the original BERT model and is more compact and efficient.
The multistage polymerization Transformer model consists of multiple layers of computational processes as in formula (1). After multilayer superposition, the obtained video sentence joint representation has the aggregation and alignment capability of rich visual language clues. Each slice in the video can interact with each word in the query sentence, thereby realizing more detailed and accurate video sentence matching. This is very important for accurate positioning.
Step S3: constructing a multi-stage aggregation module, and calculating a stage feature sequence and a prediction score sequence of three stages
After passing through the video sentence Transformer model, the obtained video sentence joint representation is the video feature sequence
Figure BDA0002845561540000086
Sentence feature sequences
Figure BDA0002845561540000087
With richer information and finer matching. However, in order to overcome the problem that the prior method ignores different phases contained in the video segment, the invention provides a multi-phase aggregation module on the video sentence Transformer model, so as to accurately locate the starting position and the ending position of the queried video segment. In a multistage polymerization module, the invention respectivelyEach video slice in a video sequence computes phase characteristics for three phases corresponding to different temporal phases, namely, phase characteristics for a start phase, an intermediate phase, and an end phase
Figure BDA0002845561540000091
In order to improve the distinguishing capability of different stage characteristics, the invention adds a prediction layer on the stage characteristics to respectively predict the scores of the starting stage, the intermediate stage and the ending stage.
In this embodiment, as shown in FIG. 3, a multi-stage aggregation module is used to calculate a stage feature sequence r for a start stage, an intermediate stage, and an end stagesta、rmid、rend
Figure BDA0002845561540000092
Wherein, the characteristic sequence r of the initial stagestaStage characterization by N slices
Figure BDA0002845561540000093
Composition, intermediate stage signature sequence rmidStage characterization from N slices
Figure BDA0002845561540000094
Composition, end stage signature sequence rendStage characterization by N slices
Figure BDA0002845561540000095
The components of the components are as follows,
Figure BDA0002845561540000096
multi-layer perceptrons (MLP) that compute a sequence of phase features for the three phases, respectively.
Computing a sequence of prediction scores p for a beginning stage, an intermediate stage and an end stagesta、pmid、pend
Figure BDA0002845561540000097
Wherein the fractional sequence p is predicted at the beginningstaPrediction score from N slices
Figure BDA0002845561540000098
Composition, intermediate stage prediction score sequence pmidPrediction score from N slices
Figure BDA0002845561540000099
Composition, end stage prediction score sequence pendPrediction score from N slices
Figure BDA00028455615400000910
The components of the composition are as follows,
Figure BDA00028455615400000911
Figure BDA00028455615400000912
respectively, a multi-layer perceptron calculating a sequence of prediction scores for the three stages.
Step S4: training a multistage polymerization Transformer model
The video sentence Transformer model and the multi-stage aggregation module form a multi-stage aggregation Transformer model;
constructing a video sentence training data set, wherein each piece of data comprises a video, a sentence and a video slice starting position of a sentence-positioned video segment
Figure BDA00028455615400000913
End position
Figure BDA00028455615400000914
Extracting a piece of data from a video sentence training data set, randomly shielding a word in a sentence, replacing the word with a mark 'MASK', processing the video and the sentence according to the steps (1) to (3), calculating each video slice, and starting to calculateTrue score of stage, intermediate stage, end stage
Figure BDA00028455615400000915
Figure BDA0002845561540000101
Wherein,
Figure BDA0002845561540000102
σsta、σmid、σendis the standard deviation, alpha, of an unnormalized two-dimensional Gaussian distributionsta、αmid、αendScalar quantities of positive value, controlling the value of the standard deviation, alphasta、αmid、αendThe larger the value of (a), the higher the score of the start/middle/end position of the video slice in the vicinity of the start/middle/end position of the video segment.
Step S4.1: computing a weighted cross-entropy loss L on a prediction layerstage
Figure BDA0002845561540000103
Step S4.2: calculating predicted values of the start position and the end position of the video slice of the z-th candidate segment
Figure BDA0002845561540000104
Figure BDA0002845561540000105
And matching score prediction values
Figure BDA0002845561540000106
Figure BDA0002845561540000107
Since all three phases are specific to the start, intermediate and end phases, this tandem feature is very discriminative for accurate video segment positioning.
Step S4.3: calculating boundary regression loss Lregress
Figure BDA0002845561540000108
Wherein Z is the total number of candidate segments,
Figure BDA0002845561540000109
the starting position and the ending position of the video slice of the z-th candidate segment are respectively.
Step S4.4: computing matching score weighted cross entropy loss Lmatch
Figure BDA00028455615400001010
Wherein, yzVideo segment (start position) for locating the z-th candidate segment and sentence
Figure BDA00028455615400001011
To the end position
Figure BDA00028455615400001012
Video) of the video.
Unlike previous predictions IoU where regression was not calculated, the prediction IoU in the present invention predicts IoU between the regressed candidate segment and the true, which allows the present invention to measure the quality of the boundary regression.
For generating candidate segments, any candidate segment generation method may be applied in the framework of the present invention. For simplicity, the present invention first enumerates all possible video segments that consist of consecutive video slices. Then, for shorter videos, they may be chosen densely as candidate segments. For longer videos, the sampling interval may be gradually increased, and they may be sparsely selected as candidate segments. The main idea of this method is to remove redundant candidate segments with large overlap.
Step S4.5: calculating cross entropy loss L for masking word predictionword
Lword=-logpmask
Wherein p ismaskIs based on sentence characteristic sequence
Figure BDA0002845561540000111
The probability of predicting a word as masked.
In the training process, the invention takes the statement pair as the input of the network. Similar to the original Transformer model, a word in the sentence sequence is randomly masked. For a MASK word, it is replaced with a special token "[ MASK ]". The model then predicts the masked words from the unmasked words and information from the video sequence. It is worth noting that predicting some important words, such as some nouns corresponding to objects and some verbs corresponding to actions, requires information of the video sequence. Thus, masking word prediction not only allows the transform model to learn language, but also better aligns the video and sentence modalities. The loss function for the matching word prediction is the standard cross entropy loss. The invention does not pre-train the transform model of the video statement on any other data set, and all parameters are initialized randomly.
Step S4.6: calculating the loss L of the whole network for training the multi-stage aggregation Transformer modeltotal
Ltotal=Lstage+Lregress+Lmatch+Lword
Step S4.7: updating parameters of an entire network
The invention does not pre-train the video sentence Transformer model on any other data set, and all parameters are initialized randomly.
Sequentially taking out a piece of data from a video sentence training data set according to the loss LtotalUpdating the parameters of the whole network until the data of the video sentence training data set is empty, thus obtaining the trained dataA multi-stage polymerization Transformer model;
step S5: video sentence localization
Inputting a video and a complete query sentence without a masking word, as shown in fig. 4, processing according to steps (1) to (3), calculating a matching score prediction value of each candidate segment and prediction values of a video slice start position and an end position (forming a new candidate segment) according to step 4.2), sorting the new candidate segments according to the matching scores from high to low, removing the new candidate segments with an overlap exceeding 70% by using NMS (non-maximum suppression), and returning the first 1 or 5 new candidate segments as finally located video segments.
Performance evaluation
The invention performs experiments on two large public datasets, activityNet _ Captions [14] and TACOS [24], respectively. ActivityNet _ Captions contains 20K videos and 100K query statements. The average length of the video is 2 minutes. It contains 127 videos related to cooking activity with an average duration of 4.79 minutes. There are on average 148 query statements per video. The TACoS dataset contains 18,818 segment statement pairs. TACoS is a very challenging piece of data. Its query statement contains multiple levels of activity, which contain different levels of detail.
The invention was evaluated using Rank n @ IoU ═ m. It represents the percentage of correct positioning over all positioning, where correct positioning is defined as at least one segment in the output result that matches the ground truth. A fragment matches a group channel if IoU between the fragment and the group channel is greater than m.
Adam was used to optimize the network in the present invention. The batch size was set to 16 and the learning rate was set to 0.0001. The number of Transformer layers was set to 6. The characteristic dimension of all layers is set to 512. The scalar standard deviation is 0.25, alphas,αm0.21. In ActivityNet _ tasks and TACoS, the number of transducer attention heads is set to 16 and 32, respectively. Video slice features are extracted using a C3D network. For ActivityNet _ Captions dataset, the sampled video slice length is set to 32, for TACo S dataThe set is set to 128.
The multi-stage aggregation Transformer network proposed by the present invention is compared with various current advanced methods, and the comparison results are shown in tables 1-2.
Figure BDA0002845561540000121
Figure BDA0002845561540000131
TABLE 1
Table 1 shows the results of comparisons with other methods on ActivityNet _ Captions data sets.
Figure BDA0002845561540000132
Figure BDA0002845561540000141
TABLE 2
Table 2 is a comparison of TACoS data sets with other methods.
From the experimental results, it can be seen that the present invention achieves a significant improvement over previous methods. Although the invention is 1.09 points lower than the value of 0.5 for Rank1@ IoU of CSMGAN [18] on the ActivityNet _ locations dataset, it outperforms CSMGAN on all other criteria. In particular, the indexes of Rank1@ IoU ═ 0.7 and Rank5@ IoU ═ 0.7 are respectively 2.63 and 3.55 points higher than CSMGAN. Note that IoU-0.7 is a more stringent criterion for determining whether a fragment is correct, indicating that the present invention can achieve a higher quality of positioning. Furthermore, the present invention is more than 10 percentage points higher than CSMGAN on all the evaluation indexes on TACoS data sets, which indicates the superiority of the present invention over the CSMGAN method. In addition, the present invention also achieves overwhelming advantages over other methods, and these results fully demonstrate the effectiveness of the present invention. In the video sentence Transformer model of the invention, each video slice can interact with each word in the query sentence, thereby obtaining more detailed and accurate video sentence alignment. Due to the multi-stage aggregation module of the present invention, the computed video segment representations can match the activities of different stages. The two modules of the invention are tightly combined together to form a very effective and efficient segment location network.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims (1)

1. A video statement positioning method based on a multi-stage aggregation Transformer model is characterized by comprising the following steps:
(1) video slice feature and word feature extraction
Uniformly dividing a video into N time points according to time, collecting a video slice at each time point, wherein the video slice consists of continuous multi-frame images, extracting the characteristics of each video slice to obtain N slice characteristics in total, and placing the N slice characteristics according to the time sequence to form a video characteristic sequence;
carrying out word steering quantity on each word of the sentence to obtain word characteristics, and then placing the word characteristics according to the sequence in the sentence to form a sentence characteristic sequence;
mapping the slice features and the word features of the sentence feature sequence in the video feature sequence to the same dimensionality to obtain the video feature sequence
Figure FDA0003577705710000011
Sentence feature sequences
Figure FDA0003577705710000012
Wherein,
Figure FDA0003577705710000013
a slice feature representing the ith slice of the video,
Figure FDA0003577705710000014
a word feature representing a jth word of the sentence;
(2) constructing a video statement Transformer model, and calculating a video characteristic sequence and a statement characteristic sequence
Constructing a D-layer video sentence Transformer model, wherein D is 1,2, …, and the output of D is:
Figure FDA0003577705710000015
Figure FDA0003577705710000016
wherein V, L denotes video and sentences, respectively, Q, K, W is a learnable parameter, wherein different subscripts denote different parameters, Att (-) is an attention calculation function;
video feature sequence
Figure FDA0003577705710000021
Sentence feature sequences
Figure FDA0003577705710000022
Calculating as the input of the transform model of the video statement to obtain the characteristic sequence of the output video of the D-th layer
Figure FDA0003577705710000023
Sentence feature sequences
Figure FDA0003577705710000024
(3) Constructing a multi-stage aggregation module, and calculating a stage characteristic sequence and a prediction score sequence of three stages
Calculating a phase feature sequence r of a start phase, an intermediate phase and an end phasesta、rmid、rend
Figure FDA0003577705710000025
Wherein, the characteristic sequence r of the initial stagestaFrom phase features r of N slicesi staN, i-1, 2,. N, intermediate stage signature sequence rmidFrom phase features r of N slicesi midI-1, 2.. N, ending the stage signature sequence rendFrom phase features r of N slicesi endN, MLP, i-1, 21 sta、MLP1 mid、MLP1 endA Multi-layer Perceptron (MLP) for computing a phase signature sequence of the three phases, respectively;
computing a sequence of prediction scores p for a beginning stage, an intermediate stage and an end stagesta、pmid、pend
Figure FDA0003577705710000026
Wherein the fractional sequence p is predicted at the beginningstaPrediction score from N slices
Figure FDA0003577705710000027
Composition, intermediate stage prediction score sequence pmidPrediction score from N slices
Figure FDA0003577705710000028
Composition, end stage prediction score sequence pendPrediction score from N slices
Figure FDA0003577705710000029
The components of the composition are as follows,
Figure FDA00035777057100000210
Figure FDA00035777057100000211
the multilayer perceptron is used for calculating the prediction score sequences of the three stages;
(4) training a multistage polymerization Transformer model
The video sentence Transformer model and the multi-stage aggregation module form a multi-stage aggregation Transformer model;
constructing a video sentence training data set, wherein each piece of data comprises a video, a sentence and a video slice starting position of a sentence-positioned video segment
Figure FDA00035777057100000212
End position
Figure FDA00035777057100000213
Extracting a piece of data from a video sentence training data set, randomly shielding a word in a sentence, replacing the word with a mark 'MASK', processing the video and the sentence according to the steps (1) to (3), and calculating the real scores of the start stage, the middle stage and the end stage of each video slice
Figure FDA0003577705710000031
Figure FDA0003577705710000032
Wherein,
Figure FDA0003577705710000033
σsta、σmidσendstandard deviation, σ, of non-normalized two-dimensional Gaussian distributionsta、αmid、αendA scalar quantity that is a positive value for controlling the value of the standard deviation;
4.1), calculating the weighted cross entropy loss L on the prediction layerstage
Figure FDA0003577705710000034
4.2), calculating the predicted values of the start position and the end position of the video slice of the z-th candidate segment
Figure FDA0003577705710000035
And matching score prediction values
Figure FDA0003577705710000036
Figure FDA0003577705710000037
Figure FDA0003577705710000038
Wherein,
Figure FDA0003577705710000039
respectively, the video slice start position, middle position and end position of the z-th candidate segment,
Figure FDA00035777057100000310
respectively is the stage characteristic sequence r obtained in the step (3)sta、rmid、rendStage characteristics of the respective locations;
4.3), calculate the boundary regression loss Lregress
Figure FDA00035777057100000311
Wherein Z is the total number of candidate fragments;
4.4), computing the matching score weighted cross entropy loss Lmatch
Figure FDA00035777057100000312
Wherein, yzVideo segment, i.e., start position, for locating the z-th candidate segment and sentence
Figure FDA00035777057100000313
To the end position
Figure FDA00035777057100000314
The degree of overlap of the video of (a);
4.5), calculating the cross entropy loss L of the masking word predictionword
Lword=-log pmask
Wherein p ismaskIs based on sentence characteristic sequence
Figure FDA0003577705710000041
A probability of predicting a word as masked;
4.6) calculating the loss L of the whole network for training the multi-stage aggregation Transformer modeltotal
Ltotal=Lstage+Lregress+Lmatch+Lword
4.7), updating parameters of the whole network
Sequentially taking out a piece of data from a video sentence training data set according to the loss LtotalUpdating parameters of the whole network until data of a video sentence training data set are empty, so that a trained multi-stage aggregation Transformer model is obtained;
(5) video statement positioning
Inputting a video and a complete query sentence without a masking word, processing according to the steps (1) to (3), calculating a matching score predicted value of each candidate segment and predicted values of a video slice starting position and a video slice ending position according to the step 4.2), forming a new candidate segment, sequencing the new candidate segment according to the matching score from high to low, removing the new candidate segment which is overlapped by more than 70% by using non-maximum suppression, and returning the first 1 or 5 new candidate segments as finally positioned video segments.
CN202011508292.1A 2020-12-18 2020-12-18 Video statement positioning method based on multi-stage aggregation Transformer model Active CN112488063B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011508292.1A CN112488063B (en) 2020-12-18 2020-12-18 Video statement positioning method based on multi-stage aggregation Transformer model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011508292.1A CN112488063B (en) 2020-12-18 2020-12-18 Video statement positioning method based on multi-stage aggregation Transformer model

Publications (2)

Publication Number Publication Date
CN112488063A CN112488063A (en) 2021-03-12
CN112488063B true CN112488063B (en) 2022-06-14

Family

ID=74914591

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011508292.1A Active CN112488063B (en) 2020-12-18 2020-12-18 Video statement positioning method based on multi-stage aggregation Transformer model

Country Status (1)

Country Link
CN (1) CN112488063B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115708359B (en) * 2021-08-20 2024-09-03 小米科技(武汉)有限公司 Video clip interception method, device and storage medium
CN116740067B (en) * 2023-08-14 2023-10-20 苏州凌影云诺医疗科技有限公司 Infiltration depth judging method and system for esophageal lesions

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1144588A (en) * 1994-03-14 1997-03-05 美国赛特公司 A system for implanting image into video stream
CN110225368A (en) * 2019-06-27 2019-09-10 腾讯科技(深圳)有限公司 A kind of video locating method, device and electronic equipment
CN110377792A (en) * 2019-06-14 2019-10-25 浙江大学 A method of task is extracted using the video clip that the cross-module type Internet solves Problem based learning
CN110781347A (en) * 2019-10-23 2020-02-11 腾讯科技(深圳)有限公司 Video processing method, device, equipment and readable storage medium
CN111814489A (en) * 2020-07-23 2020-10-23 苏州思必驰信息科技有限公司 Spoken language semantic understanding method and system
CN111931736A (en) * 2020-09-27 2020-11-13 浙江大学 Lip language identification method and system using non-autoregressive model and integrated discharge technology

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1144588A (en) * 1994-03-14 1997-03-05 美国赛特公司 A system for implanting image into video stream
CN110377792A (en) * 2019-06-14 2019-10-25 浙江大学 A method of task is extracted using the video clip that the cross-module type Internet solves Problem based learning
CN110225368A (en) * 2019-06-27 2019-09-10 腾讯科技(深圳)有限公司 A kind of video locating method, device and electronic equipment
CN110781347A (en) * 2019-10-23 2020-02-11 腾讯科技(深圳)有限公司 Video processing method, device, equipment and readable storage medium
CN111814489A (en) * 2020-07-23 2020-10-23 苏州思必驰信息科技有限公司 Spoken language semantic understanding method and system
CN111931736A (en) * 2020-09-27 2020-11-13 浙江大学 Lip language identification method and system using non-autoregressive model and integrated discharge technology

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Video Action Transformer Network;Rohit Girdhar 等;《2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition》;20200109;244-253页 *
唇读应用中唇部信息的定位跟踪与特征提取技术研究;杨阳;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;20090915;I138-563 *

Also Published As

Publication number Publication date
CN112488063A (en) 2021-03-12

Similar Documents

Publication Publication Date Title
CN112559556B (en) Language model pre-training method and system for table mode analysis and sequence mask
CN109947912B (en) Model method based on intra-paragraph reasoning and joint question answer matching
CN111243579B (en) Time domain single-channel multi-speaker voice recognition method and system
Zhang et al. Multi-scale 2d temporal adjacency networks for moment localization with natural language
CN108932304B (en) Video moment localization method, system and storage medium based on cross-module state
CN110457672B (en) Keyword determination method and device, electronic equipment and storage medium
CN111221962B (en) Text emotion analysis method based on new word expansion and complex sentence pattern expansion
CN112488063B (en) Video statement positioning method based on multi-stage aggregation Transformer model
CN107066973A (en) A kind of video content description method of utilization spatio-temporal attention model
CN109388743B (en) Language model determining method and device
CN106126619A (en) A kind of video retrieval method based on video content and system
CN113672708A (en) Language model training method, question and answer pair generation method, device and equipment
CN112733533A (en) Multi-mode named entity recognition method based on BERT model and text-image relation propagation
CN113722478B (en) Multi-dimensional feature fusion similar event calculation method and system and electronic equipment
CN108536735B (en) Multi-mode vocabulary representation method and system based on multi-channel self-encoder
CN108446316A (en) Recommendation method, apparatus, electronic equipment and the storage medium of associational word
CN103390004A (en) Determination method and determination device for semantic redundancy and corresponding search method and device
CN112989120B (en) Video clip query system and video clip query method
CN109740158A (en) Text semantic parsing method and device
CN114357120A (en) Non-supervision type retrieval method, system and medium based on FAQ
Shin et al. Learning to combine the modalities of language and video for temporal moment localization
CN117197725B (en) Sequential action nomination generation method and system based on multi-position collaboration
CN106503066A (en) Process Search Results method and apparatus based on artificial intelligence
CN113486659A (en) Text matching method and device, computer equipment and storage medium
CN110580280B (en) New word discovery method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant