CN113392717A

CN113392717A - Video dense description generation method based on time sequence characteristic pyramid

Info

Publication number: CN113392717A
Application number: CN202110558847.1A
Authority: CN
Inventors: 俞俊; 余宙; 韩男佳
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2021-09-14
Anticipated expiration: 2041-05-21
Also published as: CN113392717B

Abstract

The invention discloses a video dense description method based on a time sequence characteristic pyramid. Under a transformation network model framework, the video is coded, meanwhile, the characteristics of different resolutions are obtained by using a local attention mechanism, and then the characteristics of different resolutions are detected by using a plurality of detection heads, so that the comprehensive coverage of events with different durations is realized. After detecting the time segments possibly containing the event, the invention further utilizes a characteristic fusion mode to fuse the video characteristics with different resolutions, thereby generating more targeted description for the event. Compared with other methods, the method of the invention achieves higher accuracy and recall rate, and meanwhile, the description generation decoder also generates description sentences with higher quality according to the fused features, which proves the universality of the method and can fully play value in other multi-modal tasks.

Description

Video dense description generation method based on time sequence characteristic pyramid

Technical Field

The invention belongs to the field of Video processing, and particularly relates to a Video intensive description generation method (DVC) based on a Temporal Feature Pyramid (Temporal Feature Pyramid).

Background

Video intensive description is an emerging task in the multimedia field that aims to localize events and generate descriptive statements from raw video provided uncut. Specifically, a video file is input, and the events in time intervals (including the start time and the end time) in the video are positioned through a model. For example, a time slice information that may include an event exists in an interval between 2 nd and 12 th seconds of the video, and also exists in an interval between 21 st and 33 rd seconds of the video. For each time segment that may contain an event, e.g., between 2 nd and 12 th seconds, the video dense description model also needs to describe the content of the event that occurred within that time segment. To get a more accurate prediction, the machine needs to understand the intrinsic meaning of a given video and text, and on that basis perform a suitable cross-modal fusion of the information of both to eliminate the semantic gap to the maximum extent. Compared with images, videos can be understood as images with time sequence consistency, how to utilize time sequence information in the videos is good, and modeling in a time dimension is also a key for researching the video field.

In recent years, deep learning has received high attention from scientific research institutions and the industry, and development has led to the harvest of many excellent network models and various effective training methods. With the development of academic research, the cross-modal task is becoming a mainstream research direction. Meanwhile, the method is cross-modal, more suitable for real life scenes, and has abundant research significance and practical value. Video is taken as a research medium which is gradually rising in recent years, and natural language is combined to form a video-text cross-modal research direction, video intensive description is one of the more important directions, accurate description is realized while events are positioned, and a research problem which is worthy of being deeply explored is to enable a computer to automatically position the starting position and the ending position of the events contained in the video according to the input video and describe the events occurring in the videos in a proper language.

For many years, the importance of obtaining associations between modalities has been recognized in the field of cross-media research, and attentive mechanisms have been used in an attempt to mine rich associations between modalities. Some studies have also begun to note the interaction of intra-modal information, and before fusion, the correlation between features within the modalities is obtained through a self-attention mechanism or different linear layers. Because the understanding of the cross-media information needs to be established on the basis of fully utilizing the internal information of the single modality, no matter image text or video, more effective information worthy of mining exists, and the modeling of the information in the modality is undoubtedly helpful for deepening the understanding of the single modality and further enhancing the expression capability of the final fusion feature.

In the aspect of practical application, the video dense description algorithm has a wide application scene. In an entertainment scene, such as YouTube, love art, Tencent video and other video software, the segments which are interested in the user in the latest video can be quickly found according to the historical data of the user. Has very good research prospect and important research significance in a security system.

In summary, the intensive video description is a subject worth of intensive research, and the patent is about to cut through and develop discussion from several key points in the task, solve the difficulties and key points existing in the current method, and form a set of complete intensive video description system.

The description of natural language generally comes from different annotators, has higher degree of freedom, and does not have a uniform and fixed sentence structure. Meanwhile, video carriers in natural scenes have various themes, the content is complex and rich in variation, and frames may have high similarity and redundancy, so that intensive description of videos faces huge challenges. Specifically, there are two main difficulties:

(1) since event detection is always an indispensable link in the task of video intensive description, the existing method uses a single detector to detect and locate events occurring in the video after obtaining video features. Meanwhile, in order to perform more accurate positioning, fine-grained characteristics of the video are generally adopted. However, a single detector is difficult to handle different events with large duration differences in the video intensive description task, so that only events with durations within a specific range can be well detected. In addition, since a long-duration event needs a coarse-grained feature containing more global information during positioning, a single fine-grained feature may cause inaccuracy of positioning. Therefore, how to make the model take into account different requirements of events with different durations on the feature resolution to generate more accurate candidate time segments is a difficult problem in a task of video intensive description and also an important reason for influencing the performance of the result.

(2) After detecting a time segment containing an event, the video intensive description task also requires that a description sentence is generated on the event contained in the segment, and the conventional method generates the description sentence based on a single resolution characteristic of the video. This approach ignores the different effects of different resolution features on the event description. In addition, a cyclic neural network is often adopted during description generation, the characteristics of recursion of the cyclic neural network are limited, parallel calculation is difficult to perform during training for a description generation module, and training efficiency is reduced to a certain extent.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a video dense description method based on a time sequence characteristic pyramid. The invention provides a Video Dense description generation method (DVC) based on a Temporal Feature Pyramid (Temporal Feature Pyramid). The core method is a multilevel time sequence characteristic pyramid model which is provided and used for solving the problem of detecting events with different durations and verifying the superiority of the model in the cross-modal deep learning task of video intensive description. The method includes that under a transform network model (Transformer) framework, the video is coded, meanwhile, features with different resolutions are obtained through a local attention mechanism, then, a plurality of detection heads are used for detecting the features with different resolutions, and therefore comprehensive coverage of events with different durations is achieved. After detecting the time segments possibly containing the event, the invention further utilizes a characteristic fusion mode to fuse the video characteristics with different resolutions, thereby generating more targeted description for the event. In the experiment, an unedited video is input into a video dense description model based on a time sequence characteristic pyramid, after a time segment is predicted by a candidate time segment module, the fact that higher accuracy and recall rate are achieved compared with other methods can be found, and meanwhile, a description generation decoder also generates description sentences with higher quality according to the characteristics after fusion, so that the universality of the method is proved, and the method can fully play value in other multi-modal tasks.

The invention mainly comprises two points:

1. a plurality of detection heads based on different resolutions are simultaneously used for event detection by means of a local attention mechanism, so that events with different durations in a video intensive description task are effectively covered, intrinsic information of a video is fully explored, and a candidate time slice set with higher accuracy and recall rate is obtained.

2. A description generation decoder based on feature fusion is provided, and features with different resolutions are fused, so that global semantic information of high-level coarse-grained features can be obtained by the bottom-level fine-grained features. After the decoder obtains the characteristics of detail information and global information, the context information and the time sequence correlation of the video can be fully understood, and a more pertinent description text is generated.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step (1), data preprocessing, extracting characteristics of video and text data:

firstly, preprocessing and characteristic extraction are carried out on a video V:

for a section of un-clipped video V, a frame is cut into t blocks by taking an a frame as a unit, the a frame image in one block is subjected to feature extraction by using an I3D model pre-trained on a Kinetics data set, meanwhile, the features are extracted in the same way for a corresponding optical flow graph, then the two features are aligned in the time dimension and merged together, and a feature vector X representing the whole video is obtained after a trainable embedding matrix is passed.

Secondly, extracting the characteristics of the text information:

for a given sentence Y, punctuation marks in the sentence are removed, then each word in the sentence is put into a GloVe model to obtain word embedding characteristics, and then corresponding weights of different dimensions are adaptively learned by using an embedding matrix, so that a characteristic vector Y representing the whole sentence can be obtained.

And (2) carrying out feature coding through a video feature coder based on a local attention mechanism:

the video feature encoder is composed of L attention modules, and each attention module comprises a self-attention submodule MHA and a feed-forward network submodule FFN. For video feature X, inputting the video feature X into a video feature encoder to obtain feature sets with different resolutions

The specific process will be described below.

First consider video feature X as X⁰And then input into the self-attention submodule MHA and the feedforward network submodule FFN of the 1 st attention module in sequence. In the self-attention submodule, a local attention mechanism is adopted to limit the receptive field of the element of each position, so that the element of each position in the output feature is reconstructed only by the elements of the adjacent positions in the input feature, and a local receptive field mechanism similar to that in a convolutional neural network is formed. The feedforward network submodule is used for remapping the output characteristics to obtain the output of the 1 st attention module

Will be provided with

As input to the 2 nd attention module, the process is cycled through until the output of the lth attention module is obtained

Until now.

Step (3), constructing a candidate segment generation module based on the characteristic pyramid structure;

as shown in FIG. 1, the output characteristics of different attention modules are first input into different detection heads, and for the L (1 ≦ L ≦ L) detection head, the output is

For Q^lIn (1)Each element

And obtaining corresponding starting time and ending time and corresponding confidence scores according to the sampling interval of the video characteristics. In addition, since the feature resolution of the output of the attention module at the lower level is lower and the feature resolution of the output of the attention module at the higher level is higher in the video feature encoder, the I-th detection head based on the output feature of the I-th attention module of the encoder is responsible for predicting the duration to be xi_l-1～ξ_lIn the event of a change.

In the training phase of the model, the output of the candidate segment generation module is divided into two parts, the first part being the predicted event center position and the event duration length, which affect the start and end times of the predicted time segment. For each labeled event, an element in the output feature whose center position and anchor size most closely match is selected

For calculating the loss L_reg. Here, the regression loss function is used to measure the deviation between the predicted value and the actual value. The second part is the confidence of the prediction, representing the likelihood of containing an event in the current time slice. Elements to be used for calculating regression losses

Considering the samples as positive samples and the rest as negative samples, calculating the classification loss L for all the samples_cls. Finally, the two losses are added to obtain the total loss of the first detection head in the event detection stage

The Loss of the event detection stage can be obtained by adding all the detected Loss functions_prop。

In the testing stage, after different candidate time segment sets are generated by different detection heads, all the time segments are combined together and sorted according to the corresponding confidence scores from high to low. And then screening the time slices by adopting a non-maximum suppression algorithm to obtain a time slice set with a confidence score higher than a set confidence threshold and a mutual overlapping degree lower than a set overlapping threshold. For each remaining time segment, a specific event is considered to exist therein, and thus the visual features located in the time segment are input into a decoder to generate a corresponding description statement.

Step (4), constructing a description generation decoder based on feature fusion;

as shown in FIG. 2, for each time segment generated by the candidate time segment generation module, the original feature X of the video⁰And shielding the features beyond the starting time and the ending time and inputting the features into a video feature encoder to obtain video feature sets X with different resolutions_capAnd on the basis, performing feature fusion operation. In order to reduce the complexity of the model as much as possible, feature fusion is realized by adding corresponding positions. Inputting the features subjected to the fusion operation into a decoder, outputting words in the predicted description sentence, finally calculating the loss between the predicted word distribution and the actual words, and updating the parameters of the model by a loss function through a back propagation algorithm. After several iterations, the model can generate a targeted description statement for the events contained in each time segment.

The preprocessing mode of the video and the text in the step (1) is specifically realized as follows:

1-1, inputting all frames from the k x a frame to the (k +1) x a frame in the video into an I3D model to obtain an output feature vector x'_k. In addition, an optical flow diagram is extracted from the k × a frame to the (k +1) × a frame, and the optical flow diagram is input to the I3D model, so that an output feature vector x ″, is obtained_k. X'_kAnd x ″)_kSpliced together to obtain a feature vector x_k(k is more than or equal to 1 and less than or equal to t). After the frames in the whole video are processed in the same way and mapped by using a trainable embedding matrix, the feature vector X which represents the whole video is obtained₁，x₂，...，x_t}。

1-2, for the b (b is more than or equal to 1 and less than or equal to n) th word in a certain label description sentence, converting the b (b is more than or equal to 1 and less than or equal to n) word into One-Hot coding according to the position of the b word in a word list, then inputting the One-Hot coding into a GloVe model to compress characteristic dimensionality, then using an embedded matrix to adaptively learn corresponding weights of different dimensionality, and obtaining the characteristic vector y representing the word_b. Each word in the sentence is processed in the same way, and a feature vector Y which represents the whole sentence is obtained₁，y₂，...，y_n}。

And (2) the local attention mechanism-based video feature encoder consists of L attention modules, wherein each attention module comprises a self-attention submodule MHA and a feed-forward network submodule FFN.

2-1, the self-attention submodule MHA is responsible for reconstructing input characteristics, and the formula is as follows:

Z＝MHA(X^l，X^l，X^l)＝[head₁，head₂，...，head_h]W^oformula (1)

Wherein, X^lRepresenting the input characteristics of the first attention module, W^OIs a matrix used to map the output characteristics,

three different parameter matrices for processing input features, MASK is a MASK matrix, which indicates multiplication of elements at corresponding positions in the matrix, Q, K, V are

And 2-2. the feedforward network submodule FFN is responsible for remapping the output characteristics of the self-attention submodule, and the formula is as follows:

X^l+1＝FFN(Z)＝max(0，FW₁+b₁)W₂+b₂formula (4)

Wherein, W₁、W₂Is a matrix of two parameters, b₁、b₂Two bias parameters.

Step (3) a candidate segment generation module based on the characteristic pyramid structure:

3-1. for the output characteristics of the first attention module of the encoder, the first detection head omega is used^(l)Detecting time segments possibly containing events to obtain output values

For Q^lEach element of

The center position c of the final predicted time segment is obtained by_i' and duration h_i' and corresponding confidence o_i′。

c_i′＝p_i+sigmoid(c_i) Formula (6)

h_i′＝a_i·exp(h_i) Formula (7)

o_i′＝sigmoid(o_i) Formula (8)

Wherein, a_iDuration of ith anchor, p_iTo predict

The position of the center point of time.

And 3-2, dividing different events in the labeled data set into different detection heads for detection, wherein the specific process is as follows.

Wherein the content of the first and second substances,

representing the duration of the jth annotated event in the dataset, with only the duration lying in ξ_l-1And xi_lThe first detection head is responsible for detection.

3-3. for the output value of the l-th detection head, the deviation from the actual value is measured by using a loss function, as shown below.

Wherein alpha is₁、α₂Are two different weighting coefficients for adjusting the weight of the two loss functions during the training process.

The Loss of the event detection stage can be obtained by adding the Loss functions of all the detection heads_propAs follows.

After obtaining the time segments possibly containing the events, using a description generation decoder based on feature fusion to generate description statements for the events contained in each time segment, specifically as follows:

4-1, after inputting the shielded video characteristics into the video characteristics encoder for describing the generation stage, the output characteristics set X of the encoder can be obtained_cap：

4-2. mixing

Is regarded as F^(L)And performing feature fusion operation to generate F^(L-1)：

Wherein the content of the first and second substances,

representing the operation of adding corresponding position elements in the matrix. Then, in the same manner as X is generated from top to bottom_capCorresponding fusion feature set F:

F＝{F⁽¹⁾，F⁽²⁾，...，F^(L)equation (14)

4-3. decoder the first attention module

Accepting fusion features F from corresponding hierarchies^(l)As follows:

wherein the content of the first and second substances,

comprises three sub-modules, namely a self-attention sub-module phi (-) and a multi-terminal attention sub-module

The feedforward network submodule FFN (-) is shown below.

φ(Y^(l))＝LN(MHA(Y^(l)，(Y^(l)，(Y^(l))+(Y^(l)) Formula (16)

Wherein LN (. smallcircle.) represents the interlayer regularization operation (Layer Normalization).

The invention has the following beneficial effects:

the invention relates to a video dense description algorithm based on a time sequence feature pyramid structure, which considers the different requirements of events with different durations on feature resolution on the basis of the prior method, and simultaneously detects the events possibly existing in the video by using a plurality of detection heads, so that the generated time segment set has higher accuracy and recall rate. Furthermore, the invention also provides proper global semantic information for fine-grained features by using a feature fusion mode, so that a decoder can generate more targeted descriptive statements.

The invention has the advantages of reasonable parameter quantity, obvious effect, contribution to more efficient distributed training and contribution to being deployed in specific hardware with limited memory.

Drawings

FIG. 1: candidate time slice generation module based on pyramid structure

FIG. 2: description generation decoder based on feature fusion

Detailed Description

The following is a more detailed description of the detailed parameters of the present invention.

As shown in fig. 1 and 2, the present invention provides a method for generating dense video description based on a time-series feature pyramid.

The step (1) is a video and text feature extraction mode, which specifically comprises the following steps:

1-1. for the processing of video, a complete video is cut into several blocks in units of 64 frames, i.e. a-64.

1-2, for text processing, firstly removing punctuations in sentences, converting initial letters into lower case letters, and putting the lower case letters into a trained Golve model to obtain the characteristic expression of the sentences.

The video feature encoder based on the local attention mechanism is responsible for encoding video features, and specifically comprises the following steps:

2-1. the number of heads of attention from the attention module is 8, i.e. h-8. Mixing X^lAfter the input of the input feature is input from the attention module, attention weights of different dimensions are calculated, normalization operation is carried out on the weights through a softmax function, and then input features are reconstructed according to the normalized weights.

2-2. the feedforward network module consists of two fully connected linear layers, each linear layer output is characterized by a Dropout operation, the degree of Dropout being 10% of the total parameter.

Step (3), the candidate segment generation module based on the pyramid structure is responsible for generating time segments possibly containing events, and the details are as follows:

and 3-1, for the output characteristics of the first (L is more than or equal to 1 and less than or equal to L) attention module of the encoder, detecting by using the first detection head. The number of anchor points used for detection in the detection head is 128, and the sizes of the anchor points are obtained by clustering in the data set according to the duration of all labeled events through a K-means algorithm.

3-2. 3 detection heads are used to simultaneously detect events that may occur in the video, i.e., L-3. Different detection heads are responsible for detecting events with different durations, and the threshold xi is divided₀、ξ₁、ξ₂、ξ₃Set to 0, 12, 36, 408 seconds, respectively.

3-3, measuring the deviation between the predicted value and the true value by using a loss function, and adjusting the parameter alpha of the positive and negative sample weight₁、α₂Are set to 1 and 100, respectively.

And (4) generating a decoder for generating a corresponding description statement for the event based on the description of the feature fusion.

4-1. the number of attention modules in the encoder and decoder used to describe the generation phase is 3. In order to avoid the influence of differences among different tasks on the model effect, a video feature encoder which has the same structure as an event detection stage and is independently trained in parameters is separately arranged.

4-2, after the encoder output feature set is obtained, feature fusion is directly carried out in a mode of adding corresponding positions due to the fact that the sizes of the features are the same.

4-3. Using residual connection between different sub-blocks of the decoder, the characteristics of each linear layer output in the feedforward network sub-block will perform Dropout operation to the extent of 10% of the total parameter.

Claims

1. A video dense description method based on a time sequence characteristic pyramid is characterized by comprising the following steps:

for a section of video V which is not clipped, the video V is cut into t blocks by taking an a frame as a unit, the a frame image in one block is subjected to feature extraction by using an I3D model which is pre-trained on a Kinetics data set, meanwhile, the features are extracted in the same way for a corresponding optical flow graph, then the two features are aligned in the time dimension and merged together, and a feature vector X representing the whole video is obtained after a trainable embedding matrix is passed;

secondly, extracting the characteristics of the text information:

for a given sentence Y, punctuation marks in the sentence are removed, then each word in the sentence is put into a GloVe model to obtain word embedding characteristics, and then corresponding weights of different dimensions are learned in a self-adaptive mode by using an embedding matrix, so that a characteristic vector Y representing the whole sentence can be obtained;

the video feature encoder consists of L attention modules, and each attention module comprises a self-attention submodule MHA and a feed-forward network submodule FFN; for video feature X, inputting the video feature X into a video feature encoder to obtain feature sets with different resolutions

The specific process will be described below;

first consider video feature X as X⁰Then, the data are sequentially input into a self-attention submodule MHA and a feedforward network submodule FFN of a 1 st attention module; in the self-attention submodule, a local attention mechanism is adopted to limit the receptive field of the element at each position, so that the element at each position in the output characteristic is reconstructed only by the elements at the adjacent positions in the input characteristic, and a local receptive field mechanism similar to that in a convolutional neural network is formed; the feedforward network submodule is used for remapping the output characteristics to obtain the output of the 1 st attention module

Will be provided with

Until the end;

firstly, the output characteristics of different attention modules are input into different detection heads, and for the L (L is more than or equal to 1 and less than or equal to L) detection head, the output is

For Q^lEach element of

Obtaining corresponding starting time and ending time and corresponding confidence scores according to the sampling interval of the video characteristics; the ith detection head based on the output characteristics of the ith attention module of the encoder is responsible for predicting that the duration is positioned in xi_l-1～ξ_lAn event in between;

in the training phase of the model, the output of the candidate segment generation module is divided into two parts, wherein the first part is the predicted event center position and the event duration length, and the first part influences the starting time and the ending time of the predicted time segment; for each labeled event, an element in the output feature whose center position and anchor size most closely match is selected

For calculating the loss L_reg(ii) a Here, the regression loss function is used to measure the deviation between the predicted value and the actual value; the second part is the confidence of prediction, which represents the possibility of containing events in the current time slice; elements to be used for calculating regression losses

Considering the samples as positive samples and the rest as negative samples, calculating the classification loss L for all the samples_cls(ii) a Finally, the two losses are added to obtain the total loss of the first detection head in the event detection stage

The Loss of the event detection stage can be obtained by adding all the detected Loss functions_prop；

In the testing stage, after different candidate time segment sets are generated by different detection heads, all the time segments are combined together and sorted from high to low according to corresponding confidence scores; then, screening the time slices by adopting a non-maximum value inhibition algorithm to obtain a time slice set with a confidence score higher than a set confidence threshold value and a mutual overlapping degree lower than a set overlapping threshold value; for each reserved time segment, a specific event exists in the reserved time segment, and therefore the visual features in the reserved time segment are input into a decoder to generate a corresponding descriptive statement;

for each time segment generated by the candidate time segment generation module, the original characteristic X of the video is obtained⁰And shielding the features beyond the starting time and the ending time and inputting the features into a video feature encoder to obtain video feature sets X with different resolutions_capAnd on the basis, carrying out feature fusion operation; in order to reduce the complexity of the model as much as possible, feature fusion is realized by adopting a mode of adding corresponding positions; inputting the features subjected to the fusion operation into a decoder, outputting words in a predicted description sentence, finally calculating the loss between the predicted word distribution and the actual words, and updating the parameters of the model by a loss function through a back propagation algorithm; after several iterations, the model can generate a targeted description statement for the events contained in each time segment.

2. The method for densely describing videos based on the time-series characteristic pyramid as claimed in claim 1, wherein the preprocessing method of the videos and texts in step (1) is specifically implemented as follows:

1-1, inputting all frames from the k x a frame to the (k +1) x a frame in the video into an I3D model to obtain an output feature vector x'_k(ii) a In addition, an optical flow diagram is extracted from the k × a frame to the (k +1) × a frame, and the optical flow diagram is input to the I3D model, so that an output feature vector x ″, is obtained_k(ii) a X'_kAnd x ″)_kSpliced together to obtain a feature vector x_k(k is more than or equal to 1 and less than or equal to t); after the frames in the whole video are processed in the same way and mapped by using a trainable embedding matrix, the feature vector X which represents the whole video is obtained₁，x₂，...，x_t}；

1-2, for the b (b is more than or equal to 1 and less than or equal to n) th word in a certain label description sentence, converting the b (b is more than or equal to 1 and less than or equal to n) word into One-Hot coding according to the position of the b word in a word list, then inputting the One-Hot coding into a GloVe model to compress characteristic dimensionality, then using an embedded matrix to adaptively learn corresponding weights of different dimensionality, and obtaining the characteristic vector y representing the word_b(ii) a In the same way for each word in the sentenceLine processing to obtain a feature vector Y ═ Y representing the whole sentence₁，y₂，...，y_n}。

3. The method for dense video description based on temporal feature pyramid as claimed in claim 2, wherein the video feature encoder based on local attention mechanism in step (2) is composed of L attention modules, each of which includes a self-attention submodule MHA and a feed-forward network submodule FFN;

Z＝MHA(X^l，X^l，X^l)＝[head₁，head₂，...，head_h]W^oformula (1)

X^l+1＝FFN(Z)＝max(0，FW₁+b₁)W₂+b₂formula (4)

4. The method for video dense description based on the temporal characteristic pyramid as claimed in claim 3, wherein the step (3) comprises the following steps:

For Q^lEach element of

The center position c of the final predicted time segment is obtained by_i' and duration h_i' and corresponding confidence o_i′；

c_i′＝p_i+sigmoid(c_i) Formula (6)

h_i′＝a_i·exp(h_i) Formula (7)

o_i′＝sigmoid(o_i) Formula (8)

Wherein, a_iDuration of ith anchor, p_iTo predict

The position of the center point of time;

3-2, dividing different events in the marked data set into different detection heads for detection, wherein the specific process is as follows;

wherein the content of the first and second substances,

representing the duration of the jth annotated event in the dataset, with only the duration lying in ξ_l-1And xi_lThe first detection head is responsible for detection only when the events are labeled;

3-3, measuring the deviation between the output value of the first detection head and the actual value by using a loss function, as shown in the following;

wherein alpha is₁、α₂The two different weight coefficients are used for adjusting the proportion of the two loss functions in the training process;

the Loss of the event detection stage can be obtained by adding the Loss functions of all the detection heads_propAs shown below;

5. the method for video dense description based on the temporal characteristic pyramid as claimed in claim 4, wherein the step (4) comprises the following steps:

4-2. mixing

Wherein the content of the first and second substances,

representing the operation of adding corresponding position elements in the matrix; then, in the same manner as X is generated from top to bottom_capCorresponding fusion feature set F:

F＝{F⁽¹⁾，F⁽²⁾，...，F^(L)equation (14)

4-3. decoder the first attention module

Accepting fusion features F from corresponding hierarchies^(l)As follows:

wherein the content of the first and second substances,

A feedforward network submodule FFN (·), as shown below;

φ(Y^(l))＝LN(MHA(Y^(l)，Y^(l)，Y^(l)+Y^(l)) Formula (16)