CN113392717A - Video dense description generation method based on time sequence characteristic pyramid - Google Patents

Video dense description generation method based on time sequence characteristic pyramid Download PDF

Info

Publication number
CN113392717A
CN113392717A CN202110558847.1A CN202110558847A CN113392717A CN 113392717 A CN113392717 A CN 113392717A CN 202110558847 A CN202110558847 A CN 202110558847A CN 113392717 A CN113392717 A CN 113392717A
Authority
CN
China
Prior art keywords
video
feature
attention
time
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110558847.1A
Other languages
Chinese (zh)
Other versions
CN113392717B (en
Inventor
俞俊
余宙
韩男佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202110558847.1A priority Critical patent/CN113392717B/en
Publication of CN113392717A publication Critical patent/CN113392717A/en
Application granted granted Critical
Publication of CN113392717B publication Critical patent/CN113392717B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a video dense description method based on a time sequence characteristic pyramid. Under a transformation network model framework, the video is coded, meanwhile, the characteristics of different resolutions are obtained by using a local attention mechanism, and then the characteristics of different resolutions are detected by using a plurality of detection heads, so that the comprehensive coverage of events with different durations is realized. After detecting the time segments possibly containing the event, the invention further utilizes a characteristic fusion mode to fuse the video characteristics with different resolutions, thereby generating more targeted description for the event. Compared with other methods, the method of the invention achieves higher accuracy and recall rate, and meanwhile, the description generation decoder also generates description sentences with higher quality according to the fused features, which proves the universality of the method and can fully play value in other multi-modal tasks.

Description

Video dense description generation method based on time sequence characteristic pyramid
Technical Field
The invention belongs to the field of Video processing, and particularly relates to a Video intensive description generation method (DVC) based on a Temporal Feature Pyramid (Temporal Feature Pyramid).
Background
Video intensive description is an emerging task in the multimedia field that aims to localize events and generate descriptive statements from raw video provided uncut. Specifically, a video file is input, and the events in time intervals (including the start time and the end time) in the video are positioned through a model. For example, a time slice information that may include an event exists in an interval between 2 nd and 12 th seconds of the video, and also exists in an interval between 21 st and 33 rd seconds of the video. For each time segment that may contain an event, e.g., between 2 nd and 12 th seconds, the video dense description model also needs to describe the content of the event that occurred within that time segment. To get a more accurate prediction, the machine needs to understand the intrinsic meaning of a given video and text, and on that basis perform a suitable cross-modal fusion of the information of both to eliminate the semantic gap to the maximum extent. Compared with images, videos can be understood as images with time sequence consistency, how to utilize time sequence information in the videos is good, and modeling in a time dimension is also a key for researching the video field.
In recent years, deep learning has received high attention from scientific research institutions and the industry, and development has led to the harvest of many excellent network models and various effective training methods. With the development of academic research, the cross-modal task is becoming a mainstream research direction. Meanwhile, the method is cross-modal, more suitable for real life scenes, and has abundant research significance and practical value. Video is taken as a research medium which is gradually rising in recent years, and natural language is combined to form a video-text cross-modal research direction, video intensive description is one of the more important directions, accurate description is realized while events are positioned, and a research problem which is worthy of being deeply explored is to enable a computer to automatically position the starting position and the ending position of the events contained in the video according to the input video and describe the events occurring in the videos in a proper language.
For many years, the importance of obtaining associations between modalities has been recognized in the field of cross-media research, and attentive mechanisms have been used in an attempt to mine rich associations between modalities. Some studies have also begun to note the interaction of intra-modal information, and before fusion, the correlation between features within the modalities is obtained through a self-attention mechanism or different linear layers. Because the understanding of the cross-media information needs to be established on the basis of fully utilizing the internal information of the single modality, no matter image text or video, more effective information worthy of mining exists, and the modeling of the information in the modality is undoubtedly helpful for deepening the understanding of the single modality and further enhancing the expression capability of the final fusion feature.
In the aspect of practical application, the video dense description algorithm has a wide application scene. In an entertainment scene, such as YouTube, love art, Tencent video and other video software, the segments which are interested in the user in the latest video can be quickly found according to the historical data of the user. Has very good research prospect and important research significance in a security system.
In summary, the intensive video description is a subject worth of intensive research, and the patent is about to cut through and develop discussion from several key points in the task, solve the difficulties and key points existing in the current method, and form a set of complete intensive video description system.
The description of natural language generally comes from different annotators, has higher degree of freedom, and does not have a uniform and fixed sentence structure. Meanwhile, video carriers in natural scenes have various themes, the content is complex and rich in variation, and frames may have high similarity and redundancy, so that intensive description of videos faces huge challenges. Specifically, there are two main difficulties:
(1) since event detection is always an indispensable link in the task of video intensive description, the existing method uses a single detector to detect and locate events occurring in the video after obtaining video features. Meanwhile, in order to perform more accurate positioning, fine-grained characteristics of the video are generally adopted. However, a single detector is difficult to handle different events with large duration differences in the video intensive description task, so that only events with durations within a specific range can be well detected. In addition, since a long-duration event needs a coarse-grained feature containing more global information during positioning, a single fine-grained feature may cause inaccuracy of positioning. Therefore, how to make the model take into account different requirements of events with different durations on the feature resolution to generate more accurate candidate time segments is a difficult problem in a task of video intensive description and also an important reason for influencing the performance of the result.
(2) After detecting a time segment containing an event, the video intensive description task also requires that a description sentence is generated on the event contained in the segment, and the conventional method generates the description sentence based on a single resolution characteristic of the video. This approach ignores the different effects of different resolution features on the event description. In addition, a cyclic neural network is often adopted during description generation, the characteristics of recursion of the cyclic neural network are limited, parallel calculation is difficult to perform during training for a description generation module, and training efficiency is reduced to a certain extent.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a video dense description method based on a time sequence characteristic pyramid. The invention provides a Video Dense description generation method (DVC) based on a Temporal Feature Pyramid (Temporal Feature Pyramid). The core method is a multilevel time sequence characteristic pyramid model which is provided and used for solving the problem of detecting events with different durations and verifying the superiority of the model in the cross-modal deep learning task of video intensive description. The method includes that under a transform network model (Transformer) framework, the video is coded, meanwhile, features with different resolutions are obtained through a local attention mechanism, then, a plurality of detection heads are used for detecting the features with different resolutions, and therefore comprehensive coverage of events with different durations is achieved. After detecting the time segments possibly containing the event, the invention further utilizes a characteristic fusion mode to fuse the video characteristics with different resolutions, thereby generating more targeted description for the event. In the experiment, an unedited video is input into a video dense description model based on a time sequence characteristic pyramid, after a time segment is predicted by a candidate time segment module, the fact that higher accuracy and recall rate are achieved compared with other methods can be found, and meanwhile, a description generation decoder also generates description sentences with higher quality according to the characteristics after fusion, so that the universality of the method is proved, and the method can fully play value in other multi-modal tasks.
The invention mainly comprises two points:
1. a plurality of detection heads based on different resolutions are simultaneously used for event detection by means of a local attention mechanism, so that events with different durations in a video intensive description task are effectively covered, intrinsic information of a video is fully explored, and a candidate time slice set with higher accuracy and recall rate is obtained.
2. A description generation decoder based on feature fusion is provided, and features with different resolutions are fused, so that global semantic information of high-level coarse-grained features can be obtained by the bottom-level fine-grained features. After the decoder obtains the characteristics of detail information and global information, the context information and the time sequence correlation of the video can be fully understood, and a more pertinent description text is generated.
The technical scheme adopted by the invention for solving the technical problem comprises the following steps:
step (1), data preprocessing, extracting characteristics of video and text data:
firstly, preprocessing and characteristic extraction are carried out on a video V:
for a section of un-clipped video V, a frame is cut into t blocks by taking an a frame as a unit, the a frame image in one block is subjected to feature extraction by using an I3D model pre-trained on a Kinetics data set, meanwhile, the features are extracted in the same way for a corresponding optical flow graph, then the two features are aligned in the time dimension and merged together, and a feature vector X representing the whole video is obtained after a trainable embedding matrix is passed.
Secondly, extracting the characteristics of the text information:
for a given sentence Y, punctuation marks in the sentence are removed, then each word in the sentence is put into a GloVe model to obtain word embedding characteristics, and then corresponding weights of different dimensions are adaptively learned by using an embedding matrix, so that a characteristic vector Y representing the whole sentence can be obtained.
And (2) carrying out feature coding through a video feature coder based on a local attention mechanism:
the video feature encoder is composed of L attention modules, and each attention module comprises a self-attention submodule MHA and a feed-forward network submodule FFN. For video feature X, inputting the video feature X into a video feature encoder to obtain feature sets with different resolutions
Figure BDA0003078323130000051
The specific process will be described below.
First consider video feature X as X0And then input into the self-attention submodule MHA and the feedforward network submodule FFN of the 1 st attention module in sequence. In the self-attention submodule, a local attention mechanism is adopted to limit the receptive field of the element of each position, so that the element of each position in the output feature is reconstructed only by the elements of the adjacent positions in the input feature, and a local receptive field mechanism similar to that in a convolutional neural network is formed. The feedforward network submodule is used for remapping the output characteristics to obtain the output of the 1 st attention module
Figure BDA0003078323130000052
Will be provided with
Figure BDA0003078323130000053
As input to the 2 nd attention module, the process is cycled through until the output of the lth attention module is obtained
Figure BDA0003078323130000054
Until now.
Step (3), constructing a candidate segment generation module based on the characteristic pyramid structure;
as shown in FIG. 1, the output characteristics of different attention modules are first input into different detection heads, and for the L (1 ≦ L ≦ L) detection head, the output is
Figure BDA0003078323130000055
For QlIn (1)Each element
Figure BDA0003078323130000056
And obtaining corresponding starting time and ending time and corresponding confidence scores according to the sampling interval of the video characteristics. In addition, since the feature resolution of the output of the attention module at the lower level is lower and the feature resolution of the output of the attention module at the higher level is higher in the video feature encoder, the I-th detection head based on the output feature of the I-th attention module of the encoder is responsible for predicting the duration to be xil-1~ξlIn the event of a change.
In the training phase of the model, the output of the candidate segment generation module is divided into two parts, the first part being the predicted event center position and the event duration length, which affect the start and end times of the predicted time segment. For each labeled event, an element in the output feature whose center position and anchor size most closely match is selected
Figure BDA0003078323130000057
For calculating the loss Lreg. Here, the regression loss function is used to measure the deviation between the predicted value and the actual value. The second part is the confidence of the prediction, representing the likelihood of containing an event in the current time slice. Elements to be used for calculating regression losses
Figure BDA0003078323130000061
Considering the samples as positive samples and the rest as negative samples, calculating the classification loss L for all the samplescls. Finally, the two losses are added to obtain the total loss of the first detection head in the event detection stage
Figure BDA0003078323130000062
The Loss of the event detection stage can be obtained by adding all the detected Loss functionsprop
In the testing stage, after different candidate time segment sets are generated by different detection heads, all the time segments are combined together and sorted according to the corresponding confidence scores from high to low. And then screening the time slices by adopting a non-maximum suppression algorithm to obtain a time slice set with a confidence score higher than a set confidence threshold and a mutual overlapping degree lower than a set overlapping threshold. For each remaining time segment, a specific event is considered to exist therein, and thus the visual features located in the time segment are input into a decoder to generate a corresponding description statement.
Step (4), constructing a description generation decoder based on feature fusion;
as shown in FIG. 2, for each time segment generated by the candidate time segment generation module, the original feature X of the video0And shielding the features beyond the starting time and the ending time and inputting the features into a video feature encoder to obtain video feature sets X with different resolutionscapAnd on the basis, performing feature fusion operation. In order to reduce the complexity of the model as much as possible, feature fusion is realized by adding corresponding positions. Inputting the features subjected to the fusion operation into a decoder, outputting words in the predicted description sentence, finally calculating the loss between the predicted word distribution and the actual words, and updating the parameters of the model by a loss function through a back propagation algorithm. After several iterations, the model can generate a targeted description statement for the events contained in each time segment.
The preprocessing mode of the video and the text in the step (1) is specifically realized as follows:
1-1, inputting all frames from the k x a frame to the (k +1) x a frame in the video into an I3D model to obtain an output feature vector x'k. In addition, an optical flow diagram is extracted from the k × a frame to the (k +1) × a frame, and the optical flow diagram is input to the I3D model, so that an output feature vector x ″, is obtainedk. X'kAnd x ″)kSpliced together to obtain a feature vector xk(k is more than or equal to 1 and less than or equal to t). After the frames in the whole video are processed in the same way and mapped by using a trainable embedding matrix, the feature vector X which represents the whole video is obtained1,x2,...,xt}。
1-2, for the b (b is more than or equal to 1 and less than or equal to n) th word in a certain label description sentence, converting the b (b is more than or equal to 1 and less than or equal to n) word into One-Hot coding according to the position of the b word in a word list, then inputting the One-Hot coding into a GloVe model to compress characteristic dimensionality, then using an embedded matrix to adaptively learn corresponding weights of different dimensionality, and obtaining the characteristic vector y representing the wordb. Each word in the sentence is processed in the same way, and a feature vector Y which represents the whole sentence is obtained1,y2,...,yn}。
And (2) the local attention mechanism-based video feature encoder consists of L attention modules, wherein each attention module comprises a self-attention submodule MHA and a feed-forward network submodule FFN.
2-1, the self-attention submodule MHA is responsible for reconstructing input characteristics, and the formula is as follows:
Z=MHA(Xl,Xl,Xl)=[head1,head2,...,headh]Woformula (1)
Figure BDA0003078323130000071
Figure BDA0003078323130000072
Wherein, XlRepresenting the input characteristics of the first attention module, WOIs a matrix used to map the output characteristics,
Figure BDA0003078323130000073
three different parameter matrices for processing input features, MASK is a MASK matrix, which indicates multiplication of elements at corresponding positions in the matrix, Q, K, V are
Figure BDA0003078323130000074
And 2-2. the feedforward network submodule FFN is responsible for remapping the output characteristics of the self-attention submodule, and the formula is as follows:
Xl+1=FFN(Z)=max(0,FW1+b1)W2+b2formula (4)
Wherein, W1、W2Is a matrix of two parameters, b1、b2Two bias parameters.
Step (3) a candidate segment generation module based on the characteristic pyramid structure:
3-1. for the output characteristics of the first attention module of the encoder, the first detection head omega is used(l)Detecting time segments possibly containing events to obtain output values
Figure BDA0003078323130000081
For QlEach element of
Figure BDA0003078323130000082
The center position c of the final predicted time segment is obtained byi' and duration hi' and corresponding confidence oi′。
Figure BDA0003078323130000083
ci′=pi+sigmoid(ci) Formula (6)
hi′=ai·exp(hi) Formula (7)
oi′=sigmoid(oi) Formula (8)
Wherein, aiDuration of ith anchor, piTo predict
Figure BDA0003078323130000084
The position of the center point of time.
And 3-2, dividing different events in the labeled data set into different detection heads for detection, wherein the specific process is as follows.
Figure BDA0003078323130000085
Wherein the content of the first and second substances,
Figure BDA0003078323130000086
representing the duration of the jth annotated event in the dataset, with only the duration lying in ξl-1And xilThe first detection head is responsible for detection.
3-3. for the output value of the l-th detection head, the deviation from the actual value is measured by using a loss function, as shown below.
Figure BDA0003078323130000087
Wherein alpha is1、α2Are two different weighting coefficients for adjusting the weight of the two loss functions during the training process.
The Loss of the event detection stage can be obtained by adding the Loss functions of all the detection headspropAs follows.
Figure BDA0003078323130000088
After obtaining the time segments possibly containing the events, using a description generation decoder based on feature fusion to generate description statements for the events contained in each time segment, specifically as follows:
4-1, after inputting the shielded video characteristics into the video characteristics encoder for describing the generation stage, the output characteristics set X of the encoder can be obtainedcap
Figure BDA0003078323130000091
4-2. mixing
Figure BDA0003078323130000092
Is regarded as F(L)And performing feature fusion operation to generate F(L-1)
Figure BDA0003078323130000093
Wherein the content of the first and second substances,
Figure BDA0003078323130000094
representing the operation of adding corresponding position elements in the matrix. Then, in the same manner as X is generated from top to bottomcapCorresponding fusion feature set F:
F={F(1),F(2),...,F(L)equation (14)
4-3. decoder the first attention module
Figure BDA0003078323130000095
Accepting fusion features F from corresponding hierarchies(l)As follows:
Figure BDA0003078323130000096
wherein the content of the first and second substances,
Figure BDA0003078323130000097
comprises three sub-modules, namely a self-attention sub-module phi (-) and a multi-terminal attention sub-module
Figure BDA0003078323130000098
The feedforward network submodule FFN (-) is shown below.
φ(Y(l))=LN(MHA(Y(l),(Y(l),(Y(l))+(Y(l)) Formula (16)
Figure BDA0003078323130000099
Figure BDA00030783231300000910
Wherein LN (. smallcircle.) represents the interlayer regularization operation (Layer Normalization).
The invention has the following beneficial effects:
the invention relates to a video dense description algorithm based on a time sequence feature pyramid structure, which considers the different requirements of events with different durations on feature resolution on the basis of the prior method, and simultaneously detects the events possibly existing in the video by using a plurality of detection heads, so that the generated time segment set has higher accuracy and recall rate. Furthermore, the invention also provides proper global semantic information for fine-grained features by using a feature fusion mode, so that a decoder can generate more targeted descriptive statements.
The invention has the advantages of reasonable parameter quantity, obvious effect, contribution to more efficient distributed training and contribution to being deployed in specific hardware with limited memory.
Drawings
FIG. 1: candidate time slice generation module based on pyramid structure
FIG. 2: description generation decoder based on feature fusion
Detailed Description
The following is a more detailed description of the detailed parameters of the present invention.
As shown in fig. 1 and 2, the present invention provides a method for generating dense video description based on a time-series feature pyramid.
The step (1) is a video and text feature extraction mode, which specifically comprises the following steps:
1-1. for the processing of video, a complete video is cut into several blocks in units of 64 frames, i.e. a-64.
1-2, for text processing, firstly removing punctuations in sentences, converting initial letters into lower case letters, and putting the lower case letters into a trained Golve model to obtain the characteristic expression of the sentences.
The video feature encoder based on the local attention mechanism is responsible for encoding video features, and specifically comprises the following steps:
2-1. the number of heads of attention from the attention module is 8, i.e. h-8. Mixing XlAfter the input of the input feature is input from the attention module, attention weights of different dimensions are calculated, normalization operation is carried out on the weights through a softmax function, and then input features are reconstructed according to the normalized weights.
2-2. the feedforward network module consists of two fully connected linear layers, each linear layer output is characterized by a Dropout operation, the degree of Dropout being 10% of the total parameter.
Step (3), the candidate segment generation module based on the pyramid structure is responsible for generating time segments possibly containing events, and the details are as follows:
and 3-1, for the output characteristics of the first (L is more than or equal to 1 and less than or equal to L) attention module of the encoder, detecting by using the first detection head. The number of anchor points used for detection in the detection head is 128, and the sizes of the anchor points are obtained by clustering in the data set according to the duration of all labeled events through a K-means algorithm.
3-2. 3 detection heads are used to simultaneously detect events that may occur in the video, i.e., L-3. Different detection heads are responsible for detecting events with different durations, and the threshold xi is divided0、ξ1、ξ2、ξ3Set to 0, 12, 36, 408 seconds, respectively.
3-3, measuring the deviation between the predicted value and the true value by using a loss function, and adjusting the parameter alpha of the positive and negative sample weight1、α2Are set to 1 and 100, respectively.
And (4) generating a decoder for generating a corresponding description statement for the event based on the description of the feature fusion.
4-1. the number of attention modules in the encoder and decoder used to describe the generation phase is 3. In order to avoid the influence of differences among different tasks on the model effect, a video feature encoder which has the same structure as an event detection stage and is independently trained in parameters is separately arranged.
4-2, after the encoder output feature set is obtained, feature fusion is directly carried out in a mode of adding corresponding positions due to the fact that the sizes of the features are the same.
4-3. Using residual connection between different sub-blocks of the decoder, the characteristics of each linear layer output in the feedforward network sub-block will perform Dropout operation to the extent of 10% of the total parameter.

Claims (5)

1. A video dense description method based on a time sequence characteristic pyramid is characterized by comprising the following steps:
step (1), data preprocessing, extracting characteristics of video and text data:
firstly, preprocessing and characteristic extraction are carried out on a video V:
for a section of video V which is not clipped, the video V is cut into t blocks by taking an a frame as a unit, the a frame image in one block is subjected to feature extraction by using an I3D model which is pre-trained on a Kinetics data set, meanwhile, the features are extracted in the same way for a corresponding optical flow graph, then the two features are aligned in the time dimension and merged together, and a feature vector X representing the whole video is obtained after a trainable embedding matrix is passed;
secondly, extracting the characteristics of the text information:
for a given sentence Y, punctuation marks in the sentence are removed, then each word in the sentence is put into a GloVe model to obtain word embedding characteristics, and then corresponding weights of different dimensions are learned in a self-adaptive mode by using an embedding matrix, so that a characteristic vector Y representing the whole sentence can be obtained;
and (2) carrying out feature coding through a video feature coder based on a local attention mechanism:
the video feature encoder consists of L attention modules, and each attention module comprises a self-attention submodule MHA and a feed-forward network submodule FFN; for video feature X, inputting the video feature X into a video feature encoder to obtain feature sets with different resolutions
Figure FDA0003078323120000011
The specific process will be described below;
first consider video feature X as X0Then, the data are sequentially input into a self-attention submodule MHA and a feedforward network submodule FFN of a 1 st attention module; in the self-attention submodule, a local attention mechanism is adopted to limit the receptive field of the element at each position, so that the element at each position in the output characteristic is reconstructed only by the elements at the adjacent positions in the input characteristic, and a local receptive field mechanism similar to that in a convolutional neural network is formed; the feedforward network submodule is used for remapping the output characteristics to obtain the output of the 1 st attention module
Figure FDA0003078323120000012
Will be provided with
Figure FDA0003078323120000013
As input to the 2 nd attention module, the process is cycled through until the output of the lth attention module is obtained
Figure FDA0003078323120000021
Until the end;
step (3), constructing a candidate segment generation module based on the characteristic pyramid structure;
firstly, the output characteristics of different attention modules are input into different detection heads, and for the L (L is more than or equal to 1 and less than or equal to L) detection head, the output is
Figure FDA0003078323120000022
For QlEach element of
Figure FDA0003078323120000023
Obtaining corresponding starting time and ending time and corresponding confidence scores according to the sampling interval of the video characteristics; the ith detection head based on the output characteristics of the ith attention module of the encoder is responsible for predicting that the duration is positioned in xil-1~ξlAn event in between;
in the training phase of the model, the output of the candidate segment generation module is divided into two parts, wherein the first part is the predicted event center position and the event duration length, and the first part influences the starting time and the ending time of the predicted time segment; for each labeled event, an element in the output feature whose center position and anchor size most closely match is selected
Figure FDA0003078323120000024
For calculating the loss Lreg(ii) a Here, the regression loss function is used to measure the deviation between the predicted value and the actual value; the second part is the confidence of prediction, which represents the possibility of containing events in the current time slice; elements to be used for calculating regression losses
Figure FDA0003078323120000025
Considering the samples as positive samples and the rest as negative samples, calculating the classification loss L for all the samplescls(ii) a Finally, the two losses are added to obtain the total loss of the first detection head in the event detection stage
Figure FDA0003078323120000026
The Loss of the event detection stage can be obtained by adding all the detected Loss functionsprop
In the testing stage, after different candidate time segment sets are generated by different detection heads, all the time segments are combined together and sorted from high to low according to corresponding confidence scores; then, screening the time slices by adopting a non-maximum value inhibition algorithm to obtain a time slice set with a confidence score higher than a set confidence threshold value and a mutual overlapping degree lower than a set overlapping threshold value; for each reserved time segment, a specific event exists in the reserved time segment, and therefore the visual features in the reserved time segment are input into a decoder to generate a corresponding descriptive statement;
step (4), constructing a description generation decoder based on feature fusion;
for each time segment generated by the candidate time segment generation module, the original characteristic X of the video is obtained0And shielding the features beyond the starting time and the ending time and inputting the features into a video feature encoder to obtain video feature sets X with different resolutionscapAnd on the basis, carrying out feature fusion operation; in order to reduce the complexity of the model as much as possible, feature fusion is realized by adopting a mode of adding corresponding positions; inputting the features subjected to the fusion operation into a decoder, outputting words in a predicted description sentence, finally calculating the loss between the predicted word distribution and the actual words, and updating the parameters of the model by a loss function through a back propagation algorithm; after several iterations, the model can generate a targeted description statement for the events contained in each time segment.
2. The method for densely describing videos based on the time-series characteristic pyramid as claimed in claim 1, wherein the preprocessing method of the videos and texts in step (1) is specifically implemented as follows:
1-1, inputting all frames from the k x a frame to the (k +1) x a frame in the video into an I3D model to obtain an output feature vector x'k(ii) a In addition, an optical flow diagram is extracted from the k × a frame to the (k +1) × a frame, and the optical flow diagram is input to the I3D model, so that an output feature vector x ″, is obtainedk(ii) a X'kAnd x ″)kSpliced together to obtain a feature vector xk(k is more than or equal to 1 and less than or equal to t); after the frames in the whole video are processed in the same way and mapped by using a trainable embedding matrix, the feature vector X which represents the whole video is obtained1,x2,...,xt};
1-2, for the b (b is more than or equal to 1 and less than or equal to n) th word in a certain label description sentence, converting the b (b is more than or equal to 1 and less than or equal to n) word into One-Hot coding according to the position of the b word in a word list, then inputting the One-Hot coding into a GloVe model to compress characteristic dimensionality, then using an embedded matrix to adaptively learn corresponding weights of different dimensionality, and obtaining the characteristic vector y representing the wordb(ii) a In the same way for each word in the sentenceLine processing to obtain a feature vector Y ═ Y representing the whole sentence1,y2,...,yn}。
3. The method for dense video description based on temporal feature pyramid as claimed in claim 2, wherein the video feature encoder based on local attention mechanism in step (2) is composed of L attention modules, each of which includes a self-attention submodule MHA and a feed-forward network submodule FFN;
2-1, the self-attention submodule MHA is responsible for reconstructing input characteristics, and the formula is as follows:
Z=MHA(Xl,Xl,Xl)=[head1,head2,...,headh]Woformula (1)
Figure FDA0003078323120000041
Figure FDA0003078323120000042
Wherein, XlRepresenting the input characteristics of the first attention module, WoIs a matrix used to map the output characteristics,
Figure FDA0003078323120000043
three different parameter matrices for processing input features, MASK is a MASK matrix, which indicates multiplication of elements at corresponding positions in the matrix, Q, K, V are
Figure FDA0003078323120000044
And 2-2. the feedforward network submodule FFN is responsible for remapping the output characteristics of the self-attention submodule, and the formula is as follows:
Xl+1=FFN(Z)=max(0,FW1+b1)W2+b2formula (4)
Wherein, W1、W2Is a matrix of two parameters, b1、b2Two bias parameters.
4. The method for video dense description based on the temporal characteristic pyramid as claimed in claim 3, wherein the step (3) comprises the following steps:
3-1. for the output characteristics of the first attention module of the encoder, the first detection head omega is used(l)Detecting time segments possibly containing events to obtain output values
Figure FDA0003078323120000045
For QlEach element of
Figure FDA0003078323120000046
The center position c of the final predicted time segment is obtained byi' and duration hi' and corresponding confidence oi′;
Figure FDA0003078323120000047
ci′=pi+sigmoid(ci) Formula (6)
hi′=ai·exp(hi) Formula (7)
oi′=sigmoid(oi) Formula (8)
Wherein, aiDuration of ith anchor, piTo predict
Figure FDA0003078323120000048
The position of the center point of time;
3-2, dividing different events in the marked data set into different detection heads for detection, wherein the specific process is as follows;
Figure FDA0003078323120000051
wherein the content of the first and second substances,
Figure FDA0003078323120000052
representing the duration of the jth annotated event in the dataset, with only the duration lying in ξl-1And xilThe first detection head is responsible for detection only when the events are labeled;
3-3, measuring the deviation between the output value of the first detection head and the actual value by using a loss function, as shown in the following;
Figure FDA0003078323120000053
wherein alpha is1、α2The two different weight coefficients are used for adjusting the proportion of the two loss functions in the training process;
the Loss of the event detection stage can be obtained by adding the Loss functions of all the detection headspropAs shown below;
Figure FDA0003078323120000054
5. the method for video dense description based on the temporal characteristic pyramid as claimed in claim 4, wherein the step (4) comprises the following steps:
4-1, after inputting the shielded video characteristics into the video characteristics encoder for describing the generation stage, the output characteristics set X of the encoder can be obtainedcap
Figure FDA0003078323120000055
4-2. mixing
Figure FDA0003078323120000056
Is regarded as F(L)And performing feature fusion operation to generate F(L-1)
Figure FDA0003078323120000057
Wherein the content of the first and second substances,
Figure FDA0003078323120000058
representing the operation of adding corresponding position elements in the matrix; then, in the same manner as X is generated from top to bottomcapCorresponding fusion feature set F:
F={F(1),F(2),...,F(L)equation (14)
4-3. decoder the first attention module
Figure FDA0003078323120000059
Accepting fusion features F from corresponding hierarchies(l)As follows:
Figure FDA0003078323120000061
wherein the content of the first and second substances,
Figure FDA0003078323120000062
comprises three sub-modules, namely a self-attention sub-module phi (-) and a multi-terminal attention sub-module
Figure FDA0003078323120000063
A feedforward network submodule FFN (·), as shown below;
φ(Y(l))=LN(MHA(Y(l),Y(l),Y(l)+Y(l)) Formula (16)
Figure FDA0003078323120000064
Figure FDA0003078323120000065
Wherein LN (. smallcircle.) represents the interlayer regularization operation (Layer Normalization).
CN202110558847.1A 2021-05-21 2021-05-21 Video dense description generation method based on time sequence feature pyramid Active CN113392717B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110558847.1A CN113392717B (en) 2021-05-21 2021-05-21 Video dense description generation method based on time sequence feature pyramid

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110558847.1A CN113392717B (en) 2021-05-21 2021-05-21 Video dense description generation method based on time sequence feature pyramid

Publications (2)

Publication Number Publication Date
CN113392717A true CN113392717A (en) 2021-09-14
CN113392717B CN113392717B (en) 2024-02-13

Family

ID=77618939

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110558847.1A Active CN113392717B (en) 2021-05-21 2021-05-21 Video dense description generation method based on time sequence feature pyramid

Country Status (1)

Country Link
CN (1) CN113392717B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113810730A (en) * 2021-09-17 2021-12-17 咪咕数字传媒有限公司 Real-time text generation method and device based on video and computing equipment
CN114359768A (en) * 2021-09-30 2022-04-15 中远海运科技股份有限公司 Video dense event description method based on multi-mode heterogeneous feature fusion
CN114998673A (en) * 2022-05-11 2022-09-02 河海大学 Dam defect time sequence image description method based on local self-attention mechanism
WO2023050295A1 (en) * 2021-09-30 2023-04-06 中远海运科技股份有限公司 Multimodal heterogeneous feature fusion-based compact video event description method
CN116578769A (en) * 2023-03-03 2023-08-11 齐鲁工业大学(山东省科学院) Multitask learning recommendation method based on behavior mode conversion

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929092A (en) * 2019-11-19 2020-03-27 国网江苏省电力工程咨询有限公司 Multi-event video description method based on dynamic attention mechanism
CN111814844A (en) * 2020-03-17 2020-10-23 同济大学 Intensive video description method based on position coding fusion

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929092A (en) * 2019-11-19 2020-03-27 国网江苏省电力工程咨询有限公司 Multi-event video description method based on dynamic attention mechanism
CN111814844A (en) * 2020-03-17 2020-10-23 同济大学 Intensive video description method based on position coding fusion

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113810730A (en) * 2021-09-17 2021-12-17 咪咕数字传媒有限公司 Real-time text generation method and device based on video and computing equipment
CN113810730B (en) * 2021-09-17 2023-08-01 咪咕数字传媒有限公司 Video-based real-time text generation method and device and computing equipment
CN114359768A (en) * 2021-09-30 2022-04-15 中远海运科技股份有限公司 Video dense event description method based on multi-mode heterogeneous feature fusion
WO2023050295A1 (en) * 2021-09-30 2023-04-06 中远海运科技股份有限公司 Multimodal heterogeneous feature fusion-based compact video event description method
CN114359768B (en) * 2021-09-30 2024-04-16 中远海运科技股份有限公司 Video dense event description method based on multi-mode heterogeneous feature fusion
CN114998673A (en) * 2022-05-11 2022-09-02 河海大学 Dam defect time sequence image description method based on local self-attention mechanism
CN114998673B (en) * 2022-05-11 2023-10-13 河海大学 Dam defect time sequence image description method based on local self-attention mechanism
WO2023217163A1 (en) * 2022-05-11 2023-11-16 华能澜沧江水电股份有限公司 Dam defect time-sequence image description method based on local self-attention mechanism
CN116578769A (en) * 2023-03-03 2023-08-11 齐鲁工业大学(山东省科学院) Multitask learning recommendation method based on behavior mode conversion
CN116578769B (en) * 2023-03-03 2024-03-01 齐鲁工业大学(山东省科学院) Multitask learning recommendation method based on behavior mode conversion

Also Published As

Publication number Publication date
CN113392717B (en) 2024-02-13

Similar Documents

Publication Publication Date Title
CN113392717A (en) Video dense description generation method based on time sequence characteristic pyramid
CN111325323B (en) Automatic power transmission and transformation scene description generation method integrating global information and local information
Zhou et al. Linguistic steganography based on adaptive probability distribution
Xie et al. Attention-based dense LSTM for speech emotion recognition
CN111143563A (en) Text classification method based on integration of BERT, LSTM and CNN
CN111985239A (en) Entity identification method and device, electronic equipment and storage medium
CN111368870A (en) Video time sequence positioning method based on intra-modal collaborative multi-linear pooling
Zhao et al. Videowhisper: Toward discriminative unsupervised video feature learning with attention-based recurrent neural networks
CN114627162A (en) Multimodal dense video description method based on video context information fusion
CN115292568B (en) Civil news event extraction method based on joint model
Ye et al. A joint-training two-stage method for remote sensing image captioning
US20230252993A1 (en) Visual speech recognition for digital videos utilizing generative adversarial learning
CN114155477B (en) Semi-supervised video paragraph positioning method based on average teacher model
CN113920379B (en) Zero sample image classification method based on knowledge assistance
Wang et al. Multi-channel attentive weighting of visual frames for multimodal video classification
CN117539999A (en) Cross-modal joint coding-based multi-modal emotion analysis method
CN117033558A (en) BERT-WWM and multi-feature fused film evaluation emotion analysis method
CN114677631B (en) Cultural resource video Chinese description generation method based on multi-feature fusion and multi-stage training
Ni et al. Enhanced knowledge distillation for face recognition
CN115994220A (en) Contact net text data defect identification method and device based on semantic mining
CN114896969A (en) Method for extracting aspect words based on deep learning
CN113326371B (en) Event extraction method integrating pre-training language model and anti-noise interference remote supervision information
CN114511813A (en) Video semantic description method and device
Li et al. LogPS: A Robust Log Sequential Anomaly Detection Approach Based on Natural Language Processing
Tan et al. Sentiment analysis of chinese short text based on multiple features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant