CN114020964A - Method for realizing video abstraction by using memory network and gated cyclic unit - Google Patents

Method for realizing video abstraction by using memory network and gated cyclic unit Download PDF

Info

Publication number
CN114020964A
CN114020964A CN202111346288.4A CN202111346288A CN114020964A CN 114020964 A CN114020964 A CN 114020964A CN 202111346288 A CN202111346288 A CN 202111346288A CN 114020964 A CN114020964 A CN 114020964A
Authority
CN
China
Prior art keywords
video
output
memory
input
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111346288.4A
Other languages
Chinese (zh)
Inventor
马然
苏敏
张冰
安平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202111346288.4A priority Critical patent/CN114020964A/en
Publication of CN114020964A publication Critical patent/CN114020964A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The invention provides a method for realizing video abstraction by using a memory network and a gating cycle unit, which comprises the following steps: preprocessing a data set to obtain video frame characteristics of a video; designing an integral network model which takes the video frame characteristics as input and takes the frame importance scores as output; carrying out model training in the whole network model; converting the frame importance scores obtained from the trained overall network model into key shots; and combining the key lens into a dynamic video abstract, and testing a video abstract result. According to the invention, the overall network model performs characteristic processing on the input video, and the dynamic video abstract is obtained by utilizing the frame importance scores, so that the overall effectiveness of the model is improved.

Description

Method for realizing video abstraction by using memory network and gated cyclic unit
Technical Field
The invention relates to the technical field of video abstraction, in particular to a method for realizing video abstraction by using a memory network and a gate control cycle unit.
Background
With the development of the internet and the emergence of various mobile portable shooting devices, people can shoot people and things around video records anytime and anywhere and upload the video records to the internet, so that the number of videos is increased explosively. The massive video data inevitably brings stress on video storage and management. Meanwhile, when a user searches for information needed by the user in a large number of related videos, all contents of each video are often required to be browsed to filter unnecessary information; although the user can speed up the video browsing using the double-speed play, the user still needs to spend a lot of time and effort. Therefore, a means for extracting important contents in a video is urgently needed by human beings, and video summarization is carried out as soon as possible.
The video abstraction technology realizes the summarization and summarization of video contents by extracting meaningful frames or meaningful shots in videos, provides a way for human beings to efficiently access and manage mass video data, and relieves the pressure on video storage, transmission, archiving and retrieval caused by the explosive increase of the number of network videos in recent years. Meanwhile, the video abstract is used as a means for enabling a user to quickly browse and acquire important contents of the video, and user experience can be improved to a great extent.
The hierarchical structure of the video can be generally divided into a video stream, a scene, a shot and a frame from top to bottom, and starting from the video hierarchical structure, the video abstract can be divided into a static video abstract and a dynamic video abstract. The static video abstract is a video abstract formed by selecting meaningful frames, and the dynamic video abstract is formed by meaningful shots. For a static video summary, i.e. a video summary in the form of key frames, the summary process can be divided into the following 3 steps: feature extraction, video content importance calculation and video abstract generation. Firstly, extracting the characteristics of video frames, then calculating the content importance of the video frames, and finally selecting important video frames according to the importance to form a static video abstract. For dynamic video abstraction, namely video abstraction in the form of key shots, the abstraction process can be generally divided into four steps of feature extraction, video shot segmentation, video content importance calculation and video abstraction generation. Firstly, extracting the characteristics of video frames, then segmenting video shots according to the extracted frame characteristics to generate a plurality of shots, then calculating the content importance of the video shots, and finally selecting a certain number of shots to form an abstract according to the importance.
In an early conventional video summarization method, methods such as manually-made features or clustering are mostly used to determine whether to select video frames or video shots to compose a video summary, such as Gygli M, Grabner H, riemens chneider H, et al, creatingsummary from User Videos, Fleet D, Pajdla T, Schiele B, Tuytelaars T, editor, Computer Vision-Eccv 2014, Pt vi, Cham, Springer Int Publishing Ag,2014, 505 and 520.
With the excellent performance of deep learning in various application fields, the first article Zhang K, Chao W-L, Sha F, et al, video learning with Long Short-Term Memory [ C ] Computer Vision-ECCV 2016,2016:766-782 appeared in 2016. The authors first model the variable range temporal dependency between video frames using the Long short-term memory network (LSTM) in deep learning, then use the Multilayer Perceptron (MLP) to estimate the importance of the frames, and increase the diversity of the visual content of the generated summary based on the Determinant Point Process (DPP).
Video summaries based on deep learning can be classified into unsupervised video summaries, supervised video summaries and weakly supervised video summaries according to whether a summary algorithm needs supervised information and the strength of the supervised information. The unsupervised video abstract method evaluates the importance, representativeness, interestingness and the like of video contents by using specific standards and then generates a video abstract according to an evaluation result. Hu M, Hu R, Wang X, et al, unused Temporal adaptation prediction Model for User Created video [ C ]. MultiMedia Modeling,2021:519 530. A Temporal Attention based summary Model TASM is proposed, in which more comprehensive information is obtained by combining the outputs of the encoder and decoder, meanwhile, the feedforward attribute reward is designed to enhance the diversity of the selected key frames, and a Depth Deterministic Policy Gradient (DDPG) is adopted to train the TASM, and the network structure solves the problems of high redundancy and information distortion between key frames existing in the existing User video summary. The supervised video summarization method is used for training a neural network by using data labeled by human beings, so that the model can capture video content with more semantic information generally, and the obtained summarization result is superior to that of the unsupervised video summarization method in general. ZHao B, Li X, Lu X. TTH-RNN, temporal-Train iterative Neural Network for Video simulation [ J ]. IEEE Transactions on Industrial Electronics,2021,68(4):3629-3637. TTH-RNN Network is proposed, which comprises a transducer-Train embedding layer to avoid that the mapping matrix from high-dimensional Video features to concealment is too large, thereby reducing the training difficulty; while using hierarchical RNNs to explore long-term temporal correlations between video frames. Although the performance of the supervised video summarization method is superior to that of the unsupervised method in general conditions, the number of data sets of the current video summarization is small, the scale is small, and the establishment of a new large-scale data set is time-consuming and labor-consuming work. A VESD (variable Encoder-predictor-Decoder) model is proposed, which is composed of two parts, namely a variable Encoder and an Encoder-attention-Decoder, wherein the variable Encoder learns latent semantics from network videos, and the Encoder-attention-Decoder is used for significance evaluation and summary generation. Most of the methods use a convolutional neural network or a cyclic neural network to acquire time-space domain information of the video, and the acquisition capability of the long-range memory is poor.
Disclosure of Invention
In view of the defects in the prior art, the invention aims to provide a method for realizing video abstraction by using a memory network and a gate control cycle unit.
According to one aspect of the present invention, there is provided a method for implementing video summarization by using a memory network and a gated loop unit, comprising:
preprocessing a data set to obtain video frame characteristics of a video;
designing an integral network model which takes the video frame characteristics as input and takes the frame importance scores as output;
training the whole network model;
converting the frame importance scores output from the trained overall network model into key shots;
and combining the key shots into a dynamic video abstract.
Preferably, the preprocessing the data set to obtain video frame characteristics of the video comprises:
sampling a video into a sequence of video frames;
and extracting the characteristics of the video frame sequence by using a pre-trained GoogleLeNet model to obtain the video frame characteristics corresponding to each video.
Preferably, the overall network model with the video frame characteristics as input and the frame importance scores as output comprises:
the input module is used for fusing the front and back information of the input video frame characteristics;
the memory module is used for outputting the fused result as the input of the memory module, repeatedly retrieving and storing information in memory by using a multi-hoss structure and outputting memory output which maximally retains useful information;
an output module, a memory output of the memory module being an input of the output module, the output module returning the memory output to the frame importance score;
preferably, the input module is configured to process the input video frame feature sequence using a layer of bidirectional GRUs, including:
the forward output of the GRU contains past information of the current time;
backward output of the GRU comprises future information of the current moment;
and the bidirectional GRU fuses and adds the forward output and the backward output element by element to obtain complete past and future information of the current moment.
Preferably, constructing the memory module comprises:
coding vectors obtained by a fusion sequence output by the bidirectional GRU through a full connection layer are used as original problem vectors and input into a first hop;
respectively inputting the fusion sequence output by the bidirectional GRU to two different full-connection layers of the first hop to obtain input memory and output memory;
the problem vector and the input memory are subjected to inner product, and self-attention weight is obtained through Softmax;
the self-attention weight and the output memory are subjected to weighted summation, the weighted summation is multiplied by the problem vector element by element, and memory output is obtained through a full connection layer with a normalization layer and an activation function;
inputting the memory output as a new problem vector into the next hop to calculate self-attention weight and memory output; until the memory output of the last hop is used as the input of the output module.
Preferably, the fully-connected matrix of output memory between adjacent hop layers is equal to the fully-connected matrix of input memory; the full-join matrix of the original problem vector is equal to the full-join matrix of the underlying input memory.
Preferably, constructing the output module comprises:
the memory output is connected with the fusion sequence of the bidirectional GRU output in a residual error manner;
sequentially passing the output of the residual connection through an LN layer and a Dropout layer;
and inputting the output result of the Dropout layer into a two-layer fully-connected network, wherein the node number of the output layer of the last fully-connected layer is 1, and the frame importance score corresponding to each video frame is obtained.
Preferably, the training the whole network model includes:
the network model is trained using the mean square error as a loss function, and an Adam optimizer.
Preferably, the converting the frame importance scores obtained from the trained overall network model into key shots comprises:
detecting scene change points by using a KTS algorithm, regarding a video clip between the two scene change points as a video shot, and dividing the video into a plurality of video shots;
for each video shot, calculating the average value of the importance scores of all frames in the video shot as a shot score;
and executing a knapsack algorithm to select key shots for the plurality of divided shots and the shot scores.
Preferably, the knapsack algorithm maximizes the total fraction of key shots whose length needs to be limited to 15% of the original video length.
Compared with the prior art, the invention has the following beneficial effects:
the method for realizing video abstraction by using the memory network and the gate control cycle unit has the advantages that the overall network model carries out feature processing on input videos, and dynamic video abstraction is obtained by using the frame importance scores, so that the overall effectiveness of the model is improved.
The method for realizing the video abstraction by using the memory network and the gated cyclic unit aims at the defect that the convolutional neural network and the cyclic neural network cannot acquire long-term memory, and repeatedly retrieves information in the memory by using the memory network, so that the useful information is reserved to the maximum extent. Meanwhile, the bidirectional GRU is used in the input module to fuse the forward and backward information of the video, and the F1 score of the model is improved. Experiments on two reference data sets TVSum and SumMe verify the effectiveness of the inventive method.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a flowchart of a method for implementing video summarization using a memory network and a gated loop unit according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of an overall network model according to a preferred embodiment of the present invention;
FIG. 3 is a schematic structural diagram of an input module according to a preferred embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a memory module according to a preferred embodiment of the present invention;
fig. 5 is a schematic structural diagram of an output module according to a preferred embodiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.
As shown in fig. 1, a flowchart of a method for implementing video summarization by using a memory network and a gated loop unit according to an embodiment of the present invention includes:
s1, preprocessing the data set to obtain video frame characteristics of the video;
s2, designing an integral network model with video frame characteristics as input and frame importance scores as output;
s3, performing model training in the whole network model;
s4, converting the frame importance scores obtained from the trained integral network model into key shots;
and S5, combining the key images into a dynamic video abstract, and testing the video abstract result.
For better data processing, the methodA preferred embodiment is provided to perform S1. The video is first sampled into a sequence of video frames at a sampling rate of 2fps, and then features of the sequence of video frames are extracted using the pool5 layer of the google lenet model pre-trained on the large-scale image dataset ImageNet, the feature dimension of each video frame being 1024, so as to obtain { X { (X) } for each videokAnd (4) vector. { XkThe length of the video frame is N, and N represents the number of video frames sampled in the video; each vector XkD represents the characteristic dimension of the video frame.
S2 is performed for the present invention providing a preferred embodiment. The present embodiment aims to use the memory network to retain useful information to the maximum extent, and at the same time, fuse the forward and backward video information at each time by using a Bidirectional Gated current Unit (BiGRU) at the input module. Fig. 2 is a schematic diagram of the overall network model structure of this embodiment. As can be seen, the network model is divided into 3 modules: the specific construction process of the 3 modules comprises the following steps:
s21, constructing an input module, as shown in fig. 3, which is a schematic structural diagram of the input module. The input module is used for fusing the front and back information of the input video frame.
Selecting a layer of bidirectional GRU to process the input sequence, wherein the output characteristic dimension is still 1024, as shown in formula (1):
Figure BDA0003354242250000061
wherein
Figure BDA0003354242250000062
Indicating the hidden state of the forward GRU at time step k,
Figure BDA0003354242250000063
representing the hidden state of the backward GRU at a time step k;
Figure BDA0003354242250000064
and
Figure BDA0003354242250000065
sum VkAs output from the input module, both past and future information of the current video frame is included.
S22, constructing a memory module, as shown in FIG. 4, which is a schematic structural diagram of the memory module. The function of the memory module is to repeatedly retrieve information in the memory by using a multi-hoss structure to obtain output memory { o ] which maximally retains useful informationk}. The memory here refers to the input memory { a ] in FIG. 4kAnd output memory bkAnd mapping front and back information of the fused video frame characteristics through different full connection layers respectively.
At output of bidirectional GRU { VkGet the problem vector Q through 3 different full connection layers respectivelyk}, input memory { akAnd output memory bkAdding a Dropout layer behind each full-connection layer; for input memory { akAnd output memory bkAnd then adding a time sequence information coding matrix T respectivelyAAnd TB. The Dropout layer has the function of relieving overfitting, and the addition of the time sequence information coding matrix is because the video frame sequence is a time sequence, and the performance of the model can be improved by performing time coding, namely:
Qk=dropout(linearQ(Vk)) (2)
ak=dropout(linearA(Vk))+TA(k) (3)
bk=dropout(linearB(Vk))+TB(k) (4)
problem vector { QkAnd input memory { a }kInner product and get self-attention weight p through Softmax layerkAs shown below:
pk=Softmax(QTak) (5)
wherein Q represents a group consisting of N QkA matrix of components.
Self-attention weight { pkAnd output memory { b }kIs weighted and summed with the problem vector QkMultiplying element by element, and obtaining memory output { o ] through a full connection layer with LN layer and ReLU activation functionkAs shown below:
ok=relu(layernorm(linear(pkbk*Qk))) (6)
the output results are remembered as new problem vectors to calculate weights:
Qk=ok (7)
the formulas (3) to (7) are a hop.
In order to reduce the number of parameters, the present invention provides a preferred embodiment. The full connection matrix of the output memory between adjacent hop layers is equal to the full connection matrix of the input memory, i.e.:
linearA(k+1)=linearB(k) (8)
furthermore, the full-join matrix of the original problem vector is equal to the full-join matrix of the underlying input memory.
The number of neurons in the output layer of the full connection layer in the memory module is embed _ size 512, so that the output memory { o }kThe size of (E) is embed _ size × N.
S23, constructing an output module, as shown in fig. 5, which is a schematic structural diagram of the output module. The output module is used for returning the memory output to the frame importance score sk}。
First, the memory output { okAnd outputs of bidirectional GRU { V }kMake residual connection, since okAnd { V }kThe dimension of the component is different, and the component needs to be given to VkPerform a linear mapping to match the dimensions. And the output of the residual error connection is input into a two-layer full-connection network after sequentially passing through the LN layer and the Dropout layer. The first layer of the fully-connected network consists of a fully-connected layer, an LN layer, a ReLU layer and a Dropout layer, and the output dimension of the fully-connected layer is embedded _ size. The second layer of full-connection network consists of a full-connection layer and a SigMoid layer, the output dimensionality of the full-connection layer is 1, the SigMoid is used for limiting the output to be 0-1, and therefore a frame importance score { s }is obtainedk}。
The present invention provides one embodiment performing S3. In this embodiment, the mean square error is used as the loss function. Specifically, the mean square error loss function is as follows:
loss(si,si′)=(si-si′)2 (9)
wherein s isiAnd si' represents the frame importance score and the ground-route frame importance score, respectively, of the network output. The mean square error loss is one of the most commonly used loss functions in the regression problem.
In this embodiment, the batch size is set to 1, an Adam optimizer is used to train the network, the learning rate of the Adam optimizer is 0.0001, and the weight attenuation coefficient is 0.00001.
Based on the above embodiment, the present invention provides a preferred embodiment that executes S4, the frame importance score is converted into a key shot. Two steps are required for the frame importance score to be converted into a key shot.
The method comprises the steps of firstly, detecting scene change points by using a KTS algorithm, regarding a video clip between two scene change points as a video shot, dividing a video into a plurality of video shots, and calculating the average value of importance scores of all frames in the shot as a shot score for each shot.
And secondly, according to the plurality of divided shots and the shot scores, executing a knapsack algorithm to select key shots so as to maximize the total scores of the key shots, wherein the total duration of the key shots is limited to 15% of the duration of the original video.
Specifically, the knapsack algorithm is shown as follows:
Figure BDA0003354242250000081
wherein y isiFraction, u, representing the ith shotiIndicating whether a key shot is selected,/iIs the length of the ith shot, the size of L is equal to 15% of the original video duration, and K is the number of video shots.
Based on S4 in the above embodiment, S5 is performed to combine the key lenses into a dynamic summary, and calculate an F1 score between the result of the group-route summary.
To verify the effectiveness of the above examples, the present invention applies it to practice and compares it with other methods. Specifically, the data set consists of 4 data sets of TVSum and SumMe and auxiliary data sets YouTube and OVP. Canonical and Augmented represent two different data set settings, Canonical representing 80% of the current data set as the training set and 20% as the test set; augmented means that 80% of the current data set and the remaining 3 data sets are used as training sets and 20% of the current data set are used as test sets. For example, when the current data set is SumMe, the training set consists of YouTube, OVP, TVSum, and 80% SumMe, and the test set consists of 20% SumMe. The summary results are shown in table 1 in comparison with other algorithms. The results in Table 1 show the F1 fraction (in%:%), the greater the F1 fraction, the better the performance of the algorithm, and the final column is the average of the performance of the individual algorithms, from Table 1 the effectiveness of the method of the invention can be seen.
Figure BDA0003354242250000082
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The above-described preferred features may be used in any combination without conflict with each other.

Claims (10)

1. A method for implementing video abstraction by using a memory network and a gated cyclic unit is characterized by comprising the following steps:
preprocessing a data set to obtain video frame characteristics of a video;
designing an integral network model which takes the video frame characteristics as input and takes the frame importance scores as output;
training the whole network model;
converting the frame importance scores output from the trained overall network model into key shots;
and combining the key shots into a dynamic video abstract.
2. The method of claim 1, wherein the preprocessing the data set to obtain the video frame characteristics of the video comprises:
sampling a video into a sequence of video frames;
and extracting the characteristics of the video frame sequence by using a pre-trained GoogleLeNet model to obtain the video frame characteristics corresponding to each video.
3. The method for implementing video summarization by using a memory network and a gated round robin unit according to claim 1, wherein the overall network model with video frame characteristics as input and frame importance scores as output comprises:
the input module is used for fusing the front and back information of the input video frame characteristics;
the memory module is used for acquiring storage memory by taking the fused result output by the input module as the input of the memory module; the memory module repeatedly retrieves the information in the memory by using a multi-hoss structure and outputs memory output for retaining useful information;
and the memory output of the memory module is used as the input of the output module, and the output module returns the memory output to the frame importance score.
4. The method of claim 3, wherein the input module is configured to process the input video frame feature sequence using a layer bidirectional GRU, and comprises:
the forward output of the GRU contains past information of the current time;
backward output of the GRU comprises future information of the current moment;
and the bidirectional GRU fuses and adds the forward output and the backward output element by element to obtain complete past and future information of the current moment.
5. The method for implementing video abstraction by using memory network and gated loop unit according to claim 3 or 4, wherein said memory module is constructed, including:
coding vectors obtained by a fusion sequence output by the bidirectional GRU through a full connection layer are used as original problem vectors and input into a first hop;
respectively inputting the fusion sequence output by the bidirectional GRU to two different full-connection layers of the first hop to obtain input memory and output memory;
the problem vector and the input memory are subjected to inner product, and self-attention weight is obtained through Softmax;
the self-attention weight and the output memory are subjected to weighted summation, the weighted summation is multiplied by the problem vector element by element, and memory output is obtained through a full connection layer with a normalization layer and an activation function;
inputting the memory output as a new problem vector into the next hop to calculate self-attention weight and memory output; until the memory output of the last hop is used as the input of the output module.
6. The method for video summarization using a memory network and gated round robin unit according to claim 5, wherein the full link matrix of the output memory between adjacent hop layers is equal to the full link matrix of the input memory; the full-join matrix of the original problem vector is equal to the full-join matrix of the underlying input memory.
7. The method for implementing video summarization by using a memory network and a gated loop unit according to claim 3, wherein constructing the output module comprises:
the memory output is connected with the fusion sequence of the bidirectional GRU output in a residual error manner;
sequentially passing the output of the residual connection through an LN layer and a Dropout layer;
and inputting the output result of the Dropout layer into a two-layer fully-connected network, wherein the node number of the output layer of the last fully-connected layer is 1, and the frame importance score corresponding to each video frame is obtained.
8. The method of claim 1, wherein the training of the global network model comprises:
the network model is trained using the mean square error as a loss function, and an Adam optimizer.
9. The method for video summarization using a memory network and a gated round robin unit according to claim 1, wherein the converting the frame importance scores obtained from the trained global network model into key shots comprises:
detecting scene change points by using a KTS algorithm, regarding a video clip between the two scene change points as a video shot, and dividing the video into a plurality of video shots;
for each video shot, calculating the average value of the importance scores of all frames in the video shot as a shot score;
and executing a knapsack algorithm to select key shots for the plurality of divided shots and the shot scores.
10. The method of claim 9, wherein the knapsack algorithm aims to maximize the total shot fraction of key shots and limits the length of the key shots to 15% of the original video length.
CN202111346288.4A 2021-11-15 2021-11-15 Method for realizing video abstraction by using memory network and gated cyclic unit Pending CN114020964A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111346288.4A CN114020964A (en) 2021-11-15 2021-11-15 Method for realizing video abstraction by using memory network and gated cyclic unit

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111346288.4A CN114020964A (en) 2021-11-15 2021-11-15 Method for realizing video abstraction by using memory network and gated cyclic unit

Publications (1)

Publication Number Publication Date
CN114020964A true CN114020964A (en) 2022-02-08

Family

ID=80064038

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111346288.4A Pending CN114020964A (en) 2021-11-15 2021-11-15 Method for realizing video abstraction by using memory network and gated cyclic unit

Country Status (1)

Country Link
CN (1) CN114020964A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114979801A (en) * 2022-05-10 2022-08-30 上海大学 Dynamic video abstraction algorithm and system based on bidirectional convolution long-short term memory network
CN115002559A (en) * 2022-05-10 2022-09-02 上海大学 Video abstraction algorithm and system based on gated multi-head position attention mechanism
CN116894115A (en) * 2023-06-12 2023-10-17 国网湖北省电力有限公司经济技术研究院 Automatic archiving method for power grid infrastructure files
CN117376502A (en) * 2023-12-07 2024-01-09 翔飞(天津)智能科技有限公司 Video production system based on AI technology

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114979801A (en) * 2022-05-10 2022-08-30 上海大学 Dynamic video abstraction algorithm and system based on bidirectional convolution long-short term memory network
CN115002559A (en) * 2022-05-10 2022-09-02 上海大学 Video abstraction algorithm and system based on gated multi-head position attention mechanism
CN115002559B (en) * 2022-05-10 2024-01-05 上海大学 Video abstraction algorithm and system based on gating multi-head position attention mechanism
CN116894115A (en) * 2023-06-12 2023-10-17 国网湖北省电力有限公司经济技术研究院 Automatic archiving method for power grid infrastructure files
CN116894115B (en) * 2023-06-12 2024-05-24 国网湖北省电力有限公司经济技术研究院 Automatic archiving method for power grid infrastructure files
CN117376502A (en) * 2023-12-07 2024-01-09 翔飞(天津)智能科技有限公司 Video production system based on AI technology
CN117376502B (en) * 2023-12-07 2024-02-13 翔飞(天津)智能科技有限公司 Video production system based on AI technology

Similar Documents

Publication Publication Date Title
CN109492157B (en) News recommendation method and theme characterization method based on RNN and attention mechanism
Jiang et al. Exploiting feature and class relationships in video categorization with regularized deep neural networks
Pan et al. Hierarchical recurrent neural encoder for video representation with application to captioning
Zhao et al. TTH-RNN: Tensor-train hierarchical recurrent neural network for video summarization
CN114020964A (en) Method for realizing video abstraction by using memory network and gated cyclic unit
Wang et al. Collaborative deep learning for recommender systems
Xiao et al. Convolutional hierarchical attention network for query-focused video summarization
CN111291261A (en) Cross-domain recommendation method integrating label and attention mechanism and implementation system thereof
Wang et al. Towards accurate and interpretable sequential prediction: A CNN & attention-based feature extractor
Shi et al. Star: Sparse transformer-based action recognition
CN113177141B (en) Multi-label video hash retrieval method and device based on semantic embedded soft similarity
CN111222039B (en) Session recommendation method and system based on long-term and short-term interest combination
CN110321805B (en) Dynamic expression recognition method based on time sequence relation reasoning
CN114519145A (en) Sequence recommendation method for mining long-term and short-term interests of users based on graph neural network
CN113761359B (en) Data packet recommendation method, device, electronic equipment and storage medium
CN112434159A (en) Method for classifying thesis multiple labels by using deep neural network
Khoali et al. Advanced recommendation systems through deep learning
CN112699310A (en) Cold start cross-domain hybrid recommendation method and system based on deep neural network
CN113326384A (en) Construction method of interpretable recommendation model based on knowledge graph
CN115695950A (en) Video abstract generation method based on content perception
CN116975615A (en) Task prediction method and device based on video multi-mode information
Li et al. Application of Dual‐Channel Convolutional Neural Network Algorithm in Semantic Feature Analysis of English Text Big Data
Guo [Retracted] Intelligent Sports Video Classification Based on Deep Neural Network (DNN) Algorithm and Transfer Learning
CN114996566A (en) Intelligent recommendation system and method for industrial internet platform
CN116932862A (en) Cold start object recommendation method, cold start object recommendation device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination