CN116992079A - Multi-mode video abstract extraction method based on video captions - Google Patents

Multi-mode video abstract extraction method based on video captions Download PDF

Info

Publication number
CN116992079A
CN116992079A CN202310767163.1A CN202310767163A CN116992079A CN 116992079 A CN116992079 A CN 116992079A CN 202310767163 A CN202310767163 A CN 202310767163A CN 116992079 A CN116992079 A CN 116992079A
Authority
CN
China
Prior art keywords
video
ith
frame
mth
caption
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310767163.1A
Other languages
Chinese (zh)
Inventor
胡珍珍
王振山
宋子杰
洪日昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202310767163.1A priority Critical patent/CN116992079A/en
Publication of CN116992079A publication Critical patent/CN116992079A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/20Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Signal Processing (AREA)
  • Studio Circuits (AREA)

Abstract

The invention discloses a multi-mode video abstract extraction method based on video captions, which comprises the following steps: 1, acquiring a frame characteristic representation of a video, 2, acquiring a characteristic representation of a caption, 3, performing automated video frame importance assessment, 5, optimizing a summarizer model, and 6, optimizing a video caption generator based on a key frame. The invention can rapidly output the key frame set of the short video and the corresponding subtitles, wherein the key frame set reflects the whole content of the video in a visual form by a small number of video frames, and the matched subtitles summarize the video pictures in a text form, thereby helping users to screen the short video more efficiently, saving storage space and computing resources and being more beneficial to deployment and application to terminal equipment.

Description

Multi-mode video abstract extraction method based on video captions
Technical Field
The invention belongs to the field of artificial intelligence, and particularly relates to a multi-mode video abstract extraction method based on video captions.
Background
The explosive growth of short video social software and self-media has led to the growth of internet videos in a blowout way, so how to quickly acquire key information in videos becomes an important problem. The goal of the video summarization task is to retrieve key frames or video clips, such as key lenses, in the video that contain as much information as possible with minimal redundancy. One straightforward application of video summarization is the cover-page presentation of video in a video website, where a reasonable summary segment can help the user determine whether to click on the video. Because of the specificity of the video abstraction task, such as strong subjectivity of results, great difficulty in labeling data sets, change of video resolution and the like, great challenges are brought to the improvement of the video abstraction technology.
The problem of difficulty in labeling data sets results in a shortage of high quality data sets in the field of video summarization, and conventional video summarization methods such as MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Sum published by Xu et al 2022 tend to be based on TVSum and SumMe data sets, for example, the TVSum data set scores the importance of each frame of video by using 20 annotators for each video, the data set contains 50 videos, sumMe is a key segment of video selected by 15 to 20 annotators, and only contains 20 videos. The cost of manual annotation of large-scale video summary datasets is enormous and therefore impractical. Previous work has generally selected several low quality data sets as supplemental training. How to train a high-quality video abstract model by adopting the existing data set on the premise of not increasing extra labeling cost, and how to use the abstract video frames in a reasonable way is still a problem to be solved.
Disclosure of Invention
The invention aims to solve the defects of the prior art, and provides a multi-mode video abstract extraction method based on video captions, which can simultaneously output video abstracts and video captions, thereby helping users to screen short videos more effectively, saving storage space and computing resources and being more beneficial to deployment and application to terminal equipment.
In order to achieve the aim of the invention, the invention adopts the following technical scheme:
the invention discloses a multimode video abstract extraction method based on video captions, which is characterized by comprising the following steps:
step 1, acquiring frame characteristic representation of a video:
for a video subtitle data set D= { V, Y }, wherein V represents a video set and Y represents an English subtitle sentence set corresponding to each video in the video set V;
processing any ith video in the video set V by adopting a visual encoder of the CLIP model to obtain a frame characteristic representation F of the ith video i ={f i,1, f i,2 ,...,f i,n ,..,f i,N -a }; wherein f i,n Representing an nth frame characteristic representation in an ith video, wherein N represents the total frame number of the video i;
step 2, acquiring the characteristic representation of the caption:
text encoder adopting CLIP model is used for English caption sentence Y corresponding to ith video in pairs i ={y i,1,1 ,...,y i,1,W ;y i,m,1 ,y i,m,2 ,...,y i,m,t ,...,y i,m,W ;y i,M,1 ,...,y i,M,W Processing to obtain English caption text vector T corresponding to video i i ={t i,1 ,t i,2 ,...,t i,m ,..,t i,M -wherein y i,m,t Representing the t-th word, t, in the mth subtitle sentence corresponding to the ith video i,m Representing an mth subtitle vector in an English subtitle sentence corresponding to the ith video; w represents the total number of words;
step 3, obtaining the characteristic representation f of the nth frame in the ith video by using the formula (1) i,n And caption text vector T i Average similarity s (f) i,k ,T i ) And represents f as an nth frame feature of video i i,n Automated scoring of (a)
In the formula (1), tr represents vector transposition;
step 4, constructing a video abstractor, which comprises the following steps: the system comprises a self-attention mechanism layer, a local attention enhancement layer and a fully-connected network MLP, and is used for training;
step 4.1, the self-attention mechanism layer calculates the characteristic representation f of the nth frame in the ith video by using the method (2) i,n And j-th frame characteristic representation f i,j The cross-relation score r (f) i,n ,f i,j ):
r(f i,n ,f i,j )=P×tanh(W 1 f i,n +W 2 f i,j +b) (2)
In the formula (2), P, W 1 ,W 2 Is three parameter matrixes to be learned, and b is a bias vector; tanh represents an activation function;
step 4.2, the local attention enhancement layer calculates an nth frame feature representation f in the ith video by using the method (3) i,n Is a local attention enhanced video frame featureResulting in a locally attention enhanced feature representation of the ith video
In the formula (3), the amino acid sequence of the compound,representing the j-th frame characteristic representation f i,j An nth frame characteristic representation f with an ith video i,n The relation weights between the vectors, representing the multiplication of the vectors element by element, and has:
step 4.3, calculating the nth frame characteristic representation f of the ith video by using the full-connection network MLP (5) i,n Predictive scoring of (2)
In formula (5), geLU represents an activation function; + represents the residual connection;
step 4.4, constructing a binary cross entropy loss L by using the method (7) vsum
In the formula (7), B represents the number of videos in the video subtitle data set D;
in the first training stage, based on the video caption data set D, training a video summarizer by using a back propagation and gradient descent method, and enabling a binary cross entropy loss L vsum Stopping training when the minimum time is reached, so as to obtain a trained video abstractor model;
step 5, representing the frame characteristics of the ith video by F i ={f i,1, f i,2 ,...,f i,n ,..,f i,N Inputting into trained video abstractor model, and selectingThe top K frame characteristic representations with highest prediction scores form sub-optimal video frame setWherein (1)>A kth frame best feature representation representing an ith video; k represents the number of screened optimal video frames;
step 6, constructing a decoder consisting of a light-weight long-short-time memory network LSTM, and training;
step 6.1, when t=1, the optimal video frame set corresponding to the ith videoInputting into decoder, and obtaining the predictive word +.f of the mth caption sentence corresponding to the ith video outputted by the mth time step>
When t=2, 3, …, W, then the t-th step control factor ζ is randomly initialized t If (if)Predicted word +.of the mth subtitle sentence corresponding to the ith video output at the t-1 th time step>After the processing of the decoder, the predictive word +.f of the mth subtitle sentence corresponding to the ith video output by the mth time step is obtained>If->Then the t word y of the mth caption sentence corresponding to the ith video i,m,t After the t-th word is processed by the decoder,obtaining a predictive word +.f of an mth subtitle sentence corresponding to an ith video output by a tth time step>
Step 6.2, constructing a Cross entropy loss L by using the method (8) XE
In the formula (10), p θ (y i,m,t ) Representing the t-th word y in the mth caption sentence corresponding to the i-th video at the t-th step by the decoder i,m,t The output prediction probability, θ, represents the learning parameter;
step 6.3, during the second training phase, based on the English caption sentence Y i Training the decoder using back propagation and gradient descent methods and letting Y i And stopping training when the video frame reaches the minimum, thereby obtaining a trained decoder model, and performing subtitle output on the optimal video frame output by the trained video summarizer model.
The electronic device of the invention comprises a memory and a processor, wherein the memory is used for storing a program for supporting the processor to execute the multi-mode video abstract extraction method, and the processor is configured for executing the program stored in the memory.
The invention relates to a computer readable storage medium, on which a computer program is stored, characterized in that the computer program when run by a processor performs the steps of the multimodal video summary extraction method.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention can rapidly and efficiently compress video content and ensure the accuracy of semantic information through the double coupling function of the video coding-based double video abstraction framework, namely the abstractor and the decoder, thereby improving the efficiency of browsing short videos for users and being used as the display of the short videos in short video websites.
2. The present invention uses visual and text modes to summarize video content by extracting visual representations of key frames and then summarizing the key frames based on which a frame-level score is obtained. And selecting the frame with the most meaning and the most semantic consistency through the cross-mode video abstract model so as to compress video content on the premise of not reducing the video description quality, thereby eliminating redundant frames in the video and improving the video representation efficiency.
3. The invention uses the lightweight LSTM decoder to generate the description, and can convey the same semantic information without a large number of key frames, thereby bringing beneficial application value to the fields of video coding and video text data processing.
Drawings
FIG. 1 is a frame diagram of a multimodal video summary model of the present invention;
FIG. 2 is a block diagram of a summarizer according to the present invention;
FIG. 3 is a block diagram of a caption generator according to the present invention;
FIG. 4 is a flow chart of the multi-modal video summary model training of the present invention.
Detailed Description
In this embodiment, a method for extracting a multi-mode video summary based on video subtitles, as shown in fig. 1 and fig. 4, is performed according to the following steps:
step 1, acquiring frame characteristic representation of a video:
for a video subtitle data set D= { V, Y }, wherein V represents a video set and Y represents an English subtitle sentence set corresponding to each video in the video set V;
processing any ith video in the video set V by adopting a visual encoder of the CLIP model to obtain a frame characteristic representation F of the ith video i ={f i,1, f i,2 ,...,f i,n ,..,f i,N -a }; wherein f i,n Representing the nth frame feature representation in the ith video, N representing the total frame number of video i, in this embodiment, n=12;
step 2, acquiring the characteristic representation of the caption:
using the CLIP modelEnglish caption sentence Y corresponding to ith video in the text encoder pair of (2) i ={y i,1,1 ,...,y i,1,W ;y i,m,1 ,y i,m,2 ,...,y i,m,t,..., y i,m,W ;y i,M,1 ,...,y i,M,W Processing to obtain caption text vector T corresponding to video i i ={t i,1 ,t i,2 ,...,t i,m ,..,t i,M -wherein y i,m,t Representing the t-th word, t, in the mth subtitle sentence corresponding to the ith video i,m Representing the mth subtitle vector in the corresponding english subtitle sentence in the ith video, m=20, w=30 in this embodiment;
step 3, as shown in FIG. 2, obtaining the characteristic representation f of the nth frame in the ith video by using the formula (1) i,n And caption text vector T i Average similarity s (f) i,k ,T i ) And represents f as an nth frame feature of video i i,n Automated scoring of (a)
In the formula (1), tr represents vector transposition;
step 4, constructing a video abstractor, which comprises the following steps: the system comprises a self-attention mechanism layer, a local attention enhancement layer and a fully-connected network MLP, and is used for training;
step 4.1, calculating the characteristic representation f of the nth frame in the ith video by the self-attention mechanism layer through the method (2) i,n And j-th frame characteristic representation f i,j The cross-relation score r (f) i,n ,f i,j ):
r(f i,n ,f i,j )=P×tanh(W 1 f i,n +W 2 f i,j +b) (2)
In the formula (2), P, W 1 ,W 2 Is three parameter matrixes to be learned, and b is a bias vector; tanh represents an activation function;
step 4.2, local injectionThe force enhancement layer calculates the nth frame characteristic representation f in the ith video by using the method (3) i,n Is a local attention enhanced video frame featureResulting in a locally attention enhanced feature representation of the ith video
In the formula (3), the amino acid sequence of the compound,the j-th frame feature representation f representing the i-th video i,j And the nth frame characteristic represents f i,n The relation weights between the vectors, representing the multiplication of the vectors element by element, and has:
step 4.3, the full connection network MLP calculates the nth frame characteristic representation f of the ith video by using the method (5) i,n Predictive scoring of (2)
In formula (5), geLU represents an activation function; + represents the residual connection;
step 4.4 training the video summarizer using a back propagation and gradient descent method based on the video subtitle data set D during the first training phase and by minimizing the binary cross entropy loss L as shown in equation (7) vsum To optimize the video summarizer to obtain a trained video summaryAnd (5) a key model:
in the formula (7), B represents the number of videos in the video subtitle data set D.
In this embodiment, a maximum iteration number epoch_number is set 1 10, adopting an Adam optimization algorithm with learning rate and exponential decay rate by a gradient descent method, and when the iteration number reaches epoch_number 1 When the training is stopped, the objective function loses L vsum To the minimum;
step 5, representing the frame characteristics of the ith video by F i ={f i,1 ,f i,2 ,...,f i,n ,..,f i,N Inputting into trained video abstractor model, selecting the top K frame characteristic representations with highest predictive scores to form sub-optimal video frame setWherein (1)>A kth frame best feature representation representing an ith video; k represents the number of screened optimal video frames;
step 6, constructing a decoder composed of a light-weight long-short-time memory network LSTM and training, as shown in figure 3; the description is generated using a lightweight LSTM decoder. The video frames thinned at a certain sampling rate are used for generating the characteristics of the video ViT, the characteristics are input as a summarizer of a model, the summarizer can judge the information quantity of the input video frames and give out specific quantitative evaluation, and then a frame characteristic set with the highest information quantity is screened out according to the evaluation and is sent to an LSTM decoder to generate language description.
Step 6, constructing a decoder consisting of a light-weight long-short-time memory network LSTM, and training;
step 6.1, when t=1, the optimal video frame set corresponding to the ith videoInputting into decoder, and obtaining the predictive word +.f of the mth caption sentence corresponding to the ith video outputted by the mth time step>
When t=2, 3, …, W, then the t-th step control factor ζ is randomly initialized t If (if)Predicted word +.of the mth subtitle sentence corresponding to the ith video output at the t-1 th time step>After the processing of the decoder, the predictive word +.f of the mth subtitle sentence corresponding to the ith video output by the mth time step is obtained>If->Then the t word y of the mth caption sentence corresponding to the ith video i,m,t After the t word is processed by a decoder, obtaining a predicted word +.f of an m-th subtitle sentence corresponding to an i-th video output by a t-th time step>
Step 6.2, constructing a Cross entropy loss L by using the method (8) XE
In the formula (10), p θ (y i,m,t ) Representing the t-th word y in the mth caption sentence corresponding to the i-th video at the t-th step by the decoder i,m,t Output predictive probability, θ representsLearning parameters;
step 6.3, during the second training phase, based on English caption sentence Y i Training the decoder using back propagation and gradient descent methods and calculating L XE To update network parameters and set maximum iteration number epoch_number 2 30, in this step, the gradient descent method adopts Adam optimization algorithm with learning rate and exponential decay rate, when the iteration number reaches epoch_number 2 And stopping training, so as to obtain a trained decoder for outputting the caption of the optimal video frame output by the trained video summarizer model.
In this embodiment, an electronic device includes a memory for storing a program supporting the processor to execute the above method, and a processor configured to execute the program stored in the memory.
In this embodiment, a computer-readable storage medium stores a computer program that, when executed by a processor, performs the steps of the method described above.
In summary, the method aims at the hot tide of short video, and aims at a video output key frame set and corresponding subtitles thereof, wherein the key frame set reflects the whole content of the video in a visual form by a small number of video frames, and the matched subtitles summarize the video pictures in a text form and reflect the video content from two angles of visual text. The method has the advantages that the number of the used model parameters is small, the requirements on the storage space and the computing resources are limited, and the application can be effectively deployed.

Claims (3)

1. A multi-mode video abstract extraction method based on video captions is characterized by comprising the following steps:
step 1, acquiring frame characteristic representation of a video:
for a video subtitle data set D= { V, Y }, wherein V represents a video set and Y represents an English subtitle sentence set corresponding to each video in the video set V;
using CLIP modelsThe visual encoder processes any ith video in the video set V to obtain a frame characteristic representation F of the ith video i ={f i,1, f i,2 ,...,f i,n ,..,f i,N -a }; wherein f i,n Representing an nth frame characteristic representation in an ith video, wherein N represents the total frame number of the video i;
step 2, acquiring the characteristic representation of the caption:
text encoder adopting CLIP model is used for English caption sentence Y corresponding to ith video in pairs i ={y i,1,1 ,...,y i,1,W ;y i,m,1 ,y i,m,2 ,...,y i,m,t ,...,y i,m,W ;y i,M,1 ,...,y i,M,W Processing to obtain English caption text vector T corresponding to video i i ={t i,1 ,t i,2 ,...,t i,m ,..,t i,M -wherein y i,m,t Representing the t-th word, t, in the mth subtitle sentence corresponding to the ith video i,m Representing an mth subtitle vector in an English subtitle sentence corresponding to the ith video; w represents the total number of words;
step 3, obtaining the characteristic representation f of the nth frame in the ith video by using the formula (1) i,n And caption text vector T i Average similarity s (f) i,k ,T i ) And represents f as an nth frame feature of video i i,n Automated scoring of (a)
In the formula (1), tr represents vector transposition;
step 4, constructing a video abstractor, which comprises the following steps: the system comprises a self-attention mechanism layer, a local attention enhancement layer and a fully-connected network MLP, and is used for training;
step 4.1, the self-attention mechanism layer calculates the characteristic representation f of the nth frame in the ith video by using the method (2) i,n And j-th frame characteristic representation f i,j The cross-relation score r (f) i,n ,f i,j ):
r(f i,n ,f i,j )=P×tanh(W 1 f i,n +W 2 f i,j +b) (2)
In the formula (2), P, W 1 ,W 2 Is three parameter matrixes to be learned, and b is a bias vector; tanh represents an activation function;
step 4.2, the local attention enhancement layer calculates an nth frame feature representation f in the ith video by using the method (3) i,n Is a local attention enhanced video frame featureResulting in a locally attention enhanced feature representation of the ith video
In the formula (3), the amino acid sequence of the compound,representing the j-th frame characteristic representation f i,j An nth frame characteristic representation f with an ith video i,n The relation weights between the vectors, representing the multiplication of the vectors element by element, and has:
step 4.3, calculating the nth frame characteristic representation f of the ith video by using the full-connection network MLP (5) i,n Predictive scoring of (2)
In formula (5), geLU represents an activation function; + represents the residual connection;
step 4.4, constructing a binary cross entropy loss L by using the method (7) vsum
In the formula (7), B represents the number of videos in the video subtitle data set D;
in the first training stage, based on the video caption data set D, training a video summarizer by using a back propagation and gradient descent method, and enabling a binary cross entropy loss L vsum Stopping training when the minimum time is reached, so as to obtain a trained video abstractor model;
step 5, representing the frame characteristics of the ith video by F i ={f i,1 ,f i,2 ,...,f i,n ,..,f i,N Inputting into trained video abstractor model, selecting the top K frame characteristic representations with highest predictive scores to form sub-optimal video frame setWherein (1)>A kth frame best feature representation representing an ith video; k represents the number of screened optimal video frames;
step 6, constructing a decoder consisting of a light-weight long-short-time memory network LSTM, and training;
step 6.1, when t=1, the optimal video frame set corresponding to the ith videoInput into decoder and obtain the t-th time step transmissionPredictive word +.f. of mth subtitle sentence corresponding to ith video>
When t=2, 3, …, W, then the t-th step control factor ζ is randomly initialized t If (if)Predicted word +.of the mth subtitle sentence corresponding to the ith video output at the t-1 th time step>After the processing of the decoder, the predictive word +.f of the mth subtitle sentence corresponding to the ith video output by the mth time step is obtained>If->Then the t word y of the mth caption sentence corresponding to the ith video i,m,t After the t word is processed by a decoder, obtaining a predicted word +.f of an m-th subtitle sentence corresponding to an i-th video output by a t-th time step>
Step 6.2, constructing a Cross entropy loss L by using the method (8) XE
In the formula (10), p θ (y i,m,t ) Representing the t-th word y in the mth caption sentence corresponding to the i-th video at the t-th step by the decoder i,m,t The output prediction probability, θ, represents the learning parameter;
step 6.3, during the second training phase, based on the English caption sentence Y i Training the decoder using back propagation and gradient descent methods and letting Y i And stopping training when the video frame reaches the minimum, thereby obtaining a trained decoder model, and performing subtitle output on the optimal video frame output by the trained video summarizer model.
2. An electronic device comprising a memory and a processor, wherein the memory is configured to store a program that supports the processor to perform the multimodal video summary extraction method of claim 1, the processor being configured to execute the program stored in the memory.
3. A computer readable storage medium having a computer program stored thereon, which when executed by a processor performs the steps of the multimodal video summary extraction method of claim 1.
CN202310767163.1A 2023-06-27 2023-06-27 Multi-mode video abstract extraction method based on video captions Pending CN116992079A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310767163.1A CN116992079A (en) 2023-06-27 2023-06-27 Multi-mode video abstract extraction method based on video captions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310767163.1A CN116992079A (en) 2023-06-27 2023-06-27 Multi-mode video abstract extraction method based on video captions

Publications (1)

Publication Number Publication Date
CN116992079A true CN116992079A (en) 2023-11-03

Family

ID=88520376

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310767163.1A Pending CN116992079A (en) 2023-06-27 2023-06-27 Multi-mode video abstract extraction method based on video captions

Country Status (1)

Country Link
CN (1) CN116992079A (en)

Similar Documents

Publication Publication Date Title
CN109840287B (en) Cross-modal information retrieval method and device based on neural network
KR102480323B1 (en) Method and system for retrieving video time segments
CN106919646B (en) Chinese text abstract generating system and method
CN106777125B (en) Image description generation method based on neural network and image attention point
CN109344288A (en) A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism
JP7431833B2 (en) Language sequence labeling methods, devices, programs and computing equipment
CN109543820B (en) Image description generation method based on architecture phrase constraint vector and double vision attention mechanism
CN113987169A (en) Text abstract generation method, device and equipment based on semantic block and storage medium
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
US12050983B2 (en) Attention neural networks with parallel attention and feed-forward layers
Bilkhu et al. Attention is all you need for videos: Self-attention based video summarization using universal transformers
CN113392265A (en) Multimedia processing method, device and equipment
RU2712101C2 (en) Prediction of probability of occurrence of line using sequence of vectors
CN114743143A (en) Video description generation method based on multi-concept knowledge mining and storage medium
CN114387537A (en) Video question-answering method based on description text
CN114925232B (en) Cross-modal time domain video positioning method under text segment question-answering framework
CN116912642A (en) Multimode emotion analysis method, device and medium based on dual-mode and multi-granularity interaction
CN115810068A (en) Image description generation method and device, storage medium and electronic equipment
CN112651225B (en) Multi-item selection machine reading understanding method based on multi-stage maximum attention
CN114330352A (en) Named entity identification method and system
CN110717316B (en) Topic segmentation method and device for subtitle dialog flow
Patankar et al. Image Captioning with Audio Reinforcement using RNN and CNN
CN116681078A (en) Keyword generation method based on reinforcement learning
CN115937641A (en) Method, device and equipment for intermodal joint coding based on Transformer
CN115906879A (en) Translation model training method for vertical domain and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination