CN115033736A - Video abstraction method guided by natural language - Google Patents

Video abstraction method guided by natural language Download PDF

Info

Publication number
CN115033736A
CN115033736A CN202210652477.2A CN202210652477A CN115033736A CN 115033736 A CN115033736 A CN 115033736A CN 202210652477 A CN202210652477 A CN 202210652477A CN 115033736 A CN115033736 A CN 115033736A
Authority
CN
China
Prior art keywords
video
sequence
frame
natural language
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210652477.2A
Other languages
Chinese (zh)
Inventor
金永刚
郑婧
马海钢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202210652477.2A priority Critical patent/CN115033736A/en
Publication of CN115033736A publication Critical patent/CN115033736A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • G06F16/739Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a video abstraction method guided by natural language, which comprises the following steps: decomposing a video file into a frame sequence, and extracting frame image characteristics; extracting frame semantic features and text semantic features, and performing space cosine similarity calculation to obtain attention weight; building a video abstract model guided by natural language; carrying out model training on the whole network model; and reasonably selecting the video abstract according to the frame importance score sequence. The invention provides an attention mechanism guided by natural language aiming at a video abstract task innovatively, the attention mechanism guided by the natural language is introduced into a video abstract framework, and ablation experiments show that the attention mechanism guided by the natural language obviously improves the performance of an abstract model. In addition, the attention mechanism of the natural language guidance provided by the invention has stronger objectivity, can fully pay attention to the video clips related to the title text, and the title text is easily obtained in the Internet video without any cost.

Description

Natural language guided video abstraction method
Technical Field
The invention belongs to the technical field of video abstraction, and particularly relates to a natural language guided video abstraction method.
Background
With the rapid development of the multimedia technology and the network information technology in the recent years, video becomes an increasingly mainstream information communication mode. How to condense a lengthy video to several minutes or even a few seconds, i.e., video summarization, on the basis of maintaining key information is becoming an important research content in the field of video technology. The video abstraction technology automatically selects important segments in a video by using a computer algorithm program, and the important segments are used as video abstractions, so that the storage space of the video is reduced, and a user can quickly browse video information.
The idea of mimicking human attention initially appeared in the field of computer vision, reducing the computational complexity of image processing and improving performance by introducing a mechanism of attention that focuses only on partial regions of an image, rather than the entire image. Similarly, each frame in a video has its own importance level, and some segments contain key information of the video content, which is important for people to pay attention when selecting the abstract. Therefore, many researchers at home and abroad introduce a mechanism into the video summarization method, and assign different important weights to different frames in an input sequence instead of looking at all the input frames at once, so that an internal relation between the input video sequence and the output importance scores is provided.
If a video segment is interesting to the user, it is more likely to become an important segment of the whole sequence, and many researchers at home and abroad model based on the user's attention, for example, Ma et al propose in the document [ A user authentication model for video submission [ C ]// Proceedings of the content ACM international conference on multimedia 2002:533-542] to score the importance of the video segment by using low-level features such as motion change, human face feature, etc., combine these scores to form an attention curve, and extract the part on the peak of the curve as a key shot to construct the abstract. With the development of the deep learning Technology, some attention-based deep Video summarization methods are proposed, for example, Zhong et al propose an AVS Video summarization model in the documents [ Video summary with entry-based encoder-decoder networks [ J ]. IEEE Transactions on Circuits and Systems for Video Technology,2019,30(6):1709-1717], describe a supervised Video summary as a sequence-to-sequence learning problem, explore two attention-based decoding networks by using an additive and multiplicative objective functions, obtain an attention weight through the properties of a Video sequence itself, and learn the attention-based system in a supervised manner. Apostolisis et al propose in the document [ Combining Global and Local Attention with Positional Encoding for Video summation [ C ]//2021IEEE International Symposium on Multimedia (ISM). IEEE,2021:226-234] to discover frame-dependent models of different granularities in combination with Global and Local multi-headed Attention mechanisms, and the Attention mechanisms utilized integrate components Encoding the temporal position of the Video frame.
However, it is easy to find that currently existing attention-based depth video summarization methods generally only use video image information to generate a summary, and do not consider information of other modalities, but key information highly related to video content is often contained in descriptive texts such as titles and the like.
Disclosure of Invention
In view of the above, the invention provides a natural language guided video abstraction method, which innovatively provides a natural language guided attention mechanism for a video abstraction task, obtains attention weight by calculating similarity of a video sequence and a text, introduces the natural language guided attention mechanism in an Encoder-Decoder video abstraction framework, and ablation experiments show that the natural language guided attention mechanism obviously improves the performance of an abstraction model.
A natural language guided video abstraction method comprises the following steps:
(1) decomposing the video files in the training set into frame sequences, and using the pre-trained depth image network to frameThe sequence is subjected to image feature extraction to obtain a corresponding frame image feature sequence (f) 1 ,…,f n );
(2) Semantic feature extraction is carried out on the frame sequence by utilizing an image coding network of a pre-training multi-modal model to obtain a frame semantic feature sequence (x) 1 ,…,x n ) (ii) a Extracting semantic features of a natural language text of a video file by using a text coding network of a pre-trained multi-mode model to obtain text semantic features t;
(3) from frame semantic feature sequences (x) 1 ,…,x n ) And calculating the text semantic features t through the spatial cosine similarity to obtain an attention weight sequence (alpha) 1 ,…,α n );
(4) Constructing a video abstract model of an attention mechanism based on natural language guidance, wherein the input of the video abstract model is a frame image feature sequence (f) 1 ,…,f n ) And attention weight sequence (alpha) 1 ,…,α n ) The output is a sequence of frame importance scores (y) 1 ,…,y n ) Further training the video abstract model;
(5) feature sequence (f) of frame images of video file of test set 1 ,…,f n ) And attention weight sequence (alpha) 1 ,…,α n ) Inputting the data into a trained video abstract model, and outputting to obtain a frame importance score sequence (y) of the video file 1 ,…,y n );
(6) Using a sequence of frame importance scores (y) 1 ,…,y n ) And selecting key shots and synthesizing the key shots into a video abstract.
Further, since the natural language text of a considerable portion of the video files in the training set does not reflect the title content well, the natural language text needs to be optimized before semantic feature extraction of the natural language text of the video files: firstly, some titles are drawn up according to video contents, then a plurality of users watch videos in a questionnaire mode and select proper titles, and finally, the title with the highest proportion selected by the users is taken as an optimized natural language text.
Further, the specific implementation manner of the step (3) is as follows: first calculate the textThe semantic feature t and the frame semantic feature sequence (x) 1 ,…,x n ) The cosine similarity between each feature vector in the sequence is obtained
Figure BDA0003680495160000032
And then on the similarity sequence
Figure BDA0003680495160000033
Performing softmax normalization to obtain an attention weight sequence (alpha) 1 ,…,α n ) (ii) a The attention weight sequence can reflect the importance degree of the video clip relative to the title text, and an effective attention mechanism is provided to guide the generation of the abstract.
Furthermore, the video abstract model is formed by sequentially connecting an Encoder module, an Attention module and a Decoder module, wherein the Encoder module is used for inputting a frame image feature sequence (f) 1 ,…,f n ) Coding to obtain a hidden sequence (h) 1 ,…,h n ) The Attention module utilizes an Attention weight sequence (alpha) 1 ,…,α n ) For hidden sequence (h) 1 ,…,h n ) Carrying out weighted summation to obtain a fusion variable h, decoding the fusion variable h by a Decoder module and outputting a frame importance score sequence (y) 1 ,…,y n )。
Further, the process of training the video summary model in the step (4) is as follows: firstly, initializing model parameters, and then, carrying out frame image feature sequence (f) 1 ,…,f n ) And attention weight sequence (alpha) 1 ,…,α n ) In the input model, the model predicts the sequence of output frame importance scores (y) 1 ,…,y n ) And carrying out iterative updating on the model parameters by utilizing a gradient descent and back propagation algorithm according to the loss function L until the loss function L is minimized and converged or the maximum iteration number is reached, thus finishing the training.
Further, the expression of the loss function L is as follows:
Figure BDA0003680495160000031
wherein: y is i Representing the importance score, s, of the ith video frame i And the marking fraction of the ith video frame is shown, and n is the total frame number of the video file.
Further, the annotation score s for the ith video frame i Marking the importance of the video frame by multiple users, wherein the label 1 represents the importance, the label 0 represents the unimportance, and marking the score s i I.e. the percentage of the number of users who print the label 1 to all users.
Further, the specific implementation manner of the step (6) is as follows: visually consecutive frames are first combined into a shot, and then a sequence of frame importance scores (y) 1 ,…,y n ) Converting into a shot importance score sequence, and averaging the importance scores of all frames in the shot to obtain the importance score of the shot; and then selecting key shots by adopting an 0/1 knapsack algorithm according to the shot importance score sequence and the shot length, wherein the length of the key shots does not exceed 15% of the length of the whole video, and finally synthesizing all the key shots into the video abstract.
Based on the technical scheme, the invention has the following beneficial technical effects:
1. after 150 epochs of training, the five-fold cross-validation average F1-score for the natural language-guided video summary model was 47.0%. The ablation experiment for removing the natural language guidance is carried out, and the five-fold cross validation average F1-score of the video abstract model for removing the natural language guidance is 44.5%, which shows that the natural language guidance improves the model performance.
2. Compared with the attention mechanism based on the user, the attention mechanism guided by the natural language provided by the invention has stronger objectivity and can fully pay attention to the video clip related to the title text; at the same time, the title text is readily available in internet video and does not require any cost, whereas the user's attention requires that the analysis be collected and not readily available.
3. The abstract model fully focuses on key contents related to natural language texts while learning how human beings abstract videos. When a user provides a natural language text and a section of video, the abstract generated by the video abstract method has stronger correlation with the natural language text provided by the user, and can fully reflect the interest of the user.
Drawings
FIG. 1 is a schematic diagram of attention weight generation based on natural language guidance according to the present invention.
FIG. 2 is a schematic structural diagram of an Encode-Decoder video abstract model based on the attention mechanism.
Detailed Description
In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments.
As shown in fig. 1, the video summarization method based on natural language guidance of the present invention includes the following steps:
s1: decomposing the video file into a frame sequence, and extracting image features of the frame sequence by using a pre-training depth image network to obtain a frame image feature sequence (f) 1 ,…,f n )。
In this embodiment, a video is sampled into a frame sequence at a sampling rate of 2fps, and then image features of the frame sequence are extracted using a pool5 layer pre-trained on a google lenet model on a large-scale image data set ImageNet to obtain a frame image feature sequence (f) 1 ,…,f n ) And n represents the video length, and the feature dimension of each video frame is 1024.
S2: semantic feature extraction is carried out on the frame sequence and the natural language text by utilizing an image coding network and a text coding network of a pre-training multi-modal model to obtain a frame semantic feature sequence (x) 1 ,…,x n ) And text semantic features t, and performing space cosine similarity calculation to obtain an attention weight sequence (alpha) 1 ,…,α n )。
The embodiment utilizes the image coding network and the text coding network of the pre-trained CLIP to carry out semantic feature extraction on the frame sequence and the natural language text, and the CLIP model uniformly converts the feature space of the image coding network and the feature space of the text coding network into the semantic space in the training process, so that the good effect is achievedThe good property provides conditions for calculating the similarity of the video frame image and the text. Frame semantic feature sequence (x) extracted by CLIP pre-training model 1 ,…,x n ) And performing space cosine Similarity calculation on the semantic features t of the text to obtain a Similarity sequence
Figure BDA0003680495160000051
Will be provided with
Figure BDA0003680495160000052
Obtaining the attention weight sequence (alpha) by performing softmax normalization processing 1 ,…,α n ) The attention weight sequence obtained at this time can reflect the importance degree of the video segment relative to the title text, and an effective attention mechanism is provided to guide the generation of the abstract.
Figure BDA0003680495160000053
Figure BDA0003680495160000054
S3: constructing an Encoder-Decoder video abstract model based on an attention mechanism and realized by an LSTM network, and inputting a frame image feature sequence (f) 1 ,…,f n ) And attention weight sequence (alpha) 1 ,…,α n ) Outputting a sequence of frame importance scores (y) 1 ,…,y n )。
The Encoder and the Decoder in the model adopted by the embodiment can be realized by iteration of a convolutional neural network, a cyclic neural network and the like, and the LSTM is selected to realize the long-distance dependence considering that the LSTM can well solve the problem of long-distance dependence and has certain advantages in the problem of long-sequence modeling; as shown in FIG. 2, the Encoder-Decoder video abstract model of the attention mechanism based on natural guidance implemented by the LSTM network is divided into 3 modules: encoder, Attention, Decoder, the concrete realization mode of 3 modules is as follows:
the Encoder module is realized by n stacked LSTM unitsThe last moment memory state c is input into the LSTM cell t-1 And input f of the current time t Outputting the hidden state h at the moment t (ii) a Encoder input frame feature sequence (f) 1 ,…,f n ) Output hidden sequence (h) 1 ,…,h n ):
(h 1 ,…,h n )=Encoder(f 1 ,…,f n )
Figure BDA0003680495160000061
Attention Module Pair hidden sequence (h) 1 ,…,h n ) Performing weighted accumulation, wherein the attention weight sequence is (alpha) 1 ,…,α n ) And outputting a variable h:
h=∑h ii
the Decoder module is realized by n stacked LSTM units, inputs the obtained fusion variable h, and outputs an importance score sequence (y) 1 ,…,y n ):
(y 1 ,…,y n )=Decoder(h)
Figure BDA0003680495160000062
S4: and carrying out model training on the whole network model.
In the embodiment, the SumMe video abstract data set is selected for training, but titles of part of videos in the SumMe data set cannot well reflect title contents, for example, "Jumps" cannot completely reflect video contents, and training effect is influenced. Therefore, optimization of the video title text of the SumMe dataset is required, specifically: some titles are drawn up according to the video content, a plurality of users watch videos in a questionnaire mode and select proper titles, and finally the original titles of the videos of the SumMe data set and the optimized titles provided by the method are selected as training texts with the highest proportion, as shown in Table 1:
TABLE 1
Figure BDA0003680495160000063
Figure BDA0003680495160000071
The SumMe data set has 25 videos, the data set is small in scale and suitable for training through a cross-validation method, the 25 videos in the data set are fully utilized, and adverse effects caused by unbalanced division are reduced. In the embodiment, 80% of data is selected as a training set, 20% of data is selected as a test set, a five-fold cross validation method is used, in each division mode, the number of the test sets is 20, the number of the validation sets is 5, a model is trained by using mean loss, (y) 1 ,…,y n ) Predicting scores for the model,(s) 1 ,…,s n ) The training set is scored and the loss function is as follows:
Figure BDA0003680495160000072
s5: according to the sequence of frame importance scores (y) 1 ,…,y n ) And reasonably selecting the video abstract.
In the embodiment, visually continuous frames are combined into a shot, the frame importance score sequence is converted into a shot importance score sequence, namely, the importance scores of all the frames in the shot are averaged to obtain the shot importance score, then the shot importance score and the shot length use 0/1 knapsack algorithm to select the key shot, the length of the key shot needs to be limited to 15% of the original video length, and the key shot is combined into the video abstract.
The embodiments described above are presented to enable one of ordinary skill in the art to make and use the invention and are capable of modifications in various obvious respects, all without departing from the invention. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

Claims (8)

1. A natural language guided video abstraction method comprises the following steps:
(1) decomposing the video files in the training set into frame sequences, and extracting image features of the frame sequences by using a pre-trained depth image network to obtain corresponding frame image feature sequences (f) 1 ,...,f n );
(2) Semantic feature extraction is carried out on the frame sequence by utilizing an image coding network of a pre-training multi-modal model to obtain a frame semantic feature sequence (x) 1 ,...,x n ) (ii) a Extracting semantic features of a natural language text of a video file by using a text coding network of a pre-trained multi-mode model to obtain text semantic features t;
(3) from frame semantic feature sequences (x) 1 ,...,x n ) And obtaining an attention weight sequence (alpha) through space cosine similarity calculation with the text semantic feature t 1 ,...,α n );
(4) Constructing a video abstract model of an attention mechanism based on natural language guidance, wherein the input of the video abstract model is a frame image feature sequence (f) 1 ,...,f n ) And attention weight sequence (alpha) 1 ,...,α n ) The output is a sequence of frame importance scores (y) 1 ,...,y n ) Further training the video abstract model;
(5) feature sequence (f) of frame images of video file of test set 1 ,...,f n ) And attention weight sequence (alpha) 1 ,...,α n ) Inputting the data into a trained video abstract model, and outputting to obtain a frame importance score sequence (y) of the video file 1 ,...,y n );
(6) Using a sequence of frame importance scores (y) 1 ,...,y n ) And selecting key shots and synthesizing the key shots into a video abstract.
2. The video summarization method of claim 1, wherein: since the natural language text of a certain part of video files in the training set can not well reflect the title content, the natural language text needs to be optimized before semantic feature extraction on the natural language text of the video files: firstly, some titles are drawn up according to video contents, then a plurality of users watch videos in a questionnaire mode and select proper titles, and finally, the title with the highest proportion selected by the users is taken as an optimized natural language text.
3. The video summarization method of claim 1, wherein: the specific implementation manner of the step (3) is as follows: firstly, calculating text semantic feature t and frame semantic feature sequence (x) 1 ,...,x n ) The cosine similarity between each feature vector in the sequence table is obtained to obtain a similarity sequence
Figure FDA0003680495150000011
And then on the similarity sequence
Figure FDA0003680495150000012
Performing softmax normalization to obtain an attention weight sequence (alpha) 1 ,...,α n )。
4. The video summarization method of claim 1, wherein: the video abstract model is formed by sequentially connecting an Encoder module, an Attention module and a Decoder module, wherein the Encoder module is used for inputting a frame image characteristic sequence (f) 1 ,...,f n ) Coding to obtain a hidden sequence (h) 1 ,...,h n ) The Attention module utilizes an Attention weight sequence (alpha) 1 ,...,α n ) For hidden sequence (h) 1 ,...,h n ) Carrying out weighted summation to obtain a fusion variable h, decoding the fusion variable h by a Decoder module and outputting a frame importance score sequence (y) 1 ,...,y n )。
5. Video summary according to claim 1The method is characterized in that: the process of training the video abstract model in the step (4) is as follows: firstly, initializing model parameters, and then, carrying out frame image feature sequence (f) 1 ,...,f n ) And attention weight sequence (alpha) 1 ,...,α n ) In the input model, the model predicts the sequence of output frame importance scores (y) 1 ,...,y n ) And carrying out iterative updating on the model parameters by utilizing a gradient descent and back propagation algorithm according to the loss function L until the loss function L is minimized and converged or the maximum iteration number is reached, thus finishing the training.
6. The video summarization method of claim 5, wherein: the expression of the loss function L is as follows:
Figure FDA0003680495150000021
wherein: y is i Representing the importance score, s, of the ith video frame i And the marking fraction of the ith video frame is shown, and n is the total frame number of the video file.
7. The video summarization method of claim 6, wherein: annotation score s for ith video frame i Marking the importance of the video frame by multiple users, wherein the label 1 represents the importance, the label 0 represents the unimportance, and marking the score s i I.e. the percentage of the number of users who print the label 1 to all users.
8. The video summarization method of claim 1, wherein: the specific implementation manner of the step (6) is as follows: visually consecutive frames are first combined into a shot, and then a sequence of frame importance scores (y) 1 ,...,y n ) Converting into a shot importance score sequence, and averaging the importance scores of all frames in the shot to obtain the importance score of the shot; and selecting the relation according to the shot importance score sequence and the shot length by adopting an 0/1 knapsack algorithmAnd (4) key shots, wherein the length of each key shot is not more than 15% of the length of the whole video, and all the key shots are finally synthesized into the video abstract.
CN202210652477.2A 2022-06-07 2022-06-07 Video abstraction method guided by natural language Pending CN115033736A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210652477.2A CN115033736A (en) 2022-06-07 2022-06-07 Video abstraction method guided by natural language

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210652477.2A CN115033736A (en) 2022-06-07 2022-06-07 Video abstraction method guided by natural language

Publications (1)

Publication Number Publication Date
CN115033736A true CN115033736A (en) 2022-09-09

Family

ID=83122726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210652477.2A Pending CN115033736A (en) 2022-06-07 2022-06-07 Video abstraction method guided by natural language

Country Status (1)

Country Link
CN (1) CN115033736A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116208772A (en) * 2023-05-05 2023-06-02 浪潮电子信息产业股份有限公司 Data processing method, device, electronic equipment and computer readable storage medium
CN117835012A (en) * 2023-12-27 2024-04-05 北京智象未来科技有限公司 Controllable video generation method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116208772A (en) * 2023-05-05 2023-06-02 浪潮电子信息产业股份有限公司 Data processing method, device, electronic equipment and computer readable storage medium
CN117835012A (en) * 2023-12-27 2024-04-05 北京智象未来科技有限公司 Controllable video generation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
Xiong et al. A unified framework for multi-modal federated learning
CN111930992B (en) Neural network training method and device and electronic equipment
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
CN111414461B (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN115033736A (en) Video abstraction method guided by natural language
CN112800292B (en) Cross-modal retrieval method based on modal specific and shared feature learning
Ning et al. Semantics-consistent representation learning for remote sensing image–voice retrieval
CN111242033B (en) Video feature learning method based on discriminant analysis of video and text pairs
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
CN113987169A (en) Text abstract generation method, device and equipment based on semantic block and storage medium
Cao et al. Visual consensus modeling for video-text retrieval
CN113392265A (en) Multimedia processing method, device and equipment
Lin et al. PS-mixer: A polar-vector and strength-vector mixer model for multimodal sentiment analysis
CN115934951A (en) Network hot topic user emotion prediction method
CN115775349A (en) False news detection method and device based on multi-mode fusion
CN115731498A (en) Video abstract generation method combining reinforcement learning and contrast learning
Duan et al. A web knowledge-driven multimodal retrieval method in computational social systems: Unsupervised and robust graph convolutional hashing
CN111222010A (en) Method for solving video time sequence positioning problem by using semantic completion neural network
CN117539999A (en) Cross-modal joint coding-based multi-modal emotion analysis method
CN116628261A (en) Video text retrieval method, system, equipment and medium based on multi-semantic space
CN116975347A (en) Image generation model training method and related device
Qi et al. Video captioning via a symmetric bidirectional decoder
CN112199531A (en) Cross-modal retrieval method and device based on Hash algorithm and neighborhood map
Kim et al. SWAG-Net: Semantic Word-Aware Graph Network for Temporal Video Grounding
Hammad et al. Characterizing the impact of using features extracted from pre-trained models on the quality of video captioning sequence-to-sequence models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination