CN106993240B - Multi-video abstraction method based on sparse coding - Google Patents
Multi-video abstraction method based on sparse coding Download PDFInfo
- Publication number
- CN106993240B CN106993240B CN201710151147.4A CN201710151147A CN106993240B CN 106993240 B CN106993240 B CN 106993240B CN 201710151147 A CN201710151147 A CN 201710151147A CN 106993240 B CN106993240 B CN 106993240B
- Authority
- CN
- China
- Prior art keywords
- video
- sub
- frame
- vector
- videos
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
- H04N21/8549—Creating video summaries, e.g. movie trailer
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Computer Security & Cryptography (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Television Signal Processing For Recording (AREA)
Abstract
The invention relates to video processing, and aims to provide a multi-video abstraction technology based on sparse coding, realize clustering of videos, namely detection of sub-topics of the videos, sequence key shots and realize multi-video abstraction. The technical scheme adopted by the invention is that a multi-video abstract method based on sparse coding is used for constructing a multi-graph model by utilizing text information and visual information of videos, and clustering of the videos is realized by a graph cutting method, namely, the detection of the sub-topics of the videos is realized; secondly, associating the video frames under the sub-topics with the webpage pictures based on the sub-topics by utilizing a sparse coding method under each sub-topic to obtain a key shot; and finally, sequencing the key shots through the video uploading time, thereby realizing the multi-video summary. The invention is mainly applied to video processing.
Description
Technical Field
The invention relates to video processing, in particular to a multi-video summarization method based on sparse coding.
Background
With the rapid development of information technology, video data is emerging in large quantities, and becomes one of important ways for people to acquire information. However, due to the dramatic increase in the number of videos, redundant and repetitive information occurs in a large amount of video data, which makes it difficult for a user to quickly acquire desired information. Therefore, under the circumstances, a technology capable of integrating and analyzing mass video data under the same theme is urgently needed to meet the requirement that people want to browse main information of videos quickly and accurately and improve the information acquisition capability of people. The multiple video summarization technique has attracted increasing researchers' attention over the past several decades as one of the effective ways to solve the above-mentioned problems. The multi-video abstraction technology is a content-based video data compression technology, and aims to analyze and integrate a plurality of videos of related topics under the same event, extract main contents in the plurality of videos, and present the extracted contents to a user according to a certain logical relationship. Currently, the multi-video summary is mainly analyzed from three aspects: 1) coverage rate; 2) novelty; 3) the importance of which. Coverage refers to the fact that the extracted video content can cover the main content of multiple videos on the same topic. Redundancy refers to removing duplicate, redundant information in a multi-video summary. The importance refers to extracting important key shots in a video set according to some prior information so as to extract important contents in a plurality of videos.
Although many single video summarizations have been proposed, the research on the multi-video summarization method is less and still in the preliminary stage. This is mainly due to two reasons: 1) one is due to the diversity of multiple video topics under the same event and the cross-correlation of topics between videos. The theme diversity means that information emphasis points of a plurality of videos in the same event are different and a plurality of sub-themes are provided. The topic cross-over refers to that the content of the videos under the same event has cross-over, and the videos have similar content and different information content. 2) Secondly, the audio information, text information and visual information may have a large difference due to the audio information expressed by the multiple video data to the same content. These reasons make the study of multiple video summaries difficult with traditional single video summaries.
In the past decades, methods for multi-video summarization have been proposed for the features of multi-video data sets. The multi-video summarization method based on complex graph clustering is a relatively classical method. The method comprises the steps of constructing a complex graph by extracting key words of corresponding script information of the video and key frames of the video, and realizing abstract by utilizing a graph clustering algorithm on the basis. However, the method is mainly used for news videos, the method is of no significance for video sets without video script information, in addition, due to the fact that the content of a plurality of videos in the same theme has diversity and redundancy, the maximum coverage condition of the video content is met only by using a clustering method, the clustering effect is poor only by using visual information of the videos for multi-video abstraction, and the complexity is high although certain help is provided by combining other modes.
The multi-video abstract has information of multiple modalities, such as text information, visual information, audio information, and the like of a video. The Balanced AV-MMR (Balanced Audio Video maximum local retrieval) is a multi-Video summarization technology which effectively utilizes Video multi-modal information, and analyzes visual information, Audio information and semantic information in the visual information and the Audio information of a Video, wherein the semantic information comprises Audio, human face, time characteristics and the like which have important significance for Video summarization. The method effectively utilizes the multi-modal information of the video, but the extracted video abstract does not achieve a good effect.
In recent years, novel methods have been proposed. Among them, the realization of multi-video summarization by using the visual co-occurrence characteristics (visualCo-occurrence) of video is a novel method. According to the method, important visual concepts (concepts) are considered to be repeated in a plurality of videos under the same theme, a maximum binary search algorithm (maximum binary matching) is provided according to the characteristic, and a sparse co-occurrence mode of the videos is extracted, so that multi-video abstraction is achieved. However, the method is only suitable for a specific data set, and the method loses significance for a video set with small repeatability in the video.
In addition, in order to utilize more related information, related researchers have proposed that sensors such as a GPS and a compass on a mobile phone are used to acquire information such as a geographic position during a mobile phone video shooting process, and thus assist in determining important information in a video and generating a multi-video summary. In addition, the prior information of the webpage picture is used as auxiliary information in the field, and multi-video abstraction is better realized. At present, due to the complexity of multi-video data, the research of multi-video summarization does not achieve the ideal effect. Therefore, how to better utilize the information of the multi-video data to better realize the multi-video summarization becomes a hot spot of research of relevant scholars at present.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a multi-video abstraction technology based on sparse coding, realize clustering of videos, namely detection of sub-topics of the videos, sequence key shots and realize multi-video abstraction. The technical scheme adopted by the invention is that a multi-video abstract method based on sparse coding is used for constructing a multi-graph model by utilizing text information and visual information of videos, and clustering of the videos is realized by a graph cutting method, namely, the detection of the sub-topics of the videos is realized; secondly, associating the video frames under the sub-topics with the webpage pictures based on the sub-topics by utilizing a sparse coding method under each sub-topic to obtain a key shot; and finally, sequencing the key shots through the video uploading time, thereby realizing the multi-video summary.
The method of using sparse coding is specifically that given a set of video frames for a particular event, X ═ X1,x2,…,xNDenotes the set of features for an N-frame video frame, Z ═ Z1,z2,…,zLDenotes the feature set of the L-frame web page image, where xi∈Rd,zi∈RdUsing the candidate video frame X as the base vector group, video frame XiWebpage picture z based on sub-topic keyword searchiJointly as an input vector, learning an expression score a of a video frameiEach frame xiCorresponding to a coefficient variable aiThe expression score is called as the expression score of the ith frame, and the expression score conveys the contribution size of each frame in the reconstructed body space with the aid of the webpage picture prior information, so that an objective function is constructed as follows, and a regularization term is added into the objective function in order to ensure sparsity:
s.t.aj≥0forj∈{1,…,m},γ>0 (2)
wherein the coefficient ajIs the expression score of the jth frame, and all target vectors xiSharing the same coefficient vector a ═ a1,a2,…,an},As the regularization term is added, a sparse expression vector is obtained.
In a specific example, an objective function for extracting a key shot is constructed as a formula (2), and then a sparse vector A is obtained by using a coordinate gradient descent algorithm, wherein the specific process is as follows:
first, the vector A is initialized to be a zero vector, and the score a is expressed for the objective functioni(i ═ 1,2, …, N) to make a partial derivative; then, the expression score a that maximizes the partial derivative is selectediAnd updating a by soft-threshold (soft-threshold) methodiFinally, iterating the process until the value of the cost function is less than a certain threshold value or the iteration times reaches a certain value, and solving an expression vector A;
finally, a digest of a given length is generated: assuming a summary time length l given the kth sub-topickThe method is solved by the following optimization problems:
wherein s iskIs the number of shots under the kth sub-topic,is the importance score or expression score of the ith shot under the kth sub-topic,is the time length of the ith shot, μkIs to select a vector of the vector or vectors,indicating that the ith shot is selected, otherwise, not selecting. The optimization problem is a typical knapsack problem, and the multi-video abstract is realized by solving the problem through dynamic optimization.
The invention has the characteristics and beneficial effects that:
the invention mainly aims at the defects of the existing multi-video summarization method, designs a multi-video summarization method based on sparse coding and suitable for the characteristics of multi-video data, and makes full use of the specific information of the data under the assistance of effective prior information. The advantages are mainly reflected in that:
(1) the novelty is as follows: the multi-graph model is applied to video clustering, multi-mode information of videos is fully utilized, and sub-topic detection of a multi-video set is better achieved. On the basis, the method for extracting the key shots of the video by using the sparse coding is firstly proposed.
(2) Effectiveness: compared with a typical clustering method and a minimum sparse reconstruction method applied to single video abstraction, the sparse coding-based multi-video abstraction method disclosed by the invention is proved to have obviously better performance than the clustering method and the minimum sparse reconstruction method, so that the sparse coding-based multi-video abstraction method is more suitable for the multi-video abstraction problem.
(3) The practicability is as follows: the method is simple and feasible and can be used in the field of multimedia information processing.
Description of the drawings:
FIG. 1 is a flow chart of the present invention providing sparse coding based video key shot extraction;
FIG. 2 is a flow chart of a coordinate gradient descent algorithm for solving an objective function;
Detailed Description
The invention aims at the characteristics of more redundant information and repeated information of multimedia video data, combines visual information, text information and other prior information related to subjects of videos, and utilizes the idea of coefficient coding to improve the traditional multi-video summarization method, thereby achieving the purposes of effectively utilizing the related information of video subjects and improving the video browsing efficiency of users.
The invention aims to provide a multi-video summarization technology based on sparse coding. According to the characteristics of multi-video data, firstly, the invention provides a method for constructing a multi-graph model by using the text information and the visual information of the video, and the clustering of the video, namely the detection of the sub-topics of the video, is realized by a graph cutting method. Then, under each sub-topic, a sparse coding method is utilized to link the video frames under the sub-topics with the webpage pictures based on the sub-topics, and important key shots are obtained. And finally, sequencing the key shots through the video uploading time, thereby realizing the multi-video summary.
The method provided by the invention mainly comprises the following steps: the webpage pictures are introduced as auxiliary information, a sparse coding method suitable for characteristics of a multi-video abstract data set is designed to acquire key frames (shots) of multiple videos, so that the key frames are extracted, and the key shots (frames) are sequenced on the basis by utilizing uploading time information of the videos.
The multi-video summary aims to compress a longer video set into a shorter summary set, and helps a user to quickly acquire main information of the video set. Generally, multiple video sets of the same event have the characteristics of theme diversity, cross property and the like, so that the method for simply summarizing the single video is infeasible to be applied to the multiple video summaries. It can be generally considered that the pictures of the web page based on the sub-topic keywords reflect the important content of the topic, since each picture is uploaded by the user and mostly comes from the downloaded video of the related topic, the pictures reflect the interests of the user, and the pictures have the advantage over the video frames that the pictures capture the topic from a typical viewpoint in a manner of richer semantic information and are less noisy. The specific principle of the method is as follows:
sparse coding aims at selecting a set of basis vectorsTo reconstruct the k input vectors xjTo minimize the reconstruction error, the following equation is formulated:
wherein a isijRepresenting the ith input vector xiAnd the jth base vectorWhere the first term represents the input vector xiAnd the base vector groupThe reconstruction error of (1). The second term in the equation guarantees that the reconstruction coefficient matrix a ═ is (a)ij) Gamma is the regularization coefficient, balancing the first and second terms.
In the present invention, given a set of video frames for a particular event, X ═ X1,x2,…,xNDenotes the set of features for an N-frame video frame, Z ═ Z1,z2,…,zLDenotes the feature set of the L-frame web page image, where xi∈Rd,zi∈Rd. The essence of multi-video summarization is to select a certain number of frames to reconstruct the topic space of the original video. The method combines the prior information of the webpage pictures, constructs an objective function by using the idea of sparse coding, and learns the common mode of the video frames and the webpage pictures searched based on the subtitle keywords. The invention is directly toThe candidate video frame X is used as a base vector group, a video frame set XiWebpage picture z based on sub-topic keyword searchiJointly as an input vector, learning an expression score a of a video framei. Each frame xiCorresponding to a coefficient variable aiReferred to as the expression score of the ith frame, which conveys the contribution size of each frame in the reconstructed body space with the aid of the webpage picture prior information. Constructing an objective function as follows, and adding a regularization term into the objective function in order to ensure sparsity:
s.t.aj≥0forj∈{1,…,m},γ>0 (2)
wherein the coefficient ajIs the expression score of the jth frame, and all target vectors xiSharing the same coefficient vector a ═ a1,a2,…,anAnd meanwhile, due to the addition of a regularization term, a sparse expression vector is obtained.
The present invention will be described in further detail with reference to the following drawings and specific examples.
Fig. 1 depicts a flow chart of extracting key shots in a video by using a sparse coding method in combination with prior information of a web page image under a sub-topic. The following process is to extract key frames for one sub-topic, and the extraction methods of key shots of other sub-topics of the same event are the same.
Firstly, extracting video frames under each subtopic and webpage image features based on the subtopic. In the present invention, given a set of video frames for a certain sub-topic at a particular event, X ═ X1,x2,…,xNDenotes the N-frame video frame feature set, with Z ═ Z1,z2,…,zLDenotes the L-frame web page image feature set, where xi∈Rd,zi∈Rd. Each frame xiCorresponding to a coefficient variable aiThe expression score, referred to as the ith frame, represents the magnitude of the contribution of the ith frame in the reconstruction.
Second, an extraction is constructedObjective function of key shots. Based on the idea of sparse coding, the invention directly takes the candidate video frame X as a base vector group, and a video frame set XiWebpage picture z based on sub-topic keyword searchiJointly as an input vector, learning an expression score a of a video framei. The target is as follows: let an input vector xi、ziAnd the set of basis vectors X ═ X1,x2,…,xNThe reconstruction error of the video is minimum at the same time, namely, an important shot of the video is learned under the assistance of webpage picture information searched by keywords based on subtitles, the target function of the important shot is a formula (2), and then a sparse vector A is obtained by utilizing a coordinate gradient descent algorithm, and the specific process is as follows:
first, the vector A is initialized to be a zero vector, and the score a is expressed for the objective functioni(i ═ 1,2, …, N) to make a partial derivative; then, the expression score a that maximizes the partial derivative is selectediAnd updating a by soft-threshold (soft-threshold) methodi. And finally, iterating the process until the value change of the cost function is smaller than a certain threshold value or the iteration times reaches a certain value, and solving the expression vector A.
Finally, a digest of a given length is generated. And extracting all the key shots of the sub-topics according to the process. Assuming a summary time length l given the kth sub-topickThe present invention can be solved by the following optimization problems:
wherein s iskIs the number of shots under the kth sub-topic,is the importance score or expression score of the ith shot (frame) under the kth sub-topic,is the time length of the ith shot. Mu.skIs to select a vector of the vector or vectors,indicating that the ith shot is selected, otherwise, not selecting. The optimization problem is a typical knapsack problem, and the multi-video abstract is realized by solving the problem through dynamic optimization.
Claims (1)
1. A multi-video abstraction method based on sparse coding is characterized in that a multi-video abstraction method based on sparse coding is used for constructing a multi-graph model by utilizing text information and visual information of videos, and clustering of the videos is realized by a graph cutting method, namely detection of sub-topics of the videos is realized; secondly, associating the video frames under the sub-topics with the webpage pictures based on the sub-topics by utilizing a sparse coding method under each sub-topic to obtain a key shot; finally, sorting the key shots through the video uploading time so as to realize multi-video abstraction; the method of using sparse coding is specifically that given a set of video frames for a particular event, X ═ X1,x2,...,xNDenotes the set of features for an N-frame video frame, Z ═ Z1,z2,...,zLDenotes the feature set of the L-frame web page image, where xi∈Rd,zi∈RdUsing the candidate video frame X as the base vector group, the video frame set XiWebpage picture z based on sub-topic keyword searchiJointly as an input vector, learning an expression score a of a video frameiEach frame xiCorresponding to a coefficient variable aiThe expression score is called as the expression score of the ith frame, and the expression score conveys the contribution size of each frame in the reconstructed body space with the aid of the webpage picture prior information, so that an objective function is constructed as follows, and a regularization term is added into the objective function in order to ensure sparsity:
s.t.aj≥0 for j∈{1,...,m},γ>0 (1)
wherein the coefficient ajIs the expression score of the jth frame, and all target vectors xiSharing the same coefficient vector a ═ a1,a2,...,an},The expression vector is a regularization term, and sparse expression vectors are obtained due to the addition of the regularization term;
obtaining a sparse vector A by using a coordinate gradient descent algorithm, wherein the specific process is as follows:
first, initializing a sparse vector A as a zero vector, expressing a fraction a with respect to an objective functioniCalculating a partial derivative, i is 1, 2. Then, the expression score a that maximizes the partial derivative is selectediAnd updating a by using a soft threshold methodiFinally, iterating the process until the value change of the cost function is smaller than a certain threshold value or the iteration times reach a certain value, and solving a sparse vector A.;
finally, generating a summary with a given length, extracting all the key shots of the sub-topics according to the process, and assuming that the summary time length l under the k-th sub-topic is givenkThe method is solved by the following optimization problems:
wherein s iskIs the number of shots under the kth sub-topic,is the importance score or expression score of the ith shot/frame under the kth sub-topic,is the time length of the ith shot, μkIs to select a vector of the vector or vectors,indicating that the ith shot is selected, otherwise, not selecting.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710151147.4A CN106993240B (en) | 2017-03-14 | 2017-03-14 | Multi-video abstraction method based on sparse coding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710151147.4A CN106993240B (en) | 2017-03-14 | 2017-03-14 | Multi-video abstraction method based on sparse coding |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106993240A CN106993240A (en) | 2017-07-28 |
CN106993240B true CN106993240B (en) | 2020-10-16 |
Family
ID=59411588
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710151147.4A Active CN106993240B (en) | 2017-03-14 | 2017-03-14 | Multi-video abstraction method based on sparse coding |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106993240B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107911755B (en) * | 2017-11-10 | 2020-10-20 | 天津大学 | Multi-video abstraction method based on sparse self-encoder |
CN107943990B (en) * | 2017-12-01 | 2020-02-14 | 天津大学 | Multi-video abstraction method based on prototype analysis technology with weight |
CN109348287B (en) * | 2018-10-22 | 2022-01-28 | 深圳市商汤科技有限公司 | Video abstract generation method and device, storage medium and electronic equipment |
CN110110636B (en) * | 2019-04-28 | 2021-03-02 | 清华大学 | Video logic mining device and method based on multi-input single-output coding and decoding model |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7783459B2 (en) * | 2007-02-21 | 2010-08-24 | William Marsh Rice University | Analog system for computing sparse codes |
CN104113789A (en) * | 2014-07-10 | 2014-10-22 | 杭州电子科技大学 | On-line video abstraction generation method based on depth learning |
CN106034264A (en) * | 2015-03-11 | 2016-10-19 | 中国科学院西安光学精密机械研究所 | Coordination-model-based method for obtaining video abstract |
-
2017
- 2017-03-14 CN CN201710151147.4A patent/CN106993240B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7783459B2 (en) * | 2007-02-21 | 2010-08-24 | William Marsh Rice University | Analog system for computing sparse codes |
CN104113789A (en) * | 2014-07-10 | 2014-10-22 | 杭州电子科技大学 | On-line video abstraction generation method based on depth learning |
CN106034264A (en) * | 2015-03-11 | 2016-10-19 | 中国科学院西安光学精密机械研究所 | Coordination-model-based method for obtaining video abstract |
Also Published As
Publication number | Publication date |
---|---|
CN106993240A (en) | 2017-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xue et al. | Advancing high-resolution video-language representation with large-scale video transcriptions | |
Gabeur et al. | Multi-modal transformer for video retrieval | |
Otani et al. | Learning joint representations of videos and sentences with web image search | |
CN106993240B (en) | Multi-video abstraction method based on sparse coding | |
CN107943990B (en) | Multi-video abstraction method based on prototype analysis technology with weight | |
CN107203636B (en) | Multi-video abstract acquisition method based on hypergraph master set clustering | |
CN101739428B (en) | Method for establishing index for multimedia | |
WO2020177673A1 (en) | Video sequence selection method, computer device and storage medium | |
CN107911755B (en) | Multi-video abstraction method based on sparse self-encoder | |
Liu et al. | Attention guided deep audio-face fusion for efficient speaker naming | |
Momeni et al. | Automatic dense annotation of large-vocabulary sign language videos | |
Choi et al. | Contextually customized video summaries via natural language | |
CN114048351A (en) | Cross-modal text-video retrieval method based on space-time relationship enhancement | |
CN113392265A (en) | Multimedia processing method, device and equipment | |
Liang et al. | Informedia@ trecvid 2016 med and avs | |
Varol et al. | Scaling up sign spotting through sign language dictionaries | |
CN113408619B (en) | Language model pre-training method and device | |
CN112883229B (en) | Video-text cross-modal retrieval method and device based on multi-feature-map attention network model | |
Li et al. | Image-text alignment and retrieval using light-weight transformer | |
Zhang et al. | SOR-TC: Self-attentive octave ResNet with temporal consistency for compressed video action recognition | |
Xu et al. | Multi-guiding long short-term memory for video captioning | |
CN110826397B (en) | Video description method based on high-order low-rank multi-modal attention mechanism | |
CN116977701A (en) | Video classification model training method, video classification method and device | |
Jiang | Web-scale multimedia search for internet video content | |
CN109857906B (en) | Multi-video abstraction method based on query unsupervised deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |