CN106993240B - Multi-video abstraction method based on sparse coding - Google Patents

Multi-video abstraction method based on sparse coding Download PDF

Info

Publication number
CN106993240B
CN106993240B CN201710151147.4A CN201710151147A CN106993240B CN 106993240 B CN106993240 B CN 106993240B CN 201710151147 A CN201710151147 A CN 201710151147A CN 106993240 B CN106993240 B CN 106993240B
Authority
CN
China
Prior art keywords
video
sub
frame
vector
videos
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710151147.4A
Other languages
Chinese (zh)
Other versions
CN106993240A (en
Inventor
冀中
马亚茹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201710151147.4A priority Critical patent/CN106993240B/en
Publication of CN106993240A publication Critical patent/CN106993240A/en
Application granted granted Critical
Publication of CN106993240B publication Critical patent/CN106993240B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The invention relates to video processing, and aims to provide a multi-video abstraction technology based on sparse coding, realize clustering of videos, namely detection of sub-topics of the videos, sequence key shots and realize multi-video abstraction. The technical scheme adopted by the invention is that a multi-video abstract method based on sparse coding is used for constructing a multi-graph model by utilizing text information and visual information of videos, and clustering of the videos is realized by a graph cutting method, namely, the detection of the sub-topics of the videos is realized; secondly, associating the video frames under the sub-topics with the webpage pictures based on the sub-topics by utilizing a sparse coding method under each sub-topic to obtain a key shot; and finally, sequencing the key shots through the video uploading time, thereby realizing the multi-video summary. The invention is mainly applied to video processing.

Description

Multi-video abstraction method based on sparse coding
Technical Field
The invention relates to video processing, in particular to a multi-video summarization method based on sparse coding.
Background
With the rapid development of information technology, video data is emerging in large quantities, and becomes one of important ways for people to acquire information. However, due to the dramatic increase in the number of videos, redundant and repetitive information occurs in a large amount of video data, which makes it difficult for a user to quickly acquire desired information. Therefore, under the circumstances, a technology capable of integrating and analyzing mass video data under the same theme is urgently needed to meet the requirement that people want to browse main information of videos quickly and accurately and improve the information acquisition capability of people. The multiple video summarization technique has attracted increasing researchers' attention over the past several decades as one of the effective ways to solve the above-mentioned problems. The multi-video abstraction technology is a content-based video data compression technology, and aims to analyze and integrate a plurality of videos of related topics under the same event, extract main contents in the plurality of videos, and present the extracted contents to a user according to a certain logical relationship. Currently, the multi-video summary is mainly analyzed from three aspects: 1) coverage rate; 2) novelty; 3) the importance of which. Coverage refers to the fact that the extracted video content can cover the main content of multiple videos on the same topic. Redundancy refers to removing duplicate, redundant information in a multi-video summary. The importance refers to extracting important key shots in a video set according to some prior information so as to extract important contents in a plurality of videos.
Although many single video summarizations have been proposed, the research on the multi-video summarization method is less and still in the preliminary stage. This is mainly due to two reasons: 1) one is due to the diversity of multiple video topics under the same event and the cross-correlation of topics between videos. The theme diversity means that information emphasis points of a plurality of videos in the same event are different and a plurality of sub-themes are provided. The topic cross-over refers to that the content of the videos under the same event has cross-over, and the videos have similar content and different information content. 2) Secondly, the audio information, text information and visual information may have a large difference due to the audio information expressed by the multiple video data to the same content. These reasons make the study of multiple video summaries difficult with traditional single video summaries.
In the past decades, methods for multi-video summarization have been proposed for the features of multi-video data sets. The multi-video summarization method based on complex graph clustering is a relatively classical method. The method comprises the steps of constructing a complex graph by extracting key words of corresponding script information of the video and key frames of the video, and realizing abstract by utilizing a graph clustering algorithm on the basis. However, the method is mainly used for news videos, the method is of no significance for video sets without video script information, in addition, due to the fact that the content of a plurality of videos in the same theme has diversity and redundancy, the maximum coverage condition of the video content is met only by using a clustering method, the clustering effect is poor only by using visual information of the videos for multi-video abstraction, and the complexity is high although certain help is provided by combining other modes.
The multi-video abstract has information of multiple modalities, such as text information, visual information, audio information, and the like of a video. The Balanced AV-MMR (Balanced Audio Video maximum local retrieval) is a multi-Video summarization technology which effectively utilizes Video multi-modal information, and analyzes visual information, Audio information and semantic information in the visual information and the Audio information of a Video, wherein the semantic information comprises Audio, human face, time characteristics and the like which have important significance for Video summarization. The method effectively utilizes the multi-modal information of the video, but the extracted video abstract does not achieve a good effect.
In recent years, novel methods have been proposed. Among them, the realization of multi-video summarization by using the visual co-occurrence characteristics (visualCo-occurrence) of video is a novel method. According to the method, important visual concepts (concepts) are considered to be repeated in a plurality of videos under the same theme, a maximum binary search algorithm (maximum binary matching) is provided according to the characteristic, and a sparse co-occurrence mode of the videos is extracted, so that multi-video abstraction is achieved. However, the method is only suitable for a specific data set, and the method loses significance for a video set with small repeatability in the video.
In addition, in order to utilize more related information, related researchers have proposed that sensors such as a GPS and a compass on a mobile phone are used to acquire information such as a geographic position during a mobile phone video shooting process, and thus assist in determining important information in a video and generating a multi-video summary. In addition, the prior information of the webpage picture is used as auxiliary information in the field, and multi-video abstraction is better realized. At present, due to the complexity of multi-video data, the research of multi-video summarization does not achieve the ideal effect. Therefore, how to better utilize the information of the multi-video data to better realize the multi-video summarization becomes a hot spot of research of relevant scholars at present.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a multi-video abstraction technology based on sparse coding, realize clustering of videos, namely detection of sub-topics of the videos, sequence key shots and realize multi-video abstraction. The technical scheme adopted by the invention is that a multi-video abstract method based on sparse coding is used for constructing a multi-graph model by utilizing text information and visual information of videos, and clustering of the videos is realized by a graph cutting method, namely, the detection of the sub-topics of the videos is realized; secondly, associating the video frames under the sub-topics with the webpage pictures based on the sub-topics by utilizing a sparse coding method under each sub-topic to obtain a key shot; and finally, sequencing the key shots through the video uploading time, thereby realizing the multi-video summary.
The method of using sparse coding is specifically that given a set of video frames for a particular event, X ═ X1,x2,…,xNDenotes the set of features for an N-frame video frame, Z ═ Z1,z2,…,zLDenotes the feature set of the L-frame web page image, where xi∈Rd,zi∈RdUsing the candidate video frame X as the base vector group, video frame XiWebpage picture z based on sub-topic keyword searchiJointly as an input vector, learning an expression score a of a video frameiEach frame xiCorresponding to a coefficient variable aiThe expression score is called as the expression score of the ith frame, and the expression score conveys the contribution size of each frame in the reconstructed body space with the aid of the webpage picture prior information, so that an objective function is constructed as follows, and a regularization term is added into the objective function in order to ensure sparsity:
Figure BDA0001245614630000021
s.t.aj≥0forj∈{1,…,m},γ>0 (2)
wherein the coefficient ajIs the expression score of the jth frame, and all target vectors xiSharing the same coefficient vector a ═ a1,a2,…,an},
Figure BDA0001245614630000022
As the regularization term is added, a sparse expression vector is obtained.
In a specific example, an objective function for extracting a key shot is constructed as a formula (2), and then a sparse vector A is obtained by using a coordinate gradient descent algorithm, wherein the specific process is as follows:
first, the vector A is initialized to be a zero vector, and the score a is expressed for the objective functioni(i ═ 1,2, …, N) to make a partial derivative; then, the expression score a that maximizes the partial derivative is selectediAnd updating a by soft-threshold (soft-threshold) methodiFinally, iterating the process until the value of the cost function is less than a certain threshold value or the iteration times reaches a certain value, and solving an expression vector A;
finally, a digest of a given length is generated: assuming a summary time length l given the kth sub-topickThe method is solved by the following optimization problems:
Figure BDA0001245614630000023
wherein s iskIs the number of shots under the kth sub-topic,
Figure BDA0001245614630000031
is the importance score or expression score of the ith shot under the kth sub-topic,
Figure BDA0001245614630000032
is the time length of the ith shot, μkIs to select a vector of the vector or vectors,
Figure BDA0001245614630000033
indicating that the ith shot is selected, otherwise, not selecting. The optimization problem is a typical knapsack problem, and the multi-video abstract is realized by solving the problem through dynamic optimization.
The invention has the characteristics and beneficial effects that:
the invention mainly aims at the defects of the existing multi-video summarization method, designs a multi-video summarization method based on sparse coding and suitable for the characteristics of multi-video data, and makes full use of the specific information of the data under the assistance of effective prior information. The advantages are mainly reflected in that:
(1) the novelty is as follows: the multi-graph model is applied to video clustering, multi-mode information of videos is fully utilized, and sub-topic detection of a multi-video set is better achieved. On the basis, the method for extracting the key shots of the video by using the sparse coding is firstly proposed.
(2) Effectiveness: compared with a typical clustering method and a minimum sparse reconstruction method applied to single video abstraction, the sparse coding-based multi-video abstraction method disclosed by the invention is proved to have obviously better performance than the clustering method and the minimum sparse reconstruction method, so that the sparse coding-based multi-video abstraction method is more suitable for the multi-video abstraction problem.
(3) The practicability is as follows: the method is simple and feasible and can be used in the field of multimedia information processing.
Description of the drawings:
FIG. 1 is a flow chart of the present invention providing sparse coding based video key shot extraction;
FIG. 2 is a flow chart of a coordinate gradient descent algorithm for solving an objective function;
Detailed Description
The invention aims at the characteristics of more redundant information and repeated information of multimedia video data, combines visual information, text information and other prior information related to subjects of videos, and utilizes the idea of coefficient coding to improve the traditional multi-video summarization method, thereby achieving the purposes of effectively utilizing the related information of video subjects and improving the video browsing efficiency of users.
The invention aims to provide a multi-video summarization technology based on sparse coding. According to the characteristics of multi-video data, firstly, the invention provides a method for constructing a multi-graph model by using the text information and the visual information of the video, and the clustering of the video, namely the detection of the sub-topics of the video, is realized by a graph cutting method. Then, under each sub-topic, a sparse coding method is utilized to link the video frames under the sub-topics with the webpage pictures based on the sub-topics, and important key shots are obtained. And finally, sequencing the key shots through the video uploading time, thereby realizing the multi-video summary.
The method provided by the invention mainly comprises the following steps: the webpage pictures are introduced as auxiliary information, a sparse coding method suitable for characteristics of a multi-video abstract data set is designed to acquire key frames (shots) of multiple videos, so that the key frames are extracted, and the key shots (frames) are sequenced on the basis by utilizing uploading time information of the videos.
The multi-video summary aims to compress a longer video set into a shorter summary set, and helps a user to quickly acquire main information of the video set. Generally, multiple video sets of the same event have the characteristics of theme diversity, cross property and the like, so that the method for simply summarizing the single video is infeasible to be applied to the multiple video summaries. It can be generally considered that the pictures of the web page based on the sub-topic keywords reflect the important content of the topic, since each picture is uploaded by the user and mostly comes from the downloaded video of the related topic, the pictures reflect the interests of the user, and the pictures have the advantage over the video frames that the pictures capture the topic from a typical viewpoint in a manner of richer semantic information and are less noisy. The specific principle of the method is as follows:
sparse coding aims at selecting a set of basis vectors
Figure BDA0001245614630000041
To reconstruct the k input vectors xjTo minimize the reconstruction error, the following equation is formulated:
Figure BDA0001245614630000042
wherein a isijRepresenting the ith input vector xiAnd the jth base vector
Figure BDA0001245614630000043
Where the first term represents the input vector xiAnd the base vector group
Figure BDA0001245614630000044
The reconstruction error of (1). The second term in the equation guarantees that the reconstruction coefficient matrix a ═ is (a)ij) Gamma is the regularization coefficient, balancing the first and second terms.
In the present invention, given a set of video frames for a particular event, X ═ X1,x2,…,xNDenotes the set of features for an N-frame video frame, Z ═ Z1,z2,…,zLDenotes the feature set of the L-frame web page image, where xi∈Rd,zi∈Rd. The essence of multi-video summarization is to select a certain number of frames to reconstruct the topic space of the original video. The method combines the prior information of the webpage pictures, constructs an objective function by using the idea of sparse coding, and learns the common mode of the video frames and the webpage pictures searched based on the subtitle keywords. The invention is directly toThe candidate video frame X is used as a base vector group, a video frame set XiWebpage picture z based on sub-topic keyword searchiJointly as an input vector, learning an expression score a of a video framei. Each frame xiCorresponding to a coefficient variable aiReferred to as the expression score of the ith frame, which conveys the contribution size of each frame in the reconstructed body space with the aid of the webpage picture prior information. Constructing an objective function as follows, and adding a regularization term into the objective function in order to ensure sparsity:
Figure BDA0001245614630000045
s.t.aj≥0forj∈{1,…,m},γ>0 (2)
wherein the coefficient ajIs the expression score of the jth frame, and all target vectors xiSharing the same coefficient vector a ═ a1,a2,…,anAnd meanwhile, due to the addition of a regularization term, a sparse expression vector is obtained.
The present invention will be described in further detail with reference to the following drawings and specific examples.
Fig. 1 depicts a flow chart of extracting key shots in a video by using a sparse coding method in combination with prior information of a web page image under a sub-topic. The following process is to extract key frames for one sub-topic, and the extraction methods of key shots of other sub-topics of the same event are the same.
Firstly, extracting video frames under each subtopic and webpage image features based on the subtopic. In the present invention, given a set of video frames for a certain sub-topic at a particular event, X ═ X1,x2,…,xNDenotes the N-frame video frame feature set, with Z ═ Z1,z2,…,zLDenotes the L-frame web page image feature set, where xi∈Rd,zi∈Rd. Each frame xiCorresponding to a coefficient variable aiThe expression score, referred to as the ith frame, represents the magnitude of the contribution of the ith frame in the reconstruction.
Second, an extraction is constructedObjective function of key shots. Based on the idea of sparse coding, the invention directly takes the candidate video frame X as a base vector group, and a video frame set XiWebpage picture z based on sub-topic keyword searchiJointly as an input vector, learning an expression score a of a video framei. The target is as follows: let an input vector xi、ziAnd the set of basis vectors X ═ X1,x2,…,xNThe reconstruction error of the video is minimum at the same time, namely, an important shot of the video is learned under the assistance of webpage picture information searched by keywords based on subtitles, the target function of the important shot is a formula (2), and then a sparse vector A is obtained by utilizing a coordinate gradient descent algorithm, and the specific process is as follows:
first, the vector A is initialized to be a zero vector, and the score a is expressed for the objective functioni(i ═ 1,2, …, N) to make a partial derivative; then, the expression score a that maximizes the partial derivative is selectediAnd updating a by soft-threshold (soft-threshold) methodi. And finally, iterating the process until the value change of the cost function is smaller than a certain threshold value or the iteration times reaches a certain value, and solving the expression vector A.
Finally, a digest of a given length is generated. And extracting all the key shots of the sub-topics according to the process. Assuming a summary time length l given the kth sub-topickThe present invention can be solved by the following optimization problems:
Figure BDA0001245614630000051
wherein s iskIs the number of shots under the kth sub-topic,
Figure BDA0001245614630000052
is the importance score or expression score of the ith shot (frame) under the kth sub-topic,
Figure BDA0001245614630000053
is the time length of the ith shot. Mu.skIs to select a vector of the vector or vectors,
Figure BDA0001245614630000054
indicating that the ith shot is selected, otherwise, not selecting. The optimization problem is a typical knapsack problem, and the multi-video abstract is realized by solving the problem through dynamic optimization.

Claims (1)

1. A multi-video abstraction method based on sparse coding is characterized in that a multi-video abstraction method based on sparse coding is used for constructing a multi-graph model by utilizing text information and visual information of videos, and clustering of the videos is realized by a graph cutting method, namely detection of sub-topics of the videos is realized; secondly, associating the video frames under the sub-topics with the webpage pictures based on the sub-topics by utilizing a sparse coding method under each sub-topic to obtain a key shot; finally, sorting the key shots through the video uploading time so as to realize multi-video abstraction; the method of using sparse coding is specifically that given a set of video frames for a particular event, X ═ X1,x2,...,xNDenotes the set of features for an N-frame video frame, Z ═ Z1,z2,...,zLDenotes the feature set of the L-frame web page image, where xi∈Rd,zi∈RdUsing the candidate video frame X as the base vector group, the video frame set XiWebpage picture z based on sub-topic keyword searchiJointly as an input vector, learning an expression score a of a video frameiEach frame xiCorresponding to a coefficient variable aiThe expression score is called as the expression score of the ith frame, and the expression score conveys the contribution size of each frame in the reconstructed body space with the aid of the webpage picture prior information, so that an objective function is constructed as follows, and a regularization term is added into the objective function in order to ensure sparsity:
Figure FDA0002591183450000011
s.t.aj≥0 for j∈{1,...,m},γ>0 (1)
wherein the coefficient ajIs the expression score of the jth frame, and all target vectors xiSharing the same coefficient vector a ═ a1,a2,...,an},
Figure FDA0002591183450000012
The expression vector is a regularization term, and sparse expression vectors are obtained due to the addition of the regularization term;
obtaining a sparse vector A by using a coordinate gradient descent algorithm, wherein the specific process is as follows:
first, initializing a sparse vector A as a zero vector, expressing a fraction a with respect to an objective functioniCalculating a partial derivative, i is 1, 2. Then, the expression score a that maximizes the partial derivative is selectediAnd updating a by using a soft threshold methodiFinally, iterating the process until the value change of the cost function is smaller than a certain threshold value or the iteration times reach a certain value, and solving a sparse vector A.;
finally, generating a summary with a given length, extracting all the key shots of the sub-topics according to the process, and assuming that the summary time length l under the k-th sub-topic is givenkThe method is solved by the following optimization problems:
Figure FDA0002591183450000013
wherein s iskIs the number of shots under the kth sub-topic,
Figure FDA0002591183450000014
is the importance score or expression score of the ith shot/frame under the kth sub-topic,
Figure FDA0002591183450000015
is the time length of the ith shot, μkIs to select a vector of the vector or vectors,
Figure FDA0002591183450000016
indicating that the ith shot is selected, otherwise, not selecting.
CN201710151147.4A 2017-03-14 2017-03-14 Multi-video abstraction method based on sparse coding Active CN106993240B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710151147.4A CN106993240B (en) 2017-03-14 2017-03-14 Multi-video abstraction method based on sparse coding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710151147.4A CN106993240B (en) 2017-03-14 2017-03-14 Multi-video abstraction method based on sparse coding

Publications (2)

Publication Number Publication Date
CN106993240A CN106993240A (en) 2017-07-28
CN106993240B true CN106993240B (en) 2020-10-16

Family

ID=59411588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710151147.4A Active CN106993240B (en) 2017-03-14 2017-03-14 Multi-video abstraction method based on sparse coding

Country Status (1)

Country Link
CN (1) CN106993240B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107911755B (en) * 2017-11-10 2020-10-20 天津大学 Multi-video abstraction method based on sparse self-encoder
CN107943990B (en) * 2017-12-01 2020-02-14 天津大学 Multi-video abstraction method based on prototype analysis technology with weight
CN109348287B (en) * 2018-10-22 2022-01-28 深圳市商汤科技有限公司 Video abstract generation method and device, storage medium and electronic equipment
CN110110636B (en) * 2019-04-28 2021-03-02 清华大学 Video logic mining device and method based on multi-input single-output coding and decoding model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7783459B2 (en) * 2007-02-21 2010-08-24 William Marsh Rice University Analog system for computing sparse codes
CN104113789A (en) * 2014-07-10 2014-10-22 杭州电子科技大学 On-line video abstraction generation method based on depth learning
CN106034264A (en) * 2015-03-11 2016-10-19 中国科学院西安光学精密机械研究所 Coordination-model-based method for obtaining video abstract

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7783459B2 (en) * 2007-02-21 2010-08-24 William Marsh Rice University Analog system for computing sparse codes
CN104113789A (en) * 2014-07-10 2014-10-22 杭州电子科技大学 On-line video abstraction generation method based on depth learning
CN106034264A (en) * 2015-03-11 2016-10-19 中国科学院西安光学精密机械研究所 Coordination-model-based method for obtaining video abstract

Also Published As

Publication number Publication date
CN106993240A (en) 2017-07-28

Similar Documents

Publication Publication Date Title
Xue et al. Advancing high-resolution video-language representation with large-scale video transcriptions
Gabeur et al. Multi-modal transformer for video retrieval
Otani et al. Learning joint representations of videos and sentences with web image search
CN106993240B (en) Multi-video abstraction method based on sparse coding
CN107943990B (en) Multi-video abstraction method based on prototype analysis technology with weight
CN107203636B (en) Multi-video abstract acquisition method based on hypergraph master set clustering
CN101739428B (en) Method for establishing index for multimedia
WO2020177673A1 (en) Video sequence selection method, computer device and storage medium
CN107911755B (en) Multi-video abstraction method based on sparse self-encoder
Liu et al. Attention guided deep audio-face fusion for efficient speaker naming
Momeni et al. Automatic dense annotation of large-vocabulary sign language videos
Choi et al. Contextually customized video summaries via natural language
CN114048351A (en) Cross-modal text-video retrieval method based on space-time relationship enhancement
CN113392265A (en) Multimedia processing method, device and equipment
Liang et al. Informedia@ trecvid 2016 med and avs
Varol et al. Scaling up sign spotting through sign language dictionaries
CN113408619B (en) Language model pre-training method and device
CN112883229B (en) Video-text cross-modal retrieval method and device based on multi-feature-map attention network model
Li et al. Image-text alignment and retrieval using light-weight transformer
Zhang et al. SOR-TC: Self-attentive octave ResNet with temporal consistency for compressed video action recognition
Xu et al. Multi-guiding long short-term memory for video captioning
CN110826397B (en) Video description method based on high-order low-rank multi-modal attention mechanism
CN116977701A (en) Video classification model training method, video classification method and device
Jiang Web-scale multimedia search for internet video content
CN109857906B (en) Multi-video abstraction method based on query unsupervised deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant