CN107943990A

CN107943990A - More video summarization methods of archetypal analysis technology based on Weight

Info

Publication number: CN107943990A
Application number: CN201711249015.1A
Authority: CN
Inventors: 冀中; 江俊杰
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2017-12-01
Filing date: 2017-12-01
Publication date: 2018-04-20
Anticipated expiration: 2037-12-01
Also published as: CN107943990B

Abstract

The present invention relates to technical field of video processing, to propose more video summarization techniques of the archetypal analysis method based on Weight suitable for the feature, is allowed under the auxiliary of effective prior information, fully utilizes the peculiar information of data.The technical solution adopted by the present invention is more video summarization methods of the archetypal analysis technology based on Weight, first with the relation between the graph model modeling video frame of Weight, so as to obtain the weight matrix needed for the archetypal analysis of Weight；Then key frame is obtained using the archetypal analysis of Weight, generates the video frequency abstract of given length.Present invention is mainly applied to Video processing occasion.

Description

More video summarization methods of archetypal analysis technology based on Weight

Technical field

The present invention relates to technical field of video processing, specifically, is related to archetypal analysis technology based on Weight more and regards Frequency method of abstracting.

Background technology

With the fast development of information technology, video data emerges in multitude, become people obtain information important channel it One.However, due to the sharp increase of number of videos, occur redundancy and the information repeated in multitude of video data, this makes user quickly obtain Information needed is taken to become difficult.Therefore, in this case, there is an urgent need to it is a kind of can be to the massive video data under same subject The technology integrated, analyzed, to meet the needs of people want fast and accurately to browse video main information, improves people Obtain the ability of information.More video summarization techniques as one of effective way to solve the above problems, in the past few decades in Cause the concern of more and more researchers.More video summarization techniques are a kind of video data compression skills based on content Art, it is intended to multiple videos of the related subject under same event are analyzed, are integrated, are extracted main interior in multiple videos Hold, and the content of extraction is presented to user according to certain logical relation.At present for more video frequency abstracts mainly in terms of three Analyzed：1) coverage rate；2) novelty；3) importance.It is same that coverage rate refers to that extracted video content can cover The main contents of multiple videos under theme.Redundancy refers to removing repetition, redundancy the information in more video frequency abstracts.It is important Property refer to is then to extract important crucial camera lens in video set according to some prior informations, so as to extract important in multiple videos Content.

Although many single videos summary it has been proposed that research for more video summarization methods it is less, still in Elementary step.This mainly has two reasons：1) one be due under same event the diversity of multiple video subjects and video it Between theme intercrossing.Theme diversity refers to that the information emphasis of multiple videos under same event is different, has multiple Sub-topics.And content has intercrossing between theme intercrossing refers to the video under same event, existing similar content, also has The different information contents.2) two be due to audio-frequency information that more video datas show same content, text message and There may be bigger difference for visual information.These reasons cause the research of more video frequency abstracts to be difficult to traditional single video summary.

In the past few decades, people are directed to the characteristics of more sets of video data, it is proposed that the side of some more video frequency abstracts Method.Wherein, more video summarization methods based on complicated figure cluster are the classical methods of a comparison.Such method passes through extraction The keyword of video corresponding scripts information and the key frame of video, build the figure of complexity, and are calculated on this basis using figure cluster Method realizes summary.But this method, mainly for news video, for the video set of no video script information, this method is just lost Meaning has been gone, there is diversity and redundancy additionally, due to the content that multiple videos under same subject include, only with cluster Although method meets the maximal cover condition of video content, for more video frequency abstracts, only with the visual information cluster effect of video Fruit is poor, though there is certain help with reference to other mode, complexity is larger.

There are the information of multiple modalities, text message, visual information, the audio-frequency information of such as video in more video frequency abstracts. Balanced AV-MMR (Balanced Audio Video Maximal Marginal Relevance) are a kind of effectively profits With more video summarization techniques of video multiple modalities information, it is by analyzing visual information, audio-frequency information and the vision of video Semantic information in information and audio-frequency information, including audio, face and temporal characteristics etc. these for video frequency abstract have weight Want the information of meaning.The process efficiently utilizes the multi-modal information of video, but the video frequency abstract that extracts and not up to preferable Effect.

In recent years, there has been proposed some novel methods.Wherein, the vision co-occurrence characteristic (visual of video is utilized Co-occurrence it is one of them relatively new method) to realize more video frequency abstracts.The visual concept that this method thinks important is past Maximum two tuple lookup algorithms are proposed toward being repetitively appearing in multiple videos under same subject, and according to this feature (Maximal Biclique Finding), extracts the sparse co-occurrence pattern of more videos, so as to fulfill more video frequency abstracts.But should Method is only applicable to specific data set, and meaning is just lost for the less video set of repeatability, this method in video.

In addition, in order to utilize more relevant informations, correlative study person proposes to be passed using the GPS on mobile phone and compass etc. Sensor obtains the information such as the geographical location in mobile video shooting process, and the thus important information in auxiliary judgment video, raw Into more video frequency abstracts.In addition, the art teaches by the use of this prior information of Web page picture as auxiliary information, preferably in fact Now more video frequency abstracts.At present, since the complexity of more video datas, the research of more video frequency abstracts do not reach ideal effect. Therefore, the information of more video datas how is better profited from, more video frequency abstracts are better achieved, become presently relevant scholar and grinds The hot spot studied carefully.For this reason, this paper presents realize more video frequency abstracts using archetypal analysis technology (Archetypal Analysis).

Each data point in data set is considered as one group of list by archetypal analysis technology (Archetypal Analysis, AA) One, the mixing resultant of observable prototype, and prototype is limited to the sparse mixing of data point in data set in itself, and generally Positioned at the boundary of data set.AA models are widely used in different fields, such as in economics, astrophysics and pattern In identification.The serviceability that AA models reduce feature extraction and dimension is utilized by the machine learning algorithm in various fields, such as Say from computer vision, neuro images, chemistry, the field such as text mining and collaborative filtering.

The content of the invention

For overcome the deficiencies in the prior art, the present invention is directed to propose the archetypal analysis based on Weight suitable for the feature More video summarization techniques of method, are allowed under the auxiliary of effective prior information, fully utilize the peculiar information of data.This The technical solution that invention uses is more video summarization methods of the archetypal analysis technology based on Weight, first with Weight Graph model modeling video frame between relation, so as to obtain the weight matrix needed for the archetypal analysis of Weight；Then utilize The archetypal analysis of Weight obtain key frame, generate the video frequency abstract of given length.

Obtain the weight matrix specific steps needed for the archetypal analysis of Weight：

The simple graph of a Weight is built, gives l video under same event, n frames time is obtained after being pre-processed Key frame is selected, is expressed as feature vector, X={ f₁,f₂,f₃,...,f_n},f_i∈R^m, f_iRepresent the m Wei Te of i-th of candidate key-frames Sign vector, using candidate key-frames as vertex structure visual similarity figure G=(X, E, W), wherein X represents vertex, and E represents video Connection side between frame, W represent the vision connection weight on side, in order to calculate W, calculate the cosine similarity between video frame first A(f_i,f_j), its calculation formula such as equation (1)：

Here sim (i, j) represents the cosine similarity between the i-th frame and jth network image；

The graph model of a Weight is built, will be the company between the video frame of video using the similitude between video Edge fit additionally adds a weight, in order to which this relation, design weight matrix W is presented_v, its specific calculation such as equation (2)：

Here v (f) represents the video for including frame f, sim (v (f_i),v(f_j)) represent to include frame f_iVideo and include frame f_j Video between similitude, similitude here refers to the cosine similarity obtained according to the text message of video, above-mentioned institute The expression formula given is only the connection side increase weight between the frame of video, and in video between the connection side right of frame keep again It is constant；

Calculate video frame and the average similarity of all-network image, and the importance mark using the similitude as video frame Standard, shown in its specific calculation such as formula (3)：

Wherein g_jRepresent jth network image, sim (f_i,g_j) represent video frame f_iWith g_jCosine similarity；

Shown in the calculating such as equation (4) of the connection weight matrix W on the side of the graph model of the Weight of structure：

Comprised the following steps that in one example：

1) video frame text feature corresponding with the visual signature and video of the network image based on inquiry is extracted：Video frame Visual signature be expressed as X={ f₁,f₂,f₃,...,f_n},f_i∈R^m, the visual signature of network image is expressed as { g₁,g₂,..., g_k},g_k∈R^m, g_kRepresent the m dimensional feature vectors of kth network image, the Text Representation of video is { t₁,t₂,...,t_l}, t_a∈R^d, t_aRepresent the text feature of a-th of video；

2) complete graph of Weight is built：In order to model the dependency relation between video frame, regard video frame as vertex structure The simple graph G=(X, E, W) of Weight is built, and utilizes formula (1)-(4) solution matrix W；

3) weight by the use of the weight matrix W that step 2 obtains as archetypal analysis problem, and use formulaStructure Input matrix

4) givenThe upper archetypal analysis for performing Weight, and alternately obtain optimal dematrix P using algorithm for estimating And Q, P represent the coefficient matrix of prototype reconstruct input, Q represents the coefficient matrix of input reconstruct prototype；

5) according to formulaCalculate the importance scores S of each prototype_i；

6) mode in descending order sorts prototype, chooses the prototype that importance scores are more than certain threshold epsilon.

7) since the prototype of importance scores maximum, select element score maximum institute right in the row of its corresponding Q The video frame corresponding to line number answered, judges the similitude of the frame and previously selected all frames, if similitude is more than threshold Value, then do not include the frame in making a summary；If after all complete above process of prototype iteration, the length not up to made a summary then carries out down Selection process is taken turns, the line number corresponding to Second Largest Value is chosen from each column of Q to choose key frame；Then the iteration above process is straight To length of summarization needed for satisfaction.

The features of the present invention and beneficial effect are：

The characteristics of present invention is primarily directed to existing more video frequency abstract data sets, design is suitable for the feature based on band More video summarization techniques of the archetypal analysis method of weight, are allowed under the auxiliary of effective prior information, fully utilize number According to peculiar information.Its main advantage is mainly reflected in：

(1) novelty：First by the archetypal analysis method first Application of Weight in more video frequency abstracts of inquiry oriented.And The text message of video and network image information based on inquiry are collectively incorporated into more video frequency abstracts using the graph model of Weight In model the relation between video frame.

(2) validity：It has been experimentally confirmed sparse with the typical clustering method applied to single video summary and minimum Reconstructing method compares, and the performance of the more video summarization methods for the archetypal analysis based on Weight that the present invention designs is substantially better than Both, therefore more suitable in more video frequency abstract problems.

(3) practicality：Simple possible, can be used in multimedia signal processing field.

Brief description of the drawings：

Fig. 1 is the flow chart of the Video Key camera lens extraction of the archetypal analysis method provided by the invention based on Weight.

Embodiment

The features such as present invention is directed to the redundancy of multimedia video data, duplicate message is more, with reference to the vision of video Information, text message and the other and relevant prior information of theme, using archetypal analysis thought to traditional more video frequency abstract sides Method is improved, and has been achieveed the purpose that to efficiently use video subject relevant information, has been improved user and browse video efficiency.

Method provided by the present invention is broadly divided into：1) graph model for designing Weight first is used to construct between sentence Relevance.2) and then the archetypal analysis Technology design of Weight is utilized to be suitable for more video frequency abstract data set features of inquiry oriented Crucial frame selecting method.

Each data point in data set is considered as one group of list by archetypal analysis technology (Archetypal Analysis, AA) One, the mixing resultant of observable prototype, and prototype is limited to the sparse mixing of data point in data set in itself, and generally Positioned at the boundary of data set.

Matrix X={ the f of given n × m₁,f₂,...f_i,...,f_n},f_i∈R^m, and in the case of z ＜＜ n, archetypal analysis Matrix W factorization is two random matrix P ∈ R by problem^n×zWith Q ∈ R^n×z, the coefficient matrix of P expression prototype reconstruct inputs, Q represents the coefficient matrix of input reconstruct prototype, as follows：

X ≈ PA with A=X^TQ (4)

Archetypal analysis algorithm initializes matrix P and Q and calculates prototype matrix A first, then straight using equation (5) renewal P and Q A fully small value is converged on to residual sum of squares (RSS) RSS or reaches maximum iteration.

But above-mentioned archetypal analysis problem regards all video frame as the frame with equal weight, each data point (video frame) and its corresponding residual error minimize equation (5) to obtain prototype with identical weight.And in more video frequency abstracts, depending on Frequency frame is not point that be equivalent, making a difference between them.Therefore the invention will be obtained using the archetypal analysis of Weight Key frame.

The present invention is using the relation between the graph model modeling video frame of Weight first, so as to obtain the prototype of Weight Weight matrix needed for analysis.

In order to model the relation between video frame, the present invention constructs the simple graph of a Weight.Give same event Under l video, n frame candidate key-frames X={ f are obtained after being pre-processed₁,f₂,f₃,...,f_n},f_i∈R^m.The present invention will wait Selecting key frame, wherein X represents vertex as vertex structure visual similarity figure G=(X, E, W), and E represents the company between video frame Edge fit, W represent the vision connection weight on side.In order to calculate W, the present invention calculates the cosine similarity A between video frame first (f_i,f_j), its calculation formula such as equation (1)：

Here sim (i, j) represents the cosine similarity between the i-th frame and jth network image.

The frame-to-frame coherence relation that observation finds to distinguish in video between the similarity relationships and video of interframe helps to improve The quality of more video frequency abstracts.Therefore the influence of the similarity relationships for the relation pair interframe between reflecting video, build herein The graph model of one Weight.The present invention will be the connection side between the video frame of video using the similitude between video Additionally one weight of addition.In order to which this relation is presented, the present invention devises weight matrix W_v, its specific calculation such as equation (2)：

Here v (f) represents the video for including frame f.sim(v(f_i),v(f_j)) represent to include frame f_iVideo and include frame f_j Video between similitude, similitude here refers to the cosine similarity obtained according to the text message of video.Above-mentioned institute The expression formula given is only the connection side increase weight between the frame of video, and in video between the connection side right of frame keep again It is constant.

Recently, with retrievable user's generation information on website, such as image, video information are more and more, and one certainly Right idea is to be used as auxiliary generation summary by the use of these external informations.We, which regard query image as, obtains the important interior of video The prior information of appearance.Query image is that user uploads after conscientiously selecting as the complementary information of video, therefore with one kind More there are the main contents that semantic mode presents event, and have for video less redundancy and Noise information.It is all these to show that inquiring about picture contributes to the generation of more video frequency abstracts as prior information.Therefore the invention is first Video frame and the average similarity of all-network image, and the importance criteria using the similitude as video frame are first calculated, its Shown in specific calculation such as formula (3)：

Wherein g_jRepresent jth network image, sim (f_i,g_j) represent video frame f_iWith g_jCosine similarity, W_qExpression regards Frequency frame f_iWith the cosine similarity of all-network image and.

Therefore shown in the calculating such as equation (4) of the connection weight matrix W on the side of the graph model of constructed Weight：

After obtaining weight matrix W, which uses the archetypal analysis technical limit spacing key frame of Weight.The prototype of Weight Problem analysis can be considered as minimization problem：

This problem can also be rewritten as：

Thus, we obtain the regarding of the archetypal analysis (Weighted archetypal analysis) based on Weight more Frequency method of abstracting mainly includes Primary Stage Data and prepares, solves the weight matrix needed for archetypal analysis using the graph model of Weight, The archetypal analysis of Weight solve three phases.

Fig. 1 is described with reference to Web page image prior information, using in the archetypal analysis method extraction video based on Weight Crucial camera lens flow chart.The main thought of this method is into Weight by soft cluster (soft-clustered) of video frame Prototype (archetypes), then according to this prototype sequencing video frame, and select to come video frame above as key frame, Generate the video frequency abstract of given length.The invention comprises the following steps that：

1) video frame text feature corresponding with the visual signature and video of the network image based on inquiry is extracted.Video frame Visual signature be expressed as X={ f₁,f₂,f₃,...,f_n},f_i∈R^m, the visual signature of network image is expressed as { g₁,g₂,..., g_k},g_j∈R^m, the Text Representation of video is { t₁,t₂,...,t_l},t_a∈R^d。

2) complete graph of Weight is built.In order to model the dependency relation between video frame, the present invention regards video frame as Vertex builds the simple graph G=(X, E, W) of Weight, and utilizes formula (1)-(4) solution matrix W.

4) givenThe upper archetypal analysis for performing Weight, and alternately obtain optimal solution P and Q using algorithm for estimating.

5) according to formulaCalculate the importance scores S of each prototype_i。

7) since the prototype of importance scores maximum, select element score maximum institute right in the row of its corresponding Q The video frame corresponding to line number answered, judges the similitude of the frame and previously selected all frames, if similitude is more than one Fixed threshold value, then do not include the frame in making a summary.If after all complete above process of prototype iteration, the length not up to made a summary, then Carry out lower whorl and choose process, the line number corresponding to Second Largest Value is chosen from each column of Q to choose key frame.Then iteration is above-mentioned Process is until length of summarization needed for satisfaction.

Claims

1. a kind of more video summarization methods of the archetypal analysis technology based on Weight, it is characterized in that, first with Weight Relation between graph model modeling video frame, so as to obtain the weight matrix needed for the archetypal analysis of Weight；Then band is utilized The archetypal analysis of weight obtain key frame, generate the video frequency abstract of given length.

2. more video summarization methods of the archetypal analysis technology based on Weight as claimed in claim 1, it is characterized in that, obtain Weight matrix specific steps needed for the archetypal analysis of Weight：

The simple graph of a Weight is built, gives l video under same event, n frames candidate pass is obtained after being pre-processed Key frame, is expressed as feature vector, X={ f₁,f₂,f₃,...,f_n},f_i∈R^m, f_iRepresent the m dimensional features of i-th of candidate key-frames to Amount, using candidate key-frames as vertex structure visual similarity figure G=(X, E, W), wherein X represent vertex, E expression video frame it Between connection side, W represents the vision connection weight on side, in order to calculate W, calculates the cosine similarity A between video frame first (f_i,f_j), its calculation formula such as equation (1)：

<mrow> <mi>A</mi> <mrow> <mo>(</mo> <msub> <mi>f</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>f</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>f</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>f</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mo>&Sigma;</mo> <mrow> <msub> <mi>f</mi> <mi>j</mi> </msub> <mo>&Element;</mo> <mi>X</mi> <mo>&cap;</mo> <mi>j</mi> <mo>&NotEqual;</mo> <mi>i</mi> </mrow> </msub> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>f</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>f</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

The graph model of a Weight is built, will be the connection side between the video frame of video using the similitude between video Additionally one weight of addition, in order to which this relation, design weight matrix W is presented_v, its specific calculation such as equation (2)：

<mrow> <msub> <mi>W</mi> <mi>v</mi> </msub> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mn>1</mn> </mtd> <mtd> <mrow> <mi>i</mi> <mi>f</mi> <mi> </mi> <mi>v</mi> <mrow> <mo>(</mo> <msub> <mi>f</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mi>v</mi> <mrow> <mo>(</mo> <msub> <mi>f</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mn>1</mn> <mo>+</mo> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <mi>v</mi> <mo>(</mo> <msub> <mi>f</mi> <mi>i</mi> </msub> <mo>)</mo> <mo>,</mo> <mi>v</mi> <mo>(</mo> <msub> <mi>f</mi> <mi>j</mi> </msub> <mo>)</mo> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mi>i</mi> <mi>f</mi> <mi> </mi> <mi>v</mi> <mrow> <mo>(</mo> <msub> <mi>f</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>&NotEqual;</mo> <mi>v</mi> <mrow> <mo>(</mo> <msub> <mi>f</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>

Here v (f) represents the video for including frame f, sim (v (f_i),v(f_j)) represent to include frame f_iVideo and include frame f_jRegard Similitude between frequency, similitude here refers to the cosine similarity obtained according to the text message of video, above-mentioned given Expression formula is only the connection side increase weight between the frame of video, and in video between the connection side right of frame keep again not Become；

Video frame and the average similarity of all-network image, and the importance criteria using the similitude as video frame are calculated, Shown in its specific calculation such as formula (3)：

<mrow> <msub> <mi>W</mi> <mi>q</mi> </msub> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>f</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>g</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>

W=A ⊙ W_v⊙W_q (4)。

3. more video summarization methods of the archetypal analysis technology based on Weight as claimed in claim 1, it is characterized in that, one Comprised the following steps that in example：

1) video frame text feature corresponding with the visual signature and video of the network image based on inquiry is extracted：Video frame regards Feel character representation is X={ f₁,f₂,f₃,...,f_n},f_i∈R^m, the visual signature of network image is expressed as { g₁,g₂,...,g_k}, g_k∈R^m, g_kRepresent the m dimensional feature vectors of kth network image, the Text Representation of video is { t₁,t₂,...,t_l},t_a∈ R^d, t_aRepresent the text feature of a-th of video；

2) complete graph of Weight is built：In order to model the dependency relation between video frame, regard video frame as vertex structure band The simple graph G=(X, E, W) of weight, and utilize formula (1)-(4) solution matrix W；

4) givenThe upper archetypal analysis for performing Weight, and alternately obtain optimal dematrix P and Q, P using algorithm for estimating Represent the coefficient matrix of prototype reconstruct input, Q represents the coefficient matrix of input reconstruct prototype；

5) according to formulaCalculate the importance scores S of each prototype_i；

7) since the prototype of importance scores maximum, selected in the row of its corresponding Q corresponding to element score maximum Video frame corresponding to line number, judges the similitude of the frame and previously selected all frames, if similitude is more than threshold value, Do not include the frame in summary；If after all complete above process of prototype iteration, the length not up to made a summary, then carry out lower whorl selection Process, chooses the line number corresponding to Second Largest Value to choose key frame from each column of Q；Then the iteration above process is until meeting Required length of summarization.