CN103426176A

CN103426176A - Video shot detection method based on histogram improvement and clustering algorithm

Info

Publication number: CN103426176A
Application number: CN2013103799401A
Authority: CN
Inventors: 瞿中; 陈昌志; 刘达明; 薛峙; 高腾飞
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2013-08-27
Filing date: 2013-08-27
Publication date: 2013-12-04
Anticipated expiration: 2033-08-27
Also published as: CN103426176B

Abstract

The invention discloses a video shot detection method based on histogram improvement and a clustering algorithm, and relates to image processing techniques. According to the method, the histogram improvement and the clustering algorithm are adopted to compute an intersection of histograms of two adjacent frames of images, and whether a shot change happens is judged according to histogram similarity; if the shot change happens, secondary detection on a shot boundary of the intersection of the histograms of the two adjacent frames of images is carried out by utilizing interframe gray scale/color difference values, pixel difference value computation on each block is carried out respectively by adopting non-uniform block weighting processing, pixel difference values and preset block frame differential threshold values are compared to obtain identification variables, the identification variable of each block is weighted and summarized, and the weighing and summarizing values and the preset block differential threshold values are compared to carry out shot detection. The video shot detection method improves shot detection accuracy and solves the problems of shot false detection, frame number discontinuity, and the like.

Description

Shot detection method based on improving histogram and clustering algorithm

Technical field

The present invention relates to image processing techniques, specifically a kind of shot detection technology.

Background technology

A picture group picture frame continuous on time domain has formed video flowing, but because video frame rate is generally larger, one section very short video just has a large amount of picture frames, and the adjacent image frame has certain correlativity on visual signature, therefore in the video frequency searching field, can not directly apply the method for CBIR.Only have to video is carried out structuring and sets up index and summary for video, in the situation that form the linear structure about video content, could effectively realize fast browsing and the retrieval of video data.The structuring of video comprises that camera lens is cut apart, camera lens cuts apart and claims shot transition to detect, it is the basis of video structure stratification, the impact that requirement avoids extraneous factor to cut apart for Shot Detection, video sequence is divided into to a plurality of one group of camera lens that uninterrupted frame forms by having identical content, correctly detects various complicated editors' shot boundary.

Camera lens is cut apart requirement and exactly video is cut open from shot boundary punishment, forms independent camera lens one by one, to guarantee the accuracy of key-frame extraction.The scholars such as the scholars such as Yeung and Nagasaka propose respectively histogram intersection algorithm and χ ²Histogramming algorithm, improved the account form of histogram difference degree;

For the interference that the local motion reduced in camera lens may cause, the scholars such as Nagasaka propose the method that each frame is carried out to the piecemeal processing; In order to detect better lasting progressive formation, the scholars such as Zhang have proposed the algorithm of dual threshold; For motion feature, the scholars such as Shahraray have proposed a kind of block matching algorithm, after each piece is carried out to motion compensation, have improved the tolerance for local motion in camera lens, and the scholars such as Akutsu carry out similarity between definition frame by the related coefficient of calculation of motion vectors, thus the detector lens conversion; In camera lens when conversion,, in camera lens, target edges also changes thereupon, so the scholar such as R.Zabhi has proposed the Scene Segmentation based on edge feature; The scholars such as Chi-ChunLo propose to adopt fuzzy C-mean algorithm (FuzzyC-means:FCM) clustering algorithm to carry out camera lens to cut apart, and the most all frame of video are included into shot change (ShotChange:SC) and without shot change (NoShotChange:NSC) two classes; The scholars such as golden red propose to adopt the Unsupervised Clustering algorithm to be detected mpeg compressed video, and carry out corresponding aftertreatment according to the feature of video data; Cernekova ^[11]Deng the scholar, propose in adjacent two interframe, in conjunction with the Shot Detection algorithm of mutual information and combination entropy etc.Present stage, a lot of lens detection methods were for the detection effect convergence perfection of shot-cut, and gradual shot is due to the diversity of its camera lens translative mode, and were vulnerable to noise, and existing methodical detection effect is still undesirable.In addition, generally adopt distinct methods to carry out respectively the detection of shot-cut and gradual shot, and it is little to identify merely the practical significance of shot-cut, therefore, the method that can simultaneously identify shot-cut and gradual change is scholars' goal in research always.

It is the basis of video structure stratification that camera lens is cut apart, and has obtained researchist and scholar's extensive attention, and abundant achievement in research is arranged.Yet up to now, still do not have a kind of in all cases, for the video of various content types, can show superperformance, the camera lens Segmentation Detection of " universally applicable ".

The camera lens transition detection, for film or video are divided into to basic time domain unit---camera lens, according to the link edition mode of shot boundary, can be divided into the camera lens conversion regime following two classes: lens mutation (shear) and gradual shot.Lens mutation (shear) is the process of next camera lens that is converted to suddenly from a camera lens, and corresponding is directly to connect the camera lens edit mode of two camera lenses; And gradual shot is the process that next camera lens replaces current camera lens gradually, claim again soft conversion, corresponding is to utilize space/coloring effect to connect the camera lens edit mode of two camera lenses.Gradual shot is to comprise multiple camera lens conversion regime, is characterized in that whole handoff procedure is progressive lasting.More common gradual change mainly contains to be fade-in fade-out, to put under and mark, dissolve, sweep and change and dissolve etc.

Occur in the camera lens conversion, variation has also occurred in video content (high-level semantic) usually.The ideal process that shot detection is cut apart is directly analyzed video content (high-level semantic), but, due to " semantic gap " and the ambiguity that relates to the high-level semantic of human emotion's factor, most Shot Detection algorithms still come the detector lens border according to the variation of shot boundary place video low-level feature (as visual signature and motion features such as color, edge, textures).Generally, the camera lens conversion can cause the significant change of video low-level image feature, as the unexpected variation of picture frame color distribution, and the moving in and out etc. of edge of video objects profile.But, in the transfer process of gradual shot, the video low-level image feature changes comparatively slow and not obvious.

In addition, even in same camera lens, the quick variation of video content and noise also may cause the video low-level image feature that larger variation occurs.In view of above many influence factors, although under some particular case, utilizing existing algorithm to carry out video lens cuts apart and can reach effect preferably, but there is the rapid movement of object/video camera in video, the extreme case such as the acute variation of ambient light photograph and in the progressive formation of video, much existing algorithm carries out the effect that camera lens cuts apart and still is far from satisfactory.

In prior art, the common method that shot detection is cut apart is, calculate in video the frame difference value Diff of Low Level Vision feature between successive frame or motion feature, and it is compared with default or adaptive threshold value T, if Diff>T, this place is shot boundary, otherwise, think that this group successive frame belongs to same camera lens.From common method, the metric form of frame difference value, the setting of threshold value, and both optimum combination will become the key point that shot detection is cut apart.And, within same camera lens, video features changes and mainly contains following two reasons: the motion of object/video camera and the variation of light.The motion of object/video camera causes in camera lens constantly occurring new object, and old object, also in continuous disappearance, if deal with improperly, is easy to obscure with gradual shot simultaneously, causes camera lens flase drop (false detection).Also often occur in camera lens that light changes, if in camera lens, certain frame brightens suddenly, saltus step will occur in the frame difference based on brightness tolerance, if deal with improperly, will be detected. as shot-cut, also can cause the camera lens flase drop.Therefore, when algorithm for design, need to take into full account this two factors.Will correctly detect shot boundary and cut apart to carry out camera lens, interframe Difference of content amount ideally should possess such feature: the less and relative equilibrium of frame difference in camera lens, and it is very large at the shot boundary place, saltus step can to occur.Consider in same camera lens two principal elements that cause the camera lens content change, object/camera motion and light in frame difference reply camera lens change as far as possible insensitive, and capturing observantly the marked change of video content at the shot boundary place, saltus step obtains local maximum.In the research field of cutting apart at shot detection, research and discussion through many decades, many scholars and researchist have proposed algorithm separately, characteristics according to camera lens conversion place, the detection that picture frame visual signature based on different and camera motion feature are carried out shot boundary is cut apart, and has obtained certain effect.Generally speaking, the shot detection partitioning algorithm mainly comprises following a few class: the algorithm based on pixel, the algorithm based on histogrammic algorithm, based on motion feature, the algorithm based on edge feature etc.

Histogram has reflected the population distribution of piece image gray scale (grey level histogram) or color (color histogram) intuitively, due to its outstanding global property, be widely used in the image processing, and multiple metric form is arranged: basic skills is to calculate the histogram difference value of adjacent video interframe, but the result of difference of histograms value is different because of the histogram kind adopted.The histogram Weighted distance that also can calculate between two width images by the introducing weighting coefficient is expanded basic skills, calculates in addition in addition the histogram intersection between two width images or adopts other distance metric methods.

Be the shot detection dividing method be most widely used based on histogrammic algorithm, process simple and conveniently, computation complexity is lower, for most of videos, as long as threshold value setting is proper, general all can reach reasonable effect.Major advantage based on histogrammic algorithm just is its global property.

Consistent with the basic thought based on the pixel algorithm based on histogrammic algorithm, be all to calculate the frame difference value, different is adopted module difference, the former expands and obtains on the latter's basis.Algorithm based on pixel is sued for peace to the gray scale of adjacent two frame respective pixel or the absolute value of luminance difference, to weigh the frame difference degree.It is the simplest and rudimentary algorithm that calculates the frame difference value, and algorithmic procedure is as follows:

The interframe gray scale of respective pixel or luminance difference are suc as formula shown in (1):

fd＝|f _n(i,j)-f _n+1(i,j)| (1)

Wherein, f _n(i, j), f _N+1(i, j) mean respectively the gray scale of n frame and n+1 frame pixel (i, j) or brightness value (because of the histogram type different), total frame of n frame and n+1 interframe is poor is:

Fd = \frac{1}{MN} Σ_{i = 1}^{M} Σ_{j = 1}^{N} fd (i, j) - - - (2)

Then, by total frame poor with predetermined threshold value relatively, if exceed threshold value, camera lens conversion occurs in this place.

Although the simple and clear and easy realization of the algorithm based on pixel, but the motion for object/video camera in camera lens is very responsive, in camera lens, the motion of object/video camera will cause gray scale or the brightness of a lot of pixels in picture frame to change, thereby causes the error detection of shot boundary.Therefore, there is the scholar to propose based on histogrammic Shot Detection dividing method.

(1) histogram distance

The regularity of distribution of each pixel on each gray level, intensity level or color grade in the statistics with histogram image.Tonomura and Abe ^[14]Proposition, using grey level histogram as frame difference metric standard, is calculated the difference metric image frame difference of adjacent two frame grey level histograms:

Σ_{v = 0}^{V} | H (I_{t}, v) - H (I_{t - 1}, v) | > T - - - (3)

If meeting formula (3), adjacent two frame frame differences the camera lens conversion occurs this place.Successively have the scholar to propose to improve one's methods based on histogrammic, for example: for color histogram, visual characteristic and the needs that reduce calculated amount according to human eye, calculate the histogram difference quantized; For three-dimensional color space (typical in RGB, HSV etc.), three Color Channels are calculated respectively the interframe histogram difference and carry out weighted sum etc., wherein representational expansion has: Gargi and kasturi ^[15]Proposition is measured for histogram difference between the quantized frame of three-dimensional color space:

Σ_{k = 1}^{3} Σ_{v = 0}^{V} | H (I_{t}, C_{k}, v) - H (I_{t - 1}, C_{k}, v) | > T - - - (4)

C wherein _kShot change, as RGB or HSV etc., if the frame difference meets formula (4), occurs herein in the representative color space.

(2) histogram weighting

In three-dimensional color space, owing to comparing with other color components, the color that some color component can affect image largely shows, perhaps human vision is to its comparatively responsive (as the Hue component in the hsv color space), therefore, need to be made a concrete analysis of as the case may be, for color being shown to influence degree is large or than the color component that can meet human vision susceptibility, large weight is set, and less or be difficult to the less weight of color component setting of direct feel to influence degree, weighted sum obtains weighting interframe histogram difference, can reflect better content distance or the difference on human vision between frame of video ^[16].If

Σ_{k = 1}^{3} Σ_{v = 0}^{V} \frac{L (I_{t}, C_{k})}{L_{mean} (I_{t})} | H (I_{t}, C_{k}, v) - H (I_{t - 1}, C_{k}, v) | > T - - - (5)

Think the camera lens conversion occurs herein.L (I wherein _t, C _k) mean t frame k color component value, L _Mean(I _t) mean the average color obtained by each color component in the t frame.Zhao ^[17]Propose a kind of new learning method Deng the scholar, by the mini-max optimization procedural learning, obtain more excellent similarity measurement, for each color component is set different weights, thereby obtain the weighted histogram distance.If

Σ_{k = 1}^{3} Σ_{v = 0}^{V} w (k, v) | H (I_{t}, C_{k}, v) - H (I_{t - 1}, C_{k}, v) | > T - - - (6)

Think the generation shot change.Wherein, w (k, v) means the weighting coefficient of t frame k color component.

(3) histogram intersection

In the Shot Detection field, as another metric form of histogram similarity, histogram intersection ^[2]Also apply morely, account form also has multiple.For example: obtain the histogram intersection of t-1 frame and t frame according to the minimum function method, if

(1 - \frac{1}{xy} Σ_{v = 0}^{V} \min (H (I_{t}, v), H (I_{t - 1}, v))) > T - - - (7)

Think shot change occurs herein, the sum of all pixels in xy presentation graphs picture frame wherein, the histogram intersection calculated like this is between [0,1].

The method that another kind of compute histograms is occured simultaneously ^[18]Shown in (8), if

(1 - \frac{1}{xy} Σ_{v = 0}^{V} \frac{\min (H (I_{t}, v), H (I_{t - 1}, v))}{\max (H (I_{t}, v), H (I_{t - 1}, v))}) > T - - - (8)

Think shot change occurs herein.

The histogram intersection method statistic adjacent two interframe there is the number of pixels of same grayscale, brightness or color value.Its essence is identical with direct compute histograms distance.

(4) χ ²Histogram

χ ²Histogram method ^[3]As a kind of effective expansion for traditional histogram method, because it can amplify the interframe histogram difference, and algorithm is more stable, can reflect better the difference between adjacent two two field pictures, and be widely applied χ ²Be defined as:

And by χ ²With predetermined threshold value, T compares, if be greater than T, think shot change occurs herein.Detect with Kolmogorov-Smimov and detect and compare with the likelihood ratio of Yakimovsy, the method performance is more excellent ^[19].

(5) dual threshold relative method

The translation type of video lens can be divided into two kinds of shear and gradual changes, and generally, between the consecutive frame in the gradual shot process, the difference value amplitude is little than shot-cut, but, in the time-continuing process of gradual shot, the frame difference value amplitude of accumulation is comparatively obvious.Therefore adopt single threshold value to be judged the multiple situation that obviously can't adapt to shot-cut and gradual change.For this reason, the scholar such as Zhang has proposed dual threshold relative method (twin comparison) on the basis of compute histograms distance ^[5].At first set two threshold value T _hAnd T _l, be respectively used to detector lens shear and gradual shot.Calculate successively the frame difference value of adjacent two frames, if somewhere frame difference value surpasses T _h, think shot-cut occur herein; If the frame difference value is less than T _hAnd be greater than T _l, think and start to occur gradual shot herein.Continue to calculate the frame difference value of each frame thereafter, if still be greater than T _l, added up, otherwise think camera lens conversion does not occur, abolish start frame, and, by cumulative frame difference value zero clearing, again start judgement from next frame.Until accumulative total frame difference surpasses T _h, think that gradual shot finishes herein; If until video end frame or frame difference value are less than T _lThe time, accumulative total frame difference does not reach T yet _h, be greater than T before thinking _lThe frame difference value by other reasons, caused.

Prior art adopt based on histogrammic algorithm with have following problems based on the pixel algorithm:

(1) the histogram reflection is the population distribution of gradation of image or color, and can't embody positional information and the vision content of image, and the uncorrelated two width images of content milli also may have same gray scale/color population distribution.In addition, the two width images with same color population distribution may have identical object and background, but the position difference of object, the typical tricolor national flag as France and Holland, Ireland and Cote d'lvoire etc.

(2) histogram has reflected the population distribution of piece image gray scale (grey level histogram) or color (color histogram) intuitively, slow motion for object/video camera in camera lens has stronger robustness, but the detection effect for the rapid movement of object/video camera and gradual shot situation is still undesirable, easily causes camera lens flase drop or camera lens undetected (missed detection).

(3) carry out the shot boundary detection based on histogrammic different measures according to the entire change situation of the gray scale between frame of video or color, and reckon without the interference of the motion of object video/video camera in camera lens for detection.In testing process, if the interior object video/camera motion of camera lens causes the population distribution generation marked change of camera lens frame interior gray scale or color, just probably this frame interior is identified as to shot boundary, causes the camera lens error detection.Can solve this problem by frame of video is carried out to piecemeal, each frame of video is divided into to n * n image block, calculate interframe gray scale or the color histogram difference of consecutive frame corresponding blocks, get rid of the piece of difference maximum, the interframe histogram difference of adding up in some way all the other each pieces.With traditional comparing based on histogrammic method, this motion of improving one's methods for video camera in camera lens has and detects preferably effect, but for the special-effect of some gradual shot, as be fade-in fade-out etc., detect effect still undesirable.The situation of violent illumination variation (as flash of light etc.), also can disturb to a great extent based on histogrammic Shot Detection effect in addition.

(4) the dual threshold comparative approach has fully taken into account the feature difference of shot-cut and gradual shot, and is detected respectively for their characteristics, can meet general camera lens and cut apart requirement.And be defined in the frame difference value and be not less than T _lPrerequisite under, cumulative frame difference value reaches T _hThe time, just think and the generation gradual shot therefore for burst noise, certain antijamming capability is arranged.But change unconspicuous gradual change time-continuing process for some interframe, probably in its accumulation frame difference value, be added to T _hBefore, the gradual shot process just is through with, and probably causes undetected.(be less than T if between certain two consecutive frame in gradual change time-continuing process, difference is very little in addition _l), will directly cause cumulative process to finish, also probably cause undetected.

Clustering algorithm is widely applied in information science field, and its basic thought is from the initialization cluster, according to certain video features, utilizes certain measuring similarity mode, by sample set X=(X ₁, X ₂..., X _n) in each element distribute to the cluster the highest with its similarity, finally reach system or customer requirements.

B Gunsel, the scholars such as M R Naphade successively propose to use the K-means clustering algorithm ^[22], according to the gray scale of adjacent two frames/color histogram difference, scene is divided into significant change being arranged and not having significant change two classes to carry out Shot Detection and cuts apart.The scene changes place occurred separately is judged as to shot-cut, and the scene changes place that will occur continuously is judged as gradual shot.The K-means clustering algorithm carries out the great advantage that shot detection cuts apart and is that it does not need setting threshold, and can utilize a plurality of video features simultaneously, and the Euclidean distance by the calculated characteristics vector is to improve the Shot Detection effect.The essence of clustering algorithm is according to square error and minimum criterion, and the frame difference value is divided into to two classes, and its testing result is equivalent to respectively every section video be arranged to rational global threshold.This algorithm can carry out self-adaptation to each section video sequence, but the impact of noise is comparatively responsive to external world, if the gradual shot process is not clearly, is easy to progressive formation is divided into without obvious scene changes class.

Consider that between this two class of actual scene be fuzzy, thereby Chi-Chun Lo ^[9]Deng the people, propose with fuzzy C-mean algorithm (Fuzzy C-means, FCM) clustering algorithm carries out Shot Detection and cuts apart, all frame difference values are divided three classes: camera lens conversion (Shot Change, SC), possible camera lens conversion (Suspected Shot Change, SSC) and without camera lens change (No Shot Change, NSC), and to n possibility camera lens switch element SSC (j) between adjacent two element S C (i) and SC (i+1) in camera lens conversion class, SSC (j+1) ... SSC (j+n-1) is analyzed, each picture frame in through type (14) judgement possibility camera lens conversion class is under the jurisdiction of camera lens conversion class still without camera lens conversion class:

H_SSC(k)≥param×[0.5×(H_SC(i)+H_SC(i+1))] (14)

Wherein H_SC (i) and H_SC (i+1) mean respectively the interframe histogram difference of SC class adjacent element SC (i) and SC (i+1), and H_SSC (k) means the interframe histogram difference in SSC dvielement SSC (k) between SC class adjacent element SC (i) and SC (i+1).This algorithm is without setting threshold and introduced possibility camera lens conversion class to be further analyzed, thereby some edge frame difference more reasonably can be sorted out.

In order to reduce the computation complexity of fuzzy clustering algorithm, the scholars such as Xinbo Gao have also adopted granularity substep clustering method from coarse to fine.At first the interframe of the l of often being separated by video (l >=2) frame is carried out to thick cluster, obtain the approximate location of lens mutation on time domain, then carry out frame by frame thin cluster the lens mutation place may occur, can detect the exact position of lens mutation.

The scholars such as Xinbo Gao ^[23]The fuzzy clustering algorithm proposed also can be used for the detection of gradual shot.This algorithm adopts difference of histograms standard (Histogram difference metric, HDM) and the poor standard of air-frame (Spatial difference metric, SDM) consecutive frame is carried out to similarity measurement, and all frame of video are defined as to the feature space F generated by HDM value and SDM value _DIn a point set,

F _D＝{F _D(t)＝(D _S(t),D _H(t)),t＝1,2,...,T} (15)

Like this, just the Shot Detection problem can be converted into to the problem that feature space is divided into to significant change (Significant Change, SC) and non-significant change (No Significant Change, NSC) two sub spaces.

Adopt in the process that above-mentioned algorithm processed video, calculate at first respectively the degree of membership of current video frame for SC and NSC two class subcharacter spaces.If present frame is higher for the degree of membership of significant change scene class, this frame is included into to significant change scene class, and mean by Boolean variable 1, otherwise mean by Boolean variable 0, until all picture frame clusters of video are complete, thereby video sequence is converted into to a binary sequence, for example 1101001011110100101010 ...In video sequence, lens mutation and gradual change have certain changing pattern separately, therefore, by the video binary sequence to after transforming, carry out mode decision, can detect respectively video lens mutation and gradual shot.According to the people's such as Xinbo Gao analysis, binary sequence 010 means lens mutation, and binary sequence 011 and 110 means gradual shot.

In addition, also can directly be classified to the eigenwert of each frame of video, because each picture frame low-level feature in camera lens has certain similarity, can be chosen the camera lens of characteristic similarity maximum as the camera lens under frame.And camera lens conversion place is because the variation of camera lens content causes the variation of each frame visual signature or motion feature, the present frame of camera lens conversion place will be included into next camera lens.

In the Unsupervised Clustering algorithm, being most widely used of cyclic process, its basic thought is, from certain initial clustering, (certain way is selected or artificial the appointment), and the element in sample set is subdivided into to certain known cluster with certain similarity measurement standard until meet system or user's predetermined demand respectively.

Due to the supervision that there is no expert's priori, the Unsupervised Clustering algorithm is a kind of ofaiterative, dynamic analytic process of self-organization, under the condition that does not meet the cluster end, according to certain similarity account form, constantly restrain finally to meet the requirement for clusters number or cluster density of user or system.When utilizing the Unsupervised Clustering algorithm to carry out cluster to frame of video, described similarity measurement standard before can adopting, comprise color histogram, edge variation ratio, motion vector etc.

The clustering algorithm passing threshold δ of non-supervisory formula controls cluster density ^[10], with the first frame f ₁As initial clustering, calculate each frame f thereafter _i, i ∈ [1, N] and all known cluster centres (camera lens Lei Nei center) δ before _k, the similarity S (f between k ∈ [1, M] _i, C _k), and preserve maximal value S _maxAnd subscript k, by certain class before comparing to judge whether to belong in similarity threshold δ, and carry out dynamic feature clustering relatively with this, be same camera lens if be divided into of a sort consecutive frame.If original N in k cluster _kFrame,

\{\begin{matrix} C_{k} = \frac{f_{i} + Σ_{j = 1}^{N_{k}} {f^{|}}_{j}}{N_{k} + 1}, & S_{\max} &GreaterEqual; δ \\ C_{k + 1} = f_{i}, & S_{\max} < δ \end{matrix} - - - (16)

Wherein, C _kAnd C _K+1Be respectively the center of k and k+1 cluster.

K-means and ISODATA (Iterative Self-Organizing Data Analysis Technique repeats the self-organization data analysis technique) are the round-robin algorithms of two kinds of Unsupervised Clusterings commonly used.The K-means algorithm is selected k initial cluster center at random, and finds the nearest cluster centre of characteristic distance for each sample and carry out dynamic clustering; ISODATA is by carrying out the repetition performance analysis of self-organization to sample data, within the change allowed band of associated arguments, the clusters number finally obtained is indefinite.

The clustering algorithm of non-supervisory formula has reduced computation complexity to a certain extent, and avoided the setting of threshold value, when but if in camera lens, content change is larger, the camera lens frame interior may be divided into different cluster (camera lens), thereby cause the camera lens flase drop, and its classification results is closely related with initial barycenter (start frame).Due to when the practical application Unsupervised Clustering algorithm, do not fully take into account the temporal characteristics of video in addition, may cause camera lens the discontinuous problem of frame number to occur.

Summary of the invention

The present invention is directed to existing algorithm and carry out the video detection, may cause the problems such as camera lens flase drop, frame number be discontinuous, for the Shot Detection part, proposed the image detecting method based on improving histogram and improved clustering algorithm.

The technical scheme that the present invention solves the problems of the technologies described above is: a kind of lens detection method based on improving histogram and frame difference method comprises step: calculate the histogrammic common factor of adjacent two two field picture, and according to the histogram similarity to judge whether to occur shot change; As shot change occurs, further to shot boundary, utilize interframe gray scale/color difference the histogrammic common factor of adjacent two frame to be carried out to the secondary detection of shot boundary, adopt non-homogeneous divided group to process, respectively to each piecemeal calculating pixel difference, and pixel value difference and default piecemeal frame difference limen value are compared to the acquisition token variable, to the token variable weighted sum of each piecemeal, the divided group threshold value of the value of weighted sum and setting is compared; Frame number being less than to 20 camera lens incorporates in a upper camera lens again.

Wherein, according to formula:

S (t, t - 1) = \frac{m_{h} \times S_{h} (t, t - 1) + m_{s} \times S_{s} (t, t - 1) m_{v} \times S_{v} (t, t - 1)}{3}

Calculate adjacent t and the histogram similarity of t-1 frame, wherein, S _h(t, t-1), S _s(t, t-1) and S _v(t, t-1) is respectively the histogram similarity of H, S, V component, according to formula

Determine the similarity of adjacent two frame H components, wherein, h _t(i), h _T-1(i) represent respectively histogram, N presentation video gray scale or the color quantizing rank of t and t-1 frame H component.Can be by the weighting coefficient m of H, S, tri-components of V _h, m _s, m _vBe set as 0.9:0.3:0.1.

The present invention also proposes a kind of shot detection method based on Detection Based on Clustering, by video sequence the first frame f ₁As first camera lens, and first camera lens De Leinei center, and make this camera lens boolean access variable Shot.access ≡ 1; Extract the next frame f of video sequence ₂, and calculate respectively video sequence and the current camera lens Lei Nei center histogram similarity on H, S, V three-component, according to formula:

S (f, Shot) = \frac{m_{h} \times S_{H} (f, Shot) + m_{S} \times S_{S} (f, Shot) m_{V} \times S_{V} (f, Shot)}{3}

The histogram similarity that weighted calculation is total; If S (f, shot)>T, think that video sequence frame f belongs to camera lens Lei Nei center Shot, f is put into to Shot, and according to formula:

Shot.len=Shot.len+1 recalculates camera lens De Leinei center; If S (f, shot)<T, set up new camera lens, video sequence frame f is put into to new camera lens, as this new camera lens De Leinei center, and when the boolean's access variable access by last camera lens sets to 0, make new camera lens boolean access variable Shot.access ≡ 1, wherein f _iMean the inner original frame of camera lens.

Calculating video sequence and the histogram similarity of current camera lens Lei Nei center on H, S, V three-component is specially: by video sequence V={f ₁, f ₂..., f _nProject on the hsv color space, to H, S and V component carry out non-uniform quantizing, and determine and quantize progression, according to histogrammic H, S, V component H (i), S (j), V (k), call formula:

\{\begin{matrix} S_{H} (f, Shot) = Σ_{i = 1}^{8} \frac{\min (H (i), Shot_H (i))}{\max (H (i), Shot_H (i))} \\ S_{S} (f, Shot) = Σ_{j = 1}^{3} \frac{\min (S (j), Shot_S (j))}{\max (S (j), Shot_S (j))} \\ S_{V} (f, Shot) = Σ_{k = 1}^{3} \frac{\min (V (k), Shot_V (k))}{\max (V (k), Shot_V (k))} \end{matrix}

Calculate respectively current video sequence frame to be checked and the current camera lens Lei Nei center histogram similarity on three-component.

Two kinds of method computation complexities that the present invention proposes are low, significantly do not increase calculate and time complexity in, improved the accuracy rate of Shot Detection, solved and caused the aspect problems such as camera lens flase drop, frame number be discontinuous.

The accompanying drawing explanation

Fig. 1 histogram method treatment scheme of the present invention;

Fig. 2 frame difference method treatment scheme of the present invention;

Fig. 3 clustering algorithm flow process of the present invention.

Embodiment

Histogram has a variety of application modes, and the present invention has adopted improved mode---histogram intersection.

Because histogram can't embody positional information and the vision content of image, the uncorrelated two width images of content milli also may have same gray scale/color population distribution, therefore, the present invention improves histogram by non-homogeneous piecemeal and weighting preprocessing process, to give prominence to the contribution of core for frame difference, greatly reduce in camera lens simultaneously and move among a small circle for the impact of Shot Detection, compare with traditional color histogram method, acquired results is closer to the mankind's visual cognition.In addition, for video content, effectively suppressed the interference for Shot Detection of the advertisement of video top or bottom or captions.

Be specially:

Utilize histogram method to detect camera lens.According to the histogrammic common factor of adjacent two frame of image, determine whether camera lens changes.

(1) obtain the histogrammic common factor of adjacent two frame, calculate adjacent two frame histogram similarities, similarity and threshold value compare and tentatively judge whether to occur shot change, as similarity is greater than threshold value, tentatively judge shot change.According to the general span of setting the histogram similarity threshold of experiment, be 0.75-0.95, when threshold value is made as 0.9, the resultant effect optimum.

The similarity of adjacent two frame H components is determined by following formula:

S_{h} (t, t - 1) = Σ_{i = 1}^{N} \frac{\min (h_{t} (i), h_{t - 1} (i))}{\max (h_{t} (i), h_{t - 1} (i))} - - - (21)

Wherein, h _t(i), h _T-1(i) represent respectively histogram, N presentation video gray scale or the color quantizing rank of t and t-1 frame H component.In like manner, the histogram similarity of S, V component is respectively:

S_{s} (t, t - 1) = Σ_{i = 1}^{N} \frac{\min (S_{t} (i), S_{t - 1} (i))}{\max (S_{t} (i), S_{t - 1} (i))}

With

S_{v} (t, t - 1) = Σ_{i = 1}^{N} \frac{\min (v_{t} (i), v_{t - 1} (i))}{\max (v_{t} (i), v_{t - 1} (i))} .

Equally, s _t(i), s _T-1, and v (i) _t(i), v _T-1(i) represent respectively the histogram of t and t-1 frame S and V component.

Under the HSV space, according to formula:

S (t, t - 1) = \frac{m_{h} \times S_{h} (t, t - 1) + m_{s} \times S_{s} (t, t - 1) m_{v} \times S_{v} (t, t - 1)}{3} - - - (22)

Determine the histogram similarity of t and t-1 frame.

The span of setting the histogram similarity threshold is generally 0.75-0.95, and is learnt when threshold value is made as 0.9 the method resultant effect optimum by lots of comparing experiments; Under the HSV space, the histogram similarity of t and t-1 frame is

S (t, t - 1) = \frac{m_{h} \times S_{h} (t, t - 1) + m_{s} \times S_{s} (t, t - 1) m_{v} \times S_{v} (t, t - 1)}{3},

Gather multiple image as experimental subjects, the span that obtains the histogram similarity threshold of better effects is generally 0.75-0.95, in this scope, image is detected again, and finally obtains best histogram similarity threshold and gets 0.9.

(2) as be less than in threshold value, further utilize interframe gray scale/color difference the histogrammic common factor of adjacent two frame to be carried out to the secondary detection of shot boundary, carry out non-homogeneous divided group processing (as be divided into 9, core proportion maximum, weights and be 1), respectively to each piecemeal calculating pixel difference, and compare to carry out mark with default piecemeal frame difference limen value (span is 10-30), then to the token variable weighted sum of every, and compare to judge whether to occur shot change with the divided group threshold value (span is 0.0-0.4) of setting.

Piecemeal frame difference limen value-acquiring method can adopt: the pixel value difference of the corresponding blocks between adjacent two frames is:

Wherein, the size that M * N is certain piece, f _n(i, j), f _N+1(i, j) is respectively n and the n+1 frame chromatic value at point (i, j).When piecemeal frame difference limen value span is 10-30, resultant effect is best.

The method of non-homogeneous divided group is specially, and is mainly the shortcoming of not considering that in order to overcome histogram method positional information and frame difference method are very responsive to the motion of object/video camera in camera lens, thereby improves recall rate and the accuracy rate of Shot Detection.By great many of experiments, find, when divided group threshold value span is 0.0-0.4, experiment effect the best.

Extract adjacent two frames from video, in the HSV space, calculate its histogram intersection, obtain adjacent two frame histogram similarities, and compare with setting threshold, when being less than setting threshold, tentatively judge that camera lens changes.In order to judge more accurately whether camera lens changes, then do further judgement.Utilize interframe gray scale/color difference to carry out the secondary detection of shot boundary, extract adjacent two frames from video, and carry out non-homogeneous piecemeal, then calculate the pixel value difference of corresponding blocks.Whether the pixel value difference that judges this piece is greater than piecemeal frame difference limen value, and if so, this piece is labeled as 1, otherwise is labeled as 0.Then token variable is weighted to summation.Judge whether it is greater than the divided group threshold value, and if so, camera lens changes, otherwise camera lens does not change.

Due in the hsv color space, the visual characteristic that human eye is the most responsive to the H component, according to the weighting ratio of H, S, V component, after the quantized value that obtains H, S, V, the coefficient ratio that can obtain H, S, V component is Q _H: Q _S: Q _V, wherein, Q _H, Q _S, Q _VBe respectively the quantization level of H, S, V component, in the present invention, the coefficient ratio optimum can be set as 9:3:1.

S _h(t, t-1), S _s(t, t-1) and S _v(t, t-1) is respectively the histogram similarity of H, S, V component, and image ratio of gray scale or color quantizing level n in the histogram similarity of H, S, V component is Q _H: Q _S: Q _V.N, in order to embody more the contribution to the histogram similarity of H, S, V component, arranges the weights of H component, S component, V component according to a certain percentage, as can be by the weighting coefficient m of three components _h, m _s, m _vBe set as 0.9:0.3:0.1.

Consideration based on human visual perception, respectively H, S, V color component are carried out to non-uniform quantizing, and carrying out in the similarity coupling accordingly, for each color component is composed with different weights, the histogram difference degree of two interframe that calculate like this can reflect the difference degree of human visual perception better, has certain perception homogeneity.

(3) consider the situation of strong illumination variation, especially flash of light, be less than frame number on 20 camera lens and again incorporate in a upper camera lens.

In order further to improve recall rate and the accuracy rate of Shot Detection, said method is after utilizing the improvement histogram method to detect camera lens, further utilize frame difference method to be filtered detected camera lens, thereby formed the overall approach in conjunction with histogram method and frame difference method for Shot Detection, can effectively reduce may be by the undetected and flase drop situation of bringing based on histogrammic method.In addition, situation for violent illumination variation, especially flash of light, because its lasting frame number is less, and due to the mankind for visual media, as the persistence of vision effect (its exact value is 24fbps) of the existence such as animation, film, so the present invention is less than 20 camera lens by frame number and again incorporates in a upper camera lens, makes it to be suitable for human visual system.

For test video, be chosen under the hsv color space, adopt improved histogram intersection method to be processed it, consideration based on human visual perception, respectively H, S, V color component are carried out to non-uniform quantizing, and carrying out in the similarity coupling accordingly, for each color component is composed with different weights, the histogram difference degree of two interframe that calculate like this can reflect the difference degree of human visual perception better, has certain perception homogeneity.After being disposed, enter the last handling process that improves the pixel frame difference method, mate and be weighted by non-homogeneous piecemeal, can effectively suppress like this interference for Shot Detection of the advertisement of video top or bottom or captions, and fully taken into account the positional information of each pixel of picture frame, played good supplementary function for improved histogram method.

The present invention can adopt improved Detection Based on Clustering to be detected video lens, according to similarity, judges that video to be checked is whether in current camera lens.

Be illustrated in figure 3 and improve the clustering algorithm process flow diagram.

Traditional Unsupervised Clustering algorithm is for shot detection the time, due to the characteristics that do not fully take into account video data stream, will carry out similarity relatively with all known cluster centres (camera lens Lei Nei center) by each data object to be checked (picture frame), it is distributed to the cluster the most similar to it (camera lens).So probably cause the discontinuous phenomenon of frame number in camera lens flase drop and camera lens, and the time and computation complexity also larger.To this, consider the temporal aspect of video flowing, each frame of video only carries out cluster relatively with the current camera lens that does not complete cluster, (only have and first judge whether camera lens changes and segmented complete camera lens, whether be new camera lens, could exactly video be cut open from shot boundary punishment, form independent camera lens one by one, to guarantee the accuracy of key-frame extraction, camera lens is cut apart.) no longer participate in follow-up cluster.For this reason, introduced boolean's access variable access, when the access of certain camera lens ≡ 0, mean this camera lens cut apart complete, otherwise, mean that this camera lens is the current cluster camera lens relatively that carrying out.In addition, because also adopted the histogram in HSV space in clustering algorithm, so, when calculating frame to be checked and current shot similarity, also need to consider the problem of histogram weighting in the hsv color space.By video sequence V={f ₁, f ₂..., f _nProject on the hsv color space, to H, S and V component carry out non-uniform quantizing, and calculate respectively histogrammic H, S, V component H (i), S (j), and V (k), here, and as desirable, i ∈ [1,8]; J ∈ [1,3]; K ∈ [1,3] represents respectively the quantification progression of H, S, V component.

Then, utilize the histogram intersection algorithm, calculate respectively current video sequence frame to be checked and the current camera lens Lei Nei center histogram similarity on three-component:

\{\begin{matrix} S_{H} (f, Shot) = Σ_{i = 1}^{8} \frac{\min (H (i), Shot_H (i))}{\max (H (i), Shot_H (i))} \\ S_{S} (f, Shot) = Σ_{j = 1}^{3} \frac{\min (S (j), Shot_S (j))}{\max (S (j), Shot_S (j))} \\ S_{V} (f, Shot) = Σ_{k = 1}^{3} \frac{\min (V (k), Shot_V (k))}{\max (V (k), Shot_V (k))} \end{matrix} - - - (23)

Specifically can adopt following methods:

(1) by video sequence the first frame f ₁Regard first camera lens as, f ₁Camera lens De Lei center also, and make this camera lens boolean access variable Shot.access ≡ 1.

(2) extract the next frame f of video sequence ₂, and after the histogram similarity on H, S, V three-component that calculates respectively current video sequence and camera lens Lei Nei center, according to formula (24):

S (f_{i}, Shot) = \frac{m_{h} \times S_{H} (f_{i}, Shot) + m_{S} \times S_{S} (f_{i}, Shot) m_{V} \times S_{V} (f_{i}, Shot)}{3} - - - (24)

Be weighted and calculate total histogram similarity,

Wherein, m _h, m _s, m _vSetting is respectively H, the weighting coefficient of S and V component.

Generally speaking, because vision is the most responsive for the H component, therefore m _h>=m _s, m _h>=m _V.With the quantification weighted ratio in the hsv color space, be consistent, and, for embodying the contribution for similarity of S and V component, weighting coefficient can be assigned respectively 0.9,0.3,0.1, just the camera lens in cluster must meet Shot.access ≡ 1.

(3) if S (f, shot)>T now thinks that video sequence frame f belongs to camera lens Shot.F is put into to Shot, and recalculates Shot De Leinei center and be:

Shot = \frac{f + Σ_{i = 1}^{Shot . len} f_{i}}{Shot . len + 1};

Shot.len＝Shot.len+1 (25)

F wherein _iMean the inner original frame of camera lens.

Otherwise, if S (f, shot)<T thinks that f does not belong to Shot.Set up new camera lens, f is put into to new camera lens, also as this camera lens De Leinei center, the cluster number adds 1 simultaneously, and, when the access by last camera lens sets to 0, makes new camera lens Shot.access ≡ 1.

Wherein, Shot is camera lens Lei Nei center, and f is present frame, f _iMean the inner original frame of camera lens, T is the shot similarity threshold value, and Shot.len is the cluster number.

(4), if video is not disposed yet, turn to (2), otherwise algorithm finishes.

The present invention is in the selection that detects sample, consider ubiquity and the popularity of video selection, selected the video of 5 types, comprise animation (Beelzebub ED), advertisement (innisfree cm), news (Cctv_news), TV guide (Anime 10th anniversary) and music video (Taiyou no Uta_clip), and utilize recall rate (Recall) and accuracy rate (Precision) to weigh the detection effect of shot detection algorithm.

Recall rate

R = \frac{N_{c}}{(N_{c} + N_{m})} \times 100 % - - - (26)

Accuracy rate

P = \frac{N_{c}}{(N_{c} + N_{f})} \times 100 % - - - (27)

Wherein, N _c, N _m, N _fWhat be respectively camera lens correctly detects number, undetected number and flase drop number.

Calculate the histogrammic common factor of two frames to weigh its similarity by the minimum function method, and compare with setting threshold T, thereby judge whether to exist the scene switching.The histogrammic similarity of adjacent two frame is defined as:

Sim = \frac{1}{xy} Σ_{v = 0}^{V} \frac{\min (H (I_{t}, v), H (I_{t - 1}, v))}{\max (H (I_{t}, v), H (I_{t - 1}, v))} - - - (28)

Consider that traditional frame difference method is very responsive for the motion of object/camera in video, thereby the shortcoming that easily causes error detection, frame difference method of the present invention combines the thought of non-homogeneous divided group, compare to carry out mark to every node-by-node algorithm pixel value difference and with default piecemeal frame difference limen value respectively, then the token variable of every is weighted to summation, and compares to judge whether to exist shot-cut with the divided group threshold value of setting.Poor being defined as of frame of the corresponding blocks between adjacent two frames:

Fd = \frac{1}{MN} Σ_{i = 1}^{M} Σ_{j = 1}^{N} | f_{n} (i, j) - f_{n + 1} (i, j) | - - - (29)

For the comparison of qualitative assessment camera lens partitioning algorithm of the present invention and histogram method and frame difference method, the algorithm respectively the present invention proposed is tested, and its experimental result is as shown in table 1.

Table 1 shot detection result

As can be seen from Table 1, the Shot Detection accuracy rate drawn by overall approach is higher than two kinds of classic methods, but the recall rate of camera lens but is limited by respectively by these two kinds of results that method obtains.Take the table in final stage MV video " Taiyou no Uta_clip " be example, owing to wherein there being illumination variation in the motion of main body in a large amount of quick shears, gradual change, camera lens and certain camera lens (frame before and after the supposition gradual change and the frame in progressive formation belong to different camera lenses), therefore apply each method, detected and all had certain undetected phenomenon.

Two kinds of algorithm computation complexities that the present invention proposes are lower, when significantly not increasing calculating and time complexity, improved the accuracy rate of Shot Detection.

Claims

1. the lens detection method based on improving histogram and frame difference method, is characterized in that: calculate the histogrammic common factor of adjacent two two field picture, obtain the histogram similarity, according to the histogram similarity, tentatively judge whether camera lens changes; Utilize interframe gray scale/color difference to carry out the secondary detection of shot boundary, extract adjacent two frames from video, and carry out non-homogeneous piecemeal, calculate again the pixel value difference of corresponding blocks, and pixel value difference and default piecemeal frame difference limen value are compared to the acquisition token variable, to the token variable weighted sum of each piecemeal, the divided group threshold value of the value of weighted sum and setting is compared, if be greater than the divided group threshold value, camera lens changes; Frame number being less than to 20 camera lens incorporates in a upper camera lens again.

2. method according to claim 1, is characterized in that, described acquisition histogram similarity specifically comprises, according to formula:

S (t, t - 1) = \frac{m_{h} \times S_{h} (t, t - 1) + m_{s} \times S_{s} (t, t - 1) m_{v} \times S_{v} (t, t - 1)}{3}

Calculate adjacent t and the histogram similarity of t-1 frame, adjacent two frame histogram similarities and threshold value compare and judge whether to occur shot change, wherein, and S _h(t, t-1), S _s(t, t-1) and S _v(t, t-1) is respectively the histogram similarity of H, S, V component, according to formula

S_{h} (t, t - 1) = Σ_{i = 1}^{N} \frac{\min (h_{t} (i), h_{t - 1} (i))}{\max (h_{t} (i), h_{t - 1} (i))},

S_{s} (t, t - 1) = Σ_{i = 1}^{N} \frac{\min (S_{t} (i), S_{t - 1} (i))}{\max (S_{t} (i), S_{t - 1} (i))},

S_{v} (t, t - 1) = Σ_{i = 1}^{N} \frac{\min (v_{t} (i), v_{t - 1} (i))}{\max (v_{t} (i), v_{t - 1} (i))}

Determine the similarity of adjacent two frame H, S, V component, wherein, h _t(i), h _T-1(i) represent respectively histogram, N presentation video gray scale or the color quantizing rank of t and t-1 frame H component.

3. method according to claim 1, is characterized in that, described acquisition histogram similarity specifically comprises, calculates video sequence and the histogram similarity of current camera lens Lei Nei center on H, S, V three-component and is specially: by video sequence V={f ₁, f ₂..., f _nProject on the hsv color space, to H, S and V component carry out non-uniform quantizing, and determine and quantize progression, according to histogrammic H, S, V component H (i), S (j), V (k), call formula:

\{\begin{matrix} S_{H} (f, Shot) = Σ_{i = 1}^{8} \frac{\min (H (i), Shot_H (i))}{\max (H (i), Shot_H (i))} \\ S_{S} (f, Shot) = Σ_{j = 1}^{3} \frac{\min (S (j), Shot_S (j))}{\max (S (j), Shot_S (j))} \\ S_{V} (f, Shot) = Σ_{k = 1}^{3} \frac{\min (V (k), Shot_V (k))}{\max (V (k), Shot_V (k))} \end{matrix}

Calculate respectively frequency sequence frame and the current camera lens Lei Nei center histogram similarity on three-component, wherein, Shot is camera lens Lei Nei center.

4. method according to claim 2, is characterized in that, judges whether to occur shot change according to the histogram similarity and further comprise, histogram similarity and setting threshold relatively, when being less than setting threshold, tentatively judge that camera lens changes.

5. method according to claim 3, is characterized in that, judges whether to occur shot change according to the histogram similarity and further comprise, by video sequence the first frame f ₁As first camera lens, and first camera lens De Leinei center, and make this camera lens boolean access variable Shot.access ≡ 1 according to formula:

S (f_{i}, Shot) = \frac{m_{h} \times S_{H} (f_{i}, Shot) + m_{S} \times S_{S} (f_{i}, Shot) m_{V} \times S_{V} (f_{i}, Shot)}{3}

The histogram similarity that weighted calculation is total; If S is (f _i, shot)>T, think video sequence frame f _iBelong to camera lens Lei Nei center Shot, by f _iPut into Shot, and according to formula:

Shot.len=Shot.len+1 recalculates camera lens De Leinei center; If S is (f _i, shot)<T, camera lens changes, and sets up new camera lens, by video sequence frame f _iPut into new camera lens, as this new camera lens De Leinei center, and, when the boolean's access variable by last camera lens sets to 0, make new camera lens boolean access variable Shot.access ≡ 1.

6. according to claim 2,5 one of them described method, is characterized in that, by the weighting coefficient m of H, S, tri-components of V _h, m _s, m _vBe set as 0.9:0.3:0.1.