WO2006076760A1

WO2006076760A1 - Sequential data segmentation

Info

Publication number: WO2006076760A1
Application number: PCT/AU2006/000012
Authority: WO
Inventors: Zhenghua Yu; Swaminathan Venkata Narayana Vishwanathan; Alex Smola
Original assignee: National Ict Australia Limited
Priority date: 2005-01-24
Filing date: 2006-01-06
Publication date: 2006-07-27

Abstract

The invention concerns sequential data segmentation using spectral clustering. Attributes of data samples are extracted to construct affinity matrices for sequences of samples in sequential order. K-segmentation is then applied to a representation of the sequences of samples derived from the affinity matrices to identify K-segments. Each K-segment is comprised of representations of samples in sequential order. In this way data samples are not viewed as independent and this prevents data samples from the same true segment being assigned to different clusters. The invention respects the sequential semantics of the data and ensures that adjacent samples are consistently assigned to the same cluster. An example of sequential data segmentation is the segmentation of sequential video frames into separate shots. The invention also concerns a computer system and computer software able to perform sequential data segmentation.

Description

Title

SEQUENTIAL DATA SEGMENTATION

Technical Field The invention concerns sequential data segmentation using spectral clustering.

Aspects of the invention concern a method, system and computer software able to perform sequential data segmentation.

Background Art Sequential data, is perhaps the most frequently encountered type of data existing in nature. Examples include speech, video, climatological or industrial processes, financial markets and biometrics engineering. For instance, EEG, MEG, PCG and ECG, also DNA sequences, context recognition, computer graphics and video sequences, to name just a few. Segmenting sequential data into homogeneous sections is an essential initial step in processing sequential data.

Taking as an example video content management systems that are used help to index, search and retrieve video. A shot is the most basic semantic structure of video and is a sequence of frames bounded by transitions to adjacent sequences of frames. Examples of transitions are cuts, dissolves, wipes or flashes. After video shots are segmented, indexing and analysis of the segments can be performed. For example, security video can be isolated into segments interesting to surveillance staff and made accessible for further analysis.

Known automated video shot segmentation systems are based on heuristics or specialized methods to detect each of the different transition types in a video. Similarly most of the techniques for automated sequential data segmentation rely on specialized domain knowledge about the problem.

Video shot segmentation can be performed based on a graph of the video and shot boundaries identified from the graph. A weighted undirected graph G = (S, E) can be constructed where the vertices of the graph S = [S₁,..., s_n} represent the n video frames in the video sequence, and an edge is formed between every pair of vertices. The weight of each edge, A(i, j) , represents the similarity between frames i and j.

Finding shot boundaries is equivalent to finding cuts in graph G while respecting the temporal semantics of the video sequence. Spectral clustering refers to a group of clustering methods that solve the normalised cut problem by calculating the first few leading eigenvectors of a graph Laplacian derived from an affinity matrix and perform clustering based on the eigenvectors. Each element of the affinity matrix denotes the similarity between frames. A spectral clustering method has been proposed in A. Y. Ng, M. Jordan, and

Y. Weiss, "On Spectral Clustering: Analysis and an Algorithm," in NIPS 14, 2002 ("Ng et al"). The algorithm is summarized here. Given a set of points S = {s_x,...,s_n} in R' that we want to cluster into k subsets the following is performed:

1. Form the affinity matrix A e R"^x" defined by A_n = exp ^2ff2 if i ≠ j, and A_n = 0. d(i, j) denotes the Euclidean distance between S₁ and S₁ .

2. Define D to be the diagonal matrix where D_n = ∑_jA_tJ , and construct the normalised graph Laplacian given by L = D^~mAD^~m . 3. Find x_vx₂,...,x_k , the k largest eigenvectors of L, and form the matrix

X - [x_jX₂...x_t] € R"^xk by stacking the eigenvectors in columns.

4. Form the matrix Y from X by renormalizing each of Xs rows to have unit length (i.e. Y₉ = X₈ Z(Z_jX,²?¹² ).

5. Treating each row of Y as a point hxR^k , cluster them into k clusters via K- means.

6. Finally, assign the original point S₁ to cluster j if and only if row i of the matrix 7 was assigned to clustery.

Summary of the Invention The invention provides sequential data segmentation using spectral clustering.

Data samples are extracted to construct affinity matrices for sequences of samples in sequential order. K-segmentation is then applied to a representation of the sequences of samples derived from the affinity matrices to identify K-segments, and each segment is comprised of representations of samples in sequential order. Spectral clustering methods such as those that integrate K-means view all data samples as independent and assign those samples to clusters individually. However, in the case of sequential data all samples between two segment boundaries are not independent as they are part of an ordered series. K-means is not optimal for sequential data segmentation as data samples from the same segment may be assigned to different clusters. To overcome this limitation we impose sequential ordering on samples when constructing the affinity matrices. The invention respects the sequential semantics of the data and ensures that adjacent samples are consistently assigned to the same cluster.

Elements in a row or a column of the affinity matrices represent data samples in sequential order. A graph Laplacian may be derived from the affinity matrices and then the leading eigenvectors of the Laplacian are solved. Further, the graph Laplacian may be normalised before solving for the eigenvectors.

K-segmentation may be applied to the normalised eigenvectors representing the sequences of data samples in sequential order, such as the eigenvectors of the Laplacian of the affinity matrices.

K-segmentation may comprise identifying homogenous segments in the representation of the sequences of data samples. An estimation of the number of segments K is used to initially identify K homogenous segments.

Alternatively, K-segmentation may comprise identifying significant transitions in the representation of sequences of data points.

Spectral clustering may involve solving for the largest eigenvectors of the normalised graph Laplacian and performing clustering based on the eigenvectors. An alternative is to use eigenvectors of the graph Laplacian directly. The largest eigenvectors are stacked into columns to form matrix X such that the renormalisation of the rows of X gives matrix Y. Section S is represented in the space spanned by the rows of Y i.e. by n samples {y_x,_".,y_n} e R^k .

K-segmentation is then performed on section S to ensure that each sample of a segment is consistently assigned to the same cluster. For instance, this can be done by using dynamic programming. K-segmentation can be conducted on Y to find a sequence ys_ϊys₂...ys_k of k segments such that ys_lys₂...ys_k = {yχ,—,y_n} and each ys_t is not empty. The k segments are found by considering the homogeneity of the segments. Attributes of the sample data may be extracted to construct the affinity matrix.

In the case of video shot segmentation, the extracted attributes of the data samples includes color attributes, edge attributes, edge energy distance values between sequential samples and the temporal adjacency of samples. The attribute of random impulse noise may also be included in the affinity matrix.

The edge attributes may incorporate one or more local, semi-global or global edge histograms. For instance, an image of a single frame is divided into sub-images and edges of the sub-images are categorised to create multiple local edge histograms. From the local edge histogram one global edge histogram is calculated and multiple semi-global histograms are calculated. Alternatively, edge detection may utilize Sobel edge detector templates and the sub-samples that overlap.

The extracted attributes of the sample data may also be based on the temporal adjacency of the samples. Utilising both local and global information increases the accuracy of the segmentation when compared to most existing systems. Before K-segmentation is applied, pre-processing may be applied to identify potential boundaries and to reduce the computational complexity of the K- segmentation.

K-segmentation may be applied based on an estimation of the number of segments.

Following K-segmentation, the number of segments is determined by rejecting segments that do not meet a predetermined threshold, such as rejecting a segment if the normalized conductance of any of its boundaries is less than a predetermined threshold.

This rejects the weaker segments that may result in the false identification of a segment.

The boundary of a segment may be fine tuned by considering the sample at the identified boundary of a segment and a predetermined number of adjacent samples and selecting the most suitable sample.

An incorrect boundary of a segment may be detected and rejected by extracting key features of the boundary and assessing those features using Support Vector machines classification. The key features may be extracted from the transition of the cut values over time.

In a further aspect the invention provides a system for the segmentation of sequential data using spectral clustering. In yet a further aspect the invention provides computer software able to perform the method described above.

Brief Description of the Drawings

An example of the invention will now be described with reference to the accompanying drawings, in which:

Fig. 1 is flow chart of the method of performing shot segmentation; Fig. 2 schematically shows the dividing of images into sub-images and blocks; Fig. 3 schematically shows the dividing of sub-images into blocks that are also shifted by half a block size in comparison to the known MPEG-7 method; Fig. 4 schematically shows the finite state machine used to determine the number of segments;

Fig. 5 is a schematic diagram of the segmentation of the time series S into k segments;

Fig. 6 is a graph plotting the cut value over time used to estimate the number of segments; and Fig. 7 is a further graph plotting the cut value over time used to reject incorrectly detected shot boundaries.

Best Modes of the Invention A method to automatically identify shots in a video sequence will now be described with reference to the flow diagram of Fig. 1.

Initially, a long video is cut 10 into fixed duration sections with overlaps between adjacent sections. For example, for a sports video a fixed duration section may be 400 frames with an additional 100 frames for overlaps. For news video a fixed duration section may be 240 frames with 80 fames for overlaps. The fixed duration sections are the basic computing elements upon which shot segmentation is performed on.

Next, frame data are extracted to construct 12 affinity matrices for sequences of frames in time series order. Given N frames in a fixed duration section, affinity matrix A e R^mn is defined by A_tJ representing the similarity between frames i and j if i ≠ j , and A_n = 0. Subscripts / andj are both in the natural time series order of frames. For example, row k of affinity matrix A {A_jlc,i = l...N} represents the affinities of a sequence of frames in ascending order of time {1... N) to frame k. The affinity matrices of all the frames in the fixed duration section are constructed based on (a) color histogram, (b) edge histogram, (c) edge energy distances and (d) temporal adjacency distances between the frames. Then the affinity matrices are adjusted for (e) random impulse noise.

(a) A color histogram is calculated in HSV color space. H is quantized into 16 bins, S and V are quantized into 4 bins each, so in total there are 256 bins (see Manjunath B S, Salembier P, Sikora T (ed), Introduction to MPEG-7, Multimedia Content Description Interface, John Wiley & Sons, Ltd, 2002, "Manjunath et al"). The distance metric used is chi-squared distance.

(b) Following MPEG-7, an edge histogram is constructed with 80 bins (see Manjunath et al). Images are divided into 4 x 4 sub-images 50 as shown in Fig. 2. Each sub-image is divided into a fixed number of blocks 52. Then each image block is partitioned into 2x2 block of pixels 54. Edges in the sub-images are categorized into five types: vertical, horizontal, 45° diagonal, 135° diagonal and non-directional edges where each 2x2 block is considered a pixel. The following simple templates are used in edge detection:

Therefore each image is represented by 80 local edge histogram bins.

A global edge histogram and 65 semi-global edge histograms are computed from the 80 local histogram bins. For the global edge histogram, the five types of edge distributions for all sub-images are accumulated. For the semi-global edge histograms, subsets of sub-images are grouped.

Ll norm of the distance of local, semi-global and global histograms between two frames is adopted as the distance function. The distance of the global histogram difference is multiplied by five given the number of bins of the global histogram is much smaller than that of local and semi-global histograms. Other norms or distances may also be used based on domain knowledge about the problem on hand.

However, the following shortcomings are identified in the original MPEG-7 edge histogram descriptor. Firstly, the edge detectors are based on simple 2x2 templates which do not characterize edges well. Secondly, when partitioning sub- images into blocks 52, the blocks are not overlapped. As a result, small movement of camera or objects may lead to large variation of edge values, which is not desirable.

To address these shortcomings two improvements have been made. Firstly, the edge detection templates are replaced with Sobel edge detector which is more accurate at detecting edges. The new templates are:

1 0 - 1 1 2 1 2 1 0 0 1 2

2 2 0 0 - - 22 0 0 0 0 0 0 1 0 - 1 -1 0 1

1 0 - 1 -1 - 2 - 1 0 - 1 -2 -2 -1 0

Since we use only four directions, the local edge histogram has 64 bins instead of 80.

Secondly, when partitioning subimages into blocks, we use not only non- overlapping blocks (the same as MPEG-7), but also blocks whose origins are shifted by Vi a block size in both horizontal and vertical directions compared to the non- overlapping blocks. This is shown in Fig. 3 where Fig. 3(a) schematically shows the blocks are divided according to the MPEG-7 method (16 non-overlapping blocks in total) and Fig. 3(b) shows the proposed method of partitioning blocks (16 non- overlapping blocks and 9 blocks shifted in both the horizontal and vertical directions; 25 blocks in total; all the blocks are of equal size). (c) The following is a description of the edge energy distance calculation.

Dissolve and fade are two major types of gradual transitions. In dissolve, video frames from two shots are gradually mixed together by (usually) linear addition

£=a*t+b*(l-t), where a and b are two shots, t is the mixing ratio which is a variable of time, and f is the output. As every frame during a dissolve transition is the mixture of two frames, they usually contain weaker edges than frames before and after the dissolve. Therefore statistics on edge energy provides valuable clue to detect dissolves. The method described here is applicable to fades as well, however we use a more specialized fade detector to be described later.

Without losing generality, we incorporate edge energy statistics to assist the detection of gradual transitions.

When performing edge detection (see description of (b) edge histogram above), the outputs after applying edge detection can be denoted as e_h, e_v, e_45, e_135 respectively. The edge value of the 2*2 block of pixels is calculated as:

Edge_value=max(e_h, e_v, e_45, e_135). The square root edge value is calculated as:

Edge_sqrt_value=sqrt(e_h²+e_v²+e_45²+e_l 35²). The following four statistics about the edge values are calculated for each video frame: mean_edge_value=sum(edge_value[i])/number_of_edges mean_edge_sqrt_value=sum(edge_sqrt_value[i])/number_of_edges std__edge_value=sum((edge_value[i]-mean_edge_value) )/number_of_edges std_edge_sqrt_value=sum((edge_sqrt_value [i] - mean_edge_sqrt_value)²)/number_of_edges

Although the four of these statistics all characterize the "edgeness" of the video frame and can be used to calculate the edge energy distance between two frames, empirically we have found std_edge_sqrt_value provides the best performance.

Given two video frames i and j, and the respective std_edge_sqrt_values, the edge_energy_distance (EED) between these two frames is defined as:

EED[i,j]=abs(min(thres, std_edge_sqrt_yalue[i]) - min(thres, std_edge_sqrt_value[j])). There is a threshold used to saturate std_edge_sqrt_value as large values tend to be noisy and useless (gradual transitions will mainly lead to small edge values).

(d) Temporal adjacency is then integrated into the final affinity calculation so that the calculated affinity between frames incorporates both color and edge histograms, edge energy distance, and temporal adjacency.

Finally, the affinity between frames is calculated. A_n = A* A\_} if i ≠ j , and

JlShJl

A_n = 0 where A\_} = exp ^2σ'² represents the affinity due to temporal adjacency, d_t(i,j) is the difference in frame numbers between frames i and j, and <W('J) A_y = exp ^2CTEED represents the affinity due to edge energy distance d_EED (i, j) , and

A* =

represents the affinity due to color and edge histograms as described in (a) and (b). For example, σ_k = 56469 and σ^=5 were derived through experimentation. The value of σ_t determines the influence of frames far away from the current one.

(e) Here we consider that the affinity matrix is being constructed in the presence of random impulse noise. Unfortunately video signals are usually corrupted with noise. Here we only consider noise that affects the global characteristics of a frame or few frames. Examples of such noise include flashes, blur (due to auto- focusing), and sudden movement of irrelevant foreground objects close to the camera. We generalize the noise as random impulse noise (in temporal domain).

If we view frames before and after the noise, these frames usually are similar. In general, we assume that frames close by in time should be more similar than frames that are further apart. When translated into the calculation of the affinity matrix, we can assume that the affinity between frames i and i+t should be greater than both the affinity between frames i and i+t+k where k>0, and the affinity between frames i-1 and i+t where l>0. Consequently we apply the following adjustment when constructing the affinity matrix:

Given frames i and i+t, for any pair of k and 1 where T>k>0 and T>l>0, if

{l+kf (l+k)² A₁₁₊₁ < A_t__hι÷t+k exp ^2σ'² , then make A_{1 M} = 4__/i(+f+fc exp ^2σ'² . Here T is a threshold which limits the duration to check the affinities. T should be determined according to the characteristics of the video signal. In our experiments we used T=5 which was sufficient to handle most noise. This generic solution does not depend on the exact cause of noises. Next, fades are detected and excluded from further calculations 14. While fades can be detected using our generic shot segmentation method, since we have detected edges already it is more straightforward to detect fades using the edge information. Simple fades are just black frames; fancier ones may use blurring of non-black frames. In both cases they can be characterized by low edge values. The fade detector has two modules:

• For the first module (fade detector 1) if the percentage of detected edges among all pixels in the current frame < threshold 1 and the maximum edge value < threshold 2, then a fade frame is detected. • For the second module (fade detector 2), as described above each image is previously divided into 16 sub-images to calculate the edge values. For each sub-image, the following conditions are tested to determine if the standard deviation of its edge value < max(thresl, mean_edge_value_of _the_subimage*constl), and if the maximum edge value < max(thres2, mean_edge_value_of _the__subimage^!t!const2). If all sub-images tested meet these conditions, then a fade frame is found.

When detecting a fade, the four corner sub-images are ignored as they are likely to contain station logos which may exist during fades. Detected fade frames are excluded in further calculations.

Next, spectral clustering is performed 16. Let us define D as the diagonal matrix where D₁₁ = ∑_jA_y , and construct the normalised graph Laplacian

L = D^~mAD^~m . Next, X₁₅X₂,..., x_k & largest eigenvectors of L are found. The matrix

X - [x_jX₂...x_t] e R^mk is then formed by stacking the eigenvectors in columns. The matrix Y from X is formed by renormalizing each of Ts rows to have unit length (i.e.

Next the number of segments is estimated 18. A fixed number of segments (such as k=6) can be used. This is sufficient for most video sequences, however there are video sequences that contains a lot of transitions (mostly cuts). Using fixed k=6 in this case will lead to false deletion of transitions.

Therefore we have developed a method to estimate the number of cuts N_est in each section (e.g., 240 frames). N_est+6 is then used as the initial number of segments (in stead of 6).

Given an affinity matrix A[i,j], firstly we define two vectors which describe the sum of affinities to frames before and after the current one: bw\i] = ∑A[i,j]

<+r Mi] = ∑A[i,j]

J=«+l

T is a threshold that limits the number of frames we check the affinity of. Transitions from large bw/fw values to small values indicate the existence of cut transition, and vice versa.

Referring now to Fig. 4, a three-state machine is defined. Peak state can be entered from any state if bw[i]>thres2 and fw[i-l]>thres2, and bottom state can be entered from any state if bw[i]<thresl and fw[i-l]<thresl, where i is the current frame number, and thresl and thres2 are thresholds (0.25 and 1.75 used in experiments). The estimated number of segments N_est equals to the average of the number of state transitions from peak to bottom and from bottom to peak.

Next, pre-processing is applied 20 for dynamic programming. Directly applying dynamic programming in spectral segmentation without pre-processing has two drawbacks:

(a) Dynamic programming may not identify exactly the shot boundaries with the least cross segment similarities; and

(b) The computational complexity of dynamic programming is quite high. Therefore a pre-processing step is applied before dynamic programming to isolate the potential cut points.

Pre-processing may, for instance, consist of edge finding. Let y(i) be a point in R^k whose coordinates are taken from row i of the Y matrix, i.e., XO

which represents data sample i. Firstly, first degree derivative of y(i) is calculated * ^ ^~ κ ^ "H ₉ where Zz₁ = [I 0 - Yf and * denotes convolution. Then second degree derivative is calculated & \.^l) - S \^l) «2 _S where h₂ = [- 1 if .

Edges may be detected through finding zero crossing of % ' ¹^ . However there exist lots of zero value points in % '^ as the data may be noisy. Therefore edges are detected as the mean of a local maximum/minimum pair in order to detect strong edges only. We also only detect one candidate edge point in a neighbourhood of T samples where T is a variable which may be adjusted depending on the application (for example

T=IO was adopted in some experiments).

Suppose SV) is a i_ocal maximum (+- 772 samples) and S'V) > edge_thresl2 where edgejhres is a threshold, then try to find local minimum % (^m) between 1+1 and 1+T. If S"(^m) < ^~ed8^e _ ^thres ' ² , then an edge is located at (l+m)/2. Edge points detected in this way constitute candidate points for dynamic programming.

Suppose there are p candidates, then the complexity of dynamic programming is reduced from ⁰^¹ > to ^P ' . In experiments N values of 400 and 240 were used, while/? is usually < 10. Dynamic programming is then applied 22 on candidate points determined by pre-processing. Here spectral K-segmentation is performed on frame sequences in the fixed duration sections rather than each single frame. For example, a time series S consists of n samples (,S₁ ,...,_?„} . Similar to the notations in J. Himberg, K. Korpiaho,

H. Mannila, J. Tikanmaki, "Time series segmentation for context recognition," Proc. 2001 IEEE Int. Conf. on Data Mining, pp. 203 - 210, 2001 ("ffinberg et al"), ss(a, b) is defined as a segment of the time series S, ie, the consecutive samples {s_a,...,s_b} where a ≤ b ,

As mentioned before, k-segmentation of S is a sequence Ss₁SS₂... ss_k of k segments such that ss_λss₂...ss_k = {s_v...,s_n} and each ss, is not empty. Equivalently the task is to find k-1 segment boundaries ^{l J 2}'""' ^k~λ , \ <c_λ <c₂ < ...<c_k__x < n , where IS-T₁ = ISIS(I_J C₁ ) , -W₂ = .SIS(C₁ + 1, C₂ ) , ..., ss_k = ss(c_k__λ + \, ή) . c_o = l , and c_k = n . The k segments are schematically shown in Fig. 5. The number k could be a pre-determined value (eg, k=6) or the estimated number of segments (eg, k=N_est+6).

Spectral K-segmentation of the sequences of frames is performed in the space spanned by the first few eigenvectors of the constructed normalised graph Laplacian. Following spectral clustering step above, the same time series can be represented in the space spanned by rows of matrix Y by n samples (^₁,..., y,,} e R^k . K-segmentation of time series S can be conducted on Y, ie, to find a sequence ys₁ys₂...ys_k of k segments such that ys_ly$₂...ys_k = {y_x,..., y_n) and each ys, is not empty. Segmentation in Y space instead of S space is justified because the rows of Y will form tight clusters around k well separated points while such cluster structure may not be clear in S space (Ng et al).

A cost function is defined to illustrate the internal homogeneity of all segments

-t k c, cost = — ]T ] |^(y) - μ, where μ_t is the mean vector of all the data y in segment n ,_{=1 ;=C/}_₁₊₁ ySi - yg(^c _ι-_\ + I_J ^c _ι ) • Depending on the application other forms of cost function may be used as well.

The problem is to find the K segment boundaries c, that minimize the cost, therefore it is an instance of K-segmentation problem. As suggested in Hinberg et al,

K-segmentation problem can be solved optimally using dynamic programming.

However we apply dynamic programming to the sequence of data in the eigenspace rather than the space of the raw input data. The complexity of dynamic programming is

O(kn²) since the cost of segmentation can be calculated in linear time.

Next, the number of clusters is automatically determined 24. This is performed through thresholding of normalized conductance. This is achieved by starting from a large number of k potential segments (k=6 used in experiments which can be determined through experience, however see step 22 above that can be used to estimate the number of cuts) and then rejecting weak segments based on the criteria below. If C₁ is shot boundary between segments P and Q, then normalized conductance

∑ Y A and IS - _t] ,

the number of vertices in segment P.

However, vertices near a true shot boundary usually have small edge weights as well, so they should be excluded in the normalized conductance calculation. Define P^* = M(C₁., + β, c, - β) , and Q^* = Ss(C₁ + β, c_I+1 - β) .

^ = min_shotjength 14

Then shot boundary c, is rejected if ^^* (C,) is less than a threshold. Next, the detected shot boundary point is fine-tuned 26 in a 3 -frame window

{ C_1-1 , C₁ , c_l+l } by finding the vertex that produces the smallest φ^* and using that vertex as the shot boundary. Here the number three is a parameter that may be varied depending on the application.

Finally, the incorrectly detected shot boundaries are rejected using Support Vector Machines (SVM) classification 28. The method of determining the number of clusters 24 can be improved for gradual transitions. Graph cut/spectral clustering only finds single cut points, and may cut a gradual transition twice, one at the beginning and one at the end of the transition.

We view the rejection of incorrectly detected shot boundaries as a supervised classification problem and we adopt SVMs as the classifier. The key step is to represent the curve/time series using features of extracted feature points for SVM.

Cut value is the sum of affinities across the current frame, which is a terminology inherited from graph cut methods. We plot the cut value curve over time.

Fig. 6 shows a typical cut value curve of gradual transition. Frames in the middle of the transition 70 may have higher cut values, and there are two bottoms 72 in the transition. Spectral clustering may pick up one or two of the two bottoms 72 shown in the curve, depending on the transition.

The strategy is to identify key points out of a potential gradual transition, extract features of the key points and use these features for SVM classification. However if the duration of the transition is short (<=5 frames), then there is no need to detect key points. Instead we just use frames near the transition for SVM classification. Therefore we have defined two classifiers, one for short transition (<=5 frames), and one for long transition (>5 frames).

However, since spectral clustering only finds the "cut" point (the point with the minimum cut value), we need to (a) determine the duration of the detected transitions first, then (b) apply the classifiers to reject incorrectly detected transitions.

(a) In order to determine the duration of the detected transitions several steps must be performed.

(i) Starting from the cut point B in Fig. 6, try to find the entry and exit points (A and C). To find entry point A, starting from the cut point B, search backwards to find either:

• the first frame i with bw[i]<threslθ and bw[i-l]>thresl l. ThreslO and thresll typically are 2.5e-5 and 1, respectively. This condition typically represents sharp cuts; or

• the frame i with fw[i] greater than thresl and the difference between the edge value of frames i and i-1 is less than another threshold. Frame i may be further extended in the backwards direction by up to 3 frames if the difference between fw[i-l] and fw[i] is greater than thres3. Thresl and thres3 typically are 3.75 and 0.125, respectively. This condition typically represents gradual transitions. Similarly the exit point C can be found. Region A-B-C is called a detected transition region (DTR).

(ii) As gradual transitions may have two bottoms in the cut value curve, and sometimes only one of them has small cut value which can be detected by spectral clustering as a transition, we need to search for the first curves with a dip (i.e., similar to A-B-C) in the left and right-hand sides of the current transition region (A-B-C). The regions are found by locating the peak point first, then the nearest entry (transition from peak to bottom) point, then finding the nearest bottom point, then finding the next exit point similar to the process described in step 1. As an example, in Fig. 6, points D, E, F and G can be found in this way. Region E-F-G is called candidate transition region (CTR).

(iii) As a true gradual transition may be broken up into one DTR plus one CTR, or two DTRs, it is necessary to check whether we need to merge any DTR with a neighbouring CTR or DTR. As an example, the appearance of several DTRs and CTRs is shown in the graph of Fig. 7. There are following sub-steps to merge transitions. • Sub-step 1 : for every DTR, check the probabilities that it should be merged with neighbouring DTR/CTR to both sides of the current DTR, and store the probabilities

• Sub-step 2: examine all the probabilities calculated. Starting from the highest probability one, merge the corresponding DTR/CTR if the probability is higher than a threshold. Repeat the process until all probabilities have been examined.

The probability to merge DTR/CTR is calculated via SVM. The SVM features used in the SVM classifier include features about the seven feature points A₅ B, C, D, E, F, and G representing the first entry, first bottom, first exit, middle peak, second entry, second bottom, and second exit. Detailed features are:

• The relative location of the seven feature points expressed as relative distances

• The following features of all the seven feature points: out value, fw value, bw value, mean edge value, mean luminance value and standard deviation of luminance value of the whole image and four sub-image columns and four sub-image rows.

After this step, we have detected (and properly merged) transitions to be further classified as true or false transitions by SVMs. Two SVM classifiers are used depending on the duration of the transition. In the SVM for short transitions, the following features are extracted for SVM classification:

• the duration of the detected transition

• the edge value of the beginning and the end frames of the detected transition

• the cut value, fw value, bw value, mean edge value, mean luminance value, standard deviation of luminance value of seven neighbouring frames centered around the frame within the detected transition with the smallest cut value.

• features about the transition detected before the current transition and the frames between the previous and the current transitions (we use BT to represent these frames below), including: minimum cut value of BT, the distances between the minimum cut location of BT and the beginning of the current transition and the end of the previous transition, maximum cut value of BT, the distances between the maximum cut location of BT and the beginning of the current transition and the end of the previous transition, the distance between the previous and the current transitions, the duration of the previous transition, the cut value of the end frame of the previous transition, pre_contijpeak, mean_ehjpre, std_eh_pre, mean_pp_pre, std_pp_pre. Pre_conti_peak counts the length of continuous peak frames (greater than a threshold) starting from the beginning of the current transition. Mean_eh_pre and std_ehjpre are the mean and the standard deviation of edge values between the current and the previous transitions. Meanjpp_pre and std_pp_pre are the mean and the standard deviation of cut values between the current and the previous transitions.

• features about the transition detected after the current transition and the frames between the next and the current transitions (we use BT to represent these frames), including: minimum cut value of BT, the distances between the minimum cut location of BT and the end of the current transition and the begin of the next transition, maximum cut value of BT, the distances between the maximum cut location of BT and the end of the current transition and the beginning of the next transition, the distance between the next and the current transitions, the duration of the next transition, the cut value of the beginning frame of the next transition, post_conti_peak, mean_eh_post, std_eh_post, mean_pp_post, std_pp_post. Post_conti_peak counts the length of continuous peak frames (greater than a threshold) starting from the end of the current transition. Mean_eh_post and std_eh_post are the mean and the standard deviation of edge values between the current and the next transitions. Mean_pp_post and std_pp__post are the mean and the standard deviation of cut values between the current and the next transitions.

For the SVM for long transitions, the features used include those similar to short transition features, together with additional features. Given a detected transition, the local maximum and local minimum points in cut value curve are detected. Suppose pre and post represent the frame numbers of the beginning and the end of the transition, up to seven feature points are selected according to the rules below:

• if peak_num=0 and bottom_num=0, then use the following 5 feature point frames: pre, post, and 3 frames that equally divide post-pre

• if peak_num>0 and bottom_num=0, then use the following 5 feature point frames: pre, post, largest_peak_frame, the average of pre and largest_peak_frame, and the average of post and largest jeak_frame • if peak_num=0 and bottom_num>0, then use the following 5 feature point frames: pre, post, smallest_bottom_frame, the average of pre and smallest_bottom_frame, and the average of post and smallest_bottom_frame • if bottom_num=l and peak_num>0, then use the following 5 feature point frames: pre, post, bottom_frame, largest jpeak_frame, and the average of largest_peak_frame and either pre or post depending on where bottomjframe is located (the other side is chosen)

• if bottom_num>l, then use the following 5 feature point frames: pre, post, two bottom_frames with the smallest cut value, and the frame between the two bottom frames with the largest cut value

Two additional peak/bottom frames may be selected if they are not selected among the five frames mentioned above.

Having chosen the seven feature frames, a large number of features similar to features used for short transition are used, namely:

• the duration of the detected transition

• the cut value, fw value, bw value, mean edge value, mean luminance value, standard deviation of luminance value of the 7 feature frames (if less than 7 frames are chosen, non-informative features will be included).

• features about the transition detected before the current transition and the frames between the previous and the current transitions (the same as those for short transitions) • features about the transition detected after the current transition and the frames between the next and the current transitions (the same as those for short transitions) and the additional features include:

• type of the feature frames (eg, peak, bottom, others) • relative distance between the feature frames.

This method of identifying segments is sequential data can be performed by a computer that is able to accept data input of sequential data and to store the sequential data. The computer then performs on a processor the steps outlined in Fig. 1 to produce an output of the identified K-segments. For example, the computer may accept digital video frames and displays the identified shot segments on a display means, such as a monitor. The computer may also create or update an index document that indexes and is used to navigate to the shot segments.

The parameters of the algorithm were derived using three TV sequences, each with about 10000 frames: two news segments from TV channels CCTV-4 and CCTV- 9, and one sport (soccer) segment from TV channel Oriental TV.

The performance of the proposed algorithm was evaluated using a complete new set of TV video sequences. In total, two hours of news and one hour of sport video sequences were used in the test. The break down is as follows:

All the test sequences were never used in algorithm development. Moreover, we included two completely new TV channels: CCTV-I and Phoenix TV. In the case of CCTV-4, the development set and the test set were captured at two days. In the case of Oriental TV Sport, the development set and the test set were from two different matches captured at two different days. We observed that news from CCTV-I, CCTV- 4 and CCTV-9 exhibited different editing characteristics and there is little overlap of content among them so we regard them as different channels.

The test results are reported in the following table. Machine segmented shot boundaries were checked against manually segmented ground truth. As evident in the table, the proposed algorithm achieved high precision and recall. If breaking down the performance on different channels, we can see the performance is consistent over all four channels. We also calculated the different performance regarding to the transition type: cut and gradual transition (transition of >1 frames). The performance on gradual transition is slightly lower, mainly due to some special editing effects.

Although the invention has been described with reference to video shot segmentation it should be appreciated that it is not limited to that application. The invention can be applied to any time series or sequential data.

It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

Claims

CLAIMS:

1. Sequential data segmentation using spectral clustering wherein attributes of data samples are extracted to construct affinity matrices for sequences of samples in sequential order, and K-segmentation is then applied to a representation of the sequences of samples derived from the affinity matrices to identify K-segments, and each K-segment is comprised of representations of samples in sequential order.

2. Sequential data segmentation using spectral clustering according to claim 1, wherein elements in a row or a column of the affinity matrices represent data samples in sequential order.

3. Sequential data segmentation using spectral clustering according to claim 1 or 2, wherein a graph Laplacian is derived from the affinity matrices and then the leading eigenvectors of the Laplacian are solved.

4. Sequential data segmentation using spectral clustering according to claim 3, wherein the graph Laplacian is normalized.

5. Sequential data segmentation using spectral clustering according to claim 2 or 3, wherein K-segmentation is applied to the normalised eigenvector space representing the sequences of data samples.

6. Sequential data segmentation using spectral clustering according to any one of the preceding claims, wherein K-segmentation comprises identifying homogenous segments in the representation of sequences of data samples.

7. Sequential data segmentation using spectral clustering according to claim 6, wherein an estimation of the number of segments K is used to initially identify K homogenous segments.

8. Sequential data segmentation using spectral clustering according to any one of the preceding claims 1 to 7, wherein K-segmentation comprises identifying significant transitions in the representation of sequences of data samples.

9. Sequential data segmentation using spectral clustering according to any one of the preceding claims, wherein a data sequence is initially divided into sections of a specific number of samples.

10. Sequential data segmentation using spectral clustering according to claim 9, wherein sequential data segmentation is applied to each section.

11. Sequential data segmentation using spectral clustering according to claim 9 or

10, wherein for a section 5¹ that consists of n samples {s_{l v}..,s_n} , ss(a,b) is defined as a sequence of samples in sequential order such that {s_a,...,s_b} where a ≤ b .

12. Sequential data segmentation using spectral clustering according to claim 9, 10 or 11 and limited by claim 3, wherein spectral clustering comprises performing clustering based on the largest eigenvectors.

13. Sequential data segmentation using spectral clustering according to claim 12, wherein the largest eigenvectors are stacked into columns to form matrix X such that the renormalisation of the rows of X gives matrix Y.

14. Sequential data segmentation using spectral clustering according to claim 13, wherein section S is represented in the space spanned by the rows of Y by n frames {_yι,...,y_n} z R^k .

15. Sequential data segmentation using spectral clustering according to claim 13 or 14, wherein K-segmentation can be conducted on Y to find a sequence ys_lys₂...ys_k of k segments such that ys_ϊys₂...ys_k = {y_x,—,y_n} and eachys, is not empty.

16. Sequential data segmentation using spectral clustering according to any one of the preceding claims, wherein the extracted attributes of data samples includes the color attributes of the samples.

17. Sequential data segmentation using spectral clustering according to any one of the preceding claims, wherein the extracted attributes of data samples includes the edge attributes of the samples.

18. Sequential data segmentation using spectral clustering according to claim 17, wherein the edge attributes of the samples is determined by analysing overlapping sub- samples and using edge detection templates.

19. Sequential data segmentation using spectral clustering according to any one of the preceding claims, wherein the extracted attribute of data sample is based on edge energy distance values between sequential couples of samples.

20. Sequential data segmentation using spectral clustering according to any one of the preceding claims, wherein the extracted attribute of data sample is based on the sequential adjacency of the samples.

21. Sequential data segmentation using spectral clustering according to any one of the preceding claims, wherein the extracted attributes of the data samples includes random impulse noise.

22. Sequential data segmentation using spectral clustering according to any one of the preceding claims, wherein before K-segmentation is applied, pre-processing is applied to identify potential boundaries and to reduce the computational complexity of the K-segmentation of the sequences of data samples.

23. Sequential data segmentation using spectral clustering according to any one of the preceding claims, wherein following K-segmentation, the number of segments is determined by rejecting segments that do not meet a predetermined threshold.

24. Sequential data segmentation using spectral clustering according to claim 23, wherein segments are rejected if the normalized conductance of at least one of a segment's boundaries is less than the predetermined threshold.

25. Sequential data segmentation using spectral clustering according to claim 24, where the normalized conductance is calculated from all data in the segments.

26. Sequential data segmentation using spectral clustering according to claim 24, where the normalized conductance is calculated from selected data in the segments.

27. Sequential data segmentation using spectral clustering according to any one of the preceding claims, wherein a boundary of a segment is fine tuned by considering the sample at the identified boundary of a segment and a predetermined number of adjacent samples and selecting the most suitable sample to be the boundary.

28. Sequential data segmentation using spectral clustering according to any one of the preceding claims, wherein an incorrect boundary of a segment is detected and rejected by extracting key features of the boundary and assessing those features using Support Vector Machines classification.

29. Sequential data segmentation using spectral clustering according to claim 28, wherein the key features are extracted from the transition of the cut values over time.

30. Sequential data segmentation using spectral clustering according to any one of the preceding claims, wherein the data is comprised of samples that are video frames, and each K-segment is a video shot.

31. A computer system for performing sequential data segmentation using spectral clustering according to anyone of the preceding claims.

32. Computer software able to perform the sequential data segmentation using spectral clustering according to any one of the claims 1 to 30.