WO2007072347A2

WO2007072347A2 - System and method for processing video

Info

Publication number: WO2007072347A2
Application number: PCT/IB2006/054841
Authority: WO
Inventors: Jan A. D. Nesvadba; Yash S. Joshi; Stefan Pfundtner
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2005-12-22
Filing date: 2006-12-14
Publication date: 2007-06-28
Also published as: WO2007072347A3

Abstract

In a system (2) for processing video (1), the system comprises a processor for processing the video. The processor comprises a shot grouper (6) for grouping shots into groups of visually similar ones of the shots. The shot grouper is operative to compare corresponding feature points in the shots. In a method for processing video, the method comprises automatically grouping shots into groups of visually similar ones of the shots, using a comparison of corresponding feature points in the shots.

Description

System and method for processing video

FIELD OF THE INVENTION

The invention relates to a system for processing video, the video comprising shots, wherein the system comprises a processor for processing the video, the processor comprising a shot grouper for grouping the shots into groups of visually similar ones of the shots.

The invention also relates to a method for processing video, the video comprising shots, wherein the method comprises automatically grouping the shots into groups of visually similar ones of the shots.

BACKGROUND OF THE INVENTION

A system and method as described in the opening paragraphs are known from United States patent US 6,278,446.

In the known system and method, shots representing a continuous action in time and space are grouped into groups of visually similar shots. Shots that are visually similar, such as, for instance, a shot of an anchor person in a broadcast, are recognized and grouped. A hierarchical tree overview of video content is made from the groups.

The real tree structure can thereafter be made interactively by a user. Use is made of automatic processing in conjunction with the user input to provide video organization. Although the known system and method provide a useful tool, the known system and method require a relatively large effort by the user.

OBJECT AND SUMMARY OF THE INVENTION

It is an object of the present invention to provide a system and method of the type described in the opening paragraphs, which are more helpful to the user in organizing the video, thus requiring less human intervention and increasing the speed of organizing the video.

To this end, the system according to the invention is characterized in that the processor is arranged in such a way that the visually similar ones of the shots are grouped into a shot group on the condition that the number of intermediate shots between consecutive ones of the visually similar shots is not larger than a maximal shot window number (Nmax).

The method according to the invention is characterized in that the shots are automatically grouped into groups of visually similar ones of the shots, wherein the visually similar ones of the shots are grouped into a shot group on the condition that the number of intermediate shots is not larger than a maximal shot window number.

The inventors have realized that, although it is difficult to analyze film content objectively since every film director has his or her own style, almost every director fortunately follows some film grammar rules. An almost universally used grammar rule is that a film is divided into scenes, equivalent to a book divided into chapters. A second, frequently used directing instrument is the use of sequences of interleaved groups of shots. Examples of such sequences of interleaved groups of shots are what is called "crosscutting" which is used to visualize time- wise related, location-wise disjoined parallel running events, i.e. two or more events that are happening at the same time, but at different locations, and "shot/reverse shot" used to visualize one event such as a dialog between actors, i.e. an event happening at the same time and the same location but taken from different viewing points. Such interleaved groups of shots form a very frequently used video technology. Very often, such sequences span an almost entire semantic video scene. The inventors have realized that the number of different groups of shots in a scene is typically small and the number of shots in between shots belonging to the same group is usually small. Shots that are similar in visual content but are separated by more than a relatively small maximum number of intermediate shots often, in fact almost invariantly, belong to different scenes. Limiting the grouping of shots having a similar visual content to those shots that are separated from each other by not more than a maximum number of intermediate shots thus has the beneficial effect of bringing the automatic grouping more in line with the actual story. The known system and method group all shots with similar visual content into a single group, even if they are separated by many intermediate shots. This may well lead to grouping of shots so that shots belonging to different video scenes are grouped into a single group, for instance, all close-ups of a character within a movie. The known system and method thus often give misleading results in the sense that a group of shots, although visually similar, in fact spans a scene boundary and thus, when seen in the context of the story developed in the movie, actually belongs to different parts of the movie. The system and method of the invention reduce and, in fact, virtually eliminate this problem. This will reduce the amount of manual manipulation by the user. The system and method of the invention can be used, for example, to create scene indexes and/or to detect commercials.

In an embodiment, the processor is arranged in such a way that visually similar ones of the shots are grouped into the shot group on the condition that the number of intermediate shots between consecutive ones of the visually similar shots is not larger than a maximal shot window number (Nmax), but larger than one. The exclusion of the predecessor shot N from the similarity check can prevent false grouping of shots when a system has weaker shot boundary detectors (e.g. precision below 97%).

In preferred embodiments, the processor is arranged to identify interleaved groups of shots and to group interleaved groups of shots into a larger group and identify the beginning and end shot of the larger group. Interleaved groups are groups for which a member of one group is time-wise positioned in between two members of the other group. Within the concept of the invention, a group has at least two members. The inventors have found that, provided that the limitation of a maximal shot window number is imposed, such larger groups of interleaved groups almost invariantly fall within a scene. Scenes very often begin or end at or near the beginning or end shot of such larger groups. Within the framework of this application, such larger groups will also be referred to as "a parallel shot group". The identified beginning and end shots of a parallel shot group thus provide a good indication of where in the movie the scenes may end or begin. This provides a valuable indication of the end of scenes. Further post-processing may be necessary to pinpoint the end of scenes, but the method does provide a good starting point for pinpointing.

The prior-art system and method do not provide the user with any automatic indication of or near the end of scenes. Ends of scenes are, however, very important and provide semantic, meaningful information. Automatic generation of such indications will be a valuable helping tool for the user.

The invention further relates to shot grouping which, rather than being based on color histograms, is based on tracking of feature points. Feature points, also called salient points, are points where the local gradient is high in vertical as well as horizontal directions. Typical salient points are points where two edges meet in an image. This is particularly advantageous when used in conjunction with previously mentioned aspects, i.e. using a maximal shot window number Nmax to group shots in such a manner that the risk of grouping shots that do not belong within a scene is strongly reduced and that interleaved groups of shots are grouped into larger groups of parallel shots (PS), and the end and/or begin shot of a PS is identified, but this aspect can also be used independently of such measures. Prior-art methods using color similarity tests have serious shortcomings, e.g. for humans, identical-looking frames can have different color histograms due to changes in lighting. In contrast, for humans, dissimilar-looking frames may expose quite similar color histograms. Furthermore, for shots that have been taken at night or in the evening, there is almost always a high color similarity between shots, independent of the actual content, as everything looks grey in the evening or at night. Sudden changes in illumination, such as a cloud passing before the sun, will introduce appreciable changes in color and thereby an apparent change in the shot or scene. Tracking of feature points removes or at least reduces such problems, thereby providing a much more robust method. A further aspect of the invention, which can be used independently or in conjunction with the previously mentioned aspect, is based on the insight that various film genres typically have different interleaving patterns of groups.

According to this aspect of the invention, shots are grouped and interleaved groups of shots are identified, while furthermore interleaving statistics are generated, attributing a genre to the data stream on the basis of these interleaving statistics.

The inventors have realized that statistics are telling for the genre of the movie. The actual percentage of content that is part of a parallel-shot sequence is a valuable indication of the genre. Soaps and sitcoms typically contain 70 to 80% of parallel-shots, while movies typically contain 50% and magazine shows contain fewer shots. This fact can also be used for content classification, solely or in conjunction with the interleaving patterns. Interleaving patterns are also indicative of the genre. The pattern of keyframe links observed in a content item is considered to be a valuable indicator of the nature of the content. Hence, the statistics of linked similar shots is useful for content classification of a movie. As an example, shot sequences of the form ABABAB are characteristic of soap operas and situational comedies. Talk-show content shows similar characteristics, but repeated shots of a single person are often taken from different camera angles and hence might confound conventional linking methods. Movies are considered to exhibit a greater degree of variation - sequences of the type ABCDEDBF are not uncommon.

These and other objects of the invention are apparent from and will be elucidated with reference to the embodiments described hereinafter. BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings:

Fig. 1 illustrates the invention schematically;

Fig. 2 illustrates interleaved groups of shots; Figs. 3, 4 and 5 illustrate video sequences comprising shots, parallel shots, scenes and scene boundaries;

Fig. 6 illustrates the formation of keyframe links;

Figs. 7 and 8 show images, wherein Fig. 8 shows the Y-frame of the image of Fig. 7; Figs. 9 and 10 illustrate the tracking of feature points;

Figs. 11 and 12 show the X and Y-gradient, respectively, of the Y-frame shown in Fig. 8;

Figs. 13 to 15 illustrate feature or salient points within an image;

Fig. 16 compares the method with reality; Fig. 17 illustrates a system according to a further aspect of the invention;

Figs. 18 and 19 illustrate linking patterns that are typical of certain types of video sequences.

The Figures are not drawn to scale. Generally, identical components are denoted by the same reference numerals in the Figures.

DESCRIPTION OF EMBODIMENTS

In the production of video broadcasts, video frames are the basic units of a broadcast stream. During the recording event of a video sequence, the time-wise successively captured frames of a recording event are clustered together into video shots, which are limited by the start and end instants of the recording event. In the next stage, these shots are e.g. compressed in MPEG-2 format and stored for further post-processing. In the next processing step, the director of e.g. a feature film concatenates multiple shots into larger groups of shots and then into semantic meaningful video scenes, using established and well- defined film grammar rules. The instances, at which two shots have been concatenated, are called Shot Boundaries (SB). Methods and systems are known which can automatically retrieve the shot Boundaries using Shot Boundary Detectors (SBD) known, for instance, from WO 2004/075537.

To either increase the tension or emphasize the parallel development of two events at the same time, film producers such as directors use the film grammar technology of interleaved groups of shots. The latter are also used for dialogs (e.g. conversation scenes captured with two cameras placed at different locations) or alternatively, for two time-wise parallel events developing at the same time and being interleaved into each other. Within the scope of the present invention, such interleaved groups of shots are called parallel shot groups. Such parallel shot groups are an essential and very frequently used technique and very often a single parallel shot group spans an almost entire semantic video scene. In general, only an establishing and a conclusion shot are added at both sides to a parallel shot group, which together form a semantic video scene. Occasionally, a scene comprises two parallel shot groups and rarely three or more parallel shot groups. Fig. 1 illustrates the method according to the invention. A video stream 1 is an input for a system 2 according to the invention. The initial requirement for parallel-shot sequence analysis is to segment video streams into individual video shots. First, the shot boundaries are detected in shot boundary detector 3. This leads to a segmentation of the video stream into shots performed in segmentator 4. Consequently, keyframes - representative frames of the individual shots - are identified. In the example shown in Fig. 1, the first and the last frame of each shot are chosen as keyframes, but multiple frames of a shot may also be used for this purpose. Detector 5 detects groups of shots by comparing the content of shots. In this example, this is performed by selection of keyframes, e.g. the begin and/or end frame of a shot, and by comparing the content of begin and end frame of shots with the content of the begin or end frame of other shots. In the next step, the similarity between the identified keyframes of two shots is calculated. If a high similarity has been identified, these keyframes are linked together, which is referred to as keyframe link. Consequently, a keyframe link, established between keyframes of two shots, results in a shot- link, linking these two shots together. Any suitable method for comparing the shots or keyframes of the shots may be used, such as comparing an overall characteristic of the shots or keyframes of the shots, e.g. color or color distribution, or details of keyframes such as edge detection followed by edge comparison. In the known system and method, color and edge similarities are used for shot comparison and subsequent shot grouping. Shots that, upon comparison of similarities, meet similarity conditions are grouped in a group of shots in grouper 6. All of these elements of the system and steps of the method are equivalent or similar to the elements and steps known from the prior-art system and method. The system and method according to the invention differ from the known system and method in that shots that are visually similar are grouped into a shot group on the condition that the number of intermediate shots is not larger than a maximal shot window number Nmax, wherein Nmax is typically smaller than 10, preferably 3 to 7 and more preferably 4 to 6. The Figure also illustrates a preferred embodiment of the invention. A series of shots is chained together as soon as a shot-link has been identified, which in general would be a link between time-wise disconnected shots not exceeding the defined Maximal Shot Window (Nmax). As a result of this, all intermediate shots would be grouped into this Parallel Shot sequence as well. Consequently, the Time Stamp (TS) of the last shot of the parallel shot group will be indexed as the 'End of Parallel Shot (PS)'. Time stamps are just an example. For the same purpose, any other index such as frame number or another time index such as characteristic point information (CPI) can be used. When another, e.g. time-wise successive, shot is retrieved, which meets the similarity and Nmax criteria, the Parallel Shot group is extended, including this new shot. This implies that the last Parallel Shot is extended until the current shot. Finally, if no further shots, in the given Nmax range, can be linked to the previous Parallel Shot group, the last TS of the last frame of the last shot of the parallel shot group will be indexed as the definite 'end of parallel shot group' and archived. This is done in parallel shot grouper 9. To clarify this process, if, within a sequence of shots 1 to 17 forming a scene, there is one shot- link from shot 5 to shot 9, another from shot 7 to shot 12 and another from shot 10 to shot 14, then there are three interleaved groups that are identified and the parallel shot group spans shots 5 to 14, with shot 5 being identified as the beginning shot of the parallel shot group and shot 14 as the end shot of the parallel shot group. It is noted that shots 5 and 14 are not visually similar and the prior-art system and method will not identify them as having any relation to each other. The end of a parallel shot group does not necessarily imply that the next shot boundary is a scene boundary. Sometimes, there are two to several conclusion shots after the last shot of a parallel shot group before the scene boundary. Also a scene may comprise two or more parallel shot groups. Post-processing, automatically, semi-automatically or manually, schematically indicated by the broken line to rectangle 10 in Fig. 1, may be needed to establish the true scene boundary. However, the end of the parallel shot indication does provide an important indication of where the scene boundary is located and, during postprocessing, this indication limits processing needed to find the scene boundary automatically or manually. The beginning-end indication of a parallel shot group provides valuable information. The likelihood of the end of a scene falling within a parallel shot group has been experimentally found to be virtually non-existent. In a very high percentage of cases, the ends of scenes are at or near the shots identified as beginning and end shots of parallel shot groups. The maximal shot window number Nmax, indicated by rectangle 8 in Fig. 1 , may be an input for the group shot detection, so that the comparison in the group shot detection is only performed when the number of intermediate shots is below this maximal number. Whether or not this is the case may be deduced from counting the number of shot boundaries detected and/or from the shot segmentation. In simple embodiments, the maximal number may be fixed, or may be manually set in embodiments allowing more intervention by the user. Alternatively, the maximal number may be dependent on the genre of the movie, wherein fast movies such as action movies or science fiction movies are analyzed by using a larger Nmax (for instance, 8) than artistic movies (for instance, 5). Movies often comprise some electronic indication of the genre of the movie and the system according to the invention preferably has a means for setting the number Nmax in dependence on a characteristic of the movie. Other sources of information are e.g. Electronic Program Guides (EPG) as a service or Digital Video Broadcast-Service Information (DVB-SI).

The system and method of the invention thus in fact reduce the number of comparisons made in the shot group detection with shots that are relatively near to each other (not further than a maximum number of intermediate shots). This limitation, which prima facie may seem to only needlessly restrict the comparison of shots, is based on the above- mentioned insight of organization of most movies and videos. The limitation prevents needless comparisons of shots, thus increasing the speed of calculation, and reduces the number of false positives, i.e. shots that are included into a group on the basis of visual similarity, while yet not belonging to the same group when seen in the context of the story. The known system and method allow grouping of shots that, although visually similar, actually belong to different scenes. It is noted that the limitation of the maximal shot window number Nmax is not imposed for reasons of or because of limited calculation power, since, given the ever increased and increasing calculation power, such a limitation, if at all present, would become less and less needed in the future. The limitation stems from the insight mentioned above.

On the one hand, the system and method according to the invention allow methods of comparison using a large calculation power, thus allowing better comparison results. On the other hand, by reducing the comparison to shots that are relatively close to each other, the difference between the shots is usually relatively small, so that even relatively crude comparison methods may be useful.

Fig. 1 illustrates a preferred embodiment of the invention. The inventors have recognized that, when the shots are grouped, scenes are often composed of interleaved groups of shots (usually two to four groups) followed by an ending shot, or alternatively the last shot of the scene is a final shot of one of the interleaved group. The end shot is provided with a label 'end of parallel shot'. This provides valuable information to a user. The known system and method do not allow an easy way of establishing such information. The present system and method do allow easy determination of the end of a group. Since shots are only compared with shots that are relatively close, it is clear that a particular shot is the end shot of a group if comparison with the next Nmax shots reveals that none of these Nmax shots belongs to the same group. It is not necessary to scan the entire movie for comparable shots. This establishes the end shot of a parallel shot group. The begin shot is then easily found. The intermediate shots may be grouped in one, two or three groups. A structure of two or more interleaved groups of shots is found.

Fig. 2 illustrates the system and method according to the invention. Fig. 2 schematically illustrates a rather complex scene comprising four different groups of shots schematically indicated by either the grey shade or the pattern of the area representing the shot, for instance, two close-ups and two overviews of the scenery in which the two characters whose close-up is shown are situated. The scene spans the shots Sl to S 12 and may comprise an opening shot SO and a closing shot S13. Shots Sl, S4, S8 and SI l form a first group, S2, S6 form a second group, S5 and S9 form a third group, and S7 and S 12 form the fourth group. SlO is a single shot, not belonging to any group. These shots Sl to S 12 will be grouped into four groups using comparison methods to establish shot links. In this example, the number Nmax is set at 5. To complicate matters, however, shot S25 is visually comparable with shot S12 and shot S35 is visually comparable with shot S9. This is by no means uncommon; quite often, certain shots, for instance, close-ups of a certain character are repeated in different scenes separated some time from each other. None of the other shots S13 to S34 is comparable with S12 or S9 or any of the other shots Sl to Sl 1. In the prior-art system and method, the shot S25 is grouped together with shots S7 and S 12, and shot S35 is grouped together with shots S5 and S9. This, however, is at best confusing for the user. Shot S25 is clearly not a part of the same scene as shots Sl to S 12, and neither is shot S35. These shots may be part of another group of shots with similar visual content but outside the scene Sl to S12 (or SO to S13). The known system thus provides false positives, i.e. shots that are placed in a group to which they do not belong, or the formation of supergroups which in fact span several scenes or even the whole movie and thus do not contribute much to understanding the setup of the movie. The prior-art will group all of the close-ups of a certain character, independently of whether they belong or do not belong to the same scene. If the user wants to find the end of the scene, the grouping as performed by the known system and method is of little help, as the user has to manually check the results. Because of the limitation of the number Nmax, the present system and method will not include S25 and S35 in the groups. S 12 will thus be recognized as the end shot of interleaved groups of shots. This shot is provided with an 'end of parallel shot' indication. This allows the automatic insertion of an indication for or near the end of scenes within a video. The known system does not allow such an indication.

Instead of purely splitting the movie into groups of shots, a clustering approach is followed in these preferred embodiments, wherein interleaved groups of shots are identified and grouped in parallel shot groups, even though there may not be any visual similarity between two individual shots of the parallel shot group. This approach increases the precision of the Scene Boundary Detector (ScBD), since parallel shot sequences indicate sequences that should certainly not be split during the segmentation process. Fig. 3 illustrates a part of a movie. The shot number is plotted on the horizontal axis and the shot duration is plotted on the vertical axis. In the Figure, several parallel shot groups are indicated by PS, and the end of a scene is indicated by ScB. None of the parallel shot groups PS spans a scene boundary ScB. Each scene boundary ScB is relatively close to an end shot of a parallel shot group PS and the beginning shot of the next parallel shot group PS.

Fig. 4 illustrates a scene wherein crosscutting between events A and B is used. The parallel shot group is indicated, as are the end and begin shots of the parallel shot group. This parallel shot group comprises two interleaved groups of shots A and B.

Fig. 5 illustrates a scene in which a shot/reverse-shot is used between two speakers A and B. This parallel shot group PS comprises three interleaved shots of groups A, B and A&B.

Crosscuttings (Fig. 4) or Shot / Reverse-Shots (Fig. 5) are defined as video segments in which the same, or a very similar, camera shot is repeated at least once within a certain time or shot interval, provided that it is shot-wise separated. Using a time interval entails the risk of losing sight of the specific artistic nature of the content. Thus, a shot- dependent window avoids this trap and tackles the problem by starting from a higher semantic level. Furthermore, a shot-based interval reduces the required processing power. For more information on such techniques, reference is made to the book "Film Art" by D. Bordwel and K Thompson ISBN 0-07-248455, McGraw-Hill. Flexibility may be increased by converting the Shot-based Interval Window (SIW), defining the number of past shots taken into consideration, i.e. the maximal shot window number Nmax, into a FrameBufferSize Window (W) by multiplying the Shot-based Interval Window (Nmax) with a chosen number of representative Keyframes Per Shot (KPS) and adding a factor α (here α=l to include the last keyframe of shot N-2), as schematically shown in Fig. 6. Consequently, keyframes - representative frames of the individual shots - are identified. In the current case, the first and the last frame of each shot have been chosen as keyframes, resulting in a Keyframe Per Shot (KPS) of two, which is a reasonable assumption taking into consideration that the visual content of a given shot is largely consistent. Nevertheless, also an adaptive number of Keyframes Per Shot could have been used for this purpose, proportional to e.g. motion, genre type, emotion content, shot length and shot depth.

A further aspect of the invention will now be described, which aspect is particularly advantageous when used in conjunction with the previously mentioned aspect, i.e. using a maximal shot window number Nmax so as to group shots in such a manner that the risk of grouping shots that do not belong within a scene is strongly reduced. Grouping interleaved groups of shots into larger groups of parallel shots (PS) and identifying the end and/or begin shot of a PS can also be used independently of such measures.

US 6,278,446 describes a method in which color histograms are used to group shots. Unfortunately, these methods have their shortcomings, e.g. for humans, identical- looking frames can have different color histograms due to changes in lighting. In contrast, for humans, dissimilar-looking frames may expose quite similar color histograms. Edge detection is described in US 6,278,446 as a further means for shot grouping.

The inventors have realized that shot grouping, rather than being based on color histograms, can be based very advantageously on tracking of feature points. Feature points, also called salient points, are points where the local gradient is high in vertical as well as in horizontal directions. Typical salient points are points where two edges meet in an image. Feature points have so far been used to estimate camera motion in compressed video as described by P. Kuhn in "Camera motion estimation using feature points in MPEG compressed domain", Int. Conf. On Image Processing, pp. 596/599, vol. 3, 2000. The feature points may be scale- variant points, but are preferably scale-invariant feature points.

The inventors have realized that feature points are highly distinctive and traceable. The following sections describe methods of selecting and tracking feature points from one frame of a video sequence to another frame.

Scale- variant Feature Point Selection and Tracking Feature point selection:

In a first pre-calculation step, individual color pixels of the original frame (see Fig. 7 and Fig. 8), e.g. RGB pixels, are reduced to a suitable representative parameter, such as the luminance Y shown in Fig. 8. Subsequently, the minimum eigenvalue of a "gradient matrix" derived from a window around each pixel in the Y-frame is calculated as shown in Fig. 10. Arriving at this matrix requires computation of the horizontal and vertical gradients of the Y-frame first, which is done by convolving the Y-frame with a high-pass function, here the derivative of a Gaussian function. The horizontal gradients are obtained by convolving the high-pass function row- wise on the pixels of the Y-frame, and the vertical gradients are obtained by convolving it column- wise, respectively. Figs. 11 and 12 show the X-gradient and the Y- gradient of the Y-frame. The procedure of gradient computation is elaborated below. The discrete zero-mean, unity-variance Gaussian is given by

/ (« ) = e ² and its derivative is / (« ) = - (1).

Herewith, the x-gradient of a given image, e.g. Fig. 11, is obtained by first convolving the image vertically with the Gaussian, and then convolving it horizontally with the derivative- of-Gaussian. Likewise, the y-gradient is obtained by first convolving the image horizontally with the Gaussian and then vertically with the derivative-of-Gaussian, using g _i (x, y )

(u, v )/ O " v )] / (x - u ) (2), g ₇ (*, y ) = Σ [∑I (u, v )/ (* - a )]/ G - v ) (3).

As a result of this operation, two gradient images g_x and g_y, as shown in Figs. 11 and 12, are obtained. The gradient matrix G for each pixel is given by

∑ s A x ^{) 1} ∑ g Λ ^χ ) g _y ( ^χ ) (4).

G =

( ^χ ) g Λ ^χ ) ( x )

The matrix G for each pixel is calculated with an empirically chosen window area (WA) of e.g. 7*7 around the current pixel as shown Fig. 10. Thereafter, the strength of each pixel - i.e. its suitability to form a robust Feature Point - is measured by means of the minimum eigenvalue of the gradient matrix. The eigenvalues are derived, using G*x=λ*x,

= λ => ==> mm eigenvalue (ζ\

with x representing the eigenvector, which finally result in the eigenvalues λ₊ and λ_

λ _± = 4-[ fe - ⁺ * » )* ^ * s ,_γ * g ,_γ + {g ^ ^~ g _γγ } ^{λ (6)'} All pixels in the image can be represented with their x, y-location and their calculated minimum eigenvalue (z-axis) in a three-dimensional plot. The calculated minimum eigenvalues of the example Y-frame are also shown in Fig. 13, in two dimensions. Finally, the pixels with the highest minimum eigenvalue are selected - from best to worst - in a feature list, ensuring that new additions to the list are at least 10 pixels away in all four directions from all other pixels, which are already selected and added to the list. Pixels that do not meet this criterion are discarded as being unsuitable Feature Points. This results in a feature list containing only well-spaced Feature Points that can be tracked, as shown in Figs. 14 and 15, which show the Y-frame at times t and t+n, by means of the white dots. Feature Point Tracking: The aim is to track Feature Points from one frame in a video sequence, further referenced as I(x,y,t), to a time-wise succeeding frame I(x,y,t+n). First, the list of Feature Points for frame I(x,y,t) is found as described above by subjecting the frame to the feature-selection procedure. A feature list is obtained. Then the gradient images g_x and g_y for both frames, I(x,y,t) and I(x,y,t+n), are calculated as described above. These are hereinafter referred to as g_jχ g2_x , g_jy and g2_y. Subsequently, the following procedure is performed for each Feature Point of the feature list of I(x,y,t) with an optimal number of iterations, chosen to be five as derived from Table 1, with the estimated location of the Feature Point in I(x,y,t+n) being updated each time. The procedure is applicable as long as the displacement vector of related Feature Points of F(t) and F '(t+n) is limited. The Feature Point displacement decreases with increasing iterations. After five iterations, the determinant of the matrix decreased too much for further calculations. Hence, five iterations proved to be sufficient for tracking the Feature Point to its final location.

Table 1. Evaluation of optimal number of iterations.

Each iteration includes the following:

For a window area (WA, size 7x7) around the feature point in question, the horizontal and vertical gradient sums S_x and s_y are calculated by simply adding the corresponding values of the gradient images in both frames i A * ) + (V), s A * ) = i , ( X ) + wherein a has initially the value x. Then the gradient sum matrix S may be calculated

∑ s_x(x)² ∑ ^s A*^)s A^χ) (8). s =

∑ ^s A^χ)^s Ax) Σ ^S y ^{(X ) 2}

The 2-by-l error vector e in the equation S* d=e_±

is minimized, to arrive at an estimate for the position of the Feature Point in I(x,y,t+n), with d representing the displacement vector, wherein e is defined by

∑ (I ( x , y , t) - I ( x , y , t + n )) * s A * ) (10). e ∑ (I ( x , y , t) - I ( x , y , t + H )) * s _y ( x )

By using Cramer's rule, the equation may be solved by means of

S e S d , = s „ , S _X} S S _xy (H),

S » ^e S d ₂ = s s — s resulting in a displacement vector d, which is added to the estimated location σ of the current Feature Point in frame I(x,y,t+n), which forms the end of the iteration.

In a subsequent step, it is verified if the feature window, which represents an array of intensity values of the neighborhood of each pixel in the Y-frame, has been correctly tracked. The average intensity difference D of the original and the tracked window area (WA), in frames I(x,y,t) &nal(x,y,t+n), respectively, is computed by means of D = -—y{l(χ,y,t)-I(χ,y,t + n)) (12),

WA ±f_m wherein WA represents the window size, a product of window height and window width.

The Feature Point is discarded if D is greater than a heuristically chosen threshold value of, for instance, 20, for a reference range of zero to 255, which threshold value represents a tradeoff between insufficient correct and too many false trackings. Those Feature Points that are not discarded are expected to have converged, due to the previous five displacement- calculation iterations, to the correct position in F(t+n) constituting the new coordinates of the Feature Point.

The entire calculation described above is executed for all Feature Points mentioned in the feature list.

Scale-Invariant Feature Point Selection and Tracking

A further advanced Feature Point selection method is described below and is referred to as Scale-Invariant Feature Point Transform (SIFT). The scale-space model L(x,y,σ) is obtained by convolving the frame for which Feature Points are to be selected (which is henceforth referred to as I(x,y)) with a family of Gaussians of different variance in accordance with the following formula:

L (X, y,o )= G (x, y,o )* I (x , y) (13), wherein

G (x,y,σ )= J — _re ^2σ" (14),

2 πσ constitutes a two-dimensional Gaussian of variance sigma.

The corresponding differential scale-space representation D(x,y,σ) is given by taking the difference of two scales separated by a constant multiplicative factor k. D{x,y,G )= {G{x,y,kG )- θ{x,y,G ))* l(x,y)= L(x,y,kσ )-L(x,y,σ ) (15).

Within D(x,y,σ) the local maxima are identified on the basis of the following simple condition:

\D (x,y,σ )|> a (16), with a heuristically derived threshold a. The resulting local maxima constitute Feature Points, which, in contrast to the previous method, have additionally assigned orientations that reflect the dominant direction of the local gradient, which may be considered as a kind of Feature Point signature improving the tracking robustness. This is done in two steps. First, the gradient magnitude m(x,y) and gradient orientation θ(x,y) are calculated for each pixel in frame F(_t) at the scale of the Feature Point in accordance with the following formulae:

m (x, y ) = ((L (X + 1, y ) - L (x - 1, y ))¹ + (l (x, y + \ ) - L (x, y - \ )j f (17),

„ / \ L (x, y + 1 ) - L (x, y - 1 ) θ (x, y ) = arctan ;^V '/ \ ) ' \ 4- (18).

L \x + \, y )- L \x - \, y ) ^{v /}

Secondly, an orientation histogram is calculated around each Feature Point consisting of gradient orientations θ(x,y) of all points within the FW. The orientation histogram consists of 36-bins covering a 360° orientation range. Each sample added to the histogram is weighted by its m(x,y). Successively, the orientation highest peak in the histogram is detected and the corresponding θ(x,y) is assigned to the Feature Point, resulting in a Feature Point descriptor vector containing the orientation histogram entries. The Feature Points can therefore be tracked robustly from I(x,y,t) to I(x,y,t+n) searching for the best matching descriptors by searching for the minimal Euclidian distance between Feature Point descriptor vectors. For more information, reference is made to the paper by David G Lowe "Distinctive Image features from scale-invariant keypoints", International Journal of Computer Vision, 2004.

Feature-Point-based frame similarity measure:

For the purpose of Shot grouping and Parallel Shot Detection, the pair- wise similarity between selected shot-representative keyframes, e.g. all I-frames or the first and last I-frame of each shot, can be calculated by means of Feature Point selection and tracking. The feature points may be scale- variant feature points or the even more robust Scale-Invariant Feature point (SIFT) or any other feature point with additional signature information, i.e. having additional information to identify the point and find the best matching point in a reference image. Two keyframes are classified as similar if the number of tracked Feature Points, i.e. Feature Points which have been successfully tracked from keyframe A to keyframe B, exceeds a threshold.

In tests, the performance of the feature-point-based solution revealed a sufficiently high robustness allowing to proceed exclusively with the 'tracked feature point' similarity, saving scarce processing power. Hence, the keyframe linker receives per shot boundary a set of W (window size as defined in Fig. 6) keyframe pair similarity results consisting of the number of tracked feature points and an index such as the Time Stamps of the keyframes involved. If the number of tracked feature points exceeds a pre-defined SimilarityThreshold Th, the related keyframe pair is labeled as keyframe link. Time-wise crossing keyframe links point to the existence of a parallel shot group.

Two methods for Feature Point detection and tracking have been described in this section, but the invention is not limited to these two methods. Other feature point detection methods including variations of the described methods may be used.

Finally, some results are illustrated.

The performance of the Parallel Shot Detector has been evaluated by means of precision and recall. The parameter recall (Re) represents the percentage of all shots correctly classified as Parallel Shots (Ncorrect) in relation to all ground-truth parallel shots, as defined by

Re = ^{N cma} * 100 , Re e [ 0 %, 100 %] (\ Q)

N + N wherein N Missed represents the number of shots which are members of a parallel shot, but have not been classified as such. They are also called false negatives. In contrast, precision (Pr) is the percentage of all correctly detected parallel shots in relation to all detected shots classified as parallel shots, calculated by means of

Pr = ^{N Co}"" * 100 , Pr ε [ 0 %, 100 %]

including N Fake, which are shots falsely classified as parallel shots, also known as false positives or oversegmentation.

Fig. 16 illustrates recall and precision for parallel shot detection. An evaluation was performed, using a 20-hour AV corpus consisting of various genres. It was found that very few shots were falsely classified as Parallel Shots (high precision), although some percentages of shots which are members of a PS were not recognized as such. Most importantly, however, it was successfully accomplished to cluster shots into PS without crossing Scene Boundaries (ScB). None of the automatically derived PS sequences crossed a manually annotated Scene Boundary. This permits use of the PSD (parallel shot detection) as a pre-processing step to reduce the number of potential scene boundaries by pre-clustering a huge number of shots into Parallel Shots, which increases the precision of Scene Boundary Detection. An article which describes a scene boundary detector and its oversegmentation problem is e.g. J. Nesvadba, N. Louis, J. Benois-Pineau, M. Desainte-Catherine, M. Klein Middelink, 'Low-level cross-media statistical approach for semantic partitioning of audio -visual content in a home multimedia environment', Proc. IEEE Int. Workshop on Systems, Signals and Image Processing (IWSSIP'04), pp. 235-238, Poznan, Poland, September 13-15, 2004.

Furthermore, the inventors have realized that statistics are telling for the genre of the movie. The actual percentage of content that is part of a parallel shot sequence is a valuable indication of the genre. Soaps and sitcoms contain typically 70-80% parallel shots, while movies contain typically 50% and magazines contain ever fewer parallel shots. This fact can also be used for content classification, solely or in conjunction with the interleaving patterns. Interleaving patterns are also indicative of the genre. The pattern of keyframe links observed in a content item is considered to be a valuable indicator of the nature of the content. Hence, the statistics of linked similar shots is useful for content classification of a movie. As an example, shot sequences of the form ABABAB are characteristic of soap operas and situational comedies. Talk-show content shows similar characteristics, but repeated shots of a single person are often taken from different camera angles and hence might confound conventional linking methods. Movies are considered to exhibit a greater degree of variation - sequences of the type ABCDEDBF are not uncommon.

Fig. 17 illustrates this aspect of the invention. The system comprises a statistical analyzer 11 which receives data from the shot grouper 6 and the parallel shot grouper 9. Statistical data may include, but are not limited to the average distance between shots in a group, the amount of interleaving such as, for instance, the number of interleaved groups within a parallel shot, the average length of a shot and/or of a parallel shot. Such statistical data is indicative of the genre of the data stream, and the statistical analyzer inserts or couples a genre indicator in or to the data stream, for instance, an electronic label at the beginning of the movie. For this purpose, the statistical analyzer 11 may compare the results of the statistical analysis with data in a look-up table 12.

Figs. 18 and 19 further illustrate this aspect of the invention.

The inventors have realized that the pattern of keyframe links observed in a content item is a valuable indicator of the nature of the content. Hence, the notion of linking similar shots as well as the segmentation that has been discussed thus far is useful for content classification. As an example, it is noted as already mentioned above that shot sequences of the form ABABAB are characteristic of soap operas and situational comedies. Talk-show content shows similar characteristics, but repeated shots of a single person are often taken from different camera angles and hence might confound conventional linking methods. Movies are considered to exhibit a greater degree of variation - sequences of the type ABCDEDBF are not uncommon. An analysis may be made of the patterns of linkage observed. Figs. 18 and 19 show typical link patterns for different genres, wherein Fig. 18 shows typical link patterns for soaps and talk shows and Fig. 19 shows link patterns for movies and magazines. A test revealed that an average 46% of the content was part of a parallel shot for a number of movies, 72% was part of a parallel shot for a number of series, while the percentage was much lower, approximately 35%, for magazines.

There are a number of distinct approaches for classifying the content, inter alia:

Matching the detected link patterns with the sample link patterns given above. This is most intuitive, but might be computationally expensive and might also lead to false detections.

Counting various parameters (e.g. the total number of links in a given time interval, the number of times a link crosses another link) and using statistical techniques to match these numbers with standard numbers for each genre.

The invention is also embodied in any computer program product for a method or system according to the invention. A computer program product should be understood to be any physical realization of a collection of commands enabling a processor, generic or special-purpose, after a series of steps (which may include intermediate conversion steps, such as translation to an intermediate language, and a final processor language) to load the commands into the processor so as to execute any of the characteristic functions of the invention. In particular, the computer program product may be realized as data on a carrier such as e.g. a disk or tape, data present in a memory, data conveyed via a network connection, wired or wireless, or program codes on paper. Apart from program codes, characteristic data required for the program may also be embodied as a computer program product.

Some of the steps required for the operation of the method may already be present in the functionality of the processor instead of described in the computer program product, such as data input and output steps.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb "comprise" and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a device claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The invention may be implemented by any combination of features of various different preferred embodiments as described above.

Claims

CLAIMS:

1. A system (2) for processing video (1), the video comprising shots, wherein the system comprises a processor for processing the video, the processor comprising a shot grouper (6) for grouping the shots into groups of visually similar ones of the shots, wherein the shot grouper is operative to compare corresponding feature points in the shots.

2. A system (2) as claimed in claim 1, wherein the processor is arranged to compare scale-invariant feature points.

3. A method for processing video, the video comprising shots, wherein the method comprises automatically grouping the shots into groups of visually similar ones of the shots using a comparison of corresponding feature points in the shots.

4. A method as claimed in claim 3, wherein the feature points are scale-invariant feature points.

5. A system (2) for processing video (1), the video comprising shots, wherein the system comprises a processor for processing the video, the processor comprising a shot grouper (6) for grouping the shots into groups of visually similar ones of the shots, wherein the system comprises an analyzer (11) for analyzing the grouping of the shots.

6. A system as claimed in claim 5, wherein the processor is arranged to provide a genre indication on the basis of the analysis.

7. A system as claimed in claims 5 or 6, wherein the processor is arranged to analyze the grouping of the shots by determining a percentage of shots belonging to the ones of the groups of visually similar shots which comprise more than one shot.

8. A system as claimed in claims 5, 6 or 7, wherein the processor is arranged to analyze the grouping of the shots by determining an interleaving pattern of the groups of visually similar shots.

9. A method for processing video, wherein the method comprises processing the video by automatically grouping shots into groups of visually similar ones of the shots and analyzing the grouping of the shots.

10. A method as claimed in claim 9, wherein a genre indication is provided on the basis of the analysis.

11. A method as claimed in claims 9 or 10, wherein analyzing the grouping of the shots comprises determining a percentage of shots belonging to the ones of the groups of visually similar shots which comprise more than one shot.

12. A method as claimed in claim 9, 10 or 11, wherein analyzing the grouping of the shots comprises determining an interleaving pattern of the groups of visually similar shots.

13. A system (2) for processing video (1), the video comprising shots, wherein the system comprises a processor for processing the video, the processor comprising a shot grouper (6) for grouping the shots into groups of visually similar ones of the shots, wherein the processor is arranged in such a way that the visually similar ones of the shots are grouped into a shot group on the condition that the number of intermediate shots between consecutive ones of the visually similar shots is not larger than a maximal shot window number (Nmax).

14. A system as claimed in claim 13, wherein the processor is arranged in such a way that visually similar ones of the shots are grouped into the shot group on the condition that the number of intermediate shots between consecutive ones of the visually similar shots is not larger than a maximal shot window number (Nmax), but larger than one.

15. A system as claimed in claims 13 or 14, wherein the processor is further arranged to group the intermediate shots into the shot group.

16. A system as claimed in any one of claims 13 to 15, wherein the processor comprises a keyframe selector (7) and the shot grouper compares the data of keyframes so as to establish keyframe links.

17. A system as claimed in any one of claims 13 to 16, wherein the processor is arranged to group shots, using a comparison of corresponding feature points in the shots.

18. A method for processing video, the video comprising shots, wherein the method comprises automatically grouping the shots into groups of visually similar ones of the shots, wherein the visually similar ones of the shots are grouped into a shot group on the condition that the number of intermediate shots is not larger than a maximal shot window number.

19. A method as claimed in claim 18, wherein the visually similar ones of the shots are grouped into the shot group on the condition that the number of intermediate shots is not larger than a maximal shot window number, but larger than one.

20. A computer program product comprising instructions for enabling a programmable device to carry out a method as claimed in any one of claims 3 to 4, 9 to 12 or 18 and 19.