WO2009150425A2

WO2009150425A2 - Automatic detection of repeating video sequences

Info

Publication number: WO2009150425A2
Application number: PCT/GB2009/001460
Authority: WO
Inventors: Rainer W. Lienhart; Ina DÖHRING
Original assignee: Half Minute Media Ltd
Priority date: 2008-06-10
Filing date: 2009-06-10
Publication date: 2009-12-17
Also published as: GB2460844B; GB0810618D0; GB2460844A; WO2009150425A3

Abstract

A method of mining a video stream for repeating video sequences is described. A video stream is received and hash values are generated for the frames. Frames with matching hash values are identified. These are used to identify candidate sub-sequences of frames. The sub-sequences are compared and when a threshold level of frames with hash values matched to frames of another sub-sequence are detected, a match is identified. Longer sequences are identified from the matched sub-sequences and examined to determine if they are similar. If they are, start and end frames are detected. A fingerprint of the sequence is created by the system (4) and added to a database (5), with updates being sent to a reference database (9) of an automated detection and replacement apparatus (6). The output (8, 10) of the apparatus is altered according to whether a match is detected in a broadcast.

Description

Automatic Detection of Repeating Video Sequences

The present specification relates to the automatic detection of repeating video sequences, such as for example, commercial spots in a television broadcast. It also relates to an improved method of detecting video sequences using a new matching technique. These video sequences may be, for example, commercial spots, music videos, sports clips, news clips, copyrighted works, video sequences available on the internet or downloadable to phones or other personal communication devices.

Detecting commercial spots is known. A commercial spot may be an advertisement, an intro, an outro, or some such similar segment of video which is marketing a product, a service, a television show, a program, a sports event etc, or is an introduction to or a closing from a commercial break containing such segments. Detection may be required in order to cut and replace existing advertisements with new ones or possibly to count commercial spots for statistical reasons and for controlling contracted broadcast.

In general there are three different approaches for detecting commercials. The first focuses on the technical characteristics of commercials such as high cut frequency, disappearance of channel logo, increased volume, and the like. The second approach is based on the assumption that commercials are repeated over a period of time. Depending on interests and contracts of the companies that are advertising their products, these recurrences may appear within hours or days. The third approach searches for recurrence of known sequences stored in a database. This approach is, however, reliant on the quality of the data in the database for the detection.

Until now it has not been possible to detect previously unknown repeating video sequences, such as commercial spots, in a way that is, on the one hand, sufficiently reliable to be incorporated in an automated technique which can generate a database of meaningful results without human intervention, and on the other hand provide a technique which is also economical on processing power so that the commercial spot or other repeating video sequence can be identified quickly and easily.

Additionally, techniques for detecting a known video sequence in a video stream include the approximate substring matching algorithm, which employs a frame-by-frame distance measure that is tolerant of small changes in the frames but has a disadvantage of high computational complexity. Improvements are required in matching video sequences to one another, to improve precision and reduce processing resources. It would be desirable to be able to mine for repeating video sequences, such as commercial spots, in a video stream using an automated technique.

It would also be desirable to provide improvements for automated searching of known video sequences.

According to the present invention, from a first broad aspect, there is provided a method of mining a video stream for repeating video sequences comprising: receiving a video stream; generating hash values for frames of the video stream; identifying frames with matching hash values; identifying sub-sequences of frames within the video stream that each contain a frame with a matched hash value; comparing sub-sequences and determining a match when a sub-sequence contains a threshold level of frames with hash values matched to frames of another sub-sequence; finding sequences of frames, where each sequence contains a group of subsequences that have been matched to other sub-sequences; determining whether a sequence of frames in a first portion of the video stream is repeating as a sequence of frames in a second portion of the video stream by comparing the groups of matched sub-sequences and determining whether a threshold level of similarity is met; and when a repeating video sequence is detected, detecting start and end frames for the repeating video sequence.

Although the method of the present invention is applicable to any type of repeating video sequence, preferably the repeating video sequences being mined are commercial spots. These can be filtered from other repeating sequences, such as music videos and news clips, by selecting repeating video sequences with durations which are typical for spots such as multiples of 5 seconds, as well as which are within a given interval specified by a minimum duration and a maximum duration. Thus preferably the step of detecting the start and end frames for the repeating video sequence is to identify the sequence as a commercial spot.

In the present invention, the detection of a new commercial spot can be determined uporritsτepetittonrwithiira certain time, typically one or two days and by its typical- — durations, i.e., 10, 15 or 30 seconds. Therefore, it is able to work independently of the non- inherent or only partly inherent properties mentioned above. This brings the advantage that it is much more universal. In one embodiment, the algorithm above is used to mine automatically for commercial spots, or other repeating video sequences, and then automatically update a database without human intervention being required to manage or assess the results. An important application of this algorithm is in the use of automated mining for commercial spots in order to update automatically a database of fingerprints for the commercial spots, which is then used in an automated apparatus that performs ad detection and replacement on a live broadcast in real-time.

According to this method there is a first step of identifying small similar or identical sequences (the "sub-sequences") and then a second step where longer sequences or segments (the "sequences") containing a similar or identical group of sub-sequences are matched before a last step of determining the exact start and end positions of the repeating video sequences. The main advantage of this method is its greater robustness concerning small differences between repetitions as well as its independence from any temporal video segmentation such as shot detection.

In one embodiment, the present invention provides a method of mining a video stream for repeating video sequences comprising: receiving a video stream; generating hash values for frames of the video stream; identifying candidate frames which have the same hash values as other frames; identifying a test sub-sequence of frames containing a candidate frame; comparing the test sub-sequence to a candidate sub-sequence which contains a frame with the same hash value as the candidate frame, to determine if the sub-sequences are sufficiently similar to be considered a match; identifying a test sequence containing a group of sub-sequences, the test sequence corresponding to a candidate sequence in another portion of the video stream containing a group of similar sub-sequences that are considered to match those of the test sequence; determining whether the test sequence is sufficiently similar to the candidate sequence to be considered a match; and identifying a beginning and end of the matched sequence of frames to identify a repeating video sequence.

Preferably the frames are considered as "matched" when a threshold of frame similarity is reached, which includes frames that are similar but not necessarily identical. This may be a function of the hash value; for example, use of a hash function which is calculated for an entire frame that results in the same value for similar images (for example, frames from the same scene which include substantially the same features but are spaced temporally, so that a feature may have shifted slightly but not enough to change the hash value), or use of a hash function which is more discriminative and calculated for a set of regions within a frame where only a threshold level, which is less than 100%, of the best matching regions have to be the same for a match to be declared (for example, a match may still be found even where a banner is superimposed over a portion of the image because the regions corresponding to the banner are ignored). Thus preferably the frame matching step is tolerant to slight differences in the image.

Preferably the sub-sequences are considered as "matched" when a threshold of subsequence similarity is reached, which in addition to identical sub-sequences, also includes sub-sequences that are similar in that they contain "matched" frames in the same order but the "matched" sub-sequence may include additional frames or may be missing frames as compared to the candidate sub-sequence. This has the advantage of making the subsequence matching step tolerant at a fine scale to slight differences in the sub-sequence of frames which might be the result of broadcasters' preferences or the equipment the video stream is received through.

Preferably the sequences are considered as "matched" when a threshold of sequence similarity is reached, which in addition to identical sequences, also includes sequences that are similar, in that they contain "matched" sub-sequences in the same order but the "matched" sequence may include additional frames or may be missing frames as compared to the candidate sequence. This has the advantage of making the sequence matching step tolerant, at a coarser scale than the frame and sub-sequence matching steps, to slight differences in the sequence of frames which might be the result of broadcasters' preferences or the equipment of the video stream is received through.

Preferably the algorithm uses all three of these matching preferences to produce a method that is particularly tolerant to noise in the video signal and yet can still identify repeating video sequences with a high degree of precision.

The hash values may be generated for entire frames, or more preferably are generated for regions of a frame, and more preferably still a plurality of regions within a frame. Each hash value may represent a collection of frames or frame portions, or more preferably corresponds to a single frame, in particular a collection of regions (for example, 64 regions) captured from a frame, in particular a captured set of regions from within a frame that recombined cannot be used to reconstruct the frame. Thus, in one preferred embodiment, the method may include the step of capturing data representing a plurality of regions from frames of a video stream. This may be performed at the same or a different location to the step of generating hash values for frames of the video stream. The captured data representing a plurality of regions from frames is preferably in the form where it is not possible to reconstruct the original frames from the data because less than an entire frame is captured in each set of plurality of captured regions. This data may be stored or transmitted to another location prior to generating the hash values. The data may be in a modified form, for example, as feature vectors or some other functional representation, representing each of a combination of the plurality of regions in a given frame, the combination constituting less than an entire frame.

The hash value is preferably generated from a function of colour, for example, feature vectors like colour patches, gradient histograms, colour coherence vectors, colour histograms, etc.. Of these, colour patches, and in particular colour patches based on average colour, and gradient histograms are most preferred because these reflect the spatial distribution within a frame. They are also particularly quick to calculate and to compare values for, as well as providing excellent recall and precision (recall can be defined as the number of positive matches found divided by the number of relevant sequences in a test video stream, and precision can be taken as the number of positive matches divided by the number of found sequences). In one embodiment,_gradient histograms are preferred for their greater discriminative power where the slight additional computational time over, say, colour patches, is of less concern.

Preferably the feature vectors that are used, generate hash values which have a scalar feature value. Preferably the hash value is generated from a gradient histogram algorithm. Preferably the gradient histogram algorithm, or other hash value generating algorithm, generates 1-byte integer values (for example, preferably the gradient histogram values are reduced and mapped to 1 -byte integer values). This leads to benefits not only in reducing the fingerprint size for the frame, but also drastically reduces the search time in the database.

The sub-sequences are shorter than the sequences and may represent between 5 and 500 frames, more preferably between 10 and 100 frames and more preferably still between 15 and 50 frames. In the most preferred embodiments, the sub-sequences being matched are between 20 and 30 frames, most preferably the candidate sub-sequence is 25 frames (which corresponds to 1 second at 25 frames per second). The test sub-sequence, which is being compared to the candidate sub-sequence is preferably the same length as the candidate sub-sequence, for example, it may be 25 frames, but also it may be slightly longer or shorter, for example, where there are additional or dropped frames in the test subsequence as compared to the candidate sub-sequence. A threshold might be set at ± 5 frames on the candidate sub-sequence length.

In the preferred embodiments where the sequences correspond to commercial spots, these will be typically of the order of 15, 30 or 45 seconds duration. A 30 second commercial, for example, would have 750 frames at 25 frames per second, so a sequence may correspond to 750 frames ± 10 frames.

The method preferably includes the step of building an inverted image index, for example, by means of a hash table. In the inverted image index, the hash values are ordered in ascending (or descending) value and entries are linked to the hash value, each entry identifying a frame (or set of frames) in the video stream with that hash value. Thus an image index is created that is ordered by its hash value. This allows for a fast look up by holding image locations for a corresponding hash value. The hash value, which can be derived from a feature vector like a gradient histogram, can identify the similar or identical frames easily.

Preferably each entry in the inverted image index identifies not only the frame number, i.e. its position within the video stream, but also the number of consecutive frames which share the same hash value. This has the advantage of reducing the size of the inverted image index which can speed up search times.

Further hash tables may be incorporated into the algorithm to facilitate matching of the sequences. For example, fingerprints of identified repeating video sequences may be incorporated into a hash table to detect matches with a database of fingerprints of previously found or known repeating video sequences. In this way new repeating video sequences can be identified easily and incorporated into updates for commercial spot detection and replacement apparatus. Hash tables may also be usedin the sub^sequence and sequence matching described above.

Preferably the video stream is searched through image by image, generating hash values for the hash table in the step of building an inverted image index. This is possible with feature vectors like colour patches and gradient histograms because they are relatively quick and easy to calculate. However, for some feature vectors, it may be desirable to generate hash values for every second frame or other multiple of frames. All entries for each hash value in the hash table may be considered as similar images, assuming a hash function has been chosen with a small amount of collisions. The level of similarity in the frame required to match another will depend on the function used to generate the hash value.

Having identified the matching frames (which may be similar or identical), it is possible to identify matching sub-sequences, for example a first sub-sequence in a first portion of the video stream which shares a similar or identical pattern of "matching" frames to a second sub-sequence in a second portion ofthe video stream. In this way, when the number of frames in a particular sub-sequence (also referred to herein as a "short sequence") that are considered to "match" those of another sub-sequence in another portion ofthe video stream exceeds a threshold level, then the two small video sequences (the subsequences) are considered to be "matched".

In one preferred embodiment, this threshold level is calculated as a match ratio indicating the number of matched frames as a fraction ofthe candidate length. Preferably a minimum match ratio of 0.2 or greater is used in the algorithm, more preferably it is equal to or greater than 0.32, and in the most preferred embodiments a minimum match ratio of 0.4 is used as the threshold to decide when a test sub-sequence matches a candidate subsequence. This corresponds to 10 matching frames in a candidate length of 25 frames. Preferably this minimum match ratio is less than 0.8, more preferably less than 0.6 and most preferably less than 0.48, so that only a minimum of comparison needs to be made before a match is declared and the further searching can be proceeded with.

In one embodiment, a last matching frame ofthe test sub-sequence may be identified and compared to the candidate sub-sequence to check that it corresponds to a frame member within a possible set of frames comprising the matched candidate frame and the subsequent frames of matching hash value.

In one embodiment a distance between the sub-sequences is calculated, and if it meets a predetermined threshold level, a match is detected. The distance might be calculated from the hash values for each frame or a portion ofthe frames, for example, by counting the number of hash matches, or by performing some other such calculation based on the hash values. The results are then compared, and if these numbers are within a predetermined percentage of each other, for example, 20 % or more hash matches for a candidate sub-sequence of 25 frames, then a match is detected.

As mentioned above, an important feature of this invention is that it is not necessary for all the frames in the sub-sequence to be matching; a similarity threshold, which is less than 100%, would be set to determine when a test sub-sequence is considered to be "matched" to another candidate sub-sequence. Thus the test sub-sequence may contain additional frames that do not "match" or have dropped frames, and the "matched" frames may only be similar rather than identical.

The sub-sequences may start at every frame whenever a "matched" frame is detected. The distance between the start of each sub-sequence can be determined to decide whether it meets a certain threshold to be a new sub-sequence within a group or to determine whether it might be part of a new sequence.

Once sub-sequences have been identified, pairs of similar short sequences (the "sub-sequences") are aligned to identify longer sequences (the "sequences"). This has the advantage that a maximum delay of the order of the sub-sequences may only be required before a match can be detected. A similarity threshold may be set where a certain number of paired sub-sequences need to be identified within a predetermined length of the video stream before a match is detected. Small differences in the longer sequences can be tolerated because the algorithm is basing the matching on much smaller similar or identical pairs of sub-sequences, which in turn is based on similar or identical frames being matched.

Preferably the method includes a step of aligning the sequences. This may shift the sequences slightly, allowing a better alignment for estimating the start and end frames. This can be done by estimating occurring shot boundaries or, alternatively, by finding the offset with the minimum distance between both sequences. Cut detection may be performed, and where three or more cuts are located, this can be used to help align the sequences. After aligning, a frame by frame comparison backward and forward is carried out for the exact estimation of start and end position, respectively. In a last step all found recurring sequences are compared with each other, because often there are pair wise repetitions, for example, two commercial spots which are often repeated one after the other, but not always. Consequently, identifying the component parts means that the individual commercial spots can then be recognised.

Preferably the method also includes a step of filtering, which may be performed on the matched sub-sequences and/or the matched sequences. For example, sequences which are very short and do not meet a minimum length requirement (which might be caused by accidental matches of unrelated sub-sequences) can be discarded. This might be set to five or ten seconds, or 125 or 250 frames, for example. Additionally matches of frames or subsequences that are very close to the test frame may be rejected, for example, within 80 seconds or 2000 frames, to avoid detection within the same scene. There may also be a filtering system for detected sequence repetitions too, for example to take into account the properties of the false candidates, for instance the much greater length of music clips, and so on.

Filtering may also be conducted to reduce the number of candidates that need to be investigated further, and in this way optimise processing resources. Counts may be conducted to monitor the number of "matches" and/or "non-matches", and where these meet a particular threshold, the candidate sub-sequence or sequences can be either investigated more thoroughly or discarded from the candidate list as appropriate. For example, in one embodiment, during a frame-by- frame comparison, a count is kept of the number of consecutive non-matching frames detected, and if this exceeds a threshold, the candidate sub-sequence is removed from further investigation. In another a count is kept of the number of matching frames in the candidate sequence, and only if it exceeds a threshold is it moved on for further investigation.

Preferably the repeating video sequence is compared against the already found sequences to identify if it is a new one. This may be achieved by comparing the repeating video sequence, or more preferably, a fingerprint or set of fingerprints of the repeating video sequence, against a reference database of such fingerprints or set of fingerprints for already found repeating video sequences, in order to detect a match. If it is a new repeating video sequence, preferably the method includes the step of automatically updating the reference database with the new repeating video sequence or fingerprint(s) thereof.

According to a second aspect, the present invention provides a method of identifying the recurrence in a video stream of a stored sequence, the method comprising: receiving a video stream; generating hash values for frames of the video stream; comparing the hash values from the video stream to hash values for frames of a stored sequence to identify frames with matching hash values; identifying sub-sequences of frames in the video stream that each contain a frame with a matched hash value to a frame of the stored sequence; comparing sub-sequences of frames in the video stream to candidate sub-sequences of frames in the stored sequence and determining a match when a sub-sequence contains a threshold level of frames with hash values matched to frames of a candidate sub-sequence; identifying a sequence of frames, where the sequence contains a group of subsequences that have been matched to candidate sub-sequences; determining whether the sequence of frames in the stored sequence is recurring as a sequence of frames in the video stream by comparing the groups of sub-sequences and candidate sub-sequences and determining whether a threshold level of similarity is met; and when a recurrence of a video sequence is detected, generating a signal indicative of having identified a recurrence in the video stream of the stored sequence.

In this aspect, the same matching technique of (i) identifying "matched" frames (which may be identical or similar), (ii) identifying "matched" sub-sequences (which may be identical or similar), and (iii) identifying "matched" sequences (which may be identical or similar), is applied to the process of searching for a stored video sequence within a video stream. The sequence may be known in the sense of it being a known commercial spot, or it be unknown in the sense of being known to be a commercial spot because of its characteristics, but unknown in terms of content.

Where the method has particular utility is in connection with an automated system which first mines television broadcasts according to the first aspect, uses the found repeating sequences to build a database of sequences that are considered to be commercial spots, and then uses the commercial spots as stored sequences in the second aspect to mine automatically an incoming video stream, detect commercial spots automatically and then replace the sequences automatically with alternative sequences. Preferably this detection and replacement is conducted on a live video stream, with the detection and replacement being conducted in real-time (i.e., with a delay which is essentially unnoticed by the observer, for example one or two seconds). Preferably the commercial spots are stored on the database as a fingerprint or set of fingerprints that can be used to identify the sequence of frames and the detection and replacement apparatus is configured to mine the incoming video stream by looking for matching fingerprints.

The preferred features of the first aspect may apply equally to and are interchangeable with the second aspect, and vice versa.

Thus the present invention also extends to the automatic generation of a database which is built from repeating video sequences identified from the method of the first aspect. The database can be sold or leased to operators of commercial spot detection and replacement apparatus. The method can be used to automatically identify new commercial spots and add these to an existing database, in order to update a reference database for the method of the second aspect. The accuracy and reliability of the commercial spot detection and replacement apparatus is dependent on the reference database. While the algorithm is optimised as far as possible for accuracy and reliability, inevitably a balance exists with the amount of processing resources required to achieve this. In practice, the odd miss- detection at the ad detection and replacement apparatus may go unnoticed, and so while it is preferred to have no errors in the database if at all possible, a few may be tolerated. For the present purposes, it is preferred to try to capture as many fingerprints of commercial spots as possible so that the ad detection and replacement apparatus has a large reference database to refer to, than to ensure that 100% of these are actual commercial spots.

Consequently mining broadcasts automatically in this way, means that these reference databases can be continually updated as the new commercial spots are released. Tens or even hundreds of channels can be mined simultaneously to build up the reference database, without human intervention. After a while, with establishment of the reference database, the updates of the newly identified repeating sequences will comprise just a small number of fingerprints.

Preferably the database contains information about the sequences which cannot be used to recreate the sequences, only identify the sequences. Thus the database may comprise the hash values discussed above, which have been calculated for regions of frames (preferably without the whole frame ever being captured from the source video stream), that are indexed to the images to identify and recognise particular sequences within the video stream.

The steps of the first or second method may be conducted by a processor of a computer which is programmed to execute the algorithms of the first and / or second aspect of the present invention. Typically the algorithm of the first aspect will be conducted on a first computer in a first location and the algorithm of the second aspect will be conducted on a second computer in a second location. A communication link may be provided to update the second computer with data of the repeating video sequences identified by the first computer.

Thus according to another aspect, the present invention provides a system comprising: a) a fingerprint database generating apparatus having: an input for inputting a video stream; a processor which is programmed with a first algorithm to analyse the video stream in order to identify repeating video sequences; a fingerprint database which is updated automatically with the fingerprints of a repeating video sequence when a new repeating video sequence is detected; and an output for outputting fingerprint data of detected repeating video sequences, b) a detection and replacement apparatus, which is adapted to perform video sequences detection (e.g., commercial spot detection) and replacement on a live video broadcast in real time, the detection and replacement apparatus having: a video input for receiving a video broadcast; a video output for outputting a video signal to an observer; a video switch for selecting a source of the outputted video signal; and a processor which is programmed with a second algorithm in order to detect a known video sequence (e.g., a commercial spot) by generating fingerprints of the video broadcast, comparing the fingerprints to stored fingerprints on a reference database, and when a match is detected, to trigger automatically the video switch to switch the source of the outputted video signal from the received video broadcast to a video output having a replacement video sequence (e.g., a replacement commercial spot), so that the outputted video signal corresponds to the received video broadcast with replacement video sequences (e.g., commercial spots), and c) wherein the detection and replacement apparatus has a communication link for communicating with the output of the fingerprint database generating apparatus to receive and store updates of the fingerprint database and thereby update automatically its reference database of fingerprints.

The video stream that is being received by the input for the fingerprint database generating apparatus might be in the form of the original broadcaster's video signal or a multi-channel video signal. More preferably, however, it is a data stream representing just a captured portion or captured regions of frames from the broadcaster's video signal or multi-channel signal, the video stream then representing less than the entire frames of the original broadcast and it not being possible to reconstruct the original broadcast from the captured portion or regions. Thus preferably the video stream is a captured video signal or captured multi-channel signal.

The system may include a broadcast capturing apparatus which is arranged to receive a broadcast in the form of a video signal or multi-channel signal from a broadcaster via an input, the apparatus having a video processing device to capture only a portion or regions of frames from the broadcast and an output for outputting a video stream corresponding to the broadcast but omitting information which could allow the broadcast to be reconstructed to its original form.

The broadcast capturing apparatus may be part of an assembly also containing the fingerprint database generating apparatus, but is more preferably part of a separate device at a different location which communicates with the fingerprint database generating apparatus via a network.

Having regard to the method of the first aspect, the present invention can also be seen to provide an apparatus for mining a video stream for repeating video sequences comprising: a processor which receives a video stream that is being fed into an input; wherein the processor is programmed to execute the steps of: generating hash values for frames of the video stream; identifying frames with matching hash values; identifying sub-sequences of frames within the video stream that each contain a frame with a matched hash value; comparing sub-sequences and determining a match when a sub-sequence contains a threshold level of frames with hash values matched to frames of another subsequence; finding sequences of frames, where each sequence contains a group of subsequences that have been matched to other sub-sequences; determining whether a sequence of frames in a first portion of the video stream is repeating as a sequence of frames in a second portion of the video stream by comparing the groups of sub-sequences and determining whether a threshold level of similarity is met; and when a repeating video sequence is detected, detecting start and end frames for the repeating video sequence.

Preferably the processor is coupled to a database which stores the detected repeating video sequences, either as the repeating video sequences themselves in a normal or compressed form, or more preferably as a fingerprint which can be used to identify the sequence, for example, a fingerprint containing hash value data for the sequence. This apparatus may form part of the fingerprint database generating apparatus of the system described above. Preferably the database which stores the detected repeating video sequences is accessed each time a new repeating video sequence is detected to check if it has been detected previously (for example, by the fingerprint database generating apparatus or by a different apparatus located in another location which may be mining a different video stream and updating the database with new repeating video sequences when they are detected). Having regard to the method of the second aspect, the present invention can also be seen to provide an apparatus for identifying the recurrence in a video stream of a stored sequence, the apparatus comprising: a processor which receives a video stream that is being fed into an input; wherein the processor is programmed to execute the steps of: generating hash values for frames of the video stream; comparing the hash values from the video stream to hash values for frames of a stored sequence to identify frames with matching hash values; identifying sub-sequences of frames in the video stream that each contain a frame with a matched hash value to a frame of the stored sequence; comparing sub-sequences of frames in the video stream to candidate subsequences of frames in the stored sequence and determining a match when a sub-sequence contains a threshold level of frames with hash values matched to frames of a candidate subsequence; identifying a sequence of frames, where the sequence contains a group of sub-sequences that have been matched to candidate sub-sequences; determining whether the sequence of frames in the stored sequence is recurring as a sequence of frames in the video stream by comparing the groups of subsequences and candidate sub-sequences and determining whether a threshold level of similarity is met; and when a recurrence of a video sequence is detected, generating a signal indicative of having identified a recurrence in the video stream of the stored sequence, which is fed to an output.

This apparatus may form part of the detection and replacement apparatus of the system described above.

The method and apparatus of the second aspect may also have wider application in looking for video sequences that are not commercial spots but relate to other types of video sequence, for example, scenes from a film, a music video or other program. It may be used to monitor broadcasts for audit purposes or to detect infringements. It may be applied to live broadcasts or may be applied to archived material, for example, as a search engine. It may be used to mine large video archives or websites for particular video sequences. For example, an internet service provider might use the method to locate particular video files for audit or exclusion purposes. The method may include the step of decompressing a video stream from a first format to a second format before the matching is performed. The video stream may originate from a MPEG file or similar format and the hash values may be generated just for reference frames.

The above references to a video stream include a video stream that is multidimensional. For example, the video stream may have more than one channel. The method or apparatus may analyse two or more channels of the video stream simultaneously, generating multiple files of hash values for the different channels. Thus, the "first portion" of the video stream may be one channel and the "second portion" of the video stream having the "matched" recurring video sequence may be from a different channel. The method or apparatus may also generate multiple files of hash values representing a video stream at different times, and then mine through all of the files looking for repeating video sequences. For example, the "first portion" of the video stream may represent one recording of a channel at a given time and a "second portion" of the video stream having the "matched" recurring video sequence may represent the same or a different channel recorded at a different time. In order to handle the large amounts of data, for example, it may be desirable to split up large durations of video into smaller, more manageable sections that are temporally spaced. These might be compared and so a "first portion" of the video stream may represent a first 2 hour segment from a channel and a "second portion" of the video stream may represent a later 2 hour segment, possibly spaced by a 24 hour period. These files may be produced and stored at several locations and a network provided to mine across the data at the various locations. In another preferred embodiment, the repeating video sequences identified at each location are added to a central database at intervals. The video stream being received may be in the form of images or may be modified, for example, the video stream being analysed may be representations of frames or portions of frames that have been captured from an original video stream.

The present invention also relates to a software product which contains instructions that when loaded onto a computer, causes the computer to execute the method of the first and / or second aspect of the present invention. The present invention further extends to a software product for putting into effect the method of the first and / or second aspect of the present invention, wherein the software product is a physical data carrier. The present invention also extends to a software product for putting into effect the method of the first and / or second aspect of the present invention, wherein the software product wherein the software product comprises transmitted from a remote location. The present invention also encompasses a database of stored video sequences (preferably stored as fingerprints of the video sequences) which has been generated through updates of newly found repeating video sequences as a result of the automated mining process discussed above.

Certain preferred embodiments of the present invention will now be described in greater detail and by way of example only, with references to the accompanying drawings, in which:

Figure 1 is a flowchart of steps 1 - 2 in the preferred algorithm;

Figure 2 is a flowchart of step 3 in the preferred algorithm;

Figure 3 shows a distribution of index values for colour patches with B = 6;

Figure 4 shows a distribution of index values for colour patches with B = 6 in a sample of UK video;

Figure 5 shows a distribution of index values for a 8 x 8 x 8 gradient histogram with B = 7 in a sample of US video;

Figure 6 shows a distribution of index values for a 8 x 8 x 8 gradient histogram with B = 7 in a sample of the UK video;

Figure 7 shows the performance values for the sample of US video with a 8 x 8 x 8 - gradient histogram, and B = I. Shown are recall, the number of false positive matches, and search time for varying parameters: 7(a) RMI_N, 7(b) LF_G, 7(C) DM_AX, and 7(d) L_B, with values for the other parameters given at the bottom of the sub-figures;

Figure 8 shows a flowchart for the preferred recurring sequence search algorithm;

Figure 9a shows length statistics for detected multiples in ChartTV over 24 hours, and Figure 9b shows length statistics for detected multiples in ChartTV over 48 hours with no limitation in index values, where the labelled intervals are multiples of 10 seconds and have a range of ±5 frames, the intermediate (not labelled) intervals cover the wider range (239s) between the marked lengths, and all multiples longer than 2005 seconds are merged into the same histogram bin;

Figure 10a shows length statistics for detected multiples in SkySportsNews over 24 hours without limitation in index values, and Figure 10b shows the length statistics with limitation, where the labelled intervals are multiples of 10 seconds and have a range of ±5 frames, the intermediate (not labelled) intervals cover the wider range (239s) between the marked lengths, and all multiples longer than 2005 seconds are merged into the same histogram bin; and

Figure 11 shows an example of a system incorporating the present invention. Various ways of fingerprinting frames of a video stream are already known. Each has its associated benefits in terms of how powerful it is at discriminating between different images, how-quick it is to calculate and how quick it is to match. Three techniques will be described below which have been found to have particular utility in the algorithm of the present invention. These are colour patches, gradient histograms and the SIFT (Scale-Invariant-Feature-Transform) features. Other feature vectors, for example, ones based on these with simple modifications to the calculation to represent the property with a different value, and feature vectors which calculate a descriptor for other properties of the video, are also envisaged within this invention. Thus these feature vectors are intended as examples only, but examples that have been shown to have particular utility in the present invention.

Colour Patches

The colour patches fingerprint algorithm measures primarily a coarse colour distribution in a single frame. In essence, these are a calculation of the average colour (or other similar function describing the intensity of a particular colour) for regions of a frame or frames. They can be calculated as follows: Let

For N xM subareas S_nm of the image, averages of the colour intensity values are calculated

In this way, the colour patches fingerprint vector contains 3 x N ^χ M values R_nm, G_nm, B_nm of averaged colour intensities. In one embodiment, each entire frame, or each captured portion of a frame, is divided into 8 x 8 subareas, for example, arranged as an 8- by-8 grid. The average intensity for each region is calculated for each of the three colour channels, red, green, and blue (i.e., 3*64 byte values = 192 byte values = 192 bytes). The distance between fingerprints can be calculated using the Ll -Norm (summing the absolute differences between the values for corresponding regions) or L2-Norm (summing the squares of the differences between the values for corresponding regions). In one embodiment, the Ll -Norm is calculated and multiplied by a factor, for example by 80. Preferably the matching involves selecting n closest matching regions of a possible m regions to determine if a similarity threshold is passed (n <= m). The colour patches are very fast to compute, tolerant to image noise, and very good at matching identical clips.

When it comes to matching fingerprints, the colour patches algorithm gives very good recall values, but comes with low precision for certain types of video such as those containing relatively dark spots with low intensity values.

Gradient Histograms For obtaining structural information of images we need to consider spatial distributions of colours or intensity, respectively. Another possible way is analyzing intensity changes, characterizing edges and corners in the image. In this case it is not necessary to identify particular objects but only to identify identical frames in the video, so it is sufficient to calculate descriptors providing information about intensity changes in the image.

In the method described below, we only consider the distribution of gradient orientation in different areas of the image. For each image source, point gradient orientation and magnitude are calculated. For each area the histogram representing the distribution of gradient orientation is computed, and each sample point is weighted by the gradient magnitude.

Let - Grayscale intensity value, I(x) e (0,255), and - the gradient in intensity at point x.

The magnitude of gradient VI(x)

and orientation

are calculated by using pixel differences as gradient approximation

This discrete representation is used for intensity gradient magnitude and orientation computation:

For histogram evaluation we divide the whole image in N xM subareas I_nm with

with

over each of which we accumulate gradient magnitude values in K bins, covering the range of possible gradient orientation.

with

and normalisation factor

The distance between two images /₁ and I₂ we measure with the L₁-Norm

The size of the gradient histogram fingerprint is N xM x K values H_n ^k _m .

In one example, the spatial resolution is 8 x 8 subareas, and the number of bins is set to 8 too (N= M= 8, K = 8). This gives a relatively large fingerprint. To reduce the size of the fingerprint, a resolution of 4 x 4 with 4 bins may be preferred (N= M= 4, K = 4), , which provides a coarser resolution or an intermediate resolution may be chosen such as N = M= 8, K = 4, or N= M= 4, K= 8.

The gradient histogram algorithm gives excellent recall values with an acceptable precision. False matches are limited to matches of similar spots which have identical scenes, but have significantly different duration time, e. g. one commercial contains subsequences of the other one. For our test cases, we have found that there are no significant differences in recall and precision for different parameters N, M, and K. Setting the parameters as N=M= K= 8 gives good qualitative results, but lower resolution works almost with the same quality but benefits from a much smaller fingerprint size.

SIFT Features

A very popular method in object recognition is the extraction of so-called SIFT (Scale Invariant Feature Transformation) features. SIFT features are primarily developed for matching different images of an object or a scene. The method consists of extracting key points, characteristic points, that are invariant to scale and orientation, and assigning them a keypoint descriptor representing the gradient distribution in the region around the keypoint. The last operation is done on image data, that has been transformed corresponding to the major direction based on local gradient orientation for getting invariance in rotational transformations. Because key point extraction is expensive and results in a large number of key points we restrict the algorithm to the keypoint descriptor part, assigning key point descriptors to arbitrary points on a regular grid. As in the gradient histogram algorithm we work on grayscale images, i. e. we have N xMsubareas I_nm, the centre of each serves as key point. Intensity gradients are evaluated as in equations. 8 and 9 above. Histograms are accumulated over K orientation bins and a L x L spatial array. The gradient magnitude m_g is weighted by a Gaussian window with its centre in the keypoint. In our tests we choose the size S_L of the (quadratical) spatial bins in the way, that the image in the dimension with smaller size (height) is completely covered.

S_L = HjNL. (14)

We also disregard border pixels in the other dimension (width). Thus the Gaussian window border pixels in all dimensions are almost neglected.

The fingerprint size is N xM x L² x K. In one example, for evaluating the keypoint descriptor of the SIFT feature method we take the "standard" parameters of Lowe (see the article entitled "Distinctive image features from scale-invariant keypoints" by David G. Lowe, International Journal of Computer Vision, 60(2), 2004). Around each keypoint we compute gradient histograms over a spatial L x L - array, with L = 4, and K = 8 orientation bins. The simplest test case assigns one keypoint to the centre of each frame, N= M= 1. This leads to LxL = 4 x 4 gradient magnitude histograms with K = 8 orientation bins. Thus it is comparable to the case N= M= 4, K = 8 in the gradient histogram algorithm, but the magnitudes are weighted with respect to the spatial distance to the keypoint, areas over which values for one histogram are accumulated have different size and are, due to transformation into major gradient orientation, sometimes at different locations. In a second example, we take N = M= 2, which is similar to the N = M - K = 8 test case with gradient histograms.

The SIFT fingerprint gives good recall with lower precision. It is a more time consuming method than gradient histograms (for example, gradient histograms can be 1.5 times as long to calculate as colour patches whereas SIFT fingerprints can be five times as long). However, the higher spatial resolution, whilst it can give slightly better results for object matching, it often shows worse performance for image matching compared to the other two fingerprint algorithms.

The present invention concerns a new method, and apparatus executing the method, for detecting matching video sequences. In a first aspect the method is used to identify previously unknown repeating video sequences in a video stream. In a second aspect, the method is adapted to look for repeats of a known video sequence in a video stream. The matching algorithm uses the fingerprint techniques discussed above in its execution. It is based on matching strings of data and will be referred to below as the new String Matching Algorithm.

The New String Matching Algorithm

An approximate substring matching algorithm has been used previously which employs a frame-by-frame distance measure that is able to tolerate inserted and/or missing frames. However one disadvantage of this method is its high computational complexity: the distance has to be evaluated for each query sequence in the database. Another drawback which has been noticed is that the algorithm includes a local measure for deciding whether two frames match or whether instead a frame needs to be inserted or deleted. For matching sequences the global measure including insertions and deletions is the criterion for finding the corresponding substring. Problems arise, if surrounding frames are judged similar enough for the global measure by being below a specified threshold. In such cases the end of the commercial spot is often not detected correctly.

The new string matching algorithm has been developed with the objective to outperform the existing approximate substring matching (ASM) algorithm in two aspects:

• Computational complexity: The ASM algorithm compares every frame of the test stream with every frame of each query sequence in the database. This is computationally quite expensive and increases linearly with the size of the fingerprint database. In the present invention, an appropriate indexing structure on the fingerprint database is used such that the frames of only a small subset of the query sequences in the fingerprint database need actually to be compared at any time instance with a test frame. This results in huge computational performance savings.

• Precision for dark frames: The ASM algorithm has difficulties in estimating the correct end of relatively dark spots when followed by black frames. The new string matching algorithm solves that problem too.

In contrast to the problem of string matching of sequences which are composed of members of a small sized alphabet, such as in text recognition or DNA analysis, single frames which form a video sequence are high-dimensional structures themselves. Therefore two problems arise if we have to decide whether a sequence of images is equal to another one. First we need a criterion for deciding if two single frames are considered as equal, or better: when are two frames similar enough to call them equal, and, second, we need a criterion for deciding when two sequences of similar frames are equal.

To increase look up speed and avoid searching through the whole database, an inverted index may incorporated into the string matching algorithm. An inverted index structure provides a look-up table, which assigns to each image feature value all query sequences containing frames characterized by this feature. Additionally, the position of the matching frames can be stored in the index.

The new string matching algorithm combines the inverted image index technique with a frame-by-frame check. Here, the look-up in the index file is used for searching for candidate sequences to limit the time spent on frame-by-frame matching. In contrast to some techniques, we use a scalar feature value for creating the image index. This may be a single property like frame intensity or a scalar function of a (small) feature vector.

The index feature should meet the following conditions. On the one hand it should be robust in the sense, that "similar" frames are likely to be assigned the same value in the index table. On the other, its range should be large enough to distinguish between "different" frames. There is a balance to be found between recognising all the relevant sequences in the index-look-up step, while also limiting the number of candidates. If the set of possible or occurring index values, respectively, is too small, too many "similar" images are found which leads to a loss in performance, because of the tendency to do frame-by-frame-checking for every sequence in the database.

Matched frames for all candidates are counted. For a large database, a dynamic table is used for storing the candidates. In addition to counting matched frames, the number of consecutive non-matching frames is counted too. If this number exceeds a critical value, the candidate spot is removed from the list. On the other hand, if a threshold ratio of hits is reached and hits are found at the end of the candidate string, a frame-by- frame check is conducted over the last buffered frames, measuring the distance between individual frames. If the distance of this substring is less than a threshold value, we have a "hot" candidate and we try to estimate the end of the sequence in the test stream.

In more detail, we create an image index of the following structure.

For each index value H we have a list of N_E(H) entries Ei(H). Each entry consists of three integers: s_H iefers to the spot id that the entry belongs to, f_{H i}identifies the position of a frame with index value H in spot S_Hi, and /_{H i} is the number of contiguous frames with index value H starting at position . f _Hi has been introduced due to the assumption of the index property to map consecutive frames with only small differences to the same index. Therefore, if we have /_{H i} contiguous frames all with the same index value H, we generate only one entry in the index table. This is a form of run-length encoding.

During examination of the test stream all spots with hits are held in a dynamic candidate list of size Nc. Thus, we havej = 1, . . . ,Nc candidate spots at any time. Beside the spot id the following parameters are stored for each candidate:

length of candidate C'_;-, i. e. number of frames, an array for storing found matches with if there was not yet a match for this frame

if an index match for this frame was found number of frames which are marked as matches, i. e.

the fraction of matched frames,

position of the last frame which is marked as hit, i. e.

number of frames from the test stream which have caused a hit. for candidate , i. e.

for every test frame which has caused one or more hits ,

is increased by 1, number of consecutive frames from the test stream which have caused no matches in , i. e

for every test frame which has not caused any hit ,

is increased by 1, if the actual test frame is not a match otherwise,

a status flag which provides information if there was a match for the current test frame, i. e.

a the actual test frame caused a match otherwise,

- a status flag indicating the result of the last frame-by-frame check, i. e.

iiff tthhee ffrraamme-by-frame distance indicates a positive match otherwise.

with the following control parameters: — minimum ratio of matched frames to number of all frames of the candidate.

which must be reached for further investigations, ~ minimum required percentage of frames from the test stream, which must have caused a hit

in order to trigger for further investigations, ~ maximum allowed count of consecutive non-matching frames from the test stream;

if the value of the candidate spot exceeds this value , it is deleted from the candidate list, - maximum gap between the last frame L and the last frame of the candidate spot;

signals that the end of the candidate is reached, and the parameters concerning the fτame-by-frame distance measure - frame- by-frame distance for two sequences of length n

with the last framesI₁ and /₂, respectively, - maximum distance for frame-by-frame check for similarity of two sequences, - minimum distance for two frames to be different, - length of test buffer for frame-by-frame distance measure,

In one preferred embodiment, we have the following program flow for the String Matching Algorithm:

1. Evaluate index value H(I) for the current frame / of test video.

2. For each entry E do:

(a) Check, if the corresponding spot C_j is already listed as a candidate. If not, add entry.

(b) Mark frames from E_i (H) as hits in

.

(c) Update match ratio R

(d) Set the last-frame-variable

to the frame number of the last matched frame.

3. For each candidate

in the candidate list do:

(a) If the current frame is a match, increase the hit counter Nf and set to 0,

else increase the counter for non-matching frames .

i. If we have not yet checked the frarae-by-frame distance D_j (LF_j, I, LB) or the distance of an earlier check was greater than the threshold D MAX , evaluate the frame-by-frame test. ϋ. If

j ( the candidate is marked as "hot'" candidate, i. e. we have a (1) minimum of matched frames, (2) hits near the end of the candidate, and (3) the frame-byrframe distance of the last frames indicates a match.

4. For each "hot" candidate: If the distance

between the current frame / from the test flow and the last matched frame from candidate is larger than a minimum distance

_, i. e. both frames are different, we claim the end of the spot is reached. The "hot" candidate is declared as a positive match.

5. Positive matches are removed from the candidate list.

6. AU candidates with N are removed from the candidate list, i. e. all candidates, which

have not had any hit over a certain time interval.

Flowcharts of steps 1 and 2 dealing with the look-up in the index table are shown in Figure 1, and the part described in step 3 is shown in Figure 2. New String Matching Algorithm with Colour Patches

If we divide one image into N x Msubareas S_nm, the colour patches fingerprint vector contains

values with

and B_nm denoting the average red, green, and blue intensity values for subarea S_nm of the image,

We are looking for a scalar index value which distinguishes between different frames in a wide range, but may assign similar frames to the same index value. Similar frames include identical frames in repeating sequences as well as frames directly following each other showing only small differences. Mapping identical frames from repeating spots to the same index value is an essential condition for the algorithm being successful. The mapping of slightly different successive frames is a convenient feature for minimizing the index table, because such sequences of identical index values can be stored as one single entry.

In one example, we use the first B bits of 8-bit image average intensities Ri, Gi and B₁, with

Reducing the feature vector to a dimension of three only gives the advantage of being less sensitive to small changes between images. For instance, a moving object before a homogeneous background does not cause changes to the average values Ri, Gi, and . For generating the scalar index value we use the first B bits of each of the averaged values resulting in a 3B-bits integer value. The number of significant bits B controls the size S of the bins, which the single values are mapped to, as well as the index range R

The subscript CP refers to the colour patches fingerprint. The bin size

and index range are inversely proportional to each other. In general the algorithm can deal with any bin size, but bit-related sizes have the advantage of being easily computed.

New String Matching Algorithm with Gradient Histograms

Firstly we need to reduce the fingerprint size. The gradient histogram fingerprint vector originally contains N x M x K floating point values H_n ^k _m , where n and m are indices of the N x M subareas S_nm, and

refers to the bin in the direction histogram. Thus H denotes the portion of direction k in subarea S_nm for the normalized distribution.

In a first step we reduce the fingerprint size to a quarter of its former size by mapping all floating point values to 1-byte integer values. From the above equation we know that all H_n ^k _m are in the range between 0 and 1. We expect maxima and a relative dense distribution around the values of uniform distribution I /(N x M x K), whereas values near the borders of the interval, especially near 1, are very unlikely to be observed. Nevertheless, preferably linear mapping is used from (0, 1) → (0, 255) in order to keep computational costs low, i. e. we use simple scaling of H to get integer values

Due to non-uniform distribution of H*_m in (0, 1), c = 256, i.e., a bin size of 1/256, it may result in unsatisfying integer values. Because of the relative sparse appearance of values near the upper border 1, we choose c > 256 and map all values greater than 255 to the highest value. Experiments yield to a value of 50 x 256, i.e., c = 12,800 gives the best values with relatively stable results in the range from 20 . . . 100 x 256, i.e., c = 5,120 . . . 25,600.

We then need to determine the index value and quantization. In a similar way to the colour patches fingerprint, we choose the image related gradient distribution

to build the image index value. We expect these values to be similarly robust to small changes as the averaged colour intensities. If we take again the first B bits of every single value we get the following bin size S and index range , respectively,

with

Because we have typical sizes of 4 or 8 for K the index value range is significantly larger compared to the case of the colour patches fingerprints. To reduce the index table size in the case of a large K oτ B we can replace the raw index table with a hash table, the dimension of which we can define by technical demands. In this case the index value H(I) is evaluated by a properly designed hash function dependent on all

(two-step hashing).

Experiments:

Two test videos of different properties were investigated: one 4 hour NTSC video from US television (431 ,540 frames) and one PAL video from UK with a duration of approximately 3 hours (281,896 frames). Both videos differ not only in their technical properties, but also in their content and quality. These differences have an impact on the results. The characteristics of the test videos were as follows.

The US test video contained a mixture of news, sports, movies and commercials, whereas the one from the UK is taken from sports TV, i.e., where the main part of the video is sports news which is paused by commercial blocks. The visual impression is that the UK video is of inferior quality and of higher intensity than the US video. There are further differences between the presentation of commercials. We mostly find hard cuts and one dissolve in the UK video, whereas in the USA video the commercials are mostly separated by a couple of black frames. These individual properties result in different sensitivity to the tested algorithms.

An example frame from the broadcast on the UK video might have an almost white ticker section, which disappears when commercials are broadcast and this makes it easy to distinguish between the two by colour properties. The lower quality accompanied by blurry and noisy patterns is a worse case condition for the gradient based algorithm. In the US video, one tends not to find such pronounced layout characteristics and there is usually better image quality and lower mean intensity.

The US video contained 152 manually detected commercials, of which 97 of them are different. Searching for all of the 152 spots and taking into account all of the corresponding repetitions, the maximum count of possible matches adds up to 380.

We reduced the query set of the UK video to 104, and this contained 65 different commercials. In this case the maximum amount of possible matches was 266. The data set was also reduced, because in the earlier set, some spots with different lengths had been included, and the group of intros and outros had some special features so they are analyzed separately.

In our experiments we tested the new String Matching Algorithm utilising an inverted image index and compared its performance to that of the old approximate substring matching method. For further evaluation of the new method we investigated the properties of the index table in more detail, because the distribution of entries in the table is of significant importance for fast and successful searching. We also analyzed the influence of various parameters controlling the program flow.

Results:

Table 1 : Reference values from the approximate substring matching approach. Given are values from the parameter sets with the best results.

We compared the performance of the proposed algorithm with that of the approximate substring matching algorithm. Besides the recall and precision values with the following definitions

we also compared the evaluation times because we expected a significant speed up with the use of the index table.

Table 1 shows reference values from the approximate substring algorithm. We found that the reduction of the gradient histogram fingerprint to 1-byte integer values does not only reduce the fingerprint size by 75%, but also drastically decreases the search time in the database. The runtime for the gradient histogram fingerprints is now comparable with or even less than that of the colour patches fingerprint, whereas recall and precision have not experienced any negative changes. A slight improvement was also seen in precision though this might have been the result of rounding effects. Precision was the most noticeable difference between the two test videos.

Results from tests with the index table search are given in Table 2. For each of the two test videos various fingerprints and index values are applied. The colour patches fingerprint is used with B = 5 and B = 6, that means with bin sizes for index mapping of 8 and 4, respectively. The 8 x 8 ^χ 8 gradient histogram fingerprint was tested with B = = 6, . . . , 8, i.e., bin sizes between 256 and 64, and the 4 x 4 x 4 one was tested with B = 8, . . . , 10, the corresponding bin sizes are 16, 8, and 4, respectively. A summary of configurations and their corresponding index ranges and bin size, where the fingerprint values are mapped to, is shown in Table 3.

As it can be seen by the numbers in Table 2, colour patches and gradient histograms perform differently on both videos. The colour patches fingerprints have stable recall and precision values. For the "worse conditions" in the US video, however, evaluation time is significantly higher than that of the gradient histogram. Nevertheless, the colour patches fingerprints perform well with the index table and we can reduce search time by 75-85% using the inverted index. Additionally, we can improve the precision with the new algorithm.

In the case of the 8 x 8 x 8 gradient histogram fingerprints, the executing time for the UK video is higher than that for the US video, even though the UK test video had fewer frames and fewer commercial spots to search through. Additionally we found a loss in recall. The sensitivity of recall to quantization as well as the significant differences in search time, indicate a lower robustness of the gradient histogram fingerprints. For B = I we were able to get optimal and stable time values for both videos but for the UK video some spots were missed.

Table 2: Search time, recall and precision for different fingerprints (CP - Colour Patches, GH - Gradient Histogram) and different sizes of index values with the index table based algorithm. There are additional true positive matches from commercials with identical sequences at the end. The single false positive match is the result from a spot that is a subsequence of a longer spot.

Table 3: Index ranges and bin sizes for the fingerprints and indexing combinations

shown in Table 2

The combination of the inverted index table with the 4 x 4 x 4 gradient histogram gave the worst results. The poor performance of the hash function is probably due to the relative small size of the fingerprint. We could not find any applicable quantization that we could build into the hash function. Higher values of B, corresponding to small quantization bin sizes, result in a loss in recall. On the other hand small B with large bins lead to extremely long search times. Additionally the count of false positive matches rose, whereas all of the false matches correspond to hits inside true matches in the test video, partially caused by different variants of a group of spots or, sometimes, the end is signalled too early. This could be an indication of the need of a longer section for frame-by-frame testing than that actually used. As the small size 4 x 4 x 4 gradient histogram fingerprint is less sensitive, it could be necessary to test over the whole length of the query spot, which would further increase search time.

Looking at the numbers of true and false positive matches, respectively, we can make the following observations. The fact that the number of positive matches is higher than the actual count (UK video: 266 spots, US video: 380 spots), is due to the circumstance that there are similar commercials among the queries which have identical sequences at the end. We only do the frame-by-frame check for the buffered images at the end. Thus, we find additional matches. On the other hand the single false positive match in case of the colour patches and the 8 x 8 x 8 gradient histogram for the US video arises because one query is identical with the beginning of another one and causes a false hit at the middle of the longer commercial.

As explained above, the index value H(I) of one single frame is created from the average red, green, and blue intensities Ri, Gi and Bi, respectively. If we take the first B bits of each of the three colour averages, the index values has a size of 3B bits and its corresponding range is shown in Table 3. In more detail, the index is evaluated as

The index H(I) can be represented by a 4-byte integer value. Since the corresponding ranges are not that large we choose an index table where each possible index value is represented by itself, i.e., we have an array T of size , where each field T(H(I))

contains a pointer to the entries

For optimal performance the index table should be evenly filled with approximately the same number of entries for each index. Figure 5 shows the distributions for the case B = 6 for both test videos. As can be seen, the distributions are far from being uniform. According to the index equation above, the index value is a superposition of individual colour distributions. Due to the non-uniform distributions of the average colours, the index distribution has the corresponding maxima too. In Figure 4 it is possible to see the distribution in the UK video for one section in more detail. The influence of the colour distribution, which determines the lower bits (B/ ), can be clearly seen on this smaller scale. Although there is no qualitative difference (flatness, peaks, etc.) to be seen, the difference in average intensities causes different positions of the maxima in the distributions. The index values of the darker US video are mainly assigned to low numbers, whereas the UK indices tend to the larger values. In summary, the distribution is relatively sparse, with less than 5 % of all possible index values being used, but this is a preliminary conclusion because the number of frames in the videos is of the same order of magnitude as the index range itself. Nevertheless, there are large differences in the number of entries per index values.

The index value for the gradient histogram fingerprint is a function of K values

Η , which represent the gradient histogram for the whole image. In Table 3 it is possible to see that we are dealing with very large ranges corresponding to large N, M, K, and B. Using a hash function for evaluating the index value provides a solution to this problem. Usually, the range of the hash values is chosen accordingly to the given problem and is significantly smaller than the range of the input data, indicating that hash functions are not injective. Therefore hash functions are a good instrument for mapping sparse distributed data to a "small" size index table. Commonly hash functions with good mixing properties are used, so different input data are, in most cases, mapped to different hash values. If different input data are mapped to the same index value, we have a collision. In our case, collisions do not matter, as long as data in the index table are well distributed. Because the index is only used for limiting the queries for the frame-by-frame distance measure, collisions only increase the number of these candidates, and therefore searching time is increased, whereas the quality of search should not be effected. Only in the case of too many collisions does the algorithm become inefficient.

Often hash functions are based on the modulo function, where the index range is taken as the modulo. For large or multi-dimensional input data the modulo function can be evaluated with the Horner scheme, Of which we make use too. The Horner scheme implies an iterative evaluation. So we have an index function of the following form:

The value is the resulting index value.

Distributions of index values are shown Figures 5 and 6 for the US and the UK video, respectively. The index values are more evenly spread out than with the colour patches. Some peaks may correspond to collisions caused by a bad hash function or may indicate maxima in the distributions of input data too. For our calculations we used an index range of

, which meets the criteria for choosing a good table size. In this case, the distribution for the US video is flatter than that of the UK video (note the different scales of the figures). The occupation rate is approximately 60% higher than in the UK case (~ 30%). Both characteristics show that the gradient histogram performs better for the US video, as can be seen by the search time values shown in Table 2.

After discussing the possibilities for building the index table and the impact of quantization of the fingerprints on the search quality as well as on the computational complexity in the preceding discussion, we now want to examine the influence of various program parameters mentioned above on the performance of a query search. In Figure 9 the dependence of recall and precision as well as the corresponding query time on basic control parameters are shown. Because the precision values are close to 1 , these values are not shown directly. Instead the number of false positive values is drawn.

Certainly, in addition to the distance measure threshold are

the most important parameters of the proposed algorithm. They define the relative amount of identical index values in test sequences for becoming a candidate for more intensive evaluation. Accordingly, these parameters are important for accuracy and time control. High values correspond to a more selective selection, allowing only a small number of test sequences to reach the time consuming step of frame-by-frame comparison. Clearly, high threshold values may reduce the recall. If only a minimum of matching frame indices is required, the count of possible candidates increases, and so does time. In Figure 7(a) the explained behaviour of recall and time, respectively, can be seen. For the given configuration the value R_MIN = 0.4 turns out to be the ratio which separates matching and non-matching test sequences best. For higher values the recall decreases. For lower values the time increases significantly, because too many candidates are found, whereas for higher values the changes in time are small, corresponding primarily to the increasing count of missing candidates. The value of R_MIN which separates good and bad candidates depends on the type of index value and, mainly, on the value of B. As we can see, too, the parameters R do not have a significant influence on the precision.

is another parameter which turned out to be important for a successful query. is the control parameter for signalling the end of a match that has been found, i.e., we guess we have found the end of a possible match (which passed the thresholds

and , if a hit is marked at the last LFQ frames of the query sequence. Due to fade-outs

and dissolves there may be large differences especially at the end of the commercial spots. In the test case shown in Figure 7(b) we use a value of to obtain a recall of 1.

The increase in time is caused by two reasons: the increased count of candidates and the need of repeated evaluation of frame-by-frame distance due to steps 3(b)i and 4 in the example of the program flow given above. As for the influence on precision is very

low.

Figures 7(c) and 7(d) concern parameters of the frame-by-frame-distance measure:

• the threshold

controls when two frames are considered as completely different in the approximate substring matching algorithm; and

• the buffer length L_B - specifies the length of the substring of the test sequence for which the distance D is evaluated.

As can be seen in Figure 7(c)

is very important for recall and precision. Interestingly, the strictness of the frame-by-frame distance has consequences for the measured search time too. This is again due to repeated computation of D, if the threshold is missed. The relation between L_B and the performance values shown in Figure 7(d) is relatively evident. If only short sequences are compared frame-by-frame the precision becomes smaller than for longer tests. Presumably, the end of sequences is incorrectly determined (to early), so the recall is lowered by these false matches. The time increases almost linearly with the buffer length, which indicates that the greater part of the consumed time is taken by a distance measure.

Not shown is the influence of control parameter , which defines the maximum

count of consecutively non-matching frames which are allowed. If this value is exceeded, the candidate is removed from the list. The recall decreases if

drops below a critical value and time increases for large values because of a larger candidate list. There is no effect on precision.

Conclusion on the new String Matching Algorithm:

This new string matching algorithm, which combines an inverted index table for a coarse search with a frame-by-frame distance measure for a more detailed comparison of test and query sequence, is a significant improvement. It has been shown that we can reach a significant speed up, if data is well conditioned and the system is adequately configured. For ill-conditioned data or originally compact fingerprints like the 4 x 4 x 4 gradient histogram there might be long evaluation times or losses in recall. The combination of the inverted index table with the colour patches fingerprint gives stable and excellent recall and precision values by moderate search times for less optimal input data, whereas the combination with the 8 x 8 x 8 gradient histogram satisfies with fast performance, however the miss rate increases for worse data. The new String Matching Algorithm provides a good basis for automatically detecting repeatedly occurring sequences.

Detecting Recurring Video Sequences: Recurring Sequence Search Algorithm:

The proposed recurring sequence search algorithm is based on the inverted index search algorithm introduced above, i.e., feature vectors for all images contained in the video are calculated and an inverted index containing each single frame is created using hash functions. In one preferred embodiment, we take the GH feature (gradient histogram), because — for the purpose of detecting recurring sequences over a certain time — we assume a constant signal quality. Thus, we do not need to pay too much attention to the robustness of features across different capture cards and capture sources in general, since we can operate on the same system all the time. However, we may also benefit from the higher precision of the GH over the CP (colour patches).

In one preferred embodiment we implemented the recurring sequence algorithm as an incremental process with the following steps:

1. generating feature vectors (e.g., GH or CP as described above);

2. generating inverted index (hash table — e.g., using the new String Matching Algorithm described above);

3. searching the whole video for similar short sequences (length = 1 second);

4. finding raw candidate recurring sequences as groups of duplicate short sequences, which correlate in time;

5. coarse filtering of found candidates (this step is mainly due to rejecting very short candidates, which are mainly caused by accidental matches of unrelated sequences);

6. aligning the two matching sequences of a candidate;

7. identifying the beginning and end of a detected spot; and

8. comparing with already found sequences for deciding if we have a new one.

All steps, their intention, and their implementation, especially for steps 3 - 8, are explained in more detail below. Steps 1 & 2: Generating feature vectors and inverted index

For high precision we use the 8 x 8 x 8 gradient histogram feature vector H_n ^k _m described above in the reduced

form, where all values are mapped to 1-byte integer values. The inverted index is built, as above, taking the first B (B = 6) bits of the values of the image related gradient distribution

The inverted image index is stored in a hash table of size 100,003, by means of a simple modulo function evaluated with the Horner scheme. Because we deal with very large video sequences of 24 and 48 hours, respectively, for practical reasons we split the video into 2-hours segments, so that we could deal with smaller data volumes. Accordingly, we have a number of hash tables, each containing the indices of 180,000 frames (2 hours at 25 frames/sec). This video splitting is done a little bit arbitrarily as a way of handling very large videos, i.e. large amounts of feature vectors and correspondingly large index tables. It makes the algorithm independent of video length and memory capacities. Furthermore, the searching within several index tables holds the possibility for an alternative definition of input data (for instance, we may compare with the data from a previous day.

Step 3: Searching for recurring short sequences

Since we do not know anything about the recurring sequences that we are searching for, we look for very short sequences of about 1 s in a first step, which re-occur or are similar to each other, respectively. The assumption is that every possible sequence is composed of such small identical pieces. For doing this short sequence search we slightly alter the string matching algorithm for known commercials discussed above. In that algorithm, we made a coarse matching to all queries in the database by counting identically occurring indices. If we exceed a matching threshold, then we do a frame-by- frame distance evaluation for a more precise prediction. Randomly occurring identical index values are periodically removed, if the threshold is not reached within a certain time. For detecting unknown repeatedly appearing sequences, first we build short candidate sequences of our defined test length (for example, Is), i.e., we assign a frame matched by its index to an already existing candidate, if the first frame of it precedes the found frame no more than our test window size of 1 second. In more detail, we scan our test video frame-by-frame, calculate the index of each frame and look up in the index table for similar frames which have the same index as our actual test image. We only take into account matching frames which come later in time, searching the video only in one direction. Additionally, we reject matches which are very close to the test frame, for example, we may require a minimal gap of 2000 frames (80s) to avoid detection within the same scene. We can be sure that the delay between recurring commercials is greater than the chosen value. Each matching frame which meets these conditions is assigned to a candidate as mentioned above or acts as the seed of a new one.

Beside the alternate creation of candidates we follow the method for searching for known commercials described previously. If the fraction of matched frames is greater than the threshold (actually 20%), and matches near the end occur, a frame by frame distance measure between the short candidate sequence and the short sequence preceding the actual test frame is performed. If the calculated distance is below the distance threshold, these two short sequences are considered as being similar.

Step 4: Building candidate recurring sequences

In this step we build raw candidate recurring sequences by grouping the short sequence candidates. Short candidates, where both segments are closer than 150 frames to the corresponding short sequences are assumed to belong to the same recurring video sequence. After this step we have a number of sequences, which are pairwise identical or similar in parts. Each pair of similar sequences is called a duplicate D). Each duplicate contains two vectors S_I and S₂, representing the corresponding sequences:

Each vector contains the start frame numbers of the short sequence candidates building the duplicate which are grouped to the following sequences:

with n - the number of short candidates forming the sequences of the duplicate,

and and

originate from the same short sequence candidate.

Due to our search implementation, the frames of sequence Si are in increasing order, whereas this is not necessarily true for sequence S₂. Especially for long still scenes there may be disordered matching candidates.

Step 5: Coarse filtering of candidate sequences

For eliminating non-commercial recurring sequences, we follow in a first step two assumptions. One concerns the appearance of commercials, namely we expect a minimum as well as a maximum length. The second assumption is more associated with the expected content of the commercial, i.e., we assume the sequences Si and S₂ to be well correlated according to a sufficiently dynamic content. The second criterion has been introduced for separating long self similar sequences like talk shows, which contain a lot of short similar sequences but not necessarily in the same order. As a measure for correlation we use the standard deviation of the follow up distance

In the case, where we have a series of exact matching frames the standard deviation

becomes zero.

Duplicates must have a minimum number n of matching short sequences, a minimum (and eventually maximum) length, and the standard deviation of the follow up distance sd(Si, S2) must not exceed a critical value for proceeding further.

Step 6: Aligning sequences

The next step aligns the two similar sequences of the duplicates. According to our algorithm both sequences S/ and S2 may be shifted with respect to each other and may only be a part of the target sequence. To estimate the beginning and end of the target sequence we need properly aligned sequences. For calculating the displacement we use the intervals between cuts. We use this approach, because it is simple to implement and provides relatively exact results. The disadvantage is, that we need more than three cuts for proper alignment, and proportionally more, if there are errors in cut detection. As a result, we reduce our detection rate at this step, neglecting all commercials with a lower number of cuts. It may be possible to improve the method by adding a cut independent method for aligning, for instance searching for a string distance measure minimum over several displacements.

Step 7: Identifying the beginning and end of a detected spot

After alignment, we can compare frame-by-frame back from the shared start frame estimated by the alignment as well forward from the estimated last frame until we identify that corresponding frames are different. Here, the problem occurs that we include a couple of black frames, if commercials are separated by black frames.

Collecting multiples

In a last step we compare the pairwise matched sequences against those from the other duplicates to group more frequent occurring sequences together. We only search for the case where one of the two sequences is also appearing in another duplicate. If our detection results in pairwise different duplicates, our algorithm ends up with two final recurring sequences.

Figure 8 illustrates a flow-chart of the process. In step 100, a video stream is retrieved, either from a live broadcast or from storage. In step 102 the video stream is processed to generate feature vectors. In step 104 an inverted index (hash table) is generated from the feature vectors. In step 106 similar short sequences ("sub-sequences") are built by identifying short sequences with frames with matching hash values 108. Next there is a step 110 of finding long sequences ("sequences") as groups of short sequences. This includes matching short sequences 112. There may still be undetected parts 114. The lower matched long sequence in step 110 of Figure 8 represents a repeated sequence 116. Coarse filtering 118 is applied to the sequences. This includes the step of rejecting very short repeated sequences 120. The sequences are aligned 122. Start and end frame detection 124 is performed on the detected repeated sequence 126. Grouping of repeatedly detected sequences 128 is conducted. The multiple detected repeated sequence 130 is stored and/or fingerprinted for future comparison by an ad detection and replacement apparatus.

Experiments for the Recurring Sequence Search Algorithm:

For our experiments we used six 48 hours videos. All videos were recorded from British television, mainly music and sports channels, and are already downscaled to PAL- VCD resolution of 352x288. Two of the videos (MTV2 and Gemini) have some recording errors showing only black frames over a period of time. These errors are not cut in order to test how the algorithm can handle such cases. As a first step we create the GH-fingerprints for all videos as explained above.

For a live detection system we envisage an automatic search through about 24 hours. However, there are commercials, which are only repeated once a day or bound to specific content. Therefore, we conducted a search over 24 hours as well as 48 hours in order to investigate the effect on recall and precision caused by the amount of input.

As we noticed large differences in evaluation time dependent on video content, we additionally performed another modification test. The sports channels especially contain long nearly similar sequences, for instance a long tennis match in the SS3-video discussed below. Similar frames having the same index expand search space and slow down the search process tremendously, because the number of candidates scale with the square of the number of index entries. Our first approach is to neglect such high frequented indices and the corresponding frames, respectively. In our tests we examine closely the impact on performance measure, as well as on the achieved reduction in evaluation time.

Results:

For a first evaluation we have a look at the output produced by the raw algorithm explained above (steps 1-8). Central performance measures are, again, recall and precision, with the following definitions

Table 4: Number of all found recurring sequences, and number of all found multiples (sequences which are considered as different by each other).

Because, in this test case, we do not know the ground truth of our database, we first have a look at the measurable results and then try to appreciate our recall values. Table 4 shows the number of all found sequences and the number of multiples they are grouped to. Comparing the number of found sequences or multiples for 24 hours search and 48 hours search, respectively, we note, that the number for 48 hours is more than twice as much as that for 24 hours, indicating that there are sequences which would potentially not be found by two 24 hours searches.

The two lines at the bottom of Table 4 contain the numbers of searches if we neglect similar frames in the candidate building step. Surprisingly, in most cases more multiples are found than with no limitations. Mostly, this is caused by splitting very long sequences into several smaller ones if frames in the middle are neglected, or we may get more variants of the same repeated sequence if we neglect often occurring frames at the boundaries of the sequence.

For two example videos, ChartTV as a representative music channel and SkySportsNews from the sports channels, we perform a more detailed analysis in the following sections of recall and resources.

Recall:

Because we do not know the ground truth of our 48 hours videos, we estimate the recall for the first 2 hours by comparing the automatically found commercials with the ones labelled by hand. We will focus on real commercials and neglect channel related recurrences, which special characteristics depend on the particular channel. Accordingly, in ChartTV, commercial blocks are mostly embedded in short sequences showing the channel logo. These logo sequences come in a wide variety differing in colours and kind of animation. In SkySportsNews, intros and outros of commercial blocks are accompanied by special sponsor commercials and are frequently merged with the preceding and following content, respectively. Both of these types of commercial blocks embedding are bad conditions for our search algorithm, because the logos from ChartTV do not contain cuts, and the intros/outros from SkySportsNews are highly variable in their appearance. Especially in SkySportsNews there are some channel related commercials, mostly concerning other Sky channels, but because these sequences have some common characteristics with the "external" commercials we consider them mostly as real commercials, but it is sometime hard to decide where the spots belong to.

For a detailed analysis we use the tool named "ShowMatch", which enables browsing through all found sequences, shows the number of repetitions and lets us have a look at individual frames around the detected boundaries. Then we can assign the found sequence to a predefined category. At this stage we distinguish three main categories: commercials, channel related stuff, and false. All three main categories offer some subclasses. For the estimation of recall we deal with the commercials, which are divided into the subclasses shown in Table 5.

The distinction between EXACT and EXACT +/-5 is a little bit arbitrary. Whereas some channels like ChartTV normally broadcast commercials with their original length (at least in our test video), the length of commercials shown in SkySportsNews is more variable. Sometimes, single frames are cut at the begin or end, or sometimes dissolves are inserted and sometimes not. Therefore, according to our algorithm, some sequences belonging to the same multiple are exactly detected, and others have little errors at the borders. For calculating the recall we take multiples into account which are classified as EXACT or EXACT +/-5.

INEXACT commercials show a greater error in detection, and some reasons for this are discussed later in more detail.

In rare cases one commercial is only broadcast in the same order to another, then we can only detect the combined sequence of these two commercials (DOUBLE). Normally, the commercial also occurs in another context, so that we have the chance for a separate detection. In this case the commercial is rated with the better class, maybe EXACT or EXACT +/-5, for estimating the recall. CHANNEL is like DOUBLE, but the commercial is bound to channel related stuff. This can be a logo sequence, an intro or outro, previews or similar things. Because such channel related sequences occur more frequently (at the begin and end of commercial blocks), the probability for finding such compounds is higher than for DOUBLES. For the calculation of recall we apply the same criteria as for the DOUBLES.

Sometimes a commercial is cut at the 2 hours boundary. In these cases both parts of this commercial are considered as separate sequences and, if this commercial is repeated, consequently found as single recurring sequences. This error can easily reduced by splitting the video into longer parts and dealing with overlapping boundaries. That is why we assign this case to a separate subclass CUT. Another case occurs if several commercials for the same product contain a common sequence together with a varying part. Then the common part is detected as single recurring sequence and classified as PART. As in the DOUBLE case, there exists the possibility, that the whole commercial is detected too, if it is repeated in the same variant.

All detected sequences, which are more or less related to commercials, but could not be assigned to one of the upper subcategories are classified as OTHER.

Table 6 shows the number of detected sequences for the first two hours of the ChartTV and the SkySportsNews video, respectively. As in former investigations our results strongly depend on specific channel characteristics. For the ChartTV video we reach a recall of around 80%, and for the SkySportsNews video the recall is only around 60%, and consequently much lower. There are several reasons for the lower recall. Beside the more casual handling of commercial boundaries, mentioned above, a certain fraction of the commercial spots is related to the Sky group, adverting Sky program. In many cases of such trailers, a "set reminder" button for adding the program to the personal planner appears. The appearance of this icon is not strongly deterministic, it can be shown only for a small part of the trailer, and may only disappear after the beginning of the next commercial. Most inexact detections of commercials in SkySportsNews are related to the "set reminder" feature, either as program advertisement or as a disturbance at the beginning of the following spot. The impact on detection is clearly higher for less structured images. To circumvent this problem a mask could be used.

Table 6: Found commercial statistics for the first 2 hours of ChartTV and SkySportsNews. The Recall value takes into account the EXACT and EXACT +/-5 numbers, whereas for "Recall all" all commercial subcategories are included.

Our modification of neglecting frames with indices occurring more than 100 times within 2 hours of video has only a small effect on recall. Only one commercial is missed in 24 hours as well as in 48 hours test for the ChartTV video. The number of exact detection for SkySportsNews did not changed at all.

We can improve the recall by searching through 48 hours rather than through 24 hours. In ChartTV we catch two additional commercials, one of which appears only once within the first 24 hours, but three times within the second half. Thus this one would also be caught by two 24 hours searches, whereas the second one only appears once in each 24 hours period and can only be detected by a 48 hours search. In SkySportsNews we found two commercials which only appear once within each 24 hours period, one which could be found by two 24 hours searches, and one, where it is not clear, why it is not detected at 24 hours search. All in all, the number of found recurring sequences is increased for all types, but the benefit is quite small compared to additional evaluation time. In particular, in the SkySportsNews sample, the number of exact matches may become higher even if the concrete commercial repeats within the first 24 hours, because the spot comes with more embedding variations. The numbers of DOUBLE and CHANNEL grow too, because of the higher probability for finding two corresponding double sequences, but if a commercial is only a DOUBLE within 24 hours, we get the chance for finding another sequence combination within the next 24 hours and can extract the single spot.

Precision:

For estimating precision we will have a more detailed look at the false found sequences. Although we do not really need the subclasses of CHANNEL and FALSE for calculating precision, it is of certain interest to know, where potential false candidates occur. Table 7 and 8 show the subcategories for CHANNEL and FALSE, respectively. We differentiate between these two categories, because channel related sequences will repeat more often, and have similarities with commercials to a certain extent. On the other hand, it is sometimes also of interest to detect such things like logos and intros/outros.

Table 9 shows the numbers of detected multiples which we have assigned to all classes, as well as the number of UNSET multiples which we have not yet classified. For estimation of precision we only take the already assigned multiples into account. We calculate the exact precision, which belongs to the commercial database ready detected commercials, and additionally the value which is related to sequences, which could give hints for a subsequent manual detection of commercials. Again, we can recognize channel specific characteristics in the statistics of detected multiples, leading to different dependencies of precision on search strategies. For ChartTV the precision is nearly independent of our modification of the algorithm concerning the number of entries per index in the hash table. In contrast, for SkySportsNews we can significantly improve the precision with the limitation to little taken indices. That is caused by two mechanisms: We can increase our detection rate by neglecting a lot of accompanying channel related frames which occur with high frequency, and we can reduce the detection of often occurring sequences as intros/outros all having similar characteristics as well as self similar sequences, for instance the characteristic typical news layout of SkySportsNews.

"Precision EXACT" value takes into account the EXACT and EXACT +/-5 numbers as detected commercials, whereas for "Precision COMMERCIAL" all commercial subcategories are included. Note, that the 48 hours searches for SkySportsNews are not completely classified. By doubling the video length for searching we can observe a decrease in precision for ChartTV. As mentioned in the recall section the benefit in finding commercials is only small, whereas the number of detected video clips significantly rises. We estimate that the probability for broadcasting a video clip exactly once within 24 hours is much higher than for commercials. For SkySportsNews we cannot make a reliable statement, because of the incomplete classification, but in the case of no restriction in indices entries (where we have more than the half of it classified) it tends to have much less affects on precision.

Altogether, this raw precision is not particularly good, because we focus on commercial detection with our performance numbers, while our algorithm is designed for • finding repeating video sequences. In the analysis part below we will see how we can easily improve the rate by simple filtering for commercials.

Resources:

The most time consuming step is generally the computation of image features. The calculation of gradient histograms for our test videos takes around 20 minutes for one 2 hours segment on an Intel Xeon 2.33 GHz CPU. Video decoding is included in the measured time. Generating the hash index table for 2 hours of video is about 10 seconds. Consequently, the overall preparation time (steps 1-2) for recurring sequence detection is about 4 hours per 24 hours of video.

Table 10 shows the evaluation times for finding short sequences (step 3) and for building multiples from these short sequences (steps 4-8), respectively. Because of our dealing with 2 hours segments, our search algorithm for short sequences comes with a quadratic dependency on video length. We will note, that Gemini and MTV2 have errors, which are shown as black frames over a longer period, so the shown values deviate from normal behaviour, but we can have a look at, how this situation is handled. Thus, for MTV2 the errors are situated within the first 24 hours, and seem to take a significant fraction of evaluation time, if we do not limit the number of index entries. That is why the short sequences search time over 48 hours is not much greater than for 24 hours.

Again, as for performance values, evaluation time is strongly influenced by the channel content. If we do not limit to frames with sparsely occurring hash indices, the time needed for the short sequence search is much too large for practical use for the SkySportsNews video with many similar but not identical sequences, whereas there are only slight improvements for other videos. In Table 11 , average evaluation times and standard deviation are shown. We can see, that our modification of only taking hash indices with less than 100 entries does not only significantly decrease the computation times, but also leads to more constant time values. Without the limitation the standard deviation is of the same order of magnitude as the mean, indicating a large range of occurring values, whereas computation time is much less spread if we concentrate on characteristic frames.

Table 12 shows the number of detected short sequences for each of the six test videos. It is clearly seen how we can eliminate video errors (black frames) in Gemini and MTV2, as well as a lot of similar sequences in the sports channels SkySportsNews and SS3 by neglecting indices with many entries. For the non corrupted music channel videos ChartTV and MTV we can reduce the number of detected short sequences with only little consequences for the detected multiples (see Table 4 and 6).

Analysis:

Taking a more detailed look at qualitative aspects of our detected recurring sequences, if we can find characteristics of our classes and subclasses, we could improve our results by effective filtering.

Clip length:

Figures 9a and 9b shows two histograms over the duration of multiples detected in ChartTV taking into account the 12 most occupied classes. The x-axis shows the length in terms of the number of frames and the y-axis shows the count. Histogram bins are not equally spaced, but the labelled bins (multiples of 250) are much smaller than the others. Each labelled bin, which correlates to lengths that are multiples of 10 seconds, covers the narrow interval [LABEL - 5, LABEL + 5], whereas the unlabelled bins contain all values between, i.e., the interval [PREVIOUS LABEL+5, NEXT LABEL-5]. We can see that almost all detected EXACT commercials belong to one of the "10 seconds" bins. This is probably caused by the commercial selling model. The channel related stuff is more equally distributed in length, and therefore most of it belongs to one of the residual bins. In particular, video clips are much longer than commercials and can be easily filtered out by a cut-off duration. In Figure 9b the histogram for a 48 hours search is drawn. Compared with the 24 hours search in Figure 9a there are not much differences: the proportion of video clips grows as already discussed in the previous section, and we can find more DOUBLE commercials at the higher "10 seconds" intervals.

Figures 10a and 10b show the length statistics of detected multiples in SkySportsNews within 24 hours without and with limitation to indices with less than 100 entries, respectively. The x-axis shows the length in terms of the number of frames and the y-axis shows the count. It is clearly seen that we can suppress the detection of non- commercials by neglecting indices with many entries. This applies especially to scenes which are due to channel characteristics very similar to each other like the presentation of sports news and the transition into and out of the commercial blocks. The statistics of 48 hours is not shown because of its incompleteness, but there are a number of very long false detected OTHER sequences (see Table 9) which are due to the repetition of complete reportages.

Having a closer look at the distributions in Figures 9a, 9b, 10a and 10b, it is implied to filter out all sequences assigned to the intermediate intervals. By doing this we get the performance values shown in Table 13. By simple length filtering we can improve our precision values having only minimal losses in recall.

and SkySportsNews after length filtering. Note the precision of 48 hours SkySportsNews is inaccurate.

Thus it can be seen that we have introduced a system for automatically detecting repeating sequences, which can be used to detect unknown commercials. Discussed below are some additional changes which have an impact on run time or qualitative performance. There are also some experiments for evaluating the dependence of the system on other system parameters.

Sequence alignment:

In the previously described method, we used the positions of cuts for proper alignment (in particular three or more cuts). This method has two disadvantages. Firstly, the cut detection is relatively time consuming and an additional source for errors. Secondly, in the case of erroneous cut detection or complete absence of cuts, sequences could not be aligned and are discarded. Therefore, we added the possibility to align two sequences by searching for a local minimum distance between the two clips by using the distance of the feature vectors. Consequently, the distances between two short subsequences with different offsets each having a length of 25 frames is calculated. Both sequences are then aligned by the offset which has the (local) minimum distance of the corresponding feature vectors.

Experiments:

The experiments are based on the test video ChartTV analysed above, which is the best-analyzed video of our test suite. All given values refer to a 24 hours search. Length of short sequences:

This test concerns the length of the short sequences which are searched for in the first step in the search. The value is represented by the parameter TEST WINDOW and its reference value which we used in the previous experiments is 25 frames, which corresponds to 1 second in a PAL coded video.

multiples in dependence on the short sequences length.

Table 14 shows evaluation times in dependence on the parameter TEST WINDOW, i.e., the length of the short sequences. Reference values are, as in all following tables, in bold. Additionally, we varied the parameters MIN_COUNTER_CUT and MIN COUNTER, respectively. These are thresholds for the minimum number of short sequences which form duplicate sequences, whereas the smaller value MIN_COUNTER_CUT represents a necessary condition (all candidate duplicate sequences are discarded if they contain less than MIN COUNTER CUT short sequences, regardless if the thresholds of the other parameters are passed), and the higher value MIN COUNTER is a sufficient condition (all candidate duplicate sequences are evaluated if they meet the necessary conditions of the other thresholds). The test of this number of short sequences is connected to the test of a minimum length of the candidate sequence and the correlation of the order of the short sequences. The rule for passing this test is the completion of all necessary conditions for these three parameters as well as the sufficient condition of one of these parameters.

The adjustment of the MIN_COUNTER values is necessary because the product of MIN_COUNTER and TEST WINDO W implies a minimum length of the sequences, which is much too high for the increased short sequences length by leaving the minimum number equal to the reference values.

We can see that the evaluation time is inversely proportional to the size of the short sequences searched for. This is mainly due to the increased number of found short sequences with smaller size. The parameters MIN COUNTER CUT and MIN_COUNTER only have little influence.

respectively, as dependence on the short sequences length.

Table 15 shows the number of all found short sequences and that of all found recurring clips within the 24 hours search through ChartTV. There is a greater chance for detecting recurring sequences when using a smaller test window (the length of short sequences), whereas the gain is mostly due to effect of the combined parameters TEST_WINDOW and MIN_COUNTER on the minimum length of the detected clip. (See the discussion above.) We can compensate for this effect by decreasing the MIN COUNTER values in a proper way, as we can see for the TEST WINDOW=50 with decreased MIN COUNTER values test case. For TEST WINDOW=IOO we get an overcompensation and find many more duplicate sequences, but as we can see in the performance discussion below this is not necessarily associated with an increased precision.

Table 16 shows the number of found commercials, the corresponding recall, as well as the number of all found recurring sequences and the corresponding precision for the first 2 hours of our test video. We find the best recall for the smallest short sequences length, which is caused by the fact that we can detect smaller clips. Both clips we have found additionally to the reference case are clips with a length of 250 frames. If we compare the COMMERCIAL EXACT and the ALL COMMERCIALS row we can recognize that we can increase the exactness of the search by a smaller TEST_WINDOW size too, because of the finer granularity of the search. As a result, the higher number of found sequences for the case TEST_WINDOW=100 with decreased MIN_COUNTER values is also caused by the reduced exactness of the search with longer short sequences.

Category ALL shows the number of all found recurring sequences within the first two hours of our test video, and, additionally the number of all found sequences shorter than 2005 frames is given. (Probably most of the commercials are shorter than this.) This simple length filter leads to an acceptable recall. Number of Entries per index in Hash Table:

This section investigates the maximum number of entries for one value in the hash table. We introduced this parameter in the earlier search algorithms in order to reduce computation time. We have previously discussed the minor effect on the performance values in comparison to the situation where there is no limiting, i.e., with consideration of all values in the hash table. Here we vary this number around the reference value to observe the influence of the concrete choice of the parameter MAX HASH ENTRIES on our results.

with exact start and end determination, COMMERCIAL EXACT +/-5 - commercials with errors less than 5 frames in start or end determination, COMMERCIAL EXACT - sum of both commercial categories above, ALL

COMMERCIALS - all detected commercials including fragments and commercials which are detected together with channel logos or other commercials, ALL - all detected recurring clips including video clips and channel related stuff, ALL (length < 2005) - all detected recurring clips with a length smaller than 2005 frames (80 seconds). Recall includes detected commercials from COMMERCIAL EXACT.

(The number of all commercials in the first 2 hours of ChartTV is 48.) Precision corresponds to the COMMERCIAL EXACT/ ALL ratio. Precision (length < 2005) corresponds to the COMMERCIAL EXACT/ALL (length < 2005) ratio.

Table 17: Evaluation time in seconds for searching for short sequences (HashFP) and finding multiples (Detect), respectively, as a dependence on the maximum number of entries for a hash value to be taken into account for evaluation.

Table 18: Number of all found recurring sequences, and number of all found multiples, as a dependence on the maximum number of entries for a hash value to be taken into account . for evaluation.

If we have a look at Table 17 we find that we can reduce evaluation time by further decreasing the maximum number of hash value entries. It is not clear, why time running Detectexe for the reference value with MAX_HASH_ENTRIES=100 is larger than both of the other values, but it has turned out that the evaluation time for Detect.exe strongly depends on the memory load of the test system. The numbers in bold are the reference value from previous tests, whereas the number in parentheses is the one from a reference test at the same conditions as for the other ones. If we take the last one into account, we can not find differences in the evaluation time for Detect.exe. As already discussed, we detect more (wrong) sequences if we consider a smaller number of hash values (Table 18) because sometimes clips are split into a couple of fragments due to larger gaps in detection.

As expected we have a higher probability of missing a clip if we take less hash values into account. Performance values shown in Table 19 show a slight influence of parameter MAX H ASH-ENTRIES. In summary we have found a definite but small influence of this parameter on evaluation time and performance values.

Table 19: Categories of all found sequences, recall, and precision as a dependence on the maximum number of entries for a hash value to be taken into account for evaluation. For a detailed description see Table 16.

Note, that in difference to other parameters discussed, the influence on the performance values has to be considered in relation to the length of the video slices, i.e., the total number of entries in the hash table. In the experiment, we used 2-hour slices of the video each having its own hash table (with 180000 entries).

Sequence Alignment:

As already discussed above, we improved our code for sequence aligning to be independent of cut detection. In this section we compare the old algorithm with cut detection, the new one without cut detection, as well as a combination of both variants. In the latter case we perform the cut detection mechanism at first, and, if we could not find a valuable amount of cuts for proper aligning, we use the method based on the feature vector distance.

Table 20: Evaluation time in seconds for finding multiples (Detect) in dependence on the algorithm used for aligning similar sequences.

Because of the sequence aligning algorithm only concerns the Detect.exe, in Table 20 the evaluation time for finding short sequences is not shown. As in the last section there are problems in comparing the reference value with the two varying test cases. If we take the value in parentheses, which is taken at the same test conditions as the other ones, there is, other than expected, no difference between both algorithms, and the combination of both takes a little bit more time because of some sequences are processed twice.

Table 21 : Number of all found recurring sequences, and number of all found multiples as a dependence on the algorithm used for aligning similar sequences.

Table 21 shows the number of all found recurring sequences and multiples found by using the different aligning methods. We found more sequences with the feature vector distance based method than with the cut detection, because sequences which do not contain cuts are not discarded, but the "disadvantage is the increased number of wrongly detected sequences, which are mainly short parts of the recurring clips. To overcome this problem we tried the combination as described above and are able to reduce the number of wrongly detected sequences compared to only using the feature vector distance. As seen in Table 22, the alternative aligning method can significantly improve the recall due to the number of sequences with no or too few cuts for proper aligning, with the corresponding decrease of precision.

Table 22: Categories of all found sequences, recall, and precision in dependence on the algorithm used for aligning similar sequences. For a detailed description see Table 16.

Video File Size:

In this section we want to investigate the influence of the video file size which corresponds to the total number of entries in one hash table. As mentioned above, we split the video into 2 hour slices to allow the files to be handled properly. The size of these slices has some impact on the evaluation time, but if the MAXJHASH ENTRIES parameter is chosen in an appropriate way there should be only a minor influence on the performance except for the cases of cut clips at the video slice boundaries.

Table 23: Evaluation time in seconds for searching for short sequences (HashFP) and finding multiples (Detect), respectively, as a dependence on file size (total number of entries in one single hash table).

Surprisingly, the evaluation time is much smaller for the 1 hour slices, as well as for the 4 hours slices, than for the 2 hours slices. Note, that the influence of this parameter strongly depends on the hardware of the evaluation system, especially on the available memory and hard disk access time. In addition we scaled the MAX_HASH_ENTRIES parameter in relation to the total number of entries in the hash table.

Table 24: Number of all found recurring sequences, and number of all found multiples in dependence on file size (total number of entries in one single hash table).

Table 24 shows only a slight dependency of the number of found recurring sequences on the video file size. The small increase for smaller sizes is probably due to the more frequent occurrence of split sequences at the video borders. The influence of the MAX HASH ENTRIES parameter has already been discussed.

The recall is slightly increased for larger video file sizes. This is probably also due to split sequences and therefore not found double sequences.

Conclusion:

Most of the investigated parameters cause only minor changes to the performance of the system. Clearly the MIN_COUNTER and the MIN LENGTH have a significant influence on the minimum length of clips that can be found. Therefore, if you want to search for very short sequences (no commercials) like channel logos you should decrease these values. The TEST WINDOW parameter should be small enough for exactly recognizing the clips, because for larger values the granularity is not sufficient.

Table 25: Categories of all found sequences, recall, and precision, as a dependence on file size (total number of entries in one single hash table). For a detailed description see Table 16.

The most significant improvement of recall (with a corresponding decrease in precision) we get by the new alignment algorithm, so we do not further discard sequences with no significant cuts. The combination of the cut based alignment with the feature vector based one in the case, of failurp proves to be the most effective variant, because of the cut based alignment method is more accurate than the feature vector based method - in the case where we have enough cuts.

Thus in summary of these test we can increase the recall by varying a couple of parameters simultaneously. If we take the 4 hours video slices (MAX_HASH_ENTRIES=200) with very small MIN_COUNTER values (MIN_COUNTER=3, MIN COUNTER CUT=2), i.e., we request only two similar short sequences to build duplicates, and use the combination-alignment method, we may catch 43 of the 48 commercials correctly (three of them in +/-5 range), and two other ones we detect incorrectly, which is mostly due to incorrect alignment. As a result we reach a recall of 89.5% correct detected commercials and 93.7% for all correct or incorrect detected commercials, whereas we have no loss in evaluation time for the Detect.exe part (23seconds), and the faster evaluation corresponding to the 4 hour slices for the HashFP.exe part (193 seconds). The precision is with 39.8%, and 56.5% with the simple length filter, respectively, somewhat lower than for the reference case. However, keep in mind for the precision values of all these experiments, that there is a lot of channel related stuff, which has similar characteristics as commercials and is therefore difficult to separate, but sometimes it is desired to treat these sequences as commercials too.

Figure 11 shows an embodiment of the system of the present invention where a video stream is initially captured from a broadcast. The system may comprise a broadcast capturing apparatus 1 which receives a broadcast 2. The broadcast 2 may be a conventional video signal or multi-channel signal received through an aerial or via satellite, cable, etc., or it may be a transmission received via the internet or any other such programming. The broadcast capturing apparatus 1 captures only a portion of the broadcast, for example, regions of frames from the broadcast, and these captured regions representing the frames of the original broadcast 2 then form the video stream 3 for the subsequent video processing. While the video stream maps the frames of the original broadcast 2, it is not possible to reconstruct the original broadcast 2 from the video stream 3. The broadcast capturing apparatus 1 may store the video stream 3 before transmitting it to the fingerprint database generating apparatus 4. The fingerprint database generating apparatus 4 then analyses the video stream 3 to detect recurring video sequences as described above and adds any new ones that it finds to a database 5. Possibly in a different location, for example in a pub, a detection and replacement apparatus 6 monitors a live broadcast 7 and outputs the broadcast in its as received form via a video output 8 to a large screen or other display device (not shown). When the detection and replacement apparatus detects a sequence of frames which matches one on its reference database 9, for example by using the matching algorithm described above, it swaps the matched sequence in the broadcast for a replacement sequence in a modified video output 10. A switching device 11 may be provided to switch between the outputs 8,10, automatically switching to the modified video output 10 from video output 8 whenever a matching sequence is detected and switching back again to the video output 8 once the detected sequence has finished. In this way the audience watching the video output 12 from the switching device on a large screen would see the unmodified broadcast 8 when a match is not detected and the modified video output 10 for just those periods when a match is detected. Preferably the detection and replacement apparatus 6 is set up to swap the commercial spots which are detected in the live broadcast 7. The detection and replacement apparatus 6 may update its reference database 9 periodically with fingerprints of new recurring video sequences 12 from the database 5 of the fingerprint database generating apparatus 4. One or more broadcast capturing apparatus 1 may communicate with a fingerprint database generating apparatus 4, feeding captured video stream 3 for the detection of repeating video sequences. The fingerprint database generating apparatus 4 may communicate with other such databases 4 in different locations to update its library of repeating video sequences.

Whilst the claims are directed to the first aspect of an automated method of mining a video stream for repeating video sequences and apparatus employing the method, the present invention also provides a second aspect of a method of identifying the recurrence in a video stream of a stored sequence and apparatus employing the method. The method is preferably employed in a fully automated system with the method of the first aspect, as described in claim 19, the methods preferably being conducted in different locations. The method of the second aspect is described in clauses 1 to 12 below, namely:

1. A method of identifying the recurrence in a video stream of a stored sequence, the method comprising: receiving a video stream; generating hash values for frames of the video stream; comparing the hash values from the video stream to hash values for frames of a stored sequence to identify frames with matching hash values; identifying sub-sequences of frames in the video stream that each contain a frame with a matched hash value to a frame of the stored sequence; comparing sub-sequences of frames in the video stream to candidate sub-sequences of frames in the stored sequence and determining a match when a sub-sequence contains a threshold level of frames with hash values matched to frames of a candidate sub-sequence; identifying a sequence of frames, where the sequence contains a group of subsequences that have been matched to candidate sub-sequences; determining whether the sequence of frames in the stored sequence is recurring as a sequence of frames in the video stream by comparing the groups of sub-sequences and candidate sub-sequences and determining whether a threshold level of similarity is met; and when a recurrence of a video sequence is detected, generating a signal indicative of having identified a recurrence in the video stream of the stored sequence.

2. A method as described in clause 1 , wherein the method includes the step of building an inverted image index.

3. A method as described in clause 1 or 2, wherein the hash values are generated for regions within a frame.

4. A method as described in clause 3, wherein the regions represent less than an entire frame, preferably less than 80% of a frame.

5. A method as described in any preceding clause (1 to 4 above), wherein feature vectors are used for generating hash values that create a scalar feature value.

6. A method as described in clause 5, wherein the feature vector values are reduced and mapped to 1 -byte integer values.

7. A method as described in any preceding clause (1 to 6 above), wherein the hash values are generated from a function of colour within the frame.

8. A method as described in clause 7, wherein the hash value is generated from a gradient histogram algorithm. 9. A method as described in any preceding clause (1 to 8 above), wherein the subsequences being matched are between 20 and 30 frames, most preferably 25 frames.

10. A method as described in any preceding clause (1 to 9 above), wherein the subsequences are considered similar when a threshold level of matched frames is exceeded, which is preferably 40%.

11. A method as described in any preceding clause (1 to 10 above), wherein the sequences being matched are filtered to remove sequences having more than a threshold number of consecutive non-matching frames.

12. A method as described in any preceding clause (1 to 11), wherein the video stream comprises a plurality of different video signals and / or segments of video which are temporally distinct.

Claims

Claims:

1. A method of mining a video stream for repeating video sequences comprising: receiving a video stream; generating hash values for frames of the video stream; identifying frames with matching hash values; identifying sub-sequences of frames within the video stream that each contain a frame with a matched hash value; comparing sub-sequences and determining a match when a sub-sequence contains a threshold level of frames with hash values matched to frames of another sub-sequence; finding sequences of frames, where each sequence contains a group of subsequences that have been matched to other sub-sequences; determining whether a sequence of frames in a first portion of the video stream is repeating as a sequence of frames in a second portion of the video stream by comparing the groups of matched sub-sequences and determining whether a threshold level of similarity is met; and when a repeating video sequence is detected, detecting start and end frames for the repeating video sequence.

2. A method as claimed in claim 1 , wherein the method includes the step of building an inverted image index.

3. A method as claimed in claim 1 or 2, wherein the hash values are generated for regions within a frame.

4. A method as claimed in claim 3, wherein the regions represent less than an entire frame, preferably less than 80% of a frame.

5. A method as claimed in any preceding claim, wherein feature vectors are used for generating hash values that create a scalar feature value.

6. A method as claimed in claim 5, wherein the feature vector values are reduced and mapped to 1-byte integer values.

7. A method as claimed in any preceding claim, wherein the hash values are generated from a function of colour within the frame.

8. A method as claimed in claim 7, wherein the hash value is generated from a gradient histogram algorithm.

9. A method as claimed in any preceding claim, wherein the sub-sequences being matched are between 20 and 30 frames, most preferably 25 frames.

10. A method as claimed in any preceding claim, wherein the sub-sequences are considered similar when a threshold level of matched frames is exceeded, which is preferably 40%.

11. A method as claimed in any preceding claim, wherein the sequences being matched are filtered to remove sequences having more than a threshold number of consecutive non- matching frames.

12. A method as claimed in any preceding claim, including the step of searching a reference database of stored video sequences, preferably stored as fingerprints of the sequences, identifying whether the repeating video sequence is new, and if it is, adding the repeating video sequence to the reference database.

13. A method as claimed in claim 12, wherein the step of identifying whether the repeating video sequence is new includes the step of comparing hash values for frames of the repeating video sequence against hash values for frames of the stored sequences in the reference database, identifying sub-sequences containing frames with matching hash values, matching the sub-sequences, finding sequences containing the sub-sequences and determining if the repeating video sequence matches a stored sequence.

14. A method as claimed in any preceding claim, wherein the video stream comprises a plurality of different video signals and / or segments of video which are temporally distinct.

15. A method as claimed in any preceding claim, wherein the reference database is transmitted to a video sequence detection and replacement apparatus for use in an automated video sequence detection and replacement method which is executed at a different location to the method of mining a video stream.

16. An apparatus having an input for receiving a video stream and a processor which is programmed to execute the method of any of the preceding claims.

17. A system comprising: a) a fingerprint database generating apparatus having: an input for inputting a video stream; a processor which is programmed with a first algorithm to analyse the video stream in order to identify repeating video sequences; a fingerprint database which is updated automatically with the fingerprints of a repeating video sequence when a new repeating video sequence is detected; and an output for outputting fingerprint data of detected repeating video sequences, b) a detection and replacement apparatus, which is adapted to perform commercial spot detection and replacement on a live video broadcast in real time, the detection and replacement apparatus having: a video input for receiving a video broadcast; a video output for outputting a video signal to an observer; a video switch for selecting a source of the outputted video signal; and a processor which is programmed with a second algorithm in order to detect a commercial spot by generating fingerprints of the video broadcast, comparing the fingerprints to stored fingerprints on a reference database, and when a match is detected, to trigger automatically the video switch to switch the source of the outputted video signal from the received video broadcast to a video output having a replacement commercial spot, so that the outputted video signal corresponds to the received video broadcast with replacement commercial spots, and c) wherein the detection and replacement apparatus has a communication link for communicating with the output of the fingerprint database generating apparatus to receive and store updates of the fingerprint database and thereby update automatically its reference database of fingerprints.

18. A system as claimed in claim 17, wherein the fingerprint database generating apparatus is a computing system with a processor that is programmed with a first algorithm to analyse the video stream in order to detect repeating video sequences, the first algorithm comprising the method of mining a video stream for repeating video sequences as claimed in any of claims 1 to 15.

19. A system as claimed in claim 17, wherein the detection and replacement apparatus is a computing system with a processor that is programmed with a second algorithm to analyse the video broadcast in order to detect the recurrence of a stored sequence, the second algorithm comprising: a method of identifying the recurrence in a video stream of a stored sequence, comprising: receiving a video stream; generating hash values for frames of the video stream; comparing the hash values from the video stream to hash values for frames of a stored sequence to identify frames with matching hash values; identifying sub-sequences of frames in the video stream that each contain a frame with a matched hash value to a frame of the stored sequence; comparing sub-sequences of frames in the video stream to candidate sub-sequences of frames in the stored sequence and determining a match when a sub-sequence contains a threshold level of frames with hash values matched to frames of a candidate sub-sequence; identifying a sequence of frames, where the sequence contains a group of subsequences that have been matched to candidate sub-sequences; determining whether the sequence of frames in the stored sequence is recurring as a sequence of frames in the video stream by comparing the groups of sub-sequences and candidate sub-sequences and determining whether a threshold level of similarity is met; and when a recurrence of a video sequence is detected, generating a signal indicative of having identified a recurrence in the video stream of the stored sequence.