WO2007148290A2 - Generating fingerprints of information signals - Google Patents

Generating fingerprints of information signals Download PDF

Info

Publication number
WO2007148290A2
WO2007148290A2 PCT/IB2007/052368 IB2007052368W WO2007148290A2 WO 2007148290 A2 WO2007148290 A2 WO 2007148290A2 IB 2007052368 W IB2007052368 W IB 2007052368W WO 2007148290 A2 WO2007148290 A2 WO 2007148290A2
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
frames
fingerprints
fingerprint
data
Prior art date
Application number
PCT/IB2007/052368
Other languages
French (fr)
Other versions
WO2007148290A3 (en
Inventor
Jaap A. Haitsma
Vikas Bhargava
Original Assignee
Koninklijke Philips Electronics N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics N.V. filed Critical Koninklijke Philips Electronics N.V.
Publication of WO2007148290A2 publication Critical patent/WO2007148290A2/en
Publication of WO2007148290A3 publication Critical patent/WO2007148290A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/0021Image watermarking
    • G06T1/005Robust watermarking, e.g. average attack or collusion attack resistant
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/0021Image watermarking
    • G06T1/0085Time domain based watermarking, e.g. watermarks spread over several images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2201/00General purpose image data processing
    • G06T2201/005Image watermarking
    • G06T2201/0065Extraction of an embedded watermark; Reliable detection

Definitions

  • the present invention relates to the generation of fingerprints indicative of the contents of information signals comprising sequences of data frames.
  • embodiments of the invention are concerned with the generation of digital fingerprints of video signals.
  • a fingerprint of an information signal comprising a sequence of data frames is a piece of information indicative of the content of that signal.
  • the fingerprint may, in certain circumstances, be regarded as a short summary of the information signal.
  • Fingerprints in the present context may also be described as signatures or hashes.
  • a known use for such fingerprints is to identify the contents of unknown information signals, by comparing their fingerprints with fingerprints stored in a database. For example, to identify the content of an unknown video signal, a fingerprint of the signal may be generated and then compared with fingerprints of known video objects (e.g. television programmes, films, adverts etc.). When a match is found, the identity of the content is thus determined.
  • fingerprints of information signals having known content and to store those fingerprints in a database.
  • the method of generating a fingerprint is a robust indication of content, in the sense that the fingerprint can be used to correctly identify the content, even when the information signal is a processed, degraded, transformed, or otherwise derived version of another information signal having that content.
  • An alternative way of expressing this robustness requirement is that the fingerprints of different versions (i.e. different information signals) of the same content should be sufficiently similar to enable identification of that common content to be made.
  • an original video signal comprising a sequence of frames of pixel data, may contain a film.
  • a fingerprint of that original video signal may be generated, and stored in a database along with metadata, such as the film's name.
  • the original video signal may then be made.
  • a fingerprint generation method which, when used on any one of the copies, would yield a fingerprint sufficiently similar to that of the original for the content of the copy to be identifiable by consulting the database.
  • a number of factors make this object more difficult to achieve.
  • the global brightness and/or the contrast in one or more frames may have changed.
  • the copy may be in a different format, and/or the image in one or more frames may have been scaled, shifted, or rotated.
  • the pixel data in a frame of one version of the film e.g.
  • a copy may be completely different from the pixel data in a corresponding frame of another version (e.g. the original) of the same film.
  • a problem is, therefore, to devise a fingerprint generation method that yields fingerprints that are robust (i.e. insensitive) to a certain degree to one or more of the above-mentioned factors.
  • WO02/065782 discloses a method of generating robust hashes (in effect, fingerprints) of information signals, including audio signals and image or video signals.
  • a hash for a video signal comprising a sequence of frames is extracted from 30 consecutive frames, and comprises 30 hash words (i.e. one for each of the consecutive frames).
  • the hash is generated by firstly dividing each frame into blocks. For each block, the mean of the luminance values of the pixels is computed. Then, in order to make the hash independent of the global level and scale of the luminance, the luminance differences between two consecutive blocks are computed. Also, to reduce the correlation of the hash words in the temporal direction, the difference of spatial differential mean luminance values in consecutive frames is also computed.
  • each bit is derived from the mean luminances of a respective two consecutive blocks in a respective frame of the video signal and from the mean luminances of the same two blocks in an immediately preceding frame.
  • the method disclosed in WO02/065782 provides hashes having a certain degree of robustness, a problem remains in that the hashes are sensitive to the frame rates of the signals from which they are derived. If the information signal is a video signal comprising frames having a frame rate R (i.e. there are R frames for a second of content) then the disclosed method results in the generation of hash words having the same rate (in the sense that there are R hash words for a second of content). This is problematic, because different versions of video content may employ different frame rates; for example in the US the typical frame rate for television signals is 30 frames per second, whereas in Europe it is 25.
  • the disclosed method were used to generate a hashes of a US television version of a particular programme or film and a European television version of the same content, then in the former case the hash would comprise 30 hash words per second of content, and in the latter case 25 hash words per second. It would therefore be very difficult, if not impossible, to determine that the two versions had the same content from a comparison of their hashes.
  • the problem of generating fingerprints indicative of content and robust to the frame rates of information signals carrying that content is not limited to the field of video signals; it applies to any information signal comprising a sequence of data frames having a frame rate.
  • an object is to provide a method that generates fingerprints which can be used to identify when two signals, having different frame rates, have the same content.
  • a first aspect of the present invention provides a method of generating a fingerprint indicative of a content of an information signal, the information signal comprising a first sequence of data frames having a first frame rate, the method comprising: computing a sequence of sub- fingerprints from the first sequence of frames, the sequence of sub- fingerprints having a predetermined rate independent of the first frame rate and each sub- fingerprint being derived from and dependent upon a data content of at least one frame of the information signal; and concatenating the sub- fingerprints to form the fingerprint.
  • the "rate" of the sequence of sub- fingerprints means the number of sub-fingerprints that are generated in correspondence with a second of content in the information signal.
  • the sub- fingerprint rate may thus be the actual rate at which the sub- fingerprints are generated.
  • the physical rate of generation of the sub- fingerprints may be different from the sub- fingerprint sequence's characteristic rate (e.g. much higher).
  • the method achieves the object of generating fingerprints that are robust, at least to a degree, with respect to the frame rate of the source information signal.
  • the method facilitates recognition of common content in signals having different frame rates because a particular time length of content is represented by the same number of sub- fingerprints in each case, and these can be compared.
  • the step of computing the sequence of sub- fingerprints comprises: computing a second sequence of data frames from the first sequence, the second sequence of frames having said predetermined rate and the data content of each of the second sequence of frames being derived from the data content of at least one of the first sequence of frames; and computing the sequence of sub-fingerprints from the second sequence of data frames.
  • the second sequence of data frames can thus be regarded as an intermediate sequence, and computing the sub- fingerprints from it provides the advantage that known sub- fingerprint calculation/extraction techniques can be used in that part of the method and yet still provide a resultant fingerprint that is frame rate robust, to at least a degree.
  • the step of computing the sequence of sub-fingerprints from the second sequence of data frames may also employ sub- fingerprint extraction techniques and algorithms which provide the resultant fingerprint with robustness with regard to other factors, in addition to the inherent frame rate robustness provided by derivation of the sub- fingerprints from the derived second frame sequence which already has the predetermined, independent frame rate.
  • the data contents of the second sequence of frames are derived from the data contents of the first sequence of frames by a process comprising interpolation.
  • the frames of the second sequence may contain data relating to some feature or property extracted from the data content of the source frames (such as, in the case of video signals, the mean luminance of blocks of pixels).
  • the data content of the second sequence of frames may be considerably less than that of the information signal, and this facilitates calculation of the sub- fingerprints and any subsequent searching of databases for fingerprint matches.
  • aspects of the invention provide use of the inventive method to generate fingerprints of information signals having frame rates lower than the predetermined rate, and of signals having frame rates higher than the predetermined rate.
  • Another aspect provides signal processing apparatus arranged to carry out the inventive method.
  • aspects provide a computer program enabling the carrying out of the inventive method, and a record carrier on which such a program is recorded. Yet further aspects provide broadcast monitoring methods, filtering methods, automatic video library organization methods, selective recording methods, and tamper detection methods using the inventive fingerprint generation method.
  • FIG. 1 is a schematic representation of a fingerprint generation method embodying the invention
  • Fig. 2 is a schematic representation of part of another fingerprint generation method embodying the invention.
  • Fig. 3 is a schematic representation of part of yet another fingerprint generation method embodying the invention.
  • Fig. 4 is a schematic illustration of an interpolation technique used in certain embodiments of the invention.
  • Fig. 5 is a schematic representation of part of yet another fingerprint generation method embodying the invention, generating sub- fingerprints indicative of the content of a video signal;
  • Fig. 6 is a schematic representation of a video fingerprinting system embodying the invention.
  • Fig. 7 is a schematic representation of the division of a frame of an information signal into blocks, as used in certain embodiments of the invention.
  • Fig. 8 is a schematic representation of part of a sequence of extracted feature frames generated in a method embodying the invention;
  • Fig. 9 is a schematic representation of the division of a frame of a video signal into blocks, as used in certain embodiments of the invention.
  • Fig. 10 is a schematic representation of the division of a frame of a video signal into blocks, as used in certain embodiments of the invention.
  • Fig. 11 is a schematic representation of the division of a frame of a video signal into blocks, as used in certain embodiments of the invention.
  • An information signal 2 comprises of a first series of data frames 20 having a first frame rate. For ease of representation, only 4 of the data frames 20 are shown in the figure. However, it will be appreciated that in practice the number of data frames in the information signal whose fingerprint is being generated may be very much larger.
  • the sequence of first data frames 20 is shown at positions along a time line.
  • the frame rate of the sequence of frames 20 is constant.
  • the data frames can be regarded as samples of a content at regular time intervals.
  • the time interval between adjacent frames 20 in the sequence is constant and is denoted by t2, which in this illustrative example is one third of a second.
  • information signal 2 is in the form of a file stored on some appropriate medium.
  • the information 2 may be a broadcast signal, for example, such that the time interval t2 is the real time interval between the broadcast or transmission of successive frames (and hence also the real time interval between receipt of successive frames at some destination).
  • a sequence 3 of sub-fingerprints 30 is computed from the information signal 2.
  • This computing process is denoted by block 23 in Fig. 1.
  • the information signal 2 is processed to produce the sequence 3 of sub- fingerprints.
  • the sequence of sub-fingerprints has a predetermined rate, independent of and different from the first frame rate.
  • the frame rate of the source information signal 2 was lower than the predetermined rate of the sequence of sub- fingerprints, hence there are more sub- fingerprints for a second of content of the source information signal than there are frames of the information signal.
  • the nominal positions of the sub- fingerprints 30 are shown with respect to a time line.
  • the time interval between successive sub- fingerprints is donated by t3, which in this example is 0.2 seconds.
  • the frame rate of the source signal is 3 frames per second
  • the predetermined rate of the sub- fingerprints is 5 per second.
  • the computing step 23 may employ a variety of techniques to produce the sequence of sub fingerprints from the source information signal. However, a common feature of these techniques is that each of the sub- fingerprints 30 is derived from, and dependent upon, a data content of at least one frame 20 of the source information signal.
  • the sequence 3 of sub- fingerprints produced by the processing step 23 may be in the form of a file stored on a suitable medium, or alternatively may be a real- time succession of sub-fingerprints 30 output from a suitably arranged processor.
  • the method embodying the invention includes a further processing step 31 which operates on the sequence 3 of sub-fingerprints 30 and concatenates them to form a fingerprint 1.
  • a further processing step 31 which operates on the sequence 3 of sub-fingerprints 30 and concatenates them to form a fingerprint 1.
  • the resultant fingerprint 1 is indicative of a content of the source information signal 2.
  • the step of computing the sequence of sub-fingerprints from the source sequence of frames comprises an intermediate step 24 of computing a second sequence 4 of data frames 40 from the first sequence, that second sequence 4 of frames having the predetermined rate, and the data content of each of the second sequence 4 of frames 40 being derived from the data content of at least one of the first sequence of frames 20. Then, the sequence 3 of sub- fingerprints 30 is computed from the second sequence 4 of data frames 40, in processing step 43.
  • an advantage of producing this second sequence of frames i.e.
  • processing step 43 can use previously known sub- fingerprint techniques or algorithms (which in themselves did not provide frame rate robustness), and the resultant sub- fingerprints 30 can be combined to form a fingerprint 1 which does incorporate frame rate robustness.
  • processing step 24 comprises a frame-rate conversion of the full contents of the frames 20 of information signal 2.
  • each frame 40 of the second sequence of frames is substantially the same size as a frame 20 of the source information signal 2.
  • the data frames 40 of the second sequence are smaller than those of the first sequence and this provides the advantage that subsequent processing to form the sub- fingerprints is facilitated (this advantage manifests itself in faster processing speeds can yield smaller resultant fingerprints, which in turn facilitate subsequent storage and handling, including searching of data bases for matches with other fingerprints).
  • One way of achieving this reduction in frame size in producing the second sequence of frames is for the frames 40 to contain data related to a feature or features of the contents of the information signals 20 from which they are derived, rather than reproducing the whole source data contents.
  • such features may relate to some average property of the data or groups of data contained within the source frames 20.
  • the processing required to extract average information is relatively simple, quick and enables the second sequence of frames to contain much less data than the source signal from which it is derived.
  • the step of computing the second sequence 4 of frames comprises deriving the data content of a frame 40 of the second sequence from the data content of a plurality of frames 20 of the first sequence.
  • the step of computing the sequence of sub- fingerprints from the second sequence 4 of data frames comprises deriving a sub- fingerprint 30 from the data contents of a plurality of frames 40 of the second sequence. This enables the sub- fingerprints to be dependent upon temporal variations in the data contents of the frames 40 of the second sequence 4.
  • the data contents of the second sequence 4 of frames 40 are derived from the data contents of the first sequence 2 of frames 20 by a process comprising interpolation.
  • frames 40 to be constructed at positions on the time line that do not correspond exactly to positions of source data frames 20 (i.e. frames 40 can be constructed at positions between those of source frames).
  • the contents of the neighbouring source frames on the time line can be taken into account.
  • the resultant fingerprint produced by a method embodying invention is made a more reliable and frame-rate robust indication of content of the information signal 2.
  • the data contents of the second sequence of frames 40 are derived from the contents of the first sequence 2 by a process comprising linear interpolation.
  • the step of computing the second sequence 4 of frames 40 comprises the additional step of computing a sequence 5 of extracted feature data frames from the first sequence 2, the sequence 5 of extracted feature frames 50 having the first frame rate (i.e. the same frame rate as the source information signal), and each extracted frame 50 contains feature data indicative of at least one feature of a respective loneof the first sequence 2 of frames 20.
  • the method then also comprises a step of computing the second sequence 4 of frames from the sequence 5 of extracted feature frames 50, the data contents of the second sequence 4 of frames 40 being derived from the feature data contained in the sequence 5 extracted feature frames 50.
  • the step of computing the sequence 5 of extracted feature frames 50 is denoted by arrow 25
  • the step of computing the second sequence 4 of data frames from the sequence 5 of extracted feature frames 50 is denoted by arrow 54.
  • the data content of each of the extracted feature frames 50 is denoted by the letter F
  • the data content of each of the second sequence of frames is denoted by F'.
  • the feature data F in frame 50 of sequence 5 is indicative of a property of at least a portion of a respective frame 20 of the first sequence 2.
  • the extracted feature data F may, for example, relate to one or more of the following: the mean luminance of one or more groups of pixels; the mean chrominance of one or more groups of pixels; centroid positional information derived from pixel luminance; centroid positional information derived from pixel colour information; or any other information relating to properties of the source frame 20 or parts of it.
  • the data contents F' of the second sequence 4 of frames 40 are derived from the feature data F contained in the sequence 5 of extracted feature frames by a process comprising interpolation. This may comprise linear interpolation, or some other form of interpolation.
  • FIG. 4 this illustrates an interpolation technique which may be used in embodiments of the invention.
  • the figure shows just two extracted feature frames 50 from a sequence 5. The first of these frames is at nominal position 0 on the time line and the data it contains indicates that a particular extracted feature has a value V 1 .
  • the second of these frames 50 is at nominal position 5 on the time line and contains data indicating that the extracted feature has a value V 2 .
  • the predetermined rate of the second sequence 4 of frames is higher than the source frame rate in this example, and the three frames 40 of the second sequence 4 correspond to positions 0, 2, and 4 on the time line.
  • Each of these frames 40 contains derived feature data indicative of a value V of the extracted feature.
  • These values V are derived from the values in the extracted feature data frames 50 by linear interpolation as follows.
  • Linear interpolation in this way is advantageous as it provides a simple and quick method of processing extracted feature data to construct derived feature data at positions on the time line that were not occupied by frames of the sources information signal. It will be appreciated, however, that other interpolation techniques may be used. For example, where a frame of the second sequence corresponds to a position on the time line that is between two of the source frames, instead of its contents being derived just from those two immediately adjacent source frames, those contents may be derived from a larger number of source frames. This may be desirable in embodiments where the source frame rate is more than twice the predetermined rate of the sequence of sub- fingerprints.
  • the number of source frames from which each of the second sequence of frames is derived may be selected such that the contents all source frames have an influence on the eventual fingerprint (if only "nearest neighbours" on the time line were taken into account to determine the contents of each of the second sequence of frames then, in some cases, this could result in one or more of the source frames being "ignored” in the fingerprint generation method).
  • FIG. 5 shows part of a fingerprint generation method embodying the invention for generating digital fingerprints of an information signal 2 in a form of a video signal comprising a sequence of video frames 20, each containing pixel data.
  • the method comprises a processing step 26 of dividing each of the source frames 20 into a plurality of blocks 21.
  • each frame 20 is shown divided into just four blocks, which are labelled bl, b2...b4. It will be appreciated that this number of blocks is just an example, and in practice a different number of blocks may be used.
  • the method further comprises the steps of calculating a feature of each block 21 and then using the calculated feature data produce the sequence 5 of extracted feature frames 50 such that each extracted feature frame 50 contains the calculated block feature data for each of the plurality of blocks of the respective one of the first sequence of frames.
  • the feature calculated in processing step 27 is the mean luminance L of the group of pixels in each block 21.
  • each extracted feature frame 50 contains four mean luminance values, Ll, L2...L4.
  • the second sequence 4 of data frames 40 is constructed from the sequence 5 of extracted feature frames.
  • Each of the second sequence of frames 40 contains four mean luminance values, one for each of the four blocks into which the source frames were divided.
  • the second sequence 4 of data frames 40 is at the predetermined rate, which is in general different to the source frame rate, some of the second sequence frames 40 correspond to positions on the time line which are between positions of the extracted feature data frames 50.
  • the mean luminance values contained in the second sequence data frames 40 are derived from the contents of the extracted feature frames 50 by a process comprising interpolation.
  • the first illustrated frame of the second sequence 4 corresponds exactly to the position on the time line of the first sequence of the extracted feature frames 50, and hence the mean luminance value it contains can simply be copied from that extracted feature frame 50.
  • the second in the sequence of data frames 40 occurs at a position in the time line that is between the first and second extracted feature frames 50.
  • each of the mean luminance values in this second frame 40 has been derived by a process involving a calculation using two mean luminance values from the "surrounding" extracted feature frames 50 on the time line. Then, in processing step 43, the sequence of sub-fingerprints 30 is calculated (i.e. derived) from the block mean luminance values in the sequence of data frames 40.
  • each sub- fingerprint 30 is derived from the contents of a respective one of the second sequence 4 of frames 40 and from the immediately preceding frame 40 in that second sequence 4.
  • a video fingerprint in certain embodiments, is a code (e.g. a digital piece of information) that identifies the content of a segment of video.
  • a video fingerprint for a particular content should not only be unique (i.e. different from the fingerprints of all other video segments having different contents) but also be robust against distortions and transformations .
  • a video fingerprint can also be seen as a short summary of a video object.
  • a fingerprint function F should map a video object X, consisting of a large and variable number of bits, to a fingerprint consisting of only a smaller and fixed number of bits, in order to facilitate database storage and effective searching (for matches with other fingerprints).
  • the requirements of a video fingerprint for it to be a good content classifier can also be summarized as follows: ideally, the fingerprints of a video clip are unique, implying that the probability of fingerprints of different video clips being similar is low; and fingerprints for different versions of same video clip should be similar, implying that the probability of similarity of the fingerprints of an original video and its processed version is high.
  • a sub- fingerprint is a piece of data indicative of the content of part of a sequence of frames of an information signal.
  • a sub- fingerprint is, in certain embodiments, a binary word, and in particular embodiments is a 32 bit sequence.
  • a sub- fingerprint may be derived from and dependent upon the contents of more than one source frame;
  • a fingerprint of a video segment represents an orderly collection of all of its sub fingerprints;
  • a fingerprint block can be regarded as a sub-group of the "fingerprint" class, and in certain embodiments is a sequence of 256 sub fingerprints representing a contiguous sequence of video frames;
  • metadata is "soft" information of a video clip consisting of parameters like 'name of the video', 'artist' etc., and an end-application would be interested in getting this metadata;
  • Hamming distance In comparing two bit patterns, the Hamming distance is the count of bits different in the two patterns. More generally, if two ordered lists of items are compared, the Hamming distance is the number of items that do not identically agree. This distance is applicable to encoded information, and is a particularly simple metric of comparison, often more useful than the city-block distance (the sum of absolute values of distances along the coordinate axes) or Euclidean distance (the square root of the sum of squares of the distances along the coordinate axes).
  • Inter-Class BER refers to the bit error rate between two fingerprint blocks corresponding to two different video sequences.
  • Intra-Class BER Comparison refers to the bit error rate between two fingerprint blocks belonging to the same video sequence. It may be noted that two video sequences may be different in the sense that they might have undergone geometrical or other qualitative transformations. However, they are perceptually similar to the human eye.
  • a video fingerprinting system embodying the invention is shown in Fig. 6.
  • This video fingerprinting system provides two functionalities: fingerprint generation; and fingerprint identification. Fingerprint generation is done both during the pre-processing stage as well as identification stage.
  • the fingerprints 1 of the video files 62 (movies, television programmes and commercials etc.) are generated and stored in a database 65.
  • Fig. 6 shows this stage in box 61.
  • the fingerprints 1 are again generated from such sequences (input video queries 68) and are sent to the system as a query.
  • the fingerprint identification stage consists primarily of a database search strategy. It may be noticed that owing to the huge amount of fingerprints in the database, it is practically not possible to use a brute-force approach to search fingerprints. A different approach to search fingerprints efficiently in real-time has been adopted in certain embodiments of the invention.
  • the input in this stage is a fingerprint block query 68 and output is a metadata 625 consisting of identification result(s).
  • encoded data 623 from video files 62 is normalised (which, for example, may comprise scaling the video resolution to a fixed resolution) and decoded by a decoder and normalizer 63.
  • This stage 63 then provides normalised decoded video frames to a fingerprint extraction stage 64, which processes the incoming frames with a fingerprint extraction algorithm to generate a fingerprint 1 of the source video file.
  • This fingerprint 1 is stored in the database 65 along with corresponding metadata 625 for the video file 62.
  • An input video query 68 comprises encoded data 683 which is also processed by the decoder/normaliser 63, and the fingerprint extraction stage 64 generates a fingerprint 1 corresponding to the query and provides that fingerprint to a fingerprint search module 66. That module searches for a matching fingerprint in the database 65, and when a match is found for the query, the corresponding metadata 625 is provided as an output 67.
  • FAR false acceptance rate
  • Fingerprint size how much storage is needed for a fingerprint? To enable fast searching, fingerprints are usually stored in RAM memory. Therefore the fingerprint size, usually expressed in bits per second or bits per movie, determines to a large degree the memory resources that are needed for a fingerprint database server. Granularity: how many seconds of video is needed to identify a video clip?
  • Granularity is a parameter that can depend on the application. In some applications the whole movie can be used for identification, in others one prefers to identify a movie with only a short excerpt of video.
  • Search speed and scalability how long does it take to find a fingerprint in a fingerprint database? What if the database contains thousands of movies? For the commercial deployment of video fingerprint systems, search speed and scalability are a key parameter. Search speed should be in the order of milliseconds for a database containing over 10,000 movies using only limited computing resources (e.g. a few high-end PC's).
  • video fingerprints can change due to different transformations and processing applied on a video sequence.
  • Such transformations include smoothening and compression, for example. These transformations result in different fingerprint blocks for an original video sequence and the transformed sequence and hence a bit error rate (BER) is incurred when the fingerprints of the original and transformed versions are compared.
  • BER bit error rate
  • compression to a low bit rate can be is a highly severe process compared to mere smoothening (noise reduction) of the frames in the video sequence. The BER in the former case is therefore much higher than the latter.
  • the correlation between the two fingerprint blocks also varies depending upon the severity of transformation. The less severe the transformation, the higher is the correlation. Searching for fingerprints in a database is not an easy task. A search technique which may be used in embodiments of the invention is described in WO 02/065782. A brief description of the problem is as follows.
  • the video fingerprint system generates sub- fingerprints at 55Hz.
  • the search task has to find the position in the 396 million sub- fingerprints. With brute force searching, this takes 396 million fingerprint block comparisons. Using a modern PC, a rate of approximately 200,000 fingerprint block comparisons per second can be achieved. Therefore the total search time for our example will be in the order of 30 minutes.
  • the brute force approach can be improved by using an indexed list. For example, consider the following sequence: "AMSTERDAMBERLINNEYYORKPARISLONDON"
  • each bit in a sub- fingerprint is ranked according to its strength.
  • the weak bits are toggled of the sub-fingerprints, in the increasing order of their strength.
  • the weakest bit is toggled first, a match is searched for the resulting new fingerprint; if a match is not found then the next weakest bit is toggled and so on.
  • the one with least BER ⁇ threshold
  • a database hit represents the situation when the match (which may be an exact match, or a close match) is found in the database.
  • Video fingerprinting applications of embodiments of the invention will now be discussed in more detail.
  • Other technologies, such as watermarking available for the identification of video sequences within third-party transmissions. This process, however, relies on a video sequence being modified and the watermark being inserted into the video stream; this is then retrieved from the stream at a later time and compared with the database entry. This requires the watermark to travel with the video material.
  • a video fingerprint is stored centrally and it does not need to travel with the material. Therefore, video fingerprinting can still identify material after it has been transmitted on the web.
  • a number of applications of video fingerprinting have been considered. They are listed as follows:
  • Filtering Technology for File Sharing The movie industry throughout the world suffers great losses due to video file sharing over the peer to peer networks. Generally, when the movie is released, the "handy cam" prints of the video are already doing rounds on the so-called sharing sites. Although, the file sharing protocols are quite different from each other, yet most of them share files using un-encrypted methods. Filtering refers to active intervention in this kind of content distribution. Video fingerprinting is considered as a good candidate for such a filtering mechanism. Moreover, it is than other techniques like watermark that can be used for content identification as a watermark has to travel with the video, which cannot be guaranteed. Thus, one aspect of the invention provides a filtering method and a filtering system utilising a fingerprint generation method in accordance with the first aspect of the invention.
  • Broadcast Monitoring refers to tracking of radio, television or web broadcasts for, among others, the purposes of royalty collection, program verification and people metering. This application is passive in the sense that it has no direct influence on what is being broadcast: the main purpose of the application is to observe and report.
  • a broadcast monitoring system based on fingerprinting consists of several monitoring sites and a central site where the fingerprint server is located. At the monitoring sites fingerprints are extracted from all the (local) broadcast channels. The central site collects the fingerprints from the monitoring sites. Subsequently the fingerprint server, containing a huge fingerprint database, produces the play lists of the respective broadcast channel.
  • another aspect of the invention provides a broadcast monitoring method and a broadcast monitoring system utilising a fingerprint generation method in accordance with the first aspect of the invention.
  • Automated indexing of multimedia library Many computer users have a video library containing several hundreds, sometimes even thousands, of video files. When the files are obtained from different sources, such as ripping from a DVD, scanning of image and downloading from file sharing services, these libraries are often not well organized. By identifying these files with fingerprinting the files can be automatically labeled with the correct metadata, allowing easy organization based on, for example, artist, music album or genre.
  • another aspect of the invention provides an automated indexing method and system utilising a fingerprint generation method in accordance with the first aspect of the invention.
  • Television commercial blocking can be accomplished in a digital broadcast scenario.
  • MHP Multimedia Home Platform
  • DVD Digital Video Broadcasting
  • the television is connected to the outside world.
  • fingerprinting server and television equipped with fingerprint generation capability the television commercials can be blocked from the viewer.
  • This application can also be used as an enabling tool for selective recording of programs with the added advantage of commercials filtering.
  • other aspects of the invention provide commercial blocking and selective recording methods and systems utilising fingerprint generation methods in accordance with the first aspect of the invention.
  • the fingerprints of an original movie and its transformed (or processed) version are generally different from each other.
  • the BER function can be used to ascertain the difference between the two. This property of the fingerprints can be used to detect the malfunctioning of a transmission line which is supposed to transmit a correct video sequence. Also, it can be used to automatically detect (without manual intervention), if a movie or video material has been tampered with.
  • other aspects of the invention provide tampering and error detection methods and systems utilising fingerprint generation methods in accordance with the first aspect of the invention.
  • Video fingerprint tests have been used to evaluate fingerprint extraction algorithms used in embodiments of the invention. These tests have included reliability tests and robustness tests. Reliability of the fingerprints generated by an algorithm is closely related to false acceptance rate. In reliability tests the BER distribution of bits resulting from comparison of two fingerprint blocks have been studied, to provide theoretical false acceptance rate. Inter-Class BER distribution serves as a robust indicator of the performance of the algorithm, for example. In robustness tests, used to evaluate fingerprint extraction algorithms used in embodiments of the invention, a small database consisting of 4 video clips and several of their transformed versions was created. A video can undergo several transformations.
  • the luminance feature is more important compared to color components.
  • YUV color space is universally accepted primary sub- sampling encoder for all the video encoders. Hence luminance values are used to extract features.
  • the proposed algorithm is based on a simple statistic, the mean luminance, computed over relatively large regions.
  • the sub- fingerprints are extracted as follows. 1. Each video frame is divided in a grid of R rows and C columns, resulting in RxC blocks. For each of these blocks, the mean of the luminance values of the pixels is computed.
  • FIG. 7 illustrates a video data frame 20 divided into blocks 21 in this way.
  • the mean of the luminance values is calculated for each of the blocks resulting in RxC mean values.
  • Each of the numbers represents a corresponding region in the input video frame. Thus, the means of the luminance values in each of these regions has been calculated.
  • the computed mean luminance values in step 1 can be visualized as RxC "pixels" in a frame (an extracted feature frame). In other words, these represent the energy of different portions of the frame.
  • a spatial filter with kernel [-1 1] i.e. taking differences between neighboring blocks in the same row
  • a temporal filter with kernel [- ⁇ 1] is applied on this sequence of low resolution gray-scale images.
  • the sign value of SnFP n determines the value of the bit in the sub- fingerprint. More specifically, if SftFPn ⁇ 0 if SfiFPn ⁇ O
  • alpha can be considered to be a weighting factor, representing the degree to which values in the "next" frame are taken into account. Different embodiments may use different values for alpha. In certain embodiments, alpha equals 1, for example.
  • the frame rate is the number of frames or images that are projected or displayed per second. Frame rates are used in synchronizing audio and pictures, whether film, television, or video. Frame rates of 24, 25 and 30 frames per second are common, each having uses in different portions of the industry. In the U.S., the professional frame rate for motion pictures is 24 frames per second and, for television, 30 frames per second. However, these frame rates are variable because different standards are followed in the video broadcast throughout the world.
  • the basic differential block luminance fingerprint extraction algorithm described above works on a frame by frame basis.
  • fingerprints are generated from a video query for identification purposes.
  • video sources in these two stages have frame rates as v and ⁇ respectively, then the fingerprint blocks (consisting of 256 sub- fingerprints) in these two cases would represent (256/v) seconds and (256/ ⁇ ) seconds of video respectively. These time frames are different and hence their sub- fingerprints generated during these durations come from different frames. Hence, they would not match.
  • a modification of the basic differential block mean luminance algorithm, to provide a degree of frame rate robustness, is described below.
  • the basic algorithm unmodified, can be used in certain embodiments of the invention to produce frame rate robust fingerprints.
  • a frame- rate converted video signal is generated from the source signal, and that frame rate-converted signal (having the predetermined, independent frame rate) is fed into the basic algorithm. Acting on this rate-converted "source”, the basic algorithm is then able to generate sub- fingerprints which also have the predetermined, independent rate.
  • the further examples mentioned below use this frequency of fingerprint extraction (but it will be appreciated that the frequency is itself just one example, and further embodiments may utilise different predetermined frequencies).
  • F(r, c, 2) and F(r, c, 3) represent the mean frames at times 2/25 and 3/25 respectively.
  • the mean frames F(r, c, 4), F(r, c, 5), F(r, c, 6) and F(r, c, 7) represent the linearly interpolated mean frames at times 4/55, 5/55, 6/55 and 7/55 respectively.
  • the contents of these linearly interpolated mean frames have been constructed, by calculation from the contents of the mean frames that were obtained directly from the source frame sequence.
  • the modified algorithm comprises the generation of a sequence of extracted feature frames (containing mean luminance values) having the predetermined frame rate (55Hz in this example), the contents of those frames being derived from the contents of the source frames (via the sequence of directly extracted feature frames) by a process comprising interpolation (where necessary).
  • interpolation where necessary.
  • linear interpolation is used in the above example, other interpolation techniques may be used in alternative embodiments.
  • a first further modification will be described as a Centrally-Oriented Differential Block Luminance Algorithm.
  • This algorithm differs from the previous one in that it takes into consideration more representative features of the frame. In order to do so, it extracts the fingerprints from central portions of the video frame. Development of this modified algorithm was based on an appreciation of the following: a) It was noticed from use of the previous algorithm that black portions of the frame contributed very little information to the fingerprints. However, many of the video formats are 'letterboxed'. Letterboxing is the practice of copying widescreen film to video formats while preserving the original aspect ratio.
  • the resulting master must include masked-off areas above and below the picture area (these are often referred to as "black bars", resembling a letterbox slot).
  • black bars resembling a letterbox slot.
  • the reliability of the fingerprints can be increased by not taking the fingerprints of these areas.
  • the movies can also contain logos at the top which remain constant for the entire length of the movie. These logos are also present in different movies under the same production banner.
  • the centrally oriented differential block mean luminance algorithm is very similar to the differential block luminance algorithm.
  • the centrally oriented algorithm differs in the step where it divides a source frame into blocks. Instead of dividing the entire frame into blocks, these blocks or regions 21 are defined as shown in Fig. 9. Thus, only a central portion of the frame 20 has been divided into blocks 21; the portions in the outskirts of the frame have not been used. This helps in improving reliability. Having divided the frames into blocks in this way, the remainder of the algorithm calculates a sequence of sub- fingerprints in exactly the same way as the previously described algorithm. Thus, the means of the luminance values in each of the blocks/regions is calculated, resulting in 36 mean values for each frame (36 is just an example, however - a different number of blocks may again be used). Similarly, the mean values are collected from the next frame. Frame rate robustness may be incorporated at this stage by constructing/producing interpolated mean- frames to form the sequence at the desired, predetermined frame rate (and, indeed, the subsequent results for CODBLA are based on the algorithm including the frame rate robustness feature).
  • CODBLA centrally oriented block luminance algorithm
  • DBLA frame rate robustness
  • the performance of the CODBLA was found to be better, in terms of the robustness of the resultant fingerprints, in certain cases, for example in the case of transformations comprising cropping or shifts. This result can be understood because the top portions of the video frames generally do not have much movement and hence they do not contribute much information.
  • the CODBLA is particularly suited to fingerprinting of video that is in letterboxed format.
  • the Differential Pie-Block Luminance Algorithm is different from the previous ones as it takes into consideration the geometry of the video frame. It extracts features from the frame in blocks shaped like sectors which are more resistant to scaling and shifting.
  • the means of luminance were extracted from rectangular blocks. These means were representative of that portion of the frame and provided a representative bit (in a sub- fingerprint) after spatio-temporal filtering and thresholding. A sequence of these bits represented a frame.
  • the DPBLA uses the means (i.e mean luminance values or data) to generate sub-fingerprints from luminances of pixels in the blocks in the same way as the DBLA and the CODBLA.
  • the video frame 20 is divided into 33 "blocks" 21 in order to extract 32 values by clockwise spatial-differential explained below.
  • the blocks are now shaped similar to the sectors of a circle.
  • the uniform increase in the area of the sectors in the radial direction makes them more resistant to scaling.
  • the portions in the outskirts of the frame have not been used (so this particular DPBLA is also centrally oriented).
  • the central portion of the frame represented in form of a circle has not been used for calculating means. This portion is highly vulnerable to scaling, shifting and even small amount of rotation. This helps in improving reliability.
  • Each of the numbers represents a corresponding region in the input video frame.
  • the means of the luminance values in each of these regions is calculated. This process results in 33 mean values.
  • the frame rate robustness is applied at this stage to get the interpolated mean- frames.
  • This procedure has been described in detail above, and will not be repeated here.
  • the frames are represented as F(n, p) instead of as F(r, c, p).
  • the mean frames are interpolated likewise.
  • the computed mean luminance values in step 1 can be visualized as 33 "pixel regions" in a frame. In other words, these represent the energy of different regions of the frame.
  • a spatial filter with kernel [-1 1] i.e. taking differences between neighboring blocks in the same row
  • a temporal filter with kernel [-1 1] is applied on this sequence of low resolution gray-scale images.
  • M 13 and M 14 to be the mean values originating from regions 13 and 14 on current frame
  • M" 13 and M" 14 to be the mean values coming from corresponding regions in next frame then the value (called soft sub- fingerprint) is computed as
  • the sign value o ⁇ SftFP n determines the value of the bit. More specifically,
  • a compensation factor is used in the algorithm.
  • the means of a particular region now also have partial sums of the means of adjacent regions. This helps in increasing robustness against rotation while increasing the standard deviation of the inter- class BER distribution by a little amount.
  • the algorithm also offers improved robustness towards vertical scaling.
  • the version of the pie-block algorithm with rotation compensation provides significant improvement in finding a close match between fingerprints of original and transformed signals.
  • DVSBLA Differential Variable Size Block Luminance Algorithm
  • the luminance means are extracted from rectangular blocks. These means are representative of that portion of the frame and provide a representative bit after spatio-temporal filtering and thresholding.
  • the regions that get affected the most are the ones lying on the outskirts of the processed video frame. These regions most often result in weak bits. Hence, if these regions are made larger, the probability of getting weak bits from these regions is reduced substantially.
  • the DVSBLA extraction algorithm is similar to the CODBLA block luminance algorithm. However, in the DVSBLA the regions (blocks 21) are defined as shown in Fig. 11 .
  • the sizes of the various blocks in this particular example are given in the following tables 1 and 2, and are represented in terms of percentage of the frame width. The remainders represent the area to be left out on either side.
  • Table 1 The table shows the sizes of various columns in the differential variable size block luminance algorithm.
  • Table 2 The table shows the sizes of various rows in the differential variable size block luminance algorithm.
  • the blocks are rectangular just like those used in the centrally oriented differential block luminance algorithm. However, they are now of variable size. The size keeps on decreasing constantly towards the centre of the video frame. The geometric increase in the area of the rectangles from the centre of the frame helps in providing more coverage for outer regions which are the ones that are most affected during geometrical transformation like cropping, scaling and rotation. In case of shifting, all the regions are affected equally. It may be noticed that the portions in the outskirts of the frame have not been used. This helps in improving reliability by getting fewer weak bits.
  • the frame rate robustness is applied at this stage to get the interpolated mean- frames. This procedure has been described in detail above.
  • the sub- fingerprints are then derived from the sequence of mean frames (at the predetermined rate, constructed using interpolation) in the same way as described above in relation to the DBLA and CODBLA.
  • the DVSBLA provides more resistance to weaker bits (resuling from border portions) by providing them with a larger area.
  • Robustness of the video fingerprinting system is related to the reliability of the algorithm in correctly identifying a transformed version of a video sequence.
  • the performance of various algorithms in terms of robustness against various transformations is listed in table 3 below.
  • Table 3 The table shows the qualitative performance of the four algorithms with respect to various geometric transformations and other processing on video sequences.
  • VDSBLA differential variable size block luminance algorithm
  • the reliability of a video fingerprinting system is related to the false acceptance rate of the system.
  • their inter-class BER distribution was studied. It was noticed that the distribution closely followed the normal distribution. Hence, assuming the distribution to be normal, standard deviation and percentage of outliers were computed. The standard deviation thus computed gave an idea of the theoretical false acceptance rate of the system. These parameters are shown in table 4, below, for the 4 algorithms.
  • Table 4 The tables shows the parameters obtained from the inter-class BER distribution for the four algorithms
  • differential pie block luminance algorithm with rotation compensation (DPBLA2) has very good figures.
  • differential variable size block luminance algorithm (DVSBLA) is close and can outperform DPB LA2 in certain applications due to its high robustness.
  • a fingerprint system based on DVSBLA shall have a very low false acceptance rate.
  • Fingerprint size for all the algorithms is constant at 880 bps. Hence for storing fingerprints corresponding to 5000 hours of video, 3960 MB of storage is needed. However, for various applications, fingerprints corresponding to different amount of video needs to be stored in the database.
  • Table 5 illustrates a typical storage scenario for various applications discussed above.
  • Table 5 The table shows the approximate storage requirements for fingerprints in various applications discussed above. In practice, these storage requirements can be handled very well by the search algorithm described above. Hence, the storage requirements of video fingerprinting systems embodying the invention are practical. With regard to granularity, the results show that a video fingerprinting system embodying the invention can reliably identify video from a sequence of approximately 5 s duration.
  • Video fingerprinting systems embodying the invention consist of a fingerprint extraction algorithm module and a search module to search for such a fingerprint in a fingerprint database.
  • sub- fingerprints are extracted at a constant frequency on a frame-by- frame basis (irrespective of the frame rate of video source). These sub- fingerprints in certain embodiments are obtained from energy differences along both the time and the space axis. Investigations reveal that the sequence of such sub- fingerprints contains enough information to uniquely identify a video sequence.
  • the search module uses a search strategy for "matching" video fingerprints based on matching methods as described in WO 02/065782, for example.
  • This search strategy does not use na ⁇ ve brute force search approach because it is impossible to produce results in real-time by doing so due to huge amount of fingerprints in the database.
  • exact bit-copy of the fingerprints may not be given as input to the search module as the input video query might have undergone several image or video transformations (intentionally or unintentionally). Therefore, the search module uses the strength of bits in the fingerprint (computed during fingerprint extraction) to estimate their respective reliability and toggles them accordingly to get a fair (not exact) match.
  • Video fingerprinting systems embodying the invention have been tested and found to be highly reliable, needing just 5s of video in certain cases to identify the clip correctly.
  • the storage requirement for fingerprints corresponding to 5000 hours of video in certain examples has been approximately 4 GB.
  • Search modules in certain systems have been found to work well enough to produce results in real-time (in the order of ms).
  • Fingerprinting system embodying the invention have also been found to be highly scalable, deployable on Windows, Linux and other UNIX like platforms.
  • Certain video fingerprinting systems embodying the invention have also been optimized for performance by using MMX instructions to exploit the inherent parallelism in the algorithms they use.
  • embodiments of the invention by extracting sub- fingerprints at a constant, predetermined rate, irrespective of and different from the frame rate of the source signal, provide the advantage that the resultant fingerprints are robust indications of the content of the source information signal with respect to source frame rate. Thus, common content between two signals having different frame rates may be recognised. Particular embodiments of the invention provide additional robustness with respect to other factors. It will be appreciated that throughout the present specification, including the claims, the words “comprising” and “comprises” are to be interpreted in the sense that they do not exclude other elements or steps.

Abstract

The present invention provides a method for generating fingerprints (1) of information signals (2), and in particular of video signals, those fingerprints being robust with respect to the frame rate of the information signal. Embodiments achieve this robustness by computing a sequence (3) of sub-fingerprints (30) from the sequence of frames of the source information signal, the sequence of sub- fingerprints having a predetermined rate independent of the frame rate of the source, and each sub- fingerprint being derived from and dependent upon a data content of at least one frame of the information signal. The sub- fingerprints at the predetermined rate are then concatenated to form the fingerprint (1) of the source signal.

Description

Generating fingerprints of information signals
FIELD OF THE INVENTION
The present invention relates to the generation of fingerprints indicative of the contents of information signals comprising sequences of data frames. In particular, although not exclusively, embodiments of the invention are concerned with the generation of digital fingerprints of video signals.
BACKGROUND OF THE INVENTION
A fingerprint of an information signal comprising a sequence of data frames is a piece of information indicative of the content of that signal. The fingerprint may, in certain circumstances, be regarded as a short summary of the information signal. Fingerprints in the present context may also be described as signatures or hashes. A known use for such fingerprints is to identify the contents of unknown information signals, by comparing their fingerprints with fingerprints stored in a database. For example, to identify the content of an unknown video signal, a fingerprint of the signal may be generated and then compared with fingerprints of known video objects (e.g. television programmes, films, adverts etc.). When a match is found, the identity of the content is thus determined. Clearly, it is also known to generate fingerprints of information signals having known content, and to store those fingerprints in a database.
It is desirable for the method of generating a fingerprint to be such that the resultant fingerprint is a robust indication of content, in the sense that the fingerprint can be used to correctly identify the content, even when the information signal is a processed, degraded, transformed, or otherwise derived version of another information signal having that content. An alternative way of expressing this robustness requirement is that the fingerprints of different versions (i.e. different information signals) of the same content should be sufficiently similar to enable identification of that common content to be made. In the case of video signals, for example, an original video signal, comprising a sequence of frames of pixel data, may contain a film. A fingerprint of that original video signal may be generated, and stored in a database along with metadata, such as the film's name. Copies (i.e. other versions) of the original video signal may then be made. Ideally, one would like a fingerprint generation method which, when used on any one of the copies, would yield a fingerprint sufficiently similar to that of the original for the content of the copy to be identifiable by consulting the database. However, a number of factors make this object more difficult to achieve. For example, in a copy of the original video signal, the global brightness and/or the contrast in one or more frames may have changed. Similarly, there may have been changes in colour and/or image sharpness. In addition, the copy may be in a different format, and/or the image in one or more frames may have been scaled, shifted, or rotated. In an extreme case, the pixel data in a frame of one version of the film (e.g. a copy) may be completely different from the pixel data in a corresponding frame of another version (e.g. the original) of the same film. A problem is, therefore, to devise a fingerprint generation method that yields fingerprints that are robust (i.e. insensitive) to a certain degree to one or more of the above-mentioned factors.
WO02/065782 discloses a method of generating robust hashes (in effect, fingerprints) of information signals, including audio signals and image or video signals. In one disclosed embodiment, a hash for a video signal comprising a sequence of frames is extracted from 30 consecutive frames, and comprises 30 hash words (i.e. one for each of the consecutive frames). The hash is generated by firstly dividing each frame into blocks. For each block, the mean of the luminance values of the pixels is computed. Then, in order to make the hash independent of the global level and scale of the luminance, the luminance differences between two consecutive blocks are computed. Also, to reduce the correlation of the hash words in the temporal direction, the difference of spatial differential mean luminance values in consecutive frames is also computed. Thus, in the resultant binary hash, each bit is derived from the mean luminances of a respective two consecutive blocks in a respective frame of the video signal and from the mean luminances of the same two blocks in an immediately preceding frame.
Although the method disclosed in WO02/065782 provides hashes having a certain degree of robustness, a problem remains in that the hashes are sensitive to the frame rates of the signals from which they are derived. If the information signal is a video signal comprising frames having a frame rate R (i.e. there are R frames for a second of content) then the disclosed method results in the generation of hash words having the same rate (in the sense that there are R hash words for a second of content). This is problematic, because different versions of video content may employ different frame rates; for example in the US the typical frame rate for television signals is 30 frames per second, whereas in Europe it is 25. Thus, if the disclosed method were used to generate a hashes of a US television version of a particular programme or film and a European television version of the same content, then in the former case the hash would comprise 30 hash words per second of content, and in the latter case 25 hash words per second. It would therefore be very difficult, if not impossible, to determine that the two versions had the same content from a comparison of their hashes. The problem of generating fingerprints indicative of content and robust to the frame rates of information signals carrying that content is not limited to the field of video signals; it applies to any information signal comprising a sequence of data frames having a frame rate.
SUMMARY OF THE INVENTION
It is an object of the invention to provide a method of generating a fingerprint indicative of the content of an information signal which yields a fingerprint that is robust, at least to a degree, with respect to the frame rate of the information signal. In other words, an object is to provide a method that generates fingerprints which can be used to identify when two signals, having different frame rates, have the same content.
A first aspect of the present invention provides a method of generating a fingerprint indicative of a content of an information signal, the information signal comprising a first sequence of data frames having a first frame rate, the method comprising: computing a sequence of sub- fingerprints from the first sequence of frames, the sequence of sub- fingerprints having a predetermined rate independent of the first frame rate and each sub- fingerprint being derived from and dependent upon a data content of at least one frame of the information signal; and concatenating the sub- fingerprints to form the fingerprint.
It will be appreciated that the "rate" of the sequence of sub- fingerprints means the number of sub-fingerprints that are generated in correspondence with a second of content in the information signal. In embodiments in which the sub- fingerprints are generated in real time (for example from an incoming broadcast signal), the sub- fingerprint rate may thus be the actual rate at which the sub- fingerprints are generated. In alternative embodiments, for example where the method is used to generate fingerprints of an information signal stored in a file, the physical rate of generation of the sub- fingerprints may be different from the sub- fingerprint sequence's characteristic rate (e.g. much higher).
The fact that the sub- fingerprint sequence's rate is predetermined and independent of the information signal's frame rate means that a second (or some other time length) of content will be represented by a fixed number of sub-fingerprints in the fingerprint generated by the method, regardless of how many frames of the information signal are used to carry that content. Thus, the method achieves the object of generating fingerprints that are robust, at least to a degree, with respect to the frame rate of the source information signal. The method facilitates recognition of common content in signals having different frame rates because a particular time length of content is represented by the same number of sub- fingerprints in each case, and these can be compared.
In certain embodiments, the step of computing the sequence of sub- fingerprints comprises: computing a second sequence of data frames from the first sequence, the second sequence of frames having said predetermined rate and the data content of each of the second sequence of frames being derived from the data content of at least one of the first sequence of frames; and computing the sequence of sub-fingerprints from the second sequence of data frames. The second sequence of data frames can thus be regarded as an intermediate sequence, and computing the sub- fingerprints from it provides the advantage that known sub- fingerprint calculation/extraction techniques can be used in that part of the method and yet still provide a resultant fingerprint that is frame rate robust, to at least a degree. The step of computing the sequence of sub-fingerprints from the second sequence of data frames may also employ sub- fingerprint extraction techniques and algorithms which provide the resultant fingerprint with robustness with regard to other factors, in addition to the inherent frame rate robustness provided by derivation of the sub- fingerprints from the derived second frame sequence which already has the predetermined, independent frame rate.
Advantageously, in certain embodiments the data contents of the second sequence of frames are derived from the data contents of the first sequence of frames by a process comprising interpolation. By constructing the second sequence of frames in this way (including constructing frames corresponding to positions on a time line that are in-between positions of frames of the source information signal) the resultant fingerprints are even more robust and reliable indications of content. Advantageously, the frames of the second sequence may contain data relating to some feature or property extracted from the data content of the source frames (such as, in the case of video signals, the mean luminance of blocks of pixels). Thus, the data content of the second sequence of frames may be considerably less than that of the information signal, and this facilitates calculation of the sub- fingerprints and any subsequent searching of databases for fingerprint matches.
Other aspects of the invention provide use of the inventive method to generate fingerprints of information signals having frame rates lower than the predetermined rate, and of signals having frame rates higher than the predetermined rate.
Another aspect provides signal processing apparatus arranged to carry out the inventive method.
Further aspects provide a computer program enabling the carrying out of the inventive method, and a record carrier on which such a program is recorded. Yet further aspects provide broadcast monitoring methods, filtering methods, automatic video library organization methods, selective recording methods, and tamper detection methods using the inventive fingerprint generation method.
These and other aspects of the invention, and further features of embodiments of the invention and their associated advantages, will be apparent from the following description of embodiments and the claims.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the invention will now be described with reference to the accompanying drawings, of which: Fig. 1 is a schematic representation of a fingerprint generation method embodying the invention;
Fig. 2 is a schematic representation of part of another fingerprint generation method embodying the invention;
Fig. 3 is a schematic representation of part of yet another fingerprint generation method embodying the invention;
Fig. 4 is a schematic illustration of an interpolation technique used in certain embodiments of the invention;
Fig. 5 is a schematic representation of part of yet another fingerprint generation method embodying the invention, generating sub- fingerprints indicative of the content of a video signal;
Fig. 6 is a schematic representation of a video fingerprinting system embodying the invention;
Fig. 7 is a schematic representation of the division of a frame of an information signal into blocks, as used in certain embodiments of the invention; Fig. 8 is a schematic representation of part of a sequence of extracted feature frames generated in a method embodying the invention;
Fig. 9 is a schematic representation of the division of a frame of a video signal into blocks, as used in certain embodiments of the invention; Fig. 10 is a schematic representation of the division of a frame of a video signal into blocks, as used in certain embodiments of the invention; and
Fig. 11 is a schematic representation of the division of a frame of a video signal into blocks, as used in certain embodiments of the invention.
DESCRIPTION OF PREFERRED EMBODIMENTS
Referring now to Fig. 1, this is a highly schematic representation of a finger print generation method in accordance with the present invention. An information signal 2 comprises of a first series of data frames 20 having a first frame rate. For ease of representation, only 4 of the data frames 20 are shown in the figure. However, it will be appreciated that in practice the number of data frames in the information signal whose fingerprint is being generated may be very much larger. The sequence of first data frames 20 is shown at positions along a time line. The frame rate of the sequence of frames 20 is constant. In other words, the data frames can be regarded as samples of a content at regular time intervals. In this example the time interval between adjacent frames 20 in the sequence is constant and is denoted by t2, which in this illustrative example is one third of a second. Again, this time period has been chosen merely to simplify representation, and it will be appreciated that, in practice, in time intervals between successive frames may be much smaller (corresponding to the information signal having a much higher frame rate). In certain embodiments information signal 2 is in the form of a file stored on some appropriate medium. In alternative embodiments, the information 2 may be a broadcast signal, for example, such that the time interval t2 is the real time interval between the broadcast or transmission of successive frames (and hence also the real time interval between receipt of successive frames at some destination).
In this first example, a sequence 3 of sub-fingerprints 30 is computed from the information signal 2. This computing process is denoted by block 23 in Fig. 1. In other words, the information signal 2 is processed to produce the sequence 3 of sub- fingerprints. Rather than the sub- fingerprints being produced at the same rate as the frame rate of the information signal, as was the case in certain prior art techniques, in this embodiment of the invention the sequence of sub-fingerprints has a predetermined rate, independent of and different from the first frame rate. In the illustrated example, the frame rate of the source information signal 2 was lower than the predetermined rate of the sequence of sub- fingerprints, hence there are more sub- fingerprints for a second of content of the source information signal than there are frames of the information signal. Again, in the figure the nominal positions of the sub- fingerprints 30 are shown with respect to a time line. The time interval between successive sub- fingerprints is donated by t3, which in this example is 0.2 seconds. Thus, in the somewhat simplified example of Fig. 1, the frame rate of the source signal is 3 frames per second, and the predetermined rate of the sub- fingerprints is 5 per second. The computing step 23 may employ a variety of techniques to produce the sequence of sub fingerprints from the source information signal. However, a common feature of these techniques is that each of the sub- fingerprints 30 is derived from, and dependent upon, a data content of at least one frame 20 of the source information signal. As was the case with the source information signal, the sequence 3 of sub- fingerprints produced by the processing step 23 may be in the form of a file stored on a suitable medium, or alternatively may be a real- time succession of sub-fingerprints 30 output from a suitably arranged processor.
The method embodying the invention includes a further processing step 31 which operates on the sequence 3 of sub-fingerprints 30 and concatenates them to form a fingerprint 1. As each of the sub- fingerprints 30 is derived from and dependent upon a data content of at least 1 frame of the source information signal, the resultant fingerprint 1 is indicative of a content of the source information signal 2.
Referring now to Fig. 2 in another embodiment of the present invention the step of computing the sequence of sub-fingerprints from the source sequence of frames comprises an intermediate step 24 of computing a second sequence 4 of data frames 40 from the first sequence, that second sequence 4 of frames having the predetermined rate, and the data content of each of the second sequence 4 of frames 40 being derived from the data content of at least one of the first sequence of frames 20. Then, the sequence 3 of sub- fingerprints 30 is computed from the second sequence 4 of data frames 40, in processing step 43. As mentioned above, an advantage of producing this second sequence of frames (i.e. an intermediate sequence) at the predetermined rate and computing the sub- fingerprints from them, is that, if desired, processing step 43 can use previously known sub- fingerprint techniques or algorithms (which in themselves did not provide frame rate robustness), and the resultant sub- fingerprints 30 can be combined to form a fingerprint 1 which does incorporate frame rate robustness. In certain embodiments, processing step 24 comprises a frame-rate conversion of the full contents of the frames 20 of information signal 2. Thus, in such cases each frame 40 of the second sequence of frames is substantially the same size as a frame 20 of the source information signal 2. However, in other embodiments, the data frames 40 of the second sequence are smaller than those of the first sequence and this provides the advantage that subsequent processing to form the sub- fingerprints is facilitated (this advantage manifests itself in faster processing speeds can yield smaller resultant fingerprints, which in turn facilitate subsequent storage and handling, including searching of data bases for matches with other fingerprints). One way of achieving this reduction in frame size in producing the second sequence of frames is for the frames 40 to contain data related to a feature or features of the contents of the information signals 20 from which they are derived, rather than reproducing the whole source data contents. Advantageously, such features may relate to some average property of the data or groups of data contained within the source frames 20. The processing required to extract average information is relatively simple, quick and enables the second sequence of frames to contain much less data than the source signal from which it is derived. In certain embodiments the step of computing the second sequence 4 of frames comprises deriving the data content of a frame 40 of the second sequence from the data content of a plurality of frames 20 of the first sequence.
Also, in certain embodiments the step of computing the sequence of sub- fingerprints from the second sequence 4 of data frames comprises deriving a sub- fingerprint 30 from the data contents of a plurality of frames 40 of the second sequence. This enables the sub- fingerprints to be dependent upon temporal variations in the data contents of the frames 40 of the second sequence 4.
Advantageously, the data contents of the second sequence 4 of frames 40 are derived from the data contents of the first sequence 2 of frames 20 by a process comprising interpolation. This enables frames 40 to be constructed at positions on the time line that do not correspond exactly to positions of source data frames 20 (i.e. frames 40 can be constructed at positions between those of source frames). In constructing those frames 40 at intermediate positions, the contents of the neighbouring source frames on the time line can be taken into account. By doing this, the resultant fingerprint produced by a method embodying invention is made a more reliable and frame-rate robust indication of content of the information signal 2.
In certain embodiments, the data contents of the second sequence of frames 40 are derived from the contents of the first sequence 2 by a process comprising linear interpolation. However, it will be appreciated that other interpolation techniques may be used. Referring now to Fig. 3, in a further embodiment of the invention the step of computing the second sequence 4 of frames 40 comprises the additional step of computing a sequence 5 of extracted feature data frames from the first sequence 2, the sequence 5 of extracted feature frames 50 having the first frame rate (i.e. the same frame rate as the source information signal), and each extracted frame 50 contains feature data indicative of at least one feature of a respective loneof the first sequence 2 of frames 20. The method then also comprises a step of computing the second sequence 4 of frames from the sequence 5 of extracted feature frames 50, the data contents of the second sequence 4 of frames 40 being derived from the feature data contained in the sequence 5 extracted feature frames 50. In Fig. 3, the step of computing the sequence 5 of extracted feature frames 50 is denoted by arrow 25, and the step of computing the second sequence 4 of data frames from the sequence 5 of extracted feature frames 50 is denoted by arrow 54. The data content of each of the extracted feature frames 50 is denoted by the letter F, and the data content of each of the second sequence of frames is denoted by F'. In certain embodiments the feature data F in frame 50 of sequence 5 is indicative of a property of at least a portion of a respective frame 20 of the first sequence 2. If the information signal 2 is a video signal comprising frames 20 containing pixel data, the extracted feature data F may, for example, relate to one or more of the following: the mean luminance of one or more groups of pixels; the mean chrominance of one or more groups of pixels; centroid positional information derived from pixel luminance; centroid positional information derived from pixel colour information; or any other information relating to properties of the source frame 20 or parts of it.
In certain embodiments of the invention, the data contents F' of the second sequence 4 of frames 40 are derived from the feature data F contained in the sequence 5 of extracted feature frames by a process comprising interpolation. This may comprise linear interpolation, or some other form of interpolation.
Moving on to Fig. 4 this illustrates an interpolation technique which may be used in embodiments of the invention. The figure shows just two extracted feature frames 50 from a sequence 5. The first of these frames is at nominal position 0 on the time line and the data it contains indicates that a particular extracted feature has a value V1. The second of these frames 50 is at nominal position 5 on the time line and contains data indicating that the extracted feature has a value V2. Beneath the pair of extracted feature frames there is shown a corresponding part of the second sequence 4 of data frames 40 derived from the extracted feature frame sequence 5 by a process comprising interpolation. The predetermined rate of the second sequence 4 of frames is higher than the source frame rate in this example, and the three frames 40 of the second sequence 4 correspond to positions 0, 2, and 4 on the time line. Each of these frames 40 contains derived feature data indicative of a value V of the extracted feature. These values V are derived from the values in the extracted feature data frames 50 by linear interpolation as follows. The first frame 40 of the second sequence 4 occurs at time position 0, which corresponds to the position of the first frame 50 of the sequence 5 and hence Vi' = Vi. The second frame in the sequence 4 occurs at position 2, and hence V2' = Vi + 2/5 (V2 - Vi). Lastly, the third frame in the second sequence 4 position occurs at position t = 4, and hence V3' = Vi + 4/5 (V2 - Vi).
Linear interpolation in this way is advantageous as it provides a simple and quick method of processing extracted feature data to construct derived feature data at positions on the time line that were not occupied by frames of the sources information signal. It will be appreciated, however, that other interpolation techniques may be used. For example, where a frame of the second sequence corresponds to a position on the time line that is between two of the source frames, instead of its contents being derived just from those two immediately adjacent source frames, those contents may be derived from a larger number of source frames. This may be desirable in embodiments where the source frame rate is more than twice the predetermined rate of the sequence of sub- fingerprints. The number of source frames from which each of the second sequence of frames is derived may be selected such that the contents all source frames have an influence on the eventual fingerprint (if only "nearest neighbours" on the time line were taken into account to determine the contents of each of the second sequence of frames then, in some cases, this could result in one or more of the source frames being "ignored" in the fingerprint generation method).
Referring now to Fig. 5, this shows part of a fingerprint generation method embodying the invention for generating digital fingerprints of an information signal 2 in a form of a video signal comprising a sequence of video frames 20, each containing pixel data. The method comprises a processing step 26 of dividing each of the source frames 20 into a plurality of blocks 21. For simplicity, each frame 20 is shown divided into just four blocks, which are labelled bl, b2...b4. It will be appreciated that this number of blocks is just an example, and in practice a different number of blocks may be used. The method further comprises the steps of calculating a feature of each block 21 and then using the calculated feature data produce the sequence 5 of extracted feature frames 50 such that each extracted feature frame 50 contains the calculated block feature data for each of the plurality of blocks of the respective one of the first sequence of frames. In the illustrated example, the feature calculated in processing step 27 is the mean luminance L of the group of pixels in each block 21. Thus, each extracted feature frame 50 contains four mean luminance values, Ll, L2...L4. Then, in processing step 54, the second sequence 4 of data frames 40 is constructed from the sequence 5 of extracted feature frames. Each of the second sequence of frames 40 contains four mean luminance values, one for each of the four blocks into which the source frames were divided. As the second sequence 4 of data frames 40 is at the predetermined rate, which is in general different to the source frame rate, some of the second sequence frames 40 correspond to positions on the time line which are between positions of the extracted feature data frames 50. Thus, in this example the mean luminance values contained in the second sequence data frames 40 are derived from the contents of the extracted feature frames 50 by a process comprising interpolation. In the figure, the first illustrated frame of the second sequence 4 corresponds exactly to the position on the time line of the first sequence of the extracted feature frames 50, and hence the mean luminance value it contains can simply be copied from that extracted feature frame 50. However, the second in the sequence of data frames 40 occurs at a position in the time line that is between the first and second extracted feature frames 50. Accordingly, each of the mean luminance values in this second frame 40 has been derived by a process involving a calculation using two mean luminance values from the "surrounding" extracted feature frames 50 on the time line. Then, in processing step 43, the sequence of sub-fingerprints 30 is calculated (i.e. derived) from the block mean luminance values in the sequence of data frames 40. In this example, each sub- fingerprint 30 is derived from the contents of a respective one of the second sequence 4 of frames 40 and from the immediately preceding frame 40 in that second sequence 4.
Further background information relating to the fingerprinting of information signals, and video signals in particular, will now be given, along with descriptions of further embodiments and further features of embodiments of the invention. A video fingerprint, in certain embodiments, is a code (e.g. a digital piece of information) that identifies the content of a segment of video. Ideally, a video fingerprint for a particular content should not only be unique (i.e. different from the fingerprints of all other video segments having different contents) but also be robust against distortions and transformations . A video fingerprint can also be seen as a short summary of a video object.
Preferably, a fingerprint function F should map a video object X, consisting of a large and variable number of bits, to a fingerprint consisting of only a smaller and fixed number of bits, in order to facilitate database storage and effective searching (for matches with other fingerprints). The requirements of a video fingerprint for it to be a good content classifier can also be summarized as follows: ideally, the fingerprints of a video clip are unique, implying that the probability of fingerprints of different video clips being similar is low; and fingerprints for different versions of same video clip should be similar, implying that the probability of similarity of the fingerprints of an original video and its processed version is high.
Some definitions useful in understanding the following description are as follows: a sub- fingerprint is a piece of data indicative of the content of part of a sequence of frames of an information signal. In the case of video signals, a sub- fingerprint is, in certain embodiments, a binary word, and in particular embodiments is a 32 bit sequence. In embodiments of the invention a sub- fingerprint may be derived from and dependent upon the contents of more than one source frame; a fingerprint of a video segment represents an orderly collection of all of its sub fingerprints; a fingerprint block can be regarded as a sub-group of the "fingerprint" class, and in certain embodiments is a sequence of 256 sub fingerprints representing a contiguous sequence of video frames; metadata is "soft" information of a video clip consisting of parameters like 'name of the video', 'artist' etc., and an end-application would be interested in getting this metadata;
Hamming distance: In comparing two bit patterns, the Hamming distance is the count of bits different in the two patterns. More generally, if two ordered lists of items are compared, the Hamming distance is the number of items that do not identically agree. This distance is applicable to encoded information, and is a particularly simple metric of comparison, often more useful than the city-block distance (the sum of absolute values of distances along the coordinate axes) or Euclidean distance (the square root of the sum of squares of the distances along the coordinate axes).
Bit Error Rate (BER): Bit error rate between two fingerprints is the fraction representing the number of dissimilar bits in the two. In may also be termed as the ratio of Hamming Distance between the bit strings of two fingerprint block to the number of bits in a fingerprint block (i.e. 256 x 32 = 8192).
Inter-Class BER Comparison: Inter-Class BER refers to the bit error rate between two fingerprint blocks corresponding to two different video sequences. Intra-Class BER Comparison: Intra class BER comparison refers to the bit error rate between two fingerprint blocks belonging to the same video sequence. It may be noted that two video sequences may be different in the sense that they might have undergone geometrical or other qualitative transformations. However, they are perceptually similar to the human eye.
A video fingerprinting system embodying the invention is shown in Fig. 6. This video fingerprinting system provides two functionalities: fingerprint generation; and fingerprint identification. Fingerprint generation is done both during the pre-processing stage as well as identification stage. In the pre-processing stage, the fingerprints 1 of the video files 62 (movies, television programmes and commercials etc.) are generated and stored in a database 65. Fig. 6 shows this stage in box 61. During the identification stage, the fingerprints 1 are again generated from such sequences (input video queries 68) and are sent to the system as a query. The fingerprint identification stage consists primarily of a database search strategy. It may be noticed that owing to the huge amount of fingerprints in the database, it is practically not possible to use a brute-force approach to search fingerprints. A different approach to search fingerprints efficiently in real-time has been adopted in certain embodiments of the invention. The input in this stage is a fingerprint block query 68 and output is a metadata 625 consisting of identification result(s).
In slightly more detail, in the embodiment shown in Fig. 6, encoded data 623 from video files 62 is normalised (which, for example, may comprise scaling the video resolution to a fixed resolution) and decoded by a decoder and normalizer 63. This stage 63 then provides normalised decoded video frames to a fingerprint extraction stage 64, which processes the incoming frames with a fingerprint extraction algorithm to generate a fingerprint 1 of the source video file. This fingerprint 1 is stored in the database 65 along with corresponding metadata 625 for the video file 62. An input video query 68 comprises encoded data 683 which is also processed by the decoder/normaliser 63, and the fingerprint extraction stage 64 generates a fingerprint 1 corresponding to the query and provides that fingerprint to a fingerprint search module 66. That module searches for a matching fingerprint in the database 65, and when a match is found for the query, the corresponding metadata 625 is provided as an output 67.
Parameters to consider in a video fingerprint system are as follows: Robustness: can a video clip still be identified after severe signal degradation? In order to achieve high robustness, the fingerprint should be based on perceptual features that are invariant (at least to a certain degree) with respect to signal degradations. Preferably, severely degraded video still leads to very similar fingerprints. The false rejection rate (FRR) is generally used to express the robustness. A false rejection occurs when the fingerprints of perceptually similar video clips are too different to lead to a positive match.
Reliability: how often is a movie incorrectly identified? The rate at which this occurs is usually referred to as the false acceptance rate (FAR).
Fingerprint size: how much storage is needed for a fingerprint? To enable fast searching, fingerprints are usually stored in RAM memory. Therefore the fingerprint size, usually expressed in bits per second or bits per movie, determines to a large degree the memory resources that are needed for a fingerprint database server. Granularity: how many seconds of video is needed to identify a video clip?
Granularity is a parameter that can depend on the application. In some applications the whole movie can be used for identification, in others one prefers to identify a movie with only a short excerpt of video.
Search speed and scalability: how long does it take to find a fingerprint in a fingerprint database? What if the database contains thousands of movies? For the commercial deployment of video fingerprint systems, search speed and scalability are a key parameter. Search speed should be in the order of milliseconds for a database containing over 10,000 movies using only limited computing resources (e.g. a few high-end PC's).
Effect of transformations on fingerprints: video fingerprints can change due to different transformations and processing applied on a video sequence. Such transformations include smoothening and compression, for example. These transformations result in different fingerprint blocks for an original video sequence and the transformed sequence and hence a bit error rate (BER) is incurred when the fingerprints of the original and transformed versions are compared. In certain cases compression to a low bit rate can be is a highly severe process compared to mere smoothening (noise reduction) of the frames in the video sequence. The BER in the former case is therefore much higher than the latter.
The correlation between the two fingerprint blocks also varies depending upon the severity of transformation. The less severe the transformation, the higher is the correlation. Searching for fingerprints in a database is not an easy task. A search technique which may be used in embodiments of the invention is described in WO 02/065782. A brief description of the problem is as follows.
In certain embodiments of the invention, the video fingerprint system generates sub- fingerprints at 55Hz. Hence, from a video of duration of 2 hours the number of sub- fingerprints generated would be: (2 x 60 x 60)s x 55 sub-fingerprints/s = 396000 sub- fingerprints. In a database consisting of fingerprints of 2000 hours of video (396 million sub- fingerprints), it would not be possible for a brute force search algorithm to produce result in real-time. The search task has to find the position in the 396 million sub- fingerprints. With brute force searching, this takes 396 million fingerprint block comparisons. Using a modern PC, a rate of approximately 200,000 fingerprint block comparisons per second can be achieved. Therefore the total search time for our example will be in the order of 30 minutes. The brute force approach can be improved by using an indexed list. For example, consider the following sequence: "AMSTERDAMBERLINNEYYORKPARISLONDON"
We could index the list by the starting letter of each city. If we want to lookup for the word "PARIS", we could go directly to the sub-list for "P" and search for the word. However, the situation in case of fingerprints is not as easy as depicted in this example. This is evident from the question: will the query contain the exact word "PARIS"? The query can contain "QARIS", "QBRIS", "QASIS", "PBRHS" or even "OBSJT" or some other near word. Hence, there is a possibility that we might not even get a correct starting position in the index to start out search and the system would falsely reject the scaled version of the clip. The solution is to find close matches. Hence, when unable to find an exact match for the query word "OBSJT" each of the letters in this word is toggled and a match is searched for the resulting word.
Thus, in certain embodiments of the invention, while calculating the sub- fingerprints, each bit in a sub- fingerprint is ranked according to its strength. When an exact match is not found for any of the sub- fingerprints (letters), the weak bits are toggled of the sub-fingerprints, in the increasing order of their strength. Hence, the weakest bit is toggled first, a match is searched for the resulting new fingerprint; if a match is not found then the next weakest bit is toggled and so on. In case more than one match is found by toggling the pre-defined number of maximum bits, the one with least BER (< threshold) is deemed as the fairly closest match. Hence, if the query is "QARIS" and the strength estimation algorithm ranks "Q" as the weakest bit, the match would be found instantaneously after toggling "Q" to P for example. However, if "Q" is ranked as strongest, the search would take a longer time.
In the analysis of performance of algorithms, the term database hits is used frequently. A database hit represents the situation when the match (which may be an exact match, or a close match) is found in the database. Video fingerprinting applications of embodiments of the invention will now be discussed in more detail. Apart from video fingerprinting, there are other technologies, such as watermarking, available for the identification of video sequences within third-party transmissions. This process, however, relies on a video sequence being modified and the watermark being inserted into the video stream; this is then retrieved from the stream at a later time and compared with the database entry. This requires the watermark to travel with the video material. On the other hand, a video fingerprint is stored centrally and it does not need to travel with the material. Therefore, video fingerprinting can still identify material after it has been transmitted on the web. A number of applications of video fingerprinting have been considered. They are listed as follows:
Filtering Technology for File Sharing: The movie industry throughout the world suffers great losses due to video file sharing over the peer to peer networks. Generally, when the movie is released, the "handy cam" prints of the video are already doing rounds on the so-called sharing sites. Although, the file sharing protocols are quite different from each other, yet most of them share files using un-encrypted methods. Filtering refers to active intervention in this kind of content distribution. Video fingerprinting is considered as a good candidate for such a filtering mechanism. Moreover, it is than other techniques like watermark that can be used for content identification as a watermark has to travel with the video, which cannot be guaranteed. Thus, one aspect of the invention provides a filtering method and a filtering system utilising a fingerprint generation method in accordance with the first aspect of the invention.
Broadcast Monitoring: Monitoring refers to tracking of radio, television or web broadcasts for, among others, the purposes of royalty collection, program verification and people metering. This application is passive in the sense that it has no direct influence on what is being broadcast: the main purpose of the application is to observe and report. A broadcast monitoring system based on fingerprinting consists of several monitoring sites and a central site where the fingerprint server is located. At the monitoring sites fingerprints are extracted from all the (local) broadcast channels. The central site collects the fingerprints from the monitoring sites. Subsequently the fingerprint server, containing a huge fingerprint database, produces the play lists of the respective broadcast channel. Thus, another aspect of the invention provides a broadcast monitoring method and a broadcast monitoring system utilising a fingerprint generation method in accordance with the first aspect of the invention.
Automated indexing of multimedia library: Many computer users have a video library containing several hundreds, sometimes even thousands, of video files. When the files are obtained from different sources, such as ripping from a DVD, scanning of image and downloading from file sharing services, these libraries are often not well organized. By identifying these files with fingerprinting the files can be automatically labeled with the correct metadata, allowing easy organization based on, for example, artist, music album or genre. Thus, another aspect of the invention provides an automated indexing method and system utilising a fingerprint generation method in accordance with the first aspect of the invention.
Television Commercial Blocking and Selective Recording: Television commercial blocking can be accomplished in a digital broadcast scenario. For example, in a Multimedia Home Platform (MHP) scenario based on Digital Video Broadcasting (DVB) standard, the television is connected to the outside world. With one of such connections to the fingerprinting server and television equipped with fingerprint generation capability, the television commercials can be blocked from the viewer. This application can also be used as an enabling tool for selective recording of programs with the added advantage of commercials filtering. Thus, other aspects of the invention provide commercial blocking and selective recording methods and systems utilising fingerprint generation methods in accordance with the first aspect of the invention.
Detection of Video Tampering or Error in Transmission Lines: As discussed above, the fingerprints of an original movie and its transformed (or processed) version are generally different from each other. The BER function can be used to ascertain the difference between the two. This property of the fingerprints can be used to detect the malfunctioning of a transmission line which is supposed to transmit a correct video sequence. Also, it can be used to automatically detect (without manual intervention), if a movie or video material has been tampered with. Thus, other aspects of the invention provide tampering and error detection methods and systems utilising fingerprint generation methods in accordance with the first aspect of the invention.
Video fingerprint tests have been used to evaluate fingerprint extraction algorithms used in embodiments of the invention. These tests have included reliability tests and robustness tests. Reliability of the fingerprints generated by an algorithm is closely related to false acceptance rate. In reliability tests the BER distribution of bits resulting from comparison of two fingerprint blocks have been studied, to provide theoretical false acceptance rate. Inter-Class BER distribution serves as a robust indicator of the performance of the algorithm, for example. In robustness tests, used to evaluate fingerprint extraction algorithms used in embodiments of the invention, a small database consisting of 4 video clips and several of their transformed versions was created. A video can undergo several transformations. In order to test the fingerprinting algorithms developed, the following transformations on images were considered: scaling; horizontal scaling; vertical scaling; rotation; upward shift; downward shift; CIF (Common Interchange Format) Scaling; QCIF (Quarter Common Interchange Format) Scaling; SIF (Standard Common Interchange Format) Scaling; median filtering; change in brightness; change in contast; compression; change in frame rate. Thus, transformed versions of an original clip, using these different transformations, were made and the fingerprints of the original and transformed versions compared.
Algorithms used in video fingerprinting methods and systems embodying the invention will now be described. Firstly, a so-called differential block luminance algorithm will be described. Improvements to the basic algorithm, to increase the robustness of the algorithm, are then discussed. In the Differential Block Luminance Algorithm , the algorithm computes features in the spatio-temporal domain. Moreover, one of the major applications for video fingerprinting is filtering of video files on peer-to-peer networks. The stream of compressed data available to the system can be used beneficially, if the feature extraction uses block- based DCT (discrete cosine transformation) coefficients. The guiding principles of this algorithm are as follows:
1) To obtain features uniquely representing the video sequence on a frame by frame basis.
2) To obtain perceptually important features. It may be noticed that in an image, the luminance feature is more important compared to color components. Also, YUV color space is universally accepted primary sub- sampling encoder for all the video encoders. Hence luminance values are used to extract features.
3) To allow easy feature extraction from most compressed video streams as well, we choose features which can be easily computed from block-based DCT coefficients. Based on these considerations, the proposed algorithm is based on a simple statistic, the mean luminance, computed over relatively large regions. The sub- fingerprints are extracted as follows. 1. Each video frame is divided in a grid of R rows and C columns, resulting in RxC blocks. For each of these blocks, the mean of the luminance values of the pixels is computed. The mean luminance of block (r, c) in frame p is denoted F(r, c, p) for r = 1, 2, . . ., R and c = 1, 2, . . . , C. Fig. 7 illustrates a video data frame 20 divided into blocks 21 in this way. The representation of the frame shows the RxC blocks for R = 4 and C = 9 (i.e. 36 blocks in total in this example). The mean of the luminance values is calculated for each of the blocks resulting in RxC mean values. Each of the numbers represents a corresponding region in the input video frame. Thus, the means of the luminance values in each of these regions has been calculated.
2. The computed mean luminance values in step 1 can be visualized as RxC "pixels" in a frame (an extracted feature frame). In other words, these represent the energy of different portions of the frame. A spatial filter with kernel [-1 1] (i.e. taking differences between neighboring blocks in the same row), and a temporal filter with kernel [-α 1] is applied on this sequence of low resolution gray-scale images.
Hence, if we consider M 13 and M 14 to be the mean values originating from regions 13 and 14 on current frame and M" 13 and M" 14 to be the mean values coming from corresponding regions in next frame then the value (called soft sub- fingerprint) is computed as
SftFPi3 = [M"i4
Figure imgf000021_0001
3. The sign value of SnFPn determines the value of the bit in the sub- fingerprint. More specifically, if SftFPn<0
Figure imgf000021_0002
if SfiFPn ≥O
Summarizing and more precisely, we have for r = l, 2, ..., R and c = 1,2,... C
Figure imgf000021_0003
where
Q(r,c,p) = (F(r,c + l,p)-F(r,c,p))- a.(F(r,c + l,p - 1)-F(r,c,p - 1)) This algorithm is called "differential block luminance algorithm". It yields a sequence of sub-fingerprints, one sub fingerprint for each of the "source" image frames it acts on, the bits of those sub- fingerprints being given by B(r,c,p) above.
In this algorithm, alpha can be considered to be a weighting factor, representing the degree to which values in the "next" frame are taken into account. Different embodiments may use different values for alpha. In certain embodiments, alpha equals 1, for example.
We shall now discuss the problem of robustness against variable frame rate in relation to the above-algorithm. In motion pictures, television, and in computer video displays, the frame rate is the number of frames or images that are projected or displayed per second. Frame rates are used in synchronizing audio and pictures, whether film, television, or video. Frame rates of 24, 25 and 30 frames per second are common, each having uses in different portions of the industry. In the U.S., the professional frame rate for motion pictures is 24 frames per second and, for television, 30 frames per second. However, these frame rates are variable because different standards are followed in the video broadcast throughout the world. The basic differential block luminance fingerprint extraction algorithm described above works on a frame by frame basis. Hence, the sub- fingerprint generation rate is same as that of frame rate provided by the video source; e.g. if fingerprints are extracted from a movie being broadcast in USA, 30 sub- fingerprints would be extracted in a second. Therefore, the corresponding fingerprint block stored in the database would represent 256 / 30 = 8.53s of video. If a video query from Europe is given to the system, it would have a frame rate of 25Hz. In this case, a fingerprint block would represent 256 / 25 = 10.24s of video. In principle, these two fingerprint blocks would not match with each other as they represent two different time frames. Looking at this in general terms, a fingerprint system may provide essentially two functions. Firstly, fingerprints are generated for storage in a database. Secondly, fingerprints are generated from a video query for identification purposes. In general, if video sources in these two stages have frame rates as v and μ respectively, then the fingerprint blocks (consisting of 256 sub- fingerprints) in these two cases would represent (256/v) seconds and (256/μ) seconds of video respectively. These time frames are different and hence their sub- fingerprints generated during these durations come from different frames. Hence, they would not match.
A modification of the basic differential block mean luminance algorithm, to provide a degree of frame rate robustness, is described below. However, it should be noted that the basic algorithm, unmodified, can be used in certain embodiments of the invention to produce frame rate robust fingerprints. In such embodiments, rather than the basic algorithm being provided with the source video signal having the source frame rate, instead a frame- rate converted video signal is generated from the source signal, and that frame rate-converted signal (having the predetermined, independent frame rate) is fed into the basic algorithm. Acting on this rate-converted "source", the basic algorithm is then able to generate sub- fingerprints which also have the predetermined, independent rate.
It will be appreciated, however, that considerable processing is required to produce a full frame-rate converted version of the original video signal (with the data content, i.e. size, of each of the "converted" frames being substantially the same as that of an original frame). This may preclude use of the technique to generate sub-fingerprints in real time, and slows the process of video identification. Thus, although it is possible to use an unmodified differential mean luminance algorithm in certain embodiments, it is generally preferred to use the modified algorithm, as follows. Frame rate robustness in embodiments of the invention is incorporated by generating sub- fingerprints at a constant rate irrespective of the frame rate of the video source. The two most common frame rates of video are 25 (PAL) and 30 (NTSC) Hz. One choice for a predetermined sub- fingerprint generation rate would then be the mean of these two i.e. (25 + 30) / 2 = 27.5. Hence, a fingerprint block formed from 256 sub- fingerprints generated at this rate would represent 256 / 27.5 = 9.3s of video. In some of the applications of video fingerprinting (like television commercial blocking), a higher granularity might be required. Hence, in certain embodiments, an alternative (higher) frequency of 27.5 x 2 = 55Hz is used for fingerprint generation. The further examples mentioned below use this frequency of fingerprint extraction (but it will be appreciated that the frequency is itself just one example, and further embodiments may utilise different predetermined frequencies).
In order to incorporate frame rate robustness in the differential block mean luminance algorithm, changes are made between steps 1 and 2 in the algorithm mentioned above. If the frequency of the video source is v Hz then the sequence of F(r, c, p) ... F(r, c, p + v) is interpolated to 55 Hz. This process leads to the generation of 55 sub- fingerprints every second (except the 1st second where 54 sub- fingerprints would be generated, asp ≥ 1 ). This makes the sub- fingerprint generation independent of video source's frame rate. The sub- fingerprints generated would now represent the frames in term of a constant time frame irrespective of the time frame of the video source. Fig. 8 illustrates the scenario explained above. Suppose the video frame has frequency of 25 Hz. Hence, F(r, c, 2) and F(r, c, 3) represent the mean frames at times 2/25 and 3/25 respectively. The mean frames F(r, c, 4), F(r, c, 5), F(r, c, 6) and F(r, c, 7) represent the linearly interpolated mean frames at times 4/55, 5/55, 6/55 and 7/55 respectively. In other words, the contents of these linearly interpolated mean frames have been constructed, by calculation from the contents of the mean frames that were obtained directly from the source frame sequence. Thus, the modified algorithm comprises the generation of a sequence of extracted feature frames (containing mean luminance values) having the predetermined frame rate (55Hz in this example), the contents of those frames being derived from the contents of the source frames (via the sequence of directly extracted feature frames) by a process comprising interpolation (where necessary). Although linear interpolation is used in the above example, other interpolation techniques may be used in alternative embodiments.
Properties of the fingerprints resulting from the modified differential block mean luminance algorithm described above (using interpolation to produce extracted feature frames at the predetermined rate) have been analyzed, including performing tests to evaluate the bit error rate due to various transformations discussed above. In tests, a searching strategy as described above (using toggling of bits) was used to look for close matches of fingerprints of original versions and fingerprints of transformed versions, in addition to searches for exact matches. The following features were noticed from the results:
A good degree of frame-rate robustness was achieved. However, horizontal scaling and vertical scaling, if large, could lead to high BERs. This can be understood from the fact that during horizontal and vertical scaling, the pixels in the frame move to the neighboring blocks. This results in the calculation of a different mean. The effect of horizontal scaling is more prominent as the size of blocks is smaller horizontally than vertically. Hence the means do not change much in case of vertical scaling and hence this results in lesser BER.
Like scaling, large rotations could result in a high BER as well. Clips which were stationary or had large amounts of dark regions tended to yield lower BERs compared to their fast and bright counterparts.
In certain cases it was not possible to find even a single exact match when the transformations are as severe as large amount of scaling or rotation. However in the case of rotation, it was possible to find close matches. Also, in case of compression to a very low bit rate the number of close matches went up substantially. Toggling the weak bits in order to find a close match helps in increasing the robustness of the algorithm against various transformations .
Thus, although the above-described fingerprint generation method, using the modified differential block mean luminance algorithm, provides much improved frame rate robustness with regard to prior art techniques, tests indicated that the algorithm was vulnerable to high amounts of scaling and rotation. Further modifications have therefore been made to the algorithm, and are described below. The modifications aimed to make the algorithm more robust to scaling and rotation in particular.
A first further modification will be described as a Centrally-Oriented Differential Block Luminance Algorithm. This algorithm differs from the previous one in that it takes into consideration more representative features of the frame. In order to do so, it extracts the fingerprints from central portions of the video frame. Development of this modified algorithm was based on an appreciation of the following: a) It was noticed from use of the previous algorithm that black portions of the frame contributed very little information to the fingerprints. However, many of the video formats are 'letterboxed'. Letterboxing is the practice of copying widescreen film to video formats while preserving the original aspect ratio. Since the video display is most often a squarer aspect ratio than the original film, the resulting master must include masked-off areas above and below the picture area (these are often referred to as "black bars", resembling a letterbox slot). The reliability of the fingerprints can be increased by not taking the fingerprints of these areas. b) Generally, most of the movements in a video frame are centre-oriented. This can be understood from the fact that the cameraman would focus his camera towards the centre of the scene being shot. c) Sometimes, the movies contain subtitles in the bottom of each of the frame.
These subtitles are generally constant over a number of frames and do not qualitatively induce any information towards the fingerprint. d) The movies can also contain logos at the top which remain constant for the entire length of the movie. These logos are also present in different movies under the same production banner.
Taking these factors into account, the centrally oriented differential block mean luminance algorithm is very similar to the differential block luminance algorithm.
However, the centrally oriented algorithm differs in the step where it divides a source frame into blocks. Instead of dividing the entire frame into blocks, these blocks or regions 21 are defined as shown in Fig. 9. Thus, only a central portion of the frame 20 has been divided into blocks 21; the portions in the outskirts of the frame have not been used. This helps in improving reliability. Having divided the frames into blocks in this way, the remainder of the algorithm calculates a sequence of sub- fingerprints in exactly the same way as the previously described algorithm. Thus, the means of the luminance values in each of the blocks/regions is calculated, resulting in 36 mean values for each frame (36 is just an example, however - a different number of blocks may again be used). Similarly, the mean values are collected from the next frame. Frame rate robustness may be incorporated at this stage by constructing/producing interpolated mean- frames to form the sequence at the desired, predetermined frame rate (and, indeed, the subsequent results for CODBLA are based on the algorithm including the frame rate robustness feature).
Tests have been performed to analyze the performance of the centrally oriented block luminance algorithm (CODBLA) with respect to the previous full- frame (non- centrally oriented) differential block luminance algorithm (again, incorporating frame rate robustness)(DBLA). The performance of the CODBLA was found to be better, in terms of the robustness of the resultant fingerprints, in certain cases, for example in the case of transformations comprising cropping or shifts. This result can be understood because the top portions of the video frames generally do not have much movement and hence they do not contribute much information. Also, the CODBLA is particularly suited to fingerprinting of video that is in letterboxed format.
Building on the principle of the CODBLA (concentrating on the central portions of the frame), the fingerprint extraction algorithm was further modified to improve robustness to scaling and rotational transformations. This yielded the Differential Pie-Block Luminance Algorithm (DPBLA), as follows. The Differential Pie-Block Luminance Algorithm is different from the previous ones as it takes into consideration the geometry of the video frame. It extracts features from the frame in blocks shaped like sectors which are more resistant to scaling and shifting. In the CODBLA the means of luminance were extracted from rectangular blocks. These means were representative of that portion of the frame and provided a representative bit (in a sub- fingerprint) after spatio-temporal filtering and thresholding. A sequence of these bits represented a frame. However, use of rectangular blocks rectangular is vulnerable to scaling. Hence, when the video frame is scaled, the portions of the frame covered by the blocks are also scaled and do not represent the original portion uniquely. Hence, in the DPBLA the means (i.e mean luminance values or data) are extracted from portions of the frame which are shaped like sectors of a circle and are resistant to horizontal scaling. In other words, in the DPBLA, the step of dividing a frame into blocks comprises dividing the frame into blocks as shown in Fig. 10. Apart from this difference in the block division step, the DPBLA operates to generate sub-fingerprints from luminances of pixels in the blocks in the same way as the DBLA and the CODBLA. In this particular example of the DPBLA the video frame 20 is divided into 33 "blocks" 21 in order to extract 32 values by clockwise spatial-differential explained below. The blocks are now shaped similar to the sectors of a circle. The uniform increase in the area of the sectors in the radial direction makes them more resistant to scaling. It may be noticed that the portions in the outskirts of the frame have not been used (so this particular DPBLA is also centrally oriented). Also, the central portion of the frame represented in form of a circle has not been used for calculating means. This portion is highly vulnerable to scaling, shifting and even small amount of rotation. This helps in improving reliability. Each of the numbers represents a corresponding region in the input video frame. The means of the luminance values in each of these regions is calculated. This process results in 33 mean values.
The frame rate robustness is applied at this stage to get the interpolated mean- frames. This procedure has been described in detail above, and will not be repeated here. Unlike the previous two algorithms, in this case a small difference is that the frames are represented as F(n, p) instead of as F(r, c, p). Hence the mean frames are interpolated likewise. The computed mean luminance values in step 1 can be visualized as 33 "pixel regions" in a frame. In other words, these represent the energy of different regions of the frame. A spatial filter with kernel [-1 1] (i.e. taking differences between neighboring blocks in the same row), and a temporal filter with kernel [-1 1], as explained, is applied on this sequence of low resolution gray-scale images. Hence, if we consider M 13 and M 14 to be the mean values originating from regions 13 and 14 on current frame and M" 13 and M" 14 to be the mean values coming from corresponding regions in next frame then the value (called soft sub- fingerprint) is computed as
Figure imgf000027_0001
in general
SftFPn={F(n+ l,p)- F(n,p)}- [F(n+ l,p - 1)- F(n,p - 1)} wheren=lto32 3. The sign value oϊSftFPn determines the value of the bit. More specifically,
\0,if SftFPn<0 for n = 1..32, bιh = <
[ijf SftFPn ≥O
Tests have been performed to analyze the performance of Differential Pie Block Luminance Algorithm without rotation compensation (DPBLAl) with respect to the Centrally Oriented Differential Block Luminance Algorithm (CODBLA). In terms of equal scaling in both directions and horizontal scaling, the pie algorithm performs better. However, it is vulnerable to rotation, vertical scaling and upward shift. The vulnerability to a large amount of rotation can be understood because rotation causes sectors to change in spatial domain and hence each of the sub- fingerprint bits gets affected.
In order to make the DPBLA algorithm resilient to rotation, a further modification can be made; a compensation factor is used in the algorithm. The means of a particular region now also have partial sums of the means of adjacent regions. This helps in increasing robustness against rotation while increasing the standard deviation of the inter- class BER distribution by a little amount. The algorithm also offers improved robustness towards vertical scaling. Hence, the version of the pie-block algorithm with rotation compensation provides significant improvement in finding a close match between fingerprints of original and transformed signals.
Some conclusions that can be drawn from analysis are as follows. The pie differential block luminance algorithm with rotation compensation performs better than centrally-oriented differential block luminance algorithm, in most cases. The inter and intra class BER distribution shows that it serves as a better classification tool than the centrally oriented differential block luminance algorithm. For applications where there is less likelihood of video being modified (like broadcast monitoring on television, selective recording and commercials' filtering), this algorithm can perform better than the ones discussed before. However, it is more vulnerable to rotation. This is because even small amount of rotation changes the fingerprints significantly. These changes might be aggravated because of other omnipresent transforms like compression and changes in brightness levels etc. Another algorithm used in embodiments of the invention will now be described. It shall be referred to as the Differential Variable Size Block Luminance Algorithm (DVSBLA). As background, we recall that the centrally oriented differential block luminance algorithm was vulnerable to large amounts of rotation and scaling. The pie differential block luminance algorithm with rotation compensation yielded fingerprints that were highly robust against scaling, but were vulnerable towards rotation. In this description of the DVSBLA, we describe how the performance of the centrally-oriented differential block luminance algorithm can be improved against transformations like scaling and shifting by using variable size of the luminance blocks.
In the basic CODBLA described above, the luminance means are extracted from rectangular blocks. These means are representative of that portion of the frame and provide a representative bit after spatio-temporal filtering and thresholding. However, during geometric transformations, the regions that get affected the most are the ones lying on the outskirts of the processed video frame. These regions most often result in weak bits. Hence, if these regions are made larger, the probability of getting weak bits from these regions is reduced substantially.
The DVSBLA extraction algorithm is similar to the CODBLA block luminance algorithm. However, in the DVSBLA the regions (blocks 21) are defined as shown in Fig. 11 . The sizes of the various blocks in this particular example are given in the following tables 1 and 2, and are represented in terms of percentage of the frame width. The remainders represent the area to be left out on either side.
Figure imgf000029_0001
Table 1: The table shows the sizes of various columns in the differential variable size block luminance algorithm.
Figure imgf000029_0002
Table 2: The table shows the sizes of various rows in the differential variable size block luminance algorithm.
The blocks are rectangular just like those used in the centrally oriented differential block luminance algorithm. However, they are now of variable size. The size keeps on decreasing constantly towards the centre of the video frame. The geometric increase in the area of the rectangles from the centre of the frame helps in providing more coverage for outer regions which are the ones that are most affected during geometrical transformation like cropping, scaling and rotation. In case of shifting, all the regions are affected equally. It may be noticed that the portions in the outskirts of the frame have not been used. This helps in improving reliability by getting fewer weak bits.
The frame rate robustness is applied at this stage to get the interpolated mean- frames. This procedure has been described in detail above. The sub- fingerprints are then derived from the sequence of mean frames (at the predetermined rate, constructed using interpolation) in the same way as described above in relation to the DBLA and CODBLA.
Analysis of the performance of the DVSBLA , looking at BERs for the wide variety of transformations, has indicated that the BERs have decreased significantly compared to the version with fixed block size. The algorithm has thus become more robust towards all kinds of transformation. The DVSBLA provides more resistance to weaker bits (resuling from border portions) by providing them with a larger area.
Indeed, tests have indicated that, for certain applications, the differential block luminance algorithm with variable size blocks performs better than all other algorithm discussed so far (being equally reliable and more robust than other algorithms). For applications where there is high likelihood of video being modified (like p2p file sharing of cam prints of movies), this algorithm can perform better than the ones discussed before.
Having tested the four major algorithms described above, their relative performance can be summarised as follows:
Robustness of the video fingerprinting system is related to the reliability of the algorithm in correctly identifying a transformed version of a video sequence. The performance of various algorithms in terms of robustness against various transformations is listed in table 3 below.
Figure imgf000031_0001
Table 3: The table shows the qualitative performance of the four algorithms with respect to various geometric transformations and other processing on video sequences.
It may be noted that the differential variable size block luminance algorithm (DVSBLA) performs particularly well in terms of robustness. Hence, a fingerprinting system using VDBLA shall be highly robust against various transformations. However, it will be appreciated that each of the four algorithms in the table (which all incorporate frame rate robustness by extracting sub- fingerprints at the predetermined rate) provides improved robustness over prior art techniques for at least some of the various types of transformation.
The reliability of a video fingerprinting system is related to the false acceptance rate of the system. In order to find the false acceptance rate of various algorithms, their inter-class BER distribution was studied. It was noticed that the distribution closely followed the normal distribution. Hence, assuming the distribution to be normal, standard deviation and percentage of outliers were computed. The standard deviation thus computed gave an idea of the theoretical false acceptance rate of the system. These parameters are shown in table 4, below, for the 4 algorithms.
Figure imgf000032_0001
Table 4: The tables shows the parameters obtained from the inter-class BER distribution for the four algorithms
It may be noted that the differential pie block luminance algorithm with rotation compensation (DPBLA2) has very good figures. However, differential variable size block luminance algorithm (DVSBLA) is close and can outperform DPB LA2 in certain applications due to its high robustness. Hence, a fingerprint system based on DVSBLA shall have a very low false acceptance rate.
Fingerprint size for all the algorithms is constant at 880 bps. Hence for storing fingerprints corresponding to 5000 hours of video, 3960 MB of storage is needed. However, for various applications, fingerprints corresponding to different amount of video needs to be stored in the database. The following table 5 illustrates a typical storage scenario for various applications discussed above.
Figure imgf000032_0002
Table 5: The table shows the approximate storage requirements for fingerprints in various applications discussed above. In practice, these storage requirements can be handled very well by the search algorithm described above. Hence, the storage requirements of video fingerprinting systems embodying the invention are practical. With regard to granularity, the results show that a video fingerprinting system embodying the invention can reliably identify video from a sequence of approximately 5 s duration.
Search speed for a database consisting 24 hrs. of video has been estimated to be in the order of 100 ms. From the above description it will be appreciated that certain video fingerprinting systems embodying the invention consist of a fingerprint extraction algorithm module and a search module to search for such a fingerprint in a fingerprint database. In embodiments of the invention, sub- fingerprints are extracted at a constant frequency on a frame-by- frame basis (irrespective of the frame rate of video source). These sub- fingerprints in certain embodiments are obtained from energy differences along both the time and the space axis. Investigations reveal that the sequence of such sub- fingerprints contains enough information to uniquely identify a video sequence.
In certain embodiments, the search module uses a search strategy for "matching" video fingerprints based on matching methods as described in WO 02/065782, for example. This search strategy does not use naϊve brute force search approach because it is impossible to produce results in real-time by doing so due to huge amount of fingerprints in the database. Also, exact bit-copy of the fingerprints may not be given as input to the search module as the input video query might have undergone several image or video transformations (intentionally or unintentionally). Therefore, the search module uses the strength of bits in the fingerprint (computed during fingerprint extraction) to estimate their respective reliability and toggles them accordingly to get a fair (not exact) match.
Algorithms with better performance have been designed, investigated and tested on a large scale. Video fingerprinting systems embodying the invention have been tested and found to be highly reliable, needing just 5s of video in certain cases to identify the clip correctly. The storage requirement for fingerprints corresponding to 5000 hours of video in certain examples has been approximately 4 GB. Search modules in certain systems have been found to work well enough to produce results in real-time (in the order of ms). Fingerprinting system embodying the invention have also been found to be highly scalable, deployable on Windows, Linux and other UNIX like platforms. Certain video fingerprinting systems embodying the invention have also been optimized for performance by using MMX instructions to exploit the inherent parallelism in the algorithms they use.
It will be appreciated that embodiments of the invention, by extracting sub- fingerprints at a constant, predetermined rate, irrespective of and different from the frame rate of the source signal, provide the advantage that the resultant fingerprints are robust indications of the content of the source information signal with respect to source frame rate. Thus, common content between two signals having different frame rates may be recognised. Particular embodiments of the invention provide additional robustness with respect to other factors. It will be appreciated that throughout the present specification, including the claims, the words "comprising" and "comprises" are to be interpreted in the sense that they do not exclude other elements or steps. Also, it will be appreciated that "a" or "an" do not exclude a plurality, and that a single processor or other unit may fulfil the functions of several units, functional blocks or stages as recited in the description or claims. It will also be appreciated that reference signs in the claims shall not be construed as limiting the scope of the claims.

Claims

CLAIMS:
1. A method of generating a fingerprint (1) indicative of a content of an information signal (2), the information signal comprising a first sequence of data frames (20) having a first frame rate, the method comprising: computing a sequence (3) of sub-fingerprints (30) from the first sequence of frames, the sequence of sub- fingerprints having a predetermined rate independent of the first frame rate, and each sub- fingerprint being derived from and dependent upon a data content of at least one frame of the information signal; and concatenating the sub- fingerprints to form the fingerprint (1).
2. A method in accordance with claim 1 , wherein the step of computing the sequence of sub-fingerprints comprises: computing a second sequence (4) of data frames (40) from the first sequence, the second sequence of frames having said predetermined rate and the data content of each of the second sequence of frames being derived from the data content of at least one of the first sequence of frames; and computing the sequence of sub-fingerprints from the second sequence of data frames.
3. A method in accordance with claim 2, wherein the data contents of the second sequence of frames are derived from the data contents of the first sequence of frames by a process comprising interpolation.
4. A method in accordance with claim 2, wherein the step of computing the sequence of sub-fingerprints from the second sequence of data frames comprises deriving a sub- fingerprint from the data contents of a plurality of frames of the second sequence.
5. A method in accordance with claim 2, wherein the step of computing the second sequence of frames comprises: computing a sequence (5) of extracted feature data frames (50) from the first sequence, the sequence of extracted feature frames having the first frame rate, and each extracted feature frame containing feature data indicative of at least one feature of a respective one of the first sequence of frames; and computing the second sequence (4) of frames from the sequence of extracted feature frames, the data contents of the second sequence of frames being derived from the feature data contained in the sequence of extracted feature frames.
6. A method in accordance with claim 5, wherein the data contents of the second sequence of frames are derived from the feature data contained in the sequence of extracted feature frames by a process comprising interpolation.
7. A method in accordance with claim 5, wherein the step of computing the sequence of extracted feature frames comprises: dividing each of the first sequence of frames into a plurality of blocks (21); calculating a feature of each block; and using the calculated block feature data to produce the sequence (5) of extracted feature frames (50) such that each extracted feature frame contains the calculated block feature data for each of the plurality of blocks of the respective one of the first sequence of frames.
8. A method in accordance with claim 7, wherein the information signal (2) is a video signal and each of the first sequence of frames contains data on a plurality of pixels, and each block (21) comprises the data on a respective group of said pixels.
9. A method in accordance with claim 8, wherein the step of calculating a feature of each block comprises calculating a mean property of the respective group of pixels of each block.
10. A method in accordance with claim 9, wherein said mean property is a mean luminance.
11. A method in accordance with claim 4, wherein the information signal is a compressed information signal, each of the first sequence of frames comprising compressed data divided into blocks, and the step of computing a sequence of extracted feature data frames from the first sequence comprises extracting a selected portion of the data from each block.
12. A method in accordance with claim 11, wherein the compressed information signal is a compressed video signal, each block of compressed video data contains data indicative of a mean property of a corresponding group of pixels, and the step of extracting a selected portion comprises extracting said data indicative of a mean property from each block.
13. A method in accordance with claim 1 , wherein the step of computing the sequence of sub-fingerprints comprises generating the sequence of sub-fingerprints in real time, at said predetermined rate.
14. A method in accordance with claim 1, further comprising the step of identifying the content of the information signal by comparing the fingerprint with a plurality of fingerprints stored in a database (65), each of said stored fingerprints comprising a respective sequence of sub- fingerprints having said predetermined rate.
15. A method of identifying whether or not two information signals have common content, the first information signal comprising a sequence of data frames having a first frame rate and the second information signal comprising a sequence of data frames having a second, different frame rate, the method comprising the steps of: generating a first fingerprint of the first information signal using a method in accordance with claim 1, the first fingerprint comprising sub- fingerprints having said predetermined rate; generating a second fingerprint of the second information signal using a method in accordance with claim 1, the second fingerprint comprising sub- fingerprints having said same predetermined rate; and comparing said fingerprints.
16. Use of a method in accordance with claim 1 to generate a fingerprint of an information signal comprising a first sequence of data frames having a frame rate lower than said predetermined rate.
17. Use of a method in accordance with claim 1 to generate a fingerprint of an information signal comprising a first sequence of data frames having a frame rate higher than said predetermined rate.
18. Signal processing apparatus arranged to receive an information signal comprising a first sequence of data frames having a first frame rate, and to generate a fingerprint indicative of a content of the information signal using a method in accordance with claim 1.
19. A computer program enabling the carrying out of a method in accordance with claim 1.
20. A record carrier on which a computer program in accordance with claim 19 is stored.
21. A broadcast monitoring method comprising: receiving broadcast information signals each comprising a respective sequence of data frames having a respective frame rate; generating respective fingerprints of the received signals using a method in accordance with claim 1 ; comparing the generated fingerprints with fingerprints stored in a database (65) to determine the identities of the contents of the information signals; and generating play lists using the determined identities.
22. A filtering method comprising: generating a fingerprint of an information signal using a method in accordance with claim 1 ; comparing the generated fingerprint with fingerprints stored in a database (65) to determine an identity of the content of the information signal; and using the determined identity to determine whether to allow or block transmission of the information signal to a destination.
23. An automatic indexing method comprising: generating a fingerprint of an information signal stored in a file using a method in accordance with claim 1 ; comparing the generated fingerprint with fingerprints stored in a database (65) to determine an identity of the content of the information signal; and using the determined identity to label the file.
24. A selective recording method comprising: generating a fingerprint of an information signal using a method in accordance with claim 1 ; comparing the generated fingerprint with fingerprints stored in a database (65) to determine an identity of the content of the information signal; and using the determined identity to determine whether to record the information signal.
25. A method of detecting tampering with or an error in transmission of an information signal, the method comprising: generating a fingerprint of an information signal using a method in accordance with claim 1 ; comparing the generated fingerprint with fingerprints stored in a database (65); and using the results of the comparison to determine whether or not the information signal is a tampered signal or if a transmission error has occurred.
PCT/IB2007/052368 2006-06-20 2007-06-20 Generating fingerprints of information signals WO2007148290A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP06115733.5 2006-06-20
EP06115733 2006-06-20

Publications (2)

Publication Number Publication Date
WO2007148290A2 true WO2007148290A2 (en) 2007-12-27
WO2007148290A3 WO2007148290A3 (en) 2008-03-13

Family

ID=38698341

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2007/052368 WO2007148290A2 (en) 2006-06-20 2007-06-20 Generating fingerprints of information signals

Country Status (1)

Country Link
WO (1) WO2007148290A2 (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009107049A2 (en) * 2008-02-26 2009-09-03 Koninklijke Philips Electronics N.V. Content identification method
WO2010011991A2 (en) 2008-07-25 2010-01-28 Anvato, Inc. Method and apparatus for detecting near-duplicate videos using perceptual video signatures
WO2010021965A1 (en) * 2008-08-17 2010-02-25 Dolby Laboratories Licensing Corporation Signature derivation for images
WO2010135082A1 (en) * 2009-05-19 2010-11-25 Dolby Laboratories Licensing Corporation Localized weak bit assignment
US8428301B2 (en) 2008-08-22 2013-04-23 Dolby Laboratories Licensing Corporation Content identification and quality monitoring
WO2015054209A3 (en) * 2013-10-07 2015-10-08 Elemica, Inc. Constraint optimization method and system for supply chain management
GB2553125A (en) * 2016-08-24 2018-02-28 Snell Advanced Media Ltd Comparing video sequences using fingerprints
EP3404926A1 (en) * 2017-05-17 2018-11-21 Snell Advanced Media Limited Generation of visual hash
US10595054B2 (en) 2016-05-10 2020-03-17 Google Llc Method and apparatus for a virtual online video channel
US10750216B1 (en) 2016-05-10 2020-08-18 Google Llc Method and apparatus for providing peer-to-peer content delivery
US10750248B1 (en) 2016-05-10 2020-08-18 Google Llc Method and apparatus for server-side content delivery network switching
US10771824B1 (en) 2016-05-10 2020-09-08 Google Llc System for managing video playback using a server generated manifest/playlist
US10785508B2 (en) 2016-05-10 2020-09-22 Google Llc System for measuring video playback events using a server generated manifest/playlist
US10977298B2 (en) 2013-11-08 2021-04-13 Friend for Media Limited Identifying media components
US11032588B2 (en) 2016-05-16 2021-06-08 Google Llc Method and apparatus for spatial enhanced adaptive bitrate live streaming for 360 degree video playback
US11039181B1 (en) 2016-05-09 2021-06-15 Google Llc Method and apparatus for secure video manifest/playlist generation and playback
US11069378B1 (en) 2016-05-10 2021-07-20 Google Llc Method and apparatus for frame accurate high resolution video editing in cloud using live video streams
US11386262B1 (en) 2016-04-27 2022-07-12 Google Llc Systems and methods for a knowledge-based form creation platform

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002065782A1 (en) * 2001-02-12 2002-08-22 Koninklijke Philips Electronics N.V. Generating and matching hashes of multimedia content
EP1667093A2 (en) * 2004-12-02 2006-06-07 Hitachi, Ltd. Frame rate conversion device, image display apparatus, and method of converting frame rate

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002065782A1 (en) * 2001-02-12 2002-08-22 Koninklijke Philips Electronics N.V. Generating and matching hashes of multimedia content
EP1667093A2 (en) * 2004-12-02 2006-06-07 Hitachi, Ltd. Frame rate conversion device, image display apparatus, and method of converting frame rate

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DELANNAY D ET AL: "Temporal alignment of video sequences for watermarking systems" PROCEEDINGS OF THE SPIE - THE INTERNATIONAL SOCIETY FOR OPTICAL ENGINEERING, USA, vol. 5020, 2003, pages 481-492, XP002459915 *
OOSTVEEN J ET AL: "FEATURE EXTRACTION AND A DATABASE STRATEGY FOR VIDEO FINGERPRINTING" LECTURE NOTES IN COMPUTER SCIENCE, SPRINGER VERLAG, BERLIN, DE, vol. 2314, 11 March 2002 (2002-03-11), pages 117-128, XP009017770 *
XIAN-SHENG HUA ET AL: "Robust video signature based on ordinal measure" IMAGE PROCESSING, 2004. ICIP '04. 2004 INTERNATIONAL CONFERENCE ON SINGAPORE 24-27 OCT. 2004, PISCATAWAY, NJ, USA,IEEE, vol. 1, 24 October 2004 (2004-10-24), pages 685-688, XP010784910 *
YUNZHAO DONG ET AL: "Frame rate up-conversion based on mixed interpolation" PROCEEDINGS OF THE SPIE - THE INTERNATIONAL SOCIETY FOR OPTICAL ENGINEERING, USA, vol. 4925, 2002, pages 301-306, XP002459914 *

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009107049A3 (en) * 2008-02-26 2009-12-10 Koninklijke Philips Electronics N.V. Content identification method
WO2009107049A2 (en) * 2008-02-26 2009-09-03 Koninklijke Philips Electronics N.V. Content identification method
US8830331B2 (en) 2008-07-25 2014-09-09 Anvato, Inc. Method and apparatus for detecting near-duplicate videos using perceptual video signatures
WO2010011991A2 (en) 2008-07-25 2010-01-28 Anvato, Inc. Method and apparatus for detecting near-duplicate videos using perceptual video signatures
EP2321964A2 (en) * 2008-07-25 2011-05-18 Anvato, Inc. Method and apparatus for detecting near-duplicate videos using perceptual video signatures
US20110122255A1 (en) * 2008-07-25 2011-05-26 Anvato, Inc. Method and apparatus for detecting near duplicate videos using perceptual video signatures
EP2321964A4 (en) * 2008-07-25 2012-05-09 Anvato Inc Method and apparatus for detecting near-duplicate videos using perceptual video signatures
US8587668B2 (en) 2008-07-25 2013-11-19 Anvato, Inc. Method and apparatus for detecting near duplicate videos using perceptual video signatures
WO2010021965A1 (en) * 2008-08-17 2010-02-25 Dolby Laboratories Licensing Corporation Signature derivation for images
CN102124489B (en) * 2008-08-17 2014-08-06 杜比实验室特许公司 Signature derivation for images
US8428301B2 (en) 2008-08-22 2013-04-23 Dolby Laboratories Licensing Corporation Content identification and quality monitoring
WO2010135082A1 (en) * 2009-05-19 2010-11-25 Dolby Laboratories Licensing Corporation Localized weak bit assignment
WO2015054209A3 (en) * 2013-10-07 2015-10-08 Elemica, Inc. Constraint optimization method and system for supply chain management
US11500916B2 (en) 2013-11-08 2022-11-15 Friend for Media Limited Identifying media components
US10977298B2 (en) 2013-11-08 2021-04-13 Friend for Media Limited Identifying media components
US11386262B1 (en) 2016-04-27 2022-07-12 Google Llc Systems and methods for a knowledge-based form creation platform
US11647237B1 (en) 2016-05-09 2023-05-09 Google Llc Method and apparatus for secure video manifest/playlist generation and playback
US11039181B1 (en) 2016-05-09 2021-06-15 Google Llc Method and apparatus for secure video manifest/playlist generation and playback
US10750216B1 (en) 2016-05-10 2020-08-18 Google Llc Method and apparatus for providing peer-to-peer content delivery
US11545185B1 (en) 2016-05-10 2023-01-03 Google Llc Method and apparatus for frame accurate high resolution video editing in cloud using live video streams
US10771824B1 (en) 2016-05-10 2020-09-08 Google Llc System for managing video playback using a server generated manifest/playlist
US10785508B2 (en) 2016-05-10 2020-09-22 Google Llc System for measuring video playback events using a server generated manifest/playlist
US11877017B2 (en) 2016-05-10 2024-01-16 Google Llc System for measuring video playback events using a server generated manifest/playlist
US10595054B2 (en) 2016-05-10 2020-03-17 Google Llc Method and apparatus for a virtual online video channel
US11785268B1 (en) 2016-05-10 2023-10-10 Google Llc System for managing video playback using a server generated manifest/playlist
US11589085B2 (en) 2016-05-10 2023-02-21 Google Llc Method and apparatus for a virtual online video channel
US11069378B1 (en) 2016-05-10 2021-07-20 Google Llc Method and apparatus for frame accurate high resolution video editing in cloud using live video streams
US10750248B1 (en) 2016-05-10 2020-08-18 Google Llc Method and apparatus for server-side content delivery network switching
US11683540B2 (en) 2016-05-16 2023-06-20 Google Llc Method and apparatus for spatial enhanced adaptive bitrate live streaming for 360 degree video playback
US11032588B2 (en) 2016-05-16 2021-06-08 Google Llc Method and apparatus for spatial enhanced adaptive bitrate live streaming for 360 degree video playback
EP3287946A1 (en) * 2016-08-24 2018-02-28 Snell Advanced Media Limited Comparing video sequences using fingerprints
GB2553125B (en) * 2016-08-24 2022-03-09 Grass Valley Ltd Comparing video sequences using fingerprints
US10395121B2 (en) 2016-08-24 2019-08-27 Snell Advanced Media Limited Comparing video sequences using fingerprints
GB2553125A (en) * 2016-08-24 2018-02-28 Snell Advanced Media Ltd Comparing video sequences using fingerprints
US11341747B2 (en) 2017-05-17 2022-05-24 Grass Valley Limited Generation of video hash
EP3404926A1 (en) * 2017-05-17 2018-11-21 Snell Advanced Media Limited Generation of visual hash
US10796158B2 (en) 2017-05-17 2020-10-06 Grass Valley Limited Generation of video hash

Also Published As

Publication number Publication date
WO2007148290A3 (en) 2008-03-13

Similar Documents

Publication Publication Date Title
US20090324199A1 (en) Generating fingerprints of video signals
WO2007148290A2 (en) Generating fingerprints of information signals
Oostveen et al. Feature extraction and a database strategy for video fingerprinting
US8009861B2 (en) Method and system for fingerprinting digital video object based on multiresolution, multirate spatial and temporal signatures
Chen et al. Video sequence matching based on temporal ordinal measurement
US7921296B2 (en) Generating and matching hashes of multimedia content
US20110222787A1 (en) Frame sequence comparison in multimedia streams
Oostveen et al. Visual hashing of digital video: applications and techniques
EP2321964B1 (en) Method and apparatus for detecting near-duplicate videos using perceptual video signatures
US20120110043A1 (en) Media asset management
EP1482734A2 (en) Process and system for identifying a position in video using content-based video timelines
Liu et al. Effective and scalable video copy detection
US20030061612A1 (en) Key frame-based video summary system
JP2005513663A (en) Family histogram based techniques for detection of commercial and other video content
CA2696890A1 (en) Detection and classification of matches between time-based media
US20130039587A1 (en) Method and apparatus for comparing videos
RU2413990C2 (en) Method and apparatus for detecting content item boundaries
CN1516842A (en) Method and apparatus for detecting fast motion scenes
EP1074926A2 (en) Method of and apparatus for retrieving text data from a video signal
US9264584B2 (en) Video synchronization
Chao Introduction to video fingerprinting
Hirzallah A Fast Method to Spot a Video Sequence within a Live Stream.
Possos et al. Accuracy and stability improvement of tomography video signatures
Garboan Towards camcorder recording robust video fingerprinting
Wandelmer et al. Synchronized digital video subsampling to achieve temporal resolution independence

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07789743

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

122 Ep: pct application non-entry in european phase

Ref document number: 07789743

Country of ref document: EP

Kind code of ref document: A2