GB2455280A

GB2455280A - Generating a security signature based on face analysis

Info

Publication number: GB2455280A
Application number: GB0720526A
Authority: GB
Inventors: Baolin Tan
Original assignee: Dwight Cavendish Systems Ltd
Current assignee: Dwight Cavendish Systems Ltd
Priority date: 2007-10-22
Filing date: 2007-10-22
Publication date: 2009-06-10
Also published as: GB0720526D0; WO2009053685A1

Abstract

A method of generating a security signature for video content comprises analysing a segment of video content, having a run time t, to determine one or more faces in the content, and outputting the analysis results as a data signature describing at least part of the video segment. An associated apparatus, comprising a face detection and analysis module, is also independently claimed. The method may be used to control access to video content over a network via comparison with stored data signatures. Various facial properties may be used to form the signature, such as face temporal or position coordinates, presence or absence, coterminous appearances of faces, and face identity; faces appearing less than a predetermined number of times may be discarded. Unlike other protection techniques, there is no requirement to embed identification data, such as a watermark, in the video content. Furthermore, as face information in the video content cannot be altered without detriment to the content entertainment value, the technique is robust to protection countermeasures.

Description

Method and apparatus for generating a security signature The invention concerns a method and apparatus for generating a security signature, in particular one that can be used to control access to video content over a network.

The rise of video piracy over the internet is a major source of concern and revenue loss for content providers. As home access speeds to the internet and the number of total people connected to the internet via "broadband" internet access increase, the potential volume and convenience of this method of piracy also increases.

A method is required which can automatically detect video content which consists of an unauthorised copy of a copyrighted film, television programme, or other commercial audio/visual work.

Various methods have been proposed. Some incorporate a watermark into the audio or video content before it is originally sold to the public, and have detectors to look for these watermarks in potential pirate copies. Other methods require no modification to the content, but instead build up a database of "fingerprints": short numerical representations of the content which can be generated from the content using a specific algorithm. When potential pirate copies are found, the same algorithm produces a fingerprint from the pirate copy, and if this fingerprint is found in the database then the content can be identified, and appropriate action taken.

In practice, the matches do not always need to be exact, and a given piece of content may have multiple discrete "fingerprints" for different sections; this is one possible method of enabling even short excerpts to be identified.

Several audio fingerprint technologies have been developed, and some video fingerprint technologies have also been brought to the market.

S

The challenge for all such technologies is that they must reliably construct a matching fingerprint from the pirated content, even though that content may be dramatically altered from the original. Specifically, video encoding can degrade the original footage, introducing all kinds of video coding artefacts, e.g. blockiness, pixilation, stuttering, mosquito noise etc. Further, casual users or determined video pirates may modify the content, e.g. by cropping, rotating, zooming, resizing, low pass filtering, median filtering, noise addition, affine transformation, changing brig htness/ contrast! gamma! colour etc. Many of these can be accidentally or intentionally introduced by pointing a video camera at another display or projection screen, and so are inherently present in content pirated in a movie theatre or cinema.

We have therefore appreciated that there is a need for an improved rneinuu of deieciing video content, so that pirated copies can oe iaentifiea.

Summary of Invention

The invention is defined in the independent claims to which reference should now be made. Advantageous features are set out in the dependent claims.

The invention provides a security signature for video content that is generated based on the faces of the protagonists appearing in the content.

In a further embodiment, the signature is then used in a technique for identifying pirated video content on a network. Unlike other protection techniques, there is no requirement to embed identification data, such as a watermark, in the video content for the protection to take effect.

Furthermore, as the face information of the video content cannot be altered without detriment to the entertainment value of the content, the technique is robust to protection countermeasures that might modify the content and avoid the protection. Various properties of detected faces may be used to form the signature, such as the presence or absence of faces,

S

coterminous appearances of faces, the location of faces, and the identity of faces.

Brief Description of the Drawings

A preferred embodiment of the invention will now be described by way of example, and with reference to the drawings in which: Figure 1 is a schematic illustration of the components of a first embodiment of the invention; Figure 2 is a flowchart illustrating the operation of the first embodiment of the invention; Figure 3 is an illustration of face analysis used in a face detection technique; Figure 4 is an illustration of an analysis step in the face detection technique; Figure 5 is a schematic illustration of face detection in two frames of video content.

Figure 6 illustrates an additional face detection technique; Figure 7 is an illustrative diagram showing one example of data that may be used to describe video data; Figure 8 is an illustrative example of a database structure for use in the system of figure 1; Figure 9 is a schematic illustration of the components of a second embodiment of the invention;

Detailed Description of the Preferred Embodiments

The present invention seeks to use a specific aspect of the content for identification via fingerprinting. The chosen aspect is one which cannot easily be destroyed, either accidentally or intentionally, without destroying much of the entertainment value of the content. The present invention uses the faces of the protagonists, namely the actors or other individuals appearing in the video content, as a means to identify the content.

The face detection technique could be combined with text recognition techniques run on the same video content, such as identifying the title of the video content, if a title is displayed on screen towards the beginning of the content, or from any credits appearing at the beginning or the end.

However, not all video content comprises text, and pirated feature films or movies could be edited so that the text is obscured. Thus, face detentinn provides a technique that is robust and that cannot easily be circumvented.

A first preferred embodiment of the invention will now be described with respect to Figures 1 to 7.

As illustrated in Figure 1, the first preferred embodiment comprises face detection and authentication modules 2 deployed on a server architecture 4, and/or a video player architecture 6 respectively. The video player architecture 6 may be video player software such as that provided on a Personal Computer, or may be in dedicated video players such as home electronics equipment like DVD players/recorders or hard disk player/recorders. Additionally, a rights authentication module 8 is illustrated that is accessible to the server 4 and player 6 via a network, such as the Internet. The various modules may be implemented as one or more software programs interacting with the server or host computer network as appropriate for the server or computer platform, as will be known in the art. Alternatively, the modules may be implemented fully or in part as hardware.

Face detection and authorisation modules 2 comprise conversion moduLe 10, dedicated face detection module 12 and authorisation module 14.

Rights authentication module comprises a determination module 16 and a database 18. The database 18 contains data describing various instances of video content, in particular feature films, movies, television programmes. Such data preferably includes at least one fingerprint or signature identifying the video content, as well as textual data or image data describing the content or acting as a reference material in connection with it.

A broad overview of the operation of the system will now be discussed with refrenr.e tn Fi9iir 7 In ctp S7 cnnverion mrwli ile 10 re!ves video content from a video content source and in step s4 converts it into a format suitable for parsing by the face detection and analysis module 12 In step s6, face detection and analysis module 12 parses at least a portion of the video data to extract data about the video content, and generate a signature. The extracted signature is then transmitted to the rights authentication module 8 for verification.

The determination module 16 receives the extracted video content data from the face detection and analysis module 12, as well as an indication of the IP address from which the information is transmitted, and in step slO, compares this with the information stored in the database. The IP address will be that of the server or the personal computer on which the face detection and authentication module is housed, and so can be extracted from the transmission to the rights management module.

The extracted video content data is compared with the information stored in the database to first identify what feature film, movie or television programme title has been received by the server 4 or the player 6 from the video content source. If there is a match with data stored in the database, the IP address is compared with those stored in the database to determine whether or not the IP address is registered as an authorised holder or player of the video content. The results of the determination are then transmitted to the authorisation module 14 Authorisation module either allows further use of the video content to continue, in step s12, or blocks further use of the video content in step s14 If use of the video content is blocked, the authorisation module, or the determination module may, in step s16 take further action to enforce the rights in video content, such as deleting the video content so that it cannot be used, requesting payment before the video content can be used, or forwarding the IP address of the unauthorised server or player to a copyright enforcement service. This list is not exhaustive.

Each of the modules will now be described in more detail.

1) Conversion Module The conversion module is arranged to receive the video content in its original file encoding, such as DivX, WMV, AVI, 0n2, FLV, MOV, or MPG for example and convert it to a file encoding suitable for use with the face detection and analysis module. Preferably, this encoding is MPEG-2 This ensures that the face detection module and analysis receives video data in a pre-determined format, and so allows the technology to work with any input video content. Video encoding conversion facilities are widely known in the art, and so shall not be explained in detail here.

A further advantage of using a conversion module is to parse the video in a more compact manner than would be possible with uncompressed video. However, where the original file encoding can be decompressed directly before the face detection and analysis stage, the conversion module may be omitted.

S

The conversion module is preferably located on the server 4 such that it receives any video content uploaded to the server for storage, and passes the converted video content to the face analysis and detection module In this way, all video content will be checked, and action can be taken if deemed necessary.

If the conversion module is located on a client machine, then it is preferably provided as part of a video content player or browser. Such programmes can be modified using plug-ins or components that alter their functionality.

The output of the conversion module is data in MPEG 2 format that represents frames of the received video content. It will be appreciated that the output could be in file form, or as streamed data, and need not be actually displayed in order for further processing and analysis tn nt-rilr 2) Face detection and analysis module The face detection and analysis receives this data from the conversion module, and parses it in order to identify the primary characteristics of the video content. Preferably the video content is analysed frame by frame. In particular, the face detection and analysis module detects the faces of the protagonists in the pixel domain of the video content, as well as other information such as coordinates describing the position of the detected faces in the frames of video content. Face colour, edge and movement of the faces may also be used.

When considering a typical Hollywood film, there are key factors associated with the appearance of faces in this content: 1 Who is in the film; 2. When and where they appear on-screen; 3. Who appears on screen simultaneously; 4. Their movements on screen.

Existing commercial face recognition algorithms are not currently capable of assessing a 120 minute Hollywood film and delivering all this information accurately. However, this may change in the future, and if suitable software were to become available, it could be used in the preferred embodiment.

A preferred face detection method will now be described by way of reference, although it will be appreciated that various techniques exist and could be used within the scope of the claimed invention.

The method involves breaking the image down into rectangles and ubquiiy uuIIlpdrirlg ihe average brightness in two, three, or tour adjacent rectangles of an image. The features of these rectangles which are appropriate to detecting faces (e.g. choice of size, shape, and which is the brighter rectangle in a "face" as opposed to an area of image which is not a "face") can be determined by a machine learning algorithm fed with suitable training data, namely pictures correctly sorted into those which are of faces and those which are not.

The learning algorithm uses the Adaboost technique. This works by finding approximate solutions to the problem (weak hypotheses) and then concentrating on the cases where the solution was incorrect, that is false positives (a non-face is identified as a face) and false negatives (a face is identified as a non face). Mathematically, each of the possible solutions (rectangle areas that can be compared) are weighted and summed, and the weights are increased for those solutions which correct previous mistakes. It may take about six rounds of processing (running through the entire dataset) to reach a reasonable solution which allows the vast majority of possible rectangle comparisons to be ignored, and those most important to face recognition to be used in order of importance.

This may be understood by reference to Figure 3. This shows the two most important rectangle comparisons for detecting faces as judged by the application of the Adaboost technique. The most important feature is that the rectangle covering the eye region is darker than the rectangle immediately below it. The second most important feature is that the rectangle covering the bridge of the noise is lighter than the rectangles either side of it. It is important to understand that these rectangles have not been chosen manually; rather these have been chosen from the set of all possible rectangle combinations (many millions) using the Adaboost technique to learn from the training data of faces and non-faces. Also, in practice, comparing the brightness in the rectangles comprises subtracting the brightness in one rectangle from the brightness in the other(s), and thresholding the result, where a result above the threshold indicates a face, and a result below the threshold indicates a non-face. The threshold is determined from the training data set to best delineate faces from non-faces.

When the chosen rectangle comparisons are used to find faces in real images, two methods speed up the process greatly.

Firstly, the input images are transformed into integral images' where each pixel in the integral image is the sum of all the pixels above and to the left of that pixel in the original image. Hence the brightness of arbitrary rectangles from the original image can be calculated from just four pixels in the integral image (rather than summing all the relevant pixels in the original image).

This may be understood by reference to Figure 4. We wish to know the average brightness of rectangle D, having corners with coordinates (xl,yl), (x2,y2), (x3,y3), (x4,y4). Note that since xl = x3, x2 =x4, yl = y2,

I

and y3 = y4 only four different numbers are required to define the coordinates of this rectangle, but the present nomenclature is chosen to simplify references to the corners of the rectangle, which are key to the integral image. With only the original images, we would have to add together the brightness of every pixel within rectangle D, and divide by the number of pixels within the rectangle. If we were to work with a full resolution "PAL" TV signal as typically broadcast on digital television, or delivered via the DVD format, the largest rectangle would be nearly half a million pixels. In practice we shall work with much lower resolution images, but the number of pixels summed is still wasteful. Consider instead the integral image of Figure 4. Let us call the pixel in the integral image at (xl,yl) (i.e. the sum of all pixels above and to the left of pixel (xl,yl) in the original image) l(xl,yl). Hence l(xl,yl) is the sum of all the pixels in rectangle A of the original image. l(x2,y2) is the sum of all the pixels in rectangle A + B of the original image. In the same way, l(x3,y3) corresponds to A + C, and l(x4,y4) corresponds to A + B + C + D. Hence it can be shown by simple algebra that the sum of all pixels in rectangle D may be computed thus: I(x4,x4)+!(xl,yl)-(I(x2,y2)+I(x3,y3)) A�B+C+D+A-(A+B+A+C) =2A+B+C+D-(2A +B�C) =D Hence, the sum of all pixels in rectangle D, however large, can be calculated by the addition and subtraction of 4 pixel values from the integral image.

To calculate the average pixel brightness, we divide by the number of pixels. In fact there is no necessity to divide by the number of pixels, and a more efficient approach can be to work with absolute totals, and simply to scale the thresholds against which these are judged when looking over larger areas than the original dataset.

Secondly, the rectangle classifiers are "cascaded". The first level is "designed" (via the Adaboost algorithm, as described above) to return many false positives, but very few false negatives. In other words, during training, false negatives are the undesired result, which must be aggressively corrected, while corrections to false positives are less strongly weighted. The resulting "solutions" (rectangle comparisons) weed out the portions of the image which definitely do not contain faces, making subsequent levels in the cascade much faster, since they consider a smaller amount of data. In subsequent lower levels many more rectangle classifiers can be used simultaneously, given that most picture areas will have been rejected at higher levels.

We have found that the recognition of faces that are viewed head on, and those that are viewed from the side, is advantageously dealt with by two pai decikni puueess. ii is possibie to use a separate training oata set in parallel to enable the identification of faces viewed from the side, and this is done here. The result is that two broadly separate identification processes must be run in parallel. This is not essential, but is beneficial in improving the accuracy of face tracking.

The entire process works across the entire image, at a number of different scales. Hence there can be many "faces" found for each real face. Any found "faces" that overlap are taken to be multiple hits (correct finds) of the same real face, and so are combined.

Extra steps can be incorporated since the source is video, rather than a still image. False positives can be rejected if they appear to move around at random, and/or appear/disappear at random. False negatives may be caught by movement (including lip movement, potentially correlated with speech in the audio track), translation or rotation, and removal of occlusions (things which partially obstruct the face temporarily). The summation of useful information and the rejection of discontinuous

I

(unwanted or erroneous) information over several frames can be accomplished via a condensation algorithm.

The above is useful but costly in computation time. On balance, apart from rudimentary condensation over frames, which must be achieved in some of the implementations anyway, these steps are best left out.

A technique that we have found improves the speed of the processing is described next. At a given scale (or zoom), faces are comprised of "details". "Details" are encoded with higher frequency DCT coefficients (or comparable) in the video data (e.g. MPEG-2). At a given scale, there is no need to search blocks of the image where only low frequency (or just DC) coefficients are encoded, since these blocks are devoid of details. This can reduce the "search space" considerably, and efficiently, since this processing can be applied to the oriainal encoded video data vn hfnr it is decoded. It will often be possible to skip parts of frames, and sometimes possible to skip entire frames, based on this approach. If parts of the video are not decoded (and hence not processed), this is a significant saving.

It will be appreciated that the face detection and analysis module could scan all of the video content, or just a segment of video content having a run time of t, where t is less than or equal to the total length of the video content T. Scanning a segment does however assume that the segment itself is indicative of the video content as a whole, that is it displays at least one key protagonists and lasts for enough time for the position of the face throughout the segment to constitute a signature for the content.

The preferred embodiment therefore operates as follows: firstly, the face detection module is used to locate "faces" throughout the content. The module merely finds a "face", and may provide information about its orientation (i.e. the direction the person is looking); it does not identify the face at this stage This is illustrated schematically in Figure 5, where two faces are indicated as A and B, these are allocated identities ID#1 and lD#2, and their positions within the video content are described by coordinates x, y and t. Other coordinates and coordinates systems may be used as described below. Figure 5 shows the two faces in two frames before and after movement from initial positions A0 and BO to final positions A and B. Post-processing is employed to ignore "minor" faces (i.e. smaller faces) in frames where there are more than a pre-set number of "faces" found, otherwise crowd scenes (for example) may overload the process Also, as there are thousands of frames (i.e. individual images) in the film, some resilience to errors in the face detection module can be achieved by discarding "faces" that appear only for a handful of frames, since these are either faces that only appear momentarily or other imne fetiirs fIciy detected as "faces". However, this stage may be skipped (depending on how the data will be used subsequently) as the same "cameos" or "mistakes" will be detected when examining a pirated version of the content.

Looking at the resulting data frame by frame allows each "face" to be tracked. This data can be used in one of two ways.

The simplest implementation, and one that works for a reasonably small data set, such as several movies or large clip durations, that is pirate copies of many minutes of a movie, is to form the fingerprint simply from the number of faces and their locations.

We have found that the location or coordinate data for each face can be quantised to discrete square/regions (e.g. based on a 24x24 grid).

Preferably, the coordinate data is also transformed into a relative rather than absolute positional indicators (e.g distance and direction relative to other faces, rather than absolute location on screen -one notation is in

I

the form of normalised angular co-ordinates, rather than normalised or absolute XY cord mates) to make it robust to cropping, rotation etc. If the location of one face is also stored absolutely relative to top left, or centre of the image, this can be used to make an easier identification on content without significant cropping, rotation etc. If a "face" is detected within a short distance of a "face" in the previous frame, it can be assumed to be the same face. The movement of the "faces" with time can be described in a compact manner using motion vectors to approximate the movement of each face over several frames.

Thus the output for a sequence of frames is a set of location and motion vector data for each major "face" on screen at that time.

It is possible to extract some of this information directly from motion ir'fnr in th nrrr1rI,ir1r trm ( n th MPFr,_7 n'nrIinri\ hi it -...--.---.------potential advances in video compression technology, or a multiplicity of video formats to deal with make this less practical than simply relying on the data gathered from the face detection stage. In any case, this should only be seen as a possible speed up, since the quality of motion vectors from the encoded video stream is likely to be lower than that which is generated by analysing the output from the face detection algorithm.

In a more complex and robust implementation, which improves accuracy with shorter clips, a face recognition algorithm is run on the "faces" found by the face detection process. The objective is not necessarily to identify who the face belongs to, but to match the faces found in each frame of the film with faces in other frames of the film. Thus, where a particular face appears for 5 minutes, is absent for several minutes, and then appears again for another 10 minutes, this is detected and the fact that both appearances are of "the same face" is recorded in the database.

In this embodiment, each "face" is tagged with an ID which is consistent within that content. e.g. ID1 appears 32 seconds into the movie 15% up and 32% right from the centre of the screen, moves this way and that, and then disappears for several minutes before returning. 1D2 appears 40 seconds into the movie, at 15 degrees from IDI, and 30% screen height away, moves this way then that, then remains stationary. The actual binary representation has been found to be quite compact, with either fixed or variable length fields storing this data in a compact, quantised manner.

Each number above (apart from time) requires only 8 bits to store, typically much less.

With improved face recognition software, or manual intervention, it is possible to add a real name to each "ID", so Movie="Top Gun", ID1="Tom Cruise", etc could be included in the database. To aid the face recognition software in this process, it should not simply learn what "Tom Cruise" looks like, but what "Tom Cruise" looks like in "Top Gun", and what "Tom C,uist" iuoks iike iii Th Lasi ainurai as inese are quite airrerent.

This is illustrated in Figure 6. The first two blocks in the process illustrate detecting a face in the video content, and detecting its orientation. The face detection module may also extract information about the features of the face and match these with features stored in the database.

Once the face detection and analysis module has completed scanning the video content and has amassed data to describe it, the data is output. The output data represents a signature or fingerprint identifying the video content that was scanned.

Ensuring that the signature or fingerprint is unique is a matter of increasing the detail of the descriptive data elements in the data output, while balancing this against the increased processing and storage costs associated with more data.

Thus, a signature may comprise one or more of the following data features, as shown in Figure 7: a) frame numbers or time describing presence/absence of a face with ID number #n; b) frame numbers or time describing presence/absence of a face with ID number #n; C) indications of when face with ID #n and ID #m share frames d) position within frame (absolute or relative) of a face with ID number #n or#m; e) identity of a detected face.

This list is not exhaustive but is intended to show the preferred possibilities for signature generation. This generated signature or fingerprint is preferably transmitted to the determination module 16 on the rights authentication module 8, where comparison of the generated signature is made with pre-generated and stored signatures for known video content.

The IP address of the server or computer on which the player is located is also transmitted to the rights authentication module.

3) Rights Authentication Module Rights authentication module comprises determination module 16 and a database 18. Determination module 16 receives the generated signature from the face detection and analysis module, and compares it with pre-generated signatures in the database 18.

It is assumed that signatures will be stored in the database for nearly all commercial video content that is available and that is to be protected via the preferred system. It is also assumed that the signatures stored in the database are generated by substantially the same face detection module as is provided in the server and the player. In this way, determining the identity of the video content received at the server 4 or being played on player 6 is simply a matter of matching one signature with another. As the signature itself is data representing at least the presence of a face and the coordinates of the face in the video content matching signatures is furthermore a matter of comparing corresponding data elements in the two signatures and determining with data elements match.

Even if the signatures are generated using substantially the same face detection module, it is unlikely that the pre-generated signature for known video content stored in the database will match exactly with the signature generated by the face detection modules in the server 4 or player 6. One reason for this that even slight differences in the video content received at the server or player, and the original video content, can produce different signatures: slightly modified colour or brightness values can alter the difficulty with which faces are detected, and can affect where the f any detected aw U,dwII, screen sizes and coorainates of detected faces can be affected where the video content received at the server or player is a separate camera recording of video content being played back; and if the video content has been streamed, then gaps in the received data may affect the timing when faces are deemed to be present.

For this reason, an exact match between two signatures is often out of the question, and what is required instead is a confidence level that the generated signature is sufficiently similar to the prestored signature to be deemed a match. A 60% correlation between the two signatures, or between parts of the signatures, allowing for some distortion may for example be acceptable.

For face detection, the data set forming the signature can be compared against that in the database by means of an equation or correlation, such as that given below.

For example, in face detection techniques where the position of a face is used, the tolerance in the position (r) of the face can be given as follows: r (((xl i-x2i)2 + (yl i-y2i)2)112) / ((X02+ Y02)112) where xli and yl i are the x and y coordinates of a face on the screen of each frame of the video content stored in the database, x21 and y2i are the corresponding coordinates in the generated signature, and XO and Y0 are the resolution of the screen for the X and Y axes.

If r=O it means there is a 100% match, while increasing r implies a decreasing match.

4) Database As well as pre-stored signatures for identifying video content, the database also stores information relating to rights management. This is illustrated schematically in Figure 8 to which reference should be made. Broadly speaking the database stores the name. title or other text for the video content 20, information identifying authorised rights holders 22, that is those who have the right though payment of a fee or otherwise to store the video content or play it, as well as data defining the video content signature 24.

The information 22 identifying authorised rights holders may be one or more of individual or corporate name, postal or street address, but most importantly should contain at least one of Internet protocol, email or web address to identify the location where authorised storage or playback of the video content can take place on the network. For example, if a rights holder is authorised to play back or store Film #1 on their server or personal computer, then detection of protected video content at that IP address, email account or web address will not trigger the protection, while detection at a different IP address, email account or web address will.

The information identifying the authorised rights holder may optionally include account or debiting information. In some configurations, it may be desired to levy a fee from the right holder each time the video content is played or recorded.

The name, title or text information is optional 20, but is useful so that a human operator can view the database contents and see at a glance which video content is protected and who are the authorised rights holders.

Following on from the above, the video content signature data 24 may comprise any of a) to e) mentioned above. Where pie-generated signatures have already been recorded in the database for each video content, the data will logically be arranged in rows of the database, wherein each row corresponds to a video content title.

The determination module 16, on receiving a signature from the face detection module in the server or player, checks the database to find a matching entry. This can happen at any level, from using just the face detection data, to using the guessed ID names.

In a simple implementation, just the number of faces on screen and their motion vectors are scanned to find a possible match. Appropriate sorting of the data can improve the sorting time, such as by beginning with the only signatures having a corresponding number of protagonists. For this reason, it can be beneficial to arrange the signatures (and the video content entries) according to the number of protagonists identified in the signature (that is just 1 protagonist, then 2 then 3 people and so on), then within each group, creating sub groups for those signatures representing no movement, a little movement, a lot of movement etc. Caching of common results can also improve this, so that the most frequently found fingerprints are at the top of the database search to be hit first, easily and efficiently. This is possible because in a given week a high percentage of pirated content will be from the top 10 to 20 movies, but there could be tens of thousands of fingerprints in the data base.

In the more developed implementation of the face detection module, where the identity of the faces is part of the signature, it makes more sense to work down from the detected IDs to concentrate on those signatures indicating those protagonists. If necessary, on screen times, combinations of them, positions and motion vectors could also be checked.

The advantages of the more developed approach are many. Firstly, this process will work independently of any transforms which leave the face recognisable, including acceptable brightness, contrast, gamma, and colour changes, low pass filtering, addition of noise etc. It is to be anticipated that certain extreme changes of this type could make the face unrecognisable, but such changes would also make the content unwatchable.

Secondly, the other attacks on the content are, broadly speaking, attacks on its geometry. By their nature, face detection and recognition themselves can work around "geometrical distortions" since the act of pointing a camera at a person creates a geometrically distorted view of the face, and such distortion can be changed completely simply by moving the camera. Hence the algorithms are already built to cope with this. However, the motion vectors are susceptible to such distortion, but because they already exist in the geometric domain, they can be "corrected" for such distortion. If it is minor, it will probably be lost in the quantisation. If it is major, it can easily be checked for. There is no need to "know" the distortion -simply assume that typical cropping, rotation etc may have occurred, and as well as checking the actual motion vectors from the pirated content against the database, check ones corrected for typical geometric distortions. If the vectors are stored in suitable co-ordinate spaces, then such correction can be done using simple linear geometry. It may make sense purely from computational efficiency versus storage requirements to store the location and motion vectors in two different geometries, such as Cartesian (XY) or angular (R/theta) to enable efficient calculation of all possible geometric distortions.

Since the distortion is geometric, and the data is geometric, this is an advantage: we're working in the domain that may be distorted, and so can easily correct for it. This contrasts with many other techniques (especially early watermarking techniques) where the only way to combat geometric distortion is to correct it in the linear video domain (by processing large amounts of data) before applying the algorithm, since a geometric distortion essentially destroys the data in the domain in which the algorithm works, rather than simply changing it in a predictable way, as here.

Based on the signature received from the face detection and analysis module, and the information stored in the dths th rPtPrmin9tion module makes a determination as to whether the instance of video content detected is authorised or legal. This determination is then transmitted to authorisation module 14 located on the server or player. If the video content is determined to be illegal, the authorisation module 14 may take any number of actions, as discussed below.

If the detected instance of video content is found to be illegal, the authorisation module may block access to the video content. In this context, access includes playback, storage, duplication, transmission or manipulation of the video content. Thus at the player, the authorisation module may simply prevent the player from playing back the content, while at the server, the authorisation module may prevent the video content being stored, or may delete any stored copies.

Other measures may be taken, depending on different rights management policies. For example, the authorisation module may simply cause a warning message to be displayed at the server or player, warning a user that the video content is potentially illegal or pirated material. The display of the warning may be logged as future evidence in any copyright infringement action. Alternatively, the warning may request that the user contact the entity responsible for maintaining the rights authorisation module to negotiate access. This may involve the payment of a fee. It is also desirable if the proprietors of the video content are notified, as in many piracy cases, they do not know when and where their content is being reproduced illegally.

Preferably, the database and determination module is located on a dedicated server across the network. In this setting, the face detection is configured to transit the signature to the determination module, and the authorisation module receives a determination from the authorisation module in return. This arrangement benefits from the fact the database is likely to be quite large, and can be more efficiently stored centrally.

However, in alternative embodiments, the database and the determination may be located at the server or at the player. In order to accommodate them, it may be necessary to have a reduced size implementation of the data signature or analysis, such as use of face identity where possible.

Periodic updates of information to the server or player could be used to install new information.

A second embodiment of the invention will now be illustrated with respect to Figure 9. The second embodiment comprises a module 30 for scanning video content available on websites, and determining whether the display or manipulation of that content on the website is authorised or constitutes a breach of the video content owner's rights.

The second embodiment comprises similar elements of functionality to the first, and where possible these are given the same reference numbers in Figure 9, to aid clarity. Thus, the second embodiment can be seen to comprise a rights authorisation module 8 having determination module 16 and database 18. Additionally, scanner 30 comprises conversion module

S

10, face detection module 12, and authorisation module 14. It also comprises scanning module 32.

Scanning module 32 scans websites on the Internet 34 for new video content. Such websites may be stored on computers 36 or servers 38. The websites being scanned may be limited to those sites known to carry video content, or further to those that carry video content and are known to disregard copyright issues surrounding such content. The site may be scanned simply by noting each new file that comes on line, or by scanning for the titles of known video content: film titles for example, especially those of newly released, or about-to-be released films.

If scanning module 32 finds a new file, or finds a film title of interest, it downloads the file or part of the file to conversion module 10 and subsequently to face detection module 12 for analysis. As before the fr.e detection module produces a signature which is passed to determination module for matching with the information stored in the database 18.

Preferably, the database contains additional information that allows a decision to be taken regarding suitable action. In connection with a film title, the database may: a) state that the title is not released beyond cinemas (so should not be on websites) b) state that the title is released for sale (so should only be official/licensed websites) c) state that the content has been released for wider distribution (such as a movie trailer that can be copied) d) state that the content is historic and no longer of significant commercial interest.

As before, if the determination module concludes from analysis of the database that the video content is unauthorised, a number of actions may be taken, such as denial of service, action via the Internet Service Provider, legal action, or action to corrupt the illegal content.

A further alternative embodiment is monitor traffic within the internet infrastructure, such as at an Internet Service Provider or caching service, at a node backbone, or intercontinental connection. Video content packets passing through the monitoring point would be analysed in the manner described above and action taken. Method and apparatus for generating a security signature The invention concerns a method and apparatus for generating a security signature, in particular one that can be used to control access to video content over a network.

S

Summary of Invention

S

Brief Description of the Drawings

Detailed Description of the Preferred Embodiments

Each of the modules will now be described in more detail.

S

I

S

A further alternative embodiment is monitor traffic within the internet infrastructure, such as at an Internet Service Provider or caching service, at a node backbone, or intercontinental connection. Video content packets passing through the monitoring point would be analysed in the manner described above and action taken.

Claims

Claims 1. A method of generating a security signature for video content, comprising: analysing a segment of video content, having a segment run time t, to determine one or more protagonist faces in the content; outputting the results of the analysis as a data signature describing at least part of the video content segment.
2. A method of controlling access to video content over a network, comprising generating a signature according to claim 1, and comparing the output data signature with stored data signatures to determine what the video content is, and whether access to the video content is allowed.
3. The method of claim 1 or 2, wherein the analysing step includes determining coordinates defining the position of the one or more protagonist faces within the segment.
4. The method of claim 3, wherein the coordinates include temporal coordinates.
5. The method of claim 4 or 5, wherein the coordinates include positional coordinates.
6. The method of claim 3, 4, or 5 wherein the analysing step comprises: i) determining a first protagonist face and coordinates defining the position of the first protagonist face within the segment; ii) determining a second protagonist face and coordinates defining the position of the second protagonist face within the segment;

I

wherein the coordinates are expressed as relative coordinates of the first protagonist face to the second.
7. The method of any preceding claim, wherein the analysis step comprises determining the presence of a face in the segment of video content, and recording the periods for which said face is present or absent.
8. The method of any preceding claim, wherein the analysis step comprises determining the identity of the face.
9. The method of any preceding claim, comprising discarding detected faces that appear in less than a predetermined number of frames of the video content.
10. The method of any preceding claim, comprising discarding the smaller detecied laces in lidriles wheie piuwiiy ul idS aw ULLtiJ.
11. The method of claim 2, wherein access includes playback, storage, duplication, transmission or manipulation of the video content.
12. The method of claim 2, wherein determining whether access to the video content is to be allowed is solely based on the output data signature, and is entirely without reference to any other data embedded in the video content for purposes of data protection.
13. An apparatus for generating a security signature for video content, comprising a face detection and analysis module arranged to: a) receive a segment of video content having a segment run time t; b) analyse the segment of video content to determine one or more protagonist faces in the content; C) outputting the results of the analysis as a data signature describing at least part of the video content segment.

S
14. An apparatus for controlling access to video content over a network, comprising an apparatus for generating a signature according to claim 1, and a further module arranged to compare the output data signature with stored data signatures to determine what the video content is, and whether access to the video content is allowed.
15. A computer program product for generating a security signature for video content, having a computer readable medium on which code is stored, wherein said code when executed by a processor causes the processor to: analyse a segment of video content, having a segment run time t, to determine one or more protagonist faces in the content; output the results of the analysis as a data signature describing at least part of the video content segment.
16. A computer program product for controlling access to video content over a network generating a security signature for video content, having a computer readable medium on which code is stored, wherein said code when executed by a processor causes the processor to a) generate a signature according to claim 15; b) compare the generated data signature with stored data signatures to determine what the video content is, and whether access to the video content is allowed

16. A computer program product for controlling access to video content over a network generating a security signature for video content, having a computer readable medium on which code is stored, wherein said code when executed by a processor causes the processor to a) generate a signature according to claim 15; b) compare the generated data signature with stored data signatures to determine what the video content is, and whether access to the video content is allowed

Claims 1. A method of generating a security signature for video content, comprising: analysing a segment of video content, having a segment run time t, to determine one or more protagonist faces in the content; outputting the results of the analysis as a data signature describing at least part of the video content segment.

2. A method of controlling access to video content over a network, comprising generating a signature according to claim 1, and comparing the output data signature with stored data signatures to determine what the video content is, and whether access to the video content is allowed.

3. The method of claim 1 or 2, wherein the analysing step includes determining coordinates defining the position of the one or more protagonist faces within the segment.

4. The method of claim 3, wherein the coordinates include temporal coordinates.

5. The method of claim 4 or 5, wherein the coordinates include positional coordinates.

6. The method of claim 3, 4, or 5 wherein the analysing step comprises: i) determining a first protagonist face and coordinates defining the position of the first protagonist face within the segment; ii) determining a second protagonist face and coordinates defining the position of the second protagonist face within the segment;

I

wherein the coordinates are expressed as relative coordinates of the first protagonist face to the second.

7. The method of any preceding claim, wherein the analysis step comprises determining the presence of a face in the segment of video content, and recording the periods for which said face is present or absent.

8. The method of any preceding claim, wherein the analysis step comprises determining the identity of the face.

9. The method of any preceding claim, comprising discarding detected faces that appear in less than a predetermined number of frames of the video content.

10. The method of any preceding claim, comprising discarding the smaller detecied laces in lidriles wheie piuwiiy ul idS aw ULLtiJ.

11. The method of claim 2, wherein access includes playback, storage, duplication, transmission or manipulation of the video content.

12. The method of claim 2, wherein determining whether access to the video content is to be allowed is solely based on the output data signature, and is entirely without reference to any other data embedded in the video content for purposes of data protection.

13. An apparatus for generating a security signature for video content, comprising a face detection and analysis module arranged to: a) receive a segment of video content having a segment run time t; b) analyse the segment of video content to determine one or more protagonist faces in the content; C) outputting the results of the analysis as a data signature describing at least part of the video content segment.

S

14. An apparatus for controlling access to video content over a network, comprising an apparatus for generating a signature according to claim 1, and a further module arranged to compare the output data signature with stored data signatures to determine what the video content is, and whether access to the video content is allowed.

15. A computer program product for generating a security signature for video content, having a computer readable medium on which code is stored, wherein said code when executed by a processor causes the processor to: analyse a segment of video content, having a segment run time t, to determine one or more protagonist faces in the content; output the results of the analysis as a data signature describing at least part of the video content segment.