WO2007036892A1 - Method and apparatus for long term memory model in face detection and recognition - Google Patents

Method and apparatus for long term memory model in face detection and recognition Download PDF

Info

Publication number
WO2007036892A1
WO2007036892A1 PCT/IB2006/053527 IB2006053527W WO2007036892A1 WO 2007036892 A1 WO2007036892 A1 WO 2007036892A1 IB 2006053527 W IB2006053527 W IB 2006053527W WO 2007036892 A1 WO2007036892 A1 WO 2007036892A1
Authority
WO
WIPO (PCT)
Prior art keywords
face
video
importance
faces
doi
Prior art date
Application number
PCT/IB2006/053527
Other languages
French (fr)
Inventor
Shreeharsh Kelkar
Nevenka Dimitrova
Shih-Fu Chang
Original Assignee
Koninklijke Philips Electronics, N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics, N.V. filed Critical Koninklijke Philips Electronics, N.V.
Publication of WO2007036892A1 publication Critical patent/WO2007036892A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • G06F16/785Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using colour or luminescence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions

Definitions

  • the present invention relates to a long term memory model for use in face detection and role recognition wherein the long term memory comprises a reference database of faces detected in videos that is used to associate roles with faces in videos.
  • An algorithm is included for processing a new video against the reference database to both detect/recognize faces therein in a video and to extend the reference database to include detected but previously unrecognized faces.
  • Most face detection and recognition methods do not include any memory models.
  • Face recognition is applied to video as if a video is a sequence of unrelated frames. Each frame is treated as an independent image and face recognition is applied to a particular frame regardless of previous history of face appearance in the same TV program or home video.
  • There is a lack of continuity and memory models in the recognition phase As a result, a temporary occlusion, lack of lighting, or a camera flash might severely degrade the performance of the detector/recognizer.
  • current detectors and recognizers lack robustness.
  • the system, apparatus, and method of the present invention provide long-term memory models for both face detection and recognition.
  • a video content analysis technique such as face detection and recognition is combined with concepts from psychology such as mathematical human memory models and the basic principles of video cognition (i.e. how humans perceive video). These novel concepts are applied to TV programs (e.g. situation comedies or sitcoms), taking into consideration the grammar of the underlying program.
  • a degree of importance DoI of a face is computed within each of a frame, a shot and a scene.
  • An algorithm is provided for matching detected faces with those in a reference database of faces and determining if the face is one already recognized or a newly recognized face.
  • the degree of importance information is calculated for each face detected in a video and is used to update the reference database as well as to recognize a face by matching with faces already stored in the reference database.
  • FIG. 1 illustrates a face identification algorithm
  • FIG. 2 illustrates and observed 180-degree rule for camera placement in sitcoms
  • FIG. 3 illustrates the 180-degree camera placement rule as applied to pairs of static characters exchanging dialog in a sitcom
  • FIG. 4 illustrates skin samples from a typical face database
  • FIG. 5 illustrates a distribution of skin pixels in a face database used to train a face detection program according to the present invention
  • FIG. 6 illustrates the closest Gaussian distribution fitted to the distribution of FIG. 5, according to the present invention
  • FIG. 7 illustrates the transformation from gray scale to a binary image showing skin and non-skin areas; a) is the original image, b) is a skin likelihood image, c) is a segmented skin image;
  • FIG. 8 illustrates a face template
  • FIG. 9 illustrates an original image a) and the superposition b) of the skin template of
  • FIG. 8 on the detected skin areas thereof
  • FIG. 10 illustrates an apparatus according to the present invention for using a memory model to recognize faces in video
  • FIG. 11 illustrates a system for face detection and role recognition in a video incorporating an apparatus according to the present invention that uses a memory model
  • FIG. 12 illustrates a plot of the DoIs of the faces with frame number for a sitcom episode.
  • Video especially films and television videos
  • a preferred embodiment of the present invention uses computer vision techniques (especially face detection). These techniques are combined with concepts from psychology, especially mathematical human memory models and the basic principles of video cognition (i.e. how humans perceive video). In a preferred embodiment, both of these concepts are plied to sitcoms, taking into consideration the grammar of the sitcom. The role detection algorithm is tested on an episode of a popular sitcom "Friends".
  • the Rational model of human memory is used, see John Anderson (1989) "Rational Analysis of Memory, Varieties of memory and consciousness: Essays in honor of Endel Tulving", Lawrence Erlbaum Associates, Publishers (1989). According to this model, human memory behaves as an optimal solution to the information retrieval problems facing humans. Assume a memory structure for an item was introduced t time units ago and has been used n times since, then the probability that this item will be used the next time unit is t + b
  • the DoI of a face thus gives us the probability that the face will be remembered.
  • Calculating the Dol's of the faces over a scene, shot or video thus provides the importance of the face, in a preferred embodiment and is equivalent to the probability that the memory structure will be remembered.
  • the most important characteristic of any role is the face.
  • the most important characters in the video take up a large part of the video's screen time and are thus easily remembered.
  • Other characters may dominate screen time during some isolated portion of the video, e.g., a lead actor has been hurt and is being operated on by doctors. An isolated scene might focus on the doctor operating on the lead. The doctor, while important for that shot/scene could hardly be called an important character in the larger context of the entire video.
  • the space factor can have important effects.
  • the concepts presented for the sitcom example can be also extrapolated for the general TV programs and home video.
  • the first step in role detection is face detection at step 101.
  • head tracking is performed to acquire a tracked face segment of the video at step 103.
  • a frontal face calculation is performed at step 105 and when a most frontal face 106 is found, face recognition is performed at step 107 and a recognized face ID is output 108.
  • face detection 101 head/face tracking 103 and face recognition 107 seem to be the most important steps in Role detection a preferred embodiment incorporates a few modifications in this detection algorithm.
  • a video must first be first segmented shots and then shot into scenes. 2. The entire video is then analyzed sequentially scene by scene and a DoI (Degree of
  • Importance is computed for every face detected in each scene as an indication of how relevant the particular face (which here represents the character) is to the scene. 3.
  • the characters (or faces) whose DoI values exceed a pre-determined threshold are deduced to be the principal characters in a particular scene/shot. Detecting faces in movies/television is more complicated than detecting faces in a
  • films rely as much (if not more) on visual imagery as on dialogue. Also films use a variety of visual techniques. Split-screens, fade-ins, fade-outs, rapid crosscutting and editing, etc., are far more likely to be found in films than in television. The simple structure in a sitcom thus makes it easier to apply experimental techniques, which may then, in general be extended to other visual media.
  • a first primary character's living room 2. A second primary character's living room; and
  • An alternate non-domestic location e.g., a coffee shop.
  • the director places all cameras on the same side of an imaginary line, called the line of interest in order to ensure that a left-right orientation is consistent across shots.
  • sitcoms are essentially dialogue-driven. Hence, crudely speaking, most of the scenes consist of two people (or a group of people) speaking to each other.
  • the camera angles are generally restricted to those shown in FIG. 3 where the camera stays on the same side of the line of interest 201 joining any pair of characters 301.
  • the characters 301 are static. In this way, various and ample camera coverage is obtained for two static characters 301 during an exchange of dialogue by these two characters 301.
  • Sitcoms have straightforward scenes shown one after the other with two back-to-back scenes generally taking place in different locations. Each scene is also preceded by a long shot of its location, e.g. a plurality of shop/restaurant scenes start with the nametag of the shop/restaurant (' shop/restaurant name') being shown. Thus, the scene structure of sitcoms is rigid. There is no rapid inter-cutting between parallel events to enhance dramatic tension. In a sitcom, a scene is generally contained to one location (e.g. the shop/restaurant) with a limited number of characters that are essentially static, i.e. their movements are limited.
  • the video is first decomposed into its constituent shots. This is relatively easy for a sitcom since most of the shots are straight cuts. There are hardly any cross-fades, fade-ins or fadeouts in a sitcom. Sitcoms mostly consist of only conversational shots staged one after another in different locations. All the conversations are generally shown in over-the-shoulder shots. Characters rarely move during a shot. Even if they do, the movement is slow, allowing the shot detection algorithm of the present invention to work well.
  • the shot segmentation algorithm works as follows:
  • a color histogram of each frame is first obtained and the difference between the color histograms of two consecutive frames in the video is computed. If this difference exceeds a pre-determined histogram-threshold, then a cut is declared at that frame. Since sitcoms mostly consist of straight cuts, only one threshold is needed.
  • a first embodiment of scene segmentation detects a series of a pre-determined number of frames, e.g., 100 frames indicating a montage of about 3 to 4 seconds, which is characterized by the presence of large line objects, (generally long shots of buildings) or, the absence of faces. Thus a sequence of a pre-determined number of consecutive frames that are characterized by no faces indicates a scene boundary.
  • Audio analysis is used in a second embodiment for scene segmentation.
  • scene segmentation On studying a sitcom, one concludes that there is little or no background music during many of the scenes since the emphasis is entirely on the dialogue and the timing of the dialogue. However, when there is a scene transition and especially if the next scene is at a different location from the previous scene, there is a brief burst of background music on the soundtrack, which accompanies the 'establishing shot'. Detecting this music is used in this second alternative for automatic scene segmentation of sitcom episodes. Face Detection
  • the algorithm for face detection is based in part on Cai, Goshtasby, and Yu, "Detecting Human Faces in Color Images", International Workshop on Multi-Media Database Management Systems, 1998, and includes the steps of: Step 1.
  • a skin color model is built using a training set of skin samples and the YCbCr color space. Using this skin color model, one can transform a color image into a gray scale image such that the gray value at each pixel shows the likelihood of the pixel belonging to the skin.
  • the skin detection program is trained using skin samples from a publicly available face database. Examples of skin samples from such a database are shown in FIG. 4.
  • this filter has the impulse response given by
  • the color distribution of the 50,000 pixels in chromatic color space is shown in FIG. 5.
  • this distribution is approximated with a Gaussian model whose mean and covariance are given by:
  • r and b are the Cb and C r coordinates of a pixel.
  • the closest Gaussian distribution fitted by the program of a preferred embodiment is illustrated in FIG. 6.
  • the likelihood that the pixel (r, b) is a skin pixel is given by the probability:
  • the gray scale images can then be further transformed to a binary image showing skin regions and non-skin regions.
  • An example sequence is shown in FIG.7.
  • Step 3 Each of the skin color regions is then tested with a face template to determine whether or not it is a face.
  • the template has to be resized and rotated corresponding to the size and orientation of the original skin region in the image.
  • An example of a template used for the purpose is given in FIG. 8.
  • An example of the superposition of the template on two skin regions is illustrated in FIG. 9.
  • Face Recognition is performed using VQ histograms as defined in
  • Step 1 divide the face image into 4-by-4 blocks.
  • Step 2 calculate the minimum intensity in each 4-by-4-pixel block, and subtract the minimum intensity from each block. Therefore, an intensity variation is obtained for each block.
  • Step 3 for each block division from the face image, match the block with all the codes in a codebook, and the most similar codevector is selected using Euclidean distance for the distance matching. Other distance matching methods include L 1 , intersection method, and chi square.
  • Step 4 after performing VQ for all the blocks extracted from a facial image, matched frequencies for each codevector are counted and a histogram is generated, known as the VQ histogram of the face image.
  • VQ histogram must be normalized so that the size of the face does not matter during recognition.
  • a preferred embodiment uses certain human memory models to justify the algorithm.
  • LTM As a human watches a video, the contents of the STM are continuously changing while the important information from the STM continuously updates the contents of the LTM.
  • a face- detection and face-tracking algorithm implementation are provided, e.g., in software.
  • the software detects all the faces in an image, e.g., a video frame.
  • the tracking code also tracks each of these faces reliably.
  • the second assumption is that the shots in a sitcom video are essentially static, i.e. both the character and the camera are generally stationary.
  • the first frame of a shot can be considered to be representative of the entire shot. This is generally true for most scenes in a sitcom. Shots in which the characters actually move and are tracked by the camera are rare. Only the first frame of each shot needs to be processed, thus saving valuable computational resources.
  • the algorithm can be applied to all the frames in a video as well.
  • the middle frame or a randomly selected frame in a shot can be used.
  • the algorithm can also take more than one frame per shot.
  • the video is processed scene -by-scene, analyzing each shot within a scene sequentially. From the structure of a sitcom, it can easily be seen that the same faces appear in sequential shots of a scene and the positions of the face over sequential shots of a scene change only minutely.
  • the video is read frame-by- frame. Note that this means that only those frames where a cut has previously been detected are processed. This is a reasonable assumption to make considering the structure of the sitcom.
  • C be an array consisting of all the frames where a cut is detected.
  • each face detected is compared with a reference database comprising the faces of the main characters of the sitcom using the VQ histogram method and a closest match in the database is found using a simple Euclidean distance measure such that if the face does not match any of the faces in the database, it is considered to be an 'external' face (e.g. the face of a guest actor), [In an alternative embodiment, this external face can is added dynamically to the database]
  • DoI The Degree of Importance (DoI) is calculated for each face that matches a face in the reference database, d) three DoIs are defined as follows: i. the DoI of a particular face per frame (DoIf) is given by
  • the DoI is an indication of how well the face will be remembered or, in other words, how important the face is in the context of the video, ii. the DoI of a particular face per shot (DoISh) is given by
  • N the number of frames in the shot
  • N the number of frames in the shot where the face is found
  • Dolf j the DoIf of the face in the i th frame iii.
  • DoI of a particular face per scene (DoIS) is given by ⁇ DoISh 1
  • N the number of frames in the scene
  • N the number of shots in the scene where the face is present
  • DoISh 1 the DoIf of the face in the i th shot
  • the DoI of a face for the whole video can be calculated.
  • a sitcom episode comprises five scenes A, B, C, D and E, in that order.
  • S S A , S B , SC, S D , and S E for each scene.
  • Each scene is processed shot -by-shot by considering the first frame of each shot. For every face found in the scene, a quantity called DoIS is computed for the scene that indicates how well the face (i.e., character) is remembered by a viewer after the scene is over. In other words, it is an indicator of the importance of the character or role in that scene.
  • a filtering approach based on face area is employed. Faces detected having Dol's that are less than a pre-determined threshold are rejected because these are generally not faces at all. A threshold of 0.01 was found to be satisfactory.
  • a temporally based filtering approach is employed based on the determination of a pattern in which faces appear in a video. For example, a conversation scene in a sitcom comprises a long sequence of shots (shot-reaction, shot). Thus, a faces detected in a shot would repeat in the shot that followed after it. Such an approach eliminates false face detections (false positives). Improvement in Face Recognition
  • an apparatus 1000 uses a memory module stored in a reference database 1001 and a processor module 1002 executing a face detection module 1003 and a face/role recognitions module to recognize faces/roles in an incoming video stream 1005.
  • the face detection module comprises inter alia a video segment location module 1003.1 to locate cuts in a scene of the video, i.e., to locate shots therein. Detected faces are recognized or not by face/role recognition module 1004.
  • a degree of importance module 1003.2 can provide a degree of importance metric for filtering faces that fall below an importance threshold, i.e., are not really faces. Faces are detected by segmented a video into scenes and scenes into shots using video content to define the cut criteria.
  • FIG. 11 illustrates a system 1100 including an apparatus according to the present invention
  • FIG. 12 illustrates a plot of the DoIs of the faces with frame number (from top to bottom, the DoI plot for character A, B, C, D, E and F)
  • FIG. 12 provides a fair idea of how the DoI can be used to detect roles in videos. Taken along with the scene boundary (not shown in the figure), this distribution of the DoI for each character would give us an idea of the scenes, which each character inhabits in the episode. This would facilitate easy browsing of the video by character presence.
  • Table I gives a brief idea of each scene in the episode and also how long the scene is. Note that the credits sequence has not been considered here as a scene.
  • Table I Summary of the scenes in an episode of a sitcom.
  • the episode consists of 16 scenes (excluding the credit sequence).
  • the second column gives the characters in the scene.
  • the third indicated the no of shots in the scene while the fourth is the time for which the scene lasts.
  • the decision point is how to start a new group based on the variance and the standard deviation. If the character in the current row is not part of an existing group and the characters of the new group are not part of an existing group, then we make a new group. Next, the transitivity rule is applied: if the character in the current row is associated with characters that are already grouped, it is inserted into the already existing groups. If the character in the current row is associated with characters, some of whom are already grouped and some of whom are not, then these are made into a separate group.
  • the duration can be used to eliminate characters that occur in multiple plots.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

A system, apparatus and method are provided for a long term memory model for use in face detection and role recognition wherein the long term memory comprises a reference database (1001) of faces detected in videos that is used to associate roles with faces in videos. An algorithm is included for processing a new video against the reference database to both detect/recognize faces therein, and to extend the reference database (1001) to include detected but previously unrecognized faces. A method for computing a degree of importance DoI metric that provides a measure of the importance of a recognized face as well as several methods for using this metric.

Description

METHOD AND APPARATUS FOR LONG TERM MEMORY MODEL IN FACE
DETECTION AND RECOGNITION
The present invention relates to a long term memory model for use in face detection and role recognition wherein the long term memory comprises a reference database of faces detected in videos that is used to associate roles with faces in videos. An algorithm is included for processing a new video against the reference database to both detect/recognize faces therein in a video and to extend the reference database to include detected but previously unrecognized faces. Most face detection and recognition methods do not include any memory models. Face recognition is applied to video as if a video is a sequence of unrelated frames. Each frame is treated as an independent image and face recognition is applied to a particular frame regardless of previous history of face appearance in the same TV program or home video. There is a lack of continuity and memory models in the recognition phase. As a result, a temporary occlusion, lack of lighting, or a camera flash might severely degrade the performance of the detector/recognizer. Thus, current detectors and recognizers lack robustness.
While face detection and recognition techniques are known, most work well for mug shots but not for general video, e.g. home videos, and movies.
In order to solve the face detection/recognition robustness problem and to extend techniques to include general video, the system, apparatus, and method of the present invention provide long-term memory models for both face detection and recognition. A video content analysis technique such as face detection and recognition is combined with concepts from psychology such as mathematical human memory models and the basic principles of video cognition (i.e. how humans perceive video). These novel concepts are applied to TV programs (e.g. situation comedies or sitcoms), taking into consideration the grammar of the underlying program. A degree of importance DoI of a face is computed within each of a frame, a shot and a scene. These features are then used for long term memory-based face detection and recognition where the memory is a reference memory of faces suitable for matching against newly detected faces in order to recognize a known face or for extension by inclusion therein of a newly detected face. These measures of importance are also used for computing scene boundaries within video programs. This type of boundary corresponds to the theatrical definition of using an actor/entrance/exit as a scene boundary.
An algorithm is provided for matching detected faces with those in a reference database of faces and determining if the face is one already recognized or a newly recognized face. The degree of importance information is calculated for each face detected in a video and is used to update the reference database as well as to recognize a face by matching with faces already stored in the reference database. Several options are provided for improving detection and recognition based on memory models. Details of the invention disclosed herein shall be described with the aid of the figures listed below, wherein:
FIG. 1 illustrates a face identification algorithm;
FIG. 2 illustrates and observed 180-degree rule for camera placement in sitcoms;
FIG. 3 illustrates the 180-degree camera placement rule as applied to pairs of static characters exchanging dialog in a sitcom;
FIG. 4 illustrates skin samples from a typical face database;
FIG. 5 illustrates a distribution of skin pixels in a face database used to train a face detection program according to the present invention;
FIG. 6 illustrates the closest Gaussian distribution fitted to the distribution of FIG. 5, according to the present invention;
FIG. 7 illustrates the transformation from gray scale to a binary image showing skin and non-skin areas; a) is the original image, b) is a skin likelihood image, c) is a segmented skin image;
FIG. 8 illustrates a face template; FIG. 9 illustrates an original image a) and the superposition b) of the skin template of
FIG. 8 on the detected skin areas thereof;
FIG. 10 illustrates an apparatus according to the present invention for using a memory model to recognize faces in video;
FIG. 11 illustrates a system for face detection and role recognition in a video incorporating an apparatus according to the present invention that uses a memory model; and
FIG. 12 illustrates a plot of the DoIs of the faces with frame number for a sitcom episode. Video (especially films and television videos) can be logically segmented into scenes. While there are many approaches to segmenting a video into scenes, the contents of a scene (or a shot) can be effectively characterized by the characters appearing in the scene. Detecting the characters in a scene can be termed as Role Detection. Role Detection may be done using audio, text and video analysis. A preferred embodiment of the present invention uses computer vision techniques (especially face detection). These techniques are combined with concepts from psychology, especially mathematical human memory models and the basic principles of video cognition (i.e. how humans perceive video). In a preferred embodiment, both of these concepts are plied to sitcoms, taking into consideration the grammar of the sitcom. The role detection algorithm is tested on an episode of a popular sitcom "Friends".
Consider how useful automatic role detection can be in browsing an unknown video. For example, browsing a video by character, e.g. query and show all of Phoebe's scenes in an episode of Friends. Also, we can browse and see which character(s) was most dominant for a particular movie or episode or part of the movie (e.g. Phoebe dominated the first half of the episode). First, the video is segmented into its constituent shots and scenes, remembering that the scene structure in videos depends as much on the characters in the scene as on the location. Role detection is used such that a scene is labeled by the most important characters that appear in the scene. This labeling is extended to a shot level also as well as to the entire video.
This allows the determination of a specified number of most important characters in a video using the importance derived across the entire video for each character.
If a video is completely unknown, then a face of a character is used to characterize each role in the video. Suppose there are five most important faces (roles) for the entire video, the five most important faces for each scene (these could be entirely different from the five most important faces for the video) and even the five most important faces in each shot. Knowing the important characters in each shot/scene facilitates browsing (e.g., browsing a video by character).
In a preferred embodiment, the Rational model of human memory is used, see John Anderson (1989) "Rational Analysis of Memory, Varieties of memory and consciousness: Essays in honor of Endel Tulving", Lawrence Erlbaum Associates, Publishers (1989). According to this model, human memory behaves as an optimal solution to the information retrieval problems facing humans. Assume a memory structure for an item was introduced t time units ago and has been used n times since, then the probability that this item will be used the next time unit is t + b
RF(t,n) = l - t + b + l
if ( t + b) » (v + n) then v + n
RF(t,n) = t + b What this model implies is that, the more recent the memory structure is, or the more recently a particular memory structure is reinforced, the more likely it is to be remembered. This becomes particularly relevant when considering how humans perceive video, especially with respect to the characters in the video.
For Role Detection in videos, this model becomes particularly relevant. This is because long-term memory has to be taken into consideration while detecting the important roles in a shot, scene or the entire video itself. Note the probability term RF(t, n) in the expression above. If the face (or equivalently, the role) to be detected is considered a memory structure, the probability that a structure (or face) is likely to be remembered (i.e. RF(t, n)) is a function of the time t it was introduced before and the number of times n, the face was 'seen'. In a preferred embodiment, a quantity is modeled for each face called Degree of Importance, (DoI) which is analogous to RF(t, n)). The DoI of a face thus gives us the probability that the face will be remembered. Calculating the Dol's of the faces over a scene, shot or video thus provides the importance of the face, in a preferred embodiment and is equivalent to the probability that the memory structure will be remembered. While watching any movie/television show, the most important characteristic of any role is the face. Obviously the most important characters in the video take up a large part of the video's screen time and are thus easily remembered. Other characters may dominate screen time during some isolated portion of the video, e.g., a lead actor has been hurt and is being operated on by doctors. An isolated scene might focus on the doctor operating on the lead. The doctor, while important for that shot/scene could hardly be called an important character in the larger context of the entire video.
Thus, while watching a video, the minor characters 'register' themselves in memory but since their screen time is so minimal overall, they quickly 'fade out' of memory. On the other hand, important characters reappear time and again, and are therefore more likely to be remembered due to the reinforcement provided by their reappearances throughout the video.
Another interesting characteristic of video is that the important characters also take up more screen space. Central characters tend to be generally in focus in the center of the screen. This is especially true of films. Dialogue scenes with over-the- shoulder shots, close-ups and point-of-view shots mean that a character's face occupies a large portion of the screen space and thus ensures that the viewer is more likely to remember the character. Minor characters (like say, the doctor mentioned earlier), are rarely, if ever, shot in close-up.
We thus have two important characteristics of roles in video, the time and the space since it is in these terms that humans are more likely to remember the characters. Of time and space, time is definitely more important in evaluating the important roles over the entire video.
However, while detecting roles in localized shots or scenes, the space factor can have important effects.
With these concepts in mind, the following sections disclose what a preferred embodiment of an algorithm for general Role Detection system consists of and provide a discussion of how this algorithm works. This algorithm can be applied to a wide range of TV programs such as movies, news, talk shows and home video. Then, an example of how a preferred embodiment applies these concepts is presented using sitcoms. Lastly, the algorithm is applied to a typical sitcom episode.
The concepts presented for the sitcom example can be also extrapolated for the general TV programs and home video.
Referring now to FIG. 1, the first step in role detection is face detection at step 101.
Once a face is found at step 102, then head tracking is performed to acquire a tracked face segment of the video at step 103. Given a tracked face segment 104, a frontal face calculation is performed at step 105 and when a most frontal face 106 is found, face recognition is performed at step 107 and a recognized face ID is output 108.
While face detection 101, head/face tracking 103 and face recognition 107 seem to be the most important steps in Role detection a preferred embodiment incorporates a few modifications in this detection algorithm.
1. A video must first be first segmented shots and then shot into scenes. 2. The entire video is then analyzed sequentially scene by scene and a DoI (Degree of
Importance) is computed for every face detected in each scene as an indication of how relevant the particular face (which here represents the character) is to the scene. 3. The characters (or faces) whose DoI values exceed a pre-determined threshold are deduced to be the principal characters in a particular scene/shot. Detecting faces in movies/television is more complicated than detecting faces in a
News Video. It is rarely possible to get a frontal view of the face in a movie. It could be covered by some objects, like a handkerchief, a helmet, goggles, etc. Moreover, for some particular scenes, a frontal face never occurs but just the back of the head appears throughout the whole scene. Using the face detection methods alone, described below, it is almost impossible to detect such a face. Here, other methods, in a preferred embodiment, other techniques, such as screenplay, audio, text analysis, or a combination of these, are employed (see: R. Turetsky. N. Dimitrova: Screenplay alignment for closed-system speaker identification and analysis of feature films. IEEE ICME 2004: 1659-1662).
Example: Role Detection in sitcoms Television programs, especially situational comedies (sitcoms), rely more on their dialogue and ping-pong witticisms for extracting laughs. Because much of comedy is a matter of timing (for instance, two jokes must be spaced out; otherwise the audience which is laughing because of the first joke may miss the second one altogether), the visual aspects of sitcoms are strictly conventional (and arranged so that the viewer's attention remains focused only on the dialogue). There are generally no fades, only straight cuts. Particularly speaking, matched cuts are used, which means that the cuts are almost unnoticeable. Also more importantly, sitcoms never focus on inanimate objects or imagery. Because the dialogue is the most important component, a character's face is always in focus.
The situation for films is different. Films rely as much (if not more) on visual imagery as on dialogue. Also films use a variety of visual techniques. Split-screens, fade-ins, fade-outs, rapid crosscutting and editing, etc., are far more likely to be found in films than in television. The simple structure in a sitcom thus makes it easier to apply experimental techniques, which may then, in general be extended to other visual media.
Grammar of the sitcom Sitcoms generally follow a rigid structure. The settings are generally limited to three although some episodes may be shot in different locations. However most of the scenes in an archetypical episode take place in any of these three locations:
1. A first primary character's living room; 2. A second primary character's living room; and
3. An alternate non-domestic location, e.g., a coffee shop.
A 180 -degree rule applies to sitcoms, which rule is as follow:
180-degree sitcom rule: the director places all cameras on the same side of an imaginary line, called the line of interest in order to ensure that a left-right orientation is consistent across shots. As discussed above, sitcoms are essentially dialogue-driven. Hence, crudely speaking, most of the scenes consist of two people (or a group of people) speaking to each other. The camera angles are generally restricted to those shown in FIG. 3 where the camera stays on the same side of the line of interest 201 joining any pair of characters 301. The characters 301 are static. In this way, various and ample camera coverage is obtained for two static characters 301 during an exchange of dialogue by these two characters 301.
Sitcoms have straightforward scenes shown one after the other with two back-to-back scenes generally taking place in different locations. Each scene is also preceded by a long shot of its location, e.g. a plurality of shop/restaurant scenes start with the nametag of the shop/restaurant (' shop/restaurant name') being shown. Thus, the scene structure of sitcoms is rigid. There is no rapid inter-cutting between parallel events to enhance dramatic tension. In a sitcom, a scene is generally contained to one location (e.g. the shop/restaurant) with a limited number of characters that are essentially static, i.e. their movements are limited.
With these assumptions in mind, a preferred embodiment of role detection in sitcoms is disclosed. Since a character is essentially identified by its face, a face detection and tracking algorithm is provided for this purpose. The following processes are required to implement the algorithm:
Shot segmentation
The video is first decomposed into its constituent shots. This is relatively easy for a sitcom since most of the shots are straight cuts. There are hardly any cross-fades, fade-ins or fadeouts in a sitcom. Sitcoms mostly consist of only conversational shots staged one after another in different locations. All the conversations are generally shown in over-the-shoulder shots. Characters rarely move during a shot. Even if they do, the movement is slow, allowing the shot detection algorithm of the present invention to work well.
The shot segmentation algorithm works as follows:
In a preferred embodiment, a color histogram of each frame is first obtained and the difference between the color histograms of two consecutive frames in the video is computed. If this difference exceeds a pre-determined histogram-threshold, then a cut is declared at that frame. Since sitcoms mostly consist of straight cuts, only one threshold is needed.
The simple procedure described above works very well with sitcoms. To make it work even better so that its output is highly accurate, a filtering approach is added in a preferred embodiment. Here cuts, which are detected within three frames of each other, are removed and only the lowest one among them is kept. For e.g., if cuts are detected at frame 733, 735, 736, 800, 801, 900... then 735, 736 and 801 are removed from the list. Thus the list now reads 733, 800, and 900.
Scene segmentation
There are two alternative embodiments for segmenting a sitcom episode into its scenes. Note that these two embodiments are only applicable to sitcoms because of the rigid structure of the sitcom.
1. A first embodiment of scene segmentation detects a series of a pre-determined number of frames, e.g., 100 frames indicating a montage of about 3 to 4 seconds, which is characterized by the presence of large line objects, (generally long shots of buildings) or, the absence of faces. Thus a sequence of a pre-determined number of consecutive frames that are characterized by no faces indicates a scene boundary.
2. Audio analysis is used in a second embodiment for scene segmentation. On studying a sitcom, one concludes that there is little or no background music during many of the scenes since the emphasis is entirely on the dialogue and the timing of the dialogue. However, when there is a scene transition and especially if the next scene is at a different location from the previous scene, there is a brief burst of background music on the soundtrack, which accompanies the 'establishing shot'. Detecting this music is used in this second alternative for automatic scene segmentation of sitcom episodes. Face Detection
The algorithm for face detection is based in part on Cai, Goshtasby, and Yu, "Detecting Human Faces in Color Images", International Workshop on Multi-Media Database Management Systems, 1998, and includes the steps of: Step 1. A skin color model is built using a training set of skin samples and the YCbCr color space. Using this skin color model, one can transform a color image into a gray scale image such that the gray value at each pixel shows the likelihood of the pixel belonging to the skin. In a preferred embodiment, the skin detection program is trained using skin samples from a publicly available face database. Examples of skin samples from such a database are shown in FIG. 4.
In a typical training exercise of the face detection program, a total of 52 skin images consisting of more than 50,000 pixels were used to train the face detection program to detect skin color. The skin samples were then filtered using a low-pass filter to reduce the effect of noise in the samples. In a preferred embodiment this filter has the impulse response given by
1 1 1
1
1 1 1
9
1 1 1
The color distribution of the 50,000 pixels in chromatic color space is shown in FIG. 5. In a preferred embodiment, this distribution is approximated with a Gaussian model whose mean and covariance are given by:
Mean: m = E {x} where x = (r, b)τ
Covariance: C = E {(x - m)(x - m) τ where
r and b are the Cb and Cr coordinates of a pixel. The closest Gaussian distribution fitted by the program of a preferred embodiment is illustrated in FIG. 6.
Thus, the likelihood that the pixel (r, b) is a skin pixel is given by the probability:
Likelihood = P(r, b) = = exp[-0.5(x - m)TC"1(x - m)]
where x = (r, b)τ Step 2. With appropriate thresholding, the gray scale images can then be further transformed to a binary image showing skin regions and non-skin regions. An example sequence is shown in FIG.7.
Step 3. Each of the skin color regions is then tested with a face template to determine whether or not it is a face. Note that the template has to be resized and rotated corresponding to the size and orientation of the original skin region in the image. An example of a template used for the purpose is given in FIG. 8. An example of the superposition of the template on two skin regions is illustrated in FIG. 9.
Face Recognition (VQ histograms) In a preferred embodiment, face recognition is performed using VQ histograms as defined in
Cai, Goshtasby, and Yu, "Detecting Human Faces in Color Images", International Workshop on Multi-Media Database Management Systems, 1998, and includes the steps of: Step 1 : divide the face image into 4-by-4 blocks.
Step 2: calculate the minimum intensity in each 4-by-4-pixel block, and subtract the minimum intensity from each block. Therefore, an intensity variation is obtained for each block.
Step 3: for each block division from the face image, match the block with all the codes in a codebook, and the most similar codevector is selected using Euclidean distance for the distance matching. Other distance matching methods include L 1 , intersection method, and chi square. Step 4: after performing VQ for all the blocks extracted from a facial image, matched frequencies for each codevector are counted and a histogram is generated, known as the VQ histogram of the face image.
Note that the VQ histogram must be normalized so that the size of the face does not matter during recognition.
Role importance detection algorithm
A preferred embodiment uses certain human memory models to justify the algorithm.
Assume that as human beings watch a video (say a sitcom), the current flow of visual information is rapidly stored in a scratch-pad memory called working memory or short-term memory (STM). The contents in this scratch pad decay rapidly with time however and are continuously replaced with new information. However constant repetition of the same information (e.g. the same faces for characters appearing again and again in the video) means that this repeatedly occurring information is better remembered, i.e. this information gets passed on to the long-term memory
(LTM). As a human watches a video, the contents of the STM are continuously changing while the important information from the STM continuously updates the contents of the LTM.
Assume for the present invention that a face- detection and face-tracking algorithm implementation are provided, e.g., in software. The software detects all the faces in an image, e.g., a video frame. Assume that the tracking code also tracks each of these faces reliably.
The second assumption is that the shots in a sitcom video are essentially static, i.e. both the character and the camera are generally stationary. Thus, the first frame of a shot can be considered to be representative of the entire shot. This is generally true for most scenes in a sitcom. Shots in which the characters actually move and are tracked by the camera are rare. Only the first frame of each shot needs to be processed, thus saving valuable computational resources.
Note that with sufficient computational resources, the algorithm can be applied to all the frames in a video as well. Alternatively, the middle frame or a randomly selected frame in a shot can be used. The algorithm can also take more than one frame per shot.
The video is processed scene -by-scene, analyzing each shot within a scene sequentially. From the structure of a sitcom, it can easily be seen that the same faces appear in sequential shots of a scene and the positions of the face over sequential shots of a scene change only minutely. The Algorithm
1. The video is read frame-by- frame. Note that this means that only those frames where a cut has previously been detected are processed. This is a reasonable assumption to make considering the structure of the sitcom.
2. Let C be an array consisting of all the frames where a cut is detected.
3. For each frame in the array C a) all the faces in C are detected, b) each face detected is compared with a reference database comprising the faces of the main characters of the sitcom using the VQ histogram method and a closest match in the database is found using a simple Euclidean distance measure such that if the face does not match any of the faces in the database, it is considered to be an 'external' face (e.g. the face of a guest actor), [In an alternative embodiment, this external face can is added dynamically to the database] c) The Degree of Importance (DoI) is calculated for each face that matches a face in the reference database, d) three DoIs are defined as follows: i. the DoI of a particular face per frame (DoIf) is given by
DoIf = area of face area of frame
the DoI is an indication of how well the face will be remembered or, in other words, how important the face is in the context of the video, ii. the DoI of a particular face per shot (DoISh) is given by
∑DoIf
DoISh = -^
N where
N = the number of frames in the shot N = the number of frames in the shot where the face is found
Dolfj = the DoIf of the face in the ith frame iii. the DoI of a particular face per scene (DoIS) is given by ∑ DoISh1
DoIS = -^
N where
N = the number of frames in the scene
N = the number of shots in the scene where the face is present
DoISh1= the DoIf of the face in the ith shot
In a similar way, the DoI of a face for the whole video can be calculated.
Applying the Algorithm
In a general scenario, consider how the DoI can be used for Role-importance detection. Suppose that a sitcom episode comprises five scenes A, B, C, D and E, in that order. Suppose also that each scene comprises S shots, where S = SA, SB, SC, SD, and SE for each scene. Each scene is processed shot -by-shot by considering the first frame of each shot. For every face found in the scene, a quantity called DoIS is computed for the scene that indicates how well the face (i.e., character) is remembered by a viewer after the scene is over. In other words, it is an indicator of the importance of the character or role in that scene.
Suppose the process starts with shot 1 of scene A and that faces Fl and F2 are found in shot 1. The DoISh of faces Fl and F2 in shots 1 and 2 are calculated and summed and suppose faces F3 and F4 are found in shot 2. Their DoISh's are likewise calculated. This process continues until all the shots in the scene are processed. At completion of the process for scene A, the DoIS (DoI of a face for the entire scene A) of each face has been calculated. The process is repeated for each scene and can be extended to find the DoI of each of the faces for the entire video. Using DoI to Improve Face Detection
In a preferred embodiment a filtering approach based on face area is employed. Faces detected having Dol's that are less than a pre-determined threshold are rejected because these are generally not faces at all. A threshold of 0.01 was found to be satisfactory. In an alternative embodiment, a temporally based filtering approach is employed based on the determination of a pattern in which faces appear in a video. For example, a conversation scene in a sitcom comprises a long sequence of shots (shot-reaction, shot). Thus, a faces detected in a shot would repeat in the shot that followed after it. Such an approach eliminates false face detections (false positives). Improvement in Face Recognition
Instead of storing only one face per character in the reference database, many different images of each character are stored to improve the likelihood of a match. In practice, this requires a priori knowledge of the characters in a video.
Referring now to FIG. 10, an apparatus 1000 is shown that uses a memory module stored in a reference database 1001 and a processor module 1002 executing a face detection module 1003 and a face/role recognitions module to recognize faces/roles in an incoming video stream 1005. The face detection module comprises inter alia a video segment location module 1003.1 to locate cuts in a scene of the video, i.e., to locate shots therein. Detected faces are recognized or not by face/role recognition module 1004. A degree of importance module 1003.2 can provide a degree of importance metric for filtering faces that fall below an importance threshold, i.e., are not really faces. Faces are detected by segmented a video into scenes and scenes into shots using video content to define the cut criteria. A detected face is recognized as a role based on the representation of roles in the reference database 1001. This database represents a viewer's memory of the video as well as reference role representations. The viewer's memory takes the form of data about faces that appear in sequential shots of a scene as reinforcement much as a human learns from repeated exposure to faces that the faces represent recognizable roles. FIG. 11 illustrates a system 1100 including an apparatus according to the present invention
(as shown in FIG. 10) that analyzes a video for recognizable roles contained in a reference database 1001 (a long term memory counterpart) and dynamically update the reference database 1001 with roles recognized in videos. Testing & Discussion of the Results As a test, the DoIF (the DoI of a face per frame) was computed for the entire video. The face recognition is done by matching VQ histograms of the detected faces with the VQ-histogram database. Since the accuracy of the face recognition turns out to be very low (some measures to improve the face recognition are discussed in the next section), the figure below may improve if a different face recognition method is used. FIG. 12 illustrates a plot of the DoIs of the faces with frame number (from top to bottom, the DoI plot for character A, B, C, D, E and F)
Still, FIG. 12 provides a fair idea of how the DoI can be used to detect roles in videos. Taken along with the scene boundary (not shown in the figure), this distribution of the DoI for each character would give us an idea of the scenes, which each character inhabits in the episode. This would facilitate easy browsing of the video by character presence.
Table I below gives a brief idea of each scene in the episode and also how long the scene is. Note that the credits sequence has not been considered here as a scene.
Figure imgf000017_0001
Table I: Summary of the scenes in an episode of a sitcom.
The episode consists of 16 scenes (excluding the credit sequence). The second column gives the characters in the scene. The third indicated the no of shots in the scene while the fourth is the time for which the scene lasts. Detection of plot points using DoI
1. Produce an ordered list of character appearances (denoted as C) depending on the number of times they occur in the sequence - speaker ID.
2. Create a matrix A that tells the transitions from one character to another character. Each element of the matrix aa is a counter of transitions from character i to character j.
3. Compute mean and standard deviation for each row of matrix A. 4. As the starting point we use the ordered list of characters. The rows are taken from matrix A depending on the character list C. For each character in list C, a grouping is started of the characters together. i. This is done by grouping that character with those characters whose frequency exceeds the (mean + standard deviation) of that row. ii. Apply heuristics in order to find the groupings of characters.
Here the decision point is how to start a new group based on the variance and the standard deviation. If the character in the current row is not part of an existing group and the characters of the new group are not part of an existing group, then we make a new group. Next, the transitivity rule is applied: if the character in the current row is associated with characters that are already grouped, it is inserted into the already existing groups. If the character in the current row is associated with characters, some of whom are already grouped and some of whom are not, then these are made into a separate group.
After this process, the duration can be used to eliminate characters that occur in multiple plots.
While a system, apparatus and method have been presented for identifying the different characters in a sitcom video, this is only one example of the approach to Role Detection of the present invention that takes into account the structure of the video. One skilled in the art will realize that other approaches fall within the scope of the appended claims, such as, averaging the Dol's for a face over time instead of for a frame. Further, audio analysis can be used to determine shot segmentation and lighting can be factored into face detection (into the training samples for skin color detection) or the samples could be selected more carefully. Multiple face templates can be used, such as side views. Face recognition does not have to be limited to VQ histograms, for example the Eigenface approach and other well known approaches can be used. Further, the reference database can be dynamically extended to include new faces as they are detected. The scope of the present invention is determined only by the appended claims and not by the examples, which are used for explanation only and not in any limiting sense.

Claims

CLAIMS:
1. A method for face detection and role recognition in a video, comprising the steps of: providing a reference database of known faces and known roles associated with known faces; segmenting the video into an ordered sequence of shots; and, for each shot in order, performing the steps of - a. detecting faces; and b. for each detected face, performing the steps of: i. recognizing the face as a known role in the reference database if the detected face matches a face associated with a known role in the database, and ii. adding a data description of the detected face to the reference database that includes a measure of the importance of the known role such that future recognition of the face is enhanced.
2. The method of claim 1, wherein the segmenting step further comprises the steps of: reading the video frame-by- frame; detecting a cut in a read frame; and defining a shot as a frame having a cut detected therein.
3. The method of claim 1, wherein the segmenting the video step further comprises the steps of: segmenting the video into an ordered sequence of scenes; and segmenting each scene into an ordered sequence of shots to achieve an ordered sequence of shots over all scenes.
4. The method of claim 1, wherein the recognizing step further comprises the step of using an Eigenface method.
5. The method of claim 1, wherein: the step of providing the reference database (1001) further comprises the step of including a codebook of codevectors and VQ histograms of known faces; the detecting step further comprises the step for computing a VQ histogram of a detected face using the codebook; the recognizing step further comprises the step of matching the computed VQ histogram with the VQ histograms of known faces stored in the reference database (1001) according to a predetermined similarity criterion; and the adding step further comprises the steps of: i. adding an entry that describes the computed VQ histogram to the reference database (1001), and ii. if the detected face does not match a VQ histogram of a known face in the reference database (1001), including the computed VQ histogram in the added entry identified as an 'external' face.
6. The method of claim 5, wherein the pre-determined similarity criterion is a distance from the computed VQ histogram to a database VQ histogram of a known face is less than a pre- specified tolerance is based on a distance measure selected from the group consisting of Euclidian, Ll, intersection method, and chi square.
7. The method of claim 5, wherein the adding step further comprises the steps of: computing a degree of importance of the detected face for each shot of each scene; computing a degree of importance of the detected face for each scene of the video as the sum of the computed degrees of importance of the detected face computed for each shot of the scene; ordering the detected faces when store in the reference database (1001) in decreasing order of their computed degrees of importance; and storing the computed degree of importance of the detected face for each shot and each scene in the reference database (1001) such that the computed degree of importance is associated with the detected face in the reference database (1001).
8. The method of claim 7, further comprising the step of filtering the detected face based on the computed degree of importance (DoI) of the detected face selected from the group consisting of DoI of a shot and DoI of a scene.
9. The method of claim 8, wherein the filtering step further comprises the step of rejecting a detected face if a DoI of the detected face is less than a pre-determined importance threshold.
10. The method of claim 7, further comprising the steps of: computing a degree of importance of the detected face for the video as a sum of the computed degrees of importance of the detected face computed for each scene of the video; and storing the computed degree of importance of the detected face for the video in the reference database.
11. The method of claim 10, further comprising the step of filtering the detected face based on the computed degree of importance (DoI) of the detected face selected from the group consisting of DoI of a shot, DoI of a scene, and DoI of the video.
12. The method of claim 11, wherein the filtering step further comprises the step of rejecting a detected face if a DoI of the detected face is less than a pre-determined importance threshold.
13. An apparatus (1000) for face detection and role recognition in a video, comprising: a reference memory (1001) containing a codebook of a plurality of codes and code vectors for VQ histogram construction and a plurality of normalized VQ histograms as reference descriptions of faces of known roles/main-characters in videos; a face detection module (1003) to detect faces in a video and create a reference description of the detected faces; a face/role recognition module (1004) to accept a reference face description and query the reference memory (1001) therewith to determine one set of conditions selected from the group consisting of (a) the face is contained in the reference memory and is recognized, and (b) is not contained in the reference memory, is not recognized, and is an external face; a processor module (1002) configured to: i. accept an incoming video signal (1005) comprising a video to be analyzed for the presence of roles contained in the reference database (1001), ii. execute the face detection module (1003) to detect faces in the video, iii. for a detected face, to execute the face/role recognition module (1004) to recognize whether or not the detected face is associated with a known role in the reference database (1001), iv. update the reference database (1001) with information about the detected face whereby a memory model of recognized faces contained therein is reinforced, and v. an unrecognized face is added to the reference database.
14. The apparatus of claim 13, wherein the face detection module (1003) further comprises a video segment detection module (1003.1) that segments the incoming video signal (1005) into an ordered sequence scenes each scene comprising and ordered sequence of shots and for each shot in order detecting faces therein.
15. The apparatus of claim 14, wherein the video segment detection module (1003.1) is further configured to read the incoming video signal (1005) frame-by- frame, detect a cut in a read frame, and define a shot as a frame having a cut detected therein.
16. The apparatus of claim 15, wherein: the face detection module is further configured to compute a VQ histogram of a detected face using the codebook of the reference database (1001); the face/role recognition module (1004) is further configured to match the computed VQ histogram with the VQ histograms of known faces stored in the reference database in accordance with a pre-determined similarity criterion; andthe processor module (1002) is further configured to : i. add an entry that describes the computed VQ histogram to the reference database, and ii. when the detected face does not match a VQ histogram of a known face in the reference database, to include the computed VQ histogram in the added entry identified as an 'external' face.
17. The apparatus of claim 15, wherein the pre-determined similarity criterion is Euclidian distance from the computed VQ histogram to a database VQ histogram of a known face is less than a pre-specified tolerance.
18. The apparatus of claim 15, wherein the face detection module (1003) further comprises a degree of importance computation module (1003.2) configured to: compute a degree of importance of each detected face for each shot of each scene; compute a degree of importance of each detected face for each scene of the video as the sum of the computed degree of importance of each detected face computed for each shot of the scene; arrange detected faces in decreasing order of their computed degrees of importance; and store the computed degree of importance of each detected face for each shot and each scene in the reference database (1001).
19. The apparatus of claim 18, wherein the degree of importance module (1003.2) is further configured to apply a filter to each detected face based on the computed degree of importance (DoI) of the detected face selected from the group consisting of DoI of a shot and DoI of a scene.
20. The apparatus of claim 19, wherein the filter further comprises rejection of the detected face if a DoI of the detected face is less than a pre-determined importance threshold.
21. The apparatus of claim 18, wherein the degree of importance module (1003.2) is further configured to: compute a degree of importance of each detected face of the video as a sum of the computed degrees of importance of the detected face computed for each scene of the video; and store the computed degree of importance of each detected face for the video in the reference database (1001).
22. The apparatus of claim 21, wherein the degree of importance module (1003.2) is further configured to apply a filter to each detected face based on the computed degree of importance (DoI) of the detected face selected from the group consisting of DoI of a shot, DoI of a scene, and DoI of the video.
23. The apparatus of claim 22, wherein the filter further comprises rejection of the detected face if a DoI of the detected face is less than a pre-determined importance threshold.
24. A system (1100) for face detection and role recognition in a video, comprising: a control workstation (1101) to provide the video as an incoming video signal (1005); an apparatus (1000) according to claim 22 for face detection and recognition that accepts the incoming video signal (1005) provided by the control workstation and detects and recognizes faces therein and outputs the detected and recognized faces to the control workstation (1101).
25. A computer program that implements the method of claim 12, is executable by a processor (1002) and comprises a face detection module (1003) including a video segment detection module (1003.1) and a degree of importance computation module (1003.2), a face/role recognition module (1004) and creates and maintains a reference database (1001) of information related to and defining known faces and associated roles thereof.
26. A method for determining a pre-determined number of most important faces/roles of a video having a plurality of detectable faces/roles, comprising the steps of: using the method of claim 12 to obtain a DoI for each detectable face/role of the video; ranking the detected faces/roles by DoI; and determining the most important roles as the first pre-determined number of roles of the faces/roles ranked by DoI.
27. A method for determining at least one plot point of a video having a plurality of detectable faces/roles, comprising the steps of: obtaining a DoI for each detectable face/role of the video by performance of the method of claim 11 on the video; and step for obtaining at least one plot point with the obtained DoI for each detectable face/role.
PCT/IB2006/053527 2005-09-30 2006-09-27 Method and apparatus for long term memory model in face detection and recognition WO2007036892A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US72289205P 2005-09-30 2005-09-30
US60/722,892 2005-09-30

Publications (1)

Publication Number Publication Date
WO2007036892A1 true WO2007036892A1 (en) 2007-04-05

Family

ID=37672371

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2006/053527 WO2007036892A1 (en) 2005-09-30 2006-09-27 Method and apparatus for long term memory model in face detection and recognition

Country Status (1)

Country Link
WO (1) WO2007036892A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008125481A1 (en) * 2007-04-13 2008-10-23 Atg Advanced Us Technology Group, Inc. Method for recognizing content in an image sequence
WO2010008520A1 (en) * 2008-07-14 2010-01-21 Google Inc. Method and system for automated annotation of persons in video content
CN101783019B (en) * 2008-12-26 2013-04-24 佳能株式会社 Subject tracking apparatus and control method therefor, image capturing apparatus, and display apparatus

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Using models of Human Memory for Role Detection in Movies and Television", 24 April 2005 (2005-04-24), XP002417731, Retrieved from the Internet <URL:www.ee.columbia.edu/~sak2010/_files/Final+Reportmod2_ver1CommentsNevenka_final.pdf> [retrieved on 20070130] *
ANER-WOLF A ET AL: "VIDEO MINING, Chapter 5, Movie Content Analysis, Indexing and Skimming via Multimodal Information", VIDEO MINING, KLUWER INTERNATIONAL SERIES IN VIDEO VIDEO COUMPUTING, NORWELL, MA : KLUWER ACADEMIC PUBL, US, 2003, pages 123 - 154, XP002417732, ISBN: 1-4020-7549-9 *
Retrieved from the Internet <URL:http://web.archive.org/web/20050424212107/www.ee.columbia.edu/~sak2010/_files/Final+Reportmod2_ver1CommentsNevenka_final.pdf> [retrieved on 20070130] *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008125481A1 (en) * 2007-04-13 2008-10-23 Atg Advanced Us Technology Group, Inc. Method for recognizing content in an image sequence
US8077930B2 (en) 2007-04-13 2011-12-13 Atg Advanced Swiss Technology Group Ag Method for recognizing content in an image sequence
WO2010008520A1 (en) * 2008-07-14 2010-01-21 Google Inc. Method and system for automated annotation of persons in video content
KR20110036934A (en) * 2008-07-14 2011-04-12 구글 인코포레이티드 Method and system for automated annotation of persons in video content
US8213689B2 (en) 2008-07-14 2012-07-03 Google Inc. Method and system for automated annotation of persons in video content
KR101640268B1 (en) * 2008-07-14 2016-07-15 구글 인코포레이티드 Method and system for automated annotation of persons in video content
CN101783019B (en) * 2008-12-26 2013-04-24 佳能株式会社 Subject tracking apparatus and control method therefor, image capturing apparatus, and display apparatus

Similar Documents

Publication Publication Date Title
Rasheed et al. Detection and representation of scenes in videos
US20050228849A1 (en) Intelligent key-frame extraction from a video
JP3485766B2 (en) System and method for extracting indexing information from digital video data
Lefèvre et al. A review of real-time segmentation of uncompressed video sequences for content-based search and retrieval
Truong et al. Scene extraction in motion pictures
US20020028021A1 (en) Methods and apparatuses for video segmentation, classification, and retrieval using image class statistical models
US20060062474A1 (en) Methods of representing and analysing images
EP2270748A2 (en) Methods of representing images
US20070113248A1 (en) Apparatus and method for determining genre of multimedia data
Pfeiffer et al. Scene determination based on video and audio features
CN1685712A (en) Enhanced commercial detection through fusion of video and audio signatures
JP2011008509A (en) Important information extraction method and device
Lu et al. An effective post-refinement method for shot boundary detection
Zhu et al. Video scene segmentation and semantic representation using a novel scheme
Gade et al. Audio-visual classification of sports types
WO2007036892A1 (en) Method and apparatus for long term memory model in face detection and recognition
Liu et al. Effective feature extraction for play detection in american football video
El Khoury Unsupervised video indexing based on audiovisual characterization of persons
Wang et al. Automatic story segmentation of news video based on audio-visual features and text information
Ionescu et al. A contour-color-action approach to automatic classification of several common video genres
Li et al. Person identification in TV programs
Quenot et al. Rushes summarization by IRIM consortium: redundancy removal and multi-feature fusion
Chaloupka A prototype of audio-visual broadcast transcription system
Zhang et al. What makes for good multiple object trackers?
Masneri et al. SVM-based video segmentation and annotation of lectures and conferences

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 06809423

Country of ref document: EP

Kind code of ref document: A1