EP1112549A4 - Method of face indexing for efficient browsing and searching of people in video - Google Patents

Method of face indexing for efficient browsing and searching of people in video

Info

Publication number
EP1112549A4
EP1112549A4 EP99943190A EP99943190A EP1112549A4 EP 1112549 A4 EP1112549 A4 EP 1112549A4 EP 99943190 A EP99943190 A EP 99943190A EP 99943190 A EP99943190 A EP 99943190A EP 1112549 A4 EP1112549 A4 EP 1112549A4
Authority
EP
European Patent Office
Prior art keywords
face
track
index
class
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP99943190A
Other languages
German (de)
French (fr)
Other versions
EP1112549A1 (en
Inventor
Itzhak Wilf
Hayit Greenspan
Arie Pikaz
Ovadya Menadeva
Yaron Caspi
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
MATE-MEDIA ACCESS TECHNOLOGIES Ltd
MATE MEDIA ACCESS TECHNOLOGIES
Original Assignee
MATE-MEDIA ACCESS TECHNOLOGIES Ltd
MATE MEDIA ACCESS TECHNOLOGIES
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by MATE-MEDIA ACCESS TECHNOLOGIES Ltd, MATE MEDIA ACCESS TECHNOLOGIES filed Critical MATE-MEDIA ACCESS TECHNOLOGIES Ltd
Publication of EP1112549A1 publication Critical patent/EP1112549A1/en
Publication of EP1112549A4 publication Critical patent/EP1112549A4/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/179Human faces, e.g. facial parts, sketches or expressions metadata assisted face recognition

Definitions

  • the present invention relates to a method and apparatus for generating indices of classes of objects appearing in a collection of images, particularly in a sequence of video frames, for use in searching or browsing for particular members of the respective class.
  • the invention is particularly useful for indexing human faces in order to produce an index for face recognition tasks in querying operations.
  • a system for video browsing and searching, based on key-frames is depicted in Fig. 1A.
  • a video image sequence is inputted from a video feed module 110.
  • the video feed may be a live program or recorded on tape.
  • Analog video is digitized in video digitizer 112.
  • the system may receive digital representation, such as Motion JPEG or MPEG directly.
  • a user interface console 111 is used to select program and digitization parameters as well as to control key-frame selection.
  • a key-frame selection module 113 receives the digitized video image sequence. Key-frames can be selected at scene transitions by detecting cuts and gradual transitions such as dissolves. This coarse segmentation into shots can be refined by selecting additional key-frames in a process of tracking changes in the visual appearance of the video along the shot.
  • a feature extraction module 114 processes key-frames as well as non key-frame video data to compute key-frames characteristic data. This data is stored in the key-frames characteristic data store 115 and is accessed by a video search engine 116 in order to answer queries generated by the browser-searcher interface 117. Such queries may relate to content attributes of the video data, such as color, texture, motion and others. Key-frame data can also be accessed directly by the browser 117. In browsing, a user may review key-frames instead of the original video, thus reducing storage and bandwidth requirements.
  • a certain collection of M video shots may consist only of N ⁇ M different scenes such that at least one scene spans more than one shot.
  • the prior art describes how to cluster such shots based on their similarity of appearance, utilizing low-level features. It has become standard practice to extract features such as color, texture and motion cues, together with a set of distance metrics, and then utilize the distance metrics in the related feature spaces (respectively) for determining the similarity between key-frames of the video contents or shots. In this scenario, the video content is limited to the definition in the low-level feature space. What is missing in the prior art is the automatic extraction of high-level object-related information from the video during the indexing phase, so as to facilitate future searching applications.
  • the key-frames extraction and the automatic searching are separate processes.
  • Combining the processes in a unified framework means taking into account high-level user-queries (in search mode), during the indexing phase. Spending more effort in a more intelligent indexing process proves beneficial in a short turn around rate in the searching process.
  • a prior art method of face detection and recognition in video is depicted in Fig 1 B.
  • a face detection module (122) operates on a set of frames from the input video image sequence. This set of frames may consist of the entire image sequence if the probability of detection is of primary importance. However, the face content in the video does not change with every frame. Therefore, to save computational resources a subset of the original sequence may be utilized. Such a subset may be derived by decimating the sequence in time by a fixed factor or by selecting key-frames by a key-frame extraction module 121. The detected faces are stored in the face detection data store 123.
  • face features are extracted (124), where the features can be facial feature templates, geometrical constraints, and global facial characteristics, such as eigen-features and features of other known algorithms in the art.
  • the face representation can be compared to a currently awaiting search query, or to a predefined face database; alternatively, it can be stored in a face feature database (125) for future use.
  • face characteristic data from the database or from a user-define query, with the face characteristic data extracted from the video, the identity of people in the video can be established. This is done by the face recognition module 126 and recorded in the video face recognition report 127 with the associated confidence factor.
  • One prior art method uses a set of geometrical features, such as nose width and length, mouth position and chin shape; Another prior art method is based on template matching. One particular method represents the query and the detected faces as a combination of eigen-faces.
  • FIG. 1 C shows a simple sequence of video scenes and the associated face content, or lack of it. In this example, some people appear in several scenes.
  • FIG. 1 D depicts the results of a sequential face-indexing scheme such as the one depicted in FIG. 1 B.
  • This representation provides only a partial, highly redundant description of the face content of the video scenes.
  • a person may be visible for only a part, or for several parts, of the scene.
  • a person In a scene and across the scenes, a person generally has many redundant views, but also several different views. In such a situation it is desirable to prune redundant views, and also to increase the robustness by comparing the user-defined query against all different views of the same person.
  • a person may go from a position where the person can be detected to a position where the person is visible but cannot be detected by automatic processing. In several applications it is useful to report the full segment of visibility for each recognized person.
  • a method of generating an index of at least one class of objects appearing in a collection of images for use in searching or browsing for particular members of the class comprising: processing the collection of images to extract therefrom features characteristic of the class of objects; and grouping the images in groups according to extracted features helpful in identifying individual members of the class of objects.
  • the collection of the images represents a sequence of video frames
  • the class of objects is human faces
  • the grouping forms face tracks of contiguous frames, each track being identified by the starting and ending frames and containing face regions.
  • a method of generating an index of human faces appearing in a sequence of video frames comprising: processing the sequence of video frames to extract therefrom facial features; and grouping the video frames to produce face tracks of contiguous frames, each face track being identified by the starting and ending frames in the track and each face track containing face data characteristic of an individual face.
  • a method of generating an index of at least one class of objects appearing in a collection of images to aid in browsing or searching for individual members of the class comprising: processing the collection of images to generate an index of features characteristic of the class of objects; and annotating the index with annotations.
  • the invention also provides a method of processing a sequence of video frames having a video track and an audio track, to generate a speech annotated face index associated with speakers, comprising: generating a face index from the video track; generating a transcription of the audio track; and aligning the transcription with the face index.
  • the invention further provides a method of processing a video track having face segments (including parts of face segments) to label such segments as talking or non-talking, comprising: tracking the face segments in the video track; detecting segments having mouth motion; and estimating from the detected segments, those having talking mouth motion vs. non-talking mouth motion.
  • a method of annotating a sequence of video frames comprising: processing the sequence of video frames to generate a face index; and attaching a description to at least one entry in the face index.
  • the invention further provides a method of processing a sequence of video frames having a video track and an audio track, comprising: extracting from the video track, face segments representing human faces, and producing a face track for each individual face; extracting audio segments from the audio track; fitting a model based on a set of the audio segments corresponding to the individual face of a face track; and associating the model with the face track of the corresponding individual face.
  • a method of searching a collection of images for individual members of a class of objects comprising: processing the collection of images to generate an index of features characteristic of the class of objects; and searching the index for individual members of the class of objects.
  • a provided apparatus for generating an index of a class of objects appearing in a collection of images for use in searching or browsing for particular members of the class comprising: a processor for processing the collection of images to extract therefrom features characteristic of the objects, and for outputting therefrom indexing data with respect to the features; a user interface for selecting the features to be extracted to enable searching for and identifying individual members of the class of objects; and a store for storing the index data outputted from the processor in groups linked according to the features selected for extraction.
  • the invention is particularly applicable for parsing a video stream into face/no face segments, indexing and logging the facial content, and utilizing the indexed content as an intelligent facial database for future facial content queries of the video data.
  • the invention may use a high-level visual module in the indexing of a video stream, specifically, based on human facial information.
  • the invention also uses audio information as part of the indexing process and merges the audio and the video.
  • the invention is particularly useful for indexing facial content in order to enable face recognition
  • the invention could also be used for indexing the characteristics of other classes of objects, such as billboards, logos, overlayed text (text added on to a video frame), geographical sites, etc. to enable recognition of individual members of such classes of objects, at an increased speed and/or probability of recognition.
  • FIG. 1A describes a prior art method of searching in video by key-frame selection and feature extraction.
  • FIG. 1 B describes a prior art method of face recognition.
  • FIG. 1 C describes a sample sequence of video content.
  • FIG. 1 D presents the results of face detection applied to the sequence of FIG. 1 C, organized as a sequential index.
  • FIG. 2 describes the browsing and searching system with face indexing, as introduced in this invention.
  • FIG. 3 depicts the face index.
  • FIG. 4 depicts a preferred embodiment of a face track structure.
  • FIG. 5 presents a sample face index generated by a particular embodiment of the present invention for the example of FIG. 1 C.
  • FIG. 6 shows a set of face characteristic views selected from a face track, as taught by the present invention.
  • FIG. 7 provides an overview of the process of generating a face track from a video image sequence and extracting the associated characteristic data from it.
  • FIG. 8 shows a particular embodiment of a face tracking module.
  • FIG. 9 shows a particular embodiment for selecting the set of face characteristic views subject to self-similarity criteria.
  • FIG. 10a illustrates a template as the bounding box around a face region.
  • FIG. 10b illustrates the extraction of geometrical features from a face region
  • FIG. 11 displays the vector store of the face characteristic data.
  • FIG. 12 shows the extraction of audio data corresponding timewise to the video data.
  • FIGS. 13 and 14 show how to combine the audio track with the face index to create an audio-visual index.
  • FIG. 15 describes the linking and information merging stage as part of the face indexing process.
  • FIG. 16 presents a sample video annotation generated by a particular embodiment of the present invention for the example of FIG. 1C.
  • FIG. 17 is a flow chart describing the method of searching an index and retrieving related video frames.
  • a main purpose of the present invention is to provide a method of generating an index of a class of objects, such as a face object, appearing in a collection of images, such as a sequence of video frames, for use in searching or browsing for a particular member of the class, such as a specific individual's face (e.g., "Clinton").
  • the described method can be utilized for indexing various classes of objects (both visual objects and audio signatures) as long as the class of objects has a set of visual and/or audio characteristics that may be a-prioh defined and modeled, so as to be used for automatic detection of objects within an input video stream, using video and audio analysis tools, as described herein.
  • an input video stream is automatically parsed into segments that contain the class of objects of interest, segments representative of particular members of said class of objects are grouped, and an index or index store is generated.
  • the indexed content may be used for any future browsing and for efficient searching of any particular object.
  • classes of objects and particular members of the class include: face objects, and "Clinton”; logo objects and the "EuroSport” logo, text objects (automatic detection of words in the video stream) and "economy”.
  • FIG. 2 A system for video browsing and searching of face content in accordance with the present invention is depicted in FIG. 2.
  • a video image sequence is inputted from a video feed module (210).
  • the video feed may be a live program or recorded on tape.
  • Analog video is digitized in a video digitizer (215).
  • the system may receive the digital representation directly.
  • the video source, the program selection and digitization parameters, and the face-indexing selection parameters are all controlled by the user from an interface console (230).
  • a subset of the video sequence is inputted into an indexing module (220).
  • the computed face index data is stored in the face index data store (250).
  • the face-index data is edited by a face index editor to correct several types of errors that may occur during automatic indexing. Such errors can originate from false face detection, that is identifying non-face regions as faces.
  • An additional form of error is over-segmentation of a particular face: two or more instances of the same face fail to be linked between appearances and thus generate two or more index entries.
  • false faces will generally not be recognized and over-segmentation will result in somewhat additional processing time and possibly reduced robustness.
  • applications in which the generated index is queried frequently, it is generally cost-effective to have an editor review the graphical representation of the face index, to delete false alarms, and to merge entries originating from the same person.
  • the editor annotates the face index store by specifying the name of the person, or by linking the face appearance and another database entity.
  • This embodiment provides a method of semi-automatic annotation of video by first generating a face index and then manually annotating only the index entries. The annotation becomes immediately linked to all tracks in the video, which correspond to the specific index entry. Since the number of frames where a specific person is visible is much larger than the number of index entries, and since the spatial information, (that is the location within the frame) is readily available, a major effort saving is achieved. A more elaborate description of utilizing the face index store for annotation is provided below.
  • the face index can be utilized for efficient browsing and searching. Browsing can be effected via a browser-searcher module (240), and searching can be effected via a video search engine (260), both accessing the index in order to process face-based queries.
  • a browser-searcher module 240
  • searching can be effected via a video search engine (260)
  • An extended description of the browsing and searching operational modalities is provided below.
  • FIG. 3 illustrates one possible structure of the face index 250, which is provided with fields arranged in vertical columns and horizontal rows.
  • each face entry (Fn) is associated with a plurality of tracks (Tn), along with representative data per track.
  • a face track or segment is a contiguous sequence of video frames in which a particular face appears.
  • Each face track is depicted as a separate horizontal row in the index to include the information associated with the respective track as inputted into the face index store 250.
  • Generating an index as in FIG. 3 involves several processes. First, the input video is processed to automatically detect and extract objects of the predefined object category. Processing involves utilizing the set of visual and audio characteristics that are a-phori defined and modeled using video and audio analysis tools as described herein.
  • a second major process involves grouping of the information in groups according to extracted features helpful in identifying individual members of a class.
  • Grouping involves the generation of a track of contiguous frames (Tn), for each individual member of the class of objects to be identified.
  • Grouping may also involve the merging of frames that contain similar members of a class of objects as indicated in associating different tracks (Tn) to a particular face entry (Fn).
  • a face track or segment is represented by a start frame, Fs, and an end frame, Fe. Face coordinates are associated with each frame in the track ⁇ X(Fs),... ,X(Fe) ⁇ . In one embodiment, such coordinates could be the bounding-box coordinates of the face, as shown in FIG. 10a.
  • the face track data structure includes:
  • Face Frontal Views which are the identified frontal views within the set of Face Characteristic Views; frontal views have better chances of being recognized properly.
  • Face Characteristic Data which are attributes computed from the face image data and stored for later comparison with the corresponding attributes extracted from the query image; such data can include audio characteristic data that can be associated with the face track.
  • FIG. 4 shows a preferred embodiment of a face track data structure.
  • FIG. 5 depicts a sample face index that is generated in module 250, FIG. 2, according to a particular embodiment of the invention for the example depicted in FIG. 1 C.
  • the first detected face appearance is inputted as the first entry in the face index store (labeled as face A).
  • the track is inputted as a separate row (A1) in the data store, including the start and end frame indices, as well as the spatial coordinates of the face location in each frame of the sequence.
  • frames 77 thru 143 two faces are detected. One of them is found similar to face A, such that the face track is linked to face A entry in the face index store, indicated as a separate row, A2.
  • the second face is a new detected face, initiating a new entry in the face index (labeled as face B), and the track information is inputted as a separate row, B1.
  • face B the face index
  • the process is repeated for new face C, etc.
  • face entries in the data store are labeled randomly (A,B,...), the main focus is to differentiate among them as separate entities.
  • Annotation can be added to an index entry (right-most field). It can include a label (name of person), a description (affiliation, visual features, etc); it may be input manually or input automatically, as will be described in detail below.
  • FIG. 6 shows a set of face characteristic views selected from a face track in accordance with the present invention
  • a star denotes a face frontal view which may be selected as taught by the present invention.
  • FIG. 7 shows an overview of the process of generating a face track from a video image sequence, and extracting the associated characteristic data. The processing steps can be initiated at each frame of the video or more likely at a sub-set of the video frames, selected by module (710). The sub-set can be obtained by even sampling or by a key-frame selection process.
  • Two process modules applied to the frames a face detection module (720) and a face tracking module (730). In the face detection module (720), each frame is analyzed to find face-like regions. Face detection algorithms as known in the prior art may be utilized.
  • the detection method as taught by prior art locates facial features in addition to the bounding box of the entire face region.
  • Such features generally consist of eye features and additionally mouth features and nose features.
  • the face tracking module (730) each set of consecutive frames (1,1+1 ) is analyzed to find correlation between faces in the two frames. Correlation based tracking is described below with respect to FIG. 8.
  • the decision logic of block 750 may include the following set of rules.
  • a current detected face in frame 1+1 is defined as a new face entry if either no face was detected in the same spatial locality in frame I, or a face that did appear in the local neighborhood in frame I, is substantially non-similar in appearance and attributes.
  • the current detected face region is part of an existing track.
  • correlation-based checking is pursued for a similarity check between the new face region and the faces in the previous frame. If a track is found, the existing track entry is augmented with information from the current face region, and the track index is updated to include current frame.
  • a third case should also be considered in which a track exists from the previous frame, I, yet the detection module (720) does not detect the face in the current frame, 1+1. Rather, by using the track information, the tracking module (730) is able to find a face-like region that correlates with the existing track.
  • the face region in the previous frame may guide the search for the face region in the current face, and a correlation may be found between the smoothly transitioned viewpoints of the face.
  • the existing track entry is augmented with information from the current face region, and the face track index is updated to include the current frame.
  • the terminated face track structure is taken out of the face track store and inserted and merged into the face index store (block 770).
  • FIG. 8 describes a particular embodiment of the face tracking module 730 effected via eye tracking.
  • a defined tracking reference frame, R is set to K (810)
  • the current frame, J is set to the following frame in the sequence (820).
  • a prediction or estimate can be made as to the most probable location of the same features in the current frame, J (830).
  • the location of the features can be estimated as the same location in the previous frame.
  • trajectory estimates may be used, specifically when motion of the features is detected.
  • a similarity transformation can be derived (840) as the mathematical definition of the disparity between the corresponding features. This transformation may be used to update R as a modified frame R1 , such that any differences (e.g. zoom, rotation) between the new reference frame R1 and the current frame, J, are reduced (850).
  • Correlation matching may be used for the final feature search based on a best match search between the features in the two frames (860). For example, such a matching may involve the correlation of a window surrounding the features (e.g. eyes) in frame R1 with several windows at the locality of the predicted feature locations in frame J, and selecting the highest correlation score.
  • a window surrounding the features e.g. eyes
  • the features e.g. eyes
  • a verification step may be included (870) enabling an update of the reference frame.
  • the reference frame may be updated to eliminate the possibility of a continuous reduction in the correlation score between the reference frame and the new input frame (for example for a slowly rotating face), that may lead to a premature termination of the track.
  • the reference frame is updated.
  • the reference frame R is updated to be in fact the current frame J.
  • the iteration is completed, and the next frame in the sequence is analyzed in the next loop of the tracking process (back to 820). In case of no update, the reference frame remains unchanged in the next loop.
  • FIG. 9 shows a preferred embodiment for selecting the Face-Characteristic-Views subject to self-similarity criteria: The process starts with the start frame of a track entering the set, C, of Face-Characteristic-Views (905). Start frame, I, is taken as a reference frame. Next frame is loaded as frame K (910). Given the currently selected reference frame, I, the consecutive frame, K, is compared against I. In a procedure similar to the one described with respect to FIG. 8, the correspondence between facial features is used to solve for any face motion between frames I and K (920). The face region in frame K is compensated for the computed face motion (930).
  • the compensated region is then subtracted from the corresponding face region in I (940), and the difference value, D(I,K), is used to determine whether K is to be defined as a new member of the Face-Characteristic-View set (950). If the difference between the corresponding faces in the two frames is not large, the next frame K+1 is loaded in the next loop. In case that the difference is in fact larger than a predefined threshold, the face in frame K is input to the set C, reference frame I is set to K, and K is set to the next frame K+1 (960). The process, as shown in FIG. 9, is terminated at the end of a face track. At the end of a track, the set C contains the most distinctive views of the face.
  • the set of Face-Characteristic-Views, C is a set of face templates at varying viewpoint angles. Contiguous frames of similar face appearance can be reduced to a single face-frame. Frames that are different enough, in the sense that they can not be reconstructed from existing frames via a similarity transformation, are included in the set. A future recognition process can therefore be less sensitive to the following (and other) parameters: viewpoint angle of the face; facial expressions, including opening vs closing of the mouth; blinking; and external distracts, such as sunglasses.
  • the Face-Characteristic-View set, C also enables the identification of dominant features of the particular face, including (among others): skin-tone color; eye-color; hair shades; any special marks (such as birth marks) that are consistent within the set.
  • the set C will contain a limited set of views, as there is limited variability in the face and its viewpoint.
  • the set will contain a large number of entries per face, encompassing the variety of viewpoints of the person in the scene.
  • the Face-Frontal-View set, F is a set of face templates of the more frontal views of the face. These frames are generally the most-recognizable frames.
  • the selection process is implemented by symmetry-controlled and quality-controlled criteria:
  • the score is computed from correlation values of eyes and mouth candidate regions with at least one eye and mouth template set, respectively.
  • the quality index depends on a face orientation score computed from a mirrored correlation value of the two eyes.
  • the face centerline is estimated from mouth and nose location, and the face orientation score is computed from the ratio of distances between the left/right eye to the facial centerline.
  • the face quality index includes also a measure of the occlusion of the face in which an approximating ellipse is fitted to the head contour, and the ellipse is tested for intersection with the frame boundaries. In yet another embodiment, the ellipse is tested for intersection with other regions.
  • Geometrical information can include interocular distance between the eyes, vertical distance between the eyes and mouth, width of chin, width of mouth and so on, as shown in FIG. 10b.
  • Geometrical features such as the ones mentioned herein, are known in the prior art as one of the means for face recognition.
  • a histogram may be used to analyze the color content of the region. If only natural skin color is detected (single mode) there is a strong probability for baldness. In the case of two dominant colors, one is the natural skin color and the second dominant color is the hair color.
  • a probabilistic interpretation can be associated with the results from each of the above-mentioned algorithms. In one embodiment, the associated probabilities can be provided as confidence levels affiliated with each result.
  • the Face-Characteristic-Data combines the above data, including templates (matrices of intensity values), features vectors containing geometrical information such as characteristic distances or other information such as coefficients of the eigen-face representation, and a unique characteristic vector, as shown in FIG. 1 1 .
  • Audio-Characteristic-Data is incorporated as additional informative 5 characteristic for the segment
  • Audio-Characteristic-Data may include a set of parameters representing the audio signature of the speaker
  • speech-to-text may be incorporated, as known in prior art, to extract recognizable words, and the extracted words l o can be part of the index.
  • the overall recognition accuracy can be improved.
  • FIG. 12 shows a timeline and video and audio data, which correspond to is that timeline.
  • the face/no-face segmentation of the video stream serves as a master temporal segmentation that is applied to the audio stream.
  • the audio segments derived can be used to enhance the recognition capability of any person recognition system built on top of the indexing system constructed according to the present invention.
  • a further purpose of the present invention is to match audio characteristic data which correspond to two different face tracks in order to confirm the identity of the face tracks.
  • the present invention may utilize prior art methods of audio characterization and speaker segmentation. The latter is required for the case the audio may correspond to at least two speakers.
  • FIG. 13 shows how to combine the audio track with the face index to create an audio-visual index.
  • the present invention may use prior art methods in speech processing and speaker identification.
  • GMM Gaussian Mixture Models
  • a speech utterance Given a speech utterance, it may be scored against a speaker model, as described below, to determine whether the speech belongs to that speaker.
  • acoustic features as in the training phase (e.g. computing mel-frequency cepstral coefficients), then, considering each feature as a multi-dimensional vector and viewing the model as a probability distribution on such vectors, we compute the likelihood of these features.
  • a closed-set setting a group of speakers is given and a model for each speaker of that group is trained. In this setting, it is a-prioh known that the speech utterance belongs to one of the speakers in the group.
  • computing the likelihood of each speaker model and taking the one with maximum likelihood identifies the correct speaker.
  • the speaker verification problem is actually a hypothesis test.
  • the likelihood of the speaker model is compared with the likelihood of so-called cohort model, representing the alternative hypothesis (that the speech utterance belongs to another speaker). If the ratio of the likelihood passes a threshold, the utterance is said to belong to that speaker.
  • unsupervised speaker segmentation may be done using an iterative algorithm (1340). Parameters for the speaker GMMs are first initialized using a clustering procedure and than are iteratively improved using the Viterbi algorithm to compute segmentation. Alternatively, one may first detect audio-track changes, or cuts, by considering a sliding window of the audio and testing the hypothesis that this window has no cuts against the hypothesis that there is a cut. Then speaker identification may be performed on each segment (defined as a portion of the audio track between cuts).
  • Audio cut detection may be done by a generalized likelihood ratio test, in which the first hypothesis ("no cuts") is modeled by fitting a best Gaussian distribution to the acoustic feature data derived from the audio window, and the second hypothesis ("there is a cut”) is modeled by exhausting over cut points, fitting a best Gaussian to each part of the window (left or right to the cut point), and taking the cut point for which the likelihood is maximal.
  • the likelihood of the first hypothesis is compared to the likelihood of the second hypothesis, and a threshold is used in the decision process (taking into account the larger number of parameters in the second hypothesis).
  • FIG. 13 illustrates a preferred embodiment, wherein the segmentation of the audio track is aided by visual cues from the video face indexing process (1310).
  • the audio is partitioned with respect to the video content.
  • the initial hypothesis can be a single speaker.
  • the audio characteristic data (such as the GMMs parameters) are associated with that face (1350).
  • the initial hypothesis can be two speakers.
  • the audio characteristic data of a speech segment are associated with the face of highest mouth activity as computed by the visual mouth activity detector (1320).
  • Tracking mouth activity is known in prior art. Such technology may entail a camera capturing the mouth image followed by thresholding the image into two levels, i.e., black and white. Binary mouth images are analyzed to derive the mouth open area, the perimeter, the height, and the width.
  • the binary mouth images themselves may be used as the visual feature, with clustering schemes used to classify between these binary images. Derivative information of the mouth geometrical quantities, as well as optical flow input have been suggested as well.
  • Prior art also entails the detection of the lip boundaries (upper and lower lips), the modeling of the lip boundaries via splines and the tracking of the spline movements across frames. Motion analysis of the lip boundaries (spline models) can facilitate an overall analysis of visual mouth activity, such as the segmentation into talking vs. non-talking segments. Further analysis utilizing learning technologies such as neural-networks, and additional techniques such as a technique called "Manifold Learning", can facilitate learning complex lip configurations. These techniques are used in the prior art to augment speech recognition performance. In this invention they are utilized for the detection of motion and activity towards the affiliation of an audio signature to a visual face, as well as the affiliation of extracted words from the audio to the visual face.
  • audio characteristic data from the speech segment can augment the face index.
  • a speech to text engine is employed (1410) to generate a text transcription of the speech segments in the video program.
  • Such engines are commercially available [e.g. from Entropic Ltd. from Cambridge, England; or ViaVoice from IBM].
  • Such an engine may operate on a computer file representation of the audio track or in real time using the audio signal input into a computer with audio digital card (such as Sound Blaster by Creative Labs).
  • a full transcription of the audio may be extracted by speech recognition technology, as described above.
  • closed-caption decoding may be utilized, when available.
  • full transcription can be incorporated.
  • a subset of predefined keywords may be extracted.
  • the transcription of the audio is next aligned with the video face index.
  • the alignment is achieved by means of the related time codes (1420).
  • said alignment may include the identification of each word or utterance start and end points, hereby termed the word time segment.
  • a word is associated with a face track for which there is an overlap between the word time segment and the face track time segment.
  • the transcription may be attached as text to the face track that is determined visually to be the most matching, what we may term as speech to speaker matching. For example when an entire shot includes a single face track, the initial hypothesis can be a single speaker.
  • the transc ⁇ ption is attached in full to the matching face.
  • the audio transcription of a speech segment is associated with the face of highest mouth activity as computed by the visual mouth activity detector (1320).
  • the visual mouth activity detector (1320) there is a need to identify face segments or part of face segments, as talking segments or non-talking segments, in which the associated face is talking or not-talking, respectively. This information is valuable information to be included in the face index store, enabling future search for a person in a talking state. Moreover, this information may be critical for the matching process described above between speech and speaker.
  • mouth motion is extracted in a face track, tracking the mouth in a frame-by-frame basis throughout the track.
  • the extracted motion characteristics are compared against talking lips movement characteristics and a characterization is made into one of two states: talking vs. non-talking.
  • the extracted information from a face track may be incorporated into the face index, and links may be provided between similar face tracks to merge the information.
  • the linking and information-merging stage, as part of the face indexing process, is depicted in Fig. 15. If the face index store is empty (1510), the current face track initializes the index, providing its first entry.
  • distances are calculated between the new face track characteristics and the characteristics of each face entry in the index (1520).
  • distances can be computed between several fields of the Face-Characteristic-Data.
  • Template data Fg in FIG 11
  • Feature vector data Ff in FIG 11
  • distance metrics as known in the art, such as the Euclidean metric or the Mahalanobis metric.
  • Unique characteristic data Fu in FIG 1 1).
  • Overall distance measure is calculated as a function of the individual distance components, in one embodiment being the weighted sum of the distances. Distance measures are ranked in increasing order (1530).
  • the smallest distance is compared to a Similarity threshold parameter (1540) to categorize the entry as a new face to be entered to the index, or as an already existing face, in which case the information is merged to an existing entry in the index.
  • a Similarity threshold parameter (1540) to categorize the entry as a new face to be entered to the index, or as an already existing face, in which case the information is merged to an existing entry in the index.
  • FIG. 3 involving the processing of the inputted video and the grouping of the information into tracks and the tracks into appropriate index entries.
  • an additional grouping step can be considered to better facilitate browsing and searching and augment recognition performance.
  • the information associated with the index entry from all corresponding tracks can be merged, to produce representative set of characteristic data.
  • such a set involves the best quality frontal face view (ranking frontal face views from all tracks); a second set includes the conjunction of all the unique characteristic data.
  • the editor can annotate the face index.
  • Annotation may be implemented in several modes of operation.
  • a description or label may be attached manually to an entry in the index, following which the annotation becomes automatically linked to all frames in all tracks that relate to the index entry.
  • each new frame that is added incrementally to a respective location in the index store automatically receives all annotations available for the respective index.
  • Annotation descriptions or labels may be inputted manually by an operator (270).
  • any characteristic data in the data store e.g. the occurrence of glasses, beard
  • any characteristic data in the data store e.g. the occurrence of glasses, beard
  • audio characteristic data such as keywords detected within a segment, or the entire spoken text
  • speech-to-text can be converted via speech-to-text, as known in the art, to incorporate annotation of audio information as additional information in the index.
  • each video frame may have associated with it a text string that contains the annotations as present in the related index entry.
  • a video annotation attached to each video frame is exemplified in FIG 16.
  • the index annotation field of FIG. 5 (right-most field) is now linked to the related video frames, as shown in FIG. 16.
  • a single index entry is associated with a video track, and a video track may be associated in turn with a large number of video frames.
  • Each frame in the corresponding sequence is now (automatically) annotated.
  • annotation may include a label (such as the person name), any manually inputted description (such as the affiliation of the person), any description of visual features (e.g., 'beard', 'glasses'), and other, as taught in this invention.
  • a listing may be generated and displayed, as shown in Table 1 , summarizing all annotated information as related to frame segments.
  • the face index store is used for browsing and searching applications.
  • a user interested in browsing can select to browse at varying resolutions.
  • the display includes the best quality- most frontal face template (single image of face on screen).
  • the user may select to see the set of best characteristic views (e.g. one row of face images of distinct views).
  • user may select to see all views in all tracks available in the store for the person of interest.
  • Several screens of frames will be displayed. Frames may be ordered based on quality metrics, or they may be ordered via a time line per the video source. The user may interactively select the mode of interest and shift amongst them.
  • the present invention teaches a method of searching in video (or in a large collection of images) for particular members of an object class, by generating the above-described index and searching the index.
  • FIG. 17 presents a flowchart of the searching methodology.
  • the input search query may address any of the variety of fields present in the index, and any combinations therefrom. Corresponding fields in the index will be searched (1710).
  • the user can search for example by the name of the person.
  • the label or description inputted by the user is compared to all labeled data; if a name in fact exists in the index, all related face tracks would be provided as search results.
  • the user can also search by providing an image of a face and searching for similar looking faces; in this case, template data will be used, such as the best quality face template, or the best set of characteristic views per person entry.
  • template data will be used, such as the best quality face template, or the best set of characteristic views per person entry.
  • the user can also search by attribute data; in this scenario, feature vectors are compared.
  • the user can query using multiple attributes, such as the face label as well as a set of keywords; in this case, the search would be conducted simultaneously on multiple parameters, such as the face template and audio characteristics.
  • Matching index entries are extracted and can be optionally ranked and sorted based on the match quality (1720).
  • a match score may be computed by correlating the input template to the best quality template per index store entry; or the input template can be compared with the best set of characteristic view templates; or the input characteristic data can be compared with stored characteristics per index store entry, etc.
  • the final match score can be chosen as the best match score, or as a weighted combination of the scores.
  • the matched index entries are automatically associated with the corresponding sets of video frames (1730).
  • the search results may be displayed in a variety of forms.
  • a listing of the index entries may be outputted.
  • the video frames are displayed as icons. Output frames may be ordered based on match quality; alternatively they may be displayed on a time-line basis per related video source (e.g. per tape or movie source).
  • Utilizing the face index store allows for much greater efficiency in browsing and searching of the video data, as well as increasing recognition accuracy. Automatic organization of the data into representative sets, merged across multiple tracks from different video sources, allows for efficient browsing, at varying resolutions, as described above. In the searching mode, rather than searching on a frame-by-frame basis for the person of interest, a comparison is made on the indexed material in the face index store.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

A method and apparatus for generating an index (250) of at least one class of objects appearing in a collection of images (210, 230) for use in searching or browsing (240) for particular members of the class, by: processing (260) the collection of images to extract therefrom features characteristic of the class of objects; and grouping the images in groups according to extracted features helpful in identifying individual members of the class of objects. In the described preferred embodiments, the collection of images represents a sequence of video frames, identified by the starting and ending frames and containing face segments characteristic of a respective face. Audio characteristic data and/or annotations (270, 280) may also be associated with the face tracks.

Description

METHOD OF FACE INDEXING FOR EFFICIENT BROWSING AND SEARCHING OF PEOPLE IN VIDEO
FIELD OF THE INVENTION
The present invention relates to a method and apparatus for generating indices of classes of objects appearing in a collection of images, particularly in a sequence of video frames, for use in searching or browsing for particular members of the respective class. The invention is particularly useful for indexing human faces in order to produce an index for face recognition tasks in querying operations.
DESCRIPTION OF THE RELATED ART
The amount of video data stored in multimedia libraries grows very rapidly which makes searching a time consuming task. Both time and storage requirements can be reduced by creating a compact representation of the video footage in the form of key-frames, that is a subset of the original video frames which are used as a representation for these original video frames. The prior art focuses on key-frame extraction as basic primitives in the representation of video for browsing and searching applications.
A system for video browsing and searching, based on key-frames, is depicted in Fig. 1A. A video image sequence is inputted from a video feed module 110. The video feed may be a live program or recorded on tape. Analog video is digitized in video digitizer 112. Optionally, the system may receive digital representation, such as Motion JPEG or MPEG directly. A user interface console 111 is used to select program and digitization parameters as well as to control key-frame selection. A key-frame selection module 113 receives the digitized video image sequence. Key-frames can be selected at scene transitions by detecting cuts and gradual transitions such as dissolves. This coarse segmentation into shots can be refined by selecting additional key-frames in a process of tracking changes in the visual appearance of the video along the shot. A feature extraction module 114 processes key-frames as well as non key-frame video data to compute key-frames characteristic data. This data is stored in the key-frames characteristic data store 115 and is accessed by a video search engine 116 in order to answer queries generated by the browser-searcher interface 117. Such queries may relate to content attributes of the video data, such as color, texture, motion and others. Key-frame data can also be accessed directly by the browser 117. In browsing, a user may review key-frames instead of the original video, thus reducing storage and bandwidth requirements.
In an edited video program, the editor switches between different scenes. Thus, a certain collection of M video shots may consist only of N<M different scenes such that at least one scene spans more than one shot. The prior art describes how to cluster such shots based on their similarity of appearance, utilizing low-level features. It has become standard practice to extract features such as color, texture and motion cues, together with a set of distance metrics, and then utilize the distance metrics in the related feature spaces (respectively) for determining the similarity between key-frames of the video contents or shots. In this scenario, the video content is limited to the definition in the low-level feature space. What is missing in the prior art is the automatic extraction of high-level object-related information from the video during the indexing phase, so as to facilitate future searching applications.
In current systems for browsing and automatic searching, which are based on key-frames, the key-frames extraction and the automatic searching are separate processes. Combining the processes in a unified framework means taking into account high-level user-queries (in search mode), during the indexing phase. Spending more effort in a more intelligent indexing process proves beneficial in a short turn around rate in the searching process.
In automatic searching of video data by content, detecting and recognizing faces are of primary importance for many application domains such as news. The prior art describes methods for face detection and recognition in still images and in video image sequences.
A prior art method of face detection and recognition in video is depicted in Fig 1 B. A face detection module (122) operates on a set of frames from the input video image sequence. This set of frames may consist of the entire image sequence if the probability of detection is of primary importance. However, the face content in the video does not change with every frame. Therefore, to save computational resources a subset of the original sequence may be utilized. Such a subset may be derived by decimating the sequence in time by a fixed factor or by selecting key-frames by a key-frame extraction module 121. The detected faces are stored in the face detection data store 123.
For each detected face, face features are extracted (124), where the features can be facial feature templates, geometrical constraints, and global facial characteristics, such as eigen-features and features of other known algorithms in the art. The face representation can be compared to a currently awaiting search query, or to a predefined face database; alternatively, it can be stored in a face feature database (125) for future use. By comparing face characteristic data from the database or from a user-define query, with the face characteristic data extracted from the video, the identity of people in the video can be established. This is done by the face recognition module 126 and recorded in the video face recognition report 127 with the associated confidence factor.
Several algorithms for face recognition are described in the prior art. One prior art method uses a set of geometrical features, such as nose width and length, mouth position and chin shape; Another prior art method is based on template matching. One particular method represents the query and the detected faces as a combination of eigen-faces.
Co-pending Application No. PCT/IL99/00169 by the same assignee, entitled "Method of Selecting Key-Frames from a Video Sequence", describes a method of key-frame extraction by post-processing the key-frame set so as to optimize the set for face recognition. A face detection algorithm is applied to the key-frames; and in the case of a possible face being detected, the position of that key-frame along the time axis may be modified to allow a locally better view of the face. That application does not teach how to link between different views of the same person or how to globally optimize the different retained views of that person. FIG. 1 C shows a simple sequence of video scenes and the associated face content, or lack of it. In this example, some people appear in several scenes. Additionally, some scenes have more than one person depicted. FIG. 1 D depicts the results of a sequential face-indexing scheme such as the one depicted in FIG. 1 B. Clearly this representation provides only a partial, highly redundant description of the face content of the video scenes.
In a dynamic scene, a person may be visible for only a part, or for several parts, of the scene. In a scene and across the scenes, a person generally has many redundant views, but also several different views. In such a situation it is desirable to prune redundant views, and also to increase the robustness by comparing the user-defined query against all different views of the same person.
Also during a video segment, a person may go from a position where the person can be detected to a position where the person is visible but cannot be detected by automatic processing. In several applications it is useful to report the full segment of visibility for each recognized person.
SUMMARY OF THE INVENTION
According to one broad aspect of the present invention, there is provided a method of generating an index of at least one class of objects appearing in a collection of images for use in searching or browsing for particular members of the class, comprising: processing the collection of images to extract therefrom features characteristic of the class of objects; and grouping the images in groups according to extracted features helpful in identifying individual members of the class of objects. According to a preferred embodiment of the invention described below, the collection of the images represents a sequence of video frames, the class of objects is human faces, and the grouping forms face tracks of contiguous frames, each track being identified by the starting and ending frames and containing face regions. According to another aspect of the present invention, there is a provided a method of generating an index of human faces appearing in a sequence of video frames, comprising: processing the sequence of video frames to extract therefrom facial features; and grouping the video frames to produce face tracks of contiguous frames, each face track being identified by the starting and ending frames in the track and each face track containing face data characteristic of an individual face.
According to a still further aspect of the present invention, there is provided a method of generating an index of at least one class of objects appearing in a collection of images to aid in browsing or searching for individual members of the class, comprising: processing the collection of images to generate an index of features characteristic of the class of objects; and annotating the index with annotations.
The invention also provides a method of processing a sequence of video frames having a video track and an audio track, to generate a speech annotated face index associated with speakers, comprising: generating a face index from the video track; generating a transcription of the audio track; and aligning the transcription with the face index. The invention further provides a method of processing a video track having face segments (including parts of face segments) to label such segments as talking or non-talking, comprising: tracking the face segments in the video track; detecting segments having mouth motion; and estimating from the detected segments, those having talking mouth motion vs. non-talking mouth motion.
According to a further aspect of the present invention, there is also provided a method of annotating a sequence of video frames, comprising: processing the sequence of video frames to generate a face index; and attaching a description to at least one entry in the face index.
The invention further provides a method of processing a sequence of video frames having a video track and an audio track, comprising: extracting from the video track, face segments representing human faces, and producing a face track for each individual face; extracting audio segments from the audio track; fitting a model based on a set of the audio segments corresponding to the individual face of a face track; and associating the model with the face track of the corresponding individual face.
According to another aspect of the present invention, there is provided a method of searching a collection of images for individual members of a class of objects, comprising: processing the collection of images to generate an index of features characteristic of the class of objects; and searching the index for individual members of the class of objects.
According to yet a further aspect of the present invention, there is a provided apparatus for generating an index of a class of objects appearing in a collection of images for use in searching or browsing for particular members of the class, comprising: a processor for processing the collection of images to extract therefrom features characteristic of the objects, and for outputting therefrom indexing data with respect to the features; a user interface for selecting the features to be extracted to enable searching for and identifying individual members of the class of objects; and a store for storing the index data outputted from the processor in groups linked according to the features selected for extraction. As described more particularly below, the invention is particularly applicable for parsing a video stream into face/no face segments, indexing and logging the facial content, and utilizing the indexed content as an intelligent facial database for future facial content queries of the video data. The invention may use a high-level visual module in the indexing of a video stream, specifically, based on human facial information. Preferably, the invention also uses audio information as part of the indexing process and merges the audio and the video.
While the invention is particularly useful for indexing facial content in order to enable face recognition, the invention could also be used for indexing the characteristics of other classes of objects, such as billboards, logos, overlayed text (text added on to a video frame), geographical sites, etc. to enable recognition of individual members of such classes of objects, at an increased speed and/or probability of recognition.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1A describes a prior art method of searching in video by key-frame selection and feature extraction.
FIG. 1 B describes a prior art method of face recognition. FIG. 1 C describes a sample sequence of video content.
FIG. 1 D presents the results of face detection applied to the sequence of FIG. 1 C, organized as a sequential index.
FIG. 2 describes the browsing and searching system with face indexing, as introduced in this invention. FIG. 3 depicts the face index.
FIG. 4 depicts a preferred embodiment of a face track structure. FIG. 5 presents a sample face index generated by a particular embodiment of the present invention for the example of FIG. 1 C.
FIG. 6 shows a set of face characteristic views selected from a face track, as taught by the present invention. FIG. 7 provides an overview of the process of generating a face track from a video image sequence and extracting the associated characteristic data from it.
FIG. 8 shows a particular embodiment of a face tracking module. FIG. 9 shows a particular embodiment for selecting the set of face characteristic views subject to self-similarity criteria.
FIG. 10a illustrates a template as the bounding box around a face region.
FIG. 10b illustrates the extraction of geometrical features from a face region, FIG. 11 displays the vector store of the face characteristic data.
FIG. 12 shows the extraction of audio data corresponding timewise to the video data.
FIGS. 13 and 14 show how to combine the audio track with the face index to create an audio-visual index. FIG. 15 describes the linking and information merging stage as part of the face indexing process.
FIG. 16 presents a sample video annotation generated by a particular embodiment of the present invention for the example of FIG. 1C.
FIG. 17 is a flow chart describing the method of searching an index and retrieving related video frames.
DETAILED DESCRIPTION OF THE INVENTION
As indicated above, a main purpose of the present invention is to provide a method of generating an index of a class of objects, such as a face object, appearing in a collection of images, such as a sequence of video frames, for use in searching or browsing for a particular member of the class, such as a specific individual's face (e.g., "Clinton"). The described method can be utilized for indexing various classes of objects (both visual objects and audio signatures) as long as the class of objects has a set of visual and/or audio characteristics that may be a-prioh defined and modeled, so as to be used for automatic detection of objects within an input video stream, using video and audio analysis tools, as described herein.
According to the present invention an input video stream is automatically parsed into segments that contain the class of objects of interest, segments representative of particular members of said class of objects are grouped, and an index or index store is generated. The indexed content may be used for any future browsing and for efficient searching of any particular object. Some examples of classes of objects and particular members of the class include: face objects, and "Clinton"; logo objects and the "EuroSport" logo, text objects (automatic detection of words in the video stream) and "economy".
The following description relates primarily to the case of face indexing; however, the methods taught herein are general and can be readily applied to include indexing for browsing and searching of any class of objects as described above. A system for video browsing and searching of face content in accordance with the present invention is depicted in FIG. 2. A video image sequence is inputted from a video feed module (210). The video feed may be a live program or recorded on tape. Analog video is digitized in a video digitizer (215). Optionally, the system may receive the digital representation directly. The video source, the program selection and digitization parameters, and the face-indexing selection parameters are all controlled by the user from an interface console (230). A subset of the video sequence is inputted into an indexing module (220). The computed face index data is stored in the face index data store (250). Preferably the face-index data is edited by a face index editor to correct several types of errors that may occur during automatic indexing. Such errors can originate from false face detection, that is identifying non-face regions as faces. An additional form of error is over-segmentation of a particular face: two or more instances of the same face fail to be linked between appearances and thus generate two or more index entries. These errors are handled reasonably by any recognition scheme: false faces will generally not be recognized and over-segmentation will result in somewhat additional processing time and possibly reduced robustness. However, applications in which the generated index is queried frequently, it is generally cost-effective to have an editor review the graphical representation of the face index, to delete false alarms, and to merge entries originating from the same person.
Preferably, the editor annotates the face index store by specifying the name of the person, or by linking the face appearance and another database entity. This embodiment provides a method of semi-automatic annotation of video by first generating a face index and then manually annotating only the index entries. The annotation becomes immediately linked to all tracks in the video, which correspond to the specific index entry. Since the number of frames where a specific person is visible is much larger than the number of index entries, and since the spatial information, (that is the location within the frame) is readily available, a major effort saving is achieved. A more elaborate description of utilizing the face index store for annotation is provided below.
The face index can be utilized for efficient browsing and searching. Browsing can be effected via a browser-searcher module (240), and searching can be effected via a video search engine (260), both accessing the index in order to process face-based queries. An extended description of the browsing and searching operational modalities is provided below.
FIG. 3 illustrates one possible structure of the face index 250, which is provided with fields arranged in vertical columns and horizontal rows. As will be described below, each face entry (Fn), is associated with a plurality of tracks (Tn), along with representative data per track. A face track or segment is a contiguous sequence of video frames in which a particular face appears. Each face track is depicted as a separate horizontal row in the index to include the information associated with the respective track as inputted into the face index store 250. Generating an index as in FIG. 3 involves several processes. First, the input video is processed to automatically detect and extract objects of the predefined object category. Processing involves utilizing the set of visual and audio characteristics that are a-phori defined and modeled using video and audio analysis tools as described herein. A second major process involves grouping of the information in groups according to extracted features helpful in identifying individual members of a class. Grouping involves the generation of a track of contiguous frames (Tn), for each individual member of the class of objects to be identified. Grouping may also involve the merging of frames that contain similar members of a class of objects as indicated in associating different tracks (Tn) to a particular face entry (Fn).
A face track or segment is represented by a start frame, Fs, and an end frame, Fe. Face coordinates are associated with each frame in the track {X(Fs),... ,X(Fe)}. In one embodiment, such coordinates could be the bounding-box coordinates of the face, as shown in FIG. 10a. In addition, the face track data structure includes:
• Face Characteristic Views, which are the visually distinct appearances of the face in the track.
• Face Frontal Views, which are the identified frontal views within the set of Face Characteristic Views; frontal views have better chances of being recognized properly.
• Face Characteristic Data, which are attributes computed from the face image data and stored for later comparison with the corresponding attributes extracted from the query image; such data can include audio characteristic data that can be associated with the face track. FIG. 4 shows a preferred embodiment of a face track data structure.
FIG. 5 depicts a sample face index that is generated in module 250, FIG. 2, according to a particular embodiment of the invention for the example depicted in FIG. 1 C. The first detected face appearance is inputted as the first entry in the face index store (labeled as face A). The starting frame is inputted as fs=43. The end frame for the face track is frame fe=76. The track is inputted as a separate row (A1) in the data store, including the start and end frame indices, as well as the spatial coordinates of the face location in each frame of the sequence. In frames 77 thru 143, two faces are detected. One of them is found similar to face A, such that the face track is linked to face A entry in the face index store, indicated as a separate row, A2. The second face is a new detected face, initiating a new entry in the face index (labeled as face B), and the track information is inputted as a separate row, B1. The process is repeated for new face C, etc. It is to be noted that without an annotation process, face entries in the data store are labeled randomly (A,B,...), the main focus is to differentiate among them as separate entities. Annotation can be added to an index entry (right-most field). It can include a label (name of person), a description (affiliation, visual features, etc); it may be input manually or input automatically, as will be described in detail below.
FIG. 6 shows a set of face characteristic views selected from a face track in accordance with the present invention; a star denotes a face frontal view which may be selected as taught by the present invention. FIG. 7 shows an overview of the process of generating a face track from a video image sequence, and extracting the associated characteristic data. The processing steps can be initiated at each frame of the video or more likely at a sub-set of the video frames, selected by module (710). The sub-set can be obtained by even sampling or by a key-frame selection process. Two process modules applied to the frames: a face detection module (720) and a face tracking module (730). In the face detection module (720), each frame is analyzed to find face-like regions. Face detection algorithms as known in the prior art may be utilized. Preferably, the detection method as taught by prior art, locates facial features in addition to the bounding box of the entire face region. Such features generally consist of eye features and additionally mouth features and nose features. In the face tracking module (730), each set of consecutive frames (1,1+1 ) is analyzed to find correlation between faces in the two frames. Correlation based tracking is described below with respect to FIG. 8.
The face detection results and the face tracking results are merged and analyzed in module 740. A decision is made in logic block 750 whether to make a new entry insertion to the face track structure, and/or whether to augment an existing entry. The decision logic of block 750 may include the following set of rules. A current detected face in frame 1+1 is defined as a new face entry if either no face was detected in the same spatial locality in frame I, or a face that did appear in the local neighborhood in frame I, is substantially non-similar in appearance and attributes. Once a detected face is declared as a new face entry, a face track entry is added to the face index (FIG. 3) and a new face track structure is initiated with the starting frame index. Initially, all detected faces are inputted into the face index, with each also initializing a face track structure. In a second case, the current detected face region is part of an existing track. In one embodiment, correlation-based checking is pursued for a similarity check between the new face region and the faces in the previous frame. If a track is found, the existing track entry is augmented with information from the current face region, and the track index is updated to include current frame. A third case should also be considered in which a track exists from the previous frame, I, yet the detection module (720) does not detect the face in the current frame, 1+1. Rather, by using the track information, the tracking module (730) is able to find a face-like region that correlates with the existing track. Such a case could occur for example when a person is slowly rotating his head. The rotation angle may be large enough such that the face is no longer detectable as a face. Using the track information, however, the face region in the previous frame may guide the search for the face region in the current face, and a correlation may be found between the smoothly transitioned viewpoints of the face. In this case, the existing track entry is augmented with information from the current face region, and the face track index is updated to include the current frame. Once a face track structure is terminated as described below with respect to FIG. 8, the characterizing sets are selected, and characterizing data is extracted, as indicated by block 760 and as described more particularly below.
The terminated face track structure is taken out of the face track store and inserted and merged into the face index store (block 770).
FIG. 8 describes a particular embodiment of the face tracking module 730 effected via eye tracking. Initialized by a detection event at frame K, a defined tracking reference frame, R, is set to K (810), and the current frame, J, is set to the following frame in the sequence (820). Based on the location of features in the reference frame, R, a prediction or estimate can be made as to the most probable location of the same features in the current frame, J (830). For example, the location of the features can be estimated as the same location in the previous frame. Alternatively, trajectory estimates may be used, specifically when motion of the features is detected. Utilizing the location estimates for features in frame R and corresponding features in frame J, a similarity transformation can be derived (840) as the mathematical definition of the disparity between the corresponding features. This transformation may be used to update R as a modified frame R1 , such that any differences (e.g. zoom, rotation) between the new reference frame R1 and the current frame, J, are reduced (850).
Correlation matching may be used for the final feature search based on a best match search between the features in the two frames (860). For example, such a matching may involve the correlation of a window surrounding the features (e.g. eyes) in frame R1 with several windows at the locality of the predicted feature locations in frame J, and selecting the highest correlation score.
In case of a correlation score that is high, the detected face track is continued. A verification step may be included (870) enabling an update of the reference frame. The reference frame may be updated to eliminate the possibility of a continuous reduction in the correlation score between the reference frame and the new input frame (for example for a slowly rotating face), that may lead to a premature termination of the track. Thus, if consecutive frames have a high correlation score yet the correlation score between the input frame and the reference frame is below a threshold, the reference frame is updated. In such a scenario, the reference frame R is updated to be in fact the current frame J. The iteration is completed, and the next frame in the sequence is analyzed in the next loop of the tracking process (back to 820). In case of no update, the reference frame remains unchanged in the next loop.
In case of a correlation score that is below an acceptable threshold, no match is found, and the end of a track is declared. The state of a track end is acknowledged as a track termination step. It leads to the extraction of feature sets representative of the track.
FIG. 9 shows a preferred embodiment for selecting the Face-Characteristic-Views subject to self-similarity criteria: The process starts with the start frame of a track entering the set, C, of Face-Characteristic-Views (905). Start frame, I, is taken as a reference frame. Next frame is loaded as frame K (910). Given the currently selected reference frame, I, the consecutive frame, K, is compared against I. In a procedure similar to the one described with respect to FIG. 8, the correspondence between facial features is used to solve for any face motion between frames I and K (920). The face region in frame K is compensated for the computed face motion (930). The compensated region is then subtracted from the corresponding face region in I (940), and the difference value, D(I,K), is used to determine whether K is to be defined as a new member of the Face-Characteristic-View set (950). If the difference between the corresponding faces in the two frames is not large, the next frame K+1 is loaded in the next loop. In case that the difference is in fact larger than a predefined threshold, the face in frame K is input to the set C, reference frame I is set to K, and K is set to the next frame K+1 (960). The process, as shown in FIG. 9, is terminated at the end of a face track. At the end of a track, the set C contains the most distinctive views of the face.
The set of Face-Characteristic-Views, C, is a set of face templates at varying viewpoint angles. Contiguous frames of similar face appearance can be reduced to a single face-frame. Frames that are different enough, in the sense that they can not be reconstructed from existing frames via a similarity transformation, are included in the set. A future recognition process can therefore be less sensitive to the following (and other) parameters: viewpoint angle of the face; facial expressions, including opening vs closing of the mouth; blinking; and external distracts, such as sunglasses. The Face-Characteristic-View set, C, also enables the identification of dominant features of the particular face, including (among others): skin-tone color; eye-color; hair shades; any special marks (such as birth marks) that are consistent within the set.
In an anchorperson scene, the set C will contain a limited set of views, as there is limited variability in the face and its viewpoint. In a more dynamic scene, the set will contain a large number of entries per face, encompassing the variety of viewpoints of the person in the scene.
The Face-Frontal-View set, F, is a set of face templates of the more frontal views of the face. These frames are generally the most-recognizable frames. The selection process is implemented by symmetry-controlled and quality-controlled criteria: In a preferred embodiment the score is computed from correlation values of eyes and mouth candidate regions with at least one eye and mouth template set, respectively. In another preferred embodiment, the quality index depends on a face orientation score computed from a mirrored correlation value of the two eyes. In yet another embodiment, the face centerline is estimated from mouth and nose location, and the face orientation score is computed from the ratio of distances between the left/right eye to the facial centerline. In yet another embodiment, the face quality index includes also a measure of the occlusion of the face in which an approximating ellipse is fitted to the head contour, and the ellipse is tested for intersection with the frame boundaries. In yet another embodiment, the ellipse is tested for intersection with other regions.
The prior art describes a variety of face recognition methods, some based on correlation techniques between input templates, and others utilize distance metrics between feature sets. In order to accommodate the recognition process, a set of Face-Characteristic-Data is extracted. In a preferred embodiment Face-Characteristic-Data include the following: • Fg = Global information; consists of face templates at selected viewpoints, such as bounding boxes surrounding the face region, as shown in FIG. 10a. Templates of facial components may also be included, in one implementation of eyes, nose and mouth. Templates are known in prior art as one of the means for face recognition.
• Ff = Facial feature geometrical information indicative of the relationships between the facial components. Geometrical information can include interocular distance between the eyes, vertical distance between the eyes and mouth, width of chin, width of mouth and so on, as shown in FIG. 10b. Geometrical features, such as the ones mentioned herein, are known in the prior art as one of the means for face recognition.
• Fu = Unique characteristics, such as eyeglasses, beard, baldness, hair color. Extracting each of these characteristics can be done automatically with algorithms known in the prior art. In one embodiment, following the detection of eyes, a larger window is opened surrounding the eye region, and a search is conducted for the presence of glasses, for example searching for strong edges underneath the eyes, using classification techniques such as neural-networks. A vertical distance can be taken from the eyes to the mouth, and further towards the chin, to check for a beard. In one embodiment the check may be for a color that is different from the natural skin color. In order to extract the characterizing hair color, a vertical projection from the eyes to the forehead can be taken. A rectangular window can be extracted at the suspected forehead and hair line region. A histogram may be used to analyze the color content of the region. If only natural skin color is detected (single mode) there is a strong probability for baldness. In the case of two dominant colors, one is the natural skin color and the second dominant color is the hair color. A probabilistic interpretation can be associated with the results from each of the above-mentioned algorithms. In one embodiment, the associated probabilities can be provided as confidence levels affiliated with each result. In a preferred embodiment the Face-Characteristic-Data combines the above data, including templates (matrices of intensity values), features vectors containing geometrical information such as characteristic distances or other information such as coefficients of the eigen-face representation, and a unique characteristic vector, as shown in FIG. 1 1 .
Following the definition of a face segment, audio information in the form of the Audio-Characteristic-Data is incorporated as additional informative 5 characteristic for the segment One purpose of the present invention is to associate Audio-Characteristic-Data with a face track or part of a face track Audio-Characteristic -Data may include a set of parameters representing the audio signature of the speaker In addition, speech-to-text may be incorporated, as known in prior art, to extract recognizable words, and the extracted words l o can be part of the index.
By combining the results of visual-based face recognition and audio-based speaker identification, the overall recognition accuracy can be improved.
FIG. 12 shows a timeline and video and audio data, which correspond to is that timeline. The face/no-face segmentation of the video stream serves as a master temporal segmentation that is applied to the audio stream The audio segments derived can be used to enhance the recognition capability of any person recognition system built on top of the indexing system constructed according to the present invention. 0 A further purpose of the present invention is to match audio characteristic data which correspond to two different face tracks in order to confirm the identity of the face tracks. The present invention may utilize prior art methods of audio characterization and speaker segmentation. The latter is required for the case the audio may correspond to at least two speakers. 5 FIG. 13 shows how to combine the audio track with the face index to create an audio-visual index. The present invention may use prior art methods in speech processing and speaker identification.
It is known in the prior art to model speakers by using Gaussian Mixture Models (GMM) of their acoustic features. A model for a speaker is trained from 0 an utterance of his speech by first computing acoustic features, such as mel-frequency cepstral coefficients, computed every 10ms, and then, considering each feature as a multi-dimensional vector, fitting the parameters of a mixture of multi-dimensional Gaussian distributions to the said vectors. The set of parameters of the mixture is used to represent the speaker.
Given a speech utterance, it may be scored against a speaker model, as described below, to determine whether the speech belongs to that speaker. First, we derive acoustic features as in the training phase (e.g. computing mel-frequency cepstral coefficients), then, considering each feature as a multi-dimensional vector and viewing the model as a probability distribution on such vectors, we compute the likelihood of these features. In a closed-set setting, a group of speakers is given and a model for each speaker of that group is trained. In this setting, it is a-prioh known that the speech utterance belongs to one of the speakers in the group. Thus, computing the likelihood of each speaker model and taking the one with maximum likelihood identifies the correct speaker. In an open-set setting, such a prior knowledge is unavailable. In that case, the speaker verification problem is actually a hypothesis test. The likelihood of the speaker model is compared with the likelihood of so-called cohort model, representing the alternative hypothesis (that the speech utterance belongs to another speaker). If the ratio of the likelihood passes a threshold, the utterance is said to belong to that speaker.
In a preferred embodiment, no prior knowledge of the speakers is assumed. In that embodiment, unsupervised speaker segmentation may be done using an iterative algorithm (1340). Parameters for the speaker GMMs are first initialized using a clustering procedure and than are iteratively improved using the Viterbi algorithm to compute segmentation. Alternatively, one may first detect audio-track changes, or cuts, by considering a sliding window of the audio and testing the hypothesis that this window has no cuts against the hypothesis that there is a cut. Then speaker identification may be performed on each segment (defined as a portion of the audio track between cuts). Audio cut detection may be done by a generalized likelihood ratio test, in which the first hypothesis ("no cuts") is modeled by fitting a best Gaussian distribution to the acoustic feature data derived from the audio window, and the second hypothesis ("there is a cut") is modeled by exhausting over cut points, fitting a best Gaussian to each part of the window (left or right to the cut point), and taking the cut point for which the likelihood is maximal. The likelihood of the first hypothesis is compared to the likelihood of the second hypothesis, and a threshold is used in the decision process (taking into account the larger number of parameters in the second hypothesis). FIG. 13 illustrates a preferred embodiment, wherein the segmentation of the audio track is aided by visual cues from the video face indexing process (1310). In particular, the audio is partitioned with respect to the video content. For example when an entire shot includes a single face track, the initial hypothesis can be a single speaker. Once verified, the audio characteristic data (such as the GMMs parameters) are associated with that face (1350). In another example, when an entire shot includes only two faces, the initial hypothesis can be two speakers. Once verified, the audio characteristic data of a speech segment are associated with the face of highest mouth activity as computed by the visual mouth activity detector (1320). Tracking mouth activity is known in prior art. Such technology may entail a camera capturing the mouth image followed by thresholding the image into two levels, i.e., black and white. Binary mouth images are analyzed to derive the mouth open area, the perimeter, the height, and the width. These parameters are then used for recognition. Alternatively, the binary mouth images themselves may be used as the visual feature, with clustering schemes used to classify between these binary images. Derivative information of the mouth geometrical quantities, as well as optical flow input have been suggested as well. Prior art also entails the detection of the lip boundaries (upper and lower lips), the modeling of the lip boundaries via splines and the tracking of the spline movements across frames. Motion analysis of the lip boundaries (spline models) can facilitate an overall analysis of visual mouth activity, such as the segmentation into talking vs. non-talking segments. Further analysis utilizing learning technologies such as neural-networks, and additional techniques such as a technique called "Manifold Learning", can facilitate learning complex lip configurations. These techniques are used in the prior art to augment speech recognition performance. In this invention they are utilized for the detection of motion and activity towards the affiliation of an audio signature to a visual face, as well as the affiliation of extracted words from the audio to the visual face.
Once the audio track is segmented (1330,1340) and associated with a video track (1350), audio characteristic data from the speech segment can augment the face index. In a preferred embodiment as described in Figure 14, a speech to text engine is employed (1410) to generate a text transcription of the speech segments in the video program. Such engines are commercially available [e.g. from Entropic Ltd. from Cambridge, England; or ViaVoice from IBM]. Such an engine may operate on a computer file representation of the audio track or in real time using the audio signal input into a computer with audio digital card (such as Sound Blaster by Creative Labs). A full transcription of the audio may be extracted by speech recognition technology, as described above. In addition, closed-caption decoding may be utilized, when available. In one embodiment, full transcription can be incorporated. In yet another embodiment, a subset of predefined keywords may be extracted. The transcription of the audio is next aligned with the video face index. The alignment is achieved by means of the related time codes (1420). In one embodiment said alignment may include the identification of each word or utterance start and end points, hereby termed the word time segment. A word is associated with a face track for which there is an overlap between the word time segment and the face track time segment. In addition, as described above for the case of attaching audio signature data for speaker identification and person recognition, the transcription may be attached as text to the face track that is determined visually to be the most matching, what we may term as speech to speaker matching. For example when an entire shot includes a single face track, the initial hypothesis can be a single speaker. The transcπption is attached in full to the matching face. In another example, when an entire shot includes two faces, the audio transcription of a speech segment is associated with the face of highest mouth activity as computed by the visual mouth activity detector (1320). In one embodiment of the invention, there is a need to identify face segments or part of face segments, as talking segments or non-talking segments, in which the associated face is talking or not-talking, respectively. This information is valuable information to be included in the face index store, enabling future search for a person in a talking state. Moreover, this information may be critical for the matching process described above between speech and speaker. In a preferred embodiment, in order to partition the face segments into talking and non-talking segments, mouth motion is extracted in a face track, tracking the mouth in a frame-by-frame basis throughout the track. The extracted motion characteristics are compared against talking lips movement characteristics and a characterization is made into one of two states: talking vs. non-talking. According to the present invention the extracted information from a face track may be incorporated into the face index, and links may be provided between similar face tracks to merge the information. The linking and information-merging stage, as part of the face indexing process, is depicted in Fig. 15. If the face index store is empty (1510), the current face track initializes the index, providing its first entry. Otherwise, distances are calculated between the new face track characteristics and the characteristics of each face entry in the index (1520). In a preferred embodiment, distances can be computed between several fields of the Face-Characteristic-Data. Template data (Fg in FIG 11), can be compared with correlation-based techniques. Feature vector data (Ff in FIG 11), that represent geometrical characteristics or eigen-basis characteristics, can be compared utilizing distance metrics as known in the art, such as the Euclidean metric or the Mahalanobis metric. Similarly with the Unique characteristic data (Fu in FIG 1 1). Overall distance measure is calculated as a function of the individual distance components, in one embodiment being the weighted sum of the distances. Distance measures are ranked in increasing order (1530). The smallest distance is compared to a Similarity threshold parameter (1540) to categorize the entry as a new face to be entered to the index, or as an already existing face, in which case the information is merged to an existing entry in the index. The description has so far focused on the generation of the face index
(FIG. 3), involving the processing of the inputted video and the grouping of the information into tracks and the tracks into appropriate index entries. As part of the information merging process, an additional grouping step can be considered to better facilitate browsing and searching and augment recognition performance. In one embodiment, once the index contains multiple track entries per index entry, and at predefined time intervals, the information associated with the index entry from all corresponding tracks can be merged, to produce representative set of characteristic data. In a preferred embodiment, such a set involves the best quality frontal face view (ranking frontal face views from all tracks); a second set includes the conjunction of all the unique characteristic data. In a preferred embodiment, the editor can annotate the face index.
Annotation may be implemented in several modes of operation. In one, a description or label may be attached manually to an entry in the index, following which the annotation becomes automatically linked to all frames in all tracks that relate to the index entry. In a second mode of operation, following the annotation process, each new frame that is added incrementally to a respective location in the index store, automatically receives all annotations available for the respective index. Annotation descriptions or labels may be inputted manually by an operator (270). Alternatively, any characteristic data in the data store (e.g. the occurrence of glasses, beard) may be automatically converted, by means of a stored dictionary, to a text description or label and added on the annotation fields. In this fashion, attributes, features and names, in a textual form, are slowly incorporated into the index, automatically, enabling text search and increasing search speeds. In yet an additional mode of operation, audio characteristic data, such as keywords detected within a segment, or the entire spoken text, can be converted via speech-to-text, as known in the art, to incorporate annotation of audio information as additional information in the index.
The annotation process as described above serves to augment the generated index as taught by the present invention. The present invention further teaches how to generate video annotation from the index annotation, as depicted in 280 of FIG 2. In one embodiment each video frame may have associated with it a text string that contains the annotations as present in the related index entry. A video annotation attached to each video frame is exemplified in FIG 16. The index annotation field of FIG. 5 (right-most field) is now linked to the related video frames, as shown in FIG. 16. A single index entry is associated with a video track, and a video track may be associated in turn with a large number of video frames. Each frame in the corresponding sequence is now (automatically) annotated. Note that the annotation may include a label (such as the person name), any manually inputted description (such as the affiliation of the person), any description of visual features (e.g., 'beard', 'glasses'), and other, as taught in this invention. In another embodiment, a listing may be generated and displayed, as shown in Table 1 , summarizing all annotated information as related to frame segments.
Table 1
The face index store is used for browsing and searching applications. A user interested in browsing can select to browse at varying resolutions. In one browsing mode, the display includes the best quality- most frontal face template (single image of face on screen). In a second browsing mode, the user may select to see the set of best characteristic views (e.g. one row of face images of distinct views). In a third browsing mode, user may select to see all views in all tracks available in the store for the person of interest. Several screens of frames will be displayed. Frames may be ordered based on quality metrics, or they may be ordered via a time line per the video source. The user may interactively select the mode of interest and shift amongst them.
The present invention teaches a method of searching in video (or in a large collection of images) for particular members of an object class, by generating the above-described index and searching the index. FIG. 17 presents a flowchart of the searching methodology. The input search query may address any of the variety of fields present in the index, and any combinations therefrom. Corresponding fields in the index will be searched (1710). The user can search for example by the name of the person. The label or description inputted by the user is compared to all labeled data; if a name in fact exists in the index, all related face tracks would be provided as search results. The user can also search by providing an image of a face and searching for similar looking faces; in this case, template data will be used, such as the best quality face template, or the best set of characteristic views per person entry. The user can also search by attribute data; in this scenario, feature vectors are compared. The user can query using multiple attributes, such as the face label as well as a set of keywords; in this case, the search would be conducted simultaneously on multiple parameters, such as the face template and audio characteristics. Matching index entries are extracted and can be optionally ranked and sorted based on the match quality (1720). A match score may be computed by correlating the input template to the best quality template per index store entry; or the input template can be compared with the best set of characteristic view templates; or the input characteristic data can be compared with stored characteristics per index store entry, etc. In any comparison mode, the final match score can be chosen as the best match score, or as a weighted combination of the scores.
The matched index entries are automatically associated with the corresponding sets of video frames (1730). The search results may be displayed in a variety of forms. In one embodiment a listing of the index entries may be outputted. In another embodiment, the video frames are displayed as icons. Output frames may be ordered based on match quality; alternatively they may be displayed on a time-line basis per related video source (e.g. per tape or movie source).
Utilizing the face index store allows for much greater efficiency in browsing and searching of the video data, as well as increasing recognition accuracy. Automatic organization of the data into representative sets, merged across multiple tracks from different video sources, allows for efficient browsing, at varying resolutions, as described above. In the searching mode, rather than searching on a frame-by-frame basis for the person of interest, a comparison is made on the indexed material in the face index store.
The foregoing description has been concerned primarily with the indexing of facial content in video sequence. However, the methods taught by the invention can be readily applied to include indexing other classes of objects for browsing and searching purposes.

Claims

What Is Claimed is:
1. A method of generating an index of a class of objects appearing in a collection of images for use in searching or browsing for particular members of said class, comprising: processing said collection of images to extract therefrom features characteristic of said class of objects; and grouping said images in groups according to extracted features helpful in identifying individual members of said class of objects.
2. The method according to claim 1 , wherein said collection of images represents a sequence of video frames.
3. The method according to claim 2, wherein said grouping of images in groups, according to extracted features helpful in identifying individual members of a class, produces a track of continuous frames for each individual member of said class of objects to be identified. 4. The method according to claim 1 , wherein said groups of images are stored in a database store for use in searching or browsing for individual members of the class.
5. The method according to claim 1 , wherein said class of objects is human faces. 6. The method according to claim 5, wherein said collection of images represents a sequence of video frames, and said grouping includes forming face tracks of contiguous frames, each track being identified by the starting and ending frames and containing face regions.
7. The method according to claim 6, wherein said sequence of video frames is processed to also include audio data associated with said face tracks.
8. The method according to claim 7, wherein said sequence of video frames is processed to include audio data associated with a said face track by: generating a face index from the sequence of video frames; generating a transcription of the audio data; and aligning said transcription of the audio data with said face index.
9. The method according to claim 8, wherein said transcription of the audio data is generated by closed-caption decoding.
10. The method according to claim 8, wherein said transcription of the audio data is generated by speech recognition.
1 1. The method according to claim 8, wherein said aligning is effected by: identifying start and end points of speech segments in the sound data; extracting face start and end point from the face index; and outputting all speech segments that have non-zero temporal intersection with the respective face track.
12. The method according to claim 7, wherein said sequence of video frames is also processed to label face tracks as talking or non-talking tracks by: tracking said face regions in the video frames; detecting regions having mouth motion; and estimating from said detected regions those having talking mouth motion vs. non-talking mouth motion. 13. The method according to claim 12, wherein regions estimated to have talking mouth motion are enabled for attaching speech to a speaker in a face track.
14. The method according to claim 7, wherein said sequence of video frames is processed by: extracting all audio data from video frame regions of a particular face; fitting a model based on said extracted audio data; and associating said model with the respective face track.
15. The method according to claim 14, wherein said audio data are speech utterance. 16. The method according to claim 6, wherein said grouping further includes merging tracks containing similar faces from a plurality of said face tracks.
17. The method according to claim 6, wherein said sequence of video frames is processed to include annotations associated with said face track. 18. The method according to claim 6, wherein said sequence of video frames is processed to include face characteristic views associated with said face tracks.
19. The method according to claim 6, wherein said face regions include eye, nose and mouth templates characteristic of individual faces.
20. The method according to claim 6, wherein said face regions include image coordinates of geometric face features characteristic of individual faces. 21. The method according to claim 6, wherein said face regions include coefficients of the eigen-face representation characteristic of individual faces.
22. A method of generating an index of human faces appearing in a sequence of video frames, comprising: processing said sequence of video frames to extract therefrom facial features; and grouping said video frames to produce face tracks of contiguous frames, each face track being identified by the starting and ending frames in the track and containing face characteristic data of an individual face.
23. The method according to claim 22, wherein said face tracks are stored in a face index for use in searching or browsing for individual faces.
24. The method according to claim 22, wherein said face characteristic data includes eye, nose and mouth templates.
25. The method according to claim 22, wherein said face characteristic data includes image coordinates of geometric face features. 26. The method according to claim 22, wherein said face characteristic data includes coefficients of the eigen-face representation.
27. The method according to claim 22, wherein said grouping of video frames to produce said face tracks includes: detecting predetermined facial features in a frame and utilizing said facial features for estimating the head boundary for the respective face; opening a tracking window based on said estimated head boundary; and utilizing said tracking window for tracking sequented frames which include said predetermined facial features.
28. The method according to claim 22, wherein said sequence of video frames is processed to also include audio data associated with said face tracks.
29. The method according to claim 22, wherein said sequence of video frames is processed to also include annotations associated with said face tracks.
30. The method according to claim 22, wherein said sequence of video frames is processing to also include face characteristic views associated with said face tracks.
31. The method according to claim 22, wherein said grouping includes merging tracks containing similar faces from a plurality of said face tracks.
32. A method of generating an index of at least one class of objects appearing in a collection of images to aid in browsing or searching for individual members of said class, comprising: processing said collection of images to generate an index of features characteristic of said class of objects; and annotating said index with annotations. 33. The method according to claim 32, wherein said method further comprises grouping said images in said index according to features helpful in identifying individual members of said class of objects.
34. The method according to claim 33, wherein said collection of images are a sequence of video frames. 35. The method according to claim 34, wherein said grouping of images, according to features helpful in identifying individual members of said class, produces a track of sequential frames for each individual member of said class of objects to be identified.
36. The method according to claim 35, wherein a listing of said annotations is generated and displayed for at least some of said tracks.
37. The method according to claim 35, wherein said at least one class of objects are human faces, and said grouping of images produces face tracks of contiguous frames, each track being identified by the starting and ending frames and containing face data characteristic of an individual face. 38. The method according to claim 37, wherein said sequence of video frames is processed to also include audio data associated with said face tracks.
39. The method according to claim 37, wherein said sequence of video frames is processed to also include face characteristic views associated with said face tracks.
40. The method according to claim 32, wherein said annotations include applying descriptions to various entries in said index.
41. The method according to claim 40, wherein said descriptions are applied manually during an editing operation.
42. The method according to claim 40, wherein said descriptions are applied automatically by means of a stored dictionary. 43. A method of processing a sequence of video frames having a video track and an audio track, to generate a speech annotated face index associated with speakers, comprising: generating a face index from the video track; generating a transcription of the audio track; and aligning said transcription with said face index.
44. The method according to claim 43, wherein said transcription is generated by closed-caption decoding.
45. The method according to claim 43, wherein said transcription is generated by speech recognition. 46. The method according to claim 43, wherein said alignment is done by: identifying start and end points of speech segments in said transcription; extracting from the face index start and end points of face segments; and outputting all speech segments that have non-zero temporal intersection with the respective face segment.
47. A method of processing a video track having face segments to label such segments as talking or non-talking, comprising: tracking said face segments in the video track; detecting segments having mouth motion; and estimating from said detected segments, those having talking mouth motion vs. non-talking mouth motion.
48. The method according to claim 47, wherein segments estimated to have talking mouth motion are enabled for attaching speech to a speaker in a face segment.
49. A method of annotating a sequence of video frames, comprising: processing said sequence of video frames to generate a face index having a plurality of entries; and attaching descriptions to entries in said face index.
50. The method according to claim 49, further comprising attaching descriptions from said face index to at least one video frame. 51. A method of processing a sequence of video frames having a video track and an audio track, comprising: extracting from said video track, face segments representing human faces, and producing a face track for each individual face; extracting audio segments from said audio track; fitting a model based on a set of said audio segments corresponding to the individual face of a face track; and associating said model with the face track of the corresponding individual face.
52. The method according to claim 51 , wherein said audio segments are speech segments.
53. Apparatus for generating an index of a class of objects appearing in a collection of images for use in searching or browsing for particular members of said class, comprising: a processor for processing said collection of images to extract therefrom features characteristic of said objects, and for outputting therefrom indexing data with respect to said features; a user interface for selecting the features to be extracted to enable searching for and identifying individual members of said class of objects; and a store for storing said indexing data outputted from said processor in groups according to the features selected for extraction.
54. The apparatus according to claim 53, further including a browser-searcher for browsing and searching said store of indexing data to locate particular members of said class of objects.
55. The apparatus according to claim 54, further including an editor for scanning the indexing data store and for correcting errors occurring therein.
56. The apparatus according to claim 55, wherein said editor also enables annotating said indexing data store with annotations.
57. The apparatus according to claim 53, wherein said user interface enables selecting human facial features to be extracted from said collection of images to enable searching for individual human faces.
58. A method of searching a collection of images for individual members of a class of objects, comprising: processing said collection of images to generate an index of said class of objects; and searching said index for individual members of said class of objects.
59. The method according to claim 58, wherein said collection of images are a sequence of video frames.
60. The method according to claim 59, wherein processing said sequence of video frames produces a track of sequential frames for each individual member of said class of objects.
61. The method according to claim 60, wherein said class of objects are human faces.
EP99943190A 1998-09-10 1999-09-08 Method of face indexing for efficient browsing and searching of people in video Withdrawn EP1112549A4 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US9970298P 1998-09-10 1998-09-10
US99702P 1998-09-10
PCT/IL1999/000487 WO2000016243A1 (en) 1998-09-10 1999-09-08 Method of face indexing for efficient browsing and searching ofp eople in video

Publications (2)

Publication Number Publication Date
EP1112549A1 EP1112549A1 (en) 2001-07-04
EP1112549A4 true EP1112549A4 (en) 2004-03-17

Family

ID=22276216

Family Applications (1)

Application Number Title Priority Date Filing Date
EP99943190A Withdrawn EP1112549A4 (en) 1998-09-10 1999-09-08 Method of face indexing for efficient browsing and searching of people in video

Country Status (3)

Country Link
EP (1) EP1112549A4 (en)
AU (1) AU5645999A (en)
WO (1) WO2000016243A1 (en)

Families Citing this family (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7522186B2 (en) 2000-03-07 2009-04-21 L-3 Communications Corporation Method and apparatus for providing immersive surveillance
US8954432B2 (en) * 2000-11-15 2015-02-10 Mark Frigon Users tagging users in photos online
EP1276320B1 (en) * 2001-06-15 2015-05-27 L-1 Identity Solutions AG Method for making unrecognisable and for restoring image content
US7130446B2 (en) 2001-12-03 2006-10-31 Microsoft Corporation Automatic detection and tracking of multiple individuals using multiple cues
GB2395852B (en) * 2002-11-29 2006-04-19 Sony Uk Ltd Media handling system
GB2395853A (en) * 2002-11-29 2004-06-02 Sony Uk Ltd Association of metadata derived from facial images
US7184577B2 (en) * 2003-03-14 2007-02-27 Intelitrac, Inc. Image indexing search system and method
CA2529903A1 (en) 2003-06-19 2004-12-29 Sarnoff Corporation Method and apparatus for providing a scalable multi-camera distributed video processing and visualization surveillance system
WO2005031612A1 (en) * 2003-09-26 2005-04-07 Nikon Corporation Electronic image accumulation method, electronic image accumulation device, and electronic image accumulation system
JP2007079641A (en) 2005-09-09 2007-03-29 Canon Inc Information processor and processing method, program, and storage medium
JP2009510877A (en) * 2005-09-30 2009-03-12 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Face annotation in streaming video using face detection
US8014572B2 (en) 2007-06-08 2011-09-06 Microsoft Corporation Face annotation framework with partial clustering and interactive labeling
JP4424396B2 (en) * 2007-09-03 2010-03-03 ソニー株式会社 Data processing apparatus and method, data processing program, and recording medium on which data processing program is recorded
KR101268987B1 (en) 2007-09-11 2013-05-29 삼성전자주식회사 Method and apparatus for recording multimedia data by automatically generating/updating metadata
US20090110245A1 (en) * 2007-10-30 2009-04-30 Karl Ola Thorn System and method for rendering and selecting a discrete portion of a digital image for manipulation
US9407942B2 (en) 2008-10-03 2016-08-02 Finitiv Corporation System and method for indexing and annotation of video content
US8325999B2 (en) 2009-06-08 2012-12-04 Microsoft Corporation Assisted face recognition tagging
JP2011035837A (en) * 2009-08-05 2011-02-17 Toshiba Corp Electronic apparatus and method for displaying image data
US9251854B2 (en) 2011-02-18 2016-02-02 Google Inc. Facial detection, recognition and bookmarking in videos
EP2680189A1 (en) * 2012-06-26 2014-01-01 Alcatel-Lucent Method and system for generating multimedia descriptors
US8983836B2 (en) 2012-09-26 2015-03-17 International Business Machines Corporation Captioning using socially derived acoustic profiles
US9165182B2 (en) 2013-08-19 2015-10-20 Cisco Technology, Inc. Method and apparatus for using face detection information to improve speaker segmentation
US9646227B2 (en) 2014-07-29 2017-05-09 Microsoft Technology Licensing, Llc Computerized machine learning of interesting video sections
US9934423B2 (en) * 2014-07-29 2018-04-03 Microsoft Technology Licensing, Llc Computerized prominent character recognition in videos
KR102306538B1 (en) * 2015-01-20 2021-09-29 삼성전자주식회사 Apparatus and method for editing content
CN104991906B (en) * 2015-06-17 2020-06-02 百度在线网络技术(北京)有限公司 Information acquisition method, server, terminal, database construction method and device
CN108229322B (en) * 2017-11-30 2021-02-12 北京市商汤科技开发有限公司 Video-based face recognition method and device, electronic equipment and storage medium
CN111191484A (en) * 2018-11-14 2020-05-22 普天信息技术有限公司 Method and device for recognizing human speaking in video image
CN109784270B (en) * 2019-01-11 2023-05-26 厦门大学嘉庚学院 Processing method for improving face picture recognition integrity
CN112906466B (en) * 2021-01-15 2024-07-05 深圳云天励飞技术股份有限公司 Image association method, system and device, and image searching method and system
US11417099B1 (en) 2021-11-08 2022-08-16 9219-1568 Quebec Inc. System and method for digital fingerprinting of media content
CN117079256B (en) * 2023-10-18 2024-01-05 南昌航空大学 Fatigue driving detection algorithm based on target detection and key frame rapid positioning

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5570343A (en) * 1993-07-31 1996-10-29 Motorola, Inc. Communications system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB8710737D0 (en) * 1987-05-06 1987-06-10 British Telecomm Video image encoding
DE4028191A1 (en) * 1990-09-05 1992-03-12 Philips Patentverwaltung CIRCUIT ARRANGEMENT FOR DETECTING A HUMAN FACE
US5787414A (en) * 1993-06-03 1998-07-28 Kabushiki Kaisha Toshiba Data retrieval system using secondary information of primary data to be retrieved as retrieval key
US5974172A (en) * 1997-02-14 1999-10-26 At&T Corp Method and apparatus for coding segmented regions which may be transparent in video sequences for content-based scalability
US5895464A (en) * 1997-04-30 1999-04-20 Eastman Kodak Company Computer program product and a method for using natural language for the description, search and retrieval of multi-media objects

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5570343A (en) * 1993-07-31 1996-10-29 Motorola, Inc. Communications system

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
COLLOBERT M ET AL: "LISTEN: a system for locating and tracking individual speakers", AUTOMATIC FACE AND GESTURE RECOGNITION, 1996., PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON KILLINGTON, VT, USA 14-16 OCT. 1996, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 14 October 1996 (1996-10-14), pages 283 - 288, XP010200434, ISBN: 0-8186-7713-9 *
GUNSEL B ET AL: "Video indexing through integration of syntactic and semantic features", APPLICATIONS OF COMPUTER VISION, 1996. WACV '96., PROCEEDINGS 3RD IEEE WORKSHOP ON SARASOTA, FL, USA 2-4 DEC. 1996, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 2 December 1996 (1996-12-02), pages 90 - 95, XP010206416, ISBN: 0-8186-7620-5 *
NAM J ET AL: "Speaker identification and video analysis for hierarchical video shot classification", IMAGE PROCESSING, 1997. PROCEEDINGS., INTERNATIONAL CONFERENCE ON SANTA BARBARA, CA, USA 26-29 OCT. 1997, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 26 October 1997 (1997-10-26), pages 550 - 553, XP010254004, ISBN: 0-8186-8183-7 *
SATOH S ET AL: "Name-It: association of face and name in video", COMPUTER VISION AND PATTERN RECOGNITION, 1997. PROCEEDINGS., 1997 IEEE COMPUTER SOCIETY CONFERENCE ON SAN JUAN, PUERTO RICO 17-19 JUNE 1997, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 17 June 1997 (1997-06-17), pages 368 - 373, XP010237550, ISBN: 0-8186-7822-4 *
See also references of WO0016243A1 *

Also Published As

Publication number Publication date
EP1112549A1 (en) 2001-07-04
WO2000016243A1 (en) 2000-03-23
AU5645999A (en) 2000-04-03

Similar Documents

Publication Publication Date Title
EP1112549A1 (en) Method of face indexing for efficient browsing and searching of people in video
US11776267B2 (en) Intelligent cataloging method for all-media news based on multi-modal information fusion understanding
US6578040B1 (en) Method and apparatus for indexing of topics using foils
US6751354B2 (en) Methods and apparatuses for video segmentation, classification, and retrieval using image class statistical models
US6404925B1 (en) Methods and apparatuses for segmenting an audio-visual recording using image similarity searching and audio speaker recognition
Tsekeridou et al. Content-based video parsing and indexing based on audio-visual interaction
Greenspan et al. Probabilistic space-time video modeling via piecewise GMM
US7246314B2 (en) Methods and apparatuses for interactive similarity searching, retrieval and browsing of video
KR101640268B1 (en) Method and system for automated annotation of persons in video content
Vijayakumar et al. A study on video data mining
US20040143434A1 (en) Audio-Assisted segmentation and browsing of news videos
JP2001515634A (en) Multimedia computer system having story segmentation function and its operation program
Syeda-Mahmood et al. Detecting topical events in digital video
Radha Video retrieval using speech and text in video
Liu et al. Major cast detection in video using both speaker and face information
Ewerth et al. Videana: a software toolkit for scientific film studies
Li et al. Person identification in TV programs
Wei et al. Semantics-based video indexing using a stochastic modeling approach
Saravanan et al. A review on content based video retrieval, classification and summarization
Snoek The authoring metaphor to machine understanding of multimedia
US20240126994A1 (en) Transcript paragraph segmentation and visualization of transcript paragraphs
US20240127820A1 (en) Music-aware speaker diarization for transcripts and text-based video editing
US20240127858A1 (en) Annotated transcript text and transcript thumbnail bars for text-based video editing
Ferman et al. Editing cues for content-based analysis and summarization of motion pictures
Satou et al. Video corpus construction and analysis

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20010410

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

A4 Supplementary search report drawn up and despatched

Effective date: 20040129

RIC1 Information provided on ipc code assigned before grant

Ipc: 7G 06F 17/30 B

Ipc: 7G 06K 9/00 A

17Q First examination report despatched

Effective date: 20040712

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20050125