WO2009119272A1 - Appareil et procédé de traitement vidéo - Google Patents

Appareil et procédé de traitement vidéo Download PDF

Info

Publication number
WO2009119272A1
WO2009119272A1 PCT/JP2009/054116 JP2009054116W WO2009119272A1 WO 2009119272 A1 WO2009119272 A1 WO 2009119272A1 JP 2009054116 W JP2009054116 W JP 2009054116W WO 2009119272 A1 WO2009119272 A1 WO 2009119272A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
appearance duration
appearance
performer
duration
Prior art date
Application number
PCT/JP2009/054116
Other languages
English (en)
Inventor
Taishi Shimomori
Tatsuya Uehara
Original Assignee
Kabushiki Kaisha Toshiba
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kabushiki Kaisha Toshiba filed Critical Kabushiki Kaisha Toshiba
Priority to JP2009514284A priority Critical patent/JP2011519183A/ja
Publication of WO2009119272A1 publication Critical patent/WO2009119272A1/fr

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/102Programmed access in sequence to addressed parts of tracks of operating record carriers
    • G11B27/105Programmed access in sequence to addressed parts of tracks of operating record carriers of operating discs

Definitions

  • the present invention relates to a video processing apparatus, which can play back a scene in which a desired performer appears in units in which videos of interviews, performances, and the like change, from only a received video.
  • a facial image is detected from a received or recorded program, and is collated with those stored in advance in a facial image database so as to identify a person corresponding to the detected facial image.
  • the identified information is managed as a performer database together with a point which reflects the appearance duration of that person in the program.
  • the ratios of appearance of a given performer are calculated with reference to the performer database and points, and corresponding scenes are presented in descending order of ratio (for example, see JP-A 2006- 33659 (KOKAI)) .
  • names of people have to be separately registered in the facial image database, and when new faces or unknown people appear, the database needs to be updated.
  • names of people can only be detected in units of appearance shots, but names of people who appear in a program cannot be detected in program configuration units such as interviews, performances, and the like.
  • names of people have to be separately registered in a facial image database, and the database needs to be updated when new faces appear. Also, names of people can only be detected in units of appearance shots, but names of people who appear in a program cannot be detected in program configuration units such as interviews, performances, and the like.
  • a video processing apparatus comprising: a first extraction unit configured to extract performer information including a first performer name and a first appearance duration in which the performer appears in a video, from the video or first information appended to the video; a second extraction unit configured to extract a plurality of features of performers from the video or the first information; a first determination unit configured to determine, of figures who appear in a first sequence included in the video, a plurality of figures having similarities of the features larger than a threshold as representing one and the same person; a creation unit configured to create a second appearance duration of at least one second sequence included in the video in which it is determined that one and the same person appears, and a first cluster identifier of a first cluster including the second sequence; a second determination unit configured to determine if the second appearance
  • FIG. 1 is a block diagram of a video processing apparatus according to an embodiment
  • FIG. 2 is a block diagram of a video processing apparatus according to a modification of the embodiment
  • FIG. 3 is a view showing the relationship among performer information, segments, clusters, and sequences
  • FIG. 4 is a flowchart showing an example of the operation of the video processing apparatus shown in FIGS. 1 and 2;
  • FIG. 5 is a flowchart showing an example of the operation of step S403 in FIG. 4;
  • FIG. 6 is a flowchart showing an example of the operation of step S404 in FIG. 4;
  • FIG. 7 is a block diagram of a video processing apparatus according to a first practical example of the embodiment;
  • FIG. 8 is a view showing an example of closed captions
  • FIG. 9 is a view for explaining a method of associating cluster IDs and performer names by a labeling unit 103 in FIG. 7;
  • FIG. 10 is a view showing an example of the cluster IDs and performer names associated by the method shown in FIG. 9;
  • FIG. 11 is a block diagram of a video processing apparatus according to a second practical example of the embodiment.
  • FIG. 12 is a view for explaining a method of associating cluster IDs and performer names by a labeling unit 103 in FIG. 11;
  • FIG. 13 is a view showing an example of the cluster IDs and performer names associated by the method shown in FIG. 12;
  • FIG. 14 is a block diagram of a video processing apparatus according to a third practical example of the embodiment.
  • FIG. 15 is a view for explaining a method of associating cluster IDs and performer names by a labeling unit 103 in FIG. 14;
  • FIG. 16 is a view showing an example of the associated cluster IDs and performer names.
  • FIG. 17 is a view for explaining a method of associating cluster IDs and performer names by a labeling unit 103.
  • the user can easily discover a desired scene from only a received video.
  • a video processing apparatus of this embodiment will be described below with reference to FIG. 1.
  • the video processing apparatus of this embodiment includes a performer information extraction unit 101, clustering unit 102, and labeling unit 103.
  • the performer information extraction unit 101 extracts performer information including a performer name and an appearance duration in which a performer appears in a video, from the video or information appended to the video.
  • the performer information extraction unit 101 extracts performer information from electronic program guide (EPG) information related to the video.
  • EPG electronic program guide
  • the clustering unit 102 extracts features of performers from the video or information appended to the video, and executes clustering by deciding figures having similar features as representing one and the same person. For example, the clustering unit 102 determines as representing one and the same person a plurality of figures who have similarities of features which are larger than a threshold of those who appear in sequences included in the video. Similar features include, for example, the face shape, a quantity indicating whether or not a person of interest is a speaker, and the like.
  • the clustering result clustered by the clustering unit 102 includes a cluster ID and appearance durations of sequences included in a cluster with this ID.
  • the labeling unit 103 executes matching between the appearance durations of the performer name in the performer information with those of sequences included in the cluster based on the clustering result to associate the cluster (precisely, a cluster identifier [also referred to as a cluster ID hereinafter] ) with the performer name.
  • the cluster will be described later with reference to FIG. 3.
  • An example of the detailed operation of the labeling unit 103 will be described later with reference to FIG. 5.
  • a video processing apparatus includes a performer information extraction unit 101, clustering unit 102, labeling unit 103, segment association unit 201, segment association storage unit 202, and segmentation result storage unit 203.
  • the segmentation result storage unit 203 stores segmentation periods obtained by detecting change points of videos, and dividing the change points of the videos into periods.
  • the segmentation result storage unit 203 stores, for example, a segment identifier (ID) and a segment duration including the start and end times of a segment with this ID.
  • ID segment identifier
  • the segmentation result storage unit 203 automatically detects divisions between interview scenes and performance scenes from the video. The user may divide segments, or the segmentation result storage unit 203 may automatically divide them in correspondence with respective interview and performance scenes. The segments will be described later with reference to FIG. 3.
  • the segment association unit 201 associates segments and performer names from sequences included in clusters within segmentation periods obtained by dividing the change points of the videos into periods. The sequences will be described later with reference to FIG. 3. An example of the detailed operation of the segment association unit 201 will be described later with reference to FIG. 6.
  • the segment association storage unit 202 stores the segments and performer names associated by the segment association unit 201 in relationship to each other.
  • the video processing apparatus of FIG. 2 by associating segments and performer names from sequences included in clusters within segmentation periods obtained by dividing the change points of videos into periods, a scene in which a desired performer appears can be played back in units in which videos of interviews, performances, and the like change, from only a received video. Compared to the conventional method, the processing speed can be increased.
  • the performer information associates each performer name with an appearance duration in which this performer appears in a video.
  • a specific video is temporally divided into a plurality of segments.
  • the example of FIG. 3 shows segments 1, 2, 3, and 4.
  • a sequence is a time period of a video group which is included in a segment. This time period is designated by an appearance duration between start and end times. This time period may have various durations, as shown in FIG. 3.
  • a cluster includes a plurality of sequences determined as representing one and the same person.
  • the example of FIG. 3 shows clusters 1, 2, and 3.
  • FIG. 4 is a flowchart showing the operation of the video processing apparatus according to this embodiment until segments are labeled from a video.
  • the performer information extraction unit 101 extracts performer information including performer names and appearance durations of the performer names from a video (step S401).
  • the clustering unit 102 executes clustering by extracting features of performers from the video, and deciding figures having similar features as representing one and the same person (step S402).
  • the labeling unit 103 detects overlapping parts of the appearance durations of the performer names in the performer information and those of the sequences included in the clusters clustered by the clustering unit 102, and associates the performer names with the cluster IDs (step S403) .
  • the segment association unit 201 associates segments and the performer names from sequences included in the clusters in segmentation periods which are stored in the segmentation result storage unit 203 and are obtained by dividing change points of videos into periods (step S404) .
  • the segment association storage unit 202 stores the associated segments and performer names (step S405) .
  • An example of the operation of step S403 in FIG. 4 executed by the labeling unit 103 will be described below with reference to FIG. 5.
  • the labeling unit 103 acquires a cluster ID and appearance durations of sequences included in that cluster from the clustering result of the clustering unit 102 (step S501) .
  • the unit 103 acquires a performer name and an appearance duration of the performer name from the performer information extraction unit 101 (step S502) .
  • the unit 103 compares the appearance durations of the sequences included in the cluster with that of the performer name to check whether the appearance durations of the sequences are included in that of the performer name (step S503) . If the appearance durations of the sequences are not included in that of the performer name, the process returns to step S502 to acquire a new performer name and an appearance duration of the performer name.
  • the unit 103 associates the cluster ID including these sequences with the performer name (step S504) .
  • the unit 103 checks whether sequences included in a new cluster corresponding to an appearance duration extracted by the performer information extraction unit 101 are available (step S505) . If such sequences are available, the process returns to step S501 to acquire a cluster ID and appearance durations of sequences included in this cluster. If no such sequences are available, the unit 103 selects a performer name corresponding to one cluster ID from the association results of the cluster IDs and performer names (step S506) , thus ending the processing.
  • step S506 An example of the cluster IDs and performer names obtained in the process in step S506 will be described later with reference to FIGS. 10, 13, and 17. Note that the unit 103 may often select a plurality of performer names from one cluster in step S506. Details of such a case will be given later with reference to FIGS. 12 and 13.
  • step S404 in FIG. 4 executed by the segment association unit 201.
  • the operation of step S404 is executed for each specific sequence .
  • the segment association unit 201 acquires a cluster ID and appearance durations of sequences included in that cluster from the clustering result of the clustering unit 102 (step S601) .
  • the unit 201 acquires a segment ID and segment duration from the segmentation result stored in the segmentation result storage unit 203 (step S602) .
  • the unit 201 checks whether the appearance durations of the sequences included in the cluster are included in the segment duration (step S603) . If the appearance durations of the sequences are included in the segment duration, the unit 201 acquires an association result of the cluster ID and performance name from the processing result of the labeling unit 103 (step S604) .
  • the unit 201 checks whether the cluster ID in step S601 matches that in step S604 (step S605) .
  • the unit 201 acquires the next association result of the cluster ID and performer name from the processing result of the labeling unit 103 (step S604) . If the two cluster IDs match, the unit 201 associates the segment ID acquired in step S602 with the performer name acquired in step S604 (step S606) .
  • the segment association unit 201 checks whether a new segment is available (step S607) . If a new segment is available, the process returns to step S602 to acquire a segment ID and segment duration. If no new segment is available, the unit 201 checks whether sequences included in a new cluster are available (step S608) . If sequences included in a new cluster are available, the process returns to step S601 to acquire a cluster ID and cluster appearance duration. If no sequences included in a new cluster are available, the processing ends .
  • a video processing apparatus of the first practical example will be described below with reference to FIG. 7.
  • the performer information extraction unit 101, clustering unit 102, and segmentation result storage unit 203 in FIG. 2 are respectively practically replaced by a closed caption extraction unit 701, speaker clustering unit 702, and speaker segmentation storage unit 703.
  • the closed caption extraction unit 701 extracts speaker names and utterance durations from closed captions appended to a video.
  • MPEG2-TS as a digital broadcasting protocol allows multiple transmission of various data (closed captions, EPG, BML, etc.) required for the broadcasting purpose in addition to audio and video data.
  • the closed captions are transmitted as text data of utterance contents of performers together with utterance durations and the like, so as to facilitate viewing by the hard of hearing.
  • FIG. 8 shows an example of closed captions.
  • a speaking performer cannot be discriminated from a video alone (for example, when a plurality of performers appear in a video, when no speaker appears in a video, and so forth)
  • a performer name in parentheses is written before an utterance content.
  • the speaker clustering unit 702 divides speech data of a video into fine utterance periods, calculates features of respective periods, expresses the periods as speaker vectors each having, as a component, a likelihood with respect to a speaker model as a plurality of statistical models, compares neighboring speech periods expressed by the speaker vectors, and detects the switch timing of speakers at which a similarity becomes minimal.
  • the speaker clustering unit 702 clusters the speaker vectors at respective detected timings using a predetermined clustering method such as a MeanShift method or the like.
  • the speaker clustering method is not limited to the aforementioned method, and any other existing method such as a method described in, e.g., JP-A 11-175090 (KOKAI) and the like may be used as long as it has a function of extracting speaker features in a video, and clustering figures who are one and the same person.
  • the speaker segmentation storage unit 703 expresses, as a vector, a duration occupied by sequences of each cluster as a result of speaker clustering of the speaker clustering unit 702, and stores a point where a similarity of continuous periods becomes minimal as a change point of speaker configurations.
  • the speaker segmentation method is not limited to the aforementioned method, and any other existing method may be used as long as it has a function of detecting change points of speaker configurations in a video, and dividing into periods.
  • the closed caption extraction unit 701 extracts closed captions including performer names and appearance durations of the performer names from a video (step S401) .
  • the speaker clustering unit 702 executes speaker clustering by detecting utterance periods of performers from the video, extracting features of speakers, and deciding figures having similar features as representing one and the same person (step S402) .
  • the labeling unit 103 detects overlapping parts of the appearance durations of the performer names included in the closed captions and those of speaker sequences included in speaker clusters clustered by the speaker clustering unit 702, and associates the performer names and speaker cluster IDs (step S403) .
  • the segment association unit 201 associates segments and the performer names from speaker sequences included in speaker clusters within speaker segmentation periods which are stored in the speaker segmentation storage unit 703 and are obtained by dividing change points of speaker configurations into periods (step S404) .
  • the segment association storage unit 202 stores the associated segments and performer names (step S405) .
  • the labeling unit 103 acquires a speaker cluster ID and appearance durations of speaker sequences included in that speaker cluster from the speaker clustering result of the speaker clustering unit 702 (step S501) .
  • the labeling unit 103 acquires a performer name and an appearance duration of the performer name from a closed caption extracted by the closed caption extraction unit 701 (step S502) .
  • the labeling unit 103 compares the appearance durations of the speaker sequences included in the speaker cluster with that of the performer name to check whether the appearance durations of the speaker sequences are included in that of the performer name (step S503) .
  • a method of selecting a cluster within the appearance duration of the performer name upon checking whether the appearance durations of the speaker sequences are included in that of the performer name, and extracting an overlapping duration will be described below with reference to FIG. 9.
  • the distribution of the appearance duration of each performer name is expressed by a triangular distribution, but any other distribution such as a uniform distribution, normal distribution, and the like may be assumed.
  • the labeling unit 103 calculates the sum of the distribution of sequences in a cluster and that of the appearance durations of each performer name. When the sum for the cluster is greater than or equal to a threshold, the unit 103 associates the cluster ID of that cluster with the performer name (step S504) . In FIG.
  • the unit 103 associates clusters 1 and 3, whose sums are greater than or equal to the threshold, with performer names A and C.
  • the labeling unit 103 checks whether speaker sequences included in a new speaker cluster corresponding to an appearance duration extracted by the closed caption extraction unit 701 are available (step S505) . If speaker sequences included in a new speaker cluster are available, the process returns to step S501 to acquire a speaker cluster and appearance durations of speaker sequences included in the speaker cluster. If no speaker sequences included in a new speaker cluster are available, the labeling unit 103 selects a performer name for one speaker cluster ID from the association results of the cluster IDs and performer names (step S506) .
  • step S404 The segment association unit 201 acquires a speaker cluster ID and appearance durations of speaker sequences included in that speaker cluster from the speaker clustering result of the speaker clustering unit 702 (step S601).
  • the segment association unit 201 acquires a speaker segment ID and speaker segment duration from the speaker segmentation result (step S602) .
  • the segment association unit 201 checks whether the appearance durations of the speaker sequences included in the speaker cluster are included in the speaker segment duration (step S603) .
  • the segment association unit 201 determines that the appearance durations of the speaker sequences included in the speaker cluster are included in the speaker segment duration, it acquires the association result of the speaker cluster ID and performer name from the processing result of the labeling unit 103 (step S604) .
  • the segment association unit 201 checks whether the speaker cluster ID in step S601 matches that in step S604 (step S605) . If the two IDs do not match, the segment association unit 201 acquires the next association result of the speaker cluster ID and performer name (step S604) . If the two speaker cluster IDs match, the segment association unit 201 associates the speaker segment ID acquired in step S602 with the performer name acquired in step S604 (step S606) .
  • the segment association unit 201 checks whether a new segment is available (step S607) . If a new speaker segment is available, the process returns to step S602, and the segment association unit 201 acquires a speaker segment ID and speaker segment duration. If no new speaker segment is available, the segment association unit 201 checks whether sequences included in a new cluster are available (step S608) . If speaker sequences included in a new speaker cluster are available, the process returns to step S601, and the segment association unit 201 acquires a speaker cluster ID and appearance durations of speaker sequences included in that speaker cluster. If no new speaker cluster ID is available, the processing ends. (Second Practical Example)
  • the caption recognition unit 1101 detects caption appearance periods and regions from a video. For example, the unit 1101 detects edges from a plurality of continuous frame images using a Sobel operator, and extracts only edges common to these plurality of frames to obtain still edges. The unit 1101 extracts only pixels with small luminance variations, and detects a caption region based on the position information of the still edges and the pixels with small luminance variations.
  • the unit 1101 recognizes characters by OCR, and calculates a caption appearance duration based on the areas of the caption regions in the continuous frames.
  • the caption recognition method is not limited to the aforementioned method.
  • any other existing caption recognition method may be used as long as it has a function of extracting captions and caption appearance durations in a video, as disclosed in JP- A 2001-285716 (KOKAI) .
  • a method of extracting a performer name from the caption recognition result may extract from performer names included in performer information in EPG data, or may use an intrinsic representation extraction method disclosed in, e.g., JP-A 2007-148785 (KOKAI) .
  • the face clustering unit 1102 detects facial regions in a video, and creates many fragmentary sequences by tracking.
  • the unit 1102 creates a facial image partial space for each fragmentary sequence, and creates a similarity matrix of the facial image partial spaces for a combination of fragmentary sequences.
  • the unit 1102 creates a facial still image dictionary of a representative face for each fragmentary sequence, and creates a similarity matrix based on facial feature points for a combination of fragmentary sequences.
  • the unit 1102 executes hierarchical clustering using the two similarity matrices.
  • the face clustering method is not limited to the aforementioned method, and any other existing face clustering method may be used as long as it has a function of extracting features of faces in a video, and clustering figures determined as representing one and the same person.
  • the similar shot segmentation storage unit 1103 executes determination of similar shots, calculations of dialogue frequencies, and the like to determine chapter boundaries, and stores similar shots, as described in, e.g., JP-A 2005-130416 (KOKAI) .
  • a similar shot segmentation method a point where hue or luminance levels are largely different between neighboring frames of video data is detected as a cut point, inter-frame similarity determination is made between several frames tracing back from a previous cut point and several frames after the nearest cut point by a round robin method, and it is determined that two shots are similar when the number of pairs of frames determined as similar frames is greater than or equal to a threshold.
  • dialogue detection and dialogue frequencies a period (part) where similar shots intensively appear is a significant period, and an index "dialogue frequency" is introduced to numerically convert the concentration of appearance of similar shots. The "dialogue frequency" increases as the following conditions are satisfied.
  • Condition 1 Many shots are included.
  • Condition 2 The number of similar shots is large.
  • Condition 3 The total duration of similar shots is long.
  • the similar shot segmentation storage unit 1103 calculates dialogue periods based on the dialogue frequencies.
  • the unit 1103 connects adjacent dialogue periods.
  • the similar shot segmentation method is not limited to the aforementioned method, and any other existing similar shot segmentation method may be used as long as it has a function of segmenting similar shots in a video. An example of the operation of the video processing apparatus of the second practical example will be described below with reference to FIGS. 4, 5, and 6.
  • the caption recognition unit 1101 extracts captions including performer names and appearance durations of the performer names from a video (step S401) .
  • the face clustering unit 1102 executes face clustering by detecting the faces of performers from the video, and deciding figures having similar features as representing one and the same person (step S402) .
  • the labeling unit 103 detects overlapping parts of the appearance durations of the performer names in captions and those of face sequences included in face clusters clustered by the face clustering unit 1102, and associates the performer names with the face cluster IDs (step S403) .
  • the segment association unit 201 associates segments with the performer names from the face sequences included in the face clusters with the face cluster IDs within similar shot segmentation periods which are stored in the similar shot segmentation storage unit 1103 and are obtained by dividing the change points of similar shots into periods (step S404) .
  • the segment association storage unit 202 stores the associated segments and performer names (step S405).
  • the labeling unit 103 acquires a face cluster ID and appearance durations of face sequences included in that face cluster from the face clustering result of the face clustering unit 1102 (step S501) .
  • the labeling unit 103 acquires a performer name and an appearance duration of the performer name from a caption recognized by the caption recognition unit 1101 (step S502) .
  • the labeling unit 103 compares the appearance durations of the face sequences included in the face cluster with that of the performer name to check whether the appearance durations of the face sequences are included in that of the performer name (step S503) .
  • a method of selecting a cluster within the appearance duration of the performer name upon checking whether the appearance durations of the face sequences are included in that of the performer name, and extracting an overlapping duration will be described below with reference to FIG. 12.
  • the distribution of the appearance duration of each performer name is expressed by a triangular distribution, but any other distribution such as a uniform distribution, normal distribution, and the like may be assumed.
  • the labeling unit 103 calculates the sum of the distribution of sequences in a cluster and that of the appearance duration of each performer name, and associates a cluster ID and performer name for clusters having sums less than or equal to a threshold in turn from that which has a largest sum (step S504) .
  • FIG. 12 the distribution of the appearance duration of each performer name is expressed by a triangular distribution, but any other distribution such as a uniform distribution, normal distribution, and the like may be assumed.
  • the labeling unit 103 calculates the sum of the distribution of sequences in a cluster and that of the appearance duration of each performer
  • the relationship among the sums of the distributions of sequences and the distributions of appearance durations of respective performer names is 11 > 13 > 12.
  • This threshold indicates the upper limit of the number of clusters to be selected. For example, when the threshold is "2" in FIG. 12, clusters 1 and 3 are associated with performer names.
  • the labeling unit 103 checks whether face sequences included in a new face cluster corresponding ' to the appearance duration acquired by the caption recognition unit 1101 are available (step S505) .
  • step S501 If face sequences included in a new face cluster are available, the process returns to step S501 to acquire a face cluster and appearance durations of face sequences included in the face cluster. If no face sequences included in a new face cluster are available, the unit 103 selects a performer name corresponding to one face cluster ID from the association results of face cluster IDs and performer names (step S506) .
  • FIG. 13 shows an example of the association results of cluster IDs and performer names. In FIG. 13, in associations between face cluster IDs and performer names, a performer name with a maximum frequency of occurrence is associated with each face cluster ID. Details of step S404 will be given below.
  • the segment association unit 201 acquires a face cluster ID and face sequence appearance durations included in that face cluster from the face clustering result of the face clustering unit 1102 (step S601) .
  • the segment association unit 201 acquires a similar shot segment ID and similar shot segment duration from the similar shot segmentation result (step S602) .
  • the segment association unit 201 checks whether the appearance durations of the face sequences included in the face cluster are included in the similar shot segment duration (step S603) . If the segment association unit 201 determines that the appearance durations of the face sequences included in the face cluster are included in the similar shot segment duration, it acquires the association result of the face cluster ID and performance name from the processing result of the labeling unit 103 (step S604) .
  • the segment association unit 201 checks whether the face cluster ID in step S601 matches that in step S604 (step S605) . If the two IDs do not match, the segment association unit 201 acquires the next association result of the face cluster ID and performer name (step S604). If the two face cluster IDs match, the segment association unit 201 associates the similar shot segment ID acquired in step S602 with the performer name acquired in step S604 (step S606) . If the appearance durations of the face sequences are not included in the similar shot segment duration in step S603, the segment association unit 201 checks whether a new similar shot segment is available (step S607) .
  • step S602 If a new similar shot segment is available, the process returns to step S602, and the segment association unit 201 acquires a similar shot segment ID and similar shot segment duration. If no new similar shot segment is available, the segment association unit 201 checks whether face sequences included in a new face cluster are available (step S608) . If face sequences included in a new face cluster are available, the process returns to step S601, and the segment association unit 201 acquires a face cluster ID and appearance durations of face sequences included in the face cluster. If no face sequences included in a new face cluster ID are available, the processing ends. (Third Practical Example) A video processing apparatus of the third practical example will be described below with reference to FIG. 14.
  • the performer information extraction unit 101, clustering unit 102, and segmentation result storage unit 203 in FIG. 2 are respectively practically replaced by a speech recognition unit 1401, speaker clustering unit 702, and musical period segmentation storage unit 1402.
  • the speech recognition unit 1401 recognizes speech from a video and detects speech appearance periods. For example, the unit 1401 calculates the similarities or distances between stored speech models of words to be recognized and a feature parameter sequence of speech, and outputs words associated with the speech models with a maximum similarity (or a minimum distance) as a recognition result.
  • a method of also expressing speech models by feature parameter sequences, and calculating the distances between the feature parameter sequences of speech models and that of input speech by dynamic programming (DP), a method of expressing speech models using a hidden Markov model (HMM) , and calculating the probabilities of respective speech models upon input of the feature parameter sequence of input speech, and the like are available.
  • the speech recognition method is not limited to the aforementioned method, and any other existing speech recognition method may be used as long as it has a function of recognizing speech from a video and detecting speech appearance periods.
  • a method of extracting a performer name from the speech recognition result may extract from performer names included in performer information in EPG data, or may use an intrinsic representation extraction method disclosed in, e.g., JP-A 2007-148785 (KOKAI) .
  • the musical period segmentation storage unit 1402 segments speech data in a video into musical periods and stores the musical periods.
  • a feature that represents a sound type is compared with statistical models, which are learned in advance and are used to determine the sound types, for each frame, to calculate log likelihoods for respective models.
  • the absolute values of the differences between features extracted by windows each having a boundary candidate point as the center while shifting the boundary candidate point at given intervals are calculated so as to detect change points of features greater than or equal to a threshold.
  • Periods where the likelihoods of music or music multiplexed speech exceed a threshold are determined as musical periods by removing short musical periods and merging musical periods with low likelihoods at their two ends.
  • the musical period segmentation method is not limited to the aforementioned method, and any other existing musical period segmentation method may be used as long as it has a function of segmenting speech data in a video into musical periods.
  • the speech recognition unit 1401 recognizes speech including performer names and appearance durations of the performer names from a video (step S401) .
  • the speaker clustering unit 702 executes speaker clustering by detecting utterance periods of performers from the video, extracting features of speakers, and deciding figures having similar features as representing one and the same person (step S402) .
  • the labeling unit 103 detects overlapping parts of the appearance durations of the performer names in speech and those of speech sequences included in speech clusters clustered by the speaker clustering unit 702, and associates the performer names with speaker cluster IDs (step S403) .
  • the segment association unit 201 associates segments with the performer names from speaker sequences included in speaker clusters ID within musical period segmentation periods, which are stored in the musical period segmentation storage unit 1402 and are obtained by dividing change points of music into periods (step S404) .
  • the segment association storage unit 202 stores the associated segments and performer names (step S405) .
  • the labeling unit 103 acquires a speaker cluster ID and appearance durations of speaker sequences included in that speaker cluster from the speaker clustering result of the speaker clustering unit 702 (step S501) .
  • the labeling unit 103 acquires a performer name and an appearance duration of the performer name from speech recognized by the speech recognition unit 1401 (step S502) .
  • the labeling unit 103 compares the appearance durations of the speaker sequences included in the speaker cluster and that of the performer name to check whether the appearance durations of the speaker sequences are included in that of the performer name (step S503) .
  • a method of selecting a cluster within the appearance duration of the performer name upon checking whether the appearance durations of the speaker sequences are included in that of the performer name, and extracting an overlapping duration will be described below with reference to FIG. 15.
  • the distribution of the appearance duration of each performer name is expressed by a triangular distribution, but any other distribution such as a uniform distribution, normal distribution, and the like may be assumed.
  • the labeling unit 103 associates cluster IDs and performer names for clusters as many as the number of clusters in which the number of appearances of sequences included in a cluster within an appearance duration distribution of each performer name is greater than or equal to a threshold (step S504) .
  • a threshold step S504
  • the labeling unit 103 checks whether speaker sequences included in a new speaker cluster corresponding to an appearance duration acquired by the speech recognition unit 1401 are available (step S505) . If speaker sequences included in a new speaker cluster are available, the process returns to step S501 to acquire a speaker cluster and appearance durations of speaker sequences included in the speaker cluster. If no speaker sequences included in a new speaker cluster are available, the labeling unit 103 selects a performer name for one speaker cluster ID from the association results of the speaker cluster IDs and performer names (step S506) .
  • FIG. 16 shows an example of the association results of cluster IDs and performer names.
  • a performer name with a maximum likelihood is associated with each speaker cluster ID.
  • the reliability is calculated by dividing the frequency of occurrence in the association result of each speaker cluster ID and performer name by the product of the total frequency of occurrence for each performer name and the number of clusters. Details of step S404 will be given below.
  • the segment association unit 201 acquires a speaker cluster ID and appearance durations of speaker sequences included in that speaker cluster from the speaker clustering result of the speaker clustering unit 702 (step S601) .
  • the segment association unit 201 acquires a musical period segment ID and musical period segment duration from the musical period segmentation result (step S602) .
  • the segment association unit 201 checks whether the appearance durations of the speaker sequences included in the speaker cluster are included in the musical period segment duration (step S603) . If the segment association unit 201 determines that the appearance durations of the speaker sequences included in the speaker cluster are included in the musical period segment duration, it acquires the association result of the speaker cluster ID and performer name from the processing result of the labeling unit 103
  • step S604 The segment association unit 201 checks whether the speaker cluster ID in step S601 matches that in step S604 (step S605) . If the two IDs do not match, the segment association unit 201 acquires the next association result of the speaker cluster ID and performer name (step S604) . If the two speaker cluster IDs match, the segment association unit 201 associates the musical period segment ID acquired in step S602 with the performer name acquired in step S604 (step S606).
  • the segment association unit 201 checks whether a new musical period segment is available (step S607) . If a new musical period segment is available, the process returns to step S602, and the segment association unit 201 acquires a musical period segment ID and musical period segment duration. If no new musical period segment is available, the segment association unit 201 checks whether a new speaker cluster ID and speaker sequences included in that speaker cluster are available (step S608) . If speaker sequences included in a new speaker cluster are available, the process returns to step S601, and the segment association unit 201 acquires a speaker cluster ID and appearance durations of speaker sequences included in that speaker cluster. If no new speaker cluster ID is available, the processing ends.
  • a corner caption segmentation storage unit may be used in place of the segmentation result storage unit 203 or chapters manually assigned by the user may be stored.
  • the corner caption segmentation storage unit detects corner captions each of which is kept displayed for a long duration (e.g., during one corner or the like) , and stores segments by automatically chaptering based on their display periods.
  • slice images are created by cutting a spatiotemporal image along a plane parallel to the time axis, line segment probabilities in the slice images are calculated, and binarization is made to extract pixels having predetermined line segment probabilities or more.
  • Two-dimensional region extraction is made in the slice images, and the extracted regions are combined in the spatiotemporal image to calculate a three-dimensional spatiotemporal caption region.
  • a chapter is determined in correspondence with the start time of the extracted spatiotemporal caption.
  • the corner caption segmentation method is not limited to the aforementioned method, and any other existing method may be used as long as it has chapter corner display periods .
  • cluster IDs and performer names may be associated with each other for clusters as many as the number of clusters less than or equal to a threshold in turn from a cluster having the largest number of sequence appearances included in clusters within the performer name appearance duration distribution, as shown in FIG. 17.
  • clusters 1 to 3 those as many as the number of clusters less than or equal to the threshold in turn from cluster 3 with a largest number of sequences, i.e., cluster 3 is associated with performer name C.
  • the clustering unit 102 used to associate the labeling results for respective performers to the segmentation periods one of the speaker clustering unit and face clustering unit is used.
  • both the clustering units may be used at the same time, a combination of speaker clustering and face clustering may be calculated from the maximum frequency of occurrence, and one of these clusters may be used after conversion.
  • a cluster of either one of the performer names may be preferentially used, or a cluster of the performer name with a larger frequency of occurrence may be preferentially used.
  • clusters and performer names can be associated with each other using performer information and the clustering result of figures having similar features, the need for registering facial images in advance, and collating with these facial images can be obviated. Since segments and performer names are associated with each other from sequences included in clusters within segmentation periods obtained by dividing change points of videos into periods, a scene in which a desired performer appears can be played back in units in which videos of interviews, performances, and the like change, from only a received video. Furthermore, the processing speed can be increased compared to the conventional method. Additional advantages and modifications will readily occur to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details and representative embodiments shown and described herein. Accordingly, various modifications may be made without departing from the spirit or scope of the general inventive concept as defined by the appended claims and their equivalents.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Television Signal Processing For Recording (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

Un appareil comprend une unité extrayant des informations d’exécutant comprenant un nom d’exécutant et une durée de première apparition pendant laquelle l’exécutant apparaît dans une vidéo; une unité extrayant des particularités d’exécutants; une unité déterminant, parmi des personnes qui apparaissent dans une première séquence comprise dans la vidéo, des personnes ayant des similitudes de particularités plus grandes qu’un seuil comme représentant une seule et même personne; une unité créant une durée de seconde apparition d’au moins une seconde séquence comprise dans la vidéo pendant laquelle il est déterminé que la seule et même personne apparaît et un premier identificateur de groupe d’un premier groupe comprenant la seconde séquence; une unité déterminant si la durée de seconde apparition pour chaque séquence est comprise dans la durée de première apparition; et une unité, lorsque la durée de seconde apparition est comprise dans la durée de première apparition, associant un second identificateur de groupe d’un second groupe comprenant la seconde séquence correspondant à la durée de seconde apparition avec le nom d’exécutant.
PCT/JP2009/054116 2008-03-24 2009-02-26 Appareil et procédé de traitement vidéo WO2009119272A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2009514284A JP2011519183A (ja) 2008-03-24 2009-02-26 映像処理装置および方法

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2008076575 2008-03-24
JP2008-076575 2008-03-24

Publications (1)

Publication Number Publication Date
WO2009119272A1 true WO2009119272A1 (fr) 2009-10-01

Family

ID=40749111

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2009/054116 WO2009119272A1 (fr) 2008-03-24 2009-02-26 Appareil et procédé de traitement vidéo

Country Status (2)

Country Link
JP (1) JP2011519183A (fr)
WO (1) WO2009119272A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6934098B1 (ja) * 2020-08-28 2021-09-08 Kddi株式会社 情報処理装置、情報処理方法及びプログラム

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MARK EVERINGHAM ET AL: "Hello! My name is... Buffy - Automatic Naming of Characters in TV Video", PROCEEDINGS OF THE BRITISH MACHINE VISION CONFERENCE (2006), 4 September 2006 (2006-09-04), pages 899 - 908, XP009100754 *
SATOH S ET AL: "NAME-IT: NAMING AND DETECTING FACES IN NEWS VIDEOS", IEEE MULTIMEDIA, vol. 6, no. 1, 1 January 1999 (1999-01-01), pages 22 - 35, XP000832068, ISSN: 1070-986X *
YAO WANG ET AL: "Multimedia Content Analysis - Using Both Audio and Visual Clues", IEEE SIGNAL PROCESSING MAGAZINE, vol. 17, no. 6, 1 November 2000 (2000-11-01), pages 12 - 36, XP011089877, ISSN: 1053-5888 *

Also Published As

Publication number Publication date
JP2011519183A (ja) 2011-06-30

Similar Documents

Publication Publication Date Title
US11197073B2 (en) Advertisement detection system and method based on fingerprints
EP1081960B1 (fr) Procede de traitement de signaux et dispositif de traitement de signaux video/vocaux
KR100828166B1 (ko) 동영상의 음성 인식과 자막 인식을 통한 메타데이터 추출방법, 메타데이터를 이용한 동영상 탐색 방법 및 이를기록한 기록매체
US8200061B2 (en) Signal processing apparatus and method thereof
US8457469B2 (en) Display control device, display control method, and program
US7920761B2 (en) Multimodal identification and tracking of speakers in video
US9009054B2 (en) Program endpoint time detection apparatus and method, and program information retrieval system
JP4300697B2 (ja) 信号処理装置及び方法
US20180144194A1 (en) Method and apparatus for classifying videos based on audio signals
JP4332700B2 (ja) マルチメディアの手掛かりを利用したテレビ番組をセグメント化及びインデクス化する方法及び装置
KR20000054561A (ko) 비디오 인덱싱 방식을 이용한 네트워크 기반의 비디오검색 시스템 및 그 운영방법
JP2005173569A (ja) オーディオ信号の分類装置及び方法
US20070201764A1 (en) Apparatus and method for detecting key caption from moving picture to provide customized broadcast service
US7734096B2 (en) Method and device for discriminating obscene video using time-based feature value
US20100169248A1 (en) Content division position determination device, content viewing control device, and program
Haloi et al. Unsupervised story segmentation and indexing of broadcast news video
WO2009119272A1 (fr) Appareil et procédé de traitement vidéo
US20100079673A1 (en) Video processing apparatus and method thereof
JP2007060606A (ja) ビデオの自動構造抽出・提供方式からなるコンピュータプログラム
JP2009049667A (ja) 情報処理装置、その処理方法およびプログラム
US20060092327A1 (en) Story segmentation method for video
JP2002014973A (ja) 映像検索装置、方法、映像検索プログラムを記録した記録媒体
JP2000092435A (ja) 信号特徴抽出方法及びその装置、音声認識方法及びその装置、動画編集方法及びその装置
Tiwari et al. Video Segmentation and Video Content Analysis
Leonardi et al. Top-Down and Bottom-Up Semantic Indexing of Multimedia

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2009514284

Country of ref document: JP

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09725799

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09725799

Country of ref document: EP

Kind code of ref document: A1