JP2007323318A

JP2007323318A - Speaker face image determination method, device, and program

Info

Publication number: JP2007323318A
Application number: JP2006152189A
Authority: JP
Inventors: Dan Mikami; 弾三上; Hidenobu Osada; 秀信長田; Shozo Azuma; 正造東; Masashi Morimoto; 正志森本
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2006-05-31
Filing date: 2006-05-31
Publication date: 2007-12-13
Anticipated expiration: 2026-05-31
Also published as: JP4685712B2

Abstract

<P>PROBLEM TO BE SOLVED: To specify a speaker without viewing a video even when the name of a speaker is not known. <P>SOLUTION: This speaker face image determination method includes recognizing the voice of a video input by a teacher or a speaker recognition technology, and applying a speaker ID to every speaker section configured of the start/end time of the video and every speaker corresponding to the speaker section, and storing it in a speaker recognition result storage means as the speaker recognition result, and detecting a face image included in the speaker section corresponding to the speaker ID by a method for detecting the position of the face from the input video, and storing the face image with the speaker ID in the face image storage means for every speaker ID, and extracting personal characteristics corresponding to each face image of each speaker ID stored in the image storage means for every speaker ID, determining the most suitable personal characteristics for every speaker ID from the personal characteristics corresponding to the face image, and storing the personal characteristics in the speaker personal characteristics decision result storage means. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、話者顔画像決定方法及び装置及びプログラムに係り、特に、映像（音声付）から発話者と顔映像を対応付けるための話者顔画像決定方法及び装置及びプログラムに関する。 The present invention relates to a speaker face image determination method, apparatus, and program, and more particularly, to a speaker face image determination method, apparatus, and program for associating a speaker with a face image from a video (with sound).

映像インデキシング技術の一つに話者認識技術がある。話者認識技術は教師あり話者認識技術と教師なし話者認識技術とに大別される。教師あり話者認識技術では、教師データを与えるため、認識に先立って教師データを作成する必要がある。一方、教師なし話者認識では、教師データを用意することなく映像を発話者毎の区間に分割することが可能である。 One of the video indexing technologies is speaker recognition technology. Speaker recognition technology is roughly classified into supervised speaker recognition technology and unsupervised speaker recognition technology. Since supervised speaker recognition technology provides teacher data, it is necessary to create teacher data prior to recognition. On the other hand, in unsupervised speaker recognition, a video can be divided into sections for each speaker without preparing teacher data.

しかしながら、教師なし話者認識を行った場合、話者毎の区間に分割することはできるが、それが誰の発話なのかは不明である。従って、区間に対してラベルを付与した場合にも、
「話者１、話者２、話者１、…」
という状態になってしまう。利用者が特定話者（Ａさん）の発話部分のみを視聴したいと考えた場合を確認するためには、Ａさんがどの話者ラベルに対応しているのか一度映像を視聴しなければならない。 However, when unsupervised speaker recognition is performed, it can be divided into sections for each speaker, but it is unclear who the utterance is. Therefore, even when a label is assigned to a section,
"Speaker 1, Speaker 2, Speaker 1, ..."
It will be in the state. In order to confirm the case where the user wants to view only the utterance portion of the specific speaker (Mr. A), it is necessary to view the video once for which speaker label A corresponds to.

本発明は、上記の点に鑑みなされたもので、話者名が分からない場合においても、映像を視聴することなく話者を特定することを可能とするため、話者ラベルと該話者の顔画像とを対応付ける話者顔画像決定方法及び装置及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points, and even when the speaker name is unknown, the speaker can be identified without viewing the video. It is an object of the present invention to provide a speaker face image determination method, apparatus, and program for associating face images.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明（請求項１）は、入力された音声が含まれる映像から該映像に映っている話者の顔画像を決定するための顔画像決定方法であって、
音響解析手段が、教師なし話者認識技術により入力された映像の音声を認識し、当該映像の開始・終了時刻からなる話者区間及び該話者区間に対応する話者毎に話者ＩＤを付与し、話者認識結果として話者認識結果記憶手段に格納する音響解析ステップ（ステップ１）と、
顔検出手段が、入力された映像から顔の位置を検出する手法により話者ＩＤに対応する話者区間に含まれる顔画像を検出し、話者ＩＤと共に話者ＩＤ毎顔画像記憶手段に格納する顔検出ステップ（ステップ２）と、
個人特徴抽出手段が、話者ＩＤ毎画像記憶手段に記憶されている各話者ＩＤの各顔画像に対する個人特徴を抽出する個人特徴抽出ステップ（ステップ３）と、
話者個人特徴決定手段が、顔画像に対する個人特徴から話者ＩＤ毎に最も相応しい個人特徴を決定し、話者個人特徴決定結果記憶手段に格納する話者個人特徴決定ステップ（ステップ４）と、を行う。 The present invention (Claim 1) is a face image determination method for determining a face image of a speaker shown in the video from the video including the input voice,
The acoustic analysis means recognizes the audio of the video input by the unsupervised speaker recognition technology, and sets the speaker ID including the start and end times of the video and the speaker ID for each speaker corresponding to the speaker interval. An acoustic analysis step (step 1) for providing and storing the result as a speaker recognition result in the speaker recognition result storage means;
The face detection means detects a face image included in the speaker section corresponding to the speaker ID by a method of detecting the position of the face from the input video, and stores it in the face image storage means for each speaker ID together with the speaker ID. Detecting face (step 2),
A personal feature extraction step (step 3) in which the personal feature extraction means extracts personal features for each face image of each speaker ID stored in the image storage means for each speaker ID;
A speaker individual feature determining unit determines a personal feature most suitable for each speaker ID from the individual features of the face image and stores it in the speaker individual feature determination result storing unit (step 4); I do.

また、本発明（請求項２）は、話者個人特徴決定ステップ（ステップ４）において、
個人特徴についてクラスタリングを行い、話者区間内の映像の顔画像のうち、最頻出のものを、最も相応しい個人特徴とする。 Further, the present invention (Claim 2) provides a speaker individual feature determination step (Step 4).
Clustering is performed on the individual features, and the face image of the video in the speaker section is the most suitable personal feature.

また、本発明（請求項３）は、話者個人特徴決定ステップ（ステップ４）において、
個人特徴のうち、該個人特徴間の距離が閾値より小さい個人特徴から最も出現頻度が高いものを、最も相応しい個人特徴とする。 In the present invention (Claim 3), in the speaker individual feature determination step (Step 4),
Among the individual features, the one with the highest appearance frequency among the individual features whose distance between the individual features is smaller than the threshold is set as the most suitable personal feature.

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項４）は、入力された音声が含まれる映像から該映像に映っている話者の顔画像を決定するための顔画像決定装置であって、
教師なし話者認識技術により入力された映像の音声を認識し、当該映像の開始・終了時刻からなる話者区間及び該話者区間に対応する話者毎に話者ＩＤを付与し、話者認識結果として話者認識結果記憶手段５に格納する音響解析手段１と、
入力された映像から顔の位置を検出する手法により話者ＩＤに対応する話者区間に含まれる顔画像を検出し、話者ＩＤと共に話者ＩＤ毎顔画像記憶手段６に格納する顔検出手段２と、
話者ＩＤ毎画像記憶手段６に記憶されている各話者ＩＤの各顔画像に対する個人特徴を抽出する個人特徴抽出手段３と、
顔画像に対する個人特徴から話者ＩＤ毎に最も相応しい個人特徴を決定し、話者個人特徴決定結果記憶手段７に格納する話者個人特徴決定手段４と、を有する。 The present invention (Claim 4) is a face image determination device for determining a face image of a speaker shown in the video from the video including the input voice,
Recognize the audio of the video input by the unsupervised speaker recognition technology, and assign a speaker ID to each speaker section consisting of the start and end times of the video and each speaker corresponding to the speaker section. Acoustic analysis means 1 for storing the recognition result in the speaker recognition result storage means 5;
Face detecting means for detecting a face image included in the speaker section corresponding to the speaker ID by a method of detecting the face position from the input video, and storing it in the face image storing means 6 for each speaker ID together with the speaker ID. 2,
Personal feature extraction means 3 for extracting personal features for each face image of each speaker ID stored in each speaker ID image storage means 6;
Speaker personal feature determination means 4 that determines the most suitable personal feature for each speaker ID from the personal features of the face image and stores the result in the speaker personal feature determination result storage means 7.

また、本発明（請求項５）は、話者個人特徴決定手段４において、
個人特徴についてクラスタリングを行い、話者区間内の映像の顔画像のうち、最頻出のものを、最も相応しい個人特徴とする手段を含む。 Further, the present invention (Claim 5) is provided in the speaker individual feature determining means 4,
Clustering is performed on the individual features, and the most frequently used facial image among the face images of the video in the speaker section is included as the most suitable personal feature.

また、本発明（請求項６）は、話者個人特徴決定手段４において、
個人特徴のうち、該個人特徴間の距離が閾値より小さい個人特徴から最も出現頻度が高いものを、最も相応しい個人特徴とする手段を含む。 Further, the present invention (Claim 6) is provided in the speaker individual feature determining means 4,
Among the individual features, a means for setting a personal feature having the highest appearance frequency from among the personal features whose distance between the personal features is smaller than a threshold is used.

本発明（請求項７）は、コンピュータに、請求項４乃至６記載の顔画像決定装置の各手段を実行させる顔画像決定プログラムである。 The present invention (Claim 7) is a face image determination program for causing a computer to execute each means of the face image determination apparatus according to Claims 4 to 6.

上記のように本発明によれば、検出された顔画像から個人特徴を抽出し、話者情報と個人特徴から話者を特定し、顔画像と共に表示することにより、テキストによる話者名でないものの、顔画像で表示することで、一度再生することなしに話者についての情報を得ることができる。 As described above, according to the present invention, the personal feature is extracted from the detected face image, the speaker is identified from the speaker information and the personal feature, and displayed together with the face image. By displaying the face image, it is possible to obtain information about the speaker without playing it once.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図３は、本発明の一実施の形態における話者顔画像決定装置の構成を示す。 FIG. 3 shows the configuration of the speaker face image determination device according to the embodiment of the present invention.

同図に示す話者顔画像決定装置には、音声付の映像が蓄積されている映像記憶装置８が接続されている。 The speaker face image determination device shown in the figure is connected to a video storage device 8 in which video with sound is stored.

話者顔画像決定装置は、教師なし話者認識部１、顔検出部２、個人特徴抽出部３、話者個人特徴決定部４、話者ＩＤ毎顔画像記憶装置６、話者個人特徴決定結果記憶部７から構成される。 The speaker face image determination device includes an unsupervised speaker recognition unit 1, a face detection unit 2, a personal feature extraction unit 3, a speaker personal feature determination unit 4, a speaker ID face image storage device 6, and a speaker personal feature determination. The result storage unit 7 is configured.

教師なし話者認識部１は、映像記憶装置８から音声付の映像を取得して当該音声の特徴量から話者認識する。話者認識の方法としては、例えば、特開２００４−１４５１６１号公報（音声データベース登録処理方法、音声発生源認識方法、音声発生区間検索方法、音声データベース登録処理装置、音声発生源認識装置、音声発生区間検索装置、並びにそのプログラムおよびそのプログラムの記録媒体）を用いることができる。なお、話者認識の方法は、当該文献に示したものに限定するものではない。従って、教師無し話者認識の手法そのものには言及しない。教師なし話者認識部１で認識された話者認識結果は、話者認識結果記憶部５に格納される。 The unsupervised speaker recognition unit 1 acquires a video with audio from the video storage device 8 and recognizes the speaker from the feature amount of the audio. As a speaker recognition method, for example, Japanese Patent Application Laid-Open Publication No. 2004-145161 (voice database registration processing method, voice generation source recognition method, voice generation section search method, voice database registration processing device, voice generation source recognition device, voice generation Section search device, its program, and its recording medium) can be used. Note that the method for speaker recognition is not limited to that shown in the document. Therefore, it does not mention the method of unsupervised speaker recognition itself. The speaker recognition result recognized by the unsupervised speaker recognition unit 1 is stored in the speaker recognition result storage unit 5.

話者認識結果記憶部５は、話者認識結果を格納する。図４は、本発明の一実施の形態における話者認識結果記憶部の例を示す。図４に示す話者認識結果記憶部５は、映像区間ＩＤ、映像区間の開始時刻、終了時刻、話者ＩＤからなる。映像区間ＩＤは話者が切り替わる毎に付与されるユニークな番号である。映像区間の開始時刻・終了時刻は、時刻を特定できる表示方法であればどのような表示方法でもかまわない。本実施の形態では、秒をイメージして記述している。話者ＩＤは、話者毎に付与されるラベルであり、教師なし話者認識であるため、個人名は付与することができない。従って、話者ＩＤは話者毎に付与される記号である。ここでは、整数を付与することとしているがアルファベットなどでもかまわない。 The speaker recognition result storage unit 5 stores the speaker recognition result. FIG. 4 shows an example of the speaker recognition result storage unit in the embodiment of the present invention. The speaker recognition result storage unit 5 shown in FIG. 4 includes a video section ID, a video section start time, an end time, and a speaker ID. The video section ID is a unique number given each time the speaker is switched. The display time of the video section may be any display method as long as the time can be specified. In the present embodiment, the second is described in the image. The speaker ID is a label given to each speaker, and is an unsupervised speaker recognition, so an individual name cannot be given. Therefore, the speaker ID is a symbol assigned to each speaker. Here, an integer is assigned, but an alphabet may be used.

顔検出部２は、話者認識結果記憶部５を参照して、話者区間（映像区間ＩＤに対応する映像区間の開始時刻・終了時刻）に含まれる顔画像を検出する。顔画像の検出方法は、既存の静止画からの顔検出方法であればよく、特にその方法を限定するものではない。検出された顔画像は当該区間ＩＤに対応する話者の話者ＩＤを付与し、当該話者ＩＤ毎に顔画像を話者ＩＤ毎顔画像記憶部６に格納する。 The face detection unit 2 refers to the speaker recognition result storage unit 5 and detects a face image included in the speaker section (the start time / end time of the video section corresponding to the video section ID). The face image detection method may be any method for detecting a face from an existing still image, and the method is not particularly limited. The detected face image is assigned the speaker ID of the speaker corresponding to the section ID, and the face image is stored in the speaker ID-specific face image storage unit 6 for each speaker ID.

個人特徴抽出部３は、話者ＩＤ毎顔画像記憶部６に格納されている話者ＩＤ毎の顔画像から当該話者の個人特徴を抽出する。本発明では、個人特徴抽出方法については特に限定しないが、従来の顔認識手法で提案されている顔の幅、両目の中心間の距離、頭頂から目の高さなど、あるいは、固有顔などが考えられる。より簡易な方法としては、顔の下に存在する服の領域から色を取得することで、簡易な個人認証を行うことも考えられる。 The personal feature extraction unit 3 extracts the personal feature of the speaker from the face image for each speaker ID stored in the face image storage unit 6 for each speaker ID. In the present invention, the personal feature extraction method is not particularly limited, but the face width, the distance between the centers of both eyes, the height of the eyes from the top of the head, the unique face, etc. proposed in the conventional face recognition method may be used. Conceivable. As a simpler method, it may be possible to perform simple personal authentication by obtaining a color from a region of clothes existing under the face.

話者個人特徴決定部４は、話者ＩＤに対応する顔画像集合に対し、個人特徴を特徴量としてクラスタリングを行い、クラスタリング結果で最大クラスタに含まれる個人特徴の重心に最も近い個人特徴を求め、当該話者ＩＤに対する個人特徴として話者個人特徴決定結果記憶部７に格納する。 The speaker individual feature determination unit 4 performs clustering on the face image set corresponding to the speaker ID using the individual feature as a feature amount, and obtains the individual feature closest to the center of gravity of the individual feature included in the maximum cluster from the clustering result. Then, it is stored in the speaker individual feature determination result storage unit 7 as the individual feature for the speaker ID.

次に、上記の構成における一連の動作を説明する。 Next, a series of operations in the above configuration will be described.

図５は、本発明の一実施の形態における話者顔画像決定装置の動作のフローチャートである。 FIG. 5 is a flowchart of the operation of the speaker face image determination device according to the embodiment of the present invention.

ステップ１００）教師なし話者認識部１において、入力された音声付映像から音響解析により話者の認識を行い、話者認識結果を話者認識結果記憶部５に格納する。例えば、クラスタリングには、例えば、文献「クラスター分析入門」宮本定明著、森北出版」に記載されている方法がある。これを話者認識で利用される特徴量（前述の特開２００４−１４５１６１号公報）に適用することで、教師なし話者認識が可能である。ここでは、その手法を限定するものではない。 Step 100) The unsupervised speaker recognition unit 1 recognizes the speaker from the input video with sound by acoustic analysis, and stores the speaker recognition result in the speaker recognition result storage unit 5. For example, clustering includes, for example, a method described in the document “Introduction to Cluster Analysis” written by Sadaaki Miyamoto, Morikita Publishing. By applying this to the feature quantity used in speaker recognition (the above-mentioned Japanese Patent Application Laid-Open No. 2004-145161), unsupervised speaker recognition is possible. Here, the method is not limited.

ステップ２００）顔検出部２は、話者認識結果記憶部５の情報に基づいて顔画像を求め、話者ＩＤ毎顔画像記憶部６に記録する。以下にその詳細を説明する。 Step 200) The face detection unit 2 obtains a face image based on the information in the speaker recognition result storage unit 5 and records it in the face image storage unit 6 for each speaker ID. Details will be described below.

図６は、本発明の一実施の形態における顔画像検出処理のフローチャートである。 FIG. 6 is a flowchart of face image detection processing according to an embodiment of the present invention.

ステップ５０１）区間ＩＤｉについてｉ＝０と初期化する。 Step 501) The section IDi is initialized to i = 0.

ステップ５０２）話者認識結果記憶部５からｉの開始時刻から終了時刻を取得し、当該時間範囲の映像を映像記憶装置８から取得し、当該映像からカット点検出を行う。 Step 502) The end time is acquired from the start time of i from the speaker recognition result storage unit 5, the video in the time range is acquired from the video storage device 8, and the cut point is detected from the video.

ステップ５０３）カット点があるかどうかを判定し、カット点がある場合はステップ５０４に移行し、それ以外の場合はステップ５０９へ移行する。但し、カット点である必要はなく、予め設定しておいた一定時間間隔などでも構わない。 Step 503) It is determined whether or not there is a cut point. If there is a cut point, the process proceeds to Step 504. Otherwise, the process proceeds to Step 509. However, it is not necessary to be a cut point, and a predetermined time interval may be set in advance.

ステップ５０４）カット点について顔検出を行う。静止画からの顔検出については様々な既存技術がある。ここでは、顔の位置が検出可能な手法であればどのような方法でも構わない。例えば、IntelのOpen Source Computer Vision Libraryなどにも顔検出が含まれる。 Step 504) Face detection is performed for the cut point. There are various existing techniques for face detection from still images. Here, any method may be used as long as the position of the face can be detected. For example, Intel's Open Source Computer Vision Library also includes face detection.

ステップ５０５）顔検出されたかどうか判定を行う。顔があった場合には、ステップ５０６に移行し、ない場合はステップ５０８に移行する。 Step 505) It is determined whether a face is detected. If there is a face, the process proceeds to step 506, and if not, the process proceeds to step 508.

ステップ５０６）区間ＩＤｉに基づいて話者認識結果記憶部５を検索し、話者の話者ＩＤを求め、Ｓとする。 Step 506) The speaker recognition result storage unit 5 is searched based on the section IDi, and the speaker ID of the speaker is obtained.

ステップ５０７）話者ＩＤ（Ｓ）の顔画像として、検出された顔画像を話者ＩＤ毎顔画像記憶部６に記録する。 Step 507) The detected face image is recorded in the face image storage unit 6 for each speaker ID as the face image of the speaker ID (S).

ステップ５０８）ステップ５０２で検出された全てのカットにおける顔検出処理を行ったかを判定し、行っていなければステップ５０４に移行し、全てのカット点において顔検出処理を行っていれば、ステップ５０９に移行する。 Step 508) It is determined whether face detection processing has been performed for all the cuts detected in Step 502. If not, the process proceeds to Step 504. If face detection processing has been performed for all the cut points, Step 509 is performed. Transition.

ステップ５０９）全ての区間において上記ステップ５０２からステップ５０９に至る処理を行ったか判定を行い、行っていなければステップ５１０に移行し、そうでなければ処理を終了する。 Step 509) It is determined whether the processing from Step 502 to Step 509 has been performed in all the sections. If not, the process proceeds to Step 510. Otherwise, the process is terminated.

ステップ５１０）区間ＩＤのｉをインクリメント（ｉ＝ｉ＋１）し、ステップ５０２に移行する。 Step 510) The section ID i is incremented (i = i + 1), and the process proceeds to Step 502.

上記のステップ５０１からステップ５１０に至る処理によって記憶される話者ＩＤ毎顔画像記憶部６の例を図７に示す。同図に示すように、話者ＩＤと映像区間において得られた顔画像がセットになっている。ここで、話者ＩＤ毎顔画像記憶部６においては、１つの話者ＩＤに対して複数の顔画像が対応付けられている、いわゆる１対多の関係となっている。この状況が保持できる形式であればどのような情報保持方法であっても構わない。 An example of the face image storage unit 6 for each speaker ID stored by the processing from step 501 to step 510 is shown in FIG. As shown in the figure, the speaker ID and the face image obtained in the video section are a set. Here, the speaker ID face image storage unit 6 has a so-called one-to-many relationship in which a plurality of face images are associated with one speaker ID. Any information holding method may be used as long as the status can be held.

ステップ３００）個人特徴抽出部３は、ステップ２００において取得した話者ＩＤ毎の顔画像からその個人の特徴を抽出する。 Step 300) The personal feature extraction unit 3 extracts the personal feature from the face image for each speaker ID acquired in Step 200.

ここでは、個人特徴抽出方法については限定せず、検出画像Ｘに対する特徴抽出処理Ｐ（Ｘ）の結果、Ｆ＝Ｐ（Ｘ）となることとして説明する。ここで、Ｆは特徴抽出の結果得られる特徴数を次元数とした特徴ベクトルである。 Here, the personal feature extraction method is not limited, and it is assumed that F = P (X) is obtained as a result of the feature extraction process P (X) for the detected image X. Here, F is a feature vector in which the number of features obtained as a result of feature extraction is the number of dimensions.

ステップ４００）話者個人特徴決定部４は、個人特徴抽出部３で抽出された特徴に基づいてクラスタリングにより話者の個人特徴を決定する。 Step 400) The speaker individual feature determination unit 4 determines the speaker individual features by clustering based on the features extracted by the individual feature extraction unit 3.

ステップ３００により話者ＩＤ毎に得られている顔画像から個人特徴が抽出されたが、話者ＩＤ毎の顔画像には、話者ＩＤの人物当人だけでなく、様々な人物が含まれている。それは、ある話者が話している時に必ずしも当人が映っているとは限らないからである。そこで、ある話者ＩＤ（ｋ）について得られた顔画像Ｘ_ｋから得られる個人特徴Ｐ（Ｘ_ｋ）から真に話者ｋに関する個人特徴を決定する必要がある。 Individual features are extracted from the face image obtained for each speaker ID in step 300, but the face image for each speaker ID includes not only the person of the speaker ID but also various persons. ing. This is because the person is not necessarily shown when a certain speaker is speaking. Therefore, it is necessary to truly determine the personal characteristics relating to the speaker k from the personal characteristics P (X _k ) obtained from the face image X _k obtained for a certain speaker ID (k).

その一つの方法として、クラスタリングがある。この方法は、ある話者の時刻に画面に映る顔画像のうち最頻出のものは、当該話者の顔であるという前提による。 One method is clustering. This method is based on the premise that the most frequently appearing face image on the screen at the time of a certain speaker is the face of the speaker.

以下に、クラスタリングによる個人特徴決定処理の例を示す。 An example of personal feature determination processing by clustering is shown below.

図８は、本発明の一実施の形態におけるクラスタリングによる話者特徴決定処理のフローチャートである。 FIG. 8 is a flowchart of speaker feature determination processing by clustering according to an embodiment of the present invention.

ステップ７０１）話者ＩＤについてｑ＝０と初期化する。 Step 701) The speaker ID is initialized as q = 0.

ステップ７０２）話者ＩＤｑに含まれる顔画像集合Ｘに対して、その個人特徴である、Ｐ（Ｘ）を特跳量としたクラスタリングを行う。 Step 702) Clustering is performed on the face image set X included in the speaker ID q with P (X), which is an individual feature of the face image set X, as a jump amount.

ステップ７０３）クラスタリング結果で最大クラスタに含まれる個人特徴Ｐ（Ｘ’）の重心に最も近い個人特徴を求め、Ｐ（Ｘ_ｍ）とする。 Step 703) A personal feature closest to the center of gravity of the personal feature P (X ′) included in the maximum cluster is obtained from the clustering result, and is set as P (X _m ).

ステップ７０４）話者ＩＤｑの個人特徴として、話者個人特徴決定結果記憶部７に、話者ＩＤｑと個人特徴Ｐ（Ｘ_ｍ）の組を格納する。 Step 704) As a personal feature of the speaker ID q, the speaker personal feature determination result storage unit 7 stores a set of the speaker ID q and the personal feature P (X _m ).

ステップ７０５）全ての話者ＩＤについて処理を行ったかを判定し、終わっていれば当該処理を終了し、終わっていない場合はステップ７０６に移行する。 Step 705) It is determined whether or not the processing has been performed for all the speaker IDs. If the processing has been completed, the processing ends. If not, the processing proceeds to Step 706.

ステップ７０６）ｑをインクリメント（ｑ＝ｑ＋１）し、ステップ７０２に移行する。 Step 706) q is incremented (q = q + 1), and the process proceeds to Step 702.

次に、話者個人特徴決定処理の別の実施方法として、最頻出の個人特徴を利用することが考えられる。その例を以下に示す。 Next, as another implementation method of speaker individual feature determination processing, it is conceivable to use the most frequent individual feature. An example is shown below.

図９は、本発明の一実施の形態における最頻出の個人特徴を利用した個人特徴決定処理のフローチャートである。 FIG. 9 is a flowchart of the personal feature determination process using the most frequent personal feature according to the embodiment of the present invention.

ステップ８０１）話者ＩＤについてｑ＝０と初期化する。 Step 801) The speaker ID is initialized as q = 0.

ステップ８０２）ｋ＝０と初期化する。 Step 802) Initialize k = 0.

ステップ８０３）ｎ＝０及びカウンタＣ_ｋ＝０と初期化する。 Step 803) Initialize n = 0 and counter C _k = 0.

ステップ８０４）顔画像Ｘｋの個人特徴であるＰ（Ｘ_ｋ）と、顔特徴Ｘ_ｎの個人特徴であるＰ（Ｘ_ｎ）との距離
│Ｐ（Ｘ_ｋ）−Ｐ（Ｘ_ｎ）│
が閾値ｔｈ以下であれば、ステップ８０５に移行し、そうでない場合は、ステップ８０６に移行する。 Step 804) The distance | P (X _k ) −P (X _n ) | of P (X _k ) that is the personal feature of the face image Xk and P (X _n ) that is the personal feature of the face feature X _n
If the threshold value is less than or equal to the threshold th, the process proceeds to step 805, and if not, the process proceeds to step 806.

ステップ８０５）ステップ８０４で求めた距離が閾値以下であれば、頻度用のカウンタＣ_ｋをインクリメント（Ｃ_ｋ＝Ｃ_ｋ＋１）する。 Step 805) If the distance obtained in Step 804 is less than or equal to the threshold, the frequency counter C _k is incremented (C _k = C _k +1).

ステップ８０６）ｎ＜Ｎであればｎをインクリメント（ｎ＝ｎ＋１）してステップ８０４に移行し、そうでなければステップ８０８に移行する。 Step 806) If n <N, increment n (n = n + 1) and go to Step 804, otherwise go to Step 808.

ステップ８０７）ｋ＜Ｎであれば、ｋをインクリメント（ｋ＝ｋ＋１）してステップ８０３に移行する。そうでなければステップ８０８に移行する。 Step 807) If k <N, increment k (k = k + 1) and go to Step 803. Otherwise, the process proceeds to step 808.

ステップ８０８）頻度カウンタＣ_ｋ（ｋ＝０，…，Ｎ−１）の値が最大となるｋを求め、ｍに代入する。 Step 808) Find the maximum _{k of the} frequency counter C _k (k = 0,..., N−1) and substitute it into m.

ステップ８０９）話者ＩＤｑの個人特徴として、話者個人特徴決定結果記憶部７に話者ＩＤｑと個人特徴Ｐ（Ｘ_ｍ）の組を記憶する。 Step 809) As a personal feature of the speaker ID q, the speaker personal feature determination result storage unit 7 stores a set of the speaker ID q and the personal feature P (X _m ).

ステップ８１０）全ての話者ＩＤについて処理を行ったかを判定し、終わっていれば当該処理を終了し、終わっていなければステップ８１１に移行する。 Step 810) It is determined whether or not the processing has been performed for all the speaker IDs. If the processing has been completed, the processing ends. If not, the processing proceeds to Step 811.

ステップ８１１）ｑをインクリメント（ｑ＝ｑ＋１）してステップ８０２に移行する。 Step 811) q is incremented (q = q + 1) and the routine goes to Step 802.

図１０は、本発明の一実施の形態における話者個人特徴決定結果の例を示す。 FIG. 10 shows an example of the speaker individual feature determination result in the embodiment of the present invention.

ここでは、話者ＩＤと個人特徴の組を記憶する。さらに、この例では、代表顔画像として当該個人特徴を出力する顔画像を保持している。 Here, a set of a speaker ID and personal characteristics is stored. Further, in this example, a face image that outputs the personal feature is held as a representative face image.

以下、図面と共に本発明の実施例を示す。 Embodiments of the present invention will be described below with reference to the drawings.

以下では、クイズが出題され、クイズの回答者が２５マスから構成されるパネルで陣取り合戦を行うクイズ番組を題材として用いた例で説明する。クイズ番組は出演者が視聴者で毎回変化するものも多く、教師あり話者認識を適用することが困難な（メリットがない）コンテンツの例といえる。 In the following, an example will be described in which a quiz is presented and a quiz program in which quiz respondents battle for battle in a panel composed of 25 squares is used as the subject. There are many quiz programs whose performers change every time they are viewers, and it can be said that this is an example of content that is difficult (no merit) to apply supervised speaker recognition.

図１１は、本発明の一実施例の教師なし話者認識結果例である。 FIG. 11 is an example of an unsupervised speaker recognition result according to an embodiment of the present invention.

各話者ＩＤの映像について、顔検出部２においてカット点検出・顔検出を行い、話者ＩＤ毎顔画像記憶部６に話者ＩＤ毎の顔画像を保存する。それらの画像から個人特徴を抽出する。 For each speaker ID video, the face detection unit 2 performs cut point detection and face detection, and stores a face image for each speaker ID in the speaker ID face image storage unit 6. Individual features are extracted from these images.

今回題材とするクイズ番組においては、図１２に示すように、座席の周囲に個人特有の色が配置されるため検出された顔の周辺における色情報を個人特徴とする。個人特徴抽出部３は、図１０をＸ_０とし、Ｐ（Ｘ_０）により、網掛け部分の画素を取得する。このように、クイズ番組に限らず、番組での映り方に関しての情報がある場合には、当該情報に合わせたヒューリスティックな個人情報抽出ルールを用いるのが適当である。但し、一方で全員が同じ色の服装を着てしまうようなクイズ番組においては、服の色の情報を個人特徴として利用することはできない。 In the quiz program which is the subject of this time, as shown in FIG. 12, since a color unique to the individual is arranged around the seat, the color information around the detected face is a personal feature. The personal feature extraction unit 3 sets X ₀ in FIG. 10 and acquires pixels in the shaded portion by P (X ₀ ). As described above, when there is information on how to be reflected in a program as well as a quiz program, it is appropriate to use a heuristic personal information extraction rule that matches the information. However, in a quiz program in which everyone wears the same color clothes, information on clothes colors cannot be used as personal features.

話者個人特徴決定部４は、各話者ＩＤｑに含まれる画像集合Ｘについて、前述の個人特徴Ｐ（Ｘ_ｉ）のクラスタリングを行い、最大クラスタの重心に最も近いＰ（Ｘ_ｍ）を求める。これにより話者ＩＤｑと対応する個人特徴がＰ（Ｘ_ｍ）と求まる。ｑ⇔Ｘ_ｍの関係に従えば、顔画像をクリックすることで、該当する話者の発話区間のみを視聴することなどが容易に可能となる。 The speaker individual feature determination unit 4 performs clustering of the above-described individual features P (X _i ) for the image set X included in each speaker ID q, and obtains P (X _m ) closest to the center of gravity of the maximum cluster. As a result, the personal feature corresponding to the speaker ID q is obtained as P (X _m ). If the relationship of q⇔X _m is followed, it is possible to easily view only the utterance section of the corresponding speaker by clicking on the face image.

図１３は、本発明の一実施例におけるインデキシング結果の表示として利用した例を示している。同図の例では、画面上部に映像表示部が、下部に検出話者表示部がある。下部の顔画像は映像中の当該顔の話者が発話した図１１に示す区間と関連付いている。そして、画像をクリックすることにより、当該関連した映像区間が再生される。 FIG. 13 shows an example used as an indexing result display in one embodiment of the present invention. In the example of the figure, there is a video display unit at the top of the screen and a detected speaker display unit at the bottom. The lower face image is associated with the section shown in FIG. 11 where the speaker of the face in the video speaks. Then, by clicking on the image, the related video section is reproduced.

このように、本発明を利用することで、話者の顔画像が一覧でき、顔画像と映像区間が関連付いていることによって再生が可能になる。 In this way, by using the present invention, it is possible to list the speaker's face images and to reproduce them by associating the face images with the video section.

本発明では、上記の図３に示す装置の各機能をプログラムとして構築し、話者顔画像決定装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 In the present invention, each function of the apparatus shown in FIG. 3 can be constructed as a program and installed in a computer used as a speaker face image determination apparatus to be executed or distributed via a network. is there.

また、構築されたプログラムをハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、話者顔画像決定装置として利用されるコンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed on a computer used as a speaker face image determination device. .

なお、本発明は、上記の実施の形態及び実施例に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments and examples, and various modifications and applications can be made within the scope of the claims.

本発明は、映像コンテンツ処理技術、特に、話者認識と画像処理のマルチモーダル処理技術に適用可能である。 The present invention can be applied to video content processing technology, particularly multi-modal processing technology for speaker recognition and image processing.

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の一実施の形態における話者画像決定装置の構成図である。It is a block diagram of the speaker image determination apparatus in one embodiment of this invention. 本発明の一実施の形態における話者認識結果記憶部の例である。It is an example of the speaker recognition result memory | storage part in one embodiment of this invention. 本発明の一実施の形態における話者顔画像決定装置の動作のフローチャートである。It is a flowchart of operation | movement of the speaker face image determination apparatus in one embodiment of this invention. 本発明の一実施の形態における顔画像検出処理のフローチャートである。It is a flowchart of the face image detection process in one embodiment of the present invention. 本発明の一実施の形態における話者ＩＤ毎顔画像記憶部の例である。It is an example of the face image memory | storage part for every speaker ID in one embodiment of this invention. 本発明の一実施の形態におけるクラスタリングによる個人特徴決定処理のフローチャートである。It is a flowchart of the individual feature determination process by clustering in one embodiment of the present invention. 本発明の一実施の形態における最頻出の個人特徴を利用した個人特徴決定処理のフローチャートである。It is a flowchart of the personal feature determination process using the most frequent personal feature in one embodiment of the present invention. 本発明の一実施の形態における話者個人特徴決定結果の例である。It is an example of the speaker individual characteristic determination result in one embodiment of this invention. 本発明の一実施例の教師なし話者認識結果例である。It is an example of an unsupervised speaker recognition result of one Example of this invention. 本発明の一実施例の個人特徴の例である。It is an example of the individual characteristic of one Example of this invention. 本発明の一実施例におけるインデキシング結果の表示として利用した例である。It is the example utilized as a display of the indexing result in one Example of this invention.

Explanation of symbols

１音響解析手段、教師なし話者認識部
２顔検出手段、顔検出部
３個人特徴抽出手段、個人特徴抽出部
４話者個人特徴決定手段、話者個人特徴決定部
５話者認識結果記憶手段、話者認識結果記憶部
６話者ＩＤ毎顔画像記憶手段、話者ＩＤ毎顔画像記憶部
７話者個人特徴決定結果記憶手段、話者個人特徴決定結果記憶部 DESCRIPTION OF SYMBOLS 1 Acoustic analysis means, Unsupervised speaker recognition part 2 Face detection means, Face detection part 3 Individual feature extraction means, Individual feature extraction part 4 Speaker individual feature determination means, Speaker individual feature determination part 5 Speaker recognition result storage means , Speaker recognition result storage unit 6 face image storage means for each speaker ID, face image storage unit for each speaker ID 7 speaker individual feature determination result storage means, speaker individual feature determination result storage unit

Claims

A face image determining method for determining a face image of a speaker reflected in an image including an input voice,
The acoustic analysis means recognizes the audio of the video input by the unsupervised speaker recognition technology, and sets the speaker ID including the start and end times of the video and the speaker ID for each speaker corresponding to the speaker interval. And an acoustic analysis step for storing in the speaker recognition result storage means as a speaker recognition result,
A face detection unit detects a face image included in the speaker section corresponding to the speaker ID by a method of detecting a face position from the input video, and stores the face image for each speaker ID together with the speaker ID. A face detection step stored in the means;
A personal feature extracting step for extracting a personal feature for each face image of each speaker ID stored in the image storing unit for each speaker ID;
Speaker personal feature determination means determines a personal feature most suitable for each speaker ID from the personal characteristics of the face image, and stores the personal characteristics in the speaker personal feature determination result storage means;
A method for determining a face image.

In the speaker individual feature determination step,
Clustering the individual features, among the face images of the video in the speaker section, the most frequent one is the most suitable personal feature,
2. The face image determination method according to claim 1.

In the speaker individual feature determination step,
2. The face image determination method according to claim 1, wherein, among the individual features, a feature having the highest appearance frequency among the individual features whose distance between the individual features is smaller than a threshold is selected as the most suitable personal feature.

A face image determination device for determining a face image of a speaker reflected in an image including an input voice,
Recognize the audio of the video input by the unsupervised speaker recognition technology, and assign a speaker ID to each speaker section consisting of the start and end times of the video and each speaker corresponding to the speaker section. Acoustic analysis means for storing the recognition result in the speaker recognition result storage means;
A face image included in the speaker section corresponding to the speaker ID is detected by a method of detecting a face position from the input video, and stored in the face image storage means for each speaker ID together with the speaker ID. Detection means;
Personal feature extraction means for extracting personal features for each face image of each speaker ID stored in the image storage means for each speaker ID;
Speaker personal feature determination means for determining a personal characteristic most suitable for each speaker ID from the personal characteristics for the face image and storing the speaker personal feature determination result storage means;
A face image determination device characterized by comprising:

The speaker individual feature determining means includes:
Clustering the personal features, and including means for setting the most frequent facial images of the video in the speaker section as the most suitable personal features.
The face image determination apparatus according to claim 4.

The speaker individual feature determining means includes:
5. The face image determination apparatus according to claim 4, further comprising means for determining a personal feature having the highest appearance frequency from among the personal features whose distance between the personal features is smaller than a threshold among the personal features.

On the computer,
A face image determination program that causes each means of the face image determination device according to claim 4 to be executed.