JP4595364B2

JP4595364B2 - Information processing apparatus and method, program, and recording medium

Info

Publication number: JP4595364B2
Application number: JP2004084814A
Authority: JP
Inventors: 智彦後藤; 克浩竹松; 環児嶋
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2004-03-23
Filing date: 2004-03-23
Publication date: 2010-12-08
Anticipated expiration: 2024-03-23
Also published as: JP2005274707A

Description

本発明は、情報処理装置および方法、プログラム、並びに記録媒体に関し、特に、撮影して得られた画像から認識された話者方位に基づいて、それぞれの話者の音声データを精度よく抽出し、例えば、話者を精度よく識別することができるようにする情報処理装置および方法、プログラム、並びに記録媒体に関する。 The present invention relates to an information processing apparatus and method, a program, and a recording medium, and in particular, based on a speaker orientation recognized from an image obtained by photographing, accurately extracts voice data of each speaker, For example, the present invention relates to an information processing apparatus and method, a program, and a recording medium that can accurately identify a speaker.

近年、会議などに参加している複数の話者の発話を議事録として自動的に記録する電子会議システムがある。この電子会議システムにより記録された発話の再生時には、再生された音にノイズが少ないことはもとより、聴いている人が、話者（発話主）を識別することができることも望まれる。 In recent years, there are electronic conference systems that automatically record the utterances of a plurality of speakers participating in a conference as minutes. When the utterance recorded by the electronic conference system is reproduced, it is desired that the person who is listening can identify the speaker (speaker) as well as the reproduced sound has little noise.

そこで、再生時にそれぞれの話者を識別することができるように、例えば、それぞれの話者用のものとしてマイク（マイクロフォン）を予め設置し、各話者に割り当てられたマイクを電子会議システムに登録しておくことにより、音声が入力されたマイクから、話者を識別するものが提案されている。 Therefore, for example, a microphone (microphone) is installed in advance for each speaker so that each speaker can be identified during playback, and the microphone assigned to each speaker is registered in the electronic conference system. Thus, there has been proposed a method for identifying a speaker from a microphone to which a voice is input.

また、例えば、それぞれの話者に向けてマイクの指向性を固定し、話者の方位を電子会議システムに登録しておくことにより、音源の方位から、話者を識別するものも提案されている。 In addition, for example, it has been proposed to identify the speaker from the direction of the sound source by fixing the microphone directivity to each speaker and registering the speaker's orientation in the electronic conference system. Yes.

ところで、特許文献１には、マイクアレイから入力された発話の音声データに基づいて音源方位を推定し、その音源方位に、カメラおよび高指向性集音マイクを向けて、発話を行っている話者の画像と音声とを合わせて取得する撮像システムが開示されている。 By the way, in Patent Document 1, a sound source direction is estimated based on speech data input from a microphone array, and a camera and a highly directional sound collection microphone are directed to the sound source direction. An imaging system that acquires a person's image and sound together is disclosed.

また、特許文献２には、話者を認識して追跡するロボットが開示されている。 Patent Document 2 discloses a robot that recognizes and tracks a speaker.

特開平６−３５１０１５号公報JP-A-6-351015 特開２００２−３６６１９１号公報JP 2002-366191 A

上述したような従来の電子会議システムによっては、マイクに入力された発話から話者を識別させるためには、マイクやその指向性と、それぞれの話者との対応関係が既知であることが前提になっていることから、それらの関係が未知の場合、話者識別を行わせることができないという課題があった。 In some conventional electronic conferencing systems as described above, in order to identify a speaker from an utterance input to a microphone, it is assumed that the correspondence between the microphone and its directivity and each speaker is known. Therefore, there is a problem that speaker identification cannot be performed when the relationship between them is unknown.

すなわち、ユーザは、マイクと話者、或いはマイクの指向性と話者の関係について、電子会議システムに予め登録しておく必要がある。 That is, the user needs to register the relationship between the microphone and the speaker or the microphone directivity and the speaker in advance in the electronic conference system.

また、特許文献１に記載されている技術によっては、話者のいる方位を推定するための元になる音声を集音するためのマイクアレイと、方位を推定した後、それぞれの話者の音声を集音するための高指向性集音マイクとの２種類のマイクが必要になる。 In addition, depending on the technique described in Patent Document 1, a microphone array for collecting voice that is a source for estimating the direction of the speaker and a voice of each speaker after estimating the direction Two types of microphones, a highly directional sound collecting microphone for collecting sound, are required.

さらに、特許文献２に記載されている技術によっては、マイクとカメラを制御して特定の話者を追跡することを目的としており、複数の話者それぞれを識別することは考慮されていない。 Furthermore, according to the technique described in Patent Document 2, the purpose is to track a specific speaker by controlling a microphone and a camera, and identification of each of a plurality of speakers is not considered.

また、予め登録した話者それぞれの音声の標準パターン（モデル）と、会議などにおける発話の音声データの音声特徴量とを用いて話者識別を行う方法がある。しかしながら、この方法では、予め登録する標準パターンを作成するのに用いる音声を得るときの環境と、会議などにおいて音声特徴量を抽出する音声を得るときの環境とが異なることに起因して、話者識別の精度が劣化することがあった。 In addition, there is a method of performing speaker identification using a standard pattern (model) of speech of each speaker registered in advance and a speech feature amount of speech data of speech in a conference or the like. However, in this method, because the environment for obtaining the voice used to create the standard pattern to be registered in advance is different from the environment for obtaining the voice for extracting the voice feature amount in a conference, etc. The accuracy of person identification sometimes deteriorated.

本発明は、このような状況に鑑みてなされたものであり、撮影して得られた画像から認識された話者方位に基づいて、それぞれの話者の音声データを精度よく抽出し、例えば、話者を精度よく識別することができるようにするものである。 The present invention has been made in view of such a situation, and based on the speaker orientation recognized from the image obtained by shooting, the voice data of each speaker is accurately extracted, for example, This makes it possible to accurately identify the speaker.

本発明の情報処理装置は、１以上の話者を撮像し、画像を出力する撮像手段と、撮像手段が出力する画像における話者の位置関係に基づいて、話者それぞれの音声データを抽出する方位の範囲である抽出範囲を算出する抽出範囲算出手段と、話者による発話の音声データのうちの、抽出範囲算出手段により算出された抽出範囲内の音声データを抽出する音声データ抽出手段と、音声データ抽出手段により抽出された音声データに基づく話者識別用の標準パターンを、話者ごとに記憶する記憶手段と、音声データ抽出手段により新たに抽出された音声データと、記憶手段に記憶された標準パターンとを用いて話者識別を行う話者識別手段と、音声データ抽出手段により新たに抽出された音声データに基づいて、音源方位を推定する音源方位推定手段と、音源方位推定手段により推定された音源方位に存在する話者と、話者識別手段による話者識別により識別された話者とが一致するか否かを判定する判定手段とを備えることを特徴とする。 The information processing apparatus of the present invention images one or more speakers, and extracts voice data of each speaker based on an imaging unit that outputs an image and a positional relationship between the speakers in an image output by the imaging unit. An extraction range calculation unit that calculates an extraction range that is a range of azimuth; a voice data extraction unit that extracts voice data within the extraction range calculated by the extraction range calculation unit out of voice data of an utterance by a speaker ; The storage unit stores a standard pattern for speaker identification based on the voice data extracted by the voice data extraction unit, the voice data newly extracted by the voice data extraction unit, and the storage unit. Speaker identification means for performing speaker identification using the standard pattern, and sound source direction estimation for estimating the sound source direction based on the voice data newly extracted by the voice data extraction means Further comprising the step, and a speaker to present the sound source direction estimated by the sound source direction estimation means, a determining means for determining whether the speaker identified by the speaker identification using speaker identification means match It is characterized by.

本発明の情報処理装置には、撮像手段が出力する画像から話者の顔を検出する顔検出手段をさらに設けることができ、抽出範囲算出手段には、顔検出手段により検出された顔の位置関係に基づいて、抽出範囲を算出させるようにすることができる。 The information processing apparatus according to the present invention may further include a face detection unit that detects a speaker's face from an image output by the imaging unit, and the extraction range calculation unit includes a face position detected by the face detection unit. The extraction range can be calculated based on the relationship.

本発明の情報処理方法は、１以上の話者を撮像し、画像を出力する撮像ステップと、撮像ステップの処理により出力される画像における話者の位置関係に基づいて、話者それぞれの音声データを抽出する方位の範囲である抽出範囲を算出する抽出範囲算出ステップと、話者による発話の音声データのうちの、抽出範囲算出ステップの処理により算出された抽出範囲内の音声データを抽出する音声データ抽出ステップと、音声データ抽出ステップの処理により抽出された音声データに基づく話者識別用の標準パターンを、話者ごとに記憶する記憶ステップと、音声データ抽出ステップの処理により新たに抽出された音声データと、記憶ステップの処理で記憶された標準パターンとを用いて話者識別を行う話者識別ステップと、音声データ抽出ステップの処理により新たに抽出された音声データに基づいて、音源方位を推定する音源方位推定ステップと、音源方位推定ステップの処理により推定された音源方位に存在する話者と、話者識別ステップの処理による話者識別により識別された話者とが一致するか否かを判定する判定ステップとを含むことを特徴とする。 The information processing method of the present invention captures one or more speakers, outputs an image, and voice data of each speaker based on the positional relationship of the speakers in the image output by the processing of the imaging step. An extraction range calculation step for calculating an extraction range that is a range of azimuths for extracting voices, and a voice for extracting voice data within the extraction range calculated by the processing of the extraction range calculation step from the voice data of the utterance by the speaker A data extraction step , a storage step for storing a speaker identification standard pattern based on the voice data extracted by the voice data extraction step for each speaker, and a new extraction by the voice data extraction step processing A speaker identification step for performing speaker identification using the voice data and the standard pattern stored in the processing of the storage step; and a voice data extraction step. A sound source azimuth estimation step for estimating the sound source azimuth based on the voice data newly extracted by the processing of the step, a speaker existing in the sound source azimuth estimated by the processing of the sound source azimuth estimation step, and a speaker identification step And a determination step of determining whether or not the speaker identified by the speaker identification by the process is matched .

本発明のプログラムは、１以上の話者を撮像し、画像を出力する撮像ステップと、撮像ステップの処理により出力される画像における話者の位置関係に基づいて、話者それぞれの音声データを抽出する方位の範囲である抽出範囲を算出する抽出範囲算出ステップと、話者による発話の音声データのうちの、抽出範囲算出ステップの処理により算出された抽出範囲内の音声データを抽出する音声データ抽出ステップと、音声データ抽出ステップの処理により抽出された音声データに基づく話者識別用の標準パターンを、話者ごとの記憶を制御する記憶制御ステップと、音声データ抽出ステップの処理により新たに抽出された音声データと、記憶制御ステップの処理で記憶された標準パターンとを用いて話者識別を行う話者識別ステップと、音声データ抽出ステップの処理により新たに抽出された音声データに基づいて、音源方位を推定する音源方位推定ステップと、音源方位推定ステップの処理により推定された音源方位に存在する話者と、話者識別ステップの処理による話者識別により識別された話者とが一致するか否かを判定する判定ステップとを含むことを特徴とする。 The program of the present invention picks up one or more speakers and outputs the image, and extracts voice data of each speaker based on the positional relationship of the speakers in the image output by the processing of the imaging step. An extraction range calculation step for calculating an extraction range that is a range of azimuths to be extracted, and voice data extraction for extracting voice data within the extraction range calculated by the processing of the extraction range calculation step from the speech data of the utterance by the speaker The standard pattern for speaker identification based on the voice data extracted by the step of the voice data extraction step is newly extracted by the storage control step for controlling the storage for each speaker and the voice data extraction step processing. A speaker identification step for performing speaker identification using the recorded voice data and the standard pattern stored in the process of the storage control step; Based on the voice data newly extracted by the data extraction step processing, the sound source direction estimation step for estimating the sound source direction, the speakers existing in the sound source direction estimated by the processing of the sound source direction estimation step, and the speakers And a determination step of determining whether or not the speaker identified by the speaker identification by the identification step processing matches .

本発明の記録媒体に記録されているプログラムは、１以上の話者を撮像し、画像を出力する撮像ステップと、撮像ステップの処理により出力される画像における話者の位置関係に基づいて、話者それぞれの音声データを抽出する方位の範囲である抽出範囲を算出する抽出範囲算出ステップと、話者による発話の音声データのうちの、抽出範囲算出ステップの処理により算出された抽出範囲内の音声データを抽出する音声データ抽出ステップと、音声データ抽出ステップの処理により抽出された音声データに基づく話者識別用の標準パターンを、話者ごとの記憶を制御する記憶制御ステップと、音声データ抽出ステップの処理により新たに抽出された音声データと、記憶制御ステップの処理で記憶された標準パターンとを用いて話者識別を行う話者識別ステップと、音声データ抽出ステップの処理により新たに抽出された音声データに基づいて、音源方位を推定する音源方位推定ステップと、音源方位推定ステップの処理により推定された音源方位に存在する話者と、話者識別ステップの処理による話者識別により識別された話者とが一致するか否かを判定する判定ステップとを含むことを特徴とする。 The program recorded on the recording medium of the present invention captures one or more speakers, outputs an image, and based on the speaker's positional relationship in the image output by the processing of the imaging step. Voice within the extraction range calculated by the extraction range calculation step of the extraction range calculation step for calculating the extraction range that is the range of the direction in which each voice data is extracted and the speech range of the speaker's utterance A voice data extraction step for extracting data; a storage control step for controlling storage for each speaker of a standard pattern for speaker identification based on the voice data extracted by the processing of the voice data extraction step; and a voice data extraction step. Speaker identification is performed using the voice data newly extracted by the process of step S4 and the standard pattern stored by the process of the storage control step. Based on the voice data newly extracted by the person identification step and the voice data extraction step, and the speech existing in the voice source direction estimated by the voice source direction estimation step and the voice source direction estimation step And a determination step of determining whether or not the speaker and the speaker identified by the speaker identification by the speaker identification step process coincide with each other.

本発明の情報処理装置および方法、プログラム、並びに記録媒体においては、１以上の話者が撮像されて、画像が出力される。また、その画像における話者の位置関係に基づいて、話者それぞれの音声データを抽出する抽出範囲が算出される。そして、話者による発話の音声データのうちの、抽出範囲内の音声データが抽出され、抽出された音声データに基づく話者識別用の標準パターンが、話者ごとに記憶され、新たに抽出された音声データと、記憶された標準パターンとを用いて話者識別が行われ、新たに抽出された音声データに基づいて、音源方位が推定され、推定された音源方位に存在する話者と、話者識別により識別された話者とが一致するか否かが判定される。 In the information processing apparatus and method, the program, and the recording medium of the present invention, one or more speakers are captured and an image is output. Further, an extraction range for extracting the voice data of each speaker is calculated based on the positional relationship of the speakers in the image. Then, the voice data within the extraction range is extracted from the voice data of the utterance by the speaker, and a standard pattern for speaker identification based on the extracted voice data is stored for each speaker and newly extracted. Speaker identification is performed using the stored voice pattern and the stored standard pattern, and the sound source direction is estimated based on the newly extracted voice data, and the speaker existing in the estimated sound source direction, It is determined whether the speaker identified by speaker identification matches .

本発明によれば、撮影して得られた画像から認識された話者方位に基づいて、それぞれの話者の音声データを精度よく抽出することができる。 According to the present invention, it is possible to accurately extract voice data of each speaker based on the speaker orientation recognized from an image obtained by photographing.

また、本発明によれば、話者を精度よく識別することができる。 Further, according to the present invention, a speaker can be identified with high accuracy.

以下に本発明の最良の形態を説明するが、請求項に記載の構成要件と、発明の実施の形態における具体例との対応関係を例示すると次のようになる。この記載は、請求項に記載されている発明をサポートする具体例が、発明の実施の形態に記載されていることを確認するためのものである。従って、発明の実施の形態中には記載されているが、構成要件に対応するものとして、ここには記載されていない具体例があったとしても、そのことは、その具体例が、その構成要件に対応するものではないことを意味するものではない。逆に、具体例が構成要件に対応するものとしてここに記載されていたとしても、そのことは、その具体例が、その構成要件以外の構成要件には対応しないものであることを意味するものでもない。 BEST MODE FOR CARRYING OUT THE INVENTION The best mode of the present invention will be described below. Correspondences between constituent features described in the claims and specific examples in the embodiments of the present invention are exemplified as follows. This description is to confirm that specific examples supporting the invention described in the claims are described in the embodiments of the invention. Therefore, even if there are specific examples that are described in the embodiment of the invention but are not described here as corresponding to the configuration requirements, the specific examples are not included in the configuration. It does not mean that it does not correspond to a requirement. On the contrary, even if a specific example is described here as corresponding to a configuration requirement, this means that the specific example does not correspond to a configuration requirement other than the configuration requirement. not.

さらに、この記載は、発明の実施の形態に記載されている具体例に対応する発明が、請求項にすべて記載されていることを意味するものではない。換言すれば、この記載は、発明の実施の形態に記載されている具体例に対応する発明であって、この出願の請求項には記載されていない発明の存在、すなわち、将来、分割されたり、補正により出現し、追加される発明の存在を否定するものではない。 Further, this description does not mean that all the inventions corresponding to the specific examples described in the embodiments of the invention are described in the claims. In other words, this description is an invention corresponding to the specific example described in the embodiment of the invention, and the existence of an invention not described in the claims of this application, that is, it may be divided in the future. It does not deny the existence of an invention which appears by amendment and is added.

請求項１に記載の情報処理装置は、１以上の話者を撮像し、画像を出力する撮像手段（例えば、図２のカメラモジュール２）と、前記撮像手段が出力する画像における前記話者の位置関係に基づいて、話者それぞれの音声データを抽出する方位の範囲である抽出範囲を算出する抽出範囲算出手段（例えは、図４のビーム幅算出部１２２）と、前記話者による発話の音声データのうちの、前記抽出範囲算出手段により算出された前記抽出範囲内の音声データを抽出する音声データ抽出手段（例えば、図４のビームフォーミング処理部１４３）と、前記音声データ抽出手段により抽出された音声データに基づく話者識別用の標準パターンを、前記話者ごとに記憶する記憶手段（例えば、図４の話者情報DB１３２）と、前記音声データ抽出手段により新たに抽出された音声データと、前記記憶手段に記憶された標準パターンとを用いて話者識別を行う話者識別手段（例えば、図４の音声識別処理部１５１）と、前記音声データ抽出手段により新たに抽出された音声データに基づいて、音源方位を推定する音源方位推定手段（例えば、図４の音源定位検出部１４２）と、前記音源方位推定手段により推定された音源方位に存在する話者と、前記話者識別手段による話者識別により識別された話者とが一致するか否かを判定する判定手段（例えば、図４の識別情報付与部１５２）とを備えることを特徴とする。 The information processing apparatus according to claim 1, picks up an image of one or more speakers, outputs an image (for example, the camera module 2 in FIG. 2), and the speaker's image in the image output by the image pickup unit. Based on the positional relationship, an extraction range calculation means (for example, a beam width calculation unit 122 in FIG. 4) that calculates an extraction range that is a range of directions in which each speaker's voice data is extracted; of voice data, the voice data extracting means (e.g., beam forming process unit 143 of FIG. 4) for extracting audio data in the extraction range that is calculated by the extraction range calculating means, extracted by the voice data extracting means A storage unit (for example, speaker information DB 132 in FIG. 4) for storing each speaker for a standard pattern for speaker identification based on the recorded voice data, and the voice data extraction unit. Speaker identification means (for example, voice identification processing unit 151 in FIG. 4) for performing speaker identification using newly extracted voice data and a standard pattern stored in the storage means, and the voice data extraction means The sound source direction estimating means for estimating the sound source direction (for example, the sound source localization detecting unit 142 in FIG. 4) based on the voice data newly extracted by the voice, and the story existing in the sound source direction estimated by the sound source direction estimating means Determining means (for example, identification information adding unit 152 in FIG. 4) for determining whether or not the speaker and the speaker identified by speaker identification by the speaker identifying means match. .

請求項２に記載の情報処理装置は、前記撮像手段が出力する画像から前記話者の顔を検出する顔検出手段（例えば、図４の顔検出部１２１）をさらに含むことができ、前記抽出範囲算出手段は、前記顔検出手段により検出された顔の位置関係に基づいて、前記抽出範囲を算出するようにすることができる。 The information processing apparatus according to claim 2 can further include a face detection unit (for example, the face detection unit 121 in FIG. 4) that detects the face of the speaker from an image output by the imaging unit. The range calculation means can calculate the extraction range based on the positional relationship of the faces detected by the face detection means.

請求項３に記載の情報処理方法は、１以上の話者を撮像し、画像を出力する撮像ステップ（例えば、図６のステップＳ１の処理）と、前記撮像ステップの処理により出力される画像における前記話者の位置関係に基づいて、話者それぞれの音声データを抽出する方位の範囲である抽出範囲を算出する抽出範囲算出ステップ（例えは、図６のステップＳ４の処理）と、前記話者による発話の音声データのうちの、前記抽出範囲算出ステップの処理により算出された前記抽出範囲内の音声データを抽出する音声データ抽出ステップ（例えば、図７のステップＳ４５の処理や図８のステップＳ７４の処理）と、前記音声データ抽出ステップの処理により抽出された音声データに基づく話者識別用の標準パターンを、前記話者ごとに記憶する記憶ステップ（例えは、図７のステップＳ４７の処理）と、前記音声データ抽出ステップの処理により新たに抽出された音声データと、前記記憶ステップの処理で記憶された標準パターンとを用いて話者識別を行う話者識別ステップ（例えは、図８のステップＳ７６の処理）と、前記音声データ抽出ステップの処理により新たに抽出された音声データに基づいて、音源方位を推定する音源方位推定ステップ（例えは、図８のステップＳ７２の処理）と、前記音源方位推定ステップの処理により推定された音源方位に存在する話者と、前記話者識別ステップの処理による話者識別により識別された話者とが一致するか否かを判定する判定ステップ（例えは、図８のステップＳ７７の処理）とを含むことを特徴とする。 An information processing method according to a third aspect of the present invention relates to an imaging step of imaging one or more speakers and outputting an image (for example, the processing of step S1 in FIG. 6), and an image output by the processing of the imaging step. An extraction range calculation step (for example, the process of step S4 in FIG. 6) for calculating an extraction range that is a range of azimuths for extracting voice data of each speaker based on the positional relationship between the speakers, and the speaker The voice data extraction step (for example, the process in step S45 in FIG. 7 or the step S74 in FIG. 8) extracts the voice data in the extraction range calculated by the processing in the extraction range calculation step from the voice data of the utterance by and processing), storing stearyl the standard pattern for speaker identification based on the audio data extracted by the processing of the voice data extracting step is stored in each of the speaker (For example, the process of step S47 of FIG. 7), the voice data newly extracted by the process of the voice data extraction step, and the standard pattern stored by the process of the storage step A speaker identification step (for example, the process of step S76 in FIG. 8) and a sound source direction estimating step (for example, estimating a sound source direction based on the voice data newly extracted by the voice data extracting step) 8 in step S72 of FIG. 8), a speaker existing in the sound source direction estimated by the processing of the sound source direction estimation step, and a speaker identified by speaker identification by the processing of the speaker identification step And a determination step (for example, the process of step S77 in FIG. 8) for determining whether or not the two match .

請求項４に記載のプログラム、および、請求項５に記載の記録媒体に記録されているプログラムにおいても、各ステップが対応する実施の形態（但し一例）は、請求項３に記載の情報処理方法と同様である。 In the program according to claim 4 and the program recorded in the recording medium according to claim 5 , an embodiment (however, an example) corresponding to each step is the information processing method according to claim 3. It is the same.

以下、図を参照して本発明の実施の形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本発明を適用したテーブルトップ型話者識別装置１の利用例の一実施の形態を示す図である。 FIG. 1 is a diagram showing an embodiment of an example of use of a tabletop speaker identification device 1 to which the present invention is applied.

図１は、３人の話者が会議室やリビングルームなどでテーブルを囲んで会議等を行っている様子を示している。なお、話者の数は３人に限定されるものではなく、１人以上の何人であってもよい。 FIG. 1 shows a state in which three speakers are holding a conference in a conference room, living room, or the like, surrounding a table. Note that the number of speakers is not limited to three, and may be any number of one or more.

テーブルのほぼ中央に置かれているテーブルトップ型話者識別装置１は、話者を識別し、話者と発話内容とを対応付けて議事録として記録するものである。このテーブルトップ型話者識別装置１の外観を図２に拡大して示す。 The table-top type speaker identification device 1 placed at substantially the center of the table identifies a speaker and records the speaker and the utterance content in association with each other as a minutes. The appearance of the table top type speaker identification device 1 is shown in an enlarged manner in FIG.

図２に示されるように、テーブルトップ型話者識別装置１は、円筒状の形状を有しており、カメラモジュール２、カメラモジュール２の上面に設けられるマイクアレイ（マイクロフォンアレイ）３、カメラモジュール２の下に配置される情報処理部４、および、情報処理部４の表面に設けられる表示部５から構成される。 As shown in FIG. 2, the table top type speaker identification device 1 has a cylindrical shape, and includes a camera module 2, a microphone array (microphone array) 3 provided on the upper surface of the camera module 2, and a camera module. 2 includes an information processing unit 4 disposed under 2 and a display unit 5 provided on the surface of the information processing unit 4.

カメラモジュール２は、３６０度の全周囲を撮像可能な、例えば、双曲面ミラー（カメラモジュール２の上方にある半球体）の真下に１つのカメラ２Ａがセットされたモジュールである。カメラモジュール２により撮像された全周囲画像データは情報処理部４に供給される。 The camera module 2 is a module in which one camera 2A is set directly below a hyperboloidal mirror (a hemisphere above the camera module 2), for example, which can image the entire circumference of 360 degrees. The all-around image data captured by the camera module 2 is supplied to the information processing unit 4.

なお、カメラモジュール２は、例えば、カメラの撮像方向を機械的に回転させて全周囲を撮像するものであってもよいし、それぞれの方位を撮像する複数のカメラから構成されるものであってもよい。 Note that the camera module 2 may be, for example, one that mechanically rotates the imaging direction of the camera and images the entire periphery, or is configured from a plurality of cameras that image each direction. Also good.

マイクアレイ３は、例えば、コンデンサマイクなどの４つのマイク３−１，３−２，３−３，３−４から構成される。マイク３−１乃至３−４により集音された音声は図示せぬケーブルなどを介して情報処理部４に供給される。当然、マイクアレイ３は、４つ以外の複数のマイクから構成されるようにしてもよい。 The microphone array 3 is composed of, for example, four microphones 3-1, 3-2, 3-3, 3-4 such as condenser microphones. The sound collected by the microphones 3-1 to 3-4 is supplied to the information processing unit 4 through a cable (not shown). Of course, the microphone array 3 may be composed of a plurality of microphones other than four.

情報処理部４は、カメラモジュール２から供給される全周囲画像データに基づいて、テーブルトップ型話者識別装置１を囲む話者の方位を検出し、それぞれの話者の位置関係に基づいて、音声データをビームフォーミングする方位の幅であるビームフォーミング幅をそれぞれの話者について算出する。 The information processing unit 4 detects the orientation of the speakers surrounding the tabletop speaker identification device 1 based on the all-around image data supplied from the camera module 2, and based on the positional relationship between the speakers, A beam forming width, which is a width of an azimuth in which voice data is beam-formed, is calculated for each speaker.

また、情報処理部４は、算出したビームフォーミング幅を用いて、マイクアレイ３に入力された音声からそれぞれの話者の発話を抽出して話者識別を行い、その識別結果と発話の内容を対応付けて記録する。 In addition, the information processing unit 4 extracts each speaker's utterance from the voice input to the microphone array 3 using the calculated beamforming width, performs speaker identification, and determines the identification result and the content of the utterance. Record in association.

後に詳述するように、ビームフォーミングによれば、それぞれの話者による発話が精度よく（ノイズが少ないものとして）抽出されることから、ビームフォーミングを行わずに発話を集音する場合に較べて、より好適な話者を識別するための標準パターンを得ることができるとともに、標準パターンを得た後の話者識別を精度よく行うことができる。 As will be described in detail later, according to beamforming, utterances by each speaker are extracted with high accuracy (assuming less noise), so that compared to collecting utterances without performing beamforming. Thus, a standard pattern for identifying a more suitable speaker can be obtained, and speaker identification after obtaining the standard pattern can be performed with high accuracy.

表示部５はLCD(Liquid Crystal Display)などよりなり、ここには、例えば、カメラモジュール２で撮像された話者の画像データや、情報処理部４による話者識別の結果などが表示される。 The display unit 5 includes an LCD (Liquid Crystal Display) or the like, and displays, for example, the image data of the speaker imaged by the camera module 2, the result of speaker identification by the information processing unit 4, and the like.

図３は、図２のテーブルトップ型話者識別装置１のハードウェア構成例を示すブロック図である。 FIG. 3 is a block diagram illustrating a hardware configuration example of the tabletop speaker identification device 1 of FIG.

CPU（Central Processing Unit）５１は、ROM（Read Only Memory）５２に記憶されているプログラム、または記憶部５８からRAM（Random Access Memory）５３にロードされたプログラムに従って各種の処理を実行する。RAM５３にはまた、CPU５１が各種の処理を実行する上において必要なデータなども適宜記憶される。 A CPU (Central Processing Unit) 51 executes various processes according to a program stored in a ROM (Read Only Memory) 52 or a program loaded from a storage unit 58 to a RAM (Random Access Memory) 53. The RAM 53 also appropriately stores data necessary for the CPU 51 to execute various processes.

CPU５１，ROM５２、およびRAM５３は、バス５４を介して相互に接続される。バス５４にはまた、入出力インタフェース５５も接続される。 The CPU 51, ROM 52, and RAM 53 are connected to each other via a bus 54. An input / output interface 55 is also connected to the bus 54.

入出力インタフェース５５には、カメラモジュール２、マイクアレイ３や図示せぬ各種のボタンなどよりなる入力部５６、表示部５やスピーカ（図示せず）などよりなる出力部５７、ハードディスクなどにより構成される記憶部５８、モデムやLAN（Local Area Network）アダプタなどにより構成される通信部５９が接続される。 The input / output interface 55 includes a camera module 2, an input unit 56 including a microphone array 3 and various buttons (not shown), an output unit 57 including a display unit 5 and a speaker (not shown), a hard disk, and the like. And a communication unit 59 including a modem, a LAN (Local Area Network) adapter, and the like.

記憶部５８は、CPU５１により制御され、通信部５９を介して供給されたプログラムなどのデータを保存し、必要に応じて、保存してあるデータをRAM５３、出力部５７、通信部５９等に供給する。 The storage unit 58 is controlled by the CPU 51, stores data such as a program supplied via the communication unit 59, and supplies the stored data to the RAM 53, the output unit 57, the communication unit 59, and the like as necessary. To do.

通信部５９は、ネットワークを介して他の装置（図示せず）と通信を行う。また、通信部５９は、モデム等を用いた通信の他にも、例えば、USB（Universal Serial Bus），IEEE（Institute of Electrical and Electronic Engineers）１３９４、またはSCSI（Small Computer System Interface）等の各種の規格に準拠した通信処理を行う機能を有している。 The communication unit 59 communicates with other devices (not shown) via a network. In addition to communication using a modem or the like, the communication unit 59 is not limited to various types such as USB (Universal Serial Bus), IEEE (Institute of Electrical and Electronic Engineers) 1394, or SCSI (Small Computer System Interface). It has a function to perform communication processing conforming to the standard.

入出力インタフェース５５にはまた、ドライブ６０が接続され、磁気ディスク６１（フレキシブルディスクを含む）、光ディスク６２（CD-ROM（Compact Disk-Read Only Memory），DVD（Digital Versatile Disk）を含む）、光磁気ディスク６３（ＭＤ(Mini-Disk)（商標）を含む）、或いは半導体メモリ６４などが適宜装着され、それらから読み出されたプログラムが、必要に応じて記憶部５８にインストールされる。 A drive 60 is also connected to the input / output interface 55, and a magnetic disk 61 (including a flexible disk), an optical disk 62 (including a compact disk-read only memory (CD-ROM), a DVD (digital versatile disk)), optical A magnetic disk 63 (including MD (Mini-Disk) (trademark)), a semiconductor memory 64, or the like is appropriately mounted, and a program read from the disk is installed in the storage unit 58 as necessary.

図４は、図２の情報処理部４の機能的な構成例を示す図である。 FIG. 4 is a diagram illustrating a functional configuration example of the information processing unit 4 in FIG. 2.

図４の構成の少なくとも一部は、例えば、図３のCPU５１により所定のプログラムが実行されることにより実現される。 At least a part of the configuration in FIG. 4 is realized, for example, by executing a predetermined program by the CPU 51 in FIG. 3.

情報処理装置４には、顔検出部１２１とビーム幅算出部１２２から構成される画像データ処理部１１１、DB（Data Base）管理部１３１と話者情報DB１３２から構成される登録部１１２、マイクアレイI/F（Interface）１４１、音源定位検出部１４２、およびビームフォーミング処理部１４３から構成される音声データ処理部１１３、並びに、音声識別処理部１５１、識別情報付与部１５２、出力制御部１５３、および議事録DB１５４から構成される話者識別部１１４が実現される。 The information processing apparatus 4 includes an image data processing unit 111 including a face detection unit 121 and a beam width calculation unit 122, a registration unit 112 including a DB (Data Base) management unit 131 and a speaker information DB 132, a microphone array. An audio data processing unit 113 including an I / F (Interface) 141, a sound source localization detection unit 142, and a beamforming processing unit 143; an audio identification processing unit 151; an identification information adding unit 152; an output control unit 153; A speaker identification unit 114 configured from the minutes DB 154 is realized.

画像データ処理部１１１の顔検出部１２１は、カメラモジュール２から供給される全周囲画像データのうちの話者の顔に対応する領域の顔画像データを検出し、顔の方位を、話者の方位として検出する。顔検出部１２１は、検出した各話者の話者方位と顔画像データに、その話者を識別するための話者ID（Identification）を対応付け、話者IDが対応付けられた話者方位と顔画像データをビーム幅算出部１２２に供給する。 The face detection unit 121 of the image data processing unit 111 detects face image data in a region corresponding to the speaker's face in the all-around image data supplied from the camera module 2, and determines the orientation of the face. Detect as direction. The face detection unit 121 associates the detected speaker orientation and face image data of each speaker with a speaker ID (Identification) for identifying the speaker, and the speaker orientation associated with the speaker ID. And the face image data are supplied to the beam width calculator 122.

なお、顔検出部１２１においては、顔画像データにおける、例えば、目や鼻などに対応する領域の重心や、肌色の領域の重心が検出され、カメラモジュール２から、その重心に向かう方位が話者方位として検出される。 The face detection unit 121 detects, for example, the center of gravity of the area corresponding to the eyes and nose and the center of the skin color area in the face image data, and the orientation toward the center of gravity is determined from the camera module 2 by the speaker. Detected as a bearing.

ビーム幅算出部１２２は、顔検出部１２１から供給される話者方位から得られる話者の位置関係に基づいて、話者それぞれについてビームフォーミング幅を算出し、話者方位と、算出したビームフォーミング幅を話者IDに対応付けて音源定位検出部１４２に供給する。また、ビームフォーミング幅算出部１２２は、顔検出部１２１から供給される顔画像データをDB管理部１３１に供給する。ビーム幅算出部１２２によるビームフォーミング幅の算出については図５を用いて後述する。 The beam width calculation unit 122 calculates the beamforming width for each speaker based on the positional relationship of the speakers obtained from the speaker orientation supplied from the face detection unit 121, and calculates the speaker orientation and the calculated beamforming. The width is associated with the speaker ID and supplied to the sound source localization detection unit 142. Further, the beamforming width calculation unit 122 supplies the face image data supplied from the face detection unit 121 to the DB management unit 131. Calculation of the beam forming width by the beam width calculation unit 122 will be described later with reference to FIG.

登録部１１２のDB管理部１３１は、ビーム幅算出部１２２から供給される顔画像データや、話者識別部１１４から供給される、各話者の音声の標準パターンを話者情報DB１３２に記憶させる。話者情報DB１３２に登録された顔画像データや標準パターンは、必要に応じて、音声識別処理部１５１に提供される。 The DB management unit 131 of the registration unit 112 stores the face image data supplied from the beam width calculation unit 122 and the standard pattern of each speaker's voice supplied from the speaker identification unit 114 in the speaker information DB 132. . The face image data and the standard pattern registered in the speaker information DB 132 are provided to the voice identification processing unit 151 as necessary.

音声データ処理部１１３のマイクアレイI/F１４１は、マイクアレイ３から供給されるアナログ音声信号をA/D（Analog/Digital）変換し、得られたディジタルの音声データを音源定位検出部１４２とビームフォーミング処理部１４３に供給する。 The microphone array I / F 141 of the audio data processing unit 113 performs A / D (Analog / Digital) conversion on the analog audio signal supplied from the microphone array 3, and converts the obtained digital audio data into the sound source localization detection unit 142 and the beam. This is supplied to the forming processing unit 143.

音源定位検出部１４２は、マイクアレイI/F１４１から供給される音声データから音源方位を推定し、その推定した音源方位（推定方位）に最も近い方位にいる話者を、ビーム幅算出部１２２から供給される話者方位に基づいて検出する。 The sound source localization detection unit 142 estimates the sound source azimuth from the audio data supplied from the microphone array I / F 141, and the speaker located in the azimuth closest to the estimated sound source azimuth (estimated azimuth) from the beam width calculation unit 122. Detection is based on the supplied speaker orientation.

ここで、音源定位検出部１４２では、例えば、マイク３−１乃至３−４のそれぞれにより集音された音声データの相関性が最も大きくなるときの時間差（位相差）が求められ、その時間差に基づいて、発話の音源（話者）の音源方位が推定される。また、音源定位検出部１４２では、マイク３−１乃至３−４それぞれからの音声データを解析することにより、例えば、同時に発話された複数の音声データそれぞれの音源方位を推定することも行われる。 Here, in the sound source localization detection unit 142, for example, a time difference (phase difference) when the correlation of the sound data collected by each of the microphones 3-1 to 3-4 is maximized is obtained, and the time difference is calculated as the time difference. Based on this, the sound source direction of the sound source (speaker) of the utterance is estimated. The sound source localization detection unit 142 also analyzes the sound data from each of the microphones 3-1 to 3-4, for example, to estimate the sound source direction of each of a plurality of sound data spoken simultaneously.

また、音源定位検出部１４２は、検出した話者のビームフォーミング幅を、ビーム幅算出部１２２から供給される全ての話者のビームフォーミング幅の中から選択し、選択したビームフォーミング幅を、その話者の話者方位とともにビームフォーミング処理部１４３に供給する。 Further, the sound source localization detection unit 142 selects the beam forming width of the detected speaker from among the beam forming widths of all the speakers supplied from the beam width calculation unit 122, and selects the selected beam forming width. This is supplied to the beamforming processing unit 143 together with the speaker orientation of the speaker.

さらに、音源定位検出部１４２は、検出した話者の話者ID（話者方位に対応付けられている話者ID）を、話者識別部１１４の音声識別処理部１５１と識別情報付与部１５２に供給することも行う。 Furthermore, the sound source localization detection unit 142 uses the detected speaker ID (speaker ID associated with the speaker orientation) of the detected speaker to the voice identification processing unit 151 of the speaker identification unit 114 and the identification information adding unit 152. It also supplies to.

ビームフォーミング処理部１４３は、音源定位検出部１４２から供給される話者方位とビームフォーミング幅を用いて、マイクアレイI/F１４１から供給される音声データのビームフォーミングを行う。 The beamforming processing unit 143 performs beamforming of audio data supplied from the microphone array I / F 141 using the speaker orientation and the beamforming width supplied from the sound source localization detection unit 142.

即ち、ビームフォーミング処理部１４３は、音源定位検出部１４２から供給される話者方位を基準とし、ビームフォーミング幅により表される範囲の方位からの音声データを、マイクアレイI/F１４１から供給される全ての方位からの音声データから強調することにより、その範囲の方位からの音声データを抽出する。 In other words, the beam forming processing unit 143 is supplied with audio data from the microphone array I / F 141 from the direction of the range represented by the beam forming width with the speaker orientation supplied from the sound source localization detecting unit 142 as a reference. By emphasizing the voice data from all directions, the voice data from the direction of the range is extracted.

従って、ここで抽出される音声データは、ある１人の話者による音声データのみが強調されたものとなり（他の話者の音声や反射音の音声データなどが除去（抑制）されたものとなる）、精度の高い話者識別などが実現可能となる標準パターン、或いは、話者識別の対象とする音声を得ることが可能になる。 Therefore, the voice data extracted here is the one in which only the voice data of one speaker is emphasized (the voice data of other speakers, the voice data of reflected sound, etc. are removed (suppressed). Therefore, it is possible to obtain a standard pattern that enables high-accuracy speaker identification or the like, or a voice for speaker identification.

ビームフォーミング処理部１４３は、ビームフォーミングにより抽出された音声データを音声識別処理部１５１に供給する。 The beam forming processing unit 143 supplies the voice identification processing unit 151 with the voice data extracted by the beam forming.

なお、ビームフォーミング処理部１４３は、音源定位検出部１４２において複数の音源方位が推定された場合には、その複数の音源方位に対応するそれぞれの話者方位とビームフォーミング幅の供給を受け、その複数の話者方位とビームフォーミング幅それぞれに対して、マイクアレイI/F１４１から供給される音声データから音声データを抽出する。 In addition, when a plurality of sound source azimuths are estimated by the sound source localization detection unit 142, the beam forming processing unit 143 receives supply of respective speaker azimuths and beam forming widths corresponding to the plurality of sound source azimuths. Voice data is extracted from the voice data supplied from the microphone array I / F 141 for each of a plurality of speaker orientations and beamforming widths.

話者識別部１１４の音声識別処理部１５１は、ビームフォーミング処理部１４３から供給される音声データ（音源定位検出部１４２から供給される話者方位とビームフォーミング幅に対して行われたビームフォーミングにより得られた音声データ）を離散フーリエ変換（DFT(Discrete Fourier Transform)）するなどして音声特徴量を抽出し、その音声特徴量に基づき、話者識別のために用いられる、例えば、HMM（Hidden Makov Model）などの標準パターン（モデル）を生成する。 The voice identification processing unit 151 of the speaker identification unit 114 performs voice data supplied from the beamforming processing unit 143 (by beamforming performed on the speaker orientation and beamforming width supplied from the sound source localization detection unit 142). Speech feature values are extracted by performing discrete Fourier transform (DFT) on the obtained speech data), and used for speaker identification based on the speech feature values. For example, HMM (Hidden A standard pattern (model) such as Makov Model is generated.

また、音声識別処理部１５１は、その標準パターンに、音源定位検出部１４２から供給される話者ID（標準パターンを生成するのに用いられた音声データを得るためのビームフォーミングに使用された話者方位とビームフォーミング幅に対応付けられた話者ID）を対応付け、それらのデータをDB管理部１３１に供給する。DB管理部１３１に供給された、話者IDに対応付けられた標準パターンは話者情報DB１３２に記憶される。 In addition, the voice identification processing unit 151 adds a speaker ID (speech used for beam forming to obtain voice data used to generate the standard pattern) supplied to the standard pattern from the sound source localization detection unit 142. The person ID and the speaker ID associated with the beam forming width are associated with each other, and the data is supplied to the DB management unit 131. The standard pattern associated with the speaker ID supplied to the DB management unit 131 is stored in the speaker information DB 132.

マイクアレイ３に入力された音声の話者識別を行うとき、音声識別処理部１５１は、ビームフォーミング処理部１４３から供給される音声データから抽出された音声特徴量と、話者情報DB１３２に登録されている標準パターンを用いて、例えば、HMM法に基づく話者識別を行う。HMM法に基づく話者識別により得られる、ビームフォーミング処理部１４３からの音声データが観測される尤度が最も高い標準パターン（モデル）に対応付けられている話者IDは、話者の識別結果として、ビームフォーミング処理部１４３から供給された音声データとともに識別情報付与部１５２に供給される。 When performing speaker identification of the voice input to the microphone array 3, the voice identification processing unit 151 is registered in the speaker feature DB 132 and the voice feature amount extracted from the voice data supplied from the beamforming processing unit 143. For example, speaker identification is performed based on the HMM method. The speaker ID associated with the standard pattern (model) with the highest likelihood of observing the speech data from the beamforming processing unit 143 obtained by speaker identification based on the HMM method is the speaker identification result. Are supplied to the identification information adding unit 152 together with the audio data supplied from the beam forming processing unit 143.

識別情報付与部１５２は、音源定位検出部１４２から供給される話者IDと、音声識別処理部１５１から供給される話者IDが一致するか否かの判定を行う。識別情報付与部１５２は、それらの話者IDが一致すると判定した場合、即ち、音声識別処理部１５１における話者識別の結果得られる話者と、音源定位検出部１４２で推定された音源の推定方位に最も近い方位にいる話者が一致すると判定した場合、その話者IDに対応付けられている顔画像データをDB管理部１３１を介して話者情報DB１３２から読み出し、音声識別処理部１５１から供給される音声データとともに出力制御部１５３に供給する。 The identification information adding unit 152 determines whether or not the speaker ID supplied from the sound source localization detection unit 142 matches the speaker ID supplied from the voice identification processing unit 151. When the identification information adding unit 152 determines that the speaker IDs match, that is, the speaker obtained as a result of speaker identification in the voice identification processing unit 151 and the estimation of the sound source estimated by the sound source localization detection unit 142 If it is determined that the speakers in the azimuth closest to the azimuth match, the face image data associated with the speaker ID is read from the speaker information DB 132 via the DB management unit 131, and the voice identification processing unit 151 It is supplied to the output control unit 153 together with the supplied audio data.

音声識別処理部１５１における話者識別の結果得られる話者と、音源定位検出部１４２で推定された音源の推定方位に最も近い方位にいる話者が一致する場合に出力制御部１５３に供給されるようにしたため、より精度の高い話者識別結果を得ることが可能になる。 When the speaker obtained as a result of the speaker identification in the speech identification processing unit 151 matches the speaker in the direction closest to the estimated direction of the sound source estimated by the sound source localization detection unit 142, it is supplied to the output control unit 153. As a result, more accurate speaker identification results can be obtained.

出力制御部１５３は、識別情報付与部１５２から供給される顔画像データより得られる画像を表示部５に表示させる。従って、表示部５には、識別された話者の顔画像が表示されることになる。なお、話者識別の結果である顔画像が表示部５に表示されるだけでなく、識別された話者の音声がスピーカから出力制御部１５３により出力されるようにしてもよい。 The output control unit 153 causes the display unit 5 to display an image obtained from the face image data supplied from the identification information adding unit 152. Therefore, the face image of the identified speaker is displayed on the display unit 5. Note that the face image as a result of speaker identification is not only displayed on the display unit 5, but the voice of the identified speaker may be output from the speaker by the output control unit 153.

また、出力制御部１５３は、識別情報付与部１５２から供給される音声データと顔画像データを話者IDなどに対応付けて議事録DB１５４に記憶させる。 Further, the output control unit 153 stores the audio data and face image data supplied from the identification information adding unit 152 in the minutes DB 154 in association with the speaker ID or the like.

ここで、以上のような構成を有する情報処理部４によるビームフォーミングについて説明する。 Here, beam forming by the information processing unit 4 having the above configuration will be described.

情報処理部４において行われる遅延和ビームフォーミング（遅延和方式によるビームフォーミング）は、マイク３−１乃至３−４それぞれからの音声データを、マイク間の距離に対応する時間遅延に基づいて同相化し、所定の方位からの音声データのみを、その位相を合わせることによって強調させるものである。 Delay sum beam forming (beam forming by the delay sum method) performed in the information processing unit 4 makes audio data from each of the microphones 3-1 to 3-4 in phase based on a time delay corresponding to the distance between the microphones. Only audio data from a predetermined direction is emphasized by matching its phase.

従って、遅延和ビームフォーミングによれば、所定の方位以外の、他の方位からの音声データや、マイク間の距離に無関係な、例えば、マイク自身の雑音などが除去されることになる。 Therefore, according to the delay sum beamforming, voice data from other azimuth other than a predetermined azimuth and noise, for example, irrelevant to the distance between the mics are removed.

ところで、ビームフォーミング時のビームフォーミング幅は、それが狭いほど、ある１人の話者の音声データを、他の話者の音声データから分離することができる。しかしながら、ビームフォーミング幅が狭いと、マイクアレイ３で得られた音声データの情報量が低減し、さらに、ビームフォーミングによって抽出される音声データが、こもった音となってしまう。 By the way, as the beam forming width at the time of beam forming is narrower, the voice data of one speaker can be separated from the voice data of another speaker. However, if the beam forming width is narrow, the information amount of the audio data obtained by the microphone array 3 is reduced, and the audio data extracted by the beam forming becomes a muffled sound.

一方、ビームフォーミング幅を単に広くしてしまうと、マイクアレイ３からの音声データから、ある１人の話者の音声データを精度良く抽出することが困難となる。即ち、ビームフォーミング幅を単に広くした場合、ある１人の話者の音声データを抽出するときに、その話者に隣接する話者の音声データも抽出してしまうことになる。 On the other hand, if the beam forming width is simply widened, it becomes difficult to accurately extract the voice data of a single speaker from the voice data from the microphone array 3. That is, when the beamforming width is simply widened, when voice data of a certain speaker is extracted, voice data of a speaker adjacent to the speaker is also extracted.

そこで、情報処理部４では、各話者の音声データを、他の話者の音声データと精度良く区別して抽出することができる、より広い幅のビームフォーミング幅を、話者同士の位置関係に基づいて算出することが行われる。これにより、話者それぞれに最適なビームフォーミング幅を得ることができる。 Therefore, the information processing unit 4 can extract the voice data of each speaker with high accuracy so that it can be distinguished from the voice data of other speakers. The calculation is performed based on this. As a result, an optimum beamforming width can be obtained for each speaker.

話者同士の位置関係に基づくビームフォーミング幅の算出について、図５Ａ，Ｂを参照して説明する。 Calculation of the beam forming width based on the positional relationship between speakers will be described with reference to FIGS. 5A and 5B.

図５Ａは、図５Ｂに示されるように、４人の話者Ｘ−１乃至Ｘ＋２がテーブルトップ型話者識別装置１を囲んでいる状態において、カメラモジュール２により撮像された全周囲の画像を展開したものの例を示している。 FIG. 5A shows an image of the entire periphery imaged by the camera module 2 in a state where four speakers X-1 to X + 2 surround the tabletop speaker identification device 1 as shown in FIG. 5B. An example of the development is shown.

なお、図５Ａでは、カメラモジュール２を中心とする３６０度の周囲の全周囲画像データを、横方向を円周方向（角度方向）とするとともに、縦方向を半径方向として、長方形状の画像データとして示してある。従って、図５の上の全周囲画像データの左端の方位を例えば０度とすると、その右端の方位は３６０度である。 In FIG. 5A, the 360-degree peripheral image data around the camera module 2 is rectangular image data with the horizontal direction being the circumferential direction (angular direction) and the vertical direction being the radial direction. It is shown as Therefore, if the azimuth at the left end of the entire surrounding image data in FIG. 5 is, for example, 0 degrees, the azimuth at the right end is 360 degrees.

また、図５Ａでは、話者Ｘ＋ｉが位置する方位（話者方位）をＤ_X+iで表している（ｉ＝−１，０，１，２）。 In FIG. 5A, the direction (speaker direction) where the speaker X + i is located is represented by D _{X + i} (i = −1, 0, 1, 2).

図４のビーム幅算出部１２２は、話者Ｘのビームフォーミング幅Ｂ_Xを算出する場合、話者Ｘの方位Ｄ_Xと、話者Ｘの反時計回り方向に隣接する話者Ｘ−１の方位Ｄ_X-1の方位の差（角度）｜Ｄ_X-1−Ｄ_X｜を算出するとともに、話者Ｘの方位Ｄ_Xと、話者Ｘの時計回り方向に隣接する話者Ｘ＋１の方位Ｄ_X+1の方位の差｜Ｄ_X+1−Ｄ_X｜を算出する。 4 calculates the beamforming width B _X of the speaker X, the direction D _X of the speaker _X and the speaker X-1 adjacent to the speaker X in the counterclockwise direction. The difference (angle) | D _X-1 −D _X | of the direction D _X-1 is calculated, and the direction D of the speaker _X and the direction of the speaker X + 1 adjacent to the speaker X in the clockwise direction are calculated. D _{X + 1} of the difference between orientation | D _{_X} + 1 -D _X | is calculated.

そして、ビーム幅算出部１２２は、話者Ｘと話者Ｘ−１の方位の差｜Ｄ_X-1−Ｄ_X｜と、話者Ｘと話者Ｘ＋１の方位の差｜Ｄ_X+1−Ｄ_X｜のうちの小さい方を、話者Ｘのビームフォーミング幅Ｂ_X（図５Ｂの斜線が付されている部分）とする。 The beam width calculation unit 122 then determines the difference in orientation between the speakers X and X-1 | D _X-1 −D _X | and the difference in orientation between the speakers X and X + 1 | D _{X + 1} −. The smaller of D _X | is defined as the beam forming width B _X of the speaker _X (the hatched portion in FIG. 5B).

同様に、ビーム幅算出部１２２は、話者Ｘ＋１と話者Ｘの方位の差｜Ｄ_X−Ｄ_X+1｜と、話者Ｘ＋１と話者Ｘ＋２の方位の差｜Ｄ_X+2−Ｄ_X+1｜のうちの小さい方を、話者Ｘ＋１のビームフォーミング幅Ｂ_X+1とする。 Similarly, the beam width calculation unit 122 determines the difference in orientation between the speakers X + 1 and X | D _X −D _{X + 1} | and the difference in orientation between the speakers X + 1 and X + 2 | D _{X + 2} −D. The smaller of _{X + 1} | is set as the beam forming width B _{X + 1} of the speaker X + 1.

さらに、ビーム幅算出部１２２は、話者Ｘ＋２と話者Ｘ＋１の方位の差｜Ｄ_X+1−Ｄ_X+2｜と、話者Ｘ＋２と話者Ｘ−１の方位の差｜Ｄ_X-1−Ｄ_X+2｜のうちの小さい方を、話者Ｘ＋２のビームフォーミング幅Ｂ_X+2とし、話者Ｘ−１と話者Ｘ＋２の方位の差｜Ｄ_X+2−Ｄ_X-1｜と、話者Ｘ−１と話者Ｘの方位の差｜Ｄ_X−Ｄ_X-1｜のうちの小さい方を、話者Ｘ−１のビームフォーミング幅Ｂ_X-1とする。 Further, the beam width calculation unit 122 determines the difference in orientation between the speakers X + 2 and X + 1 | D _{X + 1} −D _{X + 2} | and the difference in orientation between the speakers X + 2 and X−1 | D _X−. The smaller of ₁₋ D _{X + 2} | is the beam forming width B _{X + 2 of} speaker X + 2, and the difference in orientation between speaker X−1 and speaker X + 2 | D _{X + 2} −D _X−1 | And the smaller difference between the orientations of the speakers X-1 and X | D _X -D _X-1 | is defined as the beam forming width B _X-1 of the speaker X-1.

なお、話者と反時計回り方向に隣接する他の話者との方位差と、話者と時計回り方向に隣接する他の話者との方位差のうちの小さい方を、その話者のビームフォーミング幅とすることで、話者の方位に最も近い方位にいる他の話者の方位の中心までが、話者のビームフォーミング幅とされる。従って、話者それぞれにとって、最も近い方位にいる他の話者の方位の中心までがビームフォーミングされることとなる。 Note that the smaller of the heading difference between the speaker and another speaker adjacent in the counterclockwise direction and the heading difference between the speaker and another speaker adjacent in the clockwise direction is the speaker's By setting the beam forming width, the beam forming width of the speaker is set up to the center of the direction of the other speaker in the direction closest to the direction of the speaker. Therefore, for each speaker, beam forming is performed up to the center of the direction of the other speaker in the closest direction.

上述のように、ビーム幅算出部１２２は、所定の話者のビームフォーミング幅を、所定の話者の方位と、反時計回り方向に隣接する他の話者の方位との差、もしくは、所定の話者の方位と、時計回り方向に隣接する他の話者の方位との差のうちの小さい方とする。これにより、所定の話者の音声データが、他の話者の音声データと分離されて最適なビームフォーミング幅でビームフォーミングされることとなる。 As described above, the beam width calculation unit 122 determines the beamforming width of a predetermined speaker as the difference between the direction of the predetermined speaker and the direction of another speaker adjacent in the counterclockwise direction, Is the smaller of the differences between the direction of the other speaker and the direction of another speaker adjacent in the clockwise direction. As a result, the voice data of a predetermined speaker is separated from the voice data of other speakers and beam-formed with an optimum beam forming width.

なお、その他、例えば、話者Ｘについて、（Ｄ_X-1＋Ｄ_X）／２から（Ｄ_X+1＋Ｄ_X）／２までの範囲の角度を、ビームフォーミング幅とすることが可能である。他の話者Ｘ−１，Ｘ＋１、およびＸ＋２についても同様である。 In addition, for the speaker X, for example, an angle in a range from (D _X-1 + D _X ) / 2 to (D _{X + 1} + D _X ) / 2 can be set as the beam forming width. The same applies to the other speakers X-1, X + 1, and X + 2.

このように、話者同士の位置関係に基づいて、ビームフォーミング幅が求められ、そのビームフォーミング幅を用いて、話者それぞれに適したビームフォーミングがなされるので、マイクアレイ３が出力する音声データから他の話者の発話や、マイク自身のノイズなどが除かれた音声データを抽出することができる。 Thus, the beam forming width is obtained based on the positional relationship between the speakers, and the beam forming width is used to perform beam forming suitable for each speaker. It is possible to extract speech data from which other speakers' utterances and noise of the microphone itself are removed.

次に、以上の構成を有するテーブルトップ型話者識別装置１の動作について説明する。 Next, the operation of the tabletop speaker identification device 1 having the above configuration will be described.

始めに、図６のフローチャートを参照して、話者方位とビームフォーミング幅を求めるとともに、顔画像データを話者情報DB１３２に登録するテーブルトップ型話者識別装置１の処理について説明する。 First, with reference to the flowchart of FIG. 6, the process of the table top type speaker identification device 1 that obtains the speaker orientation and the beamforming width and registers the face image data in the speaker information DB 132 will be described.

例えば、テーブルトップ型話者識別装置１の電源が投入され、所定のボタンが操作されたとき、ステップＳ１において、カメラモジュール２は、テーブルトップ型話者識別装置１を囲む全周囲を撮像し、全周囲画像データを取得して、それを顔検出部１２１に供給する。 For example, when the table top type speaker identification device 1 is turned on and a predetermined button is operated, in step S1, the camera module 2 images the entire periphery surrounding the table top type speaker identification device 1, All-around image data is acquired and supplied to the face detection unit 121.

ステップＳ２において、顔検出部１２１は、カメラモジュール２から供給される全周囲画像データから話者の顔に対応する顔画像データを抽出し、抽出した顔画像データのそれぞれに固有の話者IDを付与する。 In step S 2, the face detection unit 121 extracts face image data corresponding to the speaker's face from the entire surrounding image data supplied from the camera module 2, and sets a speaker ID unique to each of the extracted face image data. Give.

ステップＳ３において、顔検出部１２１は、話者のそれぞれについて、その話者の顔画像データからカメラモジュール２を中心とした実世界での話者の位置の方位である話者方位を検出する。検出された話者方位は、顔画像データとともに話者IDに対応付けられて、ビーム幅算出部１２２に供給される。 In step S 3, for each speaker, the face detection unit 121 detects the speaker orientation, which is the orientation of the speaker position in the real world with the camera module 2 as the center, from the speaker's face image data. The detected speaker orientation is associated with the speaker ID together with the face image data, and is supplied to the beam width calculation unit 122.

ステップＳ４において、ビーム幅算出部１２２は、顔検出部１２１から供給される、話者IDに対応付けられた話者方位に基づいてビームフォーミング幅を算出する。即ち、ビーム幅算出部１２２は、ある話者IDに注目し、その注目している話者IDに対応付けられている話者方位と、その話者IDの話者の両隣にいる２人の話者の話者IDに対応付けられた話者方位を認識し、その２人の話者の一方または他方の話者方位それぞれと、注目している話者IDの話者の話者方位との差を求め、小さい方の差を、注目している話者IDの話者のビームフォーミング幅とする。同様に、ビーム幅算出部１２２は、すべての話者（話者ID）についてビームフォーミング幅を算出する。 In step S 4, the beam width calculation unit 122 calculates the beamforming width based on the speaker orientation associated with the speaker ID supplied from the face detection unit 121. That is, the beam width calculation unit 122 pays attention to a certain speaker ID, and the speaker orientation associated with the speaker ID of interest and the two speakers adjacent to the speaker with the speaker ID. Recognize the speaker orientation associated with the speaker's speaker ID, each of the two speakers' one or the other speaker orientation, and the speaker orientation of the speaker with the speaker ID of interest And the smaller difference is used as the beamforming width of the speaker with the speaker ID of interest. Similarly, the beam width calculation unit 122 calculates beam forming widths for all speakers (speaker IDs).

また、ビーム幅算出部１２２は、ステップＳ５において、話者IDに対応付けられている話者方位と、その話者IDの話者のビームフォーミング幅を対応付け、各話者IDと、話者IDに対応付けられている話者方位とビームフォーミング幅を、音源定位検出部１４２に供給する。 In step S5, the beam width calculation unit 122 associates the speaker orientation associated with the speaker ID with the beam forming width of the speaker with the speaker ID, and sets each speaker ID to the speaker. The speaker orientation and beamforming width associated with the ID are supplied to the sound source localization detection unit 142.

ステップＳ６において、ビーム幅算出部１２２は、各話者IDに対応付けた顔画像データをDB管理部１３１に供給する。DB管理部１３１においては、話者IDに対応付けられている顔画像データが話者情報DB１３２に登録され、処理が終了する。 In step S 6, the beam width calculation unit 122 supplies face image data associated with each speaker ID to the DB management unit 131. In the DB management unit 131, face image data associated with the speaker ID is registered in the speaker information DB 132, and the process ends.

なお、ビーム幅算出部１２２は、話者が一人の場合、即ち、顔検出部１２１から供給された話者IDが１つの場合には、話者のビームフォーミング幅を、例えば、所定のデフォルトのビームフォーミング幅とする。 Note that the beam width calculation unit 122 determines the beam forming width of the speaker, for example, a predetermined default when the number of speakers is one, that is, when the speaker ID supplied from the face detection unit 121 is one. The beam forming width.

以上のようにして、カメラモジュール２により取得された全周囲画像データに基づいて、話者方位、顔画像データ、およびビームフォーミング幅が得られ、話者IDに対応付けられた話者方位とビームフォーミング幅が音源定位検出部１４２に供給される。また、話者IDに対応付けられた顔画像データが話者情報DB１３２に登録される。 As described above, the speaker orientation, the face image data, and the beamforming width are obtained based on the all-around image data acquired by the camera module 2, and the speaker orientation and beam associated with the speaker ID are obtained. The forming width is supplied to the sound source localization detection unit 142. Also, face image data associated with the speaker ID is registered in the speaker information DB 132.

次に、図７のフローチャートを参照して、話者の音声データの標準パターンを話者情報DB１３２に登録する処理について説明する。 Next, processing for registering a standard pattern of speaker voice data in the speaker information DB 132 will be described with reference to the flowchart of FIG.

この処理は、図６を参照して説明した処理の後に行われるものである。従って、音源定位検出部１４２には、各話者の話者IDに対応付けられた話者方位とビームフォーミング幅がビーム幅算出部１２２から供給されている。 This process is performed after the process described with reference to FIG. Therefore, the sound source localization detection unit 142 is supplied with the speaker orientation and the beamforming width associated with the speaker ID of each speaker from the beam width calculation unit 122.

ステップＳ４１において、音源定位検出部１４２は、ビーム幅算出部１２２から供給される話者IDに対応付けられた話者方位とビームフォーミング幅を受信し、その内蔵するメモリ（図示せず）に記憶する。 In step S41, the sound source localization detection unit 142 receives the speaker orientation and the beamforming width associated with the speaker ID supplied from the beam width calculation unit 122, and stores them in a built-in memory (not shown). To do.

話者方位とビームフォーミング幅をメモリに記憶した後、音源定位検出部１４２は、ステップＳ４２に進み、マイクアレイI/F１４１から音声データが供給されたか否かを判定する。 After storing the speaker orientation and the beamforming width in the memory, the sound source localization detection unit 142 proceeds to step S42, and determines whether or not audio data is supplied from the microphone array I / F 141.

ステップＳ４２において、マイクアレイI/F１４１から音声データが供給されていないと判定された場合、即ち、いずれの話者も発話を行っていない場合、音源定位検出部１４２は、音声データが供給されるまで待機する。 In step S42, when it is determined that no sound data is supplied from the microphone array I / F 141, that is, when no speaker is speaking, the sound source localization detection unit 142 is supplied with the sound data. Wait until.

一方、ステップＳ４２において、マイクアレイI/F１４１から音声データが供給されたと判定された場合、即ち、発話がマイクアレイ３に入力され、マイクアレイI/F１４１により得られたディジタルの音声データが音源定位検出部１４２とビームフォーミング処理部１４３に供給された場合、ステップＳ４３に進み、音源定位検出部１４２は、マイクアレイ３を構成するマイク３−１乃至３−４それぞれからの音声データに基づいて、発話の音源方位を推定する。 On the other hand, when it is determined in step S42 that the voice data is supplied from the microphone array I / F 141, that is, the speech is input to the microphone array 3, and the digital voice data obtained by the microphone array I / F 141 is the sound source localization. When supplied to the detection unit 142 and the beam forming processing unit 143, the process proceeds to step S43, and the sound source localization detection unit 142 is based on the audio data from each of the microphones 3-1 to 3-4 constituting the microphone array 3. Estimate the sound source direction of the utterance.

音源定位検出部１４２は、ステップＳ４３において、ステップＳ４１でビーム幅算出部１２２から供給され、メモリに記憶しておいた全ての話者方位に基づいて、ステップＳ４２で推定した音源方位に最も近い方位の話者を検出し、その話者の方位（撮像画像から得られた方位）に対応付けられている話者IDを、同じくメモリに記憶しておいた全ての話者の話者IDの中から選択する。選択された話者IDは音声識別処理部１５１に供給される。 In step S43, the sound source localization detection unit 142 is the closest to the sound source direction estimated in step S42 based on all the speaker directions supplied from the beam width calculation unit 122 in step S41 and stored in the memory. Speaker IDs associated with the speaker's orientation (the orientation obtained from the captured image) are among the speaker IDs of all the speakers stored in the memory. Select from. The selected speaker ID is supplied to the voice identification processing unit 151.

ステップＳ４４において、音源定位検出部１４２は、ステップＳ４３の処理で選択した話者IDに対応付けられている話者方位とビームフォーミング幅をビームフォーミング処理部１４３に供給する。ここで供給される話者方位とビームフォーミング幅は、音声データをビームフォーミングする方位とビームフォーミング幅としてビームフォーミング処理部１４３により設定される。 In step S44, the sound source localization detection unit 142 supplies the speaker orientation and the beamforming width associated with the speaker ID selected in the process of step S43 to the beamforming processing unit 143. The speaker orientation and beamforming width supplied here are set by the beamforming processing unit 143 as the orientation and beamforming width for beamforming voice data.

ステップＳ４５において、ビームフォーミング処理部１４３は、マイクアレイI/F１４１から供給される音声データを、音源定位検出部１４２から供給される話者方位とビームフォーミング幅に基づいてビームフォーミングし、これにより、音源定位検出部１４２からの話者方位を中心とする、ビームフォーミング幅が表す範囲の方位（話者方位から、ビームフォーミング幅が表す範囲（角度）の＋１／２の角度から−１／２の角度までの方位）からの音声データを強調し、強調して得られた音声データを音声識別処理部１５１に供給する。 In step S45, the beam forming processing unit 143 performs beam forming on the audio data supplied from the microphone array I / F 141 based on the speaker orientation and the beam forming width supplied from the sound source localization detecting unit 142. Centered on the speaker orientation from the sound source localization detector 142, the orientation of the range represented by the beamforming width (from the orientation of the speaker, from the angle of +1/2 of the range (angle) represented by the beamforming width to -1/2 The voice data from the azimuth up to the angle is emphasized, and the voice data obtained by the enhancement is supplied to the voice identification processing unit 151.

ステップＳ４６において、音声識別処理部１５１は、ビームフォーミング処理部１４３から供給される音声データ（ビームフォーミングされた音声データ）から音声特徴量を抽出し、その音声特徴量から標準パターンを生成する。さらに、音声識別処理部１５１は、その標準パターンに対して、音源定位検出部１４２からの話者IDを対応付けてDB管理部１３１に供給する。 In step S46, the voice identification processing unit 151 extracts a voice feature amount from the voice data (beamformed voice data) supplied from the beamforming processing unit 143, and generates a standard pattern from the voice feature amount. Further, the voice identification processing unit 151 associates the speaker ID from the sound source localization detection unit 142 with the standard pattern and supplies the standard pattern to the DB management unit 131.

ステップＳ４７において、DB管理部１３１は、音声識別処理部１５１から供給される話者IDに対応付けられた標準パターンを、話者情報DB１３２に登録されている話者IDのうちの同一の話者IDに対応付けて登録し、処理を終了させる。 In step S47, the DB management unit 131 sets the standard pattern associated with the speaker ID supplied from the voice identification processing unit 151 to the same speaker among the speaker IDs registered in the speaker information DB 132. Register in association with the ID, and terminate the process.

なお、ステップＳ４２乃至Ｓ４７の処理は、話者それぞれについて少なくとも１回以上行われ、話者情報DB１３２には、話者それぞれの顔画像データおよび標準パターンが、話者IDに対応付けられて登録される。この話者情報DB１３２に登録された情報に基づいて、図８の話者識別処理が行われる。 The processes in steps S42 to S47 are performed at least once for each speaker. In the speaker information DB 132, face image data and standard patterns of each speaker are registered in association with the speaker ID. The Based on the information registered in the speaker information DB 132, the speaker identification process of FIG. 8 is performed.

次に、図８のフローチャートを参照して、話者情報DB１３２に登録されている情報に基づいて行われる話者識別処理について説明する。 Next, speaker identification processing performed based on information registered in the speaker information DB 132 will be described with reference to the flowchart of FIG.

ステップＳ７１において、音源定位検出部１４２は、マイクアレイI/F１４１から音声データが供給されたか否かを判定し、音声データが供給されるまで待機する。 In step S71, the sound source localization detection unit 142 determines whether audio data is supplied from the microphone array I / F 141 and waits until the audio data is supplied.

音源定位検出部１４２は、ステップＳ７１において、マイクアレイI/F１４１から音声データが供給されたと判定した場合、ステップＳ７２に進み、マイクアレイ３を構成するマイク３−１乃至３−４それぞれからの音声データに基づいて発話の音源方位を推定する。 If the sound source localization detection unit 142 determines in step S71 that audio data has been supplied from the microphone array I / F 141, the sound source localization detection unit 142 proceeds to step S72 and performs audio from each of the microphones 3-1 to 3-4 constituting the microphone array 3. Estimate the sound source direction of the utterance based on the data.

ステップＳ７２において、音源定位検出部１４２は、図７のステップＳ４１の処理でビーム幅算出部１２２から供給された話者方位のうちの、推定方位に最も近い方位の話者を検出し、その話者の方位に対応付けられている話者IDを識別情報付与部１５２に供給する。 In step S72, the sound source localization detection unit 142 detects a speaker having a direction closest to the estimated direction from the speaker directions supplied from the beam width calculation unit 122 in the process of step S41 in FIG. The speaker ID associated with the direction of the speaker is supplied to the identification information adding unit 152.

また、音源定位検出部１４２は、ステップＳ７３において、ステップＳ７２の処理で識別情報付与部１５２に供給した話者IDに対応付けられている話者方位とビームフォーミング幅をビームフォーミング処理部１４３に供給する。ビームフォーミング処理部１４３は、音源定位検出部１４２から供給される話者方位とビームフォーミング幅を、ビームフォーミングする方位とビームフォーミング幅として設定する。 In step S73, the sound source localization detection unit 142 supplies the speaker orientation and beamforming width associated with the speaker ID supplied to the identification information providing unit 152 in step S72 to the beamforming processing unit 143. To do. The beam forming processing unit 143 sets the speaker orientation and the beam forming width supplied from the sound source localization detecting unit 142 as the beam forming orientation and the beam forming width.

ステップＳ７４において、ビームフォーミング処理部１４３は、マイクアレイI/F１４１から供給される音声データを、音源定位検出部１４２から供給される話者方位とビームフォーミング幅に基づいてビームフォーミングし、これにより、音源定位検出部１４２からの話者方位を中心とする、ビームフォーミング幅が表す範囲の方位（話者方位から、ビームフォーミング幅が表す範囲（角度）の＋１／２の角度から−１／２の角度までの方位）からの音声データを強調し、音声識別処理部１５１に供給する。 In step S74, the beamforming processing unit 143 performs beamforming on the audio data supplied from the microphone array I / F 141 based on the speaker orientation and the beamforming width supplied from the sound source localization detection unit 142, thereby Centered on the speaker orientation from the sound source localization detector 142, the orientation of the range represented by the beamforming width (from the orientation of the speaker, from the angle of +1/2 of the range (angle) represented by the beamforming width to -1/2 The voice data from the azimuth up to the angle is emphasized and supplied to the voice identification processing unit 151.

ステップＳ７５において、音声識別処理部１５１は、ビームフォーミング処理部１４３から供給されるビームフォーミングされた音声データから音声特徴量を抽出する。 In step S 75, the voice identification processing unit 151 extracts a voice feature amount from the beamformed voice data supplied from the beamforming processing unit 143.

音声識別処理部１５１は、ステップＳ７６において、DB管理部１３１を介して得られる、話者情報DB１３２に登録されている話者の標準パターンを参照し、発話を行った話者の話者識別を行う。 In step S76, the voice identification processing unit 151 refers to the standard pattern of the speaker registered in the speaker information DB 132 obtained through the DB management unit 131, and identifies the speaker identification of the speaker who made the utterance. Do.

即ち、音声識別処理部１５１は、話者情報DB１３２に登録されている標準パターンのうちの、ビームフォーミング処理部１４３からの音声データの音声特徴量が観測される尤度が高い標準パターンに対応付けられている話者IDを特定する。さらに、音声識別処理部１５１は、その話者IDと、ビームフォーミング処理部１４３からの音声データを識別情報付与部１５２に供給する。 That is, the speech identification processing unit 151 associates with a standard pattern registered in the speaker information DB 132 with a standard pattern having a high likelihood that the speech feature amount of speech data from the beamforming processing unit 143 is observed. Identify the speaker ID being used. Further, the voice identification processing unit 151 supplies the speaker ID and the voice data from the beamforming processing unit 143 to the identification information adding unit 152.

ステップＳ７７に進み、識別情報付与部１５２は、音源定位検出部１４２からの話者ID（ステップＳ７２で音源定位検出部１４２が推定した推定方位に最も近い話者方位に対応付けられている話者ID）と、ステップＳ７６で音声識別処理部１５１から供給された話者識別の結果得られた話者IDが一致するか否かを判定する。 In step S77, the identification information adding unit 152 determines the speaker ID from the sound source localization detection unit 142 (the speaker associated with the speaker direction closest to the estimated direction estimated by the sound source localization detection unit 142 in step S72). ID) and a speaker ID obtained as a result of speaker identification supplied from the voice identification processing unit 151 in step S76 are determined.

識別情報付与部１５２は、ステップＳ７７において、音源定位検出部１４２からの話者IDと、音声識別処理部１５１から供給された話者識別の結果得られた話者IDが一致しないと判定した場合、ステップＳ７８に進み、所定のエラー処理を行う。 When the identification information adding unit 152 determines in step S77 that the speaker ID from the sound source localization detection unit 142 and the speaker ID obtained as a result of speaker identification supplied from the voice identification processing unit 151 do not match. In step S78, predetermined error processing is performed.

ステップＳ７８において行われるエラー処理としては、例えば、カメラモジュール２によって全周囲画像データを取得して、話者方位およびビームフォーミング幅を再算出することができる。この場合、話者が移動して話者方位や最適なビームフォーミング幅が変化したときなどに、話者方位やビームフォーミング幅をリアルタイムに変更することができる。 As the error processing performed in step S78, for example, all-around image data can be acquired by the camera module 2, and the speaker orientation and the beamforming width can be recalculated. In this case, when the speaker moves and the speaker orientation and the optimum beamforming width change, the speaker orientation and beamforming width can be changed in real time.

一方、ステップＳ７７において、音源定位検出部１４２からの話者IDと、音声識別処理部１５１から供給された話者識別の結果得られた話者IDが一致すると判定した場合、ステップＳ７９に進み、識別情報付与部１５２は、その一致する話者IDを、最終的な話者識別結果とし、その話者IDに対応付けられている顔画像データを、DB管理部１３１を介して話者情報DB１３２から取得する。その後、識別情報付与部１５２は、音声識別処理部１５１から供給された音声データとともに顔画像データを出力制御部１５３に供給し、話者識別処理を終了させる。 On the other hand, if it is determined in step S77 that the speaker ID from the sound source localization detection unit 142 and the speaker ID obtained as a result of speaker identification supplied from the voice identification processing unit 151 match, the process proceeds to step S79. The identification information adding unit 152 sets the matching speaker ID as the final speaker identification result, and the face image data associated with the speaker ID via the DB management unit 131. Get from. Thereafter, the identification information adding unit 152 supplies the face image data to the output control unit 153 together with the audio data supplied from the audio identification processing unit 151, and ends the speaker identification process.

出力制御部１５３においては、顔画像の表示や、音声の出力が行われる。 In the output control unit 153, a face image is displayed and a sound is output.

以上のように、テーブルトップ型話者識別装置１においては、推定された音源の方位に最も近い方位の話者の話者IDと、音声識別処理部１５１による話者識別の結果得られた話者IDとが一致した場合のみ、その話者IDが最終的な話者識別結果とされることから、このような判定を行わない場合に較べて、その結果は精度の高いものとなる。 As described above, in the table top type speaker identification device 1, the speaker ID of the speaker in the direction closest to the estimated direction of the sound source and the speech obtained as a result of speaker identification by the speech identification processing unit 151. Only when the speaker ID matches, the speaker ID is used as the final speaker identification result, so that the result is more accurate than when such a determination is not made.

また、標準パターンを生成するときの音声データと、話者を識別するときの音声データとが、同一の環境で取得されるので、音声識別処理部１５１における話者識別の精度を向上させることができる。 In addition, since the voice data for generating the standard pattern and the voice data for identifying the speaker are acquired in the same environment, the accuracy of speaker identification in the voice identification processing unit 151 can be improved. it can.

なお、図８のステップＳ７２において、音源定位検出部１４２では、マイク３−１乃至３−４それぞれからの音声データに基づいて複数の発話の音源方位が推定された場合、即ち、複数の話者が同時に発話し、これにより、複数の音源方位が推定された場合、後段のビームフォーミング処理部１４３では、音源定位検出部１４２から供給される複数の話者それぞれの話者IDに対応付けられている話者方位とビームフォーミング幅が用いられてビームフォーミングが行われ、それぞれの音声データに基づいて音声識別処理部１５１により話者識別が行われることから、複数の話者が同時に発話を行った場合でも、それぞれの話者を識別することができる。 In step S72 of FIG. 8, the sound source localization detection unit 142 estimates the sound source directions of a plurality of utterances based on the sound data from the microphones 3-1 to 3-4, that is, a plurality of speakers. Are simultaneously spoken, and a plurality of sound source azimuths are thereby estimated, the beamforming processing unit 143 in the subsequent stage associates with the speaker IDs of the plurality of speakers supplied from the sound source localization detection unit 142. Beam forming is performed using a certain speaker direction and beam forming width, and speaker identification is performed by the voice identification processing unit 151 based on each voice data, so that a plurality of speakers speak at the same time. Even in the case, each speaker can be identified.

また、全周囲画像データから顔を検出するようにしたので、例えば、話者識別結果として表示される話者の顔画像データを登録する作業を省くことができる。なお、カメラモジュール２では、全周囲画像データを得るようにしたが、全周囲のうちの一部の範囲の画像データを得るようにすることも可能である。 Further, since the face is detected from the all-around image data, for example, the task of registering the speaker face image data displayed as the speaker identification result can be omitted. Although the camera module 2 obtains all-around image data, it is also possible to obtain image data of a part of the entire circumference.

さらに、本発明は、テーブルトップ型話者識別装置１に適用するのみならず、例えば、ロボットの視聴覚技術に適用することができる。本発明をロボットの視聴覚技術に適用した場合、ロボットの視覚内で同時に複数の話者が発話した場合でも、話者それぞれを識別することができる。 Furthermore, the present invention can be applied not only to the table top type speaker identification device 1 but also to, for example, audiovisual technology of a robot. When the present invention is applied to the audiovisual technology of a robot, each speaker can be identified even when a plurality of speakers speak at the same time within the vision of the robot.

また、本実施の形態では、音声識別処理部１５１において離散フーリエ変換によって音声特徴量を抽出することとしたが、音声特徴量の抽出方法は、離散フーリエ変換に限定されるものではない。 In the present embodiment, the speech identification processing unit 151 extracts speech feature values by discrete Fourier transform. However, the speech feature value extraction method is not limited to discrete Fourier transform.

さらに、音声識別処理部１５１での話者識別は、HMM法以外で行うことも可能である。 Furthermore, speaker identification in the voice identification processing unit 151 can be performed by means other than the HMM method.

また、図７のステップＳ４２の処理、および図８のステップＳ７１の処理において音源定位検出部１４２に音声データが供給されたか否かを判定する場合、マイクアレイ３が出力する音の信号から話者が発話をした区間である発話区間を検出し、発話区間が検出されたときに音声データが供給されたと判定することができる。 When determining whether or not the sound data is supplied to the sound source localization detection unit 142 in the process of step S42 of FIG. 7 and the process of step S71 of FIG. 8, the speaker is determined based on the sound signal output from the microphone array 3. It is possible to detect an utterance section that is a section in which utterance is made, and determine that voice data has been supplied when the utterance section is detected.

なお、図３のCPU５１が行う、上述した一連の処理を実行するためのプログラムは、ダウンロードサイトからダウンロードしてインストールすることができる。また、プログラムは、記録媒体からインストールすることができる。 Note that the program for executing the above-described series of processing performed by the CPU 51 of FIG. 3 can be downloaded from a download site and installed. The program can be installed from a recording medium.

このプログラムが記録された記録媒体は、磁気ディスク６１、光ディスク６２、光磁気ディスク６３、もしくは半導体メモリ６４などよりなるパッケージメディアとして配布することができる。 The recording medium on which the program is recorded can be distributed as a package medium including the magnetic disk 61, the optical disk 62, the magneto-optical disk 63, or the semiconductor memory 64.

ここで、本明細書において、説明したフローチャートに記述された各ステップの処理は、記載された順序に従って時系列的に行われる処理はもちろん、必ずしも時系列的に処理されなくとも、並列的あるいは個別に実行される処理をも含むものである。 Here, in the present specification, the processing of each step described in the flow chart described is not limited to processing performed in time series according to the described order, but may be performed in parallel or individually even if not necessarily performed in time series. This includes the processing to be executed.

また、上述した一連の処理は、ソフトウェアにより実行することとしたが、専用のハードウェアにより実行することもできる。 The series of processes described above is executed by software, but can also be executed by dedicated hardware.

本発明を適用したテーブルトップ型話者識別装置の一実施の形態の利用例を示す図である。It is a figure which shows the usage example of one Embodiment of the tabletop type speaker identification device to which this invention is applied. 図１のテーブルトップ型話者識別装置を拡大して示す斜視図である。It is a perspective view which expands and shows the tabletop type speaker identification apparatus of FIG. テーブルトップ型話者識別装置のハードウェア構成例を示すブロック図である。It is a block diagram which shows the hardware structural example of a tabletop type speaker identification apparatus. 図２の情報処理部の機能的な構成例を示すブロック図である。It is a block diagram which shows the functional structural example of the information processing part of FIG. ビームフォームングについて説明する図である。It is a figure explaining beam forming. 話者方位とビームフォーミング幅を求める処理を説明するフローチャートである。It is a flowchart explaining the process which calculates | requires a speaker orientation and a beam forming width | variety. 音声データの標準パターンを話者情報DBに登録する処理を説明するフローチャートである。It is a flowchart explaining the process which registers the standard pattern of audio | voice data into speaker information DB. 話者識別処理を説明するフローチャートである。It is a flowchart explaining a speaker identification process.

Explanation of symbols

１テーブルトップ型話者識別装置，２カメラモジュール，３マイクアレイ，４情報処理部，５表示部，５１ CPU，５２ ROM，５３ RAM，５４バス，５５入出力インタフェース，５６入力部，５７出力部，５８記憶部，５９通信部，６０ドライブ，６１磁気ディスク，６２光ディスク，６３光磁気ディスク，６４半導体メモリ，１２１顔検出部，１２２ビーム幅算出部，１３１ DB管理部，１３２話者情報DB，１４１マイクアレイI/F，１４２音源定位検出部，１４３ビームフォーミング処理部，１５１音声識別処理部，１５２識別情報付与部，１５３出力制御部，１５４議事録DB 1 table top type speaker identification device, 2 camera module, 3 microphone array, 4 information processing section, 5 display section, 51 CPU, 52 ROM, 53 RAM, 54 bus, 55 input / output interface, 56 input section, 57 output section , 58 storage unit, 59 communication unit, 60 drive, 61 magnetic disk, 62 optical disk, 63 magneto-optical disk, 64 semiconductor memory, 121 face detection unit, 122 beam width calculation unit, 131 DB management unit, 132 speaker information DB, 141 microphone array I / F, 142 sound source localization detection unit, 143 beamforming processing unit, 151 voice identification processing unit, 152 identification information adding unit, 153 output control unit, 154 minutes DB

Claims

Imaging means for imaging one or more speakers and outputting the images;
Extraction range calculation means for calculating an extraction range that is a range of orientations for extracting voice data of each speaker based on the positional relationship of the speakers in the image output by the imaging means;
Voice data extraction means for extracting voice data within the extraction range calculated by the extraction range calculation means, out of voice data of speech by the speaker ;
Storage means for storing a standard pattern for speaker identification based on the voice data extracted by the voice data extraction means for each speaker;
Speaker identification means for performing speaker identification using voice data newly extracted by the voice data extraction means and a standard pattern stored in the storage means;
Based on the voice data newly extracted by the voice data extraction means, the sound source direction estimation means for estimating the sound source direction;
Determining means for determining whether or not a speaker existing in the sound source direction estimated by the sound source direction estimating means matches a speaker identified by speaker identification by the speaker identifying means ; A characteristic information processing apparatus.

Further comprising face detection means for detecting the face of the speaker from the image output by the imaging means;
The information processing apparatus according to claim 1, wherein the extraction range calculation unit calculates the extraction range based on a positional relationship of faces detected by the face detection unit.

An imaging step of imaging one or more speakers and outputting the images;
An extraction range calculating step for calculating an extraction range that is a range of orientations for extracting voice data of each speaker based on the positional relationship of the speakers in the image output by the processing of the imaging step;
A voice data extraction step of extracting voice data within the extraction range calculated by the processing of the extraction range calculation step of the voice data of the utterance by the speaker ;
A storage step of storing a standard pattern for speaker identification based on the voice data extracted by the voice data extraction step for each speaker,
A speaker identification step of performing speaker identification using the voice data newly extracted by the processing of the voice data extraction step and the standard pattern stored by the processing of the storage step;
A sound source direction estimation step for estimating a sound source direction based on the voice data newly extracted by the processing of the voice data extraction step;
A determination step for determining whether or not a speaker existing in the sound source direction estimated by the processing of the sound source direction estimation step matches a speaker identified by speaker identification by the processing of the speaker identification step; An information processing method comprising:

A computer executable program,
An imaging step of imaging one or more speakers and outputting the images;
An extraction range calculating step for calculating an extraction range that is a range of orientations for extracting voice data of each speaker based on the positional relationship of the speakers in the image output by the processing of the imaging step;
A voice data extraction step of extracting voice data within the extraction range calculated by the processing of the extraction range calculation step of the voice data of the utterance by the speaker ;
A storage control step for controlling storage for each speaker, a standard pattern for speaker identification based on the voice data extracted by the processing of the voice data extraction step;
A speaker identification step of performing speaker identification using the voice data newly extracted by the processing of the voice data extraction step and the standard pattern stored by the processing of the storage control step;
A sound source direction estimation step for estimating a sound source direction based on the voice data newly extracted by the processing of the voice data extraction step;
A determination step for determining whether or not a speaker existing in the sound source direction estimated by the processing of the sound source direction estimation step matches a speaker identified by speaker identification by the processing of the speaker identification step; The program characterized by including.

A recording medium on which a computer executable program is recorded,
An imaging step of imaging one or more speakers and outputting the images;
An extraction range calculating step for calculating an extraction range that is a range of orientations for extracting voice data of each speaker based on the positional relationship of the speakers in the image output by the processing of the imaging step;
A voice data extraction step of extracting voice data within the extraction range calculated by the processing of the extraction range calculation step of the voice data of the utterance by the speaker ;
A storage control step for controlling storage for each speaker, a standard pattern for speaker identification based on the voice data extracted by the processing of the voice data extraction step;
A speaker identification step of performing speaker identification using the voice data newly extracted by the processing of the voice data extraction step and the standard pattern stored by the processing of the storage control step;
A sound source direction estimation step for estimating a sound source direction based on the voice data newly extracted by the processing of the voice data extraction step;
A determination step for determining whether or not a speaker existing in the sound source direction estimated by the processing of the sound source direction estimation step matches a speaker identified by speaker identification by the processing of the speaker identification step; A recording medium on which a program is recorded.