JP4822458B2

JP4822458B2 - Interface device and interface method

Info

Publication number: JP4822458B2
Application number: JP2008132542A
Authority: JP
Inventors: 晃佐宗
Original assignee: National Institute of Advanced Industrial Science and Technology AIST
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2008-05-20
Filing date: 2008-05-20
Publication date: 2011-11-24
Anticipated expiration: 2028-05-20
Also published as: JP2009282644A

Description

本発明は、３次元空間内の位置を呼気や音声を用いてコンピュータなどにユーザー操作を入力するインターフェイス装置とインターフェイス方法に関する。 The present invention relates to an interface device and an interface method for inputting a user operation to a computer or the like using a breath or voice for a position in a three-dimensional space.

従来、マウス、タッチパッド、トラックボール、ポインティングスティックなど多くのポインティングデバイスがあるが、これらはいずれも２次元平面上の位置を入力することを目的としている。しかし、例えば、３次元ＣＡＤシステムなどで、画面上に表示されている立体的なモデルの姿勢あるいは視点の位置を従来のマウスなどでコントロールする手法などが既にあるが、２次元的な移動の組み合わせで３次元的な空間情報の入力を実現しているため、なかなか希望の位置や視点を指示することが難しいという問題がある。 Conventionally, there are many pointing devices such as a mouse, a touch pad, a trackball, and a pointing stick, all of which are intended to input a position on a two-dimensional plane. However, for example, there is already a method of controlling the posture or viewpoint position of a three-dimensional model displayed on the screen with a conventional mouse or the like in a three-dimensional CAD system or the like, but a combination of two-dimensional movement Since 3D spatial information is input, there is a problem that it is difficult to specify a desired position and viewpoint.

このような問題を解決するためには、３次元的な情報を直感的に入力できるポインティングデバイスの開発が不可欠である。例えば、従来のマウスやジョイスティックなどにおいて、センサの回転軸を増やすことで３次元的な情報の入力を可能にしたスペースボールやスペースマウスなどポインティングデバイスが既に実用化されている。また、更に、アームの組み合わせで３次元的な位置を指示する装置の開発なども行われている。 In order to solve such problems, it is indispensable to develop a pointing device that can intuitively input three-dimensional information. For example, in a conventional mouse or joystick, a pointing device such as a space ball or a space mouse, which can input three-dimensional information by increasing the rotation axis of a sensor, has already been put into practical use. Furthermore, development of an apparatus that indicates a three-dimensional position by combining arms has been performed.

従来のポインティングデバイスは、ユーザーが何らかのデバイスを把持しながら、または接触しながら操作する必要がある。しかし、例えば、手の不自由な方などにとってはこれらのポインティングデバイスを扱うのは困難であり、これらのポインティングデバイスを用いた情報端末などへのアクセスの障害となってしまう。 Conventional pointing devices require a user to operate while holding or touching any device. However, for example, it is difficult for a handicapped person or the like to handle these pointing devices, resulting in an obstacle to access to an information terminal using these pointing devices.

このような問題を解決するために、例えば、特許文献１（特開２００４−２８０３０１号公報）などでは、ユーザーが特別な装置を装着することなく、マイクロフォンアレイに呼気を吹きかけながら口先を移動し、あるいは顔を移動させることによって滑らかにカーソルを移動させるポインティングデバイスを開発している。 In order to solve such a problem, for example, in Patent Document 1 (Japanese Patent Application Laid-Open No. 2004-280301) and the like, a user moves his / her mouth while blowing exhalation on a microphone array without wearing a special device. Alternatively, we are developing a pointing device that moves the cursor smoothly by moving the face.

また、特許文献２（特開２００７−２２８１３５号公報）に記載された手法は、周囲雑音がある環境下でも、特定の領域内で発声したユーザーの発声位置をマイクロフォンアレイで推定している。
特開２００４−２８０３０１号公報特開２００７−２２８１３５号公報 In addition, the technique described in Patent Document 2 (Japanese Patent Laid-Open No. 2007-228135) estimates the utterance position of a user who uttered a voice in a specific area using a microphone array even in an environment with ambient noise.
JP 2004-280301 A JP 2007-228135 A

しかしながら、特許文献１に記載された手法はユーザーの呼気音と周囲雑音の区別がつけられないため、雑音により容易に誤動作を起こすという問題がる。また、マイクロフォンを平面上に配列したマイクロフォンアレイを使用して、ユーザーの口先や顔の移動を２次元的に感知しているため、従来の２次元的な位置の入力を目的としたポインティングデバイスと同様に、３次元的な位置の入力が困難であるという問題がある。 However, since the method described in Patent Document 1 cannot distinguish between a user's breath sound and ambient noise, there is a problem that malfunction easily occurs due to noise. In addition, since a microphone array in which microphones are arranged on a plane is used to detect the movement of the user's mouth and face in two dimensions, a conventional pointing device for inputting a two-dimensional position and Similarly, there is a problem that it is difficult to input a three-dimensional position.

また、特許文献２に記載された手法においては、周囲雑音がある環境下でも、特定の領域内で発声したユーザーの発声位置をマイクロフォンアレイで推定している。しかし、ユーザーの発声位置を２次元的な位置として推定しているため、３次元的な情報を入力するポインティングデバイスとしては機能していなかった。 In the method described in Patent Document 2, the utterance position of the user who uttered in a specific area is estimated by the microphone array even in an environment with ambient noise. However, since the user's utterance position is estimated as a two-dimensional position, it has not functioned as a pointing device for inputting three-dimensional information.

以上のような従来技術の問題点を鑑み、本発明では、雑音がある環境下でもユーザーの呼気音や音声の発声位置などを３次元的に特定可能なインターフェイス装置を提供することである。 In view of the above-described problems of the prior art, the present invention provides an interface device that can three-dimensionally identify a user's exhalation sound or voice utterance position even in a noisy environment.

上記課題を解決するために、請求項１に係る発明は、複数のマイクロフォンが所定配置で設けられてなるマイクロフォンアレイと、前記マイクロフォンアレイにより取得された音声データに基づいてユーザーの鼻口腔から発せられた音の発声位置を３次元的に特定する発声位置特定手段と、を有することを特徴とするインターフェイス装置である。 In order to solve the above-described problem, the invention according to claim 1 is issued from a nasal cavity of a user based on a microphone array in which a plurality of microphones are provided in a predetermined arrangement and voice data acquired by the microphone array. And an utterance position specifying means for three-dimensionally specifying the utterance position of the sound.

また、請求項２に係る発明は、請求項１に記載のインターフェイス装置において、前記発声位置特定手段によって特定された発声位置に基づいた信号を生成する発声位置検出信号生成手段を有することを特徴とする。 According to a second aspect of the present invention, in the interface device according to the first aspect of the present invention, the interface device further includes a utterance position detection signal generating unit that generates a signal based on the utterance position specified by the utterance position specifying unit. To do.

また、請求項３に係る発明は、複数のマイクロフォンが所定配置で設けられてなるマイクロフォンアレイと、前記マイクロフォンアレイにより取得された音声データに基づいてユーザーの鼻口腔から発せられた継続音の発声開始位置と発声終了位置との間を結ぶベクトルを３次元的に特定するベクトル特定手段と、を有することを特徴とするインターフェイス装置である。 According to a third aspect of the present invention, there is provided a microphone array in which a plurality of microphones are provided in a predetermined arrangement, and start of utterance of a continuous sound emitted from the user's nasal cavity based on voice data acquired by the microphone array. An interface device comprising: vector specifying means for three-dimensionally specifying a vector connecting the position and the utterance end position.

また、請求項４に係る発明は、複数のマイクロフォンが所定配置で設けられてなるマイクロフォンアレイと、前記マイクロフォンアレイにより取得された音声データに基づいてユーザーの鼻口腔から発せられた継続音の発声開始位置から発声終了位置までの軌跡を３次元的に特定する軌跡特定手段と、を有することを特徴とするインターフェイス装置である。 According to a fourth aspect of the present invention, there is provided a microphone array in which a plurality of microphones are provided in a predetermined arrangement, and start of utterance of a continuous sound emitted from a user's nasal cavity based on voice data acquired by the microphone array. An interface device comprising trajectory specifying means for three-dimensionally specifying a trajectory from a position to a utterance end position.

また、請求項５に係る発明は、複数のマイクロフォンが所定配置で設けられてなるマイクロフォンアレイと、前記マイクロフォンアレイにより取得された音声データに基づいてユーザーの鼻口腔から発せられた音が到来する方向の方位角と仰角とを特定する発声方向特定手段と、を有することを特徴とするインターフェイス装置である。 According to a fifth aspect of the present invention, there is provided a microphone array in which a plurality of microphones are provided in a predetermined arrangement, and a direction in which sound emitted from a user's nasal cavity based on voice data acquired by the microphone array arrives. And an utterance direction specifying means for specifying an azimuth angle and an elevation angle.

また、請求項６に係る発明は、請求項５に記載のインターフェイス装置において、前記発声方向特定手段によって特定された方位角と仰角とに基づいた信号を生成する発声方向検出信号生成手段を有することを特徴とする。 Further, the invention according to claim 6 is the interface device according to claim 5, further comprising a voice direction detection signal generation means for generating a signal based on the azimuth angle and the elevation angle specified by the voice direction specification means. It is characterized by.

また、請求項７に係る発明は、複数のマイクロフォンが所定配置で設けられてなるマイクロフォンアレイと、前記マイクロフォンアレイにより取得された音声データに基づいてユーザーの鼻口腔から発せられた継続音の発声開始方向の方位角と仰角、及び、発声終了方向の方位角と仰角との間を結ぶベクトルを特定する方向ベクトル特定手段と、を有することを特徴とするインターフェイス装置である。 According to a seventh aspect of the present invention, there is provided a microphone array in which a plurality of microphones are provided in a predetermined arrangement, and start of utterance of a continuous sound emitted from a user's nasal cavity based on voice data acquired by the microphone array. An interface device comprising: a direction vector specifying means for specifying a vector connecting between an azimuth angle and an elevation angle of a direction, and a azimuth angle and an elevation angle of an utterance end direction.

また、請求項８に係る発明は、複数のマイクロフォンが所定配置で設けられてなるマイクロフォンアレイと、前記マイクロフォンアレイにより取得された音声データに基づいてユーザーの鼻口腔から発せられた継続音の発声開始方向の方位角と仰角から、発声終了方向の方位角と仰角までの軌跡を特定する方向軌跡特定手段と、を有することを特徴とするインターフェイス装置である。 According to an eighth aspect of the present invention, there is provided a microphone array in which a plurality of microphones are provided in a predetermined arrangement, and start of utterance of a continuous sound emitted from a user's nasal cavity based on voice data acquired by the microphone array. It is an interface device characterized by having direction locus specifying means for specifying a locus from an azimuth angle and an elevation angle of a direction to an azimuth angle and an elevation angle in the utterance end direction.

また、請求項９に係る発明は、複数のマイクロフォンが所定配置で設けられてなるマイクロフォンアレイと、前記マイクロフォンアレイにより取得された音声データに基づいてユーザーの鼻口腔から発せられた継続音の発声開始から発声終了までの時間を特定する時
間特定手段と、を有することを特徴とするインターフェイス装置である。 According to a ninth aspect of the present invention, there is provided a microphone array in which a plurality of microphones are provided in a predetermined arrangement, and a start of continuous sound uttered from a user's nasal cavity based on voice data acquired by the microphone array. And a time specifying means for specifying the time from the end of utterance to the end of utterance.

また、請求項１０に係る発明は、複数のマイクロフォンが所定配置で設けられてなるマイクロフォンアレイと、前記マイクロフォンアレイにより取得された音声データに基づいてユーザーの鼻口腔から発せられた継続音の移動速度を特定する移動速度特定手段と、を有することを特徴とするインターフェイス装置である。 According to a tenth aspect of the present invention, there is provided a microphone array in which a plurality of microphones are provided in a predetermined arrangement, and a moving speed of a continuous sound emitted from a user's nasal cavity based on voice data acquired by the microphone array. And a moving speed specifying means for specifying the interface device.

また、請求項１１に係る発明は、請求項１乃至請求項１０のいずれかに記載のインターフェイス装置において、ユーザーの鼻口腔から発せられた音の特徴点を特定する特徴点特定手段と、前記特徴点特定手段によって特定された特徴点に基づいた信号を生成する特徴点検出信号生成手段と、を有することを特徴とする。 According to an eleventh aspect of the present invention, in the interface device according to any one of the first to tenth aspects, the feature point specifying means for specifying a feature point of a sound emitted from a user's nasal cavity, and the feature And feature point detection signal generating means for generating a signal based on the feature point specified by the point specifying means.

また、請求項１２に係る発明は、請求項１１に記載のインターフェイス装置において、前記特徴点抽出手段によって抽出するユーザーの鼻口腔から発せられた音の特徴点が、基本周波数の特徴点、無声音・有声音の別に係る特徴点、音量の特徴点、音源の別に係る特徴点、発話内容に係る特徴点のいずれかであることを特徴とする。 According to a twelfth aspect of the present invention, in the interface device according to the eleventh aspect, the feature point of the sound emitted from the user's nasal cavity extracted by the feature point extraction means is a feature point of a fundamental frequency, an unvoiced sound, It is one of a feature point related to voiced sound, a feature point of volume, a feature point related to sound source, and a feature point related to utterance content.

また、請求項１３に係る発明は、複数のマイクロフォンが所定配置で設けられてなるマイクロフォンアレイにより取得された音声データに基づいてユーザーの鼻口腔から発せられた音の発声位置を３次元的に特定する発声位置特定ステップを有することを特徴とするインターフェイス方法である。 The invention according to claim 13 three-dimensionally specifies the utterance position of the sound emitted from the user's nasal cavity based on the sound data acquired by the microphone array in which a plurality of microphones are provided in a predetermined arrangement. And an utterance position specifying step.

また、請求項１４に係る発明は、請求項１３に記載のインターフェイス方法において、前記発声位置特定ステップによって特定された発声位置に基づいた信号を生成する発声位置検出信号生成ステップを有することを特徴とする。 The invention according to claim 14 is the interface method according to claim 13, further comprising a utterance position detection signal generation step of generating a signal based on the utterance position specified by the utterance position specification step. To do.

また、請求項１５に係る発明は、複数のマイクロフォンが所定配置で設けられてなるマイクロフォンアレイにより取得された音声データに基づいてユーザーの鼻口腔から発せられた継続音の発声開始位置と発声終了位置との間を結ぶベクトルを３次元的に特定するベクトル特定ステップを有することを特徴とするインターフェイス方法である。 The invention according to claim 15 is characterized in that the utterance start position and utterance end position of a continuous sound emitted from the user's nasal cavity based on voice data acquired by a microphone array in which a plurality of microphones are provided in a predetermined arrangement. And a vector specifying step for specifying a vector connecting the two in three dimensions.

また、請求項１６に係る発明は、複数のマイクロフォンが所定配置で設けられてなるマイクロフォンアレイにより取得された音声データに基づいてユーザーの鼻口腔から発せられた継続音の発声開始位置から発声終了位置までの軌跡を３次元的に特定する軌跡特定ステップを有することを特徴とするインターフェイス方法である。 In the invention according to claim 16, the utterance end position from the utterance start position of the continuous sound emitted from the user's nasal cavity based on the voice data acquired by the microphone array in which a plurality of microphones are provided in a predetermined arrangement. This is an interface method characterized by having a trajectory specifying step for specifying the trajectory up to three-dimensionally.

また、請求項１７に係る発明は、複数のマイクロフォンが所定配置で設けられてなるマイクロフォンアレイにより取得された音声データに基づいてユーザーの鼻口腔から発せられた音が到来する方向の方位角と仰角とを特定する発声方向特定ステップを有することを特徴とするインターフェイス方法である。 The invention according to claim 17 is directed to an azimuth angle and an elevation angle in a direction in which a sound emitted from a user's nasal cavity arrives based on voice data acquired by a microphone array in which a plurality of microphones are provided in a predetermined arrangement. And an utterance direction specifying step for specifying.

また、請求項１８に係る発明は、請求項１７に記載のインターフェイス方法において、前記発声方向特定ステップによって特定された方位角と仰角とに基づいた信号を生成する発声方向検出信号生成ステップを有することを特徴とする。 Further, the invention according to claim 18 is the interface method according to claim 17, further comprising an utterance direction detection signal generation step of generating a signal based on the azimuth angle and the elevation angle specified by the utterance direction specification step. It is characterized by.

また、請求項１９に係る発明は、複数のマイクロフォンが所定配置で設けられてなるマイクロフォンアレイにより取得された音声データに基づいてユーザーの鼻口腔から発せられた継続音の発声開始方向の方位角と仰角、及び、発声終了方向の方位角と仰角との間を結ぶベクトルを特定する方向ベクトル特定ステップと、を有することを特徴とするインターフェイス方法。 According to the nineteenth aspect of the present invention, there is provided an azimuth angle in a voice start direction of a continuous sound emitted from a user's nasal cavity based on voice data acquired by a microphone array in which a plurality of microphones are provided in a predetermined arrangement. An interface method comprising: an elevation angle and a direction vector identification step for identifying a vector connecting the azimuth angle and the elevation angle in the utterance end direction.

また、請求項２０に係る発明は、複数のマイクロフォンが所定配置で設けられてなるマイクロフォンアレイにより取得された音声データに基づいてユーザーの鼻口腔から発せられた継続音の発声開始方向の方位角と仰角から、発声終了方向の方位角と仰角までの軌跡を特定する方向軌跡特定ステップを有することを特徴とするインターフェイス方法である。 The invention according to claim 20 is characterized in that an azimuth angle in the utterance start direction of a continuous sound emitted from a user's nasal cavity based on voice data acquired by a microphone array in which a plurality of microphones are provided in a predetermined arrangement. An interface method comprising a direction trajectory specifying step of specifying a trajectory from an elevation angle to an azimuth angle and an elevation angle in the utterance end direction.

また、請求項２１に係る発明は、複数のマイクロフォンが所定配置で設けられてなるマイクロフォンアレイにより取得された音声データに基づいてユーザーの鼻口腔から発せられた継続音の発声開始から発声終了までの時間を特定する時間特定ステップを有することを特徴とするインターフェイス方法である。 The invention according to claim 21 is from the start of utterance to the end of utterance of a continuation sound emitted from the user's nasal cavity based on voice data acquired by a microphone array in which a plurality of microphones are provided in a predetermined arrangement. It is an interface method characterized by having the time specification step which specifies time.

また、請求項２２に係る発明は、複数のマイクロフォンが所定配置で設けられてなるマイクロフォンアレイにより取得された音声データに基づいてユーザーの鼻口腔から発せられた継続音の移動速度を特定する移動速度特定ステップを有することを特徴とするインターフェイス方法である。 According to a twenty-second aspect of the present invention, a moving speed that specifies a moving speed of a continuous sound emitted from a user's nasal cavity based on voice data acquired by a microphone array in which a plurality of microphones are provided in a predetermined arrangement. It is an interface method characterized by having a specific step.

また、請求項２３に係る発明は、請求項１３乃至請求項２２のいずれかに記載のインターフェイス方法において、ユーザーの鼻口腔から発せられた音の特徴点を特定する特徴点特定ステップと、前記特徴点特定ステップによって特定された特徴点に基づいた信号を生成する特徴点検出信号生成ステップと、を有することを特徴とする。 The invention according to claim 23 is the interface method according to any one of claims 13 to 22, wherein a feature point specifying step for specifying a feature point of a sound emitted from a user's nasal cavity; And a feature point detection signal generation step for generating a signal based on the feature point specified by the point specification step.

また、請求項２４に係る発明は、請求項２３に記載のインターフェイス方法において、前記特徴点抽出ステップによって抽出するユーザーの鼻口腔から発せられた音の特徴点が、基本周波数の特徴点、無声音・有声音の別に係る特徴点、音量の特徴点、音源の別に係る特徴点、発話内容に係る特徴点のいずれかであることを特徴とする。 According to a twenty-fourth aspect of the present invention, in the interface method according to the twenty-third aspect, the feature points of the sound emitted from the user's nasal cavity extracted by the feature point extraction step are the feature points of the fundamental frequency, the unvoiced sound, It is one of a feature point related to voiced sound, a feature point of volume, a feature point related to sound source, and a feature point related to utterance content.

本発明のインターフェイス装置とインターフェイス方法によれば、雑音がある環境下でもユーザーの呼気音や発声の発声位置などの事項が３次元的に特定され、特定された事項に応じた処理をコンピュータ側で実行することができるようになる。 According to the interface device and the interface method of the present invention, items such as the user's breath sounds and utterance positions of utterances are specified three-dimensionally even under noisy environments, and processing corresponding to the specified items is performed on the computer side. Be able to run.

以下、本発明の実施の形態を図面を参照しつつ説明する。図１は本発明の実施の形態に係るインターフェイス装置の外観を斜視的に示す図であり、図２は本発明の実施の形態に係るインターフェイス装置のブロック構成を示す図である。図１及び図２において、１００はインターフェイス装置、２００はマイクロフォンアレイ、２０１はシリコンマイク、２０２はウインドスクリーン、２１０はスタンド、２１１は主支柱、２１２は左側支柱、２１３は右側支柱、２８０はマイクアンプ、２９０はＡＤ変換部、３００はＣＰＵ、４００は記憶部、５００は接続ポート部をそれぞれ示している。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a perspective view showing an external appearance of an interface device according to an embodiment of the present invention, and FIG. 2 is a block diagram showing the interface device according to the embodiment of the present invention. 1 and 2, 100 is an interface device, 200 is a microphone array, 201 is a silicon microphone, 202 is a wind screen, 210 is a stand, 211 is a main support, 212 is a left support, 213 is a right support, and 280 is a microphone amplifier. Reference numeral 290 denotes an AD conversion unit, 300 denotes a CPU, 400 denotes a storage unit, and 500 denotes a connection port unit.

図１はインターフェイス装置１００のユーザーインターフェイス部の構成を示しており、図示するようなユーザーの鼻腔・口腔から発せられた音に基づいて、不図示のコンピュータなどに対する入力デバイスとして機能するものである。ここで、コンピュータとしては、例えば汎用のパーソナルコンピュータなどを挙げることができるが、これに限らず、種々のコンピュータに対するインターフェイス装置として用いることができる。また、本発明のインターフェイス装置１００は、コンピュータに対する入力用途のみならず、電気製品や車両に対する入力用途にも用いることが可能である。 FIG. 1 shows the configuration of a user interface unit of the interface apparatus 100, which functions as an input device for a computer (not shown) based on the sound emitted from the nasal cavity and oral cavity of the user as shown. Here, examples of the computer include a general-purpose personal computer. However, the computer is not limited to this and can be used as an interface device for various computers. The interface device 100 of the present invention can be used not only for input to a computer but also for input to an electric product or a vehicle.

インターフェイス装置１００の外観は、スタンド２１０上に立設された主支柱２１１と
、主支柱２１１から左右に枝分かれし、左側支柱２１２と右側支柱２１３と、それぞれの支柱に設けられたマイクロフォン群とから構成されており、卓上に設置が可能なようになっている。より具体的には、これら主支柱２１１、左側支柱２１２、右側支柱２１３のそれぞれに３ｃｍ間隔でシリコンマイク２０１が不図示の基板上に設けられており、計１２個のマイクロフォン群からマイクロフォンアレイ２００が構成されている。なお、本実施形態に係るインターフェイス装置１００においては、シリコンマイク２０１が１２個用いられたものに基づいて説明するが、シリコンマイク２０１の数は３個以上で構成すればよく、本発明が１２個のシリコンマイク２０１の使用に限定されるものではない。なお、シリコンマイク２０１の数が少ないと耐雑音性が悪化するし、また、シリコンマイク２０１の数が多いと音声データの処理負荷が重くなるので、本実施形態では先に述べたようにマイクロフォンアレイ２００を１２個のシリコンマイク２０１で構成している。また、シリコンマイク２０１には、３ｍｍ×５ｍｍ程度の小型シリコンマイクを採用している。 The external appearance of the interface device 100 is composed of a main column 211 standing on a stand 210, a left column 212, a right column 213, and a group of microphones provided on each column. It can be installed on a desktop. More specifically, silicon microphones 201 are provided on a substrate (not shown) at 3 cm intervals on each of the main support column 211, the left support column 212, and the right support column 213, and the microphone array 200 is composed of a total of 12 microphone groups. It is configured. The interface device 100 according to the present embodiment will be described based on the case where twelve silicon microphones 201 are used. However, the number of silicon microphones 201 may be three or more, and the present invention is twelve. However, the present invention is not limited to the use of the silicon microphone 201. Note that if the number of silicon microphones 201 is small, the noise resistance deteriorates, and if the number of silicon microphones 201 is large, the processing load of audio data becomes heavy. In this embodiment, as described above, the microphone array 200 includes twelve silicon microphones 201. The silicon microphone 201 is a small silicon microphone of about 3 mm × 5 mm.

それぞれの支柱に配された４つのシリコンマイク２０１は、ウインドスクリーン２０２によって覆われており、風きり音が入力されるのを伏せいている。また、左側支柱２１２に配されたマイクロフォン群、右側支柱２１３に配されたマイクロフォン群は略「ハ」の字のレイアウトとなるように配置されており、主支柱２１１に配されたマイクロフォン群は垂直に配置されている。 The four silicon microphones 201 arranged on each column are covered with a wind screen 202, and the wind noise is not input. The microphone group arranged on the left column 212 and the microphone group arranged on the right column 213 are arranged so as to have a substantially “C” -shaped layout, and the microphone group arranged on the main column 211 is vertical. Is arranged.

図２はインターフェイス装置１００のブロック構成を示す図である。１２個のシリコンマイク２０１から構成されるマイクロフォンアレイ２００の出力は、マイクアンプ２８０で増幅されＡＤ変換部２９０でアナログ−デジタル変換された後、ＣＰＵ３００に入力されるようになっている。記憶部４００は、ＣＰＵ３００上で動作するプログラムを保持するＲＯＭや、ＣＰＵ３００のワークエリアとして機能するＲＡＭから構成されている。この記憶部４００に記憶されているプログラムに基づいてＣＰＵ３００が動作することによって、本発明のインターフェイス装置１００として機能する。 FIG. 2 is a diagram showing a block configuration of the interface apparatus 100. The output of the microphone array 200 composed of twelve silicon microphones 201 is amplified by a microphone amplifier 280 and subjected to analog-digital conversion by an AD conversion unit 290 and then input to the CPU 300. The storage unit 400 includes a ROM that stores programs that run on the CPU 300 and a RAM that functions as a work area for the CPU 300. When the CPU 300 operates based on the program stored in the storage unit 400, it functions as the interface device 100 of the present invention.

なお、特許請求の範囲のカテゴリーが「装置」である請求項に記載されている「発声位置特定手段」、「発声位置検出信号生成手段」、「ベクトル特定手段」、「軌跡特定手段」、「発声方向特定手段」、「発声方向検出信号生成手段」、「方向ベクトル特定手段」、「方向軌跡特定手段」、「時間特定手段」、「移動速度特定手段」、「特徴点特定手段」などの各手段は、記憶部４００に記憶されているプログラムに基づいて動作するＣＰＵ３００によって実現されるものである。 It should be noted that the category of the claims is “device”, and “speech position specifying means”, “speech position detection signal generating means”, “vector specifying means”, “trajectory specifying means”, “ "Speech direction specifying means", "Speech direction detection signal generating means", "Direction vector specifying means", "Direction locus specifying means", "Time specifying means", "Movement speed specifying means", "Feature point specifying means", etc. Each means is realized by the CPU 300 that operates based on a program stored in the storage unit 400.

また、特許請求の範囲のカテゴリーが「方法」である請求項に記載されている「発声位置特定ステップ」、「発声位置検出信号生成ステップ」、「ベクトル特定ステップ」、「軌跡特定ステップ」、「発声方向特定ステップ」、「発声方向検出信号生成ステップ」、「方向ベクトル特定ステップ」、「方向軌跡特定ステップ」、「時間特定ステップ」、「移動速度特定ステップ」、「特徴点特定ステップ」などの各ステップは、記憶部４００に記憶されているプログラムに基づいて動作するＣＰＵ３００によって実現されるものである。 In addition, the “speech position specifying step”, “speech position detection signal generation step”, “vector specifying step”, “trajectory specifying step”, “ "Speech direction identification step", "Speech direction detection signal generation step", "Direction vector identification step", "Direction of trajectory identification step", "Time identification step", "Movement speed identification step", "Feature point identification step", etc. Each step is realized by the CPU 300 that operates based on a program stored in the storage unit 400.

また、記憶部４００には、後述するイベントデータベースが記憶保持されている。接続ポート部５００は、コンピュータなどの他の機器と接続するためのインターフェイス手段であり、ＵＳＢなどの周知のものを利用することができる。 The storage unit 400 stores and holds an event database, which will be described later. The connection port unit 500 is an interface unit for connecting to other devices such as a computer, and a known device such as a USB can be used.

以上のように構成されるインターフェイス装置１００の利用形態について説明する。なお、以下に種々の実施形態を個別的に説明するが、それぞれの実施形態は記憶部４００に記憶させるプログラムを変更することによってそれぞれ実現することができる。また、以下に個別的に説明する種々の実施形態を任意に組み合わせて構成したインターフェイス装
置も、本発明のインターフェイス装置に含まれるものである。 A usage form of the interface device 100 configured as described above will be described. Various embodiments will be individually described below, but each embodiment can be realized by changing a program stored in the storage unit 400. In addition, an interface device configured by arbitrarily combining various embodiments described below is also included in the interface device of the present invention.

図３は本発明の実施の形態に係るインターフェイス装置の利用形態例を示す図である。本実施形態に係るインターフェイス装置１００では、インターフェイス装置１００を用いて、３次元空間内で推定された発声位置がどの領域に属すかを特定するものである。 FIG. 3 is a diagram showing an example of how the interface device according to the embodiment of the present invention is used. In the interface device 100 according to the present embodiment, the interface device 100 is used to identify which region the utterance position estimated in the three-dimensional space belongs.

なお、以下、「発声」という語には、ユーザーの鼻口腔から発せられた全ての種類の音が含まれるものとする。ユーザーの鼻口腔から発せられた音には、例えば、舌打ちの音なども含まれるものであるが、一般的な利用としては、ユーザーの「シュッ」、「パッ」などの短い発声音や「シュー」、「アー」などの継続する継続的発声音が想定される。 Hereinafter, the term “speech” includes all kinds of sounds emitted from the user's nasal cavity. The sound emitted from the user's nose and mouth includes, for example, the sound of a tongue, but as a general use, a short utterance sound such as a user's “shush” or “pad” or “shoe” ”,“ A ”, etc., continuous continuous utterance sounds are assumed.

図３に示す実施形態においては、ユーザーの発声検出領域を定義し、この発声検出領域の中のユーザーの発声のみを検出するようにして、発声検出領域外からの音は雑音として処理する。さらに、定義されたユーザーの発声検出領域は、図示するように例えば８つの領域に分割する。そして、その８つの分割領域の中で、発声がどの領域内で検出されたかを特定するものである。これが特許請求の範囲に記載される「発声位置特定手段」、「発声位置特定ステップ」である。また、手前、左上の分割領域での発声が検出された場合には、例えば、マウスの左クリックに相当する信号を生成し、コンピュータ側に送信する。このような信号を生成については、特許請求の範囲において、「発声位置検出信号生成手段」、「発声位置検出信号生成ステップ」として表現されている。 In the embodiment shown in FIG. 3, a user's utterance detection area is defined, and only the user's utterance in the utterance detection area is detected, and the sound from outside the utterance detection area is processed as noise. Further, the defined user utterance detection area is divided into, for example, eight areas as shown in the figure. Then, the region in which the utterance is detected is specified in the eight divided regions. This is the “speech position specifying means” and “speech position specifying step” described in the claims. Further, when utterance is detected in the front and upper left divided areas, for example, a signal corresponding to the left click of the mouse is generated and transmitted to the computer side. The generation of such a signal is expressed as “speech position detection signal generation means” and “speech position detection signal generation step” in the claims.

以上のような実施形態におけるインターフェイス装置の処理について説明する。図４は本発明の実施の形態に係るインターフェイス装置の処理のフローチャートを示す図である。 Processing of the interface device in the above embodiment will be described. FIG. 4 is a diagram showing a flowchart of processing of the interface device according to the embodiment of the present invention.

ステップＳ１００で、処理が開始されると、次にステップＳ１０１に進み、マイクロフォンアレイ２００から音声データの取り込みが行われる。このステップではより具体的には、マイクロフォンアレイ２００から出力される音声のアナログ信号をマイクアンプ２８０で増幅した後、ＡＤ変換部２９０でデジタル信号に変換し、記憶部４００に一時記憶する。 When the process is started in step S100, the process proceeds to step S101, and audio data is captured from the microphone array 200. More specifically, in this step, an analog audio signal output from the microphone array 200 is amplified by the microphone amplifier 280, converted into a digital signal by the AD conversion unit 290, and temporarily stored in the storage unit 400.

次のステップＳ１０２では、ユーザー発声位置と周囲雑音到来方向の３次元的な情報の特定を行う。より詳細には、本願の発明者らによる特開２００７−２２８１３５号公報、特開２００８−６７８５４号公報、特願２００６−２４０７２１号の明細書、図面に記載の手法を用いて、ユーザーの発声位置と周囲雑音到来方向を３次元空間内で特定する。 In the next step S102, the three-dimensional information of the user utterance position and the ambient noise arrival direction is specified. In more detail, using the method described in Japanese Patent Application Laid-Open No. 2007-228135, Japanese Patent Application Laid-Open No. 2008-67854, Japanese Patent Application No. 2006-240721, and drawings by the inventors of the present application, The ambient noise arrival direction is specified in a three-dimensional space.

次に、ステップＳ１０３では、ユーザーの発声があるか否かが判定される。このステップでは、特願２００６−２４０７２１号に記載の手法を用いてユーザーの発声を検出し、もしユーザーの発声が検出されなければ、ステップＳ１０１から繰り返す。もしユーザーの発声が検出されればステップＳ１０４へ進む。 Next, in step S103, it is determined whether or not there is a user utterance. In this step, the user's utterance is detected using the method described in Japanese Patent Application No. 2006-240721, and if the user's utterance is not detected, the process is repeated from step S101. If the user utterance is detected, the process proceeds to step S104.

ステップＳ１０４では、周囲雑音の抑制を実行する。このステップでは、特願２００６−２４０７２１号に記載の手法を用いて周囲雑音を抑圧しユーザーの発声を強調する音源分離処理を行う。 In step S104, ambient noise is suppressed. In this step, sound source separation processing for suppressing ambient noise and enhancing the user's utterance is performed using the method described in Japanese Patent Application No. 2006-240721.

ステップＳ１０５では、３次元的な発声位置の特定を行う。より具体的には、３次元空間内で推定された発声位置がどの領域に属すかを特定する。例えば、図３示すようにユーザーの発声検出領域を定義し、更にその発声検出領域を８つの領域に分割する。そして、その８つの分割領域の中で、発声がどの領域内で検出されたかを特定する。 In step S105, a three-dimensional utterance position is specified. More specifically, the region to which the utterance position estimated in the three-dimensional space belongs is specified. For example, as shown in FIG. 3, a user utterance detection area is defined, and the utterance detection area is further divided into eight areas. Then, the region in which the utterance is detected is specified in the eight divided regions.

ステップＳ１０６では、イベントの特定が実行される。記憶部４００に保持されるイベントデータベースには、発声検出位置などに応じたイベントが記憶されている。すなわち、発声検出位置などとイベントとの組み合わせが定義されて、当該イベントデータベースに保持されるようになっている。 In step S106, event identification is executed. In the event database held in the storage unit 400, events corresponding to the utterance detection position and the like are stored. That is, a combination of an utterance detection position and an event is defined and held in the event database.

イベントデータベースには、例えば、図３の上段手前の左側の領域で短時間の発声として定義したイベントがあらかじめ登録されている。そして、ステップＳ１０６のイベントの特定処理では、発声位置が前述の位置になっているかを判断し、発声継続時間があるしきい値以下であるかを判断し、更に発声が無声音であるかなどを判断し、全ての条件が適合したときにそのイベントが発生したと判断する。 In the event database, for example, an event defined as a short-time utterance in the left area before the upper row in FIG. 3 is registered in advance. In the event specifying process in step S106, it is determined whether the utterance position is the above-described position, whether the utterance duration is equal to or less than a certain threshold, and whether the utterance is an unvoiced sound. It is determined that the event has occurred when all the conditions are met.

ステップＳ１０７では、該当イベントがあるかが判定される。ステップＳ１０６で、イベントデータベースに適合するイベントが検出されたかどうかを調べ、もしイベントが一つも検出されなければ、ステップＳ１０１へ戻る。もし、イベントが検出された場合は、ステップＳ１０８へ進む。 In step S107, it is determined whether there is a corresponding event. In step S106, it is checked whether an event that matches the event database is detected. If no event is detected, the process returns to step S101. If an event is detected, the process proceeds to step S108.

ステップＳ１０８では、コンピュータ側のアプリケーションレベルに対して、イベント検出信号を送信する。 In step S108, an event detection signal is transmitted to the application level on the computer side.

アプリケーション側の典型的な処理が点線の囲み中に示されている。以下、アプリケーション側で想定される典型的な処理について説明する。ステップＳ２０１では、本発明のインターフェイス装置から送られるイベント検出信号の受信を待ち続ける。もし、イベント検出信号を受信した場合は、ステップＳ２０２へ移る。ステップＳ２０２では、受信したイベント検出信号に対応した適切な処理を実行する。そして、ステップＳ２０１へ戻る。 Typical processing on the application side is shown in the dotted box. Hereinafter, typical processing assumed on the application side will be described. In step S201, it continues to wait for the reception of an event detection signal sent from the interface device of the present invention. If an event detection signal is received, the process proceeds to step S202. In step S202, an appropriate process corresponding to the received event detection signal is executed. Then, the process returns to step S201.

このように本実施形態によれば、雑音がある環境下でもユーザーの呼気音や発声の発声位置などが３次元的に特定され、特定された事項に応じた処理をコンピュータ側で実行することができるようになる。 As described above, according to this embodiment, the user's exhalation sound, the utterance position of the utterance, and the like can be specified three-dimensionally even in an environment with noise, and the processing corresponding to the specified matter can be executed on the computer side. become able to.

なお、以上に説明した実施形態においては、発声検出領域を８つの領域に分割する例について説明したが、これに限らず、例えば図５に示すような４つに分割することによって、コンピュータ側のディスプレイ部に表示されるカーソルの十字移動に利用することも可能である。図５は本発明の実施の形態に係るインターフェイス装置の他の領域分割例を示す図である。 In the embodiment described above, an example in which the utterance detection area is divided into eight areas has been described. However, the present invention is not limited to this, and for example, by dividing the utterance detection area into four areas as shown in FIG. It can also be used for cross-movement of the cursor displayed on the display unit. FIG. 5 is a diagram showing another example of area division of the interface device according to the embodiment of the present invention.

次に、本発明の他の実施形態について説明する。図６は本発明の他の実施の形態に係るインターフェイス装置の利用形態例を示す図である。本実施形態では、ユーザーによって「アー」などの継続音の発声があった場合にその継続音の発声開始位置（Ａ）と発声終了位置（Ｂ）との間を結ぶベクトルを３次元的に特定するものである。このような特定について、特許請求の範囲において、「ベクトル特定手段」、「ベクトル特定ステップ」として表現されている。 Next, another embodiment of the present invention will be described. FIG. 6 is a diagram showing an example of how the interface device according to another embodiment of the present invention is used. In this embodiment, when a continuous sound such as “A” is uttered by the user, a vector connecting the utterance start position (A) and the utterance end position (B) of the continuous sound is specified three-dimensionally. To do. Such identification is expressed as “vector identification means” and “vector identification step” in the claims.

このような実施形態は、図４に示すフローチャートのステップ１０５における処理を、発声開始時の位置と発声終了時の位置を記録し、その２点を結ぶベクトルを特定する処理に変更することによって実現することが可能となる。 Such an embodiment is realized by changing the process in step 105 of the flowchart shown in FIG. 4 to a process of recording a position at the start of utterance and a position at the end of utterance and specifying a vector connecting the two points. It becomes possible to do.

このように本実施形態によれば、雑音がある環境下でもユーザーの呼気音や発声の発声位置などが３次元内でベクトル的に特定され、特定された事項に応じた処理をコンピュータ側で実行することができるようになる。 As described above, according to the present embodiment, even in a noisy environment, the user's exhalation sound, the utterance position of the utterance, and the like are specified in a three-dimensional manner, and processing corresponding to the specified items is executed on the computer side. Will be able to.

次に、本発明の他の実施形態について説明する。図７は本発明の他の実施の形態に係るインターフェイス装置の利用形態例を示す図である。本実施形態では、ユーザーによって「アー」などの継続音の発声があった場合にその継続音の発声開始位置（Ａ）から発声終了位置（Ｂ）までの軌跡を３次元的に特定するものである。このような特定について、特許請求の範囲において、「軌跡特定手段」、「軌跡特定ステップ」として表現されている。 Next, another embodiment of the present invention will be described. FIG. 7 is a diagram showing an example of how the interface device according to another embodiment of the present invention is used. In this embodiment, when a continuous sound such as “A” is uttered by the user, the locus from the utterance start position (A) to the utterance end position (B) of the continuous sound is specified three-dimensionally. is there. Such identification is expressed as “trajectory identification means” and “trajectory identification step” in the claims.

このような実施形態は、図４に示すフローチャートのステップ１０５における処理を、発声開始時の位置と発声終了時の位置を記録し、その間の軌跡を特定する処理に変更することによって実現することが可能となる。 Such an embodiment can be realized by changing the process in step 105 of the flowchart shown in FIG. 4 to a process of recording the position at the start of utterance and the position at the end of utterance and specifying the locus between them. It becomes possible.

このように本実施形態によれば、雑音がある環境下でもユーザーの呼気音や発声の発声位置などが３次元内で軌跡状に特定され、特定された事項に応じた処理をコンピュータ側で実行することができるようになる。 As described above, according to the present embodiment, even in a noisy environment, the user's exhalation sound and the utterance position of the utterance are specified in a trajectory shape in three dimensions, and processing corresponding to the specified items is executed on the computer side. Will be able to.

次に、本発明の他の実施形態について説明する。図８は本発明の他の実施の形態に係るインターフェイス装置の利用形態例を示す図である。本実施形態は、前記マイクロフォンアレイ２００により取得された音声データに基づいてユーザーの鼻口腔から発せられた音が到来する方向の方位角（θ）と仰角（φ）とを特定するものである。このような特定は、特許請求の範囲において、「発声方向特定手段」、「発声方向特定ステップ」として表現されている。 Next, another embodiment of the present invention will be described. FIG. 8 is a diagram showing an example of how the interface device according to another embodiment of the present invention is used. In the present embodiment, the azimuth angle (θ) and elevation angle (φ) in the direction in which the sound emitted from the user's nasal cavity arrives based on the audio data acquired by the microphone array 200 are specified. Such specification is expressed as “voice direction specifying means” and “voice direction specifying step” in the claims.

また、上記のようにユーザーの発声が到来する方向の方位角（θ）と仰角（φ）特定されたとき、特定された方位角（θ）と仰角（φ）とに基づいた信号を生成する。このような機能は、特許請求の範囲においては、「発声方向検出信号生成手段」、「発声方向検出信号生成ステップ」と表現されている。 Further, when the azimuth angle (θ) and elevation angle (φ) in the direction in which the user's utterance arrives are specified as described above, a signal based on the specified azimuth angle (θ) and elevation angle (φ) is generated. . Such a function is expressed as “speech direction detection signal generation means” and “speech direction detection signal generation step” in the claims.

このような実施形態は、図４に示すフローチャートのステップ１０５において、発声方向の特定処理を行うようにすることで、実現することが可能となる。また、ステップＳ１０６における発声方向である方位角（θ）と仰角（φ）と応じたイベントを定義しておく。 Such an embodiment can be realized by performing the utterance direction specifying process in step 105 of the flowchart shown in FIG. In addition, an event corresponding to the azimuth angle (θ) and the elevation angle (φ) that are the utterance directions in step S106 is defined.

より具体的には、ステップ１０５においては、３次元空間内で発声した音声の到来方向の方位角（θ）と仰角（φ）を推定し、その到来方向が方位角−仰角平面上でどの領域に属すかを特定する。例えば、図１０のＡに示すように、方位角−仰角平面を一定の間隔でグリッド状に分割する。そして、推定された到来方向が方位角−仰角平面内のどの領域内で検出されたかを特定する。図１０は本発明の他の実施の形態に係るインターフェイス装置で想定する方位角（θ）―仰角（φ）平面を示す図である。 More specifically, in step 105, the azimuth angle (θ) and elevation angle (φ) of the direction of arrival of the voice uttered in the three-dimensional space are estimated, and in which area the direction of arrival is on the azimuth-elevation plane Specify whether it belongs to. For example, as shown in FIG. 10A, the azimuth-elevation plane is divided into a grid at regular intervals. Then, in which region in the azimuth-elevation plane the estimated direction of arrival is detected is specified. FIG. 10 is a diagram showing an azimuth (θ) -elevation angle (φ) plane assumed in an interface device according to another embodiment of the present invention.

このように本実施形態によれば、雑音がある環境下でもユーザーの呼気音や発声の発声位置などの方向が特定され、特定された事項に応じた処理をコンピュータ側で実行することができるようになる。 As described above, according to the present embodiment, the direction of the user's exhalation sound, the utterance position of the utterance, and the like can be specified even in a noisy environment, and processing corresponding to the specified matters can be executed on the computer side. become.

次に、本発明の他の実施形態について説明する。図９は本発明の他の実施の形態に係るインターフェイス装置の利用形態例を示す図である。本実施形態では、ユーザーによって「アー」などの継続音の発声があった場合にその継続音の発声開始方向（Ａ）の方位角（θ_A）と仰角（φ_A）と発声終了方向（Ｂ）の方位角（θ_B）と仰角（φ_B）との間を結ぶベクトルを特定するものである。このような特定について、特許請求の範囲において、「方向ベクトル特定手段」、「方向ベクトル特定ステップ」として表現されている。なお、こ
の実施形態における「ベクトル」は、図１０の方位角−仰角平面内に示されるものである。 Next, another embodiment of the present invention will be described. FIG. 9 is a diagram showing an example of how to use the interface device according to another embodiment of the present invention. In this embodiment, when a user utters a continuous sound such as “A”, the azimuth angle (θ _A ) and elevation angle (φ _A ) of the continuous sound utterance start direction ( _A ) and the utterance end direction (B ) To specify a vector connecting the azimuth angle (θ _B ) and the elevation angle (φ _B ). Such identification is expressed as “direction vector identification means” and “direction vector identification step” in the claims. The “vector” in this embodiment is shown in the azimuth-elevation plane of FIG.

このような実施形態は、図４に示すフローチャートのステップ１０５における処理を、発声開始時の方向と発声終了時の方向を記録し、その２点を結ぶベクトルを特定する処理に変更することによって実現することが可能となる。つまり、ステップＳ１０５において、３次元空間内で発声開始時の音声の到来方向（方位角と仰角）と発声終了時の音声の到来方向（方位角と仰角）を記録し、方位角−仰角平面上でその２点を結ぶベクトルを特定する。 Such an embodiment is realized by changing the process in step 105 of the flowchart shown in FIG. 4 to a process of recording the direction at the start of utterance and the direction at the end of utterance and specifying a vector connecting the two points. It becomes possible to do. That is, in step S105, the voice arrival direction (azimuth angle and elevation angle) at the start of utterance and the voice arrival direction (azimuth angle and elevation angle) at the end of utterance are recorded in the three-dimensional space, and on the azimuth-elevation plane To specify a vector connecting the two points.

このように本実施形態によれば、雑音がある環境下でもユーザーの呼気音や発声の発声方向がベクトル的に特定され、特定された事項に応じた処理をコンピュータ側で実行することができるようになる。 As described above, according to the present embodiment, the user's exhalation sound and utterance direction of the utterance can be specified in a vector even under noisy environment, and the processing corresponding to the specified matter can be executed on the computer side. become.

次に、本発明の他の実施形態について説明する。図９は本発明の他の実施の形態に係るインターフェイス装置の利用形態例を示す図である。本実施形態では、ユーザーによって「アー」などの継続音の発声があった場合にその継続音の発声開始方向（Ａ）の方位角（θ_A）と仰角（φ_A）から、発声終了方向（Ｂ）の方位角（θ_B）と仰角（φ_B）までの軌跡を特定するものである。このような特定について、特許請求の範囲において、「方向軌跡特定手段」、「方向軌跡特定ステップ」として表現されている。なお、この実施形態における「軌跡」は、図１０の方位角−仰角平面内に示されるものである。 Next, another embodiment of the present invention will be described. FIG. 9 is a diagram showing an example of how to use the interface device according to another embodiment of the present invention. In this embodiment, when a continuous sound such as “A” is uttered by the user, the utterance end direction (θ _A ) and the elevation angle (φ _A ) of the utterance start direction (A) of the continuation sound ( B) specifies the trajectory up to the azimuth angle (θ _B ) and elevation angle (φ _B ). Such specification is expressed as “direction locus specifying means” and “direction locus specifying step” in the claims. The “trajectory” in this embodiment is shown in the azimuth-elevation plane of FIG.

このような実施形態は、図４に示すフローチャートのステップ１０５における処理を、発声開始時の方向と発声終了時の方向を記録し、その２点を結ぶ軌跡を特定する処理に変更することによって実現することが可能となる。つまり、ステップＳ１０５において、３次元空間内で発声開始から発声終了までの音声の到来方向（方位角と仰角）を全て記録し、方位角−仰角平面上でその移動軌跡を特定する。 Such an embodiment is realized by changing the process in step 105 of the flowchart shown in FIG. 4 to a process of recording the direction at the start of utterance and the direction at the end of utterance and specifying the locus connecting the two points. It becomes possible to do. That is, in step S105, all the voice arrival directions (azimuth angle and elevation angle) from the start of utterance to the end of utterance are recorded in the three-dimensional space, and the movement locus is specified on the azimuth-elevation plane.

このように本実施形態によれば、雑音がある環境下でもユーザーの呼気音や発声の発声方向が軌跡的に特定され、特定された事項に応じた処理をコンピュータ側で実行することができるようになる。 As described above, according to the present embodiment, the user's breathing sound and the utterance direction of the utterance can be specified in a trajectory even in a noisy environment, and the processing corresponding to the specified matter can be executed on the computer side. become.

次に、本発明の他の実施形態について説明する。図１１は本発明の他の実施の形態に係るインターフェイス装置の処理のフローチャートを示す図である。図１１のフローチャートにおいて、図４のフローチャートと同じ内容が記載されたステップは、同様の処理を示しているので省略する。図１１のフローチャートにおいて本実施形態と関連する部分のみを説明する。 Next, another embodiment of the present invention will be described. FIG. 11 is a diagram showing a flowchart of the process of the interface device according to another embodiment of the present invention. In the flowchart of FIG. 11, steps in which the same contents as those in the flowchart of FIG. Only the portion related to the present embodiment in the flowchart of FIG. 11 will be described.

本実施形態においては、図１１のステップＳ３０５であり、ステップＳ３０５では、継続音の開始から終了までの時間を特定する処理を行う。このような特定処理については、特許請求の範囲において「時間特定手段」、「時間特定ステップ」として示されている。 In the present embodiment, step S305 in FIG. 11 is performed. In step S305, processing for specifying the time from the start to the end of the continuous sound is performed. Such identification processing is indicated as “time identification means” and “time identification step” in the claims.

このように本実施形態によれば、雑音がある環境下でもユーザーの呼気音や発声の継続時間が特定され、特定された事項に応じた処理をコンピュータ側で実行することができるようになる。 As described above, according to the present embodiment, even in a noisy environment, a user's breath sound and duration of utterance are specified, and processing corresponding to the specified items can be executed on the computer side.

次に、本発明の他の実施形態について説明する。図１１は本発明の他の実施の形態に係るインターフェイス装置の処理のフローチャートを示す図である。図１１のフローチャートにおいて、図４のフローチャートと同じ内容が記載されたステップは、同様の処理を示しているので省略する。図１１のフローチャートにおいて本実施形態と関連する部分のみ
を説明する。 Next, another embodiment of the present invention will be described. FIG. 11 is a diagram showing a flowchart of the process of the interface device according to another embodiment of the present invention. In the flowchart of FIG. 11, steps in which the same contents as those in the flowchart of FIG. Only the portion related to the present embodiment in the flowchart of FIG. 11 will be described.

本実施形態においては、図１１のステップＳ３０５であり、ステップＳ３０５では、継続音の発生位置の移動速度を特定する処理を行う。このような特定処理については、特許請求の範囲において「移動速度特定手段」、「移動速度特定ステップ」として示されている。 In the present embodiment, step S305 in FIG. 11 is performed. In step S305, processing for specifying the moving speed of the position where the continuous sound is generated is performed. Such specifying processing is indicated as “moving speed specifying means” and “moving speed specifying step” in the claims.

このように本実施形態によれば、雑音がある環境下でもユーザーの呼気音や発声の移動速度が特定され、特定された事項に応じた処理をコンピュータ側で実行することができるようになる。 As described above, according to the present embodiment, the movement speed of the user's exhalation sound and utterance can be specified even in a noisy environment, and the processing corresponding to the specified matter can be executed on the computer side.

本実施形態における特徴的な処理はステップＳ３０６であり、ステップＳ３０６ではユーザーの鼻口腔から発せられた音の特徴点を特定する処理を実行する。なお、このようなステップは、図４に記載されたフローチャートにおいても、ステップＳ１０６とステップＳ１０７との間に加入することもできる。先に述べたように、本発明のインターフェイス装置には、明細書に記載された任意の実施形態を組み合わせたものが含まれる。 A characteristic process in the present embodiment is step S306, and in step S306, a process for specifying a feature point of a sound emitted from the user's nasal cavity is executed. Note that such a step can also be added between step S106 and step S107 in the flowchart shown in FIG. As described above, the interface device of the present invention includes a combination of any embodiments described in the specification.

このようなユーザーの鼻口腔から発せられた音の特徴点を特定することは、特許請求の範囲においては「特徴点特定手段」、「特徴点特定ステップ」として表現されている。 Specifying the feature points of the sound emitted from the user's nasal cavity is expressed as “feature point specifying means” and “feature point specifying step” in the claims.

次に、このような特徴点特定手段で特定するユーザーの鼻口腔から発せられた音の特徴点とは具体的にどのようなものであるのかについて説明する。図１２は本発明の他の実施の形態に係るインターフェイス装置における特徴点抽出処理の種類を示す図である。 Next, the specific feature points of the sound emitted from the nasal cavity of the user specified by such feature point specifying means will be described. FIG. 12 is a diagram showing the types of feature point extraction processing in the interface device according to another embodiment of the present invention.

図１２に示すように、ユーザーの鼻口腔から発せられた音の特徴点には、「基本周波数の特徴点」、「無声音・有声音の別に係る特徴点」、「音量の特徴点」、「音源の別に係る特徴点」、「発話内容に係る特徴点」を挙げることができる。 As shown in FIG. 12, the feature points of the sound emitted from the user's nasal cavity include “feature points of fundamental frequency”, “feature points related to unvoiced / voiced sounds”, “feature points of volume”, “ For example, “feature points related to sound sources” and “feature points related to utterance contents”.

「基本周波数の特徴点」の特定は、ユーザーの鼻口腔から発せられた音の基本周波数を抽出することによって行う。 The “feature point of the fundamental frequency” is specified by extracting the fundamental frequency of the sound emitted from the user's nasal cavity.

また、「無声音・有声音の別に係る特徴点」の特定は、無声音・有声音を判別することによって行う。 Further, the “feature point related to unvoiced / voiced sound” is specified by discriminating unvoiced / voiced sound.

また、「音量の特徴点」の特定は、音の大きさを表すパワーなどの音量に相当するパラメータを計測することによって行う。 The “volume feature point” is specified by measuring a parameter corresponding to the volume, such as power representing the volume of the sound.

また、「音源の別に係る特徴点」の特定は、音声や呼気音以外にも口笛、舌打ちなど様々な音源の種類の同定を行うことによって行う。 Further, the “feature point related to each sound source” is specified by identifying various types of sound sources such as whistle and tongue beat in addition to voice and breath sounds.

また、「発話内容に係る特徴点」の特定は、音声認識などを行い、発話内容を認識することによって行う。 Further, the “feature point related to the utterance content” is specified by performing speech recognition or the like and recognizing the utterance content.

このように本実施形態によれば、雑音がある環境下でもユーザーの鼻口腔から発せられ
た音の特徴点が特定され、特定された事項に応じた処理をコンピュータ側で実行することができるようになる。 As described above, according to the present embodiment, it is possible to identify the feature point of the sound emitted from the user's nasal cavity even in a noisy environment, and to execute the process according to the identified matter on the computer side. become.

次に、インターフェイス装置１００の処理における要素技術について説明する。 Next, elemental technologies in the processing of the interface device 100 will be described.

インターフェイス装置１００では、周囲雑音が存在する環境下でも、３次元的なユーザの発声位置、および雑音を分離したユーザー音声が必要となる。これらの情報を抽出するために必要な3次元音声ポインティングデバイスであるインターフェイス装置１００の５
つの処理、1．ユーザ発声位置の推定（近距離音源の推定）、2．周囲雑音の到来方向推定（遠距離にある音源の音波到来方向の推定）、3．ユーザーの発話検出、４．音源分離、
５．音声認識処理（特願２００３−３２０１８３号）について以下に述べる。
1．ユーザー発声位置の推定（近距離音源の推定）
マイクロフォンアレイから約１ｍ以内の近距離にある音源の位置を、マイクロフォンアレイで推定する方法について以下に説明する。 In the interface device 100, even in an environment where ambient noise exists, a three-dimensional user utterance position and user voice with separated noise are required. 5 of the interface apparatus 100 which is a three-dimensional voice pointing device necessary for extracting these pieces of information.
One process, 1. Estimation of user utterance position (estimation of short range sound source), 2. 2. Direction of arrival estimation of ambient noise (estimation of sound wave arrival direction of a sound source at a long distance), 3. User utterance detection; Sound source separation,
5). The voice recognition process (Japanese Patent Application No. 2003-320183) will be described below.
1． Estimation of user utterance position (estimation of short range sound source)
A method for estimating the position of a sound source at a short distance within about 1 m from the microphone array using the microphone array will be described below.

複数のマイクロフォンは3次元空間中の任意の位置に配置可能である。３次元空間中の
任意の位置 The plurality of microphones can be arranged at arbitrary positions in the three-dimensional space. Arbitrary position in 3D space

に置かれた音源から出力された音響信号を、３次元空間中の任意の位置 An acoustic signal output from a sound source placed in

に配置されたＱ個のマイクロフォンで受音する。音源と各マイクロフォン間の距離Ｒｑは次式で求められる。 The sound is received by Q microphones arranged in the. The distance Rq between the sound source and each microphone can be obtained by the following equation.

音源から各マイクロフォンまでの伝播時間τｑは、音速をｖとすると、次式で求められる。 The propagation time τq from the sound source to each microphone can be obtained by the following equation, where the speed of sound is v.

各マイクロフォンで受音した中心周波数ωの狭帯域信号の、音源のそれに対する利得ｇｑは、一般的に、音源とマイクロフォン間の距離Ｒｑと中心周波数ωの関数として定義される。 The gain gq of the narrow band signal having the center frequency ω received by each microphone relative to that of the sound source is generally defined as a function of the distance Rq between the sound source and the microphone and the center frequency ω.

例えば、利得を距離Ｒｑだけの関数として、実験的に求めた次式のような関数を用いる。 For example, a function such as the following expression obtained experimentally is used with the gain as a function of only the distance Rq.

中心周波数ωの狭帯域信号に関する、音源と各マイクロフォン間の伝達特性は、 The transfer characteristics between the sound source and each microphone for the narrowband signal with the center frequency ω are:

と表される。そして、位置Ｐ０にある音源を表す位置ベクトルａ（ω，Ｐ０）を、次式のように、狭帯域信号に関する、音源と各マイクロフォン間の伝達特性を要素とする複素ベクトルとして定義する。 It is expressed. Then, the position vector a (ω, P0) representing the sound source at the position P0 is defined as a complex vector having a transfer characteristic between the sound source and each microphone as an element with respect to the narrowband signal, as in the following equation.

音源位置の推定はＭＵＳＩＣ法（相関行列を固有値分解することで信号部分空間と雑音部分空間を求め、任意の音源位置ベクトルと雑音部分空間の内積の逆数を求めることにより、音源の音波到来方向や位置を調べる手法）を用いて、以下の手順で行う。ｑ番目のマイクロフォン入力の短時間フーリエ変換を The sound source position is estimated by the MUSIC method (the signal subspace and the noise subspace are obtained by eigenvalue decomposition of the correlation matrix, and the reciprocal of the inner product of an arbitrary sound source position vector and the noise subspace is obtained. The following procedure is performed using the method for checking the position. Short-time Fourier transform of qth microphone input

で表し、これを要素として観測ベクトルを次のように定義する。 The observation vector is defined as follows using this as an element.

ここで、ｎはフレーム時刻のインデックスである。連続するＮ個の観測ベクトルから相関行列を次式により求める。 Here, n is an index of frame time. A correlation matrix is obtained from the continuous N observation vectors by the following equation.

この相関行列の大きい順に並べた固有値を The eigenvalues arranged in descending order of this correlation matrix

とし、それぞれに対応する固有ベクトルを And the corresponding eigenvectors

とする。そして、音源数Ｓを次式により推定する。 And Then, the number S of sound sources is estimated by the following equation.

もしくは、固有値に対する閾値を設け、その閾値を超える固有値の数を音源数Sとするこ
とも可能である。
雑音部分空間の基底ベクトルから行列Ｒｎ（ω）を次のように定義し、 Alternatively, a threshold value for the eigenvalue may be provided, and the number of eigenvalues exceeding the threshold value may be set as the number S of sound sources.
Define the matrix Rn (ω) from the noise subspace basis vectors as

周波数帯域 frequency band

および音源位置推定の探索領域Ｕを And a search area U for sound source position estimation

として、 As

を計算する。そして、関数Ｆ（Ｐ）が極大値をとる座標ベクトルを求める。ここでは仮にＳ個の極大値を与える座標ベクトルがＰ１，Ｐ２，・・・，Ｐｓが推定されたとする。次にその各々の座標ベクトルにある音源のパワーを次式により求める。 Calculate Then, a coordinate vector in which the function F (P) has a maximum value is obtained. Here, it is assumed that P1, P2,..., Ps are estimated as coordinate vectors giving S local maximum values. Next, the power of the sound source at each coordinate vector is obtained by the following equation.

そして、２つの閾値Ｆｔｈｒ，Ｐｔｈｒを用意し、各位置ベクトルにおけるＦ（Ｐｓ）とＰ（Ｐｓ）が次の条件を満足するときに、 Then, two threshold values Fthr and Pthr are prepared, and when F (Ps) and P (Ps) in each position vector satisfy the following conditions,

連続するＮ個のフレーム時間内の座標ベクトルＰｌにおいて発声があったと判断する。
音源位置の推定処理は連続するＮ個のフレームを１つのブロックとして処理する。音源位置の推定をより安定に行うためには、フレーム数Ｎを増やす、そして／また連続するＮｂ個のブロックの全てで式（３０）の条件が満たされたら発声があったと判断する。ブロック数は任意に設定する。連続するＮフレームの時間内において、近似的に音源が静止していると見られるほどの速さで音源が移動している場合は、前記手法により音源の移動奇跡を捉えることができる。
2．周囲雑音の到来方向推定（遠距離にある音源の音波到来方向の推定）
マイクロフォンアレイから遠距離にある音源の音波が到来する方向を、マイクロフォンアレイで推定する手法について以下に述べる。
複数のマイクロフォンは3次元空間中の任意の位置に配置可能である。遠距離から到来す
る音波は平面波として観測されると考える。 It is determined that there is a utterance in the coordinate vector Pl within N consecutive frame times.
In the sound source position estimation process, consecutive N frames are processed as one block. In order to more stably estimate the sound source position, the number N of frames is increased, and / or it is determined that there is a utterance when the condition of Expression (30) is satisfied in all of the consecutive Nb blocks. The number of blocks is set arbitrarily. When the sound source is moving at such a speed that the sound source can be seen to be approximately stationary within the time period of consecutive N frames, the moving miracle of the sound source can be captured by the above method.
2． Direction of arrival estimation of ambient noise (estimation of sound wave arrival direction of sound source at a long distance)
A method for estimating the direction in which sound waves of a sound source at a long distance from the microphone array arrive will be described below.
The plurality of microphones can be arranged at arbitrary positions in the three-dimensional space. Sound waves coming from a long distance are considered to be observed as plane waves.

図１３は本発明のマイクロフォンアレイを用いた受音機能を説明する説明図である。図１３は、例として、任意の位置に配置された３個のマイクロフォンｍ１、ｍ２、ｍ３で、音源から到来した音波を受音する場合を示している。図１３で、点ｃは基準点を示しており、この基準点のまわりで音波の到来方向を推定する。図１３で、平面ｓは、基準点ｃを含む平面波の断面を示している。平面ｓの法線ベクトルｎは、そのベクトルの向きを音波の伝播方向と逆向きとし、次式のように定義する。 FIG. 13 is an explanatory diagram for explaining a sound receiving function using the microphone array of the present invention. FIG. 13 shows, as an example, a case in which three microphones m1, m2, and m3 arranged at arbitrary positions receive sound waves that have arrived from a sound source. In FIG. 13, a point c indicates a reference point, and the arrival direction of the sound wave is estimated around this reference point. In FIG. 13, the plane s indicates a cross section of a plane wave including the reference point c. The normal vector n of the plane s is defined as the following equation, with the direction of the vector opposite to the propagation direction of the sound wave.

3次元空間中の音源の音波到来方向は２つのパラメータ（θ，φ）で表される。方向（θ
，φ）から到来する音波を各マイクロフォンで受音し、そのフーリエ変換を求めることで受音信号を狭帯域信号に分解し、各受音信号の狭帯域信号毎に利得と位相を複素数として表し、それを要素として狭帯域信号毎に全受音信号分だけ並べたベクトルを音源の位置ベクトルと定義する。以下の処理において、方向（θ，φ）から到来する音波は、前述の位
置ベクトルとして表現される。位置ベクトルは具体的に以下のように求められる。ｑ番目のマイクロフォンと平面ｓの間の距離ｒｑを次式により求める。 The sound wave arrival direction of the sound source in the three-dimensional space is represented by two parameters (θ, φ). Direction (θ
, Φ) is received by each microphone, and the received signal is decomposed into narrowband signals by obtaining the Fourier transform, and the gain and phase are expressed as complex numbers for each narrowband signal of each received signal. Then, a vector in which all the received sound signals are arranged for each narrowband signal is defined as a sound source position vector. In the following processing, the sound wave coming from the direction (θ, φ) is expressed as the aforementioned position vector. Specifically, the position vector is obtained as follows. A distance rq between the q-th microphone and the plane s is obtained by the following equation.

距離ｒｑは平面ｓに関してマイクロフォンが音源側に位置すれば正となり、逆に音源と反対側にある場合は負の値をとる。音速をｖとするとマイクロフォンと平面ｓ間の伝播時間Ｔｑは次式で表される。 The distance rq is positive when the microphone is located on the sound source side with respect to the plane s, and is negative when the microphone is on the opposite side of the sound source. If the speed of sound is v, the propagation time Tq between the microphone and the plane s is expressed by the following equation.

平面ｓでの振幅を基準としてそこから距離ｒｑ離れた位置の振幅に関する利得を、狭帯域信号の中心周波数ωと距離ｒｑの関数として次のように定義する。 The gain related to the amplitude at a distance rq away from the amplitude in the plane s is defined as a function of the center frequency ω of the narrowband signal and the distance rq as follows.

平面ｓでの位相を基準としてそこから距離ｒｑ離れた位置の位相差は、次式で表される。 A phase difference at a position away from the phase r with respect to the phase on the plane s is expressed by the following equation.

以上より、平面ｓを基準として、各マイクロフォンで観測される狭帯域信号の利得と位相差は次式で表される。 From the above, with the plane s as a reference, the gain and phase difference of the narrowband signal observed by each microphone are expressed by the following equations.

Ｑ個のマイクで（θ、φ）方向から到来する音波を観測するとき、音源の位置ベクトルは、各マイクロフォンについて式（２６）に従い求めた値を要素とするベクトルとして次式のように定義される。 When observing a sound wave coming from the (θ, φ) direction with Q microphones, the position vector of the sound source is defined as the following expression as a vector whose elements are values obtained according to Expression (26) for each microphone. The

音源の位置ベクトルが定義されたら、音波の到来方向推定は、ＭＵＳＩＣ法を用いて行われる。式（１５）で与えられる行列Ｒｎ（ω）を用い、音波到来方向推定の探索領域Ｉを When the position vector of the sound source is defined, the direction of arrival of the sound wave is estimated using the MUSIC method. Using the matrix Rn (ω) given by equation (15), the search region I for sound wave arrival direction estimation is

として、 As

を計算する。そして、関数Ｊ（θ、φ）が極大値を与える方向（θ、φ）を求める。ここでは仮にＫ個の音源が存在し、極大値を与えるＫ個の音波到来方向（（θ１、φ１），・・・，（θＫ、φＫ））が推定されたとする。次にその各々の音波到来方向にある音源のパワーを次式により求める。 Calculate Then, the direction (θ, φ) in which the function J (θ, φ) gives the maximum value is obtained. Here, it is assumed that there are K sound sources, and K sound wave arrival directions ((θ1, φ1),..., (ΘK, φK)) that give maximum values are estimated. Next, the power of the sound source in each sound wave arrival direction is obtained by the following equation.

そして、２つの閾値Ｊｔｈｒ，Ｑｔｈｒを用意し、各到来方向におけるＪ（θｋ，φｋ）とＱ（θｋ，φｋ）が次の条件を満足するときに、 Then, two threshold values Jthr and Qthr are prepared, and when J (θk, φk) and Q (θk, φk) in each arrival direction satisfy the following conditions,

連続するＮ個のフレーム時間内の到来方向（θｋ，φｋ）において発声があったと判断する。音波の到来方向の推定処理は連続するＮ個のフレームを１つのブロックとして処理する。到来方向の推定をより安定に行うためには、フレーム数Ｎを増やす、そして／また連続するＮｂ個のブロックの全てで式（３１）の条件が満たされたらその方向から音波の到来があったと判断する。ブロック数は任意に設定する。連続するＮフレームの時間内において、近似的に音源が静止していると見られるほどの速さで音源が移動している場合は、前記手法により音波の到来方向の移動奇跡を捉えることができる。 It is determined that there is utterance in the direction of arrival (θk, φk) within N consecutive frame times. In the process of estimating the direction of arrival of sound waves, N consecutive frames are processed as one block. In order to estimate the direction of arrival more stably, the number of frames N is increased, and / or if the condition of equation (31) is satisfied in all the consecutive Nb blocks, the sound wave has arrived from that direction. to decide. The number of blocks is set arbitrarily. When the sound source is moving at such a speed that the sound source can be seen to be approximately stationary within the time period of consecutive N frames, the moving miracle in the direction of arrival of the sound wave can be captured by the above method. .

近距離音源の位置推定結果と遠距離音源の音波到来方向推定結果は、続く発話検出処理や音源分離処理で重要な役割を果たすが、近距離音源と遠距離音源が同時に発生していて、更に、遠距離音源から到来する音波に対して近距離音源のパワーが著しく大きくなるとき、遠距離音源の音波の到来方向推定がうまく行えない場合がある。このような時は、近距離音源が発生する直前に推定された、遠距離音源の音波の到来方向推定結果を用いるなどして対処する。
3．ユーザーの発話検出
複数の音源が存在している場合、どの音源が認識すべき音声なのかの特定は一般的に難しい。一方、音声を用いたインタフェースを採用するシステムでは、予めシステムのユーザがシステムに対して相対的にどのような位置で発声するかを表すユーザ発声領域を決めておくことができる。この場合、前述の方法でシステムの周囲に音源が複数存在しているとしても、各音源の位置や音波の到来方向を推定できれば、システムが予め想定しているユーザ発声領域に入る音源を選択することで容易にユーザの音声を特定できるようになる。 The short-range sound source position estimation result and the long-distance sound source direction-of-arrival direction estimation result play an important role in the subsequent speech detection process and sound source separation process. When the power of the short-distance sound source is remarkably increased with respect to the sound wave coming from the long-distance sound source, the arrival direction estimation of the sound wave of the long-distance sound source may not be performed well. Such a case is dealt with by using the arrival direction estimation result of the sound wave of the long-distance sound source estimated immediately before the short-distance sound source is generated.
3． User utterance detection When there are multiple sound sources, it is generally difficult to identify which sound source should be recognized. On the other hand, in a system that employs an interface using voice, a user utterance region that represents a position at which a user of the system utters relative to the system can be determined in advance. In this case, even if there are a plurality of sound sources around the system by the above-described method, if the position of each sound source and the arrival direction of the sound waves can be estimated, the sound source that enters the user utterance region that the system assumes in advance is selected. Thus, the user's voice can be easily identified.

式（２０）や式（３１）の条件が満たされることで音源の存在を検出し、更に音源の位置や音波の到来方向の条件が満たされてユーザの発声が検出される。この検出結果は発話区間情報として、後続音声認識処理において重要な役割を果たす。音声認識を行う場合、入力信号の中から発話区間の開始時点と終了時点を検出する必要がある。しかし、周囲雑音が存在する雑音環境下での発話区間検出は必ずしも容易ではない。一般的に、発話区間の開始時点がずれると音声認識精度が著しく劣化してしまう。一方、複数の音源が存在していても、その音源がある位置や音波の到来方向において、式（１８）や式（２９）で表される関数は鋭いピークを示す。従って、この情報を用いて発話区間検出を行っている本発明音声認識装置は、複数の周囲雑音が存在しても頑健に発話区間検出が行え、高い音声認識精度を保つことができるという利点を持つ。 The presence of a sound source is detected when the conditions of Expression (20) and Expression (31) are satisfied, and further, the conditions of the position of the sound source and the arrival direction of the sound wave are satisfied, and the user's utterance is detected. This detection result plays an important role in the subsequent speech recognition process as the speech section information. When performing speech recognition, it is necessary to detect the start time and end time of an utterance section from an input signal. However, it is not always easy to detect an utterance section in a noise environment in which ambient noise exists. Generally, when the start time of the utterance section is shifted, the speech recognition accuracy is significantly deteriorated. On the other hand, even if there are a plurality of sound sources, the functions represented by Expression (18) and Expression (29) show a sharp peak at the position where the sound source is and the arrival direction of the sound waves. Therefore, the speech recognition apparatus of the present invention that performs speech segment detection using this information has the advantage that robust speech detection can be performed even when a plurality of ambient noises exist, and high speech recognition accuracy can be maintained. Have.

例えば、図１４に示すようなユーザの発声領域を定義することができる。図１４は本発明による発話検出処理の機能説明図である。この図では簡単のためにＸ−Ｙ平面のみで表すが、一般的に3次元空間においても同様に任意のユーザ発声領域を定義することができ
る。図１４では、任意の位置に配置された８個のマイクロフォンｍ１〜ｍ８を用いた処理を仮定し、近距離音源の探索領域および遠距離音源の探索領域のそれぞれで、ユーザ発声領域を定義している。近距離音源の探索空間は、（ＰｘＬ，ＰｙＬ）と（ＰｘＨ，ＰｙＨ）の2点を結ぶ直線を対角線とする矩形領域で、その領域内で（ＰＴｘＬ１，ＰＴｙＬ１
）と（ＰＴｘＨ１，ＰＴｙＨ１）、（ＰＴｘＬ２，ＰＴｙＬ２）と（ＰＴｘＨ２，ＰＴｙＨ２）のそれぞれの2点を結ぶ直線を対角線とする２つの矩形領域をユーザー発声領域と
定義している。従って、式（２０）により発声があったと判断された音源位置のなかで、その座標ベクトルが前記ユーザ発声領域内に入っているものを選択することで、近距離に存在する音源の中でユーザー声を特定できる。 For example, a user's utterance area as shown in FIG. 14 can be defined. FIG. 14 is a functional explanatory diagram of the speech detection processing according to the present invention. In this figure, for the sake of simplicity, only the XY plane is shown, but in general, any user utterance region can be similarly defined in a three-dimensional space. In FIG. 14, a process using eight microphones m1 to m8 arranged at arbitrary positions is assumed, and a user utterance region is defined in each of a short-distance sound source search region and a long-distance sound source search region. Yes. The short-distance sound source search space is a rectangular area whose diagonal is a straight line connecting two points (PxL, PyL) and (PxH, PyH), and within that area (PTxL1, PTyL1).
) And (PTxH1, PTyH1), (PTxL2, PTyL2), and (PTxH2, PTyH2), two rectangular areas having diagonal lines connecting the two points are defined as user utterance areas. Accordingly, by selecting the position of the sound source determined to have been uttered by the expression (20) whose coordinate vector is within the user utterance area, the user can select among the sound sources existing at a short distance. Can identify voice.

一方、遠距離音源の探索空間は点Ｃを基準として、角度θＬからθＨの方向を探索領域とし、その領域内で角度θＴＬ１からθＴＨ１の領域をユーザーの発声領域と定義している。従って、式（３１）により発声があったと判断された音波の到来方向のなかで、到来方向が前記ユーザ発声領域内に入っているものを選択することで、遠距離に存在する音源の中でユーザ音声を特定できる。
４．音源分離
発話検出された音源の位置推定結果または音波の到来方向推定結果を用いて、ユーザの音声を強調し周囲雑音を抑圧する音源分離処理について以下に説明する。ユーザ音声の発話位置または到来方向は前記発話検出処理により求められている。また、周囲雑音の音源位置または到来方向も既に推定されている。これらの推定結果と式（８）と式（２７）の音源位置ベクトル、そして無指向性雑音の分散を表すσを用いて、行列Ｖ（ω）を次式のように定義する。 On the other hand, the search space of the long-distance sound source defines the direction from the angle θL to θH with the point C as a reference, and defines the region from the angles θTL1 to θTH1 as the user's utterance region. Therefore, by selecting the arrival directions of the sound waves determined to have been uttered according to the equation (31) within the user utterance area, the sound sources existing at a long distance can be selected. User voice can be specified.
4). A sound source separation process for enhancing the user's voice and suppressing ambient noise using the sound source position estimation result or the sound wave arrival direction estimation result detected by the sound source separation utterance will be described below. The utterance position or the arrival direction of the user voice is obtained by the utterance detection process. Further, the sound source position or direction of arrival of ambient noise has already been estimated. Using these estimation results, the sound source position vectors of Equations (8) and (27), and σ representing the variance of omnidirectional noise, the matrix V (ω) is defined as follows.

とする。
ここで、相関行列Ｖ（ω）には近距離音源Ｓ個と遠距離音源Ｋ個を合わせて（Ｓ＋Ｋ）個の音源が含まれているから、固有値の大きい方から（Ｓ＋Ｋ）の固有値と固有ベクトルを用いて、Ｚ（ω）を次式のように定義する。 And
Here, since the correlation matrix V (ω) includes (S + K) sound sources including S short-distance sound sources and K long-distance sound sources, the eigenvalues and eigenvectors of (S + K) in descending order of eigenvalues. Is used to define Z (ω) as follows:

そして、近距離の座標ベクトルＰに居るユーザの音声を強調する分離フィルタＷ（ω）は、次式で与えられる。 A separation filter W (ω) that enhances the voice of the user in the short distance coordinate vector P is given by the following equation.

式（３６）の分離フィルタに式（１０）の観測ベクトルを乗じることで座標ベクトルＰに居るユーザの音声ｖ（ω）が得られる。 The voice v (ω) of the user in the coordinate vector P is obtained by multiplying the separation filter of Equation (36) by the observation vector of Equation (10).

この強調されたユーザ音声の波形信号は式（３７）の逆フーリエ変換を計算することで求められる。 The emphasized user speech waveform signal is obtained by calculating the inverse Fourier transform of equation (37).

一方、遠距離の方向（θ，φ）に居るユーザの音声を強調する場合の分離フィルタＭ（ω）は次式で与えられる。 On the other hand, the separation filter M (ω) for emphasizing the voice of the user in the long distance direction (θ, φ) is given by the following equation.

式（３８）の分離フィルタに式（１０）の観測ベクトルを乗じることで方向（θ，φ）に居るユーザの強調音声ｖ（ω）が得られる。 By multiplying the separation filter of Expression (38) by the observation vector of Expression (10), the emphasized voice v (ω) of the user in the direction (θ, φ) is obtained.

この強調されたユーザ音声の波形信号は式（３７）の逆フーリエ変換を計算することで求められる。連続するＮフレームの時間内において、近似的に音源が静止していると見られるほどの速さで音源が移動している場合は、前記手法により移動しているユーザーの強調音声が得られる。
５．音声認識処理
前記音源分離処理は、指向性雑音に対しては有効であるが、無指向性雑音に対してはある程度雑音が残留してしまう。また、突発性雑音のように短時間で発生する雑音に対してもあまり雑音抑圧効果を望めない。そこで、前記音源分離処理により強調されたユーザー音声の認識に、例えば、特願２００３−３２０１８３号「背景雑音歪みの補正処理方法及びそれを用いた音声認識システム」で述べられている特徴補正法を組み込んだ音声認識エンジンを用いることで、残留雑音の影響を軽減する。なお本発明は、音声認識エンジンとして特願２００３−３２０１８３号に限定するものではなく、この他にも雑音に頑健な様々な手法を実装した音声認識エンジンを使用することが考えられる。 The emphasized user speech waveform signal is obtained by calculating the inverse Fourier transform of equation (37). When the sound source is moving at such a speed that the sound source can be seen as approximately stationary within the time period of consecutive N frames, the emphasized speech of the moving user can be obtained by the above method.
5. Speech recognition processing The sound source separation processing is effective for directional noise, but noise remains to some extent for omnidirectional noise. In addition, a noise suppression effect cannot be expected even for noise that occurs in a short time such as sudden noise. Therefore, the feature correction method described in, for example, Japanese Patent Application No. 2003-320183 “Background Noise Distortion Correction Processing Method and Speech Recognition System Using the Same” is used for the recognition of the user voice emphasized by the sound source separation process. By using a built-in speech recognition engine, the effects of residual noise are reduced. Note that the present invention is not limited to Japanese Patent Application No. 2003-320183 as a speech recognition engine, and it is also possible to use a speech recognition engine in which various methods that are robust against noise are mounted.

特願２００３−３２０１８３号で述べられている特徴補正法は、音声認識エンジンが予め音声認識のためにテンプレートモデルとして持っているＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ（ＨＭＭ）に基づいて雑音重畳音声の特徴量補正を行う。ＨＭＭは雑音のないクリーン音声から求めたＭｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ（ＭＦＣＣ）に基づいて学習されている。このため、特徴補正のために新たにパラメータを用意する必要がなく、既存の認識エンジンに比較的容易に特徴補正法を組み込むことができるという利点がある。この手法は雑音を定常成分と一時的に変化を示す非定常成分に分けて考え、定常成分に関しては発話直前の数フレームから雑音の定常成分を推定する。 The feature correction method described in Japanese Patent Application No. 2003-320183 performs feature correction of noise superimposed speech based on a Hidden Markov Model (HMM) that a speech recognition engine has as a template model for speech recognition in advance. . The HMM is learned based on Mel-Frequency Cepstrum Coefficient (MFCC) obtained from clean speech with no noise. For this reason, it is not necessary to prepare a new parameter for feature correction, and there is an advantage that the feature correction method can be incorporated into an existing recognition engine relatively easily. In this method, noise is divided into a stationary component and a non-stationary component that shows a temporary change, and the stationary component of the noise is estimated from several frames immediately before the utterance.

ＨＭＭが持っている分布のコピーを生成し、推定した雑音の定常成分を加えることで定常雑音重畳音声の特徴量分布を生成する。観測された雑音重畳音声の特徴量の事後確率を、この定常雑音重畳音声の特徴量分布で評価することで、雑音の定常成分による歪を吸収する。しかし、この処理だけでは雑音の非定常成分による歪が考慮されていないので、雑音の非定常成分が存在する場合には、前記手段で求めた事後確率は正確ではなくなる。一方、特徴補正にＨＭＭを用いることで、特徴量時系列の時間的構造とそれに沿って求め
られる累積出力確率が利用可能となる。この累積出力確率から算出される重みを前述の事後確率に付与することにより、雑音の一時的に変化する非定常成分により劣化した事後確率の信頼度を改善することが出来る。 A copy of the distribution of the HMM is generated, and the estimated noise stationary component is added to generate a feature amount distribution of the stationary noise superimposed speech. The distortion due to the stationary noise component is absorbed by evaluating the posterior probability of the observed characteristic amount of the noise superimposed speech with the feature amount distribution of the stationary noise superimposed speech. However, since distortion due to the unsteady component of noise is not taken into account only by this processing, the posterior probability obtained by the above means is not accurate when the unsteady component of noise exists. On the other hand, by using the HMM for feature correction, the temporal structure of the feature amount time series and the accumulated output probability obtained along with it can be used. By assigning the weight calculated from the accumulated output probability to the above-mentioned posterior probability, the reliability of the posterior probability deteriorated due to the non-stationary component that temporarily changes the noise can be improved.

音声認識を行う場合、入力信号の中から発話区間の開始時点と終了時点を検出する必要がある。しかし、周囲雑音が存在する雑音環境下での発話区間検出は必ずしも容易ではない。特に、前記特徴補正を組み込んだ音声認識エンジンは、発話開始直前の数フレームから周囲雑音の定常的な特徴を推定するので、発話区間の開始時点がずれると認識精度が著しく劣化してしまう。一方、複数の音源が存在していても、その音源がある位置や音波の到来方向において、式（１８）や式（２９）で表される関数は鋭いピークを示す。従って、この情報を用いて発話区間検出を行っている本発明音声認識装置は、複数の周囲雑音が存在しても頑健に発話区間検出が行え、高い音声認識精度を保つことができる。 When performing speech recognition, it is necessary to detect the start time and end time of an utterance section from an input signal. However, it is not always easy to detect an utterance section in a noise environment in which ambient noise exists. In particular, since the speech recognition engine incorporating the feature correction estimates a steady feature of ambient noise from several frames immediately before the start of speech, the recognition accuracy is significantly deteriorated when the start time of the speech section is shifted. On the other hand, even if there are a plurality of sound sources, the functions represented by Expression (18) and Expression (29) show a sharp peak at the position where the sound source is and the arrival direction of the sound waves. Therefore, the speech recognition apparatus of the present invention that performs speech segment detection using this information can robustly perform speech segment detection even when a plurality of ambient noises exist, and can maintain high speech recognition accuracy.

次に、インターフェイス装置１００の雑音に対する頑健性を調べるための実験について説明する。 Next, an experiment for examining robustness against noise of the interface device 100 will be described.

実験は防音室内で行った。ユーザー音声と周囲雑音を別々に録音し、計算機上で所望のＳＮＲになるように音声と雑音を混合する。雑音を混合する前のクリーン・ユーザ音声から推定した音源位置を正解とする。そして、同様に雑音混合音声から音源位置を推定し、クリーン・ユーザ音声から推定した結果と比較することで、雑音が音源定位の推定にどれだけ影響を及ぼすかを測定する。 The experiment was conducted in a soundproof room. User voice and ambient noise are recorded separately, and the voice and noise are mixed so that the desired SNR is obtained on the computer. The sound source position estimated from the clean user voice before mixing the noise is taken as the correct answer. Similarly, the position of the sound source is estimated from the noise-mixed speech and compared with the result estimated from the clean user speech to measure how much noise affects the estimation of the sound source localization.

実験には、図１に示す構造のインターフェイス装置１００を用いた。各マイクアレイユニットには、３ｃｍ間隔で４つのマイクを配置して、計１２個のマイクから構成される。音源位置の検出空間（ユーザ発話領域）は、一辺が２ｃｍでＸ，Ｙ，Ｚ軸をそれぞれ１０点に離散化した、１０×１０×１０の格子状の離散点上で行う。 In the experiment, the interface device 100 having the structure shown in FIG. 1 was used. Each microphone array unit is composed of a total of 12 microphones with 4 microphones arranged at intervals of 3 cm. The sound source position detection space (user utterance region) is performed on 10 × 10 × 10 grid-like discrete points, each side of which is 2 cm, and the X, Y, and Z axes are discretized into 10 points.

インターフェイス装置１００の周囲には１つのスピーカを置く。そのスピーカからは、男性と女性の音声を雑音としてそれぞれ流す。スピーカは、インターフェイス装置１００の中心から１２０ｃｍ離し、前後左右の４ヵ所に順に設置し、それぞれの場所で流した雑音を録音する。ＳＮＲを２０ｄＢから−１０ｄＢまで５ｄＢ刻みとし、雑音混合音声を生成した。 One speaker is placed around the interface device 100. From the speaker, male and female voices are played as noise. The speakers are placed 120 cm away from the center of the interface device 100, and are sequentially installed at four locations, front, rear, left, and right, to record the noise flowing at each location. The SNR was changed from 20 dB to -10 dB in 5 dB steps to generate noise mixed speech.

実験結果を図１５、図１６に示す。 The experimental results are shown in FIGS.

図１５の結果は、クリーン・ユーザ音声からは音源位置が検出されなかったが雑音混合音声からは検出された場合（誤検出）、クリーン・ユーザ音声からは音源位置が検出されたが雑音混合音声からは検出されなかった場合（検出漏れ）が起こった割合を表している。このような誤検出または検出漏れはＳＮＲが低くなるほど増加する傾向にあるが、−１０ｄＢほどの雑音優位な環境下でも、誤検出・検出漏れが起こるのは３％弱程度であった。 The result of FIG. 15 is that when the sound source position is not detected from the clean user voice but is detected from the noise mixed voice (false detection), the sound source position is detected from the clean user voice, but the noise mixed voice is detected. Represents the rate at which no detection (no detection) occurred. Such false detection or detection omission tends to increase as the SNR decreases. However, even under an environment where noise is dominant as low as -10 dB, misdetection or omission is about 3% or less.

図１６の結果は、クリーン・ユーザ音声と雑音混合音声の両方から音源位置が推定されたときに、３次元空間内でクリーン・ユーザ音声から推定された音源位置と雑音混合音声から推定された音源位置の間の誤差の平均を表している。誤差の平均は２ｃｍから２．５ｃｍ程度であった。この実験では、一辺２ｃｍの格子点上で音源位置を推定しているので、この平均誤差は格子点上で推定された位置が正解位置に隣接する格子点で推定されたという程度の誤差である。 The result of FIG. 16 is that when the sound source position is estimated from both the clean user sound and the noise mixed sound, the sound source position estimated from the clean user sound and the noise mixed sound in the three-dimensional space. It represents the average error between the positions. The average error was about 2 cm to 2.5 cm. In this experiment, since the sound source position is estimated on a grid point with a side of 2 cm, this average error is an error that the position estimated on the grid point is estimated at a grid point adjacent to the correct position. .

以上の結果から、本発明のインターフェイス装置１００は、周囲の雑音の影響を受けにくいと言うことができる。 From the above results, it can be said that the interface device 100 of the present invention is not easily affected by ambient noise.

次に、インターフェイス装置１００の雑音に対する頑健性を調べる別の実験について説明する。本実実験は防音室内で行った。ユーザ音声と周囲雑音を別々に録音し、計算機上で所望のＳＮＲになるように混合した。雑音を混合する前のクリーン・ユーザ音声から推定した音源位置を正解とする。そして、同様に雑音混合ユーザ音声から音源位置を推定し、クリーン・ユーザ音声から推定した結果と比較することで、雑音が音源定位の推定にどれだけ影響を及ぼすかを測定する。 Next, another experiment for examining robustness against noise of the interface apparatus 100 will be described. The actual experiment was conducted in a soundproof room. User voice and ambient noise were recorded separately and mixed on the computer to achieve the desired SNR. The sound source position estimated from the clean user voice before mixing the noise is taken as the correct answer. Similarly, by estimating the sound source position from the noise mixed user voice and comparing it with the result estimated from the clean user voice, it is measured how much noise affects the sound source localization estimation.

実験には、図１に示す構造のインターフェイス装置１００を用いた。各マイクアレイユニットには、３ｃｍ間隔で４つのマイクを配置して、計１２個のマイクから構成される。音源位置の検出空間（ユーザ発話領域）は、一辺が２ｃｍでＸ、Ｙ、Ｚ軸をそれぞれ１０点に離散化した、１０×１０×１０の格子状の離散点上で行った。 In the experiment, the interface device 100 having the structure shown in FIG. 1 was used. Each microphone array unit is composed of a total of 12 microphones with 4 microphones arranged at intervals of 3 cm. The sound source position detection space (user utterance region) was performed on 10 × 10 × 10 grid-like discrete points, each side of which was 2 cm and the X, Y, and Z axes were discretized into 10 points.

インターフェイス装置１００の周囲には１つのスピーカを置く。そのスピーカからは、男性と女性の音声を雑音としてそれぞれ流す。スピーカは、ポインティングデバイスの中
心から１２０ｃｍ離した前後左右の４ヵ所に順次に設置し、それぞれの場所で雑音を流してマイクアレイで録音する。ＳＮＲは２０ｄＢから−１０ｄＢまで５ｄＢ刻みとし、各ＳＮＲの雑音混合ユーザ音声を生成した。 One speaker is placed around the interface device 100. From the speaker, male and female voices are played as noise. Speakers are installed sequentially at four locations, front, back, left, and right, 120cm away from the center of the pointing device, and noise is recorded at each location and recorded with a microphone array. The SNR was set in steps of 5 dB from 20 dB to -10 dB, and noise mixed user speech of each SNR was generated.

実験結果を図１７、図１８に示す。

図１７の結果は、
Ａ．クリーン・ユーザ音声からは音源が検出されなかったが雑音混合ユーザ音声からは検出された場合（誤検出）と、クリーン・ユーザ音声からは音源が検出されたが雑音混合ユーザ音声からは検出されなかった場合（検出漏れ）
Ｂ．クリーン・ユーザ音声と雑音混合ユーザ音声から推定された音源位置が一致した場合（正解）
Ｃ．両音声から推定された音源位置が一致しなかった場合（誤り）
の３つの割合を示している。 The experimental results are shown in FIGS.

The result of FIG.
A. Sound source was not detected from clean user voice but detected from noise mixed user voice (false detection). Sound source was detected from clean user voice but not from noise mixed user voice. If (detection omission)
B. When the sound source positions estimated from clean user speech and noise-mixed user speech match (correct answer)
C. When the sound source positions estimated from both voices do not match (error)
The three ratios are shown.

Ａの誤検出・検出漏れはＳＮＲが低くなるほど増加する傾向にあるが、−１０ｄＢほどの雑音優位な環境下でさえも、その割合は３％弱程度であった。Ｂの正解はＳＮＲが低くなるにつれて緩やかに減少し、Ｃの誤りは緩やかに増加する。Ｃの誤りとなった場合の雑音混合ユーザ音声から推定した音源位置とそれに対応するクリーン・音声から推定した音源位置間の距離を測定し、その平均と標準偏差を求めた結果を図１８に示す。誤差は、平均が２ｃｍから２．５ｃｍ程度、標準偏差は最大でも１ｃｍ程度で分布している。この実験では、一辺２ｃｍの格子点上で音源位置を推定しているので、音源位置が誤って推定された場合でも、正解位置にほぼ隣接する格子点にずれて推定されたという程度の誤差である。 Although the false detection / missing detection of A tends to increase as the SNR becomes lower, the ratio was about 3% or less even in a noise-dominant environment of about -10 dB. The correct answer for B gradually decreases as the SNR decreases, and the error for C increases gradually. FIG. 18 shows the results of measuring the distance between the sound source position estimated from the noise-mixed user voice and the sound source position estimated from the corresponding clean / speech sound when the error is C, and obtaining the average and standard deviation. . The error is distributed with an average of about 2 cm to 2.5 cm and a standard deviation of about 1 cm at the maximum. In this experiment, since the sound source position is estimated on a grid point with a side of 2 cm, even if the sound source position is estimated incorrectly, the error is such that it is estimated to be shifted to a grid point substantially adjacent to the correct position. is there.

以上の実験結果から、本発明の３次元音声ポインティングデバイスによる音源位置推定は、周囲雑音の影響を受けにくいと言うことができる。 From the above experimental results, it can be said that the sound source position estimation by the three-dimensional audio pointing device of the present invention is hardly influenced by the ambient noise.

以上、本発明のインターフェイス装置とインターフェイス方法によれば、雑音がある環境下でもユーザーの呼気音や発声の発声位置などの事項が３次元的に特定され、特定された事項に応じた処理をコンピュータ側で実行することができるようになる。 As described above, according to the interface device and the interface method of the present invention, items such as a user's breathing sound and the utterance position of the utterance are specified three-dimensionally even in a noisy environment, and processing corresponding to the specified items is performed by the computer. Will be able to run on the side.

本発明の実施の形態に係るインターフェイス装置の外観を斜視的に示す図である。1 is a perspective view showing an external appearance of an interface device according to an embodiment of the present invention. 本発明の実施の形態に係るインターフェイス装置のブロック構成を示す図である。It is a figure which shows the block configuration of the interface apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係るインターフェイス装置の利用形態例を示す図である。It is a figure which shows the example of a utilization form of the interface apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係るインターフェイス装置の処理のフローチャートを示す図である。It is a figure which shows the flowchart of a process of the interface apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係るインターフェイス装置の他の領域分割例を示す図である。It is a figure which shows the other area | region division example of the interface apparatus which concerns on embodiment of this invention. 本発明の他の実施の形態に係るインターフェイス装置の利用形態例を示す図である。It is a figure which shows the usage example of the interface apparatus which concerns on other embodiment of this invention. 本発明の他の実施の形態に係るインターフェイス装置の利用形態例を示す図である。It is a figure which shows the usage example of the interface apparatus which concerns on other embodiment of this invention. 本発明の他の実施の形態に係るインターフェイス装置の利用形態例を示す図である。It is a figure which shows the usage example of the interface apparatus which concerns on other embodiment of this invention. 本発明の他の実施の形態に係るインターフェイス装置の利用形態例を示す図である。It is a figure which shows the usage example of the interface apparatus which concerns on other embodiment of this invention. 本発明の他の実施の形態に係るインターフェイス装置で想定する方位角（θ）―仰角（φ）平面を示す図である。It is a figure which shows the azimuth ((theta))-elevation-angle ((phi)) plane assumed with the interface apparatus which concerns on other embodiment of this invention. 本発明の他の実施の形態に係るインターフェイス装置の処理のフローチャートを示す図である。It is a figure which shows the flowchart of a process of the interface apparatus which concerns on other embodiment of this invention. 本発明の他の実施の形態に係るインターフェイス装置における特徴点抽出処理の種類を示す図である。It is a figure which shows the kind of feature point extraction process in the interface apparatus which concerns on other embodiment of this invention. 本発明のマイクロフォンアレイを用いた受音機能を説明する説明図である。It is explanatory drawing explaining the sound reception function using the microphone array of this invention. 本発明による発話検出処理の機能説明図である。It is function explanatory drawing of the speech detection process by this invention. 本発明の雑音に対する頑健性を調べるための実験結果を示す図である。It is a figure which shows the experimental result for investigating the robustness with respect to the noise of this invention. 本発明の雑音に対する頑健性を調べるための実験結果を示す図である。It is a figure which shows the experimental result for investigating the robustness with respect to the noise of this invention. 本発明の雑音に対する頑健性を調べるための実験結果を示す図である。It is a figure which shows the experimental result for investigating the robustness with respect to the noise of this invention. 本発明の雑音に対する頑健性を調べるための実験結果を示す図である。It is a figure which shows the experimental result for investigating the robustness with respect to the noise of this invention.

Explanation of symbols

１００・・・インターフェイス装置、２００・・・マイクロフォンアレイ、２０１・・・シリコンマイク、２０２・・・ウインドスクリーン、２１０・・・スタンド、２１１・・・主支柱、２１２・・・左側支柱、２１３・・・右側支柱、２８０・・・マイクアンプ、２９０・・・ＡＤ変換部、３００・・・ＣＰＵ、４００・・・記憶部、５００・・・接続ポート部 DESCRIPTION OF SYMBOLS 100 ... Interface apparatus, 200 ... Microphone array, 201 ... Silicon microphone, 202 ... Wind screen, 210 ... Stand, 211 ... Main support, 212 ... Left support, 213 ..Right column, 280 ... Microphone amplifier, 290 ... AD converter, 300 ... CPU, 400 ... Storage, 500 ... Connection port

Claims

The interface device has a microphone array and a CPU that operates based on a program that takes in the output of the microphone array and stores it in the storage unit,
The microphone array includes a main support column that stands vertically on a stand, a left support column and a right support column that branch from the main support column to the left and right, and a microphone group that is provided on each support column. The arranged microphone group and the microphone group arranged on the right column are arranged so as to have a substantially C-shaped layout,
A user's utterance detection area is defined based on the microphone array, and the area is divided into an arbitrary number of divided areas, and only the user's utterance is detected in the utterance detection area, and from outside the utterance detection area. Treat the sound as noise,
Step S101 for capturing voice data from the microphone array, Step S102 for specifying three-dimensional information of the user utterance position and the direction of arrival of ambient noise,
If the user's utterance is detected, if the user's utterance is not detected, the process repeats from step S101, and if the user's utterance is detected, the process proceeds to step S104, and sound source separation that suppresses ambient noise and emphasizes the user's utterance In step S104, the user's utterance detection area is defined in advance, the utterance detection area is divided into an arbitrary number of areas, and the divided areas are defined, and the utterance position estimated in the three-dimensional space is Step S105 for identifying in which divided area the utterance is detected in the divided areas, and a combination of the utterance detection position and the event is defined in advance and stored in the database, and the utterance position is within the divided area. Step S106a for determining whether or not the voice is in a specific position, and determining whether or not the duration of utterance is equal to or less than a predetermined threshold value And step S106b that, utterance sequentially executes step S106c that determines whether the unvoiced,
In step S106a, the utterance position is determined as a specific position in the divided region, in step S106b, the utterance duration is determined to be equal to or less than a predetermined threshold, and in step S106c, the utterance is determined to be an unvoiced sound. Step S106 in which the identification of the corresponding event is executed when
In step S106, it is checked whether an event that matches the event database is detected. If no event is detected, the process returns to step S101. If an event is detected, the process proceeds to step S108.
It functions to execute step S108 for transmitting an event detection signal to the application level on the computer side,
Set the search range of sound source position estimation as the user's utterance detection area,
In step S102,
The MUSIC method is used to estimate the sound source position.
When estimating the user utterance position,
A coordinate vector in which the following function F (P) takes a maximum value is obtained,
Obtain the power P (P _s ) of the sound source at the obtained coordinate vector,
When the function F (P) and the power P (P _s ) of the sound source are each equal to or greater than a predetermined threshold, it is determined that there is an utterance at the coordinate vector Pl.
Set the search range of sound wave arrival direction estimation as the arrival wave is a plane wave,
Use the MUSIC method to estimate the direction of arrival of sound waves,
When estimating the direction of arrival of ambient noise,
The direction (θ, φ) in which the following function J (θ, φ) takes a maximum value is obtained,
Obtain the power Q (θ _k , φ _k ) of the sound source in the obtained sound wave arrival direction,
When the function J (θ, φ) and the power Q of the sound source (θ _k , φ _k ) are each equal to or greater than a predetermined threshold, it is determined that there is a utterance in the coordinate vector,
It is determined that there is a utterance in the sound wave arrival direction (θ _k , φ _k ),
In step S104,
Using the coordinate vector Pl, the sound wave arrival direction (θ _k , φ _k ), the sound source position vector, and the variance σ of omnidirectional noise, a correlation matrix V (ω) is defined as follows:
The using eigenvalues and eigenvectors e _k of the correlation matrix V (omega), define Z a (omega) according to the following equation,
A separation filter W (ω) that enhances the voice of the user at the short distance coordinate vector P is given by

The user's voice v (ω) in the coordinate vector P is obtained by multiplying the above separation filter by the observation vector,
And an utterance position specifying means for three-dimensionally specifying an utterance position of a sound emitted from the user's nasal cavity based on the user's voice v (ω).
Where P is an arbitrary position in the three-dimensional space, ω is the frequency of the narrowband signal,
a (ω, P) is a position vector representing a sound source at position P, and is defined as a complex vector having a transfer characteristic between the sound source and each microphone regarding the narrowband signal as an element,
Rn (ω) is the number S of sound sources obtained from the eigenvalues λ of the correlation matrix R (ω) obtained by collecting N observed vectors y (ω, n) consisting of short-time Fourier transform values of the respective microphone inputs, and the eigenvalue λ. Based on the eigenvector dk corresponding to
What we asked for,
b (ω, θ, φ) is a position vector of the sound source when observing a sound wave coming from the (θ, φ) direction at the frequency ω.

The interface apparatus according to claim 1, further comprising: an utterance position detection signal generating unit that generates a signal based on the utterance position specified by the utterance position specifying unit.

The utterance position specifying means three-dimensionally specifies a vector connecting the utterance start position and the utterance end position of a continuous sound emitted from the user's nasal cavity based on the voice data acquired by the microphone array. 2. The interface device according to claim 1, wherein the vector specifying means is used.

Trajectory specifying means for three-dimensionally specifying the trajectory from the utterance start position to the utterance end position of a continuous sound uttered from the user's nasal cavity based on the voice data acquired by the microphone array. The interface device according to claim 1, wherein:

The utterance position specifying means is a utterance direction specifying means for specifying an azimuth angle and an elevation angle in a direction in which a sound emitted from a user's nasal cavity arrives based on sound data acquired by the microphone array. The interface device according to claim 1.

The interface apparatus according to claim 5, further comprising a voice direction detection signal generation unit that generates a signal based on the azimuth angle and the elevation angle specified by the voice direction specification unit.

The utterance position specifying means includes an azimuth angle and an elevation angle in the utterance start direction and an azimuth angle and an elevation angle in the utterance end direction of a continuous sound uttered from the user's nasal cavity based on the audio data acquired by the microphone array. 2. The interface apparatus according to claim 1, wherein direction vector specifying means for specifying a vector connecting the two is used.

From the azimuth angle and elevation angle of the utterance start direction to the azimuth angle and elevation angle of the utterance end direction of the continuous sound uttered from the user's nasal cavity based on the voice data acquired by the microphone array, 2. The interface device according to claim 1, wherein a direction locus specifying means for specifying a locus is used.

The utterance position specifying means is a time specifying means for specifying the time from the start of utterance to the end of utterance of a continuous sound uttered from the user's nasal cavity based on the sound data acquired by the microphone array. The interface device according to claim 1.

The utterance position specifying means is a moving speed specifying means for specifying a moving speed of a continuous sound emitted from a user's nasal cavity based on sound data acquired by the microphone array. Interface equipment.

A feature point specifying unit that specifies a feature point of a sound emitted from the user's nasal cavity; and a feature point detection signal generating unit that generates a signal based on the feature point specified by the feature point specifying unit. The interface device according to claim 1, wherein

The feature points of the sound emitted from the user's nasal cavity extracted by the feature point specifying means are the feature points of the fundamental frequency, the feature points related to unvoiced / voiced sounds, the feature points of the volume, the feature points related to the sound source, The interface device according to claim 11, wherein the interface device is one of feature points related to utterance contents.