JP4660740B2

JP4660740B2 - Voice input device for electric wheelchair

Info

Publication number: JP4660740B2
Application number: JP2006248485A
Authority: JP
Inventors: 晃佐宗; 宏明児島
Original assignee: National Institute of Advanced Industrial Science and Technology AIST
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2006-09-13
Filing date: 2006-09-13
Publication date: 2011-03-30
Anticipated expiration: 2026-09-13
Also published as: JP2008067854A

Description

本発明は、様々な環境騒音が存在する実環境下において、高齢者や障害者などが、マイクロフォンを身に付ける必要が無く、音声により操作可能な電動車椅子搭載の音声入力装置に関する。 The present invention relates to a voice input device mounted on an electric wheelchair that can be operated by voice without requiring a microphone to be worn by an elderly person or a handicapped person in an actual environment where various environmental noises exist.

音声により制御可能な電動車椅子に関する先行技術として特許文献１や特許文献２などがあるが、いずれも音声の入力装置としてシングルマイクロフォンの使用を前提としている。マイクロフォンアレイを音声入力装置として用いた先行技術として特許文献３があり、マイクロフォンアレイを用いて音源位置を推定し、それによって電動車椅子を制御する先行技術が特許文献４などに示されている。 Prior art relating to an electric wheelchair that can be controlled by voice includes Patent Document 1 and Patent Document 2, and all of them are based on the use of a single microphone as a voice input device. There is Patent Document 3 as a prior art using a microphone array as a voice input device. Patent Document 4 discloses a prior art for estimating a sound source position using a microphone array and thereby controlling an electric wheelchair.

特開２００３−３１０６６５号公報JP 2003-310665 A 特開平６−２２５９１０号公報JP-A-6-225910 特願２００６−０４４７１１号公報Japanese Patent Application No. 2006-044711 特願２００６−０４５０９６号公報Japanese Patent Application No. 2006-045096

様々な環境騒音が存在する実環境下で音声により電動車椅子を操作する場合、雑音に対して頑健な音声認識の実現が必要不可欠である。従来のシングルマイクロフォンから入力される音声で制御可能な電動車椅子では、雑音の混入を抑えるためにヘッドセットなどの接話型マイクロフォンを用いる必要がある。しかし、ヘッドセットマイクロフォンは、電動車椅子を使用する度に装着する必要があり、また使用中に位置がずれた場合は自分でその位置を修正する必要がある。これでは、例えば、ある程度発話はできるが、手を自由に動かすことが困難な障害者などにとっては、必ずしも実用的ではないという問題がる。この問題を避けるためには、マイクロフォンを電動車椅子に固定し、操作者はマイクロフォンを一切見につけずに操作出来る電動車椅子を提供する必要がある。しかし、この場合、操作者とマイクロフォン間の距離が広がるため、周囲雑音が混入し音声認識精度が劣化する問題、また周囲雑音により引き起こされる電動車椅子の誤動作などが問題となる。これを解決する手段の一つが、マイクロフォンを複数個用いて操作者の音声を受音し、音源位置推定（特願２００６−０４５０９６）や妨害雑音の抑圧などの処理を行うことである。例えば、先行技術の特願２００６−０４４７１１では、操作者の背後から両肩上を通って操作者の口元より先まで達する程度の長さを持つ支柱上に、複数のマイクロフォンを配置する音声入力装置について述べている。しかし、例えば、脳性麻痺で痙性があり不随意運動がある障害者にとって、高い位置にマイクロフォンを設置するのは安全性の面で問題があり、また、デザイン的にも操作者を閉じこめるようになってしまうという問題があった。 When operating an electric wheelchair by voice in an actual environment where various environmental noises exist, it is essential to realize voice recognition that is robust against noise. In a conventional electric wheelchair that can be controlled by voice input from a single microphone, it is necessary to use a close-talking microphone such as a headset in order to suppress the mixing of noise. However, the headset microphone needs to be worn every time the electric wheelchair is used, and if the position is shifted during use, the position needs to be corrected by itself. In this case, for example, there is a problem that it is not practical for a handicapped person who can speak to some extent but cannot move his / her hand freely. In order to avoid this problem, it is necessary to fix the microphone to the electric wheelchair and provide the electric wheelchair that allows the operator to operate without looking at the microphone at all. However, in this case, since the distance between the operator and the microphone is widened, there is a problem that ambient noise is mixed and voice recognition accuracy is deteriorated, and a malfunction of the electric wheelchair caused by the ambient noise. One means for solving this is to receive the operator's voice using a plurality of microphones, and perform processing such as sound source position estimation (Japanese Patent Application No. 2006-045096) and suppression of interference noise. For example, in Japanese Patent Application No. 2006-044711 of the prior art, a voice input device in which a plurality of microphones are arranged on a support column having a length that reaches from the back of the operator through both shoulders to beyond the mouth of the operator. About. However, for example, for a disabled person with cerebral palsy, spasticity, and involuntary movement, placing the microphone at a high position is problematic in terms of safety, and the operator can be confined in terms of design. There was a problem that.

本発明の目的は、操作者を限定せずに、広く一般的な使用を可能とする電動車椅子搭載用音声入力装置を提供することにある。 An object of the present invention is to provide a voice input device for mounting on an electric wheelchair that enables wide and general use without limiting operators.

本発明の電動車椅子搭載用音声入力装置は、それぞれマイクロフォンを複数個離間してマイクロフォンアレイとして設けたマイクロフォン取付体を、電動車椅子の肘掛の先端部分に前記マイクロフォンが位置するように取付け、前記車両に設けた制御手段により前記両マイクロフォンから取り込んだ信号に対して音源位置推定または音声認識をすることにより操作者の指示を特定する。さらには、操作者の指示を特定し、その指示に従って前記車両を走行制御する。
また、本発明の電動車椅子搭載用音声入力装置は、前記左右の肘掛先端に取付けた一対のマイクロフォン取付体上に、操作者から見て「ハ」の字になるようにマイクロフォンを傾斜して配置する。 According to the voice input device for mounting an electric wheelchair of the present invention, a microphone mounting body provided as a microphone array with a plurality of microphones spaced apart from each other is mounted so that the microphone is positioned at the tip of the armrest of the electric wheelchair. The operator's instruction is specified by estimating the sound source position or recognizing the signals taken from both the microphones by the provided control means. Furthermore, the operator's instruction is specified, and the vehicle is controlled to travel according to the instruction.
Further, the voice input device mounted on the electric wheelchair of the present invention is arranged on the pair of microphone attachment bodies attached to the left and right armrest tips so that the microphone is inclined so as to have a “C” shape when viewed from the operator. To do.

本発明の電動車椅子搭載用音声入力装置は、車椅子に固定されたマイクロフォンを用いることで、ある程度の発話はできるが、手を自由に動かすことが困難な障害者などが使用しても、マイクロフォンの装着やマイクロフォン位置の修正などの手続きを必要としない実用的な電動車椅子が実現される。また前述した構造を持つマイクロフォンスタンドを採用することで、マイクロフォンアレイ音声入力装置および音源の位置または到来方向推定手法と音源分離手法を組み合わせることで、周囲雑音が混入し認識精度が劣化する問題や、周囲雑音により引き起こされる車椅子の誤動作の問題などが解決される。更に、脳性麻痺で痙性があり不随意運動がある障害者が使用しても、マイクロフォンアレイと接触することがなく安全に電動車椅子を操作することができる。
また、本発明の電動車椅子搭載用音声入力装置は、左右の肘掛先端に取付けた一対のマイクロフォン取付体上に、操作者から見て「ハ」の字になるようにマイクロフォンを傾斜して配置するので、各マイクロフォンはシート中心から略等距離となり、操作者の周囲の音を略同じレベルで集音することができる。
また、マイクロフォンを操作者中心から「ハ」の字に配置したので、周囲から操作者に向かって集中する音声信号を、操作者を中心としたときの中心角を大きく取って集音することができる。このことは、従来のようにマイクロフォンを１個とした場合に、特定の方向の音声信号のみの集音になるのと対照的である。
マイクロフォンアレイを２本ある程度の間隔で配置することで、例えば、それぞれのマイクロフォンアレイで音波の到来方向を推定し、その交点として音源の座標を推定することが原理的に可能となる。 The voice input device mounted on an electric wheelchair according to the present invention can speak to some extent by using a microphone fixed to the wheelchair, but even if a handicapped person who cannot move his / her hand freely is used, A practical electric wheelchair that does not require procedures such as wearing and microphone position correction is realized. In addition, by adopting the microphone stand having the structure described above, by combining the microphone array voice input device and the sound source position or arrival direction estimation method and the sound source separation method, there is a problem that the recognition accuracy deteriorates due to ambient noise mixing, The problem of wheelchair malfunction caused by ambient noise is solved. Furthermore, even if a disabled person with cerebral palsy, spasticity and involuntary movement is used, the electric wheelchair can be operated safely without contacting the microphone array.
Further, the voice input device for mounting on an electric wheelchair according to the present invention is arranged on the pair of microphone attachment bodies attached to the left and right armrest tips so that the microphone is inclined so as to form a letter “C” when viewed from the operator. Therefore, each microphone is substantially equidistant from the seat center, and sounds around the operator can be collected at substantially the same level.
In addition, since the microphone is placed in the shape of the letter “C” from the center of the operator, it is possible to collect sound signals that are concentrated from the surroundings toward the operator with a large central angle when the operator is the center. it can. This is in contrast to the case where only one audio signal in a specific direction is collected when a single microphone is used as in the prior art.
By arranging two microphone arrays at a certain interval, for example, it is possible in principle to estimate the arrival direction of a sound wave with each microphone array and to estimate the coordinates of a sound source as the intersection.

本発明の実施の形態を図に基づいて詳細に説明する。 Embodiments of the present invention will be described in detail with reference to the drawings.

図１は本発明の音声入力装置を搭載した電動車椅子の外観図、図２は図１に示す音声入力装置のブロック回路図である。
図１に示す本発明の音声入力装置を搭載した電動車椅子は、音声入力装置等を備える電動車椅子からなる。
電動車椅子は、例えば、車椅子としての、２つの後輪２３、２つの前輪２２、後輪２３の上方に設置されたシート２０と背もたれ２５、背もたれ２５の両側に設置された肘掛２１ａ，２１ｂ、前輪２２の前方に設置された足置き２４を有すると供に、肘掛２１ａ、２１ｂにはマイクロフォン取付体１０ａ、１０ｂがそれぞれ設けられている。
音声入力装置は、図２示されるように構成される。音声入力装置の主要構成部品はシート２０内や背もたれ２５に収納される。 FIG. 1 is an external view of an electric wheelchair equipped with the voice input device of the present invention, and FIG. 2 is a block circuit diagram of the voice input device shown in FIG.
The electric wheelchair equipped with the voice input device of the present invention shown in FIG. 1 is an electric wheelchair provided with a voice input device or the like.
The electric wheelchair includes, for example, two rear wheels 23, two front wheels 22, a seat 20 and a backrest 25 installed above the rear wheel 23, armrests 21 a and 21 b installed on both sides of the backrest 25, and front wheels. In addition to having a footrest 24 installed in front of 22, the armrests 21a and 21b are provided with microphone attachment bodies 10a and 10b, respectively.
The voice input device is configured as shown in FIG. The main components of the voice input device are accommodated in the seat 20 or the backrest 25.

マイクロフォン１１を複数個連設したマイクロフォンアレイ１２を基板１３上に設けたマイクロフォン取付体１０ａ、１０ｂを、左右の肘掛２１ａ，２１ｂの先端に配線を備えた支持体１４により支持する。両側のマイクロフォンアレイ１２は、シート２０に座った人が見て「ハ」の字になるように配置する。このように配置することにより、各マイクロフォンはシート２０中心から略等距離となり、操作者の周囲の音を略同じレベルで集音することができる。
マイクロフォン取付体１０ａおよび１０ｂに設けたマイクロフォンアレイ１２は、マイクロフォン１１を任意数調節自在に設ける。マイクロフォンの数、配置間隔等は任意に設定する。 Microphone mounting bodies 10a and 10b each having a microphone array 12 provided with a plurality of microphones 11 provided on a substrate 13 are supported by a support body 14 provided with wiring at the tips of left and right armrests 21a and 21b. The microphone arrays 12 on both sides are arranged so that a person sitting on the seat 20 looks like a letter “C”. By arranging in this way, each microphone is substantially equidistant from the center of the seat 20, and sounds around the operator can be collected at substantially the same level.
The microphone array 12 provided on the microphone attachment bodies 10a and 10b is provided with an arbitrary number of microphones 11 that can be adjusted. The number of microphones and the arrangement interval are arbitrarily set.

図２は本発明の電動車椅子の機能ブロック図である。
図２に示すように、電動車椅子の機能はブロックで表すと、音声入力装置の一部を構成する２個のマイクロフォンアレイ１２、マイクロフォンアンプ６１、ＡＤＣ（アナログ／デジタル変換器）６１、表示手段となるディスプレイ３１、制御手段となるＣＰＵ（中央演算処理装置）ボード６３と記憶装置６４、駆動手段となる駆動制御手段６５と駆動モータ６７、操作手段となるジョイスティックや緊急停止ボタンなどの操作スイッチ６６を有する。ＣＰＵ６３と駆動制御手段６５は、シリアルケーブル６９で接続する。
マイクロフォンアンプ６１、ＡＤＣ（アナログ／デジタル変換器）６１、制御手段となるＣＰＵ（中央演算処理装置）ボード６３と記憶装置６４、駆動手段となる駆動制御手段６５と駆動モータ６７は、車椅子のシート２０や背もたれ２５中に収納してある。
制御手段は、マイクロフォンアンプ６１、ＡＤＣ（アナログ／デジタル変換器）６１、制御手段となるＣＰＵ（中央演算処理装置）ボード６３と記憶装置６４を有する。 FIG. 2 is a functional block diagram of the electric wheelchair of the present invention.
As shown in FIG. 2, when the function of the electric wheelchair is represented by a block, two microphone arrays 12, a microphone amplifier 61, an ADC (analog / digital converter) 61, a display unit, and a part of the voice input device A display 31 as a control unit, a CPU (Central Processing Unit) board 63 and a storage device 64 as control means, a drive control means 65 and a drive motor 67 as drive means, a joystick as an operation means and an operation switch 66 such as an emergency stop button. Have. The CPU 63 and the drive control means 65 are connected by a serial cable 69.
A microphone amplifier 61, an ADC (Analog / Digital Converter) 61, a CPU (Central Processing Unit) board 63 serving as a control means and a storage device 64, a drive control means 65 serving as a drive means, and a drive motor 67 are included in the wheelchair seat 20. It is housed in the backrest 25.
The control means includes a microphone amplifier 61, an ADC (analog / digital converter) 61, a CPU (Central Processing Unit) board 63 serving as the control means, and a storage device 64.

（音声入力装置）
音声入力手段は、ユーザ音声を受音するために相互に離間して配置した複数のマイクロフォンアレイ１２からなる受音手段を備える。 (Voice input device)
The voice input unit includes a sound receiving unit including a plurality of microphone arrays 12 arranged to be separated from each other in order to receive a user voice.

（発声位置推定手段と制御手段）
ＣＰＵ（中央演算処理装置）ボード６３は、ＣＰＵを搭載したボードからなり、発声位置推定手段および制御手段を含む。発声位置推定手段および制御手段は、ＣＰＵボード６３に接続される記憶装置６４を備える。
図３はマイクロフォンアレイの機能説明図である。
発声位置推定手段は、図３に示すように、前記受音手段で受音したマルチチャネル音声データに基づきユーザの発声位置を推定し発声位置推定信号を出力する。
制御手段は、前記発声位置推定信号および前記補助操作信号に基づき前記駆動制御手段を制御する。
ＡＤＣ６１とＣＰＵボード６３はＵＳＢケーブル６８を介して接続し、マイクアンプおよびＡＤＣ６１の電源はＣＰＵボード６３から供給する。サンプリングレートは任意に設定でき、例えば８ｋＨｚとし、量子化ビット数は任意に設定でき、例えば１６ｂｉｔとする。処理精度を上げるときには、サンプリングレートおよび量子化ビット数を上げる。 (Speech position estimation means and control means)
The CPU (central processing unit) board 63 is a board on which a CPU is mounted, and includes an utterance position estimation unit and a control unit. The utterance position estimation unit and the control unit include a storage device 64 connected to the CPU board 63.
FIG. 3 is a functional explanatory diagram of the microphone array.
As shown in FIG. 3, the utterance position estimation means estimates the utterance position of the user based on the multi-channel audio data received by the sound reception means, and outputs a utterance position estimation signal.
The control means controls the drive control means based on the utterance position estimation signal and the auxiliary operation signal.
The ADC 61 and the CPU board 63 are connected via a USB cable 68, and the power of the microphone amplifier and the ADC 61 is supplied from the CPU board 63. The sampling rate can be arbitrarily set, for example, 8 kHz, and the number of quantization bits can be arbitrarily set, for example, 16 bits. When increasing the processing accuracy, the sampling rate and the number of quantization bits are increased.

（補助入力手段）
補助操作手段は、図示されていないが、操作スイッチ６６で代表され、例えばジョイスティック（図示省略）からなる座標位置指定手段、および、緊急停止ボタン（図示省略）により補助操作信号を出力する。 (Auxiliary input means)
Although not shown, the auxiliary operation means is represented by an operation switch 66, and outputs an auxiliary operation signal by means of a coordinate position designation means including a joystick (not shown) and an emergency stop button (not shown), for example.

（画像表示手段）
画像表示手段は、ディスプレイ３１を有し、前記発声位置推定信号および車椅子の状態等を視覚的に示す。 (Image display means)
The image display means has a display 31 and visually indicates the utterance position estimation signal and the state of the wheelchair.

（駆動手段）
駆動手段は、駆動制御装置６５を備え、車椅子の車輪の駆動源である駆動モータ６７を駆動制御する。 (Driving means)
The drive means includes a drive control device 65 and drives and controls a drive motor 67 that is a drive source of the wheelchair wheel.

（発声位置検出）
上記発声位置推定手段により、複数の受音手段を備えた音声入力装置からの入力信号を用いて発声位置検出処理を行う。
音声で車椅子を制御するためには、マイクロフォンから入力された音が、ユーザ音声なのか、それとも環境騒音なのかを特定する必要がある。これはその音源の位置を推定することで判断できる。もし、車椅子の外に音源があればその音源は環境騒音と判断し、また車椅子内部に音源がある場合はユーザ音声と判断する。
例えば、マイクロフォンアレイを1本だけ使用する場合、音波の到来方向を推定することはできるが、マイクロフォン間隔を相当広げない限り、マイクロフォンアレイから音源までの距離を測定することは困難である。一方、図３に示すように、マイクロフォンアレイを２本ある程度の間隔で配置することで、例えば、それぞれのマイクロフォンアレイで音波の到来方向を推定し、その交点として音源の座標を推定することが原理的に可能となる。ある程度の間隔とは、２つのマイクロフォンアレイから到来波を観測したときに、球面波として観測できる程度の間隔を意味する。
以上の理由により、本発明では、図３に示すようなマイクロフォンアレイをある程度の間隔を置いて２本配置する構造を採用する。 (Speech position detection)
The utterance position estimation means performs utterance position detection processing using input signals from a voice input device having a plurality of sound reception means.
In order to control the wheelchair by voice, it is necessary to specify whether the sound input from the microphone is user voice or environmental noise. This can be determined by estimating the position of the sound source. If there is a sound source outside the wheelchair, the sound source is judged as environmental noise, and if there is a sound source inside the wheelchair, it is judged as user voice.
For example, when only one microphone array is used, the direction of arrival of sound waves can be estimated, but it is difficult to measure the distance from the microphone array to the sound source unless the microphone interval is considerably widened. On the other hand, as shown in FIG. 3, by arranging two microphone arrays at a certain interval, for example, the arrival direction of sound waves is estimated by each microphone array, and the coordinates of the sound source are estimated as intersections thereof. Is possible. A certain interval means an interval that can be observed as a spherical wave when arriving waves are observed from two microphone arrays.
For the reasons described above, the present invention employs a structure in which two microphone arrays as shown in FIG. 3 are arranged at a certain interval.

（音声認識装置）
図４は本発明の音声認識装置のブロック構成図である。この音声認識装置は図２においてＣＰＵボード６３と記憶装置６４とから構成される。
音声認識装置４０は、マイクロフォンアレイ処理部４１と、音声認識処理部４２から構成される。
マイクロフォンアレイ処理部４１は、入力音声をひろうマイクロフォンアレイ音声入力装置４３と、装置４３の出力のひろった音から遠距離にある音源の音波到来方向を推定する遠距離にある音源の音波到来方向推定手段４５と、装置４３の出力のひろった音から近距離にある音源の位置を推定する近距離にある音源の位置推定手段４６と、手段４５および４６の音源位置情報に基づいて装置４３の出力のひろった音から抽出対象の音源の音声を分離する音源分離処理手段４４と、手段４５および４６の音源位置情報に基づいてユーザ（ヘッドセット型マイクロフォンアレイ音声入力装置装着者）の発話を検出するユーザの発話検出手段４７と、ユーザの発話検出手段４７からの検出信号に応じて音源分離処理手段４４からの音声信号を切換出力する切換手段４８から構成される。
音声認識処理部４２は、切換手段４８からの音声信号に対して特徴を補正処理する特徴補正処理手段４９と、手段４９からの特徴を補正した音声信号を音声認識して認識結果を出力する音声認識手段５０から構成される。 (Voice recognition device)
FIG. 4 is a block diagram of the speech recognition apparatus of the present invention. This voice recognition device is composed of a CPU board 63 and a storage device 64 in FIG.
The voice recognition device 40 includes a microphone array processing unit 41 and a voice recognition processing unit 42.
The microphone array processing unit 41 estimates the direction of sound wave arrival of a sound source at a long distance, which estimates the sound wave arrival direction of the sound source at a long distance from the sound of the microphone array sound input device 43 and the sound output from the device 43. Based on the sound source position information of the means 45, the sound source position estimation means 46 at a short distance and the sound source position information of the means 45 and 46, the output of the device 43 Based on the sound source position information of the sound source separation processing means 44 and means 45 and 46 for separating the sound of the sound source to be extracted from the expanded sound, the utterance of the user (headset type microphone array sound input device wearer) is detected. The voice signal from the sound source separation processing means 44 is switched according to the detection signal from the user's speech detection means 47 and the user's speech detection means 47. Composed of switching means 48 for force.
The voice recognition processing unit 42 performs a feature correction processing unit 49 for correcting a feature on the voice signal from the switching unit 48, and a voice for recognizing the voice signal corrected for the feature from the unit 49 and outputting a recognition result. It comprises a recognition means 50.

本発明のマイクロフォンアレイを用いる音声認識装置は、下記の５つの要素技術から構成される。
１．マイクロフォンアレイから近距離にある音源の位置推定
２．マイクロフォンアレイから遠距離にある音源の音波到来方向の推定
３．ユーザの発話検出
４．音源分離処理
５．音声認識処理（特願２００３−３２０１８３）
これらの要素技術の詳細について以下で説明する。 The speech recognition apparatus using the microphone array of the present invention is composed of the following five elemental technologies.
1. 1. Estimation of the position of a sound source at a short distance from the microphone array 2. Estimation of the direction of sound wave arrival of a sound source at a long distance from the microphone array. User utterance detection 4. Sound source separation processing Speech recognition processing (Japanese Patent Application No. 2003-320183)
Details of these elemental technologies will be described below.

（音源位置推定）
図３は本発明のマイクロフォンアレイの機能説明図である。
マイクロフォン１、２、３、４と、マイクロフォン５、６、７、８は、図３に示されるように、対向して、配置される。また、各マイクロフォンと音源の位置等は図のような関係になっているものとする。
マイクロフォンアレイから約１ｍ以内の近距離にある音源の位置を、マイクロフォンアレイで推定する方法について以下に説明する。 (Sound source position estimation)
FIG. 3 is a functional explanatory diagram of the microphone array of the present invention.
The microphones 1, 2, 3, 4 and the microphones 5, 6, 7, 8 are arranged to face each other as shown in FIG. In addition, it is assumed that the positions of the microphones and the sound source have a relationship as shown in the figure.
A method for estimating the position of a sound source at a short distance within about 1 m from the microphone array using the microphone array will be described below.

複数のマイクロフォンは３次元空間中の任意の位置に配置可能である。３次元空間中の任意の位置

に置かれた音源から出力された音響信号を、３次元空間中の任意の位置

に配置されたＱ個のマイクロフォンで受音する。音源と各マイクロフォン間の距離Ｒｑは次式で求められる。 The plurality of microphones can be arranged at arbitrary positions in the three-dimensional space. Arbitrary position in 3D space

An acoustic signal output from a sound source placed in

The sound is received by Q microphones arranged in the. The distance Rq between the sound source and each microphone can be obtained by the following equation.

音源から各マイクロフォンまでの伝播時間τｑは、音速をｖとすると、次式で求められる。

各マイクロフォンで受音した中心周波数ωの狭帯域信号の、音源のそれに対する利得ｇｑは、一般的に、音源とマイクロフォン間の距離Ｒｑと中心周波数ωの関数として定義される。

The propagation time τq from the sound source to each microphone can be obtained by the following equation, where the speed of sound is v.

The gain gq of the narrow band signal having the center frequency ω received by each microphone relative to that of the sound source is generally defined as a function of the distance Rq between the sound source and the microphone and the center frequency ω.

例えば、利得を距離Ｒｑだけの関数として、実験的に求めた次式のような関数を用いる。

For example, a function such as the following expression obtained experimentally is used with the gain as a function of only the distance Rq.

中心周波数ωの狭帯域信号に関する、音源と各マイクロフォン間の伝達特性は、

と表される。そして、位置Ｐ０にある音源を表す位置ベクトルａ（ω，Ｐ０）を、次式のように、狭帯域信号に関する、音源と各マイクロフォン間の伝達特性を要素とする複素ベクトルとして定義する。 The transfer characteristics between the sound source and each microphone for the narrowband signal with the center frequency ω are:

It is expressed. Then, the position vector a (ω, P0) representing the sound source at the position P0 is defined as a complex vector having a transfer characteristic between the sound source and each microphone as an element with respect to the narrowband signal, as in the following equation.

音源位置の推定はＭＵＳＩＣ法（相関行列を固有値分解することで信号部分空間と雑音部分空間を求め、任意の音源位置ベクトルと雑音部分空間の内積の逆数を求めることにより、音源の音波到来方向や位置を調べる手法）を用いて、以下の手順で行う。ｑ番目のマイロフォン入力の短時間フーリエ変換を

The sound source position is estimated by the MUSIC method (the signal subspace and the noise subspace are obtained by eigenvalue decomposition of the correlation matrix, and the reciprocal of the inner product of an arbitrary sound source position vector and the noise subspace is obtained. The following procedure is performed using the method for checking the position. Short-time Fourier transform of qth mylophone input

で表し、これを要素として観測ベクトルを次のように定義する。

ここで、ｎはフレーム時刻のインデックスである。連続するＮ個の観測ベクトルから相関行列を次式により求める。

The observation vector is defined as follows using this as an element.

Here, n is an index of frame time. A correlation matrix is obtained from the continuous N observation vectors by the following equation.

この相関行列の大きい順に並べた固有値を

とし、それぞれに対応する固有ベクトルを

The eigenvalues arranged in descending order of this correlation matrix

And the corresponding eigenvectors

とする。そして、音源数Ｓを次式により推定する。

もしくは、固有値に対する閾値を設け、その閾値を超える固有値の数を音源数Sとすることも可能である。
雑音部分空間の基底ベクトルから行列Ｒｎ（ω）を次のように定義し、

And Then, the number S of sound sources is estimated by the following equation.

Alternatively, a threshold value for the eigenvalue may be provided, and the number of eigenvalues exceeding the threshold value may be set as the number S of sound sources.
Define the matrix Rn (ω) from the noise subspace basis vectors as

周波数帯域

および音源位置推定の探索領域Ｕを

として、 frequency band

And a search area U for sound source position estimation

As

を計算する。そして、関数Ｆ（Ｐ）が極大値をとる座標ベクトルを求める。ここでは仮にＳ個の極大値を与える座標ベクトルがＰ１，Ｐ２，・・・，Ｐｓが推定されたとする。次にその各々の座標ベクトルにある音源のパワーを次式により求める。

Calculate Then, a coordinate vector in which the function F (P) has a maximum value is obtained. Here, it is assumed that P1, P2,..., Ps are estimated as coordinate vectors giving S local maximum values. Next, the power of the sound source at each coordinate vector is obtained by the following equation.

そして、２つの閾値Ｆｔｈｒ，Ｐｔｈｒを用意し、各位置ベクトルにおけるＦ（Ｐｓ）とＰ（Ｐｓ）が次の条件を満足するときに、

Then, two threshold values Fthr and Pthr are prepared, and when F (Ps) and P (Ps) in each position vector satisfy the following conditions,

連続するＮ個のフレーム時間内の座標ベクトルＰｌにおいて発声があったと判断する。
音源位置の推定処理は連続するＮ個のフレームを１つのブロックとして処理する。音源位置の推定をより安定に行うためには、フレーム数Ｎを増やす、そして／また連続するＮｂ個のブロックの全てで式（３０）の条件が満たされたら発声があったと判断する。ブロック数は任意に設定する。連続するＮフレームの時間内において、近似的に音源が静止していると見られるほどの速さで音源が移動している場合は、前記手法により音源の移動奇跡を捉えることができる。
（周囲雑音の音波到来方向推定） It is determined that there is a utterance in the coordinate vector Pl within N consecutive frame times.
In the sound source position estimation process, consecutive N frames are processed as one block. In order to more stably estimate the sound source position, the number N of frames is increased, and / or it is determined that there is a utterance when the condition of Expression (30) is satisfied in all of the consecutive Nb blocks. The number of blocks is set arbitrarily. When the sound source is moving at such a speed that the sound source can be seen to be approximately stationary within the time period of consecutive N frames, the moving miracle of the sound source can be captured by the above method.
(Estimation of sound direction of ambient noise)

マイクロフォンアレイから遠距離にある音源の音波が到来する方向を、マイクロフォンアレイで推定する手法について以下に述べる。
複数のマイクロフォンは3次元空間中の任意の位置に配置可能である。遠距離から到来する音波は平面波として観測されると考える。 A method for estimating the direction in which sound waves of a sound source at a long distance from the microphone array arrive will be described below.
The plurality of microphones can be arranged at arbitrary positions in the three-dimensional space. Sound waves coming from a long distance are considered to be observed as plane waves.

図５は本発明のマイクロフォンアレイを用いた受音機能を説明する説明図である。
図５は、例として、任意の位置に配置された３個のマイクロフォンｍ１、ｍ２、ｍ３で、音源から到来した音波を受音する場合を示している。図５で、点ｃは基準点を示しており、この基準点のまわりで音波の到来方向を推定する。図５で、平面ｓは、基準点ｃを含む平面波の断面を示している。平面ｓの法線ベクトルｎは、そのベクトルの向きを音波の伝播方向と逆向きとし、次式のように定義する。 FIG. 5 is an explanatory diagram for explaining a sound receiving function using the microphone array of the present invention.
FIG. 5 shows, as an example, a case in which three microphones m1, m2, and m3 arranged at arbitrary positions receive a sound wave that has arrived from a sound source. In FIG. 5, a point c indicates a reference point, and the arrival direction of the sound wave is estimated around the reference point. In FIG. 5, a plane s indicates a cross section of a plane wave including the reference point c. The normal vector n of the plane s is defined as follows, with the direction of the vector being opposite to the propagation direction of the sound wave.

3次元空間中の音源の音波到来方向は２つのパラメータ（θ，φ）で表される。方向（θ，φ）から到来する音波を各マイクロフォンで受音し、そのフーリエ変換を求めることで受音信号を狭帯域信号に分解し、各受音信号の狭帯域信号毎に利得と位相を複素数として表し、それを要素として狭帯域信号毎に全受音信号分だけ並べたベクトルを音源の位置ベクトルと定義する。以下の処理において、方向（θ，φ）から到来する音波は、前述の位置ベクトルとして表現される。位置ベクトルは具体的に以下のように求められる。ｑ番目のマイクロフォンと平面ｓの間の距離ｒｑを次式により求める。

The sound wave arrival direction of the sound source in the three-dimensional space is represented by two parameters (θ, φ). Sound waves arriving from the direction (θ, φ) are received by each microphone, and the received signal is decomposed into narrowband signals by obtaining the Fourier transform, and the gain and phase are determined for each narrowband signal of each received signal. A vector that is expressed as a complex number and is arranged as an element for all received sound signals for each narrowband signal is defined as a position vector of the sound source. In the following processing, the sound wave coming from the direction (θ, φ) is expressed as the aforementioned position vector. Specifically, the position vector is obtained as follows. A distance rq between the q-th microphone and the plane s is obtained by the following equation.

距離ｒｑは平面ｓに関してマイクロフォンが音源側に位置すれば正となり、逆に音源と反対側にある場合は負の値をとる。音速をｖとするとマイクロフォンと平面ｓ間の伝播時間Ｔｑは次式で表される。

The distance rq is positive when the microphone is located on the sound source side with respect to the plane s, and is negative when the microphone is on the opposite side of the sound source. If the speed of sound is v, the propagation time Tq between the microphone and the plane s is expressed by the following equation.

平面ｓでの振幅を基準としてそこから距離ｒｑ離れた位置の振幅に関する利得を、狭帯域信号の中心周波数ωと距離ｒｑの関数として次のように定義する。

平面ｓでの位相を基準としてそこから距離ｒｑ離れた位置の位相差は、次式で表される。

The gain related to the amplitude at a distance rq away from the amplitude in the plane s is defined as a function of the center frequency ω of the narrowband signal and the distance rq as follows.

A phase difference at a position away from the phase r with respect to the phase on the plane s is expressed by the following equation.

以上より、平面ｓを基準として、各マイクロフォンで観測される狭帯域信号の利得と位相差は次式で表される。

From the above, with the plane s as a reference, the gain and phase difference of the narrowband signal observed by each microphone are expressed by the following equations.

Ｑ個のマイクで（θ、φ）方向から到来する音波を観測するとき、音源の位置ベクトルは、各マイクロフォンについて式（２６）に従い求めた値を要素とするベクトルとして次式のように定義される。

When observing a sound wave coming from the (θ, φ) direction with Q microphones, the position vector of the sound source is defined as the following expression as a vector whose elements are values obtained according to Expression (26) for each microphone. The

音源の位置ベクトルが定義されたら、音波の到来方向推定は、ＭＵＳＩＣ法を用いて行われる。式（１５）で与えられる行列Ｒｎ（ω）を用い、音波到来方向推定の探索領域Ｉを

として、 When the position vector of the sound source is defined, the direction of arrival of the sound wave is estimated using the MUSIC method. Using the matrix Rn (ω) given by equation (15), the search region I for sound wave arrival direction estimation is

As

を計算する。そして、関数Ｊ（θ、φ）が極大値を与える方向（θ、φ）を求める。ここでは仮にＫ個の音源が存在し、極大値を与えるＫ個の音波到来方向（（θ１、φ１），・・・，（θＫ、φＫ））が推定されたとする。次にその各々の音波到来方向にある音源のパワーを次式により求める。

Calculate Then, the direction (θ, φ) in which the function J (θ, φ) gives the maximum value is obtained. Here, it is assumed that there are K sound sources, and K sound wave arrival directions ((θ1, φ1),..., (ΘK, φK)) that give maximum values are estimated. Next, the power of the sound source in each sound wave arrival direction is obtained by the following equation.

そして、２つの閾値Ｊｔｈｒ，Ｑｔｈｒを用意し、各到来方向におけるＪ（θｋ，φｋ）とＱ（θｋ，φｋ）が次の条件を満足するときに、

Then, two threshold values Jthr and Qthr are prepared, and when J (θk, φk) and Q (θk, φk) in each arrival direction satisfy the following conditions,

連続するＮ個のフレーム時間内の到来方向（θｋ，φｋ）において発声があったと判断する。音波の到来方向の推定処理は連続するＮ個のフレームを１つのブロックとして処理する。到来方向の推定をより安定に行うためには、フレーム数Ｎを増やす、そして／また連続するＮｂ個のブロックの全てで式（３１）の条件が満たされたらその方向から音波の到来があったと判断する。ブロック数は任意に設定する。連続するＮフレームの時間内において、近似的に音源が静止していると見られるほどの速さで音源が移動している場合は、前記手法により音波の到来方向の移動奇跡を捉えることができる。

It is determined that there is utterance in the direction of arrival (θk, φk) within N consecutive frame times. In the process of estimating the direction of arrival of sound waves, N consecutive frames are processed as one block. In order to estimate the direction of arrival more stably, the number of frames N is increased, and / or if the condition of equation (31) is satisfied in all the consecutive Nb blocks, the sound wave has arrived from that direction. to decide. The number of blocks is set arbitrarily. When the sound source is moving at such a speed that the sound source can be seen to be approximately stationary within the time period of consecutive N frames, the moving miracle in the direction of arrival of the sound wave can be captured by the above method. .

近距離音源の位置推定結果と遠距離音源の音波到来方向推定結果は、続く発話検出処理や音源分離処理で重要な役割を果たすが、近距離音源と遠距離音源が同時に発生していて、更に、遠距離音源から到来する音波に対して近距離音源のパワーが著しく大きくなるとき、遠距離音源の音波の到来方向推定がうまく行えない場合がある。このような時は、近距離音源が発生する直前に推定された、遠距離音源の音波の到来方向推定結果を用いるなどして対処する。 The short-range sound source position estimation result and the long-distance sound source direction-of-arrival direction estimation result play an important role in the subsequent speech detection process and sound source separation process. When the power of the short-distance sound source is remarkably increased with respect to the sound wave coming from the long-distance sound source, the arrival direction estimation of the sound wave of the long-distance sound source may not be performed well. Such a case is dealt with by using the arrival direction estimation result of the sound wave of the long-distance sound source estimated immediately before the short-distance sound source is generated.

（発話検出処理）
複数の音源が存在している場合、どの音源が認識すべき音声なのかの特定は一般的に難しい。一方、音声を用いたインタフェースを採用するシステムでは、予めシステムのユーザがシステムに対して相対的にどのような位置で発声するかを表すユーザ発声領域を決めておくことができる。この場合、前述の方法でシステムの周囲に音源が複数存在しているとしても、各音源の位置や音波の到来方向を推定できれば、システムが予め想定しているユーザ発声領域に入る音源を選択することで容易にユーザの音声を特定できるようになる。 (Speech detection processing)
When there are a plurality of sound sources, it is generally difficult to specify which sound source should be recognized. On the other hand, in a system that employs an interface using voice, a user utterance region that represents a position at which a user of the system utters relative to the system can be determined in advance. In this case, even if there are a plurality of sound sources around the system by the above-described method, if the position of each sound source and the arrival direction of the sound waves can be estimated, the sound source that enters the user utterance region that the system assumes in advance is selected. Thus, the user's voice can be easily identified.

式（２０）や式（３１）の条件が満たされることで音源の存在を検出し、更に音源の位置や音波の到来方向の条件が満たされてユーザの発声が検出される。この検出結果は発話区間情報として、後続音声認識処理において重要な役割を果たす。音声認識を行う場合、入力信号の中から発話区間の開始時点と終了時点を検出する必要がある。しかし、周囲雑音が存在する雑音環境下での発話区間検出は必ずしも容易ではない。一般的に、発話区間の開始時点がずれると音声認識精度が著しく劣化してしまう。一方、複数の音源が存在していても、その音源がある位置や音波の到来方向において、式（１８）や式（２９）で表される関数は鋭いピークを示す。従って、この情報を用いて発話区間検出を行っている本発明音声認識装置は、複数の周囲雑音が存在しても頑健に発話区間検出が行え、高い音声認識精度を保つことができるという利点を持つ。 The presence of a sound source is detected when the conditions of Expression (20) and Expression (31) are satisfied, and further, the conditions of the position of the sound source and the arrival direction of the sound wave are satisfied, and the user's utterance is detected. This detection result plays an important role in the subsequent speech recognition process as the speech section information. When performing speech recognition, it is necessary to detect the start time and end time of an utterance section from an input signal. However, it is not always easy to detect an utterance section in a noise environment in which ambient noise exists. Generally, when the start time of the utterance section is shifted, the speech recognition accuracy is significantly deteriorated. On the other hand, even if there are a plurality of sound sources, the functions represented by Expression (18) and Expression (29) show a sharp peak at the position where the sound source is and the arrival direction of the sound waves. Therefore, the speech recognition apparatus of the present invention that performs speech segment detection using this information has the advantage that robust speech detection can be performed even when a plurality of ambient noises exist, and high speech recognition accuracy can be maintained. Have.

例えば、図６に示すようなユーザの発声領域を定義することができる。
図６は本発明による発話検出処理の機能説明図である。
この図では簡単のためにＸ−Ｙ平面のみで表すが、一般的に3次元空間においても同様に任意のユーザ発声領域を定義することができる。図６では、任意の位置に配置された８個のマイクロフォンｍ１〜ｍ８を用いた処理を仮定し、近距離音源の探索領域および遠距離音源の探索領域のそれぞれで、ユーザ発声領域を定義している。近距離音源の探索空間は、（ＰｘＬ，ＰｙＬ）と（ＰｘＨ，ＰｙＨ）の2点を結ぶ直線を対角線とする矩形領域で、その領域内で（ＰＴｘＬ１，ＰＴｙＬ１）と（ＰＴｘＨ１，ＰＴｙＨ１）、（ＰＴｘＬ２，ＰＴｙＬ２）と（ＰＴｘＨ２，ＰＴｙＨ２）のそれぞれの2点を結ぶ直線を対角線とする２つの矩形領域をユーザの発声領域と定義している。従って、式（２０）により発声があったと判断された音源位置のなかで、その座標ベクトルが前記ユーザ発声領域内に入っているものを選択することで、近距離に存在する音源の中でユーザ音声を特定できる。 For example, a user's utterance area as shown in FIG. 6 can be defined.
FIG. 6 is a functional explanatory diagram of the speech detection processing according to the present invention.
In this figure, for the sake of simplicity, only the XY plane is shown, but in general, any user utterance region can be similarly defined in a three-dimensional space. In FIG. 6, assuming a process using eight microphones m1 to m8 arranged at arbitrary positions, a user utterance region is defined in each of a short-distance sound source search region and a long-distance sound source search region. Yes. The short-distance sound source search space is a rectangular region whose diagonal is a straight line connecting two points (PxL, PyL) and (PxH, PyH). Two rectangular areas whose diagonals are straight lines connecting two points of (PTxL2, PTyL2) and (PTxH2, PTyH2) are defined as user's utterance areas. Accordingly, by selecting the sound source positions determined to have been uttered according to the equation (20) and whose coordinate vectors are within the user utterance area, the user can select among the sound sources existing at a short distance. The voice can be specified.

一方、遠距離音源の探索空間は点Ｃを基準として、角度θＬからθＨの方向を探索領域とし、その領域内で角度θＴＬ１からθＴＨ１の領域をユーザの発声領域と定義している。従って、式（３１）により発声があったと判断された音波の到来方向のなかで、到来方向が前記ユーザ発声領域内に入っているものを選択することで、遠距離に存在する音源の中でユーザ音声を特定できる。 On the other hand, the search space for the long-distance sound source defines the direction from the angle θL to θH with the point C as a reference, and defines the region from the angles θTL1 to θTH1 as the user's utterance region. Therefore, by selecting the arrival directions of the sound waves determined to have been uttered according to the equation (31) within the user utterance area, the sound sources existing at a long distance can be selected. User voice can be specified.

（音源分離処理）
発話検出された音源の位置推定結果または音波の到来方向推定結果を用いて、ユーザの音声を強調し周囲雑音を抑圧する音源分離処理について以下に説明する。
ユーザ音声の発話位置または到来方向は前記発話検出処理により求められている。また、周囲雑音の音源位置または到来方向も既に推定されている。これらの推定結果と式（８）と式（２７）の音源位置ベクトル、そして無指向性雑音の分散を表すσを用いて、行列Ｖ（ω）を次式のように定義する。 (Sound source separation processing)
A sound source separation process for emphasizing a user's voice and suppressing ambient noise using a sound source position estimation result or a sound wave arrival direction estimation result detected by speech will be described below.
The utterance position or the arrival direction of the user voice is obtained by the utterance detection process. Further, the sound source position or direction of arrival of ambient noise has already been estimated. Using these estimation results, the sound source position vectors of Equations (8) and (27), and σ representing the variance of omnidirectional noise, the matrix V (ω) is defined as follows.

この相関行列の大きい順に並べた固有値を

The eigenvalues arranged in descending order of this correlation matrix

とし、それぞれに対応する固有ベクトルを

とする。

And the corresponding eigenvectors

And

ここで、相関行列Ｖ（ω）には近距離音源Ｓ個と遠距離音源Ｋ個を合わせて（Ｓ＋Ｋ）個の音源が含まれているから、固有値の大きい方から（Ｓ＋Ｋ）の固有値と固有ベクトルを用いて、Ｚ（ω）を次式のように定義する。

そして、近距離の座標ベクトルＰに居るユーザの音声を強調する分離フィルタＷ（ω）は、次式で与えられる。 Here, since the correlation matrix V (ω) includes (S + K) sound sources including S short-distance sound sources and K long-distance sound sources, the eigenvalues and eigenvectors of (S + K) in descending order of eigenvalues. Is used to define Z (ω) as follows:

A separation filter W (ω) that enhances the voice of the user in the short distance coordinate vector P is given by the following equation.

式（３６）の分離フィルタに式（１０）の観測ベクトルを乗じることで座標ベクトルＰに居るユーザの音声ｖ（ω）が得られる。

The voice v (ω) of the user in the coordinate vector P is obtained by multiplying the separation filter of Equation (36) by the observation vector of Equation (10).

この強調されたユーザ音声の波形信号は式（３７）の逆フーリエ変換を計算することで求められる。

The emphasized user speech waveform signal is obtained by calculating the inverse Fourier transform of equation (37).

一方、遠距離の方向（θ，φ）に居るユーザの音声を強調する場合の分離フィルタＭ（ω）は次式で与えられる。

On the other hand, the separation filter M (ω) for emphasizing the voice of the user in the long distance direction (θ, φ) is given by the following equation.

式（３８）の分離フィルタに式（１０）の観測ベクトルを乗じることで方向（θ，φ）に居るユーザの強調音声ｖ（ω）が得られる。

By multiplying the separation filter of Expression (38) by the observation vector of Expression (10), the emphasized voice v (ω) of the user in the direction (θ, φ) is obtained.

この強調されたユーザ音声の波形信号は式（３７）の逆フーリエ変換を計算することで求められる。
連続するＮフレームの時間内において、近似的に音源が静止していると見られるほどの速さで音源が移動している場合は、前記手法により移動しているユーザの強調音声が得られる。 The emphasized user speech waveform signal is obtained by calculating the inverse Fourier transform of equation (37).
When the sound source is moving at such a speed that the sound source can be seen to be approximately stationary within the time of consecutive N frames, the emphasized voice of the moving user can be obtained by the above method.

（音声認識処理）
前記音源分離処理は、指向性雑音に対しては有効であるが、無指向性雑音に対してはある程度雑音が残留してしまう。また、突発性雑音のように短時間で発生する雑音に対してもあまり雑音抑圧効果を望めない。そこで、前記音源分離処理により強調されたユーザ音声の認識に、例えば、特願２００３−３２０１８３号「背景雑音歪みの補正処理方法及びそれを用いた音声認識システム」で述べられている特徴補正法を組み込んだ音声認識エンジンを用いることで、残留雑音の影響を軽減する。なお本発明は、音声認識エンジンとして特願２００３−３２０１８３号に限定するものではなく、この他にも雑音に頑健な様々な手法を実装した音声認識エンジンを使用することが考えられる。 (Voice recognition processing)
The sound source separation processing is effective for directional noise, but noise remains to some extent for omnidirectional noise. In addition, a noise suppression effect cannot be expected even for noise that occurs in a short time such as sudden noise. Therefore, the feature correction method described in Japanese Patent Application No. 2003-320183, “Background Noise Distortion Correction Processing Method and Speech Recognition System Using the Same” is used for the recognition of user speech emphasized by the sound source separation processing. By using a built-in speech recognition engine, the effects of residual noise are reduced. Note that the present invention is not limited to Japanese Patent Application No. 2003-320183 as a speech recognition engine, and it is also possible to use a speech recognition engine in which various methods that are robust against noise are mounted.

特願２００３−３２０１８３号で述べられている特徴補正法は、音声認識エンジンが予め音声認識のためにテンプレートモデルとして持っているＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ（ＨＭＭ）に基づいて雑音重畳音声の特徴量補正を行う。ＨＭＭは雑音のないクリーン音声から求めたＭｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ（ＭＦＣＣ）に基づいて学習されている。このため、特徴補正のために新たにパラメータを用意する必要がなく、既存の認識エンジンに比較的容易に特徴補正法を組み込むことができるという利点がある。この手法は雑音を定常成分と一時的に変化を示す非定常成分に分けて考え、定常成分に関しては発話直前の数フレームから雑音の定常成分を推定する。 The feature correction method described in Japanese Patent Application No. 2003-320183 performs feature correction of noise superimposed speech based on a Hidden Markov Model (HMM) that a speech recognition engine has as a template model for speech recognition in advance. . The HMM is learned based on Mel-Frequency Cepstrum Coefficient (MFCC) obtained from clean speech with no noise. For this reason, it is not necessary to prepare a new parameter for feature correction, and there is an advantage that the feature correction method can be incorporated into an existing recognition engine relatively easily. In this method, noise is divided into a stationary component and a non-stationary component that shows a temporary change, and the stationary component of the noise is estimated from several frames immediately before the utterance.

ＨＭＭが持っている分布のコピーを生成し、推定した雑音の定常成分を加えることで定常雑音重畳音声の特徴量分布を生成する。観測された雑音重畳音声の特徴量の事後確率を、この定常雑音重畳音声の特徴量分布で評価することで、雑音の定常成分による歪を吸収する。しかし、この処理だけでは雑音の非定常成分による歪が考慮されていないので、雑音の非定常成分が存在する場合には、前記手段で求めた事後確率は正確ではなくなる。一方、特徴補正にＨＭＭを用いることで、特徴量時系列の時間的構造とそれに沿って求められる累積出力確率が利用可能となる。この累積出力確率から算出される重みを前述の事後確率に付与することにより、雑音の一時的に変化する非定常成分により劣化した事後確率の信頼度を改善することが出来る。 A copy of the distribution of the HMM is generated, and the estimated noise stationary component is added to generate a feature amount distribution of the stationary noise superimposed speech. The distortion due to the stationary noise component is absorbed by evaluating the posterior probability of the observed characteristic amount of the noise superimposed speech with the feature amount distribution of the stationary noise superimposed speech. However, since distortion due to the unsteady component of noise is not taken into account only by this processing, the posterior probability obtained by the above means is not accurate when the unsteady component of noise exists. On the other hand, by using the HMM for feature correction, the temporal structure of the feature amount time series and the accumulated output probability obtained along with it can be used. By assigning the weight calculated from the accumulated output probability to the above-mentioned posterior probability, the reliability of the posterior probability deteriorated due to the non-stationary component that temporarily changes the noise can be improved.

音声認識を行う場合、入力信号の中から発話区間の開始時点と終了時点を検出する必要がある。しかし、周囲雑音が存在する雑音環境下での発話区間検出は必ずしも容易ではない。特に、前記特徴補正を組み込んだ音声認識エンジンは、発話開始直前の数フレームから周囲雑音の定常的な特徴を推定するので、発話区間の開始時点がずれると認識精度が著しく劣化してしまう。一方、複数の音源が存在していても、その音源がある位置や音波の到来方向において、式（１８）や式（２９）で表される関数は鋭いピークを示す。従って、この情報を用いて発話区間検出を行っている本発明音声認識装置は、複数の周囲雑音が存在しても頑健に発話区間検出が行え、高い音声認識精度を保つことができる。
このように音声認識された結果の信号を用いて車いすの駆動機構を制御する。 When performing speech recognition, it is necessary to detect the start time and end time of an utterance section from an input signal. However, it is not always easy to detect an utterance section in a noise environment in which ambient noise exists. In particular, since the speech recognition engine incorporating the feature correction estimates a steady feature of ambient noise from several frames immediately before the start of speech, the recognition accuracy is significantly deteriorated when the start time of the speech section is shifted. On the other hand, even if there are a plurality of sound sources, the functions represented by Expression (18) and Expression (29) show a sharp peak at the position where the sound source is and the arrival direction of the sound waves. Therefore, the speech recognition apparatus of the present invention that performs speech segment detection using this information can robustly perform speech segment detection even when a plurality of ambient noises exist, and can maintain high speech recognition accuracy.
The wheelchair drive mechanism is controlled using the signal resulting from the speech recognition.

本発明のマイクロフォンアレイ音声入力装置を車椅子に搭載した概観図である。1 is an overview of a microphone array voice input device of the present invention mounted on a wheelchair. 本発明の電動車椅子の機能ブロック図である。It is a functional block diagram of the electric wheelchair of the present invention. 本発明のマイクロフォンアレイの機能説明図である。It is function explanatory drawing of the microphone array of this invention. 本発明の音声認識装置のブロック構成図である。It is a block block diagram of the speech recognition apparatus of this invention. 本発明のマイクロフォンアレイを用いた受音機能を説明する説明図である。It is explanatory drawing explaining the sound reception function using the microphone array of this invention. 本発明による発話検出処理の機能説明図である。It is function explanatory drawing of the speech detection process by this invention.

Explanation of symbols

１０ａ、１０ｂマイクロフォン取付体
１１マイクロフォン
１２マイクロフォンアレイ
１３基板
１４支持体
２０シート
２１ａ、２１ｂ肘掛け
２５背もたれ
３０ａ、３０ｂ平行マイクロフォンアレイ
３１ディスプレイ
３２マイクロフォンアンプとＡＤＣ
３３ＣＰＵボード
３４記憶装置
３５イヤホーンスピーカ
３６送受信装置
４０音声認識装置
４１マイクロフォンアレイ処理部
４２音声認識処理部
４３マイクロフォンアレイ音声入力装置
４４音源分離処理手段
４５遠距離にある音源の音波到来方向推定手段
４６近距離にある音源の位置推定手段
４７ユーザの発話検出手段
４８切換器
４９特徴補正処理手段
５０音声認識手段
ｍ１、ｍ２、ｍ３、ｍ４、ｍ５、ｍ６、ｍ７、ｍ８マイクロフォン 10a, 10b Microphone attachment body 11 Microphone 12 Microphone array 13 Substrate 14 Support body 20 Sheets 21a, 21b Armrest 25 Backrest 30a, 30b Parallel microphone array 31 Display 32 Microphone amplifier and ADC
33 CPU board 34 Storage device 35 Earphone speaker 36 Transmission / reception device 40 Speech recognition device 41 Microphone array processing unit 42 Speech recognition processing unit 43 Microphone array speech input device 44 Sound source separation processing unit 45 Sound wave arrival direction estimation unit 46 of a sound source at a long distance Sound source position estimation means 47 at a short distance User speech detection means 48 Switch 49 Feature correction processing means 50 Speech recognition means m1, m2, m3, m4, m5, m6, m7, m8 Microphone

Claims

And armrests with electric wheelchair, with mounting the microphone arrays spaced apart plurality of microphones, a microphone mount positioned to protrude from the armrest of the tip portion of the left and right electric wheelchair, from both microphone array A voice input device having a control means for performing sound source position estimation or voice recognition based on a captured signal ,
A voice input device, wherein the microphones are arranged on the attachment body so that both microphone arrays are formed in a C shape when viewed from the operator.

The voice input device according to claim 1, wherein the control means specifies an operator instruction based on the sound source position estimation or voice recognition, and controls the vehicle to travel according to the instruction.