JP2010056762A

JP2010056762A - Microphone array

Info

Publication number: JP2010056762A
Application number: JP2008218398A
Authority: JP
Inventors: Keishin Nishiura; 敬信西浦
Original assignee: Murata Machinery Ltd; Ritsumeikan Trust
Current assignee: Murata Machinery Ltd; Ritsumeikan Trust
Priority date: 2008-08-27
Filing date: 2008-08-27
Publication date: 2010-03-11

Abstract

PROBLEM TO BE SOLVED: To provide a microphone array providing a sound signal with an SNR (Signal-to-Noise Ratio) improved as compared with a conventional technique in a site such as a plant in which large noise is generated. SOLUTION: The microphone array 10 includes: a microphone 1 disposed in such a way that a main axis of radiation is substantially turned to a speaker's mouth on an upper vertex out of each of vertexes of a pyramid; and a plurality of microphones 2, 3 and 4 disposed in such a way that the main axis of radiation is made substantially parallel with a speaker's mouth direction on at least two vertexes of bottom of the pyramid. The pyramid is, for instance, a triangular pyramid or a regular triangular pyramid. COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、例えば所定の雑音環境下で音声認識率を向上させるために好適なマイクロホンアレーと、それを用いた音声認識装置に関する。 The present invention relates to a microphone array suitable for improving a speech recognition rate under, for example, a predetermined noise environment, and a speech recognition apparatus using the microphone array.

例えば、特許文献１において、話者の方向又は位置を推定して音声認識率を向上させることができる音声認識装置が開示されている。 For example, Patent Document 1 discloses a speech recognition apparatus that can improve the speech recognition rate by estimating the direction or position of a speaker.

この従来例に係る音声認識装置では、複数のマイクロホンを所定の間隔で並置してなるマイクロホンアレーを備えた音声認識装置において、方向推定部は、各マイクロホンから出力される電気信号に基づいてマイクロホンアレーで受信される少なくとも１つの音源の方位角を推定し、ビームフォーミング部は、各マイクロホンから出力される電気信号に基づいて推定された少なくとも１つの音源の方位角の方向に対応する少なくとも１つのビーム信号を生成する。次いで、音源判定部は各ビーム信号に基づいて音声のＨＭＭと雑音ＨＭＭとを用いて各ビーム信号が音声であるか非音声であるかを判定し、音声認識部１７は音声であると判定されたときに、当該ビーム信号に対して音声認識を行って音声認識結果を出力する。 In the speech recognition apparatus according to this conventional example, in the speech recognition apparatus provided with the microphone array in which a plurality of microphones are juxtaposed at a predetermined interval, the direction estimation unit is configured to use the microphone array based on the electrical signal output from each microphone. The beam forming unit estimates at least one beam corresponding to the direction of the azimuth angle of at least one sound source estimated based on the electrical signal output from each microphone. Generate a signal. Next, the sound source determination unit determines whether each beam signal is speech or non-speech using the speech HMM and the noise HMM based on each beam signal, and the speech recognition unit 17 determines that it is speech. When this occurs, speech recognition is performed on the beam signal and a speech recognition result is output.

特開２００２−０９１４６９号公報。JP 2002-091469 A. 特開２００３−０４４０９２号公報。JP2003-040992A. 特開平１１−３２７５９３号公報。JP-A-11-327593. S. E. Boll, "Suppression of Acoustic Noise in Speech Using Spectral Subtraction", IEEE Transaction on Acoustic Speech and Signal Processing, Vol. ASSP-27, pp.113-120, April 1979.S. E. Boll, "Suppression of Acoustic Noise in Speech Using Spectral Subtraction", IEEE Transaction on Acoustic Speech and Signal Processing, Vol. ASSP-27, pp. 113-120, April 1979.

しかしながら、マイクロホンアレーを、例えば工場などの大きな雑音を発生する現場において用いて音声認識する場合、当該雑音により音声認識率が大幅に低下するという問題点があった。 However, when speech recognition is performed using a microphone array at a site that generates a large amount of noise, such as in a factory, there is a problem that the speech recognition rate is significantly reduced due to the noise.

本発明の目的は以上の問題点を解決し、例えば工場などの大きな雑音を発生する現場において信号対雑音電力比（以下、ＳＮＲという。）を従来技術に比較して向上させた音声信号を得ることができるマイクロホンアレーと、それを用いて音声認識することにより従来技術に比較して大きな音声認識率を得ることができる音声認識装置とを提供することにある。 The object of the present invention is to solve the above-mentioned problems and obtain an audio signal having an improved signal-to-noise power ratio (hereinafter referred to as SNR) as compared with the prior art, for example, in a factory where a large noise is generated. An object of the present invention is to provide a microphone array that can be used and a speech recognition device that can obtain a speech recognition rate higher than that of the prior art by performing speech recognition using the microphone array.

本発明に係るマイクロホンアレーは、
角錐の各頂点のうち上部頂点において、放射主軸が話者の口元に実質的に向くように設けられた第１のマイクロホンと、
上記角錐の底面の少なくとも２つの頂点において、放射主軸が話者の口元方向に実質的に平行となるように設けられた複数の第２のマイクロホンとを備えたことを特徴とする。 The microphone array according to the present invention is:
A first microphone provided at a top vertex of each pyramid vertex such that the radiation main axis is substantially directed to the speaker's mouth;
A plurality of second microphones provided so that the main axis of radiation is substantially parallel to the direction of the speaker's mouth at at least two vertices of the bottom surface of the pyramid.

上記マイクロホンアレーにおいて、上記角錐は三角錐又は正三角錐であることを特徴とする。 In the microphone array, the pyramid is a triangular pyramid or a regular triangular pyramid.

また、上記マイクロホンアレーにおいて、上記正三角錐の底面の３つの頂点において、３つの第２のマイクロホンを設けたことを特徴とする。 In the microphone array, three second microphones are provided at three vertices on the bottom surface of the regular triangular pyramid.

さらに、上記マイクロホンアレーは音声認識用マイクロホンアレーであることを特徴とする。 Further, the microphone array is a voice recognition microphone array.

本発明に係るマイクロホンアレーによれば、少なくとも３つのマイクロホンを用いて話者の音声を収集することにより、従来技術に比較して向上させた音声信号を得ることができる。また、当該マイクロホンアレーを用いて音声信号を収録し、減算形アレー法を用いて複数のカージオイド信号を生成し、そのうちのより高いＳＮＲを有する複数のカージオイド信号を加算し、その加算信号に対してスペクトルサブトラクション法を用いて雑音除去をした後音声認識することにより、例えば工場などの大きな雑音を発生する現場において音声認識率を従来技術に比較して向上させることができる。 According to the microphone array of the present invention, it is possible to obtain a voice signal that is improved as compared with the prior art by collecting the voice of the speaker using at least three microphones. In addition, the microphone array is used to record an audio signal, a subtractive array method is used to generate a plurality of cardioid signals, and a plurality of cardioid signals having a higher SNR are added to the sum signal. On the other hand, by performing speech recognition after removing noise using the spectral subtraction method, the speech recognition rate can be improved as compared with the prior art in a site where large noise is generated, for example, in a factory.

以下、本発明に係る実施形態について図面を参照して説明する。なお、以下の各実施形態において、同様の構成要素については同一の符号を付している。 Hereinafter, embodiments according to the present invention will be described with reference to the drawings. In addition, in each following embodiment, the same code | symbol is attached | subjected about the same component.

図１は本発明の一実施形態に係るマイクロホンアレー１０の配置を示す斜視図であり、図２は図１のマイクロホンアレー１０を備えたマイクロホン筐体１１を示す側面図であり、図３は図２のマイクロホン筐体１１を示す正面図である。本実施形態に係るマイクロホンアレー１０は、各無指向性マイクロホン１，２，３，４を正三角錐の各頂点の位置に設けたことを特徴としている。図１及びそれ以降の配置図において、ＸＹＺの３次元座標系で各マイクロホン１，２，３，４の位置を示しており、各マイクロホン１〜４の配置座標は以下の通りである。 FIG. 1 is a perspective view showing an arrangement of a microphone array 10 according to an embodiment of the present invention, FIG. 2 is a side view showing a microphone housing 11 including the microphone array 10 of FIG. 1, and FIG. 2 is a front view showing a microphone housing 11 of FIG. The microphone array 10 according to the present embodiment is characterized in that each omnidirectional microphone 1, 2, 3, 4 is provided at the position of each vertex of the regular triangular pyramid. In FIG. 1 and subsequent layout diagrams, the positions of the microphones 1, 2, 3, and 4 are shown in an XYZ three-dimensional coordinate system, and the layout coordinates of the microphones 1 to 4 are as follows.

（Ａ）マイクロホン１のＸＹＺ座標＝（０，０，０）；正三角錐の上部頂点の位置であって、ＸＹＺの３次元座標系の原点に位置する。
（Ｂ）マイクロホン２のＸＹＺ座標＝（０，√（６）ｄ／３，√（３）ｄ／３）；正三角錐の底面の一頂点の位置であって、ＸＹ平面の０度及びＸＺ平面の５５度の方位に位置する。
（Ｃ）マイクロホン３のＸＹＺ座標＝（ｄ／２，√（６）ｄ／３，−√（３）ｄ／６）；正三角錐の底面の一頂点の位置であって、ＸＹ平面の３０度及びＸＺ平面の１１０度の方位に位置する。
（Ｄ）マイクロホン４のＸＹＺ座標＝（−ｄ／２，√（６）ｄ／３，−√（３）ｄ／６）；正三角錐の底面の一頂点の位置であって、ＸＹ平面の３００度及びＸＺ平面の１１０度の方位に位置する。 (A) XYZ coordinates of the microphone 1 = (0, 0, 0); the position of the upper vertex of the regular triangular pyramid, which is located at the origin of the three-dimensional coordinate system of XYZ.
(B) XYZ coordinates of the microphone 2 = (0, √ (6) d / 3, √ (3) d / 3); the position of one vertex of the bottom of the regular triangular pyramid, 0 degrees on the XY plane and the XZ plane It is located in the direction of 55 degrees.
(C) XYZ coordinates of the microphone 3 = (d / 2, √ (6) d / 3, −√ (3) d / 6); the position of one vertex of the bottom surface of the regular triangular pyramid and 30 degrees on the XY plane And located at 110 degrees azimuth in the XZ plane.
(D) XYZ coordinates of the microphone 4 = (− d / 2, √ (6) d / 3, −√ (3) d / 6); the position of one vertex of the bottom surface of the regular triangular pyramid, which is 300 on the XY plane It is located at an orientation of 110 degrees on the XZ plane.

図１において、ＸＹＺの３次元座標系は、話者の口元先端部５から話者音声が矢印６の音声放射方向がＹ軸方向となるように配置されている。すなわち、Ｙ軸方向は話者の口元先端部５からの法線ベクトルの方向であり、Ｘ軸方向は水平方向であり、Ｚ軸方向は垂直方向である。そして、各マイクロホン１〜４の放射主軸（放射指向特性の主方向の軸であり、マイクロホンの筒形状の軸に対応する。）は音声放射方向６を向くようにかつそれに対して実質的に平行となるように配置されている。 In FIG. 1, the three-dimensional coordinate system of XYZ is arranged so that the voice emission direction of the speaker voice from the speaker's mouth tip 5 is the Y-axis direction. That is, the Y-axis direction is the direction of the normal vector from the speaker's mouth tip 5, the X-axis direction is the horizontal direction, and the Z-axis direction is the vertical direction. The main emission axes of the microphones 1 to 4 (the main axis of the radiation directivity characteristic and corresponding to the cylindrical axis of the microphone) are directed to the sound emission direction 6 and substantially parallel thereto. It is arranged to become.

図２及び図３において、４つのマイクロホン１〜４からなるマイクロホンアレー１０はマイクロホン筐体１１に収容され、当該マイクロホン筐体１１は話者のヘッドホンセットのフレキシブルアーム１２の先端部に取り付けられている。当該マイクロホン筐体１１を正面から見ると、図３から明らかなように、４つのマイクロホン１〜４の放射面が見えるが、上部頂点のマイクロホン１のみが話者の口元により近接するような配置となっている。また、マイクロホン１〜４のうちの各隣接する２つのマイクロホン間の間隔は、図２から明らかなように、１０ｍｍに設定されている。これについては、音声信号をサンプリングするサンプリング周波数１６ｋＨｚに対して、標本化定理と同様にマイクロホン間に許容される最大距離が音速÷サンプリング周波数により、３４００００／１６０００＝２１．２５ｍｍとなり、さらに詳細後述するカージオイドを用いた信号処理を行う場合は折り返しという減少を防ぐために、さらに半分にする必要があり、許容される最大距離が１０．６２５ｍｍとなることから決定されている。その中で、最大の位相差及び角度差が得られる形状として正三角錐を採用している。 2 and 3, a microphone array 10 including four microphones 1 to 4 is accommodated in a microphone casing 11, and the microphone casing 11 is attached to the tip of a flexible arm 12 of a speaker's headphone set. . When the microphone housing 11 is viewed from the front, the radiation surfaces of the four microphones 1 to 4 can be seen as seen from FIG. 3, but only the microphone 1 at the top vertex is closer to the speaker's mouth. It has become. Moreover, the space | interval between each adjacent two microphones of the microphones 1-4 is set to 10 mm so that FIG. 2 may show. With respect to this, with respect to the sampling frequency of 16 kHz for sampling the audio signal, the maximum distance allowed between the microphones is 340000/16000 = 21.25 mm as in the case of the sampling theorem. In the case of performing signal processing using cardioid, it is necessary to further halve in order to prevent a reduction called folding, and the maximum allowable distance is determined to be 10.625 mm. Among them, a regular triangular pyramid is adopted as a shape that can obtain the maximum phase difference and angle difference.

図１乃至図３の実施形態においては、正三角錐の各頂点にマイクロホン１〜４を配置しているが、本発明はこれに限らず、正三角錐は三角錐、多角錐又は角錐でもよく、底面の各頂点に配置されるマイクロホン２〜４は少なくとも２つのみ配置してもよい。また、多角錐の場合は、底面の各頂点に配置されるマイクロホンの数は少なくとも２つ、すなわち複数配置すればよい。 In the embodiment of FIGS. 1 to 3, the microphones 1 to 4 are arranged at the apexes of the regular triangular pyramid. However, the present invention is not limited to this, and the regular triangular pyramid may be a triangular pyramid, a polygonal pyramid, or a pyramid. At least two microphones 2 to 4 may be arranged at each vertex. In the case of a polygonal pyramid, the number of microphones disposed at each vertex of the bottom surface may be at least two, that is, a plurality of microphones may be disposed.

図４は図１のマイクロホンアレー１０を用いた音声認識装置の構成を示すブロック図である。 FIG. 4 is a block diagram showing a configuration of a speech recognition apparatus using the microphone array 10 of FIG.

図４において、マイクロホン１に入力された音声は音声信号に変換された後、低周波増幅器２１及びＡ／Ｄ変換器２６を介してディジタル信号Ｓ１に変換され、減算器４１，４２，４３に入力される。また、マイクロホン２に入力された音声は音声信号に変換された後、低周波増幅器２２及びＡ／Ｄ変換器２７を介してディジタル音声信号Ｓ２に変換され、次いで、当該ディジタル音声信号Ｓ２は、遅延器３１を介して減算器４１に入力され、減算器４４に入力され、遅延器３５を介して減算器４５に入力され、遅延器３８を介して減算器４８に入力され、減算器４９に入力される。マイクロホン３に入力された音声は音声信号に変換された後、低周波増幅器２３及びＡ／Ｄ変換器２８を介してディジタル音声信号Ｓ３に変換され、次いで、当該ディジタル音声信号Ｓ３は、遅延器３２を介して減算器４２に入力され、遅延器３４を介して減算器４４に入力され、減算器４５に入力され、減算器４６に入力され、遅延器３７を介して減算器４７に入力される。マイクロホン４に入力された音声は音声信号に変換された後、低周波増幅器２４及びＡ／Ｄ変換器２９を介してディジタル音声信号Ｓ４に変換され、次いで、当該ディジタル音声信号Ｓ４は、遅延器３３を介して減算器４３に入力され、遅延器３６を介して減算器４６に入力され、減算器４７に入力され、減算器４８に入力され、遅延器３９を介して減算器４９に入力される。なお、各遅延器３１〜３９は、隣接するマイクロホン間の音声信号の到来時間差を補償するために本実施形態では、２９．４マイクロ秒の遅延量を有する。 In FIG. 4, the sound input to the microphone 1 is converted into an audio signal, and then converted into a digital signal S 1 via the low frequency amplifier 21 and the A / D converter 26, and input to the subtractors 41, 42, and 43. Is done. The sound input to the microphone 2 is converted into a sound signal, and then converted into a digital sound signal S2 via the low frequency amplifier 22 and the A / D converter 27. Then, the digital sound signal S2 is delayed. Is input to the subtractor 41 via the delay unit 31, input to the subtractor 44, input to the subtractor 45 via the delay unit 35, input to the subtractor 48 via the delay unit 38, and input to the subtractor 49. Is done. The sound input to the microphone 3 is converted into a sound signal, and then converted into a digital sound signal S3 via the low frequency amplifier 23 and the A / D converter 28. Then, the digital sound signal S3 is delayed by a delay unit 32. Is input to the subtractor 42 via the delay unit 34, input to the subtractor 44 via the delay unit 34, input to the subtractor 45, input to the subtractor 46, and input to the subtractor 47 via the delay unit 37. . The sound input to the microphone 4 is converted into a sound signal, and then converted into a digital sound signal S4 via the low frequency amplifier 24 and the A / D converter 29. Then, the digital sound signal S4 is converted into a delay unit 33. Is input to the subtracter 43 via the delay unit 36, input to the subtractor 46 via the delay unit 36, input to the subtractor 47, input to the subtractor 48, and input to the subtractor 49 via the delay unit 39. . Each of the delay units 31 to 39 has a delay amount of 29.4 microseconds in the present embodiment in order to compensate for the arrival time difference of the audio signal between adjacent microphones.

遅延形アレー回路３０は、９個の遅延器３１〜３９と、９個の減算器４１〜４９砥を備えて構成され、公知の減算形アレー法（例えば、非特許文献２参照。）を用いて、図５及び図６を参照して説明するように、雑音方向に対してゼロ点（指向性利得の最小点）を生成する所定のカージオイドＣ１〜Ｃ９を発生する。 The delay type array circuit 30 includes nine delay units 31 to 39 and nine subtracters 41 to 49, and uses a known subtraction type array method (for example, see Non-Patent Document 2). Thus, as described with reference to FIGS. 5 and 6, predetermined cardioids C1 to C9 that generate zero points (minimum points of directivity gain) with respect to the noise direction are generated.

減算器４１はディジタル音声信号Ｓ１から遅延されたディジタル音声信号Ｓ２を減算し減算結果のカージオイド音声信号ＳＣ１（後述するカージオイドＣ１の指向特性で検出されたディジタル音声信号である。）を信号評価及び選択回路５０に出力する。減算器４２はディジタル音声信号Ｓ１から遅延されたディジタル音声信号Ｓ３を減算し減算結果のカージオイド音声信号ＳＣ２（後述するカージオイドＣ２の指向特性で検出されたディジタル音声信号である。）を信号評価及び選択回路５０に出力する。減算器４３はディジタル音声信号Ｓ１から遅延されたディジタル音声信号Ｓ３を減算し減算結果のカージオイド音声信号ＳＣ３（後述するカージオイドＣ３の指向特性で検出されたディジタル音声信号である。）を信号評価及び選択回路５０に出力する。 The subtracter 41 subtracts the delayed digital audio signal S2 from the digital audio signal S1 and performs signal evaluation on the cardioid audio signal SC1 (a digital audio signal detected by a directivity characteristic of cardioid C1, which will be described later). And output to the selection circuit 50. The subtractor 42 subtracts the delayed digital audio signal S3 from the digital audio signal S1 and performs signal evaluation on a cardioid audio signal SC2 (a digital audio signal detected by a directivity characteristic of cardioid C2 described later) as a subtraction result. And output to the selection circuit 50. The subtractor 43 subtracts the delayed digital audio signal S3 from the digital audio signal S1 and performs signal evaluation on a cardioid audio signal SC3 (a digital audio signal detected by a directivity characteristic of cardioid C3 described later) as a subtraction result. And output to the selection circuit 50.

減算器４４はディジタル音声信号Ｓ２から遅延されたディジタル音声信号Ｓ３を減算し減算結果のカージオイド音声信号ＳＣ４（後述するカージオイドＣ１の指向特性で検出されたディジタル音声信号である。）を信号評価及び選択回路５０に出力する。減算器４５はディジタル音声信号Ｓ３から遅延されたディジタル音声信号Ｓ２を減算し減算結果のカージオイド音声信号ＳＣ５（後述するカージオイドＣ５の指向特性で検出されたディジタル音声信号である。）を信号評価及び選択回路５０に出力する。減算器４６はディジタル音声信号Ｓ３から遅延されたディジタル音声信号Ｓ４を減算し減算結果のカージオイド音声信号ＳＣ６（後述するカージオイドＣ６の指向特性で検出されたディジタル音声信号である。）を信号評価及び選択回路５０に出力する。減算器４７はディジタル音声信号Ｓ４から遅延されたディジタル音声信号Ｓ３を減算し減算結果のカージオイド音声信号ＳＣ７（後述するカージオイドＣ７の指向特性で検出されたディジタル音声信号である。）を信号評価及び選択回路５０に出力する。減算器４８はディジタル音声信号Ｓ４から遅延されたディジタル音声信号Ｓ２を減算し減算結果のカージオイド音声信号ＳＣ８（後述するカージオイドＣ８の指向特性で検出されたディジタル音声信号である。）を信号評価及び選択回路５０に出力する。減算器４９はディジタル音声信号Ｓ２から遅延されたディジタル音声信号Ｓ４を減算し減算結果のカージオイド音声信号ＳＣ９（後述するカージオイドＣ９の指向特性で検出されたディジタル音声信号である。）を信号評価及び選択回路５０に出力する。 The subtracter 44 subtracts the delayed digital audio signal S3 from the digital audio signal S2, and performs signal evaluation on a cardioid audio signal SC4 (a digital audio signal detected by a directivity characteristic of cardioid C1, which will be described later). And output to the selection circuit 50. The subtractor 45 subtracts the delayed digital audio signal S2 from the digital audio signal S3, and performs signal evaluation on a cardioid audio signal SC5 (a digital audio signal detected by a directivity characteristic of cardioid C5 described later) as a subtraction result. And output to the selection circuit 50. The subtractor 46 subtracts the delayed digital audio signal S4 from the digital audio signal S3, and performs signal evaluation on a cardioid audio signal SC6 (a digital audio signal detected by a directivity characteristic of cardioid C6 described later) as a subtraction result. And output to the selection circuit 50. The subtractor 47 subtracts the delayed digital audio signal S3 from the digital audio signal S4, and performs signal evaluation on a cardioid audio signal SC7 (a digital audio signal detected by a directivity characteristic of cardioid C7 described later) as a subtraction result. And output to the selection circuit 50. The subtracter 48 subtracts the delayed digital audio signal S2 from the digital audio signal S4, and performs signal evaluation on the cardioid audio signal SC8 (a digital audio signal detected by the directivity of cardioid C8 described later) as a result of subtraction. And output to the selection circuit 50. The subtractor 49 subtracts the delayed digital audio signal S4 from the digital audio signal S2 and performs signal evaluation on a cardioid audio signal SC9 (a digital audio signal detected by a directivity characteristic of cardioid C9 described later) as a subtraction result. And output to the selection circuit 50.

信号評価及び選択回路５０は、入力される９つのカージオイド音声信号ＳＣ１〜ＳＣ９について、ＶＡＤ（Voice Activity Detection）機能を用いて、音声区間と雑音区間とを検出し、それに基づいてＳＮＲを計算し、ＳＮＲが大きい上位２つ（変形例では、３つ）のカージオイド音声信号を選択し、選択したカージオイド音声信号を加算して加算結果のカージオイド音声信号を雑音除去回路５１に出力する。ここで、ＶＡＤ機能は、以下の条件で音声区間を検出する。
（１）所定のしきい値以上の信号レベルを有すること。
（２）所定のパワーレベル以上離れたカージオイド信号が存在しないこと。これは、口元方向に対応する３つのカージオイド信号と、顔平面方向に対応するカージオイド信号について、口元方向からの音声に対しては、前者３つのカージオイド音声信号はもちろん、後者６つのカージオイド音声信号も少しパワーが上がるのに対して、口元以外の方向からの音声信号は、１つ以上のカージオイドの死角に入る可能性が高く、９つの中で相対的にパワー差が開く傾向にあることを利用しようというものである。
（３）音声区間として検出されたフレームの前後５００ミリ秒を音声区間として扱う。 The signal evaluation and selection circuit 50 detects a voice interval and a noise interval using the VAD (Voice Activity Detection) function for the nine input cardioid audio signals SC1 to SC9, and calculates an SNR based on the detected voice interval and noise interval. The top two (three in the modified example) cardioid audio signals with the highest SNR are selected, the selected cardioid audio signals are added, and the resulting cardioid audio signal is output to the noise removal circuit 51. Here, the VAD function detects a voice section under the following conditions.
(1) The signal level is not less than a predetermined threshold value.
(2) There is no cardioid signal that is more than a predetermined power level. This is because the three cardioid signals corresponding to the mouth direction and the cardioid signal corresponding to the face plane direction are not limited to the former three cardioid sound signals and the latter six carousoid signals for the sound from the mouth direction. While the power of the geoid audio signal is slightly increased, the audio signal from directions other than the mouth is likely to enter the blind spot of one or more cardioids, and the power difference tends to open relatively among the nine. It is intended to use what is in.
(3) Handles 500 milliseconds before and after the frame detected as the speech section as the speech section.

次いで、雑音除去回路５１は、入力されるカージオイド音声信号に対して、公知のスペクトルサブトラクション法（以下、ＳＳ法という。）を用いて音声信号中の雑音を除去し、処理後のディジタル音声信号を音声認識回路５２に出力する。ここで、ＳＳ法は周波数領域における雑音除去法として従来から用いられており、雑音が付加された音声信号のパワースペクトから、別途推定した雑音のパワースペクトルを差し引き、そのパワースペクトルをフーリエ逆変換することで雑音を除去した音声信号を復元するものである（例えば、特許文献３及び非特許文献１参照。）。ここで、ＳＳ法を用いた演算後のスペクトル成分Ｘ（ｆ）は次式で表される。 Next, the noise removal circuit 51 removes noise in the speech signal from the input cardioid speech signal using a known spectral subtraction method (hereinafter referred to as SS method), and the processed digital speech signal Is output to the voice recognition circuit 52. Here, the SS method has been conventionally used as a noise removal method in the frequency domain, and the power spectrum of the separately estimated noise is subtracted from the power spectrum of the speech signal to which the noise is added, and the power spectrum is subjected to inverse Fourier transform. Thus, the audio signal from which noise has been removed is restored (see, for example, Patent Document 3 and Non-Patent Document 1). Here, the spectrum component X (f) after calculation using the SS method is expressed by the following equation.

［数１］
Ｘ^２（ｆ）＝ｍａｘ｛ｘ（ｆ）−αＮ（ｆ），βＮ（ｆ）｝（１） [Equation 1]
X ² (f) = max {x (f) −αN (f), βN (f)} (1)

ここで、α，βは所定の定数であって、例えばα＝２．０，β＝０．００１である。また、Ｘ（ｆ）は雑音をスペクトル減算した結果のスペクトル成分であり、ｘ（ｆ）は収録音声データ（音声＋雑音）のスペクトル成分であり、Ｎ（ｆ）は雑音のスペクトル成分である。 Here, α and β are predetermined constants, for example, α = 2.0 and β = 0.001. Further, X (f) is a spectral component obtained as a result of spectral subtraction of noise, x (f) is a spectral component of recorded voice data (voice + noise), and N (f) is a noise spectral component.

音声認識回路５２は、入力されるディジタル音声信号に対して例えば所定の音声辞書又は音声モデル（例えばＨＭＭ）を用いて音声認識処理を実行して、音声認識結果のテキストデータを液晶ディスプレイ（ＬＣＤ）５３に表示出力し、もしくはパーソナルコンピュータなどの外部装置に出力する。 The speech recognition circuit 52 executes speech recognition processing on the input digital speech signal using, for example, a predetermined speech dictionary or speech model (for example, HMM), and text data of the speech recognition result is displayed on a liquid crystal display (LCD). The data is output to 53 or output to an external device such as a personal computer.

次いで、図３の音声認識装置において形成されるカージオイドＣ１〜Ｃ９について、図５及び図６を参照して以下に説明する。 Next, the cardioids C1 to C9 formed in the speech recognition apparatus of FIG. 3 will be described below with reference to FIGS.

図５は図４の音声認識装置において実現される口元方位に対応する３つのカージオイドＣ１，Ｃ２，Ｃ３を示す斜視図である。図５において、カージオイドＣ１はディジタル音声信号Ｓ１及びＳ２により形成されるものであり、マイクロホン２に向う方向にゼロ点を有する。また、カージオイドＣ２はディジタル音声信号Ｓ１及びＳ３により形成されるものであり、マイクロホン３に向う方向にゼロ点を有する。さらに、カージオイドＣ３はディジタル音声信号Ｓ１及びＳ３により形成されるものであり、マイクロホン３に向う方向にゼロ点を有する。 FIG. 5 is a perspective view showing three cardioids C1, C2, and C3 corresponding to the mouth orientations realized in the speech recognition apparatus of FIG. In FIG. 5, the cardioid C 1 is formed by the digital audio signals S 1 and S 2 and has a zero point in the direction toward the microphone 2. The cardioid C2 is formed by the digital audio signals S1 and S3 and has a zero point in the direction toward the microphone 3. Further, the cardioid C3 is formed by the digital audio signals S1 and S3, and has a zero point in the direction toward the microphone 3.

図６は図４の音声認識装置において実現される顔水平方位に対応する６つのカージオイドＣ４，Ｃ５，Ｃ６，Ｃ７，Ｃ８，Ｃ９を示す斜視図である。図６において、カージオイドＣ４，Ｃ５はディジタル音声信号Ｓ２及びＳ３により形成されるものであり、カージオイドＣ４はマイクロホン３に向う方向にゼロ点を有し、カージオイドＣ５はマイクロホン２に向う方向にゼロ点を有する。また、カージオイドＣ６，Ｃ７はディジタル音声信号Ｓ３及びＳ４により形成されるものであり、カージオイドＣ６はマイクロホン４に向う方向にゼロ点を有し、カージオイドＣ７はマイクロホン３に向う方向にゼロ点を有する。さらに、カージオイドＣ８，Ｃ９はディジタル音声信号Ｓ４及びＳ２により形成されるものであり、カージオイドＣ８はマイクロホン２に向う方向にゼロ点を有し、カージオイドＣ９はマイクロホン４に向う方向にゼロ点を有する。 FIG. 6 is a perspective view showing six cardioids C4, C5, C6, C7, C8, and C9 corresponding to the horizontal face orientation realized in the speech recognition apparatus of FIG. In FIG. 6, cardioids C4 and C5 are formed by the digital audio signals S2 and S3. The cardioid C4 has a zero point in the direction toward the microphone 3, and the cardioid C5 is in the direction toward the microphone 2. Has zero point. The cardioids C6 and C7 are formed by the digital audio signals S3 and S4. The cardioid C6 has a zero point in the direction toward the microphone 4, and the cardioid C7 has a zero point in the direction toward the microphone 3. Have Further, the cardioids C8 and C9 are formed by the digital audio signals S4 and S2. The cardioid C8 has a zero point in the direction toward the microphone 2, and the cardioid C9 has a zero point in the direction toward the microphone 4. Have

図７は本発明者らによって実行された実施例１に係るシミュレーション実験（３つの定常雑音Ｎｓｔ１１，Ｎｓｔ１２，Ｎｓｔ１３）における雑音配置を示す斜視図である。図７において、スピーカの記号は３つの定常雑音Ｎｓｔ１１，Ｎｓｔ１２，Ｎｓｔ１３の配置位置及び放射方向を示している。ここで、定常雑音Ｎｓｔ１１は、ＸＹ平面６０度及びＸＺ平面９０度の方位から放射され、定常雑音Ｎｓｔ１２は、＋Ｙ軸から原点に向う方向で放射され、定常雑音Ｎｓｔ１３は、ＸＹ平面３００度及びＸＺ平面９０度の方位から放射される。このときに図４の音声認識装置により評価した各カージオイドＣｎ（ｎ＝１，２，…，９）に対するＳＮＲ（Ｃｎ）は以下の通りである。 FIG. 7 is a perspective view illustrating a noise arrangement in a simulation experiment (three stationary noises Nst11, Nst12, and Nst13) according to the first embodiment performed by the present inventors. In FIG. 7, the symbol of the speaker indicates the arrangement position and radiation direction of the three stationary noises Nst11, Nst12, Nst13. Here, the stationary noise Nst11 is radiated from the orientation of the XY plane 60 degrees and the XZ plane 90 degrees, the stationary noise Nst12 is radiated in the direction from the + Y axis toward the origin, and the stationary noise Nst13 is radiated from the XY plane 300 degrees and XZ. Radiated from a 90-degree plane. At this time, the SNR (Cn) for each cardioid Cn (n = 1, 2,..., 9) evaluated by the speech recognition apparatus of FIG. 4 is as follows.

［表１］
―――――――――――――――――
ＳＮＲ（Ｃ１）＝２５．８ｄＢ
ＳＮＲ（Ｃ２）＝２４．４ｄＢ
ＳＮＲ（Ｃ３）＝２４．１ｄＢ
ＳＮＲ（Ｃ４）＝１５．０ｄＢ
ＳＮＲ（Ｃ５）＝１４．８ｄＢ
ＳＮＲ（Ｃ６）＝１３．６ｄＢ
ＳＮＲ（Ｃ７）＝１３．８ｄＢ
ＳＮＲ（Ｃ８）＝１４．９ｄＢ
ＳＮＲ（Ｃ９）＝１４．９ｄＢ
――――――――――――――――― [Table 1]
―――――――――――――――――
SNR (C1) = 25.8 dB
SNR (C2) = 24.4 dB
SNR (C3) = 24.1 dB
SNR (C4) = 15.0 dB
SNR (C5) = 14.8 dB
SNR (C6) = 13.6 dB
SNR (C7) = 13.8 dB
SNR (C8) = 14.9 dB
SNR (C9) = 14.9 dB
―――――――――――――――――

この表１のＳＮＲ（Ｃｎ）のうち上位ｍ個（ｍ＝２，３，…，９）のカージオイド音声信号を加算したときのＳＮＲ_ＡＤＤ（Ｔｍ）を以下に示す。 SNR _ADD (Tm) when the upper m (m = 2, 3,..., 9) cardioid audio signals of the SNR (Cn) in Table 1 are added is shown below.

［表２］
―――――――――――――――――
ＳＮＲ_ＡＤＤ（Ｔ２）＝２５．３ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ３）＝２５．９ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ４）＝２３．３ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ５）＝２１．６ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ６）＝２０．７ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ７）＝２０．０ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ８）＝１９．４ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ９）＝１８．７ｄＢ
―――――――――――――――――― [Table 2]
―――――――――――――――――
SNR _ADD (T2) = 25.3 dB
SNR _ADD (T3) = 25.9 dB
SNR _ADD (T4) = 23.3 dB
SNR _ADD (T5) = 21.6 dB
SNR _ADD (T6) = 20.7 dB
SNR _ADD (T7) = 20.0 dB
SNR _ADD (T8) = 19.4 dB
SNR _ADD (T9) = 18.7 dB
――――――――――――――――――

表２から明らかなように、上位３個のカージオイド音声信号を加算することで最高のＳＮＲの音声信号を得ている。 As is apparent from Table 2, the highest three SNR audio signals are obtained by adding the top three cardioid audio signals.

図８は本発明者らによって実行された実施例２に係るシミュレーション実験（１つの突発性雑音Ｎｓｕ２１）における雑音配置を示す斜視図である。図８において、スピーカの記号は突発性雑音Ｎｓｕ２１の配置位置及び放射方向を示している。ここで、突発性雑音Ｎｓｕ２１は、ＸＹ平面６０度及びＸＺ平面９０度の方位から放射される。このときに図４の音声認識装置により評価した各カージオイドＣｎ（ｎ＝１，２，…，９）に対するＳＮＲ（Ｃｎ）は以下の通りである。 FIG. 8 is a perspective view showing a noise arrangement in a simulation experiment (one sudden noise Nsu21) according to Example 2 performed by the present inventors. In FIG. 8, the symbol of the speaker indicates the arrangement position and radiation direction of the sudden noise Nsu21. Here, the sudden noise Nsu21 is radiated from the orientations of 60 degrees on the XY plane and 90 degrees on the XZ plane. At this time, the SNR (Cn) for each cardioid Cn (n = 1, 2,..., 9) evaluated by the speech recognition apparatus of FIG. 4 is as follows.

［表３］
―――――――――――――――――
ＳＮＲ（Ｃ１）＝５．２ｄＢ
ＳＮＲ（Ｃ２）＝０．８ｄＢ
ＳＮＲ（Ｃ３）＝１６．４ｄＢ
ＳＮＲ（Ｃ４）＝−６．５ｄＢ
ＳＮＲ（Ｃ５）＝１．３ｄＢ
ＳＮＲ（Ｃ６）＝１６．０ｄＢ
ＳＮＲ（Ｃ７）＝−８．６ｄＢ
ＳＮＲ（Ｃ８）＝−６．６ｄＢ
ＳＮＲ（Ｃ９）＝１．６ｄＢ
――――――――――――――――― [Table 3]
―――――――――――――――――
SNR (C1) = 5.2 dB
SNR (C2) = 0.8 dB
SNR (C3) = 16.4 dB
SNR (C4) = − 6.5 dB
SNR (C5) = 1.3 dB
SNR (C6) = 16.0 dB
SNR (C7) = − 8.6 dB
SNR (C8) = − 6.6 dB
SNR (C9) = 1.6 dB
―――――――――――――――――

この表３のＳＮＲ（Ｃｎ）のうち上位ｍ個（ｍ＝２，３，…，９）のカージオイド音声信号を加算したときのＳＮＲ_ＡＤＤ（Ｔｍ）を以下に示す。 SNR _ADD (Tm) when the upper m (m = 2, 3,..., 9) cardioid audio signals of the SNR (Cn) in Table 3 are added is shown below.

［表４］
―――――――――――――――――
ＳＮＲ_ＡＤＤ（Ｔ２）＝１６．２ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ３）＝９．５ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ４）＝７．１ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ５）＝６．５ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ６）＝５．０ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ７）＝２．７ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ８）＝１．３ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ９）＝−０．５ｄＢ
―――――――――――――――――― [Table 4]
―――――――――――――――――
SNR _ADD (T2) = 16.2 dB
SNR _ADD (T3) = 9.5 dB
SNR _ADD (T4) = 7.1 dB
SNR _ADD (T5) = 6.5 dB
SNR _ADD (T6) = 5.0 dB
SNR _ADD (T7) = 2.7 dB
SNR _ADD (T8) = 1.3 dB
SNR _ADD (T9) = − 0.5 dB
――――――――――――――――――

表４から明らかなように、上位２個のカージオイド音声信号を加算することで最高のＳＮＲの音声信号を得ている。 As is apparent from Table 4, the highest two SNR audio signals are obtained by adding the top two cardioid audio signals.

図９は本発明者らによって実行された実施例３に係るシミュレーション実験（１つの突発性雑音Ｎｓｕ３１及び１つの定常雑音Ｎｓｔ３２）における雑音配置を示す斜視図である。図９において、スピーカの記号は１つの突発性雑音Ｎｓｕ３１及び１つの定常雑音Ｎｓｔ３２の配置位置及び放射方向を示している。ここで、突発性雑音Ｎｓｕ３１は、ＸＹ平面６０度及びＸＺ平面９０度の方位から放射され、定常雑音Ｎｓｔ３２は、ＸＹ平面３００度及びＸＺ平面９０度の方位から放射される。このときに図４の音声認識装置により評価した各カージオイドＣｎ（ｎ＝１，２，…，９）に対するＳＮＲ（Ｃｎ）は以下の通りである。 FIG. 9 is a perspective view showing a noise arrangement in a simulation experiment (one sudden noise Nsu31 and one stationary noise Nst32) according to Example 3 performed by the present inventors. In FIG. 9, the symbol of the speaker indicates the arrangement position and the radiation direction of one sudden noise Nsu31 and one stationary noise Nst32. Here, the sudden noise Nsu31 is radiated from the azimuth of 60 degrees of the XY plane and 90 degrees of the XZ plane, and the stationary noise Nst32 is radiated from the azimuth of 300 degrees of the XY plane and 90 degrees of the XZ plane. At this time, the SNR (Cn) for each cardioid Cn (n = 1, 2,..., 9) evaluated by the speech recognition apparatus of FIG. 4 is as follows.

［表５］
―――――――――――――――――
ＳＮＲ（Ｃ１）＝９．３ｄＢ
ＳＮＲ（Ｃ２）＝６．４ｄＢ
ＳＮＲ（Ｃ３）＝９．４ｄＢ
ＳＮＲ（Ｃ４）＝−１．５ｄＢ
ＳＮＲ（Ｃ５）＝０．８ｄＢ
ＳＮＲ（Ｃ６）＝−０．２ｄＢ
ＳＮＲ（Ｃ７）＝−２．９ｄＢ
ＳＮＲ（Ｃ８）＝−１．２ｄＢ
ＳＮＲ（Ｃ９）＝１．０ｄＢ
――――――――――――――――― [Table 5]
―――――――――――――――――
SNR (C1) = 9.3 dB
SNR (C2) = 6.4 dB
SNR (C3) = 9.4 dB
SNR (C4) = − 1.5 dB
SNR (C5) = 0.8 dB
SNR (C6) = − 0.2 dB
SNR (C7) = − 2.9 dB
SNR (C8) = − 1.2 dB
SNR (C9) = 1.0 dB
―――――――――――――――――

この表５のＳＮＲ（Ｃｎ）のうち上位ｍ個（ｍ＝２，３，…，９）のカージオイド音声信号を加算したときのＳＮＲ_ＡＤＤ（Ｔｍ）を以下に示す。 The SNR _ADD (Tm) when the upper m (m = 2, 3,..., 9) cardioid audio signals of the SNR (Cn) in Table 5 are added is shown below.

［表６］
―――――――――――――――――
ＳＮＲ_ＡＤＤ（Ｔ２）＝１０．０ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ３）＝７．６ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ４）＝７．０ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ５）＝６．４ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ６）＝５．６ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ７）＝４．９ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ８）＝４．３ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ９）＝−３．４ｄＢ
―――――――――――――――――― [Table 6]
―――――――――――――――――
SNR _ADD (T2) = 10.0 dB
SNR _ADD (T3) = 7.6 dB
SNR _ADD (T4) = 7.0 dB
SNR _ADD (T5) = 6.4 dB
SNR _ADD (T6) = 5.6 dB
SNR _ADD (T7) = 4.9 dB
SNR _ADD (T8) = 4.3 dB
SNR _ADD (T9) =-3.4 dB
――――――――――――――――――

表６から明らかなように、上位２個のカージオイド音声信号を加算することで最高のＳＮＲの音声信号を得ている。 As is apparent from Table 6, the highest two SNR audio signals are obtained by adding the top two cardioid audio signals.

図１０は本発明者らによって実行された実施例４に係るシミュレーション実験（１つの定常雑音Ｎｓｔ４１）における雑音配置を示す斜視図である。図１０において、スピーカの記号は１つの定常雑音Ｎｓｔ４１の配置位置及び放射方向を示している。ここで、定常雑音Ｎｓｔ４１は、ＸＹ平面３０度及びＸＺ平面９０度の方位から背景雑音レベル９０ｄＢＡで放射される。このときに図４の音声認識装置により評価した各カージオイド音声信号ＳＣｎ（ｎ＝１，２，…，９）に基づいて、上位２個又は３個のカージオイド音声信号を加算したときのＳＮＲ_ＡＤＤ（Ｔｍ）を以下に示す。
［表７］
―――――――――――――――――
ＳＮＲ_ＡＤＤ（Ｔ２）＝８．０ｄＢ
ＳＮＲ_ＡＤＤ（Ｔ３）＝７．３ｄＢ
―――――――――――――――――― FIG. 10 is a perspective view showing a noise arrangement in a simulation experiment (one stationary noise Nst41) according to Example 4 performed by the present inventors. In FIG. 10, the symbol of the speaker indicates the arrangement position and radiation direction of one stationary noise Nst41. Here, the stationary noise Nst41 is radiated at a background noise level of 90 dBA from directions of 30 degrees on the XY plane and 90 degrees on the XZ plane. At this time, based on each cardioid speech signal SCn (n = 1, 2,..., 9) evaluated by the speech recognition apparatus of FIG. _ADD (Tm) is shown below.
[Table 7]
―――――――――――――――――
SNR _ADD (T2) = 8.0 dB
SNR _ADD (T3) = 7.3 dB
――――――――――――――――――

ここで、より高いＳＮＲを有する上位２個のカージオイド音声信号を加算したときに、ＳＳ法を用いる図４の雑音除去回路５１を用いた場合のＳＮＲｓｓを以下に示す。 Here, SNRss when the noise removal circuit 51 of FIG. 4 using the SS method is used when the top two cardioid audio signals having higher SNR are added is shown below.

［表８］
―――――――――――――――――――――――――――
ＳＮＲ_ＳＳ（α＝１．０；β＝０．００１）＝８．０ｄＢ
ＳＮＲ_ＳＳ（α＝２．０；β＝０．００１）＝１０．３ｄＢ
――――――――――――――――――――――――――― [Table 8]
―――――――――――――――――――――――――――
SNR _SS (α = 1.0; β = 0.001) = 8.0 dB
SNR _SS (α = 2.0; β = 0.001) = 10.3 dB
―――――――――――――――――――――――――――

表８から明らかなように、ＳＳ法を用いた雑音除去回路５１を用いることにより大幅にＳＮＲが改善されていることがわかる。 As apparent from Table 8, it can be seen that the SNR is greatly improved by using the noise removal circuit 51 using the SS method.

以上の実施例１乃至４において、定常雑音は例えばベルトコンベヤーなどから発生するホワイトノイズであり、突発性雑音は例えば金属材料のパンチングなどから発生する突発雑音である。 In the first to fourth embodiments described above, the stationary noise is white noise generated from, for example, a belt conveyor, and the sudden noise is sudden noise generated from, for example, punching of a metal material.

実施例５において、本発明者らは、各種の過酷な雑音環境下（出願人の犬山工場にて）で、以下の実験条件下で異なる１００個の数字４桁を話者により読み上げ、そのときの、音声認識率を測定した。 In Example 5, the present inventors read 100 different numbers of four digits by the speaker under various severe noise environments (at the applicant's Inuyama Factory) under the following experimental conditions. The speech recognition rate was measured.

［表９］
―――――――――――――――――――――――――――――――――――――――
（Ａ）音声認識ソフトウエア：日本電気製音声認識テストアプリケーション
（Ｂ）認識辞書：数字認識辞書４桁
（Ｃ）使用マイクロホン：
（Ｃ１）日本電気製ヘッドセットマイクロホン（比較例１；単一性音声用マイクロホンと、無指向性雑音用マイクロホンとを備えて構成される）
（Ｃ２）ゼンハイザー製ＨＭＤ−２５型マイクロホン（比較例２）
（Ｃ３）本実施形態に係るマイクロホンアレー（実施形態；図１乃至図３に示すように、１個の無指向性音声用マイクロホン１と、３個の無指向性雑音用マイクロホン２，３，４とを備えて構成される。）
――――――――――――――――――――――――――――――――――――――― [Table 9]
―――――――――――――――――――――――――――――――――――――――
(A) Voice recognition software: NEC voice recognition test application (B) Recognition dictionary: Number recognition dictionary 4 digits (C) Microphone used:
(C1) NEC Headset Microphone (Comparative Example 1; comprising a single voice microphone and a non-directional noise microphone)
(C2) Sennheiser HMD-25 type microphone (Comparative Example 2)
(C3) Microphone array according to the present embodiment (embodiment; as shown in FIGS. 1 to 3, one omnidirectional audio microphone 1 and three omnidirectional noise microphones 2, 3, 4 And configured with.)
―――――――――――――――――――――――――――――――――――――――

図１１は本発明者らによって実行された実施例５に係る雑音下音声認識実験の実験結果（音声認識率）を示す表である。図１１から明らかなように、騒音レベルが８０ｄＢＡという非常に過酷な雑音環境下において、本実施形態に係るマイクロホンアレー１０を用いて収音することにより、従来技術に比較して大きく改善されたＳＮＲを有する音声信号を得ることができる。 FIG. 11 is a table showing experimental results (speech recognition rate) of a speech recognition experiment under noise according to Example 5 performed by the present inventors. As is clear from FIG. 11, the SNR greatly improved as compared with the prior art by collecting sound using the microphone array 10 according to the present embodiment in a very severe noise environment where the noise level is 80 dBA. Can be obtained.

また、実施例１乃至４の結果から明らかなように、本実施形態に係るマイクロホンアレー１０を用いて収音しかつ本実施形態に係る図４の音声認識装置を用いて音声認識することにより音声認識率を大幅に向上させることができる。 Further, as is clear from the results of Examples 1 to 4, the voice is collected by using the microphone array 10 according to the present embodiment and the voice is recognized by using the voice recognition apparatus of FIG. 4 according to the present embodiment. The recognition rate can be greatly improved.

以上の実施形態においては、減算形アレー法とＳＳ法とを併用しているが、本発明はこれに限らず、前者のみを用いて信号処理した後、音声認識してもよい。 In the above embodiment, the subtractive array method and the SS method are used together. However, the present invention is not limited to this, and speech recognition may be performed after signal processing using only the former.

以上詳述したように、本発明に係るマイクロホンアレーによれば、少なくとも３つのマイクロホンを用いて話者の音声を収集することにより、従来技術に比較して向上させた音声信号を得ることができる。また、当該マイクロホンアレーを用いて音声信号を収録し、減算形アレー法を用いて複数のカージオイド信号を生成し、そのうちのより高いＳＮＲを有する複数のカージオイド信号を加算し、その加算信号に対してＳＳ法を用いて雑音除去をした後音声認識することにより、例えば工場などの大きな雑音を発生する現場において音声認識率を従来技術に比較して向上させることができる。 As described above in detail, according to the microphone array of the present invention, it is possible to obtain an improved audio signal as compared with the prior art by collecting the voice of the speaker using at least three microphones. . In addition, the microphone array is used to record an audio signal, a subtractive array method is used to generate a plurality of cardioid signals, and a plurality of cardioid signals having a higher SNR are added to the sum signal. On the other hand, by performing speech recognition after removing noise using the SS method, the speech recognition rate can be improved as compared with the prior art in a site where large noise is generated, such as in a factory.

本発明の一実施形態に係るマイクロホンアレー１０の配置を示す斜視図である。It is a perspective view which shows arrangement | positioning of the microphone array 10 which concerns on one Embodiment of this invention. 図１のマイクロホンアレー１０を備えたマイクロホン筐体１１を示す側面図である。It is a side view which shows the microphone housing | casing 11 provided with the microphone array 10 of FIG. 図２のマイクロホン筐体１１を示す正面図である。It is a front view which shows the microphone housing | casing 11 of FIG. 図１のマイクロホンアレー１０を用いた音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus using the microphone array 10 of FIG. 図４の音声認識装置において実現される口元方位に対応する３つのカージオイドＣ１，Ｃ２，Ｃ３を示す斜視図である。FIG. 5 is a perspective view showing three cardioids C1, C2, and C3 corresponding to the mouth orientation realized in the speech recognition apparatus of FIG. 図４の音声認識装置において実現される顔水平方位に対応する６つのカージオイドＣ４，Ｃ５，Ｃ６，Ｃ７，Ｃ８，Ｃ９を示す斜視図である。FIG. 5 is a perspective view showing six cardioids C4, C5, C6, C7, C8, and C9 corresponding to the face horizontal orientation realized in the speech recognition apparatus of FIG. 4. 本発明者らによって実行された実施例１に係るシミュレーション実験（３つの定常雑音Ｎｓｔ１１，Ｎｓｔ１２，Ｎｓｔ１３）における雑音配置を示す斜視図である。It is a perspective view which shows the noise arrangement | positioning in the simulation experiment (three stationary noises Nst11, Nst12, Nst13) based on Example 1 performed by the present inventors. 本発明者らによって実行された実施例２に係るシミュレーション実験（１つの突発性雑音Ｎｓｕ２１）における雑音配置を示す斜視図である。It is a perspective view which shows the noise arrangement | positioning in the simulation experiment (one sudden noise Nsu21) based on Example 2 performed by the present inventors. 本発明者らによって実行された実施例３に係るシミュレーション実験（１つの突発性雑音Ｎｓｕ３１及び１つの定常雑音Ｎｓｔ３２）における雑音配置を示す斜視図である。It is a perspective view which shows the noise arrangement | positioning in the simulation experiment (one sudden noise Nsu31 and one stationary noise Nst32) based on Example 3 performed by the present inventors. 本発明者らによって実行された実施例４に係るシミュレーション実験（１つの定常雑音Ｎｓｔ４１）における雑音配置を示す斜視図である。It is a perspective view which shows the noise arrangement | positioning in the simulation experiment (one stationary noise Nst41) based on Example 4 performed by the present inventors. 本発明者らによって実行された実施例５に係る雑音下音声認識実験の実験結果（音声認識率）を示す表である。It is a table | surface which shows the experimental result (voice recognition rate) of the speech recognition experiment under noise based on Example 5 performed by the present inventors.

Explanation of symbols

１，２，３，４…マイクロホン、
５…口元先端部、
６…音声放射方向、
１０…マイクロホンアレー、
１１…マイクロホン筐体、
１２…フレキシブルアーム、
２１，２２，２３，２４…低周波増幅器、
２６，２７，２８，２９…Ａ／Ｄ変換器、
３０…遅延形アレー回路、
３１，３２，３３，３４，３５，３６，３７，３８，３９…遅延器、
４１，４２，４３，４４，４５，４６，４７，４８，４９…減算器、
５０…信号評価及び選択回路、
５１…雑音除去回路、
５２…音声認識回路、
５３…液晶ディスプレイ（ＬＣＤ）、
Ｃ１，Ｃ２，Ｃ３，Ｃ４，Ｃ５，Ｃ６，Ｃ７，Ｃ８，Ｃ９…カージオイド、
Ｎｓｔ１１，Ｎｓｔ１２，Ｎｓｔ１３，Ｎｓｔ３２，Ｎｓｔ４１…定常雑音、
Ｎｓｕ２１，Ｎｓｕ３１…突発性雑音。 1, 2, 3, 4 ... microphones,
5 ... Mouth tip,
6 ... Sound radiation direction,
10 ... Microphone array,
11 ... Microphone housing,
12 ... Flexible arm,
21, 22, 23, 24 ... low frequency amplifiers,
26, 27, 28, 29 ... A / D converter,
30 ... Delay type array circuit,
31, 32, 33, 34, 35, 36, 37, 38, 39 ... delay devices,
41, 42, 43, 44, 45, 46, 47, 48, 49 ... subtractor,
50. Signal evaluation and selection circuit,
51. Noise removal circuit,
52. Voice recognition circuit,
53 ... Liquid crystal display (LCD),
C1, C2, C3, C4, C5, C6, C7, C8, C9 ... cardioid,
Nst11, Nst12, Nst13, Nst32, Nst41 ... stationary noise,
Nsu21, Nsu31 ... sudden noise.

Claims

A first microphone provided at a top vertex of each pyramid vertex such that the radiation main axis is substantially directed to the speaker's mouth;
A microphone array comprising: a plurality of second microphones provided so that a principal axis of radiation is substantially parallel to a direction of a speaker's mouth at at least two vertices of the bottom surface of the pyramid.

The microphone array according to claim 1, wherein the pyramid is a triangular pyramid.

The microphone array according to claim 1, wherein the pyramid is a regular triangular pyramid.

4. The microphone array according to claim 3, wherein three second microphones are provided at three vertices of the bottom surface of the regular triangular pyramid.

5. The microphone array according to claim 1, wherein the microphone array is a voice recognition microphone array.