JP2005303574A

JP2005303574A - Voice recognition headset

Info

Publication number: JP2005303574A
Application number: JP2004115185A
Authority: JP
Inventors: Shinichi Tanaka; 信一田中; Yasuyuki Masai; 康之正井; Ko Amada; 皇天田; Takumi Yamamoto; 琢己山本
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2004-04-09
Filing date: 2004-04-09
Publication date: 2005-10-27

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition headset that does not need spherical wave approximation even when a sound source is close to a microphone array and does not have to correct a difference in an arrival time interval between microphones even about the movement of a speaker. SOLUTION: Right an lift ear muffs 2 and 3 have shapes connected with a flexible support frame 1 and are put on the head of a user, wherein an arm 6 extends from one ear muff 2, and microphones 7A and 7B are arranged at the arm 6. When the user wears the headset, the user fixes the arm 6 so that a plurality of the microphones are located on the same spherical surface centered almost on the mouth 8 of the user and the respective microphones are equally distant from the mouth. A signal processing module 10 performs the microphone array processing of voice signals inputted from the plurality of microphones, suppresses noise to output the voice signal with an enhanced voice, recognizes this voice output and outputs the recognized result. COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、ハンズフリー通話や音声認識等で用いられるヘッドセットマイク技術の一つであり、複数のマイクロホンを用いて入力音響信号から目的とする音声信号を強調あるいは目的音声信号の方向を検出する技術に関するものである。 The present invention is one of headset microphone technologies used in hands-free calling, voice recognition, and the like, and uses a plurality of microphones to emphasize a target voice signal from an input acoustic signal or detect a direction of a target voice signal. It is about technology.

音声認識技術を実環境で利用する場合、周囲の雑音は認識率に大きな影響を及ぼす。例えば車の中で使う場合、車のエンジン音や、風切り音、対向車や追い越し車両の音、カーステレオの音など多くの雑音が存在する。また、オフィスなどの比較的静かな場所でも、足音やドアを閉める音など音声認識の妨げとなる雑音は多い。また、音声認識の音声入力部としてだけでなく、雑音環境下での電話等での音声通話などにも適用される。これらの雑音は、発声者の声に混ざって声認識装置へ入力され、認識率を大きく低下させる原因となる。このような雑音の問題を解決する方法の一つとして、マイクロホンアレーの利用があげられる。マイクロホンアレーは、複数のマイクロホンから入力された音声に対して信号処理を行ない、目的とする音声を強調した信号を出力する。具体的には、目的とする音声の方向に対し鋭い指向性を形成し、その他の方向の感度を下げることで目的音声の強調を実現している。 When speech recognition technology is used in a real environment, ambient noise has a large effect on the recognition rate. For example, when used in a car, there are many noises such as car engine noise, wind noise, oncoming and overtaking vehicle sounds, and car stereo sounds. Even in relatively quiet places such as offices, there are many noises that hinder voice recognition, such as footsteps and door closing sounds. Further, the present invention is applied not only to a voice input unit for voice recognition but also to a voice call in a telephone or the like in a noisy environment. These noises are mixed with the voice of the speaker and input to the voice recognition device, causing a significant reduction in the recognition rate. One method for solving such a noise problem is to use a microphone array. The microphone array performs signal processing on sound input from a plurality of microphones, and outputs a signal that emphasizes the target sound. Specifically, the target voice is emphasized by forming a sharp directivity in the direction of the target voice and lowering the sensitivity in other directions.

例えば遅延和型のマイクロホンアレー（例えば、非特許文献１を参照）の場合、その出力信号Se(t)は、 N本のマイクロホンで得られた信号 Sn(t) ( n= 1, ... , N )を、目的音声の到来方向に合わせた時間差τだけずらして加算することで得られる。つまり、強調された音声信号 Se(t)は、 For example, in the case of a delay-and-sum type microphone array (see, for example, Non-Patent Document 1), the output signal Se (t) is a signal Sn (t) (n = 1,...) Obtained by N microphones. , N) are shifted and added by a time difference τ that matches the direction of arrival of the target speech. In other words, the emphasized audio signal Se (t) is

Ｎ
Se(t) = ΣSn(t+nτ) （１）
n=1
と表される。ただし、マイクロホンは等間隔で添字 n の順で配置されているものとする。遅延和アレーは到来信号の位相差を利用することで目的音声の方向に指向性を形成している。つまり、目的信号は同相で重ね合わされ強められるのに対し、目的信号と異なる方向から到来した雑音は位相が互いにずれるために弱めあうという原理に基づいている。 N
Se (t) = ΣSn (t + nτ) (1)
n = 1
It is expressed. However, the microphones shall be arranged in the order of the subscript n at equal intervals. The delay-and-sum array forms directivity in the direction of the target voice by using the phase difference of the incoming signals. In other words, it is based on the principle that the target signal is superimposed and strengthened in the same phase, whereas noises coming from directions different from the target signal are weakened because the phases are shifted from each other.

また、音源方向を適応ビームフォーマを用いて検出する方法がある。（例えば、特許文献１を参照。）
ところで、話者がマイクロホンアレーに対して比較的近い距離で発話した場合、音声は球面波となってマイクロホンに到達する。したがって話者がマイクロホンアレーに対して、正面で発話したとしても、マイクロホンアレーを構成する中心部のマイクロホンに比べて、端のマイクは音波の到達時間が遅れることになる。（１）に示した方式は音源が無限遠方にあり、音波が平面波と近似できると仮定した場合の理論であり、この仮定が成り立たない場合、すなわち、音源がマイクロホンアレーの大きさに比べて近くにある場合には、音波を球面波として扱う必要がある。球面波として扱う場合は、平面波に比べて計算が煩雑になるという欠点のほかに、話者が奥行き方向に移動した場合にも、マイクロホン間の音波の到達時間差が変わるため、これを一定に保つためには、話者の発話位置が限られてしまうという問題がある。 There is also a method for detecting the sound source direction using an adaptive beamformer. (For example, see Patent Document 1.)
By the way, when the speaker speaks at a relatively close distance to the microphone array, the voice reaches the microphone as a spherical wave. Therefore, even if the speaker speaks in front of the microphone array, the arrival time of the sound wave is delayed in the microphone at the end as compared with the microphone in the central part constituting the microphone array. The method shown in (1) is a theory when it is assumed that the sound source is at infinity and the sound wave can be approximated to a plane wave. In this case, it is necessary to treat the sound wave as a spherical wave. When handling as a spherical wave, in addition to the disadvantage that the calculation is more complicated than a plane wave, the difference in the arrival time of sound waves between microphones changes even when the speaker moves in the depth direction. Therefore, there is a problem that the speaker's utterance position is limited.

"音響システムとディジタル処理",第７章,電子情報通信学会, 1995"Acoustic systems and digital processing", Chapter 7, IEICE, 1995 特開平１１−０５２９７７号公報Japanese Patent Laid-Open No. 11-052977

上記のようにマイクロホンアレーに音源が近い場合、平面波近似が成り立たず、球面波近似が必要になり、処理が煩雑となり、発話位置も限定されるなどの不都合があった。 When the sound source is close to the microphone array as described above, plane wave approximation is not established, spherical wave approximation is required, processing is complicated, and the utterance position is limited.

本発明は上記の問題を解決するためになされたものであり、音源がマイクロホンアレーに近い場合でも、球面波近似を必要とせず、また、話者の移動に関しても、マイクロホン間の到達時間差の補正を必要としない音声認識ヘッドセットを提供することを目的とする。 The present invention has been made to solve the above-described problem. Even when the sound source is close to the microphone array, the spherical wave approximation is not required, and the difference in arrival time between the microphones is also corrected with respect to the movement of the speaker. An object of the present invention is to provide a voice recognition headset that does not require a voice recognition.

上記の問題を解決するために、本発明は音声を検出して音声信号を生成する複数のマイクロホンと、前記複数のマイクロホンを配置するマイクロホン支持部と、前記複数のマイクロホンの音声信号を合成して強調音声信号を生成する強調音声信号生成手段と、前記強調音声信号を認識する音声認識手段とを具備することを特徴とする。 In order to solve the above problems, the present invention synthesizes a plurality of microphones for detecting a sound and generating a sound signal, a microphone support section for arranging the plurality of microphones, and a sound signal of the plurality of microphones. Emphasized speech signal generating means for generating the enhanced speech signal, and speech recognition means for recognizing the enhanced speech signal.

また、本発明は音声を検出して音声信号を生成する複数のマイクロホンと、口元を中心とする同一円周上に前記複数のマイクロホンを配置するマイクロホン支持部と、前記複数のマイクロホンの音声信号を合成して強調音声信号を生成する強調音声信号生成手段と、前記強調音声信号を認識する音声認識手段とを具備することを特徴とする。 The present invention also provides a plurality of microphones that detect sound and generate sound signals, a microphone support portion that arranges the plurality of microphones on the same circumference centered on the mouth, and sound signals of the plurality of microphones. Emphasized speech signal generating means for synthesizing and generating an enhanced speech signal; and speech recognition means for recognizing the enhanced speech signal.

また、音声を検出して音声信号を生成する複数のマイクロホンと、口元を中心とする同一球面上に前記複数のマイクロホンを配置するマイクロホン支持部と、前記複数のマイクロホンの音声信号を合成して強調音声信号を生成する強調音声信号生成手段と、前記強調音声信号を認識する音声認識手段とを具備することを特徴とする。 In addition, a plurality of microphones that detect sound and generate a sound signal, a microphone support unit that places the plurality of microphones on the same spherical surface centered on the mouth, and a sound signal of the plurality of microphones are combined and emphasized. It is characterized by comprising enhanced speech signal generating means for generating a speech signal and speech recognition means for recognizing the enhanced speech signal.

さらに、複数のマイクロホンの距離が一定に保たれるように支持されていることを特徴とする。 Further, the microphones are supported so that the distances between the plurality of microphones are kept constant.

本発明はマイクロホンアレーに音源が近い場合でも信号処理が容易にでき、かつ認識率を向上させることができる。また、話者位置が限定されずに音声認識率を保つことができる。 The present invention can easily perform signal processing and improve the recognition rate even when the sound source is close to the microphone array. Further, the speech recognition rate can be maintained without limiting the speaker position.

以下、本発明の実施形態について図面を参照して説明する。 Embodiments of the present invention will be described below with reference to the drawings.

図１および図２は、本発明の実施例１に係る音声認識ヘッドセットの外観と、その概略システム構成を示す。図１（ａ）は音声認識ヘッドセットの正面図であり、図１（ｂ）は図１（ａ）の矢印方向から見た側面図である。音声認識ヘッドセットは、支持フレーム１と、耳あて２、３と、スピーカ４、５と、アーム６と、ヘッドセットの装着者（ユーザ）の発する音声を検出して電気的な音声信号を生成するマイクロホン７Ａ、７Ｂと、この音声信号をデジタル変換を経て音声認識する信号処理モジュール１０を備える。図１では簡単のために、2個のマイクロホンを使用したが、３個以上のマイクロホンを配置して実施することも可能である。また、マイクロホンの指向性を口元に向けて配置してもよい。 FIG. 1 and FIG. 2 show the appearance of a voice recognition headset according to Embodiment 1 of the present invention and the schematic system configuration thereof. FIG. 1A is a front view of the voice recognition headset, and FIG. 1B is a side view seen from the direction of the arrow in FIG. The voice recognition headset detects the sound produced by the support frame 1, the ear pads 2, 3, the speakers 4, 5, the arm 6, and the headset wearer (user) and generates an electrical sound signal. And a signal processing module 10 that recognizes the voice signal through digital conversion. In FIG. 1, for the sake of simplicity, two microphones are used, but it is also possible to carry out by arranging three or more microphones. Moreover, the directivity of the microphone may be arranged toward the mouth.

この音声認識ヘッドセット（以下、場合に応じて単に「ヘッドセット」と称する）は、左右の耳あて２、３を柔軟な支持フレーム１で接続した形状をしており、ユーザの頭部に装着して使用する。一方の耳あて２からはアーム６が伸びており、そのアーム６にマイクロホン７Ａ、７Ｂが配置されている。マイクロホン７Ａ、７Ｂは、ユーザがヘッドセットを装着したときに、ユーザのほぼ口元８を中心とした同一円周上９Ａに位置し、各マイクロホンは口元からの距離が等しくなるようにアーム６に固定する。 This voice recognition headset (hereinafter simply referred to as “headset” in some cases) has a shape in which left and right ear pads 2 and 3 are connected by a flexible support frame 1 and is attached to a user's head. And use it. An arm 6 extends from one ear pad 2, and microphones 7 A and 7 B are disposed on the arm 6. The microphones 7A and 7B are positioned on the same circumference 9A around the user's mouth 8 when the user wears the headset, and each microphone is fixed to the arm 6 so that the distance from the mouth is equal. To do.

耳あて２の中には、スピーカ５(左右)、信号処理モジュール１６が内蔵されている。なお、信号処理モジュール１６は耳あて３に内蔵してもよく、図示はしないが各要素は必要に応じてケーブルで接続されている。 A speaker 5 (left and right) and a signal processing module 16 are built in the earpiece 2. Note that the signal processing module 16 may be built in the earpiece 3, and although not shown, each element is connected by a cable as necessary.

図２に示したように、第１のマイクロホン７Ａから入力された音声は第１のＡＤ変換器１１Ａでアナログ−デジタル変換を行いマイクロホンアレー信号処理部１２に入力される。同様に、第２のマイクロホン７Ｂから入力された音声は第２のＡＤ変換器１１Ｂでアナログ−デジタル変換を行いマイクロホンアレー信号処理部１２に入力される。マイクロホンアレー信号処理部１２は、複数のマイクロホンから入力された音声信号をマイクロホンアレー処理して、雑音を抑圧して音声を強調した音声信号が出力される。音声認識部１３はこの音声出力を認識して、認識結果を出力する。 As shown in FIG. 2, the sound input from the first microphone 7 A undergoes analog-digital conversion by the first AD converter 11 A and is input to the microphone array signal processing unit 12. Similarly, the voice input from the second microphone 7B undergoes analog-digital conversion by the second AD converter 11B and is input to the microphone array signal processing unit 12. The microphone array signal processing unit 12 performs microphone array processing on audio signals input from a plurality of microphones, and outputs an audio signal that suppresses noise and emphasizes the audio. The voice recognition unit 13 recognizes this voice output and outputs a recognition result.

図３は、マイクロホンアレー信号処理部１２の内部構造の一例を示した図である。デジタル変換されたマイクロホンの音声信号を加算器１２１により合成してそれぞれのマイクロホン７Ａ、７Ｂの音声信号を強調した音声が出力される。これは、マイクロホン７Ａ、７Ｂが口元８を中心とした同一円周上９Ａに配置されているので、口元８の音源から等距離にあり、音源からの球面波による遅延を生じさせることなく音声が入力される。そのため、別途遅延器を必要とせず同位相の信号を加算することで、口元８から到来する音声は強調される。 FIG. 3 is a diagram illustrating an example of the internal structure of the microphone array signal processing unit 12. The digitally converted microphone audio signals are synthesized by the adder 121, and the audio in which the audio signals of the microphones 7A and 7B are emphasized is output. This is because the microphones 7A and 7B are arranged on the same circumference 9A with the mouth 8 as the center, so that they are equidistant from the sound source of the mouth 8, and the sound can be heard without causing a delay due to the spherical wave from the sound source. Entered. Therefore, by adding the signals having the same phase without requiring a separate delay device, the voice coming from the mouth 8 is enhanced.

また図４は、他のマクロホンアレー信号処理部１２の内部構造の一例を示した図である。これは適応ビームフォーマを用いて雑音を抑圧して音声を強調するもので、マイクロホンアレー信号処理部１２は加算器１２１と、ビームフォーマ処理部１２２、音声強調部１２３とから構成される。 FIG. 4 is a diagram showing an example of the internal structure of another macrophone array signal processing unit 12. This is to enhance noise by suppressing noise using an adaptive beamformer. The microphone array signal processing unit 12 includes an adder 121, a beamformer processing unit 122, and a speech enhancement unit 123.

ここで、図５を用いてビームフォーマ処理部１２２の内部構成を用いて説明する。ビームフォーマ処理部１２２は、マイクロホン７Ａ、７Ｂをアナログ−デジタル変換した音声信号に対して音源からの音声信号を抑圧するための適応ビームフォーマ処理と呼ばれるフィルタ演算処理を行う。ビームフォーマ処理部１２２の内部の処理方法としては、種々の方法が知られており、例えば一般化サイドローブキャンセラ（ＧＳＣ）、フロスト型ビームフォーマおよび参照信号法などがある。本実施例では適応ビームフォーマであればどのようなものにも適用可能であるが、ここでは２チャネルのＧＳＣを例にとり説明する。 Here, description will be made using the internal configuration of the beamformer processing unit 122 with reference to FIG. The beamformer processing unit 122 performs a filter calculation process called adaptive beamformer processing for suppressing a sound signal from a sound source on a sound signal obtained by analog-digital conversion of the microphones 7A and 7B. Various methods are known as processing methods inside the beamformer processing unit 122, such as a generalized sidelobe canceller (GSC), a frosted beamformer, and a reference signal method. This embodiment can be applied to any adaptive beamformer, but here, a two-channel GSC will be described as an example.

図５に、ビームフォーマ処理部１２２の例として、２チャネルのＧＳＣの中で一般的なJim-Griffith型のＧＳＣの構成例を示す。ビームフォーマ処理部１２２は、減算器１２２１、加算器１２２２、遅延器１２２３、適応フィルタ１２２４および減算器１２２５からなるＧＳＣである。適応フィルタ１２２４はＬＭＳ、ＲＬＳ、射影型ＬＭＳなどの種々のものが使用可能であり、フィルタ長Ｌａは例えばＬａ＝５０を用いる。遅延器１２２３の遅延量は例えばＬａ／２とする。 FIG. 5 shows a configuration example of a general Jim-Griffith type GSC in a two-channel GSC as an example of the beamformer processing unit 122. The beamformer processing unit 122 is a GSC that includes a subtractor 1221, an adder 1222, a delay unit 1223, an adaptive filter 1224, and a subtractor 1225. Various adaptive filters such as LMS, RLS, and projective LMS can be used as the adaptive filter 1224. For example, La = 50 is used as the filter length La. The delay amount of the delay device 1223 is, for example, La / 2.

ビームフォーマ１２２を構成する図５に示した２チャネルのJim-Griffith型ＧＳＣの適応フィルタ１２２４にＬＭＳ適応フィルタを用いた場合、このフィルタの更新は、時刻をｎとして適応フィルタ２４の係数をＷ（ｎ）、第ｉチャネルの入力信号をｘｉ（ｎ）、第ｉチャネルの入力信号ベクトルをＸｉ（ｎ）＝（ｘｉ（ｎ），ｘｉ（ｎ−１），…，ｘｉ（ｎ−Ｌａ＋１））とおくと、次式で表される。 When an LMS adaptive filter is used as the adaptive filter 1224 of the two-channel Jim-Griffith GSC shown in FIG. 5 constituting the beamformer 122, this filter update is performed by setting the coefficient of the adaptive filter 24 to W (time n). n), the i-th channel input signal is xi (n), the i-th channel input signal vector is Xi (n) = (xi (n), xi (n−1),..., xi (n−La + 1)) It is expressed by the following formula.

ｙ（ｎ）＝ｘ０（ｎ）＋ｘｌ（ｎ）（２）
Ｘ′（ｎ）＝Ｘ１（ｎ）−Ｘ０（ｎ）（３）
ｅ（ｎ）＝ｙ（ｎ）−Ｗ（ｎ）Ｘ′（ｎ）（４）
Ｗ（ｎ＋１）＝Ｗ（ｎ）一μＸ′（ｎ）ｅ（ｎ）（５）
目的音源の方向から信号が到来した場合、ビームフォーマ処理部１２２内のフィルタは目的音源の方向に感度が低くなっているため、このフィイタのフィルタ係数から感度の方向依存性である指向性を調べることにより、目的音源の方向を推定される。 y (n) = x0 (n) + xl (n) (2)
X ′ (n) = X1 (n) −X0 (n) (3)
e (n) = y (n) -W (n) X '(n) (4)
W (n + 1) = W (n) 1 μX ′ (n) e (n) (5)
When a signal arrives from the direction of the target sound source, the sensitivity of the filter in the beamformer processing unit 122 is low in the direction of the target sound source, so the directivity that is the direction dependency of the sensitivity is examined from the filter coefficient of this filter. Thus, the direction of the target sound source is estimated.

ところで、雑音源が非常に多く、雑音源方向を特定できないような環境では、ビームフォーマによる雑音抑圧性能は低下するが、入力音声は方向性があるため、雑音方向に目的方向を設定したビームフォーマにより、目的音源からの信号を抑圧した雑音のみの出力を抽出できる。従って、ビームフォーマ処理部１２２の出力は、雑音のみの信号であり、音声強調部１２３ではこれを用いたスペクトルサブトラクション（ＳＳ）の手法により音声を強調する。 By the way, in an environment where there are too many noise sources and the noise source direction cannot be specified, the noise suppression performance by the beamformer is degraded, but the input speech has directionality, so that the beamformer with the target direction set as the noise direction. Thus, it is possible to extract only the output of noise with the signal from the target sound source suppressed. Accordingly, the output of the beamformer processing unit 122 is a noise-only signal, and the speech enhancement unit 123 enhances speech by a spectral subtraction (SS) method using the signal.

スペクトルサブトラクションには、参照用の雑音信号と音声信号の２チャネルを用いる２ｃｈＳＳと、１チャネルの音声信号のみを用いる１ｃｈＳＳとがあるが、本実施例では参照用雑音としてビームフォーマ処理部１２２の出力を用いる２ｃｈＳＳにより音声強調を行う。通常、２ｃｈＳＳの雑音信号としては、目的音声が入力されないように目的音声収集用のマイクロホンと距離を隔てたマイクロホンの信号を使うが、雑音信号の性質が目的音声収集用マイクロホンに混入する雑音と異なってしまい、ＳＳの精度が落ちるという問題がある。 Spectral subtraction includes 2chSS that uses two channels of a noise signal for reference and an audio signal and 1chSS that uses only an audio signal of one channel. In this embodiment, the output of the beamformer processing unit 122 is used as reference noise. Speech enhancement is performed by 2chSS using. Normally, as the 2chSS noise signal, a microphone signal separated from the target voice collecting microphone is used so that the target voice is not input, but the nature of the noise signal is different from the noise mixed in the target voice collecting microphone. As a result, there is a problem that the accuracy of the SS decreases.

これに対し、本実施例では雑音収集専用のマイクロホンは使わず、複数のマイクロホンを用いたマイクロホンアレー方式により雑音信号を抽出しているため、雑音の性質が異なってしまうという問題がなく、精度よくＳＳを行える。 On the other hand, in this embodiment, a noise signal is extracted by a microphone array method using a plurality of microphones without using a dedicated microphone for noise collection. Can perform SS.

２ｃｈＳＳは例えば図６に示すような構成であり、この図６の処理を入力データをブロック処理してブロック毎に行う。図６に示す２ｃｈＳＳは、雑音信号をフーリエ変換する第１のＦＦＴ１２３１、第１のＦＦＴにより得られた周波数成分を帯域パワーに変換する第１の帯域パワー変換部１２３２、得られた帯域パワーを時間方向に平均化する雑音パワー計算部１２３３、音声信号をフーリエ変換する第２のＦＦＴ１２３４、第２のＦＦＴにより得られた周波数成分を帯域パワーに変換する第２の帯域パワー変換部１２３５、得られた帯域パワーを時間方向に平均化する音声パワー計算部１２３６、得られた雑音パワーと音声パワーとから帯域毎の重みを計算する帯域重み計算部１２３７、音声信号から第２のＦＦＴにより得られた周波数スペクトルを帯域毎の重みにより重み付けする重み付け部１２３８、重み付けされた周波数スペクトルを逆ＦＦＴして音声を出力する逆ＦＦＴ１２３９から構成される。 2chSS has a configuration as shown in FIG. 6, for example, and the processing shown in FIG. 2chSS shown in FIG. 6 includes a first FFT 1231 for Fourier transforming a noise signal, a first band power converter 1232 for converting a frequency component obtained by the first FFT into band power, and the obtained band power as time. A noise power calculation unit 1233 that averages in the direction, a second FFT 1234 that Fourier-transforms the audio signal, and a second band power conversion unit 1235 that converts the frequency component obtained by the second FFT into band power. A voice power calculation unit 1236 that averages the band power in the time direction, a band weight calculation unit 1237 that calculates a weight for each band from the obtained noise power and voice power, and a frequency obtained by the second FFT from the voice signal A weighting unit 1238 for weighting the spectrum with a weight for each band, and performing inverse FFT on the weighted frequency spectrum for sound An inverse FFT1239 to output a.

ブロック長は例えば２５６点とし、ＦＦＴの点数と一致させる。ＦＦＴの際には、例えばハニング窓により窓掛けを行い、ブロック長の半分の１２８点ずつシフトさせながら、同じ処理を繰り返す。最後に逆ＦＦＴして得られた処理結果の波形に、１２８点ずつオーバラップさせながら加算して窓掛けによる変形を復元し、出力するようにする。 The block length is, for example, 256 points and is matched with the FFT score. In the case of FFT, for example, windowing is performed using a Hanning window, and the same processing is repeated while shifting by 128 points, which is half the block length. Finally, the waveform resulting from the inverse FFT is added while being overlapped by 128 points to restore the deformation due to windowing and output.

帯域パワーへの変換は、例えば表１に示すように周波数成分を分割して１６の帯域にまとめ、帯域毎に周波数成分の２乗和を計算して帯域パワーとする。雑音パワーと音声パワーの計算は、帯域毎に例えば、１次の回帰フィルタにより次式のように行う。 For conversion to band power, for example, as shown in Table 1, the frequency components are divided into 16 bands, and the sum of squares of the frequency components is calculated for each band to obtain band power. The calculation of noise power and voice power is performed for each band, for example, using the first-order regression filter as follows:

ｐk,n ＝ａ・ｐｐk ＋（１−ａ）・ｐk,n-1 （６）
ｖk,n ＝ａ・ｖｖk ＋（１−ａ）・ｖk,n-1 （７）
ここで、ｋは、帯域の番号、ｎはブロックの香号、ｐは平均化された雑音チャネルの帯域パワー、ｐｐは雑音チャネルの当ブロックの帯域パワー、ｖは音声チャネルの平均化された帯域パワー、ｖｖは音声チャネルの当ブロックの帯域パワー、ａは定数である。ａの値は、例えば０．５を用いる。 pk, n = a.ppk + (1-a) .pk, n-1 (6)
vk, n = a.vvk + (1-a) .vk, n-1 (7)
Where k is the number of the band, n is the scent of the block, p is the band power of the averaged noise channel, pp is the band power of this block of the noise channel, and v is the averaged band of the voice channel Power, vv is the band power of this block of the voice channel, and a is a constant. For example, 0.5 is used as the value of a.

次に、帯域重み計算部では、得られた雑音と音声の帯域パワーを用いて、例えば次式により帯域毎の重みｗk,n を計算する。
ｗk,n ＝｜ｖk,n −ｐk,n ｜／ｖk,n （８）次に、帯域毎の重みを用い、例えば次式により音声チャネルの周波数成分に重み付けする。
Ｙi,n ＝Ｘi,n ・ｗk,n （９）
ここで、Ｙi,n は重み付けされた周波数成分、Ｘi,n は音声チャネルの第２のＦＦＴにより得られた周波数成分、ｉは周波数成分の番号であり、表１において周波数成分番号ｉに対応する帯域ｋの重みｗk,n を用いるようにする。 Next, the band weight calculation unit calculates the weight wk, n for each band, for example, by the following equation using the obtained noise and the band power of the voice.
wk, n = | vk, n−pk, n | / vk, n (8) Next, weights for each band are used, and for example, the frequency components of the voice channel are weighted by the following equation.
Yi, n = Xi, n .wk, n (9)
Here, Yi, n is a weighted frequency component, Xi, n is a frequency component obtained by the second FFT of the voice channel, i is a frequency component number, and corresponds to frequency component number i in Table 1. The weight wk, n of the band k is used.

２ｃｈＳＳによる音声強調部１２３の処理の流れを図７を参照して説明する。まず、初期設定を行い、例えばブロック長＝２５６、ＦＦＴ点数＝２５６、シフト点数＝１２８、帯域数＝１６とする（ステップＳ１０１）。次に、第１のＦＦＴ１２３１において雑音チャネルのデータを読み込んで窓掛けおよびＦＦＴを行い、雑音の周波数成分を求める（ステップＳ１０２）。次に、第２のＦＦＴ１２３４において音声チャネルのデータを読み込んで窓掛けおよびＦＦＴを行い、音声の周波数成分を求める（ステップＳ１０３）。次に、第１の帯域パワー変換部１２３２において、雑音の周波数成分から表１の対応に従って雑音の帯域パワーを計算する（ステップＳ１０４）。次に、第２の帯域パワー変換部１２３５において、音声の周波数成分から表１の対応に従って音声の帯域パワーを計算する（ステップＳ１０５）。次に、雑音パワー計算部１２３３において、式（６）に従って平均雑音パワーを求める（ステップＳ１０６）。次に、音声パワー計算部１２３６において、式（７）に従って平均音声パワーを求める（ステップＳ１０７）。次に、帯域重み計算部１２３７において、式（８）に従って帯域重みを求める（ステップＳ１０８）。次に、重み付け部１２３８において音声の周波数成分に対して、ステップＳ１０８で求めた重み係数を式（９）に従って重み付けする（ステップＳ１０９）。次に、逆ＦＦＴ１２３９において、ステップＳ１０９で重み付けされた周波数成分を逆ＦＦＴして波形を求め、前のブロックまでに求めた波形の最後の１２８ポイントに重畳させて出力する（ステップＳ１１０）。 A processing flow of the speech enhancement unit 123 by 2chSS will be described with reference to FIG. First, initial setting is performed, for example, block length = 256, FFT point number = 256, shift point number = 128, and band number = 16 (step S101). Next, the first FFT 1231 reads noise channel data, performs windowing and FFT, and obtains a noise frequency component (step S102). Next, the second FFT 1234 reads voice channel data, performs windowing and FFT, and obtains a frequency component of the voice (step S103). Next, the first band power converter 1232 calculates the noise band power from the noise frequency component according to the correspondence in Table 1 (step S104). Next, in the second band power conversion unit 1235, the band power of the voice is calculated from the frequency components of the voice according to the correspondence in Table 1 (step S105). Next, the noise power calculation unit 1233 obtains the average noise power according to the equation (6) (step S106). Next, the sound power calculation unit 1236 obtains the average sound power according to the equation (7) (step S107). Next, the band weight calculation unit 1237 obtains the band weight according to the equation (8) (step S108). Next, the weighting unit 1238 weights the frequency component of the sound according to the equation (9) with respect to the weighting coefficient obtained in step S108 (step S109). Next, in inverse FFT 1239, the frequency component weighted in step S109 is inverse FFTed to obtain a waveform, and is superimposed on the last 128 points of the waveform obtained up to the previous block and output (step S110).

以上、ステップＳ１０２〜Ｓ１１０までを入力がなくなるまで繰り返す。なお、この処理はビームフォーマの処理を含めた全体の処理と同期させてブロック処理すると都合がよく、その場合はビームフォーマのブロック長は、音声強調部のシフト長１２８点と一致させるようにする。このように、音声強調部１２３により雑音を抑圧した音声が出力され、音声認識部１３により音声認識される。 The steps S102 to S110 are repeated until there is no input. It is convenient to perform this block processing in synchronization with the entire processing including the beamformer processing. In this case, the block length of the beamformer is made to coincide with the shift length of 128 points in the speech enhancement unit. . In this way, the speech with noise suppressed is output by the speech enhancement unit 123 and is recognized by the speech recognition unit 13.

ここで、音声認識部１３について図８を用いて説明する。図８は、音声認識部１３の内部構成を示す。マイクロホンアレー信号処理部１２の音声出力は、まず音響分析部１３１に入力される。音響分析部１３１は、入力された音声を特徴パラメータに変換する。音声認識に使用される代表的な特徴パラメータとしては、バンドパスフィルタやフーリエ変換で求めることができるパワースペクトルや、ＬＰＣ（線形予測）分析によって求めたケプストラム係数などがよく用いられるが、ここではその特徴パラメータの種類は問わない。音響分析部１３１は、一定時間ごとに入力音声を特徴パラメータに変換する。したがってその出力は特徴パラメータの時系列（特徴パラメータ系列）となる。この特徴パラメータ系列はモデル照合部１３２に供給される。 Here, the voice recognition unit 13 will be described with reference to FIG. FIG. 8 shows the internal configuration of the voice recognition unit 13. The sound output of the microphone array signal processing unit 12 is first input to the acoustic analysis unit 131. The acoustic analysis unit 131 converts the input voice into feature parameters. As typical feature parameters used for speech recognition, a power spectrum that can be obtained by a bandpass filter or Fourier transform, a cepstrum coefficient obtained by LPC (linear prediction) analysis, etc. are often used. The type of feature parameter does not matter. The acoustic analysis unit 131 converts input speech into feature parameters at regular time intervals. Therefore, the output is a time series of characteristic parameters (characteristic parameter series). This feature parameter series is supplied to the model matching unit 132.

一方、認識語彙記憶部１３３には、認識語彙を構成する各単語の音声モデルを作成するために必要な単語の読み情報と、各単語が認識されたときに認識結果に対応する識別子、たとえばコマンドＩＤが記憶されている。なお、本実施例では、ヘッドセット内の音声認識として、単語認識による音声制御を例にとって説明するが、本発明はこれに限定されるものではない。ヘッドセット内の音声認識部１３は、連続単語認識、文認識、単語スポッティング、音声意図理解など、演算量、メモリ容量、消費電力が少ない音声認識を行い、その結果を音声認識結果として出力する。 On the other hand, the recognition vocabulary storage unit 133 stores word reading information necessary to create a speech model of each word constituting the recognition vocabulary and an identifier corresponding to the recognition result when each word is recognized, for example, a command ID is stored. In the present embodiment, voice control based on word recognition is described as an example of voice recognition in the headset, but the present invention is not limited to this. The voice recognition unit 13 in the headset performs voice recognition with a small amount of calculation, memory capacity, and power consumption, such as continuous word recognition, sentence recognition, word spotting, and voice intention understanding, and outputs the result as a voice recognition result.

認識モデル作成・記憶部１３４は、認識語彙記憶部１３３に記憶された認識語彙にしたがって、各単語の音声モデルと、各単語が認識結果となったときに認識結果としてモデル照合部１３２から出力される識別信号としての単語ＩＤをあらかじめ記憶しておく。もちろん、単語認識以外の認識を行う場合は、それに応じた識別信号を格納する。 The recognition model creation / storage unit 134 outputs the speech model of each word according to the recognition vocabulary stored in the recognition vocabulary storage unit 133 and the model collation unit 132 as a recognition result when each word becomes a recognition result. A word ID as an identification signal is stored in advance. Of course, when recognition other than word recognition is performed, an identification signal corresponding to the recognition is stored.

モデル照合部１３２は、音声モデル作成・記憶部１３４に記憶しておいた認識対象とする単語の各音声モデルと、上記入力音声の特徴パラメータ系列との類似度あるいは距離を求め、類似度が最大（あるいは距離が最小）の音声モデルと対応付けられた単語ＩＤを認識結果として出力する。 The model matching unit 132 obtains the similarity or distance between each speech model of the word to be recognized stored in the speech model creation / storage unit 134 and the feature parameter series of the input speech, and the similarity is maximum. The word ID associated with the speech model (or the smallest distance) is output as a recognition result.

モデル照合部１３２の照合方法としては、音声モデルも特徴パラメータ系列で表現しておき、ＤＰ（動的計画法）で音声モデルの特徴パラメータ系列と入力音声の特徴パラメータ系列の距離を求める方法や、ＨＭＭ（隠れマルコフモデル）を用いて音声モデルを表現しておき、入力音声の特徴パラメータ系列が入力されたときの各音声モデルの確率を計算する手法などが広く使用されているが、特に手法は問わない。モデル照合部１３２から出力された単語ＩＤは、そのまま音声認識部１３の認識結果として出力される。 As a matching method of the model matching unit 132, a speech model is also expressed by a feature parameter sequence, and a distance between the feature parameter sequence of the speech model and the feature parameter sequence of the input speech is obtained by DP (dynamic programming), A method of expressing a speech model using an HMM (Hidden Markov Model) and calculating the probability of each speech model when a feature parameter sequence of the input speech is input is widely used. It doesn't matter. The word ID output from the model matching unit 132 is output as the recognition result of the speech recognition unit 13 as it is.

図９に本発明の音声認識ヘッドセットにより音声認識した時の認識実験の結果を示す。従来方式はヘッドセットの配置された1つのマイクロホンのみを使用して入力した音声を認識した場合で、本発明方式はヘッドセットに配置された２つのマイクロホンから入力された音声でマイクロホンアレー処理を行った強調音声を音声認識装置に入力した場合の結果である。電話ベルの雑音に対して誤りが約６％削減されていることがわかるように、音声認識の認識率が向上している。 FIG. 9 shows the result of a recognition experiment when speech recognition is performed using the speech recognition headset of the present invention. The conventional method recognizes the sound input using only one microphone with the headset, and the method of the present invention performs the microphone array processing with the sound input from the two microphones disposed on the headset. This is the result when the emphasized voice is input to the voice recognition device. The recognition rate of speech recognition is improved so that the error is reduced by about 6% with respect to the noise of the telephone bell.

また、本発明のヘッドセットのマイクロホンアレー信号処理部１０による音源方向検出機能を音声認識対象外の音声や雑音の入力を棄却するために使用した場合の評価結果は、従来法では1分間当たり音声で3.3文字、電話ベルで1.0文字の湧き出し誤りが発生していたが、本発明のヘッドセットの音源方向検出機能を使用することで、いずれの湧き出し誤りも防止することができる。 The evaluation result when the sound source direction detection function by the microphone array signal processing unit 10 of the headset of the present invention is used to reject the input of speech or noise that is not subject to speech recognition is as follows. However, it was possible to prevent any errors due to the use of the sound source direction detection function of the headset of the present invention.

ここで、評価データは認識対象の話者の80cm隣りにいる人が、４種類の雑音を発生させて収集し、認識話者の音声に重畳させて使用した。雑音の種類は、声、電話ベル、紙めくり音、キーボード打鍵音の４種類である。 Here, the evaluation data was collected by a person 80 cm adjacent to the recognition target speaker by generating four types of noise and superimposing them on the speech of the recognition speaker. There are four types of noise: voice, telephone bell, paper turning sound, and keyboard keystroke sound.

このように、上記構成により口元を中心とした同一円周上に複数のマイクロホンを配置したため、音源である口元から各マイクロホンへの距離が等距離であるので、マイクロホンアレー方式の音声信号を強調する際に、遅延器を必要とせず簡単な構成で音声強調することができ、更に音声認識率を向上することができる。また、話者位置に限定されず音声認識率を保持することができる。また、ここでは特にマイクロホンの指向性について説明していないが、各マイクロホンの指向性を口元に向けて配置すると、更に音声認識する音声信号を強調することができ、認識率を向上することができる。 As described above, since a plurality of microphones are arranged on the same circumference centered on the mouth with the above configuration, the distance from the mouth, which is a sound source, to each microphone is equal, so that the microphone array audio signal is emphasized. In this case, voice enhancement can be performed with a simple configuration without the need for a delay device, and the voice recognition rate can be further improved. Further, the speech recognition rate can be maintained without being limited to the speaker position. Although the directivity of the microphone is not particularly described here, if the directivity of each microphone is arranged toward the mouth, the voice signal for voice recognition can be further enhanced, and the recognition rate can be improved. .

図１０は、本発明の実施例２に係る音声認識ヘッドセットの外観と示したものである。図１０（ａ）は音声認識ヘッドセットの正面図であり、図１０（ｂ）は図１０（ａ）の矢印方向から見た側面図である。上述した実施例１と同じ構成には同じ番号を付した。音声認識ヘッドセットは、支持フレーム１と、耳あて２、３と、スピーカ４、５と、アーム６と、ヘッドセットの装着者（ユーザ）の発する音声を検出して電気的な音声信号を生成するマイクロホン７Ａ、７Ｂと、この音声信号をデジタル変換を経て音声認識する信号処理モジュール１０を備える。図１０では簡単のために、2個のマイクロホンを使用したが、３個以上のマイクロホンを配置して実施することも可能である。 FIG. 10 shows the appearance of a voice recognition headset according to Embodiment 2 of the present invention. FIG. 10A is a front view of the voice recognition headset, and FIG. 10B is a side view seen from the direction of the arrow in FIG. The same number is attached | subjected to the same structure as Example 1 mentioned above. The voice recognition headset detects the sound produced by the support frame 1, the ear pads 2, 3, the speakers 4, 5, the arm 6, and the headset wearer (user) and generates an electrical sound signal. And a signal processing module 10 that recognizes the voice signal through digital conversion. In FIG. 10, for the sake of simplicity, two microphones are used, but it is also possible to carry out by arranging three or more microphones.

このヘッドセットは、左右の耳あて２、３を柔軟な支持フレーム１で接続した形状をしており、ユーザの頭部に装着して使用する。一方の耳あて２からはアーム６が伸びており、そのアーム６にマイクロホン７Ａ、７Ｂが配置されている。マイクロホン７Ａ、７Ｂは、ユーザがヘッドセットを装着したときに、ユーザのほぼ口元８を中心とした同一球面上９Ｂに位置し、各マイクロホンは口元からの距離が等しくなるようにアーム６に固定する。 This headset has a shape in which left and right ear pads 2, 3 are connected by a flexible support frame 1, and is used by being worn on the user's head. An arm 6 extends from one ear pad 2, and microphones 7 A and 7 B are disposed on the arm 6. When the user wears the headset, the microphones 7A and 7B are positioned on the same spherical surface 9B with the user's mouth 8 as the center, and each microphone is fixed to the arm 6 so that the distance from the mouth becomes equal. .

マイクロホン７Ａ、７Ｂで検出した音声信号から音声認識するまでの信号処理モジュール１０の構成は実施例１と同様なので、ここでは説明を省略する。
このように、口元を中心とした同一球面上に複数のマイクロホンを配置したので、実施例１と同様に口元から各マイクロホンまでの距離が等距離なので、マイクロホンアレー方式の音声信号を強調する際に、遅延器を必要とせず簡単な構成で音声強調することができ、更に音声認識率を向上することができる。また、話者位置に限定されず音声認識率を保持することができる。 Since the configuration of the signal processing module 10 from the audio signals detected by the microphones 7A and 7B to the audio recognition is the same as that of the first embodiment, the description thereof is omitted here.
As described above, since the plurality of microphones are arranged on the same spherical surface with the mouth as the center, the distance from the mouth to each microphone is the same as in the first embodiment. Further, it is possible to perform speech enhancement with a simple configuration without requiring a delay device, and to further improve the speech recognition rate. Further, the speech recognition rate can be maintained without being limited to the speaker position.

本発明の実施例１の音声認識ヘッドセット外観図。1 is an external view of a voice recognition headset according to a first embodiment of the present invention. 信号処理モジュールの構成を示すブロック図。The block diagram which shows the structure of a signal processing module. マイクロホンアレー処理部の内部構成の一例を示すブロック図。The block diagram which shows an example of the internal structure of a microphone array process part. マイクロホンアレー処理部の内部構成の他の一例を示すブロック図Block diagram showing another example of the internal configuration of the microphone array processing unit ビームフォーマ処理部の内部構成の一例を示すブロック図。The block diagram which shows an example of an internal structure of a beam former process part. 音声強調部の内部構成の一例を示すブロック図。The block diagram which shows an example of an internal structure of a speech emphasis part. 音声強調部の処理手順を示すフロー図。The flowchart which shows the process sequence of an audio | voice emphasis part. 音声認識部の内部構成の一例を示すブロック図。The block diagram which shows an example of an internal structure of a speech recognition part. 認識性能の評価結果を示す図。The figure which shows the evaluation result of recognition performance. 本発明の実施例２の音声認識ヘッドセット外観図。The external view of the voice recognition headset of Example 2 of the present invention.

Explanation of symbols

１・・・支持フレーム
２，３・・・耳あて
４，５・・・スピーカ
６・・・アーム
７Ａ，７Ｂ・・・マイクロホン
８・・・口元
１０・・・信号処理モジュール
１１Ａ，１１Ｂ・・・ＡＤ変換器
１２・・・マイクロホンアレー信号処理部
１３・・・音声認識部 DESCRIPTION OF SYMBOLS 1 ... Support frame 2, 3 ... Ear cover 4, 5 ... Speaker 6 ... Arm 7A, 7B ... Microphone 8 ... Mouth 10 ... Signal processing module 11A, 11B ... AD converter 12: Microphone array signal processing unit 13: Speech recognition unit

Claims

A plurality of microphones for detecting a sound and generating a sound signal; a microphone support section for arranging the plurality of microphones; and an enhanced sound signal generating means for synthesizing the sound signals of the plurality of microphones to generate an enhanced sound signal. And a speech recognition means for recognizing the emphasized speech signal.

A plurality of microphones that detect sound and generate a sound signal, a microphone support that places the plurality of microphones on the same circumference centered on the mouth, and a sound that is enhanced by synthesizing the sound signals of the plurality of microphones A speech recognition headset comprising enhanced speech signal generation means for generating a signal and speech recognition means for recognizing the enhanced speech signal.

A plurality of microphones that detect sound and generate a sound signal, a microphone support that places the plurality of microphones on the same spherical surface centered on the mouth, and a sound signal that is enhanced by synthesizing the sound signals of the plurality of microphones A speech recognition headset, comprising: an enhanced speech signal generating unit that generates a speech signal; and a speech recognition unit that recognizes the enhanced speech signal.

4. The voice recognition headset according to claim 1, wherein the plurality of microphones are supported so that the distance between them is kept constant.

5. The voice recognition headset according to claim 2, wherein the directivity of the microphone is directed toward the mouth.