JP2011081322A

JP2011081322A - Voice recognition system and voice recognition method

Info

Publication number: JP2011081322A
Application number: JP2009235565A
Authority: JP
Inventors: Masaomi Iida; 雅臣飯田
Original assignee: Murata Machinery Ltd
Current assignee: Murata Machinery Ltd
Priority date: 2009-10-09
Filing date: 2009-10-09
Publication date: 2011-04-21

Abstract

<P>PROBLEM TO BE SOLVED: To detect utterance by discriminating noise without presuming a sound source position. <P>SOLUTION: A voice recognition system includes a plurality of directive microphones and a voice recognition section for performing voice recognition on a signal from at least one microphone in the plurality of directive microphones. The indirective microphones and an utterance detection section for detecting an utterance section by the signal from the indirective microphone are provided, and voice recognition is performed on the signal in the utterance section by the voice recognition section. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

この発明は複数のマイクロホンを備えた音声認識に関し、特に発話の検出に関する。 The present invention relates to speech recognition including a plurality of microphones, and more particularly to speech detection.

特許文献１（特開2009-210956A）は、マイクロホンアレイを用いた音声認識装置を開示している。特許文献１では、個別のマイクロホンへの音の到達時間の差から音源の位置を推定し、これとは別に事前に発話者の位置を記憶しておく。そして発話者の位置を音源とする音響信号を音声認識の対象とする。しかしながらこの手法では音源位置の推定が絶えず必要で、信号処理が重くなる。 Patent Document 1 (Japanese Unexamined Patent Application Publication No. 2009-210956A) discloses a speech recognition apparatus using a microphone array. In Patent Document 1, the position of a sound source is estimated from the difference in the arrival time of sound to individual microphones, and separately from this, the position of a speaker is stored in advance. Then, an acoustic signal having the speaker's position as a sound source is set as a speech recognition target. However, this method requires constant estimation of the sound source position, and the signal processing becomes heavy.

特開2009-210956AJP2009-210956A

この発明の課題は、音源位置を推定せずに、雑音と区別して発話を検出することにある。 An object of the present invention is to detect an utterance by distinguishing it from noise without estimating a sound source position.

この発明は、複数の指向性のマイクロホンと、前記複数の指向性のマイクロホン中の少なくとも１個のマイクロホンからの信号に対して音声認識を行う音声認識部とを備えた音声認識システムであって、無指向性のマイクロホンと、該無指向性のマイクロホンからの信号により発話区間を検出する発話検出部とを備え、前記音声認識部は発話区間の信号に対して音声認識を行うように構成されていることを特徴とする。 The present invention is a speech recognition system comprising a plurality of directional microphones and a speech recognition unit that performs speech recognition on a signal from at least one microphone among the plurality of directional microphones, An omnidirectional microphone; and an utterance detection unit that detects an utterance period based on a signal from the omnidirectional microphone, wherein the voice recognition unit is configured to perform voice recognition on a signal in the utterance period. It is characterized by being.

またこの発明は、複数の指向性のマイクロホン中の少なくとも１個のマイクロホンからの信号に対して、音声認識装置より音声認識を行う方法であって、前記音声認識装置は、無指向性のマイクロホンからの信号により発話区間を検出し、発話区間の信号に対して音声認識を行うことを特徴とする。 Further, the present invention is a method for performing speech recognition by a speech recognition device on a signal from at least one microphone among a plurality of directional microphones, wherein the speech recognition device is a non-directional microphone. The speech section is detected from the signal of, and speech recognition is performed on the signal of the speech section.

この明細書において、音声認識装置に関する記載はそのまま音声認識方法にも当てはまり、逆に音声認識方法に関する記載はそのまま音声認識装置にも当てはまる。指向性のマイクロホンと無指向性のマイクロホンとして、例えば複数の無指向性のマイクロホンを設け、これらの組合せから指向性のマイクロホンを複数実現しても良い。あるいはまた、例えば１個の無指向性のマイクロホンと、複数個の指向性のマイクロホンとを別個に設けても良い。 In this specification, the description related to the speech recognition apparatus is also applied to the speech recognition method as it is, and the description related to the speech recognition method is applied to the speech recognition apparatus as it is. As the directional microphone and the omnidirectional microphone, for example, a plurality of omnidirectional microphones may be provided, and a plurality of directional microphones may be realized by combining these. Alternatively, for example, one omnidirectional microphone and a plurality of directional microphones may be provided separately.

指向性のマイクロホンは、その指向方向に雑音源が有ると、弱い雑音を大きな信号として捉え、雑音を発話と誤認しやすい。また雑音源が移動する場合、あるいは発話者が移動する場合などは、雑音源を避けるように指向性マイクロホンを配置することも困難である。これに対して無指向性のマイクロホンの信号で発話区間を検出すると、雑音の強弱の変動が小さくなり、雑音のレベルが一定に近づくので、雑音と音声とを識別しやすくなる。このため音源の位置を推定する、あるいは雑音源を避けるように指向性のマイクロホンを配置するなどの処理無しに、発話区間を検出できる。 When there is a noise source in the directivity direction of a directional microphone, weak noise is regarded as a large signal and it is easy to mistake the noise as an utterance. In addition, when the noise source moves or when the speaker moves, it is difficult to arrange the directional microphone so as to avoid the noise source. On the other hand, when an utterance period is detected with a non-directional microphone signal, fluctuations in noise intensity are reduced and the noise level approaches a constant level, making it easy to distinguish between noise and speech. Therefore, it is possible to detect an utterance section without processing such as estimating the position of a sound source or arranging a directional microphone so as to avoid a noise source.

好ましくは、前記複数の指向性のマイクロホンと前記無指向性のマイクロホンとを有するヘッドセットを備えている。ヘッドセットを装着する作業者の姿勢あるいは位置は固定されていないので、指向性のマイクロホンが雑音源を向くことが特に問題になる。このような場合でも、無指向性のマイクロホンを用いると、発話を雑音と区別して検出できる。 Preferably, a headset having the plurality of directional microphones and the omnidirectional microphone is provided. Since the posture or position of the worker wearing the headset is not fixed, it is particularly problematic that the directional microphone faces the noise source. Even in such a case, if an omnidirectional microphone is used, speech can be detected separately from noise.

また好ましくは、音声認識部は、前記複数の指向性のマイクロホンに対して、発話区間とそれ以外の区間での信号の強弱の程度、例えばＳ／Ｎ比、及び発話区間での信号の周波数帯、の少なくとも一方に基づいて、どの指向性のマイクロホンからの信号を音声認識するかを決定する選択部、を備えている。これらの信号は、個別の指向性マイクロホンが、音声をピックアップしているのか、雑音をピックアップしているのかを表している。そこで音声をピックアップしている可能性が高い、指向性のマイクロホンからの信号を音声認識する。また音声を認識している可能性が低い、指向性のマイクロホンからの信号は、音声認識する必要がない。 Preferably, the voice recognition unit is configured to determine the strength of the signal in the utterance section and other sections, for example, the S / N ratio, and the frequency band of the signal in the utterance section with respect to the plurality of directional microphones. And a selection unit that determines which directional microphone to recognize a signal based on at least one of. These signals represent whether individual directional microphones are picking up speech or noise. Therefore, it recognizes a signal from a directional microphone that is likely to pick up the voice. Further, there is no need to recognize a signal from a directional microphone that has a low possibility of recognizing the sound.

好ましくは、前記発話検出部は、無指向性のマイクロホンからの信号を少なくとも閾値と比較することにより発話を検出すると共に、発話区間以外での無指向性のマイクロホンからの信号の強弱に応じて前記閾値を学習し、かつ音声認識装置の位置に応じて閾値を変更するように構成されている。雑音のレベルに応じて閾値を学習することにより、より確実に発話を検出できる。音声認識装置の各マイクロホンが例えばヘッドセットに設けられている場合、ヘッドセットを装着した作業者等が、静かな環境と騒々しい環境との間で移動すると、雑音レベルの学習が追随できない。騒々しい環境に移動すると、雑音を発話と誤認し易く、静かな環境に移動すると、人は声を小さくする傾向があるため、発話を見逃しやすい。そこで位置に応じて閾値を強制的に変更することにより、学習が間に合わないことを補える。 Preferably, the utterance detection unit detects an utterance by comparing a signal from an omnidirectional microphone with at least a threshold value, and determines the utterance according to the strength of the signal from the omnidirectional microphone outside the utterance section. The threshold value is learned, and the threshold value is changed according to the position of the voice recognition device. By learning the threshold according to the noise level, it is possible to detect the utterance more reliably. When each microphone of the voice recognition device is provided in a headset, for example, if an operator wearing the headset moves between a quiet environment and a noisy environment, learning of the noise level cannot follow. Moving to a noisy environment tends to misidentify noise as an utterance, and moving to a quiet environment tends to overlook utterances because people tend to make their voices quieter. Therefore, by forcibly changing the threshold according to the position, it is possible to compensate for the fact that learning is not in time.

実施例の音声認識装置の外観を示す図The figure which shows the external appearance of the speech recognition apparatus of an Example. 実施例で指向性のマイクロホンの構成に用いたフィルタの回路図Circuit diagram of the filter used to configure the directional microphone in the example 実施例の音声認識装置のブロック図Block diagram of speech recognition apparatus of embodiment 変形例での指向性のマイクロホンと無指向性のマイクロホンの配置を示す平面図The top view which shows arrangement | positioning of the directional microphone in a modification, and an omnidirectional microphone 変形例の音声認識装置の要部ブロック図Block diagram of principal parts of a modified speech recognition apparatus 実施例での発話区間の検出モデルを示す図The figure which shows the detection model of the utterance area in an Example 実施例でのセレクタのブロック図Block diagram of the selector in the embodiment 実施例での発話区間の検出を示すフローチャートFlowchart showing detection of utterance interval in the embodiment 実施例での指向性マイクロホンの選択を示すフローチャートFlowchart showing selection of directional microphone in embodiment

以下に本発明を実施するための最適実施例を示す。この発明の範囲は、特許請求の範囲の記載に基づき、明細書とこの分野の周知技術を参酌し、当業者の理解に従って定められるべきである。 In the following, an optimum embodiment for carrying out the present invention will be shown. The scope of the present invention should be determined in accordance with the understanding of those skilled in the art based on the description of the scope of claims, taking into consideration the specification and well-known techniques in this field.

図１〜図９に、実施例の音声認識装置２と実施例の音声認識方法とを示す。図１において、４はスピーカで、６はマイクロホンアレイであり、複数個の無指向性のマイクロホン７〜１０を備えている。１２はアーム、１３はコードで、１４は音声認識装置本体であり、電源１５と信号処理部１６並びに通信部１８を備えている。実施例では、音声認識装置２が単独で音声認識を行い、通信部１８を介して図示しないサーバと通信し、例えばサーバからのピッキングの指令をスピーカ４を介して作業者に伝え、作業者の音声をマイクロホン７〜１０及び信号処理部１６で音声認識し、通信部１８からサーバへ通知する。またサーバはカメラ、ＧＰＳ等により作業者の位置を監視し、発話検出の閾値を変更する。なお音声認識をサーバで行い、音声認識装置２側から信号処理部１６を除いても良い。 1 to 9 show a speech recognition apparatus 2 according to the embodiment and a speech recognition method according to the embodiment. In FIG. 1, 4 is a speaker, 6 is a microphone array, and includes a plurality of omnidirectional microphones 7-10. Reference numeral 12 denotes an arm, 13 denotes a cord, and 14 denotes a voice recognition apparatus body, which includes a power supply 15, a signal processing unit 16, and a communication unit 18. In the embodiment, the voice recognition device 2 performs voice recognition alone, communicates with a server (not shown) via the communication unit 18, and transmits a picking command from the server to the worker via the speaker 4, for example. The voice is recognized by the microphones 7 to 10 and the signal processing unit 16 and notified from the communication unit 18 to the server. The server also monitors the position of the worker using a camera, GPS, etc., and changes the threshold for speech detection. Note that voice recognition may be performed by a server, and the signal processing unit 16 may be removed from the voice recognition device 2 side.

実施例の音声認識装置２は、ピッキングの作業者に限らず、航空機などのパイロット、自動車の運転手、手術中の医師や歯科医、工場の作業者、コールセンターのオペレータなどの音声を認識するのに適している。実施例の音声認識装置２はヘッドセットから成り、作業者が首を振ると、マイクロホンアレイ６の向きが変化するので、雑音に対する向きが絶えず変化する。また作業者が移動すると、周囲の雑音源に対する位置が変化する。 The voice recognition apparatus 2 of the embodiment recognizes voices of not only picking workers but also pilots such as airplanes, automobile drivers, doctors and dentists during surgery, factory workers, and call center operators. Suitable for The voice recognition device 2 of the embodiment is composed of a headset, and when the operator shakes his / her head, the direction of the microphone array 6 changes, so the direction to noise constantly changes. Further, when the worker moves, the position with respect to the surrounding noise source changes.

図１の左側に、マイクロホンアレイ６を拡大して示す。マイクロホンアレイ６は指向性の無い例えば４個のマイクロホン７〜１０を備え、マイクロホン７〜１０は例えば正四面体の頂点に配置され、マイクロホン７がマイクロホン８〜１０に対して上側に飛び出しているものとする。そして４個のマイクロホン７〜１０のうち、例えば１個を無指向性のマイクロホンとしてそのまま使用する。４個のマイクロホン７〜１０からマイクロホンを２個ずつ組み合わせると、_４Ｃ_２の６通りの組み合わせが生じる。 The microphone array 6 is shown enlarged on the left side of FIG. The microphone array 6 includes, for example, four microphones 7 to 10 having no directivity. The microphones 7 to 10 are arranged at the apexes of a regular tetrahedron, for example, and the microphone 7 protrudes upward from the microphones 8 to 10. And For example, one of the four microphones 7 to 10 is used as it is as an omnidirectional microphone. When two microphones are combined from the four microphones 7 to 10, six combinations of ₄ C ₂ are generated.

６通りの組み合わせにより、６個の仮想的な指向性マイクロホンを実現する。例えばマイクロホン７と、３個のマイクロホン８〜１０との組み合わせで、３個の指向性のマイクロホンが得られる。マイクロホン８とマイクロホン９との組み合わせで、右向きに指向したマイクロホンと左向きに指向したマイクロホンとが得られ、同様にマイクロホン８，１０の組み合わせ、及びマイクロホン９，１０の組み合わせで、合計例えば９個（６個＋３個）の指向性のマイクロホンが得られる。 Six virtual directional microphones are realized by six combinations. For example, a combination of the microphone 7 and the three microphones 8 to 10 provides three directional microphones. A combination of the microphone 8 and the microphone 9 provides a microphone directed rightward and a microphone directed leftward. Similarly, a total of, for example, nine (6 +3) directional microphones are obtained.

図２に、マイクロホン７，８を組み合わせた指向性のマイクロホンを示す。ここではマイクロホン７，８を例とするが、マイクロホンの他の組合せでも同様である。２０，２１はマイクロホン７，８の感度分布（指向性）を模式的に示し、相手側のマイクロホンの影となる向きで感度が低下する。２２は増幅器で、マイクロホン７，８の音声信号を増幅する。２５は遅延部で、信号を増幅すると共に、マイクロホン７，８間の距離を音波が移動する時間分、信号を遅延させる。なおマイクロホン７〜１０は正四面体の頂点に配置されているので、各マイクロホンの間隔は一定であり、遅延部２５で遅延させる時間は、マイクロホンの組み合わせによらず一定である。このためマイクロホン７〜１０毎に１個の遅延部２５を設ける。 FIG. 2 shows a directional microphone in which microphones 7 and 8 are combined. Here, the microphones 7 and 8 are taken as an example, but the same applies to other combinations of microphones. Reference numerals 20 and 21 schematically show sensitivity distributions (directivity) of the microphones 7 and 8, and the sensitivity decreases in a direction that is a shadow of the counterpart microphone. An amplifier 22 amplifies the audio signal from the microphones 7 and 8. Reference numeral 25 denotes a delay unit that amplifies the signal and delays the signal by the time required for the sound wave to travel the distance between the microphones 7 and 8. Since the microphones 7 to 10 are arranged at the apexes of the regular tetrahedron, the interval between the microphones is constant, and the time delayed by the delay unit 25 is constant regardless of the combination of the microphones. Therefore, one delay unit 25 is provided for each microphone 7-10.

２６は差分器で、遅延部２５からの信号と、組合せの相手方のマイクロホンからの遅延していない信号との差を求める。例えば差分器２６ａでは、マイクロホン７の信号からマイクロホン８の信号を引くことにより、図２の下側に指向性のある信号を得る。図２のマイクロホン７に上側から音波が到着すると、遅延部２５を介して差分器２６ａに入力されて、マイクロホン８からの信号で打ち消されるので、差分器２６ａからの出力は小さくなる。マイクロホン８に下側から音波が到着すると、直ちに差分器２６ａに入力され、マイクロホン７からの信号は遅延部２５で遅れるので、信号は相殺されない。なおこの時、１回の音響信号が符号を反転し僅かな時間差で２回差分器２６ａから出力され、一種の繰り返し信号となるが、図３の音響モデル３６は周波数スペクトル等に変換して信号を処理するので、影響は小さい。必要で有れば、遅延部２５と差分器２６ａ，ｂ等の間に、繰り返し信号を除去するフィルタを設けると良い。同様に図２の差分器２６ｂの場合、マイクロホン８の信号からマイクロホン７の信号を引いて、図２の上側に指向したマイクロホンを得る。 A difference unit 26 obtains a difference between the signal from the delay unit 25 and the undelayed signal from the microphone of the other party of the combination. For example, the subtractor 26a subtracts the signal of the microphone 8 from the signal of the microphone 7 to obtain a directional signal on the lower side of FIG. When a sound wave arrives at the microphone 7 in FIG. 2 from above, it is input to the difference unit 26a via the delay unit 25 and canceled by the signal from the microphone 8, so the output from the difference unit 26a becomes small. When a sound wave arrives at the microphone 8 from the lower side, it is immediately input to the differentiator 26a. Since the signal from the microphone 7 is delayed by the delay unit 25, the signal is not canceled out. At this time, the sound signal of one time is inverted in sign and output from the differentiator 26a twice with a slight time difference to become a kind of repetitive signal. However, the sound model 36 in FIG. The effect is small. If necessary, a filter for removing the repetitive signal may be provided between the delay unit 25 and the differentiators 26a and 26b. Similarly, in the case of the differentiator 26b in FIG. 2, the microphone 7 signal is subtracted from the microphone 8 signal to obtain a microphone directed upward in FIG.

図３に音声認識装置２の全体構成を示すと、各マイクロホン７〜１０に増幅器２２が接続され、指向性のないマイクロホン７〜１０のうち、任意の１個、例えばマイクロホン７からの信号を発話検出部３０で処理し、発話区間を検出する。学習部３２は発話検出区間以外でのマイクロホン７に接続した増幅器２２からの信号を基に、発話検出の閾値を変更する。閾値はこれ以外に音声認識装置２の位置により変更され、位置信号は例えば音声認識装置２にGPSを設けることにより発生させる。あるいは図示しないサーバで音声認識装置２の位置を認識して、位置信号をサーバから入力しても良い。 FIG. 3 shows the overall configuration of the speech recognition apparatus 2. An amplifier 22 is connected to each of the microphones 7 to 10, and an arbitrary one of the microphones 7 to 10 having no directivity, for example, a signal from the microphone 7 is uttered. Processing is performed by the detection unit 30 to detect an utterance section. The learning unit 32 changes the threshold for speech detection based on the signal from the amplifier 22 connected to the microphone 7 outside the speech detection section. In addition to this, the threshold value is changed depending on the position of the voice recognition device 2, and the position signal is generated by providing the voice recognition device 2 with GPS, for example. Alternatively, the position signal may be input from the server by recognizing the position of the voice recognition device 2 with a server (not shown).

発話検出の閾値は例えば秒〜時間のオーダーで変化し、例えば発話区間以外の区間に対し、１秒〜１分などの所定時間毎に、現在の閾値と過去１秒〜１分程度の時間内での雑音レベルとの重み付き平均を、新たな閾値とする。重みは例えば現在の閾値を９９％〜８０％程度とし、周囲の雑音レベルを１％〜２０％程度とする。このようにして、閾値を変更する時間間隔と、周囲の雑音レベルに対する重みとにより、所定の速度で閾値を学習（変更）する。これに対して位置信号により閾値を変更するのは、例えばピッキングを行う作業者が、冷蔵室や冷凍室などのほぼ無音の区間から扉を通過し、無人搬送車などが走行する雑音の大きい区間に移動した際などである。そして扉を通過したことにより、閾値を学習に優先して変更する。 The threshold for speech detection changes, for example, in the order of seconds to hours. For example, for a section other than the speech section, for example, every predetermined time such as 1 second to 1 minute, the current threshold and the past 1 second to about 1 minute. The weighted average with the noise level at is used as a new threshold. For the weight, for example, the current threshold is set to about 99% to 80%, and the ambient noise level is set to about 1% to 20%. In this way, the threshold value is learned (changed) at a predetermined speed based on the time interval for changing the threshold value and the weight for the surrounding noise level. On the other hand, the threshold value is changed by the position signal because, for example, the picking operator passes through the door from a substantially silent section such as a refrigerator compartment or a freezer compartment, and the section where the automatic guided vehicle travels is noisy. Such as when moving to. The threshold value is changed with priority over learning by passing through the door.

フィルタ２４は、図２のようにして、９個の指向性のマイクロホンの信号を作り出し、セレクタ３４により２個の指向性のマイクロホンの信号を選び出し、選択した２個のマイクロホンの信号を加算器３５で加算し、音響モデル３６で音素に変換する。そして音素を言語モデル３８で言語に変換し、音声認識を完了する。セレクタ３４，音響モデル３６，言語モデル３８が音声認識部を構成する。 As shown in FIG. 2, the filter 24 generates nine directional microphone signals, the selector 34 selects the two directional microphone signals, and the selected two microphone signals are added to the adder 35. Are added and converted to phonemes by the acoustic model 36. Then, the phoneme is converted into a language by the language model 38, and the speech recognition is completed. The selector 34, the acoustic model 36, and the language model 38 constitute a speech recognition unit.

実施例ではヘッドセットを例にしたが、図４のように固定のテーブル４０に指向性のマイクロホン４１と無指向性のマイクロホン４２とを配置し、無指向性のマイクロホン４２からの信号により発話区間を検出しても良い。例えばテーブル４０の周囲を雑音源４４が通過したとする。指向性のマイクロホン４１の信号で発話区間を検出すると、雑音源４４の通過を特定の指向性のマイクロホン４１が検出し、発話区間と誤認しやすい。これに対して無指向性のマイクロホン４２では、雑音源４４の位置が変化しても、雑音レベルの変化は小さく、また特定の雑音源４４の影響を受けにくいので、発話区間をより正確に検出できる。 In the embodiment, a headset is used as an example. However, as shown in FIG. 4, a directional microphone 41 and an omnidirectional microphone 42 are arranged on a fixed table 40, and an utterance period is determined by a signal from the omnidirectional microphone 42. May be detected. For example, it is assumed that the noise source 44 passes around the table 40. When an utterance section is detected from the signal of the directional microphone 41, the passage of the noise source 44 is detected by the specific directional microphone 41 and is easily misidentified as the utterance section. On the other hand, in the omnidirectional microphone 42, even if the position of the noise source 44 changes, the change in the noise level is small and it is difficult to be influenced by the specific noise source 44. it can.

図５はセレクタの変形例を示し、フィルタ２４で発生させた９個の仮想的なマイクロホンからの音声信号を音響モデル３６でそれぞれ音素に変換する。音響モデル３６は音素への変換時に尤度を発生し、尤度が例えば上位２個の信号をセレクタ５０で選出して、上位２個の音素信号を言語モデル３８へ入力しても良い。しかしこのようにすると、言語モデル３８で処理しない信号も音響モデル３６で処理するため、処理が重くなる。 FIG. 5 shows a modification of the selector, in which sound signals from nine virtual microphones generated by the filter 24 are converted into phonemes by the acoustic model 36, respectively. The acoustic model 36 may generate likelihood at the time of conversion to a phoneme. For example, the top two signals having the highest likelihood may be selected by the selector 50, and the top two phoneme signals may be input to the language model 38. However, in this case, since the signal not processed by the language model 38 is also processed by the acoustic model 36, the processing becomes heavy.

図５〜図９に、実施例の動作を示す。例えばピッキングを行う作業者が音声認識装置２をヘッドセットとして装着し、通信部１８により図示しないサーバと通信しながらピッキングを行う。そしてサーバは例えば図示しないカメラ，ＧＰＳなどにより、作業者の位置を認識しているものとする。サーバは作業者にスピーカ４から作業を指示し、作業者からサーバへの入力は音声認識装置２で行い、ヘッドセットの向きが変わると、雑音源に対する向きが変化し、作業者の移動に伴って雑音の程度が変化する。 5 to 9 show the operation of the embodiment. For example, an operator who performs picking wears the voice recognition device 2 as a headset, and performs picking while communicating with a server (not shown) through the communication unit 18. It is assumed that the server recognizes the position of the worker using, for example, a camera or GPS (not shown). The server instructs the worker to work from the speaker 4, and input from the worker to the server is performed by the voice recognition device 2. When the orientation of the headset is changed, the orientation with respect to the noise source is changed. The degree of noise changes.

ここで４個の無指向性のマイクロホン７〜１０を組み合わせて、仮想的に９個の指向性のマイクロホンを構成し、また４個のマイクロホン７〜１０の任意の１個を発話の検出に用いる。発話の検出では、図６のように、信号の０レベルの例えば両側に閾値を定め、信号が閾値を越え後に０レベルを通過した回数などから、発話の有無を検出する。これは信号の強弱と周波数とを評価していることに想到する。 Here, four omnidirectional microphones 7 to 10 are combined to virtually constitute nine directional microphones, and any one of the four microphones 7 to 10 is used for speech detection. . In the detection of utterances, as shown in FIG. 6, thresholds are set on, for example, both sides of the 0 level of the signal, and the presence or absence of utterances is detected from the number of times the signal has passed the 0 level after exceeding the threshold. This leads to the evaluation of signal strength and frequency.

作業者の位置などにより雑音レベルが変化するので、発話を検出していない区間でのマイクロホンからの信号により、閾値を学習により変更する。また扉などを通過し別の部屋に入ると、雑音レベルが変化する。位置の変化による雑音レベルの変化に対して、学習では一般に追随しないので、サーバからの位置情報などにより、閾値を強制的に変更する。 Since the noise level changes depending on the position of the worker and the like, the threshold value is changed by learning based on a signal from the microphone in a section where speech is not detected. The noise level changes when you enter another room through the door. Since learning generally does not follow a change in noise level due to a change in position, the threshold is forcibly changed based on position information from the server.

発話を検出するすると、例えば９個の指向性のマイクロホンのうち、尤度が上位２個のマイクロホンの信号を用いて、音声認識を行う。このための構成を図７に示し、７６は仮想的な指向性のマイクロホンの内の１個であり、発話区間以外での信号レベルの平均値を平均化部７１で記憶し、除算器７０でマイクロホン７６からの信号と平均値との比をＳ/Ｎ比として求める。また発話区間での０−クロッシングの回数などをカウンタ７２で求める。さらにバンドパスフィルタその他の簡易な周波数変換部により、音声信号に対応する周波数帯での信号の強弱を求め、好ましくは複数の周波数での信号の強弱を求めることにより、音声信号に対応した周波数スペクトルか否かを求める。 When an utterance is detected, speech recognition is performed using signals from the two microphones with the highest likelihood among, for example, nine directional microphones. A configuration for this is shown in FIG. 7, and 76 is one of the virtual directional microphones. The average value of the signal level outside the speech section is stored in the averaging unit 71, and the divider 70 is used. The ratio between the signal from the microphone 76 and the average value is obtained as the S / N ratio. Further, the counter 72 obtains the number of 0-crossings in the utterance section. Furthermore, the frequency spectrum corresponding to the audio signal is obtained by obtaining the strength of the signal in the frequency band corresponding to the audio signal, preferably obtaining the strength of the signal at a plurality of frequencies by a band pass filter or other simple frequency conversion unit. Ask whether or not.

Ｓ/Ｎ比は雑音に対して音声を認識している程度を表し、０−クロッシングの回数は検出している信号の平均的な周波数を表し、周波数変換部７３の信号も同様に入力信号のスペクトル形状を表す。これらの信号を評価部７４で評価することにより、音声を認識している可能性の大小を尤度として出力する。なおカウンタ７２の信号と周波数変換部７３の信号はいずれも信号の周波数に関するものなので、これらのいずれか一方のみを用いてもよい。そして尤度が例えば上位２個の指向性のマイクロホンからの信号を音響モデル３６で処理することにより、正確に音声認識ができる。尤度を上位２個とする代わりに、尤度が最上位の信号のみ、あるいは尤度が上位３位までの信号のみを用いてもよい。 The S / N ratio represents the degree to which speech is recognized against noise, the number of 0-crossings represents the average frequency of the detected signal, and the signal of the frequency converter 73 is also the input signal. Represents the spectral shape. By evaluating these signals by the evaluation unit 74, the likelihood of recognizing speech is output as likelihood. Since both the signal of the counter 72 and the signal of the frequency conversion unit 73 relate to the frequency of the signal, only one of them may be used. Then, by processing the signals from the microphones with the two most likely directivities with the acoustic model 36, speech recognition can be performed accurately. Instead of setting the top two likelihoods, only the signal with the highest likelihood or only the signal with the top three likelihoods may be used.

実施例の処理を図８，図９に示し、その内容は上記のもので、図８の発話区間の検出では、ステップ１，２で、無指向性マイクの信号が閾値の外側から０クロッシングした回数から発話区間か非発話区間かを識別する。非発話区間では、ステップ３，４のように、学習により発話検出の閾値を変更し、扉を通過し別の部屋に入ると、閾値を位置情報に応じて修正する（ステップ５）。 The processing of the embodiment is shown in FIGS. 8 and 9, and the contents thereof are as described above. In the detection of the speech section in FIG. 8, in steps 1 and 2, the signal of the omnidirectional microphone is zero-crossed from the outside of the threshold value. From the number of times, it is identified whether it is a speech segment or a non-speech segment. In the non-speech section, as in steps 3 and 4, the threshold for speech detection is changed by learning, and after passing through the door and entering another room, the threshold is corrected according to the position information (step 5).

音声認識は発話区間に対して行い、ステップ１０、ステップ１１で各指向性マイクロホンからの信号のＳ／Ｎ比、０クロッシングの回数、周波数の分布等により、尤度が上位２個の指向性マイクロホンを選択し、ステップ１２で音声認識する。 Speech recognition is performed on the speech section, and in steps 10 and 11, the directional microphones with the highest likelihood are determined by the S / N ratio of the signal from each directional microphone, the number of zero crossings, the frequency distribution, and the like. And the voice is recognized in step 12.

実施例では音源の位置の推定無しで、雑音と音声とを区別して発話検出ができる。また音源位置の推定無しで、いずれの指向性のマイクロホンの信号を音声認識するかを決定できる。さらにヘッドセットに音声認識装置２を設けた場合、雑音源に対するマイクロホンの向きが変化すること、作業者の移動により雑音源に対する相対位置が変化することなどの影響を受けずに、発話区間を検出できる。学習により発話検出の閾値を変更することにより、各時点での雑音の強弱に応じた閾値を設定でき、位置情報により閾値を変更することにより、新しい環境に移動した際に、学習を待たずに閾値を変更できる。 In the embodiment, it is possible to detect an utterance by distinguishing between noise and voice without estimating the position of the sound source. In addition, it is possible to determine which directional microphone signal is recognized as a voice without estimating the sound source position. Furthermore, when the voice recognition device 2 is provided in the headset, the speech section is detected without being affected by the change of the microphone direction with respect to the noise source or the change of the relative position with respect to the noise source due to the movement of the operator. it can. By changing the threshold for utterance detection by learning, it is possible to set a threshold according to the strength of noise at each time point, and by changing the threshold by position information, without waiting for learning when moving to a new environment The threshold can be changed.

２音声認識装置
４スピーカ
６マイクロホンアレイ
７〜１０マイクロホン
１２アーム
１３コード
１４音声認識装置本体
１５電源
１６信号処理部
１８通信部
２０，２１感度分布
２２増幅器
２４フィルタ
２５遅延部
２６差分器
３０発話検出部
３２学習部
３４セレクタ
３５加算器
３６音響モデル
３８言語モデル
４０テーブル
４１指向性のマイクロホン
４２無指向性のマイクロホン
４４雑音源
５０セレクタ
７０除算器
７１平均化部
７２カウンタ
７３周波数変換部
７４評価部
７６指向性のマイクロホン 2 Speech recognition device 4 Speaker 6 Microphone array 7 to 10 Microphone 12 Arm 13 Code 14 Speech recognition device body 15 Power supply 16 Signal processing unit 18 Communication unit 20, 21 Sensitivity distribution 22 Amplifier 24 Filter 25 Delay unit 26 Differentiator 30 Speech detection unit 32 learning unit 34 selector 35 adder 36 acoustic model 38 language model 40 table 41 directional microphone 42 omnidirectional microphone 44 noise source 50 selector 70 divider 71 averaging unit 72 counter 73 frequency conversion unit 74 evaluation unit 76 directivity Sex microphone

Claims

A system comprising a plurality of directional microphones and a voice recognition unit that performs voice recognition on a signal from at least one of the plurality of directional microphones,
An omnidirectional microphone; and an utterance detection unit that detects an utterance period based on a signal from the omnidirectional microphone, wherein the voice recognition unit is configured to perform voice recognition on a signal in the utterance period. A voice recognition system characterized by

The speech recognition system according to claim 1, comprising a headset including the plurality of directional microphones and the omnidirectional microphone.

The voice recognition unit may determine which directivity to the plurality of directional microphones based on at least one of a signal strength level in an utterance section and other sections and a signal frequency band in the utterance section. The speech recognition system according to claim 1, further comprising: a selection unit that determines whether to recognize a signal from a sexual microphone.

The utterance detection unit detects an utterance by comparing at least a signal from an omnidirectional microphone with a threshold, and learns the threshold according to the strength of the signal from the omnidirectional microphone outside the utterance section. The voice recognition system according to any one of claims 1 to 3, wherein the threshold value is changed in accordance with the position of the voice recognition device.

A method of performing speech recognition from a speech recognition device on a signal from at least one microphone among a plurality of directional microphones,
A speech recognition method, wherein the speech recognition device detects an utterance section from a signal from an omnidirectional microphone and performs speech recognition on a signal in the utterance section.