JP2011118124A

JP2011118124A - Speech recognition system and recognition method

Info

Publication number: JP2011118124A
Application number: JP2009274853A
Authority: JP
Inventors: Masaomi Iida; 雅臣飯田
Original assignee: Murata Machinery Ltd
Current assignee: Murata Machinery Ltd
Priority date: 2009-12-02
Filing date: 2009-12-02
Publication date: 2011-06-16

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition system and a recognition method which is robust to noise, even when a speaker moves. <P>SOLUTION: A noise spectrum by position is stored, an acoustic model including the noise by position is stored, and a position is discriminated. A spectrum of an acoustic signal from a microphone in a non-speech section is compared with the noise spectrum by position, and when difference of a threshold or more is detected, noise of at least a part of a section exceeding the threshold is removed. Then, speech recognition is performed by the acoustic model by position. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

この発明は音声認識に関し、特に雑音への頑健性を向上させることに関する。 The present invention relates to speech recognition, and more particularly to improving noise robustness.

特許文献１：特開2005-208483は、コールセンターなどでの音声認識について、話者の属する組織に応じて、言語モデルを変更することを開示している。この手法では、小さな語彙の言語モデルを複数用意し、話者に応じて変更することにより、誤認識を少なくする。また特許文献２：特許4003566は、騒音抑制のために、低周波側では非線形スペクトルサブトラクションを、高周波側ではウィナーフィルタまたはカルマンフィルタを作用させることを開示している。 Patent Document 1: Japanese Patent Laid-Open No. 2005-208483 discloses changing a language model according to an organization to which a speaker belongs for speech recognition at a call center or the like. In this technique, a plurality of small vocabulary language models are prepared and changed according to the speaker, thereby reducing misrecognition. Patent Document 2: Japanese Patent No. 4003566 discloses that a nonlinear spectral subtraction is applied on the low frequency side and a Wiener filter or a Kalman filter is applied on the high frequency side in order to suppress noise.

特許文献１での言語モデルの切替を、雑音除去パラメータの切替に応用する場合、ある程度の時間をかけて雑音パラメータを学習する。しかし話者の移動速度が速い場合は、雑音パラメータの学習が追随できない。 When the switching of the language model in Patent Document 1 is applied to the switching of the noise removal parameter, the noise parameter is learned over a certain period of time. However, when the moving speed of the speaker is fast, learning of noise parameters cannot follow.

特開2005-208483JP2005-208483 特許4003566Patent 4003566

この発明の課題は、話者が移動しても雑音に対して頑健な、音声認識システムと認識方法を提供することにある。
またこの発明での追加の課題は、ピッキング、ガイダンス、ナビゲーションなどでの音声認識で、言語モデルを適切に切り替えることにある。 An object of the present invention is to provide a speech recognition system and a recognition method that are robust against noise even when a speaker moves.
Another object of the present invention is to switch the language model appropriately in speech recognition for picking, guidance, navigation, and the like.

この発明の音声認識装置は、マイクロホンからの音響信号中の雑音を、雑音除去部で処理して、音響モデルを記憶した音声認識部により音声認識するシステムであって、
位置毎の雑音スペクトルを記憶する雑音スペクトル記憶部と、
位置毎でかつ当該位置での雑音を含む音響モデルを記憶する音響モデル記憶部と、
位置の判別手段と、
無発話区間でのマイクロホンからの音響信号のスペクトルと、判別手段で判別した位置に対して雑音スペクトル記憶部に記憶されている雑音スペクトルとを比較する雑音比較部とを設けて、
雑音比較部で閾値以上の差異を検出した際に、発話区間でのマイクロホンからの音響信号に対し、前記閾値を越える部分の少なくとも一部を、雑音除去部で雑音除去し、
判別手段で判別した位置に対する音響モデルにより、前記音声認識部により音声認識を行うことを特徴とする。 The speech recognition apparatus according to the present invention is a system that recognizes speech by a speech recognition unit that processes noise in an acoustic signal from a microphone by a noise removal unit and stores an acoustic model,
A noise spectrum storage unit for storing a noise spectrum for each position;
An acoustic model storage unit that stores an acoustic model for each position and including noise at the position;
Position discrimination means;
A noise comparison unit that compares the spectrum of the acoustic signal from the microphone in the non-speech interval with the noise spectrum stored in the noise spectrum storage unit for the position determined by the determination unit,
When the noise comparison unit detects a difference greater than or equal to the threshold value, at least a part of the portion exceeding the threshold value is removed by the noise removal unit with respect to the acoustic signal from the microphone in the utterance interval,
Voice recognition is performed by the voice recognition unit using an acoustic model for the position determined by the determination means.

この発明の音声認識方法は、マイクロホンからの音響信号中の雑音を、雑音除去部で処理して、音響モデルを記憶した音声認識部により音声認識する方法であって、
記憶手段で位置毎の雑音スペクトルを記憶し、
記憶手段で位置毎でかつ当該位置での雑音を含む音響モデルを記憶し、
位置の判別手段で位置を判別し、
比較手段で、無発話区間でのマイクロホンからの音響信号のスペクトルと、判別手段で判別した位置に対して雑音スペクトル記憶部に記憶されている雑音スペクトルとを比較し、閾値以上の差異を検出した際に、発話区間でのマイクロホンからの音響信号に対し、前記閾値を越える部分の少なくとも一部を、雑音除去部で雑音除去し、
判別手段で判別した位置に対する音響モデルにより、前記音声認識部により音声認識を行うことを特徴とする。 The speech recognition method of the present invention is a method of recognizing speech in a speech recognition unit that stores noise in an acoustic signal by processing noise in a sound signal from a microphone and storing an acoustic model,
The noise spectrum for each position is stored in the storage means,
Storing an acoustic model including noise at each position in the storage means;
The position is determined by the position determination means,
The comparison means compares the spectrum of the acoustic signal from the microphone in the non-speech section with the noise spectrum stored in the noise spectrum storage unit for the position determined by the determination means, and detects a difference greater than the threshold. At the time, for the acoustic signal from the microphone in the utterance section, at least a part of the portion exceeding the threshold value is denoised by the noise removing unit,
Voice recognition is performed by the voice recognition unit using an acoustic model for the position determined by the determination means.

この発明では、以下の効果が得られる。
(1) 雑音スペクトルの比較を通じ、あるいはGPS等の位置センサにより、話者の位置を判別する。そして位置毎で雑音を含む音響モデルを用いる。この結果、記憶済みの雑音スペクトルからの変化が小さい場合、雑音の影響を極く小さくできる。また同じ話者でも、周囲の環境により、音声が変化する。このことも場所毎の音響モデルに織り込み済みである。
(2) 話者の移動に対応して雑音を学習するのでなく、話者の位置を特定して、その位置に応じた音響モデルを用いる。このため話者の移動に速やかに追随できる。
(3) 場所毎の雑音スペクトルと現在の雑音スペクトルとが閾値以上異なる場合、雑音除去を行う。雑音除去は閾値を越える小さな部分、あるいはその一部に対して行えばよいので、音声信号を歪ませることが少ない。
この明細書で音声認識システムに関する記載は音声認識方法にもそのまま当てはまり逆に音声認識方法に関する記載は音声認識システムにもそのまま当てはまる。 In the present invention, the following effects can be obtained.
(1) The speaker's position is determined through comparison of noise spectra or by a position sensor such as GPS. An acoustic model including noise is used for each position. As a result, when the change from the stored noise spectrum is small, the influence of noise can be made extremely small. Even with the same speaker, the voice changes depending on the surrounding environment. This is already incorporated into the acoustic model for each location.
(2) Instead of learning noise in response to the movement of the speaker, the speaker's position is specified and an acoustic model corresponding to the position is used. Therefore, it is possible to quickly follow the movement of the speaker.
(3) If the noise spectrum at each location differs from the current noise spectrum by more than a threshold, noise removal is performed. Since the noise removal may be performed on a small part exceeding the threshold or a part thereof, the audio signal is hardly distorted.
In this specification, the description related to the speech recognition system is also applied to the speech recognition method, and the description related to the speech recognition method is also applied to the speech recognition system.

好ましくは、音声認識システムはサーバと複数の移動端末とからなり、かつ前記マイクロホンと判別手段とを端末に、雑音スペクトル記憶部と音響モデル記憶部と雑音比較部とをサーバに設け、雑音除去部と音声認識部とを、サーバもしくは端末のいずれかに設ける。 Preferably, the speech recognition system includes a server and a plurality of mobile terminals, and the microphone and the determination unit are provided in the terminal, a noise spectrum storage unit, an acoustic model storage unit, and a noise comparison unit are provided in the server, and a noise removal unit And a voice recognition unit are provided in either the server or the terminal.

また好ましくは、前記判別手段はＧＰＳ等の位置センサである。位置センサで話者の位置を判別すると、話者の移動に音響モデルの切替が遅れることがない。なお雑音スペクトルと実際の雑音との比較からも話者の位置を判別できるが、雑音の取得と比較とのため、位置の判別が話者の移動に遅れることがある。位置センサは音響モデルの切替等のために位置を判別できれば良く、高精度に位置を検出する必要はない。 Preferably, the determination means is a position sensor such as a GPS. When the position of the speaker is determined by the position sensor, the switching of the acoustic model is not delayed by the movement of the speaker. Note that the position of the speaker can also be determined by comparing the noise spectrum with the actual noise, but the determination of the position may be delayed due to the movement of the speaker due to the acquisition and comparison of noise. The position sensor only needs to be able to determine the position for switching the acoustic model, and does not need to detect the position with high accuracy.

好ましくは、音声認識用の辞書からなる言語モデルを複数記憶し、判別手段で判別した位置に応じて言語モデルを切り替える。話者の位置により言語モデルを切り替えることにより、より正確な音声認識ができる。
Preferably, a plurality of language models composed of a dictionary for speech recognition are stored, and the language model is switched according to the position determined by the determining means. By switching the language model according to the position of the speaker, more accurate speech recognition can be performed.

実施例の音声認識システムのブロック図Block diagram of voice recognition system of embodiment 実施例での音響モデルと言語モデルの切替アルゴリズムを示す図The figure which shows the switching algorithm of the acoustic model and language model in an Example 雑音スペクトルの記憶値と実際とがほぼ一致する例を示す図The figure which shows the example where the stored value of the noise spectrum and the actual value almost coincide 雑音スペクトルの記憶値と実際との差異が閾値以上で、雑音除去を行う例を示す図Figure showing an example of noise removal when the difference between the stored value of the noise spectrum and the actual value is greater than or equal to a threshold value 話者の位置により、書籍用と食品用との間で言語モデルを切り替える例を示す図The figure which shows the example which switches a language model between the object for books and the object for food by the position of a speaker 話者の位置により、ピッキング用とアソート用との間で言語モデルを切り替える例を示す図Diagram showing an example of switching the language model between picking and assorting depending on the speaker's position

以下に本発明を実施するための最適実施例を示す。この発明の範囲は、特許請求の範囲の記載に基づき、明細書とこの分野の周知技術を参酌し、当業者の理解に従って定められるべきである。 In the following, an optimum embodiment for carrying out the present invention will be shown. The scope of the present invention should be determined in accordance with the understanding of those skilled in the art based on the description of the scope of claims, taking into consideration the specification and well-known techniques in this field.

図１〜図６に、実施例の音声認識システムを示す。図１において、２は音声認識システムのサーバで、複数の端末４と通信部１８，３８により接続されている。６は雑音比較部で、端末４から受信した雑音の音響データあるいは雑音のスペクトルと、例えばn個の位置に対して位置毎に雑音スペクトルを記憶した雑音スペクトル記憶部８のデータとを比較する。雑音比較部６は、実際の雑音と最も近い雑音スペクトルを与える位置を求めると、実際の雑音とその位置の雑音スペクトルとの差異が閾値以上か否かを判定する。なお端末４の位置情報取得デバイス２２から位置信号が送られてくる場合、どの位置の雑音スペクトルが実際の雑音に最も近いかの判定は不要である。雑音除去部１０は、実際の雑音と記憶した雑音スペクトルとの差異が閾値以上の場合に、閾値を越える部分あるいはその一部に対して、音声認識部１６での処理前に、スペクトルサブトラクションあるいはカルマンフィルタ，ウィナーフィルタなどにより、雑音除去を行う。 1 to 6 show a speech recognition system according to an embodiment. In FIG. 1, reference numeral 2 denotes a server of a voice recognition system, which is connected to a plurality of terminals 4 by communication units 18 and 38. A noise comparison unit 6 compares the acoustic data or noise spectrum of noise received from the terminal 4 with the data of the noise spectrum storage unit 8 that stores the noise spectrum for each position, for example, for n positions. When the noise comparison unit 6 obtains a position that gives the closest noise spectrum to the actual noise, the noise comparing unit 6 determines whether or not the difference between the actual noise and the noise spectrum at that position is equal to or greater than a threshold value. When a position signal is sent from the position information acquisition device 22 of the terminal 4, it is not necessary to determine which position's noise spectrum is closest to the actual noise. When the difference between the actual noise and the stored noise spectrum is equal to or greater than the threshold, the noise removal unit 10 performs spectral subtraction or Kalman filter on the portion exceeding the threshold or a part thereof before processing by the speech recognition unit 16. Noise removal is performed using a Wiener filter.

言語モデル記憶部１２は、端末４を装着した話者の移動範囲に応じて、例えばn個の言語モデルを記憶し、その作用は音声認識で得られた音素列を言語に変換する辞書である。位置に応じて言語モデルを切り替えることにより、その位置に応じた語彙の少ない小さな辞書で正確かつ高速な音声認識を行う。実施例では、位置と、言語モデルと、雑音スペクトルとを、互いに１：１：１に対応させる。しかし１つの言語モデルに対して、複数の位置と複数の雑音スペクトルとを対応させても良い。音響モデル記憶部１４は、位置毎にかつ話者毎に音響モデルを記憶する。ただし単純のため、話者と位置毎ではなく、位置のみに応じた音響モデルとしても良い。ここでの音響モデルは、少数のサンプルを用いて作成したもので、話者に応じて発音が異なることから、好ましくは話者毎の音響モデルとする。また同じ話者でも、環境雑音などが異なれば、同じ音素を異なって発音するので、位置に応じた音響モデルとする。端末４が１個の場合、話者毎に音響モデルを設ける必要はなく、また端末４にサーバ２の機能を持たせて、サーバ２を廃止しても良い。 The language model storage unit 12 stores, for example, n language models in accordance with the movement range of the speaker wearing the terminal 4, and its action is a dictionary that converts a phoneme sequence obtained by speech recognition into a language. . By switching the language model according to the position, accurate and high-speed speech recognition is performed with a small dictionary having a small vocabulary according to the position. In the embodiment, the position, the language model, and the noise spectrum correspond to each other 1: 1: 1. However, a plurality of positions and a plurality of noise spectra may be associated with one language model. The acoustic model storage unit 14 stores an acoustic model for each position and for each speaker. However, for the sake of simplicity, an acoustic model corresponding to only the position may be used instead of each speaker and position. The acoustic model here is created using a small number of samples, and the pronunciation differs depending on the speaker. Therefore, the acoustic model is preferably set for each speaker. Even if the same speaker has different environmental noise, the same phoneme is pronounced differently, so an acoustic model corresponding to the position is used. When there is one terminal 4, it is not necessary to provide an acoustic model for each speaker, and the server 2 may be abolished by providing the terminal 4 with the function of the server 2.

音声認識部１６は、端末４から送信された音声信号を音声認識し、用いる音響モデルは端末４を装着している話者に対応し、かつ話者の位置での標準的な雑音を含んでいる。また用いる言語モデルも話者の位置に対応する。なお記憶している雑音スペクトルと実際の雑音スペクトルとの間に閾値以上の差異がある場合、雑音除去部１０で雑音を除去した後に、音声認識する。 The voice recognition unit 16 recognizes the voice signal transmitted from the terminal 4 and the acoustic model used corresponds to the speaker wearing the terminal 4 and includes standard noise at the speaker position. Yes. The language model used also corresponds to the position of the speaker. If there is a difference equal to or greater than a threshold value between the stored noise spectrum and the actual noise spectrum, the noise is removed by the noise removing unit 10 and then speech recognition is performed.

実施例では、サーバ２は複数の端末４に対し、スピーカ２０からピッキングなどの作業を指令し、端末４を装着した作業者はマイクロホン２３から作業の状況などを報告し、これをサーバ２側で音声認識する。そこでタスク状態記憶部１７では、端末４毎のタスクの状態を記憶する。タスクの状態とは、例えばピッキングの指令とそれに対する報告である。なお実施例はこれ以外に、例えばメンテナンスや作業などで、話者が複数の位置を移動しながら発話する際の音声認識などにも用いることができる。またパイロット，外科医，歯科医などの音声認識に用いることができる。 In the embodiment, the server 2 instructs the plurality of terminals 4 to perform work such as picking from the speaker 20, and the worker wearing the terminal 4 reports the work status and the like from the microphone 23. Speech recognition. Therefore, the task status storage unit 17 stores the task status for each terminal 4. The task status is, for example, a picking command and a report for the command. In addition to this, the embodiment can also be used for voice recognition when a speaker speaks while moving between a plurality of positions, for example, for maintenance or work. It can also be used for speech recognition by pilots, surgeons, dentists, etc.

端末４側の構成を説明する。スピーカ２０は作業者に対してピッキングなどの指示を行い、位置情報取得デバイス２２は例えばGPSであり、端末４の現在位置を求める。マイクロホン２３は作業者の音声をピックアップし、特徴抽出部２４で特徴量ベクトルなどに変換する。特徴量ベクトルは、発話区間を適宜の時間幅に分割し、各時間幅に対する音声データの周波数スペクトルの特徴量を１０〜２０次元等のベクトルとして抽出したものである。発話区間検出部２５は、発話区間と無発話区間とを識別し、例えばマイクロホン２３からの信号（特徴抽出部を迂回）が、所定の閾値の外側から内側へ変化し、かつ０をクロスした回数をカウントする。時間当たりのこの回数が所定値以上の場合を発話区間とし、そうでなければ無発話区間とする。これは、閾値以上の強さの信号が０クロッシングする時間当たりの頻度が高いと発話区間とすることである。なお発話区間の検出方法自体は任意である。雑音記憶部２６は、マイクロホン２３を介し、発話区間以外（無発話区間）の区間での周囲の雑音を記憶する。雑音は音響データとして記憶しても、あるいはそのスペクトルとして記憶してもよい。 The configuration on the terminal 4 side will be described. The speaker 20 gives an instruction such as picking to the worker, and the position information acquisition device 22 is a GPS, for example, and obtains the current position of the terminal 4. The microphone 23 picks up the operator's voice, and the feature extraction unit 24 converts it into a feature vector. The feature amount vector is obtained by dividing the speech section into appropriate time widths and extracting the feature amount of the frequency spectrum of the voice data for each time width as a vector of 10 to 20 dimensions or the like. The utterance section detection unit 25 discriminates between the utterance section and the non-utterance section, for example, the number of times that the signal from the microphone 23 (bypassing the feature extraction section) changes from outside a predetermined threshold value to the inside and crosses zero. Count. A case where this number of times per time is equal to or greater than a predetermined value is set as an utterance interval, and if not, it is set as a non-utterance interval. This means that an utterance period is set when the frequency per time when a signal having a strength equal to or greater than the threshold is zero crossing is high. In addition, the detection method itself of an utterance area itself is arbitrary. The noise storage unit 26 stores ambient noise in a section other than the speech section (no speech section) via the microphone 23. Noise may be stored as acoustic data or as its spectrum.

実施例では、音声認識はサーバ２側で行うものとし、雑音除去部１０〜音声認識部１６をサーバ２に設けた。しかしながらこれらの処理を端末４で行っても良く、その場合サーバ２は端末４側での実際の雑音と、記憶部８に記憶した雑音スペクトルとを比較し、差異が閾値以上の場合、閾値を超える部分の雑音除去を端末４側へ要求する。またどの位置の音響モデルを用いるかを指令する。なおGPSは高価な機器であり、端末４がデバイス２２を備えていない場合、サーバ２は端末４からの雑音を、記憶部８で記憶した全ての雑音スペクトルと比較し、端末４の位置を雑音の類似度から判別する。 In the embodiment, the speech recognition is performed on the server 2 side, and the noise removal unit 10 to the speech recognition unit 16 are provided in the server 2. However, these processes may be performed by the terminal 4, in which case the server 2 compares the actual noise on the terminal 4 side with the noise spectrum stored in the storage unit 8, and if the difference is greater than or equal to the threshold value, the threshold value is set. The terminal 4 side is requested to remove the excess noise. It also commands which acoustic model to use. Note that GPS is an expensive device, and if the terminal 4 does not include the device 22, the server 2 compares the noise from the terminal 4 with all the noise spectra stored in the storage unit 8, and determines the position of the terminal 4 as noise. Judged from the similarity of.

端末４側で音声認識を行う場合、雑音除去部２８と言語モデル記憶部３０並びに音響モデル記憶部３２、及び音声認識部３４を、端末４に設ける。言語モデルは位置毎の言語モデルであり、作業位置に応じて語彙を変更するためのものである。音響モデル記憶部３２は、端末４を装着している作業者の音響モデルを、位置を変えて雑音込みで記憶している。タスク状態記憶部３６は、端末４に関するピッキングなどのタスクの状態を記憶し、通信部３８はサーバ２と通信し、電池などの電源４０は端末４に電力を供給する。 When speech recognition is performed on the terminal 4 side, the noise removal unit 28, the language model storage unit 30, the acoustic model storage unit 32, and the speech recognition unit 34 are provided in the terminal 4. The language model is a language model for each position, and is for changing the vocabulary according to the work position. The acoustic model storage unit 32 stores the acoustic model of the worker wearing the terminal 4 with noise at different positions. The task state storage unit 36 stores the state of tasks such as picking related to the terminal 4, the communication unit 38 communicates with the server 2, and the power source 40 such as a battery supplies power to the terminal 4.

ここで位置毎に音響モデルを作成し記憶することの意味を説明する。同じ話者でも周囲の雑音などが異なれば、音声が変化する。これは静かな場所での会話と、騒音の激しい場所での会話を比較すると、日常的に認められることである。また記憶する音響モデルは音声信号のみでなく、周囲の雑音を含んでいる。従って場所毎の音響モデル、より正確には場所と人毎の音響モデルを用いることにより、その場所での雑音を含み、実際の音声に近い音響モデルを記憶できる。 Here, the meaning of creating and storing an acoustic model for each position will be described. Even if the same speaker has different ambient noise, the voice changes. This is recognized on a daily basis when comparing a conversation in a quiet place with a conversation in a noisy place. The stored acoustic model includes not only the audio signal but also ambient noise. Therefore, by using an acoustic model for each location, more precisely, an acoustic model for each location and person, it is possible to store an acoustic model that includes noise at that location and is close to actual speech.

図２に音声認識のアルゴリズムを示す。端末側では発話区間以外で雑音を記録し、雑音データ（音響データ）あるいはそのスペクトルを求めて記憶する。またGPSなどを備えている場合、現在位置を常時求めておく（ステップ１）。発話が開始されたなどによりサーバ２と通信する際に、端末側からサーバ側へ現在位置と雑音データとを送信する。雑音データは雑音スペクトルとして送信しても良い。雑音データの他に、音声データを例えば特徴抽出部２４で特徴量ベクトルに変換し、サーバ２へ送信する（ステップ２）。なお端末４で音声認識を行う場合、音声データ自体の送信は不要である。 FIG. 2 shows a speech recognition algorithm. On the terminal side, noise is recorded outside the speech section, and noise data (acoustic data) or its spectrum is obtained and stored. If a GPS is provided, the current position is always obtained (step 1). When communicating with the server 2 due to the start of speech or the like, the current position and noise data are transmitted from the terminal side to the server side. The noise data may be transmitted as a noise spectrum. In addition to the noise data, the voice data is converted into, for example, a feature vector by the feature extraction unit 24 and transmitted to the server 2 (step 2). When voice recognition is performed by the terminal 4, transmission of voice data itself is not necessary.

サーバ側では端末がGPSなどを備えていない場合、現在位置が不明として、ステップ３で受信した雑音と記憶した雑音スペクトルとを比較する。そして最も近い雑音スペクトルの位置に端末が存在するものとする。ステップ４で、現在位置に対応する雑音スペクトルの記憶値と、実際の雑音スペクトルとを比較し、差異が閾値以下であれば、その位置の音響モデルをそのまま使用する。差異が閾値以上では、閾値を越える誤差を解消するように雑音除去を実行する。雑音除去にはスペクトルサブトラクションなどを用いる。 On the server side, if the terminal does not have a GPS or the like, the current position is unknown and the noise received in step 3 is compared with the stored noise spectrum. It is assumed that the terminal exists at the nearest noise spectrum position. In step 4, the stored value of the noise spectrum corresponding to the current position is compared with the actual noise spectrum. If the difference is equal to or smaller than the threshold value, the acoustic model at that position is used as it is. If the difference is equal to or greater than the threshold, noise removal is performed so as to eliminate the error exceeding the threshold. Spectral subtraction or the like is used for noise removal.

このようにすると、用いる音響モデルは、その雑音が存在する位置でのその話者の音声に基づくものである。従って周囲の雑音は音声モデルに既に織り込み済みであり、記憶した雑音スペクトルから実際の雑音が閾値以上ずれている場合、これらの差異を解消するように雑音除去を実行する。この結果、雑音除去として小さな雑音除去を行えば良く、大規模な雑音除去を行った場合のように、音声信号が歪むことがない。サーバでは位置に応じて言語モデルを選択し（ステップ５）、ステップ６で音声認識を実行する。ピッキングなどの作業の場合、音声認識により作業の進捗状況を把握し、作業者の現在位置と合わせて次の作業を指示する。 In this way, the acoustic model used is based on the speaker's voice at the position where the noise exists. Accordingly, ambient noise has already been incorporated into the speech model, and when the actual noise deviates from the stored noise spectrum by a threshold or more, noise removal is executed so as to eliminate these differences. As a result, small noise removal may be performed as noise removal, and the audio signal is not distorted as in the case of large-scale noise removal. The server selects a language model according to the position (step 5), and executes speech recognition in step 6. In the case of work such as picking, the progress of the work is grasped by voice recognition, and the next work is instructed together with the current position of the worker.

図３，図４にスペクトルサブトラクションの例を示し、図３，図４の濃い線は記憶済みの雑音スペクトルを、淡い線は実際の雑音スペクトルを示す。図３ではこれらはほぼ一致している。図４では、記憶した雑音スペクトル４２と実際の雑音スペクトル４４には、図示しない閾値以上ずれた周波数帯があり、例えば閾値を越える部分はスペクトル４２，４４の間の領域４６とする。すると領域４６分、あるいはその一部の雑音除去を行い、例えばスペクトルサブトラクションにより、発話区間の音響信号のスペクトルから閾値を越えた部分（領域４６）、あるいはその２／３などを除去する。スペクトル４２分の雑音は音響モデルに既に織り込み済みで、音声認識の妨げとはならない。そして領域４６分の小さな雑音除去を行うので、音声信号を歪ませることが小さい。 3 and 4 show examples of spectral subtraction. The dark lines in FIGS. 3 and 4 indicate the stored noise spectrum, and the light lines indicate the actual noise spectrum. In FIG. 3, these are almost the same. In FIG. 4, the stored noise spectrum 42 and the actual noise spectrum 44 have a frequency band that is shifted by a threshold value or more (not shown). For example, a portion exceeding the threshold value is a region 46 between the spectra 42 and 44. Then, noise in the region 46 or a part thereof is removed, and for example, a portion (region 46) exceeding the threshold from the spectrum of the acoustic signal in the utterance interval, or 2/3 thereof is removed by spectrum subtraction. The noise of 42 minutes in the spectrum has already been incorporated into the acoustic model and does not hinder speech recognition. And since the small noise removal of the area | region 46 is performed, it is small to distort an audio | voice signal.

図５，図６にピッキングでの応用を示す。図５の５０，５２は２人の作業者で、例えば作業者５０は実線の経路で書籍棚５３から食品棚５４へ移動してピッキングを行い、作業者５２は点線の経路で食品棚５４で先にピッキングし、書籍棚５３で次にピッキングするとする。作業者５０，５２の位置をGPSあるいは雑音スペクトルの比較などから求め、これに応じて言語モデルを切り替える。従って食品棚５４の付近で曖昧な語尾は「冊」ではなく「個」のはずであり、書籍棚５３の付近では曖昧な語尾は「個」ではなく「冊」と認識される。語彙を絞ることにより高速で正確な音声認識ができる。 5 and 6 show applications in picking. In FIG. 5, 50 and 52 are two workers, for example, the worker 50 moves from the book shelf 53 to the food shelf 54 by a solid line route for picking, and the worker 52 moves on the food shelf 54 by a dotted line route. It is assumed that the picking is performed first and the book shelf 53 is next picked. The positions of the workers 50 and 52 are obtained from GPS or noise spectrum comparison, and the language model is switched accordingly. Therefore, the ambiguous ending in the vicinity of the food shelf 54 should be “individual” instead of “book”, and the ambiguous ending in the vicinity of the book shelf 53 is recognized as “book” instead of “individual”. By narrowing down the vocabulary, high-speed and accurate speech recognition is possible.

図６では、作業者がピッキングゾーン６０とアソートゾーン６２（詰め合わせを行うゾーン）の間を、移動するとする。ピッキングゾーン６０では、作業者の音声信号はピッキングに対応する語彙を収容した言語モデルにより解釈され、アソートゾーン６２ではアソートに関する語彙を収容した言語モデルにより解釈される。 In FIG. 6, it is assumed that the worker moves between the picking zone 60 and the assortment zone 62 (zone where assortment is performed). In the picking zone 60, the voice signal of the operator is interpreted by a language model containing vocabulary corresponding to picking, and in the assort zone 62, it is interpreted by a language model containing vocabulary related to assortment.

実施例では以下の効果が得られる。
(1) 雑音スペクトルの比較を通じ、あるいはGPS等の位置センサにより、話者の位置を判別する。そして位置毎で雑音を含む音響モデルを用いる。この結果、記憶済みの雑音スペクトルからの変化が小さい場合、雑音の影響を極く小さくできる。また同じ話者でも、周囲の環境により、音声が変化する。このことも場所毎の音響モデルに織り込み済みである。
(2) 話者の移動に対応して雑音を学習するのでなく、話者の位置を特定して、その位置に応じた音響モデルを用いる。このため話者の移動に速やかに追随できる。
(3) 特に位置センサで話者の位置を判別すると、話者の移動に音響モデルの切替が遅れることがない。なお雑音スペクトルと実際の雑音との比較からも話者の位置を判別できるが、雑音の取得と比較とのため、位置の判別が話者の移動に遅れることがある。
(4) 場所毎の雑音スペクトルと現在の雑音スペクトルとが閾値以上異なる場合、雑音除去を行う。雑音除去は閾値を越える小さな部分、あるいはその一部に対して行えばよいので、音声信号を歪ませることが少ない。
(5) 話者の位置により言語モデルを切り替えることにより、より正確な音声認識ができる。
(6) なお話者が移動するのではなく、雑音スペクトルが種々のパターンの間で変化する場合、これらのパターン毎の雑音スペクトルを記憶し、現在の雑音がどのスペクトルに最も近いかを判別する。そして判別結果に従って、雑音スペクトル毎の音響モデルを切り替えると、効率的に音声認識できる。
In the embodiment, the following effects can be obtained.
(1) The speaker's position is determined through comparison of noise spectra or by a position sensor such as GPS. An acoustic model including noise is used for each position. As a result, when the change from the stored noise spectrum is small, the influence of noise can be made extremely small. Even with the same speaker, the voice changes depending on the surrounding environment. This is already incorporated into the acoustic model for each location.
(2) Instead of learning noise in response to the movement of the speaker, the speaker's position is specified and an acoustic model corresponding to the position is used. Therefore, it is possible to quickly follow the movement of the speaker.
(3) Especially when the position of the speaker is determined by the position sensor, the switching of the acoustic model is not delayed by the movement of the speaker. Note that the position of the speaker can also be determined by comparing the noise spectrum with the actual noise, but the determination of the position may be delayed due to the movement of the speaker due to the acquisition and comparison of noise.
(4) If the noise spectrum at each location differs from the current noise spectrum by more than a threshold, noise removal is performed. Since the noise removal may be performed on a small part exceeding the threshold or a part thereof, the audio signal is hardly distorted.
(5) More accurate speech recognition is possible by switching the language model according to the position of the speaker.
(6) If the speaker does not move but the noise spectrum changes between various patterns, the noise spectrum for each pattern is memorized and it is determined which spectrum the current noise is closest to. . If the acoustic model for each noise spectrum is switched according to the discrimination result, speech recognition can be performed efficiently.

２サーバ
４端末
６雑音比較部
８雑音スペクトル記憶部
１０，２８雑音除去部
１２，３０言語モデル記憶部
１４，３２音響モデル記憶部
１６，３４音声認識部
１７タスク状態記憶部
１８，３８通信部
２０スピーカ
２２位置情報取得デバイス
２３マイクロホン
２４特徴抽出部
２５発話区間検出部
２６雑音記憶部
３６タスク状態記憶部
４０電源
５０，５２作業者
５３書籍棚
５４食品棚
６０ピッキングゾーン
６２アソートゾーン 2 Server 4 Terminal 6 Noise comparison unit 8 Noise spectrum storage unit 10, 28 Noise removal unit 12, 30 Language model storage unit 14, 32 Acoustic model storage unit 16, 34 Speech recognition unit 17 Task state storage unit 18, 38 Communication unit 20 Speaker 22 Position information acquisition device 23 Microphone 24 Feature extraction unit 25 Speech section detection unit 26 Noise storage unit 36 Task state storage unit 40 Power supply 50, 52 Worker 53 Book shelf 54 Food shelf 60 Picking zone 62 Assortment zone

Claims

A system for recognizing a noise in a sound signal from a microphone by processing a noise removing unit and storing a sound model by a voice recognition unit,
A noise spectrum storage unit for storing a noise spectrum for each position;
An acoustic model storage unit that stores an acoustic model for each position and including noise at the position;
Position discrimination means;
A noise comparison unit that compares the spectrum of the acoustic signal from the microphone in the non-speech interval with the noise spectrum stored in the noise spectrum storage unit for the position determined by the determination unit,
When the noise comparison unit detects a difference greater than or equal to the threshold value, at least a part of the portion exceeding the threshold value is removed by the noise removal unit with respect to the acoustic signal from the microphone in the utterance interval,
A speech recognition system, wherein speech recognition is performed by the speech recognition unit based on an acoustic model for a position determined by a determination means.

A server and a plurality of mobile terminals, and the microphone and the discrimination means are provided in the terminal, and a noise spectrum storage unit, an acoustic model storage unit, and a noise comparison unit are provided in the server,
The speech recognition system according to claim 1, wherein the noise removing unit and the speech recognition unit are provided in either the server or the terminal.

The voice recognition system according to claim 1 or 2, wherein the discrimination means is a position sensor.

The speech recognition system according to any one of claims 1 to 3, wherein a plurality of language models composed of a dictionary for speech recognition are stored, and the language model is switched according to the position determined by the determining means.

A method of recognizing a noise in an acoustic signal from a microphone by a noise recognizing unit and recognizing a sound by a speech recognizing unit storing an acoustic model,
The noise spectrum for each position is stored in the storage means,
Storing an acoustic model including noise at each position in the storage means;
The position is determined by the position determination means,
The comparison means compares the spectrum of the acoustic signal from the microphone in the non-speech section with the noise spectrum stored in the noise spectrum storage unit for the position determined by the determination means, and detects a difference greater than the threshold. At the time, for the acoustic signal from the microphone in the utterance section, at least a part of the portion exceeding the threshold value is denoised by the noise removing unit,
A speech recognition method, wherein speech recognition is performed by the speech recognition unit using an acoustic model for a position determined by a determination means.