JP6594278B2

JP6594278B2 - Acoustic model learning device, speech recognition device, method and program thereof

Info

Publication number: JP6594278B2
Application number: JP2016182579A
Authority: JP
Inventors: 清彰松井; 学岡本; 隆朗福冨
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-09-20
Filing date: 2016-09-20
Publication date: 2019-10-23
Anticipated expiration: 2036-09-20
Also published as: JP2018049041A

Description

この発明は、音響モデルを学習するための技術、音声を認識するための技術及び雑音を処理するための技術に関する。 The present invention relates to a technique for learning an acoustic model, a technique for recognizing speech, and a technique for processing noise.

音響モデルを学習するための音声収録又は音声認識をするための音声収録においては、インテリジェントマイク（例えば、非特許文献１参照。）を用いて、発話者にフォーカスした音声と、背景雑音にフォーカスした音声とを収集することが行われている。 In voice recording for learning an acoustic model or voice recognition, an intelligent microphone (see, for example, Non-Patent Document 1) is used to focus the voice focused on the speaker and the background noise. Collecting audio and is done.

その際、雑音の影響による、音響モデルの学習及び音声認識の精度の低下を防ぐために、収録により得られた信号に対して雑音抑圧を行い、雑音が取り除かれた信号を用いて音響モデルの学習が行われることもあった。しかし、この方法では、音響モデルの学習は雑音の無い音声でのみ行われるため、実環境では前段で雑音を取り除けない場合に音声認識の性能が低下する可能性があった。 At that time, in order to prevent the deterioration of the accuracy of acoustic model learning and speech recognition due to the influence of noise, noise suppression is performed on the signal obtained by recording, and the acoustic model learning is performed using the signal from which the noise has been removed. Was sometimes done. However, in this method, since the learning of the acoustic model is performed only with speech without noise, there is a possibility that the performance of speech recognition may be reduced when noise cannot be removed in the previous stage in an actual environment.

ところで、雑音には時間変化の小さい定常雑音と、観測する時間によって雑音の性質が異なる非定常雑音とが存在する。特に、非定常雑音が含まれる場合には、音声認識システムの性能は劣化する可能性があった。 By the way, there are stationary noises with small time variations and non-stationary noises with different noise characteristics depending on the observation time. In particular, when non-stationary noise is included, the performance of the speech recognition system may deteriorate.

スマートフォン等に搭載された音声認識を利用する場合、実環境で最も多いのは非定常雑音環境であり、この非定常雑音環境下で頑健に動作するシステムが求められていた。 When using speech recognition installed in a smartphone or the like, a non-stationary noise environment is the most common in the real environment, and a system that operates robustly in this non-stationary noise environment has been demanded.

ＮＴＴアドバンステクノロジ株式会社、“高騒音対応集音ソフトウェアインテリジェントマイクライブラリ”、［online］、［平成２８年９月１２日検索］、インターネット〈URL：http://www.ntt-at.co.jp/product/i-mic/〉NTT Advanced Technology Co., Ltd., “Sound Collection Software Intelligent Microphone Library for High Noise”, [online], [Searched on September 12, 2016], Internet <URL: http://www.ntt-at.co.jp / product / i-mic />

この発明は、従来よりも非定常雑音に頑健な、音響モデル学習装置、音声認識装置、これらの方法及びプログラムを提供することを目的とする。 This invention is robust to non-stationary noise than conventional acoustic model learning device, a voice recognition device, and to provide a these methods and programs.

この発明の一態様による音響モデル学習装置は、目的音を収音することにより音声信号を生成し、背景雑音を収音することにより雑音信号を生成する収音部と、音声信号に基づいて音声特徴量を生成し、雑音信号に基づいて雑音特徴量を生成する特徴量抽出部と、雑音特徴量が定常雑音に対応するものであるか非定常雑音に対応するものであるかを過去の定常雑音又は非定常雑音に対応する雑音特徴量を用いて判断し、雑音特徴量が非定常雑音に対応するものであると判断された場合には、雑音特徴量を定常雑音に対応する雑音特徴量に近づける処理を行う雑音情報処理装置と、処理後の雑音特徴量及び音声特徴量を用いて音響モデルの学習をすることにより、音響モデルを生成する音響モデル学習部と、を備えている。 An acoustic model learning device according to an aspect of the present invention generates a sound signal by collecting a target sound and generates a noise signal by collecting background noise, and a sound based on the sound signal. A feature extraction unit that generates a feature quantity and generates a noise feature quantity based on a noise signal, and whether the noise feature quantity corresponds to stationary noise or non-stationary noise in the past. Judgment is made using a noise feature amount corresponding to noise or non-stationary noise, and when it is determined that the noise feature amount corresponds to non-stationary noise, the noise feature amount corresponds to the stationary noise. A noise information processing apparatus that performs a process close to, and an acoustic model learning unit that generates an acoustic model by learning an acoustic model using the processed noise feature and speech feature.

この発明の一態様による音声認識装置は、目的音を収音することにより音声信号を生成し、背景雑音を収音することにより雑音信号を生成する収音部と、音声信号に基づいて音声特徴量を生成し、雑音信号に基づいて雑音特徴量を生成する特徴量抽出部と、雑音特徴量が定常雑音に対応するものであるか非定常雑音に対応するものであるかを過去の定常雑音又は非定常雑音に対応する雑音特徴量を用いて判断し、雑音特徴量が非定常雑音に対応するものであると判断された場合には、雑音特徴量を定常雑音に対応する雑音特徴量に近づける処理を行う雑音情報処理装置と、処理後の雑音特徴量及び音声特徴量と、請求項１の音響モデル学習装置により生成された音響モデルとを用いて音声認識を行う音声認識部と、を備えている。 A speech recognition apparatus according to an aspect of the present invention includes a sound collection unit that generates a sound signal by collecting a target sound and generates a noise signal by collecting background noise, and a sound feature based on the sound signal. A feature amount extraction unit that generates a noise feature amount based on a noise signal and whether the noise feature amount corresponds to stationary noise or non-stationary noise in the past. Alternatively, when it is determined using a noise feature amount corresponding to non-stationary noise and the noise feature amount is determined to correspond to non-stationary noise, the noise feature amount is changed to a noise feature amount corresponding to stationary noise. A noise information processing apparatus that performs a process of approaching, a speech recognition unit that performs speech recognition using the processed noise feature quantity and speech feature quantity, and the acoustic model generated by the acoustic model learning apparatus of claim 1; I have.

従来よりも非定常雑音に頑健な、音響モデル学習装置、音声認識装置、これらの方法及びプログラムを提供することができる。 Robust to non-stationary noise than conventional acoustic model learning device, a speech recognition device, can provide these methods and programs.

音響モデル学習装置の例を説明するためのブロック図。The block diagram for demonstrating the example of an acoustic model learning apparatus. 音響モデル学習方法の例を説明するための流れ図。The flowchart for demonstrating the example of the acoustic model learning method. 音声認識音響モデル学習装置の例を説明するためのブロック図。The block diagram for demonstrating the example of a speech recognition acoustic model learning apparatus. 音響モデル学習方法の例を説明するための流れ図。The flowchart for demonstrating the example of the acoustic model learning method.

以下、図面を参照して、この発明の一実施形態について説明する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.

[音響モデル学習装置及び方法]
音響モデル学習装置のブロック図を図１に示す。 [Acoustic model learning apparatus and method]
A block diagram of the acoustic model learning apparatus is shown in FIG.

音響モデル学習装置は、図１に示すように、収音部１０１、特徴量抽出部１０２、雑音情報処理装置１０３、音響モデル学習部１０４及び音響モデル記憶部１０６を例えば備えている。音響モデル学習方法は、音響モデル学習装置の各部が、図２及び以下に説明するステップＳ１からステップＳ４の処理により例えば実現される。 As shown in FIG. 1, the acoustic model learning device includes, for example, a sound collection unit 101, a feature amount extraction unit 102, a noise information processing device 103, an acoustic model learning unit 104, and an acoustic model storage unit 106. In the acoustic model learning method, each unit of the acoustic model learning apparatus is realized, for example, by the processing from step S1 to step S4 described below with reference to FIG.

以下、音響モデル学習装置の各部の処理について説明する。 Hereinafter, processing of each unit of the acoustic model learning device will be described.

＜収音部１０１＞
収音部１０１において、目的音及び背景雑音が収音される。収音部１０１は、目的音を収音することにより音声信号を生成し、背景雑音を収音することにより雑音信号を生成する（ステップＳ１）。 <Sound Collection Unit 101>
The sound collection unit 101 collects the target sound and background noise. The sound collection unit 101 generates an audio signal by collecting the target sound, and generates a noise signal by collecting background noise (step S1).

収音部１０１は、例えば２個のインテリジェントマイクである。以下、「インテリジェントマイク」のことを、「インテリマイク」とも言う。インテリマイクの詳細については、例えば非特許文献１を参照のこと。 The sound collection unit 101 is, for example, two intelligent microphones. Hereinafter, “intelligent microphone” is also referred to as “intelligent microphone”. For details of the intelligent microphone, see Non-Patent Document 1, for example.

インテリマイクの一方は目的音を収音するために話者にフォーカスされ、他方のインテリマイクは目的音ではない周囲の背景雑音を収音するために周辺雑音に対してフォーカスされる。所望の箇所に対するフォーカスは、インテリマイクの内部パラメータを変更することにより実現できる。 One of the intelligent microphones is focused on the speaker to pick up the target sound, and the other intelligent microphone is focused on the ambient noise to pick up surrounding background noise that is not the target sound. Focusing on a desired location can be realized by changing an internal parameter of the intelligent microphone.

目的音及び背景雑音への焦点の位置については、目的音及び雑音が移動しない場合は、位置が既知であるとして焦点を固定し、目的音が移動する場合にはインテリマイク自体の機能により音源位置を推定した後に焦点を合わせる。これにより、例えば3chの音声信号及び3chの雑音信号が得られる。 As for the position of the focus on the target sound and background noise, if the target sound and noise do not move, the focus is fixed assuming that the position is known, and if the target sound moves, the position of the sound source is determined by the function of the intelligent microphone itself. Focus after estimating. Thereby, for example, a 3ch audio signal and a 3ch noise signal are obtained.

得られた信号は、特徴量抽出部１０２に出力される。 The obtained signal is output to the feature amount extraction unit 102.

＜特徴量抽出部１０２＞
特徴量抽出部１０２は、音声信号に基づいて音声特徴量を生成し、雑音信号に基づいて雑音特徴量を生成する（ステップＳ２）。 <Feature Extraction Unit 102>
The feature amount extraction unit 102 generates a speech feature amount based on the speech signal, and generates a noise feature amount based on the noise signal (step S2).

例えば、特徴量抽出部１０２は、まず、音声信号及び雑音信号のそれぞれに対して、インテリマイクにおける前処理を行う。インテリマイクにおける前処理とは、3chの信号よりフォーカス部分の信号のみを抽出し、1chの処理済み音声を出力するものである。 For example, the feature quantity extraction unit 102 first performs preprocessing in the intelligent microphone for each of the audio signal and the noise signal. The pre-processing in the intelligent microphone is to extract only the focus signal from the 3ch signal and output the 1ch processed sound.

その後、特徴量抽出部１０２は、インテリマイクにおける前処理後の信号のそれぞれに対して、特徴量抽出を行い、フィルタバンク特徴量を得る。 Thereafter, the feature quantity extraction unit 102 performs feature quantity extraction on each of the signals after preprocessing in the intelligent microphone, and obtains a filter bank feature quantity.

その後、特徴量抽出部１０２は、フィルタバンク特徴量の対数をとり、時間変化に関する特徴量を付加する。フィルタバンク特徴量の算出及びその後の処理に関しては、既存の技術を用いることができる。このように、特徴量抽出部１０２は、音声特徴量及び雑音特徴量として、フィルタバンク特徴量の対数値を用いてもよい。 After that, the feature quantity extraction unit 102 takes the logarithm of the filter bank feature quantity and adds a feature quantity related to temporal change. An existing technique can be used for the calculation of the filter bank feature value and the subsequent processing. As described above, the feature amount extraction unit 102 may use the logarithmic value of the filter bank feature amount as the speech feature amount and the noise feature amount.

生成された音声特徴量は、音響モデル学習部１０４に出力される。生成された雑音特徴量は、雑音情報処理装置１０３に出力される。音声特徴量及び雑音特徴量は、ベクトルの形式で表現することもできるため、それぞれ音声特徴量ベクトル及び雑音特徴量ベクトルとも表記する。 The generated voice feature amount is output to the acoustic model learning unit 104. The generated noise feature amount is output to the noise information processing apparatus 103. Since the voice feature quantity and the noise feature quantity can also be expressed in a vector format, they are also expressed as a voice feature quantity vector and a noise feature quantity vector, respectively.

＜雑音情報処理装置１０３＞
雑音情報処理装置１０３は、特徴量抽出部１０２から得られた雑音特徴量が定常雑音か非定常雑音下を判定し、それぞれの雑音に適した処理を行う。具体的には、雑音情報処理装置１０３は、雑音特徴量が定常雑音に対応するものであるか非定常雑音に対応するものであるかを過去の定常雑音又は非定常雑音に対応する雑音特徴量を用いて判断し、雑音特徴量が非定常雑音に対応するものであると判断された場合には、雑音特徴量を定常雑音に対応する雑音特徴量に近づける処理を行う（ステップＳ３）。 <Noise Information Processing Device 103>
The noise information processing apparatus 103 determines whether the noise feature amount obtained from the feature amount extraction unit 102 is stationary noise or non-stationary noise, and performs processing suitable for each noise. Specifically, the noise information processing apparatus 103 determines whether the noise feature amount corresponds to stationary noise or non-stationary noise, and the noise feature amount corresponds to past stationary noise or non-stationary noise. When it is determined that the noise feature value corresponds to the non-stationary noise, a process of bringing the noise feature value close to the noise feature value corresponding to the stationary noise is performed (step S3).

雑音特徴量は、エアコンの音のように時間によってあまり大きく変化しない定常雑音と、街中のようにその時々によって雑音の性質が大きく変化する非定常雑音との２種類に分けることができる。これらの雑音は、特性が全く異なるため、それぞれ違った対処法を取ることが好ましい。 Noise feature quantities can be divided into two types: stationary noise that does not change much with time, such as the sound of an air conditioner, and non-stationary noise, where the nature of the noise changes with time. Since these noises have completely different characteristics, it is preferable to take different measures.

雑音が定常化であるか非定常であるかの判断は、システムが保持している平均雑音ベクトルと入力された雑音特徴量ベクトルとの類似性から判断することができる。 The determination of whether the noise is stationary or non-stationary can be made from the similarity between the average noise vector held by the system and the input noise feature vector.

平均雑音ベクトルは、例えば、直前の時刻までの定常雑音の平均値から求められるベクトルである。平均雑音ベクトルの初期値の求め方は、予め観測しておいた定常雑音から計算して求めておくか、あるいは入力音声の始めの部分に含まれる非音声区間から初期値を与える方法等いくつかのやり方がある。 The average noise vector is, for example, a vector obtained from the average value of stationary noise up to the previous time. There are several methods for obtaining the initial value of the average noise vector, such as calculating from the stationary noise observed in advance or giving the initial value from the non-speech interval included in the beginning of the input speech. There is a way.

ベクトル間の類似度の判定方法は、既存の技術を用いることができる。例えば、コサイン類似度を用いる場合、以下の式(1)のように平均雑音ベクトルn_aveと入力ベクトルn_in_tの内積をとることで２つのベクトルがなす角を求め、それを類似度として用いることで判定することができる。 An existing technique can be used as a method for determining the similarity between vectors. For example, when using the cosine similarity, two vectors by taking the inner product of the mean noise vector n_ave the input vector N_in _t is determined an angle as shown in the following equation (1), by using it as a degree of similarity Can be determined.

例えば上記式(1)により定義されるコサイン類似度は、２本のベクトルが近いほど、１に近い値を示す。 For example, the cosine similarity defined by the above equation (1) shows a value closer to 1 as the two vectors are closer.

例えば上記式(1)により定義されるコサイン類似度が閾値より大きければ定常雑音、閾値以下の時非定常雑音と判定する。閾値は、-1から1までの間で好きな値をとることができる。 For example, if the cosine similarity defined by the above equation (1) is larger than a threshold value, it is determined as stationary noise, and when it is below the threshold value, it is determined as non-stationary noise. The threshold value can take any value between -1 and 1.

類似度を求めた結果、定常雑音と判定された場合は、雑音特徴量と平均雑音ベクトルとの平均をとり、それを時刻tにおける新たな平均雑音ベクトルとして用いる。 As a result of obtaining the similarity, when it is determined that the noise is stationary noise, the average of the noise feature amount and the average noise vector is taken and used as a new average noise vector at time t.

非定常雑音と判定された場合、そのままでは認識に悪影響を及ぼすため、雑音特徴量を定常雑音に性質を近づけるような処理を行う。定常雑音への変換に関しても、様々な手法を用いることができるが、例えば、コサイン類似度の計算を行っていた場合は、２つのベクトルがなす角度が既に求まっているため、雑音特徴量ベクトルにその角度のコサインを作用させて回転を加えることで、平均雑音ベクトルに性質を近づけることが可能になる。多次元ベクトルの回転に関しては、例えば参考文献１の技術を用いることができる。
〔参考文献１〕A. Antonio; P. A. Ricardo, “General n-Dimensional Rotations,” WSCG 2004, pp. 1-8, 2004 If it is determined as non-stationary noise, it will adversely affect the recognition as it is, so that processing is performed to bring the noise feature quantity closer to stationary noise. Various methods can also be used for conversion to stationary noise. For example, in the case of calculating cosine similarity, the angle formed by two vectors has already been obtained, so that the noise feature vector is calculated. By applying a cosine of the angle and applying rotation, it becomes possible to bring the property closer to the average noise vector. For the rotation of the multidimensional vector, for example, the technique of Reference 1 can be used.
[Reference 1] A. Antonio; PA Ricardo, “General n-Dimensional Rotations,” WSCG 2004, pp. 1-8, 2004

雑音特徴量が非定常雑音に対応するものであると判断された場合には、定常雑音に対応する雑音特徴量に近づける処理が行われた雑音特徴量は、音響モデル学習部１０４に出力される。雑音特徴量が定常雑音に対応するものであると判断された場合には、その雑音盗聴量は、音響モデル学習部１０４に出力される。 If it is determined that the noise feature value corresponds to non-stationary noise, the noise feature value that has been processed to approach the noise feature value corresponding to the stationary noise is output to the acoustic model learning unit 104. . When it is determined that the noise feature value corresponds to stationary noise, the noise wiretapping amount is output to the acoustic model learning unit 104.

＜音響モデル学習部１０４＞
音響モデル学習部１０４は、特徴量抽出部１０２が出力した音声特徴量と、雑音情報処理装置１０３が出力した雑音特徴量とを用いて音響モデルの学習を行うことにより、音響モデルを生成する（ステップＳ４）。 <Acoustic model learning unit 104>
The acoustic model learning unit 104 generates an acoustic model by learning the acoustic model using the voice feature amount output from the feature amount extraction unit 102 and the noise feature amount output from the noise information processing apparatus 103 ( Step S4).

音響モデル学習方法として、DNN音響モデルに対する学習方法等の既存の音響モデル学習方法を用いることができる。例えば、参考文献２に記載された方法を用いることができる。
〔参考文献２〕Hinton, et al. , ” A fast learning algorithm for deep belief nets.”, Neural Computation, Vol18, No.7, pp.1527-1554, 2006
生成された音響モデルは、音響モデル記憶部１０６に記憶される。 As the acoustic model learning method, an existing acoustic model learning method such as a learning method for the DNN acoustic model can be used. For example, the method described in Reference 2 can be used.
[Reference 2] Hinton, et al., “A fast learning algorithm for deep belief nets.”, Neural Computation, Vol18, No.7, pp.1527-1554, 2006
The generated acoustic model is stored in the acoustic model storage unit 106.

[音声認識装置及び方法]
音響認識装置のブロック図を図３に示す。 [Voice recognition apparatus and method]
A block diagram of the sound recognition apparatus is shown in FIG.

音響認識装置は、図１に示すように、収音部１０１、特徴量抽出部１０２、雑音情報処理装置１０３、音声認識部１０５及び音響モデル記憶部１０６を例えば備えている。音声認識方法は、音声認識装置の各部が、図４及び以下に説明するステップＳ１からステップＳ５の処理により例えば実現される。 As illustrated in FIG. 1, the acoustic recognition device includes, for example, a sound collection unit 101, a feature amount extraction unit 102, a noise information processing device 103, a speech recognition unit 105, and an acoustic model storage unit 106. The voice recognition method is realized by, for example, each unit of the voice recognition apparatus by the processing from step S1 to step S5 described below with reference to FIG.

以下、音声認識装置の各部の処理について説明する。なお、音響モデル学習装置と異なる部分である音声認識部１０５を中心に説明する。音響モデル学習装置と同様の部分については、重複説明を省略する。 Hereinafter, processing of each unit of the speech recognition apparatus will be described. Note that the description will focus on the speech recognition unit 105 that is different from the acoustic model learning device. A duplicate description of the same parts as those of the acoustic model learning apparatus is omitted.

＜収音部１０１＞
収音部１０１は、音響モデル学習装置の収音部１０１と同様である。 <Sound Collection Unit 101>
The sound collection unit 101 is the same as the sound collection unit 101 of the acoustic model learning device.

＜特徴量抽出部１０２＞
特徴量抽出部１０２は、音響モデル学習装置の特徴量抽出部１０２と同様にして、音声信号に基づいて音声特徴量を生成し、雑音信号に基づいて雑音特徴量を生成する（ステップＳ２）。 <Feature Extraction Unit 102>
The feature amount extraction unit 102 generates a speech feature amount based on the speech signal and generates a noise feature amount based on the noise signal in the same manner as the feature amount extraction unit 102 of the acoustic model learning device (step S2).

生成された音声特徴量は、音響モデル学習部１０４に出力される。 The generated voice feature amount is output to the acoustic model learning unit 104.

＜雑音情報処理装置１０３＞
雑音情報処理装置１０３は、音響モデル学習装置の雑音情報処理装置１０３と同様にして、雑音情報処理装置１０３は、雑音特徴量が定常雑音に対応するものであるか非定常雑音に対応するものであるかを過去の定常雑音又は非定常雑音に対応する雑音特徴量を用いて判断し、雑音特徴量が非定常雑音に対応するものであると判断された場合には、雑音特徴量を定常雑音に対応する雑音特徴量に近づける処理を行う（ステップＳ３）。 <Noise Information Processing Device 103>
The noise information processing apparatus 103 is similar to the noise information processing apparatus 103 of the acoustic model learning apparatus. The noise information processing apparatus 103 corresponds to stationary noise or non-stationary noise. If the noise feature quantity is determined to correspond to the non-stationary noise, the noise feature quantity is determined to be the stationary noise. A process of approaching the noise feature amount corresponding to is performed (step S3).

＜音響モデル記憶部１０６＞
音響モデル記憶部１０６には、音響モデル学習装置により生成された音響モデルが、記憶されている。 <Acoustic model storage unit 106>
The acoustic model storage unit 106 stores an acoustic model generated by the acoustic model learning device.

＜音声認識部１０５＞
音声認識部１０５は、特徴量抽出部１０２が出力した音声特徴量と、雑音情報処理装置１０３が出力した雑音特徴量と、音響モデル記憶部１０６から読み込んだ音響モデルとを用いて音声認識を行う（ステップＳ５）。その際、音声音声認識として、既存の技術を用いることができる。 <Voice recognition unit 105>
The speech recognition unit 105 performs speech recognition using the speech feature amount output from the feature amount extraction unit 102, the noise feature amount output from the noise information processing apparatus 103, and the acoustic model read from the acoustic model storage unit 106. (Step S5). At that time, an existing technique can be used for voice recognition.

[プログラム及び記録媒体]
音響モデル学習装置、音声認識装置又は雑音情報処理装置における各処理をコンピュータによって実現する場合、これらの装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置の処理がコンピュータ上で実現される。 [Program and recording medium]
When each process in the acoustic model learning device, the speech recognition device, or the noise information processing device is realized by a computer, the processing contents of the functions that these devices should have are described by a program. Then, by executing this program on the computer, the processing of each device is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、各処理手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each processing means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

[変形例]
音声認識装置は、図３に破線で示すように、音響モデル学習部１０４を更に備えていてもよい。この場合、音響モデル学習部１０４は、音響モデル学習装置の音響モデル学習部１０４と同様の処理により音響モデルを生成し、生成した音響モデルを音声認識部１０５に出力する。そして、音声認識部１０５は、音響モデル記憶部１０６から読み込んだ音響モデルに代えて、音響モデル学習部１０４により生成された音響モデルを用いて、音声認識処理を行う。 [Modification]
The speech recognition apparatus may further include an acoustic model learning unit 104 as indicated by a broken line in FIG. In this case, the acoustic model learning unit 104 generates an acoustic model by the same processing as the acoustic model learning unit 104 of the acoustic model learning device, and outputs the generated acoustic model to the speech recognition unit 105. Then, the speech recognition unit 105 performs speech recognition processing using the acoustic model generated by the acoustic model learning unit 104 instead of the acoustic model read from the acoustic model storage unit 106.

その他、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 Needless to say, other modifications are possible without departing from the spirit of the present invention.

Claims

A sound collection unit that generates a sound signal by collecting a target sound and generates a noise signal by collecting background noise;
A feature amount extraction unit that generates a speech feature amount based on the speech signal and generates a noise feature amount based on the noise signal;
It is determined whether the noise feature quantity corresponds to stationary noise or non-stationary noise using a noise feature quantity corresponding to past stationary noise or non-stationary noise, and the noise feature quantity is A noise information processing apparatus that performs a process of bringing the noise feature amount close to a noise feature amount corresponding to stationary noise when it is determined to correspond to non-stationary noise;
An acoustic model learning unit that generates an acoustic model by learning an acoustic model using the processed noise feature and the voice feature;
An acoustic model learning device.

The acoustic model learning device according to claim 1,
The voice feature amount and the noise feature amount are logarithmic values of the filter bank feature amount.
Acoustic model learning device.

A sound collection unit that generates a sound signal by collecting a target sound and generates a noise signal by collecting background noise;
A feature amount extraction unit that generates a speech feature amount based on the speech signal and generates a noise feature amount based on the noise signal;
Whether the noise feature value corresponds to stationary noise or non-stationary noise is determined using a noise feature value corresponding to past stationary noise or non-stationary noise, and the noise feature value is A noise information processing apparatus that performs a process of bringing the noise feature amount close to a noise feature amount corresponding to stationary noise when it is determined to correspond to non-stationary noise;
A speech recognition unit that performs speech recognition using the noise feature value after processing and the speech feature value, and the acoustic model generated by the acoustic model learning device according to claim 1;
A speech recognition device.

The acoustic recognition device according to claim 3,
The voice feature amount and the noise feature amount are logarithmic values of the filter bank feature amount.
Voice recognition device.

A sound collection unit that generates a sound signal by collecting a target sound, and generates a noise signal by collecting background noise;
A feature amount extraction unit that generates a speech feature amount based on the speech signal and generates a noise feature amount based on the noise signal; and
The noise information processing apparatus determines whether the noise feature amount corresponds to stationary noise or non-stationary noise using a noise feature amount corresponding to past stationary noise or non-stationary noise. When it is determined that the noise feature amount corresponds to non-stationary noise, a noise information processing step for performing processing to bring the noise feature amount close to a noise feature amount corresponding to stationary noise;
An acoustic model learning unit that generates an acoustic model by learning an acoustic model using the processed noise feature and the voice feature;
Acoustic model learning method including

According acoustic model learning device claim 1 or 2, a program for causing a computer to function as any of the various parts of the speech recognition equipment according to claim 3 or 4.