JP2009139894A

JP2009139894A - Noise suppressing device, speech recognition device, noise suppressing method and program

Info

Publication number: JP2009139894A
Application number: JP2007319239A
Authority: JP
Inventors: Takatoshi Sanehiro; 貴敏實廣; Tomoji Toriyama; 朋二鳥山; Kiyoshi Kogure; 潔小暮
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2007-12-11
Filing date: 2007-12-11
Publication date: 2009-06-25

Abstract

PROBLEM TO BE SOLVED: To provide a noise suppressing device capable of removing noise from speech in which a plurality of kinds of changing noises and unknown noises are superimposed. SOLUTION: The noise suppressing device includes: a speech noise acoustic model storage section 14 for storing a plurality of speech noise acoustic models that is one classification of speech and noise acoustic model for a training; a synthesized acoustic model storage section 15 for storing a combined acoustic model that is the acoustic model where two or more classifications of speech and noise are synthesized; a dictionary information storage section 17 for storing dictionary information for relating a label of the classification of speech and noise to the acoustic model; a receiving section 19 for receiving a noise superimposed speech data; a label recognition section 20 for recognizing a label corresponding to the noise superimposed speech data, by using the acoustic model and the dictionary information; and a particle filter noise suppressing section 21 for generating a clean speech data, by sampling a particle from the acoustic model according to the recognized label, and by updating the sampled particle. COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、雑音を抑圧する雑音抑圧装置等に関する。 The present invention relates to a noise suppression device that suppresses noise.

従来、雑音重畳音声の認識のために、多くの耐雑音音声認識手法が提案されてきた。定常雑音に対しては、ＳｐｅｃｔｒａｌＳｕｂｔｒａｃｔｉｏｎ（例えば、非特許文献１参照）や、ＰａｒａｌｌｅｌＭｏｄｅｌＣｏｍｂｉｎａｔｉｏｎ（ＰＭＣ）（例えば、非特許文献２参照）がある。ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ（ＧＭＭ）を用いたＭｉｎｉｍｕｍＭｅａｎ−ＳｑｕａｒｅｄＥｒｒｏｒ（ＭＭＳＥ）に基づく手法（例えば、非特許文献３参照）は、フレーム同期で処理を行うことで、入力音声に対する変動に対応できる。さらに、最近では、非定常的な雑音に対する研究が盛んになっている（例えば、非特許文献４，５参照）。これらの手法は一般に一種類の雑音のみを考慮し、一つのモデルで雑音をモデル化できるという仮定がある。しかし，実環境下では、定常的な雑音だけでなく、突発的な雑音も多い。また、その他に、どのように入力から実際の雑音を推定するかという問題もある。 Conventionally, many noise-resistant speech recognition techniques have been proposed for recognition of noise-superimposed speech. For stationary noise, there are Spectral Subtraction (for example, see Non-Patent Document 1) and Parallel Model Combination (PMC) (for example, see Non-Patent Document 2). A technique based on Minimum Mean-Squared Error (MMSE) using Gaussian Mixture Model (GMM) (for example, see Non-Patent Document 3) can cope with fluctuations in input speech. Furthermore, recently, research on non-stationary noise has become active (see, for example, Non-Patent Documents 4 and 5). These methods generally take into account only one type of noise and assume that the noise can be modeled with one model. However, in an actual environment, there are many not only stationary noises but also sudden noises. Another problem is how to estimate the actual noise from the input.

これまでに、本件出願に関する発明者らは、複数モデル合成を用い、マルチパス探索で最尤ラベル系列を得て、複数雑音モデルにより雑音抑圧を行う手法（ＭＭ−ＮＳ：Ｍｕｌｔｉ−ＭｏｄｅｌＮｏｉｓｅＳｕｐｐｒｅｓｓｉｏｎ）を提案してきた（例えば、非特許文献６，７参照）。学習データから得られた雑音モデル、及びそれらを組み合せた合成モデルを用い、マルチパス探索による雑音ラベル認識により雑音ラベル系列を得て、それを元にＧＭＭを用いたＭＭＳＥ雑音抑圧手法（例えば、非特許文献３参照）を複数合成モデルへ拡張した手法により、雑音抑圧を行う。雑音重畳音声モデルを学習ではなく、モデル合成により得ることで、学習データにはないＳＮＲでの音声と雑音の合成された分布をも作成できるメリットがある。さらに、ＧＭＭによるＭＭＳＥ雑音抑圧手法により、入力音声に対する事後確率により各分布に重み付けされ、雑音が推定される。したがって、入力雑音重畳音声を反映した雑音モデルで雑音抑圧処理が行われることになる。ただし、各雑音モデルの混合分布数を増加することで性能向上を図ることができることから、各雑音モデルの持つ粒度は詳細なほどよいことになる。
Ｓ．Ｆ．Ｂｏｌｌ，「Ｓｕｐｐｒｅｓｓｉｏｎｏｆａｃｏｕｓｔｉｃｎｏｉｓｅｉｎｓｐｅｅｃｈｕｓｉｎｇｓｐｅｃｔｒａｌｓｕｂｔｒａｃｔｉｏｎ」、ＩＥＥＥＴｒａｎｓ．Ａｃｏｕｓｔ．ＳｐｅｅｃｈＳｉｇｎａｌＰｒｏｃｅｓｓ．，ｖｏｌ．２７，ｎｏ．２７，ｐ．１１３−１２０，１９７９Ｍ．Ｆ．Ｊ．Ｇａｌｅｓ，「Ｍｏｄｅｌ−ｂａｓｅｄｔｅｃｈｎｉｑｕｅｓｆｏｒｎｏｉｓｅｒｏｂｕｓｔｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎ」、ＰｈＤｔｈｅｓｉｓ，ＵｎｉｖｅｒｓｉｔｙｏｆＣａｍｂｒｉｄｇｅ，１９９５Ｊ．Ｃ．Ｓｅｇｕｒａ，Ａ．ｄｅｌａＴｏｒｒｅ，Ｍ．Ｃ．Ｂｅｎｉｔｅｚ，Ａ．Ｍ．Ｐｅｉｎａｄｏ，「Ｍｏｄｅｌ−ｂａｓｅｄｃｏｍｐｅｎｓａｔｉｏｎｏｆｔｈｅａｄｄｉｔｉｖｅｎｏｉｓｅｆｏｒｃｏｎｔｉｎｕｏｕｓｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎ．ＥｘｐｅｒｉｍｅｎｔｓｕｓｉｎｇｔｈｅＡＵＲＯＲＡＩＩｄａｔａｂａｓｅａｎｄｔａｓｋｓ」、Ｐｒｏｃ．ｏｆＥＵＲＯＳＰＥＥＣＨ２００１，ｖｏｌ．１，ｐ．２２１−２２４，２００１Ｋ．Ｙａｏ，Ｋ．Ｋ．Ｐａｌｉｗａｌ，Ｓ．Ｎａｋａｍｕｒａ，「Ｎｏｉｓｅａｄａｐｔｉｖｅｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎｂａｓｅｄｏｎｓｅｑｕｅｎｔｉａｌｎｏｉｓｅｐａｒａｍｅｔｅｒｅｓｔｉｍａｔｉｏｎ」、ＳｐｅｅｃｈＣｏｍｍｕｎｉｃａｔｉｏｎ，ｖｏｌ．４２，ｎｏ．１，ｐ．５−２３，２００４Ｍ．Ｆｕｊｉｍｏｔｏ，Ｓ．Ｎａｋａｍｕｒａ，「Ａｎｏｎ−ｓｔａｔｉｏｎａｒｙｎｏｉｓｅｓｕｐｐｒｅｓｓｉｏｎｍｅｔｈｏｄｂａｓｅｄｏｎｐａｒｔｉｃｌｅｆｉｌｔｅｒｉｎｇａｎｄＰｏｌｙａｋａｖｅｒａｇｉｎｇ」、ＩＥＩＣＥＴｒａｎｓ．Ｉｎｆ．&Ｓｙｓｔ．，ｖｏｌ．Ｅ８９−Ｄ，ｎｏ．３，２００６實廣，鳥山，小暮，「複数の雑音合成モデルを用いた探索に基づく雑音抑圧手法」、信学技報，ｖｏｌ．ＳＰ２００７−１６，ｐ．４９−５４，２００７實廣，鳥山，小暮，「複数雑音合成モデルによるマルチパス探索に基づく雑音抑圧手法」、音講論，ｐ．１５１−１５４，２００７ So far, the inventors of the present application have used a multi-model synthesis, obtain a maximum likelihood label sequence by multipath search, and perform noise suppression using a multi-noise model (MM-NS: Multi-Model Noise Suppression). Has been proposed (see, for example, Non-Patent Documents 6 and 7). Using a noise model obtained from learning data and a combined model obtained by combining them, a noise label sequence is obtained by noise label recognition by multipath search, and an MMSE noise suppression method using GMM based on the noise label sequence (for example, non-processing) Noise suppression is performed by a method in which Patent Document 3) is extended to a multiple synthesis model. By obtaining a noise superimposed speech model not by learning but by model synthesis, there is an advantage that a synthesized distribution of speech and noise at an SNR not included in the learning data can be created. Further, by the MMSE noise suppression method using GMM, each distribution is weighted by the posterior probability for the input speech, and noise is estimated. Therefore, noise suppression processing is performed with a noise model reflecting the input noise superimposed speech. However, since the performance can be improved by increasing the number of mixed distributions of each noise model, the granularity of each noise model is better as it is more detailed.
S. F. Boll, “Suppression of acoustic noise in speculation using subtraction”, IEEE Trans. Acoustic. Speech Signal Process. , Vol. 27, no. 27, p. 113-120, 1979 M.M. F. J. et al. Gales, “Model-based technologies for noise robust speech recognition”, PhD thesis, University of Cambridge, 1995. J. et al. C. Segura, A .; de la Torre, M.M. C. Benitez, A.M. M.M. Peinado, “Model-based compensation of the additive noise for continuous speech recognition. Explorations using the AURORAII database and tasks.” of EUROSPEECH 2001, vol. 1, p. 221-224, 2001 K. Yao, K .; K. Paliwal, S.M. Nakamura, “Noise adaptive speech recognition based on sequential noise parameter estimation”, Speech Communication, vol. 42, no. 1, p. 5-23, 2004 M.M. Fujimoto, S .; Nakamura, “A non-stationary noise suppression method based on particle filtering and Polyk averaging”, IEICE Trans. Inf. & Syst. , Vol. E89-D, no. 3,2006 Tsuji, Toriyama, Kogure, “Noise Suppression Method Based on Search Using Multiple Noise Synthesis Models”, IEICE Tech. SP2007-16, p. 49-54, 2007 Tsuji, Toriyama, Kogure, “Noise Suppression Based on Multipath Search Using Multiple Noise Synthesis Models”, Sound Lecture, p. 151-154, 2007

しかしながら、複数モデル合成を用い、マルチパス探索で最尤ラベル系列を得て、複数雑音モデルにより雑音抑圧を行う手法（ＭＭ−ＮＳ）では、学習データから推定された雑音モデルをベースにしているため、学習データに存在しない未知の雑音に対して近い分布が利用される可能性はあるが、厳密な対処は難しいという問題があった。すなわち、入力雑音の変動、さらには未知雑音へのより積極的な対応が課題となっていた。 However, the technique (MM-NS) that uses multiple model synthesis, obtains the maximum likelihood label sequence by multipath search, and performs noise suppression using multiple noise models is based on the noise model estimated from the learning data. There is a possibility that a close distribution to unknown noise that does not exist in the learning data may be used, but there is a problem that it is difficult to deal with strictly. That is, more proactive responses to fluctuations in input noise and unknown noise have been problems.

本発明は、この課題を解決するためになされたものであり、複数種類の雑音の重畳された音声から、雑音を適切に除去することができると共に、変動する雑音も除去しうる雑音抑圧装置等を提供することを目的とする。 The present invention has been made in order to solve this problem, such as a noise suppression device that can appropriately remove noise from a voice on which a plurality of types of noise are superimposed and can also remove fluctuating noise. The purpose is to provide.

上記目的を達成するため、本発明による雑音抑圧装置は、訓練用の音声データと訓練用の雑音データとを含む訓練データに含まれる一種類の音声データまたは一種類の雑音データの音響モデルである音声雑音音響モデルが複数記憶される音声雑音音響モデル記憶部と、前記訓練データに含まれる音声データと雑音データのうちの二種類以上が合成された音響モデルである合成音響モデルが記憶される合成音響モデル記憶部と、前記訓練データにおいて重畳されている音声と雑音の種類を識別する情報であるラベルと前記音声雑音音響モデルまたは前記合成音響モデルとを対応付ける情報である辞書情報が記憶される辞書情報記憶部と、雑音の重畳されている音声データである雑音重畳音声データを受け付ける受付部と、前記音声雑音音響モデル、前記合成音響モデル、前記辞書情報を用いて、前記雑音重畳音声データに対応するラベルをフレームごとに認識するラベル認識部と、前記ラベル認識部が認識したラベルに応じた音響モデルからパーティクルをサンプリングし、当該サンプリングしたパーティクルを更新することによって、前記雑音重畳音声データの雑音が抑圧されたクリーン音声データを生成するパーティクルフィルタ雑音抑圧部と、を備えたものである。 In order to achieve the above object, a noise suppression device according to the present invention is one type of speech data included in training data including training speech data and training noise data, or an acoustic model of one type of noise data. A speech noise acoustic model storage unit that stores a plurality of speech noise acoustic models, and a synthesized acoustic model that is an acoustic model in which two or more types of speech data and noise data included in the training data are synthesized is stored A dictionary that stores an acoustic model storage unit, dictionary information that is information that associates the speech noise model or the synthetic acoustic model with a label that is information identifying the type of speech and noise superimposed in the training data An information storage unit; a reception unit that receives noise-superimposed speech data that is speech data on which noise is superimposed; and the speech noise acoustic model. , Using the synthetic acoustic model and the dictionary information, a label recognition unit for recognizing a label corresponding to the noise-superimposed speech data for each frame, and sampling particles from the acoustic model corresponding to the label recognized by the label recognition unit And a particle filter noise suppression unit that generates clean audio data in which noise of the noise superimposed audio data is suppressed by updating the sampled particles.

このような構成により、ラベル認識を行うことによって、どのような雑音が雑音重畳音声データに含まれているのかを知ることができるため、雑音重畳音声データに突発的に入ってきた新たな種類の雑音に対しても、適切な雑音抑圧を行うことができる。また、パーティクルフィルタを用いて雑音抑圧を行うため、雑音が変動していく場合にも、適切な雑音抑圧を行うことができる。 With such a configuration, it is possible to know what kind of noise is included in the noise-superimposed speech data by performing label recognition, so a new kind of suddenly entering the noise-superimposed speech data Appropriate noise suppression can be performed for noise. Further, since noise suppression is performed using a particle filter, appropriate noise suppression can be performed even when the noise fluctuates.

また、本発明による雑音抑圧装置では、前記パーティクルフィルタ雑音抑圧部は、前記雑音重畳音声データのフレームごとに特徴量を抽出する特徴量抽出手段と、前記雑音重畳音声データに対応する特徴量と、前記ラベル認識部が認識したラベルに応じた音声雑音音響モデルまたは合成音響モデルとを用いて、前記ラベル認識部が認識したラベルに応じた音響モデルから複数のパーティクルをサンプリングし、当該サンプリングした複数のパーティクルを更新し、当該更新された各パーティクルの重みを算出するパーティクルフィルタ手段と、前記パーティクルフィルタ手段が更新したパーティクルと、前記パーティクルフィルタ手段が算出した重みを用いて、フレームごとに雑音成分を算出する雑音成分算出手段と、前記雑音重畳音声データから、前記雑音成分算出手段が算出した雑音成分を除去し、クリーン音声データを取得する雑音抑圧手段と、を備えてもよい。 In the noise suppression device according to the present invention, the particle filter noise suppression unit includes a feature amount extraction unit that extracts a feature amount for each frame of the noise-superimposed speech data, a feature amount corresponding to the noise-superimposed speech data, Using a speech noise acoustic model or a synthetic acoustic model corresponding to the label recognized by the label recognition unit, a plurality of particles are sampled from the acoustic model corresponding to the label recognized by the label recognition unit, and the sampled plurality of samples Using the particle filter means for updating the particles and calculating the weight of each updated particle, the particles updated by the particle filter means, and the weight calculated by the particle filter means, a noise component is calculated for each frame. Noise component calculating means for performing the noise superimposed voice data From said noise component calculating means removes noise components calculated, and noise suppression means for obtaining clean speech data may be provided.

このような構成により、ラベル認識結果を用いて、音声データにどのような雑音成分が重畳されているのかを知ることができ、その雑音の種類に応じた音響モデルからパーティクルをサンプリングすることができ、雑音重畳音声データから雑音成分を除去することができる。 With such a configuration, it is possible to know what noise component is superimposed on the audio data using the label recognition result, and it is possible to sample particles from the acoustic model corresponding to the type of noise. The noise component can be removed from the noise superimposed voice data.

また、本発明による雑音抑圧装置では、前記ラベルの言語モデルであるラベル言語モデルが記憶されるラベル言語モデル記憶部をさらに備え、前記ラベル認識部は、前記音声雑音音響モデル、前記合成音響モデル、前記ラベル言語モデル、前記辞書情報を用いて、前記雑音重畳音声データに対応するラベルをフレームごとに認識してもよい。
このような構成により、ラベル言語モデルを用いることによって、ラベルの認識の精度を向上させることができうる。 The noise suppression apparatus according to the present invention further includes a label language model storage unit that stores a label language model that is a language model of the label, and the label recognition unit includes the speech noise acoustic model, the synthesized acoustic model, The label corresponding to the noise-superimposed speech data may be recognized for each frame using the label language model and the dictionary information.
With such a configuration, the accuracy of label recognition can be improved by using a label language model.

また、本発明による雑音抑圧装置では、前記訓練データが記憶される訓練データ記憶部と、前記訓練データに対応するラベルの時系列に沿った情報である訓練ラベル情報が記憶されるラベル記憶部と、前記訓練データ記憶部で記憶されている訓練データから、前記訓練ラベル情報を用いて音声雑音音響モデル及び合成音響モデルを生成し、前記音声雑音音響モデル記憶部、及び前記合成音響モデル記憶部にそれぞれ蓄積するモデル生成部と、前記ラベル記憶部で記憶されている訓練ラベル情報を用いて、ラベルのラベル言語モデルを生成し、前記ラベル言語モデル記憶部に蓄積すると共に、前記辞書情報を生成して前記辞書情報記憶部に蓄積するラベル言語モデル生成部と、をさらに備えてもよい。 Further, in the noise suppression device according to the present invention, a training data storage unit that stores the training data, and a label storage unit that stores training label information that is information along a time series of labels corresponding to the training data, The speech noise acoustic model and the synthesized acoustic model are generated from the training data stored in the training data storage unit using the training label information, and the speech noise acoustic model storage unit and the synthesized acoustic model storage unit A label language model of a label is generated using a model generation unit that accumulates and training label information stored in the label storage unit, and is stored in the label language model storage unit and the dictionary information is generated. And a label language model generation unit that accumulates in the dictionary information storage unit.

このような構成により、ラベルの認識で用いる音声雑音音響モデルや合成音響モデル、辞書情報、ラベルのラベル言語モデルを生成することができ、その生成したモデル等を用いて、ラベルの認識を行うことができる。 With such a configuration, it is possible to generate a speech noise acoustic model, a synthetic acoustic model, dictionary information, and a label language model for a label used for label recognition, and label recognition is performed using the generated model. Can do.

また、本発明による音声認識装置は、前記雑音抑圧装置と、音声認識の対象となる音声データに関する音響モデルが記憶される音声認識用音響モデル記憶部と、音声認識で用いる音声認識用辞書情報が記憶される音声認識用辞書情報記憶部と、音声認識の認識対象言語に関する言語モデルが記憶される言語モデル記憶部と、前記雑音抑圧装置が生成したクリーン音声データを、前記音響モデル、前記音声認識用辞書情報、及び、前記言語モデルを用いて音声認識する音声認識部と、前記音声認識部による音声認識結果を出力する出力部と、を備えたものである。 The speech recognition apparatus according to the present invention includes the noise suppression device, an acoustic model storage unit for storing an acoustic model related to speech data to be speech-recognized, and dictionary information for speech recognition used in speech recognition. A dictionary information storage unit for speech recognition to be stored, a language model storage unit to store a language model related to a recognition target language for speech recognition, and clean speech data generated by the noise suppression device, the acoustic model, the speech recognition Dictionary information, a speech recognition unit that recognizes speech using the language model, and an output unit that outputs a speech recognition result by the speech recognition unit.

このような構成により、雑音の抑圧されたクリーン音声データを用いて音声認識を行うため、受け付けられた雑音重畳音声データからの音声認識を精度よく行うことができうる。 With such a configuration, since speech recognition is performed using clean speech data in which noise is suppressed, speech recognition from received noise superimposed speech data can be accurately performed.

本発明による雑音抑圧装置等によれば、二種類以上の雑音が重畳された音声から、それらの雑音を効果的に除去することができると共に、変動する雑音をも効果的に除去することができる。その結果、音声認識等の処理における精度を向上させることもできうる。 According to the noise suppression device and the like according to the present invention, it is possible to effectively remove two or more types of noise from voices superimposed thereon, and to effectively remove fluctuating noise. . As a result, accuracy in processing such as speech recognition can be improved.

以下、本発明による雑音抑圧装置、音声認識装置について、実施の形態を用いて説明する。なお、以下の実施の形態において、同じ符号を付した構成要素及びステップは同一または相当するものであり、再度の説明を省略することがある。 Hereinafter, a noise suppression device and a speech recognition device according to the present invention will be described using embodiments. In the following embodiments, components and steps denoted by the same reference numerals are the same or equivalent, and repetitive description may be omitted.

（実施の形態１）
本発明の実施の形態１による雑音抑圧装置について、図面を参照しながら説明する。
図１は、本実施の形態による雑音抑圧装置１の構成を示すブロック図である。本実施の形態による雑音抑圧装置１は、訓練データ記憶部１１と、ラベル記憶部１２と、モデル生成部１３と、音声雑音音響モデル記憶部１４と、合成音響モデル記憶部１５と、ラベル言語モデル生成部１６と、辞書情報記憶部１７と、ラベル言語モデル記憶部１８と、受付部１９と、ラベル認識部２０と、パーティクルフィルタ雑音抑圧部２１と、蓄積部２２とを備える。 (Embodiment 1)
A noise suppression apparatus according to Embodiment 1 of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of a noise suppression apparatus 1 according to the present embodiment. The noise suppression apparatus 1 according to the present embodiment includes a training data storage unit 11, a label storage unit 12, a model generation unit 13, a speech noise acoustic model storage unit 14, a synthetic acoustic model storage unit 15, and a label language model. A generation unit 16, a dictionary information storage unit 17, a label language model storage unit 18, a reception unit 19, a label recognition unit 20, a particle filter noise suppression unit 21, and a storage unit 22 are provided.

訓練データ記憶部１１では、訓練データが記憶される。ここで、訓練データは、音声データと雑音データを含んでいる。この音声データと雑音データとは、両者ともにモデルの学習に用いられる訓練用のものである。音声データとは、雑音ではないデータであり、例えば、人間の発声した音声のデータである。雑音データとは、ビープ音や、マシンノイズ等の雑音のデータである。この訓練データを用いて、後述するモデル生成部１３によるモデルの学習が行われる。 The training data storage unit 11 stores training data. Here, the training data includes voice data and noise data. Both the voice data and the noise data are for training used for learning the model. Voice data is data that is not noise, for example, voice data uttered by a human. The noise data is noise data such as a beep sound and machine noise. Using this training data, the model generation unit 13 described later learns the model.

訓練データ記憶部１１に訓練データが記憶される過程は問わない。例えば、記録媒体を介して訓練データが訓練データ記憶部１１で記憶されるようになってもよく、通信回線等を介して送信された訓練データが訓練データ記憶部１１で記憶されるようになってもよく、あるいは、マイクロフォン等の入力デバイスを介して入力された訓練データが訓練データ記憶部１１で記憶されるようになってもよい。訓練データ記憶部１１での記憶は、外部のストレージデバイス等から読み出した訓練データのＲＡＭ等における一時的な記憶でもよく、あるいは、長期的な記憶でもよい。訓練データ記憶部１１は、所定の記録媒体（例えば、半導体メモリや磁気ディスク、光ディスクなど）によって実現されうる。 The process in which training data is memorize | stored in the training data storage part 11 is not ask | required. For example, training data may be stored in the training data storage unit 11 via a recording medium, and training data transmitted via a communication line or the like is stored in the training data storage unit 11. Alternatively, training data input via an input device such as a microphone may be stored in the training data storage unit 11. The storage in the training data storage unit 11 may be temporary storage in the RAM or the like of training data read from an external storage device or the like, or may be long-term storage. The training data storage unit 11 can be realized by a predetermined recording medium (for example, a semiconductor memory, a magnetic disk, an optical disk, etc.).

ラベル記憶部１２では、訓練データに対応するラベルの時系列に沿った情報である訓練ラベル情報が記憶される。この訓練ラベル情報によって、訓練データ記憶部１１で記憶されている訓練データにおける音声データや雑音データの種類がラベルされることになる。例えば、ラベル「ｂｅｅｐ」「ｔａｒｇｅｔ」「ｂｅｅｐ．ｔａｒｇｅｔ」によって、訓練データに、ビープ音、目的発声、ビープ音と目的発声の重畳されたデータが含まれることが示されることになる。目的発声とは、所望の発声、すなわち、処理や聞き取り等の目的となる発声のことであり、例えば、音声認識を行う場合には、その音声認識の対象となる発声のことである。この訓練ラベル情報は、時系列に沿ったラベルの情報であるため、例えば、訓練データの時間に関する情報（例えば、タイムコードなど）を含んでおり、その情報によって、ラベルに対応する訓練データの期間を特定できるようになっていてもよい。 The label storage unit 12 stores training label information, which is information along a time series of labels corresponding to training data. With this training label information, the type of voice data and noise data in the training data stored in the training data storage unit 11 is labeled. For example, the labels “beep”, “target”, and “beep.target” indicate that the training data includes a beep sound, a target utterance, and data in which the beep sound and the target utterance are superimposed. The target utterance is a desired utterance, that is, an utterance intended for processing, listening, and the like. For example, in the case of performing speech recognition, it is an utterance that is a target of speech recognition. Since this training label information is information of a label along a time series, for example, it includes information related to the time of training data (for example, a time code), and the period of training data corresponding to the label by the information. May be specified.

ラベル記憶部１２に訓練ラベル情報が記憶される過程は問わない。例えば、記録媒体を介して訓練ラベル情報がラベル記憶部１２で記憶されるようになってもよく、通信回線等を介して送信された訓練ラベル情報がラベル記憶部１２で記憶されるようになってもよく、あるいは、入力デバイスを介して入力された訓練ラベル情報がラベル記憶部１２で記憶されるようになってもよい。ラベル記憶部１２での記憶は、外部のストレージデバイス等から読み出した訓練ラベル情報のＲＡＭ等における一時的な記憶でもよく、あるいは、長期的な記憶でもよい。ラベル記憶部１２は、所定の記録媒体（例えば、半導体メモリや磁気ディスク、光ディスクなど）によって実現されうる。 The process in which training label information is memorize | stored in the label memory | storage part 12 is not ask | required. For example, training label information may be stored in the label storage unit 12 via a recording medium, and training label information transmitted via a communication line or the like is stored in the label storage unit 12. Alternatively, the training label information input via the input device may be stored in the label storage unit 12. The storage in the label storage unit 12 may be temporary storage in the RAM or the like of training label information read from an external storage device or the like, or may be long-term storage. The label storage unit 12 can be realized by a predetermined recording medium (for example, a semiconductor memory, a magnetic disk, an optical disk, etc.).

モデル生成部１３は、訓練データ記憶部１１で記憶されている訓練データから、訓練ラベル情報を用いて音声雑音音響モデル及び合成音響モデルを生成し、音声雑音音響モデル記憶部１４、及び合成音響モデル記憶部１５にそれぞれ蓄積する。音声雑音音響モデルとは、訓練データに含まれる一種類の音声データまたは雑音データの音響モデルである。音声データの種類とは、例えば、「目的発声」や、「他の発声」等である。また、雑音データとは、例えば、「ビープ音」や、「マシンノイズ」等である。音声データの音響モデルとは、例えば、目的発声の音声データの音響モデルや、他人の発声の音声データの音響モデル等であってもよい。また、雑音データの音響モデルとは、例えば、ビープ音の音響モデルや、マシンノイズの音響モデル等であってもよい。合成音響モデルとは、訓練データに含まれる音声データと雑音データのうちの二種類以上が合成された音響モデルである。「音声データと雑音データのうちの二種類以上」とは、例えば、二種類以上の音声データであってもよく、一種類以上の音声データと一種類以上の雑音データとであってもよく、二種類以上の雑音データであってもよい。なお、合成音響モデルにおける音声データと雑音データのうちの二種類以上の組合せは、訓練データに含まれる一種類の音声データまたは雑音データのすべての組合せであってもよく、あるいは、一部の組合せであってもよい。前者であっても、組合せの最大の個数が決まっていることが好適である。また、後者の場合には、例えば、一種類の音声データまたは雑音データの組合せのうち、訓練データに含まれる組合せであってもよく、それ以外の組合せであってもよい。また、モデル生成部１３が生成する音声雑音音響モデルや合成音響モデルは、話者適応したものであってもよく、あるいは、そうでなくてもよい。 The model generation unit 13 generates a speech noise acoustic model and a synthetic acoustic model from the training data stored in the training data storage unit 11 using training label information, and the speech noise acoustic model storage unit 14 and the synthetic acoustic model Each is stored in the storage unit 15. The speech noise acoustic model is an acoustic model of one kind of speech data or noise data included in training data. The type of audio data is, for example, “target utterance”, “other utterance”, or the like. The noise data is, for example, “beep sound”, “machine noise”, and the like. The acoustic model of voice data may be, for example, an acoustic model of voice data of a target utterance, an acoustic model of voice data of another person's utterance, or the like. The noise data acoustic model may be, for example, a beep acoustic model, a machine noise acoustic model, or the like. A synthetic acoustic model is an acoustic model in which two or more types of speech data and noise data included in training data are synthesized. “Two or more types of audio data and noise data” may be, for example, two or more types of audio data, one or more types of audio data and one or more types of noise data, Two or more types of noise data may be used. The combination of two or more types of speech data and noise data in the synthetic acoustic model may be all combinations of one type of speech data or noise data included in the training data, or some combinations. It may be. Even in the former case, it is preferable that the maximum number of combinations is determined. Moreover, in the latter case, for example, a combination included in training data may be included among a combination of one type of voice data or noise data, or a combination other than that. Further, the speech noise acoustic model and the synthesized acoustic model generated by the model generation unit 13 may or may not be adapted to the speaker.

モデル生成部１３は、訓練ラベル情報を用いることによって、訓練データから所望の音声データの区間や、所望の雑音データの区間、あるいは、所望の音声の雑音の重畳されている区間等を抽出することができる。したがって、例えば、モデル生成部１３がビープ音の音声雑音音響モデルを生成する場合には、訓練ラベル情報を用いて訓練情報からビープ音に対応する雑音データの区間を抽出し、そのビープ音の雑音データを用いて、ビープ音の音声雑音音響モデルを生成する。モデル生成部１３は、一般にＧＭＭ（ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ）でモデル化を行うが、ＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）でモデル化を行ってもよい。本実施の形態では、ＧＭＭでモデル化を行う場合について説明する。 The model generation unit 13 uses the training label information to extract a desired voice data section, a desired noise data section, a section in which desired voice noise is superimposed, or the like from the training data. Can do. Therefore, for example, when the model generation unit 13 generates a beep sound noise acoustic model, a noise data section corresponding to the beep sound is extracted from the training information using the training label information, and the noise of the beep sound is extracted. A beep sound noise acoustic model is generated using the data. The model generation unit 13 generally performs modeling using a GMM (Gaussian Mixture Model), but may perform modeling using an HMM (Hidden Markov Model). In this embodiment, a case where modeling is performed by GMM will be described.

モデル生成部１３が、訓練データから音声雑音音響モデルを生成する方法は、すでに公知であり、その説明を省略する。なお、ＧＭＭでのモデル化の際に、混合分布数を音声データや、雑音データの種類ごとに変えてもよい。例えば、目的発声の場合には、混合分布数を２００として、ビープ音の場合には、混合分布数を４としてもよい。 A method in which the model generation unit 13 generates a speech noise acoustic model from training data is already known, and a description thereof will be omitted. Note that the number of mixture distributions may be changed for each type of audio data or noise data when modeling with the GMM. For example, the number of mixture distributions may be 200 in the case of target speech, and the number of mixture distributions may be 4 in the case of beep sounds.

また、モデル生成部１３は、訓練データから直接、合成音響モデルを生成してもよく、あるいは、訓練データから音声雑音音響モデルを生成し、その生成した音声雑音音響モデルを２以上合成することによって合成音響モデルを生成してもよい。本実施の形態では、モデル生成部１３は、音声雑音音響モデルを生成すると共に、その音声雑音音響モデルを２以上合成することによって合成音響モデルを生成するものとする。なお、例えば、２個の音声雑音音響モデルを合成した場合には、その合成音響モデルの混合分布数は、その２個の音声雑音音響モデルのそれぞれの混合分布数を掛け合わせた数となる。例えば、混合分布数「３」の音声雑音音響モデルと、混合分布数「２」の音声雑音音響モデルとを合成した合成音響モデルの混合分布数は、２×３＝６となる。なお、モデル合成の方法については、例えば、モデルパラメータ上でＰＭＣ（ＰａｒａｌｌｅｌＭｏｄｅｌＣｏｍｂｉｎａｔｉｏｎ）と同様のモデル合成を行う方法が知られており、その詳細な説明を省略する。ＰＭＣを用いたモデル合成の方法については、例えば、前述の非特許文献２に記載されている。 The model generation unit 13 may generate a synthesized acoustic model directly from the training data, or generate a speech noise acoustic model from the training data and synthesize two or more generated speech noise acoustic models. A synthetic acoustic model may be generated. In the present embodiment, it is assumed that the model generation unit 13 generates a speech acoustic model and generates a composite acoustic model by combining two or more speech noise acoustic models. For example, when two speech noise acoustic models are synthesized, the number of mixed distributions of the synthesized acoustic models is a number obtained by multiplying the number of mixture distributions of the two speech noise acoustic models. For example, the number of mixed distributions of a synthesized acoustic model obtained by synthesizing a speech noise acoustic model having a mixture distribution number “3” and a speech noise acoustic model having a mixture distribution number “2” is 2 × 3 = 6. As a model synthesis method, for example, a method of model synthesis similar to PMC (Parallel Model Combination) on model parameters is known, and detailed description thereof is omitted. The method of model synthesis using PMC is described in Non-Patent Document 2, for example.

ここで、音声雑音音響モデルを合成することによって合成音響モデルを生成するメリットについて簡単に説明する。

まず、訓練データは有限であるため、合成音響モデルに必要な該当する雑音や音声が重畳された区間のサンプルが少ない可能性がある。したがって、訓練データから直接、合成音響モデルを生成するよりも、訓練データから音声雑音音響モデルを生成し、その音声雑音音響モデルを２以上合成することによって合成音響モデルを生成する方が、精度が高いと考えられる。また、音声雑音音響モデルを合成することによって合成音響モデルを生成する場合には、それぞれ単独のモデルを学習し、それぞれの混合分布の組合せで合成音響モデルを生成することによって、訓練データに含まれていない、雑音と音声の組合せのパターンも含ませることができうる。そのため、モデルが表現できる範囲を、学習から直接得られるモデルより広くすることができる。また、一般に、データから音響モデルを生成するよりは、音響モデルを合成することによって合成音響モデルを生成する方が早いと考えられる。したがって、合成音響モデルを生成する時間を短縮する観点からも、音声雑音音響モデルを２以上合成することによって合成音響モデルを生成する方が適切であると考えられる。さらに、訓練データにおいて、音声と雑音が重なり合う区間や、２種類以上の雑音の重なり合う区間を適切に特定することは困難である。したがって、より特定の簡単な音声のみの区間や、雑音のみの区間を特定し、それらの区間に対応する音響モデルを合成する方が、より精度の高いモデルになると考えられる。 Here, the merit of generating a synthesized acoustic model by synthesizing a speech noise acoustic model will be briefly described.

First, since the training data is finite, there may be a small number of samples in the section on which the corresponding noise and speech necessary for the synthetic acoustic model are superimposed. Therefore, rather than generating a synthetic acoustic model directly from training data, it is more accurate to generate a speech noise acoustic model from training data and generate a synthetic acoustic model by synthesizing two or more speech noise acoustic models. It is considered high. In addition, when generating a synthesized acoustic model by synthesizing a speech noise acoustic model, it is included in the training data by learning each individual model and generating a synthesized acoustic model with a combination of the respective mixed distributions. It can also include patterns of noise and voice combinations that are not. Therefore, the range that can be expressed by the model can be made wider than the model obtained directly from learning. In general, it is considered faster to generate a synthesized acoustic model by synthesizing an acoustic model than to generate an acoustic model from data. Therefore, it is considered that it is more appropriate to generate a synthesized acoustic model by synthesizing two or more speech noise acoustic models from the viewpoint of shortening the time for generating a synthesized acoustic model. Furthermore, in training data, it is difficult to appropriately specify a section where speech and noise overlap or a section where two or more types of noise overlap. Therefore, it is considered that a more accurate model is obtained by specifying more specific simple speech-only sections or noise-only sections and synthesizing the acoustic models corresponding to these sections.

また、モデル生成部１３は、後述するように、雑音重畳音声データに含まれる背景雑音の音声データを受け付けて、その背景雑音に対応する音響モデルを生成してもよい。 Further, as will be described later, the model generation unit 13 may receive audio data of background noise included in the noise superimposed audio data and generate an acoustic model corresponding to the background noise.

音声雑音音響モデル記憶部１４では、訓練用の音声データと雑音データを含む訓練データに含まれる一種類の音声データまたは雑音データの音響モデルである音声雑音音響モデルが複数記憶される。音声雑音音響モデル記憶部１４は、所定の記録媒体（例えば、半導体メモリや磁気ディスク、光ディスクなど）によって実現されうる。 The speech noise acoustic model storage unit 14 stores a plurality of speech noise acoustic models that are acoustic models of one type of speech data or noise data included in training data including training speech data and noise data. The audio noise acoustic model storage unit 14 can be realized by a predetermined recording medium (for example, a semiconductor memory, a magnetic disk, an optical disk, or the like).

合成音響モデル記憶部１５では、訓練データに含まれる音声データと雑音データのうちの二種類以上のデータが合成された音響モデルである合成音響モデルが記憶される。合成音響モデル記憶部１５は、所定の記録媒体（例えば、半導体メモリや磁気ディスク、光ディスクなど）によって実現されうる。 The synthetic acoustic model storage unit 15 stores a synthetic acoustic model, which is an acoustic model obtained by synthesizing two or more types of data among speech data and noise data included in training data. The synthetic acoustic model storage unit 15 can be realized by a predetermined recording medium (for example, a semiconductor memory, a magnetic disk, an optical disk, etc.).

ラベル言語モデル生成部１６は、ラベル記憶部１２で記憶されている訓練ラベル情報を用いて、ラベルのラベル言語モデルを生成し、ラベル言語モデル記憶部１８に蓄積すると共に、辞書情報を生成して辞書情報記憶部１７に蓄積する。ここで、ラベル言語モデルは、例えば、ラベルのＮグラムモデルであってもよく、あるいは、文法であってもよい。文法は、例えば、ネットワーク文法や、ＣＦＧ（ＣｏｎｔｅｘｔＦｒｅｅＧｒａｍｍａｒ）、あるいは、それらの文法において確率を用いるものであってもよい。Ｎグラムモデルや文法については、自然言語処理や、音声認識において公知のものであり、詳細な説明を省略する。本実施の形態では、ラベル言語モデルがＮグラムモデルである場合について説明する。また、辞書情報とは、ラベルと音声雑音音響モデルまたは合成音響モデルとを対応付ける情報である。辞書情報は、例えば、ラベルを識別する情報と、音声雑音音響モデルを識別する情報または合成音響モデルを識別する情報とを対応付けて有する情報であってもよい。ラベルを識別する情報は、ラベルそのものであってもよく、音声雑音音響モデルを識別する情報や合成音響モデルを識別する情報は、そのモデルの名称であってもよい。したがって、辞書情報は、例えば、ビープ音のラベルを識別する情報である「ｂｅｅｐ」と、ビープ音に対応する音声雑音音響モデルを識別する情報である「ｂｅｅｐ」とを対応付ける情報であってもよい。なお、ラベル言語モデル生成部１６は、このラベル言語モデルや辞書情報を生成する際に、モデルを識別する情報を取得するために、音声雑音音響モデルや、合成音響モデル等を参照してもよい。 The label language model generation unit 16 generates a label language model of the label using the training label information stored in the label storage unit 12, accumulates the label language model in the label language model storage unit 18, and generates dictionary information. It accumulates in the dictionary information storage unit 17. Here, the label language model may be, for example, an N-gram model of a label or a grammar. The grammar may be, for example, a network grammar, a CFG (Context Free Grammar), or a grammar that uses a probability in those grammars. The N-gram model and grammar are known in natural language processing and speech recognition, and detailed description thereof is omitted. In the present embodiment, a case where the label language model is an N-gram model will be described. The dictionary information is information that associates a label with a speech noise acoustic model or a synthetic acoustic model. The dictionary information may be information having information for identifying a label and information for identifying a speech noise acoustic model or information for identifying a synthetic acoustic model, for example. The information for identifying the label may be the label itself, and the information for identifying the speech noise acoustic model or the information for identifying the synthetic acoustic model may be the name of the model. Accordingly, the dictionary information may be, for example, information that associates “beep” that is information for identifying a beep sound label with “beeep” that is information for identifying a speech noise acoustic model corresponding to the beep sound. . The label language model generation unit 16 may refer to a speech noise acoustic model, a synthetic acoustic model, or the like in order to acquire information for identifying the model when generating the label language model and dictionary information. .

なお、ラベル言語モデル生成部１６がラベル言語モデルを生成する方法は、通常のＮグラム言語モデルを生成する方法や、文法を生成する方法と同じであり（このたびは、言語モデルの単語等がラベルとなっただけである）、その説明を省略する。また、ラベル言語モデル生成部１６がＮグラムモデルを生成する場合に、その生成するラベルのＮグラムモデルのＮの値は、あらかじめ決められているものとする。Ｎは、例えば、２（バイグラム）であってもよく、３（トライグラム）であってもよく、その両方であってもよい。 Note that the method of generating the label language model by the label language model generation unit 16 is the same as the method of generating an ordinary N-gram language model and the method of generating a grammar (this time, the word of the language model etc. The explanation is omitted. In addition, when the label language model generation unit 16 generates an N-gram model, the N value of the N-gram model of the generated label is determined in advance. N may be, for example, 2 (bigram), 3 (trigram), or both.

辞書情報記憶部１７では、訓練データにおいて重畳されている音声と雑音の種類を識別する情報であるラベルと、音声雑音音響モデルまたは合成音響モデルとを対応付ける情報である辞書情報が記憶される。辞書情報記憶部１７は、所定の記録媒体（例えば、半導体メモリや磁気ディスク、光ディスクなど）によって実現されうる。 The dictionary information storage unit 17 stores dictionary information, which is information for associating a speech noise acoustic model or a synthetic acoustic model with a label that is information identifying the type of speech and noise superimposed in the training data. The dictionary information storage unit 17 can be realized by a predetermined recording medium (for example, a semiconductor memory, a magnetic disk, an optical disk, etc.).

ラベル言語モデル記憶部１８では、訓練データにおいて重畳されている音声と雑音の種類を識別する情報であるラベルのラベル言語モデルが記憶される。ラベル言語モデル記憶部１８は、所定の記録媒体（例えば、半導体メモリや磁気ディスク、光ディスクなど）によって実現されうる。なお、ラベル言語モデルが文法を含む場合に、その文法の一部または全部は、ラベル言語モデル生成部１６によって生成されたものでなくてもよく、例えば、人手によって生成されたものであってもよい。その場合には、ラベル言語モデルの一部または全部が、外部から入力され、ラベル言語モデル記憶部１８に蓄積されてもよい。 The label language model storage unit 18 stores a label language model of a label that is information for identifying the type of speech and noise superimposed in the training data. The label language model storage unit 18 can be realized by a predetermined recording medium (for example, a semiconductor memory, a magnetic disk, an optical disk, etc.). When the label language model includes a grammar, a part or all of the grammar may not be generated by the label language model generation unit 16, for example, it may be generated manually. Good. In that case, part or all of the label language model may be input from the outside and stored in the label language model storage unit 18.

受付部１９は、雑音の重畳されている音声データである雑音重畳音声データを受け付ける。受付部１９は、例えば、入力デバイス（例えば、マイクロフォンなど）から入力された雑音重畳音声データを受け付けてもよく、有線もしくは無線の通信回線を介して送信された雑音重畳音声データを受信してもよく、所定の記録媒体（例えば、光ディスクや磁気ディスク、半導体メモリなど）から読み出された雑音重畳音声データを受け付けてもよい。なお、受付部１９は、受け付けを行うためのデバイス（例えば、モデムやネットワークカードなど）を含んでもよく、あるいは含まなくてもよい。また、受付部１９は、ハードウェアによって実現されてもよく、あるいは所定のデバイスを駆動するドライバ等のソフトウェアによって実現されてもよい。なお、受付部１９が受け付けた雑音重畳音声データを一時的に記憶しておく図示しない記録媒体が存在してもよい。 The accepting unit 19 accepts noise superimposed speech data that is speech data on which noise is superimposed. For example, the reception unit 19 may receive noise superimposed voice data input from an input device (for example, a microphone) or may receive noise superimposed voice data transmitted via a wired or wireless communication line. In addition, noise superimposed audio data read from a predetermined recording medium (for example, an optical disk, a magnetic disk, a semiconductor memory, etc.) may be received. The reception unit 19 may or may not include a device (for example, a modem or a network card) for reception. The reception unit 19 may be realized by hardware, or may be realized by software such as a driver that drives a predetermined device. There may be a recording medium (not shown) that temporarily stores the noise-superimposed audio data received by the receiving unit 19.

ラベル認識部２０は、音声雑音音響モデル記憶部１４で記憶されている音声雑音音響モデル、合成音響モデル記憶部１５で記憶されている合成音響モデル、辞書情報記憶部１７で記憶されている辞書情報、ラベル言語モデル記憶部１８で記憶されているラベル言語モデルを用いて、受付部１９が受け付けた雑音重畳音声データに対応するラベルを認識する。この認識は、音声認識における単語がラベルとなり、音素ごとのモデルが音声データや雑音データまたはそれらの混合ごとのモデルとなる以外、音声認識と同様にして行われる。したがって、このラベル認識部２０としては、音響モデル、言語モデル、辞書を用いた従来の音声認識処理を行う構成要素を用いることができ、その詳細な説明を省略する。 The label recognition unit 20 includes a speech noise acoustic model stored in the speech noise acoustic model storage unit 14, a synthetic acoustic model stored in the synthetic acoustic model storage unit 15, and dictionary information stored in the dictionary information storage unit 17. Using the label language model stored in the label language model storage unit 18, the label corresponding to the noise superimposed speech data received by the receiving unit 19 is recognized. This recognition is performed in the same manner as the speech recognition except that words in speech recognition become labels and a model for each phoneme becomes a model for speech data, noise data, or a mixture thereof. Therefore, as the label recognition unit 20, components that perform conventional speech recognition processing using an acoustic model, a language model, and a dictionary can be used, and detailed description thereof is omitted.

このラベル認識の結果、雑音重畳音声データに対応するラベルが時系列に沿って特定されることになる。例えば、雑音重畳音声データの１フレームから５０フレームまでにはラベル「ｂｅｅｐ」が対応付けられされ、５１フレームから２００フレームまではラベル「ｂｅｅｐ．ｔａｒｇｅｔ」が対応付けられることになる。この認識されたラベルを示す情報は、図示しない記録媒体において記憶されるものとする。 As a result of this label recognition, labels corresponding to the noise-superimposed speech data are specified along the time series. For example, the label “beep” is associated with 1 to 50 frames of the noise-superimposed audio data, and the label “beep.target” is associated with 51 to 200 frames. Information indicating the recognized label is stored in a recording medium (not shown).

なお、ラベル認識部２０は、ビーム探索、あるいは複数のモデルを切り替えて探索するマルチパス探索を行うことによって、最尤ラベル系列を取得してもよい。例えば、第１パスで音声雑音音響モデルや合成音響モデルとバイグラムを用いた探索を行い、第２パスでトライグラムによるリスコアリングを行ってもよい。 Note that the label recognizing unit 20 may acquire the maximum likelihood label sequence by performing a beam search or a multipath search for switching a plurality of models. For example, a search using a sound noise acoustic model or a synthetic acoustic model and a bigram may be performed in the first pass, and re-scoring using a trigram may be performed in the second pass.

パーティクルフィルタ雑音抑圧部２１は、ラベル認識部２０が認識したラベルに応じた音響モデルからパーティクルをサンプリングし、そのサンプリングしたパーティクルを更新することによって、雑音重畳音声データの雑音が抑圧されたクリーン音声データを生成する。なお、パーティクルフィルタ雑音抑圧部２１は、特徴量空間において雑音抑圧を行ってもよく、あるいは、音声信号に対して雑音抑圧を行ってもよい。前者の場合には、パーティクルフィルタ雑音抑圧部２１の出力は雑音成分の除去された特徴量となるが、後者の場合には、パーティクルフィルタ雑音抑圧部２１の出力は、雑音成分の除去された音声信号となる。 The particle filter noise suppression unit 21 samples clean particles from the acoustic model corresponding to the label recognized by the label recognition unit 20, and updates the sampled particles, whereby clean audio data in which noise in the noise superimposed audio data is suppressed is obtained. Is generated. Note that the particle filter noise suppression unit 21 may perform noise suppression in the feature amount space, or may perform noise suppression on the audio signal. In the former case, the output of the particle filter noise suppression unit 21 is a feature quantity from which the noise component has been removed. In the latter case, the output of the particle filter noise suppression unit 21 is an audio signal from which the noise component has been removed. Signal.

図２は、本実施の形態によるパーティクルフィルタ雑音抑圧部２１の詳細な構成を示すブロック図である。パーティクルフィルタ雑音抑圧部２１は、特徴量抽出手段３１と、パーティクルフィルタ手段３２と、雑音成分算出手段３３と、雑音抑圧手段３４とを備える。 FIG. 2 is a block diagram showing a detailed configuration of the particle filter noise suppression unit 21 according to the present embodiment. The particle filter noise suppression unit 21 includes a feature amount extraction unit 31, a particle filter unit 32, a noise component calculation unit 33, and a noise suppression unit 34.

特徴量抽出手段３１は、雑音重畳音声データのフレームごとに特徴量を抽出する。特徴量抽出手段３１は、例えば、雑音重畳音声データをフレームごとにメルフィルタバンク分析することによって、雑音重畳音声データに対応する対数メルスペクトルを生成するものであってもよい。また、特徴量抽出手段３１は、その他の特徴量を抽出するものであってもよい。 The feature amount extraction unit 31 extracts a feature amount for each frame of the noise superimposed speech data. For example, the feature amount extraction unit 31 may generate a log mel spectrum corresponding to the noise superimposed voice data by performing a mel filter bank analysis on the noise superimposed voice data for each frame. Further, the feature quantity extraction unit 31 may extract other feature quantities.

パーティクルフィルタ手段３２は、雑音重畳音声データに対応する特徴量と、ラベル認識部２０が認識したラベルに応じた音声雑音音響モデルまたは合成音響モデルとを用いて、ラベル認識部２０が認識したラベルに応じた音響モデルから複数のパーティクルをサンプリングし、そのサンプリングした複数のパーティクルを更新し、その更新された各パーティクルの重みを算出する。この処理の具体的な式等については、後述する。 The particle filter unit 32 uses the feature amount corresponding to the noise-superimposed speech data and the speech noise acoustic model or the synthesized acoustic model corresponding to the label recognized by the label recognition unit 20 to generate a label recognized by the label recognition unit 20. A plurality of particles are sampled from the corresponding acoustic model, the plurality of sampled particles are updated, and the weight of each updated particle is calculated. A specific expression of this processing will be described later.

雑音成分算出手段３３は、パーティクルフィルタ手段３２が更新したパーティクルと、パーティクルフィルタ手段３２が算出した重みを用いて、フレームごとに雑音成分を算出する。この処理の具体的な式等については、後述する。 The noise component calculation unit 33 calculates a noise component for each frame using the particles updated by the particle filter unit 32 and the weight calculated by the particle filter unit 32. A specific expression of this processing will be described later.

雑音抑圧手段３４は、雑音重畳音声データから、雑音成分算出手段３３が算出した雑音成分を除去し、クリーン音声データを取得する。この雑音成分の除去は、例えば、推定された雑音成分からウィナーフィルタを構成し、フィルタ処理による雑音抑圧にて時間領域での音声波形を得てもよく、または、対数メルスペクトル領域での雑音成分の減算を行ってもよい。前者の場合には、対数メルスペクトル領域におけるウィナーフィルタを構成することができ、フィルタ処理とすることで、入力された雑音重畳音声データから音声波形を推定することができる。一方、後者の場合には、対数メルスペクトル領域における雑音成分の除去となり、雑音成分の除去されたクリーン音声データは、対数メルスペクトルとなる。本実施の形態では、前者の場合について説明する。 The noise suppression means 34 removes the noise component calculated by the noise component calculation means 33 from the noise superimposed voice data, and acquires clean voice data. This noise component removal may be performed, for example, by forming a Wiener filter from the estimated noise component and obtaining a speech waveform in the time domain by noise suppression by filtering, or a noise component in the log mel spectral domain Subtraction may be performed. In the former case, a Wiener filter in the logarithmic mel spectrum region can be configured, and by performing the filtering process, a speech waveform can be estimated from the input noise superimposed speech data. On the other hand, in the latter case, the noise component in the log mel spectrum region is removed, and the clean speech data from which the noise component has been removed becomes a log mel spectrum. In the present embodiment, the former case will be described.

蓄積部２２は、パーティクルフィルタ雑音抑圧部２１によって雑音の抑圧されたクリーン音声データを、所定の記録媒体に蓄積する。この記録媒体は、例えば、半導体メモリや、光ディスク、磁気ディスク等であり、蓄積部２２が有していてもよく、あるいは蓄積部２２の外部に存在してもよい。また、この記録媒体は、クリーン音声データを一時的に記憶するものであってもよく、そうでなくてもよい。 The accumulating unit 22 accumulates the clean voice data whose noise is suppressed by the particle filter noise suppressing unit 21 in a predetermined recording medium. The recording medium is, for example, a semiconductor memory, an optical disk, a magnetic disk, or the like, and may be included in the storage unit 22 or may exist outside the storage unit 22. Further, this recording medium may or may not store clean audio data temporarily.

なお、本実施の形態では、雑音抑圧装置１において、雑音抑圧後のクリーン音声データが蓄積される場合について説明するが、雑音抑圧装置１は、蓄積部２２に代えて、雑音抑圧後のクリーン音声データを出力する出力部を備えてもよい。その出力は、例えば、所定の機器への通信回線を介した送信でもよく、スピーカによる音声出力でもよく、記録媒体への蓄積でもよい。なお、その出力部は、出力を行うデバイス（例えば、スピーカや通信デバイスなど）を含んでもよく、あるいは含まなくてもよい。また、その出力部は、ハードウェアによって実現されてもよく、あるいは、それらのデバイスを駆動するドライバ等のソフトウェアによって実現されてもよい。 In the present embodiment, the case where clean speech data after noise suppression is stored in the noise suppression device 1 will be described. However, the noise suppression device 1 replaces the storage unit 22 with clean speech after noise suppression. You may provide the output part which outputs data. The output may be, for example, transmission via a communication line to a predetermined device, audio output by a speaker, or accumulation in a recording medium. Note that the output unit may or may not include an output device (for example, a speaker or a communication device). The output unit may be realized by hardware, or may be realized by software such as a driver that drives these devices.

また、訓練データ記憶部１１、ラベル記憶部１２、音声雑音音響モデル記憶部１４、合成音響モデル記憶部１５、辞書情報記憶部１７、ラベル言語モデル記憶部１８、蓄積部２２がクリーン音声データを蓄積する図示しない記録媒体のうち、任意の２以上の記録媒体は、同一の記録媒体によって実現されてもよく、別々の記録媒体によって実現されてもよい。前者の場合には、例えば、訓練データを記憶している領域が訓練データ記憶部１１となり、訓練ラベル情報を記憶している領域がラベル記憶部１２となる。 The training data storage unit 11, the label storage unit 12, the speech noise acoustic model storage unit 14, the synthetic acoustic model storage unit 15, the dictionary information storage unit 17, the label language model storage unit 18, and the storage unit 22 store clean speech data. Of the recording media not shown, any two or more recording media may be realized by the same recording medium or may be realized by separate recording media. In the former case, for example, an area storing training data is the training data storage unit 11, and an area storing training label information is the label storage unit 12.

次に、本実施の形態による雑音抑圧装置１で用いる雑音抑圧手法について説明する。
対数メルスペクトル領域において、第ｔフレームでの雑音重畳音声データに対応する特徴量ベクトルをｘ_ｔ、クリーン音声データに対応する特徴量ベクトルをｓ_ｔ、Ｎ種類の雑音データのうち、第ｎ番目の雑音データに対応する特徴量ベクトルをｎ_ｔ（ｎ）とする。なお、ｘ_ｔ、ｓ_ｔ、ｎ_ｔ（ｎ）を、単に、雑音重畳音声、クリーン音声、雑音と呼ぶこともある。そして、これらの状態空間モデルを定義する。観測モデルは、次のように表される。 Next, a noise suppression method used in the noise suppression apparatus 1 according to the present embodiment will be described.
In the log mel spectrum region, the feature vector corresponding to the noise-superimposed speech data in the t-th frame is x _t , the feature vector corresponding to the clean speech data is s _t , and the nth of the N types of noise data. Let n _t (n) be a feature vector corresponding to noise data. Note that x _t , s _t , and n _t (n) may be simply referred to as noise superimposed speech, clean speech, and noise. These state space models are defined. The observation model is expressed as follows.

ここで、ｎ_ｔは、第ｔフレームでのすべての雑音を含む合成雑音である。ｇ（ｓ_ｔ，ｎ_ｔ）は、クリーン音声ｓ_ｔと、雑音重畳音声ｘ_ｔとのミスマッチ成分である。ｖ_ｔは、観測ノイズを示し、Σ_ｘは、ｘ_ｔの共分散行列である。 Here, n _t is a synthesized noise including all noises in the t-th frame. g (s _t , n _t ) is a mismatch component between the clean speech s _t and the noise superimposed speech x _t . v _t represents observation noise, and Σ _x is a covariance matrix of x _t .

本実施の形態による雑音抑圧装置１では、フレームごとにラベル認識を行うため、音声区間（目的発声を含む区間）と、そうでない区間とを区別することができる。したがって、雑音抑圧を各区間に対して別々に行うことができる。各混合分布に対するミスマッチ成分を、次のように定義することができる。 In the noise suppression apparatus 1 according to the present embodiment, since label recognition is performed for each frame, it is possible to distinguish between a speech section (a section including a target utterance) and a section that is not. Therefore, noise suppression can be performed separately for each section. The mismatch component for each mixture distribution can be defined as follows.

ここで、μ_ｘ、ｌは、雑音重畳音声データに対応する音響モデルの混合分布における第ｌ番目の平均ベクトルであり、目的発声の区間に対して、クリーン音声データに対応する音響モデルの混合分布における第ｌ番目の平均ベクトルμ_ｓ、ｌと、推定された雑音データに対応する音響モデルにおける雑音成分との合成で生成される。どの雑音成分の合成であるのかについては、認識されたラベルによって知ることができる。なお、他の区間におけるμ_ｘ，ｌは雑音成分のみが合成されている。また、εは小さい正の値で、雑音抑圧後の残差信号パワーを調整するためのものである。 Here, μ _{x, l} is the l-th average vector in the mixture distribution of the acoustic model corresponding to the noise superimposed speech data, and the mixture distribution of the acoustic model corresponding to the clean speech data for the target speech section. a second l-th mean vector mu _{s, l} in, is produced in the synthesis of the noise component in the acoustic model corresponding to the estimated noise data. Which noise component is the synthesis can be known from the recognized label. Note that only noise components are synthesized for μ _{x, l} in other sections. Ε is a small positive value for adjusting the residual signal power after noise suppression.

さらに、雑音ｎ_ｔ（ｎ）に対応する音響モデルの第ｌ番目の混合分布ｎ_ｔ，ｌ（ｎ）のシステムモデルがランダムウォーク仮定で次ぎのようにモデル化できると仮定する。
Further, it is assumed that the system model of the l-th mixture distribution n _{t, l} (n) of the acoustic model corresponding to the noise n _t (n) can be modeled as follows with a random walk assumption.

ここで、ｕ_ｔ（ｎ，ｌ）はシステムノイズであり、Σ_{ｕｔ（ｎ，ｌ）}＝Σ_{ｎ（ｎ），ｌ}は、雑音ｎ_ｔ，ｌ（ｎ）の共分散行列である。 Here, u _t (n, l) is system noise, and Σ _{ut (n, l)} = Σ _{n (n), l} is a covariance matrix of noise n _{t, l} (n).

状態空間モデルを、式（２）、（５）で表し、前述の非特許文献５と同様にパーティクルフィルタを定義して、本実施の形態による雑音抑圧装置１での雑音抑圧に統合する。 The state space model is expressed by equations (2) and (5), and a particle filter is defined in the same manner as in Non-Patent Document 5 described above, and is integrated with noise suppression in the noise suppression apparatus 1 according to the present embodiment.

本実施の形態による雑音抑圧装置１では、同様に、拡張カルマン・パーティクルフィルタ、リサンプリング（ｒｅｓｉｄｕａｌｓａｍｐｌｉｎｇ）、及びマルコフ鎖モンテカルロ（ＭＣＭＣ：ＭａｒｋｏｖＣｈａｉｎＭｏｎｔｅＣａｒｌｏ）を用いる。雑音抑圧装置１では、認識されたラベルに対応する音響モデルの混合分布を、各パーティクルの事前分布として用いる。 Similarly, the noise suppression apparatus 1 according to the present embodiment uses an extended Kalman particle filter, resampling (residual sampling), and a Markov chain Monte Carlo (MCMC). The noise suppression apparatus 1 uses a mixture distribution of acoustic models corresponding to recognized labels as a prior distribution of each particle.

図３は、本実施の形態による雑音抑圧装置１でのパーティクルフィルタリングの概要を説明するための図である。はじめに、雑音分布に対するパーティクルを、背景雑音の音響モデルからサンプリングする。この音響モデルは、例えば、背景雑音のみの区間のデータから生成されたものである。この段階では、前述の非特許文献５と同様である。次に、ラベル認識により時刻ｔにおいて新たな雑音「雑音１」が検出されたフレームでは、あらかじめ学習データから作成され、音声雑音音響モデル記憶部１４，あるいは合成音響モデル記憶部１５で記憶されている「雑音１」の音響モデルから、新たなパーティクルをサンプリングする。次のフレームでは、「雑音１」が継続して検出されているので、同じような分布を持つ雑音が存在していると考えられ、時刻ｔの「雑音１」パーティクルから推定しやすいと考えられる。そこで、時刻ｔ＋１の「雑音１」パーティクルは、従来法と同様に、拡張カルマンフィルタを用いて推定される。時刻ｔ＋２において、「雑音１」が検出されなくなったときに、そのパーティクルも推定に用いるパーティクルから外す。このように、本実施の形態による雑音抑圧装置１では、認識されたラベルに応じた音響モデルからパーティクルをサンプリングすることによって、突発的な雑音を扱うことが可能となる。また、各フレームでパーティクルフィルタリングを行う際に、雑音重畳音声とより近いものを推定するために、クリーン音声と雑音間のパーティクルの組合せを考慮し、推定に用いることで、推定精度の向上を図る。 FIG. 3 is a diagram for explaining an outline of particle filtering in the noise suppression apparatus 1 according to the present embodiment. First, the particles for the noise distribution are sampled from an acoustic model of background noise. This acoustic model is generated from, for example, data of a section with only background noise. At this stage, it is the same as Non-Patent Document 5 described above. Next, a frame in which a new noise “noise 1” is detected at time t by label recognition is created from learning data in advance and stored in the speech noise acoustic model storage unit 14 or the synthesized acoustic model storage unit 15. New particles are sampled from the acoustic model of “Noise 1”. In the next frame, since “Noise 1” is continuously detected, it is considered that noise having a similar distribution exists, and it can be easily estimated from “Noise 1” particles at time t. . Therefore, the “noise 1” particle at time t + 1 is estimated using an extended Kalman filter, as in the conventional method. When “noise 1” is no longer detected at time t + 2, the particle is also removed from the particles used for estimation. Thus, the noise suppression apparatus 1 according to the present embodiment can handle sudden noise by sampling particles from the acoustic model corresponding to the recognized label. In addition, when performing particle filtering in each frame, in order to estimate a noise closer to the noise-superimposed speech, a combination of particles between clean speech and noise is considered and used for estimation to improve estimation accuracy. .

次に、パーティクルフィルタについて説明する。
ここでは、Ｓｅｑｕｅｎｔｉａｌｉｍｐｏｒｔａｎｃｅｓａｍｐｌｉｎｇと呼ばれるパーティクルフィルタを定義する。状態空間モデルが、式（２）、（５）であるとき、事後確率密度関数は、次のようになる。 Next, the particle filter will be described.
Here, a particle filter called “Sequential importance sampling” is defined. When the state space model is Equations (2) and (5), the posterior probability density function is as follows.

ここで、ｎ_０：ｔ＝｛ｎ_０，...，ｎ_ｔ｝、ｘ_０：ｔ＝｛ｘ_０，...，ｘ_ｔ｝である。パーティクルフィルタでは、この事後確率密度関数は、次のように近似される。
Here, n _{0: t} = {n ₀ ,..., N _t } and x _{0: t} = {x ₀ ,..., X _t }. In the particle filter, this posterior probability density function is approximated as follows.

ここで、（ｋ）はパーティクル番号であり、Ｋはパーティクルの総数である。δ（・）は、ディラック・デルタ関数であり、ｗ_ｔ ^（ｋ）は、時刻ｔでの第ｋパーティクルに対する重み（ｉｍｐｏｒｔａｎｃｅｗｅｉｇｈｔ）を示す。ｐ（ｎ_０：ｔ｜ｘ_０：ｔ）自体からのサンプリングは困難であるため、ｉｍｐｏｒｔａｎｃｅｄｅｎｓｉｔｙと呼ばれるｑ（ｎ_０：ｔ｜ｘ_０：ｔ）を導入する。すると、各パーティクルの重みは、次のように表すことができる。
Here, (k) is the particle number, and K is the total number of particles. δ (·) is a Dirac delta function, and w _t ^(k) represents the weight for the kth particle at time t. Since sampling from p (n _{0: t} | x _{0: t} ) itself is difficult, q (n _{0: t} | x _{0: t} ) called impedance density is introduced. Then, the weight of each particle can be expressed as follows.

ここで、ｐ（ｎ_ｔ ^（ｋ）｜ｎ_ｔ−１ ^（ｋ））＝ｑ（ｎ_ｔ ^（ｋ）｜ｎ_０：ｔ ^（ｋ），ｘ_０：ｔ）と仮定することによって、（９）式を次式のように簡略化できる。
Here, by assuming that p ( _nt ^(k) | _nt-1 ^(k) ) = q ( _nt ^(k) | n0 _{: t} ^(k) , x0 _{: t} ), (9) The equation can be simplified as:

パーティクルフィルタにおいて、クリーン音声の音響モデルと、雑音の音響モデルからのサンプリングを行う。雑音の音響モデルについては、初期フレームでは受け付けられた雑音重畳音声データの開始から短時間分のデータを用いて推定した背景雑音の音響モデルを用い、途中、ラベル認識で雑音が検出された場合には、その認識されたラベルに応じた雑音の音響モデルを用いる。それぞれのモデルからのサンプリングを次のように定義する。 In the particle filter, sampling is performed from an acoustic model of clean speech and an acoustic model of noise. As for the acoustic model of noise, when noise is detected during label recognition, using the acoustic model of background noise estimated using data for a short time from the start of the received noise superimposed speech data in the initial frame, Uses an acoustic model of noise corresponding to the recognized label. Sampling from each model is defined as follows:

［クリーン音声の音響モデルからのサンプリング］
[Sampling from acoustic model of clean speech]

ここで、μ_ｓ、ｌ、Σ_ｓ、ｌは、それぞれクリーン音声データに対応する音響モデルの混合分布における第ｌ番目の平均ベクトル、共分散行列である。Ｌ_ｓは、その混合分布における混合数である。 Here, μ _{s, l} , Σ _{s, l} are the l-th average vector and covariance matrix in the mixture distribution of the acoustic model corresponding to the clean speech data, respectively. L _s is the number of mixtures in the mixture distribution.

［雑音の音響モデルからのサンプリング］
[Sampling from acoustic model of noise]

ここで、μ_{ｎ（ｎ）、ｌ}、Σ_{ｎ（ｎ）、ｌ}は、それぞれそのフレームにおいて認識されたラベルに対応する雑音（単一の雑音であってもよく、合成された雑音でもよい）データに対応する音響モデルの混合分布における第ｌ番目の平均ベクトル、共分散行列である。Ｌ_ｎ（ｎ）は、その混合分布における混合数である。 Here, μ _{n (n), l} , Σ _{n (n), l} are noises corresponding to the labels recognized in the frame (may be a single noise or a synthesized noise). It is the l-th average vector and covariance matrix in the mixture distribution of the acoustic model corresponding to the data. L _{n (n)} is the number of mixtures in the mixture distribution.

このようにサンプリングされた音声のパーティクルと、雑音のパーティクルの合成によって、雑音重畳音声のパーティクルの組を作成する。次に、拡張カルマンフィルタを用い、各パーティクルを更新し、ｎ_ｔ ^（ｋ）の平均＾ｎ_ｔ ^（ｋ）と、分散＾Σ_ｎｔ ^（ｋ）とを推定する。ここで、＾Ａは、Ａの上部にハット（＾）が付されたものを示すものとする。 A pair of noise-superimposed speech particles is created by synthesizing the speech particles sampled in this way and the noise particles. Next, using an extended Kalman filter, and update each particle _to estimate the average of ^{n t (k)} _^ and ^{n t (k),} and a dispersion ^ Σ _nt ^(k). Here, ^ A indicates that a top (A) is added to A.

ここで、ｔ｜ｔ−１は、第（ｔ−１）フレームから予測された第ｔフレームのパラメータであることを示す。 Here, t | t−1 indicates a parameter of the t-th frame predicted from the (t−1) -th frame.

パーティクルフィルタリングの後に、前述の非特許文献５と同様に、ｒｅｓｉｄｕａｌｒｅｓａｍｐｌｉｎｇ、及びＭＣＭＣ（Ｍｅｔｒｏｐｏｌｉｓ−Ｈａｓｔｉｎｇｓａｍｐｌｉｎｇ）を行う。 After particle filtering, similar to Non-Patent Document 5 described above, residual resampling and MCMC (Metropolis-Hasting sampling) are performed.

また、このような手法を用いることによって、現在の雑音分布を推定することができ、その雑音分布を示す単一分布が得られる。従来のラベル認識を行う雑音抑圧手法では、雑音の音響モデルも混合分布で示されていたため、混合分布の計算が必要であったが、この手法では、そのような計算の必要がなく、計算量が削減されることになる。例えば、クリーン音声の音響モデルの混合数が２００である場合について具体的に考える。雑音の音響モデルが、混合数が４の混合分布で示される場合には、雑音重畳音声の音響モデルの混合数は、２００×４＝８００となり、８００の混合数のすべてについて計算を行うことになる。一方、雑音の音響モデルが単一分布（混合数＝１）であれば、雑音重畳音声の音響モデルの混合数は、２００×１＝２００となり、２００の混合数について計算を行えばよいこととなり、計算量が削減されていることが分かる。 Also, by using such a method, the current noise distribution can be estimated, and a single distribution indicating the noise distribution can be obtained. In the conventional noise suppression method that performs label recognition, the acoustic model of the noise was also shown as a mixture distribution, so it was necessary to calculate the mixture distribution. Will be reduced. For example, a specific case where the number of mixed acoustic models of clean speech is 200 will be considered. When the noise acoustic model is represented by a mixture distribution with a mixture number of 4, the mixture number of the acoustic model of the noise superimposed speech is 200 × 4 = 800, and calculation is performed for all of the 800 mixture numbers. Become. On the other hand, if the noise acoustic model has a single distribution (mixing number = 1), the number of mixing of the acoustic model of the noise-superimposed speech is 200 × 1 = 200. It can be seen that the calculation amount is reduced.

なお、上記の説明では、パーティクルフィルタの前段に拡張カルマンフィルタ（ＥＫＦ：ＥｘｔｅｎｄｅｄＫａｌｍａｎＦｉｌｔｅｒ）を用いる場合について説明したが、それに代えて、Ｕｎｓｃｅｎｔｅｄカルマンフィルタ（ＵＫＦ：ＵｎｓｃｅｎｔｅｄＫａｌｍａｎＦｉｌｔｅｒ）を用いてもよい。ＥＫＦは、非線形モデルをテーラー展開の１次近似で導出されたものであるが、より精度の高い近似法として、Ｕ変換（ＵｎｓｃｅｎｔｅｄＴｒａｎｓｆｏｒｍａｔｉｏｎ）を用いたＵＫＦが提案されている。それを用いれば、２次の項まで近似することができる。ＥＫＦをＵＫＦに置き換えるだけで、本実施の形態による雑音抑圧手法にＵＫＦを導入することが可能である。 In the above description, an extended Kalman filter (EKF: Extended Kalman Filter) is used in the previous stage of the particle filter, but an Unsented Kalman filter (UKF: Unscented Kalman Filter) may be used instead. The EKF is a nonlinear model derived by a first-order approximation of Taylor expansion, and as an approximation method with higher accuracy, UKF using U transformation (Unsented Transformation) has been proposed. By using it, it is possible to approximate up to a quadratic term. By simply replacing EKF with UKF, it is possible to introduce UKF into the noise suppression method according to the present embodiment.

ＵＫＦについては、次の文献を参照されたい。
文献：Ｓ．Ｊ．Ｊｕｌｉｅｒ，Ｊ．Ｋ．Ｕｈｌｍａｎｎ，「ＡｎｅｗｅｘｔｅｎｓｉｏｎｏｆｔｈｅＫａｌｍａｎｆｉｌｔｅｒｔｏｎｏｎｌｉｎｅａｒｓｙｓｔｅｍｓ」、Ｐｒｏｃ．ＡｅｒｏＳｅｎｃｅ：Ｉｎｔ．Ｓｙｍｐ．Ａｅｒｏｓｐａｃｅ／ＤｅｆｅｎｓｅＳｅｎｓｉｎｇ，Ｓｉｍｕｌ．ａｎｄＣｏｎｔｒｏｌｓ，Ｏｒｌａｎｄｏ，ＦＬ，１９９７ Please refer to the following literature for UKF.
Literature: S.M. J. et al. Julier, J. et al. K. Uhlmann, “A new extension of the Kalman filter to nonlinear systems”, Proc. AeroSense: Int. Symp. Aerospace / Defense Sensing, Simul. and Controls, Orlando, FL, 1997

文献：谷萩隆嗣，「カルマンフィルタと適応信号処理」、コロナ社，２００５ Literature: Takashi Tanibe, “Kalman filter and adaptive signal processing”, Corona, 2005

文献：山北昌毅，「ＵＫＦ（ＵｎｓｃｅｎｔｅｄＫａｌｍａｎＦｉｌｔｅｒ）って何？」、システム／制御／情報，ｖｏｌ．５０，ｎｏ．７，ｐ．２６１−２６６，２００６ Literature: Masakita Yamakita, “What is UKF (Unscented Kalman Filter)?”, System / Control / Information, vol. 50, no. 7, p. 261-266, 2006

また、ＵＫＦを用いたパーティクルフィルタについては、次の文献を参照されたい。
文献：Ｒ．ｖａｎｄｅｒＭｅｒｗｅ，Ａ．Ｄｏｕｃｅｔ，Ｎ．ｄｅＦｒｅｉｔａｓ，Ｅ．Ｗａｎ，「Ｔｈｅｕｎｓｃｅｎｔｅｄｐａｒｔｉｃｌｅｆｉｌｔｅｒ」、ＴｅｃｈｎｉｃａｌＲｅｐｏｒｔＣＵＥＤ／ＦＩＮＦＥＮＧ／ＴＲ３８０，ＣａｍｂｒｉｄｇｅＵｎｉｖｅｒｓｉｔｙＥｎｇｉｎｅｅｒｉｎｇＤｅｐａｒｔｍｅｎｔ，２０００ For the particle filter using UKF, refer to the following document.
Literature: R.D. van der Merwe, A.M. Doucet, N.M. de Freitas, E .; Wan, “The unscented particulate filter”, Technical Report CUED / FINFENG / TR 380, Cambridge University Engineering Department, 2000

次に、パーティクルフィルタに基づく雑音抑圧手法について説明する。すなわち、前述の定義を用いて、ＧＭＭに基づく雑音抑圧を行うまでの手順について説明する。 Next, a noise suppression method based on a particle filter will be described. That is, the procedure until noise suppression based on GMM is described using the above definition.

（Ｉ）初期化（フレーム番号ｔ＝０について）
各パーティクルｎ_０ ^（ｉ）（ｉ＝０，...，Ｉ）を背景雑音の音響モデルからサンプリングする。 (I) Initialization (for frame number t = 0)
Each particle n ₀ ⁽ⁱ⁾ (i = 0,..., I) is sampled from an acoustic model of background noise.

（ＩＩ）各フレーム（ｔ＝１，２，...，Ｔ）について
（ＩＩ−１）Ｉｍｐｏｒｔａｎｃｅｓａｍｐｌｉｎｇｓｔｅｐ（パーティクルフィルタ）： (II) For each frame (t = 1, 2,..., T) (II-1) Importance sampling step (particle filter):

認識されたラベルによって、そのフレームに目的発声が含まれることが示されるのであれば、（１１）式によって、クリーン音声の音響モデルから音声パーティクルｓ_ｔ ^（ｉ）をサンプリングする。
認識されたラベルによって、そのフレームに新たな雑音ｎ（ｎ）が検出されたら、（１２）式により、雑音パーティクルをサンプリングする。
ｓ_ｔ ^（ｉ）とｎ_ｔ−１ ^（ｊ）との合成で雑音重畳音声パーティクルの組を作成する。ここで、総パーティクル数は、Ｋ＝Ｉ×（Ｊ＋１）となる。 If the recognized label indicates that the target speech is included in the frame, the speech particle s _t ⁽ⁱ⁾ is sampled from the acoustic model of the clean speech according to the equation (11).
When a new noise n (n) is detected in the frame by the recognized label, noise particles are sampled by the equation (12).
A set of noise superimposed speech particles is created by combining s _t ⁽ⁱ⁾ and n _t-1 ^(j) . Here, the total number of particles is K = I × (J + 1).

ｋ＝１，...，Ｋについて、拡張カルマンフィルタにより、各パーティクルを更新する。すなわち、（１３）、（１４）式により、ｎ_ｔ ^（ｋ）の平均＾ｎ_ｔ ^（ｋ）と、分散＾Σ_ｎｔ ^（ｋ）とを推定する。 For k = 1,..., K, each particle is updated by the extended Kalman filter. That is, (13) and _(14), estimates that ^{n t} mean ^(k) _^ ^{n t (k),} and a dispersion ^ sigma _nt ^(k).

ｋ＝１，...，Ｋについて、重みｗ_ｔ ^（ｋ）∝ｗ_ｔ−１ ^（ｋ）ｐ（ｘ_ｔ｜ｎ_ｔ ^（ｋ））を計算する。ｗ_ｔ−１ ^（ｋ）は、１フレーム前で計算した値を用いる。また、ｐ（ｘ_ｔ｜ｎ_ｔ ^（ｋ））は、推定した平均＾ｎ_ｔ ^（ｋ）と、分散＾Σ_ｎｔ ^（ｋ）とを用いて計算することができる。 k = 1, ..., K for the weight _{^{_{^{w t (k) αw t-}}}} 1 (k) p | to calculate the _{^{(x t n t (k)}} ). For w _t−1 ^(k) , a value calculated one frame before is used. _{_{^{Further, p (x t | n t}}} (k)) is the estimated average _^ ^{n t (k),} it can be calculated using the dispersion ^ Σ _nt ^(k).

ｋ＝１，...，Ｋについて、計算したｗ_ｔ ^（ｋ）を正規化して、正規化重み＾ｗ_ｔ ^（ｋ）を計算する。正規化重み＾ｗ_ｔ ^（ｋ）は、ｗ_ｔ ^（ｋ）を、ｋについて和をとったｗ_ｔ ^（ｋ）で除算することによって算出できる。 For k = 1,..., K, the calculated w _t ^(k) is normalized to calculate a normalization weight ^ w _t ^(k) . Normalized weights _^ ^{w t (k)} _can be calculated by dividing by ^{w t} a ^(k), it was summed for ^k _w ^{t (k).}

（ＩＩ−２）Ｒｅｓｉｄｕａｌｓａｍｐｌｉｎｇｓｔｅｐ：
高／低ｉｍｐｏｒｔａｎｃｅｗｅｉｇｈｔを持つパーティクルを増殖／抑圧する。このｒｅｓｉｄｕａｌｓａｍｐｌｉｎｇの方法は、すでに公知であり、詳細な説明を省略する。 (II-2) Residual sampling step:
Proliferate / suppress particles with high / low importance weight. This method of resampling is already known and will not be described in detail.

（ＩＩ−３）ＭＣＭＣｓｔｅｐ：
Ｍｅｔｒｏｐｏｌｉｓ−Ｈａｓｔｉｎｇｓｓａｍｐｌｉｎｇを適用する。この方法についてもすでに公知であり、詳細な説明を省略する。 (II-3) MCMC step:
Apply Metropolis-Hastings sampling. This method is already known and will not be described in detail.

（ＩＩ−４）雑音事後分布の推定：
パーティクルから事後分布を次のように推定する。
ここで、μ_＾ｎｔ、Σ_＾ｎｔは、それぞれ雑音の音響モデルの推定平均ベクトルと、共分散行列である。 (II-4) Estimation of noise posterior distribution:
The posterior distribution is estimated from the particles as follows.
Here, μ _{^ nt} and Σ _{^ nt} are an estimated average vector of a noise acoustic model and a covariance matrix, respectively.

（ＩＩ−５）クリーン音声のＧＭＭに基づくラベルを用いた雑音抑圧：
ミスマッチ成分を用いて、クリーン音声の推定を次式のように行う。
(II-5) Noise suppression using GMM-based labels for clean speech:
The clean speech is estimated as follows using the mismatch component.

ここで、μ_ｘ、ｌ、Σ_ｘ、ｌは、それぞれ各フレームのラベルに対応する音響モデルの平均ベクトル、共分散行列である。これらは、１次テーラー級数展開（次文献参照）により、適宜、クリーン音声と推定された雑音の音響モデルＮ（μ_＾ｎｔ，Σ_＾ｎｔ）から推定される。 Here, μ _{x, l} , Σx _{, l} are the average vector and covariance matrix of the acoustic model corresponding to the label of each frame, respectively. These are estimated from the acoustic model N (μ _{^ nt} , Σ _{^ nt} ) of noise that is estimated to be clean speech as appropriate by first-order Taylor series expansion (see next document).

文献：Ｐ．Ｊ．Ｍｏｒｅｎｏ，「Ｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎｉｎｎｏｉｓｙｅｎｖｉｒｏｎｍｅｎｔｓ」、ＰｈＤｔｈｅｓｉｓ，ＣａｒｎｅｇｉｅＭｅｌｌｏｎＵｎｉｖｅｒｓｉｔｙ，Ｐｉｔｔｓｂｕｒｇｈ，Ｐｅｎｎｓｙｌｖａｎｉａ，１９９６ Literature: P.M. J. et al. Moreno, “Speech recognition in noisy environments”, PhD thesis, Carnegie Mellon University, Pittsburgh, Pennsylvania, 1996.

次に、本実施の形態による雑音抑圧装置１の動作について、図４のフローチャートを用いて説明する。
（ステップＳ１０１）モデル生成部１３は、モデルを生成するタイミングであるかどうか判断する。そして、モデルを生成するタイミングである場合には、ステップＳ１０２に進み、そうでない場合には、ステップＳ１０６に進む。モデル生成部１３は、例えば、モデルを生成する指示を雑音抑圧装置１が受け付けた場合に、モデルを生成するタイミングであると判断してもよく、訓練データが訓練データ記憶部１１に蓄積され、訓練ラベル情報がラベル記憶部１２に蓄積された場合に、モデルを生成するタイミングであると判断してもよく、あるいは、その他のタイミングで、モデルを生成するタイミングであると判断してもよい。 Next, the operation of the noise suppression apparatus 1 according to the present embodiment will be described using the flowchart of FIG.
(Step S101) The model generation unit 13 determines whether it is time to generate a model. If it is time to generate a model, the process proceeds to step S102; otherwise, the process proceeds to step S106. For example, when the noise suppression apparatus 1 receives an instruction to generate a model, the model generation unit 13 may determine that it is time to generate a model, and training data is accumulated in the training data storage unit 11. When training label information is accumulated in the label storage unit 12, it may be determined that it is time to generate a model, or it may be determined that it is time to generate a model at other timing.

（ステップＳ１０２）モデル生成部１３は、ラベル記憶部１２で記憶されている訓練ラベル情報を用いて、一種類の音声データの区間、あるいは、一種類の雑音データの区間を特定し、その区間に対応する音声雑音音響モデルをそれぞれ生成し、音声雑音音響モデル記憶部１４に蓄積する。なお、蓄積の際に、訓練ラベル情報の示すラベル（例えば、「ｂｅｅｐ」等）を、音声雑音音響モデルに対応付けて蓄積してもよい。 (Step S102) The model generation unit 13 uses the training label information stored in the label storage unit 12 to identify one type of speech data or one type of noise data, and Corresponding speech noise acoustic models are respectively generated and stored in the speech noise acoustic model storage unit 14. At the time of accumulation, a label (for example, “beep”) indicated by the training label information may be accumulated in association with the speech noise acoustic model.

（ステップＳ１０３）モデル生成部１３は、音声雑音音響モデル記憶部１４で記憶されている２以上の音声雑音音響モデルを読み出し、それらを合成することによって合成音響モデルを生成して合成音響モデル記憶部１５に蓄積する (Step S103) The model generation unit 13 reads out two or more speech noise acoustic models stored in the speech noise acoustic model storage unit 14, generates a synthesized acoustic model by synthesizing them, and a synthesized acoustic model storage unit Accumulate in 15

また、前述のように、音声雑音音響モデルのすべての組合せを網羅するのではなく、ラベル記憶部１２で記憶されている訓練ラベル情報で示される組合せに対応する合成音響モデルを生成するようにしてもよい。また、合成音響モデルの蓄積の際に、その訓練ラベル情報の示すラベル（例えば、「ｂｅｅｐ．ｔａｒｇｅｔ」等）を、合成音響モデルに対応付けて蓄積してもよい。 Further, as described above, instead of covering all combinations of the speech noise acoustic model, a synthetic acoustic model corresponding to the combination indicated by the training label information stored in the label storage unit 12 is generated. Also good. In addition, when the synthetic acoustic model is accumulated, a label (for example, “beep.target” or the like) indicated by the training label information may be accumulated in association with the synthetic acoustic model.

（ステップＳ１０４）ラベル言語モデル生成部１６は、ラベル記憶部１２で記憶されている訓練ラベル情報を用いて、ラベル言語モデルを生成し、ラベル言語モデル記憶部１８に蓄積する。 (Step S 104) The label language model generation unit 16 generates a label language model using the training label information stored in the label storage unit 12 and stores the label language model in the label language model storage unit 18.

（ステップＳ１０５）ラベル言語モデル生成部１６は、辞書情報を生成し、辞書情報記憶部１７に蓄積する。そして、ステップＳ１０１に戻る。なお、この辞書情報は、前述のように、ラベルと、音声雑音音響モデルまたは合成音響モデルとを対応付けるものである。したがって、例えば、辞書情報は、音響モデルに対応付けられているラベルの情報と、ラベルの名称とを対応付ける情報であってもよい。その場合には、例えば、音響モデルに対応付けられているラベルの情報「ｂｅｅｐ」と、ラベルの名称「ｂｅｅｐ」とが辞書情報において対応付けられることになる。 (Step S 105) The label language model generation unit 16 generates dictionary information and stores it in the dictionary information storage unit 17. Then, the process returns to step S101. As described above, this dictionary information associates a label with a speech noise acoustic model or a synthesized acoustic model. Therefore, for example, the dictionary information may be information that associates label information associated with the acoustic model with a label name. In this case, for example, the label information “beep” associated with the acoustic model and the label name “beep” are associated in the dictionary information.

（ステップＳ１０６）受付部１９は、雑音重畳音声データを受け付けたかどうか判断する。そして、受け付けた場合には、ステップＳ１０７に進み、そうでない場合には、ステップＳ１０１に戻る。 (Step S106) The reception unit 19 determines whether noise-superimposed voice data has been received. If accepted, the process proceeds to step S107, and if not, the process returns to step S101.

（ステップＳ１０７）ラベル認識部２０は、音声雑音音響モデル、合成音響モデル、ラベル言語モデル、辞書情報を用いて、雑音重畳音声データに対応するラベルを認識する。そして、その認識結果を図示しない記録媒体に蓄積する。 (Step S107) The label recognition unit 20 recognizes a label corresponding to the noise-superimposed speech data using the speech noise acoustic model, the synthesized acoustic model, the label language model, and dictionary information. Then, the recognition result is stored in a recording medium (not shown).

（ステップＳ１０８）パーティクルフィルタ雑音抑圧部２１は、認識されたラベルを用いて、雑音重畳音声データの雑音が抑圧されたクリーン音声データを生成する。この処理の詳細については、図５のフローチャートを用いて後述する。 (Step S108) The particle filter noise suppression unit 21 uses the recognized label to generate clean voice data in which noise of the noise superimposed voice data is suppressed. Details of this processing will be described later with reference to the flowchart of FIG.

（ステップＳ１０９）蓄積部２２は、雑音抑圧後のクリーン音声データを図示しない記録媒体に蓄積する。そして、ステップＳ１０１に戻る。
なお、図４のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。 (Step S109) The storage unit 22 stores clean sound data after noise suppression in a recording medium (not shown). Then, the process returns to step S101.
In the flowchart of FIG. 4, the process ends when the power is turned off or the process ends.

図５は、図４のフローチャートにおける雑音抑圧（ステップＳ１０８）の処理の詳細を示すフローチャートである。
（ステップＳ２０１）モデル生成部１３は、まず、受け付けられた雑音重畳音声データから背景雑音の音響モデルを生成し、その背景雑音の音響モデルを音声雑音音響モデル記憶部１４に蓄積する。パーティクルフィルタ雑音抑圧部２１は、その背景雑音の音響モデルからのサンプリングを行う。背景雑音は、ラベル認識部２０がラベルを何も認識していない状態でバックグラウンドに含まれる雑音である。したがって、ビープ音やマシンノイズ、目的発声等の含まれていない雑音重畳音声データ、例えば、入力初期の雑音重畳音声データの所定数のフレームを用いて音響モデルを生成し、その音響モデルから前述の「（Ｉ）初期化」で説明したように背景雑音のサンプリングを行う。なお、モデル生成部１３は、背景雑音の音響モデルと、他の音響モデルや合成音響モデルとを合成した合成音響モデルを適宜、生成してもよい。 FIG. 5 is a flowchart showing details of the noise suppression (step S108) processing in the flowchart of FIG.
(Step S 201) The model generation unit 13 first generates a background noise acoustic model from the received noise-superimposed speech data, and stores the background noise acoustic model in the speech noise acoustic model storage unit 14. The particle filter noise suppression unit 21 performs sampling from an acoustic model of the background noise. Background noise is noise included in the background when the label recognition unit 20 does not recognize any label. Therefore, an acoustic model is generated using a predetermined number of frames of noise superimposed speech data that does not include beep sound, machine noise, target speech, etc., for example, noise superimposed speech data at the initial stage of input, and the above-mentioned acoustic model is generated from the acoustic model. Background noise sampling is performed as described in “(I) Initialization”. Note that the model generation unit 13 may appropriately generate a synthesized acoustic model obtained by synthesizing the acoustic model of the background noise with another acoustic model or a synthesized acoustic model.

（ステップＳ２０２）パーティクルフィルタ雑音抑圧部２１は、カウンタｔを１に設定する。
（ステップＳ２０３）パーティクルフィルタ雑音抑圧部２１は、雑音重畳音声データのフレームｔに対応するラベルを取得する。このラベルは、ラベル認識部２０によって認識されたラベルであり、パーティクルフィルタ手段３２、及び雑音成分算出手段３３において用いられるものである。 (Step S202) The particle filter noise suppression unit 21 sets a counter t to 1.
(Step S203) The particle filter noise suppression unit 21 acquires a label corresponding to the frame t of the noise-superimposed speech data. This label is a label recognized by the label recognition unit 20 and is used in the particle filter means 32 and the noise component calculation means 33.

（ステップＳ２０４）特徴量抽出手段３１は、雑音重畳音声データのフレームｔの特徴量を抽出する。本実施の形態では、特徴量抽出手段３１は、雑音重畳音声データのフレームｔをメルフィルタバンク分析することによって、対数メルスペクトルを生成する。
（ステップＳ２０５）パーティクルフィルタ手段３２は、フレームｔの特徴量に対して、パーティクルフィルタの処理を実行する。 (Step S204) The feature amount extraction unit 31 extracts the feature amount of the frame t of the noise superimposed speech data. In the present embodiment, the feature amount extraction unit 31 generates a log mel spectrum by performing a mel filter bank analysis on the frame t of the noise superimposed speech data.
(Step S205) The particle filter means 32 performs a particle filter process on the feature quantity of the frame t.

具体的には、取得されたフレームｔに対応するラベルが、目的発声の含まれることを示す場合には、音声データの音響モデルから、（１１）式のように音声パーティクルをサンプリングする。また、取得されたフレームｔに対応するラベルが、雑音の含まれることを示す場合には、その雑音に対応する音響モデルから、（１２）式のように雑音パーティクルをサンプリングする。また、音声パーティクルと、雑音パーティクルの両方をサンプリングした場合には、パーティクルフィルタ手段３２は、その両者を合成した雑音重畳音声パーティクルの組を生成する。その場合の総パーティクル数Ｋは、Ｋ＝Ｉ×Ｊ＋Ｉとなる。ここで、Ｉは音声パーティクルの個数で、Ｊは雑音パーティクルの個数である。また、右辺第２項の「Ｉ」は、背景雑音のパーティクルの個数である。 Specifically, when the label corresponding to the acquired frame t indicates that the target utterance is included, the sound particles are sampled from the sound model of the sound data as shown in Equation (11). Further, when the label corresponding to the acquired frame t indicates that noise is included, noise particles are sampled from the acoustic model corresponding to the noise as shown in Equation (12). When both the audio particles and the noise particles are sampled, the particle filter unit 32 generates a set of noise superimposed audio particles obtained by synthesizing both. In this case, the total number K of particles is K = I × J + I. Here, I is the number of audio particles, and J is the number of noise particles. Also, “I” in the second term on the right side is the number of background noise particles.

また、パーティクルフィルタ手段３２は、すべてのパーティクルについて、拡張カルマンフィルタを用いた更新を行う。すなわち、パーティクルフィルタ手段３２は、各パーティクルについて、ｎ_ｔ ^（ｋ）の平均＾ｎ_ｔ ^（ｋ）と、分散＾Σ_ｎｔ ^（ｋ）とを推定する。 Moreover, the particle filter means 32 performs update using an extended Kalman filter for all particles. That is, the particle filter unit 32, for each particle _to estimate the average of ^{n t (k)} _^ and ^{n t (k),} and a dispersion ^ Σ _nt ^(k).

また、パーティクルフィルタ手段３２は、すべてのパーティクルについて、重みｗ_ｔ ^（ｋ）∝ｗ_ｔ−１ ^（ｋ）ｐ（ｘ_ｔ｜ｎ_ｔ ^（ｋ））を計算する。ｗ_ｔ−１ ^（ｋ）は、１フレーム前で計算した値を用いる。また、ｐ（ｘ_ｔ｜ｎ_ｔ ^（ｋ））は、推定した平均＾ｎ_ｔ ^（ｋ）と、分散＾Σ_ｎｔ ^（ｋ）とを用いて計算することができる。また、パーティクルフィルタ手段３２は、その計算した重みｗ_ｔ ^（ｋ）を正規化する。 Further, the particle filter unit 32, for all particles, the weight _{^{_{^{w t (k) αw t-}}}} 1 (k) p | calculating the _{^{(x t n t (k)}} ). For w _t−1 ^(k) , a value calculated one frame before is used. _{_{^{Further, p (x t | n t}}} (k)) is the estimated average _^ ^{n t (k),} it can be calculated using the dispersion ^ Σ _nt ^(k). Further, the particle filter means 32 normalizes the calculated weight w _t ^(k) .

その後、パーティクルフィルタ手段３２は、Ｒｅｓｉｄｕａｌｓａｍｐｌｉｎｇや、Ｍｅｔｒｏｐｏｌｉｓ−Ｈａｓｔｉｎｇｓｓａｍｐｌｉｎｇを適用する。 Thereafter, the particle filter unit 32 applies Residual sampling or Metropolis-Hastings sampling.

なお、このパーティクルフィルタの処理において、ステップＳ２０３で取得されたラベルが、１個前のフレーム、すなわち、フレーム（ｔ−１）のラベルと一致する場合には、パーティクルフィルタ手段３２は、音声パーティクルや雑音パーティクルのサンプリングを行わなくてもよい。すなわち、この場合には、パーティクルフィルタ手段３２は、１個前のフレームでの推定結果を用いて、パーティクルフィルタリングの処理を行うことになる。一方、ステップＳ２０３で取得されたラベルが、１個前のフレームのラベルと一致しない場合には、パーティクルフィルタ手段３２は、それまでのパーティクルを破棄して、前述のように（１１）式や（１２）式のように音声パーティクルや雑音パーティクルを適宜サンプリングして、パーティクルのサンプリングの処理を行うことになる。 In this particle filter processing, when the label acquired in step S203 matches the previous frame, that is, the label of the frame (t−1), the particle filter means 32 determines whether the sound particle or It is not necessary to sample noise particles. That is, in this case, the particle filter unit 32 performs the particle filtering process using the estimation result in the previous frame. On the other hand, if the label acquired in step S203 does not match the label of the previous frame, the particle filter means 32 discards the particles so far, and the expression (11) or ( As shown in equation (12), sound particles and noise particles are appropriately sampled, and particle sampling processing is performed.

（ステップＳ２０６）雑音成分算出手段３３は、パーティクルフィルタ手段３２によって推定された各パーティクルの平均＾ｎ_ｔ ^（ｋ）と、分散＾Σ_ｎｔ ^（ｋ）と、各パーティクルの重みと、クリーンな音声データに応じた音声雑音音響モデルとを用いて、フレームｔに対応する雑音成分を算出する。 (Step S206) The noise component calculating unit 33 calculates the mean ^ n _t ^{(k) of} each particle estimated by the particle filter unit 32, the variance ^ Σ _nt ^(k) , the weight of each particle, and clean audio data. A noise component corresponding to the frame t is calculated using a speech noise acoustic model corresponding to the frame t.

具体的には、雑音成分算出手段３３は、（１５）式を用いて、雑音の事後分布を推定する。パーティクルフィルタ手段３２が算出した、正規化重み＾ｗ_ｔ ^（ｋ）や、平均＾ｎ_ｔ ^（ｋ）、分散＾Σ_ｎｔ ^（ｋ）を用いることにより、雑音成分算出手段３３は、（１５）式の中央の式を求めることができる。 Specifically, the noise component calculation means 33 estimates the posterior distribution of noise using equation (15). Particle filter means 32 has been calculated, normalized weights _^ ^{w t (k)} and the mean _^ ^{n t (k),} by using a dispersion ^ sigma _nt ^(k), the noise component calculating means 33, (15) The central expression of can be obtained.

次に、雑音成分算出手段３３は、（１７）式を算出する。（１７）式のうち、ｗ_ｓ、ｌは、音声データに対応する音響モデルから得られる。また、μ_ｘ、ｌ、Σ_ｘ、ｌは、それぞれ各フレームのラベルに対応する音響モデルの平均ベクトル、共分散行列である。雑音については、（１５）式で算出しているため、ラベル認識部２０が認識したラベルによって、目的発声が含まれていないことが示される場合には、μ_ｘ、ｌ、Σ_ｘ、ｌは、（１５）式で算出した雑音の平均ベクトル、共分散行列に等しいことになる。一方、ラベル認識部２０が認識したラベルによって、目的発声が含まれることが示される場合には、（１５）式で算出した雑音モデルと、音声雑音音響モデル記憶部１４で記憶されている目的発声に対応する音響モデルとの合成モデルの平均ベクトル、共分散行列を算出する。その算出した平均ベクトル等がμ_ｘ、ｌ、Σ_ｘ、ｌとなる。 Next, the noise component calculation unit 33 calculates Expression (17). In Expression (17), w _{s and l} are obtained from the acoustic model corresponding to the sound data. Further, μ _{x, l} , Σx _{, l} are the average vector and covariance matrix of the acoustic model corresponding to the label of each frame, respectively. Since the noise is calculated by the equation (15), if the label recognized by the label recognition unit 20 indicates that the target utterance is not included, μ _{x, l} , Σ _{x, l} are , (15) is equal to the noise average vector and covariance matrix. On the other hand, when the label recognized by the label recognition unit 20 indicates that the target utterance is included, the noise model calculated by the equation (15) and the target utterance stored in the speech noise acoustic model storage unit 14 are used. The average vector and covariance matrix of the synthesis model with the acoustic model corresponding to is calculated. The calculated average vector or the like becomes μ _{x, l} , Σ _{x, l} .

また、雑音成分算出手段３３は、ラベル認識部２０が認識したラベルに応じた（４）式を用いて、ｇ（ｓ_ｔ，ｌ，ｎ_ｔ，ｌ）を算出する。（４）式右辺のμ_ｘ、ｌについては、前述のようにして算出できる。また、（４）式右辺のμ_ｓ、ｌは、目的発声に対応する音響データを用いて算出できる。このようにして、雑音成分算出手段３３は、（１６）式の右辺第２項を算出することができる。 Further, the noise component calculation means 33 calculates g (s _{t, l} , n _{t, l} ) using the equation (4) corresponding to the label recognized by the label recognition unit 20. (4) μ _{x, l} on the right side of the equation can be calculated as described above. Also, μ _{s, l on the} right side of the equation (4) can be calculated using acoustic data corresponding to the target utterance. In this way, the noise component calculation means 33 can calculate the second term on the right side of the equation (16).

なお、図２において、雑音成分算出手段３３に音声雑音音響モデルが入力されるように記載しているが、厳密には、雑音成分算出手段３３が使用するのは、目的発声に対応した音響モデルのみである。 In FIG. 2, it is described that a speech noise acoustic model is input to the noise component calculation means 33. Strictly speaking, the noise component calculation means 33 uses an acoustic model corresponding to the target utterance. Only.

なお、本実施の形態では、雑音成分を音声信号から除去する場合について説明するので、雑音成分算出手段３３は、算出した（１６）式の右辺第２項をインパルス応答（時間領域のパラメータ）に変換する。雑音成分算出手段３３は、例えば、メルスペクトルを線形スペクトルにマッピングして変換を行うＭＥＬ−ｗａｒｐｅｄＩＤＣＴを用いてもよい（例えば、次の文献参照）。 In the present embodiment, since the case where the noise component is removed from the audio signal will be described, the noise component calculation means 33 uses the second term on the right side of the calculated equation (16) as the impulse response (time domain parameter). Convert. The noise component calculation unit 33 may use, for example, MEL-warped IDCT that performs conversion by mapping a mel spectrum to a linear spectrum (see, for example, the following document).

文献：Ｗ．Ｈｅｒｂｏｒｄｔ，Ｔ．Ｈｏｒｉｕｃｈｉ，Ｍ．Ｆｕｊｉｍｏｔｏ，Ｔ．Ｊｉｔｓｕｈｉｒｏ，Ｓ．Ｎａｋａｍｕｒａ，「Ｈａｎｄｓ−ｆｒｅｅｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎａｎｄｃｏｍｍｕｎｉｃａｔｉｏｎｏｎＰＤＡｓｕｓｉｎｇｍｉｃｒｏｐｈｏｎｅａｒｒａｙｔｅｃｈｎｏｌｏｇｙ」、Ｐｒｏｃ．ｏｆＡＳＲＵ２００５，ｐｐ．３０２−３０７、２００５ Literature: W.M. Herbordt, T .; Moriuchi, M .; Fujimoto, T .; Jitsuhiro, S.M. Nakamura, “Hands-free speech recognition and communication on PDAs using microphone array technology”, Proc. of ASRU2005, pp. 302-307, 2005

（ステップＳ２０７）雑音抑圧手段３４は、雑音成分算出手段３３の算出した雑音成分を雑音重畳音声データから除去することにより、クリーン音声データを得る。例えば、雑音成分がインパルス応答で与えられる場合には、雑音抑圧手段３４は、そのインパルス応答を雑音重畳音声データのフレームｔに畳み込むことにより、クリーン音声データのフレームｔを得ることができる。このクリーン音声データのフレームｔは、蓄積部２２に渡されてもよく、あるいは、蓄積部２２に渡されるまで、図示しない記録媒体において一時的に記憶されていてもよい。 (Step S207) The noise suppression means 34 removes the noise component calculated by the noise component calculation means 33 from the noise superimposed voice data, thereby obtaining clean voice data. For example, when a noise component is given as an impulse response, the noise suppression unit 34 can obtain a frame t of clean voice data by convolving the impulse response with the frame t of noise superimposed voice data. The clean audio data frame t may be delivered to the storage unit 22 or may be temporarily stored in a recording medium (not shown) until the storage unit 22 passes the frame t.

（ステップＳ２０８）パーティクルフィルタ雑音抑圧部２１は、カウンタｔを１だけインクリメントする。
（ステップＳ２０９）パーティクルフィルタ雑音抑圧部２１は、雑音重畳音声データにフレームｔが存在するかどうか判断する。そして、存在する場合には、ステップＳ２０３に戻り、そうでない場合には、図４のフローチャートに戻る。 (Step S208) The particle filter noise suppression unit 21 increments the counter t by 1.
(Step S209) The particle filter noise suppression unit 21 determines whether or not the frame t exists in the noise superimposed voice data. And when it exists, it returns to step S203, and when that is not right, it returns to the flowchart of FIG.

なお、図４のフローチャートでは、雑音抑圧を行った後に、クリーン音声データを蓄積する場合について示しているが、図５のフローチャートで示されるように雑音抑圧の処理を行う場合には、雑音抑圧後のクリーン音声データのフレームを順次、蓄積部２２が蓄積するようにしてもよい。 Note that the flowchart of FIG. 4 shows a case where clean speech data is accumulated after noise suppression, but when noise suppression processing is performed as shown in the flowchart of FIG. Alternatively, the storage unit 22 may sequentially store the frames of the clean audio data.

次に、本実施の形態による雑音抑圧装置１の動作について、具体例を用いて説明する。
図６は、ラベル記憶部１２で記憶されている訓練ラベル情報の一例を示す図である。この訓練ラベル情報は、ラベルと、そのラベルに対応する訓練データの時間とを対応付けて有する情報である。時間は、始端と終端が含まれている。図６における単位は秒である。この訓練ラベル情報によって、例えば、訓練データの０．５秒から０．８秒まではビープ音であり、訓練データの０．８秒から１．０秒まではビープ音と目的発声とが重畳されていることが示されている。 Next, the operation of the noise suppression device 1 according to the present embodiment will be described using a specific example.
FIG. 6 is a diagram illustrating an example of training label information stored in the label storage unit 12. This training label information is information having a label and a time of training data corresponding to the label in association with each other. Time includes the beginning and the end. The unit in FIG. 6 is second. With this training label information, for example, the training data is a beep sound from 0.5 to 0.8 seconds, and the beep sound and the target utterance are superimposed from 0.8 to 1.0 seconds of the training data. It is shown that.

訓練データ記憶部１１に訓練データが記憶されており、ラベル記憶部１２に訓練ラベル情報が記憶されている状況において、ユーザが図示しない入力デバイス（例えば、マウスやキーボード等）を操作することによって、モデルを生成する指示を雑音抑圧装置１に入力したとする。すると、モデル生成部１３は、モデルを生成するタイミングであると判断する（ステップＳ１０１）。そして、モデル生成部１３は、ラベル記憶部１２を参照しながら、一種類の音声データまたは一種類の雑音データに対応するラベルを特定する。そして、その特定したラベルに対応する時間を訓練ラベル情報から取得することにより、一種類の音声データの区間や、一種類の雑音データの区間を特定する。その後、モデル生成部１３は、その特定した区間に対応する音声雑音音響モデルをそれぞれ生成し、音声雑音音響モデル記憶部１４に蓄積する（ステップＳ１０２）。モデル生成部１３は、この蓄積の際に、音声雑音音響モデルに対応するラベルの名称に対応付けて、その音声雑音音響モデルを蓄積する。 In a situation where training data is stored in the training data storage unit 11 and training label information is stored in the label storage unit 12, the user operates an input device (for example, a mouse, a keyboard, etc.) not shown, Assume that an instruction to generate a model is input to the noise suppression apparatus 1. Then, the model generation unit 13 determines that it is time to generate a model (step S101). And the model production | generation part 13 specifies the label corresponding to one type of audio | voice data or one type of noise data, referring the label memory | storage part 12. FIG. Then, by acquiring the time corresponding to the specified label from the training label information, a section of one type of audio data or a section of one type of noise data is specified. Thereafter, the model generation unit 13 generates a speech noise / acoustic model corresponding to the identified section, and accumulates it in the speech noise / acoustic model storage unit 14 (step S102). In this accumulation, the model generation unit 13 accumulates the speech noise acoustic model in association with the name of the label corresponding to the speech noise acoustic model.

また、モデル生成部１３は、ラベル記憶部１２を参照しながら、二種類以上の音声データや雑音データに対応するラベルを特定する。そして、その特定したラベルに含まれる音声データや雑音データを特定し、その特定した音声データや雑音データにそれぞれ対応する音声雑音音響モデルを音声雑音音響モデル記憶部１４から読み出す。その後、モデル生成部１３は、その読み出した複数の音声雑音音響モデルを合成することによって、その特定したラベルに対応する合成音響モデルを生成し、合成音響モデル記憶部１５に蓄積する（ステップＳ１０３）。モデル生成部１３は、この蓄積の際に、合成音響モデルに対応するラベルの名称に対応付けて、その合成音響モデルを蓄積する。 Further, the model generation unit 13 specifies labels corresponding to two or more types of audio data and noise data while referring to the label storage unit 12. Then, audio data and noise data included in the specified label are specified, and audio noise acoustic models corresponding to the specified audio data and noise data are read from the audio noise acoustic model storage unit 14. Thereafter, the model generation unit 13 generates a synthesized acoustic model corresponding to the identified label by synthesizing the read voice noise acoustic models, and stores the synthesized acoustic model in the synthesized acoustic model storage unit 15 (step S103). . In this accumulation, the model generation unit 13 accumulates the synthesized acoustic model in association with the name of the label corresponding to the synthesized acoustic model.

ラベル言語モデル生成部１６は、ラベル記憶部１２で記憶されている訓練ラベル情報を用いて、ラベルのＮグラムモデルを生成し、ラベル言語モデル記憶部１８に蓄積する（ステップＳ１０４）。また、ラベル言語モデル生成部１６は、訓練ラベル情報を用いて辞書情報も生成し、辞書情報記憶部１７に蓄積する（ステップＳ１０５）。この辞書情報は、図７で示されるようになる。図７において、辞書情報は、ラベルを識別する情報であるラベルの名称と、音響モデルを識別する情報とを対応付けて有する情報である。なお、音響モデルを識別する情報として、この具体例では、ラベルの名称を用いているため、両者は同じ情報となっている。 The label language model generation unit 16 generates an N-gram model of the label using the training label information stored in the label storage unit 12, and accumulates it in the label language model storage unit 18 (step S104). The label language model generation unit 16 also generates dictionary information using the training label information and stores it in the dictionary information storage unit 17 (step S105). This dictionary information is as shown in FIG. In FIG. 7, dictionary information is information having a label name, which is information for identifying a label, and information for identifying an acoustic model in association with each other. Note that in this specific example, the name of the label is used as information for identifying the acoustic model, so both are the same information.

次に、マイクロフォンで集音された雑音の重畳された音声が図示しないＡ／Ｄ変換器によってデジタル信号に変換されて蓄積され、その蓄積された一連の雑音重畳音声データが受付部１９で受け付けられたとする（ステップＳ１０６）。すると、ラベル認識部２０は、音声雑音音響モデル、合成音響モデル、辞書情報、ラベル言語モデル、すなわち、Ｎグラムモデルを用いて、音声認識と同様の手法によって、ラベル認識を行う（ステップＳ１０７）。例えば、図８で示されるように、ビープ音（ｂｅｅｐ）や、マシンノイズ（ｍａｃｈｉｎｅｎｏｉｓｅ）、目的発声（ｔａｒｇｅｔｕｔｔｅｒａｎｃｅ）等が重畳された雑音重畳音声データが受け付けられた場合には、「ｂｅｅｐ」や「ｂｅｅｐ．ｍａｃｈｉｎｅ」、「ｍａｃｈｉｎｅ」等のラベルが雑音重畳音声データの各区間に対して付与されることになる。なお、ラベル認識は、図９で示されるように、認識したラベルの名称と、そのラベルの時間とを対応付けて有する情報である認識ラベル情報を構成し、その認識ラベル情報を図示しない記録媒体において一時的に記憶してもよい。図９で示される認識ラベル情報における時間の単位は、フレームの番号である。 Next, the noise-superimposed sound collected by the microphone is converted into a digital signal by an A / D converter (not shown) and accumulated, and the accumulated series of noise-superimposed sound data is accepted by the accepting unit 19. (Step S106). Then, the label recognizing unit 20 performs label recognition by a method similar to speech recognition using a speech noise acoustic model, a synthesized acoustic model, dictionary information, and a label language model, that is, an N-gram model (step S107). For example, as shown in FIG. 8, when noise superimposed voice data on which a beep, machine noise, target utterance, or the like is superimposed is received, “beep” is received. And “beep.machine”, “machine”, and other labels are attached to each section of the noise-superimposed speech data. In addition, as shown in FIG. 9, the label recognition comprises recognition label information which is information having the recognized label name and the time of the label in association with each other, and the recognition label information is not shown. May be temporarily stored. The unit of time in the recognition label information shown in FIG. 9 is a frame number.

次に、パーティクルフィルタ雑音抑圧部２１による雑音を抑圧する処理について説明する。まず、モデル生成部１３は、図８で示される一番左のビープ音よりも以前の区間の雑音重畳音声データから背景雑音の音響モデルを生成して、音声雑音音響モデル記憶部１４に蓄積する。また、パーティクルフィルタ雑音抑圧部２１は、その生成された音響モデルからのサンプリングを行う。次に、雑音成分算出手段３３は、フレーム１に対応するラベルの名称「ｂｅｅｐ」を、ラベル認識部２０が生成した認識ラベル情報から取得する（ステップＳ２０２，Ｓ２０３）。また、特徴量抽出手段３１は、雑音重畳音声データのフレーム１に対応する対数メルスペクトルを生成して、パーティクルフィルタ手段３２に渡す（ステップＳ２０４）。パーティクルフィルタ手段３２は、ビープ音に対応する音声雑音音響モデルを用いて、その音響モデルからのサンプリングを行う。この場合には、目的発声が含まれないため、目的発声に対応する音響モデルからのサンプリングは行われない。そして、パーティクルフィルタ手段３２は、背景雑音のパーティクルと、ビープ音のパーティクルとについて、拡張カルマンフィルタを用いた更新を行う。また、各パーティクルについて重みを算出して正規化する。その後、パーティクルフィルタ手段３２は、Ｒｅｓｉｄｕａｌｓａｍｐｌｉｎｇや、Ｍｅｔｒｏｐｏｌｉｓ−Ｈａｓｔｉｎｇｓｓａｍｐｌｉｎｇを適用する。 Next, processing for suppressing noise by the particle filter noise suppressing unit 21 will be described. First, the model generation unit 13 generates an acoustic model of background noise from the noise-superimposed speech data in the section before the leftmost beep sound shown in FIG. 8 and stores it in the speech noise acoustic model storage unit 14. . In addition, the particle filter noise suppression unit 21 performs sampling from the generated acoustic model. Next, the noise component calculation unit 33 acquires the label name “beep” corresponding to the frame 1 from the recognition label information generated by the label recognition unit 20 (steps S202 and S203). Also, the feature quantity extraction unit 31 generates a log mel spectrum corresponding to the frame 1 of the noise superimposed voice data and passes it to the particle filter unit 32 (step S204). The particle filter means 32 uses a speech noise acoustic model corresponding to a beep sound and performs sampling from the acoustic model. In this case, since the target utterance is not included, sampling from the acoustic model corresponding to the target utterance is not performed. Then, the particle filter unit 32 updates the background noise particles and the beep sound particles using the extended Kalman filter. Also, the weight is calculated and normalized for each particle. Thereafter, the particle filter unit 32 applies Residual sampling or Metropolis-Hastings sampling.

次に、雑音成分算出手段３３は、パーティクルフィルタ手段３２によって推定された各パーティクルの平均＾ｎ_ｔ ^（ｋ）と、分散＾Σ_ｎｔ ^（ｋ）と、各パーティクルの重みと、クリーンな音声データに応じた音声雑音音響モデルとを用いて、雑音成分を算出する。この際に、このフレーム１には目的発声が含まれないため、（４）式の下側の式を用いて雑音成分を算出する（ステップＳ２０６）。また、その算出した雑音成分を、前述のようにＭＥＬ−ｗａｒｐｅｄＩＤＣＴを用いることによって、インパルス応答に変換して、雑音抑圧手段３４に渡す。雑音抑圧手段３４は、雑音重畳音声データのフレーム１にインパルス応答を畳み込むことによって、クリーン音声データのフレーム１を生成して蓄積部２２に渡す（ステップＳ２０７）。このように、パーティクルフィルタ雑音抑圧部２１は、順次、各フレームに対するパーティクルフィルタを用いた雑音抑圧の処理を実行することになる（ステップＳ１０８，Ｓ２０３〜Ｓ２０９）。 Next, the noise component calculation means 33 converts the average ^ n _t ^{(k) of} each particle estimated by the particle filter means 32, the variance ^ Σ _nt ^(k) , the weight of each particle, and clean audio data. The noise component is calculated using the corresponding speech noise acoustic model. At this time, since the target utterance is not included in this frame 1, the noise component is calculated using the lower equation of equation (4) (step S206). Further, the calculated noise component is converted into an impulse response by using the MEL-warped IDCT as described above, and passed to the noise suppression unit 34. The noise suppression unit 34 generates the frame 1 of the clean sound data by convolving the impulse response with the frame 1 of the noise superimposed sound data, and passes it to the storage unit 22 (step S207). In this way, the particle filter noise suppression unit 21 sequentially performs noise suppression processing using the particle filter for each frame (steps S108, S203 to S209).

蓄積部２２は、パーティクルフィルタ雑音抑圧部２１から受け取ったクリーン音声データの各フレームを順次、図示しない記録媒体に蓄積していく（ステップＳ１０９）。このようにして、雑音重畳音声データに対する雑音抑圧の処理が行われて、クリーン音声データを得ることができる。このクリーン音声データは、例えば、ユーザが聞くために用いられてもよく、あるいは、後述する実施の形態２で説明するように、音声認識の処理のために用いられてもよく、あるいは、他の処理のために用いられてもよい。 The accumulating unit 22 sequentially accumulates each frame of clean audio data received from the particle filter noise suppressing unit 21 in a recording medium (not shown) (step S109). In this way, noise suppression processing is performed on the noise-superimposed speech data, and clean speech data can be obtained. This clean voice data may be used, for example, for the user to listen to, or may be used for voice recognition processing, as described in the second embodiment described later, or other It may be used for processing.

以上のように、本実施の形態による雑音抑圧装置１によれば、複数モデル合成を用いて雑音重畳音声データの最尤ラベル系列を取得し、そのラベル系列に応じて複数モデル合成に拡張した雑音抑圧処理を行うことによって、突発的に新たな種類の雑音が入力された場合にも、認識されたラベルによって、その突発的な雑音の種類を特定することができ、その突発的に入力された雑音も適切に抑制することができる。これは、パーティクルフィルタのみを用いた雑音抑制手法では、実現できなかったことである。また、パーティクルフィルタを用いることによって、入力された雑音重畳音声データから得られる情報をも用いて動的に雑音推定を行うことができるため、雑音の変動にも対応することができるようになる。これは、前述の非特許文献６，７で示される従来の雑音抑制手法では、実現できなかったことである。このように、本実施の形態による雑音抑圧装置１は、突発的な新たな雑音の入力と、雑音の変動に適切に対応することができる。なお、具体的な実験結果については、実施の形態２において説明する。 As described above, according to the noise suppression apparatus 1 according to the present embodiment, the maximum likelihood label sequence of noise superimposed speech data is acquired using multiple model synthesis, and noise expanded to multiple model synthesis according to the label sequence. By performing suppression processing, even when a new type of noise is suddenly input, the type of the sudden noise can be specified by the recognized label, and the sudden input Noise can also be suppressed appropriately. This is not realized by the noise suppression method using only the particle filter. In addition, by using the particle filter, it is possible to dynamically perform noise estimation using information obtained from the input noise-superimposed speech data, so that it is possible to cope with noise fluctuations. This is not possible with the conventional noise suppression method disclosed in Non-Patent Documents 6 and 7. Thus, the noise suppression apparatus 1 according to the present embodiment can appropriately cope with sudden new noise input and noise fluctuation. Specific experimental results will be described in Embodiment 2.

（実施の形態２）
本発明の実施の形態２による音声認識装置について、図面を参照しながら説明する。本実施の形態による音声認識装置は、実施の形態１による雑音抑圧装置を備え、その雑音抑圧装置による雑音抑圧後のクリーン音声データに対して、音声認識処理を行うものである。 (Embodiment 2)
A speech recognition apparatus according to Embodiment 2 of the present invention will be described with reference to the drawings. The speech recognition apparatus according to the present embodiment includes the noise suppression device according to Embodiment 1, and performs speech recognition processing on clean speech data after noise suppression by the noise suppression device.

図１０は、本実施の形態による音声認識装置２の構成を示すブロック図である。本実施の形態による音声認識装置２は、実施の形態１による雑音抑圧装置１の各構成要素に加えて、音声認識用音響モデル記憶部４１と、言語モデル記憶部４２と、音声認識用辞書情報記憶部４３と、音声認識部４４と、出力部４５とを備える。このように、本実施の形態による音声認識装置２は、実施の形態１による雑音抑圧装置１を含んでいることになる。なお、音声認識用音響モデル記憶部４１、言語モデル記憶部４２、音声認識用辞書情報記憶部４３、音声認識部４４、出力部４５以外の構成及び動作は、実施の形態１と同様であり、その説明を省略する。 FIG. 10 is a block diagram showing the configuration of the speech recognition apparatus 2 according to this embodiment. The speech recognition device 2 according to the present embodiment includes a speech recognition acoustic model storage unit 41, a language model storage unit 42, and speech recognition dictionary information in addition to the components of the noise suppression device 1 according to the first embodiment. A storage unit 43, a voice recognition unit 44, and an output unit 45 are provided. As described above, the speech recognition apparatus 2 according to the present embodiment includes the noise suppression apparatus 1 according to the first embodiment. Configurations and operations other than the speech recognition acoustic model storage unit 41, the language model storage unit 42, the speech recognition dictionary information storage unit 43, the speech recognition unit 44, and the output unit 45 are the same as those in the first embodiment. The description is omitted.

音声認識用音響モデル記憶部４１では、音声認識の対象となる音声データに関する音響モデルが記憶される。なお、この音響モデルは、音声認識用のものである。音声認識用の音響モデルは、すでに公知であり、その詳細な説明を省略する。音声認識用音響モデル記憶部４１に音響モデルが記憶される過程は問わない。例えば、記録媒体を介して音響モデルが音声認識用音響モデル記憶部４１で記憶されるようになってもよく、あるいは、通信回線等を介して送信された音響モデルが音声認識用音響モデル記憶部４１で記憶されるようになってもよい。 The acoustic model storage unit 41 for speech recognition stores an acoustic model related to speech data that is a target of speech recognition. This acoustic model is for speech recognition. An acoustic model for speech recognition is already known and will not be described in detail. The process in which an acoustic model is memorize | stored in the acoustic model memory | storage part 41 for speech recognition is not ask | required. For example, an acoustic model may be stored in the speech recognition acoustic model storage unit 41 via a recording medium, or an acoustic model transmitted via a communication line or the like may be stored in the speech recognition acoustic model storage unit. 41 may be stored.

音声認識用音響モデル記憶部４１での記憶は、外部のストレージデバイス等から読み出した音響モデルのＲＡＭ等における一時的な記憶でもよく、あるいは、長期的な記憶でもよい。音声認識用音響モデル記憶部４１は、所定の記録媒体（例えば、半導体メモリや磁気ディスク、光ディスクなど）によって実現されうる。 The storage in the acoustic model storage unit 41 for speech recognition may be temporary storage in the RAM of the acoustic model read from an external storage device or the like, or may be long-term storage. The acoustic model storage unit 41 for speech recognition can be realized by a predetermined recording medium (for example, a semiconductor memory, a magnetic disk, an optical disk, etc.).

言語モデル記憶部４２では、音声認識の認識対象言語に関する言語モデルが記憶される。この言語モデルは、音声認識用のものであり、例えば、バイグラムの言語モデルや、トライグラムの言語モデル等である。音声認識用の言語モデルは、すでに公知であり、その詳細な説明を省略する。言語モデル記憶部４２に言語モデルが記憶される過程は問わない。例えば、記録媒体を介して言語モデルが言語モデル記憶部４２で記憶されるようになってもよく、あるいは、通信回線等を介して送信された言語モデルが言語モデル記憶部４２で記憶されるようになってもよい。 The language model storage unit 42 stores a language model related to a recognition target language for speech recognition. This language model is for speech recognition, and is, for example, a bigram language model or a trigram language model. Language models for speech recognition are already known and will not be described in detail. The process in which a language model is memorize | stored in the language model memory | storage part 42 is not ask | required. For example, the language model may be stored in the language model storage unit 42 via a recording medium, or the language model transmitted via a communication line or the like may be stored in the language model storage unit 42. It may be.

言語モデル記憶部４２での記憶は、外部のストレージデバイス等から読み出した言語モデルのＲＡＭ等における一時的な記憶でもよく、あるいは、長期的な記憶でもよい。言語モデル記憶部４２は、所定の記録媒体（例えば、半導体メモリや磁気ディスク、光ディスクなど）によって実現されうる。 Storage in the language model storage unit 42 may be temporary storage in a RAM or the like of a language model read from an external storage device or the like, or may be long-term storage. The language model storage unit 42 can be realized by a predetermined recording medium (for example, a semiconductor memory, a magnetic disk, an optical disk, etc.).

音声認識用辞書情報記憶部４３では、音声認識で用いる音声認識用辞書情報が記憶される。音声認識用辞書情報は、例えば、音素列と単語とを対応付ける情報である。音声認識用の辞書情報は、すでに公知であり、その詳細な説明を省略する。音声認識用辞書情報記憶部４３に情報が記憶される過程は問わない。例えば、記録媒体を介して音声認識用辞書情報が音声認識用辞書情報記憶部４３で記憶されるようになってもよく、あるいは、通信回線等を介して送信された音声認識用辞書情報が音声認識用辞書情報記憶部４３で記憶されるようになってもよい。 The speech recognition dictionary information storage unit 43 stores speech recognition dictionary information used for speech recognition. The speech recognition dictionary information is information that associates phoneme strings with words, for example. The dictionary information for speech recognition is already known and will not be described in detail. The process in which information is stored in the dictionary information storage unit 43 for speech recognition does not matter. For example, the speech recognition dictionary information may be stored in the speech recognition dictionary information storage unit 43 via a recording medium, or the speech recognition dictionary information transmitted via a communication line or the like may be stored as speech. It may be stored in the recognition dictionary information storage unit 43.

音声認識用辞書情報記憶部４３での記憶は、外部のストレージデバイス等から読み出した音声認識用辞書情報のＲＡＭ等における一時的な記憶でもよく、あるいは、長期的な記憶でもよい。音声認識用辞書情報記憶部４３は、所定の記録媒体（例えば、半導体メモリや磁気ディスク、光ディスクなど）によって実現されうる。 Storage in the speech recognition dictionary information storage unit 43 may be temporary storage in the RAM or the like of speech recognition dictionary information read from an external storage device or the like, or may be long-term storage. The voice recognition dictionary information storage unit 43 can be realized by a predetermined recording medium (for example, a semiconductor memory, a magnetic disk, an optical disk, or the like).

なお、訓練データ記憶部１１、ラベル記憶部１２、音声雑音音響モデル記憶部１４、合成音響モデル記憶部１５、辞書情報記憶部１７、ラベル言語モデル記憶部１８、蓄積部２２がクリーン音声データを蓄積する図示しない記録媒体、音声認識用音響モデル記憶部４１、言語モデル記憶部４２、音声認識用辞書情報記憶部４３のうち、任意の２以上の記録媒体は、同一の記録媒体によって実現されてもよく、別々の記録媒体によって実現されてもよい。 The training data storage unit 11, the label storage unit 12, the speech noise acoustic model storage unit 14, the synthetic acoustic model storage unit 15, the dictionary information storage unit 17, the label language model storage unit 18, and the storage unit 22 store clean speech data. Of the recording media (not shown), the acoustic model storage unit 41 for speech recognition, the language model storage unit 42, and the dictionary information storage unit 43 for speech recognition, any two or more recording media may be realized by the same recording medium. It may well be realized by separate recording media.

音声認識部４４は、雑音抑圧装置１が生成したクリーン音声データを、音響モデル、音声認識用辞書情報、及び、言語モデルを用いて音声認識する。この音声認識の処理は、すでに公知であり、その詳細な説明を省略する。音声認識の処理については、例えば、次の文献を参照されたい。
文献：鹿野清宏、河原達也、山本幹雄、伊藤克亘、武田一哉、「ＩＴＴｅｘｔ音声認識システム」、オーム社、２００１年 The speech recognition unit 44 recognizes speech of the clean speech data generated by the noise suppression device 1 using an acoustic model, speech recognition dictionary information, and a language model. This voice recognition process is already known and will not be described in detail. For the speech recognition process, refer to the following document, for example.
Literature: Kiyohiro Shikano, Tatsuya Kawahara, Mikio Yamamoto, Katsunobu Ito, Kazuya Takeda, “IT Text Speech Recognition System”, Ohmsha, 2001

出力部４５は、音声認識部４４による音声認識結果を出力する。この音声認識結果は、例えば、テキストデータである。ここで、この出力は、例えば、表示デバイス（例えば、ＣＲＴや液晶ディスプレイなど）への表示でもよく、所定の機器への通信回線を介した送信でもよく、プリンタによる印刷でもよく、記録媒体への蓄積でもよく、他の構成要素への引き渡しでもよい。なお、出力部４５は、出力を行うデバイス（例えば、表示デバイスやプリンタなど）を含んでもよく、あるいは含まなくてもよい。また、出力部４５は、ハードウェアによって実現されてもよく、あるいは、それらのデバイスを駆動するドライバ等のソフトウェアによって実現されてもよい。 The output unit 45 outputs the voice recognition result by the voice recognition unit 44. This voice recognition result is, for example, text data. Here, the output may be, for example, display on a display device (for example, a CRT or a liquid crystal display), transmission via a communication line to a predetermined device, printing by a printer, or output to a recording medium. It may be accumulated or delivered to another component. The output unit 45 may or may not include an output device (for example, a display device or a printer). The output unit 45 may be realized by hardware, or may be realized by software such as a driver that drives these devices.

次に、本実施の形態による雑音抑圧装置の動作について、図１１のフローチャートを用いて説明する。なお、図１１のフローチャートにおいて、ステップＳ３０１〜Ｓ３０３以外の処理は、実施の形態１の図４のフローチャートと同様であり、その説明を省略する。 Next, the operation of the noise suppression apparatus according to the present embodiment will be described using the flowchart of FIG. In the flowchart of FIG. 11, processes other than steps S301 to S303 are the same as those of the flowchart of FIG. 4 of the first embodiment, and the description thereof is omitted.

（ステップＳ３０１）音声認識部４４は、音声認識処理を行うタイミングかどうか判断する。そして、音声認識処理を行うタイミングである場合には、ステップＳ３０２に進み、そうでない場合には、ステップＳ１０１に戻る。 (Step S301) The speech recognition unit 44 determines whether it is time to perform speech recognition processing. If it is time to perform the speech recognition process, the process proceeds to step S302; otherwise, the process returns to step S101.

（ステップＳ３０２）音声認識部４４は、音声認識用音響モデル記憶部４１で記憶されている音響モデル、言語モデル記憶部４２で記憶されている言語モデル、音声認識用辞書情報記憶部４３で記憶されている音声認識用辞書情報を用いて、蓄積部２２が蓄積したクリーン音声データに対する音声認識処理を行う。 (Step S302) The speech recognition unit 44 is stored in the acoustic model stored in the speech recognition acoustic model storage unit 41, the language model stored in the language model storage unit 42, and the speech recognition dictionary information storage unit 43. The voice recognition process is performed on the clean voice data stored by the storage unit 22 using the voice recognition dictionary information.

（ステップＳ３０３）出力部４５は、音声認識部４４が音声認識処理を行った音声認識結果を出力する。そして、ステップＳ１０１に戻る。
なお、図１１のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。 (Step S303) The output unit 45 outputs a speech recognition result obtained by performing the speech recognition processing by the speech recognition unit 44. Then, the process returns to step S101.
Note that the processing is ended by powering off or interruption for aborting the processing in the flowchart in FIG.

また、音声認識処理の具体例は、すでに公知であり、音声認識処理以外の具体例は実施の形態１と同様であるため、音声認識装置２の動作の具体例の説明を省略する。 Specific examples of the speech recognition processing are already known, and specific examples other than the speech recognition processing are the same as those in the first embodiment.

次に、本実施の形態による音声認識装置２の実験例について説明する。この実験例では、ある病院において看護師が実作業を行いつつ録音したデータを、訓練データ、及び雑音重畳音声データとして用いた。具体的には、初日分のデータを雑音重畳音声データとし、２日目分をモデル学習のために用いる訓練データとした。なお、訓練データに対応する訓練ラベル情報は、訓練データをもとに人手によって作成したものである。 Next, an experimental example of the speech recognition apparatus 2 according to this embodiment will be described. In this experimental example, data recorded while a nurse performed actual work in a hospital was used as training data and noise-superimposed speech data. Specifically, the data for the first day was used as noise superimposed voice data, and the data for the second day was used as training data used for model learning. The training label information corresponding to the training data is created manually based on the training data.

図１２は、詳細な実験条件を示す表である。訓練データ、及び雑音重畳音声データに含まれるデータは、１０秒間の長さであり、目的発話を含むものである。そのデータは、病院にてサンプリング周波数３２ｋＨｚ、１６ｂｉｔで収録後、１６ｋＨｚにダウンサンプリングした。勤務シフトの関係で、訓練データの評価話者は女性８名となった。音声認識器などのツールには、ＡＴＲ音声言語コミュニケーション研究所で開発されたＡＴＲＡＳＲ大語彙音声認識システムＶｅｒ．３．６を用いた。雑音抑圧で用いる特徴量の抽出やＧＭＭの学習にはＨＴＫＶｅｒ．３．３を用いた。特徴量としては、２４次対数メルフィルタバンクの出力（ＦＢＡＮＫ）を用いた。雑音ラベル認識には、ＦＢＡＮＫモデルから変換して得たＭＦＣＣモデルを用いた。クリーン音声モデルとして、話者適応ＧＭＭを用いた。その他の音声や雑音モデルには、４，８，１２混合分布を持つＧＭＭを用いた。共分散行列として、すべて対角共分散行列を用いた。学習データから３２種類の雑音モデルが得られ、合成モデルも含むモデルの合成数は１９４であった。合成モデルにおいて、一つのモデルに合成されたモデルの数は最大３個であった。また、背景雑音は、各雑音重畳音声データに対して推定され、すべてのモデルに合成された。これら１９４モデルをマルチラベル辞書のエントリーとした。未知マルチラベル率は、３．７７％であった。 FIG. 12 is a table showing detailed experimental conditions. Data included in the training data and the noise-superimposed speech data has a length of 10 seconds and includes a target speech. The data was recorded at a sampling frequency of 32 kHz and 16 bits at a hospital and then down-sampled to 16 kHz. Because of the work shift, eight women were evaluated as training data. A tool such as a speech recognizer includes ATRASR large vocabulary speech recognition system Ver. 3.6 was used. For extraction of features used for noise suppression and learning of GMM, use HTK Ver. 3.3 was used. As the feature value, the output (FBANK) of the 24th order log mel filter bank was used. For noise label recognition, an MFCC model obtained by conversion from the FBANK model was used. A speaker adaptive GMM was used as a clean speech model. For other speech and noise models, GMMs with a 4, 8, 12 mixed distribution were used. As the covariance matrix, a diagonal covariance matrix was used. Thirty-two types of noise models were obtained from the learning data, and the number of synthesized models including synthesized models was 194. In the synthesis model, the maximum number of models synthesized into one model was three. The background noise was estimated for each noise-superimposed speech data and synthesized into all models. These 194 models were used as multi-label dictionary entries. The unknown multi-label rate was 3.77%.

また、パーティクルフィルタに関して、従来法（前述の非特許文献５参照）では、３００パーティクルを用い、本実施の形態による手法では１１０パーティクルを用いた。これは雑音抑圧処理がほぼ同程度となる設定である（ＩｎｔｅｌＰｅｎｔｉｕｍ（登録商標）−Ｄ３．２ＧＨｚでの計測でＲｅａｌＴｉｍｅＦａｃｔｏｒが約２．５である）。本実施の形態による手法では、雑音ラベル認識の処理時間も含んでいる。音声認識器などのツールには、ＡＴＲ音声言語コミュニケーション研究所で開発されたＡＴＲＡＳＲ大語彙音声認識システムＶｅｒ．３．６を用いた。音声認識用音響モデルの構造学習には、ＭＤＬ−ＳＳＳ（次の文献１参照）を用いた。この実験では話者が女性だけのため、再学習で作成した５混合分布の女声モデルのみを用いた。話者適応手法として、音声認識系音響モデルにはＭＡＰ−ＶＦＳ（次の文献２参照）を用いた。 Regarding the particle filter, 300 particles were used in the conventional method (see Non-Patent Document 5 described above), and 110 particles were used in the method according to the present embodiment. This is a setting in which the noise suppression processing is approximately the same (Intel Pentium (registered trademark)-Real Time Factor is about 2.5 in the measurement at D3.2 GHz). In the method according to the present embodiment, the processing time for noise label recognition is also included. A tool such as a speech recognizer includes ATRASR large vocabulary speech recognition system Ver. 3.6 was used. MDL-SSS (see the following document 1) was used for the structure learning of the acoustic model for speech recognition. In this experiment, only the female speaker was used, so only the 5 mixed distribution female voice model created by re-learning was used. As a speaker adaptation method, MAP-VFS (refer to the following document 2) was used for the speech recognition system acoustic model.

文献１：Ｔ．Ｊｉｔｓｕｈｉｒｏ，Ｔ．Ｍａｔｓｕｉ，Ｓ．Ｎａｋａｍｕｒａ，「Ａｕｔｏｍａｔｉｃｇｅｎｅｒａｔｉｏｎｏｆｎｏｎ−ｕｎｉｆｏｒｍＨＭＭｔｏｐｏｌｏｇｉｅｓｂａｓｅｄｏｎｔｈｅＭＤＬｃｒｉｔｅｒｉｏｎ」、ＩＥＩＣＥＴｒａｎｓ．ｏｎＩｎｆｏｒｍａｔｉｏｎａｎｄＳｙｓｔｅｍｓ，ｖｏｌ．Ｅ８７−Ｄ，ｎｏ．８，ｐ．２１２１−２１２９，２００４ Reference 1: T. Jitsuhiro, T .; Matsui, S .; Nakamura, “Automatic generation of non-uniform HMM topologies based on the MDL criteria”, IEICE Trans. on Information and Systems, vol. E87-D, no. 8, p. 2121-2129, 2004

文献２：Ｍ．Ｔｏｎｏｍｕｒａ，Ｔ．Ｋｏｓａｋａ，Ｓ．Ｍａｔｓｕｎａｇａ，「Ｓｐｅａｋｅｒａｄａｐｔａｔｉｏｎｂａｓｅｄｏｎｔｒａｎｓｆｅｒｖｅｃｔｏｒｆｉｅｌｄｓｍｏｏｔｈｉｎｇｕｓｉｎｇｍａｘｉｍｕｍａｐｏｓｔｅｒｉｏｒｉｐｒｏｂａｂｉｌｉｔｙｅｓｔｉｍａｔｉｏｎ」、ＣｏｍｐｕｔｅｒＳｐｅｅｃｈａｎｄＬａｎｇｕａｇｅ，ｖｏｌ．１０，ｐ．１１７−１３２，１９９６ Reference 2: M.M. Tonomura, T .; Kosaka, S .; Matsunaga, “Speaker adaptation based on transfer vector field smoothing using maximum a posteriori probabilistic estimation, Computer Speech and Language. 10, p. 117-132, 1996

図１３は、ラベル認識を行う雑音抑制手法（ＭＭ−ＮＳ法）で用いた雑音抑圧用モデルの総分布数の評価話者に対する平均を示している。図１３から、分布数は各雑音モデルの混合数に対して指数関数的に増加することが分かる。この実験例で用いた音声認識用の音響モデルに含まれる総分布数が１０，４６０であることを考えると、それぞれ大変大きな分布数であるといえる。 FIG. 13 shows an average of the total number of distributions of the noise suppression model used in the noise suppression method (MM-NS method) for performing label recognition for the evaluation speakers. FIG. 13 shows that the number of distributions increases exponentially with respect to the number of mixtures in each noise model. Considering that the total number of distributions included in the acoustic model for speech recognition used in this experimental example is 10,460, it can be said that the numbers are very large.

図１４に従来法と、本実施の形態による手法の単語認識率を示す。「ＳＭ−ＮＳ」は雑音モデルとして単一分布を用いた「Ｓｉｎｇｌｅ−ＭｏｄｅｌＮｏｉｓｅＳｕｐｐｒｅｓｓｉｏｎ」手法を示す。この単一分布としては各入力音声の開始１００ｍｓから推定されたものを用いた。「ＰＦ」はパーティクルフィルタを用いた従来法（非特許文献５参照）を示す。この方法は、このタスクではｂａｓｅｌｉｎｅよりよい精度を得られなかった。多くの挿入エラーが生じたためであり、突発的な雑音に対して追跡することが困難であると考えられる。「ＰＦ−ＭＭ−ＮＳ（４ｍｉｘ．）」、「ＰＦ−ＭＭ−ＮＳ（８ｍｉｘ．）」、「ＰＦ−ＭＭ−ＮＳ（１２ｍｉｘ．）」は、それぞれ提案法において、各雑音モデルの混合分布数が４、８、１２である場合を示す。これらのパターンは同じモデルを用いた従来法のＭＭＮＳ、「ＭＭ−ＮＳ（４ｍｉｘ．）」、「ＭＭ−ＮＳ（８ｍｉｘ．）」、「ＭＭ−ＮＳ（１２ｍｉｘ．）」、それぞれの精度を１％程度、上回った。「ＰＦ−ＭＭ−ＮＳ（１２ｍｉｘ．）」はｂａｓｅｌｉｎｅに対し、１２．３％のエラー改善率を得た。また、「ＰＦ−ＭＭ−ＮＳ（４ｍｉｘ．）」と「ＭＭ−ＮＳ（８ｍｉｘ．）」、あるいは、「ＰＦ−ＭＭ−ＮＳ（８ｍｉｘ．）」と「ＭＭ−ＮＳ（１２ｍｉｘ．）」との比較では、本実施の形態による手法が同程度の性能を少ない分布数（それぞれ、２８％、４６％）で得られることが分かった。 FIG. 14 shows word recognition rates of the conventional method and the method according to the present embodiment. “SM-NS” indicates a “Single-Model Noise Suppression” method using a single distribution as a noise model. As this single distribution, the one estimated from the start 100 ms of each input voice was used. “PF” indicates a conventional method using a particle filter (see Non-Patent Document 5). This method did not obtain better accuracy than baseline for this task. This is because many insertion errors have occurred, and it is considered difficult to track for sudden noise. “PF-MM-NS (4 mix.)”, “PF-MM-NS (8 mix.)”, And “PF-MM-NS (12 mix.)” Have the number of mixed distributions of each noise model in the proposed method. The cases of 4, 8, and 12 are shown. These patterns have the same accuracy of 1% for the conventional MMNS, “MM-NS (4 mix.)”, “MM-NS (8 mix.)”, “MM-NS (12 mix.)” Using the same model. It exceeded the degree. “PF-MM-NS (12mix.)” Obtained an error improvement rate of 12.3% with respect to baseline. Also, “PF-MM-NS (4 mix.)” And “MM-NS (8 mix.)”, Or “PF-MM-NS (8 mix.)” And “MM-NS (12 mix.)” Are compared. Thus, it was found that the method according to the present embodiment can obtain the same level of performance with a small number of distributions (28% and 46%, respectively).

次に、より詳細に提案法の効果を見るために、中間的な処理パターンを２つ評価した。１つは、ラベル認識を行う雑音抑圧手法における音声区間検出だけの効果を見るために、ラベル認識による音声区間検出を行い、音声区間以外では雑音抑圧処理を行うが、音声区間では何もしない、つまり、雑音抑圧を行わない場合「ＭＭ−ＮＳ（４ｍｉｘ．、ＶＡＤｏｎｌｙ）」である。もう１つは、本実施の形態による手法「ＰＦ−ＭＭ−ＮＳ（４ｍｉｘ．）」のパーティクルフィルタにおいて、雑音に応じたパーティクルを用いない、すなわち、背景雑音モデルから得たパーティクルのみで逐次雑音を推定・抑圧する場合「ＰＦ（ＢＧ）−ＭＭＮＳ（４ｍｉｘ．）」である。なお、このときのパーティクル数は提案法「ＰＦ−ＭＭ−ＮＳ（４ｍｉｘ．）」と同じ１１０とした。 Next, in order to see the effect of the proposed method in more detail, two intermediate processing patterns were evaluated. First, in order to see the effect of only speech segment detection in the noise suppression method for performing label recognition, speech segment detection by label recognition is performed and noise suppression processing is performed outside the speech segment, but nothing is performed in the speech segment. That is, “MM-NS (4 mix., VAD only)” when noise suppression is not performed. The other is that the particle filter of the method “PF-MM-NS (4mix.)” According to the present embodiment does not use particles according to noise, that is, the sequential noise is generated only with particles obtained from the background noise model. When estimating / suppressing, it is “PF (BG) -MMNS (4mix.)”. The number of particles at this time was set to 110, which is the same as the proposed method “PF-MM-NS (4mix.)”.

図１５に、それぞれのパターンに対する単語認識率を示す。上記の２つ以外は図１４に含まれるものと同じ結果である。「ＭＭ−ＮＳ（４ｍｉｘ．、ＶＡＤｏｎｌｙ）」では、音声区間検出を行うことで「Ｂａｓｅｌｉｎｅ（ｗ／ｏＮＳ）」からの改善があったが、雑音抑圧処理を行う「ＭＭＮＳ（４ｍｉｘ．）」よりは精度が低かった。複数モデルによる雑音抑圧の効果があることがわかる。「ＰＦ（ＢＧ）−ＭＭ−ＮＳ（４ｍｉｘ．）」は逐次雑音推定を行うことで、「ＭＭ−ＮＳ（４ｍｉｘ．）」に若干上回る精度を得たが、提案法「ＰＦ−ＭＭ−ＮＳ（４ｍｉｘ．）」には及ばなかった。パーティクルフィルタにおいて、雑音種類に応じたモデルをパーティクルの事前分布として用いる効果があるといえる。 FIG. 15 shows the word recognition rate for each pattern. Other than the above two results are the same as those included in FIG. In “MM-NS (4mix., VAD only)”, there is an improvement from “Baseline (w / o NS)” by performing speech section detection, but “MMNS (4mix.)” That performs noise suppression processing. It was less accurate. It can be seen that there is an effect of noise suppression by multiple models. “PF (BG) -MM-NS (4mix.)” Obtained slightly higher accuracy than “MM-NS (4mix.)” By performing successive noise estimation, but the proposed method “PF-MM-NS ( 4 mix.) ”. In the particle filter, it can be said that there is an effect of using a model corresponding to the noise type as the prior distribution of particles.

このように、従来提案されてきた複数モデル雑音抑圧手法（Ｍｕｌｔｉ−ＭｏｄｅｌＮｏｉｓｅＳｕｐｐｒｅｓｓｉｏｎ、ＭＭ−ＮＳ）では、実環境で音声データと同時に収録される複数種類の雑音を扱うことが可能である。しかし、あらかじめ学習データから得られた雑音モデルやその組合せで得られる合成モデルを元に雑音抑圧をするため、学習データで得られない組合せや未知の雑音分布に対して対応することができなかった。そこで、本実施の形態による手法では、パーティクルフィルタに基づく複数モデル雑音抑圧手法を提案した。そのパーティクルフィルタを用いることにより、入力された雑音重畳音声データから雑音分布を推定することができる。ただし、突発的な雑音に対応するために、ラベル認識結果を利用し、新たな雑音が検出されたら、その雑音モデルを事前分布として新たなパーティクルをサンプリングする。同じ種類の雑音が継続するうちは拡張カルマンフィルタにより推定し、検出できなくなったら、そのパーティクルは消去する。これにより、学習データから得られた先験的知識を利用し、かつ、雑音重畳音声データから得られる情報を用い、動的に雑音推定を行うことが可能になる。実験結果から、本実施の形態による手法は、同じ雑音モデルを用いた従来型のラベル認識を行う雑音抑圧手法より高い精度を得ることができることがわかった。また、詳細な比較実験により、ラベル認識を行う雑音抑圧手法での音声区間検出による効果、複数雑音モデルを用いる効果、パーティクルの事前分布として雑音に応じたモデルを用いる効果を示すことができた。 As described above, the multi-model noise suppression method (MM-NS) that has been proposed in the past can handle a plurality of types of noises recorded simultaneously with audio data in an actual environment. However, since noise suppression is performed based on a noise model obtained from learning data in advance and a composite model obtained from the combination, it was not possible to cope with combinations and unknown noise distribution that could not be obtained with learning data. . Thus, in the method according to the present embodiment, a multiple model noise suppression method based on a particle filter has been proposed. By using the particle filter, it is possible to estimate the noise distribution from the input noise superimposed voice data. However, in order to deal with sudden noise, the label recognition result is used, and when new noise is detected, new particles are sampled using the noise model as a prior distribution. As long as the same type of noise continues, it is estimated by the extended Kalman filter, and when it can no longer be detected, the particle is deleted. This makes it possible to perform noise estimation dynamically using a priori knowledge obtained from learning data and using information obtained from noise-superimposed speech data. From the experimental results, it was found that the method according to the present embodiment can obtain higher accuracy than the noise suppression method that performs the conventional label recognition using the same noise model. In addition, detailed comparison experiments have shown the effect of detecting speech intervals in the noise suppression method that performs label recognition, the effect of using multiple noise models, and the effect of using models corresponding to noise as the prior distribution of particles.

以上のように、本実施の形態による音声認識装置２によれば、実施の形態１で説明した雑音抑圧装置１によって雑音を抑圧したクリーン音声データを用いて音声認識を行うため、より高い単語認識率を得ることができる。 As described above, according to the speech recognition apparatus 2 according to the present embodiment, since speech recognition is performed using the clean speech data in which noise is suppressed by the noise suppression apparatus 1 described in Embodiment 1, higher word recognition is possible. Rate can be obtained.

なお、本実施の形態において、音声認識装置２が蓄積部２２を備えた構成について説明したが、音声認識装置２は、蓄積部２２を備えず、雑音抑圧されたクリーン音声データを音声認識部４４に直接渡してもよい。 In the present embodiment, the configuration in which the speech recognition apparatus 2 includes the storage unit 22 has been described. However, the speech recognition apparatus 2 does not include the storage unit 22 and the clean speech data that has been noise-suppressed is stored in the speech recognition unit 44. You may pass directly to.

また、上記各実施の形態において、雑音抑圧装置１や音声認識装置２は、訓練データ記憶部１１、ラベル記憶部１２、モデル生成部１３、ラベル言語モデル生成部１６を備えていなくてもよい。雑音抑圧装置１等がそれらの構成要素を含まない場合には、例えば、装置外部において訓練データや訓練ラベル情報に基づいて、音声雑音音響モデルや合成音響モデルや、合成音響モデル、ラベル言語モデル、辞書情報が生成され、その生成されたモデル等が音声雑音音響モデル記憶部１４や合成音響モデル記憶部１５、辞書情報記憶部１７、ラベル言語モデル記憶部１８に蓄積されるものとする。各記憶部にモデル等が記憶される過程は問わない。ただし、合成音響モデルの生成は、雑音抑圧装置１等において行われてもよい。その場合には、雑音抑圧装置１等はモデル生成部１３を備えており、そのモデル生成部１３は、音声雑音音響モデル記憶部１４で記憶されている音声雑音音響モデルを合成することによって、合成音響モデルを生成し、合成音響モデル記憶部１５に蓄積する処理を行うことになる。また、ラベル言語モデルの一部または全部は、前述のように、人手によって生成されたものであってもよい。 In each of the above embodiments, the noise suppression device 1 and the speech recognition device 2 may not include the training data storage unit 11, the label storage unit 12, the model generation unit 13, and the label language model generation unit 16. When the noise suppression device 1 or the like does not include those components, for example, based on training data and training label information outside the device, a speech noise acoustic model, a synthetic acoustic model, a synthetic acoustic model, a label language model, It is assumed that dictionary information is generated, and the generated model and the like are accumulated in the speech noise acoustic model storage unit 14, the synthesized acoustic model storage unit 15, the dictionary information storage unit 17, and the label language model storage unit 18. The process in which a model etc. is memorize | stored in each memory | storage part is not ask | required. However, the generation of the synthesized acoustic model may be performed in the noise suppression device 1 or the like. In that case, the noise suppression device 1 or the like includes a model generation unit 13, and the model generation unit 13 synthesizes the speech noise acoustic model stored in the speech noise acoustic model storage unit 14 to synthesize the speech noise acoustic model. An acoustic model is generated and stored in the synthesized acoustic model storage unit 15. Also, part or all of the label language model may be generated manually as described above.

また、上記各実施の形態において、ラベル言語モデルを用いてラベルの認識を行う場合について説明したが、ラベル言語モデルを用いないでラベルの認識を行ってもよい。いわゆる、「ノー・グラマー（ＮｏＧｒａｍｍａｒ）」と呼ばれる方法である。その場合には、雑音抑圧装置１や音声認識装置２は、ラベル言語モデル記憶部１８を備えていなくてもよく、ラベル認識部２０は、ラベル言語モデルを用いないでラベルの認識を行ってもよい。 Further, although cases have been described with the above embodiments where label recognition is performed using a label language model, label recognition may be performed without using a label language model. This is a so-called “No Grammar” method. In that case, the noise suppression device 1 and the speech recognition device 2 may not include the label language model storage unit 18, and the label recognition unit 20 may recognize the label without using the label language model. Good.

また、上記各実施の形態では、雑音抑圧装置１や音声認識装置２がスタンドアロンである場合について説明したが、雑音抑圧装置１や音声認識装置２は、スタンドアロンの装置であってもよく、サーバ・クライアントシステムにおけるサーバ装置であってもよい。後者の場合には、出力部や受付部は、通信回線を介して入力を受け付けたり、画面を出力したりすることになる。 In each of the above-described embodiments, the case where the noise suppression device 1 and the speech recognition device 2 are stand-alone has been described. However, the noise suppression device 1 and the speech recognition device 2 may be stand-alone devices. It may be a server device in a client system. In the latter case, the output unit or the reception unit receives an input or outputs a screen via a communication line.

また、上記各実施の形態において、各処理または各機能は、単一の装置または単一のシステムによって集中処理されることによって実現されてもよく、あるいは、複数の装置または複数のシステムによって分散処理されることによって実現されてもよい。 In each of the above embodiments, each processing or each function may be realized by centralized processing by a single device or a single system, or distributed processing by a plurality of devices or a plurality of systems. May be realized.

また、上記各実施の形態において、各構成要素が実行する処理に関係する情報、例えば、各構成要素が受け付けたり、取得したり、選択したり、生成したり、送信したり、受信したりした情報や、各構成要素が処理で用いるしきい値や数式、アドレス等の情報等は、上記説明で明記していない場合であっても、図示しない記録媒体において、一時的に、あるいは長期にわたって保持されていてもよい。また、その図示しない記録媒体への情報の蓄積を、各構成要素、あるいは、図示しない蓄積部が行ってもよい。また、その図示しない記録媒体からの情報の読み出しを、各構成要素、あるいは、図示しない読み出し部が行ってもよい。 Also, in each of the above embodiments, information related to processing executed by each component, for example, each component received, acquired, selected, generated, transmitted, or received Information and information such as threshold values, mathematical formulas, addresses, etc. used by each component in processing are retained temporarily or over a long period of time on a recording medium (not shown) even if not explicitly stated in the above description. May be. Further, the storage of information in the recording medium (not shown) may be performed by each component or a storage unit (not shown). Further, reading of information from the recording medium (not shown) may be performed by each component or a reading unit (not shown).

また、上記各実施の形態において、雑音抑圧装置１や音声認識装置２に含まれる２以上の構成要素が通信デバイスや入力デバイス等を有する場合に、２以上の構成要素が物理的に単一のデバイスを有してもよく、あるいは、別々のデバイスを有してもよい。 In each of the above embodiments, when two or more components included in the noise suppression device 1 or the speech recognition device 2 include a communication device or an input device, the two or more components are physically single. You may have devices or you may have separate devices.

また、上記各実施の形態において、各構成要素は専用のハードウェアにより構成されてもよく、あるいは、ソフトウェアにより実現可能な構成要素については、プログラムを実行することによって実現されてもよい。例えば、ハードディスクや半導体メモリ等の記録媒体に記録されたソフトウェア・プログラムをＣＰＵ等のプログラム実行部が読み出して実行することによって、各構成要素が実現され得る。なお、上記実施の形態における雑音抑圧装置１を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータを、雑音の重畳されている音声データである雑音重畳音声データを受け付ける受付部と、訓練用の音声データと訓練用の雑音データとを含む訓練データに含まれる一種類の音声データまたは一種類の雑音データの音響モデルである音声雑音音響モデルが複数記憶される音声雑音音響モデル記憶部で記憶されている音声雑音音響モデル、前記訓練データに含まれる音声データと雑音データのうちの二種類以上が合成された音響モデルである合成音響モデルが記憶される合成音響モデル記憶部で記憶されている合成音響モデル、前記訓練データにおいて重畳されている音声と雑音の種類を識別する情報であるラベルと前記音声雑音音響モデルまたは前記合成音響モデルとを対応付ける情報である辞書情報が記憶される辞書情報記憶部で記憶されている辞書情報を用いて、前記雑音重畳音声データに対応するラベルをフレームごとに認識するラベル認識部と、前記ラベル認識部が認識したラベルに応じた音響モデルからパーティクルをサンプリングし、当該サンプリングしたパーティクルを更新することによって、前記雑音重畳音声データの雑音が抑圧されたクリーン音声データを生成するパーティクルフィルタ雑音抑圧部として機能させるためのものである。 In each of the above embodiments, each component may be configured by dedicated hardware, or a component that can be realized by software may be realized by executing a program. For example, each component can be realized by a program execution unit such as a CPU reading and executing a software program recorded on a recording medium such as a hard disk or a semiconductor memory. In addition, the software which implement | achieves the noise suppression apparatus 1 in the said embodiment is the following programs. In other words, this program is a kind of training data that includes a computer that accepts noise-superimposed speech data, which is speech data on which noise is superimposed, and training speech data and training noise data. A speech noise acoustic model stored in a speech noise acoustic model storage unit in which a plurality of speech noise acoustic models that are acoustic models of one type of noise data or a plurality of speech noise acoustic models is stored, speech data and noise data included in the training data A synthetic acoustic model stored in a synthetic acoustic model storage unit that stores a synthetic acoustic model that is an acoustic model in which two or more of them are synthesized, and identifies the type of speech and noise superimposed in the training data Dictionary information that is information for associating a label that is information to be performed with the speech noise acoustic model or the synthetic acoustic model Using the dictionary information stored in the stored dictionary information storage unit, a label recognition unit that recognizes a label corresponding to the noise-superimposed speech data for each frame, and an acoustic corresponding to the label recognized by the label recognition unit By sampling the particles from the model and updating the sampled particles, this is to function as a particle filter noise suppression unit that generates clean audio data in which noise of the noise superimposed audio data is suppressed.

なお、上記プログラムにおいて、上記プログラムが実現する機能には、ハードウェアでしか実現できない機能は含まれない。例えば、情報を受け付ける受付部や、情報を出力する出力部などにおけるモデムやインターフェースカードなどのハードウェアでしか実現できない機能は、上記プログラムが実現する機能には少なくとも含まれない。 In the program, the functions realized by the program do not include functions that can be realized only by hardware. For example, functions that can be realized only by hardware such as a modem and an interface card in a reception unit that receives information and an output unit that outputs information are not included in at least the functions realized by the program.

また、このプログラムは、サーバなどからダウンロードされることによって実行されてもよく、所定の記録媒体（例えば、ＣＤ−ＲＯＭなどの光ディスクや磁気ディスク、半導体メモリなど）に記録されたプログラムが読み出されることによって実行されてもよい。 Further, this program may be executed by being downloaded from a server or the like, and a program recorded on a predetermined recording medium (for example, an optical disk such as a CD-ROM, a magnetic disk, a semiconductor memory, or the like) is read out. May be executed by

また、このプログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes this program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

図１６は、上記プログラムを実行して、上記実施の形態による雑音抑圧装置、音声認識装置を実現するコンピュータの外観の一例を示す模式図である。上記実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムによって実現される。 FIG. 16 is a schematic diagram illustrating an example of an external appearance of a computer that executes the program and realizes the noise suppression device and the speech recognition device according to the embodiment. The above-described embodiment is realized by computer hardware and a computer program executed on the computer hardware.

図１６において、コンピュータシステム１００は、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）ドライブ１０５、ＦＤ（ＦｌｅｘｉｂｌｅＤｉｓｋ）ドライブ１０６を含むコンピュータ１０１と、キーボード１０２と、マウス１０３と、モニタ１０４とを備える。 In FIG. 16, a computer system 100 includes a computer 101 including a CD-ROM (Compact Disk Read Only Memory) drive 105 and an FD (Flexible Disk) drive 106, a keyboard 102, a mouse 103, and a monitor 104.

図１７は、コンピュータシステムを示す図である。図１７において、コンピュータ１０１は、ＣＤ−ＲＯＭドライブ１０５、ＦＤドライブ１０６に加えて、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１１１と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１１２と、ＣＰＵ１１１に接続され、アプリケーションプログラムの命令を一時的に記憶すると共に、一時記憶空間を提供するＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１１３と、アプリケーションプログラム、システムプログラム、及びデータを記憶するハードディスク１１４と、ＣＰＵ１１１、ＲＯＭ１１２等を相互に接続するバス１１５とを備える。なお、コンピュータ１０１は、ＬＡＮへの接続を提供する図示しないネットワークカードを含んでいてもよい。 FIG. 17 is a diagram illustrating a computer system. 17, in addition to the CD-ROM drive 105 and the FD drive 106, a computer 101 includes a CPU (Central Processing Unit) 111, a ROM (Read Only Memory) 112 for storing a program such as a bootup program, A CPU (Random Access Memory) 113 that is connected to the CPU 111 and temporarily stores application program instructions and provides a temporary storage space, a hard disk 114 that stores application programs, system programs, and data, a CPU 111 and a ROM 112. Etc. to each other. The computer 101 may include a network card (not shown) that provides connection to the LAN.

コンピュータシステム１００に、上記実施の形態による雑音抑圧装置、音声認識装置の機能を実行させるプログラムは、ＣＤ−ＲＯＭ１２１、またはＦＤ１２２に記憶されて、ＣＤ−ＲＯＭドライブ１０５、またはＦＤドライブ１０６に挿入され、ハードディスク１１４に転送されてもよい。これに代えて、そのプログラムは、図示しないネットワークを介してコンピュータ１０１に送信され、ハードディスク１１４に記憶されてもよい。プログラムは実行の際にＲＡＭ１１３にロードされる。なお、プログラムは、ＣＤ−ＲＯＭ１２１やＦＤ１２２、またはネットワークから直接、ロードされてもよい。 A program that causes the computer system 100 to execute the functions of the noise suppression device and the speech recognition device according to the above-described embodiment is stored in the CD-ROM 121 or the FD 122 and inserted into the CD-ROM drive 105 or the FD drive 106. It may be transferred to the hard disk 114. Instead, the program may be transmitted to the computer 101 via a network (not shown) and stored in the hard disk 114. The program is loaded into the RAM 113 at the time of execution. The program may be loaded directly from the CD-ROM 121, the FD 122, or the network.

プログラムは、コンピュータ１０１に、上記実施の形態による雑音抑圧装置、音声認識装置の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティプログラム等を必ずしも含んでいなくてもよい。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいてもよい。コンピュータシステム１００がどのように動作するのかについては周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system (OS) or a third-party program that causes the computer 101 to execute the functions of the noise suppression device and the speech recognition device according to the above-described embodiment. The program may include only a part of an instruction that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 100 operates is well known and will not be described in detail.

また、本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 Further, the present invention is not limited to the above-described embodiment, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上より、本発明による雑音抑圧装置等によれば、雑音が重畳された音声から雑音を除去することができ、雑音抑圧装置や音声認識装置等として有用である。 As described above, according to the noise suppression device and the like according to the present invention, noise can be removed from speech with superimposed noise, which is useful as a noise suppression device and a speech recognition device.

本発明の実施の形態１による雑音抑圧装置の構成を示すブロック図1 is a block diagram showing the configuration of a noise suppression device according to Embodiment 1 of the present invention. 同実施の形態による雑音抑圧装置のパーティクルフィルタ雑音抑圧部の構成を示すブロック図The block diagram which shows the structure of the particle filter noise suppression part of the noise suppression apparatus by the embodiment 同実施の形態による雑音抑圧装置のパーティクルフィルタリングの概要を説明するための図The figure for demonstrating the outline | summary of the particle filtering of the noise suppression apparatus by the embodiment 同実施の形態による雑音抑圧装置の動作を示すフローチャートFlowchart showing the operation of the noise suppression apparatus according to the embodiment 同実施の形態による雑音抑圧装置の動作を示すフローチャートFlowchart showing the operation of the noise suppression apparatus according to the embodiment 同実施の形態における訓練ラベル情報の一例を示す図The figure which shows an example of the training label information in the embodiment 同実施の形態における辞書情報の一例を示す図The figure which shows an example of the dictionary information in the embodiment 同実施の形態におけるラベル認識について説明するための図The figure for demonstrating the label recognition in the embodiment 同実施の形態におけるラベル認識の結果の一例を示す図The figure which shows an example of the result of the label recognition in the same embodiment 本発明の実施の形態２による音声認識装置の構成を示すブロック図Block diagram showing the configuration of a speech recognition apparatus according to Embodiment 2 of the present invention. 同実施の形態による音声認識装置の動作を示すフローチャートThe flowchart which shows operation | movement of the speech recognition apparatus by the embodiment 同実施の形態における実験条件の一例を示す図The figure which shows an example of the experimental condition in the embodiment 同実施の形態における雑音モデルでの平均総分布数について説明するための図The figure for demonstrating the average total distribution number in the noise model in the embodiment 同実施の形態における単語認識率の結果の一例を示す図The figure which shows an example of the result of the word recognition rate in the embodiment 同実施の形態におけるＰＦ−ＭＭ−ＮＳ法の効果の比較について説明するための図The figure for demonstrating the comparison of the effect of the PF-MM-NS method in the embodiment 同実施の形態におけるコンピュータシステムの外観一例を示す模式図Schematic diagram showing an example of the appearance of the computer system in the embodiment 同実施の形態におけるコンピュータシステムの構成の一例を示す図The figure which shows an example of a structure of the computer system in the embodiment

Explanation of symbols

１雑音抑圧装置
２音声認識装置
１１訓練データ記憶部
１２ラベル記憶部
１３モデル生成部
１４音声雑音音響モデル記憶部
１５合成音響モデル記憶部
１６ラベル言語モデル生成部
１７辞書情報記憶部
１８ラベル言語モデル記憶部
１９受付部
２０ラベル認識部
２１パーティクルフィルタ雑音抑圧部
２２蓄積部
３１特徴量抽出手段
３２パーティクルフィルタ手段
３３雑音成分算出手段
３４雑音抑圧手段
４１音声認識用音響モデル記憶部
４２言語モデル記憶部
４３音声認識用辞書情報記憶部
４４音声認識部
４５出力部 DESCRIPTION OF SYMBOLS 1 Noise suppression apparatus 2 Speech recognition apparatus 11 Training data memory | storage part 12 Label memory | storage part 13 Model production | generation part 14 Speech noise acoustic model memory | storage part 15 Synthetic acoustic model memory | storage part 16 Label language model production | generation part 17 Dictionary information memory | storage part 18 Label language model memory | storage Unit 19 accepting unit 20 label recognizing unit 21 particle filter noise suppressing unit 22 accumulating unit 31 feature quantity extracting unit 32 particle filter unit 33 noise component calculating unit 34 noise suppressing unit 41 acoustic model storage unit for speech recognition 42 language model storage unit 43 audio Dictionary information storage unit for recognition 44 Voice recognition unit 45 Output unit

Claims

Speech noise acoustic model storage unit for storing one type of speech data included in training data including training speech data and training noise data or a plurality of speech noise acoustic models that are acoustic models of one type of noise data When,
A synthetic acoustic model storage unit that stores a synthetic acoustic model that is an acoustic model in which two or more of voice data and noise data included in the training data are synthesized;
A dictionary information storage unit for storing dictionary information that is information for associating the speech noise acoustic model or the synthetic acoustic model with a label that is information for identifying the type of speech and noise superimposed in the training data;
A reception unit that receives noise-superimposed voice data that is voice data on which noise is superimposed;
A label recognition unit that recognizes a label corresponding to the noise-superimposed speech data for each frame using the speech noise acoustic model, the synthetic acoustic model, and the dictionary information;
Particle filter noise suppression for generating clean voice data in which noise of the noise superimposed voice data is suppressed by sampling particles from an acoustic model corresponding to the label recognized by the label recognition unit and updating the sampled particles And a noise suppression device.

The particle filter noise suppression unit is
Feature amount extraction means for extracting a feature amount for each frame of the noise-superimposed speech data;
From the acoustic model corresponding to the label recognized by the label recognition unit, using the feature amount corresponding to the noise superimposed speech data and the speech noise acoustic model or the synthetic acoustic model corresponding to the label recognized by the label recognition unit. Particle filter means for sampling a plurality of particles, updating the plurality of sampled particles, and calculating the weight of each updated particle;
Noise component calculating means for calculating a noise component for each frame using the particles updated by the particle filter means and the weights calculated by the particle filter means;
The noise suppression apparatus according to claim 1, further comprising: a noise suppression unit that removes the noise component calculated by the noise component calculation unit from the noise-superimposed speech data and acquires clean speech data.

The noise suppression device according to claim 1 or 2,
An acoustic model storage unit for speech recognition in which an acoustic model related to speech data to be speech-recognized is stored;
A voice recognition dictionary information storage unit for storing voice recognition dictionary information used in voice recognition;
A language model storage unit that stores a language model related to a recognition target language for speech recognition;
Clean speech data generated by the noise suppression device, a speech recognition unit that recognizes speech using the acoustic model, the dictionary information for speech recognition, and the language model;
A speech recognition apparatus comprising: an output unit configured to output a speech recognition result obtained by the speech recognition unit.

An accepting step for receiving noise-superimposed voice data, which is voice data on which noise is superimposed;
Speech noise acoustic model storage unit for storing one type of speech data included in training data including training speech data and training noise data or a plurality of speech noise acoustic models that are acoustic models of one type of noise data Is stored in a synthesized acoustic model storage unit that stores a synthesized acoustic model that is an acoustic model in which two or more types of speech data and noise data included in the training data are synthesized. A dictionary in which dictionary information is stored as information that associates the speech noise acoustic model or the synthetic acoustic model with a label that is information identifying the type of speech and noise superimposed in the training data Using the dictionary information stored in the information storage unit, the label corresponding to the noise superimposed speech data is recognized for each frame. And the label recognition step that,
Particle filter noise suppression for generating clean voice data in which noise of the noise superimposed voice data is suppressed by sampling particles from an acoustic model corresponding to the label recognized by the label recognition unit and updating the sampled particles And a noise suppression method.

Computer
A reception unit that receives noise-superimposed voice data that is voice data on which noise is superimposed;
Speech noise acoustic model storage unit for storing one type of speech data included in training data including training speech data and training noise data or a plurality of speech noise acoustic models that are acoustic models of one type of noise data Is stored in a synthesized acoustic model storage unit that stores a synthesized acoustic model that is an acoustic model in which two or more types of speech data and noise data included in the training data are synthesized. A dictionary in which dictionary information is stored as information that associates the speech noise acoustic model or the synthetic acoustic model with a label that is information identifying the type of speech and noise superimposed in the training data Using the dictionary information stored in the information storage unit, the label corresponding to the noise superimposed speech data is recognized for each frame. And the label recognition unit that,
Particle filter noise suppression for generating clean voice data in which noise of the noise superimposed voice data is suppressed by sampling particles from an acoustic model corresponding to the label recognized by the label recognition unit and updating the sampled particles Program to function as a part.