JP5233330B2

JP5233330B2 - Acoustic analysis condition normalization system, acoustic analysis condition normalization method, and acoustic analysis condition normalization program

Info

Publication number: JP5233330B2
Application number: JP2008057491A
Authority: JP
Inventors: 隆行荒川; 剛範辻川
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2008-03-07
Filing date: 2008-03-07
Publication date: 2013-07-10
Anticipated expiration: 2028-03-07
Also published as: JP2009216760A

Abstract

<P>PROBLEM TO BE SOLVED: To solve the following problems: recognition performance gets low when an acoustic analytical condition at the time of learning an acoustic model is different from an acoustic analytical condition executed for an input signal of a recognition object; and a high calculation cost and a large volume of data are required when changing the acoustic analytical condition. <P>SOLUTION: The acoustic analytical condition normalizing system or the like includes at least: an acoustic analytical means (I) for extracting a feature quantity from the input signal of the recognition object, according to the acoustic analytical condition (I); a sound model storage means for preparing a sound model (I), based on the feature quantity extracted according to the acoustic analytical condition (I) from the first learning input signal in the acoustic analytical means (I), and for storing it; a correlation table storage means for storing a correlation table between the sound model (O) prepared based on the feature quantity extracted from the second learning input signal according to the acoustic analytical condition (O) identical to that at the time of preparing the sound model, in the acoustic analytical means (O), and the sound model (I); and an acoustic analytical condition correcting means for correcting the feature quantity of the input signal of the recognition object, using the sound model (I) and the correlation table. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、音声認識システムにおける音響分析条件の違いを補正するものである。 The present invention corrects differences in acoustic analysis conditions in a speech recognition system.

音声認識システムは、背景雑音やマイクロホンなどの音声を収録する機器の伝達特性などの影響で認識性能が劣化することが知られている。このような加算性もしくは乗算性の雑音の影響による認識性能劣化を防ぐために、音響分析部において雑音の抑圧を行うことが一般的に行われている。 It is known that the speech recognition system degrades the recognition performance due to the influence of the transfer characteristics of a device that records speech such as background noise and microphone. In order to prevent recognition performance degradation due to the influence of such additive or multiplicative noise, noise suppression is generally performed in an acoustic analysis unit.

図１２は特許文献１から類推する一般的な雑音抑圧を行う音声認識システムの構成図である。図１２では、まず入力信号取得部１０１において、マイクロホンなどを使って収録された音声の入力信号をデジタル情報に変換し、フレーム単位に切り出し、ＦＦＴ（Fast Fourier Transform）などを施し音声のスペクトル情報を取得する。時刻ｔにおける周波数ｆの発声をＳt,f、加算性の雑音をＮt,f、乗算性の雑音をＨt,fとするとき、前記入力信号のスペクトルＸt,fは以下の式で表される。 FIG. 12 is a configuration diagram of a speech recognition system that performs general noise suppression in analogy with Patent Document 1. In FIG. In FIG. 12, first, in the input signal acquisition unit 101, a voice input signal recorded using a microphone or the like is converted into digital information, cut out in units of frames, and subjected to FFT (Fast Fourier Transform) or the like to obtain spectral information of the voice. get. When the utterance of the frequency f at time t is St, f, the additive noise is Nt, f, and the multiplying noise is Ht, f, the spectrum Xt, f of the input signal is expressed by the following equation.

次に雑音成分算出部１０２において、入力信号中に含まれる加算性の雑音成分を推定する。雑音成分算出方法としては前記入力信号中の音声が含まれない複数フレームを平均して求める方法などが考えられる。次に推定音声算出部１０３において、スペクトル減算法やウィナーフィルタ法などを用いて前記雑音成分を抑圧し、音声の推定値を算出する。以下に、スペクトル減算法の例を示す。スペクトル減算法では、入力信号から前記算出された雑音成分をスペクトル領域で減算することで雑音成分を抑圧しクリーンな音声を推定する。 Next, the noise component calculation unit 102 estimates an additive noise component included in the input signal. As a method for calculating the noise component, a method in which a plurality of frames that do not include voice in the input signal are averaged is considered. Next, the estimated speech calculation unit 103 suppresses the noise component using a spectral subtraction method, a Wiener filter method, or the like, and calculates an estimated value of speech. An example of the spectral subtraction method is shown below. In the spectral subtraction method, the calculated noise component is subtracted from the input signal in the spectral domain to suppress the noise component and estimate clean speech.

ここで、Ｓ^SSt,fはスペクトル減算法で算出されたクリーンな音声の推定値である。<Ｎｆ>は前記雑音成分算出部１０２で算出された雑音成分である。max[]は、コンマで区切られた２つの値のうち大きい値を採用することを示す。αは雑音抑圧の強弱や歪みを制御するパラメータである。βはフロアリングを示すパラメータで、このパラメータも雑音抑圧の強弱や歪みを制御するパラメータである。これらのパラメータは雑音の種類やＳＮＲに応じて最適な値が異なることが知られている。特許文献１ではこれらのパラメータをＳＮＲに応じて変更させる方法について示されている。次に特徴量抽出部１０４において、前記推定音声からケプストラムや対数パワー、もしくはその１次および２次差分量、もしくはそれらを組み合わせたものを抽出する。ここで、前記乗算性雑音を除去する目的で、前記特徴量から乗算性雑音に関する部分を減算もしくは正規化する。乗算性雑音の除去方法としてはケプストラム平均正規化法などが広く知られている。以上までの入力信号から特徴量を抽出するまでの構成を音響分析部１００と呼び、上記以外にも複数の構成が提案されている。次にデコーダー部６０１において、前記特徴量に対して、音響モデル６０２と言語モデル６０３を用いて適切な単語列を探索する。音響モデルとしては、Hidden Markov Model（ＨＭＭ）などが広く用いられている。前記音響モデルは予め学習用のデータに対して、前記音響分析部１００と同じ条件で抽出された特徴量を学習し作成しておく。最後に出力部６０４において前記単語列を出力する。 Here, S ^SS t, f is an estimated value of clean speech calculated by the spectral subtraction method. <Nf> is a noise component calculated by the noise component calculation unit 102. max [] indicates that a larger value of two values separated by a comma is adopted. α is a parameter for controlling the strength and distortion of noise suppression. β is a parameter indicating flooring, and this parameter is also a parameter for controlling the strength and distortion of noise suppression. These parameters are known to have different optimum values depending on the type of noise and SNR. Patent Document 1 discloses a method for changing these parameters in accordance with the SNR. Next, the feature amount extraction unit 104 extracts a cepstrum, logarithmic power, or primary and secondary difference amounts thereof, or a combination thereof from the estimated speech. Here, in order to remove the multiplicative noise, a part related to the multiplicative noise is subtracted or normalized from the feature amount. A cepstrum average normalization method is widely known as a method for removing multiplicative noise. The configuration until the feature amount is extracted from the input signal described above is called an acoustic analysis unit 100, and a plurality of configurations other than the above are proposed. Next, the decoder unit 601 searches for an appropriate word string using the acoustic model 602 and the language model 603 for the feature amount. As an acoustic model, Hidden Markov Model (HMM) or the like is widely used. The acoustic model is created by learning feature quantities extracted in advance for the learning data under the same conditions as the acoustic analysis unit 100. Finally, the output unit 604 outputs the word string.

図１３は特許文献２から類推する一般的な分散音声認識システムの構成図である。このような分散音声認識システムでは、まず、端末側にある音響分析部１００で音響分析を行い、特徴量を算出する。次に、特徴量送信部７０１において前記算出された特徴量をサーバー側に送信する。次に、サーバー側の特徴量受信部７０２において特徴量を受信する。次に、デコーダー部６０１において前記受信された特徴量に対して音響モデル６０２と言語モデル６０３を用いて適切な単語列を探索する。次に、単語列送信部７０３において前記単語列を端末側に送信する。次に、端末側の単語列受信部７０４において前記単語列を受信する。最後に端末側の出力部７０５において前記単語列を出力する。 FIG. 13 is a configuration diagram of a general distributed speech recognition system inferred from Patent Document 2. In such a distributed speech recognition system, first, an acoustic analysis is performed by the acoustic analysis unit 100 on the terminal side, and a feature amount is calculated. Next, the feature amount transmitting unit 701 transmits the calculated feature amount to the server side. Next, the feature amount receiving unit 702 on the server side receives the feature amount. Next, the decoder unit 601 searches for an appropriate word string using the acoustic model 602 and the language model 603 for the received feature amount. Next, the word string transmission unit 703 transmits the word string to the terminal side. Next, the word string receiving unit 704 on the terminal side receives the word string. Finally, the word string is output from the output unit 705 on the terminal side.

特開２００５−１６５２２１号公報JP 2005-165221 A 特開２００６−３５００９０号公報JP 2006-350090 A

前記特許文献１−２の各開示は、引用をもって本書に組み込まれる。以下の分析は、本発明によって与えられる。 The disclosures of Patent Document 1-2 are incorporated herein by reference. The following analysis is given by the present invention.

図１２に参照する従来の音響分析部において雑音抑圧を行う方法では、音響モデル学習時の音響分析条件と、認識対象の入力信号に対して行う音響分析条件が異なると、認識性能が低くなるという問題がある。このため、入力信号に含まれる雑音やＳＮＲに応じて適切な音響分析条件で音響分析を行おうとすると、音響モデルをその音響分析条件で学習しなおす、あるいはその音響分析条件に適応させる必要がある。しかしながら、音響モデルのサイズは大きい為、これを新しい音響分析条件で学習あるいは適応させるには、多くの計算コストと多くのデータが必要となる。ここで言う音響分析条件とは、前記音響分析部１００を構成する処理およびパラメータおよびマイク特性などを示す。 In the conventional method of performing noise suppression in the acoustic analysis unit shown in FIG. 12, if the acoustic analysis condition at the time of learning the acoustic model is different from the acoustic analysis condition performed on the input signal to be recognized, the recognition performance is reduced. There's a problem. For this reason, if an acoustic analysis is performed under an appropriate acoustic analysis condition in accordance with noise or SNR included in the input signal, it is necessary to relearn the acoustic model under the acoustic analysis condition or to adapt to the acoustic analysis condition. . However, since the size of the acoustic model is large, it requires a lot of calculation costs and a lot of data to learn or adapt it under new acoustic analysis conditions. The acoustic analysis conditions referred to here indicate processing, parameters, microphone characteristics, and the like that constitute the acoustic analysis unit 100.

図１３に参照する従来の分散音声認識システムでは、前記問題はさらに顕著となる。通常の音声認識システムでは、音響分析部とデコーダー部は１対１の関係にあるが、前記分散音声認識システムでは複数の端末の音響分析部に対して１台のサーバーのデコーダー部が割り当てられることが考えられる。端末はそれぞれ、使用される雑音環境が異なり最適となる音響分析条件が異なる。また端末によってマイクなどの特性も異なる。このため、音響モデル側を端末側の音響分析条件に合わせておく必要があるが、全ての端末の条件に合わせるのは非常に困難である。 In the conventional distributed speech recognition system referred to FIG. 13, the above problem becomes more remarkable. In a normal speech recognition system, there is a one-to-one relationship between an acoustic analysis unit and a decoder unit, but in the distributed speech recognition system, a decoder unit of one server is assigned to the acoustic analysis units of a plurality of terminals. Can be considered. Each terminal is different in the noise environment to be used and has the optimum acoustic analysis conditions. In addition, characteristics such as a microphone vary depending on the terminal. For this reason, it is necessary to match the acoustic model side to the acoustic analysis conditions on the terminal side, but it is very difficult to match the conditions of all terminals.

以上のように、従来のシステムは下記記載の課題を有する。 As described above, the conventional system has the following problems.

第１の問題点は、従来の音響分析部で雑音抑圧を行う方法では、音響モデル学習時の音響分析条件と、認識対象の入力信号に対して行う音響分析条件が異なると、認識性能が低くなる、ということである。 The first problem is that in the conventional method of performing noise suppression by the acoustic analysis unit, if the acoustic analysis conditions at the time of learning the acoustic model and the acoustic analysis conditions performed on the input signal to be recognized are different, the recognition performance is low. That is.

第２の問題点は、従来の音響分析部で雑音抑圧を行う方法では、認識対象の入力信号に対して行う音響分析条件を変更すると、音響モデルを学習しなおす、あるいは適応させる必要があり、多くの計算コストと多くのデータが必要となる、ということである。 The second problem is that in the conventional method of performing noise suppression in the acoustic analysis unit, it is necessary to re-learn or adapt the acoustic model when the acoustic analysis condition to be performed on the recognition target input signal is changed. It means that a lot of calculation cost and a lot of data are needed.

本発明の目的は、音響モデル学習時の音響分析条件と、認識対象の入力信号に対して行う音響分析条件とが異なる場合でも、性能の高い音声認識を行えるよう音響分析条件の正規化を行う音響分析条件正規化システム、音響分析条件正規化プログラムを提供することである。 An object of the present invention is to normalize acoustic analysis conditions so that high-performance speech recognition can be performed even when acoustic analysis conditions at the time of learning an acoustic model differ from acoustic analysis conditions performed on an input signal to be recognized. An acoustic analysis condition normalization system and an acoustic analysis condition normalization program are provided.

本発明の別の目的は、少ない計算コストと少ないデータで音響分析条件の正規化を行う音響分析条件正規化システム、音響分析条件正規化プログラムを提供することである。 Another object of the present invention is to provide an acoustic analysis condition normalization system and an acoustic analysis condition normalization program that normalize acoustic analysis conditions with low calculation cost and small data.

本願で開示される発明は、前記課題を解決するため、概略以下のように構成される。 In order to solve the above problems, the invention disclosed in the present application is generally configured as follows.

本発明の１つの側面に係わる音響分析条件正規化システムは、認識対象の入力信号に対し音響分析条件(I)で特徴量を抽出する音響分析手段(I)と、第１の学習用の入力信号に対し前記音響分析手段(I)において音響分析条件(I)で抽出された特徴量から音声モデル(I)を作成しておき、前記作成された音声モデル(I)を格納する音声モデル格納手段と、第２の学習用の入力信号に対し音響分析手段(O)において音響分析手段(I)と異なり音声認識で用いる音響モデル作成時と同じ音響分析条件(O)で抽出された特徴量から作成された音声モデル(O)と、前記音声モデル(I)との対応関係を記憶した対応テーブルを格納する対応テーブル格納手段と、前記音声モデル(I)と前記対応テーブルとを用いて前記認識対象の入力信号の特徴量を補正し、補正後の特徴量を求める音響分析条件補正手段と、を少なくとも備える。 An acoustic analysis condition normalization system according to one aspect of the present invention includes an acoustic analysis means (I) for extracting a feature value from an input signal to be recognized under an acoustic analysis condition (I), and a first learning input. A speech model storage for storing the created speech model (I) by creating a speech model (I) from features extracted under acoustic analysis conditions (I) in the acoustic analysis means (I) for the signal And the feature quantity extracted under the same acoustic analysis condition (O) as the acoustic model used for speech recognition, unlike the acoustic analysis means (I) in the acoustic analysis means (O) for the input signal for the second learning A correspondence table storing means for storing a correspondence table storing a correspondence relationship between the speech model (O) created from the speech model (I), the speech model (I), and the correspondence table. Correct the feature value of the input signal to be recognized and obtain the corrected feature value. Provided acoustic analysis condition correcting means, at least.

このような構成とすることにより、音響分析条件の違いによるミスマッチを補正することができ、性能の高い音声認識を行うことができるため、第１の目的を達成することができる。また、音響モデルを音響分析条件に適応させる方法に較べて、音声モデルは充分小さいものを用いることができるため、少ない計算量と少ないメモリ量の増加で補正を行うことができ、第２の目的を達成することができる。 By adopting such a configuration, mismatch due to a difference in acoustic analysis conditions can be corrected, and speech recognition with high performance can be performed. Therefore, the first object can be achieved. Further, since the speech model can be sufficiently small as compared with the method of adapting the acoustic model to the acoustic analysis conditions, the correction can be performed with a small amount of calculation and a small amount of memory. Can be achieved.

本発明は、その好ましい一実施の形態において、予め入力信号に行うのと同じ音響分析条件で作成した音声モデル(I)を格納する音声モデル格納手段と、予め前記音声モデル(I)と音響モデル学習時の音響分析条件で作成した音声モデル(O)との対応関係を記憶した対応テーブルを格納する対応テーブル格納手段と、前記音声モデル(I)と前記対応テーブルを用いて音響分析条件を補正する音響分析条件補正手段を備えている。 In a preferred embodiment of the present invention, a speech model storage means for storing a speech model (I) created in advance under the same acoustic analysis conditions as performed on an input signal, the speech model (I) and the acoustic model in advance. Correspondence table storage means for storing a correspondence table storing the correspondence relationship with the speech model (O) created under the acoustic analysis condition at the time of learning, and correcting the acoustic analysis condition using the speech model (I) and the correspondence table The sound analysis condition correction means is provided.

以降、説明のわかりやすさの為に、音声認識用の音響モデル学習時の音響分析条件に係わる構成要素には(O)を、入力信号分析時の音響分析条件に係わる構成要素には(I)の記号を付与して表すものとする。 Hereinafter, for the sake of easy understanding, (O) is used for components related to acoustic analysis conditions when learning acoustic models for speech recognition, and (I) is used for components related to acoustic analysis conditions during input signal analysis. It shall be represented with a symbol.

次に、発明を実施するための第１の実施例の形態について説明する。 Next, the form of the 1st Example for inventing is demonstrated.

図１は、本発明の第１の実施例の構成を示す図である。図１を参照すると、本発明の第１の実施例に係る音響分析条件正規化システムは、入力信号取得部１０１と雑音成分算出部１０２と推定音声算出部１０３と特徴量抽出部１０４と音声モデル格納部１０５と対応テーブル格納部１０６と音響分析条件補正部１０７と出力部１０８などを備えている。 FIG. 1 is a diagram showing the configuration of the first exemplary embodiment of the present invention. Referring to FIG. 1, an acoustic analysis condition normalization system according to a first embodiment of the present invention includes an input signal acquisition unit 101, a noise component calculation unit 102, an estimated speech calculation unit 103, a feature amount extraction unit 104, and a speech model. A storage unit 105, a correspondence table storage unit 106, an acoustic analysis condition correction unit 107, an output unit 108, and the like are provided.

入力信号取得部１０１は、マイクロホンなどを使って収録された音声の入力信号をフレーム単位に切り出し、ＦＦＴなどを施し音声のスペクトル情報を取得する。 The input signal acquisition unit 101 cuts out audio input signals recorded using a microphone or the like in units of frames, and performs FFT or the like to acquire audio spectrum information.

雑音成分算出部１０２は、前記入力信号から雑音成分を算出する。 The noise component calculation unit 102 calculates a noise component from the input signal.

推定音声算出部１０３は、前記入力信号と前記雑音成分とから仮の推定音声を算出する。 The estimated speech calculation unit 103 calculates a temporary estimated speech from the input signal and the noise component.

特徴量抽出部１０４は、前記推定音声から特徴量を抽出する。 The feature amount extraction unit 104 extracts feature amounts from the estimated speech.

音声モデル格納部１０５は、予め入力信号に対して行うのと同じ音響分析条件で作成しておいた音声モデル(I)を格納する。 The speech model storage unit 105 stores a speech model (I) created in advance under the same acoustic analysis conditions as those performed on the input signal.

対応テーブル格納部１０６は、前記音声モデル(I)と音響モデル学習時と同じ音響分析条件で策した音声モデル(O)との対応テーブルを格納する。 The correspondence table storage unit 106 stores a correspondence table between the speech model (I) and the speech model (O) prepared under the same acoustic analysis conditions as when learning the acoustic model.

音響分析条件補正部１０７は、前記推定音声に対して抽出された特徴量を、前記音声モデル(I)と前記対応テーブルを用いて補正する。 The acoustic analysis condition correcting unit 107 corrects the feature amount extracted for the estimated speech using the speech model (I) and the correspondence table.

出力部１０８は、前記補正された特徴量を出力する。 The output unit 108 outputs the corrected feature amount.

なお、前記入力信号取得部１０１、前記雑音成分算出部１０２、前記推定音声算出部１０３、前記特徴量抽出部１０４をまとめて、音響分析部(I)１００と呼ぶ。 The input signal acquisition unit 101, the noise component calculation unit 102, the estimated speech calculation unit 103, and the feature quantity extraction unit 104 are collectively referred to as an acoustic analysis unit (I) 100.

また、これら各部は、雑音抑圧システムを構成するコンピュータ上で実行されるプログラムによりその機能・処理を実現するようにしても良いことは勿論である（他の実施例についても同様である）。 Of course, these units may implement their functions and processing by a program executed on a computer constituting the noise suppression system (the same applies to other embodiments).

次に、本実施例で用いる前記音声モデル(I)と前記対応テーブルについて、その詳細と作成方法について説明する。 Next, details and a creation method of the speech model (I) and the correspondence table used in this embodiment will be described.

図２に音響モデル学習時と同じ音響分析条件(O)で作成された音声モデル(O)と、入力信号に対して行われるのと同じ音響分析条件(I)で作成された音声モデル(I)と、前記２つの音声モデルを構成する要素の対応テーブルを示す。音声モデルとしては、ＶＱ（Vector Quantization）コードブックやＧＭＭ（Gaussian Mixture Model）あるいはＨＭＭ（Hidden Markov Model）などが考えられる。ここでは、ＧＭＭを音声モデルとする場合の例について示す。ＧＭＭは複数のガウス分布を混合することによって構成される確率密度関数で表されるモデルである。各ガウス分布は、音素あるいは音声クラスターに分割した単位に対応する。図２の横軸と縦軸は特徴量中のある２つの次元を選んできたものを示す。図中の楕円はガウス分布を表し、中心の点はガウス分布の平均値を示す。また、ガウス分布間を結ぶ矢印はこの２つのガウス分布が対応関係にあることを示す。 Figure 2 shows a speech model (O) created under the same acoustic analysis conditions (O) as during acoustic model learning, and a speech model (I) created under the same acoustic analysis conditions (I) as performed on the input signal. ) And a correspondence table of elements constituting the two speech models. As a speech model, VQ (Vector Quantization) codebook, GMM (Gaussian Mixture Model), HMM (Hidden Markov Model), etc. can be considered. Here, an example in which the GMM is a voice model is shown. The GMM is a model represented by a probability density function configured by mixing a plurality of Gaussian distributions. Each Gaussian distribution corresponds to a unit divided into phonemes or speech clusters. The horizontal axis and the vertical axis in FIG. 2 indicate that two dimensions in the feature quantity have been selected. The ellipse in the figure represents a Gaussian distribution, and the center point represents the average value of the Gaussian distribution. An arrow connecting the Gaussian distributions indicates that the two Gaussian distributions are in a correspondence relationship.

図３の構成図および図５のフロー図を用いて、前記２つの音声モデルおよび対応テーブルの作成方法の例について示す。なお、後述する第１の学習データは請求項における第２の学習用の入力信号に相当し、第２の学習データは請求項における第１の学習用の入力信号に相当する。 An example of a method for creating the two speech models and the correspondence table will be described with reference to the configuration diagram of FIG. 3 and the flowchart of FIG. Note that first learning data described later corresponds to a second learning input signal in claims, and second learning data corresponds to a first learning input signal in claims.

まず、予め音響分析部(O)２００において、第１の学習データに対して音響分析条件(O)で分析を行い、特徴量(O)を算出し、ＥＭ（Expectation Maximization）アルゴリズムなどを用いて音声モデル(O)を作成する（ステップＳ１０１）。本処理は音響モデル作成時に予め行っておくことができる。また、音響モデルを構成するガウス分布の一部もしくは全部を用いて音声モデル(O)を作成することもできる。作成された音声モデルは音声モデル(O)格納部２０１に格納する。ここでは、以下に示す式３のようにＫ個のガウス分布の混合によって構成されるＧＭＭで音声モデル(O)が表現されるものとする。 First, in the acoustic analysis unit (O) 200, the first learning data is analyzed under the acoustic analysis condition (O), the feature amount (O) is calculated, and an EM (Expectation Maximization) algorithm or the like is used. A voice model (O) is created (step S101). This processing can be performed in advance when creating the acoustic model. In addition, the speech model (O) can be created using a part or all of the Gaussian distribution constituting the acoustic model. The created speech model is stored in the speech model (O) storage unit 201. Here, it is assumed that the speech model (O) is represented by a GMM configured by a mixture of K Gaussian distributions as shown in Equation 3 below.

ここで、p(Ｆ^Ｏ)は、特徴量Ｏを出力する確率密度関数である。また、a_ｉはi番目のガウス分布の混合重み、Ｎ()は式４に示すようなガウス分布を示す。 Here, p (F ^O ) is a probability density function that outputs the feature quantity O. Further, a _i represents the mixing weight of the i-th Gaussian distribution, and N () represents a Gaussian distribution as shown in Equation 4.

ここで、μは平均値を、Σは分散を、Ｄは特徴量の次元を示す。 Here, μ represents an average value, Σ represents variance, and D represents a dimension of the feature amount.

次に、第２の学習用のデータに対して音響分析部(O)２００において音響分析条件(O)で分析を行い、特徴量(O)を算出する（ステップＳ１０２）。また、これと並列に第２の学習用のデータに対して、音響分析部(I)１００において音響分析条件(I)で分析を行い、特徴量(I)を算出する（ステップＳ１０３）。前記第２の学習用のデータとしては、前記第１の学習データと同じ音響環境のものを用いても良いし、異なる音響環境のものを用いても良いが、認識対象の入力信号と同じ音響環境で収録された音声であることが望ましい。ここで言う音響環境とは、背景雑音や話者、あるいはこれら両方の組み合わせの違いを示す。第２の学習用のデータは図１に示すように同じ入力信号を入力しても良いし、図４に示すように異なる音響環境の２つの入力信号を入力しても良い。前記音響環境の異なる２つの入力信号は時系列が同期しているか、あるいは音素ラベルなど時系列中のどのフレームが互いに対応しているかの情報が得られる必要がある。 Next, the acoustic analysis unit (O) 200 analyzes the second learning data under the acoustic analysis condition (O), and calculates the feature amount (O) (step S102). In parallel with this, the second learning data is analyzed by the acoustic analysis unit (I) 100 under the acoustic analysis condition (I), and the feature quantity (I) is calculated (step S103). The second learning data may be the same acoustic environment as the first learning data or may be a different acoustic environment, but the same acoustic as the input signal to be recognized. Sound recorded in the environment is desirable. The acoustic environment here refers to the difference in background noise, speaker, or a combination of both. As the second learning data, the same input signal may be input as shown in FIG. 1, or two input signals of different acoustic environments may be input as shown in FIG. It is necessary to obtain information as to whether the two input signals having different acoustic environments are synchronized in time series or which frames in the time series correspond to each other such as phoneme labels.

次に、音声モデル適応部２０２において、前記特徴量(O)と前記特徴量(I)とを用いて、前記音声モデル(O)から適応後の音声モデル(I)を作成する（ステップＳ１０４）。具体的には、前記特徴量(O)の時系列を、前記音声モデル(O)に含まれている各ガウス分布のどれに近いかを分類し、前記分類されたフレームに対応する特徴量(I)の出現確率から新たにガウス分布を構成し、音声モデル(I)とすることが考えられる。他にもＭＬＬＲ（Maximum Likelihood Linear Regression）法やＥＭアルゴリズムを使った方法など一般的な適応アルゴリズムを用いることができる。作成された音声モデルは音声モデル(I)格納部１０５に格納する。式５は音声モデル(I)を表現するＧＭＭである。 Next, the speech model adaptation unit 202 creates a speech model (I) after adaptation from the speech model (O) using the feature amount (O) and the feature amount (I) (step S104). . Specifically, the time series of the feature amount (O) is classified to which of the Gaussian distributions included in the speech model (O) is close, and the feature amount corresponding to the classified frame ( It is conceivable that a Gaussian distribution is newly constructed from the appearance probability of I) to obtain a speech model (I). In addition, a general adaptive algorithm such as a method using an MLLR (Maximum Likelihood Linear Regression) method or an EM algorithm can be used. The created speech model is stored in the speech model (I) storage unit 105. Equation 5 is a GMM representing the speech model (I).

次に、対応テーブル作成部２０３において、前記音声モデル(O)に含まれるガウス分布と前記音声モデル(I)に含まれるガウス分布との間の対応を作成する（ステップＳ１０５）。具体的には、音声モデル(I)に含まれるガウス分布にインデックスを振り、そのインデックスに対応する音声モデル(O)に含まれるガウス分布の平均値を保持する。作成された対応テーブルは対応テーブル格納部１０６に格納する。式６は対応テーブルを示す。 Next, the correspondence table creation unit 203 creates a correspondence between the Gaussian distribution included in the speech model (O) and the Gaussian distribution included in the speech model (I) (step S105). Specifically, an index is assigned to the Gaussian distribution included in the speech model (I), and an average value of the Gaussian distribution included in the speech model (O) corresponding to the index is held. The created correspondence table is stored in the correspondence table storage unit 106. Equation 6 shows the correspondence table.

ステップＳ１０２からＳ１０４までの構成は、入力信号の雑音がある程度予測できるときおよび、入力信号を取得するマイクなどの伝達特性が既知である場合など、音響分析条件(I)が予め設定できる場合には、予め行っておくことができる。あるいは、入力信号の雑音が変化し、それに合わせて音響分析条件(I)を変更する必要がある場合などには、入力信号の最初の数フレームあるいは数発声を適応用のデータとして用いてステップＳ１０２からＳ１０５までの処理を行い、オンラインで適応することもできる。また、ここでは音響モデルと同じ音響分析条件(O)で音声モデル(O)を先に作成しておき、入力信号に対して行うのと同じ音響分析条件(I)で音声モデル(I)との対応関係を作成する方法について述べたが、逆に音声モデル(I)を先に作成しておき、音声モデル(O)と対応させる方法も可能である。 The configuration from step S102 to S104 is used when the acoustic analysis condition (I) can be set in advance, such as when the noise of the input signal can be predicted to some extent, and when the transfer characteristics of the microphone for acquiring the input signal are known. , Can be done in advance. Alternatively, when the noise of the input signal changes and the acoustic analysis condition (I) needs to be changed accordingly, the first few frames or several utterances of the input signal are used as adaptation data in step S102. To S105, and it can be adapted online. Also, here, the speech model (O) is created first under the same acoustic analysis conditions (O) as the acoustic model, and the speech model (I) is created under the same acoustic analysis conditions (I) as performed for the input signal. However, it is also possible to create a speech model (I) first and associate it with the speech model (O).

次に、本実施例の動作について、図１および図６のフロー図を用いて説明する。 Next, the operation of this embodiment will be described with reference to the flowcharts of FIGS.

まず、音響分析部(I)１００において、入力信号から特徴量(I)を抽出する（ステップＳ２０１）。このあと、ケプストラム平均正規化法などを行い、乗算性の雑音成分を正規化しておいても良い。 First, the acoustic analysis unit (I) 100 extracts the feature quantity (I) from the input signal (step S201). Thereafter, a cepstrum average normalization method or the like may be performed to normalize the multiplicative noise component.

次に、音響分析条件補正部１０７において、前記特徴量(I)を音声モデル格納部１０５に格納されている音声モデル(I)と、対応テーブル格納部１０６に格納されている対応テーブルを用いて、音響モデルの作成された音響分析条件(O)とマッチするように補正する（ステップＳ２０２）。具体的には以下のような計算を行う。 Next, the acoustic analysis condition correction unit 107 uses the speech model (I) stored in the speech model storage unit 105 for the feature amount (I) and the correspondence table stored in the correspondence table storage unit 106. Then, correction is made so as to match the acoustic analysis condition (O) in which the acoustic model is created (step S202). Specifically, the following calculation is performed.

ここで、Ｆ^Ｉ→０は補正後の特徴量を示す。table[i]は式６で示す音声テーブルである。Ｐpost(i|Ｆ^Ｉ)は、以下の式８で示す時刻ｔの補正前の特徴量Ｆ_ｔ ^Ｉに対する事後確率である。 Here, FI ^{→ 0} indicates the corrected feature value. table [i] is an audio table represented by Expression 6. Ppost (i | F ^I ) is a posterior probability with respect to the feature amount F _t ^I before correction at time t shown in the following Expression 8.

ここで、i番目のガウス分布の重みａ_ｉ、平均ベクトルμ_i ^Ｉ、分散Σ_i ^Ｉは、式５に示す音声モデル(I)に保持されている値である。 Here, the weight a _i , average vector μ _i ^I , and variance Σ _i ^I of the i-th Gaussian distribution are values held in the speech model (I) shown in Equation 5.

最後に、前記補正された特徴量を出力する（ステップＳ２０３）。また、前記特徴量を用いて音声認識を行っても良い。 Finally, the corrected feature value is output (step S203). Further, speech recognition may be performed using the feature amount.

次に、本実施例の効果について説明する。 Next, the effect of the present embodiment will be described.

認識対象の入力信号に対して行う音響分析条件(I)で作成した音声モデル(I)と、音響モデル学習時と同じ音響分析条件(O)で作成した音声モデル(O)との対応関係を保持した対応テーブルを用いて、音響分析条件を補正することで、音響分析条件の違いによるミスマッチを補正することができる。また、音響モデルを音響分析条件に適応させる方法に較べて、音声モデルは充分小さいものを用いることができるため、少ない計算量と少ないメモリ量の増加で補正を行うことができる。 The correspondence between the speech model (I) created under the acoustic analysis condition (I) for the input signal to be recognized and the speech model (O) created under the same acoustic analysis condition (O) as during acoustic model learning By correcting the acoustic analysis conditions using the held correspondence table, it is possible to correct mismatches due to differences in acoustic analysis conditions. Further, since a sufficiently small speech model can be used as compared with a method of adapting an acoustic model to acoustic analysis conditions, correction can be performed with a small calculation amount and a small increase in memory amount.

図７は本発明の第２の実施例の構成を示す図である。 FIG. 7 is a diagram showing the configuration of the second embodiment of the present invention.

第２の実施例は第１の実施例の構成の形態に対して、音声モデル格納部３０１と音声モデル選択部３０２と、を備えている。 The second example is provided with a voice model storage unit 301 and a voice model selection unit 302 in contrast to the configuration of the first example.

音声モデル格納部３０１は、複数の音響分析条件に対応した複数の音声モデルを格納する。 The speech model storage unit 301 stores a plurality of speech models corresponding to a plurality of acoustic analysis conditions.

音声モデル選択部３０２は、前記特徴量抽出部１０４で抽出された特徴量から前記複数の音声モデルから音声モデルを選択する。
を備えている。 The speech model selection unit 302 selects a speech model from the plurality of speech models from the feature amounts extracted by the feature amount extraction unit 104.
It has.

次に、本実施例の動作について説明する。 Next, the operation of this embodiment will be described.

本実施例は、第１の実施例のステップＳ２０１で特徴量を算出した後、音声モデル選択部３０２において、前記算出された特徴量を用いて、予め用意しておいた複数の音響分析条件に対応した複数の音声モデル(i,ii,…)から音声モデルを選択し、ステップＳ２０２以降で、前記選択した音声モデルを用いて音響分析条件の補正を行う。選択の基準としては、前記特徴量に対する各音声モデルの尤度を用いて最も尤度の高い方法を選択する方法などが考えられる。 In this embodiment, after the feature amount is calculated in step S201 of the first embodiment, the speech model selection unit 302 uses the calculated feature amount to satisfy a plurality of acoustic analysis conditions prepared in advance. A speech model is selected from a plurality of corresponding speech models (i, ii,...), And acoustic analysis conditions are corrected using the selected speech model in step S202 and subsequent steps. As a criterion for selection, a method of selecting a method with the highest likelihood using the likelihood of each speech model with respect to the feature amount may be considered.

予め複数の音響分析条件に対応した音声モデルを用意しておき、音響分析条件の補正を行うことで、雑音条件の変化などによって音響分析条件が変化する場合に対応することができる。 By preparing speech models corresponding to a plurality of acoustic analysis conditions in advance and correcting the acoustic analysis conditions, it is possible to cope with a case where the acoustic analysis conditions change due to a change in noise conditions or the like.

また、本実施の形態は、音声モデル選択の基準として特徴量抽出部１０４において抽出された特徴量を用いることに加えて、入力信号取得部１０１において取得された入力信号や、雑音成分算出部１０２において算出された雑音成分、あるいは推定音声算出部１０３において算出された推定音声を用いて音声モデルを選択する方法も考えられる。 Further, in the present embodiment, in addition to using the feature amount extracted by the feature amount extraction unit 104 as a reference for selecting a speech model, the input signal acquired by the input signal acquisition unit 101 and the noise component calculation unit 102 are used. A method of selecting a speech model using the noise component calculated in step S1 or the estimated speech calculated in the estimated speech calculation unit 103 is also conceivable.

図８は本発明の第３の実施例の構成を示す図である。 FIG. 8 is a diagram showing the configuration of the third embodiment of the present invention.

第３の実施例は第２の実施例の構成の形態に対して、ＳＮＲ算出部４０１と推定音声算出部４０２と音声モデル選択部４０３とを備えている。 The third example includes an SNR calculation unit 401, an estimated speech calculation unit 402, and a speech model selection unit 403, as compared to the configuration of the second example.

ＳＮＲ算出部４０１は、入力信号取得部１０１で取得された入力信号と、雑音成分算出部１０２で算出された雑音成分との比を計算し、信号対雑音比（ＳＮＲ）を算出する。 The SNR calculation unit 401 calculates a ratio between the input signal acquired by the input signal acquisition unit 101 and the noise component calculated by the noise component calculation unit 102, and calculates a signal-to-noise ratio (SNR).

推定音声算出部４０２は、推定音声算出部１０３の代わりに、入力信号取得部１０１で取得された入力信号と雑音成分算出部１０２で算出された雑音成分と前記算出されたＳＮＲから、推定音声を算出する。 In place of the estimated speech calculation unit 103, the estimated speech calculation unit 402 converts the estimated speech from the input signal acquired by the input signal acquisition unit 101, the noise component calculated by the noise component calculation unit 102, and the calculated SNR. calculate.

音声モデル選択部４０３は、前記音声モデル選択部３０２の代わりに、前記算出されたＳＮＲから前記音声モデルを選択する。 The voice model selection unit 403 selects the voice model from the calculated SNR instead of the voice model selection unit 302.

このような構成とすることにより、ＳＮＲに応じて推定音声算出部における雑音抑圧処理方法および雑音抑圧処理に係わるパラメータを変化させることができ、また、このようにして変更した音響分析条件に対応した音声モデルを選択し、補正を行うことができる。 By adopting such a configuration, it is possible to change the parameters related to the noise suppression processing method and the noise suppression processing in the estimated speech calculation unit according to the SNR, and to cope with the acoustic analysis conditions thus changed. A voice model can be selected and corrected.

図９は本発明の第４の実施例の構成を示す図である。 FIG. 9 is a diagram showing the configuration of the fourth embodiment of the present invention.

第４の実施例は第１の実施例の構成の形態に対して、第２の推定音声算出部５０１と第２の特徴量抽出部５０２と出力部５０３とを備えている。 The fourth example includes a second estimated speech calculation unit 501, a second feature amount extraction unit 502, and an output unit 503 with respect to the configuration of the first example.

第２の推定音声算出部５０１は、音響分析条件補正部１０７で補正された推定音声と、入力信号取得部１０１で取得された入力信号と、雑音成分算出部１０２で算出された雑音成分とから再度、推定音声を算出する。 The second estimated speech calculation unit 501 includes the estimated speech corrected by the acoustic analysis condition correction unit 107, the input signal acquired by the input signal acquisition unit 101, and the noise component calculated by the noise component calculation unit 102. The estimated speech is calculated again.

第２の特徴量抽出部５０２は、前記第２の推定音声から特徴量を抽出する。 The second feature amount extraction unit 502 extracts feature amounts from the second estimated speech.

出力部５０３は、前記第２の特徴量を出力する。 The output unit 503 outputs the second feature amount.

本実施例は、第１の実施例のステップＳ２０２までを行い、認識対象の入力信号から抽出した特徴量を音響モデル学習時の音響分析条件(O)に補正した後で、第２の推定音声算出部５０１において、前記補正された特徴量と前記雑音成分を用いて雑音抑圧用のフィルターを算出し、前記入力信号に適用する。具体的な処理について、ここではウィナーフィルタを用いた方法について説明する。以後の処理は周波数帯域毎に独立に行うことができる。式７を用いて求めた補正後の特徴量Ｆ_ｔ ^Ｉ→０は、逆変換を行いスペクトル量Ｓ_ｔ ^Ｉ→０に戻す必要がある。特徴量としてケプストラムを使用している場合には以下の式９に示すような逆変換を行う。 In this embodiment, after performing step S202 of the first embodiment, the feature amount extracted from the input signal to be recognized is corrected to the acoustic analysis condition (O) at the time of learning the acoustic model, and then the second estimated speech. The calculation unit 501 calculates a filter for noise suppression using the corrected feature value and the noise component, and applies it to the input signal. For specific processing, a method using a Wiener filter will be described here. Subsequent processing can be performed independently for each frequency band. The corrected feature value F _t ^{I → 0} obtained using Equation 7 needs to be inversely transformed back to the spectrum amount S _t ^{I → 0} . When a cepstrum is used as a feature quantity, inverse transformation as shown in the following Expression 9 is performed.

ここで、IDCT()は逆離散コサイン変換を示す。あるいは、対応テーブルとして式６の代わりに、以下の式１０で示すように、音声モデル(O)の各ガウス分布に対応する平均スペクトルを保存しておくことも考えられる。 Here, IDCT () indicates inverse discrete cosine transform. Alternatively, it is also conceivable to store an average spectrum corresponding to each Gaussian distribution of the speech model (O) as shown in the following Expression 10 instead of Expression 6 as a correspondence table.

この場合、式９は以下の式１１のように変更される。 In this case, Equation 9 is changed to Equation 11 below.

また、上記例のように対応テーブルにスペクトルを保存しておく代わりに対数スペクトルを使う方法なども当然考えることができる。 In addition, a method of using a logarithmic spectrum instead of storing the spectrum in the correspondence table as in the above example can be naturally considered.

以上のような手順によって求まった補正した特徴量をスペクトル領域の量に変形したものと、前記雑音成分算出部１０２において算出された雑音成分のスペクトル領域の量を用いてウィナーフィルタのフィルタゲインを以下の式１２にしたがって算出する。 The filter gain of the Wiener filter is reduced by using the corrected characteristic amount obtained by the above procedure into the amount of the spectral region and the amount of the spectral region of the noise component calculated by the noise component calculation unit 102. The calculation is performed according to Equation 12 below.

前記フィルタゲインを入力信号に乗ずることで第２の推定音声を得る。 A second estimated speech is obtained by multiplying the input signal by the filter gain.

次に、第２の特徴量抽出部５０２において、前記第２の推定音声から特徴量を抽出する。特徴量としては特徴量抽出部１０４と同じものを用いても良いし異なるものを用いても良いが、音響モデル学習時とあわせておく必要がある。 Next, the second feature quantity extraction unit 502 extracts a feature quantity from the second estimated speech. The same feature quantity as the feature quantity extraction unit 104 or a different one may be used as the feature quantity, but it is necessary to match it with the acoustic model learning.

最後に、第１の実施例のステップＳ２０３と同様に、出力部５０３において、前記第２の特徴量抽出部において抽出された特徴量を出力する。また、前記特徴量を用いて音声認識を行っても良い。 Finally, as in step S203 of the first embodiment, the output unit 503 outputs the feature amount extracted by the second feature amount extraction unit. Further, speech recognition may be performed using the feature amount.

第１の実施例のステップＳ２０２までに補正された特徴量は、特に音声モデル(I)を構成するガウス分布の数が少ないときには、過剰に平滑化された値として推定される。前記補正された推定音声と、前記雑音成分を用いて雑音抑圧フィルターを作成し、式１３のように元の入力信号に対して適用することで、元の入力信号の持っている音声の情報を損なうことなく雑音抑圧手法の違いによる差分だけを補正することができる。 The feature amount corrected up to step S202 of the first embodiment is estimated as an excessively smoothed value, particularly when the number of Gaussian distributions constituting the speech model (I) is small. By creating a noise suppression filter using the corrected estimated speech and the noise component and applying it to the original input signal as shown in Equation 13, the information of the speech possessed by the original input signal is obtained. Only the difference due to the difference in the noise suppression method can be corrected without loss.

また、本実施例では、第２の特徴量抽出部５０２において、特徴量抽出部１０４と異なる特徴量を用いることによって、入力信号に含まれる加算性雑音および乗算性雑音に対する雑音抑圧手法の違いを補正するだけでなく、特徴量の違いも補正することができる。 Further, in the present embodiment, the second feature quantity extraction unit 502 uses a feature quantity different from that of the feature quantity extraction unit 104 to reduce the difference in noise suppression method for additive noise and multiplicative noise included in the input signal. In addition to correction, differences in feature values can also be corrected.

第１から第４の実施の形態は、図１０に示すように音声認識システムと一緒に用いることも考えられる。 The first to fourth embodiments may be used together with a speech recognition system as shown in FIG.

本実施例は、第１から第４の実施例の構成に対し、デコーダー部６０１と出力部６０４とを備えている。 This embodiment is provided with a decoder section 601 and an output section 604 in contrast to the configurations of the first to fourth embodiments.

デコーダー部６０１は、特徴量抽出部１０４あるいは第２の特徴量抽出部５０２で抽出された特徴量に対して、音響モデル６０２と言語モデル６０３を用いて単語列を探索する。 The decoder unit 601 searches the word string using the acoustic model 602 and the language model 603 for the feature amount extracted by the feature amount extraction unit 104 or the second feature amount extraction unit 502.

出力部６０４は、前記探索された単語列を出力する。 The output unit 604 outputs the searched word string.

このような構成とすることにより、入力信号に対して行う音響分析条件を、前記デコーダー部で用いる音響モデル学習時の音響分析条件と合わせることができ、認識性能の高い音声認識を行うことができる。図１０の例では、音響モデルが１つの時の例を示したが、音響モデルを複数持つことも考えられる。 By adopting such a configuration, the acoustic analysis conditions performed on the input signal can be matched with the acoustic analysis conditions at the time of learning the acoustic model used in the decoder unit, and speech recognition with high recognition performance can be performed. . In the example of FIG. 10, an example in which there is one acoustic model is shown, but it is also possible to have a plurality of acoustic models.

第１から第４の実施例は、図１１に示すような分散音声認識と一緒に用いることも考えられる。 The first to fourth embodiments may be used together with distributed speech recognition as shown in FIG.

本実施例は、第１から第４の実施例の構成に対し、特徴量送信部７０１と特徴量受信部７０２とデコーダー部６０１と単語列送信部７０３と単語列受信部７０４と出力部７０５とを備えている。 The present embodiment is different from the first to fourth embodiments in the feature amount transmitting unit 701, the feature amount receiving unit 702, the decoder unit 601, the word string transmitting unit 703, the word string receiving unit 704, and the output unit 705. It has.

特徴量送信部７０１は、特徴量抽出部１０４あるいは第２の特徴量抽出部５０２で抽出された特徴量をサーバー側に送信する。 The feature amount transmission unit 701 transmits the feature amount extracted by the feature amount extraction unit 104 or the second feature amount extraction unit 502 to the server side.

特徴量受信部７０２は、サーバー側で前記送信された特徴量を受信する。 The feature amount receiving unit 702 receives the transmitted feature amount on the server side.

デコーダー部６０１は、前記受信された特徴量に対して音響モデル６０２と言語モデル６０３を用いて単語列を探索する。 The decoder unit 601 searches for a word string using the acoustic model 602 and the language model 603 for the received feature amount.

単語列送信部７０３は、前記探索された単語列を端末側に送信する。 The word string transmission unit 703 transmits the searched word string to the terminal side.

単語列受信部７０４は、端末側で前記送信された単語列を受信する。 The word string receiving unit 704 receives the transmitted word string on the terminal side.

出力部７０５は、前記受信された単語列を出力する。 The output unit 705 outputs the received word string.

このような構成とすることにより、端末側で音響分析環境を音響モデル学習時と同じになるように補正するため、サーバー側に保持されている前記音響モデルはある１つの音響分析条件でのみ作成しておけば良い。もちろん音響モデルを複数用意することも可能である。また、この図１１に示す例では音響分析補正部を端末側に配置したがサーバー側に配置することも、もちろん考えられる。 By adopting such a configuration, the acoustic model is corrected so that the acoustic analysis environment on the terminal side is the same as when learning the acoustic model, so the acoustic model held on the server side is created only under a certain acoustic analysis condition. You should do it. Of course, it is also possible to prepare a plurality of acoustic models. Further, in the example shown in FIG. 11, the acoustic analysis correction unit is arranged on the terminal side, but it is of course possible to arrange it on the server side.

分散音声認識システムでは複数の端末の音響分析部に対して１台のサーバーのデコーダー部が割り当てられることが考えられる。本実施例の構成を用いて特徴量を補正することにより、前記複数の端末側で音響分析条件が異なっている場合でも、１台のサーバーのデコーダーで複数の端末に対応することができる。 In the distributed speech recognition system, it is conceivable that the decoder unit of one server is assigned to the acoustic analysis units of a plurality of terminals. By correcting the feature value using the configuration of the present embodiment, even when the acoustic analysis conditions are different on the plurality of terminals, the decoder of one server can cope with the plurality of terminals.

本発明は、雑音下での音声認識といった用途に適用できる。 The present invention can be applied to uses such as speech recognition under noise.

本発明の第１の実施例に係わる雑音用圧処理システムの構成を示すブロック図である。It is a block diagram which shows the structure of the pressure processing system for noise concerning the 1st Example of this invention. 本発明の第１の実施例に係わる音声モデルと対応テーブルの関係を示す概念図である。It is a conceptual diagram which shows the relationship between the audio | voice model concerning a 1st Example of this invention, and a corresponding | compatible table. 本発明の第１の実施例に係わる音声モデル作成法の構成を示すブロック図である。It is a block diagram which shows the structure of the speech model preparation method concerning 1st Example of this invention. 本発明の第１の実施例に係わる音声モデル作成法の別の構成を示すブロック図である。It is a block diagram which shows another structure of the audio | voice model creation method concerning the 1st Example of this invention. 本発明の第１の実施例に係わる音声モデル作成法の手順を示すフロー図である。It is a flowchart which shows the procedure of the audio | voice model creation method concerning the 1st Example of this invention. 本発明の第１の実施例に係わる雑音抑圧処理システムの手順を示すフロー図である。It is a flowchart which shows the procedure of the noise suppression processing system concerning 1st Example of this invention. 本発明の第２の実施例に係わる雑音抑圧システムの構成を示すブロック図である。It is a block diagram which shows the structure of the noise suppression system concerning the 2nd Example of this invention. 本発明の第３の実施例に係わる雑音抑圧システムの構成を示すブロック図である。It is a block diagram which shows the structure of the noise suppression system concerning the 3rd Example of this invention. 本発明の第４の実施例に係わる雑音抑圧システムの構成を示すブロック図である。It is a block diagram which shows the structure of the noise suppression system concerning the 4th Example of this invention. 本発明の第５の実施例に係わる音声認識システムの構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition system concerning the 5th Example of this invention. 本発明の第６の実施例に係わる分散音声認識システムの構成を示すブロック図である。It is a block diagram which shows the structure of the distributed speech recognition system concerning the 6th Example of this invention. 従来の音声分析部で雑音抑圧処理を行う音声認識システムの構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition system which performs noise suppression processing in the conventional speech analysis part. 従来の分散音声認識システムの構成を示すブロック図である。It is a block diagram which shows the structure of the conventional distributed speech recognition system.

Explanation of symbols

１００音響分析部(I)
１０１入力信号取得部
１０２雑音成分算出部
１０３推定音声算出部
１０４特徴量抽出部
１０５音声モデル格納部（音声モデル(I)格納部）
１０６対応テーブル格納部
１０７音響分析条件補正部
１０８出力部
２００音響分析部(O)
２０１音声モデル(O)格納部
２０２音声モデル適応部（O→I）
２０３対応テーブル作成部
３０１複数の音声モデル格納部
３０２音声モデル選択部
４０１ＳＮＲ算出部
４０２推定音声算出部
４０３音声モデル選択部
５０１第２の推定音声算出部
５０２第２の特徴量抽出部
５０３出力部
６０１デコーダー部
６０２音響モデル
６０３言語モデル
６０４出力部
７０１特徴量送信部
７０２特徴量受信部
７０３単語列送信部
７０４単語列受信部
７０５出力部 100 Acoustic Analysis Department (I)
DESCRIPTION OF SYMBOLS 101 Input signal acquisition part 102 Noise component calculation part 103 Estimated speech calculation part 104 Feature-value extraction part 105 Speech model storage part (speech model (I) storage part)
106 Corresponding table storage unit 107 Acoustic analysis condition correction unit 108 Output unit 200 Acoustic analysis unit (O)
201 Speech model (O) storage unit 202 Speech model adaptation unit (O → I)
DESCRIPTION OF SYMBOLS 203 Correspondence table preparation part 301 Several speech model storage part 302 Speech model selection part 401 SNR calculation part 402 Estimated speech calculation part 403 Speech model selection part 501 2nd estimation speech calculation part 502 2nd feature-value extraction part 503 Output part 601 Decoder unit 602 Acoustic model 603 Language model 604 Output unit 701 Feature amount transmission unit 702 Feature amount reception unit 703 Word string transmission unit 704 Word string reception unit 705 Output unit

Claims

An acoustic analysis condition normalization system for speech recognition,
Acoustic analysis means (I) for extracting feature values from the input signal to be recognized under acoustic analysis conditions (I);
A speech model (I) configured by mixing a plurality of Gaussian distributions of feature quantities extracted under acoustic analysis conditions (I) in the acoustic analysis means (I) with respect to the first learning input signal is created. Voice model storage means for storing the created voice model (I);
Different from the acoustic analysis means (I) in the acoustic analysis means (O) for the second learning input signal, a plurality of feature quantities extracted under the same acoustic analysis conditions (O) as in the creation of the acoustic model used for speech recognition Correspondence table storage means for storing a correspondence table storing correspondence relationships between each Gaussian distribution included in the speech model (O) configured by mixing Gaussian distributions and each Gaussian distribution included in the speech model (I). When,
For each Gaussian distribution constituting the speech model (I), calculate a posterior probability for the feature quantity of the input signal to be recognized,
By correcting the average value of the corresponding Gaussian distribution using the correspondence table by weighted averaging using the posterior probability, the feature quantity of the input signal is corrected to match the acoustic analysis condition (O), An acoustic analysis condition normalizing system comprising: an acoustic analysis condition correcting unit for obtaining a corrected feature quantity.

The correspondence table stores a set of an index of each Gaussian distribution constituting the speech model (I) and an average value of each Gaussian distribution constituting the speech model (O) corresponding to the index. The acoustic analysis condition normalization system described in 1.

3. The correspondence relationship between the respective Gaussian distributions is set by using the same learning data when the speech model (I) and the speech model (O) are created, and stored in the correspondence table. The acoustic analysis condition normalization system according to any one of the above.

At the time of creating the speech model (I) and the speech model (O), by creating each Gaussian distribution in units of phonemes, the correspondence relationship between the respective Gaussian distributions is set and stored in the correspondence table. acoustic analysis conditions normalization system according to any one of claims 1 to 2.

Characteristics of a microphone that acquires an input signal, a noise estimation method that estimates a noise component from an input signal, a noise suppression method that suppresses a noise component of an input signal to obtain an estimated speech, a feature amount calculation method, or one or more of these The acoustic analysis condition is a combination of
The acoustic analysis condition normalization system according to any one of claims 1 to 4 .

The speech model storage means stores a plurality of speech models (i, ii,...) Created under a plurality of different acoustic analysis conditions (i, ii,...)
A speech model selection means for selecting a speech model (I) using the input signal, the noise component, the estimated speech, the feature amount, or a combination thereof;
The acoustic analysis condition correcting unit corrects the feature quantity of the input signal to be recognized using the selected speech model (I) and the correspondence table.
The acoustic analysis condition normalization system according to claim 5 .

The speech model storage means stores a plurality of speech models (i, ii,...) Created under a plurality of different acoustic analysis conditions (i, ii,...)
SNR estimation means for estimating SNR from the input signal;
Voice model selection means for selecting a voice model (I) from the estimated SNR,
The acoustic analysis condition correcting unit corrects the feature quantity of the input signal to be recognized using the selected speech model (I) and the correspondence table.
The acoustic analysis condition normalization system according to any one of claims 1 to 5 .

A distributed speech recognition system that extracts feature values from input speech to be recognized in an acoustic analysis unit on a terminal side, and converts the feature values into a word string using the acoustic model in a decoder unit on a server side,
The feature amount extracted under the acoustic analysis condition (I) in the acoustic analysis unit on the terminal side is corrected using the means according to any one of claims 1 to 7 on the terminal side or the server side. A featured distributed speech recognition system.

A terminal of a distributed speech recognition system that extracts a feature amount from input speech to be recognized in an acoustic analysis unit of a terminal, and converts the feature amount into a word string using the acoustic model in a server-side decoder unit,
A terminal of a distributed speech recognition system, wherein the feature amount extracted under the acoustic analysis condition (I) is corrected using the means according to any one of claims 1 to 7 .

An acoustic analysis condition normalization method for speech recognition,
A first step of extracting a feature value with respect to an input signal to be recognized under an acoustic analysis condition (I);
A speech model (I) configured by mixing a plurality of Gaussian distributions of feature quantities extracted under acoustic analysis conditions (I) in the acoustic analysis means (I) with respect to the first learning input signal is created. A second step of storing the created speech model (I);
Different from the acoustic analysis means (I) in the acoustic analysis means (O) for the second learning input signal, a plurality of feature quantities extracted under the same acoustic analysis conditions (O) as in the creation of the acoustic model used for speech recognition A third step of storing a correspondence table storing correspondence relationships between each Gaussian distribution included in the speech model (O) configured by mixing Gaussian distributions and each Gaussian distribution included in the speech model (I); ,
For each Gaussian distribution constituting the speech model (I), calculate a posterior probability for the feature quantity of the input signal to be recognized,
By averaging the corresponding Gaussian distribution average value using the correspondence table using the posterior probability,
And correcting the feature quantity of the input signal so as to match the acoustic analysis condition (O), and obtaining a corrected feature quantity.

The correspondence table includes an index of each Gaussian distribution constituting the speech model (I), and
The acoustic analysis condition normalization method according to claim 10 , wherein a pair of the index and the average value of each Gaussian distribution constituting the speech model (O) corresponding to the index is stored.

Wherein when creating a speech model (I) and said speech model (O), by using the same training data, sets the corresponding relationship between the Gaussian distributions of the respective claims 10 to 11, stored in the correspondence table The acoustic analysis condition normalization method according to any one of the above.

At the time of creating the speech model (I) and the speech model (O), by creating each Gaussian distribution in units of phonemes, the correspondence relationship between the respective Gaussian distributions is set and stored in the correspondence table. The acoustic analysis condition normalizing method according to claim 10 .

Characteristics of a microphone that acquires an input signal, a noise estimation method that estimates a noise component from an input signal, a noise suppression method that suppresses a noise component of an input signal to obtain an estimated speech, a feature amount calculation method, or one or more of these The acoustic analysis condition is a combination of
The acoustic analysis condition normalizing method according to claim 10 .

In the second step, a plurality of speech models (i, ii, ...) created under a plurality of different acoustic analysis conditions (i, ii, ...) are stored,
The input signal, the noise component, the estimated speech, or the feature amount,
Alternatively, the method includes a step of selecting the speech model (I) using a combination of these,
In the fourth step, the feature amount of the input signal to be recognized is corrected using the selected speech model (I) and the correspondence table.
The acoustic analysis condition normalization method according to claim 14 .

In the second step, a plurality of speech models (i, ii, ...) created under a plurality of different acoustic analysis conditions (i, ii, ...) are stored,
Estimating an SNR from the input signal;
Selecting a speech model (I) from the estimated SNR,
In the fourth step, the feature amount of the input signal to be recognized is corrected using the selected speech model (I) and the correspondence table.
The acoustic analysis condition normalizing method according to claim 10 .

When the feature value is extracted from the input speech to be recognized in the acoustic analysis unit on the terminal side, and the feature value is converted into a word string using the acoustic model in the decoder unit on the server side,
The feature amount extracted under the acoustic analysis condition (I) in the acoustic analysis unit on the terminal side is corrected using the method according to any one of claims 10 to 16 on the terminal side or the server side. A characteristic acoustic analysis condition normalization method.

When the feature amount is extracted from the input speech to be recognized in the acoustic analysis unit of the terminal, and the feature amount is converted into a word string using the acoustic model in the server-side decoder unit,
A terminal of a distributed speech recognition system, wherein the feature amount extracted under the acoustic analysis condition (I) is corrected using the method according to any one of claims 10 to 16 .

An acoustic analysis condition normalization program that operates in an acoustic analysis condition normalization system for speech recognition,
Computer
Acoustic analysis means (I) for extracting feature values from the input signal to be recognized under the acoustic analysis condition (I), and acoustic analysis condition (I) in the acoustic analysis means (I) for the first learning input signal A speech model storage means for creating a speech model (I) configured by mixing a plurality of Gaussian distributions of the feature quantities extracted in step 1 and storing the created speech model (I);
Different from the acoustic analysis means (I) in the acoustic analysis means (O) for the second learning input signal, a plurality of feature quantities extracted under the same acoustic analysis conditions (O) as in the creation of the acoustic model used for speech recognition Correspondence table storage means for storing a correspondence table storing correspondence relationships between each Gaussian distribution included in the speech model (O) configured by mixing Gaussian distributions and each Gaussian distribution included in the speech model (I). ,
For each Gaussian distribution constituting the speech model (I), calculate a posterior probability for the feature quantity of the input signal to be recognized,
By correcting the average value of the corresponding Gaussian distribution using the correspondence table by weighted averaging using the posterior probability, the feature quantity of the input signal is corrected to match the acoustic analysis condition (O), Acoustic analysis condition normalization program that functions as acoustic analysis condition correction means for obtaining the corrected feature value

The correspondence table includes a Gaussian distribution index constituting the speech model (I), and
The acoustic analysis condition normalization program according to claim 19 , wherein a pair of the index and an average value of a Gaussian distribution constituting the speech model (O) corresponding to the index is stored.

Wherein when creating a speech model (I) and the voice model (O), the same training data by using, sets the corresponding relationship between the Gaussian distribution of the respective of claims 19 to 20 stored in the correspondence table The acoustic analysis condition normalization program according to any one of the above.

At the time of creating the speech model (I) and the speech model (O), by creating each Gaussian distribution in units of phonemes, the correspondence relationship between the respective Gaussian distributions is set and stored in the correspondence table. The acoustic analysis condition normalization program according to any one of claims 19 to 20 .

Characteristics of a microphone that acquires an input signal, a noise estimation method that estimates a noise component from an input signal, a noise suppression method that suppresses a noise component of an input signal to obtain an estimated speech, a feature amount calculation method, or one or more of these The acoustic analysis condition is a combination of
The acoustic analysis condition normalization program according to any one of claims 19 to 22 .

The speech model storage means stores a plurality of speech models (i, ii,...) Created under a plurality of different acoustic analysis conditions (i, ii,...)
Making a computer function as a speech model selection means for selecting a speech model (I) using the input signal, the noise component, the estimated speech, the feature amount, or a combination of these,
The acoustic analysis condition correcting unit corrects the feature quantity of the input signal to be recognized using the selected speech model (I) and the correspondence table.
The acoustic analysis condition normalization program according to claim 23 .

The speech model storage means stores a plurality of speech models (i, ii,...) Created under a plurality of different acoustic analysis conditions (i, ii,...)
Computer
SNR estimation means for estimating SNR from the input signal;
Function as speech model selection means for selecting a speech model (I) from the estimated SNR;
The acoustic analysis condition correcting unit corrects the feature quantity of the input signal to be recognized using the selected speech model (I) and the correspondence table.
The acoustic analysis condition normalization program according to any one of claims 19 to 23 .