JP2009210647A

JP2009210647A - Noise canceler, method thereof, program thereof and recording medium

Info

Publication number: JP2009210647A
Application number: JP2008051175A
Authority: JP
Inventors: Masakiyo Fujimoto; 雅清藤本; Kentaro Ishizuka; 健太郎石塚; Tomohiro Nakatani; 智広中谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-02-29
Filing date: 2008-02-29
Publication date: 2009-09-17
Anticipated expiration: 2028-02-29
Also published as: JP4856662B2

Abstract

<P>PROBLEM TO BE SOLVED: To perform highly accurate voice signal section estimation and noise canceling, by integrating voice signal section estimation technology and noise canceling technology. <P>SOLUTION: A sound feature amount of an input signal is extracted by frame. A noise model parameter is estimated by parallel processing in a reverse direction as well as in a normal direction to a time axis, by using a probability model of a clean voice signal and a silent signal. silent/voiced probability and a ratio of the voice probability to the silent state probability are calculated by frame, and the voice section is estimated by comparing the ratio of the voice probability with a threshold. Then, a frequency response filter for canceling the noise signal is created by using the parameter of each probability model and the silent/voiced probability. The frequency response filter is converted to an impulse response filter, and the impulse response filter is convoluted to the input signal, and a noise-free voice signal is created and output. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、音声信号と雑音信号が含まれる音響信号から、音声信号が存在する区間を推定し、推定された区間の信号から音声信号以外の雑音信号を除去するための雑音除去技術に関する。 The present invention relates to a noise removal technique for estimating a section where a voice signal exists from an acoustic signal including a voice signal and a noise signal, and removing a noise signal other than the voice signal from the signal in the estimated section.

自動音声認識技術を実際の環境で利用する場合においては、処理対象とする音声信号以外の信号、つまり雑音信号が含まれる音響信号から、処理対象とする音声信号が存在する区間を推定し、さらに雑音を取り除く必要がある。自動音声認識の実際の環境での利用は今後の情報化社会の中で大きく期待されており、早急に解決されるべき問題である。 When the automatic speech recognition technology is used in an actual environment, an interval in which the speech signal to be processed exists is estimated from a signal other than the speech signal to be processed, that is, an acoustic signal including a noise signal. It is necessary to remove noise. The use of automatic speech recognition in the actual environment is highly expected in the information-oriented society in the future, and is a problem that should be solved as soon as possible.

後掲の非特許文献１には、入力となる音響信号の周波数スペクトル、信号の全帯域のエネルギー及び帯域分割後の各帯域のエネルギー、信号波形の零点差数、及びそれらの時間微分などの特徴量を利用した音声信号区間推定方法が開示されている。 Non-Patent Document 1 described later includes characteristics such as the frequency spectrum of the input acoustic signal, the energy of the entire band of the signal, the energy of each band after the band division, the number of zero differences of the signal waveform, and their time derivatives. A speech signal section estimation method using a quantity is disclosed.

後掲の非特許文献２には、入力となる音響信号のケプストラムに含まれる雑音の成分を逐次ＥＭアルゴリズムにより推定し、推定された雑音成分を入力音響信号より差し引くことによって雑音除去を行う方法が開示されている。 Non-Patent Document 2 described below has a method of performing noise removal by sequentially estimating a noise component included in a cepstrum of an input acoustic signal using an EM algorithm and subtracting the estimated noise component from the input acoustic signal. It is disclosed.

後掲の非特許文献３には、入力となる音響信号にWiener filter理論に基づく雑音除去を適用し、雑音除去後の信号の全帯域のエネルギー及び帯域分割後の各帯域のエネルギー、周波数スペクトルの分散値などの特徴量を利用した音声信号区間推定方法が開示されている。
Benyassine,A.,Shlomot,E.,and Su,H-Y.“ITU-T recommendation G.729 Annex B: A silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications,”IEEE Communications Magazine, pp.64-73, September,1997. Myrvoll, T. A. and Nakamura, S., “Online Cepstral Filtering Using A Sequential EM Approach with Polyak Averaging and Feedback,” Proc. ICASSP ’05, Vol. I, pp. 261-264, Philadelphia, USA, March, 2005. ETSI ES 202 050 v.1.1.4 “Speech processing, Transmission and Quality aspects (STQ), Distributed Speech Recognition; Advanced Front-end feature extraction algorithm; Compression algorithms,”Nov.2005. In Non-Patent Document 3 described later, noise removal based on Wiener filter theory is applied to the input acoustic signal, and the energy of the entire band of the signal after noise removal, the energy of each band after the band division, and the frequency spectrum A speech signal section estimation method using a feature value such as a variance value is disclosed.
Benyassine, A., Shlomot, E., and Su, HY. “ITU-T recommendation G.729 Annex B: A silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications,” IEEE Communications Magazine, pp.64-73, September, 1997. Myrvoll, TA and Nakamura, S., “Online Cepstral Filtering Using A Sequential EM Approach with Polyak Averaging and Feedback,” Proc. ICASSP '05, Vol. I, pp. 261-264, Philadelphia, USA, March, 2005. ETSI ES 202 050 v.1.1.4 “Speech processing, Transmission and Quality aspects (STQ), Distributed Speech Recognition; Advanced Front-end feature extraction algorithm; Compression algorithms,” Nov. 2005.

実際の環境で自動音声認識を行うにあたり必要不可欠な技術は、入力音響信号から音声認識の対象とする音声信号が存在する区間を推定する音声信号区間推定技術と、入力音響信号から雑音を取り除き、高品質な音声信号を得る雑音除去技術である。 Indispensable technologies for performing automatic speech recognition in the actual environment are speech signal section estimation technology that estimates the section where the speech signal to be speech recognition exists from the input acoustic signal, and removes noise from the input acoustic signal, This is a noise removal technique for obtaining high-quality audio signals.

非特許文献１及び非特許文献２に記載の技術は、それぞれ音声信号区間推定と雑音除去単体の技術であり、一般にこれらの技術を処理フローの上で連結することにより、実際の環境での自働音声認識を行う。しかし、これらの技術間で必要となる情報、パラメータ等の共有はなく、あくまで個別の技術が単純に連結されているのみである。そのため、処理誤りやパラメータ推定誤差等の評価も技術ごとに別個に行わなければならない。そのため、両技術間で必要となる情報、パラメータ等を両方の技術で相互に評価できる場合に比べ、処理誤りやパラメータ推定誤り等を行うことが困難となり、高精度な音声信号区間推定及び雑音除去を行うことができない。 The technologies described in Non-Patent Document 1 and Non-Patent Document 2 are technologies for speech signal section estimation and noise removal alone, respectively. Generally, these technologies are connected in a processing flow, and are automatically used in an actual environment. Performs working voice recognition. However, there is no sharing of information, parameters, etc. required between these technologies, and the individual technologies are simply linked. Therefore, evaluation of processing errors, parameter estimation errors, and the like must be performed separately for each technique. Therefore, it is difficult to perform processing errors and parameter estimation errors, etc., compared to the case where information and parameters required between both technologies can be mutually evaluated by both technologies. Can not do.

非特許文献３の技術では、音声信号区間推定と雑音除去の二つの技術を内部で連結しており、雑音除去を行った後に音声信号区間推定が行われる。これは、雑音除去によって得られた雑音の影響が軽減された目的音声信号の特徴量を用いることにより、音声信号区間の推定性能を高精度化するのが目的である。しかし、この技術でも単に音声信号区間の推定に雑音除去の結果を利用しているのみであり、両技術間でのパラメータなどの共有は密に行われていない。そのため、非特許文献３の技術でも高精度な音声信号区間推定及び雑音除去を行うことは困難である。 In the technique of Non-Patent Document 3, two techniques of speech signal section estimation and noise removal are internally connected, and speech signal section estimation is performed after noise removal. The purpose of this is to improve the estimation performance of the speech signal section by using the feature amount of the target speech signal in which the influence of noise obtained by noise removal is reduced. However, even in this technique, the result of noise removal is merely used for estimating the speech signal section, and parameters and the like are not shared closely between the two techniques. Therefore, it is difficult to perform highly accurate speech signal section estimation and noise removal even with the technique of Non-Patent Document 3.

本発明は、このような問題に鑑みてなされたものであり、音声信号区間推定技術と雑音除去技術との間でパラメータ等の情報を密に共有し、音声信号区間推定技術と雑音除去技術とを統合的に扱うことにより、高精度な音声信号区間推定及び雑音除去を行うことを可能にする雑音除去技術を提供することを目的とする。 The present invention has been made in view of such a problem. Information such as parameters is closely shared between the speech signal section estimation technique and the noise removal technique, and the speech signal section estimation technique and the noise removal technique are provided. It is an object of the present invention to provide a noise removal technique that makes it possible to perform speech signal section estimation and noise removal with high accuracy by comprehensively handling.

本発明では、まず、クリーン音声信号と無音信号の各出力確率を、それぞれ、複数の正規分布を含有する混合正規分布で表現した確率モデルの確率モデルパラメータをモデルパラメータ記憶部に格納しておく。そして、前記入力信号の音声特徴量を一定時間区間であるフレームごとに抽出して出力する音響信号分析過程と、前記音声特徴量と、前記モデルパラメータ記憶部に記憶されたクリーン音声信号と無音信号の各確率モデルパラメータとが入力され、過去のフレームから現在のフレームに向かって並列非線形カルマンフィルタにより現在のフレームの雑音モデルパラメータを逐次推定して出力する前向き推定過程を実行する。また、前記雑音モデルパラメータと、前記クリーン音声信号と無音信号の各確率モデルパラメータとが入力され、未来のフレームから現在のフレームに向かって並列カルマンスムーザにより現在フレームの雑音モデルパラメータを逐次後向き推定し、この後向き推定した雑音モデルパラメータに基づき、音声（雑音＋クリーン音声）信号と非音声（雑音＋無音）信号の各出力確率をそれぞれ混合正規分布で表現した確率モデルの確率モデルパラメータを逐次推定し、音声信号と非音声信号それぞれの出力確率を算出して出力する後向き推定過程を実行する。そして、前向き推定過程及び後向き推定過程で得られた計算結果をパラメータ記憶部に記憶する。 In the present invention, first, probability model parameters of a probability model in which the output probabilities of the clean speech signal and the silence signal are each expressed by a mixed normal distribution containing a plurality of normal distributions are stored in the model parameter storage unit. Then, an acoustic signal analysis process for extracting and outputting a voice feature amount of the input signal for each frame that is a certain time interval, the voice feature amount, a clean voice signal and a silence signal stored in the model parameter storage unit Each of the probability model parameters is input, and a forward estimation process is executed in which the noise model parameters of the current frame are sequentially estimated and output from the past frame to the current frame by the parallel nonlinear Kalman filter. In addition, the noise model parameters and the probability model parameters of the clean speech signal and the silence signal are input, and the noise model parameters of the current frame are sequentially estimated backward from the future frame to the current frame by a parallel Kalman smoother. Then, based on this backward estimated noise model parameter, the probability model parameter of the probability model that expresses each output probability of speech (noise + clean speech) signal and non-speech (noise + silence) signal with mixed normal distribution is estimated sequentially Then, a backward estimation process of calculating and outputting the output probabilities of the voice signal and the non-voice signal is executed. The calculation results obtained in the forward estimation process and the backward estimation process are stored in the parameter storage unit.

次に、前記音声信号及び非音声信号それぞれの出力確率が入力され、音声状態確率と、非音声状態確率と、当該非音声状態確率に対する当該音声状態確率の比とを算出し、これらを出力する状態確率比算出過程とを実行する。そして、前記状態確率の比が入力され、フレームごとに当該状態確率の比としきい値とを比較して、各フレームが音声状態に属するか非音声状態に属するかを示す判定結果を出力する音声信号区間推定過程を実行する。 Next, the output probabilities of each of the speech signal and the non-speech signal are input, and the speech state probability, the non-speech state probability, and the ratio of the speech state probability to the non-speech state probability are calculated and output. The state probability ratio calculation process is executed. Then, the state probability ratio is input, the state probability ratio is compared with a threshold value for each frame, and a determination result indicating whether each frame belongs to a voice state or a non-voice state is output. Perform signal interval estimation process.

さらに、前記音声信号及び非音声信号の各確率モデルパラメータである正規分布ごとの平均と、前記クリーン音声信号及び無音信号の各確率モデルパラメータである正規分布ごとの平均と、前記音声状態確率及び前記非音声状態確率とが入力され、前記音声信号と非音声信号の各確率モデルパラメータである正規分布ごとの前記平均に対する、前記クリーン音声信号と無音信号の各確率モデルパラメータである正規分布ごとの前記平均の各相対値を、前記音声状態確率及び前記非音声状態確率とを用いて加重平均し、雑音信号を除去する周波数応答フィルタを生成し、当該周波数応答フィルタをインパルス応答フィルタに変換し、前記入力信号に対して当該インパルス応答フィルタを畳み込んで雑音除去音声信号を生成して出力する雑音除去過程を実行する。 Further, an average for each normal distribution which is each probability model parameter of the speech signal and the non-speech signal, an average for each normal distribution which is each probability model parameter of the clean speech signal and the silence signal, the speech state probability and the A non-speech state probability is input, and the average for each normal distribution which is each probability model parameter of the speech signal and the non-speech signal, and the normal distribution for each probability model parameter of the clean speech signal and the silence signal Each average relative value is weighted averaged using the speech state probability and the non-speech state probability to generate a frequency response filter that removes a noise signal, and the frequency response filter is converted into an impulse response filter, Noise removal that convolves the impulse response filter with the input signal to generate and output a noise-removed speech signal To run the degree.

本発明では、音声信号区間推定過程を実行するために生成した各パラメータを、雑音除去過程を実行するためのパラメータとして流用できる。その結果、音声信号区間推定技術と雑音除去技術との間でパラメータ等の情報を密に共有し、音声信号区間推定技術と雑音除去技術とを統合的に扱うことにより、高精度な音声信号区間推定及び雑音除去を行うことができる。 In the present invention, each parameter generated for executing the speech signal section estimation process can be used as a parameter for executing the noise removal process. As a result, information such as parameters is closely shared between the speech signal section estimation technology and the noise removal technology, and the speech signal section estimation technology and the noise removal technology are handled in an integrated manner. Estimation and noise removal can be performed.

以下、図面を参照しつつ、本発明の実施例について説明する。なお、以下の説明に用いる図面では、同一の部分には同一の符号を記してある。それらの名称、機能も同一であり、それらについての説明は繰り返さない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings used in the following description, the same parts are denoted by the same reference numerals. Their names and functions are also the same, and description thereof will not be repeated.

以下の説明において、テキスト中で使用する記号「＾」「〜」等は、本来直後の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直前に記載する。式中においてはこれらの記号は本来の位置に記述している。以下の説明において、ベクトルについては例えば「ベクトルＡ」のように直前に「ベクトル」を付与して記載する。また、ベクトルの各要素単位で行われる処理は、特に断りが無い限り、ベクトルの全ての要素に対して適用されるものとする。 In the following explanation, the symbols “^”, “˜”, etc. used in the text should be described immediately above the character that immediately follows, but are described immediately before the character due to restrictions on the text notation. . In the formula, these symbols are written in their original positions. In the following description, the vector is described with “vector” added immediately before, for example, “vector A”. Further, the processing performed for each element of the vector is applied to all elements of the vector unless otherwise specified.

〔第１実施形態〕
＜構成＞
まず、本形態の雑音除去装置１の構成を説明する。
図１は、第１実施形態の雑音除去装置１の機能構成を示すブロック図である。
図１に示すように、雑音除去装置１は、音響信号分析部１０、モデルパラメータ記憶部２０、前向き推定部３０、後向き推定部４０、推定処理用パラメータ記憶部５０、状態確率比算出部６０、音声信号区間推定部７０、信号除去部８０及び制御部９０を有する。 [First Embodiment]
<Configuration>
First, the configuration of the noise removal apparatus 1 of the present embodiment will be described.
FIG. 1 is a block diagram illustrating a functional configuration of the noise removal apparatus 1 according to the first embodiment.
As shown in FIG. 1, the noise removal apparatus 1 includes an acoustic signal analysis unit 10, a model parameter storage unit 20, a forward estimation unit 30, a backward estimation unit 40, an estimation processing parameter storage unit 50, a state probability ratio calculation unit 60, The audio signal section estimation unit 70, the signal removal unit 80, and the control unit 90 are included.

なお、本形態の雑音除去装置１は、例えば、ＣＰＵ（central processing unit），ＲＡＭ（random-access memory）等からなる公知のコンピュータに所定のプログラムが読み込まれ、ＣＰＵがそのプログラムを実行することによって構築される。また、雑音除去装置１の各処理は制御部９０によって制御され、特に明示しない限り、各処理結果は、逐一レジスタ等の一時記憶メモリ（図示せず）に格納され、その後の処理で読み込まれて利用される。すなわち、以下では特に明示はしないが、ある処理部から他の処理部にデータが送られるとは、ある処理部で生成されたデータが一時記憶メモリに格納され、他の処理部がこの一時記憶メモリからこのデータを読み出すことを意味する。 Note that the noise removal apparatus 1 of this embodiment is configured such that, for example, a predetermined program is read into a known computer including a CPU (central processing unit), a RAM (random-access memory), and the like, and the CPU executes the program. Built. Each process of the noise removal apparatus 1 is controlled by the control unit 90, and unless otherwise specified, each process result is stored in a temporary storage memory (not shown) such as a register and read in the subsequent processes. Used. That is, although not specifically shown below, when data is sent from one processing unit to another processing unit, data generated by one processing unit is stored in a temporary storage memory, and the other processing unit stores this temporary storage. This means reading this data from the memory.

図２（ａ）は、図１に示したモデルパラメータ記憶部２０の詳細を示す図であり、図２（ｂ）は、推定処理用パラメータ記憶部５０の詳細を示す図である。
図２（ａ）に示すように、本形態のモデルパラメータ記憶部２０は、無音ＧＭＭ記憶部２１、クリーン音声ＧＭＭ記憶部２２、非音声ＧＭＭ記憶部２３及び音声ＧＭＭ記憶部２４を有する。また、図２（ｂ）に示すように、本形態の推定処理用パラメータ記憶部５０は、初期雑音モデル推定用バッファ５１及び雑音モデル推定用バッファ５２を有する。 FIG. 2A is a diagram illustrating details of the model parameter storage unit 20 illustrated in FIG. 1, and FIG. 2B is a diagram illustrating details of the estimation processing parameter storage unit 50.
As shown in FIG. 2A, the model parameter storage unit 20 of this embodiment includes a silent GMM storage unit 21, a clean speech GMM storage unit 22, a non-speech GMM storage unit 23, and a speech GMM storage unit 24. As shown in FIG. 2B, the estimation processing parameter storage unit 50 of this embodiment includes an initial noise model estimation buffer 51 and a noise model estimation buffer 52.

図３は、図１に示した前向き推定部３０の詳細構成を例示したブロック図である。図３に示すように、本形態の前向き推定部３０は、雑音モデルパラメータ予測部３１、雑音モデルパラメータ更新部３２、前向き確率モデルパラメータ生成部３３、前向き音声／非音声出力確率算出部３４、前向き第１加重平均算出部３５及び前向き第２加重平均算出部３７を有する。 FIG. 3 is a block diagram illustrating a detailed configuration of the forward estimation unit 30 illustrated in FIG. 1. As shown in FIG. 3, the forward estimation unit 30 of the present embodiment includes a noise model parameter prediction unit 31, a noise model parameter update unit 32, a forward probability model parameter generation unit 33, a forward speech / non-speech output probability calculation unit 34, a forward A first weighted average calculator 35 and a forward second weighted average calculator 37 are included.

図４は、図１に示した後向き推定部４０の詳細構成を例示したブロック図である。図４に示すように、本形態の後向き推定部４０は、雑音モデルパラメータ再推定部４２、後向き確率モデルパラメータ生成部４３、後向き音声／非音声出力確率算出部４４、後向き第１加重平均算出部４５、及び後向き第２加重平均算出部４７を有する。 FIG. 4 is a block diagram illustrating a detailed configuration of the backward estimation unit 40 illustrated in FIG. 1. As shown in FIG. 4, the backward estimation unit 40 of this embodiment includes a noise model parameter re-estimation unit 42, a backward probability model parameter generation unit 43, a backward speech / non-speech output probability calculation unit 44, and a backward first weighted average calculation unit. 45, and a backward second weighted average calculation unit 47.

図５は、図１に示した状態確率比算出部６０の詳細構成を例示したブロック図である。図５に示すように、状態確率比算出部６０は、状態遷移確率テーブル６１、前向き確率算出部６２、後向き確率算出部６３、確率比算出用バッファ６４及び確率比算出部６５を有する。 FIG. 5 is a block diagram illustrating a detailed configuration of the state probability ratio calculation unit 60 illustrated in FIG. 1. As illustrated in FIG. 5, the state probability ratio calculation unit 60 includes a state transition probability table 61, a forward probability calculation unit 62, a backward probability calculation unit 63, a probability ratio calculation buffer 64, and a probability ratio calculation unit 65.

図６は、図１に示した音声信号区間推定部７０の詳細構成を例示したブロック図である。図６に示すように、音声信号区間推定部７０は、Ｌ(s)レジスタ７１、閾値ＴＨレジスタ７２及び比較部７３を有する。 FIG. 6 is a block diagram illustrating a detailed configuration of the speech signal section estimation unit 70 shown in FIG. As shown in FIG. 6, the audio signal section estimation unit 70 includes an L (s) register 71, a threshold TH register 72, and a comparison unit 73.

図７は、図１に示した信号除去部８０の詳細構成を例示したブロック図である。図７に示すように、信号除去部８０は、周波数応答フィルタ生成部８１、インパルス応答フィルタ変換部８２、入力信号読み出し部８３及びフィルタリング部８４を有する。 FIG. 7 is a block diagram illustrating a detailed configuration of the signal removal unit 80 illustrated in FIG. 1. As illustrated in FIG. 7, the signal removal unit 80 includes a frequency response filter generation unit 81, an impulse response filter conversion unit 82, an input signal readout unit 83, and a filtering unit 84.

＜処理の全体＞
次に、本形態の処理の全体を説明する。
本形態では、前処理で、クリーン音声信号と無音信号の各出力確率を、それぞれ、複数の正規分布を含有する混合正規分布で表現した確率モデルの確率モデルパラメータをモデルパラメータ記憶部に格納しておく。そして、前記入力信号の音声特徴量を一定時間区間であるフレームごとに抽出して出力する音響信号分析過程と、前記音声特徴量と、前記モデルパラメータ記憶部に記憶されたクリーン音声信号と無音信号の各確率モデルパラメータとが入力され、過去のフレームから現在のフレームに向かって並列非線形カルマンフィルタにより現在のフレームの雑音モデルパラメータを逐次推定して出力する前向き推定過程と、前記雑音モデルパラメータと、前記クリーン音声信号と無音信号の各確率モデルパラメータとが入力され、未来のフレームから現在のフレームに向かって並列カルマンスムーザにより現在フレームの雑音モデルパラメータを逐次後向き推定し、この後向き推定した雑音モデルパラメータに基づき、音声（雑音＋クリーン音声）信号と非音声（雑音＋無音）信号の各出力確率をそれぞれ混合正規分布で表現した確率モデルの確率モデルパラメータを逐次推定し、音声信号と非音声信号それぞれの出力確率を算出して出力する後向き推定過程と、前向き推定過程及び後向き推定過程で得られた計算結果をパラメータ記憶部に記憶する過程と、前記音声信号及び非音声信号それぞれの出力確率が入力され、音声状態確率と、非音声状態確率と、当該非音声状態確率に対する当該音声状態確率の比とを算出し、これらを出力する状態確率比算出過程とを実行する。そして、前記状態確率の比が入力され、フレームごとに当該状態確率の比としきい値とを比較して、各フレームが音声状態に属するか非音声状態に属するかを示す判定結果を出力する音声信号区間推定過程を実行する。さらに、前記音声信号及び非音声信号の各確率モデルパラメータである正規分布ごとの平均と、前記クリーン音声信号及び無音信号の各確率モデルパラメータである正規分布ごとの平均と、前記音声状態確率及び前記非音声状態確率とが入力され、前記音声信号と非音声信号の各確率モデルパラメータである正規分布ごとの前記平均に対する、前記クリーン音声信号と無音信号の各確率モデルパラメータである正規分布ごとの前記平均の各相対値を、前記音声状態確率及び前記非音声状態確率とを用いて加重平均し、雑音信号を除去する周波数応答フィルタを生成し、当該周波数応答フィルタをインパルス応答フィルタに変換し、前記入力信号に対して当該インパルス応答フィルタを畳み込んで雑音除去音声信号を生成して出力する雑音除去過程を実行する。 <Overall processing>
Next, the entire processing of this embodiment will be described.
In the present embodiment, in the preprocessing, the probability model parameters of the probability model in which the output probabilities of the clean speech signal and the silence signal are each expressed by a mixed normal distribution including a plurality of normal distributions are stored in the model parameter storage unit. deep. Then, an acoustic signal analysis process for extracting and outputting a voice feature amount of the input signal for each frame that is a certain time interval, the voice feature amount, a clean voice signal and a silence signal stored in the model parameter storage unit A forward estimation process of sequentially estimating and outputting a noise model parameter of a current frame from a past frame toward a current frame by a parallel nonlinear Kalman filter, and the noise model parameter; The probability model parameters of clean speech signal and silence signal are input, and the noise model parameters of the current frame are sequentially and backward estimated by the parallel Kalman smoother from the future frame to the current frame. Voice (noise + clean voice) signal Probability model parameters of a probability model that expresses the output probabilities of non-speech (noise + silence) signals in a mixed normal distribution, and then estimates the output probabilities of speech and non-speech signals and outputs them backwards A process, a process of storing calculation results obtained in the forward estimation process and the backward estimation process in a parameter storage unit, and output probabilities of the speech signal and the non-speech signal, respectively, and a speech state probability and a non-speech state probability And a ratio of the speech state probability to the non-speech state probability is calculated, and a state probability ratio calculation process of outputting these is executed. Then, the state probability ratio is input, the state probability ratio is compared with a threshold value for each frame, and a determination result indicating whether each frame belongs to a voice state or a non-voice state is output. Perform signal interval estimation process. Further, an average for each normal distribution which is each probability model parameter of the speech signal and the non-speech signal, an average for each normal distribution which is each probability model parameter of the clean speech signal and the silence signal, the speech state probability and the A non-speech state probability is input, and the average for each normal distribution which is each probability model parameter of the speech signal and the non-speech signal, and the normal distribution for each probability model parameter of the clean speech signal and the silence signal Each average relative value is weighted averaged using the speech state probability and the non-speech state probability to generate a frequency response filter that removes a noise signal, and the frequency response filter is converted into an impulse response filter, Noise removal that convolves the impulse response filter with the input signal to generate and output a noise-removed speech signal To run the degree.

このように本形態では、音声信号区間推定過程のためのパラメータを、雑音除去過程のためのパラメータとして流用する。これにより、音声信号区間推定技術と雑音除去技術とを統合的に扱うことができ、高精度な音声信号区間推定及び雑音除去を行うことができる。さらに、本形態では、前記音声信号と非音声信号の各確率モデルパラメータである正規分布ごとの前記平均に対する、前記クリーン音声信号と無音信号の各確率モデルパラメータである正規分布ごとの前記平均の各相対値を、音声状態確率及び非音声状態確率を用いて加重平均し、雑音信号を除去する周波数応答フィルタを生成している。このように音声状態確率及び非音声状態確率を用いて加重平均することによって周波数応答フィルタの精度が向上し、雑音除去精度が向上する。 As described above, in the present embodiment, the parameters for the speech signal section estimation process are used as the parameters for the noise removal process. Thereby, the speech signal section estimation technique and the noise removal technique can be handled in an integrated manner, and highly accurate speech signal section estimation and noise removal can be performed. Further, in the present embodiment, each of the averages for each normal distribution that is each probability model parameter of the clean speech signal and the silence signal, with respect to the averages for each normal distribution that is each probability model parameter of the speech signal and the non-speech signal. The relative value is weighted using the speech state probability and the non-speech state probability to generate a frequency response filter that removes the noise signal. Thus, by performing weighted averaging using the voice state probability and the non-voice state probability, the accuracy of the frequency response filter is improved, and the noise removal accuracy is improved.

なお、本形態では、音声信号（クリーン音声信号及び無音信号）並びに雑音信号を次のように定義する。 In this embodiment, an audio signal (clean audio signal and silence signal) and a noise signal are defined as follows.

雑音が全く存在しない防音室等で録音を行っても、録音された信号には極微小で白色的な雑音が観測される。本形態では、このような環境において観測される信号を無音信号と定義する。 Even when recording in a soundproof room or the like where no noise is present, a very small white noise is observed in the recorded signal. In this embodiment, a signal observed in such an environment is defined as a silence signal.

従って、無音信号も雑音の一種であるといえるが、この雑音は録音機材等の電気回路や転送系などの電気的要因により発生する雑音である。一方、自動車の走行音や風の音などは、音波が大気中を伝わって観測される音響的要因により発生する雑音である。本形態では、電気的要因による雑音と音響的要因による雑音とを区別し、後者のみを雑音信号と定義する。 Therefore, although a silence signal can also be said to be a kind of noise, this noise is generated due to electrical factors such as an electric circuit of a recording equipment or a transfer system. On the other hand, the driving sound of a car, the sound of wind, and the like are noises generated by acoustic factors observed when sound waves are transmitted through the atmosphere. In this embodiment, noise due to electrical factors and noise due to acoustic factors are distinguished, and only the latter is defined as a noise signal.

また、無音信号が観測されている環境において発話を行うと、発話音声信号が無音信号に重畳された形で観測される。本形態ではこの重畳された信号をクリーン音声信号と定義する。 Further, when an utterance is performed in an environment where a silence signal is observed, the utterance voice signal is observed in a form superimposed on the silence signal. In this embodiment, this superimposed signal is defined as a clean audio signal.

そして、雑音信号が存在しない環境では、連続する無音信号の合間にクリーン音声信号が観測される。本形態では、これら無音信号とクリーン音声信号を総称して音声信号と定義する。 In an environment where there is no noise signal, a clean voice signal is observed between successive silence signals. In this embodiment, the silence signal and the clean audio signal are collectively defined as an audio signal.

＜処理の詳細＞
次に、本形態の処理の詳細を説明する。
［前処理］
モデルパラメータ記憶部２０の無音ＧＭＭ記憶部２１及びクリーン音声ＧＭＭ記憶部２２には、それぞれ、あらかじめ用意された無音信号及びクリーン音声信号の確率モデルのパラメータが格納される。本形態では、確率モデルとして複数の正規分布を含有する混合正規分布モデル（ＧＭＭ：Gaussian Mixture Model）を利用する。なお、混合正規分布モデルに含まれる正規分布の数が多いほど推定精度は向上するが、処理速度は低下する。そのため、混合正規分布モデルに含まれる正規分布の数は、実効的には２〜５１２個の間の値が望ましく、３２個程度が最も望ましい。また、それぞれの正規分布は混合重みｗ_j,k、平均μ^S _j,k,u、分散σ^S _j,k,uをパラメータとして構成される。ここで、ｊはＧＭＭの種別（ｊ＝０：無音ＧＭＭ、ｊ＝１：クリーン音声ＧＭＭ）であり、ｋは各正規分布の番号である。なお、ＧＭＭの構成方法については公知の技術なので説明を省略する（例えば、中川聖一著、「確率モデルによる音声認識」、電子情報通信学会等参照）。この前処理を前提として以下の処理が実行される。 <Details of processing>
Next, details of the processing of this embodiment will be described.
[Preprocessing]
The silence GMM storage unit 21 and the clean speech GMM storage unit 22 of the model parameter storage unit 20 store the parameters of the probabilistic models of the silence signal and the clean speech signal prepared in advance, respectively. In this embodiment, a mixed normal distribution model (GMM: Gaussian Mixture Model) containing a plurality of normal distributions is used as the probability model. Note that the estimation accuracy improves as the number of normal distributions included in the mixed normal distribution model increases, but the processing speed decreases. Therefore, the number of normal distributions included in the mixed normal distribution model is effectively a value between 2 and 512, and most preferably about 32. Further, each normal distribution is configured with the mixture weights w _{j, k} , the average μ ^S _{j, k, u} and the variance σ ^S _{j, k, u} as parameters. Here, j is the type of GMM (j = 0: silent GMM, j = 1: clean speech GMM), and k is the number of each normal distribution. Since the GMM configuration method is a known technique, a description thereof will be omitted (see, for example, Seiichi Nakagawa, “Voice Recognition Using a Stochastic Model”, IEICE, etc.). The following processing is executed on the premise of this preprocessing.

［音響信号分析部１０の処理］
まず、音響信号分析部１０に、所定のサンプリング周波数（例えば、８０００Ｈｚ）でサンプリングされ、離散信号に変換された音響信号o_νが入力される。この音響信号o_νは、目的信号である音声信号に雑音信号が重畳した信号となっており、本形態ではこれを入力信号o_νと呼ぶ。また、νはサンプリング時刻を示す離散値である。 [Processing of Acoustic Signal Analysis Unit 10]
First, an acoustic signal o _v sampled at a predetermined sampling frequency (for example, 8000 Hz) and converted into a discrete signal is input to the acoustic signal analysis unit 10. The acoustic signal o _ν is a signal in which a noise signal is superimposed on a voice signal that is a target signal, and in the present embodiment, this is called an input signal o _ν . Further, ν is a discrete value indicating the sampling time.

音響信号分析部１０は、入力信号o_νを時間軸方向に一定時間幅で始点を移動させながら、一定時間長の入力信号o_t,0，・・・，o_t,m，・・・，o_t,M-1をフレームとして切り出す。例えば、１６０サンプル点長（サンプリング周波数８０００Ｈｚで時間長２０ｍｓ）の入力信号を８０サンプル点（サンプリング周波数８０００Ｈｚで時間長１０ｍｓ）ずつ始点を移動させながら切り出す。なお、tは各フレームに付されたフレーム番号を示す。フレーム番号tの初期値は０であり、新たにフレームが切り出されるたびに直前のフレーム番号に１を加算した値が新たなフレーム番号として付与される。また、Mはフレーム毎に切り出されたサンプルの数を示し、o_t、mはフレーム番号tのフレームが含むm+1番目の入力信号を示す。 The acoustic signal analysis unit 10 moves the input signal o _v in the time axis direction with a constant time width while moving the input signal o _{t, 0} ,..., O _{t, m} ,. o Cut out _{t, M-1} as a frame. For example, an input signal having a 160 sample point length (sampling frequency of 8000 Hz and time length of 20 ms) is cut out while moving the start point by 80 sample points (sampling frequency of 8000 Hz and time length of 10 ms). Note that t indicates a frame number assigned to each frame. The initial value of the frame number t is 0, and every time a new frame is cut out, a value obtained by adding 1 to the immediately preceding frame number is given as a new frame number. M represents the number of samples cut out for each frame, and o _{t and m} represent the m + 1-th input signal included in the frame with frame number t.

そして、音響信号分析部１０は、フレーム毎に入力信号o_t、0，・・・，o_t、m，・・・，o_t、M-1を高速フーリエ変換して周波数領域の信号に変換し、さらに２４次元のメルフィルタバンク分析を適用して２４個のスペクトルパワー係数を生成し、それらの対数をとることによって２４次元の対数メルスペクトルを要素に持つ、ベクトルＯ_ｔ＝{Ｏ_t、0，・・・，Ｏ_t、u，・・・，Ｏ_t、23}を算出し出力する。本形態では、このベクトルＯ_ｔを、フレーム番号ｔにおける音声特徴量として用いる（以下、音声特徴量Ｏ_ｔと呼ぶ）。また、音声特徴量Ｏ_ｔの要素の添字uはベクトルの要素番号であり、メルフィルタバンクのチャネル番号、すなわちメルフィルタバンク分析において周波数領域の信号にメルスケール上で掛け合わされる三角窓の番号に対応する。また、メルフィルタバンク分析の次数とは、フィルタバンクのチャネル数、すなわち上記三角窓の個数を意味する。本形態では、２４次元のメルフィルタバンク分析を行う場合を例示するが、次元数はこれに限定されない。また、メルフィルタバンク分析の内容についは公知の技術なので説明を省略する（例えば、非特許文献３等参照）。 The sound signal analysis unit 10, the input signal o _{t, 0} for each _{frame, ···, o t, m,} ···, o t, the _M-1 with a fast Fourier transform on the signal in the frequency domain transform Furthermore, a 24-dimensional mel filter bank analysis is applied to generate 24 spectral power coefficients, and by taking their logarithm, a vector O _t = {O _t, having a 24-dimensional log mel spectrum as an element _{, 0} ,..., O _{t, u} ,..., O _{t, 23} } are calculated and output. In the present embodiment, this vector O _t is used as a speech feature amount at frame number t (hereinafter referred to as speech feature amount O _t ). Also, subscript u of the elements of speech features O _t is the element number of a vector, the channel number of the Mel filter bank, i.e. the number of the triangular windows are multiplied on mel scale into a frequency domain signal in the mel filter bank analysis Correspond. The order of the mel filter bank analysis means the number of channels of the filter bank, that is, the number of the triangular windows. In this embodiment, a case where 24-dimensional Mel filter bank analysis is performed is exemplified, but the number of dimensions is not limited to this. Further, since the contents of the mel filter bank analysis are known techniques, the description thereof is omitted (for example, see Non-Patent Document 3).

［前向き推定部３０の処理］
図８は、第1実施形態の前向き推定部３０の処理手順を説明するためのフローチャートである。以下、この図と前述の図３とを用い、前向き推定部３０の処理を説明する。 [Processing of forward estimation unit 30]
FIG. 8 is a flowchart for explaining the processing procedure of the forward estimation unit 30 of the first embodiment. Hereinafter, the process of the forward estimation unit 30 will be described with reference to FIG.

まず、雑音モデルパラメータ予測部３１に、前記音声特徴量Ｏ_t,uとフレーム番号ｔ−１における前向き第２加重平均値^Ｎ_t-1,u、^σ^N _t-1、uとが入力され、雑音モデルパラメータ予測部３１が、平均値Ｎ_t,u ^predと分散値σ^N _t,u ^predとからなる雑音モデルパラメータ予測値を生成して出力する。 First, the speech feature quantity O _{t, u} and the forward second weighted average value ^ N _{t−1, u} , ^ σ ^N _{t−1, u at the} frame number t−1 are input to the noise model parameter prediction unit 31. Then, the noise model parameter prediction unit 31 generates and outputs a noise model parameter prediction value composed of the average value N _{t, u} ^pred and the variance value σ ^N _{t, u} ^pred .

この具体的処理を、図８の処理手順に従い説明する。なお、図８では１つのフレームに対する処理のみが示されているが、実際は各フレームに対して同様な処理が繰り返される（以降説明する他のフローチャートについても同様）。 This specific processing will be described according to the processing procedure of FIG. Although FIG. 8 shows only the processing for one frame, the same processing is actually repeated for each frame (the same applies to other flowcharts described below).

まず、制御部９０がフレーム判定処理Ｓ３０１を行い、音響信号分析部１０から出力される音声特徴量Ｏ_ｔのフレーム番号を判定する。このフレーム判定処理Ｓ３０１においてｔ＜１０と判定されたのであれば、制御部９０は、バッファリング処理Ｓ３０２において推定処理用パラメータ記憶部５０の初期雑音モデル推定用バッファ５１に前記音響特徴量Ｏ_t,uを記憶し、そのフレームの前向き推定部３０の処理を終了する。 First, the control unit 90 performs frame determination processing S301, and determines the frame number of the audio feature amount O _t output from the acoustic signal analysis unit 10. If it is determined that t <10 in the frame determination process S301, the control unit 90 stores the acoustic feature quantity O _{t, in} the initial noise model estimation buffer 51 of the estimation process parameter storage unit 50 in the buffering process S302 _{. u} is stored, and the processing of the forward estimation unit 30 for the frame is terminated.

また、フレーム判定処理Ｓ３０１においてｔ＝１０と判定されたのであれば、雑音モデルパラメータ予測部３１は、読み出し処理Ｓ３０３において推定処理用パラメータ記憶部５０の初期雑音モデル推定用バッファ５１から音声特徴量Ｏ_0,u、・・・、Ｏ_9,uを読み出す。そして、雑音モデルパラメータ予測部３１は、初期パラメータ推定処理Ｓ３０４において初期の雑音モデルパラメータＮ_u ^init、σ^N _u ^initを以下のように推定する。 If it is determined that t = 10 in the frame determination process S301, the noise model parameter prediction unit 31 reads the speech feature amount O from the initial noise model estimation buffer 51 of the estimation process parameter storage unit 50 in the read process S303. _{Read out 0, u} , ..., O9 _{, u} . Then, the noise model parameter prediction unit 31 estimates initial noise model parameters N _u ^init and σ ^N _u ^init in the initial parameter estimation process S304 as follows.

また、フレーム判定処理Ｓ３０１においてｔ＞１０と判定されたのであれば、雑音モデルパラメータ予測部３１は、読み出し処理Ｓ３０５において推定処理用パラメータ記憶部５０の雑音モデル推定用バッファ５２から１フレーム前の前向き第２加重平均値^Ｎ_t-1,u、^σ^N _t-1、uを読み出す。 If it is determined that t> 10 in the frame determination process S301, the noise model parameter prediction unit 31 moves forward one frame from the noise model estimation buffer 52 of the estimation process parameter storage unit 50 in the read process S305. The second weighted average values ^ N _{t-1, u} and ^ σ ^N _{t-1, u} are read out.

なお、Ｓ３０１〜３０５の処理においてｔ＝１０を基準に判定しているが、これは最も望ましい基準値としての例示であり、実効的にはｔ＝１〜２０の範囲で適宜設定してよい。 Note that although t = 10 is determined as a reference in the processing of S301 to S305, this is an example as the most desirable reference value, and may be set appropriately in the range of t = 1 to 20.

ｔ≧１０の場合は、次にパラメータ予測処理Ｓ３０６を行う。ｔ＞１０の場合、雑音モデルパラメータ予測部３１は、読み出したフレーム番号ｔ−１における推定結果から現在のフレーム番号の雑音モデルパラメータを以下のランダムウォーク過程により予測する。 When t ≧ 10, parameter prediction processing S306 is performed next. When t> 10, the noise model parameter prediction unit 31 predicts the noise model parameter of the current frame number from the estimation result at the read frame number t−1 by the following random walk process.

上式において、Ｎ_t,u ^predとσ^N _t,u ^predはフレーム番号ｔにおける雑音モデルパラメータ予測値（平均値Ｎ_t,u ^predと分散値σ^N _t,u ^pred）であり、またεは雑音の変化の度合いを表す定数で実効的には０．０００１〜０．００１の間の値に設定するのが望ましく、０．００１程度が最も望ましい。また、ｔ＝１０の場合は以下のように予測する。 In the above equation, N _{t, u} ^pred and σ ^N _{t, u} ^pred are noise model parameter prediction values (average value N _{t, u} ^pred and variance value σ ^N _{t, u} ^pred ) at frame number t, and ε is It is desirable to set a value between 0.0001 and 0.001 in terms of a constant representing the degree of change in noise, and most preferably about 0.001. When t = 10, the prediction is as follows.

以上のように算出された平均値Ｎ_t,u ^predと分散値σ^N _t,u ^predとからなる雑音モデルパラメータ予測値は、雑音モデルパラメータ更新部３２に送られる。 The noise model parameter prediction value composed of the average value N _{t, u} ^pred and the variance value σ ^N _{t, u} ^pred calculated as described above is sent to the noise model parameter update unit 32.

雑音モデルパラメータ更新部３２には、音響信号分析部１０から送られた前記音声特徴量Ｏ_t,uと、前記雑音モデルパラメータ予測値Ｎ_t,u ^pred、σ^N _t,u ^predと、モデルパラメータ記憶部２０の無音ＧＭＭ記憶部２１及びクリーン音声ＧＭＭ記憶部２２から読み込まれた前記無音信号及びクリーン音声信号それぞれの確率モデルパラメータμ^S _j,k,u、σ^S _j,k,uとが入力される。雑音モデルパラメータ更新部３２は、これらの情報を用い、平均値^Ｎ_t,j,k,uと分散値^σ^N _{t、j、k、u}とからなる雑音モデルパラメータ更新値を生成して出力する。 The noise model parameter updating unit 32 includes the speech feature amount O _{t, u} sent from the acoustic signal analysis unit 10, the noise model parameter predicted values N _{t, u} ^pred , σ ^N _{t, u} ^pred , and model parameters. Probability model parameters μ ^S _{j, k, u} and σ ^S _{j, k, u of the} silent signal and the clean speech signal read from the silent GMM storage unit 21 and the clean speech GMM storage unit 22 of the storage unit 20 are input. Is done. The noise model parameter update unit 32 uses these pieces of information to generate a noise model parameter update value composed of an average value ^ N _{t, j, k, u} and a variance value ^ σ ^N _{t, j, k, u.} Output.

この具体的処理を、図８の処理手順に従い説明する。
パラメータ更新処理Ｓ３０７においては、前記クリーン音声信号、無音信号それぞれの確率モデルパラメータは正規分布ごとに複数存在するため、これら複数のパラメータを使って、かつそれぞれ並行して前記雑音モデルパラメータ予測値の更新処理を行う。すなわち、前記クリーン音声信号、無音信号それぞれの確率モデルに含まれる正規分布の合計数と同数の更新結果を得る。雑音モデルパラメータ更新部３２は、入力された各情報を用い、次式に従って雑音モデルパラメータの更新処理を行う。 This specific processing will be described according to the processing procedure of FIG.
In the parameter update process S307, since there are a plurality of probability model parameters for each of the clean speech signal and the silence signal for each normal distribution, the noise model parameter predicted value is updated using these parameters in parallel. Process. That is, the same number of update results as the total number of normal distributions included in the probability models of the clean speech signal and the silence signal are obtained. The noise model parameter update unit 32 performs the update process of the noise model parameter according to the following equation using each input information.

式(11)と式(12)で求められた^Ｎ_t,j,k,uと^σ^N _{t、j、k、u}とが雑音モデルパラメータ更新値（平均値^Ｎ_t,j,k,uと分散値^σ^N _{t、j、k、u}）である。算出された雑音モデルパラメータ更新値（平均値^Ｎ_t,j,k,uと分散値^σ^N _{t、j、k、u}）は、確率モデルパラメータ生成部３３に送られる。また、平均値^Ｎ_t,j,k,uと分散値^σ^N _{t、j、k、u}は、前向き第１加重平均算出部３５にも送られる。 ^ N _{t, j, k, u} and ^ σ ^N _{t, j, k, u} obtained by Equation (11) and Equation (12) are the noise model parameter update values (average value ^ N _{t, j, k , u} and variance value ^ σ ^N _{t, j, k, u} ). The calculated noise model parameter update values (average value ^ N _{t, j, k, u} and variance value ^ σ ^N _{t, j, k, u} ) are sent to the probability model parameter generation unit 33. The average value ^ N _{t, j, k, u} and the variance value ^ σ ^N _{t, j, k, u} are also sent to the forward first weighted average calculation unit 35.

前向き確率モデルパラメータ生成部３３には、雑音モデルパラメータ更新部３２から送られた前記雑音モデルパラメータ更新値^Ｎ_t,j,k,u、^σ^N _{t、j、k、u}と、モデルパラメータ記憶部２０の無音ＧＭＭ記憶部２１及びクリーン音声ＧＭＭ記憶部２２から読み込まれた前記無音信号及びクリーン音声信号それぞれの確率モデルパラメータμ^S _j,k,u、σ^S _j,k,uとが入力される。雑音モデルパラメータ更新部３２は、これらの情報を用い、平均値μ^O _t,j,k,uと分散値σ^O _{t、j、k、u}とからなる前向き確率モデルパラメータを生成して出力する。 In the forward probability model parameter generation unit 33, the noise model parameter update values ^ N _{t, j, k, u} , ^ σ ^N _{t, j, k, u} sent from the noise model parameter update unit 32 _, and model parameters Probability model parameters μ ^S _{j, k, u} and σ ^S _{j, k, u of the} silent signal and the clean speech signal read from the silent GMM storage unit 21 and the clean speech GMM storage unit 22 of the storage unit 20 are input. Is done. The noise model parameter updating unit 32 uses these pieces of information to generate and output a forward probability model parameter composed of the average value μ ^O _{t, j, k, u} and the variance values σ ^O _{t, j, k, u.} .

この具体的処理を、図８の処理手順に従い説明する。
確率モデルパラメータ生成処理Ｓ３０８では、前向き確率モデルパラメータ生成部３３が、入力された各情報を用い、フレーム番号ｔにおける雑音環境に適合した、音声（雑音＋クリーン音声：ｊ＝１）、非音声（雑音＋無音：ｊ＝０）それぞれの確率モデルの平均μ^O _t,j,k,uと分散σ^O _{t、j、k、u}を次式により生成する。 This specific processing will be described according to the processing procedure of FIG.
In the probability model parameter generation process S308, the forward probability model parameter generation unit 33 uses each piece of input information, and is adapted to the noise environment at the frame number t (sound + clean speech: j = 1), non-speech ( Noise + Silence: j = 0) The average μ ^O _{t, j, k, u} and variance σ ^O _{t, j, k, u} of each probability model are generated by the following equations.

以上のように設定された音声、非音声それぞれの確率モデルのパラメータμ^O _t,j,k,u、σ^O _{t、j、k、u}は、それぞれ、モデルパラメータ記憶部２０の音声ＧＭＭ記憶部２４及び非音声ＧＭＭ記憶部２３に格納されるとともに、前向き音声／非音声出力確率算出部３４に送られる。なお、ここでの混合重みは前記クリーン音声信号、無音信号それぞれの確率モデルパラメータにおける混合重みｗ_j,kであるものとして以降の処理を行う。 The parameters μ ^O _{t, j, k, u} , σ ^O _{t, j, k, u} of the speech and non-voice probabilistic models set as described above are the speech GMM storage unit of the model parameter storage unit 20, respectively. 24 and the non-speech GMM storage unit 23 and sent to the forward speech / non-speech output probability calculation unit 34. The following processing is performed assuming that the mixing weight here is the mixing weight w _{j, k} in the probability model parameters of the clean speech signal and the silence signal.

前向き音声／非音声出力確率算出部３４には、音響信号分析部２０から送られた前記音声特徴量Ｏ_t,uと、前記音声、非音声それぞれの確率モデルパラメータμ^O _t,j,k,u、σ^O _{t、j、k、u}と、モデルパラメータ記憶部２０の無音ＧＭＭ記憶部２１及びクリーン音声ＧＭＭ記憶部２２から読み込まれた前記クリーン音声信号、無音信号それぞれの確率モデルパラメータにおける混合重みｗ_j,kとが入力される。前向き音声／非音声出力確率算出部３４は、これらの情報を用い、フレーム番号ｔにおける音声・非音声の前向き出力確率ｂ_ｊ(Ｏ_ｔ）と、この前向き出力確率ｂ_ｊ(Ｏ_ｔ）を前記正規分布ｋごとに分解して正規化した前向き正規化出力確率ｗ^OF _j,kとを生成して出力する。 The forward speech / non-speech output probability calculation unit 34 includes the speech feature amount O _{t, u} sent from the acoustic signal analysis unit 20 and the probability model parameters μ ^O _{t, j, k,} respectively for the speech and non-speech _{. u} , σ ^O _{t, j, k, u} and mixing weights in the probability model parameters of the clean speech signal and the silence signal read from the silence GMM storage unit 21 and the clean speech GMM storage unit 22 of the model parameter storage unit 20, respectively. w _{j, k} are input. The forward speech / non-speech output probability calculation unit 34 uses these pieces of information to determine the speech / non-speech forward output probability b _j (O _t ) and the forward output probability b _j (O _t ) at the frame number t. A forward normalized output probability w ^OF _{j, k} that is decomposed and normalized for each normal distribution k is generated and output.

この具体的処理を、図８の処理手順に従い説明する。
出力確率算出処理Ｓ３０９では、前向き音声／非音声出力確率算出部３４が、前記音声特徴量Ｏ_t,uをＳ３０８の処理で生成された前記音声、非音声それぞれの確率モデルに入力した際の、前記音声、非音声それぞれの確率モデル全体における音声、非音声の前向き出力確率ｂ_ｊ(Ｏ_ｔ）を次式により求める。 This specific processing will be described according to the processing procedure of FIG.
In the output probability calculation process S309, the forward speech / non-speech output probability calculation unit 34 inputs the speech feature value O _{t, u} to the speech and non-speech probability models generated in the process of S308. The forward output probability b _j (O _t ) of speech and non-speech in the whole probability model of speech and non-speech is obtained by the following equation.

また、上式のｗ_j,kｂ_j,k(Ｏ_ｔ)は、音声、非音声それぞれの確率モデルに含まれる各正規分布ｋの出力確率である。前向き音声／非音声出力確率算出部３４は、ｗ_j,kｂ_j,k(Ｏ_ｔ)の合計が１になるよう次式で正規化を行う。 Also, w _{j, k} b _{j, k} (O _t ) in the above equation is the output probability of each normal distribution k included in the probability models of speech and non-speech. The forward speech / non-speech output probability calculation unit 34 performs normalization using the following equation so that the sum of w _{j, k} b _{j, k} (O _t ) is 1.

上式のｗ^OF _t,j,kが、音声、非音声それぞれの確率モデルに含まれる各正規分布ｋの前向き正規化出力確率である。前向き音声／非音声出力確率算出部３４は、生成した音声、非音声の前向き出力確率ｂ_ｊ(Ｏ_ｔ）を前向き第２加重平均算出部３７及び状態確率比算出部６０に送り、前向き正規化出力確率ｗ^OF _t,j,kを前向き第１加重平均算出部３５に送る。また、前向き正規化出力確率ｗ^OF _t,j,kは、推定処理用パラメータ記憶部５０の雑音モデル推定用バッファ５２に格納される。 In the above equation, w ^OF _{t, j, k} is the forward normalized output probability of each normal distribution k included in the probability models of speech and non-speech. The forward speech / non-speech output probability calculation unit 34 sends the generated speech and non-speech forward output probability b _j (O _t ) to the forward second weighted average calculation unit 37 and the state probability ratio calculation unit 60 to perform forward normalization. The output probability w ^OF _{t, j, k} is sent to the forward first weighted average calculation unit 35. Further, the forward normalized output probability w ^OF _{t, j, k} is stored in the noise model estimation buffer 52 of the estimation processing parameter storage unit 50.

前向き第１加重平均算出部３５には、前記雑音モデルパラメータ更新値^Ｎ_t,j,k,u、^σ^N _{t、j、k、u}と前記前向き正規化出力確率ｗ^OF _t,j,kとが入力され、平均値^Ｎ_t,j,uと分散値^σ^N _t、j、uとからなる雑音モデルパラメータの前向き第１加重平均値を出力する。 The forward first weighted average calculator 35 includes the noise model parameter update value ^ N _{t, j, k, u} , ^ σ ^N _{t, j, k, u} and the forward normalized output probability w ^OF _{t, j, k} is input _, and a forward first weighted average value of a noise model parameter including an average value ^ N _{t, j, u} and a variance value ^ σ ^N _{t, j, u} is output.

この具体的処理を、図８の処理手順に従い説明する。
第１加重平均処理Ｓ３１０では、前向き第１加重平均算出部３５が、パラメータ更新処理Ｓ３０７で得られた複数の雑音モデルパラメータ更新結果を、出力確率算出処理Ｓ３０９で得られた前向き正規化出力確率ｗ^OF _t,j,kを用いて加重平均することにより、音声、非音声それぞれの確率モデルに対応する雑音パラメータ推定結果である前向き第１加重平均値^Ｎ_t,j,u、^σ^N _t、j、uを得る。加重平均は次式により行う。 This specific processing will be described according to the processing procedure of FIG.
In the first weighted average process S310, the forward first weighted average calculation unit 35 uses the plurality of noise model parameter update results obtained in the parameter update process S307 as forward normalized output probabilities w obtained in the output probability calculation process S309. By performing weighted averaging using ^OF _{t, j, k} , forward first weighted average values ^ N _{t, j, u} and ^ σ ^N _t that are the noise parameter estimation results corresponding to the probabilistic models of speech and non-speech _{Get j, u} . The weighted average is calculated by the following formula.

生成された前向き第１加重平均値^Ｎ_t,j,u、^σ^N _t、j、uは前向き第２加重平均算出部３７に送られる。 The generated forward first weighted average values ^ N _{t, j, u} , ^ σ ^N _{t, j, u} are sent to the forward second weighted average calculation unit 37.

前向き第２加重平均算出部３７には、前記前向き第１加重平均値^Ｎ_t,j,u、^σ^N _t、j、uと、前記前向き出力確率ｂ_ｊ(Ｏ_ｔ）とが入力され、前向き第２加重平均算出部３７は、平均値^Ｎ_t,uと分散値^σ^N _t、uとからなるフレーム番号ｔにおける前向き第２加重平均値を生成して出力する。 The forward second weighted average calculation unit 37 receives the forward first weighted average value ^ N _{t, j, u} , ^ σ ^N _{t, j, u} and the forward output probability b _j (O _t ). The forward second weighted average calculating unit 37 generates and outputs a forward second weighted average value at the frame number t composed of the average value ^ N _{t, u} and the variance value ^ σ ^N _{t, u} .

この具体的処理を、図８の処理手順に従い説明する。
第２加重平均処理Ｓ３１２では、前向き第２加重平均算出部３７が、第１加重平均処理Ｓ３１０で得られた前向き第１加重平均値^Ｎ_t,j,u、^σ^N _t、j、uを、出力確率算出処理Ｓ３０９で得られた前向き出力確率ｂ_ｊ(Ｏ_ｔ）を用いて加重平均することにより、フレーム番号ｔにおける雑音モデルパラメータ推定結果である前向き第２加重平均値^Ｎ_t,u、^σ^N _t、uを算出し、次のフレームの雑音パラメータの推定に利用する。加重平均は次式により行う。 This specific processing will be described according to the processing procedure of FIG.
In the second weighted average process S312, the forward second weighted average calculation unit 37 performs the forward first weighted average value ^ N _{t, j, u} , ^ σ ^N _{t, j, u} obtained in the first weighted average process S310. Is weighted and averaged using the forward output probability b _j (O _t ) obtained in the output probability calculation process S309, the forward second weighted average value ^ N _t, which is the noise model parameter estimation result at frame number _{t. u} , ^ σ ^N _{t, u} are calculated and used to estimate the noise parameters of the next frame. The weighted average is calculated by the following formula.

最後にＳ３１３のバッファリング処理で、制御部９０が、フレーム番号tのフレームが含む各入力信号o_t、0，・・・，o_t、m，・・・，o_t、M-1、Ｓ３０１〜３１２の処理により得られた当該フレーム番号ｔにおける音声特徴量Ｏ_t,u、雑音モデルパラメータ予測値Ｎ_t,u ^pred、σ^N _t,u ^pred、雑音モデルパラメータ更新値^Ｎ_t,j,k,u、^σ^N _{t、j、k、u}、及び前向き第２加重平均値^Ｎ_t,u、^σ^N _t、uを推定処理用パラメータ記憶部５０の雑音モデル推定用バッファ５２に格納する。 Finally, in the buffering process of S313, the control unit 90 causes the input signals ot _{, 0} , ..., ot _{, m} , ..., ot _{, M-1} , S301 included in the frame with the frame number t. ˜312, the speech feature amount O _{t, u} , noise model parameter predicted value N _{t, u} ^pred , σ ^N _{t, u} ^pred , noise model parameter update value ^ N _{t, j, k, u} , ^ σ ^N _{t, j, k, u} and forward second weighted average value ^ N _{t, u} , ^ σ ^N _{t, u} are stored in the noise model estimation buffer 52 of the estimation processing parameter storage unit 50. Store.

式(3)(4)の予測処理、及び式(7)〜(12)の更新処理は、従来の非線形カルマンフィルタと計算式の構成自体は同様であるが、本形態ではクリーン音声信号、無音信号それぞれのＧＭＭに含まれる複数の正規分布ごとに複数のフィルタを構成し、これらを利用することにより得られる複数の推定結果を加重平均する（並列非線形カルマンフィルタ）。このような処理を行うことによって、より正確な雑音モデルのパラメータ推定が実現される。 The prediction processing of formulas (3) and (4) and the update processing of formulas (7) to (12) are the same in the configuration of the conventional nonlinear Kalman filter and the calculation formula itself, but in this embodiment, clean speech signals and silence signals A plurality of filters are configured for each of a plurality of normal distributions included in each GMM, and a plurality of estimation results obtained by using these are weighted and averaged (parallel non-linear Kalman filter). By performing such processing, more accurate noise model parameter estimation is realized.

［後向き推定部４０の処理］
図９は、第1実施形態の後向き推定部４０の処理手順を説明するためのフローチャートである。以下、この図と前述の図４とを用い、後向き推定部４０の処理を説明する。 [Processing of Backward Estimation Unit 40]
FIG. 9 is a flowchart for explaining the processing procedure of the backward estimation unit 40 of the first embodiment. Hereinafter, the processing of the backward estimation unit 40 will be described with reference to FIG. 4 and FIG. 4 described above.

まず、雑音モデルパラメータ再推定部４２に、推定処理用パラメータ記憶部５０の雑音モデル推定用バッファ５２に記憶されたフレーム番号ｓにおける雑音モデルパラメータ予測値Ｎ_s,u ^pred、σ^N _s,u ^pred、フレーム番号ｓ−１における雑音モデルパラメータ更新値^Ｎ_s-1,j,k,u、^σ^N _{s-1、j、k、u}及びフレーム番号ｓにおける雑音モデルパラメータ再推定値〜Ｎ_s,j,k,u、〜σ^N _{s、j、k、u}とが入力され、雑音モデルパラメータ再推定部４２は、平均値〜Ｎ_s-1,j,k,uと分散値〜σ^N _{s-1、j、k、u}とからなるフレーム番号ｓ−１における雑音モデルパラメータ再推定値を生成して出力する。 First, the noise model parameter re-estimation unit 42 stores the noise model parameter prediction values N _{s, u} ^pred and σ ^N _{s, u} ^pred at the frame number s stored in the noise model estimation buffer 52 of the estimation processing parameter storage unit 50. , Noise model parameter update value at frame number s-1 ^ N _{s-1, j, k, u} , ^ σ ^N _{s-1, j, k, u} and noise model parameter re-estimation value at frame number s ~ N _{s , j, k, u} , ~ σ ^N _{s, j, k, u} are input, and the noise model parameter re-estimator 42 calculates the mean value ~ N _{s-1, j, k, u} and the variance ~ σ ^N A noise model parameter re-estimation value at frame number s-1 consisting of _{s-1, j, k, and u} is generated and output.

この具体的処理を、図９の処理手順に従い説明する。
まず、制御部９０がフレーム判定処理Ｓ４０１を行い、音響信号分析部１０から出力される音声特徴量Ｏ_ｔのフレーム番号を判定する。このフレーム判定処理Ｓ４０１においてｔ＜１０と判定されたのであれば、制御部９０は、変数設定処理Ｓ４０２において変数ｔｂを０に設定し、そのフレームの後向き推定部４０の処理を終了する。 This specific processing will be described according to the processing procedure of FIG.
First, the control unit 90 performs a frame determination process S401 to determine the frame number of the audio feature amount O _t output from the acoustic signal analysis unit 10. If it is determined that t <10 in the frame determination process S401, the control unit 90 sets the variable tb to 0 in the variable setting process S402, and ends the process of the backward estimation unit 40 for the frame.

また、ｔ≧１０と判定されたのであれば、制御部９０は、変数判定処理Ｓ４０３においてｔｂが後向き推定に要するフレーム数Ｂ未満であれば変数書替処理Ｓ４０４にてｔｂの値を１加算して処理を終了し、ｔｂの値がＢ以上であれば変数設定処理Ｓ４０５において後向き推定用カウンタ値ｂｗにＢを設定する。Ｂは大きいほど推定精度向上に寄与する反面、処理速度を損なうため、実効的には１〜１０の間の値に設定するのが望ましく、１０程度が最も望ましい。 If it is determined that t ≧ 10, the control unit 90 adds 1 to the value of tb in the variable rewriting process S404 if tb is less than the number of frames B required for backward estimation in the variable determination process S403. If the value of tb is equal to or greater than B, B is set to the backward estimation counter value bw in the variable setting process S405. A larger B contributes to an improvement in the estimation accuracy, but the processing speed is impaired. Therefore, it is desirable to set the value to a value between 1 and 10 and the most desirable is about 10.

次に、雑音モデルパラメータ再推定部４２が、読み出し処理Ｓ４０６において、推定処理用パラメータ記憶部５０の雑音モデル推定用バッファ５２から前向き推定部３０において算出されたフレーム番号ｓ＝ｔ−Ｂ＋ｂｗにおける雑音モデルパラメータ予測値Ｎ_s,u ^pred、σ^N _s,u ^pred、フレーム番号ｓ−１における音響特徴量Ｏ_s-1,u、フレーム番号ｓ−１における雑音モデルパラメータ更新値^Ｎ_s-1,j,k,u、^σ^N _{s-1、j、k、u}、及び後向き推定部４０において算出されたフレーム番号ｓ＝ｔ−Ｂ＋ｂｗにおける雑音モデルパラメータ再推定値〜Ｎ_s,j,k,u、〜σ^N _{s、j、k、u}を読み出す。なお、ｂｗ＝Ｂ、すなわちフレーム番号ｓ＝ｔの場合は、雑音モデルパラメータ再推定部４２は、^Ｎ_t,j,k,u、^σ^N _{t、j、k、u}、^Ｎ_t,u、^σ^N _t、uを雑音モデル推定用バッファ５２から読み出し、〜Ｎ_s,j,k,u＝^Ｎ_t,j,k,u、〜σ^N _{s、j、k、u}＝^σ^N _{t、j、k、u}、〜Ｎ_s,u＝^Ｎ_t,u、〜σ^N _s、u＝^σ^N _t、uとする。 Next, the noise model parameter re-estimating unit 42 reads the noise model at the frame number s = t−B + bw calculated in the forward estimation unit 30 from the noise model estimation buffer 52 of the estimation processing parameter storage unit 50 in the reading process S406. Parameter predicted value N _{s, u} ^pred , σ ^N _{s, u} ^pred , acoustic feature quantity O _{s-1, u} at frame number s−1, noise model parameter update value at frame number s−1 ^ N _{s-1, j , k, u} , ^ σ ^N _{s-1, j, k, u} , and noise model parameter re-estimation value N _{s, j, k, u} at the frame number s = t−B + bw calculated by the backward estimation unit 40 , ~ Σ ^N _{s, j, k, u} are read out. When bw = B, that is, when the frame number s = t, the noise model parameter re-estimator 42 obtains ^ N _{t, j, k, u} , ^ σ ^N _{t, j, k, u} , ^ N _{t, u} , ^ σ ^N _{t, u} are read from the noise model estimation buffer 52, and ~ N _{s, j, k, u} = ^ N _{t, j, k, u} , ~ σ ^N _{s, j, k, u} = ^ σ ^N _{t, j, k, u} , ˜N _{s, u} = ^ N _{t, u} , ˜σ ^N _{s, u} = ^ σ ^N _{t, u} .

そして、雑音モデルパラメータ再推定部４２は、パラメータ平滑処理Ｓ４０７において、後向き推定を用いて次式によるパラメータの再推定（平滑化）を行う。 Then, the noise model parameter re-estimator 42 performs parameter re-estimation (smoothing) by the following equation using backward estimation in the parameter smoothing process S407.

式(27)と式(28)で求められた〜Ｎ_s-1,j,k,uと〜σ^N _{s-1、j、k、u}とが雑音モデルパラメータ再推定値である。なお、〜Ｎ_s-1,j,k,uと〜σ^N _{s-1、j、k、u}は次回の平滑処理のために推定処理用パラメータ記憶部５０の雑音モデル推定用バッファ５２に記憶するとともに、後向き確率モデルパラメータ生成部４３に送られる。 ˜N _{s−1, j, k, u} and ˜σ ^N _{s−1, j, k, u} obtained by Equation (27) and Equation (28) are noise model parameter re-estimation values. Note that ˜N _{s−1, j, k, u} and ˜σ ^N _{s−1, j, k, u} are stored in the noise model estimation buffer 52 of the estimation processing parameter storage unit 50 for the next smoothing process. And sent to the backward probability model parameter generation unit 43.

後向き確率モデルパラメータ生成部４３には、前記雑音モデルパラメータ再推定値〜Ｎ_s-1,j,k,u、〜σ^N _{s-1、j、k、u}と、モデルパラメータ記憶部２０の無音ＧＭＭ記憶部２１及びクリーン音声ＧＭＭ記憶部２２から読み込まれた前記無音信号及びクリーン音声信号それぞれの確率モデルパラメータμ^S _j,k,u、σ^S _j,k,uとが入力される。後向き確率モデルパラメータ生成部４３は、これらの情報を用い、平均値μ^O _s-1,j,k,uと分散値σ^O _{s-1、j、k、u}とからなる後向き確率モデルパラメータを算出して出力する。 The backward probability model parameter generation unit 43 includes the noise model parameter re-estimation value ~ N _{s-1, j, k, u} , ~ σ ^N _{s-1, j, k, u} and the silence of the model parameter storage unit 20. The silence model parameters μ ^S _{j, k, u} and σ ^S _{j, k, u of the} silence signal and the clean speech signal read from the GMM storage unit 21 and the clean speech GMM storage unit 22 are input. The backward probability model parameter generation unit 43 uses these pieces of information to determine a backward probability model parameter composed of the average value μ ^O _{s-1, j, k, u} and the variance values σ ^O _{s-1, j, k, u.} Calculate and output.

この具体的処理を、図９の処理手順に従い説明する。
確率モデルパラメータ生成処理Ｓ４０８では、後向き確率モデルパラメータ生成部４３が、フレーム番号ｓ−１における雑音環境に適合した、音声（雑音＋クリーン音声：ｊ＝１）、非音声（雑音＋無音：ｊ＝０）それぞれの確率モデルパラメータμ^O _s-1,j,k,u、σ^O _{s-1、j、k、u}を次式により生成する。 This specific processing will be described according to the processing procedure of FIG.
In the probability model parameter generation processing S408, the backward probability model parameter generation unit 43 is adapted to the noise environment in the frame number s-1 and is voice (noise + clean speech: j = 1), non-speech (noise + silence: j = 0) Each probability model parameter μ ^O _{s-1, j, k, u} , σ ^O _{s-1, j, k, u} is generated by the following equation.

以上のように設定された音声、非音声それぞれの確率モデルのパラメータμ^O _s-1,j,k,u、σ^O _{s-1、j、k、u}は、それぞれ、モデルパラメータ記憶部２０の音声ＧＭＭ記憶部２４及び非音声ＧＭＭ記憶部２３に格納されるとともに、後向き音声／非音声出力確率算出部４４に送られる。なお、ここでの混合重みについても前記クリーン音声信号、無音信号それぞれの確率モデルパラメータにおける混合重みｗ_j,kであるものとして以降の処理を行う。 The parameters μ ^O _{s−1, j, k, u} and σ ^O _{s−1, j, k, u} of the speech and non-speech models set as described above are respectively stored in the model parameter storage unit 20. While being stored in the speech GMM storage unit 24 and the non-speech GMM storage unit 23, it is sent to the backward speech / non-speech output probability calculation unit 44. The following processing is performed assuming that the mixing weight here is the mixing weight w _{j, k} in the probability model parameters of the clean speech signal and the silence signal.

後向き音声／非音声出力確率算出部４４には、音響信号分析部２０から送られた前記音声特徴量Ｏ_s-1,uと、前記音声、非音声それぞれの確率モデルパラメータμ^O _s-1,j,k,u、σ^O _{s-1、j、k、u}と、モデルパラメータ記憶部２０の無音ＧＭＭ記憶部２１及びクリーン音声ＧＭＭ記憶部２２から読み込まれた前記クリーン音声信号、無音信号それぞれの確率モデルパラメータにおける混合重みｗ_j,kとが入力される。後向き音声／非音声出力確率算出部４４は、これらの情報を用い、フレーム番号ｓ−１における音声・非音声の出力確率ｂ_ｊ(Ｏ_s-1）と、この出力確率ｂ_ｊ(Ｏ_s-1）を前記正規分布ｋごとに分解して正規化した後向き正規化出力確率ｗ^OB _s-1,j,kとを出力する。 The backward speech / non-speech output probability calculation unit 44 includes the speech feature quantity O _{s-1, u} sent from the acoustic signal analysis unit 20 and the probability model parameters μ ^O _{s-1, for the} speech and non-speech _{. j, k, u} , σ ^O _{s-1, j, k, u,} and each of the clean speech signal and the silence signal read from the silence GMM storage unit 21 and the clean speech GMM storage unit 22 of the model parameter storage unit 20, respectively. The mixture weight w _{j, k} in the probability model parameter is input. The backward speech / non-speech output probability calculation unit 44 uses these pieces of information, and outputs the speech / non-speech output probability b _j (O _s-1 ) at the frame number s−1 and the output probability b _j (O _{s− 1} ) A backward normalized output probability w ^OB _{s−1, j, k} is output by decomposing and normalizing each normal distribution k.

この具体的処理を、図９の処理手順に従い説明する。
出力確率算出処理Ｓ４０９では、後向き音声／非音声出力確率算出部４４が、前記音声特徴量Ｏ_s-1,uをＳ４０８の処理で生成された前記音声、非音声それぞれの確率モデルに入力した際の、前記音声、非音声それぞれの確率モデル全体における音声、非音声の出力確率ｂ_ｊ(Ｏ_s-1）を次式により求める。 This specific processing will be described according to the processing procedure of FIG.
In the output probability calculation process S409, when the backward speech / non-speech output probability calculation unit 44 inputs the speech feature quantity O _{s-1, u} to the probability models of the speech and non-speech generated in the process of S408. The speech and non-speech output probabilities b _j (O _s-1 ) in the overall probability models of the speech and non-speech are obtained by the following equation.

また、上式のｗ_j,kｂ_j,k(Ｏ_s-1)は、音声、非音声それぞれの確率モデルに含まれる各正規分布ｋの出力確率である。後向き音声／非音声出力確率算出部４４は、ｗ_j,kｂ_j,k(Ｏ_s-1)の合計が１になるよう次式で正規化を行う。 Also, w _{j, k} b _{j, k} (O _s-1 ) in the above equation is the output probability of each normal distribution k included in the probability models of speech and non-speech. The backward speech / non-speech output probability calculation unit 44 performs normalization using the following equation so that the sum of w _{j, k} b _{j, k} (O _s-1 ) is 1.

上式のｗ^OB _s-1,j,kが、音声、非音声それぞれの確率モデルに含まれる各正規分布ｋの後向き正規化出力確率である。後向き音声／非音声出力確率算出部４４は、生成した音声、非音声の前向き出力確率ｂ_j,k(Ｏ_s-1)を後向き第２加重平均算出部４７及び状態確率比算出部６０に送り、後向き正規化出力確率ｗ^OB _s-1,j,kを後向き第１加重平均算出部４５に送る。さらに、後向き音声／非音声出力確率算出部４４は、後向き正規化出力確率ｗ^OB _s-1,j,kを推定処理用パラメータ記憶部５０の雑音モデル推定用バッファ５２に格納する。 In the above equation, w ^OB _{s−1, j, k} is the backward normalized output probability of each normal distribution k included in the probability models of speech and non-speech. The backward speech / non-speech output probability calculation unit 44 sends the generated speech and non-speech forward output probability b _{j, k} (O _s-1 ) to the backward second weighted average calculation unit 47 and the state probability ratio calculation unit 60. The backward normalized output probability w ^OB _{s−1, j, k} is sent to the backward first weighted average calculation unit 45. Further, the backward speech / non-speech output probability calculation unit 44 stores the backward normalized output probability w ^OB _{s−1, j, k} in the noise model estimation buffer 52 of the estimation processing parameter storage unit 50.

後向き第１加重平均算出部４５には、前記雑音モデルパラメータ再推定値〜Ｎ_s-1,j,k,u、〜σ^N _{s-1、j、k、u}と、前記後向き正規化出力確率ｗ^OB _s-1,j,kとが入力され、後向き第１加重平均算出部４５は、平均値〜Ｎ_s-1,j,uと分散値〜σ^N _s-1、j、uとからなる雑音モデルパラメータの後向き第１加重平均値を出力する。 The backward first weighted average calculation unit 45 includes the noise model parameter re-estimation value ~ N _{s-1, j, k, u} , ~ σ ^N _{s-1, j, k, u} and the backward normalized output probability. w ^OB _{s-1, j, k} is input, and the backward first weighted average calculating unit 45 calculates the average value ~ N _{s-1, j, u} and the variance value ~ σ ^N _{s-1, j, u.} The backward first weighted average value of the noise model parameter is output.

この具体的処理を、図９の処理手順に従い説明する。
第１加重平均処理Ｓ４１０では、後向き第１加重平均算出部４５が、パラメータ平滑処理Ｓ４０７で得られた複数の雑音モデルパラメータ更新結果を出力確率算出処理Ｓ４０９で得られた後向き正規化出力確率ｗ^OB _s-1,j,kを用いて加重平均することにより、音声、非音声それぞれの確率モデルに対応する雑音パラメータ推定結果である後向き第１加重平均値〜Ｎ_s-1,j,u、〜σ^N _s-1、j、uを得る。加重平均は次式により行う。 This specific processing will be described according to the processing procedure of FIG.
In the first weighted average process S410, the backward first weighted average calculation unit 45 converts the plurality of noise model parameter update results obtained in the parameter smoothing process S407 into the backward normalized output probability w ^OB obtained in the output probability calculation process S409. By performing a weighted average using _{s-1, j, k} , a backward first weighted average value ~ N _{s-1, j, u} , which is a noise parameter estimation result corresponding to each probability model of speech and non-speech Obtain σ ^N _{s−1, j, u} . The weighted average is calculated by the following formula.

生成された後向き第１加重平均値〜Ｎ_s-1,j,u、〜σ^N _s-1、j、uは後向き第２加重平均算出部４７に送られる。 The generated backward first weighted average values ˜N _{s−1, j, u} , ˜σ ^N _{s−1, j, u} are sent to the backward second weighted average calculating unit 47.

後向き第２加重平均算出部４７には、前記後向き第１加重平均値〜Ｎ_s-1,j,u、〜σ^N _s-1、j、uと前記出力確率ｂ_ｊ(Ｏ_s-1）とが入力され、平均値〜Ｎ_s-1,uと分散値〜σ^N _s-1、uとからなるフレーム番号ｓ−１における後向き第２加重平均値を出力する。 The backward second weighted average calculation unit 47 includes the backward first weighted average value ~ N _{s-1, j, u} , ~ σ ^N _{s-1, j, u} and the output probability b _j (O _s-1 ). Are input, and a backward second weighted average value in frame number s-1 consisting of average value ~ N _{s-1, u} and variance value ~ σ ^N _{s-1, u} is output.

この具体的処理を、図９の処理手順に従い説明する。
第２加重平均処理Ｓ４１２では、後向き第２加重平均算出部４７が、第１加重平均処理Ｓ４１０で得られた後向き第１加重平均値〜Ｎ_s-1,j,u、〜σ^N _s-1、j、uを、出力確率算出処理Ｓ４０９で得られた出力確率ｂ_ｊ(Ｏ_s-1）、を用いて加重平均することにより、フレーム番号ｓ−１における雑音モデルパラメータ推定結果である後向き第２加重平均値〜Ｎ_s-1,u、〜σ^N _s-1、uを算出し、次のフレーム番号の雑音パラメータの推定に利用する。加重平均は次式により行う。 This specific processing will be described according to the processing procedure of FIG.
In the second weighted average processing S412, backward second weighted average calculation section 47, first weighted average value retrospective obtained in the first weighted average processing _{S410 ~N s-1, j,} u, ~σ N s-1 _{, J, and u} are weighted averaged using the output probability b _j (O _s-1 ) obtained in the output probability calculation process S409, so that the backward model of the noise model parameter estimation result in the frame number s-1 is obtained. Two weighted average values ~ N _{s-1, u} , ~ σ ^N _{s-1, u} are calculated and used to estimate the noise parameter of the next frame number. The weighted average is calculated by the following formula.

生成された後向き第２加重平均値〜Ｎ_s-1,u、〜σ^N _s-1、u推定処理用パラメータ記憶部５０の雑音モデル推定用バッファ５２に格納される。 The generated backward second weighted average values ~ N _{s-1, u} , ~ σ ^N _s-1 are stored in the noise model estimation buffer 52 of _{the u} estimation processing parameter storage unit 50.

そして、制御部９０が、変数書替処理Ｓ４１３において、ｂｗの値を１減算（すなわちフレーム番号ｓの値を１減算）し、変数判定処理Ｓ４１４において、ｂｗ＞０であれば処理をＳ４０６に戻し、そうでなければ処理を終了する。 Then, the controller 90 subtracts 1 from the value of bw (that is, subtracts 1 from the value of the frame number s) in the variable rewriting process S413, and returns the process to S406 if bw> 0 in the variable determination process S414. Otherwise, the process ends.

前向き推定部３０の各処理で得られた結果のうち、出力確率算出処理Ｓ３０９で得られた出力確率ｂ_ｊ(Ｏ_ｔ）と、後向き推定部４０の各処理で得られた結果のうち、出力確率算出処理Ｓ４０９で得られた出力確率ｂ_ｊ(Ｏ_s-1）が、状態確率比算出部６０における処理に使用される。つまり、出力確率ｂ_ｊ(Ｏ_ｔ）,..., ｂ_ｊ(Ｏ_t-B）が状態確率比算出部６０への入力パラメータとなる。 Of the results obtained in each process of the forward estimation unit 30, the output probability b _j (O _t ) obtained in the output probability calculation process S309 and the results obtained in each process of the backward estimation unit 40 are output. The output probability b _j (O _s-1 ) obtained in the probability calculation process S409 is used for the process in the state probability ratio calculation unit 60. That is, the output probabilities b _j (O _t ),..., B _j (O _tB ) are input parameters to the state probability ratio calculation unit 60.

式 (26)〜(28)の平滑処理は、従来のカルマンスムーザと計算式の構成自体は同様であるが、本形態ではクリーン音声信号、無音信号それぞれのＧＭＭに含まれる複数の正規分布ごとに複数のフィルタを構成し、これらを利用することにより得られる複数の推定結果を加重平均する（並列カルマンスムーザ）。このような処理を行うことによって、より正確な雑音モデルのパラメータ推定が実現される。 The smoothing processing of formulas (26) to (28) is the same as the conventional Kalman smoother and the calculation formula itself, but in this embodiment, each of the normal distributions included in each GMM of the clean speech signal and the silence signal A plurality of filters are constructed, and a plurality of estimation results obtained by using these filters are weighted and averaged (parallel Kalman smoother). By performing such processing, more accurate noise model parameter estimation is realized.

パラメータ記憶部５０の雑音モデル推定用バッファ５２は、前向き推定部３０と後向き推定部４０における処理の過程で得られた計算結果を記憶する。 The noise model estimation buffer 52 of the parameter storage unit 50 stores calculation results obtained in the process of the forward estimation unit 30 and the backward estimation unit 40.

［状態確率比算出部６０の処理］
状態遷移確率テーブル６１は、有限状態機械により表現された音声／非音声の状態遷移モデルにおいて適宜設定した状態遷移確率ａ_i,jを記憶する。 [Processing of State Probability Ratio Calculation Unit 60]
The state transition probability table 61 stores state transition probabilities a _{i, j} appropriately set in a speech / non-speech state transition model expressed by a finite state machine.

図１２は、音声状態／非音声状態の状態遷移モデルを示す概念図であり、非音声状態Ｈ_０と音声状態Ｈ_１と各状態への状態遷移確率ａ_i,jとを含む（ｉは状態遷移元の状態番号、ｊは状態遷移先の状態番号で、状態番号０は非音声状態を、状態番号１は音声状態を示す）。ａ_i,jは音声状態確率及び非音声状態確率を求める上での基準となる値で、定数を設定しても入力信号の特徴に応じて適応的に決定しても構わないが、本形態においては定数を設定し、これを状態遷移確率テーブル６１に記憶して音声状態確率及び非音声状態確率の計算に使用する。設定するａ_i,jはａ_i,0＋ａ_i,1＝１を満たす値で、ａ_0,0及びａ_1,1を0.5〜0.9の範囲で、ａ_0,1及びａ_1,0を0.5〜0.1の範囲で設定するのが望ましく、ａ_0,0＝0.8、ａ_0,1＝0.2、ａ_1,0＝0.1、ａ_1,1＝0.9程度が最も望ましい。 FIG. 12 is a conceptual diagram showing a state transition model of a voice state / non-voice state, and includes a non-voice state H ₀ , a voice state H _1, and a state transition probability a _{i, j} to each state (i is a state The state number of the transition source, j is the state number of the state transition destination, state number 0 indicates a non-voice state, and state number 1 indicates a voice state). a _{i, j} is a reference value for _obtaining the speech state probability and the non-speech state probability, and may be set in a constant manner or adaptively determined according to the characteristics of the input signal. In, a constant is set, and this is stored in the state transition probability table 61 and used to calculate the speech state probability and the non-speech state probability. A _{i, j to be} set is a value satisfying a _{i, 0} + a _{i, 1} = 1, a _0,0 and a _1,1 are in the range of 0.5 to 0.9, and a _0,1 and a _1,0 are 0.5. It is desirable to set in the range of .about.0.1, and it is most desirable that _a.sub.0,0 = 0.8, _a.sub.0,1 = 0.2, _a.sub.1,0 = 0.1, and _a.sub.1,1 = 0.9.

前向き確率算出部６２は、前記出力確率ｂ_ｊ(Ｏ_s）と、状態遷移確率ａ_i,jと、フレーム番号ｓ−１の前向き確率α_s-1、jとが入力され、フレーム番号ｓの前向き確率α_s、jを出力する。 The forward probability calculation unit 62 receives the output probability b _j (O _s ), the state transition probability a _{i, j,} and the forward probability α _{s−1, j of} the frame number s−1. Output the forward probability α _{s, j} .

図１０は、状態確率比算出部６０の処理手順を説明するためのフローチャートである。
この具体的処理を、図１０の処理手順に従い説明する。
音声状態確率及び非音声状態確率の算出は、まず前向き確率α_s、jを求め、続いて後向き確率β_s、jを求めて、それらの積をとることによって求める。そして、フレーム番号ｓの後向き確率β_s、jは、前記後向き推定部４０における計算と同様にＢフレーム未来のフレーム番号ｓ＋Ｂから遡って算出する。 FIG. 10 is a flowchart for explaining the processing procedure of the state probability ratio calculation unit 60.
This specific processing will be described according to the processing procedure of FIG.
The speech state probability and the non-speech state probability are calculated by first obtaining the forward probability α _{s, j} and then obtaining the backward probability β _{s, j} and taking the product of them. The backward probability β _{s, j} of the frame number s is calculated retrospectively from the future frame number s + B of the B frame, similarly to the calculation in the backward estimation unit 40.

そこで、まず、制御部９０が、変数判定処理Ｓ６０１において、音響信号分析部１０から出力される音声特徴量Ｏ_ｔのフレーム番号を判定する。ここで、ｔ＜１０＋Ｂ、すなわちｓ＜１０と判定された場合は、前向き確率算出部６２が、初期値設定処理Ｓ６０２において前向き確率α_s、jを以下のように設定し、それらをバッファリング処理Ｓ６０３において確率比算出用バッファ６４に記憶して処理を終了する。 Therefore, first, the control unit 90 determines the frame number of the audio feature amount O _t output from the acoustic signal analysis unit 10 in the variable determination process S601. Here, when it is determined that t <10 + B, that is, s <10, the forward probability calculation unit 62 sets the forward probability α _{s, j} as follows in the initial value setting process S602, and buffers them. In step S603, the data is stored in the probability ratio calculation buffer 64, and the process ends.

α_s,0＝１ (42)
α_s,1＝０ (43)
ｔ＜１０＋Ｂでない場合、すなわちｓ≧１０の場合は、前向き確率算出部６２が、読み出し処理Ｓ６０４において、確率比算出用バッファ６４からフレーム番号ｓ−１の前向き確率α_s-1、jを読み出す。 α _{s, 0} = 1 (42)
α _{s, 1} = 0 (43)
When t <10 + B is not satisfied, that is, when s ≧ 10, the forward probability calculation unit 62 reads the forward probability α _{s−1, j} of the frame number s−1 from the probability ratio calculation buffer 64 in the reading process S604.

次に、前向き確率算出部６２は、前向き確率算出処理Ｓ６０５において状態遷移確率テーブル６１から音声状態確率ａ_i,jを読み出し、これとフレーム番号ｓの前記出力確率ｂ_ｊ(Ｏ_s）と、フレーム番号ｓ−１の前記前向き確率α_s-1、jとから次式によりフレーム番号ｓの前向き確率α_s、jを算出し、これらをバッファリング処理６０６において確率比算出用バッファ６４に記憶する。 Next, the forward probability calculation unit 62 reads the speech state probability a _{i, j} from the state transition probability table 61 in the forward probability calculation process S605, and the output probability b _j (O _s ) of the frame number s and the frame calculating the forward probability alpha _{s, j} of the frame number s by the following equation from said forward probability α _{s-1, j} of the number s-1, and stores them in the probability ratio calculation buffer 64 in the buffering process 606.

後向き確率算出部６３は、フレーム番号ｓ＋１の前記出力確率ｂ_ｊ(Ｏ_s+1）と、状態遷移確率ａ_i,jと、フレーム番号ｓ＋１の後向き確率β_s+1、iとが入力され、フレーム番号ｓの後向き確率β_s、iを出力する。 The backward probability calculation unit 63 receives the output probability b _j (O _{s + 1} ) of the frame number s + 1, the state transition probability a _{i, j,} and the backward probability β _{s + 1, i of} the frame number s + 1, The backward probability β _{s, i} of the frame number s is output.

この具体的処理を、図１０の処理手順に従い説明する。
まず、変数設定処理Ｓ６０７において、制御部９０が、後向き確率算出用のカウンタｂｗの値をＢに設定する。 This specific processing will be described according to the processing procedure of FIG.
First, in the variable setting process S607, the control unit 90 sets the value of the counter bw for calculating the backward probability to B.

次に、後向き確率算出部６３が、後向き確率算出処理Ｓ６０８において状態遷移確率テーブル６１から音声状態確率ａ_i,jを読み出し、これとフレーム番号ｓ＋ｂｗの前記出力確率ｂ_ｊ(Ｏ_s+bw）とフレーム番号ｓ＋ｂｗの前記後向き確率β_bw、jとからフレーム番号ｓ＋ｂｗ−１の後向き確率β_s+bw-1、iを次式により算出する。なお、ｂｗ＝Ｂの場合は初期値β_s+B,i＝１を与える。 Next, the backward probability calculation unit 63 reads the speech state probability a _{i, j} from the state transition probability table 61 in the backward probability calculation process S608, and the output probability b _j (O _{s + bw} ) of the frame number s + bw. From the backward probability β _{bw, j} of the frame number s + bw, the backward probability β _{s + bw-1, i} of the frame number s + bw-1 is calculated by the following equation. When bw = B, the initial value β _{s + B, i} = 1 is given.

そして、制御部９０が、変数書替処理Ｓ６０９においてｂｗの値を１減算し、変数判定処理Ｓ６１０においてｂｗ＞０であれば処理をＳ６０７に戻し、そうでなければこの時点でフレーム番号ｓにおける後向き確率β_s,iが得られるので、これをバッファリング処理Ｓ６１１において確率比算出用バッファ６４に記憶し、確率比算出処理Ｓ６１２に移行する。 Then, the control unit 90 subtracts 1 from the value of bw in the variable rewriting process S609. If bw> 0 in the variable determination process S610, the control unit 90 returns the process to S607; Since the probability β _{s, i} is obtained, it is stored in the probability ratio calculation buffer 64 in the buffering process S611, and the process proceeds to the probability ratio calculation process S612.

確率比算出用バッファ６４は、前向き確率算出部６２で算出された前向き確率α_s、jと、後向き確率算出部６３で算出されたと後向き確率β_s,iを記憶する。 The probability ratio calculation buffer 64 stores the forward probability α _{s, j} calculated by the forward probability calculation unit 62 and the backward probability β _{s, i} calculated by the backward probability calculation unit 63.

確率比算出部６５には、前記前向き確率α_s、jと前記後向き確率β_s,iとが入力され、確率比算出部６５は、図１０の確率比算出処理Ｓ６１２において、非音声状態の確率に対する音声状態の確率の比Ｌ(s)を次式により算出する。 The probability ratio calculation unit 65 receives the forward probability α _{s, j} and the backward probability β _{s, i,} and the probability ratio calculation unit 65 determines the probability of the non-voice state in the probability ratio calculation process S612 of FIG. The ratio L (s) of the probability of the speech state with respect to is calculated by the following equation.

つまり、状態確率比算出部６０は、該当フレーム番号ｔよりもＢフレーム過去のフレーム番号ｓ＝ｔ−Ｂにおける前向き確率α_s,j、後向き確率β_s,i、及び非音声状態の確率に対する音声状態の確率の比Ｌ(s)を算出することになる。このように算出された音声状態の確率の比Ｌ(s)は、音声信号区間推定部７０に送られる。また、その生成過程で得られた音声状態／非音声状態確率γ_ｓ,ｊは、推定処理用パラメータ記憶部５０の雑音モデル推定用バッファ５２に格納される。 That is, the state probability ratio calculation unit 60 performs the speech with respect to the forward probability α _{s, j} , the backward probability β _{s, i} , and the probability of the non-speech state at the frame number s = t−B that is B frames past the frame number t. The ratio L (s) of the state probabilities is calculated. The voice state probability ratio L (s) calculated in this way is sent to the voice signal section estimation unit 70. Also, the speech state / non-speech state probability γ _{s, j} obtained in the generation process is stored in the noise model estimation buffer 52 of the estimation processing parameter storage unit 50.

なお、式(46)は以下に示す過程を経て導かれる。
まず、フレーム番号ｓにおける信号の状態をｑ_ｓ＝Ｈ_ｊと定義すると、非音声状態確率ｐ（ｑ_ｓ＝Ｈ_０｜Ｏ_0:s）及び音声状態確率ｐ（ｑ_ｓ＝Ｈ_１｜Ｏ_0:s）はベイズの定理により次式により得られる。 Equation (46) is derived through the following process.
First, if the state of the signal at frame number s is defined as q _s = H _j , non-voice state probability p (q _s = H ₀ | O _{0: s} ) and voice state probability p (q _s = H ₁ | O _{0 : s} ) is obtained by the following equation according to Bayes' theorem.

上式において、Ｏ_0:s＝{Ｏ₀，・・・，Ｏ_s}であり、雑音信号Ｎ_0:s＝{Ｎ₀，・・・，Ｎ_s}の時間変動を考慮すると、上式は次式のように拡張される。 In the above _{formula, O 0: s = {O} 0, ···, O s} is, the noise signal _{_{N 0: s = {N 0}} , ···, N s} Considering the time variation of the above formula Is expanded as:

上式は、過去のフレーム番号の状態を考慮した再帰式（１次マルコフ過程）により、次式のように展開される。 The above equation is developed as the following equation by a recursive equation (first-order Markov process) considering the state of the past frame number.

上式において、ｐ(ｑ_ｓ＝Ｈ_ｊ|ｑ_s-1＝Ｈ_ｉ)＝ａ_i,j、ｐ(Ｏ_ｓ|ｑ_ｓ＝Ｈ_ｊ,Ｎ_ｓ)＝ｂ_ｊ(Ｏ_ｓ)、ｐ(Ｎ_ｓ|Ｎ_s-1)＝１に相当し、またｐ(Ｏ_ｓ,ｑ_ｓ＝Ｈ_ｊ,Ｎ_ｓ)は時間軸方向に算出される前向き確率α_s、jに相当する。すなわち上式は、次式の再帰式により得られる。 In the above equation, p (q _s = H _j | q _s−1 = H _i ) = a _{i, j} , p (O _s | q _s = H _j , N _s ) = b _j (O _s ), p ( N _s | N _s-1 ) = 1, and p (O _s , q _s = H _j , N _s ) corresponds to the forward probability α _{s, j} calculated in the time axis direction. That is, the above equation is obtained by the following recursive equation.

次に、時刻ｓより未来の時刻、すなわち時刻ｓ＋１，・・・，ｔ＝ｓ＋Ｂにおける状態の影響を考慮すると、非音声状態確率ｐ（ｑ_ｓ＝Ｈ_０｜Ｏ_0:t）及び音声状態確率ｐ（ｑ_ｓ＝Ｈ_１｜Ｏ_0:t）は以下のようになる。 Next, considering the influence of the state at a time later than time s, that is, time s + 1,..., T = s + B, non-voice state probability p (q _s = H ₀ | O _{0: t} ) and voice state probability p (q _s = H ₁ | O _{0: t} ) is as follows.

上式の確率ｐ(Ｏ_s+1:t,Ｎ_s+1:t|ｑ_ｓ,Ｎ_ｓ)は、フレーム番号ｓより未来のフレーム番号の状態を考慮した再帰式（１次マルコフ過程）により、次式のように展開される。 The probability p (O _{s + 1: t} , N _{s + 1: t} | q _s , N _s ) in the above equation is given by a recursive equation (first-order Markov process) that takes into consideration the state of the future frame number from frame number s Is expanded as follows:

上式において、ｐ(ｑ_S+1＝Ｈ_ｊ|ｑ_s＝Ｈ_ｉ)＝ａ_i,j、ｐ(Ｏ_S+1|ｑ_S+1＝Ｈ_ｊ,Ｎ_S+1) ＝ｂ_ｊ(Ｏ_S+1)、ｐ(Ｎ_ｓ＋１|Ｎ_s)＝１に相当し、またｐ(Ｏ_S+1:t,Ｎ_S+1:t|ｑ_ｓ＝Ｈ_ｉ,Ｎ_ｓ)は時間軸方向に算出される後向き確率β_s、ｉに相当する。すなわち上式は、次式の再帰式により得られる。 In the above equation, p (q _{S + 1} = H _j | q _s = H _i ) = a _{i, j} , p (O _{S + 1} | q _{S + 1} = H _j , N _{S + 1} ) = b _j ( _{O S + 1), p (} N s + 1 | corresponds to N _s) = 1, also _{p (O S + 1: t} , N S + 1: t | q s = H i, N s) is the time axis direction This corresponds to the backward probability β _{s, i} calculated by. That is, the above equation is obtained by the following recursive equation.

よって、式(51)の確率ｐ(Ｏ_０:ｓ,ｑ_ｓ＝Ｈ_ｊ,Ｎ_０:ｓ)・ｐ(Ｏ_s+1:t,Ｎ_s+1:t|ｑ_ｓ＝Ｈ_ｉ,Ｎ_ｓ) は、次式のような前向き確率α_ｓ,ｊと後向き確率β_ｓ,ｊの積で得られる。
γ_ｓ,ｊ＝α_ｓ,ｊ・β_ｓ,ｊ (54)
ここで、音声状態確率と非音声状態の確率の比Ｌ(s)は次式により得られる。 Therefore, the probability p (O _{0: s} , q _s = H _j , N _{0: s} ) · p (O _{s + 1: t} , N _{s + 1: t} | q _s = H _i , N in equation (51) _s ) is obtained by the product of the forward probability α _{s, j} and the backward probability β _{s, j} as shown in the following equation.
γ _{s, j} = α _{s, j} · β _{s, j} (54)
Here, the ratio L (s) of the speech state probability and the non-speech state probability is obtained by the following equation.

また、雑音信号Ｎ_0:s＝{Ｎ₀，・・・，Ｎ_s}の時間変動を考慮すると、上式は次式のように拡張される。 Further, when the time variation of the noise signal N _{0: s} = {N ₀ ,..., N _s } is taken into consideration, the above equation is expanded as the following equation.

次に、フレーム番号ｓより未来のフレーム番号、すなわちフレーム番号ｓ＋１，・・・，ｔ＝ｓ＋Ｂにおける状態の影響を考慮すると、確率比Ｌ(s)は次式のように表現される。 Next, considering the influence of the state in the future frame number from the frame number s, that is, the frame numbers s + 1,..., T = s + B, the probability ratio L (s) is expressed as follows.

ここで、式(54)を用いて式(57)を変形すると以下のようになり、式(46)が導かれる。 Here, when Expression (57) is transformed using Expression (54), the following is obtained, and Expression (46) is derived.

［音声信号区間推定部７０の処理］
音声信号区間推定部７０に、状態確率比算出部６０から出力さ音声状態の確率の比Ｌ(s)が入力され、音声信号区間推定部７０は、フレーム番号ｓのフレームが音声状態に属するか非音声状態に属するかを判定する。 [Processing of Speech Signal Section Estimating Unit 70]
The speech signal interval estimation unit 70 receives the probability ratio L (s) of the speech state output from the state probability ratio calculation unit 60, and the speech signal interval estimation unit 70 determines whether the frame of frame number s belongs to the speech state. It is determined whether it belongs to the non-voice state.

Ｌ(s)レジスタ７１（図６）は、状態確率比算出部６０において算出された前記非音声状態の確率に対する音声状態の確率の比Ｌ(s）を入力し記憶する。 The L (s) register 71 (FIG. 6) inputs and stores the ratio L (s) of the voice state probability to the non-voice state probability calculated by the state probability ratio calculation unit 60.

閾値ＴＨレジスタ７２は、比較部７３において前記確率比Ｌ(s)が音声状態に属するか非音声状態に属するかを判断する閾値ＴＨを記憶する。なお、閾値ＴＨの値は、事前に固定された値に決定しておいても、入力信号の特徴に応じて適応的に決定してもよい。固定値を設定する場合は、一般的には１０程度の値に設定するのが最も望ましいが、用途に応じ0.5〜10,000の範囲で適宜設定して構わない。 The threshold TH register 72 stores a threshold TH for determining in the comparison unit 73 whether the probability ratio L (s) belongs to a voice state or a non-voice state. Note that the value of the threshold TH may be determined in advance or may be determined adaptively according to the characteristics of the input signal. When setting a fixed value, it is generally most desirable to set it to a value of about 10, but it may be set appropriately in the range of 0.5 to 10,000 depending on the application.

比較部７３は、Ｌ(s)レジスタ７１から前記確率比Ｌ(s)を読み出すとともに、閾値レジスタ７２から閾値ＴＨを読み出し、フレーム番号ｓのフレームが音声状態に属するか非音声状態に属するかを判定し、判定結果ＶＡＤ(ｓ)を出力する。 The comparison unit 73 reads the probability ratio L (s) from the L (s) register 71 and also reads the threshold value TH from the threshold value register 72 to determine whether the frame with the frame number s belongs to the voice state or the non-voice state. Judgment is made and a judgment result VAD (s) is output.

具体的には、例えばＬ(s)の値が閾値ＴＨ以上であれば、フレーム番号ｓのフレームが音声状態に属すると判断してＶＡＤ(ｓ)＝１を出力し、閾値ＴＨ未満であれば、フレーム番号ｓのフレームが非音声状態に属すると判断してＶＡＤ(ｓ)＝０を出力する。出力された判定結果は、信号除去部８０に送られる。 Specifically, for example, if the value of L (s) is equal to or greater than the threshold value TH, it is determined that the frame with the frame number s belongs to the voice state, and VAD (s) = 1 is output. Therefore, it is determined that the frame of frame number s belongs to the non-voice state, and VAD (s) = 0 is output. The output determination result is sent to the signal removal unit 80.

[信号除去部８０の処理]
図１１（ａ）は、信号除去部８０が行うフィルタ生成処理を説明するためのフローチャートであり、図１１（ｂ）は、フィルタリング処理を説明するためのフローチャートである。 [Processing of signal removal unit 80]
FIG. 11A is a flowchart for explaining the filter generation process performed by the signal removal unit 80, and FIG. 11B is a flowchart for explaining the filtering process.

信号除去部８０（図７）には、前記音声信号及び非音声信号の各確率モデルパラメータである正規分布ごとの平均μ^Ｓ _ｊ,ｋ,ｕと、前記クリーン音声信号及び無音信号の各確率モデルパラメータである正規分布ごとの平均μ^Ｏ _{ｓ,ｊ,ｋ,ｕ}と、音声、非音声それぞれの確率モデルに含まれる各正規分布ｋの後向き正規化出力確率ｗ^ＯＢ _ｓ,ｊ,ｋと、前記音声状態確率及び前記非音声状態確率γ_ｓ,ｊとが入力される。信号除去部８０は、前記音声信号と非音声信号の各確率モデルパラメータである正規分布ごとの前記平均μ^Ｏ _{ｓ,ｊ,ｋ,ｕ}に対する、前記クリーン音声信号と無音信号の各確率モデルパラメータである正規分布ごとの前記平均μ^Ｓ _ｊ,ｋ,ｕの各相対値(μ^Ｓ _ｊ,ｋ,ｕ−μ^Ｏ _{ｓ,ｊ,ｋ,ｕ})を、正規分布ごとの後向き正規化出力確率ｗ^ＯＢ _ｓ,ｊ,ｋと、前記音声状態確率及び前記非音声状態確率γ_ｓ,ｊとを用いて加重平均し、雑音信号を除去する周波数応答フィルタを生成し、当該周波数応答フィルタをインパルス応答フィルタに変換し、前記入力信号に対して当該インパルス応答フィルタを畳み込んで雑音除去音声信号を生成して出力する。 The signal removal unit 80 (FIG. 7) includes an average μ ^S _{j, k, u} for each normal distribution that is each probability model parameter of the speech signal and the non-speech signal, and each probability model of the clean speech signal and the silence signal. The average μ ^O _{s, j, k, u} for each normal distribution, which is a parameter, and the backward normalized output probability w ^OB _{s, j, k} for each normal distribution k included in the probability models of speech and non-speech, The speech state probability and the non-speech state probability γ _{s, j} are input. The signal removal unit 80 uses the probability model parameters of the clean speech signal and the silence signal for the average μ ^O _{s, j, k, u} for each normal distribution that is the probability model parameters of the speech signal and the non-speech signal. the average mu ^{S j} for each certain normal _{distribution, k,} each relative value of _{^{_{u (μ S j, k,}}} u -μ O s, j, k, u) and backward normalization for each normal distribution output probability w ^OB _{s, j, k} and the voice state probability and the non-speech state probability γ _{s, j} are used for weighted averaging to generate a frequency response filter that removes a noise signal, and the frequency response filter is used as an impulse response filter. Then, the impulse response filter is convoluted with the input signal to generate and output a noise-removed speech signal.

この具体的処理を、図１１（ａ）の処理手順に従い説明する。
信号除去部８０は、現在のフレーム番号ｔのフレームよりもＢフレーム遡ったフレームｓ＝ｔ−Ｂに視点を移して処理を行う。まず、制御部９０がフレーム判定処理Ｓ７０１を行い、音響信号分析部１０から出力される音声特徴量Ｏ_ｔ＋Ｂのフレーム番号を判定する。このフレーム判定処理Ｓ７０１においてｔ＜１０＋Ｂ、すなわちｓ＜１０であると判定されたのであれば、制御部９０は、そのフレームの信号除去部８０の処理を終了する。 This specific processing will be described in accordance with the processing procedure of FIG.
The signal removal unit 80 performs processing by moving the viewpoint to a frame s = t−B that is B frames later than the frame of the current frame number t. First, the control unit 90 performs frame determination processing S701 to determine the frame number of the audio feature amount O _{t + B} output from the acoustic signal analysis unit 10. If it is determined in this frame determination processing S701 that t <10 + B, that is, s <10, the control unit 90 ends the processing of the signal removal unit 80 for that frame.

また、フレーム判定処理Ｓ３０１においてｔ＜１０＋Ｂでないと判定されたのであれば、周波数応答フィルタ生成部８１が、雑音モデル推定用バッファ読出処理７０２において、推定処理用パラメータ記憶部５０の雑音モデル推定用バッファ５２から、正規分布ごとの後向き正規化出力確率ｗ^ＯＢ _ｓ,ｊ,ｋと、音声状態／非音声状態確率γ_ｓ,ｊとを読み込む。さらに、周波数応答フィルタ生成部８１が、ＧＭＭパラメータ読み出し処理７０３において、モデルパラメータ記憶部２０の無音ＧＭＭ記憶部２１及びクリーン音声ＧＭＭ記憶部２２から無音信号及びクリーン音声信号の確率モデルのパラメータ（平均μ^S _j,k,u）を読み出し、非音声ＧＭＭ記憶部２３及び音声ＧＭＭ記憶部２４から非音声信号及び音声信号の確率モデルのパラメータ（平均μ^O _s,j,k,u）を読み込む。 If it is determined that t <10 + B is not satisfied in the frame determination process S301, the frequency response filter generation unit 81 performs the noise model estimation buffer of the estimation process parameter storage unit 50 in the noise model estimation buffer read process 702. From 52, the backward normalized output probability w ^OB _{s, j, k} and the speech state / non-speech state probability γ _{s, j} for each normal distribution are read. Further, in the GMM parameter read processing 703, the frequency response filter generation unit 81 performs a parameter (average μ) of a silence signal and a clean speech signal from the silent GMM storage unit 21 and the clean speech GMM storage unit 22 of the model parameter storage unit 20. ^S _{j, k, u} ) are read out, and the parameters (average μ ^O _{s, j, k, u} ) of the non-voice signal and the voice signal probability model are read from the non-voice GMM storage unit 23 and the voice GMM storage unit 24.

次に、周波数応答フィルタ生成部８１が、周波数応答フィルタ生成処理Ｓ７０４により、雑音除去を行う周波数応答フィルタＦＩＬＴＥＲ_ｓ，ｕを次式により算出する。 Next, the frequency response filter generation unit 81 calculates a frequency response filter FILTER _{s, u} for performing noise removal by the following equation, in frequency response filter generation processing S704.

式中の(μ^Ｓ _ｊ,ｋ,ｕ−μ^Ｏ _{ｓ,ｊ,ｋ,ｕ})は、音声状態もしくは非音声状態の確率モデルの何れかの正規分布のパラメータ（平均）を用いて算出される雑音除去フィルタの対数周波数応答である。なお、これが対数周波数応答となるのは、本形態では音声特徴量として対数メルスペクトルを要素に持つベクトルを用いているからである。これを音声状態及び非音声状態の確率モデルの各正規分布に割り当てられた後向き正規化出力確率ｗ^ＯＢ _ｓ,ｊ,ｋと、音声状態／非音声状態確率γ_ｓ,ｊとを用いて加重平均し、指数変換することにより、雑音除去を行う周波数応答フィルタＦＩＬＴＥＲ_ｓ,ｕが得られる。このように生成された周波数応答フィルタＦＩＬＴＥＲ_ｓ,ｕは、インパルス応答フィルタ変換部８２に送られる。 (Μ ^S _{j, k, u} −μ ^O _{s, j, k, u} ) in the equation is calculated using a parameter (average) of a normal distribution of either a speech state or a non-speech state probability model. It is a logarithmic frequency response of a noise removal filter. This is a logarithmic frequency response because in this embodiment, a vector having a log mel spectrum as an element is used as an audio feature quantity. This is a weighted average using the backward normalized output probability w ^OB _{s, j, k} assigned to each normal distribution of the probability model of the speech state and the non-speech state and the speech state / non-speech state probability γ _{s, j.} By performing exponential conversion, a frequency response filter FILTER _{s, u} for removing noise is obtained. The frequency response filter FILTER _{s, u} generated in this way is sent to the impulse response filter converter 82.

インパルス応答フィルタ変換部８２は、インパルス応答変換処理Ｓ７０５において、周波数応答フィルタＦＩＬＴＥＲ_ｓ,ｕを時間領域に変換してインパルス応答フィルタｆｉｌｔｅｒ_ｓ,τに変換する。ただし、本形態では、音声特徴量として周波数について非線形なメルスケール上で生成された対数メルスペクトルを要素に持つベクトルを用いている。そこで、次式のようにメル周波数の重み付けがなされた逆離散コサイン変換（IDCT: Inverse Discrete Cosine Transform）によって、メルスケール上の周波数応答フィルタＦＩＬＴＥＲ_ｓ,ｕを時間領域のインパルス応答フィルタｆｉｌｔｅｒ_ｓ,τに変換する。 In the impulse response conversion processing S705, the impulse response filter conversion unit 82 converts the frequency response filter FILTER _{s, u} into the time domain and converts it into the impulse response filter filter _{s, τ} . However, in the present embodiment, a vector having a log mel spectrum generated on a mel scale nonlinear with respect to frequency as an element is used as a voice feature amount. Accordingly, the frequency response filter FILTER _{s, u} on the mel scale is converted into the time domain impulse response filter filter _{s, τ} by inverse discrete cosine transform (IDCT) in which the mel frequency is weighted as in the following equation. Convert to

ここで、τは離散時間であり、ｍｅｌＩＤＥＣＴ_ｕ,τはメル周波数の重み付けがなされた逆離散コサイン変換係数であり、次式で表現される（非特許文献３参照）。 Here, τ is a discrete time, melIDECT _{u, τ} is an inverse discrete cosine transform coefficient weighted with a mel frequency, and is expressed by the following equation (see Non-Patent Document 3).

ここで、ｆ_ｓａｍｐは入力信号のサンプリング周波数である。また、ｆ_{ｃｅｎｔｒ}（ｕ）は、音響信号分析部１０でのメルフィルタバンク分析におけるチャネル番号ｕのバンドの中心周波数を意味し、次式によって表現される。 Here, f _samp is the sampling frequency of the input signal. Further, f _centerr (u) means the center frequency of the band of channel number u in the mel filter bank analysis in the acoustic signal analysis unit 10, and is expressed by the following equation.

なお、Ｎ_ＳＰＥＣ＝（Ｎ_ＦＦＴ／２）＋１であり、Ｎ_ＦＦＴは音響信号分析部１０での高速フーリエ変換の次元数であり、Ｗ(u,i)は音響信号分析部１０でのメルフィルタバンク分析におけるチャネル番号ｕの三角窓関数であり、ｉは周波数ビンである。 N _SPEC = (N _FFT / 2) +1, where N _FFT is the number of dimensions of the fast Fourier transform in the acoustic signal analysis unit 10, and W (u, i) is a Mel filter in the acoustic signal analysis unit 10 It is a triangular window function of channel number u in bank analysis, and i is a frequency bin.

以上のように生成されたインパルス応答フィルタｆｉｌｔｅｒ_ｓ,τは、フィルタリング部８４に送られる。 The impulse response filter filter _{s, τ} generated as described above is sent to the filtering unit 84.

フィルタリング部８４は、入力信号ｏ_ｓ，τに対して当インパルス応答フィルタｆｉｌｔｅｒ_ｓ,τを畳み込んで雑音除去音声信号ｓ_ｓ，τを生成して出力する。 Filtering unit 84, an input signal _{o s,} those impulse response filter filter _s with respect to _{_tau,} noise cancellation sound signal _{s s} convolving the _{_tau,} and generates and outputs _tau.

この具体的処理を、図１１（ｂ）の処理手順に従い説明する。
入力信号読み出し部８３（図７）には、音声信号区間推定部７０から出力された判定結果ＶＡＤ(ｓ)が入力される。入力信号読み出し部８３は、音声区間判定処理Ｓ８０１を行い、判定結果ＶＡＤ(ｓ)が０であるか１であるかを判定する。ここで、ＶＡＤ(ｓ)＝０、すなわち、判定結果ＶＡＤ(ｓ)がフレーム番号ｓのフレームが非音声状態に属するとの判定結果を示す場合、入力信号読み出し部８３は、このフレームについての信号除去部８０の処理を終了させる。 This specific processing will be described in accordance with the processing procedure of FIG.
The determination result VAD (s) output from the audio signal section estimation unit 70 is input to the input signal reading unit 83 (FIG. 7). The input signal reading unit 83 performs a speech section determination process S801, and determines whether the determination result VAD (s) is 0 or 1. Here, when VAD (s) = 0, that is, when the determination result VAD (s) indicates the determination result that the frame with the frame number s belongs to the non-speech state, the input signal reading unit 83 determines the signal for this frame. The process of the removal part 80 is complete | finished.

一方、ＶＡＤ(ｓ)＝１、すなわち、判定結果ＶＡＤ(ｓ)がフレーム番号ｓのフレームが音声状態に属するとの判定結果を示す場合、入力信号読み出し部８３は、雑音モデル推定用バッファ読出処理Ｓ８０２において、推定処理用パラメータ記憶部５０の雑音モデル推定用バッファ５２から、フレーム番号ｓのフレームが含む各入力信号o_s、mを読み込み、それらをフィルタリング部８４に送る。 On the other hand, when VAD (s) = 1, that is, when the determination result VAD (s) indicates the determination result that the frame with the frame number s belongs to the voice state, the input signal reading unit 83 performs the noise model estimation buffer reading process. In step S <b> 802, the input signals o _{s and m} included in the frame with the frame number s are read from the noise model estimation buffer 52 of the estimation processing parameter storage unit 50 and sent to the filtering unit 84.

フィルタリング部８４は、フィルタリング処理Ｓ８０３において、各入力信号o_s、mに対して、インパルス応答フィルタｆｉｌｔｅｒ_ｓ,τを所定のフィルタタップ数Ｌで以下のように畳み込み、雑音除去音声信号s_s、mを生成して出力する。 In the filtering process S803, the filtering unit 84 convolves the impulse response filter filter _{s, τ} with the predetermined number of filter taps L with respect to the input signals o _{s, m} as follows, and the noise-removed audio signal s _{s, m} Is generated and output.

ここで得られた雑音除去音声信号s_s、mが本形態の雑音除去装置１の出力となる。 The noise-removed speech signal s _{s, m} obtained here is the output of the noise-removing device 1 of this embodiment.

〔第２実施形態〕
本発明の第２実施形態は、第１実施形態における前向き第１加重平均算出部３５、前向き第２加重平均算出部３７、後向き第１加重平均算出部４５、及び後向き第２加重平均算出部４７における計算方法が異なるもので、装置構成は第１実施形態と同様である。 [Second Embodiment]
The second embodiment of the present invention includes a forward first weighted average calculator 35, a forward second weighted average calculator 37, a backward first weighted average calculator 45, and a backward second weighted average calculator 47 according to the first embodiment. The calculation method is different, and the apparatus configuration is the same as that of the first embodiment.

従って、機能構成例については第１実施形態における上記それぞれの部位の番号が異なるのみであるため、図を分けずに前向き推定部に係る図３及び後向き推定部に係る図４に第２実施形態における部位番号をカッコ書きで記すにとどめる。 Accordingly, since the numbers of the respective parts in the first embodiment are different only in the functional configuration example, the second embodiment is shown in FIG. 3 related to the forward estimation unit and FIG. 4 related to the backward estimation unit without dividing the figure. Just write the part number in parentheses.

前向き第１加重平均算出部１３５は、前記雑音モデルパラメータ更新値^Ｎ_t,j,k,u、^σ^N _{t、j、k、u}と前記前向き正規化出力確率ｗ^OF _t,j,kとが入力され、平均値^Ｎ_t,j,uと分散値^σ^N _t、j、uとからなる雑音モデルパラメータの前向き第１加重平均値を出力する。 The forward first weighted average calculator 135 calculates the noise model parameter update value ^ N _{t, j, k, u} , ^ σ ^N _{t, j, k, u} and the forward normalized output probability w ^OF _{t, j, k.} Are input _, and the forward first weighted average value of the noise model parameter consisting of the average value ^ N _{t, j, u} and the variance value ^ σ ^N _{t, j, u} is output.

この実施形態では、前記正規分布ｋごとに算出される前記前向き正規化出力確率ｗ^OF _t,j,kの中で最も確率の高いｗ^OF _t,j,kに該当する正規分布ｋの前記雑音モデルパラメータ更新値^Ｎ_t,j,k,u、^σ^N _{t、j、k、u}を、前向き第１加重平均値^Ｎ_t,j,u、^σ^N _t、j、uとして出力する。 In this embodiment, the noise of the normal distribution k corresponding to w ^OF _{t, j, k} having the highest probability among the forward normalized output probabilities w ^OF _{t, j, k} calculated for each normal distribution k. Model parameter update values ^ N _{t, j, k, u} , ^ σ ^N _{t, j, k, u} are output as forward first weighted average values ^ N _{t, j, u} , ^ σ ^N _{t, j, u} To do.

このように処理することで、加重平均の計算をせずに済むため、処理の高速化を図ることができる。ただし、前向き正規化出力確率が各正規分布について確率差が小さい場合には特定の正規分布において突出して確率が高い場合と比べて他の正規分布を無視することによる影響が大きくなるため、この実施形態の利用に際しては特定の正規分布における確率がその他の正規分布に比べて十分に高いことが望ましい。 By processing in this way, it is not necessary to calculate a weighted average, so that the processing speed can be increased. However, if the probability of forward normalized output is small for each normal distribution, the impact of ignoring other normal distributions will be greater than when the probability is prominent in a specific normal distribution and the probability is high. When using the form, it is desirable that the probability in a specific normal distribution is sufficiently high compared to other normal distributions.

前向き第２加重平均算出部１３７は、前記前向き第１加重平均値^Ｎ_t,j,u、^σ^N _t、j、uと前記前向き出力確率ｂ_ｊ(Ｏ_ｔ）とが入力され、平均値^Ｎ_t,uと分散値^σ^N _t、uとからなるフレーム番号ｔにおける前向き第２加重平均値を出力する。 The forward second weighted average calculating unit 137 receives the forward first weighted average value ^ N _{t, j, u} , ^ σ ^N _{t, j, u} and the forward output probability b _j (O _t ), and calculates the average. A forward second weighted average value at frame number t consisting of value ^ N _{t, u} and variance value ^ σ ^N _{t, u} is output.

この実施形態では、前記音声及び非音声について算出される前記前向き出力確率ｂ_ｊ(Ｏ_ｔ）のうち、確率の高い音声（ｊ＝１）又は非音声（ｊ＝０）に対応する前向き第１加重平均値^Ｎ_t,j,u、^σ^N _t、j、uを、前向き第２加重平均値^Ｎ_t,j,u、^σ^N _t、j、uとして出力する。 In this embodiment, out of the forward output probabilities b _j (O _t ) calculated for the speech and non-speech, the forward first corresponding to speech (j = 1) or non-speech (j = 0) with high probability. The weighted average values ^ N _{t, j, u} , ^ σ ^N _{t, j, u} are output as forward second weighted average values ^ N _{t, j, u} , ^ σ ^N _{t, j, u} .

このように処理することで、加重平均の計算をせずに済むため、処理の高速化を図ることができる。ただし、両者の確率差が小さい場合には一方を無視することによる影響が大きくなるため、この実施形態の利用に際しては双方の確率差が十分に大きいことが望ましい。 By processing in this way, it is not necessary to calculate a weighted average, so that the processing speed can be increased. However, if the probability difference between the two is small, the influence of ignoring one becomes large. Therefore, it is desirable that the probability difference between the two is sufficiently large when using this embodiment.

以上、前向き第１加重平均算出部１３５及び後向き第１加重平均算出部１３７について記したが、後向き第１加重平均算出部１４５及び後向き第２加重平均算出部１４７についても前向き第１加重平均算出部１３５及び前向き第２加重平均算出部１３７と同様な処理を行うことができる。 The forward first weighted average calculator 135 and the backward first weighted average calculator 137 have been described above, but the forward first weighted average calculator 145 and the backward second weighted average calculator 147 are also forward first weighted average calculators. 135 and the forward second weighted average calculation unit 137 can be performed.

〔変更例等〕
上記各実施の形態において、パラメータ予測処理Ｓ３０６において、ランダムウォーク過程により１フレーム前の推定結果より現在の時刻のパラメータを予測しているが、自己回帰法（線形予測法）などを用いて予測してもよい。この場合、自己回帰係数の次数に応じて最終的な雑音モデルパラメータ推定性能が向上することが期待される。 [Examples of changes]
In each of the above embodiments, in the parameter prediction process S306, the parameter of the current time is predicted from the estimation result of one frame before by a random walk process, but is predicted using an autoregressive method (linear prediction method) or the like. May be. In this case, it is expected that the final noise model parameter estimation performance is improved according to the order of the autoregressive coefficient.

また、上記実施の形態において、音声信号区間推定部７０における閾値判定後に、図６に破線で示すように音声信号区間及び非音声信号区間の継続長を調査して音声信号区間推定結果を自動訂正する突発異常検出補正部７４を接続してもよい。又は、同じく図６に破線で示すように、音声状態／非音声状態の判定結果と入力信号ｏ_νとを掛け合わせた信号を出力するようにし、突発異常検出補正部７４と同様に作用させてもよい。音声信号区間推定部７０をこのように構成することにより、突発的な識別誤りを訂正することができるため、音声信号区間推定の性能が向上することが期待される。 Further, in the above embodiment, after the threshold value determination in the audio signal interval estimation unit 70, the duration of the audio signal interval and the non-audio signal interval is investigated and the audio signal interval estimation result is automatically corrected as shown by the broken line in FIG. A sudden abnormality detection correction unit 74 may be connected. Alternatively, as indicated by a broken line in FIG. 6, a signal obtained by multiplying the determination result of the voice state / non-voice state and the input signal o _ν is output, and the same action as the sudden abnormality detection correction unit 74 is performed. Also good. By configuring the speech signal section estimation unit 70 in this way, sudden identification errors can be corrected, and it is expected that the performance of speech signal section estimation is improved.

また、周波数応答フィルタ生成処理Ｓ７０４において、後向き正規化出力確率ｗ^ＯＢ _ｓ,ｊ,ｋの代わりに前向き正規化出力確率ｗ^OF _t,j,kを用いて周波数応答フィルタＦＩＬＴＥＲ_ｓ，ｕを生成してもよい。 Further, in the frequency response filter generation processing S704, the frequency response filter FILTER _{s, u} is generated using the forward normalized output probability w ^OF _{t, j, k} instead of the backward normalized output probability w ^OB _{s, j, k.} May be.

また、上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。 Further, when the above-described configuration is realized by a computer, processing contents of functions that each device should have are described by a program. The processing functions are realized on the computer by executing the program on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよいが、具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. The computer-readable recording medium may be any medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory. Specifically, for example, the magnetic recording device may be a hard disk device or a flexible Discs, magnetic tapes, etc. as optical discs, DVD (Digital Versatile Disc), DVD-RAM (Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD-R (Recordable) / RW (ReWritable), etc. As the magneto-optical recording medium, MO (Magneto-Optical disc) or the like can be used, and as the semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory) or the like can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

〔実験結果〕
実施形態の手法の効果を示すために、音声信号と雑音信号が混在する入力信号を第１実施形態の雑音除去装置１に入力し、雑音除去を行った実施例を示す。以下、実験方法及び結果について説明する。〔Experimental result〕
In order to show the effect of the technique of the embodiment, an example is shown in which an input signal in which a voice signal and a noise signal are mixed is input to the noise removal apparatus 1 of the first embodiment and noise removal is performed. Hereinafter, experimental methods and results will be described.

本実験では、音声区間検出（VAD: Voice Activity Detection）の評価用に設計されたデータベースCENSREC-1-C（北岡教英，山田武志，柘植覚，宮島千代美，西浦敬信，中山雅人，傳田遊亀，藤本雅清，山本一公，滝口哲也，黒岩眞吾，武田一哉，中村哲，“CENSREC-1-C：雑音下音声区間検出評価基盤の構築，”情報処理学会研究報告，SLP-63-1, pp.1-6, Oct.2006.）を用いて提案手法の評価を行った。 In this experiment, the database CENSREC-1-C designed for the evaluation of voice activity detection (VAD) (Norihide Kitaoka, Takeshi Yamada, Satoshi Tsuji, Chiyomi Miyajima, Takanobu Nishiura, Masato Nakayama, Yuka Tomita, Masayoshi Fujimoto, Kazuko Yamamoto, Tetsuya Takiguchi, Satoshi Kuroiwa, Kazuya Takeda, Satoshi Nakamura, “CENSREC-1-C: Construction of a noisy speech segment detection and evaluation platform,” Information Processing Society of Japan Research Report, SLP-63-1, pp .1-6, Oct. 2006.), the proposed method was evaluated.

CENSREC-1-Cは、人工的に作成したシミュレーションデータと、実環境で収録した実データの２種類のデータを含んでいる。本実験では、実環境における音声品質劣化の影響（雑音及び、発声変形の影響等）を調査するため、実データを用いて評価を行った。 CENSREC-1-C includes two types of data: artificially created simulation data and actual data recorded in an actual environment. In this experiment, in order to investigate the influence of voice quality degradation (noise, influence of utterance deformation, etc.) in the actual environment, evaluation was performed using actual data.

CENSREC-1-Cの実データの収録は、学生食堂(Restaurant)と高速道路付近(Street)の２種類の環境で行われており、SNRはそれぞれ、High SNR（騒音レベル60dB(A)前後）と、Low SNR（騒音レベル70dB(A)前後）である。音声データは、１名の話者が１〜１２桁の連続数字を８〜１０回、約２秒間隔で発話した音声を１ファイルとして収録しており、各環境において話者１名あたり４ファイルを収録している。発話者は１０名（男女各５名）である（ただし評価対象は男性１名を除く９名分のデータ）。それぞれの信号は、サンプリング周波数は８，０００Ｈｚ、量子化ビット数１６ビットで離散サンプリングされた。この音響信号に対し、１フレームの時間長を２５ｍｓ（２００サンプル点）とし、１０ｍｓ（８０サンプル点）ごとにフレームの始点を移動させて、音響信号分析部１０の処理を行った。 The actual data of CENSREC-1-C is recorded in two types of environments: the student cafeteria (Restaurant) and the highway (Street). The SNR is high SNR (noise level around 60dB (A)). And Low SNR (noise level around 70 dB (A)). Voice data is recorded as one file of voices spoken by a single speaker, 8 to 10 consecutive numbers of 1 to 12 digits at intervals of about 2 seconds, and 4 files per speaker in each environment. Is recorded. There are 10 speakers (five men and women each) (however, the object of evaluation is data for nine people excluding one man). Each signal was discretely sampled at a sampling frequency of 8,000 Hz and a quantization bit number of 16 bits. With respect to this acoustic signal, the time length of one frame is set to 25 ms (200 sample points), the start point of the frame is moved every 10 ms (80 sample points), and the processing of the acoustic signal analysis unit 10 is performed.

また、無音信号及びクリーン音声信号の確率モデルには、それぞれ、２４次元の対数メルスペクトルを音響特徴量とする混合分布数３２の混合正規分布モデルを用い、それぞれ無音信号及びクリーン音声信号を用いて学習した。 Further, as the probability models of the silence signal and the clean speech signal, respectively, a mixed normal distribution model having a mixture distribution number 32 having a 24-dimensional log mel spectrum as an acoustic feature amount is used, and the silence signal and the clean speech signal are respectively used. Learned.

また、パラメータ予測処理Ｓ３０６においてεのパラメータ値には０．００１を設定し、変数判定処理Ｓ４０３において、後向き推定に要するフレーム数βには５を設定した。状態遷移確率テーブル６１において、状態遷移確率ａ_ｉ，ｊとしてａ_０,０＝０.８，ａ_０,１＝０.２，ａ_１,１＝０.１，ａ_１,０＝０.９を設定した。また、音声信号区間推定部７０において、関値ＴＨの値に１０を設定した。 In the parameter prediction process S306, the parameter value of ε is set to 0.001, and in the variable determination process S403, the number of frames β required for backward estimation is set to 5. In the state transition probability table 61, as a state transition probability a _{i, j} , a _0,0 = 0.8, a _0,1 = 0.2, a _1,1 = 0.1, a _1,0 = 0.9 It was set. Further, the voice signal section estimation unit 70 sets 10 as the value of the function value TH.

図１３（ａ）（ｂ）は、それぞれ、音声信号区間検出のみを行った場合の音声信号波形、本形態の手法によって雑音除去を行った場合の音声信号波形である。これらの結果から、本形態の手法により効果的に雑音除去が行われていることが明らかとなった。 FIGS. 13A and 13B show the sound signal waveform when only the sound signal section detection is performed, and the sound signal waveform when noise removal is performed by the method of the present embodiment. From these results, it has been clarified that noise removal is effectively performed by the method of this embodiment.

また、図１３（ｃ）（ｄ）（ｅ）は、音声認識による評価結果であり、それぞれCENSREC-1-Cデータベースに規定されたベースライン、音声信号区間検出のみを行った場合、本形態の手法よる結果である。これらの結果から、本形態の手法により、雑音が存在する環境下で高い音声認識性能を得られることが明らかとなった。 FIGS. 13C, 13D, and 13E show the evaluation results by speech recognition. When only the baseline and speech signal section detection specified in the CENSREC-1-C database are performed, respectively, This is a result of the method. From these results, it became clear that the method of this embodiment can obtain high speech recognition performance in an environment where noise exists.

図１は第１実施形態の雑音除去装置の機能構成を示すブロック図である。FIG. 1 is a block diagram illustrating a functional configuration of the noise removal apparatus according to the first embodiment. 図２（ａ）は、図１に示したモデルパラメータ記憶部２０の詳細を示す図であり、図２（ｂ）は、推定処理用パラメータ記憶部５０の詳細を示す図である。FIG. 2A is a diagram illustrating details of the model parameter storage unit 20 illustrated in FIG. 1, and FIG. 2B is a diagram illustrating details of the estimation processing parameter storage unit 50. 図３は、図１に示した前向き推定部の詳細構成を例示したブロック図である。FIG. 3 is a block diagram illustrating a detailed configuration of the forward estimation unit illustrated in FIG. 1. 図４は、図１に示した後向き推定部の詳細構成を例示したブロック図である。FIG. 4 is a block diagram illustrating a detailed configuration of the backward estimation unit illustrated in FIG. 1. 図５は、図１に示した状態確率比算出部の詳細構成を例示したブロック図である。FIG. 5 is a block diagram illustrating a detailed configuration of the state probability ratio calculation unit illustrated in FIG. 1. 図６は、図１に示した音声信号区間推定部の詳細構成を例示したブロック図である。FIG. 6 is a block diagram illustrating a detailed configuration of the speech signal section estimation unit illustrated in FIG. 1. 図７は、図１に示した信号除去部の詳細構成を例示したブロック図である。FIG. 7 is a block diagram illustrating a detailed configuration of the signal removing unit illustrated in FIG. 1. 図８は、第1実施形態の前向き推定部３０の処理手順を説明するためのフローチャートである。FIG. 8 is a flowchart for explaining the processing procedure of the forward estimation unit 30 of the first embodiment. 図９は、第1実施形態の後向き推定部の処理手順を説明するためのフローチャートである。FIG. 9 is a flowchart for explaining the processing procedure of the backward estimation unit of the first embodiment. 図１０は、状態確率比算出部の処理手順を説明するためのフローチャートである。FIG. 10 is a flowchart for explaining the processing procedure of the state probability ratio calculation unit. 図１１（ａ）は、信号除去部８０が行うフィルタ生成処理を説明するためのフローチャートであり、図１１（ｂ）は、フィルタリング処理を説明するためのフローチャートである。FIG. 11A is a flowchart for explaining the filter generation process performed by the signal removal unit 80, and FIG. 11B is a flowchart for explaining the filtering process. 図１２は、音声状態／非音声状態の状態遷移モデルを示す概念図である。FIG. 12 is a conceptual diagram showing a state transition model of a voice state / non-voice state. 図１３（ａ）（ｂ）は、それぞれ、音声信号区間検出のみを行った場合の音声信号波形、本形態の手法によって雑音除去を行った場合の音声信号波形である。また、図１３（ｃ）（ｄ）（ｅ）は、音声認識による評価結果である。FIGS. 13A and 13B show the sound signal waveform when only the sound signal section detection is performed, and the sound signal waveform when noise removal is performed by the method of the present embodiment. FIGS. 13C, 13D, and 13E show evaluation results based on speech recognition.

Claims

A noise removing device that removes a noise signal from an input signal including an audio signal and a noise signal,
An acoustic signal analysis unit that extracts and outputs the audio feature amount of the input signal for each frame that is a certain time interval; and
A model parameter storage unit that stores probability model parameters of a probability model in which each output probability of the clean speech signal and the silence signal is expressed by a mixed normal distribution containing a plurality of normal distributions;
The speech feature value and each probability model parameter of the clean speech signal and the silence signal stored in the model parameter storage unit are input, and noise of the current frame is processed by a parallel nonlinear Kalman filter from the past frame toward the current frame. A forward estimation unit that sequentially estimates and outputs model parameters;
The noise model parameters output from the forward estimation unit and the probability model parameters of the clean speech signal and the silence signal stored in the model parameter storage unit are input, and the parallel Kalman from the future frame to the current frame is input. The noise model parameters of the current frame are sequentially backward estimated by the smoother, and the output probabilities of speech (noise + clean speech) and non-speech (noise + silence) signals are mixed and normalized based on the backward estimated noise model parameters. A backward estimation unit that sequentially estimates the probability model parameters of the probability model expressed by distribution, calculates and outputs the output probability of each of the speech signal and the non-speech signal,
A parameter storage unit for storing calculation results obtained in the course of processing in the forward estimation unit and the backward estimation unit;
Output probability of each of the speech signal and the non-speech signal is inputted, a speech state probability, a non-speech state probability, a ratio of the speech state probability to the non-speech state probability, and a state probability ratio for outputting them A calculation unit;
A voice signal section in which the ratio of the state probabilities is input, the ratio of the state probabilities for each frame is compared with a threshold value, and a determination result indicating whether each frame belongs to a voice state or a non-voice state is output. An estimation unit;
The average for each normal distribution that is each probability model parameter of the speech signal and the non-speech signal, the average for each normal distribution that is each probability model parameter of the clean speech signal and the silence signal, the speech state probability and the non-speech A state probability is input, and the average for each normal distribution that is each probability model parameter of the clean speech signal and silence signal is compared to the average for each normal distribution that is each probability model parameter of the speech signal and the non-speech signal. Each relative value is weighted and averaged using the speech state probability and the non-speech state probability to generate a frequency response filter that removes a noise signal, and the frequency response filter is converted into an impulse response filter. A noise removing unit that convolves the impulse response filter to generate and output a noise-removed voice signal,
A noise removal apparatus comprising:

In the noise removal apparatus of Claim 1,
The forward estimation unit includes:
A noise model parameter prediction unit that receives the acoustic feature value and a forward second weighted average value one frame before, calculates and outputs a noise model parameter prediction value of the current frame from the past frame toward the current frame; ,
The acoustic feature value, the noise model parameter prediction value, and each probability model parameter of the clean speech signal and silence signal are input, and the update process of the noise model parameter is performed for each probability model of the clean speech signal and silence signal. A noise model parameter update unit that outputs a noise model parameter update value in parallel for each of a plurality of normal distributions;
The noise model parameter update value and the probability model parameters of the clean speech signal and the silence signal are input, and the speech (noise + clean speech) probability model parameter suitable for the noise environment at the time in units of each frame; A forward probability model parameter generation unit that generates and outputs a non-voice (noise + silence) probability model parameter;
For each of the normal distributions, the acoustic feature amount, the speech probability model parameter and the non-speech probability model parameter output from the forward probability model parameter generation unit, and the probability model parameters of the clean speech signal and the silence signal. Forward weights of speech and non-speech output probabilities for each frame, and forward normalized output probabilities obtained by decomposing the forward output probabilities for each of the normal distributions, and output them. An audio output probability calculation unit;
A forward first weighted average calculating unit that receives the noise model parameter update value and the forward normalized output probability and calculates and outputs a forward first weighted average value of the noise model parameter;
A forward second weighted average calculating unit that receives the forward first weighted average value and the forward output probability of each of the speech and non-speech and calculates and outputs the forward second weighted average value of the current frame;
Have
The backward estimation unit is
The noise model parameter prediction value after one frame, the noise model parameter update value of the current frame, and the noise model parameter re-estimation value after one frame are input, and re-estimation processing of the forward noise model parameter of the current frame is performed. A noise model parameter re-estimation unit that outputs a noise model parameter re-estimation value in parallel for each of the plurality of normal distributions of each probability model of the clean speech signal and the silence signal from the future time to the current time; ,
The noise model parameter re-estimation value and the probability model parameters of the clean speech signal and silence signal are input, and the speech (noise + clean speech) probability model parameter suitable for the noise environment at the time in units of the frame And a backward probability model parameter generation unit that generates and outputs a non-voice (noise + silence) probability model parameter;
For each of the normal distributions, the acoustic feature amount, the speech probability model parameter and the non-speech probability model parameter output from the backward probability model parameter generation unit, and each probability model parameter of the clean speech signal and the silence signal Backward speech / non-speech output for calculating and outputting the output probability of each of speech and non-speech for each frame and the backward normalized output probability obtained by decomposing this output probability for each normal distribution A probability calculator;
A backward first weighted average calculating unit that receives the noise model parameter re-estimated value and the backward normalized output probability, calculates and outputs a backward first weighted average value of the noise model parameter;
A backward second weighted average calculating unit that receives the backward first weighted average value and the output probabilities of the speech and non-speech and calculates and outputs the backward second weighted average value of the current frame;
Have
The noise removal unit further receives at least one of a forward normalized output probability and a backward normalized output probability for each normal distribution,
The noise removing unit
The relative value of the average for each normal distribution that is each probability model parameter of the clean speech signal and the silence signal, with respect to the average for each normal distribution that is each probability model parameter of the speech signal and the non-speech signal, The frequency response filter is generated by weighted averaging using at least one of a forward normalized output probability and a backward normalized output probability for each normal distribution.
The noise removal apparatus characterized by the above-mentioned.

In the noise removal apparatus according to claim 1 or 2,
The state probability ratio calculation unit
A state transition probability table that stores state transition probabilities set in advance in a speech / non-speech state transition model expressed by a finite state machine;
A forward probability calculation unit that inputs the output probability of each of the speech and non-speech of the current frame, the state transition probability, and the forward probability of one frame before, and calculates and outputs the forward probability of the current frame;
A backward probability calculation unit that receives the output probability of each of the speech and non-speech after one frame, the state transition probability, and the backward probability after one frame, and calculates and outputs the backward probability of the current frame;
A probability ratio calculation buffer for storing the forward probability and the backward probability obtained in the course of processing in the forward probability calculation unit and the backward probability calculation unit;
The forward probability of the current frame and the backward probability of the current frame are input, and the non-speech state probability and the speech state probability are calculated by a product of the forward probability of the current frame and the backward probability of the current frame, A ratio of the voice state probability to the state probability; and a probability ratio calculation unit that outputs the non-voice state probability, the voice state probability, and the ratio of the voice state probability to the non-voice state probability;
A noise removal apparatus comprising:

In the noise removal apparatus of Claim 2 or 3,
The forward first weighted average calculating unit outputs a noise model parameter update value having the maximum forward normalized output probability among the noise model parameter update values as a forward first weighted average value of the noise model parameters. Yes,
The forward second weighted average calculating unit outputs the forward first weighted average value having the maximum forward output probability among the forward first weighted average values as the forward second weighted average value of the current frame. ,
The backward first weighted average calculating unit outputs a noise model parameter reestimation value having the maximum backward normalized output probability among the noise model parameter reestimation values as a backward first weighted average value of the noise model parameters. Is,
The backward second weighted average calculating unit outputs the backward first weighted average value having the maximum state transition probability among the backward first weighted average values as the backward second weighted average value of the current frame. ,
The noise removal apparatus characterized by the above-mentioned.

A noise removal method for removing a noise signal from an input signal including an audio signal and a noise signal,
Probability model parameters of a probability model in which each output probability of a clean speech signal and a silence signal is expressed by a mixed normal distribution containing a plurality of normal distributions is stored in the model parameter storage unit,
An acoustic signal analysis process in which the acoustic signal analysis unit extracts and outputs the audio feature amount of the input signal for each frame that is a fixed time interval;
The forward estimation unit to which the speech feature amount and each probability model parameter of the clean speech signal and the silence signal stored in the model parameter storage unit are input is performed by a parallel nonlinear Kalman filter from a past frame to a current frame. A forward estimation process that sequentially estimates and outputs the noise model parameters of the current frame;
The backward estimation unit, to which the noise model parameters and the probability model parameters of the clean speech signal and the silence signal are input, calculates the noise model parameters of the current frame from the future frame to the current frame by a parallel Kalman smoother. Probabilistic model parameters of a probability model in which each output probability of a speech (noise + clean speech) signal and a non-speech (noise + silence) signal is expressed by a mixed normal distribution based on successive backward estimation noise model parameters A backward estimation process that sequentially estimates and outputs the output probability of each of the speech signal and the non-speech signal,
Storing the calculation results obtained in the forward estimation process and the backward estimation process in the parameter storage unit;
The state probability ratio calculation unit to which the output probability of each of the speech signal and the non-speech signal is input calculates a speech state probability, a non-speech state probability, and a ratio of the speech state probability to the non-speech state probability, State probability ratio calculation process that outputs these,
The speech signal section estimation unit to which the state probability ratio is input compares the state probability ratio with a threshold value for each frame to determine whether each frame belongs to a speech state or a non-speech state A speech signal interval estimation process for outputting a result;
The average for each normal distribution that is each probability model parameter of the speech signal and the non-speech signal, the average for each normal distribution that is each probability model parameter of the clean speech signal and the silence signal, the speech state probability and the non-speech A normal distribution which is a probability model parameter of each of the clean speech signal and the silence signal with respect to the average of each normal distribution which is a probability model parameter of each of the speech signal and the non-speech signal. Each average value of each average is weighted and averaged using the speech state probability and the non-speech state probability to generate a frequency response filter that removes a noise signal, and the frequency response filter is converted into an impulse response filter Then, the impulse response filter is convoluted with the input signal to generate and output a noise-removed speech signal. And the removal process,
A denoising method characterized by comprising:

In the noise removal method of Claim 5,
The forward estimation process includes:
The noise model parameter prediction unit to which the acoustic feature amount and the forward second weighted average value one frame before are input calculates and outputs the noise model parameter prediction value of the current frame from the past frame to the current frame Noise model parameter prediction process,
The noise model parameter update unit to which the acoustic feature amount, the noise model parameter prediction value, and each probability model parameter of the clean speech signal and the silence signal are input, the noise model parameter update processing, the clean speech signal and A noise model parameter update process in which the noise model parameter update value is output in parallel for each normal distribution of each probability model of the silence signal,
The forward probability model parameter generation unit to which the noise model parameter update value and the probability model parameters of the clean speech signal and the silence signal are input is a speech (noise) adapted to a noise environment at a time in units of the frames. + A forward speech model parameter generation process that generates and outputs a probability model parameter and a non-speech (noise + silence) probability model parameter;
For each of the normal distributions, the acoustic feature amount, the speech probability model parameter and the non-speech probability model parameter output from the forward probability model parameter generation unit, and the probability model parameters of the clean speech signal and the silence signal. The forward speech / non-speech output probability calculation unit to which the mixed weight is input has a forward output probability of each speech and non-speech for each frame, and a forward normalized output probability obtained by decomposing this forward output probability for each normal distribution A forward sound / non-speech output probability calculation process for calculating and outputting
A forward first weighted average calculation unit that receives the noise model parameter update value and the forward normalized output probability and calculates and outputs a forward first weighted average value of the noise model parameter. Process,
A forward second weighted average calculation unit, to which the forward first weighted average value and the forward output probabilities of the voice and non-voice are input, calculates and outputs the forward second weighted average value of the current frame. 2-weighted average calculation process;
Have
The backward estimation process includes:
The noise model parameter re-estimation unit, to which the noise model parameter prediction value after one frame, the noise model parameter update value of the current frame, and the noise model parameter re-estimation value after one frame are input, The noise model parameter re-estimation process is performed in parallel for each normal distribution of each probability model of the clean speech signal and silence signal, and the noise model parameter re-estimation value is output from the future time to the current time. Noise model parameter re-estimation process,
The backward probability model parameter generation unit to which the noise model parameter re-estimation value and the probability model parameters of the clean speech signal and the silence signal are input is a speech adapted to the noise environment at the time in units of the frame ( A process of generating a backward probability model parameter for generating and outputting a noise + clean speech) probability model parameter and a non-speech (noise + silence) probability model parameter;
For each of the normal distributions, the acoustic feature amount, the speech probability model parameter and the non-speech probability model parameter output from the backward probability model parameter generation unit, and each probability model parameter of the clean speech signal and the silence signal The backward speech / non-speech output probability calculation unit to which the mixed weight is input, outputs the speech and non-speech output probabilities for each frame, and the backward normalized output probability obtained by decomposing this output probability for each normal distribution. Backward voice / non-voice output probability calculation process to calculate and output,
A backward first weighted average calculation unit, to which the noise model parameter re-estimation value and the backward normalized output probability are input, calculates and outputs a backward first weighted average value of the noise model parameter, and outputs the backward first weighted average Calculation process,
A backward second weighted average calculation unit, to which the backward first weighted average value and the output probabilities of the voice and non-voice are input, calculates and outputs the backward second weighted average value of the current frame. A weighted average calculation process;
Have
The noise removal process includes:
Further, the noise removing unit to which at least one of a forward normalized output probability and a backward normalized output probability for each normal distribution is input,
The relative value of the average for each normal distribution that is each probability model parameter of the clean speech signal and the silence signal, with respect to the average for each normal distribution that is each probability model parameter of the speech signal and the non-speech signal, A step of generating a frequency response filter by weighted averaging using at least one of a forward normalized output probability and a backward normalized output probability for each normal distribution;
A noise removal method comprising:

In the noise removal method of Claim 5 or 6,
The state probability ratio calculation process includes:
In the state transition probability table, state transition probabilities set in advance in a state transition model of speech / non-speech expressed by a finite state machine are stored.
A forward probability that the forward probability calculation unit to which the output probability of each of the speech and non-speech of the current frame, the state transition probability, and the forward probability of one frame before is input calculates and outputs the forward probability of the current frame Calculation process,
The backward probability calculation unit, to which the output probability of the speech and non-speech after one frame, the state transition probability, and the backward probability after one frame are input, calculates and outputs the backward probability of the current frame Probability calculation process;
A buffering process for storing the forward probability and the backward probability obtained in the forward probability calculation process and the backward probability calculation process in a probability ratio calculation buffer;
A probability ratio calculation unit that receives the forward probability of the current frame and the backward probability of the current frame calculates a non-speech state probability and a speech state probability by a product of the forward probability of the current frame and the backward probability of the current frame. Calculating a ratio of the speech state probability to the non-speech state probability, and outputting the non-speech state probability, the speech state probability, and a ratio of the speech state probability to the non-speech state probability Process,
A denoising method characterized by comprising:

In the noise removal method of Claim 6 or 7,
The forward first weighted average calculation process is a process of outputting a noise model parameter update value having the largest forward normalized output probability among the noise model parameter update values as a forward first weighted average value of the noise model parameters. Yes,
The forward second weighted average calculation process is a process of outputting the forward first weighted average value having the maximum forward output probability among the forward first weighted average values as the forward second weighted average value of the current frame. ,
The backward first weighted average calculation process outputs a noise model parameter reestimation value having the maximum backward normalized output probability among the noise model parameter reestimation values as a backward first weighted average value of the noise model parameters. Process,
The backward second weighted average calculation process is a process of outputting the backward first weighted average value having the maximum state transition probability among the backward first weighted average values as the backward second weighted average value of the current frame. ,
A noise removal method characterized by the above.

The program for functioning a computer as a noise removal apparatus in any one of Claims 1-4.

A computer-readable recording medium on which the program according to claim 9 is recorded.