JP4673828B2

JP4673828B2 - Speech signal section estimation apparatus, method thereof, program thereof and recording medium

Info

Publication number: JP4673828B2
Application number: JP2006335536A
Authority: JP
Inventors: 雅清藤本; 健太郎石塚; 比呂子加藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2006-12-13
Filing date: 2006-12-13
Publication date: 2011-04-20
Anticipated expiration: 2026-12-13
Also published as: JP2008145923A

Abstract

<P>PROBLEM TO BE SOLVED: To provide speech signal section estimation technique capable of estimating a speech signal section with high precision by accurately grasping state transition of a signal in spite of an unsteady noise such that statistical property of the noise signal changes with time. <P>SOLUTION: A sound signal analyzer 10 extracts sound feature quantities by frames obtained by cutting an input signal in constant section units. Using a probability model (GMM) of a clean speech signal and a soundless signal, a forward estimating unit 30 and a backward estimating unit 40 estimate noise model parameters not only forward, but also backward along the time base through parallel processings by a plurality of normal distributions included in the GMM. Based upon the estimated noise model parameters, a speech/non-speech output probability and a noise state transition probability are calculated. A state probability ratio calculator 60 calculates ratios of speech probabilities to non-speech state probabilities by the frames and a speech signal section estimating unit 70 compares the calculated probability ratios with a threshold to decide a speech state or a non-speech state for each frame. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声信号と雑音信号が含まれる音響信号から、上記音声信号が存在する区間の推定を、音声状態確率、非音声状態確率を求めて行う音声信号区間推定装置、その方法、そのプログラム及びそのプログラムを記憶する記録媒体に関する。 The present invention relates to a speech signal section estimation apparatus, method and program for estimating a speech state probability and non-speech state probability from a sound signal including a speech signal and a noise signal by estimating a section in which the speech signal exists. And a recording medium for storing the program.

音声信号の符号化、雑音信号の抑圧、残響除去、自動音声認識などの音声信号処理技術の多くにおいては、処理対象とする音声信号以外の信号、つまり雑音信号が含まれる音響信号から、処理対象とする音声信号が存在する区間を推定する必要があり、この区間推定の精度がその後の処理の効果にも大きく影響する。よって、あらゆる音声信号処理技術の基礎となる技術であり、早急に解決されるべき問題である。
後掲の非特許文献１には、入力となる音響信号の周波数スペクトル、信号の全帯域のエネルギーおよび帯域分割後の各帯域のエネルギー、信号波形の零交差数、およびそれらの時間微分などの特徴量を利用した音声信号区間推定方法が開示されている。これらの音響特徴を用いた音声信号区間推定方法では、入力される音響信号を２５ｍｓ程度のある一定時間長に分割し、分割された各信号区間で上述の音響特徴を算出し、その値が別途定めた閾値を超える場合には音声区間、そうでなければ非音声区間として判定する。 In many speech signal processing technologies such as speech signal coding, noise signal suppression, dereverberation, and automatic speech recognition, a signal other than the speech signal to be processed, that is, an acoustic signal including a noise signal is processed. It is necessary to estimate a section in which a speech signal is present, and the accuracy of this section estimation greatly affects the effect of subsequent processing. Therefore, it is a technology that is the basis of all audio signal processing technologies, and is a problem that should be solved immediately.
Non-Patent Document 1 described later includes characteristics such as the frequency spectrum of the input acoustic signal, the energy of the entire band of the signal and the energy of each band after the band division, the number of zero crossings of the signal waveform, and their time derivatives. A speech signal section estimation method using a quantity is disclosed. In the speech signal section estimation method using these acoustic features, an input acoustic signal is divided into a certain fixed time length of about 25 ms, the above-described acoustic features are calculated in each divided signal section, and the value is separately obtained. If it exceeds a predetermined threshold, it is determined as a speech segment, otherwise it is determined as a non-speech segment.

後掲の非特許文献２には、入力となる音響信号にWiener filter理論に基づく雑音除去を適用し、雑音除去後の信号の全帯域のエネルギーおよび帯域分割後の各帯域のエネルギー、周波数スペクトルの分散値などの特徴量を利用した音声信号区間推定方法が開示されている。これらの音響特徴を用いた音声信号区間推定方法では、入力される音響信号を２５ｍｓ程度のある一定時間長に分割し、分割された各信号区間で上述の音響特徴を算出し、その値が別途定めた閾値を超える場合には音声区間、そうでなければ非音声区間として判定する。
後掲の非特許文献３には、信号の状態遷移を定義した音声信号区間推定方法が開示されている。この方法では、入力となる音響信号が時間経過とともに音声状態、および非音声状態を遷移する信号であると見なす。音声状態、および非音声状態の状態遷移は、入力信号が音声状態に属する確率、非音声状態に属する確率を基準として決定され、音声状態に属する信号のみを出力する。
また、音声信号区間推定の性能を改善させるための技術として、入力となる音響信号に含まれる、雑音信号を正確に推定する技術が必要となる。このような技術において、信号の統計的な特徴が時々刻々と変化する、非定常的な雑音信号の逐次推定技術が極めて重要である。 In Non-Patent Document 2 described later, noise removal based on Wiener filter theory is applied to an input acoustic signal, and energy of the entire band of the signal after noise removal, energy of each band after band division, and frequency spectrum A speech signal section estimation method using a feature value such as a variance value is disclosed. In the speech signal section estimation method using these acoustic features, an input acoustic signal is divided into a certain fixed time length of about 25 ms, the above-described acoustic features are calculated in each divided signal section, and the value is separately obtained. If it exceeds a predetermined threshold, it is determined as a speech segment, otherwise it is determined as a non-speech segment.
Non-Patent Document 3 described later discloses a speech signal section estimation method that defines signal state transitions. In this method, an input acoustic signal is regarded as a signal that transitions between a voice state and a non-voice state over time. The state transition between the speech state and the non-speech state is determined based on the probability that the input signal belongs to the speech state and the probability that the input signal belongs to the non-speech state, and outputs only the signal belonging to the speech state.
Further, as a technique for improving the performance of speech signal section estimation, a technique for accurately estimating a noise signal included in an input acoustic signal is required. In such a technique, a non-stationary noise signal successive estimation technique in which statistical characteristics of a signal change from moment to moment is extremely important.

後掲の非特許文献４には、時系列パラメータの一般的な逐次推定方法であるカルマンフィルタが開示されている。この方法では、過去の時刻のパラメータが現在の時刻のパラメータに与える影響を考慮することにより最適なパラメータ推定結果を得る。
後掲の非特許文献５には、前記カルマンフィルタを発展させて非線形モデルにおいても推定が行えるようにした拡張（非線形）カルマンフィルタが開示されている。また、同じく前記カルマンフィルタの発展形であるカルマンスムーザについても開示されている。カルマンスムーザにおいては、過去の時刻だけでなく、未来の時刻のパラメータの関係を考慮することにより、より正確なパラメータ推定結果を得る。
Benyassine,A.,Shlomot,E.,and Su,H-Y.“ITU-T recommendation G.729 Annex B: A silence compression scheme for use with G.729 optimized for V.70digital simultaneous voice and data applications,”IEEE Communications Magazine, pp.64-73, September,1997. ETSI ES 202 050 v.1.1.4 “Speech processing,Transmission and Quality aspects(STQ), Advanced Distributed Speech Recognition; Front-end feature extraction algorithm; Compression algorithms,”Nov.2005. Sohn,J.,Kim,N.S.and Sung. W.“A Statistical Model-Based Voice Activity Detection,”IEEE Signal Processing Letters,Vol.6,No.1,pp.1-3,January,1999. Kalman,R.E.“A New Approach to Linear Filtering and Prediction Problems,”Transactions of the ASME-Journal of Basic Engineering,Vol.82,Series D,pp.35-45,1960. 片山徹、応用カルマンフィルタ、第５章及び第７章、朝倉書店、1983 Non-Patent Document 4 described later discloses a Kalman filter, which is a general sequential estimation method for time series parameters. In this method, an optimum parameter estimation result is obtained by considering the influence of the past time parameter on the current time parameter.
Non-Patent Document 5 described below discloses an extended (non-linear) Kalman filter that is developed so that estimation can be performed in a non-linear model by developing the Kalman filter. Similarly, a Kalman smoother, which is an advanced version of the Kalman filter, is also disclosed. In the Kalman smoother, more accurate parameter estimation results are obtained by considering not only the past time but also the relationship of the parameters of the future time.
Benyassine, A., Shlomot, E., and Su, HY. “ITU-T recommendation G.729 Annex B: A silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications,” IEEE Communications Magazine, pp.64-73, September, 1997. ETSI ES 202 050 v.1.1.4 “Speech processing, Transmission and Quality aspects (STQ), Advanced Distributed Speech Recognition; Front-end feature extraction algorithm; Compression algorithms,” Nov. 2005. Sohn, J., Kim, NSand Sung. W. “A Statistical Model-Based Voice Activity Detection,” IEEE Signal Processing Letters, Vol. 6, No. 1, pp. 1-3, January, 1999. Kalman, RE “A New Approach to Linear Filtering and Prediction Problems,” Transactions of the ASME-Journal of Basic Engineering, Vol. 82, Series D, pp. 35-45, 1960. Toru Katayama, Applied Kalman Filter, Chapters 5 and 7, Asakura Shoten, 1983

非特許文献１、非特許文献２、および非特許文献３に記載の技術は、入力音響信号に含まれる雑音信号の特徴が、定常的なものであるという前提のもとで音声信号区間推定を行う技術である。しかし、実環境における雑音信号の多くは非定常的な特徴をもっている。すなわち、雑音信号の統計的な特徴が、時間の経過に伴い変動する。そのため、非特許文献１、非特許文献２、および非特許文献３に記載の技術では、雑音の時間変動に対応できず、高精度に音声信号区間の推定を行うことができない。
非定常的な雑音信号の推定について、非特許文献４、および非特許文献５に記載の技術では、カルマンフィルタ、およびカルマンスムーザを用いて目的信号の逐次推定を行う。前者のカルマンフィルタは時間に対して順方向に推定を行う方法であり、後者のカルマンスムーザはカルマンフィルタの推定結果を時間に対して逆方向に再推定を行う方法である。しかし、これらの逐次推定方法は、各時刻において１つの推定結果のみを出力する。つまり、ある時刻で致命的な誤差が発生した場合、その誤差が以降の時刻の推定結果に影響を与え、誤差の回復が困難となる。 The techniques described in Non-Patent Document 1, Non-Patent Document 2, and Non-Patent Document 3 perform speech signal section estimation on the assumption that the characteristics of the noise signal included in the input acoustic signal are stationary. It is a technique to perform. However, many noise signals in the real environment have non-stationary characteristics. That is, the statistical characteristics of the noise signal vary with time. For this reason, the techniques described in Non-Patent Document 1, Non-Patent Document 2, and Non-Patent Document 3 cannot cope with time fluctuations of noise and cannot estimate a speech signal section with high accuracy.
Regarding the non-stationary noise signal estimation, the techniques described in Non-Patent Document 4 and Non-Patent Document 5 perform sequential estimation of a target signal using a Kalman filter and a Kalman smoother. The former Kalman filter is a method of estimating in the forward direction with respect to time, and the latter Kalman smoother is a method of re-estimating the estimation result of the Kalman filter in the opposite direction of time. However, these sequential estimation methods output only one estimation result at each time. That is, when a fatal error occurs at a certain time, the error affects the estimation result of the subsequent time, making it difficult to recover the error.

また、非特許文献３に記載の技術は、入力音響信号が音声状態と非音声状態に相互に状態遷移することに着目して、音声信号区間推定を行う技術である。しかし、遷移先の状態の決定は、過去の信号の状態のみに依存し、未来の信号の状態に対する影響を考慮しておらず、信号の正確な状態遷移を表現できない。
それゆえに、本発明の目的は、入力音響信号に含まれる非定常的な雑音信号の推定を行って定常的な雑音信号以外にも適用可能であり、さらに入力音響信号の過去、現在のみでなく、未来の時刻における状態の影響を考慮した、高精度な音声信号区間を推定する装置、その方法、そのプログラム及び記録媒体を提供することにある。 The technique described in Non-Patent Document 3 is a technique for estimating a speech signal section by paying attention to the fact that an input acoustic signal makes a state transition between a speech state and a non-speech state. However, the determination of the state of the transition destination depends only on the state of the past signal, does not consider the influence on the state of the future signal, and cannot represent the exact state transition of the signal.
Therefore, the object of the present invention is applicable to a non-stationary noise signal by estimating a non-stationary noise signal included in the input acoustic signal, and not only the past and present of the input acoustic signal. Another object of the present invention is to provide an apparatus, a method, a program, and a recording medium for estimating a voice signal section with high accuracy in consideration of the influence of a state at a future time.

本発明の音声信号区間推定装置は、音響信号分析部、無雑音モデル記憶部、前向き推定部、後向き推定部、パラメータ記憶部、状態確率比算出部、及び音声信号区間推定部を具備する。
音声信号分析部は、前記入力信号を一定区間ごとに切り出したフレームごとに音声特徴量を抽出する。
無雑音モデル記憶部は、クリーン音声信号と無音信号それぞれの、複数の正規分布を含有する混合正規分布に基づく確率モデル（ＧＭＭ：Gaussian Mixture Model）パラメータを記憶する。
前向き推定部は、前記音声特徴量と前記無雑音モデル記憶部に記憶された各確率モデルパラメータとが入力され、過去の時刻から現在の時刻に向かって並列非線形カルマンフィルタにより現在時刻の雑音モデルパラメータを逐次推定して出力する。 The speech signal section estimation device of the present invention includes an acoustic signal analysis section, a noiseless model storage section, a forward estimation section, a backward estimation section, a parameter storage section, a state probability ratio calculation section, and a speech signal section estimation section.
The voice signal analysis unit extracts a voice feature amount for each frame obtained by cutting out the input signal for each predetermined section.
The noiseless model storage unit stores a probability model (GMM: Gaussian Mixture Model) parameter based on a mixed normal distribution containing a plurality of normal distributions of the clean speech signal and the silent signal.
The forward estimation unit receives the speech feature and each probability model parameter stored in the noiseless model storage unit, and calculates a noise model parameter at the current time from a past time to a current time by a parallel nonlinear Kalman filter. Estimate and output sequentially.

後向き推定部は、前記前向き推定部から出力された雑音モデルパラメータと前記無雑音モデル記憶部に記憶された各確率モデルパラメータとが入力され、未来の時刻から現在の時刻に向かって並列カルマンスムーザにより現在時刻の雑音モデルパラメータを逐次後向き推定し、この後向き推定した雑音モデルパラメータに基づき音声（雑音＋クリーン音声）と非音声（雑音＋無音）それぞれの確率モデルパラメータを逐次推定して音声と非音声それぞれの出力確率を算出し出力するとともに、この出力確率と前記後向き推定した雑音モデルパラメータとから雑音モデルパラメータの推定結果の１フレーム前から現フレームへの雑音状態遷移確率を算出し出力する。 The backward estimation unit receives the noise model parameter output from the forward estimation unit and each probability model parameter stored in the noiseless model storage unit, and performs a parallel Kalman smoother from a future time to a current time. The noise model parameters at the current time are sequentially estimated backward, and the stochastic model parameters for speech (noise + clean speech) and non-speech (noise + silence) are sequentially estimated based on the backward-estimated noise model parameters. The output probability of each voice is calculated and output, and the noise state transition probability from the previous frame of the noise model parameter estimation result to the current frame is calculated and output from this output probability and the backward estimated noise model parameter.

パラメータ記憶部は、前向き推定部及び後向き推定部における処理の過程で得られた計算結果を記憶する。
状態確率比算出部は、後向き推定部から出力された、前記音声の出力確率と前記非音声の出力確率と前記雑音状態遷移確率とが入力され、音声状態確率と非音声状態確率とを算出して、非音声状態確率に対する音声状態確率の比を出力する。
音声信号区間推定部は、前記状態確率の比が入力され、フレームごとにしきい値と比較して、音声状態か非音声状態のいずれかを比較結果として出力する。 The parameter storage unit stores calculation results obtained in the course of processing in the forward estimation unit and the backward estimation unit.
State probability ratio calculating unit, output from the backward estimation unit, the audio output probability and output probability of the Hioto voice and the noise state transition probabilities are inputted, calculates the audio state probability and non-speech state probability Then, the ratio of the speech state probability to the non-speech state probability is output.
The speech signal section estimation unit receives the state probability ratio, compares it with a threshold value for each frame, and outputs either a speech state or a non-speech state as a comparison result.

本発明の音声信号区間推定装置は、無音信号及びクリーン音声信号それぞれの確率モデル（ＧＭＭ）に含まれる複数の正規分布ごとに、複数の雑音パラメータを時間に対して順方向に推定し、更に逆方向にも推定し、得られた複数の推定結果について加重平均をとることによりその時刻の雑音パラメータを決定する。そのため、雑音信号の統計的性質が時間に伴い変化する非定常雑音においても、信号の状態遷移を正確に把握して高精度に音声信号区間を推定でき、また、ある時刻で大きな誤差が生じたとしても、以降の時刻にその影響を及ぼすことなく推定をすることができる。 The speech signal section estimation device according to the present invention estimates a plurality of noise parameters in the forward direction with respect to time for each of a plurality of normal distributions included in the probability models (GMM) of the silence signal and the clean speech signal, and further reverses them. The direction is also estimated, and the noise parameter at that time is determined by taking a weighted average of a plurality of obtained estimation results. Therefore, even in the case of non-stationary noise in which the statistical properties of the noise signal change over time, it is possible to accurately grasp the signal state transition and estimate the speech signal section with high accuracy, and a large error occurred at a certain time. However, it is possible to estimate without affecting the subsequent time.

以下、図面を参照しつつ、本発明の実施例について説明する。なお、以下の説明に用いる図面では、同一の部品には同一の符号を記してある。それらの名称、機能も同一であり、それらについての説明は繰り返さない。
以下の説明において、テキスト中で使用する記号「＾」「〜」等は、本来直後の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直前に記載する。式中においてはこれらの記号は本来の位置に記述している。以下の説明において、ベクトルについては例えば「ベクトルＡ」のように直前に「ベクトル」を付与して記載する。また、ベクトルの各要素単位で行われる処理は、特に断りが無い限り、ベクトルの全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings used for the following description, the same parts are denoted by the same reference numerals. Their names and functions are also the same, and description thereof will not be repeated.
In the following explanation, the symbols “^”, “˜”, etc. used in the text should be described immediately above the character that immediately follows, but are described immediately before the character due to restrictions on the text notation. . In the formula, these symbols are written in their original positions. In the following description, the vector is described with “vector” added immediately before, for example, “vector A”. Further, the processing performed for each element of the vector is applied to all elements of the vector unless otherwise specified.

〔第１実施形態〕
図１は、本発明の音声信号区間推定装置１の機能構成例である。
音声信号区間推定装置１は、音響結合量分析部１０、無雑音モデル記憶部２０、前向き推定部３０、後向き推定部４０、パラメータ記憶部５０、状態確率比算出部６０、音声信号区間推定部７０から構成される。
音響結合量分析部１０は、音声信号と雑音信号が重畳された音響信号Ｏ(t)が入力され、まず、この音響信号Ｏ(t)を時間軸方向に一定時間幅で始点を移動させながら、一定時間長の音響信号をフレームとして切り出す。例えば、１６０サンプル点長（サンプリング周波数８０００Ｈｚで時間長２０ｍｓ）の音響信号を８０サンプル点（サンプリング周波数８０００Ｈｚで時間長１０ｍｓ）ずつ始点を移動させながら切り出す。
そして、切り出された音響信号に対して高速フーリエ変換及び２４次元のメルフィルタバンク分析を適用し、２４次元の対数メルスペクトルを要素に持つ、ベクトルＯ_ｔ＝{Ｏ_t、0，・・・，Ｏ_t、l，・・・，Ｏ_t、23}（時刻ｔのフレームにおける音声特徴量、ｌはベクトルの要素番号）を算出し出力する。 [First Embodiment]
FIG. 1 is a functional configuration example of a speech signal section estimation apparatus 1 according to the present invention.
The speech signal section estimation device 1 includes an acoustic coupling amount analysis unit 10, a noiseless model storage unit 20, a forward estimation unit 30, a backward estimation unit 40, a parameter storage unit 50, a state probability ratio calculation unit 60, and a speech signal section estimation unit 70. Consists of
The acoustic coupling amount analysis unit 10 receives an acoustic signal O (t) in which a speech signal and a noise signal are superimposed, and first, the acoustic signal O (t) is moved in the time axis direction at a certain time width while moving the start point. Then, an acoustic signal having a predetermined time length is cut out as a frame. For example, an acoustic signal having a length of 160 sample points (a sampling frequency of 8000 Hz and a time length of 20 ms) is cut out while moving the start point by 80 sample points (a sampling frequency of 8000 Hz and a time length of 10 ms).
Then, fast Fourier transform and 24-dimensional mel filter bank analysis are applied to the cut out acoustic signal, and a vector O _t = {O _{t, 0} ,. O _{t, l} ,..., O _{t, 23} } (speech feature quantity in the frame at time t, l is a vector element number) is calculated and output.

なお、本発明では音声信号（及び無音信号、クリーン音声信号）、雑音信号を次のように定義する。
雑音が全く存在しない防音室等で録音を行っても、録音された信号には極微小で白色的な雑音が観測される。本発明では、このような環境において観測される信号を無音信号と定義する。
従って、無音信号も雑音の一種であるといえるが、この雑音は録音機材等の電気回路や転送系などの電気的要因により発生する雑音である。一方、自動車の走行音や風の音などは、音波が大気中を伝わって観測される音響的要因により発生する雑音である。本発明では、電気的要因による雑音と音響的要因による雑音とを区別し、後者のみを雑音信号と定義する。 In the present invention, an audio signal (and a silence signal, a clean audio signal) and a noise signal are defined as follows.
Even when recording in a soundproof room or the like where no noise is present, a very small white noise is observed in the recorded signal. In the present invention, a signal observed in such an environment is defined as a silence signal.
Therefore, although a silence signal can also be said to be a kind of noise, this noise is generated due to electrical factors such as an electric circuit of a recording equipment or a transfer system. On the other hand, the driving sound of a car, the sound of wind, and the like are noises generated by acoustic factors observed when sound waves are transmitted through the atmosphere. In the present invention, noise caused by electrical factors and noise caused by acoustic factors are distinguished, and only the latter is defined as a noise signal.

また、無音信号が観測されている環境において発話を行うと、発話音声信号が無音信号に重畳された形で観測される。本発明ではこの重畳された信号をクリーン音声信号と定義する。
そして、雑音信号が存在しない環境では、連続する無音信号の合間にクリーン音声信号が観測される。本発明では、これら無音信号とクリーン音声信号を総称して音声信号と定義する。
無雑音モデル記憶部２０は、あらかじめ用意したクリーン音声信号、無音信号それぞれの、複数の正規分布を含有する混合正規分布に基づく確率モデル（ＧＭＭ：Gaussian Mixture Model）を記憶する。複数の正規分布の数は多いほど推定精度の向上に寄与するが、処理速度上の問題とのトレードオフから実効的には２〜５１２個の間の値が望ましく、３２個程度が最も望ましい。
それぞれの正規分布は混合重みｗ_j,k、平均μ^S _j,k,l、分散σ^S _j,k,lをパラメータとして構成される。ここで、ｊはＧＭＭの種別（ｊ＝０：無音ＧＭＭ、ｊ＝１：クリーン音声ＧＭＭ）であり、ｋは各正規分布の番号である。 Further, when an utterance is performed in an environment where a silence signal is observed, the utterance voice signal is observed in a form superimposed on the silence signal. In the present invention, this superimposed signal is defined as a clean audio signal.
In an environment where there is no noise signal, a clean voice signal is observed between successive silence signals. In the present invention, these silence signals and clean sound signals are collectively defined as sound signals.
The noiseless model storage unit 20 stores a probabilistic model (GMM: Gaussian Mixture Model) based on a mixed normal distribution including a plurality of normal distributions of each of a clean speech signal and a silence signal prepared in advance. A larger number of normal distributions contributes to an improvement in estimation accuracy. However, a value between 2 and 512 is practically desirable from a trade-off with a problem in processing speed, and about 32 is most desirable.
Each normal distribution is configured with the mixture weight w _{j, k} , the average μ ^S _{j, k, l} and the variance σ ^S _{j, k, l} as parameters. Here, j is the type of GMM (j = 0: silent GMM, j = 1: clean speech GMM), and k is the number of each normal distribution.

なお、ＧＭＭの構成方法については公知の技術なので説明を省略する。
図２は前向き推定部３０の機能構成例である。
前向き推定部３０は、雑音モデルパラメータ予測部３１、雑音モデルパラメータ更新部３２、前向き確率モデルパラメータ生成部３３、前向き音声／非音声出力確率算出部３４、前向き第１加重平均算出部３５、前向き雑音状態遷移確率推定部３６、前向き第２加重平均算出部３７から構成される。
雑音モデルパラメータ予測部３１は、前記音声特徴量Ｏ_t,lと時刻ｔ−１における前向き第２加重平均値^Ｎ_t-1,l、^σ^N _t-1、lとが入力され、平均値Ｎ_t,l ^predと分散値σ^N _t,l ^predとからなる雑音モデルパラメータ予測値を出力する。 Since the GMM configuration method is a known technique, a description thereof will be omitted.
FIG. 2 is a functional configuration example of the forward estimation unit 30.
The forward estimation unit 30 includes a noise model parameter prediction unit 31, a noise model parameter update unit 32, a forward probability model parameter generation unit 33, a forward speech / non-speech output probability calculation unit 34, a forward first weighted average calculation unit 35, a forward noise It comprises a state transition probability estimation unit 36 and a forward second weighted average calculation unit 37.
The noise model parameter prediction unit 31 receives the speech feature amount O _{t, l} and the forward second weighted average value ^ N _{t-1, l} , ^ σ ^N _{t-1, l at time t−1,} and calculates the average A noise model parameter prediction value composed of the value N _{t, l} ^pred and the variance value σ ^N _{t, l} ^pred is output.

具体的処理について、図３の処理手順に従い説明する。
まず、フレーム判定処理Ｓ３０１においてｔ＜１０であれば、バッファリング処理Ｓ３０２においてパラメータ記憶部５０に前記音響特徴量Ｏ_t,lを記憶する。フレーム判定処理Ｓ３０１においてｔ＝１０であれば、読み出し処理Ｓ３０３においてパラメータ記憶部５０からＯ_0,l、・・・、Ｏ_9,lを読み出し、初期パラメータ推定処理Ｓ３０４において初期の雑音モデルパラメータＮ_l ^init、σ^N _l ^initを以下のように推定する。 Specific processing will be described in accordance with the processing procedure of FIG.
First, if t <10 in the frame determination process S301, the acoustic feature quantity O _{t, l} is stored in the parameter storage unit 50 in the buffering process S302. If t = 10 in the frame determination process S301, O _{0, l} ,..., O _{9, l} are read from the parameter storage unit 50 in the read process S303, and the initial noise model parameter N _l is read in the initial parameter estimation process S304. ^init and σ ^N _l ^init are estimated as follows.

また、フレーム判定処理Ｓ３０１においてｔ＞１０であれば、読み出し処理Ｓ３０５においてパラメータ記憶部５０から１時刻前の前向き第２加重平均値^Ｎ_t-1,l、^σ^N _t-1、lを読み出す。
なお、Ｓ３０１〜３０５の処理においてｔ＝１０を基準に判定しているが、これは最も望ましい基準値としての例示であり、実効的にはｔ＝１〜２０の範囲で適宜設定してよい。
ｔ≧１０の場合は、次にパラメータ予測処理Ｓ３０６を行う。ｔ＞１０の場合は時刻ｔ−１における推定結果から現在の時刻の雑音モデルパラメータを以下のランダムウォーク過程により予測する。

If t> 10 in the frame determination process S301, the forward second weighted average value ^ N _{t-1, l} , ^ σ ^N _{t-1, l} from the parameter storage unit 50 is read from the parameter storage unit 50 in the read process S305. read out.
Note that although t = 10 is determined as a reference in the processing of S301 to S305, this is an example as the most desirable reference value, and may be set appropriately in the range of t = 1 to 20.
When t ≧ 10, parameter prediction processing S306 is performed next. When t> 10, the noise model parameter at the current time is predicted from the estimation result at time t−1 by the following random walk process.

上式において、Ｎ_t,l ^predとσ^N _t,l ^predは時刻ｔにおける雑音モデルパラメータ予測値であり、またεは雑音の変化の度合いを表す定数で実効的には０．０００１〜０．００１の間の値に設定するのが望ましく、０．００１程度が最も望ましい。また、ｔ＝１０の場合は以下のように予測する。

In the above equation, N _{t, l} ^pred and σ ^N _{t, l} ^pred are noise model parameter predicted values at time t, and ε is a constant representing the degree of noise change, and is effectively 0.0001-0. It is desirable to set the value between 001, and the most desirable value is about 0.001. When t = 10, the prediction is as follows.

雑音モデルパラメータ更新部３２は、前記音声特徴量Ｏ_t,lと前記雑音モデルパラメータ予測値Ｎ_t,l ^pred、σ^N _t,l ^predと前記クリーン音声信号、無音信号それぞれの確率モデルパラメータμ^S _j,k,l、σ^S _j,k,lとが入力され、平均値^Ｎ_t,j,k,lと分散値^σ^N _{t、j、k、l}とからなる雑音モデルパラメータ更新値を出力する。

The noise model parameter updating unit 32 includes the probability model parameter μ ^{S of} each of the speech feature quantity O _{t, l} , the noise model parameter prediction value N _{t, l} ^pred , σ ^N _{t, l} ^pred , the clean speech signal, and the silence signal. _{j, k, l} and σ ^S _{j, k, l} are input _, and the noise model parameter update value consisting of mean value ^ N _{t, j, k, l} and variance ^ σ ^N _{t, j, k, l} Is output.

具体的処理について、図３の処理手順に従い説明する。
パラメータ更新処理Ｓ３０７においては、前記クリーン音声信号、無音信号それぞれの確率モデルパラメータは正規分布ごとに複数存在するため、これら複数のパラメータを使って、かつそれぞれ並行して前記雑音モデルパラメータ予測値の更新処理を行う。すなわち、前記クリーン音声信号、無音信号それぞれの確率モデルに含まれる正規分布の合計数と同数の更新結果を得る。更新処理は次式により行う。 Specific processing will be described in accordance with the processing procedure of FIG.
In the parameter update process S307, since there are a plurality of probability model parameters for each of the clean speech signal and the silence signal for each normal distribution, the noise model parameter predicted value is updated using these parameters in parallel. Process. That is, the same number of update results as the total number of normal distributions included in the probability models of the clean speech signal and the silence signal are obtained. The update process is performed according to the following formula.

式(11)と式(12)で求められた^Ｎ_t,j,k,lと^σ^N _{t、j、k、l}とが雑音モデルパラメータ更新値である。
前向き確率モデルパラメータ生成部３３は、前記雑音モデルパラメータ更新値^Ｎ_t,j,k,l、^σ^N _{t、j、k、l}と前記クリーン音声信号、無音信号それぞれの確率モデルパラメータμ^S _j,k,l、σ^S _j,k,lとが入力され、平均値μ^O _t,j,k,lと分散値σ^O _{t、j、k、l}とからなる前向き確率モデルパラメータを出力する。

^ N _{t, j, k, l} and ^ σ ^N _{t, j, k, l} obtained by Equation (11) and Equation (12) are noise model parameter update values.
The forward probability model parameter generation unit 33 generates the noise model parameter update values ^ N _{t, j, k, l} , ^ σ ^N _{t, j, k, l} and the probability model parameters μ ^{S of the} clean speech signal and the silence signal, respectively. _{j, k, l} and σ ^S _{j, k, l} are input _, and a forward probability model parameter consisting of mean value μ ^O _{t, j, k, l} and variance values σ ^O _{t, j, k, l} is output To do.

具体的処理について、図３の処理手順に従い説明する。
確率モデルパラメータ生成処理Ｓ３０８では、時刻ｔにおける雑音環境に適合した、音声（雑音＋クリーン音声：ｊ＝１）、非音声（雑音＋無音：ｊ＝０）それぞれの確率モデルパラメータμ^O _t,j,k,l、σ^O _{t、j、k、l}を次式により生成する。 Specific processing will be described in accordance with the processing procedure of FIG.
In the probabilistic model parameter generation processing S308, the probability model parameters μ ^O _{t, j for} speech (noise + clean speech: j = 1) and non-speech (noise + silence: j = 0) that are suitable for the noise environment at time t. _{, k, l} , σ ^O _{t, j, k, l} are generated by the following equations.

なお、ここでの混合重みは前記クリーン音声信号、無音信号それぞれの確率モデルパラメータにおける混合重みｗ_j,kであるものとして以降の処理を行う。
前向き音声／非音声出力確率算出部３４は、前記音声特徴量Ｏ_t,lと前記音声、非音声それぞれの確率モデルパラメータμ^O _t,j,k,l、σ^O _{t、j、k、l}と前記クリーン音声信号、無音信号それぞれの確率モデルパラメータにおける混合重みｗ_j,kとが入力され、時刻ｔにおける音声・非音声の前向き出力確率ｂ_ｊ(Ｏ_ｔ）と、この前向き出力確率ｂ_ｊ(Ｏ_ｔ）を前記正規分布ｋごとに分解して正規化した前向き正規化出力確率ｗ^OF _j,kとを出力する。

The following processing is performed assuming that the mixing weight here is the mixing weight w _{j, k} in the probability model parameters of the clean speech signal and the silence signal.
The forward speech / non-speech output probability calculation unit 34 calculates the speech feature amount O _{t, l} and the probability model parameters μ ^O _{t, j, k, l} , σ ^O _{t, j, k, l of the} speech and non-speech. And the mixing weights w _{j, k} in the probability model parameters of the clean speech signal and the silence signal are input, the forward output probability b _j (O _t ) of speech / non-speech at time t, and the forward output probability b _j A forward normalized output probability w ^OF _{j, k obtained} by decomposing and normalizing (O _t ) for each normal distribution k is output.

具体的処理について、図３の処理手順に従い説明する。
出力確率算出処理Ｓ３０９では、前記音声特徴量Ｏ_t,lをＳ３０８の処理で生成された前記音声、非音声それぞれの確率モデルに入力した際の、前記音声、非音声それぞれの確率モデル全体における音声、非音声の前向き出力確率ｂ_ｊ(Ｏ_ｔ）を次式により求める。

また、上式のｗ_j,kｂ_j,k(Ｏ_ｔ)は、音声、非音声それぞれの確率モデルに含まれる各正規分布ｋの出力確率であり、ｗ_j,kｂ_j,k(Ｏ_ｔ)の合計が１になるよう次式で正規化を行う。 Specific processing will be described in accordance with the processing procedure of FIG.
In the output probability calculation process S309, the speech in the entire probability model of each of the speech and non-speech when the speech feature quantity O _{t, l} is input to the probability model of each of speech and non-speech generated in the process of S308. The non-voice forward output probability b _j (O _t ) is obtained by the following equation.

Also, w _{j, k} b _{j, k} (O _t ) in the above equation is an output probability of each normal distribution k included in the probability models of speech and non-speech, and w _{j, k} b _{j, k} (O Normalization is performed by the following formula so that the sum of _t ) becomes 1.

上式のｗ^OF _j,kが、音声、非音声それぞれの確率モデルに含まれる各正規分布ｋの前向き正規化出力確率である。
前向き第１加重平均算出部３５は、前記雑音モデルパラメータ更新値^Ｎ_t,j,k,l、^σ^N _{t、j、k、l}と前記前向き正規化出力確率ｗ^OF _j,kとが入力され、平均値^Ｎ_t,j,lと分散値^σ^N _t、j、lとからなる雑音モデルパラメータの前向き第１加重平均値を出力する。

In the above equation, w ^OF _{j, k} is the forward normalized output probability of each normal distribution k included in each probability model of speech and non-speech.
The forward first weighted average calculator 35 calculates the noise model parameter update value ^ N _{t, j, k, l} , ^ σ ^N _{t, j, k, l} and the forward normalized output probability w ^OF _{j, k.} The forward first weighted average value of the noise model parameter which is input and has the average value ^ N _{t, j, l} and the variance value ^ σ ^N _{t, j, l} is output.

具体的処理について、図３の処理手順に従い説明する。
第１加重平均処理Ｓ３１０では、パラメータ更新処理Ｓ３０７で得られた複数の雑音モデルパラメータ更新結果を出力確率算出処理Ｓ３０９で得られた前向き正規化出力確率ｗ^OF _j,kを用いて加重平均することにより、音声、非音声それぞれの確率モデルに対応する雑音パラメータ推定結果である前向き第１加重平均値^Ｎ_t,j,l、^σ^N _t、j、lを得る。加重平均は次式により行う。 Specific processing will be described in accordance with the processing procedure of FIG.
In the first weighted average process S310, a plurality of noise model parameter update results obtained in the parameter update process S307 are weighted and averaged using the forward normalized output probability w ^OF _{j, k} obtained in the output probability calculation process S309. Thus, the forward first weighted average values ^ N _{t, j, l} , ^ σ ^N _{t, j, l} which are noise parameter estimation results corresponding to the respective speech and non-speech probability models are obtained. The weighted average is calculated by the following formula.

前向き雑音状態遷移確率算出部３６は、前記雑音モデルパラメータ更新値^Ｎ_t,j,k,lと前記前向き正規化出力確率ｗ^OF _j,kと時刻ｔ−１における前向き第２加重平均値^Ｎ_t-1,lとが入力され、前向き雑音状態遷移確率ｃ_t,jを出力する。

The forward noise state transition probability calculation unit 36 calculates the noise model parameter update value ^ N _{t, j, k, l} , the forward normalized output probability w ^OF _{j, k} and the forward second weighted average value at time t−1 ^. N _t−1,1 is input, and a forward noise state transition probability c _{t, j} is output.

具体的処理について、図３の処理手順に従い説明する。
状態遷移確率算出処理Ｓ３１１では、まず、時刻ｔ−１における推定結果（前向き第２加重平均値^Ｎ_t-1,l）から時刻ｔにおける推定結果^Ｎ_t,j,lへの状態遷移確率ｄ_t,jを次式により算出する。

そして、ｄ_t,jの合計が１になるように次式で正規化を行い、前向き雑音状態遷移確率ｃ_t,jを得る。

前向き第２加重平均算出部３７は、前記前向き第１加重平均値^Ｎ_t,j,l、^σ^N _t、j、lと前記前向き出力確率ｂ_ｊ(Ｏ_ｔ）と前記前向き雑音状態遷移確率ｃ_t,jとが入力され、平均値^Ｎ_t,lと分散値^σ^N _t、lとからなる時刻ｔにおける前向き第２加重平均値を出力する。 Specific processing will be described in accordance with the processing procedure of FIG.
In the state transition probability calculation process S311, first, the state transition probability from the estimation result at time t-1 (the forward second weighted average value ^ N _{t-1, l} ) to the estimation result ^ N _{t, j, l at} time t. d _{t, j} is calculated by the following equation.

Then, normalization is performed by the following equation so that the sum of d _{t, j} becomes 1, and a forward noise state transition probability c _{t, j} is obtained.

The forward second weighted average calculation unit 37 includes the forward first weighted average value ^ N _{t, j, l} , ^ σ ^N _{t, j, l} , the forward output probability b _j (O _t ), and the forward noise state transition. The probability c _{t, j} is input, and a forward second weighted average value at time t consisting of the average value ^ N _{t, l} and the variance value ^ σ ^N _{t, l} is output.

具体的処理について、図３の処理手順に従い説明する。
第２加重平均処理Ｓ３１２では、第１加重平均処理Ｓ３１０で得られた前向き第１加重平均値^Ｎ_t,j,l、^σ^N _t、j、lを、出力確率算出処理Ｓ３０９で得られた前向き出力確率ｂ_ｊ(Ｏ_ｔ）、及び状態遷移確率算出処理Ｓ３１１で得られた前向き雑音状態遷移確率ｃ_t,jとを用いて加重平均することにより、時刻ｔにおける雑音モデルパラメータ推定結果である前向き第２加重平均値^Ｎ_t,l、^σ^N _t、lを算出し、次の時刻の雑音パラメータの推定に利用する。加重平均は次式により行う。 Specific processing will be described in accordance with the processing procedure of FIG.
In the second weighted average process S312, the forward first weighted average values ^ N _{t, j, l} , ^ σ ^N _{t, j, l} obtained in the first weighted average process S310 are obtained in the output probability calculation process S309. The weighted average using the forward output probability b _j (O _t ) and the forward noise state transition probability c _{t, j} obtained in the state transition probability calculation process S311 gives the noise model parameter estimation result at time t. A certain forward second weighted average value ^ N _{t, l} , ^ σ ^N _{t, l} is calculated and used to estimate the noise parameter at the next time. The weighted average is calculated by the following formula.

最後にＳ３１３のバッファリング処理で、Ｓ３０１〜３１２の処理により得られた当該時刻ｔにおける音声特徴量Ｏ_t,l、雑音モデルパラメータ予測値Ｎ_t,l ^pred、σ^N _t,l ^pred、雑音モデルパラメータ更新値^Ｎ_t,j,k,l、^σ^N _{t、j、k、l}、及び前向き第２加重平均値^Ｎ_t,l、^σ^N _t、lがパラメータ記憶部５０に記憶される。
式(3)(4)の予測処理、及び式(7)〜(12)の更新処理は、従来の非線形カルマンフィルタと計算式の構成自体は同様であるが、本発明ではクリーン音声信号、無音信号それぞれのＧＭＭに含まれる複数の正規分布ごとに複数のフィルタを構成し、これらを利用することにより得られる複数の推定結果を加重平均する（並列非線形カルマンフィルタ）。このような処理を行うことによって、より正確な雑音モデルのパラメータ推定が実現される。

Finally, in the buffering process of S313, the speech feature quantity O _{t, l} , noise model parameter prediction value N _{t, l} ^pred , σ ^N _{t, l} ^pred , noise model obtained at the time t obtained by the processes of S301 to S312 Parameter update values ^ N _{t, j, k, l} , ^ σ ^N _{t, j, k, l} and forward second weighted average values ^ N _{t, l} , ^ σ ^N _{t, l} are stored in the parameter storage unit 50 Is done.
The prediction processing of Equations (3) and (4) and the updating processing of Equations (7) to (12) are the same as the conventional nonlinear Kalman filter and the calculation formula itself, but in the present invention, clean speech signals and silence signals are used. A plurality of filters are configured for each of a plurality of normal distributions included in each GMM, and a plurality of estimation results obtained by using these are weighted and averaged (parallel non-linear Kalman filter). By performing such processing, more accurate noise model parameter estimation is realized.

図４は後向き推定部４０の機能構成例である。
後向き推定部４０は、雑音モデルパラメータ再推定部４２、後向き確率モデルパラメータ生成部４３、後向き音声／非音声出力確率算出部４４、後向き第１加重平均算出部４５、後向き雑音状態遷移確率推定部４６、後向き第２加重平均算出部４７から構成される。
雑音モデルパラメータ再推定部４２は、パラメータ記憶部５０に記憶された時刻ｓにおける雑音モデルパラメータ予測値Ｎ_s,l ^pred、σ^N _s,l ^pred、時刻ｓ−１における雑音モデルパラメータ更新値^Ｎ_s-1,j,k,l、^σ^N _{s-1、j、k、l}及び時刻ｓにおける雑音モデルパラメータ再推定値〜Ｎ_s,j,k,l、〜σ^N _{s、j、k、l}とが入力され、平均値〜Ｎ_s-1,j,k,lと分散値〜σ^N _{s-1、j、k、l}とからなる時刻ｓ−１における雑音モデルパラメータ再推定値を出力する。 FIG. 4 is a functional configuration example of the backward estimation unit 40.
The backward estimation unit 40 includes a noise model parameter re-estimation unit 42, a backward probability model parameter generation unit 43, a backward speech / non-speech output probability calculation unit 44, a backward first weighted average calculation unit 45, and a backward noise state transition probability estimation unit 46. The rearward second weighted average calculating unit 47 is configured.
The noise model parameter re-estimation unit 42 stores the noise model parameter predicted value N _{s, l} ^pred , σ ^N _{s, l} ^pred at time s stored in the parameter storage unit 50, and the noise model parameter update value ^ N at time s−1. _{s-1, j, k, l} , ^ σ ^N _{s-1, j, k, l} and noise model parameter re-estimated value at time s ~ N _{s, j, k, l} , ~ σ ^N _{s, j, k , L} are input _, and the noise model parameter re-estimation value at time s-1 including the average value ~ N _{s-1, j, k, l} and the variance value ~ σ ^N _{s-1, j, k, l} Output.

具体的処理について、図５の処理手順に従い説明する。
まず、フレーム判定処理Ｓ４０１においてｔ＜１０であれば、変数設定処理Ｓ４０２において変数ｔｂを０に設定して処理を終了する。ｔ≧１０の場合、変数判定処理Ｓ４０３においてｔｂが後向き推定に要するフレーム数Ｂ未満であれば変数書替処理Ｓ４０４にてｔｂの値を１加算して処理を終了し、ｔｂの値がＢ以上であれば変数設定処理Ｓ４０５において後向き推定用カウンタ値ｂｗにＢを設定する。Ｂは大きいほど推定精度向上に寄与する反面、処理速度を損なうため、実効的には１〜１０の間の値に設定するのが望ましく、１０程度が最も望ましい。 Specific processing will be described in accordance with the processing procedure of FIG.
First, if t <10 in the frame determination process S401, the variable tb is set to 0 in the variable setting process S402, and the process ends. In the case of t ≧ 10, if tb is less than the number of frames B required for backward estimation in the variable determination process S403, 1 is added to the value of tb in the variable rewriting process S404, and the process is terminated. If so, B is set to the counter value bw for backward estimation in the variable setting process S405. A larger B contributes to an improvement in the estimation accuracy, but the processing speed is impaired. Therefore, it is desirable to set the value to a value between 1 and 10 and the most desirable is about 10.

次に読み出し処理Ｓ４０６において、パラメータ記憶部５０から前向き推定部３０において算出された時刻ｓ＝ｔ−Ｂ＋ｂｗにおける雑音モデルパラメータ予測値Ｎ_s,l ^pred、σ^N _s,l ^pred、時刻ｓ−１における音響特徴量Ｏ_s-1,l、時刻ｓ−１における雑音モデルパラメータ更新値^Ｎ_s-1,j,k,l、^σ^N _{s-1、j、k、l}、及び後向き推定部４０において算出された時刻ｓ＝ｔ−Ｂ＋ｂｗにおける雑音モデルパラメータ再推定値〜Ｎ_s,j,k,l、〜σ^N _{s、j、k、l}を読み出す。なお、ｂｗ＝Ｂ、すなわち時刻ｓ＝ｔの場合は、^Ｎ_t,j,k,l、^σ^N _{t、j、k、l}、^Ｎ_t,l、^σ^N _t、lを読み出し、〜Ｎ_s,j,k,l＝^Ｎ_t,j,k,l、〜σ^N _{s、j、k、l}＝^σ^N _{t、j、k、l}、〜Ｎ_s,l＝^Ｎ_t,l、〜σ^N _s、l＝^σ^N _t、lとする。
そして、パラメータ平滑処理Ｓ４０７において、後向き推定を用いて次式によるパラメータの再推定（平滑化）を行う。 Next, in the reading process S406, the noise model parameter predicted values N _{s, l} ^pred , σ ^N _{s, l} ^pred at the time s = t−B + bw calculated from the parameter storage unit 50 at the forward estimation unit 30, at the time s−1. Acoustic feature value O _{s-1, l} , noise model parameter update value at time s−1 ^ N _{s-1, j, k, l} , ^ σ ^N _{s-1, j, k, l} , and backward estimation unit 40 The noise model parameter re-estimated values ~ N _{s, j, k, l} , ~ σ ^N _{s, j, k, l} at the time s = t-B + bw calculated in step ii are read out. If bw = B, that is, time s = t, read ^ N _{t, j, k, l} , ^ σ ^N _{t, j, k, l} , ^ N _{t, l} , ^ σ ^N _{t, l} , ~ N _{s, j, k, l} = ^ N _{t, j, k, l} , ~ σ ^N _{s, j, k, l} = ^ σ ^N _{t, j, k, l} , ~ N _{s, l} = ^ N _{t, l} , ˜σ ^N _{s, l} = ^ σ ^N _{t, l}
Then, in parameter smoothing processing S407, parameters are re-estimated (smoothed) by the following equation using backward estimation.

式(27)と式(28)で求められた〜Ｎ_s-1,j,k,lと〜σ^N _{s-1、j、k、l}とが雑音モデルパラメータ再推定値である。なお、〜Ｎ_s-1,j,k,lと〜σ^N _{s-1、j、k、l}は次回の平滑処理のためにパラメータ記憶部５０に記憶する。
後向き確率モデルパラメータ生成部４３は、前記雑音モデルパラメータ再推定値〜Ｎ_s-1,j,k,l、〜σ^N _{s-1、j、k、l}と前記クリーン音声信号、無音信号それぞれの確率モデルパラメータμ^S _j,k,l、σ^S _j,k,lとが入力され、平均値μ^O _s-1,j,k,lと分散値σ^O _{s-1、j、k、l}とからなる後向き確率モデルパラメータを出力する。

˜N _{s−1, j, k, l} and ˜σ ^N _{s−1, j, k, l} obtained by Expression (27) and Expression (28) are noise model parameter re-estimation values. Note that ˜N _{s−1, j, k, l} and ˜σ ^N _{s−1, j, k, l} are stored in the parameter storage unit 50 for the next smoothing process.
The backward probability model parameter generator 43 generates the noise model parameter re-estimation values ~ N _{s-1, j, k, l} , ~ σ ^N _{s-1, j, k, l} and the clean speech signal and the silence signal. The probability model parameters μ ^S _{j, k, l} and σ ^S _{j, k, l} are input, and the mean value μ ^O _{s-1, j, k, l} and the variance value σ ^O _{s-1, j, k, l} A backward probability model parameter consisting of

具体的処理について、図５の処理手順に従い説明する。
確率モデルパラメータ生成処理Ｓ４０８では、時刻ｓ−１における雑音環境に適合した、音声（雑音＋クリーン音声：ｊ＝１）、非音声（雑音＋無音：ｊ＝０）それぞれの確率モデルパラメータμ^O _s-1,j,k,l、σ^O _{s-1、j、k、l}を次式により生成する。

なお、ここでの混合重みについても前記クリーン音声信号、無音信号それぞれの確率モデルパラメータにおける混合重みｗ_j,kであるものとして以降の処理を行う。
後向き音声／非音声出力確率算出部４４は、前記音声特徴量Ｏ_s-1,lと前記音声、非音声それぞれの確率モデルパラメータμ^O _s-1,j,k,l、σ^O _{s-1、j、k、l}と前記クリーン音声信号、無音信号それぞれの確率モデルパラメータにおける混合重みｗ_j,kとが入力され、時刻ｓ−１における音声・非音声の出力確率ｂ_ｊ(Ｏ_s-1）と、この出力確率ｂ_ｊ(Ｏ_s-1）を前記正規分布ｋごとに分解して正規化した後向き正規化出力確率ｗ^OB _j,kとを出力する。 Specific processing will be described in accordance with the processing procedure of FIG.
In the probabilistic model parameter generation processing S408, the probability model parameters μ ^O _{s for} speech (noise + clean speech: j = 1) and non-speech (noise + silence: j = 0) that are suitable for the noise environment at time s−1. _{−1, j, k, l} , σ ^O _{s−1, j, k, l} are generated by the following equations.

The following processing is performed assuming that the mixing weight here is the mixing weight w _{j, k} in the probability model parameters of the clean speech signal and the silence signal.
The backward speech / non-speech output probability calculation unit 44 includes the speech feature quantity O _{s-1, l} and the probability model parameters μ ^O _{s-1, j, k, l} , σ ^O _{s-1 of the} speech and non-speech. _{, J, k, l} and the mixing weights w _{j, k} in the probability model parameters of the clean speech signal and the silence signal are input, and the speech / non-speech output probability b _j (O _{s−1) at time s−1.} And a backward normalized output probability w ^OB _{j, k obtained} by decomposing and normalizing the output probability b _j (O _s-1 ) for each normal distribution k.

具体的処理について、図５の処理手順に従い説明する。
出力確率算出処理Ｓ４０９では、前記音声特徴量Ｏ_s-1,lをＳ４０８の処理で生成された前記音声、非音声それぞれの確率モデルに入力した際の、前記音声、非音声それぞれの確率モデル全体における音声、非音声の出力確率ｂ_ｊ(Ｏ_s-1）を次式により求める。 Specific processing will be described in accordance with the processing procedure of FIG.
In the output probability calculation process S409, the entire probability model of each of the speech and non-speech when the speech feature quantity O _{s-1, l} is input to the probability model of each of the speech and non-speech generated in the process of S408. The output probability b _j (O _s-1 ) of speech and non-speech at is obtained by the following equation.

また、上式のｗ_j,kｂ_j,k(Ｏ_s-1)は、音声、非音声それぞれの確率モデルに含まれる各正規分布ｋの出力確率であり、ｗ_j,kｂ_j,k(Ｏ_s-1)の合計が１になるよう次式で正規化を行う。

上式のｗ^OB _j,kが、音声、非音声それぞれの確率モデルに含まれる各正規分布ｋの後向き正規化出力確率である。
後向き第１加重平均算出部４５は、前記雑音モデルパラメータ再推定値〜Ｎ_s-1,j,k,l、〜σ^N _{s-1、j、k、l}と前記後向き正規化出力確率ｗ^OB _j,kとが入力され、平均値〜Ｎ_s-1,j,lと分散値〜σ^N _s-1、j、lとからなる雑音モデルパラメータの後向き第１加重平均値を出力する。

Also, w _{j, k} bj _{, k} (O _s-1 ) in the above equation is the output probability of each normal distribution k included in the probability models of speech and non-speech, and w _{j, k} b _{j, k} Normalization is performed by the following equation so that the sum of (O _s-1 ) is 1.

In the above equation, w ^OB _{j, k} is the backward normalized output probability of each normal distribution k included in each probability model of speech and non-speech.
The backward first weighted average calculating unit 45 calculates the noise model parameter re-estimated values ~ N _{s-1, j, k, l} , ~ σ ^N _{s-1, j, k, l} and the backward normalized output probability w ^OB. _{j, k} are input _, and a backward first weighted average value of the noise model parameter consisting of an average value ~ N _{s-1, j, l} and a variance value ~ σ ^N _{s-1, j, l} is output.

具体的処理について、図５の処理手順に従い説明する。
第１加重平均処理Ｓ４１０では、パラメータ平滑処理Ｓ４０７で得られた複数の雑音モデルパラメータ更新結果を出力確率算出処理Ｓ４０９で得られた後向き正規化出力確率ｗ^OB _j,kを用いて加重平均することにより、音声、非音声それぞれの確率モデルに対応する雑音パラメータ推定結果である後向き第１加重平均値〜Ｎ_s-1,j,l、〜σ^N _s-1、j、lを得る。加重平均は次式により行う。

後向き雑音状態遷移確率算出部４６は、時刻ｓにおける後向き第２加重平均値〜Ｎ_s,lと時刻ｓ−１における前記雑音モデルパラメータ再推定値〜Ｎ_s-1,j,k,lと時刻ｓ−１における後向き第1加重平均値〜Ｎ_s-1,j,lと前記後向き正規化出力確率ｗ^OB _j,kとが入力され、雑音状態遷移確率ｃ_s,jを出力する。 Specific processing will be described in accordance with the processing procedure of FIG.
In the first weighted average process S410, a plurality of noise model parameter update results obtained in the parameter smoothing process S407 are weighted and averaged using the backward normalized output probability w ^OB _{j, k} obtained in the output probability calculation process S409. Thus, the backward first weighted average values ˜N _{s−1, j, l} , ˜σ ^N _{s−1, j, l} , which are noise parameter estimation results corresponding to the respective speech and non-speech probability models, are obtained. The weighted average is calculated by the following formula.

The backward noise state transition probability calculation unit 46 calculates the backward second weighted average value at time s to N _{s, l} and the noise model parameter re-estimation value at time s−1 to N _{s−1, j, k, l} and the time. The backward first weighted average value ˜N _{s−1, j, l in s−1} and the backward normalized output probability w ^OB _{j, k} are input, and the noise state transition probability c _{s, j} is output.

具体的処理について、図５の処理手順に従い説明する。
状態遷移確率算出処理Ｓ４１１では、まず、時刻ｓ−１における推定結果（後向き第１加重平均値〜Ｎ_s-1,j,l）から時刻ｓにおける推定結果〜Ｎ_s,lへの状態遷移確率ｄ_s,jを次式により算出する。

そして、ｄ_s,jの合計が１になるように次式で正規化を行い、雑音状態遷移確率ｃ_s,jを得る。

後向き第２加重平均算出部４７は、前記後向き第１加重平均値〜Ｎ_s-1,j,l、〜σ^N _s-1、j、lと前記出力確率ｂ_ｊ(Ｏ_s-1）と前記雑音状態遷移確率ｃ_s,jとが入力され、平均値〜Ｎ_s-1,lと分散値〜σ^N _s-1、lとからなる時刻ｓ−１における後向き第２加重平均値を出力する。 Specific processing will be described in accordance with the processing procedure of FIG.
In the state transition probability calculation process S411, first, the state transition probability from the estimation result at the time s-1 (the backward first weighted average value˜N _{s−1, j, l} ) to the estimation result at the time s˜N _{s, l} . d _{s, j} is calculated by the following equation.

Then, normalization is performed by the following equation so that the sum of d _{s, j} becomes 1, and a noise state transition probability c _{s, j} is obtained.

The backward second weighted average calculation unit 47 includes the backward first weighted average value ~ N _{s-1, j, l} , ~ σ ^N _{s-1, j, l} and the output probability b _j (O _s-1 ). The noise state transition probability c _{s, j} is input, and a backward second weighted average value at time s−1 including an average value˜N _{s−1, l} and a variance value˜σ ^N _{s−1, l} is output. To do.

具体的処理について、図５の処理手順に従い説明する。
第２加重平均処理Ｓ４１２では、第１加重平均処理Ｓ４１０で得られた後向き第１加重平均値^Ｎ_s-1,j,l、^σ^N _s-1、j、lを、出力確率算出処理Ｓ４０９で得られた出力確率ｂ_ｊ(Ｏ_s-1）、及び状態遷移確率算出処理Ｓ４１１で得られた雑音状態遷移確率ｃ_s,jとを用いて加重平均することにより、時刻ｓ−１における雑音モデルパラメータ推定結果である後向き第２加重平均値〜Ｎ_s-1,l、〜σ^N _s-1、lを算出し、次の時刻の雑音パラメータの推定に利用する。加重平均は次式により行う。 Specific processing will be described in accordance with the processing procedure of FIG.
In the second weighted average process S412, the first weighted average value retrospective obtained in the first weighted average processing _{S410 ^ N s-1, j} , l, ^ σ N s-1, j, and _l, the output probability calculation process By performing a weighted average using the output probability b _j (O _s-1 ) obtained in S409 and the noise state transition probability c _{s, j} obtained in the state transition probability calculation process S411, at time s−1 The backward second weighted average values ~ N _{s-1, l} , ~ σ ^N _{s-1, l} , which are noise model parameter estimation results, are calculated and used to estimate the noise parameters at the next time. The weighted average is calculated by the following formula.

そして、変数書替処理Ｓ４１３において、ｂｗの値を１減算（すなわち時刻ｓの値を１減算）し、変数判定処理Ｓ４１４において、ｂｗ＞０であれば処理Ｓ４０６に戻り、そうでなければ処理を終了する。
後向き推定部４０の各処理で得られた結果のうち、出力確率算出処理Ｓ４０９で得られた出力確率ｂ_ｊ(Ｏ_s-1）と状態遷移確率算出処理Ｓ４１１で得られた雑音状態遷移確率ｃ_s,jとが、状態確率比算出部６０における処理に使用される。

Then, in the variable rewriting process S413, 1 is subtracted from the value of bw (that is, 1 is subtracted from the value of time s), and in the variable determination process S414, if bw> 0, the process returns to process S406. finish.
Of the results obtained in each process of the backward estimation unit 40, the output probability b _j (O _s-1 ) obtained in the output probability calculation process S409 and the noise state transition probability c obtained in the state transition probability calculation process S411. _{s, j} are used for processing in the state probability ratio calculation unit 60.

式 (26)〜(28)の平滑処理は、従来のカルマンスムーザと計算式の構成自体は同様であるが、本発明ではクリーン音声信号、無音信号それぞれのＧＭＭに含まれる複数の正規分布ごとに複数のフィルタを構成し、これらを利用することにより得られる複数の推定結果を加重平均する（並列カルマンスムーザ）。このような処理を行うことによって、より正確な雑音モデルのパラメータ推定が実現される。
パラメータ記憶部５０は、前向き推定部３０と後向き推定部４０における処理の過程で得られた計算結果を記憶する。 The smoothing processing of Equations (26) to (28) has the same configuration as the conventional Kalman smoother, but in the present invention, each of the normal distributions included in each GMM of the clean speech signal and the silent signal is used. A plurality of filters are constructed, and a plurality of estimation results obtained by using these filters are weighted and averaged (parallel Kalman smoother). By performing such processing, more accurate noise model parameter estimation is realized.
The parameter storage unit 50 stores calculation results obtained in the course of processing in the forward estimation unit 30 and the backward estimation unit 40.

図６は状態確率比算出部６０の機能構成例である。
状態確率比算出部６０は、音声状態遷移確率テーブル６１、前向き確率算出部６２、後向き確率算出部６３、確率比算出用バッファ６４、確率比算出部６５から構成される。
音声状態遷移確率テーブル６１は、有限状態機械により表現された音声／非音声の状態遷移モデルにおいて適宜設定した音声状態遷移確率ａ_i,jを記憶する。 FIG. 6 is a functional configuration example of the state probability ratio calculation unit 60.
The state probability ratio calculation unit 60 includes a speech state transition probability table 61, a forward probability calculation unit 62, a backward probability calculation unit 63, a probability ratio calculation buffer 64, and a probability ratio calculation unit 65.
The speech state transition probability table 61 stores speech state transition probabilities a _{i, j} set as appropriate in a speech / non-speech state transition model expressed by a finite state machine.

図７は、音声状態／非音声状態の状態遷移モデルであり、非音声状態Ｈ_０と音声状態Ｈ_１と各状態への音声状態遷移確率ａ_i,jとを含む（ｉは状態遷移元の状態番号、ｊは状態遷移先の状態番号で、状態番号０は非音声状態を、状態番号１は音声状態を示す）。ａ_i,jは音声状態確率及び非音声状態確率を求める上での基準となる値で、定数を設定しても入力信号の特徴に応じて適応的に決定しても構わないが、本発明においては定数を設定し、これを音声状態遷移確率テーブル６１に記憶して音声状態確率及び非音声状態確率の計算に使用する。この。設定するａ_i,jはａ_i,0＋ａ_i,1＝１を満たす値で、ａ_0,0及びａ_1,1を0.5〜0.9の範囲で、ａ_0,1及びａ_1,0を0.5〜0.1の範囲で設定するのが望ましく、ａ_0,0＝0.8、ａ_0,1＝0.2、ａ_1,0＝0.1、ａ_1,1＝0.9程度が最も望ましい。
前向き確率算出部６２は、前記出力確率ｂ_ｊ(Ｏ_s-1）と前記雑音状態遷移確率ｃ_s,jと、音声状態遷移確率ａ_i,jと、時刻ｓ−１の前向き確率α_s-1、jとが入力され、時刻ｓの前向き確率α_s、jを出力する。 FIG. 7 shows a state transition model of a speech state / non-speech state, which includes a non-speech state H ₀ , a speech state H _1, and a speech state transition probability a _{i, j} to each state (i is a state transition source) The state number, j is the state number of the state transition destination, the state number 0 indicates the non-voice state, and the state number 1 indicates the voice state). a _{i, j} is a reference value for _obtaining the speech state probability and the non-speech state probability, and may be set constant or adaptively determined according to the characteristics of the input signal. In, a constant is set, and this is stored in the voice state transition probability table 61 and used to calculate the voice state probability and the non-voice state probability. this. A _{i, j to be} set is a value satisfying a _{i, 0} + a _{i, 1} = 1, a _0,0 and a _1,1 are in the range of 0.5 to 0.9, and a _0,1 and a _1,0 are 0.5. It is desirable to set in the range of .about.0.1, and it is most desirable that _a.sub.0,0 = 0.8, _a.sub.0,1 = 0.2, _a.sub.1,0 = 0.1, and _a.sub.1,1 = 0.9.
The forward probability calculation unit 62 includes the output probability b _j (O _s-1 ), the noise state transition probability c _{s, j} , the speech state transition probability a _{i, j,} and the forward probability α _{s− at} time s−1. _{1 and j} are input _, and the forward probability α _{s, j} at time s is output.

具体的処理について、図８の処理手順に従い説明する。
音声状態確率及び非音声状態確率の算出は、まず前向き確率α_s、jを求め、続いて後向き確率β_s、jを求めて、それらの積をとることによって求める。そして、現在の時刻ｓの後向き確率β_s、jは、前記後向き推定部４０における計算と同様にＢフレーム未来の時刻ｓ＋Ｂから遡って算出する。
そこで、変数判定処理Ｓ６０１においては、例えばｔ＜１０＋Ｂ、すなわちｓ＜１０の場合は初期値設定処理Ｓ６０２において前向き確率α_s、jを以下のように設定し、それらをバッファリング処理Ｓ６０３において確率比算出用バッファ６４に記憶して処理を終了する。
α_s,0＝１ (42)
α_s,1＝０ (43)
ｔ＜１０＋Ｂでない場合、すなわちｓ≧１０の場合は、読み出し処理Ｓ６０４において、確率比算出用バッファ６４から時刻ｓ−１の前向き確率確率α_s-1、jを読み出す。 Specific processing will be described in accordance with the processing procedure of FIG.
The speech state probability and the non-speech state probability are calculated by first obtaining the forward probability α _{s, j} and then obtaining the backward probability β _{s, j} and taking the product of them. The backward probability β _{s, j} of the current time s is calculated retroactively from the future time s + B of the B frame in the same manner as the calculation in the backward estimation unit 40.
Therefore, in the variable determination process S601, for example, in the case of t <10 + B, that is, s <10, the forward probability α _{s, j} is set as follows in the initial value setting process S602, and the probability ratio is set in the buffering process S603. The data is stored in the calculation buffer 64 and the process is terminated.
α _{s, 0} = 1 (42)
α _{s, 1} = 0 (43)
When t <10 + B is not satisfied, that is, when s ≧ 10, the forward probability probability α _{s−1, j} at time s−1 is read from the probability ratio calculation buffer 64 in the reading process S604.

次に、前向き確率算出処理Ｓ６０５において音声状態遷移確率テーブル６１から音声状態確率ａ_i,jを読み出し、これと時刻ｓ−１の前記出力確率ｂ_ｊ(Ｏ_s-1）と時刻ｓの前記雑音状態遷移確率ｃ_s,jと時刻ｓ−１の前記前向き確率α_s-1、jとから次式により時刻ｓの前向き確率α_s、jを算出し、これらをバッファリング処理６０６において確率比算出用バッファ６４に記憶する。

後向き確率算出部６３は、時刻ｓ＋１の前記出力確率ｂ_ｊ(Ｏ_s+1）と時刻ｓ＋１の前記雑音状態遷移確率ｃ_s+1,jと、音声状態遷移確率ａ_i,jと、時刻ｓ＋１の後向き確率β_s+1、iとが入力され、時刻ｓの後向き確率β_s、iを出力する。 Next, in the forward probability calculation process S605, the speech state probability a _{i, j} is read from the speech state transition probability table 61, and the output probability b _j (O _s-1 ) at time s _-1 and the noise at time s. From the state transition probability c _{s, j} and the forward probability α _{s−1, j at the} time s _−1, the forward probability α _{s, j} at the time s is calculated by the following equation, and the probability ratio is calculated in the buffering process 606. Stored in the buffer 64.

The backward probability calculation unit 63 includes the output probability b _j (O _{s + 1} ) at time s + 1, the noise state transition probability c _{s + 1, j} at time s + 1, the speech state transition probability a _{i, j,} and the time s + 1. backward probability β _{s + 1 of, i} and is input, and outputs the backward probability β _{s, i} of time s.

具体的処理について、図８の処理手順に従い説明する。
まず、変数設定処理Ｓ６０７において、後向き確率算出用のカウンタｂｗの値をＢに設定する。
次に、後向き確率算出処理Ｓ６０８において音声状態遷移確率テーブル６１から音声状態確率ａ_i,jを読み出し、これと時刻ｓ＋ｂｗの前記出力確率ｂ_ｊ(Ｏ_s+bw）と時刻ｓの前記雑音状態遷移確率ｃ_s+bw,jと時刻ｓ＋ｂｗの前記後向き確率β_bw、jとから時刻ｓ＋ｂｗ−１の後向き確率β_s+bw-1、iを次式により算出する。なお、ｂｗ＝Ｂの場合は初期値β_s+B,i＝１を与える。

そして、変数書替処理Ｓ６０９においてｂｗの値を１減算し、変数判定処理Ｓ６１０においてｂｗ＞０であれば処理Ｓ６０７に戻り、そうでなければこの時点で時刻ｓにおける後向き確率β_s,iが得られるので、これをバッファリング処理Ｓ６１１において確率比算出用バッファ６４に記憶し、確率比算出処理Ｓ６１２に移行する。
確率比算出用バッファ６４は、前向き確率算出部６２で算出された前向き確率α_s、jと、後向き確率算出部６３で算出されたと後向き確率β_s,iを記憶する。 Specific processing will be described in accordance with the processing procedure of FIG.
First, in the variable setting process S607, the value of the counter bw for calculating the backward probability is set to B.
Next, in the backward probability calculation process S608, the speech state probability a _{i, j} is read from the speech state transition probability table 61, and the output probability b _j (O _{s + bw} ) at time s _{+ bw} and the noise state transition at time s. From the probability c _{s + bw, j} and the backward probability β _{bw, j} of the time s + bw, the backward probability β _{s + bw-1, i} of the time s + bw-1 is calculated by the following equation. When bw = B, the initial value β _{s + B, i} = 1 is given.

Then, 1 is subtracted from the value of bw in the variable rewriting process S609, and if bw> 0 in the variable determination process S610, the process returns to the process S607, otherwise the backward probability β _{s, i} at the time s is obtained at this time. Therefore, this is stored in the probability ratio calculation buffer 64 in the buffering process S611, and the process proceeds to the probability ratio calculation process S612.
The probability ratio calculation buffer 64 stores the forward probability α _{s, j} calculated by the forward probability calculation unit 62 and the backward probability β _{s, i} calculated by the backward probability calculation unit 63.

確率比算出部６５は、前記前向き確率α_s、jと前記後向き確率β_s,iとが入力され、図８の確率比算出処理Ｓ６１２において、非音声状態の確率に対する音声状態の確率の比Ｌ(s)を次式により算出する。

つまり、状態確率比算出部６０は、該当時刻ｔよりもＢフレーム過去の時刻ｓ＝ｔ−Ｂにおける前向き確率α_s、j、後向き確率β_s,i、及び非音声状態の確率に対する音声状態の確率の比Ｌ(s)を算出することになる。 The probability ratio calculation unit 65 receives the forward probability α _{s, j} and the backward probability β _{s, i,} and in the probability ratio calculation process S612 of FIG. 8, the ratio L of the probability of the speech state to the probability of the non-speech state (s) is calculated by the following equation.

That is, the state probability ratio calculation unit 60 calculates the voice state with respect to the forward probability α _{s, j} , the backward probability β _{s, i} , and the probability of the non-voice state at time s = t−B in the past B frames from the time t. The probability ratio L (s) is calculated.

なお、式(46)は以下に示す過程を経て導かれる。
まず、時刻ｓにおける信号の状態をｑ_ｓ＝Ｈ_ｊと定義すると、音声状態確率と非音声状態の確率の比Ｌ(s)は次式により得られる。

上式において、Ｏ_0:s＝{Ｏ₀，・・・，Ｏ_s}であり、確率比Ｌ(s)はベイズの定理により次式のように展開される。

また、雑音信号Ｎ_0:s＝{Ｎ₀，・・・，Ｎ_s}の時間変動を考慮すると、上式は次式のように拡張される。

上式は、過去の時刻の状態を考慮した再帰式（１次マルコフ過程）により、次式のように展開される。

上式において、ｐ(ｑ_ｓ＝Ｈ_ｊ|ｑ_s-1＝Ｈ_ｉ)＝ａ_i,j、ｐ(Ｏ_ｓ|ｑ_ｓ＝Ｈ_ｊ,Ｎ_ｓ)＝ｂ_ｊ(Ｏ_ｓ)、ｐ(Ｎ_ｓ|ｑ_ｓ＝Ｈ_ｊ,Ｎ_s-1)＝ｃ_s,jに相当し、またｐ(Ｏ_ｓ,ｑ_ｓ＝Ｈ_ｊ,Ｎ_ｓ)は時間軸方向に算出される前向き確率α_s、jに相当する。すなわち上式は、次式の再帰式により得られる。 Equation (46) is derived through the following process.
First, if the state of the signal at time s is defined as q _s = H _j , the ratio L (s) between the speech state probability and the non-speech state probability is obtained by the following equation.

In the above equation, O _{0: s} = {O ₀ ,..., O _s }, and the probability ratio L (s) is expanded as follows by Bayes' theorem.

In addition, when the time variation of the noise signal N _{0: s} = {N ₀ ,..., N _s } is taken into consideration, the above equation is expanded as follows.

The above equation is developed as the following equation by a recursive equation (first-order Markov process) considering the state of the past time.

In the above equation, p (q _s = H _j | q _s−1 = H _i ) = a _{i, j} , p (O _s | q _s = H _j , N _s ) = b _j (O _s ), p ( N _s | q _s = H _j , N _s−1 ) = c _{s, j} , and p (O _s , q _s = H _j , N _s ) is a forward probability α _s calculated in the time axis direction _{, J.} That is, the above equation is obtained by the following recursive equation.

次に、時刻ｓより未来の時刻、すなわち時刻ｓ＋１，・・・，ｔ＝ｓ＋Ｂにおける状態の影響を考慮すると、確率比Ｌ(s)は次式のように表現される。

上式の確率ｐ(Ｏ_s+1:t,Ｎ_s+1:t|ｑ_ｓ＝Ｈ_ｉ,Ｎ_ｓ)は、時刻ｓより未来の時刻の状態を考慮した再帰式（１次マルコフ過程）により、次式のように展開される。

上式において、ｐ(ｑ_S+1＝Ｈ_ｊ|ｑ_s＝Ｈ_ｉ)＝ａ_i,j、ｐ(Ｏ_S+1|ｑ_S+1＝Ｈ_ｊ,Ｎ_S+1) ＝ｂ_ｊ(Ｏ_S+1)、ｐ(Ｎ_S+1|ｑ_S+1＝Ｈ_ｊ,Ｎ_s)＝ｃ_S+1,jに相当し、またｐ(Ｏ_S+1:t,Ｎ_S+1:t|ｑ_ｓ＝Ｈ_ｉ,Ｎ_ｓ)は時間軸方向に算出される後向き確率β_s、ｉに相当する。すなわち上式は、次式の再帰式により得られる。

つまり、式(52)に式(50)(51)及び式(52)(53)を適用することにより、式(46)が導かれる。

Next, considering the influence of the state at a time later than time s, that is, time s + 1,..., T = s + B, the probability ratio L (s) is expressed as the following equation.

The probability p (O _{s + 1: t} , N _{s + 1: t} | q _s = H _i , N _s ) in the above equation is a recursive formula (first-order Markov process) that takes into account the state of the future time from time s Is expanded as follows.

In the above equation, p (q _{S + 1} = H _j | q _s = H _i ) = a _{i, j} , p (O _{S + 1} | q _{S + 1} = H _j , N _{S + 1} ) = b _j ( O _{S + 1} ), p (N _{S + 1} | q _{S + 1} = H _j , N _s ) = c _{S + 1, j} , and p (O _{S + 1: t} , N _{S + 1: t} | q _s = H _i , N _s ) corresponds to the backward probability β _{s, i} calculated in the time axis direction. That is, the above equation is obtained by the following recursive equation.

That is, Expression (46) is derived by applying Expressions (50), (51), and Expressions (52), (53) to Expression (52).

図９は音声信号区間推定部７０の機能構成例である。
音声信号区間推定部７０は、Ｌ(s)レジスタ７１、閾値ＴＨレジスタ７２、比較部７３から構成される。
Ｌ(s)レジスタ７１は、状態確率比算出部６０において算出された前記非音声状態の確率に対する音声状態の確率の比Ｌ(s）を入力し記憶する。
閾値ＴＨレジスタ７２は、比較部７３において前記確率比Ｌ(s)が音声状態に属するか非音声状態に属するかを判断する閾値ＴＨを記憶する。なお、閾値ＴＨの値は、事前に固定された値に決定しておいても、入力信号の特徴に応じて適応的に決定してもよい。固定値を設定する場合は、一般的には１０程度の値に設定するのが最も望ましいが、用途に応じ0.5〜10,000の範囲で適宜設定して構わない。
比較部７３は、Ｌ(s)レジスタ７１から前記確率比Ｌ(s)を読み出すとともに、閾値レジスタ７２から閾値ＴＨを読み出し、時刻ｓのフレームが音声状態に属するか非音声状態に属するかを判定し、判定結果を出力する。
具体的には、例えばＬ(s)の値が閾値ＴＨ以上であれば、時刻ｓのフレームが音声状態に属すると判断して１を出力し、閾値ＴＨ未満であれば、時刻ｓのフレームが非音声状態に属すると判断して０を出力する。 FIG. 9 is a functional configuration example of the speech signal section estimation unit 70.
The audio signal section estimation unit 70 includes an L (s) register 71, a threshold TH register 72, and a comparison unit 73.
The L (s) register 71 inputs and stores the ratio L (s) of the probability of the speech state to the probability of the non-speech state calculated by the state probability ratio calculation unit 60.
The threshold TH register 72 stores a threshold TH for determining in the comparison unit 73 whether the probability ratio L (s) belongs to a voice state or a non-voice state. Note that the value of the threshold TH may be determined in advance or may be determined adaptively according to the characteristics of the input signal. When setting a fixed value, it is generally most desirable to set it to a value of about 10, but it may be set appropriately in the range of 0.5 to 10,000 depending on the application.
The comparison unit 73 reads out the probability ratio L (s) from the L (s) register 71 and also reads out the threshold value TH from the threshold value register 72 to determine whether the frame at time s belongs to the voice state or the non-voice state. And output the determination result.
Specifically, for example, if the value of L (s) is greater than or equal to the threshold value TH, it is determined that the frame at time s belongs to the audio state, and 1 is output. It judges that it belongs to the non-voice state and outputs 0.

〔第２実施形態〕
本発明の第２実施形態は、第１実施形態における前向き第１加重平均算出部３５、前向き第２加重平均算出部３７、後向き第１加重平均算出部４５、及び後向き第２加重平均算出部４７における計算方法が異なるもので、装置構成は第１実施形態と同様である。
従って、機能構成例については第１実施形態における上記それぞれの部位の番号が異なるのみであるため、図を分けずに前向き推定部に係る図２及び後向き推定部に係る図４に第２実施形態における部位番号をカッコ書きで記すにとどめる。
前向き第１加重平均算出部１３５は、前記雑音モデルパラメータ更新値^Ｎ_t,j,k,l、^σ^N _{t、j、k、l}と前記前向き正規化出力確率ｗ^OF _j,kとが入力され、平均値^Ｎ_t,j,lと分散値^σ^N _t、j、lとからなる雑音モデルパラメータの前向き第１加重平均値を出力する。 [Second Embodiment]
The second embodiment of the present invention includes a forward first weighted average calculator 35, a forward second weighted average calculator 37, a backward first weighted average calculator 45, and a backward second weighted average calculator 47 in the first embodiment. The calculation method is different, and the apparatus configuration is the same as that of the first embodiment.
Accordingly, since the numbers of the respective parts in the first embodiment are different only in the functional configuration example, the second embodiment is shown in FIG. 2 related to the forward estimation unit and FIG. 4 related to the backward estimation unit without dividing the figure. Only write the part number in parentheses.
The forward first weighted average calculating unit 135 calculates the noise model parameter update value ^ N _{t, j, k, l} , ^ σ ^N _{t, j, k, l} and the forward normalized output probability w ^OF _{j, k.} The forward first weighted average value of the noise model parameter which is input and has the average value ^ N _{t, j, l} and the variance value ^ σ ^N _{t, j, l} is output.

この実施形態では、前記正規分布ｋごとに算出される前記前向き正規化出力確率ｗ^OF _j,kの中で最も確率の高いｗ^OF _j,kに該当する正規分布ｋの前記雑音モデルパラメータ更新値^Ｎ_t,j,k,l、^σ^N _{t、j、k、l}を、前向き第１加重平均値^Ｎ_t,j,l、^σ^N _t、j、lとして出力する。
このように処理することで、加重平均の計算をせずに済むため、処理の高速化を図ることができる。ただし、前向き正規化出力確率が各正規分布について確率差が小さい場合には特定の正規分布において突出して確率が高い場合と比べて他の正規分布を無視することによる影響が大きくなるため、この実施形態の利用に際しては特定の正規分布における確率がその他の正規分布に比べて十分に高いことが望ましい。
前向き第２加重平均算出部１３７は、前記前向き第１加重平均値^Ｎ_t,j,l、^σ^N _t、j、lと前記前向き出力確率ｂ_ｊ(Ｏ_ｔ）と前記前向き雑音状態遷移確率ｃ_t,jとが入力され、平均値^Ｎ_t,lと分散値^σ^N _t、lとからなる時刻ｔにおける前向き第２加重平均値を出力する。 In this embodiment, the noise model parameter update value of the normal distribution k corresponding to w ^OF _{j, k} having the highest probability among the forward normalized output probabilities w ^OF _{j, k} calculated for each normal distribution k. ^ N _{t, j, k, l} , ^ σ ^N _{t, j, k, l} are output as forward first weighted average values ^ N _{t, j, l} , ^ σ ^N _{t, j, l} .
By processing in this way, it is not necessary to calculate a weighted average, so that the processing speed can be increased. However, if the probability of forward normalized output is small for each normal distribution, the impact of ignoring other normal distributions will be greater than when the probability is prominent in a specific normal distribution and the probability is high. When using the form, it is desirable that the probability in a specific normal distribution is sufficiently high compared to other normal distributions.
The forward second weighted average calculation unit 137 includes the forward first weighted average value ^ N _{t, j, l} , ^ σ ^N _{t, j, l} , the forward output probability b _j (O _t ), and the forward noise state transition. The probability c _{t, j} is input, and a forward second weighted average value at time t consisting of the average value ^ N _{t, l} and the variance value ^ σ ^N _{t, l} is output.

この実施形態では、前記音声及び非音声について算出される前記前向き雑音状態遷移確率ｃ_t,jのうち、確率の高い音声又は非音声の前向き第１加重平均値^Ｎ_t,j,l、^σ^N _t、j、lを、前向き第２加重平均値^Ｎ_t,j,l、^σ^N _t、j、lとして出力する。
このように処理することで、加重平均の計算をせずに済むため、処理の高速化を図ることができる。ただし、両者の確率差が小さい場合には一方を無視することによる影響が大きくなるため、この実施形態の利用に際しては双方の確率差が十分に大きいことが望ましい。
以上、前向き第１加重平均算出部１３５及び後向き第１加重平均算出部１３７について記したが、後向き第１加重平均算出部１４５及び後向き第２加重平均算出部１４７についても前向き第１加重平均算出部１３５及び前向き第２加重平均算出部１３７と同様な処理を行うことができる。 In this embodiment, among the forward noise state transition probabilities c _{t, j} calculated for the speech and non-speech, the first weighted average value ^ N _{t, j, l} , ^ σ ^N _{t, j, l} is output as a forward second weighted average value ^ N _{t, j, l} , ^ σ ^N _{t, j, l} .
By processing in this way, it is not necessary to calculate a weighted average, so that the processing speed can be increased. However, if the probability difference between the two is small, the influence of ignoring one becomes large. Therefore, it is desirable that the probability difference between the two is sufficiently large when using this embodiment.
The forward first weighted average calculator 135 and the backward first weighted average calculator 137 have been described above, but the forward first weighted average calculator 145 and the backward second weighted average calculator 147 are also forward first weighted average calculators. 135 and the forward second weighted average calculation unit 137 can be performed.

〔変更例〕
上記実施の形態において、パラメータ予測処理Ｓ３０６において、ランダムウォーク過程により１時刻前の推定結果より現在の時刻のパラメータを予測しているが、自己回帰法（線形予測法）などを用いて予測してもよい。この場合、自己回帰係数の次数に応じて最終的な雑音モデルパラメータ推定性能が向上することが期待される。
また、上記実施の形態において、音声信号区間推定部７０における閾値判定後に、図９に破線で示すように音声信号区間及び非音声信号区間の継続長を調査して音声信号区間推定結果を自動訂正する突発異常検出補正部７４を接続してもよい。又は、同じく図９に破線で示すように、音声状態／非音声状態の判定結果と入力信号Ｏ(t)とを掛け合わせた信号を出力するようにし、突発異常検出補正部７４と同様に作用させてもよい。音声信号区間推定部７０をこのように構成することにより、突発的な識別誤りを訂正することができるため、音声信号区間推定の性能が向上することが期待される。 [Example of change]
In the above embodiment, in the parameter prediction process S306, the parameter at the current time is predicted from the estimation result one time before by a random walk process. However, the parameter is predicted using an autoregressive method (linear prediction method) or the like. Also good. In this case, it is expected that the final noise model parameter estimation performance is improved according to the order of the autoregressive coefficient.
In the above embodiment, after the threshold value is determined by the audio signal interval estimation unit 70, the duration of the audio signal interval and the non-audio signal interval is investigated and the audio signal interval estimation result is automatically corrected as shown by the broken line in FIG. A sudden abnormality detection correction unit 74 may be connected. Alternatively, as indicated by a broken line in FIG. 9, a signal obtained by multiplying the determination result of the voice state / non-voice state and the input signal O (t) is output, and operates in the same manner as the sudden abnormality detection correction unit 74. You may let them. By configuring the speech signal section estimation unit 70 in this way, sudden identification errors can be corrected, and it is expected that the performance of speech signal section estimation is improved.

〔発明の実験結果〕
本発明の効果を示すために、音声信号と雑音信号が混在する音響信号を本発明の音声信号区間検出装置に入力し、音声信号区間を検出する実施例を示す。以下、実験方法及び結果について説明する。
本実験では、日本語旅行対話音声データベースに収録されたクリーン音声2,292文のデータをクリーン音声とし、空港ロビーにて収録した騒音を雑音として、それぞれを信号対雑音比０ｄＢで人工的に加算した信号を入力信号Ｏ(t)として作成した。それぞれの信号は、サンプリング周波数8,000Ｈｚ、量子化ビット数１６ビットで離散サンプリングした。この入力音響信号に対し、１フレームの時間長を２０ｍｓ（１６０サンプル点）とし、１０ｍｓ（８０サンプル点）ごとにフレームの始点を移動させて、音響信号分析部１１を適用し、２４次元のメルスペクトルを音響特徴量として抽出した。 [Experimental result of the invention]
In order to show the effect of the present invention, an embodiment will be described in which an audio signal in which an audio signal and a noise signal are mixed is input to the audio signal interval detecting device of the present invention, and the audio signal interval is detected. Hereinafter, experimental methods and results will be described.
In this experiment, 2,292 sentences of clean speech recorded in the Japanese travel dialogue speech database were used as clean speech, and noise recorded in the airport lobby was regarded as noise, and each was artificially added with a signal-to-noise ratio of 0 dB. As an input signal O (t). Each signal was discretely sampled at a sampling frequency of 8,000 Hz and a quantization bit number of 16 bits. For this input acoustic signal, the time length of one frame is set to 20 ms (160 sample points), the start point of the frame is moved every 10 ms (80 sample points), and the acoustic signal analyzer 11 is applied. The spectrum was extracted as an acoustic feature.

ＧＭＭには２４次元の対数メルスペクトルを音響特徴量とする混合分布数６４のモデルを用い、それぞれ無音信号、クリーン音声信号を用いて学習した。パラメータ予測処理Ｓ３０６においてεのパラメータ値には0.001を設定し、処理Ｓ４０３において後向き推定に要するフレーム数Ｂには５を設定した。音声状態遷移確率テーブル６１において、音声状態遷移確率ａ_i,jの値にはそれぞれ0.8,0.2,0.9,0.1を設定した。音声信号区間推定部７０において、閾値ＴＨの値には１０を設定した。
性能の評価は、次式のFalse acceptanceとFalse rejectionの調和平均であるHarmonic meanにより行った。False acceptanceは、非音声区間を誤って音声区間と識別した割合、False rejectionは、音声区間を誤って非音声区間と識別した割合である。評価尺度にHarmonic meanを用いて、本発明の性能評価と従来技術との性能の比較を行った。 For GMM, a model with 64 mixed distributions using a 24-dimensional log mel spectrum as an acoustic feature was used, and learning was performed using a silence signal and a clean speech signal, respectively. In the parameter prediction process S306, 0.001 is set as the parameter value of ε, and in the process S403, 5 is set as the number of frames B required for backward estimation. In the voice state transition probability table 61, the values of the voice state transition probabilities a _{i, j} are set to 0.8, 0.2, 0.9, and 0.1, respectively. In the audio signal section estimation unit 70, 10 is set as the threshold value TH.
The performance was evaluated by Harmonic mean, which is the harmonic mean of False acceptance and False rejection of the following equation. False acceptance is the rate at which a non-speech segment is mistakenly identified as a speech segment, and False rejection is the rate at which a speech segment is mistakenly identified as a non-speech segment. Using Harmonic mean as an evaluation scale, the performance evaluation of the present invention was compared with the performance of the prior art.

図１０に実験結果を示す。図１０の縦軸はHarmonic meanを示しており、値が小さいほど性能が高いことを示す。横軸は各音声信号区間推定方法を示しており、８１、８２、８３はそれぞれ非特許文献１、非特許文献２、非特許文献３に開示された方法による結果であり、８４は本発明の第１実施形態による結果を示す。
図１０の結果から、本発明により従来技術に比べて高い性能が得られることが明らかとなった。

FIG. 10 shows the experimental results. The vertical axis in FIG. 10 indicates Harmonic mean, and the smaller the value, the higher the performance. The horizontal axis shows each speech signal section estimation method, 81, 82, and 83 are the results obtained by the methods disclosed in Non-Patent Document 1, Non-Patent Document 2, and Non-Patent Document 3, respectively, and 84 is the result of the present invention. The result by 1st Embodiment is shown.
From the results shown in FIG. 10, it has been clarified that the present invention can obtain higher performance than the prior art.

本発明による音声信号区間推定装置の構成図。The block diagram of the audio | voice signal area estimation apparatus by this invention. 本発明による音声信号区間推定装置における前向き推定部の構成図。The block diagram of the forward estimation part in the audio | voice signal area estimation apparatus by this invention. 本発明による音声信号区間推定装置における前向き推定部の処理手順。The processing procedure of the forward estimation part in the audio | voice signal area estimation apparatus by this invention. 本発明による音声信号区間推定装置における後向き推定部の構成図。The block diagram of the back direction estimation part in the audio | voice signal area estimation apparatus by this invention. 本発明による音声信号区間推定装置における後向き推定部の処理手順。The processing procedure of the backward estimation part in the audio | voice signal area estimation apparatus by this invention. 本発明による音声信号区間推定装置における状態確率比算出部の構成図。The block diagram of the state probability ratio calculation part in the audio | voice signal area estimation apparatus by this invention. 音声状態／非音声状態の状態遷移モデルを示す図。The figure which shows the state transition model of an audio | voice state / non-audio | voice state. 本発明による音声信号区間推定装置における状態確率比算出部の処理手順。The processing procedure of the state probability ratio calculation part in the audio | voice signal area estimation apparatus by this invention. 本発明による音声信号区間推定装置における音声信号区間推定部の構成図。The block diagram of the audio | voice signal area estimation part in the audio | voice signal area estimation apparatus by this invention. 本発明による音声信号区間推定の実験結果。The experimental result of the audio | voice signal area estimation by this invention.

Claims

An audio signal interval estimation device that detects and estimates a time interval in which an audio signal exists in an input signal including an audio signal and a noise signal,
An acoustic signal analysis unit that extracts a voice feature amount for each frame obtained by cutting out the input signal at predetermined intervals;
A noiseless model storage unit for storing a probability model (GMM: Gaussian Mixture Model) parameter based on a mixed normal distribution including a plurality of normal distributions of each of the clean speech signal and the silence signal;
The speech feature and each probability model parameter stored in the noiseless model storage unit are input, and the noise model parameter at the current time is sequentially estimated and output from the past time to the current time by a parallel nonlinear Kalman filter. A forward estimator to perform,
The noise model parameter output from the forward estimation unit and each probability model parameter stored in the noiseless model storage unit are input, and the noise at the current time is detected by a parallel Kalman smoother from the future time to the current time. The model parameters are sequentially estimated backward. Based on the estimated noise model parameters, the probability model parameters for speech (noise + clean speech) and non-speech (noise + silence) are estimated sequentially, and the output probabilities for speech and non-speech. A backward estimation unit that calculates and outputs a noise state transition probability from the previous frame to the current frame of the estimation result of the noise model parameter from the output probability and the backward estimated noise model parameter;
A parameter storage unit for storing calculation results obtained in the course of processing in the forward estimation unit and the backward estimation unit;
The output from the backward estimation unit, the audio output probability and output probability of the Hioto voice and the noise state transition probabilities are inputted, it calculates the audio state probability and non-speech state probability, non-voice state A state probability ratio calculation unit that outputs a ratio of the voice state probability to the probability;
A ratio of the state probabilities is input, and compared with a threshold value for each frame, a speech signal section estimation unit that outputs either a speech state or a non-speech state as a comparison result;
A speech signal section estimation device comprising:

The speech signal section estimation device according to claim 1,
The forward estimation unit includes:
A noise model parameter prediction unit that receives the acoustic feature value and a forward second weighted average value one frame before, calculates a noise model parameter prediction value of the current frame from a past time to a current time, and outputs the noise model parameter prediction value of the current frame;
The acoustic feature quantity, the noise model parameter prediction value, and each probability model parameter stored in the noiseless model storage unit are input, and each probability stored in the noiseless model storage unit is a noise model parameter update process. A noise model parameter update unit that outputs the noise model parameter update value in parallel for each of the plurality of normal distributions of the model,
The noise model parameter update value and each probability model parameter stored in the noiseless model storage unit are input, and the speech (noise + clean speech) probability model parameter suitable for the noise environment at the time in units of the frame And a forward probability model parameter generation unit that generates and outputs a non-voice (noise + silence) probability model parameter;
The acoustic feature amount, each probability model parameter output from the forward probability model parameter generation unit, and each probability model parameter stored in the noiseless model storage unit are input, and for each frame, each of speech and non-speech A forward speech / non-speech output probability calculator that calculates and outputs a forward output probability and a forward normalized output probability obtained by decomposing the forward output probability for each normal distribution;
A forward first weighted average calculating unit that receives the noise model parameter update value and the forward normalized output probability and calculates and outputs a forward first weighted average value of the noise model parameter;
The forward second weighted average value of the previous frame, the noise model parameter update value, the forward normalized output probability, and the forward first weighted average value are input, and the estimated result of the previous frame is changed to the estimated result of the current frame. A forward noise state transition probability calculating unit for calculating and outputting the forward noise state transition probability of
The forward first weighted average value, the forward output probability of each of the speech and non-speech, and the forward noise state transition probability are input, and the forward second weighted average value of the forward second weighted average value of the current frame is calculated and output. A calculation unit;
Comprising
The backward estimation unit is
The noise model parameter prediction value after one frame, the noise model parameter update value for the current frame, and the noise model parameter re-estimation value after one frame are input, and the re-estimation processing of the forward noise model parameter for the current frame is not performed. A noise model parameter re-estimation unit that outputs a noise model parameter re-estimation value in parallel for each normal distribution of each probability model stored in the noise model storage unit from the future time to the current time; ,
The noise model parameter re-estimation value and each probability model parameter stored in the noiseless model storage unit are input, and the speech (noise + clean speech) probability suitable for the noise environment at the time in units of the frame A backward probability model parameter generation unit that generates and outputs model parameters and non-voice (noise + silence) probability model parameters;
The acoustic feature value, each probability model parameter output from the backward probability model parameter generation unit, and each probability model parameter stored in the noiseless model storage unit are input, and voice and non-speech for each frame A backward speech / non-speech output probability calculation unit that calculates and outputs the output probability and a backward normalized output probability obtained by decomposing the output probability for each normal distribution;
A backward first weighted average calculating unit that receives the noise model parameter re-estimated value and the backward normalized output probability, calculates and outputs a backward first weighted average value of the noise model parameter;
The backward second weighted average value one frame before, the noise model parameter re-estimation value, the backward normalized output probability, and the backward first weighted average value are input, and the estimation result of the current frame from the estimation result of the previous frame A backward noise state transition probability calculating unit that calculates and outputs the noise state transition probability to
Said rearward first weighted average value output from the backward first weighted average calculation unit, the output probability of the backward speech / non-speech output probability the sound output from the calculation section and the output probability of the Hioto voice, and said noise state transition probability output from said rearward noise state transition probability calculating unit is input, and a rearward second weighted average calculation unit which calculates and outputs the backward second weighted average value of the current frame,
A speech signal section estimation device comprising:

The speech signal section estimation device according to claim 1 or 2,
The state probability ratio calculation unit
A speech state transition probability table that stores speech state transition probabilities set appropriately in a speech / non-speech state transition model expressed by a finite state machine;
And said noise state transition probability of the output probabilities and the current frame of the Hioto voice output probabilities and the current frame of the speech of the current frame output from the backward estimation unit, and the voice state transition probabilities, the previous frame forward probability, is input, the forward probability calculation portion which calculates and outputs the forward probability of the current frame,
An output probability and said noise state transition probability after one frame of the Hioto voice after the output probability and a frame of the speech after one frame output from the backward estimation unit, and the voice state transition probabilities, one frame after the backward probabilities, is entered, the backward probability calculation portion which calculates and outputs the backward probability of the current frame,
A probability ratio calculation buffer for storing the forward probability and the backward probability obtained in the course of processing in the forward probability calculation unit and the backward probability calculation unit;
A probability ratio calculation unit that receives the forward probability of the current frame and the backward probability of the current frame, calculates a ratio of the speech state probability to the non-speech state probability, and outputs the ratio.
A speech signal section estimation device comprising:

The speech signal section estimation device according to claim 2 or 3,
The forward first weighted average calculating unit outputs a noise model parameter update value having the maximum forward normalized output probability among the noise model parameter update values as a forward first weighted average value of the noise model parameters. Yes,
The forward second weighted average calculating unit outputs the forward first weighted average value having the maximum forward noise state transition probability among the forward first weighted average values as the forward second weighted average value of the current frame. And
The backward first weighted average calculating unit outputs a noise model parameter reestimation value having the maximum backward normalized output probability among the noise model parameter reestimation values as a backward first weighted average value of the noise model parameters. Is,
The backward second weighted average calculating unit outputs the backward first weighted average value having the maximum state transition probability among the backward first weighted average values as the backward second weighted average value of the current frame. A speech signal section estimation device characterized by the above.

An audio signal interval estimation method for detecting and estimating a time interval in which an audio signal exists in an input signal including an audio signal and a noise signal,
A process in which an audio signal analysis unit extracts an audio feature amount for each frame obtained by cutting out the input signal at predetermined intervals;
From the past time, the forward estimation unit determines from the past time from a probability model (GMM: Gaussian Mixture Model) parameter based on a mixed normal distribution including a plurality of normal distributions of the clean speech signal and the silence signal. The process of sequentially estimating the noise model parameters at the current time by the parallel nonlinear Kalman filter toward the time of
A backward estimation unit determines a future model based on a noise model parameter output from the forward estimation unit and a probability model (GMM) parameter based on a mixed normal distribution including a plurality of normal distributions of the clean speech signal and the silence signal. The noise model parameters at the current time are sequentially and backward estimated from the time to the current time by the parallel Kalman smoother, and voice (noise + clean speech) and non-speech (noise + silence) respectively based on this backward estimated noise model parameter Are sequentially estimated to calculate the output probabilities of speech and non-speech, and from this output probability and the backward estimated noise model parameter, the estimation result of the noise model parameter from one frame before to the current frame is calculated. Calculating the noise state transition probability;
State probability ratio calculation unit, the output from the backward estimation unit, the output probability of the speech and the output probabilities of the Hioto voice and the noise state transition probability, and calculates the voice state probability and non-speech state probability Calculating the ratio of the speech state probability to the non-speech state probability;
A process in which the speech signal section estimation unit compares the state probability ratio with a threshold value for each frame to estimate whether the state is a speech state or a non-speech state;
A speech signal section estimation method comprising:

The speech signal section estimation method according to claim 5,
The forward estimation unit sequentially estimates the noise model parameters,
A process of calculating a noise model parameter prediction value of the current frame from the past time to the current time from the acoustic feature quantity and the forward second weighted average value of the previous frame by the noise model parameter prediction unit;
A noise model parameter update unit includes: a probability model (GMM) parameter based on a mixed normal distribution including a plurality of normal distributions of the acoustic feature amount, the noise model parameter prediction value, and each of the clean speech signal and the silence signal. From the above, the noise model parameter update process is performed in parallel for each of the normal distributions, and the noise model parameter update value is calculated,
A forward probability model parameter generation unit generates the frame from the noise model parameter update value and a probability model (GMM) parameter based on a mixed normal distribution including a plurality of normal distributions of the clean speech signal and the silence signal. A process of generating a speech (noise + clean speech) probability model parameter and a non-speech (noise + silence) probability model parameter suitable for the noise environment at the time as a unit;
The forward speech / non-speech output probability calculation unit includes a plurality of normal distributions of the acoustic feature amount, each probability model parameter calculated by the forward probability model parameter generation unit, and each of the clean speech signal and the silence signal. From the probability model (GMM) parameters based on the mixed normal distribution, the forward output probabilities of speech and non-speech for each frame and the forward normalized output probability obtained by decomposing the forward output probability for each normal distribution are calculated. The process of
A forward first weighted average calculating unit calculating a forward first weighted average value of the noise model parameter from the noise model parameter update value and the forward normalized output probability;
The forward noise state transition probability calculation unit calculates an estimation result one frame before from the forward second weighted average value one frame before, the noise model parameter update value, the forward normalized output probability, and the forward first weighted average value. Calculating the forward noise state transition probability from the current frame to the current frame estimation result,
A process in which a forward second weighted average calculating unit calculates a forward second weighted average value of the current frame from the forward first weighted average value, the forward output probability of each of the speech and non-speech, and the forward noise state transition probability; When,
Consists of
The process of calculating the output probability and the noise state transition probability by the backward estimation unit,
The noise model parameter re-estimation unit calculates a forward noise model parameter of the current frame from the predicted noise model parameter value after one frame, the updated noise model parameter value of the current frame, and the re-estimated noise model parameter value after one frame. The re-estimation process is performed in parallel for each of a plurality of normal distributions included in the respective probability models of the clean speech signal and the silence signal, and the noise model parameter re-estimation value is calculated from the future time to the current time. Process,
A backward probability model parameter generation unit generates the frame from the noise model parameter re-estimation value and a probability model (GMM) parameter based on a mixed normal distribution including a plurality of normal distributions of the clean speech signal and the silence signal. Generating a speech (noise + clean speech) probability model parameter and a non-speech (noise + silence) probability model parameter suitable for the noise environment at the time in units of
The backward speech / non-speech output probability calculation unit includes a plurality of normal distributions of the acoustic feature amount, each probability model parameter calculated by the backward probability model parameter generation unit, and each of the clean speech signal and the silence signal. Calculating a speech and non-speech output probability for each frame and a backward normalized output probability by decomposing this output probability for each normal distribution from a probability model (GMM) parameter based on a mixed normal distribution; ,
A backward first weighted average calculation unit calculating a backward first weighted average value of the noise model parameter from the noise model parameter re-estimated value and the backward normalized output probability;
The backward noise state transition probability calculation unit estimates the previous frame from the backward second weighted average value one frame before, the noise model parameter re-estimation value, the backward normalized output probability, and the backward first weighted average value. Calculating the noise state transition probability from the result to the estimation result of the current frame;
Second weighted average calculation section backward is, with the backward first weighted average value output from the backward first weighted average calculation unit, the output probability of the audio output from the backward speech / non-speech output probability calculation unit an output probability of the Hioto voice from said noise state transition probability output from said rearward noise transition probability calculation portion, and a step of calculating a second weighted average value rearward of the current frame,
A speech signal section estimation method comprising:

The speech signal section estimation method according to claim 5 or 6,
The process in which the state probability ratio calculation unit calculates the ratio of the speech state probability to the non-speech state probability,
Forward probability calculation portion, the output probability of the speech of the current frame output from the backward estimation unit and the output probability of the Hioto voice of the current frame and the noise state transition probability of the current frame is represented by a finite state machine a process of calculating a voice state transition probability, and the forward probability in the previous frame, from the forward probability of the current frame is set as appropriate in the voice / non-voice state transition model,
Backward probability calculation portion, and the noise state transition probability after the output probability and one frame of said Hioto voice after the output probability and a frame of the speech after one frame output from the backward estimation unit, the voice state and transition probabilities from a backward probability after one frame, the process of calculating the backward probability of the current frame,
A process of calculating a ratio of a speech state probability to a non-speech state probability from a forward probability of the current frame and a backward probability of the current frame;
A speech signal section estimation method comprising:

The speech signal section estimation method according to claim 6 or 7,
The process of calculating the forward first weighted average value by the forward first weighted average calculating unit is to calculate a noise model parameter update value having the maximum forward normalized output probability among the noise model parameter update values as a noise model parameter. A process of calculating as a positive first weighted average value,
The process of calculating the forward second weighted average value by the forward second weighted average calculating unit includes a forward first weighted average value having the maximum forward noise state transition probability among the forward first weighted average values as a current frame. Is calculated as a positive second weighted average value of
The process of calculating the backward first weighted average value by the backward first weighted average calculating unit includes calculating a noise model parameter reestimation value having the maximum backward normalized output probability among the noise model parameter reestimation values as a noise model. A process of calculating the parameter as a backward first weighted average value,
The process of calculating the backward second weighted average value by the backward second weighted average calculating unit is performed by setting the backward first weighted average value having the maximum state transition probability among the backward first weighted average values to the backward direction of the current frame. A speech signal section estimation method, which is a process of calculating as a second weighted average value.

The program for functioning a computer as an apparatus in any one of Claims 1-4.

A computer-readable recording medium on which the program according to claim 9 is recorded.