JP6139429B2

JP6139429B2 - Signal processing apparatus, method and program

Info

Publication number: JP6139429B2
Application number: JP2014025196A
Authority: JP
Inventors: 小川　厚徳; 厚徳小川; 慶介木下; 堀　貴明; 貴明堀; 中谷　智広; 智広中谷; 中村　篤; 篤中村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-02-13
Filing date: 2014-02-13
Publication date: 2017-05-31
Anticipated expiration: 2034-02-13
Also published as: JP2015152704A

Description

この発明は、音声信号、音響信号等の信号を処理するための技術に関する。 The present invention relates to a technique for processing a signal such as an audio signal or an acoustic signal.

雑音や残響のある環境で音響信号を収音すると、本来の信号に音響歪み（雑音や残響）が重畳された信号が観測される。音響信号が音声の場合、重畳した音響歪みの影響により音声の明瞭度は大きく低下してしまう。その結果、本来の音声信号の性質を抽出することが困難となり、例えば、音声認識システムの認識率が低下する。この認識率の低下を防ぐためには、重畳した音響歪みを取り除く工夫が必要である。 When an acoustic signal is collected in an environment with noise or reverberation, a signal in which acoustic distortion (noise or reverberation) is superimposed on the original signal is observed. When the acoustic signal is speech, the clarity of speech is greatly reduced due to the effect of superimposed acoustic distortion. As a result, it becomes difficult to extract the nature of the original speech signal, and for example, the recognition rate of the speech recognition system decreases. In order to prevent this decrease in the recognition rate, it is necessary to devise a method for removing the superimposed acoustic distortion.

そこで、以下に述べる従来の信号処理装置が提案されている。なお、この信号処理装置は、音声認識の他にも、例えば、補聴器、ＴＶ会議システム、機械制御インターフェース、楽曲を検索したり採譜したりする音楽情報処理システムなどに利用することが出来る。 Therefore, a conventional signal processing apparatus described below has been proposed. In addition to voice recognition, this signal processing device can be used for, for example, a hearing aid, a TV conference system, a machine control interface, a music information processing system for searching for music, and recording music.

[信号処理装置]
図１に従来の信号処理装置の機能構成例を示して、その動作を簡単に説明する。信号処理装置は、フーリエ変換部１０１と、特徴量生成部１０２と、マッチング部１０３と、音声強調フィルタリング部１０４と、事例モデル記憶部１０５とを備えている。 [Signal processing equipment]
FIG. 1 shows a functional configuration example of a conventional signal processing apparatus, and its operation will be briefly described. The signal processing apparatus includes a Fourier transform unit 101, a feature value generation unit 102, a matching unit 103, a voice enhancement filtering unit 104, and a case model storage unit 105.

フーリエ変換部１０１には、雑音/残響を含む音声が入力信号として入力される。入力信号は例えば３０ｍｓ程度の短時間ハミング窓で窓かけされ、窓かけされた入力信号は離散フーリエ変換を経て振幅スペクトルに変換される（ステップＳ１，図２）。振幅スペクトルとは、周波数スペクトルの振幅データのことである。振幅スペクトルは、特徴量生成部１０２及び音声強調フィルタリング部１０４に提供される。 Voice including noise / reverberation is input to the Fourier transform unit 101 as an input signal. The input signal is windowed by a short Hamming window of about 30 ms, for example, and the windowed input signal is converted into an amplitude spectrum through a discrete Fourier transform (step S1, FIG. 2). An amplitude spectrum is amplitude data of a frequency spectrum. The amplitude spectrum is provided to the feature quantity generation unit 102 and the speech enhancement filtering unit 104.

特徴量生成部１０２は、フーリエ変換部１０１が出力する振幅スペクトルの全てを、例えばメルケプストラム特徴量に変換する（ステップＳ２，図２）。一般的に広く使われているメルケプストラムは高々１０〜２０次程度であるが、事例データを正確に表すために、高い次数（例えば、３０〜１００次程度）のメルケプストラムを用いる。なお、メルケプストラム以外の特徴量を用いても良い。生成された特徴量は、マッチング部１０３に提供される。 The feature amount generation unit 102 converts all of the amplitude spectrum output from the Fourier transform unit 101 into, for example, a mel cepstrum feature amount (step S2, FIG. 2). In general, the mel cepstrum widely used is about 10 to 20th order, but in order to accurately represent the case data, a mel cepstrum having a high order (for example, about 30 to 100th order) is used. Note that feature quantities other than the mel cepstrum may be used. The generated feature amount is provided to the matching unit 103.

事例モデル記憶部１０５には、事例に対応したクリーン音声のデータと、フレームごとの特徴量に対して最大の尤度を与えるガウス混合分布のインデックスの系列（セグメント）である事例モデルＭとが記憶されている。事例に対応したクリーン音声のデータとは、例えば事例に対応したクリーン音声の振幅スペクトルのことである。事例モデルＭに含まれるセグメントの例を図３に示す。各セルはｉ番目の時間フレームに対応する。各セル内の数字は最大の尤度を与えるガウス混合分布ｇ中のガウス分布のインデックスmiを表す。事例モデルは、音声コーパスなどから得られる大量のクリーン音声と、あらゆる環境で得られる雑音/残響データ（雑音信号の波形や、室内インパルス応答）とを用い、さまざま
な環境での観測信号を模擬生成し、その模擬観測信号を特徴量領域へ変換したものを用いて、事前に事例モデル生成装置によって生成され、予め事例モデル記憶部１０５に記憶される。この事例モデル生成装置の詳細については、後述する。 The case model storage unit 105 stores clean speech data corresponding to a case, and a case model M that is a series (segment) of indexes of a Gaussian mixture distribution that gives the maximum likelihood with respect to a feature amount for each frame. Has been. The clean sound data corresponding to the case is, for example, the amplitude spectrum of the clean sound corresponding to the case. An example of segments included in the case model M is shown in FIG. Each cell corresponds to the i-th time frame. The number in each cell represents the index mi of the Gaussian distribution in the Gaussian mixture distribution g giving the maximum likelihood. The example model uses a large amount of clean speech obtained from a speech corpus and noise / reverberation data (noise signal waveform and room impulse response) obtained in any environment to simulate generation of observation signals in various environments. Then, using the simulation observation signal converted into the feature amount region, it is generated in advance by the case model generation device and stored in the case model storage unit 105 in advance. Details of the case model generation apparatus will be described later.

マッチング部１０３は、入力信号の特徴量と事例モデル記憶部１０５内に含まれる特徴量の事例とのマッチングを行い、入力信号に一番近い事例モデル中のセグメントを探索する（ステップＳ３，図２）。探索により見つかった入力信号に一番近い事例モデル中のセグメントについての情報は、音声強調フィルタリング部１０４に提供される。マッチング部１０３の詳細については、後述する。 The matching unit 103 matches the feature quantity of the input signal with the case example of the feature quantity included in the case model storage unit 105, and searches for a segment in the case model closest to the input signal (step S3, FIG. 2). ). Information about the segment in the case model closest to the input signal found by the search is provided to the speech enhancement filtering unit 104. Details of the matching unit 103 will be described later.

音声強調フィルタリング部１０４は、マッチング部１０３で探索した入力信号に一番近い事例モデル中のセグメントに対応するクリーン音声の振幅スペクトルを用いて音声強調のためのフィルタを作成し、作成されたフィルタを用いて入力信号をフィルタリングする（ステップＳ４，図２）。入力信号に一番近い事例モデル中のセグメントに対応するクリーン音声の振幅スペクトルは、事例モデル記憶部１０５から読み込んだものを用いる。音声強調フィルタリング部１０４の詳細については、例えば非特許文献１及び特許文献１を参照のこと。 The speech enhancement filtering unit 104 creates a filter for speech enhancement using the amplitude spectrum of clean speech corresponding to the segment in the case model closest to the input signal searched by the matching unit 103, and the created filter is To filter the input signal (step S4, FIG. 2). As the amplitude spectrum of the clean speech corresponding to the segment in the case model closest to the input signal, the amplitude spectrum read from the case model storage unit 105 is used. For details of the voice enhancement filtering unit 104, see, for example, Non-Patent Document 1 and Patent Document 1.

この信号処理装置によれば、従来は困難であった、非常に時間変化の多い雑音の除去が可能となることが報告されている。非常に時間変化の多い雑音とは、背景雑音に対して、例えば目覚まし時計のアラーム音などの雑音のことである。 According to this signal processing apparatus, it has been reported that it is possible to remove noise that has been difficult in the past and has a very large time variation. The noise having a very large time change is a noise such as an alarm sound of an alarm clock with respect to the background noise.

［事例モデル生成装置］
ここで、事例モデル記憶部１０５に記憶される事例モデルを生成する事例モデル生成装置について説明する。図４に、事例モデル生成装置の機能構成例を示す。事例モデル生成装置は、フーリエ変換部２０１と、特徴量生成部２０２と、ガウス混合モデル学習部２０３と、最尤ガウス分布計算部２０４とを備えている。 [Case model generator]
Here, a case model generation apparatus that generates a case model stored in the case model storage unit 105 will be described. FIG. 4 shows a functional configuration example of the case model generation apparatus. The example model generation apparatus includes a Fourier transform unit 201, a feature value generation unit 202, a Gaussian mixture model learning unit 203, and a maximum likelihood Gaussian distribution calculation unit 204.

事例モデル生成装置の各部の機能は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 The function of each part of the case model generation apparatus is realized by, for example, a predetermined program being read into a computer including a ROM, a RAM, a CPU, and the like, and the CPU executing the program.

事例モデル生成装置への入力は、様々な雑音/残響環境の音声データである。なお、この様々な雑音/残響環境の音声データの中には、クリーン環境の音声データが含まれているとする。この様々な雑音/残響環境の音声データのそれぞれについて以下の処理が行われる
フーリエ変換部２０１及び特徴量生成部２０２は、それぞれ図１のフーリエ変換部１０１及び特徴量生成部１０２と同様であるため、重複説明を省略する。 The input to the case model generator is speech data of various noise / reverberation environments. It is assumed that the sound data of various noise / reverberation environments includes the sound data of clean environments. The following processing is performed for each of the audio data of various noise / reverberation environments, because the Fourier transform unit 201 and the feature amount generation unit 202 are the same as the Fourier transform unit 101 and the feature amount generation unit 102 of FIG. 1, respectively. The duplicated explanation is omitted.

ガウス混合モデル学習部２０３は、特徴量生成部２０２で得られた各短時間フレームｔでの特徴量ｘ_ｉを学習データとして、通常の最尤推定法によりガウス混合モデルｇを得る。ガウス混合モデルｇは、以下の式により示される。 Gaussian mixture model learning unit 203, a feature amount x _i the learning data in each short time frame t obtained by the feature amount generating unit 202 to obtain the Gaussian mixture model g by a conventional maximum likelihood estimation. The Gaussian mixture model g is expressed by the following equation.

ｇ（ｘ_ｉ|ｍ）は、平均μ_ｍ、分散Σ_ｍを持つｍ番目のガウス分布を表す。ｇ（ｘ_ｉ|ｍ）は、多くの場合多次元ガウス分布であり、その次元数は特徴量ｘ_ｉの次元数と同じである。ｇ（ｘ_ｉ|ｍ）が多次元ガウス分布である場合、平均μ_ｍ及び分散Σ_ｍのそれぞれはベクトルとなる。ここでは、ｇ（ｘ_ｉ|ｍ）が多次元ガウス分布であったとしても、記載の簡略化のためｇ（ｘ_ｉ|ｍ）のことを単にガウス分布と表現する。ｗ（ｍ）は、ｍ番目のガウス分布に対する混合重みを表す。Ｑは混合数を表す。Ｑには、例えば、4096や8192など、かなり大きな値を設定する。 g (x _i | m) represents the m-th Gaussian distribution having mean μ _m and variance Σ _m . In many cases, g (x _i | m) is a multidimensional Gaussian distribution, and the number of dimensions is the same as the number of dimensions of the feature quantity x _i . When g (x _i | m) is a multidimensional Gaussian distribution, each of the mean μ _m and the variance Σ _m is a vector. Here, even if g (x _i | m) is a multidimensional Gaussian distribution, g (x _i | m) is simply expressed as a Gaussian distribution for simplification of the description. w (m) represents the mixing weight for the mth Gaussian distribution. Q represents the number of mixtures. For Q, for example, a fairly large value such as 4096 or 8192 is set.

最尤ガウス分布計算部２０４は、各時間フレームｉに対して最大の尤度を与えるガウス混合分布ｇの中のガウス分布のインデックスｍ_ｉを求め、そのインデックスｍ_ｉの時間系列を事例モデルＭとして求める。事例モデルＭは、ガウス分布のインデックスｍ_ｉの集合とガウス混合モデルｇを用いて以下の式に示すように表される。 Maximum likelihood Gaussian distribution calculation unit 204, the index m _i of the Gaussian distribution in the Gaussian mixture distribution g which gives the maximum likelihood for each time frame i calculated, the time sequence of the index m _i as a case model M Ask. Case model M, using the set and Gaussian mixture model g of the index m _i of the Gaussian distribution is expressed as shown in the following equation.

ここで、ｍ_ｉは、ｉ番目のフレームの特徴量x_ｉに対して最大の尤度を与えるガウス分布のインデックスであり、ガウス混合分布ｍの中のガウス分布ｇ（ｘ_ｉ|ｍ）を表している。Ｉは学習データの総フレーム数を表す。例えば、1時間の学習データを仮定すると、Ｉ＝３．５×１０^５となる。生成された事例モデルＭは、事例モデル記憶部１０５（図１）に記憶される。この事例モデルの生成は、様々な雑音/残響環境の学習データのそれぞれに対して行われる。 Here, m _i is the index of the Gaussian distribution that gives the maximum likelihood for the feature amount x _i of i-th frame, Gaussian g in Gaussian mixture m | represents the (x _i m) ing. I represents the total number of frames of learning data. For example, assuming 1 hour of learning data, I = 3.5 × 10 ⁵ . The generated case model M is stored in the case model storage unit 105 (FIG. 1). This case model is generated for each learning data of various noise / reverberation environments.

なお、環境がクリーンの場合は、フーリエ変換部２０１から出力された振幅スペクトルデータも事例モデル記憶部１０５（図１）に記憶される。 If the environment is clean, the amplitude spectrum data output from the Fourier transform unit 201 is also stored in the case model storage unit 105 (FIG. 1).

［マッチング部１０３の具体処理］
ここで、マッチング部１０３における処理を詳述する。簡単のためあるひとつの雑音/残響環境の事例モデルＭのみを考える。また、簡単のため入力信号の特徴量系列と学習データセグメントのマッチングの際の時間伸縮は考えないものとする。マッチング部１０３は、入力信号の特徴量ｙ_ｔと事例モデルＭとを用いて、入力信号の特徴量系列に最も近い学習データのセグメントを探索し、入力信号に含まれるクリーン音声に一番近いクリーン音声系列を与えると思われる学習データセグメントＭ^ｔ _{ｕ：ｕ＋τｍａｘ}を出力する。 [Specific Processing of Matching Unit 103]
Here, the processing in the matching unit 103 will be described in detail. For simplicity, consider only one example model M of a noise / reverberation environment. For simplicity, it is assumed that time expansion and contraction is not considered when matching the feature amount series of the input signal and the learning data segment. Matching unit 103 uses the feature quantity y _t and case model M of the input signal, searching the segment closest training data to the feature amount sequence of the input signal, nearest clean clean speech included in the input signal training data segment ^M _{t u} is believed to give a speech _sequence: output _{u + .tau.max.}

入力信号は、Ｔ個の時間フレームから成るとし、その入力信号の特徴量系列をｙ＝{ｙ_ｔ:ｔ=1，２，…，Ｔ}とする。また、ｙ_{ｔ：ｔ＋τ}を入力信号の特徴量の時間フレームｔからｔ＋τまでの系列とする。そして、Ｍ_{ｕ：ｕ＋τ}＝{ｇ，ｍ_ｉ：ｉ＝ｕ，ｕ＋１，…，ｕ＋τ}を、学習データの中のｕ番目からｕ＋τ番目までの連続する時間フレームに対応するガウス分布系列とする。 Assume that the input signal is composed of T time frames, and the feature quantity sequence of the input signal is y = {y _t : t = 1, 2,..., T}. Also, let _{yt: t + τ be} a sequence from the time frame t to t + τ of the feature quantity of the input signal. Then, M _{u: u + τ} = {g, m _i : i = u, u + 1,..., U + τ} is a Gaussian distribution sequence corresponding to continuous time frames from u-th to u + τ-th in the learning data.

入力信号の特徴量系列ｙ_{ｔ：ｔ＋τ}と学習データの中のあるセグメントとの距離の定義や、入力信号の特徴量系列ｙ_{ｔ：ｔ＋τ}と一番近い学習データの探索方法としては、ユークリッド距離など、他のいくつかの方法を考えることが出来る。ここでは、入力信号の特徴量系列に対する一番近い学習データセグメントは、入力信号の特徴量系列に良く一致する学習データセグメントの中でも長さの最も長いものとする。つまり、入力信号の特徴量系列に最も近い学習データセグメントＭ^ｔ _{ｕ：ｕ＋τ}は、次式に示す事後確率を最大化することで求めることが出来る。 Feature amount sequence y _t of the input _signal: definition and of the distance between a segment in the _{t + tau} training data, feature amount sequence y _t of the input _signal: a method of searching for _{t + tau} and closest training data, Euclidean distance, etc. You can think of several other ways. Here, it is assumed that the learning data segment closest to the feature quantity sequence of the input signal has the longest length among learning data segments that closely match the feature quantity series of the input signal. In other words, the closest training data segments M ^{t u} the feature amount sequence of the input _{signal: u + tau} can be determined by maximizing a posterior probability shown in the following equation.

ここで、ｐ（Ｍ_ｕ:ｕ+τ|ｙ_ｔ:ｔ+τ）は事後確率を表し、ｙ_ｔ:ｔ+τとＭ_ｕ:ｕ+τが比較的よく一致している場合、τが長ければ長いほど高い事後確率を与えるという特徴を持っている。この特徴の証明は、非特許文献１に詳述されている。より長いセグメントを探索するという方策を取ることで、ある時間に局所的に存在する雑音などの影響を受け難くなり、雑音などに対して比較的ロバストなマッチングが行われることが期待できる。 Here, p (M _{u: u + τ} | y _{t: t + τ} ) represents the posterior probability, and when y _{t: t + τ} and M _{u: u + τ} are relatively well matched, τ is The longer it is, the higher the posterior probability is. The proof of this feature is described in detail in Non-Patent Document 1. By taking a measure of searching for a longer segment, it becomes difficult to be affected by noise that exists locally at a certain time, and it can be expected that relatively robust matching is performed with respect to noise.

式（２）の分子の項ｐ（ｙ_ｔ:ｔ+τ|Ｍ_ｕ:ｕ+τ）は、Ｍ_ｕ:ｕ+τに対応する学習データセグメントに対するｙ_ｔ:ｔ+τの尤度である。その尤度は次式で計算される。 The numerator term p (y _{t: t + τ} | M _{u: u + τ} ) in equation (2) is the likelihood of y _{t: t + τ} for the training data segment corresponding to M _{u: u + τ.} . The likelihood is calculated by the following equation.

簡単のため、隣り合うフレームは独立であることを仮定している。式（２）の分母の第１項は、学習データ中のあらゆる時間フレームｕ’を開始点として，ｐ（ｙ_ｔ:ｔ+τ|Ｍ_{ｕ’:ｕ’+τ}）の和を取ったものである。式（２）の分母の第２項は、ガウス混合モデルｇに対するｙ_ｔ:ｔ+τの尤度であり、次式で計算される。 For simplicity, it is assumed that adjacent frames are independent. The first term of the denominator of Equation (2) is the sum of p (y _{t: t + τ} | M _{u ′: u ′ + τ} ) starting from any time frame u ′ in the learning data. It is. The second term of the denominator of Equation (2) is the likelihood of yt _{: t + τ} for the Gaussian mixture model g, and is calculated by the following equation.

ここでマッチング部１０３におけるセグメント探索処理の手順を更に具体的に記述する。まず、セグメントの最大長を（τ_lim＋１）フレームに制限する。例えば、セグメントの最大長を３０フレームと制限するならば、τ_lim＝２９である。この制限の下で、まず、τ＝０、すなわち、セグメント長＝１として、式（２）に従い、最大事後確率を与えるセグメント長＝１のセグメントを見つける。次にτ＝１、すなわち、セグメント長＝２として、式（２）に従い、最大事後確率を与えるセグメント長＝２のセグメントを見つける。この処理をτ＝τ_limまで繰り返し、最後に、見つかった異なる長さのセグメント候補の中から，最大事後確率を与えるセグメントを見つける。その最大事後確率を与えるセグメントの長さがτ_maxである。 Here, the procedure of the segment search process in the matching unit 103 will be described more specifically. First, the maximum segment length is limited to (τ _lim +1) frames. For example, if the maximum length of the segment is limited to 30 frames, τ _lim = 29. Under this restriction, first, τ = 0, that is, segment length = 1, and a segment with segment length = 1 that gives the maximum posterior probability is found according to the equation (2). Next, assuming τ = 1, that is, segment length = 2, a segment with segment length = 2 that gives the maximum posterior probability is found according to equation (2). This process is repeated until τ = τ _lim , and finally, a segment that gives the maximum posterior probability is found from the segment candidates of different lengths that have been found. The length of the segment giving the maximum posterior probability is τ _max .

このマッチング部１０３におけるセグメント探索処理は、図３に示すような、Ｉフレーム分のリニアなメモリで表現できる事例モデルＭ上で行うことができる。 The segment search process in the matching unit 103 can be performed on a case model M that can be expressed by a linear memory for I frames as shown in FIG.

J. Ming and R. Srinivasan, and D. Crooke, “A Corpus-Based Approach to Speech Enhancement From Nonstationary Noise,” IEEE Trans. On Acoustics, Speech and Signal Processing, 19(4), pp. 822-836, 2011.J. Ming and R. Srinivasan, and D. Crooke, “A Corpus-Based Approach to Speech Enhancement From Nonstationary Noise,” IEEE Trans. On Acoustics, Speech and Signal Processing, 19 (4), pp. 822-836, 2011 .

特開２０１３−３７１７４号公報JP 2013-37174 A

従来の信号処理装置では、マッチング部１０３において、入力信号に一番近いセグメントを探索する際の計算コストが高くなる可能性がある。これはセグメント候補の数を考慮すれば明らかである。例えば、セグメント長＝１のセグメント候補は、学習データの総フレーム数のＩ個あり、上記の通り、高々１時間の学習データであっても、Ｉ＝３．５×１０^５という膨大な数になる。 In the conventional signal processing apparatus, the matching unit 103 may increase the calculation cost when searching for the segment closest to the input signal. This is clear when considering the number of segment candidates. For example, there are I segment candidates whose segment length = 1, which is the total number of frames of learning data. As described above, even if the learning data is one hour at most, the number is as large as I = 3.5 × 10 ^5. Become.

この発明は、従来よりもマッチング部の計算コストを削減した信号処理装置、方法及びプログラムを提供することを目的とする。 An object of the present invention is to provide a signal processing apparatus, method, and program in which the calculation cost of the matching unit is reduced as compared with the conventional one.

この発明の一態様による信号処理装置は、所定の信号の各フレームの特徴量に対して最大の尤度を与える、ガウス混合分布の中のガウス分布のインデックスの系列であるセグメントが、各フレームの先頭から少なくとも所定の個数のフレームの部分だけ木構造で表現されて記憶されている事例モデル記憶部と、事例モデル記憶部に記憶されている木構造のルートノードを開始点とする任意の長さのセグメントを候補として、入力信号の特徴量系列に対して最大の事後確率を与えるセグメントを探索するマッチング部と、を備え、入力信号を２つに分割したときの前半部分を前半部分信号とし後半部分を後半部分信号として、マッチングステップにおける事後確率は、前半部分信号についてその前半部分信号に対応する長さのセグメントに基づいて評価した尤度と、後半部分信号についてガウス混合分布によるモデルに基づいて評価した尤度とを用いて表現される。 In the signal processing device according to one aspect of the present invention, a segment that is a series of Gaussian distribution indexes in the Gaussian mixture distribution that gives the maximum likelihood to the feature amount of each frame of a predetermined signal is included in each frame. Case model storage unit expressed and stored in a tree structure for at least a predetermined number of frames from the beginning, and an arbitrary length starting from the root node of the tree structure stored in the case model storage unit And a matching unit that searches for a segment that gives the maximum posterior probability with respect to the feature quantity sequence of the input signal as a candidate, and the first half part when the input signal is divided into two is used as the first half part signal. With the part as the second half signal, the posterior probability in the matching step is based on the segment of the length corresponding to the first half signal for the first half signal. It is expressed using the likelihood that evaluation, the likelihood of the evaluation on the basis of the model by the Gaussian mixture distribution for second half signal Te.

従来よりもマッチング部の計算コストを削減することができる。 The calculation cost of the matching unit can be reduced as compared with the conventional case.

信号処理装置の例を説明するためのブロック図。The block diagram for demonstrating the example of a signal processing apparatus. 信号処理方法の例を説明するためのフローチャート。The flowchart for demonstrating the example of a signal processing method. セグメントの例を説明するための図。The figure for demonstrating the example of a segment. 事例モデル生成装置の例を説明するための図。The figure for demonstrating the example of an example model production | generation apparatus. セグメントの構造化を説明するための図。The figure for demonstrating structuring of a segment. セグメントの構造化を説明するための図。The figure for demonstrating structuring of a segment. 式（７）によるセグメント評価を説明するための図。The figure for demonstrating the segment evaluation by Formula (7).

以下、図面を参照して、信号処理装置及び方法の実施形態を説明する。 Hereinafter, embodiments of a signal processing apparatus and method will be described with reference to the drawings.

［第一実施形態］
第一実施形態による信号処理装置は、従来の信号処理装置と同様に、図１に例示するように、フーリエ変換部１０１と、特徴量生成部１０２と、マッチング部１０３と、音声強調フィルタリング部１０４と、事例モデル記憶部１０５とを備えている。 [First embodiment]
As illustrated in FIG. 1, the signal processing device according to the first embodiment, like the conventional signal processing device, includes a Fourier transform unit 101, a feature value generation unit 102, a matching unit 103, and a speech enhancement filtering unit 104. And a case model storage unit 105.

以下、従来とは異なる部分である、マッチング部１０３を中心に説明する。第一実施形態による信号処理装置のフーリエ変換部１０１と、特徴量生成部１０２と、音声強調フィルタリング部１０４とは、それぞれ従来の信号処理装置のフーリエ変換部１０１と、特徴量生成部１０２と、音声強調フィルタリング部１０４と同様であるため、重複説明を省略する。 Hereinafter, the matching unit 103, which is a part different from the conventional one, will be mainly described. The Fourier transform unit 101, the feature amount generation unit 102, and the speech enhancement filtering unit 104 of the signal processing device according to the first embodiment are respectively the Fourier transform unit 101, the feature amount generation unit 102, and the conventional signal processing device. Since it is the same as that of the voice emphasis filtering unit 104, duplicate description is omitted.

従来手法にあったマッチング部１０３におけるセグメント探索の計算コストが高いという問題点を解決するために、この発明では事例モデルに含まれるセグメントの構造化表現を行う。すなわち、セグメントを構造化表現した事例モデルを事例モデル記憶部１０５に記憶しておく。 In order to solve the problem that the calculation cost of the segment search in the matching unit 103 in the conventional method is high, the present invention performs a structured representation of the segments included in the case model. That is, a case model in which segments are structured and expressed is stored in the case model storage unit 105.

まず、図３の事例モデルＭを、図５に示すように、セグメントの最大長である（τ_lim＋１）フレームで区切って表現する。ｊを１以上の整数として、図５のセグメントｊは、ｊ番目のフレームのセルから（ｊ＋τ_lim＋１）番目のフレームのセルにより構成されるセグメントを意味する。 First, as shown in FIG. 5, the case model M in FIG. 3 is expressed by being divided by (τ _lim +1) frames, which is the maximum length of the segment. When j is an integer of 1 or more, the segment j in FIG. 5 means a segment constituted by cells of the (j + τ _lim +1) th frame from the cell of the jth frame.

図５から分かることは、例えば、セグメント長＝１のセグメント候補はＩ個あるが、実質的な種類はＱ個しかないということである。ここで、Ｑはガウス分布の混合数である。一般に、Ｑ＜＜Ｉであり、ＱはＩよりも十分に小さい。よって、計算コストを削減するために、図５の構造を図６のような、セグメント候補の先頭から共有できるノードは共有し、木構造で表現することを考える。 As can be seen from FIG. 5, for example, there are I segment candidates with a segment length = 1, but there are only Q substantial types. Here, Q is the number of mixtures of Gaussian distribution. In general, Q << I, and Q is sufficiently smaller than I. Therefore, in order to reduce the calculation cost, it is considered that the node of FIG. 5 that can be shared from the top of the segment candidates as shown in FIG.

図５では、セグメント長＝１の場合は、セグメント２，３，４が同じガウス分布インデックス＝７で表現されるため、図６に示すように、これらをひとつのノードで表現する。セグメント２，３は、セグメント長＝２でも同じガウス分布インデックス列＝{７，７}で表現されるため、これらを同じノード列で表現する。このような処理を、全てのセグメント候補に対して、セグメント長＝１からセグメント長＝τ_lim＋１まで繰り返すことで，セグメントの木構造表現が完成する。このように、事例モデルにおけるセグメントを木構造表現として事例モデル記憶部１０５に格納しておく。 In FIG. 5, when the segment length = 1, the segments 2, 3, and 4 are represented by the same Gaussian distribution index = 7, so these are represented by one node as shown in FIG. Since the segments 2 and 3 are expressed by the same Gaussian distribution index string = {7, 7} even when the segment length = 2, they are expressed by the same node string. Such processing is repeated for all segment candidates from segment length = 1 to segment length = τ _lim +1, thereby completing the tree structure representation of the segment. Thus, the segment in the case model is stored in the case model storage unit 105 as a tree structure representation.

言い換えれば、事例モデル記憶部１０５には、所定の信号の各フレームの特徴量に対して最大の尤度を与える、ガウス混合分布の中のガウス分布のインデックスの系列であるセグメントが、各フレームの先頭から所定の個数τ_lim＋１のフレームの部分だけ木構造で表現されて記憶されているとする。 In other words, in the case model storage unit 105, a segment that is a series of Gaussian distribution indexes in the Gaussian mixture distribution that gives the maximum likelihood to the feature amount of each frame of a predetermined signal is stored in each frame. _{Assume that} only a predetermined number of frames τ _lim +1 from the beginning are stored in a tree structure.

マッチング部１０３は、木構造表現された事例モデルを参照して、セグメントの探索を行う。この図６のセグメントの木構造表現によれば、セグメント長が短い間は式（３）による尤度計算を共有できるため、図６の構造化をしない場合と比較して、大幅な計算コスト削減が可能となる。 The matching unit 103 searches for a segment with reference to a case model expressed in a tree structure. According to the tree structure representation of the segment of FIG. 6, since the likelihood calculation according to the equation (3) can be shared while the segment length is short, the calculation cost can be greatly reduced as compared with the case where the structure is not structured in FIG. Is possible.

例として、セグメント長＝１の場合を考える。従来法で図５に例示される事例モデルの探索を行った場合には、セグメント２，３，４の評価において、同じガウス分布インデックス＝７の計算を３回行うことになる。実装上は、セグメント２で、最初にガウス分布インデックス＝７の計算を行った際に、その値をメモリ上に記憶しておき、セグメント３，４の評価を行う際は記憶した値を参照してもよいが、この参照自体も回数が多ければコストが高くなる。 As an example, consider the case where segment length = 1. When the case model illustrated in FIG. 5 is searched by the conventional method, the same Gaussian distribution index = 7 is calculated three times in the evaluation of the segments 2, 3, and 4. In terms of implementation, when the calculation of Gaussian distribution index = 7 is first performed in segment 2, the value is stored in the memory, and when the evaluation of segments 3 and 4 is performed, the stored value is referred to. However, if the number of times of this reference itself is large, the cost becomes high.

これに対し、上記のように図６の木構造で表現された事例モデルを用いて探索を行う場合は、ガウス分布インデックス＝７の計算を１回行うだけで、セグメント２，３，４の評価が一度に行われることになる。特に、セグメント長＝１の場合は、計算回数がＩからＱに減るので、大幅な計算コスト削減が可能になる。また、ガウス混合モデルｇ中のガウス分布数のＱの数が増えるほど、学習データの量が増えるほど（言い換えれば、フレーム数Ｉが大きくなるほど）、図６の木構造表現の優位性が高くなる。 On the other hand, when the search is performed using the case model expressed in the tree structure of FIG. 6 as described above, the evaluation of the segments 2, 3, and 4 is performed only by calculating the Gaussian distribution index = 7 once. Will be done at once. In particular, when the segment length = 1, the number of calculations is reduced from I to Q, so that the calculation cost can be greatly reduced. Further, as the number of Q of the Gaussian distribution number in the Gaussian mixture model g increases and the amount of learning data increases (in other words, the number of frames I increases), the superiority of the tree structure representation of FIG. 6 increases. .

言い換えると、従来のマッチング部１０３では、事例モデルに含まれる任意の長さの部分セグメントを候補として、個別に探索を行っていた。これに対し、この実施形態によるマッチング部１０３では、木構造表現のルートノードを開始点とする任意の長さのセグメントを候補として探索を行うことにより、探索対象となるセグメントの候補数を削減するものである。そして、各候補について式（２）の事後確率を計算し、事後確率が最大となるセグメントを求める。 In other words, the conventional matching unit 103 individually searches for partial segments of any length included in the case model as candidates. On the other hand, the matching unit 103 according to this embodiment reduces the number of candidate segments to be searched by performing a search using a segment of an arbitrary length starting from the root node of the tree structure representation as a candidate. Is. Then, the posterior probability of Expression (2) is calculated for each candidate, and the segment with the maximum posterior probability is obtained.

このようにして、マッチング部１０３は、事例モデル記憶部１０５に記憶されている木構造のルートノードを開始点とする任意の長さのセグメントを候補として、入力信号の特徴量系列に対して最大の事後確率を与えるセグメントを探索する（ステップＳ３，図２）。探索された最大事後確率を与えるセグメントについての情報は、音声強調フィルタリング部１０４に出力される。 In this way, the matching unit 103 uses a segment of an arbitrary length starting from the root node of the tree structure stored in the case model storage unit 105 as a candidate for the maximum feature amount sequence of the input signal. A segment that gives the posterior probability is searched (step S3, FIG. 2). Information on the searched segment that gives the maximum posterior probability is output to the speech enhancement filtering unit 104.

これにより、従来よりもマッチング部１０３の計算コストを削減することができる。よって、従来の方法よりも高速な事例探索が可能になり、結果的に、従来の方法よりも高速に音声強調を行うことが可能になる。 Thereby, the calculation cost of the matching part 103 can be reduced more than before. Therefore, case search can be performed at a speed higher than that of the conventional method, and as a result, speech enhancement can be performed at a speed higher than that of the conventional method.

（第一実施形態の変形例）
図６のようにセグメント候補を木構造表現することで、マッチング部１０３におけるセグメント探索の計算コストは大幅に削減できるが、ひとつ問題が生じ得る。それは、木構造表現を行うためには、膨大な量のノードが必要になり、大量のメモリを消費するという問題である。 (Modification of the first embodiment)
By expressing the segment candidates as a tree structure as shown in FIG. 6, the calculation cost of the segment search in the matching unit 103 can be significantly reduced, but one problem may arise. This is a problem that a huge amount of nodes are required to express the tree structure, and a large amount of memory is consumed.

ここで、セグメント長とセグメントの種類の関係を考えてみる。セグメント長＝１の場合は、上記の通りセグメントの種類はＱ（Ｑ＜＜Ｉ）である。セグメント長＝２の場合は、理論的にはＱ^２種類のセグメントが存在し得ることになる。実際のセグメント種類はＱ^２種類より小さくなるが、セグメント長が大きくなるに従い、セグメントの種類が急激に増えていくことは容易に想像できる。例えば、Ｑ＝４０９６の場合、セグメント長が１０にもなれば、セグメントの種類はその上限のＩとほぼ等しくなる。したがって、木構造表現を導入することによって、計算コストが大幅に削減できるのは、最初の数フレームに限られるということが分かる。 Now consider the relationship between segment length and segment type. When the segment length = 1, the segment type is Q (Q << I) as described above. When the segment length = 2, there are theoretically Q ² types of segments. The actual segment type is smaller than Q ² kinds, but according to the segment length increases, the type of segment will rapidly increase can be easily imagined. For example, in the case of Q = 4096, if the segment length becomes 10, the type of segment becomes almost equal to the upper limit I. Therefore, it can be seen that the introduction of the tree structure representation can greatly reduce the calculation cost only in the first few frames.

そこで、第一実施形態の信号処理装置の変形例では、事例モデル記憶部１０５に記憶する事例モデルのセグメントのうち、最初の数フレームのみ木構造で表現し、それ以降は構造化表現を行なわないこととする。 Therefore, in the modified example of the signal processing apparatus of the first embodiment, only the first few frames of the segment of the case model stored in the case model storage unit 105 are expressed in a tree structure, and the structured expression is not performed thereafter. I will do it.

これにより、最初の数フレーム以降は、従来の尤度計算方法である図３に示すようなＩフレーム分のリニアなメモリで表現できる事例モデルＭ上で行うことになる。これにより、計算コストの削減とメモリ消費量の増加を防ぐという両方を達成することができる。 As a result, the first few frames and thereafter are performed on the case model M that can be expressed by a linear memory for I frames as shown in FIG. 3, which is a conventional likelihood calculation method. As a result, both reduction in calculation cost and prevention of increase in memory consumption can be achieved.

このように、所定の信号の各フレームの特徴量に対して最大の尤度を与える、ガウス混合分布の中のガウス分布のインデックスの系列であるセグメントが、各フレームの先頭から少なくとも所定の個数のフレームの部分だけ木構造で表現されて事例モデル記憶部１０５に記憶されていてもよい。 In this way, at least a predetermined number of segments from the head of each frame are segments that are sequences of Gaussian distribution indexes in the Gaussian mixture distribution that give the maximum likelihood to the feature amount of each frame of the predetermined signal. Only the frame portion may be expressed in a tree structure and stored in the case model storage unit 105.

第一実施形態の信号処理装置の変形例の他の部分については、第一実施形態の信号処理装置と同様であるため、重複説明を省略する。 The other parts of the modification of the signal processing device of the first embodiment are the same as those of the signal processing device of the first embodiment, and thus redundant description is omitted.

［第二実施形態］
第二実施形態の信号処理装置は、マッチング部１０３において、異なるセグメント長のセグメントを、フレームという共通の長さの下で公平に評価することにより、入力信号に一番近いセグメントを探索する。 [Second Embodiment]
In the signal processing apparatus of the second embodiment, the matching unit 103 searches for the segment closest to the input signal by fairly evaluating segments having different segment lengths under a common length called a frame.

以下、第一実施形態と異なる部分を中心に説明する。第一実施形態と同様の部分については重複説明を省略する。 Hereinafter, a description will be given centering on differences from the first embodiment. A duplicate description of the same parts as in the first embodiment is omitted.

第二実施形態のマッチング部１０３では、式（３）の代わりに、所定の長さのフレームの入力信号の特徴量系列ｙ_ｔ:ｔ+τの尤度を、事例モデルＭとガウス混合モデルｇの両方を用いて計算する。すなわち、ｙ_ｔ:ｔ+τをｙ_ｔ:ｔ+νとｙ_{ｔ＋ν＋１:ｔ+τ}に分割して（０≦ν≦τ）、前者をＭで、後者をｇで、評価する形にする。具体的には入力信号の特徴量系列ｙ_ｔ:ｔ+τの尤度は、次式のように計算される。 In the matching unit 103 of the second embodiment, instead of the equation (3), the likelihood of the feature amount sequence yt _{: t + τ} of the input signal of the frame having a predetermined length is converted into the case model M and the Gaussian mixture model g. Calculate using both. That is, y _{t: t + τ} is divided into y _{t: t + ν} and y _{t + ν + 1: t + τ} (0 ≦ ν ≦ τ), and the former is evaluated by M and the latter is evaluated by g. Specifically, the likelihood of the feature quantity sequence yt _{: t + τ} of the input signal is calculated as follows.

ここで、ｐ（ｙ_ｔ:ｔ+ν｜Ｍ_ｕ:ｕ+ν）は、事例モデルＭ_ｕ:ｕ+νが与えられたときの入力信号の特徴量系列のｙ_ｔ:ｔ+νの尤度を表す。ｐ（ｙ_{ｔ＋ν＋１:ｔ+τ}｜φ_{ｕ＋ν＋１：ｕ＋τ}）は、混合モデルφ_{ｕ＋ν＋１：ｕ＋τ}が与えられたときの入力信号の特徴量系列ｙ_ｔ:ｔ+νの尤度を表す。φ_{ｕ＋ν＋１：ｕ＋τ}は、フレームｕ＋ν＋１からフレームｕ＋τに対応するガウス混合分布である。p(ｙ_ｔ:ｔ+ν｜Ｍ_ｕ:ｕ+ν, φ_{ｕ＋ν＋１：ｕ＋τ})は、事例モデルＭ_ｕ:ｕ+ν及び混合モデルφ_{ｕ＋ν＋１：ｕ＋τ}が与えられたときの入力信号の特徴量系列ｙ_ｔ:ｔ+νの尤度を表す。 Here, p (y _{t: t + ν} | M _{u: u + ν} ) is the likelihood of y _{t: t + ν} of the feature quantity sequence of the input signal when the case model M _{u: u + ν} is given. Represents degrees. p (y _{t + ν + 1: t + τ} | φ _{u + ν + 1: u + τ} ) represents the likelihood of the feature quantity sequence y _{t: t + ν} of the input signal when the mixed model φ _{u + ν + 1: u + τ} is given. φ _{u + ν + 1: u + τ} is a Gaussian mixture distribution corresponding to the frame u + ν + 1 to the frame u + τ. p (y _{t: t + ν} | M _{u: u + ν} , φ _{u + ν + 1: u + τ} ) is a feature quantity sequence of the input signal when the case model M _{u: u + ν} and the mixed model φ _{u + ν + 1: u + τ} are given. y _t: represents the likelihood of _{t + ν} .

ｙ_ｔ:ｔ+νは、入力信号の特徴量系列ｙ_ｔ:ｔ+τのうち事例モデルのセグメントＭ_ｕ:ｕ+νに対応する長さの入力信号の特徴量系列である。言い換えれば、ｙ_ｔ:ｔ+νは、フレームｔからフレームｔ＋νに対応する入力信号の特徴量系列である。ｙ_{ｔ＋ν＋１:ｔ+τ}は、入力信号の特徴量系列ｙ_ｔ:ｔ+τのうち事例モデルのセグメントＭ_ｕ:ｕ+νの長さを超える部分の入力信号の特徴量系列である。言い換えれば、ｙ_{ｔ＋ν＋１:ｔ+τ}は、フレームｔ＋ν＋１からフレームｔ＋τに対応する入力信号の特徴量系列である。 y _{t: t + ν} is a feature amount sequence of the input signal having a length corresponding to the segment M _{u: u + ν} of the case model in the feature amount sequence y _{t: t + τ} of the input signal. In other words, yt _{: t + ν} is a feature quantity sequence of the input signal corresponding to the frame t to the frame t + ν. y _{t + ν + 1: t + τ} is the feature amount sequence of the input signal in the portion of the feature amount sequence y _{t: t + τ} of the input signal that exceeds the length of the segment M _{u: u + ν} of the case model. In other words, yt _{+ ν + 1: t + τ} is a feature quantity sequence of the input signal corresponding to the frame t + ν + 1 to the frame t + τ.

すなわち、式（５）は、評価対象の入力信号を所定長（ここではτ＋１）の入力信号として、評価対象の入力信号の特徴量系列のうち事例モデルに基づいて評価できる部分は事例モデルで尤度ｐ（ｙ_ｔ:ｔ+ν｜Ｍ_ｕ:ｕ+ν）を評価し、事例モデルのセグメントＭ_ｕ:ｕ+νで評価できない（事例モデルのセグメントの長さを超える部分の）評価対象の入力信号の特徴量系列ｙ_{ｔ＋ν＋１:ｔ+τ}については混合モデルｇに基づいて尤度ｐ（ｙ_{ｔ＋ν＋１:ｔ+τ}｜φ_{ｕ＋ν＋１：ｕ＋τ}）を評価することを意味する。 In other words, Equation (5) is obtained by using the input signal to be evaluated as an input signal having a predetermined length (in this case, τ + 1), and the portion that can be evaluated based on the case model in the feature quantity series of the input signal to be evaluated is the case model. Degree p (y _{t: t + ν} | M _{u: u + ν} ) and cannot be evaluated by the segment M _{u: u + ν} of the case model (the portion exceeding the segment length of the case model) For the feature quantity sequence y _{t + ν + 1: t + τ} of the input signal, this means that the likelihood p (y _{t + ν + 1: t + τ} | φ _{u + ν + 1: u + τ} ) is evaluated based on the mixed model g.

言い換えれば、入力信号を２つに分割したときの前半部分を前半部分信号とし後半部分を後半部分信号として、マッチング部１０３が式（４）に基づいて計算する尤度は、前半部分信号についてその前半部分信号に対応する長さのセグメントに基づいて評価した尤度ｐ（ｙ_ｔ:ｔ+ν｜Ｍ_ｕ:ｕ+ν）と、後半部分信号について上記ガウス混合分布によるモデルに基づいて評価した尤度ｐ（ｙ_{ｔ＋ν＋１:ｔ+τ}｜φ_{ｕ＋ν＋１：ｕ＋τ}）とが統合された尤度であると言える。 In other words, the likelihood that the matching unit 103 calculates based on Equation (4) using the first half when the input signal is divided into two as the first half signal and the second half as the second half signal is that for the first half signal. The likelihood p (y _{t: t + ν} | M _{u: u + ν} ) evaluated based on the segment of the length corresponding to the first half signal and the second half signal based on the above-described model based on the Gaussian mixture distribution It can be said that the likelihood p (y _{t + ν + 1: t + τ} | φ _{u + ν + 1: u + τ} ) is an integrated likelihood.

混合モデルｇに基づく尤度は、モデル全体で平滑化した尤度のようなものに相当する。事例モデルで評価できない部分については平均的な尤度で代用することで、等しいフレーム長で入力信号を公平に評価しようとするものである。 The likelihood based on the mixed model g corresponds to a likelihood smoothed over the entire model. For the part that cannot be evaluated by the case model, the average likelihood is substituted to try to evaluate the input signal fairly with the same frame length.

このｙ_ｔ:ｔ+τの尤度を用いて，マッチング部１０３ではｙ_ｔ:ｔ+τに最も適合するセグメントＭ^ｔ _{ｕ：ｕ＋νmax}を次式（６）（７）に従い求める。ｔ，τ，ｕ，ν，ｕ’，ν’は整数である。 The y _t: using the likelihood of the _{t + tau,} the matching unit 103 y _{t: t +} best fits segment _{τ ^M} _{t u:} finding according the following equation (6) (7) _{u + .nu.max.} t, τ, u, ν, u ′, ν ′ are integers.

ここで、式（７）の分母は、学習データのあらゆる開始点ｕ’と、ｙ_ｔ:ｔ+τのあらゆる分割点ν’について、ｐ（ｙ_ｔ:ｔ+τ|Ｍ_{ｕ′:ｕ′+ν′}，φ_{ｕ′＋ν′＋１：ｕ′＋τ}）の和を取ったものである。 Here, the denominator of equation (7), 'and, y _t: any division point of _{t + τ ν'} u any starting point of the learning data _{for, p (y t: t +} τ | M u ': u' _{+ ν ′} , φu _{′ + ν ′ + 1: u ′ + τ} ).

式（７）により定義される事後確率ｐ（Ｍ_ｕ:ｕ+ν，φ_{ｕ＋ν＋１：ｕ＋τ}｜ｙ_ｔ:ｔ+τ）は、上記式（４）及び上記式（５）に示したように、入力信号を２つに分割したときの前半部分を前半部分信号とし後半部分を後半部分信号として、部分信号についてその前半部分信号に対応する長さのセグメントに基づいて評価した尤度ｐ（ｙ_ｔ:ｔ+ν｜Ｍ_ｕ:ｕ+ν）と、後半部分信号について上記ガウス混合分布によるモデルに基づいて評価した尤度ｐ（ｙ_{ｔ＋ν＋１:ｔ+τ}｜φ_{ｕ＋ν＋１：ｕ＋τ}）とを用いて表現される。 The posterior probability p (M _{u: u + ν} , φ _{u + ν + 1: u + τ} | y _{t: t + τ} ) defined by the equation (7) is as shown in the above equation (4) and the above equation (5): Likelihood p (y _t) evaluated based on a segment having a length corresponding to the first half signal, with the first half of the input signal divided into two as the first half signal and the second half as the second half signal. _{: t + ν} | M _{u: u + ν} ) and likelihood p (y _{t + ν + 1: t + τ} | φ _{u + ν + 1: u + τ} ) evaluated based on the model of the Gaussian mixture distribution for the latter half signal. Is done.

セグメントの最大長を、従来の方法と同様に、（τ_lim＋１）フレームに制限する。例えば、セグメントの最大長を３０フレームと制限するならば、τ_lim＝２９となる。この制限の下での式（７）によるセグメント評価を図示すると、図７のようになる。この図から明らかなように、この実施形態によれば、あらゆるセグメント長のセグメントが、（τ_lim＋１）フレームという共通の長さの下で、公平に評価されていることがわかる。別の見方をすれば、この実施形態によれば、最適なセグメント長（ν_max）と，セグメント開始点（ｕ）の探索を同時に行っていることになる。 The maximum length of the segment is limited to (τ _lim +1) frames as in the conventional method. For example, if the maximum length of the segment is limited to 30 frames, τ _lim = 29. FIG. 7 shows the segment evaluation according to the equation (7) under this restriction. As is apparent from this figure, according to this embodiment, it can be seen that the segments of any segment length are evaluated fairly under a common length of (τ _lim +1) frames. From another viewpoint, according to this embodiment, the optimum segment length (ν _max ) and the segment start point (u) are searched simultaneously.

以下、本発明による式（７）の事後確率が、従来手法による式（２）の事後確率と同様に、ｙ_ｔ:ｔ+τとＭ_ｕ:ｕ+τが比較的よく一致している場合、τが長ければ長いほど高い事後確率を与えるという特徴を持っていることを証明する。このため，ｙ_ｔ:ｔ+τをｙ_ｔ:ｔ+νとｙ_{ｔ＋ν＋１:ｔ+τ}に分割して前者をＭで後者をｇで評価する場合（式（４））と、ｙ_ｔ:ｔ+τをｙ_ｔ:ｔ+ν-1とｙ_{ｔ＋ν:ｔ+τ}に分割して前者をＭで後者をｇで評価する場合とで、事後確率の大小を比較する。 Hereinafter, in the case where the posterior probability of Equation (7) according to the present invention is relatively good in agreement with _yt _{: t + τ} and _{Mu: u + τ} , similarly to the posterior probability of Equation (2) by the conventional method It proves that the longer τ is, the higher posterior probability is given. Therefore, when y _{t: t + τ} is divided into y _{t: t + ν} and y _{t + ν + 1: t + τ} and the former is evaluated by M and the latter is evaluated by g (equation (4)), y _{t: t +} a _{_{τ y t: t + ν-}} 1 and y _{t + [nu:} the former is divided into _{t + tau} in the case of evaluating the latter in g in M, compares the magnitude of the posterior probability.

式（７）から明らかなように、両場合において分母は等しくなるので、両場合の比は、式（４）から、以下の尤度比に等しくなる。 As is clear from equation (7), the denominator is equal in both cases, so the ratio in both cases is equal to the following likelihood ratio from equation (4).

ここで、ｙ_ｔ＋νがｍ_ｕ＋νによく一致していると仮定する。この場合、式（８）の分母は、ｗ（ｍ_ｕ＋ν）ｇ（ｙ_ｔ＋ν｜ｍ_ｕ＋ν）と近似できる。よって、式（８）は、１／ｗ（ｍ_ｕ＋ν）に等しい。ｗ（ｍ_ｕ＋ν）は１以下であるので、式（８）は１以上になる。これにより、ｙ_{ｔ：ｔ＋τ}とＭ_{ｕ：ｕ＋τ}が比較的よく一致している場合、τが長ければ長いほど式（７）で計算される事後確率が高くなるという特徴を持っていることが分かる。 Here, it is assumed that _{y t + [nu} is good agreement in _{m u + ν.} In this case, the denominator of Equation (8) can be approximated as w (m _{u + ν} ) g (y _{t + ν} | _{mu + ν} ). Thus, equation (8) is equal to 1 / w (m _{u + ν} ). Since w (m _{u + ν} ) is 1 or less, Expression (8) becomes 1 or more. Thus, it can be seen that when yt _{: t + τ} and _{Mu: u + τ} match relatively well, the longer τ is, the higher the posterior probability calculated by equation (7) is. .

式（６）及び式（７）により尤度計算を行う場合には、図６のセグメントの木構造の２層目からτ_ｌｉｍ層目のあらゆるノードからガウス混合モデルノードに遷移可能であるとする。ルートノードを開始点とする任意の長さのセグメントを候補として探索することで、式（６）の尤度計算を高速に行うことができる。 When likelihood calculation is performed according to Equation (6) and Equation (7), it is assumed that transition from any node in the tree structure of the segment in FIG. 6 to any Gaussian mixture model node from the τ _lim layer is possible. . By searching for a segment of an arbitrary length starting from the root node as a candidate, the likelihood calculation of Expression (6) can be performed at high speed.

［変形例等］
なお、この発明は、複数の雑音／残響環境の事例モデルを考慮する場合、及び、マッチング時に時間伸縮を考える場合についても、非特許文献１に記載されているように、拡張可能である。 [Modifications, etc.]
Note that the present invention can be extended as described in Non-Patent Document 1 when considering a plurality of case models of noise / reverberation environments and considering time expansion and contraction at the time of matching.

上記信号処理装置及び方法において説明した処理は、記載の順にしたがって時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 The processes described in the above signal processing apparatus and method are not only executed in chronological order according to the order of description, but may be executed in parallel or individually as required by the processing capability of the apparatus that executes the process. .

また、信号処理装置における各部をコンピュータによって実現する場合、信号処理装置の各部が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、その各部がコンピュータ上で実現される。 Further, when each unit in the signal processing device is realized by a computer, the processing contents of the functions that each unit of the signal processing device should have are described by a program. And each part is implement | achieved on a computer by running this program with a computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、各処理手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each processing means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

その他、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 Needless to say, other modifications are possible without departing from the spirit of the present invention.

１０１フーリエ変換部
１０２特徴量生成部
１０３マッチング部
１０４音声強調フィルタリング部
１０５事例モデル記憶部 101 Fourier Transform Unit 102 Feature Quantity Generation Unit 103 Matching Unit 104 Speech Enhancement Filtering Unit 105 Case Model Storage Unit

Claims

A segment that is a sequence of Gaussian distribution indices in the Gaussian mixture distribution that gives the maximum likelihood for the feature value of each frame of a given signal is at least a predetermined number of frames from the beginning of each frame. A case model storage unit expressed and stored in a tree structure;
Search for a segment that gives the maximum posterior probability with respect to the feature quantity sequence of the input signal by using a segment of any length starting from the root node of the tree structure stored in the case model storage unit as a candidate. and a matching unit, only including,
When the input signal is divided into two, the first half is the first half signal and the second half is the second half signal.
The posterior probabilities in the matching unit are the likelihood evaluated based on the segment of the length corresponding to the first half signal for the first half signal and the likelihood evaluated based on the model based on the Gaussian mixture distribution for the second half signal. Expressed in degrees,
Signal processing device.

In the case model storage unit, a segment that is a series of Gaussian distribution indexes in the Gaussian mixture distribution that gives the maximum likelihood to the feature amount of each frame of a predetermined signal is at least predetermined from the beginning of each frame. It is assumed that only the number of frames are represented and stored in a tree structure,
The matching unit gives a maximum posterior probability to the feature quantity sequence of the input signal by using a segment of an arbitrary length starting from the tree structure root node stored in the case model storage unit as a candidate. a matching step of searching for a segment only including,
When the input signal is divided into two, the first half is the first half signal and the second half is the second half signal.
The posterior probability in the matching step is the likelihood evaluated based on the segment of the length corresponding to the first half signal for the first half signal and the likelihood evaluated based on the model based on the Gaussian mixture distribution for the second half signal. Expressed in degrees,
Signal processing method.

The program for functioning a computer as each part of the signal processing apparatus of Claim 1 .