JP2012181233A

JP2012181233A - Speech enhancement device, method and program

Info

Publication number: JP2012181233A
Application number: JP2011042116A
Authority: JP
Inventors: Hiroshi Saruwatari; 洋猿渡; Ryu Wakisaka; 龍脇坂; Tomoya Takatani; 智哉高谷
Original assignee: Nara Institute of Science and Technology NUC; Toyota Motor Corp
Current assignee: Nara Institute of Science and Technology NUC; Toyota Motor Corp
Priority date: 2011-02-28
Filing date: 2011-02-28
Publication date: 2012-09-20
Anticipated expiration: 2031-02-28
Also published as: JP5687522B2

Abstract

PROBLEM TO BE SOLVED: To provide a speech enhancement device, a speech enhancement method and a speech enhancement program which can effectively suppress noise and enhance speech.SOLUTION: A speech enhancement device includes a cumulant estimation section for observation signal 21 to estimate cumulant of observation signal, a noise estimation section 11 to estimate a noise component, a cumulant estimation section for estimated noise 22 to estimate cumulant of estimated noise, a cumulant estimation section for speech component 24 to estimate cumulant of speech component based on the cumulant of the observation signal and the cumulant of the estimated noise, a kurtosis estimation section 23 to estimate kurtosis of speech component based on the cumulant of the speech component, a subtraction coefficient calculation part to calculate subtraction coefficient based on the kurtosis of the speech component, and a noise subtraction section 12 to perform noise subtraction on the observation signal using the subtraction coefficient calculated by the subtraction coefficient calculation section.

Description

本発明は、雑音成分と音声成分を含む観測信号の音声を強調する音声強調装置、音声強調方法、及び音声強調プログラムに関する。 The present invention relates to a speech enhancement device, a speech enhancement method, and a speech enhancement program for enhancing the speech of an observation signal including a noise component and a speech component.

近年、音声を利用するアプリケーションの増加とともに、雑音の存在する中から目的とする音声のみを抽出したいという要望が高まってきている。例えば、図３に示すような環境において、発話者が発話したとする。発話者による発話がマイクロホン１による集音される。このとき、周囲の雑音もマイクロホン１で集音される。従って、マイクロホン１が取得した観測信号Ｘ（ｆ,ｔ）は、目的となる音声信号Ｓ（ｆ,ｔ）と、雑音信号Ｎ（ｆ,ｔ）とを含んでいる。すなわち、Ｘ（ｆ,ｔ）＝Ｓ（ｆ,ｔ）＋Ｎ（ｆ,ｔ）となる。 In recent years, with an increase in applications that use voice, there is an increasing demand for extracting only the target voice from the presence of noise. For example, assume that a speaker speaks in an environment as shown in FIG. Speech from the speaker is collected by the microphone 1. At this time, ambient noise is also collected by the microphone 1. Therefore, the observation signal X (f, t) acquired by the microphone 1 includes the target audio signal S (f, t) and the noise signal N (f, t). That is, X (f, t) = S (f, t) + N (f, t).

そして、取得した観測信号Ｘ（ｆ,ｔ）から雑音推定を行う。観測信号Ｘ（ｆ,ｔ）から、推定雑音信号（推定雑音スペクトル）を推定する。なお、図１において、推定値を示すハット付きのＮ（ｆ,ｔ）が推定雑音信号である。推定雑音信号を用いて雑音減算を行うことで、出力信号Ｙ（ｆ,ｔ）を得ることができる。 Then, noise estimation is performed from the acquired observation signal X (f, t). An estimated noise signal (estimated noise spectrum) is estimated from the observed signal X (f, t). In FIG. 1, N (f, t) with a hat indicating an estimated value is an estimated noise signal. By performing noise subtraction using the estimated noise signal, the output signal Y (f, t) can be obtained.

具体的な雑音推定方法としては、以下の２つが挙げられる。１つ目は、ユーザ音声の無音区間を推定する方法である。この方法では、雑音が定常であると仮定する。そして、カートシス（尖度）やパワー閾値などにより、区間を判定して、推定雑音スペクトルを算出する。 Specific noise estimation methods include the following two. The first is a method for estimating a silent section of user voice. This method assumes that the noise is stationary. Then, an estimated noise spectrum is calculated by determining a section based on kurtosis (kurtosis), a power threshold value, and the like.

２つ目の方法は、マイクロホンアレイを用いる方法である。この方法では、ユーザからの放射音は、マイクロホンに最も近い点音源と仮定する。そして、ユーザ方位に死角を形成し、雑音推定スペクトルを算出する。 The second method is a method using a microphone array. In this method, the radiated sound from the user is assumed to be the point sound source closest to the microphone. Then, a blind spot is formed in the user orientation, and a noise estimation spectrum is calculated.

このように推定された雑音推定スペクトルを用いて、雑音減算する。非線形雑音抑圧処理の多くは、時間−周波数領域に変換した観測信号Ｘ（ｆ,ｔ）に対してフィルタ係数Ｈ（ｆ,ｔ）を適用する。具体的には、以下の式（１）により、出力信号Ｙ（ｆ,ｔ）を求めることができる。
Noise is subtracted using the estimated noise spectrum. In many nonlinear noise suppression processes, a filter coefficient H (f, t) is applied to the observation signal X (f, t) converted into the time-frequency domain. Specifically, the output signal Y (f, t) can be obtained by the following equation (1).

フィルタ係数Ｈ（ｆ,ｔ）の設計は手法により異なるが、フィルタ係数Ｈ（ｆ,ｔ）は、観測信号Ｘ（ｆ,ｔ）、雑音推定信号、減算係数βにより生成される。具体的な設計手法については、例えば、（ａ）スペクトルサブトラクション（ＳｐｅｃｔｒａｌＳｕｂｔｒａｃｔｉｏｎ：ＳＳ）法、（ｂ）一般化スペクトルサブトラクション（ＧｅｎｅｒａｌｉｚｅｄＳｐｅｃｔｒａｌＳｕｂｔｒａｃｔｉｏｎ：ＧＳＳ）法、（ｃ）ウィーナーフィルタ（ＷｉｅｎｅｒＦｉｌｔｅｒ：ＷＦ）法、（ｄ）パラメトリックウィーナーフィルタ（ＰａｒａｍｅｔｒｉｃＷｉｅｎｅｒＦｉｌｔｅｒ：ＰＷＦ）法がある。それぞれフィルタ係数Ｈ（ｆ,ｔ）は式（２）〜式（５）で示される。 Although the design of the filter coefficient H (f, t) varies depending on the method, the filter coefficient H (f, t) is generated by the observation signal X (f, t), the noise estimation signal, and the subtraction coefficient β. Specific design methods include, for example, (a) spectral subtraction (SS) method, (b) generalized spectral subtraction (GSS) method, and (c) Wiener filter (WF). (D) Parametric Wiener Filter (PWF) method. The filter coefficients H (f, t) are respectively expressed by equations (2) to (5).

（ａ）
(A)

（ｂ）
(B)

（ｃ）
(C)

（ｄ）
(D)

上記に手法（ａ）〜（ｄ）において、減算係数βの設定により、雑音抑制性能・音質性能が変動する。例えば、図４に減算係数と各性能の関係のシミュレーション結果を示す。図４に示すように、減算係数βを大きく設定した場合、抑圧性能は高くなるが、音質性能は低くなる。一方、減算係数βを小さく設定した場合、抑圧性能は低くなるが、音質性能は高くなる。 In the above methods (a) to (d), the noise suppression performance and sound quality performance vary depending on the setting of the subtraction coefficient β. For example, FIG. 4 shows a simulation result of the relationship between the subtraction coefficient and each performance. As shown in FIG. 4, when the subtraction coefficient β is set large, the suppression performance increases, but the sound quality performance decreases. On the other hand, when the subtraction coefficient β is set small, the suppression performance is low, but the sound quality performance is high.

実環境では、雑音と音声の混入度合いは周波数ごとに異なる。このため、減算係数βの最適値が変動する。また、実環境では、雑音と音声の混入度合いが不明であるため、図４のようなグラフすら描くことができない。よって、最適な減算係数βを求めることが困難である。 In a real environment, the degree of noise and voice mixing varies from frequency to frequency. For this reason, the optimum value of the subtraction coefficient β varies. Further, in the actual environment, since the degree of noise and voice mixing is unknown, even the graph as shown in FIG. 4 cannot be drawn. Therefore, it is difficult to obtain the optimum subtraction coefficient β.

特開２０００−３３０５９７号公報JP 2000-330597 A 特開２００７−３３０５９７号公報JP 2007-330597 A

ＥＵＳＩＰＣＯ２０１０ｐｐ．９９４−９９８EUSIPCO 2010 pp. 994-998

特許文献１、特許文献２、及び非特許文献１に、雑音（ノイズ）を抑圧して、音声を強調する別の方法が開示されている。特許文献１では、入力音声信号のＳ／Ｎ比を推定し、雑音の抑圧量を制御する減算係数が複数個格納されている減算係数データテーブルが設けられている。そして、この減算係数データテーブルからＳ／Ｎ比に基づいて減算係数を決定している。 Patent Document 1, Patent Document 2, and Non-Patent Document 1 disclose other methods for enhancing speech by suppressing noise. In Patent Document 1, a subtraction coefficient data table storing a plurality of subtraction coefficients for estimating the S / N ratio of an input audio signal and controlling the amount of noise suppression is provided. The subtraction coefficient is determined based on the S / N ratio from the subtraction coefficient data table.

特許文献２では、各周波数ビンにおいて、入力信号の信号対雑音比（ＳＮＲ）を算出している。そして、ＳＮＲが低い時（雑音が多く、音声が少ないと判定された場合）、減算係数を補正して、減算量を増やしている。これにより、入力信号の抑圧をより強くすることができる。一方、ＳＮＲが高い時（雑音が少なく、音声が大勢と判定された場合）、減算係数を補正して、減算量を減らしている。これにより、入力信号の抑圧をより小さくすることができる。 In Patent Document 2, the signal-to-noise ratio (SNR) of the input signal is calculated for each frequency bin. When the SNR is low (when it is determined that there is a lot of noise and the voice is low), the subtraction coefficient is corrected to increase the subtraction amount. Thereby, suppression of an input signal can be strengthened more. On the other hand, when the SNR is high (when it is determined that there is little noise and there are many voices), the subtraction coefficient is corrected to reduce the subtraction amount. Thereby, suppression of an input signal can be made smaller.

しかしながら、音声対話システムの場合、補正係数や減算係数データテーブルを算出するため、運営する環境にシステムを設置する必要がある。その環境において、予め騒音、及び音声データを測定する。各減算係数値での音声認識率を算出し、減算係数データテーブルの減算係数値や補正係数を決定しなければならない。実際の製品では、このような事前の処理は困難である。また、マイクロホンや、ＡＤコンバータ等の器材が変更された場合、同様に減算係数データテーブルの減算係数値等を決定しなければならない。別環境で予め設定した減算係数を用いる場合、その値が最適値とならない。従って、雑音成分の過大減算、又は過小減算が発生する。音声成分の劣化や雑音成分の残留（ミュージカルノイズの発生）が生じる。これにより、音声認識率の低下、や音質劣化を引き起こしてしまう。 However, in the case of a spoken dialogue system, it is necessary to install the system in the operating environment in order to calculate the correction coefficient and subtraction coefficient data tables. In that environment, noise and voice data are measured in advance. The speech recognition rate at each subtraction coefficient value is calculated, and the subtraction coefficient value and correction coefficient of the subtraction coefficient data table must be determined. In an actual product, such advance processing is difficult. Further, when equipment such as a microphone or an AD converter is changed, the subtraction coefficient value of the subtraction coefficient data table must be determined in the same manner. When a subtraction coefficient set in advance in another environment is used, the value is not an optimum value. Therefore, excessive subtraction or undersubtraction of noise components occurs. Deterioration of audio components and residual noise components (occurrence of musical noise) occur. As a result, the speech recognition rate is lowered and the sound quality is deteriorated.

非特許文献１では、処理前後の雑音抑圧量（ＮｏｉｓｅＲｅｄｕｃｔｉｏｎＲａｔｅ：ＮＲＲ）の自動推定と、処理前後における雑音区間の分布形状の変動を「カートシス比」として算出している。そして、このカートシス比の値を設定値以下に収めるように、減算係数を適応的に選択している。こうすることで、非音声区間の過大減算、又は過小減算を制御している。 In Non-Patent Document 1, automatic estimation of noise suppression rate (NRR) before and after processing and fluctuations in the distribution shape of the noise interval before and after processing are calculated as a “cartesis ratio”. Then, the subtraction coefficient is adaptively selected so that the value of the cartosis ratio is kept below the set value. In this way, oversubtraction or undersubtraction of the non-speech section is controlled.

しかしながら、非特許文献１では、非音声区間の過大減算、又は過小減算を制御している。換言すると、音声区間の過大減算、又は過小減算を評価していない。よって、音声認識の目的である音声区間の過大減算、又は過小減算が生じてしまうおそれがある。このように特許文献１、特許文献２、及び非特許文献３では、効果的に音声を強調することが困難である。 However, Non-Patent Document 1 controls oversubtraction or undersubtraction of non-voice sections. In other words, over-subtraction or under-subtraction of the speech section is not evaluated. Therefore, there is a possibility that oversubtraction or undersubtraction of the speech section, which is the purpose of speech recognition, occurs. As described above, in Patent Document 1, Patent Document 2, and Non-Patent Document 3, it is difficult to effectively enhance speech.

本発明は、上記の問題点に鑑みてなされたものであり、効果的に音声を強調することができる音声強調装置、音声強調方法、及び音声強調プログラムを提供することを目的とする。 The present invention has been made in view of the above-described problems, and an object thereof is to provide a speech enhancement device, a speech enhancement method, and a speech enhancement program that can effectively enhance speech.

本発明の一態様にかかる音声強調装置は、マイクロホンユニットによって取得された観測信号に対して、音声を強調する音声強調装置であって、雑音成分と音声成分とを含む観測信号のキュムラントを推定する第１のキュムラント推定部と、前記観測信号に含まれる雑音成分を推定する雑音推定部と、前記雑音推定部で推定された推定雑音のキュムラントを推定する第２のキュムラント推定部と、前記観測信号のキュムラントと、前記推定雑音のキュムラントに基づいて、音声成分のキュムラントを推定する第３のキュムラント推定部と、前記音声成分のキュムラントに基づいて、音声成分のカートシスを推定する第１のカートシス推定部と、前記音声成分カートシスに基づいて、減算係数を算出する減算係数適応部と、前記減算係数適応部で算出された減算係数を用いて、前記観測信号に対して雑音減算する雑音減算部と、を備えたものである。 A speech enhancement apparatus according to an aspect of the present invention is a speech enhancement apparatus that enhances speech with respect to an observation signal acquired by a microphone unit, and estimates a cumulant of an observation signal including a noise component and a speech component. A first cumulant estimating unit; a noise estimating unit for estimating a noise component included in the observed signal; a second cumulant estimating unit for estimating a cumulant of the estimated noise estimated by the noise estimating unit; and the observed signal , A third cumulant estimator for estimating the cumulant of the speech component based on the cumulant of the estimated noise, and a first quatsis estimator for estimating the custosis of the speech component based on the cumulant of the speech component A subtraction coefficient adapting unit that calculates a subtraction coefficient based on the speech component cartesis, and the subtraction coefficient adapting unit Using the calculated subtraction coefficient is obtained by and a noise subtracting unit for noise subtracted from the observed signal.

上記の音声強調装置が、前記雑音減算部から出力された出力信号のキュムラントを推定する第４のキュムラント推定部と、前記出力信号のキュムラントに基づいて、出力信号のカートシスを推定するカートシス推定部と、をさらに備え、前記出力信号のカートシスに基づいて、前記減算係数適応部が、減算係数を算出するようにしてもよい。 The speech enhancement apparatus includes: a fourth cumulant estimation unit that estimates a cumulant of the output signal output from the noise subtraction unit; and a cartosis estimation unit that estimates the kurtosis of the output signal based on the cumulant of the output signal. , And the subtraction coefficient adaptation unit may calculate a subtraction coefficient based on the cartesis of the output signal.

上記の音声強調装置において、前記観測信号のキュムラントと、前記推定雑音のキュムラントとの差に基づいて、前記音声成分のキュムラントが推定されていてもよい。 In the above speech enhancement device, the cumulant of the speech component may be estimated based on a difference between the cumulant of the observed signal and the cumulant of the estimated noise.

上記の音声強調装置において、前記マイクロホンユニットが複数のマイクロホンを有するマイクロホンアレイを備え、前記雑音推定部が、マイクロホンアレイ処理によって、前記推定雑音を推定するようにしてもよい。 In the speech enhancement apparatus, the microphone unit may include a microphone array having a plurality of microphones, and the noise estimation unit may estimate the estimated noise by a microphone array process.

本発明の一態様にかかる音声強調方法は、マイクロホンユニットによって取得された観測信号に対して、音声を強調する音声強調方法であって、雑音成分と音声成分とを含む観測信号のキュムラントを算出するステップと、前記観測信号に含まれる雑音を推定するステップと、推定された推定雑音のキュムラントを算出するステップと、前記観測信号のキュムラントと、前記推定雑音のキュムラントに基づいて、音声成分のキュムラントを算出するステップと、前記音声成分のキュムラントに基づいて、音声成分のカートシスを推定するステップと、前記音声成分カートシスに基づいて、減算係数を算出するステップと、前記減算係数を用いて、前記観測信号に対して雑音減算するステップと、を備えたものである。 A speech enhancement method according to an aspect of the present invention is a speech enhancement method for enhancing speech with respect to an observation signal acquired by a microphone unit, and calculates a cumulant of an observation signal including a noise component and a speech component. A step of estimating a noise included in the observed signal, a step of calculating a cumulant of the estimated estimated noise, a cumulant of the observed signal, and a cumulant of the speech component based on the cumulant of the estimated noise. A step of calculating, based on the cumulant of the speech component, a step of estimating a speech component cartesis, a step of calculating a subtraction coefficient based on the speech component cartesis, and using the subtraction factor, the observation signal Noise subtracting.

上記の音声強調方法が、出力信号のキュムラントを算出するステップと、出力信号のキュムラントに基づいて、出力信号のカートシスを算出するステップと、をさらに備え、前記出力信号のカートシスと前記音声成分のカートシスに基づいて、前記減算係数が算出されていてもよい。 The speech enhancement method further includes the steps of: calculating a cumulant of the output signal; and calculating a kurtosis of the output signal based on the cumulant of the output signal; Based on the above, the subtraction coefficient may be calculated.

上記の音声強調方法において、前記観測信号のキュムラントと、前記推定雑音のキュムラントとの差に基づいて、前記音声成分のキュムラントが推定されていてもよい。 In the speech enhancement method, the speech component cumulant may be estimated based on a difference between the observed signal cumulant and the estimated noise cumulant.

上記の音声強調方法において、前記マイクロホンユニットが複数のマイクロホンを有するマイクロホンアレイを備え、マイクロホンアレイ処理によって推定雑音が推定されていてもよい。 In the speech enhancement method, the microphone unit may include a microphone array having a plurality of microphones, and estimated noise may be estimated by microphone array processing.

本発明の一態様にかかる音声強調プログラムは、マイクロホンユニットによって取得された観測信号に対して、音声を強調する音声強調プログラムであって、コンピュータに対して、雑音成分と音声成分とを含む観測信号のキュムラントを算出させるステップと、前記観測信号に含まれる雑音を推定させるステップと、推定された推定雑音のキュムラントを算出させるステップと、前記観測信号のキュムラントと、前記推定雑音のキュムラントに基づいて、音声成分のキュムラントを算出させるステップと、前記音声成分のキュムラントに基づいて、音声成分のカートシスを推定するステップと、前記音声成分カートシスに基づいて、減算係数を算出させるステップと、前記減算係数を用いて、前記観測信号に対して雑音減算させるステップと、を備えたものである。 A speech enhancement program according to an aspect of the present invention is a speech enhancement program that enhances speech with respect to an observation signal acquired by a microphone unit, and includes an observation signal including a noise component and a speech component for a computer. A step of calculating a cumulant of the observed signal, a step of estimating a noise included in the observed signal, a step of calculating a cumulant of the estimated estimated noise, a cumulant of the observed signal, and a cumulant of the estimated noise, Using a step of calculating a cumulant of an audio component, a step of estimating a kurtosis of an audio component based on the cumulant of the audio component, a step of calculating a subtraction coefficient based on the audio component kurtosis, Subtracting noise from the observed signal , It is those with a.

上記の音声強調プログラムが、コンピュータに対して、出力信号のキュムラントを算出させるステップと、出力信号のキュムラントに基づいて、出力信号のカートシスを算出させるステップと、をさらに備え、前記出力信号のカートシスと前記音声成分のカートシスに基づいて、前記減算係数を算出させてもよい。 The speech enhancement program further comprises: causing the computer to calculate a cumulant of the output signal; and causing the computer to calculate a kurtosis of the output signal based on the cumulant of the output signal; The subtraction coefficient may be calculated on the basis of the categorization of the audio component.

上記の音声強調プログラムにおいて、前記観測信号のキュムラントと前記推定雑音のキュムラントとの差に基づいて、前記音声成分のキュムラントが推定されていてもよい。 In the speech enhancement program, the speech component cumulant may be estimated based on a difference between the observed signal cumulant and the estimated noise cumulant.

上記の音声強調方法において、前記マイクロホンユニットが複数のマイクロホンを有するマイクロホンアレイを備え、マイクロホンアレイ処理によって前記推定雑音が推定されていてもよい。 In the speech enhancement method, the microphone unit may include a microphone array having a plurality of microphones, and the estimated noise may be estimated by a microphone array process.

本発明によれば、効果的に音声を強調することができる音声強調装置、音声強調方法、及び音声強調プログラムを提供することを提供することができる。 According to the present invention, it is possible to provide a voice enhancement device, a voice enhancement method, and a voice enhancement program that can effectively enhance a voice.

実施の形態１にかかる音声強調装置の構成を示すブロック図である。1 is a block diagram showing a configuration of a speech enhancement apparatus according to a first exemplary embodiment. 実施の形態２にかかる音声強調装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice emphasis apparatus concerning Embodiment 2. FIG. 一般的な雑音減算処理を示す図である。It is a figure which shows a general noise subtraction process. 雑音減算処理における減算係数と性能の関係を示すシミュレーション結果である。It is a simulation result which shows the relationship between the subtraction coefficient in noise subtraction processing, and performance.

以下、本発明に係る移動体の実施形態を、図面に基づいて詳細に説明する。但し、本発明が以下の実施形態に限定される訳ではない。また、説明を明確にするため、以下の記載及び図面は、適宜、簡略化されている。 Hereinafter, embodiments of a moving body according to the present invention will be described in detail with reference to the drawings. However, the present invention is not limited to the following embodiments. In addition, for clarity of explanation, the following description and drawings are simplified as appropriate.

実施の形態１．
まず、本発明の実施の形態１にかかる音声強調装置について、図１を用いて説明する。図１は、音声強調装置のシステム構成を示すブロック図である。マイクロホン１は、周囲で発生した音を集音して、その音に基づく観測信号を出力する。観測信号には、音声成分と、雑音成分とが含まれている。音声成分は、音声認識の目的となる発話者の音声の信号であり、雑音成分は、発話者の音声以外の信号である。音声強調装置２には、マイクロホン１が接続されている。従って、マイクロホン１が集音した観測信号が音声強調装置２に入力される。 Embodiment 1 FIG.
First, a speech enhancement apparatus according to Embodiment 1 of the present invention will be described with reference to FIG. FIG. 1 is a block diagram showing a system configuration of the speech enhancement apparatus. The microphone 1 collects sound generated around and outputs an observation signal based on the sound. The observation signal includes a voice component and a noise component. The voice component is a signal of the voice of the speaker who is the object of voice recognition, and the noise component is a signal other than the voice of the speaker. A microphone 1 is connected to the speech enhancement device 2. Therefore, the observation signal collected by the microphone 1 is input to the speech enhancement device 2.

音声強調装置２によって、観測信号の音声を強調する。そして、音声が強調された出力信号を、出力側装置３に出力する。出力側装置３は、音声認識システムや、通信機器などであり、出力信号に対して、所定の処理を行う。例えば、音声認識システムの場合、出力信号に対して音声認識処理を行う。 The voice of the observation signal is emphasized by the voice enhancement device 2. Then, the output signal in which the voice is emphasized is output to the output side device 3. The output side device 3 is a voice recognition system, a communication device, or the like, and performs predetermined processing on the output signal. For example, in the case of a speech recognition system, speech recognition processing is performed on the output signal.

なお、音声強調装置２は、ＣＰＵ（Central Processing Unit）、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）、通信用のインタフェイスなどを有する演算処理装置であり、より具体的には、パーソナルコンピュータ（ＰＣ）等である。また、音声強調装置２は、着脱可能なＨＤＤ、光ディスク、光磁気ディスク等を有し、各種プログラムや制御パラメータなどを記憶し、そのプログラムやデータを必要に応じてメモリ（不図示）等に供給する。もちろん、音声強調装置２は、物理的に一つの構成に限られるものではない。音声強調装置２は、マイクロホン１によって、集音された音のデータに対して、音声処理を行う。 The speech enhancement device 2 is an arithmetic processing device having a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), a communication interface, and the like. A computer (PC) or the like. The speech enhancement device 2 has a removable HDD, optical disk, magneto-optical disk, etc., stores various programs and control parameters, and supplies the programs and data to a memory (not shown) as necessary. To do. Of course, the speech enhancement device 2 is not limited to one physical configuration. The voice emphasizing device 2 performs voice processing on sound data collected by the microphone 1.

音声強調装置２は、雑音推定部１１と、雑音減算部１２と、カートシス算出ユニット２０と、減算係数算出ユニット３０とを、備えている。カートシス算出ユニット２０は、観測信号用キュムラント推定部２１と、推定雑音用キュムラント推定部２２と、カートシス推定部２３と、音声成分用キュムラント推定部２４とを備えている。減算係数算出ユニット３０は、減算係数適応器３１と、出力信号用キュムラント推定部３２と、出力信号用カートシス推定部３３と、を備えている。 The speech enhancement device 2 includes a noise estimation unit 11, a noise subtraction unit 12, a cartesis calculation unit 20, and a subtraction coefficient calculation unit 30. The cartesis calculation unit 20 includes an observation signal cumulant estimation unit 21, an estimated noise cumulant estimation unit 22, a cartosis estimation unit 23, and a speech component cumulant estimation unit 24. The subtraction coefficient calculation unit 30 includes a subtraction coefficient adaptor 31, an output signal cumulant estimation unit 32, and an output signal cartesis estimation unit 33.

マイクロホン１からの観測信号は、雑音推定部１１と雑音減算部１２と観測信号用キュムラント推定部２１に入力される。なお、入力される観測信号Ｘ（ｆ,ｔ）は、音声強調処理を行う前の前処理によって、時間−周波数領域の信号となっている。具体的には、所定時間の観測信号をバッファに記憶して、その観測信号を、ｋ個（ｋは２以上の整数）のフレームに分割する。ここでは、時間領域において、隣接フレームが半分重なるように、ハーフシフトによってフレーム分割している。さらに、窓関数を用いて、フレーム分割しても良い。さらに、フレーム分割された観測信号を離散フーリエ変換する。これにより、時間−周波数領域の観測信号Ｘ（ｆ,ｔ）を得ることができる。なお、この前処理は、音声強調装置２が行っても良く、他の装置、例えば、マイクロホン１を有するマイクロホンユニットが行っても良い。 The observation signal from the microphone 1 is input to the noise estimation unit 11, the noise subtraction unit 12, and the observation signal cumulant estimation unit 21. Note that the input observation signal X (f, t) is a signal in the time-frequency domain by the preprocessing before performing the speech enhancement processing. Specifically, an observation signal for a predetermined time is stored in a buffer, and the observation signal is divided into k frames (k is an integer of 2 or more). Here, in the time domain, the frames are divided by half shift so that adjacent frames are overlapped by half. Further, the frame may be divided using a window function. Further, the observation signal divided into frames is subjected to discrete Fourier transform. Thereby, the observation signal X (f, t) in the time-frequency domain can be obtained. Note that this pre-processing may be performed by the speech enhancement device 2 or may be performed by another device, for example, a microphone unit having the microphone 1.

雑音推定部１１は、観測信号Ｘ（ｆ,ｔ）に対して雑音推定を行う。これにより、雑音推定信号Ｎ（ｆ,ｔ）が生成される。なお、図１において、雑音推定信号には、推定を示すハットが付されたＮ（ｆ,ｔ）が示されているが、明細書中の説明では、適宜簡略化して、Ｎ（ｆ,ｔ）とする。雑音減算部１２と、推定雑音用キュムラント推定部２２とに、雑音推定信号Ｎ（ｆ,ｔ）が入力される。推定雑音用キュムラント推定部２２は、雑音推定信号に基づいて、推定雑音信号のキュムラントを推定する。また、観測信号用キュムラント推定部２１は、観測信号Ｘ（ｆ,ｔ）のキュムラントを推定する。音声成分用キュムラント推定部２４は、観測信号Ｘ（ｆ,ｔ）のキュムラントと推定雑音のキュムラントとから、音声成分のキュムラントを推定する。キュムラントには、加法性が成り立つため、音声成分のキュムラントは、観測信号のキュムラントと雑音推定のキュムラントとの差で示される。カートシス推定部２３は、音声成分のキュムラントに基づいて、音声成分のカートシスを推定する。 The noise estimation unit 11 performs noise estimation on the observation signal X (f, t). As a result, a noise estimation signal N (f, t) is generated. In FIG. 1, N (f, t) with a hat indicating estimation is shown in the noise estimation signal. However, in the description in the specification, the noise estimation signal is appropriately simplified to N (f, t ). The noise estimation signal N (f, t) is input to the noise subtraction unit 12 and the estimated noise cumulant estimation unit 22. The estimated noise cumulant estimation unit 22 estimates the cumulant of the estimated noise signal based on the noise estimation signal. Further, the observation signal cumulant estimation unit 21 estimates the cumulant of the observation signal X (f, t). The speech component cumulant estimation unit 24 estimates the speech component cumulant from the observed signal X (f, t) cumulant and the estimated noise cumulant. Since the cumulant is additive, the cumulant of the speech component is indicated by the difference between the observed signal cumulant and the noise estimation cumulant. The kurtosis estimation unit 23 estimates the kurtosis of the voice component based on the cumulant of the voice component.

減算係数適応器３１には、音声成分のカートシスが入力される。減算係数適応器３１は、音声成分のカートシスに基づいて、減算係数βを適応する。そして、減算係数適応器３１で求められた減算係数βは、雑音減算部１２に入力される。雑音減算部１２は、減算係数βを用いて、雑音減算処理を実行する。そして、雑音減算部１２からは、雑音が減算された出力信号Ｙ（ｆ,ｔ）が出力される。さらに、出力信号Ｙ（ｆ,ｔ）は、出力信号用キュムラント推定部３２に入力される。出力信号用キュムラント推定部３２は、出力信号Ｙ（ｆ,ｔ）のキュムラントを推定する。出力信号用カートシス推定部３３は、出力信号Ｙ（ｆ,ｔ）のキュムラントから、出力信号のカートシスを推定する。出力信号のカートシスは、減算係数適応器３１に入力される。 The subtraction coefficient adaptor 31 is inputted with the categorization of the voice component. The subtraction coefficient adaptor 31 adapts the subtraction coefficient β on the basis of the speech component cartesis. Then, the subtraction coefficient β obtained by the subtraction coefficient adaptor 31 is input to the noise subtraction unit 12. The noise subtraction unit 12 performs noise subtraction processing using the subtraction coefficient β. The noise subtracting unit 12 outputs an output signal Y (f, t) from which noise has been subtracted. Further, the output signal Y (f, t) is input to the output signal cumulant estimation unit 32. The output signal cumulant estimation unit 32 estimates the cumulant of the output signal Y (f, t). The output signal cartesis estimation unit 33 estimates the output signal cartesis from the cumulant of the output signal Y (f, t). The output signal cartesis is input to the subtraction coefficient adaptor 31.

減算係数適応器３１は、出力信号のカートシスと、音声成分のカートシスとに基づいて、減算係数βを算出する。例えば、出力信号のカートシスと、音声成分のカートシスとの差が収束するまで、繰り返し演算を行う。すなわち、出力信号のカートシスと、音声成分のカートシスとの差が収束するような、減算係数βを算出する。そして、この減算係数βに基づいて、雑音減算部１２が雑音減算処理を行う。雑音減算については、上記の手法（ａ）〜（ｄ）、すなわち、式（２）〜式（５）を用いることができる。減算係数βは、適応係数であり、入力された観測信号Ｘ（ｆ,ｔ）に応じて決定される。すなわち、雑音減算のためのフィルタが、入力された観測信号に基づいて自己適応されている。 The subtraction coefficient adaptor 31 calculates a subtraction coefficient β based on the output signal cartesis and the speech component cartesis. For example, the calculation is repeated until the difference between the output signal cartesis and the speech component cartesis converges. That is, the subtraction coefficient β is calculated so that the difference between the output signal cartesis and the speech component cartesis converges. Then, based on the subtraction coefficient β, the noise subtraction unit 12 performs a noise subtraction process. For the noise subtraction, the above methods (a) to (d), that is, the equations (2) to (5) can be used. The subtraction coefficient β is an adaptive coefficient, and is determined according to the input observation signal X (f, t). That is, the filter for noise subtraction is self-adapted based on the input observation signal.

次に、上記の音声強調装置２における音声強調方法について、詳細に説明する。
（ステップ１）
まず、マイクロホン１で取得した時間領域の観測信号ｘ（ｔ）をフレーム分割した後、離散フーリエ変換を行う。これにより、時間−周波数領域の観測信号Ｘ（ｆ,ｔ）を得ることができる。
（ステップ２）
雑音推定部１１が雑音推定処理を実行する。ここでは、観測信号Ｘ（ｆ,ｔ）に対して、音声区間／非音声区間の判定を行い、非音声区間を雑音推定信号とする。なお、ステップ１とステップ２の処理に付いては、公知の方法を用いることができ、特に限定されるものではない。
（ステップ３）
周波数ビンｆ＝０をセットする。
（ステップ４）
観測信号用キュムラント推定部２１と、推定雑音用キュムラント推定部２２が観測信号Ｘ（ｆ,ｔ）のキュムラントＣ_観測信号及び推定雑音信号Ｎ（ｆ,ｔ）のキュムラントＣ_雑音信号を算出する。そのため、まず、観測信号Ｘ（ｆ,ｔ）、及び推定雑音信号Ｎ（ｆ,ｔ）のモーメントを求める。例えば、観測信号Ｘ（ｆ,ｔ）の２次のモーメントＭ_{２,観測信号}、４次のモーメントＭ_{４,観測信号}、６次のモーメントＭ_{６,観測信号}、８次のモーメントＭ_{８,観測信号}は、以下の式（６）によって求めることができる。 Next, the speech enhancement method in the speech enhancement apparatus 2 will be described in detail.
(Step 1)
First, the time domain observation signal x (t) acquired by the microphone 1 is divided into frames, and then discrete Fourier transform is performed. Thereby, the observation signal X (f, t) in the time-frequency domain can be obtained.
(Step 2)
The noise estimation unit 11 performs a noise estimation process. Here, a speech / non-speech interval is determined for the observed signal X (f, t), and the non-speech interval is used as a noise estimation signal. In addition, about the process of step 1 and step 2, a well-known method can be used and it does not specifically limit.
(Step 3)
Set frequency bin f = 0.
(Step 4)
An observation signal cumulant estimator 21, estimated noise for cumulant estimator 22 calculates the cumulant C _{noise signal} cumulant C _{observed signal} and the estimated noise signal N of the observation signals X (f, t) (f , t). Therefore, first, the moments of the observed signal X (f, t) and the estimated noise signal N (f, t) are obtained. For example, the second moment M _{2 of} the observed signal X (f, t) _{, the observed signal} , the fourth moment M _{4, the observed signal} , the sixth moment M _{6, the observed signal} , the eighth moment M _{8 and the observed signal} Can be obtained by the following equation (6).

なお、＜＞_ｔはフレームにおける時間平均を示す。観測信号のモーメントＭ_観測信号から観測信号のキュムラントＣ_観測信号を求める。観測信号Ｘ（ｆ,ｔ）の２次のキュムラントＣ_{２,観測信号}、４次のキュムラントＣ_{４,観測信号}、６次のキュムラントＣ_{６,観測信号}、８次のキュムラントＣ_{８,観測信号}は、以下の式（７）によって求めることができる。 Note that <> _t indicates a time average in a frame. The cumulant C _{observation signal} of the _{observation signal} is _obtained from the moment M _{observation signal} of the observation signal. The second-order cumulant C _{2 of the} observation signal X (f, t) _{, the observation signal} , the fourth-order cumulant C _{4, the observation signal} , the sixth-order cumulant C _{6, the observation signal} , the eighth-order cumulant C _{8, and the observation signal} are: It can obtain | require by the following formula | equation (7).

同様に、推定雑音信号Ｎ（ｆ,ｘ）の２次のモーメントＭ_{２,雑音信号}、４次のモーメントＭ_{４,雑音信号}、６次のモーメントＭ_{６,雑音信号}、８次のモーメントＭ_{８,雑音信号}は、以下の式（８）によって求めることができる。
Similarly, the second order moment M ₂ of the estimated noise signal N (f, x) _{, the noise signal} , the fourth order moment M _{4, the noise signal} , the sixth order moment M _{6, the noise signal} , the eighth order moment M _{8, The noise signal} can be obtained by the following equation (8).

推定雑音信号のモーメントＭ_雑音信号から推定雑音信号のキュムラントＣ_雑音信号を求める。推定雑音信号Ｎ（ｆ,ｔ）の２次のキュムラントＣ_{２,雑音信号}、４次のキュムラントＣ_{４,雑音信号}、６次のキュムラントＣ_{６,雑音信号}、８次のキュムラントＣ_{８,雑音信号}は、以下の式（９）によって求めることができる。 The cumulant C _{noise signal} of the estimated noise signal is obtained from the moment M _{noise signal} of the estimated noise signal. The second order cumulant C _{2, noise signal} , fourth order cumulant C _{4, noise signal} , sixth order cumulant C _{6, noise signal} , eighth order cumulant C _{8, noise signal} of the estimated noise signal N (f, t) are The following equation (9) can be obtained.

このようにして、観測信号Ｘ（ｆ,ｔ）のキュムラントＣ_観測信号及び推定雑音信号Ｎ（ｆ,ｔ）のキュムラントＣ_雑音信号を算出することができる。なお、時間領域信号において、信号の確率密度関数が平均０かつ対称であると仮定すると奇数次のモーメント、及びキュムラントは０となる。よって、奇数次のモーメント、及びキュムラントは算出しなくてよい。さらに、上記の説明では、２次、４次、６次、８次のモーメント、及びキュムラントを求めたが、求める次数はこれに限られるものではない。 In this way, the cumulants C _{observed signal} and the estimated noise signal N (f, t) of the observation signal X (f, t) can be calculated cumulant C _{noise signal.} In a time domain signal, assuming that the probability density function of the signal is 0 on average and symmetric, the odd-order moment and the cumulant are 0. Therefore, odd-order moments and cumulants need not be calculated. Furthermore, in the above description, the second-order, fourth-order, sixth-order, and eighth-order moments and cumulants are obtained, but the obtained order is not limited to this.

（ステップ５）
音声成分用キュムラント推定部２４が観測信号Ｘ（ｆ,ｔ）内の音声成分のキュムラントＣ_音声成分を算出する。キュムラントには、加法性が成り立つため、音声成分のキュムラントは、観測信号のキュムラントと推定雑音のキュムラントとの差で示される。よって、音声成分の２次のキュムラントＣ_{２,音声成分}、４次のキュムラントＣ_{４,音声成分}、６次のキュムラントＣ_{６,音声成分}、８次のキュムラントＣ_{８,音声成分}は、以下の式（１０）で示される。 (Step 5)
The speech component cumulant estimation unit 24 calculates the cumulant C _{speech component} of the speech component in the observation signal X (f, t). Since the cumulant is additive, the cumulant of the speech component is indicated by the difference between the observed signal cumulant and the estimated noise cumulant. Thus, second order cumulant C _{2, audio components} of the sound component, the fourth-order cumulant C _{4, audio components,} sixth-order cumulant C _{6, audio components,} 8-order cumulant C _{8, the audio component} has the following formula ( 10).

（ステップ６）
カートシス推定部２３が、音声成分のキュムラントＣ_音声成分から、音声成分のカートシスＫ_音声成分を推定する。カートシスの推定に付いては、特に限定されるものではないが、例えば、式（１１）を用いることができる。これにより、音声成分のパワースペクトル領域のカートシスＫ_音声成分を算出することができる。 (Step 6)
The kurtosis estimation unit 23 estimates the kurtosis K _{audio component} of the _{audio component} from the cumulant C _{audio component} of the audio component. The estimation of the cartesis is not particularly limited, but for example, Equation (11) can be used. Thereby, the Cartis K _{audio component} in the power spectrum region of the _{audio component} can be calculated.

（ステップ７）
減算係数適応器３１に減算係数βの初期値をセットし、更新回数ｉ＝０をセットする。なお、減算係数βの初期値については、適当な値を選択することができる。
（ステップ８）
そして、減算係数βの初期値を用いて、雑音減算部１２が、観測信号Ｘ（ｆ,ｔ）に対して雑音減算処理を行う。雑音減算処理については、例えば、手法（ａ）〜（ｄ）のいずれか一つを用いることができる。従って、式（２）〜式（５）のいずれか一つを採用し、その式に減算係数βの初期値を代入する。これにより、フィルタ係数Ｈ（ｆ,ｔ）を算出することができる。そして、フィルタ係数Ｈ（ｆ,ｔ）と観測信号Ｘ（ｆ,ｔ）とから、出力信号Ｙ（ｆ,ｔ）を算出する。具体的には、Ｙ（ｆ,ｔ）＝Ｈ（ｆ,ｔ）Ｘ（ｆ,ｔ）となる。 (Step 7)
The initial value of the subtraction coefficient β is set in the subtraction coefficient adaptor 31, and the number of updates i = 0 is set. An appropriate value can be selected for the initial value of the subtraction coefficient β.
(Step 8)
Then, using the initial value of the subtraction coefficient β, the noise subtraction unit 12 performs a noise subtraction process on the observation signal X (f, t). For the noise subtraction process, for example, any one of the methods (a) to (d) can be used. Therefore, any one of the expressions (2) to (5) is adopted, and the initial value of the subtraction coefficient β is substituted into the expression. Thereby, the filter coefficient H (f, t) can be calculated. Then, an output signal Y (f, t) is calculated from the filter coefficient H (f, t) and the observation signal X (f, t). Specifically, Y (f, t) = H (f, t) X (f, t).

（ステップ９）
出力信号用キュムラント推定部３２が出力信号Ｙ（ｆ,ｔ）のキュムラントＣ_出力信号を推定する。そのため、まず、出力信号Ｙ（ｆ,ｔ）のモーメントＭ_出力信号を求める。例えば、出力信号Ｙ（ｆ,ｔ）の２次のモーメントＭ_{２,出力信号}、４次のモーメントＭ_{４,出力信号}、６次のモーメントＭ_{６,出力信号}、８次のモーメントＭ_{８,出力信号}は、以下の式（１２）によって求めることができる。 (Step 9)
The output signal cumulant estimation unit 32 estimates the cumulant C _{output signal} of the output signal Y (f, t). Therefore, first, a moment M _{output signal} of the output signal Y (f, t) is obtained. For example, the second moment M _{2 of} the output signal Y (f, t) _{, the output signal} , the fourth moment M _{4, the output signal} , the sixth moment M _{6, the output signal} , the eighth moment M _{8 and the output signal.} Can be obtained by the following equation (12).

これらのモーメントから出力信号のキュムラントＣ_出力信号を求める。出力信号Ｙ（ｆ,ｔ）の２次のキュムラントＣ_{２,出力信号}、４次のキュムラントＣ_{４,出力信号}、６次のキュムラントＣ_{６,出力信号}、８次のキュムラントＣ_{８,出力信号}は、以下の式（１３）によって求めることができる。 The cumulant C _{output signal} of the _{output signal} is obtained from these moments. Output signal Y (f, t) second order cumulant C2 _{, output signal} , fourth order cumulant C4 _{, output signal} , sixth order cumulant C6 _{, output signal} , eighth order cumulant C8 _{, output signal} are It can obtain | require by the following formula | equation (13).

（ステップ１０）
出力信号用カートシス推定部３３が、キュムラントＣ_出力信号に基づいて、出力信号のカートシスＫ_出力信号を算出する。カートシスの推定に付いては、特に限定されるものではないが、例えば、式（１４）を用いることができる。これにより、音声成分のパワースペクトル領域のカートシスＫ_出力信号を算出することができる。 (Step 10)
Based on the cumulant C _{output signal} , the _{output signal} cartesis estimation unit 33 calculates a cartesis K _{output signal} of the _{output signal} . The estimation of the cartesis is not particularly limited, but for example, Equation (14) can be used. As a result, the Cartesis K _{output signal} in the power spectrum region of the audio component can be calculated.

（ステップ１１）
減算係数適応器３１が、出力信号のカートシスＫ_出力信号と音声成分のカートシスＫ_音声成分とを比較して、減算係数βを更新する。例えば、出力信号のカートシスＫ_出力信号と音声成分のカートシスＫ_音声成分の差を求める。そして、カートシスの差に応じて、減算係数βを更新する。具体的には、以下の式（１５）を用いて、減算係数βを更新する。 (Step 11)
The subtraction coefficient adaptor 31 updates the subtraction coefficient β by comparing the Cartis K _{output signal} of the _{output signal} with the Cartis K _{sound component} of the _{sound component} . For example, the difference between the Cartis K _{output signal} of the _{output signal} and the Cartis K _{sound component} of the _{sound component} is obtained. Then, the subtraction coefficient β is updated according to the difference between the cartesis. Specifically, the subtraction coefficient β is updated using the following equation (15).

なお、Ｔｈｒｅｓｈｏｌｄは減算係数βが収束したか否かを判定するためのしきい値であり、任意の値を設定することができる。Δβは、βを収束させるループ計算における、減算係数βの増分値であり、任意の値とすることができる。また、Δβをカートシスの差に応じて、変更しても良い。このように、音声成分のカートシスＫ_音声成分が出力信号のカートシスＫ_出力信号よりも大きい場合、減算係数適応器３１が、雑音減算が小さいと判定して、減算係数βを増加させる。また、カートシスの差の絶対値がしきい値よりも小さい場合、減算係数適応器３１は、減算係数βが収束したと判定する。従って、周波数ビンｆをインクリメントして、後述する（１４）に進む。 Threshold is a threshold value for determining whether or not the subtraction coefficient β has converged, and an arbitrary value can be set. Δβ is an increment value of the subtraction coefficient β in the loop calculation for converging β, and can be an arbitrary value. In addition, Δβ may be changed according to the difference in cartesis. As described above, when the Cartis K _{audio component} of the _{audio component} is larger than the Cartis K _{output signal of the output signal} , the subtraction coefficient adaptor 31 determines that the noise subtraction is small and increases the subtraction coefficient β. When the absolute value of the Cartis difference is smaller than the threshold value, the subtraction coefficient adaptor 31 determines that the subtraction coefficient β has converged. Accordingly, the frequency bin f is incremented and the process proceeds to (14) described later.

（ステップ１２）
更新回数ｉがインクリメントされる。
（ステップ１３）
更新回数ｉがＩを越えたか否かが判定される。これにより、減算係数βを求めるためのループ計算が十分な回数を行われたか否かが判定される。更新回数ｉがＩよりも小さい場合、ステップ８に戻る。一方、更新回数ｉがＩ以上の場合、周波数ビンｆをインクリメントして、次のステップ１４に進む。すなわち、減算係数βが収束しない場合、更新回数ｉがＩに達するまで、ステップ８〜ステップ１２のループ計算が繰り返し行われる。もちろん、上記の通り、減算係数βが収束した場合、更新回数ｉがＩに到達する前に、ループ計算を抜けても良い。
（ステップ１４）
全ての周波数ビンについて、減算係数βが算出されたか否かを判定する。具体的には、周波数ビンｆがＦよりも小さい場合、ステップ４に戻り、次の周波数ビンの減算係数βを求める。なお、Ｆは周波数ビンの数である。一方、周波数ビンｆがＦより以上の場合、時間領域の出力信号を得る。具体的には、雑音減算部１２で算出された出力信号Ｙ（ｆ,ｔ）を逆フーリエ変換する。そして、逆フーリエ変換された出力信号に窓掛けして、オーバーラップアッドにより、時間領域のデータを得る。これにより、時間領域の出力信号ｙ（ｔ）が出力側装置３に出力される。すなわち、周波数ビンｆがＦに到達するまで、ステップ４〜ステップ１３までのループ計算が繰り返し行われる。なお、ステップ１４の処理は、音声強調装置２が行っても良く、他の装置、例えば、出力側装置３が行っても良い。 (Step 12)
The update count i is incremented.
(Step 13)
It is determined whether or not the number of updates i exceeds I. Thus, it is determined whether or not the loop calculation for obtaining the subtraction coefficient β has been performed a sufficient number of times. If the number of updates i is smaller than I, the process returns to step 8. On the other hand, if the number of updates i is equal to or greater than I, the frequency bin f is incremented and the process proceeds to the next step 14. That is, when the subtraction coefficient β does not converge, the loop calculation from step 8 to step 12 is repeated until the number of updates i reaches I. Of course, as described above, when the subtraction coefficient β converges, the loop calculation may be skipped before the number of updates i reaches I.
(Step 14)
It is determined whether or not the subtraction coefficient β has been calculated for all frequency bins. Specifically, when the frequency bin f is smaller than F, the process returns to step 4 to obtain the subtraction coefficient β of the next frequency bin. Note that F is the number of frequency bins. On the other hand, if the frequency bin f is greater than F, a time domain output signal is obtained. Specifically, the output signal Y (f, t) calculated by the noise subtracting unit 12 is subjected to inverse Fourier transform. Then, the output signal subjected to inverse Fourier transform is windowed, and data in the time domain is obtained by overlap add. As a result, the output signal y (t) in the time domain is output to the output side device 3. That is, until the frequency bin f reaches F, the loop calculation from step 4 to step 13 is repeated. Note that the processing in step 14 may be performed by the speech enhancement device 2 or may be performed by another device, for example, the output side device 3.

このように観測信号内の音声成分のカートシスをブラインド推定する。ここでは、キュムラントの加法性を利用して、観測信号と推定雑音のキュムラントの差から、音声成分のキュムラントを算出している。そして、音声成分のキュムラントからカートシスを算出している。これにより、畳み込みなどの複雑な演算を行うことなく、音声成分のカートシスを推定することができる。音声成分のカートシス推定値と、出力信号のカートシス推定値との比較により、雑音減算の過大／過小を評価する。その評価結果に応じて、減算係数を調整する。すなわち、スペクトル減算が過大であれば、減算係数を小さくし、過小であれば、減算係数を大きくするように、制御する。これにより、適切に雑音を抑圧することができる。特に、音声評価の品質に関する音声成分のカートシスを用いて、減算係数を適応的に算出している。すなわち、音声区間のカートシスを算出することによって、処理前後における、音声区間の分布形状の変動を求めることができる。これにより、音声成分の歪み（例えば、ケプストラム歪み）を抑制することができる。雑音と音声の混合された観測信号から、正確に音性成分のカートシスを算出することができる。出力側装置３における音声認識処理の正確性を向上することができる。 In this way, the blinding of the cartesis of the speech component in the observed signal is estimated. Here, the cumulant of the speech component is calculated from the difference between the cumulant of the observed signal and the estimated noise using the additive property of the cumulant. Then, the cartesis is calculated from the cumulant of the speech component. As a result, it is possible to estimate the speech component cartesis without performing complicated operations such as convolution. Over / under noise subtraction is evaluated by comparing the kurtosis estimated value of the speech component and the kurtosis estimated value of the output signal. The subtraction coefficient is adjusted according to the evaluation result. That is, if the spectral subtraction is excessive, the subtraction coefficient is reduced, and if it is excessive, the subtraction coefficient is increased. Thereby, noise can be appropriately suppressed. In particular, the subtraction coefficient is adaptively calculated by using the speech component cartesis relating to the speech evaluation quality. In other words, by calculating the voice section cartesis, the fluctuation of the voice section distribution shape before and after the processing can be obtained. Thereby, distortion (for example, cepstrum distortion) of an audio component can be suppressed. It is possible to accurately calculate the kurtosis of the sound component from the observation signal in which noise and speech are mixed. The accuracy of the speech recognition process in the output side device 3 can be improved.

上記の説明では、周波数ビン毎に減算係数βを算出している。これにより、より適切に雑音を抑圧することができ、音声認識処理の正確性を向上することができる。もちろん、減算係数βを一括処理で求めてもよい。以下に、減算係数βを一括で算出する方法を、変形例１として説明する。 In the above description, the subtraction coefficient β is calculated for each frequency bin. Thereby, noise can be suppressed more appropriately, and the accuracy of the speech recognition process can be improved. Of course, the subtraction coefficient β may be obtained by batch processing. Hereinafter, a method of calculating the subtraction coefficient β collectively will be described as a first modification.

実施の形態１の変形例１．
雑音減算の基本的な方法に付いては、上記の処理と方法であるため、説明を省略する。本変形例１では、実施の形態１の方法との相違点を中心に説明する。 Modification 1 of Embodiment 1
Since the basic method of noise subtraction is the above-described processing and method, description thereof is omitted. In the first modification, the difference from the method according to the first embodiment will be mainly described.

変形例１では、複数の周波数ビンに対する減算係数βを一括して算出している。従って、ステップ３と、ステップ１３における周波数ビンｆのインクリメントと、ステップ１４における周波数ビンｆの判定が不要となる。さらに、ステップ４、ステップ９におけるモーメント算出式が異なる。具体的には、式（６）、式（８）、及び式（１２の）代わりに以下の式（１６）、式（１７）、及び式（１８）を用いる。こうすることで、それぞれ、観測信号のモーメントＭ_観測信号、雑音成分のモーメントＭ_雑音成分、及び出力信号のモーメントＭ_出力信号を求めることができる。 In the first modification, the subtraction coefficient β for a plurality of frequency bins is calculated all at once. Therefore, the increment of the frequency bin f in step 3 and step 13 and the determination of the frequency bin f in step 14 are not required. Furthermore, the moment calculation formulas in step 4 and step 9 are different. Specifically, the following formula (16), formula (17), and formula (18) are used instead of formula (6), formula (8), and formula (12). By doing so, the moment M _{observation signal of} the _{observation signal} , the moment M _{noise component of} the _{noise component} , and the moment M _{output signal} of the _{output signal} can be obtained, respectively.

上記のモーメントを用いて、キュムラントを推定する。そして、キュムラントからカートシスを算出する。なお、キュムラントとカートシスを求めるための演算処理には、実施の形態１と同様であるため説明を省略する。本変形例１では、０番目の周波数ビンから、（Ｆ−１）番目の周波数ビンまで一括で減算係数を算出することができる。すなわち、Ｆ個の周波数ビンに対して、共通の減算係数βが使用される。これにより、実施の形態１よりも計算時間を短縮することができる。また、音声区間のカートシスを用いているため、効果的に音声を強調することができる。 The cumulant is estimated using the above moment. Then, the cartesis is calculated from the cumulant. Note that the calculation processing for obtaining cumulant and cartesis is the same as in the first embodiment, and a description thereof will be omitted. In the first modification, the subtraction coefficients can be calculated collectively from the 0th frequency bin to the (F-1) th frequency bin. That is, a common subtraction coefficient β is used for F frequency bins. Thereby, calculation time can be shortened compared with Embodiment 1. FIG. In addition, since the voice section cartesis is used, the voice can be effectively enhanced.

実施の形態２．
実施の形態２にかかる音声強調装置２に付いて、図２を用いて説明する。実施の形態２では、６個のマイクロホン１が設けられたマイクロホンアレイ５が使用されている。そして、マイクロホンアレイ５で取得された観測信号Ｘ_０〜Ｘ_５が音声強調装置２に入力される。さらに、雑音推定の処理が実施の形態２と異なっている。なお、これら以外の基本的な処理については、実施の形態１、又は変形例１と同様であるため、説明を省略する。 Embodiment 2. FIG.
The speech enhancement apparatus 2 according to the second embodiment will be described with reference to FIG. In the second embodiment, a microphone array 5 provided with six microphones 1 is used. Then, the observation signals X _{0 to} X ₅ acquired by the microphone array 5 are input to the speech enhancement device 2. Further, the noise estimation process is different from that of the second embodiment. Since basic processes other than these are the same as those in the first embodiment or the first modification, description thereof will be omitted.

複数のマイクロホン１で構成されたマイクロホンアレイ５を用いた場合、近接の点音源に対しては、位相差制御により雑音を抑圧することが可能である。従って、本実施形態では、位相差制御により、雑音推定を行っている。実施の形態１で示したように、フレーム区間で音声/非音声区間の推定を行う必要は無い。すなわち、音声があった場合も、音声をキャンセルすることが可能となる。従って，前述の音声区間検出の処理を取り除くことができる。 When a microphone array 5 composed of a plurality of microphones 1 is used, it is possible to suppress noise with respect to a nearby point sound source by phase difference control. Therefore, in this embodiment, noise estimation is performed by phase difference control. As shown in the first embodiment, it is not necessary to estimate the speech / non-speech interval in the frame interval. That is, even when there is a voice, it is possible to cancel the voice. Therefore, the above-described voice segment detection process can be eliminated.

マイクロホンアレイ５からの観測信号は、雑音推定処理部４１と音声推定処理部４２と観測信号用キュムラント算出部５１に入力される。ここで、雑音低減前の前処理によって、実施の形態１と同様に、観測信号は、時間−周波数領域の信号に変換されている。雑音推定処理部４１は、複数の観測信号Ｘ_０（ｆ,ｔ）〜Ｘ_５（ｆ,ｔ）に対して、雑音推定処理を行う。本実施の形態では、マイクロホンアレイ処理を行い、観測信号Ｘ_０（ｆ,ｔ）〜Ｘ_５（ｆ,ｔ）のそれぞれに対して、推定雑音信号が生成される。例えば、マイクロホンアレイ処理によって音源の位置推定を行い、音源の方以外の音を雑音と推定する。具体的には、ヌルビームフォーマーやＩＣＡ（独立成分分析）による適応アレイ処理で、雑音推定する。なお、マイクロホンアレイ５による雑音推定に付いては、特に限定されるものではなく、公知の方法を用いることができる。 The observation signal from the microphone array 5 is input to the noise estimation processing unit 41, the speech estimation processing unit 42, and the observation signal cumulant calculation unit 51. Here, as in the first embodiment, the observation signal is converted into a time-frequency domain signal by pre-processing before noise reduction. The noise estimation processing unit 41 performs noise estimation processing on the plurality of observation signals X ₀ (f, t) to X ₅ (f, t). In the present embodiment, microphone array processing is performed, and an estimated noise signal is generated for each of the observation signals X ₀ (f, t) to X ₅ (f, t). For example, the position of a sound source is estimated by microphone array processing, and sound other than the sound source is estimated as noise. Specifically, noise estimation is performed by adaptive array processing using a null beam former or ICA (independent component analysis). Note that noise estimation by the microphone array 5 is not particularly limited, and a known method can be used.

観測信号用キュムラント算出部５１は、観測信号用キュムラント推定部２１と同様に、観測信号のキュムラントを算出する。また、推定雑音用キュムラント算出部５２は、推定雑音用キュムラント推定部２２と同様に、推定雑音信号のキュムラントを推定する。音声成分用キュムラント算出部５３は、音声成分用キュムラント推定部２４と同様に、観測信号のキュムラントと推定雑音信号のキュムラントとに基づいて、音声成分のキュムラントを算出する。具体的には、観測信号のキュムラントと推定雑音信号のキュムラントとの差から、音声成分のキュムラントを求めることができる。カートシス算出部５４は、カートシス推定部２３と同様に、音声成分のキュムラントから、音声成分のカートシスを算出する。減算パラメータ適応判定器５５は、減算係数適応器３１と同様に、音声成分のカートシスに基づいて、減算係数を算出する。 The observation signal cumulant calculation unit 51 calculates the cumulant of the observation signal, similarly to the observation signal cumulant estimation unit 21. Further, the estimated noise cumulant calculating unit 52 estimates the cumulant of the estimated noise signal in the same manner as the estimated noise cumulant estimating unit 22. Similar to the speech component cumulant estimation unit 24, the speech component cumulant calculation unit 53 calculates the speech component cumulant based on the observed signal cumulant and the estimated noise signal cumulant. Specifically, the cumulant of the speech component can be obtained from the difference between the cumulant of the observed signal and the cumulant of the estimated noise signal. Similar to the cartosis estimation unit 23, the cartesis calculation unit 54 calculates the speech component cartesis from the speech component cumulant. Similar to the subtraction coefficient adaptor 31, the subtraction parameter adaptation determiner 55 calculates a subtraction coefficient based on the kurtosis of the audio component.

減算パラメータ適応判定器５５で算出された減算係数βが、音声推定処理部４２に入力される。そして、音声推定処理部４２は、この減算係数βを用いて、音声推定処理を行う。すなわち、音声推定処理部４２は雑音減算部１２と同様に、フィルタ係数と減算係数βを用いて、雑音減算処理を行う。これにより、音声が強調された出力信号が生成される。出力信号は、出力信号用キュムラント算出部５７に入力される。 The subtraction coefficient β calculated by the subtraction parameter adaptive determination unit 55 is input to the speech estimation processing unit 42. Then, the speech estimation processing unit 42 performs speech estimation processing using this subtraction coefficient β. That is, the speech estimation processing unit 42 performs the noise subtraction process using the filter coefficient and the subtraction coefficient β, similarly to the noise subtraction unit 12. As a result, an output signal in which the voice is emphasized is generated. The output signal is input to the output signal cumulant calculation unit 57.

出力信号用キュムラント算出部５７は、出力信号用キュムラント推定部３２と同様に、出力信号のキュムラントを算出する。カートシス算出部５６は、出力信号用カートシス推定部３３と同様に、出力信号のキュムラントに基づいて、出力信号のカートシスを算出する。出力信号のカートシスは、減算パラメータ適応判定器５５に入力される。減算パラメータ適応判定器５５は、減算係数適応器３１と同様に、出力信号のカートシスと、音声成分のカートシスとから、減算係数βを算出する。そして、減算パラメータ適応判定器５５で算出された減算係数βが音声推定処理部４２に入力される。音声推定処理部４２は、更新された減算係数βに基づいて、音声推定処理を行う。すなわち、音声推定処理部４２は雑音減算部１２と同様に、フィルタ係数と減算係数を用いて、雑音減算処理を行う。これにより、音声が推定され、音声が強調された出力信号が出力される。そして、出力信号が外部機器、例えば、実施の形態１で示した出力側装置３（図２では省略）に出力される。上記の処理は、実施の形態１、または変形例１で示した数式によって、算出することができる。これにより、マイクロホン１毎に出力信号Ｙ_０（ｆ,ｔ）〜Ｙ_５（ｆ,ｔ）を得ることができる。 Similarly to the output signal cumulant estimation unit 32, the output signal cumulant calculation unit 57 calculates the cumulant of the output signal. Similar to the output signal cartesis estimation unit 33, the cartesis calculation unit 56 calculates the output signal cartesis based on the cumulant of the output signal. The cartesis of the output signal is input to the subtraction parameter adaptive determination unit 55. Similar to the subtraction coefficient adaptor 31, the subtraction parameter adaptation determiner 55 calculates a subtraction coefficient β from the output signal cartesis and the speech component cartesis. Then, the subtraction coefficient β calculated by the subtraction parameter adaptive determination unit 55 is input to the speech estimation processing unit 42. The speech estimation processing unit 42 performs speech estimation processing based on the updated subtraction coefficient β. That is, the speech estimation processing unit 42 performs noise subtraction processing using the filter coefficient and the subtraction coefficient in the same manner as the noise subtraction unit 12. Thereby, the voice is estimated and an output signal in which the voice is emphasized is output. Then, the output signal is output to an external device, for example, the output side device 3 (not shown in FIG. 2) shown in the first embodiment. The above processing can be calculated by the mathematical formula shown in the first embodiment or the first modification. Thereby, output signals Y ₀ (f, t) to Y ₅ (f, t) can be obtained for each microphone 1.

次に、本実施の形態の音声強調装置２における音声強調方法について、詳細に説明する。なお、以下の説明では、実施の形態１と同様に、周波数ビン毎に、カートシスを算出するが、変形例１と同様に、一括して、カートシスを算出しても良い。
（ステップ１０１）
まず、マイクロホンアレイ５で取得した時間領域の観測信号ｘ_０（ｔ）〜ｘ_５（ｔ）をフレーム分割した後、離散フーリエ変換を行う。これにより、時間−周波数領域の観測信号Ｘ_０（ｆ,ｔ）〜Ｘ_５（ｆ,ｔ）を得ることができる。
（ステップ１０２）
周波数ビンｆ＝０をセットする。
（ステップ１０３）
雑音推定処理部４１が雑音推定処理を実行する。ここでは、マイクロホンアレイ処理を行うことで、雑音推定信号を得ることができる。それぞれのマイクロホン１に対して、雑音推定信号が生成される。図２の例では、マイクロホン１が６個あるため、雑音推定信号Ｎ_０（ｆ,ｔ）〜Ｎ_５（ｆ,ｔ）が算出される。なお、ステップ１０１とステップ１０３の処理に付いては、公知の方法を用いることができるため、特に限定されるものではない。もちろん、マイクロホン１の数は、６個に限定されるものではない。
（ステップ１０４）
マイクロホン１の番号ｎ＝０をセットする。すなわち、１つ目のマイクロホン１で取得された観測信号Ｘ_０（ｆ,ｔ）とその雑音推定信号Ｎ_０（ｆ,ｔ）に対して、キュムラント、及びカートシスを算出するための処理が行われる。 Next, the speech enhancement method in speech enhancement apparatus 2 of the present embodiment will be described in detail. In the following description, the kurtosis is calculated for each frequency bin as in the first embodiment. However, the kurtosis may be calculated collectively as in the first modification.
(Step 101)
First, the time domain observation signals x ₀ (t) to x ₅ (t) acquired by the microphone array 5 are divided into frames, and then discrete Fourier transform is performed. Thereby, observation signals X ₀ (f, t) to X ₅ (f, t) in the time-frequency domain can be obtained.
(Step 102)
Set frequency bin f = 0.
(Step 103)
The noise estimation processing unit 41 executes noise estimation processing. Here, a noise estimation signal can be obtained by performing microphone array processing. A noise estimation signal is generated for each microphone 1. In the example of FIG. 2, since there are six microphones 1, noise estimation signals N ₀ (f, t) to N ₅ (f, t) are calculated. In addition, about the process of step 101 and step 103, since a well-known method can be used, it is not specifically limited. Of course, the number of microphones 1 is not limited to six.
(Step 104)
The number n = 0 of the microphone 1 is set. That is, a process for calculating cumulant and cartosis is performed on the observation signal X ₀ (f, t) acquired by the first microphone 1 and its noise estimation signal N ₀ (f, t). .

（ステップ１０５）
観測信号用キュムラント算出部５１と、推定雑音用キュムラント算出部５２が観測信号Ｘ_ｎ（ｆ,ｔ）のキュムラントＣ_{観測信号ｎ}及び推定雑音信号Ｎ_ｎ（ｆ,ｔ）のキュムラントＣ_{雑音信号ｎ}を算出する。そのため、まず、観測信号Ｘ_ｎ（ｆ,ｔ）、及び推定雑音信号Ｎ_ｎ（ｆ,ｔ）のモーメントを求める。例えば、観測信号Ｘ（ｆ,ｔ）の２次のモーメントＭ_{２,観測信号ｎ}、４次のモーメントＭ_{４,観測信号ｎ}、６次のモーメントＭ_{６,観測信号ｎ}、８次のモーメントＭ_{８,観測信号ｎ}は、上記の式（６）と同様の式によって求めることができる。観測信号用キュムラント算出部５１は、観測信号のモーメントＭ_{観測信号ｎ}から観測信号のキュムラントＣ_{観測信号ｎ}を求める。観測信号Ｘ_ｎ（ｆ,ｔ）の２次のキュムラントＣ_{２,観測信号ｎ}、４次のキュムラントＣ_{４,観測信号ｎ}、６次のキュムラントＣ_{６,観測信号ｎ}、８次のキュムラントＣ_{８,観測信号ｎ}は、上記の式（７）によって求めることができる。 (Step 105)
The observed signal cumulant calculating unit 51 and the estimated noise cumulant calculating unit 52 calculate the cumulant C _{observed signal n of the} observed signal X _n (f, t) and the cumulant C noise signal _n of the estimated noise signal N _n (f, t), _respectively . calculate. Therefore, first, the moments of the observed signal X _n (f, t) and the estimated noise signal N _n (f, t) are obtained. For example, the second moment M _{2 of} the observation signal X (f, t) _{, the observation signal n} , the fourth moment M _{4, the observation signal n} 6, the sixth moment M _{6, the observation signal n 1} , and the eighth moment M _{8 , The observation signal n} can be obtained by an equation similar to the above equation (6). The observation signal cumulant calculation unit 51 obtains the cumulant C _{observation signal n} of the _{observation signal} from the moment M _{observation signal n} of the observation signal. Observation signal X _n (f, t) second-order cumulant C _{2, observation signal n} 4th-order cumulant C _{4, observation signal n} 6th-order cumulant C _{6, observation signal n 1} , eighth-order cumulant C _{8, The observation signal n} can be obtained by the above equation (7).

推定雑音用キュムラント算出部５２が、同様に、推定雑音信号Ｎ_ｎ（ｆ,ｘ）の２次のモーメントＭ_{２,雑音信号ｎ}、４次のモーメントＭ_{４,雑音信号ｎ}、６次のモーメントＭ_{６,雑音信号ｎ}、８次のモーメントＭ_{８,雑音信号ｎ}は、上記の式（８）によって求めることができる。推定雑音用キュムラント算出部５２は、推定雑音信号のモーメントＭ_{雑音信号ｎ}から推定雑音信号のキュムラントＣ_{雑音信号ｎ}を求める。推定雑音信号Ｎ_ｎ（ｆ,ｔ）の２次のキュムラントＣ_{２,雑音信号ｎ}、４次のキュムラントＣ_{４,雑音信号ｎ}、６次のキュムラントＣ_{６,雑音信号ｎ}、８次のキュムラントＣ_{８,雑音信号ｎ}は、上記の式（９）によって求めることができる。このようにして、観測信号Ｘ_ｎ（ｆ,ｔ）のキュムラントＣ_{観測信号ｎ}及び推定雑音信号Ｎ_ｎ（ｆ,ｔ）のキュムラントＣ_{雑音信号ｎ}を算出することができる。
（ステップ１０６）
音声成分用キュムラント算出部５３が観測信号Ｘ_ｎ（ｆ,ｔ）内の音声成分のキュムラントＣ_{音声成分ｎ}を算出する。キュムラントには、加法性が成り立つため、音声成分のキュムラントは、観測信号のキュムラントと推定雑音のキュムラントとの差で示される。よって、音声成分の２次のキュムラントＣ_{２,音声成分ｎ}、４次のキュムラントＣ_{４,音声成分ｎ}、６次のキュムラントＣ_{６,音声成分ｎ}、８次のキュムラントＣ_{８,音声成分ｎ}は、上記の式（１０）で示される。 Similarly, the estimated noise cumulant calculation unit 52 performs the second moment M _{2, the noise signal n} 4, the fourth moment M _{4, the noise signal n 1} , and the sixth moment M M of the estimated noise signal N _n (f, x). _{6. The noise signal n} , the eighth-order moment M8 _{, and the noise signal n} can be obtained by the above equation (8). The estimated noise cumulant calculating unit 52 obtains a cumulant C _{noise signal n} of the estimated noise signal from the moment M _{noise signal n} of the estimated noise signal. Second order cumulant C _{2 of} estimated noise signal N _n (f, t) _{, noise signal n} 4th order cumulant C _{4, noise signal n} 6th order cumulant C _{6, noise signal n 6th} order cumulant C _{8 , Noise signal n} can be obtained by the above equation (9). In this way, it is possible to calculate the cumulant C _{noise signal n} of the observation signal _X n (f, t) cumulant C _{observed signal n} and the estimated noise signal _N n (f, t) of the.
(Step 106)
The voice component cumulant calculation unit 53 calculates the cumulant C _{voice component n} of the voice component in the observation signal X _n (f, t). Since the cumulant is additive, the cumulant of the speech component is indicated by the difference between the observed signal cumulant and the estimated noise cumulant. Therefore, the second-order cumulant C _{2, the voice component n} 4, the fourth-order cumulant C _{4, the voice component n} 6, the sixth-order cumulant C _{6, the voice component n} 8, the eighth-order cumulant C _{8, and the voice component n} are It is shown by the above formula (10).

（ステップ１０７）
カートシス算出部５４が、音声成分のキュムラントＣ_{音声成分ｎ}から、音声成分のカートシスＫ_{音声成分ｎ}を推定する。カートシスの推定に付いては、特に限定されるものではないが、例えば、上記の式（１１）を用いることができる。これにより、音声成分のパワースペクトル領域のカートシスＫ_{音声成分ｎ}を算出することができる。 (Step 107)
The kurtosis calculation unit 54 estimates the kurtosis K _{audio component n} of the _{audio component} from the cumulant C _{audio component n} of the audio component. The estimation of the cartesis is not particularly limited, but for example, the above formula (11) can be used. As a result, the Cartis K _{audio component n} in the power spectrum region of the _{audio component} can be calculated.

（ステップ１０８）
また、減算パラメータ適応判定器５５に減算係数βの初期値をセットし、更新回数ｉ＝０をセットする。なお、減算係数βの初期値については、適当な値を選択することができる。
（ステップ１０９）
そして、減算係数βの初期値を用いて、音声推定処理部４２が、観測信号Ｘ_ｎ（ｆ,ｔ）に対して雑音減算処理を行う。雑音減算処理については、例えば、手法（ａ）〜（ｄ）のいずれか一つを用いることができる。従って、式（２）〜式（５）のいずれか一つを採用し、その式に減算係数βの初期値を代入する。これにより、フィルタ係数Ｈ（ｆ,ｔ）を算出することができる。そして、フィルタ係数Ｈ_ｎ（ｆ,ｔ）と観測信号Ｘ（ｆ,ｔ）とから、出力信号Ｙ_ｎ（ｆ,ｔ）を算出する。具体的には、Ｙ_ｎ（ｆ,ｔ）＝Ｈ_ｎ（ｆ,ｔ）Ｘ_ｎ（ｆ,ｔ）となる。 (Step 108)
Further, the initial value of the subtraction coefficient β is set in the subtraction parameter adaptive determination unit 55, and the update count i = 0 is set. An appropriate value can be selected for the initial value of the subtraction coefficient β.
(Step 109)
Then, using the initial value of the subtraction coefficient β, the speech estimation processing unit 42 performs noise subtraction processing on the observation signal X _n (f, t). For the noise subtraction process, for example, any one of the methods (a) to (d) can be used. Therefore, any one of the expressions (2) to (5) is adopted, and the initial value of the subtraction coefficient β is substituted into the expression. Thereby, the filter coefficient H (f, t) can be calculated. Then, an output signal Y _n (f, t) is calculated from the filter coefficient H _n (f, t) and the observation signal X (f, t). Specifically, Y _n (f, t) = H _n (f, t) X _n (f, t).

（ステップ１１０）
出力信号用キュムラント算出部５７が出力信号Ｙ_ｎ（ｆ,ｔ）のキュムラントＣ_出力信号を推定する。そのため、まず、出力信号Ｙ_ｎ（ｆ,ｔ）のモーメントＭ_{出力信号ｎ}を求める。例えば、出力信号Ｙ（ｆ,ｔ）の２次のモーメントＭ_{２,出力信号ｎ}、４次のモーメントＭ_{４,出力信号ｎ}、６次のモーメントＭ_{６,出力信号ｎ}、８次のモーメントＭ_{８,出力信号ｎ}は、上記の式（１２）によって求めることができる。これらのモーメントから出力信号のキュムラントＣ_{出力信号ｎ}を求める。出力信号Ｙ_ｎ（ｆ,ｔ）の２次のキュムラントＣ_{２,出力信号ｎ}、４次のキュムラントＣ_{４,出力信号ｎ}、６次のキュムラントＣ_{６,出力信号ｎ}、８次のキュムラントＣ_{８,出力信号ｎ}は、上記の式（１３）によって求めることができる。 (Step 110)
The output signal cumulant calculation unit 57 estimates the cumulant C _{output signal} of the output signal Y _n (f, t). Therefore, first, the moment M _{output signal n} of the output signal Y _n (f, t) is obtained. For example, the second moment M _{2 of} the output signal Y (f, t) _{, the output signal n} 4, the fourth moment M _{4, the output signal n} 6, the sixth moment M _{6, the output signal n 6,} and the eighth moment M _{8 The output signal n} can be obtained by the above equation (12). From these moments, the cumulant C _{output signal n} of the _{output signal} is obtained. Output signal Y _n (f, t) second-order cumulant C _{2, output signal n} 4th-order cumulant C _{4, output signal n} 6th-order cumulant C _{6, output signal n} 8th-order cumulant C _{8, The output signal n} can be obtained by the above equation (13).

（ステップ１１１）
カートシス算出部５６が、キュムラントＣ_出力信号に基づいて、出力信号のカートシスＫ_{出力信号ｎ}を算出する。カートシスの推定に付いては、特に限定されるものではないが、例えば、上記の式（１４）を用いることができる。これにより、音声成分のパワースペクトル領域のカートシスＫ_{出力信号ｎ}を算出することができる。 (Step 111)
Based on the cumulant C _{output signal} , the cartesis calculation unit 56 calculates the cartosis K _{output signal n} of the _{output signal} . The estimation of the cartesis is not particularly limited, but for example, the above formula (14) can be used. As a result, the cartosis K _{output signal n} in the power spectrum region of the audio component can be calculated.

（ステップ１１２）
減算パラメータ適応判定器５５が、減算係数βを更新するとともに、更新回数ｉをインクリメントする。減算パラメータ適応判定器５５が、出力信号のカートシスＫ_{出力信号ｎ}と音声成分のカートシスＫ_{音声成分ｎ}とを比較して、減算係数βを算出する。例えば、出力信号のカートシスＫ_{出力信号ｎ}と音声成分のカートシスＫ_{音声成分ｎ}の差を求める。そして、カートシスの差に応じて、減算係数βを更新する。具体的には、上記の式（１５）を用いて、減算係数βを更新する。さらに、更新回数ｉをインクリメントする。 (Step 112)
The subtraction parameter adaptive determination unit 55 updates the subtraction coefficient β and increments the number of updates i. The subtraction parameter adaptive determination unit 55 compares the output signal kartsys K _{output signal n} and the sound component kartsys K _{sound component n} to calculate a subtraction coefficient β. For example, the difference between the Cartis K _{output signal n of the output signal} and the Cartis K _{sound component n} of the _{sound component} is obtained. Then, the subtraction coefficient β is updated according to the difference between the cartesis. Specifically, the subtraction coefficient β is updated using the above equation (15). Further, the update count i is incremented.

（ステップ１１３）
更新回数ｉがＩを越えたか否かが判定される。これにより、減算係数βを求めるためのループ計算が十分な回数を行われたか否かが判定される。更新回数ｉがＩよりも小さい場合、ステップ１０９に戻る。一方、更新回数ｉがＩ以上の場合、周波数ビンｆをインクリメントして、次のステップ１１４に進む。すなわち、減算係数βが収束しない場合、更新回数ｉがＩに達するまで、ステップ８〜ステップ１１の処理が繰り返し行われる。なお、更新回数ｉがＩに到達する前に、β減算係数が収束した場合、ループ計算を抜けて、次のステップ１１４に進むようにしてもよい。例えば、カートシスの差又は比がしきい値Ｔｈｒｅｓｈｏｌｄよりも小さい場合、ステップ１０９〜ステップ１１２までのループ計算を抜けるようにしてもよい。 (Step 113)
It is determined whether or not the number of updates i exceeds I. Thus, it is determined whether or not the loop calculation for obtaining the subtraction coefficient β has been performed a sufficient number of times. If the number of updates i is smaller than I, the process returns to step 109. On the other hand, if the number of updates i is greater than or equal to I, the frequency bin f is incremented and the process proceeds to the next step 114. That is, when the subtraction coefficient β does not converge, the processes of Step 8 to Step 11 are repeated until the number of updates i reaches I. If the β subtraction coefficient converges before the number of updates i reaches I, the loop calculation may be skipped and the process proceeds to the next step 114. For example, when the cartesis difference or ratio is smaller than the threshold Threshold, the loop calculation from Step 109 to Step 112 may be skipped.

（ステップ１１４）
全てのマイクロホン１に付いて、減算係数βを算出したか否かを判定する。例えば、マイクロホンアレイ５に含まれるマイクロホン１の数をＭとすると、マイクロホン番号ｎがＭ以上であるか否かを判定する。マイクロホン番号ｎがＭよりも小さい場合、ステップ１０５に戻る。マイクロホン番号ｎがＭ以上の場合、周波数ビンｆをインクリメントして、次のステップ１１５に移行する。
（ステップ１１５）
全ての周波数ビンについて、減算係数βが算出されたか否かを判定する。具体的には、周波数ビンｆがＦよりも小さい場合、ステップ１０４に戻り、次の周波数ビンｆの減算係数βを求める。なお、Ｆは周波数ビンの数である。一方、周波数ビンｆがＦ以上の場合、時間領域の出力信号を得る。具体的には、音声推定処理部４２で算出された出力信号Ｙ（ｆ,ｔ）を逆フーリエ変換する。そして、逆フーリエ変換された出力信号に窓掛け（ハミング窓）し、オーバーラップアッドにより、時間領域のデータを得る。これにより、音声強調装置２から時間領域の出力信号ｙ（ｔ）が出力される。なお、ステップ１１５の処理は、音声強調装置２が行っても良く、他の装置が行っても良い。 (Step 114)
It is determined whether or not the subtraction coefficient β has been calculated for all the microphones 1. For example, when the number of microphones 1 included in the microphone array 5 is M, it is determined whether or not the microphone number n is M or more. If the microphone number n is smaller than M, the process returns to step 105. If the microphone number n is greater than or equal to M, the frequency bin f is incremented and the process proceeds to the next step 115.
(Step 115)
It is determined whether or not the subtraction coefficient β has been calculated for all frequency bins. Specifically, when the frequency bin f is smaller than F, the process returns to step 104 to obtain the subtraction coefficient β of the next frequency bin f. Note that F is the number of frequency bins. On the other hand, when the frequency bin f is F or more, a time domain output signal is obtained. Specifically, the output signal Y (f, t) calculated by the speech estimation processing unit 42 is subjected to inverse Fourier transform. Then, the output signal subjected to inverse Fourier transform is windowed (Hamming window), and time domain data is obtained by overlap addition. As a result, the time-domain output signal y (t) is output from the speech enhancement device 2. Note that the processing of step 115 may be performed by the speech enhancement device 2 or another device.

このようにすることで、実施の形態１と同様に、観測信号中の雑音が効果的に低減される。よって、観測信号中の音声を強調することができ、後段の音声認識システムでの音声認識処理の正確性を向上することができる。さらに、音声信号を取得するためのマイクロホンユニットとして、マイクロホンアレイ５を用いている。このため、効果的に雑音推定することができる。また、実施の形態１〜実施の形態２において、減算係数を算出するためのループ計算は、同じ観測信号に対して実行されても良く、随時取得される観測信号を用いて、実行されても良い。すなわち、ループ計算毎に、最新の観測信号を用いてもよい。 By doing in this way, the noise in an observation signal is reduced effectively similarly to Embodiment 1. Therefore, the voice in the observation signal can be emphasized, and the accuracy of the voice recognition process in the subsequent voice recognition system can be improved. Further, a microphone array 5 is used as a microphone unit for acquiring an audio signal. For this reason, noise estimation can be performed effectively. In the first to second embodiments, the loop calculation for calculating the subtraction coefficient may be performed on the same observation signal, or may be performed using the observation signal acquired as needed. good. That is, the latest observation signal may be used for each loop calculation.

上述した雑音抑制処理は、ＤＳＰ（Digital Signal Processor）、ＭＰＵ（Micro Processing Unit）、若しくはＣＰＵ（Central Processing Unit）又はこれらの組み合わせを含むコンピュータにプログラムを実行させることによって実現してもよい。 The noise suppression processing described above may be realized by causing a computer including a DSP (Digital Signal Processor), MPU (Micro Processing Unit), CPU (Central Processing Unit), or a combination thereof to execute a program.

上述の例において、音声強調処理をコンピュータに行わせるための命令群を含むプログラムは、様々なタイプの非一時的なコンピュータ可読媒体（non-transitory computer readable medium）を用いて格納され、コンピュータに供給することができる。非一時的なコンピュータ可読媒体は、様々なタイプの実体のある記録媒体（tangible storage medium）を含む。非一時的なコンピュータ可読媒体の例は、磁気記録媒体（例えばフレキシブルディスク、磁気テープ、ハードディスクドライブ）、光磁気記録媒体（例えば光磁気ディスク）、ＣＤ−ＲＯＭ（Read Only Memory）、ＣＤ−Ｒ、ＣＤ−Ｒ／Ｗ、半導体メモリ（例えば、マスクＲＯＭ、ＰＲＯＭ（Programmable ROM）、ＥＰＲＯＭ（Erasable PROM）、フラッシュＲＯＭ、ＲＡＭ（Random Access Memory））を含む。また、プログラムは、様々なタイプの一時的なコンピュータ可読媒体（transitory computer readable medium）によってコンピュータに供給されてもよい。一時的なコンピュータ可読媒体の例は、電気信号、光信号、及び電磁波を含む。一時的なコンピュータ可読媒体は、電線及び光ファイバ等の有線通信路、又は無線通信路を介して、プログラムをコンピュータに供給できる。 In the above example, a program including a group of instructions for causing a computer to perform speech enhancement processing is stored using various types of non-transitory computer readable media and supplied to the computer. can do. Non-transitory computer readable media include various types of tangible storage media. Examples of non-transitory computer-readable media include magnetic recording media (for example, flexible disks, magnetic tapes, hard disk drives), magneto-optical recording media (for example, magneto-optical disks), CD-ROMs (Read Only Memory), CD-Rs, CD-R / W, semiconductor memory (for example, mask ROM, PROM (Programmable ROM), EPROM (Erasable PROM), flash ROM, RAM (Random Access Memory)) are included. The program may also be supplied to the computer by various types of transitory computer readable media. Examples of transitory computer readable media include electrical signals, optical signals, and electromagnetic waves. The temporary computer-readable medium can supply the program to the computer via a wired communication path such as an electric wire and an optical fiber, or a wireless communication path.

１マイクロホン
２音声強調装置
３出力側装置
５マイクロホンアレイ
１１雑音推定部
１２雑音減算部
２０カートシス算出ユニット
２１観測信号用キュムラント推定部
２２推定雑音用キュムラント推定部
２３カートシス推定部
２４音声成分用キュムラント推定部
３０減算係数算出ユニット
３１減算係数適応器
３２出力信号用キュムラント推定部
３３出力信号用カートシス推定部
４１雑音推定処理部
４２音声推定処理部
５１観測信号用キュムラント算出部
５２推定雑音用キュムラント算出部
５３音声成分用キュムラント算出部
５４カートシス算出部
５５減算パラメータ適応判定器
５６カートシス算出部
５７出力信号用キュムラント算出部 DESCRIPTION OF SYMBOLS 1 Microphone 2 Speech enhancement device 3 Output side device 5 Microphone array 11 Noise estimation unit 12 Noise subtraction unit 20 Cartis calculation unit 21 Observation signal cumulant estimation unit 22 Estimated noise cumulant estimation unit 23 Cartis estimation unit 24 Speech component cumulant estimation unit DESCRIPTION OF SYMBOLS 30 Subtraction coefficient calculation unit 31 Subtraction coefficient adaptor 32 Output signal cumulant estimation part 33 Output signal cartesis estimation part 41 Noise estimation processing part 42 Speech estimation processing part 51 Observation signal cumulant calculation part 52 Estimated noise cumulant calculation part 53 Speech Ingredient cumulant calculation unit 54 Cartis calculation unit 55 Subtractive parameter adaptive decision unit 56 Cartis calculation unit 57 Output signal cumulant calculation unit

Claims

A speech enhancement device that enhances speech with respect to an observation signal acquired by a microphone unit,
A first cumulant estimation unit for estimating a cumulant of an observation signal including a noise component and a speech component;
A noise estimation unit for estimating a noise component included in the observation signal;
A second cumulant estimation unit for estimating a cumulant of the estimated noise estimated by the noise estimation unit;
A third cumulant estimation unit that estimates a cumulant of a speech component based on the cumulant of the observed signal and the cumulant of the estimated noise;
A first kurtosis estimator for estimating a kurtosis of an audio component based on the cumulant of the audio component;
A subtraction coefficient adaptation unit that calculates a subtraction coefficient based on the speech component cartesis;
A speech enhancement apparatus, comprising: a noise subtraction unit that performs noise subtraction on the observation signal using the subtraction coefficient calculated by the subtraction coefficient adaptation unit.

A fourth cumulant estimation unit for estimating a cumulant of the output signal output from the noise subtraction unit;
A cartesis estimation unit that estimates the output signal cartesis based on the output signal cumulant, and
The speech enhancement apparatus according to claim 1, wherein the subtraction coefficient adaptation unit calculates a subtraction coefficient based on the cartesis of the output signal.

The speech enhancement apparatus according to claim 1, wherein a cumulant of the speech component is estimated based on a difference between the cumulant of the observation signal and the cumulant of the estimated noise.

The microphone unit includes a microphone array having a plurality of microphones;
The speech enhancement apparatus according to claim 1, wherein the noise estimation unit estimates the estimated noise by microphone array processing.

A speech enhancement method for enhancing speech with respect to an observation signal acquired by a microphone unit,
Calculating a cumulant of an observation signal including a noise component and a speech component;
Estimating noise contained in the observed signal;
Calculating an estimated noise cumulant;
Calculating a speech component cumulant based on the observed signal cumulant and the estimated noise cumulant;
Estimating the speech component cartesis based on the speech component cumulant;
Calculating a subtraction coefficient based on the speech component cartesis;
Subtracting noise from the observation signal using the subtraction coefficient.

Calculating a cumulant of the output signal;
Calculating a cartesis of the output signal based on the cumulant of the output signal; and
The speech enhancement method according to claim 5, wherein the subtraction coefficient is calculated based on the output signal cartesis and the speech component cartesis.

The speech enhancement method according to claim 5 or 6, wherein the cumulant of the speech component is estimated based on a difference between the cumulant of the observation signal and the cumulant of the estimated noise.

The microphone unit includes a microphone array having a plurality of microphones;
The speech enhancement method according to claim 5, wherein the estimated noise is estimated by a microphone array process.

A speech enhancement program for enhancing speech with respect to an observation signal acquired by a microphone unit,
Against the computer,
Calculating a cumulant of an observation signal including a noise component and a speech component;
Estimating noise included in the observed signal;
Calculating the estimated noise cumulant,
Calculating a speech component cumulant based on the observed signal cumulant and the estimated noise cumulant;
Estimating the speech component cartesis based on the speech component cumulant;
Calculating a subtraction coefficient based on the speech component cartesis;
And a step of subtracting noise from the observation signal using the subtraction coefficient.

Against the computer,
Calculating a cumulant of the output signal;
Calculating the output signal cartesis based on the cumulant of the output signal; and
The speech enhancement program according to claim 9, wherein the subtraction coefficient is calculated based on a cartesis of the output signal and a cartesis of the speech component.

The speech enhancement method according to claim 9 or 10, wherein a cumulant of the speech component is estimated based on a difference between the cumulant of the observation signal and the cumulant of the estimated noise.

The microphone unit includes a microphone array having a plurality of microphones;
The speech enhancement program according to any one of claims 9 to 11, wherein the estimated noise is estimated by a microphone array process.