JPWO2015093025A1

JPWO2015093025A1 - Audio processing apparatus, audio processing method, and audio processing program

Info

Publication number: JPWO2015093025A1
Application number: JP2015553369A
Authority: JP
Inventors: 秀治古明地; 剛範辻川; 亮輔磯谷
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2013-12-17
Filing date: 2014-12-12
Publication date: 2017-03-16
Also published as: WO2015093025A1

Abstract

音声特徴量に関する時間遷移情報に基づいて音声モデルに含まれる分布を選択することにより得られる効果を活かし、かつ、過去の音声特徴量に関する状態遷移が雑音により強い影響を連続的に受ける場合にも、高い雑音抑圧精度を得ることができる。音声処理装置４は、音声特徴量に関する状態遷移の実績を示す時間遷移情報を有する第１の音声モデルと、時間遷移情報を有さない第２の音声モデルと、を記憶する記憶部４０と、音声成分及び雑音成分が混在する信号である入力信号が示す情報から、第１及び第２の音声モデルの何れかを所定の基準に従って選択し、選択した音声モデル及び入力信号を用いて、音声成分に関する期待値を算出する期待値算出部６０と、その期待値を用いて、入力信号に含まれる雑音成分を抑圧した雑音抑圧音声を生成する雑音抑圧部８１と、を備える。Even when the effect obtained by selecting the distribution included in the speech model based on the time transition information related to speech features is utilized, and state transitions related to past speech features are continuously strongly influenced by noise High noise suppression accuracy can be obtained. The speech processing device 4 includes a storage unit 40 that stores a first speech model having time transition information indicating the results of state transitions related to speech feature values, and a second speech model having no time transition information; From the information indicated by the input signal, which is a signal in which a speech component and a noise component are mixed, one of the first and second speech models is selected according to a predetermined criterion, and the speech component is selected using the selected speech model and input signal. An expected value calculation unit 60 that calculates an expected value for the noise, and a noise suppression unit 81 that generates noise-suppressed speech in which a noise component included in the input signal is suppressed using the expected value.

Description

本願発明は、音声信号に含まれる雑音成分を抑圧する処理を行う音声処理装置等に関する。 The present invention relates to a speech processing apparatus that performs processing for suppressing a noise component included in a speech signal.

近年、音声パタンをモデル化した音声モデルを雑音抑圧等に用いるモデルベースの雑音抑圧／音声強調技術等が発展している。このような技術は、雑音を消し過ぎることによって発生する歪みを軽減し、音声認識および異常音検出により適した変換を行うことを目的としている。 In recent years, a model-based noise suppression / speech enhancement technique that uses a speech model obtained by modeling a speech pattern for noise suppression has been developed. Such a technique is intended to reduce distortion caused by excessive noise suppression and to perform conversion suitable for voice recognition and abnormal sound detection.

このような技術に関連する技術として、特許文献１には、入力信号スペクトルと雑音スペクトルの推定値とから仮推定音声スペクトルを求め、その仮推定音声スペクトルを、音声モデルを用いて補正する雑音抑圧システムが開示されている。係る雑音抑圧システムは、補正した仮推定音声スペクトルと雑音スペクトルの推定値とから雑音抑圧フィルタを算出し、当該雑音抑圧フィルタと入力信号スペクトルとから推定音声スペクトルを算出する。 As a technique related to such a technique, Patent Document 1 discloses a noise suppression that obtains a temporary estimated speech spectrum from an input signal spectrum and an estimated value of a noise spectrum, and corrects the temporary estimated speech spectrum using a speech model. A system is disclosed. Such a noise suppression system calculates a noise suppression filter from the corrected temporary estimated speech spectrum and the estimated value of the noise spectrum, and calculates an estimated speech spectrum from the noise suppression filter and the input signal spectrum.

また、特許文献２には、推定音声の音声特徴量を算出する際に、音声モデルとして音声の時間遷移情報を付加したモデルを用いた雑音抑圧システムが開示されている。係る雑音抑圧システムは、過去の音声特徴量の変遷に沿った分布選択による特徴量補正を実現する。 Further, Patent Document 2 discloses a noise suppression system using a model to which time transition information of speech is added as a speech model when calculating speech feature values of estimated speech. Such a noise suppression system realizes feature amount correction by distribution selection along the transition of past speech feature amounts.

さらに、特許文献３には、音声の特徴的性質を表現する複数の詳細度を持つ音声モデルを保持し、その音声モデルを基に音声に関する入力信号が有する特徴的性質に最も近い詳細度を選択する音声認識装置が開示されている。係る音声認識装置は、選択した詳細度に応じて音声認識に関わるパラメータを制御する。 Furthermore, Patent Document 3 holds a speech model having a plurality of details representing the characteristic properties of speech, and selects the detail closest to the feature properties of the input signal related to speech based on the speech model. A speech recognition apparatus is disclosed. The speech recognition apparatus controls parameters related to speech recognition according to the selected level of detail.

特許第4765461号公報Japanese Patent No. 4754661 特開2005-84653号公報JP 2005-84653 A 国際公開第2008/108232号International Publication No. 2008/108232

ＰｅｄｒｏＪ. Ｍｏｒｅｎｏ, ＢｈｉｋｓｈａＲａｊａｎｄＲｉｃｈａｒｄＭ. Ｓｔｅｒｎ, “ＡＶｅｃｔｏｒＴａｙｌｏｒＳｅｒｉｅｓＡｐｐｒｏａｃｈｆｏｒＥｎｖｉｒｏｎｍｅｎｔＩｎｄｅｐｅｎｄｅｎｔＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ”, Ｐｒｏｃ. ＩＣＡＳＳＰ１９９６, ｐｐ. ７３３−７３６ｖｏｌ. 2, １９９６.Pedro J. Moreno, Bhisha Raj and Richard M. Stern, “A Vector Taylor Series Approach for Independent Inventive Recognition.

特許文献２が開示した技術は、通常の場合、特許文献１が開示した技術よりも、雑音抑圧制度を向上させることが可能である。しかしながら、特に雑音環境下において、特許文献２が開示した、推定音声の音声特徴量算出に音声の時間遷移情報を付加した音声モデルを用いることがかえって、雑音抑圧精度を低下させることがある。 The technique disclosed in Patent Document 2 can improve the noise suppression system more than the technique disclosed in Patent Document 1 in a normal case. However, particularly in a noisy environment, the use of the speech model added with the time transition information of speech for the speech feature amount calculation of the estimated speech disclosed in Patent Document 2 may reduce the noise suppression accuracy.

例えば、過去の音声特徴量が雑音により強い影響を連続的に受けるとき、分布選択精度が低下する。このとき、特許文献２が開示した雑音抑制システムでは、過去の音声特徴量の変遷を利用しない特許文献１に係る技術よりも雑音抑圧精度が低下する可能性がある。 For example, when past voice feature values are continuously strongly influenced by noise, the distribution selection accuracy decreases. At this time, in the noise suppression system disclosed in Patent Literature 2, there is a possibility that the noise suppression accuracy may be lower than that of the technology according to Patent Literature 1 that does not use the transition of past speech feature values.

尚、特許文献３が開示した技術は、複数の音声モデルを利用することにより音声認識精度を向上させているが、雑音抑圧精度を向上させるための技術については言及していない。 The technology disclosed in Patent Document 3 improves the speech recognition accuracy by using a plurality of speech models, but does not mention the technology for improving the noise suppression accuracy.

したがって、時間遷移情報による音声モデルが含む分布の選択で得られる効果を活かしつつ、過去の音声特徴量の変遷が雑音により強い影響を連続的に受けるときにも精度の低下を防止することにより、雑音抑圧精度が高い信号処理を行うことが課題となる。 Therefore, while taking advantage of the effect obtained by selecting the distribution included in the speech model based on time transition information, by preventing the deterioration of accuracy even when the transition of past speech feature is continuously strongly influenced by noise, It becomes a problem to perform signal processing with high noise suppression accuracy.

本願発明の主たる目的は、上述した課題を解決可能な、音声処理装置等を提供することである。 The main object of the present invention is to provide an audio processing apparatus and the like that can solve the above-described problems.

本願発明の一態様に係る音声処理装置は、音声特徴量に関する状態遷移の実績を示す時間遷移情報を有する第一の音声モデルと、前記時間遷移情報を有さない第二の音声モデルと、を記憶する記憶手段と、音声成分及び雑音成分が混在する信号である入力信号が示す情報から、前記第一及び第二の音声モデルの何れかを所定の基準に従って選択し、選択した前記音声モデル及び前記入力信号を用いて、前記音声成分に関する期待値を算出する期待値算出手段と、前記期待値を用いて、前記入力信号に含まれる前記雑音成分を抑圧した第一の雑音抑圧音声を生成する雑音抑圧手段と、を備える。 A speech processing apparatus according to an aspect of the present invention includes: a first speech model having time transition information indicating a result of state transition relating to speech feature amounts; and a second speech model not having the time transition information. From the information indicated by the storage means for storing and the input signal that is a signal in which a speech component and a noise component are mixed, the first and second speech models are selected according to a predetermined criterion, and the selected speech model and Expected value calculating means for calculating an expected value related to the speech component using the input signal, and generating a first noise-suppressed speech in which the noise component included in the input signal is suppressed using the expected value. Noise suppression means.

上記目的を達成する他の見地において、本願発明の一態様に係る音声処理方法は、情報処理装置によって、音声特徴量に関する状態遷移の実績を示す時間遷移情報を有する第一の音声モデルと、前記時間遷移情報を有さない第二の音声モデルとが記憶されている記憶手段を参照することにより、音声成分及び雑音成分が混在する信号である入力信号が示す情報から、前記第一及び第二の音声モデルの何れかを所定の基準に従って選択し、選択した前記音声モデル及び前記入力信号を用いて、前記音声成分に関する期待値を算出し、前記期待値を用いて、前記入力信号に含まれる前記雑音成分を抑圧した第一の雑音抑圧音声を生成する。 In another aspect of achieving the above object, a speech processing method according to an aspect of the present invention includes: a first speech model having time transition information indicating a record of state transitions related to speech feature amounts by an information processing device; By referring to the storage means in which the second speech model having no time transition information is stored, the first and second information can be obtained from the information indicated by the input signal, which is a signal in which speech components and noise components are mixed. Is selected according to a predetermined criterion, and an expected value for the speech component is calculated using the selected speech model and the input signal, and is included in the input signal using the expected value. A first noise-suppressed speech in which the noise component is suppressed is generated.

また、上記目的を達成する更なる見地において、本願発明の一態様に係るコンピュータ読み取り可能な記録媒体は、音声特徴量に関する状態遷移の実績を示す時間遷移情報を有する第一の音声モデルと、前記時間遷移情報を有さない第二の音声モデルとが記憶されている記憶手段を参照することにより、音声成分及び雑音成分が混在する信号である入力信号が示す情報から、前記第一及び第二の音声モデルの何れかを所定の基準に従って選択し、選択した前記音声モデル及び前記入力信号を用いて、前記音声成分に関する期待値を算出する期待値算出処理と、前記期待値を用いて、前記入力信号に含まれる前記雑音成分を抑圧した第一の雑音抑圧音声を生成する雑音抑圧処理と、をコンピュータに実行させる音声処理プログラムを格納している。 Further, in a further aspect of achieving the above object, a computer-readable recording medium according to one aspect of the present invention includes a first speech model having time transition information indicating a record of state transitions related to speech features, By referring to the storage means in which the second speech model having no time transition information is stored, the first and second information can be obtained from the information indicated by the input signal, which is a signal in which speech components and noise components are mixed. Selected according to a predetermined criterion, using the selected speech model and the input signal, an expected value calculation process for calculating an expected value for the speech component, and using the expected value, A sound processing program for causing a computer to execute noise suppression processing for generating first noise-suppressed speech in which the noise component included in the input signal is suppressed. .

更に、本発明は、係る記憶媒体に格納されているコンピュータプログラムによっても実現可能である。 Furthermore, the present invention can also be realized by a computer program stored in such a storage medium.

本願発明は、音声特徴量に関する状態遷移の実績を示す時間遷移情報に基づいて音声モデルに含まれる分布を選択することにより得られる効果を活かし、かつ、過去の音声特徴量に関する状態遷移が雑音により強い影響を連続的に受ける場合にも、高い雑音抑圧精度を得ることができる。 The present invention takes advantage of the effect obtained by selecting the distribution included in the speech model based on the time transition information indicating the state transition results regarding the speech feature amount, and the state transition regarding the past speech feature amount is caused by noise. High noise suppression accuracy can be obtained even when a strong influence is continuously received.

本願発明の第１の実施形態に係る音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio processing apparatus which concerns on 1st Embodiment of this invention. 本願発明の第１の実施形態に係る音声処理装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio processing apparatus which concerns on 1st Embodiment of this invention. 本願発明の第２の実施形態に係る音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio processing apparatus which concerns on 2nd Embodiment of this invention. 本願発明の第３の実施形態に係る音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio processing apparatus which concerns on 3rd Embodiment of this invention. 本願発明の第４の実施形態に係る音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio processing apparatus which concerns on 4th Embodiment of this invention. 本願発明の各実施形態に係る音声処理装置を実行可能な情報処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the information processing apparatus which can perform the audio | voice processing apparatus which concerns on each embodiment of this invention.

以下、本願発明の実施の形態について図面を参照して詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜第１の実施形態＞
図１は第１の実施形態の音声処理装置１の構成を概念的に示すブロック図である。<First Embodiment>
FIG. 1 is a block diagram conceptually showing the structure of the speech processing apparatus 1 of the first embodiment.

本実施の形態に係る音声処理装置１は、入力信号における雑音の埋もれ具合を示す数値として、入力信号が音声モデルにどれだけ適合しているかを示す「モデル適合度」を用いる。そして、音声処理装置１は、雑音抑圧処理を行う際に、モデル適合度が低い場合は、時間遷移情報を有さない音声モデルを用い、モデル適合度が高い場合は、時間遷移情報を有する音声モデルを用いる。以下、時間遷移情報を有さない音声モデルを「モデルＡ」、時間遷移情報を有する音声モデルを「モデルＢ」と称する場合がある。
図１に示すとおり、音声処理装置１は、入力信号取得部１０、雑音推定部２０、仮推定音声算出部３０、記憶部４０、音声モデル適合度算出部５０、期待値算出部６０、フィルタ算出部７０、及び、フィルタリング部８０を備えている。The speech processing apparatus 1 according to the present embodiment uses “model adaptability” indicating how much the input signal is adapted to the speech model as a numerical value indicating the degree of noise burial in the input signal. Then, when performing noise suppression processing, the speech processing apparatus 1 uses a speech model that does not have time transition information when the model suitability is low, and speech that has time transition information when the model suitability is high. Use the model. Hereinafter, a speech model that does not have time transition information may be referred to as “model A”, and a speech model that has time transition information may be referred to as “model B”.
As shown in FIG. 1, the speech processing apparatus 1 includes an input signal acquisition unit 10, a noise estimation unit 20, a temporary estimated speech calculation unit 30, a storage unit 40, a speech model suitability calculation unit 50, an expected value calculation unit 60, and a filter calculation. Unit 70 and filtering unit 80.

入力信号取得部１０、雑音推定部２０、仮推定音声算出部３０、音声モデル適合度算出部５０、期待値算出部６０、フィルタ算出部７０、及び、フィルタリング部８０は、電子回路の場合もあれば、コンピュータプログラムとそのコンピュータプログラムに従って動作するプロセッサである場合もある。記憶部４０は、電子回路、あるいは、コンピュータプログラムとそのコンピュータプログラムに従って動作するプロセッサによりアクセス制御される、磁気ディスクあるいは電子メモリ等の電子デバイスである。 The input signal acquisition unit 10, the noise estimation unit 20, the provisional estimated speech calculation unit 30, the speech model fitness calculation unit 50, the expected value calculation unit 60, the filter calculation unit 70, and the filtering unit 80 may be electronic circuits. For example, it may be a computer program and a processor that operates according to the computer program. The storage unit 40 is an electronic circuit or an electronic device such as a magnetic disk or an electronic memory that is access-controlled by a computer program and a processor that operates according to the computer program.

入力信号取得部１０は、音声に関する入力信号を取得する機能を有する。入力信号取得部１０により取得される入力信号は、例えば、マイクロフォン等からＡ（Ａｎａｌｏｇ）／Ｄ（Ｄｉｇｉｔａｌ）変換機を通して取得される入力信号である。入力信号取得部１０により取得される入力信号は、あるいは、ハードディスクより読み出される入力信号、通信パケットから得られる入力信号などが挙げられるが、本発明における入力信号は、これらに限定されない。 The input signal acquisition unit 10 has a function of acquiring an input signal related to sound. The input signal acquired by the input signal acquisition unit 10 is an input signal acquired from a microphone or the like through an A (Analog) / D (Digital) converter, for example. Examples of the input signal acquired by the input signal acquisition unit 10 include an input signal read from a hard disk and an input signal obtained from a communication packet. However, the input signal in the present invention is not limited to these.

入力信号取得部１０は、取得した入力信号に相当するデジタル信号を単位時間毎に区分する。以下、本願では、切り出される１区間分のデシタル信号をフレームと称する。また、時刻ｔにおける入力信号に相当するフレームを、ｘ（ｔ−τ）（τ＝０，・・・，Ｔ−１：Ｔはフレームに含まれるサンプル数）と表記する。 The input signal acquisition unit 10 classifies a digital signal corresponding to the acquired input signal for each unit time. Hereinafter, in the present application, the extracted digital signal for one section is referred to as a frame. Further, a frame corresponding to the input signal at time t is expressed as x (t−τ) (τ = 0,..., T−1: T is the number of samples included in the frame).

例えば、サンプリング周波数が８０００Ｈｚ（ヘルツ）であり、量子化ビット数が１６ビットであるリニアＰＣＭ（ＰｕｌｓｅＣｏｄｅＭｏｄｕｌａｔｉｏｎ）により生成されたデジタル信号は、１秒当り８０００点分の値を含む。このとき、１フレームの長さを２５ミリ秒とすると、１フレームは２００点分の値を含む。即ちこの場合、Ｔ＝２００となる。 For example, a digital signal generated by linear PCM (Pulse Code Modulation) having a sampling frequency of 8000 Hz (Hertz) and a quantization bit number of 16 bits includes a value of 8000 points per second. At this time, if the length of one frame is 25 milliseconds, one frame includes values for 200 points. That is, in this case, T = 200.

なお、入力信号取得部１０は、入力信号に相当するフレームを、短時間離散フーリエ変換を用いて、スペクトルの絶対値またはパワースペクトルに変換してもよい。以下の説明において、入力信号取得部１０は、各フレームのスペクトルの絶対値を出力するものとするが、本発明はこれに限定されない。以下、スペクトルの絶対値を、単にスペクトルとも称する。 Note that the input signal acquisition unit 10 may convert a frame corresponding to the input signal into an absolute value of a spectrum or a power spectrum using short-time discrete Fourier transform. In the following description, the input signal acquisition unit 10 outputs the absolute value of the spectrum of each frame, but the present invention is not limited to this. Hereinafter, the absolute value of the spectrum is also simply referred to as a spectrum.

時刻ｔにおける入力信号のスペクトルを、Ｘ（ｔ，ｋ）（ｋ＝０，・・・，Ｋ−１）とする。ここで、ｋは周波数ビンであり、Ｋは周波数ビン数である。 The spectrum of the input signal at time t is assumed to be X (t, k) (k = 0,..., K−1). Here, k is a frequency bin, and K is the number of frequency bins.

入力信号取得部１０は、入力信号のスペクトルＸ（ｔ，ｋ）を、雑音推定部２０、仮推定音声算出部３０、及び、フィルタリング部８０に、それぞれ供給する。 The input signal acquisition unit 10 supplies the spectrum X (t, k) of the input signal to the noise estimation unit 20, the temporary estimated speech calculation unit 30, and the filtering unit 80, respectively.

雑音推定部２０は、入力信号のスペクトルに含まれる雑音成分のスペクトルを推定する機能を有する。具体的には、雑音推定部２０は、入力信号取得部１０から入力信号のスペクトルＸ（ｔ，ｋ）を取得する。そして、雑音推定部２０は、取得した入力信号のスペクトルＸ（ｔ，ｋ）に含まれる雑音成分のスペクトルを推定する。雑音推定部２０は、推定した雑音成分のスペクトルを、仮推定音声算出部３０およびフィルタ算出部７０に、それぞれ供給する。 The noise estimation unit 20 has a function of estimating the spectrum of the noise component included in the spectrum of the input signal. Specifically, the noise estimation unit 20 acquires the spectrum X (t, k) of the input signal from the input signal acquisition unit 10. And the noise estimation part 20 estimates the spectrum of the noise component contained in the spectrum X (t, k) of the acquired input signal. The noise estimation unit 20 supplies the estimated spectrum of the noise component to the temporary estimated speech calculation unit 30 and the filter calculation unit 70, respectively.

以下、雑音推定部２０が推定した雑音成分を推定雑音と称する。また、推定雑音のスペクトルを、推定値を示すハット記号“＾”を用いて、Ｎ＾（ｔ，ｋ）（ｋ＝０，・・・，Ｋ−１）とする。なお、本願では、文書表記の制限上、ハット記号を直前文字の右に記しているが、一般的には、当該ハット記号“＾”は、直前の文字の上部に表記される。なお、本実施形態において、雑音推定部２０は、推定雑音を公知技術である重み付き雑音推定法（ＷｅｉｇｈｔｅｄＮｏｉｓｅＥｓｔｉｍａｔｉｏｎ；ＷｉＮＥ）等を用いて算出しているが、本発明が用いる方法はこれに限定されない。雑音推定部２０は、ＷｉＮＥ以外の所定の方法を用いて推定雑音を算出してもよい。 Hereinafter, the noise component estimated by the noise estimation unit 20 is referred to as estimated noise. Further, the spectrum of the estimated noise is N ^ (t, k) (k = 0,..., K−1) using a hat symbol “^” indicating an estimated value. In the present application, the hat symbol is written to the right of the immediately preceding character due to restrictions on document notation, but in general, the hat symbol “^” is written above the immediately preceding character. In the present embodiment, the noise estimation unit 20 calculates the estimated noise by using a weighted noise estimation method (WiNE) that is a known technique, but the method used by the present invention is the same. It is not limited. The noise estimation unit 20 may calculate the estimated noise using a predetermined method other than WiNE.

仮推定音声算出部３０は、入力信号のスペクトルから推定雑音を除去し、仮推定音声のスペクトルおよび音声特徴量を算出する機能を有する。具体的には、仮推定音声算出部３０は、入力信号取得部１０から入力信号のスペクトルＸ（ｔ，ｋ）を取得する。また、仮推定音声算出部３０は、雑音推定部２０から推定雑音スペクトルＮ＾（ｔ，ｋ）を取得する。そして、仮推定音声算出部３０は、取得した入力信号のスペクトルＸ（ｔ，ｋ）と、推定雑音スペクトルＮ＾（ｔ，ｋ）とを基に、仮推定音声のスペクトルを算出する。そして、仮推定音声算出部３０は、仮推定音声のスペクトルを基に、メルスペクトルやメルケプストラムといった音声特徴量を算出する。なお、これらの音声特徴量は、１次動的成分及び２次動的成分を含んでもよい。但し、本発明の音声特徴量はこれらに限らない。仮推定音声算出部３０は、算出した仮推定音声の音声特徴量を、音声モデル適合度算出部５０及び期待値算出部６０に供給する。 The temporary estimated speech calculation unit 30 has a function of removing the estimated noise from the spectrum of the input signal and calculating the spectrum and speech feature amount of the temporary estimated speech. Specifically, the temporary estimated speech calculation unit 30 acquires the spectrum X (t, k) of the input signal from the input signal acquisition unit 10. In addition, the temporary estimated speech calculation unit 30 acquires the estimated noise spectrum N ^ (t, k) from the noise estimation unit 20. Then, the temporary estimated speech calculation unit 30 calculates the spectrum of the temporary estimated speech based on the acquired spectrum X (t, k) of the input signal and the estimated noise spectrum N ^ (t, k). Then, the temporary estimated speech calculation unit 30 calculates speech feature quantities such as a mel spectrum and a mel cepstrum based on the spectrum of the temporary estimated speech. Note that these audio feature quantities may include a primary dynamic component and a secondary dynamic component. However, the audio feature amount of the present invention is not limited to these. The temporary estimated speech calculation unit 30 supplies the calculated speech feature amount of the temporary estimated speech to the speech model suitability calculation unit 50 and the expected value calculation unit 60.

以下、仮推定音声算出部３０が算出した仮推定音声のスペクトルを、Ｓ＾（ｔ，ｋ）（ｋ＝０，・・・，Ｋ−１）と表記する。また、仮推定音声の音声特徴量を、ｓ＾（ｔ）（∈Ｒ^Ｍ、Ｒ^ＭはＭ次元実ベクトル空間を表す）と表記する。なお、本実施形態において、仮推定音声算出部３０は、仮推定音声のスペクトルを、公知技術（例えば、スペクトル減算法（ＳｐｅｃｔｒａｌＳｕｂｔｒａｃｔｉｏｎ：ＳＳ）、ウィナーフィルタ法（ＷｉｅｎｅｒＦｉｌｔｅｒ：ＷＦ）等）を用いて算出しているが、本発明が用いる方法はこれらに限定されない。仮推定音声算出部３０は、所定の方法を用いて仮推定音声のスペクトルを算出してもよい。Hereinafter, the spectrum of the temporary estimated speech calculated by the temporary estimated speech calculation unit 30 is denoted as S ^ (t, k) (k = 0,..., K−1). Further, the speech feature amount of the temporarily estimated speech is expressed as s ^ (t) (εR ^M , R ^M represents an M-dimensional real vector space). In the present embodiment, the temporary estimated speech calculation unit 30 uses a known technique (for example, spectral subtraction (SS), Wiener Filter (WF), etc.) for the spectrum of the temporary estimated speech. However, the method used by the present invention is not limited to these. The temporary estimated speech calculation unit 30 may calculate the spectrum of the temporary estimated speech using a predetermined method.

記憶部４０は、音声特徴量をモデル化した音声モデルを格納している。具体的には、記憶部４０は、図１に示すように、混合ガウス分布モデルＧＭＭ（ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ）４０１と、隠れマルコフモデルＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）４０２とを格納している。 The storage unit 40 stores a voice model obtained by modeling a voice feature amount. Specifically, as illustrated in FIG. 1, the storage unit 40 stores a mixed Gaussian distribution model GMM (Gaussian Mixture Model) 401 and a hidden Markov model HMM (Hidden Markov Model) 402.

なお、本実施形態では、ＧＭＭとＨＭＭを異なるモデルとしているが、ＨＭＭに関するガウス分布集合をＧＭＭとして扱うことも可能である。従って、記憶部４０は、ＨＭＭだけを記憶してもよい。 In this embodiment, the GMM and the HMM are different models, but a Gaussian distribution set related to the HMM can be handled as a GMM. Therefore, the storage unit 40 may store only the HMM.

ＧＭＭ４０１は、時間遷移情報を持たない音声モデルである。この時間遷移情報は、予め収集した音声データから抽出された音声特徴量を学習データとして作成される。具体的には、ＧＭＭは、複数のガウス分布により構成される。各ガウス分布は、重み、平均ベクトル、および、分散行列をもつ。ここで、時間遷移情報とは、モデルの時間的な状態遷移に関する制約を指す。 The GMM 401 is a speech model that does not have time transition information. This time transition information is created by using speech feature values extracted from previously collected speech data as learning data. Specifically, the GMM is composed of a plurality of Gaussian distributions. Each Gaussian distribution has a weight, a mean vector, and a variance matrix. Here, the time transition information refers to a constraint related to the temporal state transition of the model.

以降、ＧＭＭに関する混合数（ＧＭＭを構成するガウス分布の数）をＮ_ＧＭＭ、ｉ番目のガウス分布の重みをｗ_{ｉ，ＧＭＭ}、平均ベクトルをμ_{ｉ，ＧＭＭ}（∈Ｒ^Ｍ）、分散行列をΣ_{ｉ，ＧＭＭ}（∈Ｒ^Ｍ×Ｍ）（ｉ＝０，・・・，Ｎ_ＧＭＭ−１，∈Ｒ^Ｍ×Ｍ）と表記する。Hereinafter, the number of mixtures related to the GMM (the number of Gaussian distributions constituting the GMM) is N _GMM , the weight of the i-th Gaussian distribution is w _{i, GMM} , the mean vector is μ _{i, GMM} (∈R ^M ), and the variance matrix is Σ _{i, GMM} (εR ^{M × M} ) (i = 0,..., N _GMM− 1, εR ^{M × M} ).

ＧＭＭの学習に用いる音声データは、仮推定音声算出部３０と同様の仮推定音声の特徴量算出処理が施されていることが望ましい。 The speech data used for GMM learning is preferably subjected to a temporary estimated speech feature value calculation process similar to that of the temporary estimated speech calculation unit 30.

ＨＭＭ４０２は、時間遷移情報をもつ音声モデルである。具体的には、ＨＭＭ４０２は、複数の状態及び状態間の遷移確率を示す状態遷移確率を有する。各状態は、混合数が所定の値であるＧＭＭを有する。なお、この状態遷移確率がモデルの時間遷移情報に相当する。 The HMM 402 is a voice model having time transition information. Specifically, the HMM 402 has a state transition probability indicating a plurality of states and transition probabilities between the states. Each state has a GMM whose number of mixtures is a predetermined value. This state transition probability corresponds to the time transition information of the model.

以降、ＨＭＭ４０２に関する状態数をＳ(状態の指標をｓ＝０，・・・，Ｓ−１と記す)、各状態ｓにおけるＧＭＭに関する混合数をＮ_ｓ，HMM、ｉ番目のガウス分布の重みをｗ_{ｉ，ｓ，ＨＭＭ}、平均ベクトルをμ_{ｉ，ｓ，ＨＭＭ}（∈Ｒ^Ｍ）、分散行列をΣ_{ｉ，ｓ，ＨＭＭ}（∈Ｒ^Ｍ×Ｍ）（ｉ＝０，・・・，Ｎ_{ｓ，ＨＭＭ}−１）と表記する。また、モデルが状態ｐから状態ｑへ遷移する状態遷移確率をａ_ｑｐと表記する。Hereinafter, the number of states relating to the HMM 402 is denoted by S (state indices are denoted as s = 0,..., S-1), the number of mixtures relating to GMM in each state s is denoted by N _{s, HMM} , and the weight of the i-th Gaussian distribution. w _{i, s, HMM} , mean vector μ _{i, s, HMM} (∈R ^M ), variance matrix Σ _{i, s, HMM} (∈R ^{M × M} ) (i = 0,..., N _{s, HMM-} 1). Further, the state transition probability that the model transitions from the state p to the state q is denoted as a _qp .

音声モデル適合度算出部５０は、仮推定音声の音声モデルに対する適合度を算出する機能を有する。具体的には、音声モデル適合度算出部５０は、仮推定音声算出部３０から仮推定音声の特徴量ｓ＾（ｔ）を取得する。また、音声モデル適合度算出部５０は、記憶部４０に格納されたＧＭＭ４０１を取得する。そして、音声モデル適合度算出部５０は、取得した仮推定音声の特徴量ｓ＾（ｔ）と、ＧＭＭ４０１とを基に、音声モデル適合度を算出する。音声モデル適合度算出部５０は、算出した尤度を期待値算出部６０に供給する。
音声モデル適合度算出部５０が算出する音声モデル適合度Ｋ（ｔ）は、式１により表すことができる。

・・・・・・（式１）The speech model suitability calculation unit 50 has a function of calculating the suitability of the temporarily estimated speech with respect to the speech model. Specifically, the speech model suitability calculating unit 50 acquires the feature amount s ^ (t) of the temporary estimated speech from the temporary estimated speech calculating unit 30. In addition, the speech model suitability calculation unit 50 acquires the GMM 401 stored in the storage unit 40. Then, the speech model suitability calculation unit 50 calculates the speech model suitability based on the acquired temporary estimated speech feature s ^ (t) and the GMM 401. The speech model suitability calculation unit 50 supplies the calculated likelihood to the expected value calculation unit 60.
The speech model fitness K (t) calculated by the speech model fitness calculator 50 can be expressed by Equation 1.

・・・・・・ (Formula 1)

ここで、Ｌ（ｉ｜ｓ＾（ｔ））は、仮推定音声の特徴量ｓ＾（ｔ）が与えられた場合のガウス分布ｉに関する尤度を示し、式２により表すことができる。

・・・・・・（式２）Here, L (i | s ^ (t)) indicates the likelihood regarding the Gaussian distribution i when the feature quantity s ^ (t) of the temporary estimated speech is given, and can be expressed by Expression 2.

・・・・・・ (Formula 2)

Ｎ（ｘ；μ，Σ）は、式３により表すことができる。

・・・・・・（式３）
N (x; μ, Σ) can be expressed by Equation 3.

・・・・・・ (Formula 3)

なお、式１では、ＧＭＭ４０１に関するガウス分布の最大尤度を音声モデル適合度としているが、音声モデル適合度は、これに限定されない。音声モデル適合度は、例えば、式４に示すように、ＧＭＭ４０１に関する上位Ｎ位までのガウス分布ｉの尤度の和をとり、さらに、現在時刻ｔから時刻（ｔ−Ｔ_Ｋ）までの和であってもよい。

・・・・・・（式４）In Equation 1, the maximum likelihood of the Gaussian distribution related to the GMM 401 is used as the speech model fitness, but the speech model fitness is not limited to this. For example, as shown in Equation 4, the speech model fitness is a sum of the likelihoods of the Gaussian distribution i up to the top N with respect to the GMM 401, and is a sum from the current time t to the time (t−T _K ). There may be.

・・・・・・ (Formula 4)

ここで、Ｕ_Ｎは、ガウス分布ｉの尤度Ｌ（ｉ｜ｓ＾（ｔ））でｉに関して順位付けしたときに上位N位までに相当するガウス分布の番号の集合である。Here, U _N is a set of Gaussian distribution numbers corresponding to the top N when ranking with respect to i by the likelihood L (i | s ^ (t)) of the Gaussian distribution i.

また、式１では、音声モデル適合度をＧＭＭ４０１に基づく尤度から算出したが、ＨＭＭ４０２に基づく尤度から算出してもよい。 Further, in Formula 1, the speech model fitness is calculated from the likelihood based on the GMM 401, but may be calculated from the likelihood based on the HMM 402.

期待値算出部６０は、仮推定音声の特徴量と、音声モデル適合度と、記憶部４０に格納されたＧＭＭ４０１及びＨＭＭ４０２とを用いて、音声スペクトルの期待値を算出する機能を有する。具体的には、期待値算出部６０は、仮推定音声算出部３０から、仮推定音声の特徴量ｓ＾（ｔ）を取得する。また、期待値算出部６０は、音声モデル適合度算出部５０から音声モデル適合度Ｋ（ｔ）を取得する。更に、期待値算出部６０は、記憶部４０からＧＭＭ４０１及びＨＭＭ４０２を取得する。 The expected value calculation unit 60 has a function of calculating the expected value of the speech spectrum using the feature amount of the temporarily estimated speech, the speech model fitness, and the GMM 401 and HMM 402 stored in the storage unit 40. Specifically, the expected value calculation unit 60 acquires the feature amount s ^ (t) of the temporary estimated sound from the temporary estimated sound calculation unit 30. Further, the expected value calculation unit 60 acquires the speech model fitness K (t) from the speech model fitness calculation unit 50. Further, the expected value calculation unit 60 acquires the GMM 401 and the HMM 402 from the storage unit 40.

そして、期待値算出部６０は、取得した仮推定音声の特徴量と、取得した音声モデル適合度と、取得したＧＭＭ４０１及びＨＭＭ４０２とから、音声スペクトルの期待値Ｓ＾_Ｅ（ｔ，ｋ）(ｋ＝０，・・・，Ｋ−１)を算出する。音声スペクトルの期待値Ｓ＾_Ｅ（ｔ，ｋ）は式５により算出される。

・・・・・・（式５）Then, the expected value calculation unit 60 calculates the expected value S _{E E} (t, k) (k) of the speech spectrum from the feature amount of the acquired temporary estimated speech, the acquired speech model fitness, and the acquired GMM 401 and HMM 402. = 0, ..., K-1). The expected value S ^ _E (t, k) of the speech spectrum is calculated by Equation 5.

・・・・・・ (Formula 5)

ここで、式５は、音声モデル適合度Ｋ（ｔ）と閾値ｔｈ１とを比較した結果に従い、音声スペクトル期待値Ｓ＾_Ｅ（ｔ，ｋ）を、ＧＭＭ４０１を用いて算出するか、ＨＭＭ４０２を用いて算出するかを決定することを示す。式５において、音声スペクトル期待値Ｓ＾_GMM（ｔ，ｋ）は、後述のＧＭＭ４０１を用いて算出する音声スペクトル期待値である。音声スペクトル期待値Ｓ＾_HMM（ｔ，ｋ）は、後述のＨＭＭ４０２を用いて算出する音声スペクトル期待値である。閾値ｔｈ１には、実験的に算出した値を用いることができる。Here, Expression 5 calculates the speech spectrum expected value S ^ _E (t, k) using the GMM 401 or uses the HMM 402 according to the result of comparing the speech model fitness K (t) and the threshold th1. Indicates that it is to be calculated. In Equation 5, an expected speech spectrum value S ^ _GMM (t, k) is an expected speech spectrum value calculated using _GMM 401 described later. The expected speech spectrum value S ^ _HMM (t, k) is an expected speech spectrum value calculated using the _HMM 402 described later. An experimentally calculated value can be used for the threshold th1.

式５では、フレームｔ毎にモデルの選択を行っているが、音声区間毎にモデルの選択を行ってもよい。 In Equation 5, a model is selected for each frame t, but a model may be selected for each voice section.

次に、式５における、ＧＭＭ４０１を用いて算出する音声スペクトル期待値Ｓ＾_ＧＭＭ（ｔ，ｋ）と、ＨＭＭ４０２を用いて算出する音声スペクトル期待値Ｓ＾_ＨＭＭ（ｔ，ｋ）とについて説明する。
ＧＭＭ４０１を用いて算出する音声スペクトル期待値Ｓ＾_ＧＭＭ（ｔ，ｋ）は、式６により算出される。

・・・・・・（式６）Next, the expected speech spectrum value ＾ _GMM (t, k) calculated using _GMM 401 and the expected speech spectrum value ＾ _HMM (t, k) calculated using _HMM 402 in Equation 5 will be described.
The expected speech spectrum value S ^ _GMM (t, k) calculated using the _GMM 401 is calculated by Equation 6.

・・・・・・ (Formula 6)

ここで、Ｓ_{μ，ｉ，ＧＭＭ}（ｋ）（ｋ＝０，・・・，Ｋ−１）は、ＧＭＭに関するガウス分布ｉの平均ベクトルμ_{ｉ，ＧＭＭ}をスペクトルに変換した値である。Here, S _{μ, i, GMM} (k) (k = 0,..., K−1) is a value obtained by converting an average vector μ _{i, GMM} of a Gaussian distribution i related to _GMM into a spectrum.

次に、ＨＭＭ４０２を用いて算出する音声スペクトル期待値Ｓ＾_HMM（ｔ，ｋ）は、式７により算出される。特許文献２において用いられている算出方法をもとに、音声スペクトル期待値Ｓ＾_HMM（ｔ，ｋ）の算出方法を説明する。

・・・・・・（式７）Next, the expected speech spectrum value S ^ _HMM (t, k) calculated using the _HMM 402 is calculated by Equation 7. Based on the calculation method used in Patent Document 2, the calculation method of the expected speech spectrum value S ^ _HMM (t, k) will be described.

・・・・・・ (Formula 7)

ここで、Ｓは状態数を示し、Ｓ_{μ，ｉ，ｓ，ＨＭＭ}（ｋ）（ｋ＝０，・・・，Ｋ−１）は、ＨＭＭの状態ｓにおけるガウス分布ｉの平均ベクトルμ_{ｉ，ｓ，ＨＭＭ}をスペクトルに変換した値である。また、重み係数α（ｓ，ｔ）は、時間遷移情報である状態遷移確率ａ_ｑｐを用いて、式８及び９により、再帰的に算出される。尚、式９の右辺第２項は、自然対数である。

・・・・・・(式８)

・・・・・・（式９）Here, S indicates the number of states, and S _{μ, i, s, HMM} (k) (k = 0,..., K−1) is an average vector μ _i, Gaussian distribution i in the state s of the HMM _{. This} is a value obtained by converting _{s and HMM} into a spectrum. Further, the weighting coefficient α (s, t) is recursively calculated by Expressions 8 and 9 using the state transition probability a _qp that is time transition information. Note that the second term on the right side of Equation 9 is a natural logarithm.

(Equation 8)

・・・・・・ (Formula 9)

期待値算出部６０は、式５に示すとおり、音声スペクトルの期待値を算出する際に用いる音声モデルとして、時間遷移情報を有するＧＭＭ及び時間遷移情報を有さないＨＭＭのいずれかを選択する。期待値算出部６０は、この選択を、モデル適合度Ｋ（ｔ）、すなわち仮推定音声の音声モデルに対する適合度に基づいて行う。期待値算出部６０は、音声モデルに対する適合度が閾値以上である場合は、時間遷移情報を有するモデルを用いて、音声スペクトルの期待値を算出する。期待値算出部６０は、音声モデルに対する適合度が閾値未満である場合は、時間遷移情報をもたないモデルを用いて音声スペクトルの期待値を算出する。期待値算出部６０は、算出した音声スペクトルの期待値をフィルタ算出部７０に供給する。 As shown in Equation 5, the expected value calculation unit 60 selects either a GMM having time transition information or an HMM having no time transition information as a speech model used when calculating an expected value of the speech spectrum. The expected value calculation unit 60 performs this selection based on the model suitability K (t), that is, the suitability of the temporary estimated speech with respect to the speech model. The expected value calculation unit 60 calculates the expected value of the speech spectrum using a model having time transition information when the degree of fitness for the speech model is equal to or greater than a threshold value. The expected value calculation unit 60 calculates the expected value of the speech spectrum using a model that does not have time transition information when the degree of fitness for the speech model is less than the threshold. The expected value calculation unit 60 supplies the calculated expected value of the speech spectrum to the filter calculation unit 70.

フィルタ算出部７０は、推定雑音のスペクトルと音声スペクトルの期待値とを基に、雑音を抑圧する抑圧フィルタを算出する機能を有する。具体的には、フィルタ算出部７０は、雑音推定部２０から推定雑音のスペクトルＮ＾（ｔ，ｋ）を取得する。さらに、フィルタ算出部７０は、期待値算出部６０から音声スペクトルの期待値Ｓ＾_Ｅ（ｔ，ｋ）を取得する。そして、フィルタ算出部７０は、取得した推定雑音のスペクトルと音声スペクトルの期待値とを基に抑圧フィルタを算出する。ここで、抑圧フィルタをＷ（ｔ，ｋ）（ｋ＝０，・・・，Ｋ−１）と表記する。フィルタ算出部７０は、式１０を用いて、抑圧フィルタＷ（ｔ，ｋ）を算出する。

・・・・・・（式１０）The filter calculation unit 70 has a function of calculating a suppression filter that suppresses noise based on the estimated noise spectrum and the expected value of the speech spectrum. Specifically, the filter calculation unit 70 acquires the estimated noise spectrum N ^ (t, k) from the noise estimation unit 20. Further, the filter calculation unit 70 acquires the expected value S ^ _E (t, k) of the speech spectrum from the expected value calculation unit 60. The filter calculation unit 70 calculates a suppression filter based on the acquired spectrum of estimated noise and the expected value of the speech spectrum. Here, the suppression filter is expressed as W (t, k) (k = 0,..., K−1). The filter calculation unit 70 calculates the suppression filter W (t, k) using Expression 10.

・・・・・・ (Formula 10)

フィルタ算出部７０は、その後、算出した抑圧フィルタＷ（ｔ，ｋ）をフィルタリング部８０に供給する。 Thereafter, the filter calculation unit 70 supplies the calculated suppression filter W (t, k) to the filtering unit 80.

フィルタリング部８０は、入力信号に対し、抑圧フィルタを使用したフィルタリング処理を行うことにより、推定音声（出力信号）を出力する機能を有する。具体的には、フィルタリング部８０は、入力信号取得部１０から入力信号のスペクトルＸ（ｔ，ｋ）を取得する。また、フィルタリング部８０は、フィルタ算出部７０から抑圧フィルタＷ（ｔ，ｋ）を取得する。そして、フィルタリング部８０は、受信した入力信号のスペクトルと、抑圧フィルタとを用いて、音声処理装置１が出力する推定音声のスペクトルを算出する。ここで、推定音声のスペクトルを、Ｓ＾_ＯＵＴ（ｔ，ｋ）（ｋ＝０，・・・，Ｋ−１）と表記する。フィルタリング部８０は、式１１を用いて、推定音声のスペクトルを算出する。

・・・・・・（式１１）The filtering unit 80 has a function of outputting an estimated voice (output signal) by performing a filtering process using a suppression filter on the input signal. Specifically, the filtering unit 80 acquires the spectrum X (t, k) of the input signal from the input signal acquisition unit 10. Further, the filtering unit 80 acquires the suppression filter W (t, k) from the filter calculation unit 70. Then, the filtering unit 80 calculates the spectrum of the estimated speech output by the speech processing device 1 using the spectrum of the received input signal and the suppression filter. Here, the spectrum of the estimated speech is denoted as S _OUT (t, k) (k = 0,..., K−1). The filtering unit 80 calculates the spectrum of the estimated speech using Expression 11.

... (Formula 11)

フィルタリング部８０は、式１１を用いて算出した推定音声のスペクトルを出力する。なお、フィルタリング部８０は、出力対象が音声認識装置である場合は、当該推定音声のスペクトルを特徴量ベクトルに変換して、推定音声の特徴量ベクトルを出力する。また、フィルタリング部８０は、出力対象がスピーカ等の音声再生装置である場合は、当該推定音声のスペクトルを逆フーリエ変換することによって、時間領域の信号に変換して、当該デジタル信号を出力する。 The filtering unit 80 outputs the spectrum of the estimated speech calculated using Expression 11. When the output target is a speech recognition device, the filtering unit 80 converts the spectrum of the estimated speech into a feature amount vector and outputs the feature amount vector of the estimated speech. In addition, when the output target is a sound reproduction device such as a speaker, the filtering unit 80 performs inverse Fourier transform on the spectrum of the estimated sound, thereby converting the spectrum into a time domain signal and outputs the digital signal.

次に図２のフローチャートを参照して、本実施形態の音声処理装置１の動作（処理）について詳細に説明する。 Next, the operation (processing) of the sound processing apparatus 1 of the present embodiment will be described in detail with reference to the flowchart of FIG.

入力信号取得部１０は、入力信号を取得する（ステップＳ１０１）。入力信号取得部１０は、取得した入力信号を基に生成したデジタル信号を、単位時間毎のフレームに区分する。入力信号取得部１０は、切り出したフレームをスペクトルの絶対値に変換し、入力信号スペクトルとして出力する（ステップＳ１０２）。雑音推定部２０は、入力信号取得部１０から出力された入力信号スペクトルを基に雑音成分のスペクトルを推定し、推定した推定雑音スペクトルを出力する（ステップＳ１０３）。 The input signal acquisition unit 10 acquires an input signal (step S101). The input signal acquisition unit 10 divides a digital signal generated based on the acquired input signal into frames for each unit time. The input signal acquisition unit 10 converts the cut frame into an absolute value of the spectrum and outputs it as an input signal spectrum (step S102). The noise estimation unit 20 estimates the spectrum of the noise component based on the input signal spectrum output from the input signal acquisition unit 10, and outputs the estimated estimated noise spectrum (step S103).

仮推定音声算出部３０は、入力信号取得部１０から出力された入力信号スペクトルと、雑音推定部２０から出力された推定雑音スペクトルとを用いて、仮推定音声のスペクトル及び音声特徴量を算出する。仮推定音声算出部３０は、算出した仮推定音声特徴量を出力する（ステップＳ１０４）。 The temporary estimated speech calculation unit 30 uses the input signal spectrum output from the input signal acquisition unit 10 and the estimated noise spectrum output from the noise estimation unit 20 to calculate the spectrum and speech feature amount of the temporary estimated speech. . The temporary estimated speech calculation unit 30 outputs the calculated temporary estimated speech feature amount (step S104).

音声モデル適合度算出部５０は、仮推定音声算出部３０から出力された仮推定音声の音声特徴量と、記憶部４０に格納されたＧＭＭ４０１またはＨＭＭ４０２とを用いて、音声モデル適合度を算出する。音声モデル適合度算出部５０は、算出した音声モデル適合度を期待値算出部６０に出力する（ステップＳ１０５）。 The speech model suitability calculation unit 50 calculates the speech model suitability using the speech feature amount of the temporary estimated speech output from the temporary estimated speech calculation unit 30 and the GMM 401 or HMM 402 stored in the storage unit 40. . The speech model fitness calculation unit 50 outputs the calculated speech model fitness to the expected value calculation unit 60 (step S105).

音声モデル適合度が閾値以上である場合（ステップＳ１０６でＹｅｓ）、期待値算出部６０は、仮推定音声算出部３０から出力された仮推定音声の音声特徴量と、記憶部４０に格納されたＨＭＭ４０２とを用いて、音声スペクトルの期待値を算出する。期待値算出部６０は、算出した音声スペクトルの期待値をフィルタ算出部７０に出力する（ステップＳ１０７）。 When the speech model suitability is equal to or higher than the threshold (Yes in step S106), the expected value calculation unit 60 stores the speech feature amount of the temporary estimated speech output from the temporary estimated speech calculation unit 30 and the storage unit 40. The expected value of the voice spectrum is calculated using the HMM 402. The expected value calculation unit 60 outputs the calculated expected value of the speech spectrum to the filter calculation unit 70 (step S107).

音声モデル適合度が閾値未満である場合（ステップＳ１０６でＮｏ）、期待値算出部６０は、仮推定音声算出部３０から出力された仮推定音声の音声特徴量と、記憶部４０に格納されたＧＭＭ４０１とを用いて、音声スペクトルの期待値を算出する。期待値算出部６０は、算出した音声スペクトルの期待値をフィルタ算出部７０に出力する（ステップＳ１０８）。 When the speech model fitness is less than the threshold value (No in step S106), the expected value calculation unit 60 stores the speech feature amount of the temporary estimated speech output from the temporary estimated speech calculation unit 30 and the storage unit 40. The expected value of the speech spectrum is calculated using the GMM 401. The expected value calculation unit 60 outputs the calculated expected value of the speech spectrum to the filter calculation unit 70 (step S108).

フィルタ算出部７０は、雑音推定部２０から出力された推定雑音スペクトルと、期待値算出部６０から出力された音声スペクトルの期待値とを用いて、抑圧フィルタを算出する。フィルタ算出部７０は、算出した抑圧フィルタをフィルタリング部８０に出力する（ステップＳ１０９）。フィルタリング部８０は、入力信号取得部１０から出力された入力信号スペクトルと、フィルタ算出部７０から出力された抑圧フィルタとを用いて、推定音声を算出する（ステップＳ１１０）。 The filter calculation unit 70 calculates a suppression filter using the estimated noise spectrum output from the noise estimation unit 20 and the expected value of the speech spectrum output from the expected value calculation unit 60. The filter calculation unit 70 outputs the calculated suppression filter to the filtering unit 80 (step S109). The filtering unit 80 calculates the estimated speech using the input signal spectrum output from the input signal acquisition unit 10 and the suppression filter output from the filter calculation unit 70 (step S110).

入力信号取得部１０は、入力信号が有るか否かを確認し、入力信号がある場合（ステップＳ１１１でＹｅｓ）、処理はＳ１０１に戻り、入力信号が無い場合（ステップＳ１１１でＮｏ）、全体の処理は終了する。 The input signal acquisition unit 10 checks whether or not there is an input signal. If there is an input signal (Yes in step S111), the process returns to S101, and if there is no input signal (No in step S111), the entire The process ends.

本実施形態に係る音声処理装置１は、時間遷移情報に基づいて音声モデルに含まれる分布を選択することにより得られる効果を活かし、かつ、音声特徴量に関する状態遷移が雑音により強い影響を連続的に受ける場合にも、高い雑音抑圧精度を得ることができる。その理由は、期待値算出部６０が音声スペクトルの期待値を算出する際に、音声モデル適合度が閾値以上である場合はＨＭＭ４０２を使用し、音声モデル適合度が当該閾値未満である場合はＧＭＭ４０１を使用するからである。 The speech processing apparatus 1 according to the present embodiment takes advantage of the effect obtained by selecting the distribution included in the speech model based on the time transition information, and the state transition related to the speech feature is continuously influenced by noise. Even in the case of receiving noise, high noise suppression accuracy can be obtained. The reason is that when the expected value calculation unit 60 calculates the expected value of the speech spectrum, the HMM 402 is used when the speech model suitability is greater than or equal to the threshold, and the GMM 401 when the speech model fit is less than the threshold. It is because it uses.

本実施形態に係る音声処理装置１は、音声信号に含まれる雑音成分が占める割合が高くなるにつれて、当該音声信号に関するモデル適合度が低下する性質を利用する。すなわち、音声処理装置１は、音声スペクトルの期待値を算出する際に、モデル適合度が高い場合はＨＭＭ４０２を使用することにより、ＨＭＭが有する時間遷移情報に基づいて音声モデルに含まれる分布を選択することにより得られる効果を活かすことができる。そして、音声処理装置１は、モデル適合度が低い場合はＨＭＭ４０２を回避してＧＭＭ４０１を使用することにより、音声特徴量に関する状態遷移が雑音により強い影響を連続的に受ける場合にも、高い雑音抑圧精度を得ることができる。 The speech processing apparatus 1 according to the present embodiment uses the property that the degree of model suitability for the speech signal decreases as the proportion of the noise component included in the speech signal increases. That is, when calculating the expected value of the speech spectrum, the speech processing apparatus 1 selects a distribution included in the speech model based on the time transition information of the HMM by using the HMM 402 when the model suitability is high. The effect obtained by doing can be utilized. Then, the speech processing apparatus 1 uses the GMM 401 while avoiding the HMM 402 when the model conformity is low, so that even when the state transition related to the speech feature is strongly influenced by noise, high noise suppression is achieved. Accuracy can be obtained.

なお、本実施形態は、雑音抑圧の方式として、例えば、式５に示すような、モデルベースウィーナーフィルタリング法（特許文献1）に類するものを採用しているが、別の方式を用いてもよい。本実施形態が処理する雑音抑制は、モデルベースの雑音抑圧であればよく、例えば、非特許文献１記載のベクトルテイラー級数展開の形で行ってもよい。 In addition, although this embodiment employs a method similar to the model-based Wiener filtering method (Patent Document 1) as shown in Equation 5, for example, as a noise suppression method, another method may be used. . The noise suppression processed by the present embodiment may be model-based noise suppression, and may be performed, for example, in the form of a vector Taylor series expansion described in Non-Patent Document 1.

＜第２の実施形態＞
図３は第２の実施形態の音声処理装置２の構成を概念的に示すブロック図である。なお、説明の便宜上、前述した第１の実施形態について説明した図１に含まれる構成要素と同じ機能を有する構成要素については、第１の実施形態と同じ符号を付し、その説明を省略する。<Second Embodiment>
FIG. 3 is a block diagram conceptually showing the configuration of the speech processing apparatus 2 of the second embodiment. For convenience of explanation, components having the same functions as those of the components included in FIG. 1 described for the first embodiment are denoted by the same reference numerals as those of the first embodiment, and description thereof is omitted. .

本実施形態に係る音声処理装置２は、入力信号における雑音の埋もれ具合を示す数値として、音声雑音比「ＳＮＲ（ＳｉｇｎａｌＮｏｉｓｅＲａｔｉｏ）」を用いる。そして、音声処理装置２は、雑音抑圧処理を行う際に、ＳＮＲが低い場合は、時間遷移情報を有さない音声モデル（モデルＡ）を用い、ＳＮＲが高い場合は、時間遷移情報を有する音声モデル（モデルＢ）を用いる。
図３に示すとおり、音声処理装置２の構成は、音声処理装置１における音声モデル適合度算出部５０がＳＮＲ算出部９０に置き換わったことを除いては、音声処理装置１と同様である。The speech processing apparatus 2 according to the present embodiment uses a speech noise ratio “SNR (Signal Noise Ratio)” as a numerical value indicating the degree of noise burying in the input signal. When the noise processing is performed, the speech processing apparatus 2 uses a speech model (model A) that does not have time transition information when the SNR is low, and speech that has time transition information when the SNR is high. A model (Model B) is used.
As shown in FIG. 3, the configuration of the speech processing device 2 is the same as that of the speech processing device 1 except that the speech model suitability calculation unit 50 in the speech processing device 1 is replaced with an SNR calculation unit 90.

なお、本実施形態に係る雑音推定部２０は、第１の実施形態のときとは異なり、推定した雑音成分のスペクトルを、仮推定音声算出部３０およびフィルタ算出部７０だけではなく、ＳＮＲ算出部９０にも供給する。また、本実施形態に係る仮推定音声算出部３０は、算出した仮推定音声のスペクトルをＳＮＲ算出部９０に供給する。そして、本実施形態に係る期待値算出部６０は、ＳＮＲ算出部９０が算出したＳＮＲを取得する。 Note that, unlike the first embodiment, the noise estimation unit 20 according to the present embodiment uses not only the temporary estimated speech calculation unit 30 and the filter calculation unit 70 but also the SNR calculation unit as the estimated noise component spectrum. 90 is also supplied. Further, the temporary estimated speech calculation unit 30 according to the present embodiment supplies the calculated spectrum of the temporary estimated speech to the SNR calculation unit 90. Then, the expected value calculation unit 60 according to the present embodiment acquires the SNR calculated by the SNR calculation unit 90.

ＳＮＲ算出部９０は、仮推定音声スペクトルと推定雑音スペクトルとを基にＳＮＲを算出する。具体的には、ＳＮＲ算出部９０は、仮推定音声算出部３０から仮推定音声のスペクトルＳ＾（ｔ，ｋ）を取得する。また、ＳＮＲ算出部９０は、雑音推定部２０から推定雑音のスペクトルＮ＾（ｔ，ｋ）を取得する。そして、ＳＮＲ算出部９０は、算出したＳＮＲを期待値算出部６０に供給する。 The SNR calculator 90 calculates the SNR based on the temporary estimated speech spectrum and the estimated noise spectrum. Specifically, the SNR calculator 90 acquires the spectrum S ^ (t, k) of the temporary estimated speech from the temporary estimated speech calculator 30. In addition, the SNR calculation unit 90 acquires the estimated noise spectrum N ^ (t, k) from the noise estimation unit 20. Then, the SNR calculation unit 90 supplies the calculated SNR to the expected value calculation unit 60.

以下、ＳＮＲ算出部９０が算出したＳＮＲをＳＮＲ（ｔ)とすると、ＳＮＲ（ｔ）は、式１２により表すことができる。

・・・・・・（式１２）Hereinafter, when the SNR calculated by the SNR calculation unit 90 is SNR (t), the SNR (t) can be expressed by Expression 12.

・・・・・・ (Formula 12)

ここで、Ｔ_aｖｇは、平均値計算に用いるフレーム数である。なお、平均値計算に用いるフレームは、音声検出で得られた音声区間に属するものを用いてもよい。また、ＳＮＲの算出法は式１２に限らない。Here, T _avg is the number of frames used for the average value calculation. Note that the frames used for the average value calculation may be those belonging to the speech section obtained by speech detection. The SNR calculation method is not limited to Equation 12.

期待値算出部６０は、仮推定音声の音声特徴量と、ＳＮＲと、記憶部４０に格納されたＧＭＭ４０１及びＨＭＭ４０２とを用いて、音声スペクトルの期待値を算出する。具体的には、期待値算出部６０は、仮推定音声算出部３０から仮推定音声の特徴量ｓ＾（ｔ）を取得する。また、期待値算出部６０は、ＳＮＲ算出部９０からＳＮＲ（ｔ）を取得する。更に、期待値算出部６０は、記憶部４０からＧＭＭ４０１及びＨＭＭ４０２を取得する。 The expected value calculation unit 60 calculates the expected value of the speech spectrum using the speech feature amount of the temporary estimated speech, the SNR, and the GMM 401 and HMM 402 stored in the storage unit 40. Specifically, the expected value calculation unit 60 acquires the feature amount s ^ (t) of the temporary estimated speech from the temporary estimated speech calculation unit 30. In addition, the expected value calculation unit 60 acquires SNR (t) from the SNR calculation unit 90. Further, the expected value calculation unit 60 acquires the GMM 401 and the HMM 402 from the storage unit 40.

そして、期待値算出部６０は、取得した仮推定音声の音声特徴量と、取得したＳＮＲと、取得したＧＭＭ４０１及びＨＭＭ４０２とから、音声スペクトルの期待値Ｓ＾_Ｅ（ｔ，ｋ）(ｋ＝０，・・・，Ｋ−１)を算出する。音声スペクトルの期待値Ｓ＾_Ｅ（ｔ，ｋ）は式１３により算出される。

・・・・・・（式１３）Then, the expected value calculation unit 60 calculates the expected value S ^ _E (t, k) (k = 0) of the speech spectrum from the speech feature amount of the acquired temporary estimated speech, the acquired SNR, and the acquired GMM 401 and HMM 402. , ..., K-1). The expected value S ^ _E (t, k) of the speech spectrum is calculated by Equation 13.

... (Formula 13)

ここで、式１３は、ＳＮＲ（ｔ）と閾値ｔｈ２を比較することにより、音声スペクトル期待値Ｓ＾_Ｅ（ｔ，ｋ）を、音声スペクトル期待値Ｓ_GMM＾（ｔ，ｋ）とするか、音声スペクトル期待値Ｓ_ＨＭＭ＾（ｔ，ｋ）とするかについて決定することを示す。ここで、音声スペクトル期待値Ｓ_GMM＾（ｔ，ｋ）は、ＧＭＭ４０１を用いて算出する音声スペクトルに関する期待値である。音声スペクトル期待値Ｓ_ＨＭＭ＾（ｔ，ｋ）は、ＨＭＭ４０２を用いて算出する音声スペクトルに関する期待値である。閾値ｔｈ２には、実験的に算出した値を用いることができる。Here, the expression 13 compares the SNR (t) and the threshold th2 to set the expected speech spectrum value S ^ _E (t, k) as the expected speech spectrum value S _GMM ^ (t, k). It indicates that it is determined whether or not the speech spectrum expected value S _HMM ^ (t, k) is to be used. Here, the expected speech spectrum value S _GMM ^ (t, k) is an expected value related to the speech spectrum calculated using the _GMM 401. The expected speech spectrum value S _HMM ^ (t, k) is an expected value related to the speech spectrum calculated using the _HMM 402. An experimentally calculated value can be used as the threshold th2.

本実施形態に係る音声処理装置２は、時間遷移情報に基づいて音声モデルに含まれる分布を選択することにより得られる効果を活かし、かつ、音声特徴量に関する状態遷移が雑音により強い影響を連続的に受ける場合にも、高い雑音抑圧精度を得ることができる。その理由は、期待値算出部６０が音声スペクトルの期待値を算出する際に、ＳＮＲが閾値以上である場合はＨＭＭ４０２を使用し、音声モデル適合度が閾値未満である場合はＧＭＭ４０１を使用するからである。 The speech processing apparatus 2 according to the present embodiment utilizes the effect obtained by selecting the distribution included in the speech model based on the time transition information, and the state transition related to the speech feature is continuously influenced by noise. Even in the case of receiving noise, high noise suppression accuracy can be obtained. The reason is that when the expected value calculation unit 60 calculates the expected value of the speech spectrum, the HMM 402 is used when the SNR is greater than or equal to the threshold value, and the GMM 401 is used when the speech model suitability is less than the threshold value. It is.

本実施形態に係る音声処理装置２は、音声信号に含まれる雑音成分が占める割合が高くなるにつれて、当該音声信号に関するＳＮＲが低下する性質を利用する。すなわち、音声処理装置２は、音声スペクトルの期待値を算出する際に、ＳＮＲが高い場合はＨＭＭ４０２を使用することにより、ＨＭＭが有する時間遷移情報に基づいて音声モデルに含まれる分布を選択することにより得られる効果を活かすことができる。そして、音声処理装置２は、ＳＮＲが低い場合はＨＭＭ４０２を回避してＧＭＭ４０１を使用することにより、音声特徴量に関する状態遷移が雑音により強い影響を連続的に受ける場合にも、高い雑音抑圧精度を得ることができる。 The audio processing device 2 according to the present embodiment uses the property that the SNR relating to the audio signal decreases as the ratio of the noise component included in the audio signal increases. That is, when calculating the expected value of the speech spectrum, the speech processing device 2 uses the HMM 402 when the SNR is high, thereby selecting a distribution included in the speech model based on the time transition information of the HMM. The effect obtained by can be utilized. Then, when the SNR is low, the speech processing apparatus 2 avoids the HMM 402 and uses the GMM 401, so that even when the state transition related to the speech feature amount is continuously strongly influenced by the noise, high noise suppression accuracy can be obtained. Can be obtained.

＜第３の実施形態＞
図４は第３の実施形態の音声処理装置３の構成を概念的に示すブロック図である。なお、説明の便宜上、前述した第１の実施形態について説明した図１に含まれる構成要素と同じ機能を有する構成要素については、第１の実施形態と同じ符号を付し、その説明を省略する。<Third Embodiment>
FIG. 4 is a block diagram conceptually showing the configuration of the speech processing apparatus 3 of the third embodiment. For convenience of explanation, components having the same functions as those of the components included in FIG. 1 described for the first embodiment are denoted by the same reference numerals as those of the first embodiment, and description thereof is omitted. .

本実施形態に係る音声処理装置３は、入力信号が雑音に埋もれると音声モデルが含む特定の分布に適合しやすくなる性質を利用し、雑音の埋もれ具合を、入力信号がその特定の分布に適合するか否かに基づいて判定する。音声処理装置３は、雑音抑圧処理を行う際に、入力信号が特定の分布に適合する場合に、時間遷移情報を有さない音声モデル（モデルＡ）を用い、入力信号が特定の分布に適合しない場合は、時間遷移情報を有する音声モデル（モデルＢ）を用いる。 The speech processing apparatus 3 according to the present embodiment uses the property that when the input signal is buried in noise, it easily adapts to a specific distribution included in the speech model, and the input signal is adapted to the specific distribution. Judgment based on whether or not to do. The voice processing device 3 uses a voice model (model A) that does not have time transition information when the input signal matches a specific distribution when performing noise suppression processing, and the input signal matches the specific distribution. If not, a speech model (model B) having time transition information is used.

図４に示すとおり、音声処理装置３の構成は、音声処理装置１における音声モデル適合度算出部５０が指定分布マッチング部９１に置き換わり、指定分布記憶部４１が追加されたことを除いては、音声処理装置１と同様である。 As shown in FIG. 4, the configuration of the speech processing device 3 is the same as that of the speech processing device 1 except that the speech model suitability calculation unit 50 is replaced with a designated distribution matching unit 91 and a designated distribution storage unit 41 is added. This is the same as the sound processing device 1.

指定分布記憶部４１は指定分布を記憶する。具体的には、指定分布記憶部４１は、記憶部４０に格納されたＧＭＭ４０１に含まれるガウス分布のうち、あらかじめ指定した分布集合Ｐを記憶する。そして、指定分布記憶部４１は、この分布集合Ｐを指定分布マッチング部９１に供給する。ここで、指定分布とは、雑音に埋もれた入力信号に適合しやすい分布であり、例えば、あらかじめ実験的に求めておくことができる。 The designated distribution storage unit 41 stores a designated distribution. Specifically, the designated distribution storage unit 41 stores a distribution set P designated in advance among the Gaussian distributions included in the GMM 401 stored in the storage unit 40. Then, the designated distribution storage unit 41 supplies this distribution set P to the designated distribution matching unit 91. Here, the designated distribution is a distribution that easily matches an input signal buried in noise, and can be experimentally obtained in advance, for example.

指定分布マッチング部９１は、ＧＭＭ４０１が含むガウス分布のうち、仮推定音声に関する音声特徴量に対して最大の尤度を与える分布が、指定された分布集合Ｐに含まれるか否かを判定する。具体的には、指定分布マッチング部９１は、仮推定音声算出部３０から仮推定音声の特徴量ｓ＾（ｔ）を取得する。また、指定分布マッチング部９１は、指定分布記憶部４１から指定された分布集合Ｐを取得する。そして、指定分布マッチング部９１は、ＧＭＭ４０１が含むガウス分布のうち、前記仮推定音声の特徴量ｓ＾（ｔ）に対して最大の尤度を与える分布を求め、分布集合Ｐとのマッチングを行う。指定分布マッチング部９１は、マッチングした結果を、期待値算出部６０に供給する。 The designated distribution matching unit 91 determines whether or not a distribution that gives the maximum likelihood to the speech feature amount related to the temporarily estimated speech among the Gaussian distributions included in the GMM 401 is included in the designated distribution set P. Specifically, the designated distribution matching unit 91 acquires the feature amount s ^ (t) of the temporary estimated speech from the temporary estimated speech calculation unit 30. Also, the designated distribution matching unit 91 acquires the designated distribution set P from the designated distribution storage unit 41. Then, the designated distribution matching unit 91 obtains a distribution that gives the maximum likelihood to the feature quantity s ^ (t) of the temporary estimated speech from among the Gaussian distributions included in the GMM 401 and performs matching with the distribution set P. . The designated distribution matching unit 91 supplies the matching result to the expected value calculation unit 60.

指定分布マッチング部９１のマッチング結果をＱ（ｔ）とすると、マッチング結果Ｑ（ｔ）は、式１４により表すことができる。

・・・・・・（式１４）If the matching result of the designated distribution matching unit 91 is Q (t), the matching result Q (t) can be expressed by Expression 14.

・・・・・・ (Formula 14)

ここで、ｐ’は、ＧＭＭ４０１が含むガウス分布のうち、仮推定音声の特徴量ｓ＾（ｔ）に対して最大の尤度を与える分布を示す。 Here, p ′ indicates a distribution that gives the maximum likelihood to the feature quantity s ^ (t) of the temporarily estimated speech among the Gaussian distributions included in the GMM 401.

期待値算出部６０は、仮推定音声の特徴量と、マッチング結果と、記憶部４０に格納されたＧＭＭ４０１及びＨＭＭ４０２とを用いて、音声スペクトルの期待値を算出する。具体的には、期待値算出部６０は、仮推定音声算出部３０から、仮推定音声の特徴量ｓ＾（ｔ）を取得する。また、期待値算出部６０は、指定分布マッチング部９１からＱ（ｔ）を取得する。更に、期待値算出部６０は、記憶部４０からＧＭＭ４０１及びＨＭＭ４０２を取得する。 The expected value calculation unit 60 calculates the expected value of the speech spectrum using the feature amount of the temporarily estimated speech, the matching result, and the GMM 401 and HMM 402 stored in the storage unit 40. Specifically, the expected value calculation unit 60 acquires the feature amount s ^ (t) of the temporary estimated sound from the temporary estimated sound calculation unit 30. In addition, the expected value calculation unit 60 acquires Q (t) from the designated distribution matching unit 91. Further, the expected value calculation unit 60 acquires the GMM 401 and the HMM 402 from the storage unit 40.

そして、期待値算出部６０は、取得した仮推定音声の特徴量と、取得したマッチング結果と、取得したＧＭＭ４０１及びＨＭＭ４０２とから、音声スペクトルの期待値Ｓ＾_Ｅ（ｔ，ｋ）(ｋ＝０，・・・，Ｋ−１)を算出する。音声スペクトルの期待値Ｓ＾_Ｅ（ｔ，ｋ）は以下の式１５により算出される。

・・・・・・（式１５）Then, the expected value calculation unit 60 calculates the expected value S ^ _E (t, k) (k = 0) of the speech spectrum from the feature amount of the acquired temporary estimated speech, the acquired matching result, and the acquired GMM 401 and HMM 402. , ..., K-1). The expected value S ^ _E (t, k) of the speech spectrum is calculated by the following equation 15.

・・・・・・ (Formula 15)

ここで、式１５は、Ｑ（ｔ）が示す値が１であるか否かに応じて、音声スペクトル期待値Ｓ＾_Ｅ（ｔ，ｋ）を、音声スペクトル期待値Ｓ_GMM＾（ｔ，ｋ）とするか、音声スペクトル期待値Ｓ_ＨＭＭ＾（ｔ，ｋ）とするかを決定することを示す。ここで、音声スペクトル期待値Ｓ_GMM＾（ｔ，ｋ）は、ＧＭＭ４０１を用いて算出する音声スペクトルに関する期待値である。音声スペクトル期待値Ｓ_ＨＭＭ＾（ｔ，ｋ）は、ＨＭＭ４０２を用いて算出する音声スペクトルに関する期待値である。Here, in Expression 15, depending on whether or not the value indicated by Q (t) is 1, the expected speech spectrum value S ^ _E (t, k) is changed to the expected speech spectrum value S _GMM ^ (t, k). ) Or the expected speech spectrum value S _HMM ^ (t, k). Here, the expected speech spectrum value S _GMM ^ (t, k) is an expected value related to the speech spectrum calculated using the _GMM 401. The expected speech spectrum value S _HMM ^ (t, k) is an expected value related to the speech spectrum calculated using the _HMM 402.

なお、本実施形態では、音声処理装置３が仮推定音声から算出されるＧＭＭ４０１が包含する尤度が最大である分布と分布集合Ｐとのマッチングをとる例を示した。音声処理装置３は、仮推定音声から算出されるＨＭＭ４０２に関する状態と指定された状態集合とのマッチングをとってもよい。あるいは、音声処理装置３は、仮推定音声から算出されるＨＭＭ４０２に関する状態系列と指定された状態系列集合とのマッチングをとってもよい。 In the present embodiment, the example in which the speech processing device 3 matches the distribution set P with the maximum likelihood included in the GMM 401 calculated from the temporary estimated speech is shown. The speech processing device 3 may take a match between the state relating to the HMM 402 calculated from the temporary estimated speech and the specified state set. Alternatively, the voice processing device 3 may take a matching between the state series related to the HMM 402 calculated from the temporary estimated voice and the designated state series set.

本実施形態に係る音声処理装置３は、時間遷移情報に基づいて音声モデルに含まれる分布を選択することにより得られる効果を活かし、かつ、音声特徴量に関する状態遷移が雑音により強い影響を連続的に受ける場合にも、高い雑音抑圧精度を得ることができる。その理由は、期待値算出部６０が音声スペクトルの期待値を算出する際に、入力信号が特定の分布に適合しない場合はＨＭＭ４０２を使用し、入力信号がその特定の分布に適合する場合はＧＭＭ４０１を使用するからである。 The speech processing apparatus 3 according to the present embodiment utilizes the effect obtained by selecting the distribution included in the speech model based on the time transition information, and the state transition related to the speech feature is continuously influenced by noise. Even in the case of receiving noise, high noise suppression accuracy can be obtained. The reason is that when the expected value calculation unit 60 calculates the expected value of the speech spectrum, the HMM 402 is used when the input signal does not match the specific distribution, and the GMM 401 when the input signal matches the specific distribution. It is because it uses.

本実施形態に係る音声処理装置３は、音声信号に含まれる雑音成分が占める割合が高くなるにつれて、当該音声信号が、音声モデルが含む特定の分布に適合しやすくなる性質を利用する。すなわち、音声処理装置３は、音声スペクトルの期待値を算出する際に、入力信号が特定の分布に適合しない場合は、ＨＭＭ４０２を使用することにより、ＨＭＭが有する時間遷移情報に基づいて音声モデルに含まれる分布を選択することにより得られる効果を活かすことができる。そして、音声処理装置３は、入力信号が特定の分布に適合する場合は、ＨＭＭ４０２を回避してＧＭＭ４０１を使用することにより、音声特徴量に関する状態遷移が雑音により強い影響を連続的に受ける場合にも、高い雑音抑圧精度を得ることができる。 The sound processing device 3 according to the present embodiment uses the property that the sound signal is likely to be adapted to a specific distribution included in the sound model as the ratio of the noise component included in the sound signal increases. That is, when calculating the expected value of the speech spectrum, the speech processing device 3 uses the HMM 402 when the input signal does not match a specific distribution, thereby converting the speech model into a speech model based on the time transition information of the HMM. The effect obtained by selecting the included distribution can be utilized. When the input signal matches a specific distribution, the speech processing apparatus 3 avoids the HMM 402 and uses the GMM 401 so that the state transition related to the speech feature amount is strongly influenced by noise. However, high noise suppression accuracy can be obtained.

＜第４の実施形態＞
図５は第４の実施形態の音声処理装置４の構成を概念的に示すブロック図である。<Fourth Embodiment>
FIG. 5 is a block diagram conceptually showing the structure of the speech processing apparatus 4 of the fourth embodiment.

本実施形態の音声処理装置４は、記憶部４０と、期待値算出部６０と、雑音抑圧部８１とを備えている。 The speech processing apparatus 4 according to the present embodiment includes a storage unit 40, an expected value calculation unit 60, and a noise suppression unit 81.

記憶部４０は、音声特徴量に関する状態遷移の実績を示す時間遷移情報を有する第１の音声モデルと、その時間遷移情報を有さない第２の音声モデルと、を記憶する。 The memory | storage part 40 memorize | stores the 1st audio | voice model which has the time transition information which shows the performance of the state transition regarding an audio | voice feature-value, and the 2nd audio | voice model which does not have the time transition information.

期待値算出部６０は、音声成分及び雑音成分が混在する信号である入力信号が示す情報から、第１及び第２の音声モデルの何れかを所定の基準に従って選択する。期待値算出部６０は、選択した音声モデル及び入力信号を用いて、音声成分に関する期待値を算出する。 The expected value calculation unit 60 selects one of the first and second speech models from information indicated by the input signal, which is a signal in which speech components and noise components are mixed, according to a predetermined criterion. The expected value calculation unit 60 calculates an expected value related to the speech component using the selected speech model and input signal.

雑音抑圧部８１は、期待値算出部６０により算出された期待値を用いて、入力信号に含まれる雑音成分を抑圧した雑音抑圧音声を生成する。 The noise suppression unit 81 uses the expected value calculated by the expected value calculation unit 60 to generate noise-suppressed speech in which the noise component included in the input signal is suppressed.

本実施形態に係る音声処理装置４は、時間遷移情報に基づいて音声モデルに含まれる分布を選択することにより得られる効果を活かし、かつ、音声特徴量に関する状態遷移が雑音により強い影響を連続的に受ける場合にも、高い雑音抑圧精度を得ることができる。その理由は、期待値算出部６０が音声スペクトルの期待値を算出する際に、入力信号が示す情報を基に、所定の基準に従って第１及び第２の音声モデルの何れかを選択して、選択した音声モデルを使用するからである。 The speech processing apparatus 4 according to the present embodiment makes use of the effect obtained by selecting the distribution included in the speech model based on the time transition information, and the state transition related to the speech feature is continuously affected by noise. Even in the case of receiving noise, high noise suppression accuracy can be obtained. The reason is that when the expected value calculation unit 60 calculates the expected value of the speech spectrum, based on the information indicated by the input signal, it selects one of the first and second speech models according to a predetermined criterion, This is because the selected speech model is used.

＜ハードウェア構成例＞
上述した実施形態において図１、及び、図３乃至５に示した各部は、専用のＨＷ（電子回路）によって実現することができる。また、当該各部（記憶部４０を除く）は、ソフトウェアプログラムの機能（処理）単位（ソフトウェアモジュール）と捉えることができる。但し、これらの図面に示した各部の区分けは、説明の便宜上の構成であり、実装に際しては、様々な構成が想定され得る。この場合のハードウェア環境の一例を、図６を参照して説明する。<Hardware configuration example>
1 and 3 to 5 in the above-described embodiment can be realized by a dedicated HW (electronic circuit). Each unit (excluding the storage unit 40) can be regarded as a function (processing) unit (software module) of the software program. However, the division of each part shown in these drawings is a configuration for convenience of explanation, and various configurations can be assumed for mounting. An example of the hardware environment in this case will be described with reference to FIG.

図６は、本発明の模範的な各実施形態に係る音声処理装置を実行可能な情報処理装置９００（コンピュータ）の構成を例示的に説明する図である。即ち、図６は、図１、及び、図３乃至５に示した音声処理装置を実現可能なコンピュータ（情報処理装置）の構成であって、上述した実施形態における各機能を実現可能なハードウェア環境を表す。 FIG. 6 is a diagram for exemplarily explaining the configuration of an information processing apparatus 900 (computer) that can execute the speech processing apparatus according to each exemplary embodiment of the present invention. That is, FIG. 6 shows a configuration of a computer (information processing apparatus) capable of realizing the sound processing apparatus shown in FIGS. 1 and 3 to 5, and hardware capable of realizing each function in the above-described embodiment. Represents the environment.

図６に示した情報処理装置９００は、構成要素として下記を備えている。
・ＣＰＵ９０１（Ｃｅｎｔｒａｌ＿Ｐｒｏｃｅｓｓｉｎｇ＿Ｕｎｉｔ）、
・ＲＯＭ９０２（Ｒｅａｄ＿Ｏｎｌｙ＿Ｍｅｍｏｒｙ）、
・ＲＡＭ９０３（Ｒａｎｄｏｍ＿Ａｃｃｅｓｓ＿Ｍｅｍｏｒｙ）、
・ハードディスク９０４（記憶装置）、
・外部装置との通信インタフェース９０５、
・ＣＤ−ＲＯＭ（Ｃｏｍｐａｃｔ＿Ｄｉｓｃ＿Ｒｅａｄ＿Ｏｎｌｙ＿Ｍｅｍｏｒｙ）等の記憶媒体９０７に格納されたデータを読み書き可能なリーダライタ９０８、
・入出力インタフェース９０９、
情報処理装置９００は、これらの構成がバス９０６（通信線）を介して接続された一般的なコンピュータである。The information processing apparatus 900 illustrated in FIG. 6 includes the following as constituent elements.
CPU 901 (Central_Processing_Unit),
ROM 902 (Read_Only_Memory),
RAM 903 (Random_Access_Memory),
-Hard disk 904 (storage device),
A communication interface 905 with an external device,
A reader / writer 908 capable of reading and writing data stored in a storage medium 907 such as a CD-ROM (Compact_Disc_Read_Only_Memory)
-I / O interface 909,
The information processing apparatus 900 is a general computer in which these configurations are connected via a bus 906 (communication line).

そして、上述した実施形態を例に説明した本発明は、図６に示した情報処理装置９００に対して、次の機能を実現可能なコンピュータプログラムを供給する。その機能とは、その実施形態の説明において参照したブロック構成図（図１、及び、図３乃至５）或いはフローチャート（図２）の機能である。本発明は、その後、そのコンピュータプログラムを、当該ハードウェアのＣＰＵ９０１に読み出して解釈し実行することによって達成される。また、当該装置内に供給されたコンピュータプログラムは、読み書き可能な揮発性の記憶メモリ（ＲＡＭ９０３）またはハードディスク９０４等の不揮発性の記憶デバイスに格納すれば良い。 The present invention described using the above-described embodiment as an example supplies a computer program capable of realizing the following functions to the information processing apparatus 900 illustrated in FIG. 6. The function is a function of a block configuration diagram (FIGS. 1 and 3 to 5) or a flowchart (FIG. 2) referred to in the description of the embodiment. The present invention is then achieved by reading the computer program into the hardware CPU 901 for interpretation and execution. The computer program supplied to the apparatus may be stored in a readable / writable volatile storage memory (RAM 903) or a nonvolatile storage device such as the hard disk 904.

また、前記の場合において、当該ハードウェア内へのコンピュータプログラムの供給方法は、現在では一般的な手順を採用することができる。その手順としては、例えば、ＣＤ−ＲＯＭ等の各種記憶媒体９０７を介して当該装置内にインストールする方法や、インターネット等の通信回線を介して外部よりダウンロードする方法等がある。そして、このような場合において、本発明は、係るコンピュータプログラムを構成するコード或いは、そのコードが格納された記憶媒体９０７によって構成されると捉えることができる。 In the above case, a general procedure can be adopted as a method for supplying the computer program into the hardware. The procedure includes, for example, a method of installing in the apparatus via various storage media 907 such as a CD-ROM, and a method of downloading from the outside via a communication line such as the Internet. In such a case, it can be understood that the present invention is configured by a code constituting the computer program or a storage medium 907 in which the code is stored.

以上、上述した実施形態を模範的な例として本発明を説明した。しかしながら、本発明は、上述した実施形態には限定されない。即ち、本発明は、本発明のスコープ内において、当業者が理解し得る様々な態様を適用することができる。 The present invention has been described above using the above-described embodiment as an exemplary example. However, the present invention is not limited to the above-described embodiment. That is, the present invention can apply various modes that can be understood by those skilled in the art within the scope of the present invention.

この出願は、２０１３年１２月１７日に出願された日本出願特願２０１３−２５９８４６を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2013-259846 for which it applied on December 17, 2013, and takes in those the indications of all here.

１音声処理装置
２音声処理装置
３音声処理装置
４音声処理装置
１０入力信号取得部
２０雑音推定部
３０仮推定音声算出部
４０記憶部
４０１ＧＭＭ
４０２ＨＭＭ
４１指定分布記憶部
５０音声モデル適合度算出部
６０期待値算出部
７０フィルタ算出部
８０フィルタリング部
８１雑音抑圧部
９０ＳＮＲ算出部
９１指定分布マッチング部
９００情報処理装置
９０１ＣＰＵ
９０２ＲＯＭ
９０３ＲＡＭ
９０４ハードディスク
９０５通信インタフェース
９０６バス
９０７記憶媒体
９０８リーダライタ
９０９入出力インタフェースDESCRIPTION OF SYMBOLS 1 Audio processing apparatus 2 Audio processing apparatus 3 Audio processing apparatus 4 Audio processing apparatus 10 Input signal acquisition part 20 Noise estimation part 30 Temporary estimation audio | voice calculation part 40 Memory | storage part 401 GMM
402 HMM
DESCRIPTION OF SYMBOLS 41 Designated distribution memory | storage part 50 Speech model adaptation calculation part 60 Expected value calculation part 70 Filter calculation part 80 Filtering part 81 Noise suppression part 90 SNR calculation part 91 Designated distribution matching part 900 Information processing apparatus 901 CPU
902 ROM
903 RAM
904 Hard disk 905 Communication interface 906 Bus 907 Storage medium 908 Reader / writer 909 Input / output interface

Claims

Storage means for storing a first speech model having time transition information indicating a result of state transition relating to a speech feature, and a second speech model not having the time transition information;
From the information indicated by the input signal, which is a signal in which a speech component and a noise component are mixed, one of the first and second speech models is selected according to a predetermined criterion, and the selected speech model and the input signal are used. , An expected value calculating means for calculating an expected value for the audio component;
Noise suppression means for generating a first noise-suppressed speech in which the noise component included in the input signal is suppressed using the expected value;
A speech processing apparatus comprising:

A temporary estimated speech calculation means for generating a signal related to a second noise-suppressed speech in which the noise component contained in the input signal is suppressed by a predetermined method;
The expected value calculation means selects one of the first and second speech models from information represented by a signal related to the second noise-suppressed speech.
The speech processing apparatus according to claim 1.

Voice model suitability calculating means for calculating a model suitability indicating the degree to which the second noise-suppressed speech is adapted to the first or second speech model;
The expected value calculation means selects the first speech model when the value indicated by the speech model suitability is greater than or equal to a threshold, and when the value indicated by the speech model suitability is less than the threshold, Select the second voice model,
The speech processing apparatus according to claim 2.

For the second noise-suppressed speech, further comprising speech noise ratio calculating means for calculating a speech noise ratio indicating a ratio of the speech component to the noise component,
The expected value calculation means selects the first speech model when the value indicated by the speech noise ratio is equal to or greater than a threshold value, and selects the second speech model when the value indicated by the speech noise ratio is less than the threshold value. Select the voice model for
The speech processing apparatus according to claim 2 or 3.

Designated distribution storage means for storing a predetermined distribution set included in the first or second speech model;
Designated distribution matching that performs pattern matching between the third speech model related to the second noise-suppressed speech and the predetermined distribution set, and determines whether or not the third speech model is included in the predetermined distribution set Means,
Further comprising
The expected value calculation means selects the second speech model when the third speech model is included in the predetermined distribution set, and the third speech model is included in the predetermined distribution set. If not, select the first speech model,
The speech processing apparatus according to claim 2.

The specified distribution matching means uses a mixed Gaussian distribution model as the third speech model, and with respect to the speech feature amount related to the second noise-suppressed speech among a plurality of Gaussian distributions included in the mixed Gaussian distribution model. Pattern matching between the Gaussian distribution giving the maximum likelihood and the predetermined distribution set,
The speech processing apparatus according to claim 5.

Depending on the information processing device,
By referring to the storage means in which the first speech model having time transition information indicating the results of state transition relating to the speech feature amount and the second speech model not having the time transition information are stored, the speech From information indicated by an input signal that is a signal in which a component and a noise component are mixed, either the first or second speech model is selected according to a predetermined criterion, and the selected speech model and the input signal are used. Calculating an expected value for the audio component;
Using the expected value, a first noise-suppressed speech in which the noise component included in the input signal is suppressed is generated.
Audio processing method.

Generating a signal relating to a second noise-suppressed speech in which the noise component contained in the input signal is suppressed by a predetermined method;
From the information represented by the signal related to the second noise-suppressed speech, select one of the first and second speech models.
The voice processing method according to claim 7.

By referring to the storage means in which the first speech model having time transition information indicating the results of state transition relating to the speech feature amount and the second speech model not having the time transition information are stored, the speech From information indicated by an input signal that is a signal in which a component and a noise component are mixed, either the first or second speech model is selected according to a predetermined criterion, and the selected speech model and the input signal are used. An expected value calculation process for calculating an expected value related to the audio component;
Using the expected value, a noise suppression process for generating a first noise-suppressed speech in which the noise component included in the input signal is suppressed;
A computer-readable recording medium in which a sound processing program for causing a computer to execute is stored.

A temporary estimated speech calculation process for generating a signal relating to a second noise-suppressed speech in which the noise component contained in the input signal is suppressed by a predetermined method;
From the information represented by the signal relating to the second noise-suppressed speech, the expected value calculation process for selecting one of the first and second speech models;
A computer-readable recording medium in which the sound processing program according to claim 9 is stored.