JP6734233B2

JP6734233B2 - Signal processing device, case model generation device, collation device, signal processing method, and signal processing program

Info

Publication number: JP6734233B2
Application number: JP2017150755A
Authority: JP
Inventors: 小川　厚徳; 厚徳小川; 慶介木下; マークデルクロア; 中谷　智広; 智広中谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-08-03
Filing date: 2017-08-03
Publication date: 2020-08-05
Anticipated expiration: 2037-08-03
Also published as: JP2019028390A

Description

本発明は、信号処理装置、事例モデル生成装置、照合装置、信号処理方法及び信号処理プログラムに関する。 The present invention relates to a signal processing device, a case model generation device, a matching device, a signal processing method, and a signal processing program.

従来、音声認識システム、補聴器、ＴＶ会議システム、機械制御インターフェース、楽曲の検索及び採譜のための音楽情報処理システム等において、音響信号を収音し、目的の音声信号の成分を抽出する技術が利用されている。 Conventionally, in a voice recognition system, a hearing aid, a TV conference system, a machine control interface, a music information processing system for music search and transcription, etc., a technique for picking up an acoustic signal and extracting a component of a target audio signal is used. Has been done.

一般的に、雑音や残響のある実環境で音響信号を収音すると、収音目的の音声信号だけでなく、雑音や残響（音響歪み）が重畳された信号が観測される。しかしながら、これらの雑音や残響が信号に重畳されると、本来の音声信号の成分の抽出が困難となり、音声信号の明朗度や聞き取りやすさを大きく低下させてしまう要因となる。この結果、本来の音声信号の性質を抽出することが困難となり、例えば、音声認識システムの認識率が低下する。この認識率の低下を防ぐためには、重畳した音響歪みを取り除く工夫が必要である。 Generally, when an acoustic signal is picked up in a real environment with noise or reverberation, not only a voice signal for the purpose of picking up sound but also a signal on which noise and reverberation (acoustic distortion) are superimposed is observed. However, when these noises and reverberations are superimposed on the signal, it becomes difficult to extract the original component of the audio signal, which causes a significant reduction in the clarity and intelligibility of the audio signal. As a result, it becomes difficult to extract the nature of the original voice signal, and, for example, the recognition rate of the voice recognition system decreases. In order to prevent this reduction in recognition rate, it is necessary to devise a method for removing the superimposed acoustic distortion.

そこで、従来、ガウス混合分布モデル（ＧＭＭ：Gaussian Mixture Model）によって表現された事例モデルを用いて、入力音声を変換した特徴量との類似度を調べ、高い類似度を示した事例モデルを音声信号候補とする信号処理装置が提案されている（例えば、非特許文献１参照）。 Therefore, conventionally, by using a case model represented by a Gaussian Mixture Model (GMM), the similarity with the feature amount obtained by converting the input speech is examined, and the case model showing a high degree of similarity is used as a speech signal. A candidate signal processing device has been proposed (for example, see Non-Patent Document 1).

従来の信号処理装置は、事前に学習された混合分布モデルによって表現された事例モデルを用いて信号処理を行う。事例モデルは、例えば、各事例に対応したクリーン音声の振幅スペクトルと、フレームごとの特徴量（例えば、メル周波数ケプストラム係数）に対して最大の尤度を与えるガウス混合分布のインデックスの系列（セグメント）を含む。信号処理装置は、入力音声に最も類似するクリーン音声の振幅スペクトルを求めるために、入力音声から生成した特徴量のセグメントを用いて、予め求めた事例モデルの中から、最大の事後確率を与えるセグメントを探索する。 A conventional signal processing device performs signal processing using a case model represented by a mixture distribution model learned in advance. The case model is, for example, the amplitude spectrum of clean speech corresponding to each case, and a sequence of Gaussian mixture distribution indices (segments) that gives the maximum likelihood to the feature amount (eg, Mel frequency cepstrum coefficient) for each frame. including. The signal processing device uses the segment of the feature amount generated from the input voice to obtain the amplitude spectrum of the clean voice most similar to the input voice, and from the case models obtained in advance, the segment that gives the maximum posterior probability. To explore.

このような従来の信号処理装置によれば、それまでは困難であった、非常に時間変化の多い雑音の除去が可能となることが報告されている。非常に時間変化の多い雑音とは、背景雑音に対して、例えば目覚まし時計のアラーム音などの雑音のことである。 It has been reported that such a conventional signal processing device can remove noise, which has been difficult until then, and which has a very large time change. The noise that changes significantly over time is noise such as an alarm sound of an alarm clock with respect to background noise.

J. Ming and R. Srinivasan, and D. Crookes, “A Corpus-Based Approach to Speech Enhancement From Nonstationary Noise”, IEEE Transactions on Audio, Speech, and Language Processing, Vol.19, No.4, pp.822-836, May 2011J. Ming and R. Srinivasan, and D. Crookes, “A Corpus-Based Approach to Speech Enhancement From Nonstationary Noise”, IEEE Transactions on Audio, Speech, and Language Processing, Vol.19, No.4, pp.822- 836, May 2011 G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, and B. Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The shared views of four research groups”, IEEE Signal Processing Magazine, pp. 82-97, Nov. 2012.G. Hinton, L. Deng, D. Yu, GE Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, TN Sainath, and B. Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The shared views of four research groups”, IEEE Signal Processing Magazine, pp. 82-97, Nov. 2012.

しかしながら、セグメント探索に用いるメル周波数ケプストラム係数は、振幅スペクトルから得られる単純な特徴量である。このため、入力信号に雑音や残響が含まれる場合には、メル周波数ケプストラム係数も雑音や残響の影響を含むものとなり、従来の信号処理装置におけるセグメント探索は、必ずしも高精度ではなかった。 However, the mel frequency cepstrum coefficient used for segment search is a simple feature amount obtained from the amplitude spectrum. Therefore, when the input signal includes noise and reverberation, the mel frequency cepstrum coefficient also includes the influence of noise and reverberation, and the segment search in the conventional signal processing device is not always highly accurate.

また、事例モデルは、種々の音響歪み環境を想定して準備するものの、現実的に、全ての音響歪み環境に対応する事例モデルを準備することは困難である。このため、従来の信号処理装置では、入力信号から生成した特徴量のセグメントと高い類似度を有するセグメントを事例モデルの中から探索できない場合があった。 Further, although the case models are prepared by assuming various acoustic distortion environments, it is difficult to prepare case models corresponding to all the acoustic distortion environments in reality. For this reason, in the conventional signal processing device, there is a case where a segment having a high degree of similarity to the segment of the feature amount generated from the input signal cannot be searched from the case model.

したがって、従来の信号処理装置では、探索に用いる特徴量が雑音や残響の影響を受けるため、入力信号に類似するクリーン音声の特徴量を探索する精度にも限界があった。 Therefore, in the conventional signal processing device, since the feature amount used for the search is affected by noise and reverberation, there is a limit to the accuracy of searching the feature amount of clean speech similar to the input signal.

本発明は、上記に鑑みてなされたものであって、入力信号に類似するクリーン音声を精度よく探索する信号処理装置、事例モデル生成装置、照合装置、信号処理方法及び信号処理プログラムを提供することを目的とする。 The present invention has been made in view of the above, and provides a signal processing device, a case model generation device, a collation device, a signal processing method, and a signal processing program for accurately searching for clean speech similar to an input signal. With the goal.

上述した課題を解決し、目的を達成するために、本発明に係る信号処理装置は、雑音又は音響歪みを含む音声或いはクリーン音声を入力とし、Neural Network（ＮＮ）に基づく音響モデルを用いて出力された事例モデルを記憶する記憶部と、入力信号から特徴量を生成する特徴量生成部と、特徴量を入力とし、ＮＮに基づく音響モデルを用いて出力された出力結果と、記憶部に記憶された事例モデルとを照合し、入力信号に対応するクリーン音声特徴量を求める照合部と、照合部によって求められたクリーン音声特徴量から構成されるフィルタを入力信号に乗算した強調音声を出力する出力部と、を有することを特徴とする。 In order to solve the above-mentioned problems and achieve the object, the signal processing device according to the present invention inputs voice or clean voice including noise or acoustic distortion and outputs it using an acoustic model based on Neural Network (NN). The storage unit that stores the generated case model, the feature amount generation unit that generates the feature amount from the input signal, the output result that is output using the acoustic model based on the NN with the feature amount as an input, and the storage unit that stores the output result. And outputs a emphasized speech obtained by multiplying the input signal by a matching unit that obtains a clean speech feature amount corresponding to the input signal and a filter including the clean speech feature amount obtained by the matching unit. And an output unit.

また、上述した課題を解決し、目的を達成するために、本発明に係る事例モデル生成装置は、学習用の入力信号から特徴量を生成する特徴量生成部と、特徴量を用いてＤＮＮ（Deep Neural Network）に基づくＨＭＭ(Hidden Markov Model)音響モデル音響モデルを学習する学習部と、ＤＮＮに基づくＨＭＭ音響モデルが出力したＨＭＭ状態の尤度を基に、時間フレームごとの特徴量に対して最大の尤度を与えるＨＭＭ状態のインデックスの系列を事例モデルとして計算する最尤ＨＭＭ状態計算部と、を有することを特徴とする。 In addition, in order to solve the above-mentioned problems and achieve the object, the case model generation device according to the present invention uses a feature amount generation unit that generates a feature amount from a learning input signal, and a DNN (feature amount). HMM (Hidden Markov Model) based on Deep Neural Network) A learning unit for learning an acoustic model, and the likelihood of the HMM state output by the HMM acoustic model based on DNN, based on the feature amount for each time frame. And a maximum likelihood HMM state calculation unit that calculates, as a case model, a series of HMM state indexes that give the maximum likelihood.

また、上述した課題を解決し、目的を達成するために、本発明に係る照合装置は、入力信号の特徴量をＤＮＮに基づくＨＭＭ音響モデルに入力し、ＤＮＮに基づくＨＭＭ音響モデルによる出力結果と、雑音又は音響歪みを含む音声或いはクリーン音声をＤＮＮに基づくＨＭＭ音響モデルを用いて学習した事例モデルとを照合し、入力信号に対応するクリーン音声特徴量を求める照合部を有することを特徴とする。 Further, in order to solve the above-mentioned problems and achieve the object, the matching device according to the present invention inputs the feature amount of the input signal into the HMM acoustic model based on DNN, and outputs the output result by the HMM acoustic model based on DNN. , A speech including noise or acoustic distortion or a clean speech is collated with a case model learned by using an HMM acoustic model based on DNN, and a collation unit for obtaining a clean speech feature amount corresponding to an input signal is provided. ..

本発明によれば、入力信号に類似するクリーン音声を精度よく探索することができる。 According to the present invention, it is possible to accurately search for clean speech that is similar to an input signal.

図１は、実施の形態に係る信号処理装置の機能構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of a functional configuration of a signal processing device according to an embodiment. 図２は、セグメントの一例を説明するための図である。FIG. 2 is a diagram for explaining an example of the segment. 図３は、図１に示すマッチング部の処理を説明する図である。FIG. 3 is a diagram for explaining the processing of the matching unit shown in FIG. 図４は、図１に示す信号処理装置が実行する信号処理方法の処理手順を示すフローチャートである。FIG. 4 is a flowchart showing a processing procedure of a signal processing method executed by the signal processing device shown in FIG. 図５は、実施の形態に係る事例モデル生成装置の機能構成の一例を示すブロック図である。FIG. 5 is a block diagram showing an example of the functional configuration of the case model generation device according to the embodiment. 図６は、図５に示す事例モデル生成装置による事例モデル生成処理の処理手順を示すフローチャートである。FIG. 6 is a flowchart showing a processing procedure of a case model generation process by the case model generation device shown in FIG. 図７は、プログラムが実行されることにより、信号処理装置或いは事例モデル生成装置が実現されるコンピュータの一例を示す図である。FIG. 7 is a diagram illustrating an example of a computer in which a signal processing device or a case model generation device is realized by executing a program.

以下、図面を参照して、本発明の一実施形態を詳細に説明する。なお、この実施形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. The present invention is not limited to this embodiment. In the description of the drawings, the same parts are designated by the same reference numerals.

［実施の形態］
まず、実施の形態に係る信号処理装置について説明する。この信号処理装置は、雑音及び残響（音響歪み）を含む入力信号から音響歪みを除去し、明瞭な強調音声信号を出力する処理を行う装置である。 [Embodiment]
First, the signal processing device according to the embodiment will be described. This signal processing device is a device that removes acoustic distortion from an input signal including noise and reverberation (acoustic distortion) and outputs a clear emphasized speech signal.

［信号処理装置の構成］
図１は、実施の形態に係る信号処理装置の機能構成の一例を示す図である。実施の形態１に係る信号処理装置１００は、例えば、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）、ＣＰＵ（Central Processing Unit）等を含むコンピュータ等に所定のプログラムが読み込まれて、ＣＰＵが所定のプログラムを実行することで実現される。 [Configuration of signal processing device]
FIG. 1 is a diagram illustrating an example of a functional configuration of a signal processing device according to an embodiment. In the signal processing device 100 according to the first embodiment, for example, a predetermined program is read into a computer including a ROM (Read Only Memory), a RAM (Random Access Memory), a CPU (Central Processing Unit), etc. It is realized by executing a predetermined program.

図１に示すように、信号処理装置１００は、事例モデル記憶部１０１（記憶部）、フーリエ変換部１０２、特徴量生成部１０３、マッチング部１０４（照合部）及び音声強調フィルタリング部１０５（出力部）を有する。信号処理装置１００は、ＤＮＮ（ディープニューラルネットワーク）に基づくＨＭＭ(隠れマルコフモデル)音響モデル（以降、ＤＮＮ−ＨＭＭ音響モデルという。）によって表現された事例モデルを用いて系列（セグメント）探索を行う。ＤＮＮ−ＨＭＭ音響モデルは、高い耐雑音性を有する。 As shown in FIG. 1, the signal processing device 100 includes a case model storage unit 101 (storage unit), a Fourier transform unit 102, a feature amount generation unit 103, a matching unit 104 (matching unit), and a speech enhancement filtering unit 105 (output unit. ) Has. The signal processing apparatus 100 performs a sequence (segment) search using a case model represented by an HMM (Hidden Markov Model) acoustic model (hereinafter referred to as a DNN-HMM acoustic model) based on DNN (Deep Neural Network). The DNN-HMM acoustic model has high noise immunity.

事例モデル記憶部１０１は、雑音又は音響歪みを含む音声或いはクリーン音声を入力とし、ＤＮＮ−ＨＭＭ音響モデルを用いて出力された事例モデルＭｓを記憶する。具体的には、事例モデル記憶部１０１は、事例に対応したクリーン音声のデータと、事例モデルＭｓとを記憶する。クリーン音声のデータは、例えば、事例に対応したクリーン音声の振幅スペクトルのことである。また、事例モデルは、時間フレームごとの特徴量に対して最大の尤度を与えるＨＭＭ状態のインデックスであるｓの系列(最尤ＨＭＭ状態系列)で表現される。なお、事例モデルＭｓは、ＨＭＭ状態ｓの事前確率Ｐ（ｓ）も含む。これは、後述するように、マッチング部１０４におけるマッチング処理でＨＭＭ状態ｓの事前確率Ｐ（ｓ）を用いるためである。 The case model storage unit 101 receives the voice or clean voice including noise or acoustic distortion as an input, and stores the case model Ms output using the DNN-HMM acoustic model. Specifically, the case model storage unit 101 stores the clean voice data corresponding to the case and the case model Ms. The clean voice data is, for example, an amplitude spectrum of clean voice corresponding to a case. Further, the case model is represented by a sequence of s (maximum likelihood HMM state sequence) that is an index of the HMM state that gives the maximum likelihood to the feature amount for each time frame. The case model Ms also includes the prior probability P(s) of the HMM state s. This is because the a priori probability P(s) of the HMM state s is used in the matching process in the matching unit 104, as will be described later.

ここで、事例モデルＭｓは、事前に事例モデル生成装置２００（後述）によって生成され、事例モデル記憶部１０１に格納される。事例モデル生成装置２００は、音声コーパスなどから得られる大量のクリーン音声と、種々の環境で得られる雑音及び残響データ（雑音信号の波形や、室内インパルス応答等）と、を用い、様々な環境での観測信号を学習用の音声信号として模擬生成し、その模擬観測信号を特徴量領域へ変換したものを、ＤＮＮ−ＨＭＭ音響モデル（詳細は、非特許文献２参照）で学習して事例モデルＭｓを生成する。 Here, the case model Ms is generated in advance by the case model generation device 200 (described later) and stored in the case model storage unit 101. The case model generation apparatus 200 uses a large amount of clean speech obtained from a speech corpus and the like and noise and reverberation data (noise signal waveform, room impulse response, etc.) obtained in various environments, in various environments. Of the observed signal of FIG. 3 is simulated as a speech signal for learning, and the simulated observed signal converted into the feature amount region is learned by the DNN-HMM acoustic model (for details, refer to Non-Patent Document 2). To generate.

ＤＮＮ−ＨＭＭ音響モデルは、入力音声の特徴量から、それに対応するＨＭＭの状態番号を推定する機能を有する。ＨＭＭの一つの状態番号は、「a」、「i」、「u」等の、一つの音素の先頭部分、中間部分、または、後続部分に相当し、通常、３０００〜１００００程度の数で定義されることが多い。このＤＮＮ−ＨＭＭ音響モデルは、入力された音声の特徴量を、複数ノードの中間層で非線形の特徴量に変換後、出力層で３０００〜１００００程度のＨＭＭ状態ｓの尤度を出力する。 The DNN-HMM acoustic model has a function of estimating the HMM state number corresponding to the input speech feature amount. One state number of the HMM corresponds to the head portion, the middle portion, or the trailing portion of one phoneme such as “a”, “i”, and “u”, and is usually defined by a number of 3000 to 10000. It is often done. This DNN-HMM acoustic model outputs the likelihood of the HMM state s of about 3000 to 10000 at the output layer after converting the input feature amount of speech into a non-linear feature amount at the intermediate layer of a plurality of nodes.

このため、事例モデル生成装置２００（後述）で、学習用の音声信号の特徴量から、ＤＮＮ−ＨＭＭ音響モデル（以下、ｇ_ｓと表記する。）を基に、各時間フレームｉに対する最大の尤度を与えるＨＭＭ状態のインデックスの系列(最尤ＨＭＭ状態系列)ｓ_ｉが求められる。該求められたインデックスｓ_ｉの時間系列（セグメント）が、事例モデルＭｓの一つとなる。この事例モデルＭｓは、最尤ＨＭＭ状態系列ｓ_ｉの集合とＤＮＮ−ＨＭＭ音響モデルｇ_ｓとを用いて以下の（１）式に示すように表される。 Therefore, in the case model generation device 200 (described later), the maximum likelihood for each time frame i is calculated from the feature amount of the learning voice signal based on the DNN-HMM acoustic model (hereinafter, referred to as g _s ). A sequence of HMM state indexes (maximum likelihood HMM state sequence) s _i that gives a degree is obtained. The time series (segment) of the obtained index s _i becomes one of the case models Ms. The case model Ms is expressed as shown in the following equation (1) using the set of maximum likelihood HMM state series s _i and the DNN-HMM acoustic model g _s .

なお、ｓ_ｉは、ｉ番目のフレームの特徴量ｋ_ｉに対して最大の尤度を与えるＨＭＭ状態のインデックスである。Ｉは学習用の音声信号の総フレーム数を表す。例えば、１時間の学習データを仮定すると、Ｉ＝３．５×１０^５となる。 Incidentally, it s _i is the index of the HMM state to provide maximum likelihood for the feature amount k _i of i-th frame. I represents the total number of frames of the audio signal for learning. For example, assuming 1 hour of learning data, I=3.5×10 ⁵ .

そして、事例モデルに含まれるセグメントの例を説明する。図２は、セグメントの一例を説明するための図である。例えば、図２に示すセグメントの各セルは、Ｉフレームのうちｉ番目の時間フレームに対応する。各セル内の数字は最大の尤度を与えるＨＭＭ状態のインデックスｓ_ｉを表す。 Then, an example of the segments included in the case model will be described. FIG. 2 is a diagram for explaining an example of the segment. For example, each cell of the segment shown in FIG. 2 corresponds to the i-th time frame of the I frame. The number in each cell represents the index s _{i of the} HMM state that gives the maximum likelihood.

次に、図１に戻り、フーリエ変換部１０２の説明を行う。フーリエ変換部１０２は、入力信号をフレームごとに振幅スペクトルに変換する。入力信号として、雑音及び残響を含む音声信号がフーリエ変換部１０２に入力される。まず、フーリエ変換部１０２は、入力信号の波形データを短い時間幅で切り出す。例えば、フーリエ変換部１０２は、３０（ｍｓｅｃ）程度の短時間ハミング窓等の窓関数を掛け合わせて入力信号を短い時間幅で切り出す。続いて、フーリエ変換部１０２は、切り出した入力信号に離散フーリエ変換処理を実行し、振幅スペクトルに変換する。なお、振幅スペクトルとは、周波数スペクトルの振幅データのことである。フーリエ変換部１０２は、変換後の振幅スペクトルを、特徴量生成部１０３及び音声強調フィルタリング部１０５に入力する。 Next, returning to FIG. 1, the Fourier transform unit 102 will be described. The Fourier transform unit 102 transforms the input signal into an amplitude spectrum for each frame. An audio signal including noise and reverberation is input to the Fourier transform unit 102 as an input signal. First, the Fourier transform unit 102 cuts out waveform data of an input signal with a short time width. For example, the Fourier transform unit 102 multiplies a window function such as a short-time Hamming window of about 30 (msec) to cut out the input signal in a short time width. Then, the Fourier transform unit 102 executes a discrete Fourier transform process on the cut input signal to convert it into an amplitude spectrum. The amplitude spectrum is the amplitude data of the frequency spectrum. The Fourier transform unit 102 inputs the transformed amplitude spectrum to the feature amount generation unit 103 and the speech enhancement filtering unit 105.

特徴量生成部１０３は、フーリエ変換部１０２から出力された振幅スペクトルから特徴量ｙ_ｔを生成する。言い換えると、特徴量生成部１０３は、フーリエ変換部１０２から入力された振幅スペクトルから特徴量ｙ_ｔのセグメントを生成する。なお、ｔは、処理対象のフレームとする。特徴量生成部１０３は、フーリエ変換部１０２から出力された振幅スペクトルの全てを、例えば、メル周波数ケプストラム係数に変換する。これによって、入力信号は、フレームごとに、特徴量ベクトルのセグメントとして表される。 The feature quantity generation unit 103 generates a feature quantity y _t from the amplitude spectrum output from the Fourier transform unit 102. In other words, the feature quantity generation unit 103 generates a segment of the feature quantity y _t from the amplitude spectrum input from the Fourier transform unit 102. Note that t is a frame to be processed. The feature amount generation unit 103 converts all the amplitude spectra output from the Fourier transform unit 102 into, for example, mel frequency cepstrum coefficients. As a result, the input signal is represented as a segment of the feature quantity vector for each frame.

ここで、一般的に使用されているメル周波数ケプストラム係数は、１０〜２０次程度である。信号処理装置１００では、事例モデルＭｓを正確に表すために、一般的に使用されている次数よりも高い次数（例えば、３０〜１００次程度）のメル周波数ケプストラム係数を用いる。このため、特徴量生成部１０３は、フーリエ変換部１０２から出力された振幅スペクトルの全てを、例えば、３０〜１００次程度のメル周波数ケプストラム係数に変換する。なお、特徴量生成部１０３は、メル周波数ケプストラム係数以外の特徴量（例えば、ケプストラム係数等）を用いてもよい。特徴量生成部１０３は、生成した特徴量ｙ_ｔを、マッチング部１０４に入力する。 Here, the commonly used mel frequency cepstrum coefficient is of the order of 10 to 20. In the signal processing device 100, in order to accurately represent the case model Ms, a mel frequency cepstrum coefficient having an order higher than a commonly used order (for example, about 30 to 100) is used. Therefore, the feature amount generation unit 103 converts all the amplitude spectra output from the Fourier transform unit 102 into, for example, mel frequency cepstrum coefficients of the order of 30 to 100. The feature amount generation unit 103 may use a feature amount other than the mel frequency cepstrum coefficient (for example, cepstrum coefficient or the like). The feature amount generation unit 103 inputs the generated feature amount y _t to the matching unit 104.

マッチング部１０４は、特徴量ｙ_ｔを入力とし、ＤＮＮ−ＨＭＭ音響モデルｇ_ｓを用いて出力された出力結果と、事例モデル記憶部１０１に記憶された事例モデルＭｓとを照合（マッチング）し、入力信号に対応するクリーン音声特徴量を求める。マッチング部１０４は、入力された入力音声の特徴量ｙ_ｔのＤＮＮ−ＨＭＭ音響モデルからの出力結果に対し、高い類似度を示した事例モデルＭｓに対応するクリーン音声を収音目的の音声信号候補としていく。 The matching unit 104 receives the feature amount y _t as an input, and collates (matches) the output result output using the DNN-HMM acoustic model g _s with the case model Ms stored in the case model storage unit 101. The clean speech feature amount corresponding to the input signal is obtained. The matching unit 104 collects a clean voice corresponding to the case model Ms showing a high degree of similarity with respect to the output result from the DNN-HMM acoustic model of the input feature value y _t of the input voice, and a voice signal candidate for the purpose of collecting the voice. To go.

具体的には、マッチング部１０４は、ＤＮＮ−ＨＭＭ音響モデルに入力音声の特徴量ｙ_ｔを入力し、ＤＮＮ−ＨＭＭ音響モデルが出力したＨＭＭ状態ｓの尤度と事例モデル記憶部１０１に記憶された事例モデルの最尤ＨＭＭ状態系列ｓ_ｉとを照合し、高い類似度を示した事例モデルＭｓに対応するクリーン音声特徴量を、入力信号に対応するクリーン音声の特徴量として求める。 Specifically, the matching unit 104 inputs the feature amount y _t of the input speech into the DNN-HMM acoustic model, and stores the likelihood of the HMM state s output by the DNN-HMM acoustic model and the case model storage unit 101. The maximum likelihood HMM state series s _{i of} the case model described above is collated, and the clean speech feature amount corresponding to the case model Ms showing a high degree of similarity is obtained as the feature amount of the clean speech corresponding to the input signal.

言い換えると、マッチング部１０４は、事例モデル記憶部１０１の事例モデルの中から、ＤＮＮ−ＨＭＭ音響モデルから出力されたＨＭＭ状態の尤度に対して最も高い尤度をとるセグメントを探索する。マッチング部１０４は、探索により見つかった事例モデル中のセグメントについての情報を、音声強調フィルタリング部１０５に入力する。なお、マッチング部１０４の処理の詳細については、後述する。 In other words, the matching unit 104 searches the case model in the case model storage unit 101 for a segment having the highest likelihood with respect to the likelihood of the HMM state output from the DNN-HMM acoustic model. The matching unit 104 inputs the information about the segment in the case model found by the search to the voice enhancement filtering unit 105. The details of the processing of the matching unit 104 will be described later.

音声強調フィルタリング部１０５は、マッチング部１０４によって求められたクリーン音声特徴量から構成されるフィルタを入力信号に乗算した強調音声信号を出力する。具体的には、音声強調フィルタリング部１０５は、マッチング部１０４が探索した事例モデルＭｓのセグメントの特徴量に対応するクリーン音声の振幅スペクトルを、入力信号に最も類似するクリーン音声の振幅スペクトルとみなし、事例モデル記憶部１０１から、このクリーン音声の振幅スペクトルを読み出す。続いて、音声強調フィルタリング部１０５は、読み出したクリーン音声の振幅スペクトルを用いて音声強調のためのフィルタを作成し、該フィルタを用いて入力信号をフィルタリングする。この結果、音声強調フィルタリング部１０５から、入力信号から音響歪みが除去された強調音声信号が出力される。 The voice emphasis filtering unit 105 outputs an emphasis voice signal obtained by multiplying an input signal by a filter including the clean voice feature amount obtained by the matching unit 104. Specifically, the speech enhancement filtering unit 105 regards the amplitude spectrum of the clean speech corresponding to the feature amount of the segment of the case model Ms searched by the matching unit 104 as the amplitude spectrum of the clean speech most similar to the input signal, The amplitude spectrum of this clean voice is read from the case model storage unit 101. Subsequently, the voice enhancement filtering unit 105 creates a filter for voice enhancement using the read amplitude spectrum of the clean voice, and filters the input signal using the filter. As a result, the speech enhancement filtering unit 105 outputs an enhanced speech signal in which acoustic distortion is removed from the input signal.

［マッチング部の処理］
次に、マッチング部１０４の処理について詳細に説明する。図３は、図１に示すマッチング部１０４の処理を説明する図である。 [Processing of matching unit]
Next, the processing of the matching unit 104 will be described in detail. FIG. 3 is a diagram for explaining the processing of the matching unit 104 shown in FIG.

図３に示すように、マッチング部１０４への入力は、雑音や残響の影響を受けた入力音声の特徴量ｙ_ｔである。ｙ_ｔは、例えば、メル周波数ケプストラム係数や、フィルタバンク係数などである。 As shown in FIG. 3, the input to the matching unit 104 is the feature amount y _t of the input voice affected by noise and reverberation. y _t is, for example, a mel frequency cepstrum coefficient, a filter bank coefficient, or the like.

音声認識に用いるメル周波数ケプストラム係数の次元数は１３次元程度で、そのΔ及びΔΔ係数も同時に使用されることが多い。このため特徴量ｙ_ｔの合計の次元は、１３次元の３倍の３９次元となる。ＤＮＮ−ＨＭＭ音響モデルへ入力される場合は、該等フレームだけでなく、例えば、その前後５フレームを含む合計１１フレーム分の特徴量が一度に入力されることが多い。その場合には、特徴量ｙ_ｔの次元数は、３９次元の１１倍である４２９次元となる。 The mel frequency cepstrum coefficient used for speech recognition has about 13 dimensions, and its Δ and ΔΔ coefficients are often used at the same time. Therefore, the total dimension of the feature amount y _t is 39 times, which is three times as large as 13 dimensions. When input to the DNN-HMM acoustic model, not only the equal frames but also a total of 11 frame feature amounts including 5 frames before and after the same frame are often input at one time. In that case, the number of dimensions of the feature amount y _t is 429, which is 11 times as large as 39.

また、入力がフィルタバンク係数の場合はさらに次元数が多く、基本の次元数が４０次元、そのΔ及びΔΔを考慮することで４９次元の３倍である１２０次元となる。さらに、該当フレームの前後５フレームを含めた合計１１フレームを考慮することで、合計１３２０次元の特徴量ｙ_ｔとなる。 Further, when the input is a filter bank coefficient, the number of dimensions is further increased, and the basic number of dimensions is 40, and by taking Δ and ΔΔ into consideration, the number is 120, which is three times 49. Furthermore, by considering a total of 11 frames including 5 frames before and after the corresponding frame, a total of 1320-dimensional feature amount y _t is obtained.

この特徴量ｙ_ｔがマッチング部１０４におけるＤＮＮ−ＨＭＭ音響モデルに入力される。なお、ＤＮＮ−ＨＭＭ音響モデルの入力層のノード数は、特徴量ｙ_ｔの次元数に等しい。そして、入力された特徴量ｙ_ｔは、例えば２０４８ノード、典型的には、５〜１０層程度の中間層による非線形の特徴量変換を経て，出力層で、例えば、３０００〜１００００ノード程度のＨＭＭ状態の尤度ｐ（ｙ_ｔ｜ｓ）が求められる。ｐ（ｙ_ｔ｜ｓ）は具体的には、以下に示す（２）式及び（３）式によって計算される。 The feature amount y _t is input to the DNN-HMM acoustic model in the matching unit 104. Incidentally, the number of nodes of the input layer of DNN-HMM acoustic model is equal to the number of dimensions of the feature _{y t.} Then, the input feature amount y _t is subjected to non-linear feature amount conversion by, for example, 2048 nodes, typically about 5 to 10 intermediate layers, and then, at the output layer, for example, an HMM of about 3000 to 10000 nodes. The likelihood p(y _t |s) of the state is obtained. Specifically, p(y _t |s) is calculated by the following equations (2) and (3).

ここで，ｚ_ｓ ^Ｌ（ｙ_ｔ）は、特徴量ｙ_ｔが与えられた際の、出力層(ＤＮＮ−ＨＭＭ音響モデルのＬ層目)のｓ番目のノード（ＨＭＭ状態ｓに相当）の活性値である。Ｐ（ｓ）は、ＨＭＭ状態ｓの事前確率である。Ｐ（ｓ）は、事例モデルＭｓに含まれる。ｗ_ｓ，ｒ ^Ｌは最終の中間層（ＤＮＮ−ＨＭＭ音響モデルの（Ｌ−１）層目）のｒ番目のノードから出力層（第Ｌ層）のｓ番目のノードの間の重み係数である。ｆ(・)は、活性化関数（典型的にはシグモイド関数）である。ｂ_ｓ ^Ｌは，出力層（第Ｌ層）のｓ番目のノードのバイアス値である。なお、ＤＮＮ−ＨＭＭ音響モデルによるｙ_ｔからのｐ（ｙ_ｔ｜ｓ）の求め方については、例えば、非特許文献２に詳述されている。 Here, z _s ^L (y _t ) is the activity of the s-th node (corresponding to the HMM state s) of the output layer (L-th layer of the DNN-HMM acoustic model) when the feature amount y _t is given. It is a value. P(s) is the prior probability of the HMM state s. P(s) is included in the case model Ms. w _s,r ^L is a weighting coefficient from the r-th node of the final intermediate layer (the (L-1)-th layer of the DNN-HMM acoustic model) to the s-th node of the output layer (L-th layer). .. f(•) is an activation function (typically a sigmoid function). b _s ^L is a bias value of the sth node of the output layer (Lth layer). It should be noted that the method of obtaining p(y _t |s) from y _t by the DNN-HMM acoustic model is described in detail in Non-Patent Document 2, for example.

マッチング部１０４は、ＤＮＮ−ＨＭＭ音響モデルの出力層から出力されたＨＭＭ状態の尤度ｐ（ｙ_ｔ｜ｓ）と、事例モデル記憶部１０１が記憶する事例モデルＭｓとをマッチングするマッチング処理を行う。そして、マッチング部１０４は、出力層から出力されたＨＭＭ状態の尤度ｐ（ｙ_ｔ｜ｓ）と、高い類似度を示した事例モデルＭｓを探索し、該探索した事例モデルＭｓに対応するクリーン音声を収音目的の音声信号候補としていく。 The matching unit 104 performs matching processing for matching the likelihood p(y _t |s) of the HMM state output from the output layer of the DNN-HMM acoustic model with the case model Ms stored in the case model storage unit 101. .. Then, the matching unit 104 searches for the likelihood p(y _t |s) of the HMM state output from the output layer and the case model Ms exhibiting a high degree of similarity, and cleans the case model Ms searched for. The voice is used as a voice signal candidate for collecting sound.

入力信号の特徴量のＨＭＭ状態の尤度ｐ（ｙ_ｔ｜ｓ）と一番近い事例モデルＭｓの最尤ＨＭＭ状態系列の探索方法の一例を説明する。例えば、マッチング部１０４は、事例モデル記憶部１０１の事例モデルの中から、ＤＮＮ−ＨＭＭ音響モデルから出力されたＨＭＭ状態の尤度を、最尤ＨＭＭ状態系列であるセグメントのそれぞれに対応させ、出力されたＨＭＭ状態の尤度に対して最も高い尤度をとるセグメントを抽出する。言い換えると、マッチング部１０４は、事例モデル記憶部１０１の事例モデルの中から、ＤＮＮ−ＨＭＭ音響モデルから出力されたＨＭＭの状態番号列を、最尤ＨＭＭ状態系列のそれぞれのセグメントに当てはめ、出力されたＨＭＭの状態番号列に対して最も高い尤度をとるセグメントを抽出する。 An example of a method of searching the maximum likelihood HMM state series of the case model Ms closest to the likelihood p(y _t |s) of the HMM state of the feature amount of the input signal will be described. For example, the matching unit 104 associates the likelihood of the HMM state output from the DNN-HMM acoustic model from the case models in the case model storage unit 101 with each of the segments that are the maximum likelihood HMM state series, and outputs the result. The segment having the highest likelihood is extracted with respect to the likelihood of the HMM state. In other words, the matching unit 104 applies the HMM state number sequence output from the DNN-HMM acoustic model from the case models in the case model storage unit 101 to each segment of the maximum likelihood HMM state series, and outputs the HMM state number sequence. The segment having the highest likelihood with respect to the HMM state number sequence is extracted.

また、マッチング部１０４は、入力信号の特徴量のＨＭＭ状態の尤度ｐ（ｙ_ｔ｜ｓ）と事例モデルＭｓの中のあるセグメントである最尤ＨＭＭ状態系列との距離、例えば、ユークリッド距離などを基に、事例モデルＭｓを探索してもよい。 The matching unit 104 also calculates the distance between the likelihood p(y _t |s) of the HMM state of the feature amount of the input signal and the maximum likelihood HMM state sequence that is a segment in the case model Ms, for example, Euclidean distance. The case model Ms may be searched based on

そして、マッチング部１０４は、探索した事例モデルＭｓのセグメント、すなわち、入力信号に含まれるクリーン音声に最も類似するクリーン音声系列を与えると思われる事例モデルＭｓのセグメントについての情報を、音声強調フィルタリング部１０５に入力する。これによって、音声強調フィルタリング部１０５は、セグメントに対応する事例モデル記憶部１０１内のクリーン音声の振幅スペクトルを用いて、音声強調のためのフィルタを作成し、該フィルタで入力信号をフィルタリングすることによって、強調音声信号を出力する。 Then, the matching unit 104 outputs information about the searched segment of the case model Ms, that is, the segment of the case model Ms that is considered to give the clean speech sequence most similar to the clean speech included in the input signal, to the speech enhancement filtering unit. Input to 105. Thereby, the voice enhancement filtering unit 105 creates a filter for voice enhancement by using the amplitude spectrum of the clean voice in the case model storage unit 101 corresponding to the segment, and filters the input signal by the filter. , Outputs the emphasized voice signal.

このように、実施の形態に係る信号処理装置１００では、セグメント探索を、ＤＮＮ−ＨＭＭ音響モデルから出力されたＨＭＭ状態の尤度ｐ（ｙ_ｔ｜ｓ）を用いて行う。このＤＮＮ−ＨＭＭ音響モデルは、高い耐雑音性を持つ。言い換えると、ＤＮＮ−ＨＭＭ音響モデルは、入力音声の特徴量が雑音や残響の影響を受けていたとしても、高精度でＨＭＭの状態番号を推定することが可能である。したがって、信号処理装置１００では，高い耐雑音性を有するＤＮＮ−ＨＭＭ音響モデルを用いることで，雑音や残響に頑健なセグメント探索、すなわち、雑音や残響の影響を受けにくいセグメント探索を行う。そこで、次に、信号処理装置１００においてＤＮＮ−ＨＭＭ音響モデルを用いた信号処理方法の手順について説明する。 As described above, in the signal processing device 100 according to the embodiment, the segment search is performed using the likelihood p(y _t |s) of the HMM state output from the DNN-HMM acoustic model. This DNN-HMM acoustic model has high noise resistance. In other words, the DNN-HMM acoustic model can estimate the HMM state number with high accuracy even if the input speech feature amount is affected by noise or reverberation. Therefore, in the signal processing device 100, the DNN-HMM acoustic model having high noise resistance is used to perform a segment search that is robust against noise and reverberation, that is, a segment search that is less susceptible to noise and reverberation. Therefore, next, the procedure of the signal processing method using the DNN-HMM acoustic model in the signal processing device 100 will be described.

［信号処理装置における信号処理方法］
次に、信号処理装置１００における信号処理方法について説明する。図４は、図１に示す信号処理装置１００が実行する信号処理方法の処理手順を示すフローチャートである。 [Signal Processing Method in Signal Processing Device]
Next, a signal processing method in the signal processing device 100 will be described. FIG. 4 is a flowchart showing a processing procedure of a signal processing method executed by the signal processing device 100 shown in FIG.

まず、フーリエ変換部１０２は、入力信号を振幅スペクトルに変換するフーリエ変換処理（ステップＳ１）を行う。特徴量生成部１０３は、フーリエ変換部１０２から出力された振幅スペクトルから、メル周波数ケプストラム係数等の特徴量を生成する特徴量生成処理（ステップＳ２）を行う。 First, the Fourier transform unit 102 performs a Fourier transform process (step S1) of transforming an input signal into an amplitude spectrum. The feature amount generation unit 103 performs a feature amount generation process (step S2) of generating a feature amount such as a mel frequency cepstrum coefficient from the amplitude spectrum output from the Fourier transform unit 102.

マッチング部１０４は、特徴量生成部１０３が生成した特徴量をＤＮＮ−ＨＭＭ音響モデルに入力し、ＤＮＮ−ＨＭＭ音響モデルから出力されたＨＭＭ状態の尤度と、事例モデル記憶部１０１の事例モデルＭｓの最尤ＨＭＭ状態系列とをマッチングし、高い類似度を示した事例モデルＭｓに対応するクリーン音声を収音目的の音声信号候補とするマッチング処理を行う（ステップＳ３）。 The matching unit 104 inputs the feature amount generated by the feature amount generation unit 103 to the DNN-HMM acoustic model, the likelihood of the HMM state output from the DNN-HMM acoustic model, and the case model Ms of the case model storage unit 101. The maximum likelihood HMM state sequence is matched, and a matching process is performed in which a clean voice corresponding to the case model Ms showing a high degree of similarity is set as a voice signal candidate for sound pickup (step S3).

音声強調フィルタリング部１０５は、マッチング部１０４が探索した事例モデルＭｓのセグメントの特徴量に対応するクリーン音声の振幅スペクトルを用いて音声強調のためのフィルタを作成し、該フィルタを入力信号に乗算した強調音声を出力する音声強調フィルタリング処理（ステップＳ４）を行う。 The voice enhancement filtering unit 105 creates a filter for voice enhancement using the amplitude spectrum of clean voice corresponding to the feature amount of the segment of the case model Ms searched by the matching unit 104, and multiplies the input signal by the filter. A voice emphasis filtering process (step S4) of outputting emphasized voice is performed.

［事例モデル作成装置の構成］
次に、信号処理装置１００の事例モデル記憶部１０１に記憶される事例モデルＭｓを生成する事例モデル生成装置２００について説明する。この事例モデル生成装置２００においても、例えば、学習用の音声信号から生成されたメル周波数ケプストラム係数等の特徴量ｙ_ｔに対し、高い耐雑音性を持つＤＮＮ−ＨＭＭ音響モデルを用いて学習を行い、事例モデルＭｓの生成を行っている。 [Configuration of case model creation device]
Next, the case model generation device 200 that generates the case model Ms stored in the case model storage unit 101 of the signal processing device 100 will be described. Also in this case model generation device 200, for example, learning is performed using a DNN-HMM acoustic model having high noise resistance with respect to the characteristic amount y _t such as the mel frequency cepstrum coefficient generated from the learning speech signal. , The case model Ms is generated.

図５は、事例モデル生成装置２００の機能構成の一例を示すブロック図である。図５に示す事例モデル生成装置２００は、例えば、ＲＯＭ、ＲＡＭ、ＣＰＵ等を含むコンピュータ等に所定のプログラムが読み込まれて、ＣＰＵが所定のプログラムを実行することで実現される。事例モデル生成装置２００は、フーリエ変換部２０１、特徴量生成部２０２、ＤＮＮ−ＨＭＭ音響モデル学習部２０３（学習部）、及び、最尤ＨＭＭ状態計算部２０４を有する。 FIG. 5 is a block diagram showing an example of the functional configuration of the case model generation device 200. The case model generation device 200 illustrated in FIG. 5 is realized by, for example, a computer including a ROM, a RAM, a CPU, or the like reading a predetermined program and the CPU executing the predetermined program. The case model generation device 200 includes a Fourier transform unit 201, a feature amount generation unit 202, a DNN-HMM acoustic model learning unit 203 (learning unit), and a maximum likelihood HMM state calculation unit 204.

まず、事例モデル生成装置２００に入力される学習用の音声信号について説明する。事例モデル生成装置２００に入力される信号は、様々な雑音／残響環境の音声信号である。この様々な雑音／残響環境の音声信号の中には、クリーン環境の音声信号が含まれている。具体的には、音声コーパスなどから得られる大量のクリーン音声と、種々の環境で得られる雑音及び残響データ（雑音信号の波形や、室内インパルス応答等）とを用い、さまざまな環境での観測信号を模擬生成した模擬観測信号が、学習用の音声信号として事例モデル生成装置２００に入力される。これらの学習用の音声信号のそれぞれについて以下の処理が行われる。 First, a learning voice signal input to the case model generation device 200 will be described. The signal input to the case model generation device 200 is a voice signal of various noise/reverberation environments. The audio signal of the clean environment is included in the audio signals of the various noise/reverberation environments. Specifically, a large amount of clean speech obtained from a speech corpus, etc. and noise and reverberation data (noise signal waveform, room impulse response, etc.) obtained in various environments are used to observe signals in various environments. The simulated observation signal generated by simulating is input to the case model generation device 200 as a voice signal for learning. The following processing is performed on each of these learning audio signals.

フーリエ変換部２０１及び特徴量生成部２０２は、図１に示す信号処理装置１００におけるフーリエ変換部１０２及び特徴量生成部１０３とそれぞれ同様の処理を、学習用の音声信号に対して実行する。なお、フーリエ変換部２０１は、入力音声がクリーン音声の場合には、クリーン音声の振幅スペクトルを、事例モデルＭｓの一部として、信号処理装置１００の事例モデル記憶部１０１に格納する。 The Fourier transform unit 201 and the feature amount generation unit 202 perform the same processing as that of the Fourier transform unit 102 and the feature amount generation unit 103 in the signal processing device 100 shown in FIG. 1, on the learning audio signal. When the input voice is clean voice, the Fourier transform unit 201 stores the amplitude spectrum of the clean voice in the case model storage unit 101 of the signal processing device 100 as a part of the case model Ms.

ＤＮＮ−ＨＭＭ音響モデル学習部２０３は、特徴量生成部２０２が生成した特徴量を用いて、ＤＮＮに基づくＨＭＭ音響モデルを学習する。ＤＮＮ−ＨＭＭ音響モデル学習部２０３は、ＤＮＮに基づくＨＭＭ音響モデルに特徴量を入力して学習を行い、ＤＮＮに基づくＨＭＭ音響モデルが出力したＨＭＭ状態の尤度を取得する。ＤＮＮ−ＨＭＭ音響モデル学習部２０３は、特徴量生成部２０２が生成した特徴量ｙ_ｔを学習データとしてＤＮＮ−ＨＭＭ音響モデルに入力し、ＤＮＮ−ＨＭＭ音響モデルから出力されたＨＭＭ状態の尤度ｐ（ｙ_ｔ｜ｓ）を、最尤ＨＭＭ状態計算部２０４に出力する。この際、ＤＮＮ−ＨＭＭ音響モデル学習部２０３は、信号処理装置１００のマッチング部１０４において、（２）式に示すように、ＨＭＭ状態ｓの事前確率Ｐ（ｓ）を用いた計算処理を行うため、ＨＭＭ状態ｓの事前確率Ｐ（ｓ）も生成し、事例モデルの一部として、信号処理装置１００の事例モデル記憶部１０１に格納する。 The DNN-HMM acoustic model learning unit 203 learns an HMM acoustic model based on DNN using the feature amount generated by the feature amount generating unit 202. The DNN-HMM acoustic model learning unit 203 inputs the feature amount into the HMM acoustic model based on DNN to perform learning, and acquires the likelihood of the HMM state output by the HMM acoustic model based on DNN. The DNN-HMM acoustic model learning unit 203 inputs the feature amount y _t generated by the feature amount generating unit 202 as learning data into the DNN-HMM acoustic model, and the likelihood p of the HMM state output from the DNN-HMM acoustic model. (Y _t |s) is output to the maximum likelihood HMM state calculation unit 204. At this time, the DNN-HMM acoustic model learning unit 203 performs calculation processing using the a priori probability P(s) of the HMM state s in the matching unit 104 of the signal processing device 100, as shown in Expression (2). , A prior probability P(s) of the HMM state s is also generated and stored in the case model storage unit 101 of the signal processing device 100 as a part of the case model.

最尤ＨＭＭ状態計算部２０４は、ＤＮＮ−ＨＭＭ音響モデル学習部２０３が出力したＤＮＮ−ＨＭＭ音響モデルｇ_ｓ、すなわち、ＨＭＭ状態の尤度ｐ（ｙ_ｔ｜ｓ）を基に、時間フレームごとの特徴量に対して最大の尤度を与えるＨＭＭ状態のインデックスであるｓの系列(最尤ＨＭＭ状態系列)を計算する。最尤ＨＭＭ状態計算部２０４は、各時間フレームｉに対する最大の尤度を与えるＨＭＭ状態のインデックスの系列(最尤ＨＭＭ状態系列)ｓ_ｉを求め、該求めたインデックスｓ_ｉの時間系列（セグメント）を、ＤＮＮ−ＨＭＭに基づく事例モデルＭｓとして、信号処理装置１００の事例モデル記憶部１０１に格納する。 The maximum likelihood HMM state calculation unit 204 determines, for each time frame, based on the DNN-HMM acoustic model g _s output by the DNN-HMM acoustic model learning unit 203, that is, the likelihood p(y _t |s) of the HMM state. A sequence of s (maximum likelihood HMM state sequence) that is an index of the HMM state that gives the maximum likelihood to the feature amount is calculated. The maximum likelihood HMM state calculation unit 204 obtains an HMM state index sequence (maximum likelihood HMM state sequence) s _i that gives the maximum likelihood for each time frame i, and the obtained time sequence (segment) of the index s _i. Is stored in the case model storage unit 101 of the signal processing device 100 as a case model Ms based on DNN-HMM.

［事例モデル生成処理］
次に、事例モデル生成処理について説明する。図６は、図５に示す事例モデル生成装置２００による事例モデル生成処理の処理手順を示すフローチャートである。 [Case model generation process]
Next, the case model generation process will be described. FIG. 6 is a flowchart showing the processing procedure of the case model generation processing by the case model generation device 200 shown in FIG.

事例モデル生成装置２００において、フーリエ変換部２０１及び特徴量生成部２０２は、入力された学習用の音声信号に対し、図４に示すステップＳ１，Ｓ２と同様の手順でステップＳ１１，Ｓ１２の処理を行う。 In the case model generation device 200, the Fourier transform unit 201 and the feature amount generation unit 202 perform the processing of steps S11 and S12 on the input learning audio signal in the same procedure as steps S1 and S2 shown in FIG. To do.

ＤＮＮ−ＨＭＭ音響モデル学習部２０３は、前段の特徴量生成部２０２から入力された特徴量を用いてＤＮＮ−ＨＭＭ音響モデルの学習処理を行い（ステップＳ１３）。また、ＤＮＮ−ＨＭＭ音響モデル学習部２０３は、ＨＭＭ状態ｓの事前確率Ｐ（ｓ）も計算する。 The DNN-HMM acoustic model learning unit 203 performs learning processing of the DNN-HMM acoustic model using the feature amount input from the feature amount generating unit 202 in the previous stage (step S13). The DNN-HMM acoustic model learning unit 203 also calculates the prior probability P(s) of the HMM state s.

続いて、最尤ＨＭＭ状態計算部２０４は、ＤＮＮ−ＨＭＭ音響モデルが出力したＨＭＭ状態の尤度を基に、時間フレームごとの特徴量に対して最大の尤度を与えるＨＭＭ状態のインデックスであるｓの系列(最尤ＨＭＭ状態系列)を事例モデルＭｓとして計算する最尤ＨＭＭ状態計算処理を行う（ステップＳ１４）。そして、事例モデル生成装置２００は、この最尤ＨＭＭ状態系列を事例モデルＭｓとして信号処理装置１００の事例モデル記憶部１０１に格納する格納処理を行う（ステップＳ１５）。 Subsequently, the maximum likelihood HMM state calculation unit 204 is an HMM state index that gives the maximum likelihood to the feature amount for each time frame, based on the likelihood of the HMM state output by the DNN-HMM acoustic model. Maximum-likelihood HMM state calculation processing for calculating the series of s (maximum-likelihood HMM state series) as the case model Ms is performed (step S14). Then, the case model generation device 200 performs a storage process of storing the maximum likelihood HMM state series in the case model storage unit 101 of the signal processing device 100 as the case model Ms (step S15).

このように、事例モデル生成装置２００では、信号処理装置１００に対応させて、ＤＮＮ−ＨＭＭ音響モデルを用いた事例モデルＭｓの生成を行っている。したがって、信号処理装置１００では、高い耐雑音性を反映した事例モデルＭｓを使用したマッチング処理を実行することができる。 In this way, the case model generation device 200 corresponds to the signal processing device 100 and generates the case model Ms using the DNN-HMM acoustic model. Therefore, the signal processing apparatus 100 can execute the matching process using the case model Ms that reflects high noise resistance.

［実施の形態の効果］
本実施の形態に係る信号処理装置１００では、ＤＮＮ−ＨＭＭ音響モデルを用いて信号処理を行っている。このＤＮＮ−ＨＭＭ音響モデルは、高い耐雑音性を持つ。具体的には、信号処理装置１００では、入力信号の特徴量をＤＮＮ−ＨＭＭ音響モデルに入力し、ＤＮＮ−ＨＭＭ音響モデルによる出力結果と、事例モデル記憶部１０１に記憶された事例モデルＭｓとを照合して、入力信号に対応するクリーン音声特徴量を求めている。 [Effect of Embodiment]
In the signal processing device 100 according to this embodiment, signal processing is performed using the DNN-HMM acoustic model. This DNN-HMM acoustic model has high noise resistance. Specifically, in the signal processing device 100, the feature amount of the input signal is input to the DNN-HMM acoustic model, and the output result by the DNN-HMM acoustic model and the case model Ms stored in the case model storage unit 101 are displayed. By collating, the clean speech feature amount corresponding to the input signal is obtained.

前述したように、ＤＮＮ−ＨＭＭ音響モデルは、高い耐雑音性を持つ。言い換えると、ＤＮＮ−ＨＭＭ音響モデルは、入力音声の特徴量が雑音や残響の影響を受けていたとしても、高精度でＨＭＭの状態を推定することが可能である。したがって、信号処理装置１００では、高い耐雑音性を有するＤＮＮ−ＨＭＭ音響モデルを用いることで，雑音や残響に頑健なセグメント探索、すなわち、雑音や残響の影響を受けにくいセグメント探索を行うことが可能になる。また、事例モデル生成装置２００では、信号処理装置１００に対応させて、ＤＮＮ−ＨＭＭ音響モデルを用いた事例モデルＭｓの生成を行っている。したがって、信号処理装置１００では、高い耐雑音性を反映した事例モデルＭｓを使用したマッチング処理を実行することができる。 As described above, the DNN-HMM acoustic model has high noise resistance. In other words, the DNN-HMM acoustic model can estimate the HMM state with high accuracy even if the feature amount of the input speech is affected by noise or reverberation. Therefore, in the signal processing apparatus 100, by using the DNN-HMM acoustic model having high noise resistance, it is possible to perform a segment search that is robust against noise and reverberation, that is, a segment search that is less susceptible to noise and reverberation. become. Further, the case model generation device 200 corresponds to the signal processing device 100 and generates the case model Ms using the DNN-HMM acoustic model. Therefore, the signal processing apparatus 100 can execute the matching process using the case model Ms that reflects high noise resistance.

このように、本実施の形態によれば、入力信号に類似するクリーン音声の探索に対する雑音や残響の影響を低減でき、入力信号に類似するクリーン音声を精度よく探索することが可能になる。 As described above, according to the present embodiment, it is possible to reduce the influence of noise and reverberation on the search for clean speech similar to the input signal, and it is possible to accurately search for clean speech similar to the input signal.

［変形例］
本実施の形態では、ＤＮＮ−ＨＭＭ音響モデルを用いた場合を説明した。このＤＮＮ−ＨＭＭ音響モデルは、いわゆるfully-connected feed forward neural networkに基づくものであるが、本実施の形態では、その他の構造のNeural Networkに基づく音響モデルを用いることも可能である。例えば、本実施の形態では、例えば、Convolutional Neural Network（ＣＮＮ：畳み込みニューラルネットワーク）に基づく音響モデル、Recurrent Neural Network（ＲＮＮ：再帰的ニューラルネットワーク）に基づく音響モデル、ＬＳＴＭ(Long Short-Term Memory)に基づくＲＮＮの音響モデルであってもよい。 [Modification]
In the present embodiment, the case where the DNN-HMM acoustic model is used has been described. This DNN-HMM acoustic model is based on the so-called fully-connected feed forward neural network, but in the present embodiment, it is also possible to use an acoustic model based on the Neural Network of other structure. For example, in the present embodiment, for example, an acoustic model based on a Convolutional Neural Network (CNN: convolutional neural network), an acoustic model based on a Recurrent Neural Network (RNN: recursive neural network), or an LSTM (Long Short-Term Memory) is used. It may be an acoustic model of the RNN based on.

この場合には、事例モデル記憶部１０１は、雑音又は音響歪みを含む音声或いはクリーン音声を、ＣＮＮに基づく音響モデル、ＲＮＮに基づく音響モデル、または、ＬＳＴＭに基づくＲＮＮ音響モデルのいずれか一つを用いて学習した事例モデルを記憶する。そして、マッチング部１０４は、特徴量を、ＣＮＮに基づく音響モデル、ＲＮＮに基づく音響モデル、または、ＬＳＴＭに基づくＲＮＮ音響モデルのいずれか一つに入力し、いずれか一つのNeural Networkによる出力結果と、事例モデル記憶部１０１に記憶された事例モデルとを照合する。 In this case, the case model storage unit 101 selects one of the CNN-based acoustic model, the RNN-based acoustic model, and the LSTM-based RNN acoustic model for the speech or clean speech including noise or acoustic distortion. The case model learned by using it is stored. Then, the matching unit 104 inputs the feature amount into any one of the CNN-based acoustic model, the RNN-based acoustic model, and the LSTM-based RNN acoustic model, and outputs the output result by any one of the Neural Networks. , And collates with the case model stored in the case model storage unit 101.

また、上記の説明では、ＤＮＮ−ＨＭＭ音響モデルまたはその他のＮＮ構造に基づく音響モデルは、事例モデル生成装置にて学習されるものとしたが、これは必須ではなく、例えば非特許文献２に記載の方法で別途学習された既存の音響モデルを事例モデル生成装置にて使用することも可能である。これらの音響モデルは必ずしも上記の模擬観測信号ではなく、雑音や残響を元来含む音声信号で学習されたものでもよい。 Further, in the above description, the DNN-HMM acoustic model or other acoustic model based on the NN structure is assumed to be learned by the case model generation device, but this is not essential, and is described in Non-Patent Document 2, for example. It is also possible to use an existing acoustic model that has been separately learned by the method in the case model generation device. These acoustic models are not necessarily the above-mentioned simulated observation signals, and may be those learned by a speech signal that originally contains noise and reverberation.

また、本実施の形態では、マッチング部１０４は、入力信号を２つに分割し、前半部分と後半部分とのそれぞれについてマッチング処理を行ってもよい。なお、入力信号の分割については、例えば、出願人による特許第６１３９４２９号公報或いは出願人による特許第６１３９４３０号公報を参照されたい。 Further, in the present embodiment, matching section 104 may divide the input signal into two and perform matching processing on each of the first half portion and the second half portion. Regarding division of the input signal, refer to, for example, Japanese Patent No. 6139429 by the applicant or Japanese Patent No. 6139430 by the applicant.

［システム構成等］
図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。例えば、信号処理装置１００及び事例モデル生成装置２００は、一体の装置であってもよい。さらに、各装置にて行なわれる各処理機能は、その全部又は任意の一部が、ＣＰＵ及び当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
The respective constituent elements of the illustrated devices are functionally conceptual, and do not necessarily have to be physically configured as illustrated. That is, the specific form of distribution/integration of each device is not limited to the one shown in the figure, and all or part of the device may be functionally or physically distributed/arranged in arbitrary units according to various loads and usage conditions. It can be integrated and configured. For example, the signal processing device 100 and the case model generation device 200 may be an integrated device. Further, each processing function performed by each device may be realized in whole or in an arbitrary part by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by a wired logic.

また、本実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的におこなうこともでき、あるいは、手動的におこなわれるものとして説明した処理の全部又は一部を公知の方法で自動的におこなうこともできる。また、本実施形態において説明した各処理は、記載の順にしたがって時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Further, of the processes described in the present embodiment, all or part of the processes described as being automatically performed may be manually performed, or the processes described as being manually performed may be performed. The whole or part of the process can be automatically performed by a known method. Further, each processing described in the present embodiment is not only executed in time series according to the order described, but may be executed in parallel or individually according to the processing capacity of the device that executes the processing or the need. .. In addition, the processing procedures, control procedures, specific names, and information including various data and parameters shown in the above-mentioned documents and drawings can be arbitrarily changed unless otherwise specified.

［プログラム］
図７は、プログラムが実行されることにより、信号処理装置１００或いは事例モデル生成装置２００が実現されるコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 [program]
FIG. 7 is a diagram illustrating an example of a computer in which the signal processing device 100 or the case model generation device 200 is realized by executing a program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These units are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ１０１１及びＲＡＭ１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０３１に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１０４１に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１０４１に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1031. The disk drive interface 1040 is connected to the disk drive 1041. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to the display 1130, for example.

ハードディスクドライブ１０３１は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、信号処理装置１００或いは事例モデル生成装置２００の各処理を規定するプログラムは、コンピュータ１０００により実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０３１に記憶される。例えば、信号処理装置１００或いは事例モデル生成装置２００における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０３１に記憶される。なお、ハードディスクドライブ１０３１は、ＳＳＤ（Solid State Drive）により代替されてもよい。 The hard disk drive 1031 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, the program defining each process of the signal processing device 100 or the case model generation device 200 is implemented as a program module 1093 in which a code executable by the computer 1000 is described. The program module 1093 is stored in, for example, the hard disk drive 1031. For example, a program module 1093 for executing the same processing as the functional configuration in the signal processing device 100 or the case model generation device 200 is stored in the hard disk drive 1031. The hard disk drive 1031 may be replaced with an SSD (Solid State Drive).

また、上述した実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０３１に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０３１に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 The setting data used in the processing of the above-described embodiment is stored as the program data 1094 in, for example, the memory 1010 or the hard disk drive 1031. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the memory 1010 or the hard disk drive 1031 to the RAM 1012 as necessary and executes them.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０３１に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１０４１等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３及びプログラムデータ１０９４は、ネットワーク（ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３及びプログラムデータ１０９４は、他のコンピュータから、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1031 and may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1041 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.

以上、本発明者によってなされた発明を適用した実施形態について説明したが、本実施形態による本発明の開示の一部をなす記述及び図面により本発明は限定されることはない。すなわち、本実施形態に基づいて当業者等によりなされる他の実施形態、実施例及び運用技術等は全て本発明の範疇に含まれる。 Although the embodiments to which the invention made by the present inventor has been applied have been described above, the present invention is not limited to the description and the drawings that form part of the disclosure of the present invention according to the present embodiment. That is, all other embodiments, examples, operation techniques, and the like made by those skilled in the art based on this embodiment are included in the scope of the present invention.

１００，１００Ｐ信号処理装置
１０１，１０１Ｐ事例モデル記憶部
１０２，１０２Ｐフーリエ変換部
１０３，１０３Ｐ特徴量生成部
１０４，１０４Ｐマッチング部
１０５，１０５Ｐ音声強調フィルタリング部
２００，２００Ｐ事例モデル生成装置 100, 100P Signal processing device 101, 101P Case model storage unit 102, 102P Fourier transform unit 103, 103P Feature amount generation unit 104, 104P Matching unit 105, 105P Speech enhancement filtering unit 200, 200P Case model generation device

Claims

A storage unit that stores a case model that is output by using an acoustic model based on Neural Network, with speech or clean speech including noise or acoustic distortion as an input,
A feature amount generation unit that generates a feature amount from an input signal,
The clean speech feature amount corresponding to the input signal is obtained by matching the output result output using the acoustic model based on the Neural Network with the feature amount as an input, and the case model stored in the storage unit. A matching section,
An output unit for outputting a emphasized voice obtained by multiplying the input signal by a filter composed of the clean voice feature amount obtained by the matching unit;
A signal processing device comprising:

The said storage part and the said collation part use the HMM(Hidden Markov Model) acoustic model based on DNN(Deep Neural Network) based on the input of the speech containing noise or acoustic distortion, or clean speech. The signal processing device described.

The storage unit and the matching unit are any one of a CNN (Convolutional Neural Network)-based acoustic model, an RNN (Recurrent Neural Network)-based acoustic model, and an LSTM (Long Short-Term Memory)-based RNN acoustic model. The signal processing device according to claim 1, wherein one of them is used.

A feature amount generation unit that generates a feature amount from an input signal for learning,
A learning unit for learning a DNN-based HMM acoustic model using the feature quantity;
A maximum likelihood HMM state in which a series of HMM state indexes that give the maximum likelihood to the feature amount for each time frame is calculated as a case model based on the likelihood of the HMM state output from the HMM acoustic model based on the DNN. A calculation part,
A case model generation device having:

The feature amount of the input signal is input to the DNN-based HMM acoustic model, and the output result of the DNN-based HMM acoustic model and the speech or clean speech including noise or acoustic distortion are learned using the DNN-based HMM acoustic model. A collation device comprising: a collation unit that collates with a case model and obtains a clean speech feature amount corresponding to the input signal.

A signal processing device executed by the signal processing device, comprising:
The signal processing device has a storage unit that stores a case model output by using a voice or clean voice including noise or acoustic distortion as an input and using an HMM acoustic model based on DNN,
A feature amount generation step of generating a feature amount from an input signal,
The feature amount is input to a DNN-based HMM acoustic model, an output result of the DNN-based HMM acoustic model is collated with a case model stored in the storage unit, and a clean speech feature amount corresponding to the input signal is obtained. And a matching step for obtaining
An output step of outputting a emphasized voice obtained by multiplying the input signal by a filter composed of the clean voice feature amount obtained in the matching step;
A signal processing method comprising:

A signal processing method executed by a case model generation device, comprising:
A feature amount generating step of generating a feature amount from an input signal for learning,
A learning step of learning an HMM acoustic model based on DNN using the feature quantity;
A maximum likelihood HMM state in which a series of HMM state indexes that give the maximum likelihood to the feature amount for each time frame is calculated as a case model based on the likelihood of the HMM state output from the HMM acoustic model based on the DNN. Calculation process,
A signal processing method comprising:

A signal for causing a computer to function as any one of the signal processing device according to claim 1, the case model generation device according to claim 4, and the matching device according to claim 5. Processing program.