JP6499095B2

JP6499095B2 - Signal processing method, signal processing apparatus, and signal processing program

Info

Publication number: JP6499095B2
Application number: JP2016015464A
Authority: JP
Inventors: 小川　厚徳; 厚徳小川; 慶介木下; マークデルクロア; 拓也吉岡; 中谷　智広; 智広中谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-01-29
Filing date: 2016-01-29
Publication date: 2019-04-10
Anticipated expiration: 2036-01-29
Also published as: JP2017134321A

Description

本発明は、信号処理方法、信号処理装置及び信号処理プログラムに関する。 The present invention relates to a signal processing method, a signal processing device, and a signal processing program.

従来、音声認識システム、補聴器、ＴＶ会議システム、機械制御インターフェース、楽曲の検索及び採譜のための音楽情報処理システム等において、マイクロホンを用いて音響信号を収音し、目的の音声信号の成分を抽出する技術が利用されている。 Conventionally, in a speech recognition system, a hearing aid, a video conference system, a machine control interface, a music information processing system for searching and recording music, etc., an acoustic signal is collected using a microphone and a component of the target audio signal is extracted. Technology is used.

一般的に、雑音や残響のある実環境でマイクロホンを用いて音響信号を収音すると、収音目的の音声信号だけでなく、雑音や残響（音響歪み）が重畳された信号が観測される。しかしながら、これらの雑音や残響が信号に重畳されると、収音目的の音声信号の成分の抽出が困難となり、音声信号の明朗度や聞き取りやすさを大きく低下させてしまう要因となる。この結果、例えば、音声認識システムの認識率が低下してしまうという問題があった。 In general, when a sound signal is collected using a microphone in a real environment with noise and reverberation, not only a sound signal for sound collection but also a signal on which noise and reverberation (acoustic distortion) are superimposed is observed. However, when these noises and reverberations are superimposed on the signal, it becomes difficult to extract the components of the sound signal for sound collection, which causes a significant reduction in the clarity and ease of hearing of the sound signal. As a result, for example, there is a problem that the recognition rate of the voice recognition system is lowered.

そこで、音声信号に重畳した雑音や残響を取り除く技術が提案されている（例えば、非特許文献１参照）。例えば、図１０を参照して、従来の音声信号の信号処理装置について説明する。図１０は、従来の信号処理装置の構成の一例を示すブロック図である。なお、図１０に示す信号処理装置１Ｐは、ガウス混合分布モデル（ＧＭＭ：Gaussian Mixture Model）によって表現された事例モデルを用いて、入力音声を変換した特徴量との類似度を調べ、高い類似度を示した事例モデルを収音目的の音声信号候補としていく。 Therefore, a technique for removing noise and reverberation superimposed on an audio signal has been proposed (see, for example, Non-Patent Document 1). For example, a conventional audio signal processing apparatus will be described with reference to FIG. FIG. 10 is a block diagram showing an example of the configuration of a conventional signal processing apparatus. Note that the signal processing device 1P shown in FIG. 10 uses the case model expressed by a Gaussian Mixture Model (GMM) to check the similarity with the feature value obtained by converting the input speech, and the high similarity The example model that indicates is used as a sound signal candidate for sound collection.

この従来の信号処理装置１Ｐには、事例モデル記憶部１１Ｐに、事前に学習された混合分布モデルによって表現された事例モデルが記憶されている。具体的には、事例モデル記憶部１１Ｐには、各事例に対応したクリーン音声の振幅スペクトルと、フレームごとの特徴量（例えば、メル周波数ケプストラム係数）に対して最大の尤度を与えるガウス混合分布のインデックスの系列（セグメント）を含む事例モデルとが記憶されている。 In this conventional signal processing apparatus 1P, a case model expressed by a mixture distribution model learned in advance is stored in the case model storage unit 11P. Specifically, in the case model storage unit 11P, the Gaussian mixture distribution that gives the maximum likelihood to the amplitude spectrum of clean speech corresponding to each case and the feature amount (for example, Mel frequency cepstrum coefficient) for each frame. A case model including a series (segment) of indexes is stored.

まず、フーリエ変換部１２Ｐが、音響歪みを含む入力信号を離散フーリエ変換して振幅スペクトルを取得し、特徴量生成部１３Ｐが、振幅スペクトルから、特徴量のセグメントを生成する。 First, the Fourier transform unit 12P obtains an amplitude spectrum by performing discrete Fourier transform on the input signal including the acoustic distortion, and the feature amount generation unit 13P generates a segment of the feature amount from the amplitude spectrum.

続いて、マッチング部１５Ｐは、特徴量生成部１３Ｐが生成した特徴量のセグメントと事例モデル記憶部１１Ｐの事例モデルに含まれるセグメントとのマッチングを行い、事例モデルの中から、特徴量生成部１３Ｐが生成した特徴量のセグメントに対して最も高い類似度を示すセグメントを探索する。具体的には、マッチング部１５Ｐが、事例モデルのセグメントの中から、特徴量生成部１３Ｐが生成した特徴量のセグメントに対して最大の事後確率を与えるセグメントを探索する。 Subsequently, the matching unit 15P performs matching between the segment of the feature amount generated by the feature amount generation unit 13P and the segment included in the case model of the case model storage unit 11P, and from the case model, the feature amount generation unit 13P The segment having the highest similarity to the segment of the feature amount generated by is searched. Specifically, the matching unit 15P searches the segment of the case model for a segment that gives the maximum posterior probability with respect to the feature amount segment generated by the feature amount generation unit 13P.

そして、音声強調フィルタリング部１６Ｐが、マッチング部１５Ｐが探索した事例モデルのセグメントの特徴量に対応するクリーン音声の振幅スペクトルを、入力信号に最も類似するクリーン音声の振幅スペクトルとみなし、事例モデル記憶部１１Ｐから、このクリーン音声の振幅スペクトルを読み出して音声強調のためのフィルタを作成する。このフィルタで入力信号をフィルタリングすることによって、入力信号から音響歪みが除去された強調音声信号が得られる。 Then, the speech enhancement filtering unit 16P regards the clean speech amplitude spectrum corresponding to the feature quantity of the segment of the case model searched by the matching unit 15P as the clean speech amplitude spectrum most similar to the input signal, and the case model storage unit. From 11P, the amplitude spectrum of the clean speech is read out to create a filter for speech enhancement. By filtering the input signal with this filter, an enhanced speech signal from which acoustic distortion has been removed from the input signal is obtained.

J. Ming and R. Srinivasan, and D. Crookes, “A Corpus-Based Approach to Speech Enhancement From Nonstationary Noise,” IEEE Transactions on Audio, Speech, and Language Processing, Vol.19, No.4, pp.822-836, 2011J. Ming and R. Srinivasan, and D. Crookes, “A Corpus-Based Approach to Speech Enhancement From Nonstationary Noise,” IEEE Transactions on Audio, Speech, and Language Processing, Vol.19, No.4, pp.822- 836, 2011

このように、従来の信号処理装置１Ｐは、入力音声に最も類似するクリーン音声の振幅スペクトルを求めるために、特徴量生成部１３Ｐが生成した特徴量のセグメントを用いて、事例モデル記憶部１１Ｐの事例モデルの中から、最大の事後確率を与えるセグメントを探索する。 As described above, the conventional signal processing device 1P uses the feature amount segment generated by the feature amount generation unit 13P in order to obtain the amplitude spectrum of the clean sound most similar to the input sound. A segment that gives the maximum posterior probability is searched from the case model.

しかしながら、セグメント探索に用いるメル周波数ケプストラム係数は、振幅スペクトルから得られる単純な特徴量である。このため、入力信号に雑音や残響が含まれる場合には、メル周波数ケプストラム係数も雑音や残響の影響を含むものとなり、マッチング部１５Ｐによるセグメント探索は、必ずしも高精度であるとは言えなかった。 However, the mel frequency cepstrum coefficient used for the segment search is a simple feature amount obtained from the amplitude spectrum. For this reason, when noise and reverberation are included in the input signal, the mel frequency cepstrum coefficient also includes the influence of noise and reverberation, and the segment search by the matching unit 15P is not necessarily highly accurate.

また、事例モデルは、種々の音響歪み環境を想定して準備するものの、現実的に、全ての音響歪み環境に対応する事例モデルを準備することは困難であるため、マッチング部１５Ｐは、特徴量生成部１３Ｐが生成した特徴量のセグメントと高い類似度を有するセグメントを事例モデルの中から探索できない場合があった。 In addition, although the case model is prepared assuming various acoustic distortion environments, it is actually difficult to prepare case models corresponding to all the acoustic distortion environments. In some cases, a segment having a high degree of similarity with the feature amount segment generated by the generation unit 13P cannot be searched from the case model.

したがって、従来の信号処理装置では、探索に用いる特徴量が雑音や残響の影響を受けるため、入力信号に類似するクリーン音声の特徴量を探索する精度にも限界があった。 Therefore, in the conventional signal processing apparatus, since the feature amount used for the search is affected by noise and reverberation, there is a limit to the accuracy of searching for the clean speech feature amount similar to the input signal.

本発明は、上記に鑑みてなされたものであって、入力信号に類似するクリーン音声の探索に対する雑音や残響の影響を低減した信号処理方法、信号処理装置及び信号処理プログラムを提供することを目的とする。 The present invention has been made in view of the above, and it is an object of the present invention to provide a signal processing method, a signal processing apparatus, and a signal processing program that reduce the influence of noise and reverberation on a search for clean speech similar to an input signal. And

上述した課題を解決し、目的を達成するために、本発明に係る信号処理方法は、信号処理装置で実行される信号処理方法であって、前記信号処理装置は、雑音又は音響歪みを含む音声或いはクリーン音声を学習した混合分布モデルを記憶する記憶部を有し、前記信号処理装置が、入力信号から第１の特徴量を生成する特徴量生成工程と、前記信号処理装置が、前記第１の特徴量を、雑音又は音響歪みの低減処理を施した第２の特徴量に変換する特徴量変換工程と、前記信号処理装置が、前記記憶部に記憶された前記混合分布モデルのパラメータを基に、前記第２の特徴量が前記混合分布モデルの各分布に該当する確率を示す事後確率を計算し、最も高い事後確率をとるクリーン音声特徴量を前記入力信号に対応するクリーン音声特徴量として求める照合工程と、前記信号処理装置が、前記照合工程において求められたクリーン音声特徴量から構成されるフィルタを前記入力信号に乗算した強調音声信号を出力する出力工程と、を含んだことを特徴とする。 In order to solve the above-described problems and achieve the object, a signal processing method according to the present invention is a signal processing method executed by a signal processing device, and the signal processing device is a voice including noise or acoustic distortion. Or it has a storage part which memorizes the mixture distribution model which learned clean speech, the signal processing device generates the 1st feature amount from an input signal, and the signal processing device has the 1st above-mentioned A feature amount conversion step of converting the feature amount of the second feature amount into a second feature amount subjected to noise or acoustic distortion reduction processing, and the signal processing device based on the parameters of the mixed distribution model stored in the storage unit. In addition, a posterior probability indicating the probability that the second feature amount corresponds to each distribution of the mixed distribution model is calculated, and a clean speech feature amount having the highest posterior probability is defined as a clean speech feature amount corresponding to the input signal. A matching step, and the signal processing device includes an output step of outputting an enhanced speech signal obtained by multiplying the input signal by a filter composed of the clean speech feature obtained in the matching step. Features.

本発明によれば、入力信号に類似するクリーン音声の探索に対する雑音や残響の影響を低減することができる。 According to the present invention, it is possible to reduce the influence of noise and reverberation on a search for clean speech similar to an input signal.

図１は、実施の形態１に係る信号処理装置の構成の一例を模式的に示す図である。FIG. 1 is a diagram schematically illustrating an example of the configuration of the signal processing device according to the first embodiment. 図２は、セグメントの一例を説明するための図である。FIG. 2 is a diagram for explaining an example of a segment. 図３は、図１に示す特徴量変換部の処理を説明するための概念図である。FIG. 3 is a conceptual diagram for explaining processing of the feature amount conversion unit shown in FIG. 図４は、図１に示す信号処理装置が実行する処理手順を示すフローチャートである。FIG. 4 is a flowchart showing a processing procedure executed by the signal processing apparatus shown in FIG. 図５は、実施の形態１に係る事例モデル生成装置の機能構成例を示すブロック図である。FIG. 5 is a block diagram illustrating a functional configuration example of the case model generation apparatus according to the first embodiment. 図６は、図５に示す事例モデル生成装置による事例モデル生成処理の処理手順を示すフローチャートである。FIG. 6 is a flowchart showing a processing procedure of case model generation processing by the case model generation apparatus shown in FIG. 図７は、実施の形態２に係る信号処理装置の構成を示すブロック図である。FIG. 7 is a block diagram showing the configuration of the signal processing apparatus according to the second embodiment. 図８は、図７に示す信号処理装置が実行する処理手順を示すフローチャートである。FIG. 8 is a flowchart showing a processing procedure executed by the signal processing apparatus shown in FIG. 図９は、プログラムが実行されることにより、信号処理装置が実現されるコンピュータの一例を示す図である。FIG. 9 is a diagram illustrating an example of a computer in which a signal processing apparatus is realized by executing a program. 図１０は、従来の信号処理装置の構成の一例を示すブロック図である。FIG. 10 is a block diagram showing an example of the configuration of a conventional signal processing apparatus.

以下、図面を参照して、本発明の一実施形態を詳細に説明する。なお、この実施形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. In addition, this invention is not limited by this embodiment. Moreover, in description of drawing, the same code | symbol is attached | subjected and shown to the same part.

［実施の形態１］
まず、実施の形態１に係る信号処理装置について説明する。この信号処理装置は、雑音及び残響（音響歪み）を含む入力信号から音響歪みを除去し、明瞭な強調音声信号を出力する処理を行う装置である。 [Embodiment 1]
First, the signal processing apparatus according to the first embodiment will be described. This signal processing device is a device that performs processing for removing acoustic distortion from an input signal including noise and reverberation (acoustic distortion) and outputting a clear enhanced speech signal.

［信号処理装置の構成］
図１は、実施の形態１に係る信号処理装置の構成の一例を模式的に示す図である。実施の形態１に係る信号処理装置１は、例えば、例えばＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）、ＣＰＵ（Central Processing Unit）等を含むコンピュータ等に所定のプログラムが読み込まれて、ＣＰＵが所定のプログラムを実行することで実現される。 [Configuration of signal processing apparatus]
FIG. 1 is a diagram schematically illustrating an example of the configuration of the signal processing device according to the first embodiment. For example, the signal processing apparatus 1 according to the first embodiment reads a predetermined program into a computer or the like including, for example, a ROM (Read Only Memory), a RAM (Random Access Memory), a CPU (Central Processing Unit), and the like. Is realized by executing a predetermined program.

図１に示すように、信号処理装置１は、事例モデル記憶部１１、フーリエ変換部１２、特徴量生成部１３、特徴量変換部１４、マッチング部１５（照合部）及び音声強調フィルタリング部１６（出力部）を有する。信号処理装置１は、ＧＭＭによって表現された事例モデルＭを用いて、入力信号を変換した特徴量との類似度を調べ、高い類似度を示した事例モデルＭを収音目的の音声信号候補として利用する。 As shown in FIG. 1, the signal processing apparatus 1 includes a case model storage unit 11, a Fourier transform unit 12, a feature amount generation unit 13, a feature amount conversion unit 14, a matching unit 15 (collation unit), and a speech enhancement filtering unit 16 ( Output section). The signal processing apparatus 1 uses the case model M expressed by the GMM to check the similarity with the feature amount obtained by converting the input signal, and uses the case model M showing a high similarity as a speech signal candidate for sound collection purposes. Use.

事例モデル記憶部１１は、音響歪みを含む音声或いはクリーン音声を学習した混合分布モデルを記憶する。具体的には、事例モデル記憶部１１は、事例に対応したクリーン音声のデータと、事例モデルＭとが記憶される。クリーン音声のデータは、例えば、事例に対応したクリーン音声の振幅スペクトルのことである。また、事例モデルＭは、フレームごとの特徴量に対して最大の尤度を与えるガウス混合分布のインデックスの系列（セグメント）を、混合分布モデルのパラメータとして含む。 The case model storage unit 11 stores a mixed distribution model obtained by learning speech including acoustic distortion or clean speech. Specifically, the case model storage unit 11 stores clean voice data corresponding to a case and a case model M. The clean voice data is, for example, the amplitude spectrum of clean voice corresponding to the case. In addition, the case model M includes, as a parameter of the mixed distribution model, a sequence (segment) of a Gaussian mixed distribution index that gives the maximum likelihood to the feature amount for each frame.

ここで、事例モデルＭは、事前に事例モデル生成装置２（後述）によって生成され、事例モデル記憶部１１に格納される。事例モデル生成装置２は、音声コーパスなどから得られる大量のクリーン音声と、種々の環境で得られる雑音及び残響データ（雑音信号の波形や、室内インパルス応答等）とを用い、様々な環境での観測信号を学習用の音声信号として模擬生成し、その模擬観測信号を特徴量領域へ変換したものを用いて、事例モデルＭを生成する。 Here, the case model M is generated in advance by the case model generation device 2 (described later) and stored in the case model storage unit 11. The example model generation device 2 uses a large amount of clean speech obtained from a speech corpus and noise and reverberation data (noise signal waveform, indoor impulse response, etc.) obtained in various environments, and in various environments. A case model M is generated by using the observation signal that is simulated and generated as a speech signal for learning, and the simulated observation signal is converted into a feature amount region.

具体的には、事例モデル生成装置２（後述）によって、学習用の音声信号の特徴量を基に、各時間フレームｉに対して最大の尤度を与えるガウス混合モデルｇの中のガウス分布のインデックスｍ_ｉが求められ、該求められたインデックスｍ_ｉの時間系列（セグメント）が、事例モデルＭの一つとなる。この事例モデルＭは、ガウス混合モデルｇの中のガウス分布のインデックスｍ_ｉの集合とガウス混合モデルｇとを用いて以下の（１）式に示すように表される。 Specifically, the Gaussian distribution in the Gaussian mixture model g that gives the maximum likelihood for each time frame i based on the feature amount of the speech signal for learning by the case model generation device 2 (described later). index m _i is determined, the sought time sequence index m _i (segment), and one case model M. This case model M is expressed as shown in the following (1) equation by using the set and Gaussian mixture model g of the index m _i of the Gaussian distribution in the Gaussian mixture model g.

なお、ｍ_ｉは、ｉ番目のフレームの特徴量ｋ_ｉに対して最大の尤度を与えるガウス分布のインデックスであり、ガウス混合分布ｍの中のガウス分布ｇ（ｋ_ｉ｜ｍ）を表している。Ｉは学習用の音声信号の総フレーム数を表す。例えば、１時間の学習データを仮定すると、Ｉ＝３．５×１０^５となる。 Incidentally, m _i is the index of the Gaussian distribution that gives the maximum likelihood for the feature amount k _i of i-th frame, Gaussian g in Gaussian mixture m _| represents (k i _m) Yes. I represents the total number of frames of the speech signal for learning. For example, assuming 1 hour of learning data, I = 3.5 × 10 ⁵ .

そして、事例モデルＭに含まれるセグメントの例を説明する。図２は、セグメントの一例を説明するための図である。例えば、図２に示すセグメントの各セルは、Ｉフレームのうちｉ番目の時間フレームに対応する。各セル内の数字は最大の尤度を与えるガウス混合モデルｇ中のガウス分布のインデックスｍ_ｉを表す。 An example of segments included in the case model M will be described. FIG. 2 is a diagram for explaining an example of a segment. For example, each cell of the segment shown in FIG. 2 corresponds to the i-th time frame of the I frame. The numbers in each cell represents the index m _i of the Gaussian distribution of the Gaussian mixed model g that gives the maximum likelihood.

フーリエ変換部１２は、入力信号をフレームごとに振幅スペクトルに変換する。この入力信号として、雑音及び残響を含む音声信号がフーリエ変換部１２に入力される。まず、フーリエ変換部１２は、入力信号の波形データを短い時間幅で切り出す。例えば、フーリエ変換部１２は、３０（ｍｓｅｃ）程度の短時間ハミング窓等の窓関数を掛け合わせて入力信号を短い時間幅で切り出す。続いて、フーリエ変換部１２は、切り出した入力信号に離散フーリエ変換処理を実行し、振幅スペクトルに変換する。なお、振幅スペクトルとは、周波数スペクトルの振幅データのことである。フーリエ変換部１２は、変換後の振幅スペクトルを、特徴量生成部１３及び音声強調フィルタリング部１６に入力する。 The Fourier transform unit 12 converts the input signal into an amplitude spectrum for each frame. An audio signal including noise and reverberation is input to the Fourier transform unit 12 as this input signal. First, the Fourier transform unit 12 cuts out waveform data of an input signal with a short time width. For example, the Fourier transform unit 12 cuts out an input signal with a short time width by multiplying a window function such as a short-time Hamming window of about 30 (msec). Subsequently, the Fourier transform unit 12 performs a discrete Fourier transform process on the extracted input signal to convert it into an amplitude spectrum. The amplitude spectrum is amplitude data of the frequency spectrum. The Fourier transform unit 12 inputs the converted amplitude spectrum to the feature value generation unit 13 and the speech enhancement filtering unit 16.

特徴量生成部１３は、フーリエ変換部１２から出力された振幅スペクトルから特徴量（第１の特徴量）ｘ_ｔを生成する。言い換えると、特徴量生成部１３は、フーリエ変換部１２から入力された振幅スペクトルから特徴量ｘ_ｔのセグメントを生成する。なお、ｔは、処理対象のフレームとする。特徴量生成部１３は、フーリエ変換部１２から出力された振幅スペクトルの全てを、例えば、メル周波数ケプストラム係数に変換する。これによって、入力信号は、フレームごとに、特徴量ベクトルのセグメントとして表される。 The feature quantity generation unit 13 generates a feature quantity (first feature quantity) x _t from the amplitude spectrum output from the Fourier transform unit 12. In other words, the feature quantity generation unit 13 generates a segment of the feature quantity x _t from the amplitude spectrum input from the Fourier transform unit 12. Note that t is a frame to be processed. The feature value generation unit 13 converts all of the amplitude spectrum output from the Fourier transform unit 12 into, for example, a mel frequency cepstrum coefficient. As a result, the input signal is represented as a segment of a feature vector for each frame.

ここで、一般的に使用されているメル周波数ケプストラム係数は、１０〜２０次程度である。信号処理装置１では、事例モデルＭを正確に表すために、一般的に使用されている次数よりも高い次数（例えば、３０〜１００次程度）のメル周波数ケプストラム係数を用いる。このため、特徴量生成部１３は、フーリエ変換部１２から出力された振幅スペクトルの全てを、例えば、３０〜１００次程度のメル周波数ケプストラム係数に変換する。なお、特徴量生成部１３は、メル周波数ケプストラム係数以外の特徴量（例えば、ケプストラム係数等）を用いてもよい。特徴量生成部１３は、生成した特徴量ｘ_ｔを、特徴量変換部１４に入力する。 Here, the mel frequency cepstrum coefficient generally used is about 10 to 20th order. In the signal processing device 1, in order to accurately represent the case model M, a mel frequency cepstrum coefficient having a higher order (for example, about 30 to 100th order) than a generally used order is used. For this reason, the feature quantity generation unit 13 converts all of the amplitude spectrum output from the Fourier transform unit 12 into, for example, a mel frequency cepstrum coefficient of about 30 to 100th order. Note that the feature quantity generation unit 13 may use a feature quantity (for example, a cepstrum coefficient) other than the mel frequency cepstrum coefficient. The feature quantity generation unit 13 inputs the generated feature quantity _xt to the feature quantity conversion unit 14.

特徴量変換部１４は、特徴量生成部１３が生成した特徴量ｘ_ｔを、雑音又は残響（音響歪み）の低減処理を施した特徴量（第２の特徴量）に変換する。すなわち、特徴量変換部１４は、特徴量生成部１３で生成された、例えばメル周波数ケプストラム係数等の特徴量を、音響歪み耐性が高い特徴量に変換する。 The feature amount conversion unit 14 converts the feature amount x _t generated by the feature amount generation unit 13 into a feature amount (second feature amount) subjected to noise or reverberation (acoustic distortion) reduction processing. That is, the feature amount conversion unit 14 converts the feature amount generated by the feature amount generation unit 13 such as a mel frequency cepstrum coefficient into a feature amount having high acoustic distortion resistance.

具体的には、特徴量変換部１４は、特徴量生成部１３が生成した特徴量ｘ_ｔを、ＤＮＮ（Deep Neural Network；ディープニューラルネットワーク）−ＨＭＭ（Hidden Markov Model；隠れマルコフモデル）音響モデルにおける非線形な特徴量変換を多段に適用して変換した、音響歪み耐性の高いボトルネック特徴量ｂ_ｔを生成する。この場合、特徴量変換部１４は、処理対象フレームの特徴量のセグメントのみでなく、その前後の所定数のフレームの特徴量のセグメントも用いて、ボトルネック特徴量ｂ_ｔを生成する。ボトルネック特徴量ｂ_ｔは、ニューラルネットワークの中間層のユニットを少なく抑えたボトルネック構造のネットワークから抽出される。ボトルネック構造の中間層で抽出している特徴量は、入力特徴量を次元圧縮された音響歪み耐性がある特徴量である。特徴量変換部１４は、生成したボトルネック特徴量ｂ_ｔを、マッチング部１５に入力する。 Specifically, the feature amount conversion unit 14 uses the feature amount x _t generated by the feature amount generation unit 13 in a DNN (Deep Neural Network) -HMM (Hidden Markov Model) acoustic model. A bottleneck feature quantity b _t having high acoustic distortion resistance is generated by applying nonlinear feature quantity transformation in multiple stages. In this case, the feature amount conversion unit 14 generates the bottleneck feature amount b _t by using not only the feature amount segment of the processing target frame but also the feature amount segments of a predetermined number of frames before and after the feature amount segment. The bottleneck feature amount b _t is extracted from a network having a bottleneck structure in which the number of intermediate layer units of the neural network is reduced. The feature quantity extracted in the intermediate layer of the bottleneck structure is a feature quantity having acoustic distortion resistance obtained by dimension-compressing the input feature quantity. The feature amount conversion unit 14 inputs the generated bottleneck feature amount b _t to the matching unit 15.

なお、「音響歪み耐性がある特徴量」とは、同じ入力音声に対して、例えば、異なる二つの音響歪みが付加されていると仮定した場合に、これら二つの異なる音響歪みが付加された入力音声に対して生成した二つの特徴量が「似通っている」ことをいう。言い換えれば、「音響歪み耐性がある特徴量」は、音響歪みの影響が軽減された特徴量である。 Note that the “characteristic amount having acoustic distortion resistance” is an input in which two different acoustic distortions are added, for example, assuming that two different acoustic distortions are added to the same input voice. Two feature values generated for speech are "similar". In other words, the “characteristic amount having acoustic distortion resistance” is a characteristic quantity in which the influence of the acoustic distortion is reduced.

マッチング部１５は、事例モデルＭを用いて、入力された入力音声の特徴量との類似度を調べ、高い類似度を示した事例モデルＭに対応するクリーン音声を収音目的の音声信号候補としていく。具体的には、マッチング部１５は、事例モデル記憶部１１に記憶された混合分布モデルのパラメータを基に、入力された特徴量（特徴量変換部１４が変換したボトルネック特徴量ｂ_ｔ）が混合分布モデルの各分布に該当する確率を示す事後確率を計算し、最も高い事後確率をとるクリーン音声特徴量を入力信号に対応するクリーン音声の特徴量として求める。 The matching unit 15 uses the case model M to check the similarity with the feature amount of the input speech that has been input, and uses the clean speech corresponding to the case model M showing a high similarity as a speech signal candidate for sound collection. Go. Specifically, the matching unit 15 uses the input feature value (the bottleneck feature value b _t converted by the feature value conversion unit 14) based on the parameters of the mixed distribution model stored in the case model storage unit 11. A posteriori probability indicating the probability corresponding to each distribution of the mixed distribution model is calculated, and a clean speech feature value having the highest posterior probability is obtained as a clean speech feature value corresponding to the input signal.

言い換えると、マッチング部１５は、特徴量変換部１４から入力された特徴量（ボトルネック特徴量ｂ_ｔ）のセグメントと事例モデル記憶部１１の事例モデルＭに含まれるセグメントとのマッチングを行い、事例モデル記憶部１１の事例モデルＭの中から、入力された特徴量のセグメントに対して最も高い事後確率をとるセグメントを探索する。マッチング部１５は、探索により見つかった事例モデルＭ中のセグメントについての情報を、音声強調フィルタリング部１６に入力する。なお、マッチング部１５の処理の詳細については、後述する。 In other words, the matching unit 15 performs matching between the segment of the feature amount (bottleneck feature amount b _t ) input from the feature amount conversion unit 14 and the segment included in the case model M of the case model storage unit 11, and The segment having the highest posterior probability with respect to the input feature amount segment is searched from the case model M in the model storage unit 11. The matching unit 15 inputs information about the segments in the case model M found by the search to the speech enhancement filtering unit 16. Details of the processing of the matching unit 15 will be described later.

音声強調フィルタリング部１６は、マッチング部１５によって求められたクリーン音声特徴量から構成されるフィルタを入力信号に乗算した強調音声信号を出力する。具体的には、音声強調フィルタリング部１６は、マッチング部１５が探索した事例モデルＭのセグメントの特徴量に対応するクリーン音声の振幅スペクトルを、入力信号に最も類似するクリーン音声の振幅スペクトルとみなし、事例モデル記憶部１１から、このクリーン音声の振幅スペクトルを読み出す。続いて、音声強調フィルタリング部１６は、読み出したクリーン音声の振幅スペクトルを用いて音声強調のためのフィルタを作成し、該フィルタを用いて入力信号をフィルタリングする。この結果、音声強調フィルタリング部１６から、入力信号から音響歪みが除去された強調音声信号が出力される。 The speech enhancement filtering unit 16 outputs an enhanced speech signal obtained by multiplying the input signal by a filter composed of clean speech feature values obtained by the matching unit 15. Specifically, the speech enhancement filtering unit 16 regards the clean speech amplitude spectrum corresponding to the feature amount of the segment of the case model M searched by the matching unit 15 as the clean speech amplitude spectrum most similar to the input signal, The amplitude spectrum of this clean speech is read from the case model storage unit 11. Subsequently, the speech enhancement filtering unit 16 creates a filter for speech enhancement using the read amplitude spectrum of the clean speech, and filters the input signal using the filter. As a result, the enhanced speech signal from which the acoustic distortion has been removed from the input signal is output from the speech enhancement filtering unit 16.

［特徴量変換部の処理］
次に、特徴量変換部１４の処理について詳細に説明する。特徴量変換部１４は，特徴量生成部１３で生成された、例えばメル周波数ケプストラム係数等の特徴量を、音響歪み耐性が高いボトルネック特徴量ｂ_ｔに変換する。この特徴量変換部１４には、前述したように、ＤＮＮ−ＨＭＭ音響モデルが適用される。そこで、図３を参照して、特徴量変換部１４の処理を説明する。 [Processing of feature quantity conversion unit]
Next, the process of the feature amount conversion unit 14 will be described in detail. Feature transformation unit 14, generated by the feature amount generating unit 13, for example, the feature amounts such as mel-frequency cepstrum coefficient, sound distortion resistance is converted into a high bottleneck feature quantity b _t. As described above, the DNN-HMM acoustic model is applied to the feature amount conversion unit 14. Therefore, with reference to FIG. 3, the process of the feature amount conversion unit 14 will be described.

図３は、ＤＮＮ−ＨＭＭ音響モデルを用いて構成した特徴量変換部１４の処理を説明するための概念図である。特徴量変換部１４は、特徴量生成部１３で生成された、例えばメル周波数ケプストラム係数等の特徴量ｘ_ｔを入力データとして受け取る。このとき、特徴量変換部１４は、処理対象のフレームｔの特徴量ｘ_ｔだけでなく、その前後数フレーム分の特徴量も受け取る。 FIG. 3 is a conceptual diagram for explaining the processing of the feature quantity conversion unit 14 configured using the DNN-HMM acoustic model. The feature quantity conversion unit 14 receives the feature quantity x _t generated by the feature quantity generation unit 13 such as a mel frequency cepstrum coefficient as input data. At this time, the feature amount conversion unit 14 receives not only the feature amount x _t of the processing target frame _t but also the feature amounts of several frames before and after the feature amount x _t .

例えば、特徴量変換部１４は、当該フレームｔの４０次元の特徴量ｘ_ｔ（行ベクトル）に加えて、前後５フレーム分の特徴量ｘ_ｔ−５，ｘ_ｔ−４，ｘ_ｔ−３，ｘ_ｔ−２，ｘ_ｔ−１，ｘ_ｔ＋１，ｘ_ｔ＋２，ｘ_ｔ＋３，ｘ_ｔ＋４，ｘ_ｔ＋５を受け取る。この場合、特徴量変換部１４は、合計１１フレーム分で４４０次元の特徴量［ｘ_ｔ−５＾Ｔ，・・・，ｘ_ｔ＾Ｔ，・・・，ｘ_ｔ＋５＾Ｔ］＾Ｔ(Ｔはベクトルの転置を表す)を受け取ることになる。 For example, in addition to the 40-dimensional feature amount x _t (row vector) of the frame t, the feature amount conversion unit 14 includes feature amounts x _t−5 , x _t−4 , x _t−3 , x _t−3 , _xt-2 , _xt-1 , _{xt + 1} , _{xt + 2} , _{xt + 3} , _{xt + 4} , and _{xt + 5} are received. In this case, feature transformation unit 14, feature amount 440 D total 11 frames _{[x t-5 ^ T,} ···, x t ^ T, ···, x t + 5 ^ T] ^ T (T Represents a transpose of a vector).

なお、処理対象のフレームｔの特徴量ｘ_ｔは、静的な特徴量だけでなく、例えば、その１次，２次回帰係数で構成される場合もある。この場合、特徴量変換部１４が受け取る特徴量の次元数も増える。例えば、ｘ_ｔが静的な４０次元の特徴量と、その１次，２次回帰係数とで構成されるとすると、次元数は合計で１２０次元となる。これの前後５フレーム分を考慮すると、特徴量変換部１４が受け取る特徴量の次元数は、１３２０次元となる。 The feature amount x _t of frame t to be treated, not only the static characteristic quantities, for example, the primary, there is a case composed of two regression coefficients. In this case, the number of dimensions of the feature quantity received by the feature quantity conversion unit 14 also increases. For example, if _xt is composed of a static 40-dimensional feature quantity and its primary and secondary regression coefficients, the total number of dimensions is 120. Considering 5 frames before and after this, the number of dimensions of the feature quantity received by the feature quantity conversion unit 14 is 1320 dimensions.

続いて、４４０次元の特徴量を受け取った特徴量変換部１４は、これをＤＮＮ-ＨＭＭ音響モデルによる、例えば２０４８ノードの中間層を何層か(典型的には５〜１０層程度)通し、最終的に、例えば８０ノードのボトルネック層により８０次元程度に次元圧縮されたボトルネック特徴量ｂ_ｔを取得する。特徴量変換部１４は、このボトルネック特徴量ｂ_ｔをマッチング部１５に入力する。 Subsequently, the feature amount conversion unit 14 that has received the 440-dimensional feature amount passes this through several intermediate layers (typically about 5 to 10 layers) of, for example, 2048 nodes according to the DNN-HMM acoustic model, Finally, for example, a bottleneck feature quantity b _t dimensionally compressed to about 80 dimensions by a bottleneck layer of 80 nodes is acquired. The feature amount conversion unit 14 inputs the bottleneck feature amount b _t to the matching unit 15.

特徴量変換部１４は、ＤＮＮ−ＨＭＭ音響モデルにおいて非線形な特徴量変換を多段に適用することにより，音響歪み耐性が高いボトルネック特徴量ｂ_ｔを得ることができる。そして、マッチング部１５は、特徴量変換部１４から入力されたボトルネック特徴量ｂ_ｔを用いてセグメントの探索を行うことで、精度の高いセグメント探索を行うことができる。そこで、このボトルネック特徴量ｂ_ｔを用いたマッチング部１５の処理について説明する。 Feature transformation unit 14, by applying a nonlinear feature transformation in multiple stages in DNN-HMM acoustic model may sound distortion resistance obtain high bottleneck feature quantity b _t. The matching unit 15 can perform a segment search with high accuracy by performing a segment search using the bottleneck feature amount b _t input from the feature amount conversion unit 14. Therefore, the processing of the matching unit 15 using this bottleneck feature quantity b _t will be described.

［マッチング部の処理］
ここでは、説明の簡易化のため、あるひとつの雑音／残響環境の事例モデルＭのみを考える。また、説明の簡易化のため、入力信号の特徴量のセグメントｙ_ｔと学習データセグメントのマッチングの際の時間伸縮は考えないものとする。実施の形態１では、入力信号の特徴量のセグメントｙ_ｔとして、前段の特徴量変換部１４から、特徴量変換部１４が変換したボトルネック特徴量ｂ_ｔが入力される。 [Processing of matching part]
Here, for simplification of explanation, only an example model M of a certain noise / reverberation environment is considered. Further, for simplification of explanation, and is not considered time warping during the feature amount of the segment y _t and matching training data segment of the input signal. In the first embodiment, the bottleneck feature value b _t converted by the feature value conversion unit 14 is input from the previous feature value conversion unit 14 as the segment y _t of the feature value of the input signal.

まず、マッチング部１５は、入力された特徴量のセグメントｙ_ｔと事例モデル記憶部１１に記憶された事例モデルＭのセグメントとのマッチングを行う。続いて、マッチング部１５は、事例モデルＭのセグメントの中から、入力信号の特徴量の系列ｙ_{ｔ：ｔ＋τ}に最も近いセグメントを探索し、入力信号に含まれるクリーン音声に最も類似するクリーン音声系列を与えると思われるセグメントＭ^ｔ _{ｕ：ｕ＋τｍａｘ}を求めて、出力する。これは、（２）式のように定式化することができる。 First, the matching section 15 performs matching between a segment of the stored cases model M on the segment y _t and case model storage unit 11 of the input feature quantity. Next, the matching unit 15 searches the segment of the case model M for the segment closest to the sequence y _{t: t + τ} of the feature quantity of the input signal, and the clean speech sequence most similar to the clean speech included in the input signal segment seems to give ^M _{t u:} seeking _{u + .tau.max,} outputs. This can be formulated as equation (2).

ここで、入力される特徴量ｙ_ｔは、Ｌ個の時間フレームから成るとし、その入力信号の特徴量系列をｙ＝｛ｙ_ｔ:ｔ=１，２，・・・，Ｌ｝とする。また、ｙ_{ｔ：ｔ＋τ}を入力信号の特徴量の時間フレームｔからｔ＋τまでの系列とする。そして、Ｍ_{ｕ：ｕ＋τ}＝｛ｇ，ｍ_ｉ：ｉ＝ｕ，ｕ＋１，・・・，ｕ＋τ｝を、事例モデルＭの中のｕ番目からｕ＋τ番目までの連続する時間フレームに対応するガウス分布系列とする。 Here, it is assumed that the input feature quantity y _t is composed of L time frames, and the feature quantity series of the input signal is y = {y _t : t = 1, 2,..., L}. Also, let _{yt: t + τ be} a sequence from the time frame t to t + τ of the feature quantity of the input signal. Then, M _{u: u + τ} = {g, m _i : i = u, u + 1,..., U + τ} is a Gaussian distribution sequence corresponding to continuous time frames from u-th to u + τ-th in the case model M. And

入力信号の特徴量の系列ｙ_{ｔ：ｔ＋τ}と事例モデルＭの中のあるセグメントとの距離の定義や、入力信号の特徴量系列ｙ_{ｔ：ｔ＋τ}と一番近い事例モデルＭの探索方法として、ユークリッド距離など、他のいくつかの方法を考えることができる。ここでは、入力信号の特徴量系列に対する一番近い事例モデルＭのセグメントは、入力信号の特徴量系列によく一致する事例モデルＭのセグメントの中でも長さの最も長いものとする。つまり、入力信号の特徴量系列に最も近い事例モデルＭのセグメントＭ^ｔ _{ｕ：ｕ＋τ}は、（３）式に示す事後確率を最大化することで求めることができる。 Series y _t of the feature amount of the input _signal: definition and of the distance between a segment in the _{t + tau} and case model M, feature amount sequence y _t of the input _signal: a method of searching for _{t + tau} and closest case model M, Euclid Several other methods can be considered, such as distance. Here, it is assumed that the segment of the case model M closest to the feature quantity series of the input signal has the longest length among the segments of the case model M that closely match the feature quantity series of the input signal. That is, the segment M ^{t u} closest case model M to the feature amount sequence of the input _{signal: u + tau} can be determined by maximizing a posterior probability shown in (3) below.

この場合、ｐ（Ｍ_ｕ:ｕ+τ|ｙ_ｔ:ｔ+τ）は、事後確率を表し、ｙ_ｔ:ｔ+τとＭ_ｕ:ｕ+τが比較的よく一致している場合、τが長ければ長いほど高い事後確率を与えるという特徴を持っている。より長いセグメントを探索するという方法を取ることで、ある時間に局所的に存在する雑音などの影響を受けにくくなり、雑音などに対して比較的ロバストなマッチングが行われると思われる。 In this case, p (M _{u: u + τ} | y _{t: t + τ} ) represents the posterior probability, and if y _{t: t + τ} and M _{u: u + τ} are relatively well matched, τ The longer the is, the higher the posterior probability. By taking a method of searching for a longer segment, it is unlikely to be affected by noise that exists locally at a certain time, and it seems that relatively robust matching is performed against noise.

なお、（３）式の分子の項ｐ（ｙ_ｔ:ｔ+τ|Ｍ_ｕ:ｕ+τ）は、Ｍ_ｕ:ｕ+τに対応する事例モデルＭのセグメントに対するｙ_ｔ:ｔ+τの尤度である。この尤度は、（４）式で計算される。 Note that the numerator term p (y _{t: t + τ} | M _{u: u + τ} ) in the equation (3) is the y _{t: t + τ} for the segment of the case model M corresponding to M _{u: u} _{+ τ} . Likelihood. This likelihood is calculated by equation (4).

ここでは、簡単のため、隣り合うフレームは独立であることを仮定している。（３）式の分母の第１項は、事例モデルＭの中のあらゆる時間フレームｕ’を開始点として，ｐ（ｙ_ｔ:ｔ+τ｜Ｍ_{ｕ’:ｕ’+τ}）の和を取ったものである。そして、（３）式の分母の第２項は、ガウス混合モデルｇに対するｙ_ｔ:ｔ+τの尤度であり、（５）式で計算される。 Here, for simplicity, it is assumed that adjacent frames are independent. The first term of the denominator of equation (3) takes the sum of p (y _{t: t + τ} | M _{u ′: u ′ + τ} ) starting from any time frame u ′ in the case model M. It is a thing. The second term of the denominator of the equation (3) is the likelihood of yt _{: t + τ} with respect to the Gaussian mixture model g and is calculated by the equation (5).

続いて、マッチング部１５におけるセグメント探索処理の手順をさらに具体的に記述する。まず、セグメントの最大長を（τ_ｌｉｍ＋１）フレームに制限する。例えば、セグメントの最大長を３０フレームと制限するならば、τ_ｌｉｍ＝２９である。 Subsequently, the procedure of the segment search process in the matching unit 15 will be described more specifically. First, the maximum segment length is limited to (τ _lim +1) frames. For example, if the maximum segment length is limited to 30 frames, τ _lim = 29.

まず、マッチング部１５は、この制限の下で、τ＝０、すなわち、セグメント長＝１として、（３）式に従い、最大事後確率を与えるセグメント長＝１のセグメントを探索する。次に、マッチング部１５は、τ＝１、すなわち、セグメント長＝２として、（３）式に従い、最大事後確率を与えるセグメント長＝２のセグメントを探索する。 First, the matching unit 15 searches for a segment having a segment length = 1 that gives the maximum posterior probability according to the equation (3), with τ = 0, that is, the segment length = 1, under this restriction. Next, the matching unit 15 searches for a segment with segment length = 2 that gives the maximum posterior probability according to the equation (3), with τ = 1, that is, segment length = 2.

マッチング部１５は、この処理をτ＝τ_ｌｉｍまで繰り返す。そして、マッチング部１５は、探索した長さの異なるセグメント候補の中から、最大事後確率を与えるセグメントを見つける。τ_ｍａｘは、この最大事後確率を与えるセグメントの長さである。このようなマッチング部１５におけるセグメント探索処理は、図２に示すような、Ｉフレーム分のリニアなメモリで表現できる事例モデルＭ上で行うことができる。 The matching unit 15 repeats this process until τ = τ _lim . Then, the matching unit 15 finds a segment that gives the maximum posterior probability from the searched segment candidates having different lengths. τ _max is the length of the segment giving this maximum posterior probability. Such segment search processing in the matching unit 15 can be performed on a case model M that can be expressed by a linear memory for I frames as shown in FIG.

そして、マッチング部１５は、探索した最大事後確率を与えるセグメント、すなわち、入力信号に含まれるクリーン音声に最も類似するクリーン音声系列を与えると思われる事例モデルＭのセグメントＭ^ｔ _{ｕ：ｕ＋τｍａｘ}についての情報を、音声強調フィルタリング部１６に入力する。これによって、音声強調フィルタリング部１６は、セグメントＭ^ｔ _{ｕ：ｕ＋τｍａｘ}に対応する事例モデル記憶部１１内のクリーン音声の振幅スペクトルを用いて、音声強調のためのフィルタを作成し、該フィルタで入力信号をフィルタリングすることによって、強調音声信号を出力する。 Then, the matching section 15, the segment which gives the maximum a posteriori probability searched, i.e., segment M ^{t u} case model M seems to provide a clean speech sequence most similar to the clean speech included in the input _signal: information about _{u + .tau.max} Is input to the speech enhancement filtering unit 16. Thereby, the speech enhancement filtering unit 16, segment M ^{t _u:} by using the amplitude spectrum of the clean speech in the case the model storage unit 11 corresponding to the _{u + .tau.max,} to create a filter for the speech enhancement, the input signal at the filter Is output as an enhanced speech signal.

［信号処理装置における信号処理方法］
次に、信号処理装置１における信号処理方法について説明する。図４は、図１に示す信号処理装置１が実行する処理手順を示すフローチャートである。 [Signal processing method in signal processing apparatus]
Next, a signal processing method in the signal processing apparatus 1 will be described. FIG. 4 is a flowchart showing a processing procedure executed by the signal processing device 1 shown in FIG.

まず、フーリエ変換部１２は、入力信号を振幅スペクトルに変換するフーリエ変換処理（ステップＳ１）を行う。特徴量生成部１３は、フーリエ変換部１２から出力された振幅スペクトルから、メル周波数ケプストラム係数等の特徴量を生成する特徴量生成処理（ステップＳ２）を行う。 First, the Fourier transform unit 12 performs a Fourier transform process (step S1) for converting an input signal into an amplitude spectrum. The feature amount generation unit 13 performs a feature amount generation process (step S2) for generating a feature amount such as a mel frequency cepstrum coefficient from the amplitude spectrum output from the Fourier transform unit 12.

特徴量変換部１４は、特徴量生成部１３が生成した特徴量を、雑音又は残響（音響歪み）の低減処理を施したボトルネック特徴量に変換する特徴量変換処理（ステップＳ３）を行う。 The feature amount conversion unit 14 performs a feature amount conversion process (step S3) for converting the feature amount generated by the feature amount generation unit 13 into a bottleneck feature amount subjected to noise or reverberation (acoustic distortion) reduction processing.

マッチング部１５は、事例モデル記憶部１１の事例モデルＭのセグメントと、入力されたボトルネック特徴量のセグメントとのマッチングを行い、事例モデルＭのセグメントの中から、入力されたボトルネック特徴量のセグメントに対して最も高い事後確率をとるセグメントをとるセグメントを探索するマッチング処理（ステップＳ４）を行う。 The matching unit 15 performs matching between the segment of the case model M in the case model storage unit 11 and the segment of the input bottleneck feature amount, and from the segment of the case model M, the input bottleneck feature amount A matching process (step S4) for searching for a segment that takes a segment having the highest posterior probability is performed.

音声強調フィルタリング部１６は、マッチング部１５が探索した事例モデルＭのセグメントの特徴量に対応するクリーン音声の振幅スペクトルを用いて音声強調のためのフィルタを作成し、該フィルタを入力信号に乗算した強調音声を出力する音声強調フィルタリング処理（ステップＳ５）を行う。 The speech enhancement filtering unit 16 creates a filter for speech enhancement using the amplitude spectrum of clean speech corresponding to the feature amount of the segment of the case model M searched by the matching unit 15, and multiplies the input signal by the filter. A voice enhancement filtering process (step S5) for outputting the emphasized voice is performed.

［本実施の形態１の効果］
このように、本実施の形態１に係る信号処理装置１は、マッチング部１５がセグメント探索に用いる特徴量ｙ_ｔとして、メル周波数ケプストラム係数等の振幅スペクトルから単純に得られる特徴量ｘ_ｔではなく、この特徴量ｘ_ｔに対して、さらに雑音又は残響（音響歪み）の低減処理を施したボトルネック特徴量ｂ_ｔを用いている。言い換えれば，マッチング部１５は、音響歪み耐性が高いボトルネック特徴量ｂ_ｔを用いてセグメント探索を行うため、セグメント探索に対する雑音又は残響の影響を低減でき、セグメント探索の精度を高めることができる。したがって、信号処理装置１によれば、入力信号に類似するクリーン音声の特徴量を高精度で探索でき、入力信号を明瞭な強調音声信号に変換することができる。 [Effect of the first embodiment]
Thus, the signal processing apparatus 1 according to the first embodiment, as the characteristic amount y _t the matching section 15 is used to segment the search, rather than simply feature amount x _t is obtained from the amplitude spectrum, such as Mel Frequency Cepstral Coefficients for the feature amount x _t, it is used further noise or reverberation bottleneck characteristic quantity b _t subjected to reduction processing (acoustic distortion). In other words, the matching unit 15, for performing segment search using acoustic distortion resistance is high bottleneck characteristic quantity b _t, can reduce the influence of noise or reverberation for the segment search, it is possible to improve the accuracy of the segment search. Therefore, according to the signal processing device 1, it is possible to search for a feature amount of clean speech similar to the input signal with high accuracy, and to convert the input signal into a clear enhanced speech signal.

［事例モデル生成装置］
また、信号処理装置１の事例モデル記憶部１１に記憶される事例モデルＭを生成する事例モデル生成装置２について説明する。この事例モデル生成装置２においても、例えば、学習用の音声信号から生成されたメル周波数ケプストラム係数等の特徴量ｘ_ｔに対して、雑音又は残響（音響歪み）の低減処理を施したボトルネック特徴量ｂ_ｔを用いて、事例モデルＭの生成を行っている。 [Case model generator]
The case model generation device 2 that generates the case model M stored in the case model storage unit 11 of the signal processing device 1 will be described. In this case the model generating apparatus 2, for example, a bottleneck features against the feature amount x _t such mel-frequency cepstrum coefficients generated from the speech signals for learning, subjected to reduction processing of noise or reverberation (acoustic distortion) A case model M is generated using the quantity b _t .

図５は、事例モデル生成装置２の機能構成例を示すブロック図である。図５に示す事例モデル生成装置２は、例えば、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等を含むコンピュータ等に所定のプログラムが読み込まれて、ＣＰＵが所定のプログラムを実行することで実現される。事例モデル生成装置２は、フーリエ変換部１２、特徴量生成部１３、特徴量変換部１４、ガウス混合モデル学習部２５及び最尤ガウス分布計算部２６を有する。 FIG. 5 is a block diagram illustrating a functional configuration example of the case model generation device 2. The case model generation device 2 illustrated in FIG. 5 is realized by, for example, a predetermined program being read into a computer including a ROM, a RAM, a CPU, and the like, and the CPU executing the predetermined program. The case model generation apparatus 2 includes a Fourier transform unit 12, a feature amount generation unit 13, a feature amount conversion unit 14, a Gaussian mixture model learning unit 25, and a maximum likelihood Gaussian distribution calculation unit 26.

まず、事例モデル生成装置２に入力される学習用の音声信号について説明する。事例モデル生成装置２に入力される信号は、様々な雑音／残響環境の音声信号である。この様々な雑音／残響環境の音声信号の中には、クリーン環境の音声信号が含まれている。具体的には、音声コーパスなどから得られる大量のクリーン音声と、種々の環境で得られる雑音及び残響データ（雑音信号の波形や、室内インパルス応答等）とを用い、さまざまな環境での観測信号を模擬生成した模擬観測信号が、学習用の音声信号として事例モデル生成装置２に入力される。これらの学習用の音声信号のそれぞれについて以下の処理が行われる。 First, a learning speech signal input to the case model generation device 2 will be described. The signal input to the case model generation apparatus 2 is an audio signal having various noise / reverberation environments. Among the various noise / reverberation environment audio signals, clean environment audio signals are included. Specifically, using a large amount of clean speech obtained from a speech corpus and noise and reverberation data (noise signal waveforms, room impulse responses, etc.) obtained in various environments, observation signals in various environments The simulated observation signal generated by simulating the signal is input to the case model generation device 2 as a speech signal for learning. The following processing is performed for each of these learning speech signals.

フーリエ変換部１２、特徴量生成部１３及び特徴量変換部１４は、図１に示す信号処理装置１におけるフーリエ変換部１２、特徴量生成部１３及び特徴量変換部１４とそれぞれ同様の処理を、学習用の音声信号に対して実行する。特徴量変換部１４は、学習用の音声信号に対応する特徴量ｘ_ｔをボトルネック特徴量ｂ_ｔに変換し、ガウス混合モデル学習部２５に入力する。 The Fourier transform unit 12, the feature amount generation unit 13, and the feature amount conversion unit 14 perform the same processes as the Fourier transform unit 12, the feature amount generation unit 13, and the feature amount conversion unit 14 in the signal processing apparatus 1 illustrated in FIG. This is performed on the audio signal for learning. The feature amount conversion unit 14 converts the feature amount x _t corresponding to the learning speech signal into a bottleneck feature amount b _t and inputs it to the Gaussian mixture model learning unit 25.

ガウス混合モデル学習部２５は、各短時間フレームｔでの特徴量ｂ_ｉを学習データとして、通常の最尤推定法によりガウス混合モデルｇを得る。ここで、ガウス混合モデル学習部２５では、前段の特徴量変換部１４から入力されたボトルネック特徴量ｂ_ｔを学習データとして用いてガウス混合モデルｇを得る。このガウス混合モデルｇは、（６）式により示される。また、ガウス混合モデルｇの中のガウス分布を表すｇ（ｂ_ｉ｜ｍ）は、（７）式により示される。なお、ｂ_ｉは、ｉ番目のフレームのボトルネック特徴量である。 Gaussian mixture model learning unit 25, as learning data characteristic quantity b _i for each short time frame t, obtaining a Gaussian mixture model g by a conventional maximum likelihood estimation. Here, the Gaussian mixture model learning unit 25 obtains a Gaussian mixture model g by using the bottleneck feature amount b _t input from the preceding feature amount conversion unit 14 as learning data. This Gaussian mixture model g is expressed by equation (6). Further, g (b _i | m) representing the Gaussian distribution in the Gaussian mixture model g is expressed by the equation (7). Note that b _i is the bottleneck feature amount of the i-th frame.

ｇ（ｂ_ｉ｜ｍ）は、平均μ_ｍ、分散Σ_ｍを持つｍ番目のガウス分布を表す。ｇ（ｂ_ｉ｜ｍ）は、多くの場合多次元ガウス分布であり、その次元数は特徴量ｂ_ｉの次元数と同じである。ｇ（ｂ_ｉ｜ｍ）が多次元ガウス分布である場合、平均μ_ｍ及び分散Σ_ｍのそれぞれはベクトルとなる。ここでは、ｇ（ｂ_ｉ｜ｍ）が多次元ガウス分布であっても、記載の簡略化のため、ｇ（ｂ_ｉ｜ｍ）のことを単にガウス分布と表現する。ｗ（ｍ）は、ｍ番目のガウス分布に対する混合重みを表す。Ｑは、混合数を表す。Ｑには、例えば、４０９６や８１９２など、かなり大きな値を設定する。 g (b _i | m) represents the m-th Gaussian distribution with mean μ _m and variance Σ _m . g (b _i | m) is often a multidimensional Gaussian distribution, and the number of dimensions is the same as the number of dimensions of the feature quantity b _i . When g (b _i | m) is a multidimensional Gaussian distribution, each of the mean μ _m and the variance Σ _m is a vector. Here, even if g (b _i | m) is a multidimensional Gaussian distribution, g (b _i | m) is simply expressed as a Gaussian distribution in order to simplify the description. w (m) represents the mixing weight for the mth Gaussian distribution. Q represents the number of mixtures. For Q, for example, a fairly large value such as 4096 or 8192 is set.

最尤ガウス分布計算部２６は、各時間フレームｉに対して最大の尤度を与えるガウス混合モデルｇの中のガウス分布のインデックスｍ_ｉを求め、そのインデックスｍ_ｉの時間系列を、事例モデルＭの一つのセグメントとして取得する。なお、事例モデルＭは、ガウス分布のインデックスｍ_ｉの集合とガウス混合モデルｇを用いて、前述した（１）式のように表される。 Maximum likelihood Gaussian distribution calculation unit 26 calculates an index m _i of the Gaussian distribution in the Gaussian mixture model g that gives the maximum likelihood for each time frame i, the time sequence of the index m _i, case model M As a single segment. Incidentally, case model M, using the set and Gaussian mixture model g of the index m _i of the Gaussian distribution is expressed as previously described (1).

事例モデルＭのセグメントの生成は、学習用の音声信号のそれぞれに対して行われ、生成された各セグメントを含む事例モデルＭは、事例モデル記憶部１１（図１）に記憶される。また、環境がクリーンの場合は、フーリエ変換部１２から出力された振幅スペクトルデータもクリーン音声の振幅スペクトルとして事例モデル記憶部１１（図１）に記憶される。 The generation of the segment of the case model M is performed for each of the learning speech signals, and the generated case model M including each segment is stored in the case model storage unit 11 (FIG. 1). When the environment is clean, the amplitude spectrum data output from the Fourier transform unit 12 is also stored in the case model storage unit 11 (FIG. 1) as the amplitude spectrum of clean speech.

［事例モデル生成処理］
次に、事例モデル生成処理について説明する。図６は、事例モデル生成装置２による事例モデル生成処理の処理手順を示すフローチャートである。 [Case model generation process]
Next, case model generation processing will be described. FIG. 6 is a flowchart showing a processing procedure of case model generation processing by the case model generation device 2.

事例モデル生成装置２において、フーリエ変換部１２、特徴量生成部１３及び特徴量変換部１４は、入力された学習用の音声信号に対し、図４に示すステップＳ１〜Ｓ３と同様の手順でステップＳ１１〜ステップＳ１３の処理を行う。 In the example model generation device 2, the Fourier transform unit 12, the feature amount generation unit 13, and the feature amount conversion unit 14 perform steps in the same procedure as steps S1 to S3 shown in FIG. 4 for the input learning speech signal. The process of S11-step S13 is performed.

ガウス混合モデル学習部２５は、前段の特徴量変換部１４から入力されたボトルネック特徴量ｂ_ｔを学習データとして用い、通常の最尤推定法によりガウス混合モデルｇを得るガウス混合モデル学習処理を行う（ステップＳ１４）。 The Gaussian mixture model learning unit 25 performs a Gaussian mixture model learning process for obtaining a Gaussian mixture model g by a normal maximum likelihood estimation method using the bottleneck feature amount b _t input from the preceding feature amount conversion unit 14 as learning data. Perform (step S14).

続いて、最尤ガウス分布計算部２６は、各時間フレームｉに対して最大の尤度を与えるガウス混合モデルｇの中のガウス分布のインデックスｍ_ｉを求め、求めたインデックスｍ_ｉの時間系列を、事例モデルＭの一つのセグメントとして取得する最尤ガウス分布計算処理を行う（ステップＳ１５）。そして、事例モデル生成装置２は、このインデックスｍ_ｉの時間系列を、事例モデルＭの一つのセグメントとして信号処理装置１の事例モデル記憶部１１に格納する格納処理を行う（ステップＳ１６）。 Subsequently, the maximum likelihood Gaussian distribution calculation unit 26 calculates an index m _i of the Gaussian distribution in the Gaussian mixture model g that gives the maximum likelihood for each time frame i, the time sequence of the index m _i obtained Then, the maximum likelihood Gaussian distribution calculation process acquired as one segment of the case model M is performed (step S15). The case model generating device 2, the time sequence of the index m _i, performs a storage process for storing as one segment case model M in case the model storage unit 11 of the signal processing apparatus 1 (step S16).

このように、事例モデル生成装置２では、信号処理装置１に対応させて、ボトルネック特徴量ｂ_ｔを用いて事例モデルＭの生成を行っている。 As described above, the case model generation device 2 generates the case model M using the bottleneck feature quantity b _t in correspondence with the signal processing device 1.

［実施の形態２］
次に、実施の形態２について説明する。実施の形態２では、音響歪みの影響を軽減させるとともに、話者性を考慮したセグメント探索を行う信号処理装置について説明する。 [Embodiment 2]
Next, a second embodiment will be described. In the second embodiment, a signal processing device that performs a segment search in consideration of speaker characteristics while reducing the influence of acoustic distortion will be described.

［信号処理装置の構成］
図７は、実施の形態２に係る信号処理装置の構成を示すブロック図である。図７に示すように、実施の形態２に係る信号処理装置２０１は、図１に示す信号処理装置１と比して、特徴量変換部１４と並列に設けられた話者特徴量生成部２１７と、特徴量変換部１４及び話者特徴量生成部２１７の後段に設けられた連結部２１８と、をさらに有する。 [Configuration of signal processing apparatus]
FIG. 7 is a block diagram showing the configuration of the signal processing apparatus according to the second embodiment. As shown in FIG. 7, the signal processing device 201 according to the second embodiment is a speaker feature value generation unit 217 provided in parallel with the feature value conversion unit 14 as compared with the signal processing device 1 shown in FIG. 1. And a connection unit 218 provided at the subsequent stage of the feature amount conversion unit 14 and the speaker feature amount generation unit 217.

話者特徴量生成部２１７は、話者の特徴を表現した話者特徴量を生成する。話者特徴量生成部２１７は、特徴量生成部１３から出力されるメルケプストラム等の特徴量ｘ_ｔを受け取り、この特徴量ｘ_ｔを用いて、話者性を表現する例えばi-vector等の数十〜数百次元程度の話者特徴量ｗ_ｔを生成する。 The speaker feature value generation unit 217 generates a speaker feature value expressing the features of the speaker. The speaker feature quantity generation unit 217 receives a feature quantity x _t such as a mel cepstrum output from the feature quantity generation unit 13, and uses the feature quantity x _t to express speaker characteristics such as an i-vector. A speaker feature w _t of about several tens to several hundreds of dimensions is generated.

連結部２１８は、特徴量変換部１４が変換したボトルネック特徴量ｂ_ｔと、話者特徴量生成部２１７が生成した話者特徴量ｗ_ｔとを連結した連結特徴量［ｂ_ｔ＾Ｔ，ｗ_ｔ^Ｔ］＾Ｔ(Ｔはベクトルの転置を表す)を生成し、後段のマッチング部１５に入力する。 The concatenating unit 218 connects the bottleneck feature value b _t converted by the feature value converting unit 14 and the speaker feature value w _t generated by the speaker feature value generating unit 217 [b _t ^ T, w _t ^ T] ^ T (T represents transposition of the vector) is generated and input to the matching unit 15 at the subsequent stage.

そして、マッチング部１５は、連結特徴量［ｂ_ｔ＾Ｔ，ｗ_ｔ^Ｔ］＾Ｔが混合分布モデルの各分布に該当する確率を示す事後確率を計算し、最も高い事後確率をとるクリーン音声特徴量を入力信号に対応するクリーン音声特徴量として求める。 Then, the matching unit 15 calculates a posteriori probability indicating the probability that the connected feature value [b _t ^ T, w _t ^ T] ^ T corresponds to each distribution of the mixed distribution model, and clean speech that takes the highest a posteriori probability. The feature amount is obtained as a clean speech feature amount corresponding to the input signal.

［話者特徴量生成部の処理］
ここで、話者特徴量生成部２１７による話者特徴量ｗ_ｔの生成処理について説明する。ここでは、話者特徴量生成部２１７が、話者の特徴を数十〜数百次元程度のベクトルで表現したi-vectorと呼ばれる特徴量ベクトル（ベクトルｗ_ｅ）を生成する場合について説明する。また、ここでは、ＧＭＭ−ＵＢＭ（Universal Background Model）アプローチで、話者認識におけるi-vectorを抽出する方法について説明する。ＧＭＭ−ＵＢＭアプローチは、「音声らしい」モデル(ＵＢＭ)を多数の不特定話者の大量のＵＢＭ学習用の音声データを用いて学習しておき、新たな話者のモデル（ＧＭＭ）は、当該話者の少量の音声データを用いてＵＢＭを適応して得るという手法である。ＵＢＭは、図示しない記憶部に記憶されている。 [Processing of speaker feature generator]
Here, the generation process of the speaker feature quantity w _t by the speaker feature quantity generation unit 217 will be described. Here, the speaker feature amount generating unit 217, will be described for generating a feature vector called i-vector expressed in a vector of several tens to several hundreds dimensional characteristics of the speaker (vector w _e). Here, a method for extracting an i-vector in speaker recognition using a GMM-UBM (Universal Background Model) approach will be described. In the GMM-UBM approach, a “voice-like” model (UBM) is learned using a large amount of speech data for UBM learning of a large number of unspecified speakers, and a new speaker model (GMM) This is a technique of adaptively obtaining UBM using a small amount of voice data of a speaker. The UBM is stored in a storage unit (not shown).

以下、i-vectorであるベクトルｗ_ｅの具体的な一連の抽出手順について述べる。i-vectorｗ_ｅを求めるため、まず、実施の形態１に示した（３）式を用いて、信号処理装置２０１に入力された入力信号ｅから得られるＬフレームの特徴量ベクトル系列Ｘ_ｅの各フレームの特徴量ｘ_ｔ（ｔ=１，２，・・・，Ｌ）がＵＢＭのｍ番目のガウス分布から生成される事後確率γ_ｔ(ｍ)を計算する。続いて、（３）式で計算した事後確率γ_t(ｍ)を用いて、下記の（８）式〜（１２）式に従い、i-vectorｗ_ｅを計算する。 Hereinafter, we describe a specific series of extraction steps in an i-vector vector w _e. To determine the i-vectorw _e, first, by using the equation (3) shown in the first embodiment, each of the feature vector series X _e of L frames obtained from the input signal e that is input to the signal processing unit 201 A posteriori probability γ _t (m) in which the feature value x _t (t = 1, 2,..., L) of the frame is generated from the mth Gaussian distribution of the UBM is calculated. Subsequently, (3) using the posterior probability gamma _t (m) is calculated by the formula, according to (8) to (12) below to calculate the i-vectorw _e.

事後確率γ_t(ｍ)を用いると、ＵＢＭを用いた入力信号ｅに対する０次、１次のBaum-Welch統計量Ｎ_ｅ,ｍ、ベクトルＦ_ｅ,ｍは、下記の（８）式及び（９）式のようにそれぞれ書くことができる。ただし、ベクトルＦ_ｅ,ｍは、Ｄ次元のベクトルである。 Using the posterior probability γ _t (m), the 0th-order and 1st-order Baum-Welch statistics N _{e, m} and the vector F _{e, m} for the input signal e using UBM are expressed by the following equation (8) and ( 9) Each can be written as However, the vector F _{e, m} is a D-dimensional vector.

さらに、（８）式及び（９）式を用いて、（１０）式及び（１１）式のように、０次、１次のBaum-Welch統計量である行列Ｎ_ｅ、ベクトルＦ_ｅを定義する。ただし、行列Ｎ_ｅはＣＤ次元×ＣＤ次元の行列であり、ベクトルＦ_ｅはＤ次元のベクトルである。 Further, using equation (8) and equation (9), a matrix N _e and a vector F _e that are 0th-order and first-order Baum-Welch statistics are defined as in equations (10) and (11). To do. However, the matrix N _e is a CD dimension × CD dimension matrix, and the vector F _e is a D dimension vector.

ここで、上記の（１０）式の対角成分に現れる行列Ｉ_Ｄは、Ｄ次元×Ｄ次元の単位行列である。また、行列Ｔは、全変動行列と呼ばれるＣＤ次元×Ｍ次元の矩形行列（Ｍ＜＜ＣＤ)である。行列Σを全変動行列Ｔで表現できない残留変動成分をモデル化するＤ次元×Ｄ次元の対角共分散行列とする。以上を用いてi-vectorｗ_ｅは、（１２）式のように計算できる。 Here, the matrix _ID that appears in the diagonal component of the above equation (10) is a D-dimensional × D-dimensional unit matrix. The matrix T is a CD dimension × M dimension rectangular matrix (M << CD) called a total variation matrix. The matrix Σ is a D-dimensional D-dimensional diagonal covariance matrix that models residual fluctuation components that cannot be expressed by the total fluctuation matrix T. I-vectorw _e using the above can be calculated as (12).

なお、（１２）式における行列Ｉ_Ｍは、Ｍ次元×Ｍ次元の単位行列である。（１２）式に示すベクトルｗ_ｅが入力音声データｅに対するＭ次元のi-vectorである。話者特徴量生成部２１７は、このベクトルｗ_ｅを、話者特徴量ｗ_ｔとして、連結部２１８に出力する。 Note that (12) matrix _{I M} in formula is a unit matrix of the M-dimensional × M dimension. (12) a i-vector of M dimension for vector w _e is input voice data e in the expression. Speaker feature amount generating unit 217, the vector _{w e,} as the speaker characteristic quantity _{w t,} and outputs the coupling portion 218.

［信号処理装置の処理］
そこで、信号処理装置２０１が強調音声信号を出力するまでの処理について説明する。
図８は、信号処理装置２０１が実行する処理手順を示すフローチャートである。 [Processing of signal processor]
Therefore, a process until the signal processing apparatus 201 outputs an enhanced audio signal will be described.
FIG. 8 is a flowchart showing a processing procedure executed by the signal processing device 201.

ステップＳ２１〜ステップＳ２３は、図１に示すステップＳ１〜Ｓ３である。そして、話者特徴量生成部２１７は、入力された特徴量ｘ_ｔを用いて、話者特徴量ｗ_ｔを生成する話者特徴量生成処理を行う（ステップＳ２４）。なお、ステップＳ２３及びステップＳ２４は、例えば、並列に実行される。 Steps S21 to S23 are steps S1 to S3 shown in FIG. The speaker feature amount generating unit 217, by using the input feature amount x _t, performs speaker feature quantity generation process for generating a speaker characteristic quantity w _t (step S24). Note that step S23 and step S24 are executed in parallel, for example.

連結部２１８は、特徴量変換部１４が変換したボトルネック特徴量ｂ_ｔと、話者特徴量生成部２１７が生成した話者特徴量ｗ_ｔとを連結した連結特徴量［ｂ_ｔ＾Ｔ，ｗ_ｔ^Ｔ］＾Ｔ(Ｔはベクトルの転置を表す)を生成する連結処理を行う（ステップＳ２５）。 The concatenating unit 218 connects the bottleneck feature value b _t converted by the feature value converting unit 14 and the speaker feature value w _t generated by the speaker feature value generating unit 217 [b _t ^ T, A concatenation process for generating w _t ^ T] ^ T (T represents transposition of the vector) is performed (step S25).

マッチング部１５は、事例モデル記憶部１１の事例モデルＭのセグメントに対するマッチング対象として、連結部２１８が生成した連結特徴量［ｂ_ｔ＾Ｔ，ｗ_ｔ^Ｔ］＾Ｔを用い、図４のステップＳ４と同様の処理手順を行って、マッチング処理を行う（ステップＳ２６）。図８に示すステップＳ２７は、図４に示すステップＳ５である。 The matching unit 15 uses the connected feature [b _t ^ T, w _t ^ T] ^ T generated by the connecting unit 218 as a matching target for the segment of the case model M in the case model storage unit 11, and performs the steps of FIG. The same processing procedure as S4 is performed to perform matching processing (step S26). Step S27 shown in FIG. 8 is step S5 shown in FIG.

［実施の形態２の効果］
音声認識においては、話者性は不要な情報であるので、ＤＮＮ−ＨＭＭ音響モデルを通す特徴量変換処理では、話者性を軽減するような特徴量変換を行う。したがって、特徴量変換部１４においては、ＤＮＮ−ＨＭＭ音響モデルを通してボトルネック特徴量を抽出する際に、話者性も軽減している。そこで、実施の形態２では、話者性が軽減されたボトルネック特徴量ｂ_ｔに話者特徴量ｗ_ｔを連結した連結特徴量を用いて、マッチング部１５によるセグメント探索を行ことによって、最終的に信号処理装置２０１から出力される強調音声信号を、話者性を含ませたものとすることができる。 [Effect of Embodiment 2]
In speech recognition, speaker characteristics are unnecessary information. Therefore, in the feature value conversion processing through the DNN-HMM acoustic model, feature value conversion that reduces speaker characteristics is performed. Therefore, when the feature amount conversion unit 14 extracts the bottleneck feature amount through the DNN-HMM acoustic model, the speaker characteristic is also reduced. Therefore, in the second embodiment, a segment search is performed by the matching unit 15 using a connected feature value obtained by connecting the speaker feature value w _t to the bottleneck feature value b _t with reduced speaker characteristics, thereby obtaining a final result. In particular, the emphasized speech signal output from the signal processing device 201 may include speaker characteristics.

このように、実施の形態２では、音響歪みの影響を軽減したボトルネック特徴量と話者性を表現する話者特徴量とを連結して用いることで、マッチング部１５において、音響歪みの影響が軽減し、かつ、話者性を考慮したセグメント探索を行うことが可能になる。 As described above, in the second embodiment, the matching unit 15 uses the influence of the acoustic distortion by connecting the bottleneck feature quantity that reduces the influence of the acoustic distortion and the speaker feature quantity that expresses the speaker characteristics. This makes it possible to perform segment search in consideration of speaker characteristics.

［信号処理装置及び事例モデル生成装置の構成について］
なお、この発明は、複数の音響歪み(雑音／残響環境)の事例モデルを考慮する際の時間、及び、マッチング時に時間伸縮について考慮する際の時間は、非特許文献１に記載されているように、拡張可能である。また、事例モデル記憶部１１は、例えば、出願人による特開２０１５−１５２７０４号公報に記載された木構造化構成を適用したセグメントを含む事例モデルＭを記憶していてもよい。この場合、マッチング部１５は、この木構造化構成のセグメントを含む事例モデルＭから、入力信号に対応するセグメントに最も類似したセグメントを探索してもよい。また、マッチング部１５は、例えば、出願人による特開２０１５−１５２７０５号公報に記載されたセグメント評価関数を用いてセグメント探索を行ってもよい。 [Configuration of Signal Processing Device and Case Model Generation Device]
In the present invention, the time when considering a case model of a plurality of acoustic distortions (noise / reverberation environment) and the time when considering time expansion and contraction at the time of matching are described in Non-Patent Document 1. It is extensible. Further, the case model storage unit 11 may store, for example, a case model M including a segment to which a tree structured configuration described in Japanese Patent Application Laid-Open No. 2015-152704 by the applicant is applied. In this case, the matching unit 15 may search for a segment most similar to the segment corresponding to the input signal from the case model M including the segment having the tree structure. Moreover, the matching part 15 may perform a segment search, for example using the segment evaluation function described in Unexamined-Japanese-Patent No. 2015-152705 by the applicant.

［システム構成等］
図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。例えば、信号処理装置１，２０１及び事例モデル生成装置２は、一体の装置であってもよい。さらに、各装置にて行なわれる各処理機能は、その全部又は任意の一部が、ＣＰＵ及び当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Each component of each illustrated device is functionally conceptual and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or a part of the distribution / integration is functionally or physically distributed in arbitrary units according to various loads or usage conditions. Can be integrated and configured. For example, the signal processing devices 1 and 201 and the case model generation device 2 may be an integrated device. Further, all or a part of each processing function performed in each device can be realized by a CPU and a program that is analyzed and executed by the CPU, or can be realized as hardware by wired logic.

また、本実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的におこなうこともでき、あるいは、手動的におこなわれるものとして説明した処理の全部又は一部を公知の方法で自動的におこなうこともできる。また、本実施形態において説明した各処理は、記載の順にしたがって時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 In addition, among the processes described in this embodiment, all or a part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed can be performed. All or a part can be automatically performed by a known method. In addition, each process described in the present embodiment is not only executed in time series according to the order of description, but may be executed in parallel or individually as required by the processing capability of the apparatus that executes the process. . In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above-described document and drawings can be arbitrarily changed unless otherwise specified.

［プログラム］
図９は、プログラムが実行されることにより、信号処理装置或いは学習モデル生成装置が実現されるコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 [program]
FIG. 9 is a diagram illustrating an example of a computer in which a signal processing device or a learning model generation device is realized by executing a program. The computer 1000 includes a memory 1010 and a CPU 1020, for example. The computer 1000 also includes a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These units are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１及びＲＡＭ１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０３１に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１０４１に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１０４１に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1031. The disk drive interface 1040 is connected to the disk drive 1041. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041. The serial port interface 1050 is connected to a mouse 1110 and a keyboard 1120, for example. The video adapter 1060 is connected to the display 1130, for example.

ハードディスクドライブ１０３１は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、信号処理装置、学習モデル生成装置の各処理を規定するプログラムは、コンピュータ１０００により実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０３１に記憶される。例えば、信号処理装置、学習モデル生成装置における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０３１に記憶される。なお、ハードディスクドライブ１０３１は、ＳＳＤ（Solid State Drive）により代替されてもよい。 The hard disk drive 1031 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, a program that defines each process of the signal processing device and the learning model generation device is implemented as a program module 1093 in which a code executable by the computer 1000 is described. The program module 1093 is stored in the hard disk drive 1031, for example. For example, a program module 1093 for executing processing similar to the functional configuration in the signal processing device and the learning model generation device is stored in the hard disk drive 1031. The hard disk drive 1031 may be replaced by an SSD (Solid State Drive).

また、上述した実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０３１に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０３１に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 The setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1031. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1031 to the RAM 1012 as necessary, and executes them.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０３１に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３及びプログラムデータ１０９４は、ネットワーク（ＬＡＮ、ＷＡＮ等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３及びプログラムデータ１０９４は、他のコンピュータから、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 Note that the program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1031, but may be stored in, for example, a removable storage medium and read out by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN, WAN, etc.). Then, the program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.

以上、本発明者によってなされた発明を適用した実施形態について説明したが、本実施形態による本発明の開示の一部をなす記述及び図面により本発明は限定されることはない。すなわち、本実施形態に基づいて当業者等によりなされる他の実施形態、実施例及び運用技術等は全て本発明の範疇に含まれる。 As mentioned above, although embodiment which applied the invention made | formed by this inventor was described, this invention is not limited with the description and drawing which make a part of indication of this invention by this embodiment. That is, other embodiments, examples, operation techniques, and the like made by those skilled in the art based on this embodiment are all included in the scope of the present invention.

１，１Ｐ，２０１信号処理装置
２事例モデル生成装置
１１，１１Ｐ事例モデル記憶部
１２，１２Ｐフーリエ変換部
１３，１３Ｐ特徴量生成部
１４特徴量変換部
１５，１５Ｐマッチング部
１６，１６Ｐ音声強調フィルタリング部
２５ガウス混合モデル学習部
２６最尤ガウス分布計算部
２１７話者特徴量生成部
２１８連結部 1, 1P, 201 Signal processing device 2 Case model generation device 11, 11P Case model storage unit 12, 12P Fourier transform unit 13, 13P Feature amount generation unit 14 Feature amount conversion unit 15, 15P Matching unit 16, 16P Speech enhancement filtering unit 25 Gaussian mixture model learning unit 26 Maximum likelihood Gaussian distribution calculation unit 217 Speaker feature generation unit 218 Connection unit

Claims

A signal processing method executed by a signal processing device,
The signal processing apparatus includes a storage unit that stores a mixed distribution model in which a voice including noise or acoustic distortion or a clean voice is learned,
A feature amount generating step in which the signal processing device generates a first feature amount from an input signal;
A feature amount conversion step in which the signal processing device converts the first feature amount into a second feature amount subjected to noise or acoustic distortion reduction processing;
A speaker feature generating step in which the signal processing device generates a speaker feature that expresses the feature of the speaker;
A connecting step in which the signal processing device generates a connected feature value obtained by connecting the second feature value and the speaker feature value;
The signal processing device calculates a posterior probability indicating the probability that the connected feature value corresponds to each distribution of the mixed distribution model based on the parameters of the mixed distribution model stored in the storage unit, and has the highest posterior A collation step for obtaining a clean speech feature value taking a probability as a clean speech feature value corresponding to the input signal;
An output step in which the signal processing device outputs an enhanced speech signal obtained by multiplying the input signal by a filter configured from the clean speech feature value obtained in the matching step;
A signal processing method comprising:

2. The signal according to claim 1, wherein the reduction process is a process of obtaining a bottleneck feature amount from a DNN (Deep Neural Network) -HMM (Hidden Markov Model) acoustic model. Processing method.

A learning feature value generation step in which the signal processing device generates a third feature value from an input signal for learning;
A learning feature value conversion step in which the signal processing device generates a fourth feature value obtained by performing the noise or acoustic distortion reduction process on the third feature value;
A Gaussian mixture model learning step in which the signal processing device acquires a Gaussian mixture distribution model by a maximum likelihood estimation method using the fourth feature amount as learning data;
A maximum likelihood Gaussian distribution calculating step in which the signal processing device obtains an index of a Gaussian distribution in the Gaussian mixture distribution model giving the maximum likelihood for each time, and obtains a time series of the index;
A storing step in which the signal processing apparatus stores the time series of the index in the storage unit as a parameter of the mixed distribution model;
The signal processing method according to claim 1 or 2, characterized in that it contained.

A storage unit for storing a mixed distribution model in which a voice including noise or acoustic distortion or a clean voice is learned;
A feature quantity generator for generating a first feature quantity from an input signal;
A feature amount conversion unit that converts the first feature amount into a second feature amount subjected to noise or acoustic distortion reduction processing;
A speaker feature value generating unit for generating speaker feature values expressing speaker characteristics;
A connecting unit that generates a connected feature value obtained by connecting the second feature value and the speaker feature value;
Based on the parameters of the mixed distribution model stored in the storage unit, a posterior probability indicating the probability that the connected feature amount corresponds to each distribution of the mixed distribution model is calculated, and a clean speech feature having the highest posterior probability A matching unit for obtaining a quantity as a clean speech feature corresponding to the input signal;
An output unit that outputs an emphasized speech obtained by multiplying the input signal by a filter composed of clean speech feature values obtained by the matching unit;
A signal processing apparatus comprising:

A signal processing program for causing a signal processing device to execute,
The signal processing apparatus includes a storage unit that stores a mixed distribution model in which a voice including noise or acoustic distortion or a clean voice is learned,
A feature value generation step for generating a first feature value from an input signal;
A feature amount conversion step of converting the first feature amount into a second feature amount subjected to noise or acoustic distortion reduction processing;
A speaker feature generating step for generating a speaker feature expressing a speaker feature;
A connecting step of generating a connected feature value obtained by connecting the second feature value and the speaker feature value;
Based on the parameters of the mixed distribution model stored in the storage unit, a posterior probability indicating the probability that the connected feature amount corresponds to each distribution of the mixed distribution model is calculated, and a clean speech feature having the highest posterior probability A collation step for obtaining a quantity as a clean speech feature corresponding to the input signal;
An output step of outputting an enhanced speech signal obtained by multiplying the input signal by a filter composed of clean speech feature values obtained in the collating step;
A signal processing program for causing the signal processing device to execute.