JP2018028618A

JP2018028618A - Parameter estimation device for mask estimation, parameter estimation method for mask estimation, and parameter estimation program for mask estimation

Info

Publication number: JP2018028618A
Application number: JP2016160668A
Authority: JP
Inventors: 卓哉樋口; Takuya Higuchi; 拓也吉岡; Takuya Yoshioka; 中谷　智広; Tomohiro Nakatani; 智広中谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-08-18
Filing date: 2016-08-18
Publication date: 2018-02-22
Anticipated expiration: 2036-08-18
Also published as: JP6517760B2

Abstract

PROBLEM TO BE SOLVED: To improve the accuracy of speech recognition with limited learning data when performing speech recognition while adapting to a conventional environment.SOLUTION: A time frequency analysis unit 11 applies short time signal analysis to each of a plurality of observation signals to extract an observation signal for each time frequency point in a situation where an acoustic signal of a target sound source and an acoustic signal of noise coexist, and constitutes an observation vector for each time frequency point. A mask estimation unit 12 estimate a mask with regard to the acoustic signal on the basis of the observation vector and a parameter for mask estimation. A sound emphasis unit 13 multiplies the observation vector and the mask with regard to the acoustic signal of the target sound source together and thereby acquires an enhanced sound. A speech recognition unit 14 estimates the posterior probability of phoneme state of the enhanced sound using a parameter for speech recognition. A parameter estimation unit 15 estimates a parameter for mask estimation so that a distance reference between the posterior probability of phoneme state and a reference label is minimized.SELECTED DRAWING: Figure 1

Description

本発明は、マスク推定用パラメータ推定装置、マスク推定用パラメータ推定方法およびマスク推定用パラメータ推定プログラムに関する。 The present invention relates to a mask estimation parameter estimation apparatus, a mask estimation parameter estimation method, and a mask estimation parameter estimation program.

従来、環境に適応しながら音声認識を行う手法として、音声認識の前段処理として音声強調を行う手法（例えば非特許文献１を参照）や、音声認識器そのものを環境に適応させる手法（例えば非特許文献２を参照）が知られている。 Conventionally, as a method of performing speech recognition while adapting to the environment, a method of performing speech enhancement as a pre-processing of speech recognition (see, for example, Non-Patent Document 1), or a method of adapting the speech recognizer itself to the environment (eg, non-patent) Reference 2) is known.

N. Ito, S. Araki, T. Yoshioka, and T. Nakatani, "Relaxed disjointness based clustering for joint blind source separation and dereverberation," in Proc. Int. Worksh. Acoust. Echo, Noise Contr., pp. 268‐272, 2014.N. Ito, S. Araki, T. Yoshioka, and T. Nakatani, "Relaxed disjointness based clustering for joint blind source separation and dereverberation," in Proc. Int. Worksh. Acoust. Echo, Noise Contr., Pp. 268- 272, 2014. D. Yu and L. Deng, "Automatic speech recognition: a deep learning approach," Springer, 2015.D. Yu and L. Deng, "Automatic speech recognition: a deep learning approach," Springer, 2015. T. Yoshioka, N. Ito, M. Delcroix, A. Ogawa, K. Kinoshita, M. Fujimoto, C. Yu, W. J. Fabian, M. Espi, T. Higuchi, S. Araki, and T. Nakatani, "The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices," in Proc. Worksh. Automat. Speech Recognition, Understanding, 2015, pp.436‐443.T. Yoshioka, N. Ito, M. Delcroix, A. Ogawa, K. Kinoshita, M. Fujimoto, C. Yu, WJ Fabian, M. Espi, T. Higuchi, S. Araki, and T. Nakatani, "The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices, "in Proc.Worksh. Automat.Speech Recognition, Understanding, 2015, pp.436-443.

しかしながら、従来の環境に適応しながら音声認識を行う手法には、音声認識の精度を向上させるためには、大量の学習データが必要になる場合があるという問題があった。例えば、非特許文献２に記載された手法では、音声認識器のパラメータを環境に適応させることで音声認識の精度を向上させている。ここで、音声認識器は線形演算や非線形演算を複数回行うことで音素状態事後確率を計算するため、多くのパラメータによって構成されているため、全てのパラメータを環境に適応させるためには、大量の学習データが必要になる場合がある。 However, the conventional method of performing speech recognition while adapting to the environment has a problem that a large amount of learning data may be required to improve the accuracy of speech recognition. For example, in the method described in Non-Patent Document 2, the accuracy of speech recognition is improved by adapting the parameters of a speech recognizer to the environment. Here, since the speech recognizer calculates phoneme state posterior probabilities by performing linear operations and nonlinear operations multiple times, it is composed of many parameters, so in order to adapt all parameters to the environment, a large amount Learning data may be required.

本発明のマスク推定用パラメータ推定装置は、目的音源に対応する１個の第１の音響信号と、雑音に対応するＮ−１個の第２の音響信号（ただし、Ｎは２以上の整数）と、を含んだＮ個の音響信号が混在する状況において、それぞれ異なる位置で収録されたＭ個の観測信号（ただし、Ｍは２以上の整数）のそれぞれに短時間信号分析を適用して時間周波数点ごとの観測信号を抽出し、前記時間周波数点ごとの観測信号のＭ次元縦ベクトルである観測ベクトルを構成する時間周波数分析部と、前記観測ベクトルとマスク推定用のパラメータとに基づいて、前記Ｎ個の音響信号のそれぞれが、前記時間周波数点ごとに、前記観測ベクトルにどの程度の割合で含まれているかを表すマスクを推定するマスク推定部と、前記観測ベクトルと前記第１の音響信号についての前記マスクとを、前記時間周波数点のそれぞれにおいて掛け合わせることで強調音声を取得する音声強調部と、学習データを用いて事前に学習した音声認識用のパラメータを用いて、前記強調音声が各時刻においてどの音素状態であるらしいかを表す音素状態事後確率を推定する音声認識部と、前記音素状態事後確率と外部から入力された音素状態の参照ラベルとの間の所定の距離基準が最小化されるように前記マスク推定用のパラメータを推定するパラメータ推定部と、を有することを特徴とする。 The parameter estimation apparatus for mask estimation according to the present invention includes one first acoustic signal corresponding to a target sound source and N−1 second acoustic signals corresponding to noise (where N is an integer of 2 or more). In a situation where N acoustic signals including, are mixed, a short time signal analysis is applied to each of the M observation signals (where M is an integer of 2 or more) recorded at different positions. Extracting an observation signal for each frequency point, and based on the observation frequency and a parameter for mask estimation, which constitutes an observation vector that is an M-dimensional vertical vector of the observation signal for each time frequency point, A mask estimator for estimating a mask indicating how much each of the N acoustic signals is included in the observation vector for each of the time frequency points, the observation vector, and the first Using the speech enhancement unit that acquires the enhanced speech by multiplying the mask for the reverberation signal at each of the time frequency points, and the parameters for speech recognition learned in advance using learning data, A predetermined distance criterion between a speech recognition unit that estimates a phoneme state posterior probability that indicates which phoneme state the speech seems to be at each time, and a reference label of the phoneme state input from the outside And a parameter estimation unit that estimates the parameter for mask estimation so that is minimized.

また、本発明のマスク推定用パラメータ推定方法は、マスク推定用パラメータ推定装置で実行されるマスク推定用パラメータ推定方法であって、目的音源に対応する１個の第１の音響信号と、雑音に対応するＮ−１個の第２の音響信号と、を含んだＮ個の音響信号が混在する状況において、それぞれ異なる位置で収録されたＭ個の観測信号のそれぞれに短時間信号分析を適用して時間周波数点ごとの観測信号を抽出し、前記時間周波数点ごとの観測信号のＭ次元縦ベクトルである観測ベクトルを構成する時間周波数分析工程と、前記観測ベクトルとマスク推定用のパラメータとに基づいて、前記Ｎ個の音響信号のそれぞれが、前記時間周波数点ごとに、前記観測ベクトルにどの程度の割合で含まれているかを表すマスクを推定するマスク推定工程と、前記観測ベクトルと前記第１の音響信号についての前記マスクとを、前記時間周波数点のそれぞれにおいて掛け合わせることで強調音声を取得する音声強調工程と、学習データを用いて事前に学習した音声認識用のパラメータを用いて、前記強調音声が各時刻においてどの音素状態であるらしいかを表す音素状態事後確率を推定する音声認識工程と、前記音素状態事後確率と外部から入力された音素状態の参照ラベルとの間の所定の距離基準が最小化されるように前記マスク推定用のパラメータを推定するパラメータ推定工程と、を含んだことを特徴とする。 The mask estimation parameter estimation method of the present invention is a mask estimation parameter estimation method executed by a mask estimation parameter estimation device, and includes a first acoustic signal corresponding to a target sound source and noise. In a situation where N acoustic signals including the corresponding N-1 second acoustic signals are mixed, short-time signal analysis is applied to each of M observation signals recorded at different positions. Based on the time-frequency analysis step of extracting an observation signal for each time-frequency point and constructing an observation vector that is an M-dimensional vertical vector of the observation signal for each time-frequency point, and the observation vector and a parameter for mask estimation A mask estimation for estimating a mask indicating how much each of the N acoustic signals is included in the observation vector for each time frequency point Then, the observation vector and the mask for the first acoustic signal are multiplied at each of the time frequency points, and a speech enhancement step for acquiring enhanced speech, and learning is performed in advance using learning data. A speech recognition step for estimating a phoneme state posterior probability that indicates which phoneme state the emphasized speech is likely to be at each time using a parameter for speech recognition, and the phoneme state posterior probability and a phoneme state input from the outside A parameter estimation step of estimating the parameter for mask estimation so that a predetermined distance criterion between the reference label and the reference label is minimized.

本発明によれば、環境に適応しながら音声認識を行う際に、限られた学習データで音声認識の精度を向上させることができる。 According to the present invention, when performing speech recognition while adapting to the environment, the accuracy of speech recognition can be improved with limited learning data.

図１は、第１の実施形態に係るマスク推定用パラメータ推定装置の構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of the configuration of a mask estimation parameter estimation apparatus according to the first embodiment. 図２は、第１の実施形態に係るマスク推定用パラメータ推定装置のパラメータ推定部の構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a configuration of a parameter estimation unit of the mask estimation parameter estimation apparatus according to the first embodiment. 図３は、第１の実施形態に係るマスク推定用パラメータ推定装置の処理の流れを示すフローチャートである。FIG. 3 is a flowchart showing a process flow of the mask estimation parameter estimation apparatus according to the first embodiment. 図４は、第１の実施形態に係るマスク推定用パラメータ推定装置のパラメータ推定部の処理の流れを示すフローチャートである。FIG. 4 is a flowchart showing a process flow of the parameter estimation unit of the mask estimation parameter estimation apparatus according to the first embodiment. 図５は、プログラムが実行されることによりマスク推定用パラメータ推定装置が実現されるコンピュータの一例を示す図である。FIG. 5 is a diagram illustrating an example of a computer that realizes a mask estimation parameter estimation apparatus by executing a program.

以下に、本願に係るマスク推定用パラメータ推定装置、マスク推定用パラメータ推定方法およびマスク推定用パラメータ推定プログラムの実施形態を図面に基づいて詳細に説明する。なお、この実施形態により本発明が限定されるものではない。 Hereinafter, embodiments of a mask estimation parameter estimation device, a mask estimation parameter estimation method, and a mask estimation parameter estimation program according to the present application will be described in detail with reference to the drawings. In addition, this invention is not limited by this embodiment.

［第１の実施形態の構成］
まず、図１を用いて、第１の実施形態に係るマスク推定用パラメータ推定装置の構成について説明する。図１は、第１の実施形態に係るマスク推定用パラメータ推定装置の構成の一例を示す図である。図１に示すように、マスク推定用パラメータ推定装置１０は、時間周波数分析部１１、マスク推定部１２、音声強調部１３、音声認識部１４およびパラメータ推定部１５を有する。 [Configuration of First Embodiment]
First, the configuration of the mask estimation parameter estimation apparatus according to the first embodiment will be described with reference to FIG. FIG. 1 is a diagram illustrating an example of the configuration of a mask estimation parameter estimation apparatus according to the first embodiment. As illustrated in FIG. 1, the mask estimation parameter estimation apparatus 10 includes a time-frequency analysis unit 11, a mask estimation unit 12, a speech enhancement unit 13, a speech recognition unit 14, and a parameter estimation unit 15.

第１の実施形態において、パラメータ推定部１５は、マスク推定部１２、音声強調部１３および音声認識部１４を１つの計算ネットワークとして解釈する。つまり、パラメータ推定部１５は、マスク推定部１２、音声強調部１３および音声認識部１４からなる計算ネットワークの出力に基づいて、マスク推定のためのパラメータを推定する。以下、各処理部について説明する。 In the first embodiment, the parameter estimation unit 15 interprets the mask estimation unit 12, the speech enhancement unit 13, and the speech recognition unit 14 as one calculation network. That is, the parameter estimation unit 15 estimates a parameter for mask estimation based on the output of the calculation network including the mask estimation unit 12, the speech enhancement unit 13, and the speech recognition unit 14. Hereinafter, each processing unit will be described.

時間周波数分析部１１は、目的音源に対応する１個の第１の音響信号と、雑音に対応するＮ−１個の第２の音響信号と、を含んだＮ個の音響信号が混在する状況において、それぞれ異なる位置で収録されたＭ個の観測信号のそれぞれに短時間信号分析を適用して時間周波数点ごとの観測信号を抽出し、時間周波数点ごとの観測信号のＭ次元縦ベクトルである観測ベクトルを構成する。なお、雑音には干渉音や背景雑音が含まれる。 The time-frequency analysis unit 11 has a situation in which N acoustic signals including one first acoustic signal corresponding to the target sound source and N−1 second acoustic signals corresponding to noise are mixed. In FIG. 5, a short-time signal analysis is applied to each of M observation signals recorded at different positions to extract an observation signal for each time frequency point, and an M-dimensional vertical vector of the observation signal for each time frequency point. Construct an observation vector. The noise includes interference sound and background noise.

短時間フーリエ変換等の短時間信号分析を用いて得られる観測特徴量ベクトルをｙ_ｆ，ｔと表すとする。ただし、ｔとｆは、それぞれ時間と周波数の番号であり、ｔは１〜Ｔの整数、ｆは０〜Ｆの整数であることとする。このとき、目的音源および雑音はスパース性を有し、各時間周波数点において高々１つの目的音源だけが存在すると仮定し、各時間周波数点における観測ベクトルｙ_ｆ，ｔを、以下の式（１）または（２）のいずれかの形式でモデル化できることが知られている（例えば非特許文献１を参照）。 Assume that an observation feature vector obtained by using short-time signal analysis such as short-time Fourier transform is represented as y _{f, t} . However, t and f are time and frequency numbers, respectively, t is an integer from 1 to T, and f is an integer from 0 to F. At this time, it is assumed that the target sound source and the noise have sparsity, and there is at most one target sound source at each time frequency point, and the observation vector y _{f, t} at each time frequency point is _expressed by the following equation (1). Alternatively, it is known that modeling can be performed in any form of (2) (for example, see Non-Patent Document 1).

ここで、式（１）は当該時間周波数点において目的音源だけが存在する場合を表し、式（２）はｎ番目の雑音のみが存在する場合を表す。ｓ（ｆ，ｔ）は、第１の音響信号、すなわち目的音源に対応する音響信号の時間周波数成分である。また、ｖ_ｎ（ｆ，ｔ）は、Ｎ−１個のうちのｎ番目の第２の音響信号、すなわちｎ番目の雑音に対応する音響信号のうちの時間周波数成分である。 Here, Expression (1) represents a case where only the target sound source exists at the time frequency point, and Expression (2) represents a case where only the nth noise exists. s (f, t) is a time frequency component of the first acoustic signal, that is, the acoustic signal corresponding to the target sound source. Further, v _n (f, t) is a time-frequency component of the nth second acoustic signal among N−1, that is, the acoustic signal corresponding to the nth noise.

マスク推定部１２は、観測ベクトルとマスク推定用のパラメータとに基づいて、Ｎ個の音響信号のそれぞれが、時間周波数点ごとに、観測ベクトルｙ_ｆ，ｔにどの程度の割合で含まれているかを表すマスクを推定する。 Based on the observation vector and the mask estimation parameter, the mask estimation unit 12 determines how much each of the N acoustic signals is included in the observation vector yf _{, t} for each time frequency point. Is estimated.

具体的に、マスク推定部１２は、周波数ごとに、観測ベクトルｙ_ｆ，ｔの確率分布を、Ｎ個の音響信号のそれぞれに対応するＮ個の要素分布からなる混合分布でモデル化し、要素分布の事後確率を、Ｎ個の音響信号のそれぞれに対応するマスクとして推定する。 Specifically, the mask estimation unit 12 models the probability distribution of the observation vectors y _{f and t} for each frequency with a mixed distribution composed of N element distributions corresponding to each of the N acoustic signals. Is estimated as a mask corresponding to each of the N acoustic signals.

マスク推定部１２は、まず、各時間周波数点の観測ベクトルｙ_ｆ，ｔの確率分布を、式（３）で表す。 First, the mask estimation unit 12 represents the probability distribution of the observation vectors y _{f, t} at each time frequency point by Expression (3).

ここで、Θ^（ｓ）は、目的音源に対応する要素分布のパラメータである。また、Θ^（ｖｎ）は、ｎ番目の雑音に対応する要素分布のパラメータである。また、α_ｆ ^（ｓ）は、周波数ｆにおける目的音源に対応する要素分布の重みパラメータである。また、α_ｆ ^（ｖｎ）は、周波数ｆにおけるｎ番目の雑音に対応する要素分布の重みパラメータである。また、α_ｆ ^（ｓ）およびα_ｆ ^（ｖｎ）は、式（４）を満たす。 Here, Θ ^(s) is an element distribution parameter corresponding to the target sound source. Further, Θ ^(vn) is an element distribution parameter corresponding to the n-th noise. Α _f ^(s) is a weight parameter of the element distribution corresponding to the target sound source at the frequency f. Α _f ^(vn) is a weight parameter of an element distribution corresponding to the nth noise at the frequency f. Further, α _f ^(s) and α _f ^(vn) satisfy Expression (4).

このとき、マスク推定部１２は、時間周波数点（ｆ，ｔ）における目的音源に対応するマスクλ_ｆ，ｔ ^（ｓ）を式（５）によって計算する。 At this time, the mask estimation unit 12 calculates a mask λ _{f, t} ^(s) corresponding to the target sound source at the time frequency point (f, t ^{) according} to Expression (5).

式（５）は、この式は、観測ベクトルｙ_ｆ，ｔを入力として、内部パラメータΘ^(s)，Θ^(ｖ１)，…，Θ^{(ｖＮ−１)}，α_ｆ ^（ｓ） _，α_ｆ ^（ｖ１），…，α_ｆ ^{（ｖＮ−１）}を用いて、マスクλ_ｆ，ｔ ^（ｓ）を推定する計算ネットワークと解釈できるので、当該計算ネットワークによる演算は式（６）で表される。ただし、νはｓまたはｖ_ｎである。 Equation (5) is obtained by using the observation vectors y _{f and t} as input and internal parameters Θ ^(s) , Θ ^(v1) ,..., Θ ^(vN−1) , α _f ^(s) _, α _f ^{( Since v1)} ,..., α _f ^(vN−1) can be interpreted as a calculation network for estimating the mask λ _{f, t} ^(s) , the calculation by the calculation network is expressed by Expression (6). However, ν is s or _{v n.}

音声強調部１３は、観測ベクトルｙ_ｆ，ｔと第１の音響信号についてのマスクとを、時間周波数点のそれぞれにおいて掛け合わせることで強調音声を取得する。例えば、音声強調部１３は、式（７）に示すように、Ｍ個の観測信号のうちのｍ´番目の観測信号に対応する観測ベクトルｙ_ｆ，ｔ ^（ｍ´）とマスクλ_ｆ，ｔ ^（ｓ）とを時間周波数点において掛け合わせることで強調音声を得る。 The speech enhancement unit 13 obtains enhanced speech by multiplying the observation vector y _{f, t} and the mask for the first acoustic signal at each time frequency point. For example, as shown in Expression (7), the speech enhancement unit 13 uses the observation vector y _{f, t} ^{(m ′)} corresponding to the m′-th observation signal among the M observation signals and the mask λ _{f, t.} ^The emphasized speech is obtained by multiplying ^(s) at the time frequency point.

ここで、式（７）の演算を、式（８）のように表す。 Here, the calculation of Expression (7) is expressed as Expression (8).

音声認識部１４は、学習データを用いて事前に学習した音声認識用のパラメータを用いて、強調音声が各時刻においてどの音素状態であるらしいかを表す音素状態事後確率を推定する。ここで、音声認識部１４による音素状態事後確率の計算を式（９）のように表す。＾Ｉ_ｔは、時刻ｔにおけるＩ個の音素状態に対応する音素状態事後確率が並んだベクトルである。なお、以降の説明で、＾ａはａの上に＾が付された記号を表すこととする。 The speech recognition unit 14 estimates a phoneme state posterior probability that indicates which phoneme state the emphasized speech is likely to be at each time, using speech recognition parameters learned in advance using learning data. Here, the calculation of the phoneme state posterior probability by the speech recognition unit 14 is expressed as in Expression (9). ^ I _t is a vector in which phoneme state posterior probabilities corresponding to I phoneme states at time t are arranged. In the following description, ^ a represents a symbol with a appended on a.

式（６）、（８）および（９）より、観測ベクトルを用いて音素状態事後確率を推定するプロセスは、１つの計算ネットワークとして記述できる。パラメータ推定部１５は、計算ネットワークのパラメータと構造を保持することで、音素状態事後確率と外部から入力された音素状態の参照ラベルとの間の所定の距離基準が最小化されるようにマスク推定用パラメータを推定する。 From the equations (6), (8), and (9), the process of estimating the phoneme state posterior probability using the observation vector can be described as one computation network. The parameter estimation unit 15 retains the parameters and structure of the calculation network, so that a predetermined distance criterion between the phoneme state posterior probability and the reference label of the phoneme state input from the outside is minimized. Estimate parameters.

これにより、第１の実施形態のマスク推定用パラメータ推定装置１０は、マスク推定用のパラメータを環境に合わせて最適化することができる。また、マスク推定用パラメータ推定装置１０によって最適化されるパラメータは、本来尤度最大化基準に基づき少量のデータから推定されていたパラメータなので、比較的数が少なく、少量のデータを用いてパラメータ推定を行った場合でも、過学習を防ぎながらパラメータ推定を行うことができる。 Thereby, the mask estimation parameter estimation apparatus 10 of the first embodiment can optimize the mask estimation parameters according to the environment. In addition, the parameters optimized by the mask estimation parameter estimation apparatus 10 are parameters that were originally estimated from a small amount of data based on the likelihood maximization criterion. Therefore, the number of parameters is relatively small, and parameter estimation is performed using a small amount of data. Even in the case of performing parameter estimation, parameter estimation can be performed while preventing overlearning.

［実施例］
マスク推定用パラメータ推定装置１０の処理を、実施例に基づいて説明する。実施例では、１個の目的音源から出た音響信号を、雑音下でＭ個のマイクロホンで収録していることとする。このとき、マイクロホンｍで収録された観測信号をｙ^（ｍ）（τ）とすると、式（１０）に示すように、ｙ^（ｍ）（τ）は、目的音源に対応する音響信号ｓ^（ｍ）（τ）と雑音に対応する音響信号ｖ^（ｍ）（τ）の和で表される。 [Example]
The process of the mask estimation parameter estimation apparatus 10 will be described based on an embodiment. In the embodiment, it is assumed that an acoustic signal output from one target sound source is recorded with M microphones under noise. At this time, if the observation signal recorded by the microphone m is y ^(m) (τ), y ^(m) (τ) is an acoustic signal s ^(m ^(m) corresponding to the target sound source, as shown in Equation (10). ⁾ It is expressed by the sum of (τ) and the acoustic signal v ^(m) (τ) corresponding to noise.

時間周波数分析部１１は、全てマイクロホンで収録された上記観測信号を受け取り、各観測信号ｙ^（ｍ）（τ）ごとに短時間信号分析を適用して時間周波数ごとの信号特徴量Ｙ^（ｍ）（ｆ，ｔ）を求める。時間周波数分析部１１は、短時間信号分析の手法として、短時間離散フーリエ変換や短時間離散コサイン変換等の手法を用いることができる。時間周波数分析部１１は、さらに、各時間周波数で得られた信号Ｙ^（ｍ）（ｆ，ｔ）を、全てのマイクロホンに関してまとめたベクトルである観測ベクトルｙ_ｆ，ｔを、式（１１）のように構成する。 The time-frequency analysis unit 11 receives all the observation signals recorded by the microphones, applies the short-time signal analysis for each observation signal y ^(m) (τ), and the signal feature Y ^(m) for each time frequency. Find (f, t). The time-frequency analysis unit 11 can use a technique such as a short-time discrete Fourier transform or a short-time discrete cosine transform as a technique for short-time signal analysis. The time frequency analysis unit 11 further obtains an observation vector y _{f, t} , which is a vector obtained by collecting the signals Y ^(m) (f, t) obtained at the respective time frequencies with respect to all microphones, using the expression (11). Configure as follows.

実施例においては、一般性を失わずに表記を簡略化するため、目的音源に対応する要素分布の重みパラメータをα^（ｓ）＝α^（ｖ１）＝…＝α^{（ｖＮ−１）}＝１／Ｎであるとする。マスク推定部１２は、各時間周波数点の観測ベクトルを、それぞれ目的音源と雑音に対する２つの正規分布の混合分布によってモデル化する。このとき、分布パラメータφ_ｆ，ｔ ^（ν）およびマスク推定用パラメータＲ_ｆ ^（ν）が与えられたとき、マスク推定部１２は、各正規分布に対応する事後確率を、式（１２）とする。 In the embodiment, in order to simplify the ^description without losing generality, the weight parameter of the element distribution corresponding to the target sound source is set to α ^(s) = α ^(v1) =... = Α ^(vN−1) = 1 / Suppose N. The mask estimation unit 12 models the observation vector at each time frequency point by a mixed distribution of two normal distributions for the target sound source and noise, respectively. At this time, when the distribution parameter φ _{f, t} ^(ν) and the mask estimation parameter R _f ^(ν) are given, the mask estimation unit 12 sets the posterior probability corresponding to each normal distribution to the equation (12). .

ここで、非特許文献１に記載されたパラメータφ_ｆ，ｔ ^（ν）の更新則を用いて、式（１２）は、式（１３）のように表すことができる。 Here, using the update rule of the parameter φ _{f, t} ^(ν) described in Non-Patent Document 1, Expression (12) can be expressed as Expression (13).

式（５）と式（１３）より、マスク推定部１２は、マスク目的音源に対するマスクλ_ｆ，ｔ ^（ｓ）を、式（１４）のように計算する。 From Equation (5) and Equation (13), the mask estimation unit 12 calculates the mask λ _{f, t} ^(s) for the mask target sound source as shown in Equation (14).

ただし、ｐ_ｆ，ｔ ^（ｓ）およびｐ_ｆ，ｔ ^（ｖ）は、それぞれ式（１５）および（１６）の通りである。 However, _{pf, t} ^(s) and _{pf, t} ^(v) are as Formula (15) and (16), respectively.

マスク推定部１２による演算は、Ｒ_ｆ ^（ν）を内部パラメータとして、観測ベクトルから目的音源に対応するマスクを計算するネットワークとして解釈できるので、当該計算ネットワークによる演算は式（１７）で表される。なお、式（１７）は、式（６）に対応している。 Since the calculation by the mask estimation unit 12 can be interpreted as a network for calculating a mask corresponding to the target sound source from the observation vector using R _f ^(ν) as an internal parameter, the calculation by the calculation network is expressed by Expression (17). . Equation (17) corresponds to Equation (6).

また、一例として、マスク推定部１２は、観測ベクトルｙ_ｆ，ｔの確率分布を、平均が０であるＮ個のＭ次元複素ガウス分布であって、共分散行列が、各時刻において異なる値を取るスカラーパラメータと時不変のパラメータとを要素にもつエルミート行列の積で表されるＭ次元複素ガウス分布からなる混合分布でモデル化する。例えば、式（１０）および（１１）におけるφ_ｆ，ｔ ^（ν）を各時刻において異なる値を取るスカラーパラメータ、Ｒ_ｆ ^（ν）を時不変のパラメータとすることができる。 Also, as an example, the mask estimation unit 12 uses the probability distribution of the observation vectors y _{f, t} as N M-dimensional complex Gaussian distributions with an average of 0, and the covariance matrix has different values at each time. Modeling is performed with a mixed distribution consisting of an M-dimensional complex Gaussian distribution represented by the product of Hermitian matrices having scalar parameters and time-invariant parameters as elements. For example, φ _{f, t} ^(ν ) in equations (10) and (11) can be a scalar parameter that takes different values at each time, and R _f ^(ν) can be a time-invariant parameter.

音声強調部１３は、マスクを受け取り、式（１８）を用いて、参照マイクであるｍ´番目のマイクで録音された成分ｙ_ｆ，ｔ ^（ｍ´）に、マスクλ_ｆ，ｔ ^（ｓ）を乗算することで強調音声＾ｓ_ｆ，ｔを計算する。なお、音声強調部１３は、マスクλ_ｆ，ｔ ^（ｓ）をβ乗した値をｙ_ｆ，ｔ ^（ｍ´）に乗算することで強調音声＾ｓ_ｆ，ｔを計算してもよい。 The speech enhancement unit 13 receives the mask and uses the equation (18) to ^apply the mask λ _{f, t} ^(s) to the component y _{f, t} ^{(m ′)} recorded by the m′-th microphone that is the reference microphone. Is multiplied to calculate the emphasized speech ^ s _{f, t} . The speech enhancement unit 13 may calculate the enhanced speech ^ s _{f, t} by multiplying y _{f, t} ^{(m ′)} by a value obtained by multiplying the mask λ _{f, t} ^(s) to the β power.

音声認識部１４は、各周波数における強調音声が並んだベクトル＾ｓ_ｔ＝［＾ｓ_１，ｔ，…，＾ｓ_Ｆ，ｔ］を基に、事前に学習した音声認識用のパラメータを用いて、線形演算と非線形演算を複数回繰り返し、各時刻毎の音素状態事後確率＾Ｉ＝［＾Ｉ_１，ｔ，…，＾Ｉ_Ｋ，ｔ］を計算する。このときの音声認識部１４による演算を、式（１９）のように表す。 The speech recognition unit 14 uses the parameters for speech recognition learned in advance based on the vector ^ s _t = [^ s _{1, t} , ..., ^ s _{F, t} ] in which the emphasized speech at each frequency is arranged. Then, the linear operation and the non-linear operation are repeated a plurality of times, and the phoneme state posterior probability ^ I = [^ I1 _{, t} ,..., ^ _{IK, t} ] is calculated for each time. The calculation by the voice recognition unit 14 at this time is expressed as in Expression (19).

式（１７）〜（１９）により、マスク推定部１２、音声強調部１３および音声認識部１４は、観測ベクトルを入力とし、音素状態事後確率を出力とする１つの計算ネットワークと解釈することができる。ここで、パラメータ推定部１５によりマスク推定用のパラメータの推定について、図２を用いて説明する。図２は、第１の実施形態に係るマスク推定用パラメータ推定装置のパラメータ推定部の構成の一例を示す図である。図２に示すようにパラメータ推定部１５は、マスク推定用パラメータ初期化部１５１、勾配計算部１５２、パラメータ保持部１５３、パラメータ更新部１５４および収束判定部１５５を有する。 By the equations (17) to (19), the mask estimation unit 12, the speech enhancement unit 13, and the speech recognition unit 14 can be interpreted as one calculation network having the observation vector as an input and the phoneme state posterior probability as an output. . Here, estimation of parameters for mask estimation by the parameter estimation unit 15 will be described with reference to FIG. FIG. 2 is a diagram illustrating an example of a configuration of a parameter estimation unit of the mask estimation parameter estimation apparatus according to the first embodiment. As shown in FIG. 2, the parameter estimation unit 15 includes a mask estimation parameter initialization unit 151, a gradient calculation unit 152, a parameter holding unit 153, a parameter update unit 154, and a convergence determination unit 155.

実施例では、音声強調を行わずに音声認識を行った場合の認識結果を、バイナリの参照ラベルとして用いる。このとき、パラメータ推定部１５において、パラメータ更新のための目的関数は、音素状態事後確率＾Ｉ_ｔと、参照ラベルＩ_ｔ＝［Ｉ_１，ｔ，…，Ｉ_Ｋ，ｔ］との間のクロスエントロピーとして、式（２０）のように定義することができる。 In the embodiment, the recognition result when speech recognition is performed without performing speech enhancement is used as a binary reference label. At this time, in the parameter estimator 15, the objective function for parameter update is a cross between the phoneme state posterior probabilities ^ I _t and the reference labels I _t = [I _{1, t} ,..., I _{K, t} ]. Entropy can be defined as shown in Equation (20).

なお、式（２０）のクロスエントロピーは、パラメータ推定部１５が最小化する距離基準の一例である。マスク推定用パラメータ初期化部１５１は、マスク推定用パラメータＲ_ｆ ^（ν）の初期値および学習率γを決定する。マスク推定用パラメータＲ_ｆ ^（ν）には、Ｒ_ｆ ^（ｓ）およびＲ_ｆ ^（ｓ）が含まれる。なお、マスク推定用パラメータ初期化部１５１は、マスク推定用パラメータＲ_ｆ ^（ν）の初期値を単位行列としてもよいし、非特許文献1に記載された尤度最大化基準により求めてもよい。また、パラメータ保持部１５３は音声強調部１３のパラメータと、音声認識部１４のパラメータとを保持している。 In addition, the cross entropy of Formula (20) is an example of the distance reference | standard which the parameter estimation part 15 minimizes. The mask estimation parameter initialization unit 151 determines the initial value of the mask estimation parameter R _f ^(ν) and the learning rate γ. The mask estimation parameter R _f ^(ν) includes R _f ^(s) and R _f ^(s) . Note that the mask estimation parameter initialization unit 151 may use the initial value of the mask estimation parameter R _f ^(ν) as a unit matrix, or may obtain it using a likelihood maximization criterion described in Non-Patent Document 1. . The parameter holding unit 153 holds the parameters of the voice enhancement unit 13 and the parameters of the voice recognition unit 14.

パラメータ更新部１５４は、最急降下法の原理に基づき、式（２１）によってマスク推定用パラメータＲ_ｆ ^（ν）を更新する。なお、この場合、実際に更新されるのはマスク推定用パラメータＲ_ｆ ^（ν）の逆行列である。 The parameter updating unit 154 updates the mask estimation parameter R _f ^(ν) using Expression (21) based on the principle of the steepest descent method. In this case, what is actually updated is the inverse matrix of the parameter R _f ^(ν) for mask estimation.

ここで、前述の通り、マスク推定部１２、音声強調部１３および音声認識部１４は１つの計算ネットワークと解釈することができるため、勾配計算部１５２は、音素状態事後確率＾Ｉ_ｔ、参照ラベルＩ_ｔ、および、パラメータ保持部１５３によって保持されているパラメータを受け取り、式（２１）における勾配∂Ｌ（Ｉ_ｔ，＾Ｉ_ｔ）／∂｛Ｒ_ｆ ^{（ν）−１}｝^＊を、連鎖側を用いて式（２２）のように計算する。 Here, as described above, since the mask estimation unit 12, the speech enhancement unit 13, and the speech recognition unit 14 can be interpreted as one calculation network, the gradient calculation unit 152 includes the phoneme state posterior probability ^ I _t , the reference label I _t and the parameter held by the parameter holding unit 153 are received, and the gradient ∂L (I _t , ^ I _t ) / ∂ {R _f ^{(ν) −1} } ^* in Equation (21) is Is calculated as shown in Equation (22).

収束判定部１５５は、パラメータ更新部１５４による更新の結果、目的関数が収束したか否かを判定する。収束判定部１５５が収束したと判定した場合、パラメータ推定部１５は、推定したマスク推定用パラメータを出力する。また、収束判定部１５５が収束していないと判定した場合、パラメータ推定部１５は、更新したマスク推定用パラメータを用い、勾配計算部１５２およびパラメータ更新部１５４による処理をさらに繰り返す。なお、収束判定部１５５は、所定回数だけ繰り返しが行われた場合に処理を収束したと判定することとしてもよい。 The convergence determination unit 155 determines whether or not the objective function has converged as a result of the update by the parameter update unit 154. When the convergence determination unit 155 determines that the convergence has been completed, the parameter estimation unit 15 outputs the estimated mask estimation parameter. When the convergence determination unit 155 determines that the convergence has not occurred, the parameter estimation unit 15 further repeats the processing by the gradient calculation unit 152 and the parameter update unit 154 using the updated mask estimation parameter. The convergence determination unit 155 may determine that the process has converged when the process is repeated a predetermined number of times.

これにより、パラメータ推定部１５は、音素状態事後確率＾Ｉ_ｔと、参照ラベルＩ_ｔとの間のクロスエントロピーを局所最小化するようなマスク推定用パラメータを得ることができる。 Accordingly, the parameter estimation unit 15 can obtain the phoneme state posterior probability ^ I _t, a mask estimation parameters such as locally minimize cross entropy between the reference label I _t.

また、音声認識部１４がニューラルネットワークにより構成されている場合、勾配計算部１５２は、式（２２）における勾配∂Ｌ（Ｉ_ｔ，＾Ｉ_ｔ）／∂＾ｓ_ｆ，ｔを、当該ニューラルネットワークのパラメータを推定する際に用いられるバックプロパゲーションに基づく手法を用いて計算することができる。例えば、勾配計算部１５２は、勾配∂＾ｓ_ｆ，ｔ／∂λ_ｆ，ｔ ^（ｘ）を、式（１８）に基づき、式（２３）として計算する。 When the speech recognition unit 14 is configured by a neural network, the gradient calculation unit 152 calculates the gradient ∂L (I _t , ^ I _t ) / ∂ ^ s _{f, t} in the equation (22) as the neural network. Can be calculated using a method based on backpropagation that is used in estimating the parameters. For example, the gradient calculation unit 152 calculates the gradient ∂ ^ s _{f, t} / ∂λ _{f, t} ^(x) as the equation (23) based on the equation (18).

また、例えば、勾配計算部１５２は、勾配∂λ_ｆ，ｔ ^（ｘ）／∂ｐ_ｆ，ｔ ^（ν）を、式（１４）に基づき、式（２４）または（２５）として計算する。 For example, the gradient calculation unit 152 calculates the gradient ∂λ _{f, t} ^(x) / ^） p _{f, t} ^(ν) as the equation (24) or (25) based on the equation (14).

また、例えば、勾配計算部１５２は、勾配∂ｐ_ｆ，ｔ ^（ν）／∂｛Ｒ_ｆ ^{（ν）−１}｝^＊を、式（１５）および（１６）に基づき、式（２６）として計算する。 For example, the gradient calculation unit 152 calculates the gradient ∂p _{f, t} ^(ν) / ^） {R _f ^{(ν) −1} } ^* as the equation (26) based on the equations (15) and (16). To do.

［第１の実施形態の処理］
図３を用いて、マスク推定用パラメータ推定装置１０の処理の流れについて説明する。図３は、第１の実施形態に係るマスク推定用パラメータ推定装置の処理の流れを示すフローチャートである。 [Process of First Embodiment]
A processing flow of the mask estimation parameter estimation apparatus 10 will be described with reference to FIG. FIG. 3 is a flowchart showing a process flow of the mask estimation parameter estimation apparatus according to the first embodiment.

図３に示すように、まず、マスク推定用パラメータ推定装置１０の時間周波数分析部１１は、目的音源と雑音に対応した音響信号に対し時間周波数分析を行い、観測ベクトルを取得する（ステップＳ１１）。次に、マスク推定部１２は、マスク推定用パラメータと観測ベクトルとを基に、音声強調のためのマスクを推定する（ステップＳ１２）。 As shown in FIG. 3, first, the time-frequency analysis unit 11 of the mask estimation parameter estimation apparatus 10 performs time-frequency analysis on the acoustic signal corresponding to the target sound source and noise, and obtains an observation vector (step S11). . Next, the mask estimation unit 12 estimates a mask for speech enhancement based on the mask estimation parameter and the observation vector (step S12).

音声強調部１３は、観測ベクトルと、マスク推定部１２によって推定されたマスクとを掛け合わせ、強調音声を取得する（ステップＳ１３）。そして、音声認識部１４は、強調音声と音声認識用のパラメータとを用いて、音声認識を行う（ステップＳ１４）。そして、パラメータ推定部１５は、マスク推定部１２、音声強調部１３および音声認識部１４を１つの計算ネットワークとし、音声認識結果が参照ラベルに近くなるようにマスク推定用パラメータの推定を行う（ステップＳ１５）。 The speech enhancement unit 13 multiplies the observation vector and the mask estimated by the mask estimation unit 12 to obtain enhanced speech (step S13). Then, the voice recognition unit 14 performs voice recognition using the emphasized voice and the parameters for voice recognition (step S14). Then, the parameter estimation unit 15 uses the mask estimation unit 12, the speech enhancement unit 13, and the speech recognition unit 14 as one calculation network, and estimates the parameters for mask estimation so that the speech recognition result is close to the reference label (step) S15).

次に、図４を用いて、パラメータ推定部１５の処理について説明する。図４は、第１の実施形態に係るマスク推定用パラメータ推定装置のパラメータ推定部の処理の流れを示すフローチャートである。図４に示すように、パラメータ推定部１５のマスク推定用パラメータ初期化部１５１は、マスク推定用パラメータの初期値を決定する（ステップＳ１５１）。次に、勾配計算部１５２は、音声状態事後確率と、参照ラベルと、音声強調部１３および音声認識部１４のパラメータとを受け取り、音声状態事後確率と参照ラベルとの間の距離基準の勾配を計算する（ステップＳ１５２）。パラメータ更新部１５４は、距離基準が小さくなるようにマスク推定用のパラメータを更新する（ステップＳ１５３）。 Next, the process of the parameter estimation part 15 is demonstrated using FIG. FIG. 4 is a flowchart showing a process flow of the parameter estimation unit of the mask estimation parameter estimation apparatus according to the first embodiment. As shown in FIG. 4, the mask estimation parameter initialization unit 151 of the parameter estimation unit 15 determines an initial value of the mask estimation parameter (step S151). Next, the gradient calculation unit 152 receives the speech state posterior probability, the reference label, and the parameters of the speech enhancement unit 13 and the speech recognition unit 14, and calculates the gradient of the distance standard between the speech state posterior probability and the reference label. Calculation is performed (step S152). The parameter updating unit 154 updates the parameters for mask estimation so that the distance reference becomes small (step S153).

収束判定部１５５がマスク推定用のパラメータが収束したと判定した場合（ステップＳ１５４、Ｙｅｓ）、パラメータ推定部１５は処理を終了する。また、収束判定部１５５がマスク推定用のパラメータが収束していないと判定した場合（ステップＳ１５４、Ｎｏ）、パラメータ推定部１５は、処理をステップＳ１５２に戻し、更新したマスク推定用パラメータを用い、勾配計算部１５２およびパラメータ更新部１５４による処理をさらに繰り返す。 When the convergence determination unit 155 determines that the mask estimation parameters have converged (step S154, Yes), the parameter estimation unit 15 ends the process. When the convergence determination unit 155 determines that the parameter for mask estimation has not converged (No in step S154), the parameter estimation unit 15 returns the process to step S152 and uses the updated mask estimation parameter, The processing by the gradient calculation unit 152 and the parameter update unit 154 is further repeated.

［第１の実施形態の効果］
時間周波数分析部１１は、目的音源に対応する１個の第１の音響信号と、雑音に対応するＮ−１個の第２の音響信号と、を含んだＮ個の音響信号が混在する状況において、それぞれ異なる位置で収録されたＭ個の観測信号のそれぞれに短時間信号分析を適用して時間周波数点ごとの観測信号を抽出し、時間周波数点ごとの観測信号のＭ次元縦ベクトルである観測ベクトルを構成する。また、マスク推定部１２は、観測ベクトルとマスク推定用のパラメータとに基づいて、Ｎ個の音響信号のそれぞれが、時間周波数点ごとに、観測ベクトルにどの程度の割合で含まれているかを表すマスクを推定する。また、音声強調部１３は、観測ベクトルと第１の音響信号についてのマスクとを、時間周波数点のそれぞれにおいて掛け合わせることで強調音声を取得する。また、音声認識部１４は、学習データを用いて事前に学習した音声認識用のパラメータを用いて、強調音声が各時刻においてどの音素状態であるらしいかを表す音素状態事後確率を推定する。また、パラメータ推定部１５は、音素状態事後確率と外部から入力された音素状態の参照ラベルとの間の所定の距離基準が最小化されるようにマスク推定用のパラメータを推定する。 [Effect of the first embodiment]
The time-frequency analysis unit 11 has a situation in which N acoustic signals including one first acoustic signal corresponding to the target sound source and N−1 second acoustic signals corresponding to noise are mixed. In FIG. 5, a short-time signal analysis is applied to each of M observation signals recorded at different positions to extract an observation signal for each time frequency point, and an M-dimensional vertical vector of the observation signal for each time frequency point. Construct an observation vector. Further, the mask estimation unit 12 indicates how much each of the N acoustic signals is included in the observation vector for each time frequency point based on the observation vector and the mask estimation parameter. Estimate the mask. Further, the speech enhancement unit 13 acquires the enhanced speech by multiplying the observation vector and the mask for the first acoustic signal at each time frequency point. In addition, the speech recognition unit 14 estimates a phoneme state posterior probability that indicates which phoneme state the emphasized speech is likely to be at each time, using speech recognition parameters learned in advance using the learning data. Further, the parameter estimation unit 15 estimates a parameter for mask estimation so that a predetermined distance criterion between the phoneme state posterior probability and the reference label of the phoneme state input from the outside is minimized.

このように、第１の実施形態では、音声認識部１４による音声認識結果を、マスク推定用のパラメータの推定に反映させ、また、音声認識部１４のパラメータを更新する必要がないので、環境に適応しながら音声認識を行う際に、限られた学習データで音声認識の精度を向上させることができる。 Thus, in the first embodiment, it is not necessary to reflect the result of speech recognition by the speech recognition unit 14 in the estimation of parameters for mask estimation, and it is not necessary to update the parameter of the speech recognition unit 14. When performing speech recognition while adapting, the accuracy of speech recognition can be improved with limited learning data.

マスク推定部１２は、周波数ごとに、観測ベクトルの確率分布を、Ｎ個の音響信号のそれぞれに対応するＮ個の要素分布からなる混合分布でモデル化し、要素分布の事後確率を、Ｎ個の音響信号のそれぞれに対応するマスクとして推定してもよい。これにより、目的音源および雑音に対応した音響信号のそれぞれに対し、マスクの推定を行うことが可能となる。 The mask estimation unit 12 models the probability distribution of the observation vector for each frequency by a mixed distribution including N element distributions corresponding to each of the N acoustic signals, and sets the posterior probability of the element distribution to N pieces. You may estimate as a mask corresponding to each of an acoustic signal. This makes it possible to estimate a mask for each of the target sound source and the acoustic signal corresponding to the noise.

マスク推定部１２は、観測ベクトルの確率分布を、平均が０であるＮ個のＭ次元複素ガウス分布であって、共分散行列が、各時刻において異なる値を取るスカラーパラメータと時不変のパラメータとを要素にもつエルミート行列の積で表されるＭ次元複素ガウス分布からなる混合分布でモデル化してもよい。 The mask estimator 12 determines the probability distribution of the observation vector as N M-dimensional complex Gaussian distributions having an average of 0, and a covariance matrix having different values at each time and a time-invariant parameter. May be modeled by a mixed distribution consisting of an M-dimensional complex Gaussian distribution represented by a product of Hermitian matrices having elements as.

一般的に、目的音源に対応する音響信号は、マイクロホンからみて音源方向から主に到来し、雑音はあらゆる方向から到来する。また、エルミート行列には、音源方向に対応する部分空間に最大の固有値を持ち、それ以外の部分空間の固有値は比較的小さな値を持つという性質があるため、エルミート行列を用いてモデル化することで、推定したマスクがどの音響信号に対応したものであるかが明確になる。 In general, an acoustic signal corresponding to a target sound source mainly comes from a sound source direction as viewed from a microphone, and noise comes from all directions. The Hermite matrix has the property that the subspace corresponding to the sound source direction has the largest eigenvalue and the other subspaces have relatively small eigenvalues. Thus, it becomes clear which acoustic signal the estimated mask corresponds to.

ここで、本発明の効果を確認するために行った、従来の方法および第１の実施形態を用いた確認実験について説明する。確認実験では、学習率γを１０^５、Ｒ_ｆ ^（ν）の初期値を非特許文献１に記載の尤度最大化基準で求めた値、更新則の反復回数を３０回とした。また、音声強調は、マスクをそのまま掛け合わせることで行った。 Here, a confirmation experiment using the conventional method and the first embodiment performed to confirm the effect of the present invention will be described. In the confirmation experiment, the learning rate γ was 10 ⁵ , the initial value of R _f ^(ν) was a value obtained by the likelihood maximization criterion described in Non-Patent Document 1, and the number of update rule iterations was 30. Speech enhancement was performed by multiplying the masks as they were.

確認実験では、バスの中、カフェ等の背景雑音の存在する環境下において、１人の話者がタブレットに向かって文章を読み上げている状況で、タブレットに装着されたＭ＝６個のマイクで収録した信号に対する音声認識を行った。以下に、従来の方法を用いて音声認識を行った場合と第１の実施形態を用いて音声認識を行った場合の単語誤り率を示す。
（１）音声強調を行わず音声認識を行った場合：２４．６６（％）
（２）非特許文献１に記載の尤度最大化規準で分布パラメータを推定した後、マスキングによって音声強調を行ったうえで音声認識を行った場合：１９．８８（％）
（３）音声認識部のパラメータの一部を、非特許文献２に記載の方法で再推定したうえで音声認識を行った場合：２４．１０（％）
（４）第１の実施形態の方法で分布パラメータを推定し、マスキングによって音声強調を行ったうえで音声認識を行った場合：１８．３５（％）
確認実験の結果、（４）の場合が最も単語誤り率が小さくなった。これより、第１の実施形態によれば、従来の方法と比べて音声認識精度を向上させることができるといえる。 In the confirmation experiment, in a situation where there is background noise in a bus, cafe, etc., one speaker is reading a sentence into the tablet, and M = 6 microphones attached to the tablet. Voice recognition was performed on the recorded signals. The word error rate when speech recognition is performed using the conventional method and when speech recognition is performed using the first embodiment is shown below.
(1) When speech recognition is performed without performing speech enhancement: 24.66 (%)
(2) After estimating the distribution parameters using the likelihood maximization criterion described in Non-Patent Document 1 and then performing speech recognition after performing speech enhancement by masking: 19.88 (%)
(3) When speech recognition is performed after re-estimating some of the parameters of the speech recognition unit by the method described in Non-Patent Document 2: 24.10 (%)
(4) When the speech recognition is performed after estimating the distribution parameter by the method of the first embodiment and performing speech enhancement by masking: 18.35 (%)
As a result of the confirmation experiment, the word error rate was the smallest in the case of (4). Thus, according to the first embodiment, it can be said that the voice recognition accuracy can be improved as compared with the conventional method.

［システム構成等］
また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部または任意の一部が、ＣＰＵおよび当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Further, each component of each illustrated apparatus is functionally conceptual, and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each device is not limited to that shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. Can be integrated and configured. Furthermore, all or a part of each processing function performed in each device may be realized by a CPU and a program that is analyzed and executed by the CPU, or may be realized as hardware by wired logic.

また、本実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部または一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Also, among the processes described in this embodiment, all or part of the processes described as being performed automatically can be performed manually, or the processes described as being performed manually can be performed. All or a part can be automatically performed by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above-described document and drawings can be arbitrarily changed unless otherwise specified.

［プログラム］
一実施形態として、マスク推定用パラメータ推定装置１０は、パッケージソフトウェアやオンラインソフトウェアとして上記のマスク推定用パラメータ推定を実行するマスク推定用パラメータ推定プログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記のマスク推定用パラメータ推定プログラムを情報処理装置に実行させることにより、情報処理装置をマスク推定用パラメータ推定装置１０として機能させることができる。ここで言う情報処理装置には、デスクトップ型またはノート型のパーソナルコンピュータが含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やＰＨＳ（Personal Handyphone System）等の移動体通信端末、さらには、ＰＤＡ（Personal Digital Assistant）等のスレート端末等がその範疇に含まれる。 [program]
As one embodiment, the mask estimation parameter estimation apparatus 10 can be implemented by installing a mask estimation parameter estimation program for executing the above mask estimation parameter estimation as package software or online software on a desired computer. For example, the information processing apparatus can function as the mask estimation parameter estimation apparatus 10 by causing the information processing apparatus to execute the mask estimation parameter estimation program. The information processing apparatus referred to here includes a desktop or notebook personal computer. In addition, the information processing apparatus includes mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone System), and slate terminals such as PDA (Personal Digital Assistant).

また、マスク推定用パラメータ推定装置１０は、ユーザが使用する端末装置をクライアントとし、当該クライアントに上記のマスク推定用パラメータ推定に関するサービスを提供するマスク推定用パラメータ推定サーバ装置として実装することもできる。例えば、マスク推定用パラメータ推定サーバ装置は、計算ネットワークの各パラメータ、音声認識結果および参照ラベルを入力とし、マスク推定用パラメータを出力とするマスク推定用パラメータ推定サービスを提供するサーバ装置として実装される。この場合、マスク推定用パラメータ推定サーバ装置は、Ｗｅｂサーバとして実装することとしてもよいし、アウトソーシングによって上記のマスク推定用パラメータ推定に関するサービスを提供するクラウドとして実装することとしてもかまわない。 The mask estimation parameter estimation apparatus 10 can also be implemented as a mask estimation parameter estimation server apparatus that uses a terminal device used by a user as a client and provides the client with services related to the above mask estimation parameter estimation. For example, a mask estimation parameter estimation server device is implemented as a server device that provides a mask estimation parameter estimation service that receives each parameter of a calculation network, a speech recognition result, and a reference label and outputs a mask estimation parameter. . In this case, the mask estimation parameter estimation server apparatus may be implemented as a Web server, or may be implemented as a cloud that provides services related to the above mask estimation parameter estimation through outsourcing.

図５は、プログラムが実行されることによりマスク推定用パラメータ推定装置が実現されるコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 FIG. 5 is a diagram illustrating an example of a computer that realizes a mask estimation parameter estimation apparatus by executing a program. The computer 1000 includes a memory 1010 and a CPU 1020, for example. The computer 1000 also includes a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These units are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１およびＲＡＭ１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１１００に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to a mouse 1110 and a keyboard 1120, for example. The video adapter 1060 is connected to the display 1130, for example.

ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、マスク推定用パラメータ推定装置１０の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０９０に記憶される。例えば、マスク推定用パラメータ推定装置１０における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０９０に記憶される。なお、ハードディスクドライブ１０９０は、ＳＳＤにより代替されてもよい。 The hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, a program that defines each process of the mask estimation parameter estimation apparatus 10 is implemented as a program module 1093 in which a code executable by a computer is described. The program module 1093 is stored in the hard disk drive 1090, for example. For example, a program module 1093 for executing processing similar to the functional configuration in the mask estimation parameter estimation apparatus 10 is stored in the hard disk drive 1090. Note that the hard disk drive 1090 may be replaced by an SSD.

また、上述した実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０９０に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 The setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 and executes them as necessary.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３およびプログラムデータ１０９４は、ネットワーク（ＬＡＮ、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３およびプログラムデータ１０９４は、他のコンピュータから、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read out by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN, WAN (Wide Area Network), etc.). The program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.

１０マスク推定用パラメータ推定装置
１１時間周波数分析部
１２マスク推定部
１３音声強調部
１４音声認識部
１５パラメータ推定部
１５１マスク推定用パラメータ初期化部
１５２勾配計算部
１５３パラメータ保持部
１５４パラメータ更新部
１５５収束判定部 DESCRIPTION OF SYMBOLS 10 Mask estimation parameter estimation apparatus 11 Time frequency analysis part 12 Mask estimation part 13 Speech enhancement part 14 Speech recognition part 15 Parameter estimation part 151 Mask estimation parameter initialization part 152 Gradient calculation part 153 Parameter holding part 154 Parameter update part 155 Convergence Judgment part

Claims

N acoustic signals including one first acoustic signal corresponding to the target sound source and N−1 second acoustic signals corresponding to noise (where N is an integer of 2 or more) In a mixed situation, a short-time signal analysis is applied to each of M observation signals (where M is an integer of 2 or more) recorded at different positions to extract observation signals at each time frequency point, A time-frequency analyzer that constitutes an observation vector that is an M-dimensional vertical vector of observation signals for each time-frequency point;
Based on the observation vector and a parameter for mask estimation, a mask representing how much each of the N acoustic signals is included in the observation vector for each time frequency point is estimated. A mask estimation unit;
A speech enhancement unit that obtains enhanced speech by multiplying the observation vector and the mask for the first acoustic signal at each of the time frequency points;
Using a speech recognition parameter learned in advance using learning data, a speech recognition unit that estimates a phoneme state posterior probability indicating which phonetic state the emphasized speech is likely to be at each time;
A parameter estimation unit that estimates the parameter for mask estimation so that a predetermined distance criterion between the phoneme state posterior probability and a reference label of the phoneme state input from the outside is minimized;
The parameter estimation apparatus for mask estimation characterized by having.

The mask estimation unit models the probability distribution of the observation vector for each frequency by a mixed distribution composed of N element distributions corresponding to each of the N acoustic signals, and sets the posterior probability of the element distribution as The mask estimation parameter estimation apparatus according to claim 1, wherein the estimation is performed as a mask corresponding to each of the N acoustic signals.

The mask estimation unit is a scalar parameter and a time-invariant parameter in which the probability distribution of the observation vector is N M-dimensional complex Gaussian distributions having an average of 0, and the covariance matrix takes different values at each time. 3. The parameter estimation apparatus for mask estimation according to claim 2, wherein modeling is performed with a mixed distribution composed of an M-dimensional complex Gaussian distribution represented by a product of Hermitian matrices having elements as elements.

A parameter estimation method for mask estimation executed by a parameter estimation apparatus for mask estimation,
N acoustic signals including one first acoustic signal corresponding to the target sound source and N−1 second acoustic signals corresponding to noise (where N is an integer of 2 or more) In a mixed situation, a short-time signal analysis is applied to each of M observation signals (where M is an integer of 2 or more) recorded at different positions to extract observation signals at each time frequency point, A time-frequency analysis step of constructing an observation vector that is an M-dimensional vertical vector of the observation signal for each time-frequency point;
Based on the observation vector and a parameter for mask estimation, a mask representing how much each of the N acoustic signals is included in the observation vector for each time frequency point is estimated. A mask estimation step;
A speech enhancement step of acquiring enhanced speech by multiplying the observation vector and the mask for the first acoustic signal at each of the time frequency points;
Using a speech recognition parameter learned in advance using learning data, a speech recognition step of estimating a phoneme state posterior probability indicating which phonetic state the emphasized speech is likely to be at each time;
A parameter estimation step of estimating the parameter for mask estimation so that a predetermined distance criterion between the phoneme state posterior probability and a reference label of the phoneme state input from the outside is minimized;
The parameter estimation method for mask estimation characterized by including.

The mask estimation step models the probability distribution of the observation vector for each frequency by a mixed distribution composed of N element distributions corresponding to each of the N acoustic signals, and sets the posterior probability of the element distribution as The mask estimation parameter estimation method according to claim 4, wherein the estimation is performed as a mask corresponding to each of the N acoustic signals.

In the mask estimation step, the probability distribution of the observation vector is N M-dimensional complex Gaussian distributions whose mean is 0, and the covariance matrix is a scalar parameter and a time-invariant parameter that take different values at each time. 6. The parameter estimation method for mask estimation according to claim 5, wherein modeling is performed with a mixed distribution composed of an M-dimensional complex Gaussian distribution represented by a product of Hermitian matrices having elements as elements.

A mask estimation parameter estimation program for causing a computer to function as the mask estimation parameter estimation apparatus according to any one of claims 1 to 3.