JP6114053B2

JP6114053B2 - Sound source separation device, sound source separation method, and program

Info

Publication number: JP6114053B2
Application number: JP2013028074A
Authority: JP
Inventors: ソウデンメレツ; 慶介木下; 中谷　智広; 智広中谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-02-15
Filing date: 2013-02-15
Publication date: 2017-04-12
Anticipated expiration: 2033-02-15
Also published as: JP2014157261A

Description

この発明は、複数の目的信号が含まれている入力信号から各目的信号を精度良く抽出する音源分離技術に関する。 The present invention relates to a sound source separation technique for accurately extracting each target signal from an input signal including a plurality of target signals.

複数の音源が存在する環境で音響信号を収音すると、目的信号同士が互いに重なり合った混合信号が観測されることがある。この時、注目している目的信号が音声信号である場合、その他の音源の信号がその目的信号に重畳した影響により、目的音声の明瞭性は大きく低下してしまい、本来の目的音声の性質を抽出することが困難となる。例えば、目的音声に対して自動音声認識システムにより音声認識を行う場合などには認識率が著しく低下する。このような場合に、それぞれの目的信号を分離する音源分離処理により、目的音声の明瞭性を回復したり、音声認識率を改善したりすることができる。 When an acoustic signal is collected in an environment where a plurality of sound sources exist, a mixed signal in which target signals overlap with each other may be observed. At this time, when the target signal of interest is an audio signal, the clarity of the target voice is greatly reduced due to the influence of the signal of the other sound source superimposed on the target signal, and the original target voice characteristics are reduced. It becomes difficult to extract. For example, when the target speech is recognized by an automatic speech recognition system, the recognition rate is significantly reduced. In such a case, the clarity of the target speech can be restored or the speech recognition rate can be improved by the sound source separation processing for separating the respective target signals.

音源分離処理を他の様々な音響信号処理システムの要素技術として用いることで、そのシステム全体の性能向上に繋げることができる。音源分離処理が要素技術として性能向上に寄与できるようなシステムには、例えば、以下のようなものが列挙できる。実環境で収録された音声にはしばしば他話者の音や雑音などの目的音声以外の音源の音が含まれるが、以下に列挙するシステムはそのような状況で用いられることを想定した例である。
１．実環境下で収音された音から目的信号を抽出し聞き取り易さを向上させる補聴器。
２．目的信号を抽出し音声の明瞭度を向上させるTV会議システムなどの通信システム。
３．実環境で用いられる音声認識システム。
４．人が発した音に反応して機械にコマンドをわたす機械制御インターフェース、および機械と人間との対話装置。
５．人が歌ったり、楽器で演奏したり、スピーカで演奏されたりした音楽に含まれる目的信号を抽出し、楽曲を検索したり、採譜したりする音楽情報処理システム。 By using the sound source separation process as an elemental technology of other various acoustic signal processing systems, it is possible to improve the performance of the entire system. Examples of systems in which the sound source separation processing can contribute to performance improvement as an elemental technology include the following. Voices recorded in a real environment often include sounds of other sound sources such as other speakers' sounds and noise, but the systems listed below are examples that are assumed to be used in such situations. is there.
1. A hearing aid that improves the ease of hearing by extracting the target signal from the sound collected in the actual environment.
2. A communication system such as a TV conference system that extracts the target signal and improves the clarity of the voice.
3. A speech recognition system used in a real environment.
4). A machine control interface that passes commands to the machine in response to sounds emitted by humans, and a machine-human interaction device.
5. A music information processing system that extracts objective signals contained in music that people sing, perform on musical instruments, or perform on speakers, search for music, and record music.

このような音源分離技術には、例えば非特許文献１に記載された技術がある。図１を参照して非特許文献１の音源分離技術を説明する。 As such a sound source separation technique, for example, there is a technique described in Non-Patent Document 1. The sound source separation technique of Non-Patent Document 1 will be described with reference to FIG.

非特許文献１の音源分離装置は、図１に示すように、複素特徴ベクトル計算部１、音声存在確率計算部３、フィルタリング部６を含む。 As shown in FIG. 1, the sound source separation device of Non-Patent Document 1 includes a complex feature vector calculation unit 1, a speech existence probability calculation unit 3, and a filtering unit 6.

複数の音源が存在する環境において複数のマイクＭ₁,…,Ｍ_N（N>1）により収音された観測信号y(t)が音源分離装置へ入力される。ここで、tは時間フレームの番号である。この観測信号y(t)は複数の目的信号が重なり合った混合信号であり、短時間フーリエ変換などにより周波数領域に変換されていることを前提とする。入力された観測信号y(t)は複素特徴ベクトル計算部１へ入力される。 Observation signals y (t) collected by a plurality of microphones M ₁ ,..., M _N (N> 1) in an environment where a plurality of sound sources exist are input to the sound source separation device. Here, t is a time frame number. This observation signal y (t) is a mixed signal in which a plurality of target signals are overlapped, and is premised on being converted into the frequency domain by short-time Fourier transform or the like. The input observation signal y (t) is input to the complex feature vector calculation unit 1.

複素特徴ベクトル計算部１は、観測信号y(t)に基づいて、各時間周波数ビンを特徴づける複素特徴ベクトルψ(t)を計算する。複素特徴ベクトルψ(t)は複素領域の観測信号をそのノルムで正規化した特徴ベクトルである。観測信号をノルムで正規化することで音声信号による変動を正規化し、複素単位球面に射影することができる。複素特徴ベクトルψ(t)は式(1)で表される。 The complex feature vector calculation unit 1 calculates a complex feature vector ψ (t) that characterizes each time frequency bin based on the observation signal y (t). The complex feature vector ψ (t) is a feature vector obtained by normalizing the observation signal in the complex region with its norm. By normalizing the observation signal with the norm, the fluctuation due to the voice signal can be normalized and projected onto the complex unit sphere. The complex feature vector ψ (t) is expressed by Equation (1).

計算した複素特徴ベクトルψ(t)は音声存在確率計算部３へ入力される。音声存在確率計算部３は複素特徴ベクトルψ(t)に基づいて各時間周波数ビンで複数の目的信号の各々の存在確率である音声存在確率を計算する。音声存在確率は混合数Lの混合モデルのパラメータを最尤推定することで計算される。ここで、Lは観測信号に含まれる目的信号の数である。音声信号はスパース性を有するため、複素特徴ベクトルψ(t)は多峰性の分布で精度よくモデル化することができる。つまり、多峰性の各山はL個の目的信号のいずれかのみから計算される正規化ベクトルの平均を中心として広がる。そのため、音源分離のタスクは多峰性分布の各山を表す隠れ変数Hを時間周波数ビンごとに定める作業に帰着されることになる。隠れ変数HはL個の離散値をとり、各離散値をH₁,…,H_Lとする。仮にH=H_λ（λは1以上L以下の整数）であれば、λ番目の目的信号が観測信号の中で支配的であると言える。言い換えれば、各時間周波数ビンにおいてL個の事後確率p(H_λ|ψ(t))を計算することができれば観測信号のクラスタリングによる音源分離を行うことが可能となる。具体的には、式(2)に示すような混合モデルを用いて複素特徴ベクトルψ(t)をクラスタリングすることで音源分離を行う。 The calculated complex feature vector ψ (t) is input to the speech existence probability calculation unit 3. The speech presence probability calculation unit 3 calculates a speech presence probability that is the presence probability of each of a plurality of target signals in each time frequency bin based on the complex feature vector ψ (t). The speech existence probability is calculated by maximum likelihood estimation of the parameters of the mixture model with L mixture. Here, L is the number of target signals included in the observation signal. Since the speech signal has sparsity, the complex feature vector ψ (t) can be accurately modeled with a multimodal distribution. That is, each of the multimodal peaks spreads around the average of the normalized vectors calculated from only one of the L target signals. Therefore, the task of sound source separation is reduced to the work of defining the hidden variable H representing each mountain of the multimodal distribution for each time frequency bin. The hidden variable H takes L discrete values, and each discrete value is denoted as H ₁ ,..., H _L. If H = H _λ (λ is an integer greater than or equal to 1 and less than or equal to L), it can be said that the λth target signal is dominant in the observed signal. In other words, if L posterior probabilities p (H _λ | ψ (t)) can be calculated in each time frequency bin, sound source separation by clustering of observation signals can be performed. Specifically, sound source separation is performed by clustering complex feature vectors ψ (t) using a mixed model as shown in Equation (2).

ここで、θはモデルパラメータを表し、w_λは式(3)の関係を満たす。 Here, θ represents a model parameter, and w _λ satisfies the relationship of Expression (3).

複素特徴ベクトルの確率分布を用いたモデル化には、非特許文献１に記載されているガウス分布に類似した分布や、非特許文献２に記載されているワトソン混合分布が用いられている。ワトソン混合分布は式(4)で表される確率分布である。 For modeling using a probability distribution of complex feature vectors, a distribution similar to the Gaussian distribution described in Non-Patent Document 1 or a Watson mixture distribution described in Non-Patent Document 2 is used. The Watson mixture distribution is a probability distribution represented by Equation (4).

ここで、a_λは集中母数（concentration parameter）であり、к_λは分布の重心（centroid）である。Cは観測信号を収音したマイクの数である。Γ(・)はガンマ関数である。M(・,・,・)はKummerの合流型超幾何関数（confluent hypergeometric function）である。Hは複素転置、すなわち転置行列または転置ベクトルの成分をすべて共役複素数にしたものを表す。 Here, a _λ is a concentration parameter, and κ _λ is a centroid of the distribution. C is the number of microphones that collected the observation signal. Γ (·) is a gamma function. M (•, •, •) is Kummer's confluent hypergeometric function. H represents a complex transposition, that is, a transposed matrix or a transposed vector whose components are all conjugate complex numbers.

式(5)に示すパラメータθの推定にはEMアルゴリズムを用い、その中のEステップで各音源の存在確率に相当する、L個のクラスタに関する事後確率を求める。式(5)において、・^Tはベクトルまたは行列の転置を表す。 An EM algorithm is used to estimate the parameter θ shown in Equation (5), and posterior probabilities relating to L clusters corresponding to the existence probabilities of each sound source are obtained at the E step. In Equation (5), ^T represents transposition of a vector or matrix.

計算した音声存在確率はフィルタリング部６へ入力される。フィルタリング部６は観測信号y(t)の各時間周波数ビンの値に各目的信号に対する音声存在確率を乗算することで所望の目的信号の推定値を算出する。この方法を用いることで観測信号に含まれる複数の目的信号を精度よく回復することができる。 The calculated speech existence probability is input to the filtering unit 6. The filtering unit 6 calculates the estimated value of the desired target signal by multiplying the value of each time frequency bin of the observed signal y (t) by the speech existence probability for each target signal. By using this method, a plurality of target signals included in the observation signal can be accurately recovered.

H. Sawada, S. Araki, and S. Makino, “Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, pp. 516-527, March 2011.H. Sawada, S. Araki, and S. Makino, “Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, pp. 516-527 , March 2011. D.H. Tran and R. Haeb-Umbach, “Blind separation employing directional statistics in an expectation maximization framework”, IEEE ICASSP 2010, pp. 241-244.D.H.Tran and R. Haeb-Umbach, “Blind separation compressing directional statistics in an expectation maximization framework”, IEEE ICASSP 2010, pp. 241-244.

しかしながら、非特許文献１に記載された音源分離技術では、入力信号の各チャネルのサンプリング周波数が異なる場合は、精度の良い信号回復を行うことができなかった。入力信号の各チャネルのサンプリング周波数が異なる状況は、例えばマイクアレイが複数の独立した録音デバイス（ICレコーダなど）で構成される場合にしばしば生じる。以降の説明では、このように複数の独立した録音デバイスで構成されたマイクアレイを分散型マイクアレイと呼ぶ。 However, in the sound source separation technique described in Non-Patent Document 1, accurate signal recovery cannot be performed when the sampling frequency of each channel of the input signal is different. A situation in which the sampling frequency of each channel of the input signal is different often occurs, for example, when the microphone array is composed of a plurality of independent recording devices (such as IC recorders). In the following description, the microphone array composed of a plurality of independent recording devices is referred to as a distributed microphone array.

この発明の目的は、入力信号の各チャネルのサンプリング周波数が異なる場合でも、適切に目的音声を推定することができる音源分離技術を提供することである。 An object of the present invention is to provide a sound source separation technique capable of appropriately estimating a target voice even when sampling frequencies of respective channels of an input signal are different.

上記の課題を解決するために、この発明の音源分離装置は、複数の目的信号が重なり合った混合信号を１以上のマイクを含む２以上のノードからなるマイクアレイを用いて収音した観測信号からノードごとのエネルギーを表すエネルギー特徴ベクトルを計算するエネルギー特徴ベクトル計算部と、エネルギー特徴ベクトルに基づいて目的信号ごとの音声が存在する確率を示す音声存在確率を計算する音声存在確率計算部と、観測信号に音声存在確率を乗じて目的信号の推定値を求めるフィルタリング部とを含む。 In order to solve the above-described problem, a sound source separation device according to the present invention is based on an observation signal obtained by collecting a mixed signal in which a plurality of target signals overlap using a microphone array including two or more nodes including one or more microphones. An energy feature vector calculation unit that calculates an energy feature vector that represents energy for each node; a voice presence probability calculation unit that calculates a probability of presence of speech for each target signal based on the energy feature vector; and an observation A filtering unit that multiplies the signal by a voice presence probability to obtain an estimated value of the target signal.

この発明の音源分離技術によれば、入力信号の各チャネルのサンプリング周波数が異なる場合でも、適切に目的音声を推定することができる。 According to the sound source separation technique of the present invention, the target speech can be appropriately estimated even when the sampling frequency of each channel of the input signal is different.

従来の音源分離装置の機能構成を例示する図。The figure which illustrates the function structure of the conventional sound source separation apparatus. 第一実施形態の音源分離装置の機能構成を例示する図。The figure which illustrates the function structure of the sound source separation apparatus of 1st embodiment. 第一実施形態の音声存在確率計算部の機能構成を例示する図。The figure which illustrates the functional composition of the voice existence probability calculation part of a first embodiment. 第一実施形態の音源分離装置の処理フローを例示する図。The figure which illustrates the processing flow of the sound source separation apparatus of 1st embodiment. 第二実施形態の音源分離装置の機能構成を例示する図。The figure which illustrates the function structure of the sound source separation apparatus of 2nd embodiment. 第二実施形態の音声存在確率計算部の機能構成を例示する図。The figure which illustrates the functional composition of the voice existence probability calculation part of a second embodiment. 第二実施形態の音源分離装置の処理フローを例示する図。The figure which illustrates the processing flow of the sound source separation apparatus of 2nd embodiment. 実験条件を説明する図。The figure explaining experimental conditions. 実験結果を説明する図。The figure explaining an experimental result.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

［第一実施形態］
第一実施形態の音源分離装置及び方法は、L個の音源が存在する環境で、少なくとも1個のマイクを含むN個のノードからなる分散型マイクアレイを用いて収音した観測信号から、特定の音源からの目的信号を推定する。 [First embodiment]
The sound source separation apparatus and method according to the first embodiment are specified from observation signals collected using a distributed microphone array composed of N nodes including at least one microphone in an environment where L sound sources exist. The target signal from the sound source is estimated.

＜観測信号＞
この実施形態では、L個の音源を基点とする音響信号を、少なくとも1個のマイクを含むN個のノードからなる分散型マイクアレイを用いて収音した観測信号が入力されるものとする。ここで、Lは2以上の整数であり、Nは2以上の整数である。すなわち、1個のマイクを含む2個のノードからなる分散型マイクアレイを構成することにより、観測信号は少なくとも2個のチャネルを含む。各ノードに含まれるマイクの数は統一されている必要はなく、ここではN個のノードそれぞれに対応するマイクの数をC₁,…,C_Nとする。すなわち、nを1以上N以下の整数として、n番目のノードにはC_n個のマイクが含まれる。つまり、Cを観測信号を収音したマイクの数として、C=Σ_n=1 ^NC_nが成り立つ。 <Observation signal>
In this embodiment, it is assumed that an observation signal obtained by collecting an acoustic signal having L sound sources as base points using a distributed microphone array composed of N nodes including at least one microphone is input. Here, L is an integer of 2 or more, and N is an integer of 2 or more. That is, the observation signal includes at least two channels by configuring a distributed microphone array composed of two nodes including one microphone. The number of microphones included in each node does not need to be unified, and here, the number of microphones corresponding to each of the N nodes is C ₁ ,..., C _N. That is, n as 1 or more N an integer, the n-th node includes C _n number of microphones. That is, C = Σ _{n = 1} ^N C _n holds, where C is the number of microphones that collected the observation signal.

n番目のノードが収音した観測信号y_n(k,t)は式(6)のように表される。 An observation signal y _n (k, t) picked up by the nth node is expressed as shown in Equation (6).

ここで、t（1≦t）を時間フレームの番号、Kを周波数ビンの数として、k（1≦k≦K）は周波数ビンの番号、Nを分散型マイクアレイのノードの数として、n（1≦n≦N）はノードの番号である。 Where t (1 ≦ t) is the number of time frames, K is the number of frequency bins, k (1 ≦ k ≦ K) is the number of frequency bins, N is the number of nodes in the distributed microphone array, and n (1 ≦ n ≦ N) is a node number.

n番目のノードが収音した観測信号y_n(k,t)はC_nチャネル分の音声信号を含んでいる。そのため、観測信号y_n(k,t)は式(7)により定義される。 The observation signal y _n (k, t) picked up by the n-th node includes audio signals for C _n channels. Therefore, the observation signal y _n (k, t) is defined by Equation (7).

また、分散型マイクアレイ全体の観測信号y(k,t)は式(8)により定義される。 Further, the observation signal y (k, t) of the entire distributed microphone array is defined by equation (8).

λ番目の目的信号S_λ(k,t)をn番目のノードが収音した観測信号x_n,λ(k,t)は式(9)により定義されるように、チャネル歪みとも呼ばれる短い残響h_n,λ(k)が重畳している。 The observed signal x _{n, λ} (k, t) obtained by the _nth node collecting the λth target signal S _λ (k, t) is a short reverberation, also called channel distortion, as defined by Equation (9). h _{n, λ} (k) is superimposed.

式(9)において、h_n,λ(k)はλ番目の音源とn番目のノードとの間の伝達関数である。h_n,λ(k)は式(10)により定義される。 In Equation (9), h _{n, λ} (k) is a transfer function between the λth sound source and the nth node. h _{n, λ} (k) is defined by equation (10).

なお、この発明ではすべての処理を周波数ビンごとに独立に行うため、以降の説明では周波数ビンの番号kは省略して記載している。 In the present invention, since all processing is performed independently for each frequency bin, the frequency bin number k is omitted in the following description.

＜エネルギー特徴ベクトル＞
音のエネルギーは、例えば自由音場では距離の二乗の逆数に比例し減衰するなど、音源とマイクとの間の距離に依存して大きく異なることが知られている。この発明ではこの距離による違いを利用して音源分離を行うために、式(11)により定義されるエネルギー特徴ベクトルρ(t)を計算する。 <Energy feature vector>
It is known that the energy of sound varies greatly depending on the distance between the sound source and the microphone, for example, in a free sound field, which attenuates in proportion to the inverse of the square of the distance. In the present invention, the energy feature vector ρ (t) defined by the equation (11) is calculated in order to perform sound source separation using the difference due to the distance.

ここで、n番目のノードの観測信号に対するエネルギー特徴ベクトルρ_n(t)は式(12)により定義される。 Here, the energy feature vector ρ _n (t) for the observation signal at the n-th node is defined by equation (12).

つまり、エネルギー特徴ベクトルρ_n(t)はn番目のノードの観測信号のエネルギーを正規化した値である。式(12)に示すエネルギー特徴ベクトルρ_n(t)の分母は、n番目のノードが||y_n||²のみを他のノードと共有すれば、それらを総和することで求めることができる。 That is, the energy feature vector ρ _n (t) is a value obtained by normalizing the energy of the observation signal of the nth node. The denominator of the energy feature vector ρ _n (t) shown in equation (12) can be obtained by summing them if the _nth node shares only || y _n || ² with other nodes. .

非特許文献１の音源分離技術で用いられている式(1)に示す複素特徴ベクトルψ(t)は位相情報を含んでいる。各チャネルにおいてサンプリング周波数が異なると位相情報は大きくその影響を受けるため、位相情報を含む特徴ベクトルに基づいてクラスタリングを行なっても効果的な音源分離は実現できない。一方、式(11)に示すエネルギー特徴ベクトルρ(t)は位相情報を含んでおらず振幅情報のみが表されている。振幅情報はサンプリング周波数のずれがあっても、そのずれがフレーム長を大きく超えなければ音源分離への影響は大きくならない。したがって、エネルギー特徴ベクトルをクラスタリングに用いることで、各チャネルにおいてサンプリング周波数が異なる場合でも頑健かつ効果的なクラスタリングが実現されることが期待できる。 The complex feature vector ψ (t) shown in Expression (1) used in the sound source separation technique of Non-Patent Document 1 includes phase information. If the sampling frequency is different in each channel, the phase information is greatly affected. Therefore, even if clustering is performed based on the feature vector including the phase information, effective sound source separation cannot be realized. On the other hand, the energy feature vector ρ (t) shown in Expression (11) does not include phase information, and represents only amplitude information. Even if there is a sampling frequency shift in the amplitude information, the influence on the sound source separation does not increase unless the shift greatly exceeds the frame length. Therefore, by using the energy feature vector for clustering, it can be expected that robust and effective clustering is realized even when the sampling frequency is different in each channel.

なお、エネルギー特徴ベクトルρ(t)は各ノードの観測信号のエネルギーをすべてのノードについて並べた特徴ベクトルであるため、ノード間での観測信号の特徴量（ノード間特徴量）である。一方、複素特徴ベクトルψ(t)は各ノードの観測信号の特徴量（ノード内特徴量）である。 Since the energy feature vector ρ (t) is a feature vector in which the energy of the observation signal of each node is arranged for all the nodes, it is a feature amount of the observation signal between nodes (inter-node feature amount). On the other hand, the complex feature vector ψ (t) is the feature amount (intra-node feature amount) of the observation signal at each node.

＜エネルギー特徴ベクトルのモデル化＞
エネルギー特徴ベクトルはディリクレ混合分布モデル（Dirichlet Mixture Model: DMM）を用いてモデル化することができる。式(13)に示すようにすべての音源の音がすべてのマイクに到来していると仮定すると、ディリクレ混合分布モデルはエネルギー特徴ベクトルをモデル化するために妥当な確率分布である。 <Modeling of energy feature vectors>
The energy feature vector can be modeled using a Dirichlet Mixture Model (DMM). Assuming that the sound of all sound sources has arrived at all microphones as shown in Equation (13), the Dirichlet mixture distribution model is a reasonable probability distribution for modeling the energy feature vector.

ディリクレ混合分布モデルは式(14)により定義される。 The Dirichlet mixture distribution model is defined by equation (14).

式(14)の各要素は式(15)により定義される。 Each element of equation (14) is defined by equation (15).

式(15)においてパラメータαは式(16)により定義される。 In the equation (15), the parameter α is defined by the equation (16).

ディリクレ混合分布モデルの詳細は、「T.P. Minka, “Estimating a Dirichlet distribution,” Technical report, Microsoft Research, Cambridge, 2003.（参考文献１）」および「N. Bouguila, D. Ziou, and J. Vaillancourt, “Unsupervised learning of a finite mixture model based on the Dirichlet distribution and its application,” IEEE Trans. Image Process., vol. 13, pp. 1533-1543, Nov. 2004.（参考文献２）」を参照されたい。 Details of the Dirichlet mixture distribution model can be found in “TP Minka,“ Estimating a Dirichlet distribution, ”Technical report, Microsoft Research, Cambridge, 2003. (reference 1)” and “N. Bouguila, D. Ziou, and J. Vaillancourt, See "Unsupervised learning of a finite mixture model based on the Dirichlet distribution and its application," IEEE Trans. Image Process., Vol. 13, pp. 1533-1543, Nov. 2004. (reference 2).

＜ディリクレ混合分布のパラメータ推定＞
ディリクレ混合分布のパラメータは尤度最大化などの基準で推定することが可能である。その一例としてEMアルゴリズムが挙げられる。以下に、EMアルゴリズムを用いたディリクレ混合分布のパラメータ推定の詳細な手順を説明する。 <Parameter estimation of Dirichlet mixture distribution>
The parameters of the Dirichlet mixture distribution can be estimated by a criterion such as likelihood maximization. One example is the EM algorithm. The detailed procedure for estimating the parameters of the Dirichlet mixture distribution using the EM algorithm will be described below.

はじめに、ベイズの定理に従い、エネルギー特徴ベクトルを用いて計算される各目的信号の音声存在事後確率P(t,λ,^θ)を式(17)により定義する。 First, according to Bayes' theorem, the speech existence posterior probability P (t, λ, ^ θ) of each target signal calculated using the energy feature vector is defined by Equation (17).

ここで、^θは未知のディリクレ混合分布のパラメータαを表しており、式(14)を最大化することで求めることができる。なお、式(14)を効率的に最大化する方法として、式(14)を直接最大化するのではなく、式(14)の補助関数を式(18)のように定義し、それを最大化するパラメータを繰り返し推定してもよい。 Here, ^ θ represents an unknown Dirichlet mixture distribution parameter α, which can be obtained by maximizing equation (14). As an efficient way to maximize Equation (14), instead of directly maximizing Equation (14), the auxiliary function of Equation (14) is defined as Equation (18) and The parameter to be converted may be repeatedly estimated.

式(18)において^θ’はパラメータ^θの事前推定値である。また、Q₁関数は式(19)により定義され、Q₂関数は式(20)により定義される。 In equation (18), ^ θ 'is a prior estimate of parameter ^ θ. Also, Q ₁ function is defined by the equation (19), Q ₂ function is defined by the equation (20).

式(18)に示す補助関数の最大化においては式(21)の制約を満たす必要がある。 In maximization of the auxiliary function shown in Equation (18), it is necessary to satisfy the constraint of Equation (21).

その結果、式(19)を最大化するw_λは式(22)により求めることができる。 As a result, w _λ that maximizes Equation (19) can be obtained from Equation (22).

式(18)のQ(^θ,^θ')を最大化するディリクレ混合分布のパラメータ推定には閉形式解がないが、Newton-Raphsonアルゴリズムにより精度の良いパラメータ推定が可能であることが知られている（詳しくは参考文献１参照）。 Although there is no closed-form solution for parameter estimation of the Dirichlet mixture distribution that maximizes Q (^ θ, ^ θ ') in Eq. (18), it is known that accurate parameter estimation is possible with the Newton-Raphson algorithm. (See Reference 1 for details).

推定されたパラメータα_n,λが正の値となることを保障するために、ある実数値β_n,λに対してα_n,λ=exp(β_n,λ)のように変形する必要がある（詳しくは参考文献２参照）。β_λをβ_λ=[β_1,λ…β_N,λ]^Tとすれば、式(23)の操作を数回繰り返せば精度良くα_n,λ=exp(β_n,λ)を求めることができる。 In order to ensure that the estimated parameter α _{n, λ} is a positive value, it is necessary to modify α _{n, λ} = exp (β _{n, λ} ) for some real value β _{n, λ} . Yes (see Reference 2 for details). If β _λ is β _λ = [β _{1, λ} … β _{N, λ} ] ^T , then α _{n, λ} = exp (β _{n, λ} ) can be obtained accurately by repeating the operation of equation (23) several times. Can do.

式(23)において、jは繰り返し回数を表す。Δ(β_λ)はβに関するQ(^θ,^θ')の勾配を表す。∇(β_λ)はQ(^θ,^θ')のヘッセ行列であり、逆行列を求めることが可能である。以下、Δ(β_λ)および∇(β_λ)に関して詳述する。 In Expression (23), j represents the number of repetitions. Δ (β _λ ) represents the gradient of Q (^ θ, ^ θ ') with respect to β. ∇ (β _λ ) is a Hessian matrix of Q (^ θ, ^ θ '), and an inverse matrix can be obtained. Hereinafter, Δ (β _λ ) and ∇ (β _λ ) will be described in detail.

はじめに、変数γ_n,λを式(24)により定義する。 First, variables γ _{n and λ} are defined by equation (24).

また、変数τ_n,λを式(25)により定義する。 Further, the variable τ _{n, λ} is defined by the equation (25).

変数γ_n,λ,τ_n,λを用いるとΔ(β_λ)のn番目の要素は式(26)のように表される。 When the variables γ _{n, λ} , τ _{n, λ} are used, the n-th element of Δ (β _λ ) is expressed as in equation (26).

ここで、ψ(・)はdigamma関数である。∇(β_λ)の対角要素は式(27)のように表され、非対角要素は式(28)のように表される。 Here, ψ (·) is a digamma function. The diagonal element of ∇ (β _λ ) is expressed as in Expression (27), and the non-diagonal element is expressed as in Expression (28).

ここで、ψ'(・)はtrigamma関数である。結果、∇(β_λ)は式(29)のような平易な形式で表わされることがわかる。 Here, ψ ′ (·) is a trigamma function. As a result, it can be seen that ∇ (β _λ ) is expressed in a simple form such as equation (29).

ここで＊は要素ごとの掛け算を表し、diag[・]は入力ベクトルを対角要素に持つ対角行列を表す。この行列の逆行列計算はSherman-Morrisonの公式を用いることで容易に計算することができる。 Here, * represents multiplication for each element, and diag [•] represents a diagonal matrix having an input vector as a diagonal element. The inverse matrix of this matrix can be easily calculated using the Sherman-Morrison formula.

digamma関数およびtrigamma関数の詳細は「I.S. Gradshteyn and I.M. Ryzhik, “Table of integrals, series, and products, seventh edition”, Academic Press, MA, USA, 2007.（参考文献３）」を参照されたい。Sherman-Morrisonの公式の詳細は「J. Sherman and J. W. Morrison, “Adjustment of an inverse matrix corresponding to a change in one element of a given matrix,” Annals of Mathematical Statistics, vol. 21, pp. 124-127, 1950.（参考文献４）」を参照されたい。 For details of the digamma function and the trigamma function, see “I.S. Gradshteyn and I.M. Ryzhik,“ Table of integrals, series, and products, seventh edition ”, Academic Press, MA, USA, 2007. (reference 3). For details of the Sherman-Morrison formula, see J. Sherman and JW Morrison, “Adjustment of an inverse matrix corresponding to a change in one element of a given matrix,” Annals of Mathematical Statistics, vol. 21, pp. 124-127, 1950 (Ref. 4) ”.

以上説明した通り、ディリクレ混合分布のパラメータ推定をEMアルゴリズムで行う場合、Eステップとして式(15)(17)を、Mステップとして式(22)(23)を、所定の基準を満たすまで繰り返し実行する。所定の基準としては、例えばディリクレ混合分布のパラメータおよび目的信号の音声存在事後確率から計算されるQ関数の値が、更新前の値と更新後の値とでその差が所定の閾値未満となったときに所定の基準を満たしたと判定する方法が考えられる。また、予め定めた繰り返し回数に到達した場合に所定の基準を満たしたと判定する方法なども考えられる。繰り返し処理を行うことで式(18)のQ関数の値を最大化することができる。 As described above, when performing Dirichlet mixture distribution parameter estimation using the EM algorithm, Eqs. (15) and (17) are repeatedly executed as E steps, and Eqs. (22) and (23) are repeatedly executed as M steps until a predetermined criterion is satisfied. To do. As the predetermined standard, for example, the value of the Q function calculated from the parameters of the Dirichlet mixture distribution and the speech a posteriori probability of the target signal is such that the difference between the pre-update value and the post-update value is less than a predetermined threshold value. It is conceivable to determine that a predetermined standard has been satisfied. Further, a method of determining that a predetermined standard is satisfied when a predetermined number of repetitions is reached may be considered. By repeatedly performing the process, the value of the Q function in Expression (18) can be maximized.

なお上述の通り、これらの事後確率の計算は各周波数ビンで独立に行われるため、パーミュテーションの問題が生じるが、非特許文献１に記載されたパーミュテーション解決の方法を適用すればよい。 As described above, since the calculation of these posterior probabilities is performed independently for each frequency bin, a permutation problem arises. However, the permutation solving method described in Non-Patent Document 1 may be applied. .

＜構成＞
第一実施形態の音源分離装置は、図２に示す通り、エネルギー特徴ベクトル計算部２、音声存在確率計算部４、フィルタリング部６を有する。音源分離装置は、例えば、ＣＰＵ（Central Processing Unit）、ＲＡＭ（Random Access Memory）等を有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。音声存在確率計算部４は、図３に示す通り、事後確率計算手段４１、パラメータ推定手段４２、反復処理手段４３を有する。 <Configuration>
The sound source separation device according to the first embodiment includes an energy feature vector calculation unit 2, a speech existence probability calculation unit 4, and a filtering unit 6, as shown in FIG. The sound source separation device is a special device configured by reading a special program into a known or dedicated computer having, for example, a CPU (Central Processing Unit), a RAM (Random Access Memory), and the like. As shown in FIG. 3, the speech existence probability calculation unit 4 includes a posterior probability calculation means 41, a parameter estimation means 42, and an iterative processing means 43.

＜動作＞
図４を参照して、第一実施形態の音源分離装置の動作例を説明する。 <Operation>
With reference to FIG. 4, the operation example of the sound source separation apparatus of the first embodiment will be described.

C個のマイクＭ_1,1,…,Ｍ_1,C1,…,Ｍ_N,1,…,Ｍ_N,CNからなる分散型マイクアレイで収音した観測信号y(t)はエネルギー特徴ベクトル計算部２へ入力される。エネルギー特徴ベクトル計算部２は観測信号y(t)に基づいて各ノードのエネルギーを表すエネルギー特徴ベクトルρ(t)を計算する（ステップＳ２）。エネルギー特徴ベクトルρ(t)は式(11)(12)により計算できる。詳しくは上述の＜エネルギー特徴ベクトル＞を参照されたい。計算したエネルギー特徴ベクトルρ(t)は音声存在確率計算部４へ入力される。 The observed signal y (t) collected by the distributed microphone array consisting of C microphones M _1,1 , ..., M _{1, C1} , ..., M _{N, 1} , ..., M _{N, CN} is energy feature vector calculation Input to part 2. The energy feature vector calculator 2 calculates an energy feature vector ρ (t) representing the energy of each node based on the observed signal y (t) (step S2). The energy feature vector ρ (t) can be calculated by equations (11) and (12). For details, see <Energy Feature Vector> above. The calculated energy feature vector ρ (t) is input to the speech existence probability calculation unit 4.

音声存在確率計算部４は、事後確率計算手段４１により、エネルギー特徴ベクトルρ(t)に基づいて目的信号ごとの音声が存在する確率を示す音声存在確率P(t,λ,^θ)を求める（ステップＳ４１）。音声存在確率P(t,λ,^θ)は式(15)(17)により計算できる。詳しくは上述の＜ディリクレ混合分布のパラメータ推定＞を参照されたい。 The speech existence probability calculation unit 4 obtains a speech existence probability P (t, λ, ^ θ) indicating the probability that there is speech for each target signal based on the energy feature vector ρ (t) by the posterior probability calculation means 41. (Step S41). The speech existence probability P (t, λ, ^ θ) can be calculated by equations (15) and (17). For details, see <Estimation of parameters of Dirichlet mixture distribution> above.

音声存在確率計算部４は、パラメータ推定手段４２により、エネルギー特徴ベクトルρ(t)と音声存在確率P(t,λ,^θ)に基づいてディリクレ混合分布のパラメータα_λを更新する（ステップＳ４２）。パラメータα_λは式(22)(23)により求めることができる。詳しくは上述の＜ディリクレ混合分布のパラメータ推定＞を参照されたい。 Speech presence probability calculator 4, the parameter estimation unit 42, an energy feature vector [rho (t) and the speech presence probability P (t, λ, ^ θ ) to update the parameter alpha _lambda Dirichlet mixture model based on (step S42 ). The parameter α _λ can be obtained by the equations (22) and (23). For details, see <Estimation of parameters of Dirichlet mixture distribution> above.

音声存在確率計算部４は、反復処理手段４３により、所定の基準を満たすかどうかを判断する（ステップＳ４３）。所定の基準を満たさない場合には、ステップＳ４１へ戻る。所定の基準を満たす場合には、最終的に得られた音声存在確率P(t,λ,^θ)をフィルタリング部６へ出力する。所定の基準については上述の＜ディリクレ混合分布のパラメータ推定＞で詳述したためここでは説明を省略する。 The speech existence probability calculation unit 4 determines whether or not a predetermined criterion is satisfied by the iterative processing unit 43 (step S43). If the predetermined standard is not satisfied, the process returns to step S41. When the predetermined criterion is satisfied, the finally obtained speech existence probability P (t, λ, ^ θ) is output to the filtering unit 6. Since the predetermined criterion has been described in detail in <Parameter estimation of Dirichlet mixture distribution> described above, description thereof is omitted here.

フィルタリング部６は、観測信号y(t)の各時間周波数ビンの値に各目的信号に対する音声存在確率P(t,λ,^θ)を乗じて目的信号の推定値を求める（ステップＳ６）。 The filtering unit 6 obtains an estimated value of the target signal by multiplying the value of each time frequency bin of the observed signal y (t) by the speech existence probability P (t, λ, ^ θ) for each target signal (step S6).

推定した目的信号は出力端子Ｓから出力される。 The estimated target signal is output from the output terminal S.

［第二実施形態］
第二実施形態の音源分離装置及び方法は、L個の音源が存在する環境で、少なくとも2個のマイクを含むN個のノードからなる分散型マイクアレイを用いて収音した観測信号から、特定の音源の目的信号を推定する。したがって、第一実施形態の音源分離装置及び方法との相違点は分散型マイクアレイの各ノードが複数のマイクを含む点である。 [Second Embodiment]
The sound source separation device and method according to the second embodiment are specified from observation signals collected using a distributed microphone array composed of N nodes including at least two microphones in an environment where L sound sources exist. Estimate the target signal of the sound source. Therefore, the difference from the sound source separation apparatus and method of the first embodiment is that each node of the distributed microphone array includes a plurality of microphones.

＜複素特徴ベクトル＞
この実施形態の複素特徴ベクトルψ_n(t)は式(30)のように表される。 <Complex feature vector>
The complex feature vector ψ _n (t) of this embodiment is expressed as in Expression (30).

この複素特徴ベクトルψ_n(t)は非特許文献１に記載の複素特徴ベクトルψ(t)を各ノードで計算した特徴ベクトルである。これはノードごとの観測信号を正規化した特徴量、すなわちノード内特徴量とも言える。 This complex feature vector ψ _n (t) is a feature vector obtained by calculating the complex feature vector ψ (t) described in Non-Patent Document 1 at each node. This can be said to be a feature quantity obtained by normalizing the observation signal for each node, that is, an intra-node feature quantity.

＜複素特徴ベクトルのモデル化＞
複素特徴ベクトルψ_n(t)はノード内特徴量であるため、非特許文献１と同様にワトソン混合分布でモデル化することが可能である。ワトソン混合分布の未知のパラメータ~θは式(31)で表される。 <Modeling of complex feature vectors>
Since the complex feature vector ψ _n (t) is an intra-node feature, it can be modeled with a Watson mixture distribution as in Non-Patent Document 1. The unknown parameter ~ θ of the Watson mixture distribution is expressed by Equation (31).

パラメータ~θの推定は非特許文献１と同様にノードごとに独立に最適化すればよいが、より精度良く推定を行うためにノード間情報の共有を行なってもよい。ここでは、各ノードの観測信号において式(32)の独立性を仮定する。 The estimation of the parameter ~ θ may be optimized independently for each node as in Non-Patent Document 1, but information between nodes may be shared in order to estimate with higher accuracy. Here, the independence of Expression (32) is assumed in the observation signal of each node.

ここで、~ψ(t)は式(33)により定義される。 Here, ~ ψ (t) is defined by Expression (33).

式(32)の仮定のもとベイズ則を用いると、λ番目の目的信号に対する事後確率~P(t,λ,~θ)は式(34)のように表すことができる。 Using the Bayes rule under the assumption of Equation (32), the posterior probability ~ P (t, λ, ~ θ) for the λth target signal can be expressed as shown in Expression (34).

ここで、χ(t,~θ)は正規化項であり、ζ(t,λ,~θ)は式(35)により定義される。 Here, χ (t, ˜θ) is a normalization term, and ζ (t, λ, ˜θ) is defined by equation (35).

式(35)の詳細は「J. Kittler, M. Hatef, R.P.W. Duin, and J. Matas, “On combining classifiers”, IEEE Transactions Pattern Analysis and Machine Intelligence, vol. 20, pp. 226-239, March 1998.（参考文献５）」を参照されたい。 For details of equation (35), see “J. Kittler, M. Hatef, RPW Duin, and J. Matas,“ On combining classifiers ”, IEEE Transactions Pattern Analysis and Machine Intelligence, vol. 20, pp. 226-239, March 1998. (Reference 5).

式(34)(35)によれば全体のパラメータ推定の中では、事後確率p(H_λ|ψ_n(t);~θ)のみがノード間で共有され、その他のパラメータは各ノードで独立して推定を行うことになる。なお、実際の音響空間では式(29)に示すような乗算に基づく仮説統合ではなく、式(36)に示すように加算に基づく仮説統合を行う方が効果的である。 According to Eqs. (34) and (35), only the posterior probability p (H _λ | ψ _n (t); ~ θ) is shared among nodes in the overall parameter estimation, and other parameters are independent at each node. To estimate. In an actual acoustic space, it is more effective to perform hypothesis integration based on addition as shown in equation (36) than hypothesis integration based on multiplication as shown in equation (29).

＜モデルパラメータ推定＞
エネルギー特徴ベクトルρ(t)と複素特徴ベクトルψ₁(t),…,ψ_N(t)とは相補的な情報を捉えているため、互いに統計的に独立と仮定することができる。したがって、これらの特徴ベクトルを用いた各目的信号の音声存在事後確率P^(ρ,~ψ)(t,λ,^θ)は式(37)のように表される。 <Model parameter estimation>
Since the energy feature vector ρ (t) and the complex feature vector ψ ₁ (t),..., Ψ _N (t) capture complementary information, they can be assumed to be statistically independent from each other. Therefore, the speech existence posterior probability P ^{(ρ, ˜ψ)} (t, λ, ^ θ) of each target signal using these feature vectors is expressed as in Expression (37).

式(37)の音声存在事後確率P^(ρ,~ψ)(t,λ,^θ)は式(38)により計算できる。 The speech existence posterior probability P ^{(ρ, ~ ψ)} (t, λ, ^ θ) in equation (37) can be calculated by equation (38).

また、全確率の定理を用いると、式(39)を得ることができる。 Further, using the total probability theorem, Equation (39) can be obtained.

ここで、^θはすべてのモデルパラメータを表しており、式(39)を最大化することで求めることができる。なお、式(39)を効率的に最大化する方法として、式(39)を直接最大化するのではなく、式(39)の補助関数を式(40)のように定義し、それを最大化するパラメータを繰り返し推定してもよい。 Here, ^ θ represents all model parameters, and can be obtained by maximizing Equation (39). As an efficient way to maximize Equation (39), instead of directly maximizing Equation (39), the auxiliary function of Equation (39) is defined as Equation (40) and The parameter to be converted may be repeatedly estimated.

式(40)において^θ’は^θの事前推定値である。また、Q₁は式(41)により定義され、Q₂は式(42)により定義され、Q₃は式(43)により定義される。 In equation (40), ^ θ 'is a prior estimate of ^ θ. Q ₁ is defined by equation (41), Q ₂ is defined by equation (42), and Q ₃ is defined by equation (43).

式(40)に示す補助関数の最大化においては上記式(21)の制約を満たす必要がある。その結果、式(41)を最大化するw_λは式(44)により求めることができる。 In maximization of the auxiliary function shown in Expression (40), it is necessary to satisfy the restriction of Expression (21). As a result, w _λ that maximizes Equation (41) can be obtained from Equation (44).

同様に、式(43)のa_n,λ,к_n,λに関する偏微分値を0とすることで、a_n,λは式(45)に示す行列R_n,λの最大固有値r_n,λに対応する固有ベクトルとして与えられる。 Similarly, by setting the partial differential value with respect to a _{n, λ} , к _{n, λ} in equation (43) to 0, a _{n, λ} is the maximum eigenvalue r _n, _λ of the matrix R _{n, λ} shown in equation (45) _{. It} is given as an eigenvector corresponding to _λ .

ここで、к_n,λは式(46)を充足する必要がある。 Here, к _{n, λ} needs to satisfy Equation (46).

式(46)からк_n,λに関する閉形式解を導出することはできないが、к_n,λに関して式(47)の近似を用いることが効果的であることが知られている。 Although it is not possible to derive a closed form solution for κ _{n, λ} from Equation (46), it is known that using the approximation of Equation (47) for к _{n, λ} is effective.

この近似の詳細は「A.S. Bijral, M. Breitenbach, and G. Grudic, “Mixture of Watson distributions: a generative model for hyperspherical embedding”, J. Machine Learning Research, pp. 35-42, 2007.（参考文献６）」および「S. Sra and D. Karp, “The multivariate Watson distribution: maximum-likelihood estimation and other aspects”, preprint: arXiv:1104.4422v2, May 2012.（参考文献７）」を参照されたい。 Details of this approximation are described in “AS Bijral, M. Breitenbach, and G. Grudic,“ Mixture of Watson distributions: a generative model for hyperspherical embedding ”, J. Machine Learning Research, pp. 35-42, 2007. ) "And" S. Sra and D. Karp, "The multivariate Watson distribution: maximum-likelihood estimation and other aspects", preprint: arXiv: 1104.4422v2, May 2012. (reference 7) ".

式(40)のQ(^θ,^θ')を最大化するディリクレ混合分布のパラメータ推定には閉形式解がないが、Newton-Raphsonアルゴリズムにより精度の良いパラメータ推定が可能であることが知られている（詳しくは参考文献１参照）。 Although there is no closed-form solution for parameter estimation of the Dirichlet mixture distribution that maximizes Q (^ θ, ^ θ ') in Eq. (40), it is known that accurate parameter estimation is possible with the Newton-Raphson algorithm. (See Reference 1 for details).

以上説明した通り、この実施形態のパラメータ推定をEMアルゴリズムで行う場合、所定の基準を満たすまで繰り返し、Eステップとして式(34)(36)(15)(38)を実行し、Mステップとして式(44)(45)(47)(23)を実行する。所定の基準は第一実施形態と同様であるので詳細な説明は省略する。 As described above, when the parameter estimation of this embodiment is performed by the EM algorithm, it is repeated until a predetermined criterion is satisfied, and the equations (34), (36), (15), and (38) are executed as the E step, and the equation as the M step. (44) (45) (47) (23) is executed. Since the predetermined standard is the same as that of the first embodiment, detailed description thereof is omitted.

なお、この実施形態においてもパーミュテーションの問題が生じるが、第一実施形態と同様に非特許文献１に記載されたパーミュテーション解決の方法を適用すればよい。 In this embodiment, a permutation problem also occurs. However, the permutation solving method described in Non-Patent Document 1 may be applied as in the first embodiment.

＜構成＞
第二実施形態の音源分離装置は、図５に示す通り、N個の複素特徴ベクトル計算部１₁,…,１_N、エネルギー特徴ベクトル計算部２、音声存在確率計算部５、フィルタリング部６を有する。音源分離装置は、例えば、ＣＰＵ（Central Processing Unit）、ＲＡＭ（Random Access Memory）等を有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。音声存在確率計算部５は、図６に示す通り、第一存在確率計算手段５１、第二存在確率計算手段５２、事後確率計算手段５３、パラメータ推定手段５４、反復処理手段５５を有する。 <Configuration>
As shown in FIG. 5, the sound source separation apparatus according to the second embodiment includes N complex feature vector calculation units 1 ₁ ,..., 1 _N , an energy feature vector calculation unit 2, a speech existence probability calculation unit 5, and a filtering unit 6. Have. The sound source separation device is a special device configured by reading a special program into a known or dedicated computer having, for example, a CPU (Central Processing Unit), a RAM (Random Access Memory), and the like. As shown in FIG. 6, the speech existence probability calculation unit 5 includes first existence probability calculation means 51, second existence probability calculation means 52, posterior probability calculation means 53, parameter estimation means 54, and iterative processing means 55.

＜動作＞
図７を参照して、第二実施形態の音源分離装置の動作例を説明する。 <Operation>
With reference to FIG. 7, the operation example of the sound source separation apparatus of 2nd embodiment is demonstrated.

分散型マイクアレイのN個のノードとN個の複素特徴ベクトル計算部１₁,…,１_Nとはそれぞれ１対１で対応している。n番目のノードに含まれるC_n個のマイクＭ_n,1,…,Ｍ_n,Cnで収音したC_nチャネルの観測信号y_n(t)は複素特徴ベクトル計算部１_nへ入力される。複素特徴ベクトル計算部１_nは、n番目のノードで収音した観測信号y_n(t)に基づいて、各時間周波数ビンを特徴づける複素特徴ベクトルψ_n(t)を計算する。（ステップＳ１）。複素特徴ベクトルψ_n(t)は式(30)により計算できる。詳しくは上述の＜複素特徴ベクトル＞を参照されたい。複素特徴ベクトル計算部１₁,…,１_Nが計算した複素特徴ベクトルψ₁(t),…,ψ_N(t)は音声存在確率計算部５へ入力される。 The N nodes of the distributed microphone array and the N complex feature vector calculation units 1 ₁ ,..., 1 _N have a one-to-one correspondence. C _n-number of microphones M _{n, 1} included in the n-th node, ..., M _n, the observed signal C _n-channel picked up by _Cn y _n (t) is input to the complex feature vector calculating portion 1 _n . The complex feature vector calculation unit 1 _n calculates a complex feature vector ψ _n (t) that characterizes each time frequency bin based on the observation signal y _n (t) collected at the n-th node. (Step S1). The complex feature vector ψ _n (t) can be calculated by Equation (30). For details, see <Complex Feature Vector> above. The complex feature vectors ψ ₁ (t),..., Ψ _N (t) calculated by the complex feature vector calculators 1 ₁ ,..., 1 _N are input to the speech existence probability calculator 5.

分散型マイクアレイ全体で収音したCチャネルの観測信号y(t)はエネルギー特徴ベクトル計算部２へ入力される。エネルギー特徴ベクトル計算部２は、観測信号y(t)に基づいて、ノードごとのエネルギーを表すエネルギー特徴ベクトルρ(t)を計算する（ステップＳ２）。エネルギー特徴ベクトル計算部２の処理は第一実施形態と同様であるので詳細な説明は省略する。計算したエネルギー特徴ベクトルρ(t)は音声存在確率計算部５へ入力される。 The C channel observation signal y (t) collected by the entire distributed microphone array is input to the energy feature vector calculator 2. The energy feature vector calculation unit 2 calculates an energy feature vector ρ (t) representing energy for each node based on the observation signal y (t) (step S2). Since the process of the energy feature vector calculation unit 2 is the same as that of the first embodiment, detailed description thereof is omitted. The calculated energy feature vector ρ (t) is input to the speech existence probability calculation unit 5.

音声存在確率計算部５は、第一存在確率計算手段５１により、エネルギー特徴ベクトルρ(t)に基づいて目的信号ごとの音声が存在する確率を示す第一音声存在確率を求める（ステップＳ５１）。第一音声存在確率は式(15)により計算できる。詳しくは上述の＜ディリクレ混合分布のパラメータ推定＞を参照されたい。 The voice presence probability calculation unit 5 uses the first presence probability calculation means 51 to obtain a first voice presence probability indicating a probability that a voice for each target signal exists based on the energy feature vector ρ (t) (step S51). The first speech existence probability can be calculated by equation (15). For details, see <Estimation of parameters of Dirichlet mixture distribution> above.

音声存在確率計算部５は、第二存在確率計算手段５２により、複素特徴ベクトルψ₁(t),…,ψ_N(t)に基づいて目的信号ごとの音声が存在する確率を示す第二音声存在確率を求める（ステップＳ５２）。第二音声存在確率は式(34)(36)により計算できる。詳しくは上述の＜モデルパラメータ推定＞を参照されたい。 The voice existence probability calculation unit 5 uses the second existence probability calculation means 52 to indicate the probability that the voice for each target signal exists based on the complex feature vector ψ ₁ (t),..., Ψ _N (t). The existence probability is obtained (step S52). The second speech existence probability can be calculated by equations (34) and (36). For details, refer to <Model Parameter Estimation> above.

音声存在確率計算部５は、事後確率計算手段５３により、第一音声存在確率と第二音声存在確率を統合して目的信号ごとの音声が存在する確率を示す音声存在確率P^(ρ,~ψ)(t,λ,^θ)を求める（ステップＳ５３）。音声存在確率P^(ρ,~ψ)(t,λ,^θ)は式(38)により求めることができる。詳しくは上述の＜モデルパラメータ推定＞を参照されたい。 The speech presence probability calculator 5, the posterior probability calculation unit 53, speech presence probability P ^([rho indicating the probability of integrating the first speech presence probability and the second speech presence probability exists audio for each object ^{signal, ~ [psi )} (t, λ, ^ θ) is obtained (step S53). The speech existence probability P ^{(ρ, ˜ψ)} (t, λ, ^ θ) can be obtained by the equation (38). For details, refer to <Model Parameter Estimation> above.

音声存在確率計算部５は、パラメータ推定手段５４により、エネルギー特徴ベクトルρ(t)と複素特徴ベクトルψ₁(t),…,ψ_N(t)と音声存在確率P^(ρ,~ψ)(t,λ,^θ)に基づいてディリクレ混合分布のパラメータα_λとワトソン混合分布のパラメータк_λを更新する（ステップＳ５４）。パラメータα_λは式(22)(23)により求めることができる。詳しくは上述の＜ディリクレ混合分布のパラメータ推定＞を参照されたい。パラメータк_λは式(44)(45)(47)により求めることができる。詳しくは上述の＜モデルパラメータ推定＞を参照されたい。 The speech existence probability calculation unit 5 uses the parameter estimation unit 54 to calculate the energy feature vector ρ (t), the complex feature vector ψ ₁ (t),..., Ψ _N (t), and the speech existence probability P ^{(ρ, ˜ψ)} ( t, λ, ^ θ) to update the parameter K _lambda parameter alpha _lambda Watson mixture distribution of Dirichlet mixture model based on (step S54). The parameter α _λ can be obtained by the equations (22) and (23). For details, see <Estimation of parameters of Dirichlet mixture distribution> above. The parameter κ _λ can be obtained by equations (44), (45), and (47). For details, refer to <Model Parameter Estimation> above.

音声存在確率計算部５は、反復処理手段５５により、所定の基準を満たすかどうかを判断する（ステップＳ５５）。所定の基準を満たさない場合には、ステップＳ５１へ戻る。所定の基準を満たす場合には、最終的に得られた音声存在確率P^(ρ,~ψ)(t,λ,^θ)をフィルタリング部６へ出力する。所定の基準については上述の＜ディリクレ混合分布のパラメータ推定＞で詳述したためここでは説明を省略する。 The speech existence probability calculation unit 5 determines whether or not a predetermined criterion is satisfied by the iterative processing unit 55 (step S55). If the predetermined standard is not satisfied, the process returns to step S51. When the predetermined criterion is satisfied, the finally obtained speech existence probability P ^{(ρ, ˜ψ)} (t, λ, ^ θ) is output to the filtering unit 6. Since the predetermined criterion has been described in detail in <Parameter estimation of Dirichlet mixture distribution> described above, description thereof is omitted here.

フィルタリング部６は、観測信号y(t)の各時間周波数ビンの値に各目的信号に対する音声存在確率P^(ρ,~ψ)(t,λ,^θ)を乗じて目的信号の推定値を求める（ステップＳ６）。 The filtering unit 6 multiplies the value of each time frequency bin of the observed signal y (t) by the speech existence probability P ^{(ρ, ~ ψ)} (t, λ, ^ θ) for each target signal to obtain the estimated value of the target signal. Obtained (step S6).

［実験結果］
この発明によれば、例えば分散型マイクアレイ環境で収音した場合のように、入力信号の各チャネルのサンプリング周波数が異なる場合でも、精度の良い音源分離を安定的に行うことができる。 [Experimental result]
According to the present invention, accurate sound source separation can be stably performed even when the sampling frequency of each channel of the input signal is different, for example, when sound is collected in a distributed microphone array environment.

この発明の効果を確認するためにシミュレーション実験を行った。図８に実験環境を図示する。この実験では、目的信号を3つ（L=3）、ノード数を3（N=3）、各ノード内のマイク数を2（C₁=2、C₂=2、C₃=2）とした。各話者の音源は、TIMITデータベースからランダムに抽出した男女各12名の話者のデータを用いた。各話者は分散型マイクアレイの中心から同心円上に3m離れた位置に等間隔に配置し、各マイクノードは分散型マイクアレイの中心から0.3m離れた位置に等間隔に配置した。実験を実施した部屋の残響時間は240msであった。 A simulation experiment was performed to confirm the effect of the present invention. FIG. 8 illustrates the experimental environment. In this experiment, 3 target signals (L = 3), 3 nodes (N = 3), 2 microphones in each node (C ₁ = 2, C ₂ = 2 and C ₃ = 2) did. As the sound source of each speaker, the data of 12 male and female speakers randomly extracted from the TIMIT database were used. Each speaker was placed at a distance of 3 m concentrically from the center of the distributed microphone array, and each microphone node was placed at a distance of 0.3 m from the center of the distributed microphone array. The reverberation time of the room where the experiment was conducted was 240 ms.

実験の評価指標としては目的音源とその他の音源のエネルギー比を示すSIR（Signal-to- Interference Ratio）を用いた。SIRが高ければより精度の高い音源分離を達成できていることが示される。各ノードのサンプリング周波数のずれによる性能変化を確認するため、3つの条件を用意した。条件(0,0,0)は、3つのノードすべてのサンプリング周波数が合致している条件に相当する。条件(0,4,8)は、2つ目のノードが1つ目のノードに対して+4サンプル/秒だけサンプリング周波数がずれており、3つ目のノードが1つ目のノードに対して+8サンプル/秒だけサンプリング周波数がずれている条件に相当する。条件(0,16,32)は、2つ目、3つ目のノードがそれぞれ+16、+32サンプル/秒だけサンプリング周波数がずれている条件に相当する。上記すべての条件において、1つ目のノードのサンプリング周波数は16kHzサンプリングとした。なお、事前実験では、同じ製造メーカから発売されている同じ機種の2つのICレコーダ間には1サンプル/秒以下のずれしかなかったのに対し、異なる製造メーカから発売されている2つのICレコーダ間には約30サンプル/秒ものずれがあることを確認している。上記のサンプリング周波数のずれ幅はこれらの事前実験に基づき定めた。 SIR (Signal-to-Interference Ratio) indicating the energy ratio between the target sound source and other sound sources was used as an evaluation index for the experiment. A higher SIR indicates that more accurate sound source separation can be achieved. Three conditions were prepared to confirm the performance change due to the sampling frequency deviation of each node. The condition (0,0,0) corresponds to a condition in which the sampling frequencies of all three nodes are matched. Condition (0,4,8) is that the sampling frequency of the second node is shifted by +4 samples / second relative to the first node, and the third node is relative to the first node. This corresponds to the condition that the sampling frequency is shifted by +8 samples / second. The condition (0, 16, 32) corresponds to a condition in which the second and third nodes are shifted in sampling frequency by +16 and +32 samples / second, respectively. Under all the above conditions, the sampling frequency of the first node was 16 kHz sampling. In addition, in the prior experiment, there was a shift of 1 sample / second or less between two IC recorders of the same model released by the same manufacturer, whereas two IC recorders released by different manufacturers It is confirmed that there is a gap of about 30 samples / second between them. The sampling frequency deviation was determined based on these preliminary experiments.

図９に実験結果を示す。「従来法（全体）」は、非特許文献１に記載された従来の音源分離技術を用いて、すべてのノードのすべてのマイクを用いて式(1)に示した複素特徴ベクトルを抽出し音源分離処理を行った結果である。「従来法（ノードごと）」では、非特許文献１に記載された従来の音源分離技術を用いて、ノードごとに式(1)に示した複素特徴ベクトルを抽出し音源分離処理を行い、各話者の分離音は話者に一番近いノードから生成した結果である。「提案法（第二実施形態）」は、上述の第二実施形態の音源分離技術を用いた結果である。 FIG. 9 shows the experimental results. The “conventional method (overall)” uses the conventional sound source separation technique described in Non-Patent Document 1 to extract the complex feature vector shown in Equation (1) using all microphones at all nodes. It is the result of performing the separation process. In the “conventional method (for each node)”, using the conventional sound source separation technique described in Non-Patent Document 1, the complex feature vector shown in Equation (1) is extracted for each node, and sound source separation processing is performed. The separated sound of the speaker is a result generated from the node closest to the speaker. “Proposed method (second embodiment)” is a result of using the sound source separation technique of the second embodiment described above.

「従来法（全体）」はサンプリング周波数ずれの影響を大きく受け、ずれが大きくなるほど性能が低下していることが分かる。「従来法（ノードごと）」は、ノードごとの処理であるためサンプリング周波数ずれの影響は受けなかった。しかし、マイク数が2に留まっていることにも起因するが、全体的にSIRが低く、高い分離性能を達成することができていない。「提案法」は、サンプリング周波数ずれがない条件では「従来法（全体）」に劣るものの、サンプリング周波数ずれがある条件では従来法のいずれをも大きく上回っており、安定的に精度の良い音源分離を達成できていることがわかる。これらの結果よりこの発明の音源分離技術は様々な分散型マイクアレイ環境において精度の良い音源分離を安定的に行うことができることが確認された。 It can be seen that the “conventional method (overall)” is greatly affected by the sampling frequency deviation, and the performance decreases as the deviation increases. Since the “conventional method (for each node)” is a process for each node, it was not affected by the sampling frequency deviation. However, due to the fact that the number of microphones remains at 2, the overall SIR is low and high separation performance cannot be achieved. The “proposed method” is inferior to the “conventional method (overall)” under conditions where there is no sampling frequency deviation, but far exceeds all of the conventional methods under conditions where there is a sampling frequency deviation, and stable and accurate sound source separation. It can be seen that From these results, it was confirmed that the sound source separation technique of the present invention can stably perform accurate sound source separation in various distributed microphone array environments.

［プログラム、記録媒体］
この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 [Program, recording medium]
The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above embodiment may be executed not only in time series according to the order of description, but also in parallel or individually as required by the processing capability of the apparatus that executes the processes or as necessary.

また、上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１複素特徴ベクトル計算部
２エネルギー特徴ベクトル計算部
３，４，５音声存在確率計算部
６フィルタリング部
４１事後確率計算手段
４２パラメータ推定手段
４３反復処理手段
５１第一存在確率計算手段
５２第二存在確率計算手段
５３事後確率計算手段
５４パラメータ推定手段
５５反復処理手段 DESCRIPTION OF SYMBOLS 1 Complex feature vector calculation part 2 Energy feature vector calculation part 3, 4, 5 Speech presence probability calculation part 6 Filtering part 41 A posteriori probability calculation means 42 Parameter estimation means 43 Iterative processing means 51 First existence probability calculation means 52 Second existence probability Calculation means 53 A posteriori probability calculation means 54 Parameter estimation means 55 Iterative processing means

Claims

An energy feature vector that represents energy for each node and does not include phase information is calculated from an observation signal obtained by collecting a mixed signal in which a plurality of target signals overlap using a microphone array including two or more nodes including one or more microphones. An energy feature vector calculation unit,
A voice presence probability calculation unit that calculates a voice presence probability that indicates a probability that a voice for each target signal exists based on the energy feature vector and does not depend on phase information;
A filtering unit for multiplying the observed signal by the voice presence probability to obtain an estimated value of the target signal;
Including
|| • || is the norm of n, n is the number of the node, t is the number of the time frame, y (t) is the observed signal of the microphone array in the t th time frame, y _n (t) is the observation signal of the n th node in the t th time frame,
The energy feature vector calculator calculates energy feature vectors for n = 1,...

The sound existence probability calculating unit obtains the sound existence probability by modeling the energy feature vector with a Dirichlet mixture distribution.

An energy feature vector that represents energy for each node and does not include phase information is calculated from an observation signal obtained by collecting a mixed signal in which a plurality of target signals overlap using a microphone array including two or more nodes including one or more microphones. An energy feature vector calculation unit,
A complex feature vector calculation unit that calculates a complex feature vector by normalizing the observation signal for each node;
A voice presence probability calculation unit that calculates a voice presence probability that indicates a probability that a voice for each target signal exists based on the energy feature vector and does not depend on phase information;
A filtering unit for multiplying the observed signal by the voice presence probability to obtain an estimated value of the target signal;
Including
The speech existence probability calculating unit calculates a first speech existence probability indicating a probability that a speech for each target signal exists based on the energy feature vector, and a speech for each target signal is calculated based on the complex feature vector. Calculating a second voice presence probability indicating a probability of existence, and obtaining the voice presence probability by integrating the first voice presence probability and the second voice presence probability;
A sound source separation device characterized by that.

The sound source separation device according to claim 1 ,
A complex feature vector calculation unit that calculates a complex feature vector by normalizing the observation signal for each node;
The speech existence probability calculating unit calculates a first speech existence probability indicating a probability that a speech for each target signal exists based on the energy feature vector, and a speech for each target signal is calculated based on the complex feature vector. Calculating a second voice presence probability indicating a probability of existence, and obtaining the voice presence probability by integrating the first voice presence probability and the second voice presence probability;
A sound source separation device characterized by that.

The sound source separation device according to claim 2 or 3,
|| · || is the norm of •, n is the number of the node, t is the number of the time frame, and y _n (t) is the observation signal of the n th node in the t th time frame. Yes,
The complex feature vector calculation unit calculates a complex feature vector according to the following equation:

The speech existence probability calculator calculates the first speech existence probability by modeling the energy feature vector with a Dirichlet mixture distribution, and calculates the second speech existence probability by modeling the complex feature vector with a Watson mixture distribution. A sound source separation device characterized by:

An energy feature vector calculation unit represents energy for each node from an observation signal obtained by collecting a mixed signal in which a plurality of target signals are overlapped using a microphone array including two or more nodes including one or more microphones. An energy feature vector calculation step for calculating an energy feature vector not included;
A voice presence probability calculating unit that calculates a voice presence probability that indicates a probability that a voice exists for each target signal using the energy feature vector and does not depend on phase information;
A filtering step of obtaining an estimated value of the target signal by multiplying the observation signal by the voice existence probability;
Only including,
|| • || is the norm of n, n is the number of the node, t is the number of the time frame, y (t) is the observed signal of the microphone array in the t th time frame, y _n (t) is the observation signal of the n th node in the t th time frame,
The energy feature vector calculation step calculates an energy feature vector according to the following equation for n = 1,.

The speech existence probability calculation step obtains the speech presence probability by modeling the energy feature vector with a Dirichlet mixture distribution.
A sound source separation method characterized by the above .

An energy feature vector calculation unit represents energy for each node from an observation signal obtained by collecting a mixed signal in which a plurality of target signals are overlapped using a microphone array including two or more nodes including one or more microphones. An energy feature vector calculation step for calculating an energy feature vector not included;
A complex feature vector calculation unit calculates a complex feature vector by normalizing the observation signal for each node, and
A voice presence probability calculating unit that calculates a voice presence probability that indicates a probability that a voice exists for each target signal based on the energy feature vector and does not depend on phase information;
A filtering step of obtaining an estimated value of the target signal by multiplying the observation signal by the voice existence probability;
Including
The speech presence probability calculating step calculates a first speech presence probability indicating a probability that a speech for each target signal exists based on the energy feature vector, and a speech for each target signal based on the complex feature vector Calculating a second voice presence probability indicating a probability of existence, and obtaining the voice presence probability by integrating the first voice presence probability and the second voice presence probability;
A sound source separation method characterized by the above.

The program for functioning a computer as a sound source separation apparatus in any one of Claim 1 to 4.