JP6314393B2

JP6314393B2 - Acoustic signal analyzing apparatus, acoustic signal analyzing method, and computer program

Info

Publication number: JP6314393B2
Application number: JP2013189156A
Authority: JP
Inventors: 陽前澤
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2013-09-12
Filing date: 2013-09-12
Publication date: 2018-04-25
Anticipated expiration: 2033-09-12
Also published as: JP2015055766A

Description

本発明は、複数の音源からそれぞれ発生された複数の音が混合された混合音に基づいて、各音源から発生された音をそれぞれ推定する音響信号分析装置に関する。 The present invention relates to an acoustic signal analyzing apparatus that estimates sounds generated from each sound source based on a mixed sound obtained by mixing a plurality of sounds respectively generated from a plurality of sound sources.

従来から、例えば、下記非特許文献１及び２に示されているように、各音源から発生された音を混合音から抽出する音響信号分析装置は知られている。非特許文献１に記載の音響信号分析装置では、第１の音（話者の声）のモデルと、第２の音（背景音楽）のモデルとを学習しておき、その学習結果を用いて、混合音から第１の音と第２の音を抽出する。 Conventionally, for example, as shown in Non-Patent Documents 1 and 2 below, acoustic signal analyzers that extract sounds generated from respective sound sources from mixed sounds are known. In the acoustic signal analyzer described in Non-Patent Document 1, a model of a first sound (speaker's voice) and a model of a second sound (background music) are learned, and the learning result is used. The first sound and the second sound are extracted from the mixed sound.

また、非特許文献２に記載の音響信号分析装置では、独立成分分析法を用いて、混合音を外乱とその他の成分に分離する。 Moreover, in the acoustic signal analyzer described in Non-Patent Document 2, the mixed sound is separated into disturbance and other components by using an independent component analysis method.

ＨａｋａｎＥｒｄｏｇａｎ，ＥｍａｄＭ．Ｇｒａｉｓ、「Ｓｅｍｉ−ＢｌｉｎｄＳｐｅｅｃｈ−ＭｕｓｉｃＳｅｐａｒａｔｉｏｎＵｓｉｎｇＳｐａｒｓｉｔｙａｎｄＣｏｎｔｉｎｕｉｔｙＰｒｉｏｒｓ」、ＰａｔｔｅｒｎＲｅｃｏｇｎｉｔｉｏｎ（ＩＣＰＲ），２０１０，２０ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎ、Ａｕｇ．２０１０、ｐ．４５７３−ｐ．４５７６Haka Erdogan, Emad M. et al. Grais, “Semi-Blind Speech-Music Separation Using Sparsity and Continuity Priors”, Pattern Recognition (ICPR), 2010, 20th International Conference on, Aug. 2010, p. 4573-p. 4576 澤田紘志，ＪａｎｉＥｖｅｎ，猿渡洋，鹿野清宏，高谷智哉、「内部雑音抑圧型ロボット音声対話システムにおけるマイクロホンアレー配置の検討」、ＡＩチャレンジ研究会、２００９年１１月、ｐ．２６−ｐ．３１Satoshi Sawada, Jani Even, Hiroshi Saruwatari, Kiyohiro Shikano, Tomoya Takatani, “Examination of microphone array placement in an internal noise suppression type robotic speech dialogue system”, AI Challenge Study Group, November 2009, p. 26-p. 31

上記非特許文献１の音響信号分析装置によれば、学習したモデルに関する音からなる混合音にしか対応できない。つまり、非特許文献１の音響信号分析装置は、汎用性に欠ける。 According to the acoustic signal analysis apparatus of Non-Patent Document 1, only a mixed sound composed of sounds related to a learned model can be handled. That is, the acoustic signal analyzer of Non-Patent Document 1 lacks versatility.

また、非特許文献２の音響信号分析装置によれば、音源の数に応じた複数の収音装置（マイク）が必要である。したがって、装置の構成が複雑になる。 Further, according to the acoustic signal analysis device of Non-Patent Document 2, a plurality of sound collection devices (microphones) corresponding to the number of sound sources are required. Therefore, the configuration of the apparatus becomes complicated.

また、上記非特許文献１及び２の音響信号分析装置を用いれば、音響信号分析装置が設置された部屋の音場特性を最尤推定することができる。推定される音場特性の信頼度は、各音源から発生された音に依存する。すなわち、各音源から発生された音に含まれない周波数帯の特性の信頼度は低い。すべての音源から発生された音が既知であるわけではない（少なくとも１つの音源から発生された音は未知である）ので、推定された音場特性のうち、どの周波数帯の信頼度が高く、どの周波数帯の信頼度が低いのかを認定することができない。そのため、各音源から発生された音を的確に推定することが困難である。 Moreover, if the acoustic signal analyzers of Non-Patent Documents 1 and 2 are used, it is possible to estimate the maximum likelihood of the sound field characteristics of the room in which the acoustic signal analyzer is installed. The reliability of the estimated sound field characteristics depends on the sound generated from each sound source. That is, the reliability of the characteristics of the frequency band not included in the sound generated from each sound source is low. Since the sound generated from all sound sources is not known (the sound generated from at least one sound source is unknown), the reliability of which frequency band is high among the estimated sound field characteristics, It is impossible to identify which frequency band has low reliability. Therefore, it is difficult to accurately estimate the sound generated from each sound source.

本発明は上記問題に対処するためになされたもので、その目的は、汎用性が高く、混合音に基づいて、各音源から発生された音を的確に推定できる音響信号分析装置、音響信号分析方法及びコンピュータプログラムを提供することにある。なお、下記本発明の各構成要件の記載においては、本発明の理解を容易にするために、実施形態の対応箇所の符号を括弧内に記載しているが、本発明の各構成要件は、実施形態の符号によって示された対応箇所の構成に限定解釈されるべきものではない。
The present invention has been made to address the above-described problems, and has as its purpose a highly versatile acoustic signal analyzer and acoustic signal analyzer that can accurately estimate the sound generated from each sound source based on mixed sound. It is to provide a method and a computer program . In addition, in the description of each constituent element of the present invention below, in order to facilitate understanding of the present invention, reference numerals of corresponding portions of the embodiment are described in parentheses, but each constituent element of the present invention is The present invention should not be construed as being limited to the configurations of the corresponding portions indicated by the reference numerals of the embodiments.

上記目的を達成するために、本発明の特徴は、所定の第１の音を放音する放音手段（１６１）と、前記放音された第１の音と、前記放音手段とは異なる音源から放音された第２の音とを含む混合音を収音する収音手段（１６２）と、前記収音された混合音及び前記第１の音に基づいて、前記第２の音と、前記放音手段及び前記収音手段が設置された音場の特性とを同時にベイズ推定する推定手段（Ｓ１２〜Ｓ１５、Ｓ２３〜Ｓ３１）を備え、前記混合音は、前記第１の音と前記音場の特性とが畳み込まれた音、及び前記第２の音からなり、前記音場の特性は、前記第１の音の各周波数成分の強度に乗算される係数の集合として表され、前記推定手段は、前記混合音のスペクトル、前記第２の音のスペクトルの時系列及び前記係数の集合が複素正規分布及び一般化逆ガウス分布に従ってそれぞれ生成されることを表す生成モデルの事後分布の下限である下限関数であって、複数の補助変数を用いて表され、前記第２の音及び前記音場の特性に関するパラメータを含む下限関数を設定するとともに前記補助変数及び前記パラメータを反復的に更新して前記下限関数を決定することにより、前記事後分布を近似的に推定することを特徴とする音響信号分析装置（１０、２０）としたことにある。
In order to achieve the above object, the present invention is characterized in that the sound emission means (161) for emitting a predetermined first sound, the emitted first sound, and the sound emission means are different. Sound collecting means (162) for collecting a mixed sound including the second sound emitted from the sound source, the second sound based on the collected mixed sound and the first sound, , And an estimation means (S12 to S15, S23 to S31) for simultaneously Bayesian estimation of the sound field characteristic in which the sound emission means and the sound collection means are installed, and the mixed sound includes the first sound and the sound The sound field is composed of a convolution sound and the second sound, and the sound field characteristic is expressed as a set of coefficients multiplied by the intensity of each frequency component of the first sound, The estimation means is configured such that a spectrum of the mixed sound, a time series of the spectrum of the second sound, and the set of coefficients are complex. A lower limit function that is a lower limit of the posterior distribution of a generation model representing generation according to a normal distribution and a generalized inverse Gaussian distribution, and is expressed using a plurality of auxiliary variables, and the second sound and the sound field An acoustic wave characterized in that the posterior distribution is approximately estimated by setting a lower limit function including a parameter relating to a characteristic and determining the lower limit function by repeatedly updating the auxiliary variable and the parameter. The signal analyzer (10, 20) is used.

また、本発明の特徴は、所定の第１の音を放音する放音手段と、前記放音された第１の音と、前記放音手段とは異なる音源から放音された第２の音とを含む混合音を収音する収音手段と、前記収音された混合音及び前記第１の音に基づいて、前記第２の音と、前記放音手段及び前記収音手段が設置された音場の特性とを同時にベイズ推定する推定手段を備え、前記混合音は、前記第１の音と前記音場の特性とが畳み込まれた音、及び前記第２の音からなり、前記音場の特性は、前記第１の音の各周波数成分の強度に乗算される係数の集合として表され、前記推定手段は、前記混合音のスペクトル、前記第２の音のスペクトルの時系列及び前記係数の集合がポアソン分布及びガンマ分布に従ってそれぞれ生成されることを表す生成モデルの事後分布の下限である下限関数であって、複数の補助変数を用いて表され、前記第２の音及び前記音場の特性に関するパラメータを含む下限関数を設定するとともに前記補助変数及び前記パラメータを反復的に更新して前記下限関数を決定することにより、前記事後分布を近似的に推定することを特徴とする音響信号分析装置としたことにある。 Further, the present invention is characterized in that a sound emitting means for emitting a predetermined first sound, the emitted first sound, and a second sound emitted from a sound source different from the sound emitting means. Sound collecting means for collecting mixed sound including sound, and the second sound, sound emitting means, and sound collecting means are installed based on the collected mixed sound and the first sound. An estimation means for performing Bayesian estimation simultaneously with the characteristic of the generated sound field, and the mixed sound includes a sound obtained by convolving the first sound and the characteristic of the sound field, and the second sound, The characteristics of the sound field are expressed as a set of coefficients that are multiplied by the intensity of each frequency component of the first sound, and the estimation means includes a time series of the spectrum of the mixed sound and the spectrum of the second sound. And the posterior of the generation model that represents that the set of coefficients is generated according to the Poisson distribution and the gamma distribution, respectively A lower limit function which is a lower limit of the cloth, which is expressed using a plurality of auxiliary variables, sets a lower limit function including parameters relating to the characteristics of the second sound and the sound field, and repeats the auxiliary variables and the parameters The acoustic signal analysis apparatus is characterized in that the posterior distribution is approximately estimated by updating the lower limit function and determining the lower limit function.

また、本発明の特徴は、所定の第１の音を放音する放音手段と、前記放音された第１の音と、前記放音手段とは異なる音源から放音された第２の音とを含む混合音を収音する収音手段と、前記収音された混合音及び前記第１の音に基づいて、前記第２の音と、前記放音手段及び前記収音手段が設置された音場の特性とを同時にベイズ推定する推定手段を備え、前記混合音は、前記第１の音と前記音場の特性とが畳み込まれた音、及び前記第２の音と前記音場の特性とが畳み込まれた音からなり、前記音場の特性は、前記第１の音の各周波数成分の強度に乗算される係数の集合として表され、前記推定手段は、前記混合音のスペクトル、前記第２の音のスペクトルの時系列及び前記係数の集合が複素正規分布及び一般化逆ガウス分布に従ってそれぞれ生成されることを表す生成モデルの事後分布の下限である下限関数であって、複数の補助変数を用いて表され、前記第２の音及び前記音場の特性に関するパラメータを含む下限関数を設定するとともに前記補助変数及び前記パラメータを反復的に更新して前記下限関数を決定することにより、前記事後分布を近似的に推定することを特徴とする音響信号分析装置としたことにある。 Further, the present invention is characterized in that a sound emitting means for emitting a predetermined first sound, the emitted first sound, and a second sound emitted from a sound source different from the sound emitting means. Sound collecting means for collecting mixed sound including sound, and the second sound, sound emitting means, and sound collecting means are installed based on the collected mixed sound and the first sound. Estimation means for simultaneously performing Bayesian estimation of the characteristics of the generated sound field, and the mixed sound includes a sound obtained by convolving the first sound and the characteristic of the sound field, and the second sound and the sound. The sound field characteristic is expressed as a set of coefficients multiplied by the intensity of each frequency component of the first sound, and the estimation means includes the mixed sound. Spectrum, the time series of the spectrum of the second sound, and the set of coefficients according to a complex normal distribution and a generalized inverse Gaussian distribution. A lower limit function that is a lower limit of the posterior distribution of a generation model representing generation of each, and is expressed using a plurality of auxiliary variables and includes parameters relating to characteristics of the second sound and the sound field And the auxiliary variable and the parameter are iteratively updated to determine the lower limit function to approximately estimate the posterior distribution. .

また、本発明の特徴は、所定の第１の音を放音する放音手段と、前記放音された第１の音と、前記放音手段とは異なる音源から放音された第２の音とを含む混合音を収音する収音手段と、前記収音された混合音及び前記第１の音に基づいて、前記第２の音と、前記放音手段及び前記収音手段が設置された音場の特性とを同時にベイズ推定する推定手段を備え、前記混合音は、前記第１の音と前記音場の特性とが畳み込まれた音、及び前記第２の音と前記音場の特性とが畳み込まれた音からなり、前記音場の特性は、前記第１の音の各周波数成分の強度に乗算される係数の集合として表され、前記推定手段は、前記混合音のスペクトル、前記第２の音のスペクトルの時系列及び前記係数の集合がポアソン分布及びガンマ分布に従ってそれぞれ生成されることを表す生成モデルの事後分布の下限である下限関数であって、複数の補助変数を用いて表され、前記第２の音及び前記音場の特性に関するパラメータを含む下限関数を設定するとともに前記補助変数及び前記パラメータを反復的に更新して前記下限関数を決定することにより、前記事後分布を近似的に推定することを特徴とする音響信号分析装置としたことにある。 Further, the present invention is characterized in that a sound emitting means for emitting a predetermined first sound, the emitted first sound, and a second sound emitted from a sound source different from the sound emitting means. Sound collecting means for collecting mixed sound including sound, and the second sound, sound emitting means, and sound collecting means are installed based on the collected mixed sound and the first sound. Estimation means for simultaneously performing Bayesian estimation of the characteristics of the generated sound field, and the mixed sound includes a sound obtained by convolving the first sound and the characteristic of the sound field, and the second sound and the sound. The sound field characteristic is expressed as a set of coefficients multiplied by the intensity of each frequency component of the first sound, and the estimation means includes the mixed sound. Spectrum, the second sound spectrum time series and the coefficient set are generated according to Poisson distribution and gamma distribution, respectively. A lower limit function that is a lower limit of the posterior distribution of the generation model representing that the generation model is expressed, and is expressed using a plurality of auxiliary variables, and includes a lower limit function including parameters relating to the characteristics of the second sound and the sound field In addition, the acoustic signal analyzing apparatus is characterized in that the posterior distribution is approximately estimated by repetitively updating the auxiliary variable and the parameter to determine the lower limit function.

上記のように構成された音響信号分析装置によれば、混合音を構成する音のモデル（又は混合音を構成する音をそれぞれ発生する各音源のモデル）を予め学習しておく必要が無いので、どのような混合音であっても歌声及び音場特性を推定できる。つまり、音響信号分析装置１０は、上記非特許文献１の音響信号分析装置に比べて汎用性が高い。 According to the acoustic signal analyzing apparatus configured as described above, it is not necessary to learn in advance a model of a sound that constitutes a mixed sound (or a model of each sound source that generates a sound that constitutes a mixed sound). The singing voice and sound field characteristics can be estimated for any mixed sound. That is, the acoustic signal analyzer 10 is more versatile than the acoustic signal analyzer of Non-Patent Document 1.

また、推定される音場特性は、楽音（直接音）の周波数特性に大きく依存するが、本発明によれば、音場特性の事後分布が推定されるので、推定された音場特性の不確かさを認定できる。つまり、推定された音場特性の事後分布の分散が所定の閾値を超える周波数帯域の信頼度は低く、分散が前記所定の閾値以下の周波数帯域の信頼度は高いと認定できる。そして、信頼度が低い周波数帯域の音場特性を、イコライザーなどを用いて補正すれば、混合音から第２の音をより正確に抽出することができる。 In addition, the estimated sound field characteristic largely depends on the frequency characteristic of the musical sound (direct sound). However, according to the present invention, since the posterior distribution of the sound field characteristic is estimated, the uncertainty of the estimated sound field characteristic is uncertain. Can be certified. That is, it can be recognized that the reliability of the frequency band in which the variance of the estimated posterior distribution of the sound field characteristic exceeds the predetermined threshold is low and the reliability of the frequency band in which the variance is equal to or less than the predetermined threshold is high. If the sound field characteristic in the frequency band with low reliability is corrected using an equalizer or the like, the second sound can be extracted more accurately from the mixed sound.

また、本発明の他の特徴は、前記収音手段は、前記第２の音及び前記混合音を実時間でサンプリングし、前記推定手段は、前記下限関数の期待値を実時間で更新して最適化することにより、前記事後分布を近似的に推定することにある。また、本発明の他の特徴は、前記音場の特性に関するパラメータであって、所定の周波数成分の強度に乗算される前記係数に関するパラメータは、前記第１の音の発音開始から現在までの前記第１の音のスペクトルの前記所定の周波数成分の強度の総和及び前記混合音の発音開始から現在までの前記混合音のスペクトルの前記所定の周波数成分の強度の総和にのみ更新回数（ｎ）に応じた重み付け係数（η_ｎ）が乗算されるように設定された更新式に基づいて更新されることにある。
According to another feature of the present invention, the sound collection means samples the second sound and the mixed sound in real time, and the estimation means updates the expected value of the lower limit function in real time. By optimizing, the posterior distribution is approximately estimated. Another feature of the present invention is a parameter related to characteristics of the sound field, parameter relating to the coefficient multiplied to the intensity of the predetermined frequency component, the up to now from the start of sounding of the first sound Only the total sum of the intensities of the predetermined frequency components of the spectrum of the first sound and the total sum of the intensities of the predetermined frequency components of the spectrum of the mixed sound from the start of sound generation to the present are updated (n). The update is based on the update formula set so as to be multiplied by the corresponding weighting coefficient (η _n ).

これによれば、特定の変数についてのみ重みが付されるので、所謂「確定的アニーリング」という手法を採用する場合に比べて、下限関数を決定する際の反復計算の回数が増大することを抑制できる。 According to this, since weighting is applied only to a specific variable, it is possible to suppress an increase in the number of iterative calculations when determining the lower limit function, compared to a case where a so-called “deterministic annealing” is employed. it can.

本発明の第１実施形態に係る混合音の生成過程を示すブロック図である。It is a block diagram which shows the production | generation process of the mixed sound which concerns on 1st Embodiment of this invention. 図１の音場特性の構成を示すブロック図である。It is a block diagram which shows the structure of the sound field characteristic of FIG. 本発明の第１及び第２実施形態に係る音響信号分析装置の構成を示すブロック図である。It is a block diagram which shows the structure of the acoustic signal analyzer which concerns on 1st and 2nd embodiment of this invention. 音響信号をバッチ処理する場合の手順を示すフローチャートである。It is a flowchart which shows the procedure in the case of batch-processing an acoustic signal. 音響信号を実時間処理する場合の手順を示すフローチャートである。It is a flowchart which shows the procedure in the case of processing an acoustic signal in real time. 本発明の第２実施形態に係る混合音の生成過程を示すブロック図である。It is a block diagram which shows the production | generation process of the mixed sound which concerns on 2nd Embodiment of this invention. 図６の音場特性の構成を示すブロック図である。It is a block diagram which shows the structure of the sound field characteristic of FIG.

（第１実施形態）
本発明の第１実施形態に係る音響信号分析装置１０について説明する。まず、音響信号分析装置１０の概略について説明する。音響信号分析装置１０は、所定の音（本実施形態では所定の楽曲とする）を放音装置（スピーカ１６１：図３参照）から放音するとともに、収音装置（マイク１６２：図３参照）を用いて音響信号分析装置１０の周囲の音（本実施形態では歌手の歌声とする）を収音する。なお、本実施形態では、放音装置と収音装置とが互いに遠く離れた位置に設置されている。よって、放音装置及び収音装置が設置された部屋の音響的特性（以下、音場特性と呼ぶ）と楽音とが畳み込まれた音も収音装置によって収音される。つまり、放音手段から放音された音（直接音）のみならず、部屋の壁や床などで反射した反射音（残響）も収音される。また、収音装置は歌手に近い位置に設置されている。よって、収音装置で収音される歌声は音場の影響を受けない。本実施形態においては、残響を含む楽音と、歌声（直接音のみ）とが混合された音を混合音と呼ぶ。 (First embodiment)
The acoustic signal analyzer 10 according to the first embodiment of the present invention will be described. First, the outline of the acoustic signal analyzer 10 will be described. The acoustic signal analyzer 10 emits a predetermined sound (predetermined music in this embodiment) from a sound emitting device (speaker 161: see FIG. 3) and a sound collecting device (microphone 162: see FIG. 3). Is used to pick up sounds around the acoustic signal analyzer 10 (in this embodiment, the singer's voice). In the present embodiment, the sound emitting device and the sound collecting device are installed at positions far away from each other. Therefore, the sound collecting device also collects the sound in which the acoustic characteristics (hereinafter referred to as sound field characteristics) of the room in which the sound emitting device and the sound collecting device are installed and the musical sound are convoluted. That is, not only the sound emitted from the sound emitting means (direct sound) but also the reflected sound (reverberation) reflected by the wall or floor of the room is collected. The sound collection device is installed at a position close to the singer. Therefore, the singing voice collected by the sound collecting device is not affected by the sound field. In the present embodiment, a sound in which a musical sound including reverberation and a singing voice (direct sound only) are mixed is called a mixed sound.

混合音のパワースペクトルＹ、楽音（直接音）のパワースペクトルＸ、音場特性Ｈ、及び歌声のパワースペクトルＳの関係は、図１及び図２に示すようなブロック図として表わすことができる。このモデルは、下記の式（１）のように定式化することができる。音響特性Ｈ及び歌声のパワースペクトルＳは直接的には観測されないので、このモデルにおける潜在変数である。なお、楽音が本発明の第１の音に相当し、歌声が本発明の第２の音に相当する。音響信号分析装置１０は、混合音を収録（サンプリング）し、前記収録した混合音を観測データとして、歌声と音場の特性とを同時にベイズ推定する。

The relationship among the power spectrum Y of the mixed sound, the power spectrum X of the musical sound (direct sound), the sound field characteristic H, and the power spectrum S of the singing voice can be expressed as a block diagram as shown in FIGS. This model can be formulated as the following formula (1). Since the acoustic characteristic H and the power spectrum S of the singing voice are not directly observed, they are latent variables in this model. Note that the musical sound corresponds to the first sound of the present invention, and the singing voice corresponds to the second sound of the present invention. The acoustic signal analyzer 10 records (samples) the mixed sound, and uses the recorded mixed sound as observation data to simultaneously perform Bayesian estimation of the singing voice and the characteristics of the sound field.

なお、「ｆ」は周波数ビンのインデックス（ｆ＝１，２，・・・，Ｆ）を表わし、「ｔ」は時間フレーム（以下、単にフレームと呼ぶ）のインデックス（ｔ＝１，２，・・・，Ｔ）を表わす。したがって、「Ｘ_ｆ，ｔ」は、楽曲のｔ番目のフレームにおけるｆ番目の周波数ビンの強度（振幅）を表わす。「Ｙ_ｆ，ｔ」は、混合音のｔ番目のフレームにおけるｆ番目の周波数ビンの強度（振幅）を表わす。「Ｓ_ｆ，ｔ」は、歌声のｔ番目のフレームにおけるｆ番目の周波数ビンの強度（振幅）を表わす。また、音場特性Ｈは、係数Ｈ_ｆ，ｉの集合として表わされる。「ｉ」は係数のインデックス（ｉ＝１，２，・・・，Ｉ）を表わす。すなわち、「Ｈ_ｆ，ｉ」は、ｆ番目の周波数ビンの強度（振幅）であってｉ回（ｉ個のフレームの時間）遅延された強度（振幅）に乗算される係数を表わす。 “F” represents the frequency bin index (f = 1, 2,..., F), and “t” represents the index (t = 1, 2,...) Of the time frame (hereinafter simply referred to as a frame). .., T). Therefore, “X _{f, t} ” represents the strength (amplitude) of the f th frequency bin in the t th frame of the music. “Y _{f, t} ” represents the intensity (amplitude) of the f-th frequency bin in the t-th frame of the mixed sound. “S _{f, t} ” represents the intensity (amplitude) of the f-th frequency bin in the t-th frame of the singing voice. The sound field characteristic H is expressed as a set of coefficients H _{f, i} . “I” represents a coefficient index (i = 1, 2,..., I). That is, “H _{f, i} ” represents a coefficient to be multiplied by the intensity (amplitude) of the f-th frequency bin and delayed i times (time of i frames).

次に音響信号分析装置１０の構成について説明する。音響信号分析装置１０は、図３に示すように、入力操作子１１、コンピュータ部１２、表示器１３、記憶装置１４、外部インターフェース回路１５、及びサウンドシステム１６を備えており、これらがバスＢＳを介して接続されている。 Next, the configuration of the acoustic signal analyzer 10 will be described. As shown in FIG. 3, the acoustic signal analyzer 10 includes an input operator 11, a computer unit 12, a display 13, a storage device 14, an external interface circuit 15, and a sound system 16, which are connected to the bus BS. Connected through.

入力操作子１１は、オン・オフ操作に対応したスイッチ（例えば数値を入力するためのテンキー）、回転操作に対応したボリューム又はロータリーエンコーダ、スライド操作に対応したボリューム又はリニアエンコーダ、マウス、タッチパネルなどから構成される。これらの操作子は、演奏者の手によって操作されて、処理開始又は停止、音響信号の分析に関する各種パラメータの設定などに用いられる。入力操作子１１を操作すると、その操作内容を表す操作情報が、バスＢＳを介して、後述するコンピュータ部１２に供給される。 The input operator 11 includes a switch corresponding to an on / off operation (for example, a numeric keypad for inputting a numerical value), a volume or rotary encoder corresponding to a rotation operation, a volume or linear encoder corresponding to a slide operation, a mouse, a touch panel, etc. Composed. These operators are operated by a player's hand, and are used to start or stop the process, to set various parameters relating to the analysis of the acoustic signal, and the like. When the input operator 11 is operated, operation information indicating the operation content is supplied to the computer unit 12 described later via the bus BS.

コンピュータ部１２は、バスＢＳにそれぞれ接続されたＣＰＵ１２ａ、ＲＯＭ１２ｂ及びＲＡＭ１２ｃからなる。ＣＰＵ１２ａは、混合音に基づいて歌声及び音場の特性を推定する手順を表わしたプログラムをＲＯＭ１２ｂから読み出して実行する。ＲＯＭ１２ｂには、前記プログラムに加えて、初期設定パラメータ、表示器１３に表示される画像を表わす表示データを生成するための図形データ及び文字データなどの各種データが記憶されている。ＲＡＭ１２ｃには、前記プログラムの実行時に必要なデータが一時的に記憶される。例えば、後述するマイク１６２で収音された混合音を所定のサンプリング周期（例えば１／４４１００ｓｅｃ）でサンプリングして得られた複数のサンプル値からなる混合音データがＲＡＭ１２ｃに記憶される。 The computer unit 12 includes a CPU 12a, a ROM 12b, and a RAM 12c connected to the bus BS. The CPU 12a reads out from the ROM 12b and executes a program representing a procedure for estimating the characteristics of the singing voice and the sound field based on the mixed sound. In addition to the program, the ROM 12b stores various data such as initial setting parameters, graphic data for generating display data representing an image displayed on the display 13, and character data. The RAM 12c temporarily stores data necessary for executing the program. For example, mixed sound data composed of a plurality of sample values obtained by sampling a mixed sound collected by a microphone 162 described later at a predetermined sampling period (for example, 1/444100 sec) is stored in the RAM 12c.

表示器１３は、液晶ディスプレイ（ＬＣＤ）によって構成される。コンピュータ部１２は、図形データ、文字データなどを用いて表示すべき内容を表わす表示データを生成して表示器１３に供給する。表示器１３は、コンピュータ部１２から供給された表示データに基づいて画像を表示する。 The display 13 is configured by a liquid crystal display (LCD). The computer unit 12 generates display data representing contents to be displayed using graphic data, character data, and the like, and supplies the display data to the display unit 13. The display device 13 displays an image based on the display data supplied from the computer unit 12.

また、記憶装置１４は、ＨＤＤ、ＦＤＤ、ＣＤ、ＤＶＤなどの大容量の不揮発性記録媒体と、同各記録媒体に対応するドライブユニットから構成されている。記憶装置１４には、前記所定の楽曲を表わす楽曲データが記憶されている。楽曲データは、前記所定の楽曲の演奏を所定のサンプリング周期（例えば１／４４１００ｓｅｃ）でサンプリングして得られた複数のサンプル値からそれぞれなり、各サンプル値が記憶装置１４における連続するアドレスに順に記録されている。楽曲データには、楽曲のタイトルを表わすタイトル情報、容量を表わすデータサイズ情報なども含まれている。楽曲データは予め記憶装置１４に記憶されていてもよいし、後述する外部インターフェース回路１５を介して外部から取り込んでもよい。 The storage device 14 includes a large-capacity nonvolatile recording medium such as an HDD, FDD, CD, or DVD, and a drive unit corresponding to each recording medium. The storage device 14 stores music data representing the predetermined music. The music data is composed of a plurality of sample values obtained by sampling the performance of the predetermined music at a predetermined sampling period (for example, 1/444100 sec), and each sample value is sequentially recorded at consecutive addresses in the storage device 14. Has been. The music data includes title information representing the title of the music, data size information representing the capacity, and the like. The music data may be stored in the storage device 14 in advance, or may be taken in from the outside via the external interface circuit 15 described later.

外部インターフェース回路１５は、音響信号分析装置１０を電子音楽装置、パーソナルコンピュータなどの外部機器に接続可能とする接続端子を備えている。音響信号分析装置１０は、外部インターフェース回路１５を介して、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、インターネットなどの通信ネットワークにも接続可能である。 The external interface circuit 15 includes a connection terminal that enables the acoustic signal analyzer 10 to be connected to an external device such as an electronic music device or a personal computer. The acoustic signal analyzer 10 can be connected to a communication network such as a LAN (Local Area Network) or the Internet via the external interface circuit 15.

サウンドシステム１６は、楽曲データをアナログ音信号に変換するＤ／Ａ変換器、変換したアナログ音信号を増幅するアンプ、及び増幅されたアナログ音信号を音響信号に変換して放音するスピーカ１６１を備えている。また、サウンドシステム１６は、混合音を収音するためのマイク１６２、及び収音されたアナログ音信号としての混合音をデジタル音信号に変換するＡ／Ｄ変換器も備えている。なお、上記非特許文献２の音響信号分析装置においては複数のマイクを備えているが、本実施形態では１つのマイク１６２のみを備えている。 The sound system 16 includes a D / A converter that converts music data into an analog sound signal, an amplifier that amplifies the converted analog sound signal, and a speaker 161 that converts the amplified analog sound signal into an acoustic signal and emits the sound. I have. The sound system 16 also includes a microphone 162 for collecting the mixed sound and an A / D converter for converting the mixed sound as the collected analog sound signal into a digital sound signal. Although the acoustic signal analyzer of Non-Patent Document 2 includes a plurality of microphones, the present embodiment includes only one microphone 162.

次に、上記のように構成した音響信号分析装置１０の動作（歌声及び音場特性の推定手順）について説明する。図４に示すように、ステップＳ１０にて歌声及び音場特性の推定処理が開始される。次に、ステップＳ１１にて、各種変数（後述する補助変数や事後分布のパラメータなど）が初期化される。次に、ステップＳ１２にて、楽曲データがサウンドシステム１６に供給されてスピーカ１６１から楽曲の放音が開始されるとともに、マイク１６２で収音された混合音のサンプリングが開始される。サンプリングされた混合音データは、ＲＡＭ１２ｃに記憶される。楽曲の放音が終了すると、以下説明するように、ＲＡＭ１２ｃに記憶された混合音データを観測データとして、歌声及び音場特性が同時に（一体的に）ベイズ推定される。 Next, the operation (singing voice and sound field characteristic estimation procedure) of the acoustic signal analyzer 10 configured as described above will be described. As shown in FIG. 4, the singing voice and sound field characteristic estimation process is started in step S10. Next, in step S11, various variables (auxiliary variables described later, parameters of posterior distribution, etc.) are initialized. Next, in step S 12, the music data is supplied to the sound system 16 and sound emission of the music is started from the speaker 161, and sampling of the mixed sound collected by the microphone 162 is started. The sampled mixed sound data is stored in the RAM 12c. When the music is released, as described below, the singing voice and sound field characteristics are simultaneously (integrated) Bayesian estimated using the mixed sound data stored in the RAM 12c as observation data.

歌声のスペクトル、楽音のスペクトル及び混合音のスペクトル（短時間フーリエ変換）が複素正規分布から生成されるとすると、下記の式（２）乃至（４）で表されるような生成モデルを構築できる。なお、式（２）乃至（４）における「ＧＩＧ（ａ，ｂ，ｃ）」は、母数ａ，ｂ，ｃによって定義される一般化逆ガウス分布を表わす。また、以下の説明においては、必要に応じ、パラメータがいずれの要素（Ｓ、Ｈなど）に関するものであるかを示すために、右上の括弧内に要素を表わす変数名を記載する。例えば、「ａ^（Ｈ）」と記載されたパラメータは、音場特性に関するパラメータである。

If the singing voice spectrum, musical tone spectrum, and mixed tone spectrum (short-time Fourier transform) are generated from a complex normal distribution, a generation model represented by the following equations (2) to (4) can be constructed. . Note that “GIG (a, b, c)” in the equations (2) to (4) represents a generalized inverse Gaussian distribution defined by the parameters a, b, c. In the following description, a variable name representing an element is described in parentheses in the upper right to indicate which element (S, H, etc.) the parameter relates to as necessary. For example, a parameter described as “a ^(H) ” is a parameter related to sound field characteristics.

上記のモデルの事後分布を、変分ベイズ法を用いて計算する。ここで、対数同時分布は下記の式（５）のように表わされる。なお、式（５）においては、定数項を無視している。

The posterior distribution of the above model is calculated using the variational Bayes method. Here, the logarithmic simultaneous distribution is expressed as the following equation (5). In Equation (5), the constant term is ignored.

しかし、式（５）に変分ベイズ法を適用することができないので、補助関数を用いて下限を定める。具体的には、下記の式（６）のような下限関数を設定し、新たに導入された補助変数Ｍ及び補助変数Φが更新される。

However, since the variational Bayes method cannot be applied to Equation (5), the lower limit is determined using an auxiliary function. Specifically, a lower limit function such as the following equation (6) is set, and the newly introduced auxiliary variable M and auxiliary variable Φ are updated.

具体的には、ステップＳ１３にて、式（７）乃至（１０）によって定義される条件下で補助変数Ｍが最適化され、式（１１）によって定義される条件下で補助変数Φが最適化される。

Specifically, in step S13, the auxiliary variable M is optimized under the conditions defined by the equations (7) to (10), and the auxiliary variable Φ is optimized under the conditions defined by the equation (11). Is done.

次に、ステップＳ１４にて、下記の式（１２）及び式（１３）を用いて事後分布のパラメータが更新される。なお、式（１２）及び式（１３）中の各パラメータは下記の式（１４）乃至（１９）のように定義されている。

Next, in step S14, the parameters of the posterior distribution are updated using the following equations (12) and (13). Each parameter in the equations (12) and (13) is defined as the following equations (14) to (19).

次にステップＳ１５にて、下限関数が収束したか否かが判定される。すなわち、補助変数Ｍ及び補助変数Φ、並びに事後分布の各パラメータが収束したか否かが判定される。下限関数が収束していない場合には「Ｎｏ」と判定され、ステップＳ１３及びステップＳ１４にて補助変数Ｍ及び補助変数Φ、並びに事後分布の各パラメータがそれぞれ更新される。一方、下限関数が収束した場合には、「Ｙｅｓ」と判定され、ステップＳ１６にて、歌声及び音場特性の推定処理が終了する。上記のようにして、補助変数Ｍ及び補助変数Φ、並びに事後分布の各パラメータが反復的に更新されて下限関数が決定されることにより、事後分布が近似的に計算される。これにより、歌声及び音場特性が同時に（一体的に）ベイズ推定される。なお、下記の式（２０）で示されるマスクを、ｔ番目のフレームにおける混合音のスペクトル（短時間フーリエ変換）に適用し、その逆フーリエ変換を計算することにより、混合音から歌声を抽出することができる。

Next, in step S15, it is determined whether or not the lower limit function has converged. That is, it is determined whether the auxiliary variable M, the auxiliary variable Φ, and the parameters of the posterior distribution have converged. If the lower limit function has not converged, it is determined as “No”, and the auxiliary variable M, the auxiliary variable Φ, and the parameters of the posterior distribution are respectively updated in steps S13 and S14. On the other hand, when the lower limit function has converged, it is determined as “Yes”, and the singing voice and sound field characteristic estimation processing ends in step S16. As described above, the auxiliary variable M, the auxiliary variable Φ, and the parameters of the posterior distribution are updated repeatedly to determine the lower limit function, whereby the posterior distribution is approximately calculated. Thereby, Bayesian estimation of the singing voice and the sound field characteristic is performed simultaneously (integrally). Note that the singing voice is extracted from the mixed sound by applying the mask represented by the following equation (20) to the spectrum (short-time Fourier transform) of the mixed sound in the t-th frame and calculating the inverse Fourier transform. be able to.

上記のように構成された音響信号分析装置１０によれば、混合音を構成する音のモデル（又は混合音を構成する音をそれぞれ発生する各音源のモデル）を予め学習しておく必要が無いので、どのような混合音であっても歌声及び音場特性を推定できる。つまり、音響信号分析装置１０は、上記非特許文献１の音響信号分析装置に比べて汎用性が高い。 According to the acoustic signal analyzing apparatus 10 configured as described above, it is not necessary to learn in advance a model of a sound that constitutes a mixed sound (or a model of each sound source that generates a sound that constitutes a mixed sound). Therefore, singing voice and sound field characteristics can be estimated for any mixed sound. That is, the acoustic signal analyzer 10 is more versatile than the acoustic signal analyzer of Non-Patent Document 1.

また、上記非特許文献２の音響信号分析装置とは異なり、１つのマイクを備えていればよいので、装置の構成が単純である。 Further, unlike the acoustic signal analyzer of Non-Patent Document 2, it is sufficient to have one microphone, so the configuration of the apparatus is simple.

また、推定される音場特性は、楽音（直接音）の周波数特性に大きく依存するが、本実施形態では、音場特性の事後分布が推定されるので、推定された音場特性の不確かさを認定できる。つまり、推定された音場特性の事後分布の分散が所定の閾値（例えば、予め定められた値、又は楽音のパワースペクトルＸにおける特定の周波数ビンのパワーに比例した値）を超える周波数帯域の信頼度は低く、分散が前記所定の閾値以下の周波数帯域の信頼度は高いと認定できる。そして、信頼度が低い周波数帯域の音場特性を、イコライザーなどを用いて補正すれば、混合音から歌声をより正確に抽出することができる。 Further, the estimated sound field characteristic largely depends on the frequency characteristic of the musical sound (direct sound). However, in this embodiment, the posterior distribution of the sound field characteristic is estimated, and thus the uncertainty of the estimated sound field characteristic is estimated. Can be certified. That is, the reliability of the frequency band in which the variance of the estimated posterior distribution of the sound field characteristic exceeds a predetermined threshold (for example, a predetermined value or a value proportional to the power of a specific frequency bin in the power spectrum X of the musical sound). The degree of reliability is low, and it can be recognized that the reliability of the frequency band whose variance is equal to or less than the predetermined threshold is high. If the sound field characteristics in the frequency band with low reliability are corrected using an equalizer or the like, the singing voice can be extracted more accurately from the mixed sound.

上記実施形態における式（２）乃至（４）によって表わされる生成モデルに代えて、次の式（２１）乃至（２３）によって表わされる生成モデルを採用しても良い。

Instead of the generation model represented by the expressions (2) to (4) in the above embodiment, the generation model represented by the following expressions (21) to (23) may be adopted.

この場合、Ｐｏｉｓｓｏｎ分布の再生性を用いるとともに、補助変数Ｍ^（Ｓ），Ｍ^（Ｈ）を下記の式（２４）を満たすように制約すれば、上記の式（２１）乃至（２３）によって表わされる生成モデルは、下記の式（２５）乃至（２８）によって表わされる生成モデルと等価である。

In this case, if the reproducibility of the Poisson distribution is used and the auxiliary variables M ^(S) and M ^(H) are constrained to satisfy the following expression (24), they are expressed by the above expressions (21) to (23). The generated model is equivalent to the generated model represented by the following equations (25) to (28).

そして、対数同時分布は、下記の式（２９）のように表わされる。

The logarithmic simultaneous distribution is represented by the following equation (29).

事後分布は、下記の式（３０）乃至（３３）を用いて更新される。

The posterior distribution is updated using the following equations (30) to (33).

なお、上記式（３１）におけるＺ_ｆ，ｔは、下記の式（３４）に示すような正規化係数である。また、その他のパラメータは、下記の式（３５）乃至（４１）に示すように定義される。

Note that Z _{f, t} in the above equation (31) is a normalization coefficient as shown in the following equation (34). Other parameters are defined as shown in the following equations (35) to (41).

この場合、「Ｓ_ｆ，ｔ」の平均値に混合音の短時間フーリエ変換の位相を付与することにより、混合音から歌声を抽出することができる。 In this case, a singing voice can be extracted from the mixed sound by adding the short-time Fourier transform phase of the mixed sound to the average value of “S _{f, t} ”.

上記実施形態では、「Ｉ」を固定する必要がある。しかし、「Ｉ」は音場の残響時間に依存するので、様々な音場に対して頑健であるためには、「Ｉ」に対する依存性が弱められていることが望ましい。そこで、下記の式（４２）のように定義される変数ｇ_ｊを上記の式（２１）に導入し、下記の式（４３）のようにモデルを定式化しても良い。つまり、変数ｇ_ｊは、現在のフレームの歌声のゲイン、及び過去のフレームの楽音のゲインを表わす。

In the above embodiment, “I” needs to be fixed. However, since “I” depends on the reverberation time of the sound field, in order to be robust against various sound fields, it is desirable that the dependence on “I” is weakened. Therefore, a variable g _j defined as in the following equation (42) may be introduced into the above equation (21), and the model may be formulated as in the following equation (43). That is, the variable g _j represents the gain of the singing voice of the current frame and the gain of the musical sound of the past frame.

これによれば、「Ｉ」の値が音場に応じて動的に変化するような挙動が得られる。これにより、混合音から歌声をより正確に抽出できる。 According to this, a behavior is obtained in which the value of “I” dynamically changes according to the sound field. Thereby, a singing voice can be extracted from a mixed sound more correctly.

また、上記第１実施形態及びその変形例では、フレームの総和（「ｔ」に関する総和）を表わす項が含まれる数式を用いているため、歌声及び音場特性を実時間で推定することができない。そこで、式（２１）乃至（４３）を用いて説明した歌声及び音場特性の推定手順を変形し、歌声及び音場特性を実時間で推定する手順について説明する。 Further, in the first embodiment and the modification thereof, since the mathematical expression including the term representing the sum of the frames (sum related to “t”) is used, the singing voice and the sound field characteristics cannot be estimated in real time. . Therefore, a procedure for estimating the singing voice and sound field characteristics in real time by modifying the singing voice and sound field characteristics estimation procedure described using the equations (21) to (43) will be described.

まず、事後分布は各周波数ビンに関して独立であるとみなし、１つの周波数ビンに注目する。すると、事後分布は下記の式（４４）及び式（４５）のように表わされる。

First, consider the posterior distribution to be independent for each frequency bin and focus on one frequency bin. Then, the posterior distribution is expressed as the following formula (44) and formula (45).

上記の事後分布を、変分ベイズ法を用いて計算する。具体的には、下記の式（４６）に示すような目的関数を最適化することにより、事後分布を近似的に計算する。なお、式（４６）中の「Ｈ_ｑ（θ）」は、下記の式（４７）に示すように定義される。

The above posterior distribution is calculated using the variational Bayesian method. Specifically, the posterior distribution is approximately calculated by optimizing an objective function as shown in the following formula (46). “H _{q (θ)} ” in the equation (46) is defined as shown in the following equation (47).

ここで、式（４８）に示すように、事後分布における変数のうち、フレームごとに独立である変数を先に最適化する。式（４８）に示す目的関数Ｊ´を最適化することは、式（４６）に示す目的関数Ｊを最適化することと等価である。

Here, as shown in Expression (48), among the variables in the posterior distribution, variables that are independent for each frame are first optimized. Optimizing the objective function J ′ shown in Expression (48) is equivalent to optimizing the objective function J shown in Expression (46).

さらに、「０」から「Ｔ」の間で一様に分布する確率変数τ〜Ｕｎｉｆｏｒｍ（０，Ｔ）を導入し、目的関数Ｊ_τを下記の式（４９）のように定義する。目的関数Ｊ_τは、全ての観測データが、フレームτのものであったとみなしたときの、目的関数Ｊ´の目的関数である。

Furthermore, random variables _τ to Uniform (0, T) that are uniformly distributed between “0” and “T” are introduced, and the objective function J _τ is defined as the following equation (49). The objective function J _τ is an objective function of the objective function J ′ when it is assumed that all the observation data are of the frame τ.

なお、観測データは一様に分布しているので、下記の式（５０）が成立する。フレームのインデックスである「ｔ」を「１」ずつ増加させ、各フレームにおいて目的関数Ｊ_τ（θ）を評価し、その最適値を累積すれば、Ｊ_Ｉ（θ）の平均値を実時間で更新することができる。

Since the observation data is uniformly distributed, the following equation (50) is established. If the frame index “t” is incremented by “1”, the objective function J _τ (θ) is evaluated in each frame, and the optimum value is accumulated, the average value of J _I (θ) is calculated in real time. Can be updated.

上記の歌声及び音場特性の実時間推定処理の手順を、図５を用いて説明する。ステップＳ２０にて歌声及び音場特性の実時間推定処理が開始される。ステップＳ２１にて、各種変数が初期化される。次に、ステップＳ２２にて、楽曲の再生が開始される。次に、ステップＳ２３にて、ｔ番目のフレームに相当する混合音のスペクトルが計算される。具体的には、所定のサンプリング周期でバッファにサンプル値が逐次的に記憶されており、前記バッファに記憶されたサンプル値のうちのｔ番目のフレームに相当する複数のサンプル値を用いて、ｔ番目のフレームに相当する混合音のスペクトルが計算される。最初、「ｔ」は「０」に初期化されているので、最初のフレームに相当するスペクトルが計算される。次に、ステップＳ２４にて、補助変数Φが更新される。次に、ステップＳ２５にて、潜在変数Ｓ_ｔが更新される。次に、ステップＳ２６にて、補助変数Φが収束したか否かが判定される。補助変数Φが収束していない場合、「Ｎｏ」と判定され、ステップＳ２４及びステップ２５を実行し、再び補助変数Φ及び潜在変数Ｓ_ｔが更新される。一方、補助変数Φが収束した場合、「Ｙｅｓ」と判定され、ステップＳ２７にて、潜在変数Ｓ_ｔの平均値に基づいて歌声の音響信号が復元される。そして、ステップＳ２８にて、事後分布のパラメータが更新される。次に、ステップＳ２９にて、事後分布が更新される。ただし、「ρ_ｔ」は、０≦ρ_ｔ≦１を満たす係数である。また、下記の式（５１）を満たす場合には、変数Φの収束が保証される。

The procedure of the real time estimation process of the above singing voice and sound field characteristics will be described with reference to FIG. In step S20, real time estimation processing of singing voice and sound field characteristics is started. In step S21, various variables are initialized. Next, in step S22, music playback is started. Next, in step S23, the spectrum of the mixed sound corresponding to the t-th frame is calculated. Specifically, sample values are sequentially stored in the buffer at a predetermined sampling period, and a plurality of sample values corresponding to the t-th frame among the sample values stored in the buffer are used. The spectrum of the mixed sound corresponding to the second frame is calculated. Initially, “t” is initialized to “0”, so the spectrum corresponding to the first frame is calculated. Next, in step S24, the auxiliary variable Φ is updated. Next, in step S25, the latent variable _St is updated. Next, in step S26, it is determined whether or not the auxiliary variable Φ has converged. When the auxiliary variable Φ has not converged, it is determined as “No”, Steps S24 and Step 25 are executed, and the auxiliary variable Φ and the latent variable _St are updated again. On the other hand, when the auxiliary variable Φ is converged, it is determined as "Yes" at step S27, voice of the audio signal is restored based on the average value of the latent variable S _t. In step S28, the parameters of the posterior distribution are updated. Next, in step S29, the posterior distribution is updated. However, “ρ _t ” is a coefficient that satisfies 0 ≦ ρ _t ≦ 1. Further, when the following equation (51) is satisfied, the convergence of the variable Φ is guaranteed.

次に、ステップＳ３０にて、フレームのインデックスが更新（インクリメント）される。次に、ステップＳ３１にて、最終フレームの処理を終了したか否かが判定される。つまり、フレームのインデックス（ｔ）が最終フレームのインデックス（Ｔ）を超えていれば、最終フレームを既に処理したと判定し、ステップＳ３２にて実時間処理が終了する。一方、フレームのインデックス（ｔ）が最終フレームのインデックス（Ｔ）以下であれば、ステップＳ２３乃至Ｓ３１からなる処理が再び実行される。 Next, in step S30, the index of the frame is updated (incremented). Next, in step S31, it is determined whether or not the last frame has been processed. That is, if the index (t) of the frame exceeds the index (T) of the final frame, it is determined that the final frame has already been processed, and the real-time processing ends in step S32. On the other hand, if the index (t) of the frame is equal to or less than the index (T) of the final frame, the process consisting of steps S23 to S31 is executed again.

上記の実時間処理において、「Ｈ」を階層化し、事前分布の超パラメータを最適化してもよい。これによれば、各フレーム間において「Ｈ」が独立であるため、現在のフレームのデータのみから「Ｈ」を更新できる。ただし、このままではサンプル数が少なくなりすぎ、歌声の推定精度が低下する虞がある。そこで、各フレームにて推定される「Ｈ」の事後分布から適切な「Ｈ」の事前分布を更新する。つまり、式（２２）に代えて、下記の式（５２）を用いる。

In the real-time processing described above, “H” may be hierarchized to optimize the prior parameter superparameter. According to this, since “H” is independent between the frames, “H” can be updated only from the data of the current frame. However, if this is the case, the number of samples will be too small, and the estimation accuracy of the singing voice may be reduced. Therefore, the appropriate prior distribution of “H” is updated from the posterior distribution of “H” estimated in each frame. That is, the following formula (52) is used instead of the formula (22).

これによれば、「Ｈ_{ｆ，ｔ，ｉ}」の更新式には、フレームの総和（「ｔ」に関する総和）を表わす項が含まれない。したがって、「Ｈ_{ｆ，ｔ，ｉ}」は各フレームにおいて独立して更新される。 According to this, the update formula of “H _{f, t, i} ” does not include a term representing the sum of frames (sum related to “t”). Therefore, “H _{f, t, i} ” is updated independently in each frame.

次に、下記の式（５３）に示す目的関数を最大化することにより、超パラメータを最適化する。なお、任意の「ｆ」及び「ｉ」に関して、式（５４）乃至（５７）が成立する。

Next, the hyper parameter is optimized by maximizing the objective function shown in the following equation (53). For any “f” and “i”, equations (54) to (57) hold.

ここで、Ｇ（ａ，ｂ）を「ｂ」に関して偏微分した結果が「０」に等しいとすると、「ｂ＝ａ／ｃ」という関係式が得られる。また、Ｇ（ａ，ｂ）を「ａ」に関して偏微分した結果が「０」に等しいとすると、「ｄ＋ｌｏｇｂ−ψ（ａ）＝０」という関係式が得られる。「ａ」が求まれば「ｂ」も求まるが、「ａ」を求めるのは困難である。そこで、上記の２つの関係式を組み合わせた下記の式（５８）を解く。

Here, if the result of partial differentiation of G (a, b) with respect to “b” is equal to “0”, a relational expression “b = a / c” is obtained. Also, assuming that the result of partial differentiation of G (a, b) with respect to “a” is equal to “0”, the relational expression “d + logb−ψ (a) = 0” is obtained. If “a” is obtained, “b” is also obtained, but it is difficult to obtain “a”. Therefore, the following equation (58) combining the above two relational expressions is solved.

ここで、ｅｘｐ（ａ´）＝ａとすると、下記の式（５９）に示す目的関数が導出される。

Here, if exp (a ′) = a, an objective function shown in the following formula (59) is derived.

この式（５９）にＮｅｗｔｏｎＲａｐｈｓｏｎ法を適用すると、下記の式（６０）に示す更新式が導出される。

When the Newton Raphson method is applied to this equation (59), an update equation shown in the following equation (60) is derived.

つまり、「ａ」は、下記の式（６１）のように表わされる。

That is, “a” is expressed as in the following formula (61).

ディガンマ関数の近似式を用いれば、下記の式（６２）及び式（６３）が成立する。

If an approximate expression of the digamma function is used, the following expressions (62) and (63) are established.

したがって、下記の式（６４）に示すような更新式が導出される。この更新式を用いてパラメータを更新しても良い。

Therefore, an update formula as shown in the following formula (64) is derived. The parameters may be updated using this update formula.

なお、式（４１）及び式（４２）に鑑みれば、単純に履歴を蓄積していけばよいとも思われる。すなわち、例えば、下記の式（６５）のような更新式を用いれば良いとも思われる。

In view of the equations (41) and (42), it is considered that the history should be simply accumulated. That is, for example, it is considered that an update equation such as the following equation (65) may be used.

しかし、変分ベイズ法は反復計算を利用するものであり、「Ｓ」の初期値はスパースである（「０」に近い値を優遇する）。一方、観測データが多くなるほど、「ａ^（Ｈ）」及び「ｂ^（Ｈ）」の確信度が高くなる。すなわち、初期段階では観測データ数が少ないので確信度が低い。また、式（６５）を用いた場合、新しい観測は古い観測に対して弱い影響しか与えない。そこで、下記の式（６６）及び（６７）に示すように、履歴の蓄積に対して重みを付ける変数η_ｎを導入する。すなわち、変数η_ｎは、楽音（第１の音）の発音開始から現在までの楽音のスペクトルの所定の周波数成分の強度の総和及び混合音の発音開始から現在までの混合音のスペクトルの所定の周波数成分の強度の総和にのみ更新回数に応じて乗算される重み付け係数に相当する。なお、「ｎ」は反復計算の回数を表わす。初期段階（すなわち、「ｎ」が小さいとき）では、変数η_ｎを「０」に近い値に設定し、反復回数が増すにつれて変数η_ｎを徐々に「１」に近づける。

However, the variational Bayes method uses iterative calculation, and the initial value of “S” is sparse (a value close to “0” is preferentially treated). On the other hand, as the observation data increases, the certainty of “a ^(H) ” and “b ^(H) ” increases. That is, the reliability is low because the number of observation data is small in the initial stage. Also, using equation (65), the new observation has a weak effect on the old observation. Therefore, as shown in the following formulas (66) and (67), a variable η _n is introduced that gives weight to the history accumulation. That is, the variable η _n is the sum of the intensities of predetermined frequency components of the musical sound spectrum from the start of sound generation (first sound) to the present, and the predetermined spectrum spectrum of the mixed sound from the start of sound generation to the present. Only the sum of the frequency component intensities corresponds to a weighting coefficient that is multiplied according to the number of updates. “N” represents the number of iterations. In the initial stage (that is, when “n” is small), the variable η _n is set to a value close to “0”, and the variable η _n is gradually brought close to “1” as the number of iterations increases.

なお、式（２１）乃至（４３）を用いて説明した歌声及び音場特性の推定手順を変形し、歌声及び音場特性を実時間で推定する手順について説明したが、式（２）乃至（２０）を用いて説明した歌声及び音場特性の推定手順に関しても同様に変形し、歌声及び音場特性を実時間で推定することが可能である。 In addition, although the estimation procedure of the singing voice and the sound field characteristic described using the expressions (21) to (43) is modified and the procedure for estimating the singing voice and the sound field characteristic in real time has been described, the expressions (2) to ( The singing voice and sound field characteristics estimation procedure described using 20) can be similarly modified, and the singing voice and sound field characteristics can be estimated in real time.

（第２実施形態）
次に、本発明の第２実施形態に係る音響信号分析装置２０について説明する。まず、音響信号分析装置２０の概略について説明する。音響信号分析装置２０は、音響信号分析装置１０と同様に、所定の楽曲を放音装置（スピーカ１６１）から放音するとともに、収音装置（マイク１６２）を用いて歌手の歌声を収音する。ただし、本実施形態では、第１実施形態とは異なり、放音装置と収音装置とが近い位置に設置されている。また、収音装置は歌手及び放音装置から遠く離れた位置に設置されている。よって、放音装置及び収音装置が設置された部屋の音響特性（音場特性）と楽音及び歌声とが畳み込まれた音が収音装置によって収音される。つまり、放音装置から放音された音（直接音）及び歌手の歌声（直接音）のみならず、部屋の壁や床などで反射した反射音（反響）も収音される。本実施形態においては、残響を含む楽曲の音と、残響を含む歌声とが混合された音を混合音と呼ぶ。 (Second Embodiment)
Next, the acoustic signal analyzer 20 according to the second embodiment of the present invention will be described. First, an outline of the acoustic signal analyzer 20 will be described. The acoustic signal analysis device 20, like the acoustic signal analysis device 10, emits predetermined music from the sound emitting device (speaker 161) and collects the singer's singing voice using the sound collecting device (microphone 162). . However, in the present embodiment, unlike the first embodiment, the sound emitting device and the sound collecting device are installed at close positions. The sound collecting device is installed at a position far away from the singer and the sound emitting device. Therefore, the sound collecting device collects the sound in which the acoustic characteristics (sound field characteristics) of the room where the sound emitting device and the sound collecting device are installed, the musical sound and the singing voice are convoluted. That is, not only the sound emitted from the sound emitting device (direct sound) and the singer's singing voice (direct sound) but also the reflected sound (reflection) reflected from the wall or floor of the room is collected. In the present embodiment, a sound obtained by mixing a sound of music including reverberation and a singing voice including reverberation is referred to as a mixed sound.

混合音のパワースペクトルＹ、楽音（直接音）のパワースペクトルＸ、音場特性Ｈ、及び歌声のパワースペクトルＳの関係は、図６及び図７に示すようなブロック図として表わすことができ、このモデルは、下記の式（６８）のように定式化することができる。音響特性Ｈ及び歌声のパワースペクトルＳは直接的には観測されないので、このモデルにおける潜在変数である。

The relationship among the power spectrum Y of the mixed sound, the power spectrum X of the musical sound (direct sound), the sound field characteristic H, and the power spectrum S of the singing voice can be expressed as a block diagram as shown in FIGS. The model can be formulated as the following formula (68). Since the acoustic characteristic H and the power spectrum S of the singing voice are not directly observed, they are latent variables in this model.

音響信号分析装置２０は、混合音を収録（サンプリング）し、前記収録した混合音を観測データとして、歌声と音場の特性とを同時に（一体的に）ベイズ推定する。音響信号分析装置２０のその他の構成は音響信号分析装置１０と同様であるので、その説明を省略する。 The acoustic signal analyzer 20 records (samples) the mixed sound, and uses the recorded mixed sound as observation data to perform Bayesian estimation of the singing voice and the characteristics of the sound field simultaneously (integrally). Since the other structure of the acoustic signal analyzer 20 is the same as that of the acoustic signal analyzer 10, the description thereof is omitted.

次に、上記のように構成した音響信号分析装置２０の動作（歌声及び音場特性の推定手順）について説明する。第１実施形態と同様に、図４に示すように、ステップＳ１０にて歌声及び音場特性の推定処理が開始される。次に、ステップＳ１１にて、各種変数（後述する補助変数や事後分布のパラメータなど）が初期化される。次に、ステップＳ１２にて、楽曲データがサウンドシステム１６に供給されてスピーカ１６１から楽曲の放音が開始されるとともに、マイク１６２で収音された混合音のサンプリングが開始される。サンプリングされた混合音データは、ＲＡＭ１２ｃに記憶される。楽曲の放音が終了すると、以下説明するように、ＲＡＭ１２ｃに記憶された混合音データを観測データとして、歌声及び音場特性が同時に（一体的に）ベイズ推定される。 Next, the operation (singing voice and sound field characteristic estimation procedure) of the acoustic signal analyzer 20 configured as described above will be described. As in the first embodiment, as shown in FIG. 4, the singing voice and sound field characteristic estimation processing is started in step S10. Next, in step S11, various variables (auxiliary variables described later, parameters of posterior distribution, etc.) are initialized. Next, in step S 12, the music data is supplied to the sound system 16 and sound emission of the music is started from the speaker 161, and sampling of the mixed sound collected by the microphone 162 is started. The sampled mixed sound data is stored in the RAM 12c. When the music is released, as described below, the singing voice and sound field characteristics are simultaneously (integrated) Bayesian estimated using the mixed sound data stored in the RAM 12c as observation data.

歌声のスペクトル、楽音のスペクトル及び混合音のスペクトル（短時間フーリエ変換）が複素正規分布から生成されるとすると、下記の式（６９）乃至（７１）で表されるような生成モデルを構築できる。

Assuming that the spectrum of singing voice, the spectrum of musical sounds, and the spectrum of mixed sounds (short-time Fourier transform) are generated from a complex normal distribution, a generation model represented by the following equations (69) to (71) can be constructed. .

上記のモデルの事後分布を、変分ベイズ法を用いて計算する。ここで、対数同時分布は次の式（７２）のように表わされる。なお、式（７２）においては、定数項を無視している。

The posterior distribution of the above model is calculated using the variational Bayes method. Here, the logarithmic simultaneous distribution is represented by the following equation (72). In the equation (72), the constant term is ignored.

しかし、式（７２）に変分ベイズ法を適用することができないので、補助関数を用いて下限を定める。具体的には、下記の式（７３）のような下限関数を設定し、新たに導入された補助変数Ｍ及び補助変数Φが更新される。

However, since the variational Bayes method cannot be applied to the equation (72), the lower limit is determined using an auxiliary function. Specifically, a lower limit function such as the following formula (73) is set, and the newly introduced auxiliary variable M and auxiliary variable Φ are updated.

具体的には、ステップＳ１３にて、式（７４）乃至（７７）によって定義される条件下で補助変数Ｍが最適化され、式（７８）によって定義される条件下で補助変数Φが最適化される。

Specifically, in step S13, the auxiliary variable M is optimized under the conditions defined by the equations (74) to (77), and the auxiliary variable Φ is optimized under the conditions defined by the equation (78). Is done.

次に、ステップＳ１４にて、下記の式（７９）及び式（８０）を用いて事後分布のパラメータが更新される。なお、式（７９）及び式（８０）中の各パラメータは下記の式（８１）乃至（８６）のように定義されている。

Next, in step S14, the parameters of the posterior distribution are updated using the following equations (79) and (80). Each parameter in the formulas (79) and (80) is defined as the following formulas (81) to (86).

次にステップＳ１５にて、下限関数が収束したか否かが判定される。すなわち、補助変数Ｍ及び補助変数Φ、並びに事後分布の各パラメータが収束したか否かが判定される。下限関数が収束していない場合には「Ｎｏ」と判定され、ステップＳ１３及びステップＳ１４にて補助変数Ｍ及び補助変数Φ、並びに事後分布の各パラメータがそれぞれ更新される。一方、下限関数が収束した場合には、「Ｙｅｓ」と判定され、ステップＳ１６にて、歌声及び音場特性の推定処理が終了する。上記のようにして、事後分布が近似的に計算される。これにより、歌声及び音場特性が同時に（一体的に）ベイズ推定される。なお、下記の式（８７）で示されるマスクを、ｔ番目のフレームにおける混合音のスペクトル（短時間フーリエ変換）に適用し、その逆フーリエ変換を計算することにより、混合音から歌声を抽出することができる。

Next, in step S15, it is determined whether or not the lower limit function has converged. That is, it is determined whether the auxiliary variable M, the auxiliary variable Φ, and the parameters of the posterior distribution have converged. If the lower limit function has not converged, it is determined as “No”, and the auxiliary variable M, the auxiliary variable Φ, and the parameters of the posterior distribution are respectively updated in steps S13 and S14. On the other hand, when the lower limit function has converged, it is determined as “Yes”, and the singing voice and sound field characteristic estimation processing ends in step S16. As described above, the posterior distribution is approximately calculated. Thereby, Bayesian estimation of the singing voice and the sound field characteristic is performed simultaneously (integrally). Note that the singing voice is extracted from the mixed sound by applying the mask represented by the following formula (87) to the spectrum of the mixed sound (short-time Fourier transform) in the t-th frame and calculating the inverse Fourier transform thereof. be able to.

上記のように構成された音響信号分析装置２０によっても、第１実施形態と同様の効果が得られる。 The acoustic signal analyzer 20 configured as described above can also provide the same effects as those of the first embodiment.

上記実施形態における式（６９）乃至（７１）によって表わされる生成モデルに代えて、次の式（８８）乃至（９０）によって表わされる生成モデルを採用しても良い。

Instead of the generation model represented by the equations (69) to (71) in the above embodiment, the generation model represented by the following equations (88) to (90) may be adopted.

この場合、Ｐｏｉｓｓｏｎ分布の再生性を用いるとともに、補助変数Ｍ^（Ｓ），Ｍ^（Ｈ）を下記の式（９１）を満たすように制約すれば、上記の式（８８）乃至（９０）によって表わされる生成モデルは、下記の式（９２）乃至（９５）によって表わされる生成モデルと等価である。

In this case, if the reproducibility of the Poisson distribution is used and the auxiliary variables M ^(S) and M ^(H) are constrained to satisfy the following equation (91), they are expressed by the above equations (88) to (90). The generated model is equivalent to the generated model represented by the following equations (92) to (95).

この場合、対数同時分布は、下記の式（９６）のように表わされる。

In this case, the logarithmic simultaneous distribution is represented by the following equation (96).

事後分布は、下記の式（９７）乃至（１００）を用いて更新される。

The posterior distribution is updated using the following equations (97) to (100).

なお、上記式（９８）におけるＺ_ｆ，ｔは、下記の式（１０１）に示すような正規化係数である。また、その他のパラメータは、下記の式（１０２）乃至（１０８）に示すように定義される。

Note that Z _{f, t} in the above equation (98) is a normalization coefficient as shown in the following equation (101). Other parameters are defined as shown in the following equations (102) to (108).

また、第２実施形態においても、第１実施形態の変形例と同様に数式を変形することにより、音場特性及び歌声を実時間推定することができる。 Also in the second embodiment, the sound field characteristics and the singing voice can be estimated in real time by modifying the mathematical formulas as in the modification of the first embodiment.

１０，２０・・・音響信号分析装置、１６１・・・スピーカ、１６２・・・マイク、Ｙ・・・混合音のパワースペクトル、Ｘ・・・楽音のパワースペクトル、Ｈ・・・音場特性、Ｓ・・・歌声のパワースペクトル 10, 20 ... Acoustic signal analyzer, 161 ... Speaker, 162 ... Microphone, Y ... Power spectrum of mixed sound, X ... Power spectrum of musical sound, H ... Sound field characteristics, S ... Singing voice power spectrum

Claims

  Sound emission means for emitting a predetermined first sound;
  Sound collecting means for collecting a mixed sound including the first sound emitted and a second sound emitted from a sound source different from the sound emitting means;
  Estimation means for performing Bayesian estimation simultaneously on the second sound and the characteristics of the sound field in which the sound emission means and the sound collection means are installed based on the collected mixed sound and the first sound; Prepared,
  The mixed sound includes a sound in which the first sound and the characteristics of the sound field are convoluted, and the second sound,
  The characteristics of the sound field are expressed as a set of coefficients that are multiplied by the intensity of each frequency component of the first sound,
  The estimation means includes
  The lower limit function that is the lower limit of the posterior distribution of the generation model representing that the spectrum of the mixed sound, the time series of the spectrum of the second sound, and the set of coefficients are generated according to a complex normal distribution and a generalized inverse Gaussian distribution, respectively. A lower limit function represented by a plurality of auxiliary variables and including a parameter relating to the characteristics of the second sound and the sound field, and the auxiliary variable and the parameter are repeatedly updated to set the lower limit function. An acoustic signal analyzer characterized by approximately estimating the posterior distribution by determining a function.

  Sound emission means for emitting a predetermined first sound;
  Sound collecting means for collecting a mixed sound including the first sound emitted and a second sound emitted from a sound source different from the sound emitting means;
  Estimation means for performing Bayesian estimation simultaneously on the second sound and the characteristics of the sound field in which the sound emission means and the sound collection means are installed based on the collected mixed sound and the first sound; Prepared,
  The mixed sound includes a sound in which the first sound and the characteristics of the sound field are convoluted, and the second sound,
  The characteristics of the sound field are expressed as a set of coefficients that are multiplied by the intensity of each frequency component of the first sound,
  The estimation means includes
  A lower limit function that is a lower limit of a posterior distribution of a generation model indicating that the spectrum of the mixed sound, the time series of the spectrum of the second sound, and the set of coefficients are generated according to a Poisson distribution and a gamma distribution, respectively. A lower limit function represented by a plurality of auxiliary variables and including parameters relating to the characteristics of the second sound and the sound field is set, and the lower limit function is determined by repeatedly updating the auxiliary variables and the parameters. Thus, the acoustic signal analyzer characterized by approximately estimating the posterior distribution.

  Sound emission means for emitting a predetermined first sound;
  Sound collecting means for collecting a mixed sound including the first sound emitted and a second sound emitted from a sound source different from the sound emitting means;
  Estimation means for performing Bayesian estimation simultaneously on the second sound and the characteristics of the sound field in which the sound emission means and the sound collection means are installed based on the collected mixed sound and the first sound; Prepared,
  The mixed sound includes a sound in which the first sound and the characteristics of the sound field are convoluted, and a sound in which the second sound and the characteristics of the sound field are convoluted,
  The characteristics of the sound field are expressed as a set of coefficients that are multiplied by the intensity of each frequency component of the first sound,
  The estimation means includes
  The lower limit function that is the lower limit of the posterior distribution of the generation model representing that the spectrum of the mixed sound, the time series of the spectrum of the second sound, and the set of coefficients are generated according to a complex normal distribution and a generalized inverse Gaussian distribution, respectively. A lower limit function represented by a plurality of auxiliary variables and including a parameter relating to the characteristics of the second sound and the sound field, and the auxiliary variable and the parameter are repeatedly updated to set the lower limit function. An acoustic signal analyzer characterized by approximately estimating the posterior distribution by determining a function.

  Sound emission means for emitting a predetermined first sound;
  Sound collecting means for collecting a mixed sound including the first sound emitted and a second sound emitted from a sound source different from the sound emitting means;
  Estimation means for performing Bayesian estimation simultaneously on the second sound and the characteristics of the sound field in which the sound emission means and the sound collection means are installed based on the collected mixed sound and the first sound; Prepared,
  The mixed sound includes a sound in which the first sound and the characteristics of the sound field are convoluted, and a sound in which the second sound and the characteristics of the sound field are convoluted,
  The characteristics of the sound field are expressed as a set of coefficients that are multiplied by the intensity of each frequency component of the first sound,
  The estimation means includes
  A lower limit function that is a lower limit of a posterior distribution of a generation model indicating that the spectrum of the mixed sound, the time series of the spectrum of the second sound, and the set of coefficients are generated according to a Poisson distribution and a gamma distribution, respectively. A lower limit function represented by a plurality of auxiliary variables and including parameters relating to the characteristics of the second sound and the sound field is set, and the lower limit function is determined by repeatedly updating the auxiliary variables and the parameters. Thus, the acoustic signal analyzer characterized by approximately estimating the posterior distribution.

  The acoustic signal analyzer according to any one of claims 1 to 4,
  The sound collection means samples the second sound and the mixed sound in real time,
  The acoustic signal analyzer according to claim 1, wherein the estimation means approximately estimates the posterior distribution by updating and optimizing an expected value of the lower limit function in real time.

The acoustic signal analyzer according to claim 5,
The parameter relating to the characteristic of the sound field, the parameter relating to the coefficient multiplied by the intensity of a predetermined frequency component, is the predetermined spectrum of the spectrum of the first sound from the start of sound generation to the present. Only the sum of frequency component intensities and the sum of intensities of the predetermined frequency components of the spectrum of the mixed sound from the start of sound generation to the present are set to be multiplied by a weighting coefficient corresponding to the number of updates. An acoustic signal analyzer that is updated based on an update formula.

  A mixed sound including a predetermined first sound emitted from the sound emitting means and a second sound emitted from a sound source different from the sound emitting means is collected using the sound collecting means. Acoustic signal analysis for simultaneously Bayesian estimation of the second sound and the characteristics of the sound field in which the sound emitting means and the sound collecting means are installed based on the collected mixed sound and the first sound A method,
  The mixed sound includes a sound in which the first sound and the characteristics of the sound field are convoluted, and the second sound,
  The characteristics of the sound field are expressed as a set of coefficients that are multiplied by the intensity of each frequency component of the first sound,
  The lower limit function that is the lower limit of the posterior distribution of the generation model representing that the spectrum of the mixed sound, the time series of the spectrum of the second sound, and the set of coefficients are generated according to a complex normal distribution and a generalized inverse Gaussian distribution, respectively. A lower limit function represented by a plurality of auxiliary variables and including a parameter relating to the characteristics of the second sound and the sound field, and the auxiliary variable and the parameter are repeatedly updated to set the lower limit function. An acoustic signal analysis method characterized in that the posterior distribution is approximately estimated by determining a function.

  A mixed sound including a predetermined first sound emitted from the sound emitting means and a second sound emitted from a sound source different from the sound emitting means is collected using the sound collecting means. Acoustic signal analysis for simultaneously Bayesian estimation of the second sound and the characteristics of the sound field in which the sound emitting means and the sound collecting means are installed based on the collected mixed sound and the first sound A method,
  The mixed sound includes a sound in which the first sound and the characteristics of the sound field are convoluted, and the second sound,
  The characteristics of the sound field are expressed as a set of coefficients that are multiplied by the intensity of each frequency component of the first sound,
  A lower limit function that is a lower limit of a posterior distribution of a generation model indicating that the spectrum of the mixed sound, the time series of the spectrum of the second sound, and the set of coefficients are generated according to a Poisson distribution and a gamma distribution, respectively. A lower limit function represented by a plurality of auxiliary variables and including parameters relating to the characteristics of the second sound and the sound field is set, and the lower limit function is determined by repeatedly updating the auxiliary variables and the parameters. Thus, the acoustic signal analysis method characterized by approximately estimating the posterior distribution.

  A mixed sound including a predetermined first sound emitted from the sound emitting means and a second sound emitted from a sound source different from the sound emitting means is collected using the sound collecting means. Acoustic signal analysis for simultaneously Bayesian estimation of the second sound and the characteristics of the sound field in which the sound emitting means and the sound collecting means are installed based on the collected mixed sound and the first sound A method,
  The mixed sound includes a sound in which the first sound and the characteristics of the sound field are convoluted, and a sound in which the second sound and the characteristics of the sound field are convoluted,
  The characteristics of the sound field are expressed as a set of coefficients that are multiplied by the intensity of each frequency component of the first sound,
  The lower limit function that is the lower limit of the posterior distribution of the generation model representing that the spectrum of the mixed sound, the time series of the spectrum of the second sound, and the set of coefficients are generated according to a complex normal distribution and a generalized inverse Gaussian distribution, respectively. A lower limit function represented by a plurality of auxiliary variables and including a parameter relating to the characteristics of the second sound and the sound field, and the auxiliary variable and the parameter are repeatedly updated to set the lower limit function. An acoustic signal analysis method characterized in that the posterior distribution is approximately estimated by determining a function.

  A mixed sound including a predetermined first sound emitted from the sound emitting means and a second sound emitted from a sound source different from the sound emitting means is collected using the sound collecting means. Acoustic signal analysis for simultaneously Bayesian estimation of the second sound and the characteristics of the sound field in which the sound emitting means and the sound collecting means are installed based on the collected mixed sound and the first sound A method,
  The mixed sound includes a sound in which the first sound and the characteristics of the sound field are convoluted, and a sound in which the second sound and the characteristics of the sound field are convoluted,
  The characteristics of the sound field are expressed as a set of coefficients that are multiplied by the intensity of each frequency component of the first sound,
  A lower limit function that is a lower limit of a posterior distribution of a generation model indicating that the spectrum of the mixed sound, the time series of the spectrum of the second sound, and the set of coefficients are generated according to a Poisson distribution and a gamma distribution, respectively. A lower limit function represented by a plurality of auxiliary variables and including parameters relating to the characteristics of the second sound and the sound field is set, and the lower limit function is determined by repeatedly updating the auxiliary variables and the parameters. Thus, the acoustic signal analysis method characterized by approximately estimating the posterior distribution.

  A mixed sound including a predetermined first sound emitted from the sound emitting means and a second sound emitted from a sound source different from the sound emitting means is collected using the sound collecting means. A sound collection step;
  An estimation step for performing Bayesian estimation simultaneously on the second sound and the characteristics of the sound field in which the sound emitting means and the sound collecting means are installed based on the collected mixed sound and the first sound. A computer program characterized by causing a computer to execute,
  The mixed sound includes a sound in which the first sound and the characteristics of the sound field are convoluted, and the second sound,
  The characteristics of the sound field are expressed as a set of coefficients that are multiplied by the intensity of each frequency component of the first sound,
  The estimation step includes
  The lower limit function that is the lower limit of the posterior distribution of the generation model representing that the spectrum of the mixed sound, the time series of the spectrum of the second sound, and the set of coefficients are generated according to a complex normal distribution and a generalized inverse Gaussian distribution, respectively. A lower limit function represented by a plurality of auxiliary variables and including a parameter relating to the characteristics of the second sound and the sound field, and the auxiliary variable and the parameter are repeatedly updated to set the lower limit function. A computer program comprising the step of approximately estimating the posterior distribution by determining a function.

  A mixed sound including a predetermined first sound emitted from the sound emitting means and a second sound emitted from a sound source different from the sound emitting means is collected using the sound collecting means. A sound collection step;
  An estimation step for performing Bayesian estimation simultaneously on the second sound and the characteristics of the sound field in which the sound emitting means and the sound collecting means are installed based on the collected mixed sound and the first sound. A computer program characterized by causing a computer to execute,
  The mixed sound includes a sound in which the first sound and the characteristics of the sound field are convoluted, and the second sound,
  The estimation step includes
  The characteristics of the sound field are expressed as a set of coefficients that are multiplied by the intensity of each frequency component of the first sound,
  A lower limit function that is a lower limit of a posterior distribution of a generation model indicating that the spectrum of the mixed sound, the time series of the spectrum of the second sound, and the set of coefficients are generated according to a Poisson distribution and a gamma distribution, respectively. A lower limit function represented by a plurality of auxiliary variables and including parameters relating to the characteristics of the second sound and the sound field is set, and the lower limit function is determined by repeatedly updating the auxiliary variables and the parameters. Thus, the computer program is a step of approximately estimating the posterior distribution.

  A mixed sound including a predetermined first sound emitted from the sound emitting means and a second sound emitted from a sound source different from the sound emitting means is collected using the sound collecting means. A sound collection step;
An estimation step for performing Bayesian estimation simultaneously on the second sound and the characteristics of the sound field in which the sound emitting means and the sound collecting means are installed based on the collected mixed sound and the first sound. A computer program characterized by causing a computer to execute,
  The mixed sound includes a sound in which the first sound and the characteristics of the sound field are convoluted, and a sound in which the second sound and the characteristics of the sound field are convoluted,
  The characteristics of the sound field are expressed as a set of coefficients that are multiplied by the intensity of each frequency component of the first sound,
  The estimation step includes
  The lower limit function that is the lower limit of the posterior distribution of the generation model representing that the spectrum of the mixed sound, the time series of the spectrum of the second sound, and the set of coefficients are generated according to a complex normal distribution and a generalized inverse Gaussian distribution, respectively. A lower limit function represented by a plurality of auxiliary variables and including a parameter relating to the characteristics of the second sound and the sound field, and the auxiliary variable and the parameter are repeatedly updated to set the lower limit function. A computer program comprising the step of approximately estimating the posterior distribution by determining a function.

  A mixed sound including a predetermined first sound emitted from the sound emitting means and a second sound emitted from a sound source different from the sound emitting means is collected using the sound collecting means. A sound collection step;
  An estimation step for performing Bayesian estimation simultaneously on the second sound and the characteristics of the sound field in which the sound emitting means and the sound collecting means are installed based on the collected mixed sound and the first sound. A computer program characterized by causing a computer to execute,
  The mixed sound includes a sound in which the first sound and the characteristics of the sound field are convoluted, and a sound in which the second sound and the characteristics of the sound field are convoluted,
  The characteristics of the sound field are expressed as a set of coefficients that are multiplied by the intensity of each frequency component of the first sound,
  The estimation step includes
  A lower limit function that is a lower limit of a posterior distribution of a generation model indicating that the spectrum of the mixed sound, the time series of the spectrum of the second sound, and the set of coefficients are generated according to a Poisson distribution and a gamma distribution, respectively. A lower limit function represented by a plurality of auxiliary variables and including parameters relating to the characteristics of the second sound and the sound field is set, and the lower limit function is determined by repeatedly updating the auxiliary variables and the parameters. Thus, the computer program is a step of approximately estimating the posterior distribution.