JP5334142B2

JP5334142B2 - Method and system for estimating mixing ratio in mixed sound signal and method for phoneme recognition

Info

Publication number: JP5334142B2
Application number: JP2011523664A
Authority: JP
Inventors: 弘将藤原; 真孝後藤
Original assignee: National Institute of Advanced Industrial Science and Technology AIST
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2009-07-21
Filing date: 2010-07-21
Publication date: 2013-11-06
Anticipated expiration: 2030-07-21
Also published as: JPWO2011010647A1; WO2011010647A1

Abstract

Provided are a mixed-sound signal mixture ratio estimating method and system for estimating a mixture ratio between a target sound signal and a noise signal in a mixed-sound signal. The gain of a target sound spectrum template, which constitutes a probabilistic spectrum template, and the gain of a noise spectrum template are modified to obtain a plurality of gain-modified spectrum templates. One of the gain-modified spectrum templates that exhibits the shortest distance from an observation spectrum is decided as a shortest-distance gain-modified spectrum template. The mixture ratio is estimated based on the gain of the shortest-distance gain-modified spectrum template and the gain of the noise spectrum template.

Description

本発明は、混合音信号中の対象音信号とノイズ信号との混合比率を推定する混合音信号の混合比率推定方法及びシステム並びに音素認識方法に関するものである。 The present invention relates to a mixing ratio estimation method and system for a mixed sound signal for estimating a mixing ratio between a target sound signal and a noise signal in the mixed sound signal, and a phoneme recognition method.

従来は、混合音信号中の対象音信号とノイズ信号との混合比率（Ｓ／Ｎ比）が既知であることを前提にして、音響信号中に含まれる音声を認識する技術や、音素認識技術において、認識精度を高める技術が提案されている（非特許文献１）。 Conventionally, on the premise that the mixing ratio (S / N ratio) between the target sound signal and the noise signal in the mixed sound signal is known, the technology for recognizing the speech included in the acoustic signal, or the phoneme recognition technology Has proposed a technique for improving recognition accuracy (Non-patent Document 1).

Gales、 M. J.F. and Yound、 S.「 An improved approach to the hidden Markov model decomposition of speech and noise」、 Proceedings of the 1997 IEEE International Conference on Acoustics、 Speech、 and Signal Processing (ICASSP 1997)、 pp.835−838 (1997)Gales, MJF and Yound, S. `` An improved approach to the hidden Markov model decomposition of speech and noise '', Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1997), pp.835-838 ( 1997)

従来は、混合比率（Ｓ／Ｎ比）が既知であることを前提にするため、混合音信号に含まれるノイズ信号の変動量が大きくなると、混合比率の推定精度が悪くなる問題があった。 Conventionally, since it is assumed that the mixing ratio (S / N ratio) is known, there is a problem that the estimation accuracy of the mixing ratio is deteriorated when the fluctuation amount of the noise signal included in the mixed sound signal is increased.

本発明の目的は、混合音信号中の対象音信号とノイズ信号との混合比率を推定することができる混合音信号の混合比率推定方法及びシステムを提供することにある。 An object of the present invention is to provide a mixed sound signal mixing ratio estimation method and system capable of estimating a mixing ratio between a target sound signal and a noise signal in a mixed sound signal.

上記目的に加えて、本発明の他の目的は、有声音信号の混合比率を推定する際に、基本周波数Ｆ０も一緒に推定することができる混合音信号の混合比率推定方法を提供することにある。 In addition to the above object, another object of the present invention is to provide a mixed sound signal mixing ratio estimation method capable of estimating the fundamental frequency F0 together when estimating the mixing ratio of the voiced sound signal. is there.

本発明の他の目的は、推定した混合比率を用いて音素認識を行う音素認識方法を提供することにある。 Another object of the present invention is to provide a phoneme recognition method for performing phoneme recognition using an estimated mixing ratio.

本発明は、混合音信号から離散的に取得した１フレーム信号に含まれる対象音信号とノイズ信号との混合比率を、コンピュータを用いて推定する混合音信号の混合比率推定方法を改良の対象とする。本願明細書において、対象音信号には、音声信号（歌声信号を含む）や楽器の音響信号等が含まれる。またノイズ信号は、混合音信号に含まれる対象音信号以外の信号を言う。また「離散的に取得した１フレーム信号」とは、所定の時間幅のハニング窓を１フレームとして用いて混合音信号から取得した信号である。 An object of the present invention is to improve a mixing ratio estimation method of a mixed sound signal by using a computer to estimate a mixing ratio of a target sound signal and a noise signal included in one frame signal obtained discretely from the mixed sound signal. To do. In the present specification, the target sound signal includes an audio signal (including a singing voice signal), an acoustic signal of a musical instrument, and the like. The noise signal is a signal other than the target sound signal included in the mixed sound signal. Further, “one frame signal obtained discretely” is a signal obtained from a mixed sound signal using a Hanning window having a predetermined time width as one frame.

本発明では、１以上の学習用対象音信号の周波数成分とパワースペクトルの確率分布の関係を示す１以上の対象音スペクトルテンプレートを用意する。また１以上の学習用ノイズ信号の周波数成分とパワースペクトルの確率分布の関係を示す１以上のノイズ・スペクトルテンプレートを用意する。そして１以上の対象音スペクトルテンプレートと１以上のノイズ・スペクトルテンプレートとを組み合わせて合成することにより１以上の確率的スペクトルテンプレートを作成する。 In the present invention, one or more target sound spectrum templates showing the relationship between the frequency components of one or more learning target sound signals and the probability distribution of the power spectrum are prepared. In addition, one or more noise spectrum templates indicating the relationship between the frequency component of one or more learning noise signals and the probability distribution of the power spectrum are prepared. Then, one or more stochastic spectrum templates are created by combining one or more target sound spectrum templates and one or more noise spectrum templates.

本願明細書において、音声（歌声を含む）等を含む混合音信号のスペクトルが存在する確率分布の集合を確率的スペクトルテンプレート（Probabilistic_Spectral Template）と呼ぶ。 In the present specification, a set of probability distributions in which a spectrum of a mixed sound signal including speech (including singing voice) and the like exists is called a probabilistic spectrum template (Probabilistic_Spectral Template).

ここで学習用対象音信号とは、対象音に応じて集めた１以上の学習用の音信号である。例えば対象音が音声の場合には、母音、子音等の有声音、無声音などの単音の音信号が、学習用対象音信号となる。精度を高めるためには、複数の人の音声信号から複数の単音の音信号を学習用対象音信号として取得するのが好ましい。観測する混合音信号に応じて、男の音声信号、女の音声信号、子供の音声信号などの種類に分けて複数種類の学習用対象音信号を用いてもよい。また対象音が弦楽器の楽器音の場合には、ある弦楽器の単音の音信号が学習用対象音信号となり、対象音が打楽器の楽器音の場合には、ある打楽器の単音の音信号が学習用対象音信号となる。 Here, the learning target sound signal is one or more learning sound signals collected according to the target sound. For example, when the target sound is a voice, a single sound signal such as a voiced sound such as a vowel or a consonant sound or an unvoiced sound becomes the learning target sound signal. In order to improve the accuracy, it is preferable to obtain a plurality of single sound signals as learning target sound signals from a plurality of human sound signals. Depending on the mixed sound signal to be observed, a plurality of types of learning target sound signals may be used, divided into types such as male audio signals, female audio signals, and child audio signals. When the target sound is a stringed instrument sound, a single tone signal of a certain stringed instrument becomes the learning target sound signal. When the target sound is a percussion instrumental sound, a single percussion instrument sound signal is used for learning. This is the target sound signal.

また本願明細書において、学習用ノイズ信号は、対象となる混合音信号に含まれる対象音の音信号以外の音信号である。歌声を含む楽曲の楽曲信号が混合音信号であれば、歌声が対象音で、背景の伴奏音がノイズ音となる。したがって学習用ノイズ音は、対象とする混合音信号に含まれるノイズ音の種類を想定して、適宜に選定されることになる。歌声だけの音信号があれば、この歌声だけの音信号が学習用対象音信号となり、また伴奏だけの音信号があれば、この伴奏だけの音信号が学習用ノイズ信号となる。このような学習用対象音信号及び学習用ノイズ信号は、それぞれ個別に入手することになる。 In the specification of the present application, the learning noise signal is a sound signal other than the sound signal of the target sound included in the target mixed sound signal. If the music signal of the music including the singing voice is a mixed sound signal, the singing voice is the target sound and the background accompaniment sound is the noise sound. Therefore, the learning noise sound is appropriately selected in consideration of the type of noise sound included in the target mixed sound signal. If there is a sound signal only for the singing voice, the sound signal only for the singing voice becomes the learning target sound signal, and if there is a sound signal only for the accompaniment, the sound signal only for the accompaniment becomes the learning noise signal. Such a learning target sound signal and a learning noise signal are obtained individually.

しかし学習対象音信号及び学習用ノイズ信号が、簡単に入手できない場合もある。そこでこのような場合には、学習用対象音信号の対象音スペクトルテンプレートと学習用ノイズ信号のノイズ・スペクトルテンプレートを、共に学習用混合信号から推定してもよい。この場合、学習用混合音とは、対象音に相当する音の信号とノイズに相当する音の信号が混合されて構成されたものである。例えば、対象音が歌声であれば、歌声と伴奏音を含む、ある音信号が混合音信号であり、対象音がスピーチ等の音声であれば、その音声と背景の雑音を含む音信号が混合音信号である。 However, the learning target sound signal and the learning noise signal may not be easily available. Therefore, in such a case, both the target sound spectrum template of the learning target sound signal and the noise spectrum template of the learning noise signal may be estimated from the learning mixed signal. In this case, the learning mixed sound is configured by mixing a sound signal corresponding to the target sound and a sound signal corresponding to noise. For example, if the target sound is a singing voice, a certain sound signal including a singing voice and an accompaniment sound is a mixed sound signal, and if the target sound is a voice such as speech, the sound signal including background noise is mixed. It is a sound signal.

観察対象の混合音信号が、女性のボーカル歌声を含む混合音信号であれば、１以上の学習用混合音信号として、女性のボーカル歌声を含む混合音信号を用いるのが好ましい。観測する混合音信号とは種類が異なる音信号であっても、ある程度の数の混合音信号を学習用混合音信号として集めて、それぞれの学習用混合音信号から、複数の対象音スペクトルテンプレート及び複数のノイズ・スペクトルテンプレートを推定すれば、平均化された学習データが取得できるので、精度の低下には大きな問題は生じない。 If the mixed sound signal to be observed is a mixed sound signal including a female vocal singing voice, it is preferable to use a mixed sound signal including a female vocal singing voice as one or more learning mixed sound signals. Even if the sound signal is different in type from the mixed sound signal to be observed, a certain number of mixed sound signals are collected as learning mixed sound signals, and a plurality of target sound spectrum templates and If a plurality of noise spectrum templates are estimated, averaged learning data can be acquired, so that no major problem arises in the reduction in accuracy.

本発明の方法では、観測する混合音信号から１フレーム中の観測スペクトルを取得する。観測スペクトルとは、混合音信号から得た１フレーム中の信号の周波数とパワースペクトルとの関係を示すスペクトル波形である。そして本発明では、１以上の確率的スペクトルテンプレートを構成する１以上の対象音スペクトルテンプレートのゲインと１以上のノイズ・スペクトルテンプレートのゲインを変えて得た複数のゲイン変更スペクトルテンプレートと前記観測スペクトルとの距離が一番小さくなるゲイン変更スペクトルテンプレートを最小距離ゲイン変更スペクトルテンプレートとして決定する。そして最小距離ゲイン変更スペクトルテンプレートのゲインとノイズ・スペクトルテンプレートのゲインに基づいて混合比率を推定する。 In the method of the present invention, an observation spectrum in one frame is acquired from the mixed sound signal to be observed. An observation spectrum is a spectrum waveform showing the relationship between the frequency of a signal in one frame obtained from a mixed sound signal and the power spectrum. In the present invention, a plurality of gain change spectrum templates obtained by changing gains of one or more target sound spectrum templates and gains of one or more noise spectrum templates constituting one or more stochastic spectrum templates, and the observed spectrum The gain change spectrum template having the smallest distance is determined as the minimum distance gain change spectrum template. Then, the mixing ratio is estimated based on the gain of the minimum distance gain change spectrum template and the gain of the noise spectrum template.

なおゲイン決定のための最適化には、準ニュートン法を用いることができる。決定された最小距離ゲイン変更スペクトルテンプレートの対象音スペクトルテンプレートのゲインＧsとノイズ・スペクトルテンプレートのゲインＧnに基づいて、１フレームの混合音信号の混合比率（Ｓ／Ｎ比）を推定する。具体的には、Ｇs／Ｇnが、１フレームの混合音信号の混合比率となる。 The quasi-Newton method can be used for optimization for determining the gain. Based on the gain Gs of the target sound spectrum template of the determined minimum distance gain change spectrum template and the gain Gn of the noise spectrum template, the mixing ratio (S / N ratio) of the mixed sound signal of one frame is estimated. Specifically, Gs / Gn is the mixing ratio of the mixed sound signal of one frame.

本発明によれば、対象音（音声、歌声等）がその他のノイズ（伴奏音等）と混ざった状態のスペクトルを、分離せずそのまま混合比率を認識することができる。本発明によれば、背景のノイズに関する情報も活用するため、混合音を認識するために混合音を構成する対象音及びノイズ音を分離し、その後分離した音を認識するという従来の技術と比べて、推定精度を向上させることができる。また本発明によれば、混合音信号について各フレームでＳ／Ｎ比の推定を行うので、ノイズの変動に対してロバストになるという利点がある。 According to the present invention, it is possible to recognize the mixing ratio as it is without separating the spectrum in a state where the target sound (speech, singing voice, etc.) is mixed with other noise (accompaniment sound, etc.). According to the present invention, since information about background noise is also used, the target sound and noise sound constituting the mixed sound are separated in order to recognize the mixed sound, and then compared with the conventional technique of recognizing the separated sound. Thus, the estimation accuracy can be improved. Further, according to the present invention, since the S / N ratio is estimated for each frame of the mixed sound signal, there is an advantage that it is robust against noise fluctuations.

対象音信号が有声音信号のように調波構造を有する音信号であれば、対象音スペクトルテンプレートは駆動音源関数と音声包絡テンプレートとの積により定められる。駆動音源関数は、有声音信号のように調波構造を有する音信号の調波構造の標準的なスペクトルの周波数成分を示すフィルタである。なお駆動音源関数を用いる場合には、最小距離ゲイン変更スペクトルテンプレートを決定する際に、同時に駆動音源関数の基本周波数Ｆ０を推定する。基本周波数Ｆ０を推定する場合にも、前述の準ニュートン法を用いることができる。駆動音源関数を用いると、対象音信号のスペクトルのスペクトル包絡を推定しないため、調波構造を持つ音をそのまま表現できるという利点が得られる。 If the target sound signal is a sound signal having a harmonic structure such as a voiced sound signal, the target sound spectrum template is determined by the product of the driving sound source function and the voice envelope template. The driving sound source function is a filter that indicates a frequency component of a standard spectrum of a harmonic structure of a sound signal having a harmonic structure such as a voiced sound signal. When the driving sound source function is used, the fundamental frequency F0 of the driving sound source function is estimated at the same time when the minimum distance gain change spectrum template is determined. The above-described quasi-Newton method can also be used when estimating the fundamental frequency F0. When the driving sound source function is used, since the spectrum envelope of the spectrum of the target sound signal is not estimated, there is an advantage that a sound having a harmonic structure can be expressed as it is.

対象音信号が音声信号であれば、対象音スペクトルテンプレートは音声スペクトルテンプレートである。そして調波構造を有する音信号が有声音信号であれば、対象音スペクトルテンプレートは有声音信号の調波構造の標準的なスペクトルの周波数成分を示す駆動音源関数と音声包絡テンプレートとの積により定められる。また対象音信号が無声音信号であれば、対象音スペクトルテンプレートは音声包絡テンプレートである。ここで音声包絡テンプレートは、対象とする有声音または無声音について収集した学習用音信号を周波数分析して得た周波数成分とパワーの関係を示す複数の周波数スペクトル波形に含まれるパワー中の複数のピークを繋ぐ包絡線の分布状態を示すテンプレートである。 If the target sound signal is a sound signal, the target sound spectrum template is a sound spectrum template. If the sound signal having the harmonic structure is a voiced sound signal, the target sound spectrum template is determined by the product of the driving sound source function indicating the frequency component of the standard spectrum of the harmonic structure of the voiced sound signal and the sound envelope template. It is done. If the target sound signal is an unvoiced sound signal, the target sound spectrum template is a voice envelope template. Here, the speech envelope template is a plurality of peaks in power included in a plurality of frequency spectrum waveforms indicating a relationship between frequency components and power obtained by frequency analysis of a learning sound signal collected for a target voiced or unvoiced sound. It is a template which shows the distribution state of the envelope which connects.

パワースペクトルの確率分布は、各周波数において対数正規分布で表されているのが好ましい。対数正規分布で表されていれば、推定のための演算が容易になる。 The power spectrum probability distribution is preferably represented by a lognormal distribution at each frequency. If the logarithmic normal distribution is used, calculation for estimation becomes easy.

なお対象音の種類が判っていない場合には、予め対象音スペクトルテンプレートを複数用意すればよい。 If the type of the target sound is not known, a plurality of target sound spectrum templates may be prepared in advance.

本発明によれば、観測する混合音信号の１フレーム単位の混合比率を従来よりも高い精度で推定することができる。また対象音が有声音の場合に駆動音源関数を用いると、最小距離ゲイン変更スペクトルテンプレートを決定する際に、同時に駆動音源関数の基本周波数Ｆ０を推定することができる。 According to the present invention, it is possible to estimate the mixing ratio of one frame unit of the mixed sound signal to be observed with higher accuracy than before. If the driving sound source function is used when the target sound is a voiced sound, the fundamental frequency F0 of the driving sound source function can be estimated simultaneously when determining the minimum distance gain change spectrum template.

本発明の混合比率推定方法を実施する混合比率推定システムは、スペクトルテンプレート記憶部と、確率的スペクトルテンプレート作成部と、観測スペクトル取得部と、決定部と、混合比率推定部とを備えている。 A mixture ratio estimation system that implements the mixture ratio estimation method of the present invention includes a spectrum template storage unit, a stochastic spectrum template creation unit, an observed spectrum acquisition unit, a determination unit, and a mixture ratio estimation unit.

スペクトルテンプレート記憶部は、１以上の学習用対象音信号の周波数成分とパワースペクトルの確率分布の関係を示す１以上の対象音スペクトルテンプレートと、１以上の学習用ノイズ信号の周波数成分とパワースペクトルの確率分布の関係を示す１以上のノイズ・スペクトルテンプレートとを記憶する。確率的スペクトルテンプレート作成部は、１以上の対象音スペクトルテンプレートと１以上のノイズ・スペクトルテンプレートとを組み合わせて合成することにより１以上の確率的スペクトルテンプレートを作成する。観測スペクトル取得部は、混合音信号から１フレーム中の観測スペクトルを取得する。そして決定部は、１以上の確率的スペクトルテンプレートをそれぞれ構成する対象音スペクトルテンプレートのゲインとノイズ・スペクトルテンプレートのゲインを変えて得た複数のゲイン変更スペクトルテンプレートと観測スペクトルとの距離が一番小さくなるゲイン変更スペクトルテンプレートを最小距離ゲイン変更スペクトルテンプレートとして決定する。推定部は、最小距離ゲイン変更スペクトルテンプレートのゲインとノイズ・スペクトルテンプレートのゲインに基づいて混合比率を推定する。 The spectrum template storage unit includes one or more target sound spectrum templates indicating the relationship between the frequency distribution of one or more learning target sound signals and the probability distribution of the power spectrum, the frequency component of one or more learning noise signals, and the power spectrum. One or more noise spectrum templates indicating a probability distribution relationship are stored. The stochastic spectrum template creation unit creates one or more stochastic spectrum templates by combining one or more target sound spectrum templates and one or more noise spectrum templates. The observation spectrum acquisition unit acquires an observation spectrum in one frame from the mixed sound signal. Then, the determination unit has the smallest distance between the observed spectrum and a plurality of gain-change spectrum templates obtained by changing the gain of the target sound spectrum template and the gain of the noise spectrum template that respectively constitute one or more stochastic spectrum templates. The gain change spectrum template is determined as the minimum distance gain change spectrum template. The estimation unit estimates the mixture ratio based on the gain of the minimum distance gain change spectrum template and the gain of the noise spectrum template.

本発明のシステムは、１以上の対象音スペクトルテンプレート及び１以上のノイズ・スペクトルテンプレートを生成するテンプレート生成部を備えていてもよい。テンプレート生成部は、対象音信号が調波構造を有する有声音信号であるときに、対象音スペクトルテンプレートを、有声音信号の調波構造の標準的なスペクトルの周波数成分を示す駆動音源関数と音声包絡テンプレートとの積により定め、且つ対象音信号が無声音信号であれば、対象音スペクトルテンプレートとして音声包絡テンプレートを用いるように構成することができる。 The system of the present invention may include a template generation unit that generates one or more target sound spectrum templates and one or more noise spectrum templates. The template generation unit, when the target sound signal is a voiced sound signal having a harmonic structure, a target sound spectrum template, a driving sound source function indicating a frequency component of a standard spectrum of the harmonic structure of the voiced sound signal, and a sound If the target sound signal is determined as a product of the envelope template and the target sound signal is an unvoiced sound signal, the voice envelope template can be used as the target sound spectrum template.

またテンプレート生成部は、対象音スペクトルテンプレートとノイズ・スペクトルテンプレートとを共に学習用混合信号から推定するように構成してもよい。 The template generation unit may be configured to estimate both the target sound spectrum template and the noise spectrum template from the learning mixed signal.

本発明の音素認識方法では、混合音信号中の混合比率推定方法により求めた、最小距離ゲイン変更スペクトルテンプレートに対応する音素を１フレームの音素と決定する。そして決定された複数のフレームの音素の連続性に基づいて音声の種類を決定する。ここで「フレームの音素の連続性」とは、実際の信号において、同じ音素が複数のフレームで連続して現れる傾向を示す性質を意味する。 In the phoneme recognition method of the present invention, the phoneme corresponding to the minimum distance gain change spectrum template obtained by the mixing ratio estimation method in the mixed sound signal is determined as one frame of phoneme. Then, the type of speech is determined based on the determined continuity of phonemes of a plurality of frames. Here, “continuity of phonemes in frames” means a property indicating the tendency of the same phonemes to appear continuously in a plurality of frames in an actual signal.

本発明の混合音信号の混合比率推定方法を実施する本発明の混合音信号の混合比率推定システムの実施の形態を備えた音素認識システムの一例の構成を示すブロック図である。It is a block diagram which shows the structure of an example of the phoneme recognition system provided with embodiment of the mixing ratio estimation system of the mixed sound signal of this invention which implements the mixing ratio estimation method of the mixed sound signal of this invention. 図１の実施の形態を、コンピュータを用いて実現する場合に用いるプログラムのアルゴリズムを示すフローチャートである。It is a flowchart which shows the algorithm of the program used when implement | achieving embodiment of FIG. 1 using a computer. （ａ）乃至（ｃ）は、対象音が音声の場合における対象音スペクトルテンプレートとしての音声スペクトルテンプレートの生成過程を説明するために用いる図である。(A) thru | or (c) is a figure used in order to demonstrate the production | generation process of the audio | voice spectrum template as an object sound spectrum template in case the object sound is an audio | voice. （ａ）乃至（ｄ）は、音声スペクトルテンプレート_v、fとノイズ・スペクトルテンプレートとに基づいて確率的スペクトルテンプレートＹ_fを生成する過程と、確率的スペクトルテンプレートＹ_n、fと観測スペクトルｙ（ｆ）との間の距離（尤度）を求める過程を説明するために用いる図である。(A) to (d) show the process of generating the stochastic spectrum template Y _f based on the speech spectrum templates _{v and f} and the noise spectrum template, and the stochastic spectrum templates Y _{n and f} and the observed spectrum y (f It is a figure used in order to explain the process of calculating | requiring the distance (likelihood) between these. 音素認識方法の概要を示す図である。It is a figure which shows the outline | summary of the phoneme recognition method. コンピュータを用いてゲイン変更スペクトルテンプレートＹ′_ｆと観測スペクトルｙ（ｆ）との距離（尤度）を求めるプログラムのアルゴリズの一例を示す図である。It is a figure which shows an example of the algorithm of the program which calculates | requires the distance (likelihood) of gain change spectrum template _Y'f and observed spectrum y (f) using a computer. 図６のステップＳＴ１２における基本周波数Ｆ０の推定のアルゴリズムの一例を示す図である。It is a figure which shows an example of the algorithm of estimation of the fundamental frequency F0 in step ST12 of FIG. 音素の推定をコンピュータを用いて行う場合に用いるプログラムのアルゴリズムの一例を示すフローチャートである。It is a flowchart which shows an example of the algorithm of the program used when estimating phoneme using a computer. （ａ）乃至（ｄ）は、パラメータの推定過程の例を示す図である。(A) thru | or (d) is a figure which shows the example of the estimation process of a parameter. パラメータ推定をコンピュータで実施する場合に用いるプログラムのアルゴリズムのフローチャートである。It is a flowchart of the algorithm of the program used when parameter estimation is implemented with a computer. 学習用混合音信号から対象音スペクトルテンプレートとノイズ・スペクトルテンプレートを推定するためのアルゴリズムを示すフローチャートである。It is a flowchart which shows the algorithm for estimating a target sound spectrum template and a noise spectrum template from the mixed sound signal for learning. サンプリングの概念を模式的に示す図である。It is a figure which shows the concept of sampling typically.

図１は、本発明の混合音信号の混合比率推定方法を実施する本発明の混合音信号の混合比率推定システムの実施の形態を備えた音素認識システムの一例の構成を示すブロック図である。図２は、図１の実施の形態を、コンピュータを用いて実現する場合に用いるプログラムのアルゴリズムを示すフローチャートである。また図３は、対象音が音声の場合における対象音スペクトルテンプレートとしての音声スペクトルテンプレートの生成過程を説明するために用いる図である。図４は、音声スペクトルテンプレートとノイズ・スペクトルテンプレートとに基づいて確率的スペクトルテンプレートを生成する過程と、確率的スペクトルテンプレートと観測スペクトルとの間の距離（尤度）を求める過程を説明するために用いる図である。 FIG. 1 is a block diagram showing a configuration of an example of a phoneme recognition system including an embodiment of a mixed sound signal mixing ratio estimation system of the present invention that implements a mixed sound signal mixing ratio estimation method of the present invention. FIG. 2 is a flowchart showing an algorithm of a program used when the embodiment of FIG. 1 is realized using a computer. FIG. 3 is a diagram used for explaining a generation process of a voice spectrum template as a target sound spectrum template when the target sound is voice. FIG. 4 illustrates a process of generating a stochastic spectrum template based on a speech spectrum template and a noise spectrum template and a process of obtaining a distance (likelihood) between the stochastic spectrum template and the observed spectrum. FIG.

本実施の形態の混合比率推定システム１は、テンプレート生成部２と、スペクトルテンプレート記憶部３と、確率的スペクトルテンプレート作成部９と、観測スペクトル取得部１４と、決定部１５と、混合比率推定部２５とを備えている。テンプレート生成部２は、対象音スペクトルテンプレートとノイズ・スペクトルテンプレートとを生成する。本実施の形態で採用するテンプレート生成部２は、２つの生成方法のいずれかを実施できるように構成されている、第１の生成方法を実施する場合、テンプレート生成部２は対象音信号が調波構造を有する有声音信号であるときに、対象音スペクトルテンプレートを、有声音信号の調波構造の標準的なスペクトルの周波数成分を示す駆動音源関数と音声包絡テンプレートとの積により定め、且つ対象音信号が無声音信号であれば、対象音スペクトルテンプレートとして音声包絡テンプレートを用いるように構成される。第２の生成方法を実施する場合、テンプレート生成部２は、対象音スペクトルテンプレートとノイズ・スペクトルテンプレートとを共に学習用混合信号から推定するように構成されている。なおこれら第１及び第２の生成方法については後に詳しく説明する。 The mixture ratio estimation system 1 according to the present embodiment includes a template generation unit 2, a spectrum template storage unit 3, a stochastic spectrum template creation unit 9, an observed spectrum acquisition unit 14, a determination unit 15, and a mixture ratio estimation unit. 25. The template generation unit 2 generates a target sound spectrum template and a noise spectrum template. The template generation unit 2 employed in the present embodiment is configured to be able to perform either of the two generation methods. When the first generation method is performed, the template generation unit 2 adjusts the target sound signal. The target sound spectrum template is determined by the product of the driving sound source function indicating the frequency component of the standard spectrum of the harmonic structure of the voiced sound signal and the voice envelope template when the voice sound signal has a wave structure; If the sound signal is an unvoiced sound signal, a speech envelope template is used as the target sound spectrum template. When the second generation method is performed, the template generation unit 2 is configured to estimate both the target sound spectrum template and the noise spectrum template from the learning mixed signal. The first and second generation methods will be described in detail later.

スペクトルテンプレート記憶部３は、テンプレート生成部２が生成した対象音スペクトルテンプレートを記憶する対象音スペクトルテンプレート記憶部５とテンプレート生成部２が生成したノイズ・スペクトルテンプレートを記憶するノイズ・スペクトルテンプレート記憶部７とから構成されている。対象音スペクトルテンプレート記憶部５は、複数の学習用対象音信号に基づいて予め用意した複数の対象音スペクトルテンプレート（本実施の形態では音素認識に使用するため、具体的には「音声スペクトルテンプレート_v、f」）を記憶している。例えば、図３（ｃ）に示すように、対象音スペクトルテンプレートは、複数の学習用対象音信号に基づいて作成した複数の学習用対象音信号の周波数成分とパワースペクトルの確率分布（確率密度）の関係を示すテンプレートである。例えば、対象音が音声信号の場合には、母音及び子音の有声音、無声音などの学習用の複数の単音信号について、それぞれ得た周波数成分とパワースペクトルの確率分布（確率密度）の関係を示すテンプレートが、複数の対象音スペクトルテンプレートである。The spectrum template storage unit 3 includes a target sound spectrum template storage unit 5 that stores the target sound spectrum template generated by the template generation unit 2 and a noise / spectrum template storage unit 7 that stores a noise / spectral template generated by the template generation unit 2. It consists of and. The target sound spectrum template storage unit 5 includes a plurality of target sound spectrum templates prepared in advance based on a plurality of learning target sound signals (specifically, “speech spectrum template _v in order to be used for phoneme recognition in the present embodiment). _{, F} "). For example, as shown in FIG. 3C, the target sound spectrum template is a probability distribution (probability density) of frequency components and power spectra of a plurality of learning target sound signals created based on a plurality of learning target sound signals. It is a template which shows the relationship. For example, when the target sound is an audio signal, the relationship between the obtained frequency component and the probability distribution (probability density) of the power spectrum is shown for a plurality of learning monotone signals such as vowels and consonant voiced and unvoiced sounds. The template is a plurality of target sound spectrum templates.

ここで１以上の学習用対象音信号とは、対象音に応じて集めた１以上の学習用の音信号であり、例えば対象音が音声の場合には、母音、子音等の有声音、無声音などの単音の音信号であり、複数の人の音声信号から取得したものである。観測対象の混合音信号に応じて、男の音声の音声信号、女の音声の音声信号、子供の音声の音声信号などの種類に分けて学習用対象音信号を用いてもよい。また１以上の学習用ノイズ信号は、対象となる混合音信号に含まれる対象音の音信号以外の音信号である。学習用ノイズ音は、対象とする混合音信号に含まれるノイズ音の種類を想定して、適宜に選定される。例えば、歌声だけの音信号があれば、この歌声だけの音信号が学習用対象音信号となり、また伴奏だけの音信号があれば、この伴奏だけの音信号が学習用ノイズ信号となる。 Here, the one or more learning target sound signals are one or more learning sound signals collected according to the target sound. For example, when the target sound is speech, voiced sounds such as vowels and consonants, and unvoiced sounds Are obtained from a plurality of human voice signals. Depending on the mixed sound signal to be observed, the learning target sound signal may be divided into types such as a male voice signal, a female voice signal, and a child voice signal. The one or more learning noise signals are sound signals other than the sound signal of the target sound included in the target mixed sound signal. The learning noise sound is appropriately selected in consideration of the type of noise sound included in the target mixed sound signal. For example, if there is a sound signal only for a singing voice, the sound signal only for this singing voice becomes a learning target sound signal, and if there is a sound signal only for accompaniment, the sound signal only for this accompaniment becomes a learning noise signal.

また学習用混合音信号とは、対象音に相当する音の信号とノイズに相当する音の信号が混合されて構成されたものである。例えば、対象音が歌声であれば、歌声と伴奏音を含む、ある音信号が混合音信号であり、対象音がスピーチ等の音声であれば、その音声と背景の雑音を含む音信号が混合音信号である。 The learning mixed sound signal is formed by mixing a sound signal corresponding to the target sound and a sound signal corresponding to noise. For example, if the target sound is a singing voice, a certain sound signal including a singing voice and an accompaniment sound is a mixed sound signal, and if the target sound is a voice such as speech, the sound signal including background noise is mixed. It is a sound signal.

観測対象の混合音信号が、女性のボーカル歌声を含む混合音信号であれば、１以上の学習用混合音信号として女性のボーカル歌声を含む混合音信号を用いるのが好ましい。しかしながら観測の混合音信号とは種類が異なる音信号であっても、ある程度の数の混合音信号を学習用混合音信号として集めて、それぞれの学習用混合音信号から複数の学習用対象音信号及び複数の学習用ノイズ信号を取得して、複数の対象音スペクトルテンプレート及び複数のノイズ・スペクトルテンプレートを用意すれば、平均化された学習データが取得できるので、精度の低下に大きな問題は生じない。 If the mixed sound signal to be observed is a mixed sound signal including a female vocal singing voice, it is preferable to use a mixed sound signal including a female vocal singing voice as one or more learning mixed sound signals. However, even if the sound signal is different from the observed mixed sound signal, a certain number of mixed sound signals are collected as learning mixed sound signals, and a plurality of learning target sound signals are obtained from each learning mixed sound signal. If a plurality of learning noise signals are acquired and a plurality of target sound spectrum templates and a plurality of noise spectrum templates are prepared, averaged learning data can be acquired. .

対象音信号が有声音信号であれば、テンプレート生成部２は、対象音スペクトルテンプレートを図３（ｂ）に示す駆動音源関数Ｈ（ｆ；ｆ₀）と図３（ａ）に示す音声包絡テンプレートＹ′_v，fとの積により生成する。駆動音源関数（ｆ；ｆ₀）は、有声音信号の調波構造の標準的なスペクトルの周波数成分を示すフィルタである。適切な駆動音源関数Ｈ（ｆ；ｆ₀）の基本周波数Ｆ₀は、音声スペクトルテンプレートＹ_v、fとノイズ・スペクトルテンプレートのゲインまたは後述する重みパラメータｇ_v、ｇ_nの最適化の際に同時に決定されることなる。If the target sound signal is a voiced sound signal, the template generation unit 2 sets the target sound spectrum template as the driving sound source function H (f; f ₀ ) shown in FIG. 3B and the voice envelope template shown in FIG. It is generated by the product of Y ′ _{v and f} . The driving sound source function (f; f ₀ ) is a filter indicating a frequency component of a standard spectrum of the harmonic structure of the voiced sound signal. The fundamental frequency F ₀ of the appropriate driving sound source function H (f; f ₀ ) is simultaneously determined when optimizing the gains of the speech spectrum templates Y _{v and f} and the noise spectrum template or the weight parameters g _v and g _n described later. Will be decided.

音声包絡テンプレートＹ′_v，fは、図３（ａ）に示すように、対象音（有声音または無声音）について収集した１以上の学習用対象音信号を、周波数分析して得た周波数成分とパワーの関係を示す周波数スペクトル波形に含まれるパワー中の複数のピークを繋ぐ包絡線の分布状態（確率密度）を示すテンプレートである。図３（ａ）の音声包絡テンプレートＹ′_v，fに示される濃淡は、分布状態（確率密度）を示している。音声包絡テンプレートＹ′_v，fは、対象音ごとに準備される。音素認識であれば、認識すべき全ての音素ごとに音声包絡テンプレートＹ′_v，fが準備される。前述のように、対象音が有声音の場合には、図３に示すように駆動音源関数Ｈ（ｆ；ｆ₀）と図３（ａ）に示す音声包絡テンプレートＹ′_v，fとの積により求められた音声スペクトルテンプレートが対象音スペクトルテンプレート記憶部５に記憶されている。駆動音源関数Ｈ（ｆ；ｆ₀）と音声包絡テンプレートＹ′_v，f
は、テンプレート生成部２内の内部メモリに保存されており、両者の積の演算がテンプレート生成部２内の演算部で実行される。As shown in FIG. 3A _{, the} speech envelope template Y ′ _{v, f} includes frequency components obtained by frequency analysis of one or more learning target sound signals collected for the target sound (voiced sound or unvoiced sound). It is a template which shows the distribution state (probability density) of the envelope which connects the some peak in the power contained in the frequency spectrum waveform which shows the relationship of power. The shading shown in the speech envelope template Y ′ _{v, f} in FIG. 3A indicates the distribution state (probability density). The voice envelope template Y ′ _{v, f} is prepared for each target sound. In the case of phoneme recognition, a speech envelope template Y ′ _{v, f} is prepared for every phoneme to be recognized. As described above, when the target sound is a voiced sound, as shown in FIG. 3, the product of the driving sound source function H (f; f ₀ ) and the speech envelope template Y ′ _{v, f} shown in FIG. Is stored in the target sound spectrum template storage unit 5. Driving sound source function H (f; f ₀ ) and voice envelope template Y ′ _{v, f}
Is stored in an internal memory in the template generation unit 2, and the product of the two is executed by the calculation unit in the template generation unit 2.

対象音が無声音の場合には、テンプレート生成部２が内部メモリに保存している音声包絡テンプレートＹ′_v，fを、対象音スペクトルテンプレートとして対象音スペクトルテンプレート記憶部５に記憶させる。When the target sound is an unvoiced sound, the voice envelope template Y ′ _{v, f} stored in the internal memory by the template generation unit 2 is stored in the target sound spectrum template storage unit 5 as the target sound spectrum template.

ノイズ・スペクトルテンプレート記憶部７は、１種以上のノイズ・スペクトルテンプレート［図４（ｂ）参照］を記憶している。ノイズ・スペクトルテンプレートとは、学習用ノイズ信号の周波数成分とパワースペクトルの確率分布の関係を示すテンプレートである。ここで学習用ノイズ信号は、観測対象となる混合音信号に含まれる対象音の音信号以外の音信号である。ノイズも混合音信号の種類によって、異なってくる。そこで学習用ノイズ音は、対象とする混合音信号に含まれるノイズ音の種類を想定して、適宜に選定されることになる。すなわち混合音信号の種類に応じて（ポップスの音楽信号、オペラのようなクラッシックの音楽信号等のように音楽種類に応じて）、ノイズ・スペクトルテンプレートを作成するのが好ましい。本実施の形態では、テンプレート生成部２が、学習用ノイズ信号の周波数成分とパワースペクトルの確率分布の関係に基づいてノイズ・スペクトルテンプレートを作成して、ノイズ・スペクトルテンプレート記憶部７にそれを記憶させる。本実施の形態では、観測対象となる混合音信号の種類に合わせて、複数種類のノイズ・スペクトルテンプレートが、ノイズ・スペクトルテンプレート記憶部７に記憶されている。図４（ｂ）のノイズ・スペクトルテンプレートに示される濃淡は、確率密度を示している。 The noise / spectrum template storage unit 7 stores one or more types of noise / spectrum templates [see FIG. 4B]. The noise spectrum template is a template indicating the relationship between the frequency component of the learning noise signal and the probability distribution of the power spectrum. Here, the learning noise signal is a sound signal other than the sound signal of the target sound included in the mixed sound signal to be observed. Noise also varies depending on the type of mixed sound signal. Therefore, the learning noise sound is appropriately selected in consideration of the type of noise sound included in the target mixed sound signal. That is, it is preferable to create a noise spectrum template according to the type of the mixed sound signal (according to the type of music such as a pop music signal, a classical music signal such as an opera). In the present embodiment, the template generator 2 creates a noise spectrum template based on the relationship between the frequency component of the learning noise signal and the probability distribution of the power spectrum, and stores it in the noise spectrum template storage unit 7. Let In the present embodiment, a plurality of types of noise / spectrum templates are stored in the noise / spectrum template storage unit 7 in accordance with the type of mixed sound signal to be observed. The shading shown in the noise spectrum template of FIG. 4B indicates the probability density.

確率的スペクトルテンプレート作成部９は、組合せ部１１と確率的スペクトルテンプレート記憶部１３とを備えている。組合せ部１１は、対象音スペクトルテンプレート記憶部５に保存されている１以上の対象音スペクトルテンプレートと、ノイズ・スペクトルテンプレート記憶部７に保存されている１種類以上のノイズ・スペクトルテンプレートとを一つずつ組み合わせて合成することにより１以上確率的スペクトルテンプレートを作成する。１００の対象音スペクトルテンプレート（音声スペクトルテンプレート）と２つのノイズ・スペクトルテンプレートとがある場合、２００の確率的スペクトルテンプレートが、組合せ部１１で組み合わされて合成される。２００の確率的スペクトルテンプレートは、確率的スペクトルテンプレート記憶部１３に保存される。図４（ｃ）は、確率的スペクトルテンプレートＹ_fを一例を示している。The stochastic spectrum template creation unit 9 includes a combination unit 11 and a stochastic spectrum template storage unit 13. The combination unit 11 includes one or more target sound spectrum templates stored in the target sound spectrum template storage unit 5 and one or more types of noise spectrum templates stored in the noise spectrum template storage unit 7. One or more probabilistic spectral templates are created by combining them one by one. When there are 100 target sound spectrum templates (speech spectrum templates) and two noise spectrum templates, 200 stochastic spectrum templates are combined and combined by the combining unit 11. The 200 stochastic spectrum templates are stored in the stochastic spectrum template storage unit 13. FIG. 4 (c) shows an example of probabilistic spectral template Y _f.

観測スペクトル取得部１４は、観測対象の混合音信号から離散的に取得した１フレーム信号を周波数分析して、図４（ｄ）に示すような周波数とパワースペクトルとの関係を示す観測スペクトルｙ（ｆ）を取得する。具体的には、所定の時間幅のハニング窓を１フレームとして用いて混合音信号から１フレーム信号を取得し、周波数分析を行って観測スペクトルを取得する。 The observed spectrum acquisition unit 14 performs frequency analysis on one frame signal obtained discretely from the mixed sound signal to be observed, and observes the observed spectrum y () indicating the relationship between the frequency and the power spectrum as shown in FIG. f) is obtained. Specifically, a Hanning window having a predetermined time width is used as one frame to acquire one frame signal from the mixed sound signal, and frequency analysis is performed to acquire an observation spectrum.

決定部１５は、選択部１７と、距離演算部１９と、一時記憶部２１と、確定部２３とから構成される。選択部１７は、確率的スペクトルテンプレート記憶部１３から確率的スペクトルテンプレートを順番に選択する。そして距離演算部１９は、選択した１つの確率的スペクトルテンプレートを構成する対象音スペクトルテンプレートのゲインＧs（重みパラメータｇ_ｖ）とノイズ・スペクトルテンプレートのゲインＧn（重みパラメータｇ_n）を変えて得た複数のゲイン変更スペクトルテンプレートＹ′_ｆと観測スペクトルｙ（ｆ）との距離（尤度）を求め、この距離が一番小さくなるゲイン変更スペクトルテンプレートをその確率的スペクトルテンプレートにおける最小距離ゲイン変更スペクトルテンプレートＹ′_ｆminとして決定する。そして一時記憶部２１に、最小距離ゲイン変更スペクトルテンプレートＹ′_ｆminを記憶する。確率的スペクトルテンプレート記憶部１３に記憶されている全ての確率的スペクトルテンプレートについて最小距離ゲイン変更スペクトルテンプレートＹ′_ｆminを求めてそれらを一時記憶部２１に記憶した後、確定部２３は複数の確率的スペクトルテンプレートについてそれぞれ決定されて一時記憶部１２に記憶された複数の最小距離ゲイン変更スペクトルテンプレートの中で、距離が最も小さい最小距離ゲイン変更スペクトルテンプレートＹ′_ｆminを確定する。そして推定部２５は、確定した最小距離ゲイン変更スペクトルテンプレートＹ′_ｆminの対象音スペクトルテンプレートのゲインＧs（重みパラメータｇ_ｖ）とノイズ・スペクトルテンプレートのゲインＧn（重みパラメータｇ_n）に基づいて、混合比率Ｇs／Ｇnを推定する。例えば、１００の対象音スペクトルテンプレートと２つのノイズ・スペクトルテンプレートとがある場合、２００組の確率的スペクトルテンプレートが存在することにより、これら２００組の確率的スペクトルテンプレートのそれぞれを構成する対象音スペクトルテンプレートのゲインとノイズ・スペクトルテンプレートのゲインとを変更して、２００組の前述の候補を決定する。そして２００組の候補の中から観測スペクトルとの距離が一番小さくなるものが、最小距離ゲイン変更スペクトルテンプレートとして決定される。ゲイン決定のための最適化には、準ニュートン法を用いることができる。The determination unit 15 includes a selection unit 17, a distance calculation unit 19, a temporary storage unit 21, and a determination unit 23. The selection unit 17 selects the stochastic spectrum templates from the stochastic spectrum template storage unit 13 in order. The distance calculation unit 19 obtains the gain Gs (weight parameter g _v ) of the target sound spectrum template and the gain G _n (weight parameter g _n ) of the noise spectrum template constituting one selected stochastic spectrum template. A distance (likelihood) between a plurality of gain change spectrum templates Y ′ _f and the observed spectrum y (f) is obtained, and a gain change spectrum template having the smallest distance is obtained as the minimum distance gain change spectrum template in the stochastic spectrum template. Y ′ is determined as _fmin . The temporary storage unit 21 stores the minimum distance gain change spectrum template Y ′ _fmin . After _{obtaining the} minimum distance gain change spectrum template Y ′ _fmin for all the stochastic spectrum templates stored in the stochastic spectrum template storage unit 13 and storing them in the temporary storage unit 21, the determination unit 23 performs a plurality of stochastic operations. The minimum distance gain change spectrum template Y ′ _fmin having the shortest distance among the plurality of minimum distance gain change spectrum templates determined for the spectrum templates and stored in the temporary storage unit 12 is determined. Then, the estimation unit 25 performs mixing based on the gain Gs (weight parameter g _v ) of the target sound spectrum template of the determined minimum distance gain change spectrum template Y ′ _fmin and the gain G _n (weight parameter g _n ) of the noise spectrum template. Estimate the ratio Gs / Gn. For example, if there are 100 target sound spectrum templates and two noise spectrum templates, there are 200 sets of stochastic spectrum templates, so that the target sound spectrum templates constituting each of these 200 sets of stochastic spectrum templates are present. 200 and the gain of the noise spectrum template are changed to determine 200 sets of the aforementioned candidates. Then, the one having the smallest distance from the observed spectrum among the 200 sets of candidates is determined as the minimum distance gain change spectrum template. The quasi-Newton method can be used for optimization for determining the gain.

推定部２５が推定した１フレーム分の混合音信号の混合比率Ｇs／Ｇnは推定結果記憶部２７に、対象音スペクトルテンプレートの識別情報（音素の種類を特定する情報）と一緒に格納される。音素認定部２９は、推定結果記憶部２７に記憶されているデータに基づいて、最小距離ゲイン変更スペクトルテンプレートに対応する音素を１フレームの音素として決定する。そして決定されたフレームの音素の連続性に基づいて音声の種類を決定する。ここで「フレームの音素の連続性」とは、実際の信号において、同じ音素が複数のフレームで連続して現れる傾向を示す性質を意味する。例えば、歌声の中で１つの母音が連続する長さは、１フレーム周期の１００倍以上の長さになることもあり得る。 The mixing ratio Gs / Gn of the mixed sound signal for one frame estimated by the estimation unit 25 is stored in the estimation result storage unit 27 together with the identification information (information specifying the type of phoneme) of the target sound spectrum template. Based on the data stored in the estimation result storage unit 27, the phoneme recognition unit 29 determines a phoneme corresponding to the minimum distance gain change spectrum template as a phoneme of one frame. Then, the type of speech is determined based on the continuity of phonemes in the determined frame. Here, “continuity of phonemes in frames” means a property indicating the tendency of the same phonemes to appear continuously in a plurality of frames in an actual signal. For example, the length that one vowel continues in a singing voice may be 100 times or longer than one frame period.

したがってフレームの音素に基づいて、歌声の音素を決定する場合には、複数の連続するフレームの音素が、必ず、全てまたは大部分が同じになる。そこで本実施の形態では、フレームの音素の連続性に基づいて音声の種類を決定する。このようにすると混合音信号から音声信号だけを取り出すことなく、音素認識を行うことができる。 Therefore, when determining the phoneme of the singing voice based on the phoneme of the frame, all or most of the phonemes of a plurality of consecutive frames are always the same. Therefore, in the present embodiment, the type of speech is determined based on the continuity of phonemes in frames. In this way, phoneme recognition can be performed without extracting only the audio signal from the mixed sound signal.

次に、図１に示した実施の形態をコンピュータを用いて実施する場合のプログラムのアルゴリズムを示す図２に示したフローチャートについて説明する。このフローチャートは、一例であって、本発明はこのフローチャートに限定されるものではない。まずステップＳＴ１では、複数の確率スペクトルテンプレートを作成する。そこでステップＳＴ１を実施するために確率的スペクトルテンプレートを作成する。すなわち複数の学習用対象音信号に基づいて予め用意した複数の対象音スペクトルテンプレートと複数の学習用混合音信号に基づいて予め用意した１種類以上のノイズ・スペクトルテンプレートとを一つずつ組み合わせて合成することにより複数の確率的スペクトルテンプレートを作成する。次にステップＳＴ２では、混合音信号から１フレーム中の観測スペクトルを取得する。ステップＳＴ３では、複数（理論的には１つでも可能）の確率的スペクトルテンプレートのそれぞれについて、確率的スペクトルテンプレートを構成する対象音スペクトルテンプレートのゲインとノイズ・スペクトルテンプレートのゲインを変えて得た複数のゲイン変更スペクトルテンプレートと観測スペクトルとの距離が一番小さくなるゲイン変更スペクトルテンプレートを最小距離ゲイン変更スペクトルテンプレートとして決定する。ステップＳＴ４では、複数の確率的スペクトルテンプレートについてそれぞれ決定された複数の最小距離ゲイン変更スペクトルテンプレートの中で、距離が最も小さい最小距離ゲイン変更スペクトルテンプレートの対象音スペクトルテンプレートのゲインとノイズ・スペクトルテンプレートのゲインに基づいて、混合比率を推定する。 Next, the flowchart shown in FIG. 2 showing the algorithm of the program when the embodiment shown in FIG. 1 is implemented using a computer will be described. This flowchart is an example, and the present invention is not limited to this flowchart. First, in step ST1, a plurality of probability spectrum templates are created. Therefore, a stochastic spectrum template is created to execute step ST1. That is, a plurality of target sound spectrum templates prepared in advance based on a plurality of learning target sound signals and one or more types of noise spectrum templates prepared in advance based on a plurality of mixed sound signals for learning are combined and synthesized one by one. To create a plurality of stochastic spectral templates. Next, in step ST2, an observation spectrum in one frame is acquired from the mixed sound signal. In step ST3, for each of a plurality (or theoretically even one) of stochastic spectrum templates, a plurality of gains obtained by changing the gain of the target sound spectrum template and the gain of the noise spectrum template constituting the stochastic spectrum template. The gain change spectrum template that minimizes the distance between the gain change spectrum template and the observed spectrum is determined as the minimum distance gain change spectrum template. In step ST4, the gain of the target sound spectrum template and the noise spectrum template of the minimum distance gain change spectrum template having the smallest distance among the plurality of minimum distance gain change spectrum templates respectively determined for the plurality of stochastic spectrum templates. Based on the gain, the mixing ratio is estimated.

［具体的適用例］
次に上記実施の形態の混合比率推定方法及びシステムを用いて、混合音信号中の歌声の歌詞（音素）と基本周波数（Ｆ０）を同時に認識する実施の形態について説明する。歌詞は歌い手が歌声によって伝えたい内容を表現し、基本周波数Ｆ０は楽曲の旋律を表すと同時に、歌手の技巧や表情なども表現するため、どちらも歌声を構成する重要な要素である。そのため、混合音中からこれらの要素を自動認識する技術は、音楽情報検索などにも応用可能で、重要な基礎技術となる。例えば、歌詞が認識できることで、歌詞が未知の楽曲を歌詞を手がかりに検索できる。また、音素の自動認識技術は、歌詞と音楽の時間的対応付けに適用することができ、歌詞をカラオケのように表示する音楽プレイヤーや音楽ビデオのテロップ自動作成などに応用できる。歌声の基本周波数（Ｆ０）の推定は、ボーカルパートの自動採譜やハミング検索などに応用可能である。さらに、ハミング検索に歌詞の情報を統合することで、ハミング検索の精度が向上することも報告されているなど、歌詞とＦ０を同時に推定することでさらに応用範囲が広まる。しかし、歌声は話し声に比べて、ビブラートやＦ０の変化幅の広さ、歌手の感情表現などに起因する変動が多い上に、伴奏音が大音量で重畳するため、歌声（音素）の自動認識は非常に難しい問題がある。[Specific application examples]
Next, an embodiment in which the lyrics (phonemes) of the singing voice and the fundamental frequency (F0) in the mixed sound signal are simultaneously recognized using the mixing ratio estimation method and system of the above embodiment will be described. The lyrics express the content that the singer wants to convey with the singing voice, and the fundamental frequency F0 expresses the melody of the music and at the same time expresses the skill and expression of the singer, and both are important elements constituting the singing voice. Therefore, the technology for automatically recognizing these elements from the mixed sound can be applied to music information retrieval and the like and is an important basic technology. For example, by recognizing the lyrics, it is possible to search for songs with unknown lyrics using the lyrics as clues. In addition, the automatic phoneme recognition technology can be applied to temporal correspondence between lyrics and music, and can be applied to music players that display lyrics like karaoke, automatic creation of music video telops, and the like. The estimation of the singing voice fundamental frequency (F0) can be applied to automatic transcription of vocal parts, humming search, and the like. Furthermore, it has been reported that the accuracy of the Hamming search is improved by integrating the lyrics information into the Hamming search, and the range of application is further expanded by simultaneously estimating the lyrics and F0. However, the singing voice has more fluctuations due to vibrato, F0 variation range, singer's emotional expression, etc., and the accompaniment sound is superimposed at a louder volume than the spoken voice, so the singing voice (phoneme) is recognized Has a very difficult problem.

発明者等は、今までに音楽と歌詞の時間的対応付け手法（下記論文１及び２）と混合音中の歌声のＦ０推定手法（下記論文３）について研究してきた。 The inventors have so far studied the temporal association method between music and lyrics (the following papers 1 and 2) and the F0 estimation method of the singing voice in the mixed sound (the following paper 3).

［論文１］
Fujihara，H.及びGoto，M.著の「Three Techniques for Improving Automatic Synchronization between Music and Lyrics: Fricative Sound Detection、 Filler Model、 and Novel Feature Vectors for Vocal Activity Detection」、 Proceedings of the 2008 IEEE International Conference on Acoustics、 Speech、 and Signal Processing(ICASSP2008)、 pp.69−72 (2008).
［論文２］
Fujihara，H、 Goto，M.、 Ogata，J.、 Komatani，K.、 Ogata，T. 及びOkuno，H.G.著の「Automatic synchronization between lyrics and music CD recordings based on Viterbialignment of segregated vocal signals」、 Proc. ISM、 pp.257−264 (2006).
［論文３］
藤原弘将、後藤真孝及び奥乃博著「歌声の統計的モデル化とビタビ探索を用いた多重奏中のボーカルパートに対する音高推定手法」情報処理学会論文誌、 Vol.49、 No.10 (2008).
上記論文に記載の手法では共通して、混合音から調波構造を手がかりに音を分離し、それを統計的手法により識別するというアプローチをとっていた。具体的には、歌詞の時間的対応付けの場合、既存手法によって推定された歌声のＦ０の音がどの音素であるかを識別し、歌声のＦ０推定の場合、各時刻の周波数成分の候補が歌声であるかそれ以外の音であるかを識別していた。しかし、それらの手法は下記の２つの問題点を抱えている。[Article 1]
Fujihara, H. and Goto, M., "Three Techniques for Improving Automatic Synchronization between Music and Lyrics: Fricative Sound Detection, Filler Model, and Novel Feature Vectors for Vocal Activity Detection", Proceedings of the 2008 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2008), pp.69-72 (2008).
[Article 2]
"Automatic synchronization between lyrics and music CD recordings based on Viterbialignment of segregated vocal signals" by Fujihara, H, Goto, M., Ogata, J., Komatani, K., Ogata, T. and Okuno, HG, Proc. ISM Pp.257-264 (2006).
[Papers 3]
Hiromasa Fujiwara, Masataka Goto, and Hiroshi Okuno, “Pitch estimation method for vocal parts in multiple playing using statistical modeling of singing voice and Viterbi search”, Transactions of Information Processing Society of Japan, Vol.49, No.10 (2008 ).
In the methods described in the above paper, in common, the approach was to separate the sound from the mixed sound by using the harmonic structure as a clue, and identify it by a statistical method. Specifically, in the case of lyric temporal correspondence, it identifies which phoneme is the sound of the singing voice F0 estimated by the existing method, and in the case of the singing voice F0 estimation, the frequency component candidates at each time are determined. They identified whether it was a singing voice or other sounds. However, these methods have the following two problems.

［分離の問題］
歌声の認識性能が、その前段に行われる分離の性能に大きく依存していた。そのため、Ｆ０推定や、分離の際にスペクトルから調波成分を選択する処理の誤りが、性能に悪影響を与えていた。また、歌声とノイズのS/N比や歌声の歪み度合いなどの情報を含んでいる背景雑音（分離対象の音以外の音）を、分離の過程で捨ててしまっていた。[Separation issues]
The recognition performance of the singing voice largely depended on the performance of separation performed in the previous stage. For this reason, errors in F0 estimation and processing for selecting a harmonic component from a spectrum during separation have an adverse effect on performance. Also, background noise (sounds other than the sound to be separated) containing information such as the S / N ratio of singing voice and noise and the degree of distortion of the singing voice was thrown away in the separation process.

［スペクトル包絡推定の問題］
従来の手法では、スペクトル包絡を分離後の歌声の調波構造から推定しスペクトル包絡同士の距離を計算することで、歌声を認識していた。しかし、調波構造の各倍音成分は元のスペクトル包絡からＦ０の整数倍の周波数成分をサンプリングしたものと考えることができるため、与えられた調波構造から元のスペクトル包絡を一意に復元することは原理的に不可能であった。そのため、例えばＦ０が高い音など、調波構造の各倍音成分の谷間の幅が広い場合など、距離を正確に計算することが困難であった。[Problem of spectral envelope estimation]
In the conventional method, the singing voice is recognized by estimating the spectral envelope from the harmonic structure of the separated singing voice and calculating the distance between the spectral envelopes. However, since each harmonic component of the harmonic structure can be considered as a sample of frequency components that are integer multiples of F0 from the original spectral envelope, the original spectral envelope must be uniquely restored from the given harmonic structure. Was impossible in principle. For this reason, it is difficult to accurately calculate the distance, for example, when the width of the valley of each harmonic component of the harmonic structure is wide, such as a sound with a high F0.

本実施の形態では、歌声を分離したり、単一の調波構造からスペクトル包絡を推定したりせず、観測されたスペクトルを伴奏音が重畳したありのままの形を確率的にモデリングする。さらに、学習の過程では、複数の調波構造を用いることで、より正確にスペクトル包絡を推定する。 In the present embodiment, the singing voice is not separated, and the spectral envelope is not estimated from a single harmonic structure, and an as-is shape in which accompaniment sounds are superimposed on the observed spectrum is probabilistically modeled. Furthermore, in the learning process, the spectral envelope is estimated more accurately by using a plurality of harmonic structures.

具体的には、図４（ｃ）と図４（ｄ）に示すように、歌声を含む混合音信号のスペクトルがある確率分布の集合から生成されると仮定する。ここで、スペクトルの各周波数ビン（周波数分析幅）に現れるパワーはある確率分布に従い、その確率分布は複数のスペクトルのビンごとに異なると考える。スペクトルの加法性を仮定すると、確率的スペクトルテンプレートは、歌声を表現する音声（歌声）スペクトルテンプレート［図４（ａ）］と歌声以外の音を表現するノイズ・スペクトルテンプレート［図４（ｂ）］の線形軸上での加算で表現することができる。そしてこれら２つのスペクトルテンプレートの加算の際に重みパラメータ（ゲイン調整）を導入し、重み付きで加算することで、様々なＳ／Ｎ比のスペクトルを表現できる。さらに、ソースフィルターモデルを仮定すると、音声（歌声）スペクトルテンプレートは、スペクトル包絡を表現する音声（歌声）包絡テンプレート（Vocal Envelope Template）［図３（ａ）］と駆動源の調波構造を表現する駆動音源関数（Harmonic Filter）［図３（ｂ）］の積によって生成されると考えられる。駆動音源関数の形状は、基本周波数Ｆ０の値をパラメータとして、コントロールできる。 Specifically, as shown in FIGS. 4C and 4D, it is assumed that the spectrum of a mixed sound signal including a singing voice is generated from a set of probability distributions. Here, it is assumed that the power appearing in each frequency bin (frequency analysis width) of the spectrum follows a certain probability distribution, and the probability distribution is different for each of the plurality of spectrum bins. Assuming that the spectrum is additive, the stochastic spectrum template includes a voice (singing voice) spectrum template expressing a singing voice (FIG. 4 (a)) and a noise spectrum template expressing a sound other than the singing voice (FIG. 4 (b)). Can be expressed by addition on the linear axis. Then, by introducing a weight parameter (gain adjustment) when adding these two spectrum templates and adding them with weights, it is possible to express spectra with various S / N ratios. Further, assuming a source filter model, the voice (singing voice) spectrum template represents the harmonic structure of the voice (singing voice) envelope template (FIG. 3A) expressing the spectral envelope and the driving source. It is considered to be generated by the product of the driving sound source function (Harmonic Filter) [FIG. 3 (b)]. The shape of the driving sound source function can be controlled using the value of the fundamental frequency F0 as a parameter.

確率モデルのパラメータである駆動音源関数のＦ０と、音声（歌声）スペクトルテンプレートとノイズ・スペクトルテンプレートのそれぞれの重みが定まれば、観測スペクト
ルの確率モデル（確率的スペクトルテンプレート）に対する尤度（距離）を計算することができる。このモデルを用いると、図５に示すように、各音素を表現する音声（歌声）包絡テンプレートＹ′_v，f［音素／ａ／，音素／ｂ／，・・・音素／ｏ／・・］をあらかじめ学習しておき、観測スペクトルに対して最尤な（最も距離が近い）音声（歌声）包絡テンプレートＹ′_v，fを選択することで音素認識ができて、最尤な（最も距離が近い）Ｆ０の値を推定することでＦ０推定ができる。図３を用いて説明した最初の実施の形態で説明したように、各音素を表現する音声（歌声）包絡テンプレート［音素／ａ／，音素／ｂ／，・・・音素／ｏ／・・］と駆動音源関数Ｈ（ｆ_i，ｆ₀）との積をとって、各音素のスペクトルテンプレートを表現する複数の音声（歌声）スペクトルテンプレート（対象音スペクトルテンプレート）Ｙ_v，fを作る。次に図４に示すように、各音素のスペクトルテンプレートを表現する複数の音声（歌声）スペクトルテンプレートＹ_v，fとノイズ・スペクトルテンプレートＹ_n，fとの積をとり（組み合わせて）、複数の音声（歌声）スペクトルテンプレートに対する複数の確率的スペクトルテンプレートＹ_fを作成する。If F0 of the driving sound source function, which is a parameter of the probability model, and the weights of the voice (singing voice) spectrum template and the noise spectrum template are determined, the likelihood (distance) for the probability model (probabilistic spectrum template) of the observed spectrum Can be calculated. When this model is used, as shown in FIG. 5, a speech (singing voice) envelope template Y ′ _{v, f} [phoneme / a /, phoneme / b /,... Phoneme / o /. Can be recognized in advance by selecting the most likely (closest distance) speech (singing voice) envelope template Y ′ _{v, f} with respect to the observed spectrum. F0 can be estimated by estimating the value of (close) F0. As described in the first embodiment described with reference to FIG. 3, a speech (singing voice) envelope template [phoneme / a /, phoneme / b /,... Phoneme / o / ..] expressing each phoneme. And a sound source function H (f _i , f ₀ ), and a plurality of speech (singing voice) spectrum templates (target sound spectrum templates) Y _{v, f} representing the spectrum template of each phoneme are created. Next, as shown in FIG. 4, a product of a plurality of speech (singing voice) spectrum templates Y _{v, f} and a noise spectrum template Y _{n, f} representing the spectrum template of each phoneme is taken (combined) to obtain a plurality of creating a plurality of probabilistic spectral template Y _f for voice (vocal) spectral template.

各音素の確認的スペクトルテンプレートを構成する音声（歌声）スペクトルテンプレートとノイズ・スペクトルテンプレートのそれぞれの重みを定めるために、各確率的スペクトルテンプレートを構成する対象音スペクトルテンプレートのゲインＧs（重みパラメータｇ_ｖ）とノイズ・スペクトルテンプレートのゲインＧn（重みパラメータｇ_n）を変えて各音素についての複数のゲイン変更スペクトルテンプレートＹ′_ｆを得る。そして各音素についての複数のゲイン変更スペクトルテンプレートＹ′_ｆと観測スペクトルｙ（ｆ）との距離（尤度）を求め、この距離が一番小さくなるゲイン変更スペクトルテンプレートをその確率的スペクトルテンプレートにおける最小距離ゲイン変更スペクトルテンプレートＹ′_ｆminとして決定する。すなわち各音素についての複数のゲイン変更スペクトルテンプレートＹ′_ｆの中で距離（尤度）が一番小さくなるものを、その音素についての最小距離ゲイン変更スペクトルテンプレートＹ′_ｆminとする。全ての音素についての確率的スペクトルテンプレートについて最小距離ゲイン変更スペクトルテンプレートＹ′_ｆminを求め、求めた複数の最小距離ゲイン変更スペクトルテンプレートの中で、距離が最も小さい最小距離ゲイン変更スペクトルテンプレートＹ′_ｆminに対応する音素を、認識した音素として確定する。In order to determine the respective weights of the speech (singing voice) spectrum template and the noise spectrum template constituting the confirmation spectrum template of each phoneme, the gain Gs (weight parameter g _{v) of} the target sound spectrum template constituting each stochastic spectrum template ) And the gain Gn (weight parameter g _n ) of the noise spectrum template are changed to obtain a plurality of gain change spectrum templates Y ′ _f for each phoneme. Then, the distance (likelihood) between the plurality of gain change spectrum templates Y ′ _f and the observed spectrum y (f) for each phoneme is obtained, and the gain change spectrum template having the smallest distance is determined as the minimum in the stochastic spectrum template. The distance gain changing spectrum template Y ′ _fmin is determined. That is, the smallest distance gain change spectrum template Y ′ _fmin for the phoneme is the one with the smallest distance (likelihood) among the plurality of gain change spectrum templates Y ′ _f for each phoneme. 'Seeking _fmin, among a plurality of minimum distance gain changing spectral template determined, the distance is the smallest minimum distance gain changing spectral template Y' minimum distance gain changing spectral template Y Stochastic spectral template for all phonemes in _fmin The corresponding phoneme is determined as the recognized phoneme.

図６には、コンピュータを用いて前述のゲイン変更スペクトルテンプレートＹ′_ｆと観測スペクトルｙ（ｆ）との距離（尤度）を求めるプログラムのアルゴリズの一例を示している。このアルゴリズでは、ステップＳＴ１１で基本周波数Ｆ０の初期値を設定し、音声スペクトルテンプレートのゲインを設定し、ノイズ・スペクトルテンプレートのゲインの初期値を設定する。そしてステップＳＴ１２では、準ニュートン法等の非線形最適化手法で最適なゲインとＦ０を推定する。そしてステップＳＴ１３で、得られたゲインとＦ０値についての尤度を計算する。FIG. 6 shows an example of an algorithm of a program for obtaining the distance (likelihood) between the above-described gain-change spectrum template Y ′ _f and the observed spectrum y (f) using a computer. In this algorithm, the initial value of the fundamental frequency F0 is set in step ST11, the gain of the speech spectrum template is set, and the initial value of the gain of the noise spectrum template is set. In step ST12, the optimum gain and F0 are estimated by a nonlinear optimization method such as a quasi-Newton method. In step ST13, the obtained gain and likelihood for the F0 value are calculated.

図７は、ステップＳＴ１２における基本周波数Ｆ０の推定のアルゴリズムの一例を示している。このアルゴリズムでは、ステップＳＴ２１で観測スペクトル中から複数個のＦ０候補を推定する。このＦ０候補の推定には、観測スペクトルの周波数ピークの値を使用する方法や、櫛形フィルタの応答に基づいて推定する手法等、公知の推定法を用いることができる。 FIG. 7 shows an example of an algorithm for estimating the fundamental frequency F0 in step ST12. In this algorithm, a plurality of F0 candidates are estimated from the observed spectrum in step ST21. For estimation of this F0 candidate, a known estimation method such as a method using the value of the frequency peak of the observed spectrum or a method of estimating based on the response of the comb filter can be used.

そしてステップＳＴ２２で全てのＦ０候補について以下のループ１を実施することが開始される。ステップＳＴ２３では、全ての音声スペクトルテンプレートについて以下のループ２を実施することが開始される。ステップＳＴ２４では、全ての音声スペクトルテンプレートについて以下のループ３を実施することが開始される。ステップＳＴ２５では、Ｆ０候補の値を初期値として、音声スペクトルテンプレート及びノイズ・スペクトルテンプレートと観測スペクトルとの尤度により最適なＦ０を計算して保存する。最適なＦ０は、後述する「パラメータ推定」の説明中におけるStep0〜Step3を用いて計算する。このとき、Step0で与えるＦ０の初期値に、Ｆ０候補の値を使用する。ステップＳＴ２６でループ３を終了し、ステップＳＴ２７でループ２を終了する。そしてステップＳＴ２８でループ２とループ３で最も尤度が大きかったときのＦ０値と尤度を保存する。ステップＳＴ２９でループ１を終了し、ステップＳＴ３０では、ループ１で最も尤度が大きかったＦ０を推定結果として出力する。 In step ST22, the following loop 1 is started for all the F0 candidates. In step ST23, the following loop 2 is started for all speech spectrum templates. In step ST24, the following loop 3 is started for all speech spectrum templates. In step ST25, using the value of the F0 candidate as an initial value, the optimum F0 is calculated and stored based on the likelihood of the speech spectrum template, noise spectrum template, and observed spectrum. The optimum F0 is calculated using Step 0 to Step 3 in the explanation of “parameter estimation” described later. At this time, the value of the F0 candidate is used as the initial value of F0 given in Step 0. In step ST26, the loop 3 is terminated, and in step ST27, the loop 2 is terminated. In step ST28, the F0 value and the likelihood when the likelihood is the greatest in the loop 2 and the loop 3 are stored. In step ST29, loop 1 is terminated, and in step ST30, F0 having the highest likelihood in loop 1 is output as an estimation result.

図８は、音素の推定をコンピュータを用いて行う場合に用いるプログラムのアルゴリズムの一例を示すフローチャートである。このアルゴリズムでは、ステップＳＴ３１で全ての音素について以下のループ１を実施することが開始される。ステップＳＴ３２では、その音素の全ての音声スペクトルテンプレートについて以下のループ２を実施することが開始される。ステップＳＴ３３では、全てのノイズ・スペクトルテンプレートについて以下のループ３を実施することが開始される。ステップＳＴ３４では、音声スペクトルテンプレート及びノイズ・スペクトルテンプレートと観測スペクトルとの尤度を計算して保存する。ステップＳＴ３５でループ３を終了し、ステップＳＴ３６でループ２を終了する。そしてステップＳＴ３７でループ２とループ３で最も尤度が大きかった値をこの音素の尤度として保存する。ステップＳＴ３８でループ１を終了し、ステップＳＴ３９では、ループ１で最も尤度が大きかった音素を推定結果として出力する。 FIG. 8 is a flowchart showing an example of a program algorithm used when phonemes are estimated using a computer. In this algorithm, the following loop 1 is started for all phonemes in step ST31. In step ST32, the following loop 2 is started for all speech spectrum templates of the phoneme. In step ST33, the following loop 3 is started for all noise spectrum templates. In step ST34, the likelihoods of the speech spectrum template, noise spectrum template, and observed spectrum are calculated and stored. In step ST35, the loop 3 is terminated, and in step ST36, the loop 2 is terminated. In step ST37, the value having the highest likelihood in loop 2 and loop 3 is stored as the likelihood of this phoneme. In step ST38, loop 1 is terminated, and in step ST39, the phoneme having the highest likelihood in loop 1 is output as an estimation result.

この具体的な実施の形態によれば、音声（歌声）を分離せずに、ノイズ（伴奏音）が混在した状態をそのまま表現する。この具体的な実施の形態は、人間は音声（歌声）を分離せずにそのまま音声を認識できることを考えると、人間の知覚の観点からも自然な方法である。本実施の形態の方法では、音声（歌声）とノイズ（伴奏音）のＳ／Ｎ比をフレームごとに推定できるため、ノイズ（伴奏音）の変動に対してシステムは頑健である。さらに、複数のノイズ・スペクトルテンプレートを用意し、最尤なものを選択することで、システムをより頑健にすることができる。 According to this specific embodiment, a state in which noise (accompaniment sound) is mixed is expressed as it is without separating voice (singing voice). This specific embodiment is a natural method from the viewpoint of human perception, considering that humans can recognize voices without separating voices (singing voices). In the method of the present embodiment, since the S / N ratio between voice (singing voice) and noise (accompaniment sound) can be estimated for each frame, the system is robust against fluctuations in noise (accompaniment sound). Furthermore, the system can be made more robust by preparing a plurality of noise spectrum templates and selecting the most likely one.

また本実施の形態では、単一の調波構造からスペクトル包絡を推定しないため、高いＦ０を持つ音に対してもシステムは頑健である。更に本実施の形態では、Ｆ０を持たない無声子音など、他の音や音源に対しても、駆動音源関数を用いない音声（歌声）スペクトルテンプレートを用意することで容易に拡張できる。 In this embodiment, since the spectral envelope is not estimated from a single harmonic structure, the system is robust even for sounds with high F0. Furthermore, in the present embodiment, it is possible to easily extend the sound (singing voice) spectrum template that does not use the driving sound source function for other sounds and sound sources such as unvoiced consonants without F0.

［定式化］
以下上記に述べた方法及びシステムの具体的な定式化について説明する。本発明の方法をコンピュータに実装するに当たって、下記の３つの方法を具体化する。[Formulation]
A specific formulation of the method and system described above will be described below. In implementing the method of the present invention in a computer, the following three methods are embodied.

（１）確率的スペクトルテンプレートの表現方法。 (1) A method of expressing a stochastic spectrum template.

（２）２つのスペクトルテンプレートの加算の計算方法。 (2) A method for calculating the addition of two spectral templates.

（３）パラメータである、Ｆ０とゲインを最適化する方法。 (3) A method for optimizing F0 and gain as parameters.

上記の３つの方法を具体化するために、下記のようなアプローチを取る。 In order to embody the above three methods, the following approach is taken.

（１）確率的スペクトルテンプレートの各周波数ビンの分布として、対数正規分布を用いる。 (1) A lognormal distribution is used as the distribution of each frequency bin of the stochastic spectrum template.

（２）対数正規分布に従う確率変数を加算した確率変数が、対数正規分布に従うと仮定する。 (2) It is assumed that a random variable obtained by adding random variables that follow a lognormal distribution follows a lognormal distribution.

（３）準ニュートン法によりパラメータを最適化する。 (3) The parameters are optimized by the quasi-Newton method.

［確率的スペクトルテンプレート］
音声（歌声）を含む混合音のスペクトルy(f) は、確率変数Yf から生成されると仮定する。ただし、f は対数軸での周波数を表し、s は対数軸でのスペクトルのパワーを表す。この確率変数（の集合）Ｙ_f が前述の確率的スペクトルテンプレートである。[Probabilistic spectral template]
It is assumed that the spectrum y (f) of the mixed sound including voice (singing voice) is generated from the random variable Yf. Where f represents the frequency on the logarithmic axis, and s represents the power of the spectrum on the logarithmic axis. This random variable (set) Y _f is the aforementioned stochastic spectrum template.

次に、Ｙ_fは次式により２つの異なるスペクトルテンプレートに分割できると仮定する。
Now assume that Y _f can be divided into two different spectral templates according to:

ただし、Ｙ_v，fは音声（歌声）のスペクトルを表し、前述の音声（歌声）スペクトルテンプレートである。Ｙn，f は音声（歌声）以外の音（ノイズまたは伴奏音）のスペクトルを表し、前述のノイズ・スペクトルテンプレートである。ｇv とｇn は音声スペクトルテンプレート及びノイズ・スペクトルテンプレートの重みであり、それらを変化させることで音声（歌声）とその他の音のＳ／Ｎ比を変化させることができる。なお、式（１）においては、線形軸上でスペクトルの加法性を仮定している。Ｙ_v，fとＹn，fとが、次式のように、（対数周波数軸上で）正規分布に従うと仮定する。
However, _{Yv, f} represents the spectrum of the voice (singing voice) and is the aforementioned voice (singing voice) spectrum template. Yn, f represents the spectrum of a sound (noise or accompaniment sound) other than voice (singing voice), and is the above-described noise spectrum template. gv and gn are weights of the voice spectrum template and the noise spectrum template, and the S / N ratio of the voice (singing voice) and other sounds can be changed by changing them. In equation (1), it is assumed that the spectrum is additive on the linear axis. It is assumed that Yv _{, f} and Yn, f follow a normal distribution (on the logarithmic frequency axis) as in the following equation.

ここで、N(y; μ，σ²) は、平均μ、分散σ²の正規分布である。さらに、ソースフィルターモデルを仮定することで、調波構造を持つ音声（歌声）Ｙ_v，f は、次式のように、包絡の確率モデルと調波構造を表現するフィルタの対数軸上の加算で表現できると仮定する（図３）。
Here, N (y; μ, σ ² ) is a normal distribution with mean μ and variance σ ² . Furthermore, assuming a source filter model, the voice (singing voice) Y _{v, f} having a harmonic structure is added on the logarithmic axis of the envelope probability model and the filter expressing the harmonic structure as shown in the following equation: (Fig. 3).

ここで、Ｙ′_v，f 〜N(y; μ′_v，f ; σ² _v，f ) は音声（歌声）のスペクトル包絡を表現する確率変数であり、前述の音声（歌声）包絡テンプレートである。また、H(f; ｆ₀) はＦ０の値がｆ₀のフィルタを表現し、駆動音源関数と呼ぶ。なお、駆動音源関数H(f; ｆ₀) は確率変数ではない。以上をまとめると、音声（歌声）とノイズ（伴奏音）が混ざったスペクトルを表現する確率的スペクトルテンプレートＹ_fは下記のように表される。
Here, Y ′ _{v, f} ˜N (y; μ ′ _{v, f} ; σ ² _{v, f} ) is a random variable representing the spectral envelope of speech (singing voice), and is the above-mentioned speech (singing voice) envelope template. is there. H (f; f ₀ ) represents a filter having a value F ₀ of f ₀ and is called a driving sound source function. The driving sound source function H (f; f ₀ ) is not a random variable. To summarize the above, the stochastic spectrum template Y _f expressing a spectrum in which voice (singing voice) and noise (accompaniment sound) are mixed is expressed as follows.

［スペクトルテンプレートの加算の近似］
上記式（１）で表される確率的スペクトルテンプレートＹ_f は、解析的に計算することは困難であるので、正規分布を用いて近似計算する。下記の関数l(x_1， x₂)を考える。
[Approximation of spectral template addition]
Since the stochastic spectrum template Y _f represented by the above equation (1) is difficult to calculate analytically, it is approximated using a normal distribution. Consider the following function _{_{l (x 1, x 2)}} .

上記式の(x_1， x₂) ＝ (μ_v，f + g_v、μ_n，f+ g_n) における２次のテイラー展開は、
The second-order Taylor expansion of (x _1, x ₂ ) = (μ _{v, f} + g _v, μ _{n, f} + g _n ) in the above equation is

のように計算される。ただし、Ｃはx₁ とx₂とは独立な定数である。ここで、パラメータ
g_v、 g_n 、ｆ₀が固定された場合、式（１２）がx₁ とx₂の重み付き加算であることに注意すると、確率的スペクトルテンプレートＹ_fは以下のように表される。
It is calculated as follows. However, C ₁ is a constant independent of x ₁ and x ₂ . Where the parameter
If g _v , g _n , and f ₀ are fixed, note that Equation (12) is a weighted addition of x ₁ and x ₂ , and the stochastic spectrum template Y _f is expressed as follows:

そしてＹ_fは、
And Y _f is

のように表現される。 It is expressed as

［音素とＦ０の推定］
このモデルを使って音素とＦ０を認識するためには、まず、それぞれの音素ｉを表現する音声（歌声）包絡テンプレートθⁱ _vとノイズ・スペクトルテンプレートθ_nを準備する必要がある。観測スペクトルｙ（ｆ）が与えられたとき、次式によりｙ（ｆ）に含まれる音素ｉとＦ０を推定することができる。
[Presumption of phonemes and F0]
In order to recognize phonemes and F0 using this model, it is first necessary to prepare a speech (singing voice) envelope template θ ⁱ _v and a noise spectrum template θ _n expressing each phoneme i. When the observation spectrum y (f) is given, the phonemes i and F0 included in y (f) can be estimated by the following equation.

ただし、u_f とσ²f は、それぞれ式（１６）と（１７）で定義される。However, u _f and σ ² f are defined by equations (16) and (17), respectively.

［準ニュートン法によるパラメータ最適化］
式（１９）を計算するためのパラメータθ = （g_v、g_n、ｆ₀) の最適化には、BFGS（Broyden−Fletcher−Goldfarb−Shanno）公式に基づく準ニュートン法を使用する。準ニュートン法は山登り法の一種であり、反復的にパラメータを更新する。本モデルにおいて、最小化すべき目的関数Ｑ（θ）は、
[Parameter optimization by quasi-Newton method]
The quasi-Newton method based on the BFGS (Broyden-Fletcher-Goldfarb-Shanno) formula is used for optimization of the parameter θ = (g _v , g _n, f ₀ ) for calculating the equation (19). The quasi-Newton method is a kind of hill-climbing method, and it updates parameters repeatedly. In this model, the objective function Q (θ) to be minimized is

で表される。ただし、ｙ（ｆ）は観測スペクトルである。 It is represented by However, y (f) is an observed spectrum.

ニュートン法では、目的関数を現在のパラメータの周りの二次のテイラー展開で近似し、パラメータを逐次的に更新する。しかし、ニュートン法では、２次のテイラー展開の計算に必要な２次の導関数のヘッセ行列が正定値であることを仮定しているが、この仮定は必ずしも成立しなかった。一方、準ニュートン法では、ヘッセ行列を直接計算せずに、パラメータの更新による１次の導関数の変化を用いて次式のように数値的に近似することで、安定した最適化が可能である。
In Newton's method, the objective function is approximated by a quadratic Taylor expansion around the current parameter, and the parameter is updated sequentially. However, the Newton method assumes that the Hessian of the second derivative necessary for calculating the second-order Taylor expansion is positive definite, but this assumption does not necessarily hold. On the other hand, in the quasi-Newton method, stable optimization can be achieved by numerically approximating the change of the first derivative by updating the parameter without using the Hessian matrix directly, as shown in the following equation. is there.

ただし、k は反復回数を表す。 Here, k represents the number of iterations.

パラメータは下記のように最適化できる。 The parameters can be optimized as follows:

Step 0 ：k ＝ 0 とB⁽⁰⁾ ＝ I を設定し、θ⁽⁰⁾ を初期化する。Step 0: Set k = 0 and B ⁽⁰⁾ = I, and initialize θ ⁽⁰⁾ .

Step 1 ：θ^(k+1) を次式により更新する。
Step 1: Update θ ^{(k + 1)} by the following formula.

α^(k) の値は、線形探索により決定する。The value of α ^(k) is determined by linear search.

Step 2：式(21) によりB^(k+1) を更新する。Step 2: Update B ^{(k + 1)} using equation (21).

Step 3：ステップ１に戻る。 Step 3: Return to step 1.

［歌声包絡テンプレートの推定］
式（４）中の音声（歌声）包絡テンプレートＹ_v、f とノイズ・スペクトルテンプレートＹ_n，f は、学習データから推定する。一般に、調波構造を持つ音声（歌声）のスペクトルは、真のスペクトル包絡に対して、基本周波数の整数倍の周波数成分の点をサンプリングしたものと考えることができる。そのため、観測スペクトル（調波構造）と、その元となるスペクトル包絡は一対多の関係になり得るので、単一フレームの調波構造から真のスペクトル包絡を推定することは困難である。そこで本実施の形態では、異なるＦ０の値を持つ複数フレームの調波構造を用いることで、信頼性の高いスペクトル包絡を推定する。また、スペクトル包絡を一意に定めるのではなく、確率分布として推定するので、歌声の変動や学習データとテストデータの違いに対して頑健となる。複数の調波構造からその元となるスペクトル包絡を推定する場合、フレームごとの音量の違いを考慮に入れる必要がある。そのため、本実施の形態では各フレームの音量を正規化するためのパラメータを導入し、それも未知パラメータとして推定することでこの問題を解決した。[Singing voice envelope template estimation]
The voice (singing voice) envelope template Y _{v, f} and the noise spectrum template Y _{n, f} in the equation (4) are estimated from the learning data. In general, the spectrum of a voice having a harmonic structure (singing voice) can be considered as a sample of frequency component points that are integer multiples of the fundamental frequency with respect to the true spectral envelope. For this reason, since the observed spectrum (harmonic structure) and the spectrum envelope that is the source of the spectrum can be in a one-to-many relationship, it is difficult to estimate the true spectrum envelope from the harmonic structure of a single frame. Therefore, in the present embodiment, a highly reliable spectral envelope is estimated by using a harmonic structure of a plurality of frames having different F0 values. In addition, since the spectral envelope is not uniquely determined but estimated as a probability distribution, it is robust against fluctuations in singing voice and differences between learning data and test data. When estimating the original spectral envelope from a plurality of harmonic structures, it is necessary to take into account the difference in volume for each frame. For this reason, this embodiment solves this problem by introducing a parameter for normalizing the volume of each frame and estimating it as an unknown parameter.

［混合回帰分布］
スペクトルテンプレートを表現するモデルとして、各回帰要素として線形回帰を使用した混合回帰モデルを導入する。この混合回帰モデルは、例えば、 Jacobs，R.J.、 Jordan， M.、 Nowlan，S.J. 及び Hinton，G.E.著の「Adaptive mixtures of local experts」、 Neural Computation、 Vol.3、 pp.79−87 (1991)に記載されている。先に述べたように、本実施の形態では、スペクトルテンプレートはある周波数ｆにおける対数パワーの分布が正規分布で表現されるモデルを用いて定義される必要があるが、このモデルはその用件を満たしている。混合回帰モデルでは、スペクトルテンプレートの平均μ_v，fと分散σ²v，f を下記の通り表現する。
[Mixed regression distribution]
A mixed regression model using linear regression as each regression element is introduced as a model expressing a spectrum template. This mixed regression model is described in, for example, Jacobs, RJ, Jordan, M., Nowlan, SJ and Hinton, GE, “Adaptive mixture of local experts”, Neural Computation, Vol. 3, pp. 79-87 (1991). Have been described. As described above, in this embodiment, the spectrum template needs to be defined using a model in which the distribution of logarithmic power at a certain frequency f is expressed by a normal distribution. Satisfies. In the mixed regression model, the mean μ _{v, f} and variance σ ² v, f of the spectral template are expressed as follows.

ただし、Gm(f; ψ_m， μ_m，σ² _m) はゲート関数の出力で、次式で定義される正規化ガウス関数を用いた。この正規化ガウス関数は、Xu， L.、 Jordan，M. I. 及び Hinton，G.E.著の「An alternative model for mixtures of experts」、 Advances in Neural Information Processing Systems 7、 pp.633−640 (1994)に記載されている。
Here, Gm (f; ψ _m , μ _m , σ ² _m ) is the output of the gate function, and a normalized Gaussian function defined by the following equation was used. This normalized Gaussian function is described in Xu, L., Jordan, MI and Hinton, GE, “An alternative model for combination of experts”, Advances in Neural Information Processing Systems 7, pp. 633-640 (1994). ing.

このモデルにおいて、未知パラメータは{ψ_m， μ_m， σ² _m， a_m， b_m，β² _m} であり、EM（Expectation and Maximization）法により推定することが可能である。ただし、ψ_m は、ψ_m ≧ 0かつΣ_m ψ_m ＝ 1 である。In this model, the unknown parameters are {ψ _m , μ _m , σ ² _m , a _m , b _m , β ² _m } and can be estimated by the EM (Expectation and Maximization) method. However, ψ _m is ψ _m ≧ 0 and Σ _m ψ _m = 1.

［パラメータ推定］
学習データとして与えられた１フレーム分の調波構造si(i = 1，．．．，I) のh 次倍音の周波数f_i，h とその対数パワーy_i，hが、下記の式として表されたとする。
[Parameter estimation]
The frequency f _{i, h} and its logarithmic power y _{i, h} of the harmonic structure si (i = 1, ..., I) of the harmonic structure si (i = 1, ..., I) given as training data is expressed as Suppose that

この時、最大化したい尤度関数は、次式で表される。
At this time, the likelihood function to be maximized is expressed by the following equation.

ここで、k_i は各調波構造の音量を正規化するオフセットパラメータである。混合回帰モデルのパラメータとk_iを同時に最適化することは困難であるため、それらを反復的に更新していく。Here, k _i is an offset parameter that normalizes the volume of each harmonic structure. Since it is difficult to simultaneously optimize the parameters of the mixed regression model and k _i , they are updated iteratively.

パラメータは下記の手続きで推定される。 The parameters are estimated by the following procedure.

Step 0：k_i ＝ 0 とし、その他のパラメータの初期値を与える。Step 0: Set k _i = 0 and give initial values for other parameters.

Step 1：混合回帰モデルのパラメータをEM法により推定する。 Step 1: Estimate the parameters of the mixed regression model using the EM method.

Step 2：k_iを次式により更新する。
Step 2: Update k _i by the following formula.

Step 3：１に戻る。 Return to Step 3: 1.

図９は、パラメータの推定過程の例である。図９は、混合回帰モデルのパラメータ推定の過程の一例であり、各図の中心の太い線は混合回帰モデルの平均を表し、その上下の細い２本の線は標準偏差を表す。背景の細かい点は学習データの調波成分を表し、図の下部の複数の山は、ゲート関数G_m(f; ψ_m， μ_m，σ² _m) を表す。図より、更新を重ねることで学習データの各調波構造に対するオフセットパラメータk_i が最適化されて、より分散の少ない回帰曲線が推定されていることが見てとれる。ノイズ・スペクトルテンプレートについては、s_i(i = 1，．．．．，I)を調波構造でなくスペクトルそのものと考えることで、同様に推定できる。FIG. 9 is an example of a parameter estimation process. FIG. 9 shows an example of the process of parameter estimation of the mixed regression model. The thick line at the center of each figure represents the average of the mixed regression model, and the two thin lines above and below it represent the standard deviation. The fine points on the background represent the harmonic components of the learning data, and the plurality of peaks at the bottom of the figure represent the gate function G _m (f; ψ _m , μ _m , σ ² _m ). From the figure, it can be seen that by repeating the update, the offset parameter k _i for each harmonic structure of the learning data is optimized, and a regression curve with less variance is estimated. The noise spectrum template can be estimated in the same way by considering s _i (i = 1, ..., I) not as a harmonic structure but as a spectrum itself.

図１０は、このパラメータ推定をコンピュータで実施する場合に用いるプログラムのアルゴリズムのフローチャートを示している。まずステップＳＴ４１でパラメータを初期化する。パラメータの初期化のために、学習データ、複数の調波構造（各倍音Ｆ０とパワー）が使用される。次にステップＳＴ４２では、ｔ＝１としてループ１を開始する。ステップＳＴ４３では、現在のオフセットパラメータと各混合回帰モデルのパラメータを用いて、学習データの調波構造の各混合回帰モデルに対する帰属確率を計算する。そしてステップＳＴ４４では、現在のオフセットパラメータと各混合回帰モデルのパラメータに対する帰属確率を用いて、各混合回帰モデルを用いて、各混合回帰モデルのパラメータをＥＭアルゴリズムにより推定する。ステップＳＴ４５では、オフセットパラメータを更新する。そしてステップＳＴ４６で、ｔが一定の回数を上回ったか否かの判定がなされる。Ｙｅｓであれば、ステップＳＴ４８で終了し、Ｎｏであればループ１が繰り返される。 FIG. 10 shows a flowchart of a program algorithm used when this parameter estimation is executed by a computer. First, in step ST41, parameters are initialized. For initialization of parameters, learning data and a plurality of harmonic structures (each harmonic F0 and power) are used. Next, in step ST42, loop 1 is started with t = 1. In step ST43, using the current offset parameter and the parameters of each mixed regression model, the belonging probability for each mixed regression model of the harmonic structure of the learning data is calculated. In step ST44, the parameters of each mixed regression model are estimated by the EM algorithm using each mixed regression model using the current offset parameter and the belonging probability for the parameter of each mixed regression model. In step ST45, the offset parameter is updated. In step ST46, it is determined whether t has exceeded a certain number of times. If Yes, the process ends in step ST48, and if No, loop 1 is repeated.

上記実施の形態では、使用する学習用対象音信号及び学習用ノイズ信号は、それぞれ個別に入手することを前提としている。しかし学習対象音信号及び学習用ノイズ信号が、簡単に入手できない場合もある。そこでこのような場合には、学習用対象音信号の対象音スペクトルテンプレートと学習用ノイズ信号のノイズ・スペクトルテンプレートを、共に学習用混合信号から推定することができる。この推定は、図１のテンプレート生成部２の構成を変えることにより実現できる。なお学習用混合音とは、対象音が属する種類の音の信号とノイズに相当する音の信号が混合されて構成されたものである。観察対象の混合音信号が、女性のボーカル歌声を含む混合音信号であれば、１以上の学習用混合音信号として、女性のボーカル歌声を含む混合音信号を用いる。 In the above embodiment, it is assumed that the learning target sound signal and the learning noise signal to be used are obtained separately. However, the learning target sound signal and the learning noise signal may not be easily available. Therefore, in such a case, both the target sound spectrum template of the learning target sound signal and the noise spectrum template of the learning noise signal can be estimated from the learning mixed signal. This estimation can be realized by changing the configuration of the template generation unit 2 in FIG. Note that the learning mixed sound is configured by mixing a sound signal of a type to which the target sound belongs and a sound signal corresponding to noise. If the mixed sound signal to be observed is a mixed sound signal including a female vocal singing voice, a mixed sound signal including a female vocal singing voice is used as one or more learning mixed sound signals.

具体的に、学習用混合音からテンプレートを推定する場合は、音声包絡テンプレートとノイズ・スペクトルテンプレートを同時に推定する必要がある。図１１には、テンプレート生成部２をコンピュータを用いて実現する場合に用いるプログラムのアルゴリズムを示してある。ステップＳＴ５１において、パラメータの初期化を行う。前提として、I 個の観測スペクトルy₁(f)，・・・，y_i(f)，・・・，y_I (f) を観測したと仮定する。推定すべき対象音（音声）スペクトルテンプレートのパラメータはθ_v = {ψ_v，m、μ_v，m、σ² _v，ｍ、a_v，m、b_v，ｍ、β² _v，ｍ} とし、ノイズ・スペクトルテンプレートのパラメータはθ_n = {ψ_n，m、 μ_n，m、 σ² _n，ｍ、 a_n，m、 b_n，m、 β² _n，ｍ} とする。i 番目のスペクトルにおける駆動音源関数を加えた後の対象音スペクトルテンプレートは、以下のように表すことができる。
Specifically, when a template is estimated from a learning mixed sound, it is necessary to simultaneously estimate a speech envelope template and a noise spectrum template. FIG. 11 shows an algorithm of a program used when the template generation unit 2 is realized using a computer. In step ST51, parameters are initialized. As a premise, it is assumed that I observation spectra y ₁ (f), ···, y _i (f), ···, y _I (f) are observed. The parameters of the target sound (speech) spectrum template to be estimated are θ _v = {ψ _{v, m} , μ _{v, m} , σ ² _{v, m} , a _{v, m} , b _{v, m} , β ² _{v, m} } The parameters of the noise spectrum template are θ _n = {ψ _{n, m} , μ _{n, m} , σ ² _{n, m} , an _{n, m} , b _{n, m} , β ² _{n, m} }. The target sound spectrum template after adding the driving sound source function in the i-th spectrum can be expressed as follows.

ただし、i 番目の観測スペクトルのF0 であるf₀(i) は全てのi について既知であるとする。However, it is assumed that f ₀ (i) which is F0 of the i-th observed spectrum is known for all i.

先の実施の形態では、対数正規分布の加算を１次のテイラー展開を用いて近似計算した。しかし、得られた式(15)〜(17) は複雑な形状となり、対象音（音声）スペクトルテンプレートθ_v、ノイズ・スペクトルテンプレートθ_n を最適化するのは困難である。そこで本実施の形態では、対数正規分布の加算を定義に従って厳密に計算した後、パラメータを近似的に推定するというアプローチをとる。合成後のスペクトルテンプレートの確率密度関数をp_i，f (y; θ_v、θ_n、g_i，v、g_i，n) と書くと［なお観測するスペクトルの番号ｉごとに確率密度関数の形状が異なるので、添え字ｉを追加している。］、目的関数Ｌは、以下のように表される。
In the previous embodiment, the addition of the lognormal distribution is approximately calculated using the first-order Taylor expansion. However, the obtained equations (15) to (17) have complicated shapes, and it is difficult to optimize the target sound (speech) spectrum template θ _v and the noise spectrum template θ _n . Therefore, in this embodiment, an approach is taken in which the parameter is approximately estimated after the addition of the lognormal distribution is strictly calculated according to the definition. If the probability density function of the spectrum template after synthesis is written as p _{i, f} (y; θ _v , θ _n , g _{i, v} , g _{i, n} ), [the probability density function of each observed spectrum number i is Since the shapes are different, the subscript i is added. ], The objective function L is expressed as follows.

ここで、g_i,vとg_i,n は、先の実施の形態のオフセットパラメータk_i と同様で、音量をフレーム間で正規化するオフセットパラメータ（重み）である。また、g_i,vとg_i,n は、音声（歌声）包絡テンプレートとノイズ・スペクトルテンプレートのSIR（Signal-to-Interference Ratio）を調整する役割も持っている。実際の実装では、連続ウェーブレット変換は周波数軸に対して離散的に計算しているため、f に関する積分は和の演算で置き換えられる。Here, g _{i, v} and g _{i, n} are the offset parameters (weights) for normalizing the sound volume between frames in the same manner as the offset parameters k _i of the previous embodiment. Further, g _{i, v} and g _{i, n} also have a role of adjusting the SIR (Signal-to-Interference Ratio) of the voice (singing voice) envelope template and the noise spectrum template. In the actual implementation, the continuous wavelet transform is calculated discretely with respect to the frequency axis, so the integral over f is replaced with a sum operation.

ここで推定すべきパラメータは{g_i,v、 g_i,n、θ_v、θ_n} である。これらのパラメータを全て同時に最適化するのは困難であるので、逐次的に最適化する。まず、ステップＳＴ５２において、重みg_i,n とノイズ・スペクトルテンプレートθ_nを固定して、上記式(31) による重みg_i,v とノイズ・スペクトルテンプレートθ_v の最適化を行い、ステップＳＴ５６においては重みg_i,v と対象音スペクトルテンプレートθ_v を固定して、式(32) による重みg_{i, n} と対象音スペクトルテンプレートθ_n の最適化を交互に繰り返すことを考える。まず、ステップＳＴ５２において、g_i,n とθ_n を固定して考えると、式(31) の和の内部は期待値の計算と考えることができる。そこで、サンプルＵの期待値の計算（正規分布の積分を含む計算）をサンプリングにより和の計算で近似する。ここでサンプリングとは、図１２に模擬的に示すように、分布に関する積分を多くの点の和で近似することを意味する。このサンプリングにより、g_i,v とθ_v の近似的な最適化が可能になる。具体的には、学習用ノイズ音に関する正規分布N(U; μ_n,f + g_i,n、 σ² _n,f) をU = y_i(f) で切断した、確率変数の定義域の上限が有界な単一切断正規分布からそれぞれのi、f についてＲ個ずつのサンプル(U_i,1,f 、・・・、 U_i,r,f 、・・・、 U_i,R,f ) をサンプリングしたとき、目的関数Ｌは、以下のように近似できる。
The parameters to be estimated here are {g _{i, v} , g _{i, n} , θ _v , θ _n }. Since it is difficult to optimize all these parameters at the same time, they are optimized sequentially. First, in step ST52, the weight g _{i, n} and the noise spectrum template θ _n are fixed, and the weight g _{i, v} and the noise spectrum template θ _v are optimized by the above equation (31). In step ST56, Consider that the weight g _{i, v} and the target sound spectrum template θ _v are fixed and the optimization of the weight g _{i, n} and the target sound spectrum template θ _n according to the equation (32) is repeated alternately. First, in step ST52, if g _{i, n} and θ _n are fixed, the inside of the sum of equation (31) can be considered as the calculation of the expected value. Therefore, the calculation of the expected value of the sample U (the calculation including the integral of the normal distribution) is approximated by the calculation of the sum by sampling. Here, sampling means approximating the integral relating to the distribution with the sum of many points, as schematically shown in FIG. This sampling allows approximate optimization of g _{i, v} and θ _v . Specifically, the normal distribution N (U; μ _{n, f} + g _{i, n} , σ ² _{n, f} ) related to the learning noise sound is cut by U = y _i (f). R samples (U _{i, 1, f} ,..., U _{i, r, f} , U _{i, R, for} each i and f from a single-cut normal distribution with bounded upper bounds _{When f} ) is sampled, the objective function L 1 can be approximated as follows:

具体的な実施例では、Ｒの値を300 に設定している。ここで、重みg_i，n とノイズ・スペクトルテンプレートθ_n を固定すると、π_i，r,ｆと（log(exp(y_i(f))−exp(U_i,r,f )) は定数となるため、式(33) を用いて、重みg_i,vと対象音スペクトルテンプレートθ_v を最適化できる（ステップＳＴ５１〜ステップＳＴ５５）。また、重みg_i,v と対象音スペクトルテンプレートθ_v を固定した場合も同様で、式(31) からサンプリングにより式(33) と同様の式を導出し、重みg_i,n とノイズ・スペクトルテンプレートθ_n を最適化する（ステップＳＴ５６〜ステップＳＴ５９）。In a specific embodiment, the value of R is set to 300. Here, if weights g _{i, n} and noise spectrum template θ _n are fixed, π _{i, r, f} and (log (exp (y _i (f)) − exp (U _{i, r, f} )) are constants. since the, using equation (33), the weights g _{i, v} and the target sound spectrum template theta _v optimize (step ST51~ step ST55). the weight g _{i, v} and the target sound spectrum template theta _v Is the same, and the same expression as expression (33) is derived by sampling from expression (31), and the weights g _{i, n} and noise spectrum template θ _n are optimized (step ST56 to step ST59). .

しかし、式(33) は和(Σ)の対数（log）の形をしているため、未だ直接の最適化が困難である。そこで、ＥＭアルゴリズムに似た反復法によって、式(33) を反復的に最適化する。便宜的に、推定したいパラメータをλ = {g_i,v,θ_v} と書く。また、一回前の反復におけるパラメータの推定値をλ′と置く。まず、下記の変数z_i，r，f を考える。
However, since Equation (33) is in the form of the logarithm (log) of the sum (Σ), direct optimization is still difficult. Therefore, Equation (33) is iteratively optimized by an iterative method similar to the EM algorithm. For convenience, the parameter to be estimated is written as λ = {g _{i, v} , θ _v }. Also, let λ ′ be the parameter estimate in the previous iteration. First, consider the following variables z _{i, r, and f} .

そしてλ′を用いて計算したz_i，r，f をz′_i，r，f とする（ステップＳＴ４）。このとき、z_i，r，f を固定し、下記の新たな目的関数Q₁(λ|λ′)を定める。
Then, z _{i, r, f} calculated using λ ′ is set as z ′ _{i, r, f} (step ST4). At this time, z _{i, r, f} are fixed, and the following new objective function Q ₁ (λ | λ ′) is determined.

そして上記目的関数をλ に関して最適化する操作と、最適化されたλ を用いてz_i，r，f を再計算する操作を反復する（ステップＳＴ５３〜ＳＴ５５の繰り返し反復をする）と真の目的関数Ｌが最大化できる。なおこの反復回数は少なくとも１回でよい。式(36) をよく見ると、π_i，r，f は最適化に無関係であることがわかる。したがって、下記の関数Q₂(λ|λ′) の最適化は、Q₁(λ|λ′) の最適化と等価であることがわかる。
Then, the operation for optimizing the objective function with respect to λ 1 and the operation for recalculating z _{i, r, f} using the optimized λ 2 are repeated (repeatedly repeating steps ST53 to ST55). The function L can be maximized. The number of repetitions may be at least once. If you look closely at equation (36), you can see that π _{i, r, f} is irrelevant to optimization. Therefore, it can be seen that the optimization of the following function Q ₂ (λ | λ ′) is equivalent to the optimization of Q ₁ (λ | λ ′).

さらに、Q₂ は定数項z の存在を除くと、式(27) と同様の形式をしていることがわかる。そこで上記式(37)のQ₂関数の最適化を実施する（ステップＳＴ５４）。すなわち、先の実施の形態で述べた単独の学習用対象音信号及び学習用ノイズ信号からのテンプレート推定の場合と同様に、Q₂ 関数は最適化できることがわかる。Furthermore, it can be seen that Q ₂ has the same form as equation (27) except for the existence of the constant term z. Therefore implementing optimization Q ₂ 'functions of the above formula (37) (step ST54). That is, as in the case of the template estimate from a single learning target sound signal described in the above embodiment and training noise signal, Q ₂ function it can be seen that the optimization.

上記と同様の操作を重みg_i,v と対象音スペクトルテンプレートθ_v を固定し、式(31) からサンプリングにより式(33) と同様の式を導出し、重みg_i,n とノイズ・スペクトルテンプレートθ_n を最適化する（ステップＳＴ５６〜ステップＳＴ５９）。そしてステップＳＴ５２〜ＳＴ５９を予め定めた回数反復（ステップＳＴ６０）すると終了する。この反復回数は少なくとも１回でよい。The same operation as above, with the weights g _{i, v} and the target sound spectrum template θ _v fixed, the same expression as the expression (33) is derived by sampling from the expression (31), the weight g _{i, n} and the noise spectrum The template θ _n is optimized (step ST56 to step ST59). Then, when steps ST52 to ST59 are repeated a predetermined number of times (step ST60), the process ends. This iteration may be at least once.

以上をまとめるとパラメータは下記の手続きで推定される。 In summary, parameters are estimated by the following procedure.

ステップＳＴ５１： g_i，v ＝ 0、g_i，n ＝0 とし、その他のパラメータに対して後述のように初期値を与える。Step ST51: g _{i, v} = 0, g _{i, n} = 0, and initial values are given to other parameters as described later.

ステップＳＴ５２： g_i，n とθ_n を固定して、式(31) のＵをサンプリングする。Step ST52: g _{i, n} and θ _n are fixed, and U in Expression (31) is sampled.

ステップＳＴ５３：サンプリングしたＵと現在のパラメータg_i，v、θ_v を用いて、式(35) のz_i，r，f を計算する。Step ST53: Using the sampled U and the current parameters g _{i, v} , θ _v , z _{i, r, f} in equation (35) is calculated.

ステップＳＴ５４：ステップＳＴ５３計算されたz_i，r，f を用いて、式(37) のQ₂ 関数を最適化する。この最適化には、反復的な最適化法を利用する。Step ST54: Step ST53 Using the calculated z _{i, r, f} , the Q ₂ function of Expression (37) is optimized. This optimization uses an iterative optimization method.

ステップＳＴ５５：ステップＳＴ５２〜ステップ５４の反復が規定回数を超えた場合はステップＳＴ５６へ、そうでない場合はステップＳＴ５２に戻る。 Step ST55: When the repetition of step ST52 to step 54 exceeds the specified number of times, the process returns to step ST56, and otherwise, the process returns to step ST52.

ステップＳＴ５６： g_i，v とθ_v を固定して、式(3１) のＵをサンプリングする。Step ST56: G _{i, v} and θ _v are fixed, and U in Expression (31) is sampled.

ステップＳＴ５７：サンプリングしたＵと現在のパラメータg_i，n，θ_n を用いて、式(35) のz_i，r，f を計算する。Step ST57: Using the sampled U and the current parameters g _{i, n} , θ _n , z _{i, r, f} in equation (35) is calculated.

ステップＳＴ５８：計算されたz_i，r，f を用いて、式(37) のQ₂ 関数を最適化する。この最適化にも反復的な最適化法を利用する。Step ST58: The calculated z _{i, r,} using _f, to optimize the Q ₂ function of Equation (37). An iterative optimization method is also used for this optimization.

ステップＳＴ５９：ステップＳＴ５７〜ＳＴ５８の反復が規定回数を超えた場合はステップＳＴ６０へ、そうでない場合はステップＳＴ５７に戻る。 Step ST59: If the repetition of steps ST57 to ST58 has exceeded the specified number of times, the process returns to step ST60, and if not, the process returns to step ST57.

ステップＳＴ６０：ステップＳＴ５２〜ＳＴ５９の反復が規定回数を超えた場合は終了する。そうでない場合はステップＳＴ５２に戻る。 Step ST60: When the repetition of steps ST52 to ST59 exceeds the specified number of times, the process ends. Otherwise, the process returns to step ST52.

対象音スペクトルテンプレートの初期値は、観測対象の対象音信号（例えば対象音が歌であれば、対象音の歌手とは異なる歌手の単独歌唱の音響信号から得る。またノイズ・スペクトルテンプレートの初期値は、歌声の入っていない音楽音響信号（例えば、カラオケトラック）から、それぞれ先の実施の形態で推定したパラメータの値を使用すればよい。 The initial value of the target sound spectrum template is obtained from the target target sound signal to be observed (for example, if the target sound is a song, it is obtained from the acoustic signal of a single singer different from the singer of the target sound. May use the values of the parameters estimated in the previous embodiment from the music acoustic signal (for example, karaoke track) that does not contain a singing voice.

本発明によれば、対象音（音声、歌声等）がその他のノイズ（伴奏音等）と混ざった状態のスペクトルを、分離せずそのまま認識することができる。混合音を認識するために、構成するそれぞれの音を分離し、その後分離した音を認識するという従来の技術と比べて、本発明によれば、背景のノイズに関する情報も活用するため、従来よりも性能を向上させることができる。また本発明によれば、混合音信号について各フレームでＳ／Ｎ比の推定を行うのでノイズの変動に対してロバストになるという利点がある。 According to the present invention, a spectrum in a state where the target sound (speech, singing voice, etc.) is mixed with other noise (accompaniment sound, etc.) can be recognized as it is without being separated. Compared to the conventional technique of recognizing mixed sound, separating each sound that constitutes and then recognizing the separated sound, according to the present invention, since information on background noise is also utilized, Can also improve performance. Further, according to the present invention, since the S / N ratio is estimated for each frame of the mixed sound signal, there is an advantage that it is robust against noise fluctuations.

１混合比率推定システム
２テンプレート生成部
３スペクトルテンプレート記憶部
５対象音スペクトルテンプレート記憶部
７ノイズ・スペクトルテンプレート記憶部
９確率的スペクトルテンプレート作成部
１１組合せ部
１３確率的スペクトルテンプレート記憶部
１４観測スペクトル取得部
１５決定部
１７選択部
１９距離演算部
２１一時記憶部
２３確定部
２５推定部
２７推定結果記憶部
２９音素認識部DESCRIPTION OF SYMBOLS 1 Mixing ratio estimation system 2 Template production | generation part 3 Spectrum template memory | storage part 5 Target sound spectrum template memory | storage part 7 Noise / spectrum template memory | storage part 9 Probabilistic spectrum template creation part 11 Combining part 13 Stochastic spectrum template memory | storage part 14 Observation spectrum acquisition part DESCRIPTION OF SYMBOLS 15 Determination part 17 Selection part 19 Distance calculation part 21 Temporary storage part 23 Determination part 25 Estimation part 27 Estimation result storage part 29 Phoneme recognition part

Claims

A method for estimating a mixing ratio of a mixed sound signal by using a computer to estimate a mixing ratio between a target sound signal and a noise signal included in one frame signal obtained discretely from the mixed sound signal,
One or more target sound spectrum templates showing the relationship between the frequency components of one or more learning target sound signals and the probability distribution of the power spectrum, and the relationship between the frequency components of one or more learning noise signals and the probability distribution of the power spectrum. Prepare one or more noise spectrum templates,
Creating one or more stochastic spectrum templates by combining the one or more target sound spectrum templates and the one or more noise spectrum templates;
Obtaining an observation spectrum in the one frame from the mixed sound signal;
Distance between the observed spectrum and a plurality of gain-change spectrum templates obtained by changing the gain of the one or more target sound spectrum templates and the gain of the one or more noise spectrum templates constituting the one or more stochastic spectrum templates Is determined as a minimum distance gain change spectrum template, and the mixing ratio is estimated based on the gain of the minimum distance gain change spectrum template and the gain of the noise spectrum template,
When the target sound signal is a voiced sound signal having a harmonic structure, the target sound spectrum template includes a driving sound source function and a voice envelope template indicating frequency components of a standard spectrum of the harmonic structure of the voiced sound signal. Determined by the product of
If the target sound signal is an unvoiced sound signal, using the speech envelope template as the target sound spectrum template,
The speech envelope template is an envelope that connects a plurality of peaks in the power included in a frequency spectrum waveform indicating a relationship between frequency components and power obtained by frequency analysis of a learning sound signal for a target voiced or unvoiced sound. A method for estimating a mixing ratio in a mixed sound signal, which is a template indicating a distribution state of lines.

The method for estimating a mixture ratio in a mixed sound signal according to claim 1, wherein both the target sound spectrum template and the noise spectrum template are estimated from a learning mixed signal.

The method of estimating a mixing ratio in a mixed sound signal according to claim 1, wherein the fundamental frequency F0 of the driving sound source function is estimated when determining the minimum distance gain changing spectrum template.

A method for estimating a mixing ratio of a mixed sound signal by using a computer to estimate a mixing ratio between a target sound signal and a noise signal included in one frame signal obtained discretely from the mixed sound signal,
One or more target sound spectrum templates showing the relationship between the frequency components of one or more learning target sound signals and the probability distribution of the power spectrum, and the relationship between the frequency components of one or more learning noise signals and the probability distribution of the power spectrum. Prepare one or more noise spectrum templates,
Creating one or more stochastic spectrum templates by combining the one or more target sound spectrum templates and the one or more noise spectrum templates;
Obtaining an observation spectrum in the one frame from the mixed sound signal;
Distance between the observed spectrum and a plurality of gain-change spectrum templates obtained by changing the gain of the one or more target sound spectrum templates and the gain of the one or more noise spectrum templates constituting the one or more stochastic spectrum templates Is determined as a minimum distance gain change spectrum template, and the mixing ratio is estimated based on the gain of the minimum distance gain change spectrum template and the gain of the noise spectrum template,
The minimum distance in determining the gain change spectral template, the mixing ratio estimation method in the mixed sound signals and estimates the fundamental frequency F0 of the drive dynamic sound function.

The probability distribution of the power spectrum, the mixing ratio estimation method in the mixed sound signal according to claim 1, characterized in that the area is represented by the log-normal distribution at each frequency.

5. The method of estimating a mixing ratio in a mixed sound signal according to claim 3, wherein a quasi-Newton method is used for the optimization of the gain and the estimation of the fundamental frequency F0.

The phoneme corresponding to the minimum distance gain change spectrum template obtained by the method for estimating the mixture ratio in the mixed sound signal according to any one of claims 1 to 6 is determined as the phoneme of the one frame. A phoneme recognition method, wherein a speech type is determined based on continuity of a plurality of one- frame phonemes.

A mixing ratio estimation system for a mixed sound signal for estimating a mixing ratio between a target sound signal and a noise signal included in one frame signal obtained discretely from the mixed sound signal,
One or more target sound spectrum templates showing the relationship between the frequency components of one or more learning target sound signals and the probability distribution of the power spectrum, and the relationship between the frequency components of one or more learning noise signals and the probability distribution of the power spectrum. A spectrum template storage unit for storing one or more noise spectrum templates;
A stochastic spectrum template creating unit that creates one or more stochastic spectrum templates by combining the one or more target sound spectrum templates and the one or more noise spectrum templates;
An observation spectrum acquisition unit for acquiring an observation spectrum in the one frame from the mixed sound signal;
The distance between the observed spectrum and the plurality of gain-change spectrum templates obtained by changing the gain of the target sound spectrum template and the gain of the noise spectrum template respectively constituting the one or more stochastic spectrum templates is the smallest. A determination unit for determining the gain change spectrum template as a minimum distance gain change spectrum template;
An estimation unit that estimates the mixing ratio based on the gain of the minimum distance gain change spectrum template and the gain of the noise spectrum template;
A template generation unit that generates the one or more target sound spectrum templates and the one or more noise spectrum templates;
When the target sound signal is a voiced sound signal having a harmonic structure, the template generator is configured to convert the target sound spectrum template to a frequency component of a standard spectrum of the harmonic structure of the voiced sound signal. If the target sound signal is an unvoiced sound signal determined by a product of a wave drive sound source function and a voice envelope template, the voice envelope template is used as the target sound spectrum template,
The speech envelope template is an envelope that connects a plurality of peaks in the power included in a frequency spectrum waveform indicating a relationship between frequency components and power obtained by frequency analysis of a learning sound signal for a target voiced or unvoiced sound. A mixing ratio estimation system in a mixed sound signal, which is a template showing a distribution state of lines.

9. The mixing ratio estimation system in a mixed sound signal according to claim 8, wherein the template generation unit is configured to estimate both the target sound spectrum template and the noise spectrum template from a mixed signal for learning.