JP2006084619A

JP2006084619A - Device and program for automatic detection of breathy area of speech data

Info

Publication number: JP2006084619A
Application number: JP2004267887A
Authority: JP
Inventors: Carlos Toshinori Ishii; カルロス寿憲石井; Campbell Nick; ニック・キャンベル
Original assignee: Japan Science and Technology Agency; ATR Advanced Telecommunications Research Institute International
Current assignee: Japan Science and Technology Agency; ATR Advanced Telecommunications Research Institute International
Priority date: 2004-09-15
Filing date: 2004-09-15
Publication date: 2006-03-30
Anticipated expiration: 2024-09-15
Also published as: JP3851328B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device that automatically detects a breathy area of speech data. <P>SOLUTION: The breathy area automatic detecting device includes 1st and 2nd filters 81 and 82 which filter the speech data by filters of 1st and 2nd frequency bands corresponding to frequencies of 1st and 3rd formants, 1st and 2nd envelope estimation parts 83 and 84 which find envelope waveforms of signal waves filtered by the 1st and 2nd filters 81 and 82, a correlation arithmetic part 85 which finds the correlation between the two found envelope waveforms, a comparison part 86 which compares the found correlation result with a 1st threshold, and a decision part 89 which decides the breathy area of the speech data. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、音声データのうちの音素・韻律的特徴以外の特徴を検出するための技術に関し、特に、音声データの息漏れ領域を自動的に検出する息漏れ領域自動検出装置および息漏れ領域自動検出プログラムに関する。 The present invention relates to a technique for detecting features other than phoneme / prosodic features in speech data, and more particularly to a breath leak region automatic detection device and a breath leak region automatic detection device that automatically detect a breath leak region of speech data. Regarding detection program.

音声に含まれる息漏れ成分は、言語によって重要な言語・パラ言語情報すなわち怒り、喜び、悲しみといった感情的情報を伝達する。例えば、母音での弱い息漏れ（ｂｒｅａｔｈｙ）発声と地声（ｍｏｄａｌｖｏｉｃｅ）発声との区別はさまざまなマイナー言語に多く生じる（非特許文献１）。また、英語においてはさまざまな発声様式とパラ言語情報との関連が報告されており（非特許文献２）、強い息漏れ（ｗｈｉｓｐｅｒｙ）発声と「恐怖」との関連、息漏れ発声と「悲しみ」との関連が示されている。 The breath leak component included in the speech conveys important language / paralinguistic information depending on the language, that is, emotional information such as anger, joy, and sadness. For example, the distinction between weak breathy utterances and vowel utterances in vowels often occurs in various minor languages (Non-Patent Document 1). In English, the relation between various utterance styles and paralinguistic information has been reported (Non-patent Document 2), the relation between strong whispery utterance and “fear”, breath utterance and “sadness” The relationship with is shown.

また、日本語では態度や丁寧度を表現するに当たって「息漏れ性（ｂｒｅａｔｈｉｎｅｓｓ）」が利用されている報告もある（非特許文献３、４）。更に、息漏れは笑い声の母音区間でもよく起こることが観察されている。 In addition, there is a report that “breathiness” is used in Japanese to express attitude and politeness (Non-patent Documents 3 and 4). Furthermore, it has been observed that breath leaks often occur in the vowel section of laughter.

地声発声のように声門の平均的な開き具合が小さい場合、雑音成分は周期成分に比べて弱い。逆に、声門が開きすぎると、声帯振動は起こらず、雑音のみが生成される（囁き声）。気息音のように声門の開きが程よい場合は、声帯は振動し続けるが、声帯音源のスペクトルに二つの変化が生じる（非特許文献５）。一つは高調波成分でスペクトルが著しく下降することであり、もう一つは気流により雑音成分が増大することである。 When the average opening of the glottis is small, as in the case of vowels, the noise component is weaker than the periodic component. Conversely, if the glottis opens too much, vocal cord vibration does not occur and only noise is generated (whispering). When the opening of the glottis is moderate as in a breath sound, the vocal cords continue to vibrate, but two changes occur in the spectrum of the vocal cord sound source (Non-Patent Document 5). One is that the spectrum is significantly lowered by the harmonic component, and the other is that the noise component is increased by the air flow.

非特許文献６および７では、息漏れ性に関連するパラメータＮＡＱ（ＮｏｒｍａｌｉｓｅｄＡｍｐｌｉｔｕｄｅＱｕｏｔｉｅｎｔ）が提案され、声帯振動波形の振幅のピーク・ツー・ピーク値と声帯振動の微分波形の負のピーク値との比率を基本周波数Ｆ０で正規化したものとして定義されている。 In Non-Patent Documents 6 and 7, a parameter NAQ (Normalized Amplitude Quotient) related to breath leak is proposed, and the peak-to-peak value of the amplitude of the vocal fold vibration waveform and the negative peak value of the differential waveform of the vocal fold vibration are calculated. The ratio is defined as normalized by the fundamental frequency F0.

しかし、このパラメータの計算は声道逆フィルタリングを必要とする。声道逆フィルタリングは誤りが生じやすい処理であり、安定的に実行することは困難である。またＮＡＱを用いた場合、息漏れと鼻音との区別に問題がある。それは逆フィルタリング後、いずれの声の場合にも基本周波数成分が強いことが理由と考えられる。パラメータＮＡＱは息漏れの雑音成分の特徴を効率的に利用していないことが欠点と考えられる。 However, calculation of this parameter requires vocal tract inverse filtering. Vocal tract inverse filtering is an error-prone process and is difficult to perform stably. In addition, when NAQ is used, there is a problem in distinguishing between breath leak and nasal sound. This is probably because the fundamental frequency component is strong in any voice after inverse filtering. It is considered that the parameter NAQ does not efficiently use the characteristics of the noise component of breath leakage.

一方、息漏れの雑音成分を考慮した主観的な雑音評定手法も提案されている（非特許文献８）。この手法は、第３フォルマント（Ｆ３）の周波数（以下「Ｆ３周波数」と呼ぶ。）周辺の帯域通過フィルタをかけて求めた波形の観察によって不規則性を評定するという手法である。息漏れの場合、Ｆ３周波数の領域では雑音成分が周期成分よりも強くなる傾向にあり、この中から高周波数帯域の有意な雑音成分が気息的な声質の知覚に関連しているという主張である。非特許文献９もこの手法を応用し、被験者同士での高い相関を示している。
Ｍ．ゴードン、Ｐ．レイドフォージド、「発声タイプ：言語間にわたる概観」、音声学ジャーナル、第２９巻、ｐｐ．３８３−４０６、２００１年（Gordon, M., Ladefoged, P. “Phonation types: a cross-linguistic overview,” J. of Phonetics 29, 383-406, 2001）Ｇ．クラスマイヤー、Ｗ．Ｆ．ゼンドルマイヤー、「音声と感情の状態」、声質測定、シンギュラー・トムソン・ラーニング社、第１５章、ｐｐ．３３９−３５８、２０００年（Klasmeyer, G., Sendlmeier, W. F. “Voice and Emotional States,” In Voice Quality Measurement, Singular Thomson Learning. Ch. 15, 339-358, 2000）Ｎ．キャンベル、Ｐ．モクタリ、「声質：４番目の韻律パラメータ」、第１５回国際音声科学大会予稿集、ｐｐ．２４１７−２４２０、２００３年（Campbell, N., Mokhtari, P. “Voice quality; the 4th prosodic parameter,” Proc. 15th International Congress of Phonetic Sciences, 2417-2420, 2003）Ｍ．イトー、「丁寧さと声質−気息性雑音測定のためのもう一つの方法」、音声韻律２００４予稿集、ｐｐ．２１３−２１６、２００４年（Ito, M., “Politeness and voice quality - The alternative method to measure aspiration noise,” Proc. Speech Prosody 2004, 213-216, 2004）Ｋ．スティーブンス、「息漏れおよび地声発声時の際の声門における乱流雑音」、音響音声学、ＭＩＴプレス、ｐｐ．４４５−４５０、２０００年（Stevens, K. “Turbulence Noise at the Glottis During Breathy and Modal Voicing,” In Acoustic Phonetics, The MIT Press, 445-450, 2000）Ｐ．アルク、Ｅ．ヴィルクマン、「逆フィルタリングにより推定した声門体積速度波形の記述のための振幅指標」、音声コミュニケーション、第１８巻第２号、ｐｐ．１３１−１３８、１９９６年（Alku, P., Vilkman, E. “Amplitude domain quotient for characterization of the glottal volume velocity waveform estimated by inverse filtering,” Speech Communication, Vol. 18, No. 2, 131-138, 1996.）Ｐ．モクタリ、Ｎ．キャンベル、「連続音声中の信頼性の中心における緊張音／息漏れの自動測定」、ＩＥＩＣＥ論文誌情報処理・システムトランザクション、第Ｅ−８６−Ｄ巻第３号、ｐｐ．５７４−５８２、２００３年（Mokhtari, P., Campbell, N. “Automatic measurement of pressed/breathy phonation at acoustic centres of reliability in continuous speech,” IEICE Trans. on Inform. and Systems, Japan, Vol. E-86-D, No. 3, 574-582, 2003）Ｄ．クラット、Ｌ．クラット、「女性および男性話者の間の声質変化の分析、合成、および知覚」、米国音声学会ジャーナル、第８７巻、ｐｐ．８２０−８５７、１９９０年（Klatt, D., Klatt, L. “Analysis, synthesis, and perception of voice quality variations among female and male talkers,” J. Acoustic. Soc. Amer., Vol. 87: 820-857, 1990）Ｈ．ハンソン、「女性話者の声門の特徴：音響相関現象」、米国音響学会ジャーナル、第１０１巻、ｐｐ．４６６−４８１、１９９７年（Hanson, H. “Glottal characteristics of female speakers: Acoustic Correlates,” J. Acoustic. Soc. Amer., Vol. 101: 466-481, 1997） On the other hand, a subjective noise evaluation method that takes into account the noise component of breath leakage has also been proposed (Non-Patent Document 8). In this method, irregularities are evaluated by observing a waveform obtained by applying a bandpass filter around the frequency of the third formant (F3) (hereinafter referred to as “F3 frequency”). In the case of breath leakage, the noise component tends to be stronger than the periodic component in the F3 frequency region, and it is an assertion that a significant noise component in the high frequency band is related to the perception of breathy voice quality. . Non-Patent Document 9 also applies this method and shows a high correlation between subjects.
M.M. Gordon, P. Raid Forged, “Voice Type: A Cross-Language Overview”, Phonetic Journal, Vol. 29, pp. 383-406, 2001 (Gordon, M., Ladefoged, P. “Phonation types: a cross-linguistic overview,” J. of Phonetics 29, 383-406, 2001) G. Class Meyer, W. F. Sendle Meyer, “State of Voice and Emotion”, Voice Quality Measurement, Singular Thomson Learning, Chapter 15, pp. 339-358, 2000 (Klasmeyer, G., Sendlmeier, WF “Voice and Emotional States,” In Voice Quality Measurement, Singular Thomson Learning. Ch. 15, 339-358, 2000) N. Campbell, P.A. Moktari, “Voice quality: 4th prosodic parameter”, Proceedings of the 15th International Speech Science Conference, pp. 2417-2420, 2003 (Campbell, N., Mokhtari, P. “Voice quality; the 4th prosodic parameter,” Proc. 15th International Congress of Phonetic Sciences, 2417-2420, 2003) M.M. Ito, “Politeness and Voice Quality-Another Method for Measuring Breathing Noise”, Speech Prosody 2004 Proceedings, pp. 213-216, 2004 (Ito, M., “Politeness and voice quality-The alternative method to measure aspiration noise,” Proc. Speech Prosody 2004, 213-216, 2004) K. Stevens, “Turbulent Noise in the Glottal During Breathing and Speaking”, Acoustic Phonetics, MIT Press, pp. 445-450, 2000 (Stevens, K. “Turbulence Noise at the Glottis During Breathy and Modal Voicing,” In Acoustic Phonetics, The MIT Press, 445-450, 2000) P. ALC, E.I. Wilkman, “Amplitude Index for Description of Glottal Volume Velocity Waveform Estimated by Inverse Filtering”, Spoken Communication, Vol. 18, No. 2, pp. 131-138, 1996 (Alku, P., Vilkman, E. “Amplitude domain quotient for characterization of the glottal volume velocity waveform estimated by inverse filtering,” Speech Communication, Vol. 18, No. 2, 131-138, 1996 .) P. Moktari, N.I. Campbell, "Automatic measurement of tone / breath leak at the center of reliability in continuous speech", IEICE Journal, Information Processing and System Transactions, E-86-D Vol. 574-582, 2003 (Mokhtari, P., Campbell, N. “Automatic measurement of pressed / breathy phonation at acoustic centres of reliability in continuous speech,” IEICE Trans. On Inform. And Systems, Japan, Vol. E-86 -D, No. 3, 574-582, 2003) D. Cratt, L.C. Krat, “Analysis, synthesis, and perception of voice quality changes between female and male speakers”, American Phonetic Society Journal, Vol. 87, pp. 820-857, 1990 (Klatt, D., Klatt, L. “Analysis, synthesis, and perception of voice quality variations among female and male talkers,” J. Acoustic. Soc. Amer., Vol. 87: 820-857. , 1990) H. Hanson, “Characteristics of Female Speakers: Acoustic Correlation”, Acoustical Society of America Journal, Vol. 101, pp. 466-481, 1997 (Hanson, H. “Glottal characteristics of female speakers: Acoustic Correlates,” J. Acoustic. Soc. Amer., Vol. 101: 466-481, 1997)

しかし、上述した非特許文献８または非特許文献９に記載のものは、第３フォルマントのＦ３周波数の周辺帯域の不規則性が息漏れと関連していることを示したものではあるが、その度合いを定量化して、音声データから息漏れ領域を自動的に検出する技術を開示するものではなかった。 However, the non-patent document 8 or the non-patent document 9 described above shows that the irregularity in the peripheral band of the F3 frequency of the third formant is related to breath leakage. It did not disclose a technique for quantifying the degree and automatically detecting a breath leak region from voice data.

そこで、本発明の目的は、Ｆ３周波数帯域の不規則性の度合いを定量化して、音声データの規則音領域を自動的に検出するための装置およびプログラムを提供することである。 Accordingly, an object of the present invention is to provide an apparatus and a program for automatically detecting a regular sound region of audio data by quantifying the degree of irregularity in the F3 frequency band.

本発明の他の目的は、音声データから息漏れ領域を自動的に精度良く検出するための装置およびプログラムを提供することである。 Another object of the present invention is to provide an apparatus and a program for automatically and accurately detecting a breath leak region from audio data.

本発明のある局面に従った息漏れ領域自動検出装置は、音声データの第１フォルマント周波数領域の信号成分の包絡波形と第３フォルマント周波数領域の信号成分の包絡波形との相関を求めるための波形相関演算手段と、相関演算手段により求められた相関結果と予め定められたしきい値とを比較するための比較手段と、比較手段による比較結果に基づいて、音声データの息漏れ領域を判定するための判定手段とを含む。 A breath leak region automatic detection device according to an aspect of the present invention is a waveform for obtaining a correlation between an envelope waveform of a signal component of a first formant frequency region and an envelope waveform of a signal component of a third formant frequency region of audio data. Correlation calculation means, comparison means for comparing the correlation result obtained by the correlation calculation means with a predetermined threshold value, and determining a breath leak region of the voice data based on the comparison result by the comparison means Determination means.

この息漏れ領域自動検出装置は、音声データの第１フォルマント周波数領域の信号成分の包絡波形と、第３フォルマント周波数領域の信号成分の包絡波形との相関を求める。息漏れがあると第３フォルマント周波数領域での雑音成分が多くなり、上記した二つの包絡波形は類似しない。逆に息漏れがない場合は第３フォルマント周波数の領域での雑音成分が少なくなり、上記した二つの包絡波形は互いに類似する。この包絡波形の類似性、すなわち相関を数量化し、その値としきい値とが比較手段で比較され、比較結果に基づいて判定手段が息漏れ領域か否かを判定する。このように、波形同士の類似性が定量化され、判定手段は息漏れがある息漏れ領域を自動的に判定することができ、その結果、音声データに含まれる息漏れ領域を自動的に検出できる。 This breath leak region automatic detection apparatus obtains a correlation between the envelope waveform of the signal component in the first formant frequency region of the audio data and the envelope waveform of the signal component in the third formant frequency region. If there is a breath leak, the noise component in the third formant frequency region increases, and the above two envelope waveforms are not similar. Conversely, when there is no breath leak, the noise component in the third formant frequency region is reduced, and the above two envelope waveforms are similar to each other. The similarity of the envelope waveform, that is, the correlation is quantified, and the value and the threshold value are compared with each other by the comparison unit. Based on the comparison result, it is determined whether or not the determination unit is a breath leak region. In this way, the similarity between waveforms is quantified, and the determination means can automatically determine the breath leak area where there is a breath leak, and as a result, the breath leak area included in the audio data is automatically detected. it can.

好ましくは、波形相関演算手段は、音声データを第１フォルマント周波数に対応する第１の周波数帯域のフィルタで濾波するための第１のフィルタ手段と、音声データを第３フォルマント周波数に対応する第２の周波数帯域のフィルタで濾波するための第２のフィルタ手段と、第１のフィルタ手段によって濾波された信号波の包絡波形を求めるための第１の包絡推定手段と、第２のフィルタ手段によって濾波された信号波の包絡波形を求めるための第２の包絡推定手段と、第１および第２の包絡推定手段によりそれぞれ求められた二つの包絡波形の相関を求めるための演算手段とを含む。 Preferably, the waveform correlation calculation means has first filter means for filtering the voice data with a filter in a first frequency band corresponding to the first formant frequency, and second voice data corresponding to the third formant frequency. The second filter means for filtering with a filter of the frequency band of the first, the first envelope estimating means for obtaining the envelope waveform of the signal wave filtered by the first filter means, and the second filter means for filtering Second envelope estimating means for obtaining an envelope waveform of the signal wave obtained, and arithmetic means for obtaining a correlation between the two envelope waveforms respectively obtained by the first and second envelope estimating means.

この息漏れ領域自動検出装置は、第１のフィルタ手段および第２のフィルタ手段で音声データを濾波し、濾波された信号波形の包絡波形を二つの包絡推定手段により求める。推定された二つの包絡波形の間の相関を求めることにより、音声データのその部分が息漏れ領域である可能性が判定できる。 In this breath leak region automatic detection device, voice data is filtered by the first filter means and the second filter means, and an envelope waveform of the filtered signal waveform is obtained by two envelope estimation means. By obtaining the correlation between the two estimated envelope waveforms, it is possible to determine the possibility that the portion of the voice data is a breath leak region.

さらに好ましくは、息漏れ領域自動検出装置は、第１のフィルタ手段によって濾波された信号波の強度と、第２のフィルタ手段によって濾波された信号波の強度との差を演算するための強度差演算手段をさらに含み、判定手段は、比較手段による比較結果および強度差演算手段による演算結果に基づいて、音声データの息漏れ領域を判定するための手段を含む。 More preferably, the breath leak region automatic detection device is configured to calculate a difference between the intensity of the signal wave filtered by the first filter means and the intensity of the signal wave filtered by the second filter means. Further, a calculation means is included, and the determination means includes a means for determining a breath leak region of the audio data based on the comparison result by the comparison means and the calculation result by the intensity difference calculation means.

判定手段は、比較手段に加え強度差演算手段の結果に基づいて判定を行なう。第２のフィルタ手段によって濾波された信号波の強度が相対的に小さいと、息漏れ領域が存在していても、聴者はこれを知覚できない。このため、強度差を判定条件に加えて息漏れ領域の存在を判定することにより、音声データに含まれる息漏れ領域をより人間の感覚に近く精度良く検出することができる。 The determination means makes a determination based on the result of the intensity difference calculation means in addition to the comparison means. If the intensity of the signal wave filtered by the second filter means is relatively small, even if there is a breath leak region, the listener cannot perceive it. Therefore, by determining the presence of the breath leak region by adding the intensity difference to the determination condition, it is possible to detect the breath leak region included in the audio data closer to a human sense and with high accuracy.

強度差演算手段による強度差の演算の一例として、第１のフィルタ手段によって濾波された信号波中の最強度成分と、第２のフィルタ手段によって濾波された信号波中の最強度成分とのパワー値の差の演算を挙げることができる。 As an example of the calculation of the intensity difference by the intensity difference calculation means, the power of the strongest component in the signal wave filtered by the first filter means and the strongest component in the signal wave filtered by the second filter means The calculation of the difference between values can be mentioned.

また、判定手段は、音声データのフレームごとに、比較手段による比較結果に基づいて、音声データの息漏れ領域を判定するためのフレーム判定手段と、音声データの連続する所定数のフレームに対し、フレーム判定手段により息漏れ領域という判定結果が得られたことに応答して、当該所定数のフレームを息漏れ領域と判定するための手段とを含んでもよい。 Further, the determination unit is configured to determine, for each frame of the audio data, a frame determination unit for determining a breath-leak area of the audio data based on a comparison result by the comparison unit, and a predetermined number of consecutive frames of the audio data. In response to the determination result of the breath leakage area being obtained by the frame determination means, a means for determining the predetermined number of frames as the breath leakage area may be included.

この構成により、音声データの息漏れ領域をさらに精度良く、かつ安定して判定できる。 With this configuration, it is possible to more accurately and stably determine the breath leak region of the audio data.

本発明の他の局面に従った息漏れ領域自動検出プログラムは、コンピュータにより実行されると上記いずれかの息漏れ領域自動検出装置としてコンピュータを動作させる、コンピュータで実行可能なプログラムである。 A breath leak region automatic detection program according to another aspect of the present invention is a computer-executable program that, when executed by a computer, causes the computer to operate as one of the above breath leak region automatic detection devices.

−基本的概念−
非特許文献８に開示された主観的雑音評定においては、被験者はＦ３周波数周辺の帯域通過フィルタで濾波された波形を、Ｆ１周波数領域の波形とあわせて見ながら評価している。したがって、非特許文献８では指摘されていないが、被験者がこの評価をする際には、特に意識することなくこれらの波形の包絡波形を比較して評価していると考えられる。以下に述べる本発明の一実施の形態は、この基本アイデアを基に、Ｆ３周辺の周波数帯域の帯域通過フィルタにより濾波された波形成分の包絡波形を推定し、Ｆ１周辺の周波数周辺の波形成分についての包絡波形との相互相関を取って波形の類似性を定量化する。 -Basic concept-
In the subjective noise evaluation disclosed in Non-Patent Document 8, the subject evaluates the waveform filtered by the bandpass filter around the F3 frequency while viewing it together with the waveform in the F1 frequency region. Therefore, although not pointed out in Non-Patent Document 8, it is considered that when the subject performs this evaluation, the envelope waveforms of these waveforms are compared and evaluated without particular awareness. In the embodiment of the present invention described below, based on this basic idea, an envelope waveform of a waveform component filtered by a bandpass filter in the frequency band around F3 is estimated, and the waveform component around the frequency around F1 is estimated. The correlation between waveforms is quantified by taking the cross-correlation with the envelope waveform.

−コンピュータによる実現−
以下に述べる本発明の一実施の形態は、コンピュータおよびコンピュータ上で動作するソフトウェアにより実現される。もちろん、以下に述べる機能の一部又は全部を、ソフトウェアでなくハードウェアで実現することも可能である。 -Realization by computer-
One embodiment of the present invention described below is realized by a computer and software operating on the computer. Of course, part or all of the functions described below can be realized by hardware instead of software.

図１に、本実施の形態で利用されるコンピュータシステム２０の外観図を、図２にコンピュータシステム２０のブロック図を、それぞれ示す。なおここに示すコンピュータシステム２０はあくまで一例であり、この他にも種々の構成が可能である。 FIG. 1 shows an external view of a computer system 20 used in the present embodiment, and FIG. 2 shows a block diagram of the computer system 20. The computer system 20 shown here is merely an example, and various other configurations are possible.

図１を参照して、コンピュータシステム２０は、コンピュータ４０と、いずれもこのコンピュータ４０に接続されたモニタ４２、キーボード４６、およびマウス４８を含む。コンピュータ４０にはさらに、ＣＤ−ＲＯＭ（Compact Disk Read−Only Memory）ドライブ５０と、ＦＤ（Flexible Disk）ドライブ５２とが内蔵されている。 Referring to FIG. 1, the computer system 20 includes a computer 40, a monitor 42, a keyboard 46, and a mouse 48, all connected to the computer 40. The computer 40 further includes a CD-ROM (Compact Disk Read-Only Memory) drive 50 and an FD (Flexible Disk) drive 52.

図２を参照して、コンピュータシステム２０はさらに、コンピュータ４０に接続されるプリンタ４４を含むが、これは図１には示していない。またコンピュータ４０はさらに、ＣＤ−ＲＯＭドライブ５０およびＦＤドライブ５２に接続されたバス６６と、いずれもバス６６に接続された中央演算装置（Central Processing Unit：CPU）５６、コンピュータ４０のブートアッププログラムなどを記憶したＲＯＭ（Read−Only Memory）５８、ＣＰＵ５６が使用する作業エリアおよびＣＰＵ５６により実行されるプログラムの格納エリアを提供するＲＡＭ(Random Access Memory）６０、および音声データを格納したハードディスク５４を含む。 Referring to FIG. 2, computer system 20 further includes a printer 44 connected to computer 40, which is not shown in FIG. The computer 40 further includes a bus 66 connected to the CD-ROM drive 50 and the FD drive 52, a central processing unit (CPU) 56 connected to the bus 66, a boot-up program for the computer 40, and the like. ROM (Read-Only Memory) 58 that stores data, a RAM (Random Access Memory) 60 that provides a work area used by the CPU 56 and a storage area for programs executed by the CPU 56, and a hard disk 54 that stores audio data.

以下に述べる実施の形態のシステムを実現するソフトウェアは、たとえば、ＣＤ−ＲＯＭ６２のような記録媒体上に記録されて流通し、ＣＤ−ＲＯＭドライブ５０のような読取装置を介してコンピュータ４０に読込まれ、ハードディスク５４に格納される。ＣＰＵ５６がこのプログラムを実行する際には、ハードディスク５４からこのプログラムを読出してＲＡＭ６０に格納し、図示しないプログラムカウンタによって指定されるアドレスから命令を読出して実行する。ＣＰＵ５６は、処理対象のデータをハードディスク５４から読出し、処理結果を同じくハードディスク５４に格納する。 The software for realizing the system of the embodiment described below is recorded and distributed on a recording medium such as a CD-ROM 62, and is read into the computer 40 via a reader such as the CD-ROM drive 50. Stored in the hard disk 54. When the CPU 56 executes this program, the program is read from the hard disk 54 and stored in the RAM 60, and an instruction is read from an address designated by a program counter (not shown) and executed. The CPU 56 reads data to be processed from the hard disk 54 and stores the processing result in the hard disk 54 as well.

コンピュータシステム２０の動作自体は周知であるので、ここではその詳細については繰り返さない。 Since the operation itself of the computer system 20 is well known, details thereof will not be repeated here.

なお、ソフトウェアの流通形態は上記したように記憶媒体に固定された形には限定されない。たとえば、ネットワークを通じて接続された他のコンピュータからデータを受取る形で流通することもあり得る。また、ソフトウェアの一部が予めハードディスク５４中に格納されており、ソフトウェアの残りの部分をネットワーク経由でハードディスク５４に取込んで実行時に統合するような形の流通形態もあり得る。 Note that the software distribution form is not limited to the form fixed to the storage medium as described above. For example, data may be distributed in the form of receiving data from other computers connected through a network. Further, there may be a distribution form in which a part of software is stored in the hard disk 54 in advance, and the remaining part of the software is taken into the hard disk 54 via the network and integrated at the time of execution.

一般的に、現代のプログラムはコンピュータのオペレーティングシステム（ＯＳ）によって提供される汎用の機能を利用し、それらを所望の目的にしたがって組織化した形態で実行することにより前記した所望の目的を達成する。したがって、以下に述べる本実施の形態の各機能のうち、ＯＳまたはサードパーティが提供する汎用的な機能を含まず、それら汎用的な機能の実行順序の組合せだけを指定するプログラム（群）であっても、それらを利用して全体的として所望の目的を達成する制御構造を有するプログラム(群）である限り、それらが本発明の技術的範囲に含まれることは明らかである。 In general, modern programs utilize the general purpose functions provided by a computer operating system (OS) and achieve the desired objectives described above by executing them in an organized manner according to the desired objectives. . Therefore, among the functions of the present embodiment described below, it is a program (group) that does not include general-purpose functions provided by the OS or a third party, and specifies only a combination of execution orders of these general-purpose functions. However, as long as it is a program (group) having a control structure that achieves a desired object as a whole using them, it is obvious that they are included in the technical scope of the present invention.

−機能的構成−
本実施の形態のプログラムを息漏れ領域自動検出装置とみなして機能的に示したのが図３のブロック図である。図３を参照して、この装置８０は、ハードディスク５４に格納された音声データ９０に対して以下に説明する処理を行なって、息漏れを検出して、息漏れ領域の存在を推定しその結果を出力するためのものである。音声データ９０は本実施の形態では１フレーム３２ｍｓｅｃの長さで、かつ１０ｍｓｅｃの時間間隔でサンプリングされフレーム化されており、各フレーム毎に以下の処理が行なわれる。 -Functional configuration-
FIG. 3 is a block diagram functionally showing the program of the present embodiment as a breath leak region automatic detection device. Referring to FIG. 3, this device 80 performs the process described below on audio data 90 stored in hard disk 54, detects a breath leak, estimates the presence of a breath leak area, and results thereof. Is output. In this embodiment, the audio data 90 is sampled and framed with a length of 32 msec per frame and a time interval of 10 msec, and the following processing is performed for each frame.

装置８０は、音声データ９０を第１フォルマントの周波数に対応する周波数帯域（以下、Ｆ１帯域という）のフィルタで濾波するＦ１帯域フィルタ部８１と、音声データ９０を第３フォルマントの周波数に対応する周波数帯域（以下、Ｆ３帯域という）のフィルタで濾波するＦ３帯域フィルタ部８２とを含む。 The apparatus 80 includes an F1 band filter unit 81 for filtering the audio data 90 with a filter in a frequency band (hereinafter referred to as F1 band) corresponding to the frequency of the first formant, and a frequency corresponding to the frequency of the third formant. And an F3 band filter unit 82 for filtering with a band (hereinafter referred to as F3 band) filter.

Ｆ１帯域フィルタ部８１およびＦ３帯域フィルタ部８２では、誤りが起きやすいフォルマントの自動抽出を避けるため、それぞれＦ１帯域、Ｆ３帯域において固定の帯域幅を使用する。この実施の形態では、Ｆ３帯域は、男性でも女性でも第３フォルマントの周波数を含む可能性が高い１８００〜４０００Ｈｚの帯域に固定する。低周波数帯域であるＦ１帯域は、周期成分に息漏れの雑音成分の影響が少ない１００〜１５００Ｈｚの帯域に固定する。 The F1 band filter unit 81 and the F3 band filter unit 82 use fixed bandwidths in the F1 band and the F3 band, respectively, in order to avoid automatic extraction of formants that are prone to errors. In this embodiment, the F3 band is fixed to a band of 1800 to 4000 Hz, which is likely to include the third formant frequency for both men and women. The F1 band which is a low frequency band is fixed to a band of 100 to 1500 Hz where the influence of the noise component of breath leakage is small on the periodic component.

なお、以下の説明においては簡単のために、Ｆ１帯域フィルタ部８１により濾波された信号をＦ１波、その波形をＦ１波形、Ｆ３帯域フィルタ部８２により濾波された信号をＦ３波、その波形をＦ３波形と呼ぶ。 In the following description, for simplicity, the signal filtered by the F1 band filter unit 81 is an F1 wave, the waveform is an F1 waveform, the signal filtered by the F3 band filter unit 82 is an F3 wave, and the waveform is F3. It is called a waveform.

装置８０はさらに、Ｆ１波の振幅包絡波形を求めるためのＦ１包絡推定部８３と、Ｆ３波の振幅包絡波形を求めるためのＦ３包絡推定部８４と、Ｆ１、Ｆ３各包絡推定部８３、８４により求められた二つの包絡波形の相互相関を計算することで、これらの波形間の同期率を求めるための相関演算部８５と、相関演算部８５により演算した同期率と所定のしきい値とを比較して、二つの包絡波形が相関しているか否かについての結果を出力する第１比較部８６と、Ｆ１帯域フィルタ部８１によって濾波されたＦ１波のパワー値Ａ１とＦ３帯域フィルタ部８２によって濾波されたＦ３波のパワー値Ａ３との差Ａ１−Ａ３を演算する強度差演算部８７と、強度差演算部８７により演算されたパワー値の差Ａ１−Ａ３と所定のしきい値とを比較し、その結果を出力するための第２比較部８８と、第１比較部８６および第２比較部８８による比較結果に基づいて、そのフレームの音声が息漏れ領域か否かを判定するための判定部８９とを含む。 The apparatus 80 further includes an F1 envelope estimation unit 83 for obtaining the amplitude envelope waveform of the F1 wave, an F3 envelope estimation unit 84 for obtaining the amplitude envelope waveform of the F3 wave, and the envelope estimation units 83 and 84 for F1 and F3, respectively. By calculating a cross-correlation between the two obtained envelope waveforms, a correlation calculation unit 85 for calculating a synchronization rate between these waveforms, a synchronization rate calculated by the correlation calculation unit 85, and a predetermined threshold value are obtained. In comparison, the first comparison unit 86 that outputs a result of whether or not the two envelope waveforms are correlated, and the power value A1 of the F1 wave filtered by the F1 band filter unit 81 and the F3 band filter unit 82 An intensity difference calculation unit 87 for calculating a difference A1-A3 with the power value A3 of the filtered F3 wave, and a comparison between the power value difference A1-A3 calculated by the intensity difference calculation unit 87 and a predetermined threshold value And A second comparison unit 88 for outputting the result, and a determination unit 89 for determining whether or not the sound of the frame is a breath leak region based on the comparison results by the first comparison unit 86 and the second comparison unit 88. Including.

包絡波形の求め方としては、この実施の形態では、先ずＦ１波形、Ｆ３波形の瞬時振幅をＨｉｌｂｅｒｔ包絡手法によって推定し、それぞれのＨｉｌｂｅｒｔ包絡を１ｍｓ長のハニング窓で平滑化して振幅包絡波形を求める方法を採用している。 In this embodiment, first, the instantaneous amplitudes of the F1 waveform and the F3 waveform are estimated by the Hilbert envelope method, and each Hilbert envelope is smoothed by a 1 ms long Hanning window to obtain an amplitude envelope waveform. The method is adopted.

相関演算部８５は、Ｆ１、Ｆ３包絡推定部８３、８４により求められた二つの包絡波形の相互相関を計算することで、これらの波形間の同期率を求めるものである。同期率が低い場合、つまり二つの包絡波形が相関していない場合、Ｆ３帯域の雑音成分が声帯パルスと独立して生成されている可能性が高い。したがって、入力された音声データ９０は息漏れ雑音を含み、息漏れ領域である可能性が高いという論理である。 The correlation calculation unit 85 calculates the synchronization rate between these waveforms by calculating the cross-correlation between the two envelope waveforms obtained by the F1 and F3 envelope estimation units 83 and 84. When the synchronization rate is low, that is, when the two envelope waveforms are not correlated, there is a high possibility that the noise component in the F3 band is generated independently of the vocal cord pulse. Therefore, it is a logic that the input voice data 90 includes breath leak noise and is likely to be a breath leak region.

強度差演算部８７は、この実施の形態では、具体的には、Ｆ１帯域フィルタ部８１によって濾波されたＦ１波中の最強度成分のパワー値Ａ１と、Ｆ３帯域フィルタ部８２によって濾波されたＦ３波中の最強度成分とのパワー値Ａ３の差Ａ１−Ａ３を演算する。後述するように、Ｆ１、Ｆ３包絡推定部８３、８４により求められた二つの包絡波形の相関が低くても息漏れとして知覚されない可能性があることから、スペクトル下降を表すパラメータとして、パワー値の差Ａ１−Ａ３という測定量を用いたものである。 In this embodiment, specifically, the intensity difference calculation unit 87 includes the power value A1 of the strongest component in the F1 wave filtered by the F1 band filter unit 81 and the F3 band filtered by the F3 band filter unit 82. The difference A1-A3 of the power value A3 with the highest intensity component in the wave is calculated. As will be described later, even if the correlation between the two envelope waveforms obtained by the F1 and F3 envelope estimation units 83 and 84 is low, it may not be perceived as a breath leak. The measured amount of difference A1-A3 is used.

なお、Ｆ１波中の最強度成分のパワー値Ａ１とＦ３波中の最強度成分とのパワー値Ａ３の差Ａ１−Ａ３を演算するのではなく、いくつかの成分のパワー値の平均を出し、平均値の差をもって強度差としてもよい。 Instead of calculating the difference A1-A3 between the power value A3 of the strongest component in the F1 wave and the strongest component in the F3 wave, the average of the power values of several components is calculated, It is good also as an intensity | strength difference with the difference of an average value.

また、判定部８９は、具体的には、第１比較部８６において、二つの包絡波形の同期率が所定のしきい値以上と判定された場合には、すなわち両波形に相関がある場合には、その音声データは息漏れ雑音を含む可能性が低く、したがって息漏れはないと判定する。逆に、第１比較部８６において、二つの包絡波形の同期率が所定のしきい値よりも低いと判定された場合、すなわち両波形の相関が低い場合には、息漏れが存在すると推定できる。しかし、この場合でも、聴者が明確には知覚できない（すなわち聴者にとって息漏れと意識されない）息漏れが含まれている可能性がある。後述するように大規模音声データベースの音声に対するラベル付けを行なう場合、人間の聴者の感覚とできるだけ同じような結果が得られるようにするべきである。したがって、息漏れがあったとしてもそれを聴者が明確に知覚できないようなものは排除すべきである。 Further, the determination unit 89, specifically, when the first comparison unit 86 determines that the synchronization rate of the two envelope waveforms is equal to or higher than a predetermined threshold, that is, when both waveforms are correlated. Determines that the voice data is unlikely to contain breath leak noise, and therefore there is no breath leak. Conversely, if the first comparison unit 86 determines that the synchronization rate of the two envelope waveforms is lower than the predetermined threshold value, that is, if the correlation between the two waveforms is low, it can be estimated that there is a breath leak. . However, even in this case, there may be a breath leak that the listener cannot clearly perceive (ie, the listener is not aware of the breath leak). As will be described later, when labeling voices in a large-scale voice database, the results should be as similar as possible to those of a human listener. Therefore, if there is a breath leak, something that the listener cannot clearly perceive should be excluded.

そこで、聴者が明確に知覚できない息漏れの存在を排除するため、第２比較部８８による比較結果を参照する。実験によれば、同期率が低くても、パワー値の差Ａ１−Ａ３が小さい場合には、聴者は息漏れありと判定し、逆に、パワー値の差Ａ１−Ａ３が大きい場合には、聴者は息漏れなしと判定することが多かった。そこで、同期率が低くて、パワー値の差Ａ１−Ａ３の値が所定のしきい値より小さい場合には息漏れありと判定し、同期率が低くても、パワー値の差Ａ１−Ａ３の値が所定のしきい値以上の場合には、息漏れなしと判定する。 Therefore, in order to eliminate the presence of breath leakage that cannot be clearly perceived by the listener, the comparison result by the second comparison unit 88 is referred to. According to experiments, even if the synchronization rate is low, if the power value difference A1-A3 is small, the listener determines that there is a breath leak, and conversely, if the power value difference A1-A3 is large, Listeners often determined no breathing. Therefore, if the synchronization rate is low and the value of the power value difference A1-A3 is smaller than the predetermined threshold value, it is determined that there is a breath leak. Even if the synchronization rate is low, the difference in power value A1-A3 If the value is equal to or greater than a predetermined threshold value, it is determined that there is no breath leak.

このような判定処理が音声データ９０のフレーム毎に行なわれる。望ましくは、息漏れありの判定が複数回連続したとき（例えば３回以上）に、その音声フレームに息漏れありと判定するのが良い。実験から、発話者が発話した音節またはフレーズに息漏れが知覚されるためには、ある程度の長さ（３０〜４０ｍｓ）が必要と考えられるからである。 Such a determination process is performed for each frame of the audio data 90. Desirably, when it is determined that there is a breath leak a plurality of times (for example, three times or more), it is determined that there is a breath leak in the audio frame. This is because it is considered from experiments that a certain amount of length (30 to 40 ms) is necessary in order for breath leakage to be perceived in a syllable or phrase uttered by a speaker.

−動作−
図１〜図３に示された装置８０は以下のように動作する。 -Operation-
The apparatus 80 shown in FIGS. 1-3 operates as follows.

まず、音声データ９０をＦ１帯域フィルタ部８１およびＦ３帯域フィルタ部８２で濾波する。次に、Ｆ１包絡推定部８３およびＦ３包絡推定部８４により、Ｆ１波およびＦ３波の振幅包絡波形をそれぞれ求めたのち、相関演算部８５で両包絡波形の同期率を求める。 First, the audio data 90 is filtered by the F1 band filter unit 81 and the F3 band filter unit 82. Next, the F1 envelope estimation unit 83 and the F3 envelope estimation unit 84 obtain the amplitude envelope waveforms of the F1 wave and the F3 wave, respectively, and then the correlation calculation unit 85 obtains the synchronization rate of both envelope waveforms.

求めた同期率は、第１比較部８６においてしきい値と比較される。比較結果は判定部８９に与えられる。 The obtained synchronization rate is compared with a threshold value in the first comparison unit 86. The comparison result is given to the determination unit 89.

一方、強度差演算部８７は、Ｆ１帯域フィルタ部８１によって濾波されたＦ１波のパワー値Ａ１とＦ３帯域フィルタ部８２によって濾波されたＦ３波のパワー値Ａ３との差Ａ１−Ａ３を演算し、第２比較部８８に与える。第２比較部８８は、強度差演算部８７により演算された、Ｆ１波中の最強度成分のパワー値Ａ１とＦ３波中の最強度成分とのパワー値Ａ３の差Ａ１−Ａ３と、しきい値とを比較し、その結果を判定部８９に出力する。判定部８９は、第１比較部８６の比較結果と第２比較部８８の比較結果とを用いて息漏れの存在について判定する。 On the other hand, the intensity difference calculation unit 87 calculates the difference A1-A3 between the power value A1 of the F1 wave filtered by the F1 band filter unit 81 and the power value A3 of the F3 wave filtered by the F3 band filter unit 82, This is given to the second comparison unit 88. The second comparison unit 88 calculates the threshold value difference A1-A3 between the power value A1 of the strongest component in the F1 wave and the power value A3 of the strongest component in the F3 wave, which is calculated by the intensity difference calculating unit 87. The values are compared, and the result is output to the determination unit 89. The determination unit 89 determines the presence of breath leak using the comparison result of the first comparison unit 86 and the comparison result of the second comparison unit 88.

図４に、判定部８９が行なう息漏れ有無判定処理のサブルーチンを示す。まず、第１比較部８６の比較結果から、同期率がしきい値以上か否かを判断する（ステップ１０１）。しきい値以上であれば（ステップ１０１の判断がＹＥＳ）、その音声データは息漏れ雑音を含む可能性が低いことから、息漏れなしと判定し（ステップ１０６）、判定処理を終了する。 FIG. 4 shows a subroutine of breath leak presence / absence determination processing performed by the determination unit 89. First, it is determined from the comparison result of the first comparison unit 86 whether or not the synchronization rate is equal to or greater than a threshold value (step 101). If it is equal to or greater than the threshold value (YES at step 101), the voice data is unlikely to contain breath leak noise, so it is determined that there is no breath leak (step 106), and the determination process ends.

同期率がしきい値よりも小さい場合は（ステップ１０１の判断がＮＯ）、第２比較部８８の比較結果を待つ。 When the synchronization rate is smaller than the threshold value (NO at Step 101), the comparison result of the second comparison unit 88 is waited for.

判定部８９は、パワー値の差Ａ１−Ａ３がしきい値よりも小さいか否かを判断し（ステップ１０２）、しきい値よりも小さければ（ステップ１０２の判断がＹＥＳ）、息漏れを聴者が知覚できるものとして息漏れありと判定する（ステップ１０３）。差Ａ１−Ａ３がしきい値よりも小さくなければ（ステップ１０２の判断がＮＯ）、息漏れはあっても知覚できないものとして、息漏れなしと判定する（ステップ１０６）。 The determination unit 89 determines whether or not the power value difference A1-A3 is smaller than the threshold value (step 102). If the power value difference A1-A3 is smaller than the threshold value (determination in step 102 is YES), the breath leak is heard by the listener. Is perceived as having a breath leak (step 103). If the difference A1-A3 is not smaller than the threshold value (NO in step 102), it is determined that there is no breath leak even if there is a breath leak (step 106).

次に、判定部８９はステップ１０４で、息漏れありの判定が３フレーム連続してなされたか否かを判断し、なされていれば（ステップ１０４の判断がＹＥＳ）、その音声データのフレームについて息漏れありというラベル付けを行なうための出力を出す。息漏れありの判定が３フレーム連続してなされていなければ（ステップ１０４の判断がＮＯ）、息漏れなしと判定し（ステップ１０６）、何もすることなく判定処理を終了する。 Next, in step 104, the determination unit 89 determines whether or not the determination that there is a breath leak has been made for three consecutive frames (if the determination in step 104 is YES), the determination unit 89 determines whether the breath data frame is breathing. Produce output to label as leaked. If the determination that there is no breath leakage has not been made for three consecutive frames (NO at Step 104), it is determined that there is no breath leakage (Step 106), and the determination process is terminated without doing anything.

このようにして、上記の息漏れ領域自動検出装置を用いることにより、音声データについて、息漏れ領域に関するラベル付けを自動的に連続的に行なうことができる。 In this way, by using the above-described breath leak region automatic detection device, it is possible to automatically and continuously label the speech data regarding the breath leak region.

−関連する実験結果−
図５および図６は、さまざまな声質で発声された母音／ａ／を女性話者のデータベースから引き出し、それぞれのＦ１波形（上から３つ目の波形）およびＦ３波形（上から４つ目の波形）と、それら波形の振幅包絡波形（一番下のＦ１ｅｎｖ、Ｆ３ｅｎｖと記された波形で、図面ではＦ１波の包絡波形を「△」印を結んだ線で、Ｆ３波の包絡波形を「×」印を結んだ線で、それぞれ示している。）を表示している。図５および図６に表示した波形は目視による比較を適切にするためスケーリングしたものである。 -Related experimental results-
FIG. 5 and FIG. 6 show vowels / a / uttered with various voice qualities from the database of female speakers, and the respective F1 waveforms (third waveform from the top) and F3 waveforms (fourth from the top). Waveform) and the amplitude envelope waveform of those waveforms (the waveforms indicated as F1env and F3env at the bottom, in the drawing, the F1 wave envelope waveform is a line connecting the “Δ” marks, and the F3 wave envelope waveform is “ This is indicated by a line connecting “x” marks.). The waveforms shown in FIGS. 5 and 6 are scaled for proper visual comparison.

図５（ａ）は地声（Ｍｏｄａｌｖｏｉｃｅ）を表示している。この図では、Ｆ１波形とＦ３波形に関しそれぞれ推定された振幅包絡波形（Ｆ１ｅｎｖ，Ｆ３ｅｎｖ）が同期している（相関が高い）ことが観察される。この場合の同期率（図ではＦ１Ｆ３ｓｙｎと記している）は０．７５である。 FIG. 5 (a) displays a local voice (Modal voice). In this figure, it is observed that the amplitude envelope waveforms (F1env, F3env) estimated for the F1 waveform and the F3 waveform are synchronized (highly correlated). The synchronization rate in this case (denoted as F1F3syn in the figure) is 0.75.

図６（ａ）、（ｂ）に弱い息漏れ音（Ｂｒｅａｔｈｙｖｏｉｃｅ）と強い息漏れ音（Ｗｈｉｓｐｅｒｙｖｏｉｃｅ）の例を示す。弱い息漏れ音の場合、Ｆ１波形に明確な周期性が観察されるが、強い息漏れ音の場合はそれほど明確ではない。しかし、弱い息漏れ音と強い息漏れ音のいずれにもＦ３波形には規則性はみられない。更に振幅包絡波形Ｆ１ｅｎｖとＦ３ｅｎｖは異なった形状を示し、同期率Ｆ１Ｆ３ｓｙｎも低い (０．０５および０．０３)。 FIGS. 6 (a) and 6 (b) show examples of weak breath sounds and strong breath sounds. In the case of a weak breath leak sound, a clear periodicity is observed in the F1 waveform, but in the case of a strong breath leak sound, it is not so clear. However, there is no regularity in the F3 waveform for both weak and strong breath leak sounds. Furthermore, the amplitude envelope waveforms F1env and F3env have different shapes, and the synchronization rate F1F3syn is also low (0.05 and 0.03).

図５（ｂ）はしゃがれ声（Ｃｒｅａｋｙｖｏｉｃｅ）の例を示す。しゃがれ声は低いＦ０（基本周波数）に伴って生じることが多く、図５（ｂ）のように分析窓の中に声帯パルスが１個しか含まれない場合がある。この場合は、分析窓に声帯パルスの周期性が全くない場合でも、振幅包絡波形Ｆ１ｅｎｖ，Ｆ３ｅｎｖは同じ様な形状を示し、同期率Ｆ１Ｆ３ｓｙｎも０．８７と高い値を示している。 FIG.5 (b) shows the example of a crouching voice (Creaky voice). A squatting voice often occurs with a low F0 (fundamental frequency), and there may be a case where only one vocal cord pulse is included in the analysis window as shown in FIG. In this case, the amplitude envelope waveforms F1env and F3env have the same shape and the synchronization rate F1F3syn is a high value of 0.87 even when there is no vocal cord pulse periodicity in the analysis window.

この結果から理解されるように、振幅包絡波形の同期率としきい値とを比較することにより、同期率の高い図５（ａ）の地声や図５（ｂ）のしゃがれ声のような息漏れのない声質のものを確実に排除することができる。 As can be understood from this result, by comparing the synchronization rate of the amplitude envelope waveform with the threshold value, a breath such as a local voice of FIG. 5A having a high synchronization rate and a screaming voice of FIG. 5B is obtained. It is possible to surely eliminate the voice quality that does not leak.

次に、音声データとして、３０代女性話者１名の日常自然会話から５０発話を選択した。これらの５０発話はさまざまな発話様式や声質（弱い息漏れ音、強い息漏れ音、しゃがれ声、かすれ声、緊張音、笑い声等）を含むよう選択した。 Next, as speech data, 50 utterances were selected from daily natural conversations of one female speaker in their 30s. These 50 utterances were selected to include a variety of utterance styles and voice qualities (weak breath leaks, strong breath leaks, squatting voices, faint voices, tense sounds, laughter, etc.).

各発話において、息漏れが知覚された音節にマークを付与する作業を被験者８名が行なった。３名以上が付与した音節には息漏れがあると判定した。 In each utterance, eight subjects performed a task of adding a mark to a syllable in which breath leakage was perceived. It was determined that there was a breath leak in the syllables given by three or more people.

５０発話を音素レベルで手動で切出し、母音区間のすべてのフレームで抽出した音響的パラメータを評価実験に使用した。 50 utterances were manually extracted at the phoneme level, and the acoustic parameters extracted in all frames of the vowel section were used in the evaluation experiment.

この結果、４２個の母音（８４７フレーム）は息漏れがあると判断され、残りの３４５個の母音（４５３２フレーム）は息漏れがないと判断された。図７は、息漏れがある（ａ）・ない（ｂ）と判断された母音のすべてのフレームで抽出した音響的パラメータのデータを表示したものである。横軸が同期率Ｆ１Ｆ３ｓｙｎ、縦軸がパワー値の差Ａ１−Ａ３である。音素境界の周辺のフレーム（前後２０ｍｓ）および息漏れがあると判断された母音区間の地声部分のフレームは図７（ａ）から省き、４１０フレームとなった。 As a result, 42 vowels (847 frames) were determined to have a breath leak, and the remaining 345 vowels (4532 frames) were determined to have no breath leak. FIG. 7 shows data of acoustic parameters extracted in all frames of vowels determined to have (a) / no (b) breath leakage. The horizontal axis represents the synchronization rate F1F3syn, and the vertical axis represents the power value difference A1-A3. The frame around the phoneme boundary (20 ms before and after) and the frame of the voice part of the vowel section determined to have a breath leak were omitted from FIG.

息漏れのある母音では同期率の値が低く、差Ａ１−Ａ３がおよそ３０ｄＢの辺りで上限されていることが分かる。しかし、息漏れがないと判断された多くのフレームでも同期率が低い値を示している。また、同期率の低い値は差Ａ１−Ａ３の高い値に集中している。Ａ３がＡ１より著しく低い場合、つまり差Ａ１−Ａ３が大きい値（例えば３０〜４０ｄＢ）の場合、Ｆ３帯域の雑音成分が明確に聴覚されていない可能性がある。したがって、差Ａ１−Ａ３が大きい場合、推定した同期率の値は息漏れ検出に使用しないものとした。 It can be seen that the value of the synchronization rate is low for vowels with breath leakage, and the difference A1-A3 is capped at about 30 dB. However, the synchronization rate is low even in many frames where it is determined that there is no breath leak. Moreover, the low value of the synchronization rate is concentrated on the high value of the difference A1-A3. When A3 is significantly lower than A1, that is, when the difference A1-A3 is a large value (for example, 30 to 40 dB), there is a possibility that the noise component in the F3 band is not clearly heard. Therefore, when the difference A1-A3 is large, the estimated synchronization rate value is not used for breath leak detection.

同期率Ｆ１Ｆ３ｓｙｎに０．４のしきい値および差Ａ１−Ａ３に２５ｄＢのしきい値を用いて、息漏れ検出の予備評価を行なった。息漏れがあると判断された母音のフレームの多くは正しく検出された。しかし、息漏れが知覚されないと判断された母音のいくつかのフレームでも、誤検出が起きた。 Preliminary evaluation of breath leak detection was performed using a threshold value of 0.4 for the synchronization rate F1F3syn and a threshold value of 25 dB for the difference A1-A3. Many of the vowel frames that were judged to have a breath leak were detected correctly. However, false detections occurred even in some frames of vowels where it was determined that no breath leak was perceived.

そこで、この実施の形態のように、息漏れの特徴を示したフレームが３つ連続しなければならないという制限を加えると、被験者が判断した音節とのマッチングが９割以上に達した。この制限を加えるのが良い理由は、前述のように、音節またはフレーズに息漏れが知覚されるためにはある程度の長さが必要だと考えられるからである。 Therefore, as in this embodiment, when the restriction that three frames showing the characteristic of breath leakage must be continued, matching with the syllable determined by the subject reached 90% or more. The reason for adding this restriction is that, as described above, it is considered that a certain length is necessary for the syllable or phrase to perceive a breath leak.

以上のように本実施の形態によれば、息漏れのある音声区間の息漏れ領域とそれ以外の領域とを自動的に判別できる。その結果、大規模な音声データベースの音声に対し、息漏れ領域か否かの判定を人間の感覚に即して自動的に行ない、ラベル付けすることができる。また、いろいろな発話様式や声質の音声の認識を行なったり、発話者の感情の推定をしたり、特定の感情を聴者に感じさせるような声質の音声を合成したりする際に応用可能である。さらに、息漏れを多くふくんだ笑い声や、驚いた声の自動検出にも利用可能である。 As described above, according to the present embodiment, it is possible to automatically discriminate between a breath-leak area and a non-breath area in a voice section where there is a breath leak. As a result, it is possible to automatically label voices in a large-scale voice database by determining whether or not they are breath-leak areas according to human senses. It can also be applied when recognizing speech of various utterance styles and voice quality, estimating the emotion of the speaker, or synthesizing voice quality speech that makes the listener feel a specific emotion. . It can also be used for automatic detection of laughing voices with many breath leaks and surprised voices.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

この発明の一実施の形態の息漏れ領域自動検出装置を実現するためのプログラムを実行するコンピュータシステムの外観図である。1 is an external view of a computer system that executes a program for realizing a breath leak region automatic detection device according to an embodiment of the present invention. FIG. 図１のコンピュータシステムのブロック図である。FIG. 2 is a block diagram of the computer system of FIG. 1. この発明の一実施の形態の息漏れ領域自動検出装置８０のブロック図である。It is a block diagram of breath leak region automatic detection device 80 of one embodiment of this invention. 判定部８９が行なう息漏れ領域の判定処理を示すフローチャートである。It is a flowchart which shows the determination process of the breath leak area | region which the determination part 89 performs. 各種声質の音声データについてのＦ１波形およびＦ３波形とそれらの振幅包絡波形を表示する図である。It is a figure which displays F1 waveform and F3 waveform about those voice data of various voice qualities, and those amplitude envelope waveforms. 同じく、各種声質の音声データについてのＦ１波形およびＦ３波形とそれらの振幅包絡波形を表示する図である。Similarly, it is a figure which displays F1 waveform and F3 waveform about those voice data of various voice qualities, and those amplitude envelope waveforms. 息漏れがある、またはないと判断された母音のすべてのフレームで抽出した音響的パラメータのデータを表示する図である。It is a figure which displays the data of the acoustic parameter extracted in all the frames of the vowel which it was judged that there is a breath leak or is not.

Explanation of symbols

８０息漏れ領域自動検出装置、８１Ｆ１帯域フィルタ部（第１のフィルタ手段）、８２Ｆ３帯域フィルタ部（第２のフィルタ手段）、８３Ｆ１包絡推定部（第１の包絡推定手段）、８４Ｆ３包絡推定部（第２の包絡推定手段）、８５相関演算部、８６第１比較部、８７強度差演算部、８８第２比較部、８９判定部
80 breath leak region automatic detection device, 81 F1 band filter unit (first filter unit), 82 F3 band filter unit (second filter unit), 83 F1 envelope estimation unit (first envelope estimation unit), 84 F3 Envelope estimation unit (second envelope estimation means), 85 correlation calculation unit, 86 first comparison unit, 87 intensity difference calculation unit, 88 second comparison unit, 89 determination unit

Claims

A waveform correlation calculating means for obtaining a correlation between an envelope waveform of the signal component of the first formant frequency region of the audio data and an envelope waveform of the signal component of the third formant frequency region;
A comparison means for comparing the correlation result obtained by the correlation calculation means with a predetermined threshold value;
An apparatus for automatically detecting a breath leak region, comprising: a determination unit for determining a breath leak region of the audio data based on a comparison result by the comparison unit.

The waveform correlation calculation means includes
First filter means for filtering the audio data with a filter in a first frequency band corresponding to the frequency of the first formant;
Second filter means for filtering the audio data with a filter in a second frequency band corresponding to the frequency of the third formant;
First envelope estimation means for obtaining an envelope waveform of the signal wave filtered by the first filter means;
Second envelope estimating means for obtaining an envelope waveform of the signal wave filtered by the second filter means;
The breath leak region automatic detection device according to claim 1, further comprising computing means for obtaining a correlation between two envelope waveforms respectively obtained by the first and second envelope estimating means.

Intensity difference calculating means for calculating the difference between the intensity of the signal wave filtered by the first filter means and the intensity of the signal wave filtered by the second filter means;
3. The breath leak area automatic according to claim 2, wherein the determination means includes means for determining a breath leak area of the audio data based on a comparison result by the comparison means and a calculation result by the intensity difference calculation means. Detection device.

The intensity difference calculating means calculates a power value difference between the highest intensity component in the signal wave filtered by the first filter means and the highest intensity component in the signal wave filtered by the second filter means. 4. The breath leak region automatic detection device according to claim 3, comprising means for calculating.

The determination means includes
Frame determination means for determining a breath leak area of the audio data based on a comparison result by the comparison means for each frame of the audio data;
Means for determining the predetermined number of frames as a breath leak area in response to the determination result of the breath leak area being obtained by the frame determination means for a predetermined number of frames of the audio data; The breath leak area | region automatic detection apparatus in any one of Claims 1-4 containing this.

A computer-executable breath leak region automatic detection program that, when executed by a computer, causes the computer to operate as the breath leak region automatic detection device according to any one of claims 1 to 5.