JP4677548B2

JP4677548B2 - Paralinguistic information detection apparatus and computer program

Info

Publication number: JP4677548B2
Application number: JP2005269699A
Authority: JP
Inventors: イシイ・カルロス・トシノリ; 浩石黒; 紀博萩田
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2005-09-16
Filing date: 2005-09-16
Publication date: 2011-04-27
Anticipated expiration: 2025-09-16
Also published as: JP2007079363A

Description

この発明は、人間の発話音声から、発話内容に依存しないパラ言語情報を検出するための装置に関し、特に、人間の発話音声に含まれる韻律に関する情報と声質に関する情報とから、パラ言語情報を検出するためのパラ言語情報検出装置に関する。 The present invention relates to an apparatus for detecting paralinguistic information independent of utterance contents from human speech, and in particular, detects paralinguistic information from prosody information and voice quality information contained in human speech. The present invention relates to a paralinguistic information detection apparatus for performing the above.

近年の技術進歩により、人語を発する様々な装置が生産される様になってきている。この様な装置としては、一例として、カーナビゲーションシステムが挙げられる。カーナビゲーションシステムは、機械が人間に対して一方通行の発話を行なうものであるが、人間との対話が必要とされる装置も存在する。例えば、ロボット等がこれにあたる。 Due to recent technological advances, various devices that emit human language have been produced. An example of such a device is a car navigation system. A car navigation system is a machine in which a machine makes a one-way utterance to a person, but there are also devices that require dialogue with the person. For example, a robot or the like corresponds to this.

ロボットのような装置は、カーナビゲーションシステムよりもさらに人間の生活に密着する可能性が高い。従ってそうした装置で円滑に人間と対話を行なうためには、人間の発話内容だけでなく、感情まで考慮する必要性がある。 Devices such as robots are more likely to be in close contact with human life than car navigation systems. Therefore, in order to smoothly communicate with humans using such a device, it is necessary to consider not only human speech content but also emotions.

発話に伴う発話者の感情を推定する場合、発話内容だけでなく、発話内容に依存しない情報である発話意図、態度及び感情等のパラ言語情報をさらに考慮する事が合理的である。つまり、予想されるすべての発話内容に対応する人間の感情を予め学習させるよりは、発話内容と、発話内容に付随するパラ言語情報とを用いて人間の感情を推定する方が合理的でかつ正確であると言える。 When estimating the emotion of a speaker accompanying an utterance, it is reasonable to further consider not only the utterance content but also paralinguistic information such as utterance intention, attitude and emotion, which is information independent of the utterance content. That is, it is more reasonable to estimate human emotions using utterance content and paralinguistic information attached to the utterance content than to learn in advance human emotions corresponding to all expected utterance content. It can be said that it is accurate.

パラ言語情報の抽出に関する従来の研究は、非特許文献１に開示される様に、韻律特徴を重視していた。 Conventional research related to the extraction of paralinguistic information has emphasized prosodic features as disclosed in Non-Patent Document 1.

図１に、韻律特徴を使用した従来のパラ言語情報抽出装置３０の機能ブロック図を示す。図１を参照して、このパラ言語情報抽出装置３０は、韻律に基づいて発話音声信号を処理し、句末トーン情報と呼ばれるパラメータを出力するための韻律による音声処理部４０と、予め学習用データを用いて学習した、句末トーン情報とパラ言語情報との関係の確率分布を用いる事により、韻律による音声処理部４０から得られた句末トーン情報からパラ言語情報を抽出して出力するためのパラ言語情報抽出部４２とを含む。この従来の技術では、句末トーン情報としてＦ０ｍｏｖｅと呼ばれる音程の変化を表すパラメータを用いている。 FIG. 1 shows a functional block diagram of a conventional paralinguistic information extraction apparatus 30 using prosodic features. Referring to FIG. 1, this paralinguistic information extraction device 30 processes an utterance speech signal based on prosody and outputs a speech processing unit 40 based on prosody for outputting a parameter called phrase end tone information, and for learning in advance. By using the probability distribution of the relationship between phrase end tone information and paralinguistic information learned using data, paralinguistic information is extracted from the phrase end tone information obtained from the prosodic speech processing unit 40 and output. A paralinguistic information extraction unit 42 for the purpose. In this conventional technique, a parameter representing a change in pitch called F0move is used as phrase end tone information.

ユーザが発話をすると、その発話音声が図示しないマイクによって、発話音声信号に変換される。この発話音声信号は、音声処理部４０に与えられる。音声処理部４０での処理によって句末トーン情報が得られる。韻律による音声処理部４０での処理によって得られた句末トーン情報を使用して、パラ言語情報抽出部４２でパラ言語情報が抽出される。
「自然発話における、知覚に関連した句末の音響的韻律特徴」．カルロス・トシノリ・イシイ、パーハム・モクタリ、ニック・キャンベル、ユーロスピーチ：ｐｐ．４０５−４０８、２００３（”Perceptually-related Acoustic-Prosodic Features of Phrase Finals in Spontaneous Speech”, Carlos Toshinori Ishi, Parham Mokhtari, Nick Campbell, Eurospeech 2003: 405-408, 2003） When the user speaks, the voice is converted into a voice signal by a microphone (not shown). This speech audio signal is given to the audio processing unit 40. The phrase end tone information is obtained by the processing in the voice processing unit 40. Paralinguistic information is extracted by the paralinguistic information extracting unit 42 using the end-of-phrase tone information obtained by the processing in the speech processing unit 40 based on prosody.
“Acoustic prosodic features at the end of phrases related to perception in natural speech”. Carlos Toshinori Ishii, Parham Moktari, Nick Campbell, Euro Speech: pp. 405-408, 2003 ("Perceptually-related Acoustic-Prosodic Features of Phrase Finals in Spontaneous Speech", Carlos Toshinori Ishi, Parham Mokhtari, Nick Campbell, Eurospeech 2003: 405-408, 2003)

この様な韻律特徴のみを使用したパラ言語情報検出装置３０においては、互いに異なった感情を表わしている発話から抽出したパラ言語情報が、互いに重なってしまう場合がある。 In the paralinguistic information detection apparatus 30 using only such prosodic features, paralinguistic information extracted from utterances expressing different emotions may overlap each other.

図２を参照して、この重なりについて説明する。ここでは韻律特徴として、音程の変化Ｆ０ｍｏｖｅと発話持続時間とを使用している。 This overlap will be described with reference to FIG. Here, pitch change F0move and utterance duration are used as prosodic features.

グラフの縦軸は発話持続時間を示し、横軸は音程の変化を表わす。凡例５０に示す様に、グラフ中にプロットされた記号はそれぞれ、発話者の感情を表わしている。このグラフに見られる様に、韻律情報のみを使用すると、異なったパラ言語情報が同じ韻律情報と発話持続時間とで表わされている。つまり、ある発話持続時間とある音程の変化とをもつパラ言語情報が「聞返し」であるのか「驚き・意外」であるのかがはっきりしないという結果になる。それゆえ、パラ言語情報検出の精度が下がる。 The vertical axis of the graph represents the speech duration, and the horizontal axis represents the change in pitch. As shown in the legend 50, each of the symbols plotted in the graph represents the emotion of the speaker. As can be seen from this graph, when only prosodic information is used, different paralinguistic information is represented by the same prosodic information and utterance duration. In other words, it is not clear whether the paralinguistic information having a certain utterance duration and a certain pitch change is “listen” or “surprise / unexpected”. Therefore, the accuracy of paralinguistic information detection decreases.

さらに、表現豊かな発話音声では、息漏れを含む音声である気息性の音声の様に、音程を抽出する事が難しいものも含まれている。 In addition, the expressive utterance voice includes a voice whose breathing is difficult to extract, such as a breathing voice that is a voice including a breath leak.

そこで、本発明の目的は、これらの問題を解決し、パラ言語情報検出の際に、韻律特徴だけを用いる場合より明確にパラ言語情報を区別できる、精度の高いパラ言語情報を提供する事である。 Therefore, an object of the present invention is to solve these problems and provide highly accurate paralinguistic information that can distinguish paralinguistic information more clearly than when only prosodic features are used when detecting paralinguistic information. is there.

本発明の他の目的は、パラ言語情報を、韻律情報だけでなく声質情報も用いて抽出する事により、精度の高いパラ言語情報検出装置を提供する事である。 Another object of the present invention is to provide a highly accurate paralinguistic information detection apparatus by extracting paralinguistic information using not only prosodic information but also voice quality information.

本発明の第１の局面に係るパラ言語情報検出装置は、人間の発話音声信号から、発話内容に依存しないパラ言語情報を検出するためのパラ言語情報検出装置であって、発話音声信号の韻律に関する情報を処理するための第１の音声処理手段と、発話音声信号の声質に関する情報を処理するための第２の音声処理手段と、韻律に関する情報と声質に関する情報とから発話音声に関するパラ言語情報を抽出するためのパラ言語情報抽出手段とを含む。 A paralinguistic information detection apparatus according to a first aspect of the present invention is a paralinguistic information detection apparatus for detecting paralinguistic information independent of utterance content from a human utterance voice signal, the prosody of the utterance voice signal. First speech processing means for processing information relating to speech, second speech processing means for processing information relating to voice quality of the speech signal, paralinguistic information relating to speech speech from information relating to prosody and information relating to voice quality And paralinguistic information extracting means for extracting.

好ましくは、第２の音声処理手段は、発話音声信号の発話区間中にボーカル・フライ区間が占める割合を算出するためのボーカル・フライ割合算出手段を含む。 Preferably, the second speech processing means includes vocal / fly ratio calculating means for calculating a ratio of the vocal / fly section in the speech section of the speech signal.

より好ましくは、第２の音声処理手段は、さらに、発話音声信号の発話区間中に非周期性／ダブル周期性区間が占める割合を算出するための非周期性／ダブル周期性割合算出手段を含む。 More preferably, the second sound processing means further includes a non-periodic / double periodicity ratio calculating means for calculating a ratio occupied by the non-periodic / double periodicity period in the utterance period of the speech signal. .

さらに好ましくは、第２の音声処理手段は、発話音声信号の発話区間中に非周期性／ダブル周期性区間が占める割合を算出するための非周期性／ダブル周期性割合算出手段を含む。 More preferably, the second sound processing means includes a non-periodic / double periodicity ratio calculating means for calculating a ratio of the non-periodic / double periodicity period in the utterance section of the speech signal.

より好ましくは、第２の音声処理手段は、さらに、発話音声信号の発話区間中に気息性区間が占める割合を算出するための気息性割合算出手段を含む。 More preferably, the second voice processing means further includes a breathing ratio calculation means for calculating a ratio of the breathing section in the utterance section of the utterance voice signal.

さらに好ましくは、第２の音声処理手段は、発話音声信号の発話区間中に気息性区間が占める割合を算出するための気息性割合算出手段を含む。 More preferably, the second voice processing means includes a breathing ratio calculation means for calculating a ratio of the breathing section in the utterance section of the utterance voice signal.

本発明の第２の局面に係るコンピュータプログラムは、コンピュータにより実行されると、当該コンピュータを上記したいずれかのパラ言語情報検出装置として動作させる。 When the computer program according to the second aspect of the present invention is executed by a computer, it causes the computer to operate as one of the paralinguistic information detection devices described above.

このパラ言語情報検出装置によると、情報検出の際に韻律に関する情報のみならず、声質に関する情報も使用できる。それゆえ、パラ言語情報検出の精度を上げる事ができる。従って、より精度の高いパラ言語情報検出装置を提供する事ができる。 According to this paralinguistic information detection apparatus, not only information related to prosody but also information related to voice quality can be used for information detection. Therefore, the accuracy of paralinguistic information detection can be increased. Therefore, it is possible to provide a more accurate paralinguistic information detection apparatus.

以下、図面を参照し、本発明の一実施の形態を説明する。本実施の形態は、発話音声信号から韻律による音声処理と声質による音声処理とを行ない、パラ言語情報を抽出するためのパラ言語情報検出装置に関するものである。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings. The present embodiment relates to a paralinguistic information detection device for performing parametric sound processing and voice quality voice processing from speech signals and extracting paralinguistic information.

＜構成＞
図３に、本実施の形態に係るパラ言語情報検出装置６０の機能ブロック図を示す。図３を参照して、このパラ言語情報検出装置６０は、韻律に基づいて発話音声信号を処理してパラ言語情報の抽出に使用するパラメータを出力するための韻律による音声処理部７０と、声質に基づいて発話音声信号を処理してパラ言語情報の抽出に使用するパラメータを出力するための声質による音声処理部７２と、韻律による音声処理部７０と声質による音声処理部７２とから得られたパラメータから、予め学習用データを用いて学習した、パラメータとパラ言語情報との関係を示す確率分布に従ってパラ言語情報を抽出して出力するためのパラ言語情報抽出部７４とを含む。 <Configuration>
FIG. 3 shows a functional block diagram of the paralinguistic information detection device 60 according to the present embodiment. Referring to FIG. 3, this paralinguistic information detection device 60 includes a prosodic speech processing unit 70 for processing a speech signal based on prosody and outputting parameters used for extraction of paralinguistic information, and voice quality. Obtained from the voice processing unit 72 by voice quality for processing the speech signal based on the voice and outputting parameters used for extracting paralinguistic information, the voice processing unit 70 by prosody, and the voice processing unit 72 by voice quality A paralinguistic information extracting unit 74 for extracting and outputting paralinguistic information in accordance with a probability distribution indicating a relationship between the parameter and paralinguistic information, which has been learned in advance using learning data.

図４に、韻律による音声処理部７０の詳細を機能ブロック図で示す。図４を参照して、韻律による音声処理部７０は、発話音声信号をピッチの動き、つまり音程の変化を表わすパラメータであるＦ０ｍｏｖｅに変換する処理を行なうための韻律特徴処理部８０と、発話持続時間に関する情報を抽出するための発話持続時間抽出部８４と、韻律特徴処理部８０で得られたＦ０ｍｏｖｅと発話持続時間抽出部８４で得られた発話時間情報とからトーンパラメータを抽出するためのトーンパラメータ抽出部８２とを含む。 FIG. 4 is a functional block diagram showing details of the speech processing unit 70 based on prosody. Referring to FIG. 4, the prosody speech processing unit 70 includes a prosody feature processing unit 80 for performing processing of converting the speech signal into F0move, which is a parameter representing a pitch movement, that is, a change in pitch. Tone for extracting tone parameters from utterance duration extraction unit 84 for extracting information about time, F0move obtained by prosodic feature processing unit 80 and utterance time information obtained by utterance duration extraction unit 84 A parameter extraction unit 82.

図５を参照してトーンパラメータについて説明する。ここでは、日本語「ないね」を例にとる。トーンパラメータとは、言葉の中に含まれる音程の上下をパラメータ化したものである。例えば、トーンパラメータ１a（１００）においては「ないね」という言葉の「な」と「いね」との間で音程の変化が起こる。そしてその変化は、「な」から「いね」に移る際に、音程が下がるというものである。 The tone parameters will be described with reference to FIG. Here, Japanese “None” is taken as an example. The tone parameter is a parameterization of the upper and lower pitches included in a word. For example, in the tone parameter 1a (100), the pitch changes between “N” and “I” in the word “None”. And the change is that the pitch goes down when moving from "NA" to "I".

図５に示された記号┐は音程が下降する事、記号┌は下降した音程が元の音程に戻る事、右上がりの矢印記号は音程が上昇する事を示す。 The symbol 示 shown in FIG. 5 indicates that the pitch is lowered, the symbol ┌ indicates that the lowered pitch returns to the original pitch, and the arrow symbol rising to the right indicates that the pitch is increased.

図５に示されたトーンパラメータは７種類であるが、本実施の形態では、１ａ（１００）、２ａ（１０２）、２ｂ（１０４）、２ｃ（１０６）、３（１０８）の５種類を使用する。 Although there are seven types of tone parameters shown in FIG. 5, in the present embodiment, five types of 1a (100), 2a (102), 2b (104), 2c (106), and 3 (108) are used. To do.

図６に、韻律特徴処理部８０の詳細を機能ブロック図で示す。図６を参照して、韻律特徴処理部８０は、発話音声信号から音程に関する情報であるパラメータＦ０を得るためのＦ０抽出部９０と、パラメータＦ０を用いてある音節内のピッチの動き（方向と度合い）つまり音程の変化を半音単位で表わすパラメータであるＦ０ｍｏｖｅを抽出するためのＦ０ｍｏｖｅ抽出部９２とを含む。Ｆ０抽出部９０は、発話音声信号から音の高さに関する情報であるＦ０のみを抽出し、音階で表わす様に変換する。 FIG. 6 is a functional block diagram showing details of the prosodic feature processing unit 80. Referring to FIG. 6, the prosody feature processing unit 80 includes a F0 extraction unit 90 for obtaining a parameter F0 that is information about a pitch from a speech signal, and a pitch movement (direction and direction) within a syllable using the parameter F0. Degree), that is, an F0move extraction unit 92 for extracting F0move, which is a parameter representing a change in pitch in semitone units. The F0 extraction unit 90 extracts only F0, which is information relating to the pitch of the sound, from the utterance voice signal, and converts it so as to be expressed by a musical scale.

図７に、声質による音声処理部７２の詳細を機能ブロック図で示す。図７を参照して、声質による音声処理部７２は、発話音声信号からボーカル・フライを検出するためのボーカル・フライ検出部１２０と、全発話区間内に占めるボーカル・フライ区間の割合を算出するためのボーカル・フライ割合算出部１２２とを含む。ここで、ボーカル・フライとは、声道の励振がほとんど減衰した事により生じる７Ｈｚ〜７８Ｈｚくらいの非常に低い周波数のパルス音声の事である。 FIG. 7 is a functional block diagram showing details of the voice processing unit 72 based on voice quality. Referring to FIG. 7, the voice processing unit 72 based on voice quality calculates a vocal / fly detection unit 120 for detecting vocal / fly from a speech signal, and a ratio of the vocal / fly period in the entire utterance period. And a vocal / fly ratio calculation unit 122. Here, the vocal fly is a pulse sound of a very low frequency of about 7 Hz to 78 Hz, which is generated when the excitation of the vocal tract is almost attenuated.

声質による音声処理部７２はさらに、与えられた発話音声信号のうちで、ボーカル・フライ区間以外でかつ音声波形が非周期である区間及びダブル周期である区間の情報である非周期性区間情報及びダブル周期性区間情報を検出するための非周期性／ダブル周期性検出部１２４と、非周期性／ダブル周期性検出部１２４で検出された非周期性及びダブル周期性区間情報からボーカル・フライ検出部１２０で検出されたボーカル・フライ区間情報１３２を除き、これらの非周期性区間情報及びダブル周期性区間情報が全発話区間中で占める割合を算出するための非周期性／ダブル周期性割合算出部１２６とを含む。ここで、非周期性とは、音声波形が非周期的である事である。また、ダブル周期性とは、音声波形が、ピーク長及びピーク幅の異なる二つの波形からなる波形のセットが周期的に繰返された形状をもつ事をいう。 The voice processing unit 72 according to voice quality further includes non-periodic section information which is information of a section other than the vocal / fly section and a section having a non-periodic speech waveform and a section having a double period, in the given speech signal. A non-periodic / double-periodicity detecting unit 124 for detecting double-periodic interval information, and vocal / fly detection from the non-periodic and double-periodic interval information detected by the aperiodic / double-periodicity detecting unit 124 Aperiodic / double periodicity ratio calculation for calculating the ratio of the non-periodic section information and the double periodic section information in all utterance sections except for the vocal / fly section information 132 detected by the unit 120 Part 126. Here, aperiodicity means that the speech waveform is aperiodic. Double periodicity means that a speech waveform has a shape in which a set of waveforms composed of two waveforms having different peak lengths and peak widths is periodically repeated.

声質による音声処理部７２はさらに、与えられた発話音声信号から、気息性区間情報を検出するための気息性検出部１２８と、気息性区間が全発話区間中で占める割合を算出するための気息性割合算出部１３０とを含む。ここで、気息性とは、音声に含まれる息漏れの度合いの事である。気息性のある声としては例えば、ささやき声等が挙げられる。 The voice processing unit 72 based on voice quality further includes a breathing property detection unit 128 for detecting breathing interval information from a given utterance voice signal, and a breath for calculating the proportion of the breathing interval in all the speech segments. A sex ratio calculation unit 130. Here, the breathability is the degree of breath leakage included in the voice. Examples of breathable voices include whispering voices.

図８に、ボーカル・フライ検出部１２０の詳細を機能ブロック図で示す。図８を参照して、ボーカル・フライ検出部１２０は、発話信号のうち１００Ｈｚ〜１５００Ｈｚの周波数成分のみを通過させるためのバンドパスフィルタ１４０と、バンドパスフィルタ１４０を通過した発話信号１５４を超短期フレーム長でフレーム化し、各フレームについて、その前後２フレームと比較してパワーが大きく、かつその差が予め定められたパワーしきい値よりも大きいフレームのフレーム位置を示す情報１５０をパワーピーク候補の位置情報として出力するための超短期ピーク検出処理部１４２と、発話信号１５４を短期フレーム長でフレーム化したものについてフレーム内周期性（Intra-frame periodicity :ＩＦＰ値）に関する値を算出し、フレーム内周期性が所定個数以上存在するフレーム以外のフレームのＩＦＰ値をヌルに設定するための短期周期性検出部１４４と、超短期ピーク検出処理部１４２から与えられたピーク位置情報１５０のうち、短期周期性検出部１４４から与えられた短期周期性情報１５２により、フレーム値がヌルとなっている部分の情報１５６のみを類似性検査部１４８に与えるための周期性検査部１４６と、情報１５６によって特定されるパワーピーク候補の付近の波形とその前のパワーピーク付近の波形との間のパルス間類似性（inter-pulse similarity :ＩＰＳ値）に関する値が所定のしきい値以上であるもののピーク位置情報を検出し、このピーク位置情報に基づき、隣接するパルス間でＩＰＳ値の高いものの間のフレームからボーカル・フライ区間情報を検出し、ボーカル・フライ割合算出部１２２と非周期性／ダブル周期性割合算出部１２６とに与えるための類似性検査部１４８とを含む。 FIG. 8 is a functional block diagram showing details of the vocal / fly detection unit 120. Referring to FIG. 8, the vocal / fly detection unit 120 uses a band pass filter 140 for passing only a frequency component of 100 Hz to 1500 Hz in an utterance signal, and an utterance signal 154 that has passed through the band pass filter 140 in a very short period. Each frame is converted into a frame length, and information 150 indicating the frame position of each frame having a power larger than that of the two frames before and after the frame and having a difference larger than a predetermined power threshold is set as a power peak candidate. A value related to intra-frame periodicity (IFP value) is calculated for an ultra-short-term peak detection processing unit 142 for outputting as position information and an utterance signal 154 framed with a short-term frame length. Null IFP value for frames other than frames with a predetermined number of periodicities Of the peak position information 150 provided from the short-term periodicity detection unit 144 and the ultra-short-term peak detection processing unit 142, the frame value is determined by the short-term periodicity information 152 provided from the short-term periodicity detection unit 144. A periodicity checking unit 146 for giving only the information 156 of the null portion to the similarity checking unit 148, a waveform near the power peak candidate specified by the information 156, and a waveform near the power peak before that The peak position information is detected when the value related to the inter-pulse similarity (IPS value) is greater than or equal to a predetermined threshold value, and the IPS value between adjacent pulses is detected based on the peak position information. Vocal / fly interval information is detected from the frame between the higher frames, and the vocal / fly ratio calculation unit 122 and the non-periodic / double periodicity ratio calculation unit 1 26 and a similarity checking unit 148 for giving to H.26.

図９に、非周期性／ダブル周期性検出部１２４の詳細を機能ブロック図で示す。図９を参照して、非周期性／ダブル周期性検出部１２４は、発話音声信号をフィルタリング処理して音声波形のピークを検出する事によって、正規化自己相関関数を算出するための正規化自己相関関数算出部１６０と、正規化自己相関関数算出部１６０で算出された正規化自己相関関数に基づいた正規化自己相関関数の波形から、ピーク値やピーク位置の関係等で表わされる正規化自己相関関数パラメータを算出するための正規化自己相関関数パラメータ算出部１６２と、算出された正規化自己相関関数パラメータの値から、非周期性及びダブル周期性区間情報を検出するための非周期性／ダブル周期性区間情報検出部１６４とを含む。 FIG. 9 is a functional block diagram showing details of the aperiodic / double periodicity detecting unit 124. Referring to FIG. 9, the aperiodic / double periodicity detecting unit 124 filters the speech signal and detects the peak of the speech waveform, thereby calculating a normalized self-correlation function for calculating a normalized autocorrelation function. From the correlation function calculation unit 160 and the waveform of the normalized autocorrelation function based on the normalized autocorrelation function calculated by the normalized autocorrelation function calculation unit 160, the normalized self represented by the relationship between the peak value and the peak position, etc. A normalized autocorrelation function parameter calculation unit 162 for calculating a correlation function parameter, and an aperiodicity for detecting aperiodic and double periodicity interval information from the calculated normalized autocorrelation function parameter value A double periodic section information detection unit 164.

正規化自己相関関数パラメータ算出部１６２では、正規化自己相関関数算出部１６０で得られた正規化自己相関関数より最初の２ピーク（Ｐ１及びＰ２）を検出する。ただし、ピーク値は０．２を超えるもののみピークとみなす。 The normalized autocorrelation function parameter calculation unit 162 detects the first two peaks (P1 and P2) from the normalized autocorrelation function obtained by the normalized autocorrelation function calculation unit 160. However, only the peak value exceeding 0.2 is regarded as a peak.

これらのピークの正規化自己相関値をＮＡＣ（Ｐ１）、ＮＡＣ（Ｐ２）及び、正規化自己相関位置をＴＬ（Ｐ１）、ＴＬ（Ｐ２）と呼び、正規化自己相関関数パラメータとして扱う。 The normalized autocorrelation values of these peaks are called NAC (P1) and NAC (P2), and the normalized autocorrelation positions are called TL (P1) and TL (P2), and are treated as normalized autocorrelation function parameters.

図１０に、正規化自己相関関数算出部１６０の詳細を機能ブロック図で示す。図１０を参照して、正規化自己相関関数算出部１６０は、発話信号のうち６０Ｈｚ以上の周波数成分のみを通すためのハイパスフィルタ１７０と、ハイパスフィルタ１７０の出力する音声信号の高域部分を強調する処理を行なうための高域強調部１７２と、高域強調部１７２の出力する音声信号に線型予測分析を行ない、声道パラメータ抽出部１７４で声道パラメータを抽出し、逆フィルタ１７６で、ハイパスフィルタ１７０の出力する音声信号に声道パラメータ抽出部１７４で抽出された声道パラメータを使用して、逆フィルタを行ない、声帯音源波形に対応する残差信号が得られると、後の処理に必要となるピーク検出を容易にするために２ｋＨｚ以下の音声信号のみを通すためのローパスフィルタ１７８と、ローパスフィルタ１７８を通った音声信号が与えられるとウィンドウの大きさを８０ｍｓにし、そのウィンドウに含まれる音声信号から自己相関関数を算出するための自己相関関数算出部１８０と、自己相関関数算出部１８０で算出された自己相関関数の波形から、各々のフレームに含まれた最大のピークを検出するためのピーク検出部１８２と、ピーク検出部１８２で検出された最大ピークとその直前もしくは直後の最大ピークとの間の時間のずれを抽出し、ずれた時間の４倍の時間を１フレームとする様にフレーム長を再調節し、再調節されたフレームに含まれる自己相関関数の算出を行なうための自己相関関数再算出部１８４と、得られた自己相関関数を正規化する処理を行なうための正規化部１８６とを含む。 FIG. 10 is a functional block diagram showing details of the normalized autocorrelation function calculation unit 160. Referring to FIG. 10, normalized autocorrelation function calculation section 160 emphasizes a high-pass filter 170 for passing only frequency components of 60 Hz or higher in the speech signal, and a high-frequency portion of the audio signal output from high-pass filter 170. The high-frequency emphasis unit 172 for performing the processing to perform the linear prediction analysis on the speech signal output from the high-frequency emphasis unit 172, the vocal tract parameter extraction unit 174 extracts the vocal tract parameters, and the inverse filter 176 When a vocal tract parameter extracted by the vocal tract parameter extraction unit 174 is used for the voice signal output from the filter 170 and an inverse filter is performed to obtain a residual signal corresponding to the vocal cord sound source waveform, it is necessary for subsequent processing. In order to facilitate peak detection, a low pass filter 178 for passing only an audio signal of 2 kHz or less and a low pass filter 178 are provided. When the received audio signal is given, the window size is set to 80 ms, and the autocorrelation function calculation unit 180 for calculating the autocorrelation function from the audio signal included in the window is calculated by the autocorrelation function calculation unit 180. A peak detection unit 182 for detecting the maximum peak included in each frame from the waveform of the autocorrelation function, and the maximum peak detected by the peak detection unit 182 and the maximum peak immediately before or immediately after the peak The time difference is extracted, the frame length is readjusted so that the time four times the shifted time becomes one frame, and the autocorrelation function is recalculated to calculate the autocorrelation function included in the readjusted frame. It includes a calculation unit 184 and a normalization unit 186 for performing processing for normalizing the obtained autocorrelation function.

図１１に、気息性検出部１２８の詳細を機能ブロック図で示す。図１１を参照して、気息性検出部１２８は、発話音声信号のうちで、１００Ｈｚ〜１５００Ｈｚの周波数成分のみを通過させるためのＦ１パスフィルタ２０２と、このＦ１パスフィルタ２０２を通過した波形全体から、振幅の変化を抽出するための振幅包絡抽出部２０４と、発話信号のうち、１８００Ｈｚ〜４０００Ｈｚの周波数成分のみを通過させるためのＦ３パスフィルタ２００と、Ｆ３パスフィルタ２００を通過した波形全体から、振幅の変化を抽出するための振幅包絡抽出部２１０と、振幅包絡抽出部２０４から得られた振幅の変化と振幅包絡抽出部２１０から得られた振幅の変化との間の相互相関を計算するための相互相関計算部２１４とを含む。ここで、Ｆ１パスフィルタ２０２を通過した周波数をＦ１波と呼び、Ｆ３パスフィルタ２００を通過した周波数をＦ３波と呼ぶ。また、振幅包絡抽出部２０４で抽出された振幅の変化をＦ１振幅包絡と呼び、振幅包絡抽出部２１０で抽出された振幅の変化をＦ３振幅包絡と呼ぶ。 FIG. 11 is a functional block diagram showing details of the breath detection unit 128. Referring to FIG. 11, the breath detection unit 128 includes an F1 pass filter 202 for passing only frequency components of 100 Hz to 1500 Hz in the speech signal, and an entire waveform that has passed through the F1 pass filter 202. From the amplitude envelope extraction unit 204 for extracting the change in amplitude, the F3 pass filter 200 for passing only the frequency components of 1800 Hz to 4000 Hz in the speech signal, and the entire waveform passing through the F3 pass filter 200, Amplitude envelope extraction unit 210 for extracting amplitude change, and for calculating a cross-correlation between the amplitude change obtained from amplitude envelope extraction unit 204 and the amplitude change obtained from amplitude envelope extraction unit 210 Cross-correlation calculation unit 214. Here, the frequency that has passed through the F1 pass filter 202 is referred to as F1 wave, and the frequency that has passed through the F3 pass filter 200 is referred to as F3 wave. Also, the change in amplitude extracted by the amplitude envelope extraction unit 204 is called F1 amplitude envelope, and the change in amplitude extracted by the amplitude envelope extraction unit 210 is called F3 amplitude envelope.

気息性検出部１２８はさらに、Ｆ１パスフィルタ２０２を通過した成分からなるＦ１波から、最大周波数成分を抽出するための第１の最大周波数成分抽出部２０６と、Ｆ３パスフィルタ２００を通過した成分からなるＦ３波から、最大周波数成分を抽出するための第２の最大周波数成分抽出部２１２と、Ｆ１波中に含まれる最大周波数成分とＦ３波中に含まれる最大周波数成分との差であるスペクトル傾斜Ａ１−Ａ３値を算出するためのスペクトル傾斜算出部２１６とを含む。 The breathability detection unit 128 further includes a first maximum frequency component extraction unit 206 for extracting a maximum frequency component from an F1 wave including a component that has passed through the F1 pass filter 202, and a component that has passed through the F3 pass filter 200. A second maximum frequency component extraction unit 212 for extracting a maximum frequency component from the F3 wave, and a spectral tilt that is a difference between the maximum frequency component included in the F1 wave and the maximum frequency component included in the F3 wave And a spectrum inclination calculation unit 216 for calculating the A1-A3 value.

気息性検出部１２８はさらに、相互相関計算部２１４から得られたＦ１Ｆ３相関値があるしきい値未満であり、かつ、スペクトル傾斜算出部２１６から得られたスペクトル傾斜Ａ１−Ａ３値があるしきい値未満であるか否かにより気息性区間か否かを判定し、気息性区間情報を出力するための気息性判定部２１８を含む。 The breath detection unit 128 further has a threshold value of the spectrum slope A1-A3 obtained from the spectrum slope calculation unit 216 and the F1F3 correlation value obtained from the cross correlation calculation unit 214 is less than a certain threshold value. An air breath determination unit 218 is provided for determining whether it is an air breath interval based on whether it is less than the value and outputting the air breath interval information.

＜動作＞
図３を参照して、まず、ユーザが発話をすると、その発話音声が図示しないマイクにより発話音声信号に変換される。マイクによって変換された発話音声信号は、韻律による音声処理部７０と声質による音声処理部７２とに与えられる。この韻律による音声処理部７０での処理によって句末トーン情報が得られる。声質による音声処理部７２での処理によって発話全体に占めるボーカル・フライの割合、非周期性及びダブル周期性の割合、及び気息性の割合に関する情報が得られる。韻律による音声処理部７０及び声質による音声処理部７２での処理の詳細については後述する。 <Operation>
Referring to FIG. 3, first, when the user utters, the uttered voice is converted into an uttered voice signal by a microphone (not shown). The speech signal converted by the microphone is given to the speech processing unit 70 based on prosody and the speech processing unit 72 based on voice quality. Phrase end tone information is obtained by processing in the speech processing unit 70 based on this prosody. Information on the voice / fly ratio, the ratio of non-periodicity and double-periodicity, and the ratio of breathability in the entire utterance can be obtained by processing in the voice processing unit 72 based on voice quality. Details of processing in the speech processing unit 70 based on prosody and the speech processing unit 72 based on voice quality will be described later.

図４を参照して、韻律による音声処理部７０の動作の詳細について述べる。発話音声信号を受信すると、韻律特徴処理部８０では、まず、その発話音声信号をピッチの動きつまり音程の変化を表わすパラメータであるＦ０ｍｏｖｅに変換する処理が行なわれる。Ｆ０ｍｏｖｅは、音程に関する情報であるＦ０から得られる。 With reference to FIG. 4, the details of the operation of the speech processing unit 70 based on prosody will be described. When the utterance voice signal is received, the prosody feature processing unit 80 first performs a process of converting the utterance voice signal into F0move which is a parameter representing a change in pitch, that is, a change in pitch. F0move is obtained from F0, which is information about the pitch.

図６を参照して、韻律特徴処理部８０での動作の詳細について述べる。発話音声信号を受信すると、Ｆ０抽出部９０では発話音声信号から音の高さに関する情報のみを抽出し、音階情報に変換してパラメータＦ０を得る。 Details of the operation in the prosodic feature processing unit 80 will be described with reference to FIG. When the utterance voice signal is received, the F0 extraction unit 90 extracts only the information about the pitch of the sound from the utterance voice signal and converts it into scale information to obtain the parameter F0.

パラメータＦ０を用いてＦ０ｍｏｖｅ抽出部９２で、ある音節内のピッチの動き（方向と度合い）つまり音程の変化を半音単位で表わすパラメータであるＦ０ｍｏｖｅが抽出される。Ｆ０ｍｏｖｅは、複数のＦ０の差から求める事が可能である。 Using the parameter F0, the F0move extraction unit 92 extracts F0move, which is a parameter that represents a pitch movement (direction and degree) within a syllable, that is, a change in pitch in semitone units. F0move can be obtained from the difference between a plurality of F0s.

図４を参照して、発話持続時間抽出部８４で、発話音声信号から発話持続時間に関する情報が抽出される。 Referring to FIG. 4, utterance duration extraction unit 84 extracts information related to the utterance duration from the utterance voice signal.

韻律特徴処理部８０で抽出されたＦ０ｍｏｖｅと発話持続時間抽出部８４で抽出された発話持続時間に関する情報とを用いて、トーンパラメータ抽出部８２でトーンパラメータが抽出される。抽出されたトーンパラメータは後のパラ言語情報抽出部７４での処理に使用される。 A tone parameter is extracted by the tone parameter extraction unit 82 using the F0move extracted by the prosodic feature processing unit 80 and the information related to the utterance duration extracted by the utterance duration extraction unit 84. The extracted tone parameters are used for processing in the later paralinguistic information extraction unit 74.

図７を参照して、声質による音声処理部７２は以下の様に動作する。まず、発話音声信号から、ボーカル・フライ検出部１２０でボーカル・フライ区間情報が検出される。 Referring to FIG. 7, voice processing unit 72 based on voice quality operates as follows. First, vocal / fly section information is detected by the vocal / fly detection unit 120 from the speech signal.

図８を参照して、ボーカル・フライ検出部１２０は以下の様に動作する。バンドパスフィルタ１４０は、発話信号のうち１００Ｈｚ〜１５００Ｈｚの周波数成分のみを通過させる。バンドパスフィルタ１４０を通過した発話信号１５４は、超短期ピーク検出処理部１４２、短期周期性検出部１４４及び類似性検査部１４８に与えられる。超短期ピーク検出処理部１４２は、発話信号１５４を超短期フレーム化し、各フレームに対し超短期パワーを算出する。そして、各フレームについて、その前後２フレームと比較してパワーの差がパワーしきい値よりも大きいフレームをパワーピーク候補とし、そのフレーム位置を示す情報１５０を出力する。 Referring to FIG. 8, the vocal / fly detection unit 120 operates as follows. The band pass filter 140 passes only frequency components of 100 Hz to 1500 Hz in the speech signal. The utterance signal 154 that has passed through the bandpass filter 140 is given to the ultra-short-term peak detection processing unit 142, the short-term periodicity detection unit 144, and the similarity check unit 148. The ultra-short-term peak detection processing unit 142 converts the utterance signal 154 into ultra-short-term frames and calculates ultra-short-term power for each frame. Then, for each frame, a frame whose power difference is larger than the power threshold value compared to the two frames before and after the frame is set as a power peak candidate, and information 150 indicating the frame position is output.

短期周期性検出部１４４は、発話信号１５４をフレーム化し、その各フレームについてＩＦＰ値を算出する。算出されたＩＦＰ値としきい値とを比較し、しきい値未満であれば、そのフレームのＩＦＰ値をヌルに設定する。ヌルではないフレームが少なくとも３フレームだけ連続していなければ、それらのフレームのＩＦＰ値をヌルに補正する。そして補正されたＩＦＰ値が周期性検査部１４６に与えられる。 The short-term periodicity detection unit 144 frames the speech signal 154 and calculates an IFP value for each frame. The calculated IFP value is compared with a threshold value, and if it is less than the threshold value, the IFP value of the frame is set to null. If the non-null frames are not continuous by at least 3 frames, the IFP values of those frames are corrected to null. Then, the corrected IFP value is given to the periodicity inspection unit 146.

周期性検査部１４６は、超短期ピーク検出処理部１４２から与えられたピーク位置情報１５０のうち、短期周期性検出部１４４から与えられた短期周期性情報１５２により、フレームＩＦＰ値がヌルとなっている部分の情報１５６のみを類似性検査部１４８に与える。 The periodicity inspection unit 146 sets the frame IFP value to null by the short-term periodicity information 152 given from the short-term periodicity detection unit 144 among the peak position information 150 given from the ultrashort-term peak detection processing unit 142. Only the information 156 of the existing part is given to the similarity checking unit 148.

類似性検査部１４８は、情報１５６によって特定される区間に存在するパワーピーク候補の各パワーピーク付近の波形とその前のパワーピーク付近の波形との間のＩＰＳ値を算出する。そしてそのＩＰＳ値としきい値とを比較し、しきい値以上のパワーピークのピーク位置情報を検出する。このピーク位置情報に基づき、隣接するパルス間でＩＰＳ値の高いものの間のフレームをボーカル・フライ区間として検出し、それらを示す情報（ボーカル・フライ区間情報）を出力する。 The similarity checking unit 148 calculates an IPS value between the waveform near each power peak of the power peak candidate existing in the section specified by the information 156 and the waveform near the previous power peak. Then, the IPS value is compared with a threshold value, and peak position information of a power peak equal to or greater than the threshold value is detected. Based on this peak position information, a frame between adjacent pulses having a high IPS value is detected as a vocal / fly interval, and information indicating them (vocal / fly interval information) is output.

図７を参照して、検出されたボーカル・フライ区間情報はボーカル・フライ割合算出部１２２に与えられる。ボーカル・フライ区間情報から、全発話区間中でボーカル・フライ区間の占める割合がボーカル・フライ割合算出部１２２で算出される。この算出はボーカル・フライ区間を全発話区間で割る事によって得られる。算出されたボーカル・フライ区間割合情報は、後の処理のためにパラ言語情報抽出部７４に与えられる。 Referring to FIG. 7, the detected vocal / fly interval information is provided to vocal / fly ratio calculation section 122. From the vocal / fly section information, the ratio of the vocal / fly section in all utterance sections is calculated by the vocal / fly ratio calculation unit 122. This calculation is obtained by dividing the vocal / fly interval by the total utterance interval. The calculated vocal / fly interval ratio information is provided to the para-language information extraction unit 74 for later processing.

非周期性／ダブル周期性検出部１２４により、発話音声信号のうちで、音声波形が非周期である区間及びダブル周期である区間の情報である非周期性区間及びダブル周期性区間が検出され、それらを示す非周期性区間情報及びダブル周期性区間情報が出力される。 The non-periodic / double periodicity detecting unit 124 detects an aperiodic section and a double-periodic section, which are information of a section in which the speech waveform is aperiodic and a section in which the speech waveform is a double period, from the speech signal. Aperiodic section information and double periodic section information indicating them are output.

図９を参照して、非周期性／ダブル周期性検出部１２４は以下の様に動作する。発話音声信号が与えられると、正規化自己相関関数算出部１６０は、その音声信号をフィルタリング処理した音声波形を解析する事によって自己相関関数を算出する。そしてその自己相関関数を正規化し正規化自己相関関数を算出する。この正規化自己相関関数算出部１６０での処理の詳細については以下に述べる。 Referring to FIG. 9, aperiodic / double periodicity detection unit 124 operates as follows. When an utterance voice signal is given, the normalized autocorrelation function calculation unit 160 calculates an autocorrelation function by analyzing a voice waveform obtained by filtering the voice signal. Then, the autocorrelation function is normalized and a normalized autocorrelation function is calculated. Details of processing in the normalized autocorrelation function calculation unit 160 will be described below.

図１０を参照して、発話信号が与えられると、ハイパスフィルタ１７０によって、６０Ｈｚ以上の周波数成分のみが通過させられる。６０Ｈｚ以上の音声信号は、高域強調部１７２と逆フィルタ１７６とに与えられる。高域強調部１７２は与えられた音声信号の高域部分を強調する処理を行なう。そして、声道パラメータ抽出部１７４で、声道を特徴付けるフィルタパラメータを推測する。その後、ハイパスフィルタ１７０の出力音声信号に声道パラメータ抽出部１７４で与えられた声道パラメータを用いて、声帯音源信号を求めるために逆フィルタ１７６を行なう。 Referring to FIG. 10, when an utterance signal is given, only a frequency component of 60 Hz or higher is passed by high-pass filter 170. The audio signal of 60 Hz or higher is given to the high frequency emphasizing unit 172 and the inverse filter 176. The high frequency emphasizing unit 172 performs processing for emphasizing the high frequency part of the given audio signal. Then, the vocal tract parameter extraction unit 174 estimates a filter parameter that characterizes the vocal tract. Thereafter, an inverse filter 176 is performed to obtain a vocal cord sound source signal using the vocal tract parameters given by the vocal tract parameter extraction unit 174 to the output speech signal of the high pass filter 170.

逆フィルタ１７６で処理された残差信号は、次にローパスフィルタ１７８に与えられる。このローパスフィルタ１７８は、後の処理に必要となるピーク検出を容易にするために２ｋＨｚ以下の周波数成分のみを通過させる。ローパスフィルタ１７８を通過した周波数成分は、自己相関関数算出部１８０と自己相関関数再算出部１８４とに与えられる。自己相関関数算出部１８０では、検出処理の際に使用するフレームの大きさを８０ｍｓとし、フレーム中の音声信号波形から自己相関関数を得る。そしてこの自己相関関数を出力する。 The residual signal processed by the inverse filter 176 is then provided to the low pass filter 178. The low-pass filter 178 allows only frequency components of 2 kHz or less to pass in order to facilitate peak detection required for subsequent processing. The frequency component that has passed through the low-pass filter 178 is given to the autocorrelation function calculation unit 180 and the autocorrelation function recalculation unit 184. The autocorrelation function calculation unit 180 sets the frame size used in the detection process to 80 ms, and obtains the autocorrelation function from the sound signal waveform in the frame. Then, this autocorrelation function is output.

ピーク検出部１８２では、自己相関関数算出部１８０で得られた自己相関関数に含まれた最大のピークを検出する処理が行なわれる。 In the peak detection unit 182, processing for detecting the maximum peak included in the autocorrelation function obtained by the autocorrelation function calculation unit 180 is performed.

自己相関関数再算出部１８４では、まず、ピーク検出部１８２で検出された最大ピークの位置の４倍の時間を新しいフレーム長とする。この様なフレームの再調節が行なわれるのは、自己相関関数の適切な算出を行なうためである。つまり、固定のフレーム長の場合、フレームが大きすぎても小さすぎても自己相関関数の適切な算出をする事が難しいからである。そして、そのフレームから再度自己相関関数を得る。 The autocorrelation function recalculation unit 184 first sets a time that is four times the position of the maximum peak detected by the peak detection unit 182 as a new frame length. Such readjustment of the frame is performed in order to appropriately calculate the autocorrelation function. That is, in the case of a fixed frame length, it is difficult to appropriately calculate the autocorrelation function even if the frame is too large or too small. Then, an autocorrelation function is obtained again from the frame.

次に、正規化部１８６で得られた自己相関関数を正規化する処理を行なう。図９を参照して、正規化自己相関関数算出部１６０で算出された正規化自己相関関数に基づいて、正規化自己相関関数パラメータ算出部１６２での算出処理が行なわれる。そして、音波の非周期性及びダブル周期性を抽出するために正規化自己相関関数の波形から、ピーク値及びピーク位置を検出する。そしてその後それらピーク値の比率とピーク位置の比率とを算出する。ピーク値の比率は、１０００＊ＮＡＣ（Ｐ２）／ＮＡＣ（Ｐ１）で求められる。また、ピーク位置の比率は２０００＊ＴＬ（Ｐ２）／ＴＬ（Ｐ１）で求められる。 Next, a process for normalizing the autocorrelation function obtained by the normalization unit 186 is performed. Referring to FIG. 9, based on the normalized autocorrelation function calculated by normalized autocorrelation function calculation unit 160, a calculation process by normalized autocorrelation function parameter calculation unit 162 is performed. Then, a peak value and a peak position are detected from the waveform of the normalized autocorrelation function in order to extract the non-periodicity and double periodicity of the sound wave. Thereafter, the ratio of the peak values and the ratio of the peak positions are calculated. The ratio of peak values is obtained by 1000 * NAC (P2) / NAC (P1). Further, the ratio of peak positions is obtained by 2000 * TL (P2) / TL (P1).

さらに、算出された正規化自己相関関数パラメータを使用して、非周期性／ダブル周期性区間情報検出部１６４で当該音声信号が非周期性もしくはダブル周期性を持つ区間が検出される。この検出処理の詳細は以下の通りである。 Further, using the calculated normalized autocorrelation function parameter, the non-periodic / double-periodic section information detection unit 164 detects a section in which the speech signal has aperiodicity or double-periodicity. Details of this detection processing are as follows.

つまり、上述した自己相関関数パラメータがいずれも１０００に近似した値であれば、その自己相関関数の波形で表わされる区間の発話音声波形は周期性を持つと言える。そこで、それ以外の値を取る発話区間を非周期性及びダブル周期性区間として抽出する事ができる。 That is, if all of the above-mentioned autocorrelation function parameters are values close to 1000, it can be said that the speech speech waveform in the section represented by the waveform of the autocorrelation function has periodicity. Therefore, it is possible to extract utterance sections that take other values as aperiodic and double periodic sections.

非周期性／ダブル周期性区間情報検出部１６４で検出された非周期性／ダブル周期性区間情報が非周期性／ダブル周期性割合算出部１２６に与えられる。 The non-periodic / double-periodic section information detected by the non-periodic / double-periodic section information detection unit 164 is provided to the non-periodic / double-periodicity ratio calculation unit 126.

図７を参照して、全発話区間中で非周期性区間及びダブル周期性区間の占める割合が非周期性／ダブル周期性割合算出部１２６で算出される。この算出は、非周期性区間及びダブル周期性区間を全発話区間で割る事によって行なわれる。 Referring to FIG. 7, the ratio of the non-periodic section and the double periodic section in all utterance sections is calculated by a non-periodic / double periodic ratio calculation unit 126. This calculation is performed by dividing the non-periodic section and the double periodic section by the entire speech section.

この算出処理の前にまず、ボーカル・フライ検出部１２０で、ボーカル・フライ区間として検出された区間情報を非周期性／ダブル周期性区間情報から除去する処理が行なわれる。ボーカル・フライも非周期性特徴を持つが、ここでは、ボーカル・フライ以外の非周期性／ダブル周期性を対象としているからである。 Before this calculation process, the vocal / fly detection unit 120 first performs a process of removing the section information detected as the vocal / fly section from the non-periodic / double periodic section information. This is because vocal fly also has non-periodic characteristics, but here, non-periodic / double periodicity other than vocal fly is targeted.

図１１を参照して、気息性検出部１２８は以下の様に動作する。発話音声信号が与えられると、Ｆ１パスフィルタ２０２は、まず、その発話音声信号のうち、１００Ｈｚ〜１５００Ｈｚの周波数成分のみを通過させる。振幅包絡抽出部２０４では、Ｆ１パスフィルタ２０２を通ったＦ１波の波形から、振幅包絡を抽出する。 Referring to FIG. 11, breathability detection unit 128 operates as follows. When the utterance voice signal is given, the F1 pass filter 202 first passes only the frequency components of 100 Hz to 1500 Hz in the utterance voice signal. The amplitude envelope extraction unit 204 extracts the amplitude envelope from the waveform of the F1 wave that has passed through the F1 pass filter 202.

Ｆ３パスフィルタ２００でも同様に、発話音声信号のうち、１８００Ｈｚ〜４０００Ｈｚの周波数成分のみを通過させる。そして振幅包絡抽出部２１０では、Ｆ３パスフィルタ２００を通ったＦ３波の波形から、振幅包絡を抽出する。 Similarly, in the F3 pass filter 200, only the frequency component of 1800 Hz to 4000 Hz is passed in the speech signal. Then, the amplitude envelope extraction unit 210 extracts the amplitude envelope from the waveform of the F3 wave that has passed through the F3 pass filter 200.

振幅包絡抽出部２０４から得られたＦ１振幅包絡と振幅包絡抽出部２１０から得られたＦ３振幅包絡との相互相関を相互相関計算部２１４で計算する。この処理により、Ｆ１振幅包絡とＦ３振幅包絡の相互の関係を示すＦ１Ｆ３相関値が得られる。 A cross-correlation calculation unit 214 calculates a cross-correlation between the F1 amplitude envelope obtained from the amplitude envelope extraction unit 204 and the F3 amplitude envelope obtained from the amplitude envelope extraction unit 210. By this processing, an F1F3 correlation value indicating the mutual relationship between the F1 amplitude envelope and the F3 amplitude envelope is obtained.

Ｆ１パスフィルタ２０２を通過したＦ１波からはまた、最大周波数成分抽出部２０６でこの周波数帯域中に含まれるもののうち最大の周波数成分が抽出される。そして、Ｆ３パスフィルタ２００を通過したＦ３波にも、最大周波数成分抽出部２１２で同様の処理が行なわれる。Ｆ１波中に含まれる最大周波数成分とＦ３波中に含まれる最大周波数成分との差、つまりスペクトル傾斜を算出する処理がスペクトル傾斜算出部２１６で行なわれる。このスペクトル傾斜をＡ１−Ａ３とする。 From the F1 wave that has passed through the F1 pass filter 202, the maximum frequency component extraction unit 206 extracts the maximum frequency component from those included in this frequency band. The same processing is performed by the maximum frequency component extraction unit 212 on the F3 wave that has passed through the F3 pass filter 200. A process for calculating a difference between the maximum frequency component included in the F1 wave and the maximum frequency component included in the F3 wave, that is, a spectrum tilt, is performed by the spectrum tilt calculating unit 216. This spectral inclination is defined as A1-A3.

気息性判定部２１８では、Ｆ１Ｆ３相関値とスペクトル傾斜Ａ１−Ａ３値とを用いて気息性であるか否かを判定して、気息性区間情報を出力する。ここでの処理では、Ｆ１Ｆ３相関値があるしきい値未満で、かつＡ１−Ａ３値があるしきい値未満であれば、気息性区間であると判定する。これらのしきい値は予め学習によって得られる。このしきい値と実際に得られたＦ１Ｆ３相関値とＡ１−Ａ３とを比較参照する事により、気息性の有無が判定できる。 The breathability determination unit 218 determines whether or not it is breathable by using the F1F3 correlation value and the spectrum inclination A1-A3 value, and outputs breathability section information. In this process, if the F1F3 correlation value is less than a certain threshold value and the A1-A3 value is less than a certain threshold value, it is determined that the breathing interval is present. These threshold values are obtained by learning in advance. By comparing and comparing this threshold value, the actually obtained F1F3 correlation value, and A1-A3, the presence or absence of breathability can be determined.

気息性区間情報は、気息性割合算出部１３０に与えられる。図７を参照して、気息性割合算出部１３０は、全発話区間中で気息性区間の占める割合を、気息性区間を全発話区間で割る事によって算出する。算出された気息性区間割合は、後の処理のためにパラ言語情報抽出部７４に与えられる。 The breathability interval information is given to the breathability ratio calculation unit 130. Referring to FIG. 7, the breathing rate calculation unit 130 calculates the ratio of the breathing interval in all speech segments by dividing the breathing interval by all speech segments. The calculated breathing interval ratio is given to the paralinguistic information extraction unit 74 for later processing.

図３を参照して、韻律による音声処理部７０での処理によって得られた句末トーン情報、声質による音声処理部７２での処理によって得られた発話全体に占めるボーカル・フライの割合に関する情報、発話全体に占める非周期性もしくはダブル周期性の割合に関する情報及び、発話全体に占める気息性の割合に関する情報を使用して、パラ言語情報抽出部７４でパラ言語情報が抽出される。 Referring to FIG. 3, the end-of-phrase tone information obtained by the processing in the speech processing unit 70 based on the prosody, the information on the ratio of the vocal fly to the entire utterance obtained by the processing in the speech processing unit 72 based on the voice quality, Paralinguistic information is extracted by the paralinguistic information extraction unit 74 using information on the ratio of non-periodicity or double periodicity in the entire utterance and information on the ratio of breathability in the entire utterance.

ここでの処理においては、予め句末トーン情報、発話全体に占めるボーカル・フライ区間の割合、非周期性及びダブル周期性区間の割合及び、気息性の割合に関する情報とパラ言語情報との関係に関するデータを集積する必要がある。この集積されたデータによってさらに、どの様なパラメータが入力されれば、どの様なパラ言語情報が検出できるかというモデルを学習によって作成する事ができる。 In this process, the end-of-phrase tone information, the proportion of vocal / fly intervals in the entire utterance, the proportion of non-periodic and double-periodic intervals, and the relationship between the information about the breathability and the paralinguistic information Data needs to be accumulated. By learning, it is possible to create a model of what kind of paralinguistic information can be detected by inputting what kind of parameters from the accumulated data.

このモデルに使用されるものとしては、決定木、ニューラルネットワーク及び、ＳＶＭ（Support Vector Machine）等が考えられる。 As this model, a decision tree, a neural network, a support vector machine (SVM), and the like can be considered.

［コンピュータによる実現］
この実施の形態のシステムは、コンピュータハードウェアと、そのコンピュータハードウェアにより実行されるプログラムと、コンピュータハードウェアに格納されるデータとにより実現される。図１２はこのコンピュータシステム３３０の外観を示し、図１３はコンピュータシステム３３０の内部構成を示す。 [Realization by computer]
The system of this embodiment is realized by computer hardware, a program executed by the computer hardware, and data stored in the computer hardware. FIG. 12 shows the external appearance of the computer system 330, and FIG. 13 shows the internal configuration of the computer system 330.

図１２を参照して、このコンピュータシステム３３０は、ＦＤ（フレキシブルディスク）ドライブ３５２およびＣＤ−ＲＯＭ（コンパクトディスク読出専用メモリ）ドライブ３５０を有するコンピュータ３４０と、キーボード３４６と、マウス３４８と、モニタ３４２とを含む。 Referring to FIG. 12, this computer system 330 includes a computer 340 having an FD (flexible disk) drive 352 and a CD-ROM (compact disk read only memory) drive 350, a keyboard 346, a mouse 348, and a monitor 342. including.

図１３を参照して、コンピュータ３４０は、ＦＤドライブ３５２およびＣＤ−ＲＯＭドライブ３５０に加えて、ＣＰＵ（中央処理装置）３５６と、ＣＰＵ３５６、ＦＤドライブ３５２およびＣＤ−ＲＯＭドライブ３５０に接続されたバス３６６と、ブートアッププログラム等を記憶する読出専用メモリ（ＲＯＭ）３５８と、バス３６６に接続され、プログラム命令、システムプログラム、および作業データ等を記憶するランダムアクセスメモリ（ＲＡＭ）３６０とを含む。コンピュータシステム３３０はさらに、プリンタ３４４を含んでいる。 Referring to FIG. 13, in addition to FD drive 352 and CD-ROM drive 350, computer 340 includes CPU (central processing unit) 356 and bus 366 connected to CPU 356, FD drive 352 and CD-ROM drive 350. And a read only memory (ROM) 358 for storing a boot-up program and the like, and a random access memory (RAM) 360 connected to the bus 366 for storing a program command, a system program, work data, and the like. Computer system 330 further includes a printer 344.

ここでは示さないが、コンピュータ３４０はさらにローカルエリアネットワーク（ＬＡＮ）への接続を提供するネットワークアダプタボードを含んでもよい。 Although not shown here, the computer 340 may further include a network adapter board that provides a connection to a local area network (LAN).

コンピュータシステム３３０にパラ言語情報抽出装置６０としての動作を行なわせるためのコンピュータプログラムは、ＣＤ−ＲＯＭドライブ３５０またはＦＤドライブ３５２に挿入されるＣＤ−ＲＯＭ３６２またはＦＤ３６４に記憶され、さらにハードディスク３５４に転送される。または、プログラムは図示しないネットワークを通じてコンピュータ３４０に送信されハードディスク３５４に記憶されてもよい。プログラムは実行の際にＲＡＭ３６０にロードされる。ＣＤ−ＲＯＭ３６２から、ＦＤ３６４から、またはネットワークを介して、直接にＲＡＭ３６０にプログラムをロードしてもよい。 A computer program for causing the computer system 330 to operate as the paralinguistic information extracting device 60 is stored in the CD-ROM 362 or FD 364 inserted in the CD-ROM drive 350 or FD drive 352 and further transferred to the hard disk 354. The Alternatively, the program may be transmitted to the computer 340 through a network (not shown) and stored in the hard disk 354. The program is loaded into the RAM 360 when executed. The program may be loaded directly into the RAM 360 from the CD-ROM 362, from the FD 364, or via a network.

このプログラムは、コンピュータ３４０にこの実施の形態のパラ言語情報抽出装置６０として動作を行なわせる複数の命令を含む。この動作を行なわせるのに必要な基本的機能のいくつかはコンピュータ３４０上で動作するオペレーティングシステム（ＯＳ）もしくはサードパーティのプログラム、またはコンピュータ３４０にインストールされる各種ツールキットのモジュールにより提供される。従って、このプログラムはこの実施の形態のシステムおよび方法を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られるように制御されたやり方で適切な機能または「ツール」を呼出すことにより、上記したパラ言語情報抽出装置６０としての動作を実行する命令のみを含んでいればよい。コンピュータシステム３３０の動作は周知であるので、ここでは繰返さない。 This program includes a plurality of instructions for causing the computer 340 to operate as the paralinguistic information extracting device 60 of this embodiment. Some of the basic functions required to perform this operation are provided by operating system (OS) or third party programs running on the computer 340 or various toolkit modules installed on the computer 340. Therefore, this program does not necessarily include all functions necessary for realizing the system and method of this embodiment. This program calls only an instruction for executing the operation as the paralinguistic information extracting device 60 described above by calling an appropriate function or “tool” in a controlled manner so as to obtain a desired result. It only has to be included. The operation of computer system 330 is well known and will not be repeated here.

以上の様に、パラ言語情報を検出する際に、韻律に関する情報のみならず、声質に関する情報も使用する事により、パラ言語情報の検出精度が高くなる。 As described above, when detecting paralinguistic information, not only information related to prosody but also information related to voice quality is used, thereby increasing the accuracy of detecting paralinguistic information.

今回開示された実施の形態に使用された具体的な数字は例示である。 Specific numbers used in the embodiments disclosed this time are examples.

また、今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。 Further, the embodiment disclosed this time is merely an example, and the present invention is not limited to the embodiment described above. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

韻律特徴を使用したパラ言語情報抽出装置３０の機能ブロック図である。It is a functional block diagram of the paralinguistic information extraction device 30 using a prosodic feature. 韻律特徴を使用してパラ言語情報を検出した場合のパラ言語情報の重なりを表わすグラフである。It is a graph showing the overlap of paralinguistic information at the time of detecting paralinguistic information using a prosodic feature. 本実施の形態に係るパラ言語情報検出装置６０についての機能ブロック図である。It is a functional block diagram about the paralinguistic information detection apparatus 60 which concerns on this Embodiment. 韻律による音声処理部７０の処理の詳細を示す機能ブロック図である。It is a functional block diagram which shows the detail of the process of the audio | voice processing part 70 by a prosody. トーンパラメータについて説明する図である。It is a figure explaining a tone parameter. 韻律特徴処理部８０の詳細を示す機能ブロック図である。3 is a functional block diagram showing details of a prosodic feature processing unit 80. FIG. 声質による音声処理部７２の詳細を示す機能ブロック図である。It is a functional block diagram which shows the detail of the audio | voice processing part 72 by voice quality. ボーカル・フライ検出部１２０の詳細を示す機能ブロック図である。3 is a functional block diagram showing details of a vocal / fly detection unit 120. FIG. 非周期性／ダブル周期性検出部１２４の詳細を示す機能ブロック図である。4 is a functional block diagram illustrating details of an aperiodic / double periodicity detection unit 124. FIG. 正規化自己相関関数算出部１６０の詳細を示す機能ブロック図である。3 is a functional block diagram illustrating details of a normalized autocorrelation function calculation unit 160. FIG. 気息性検出部１２８の詳細を示す機能ブロック図である。3 is a functional block diagram showing details of a breath detection unit 128. FIG. 本発明の一実施の形態に係るパラ言語情報抽出装置３０を実現するコンピュータシステムの外観図である。1 is an external view of a computer system that implements a paralinguistic information extraction device 30 according to an embodiment of the present invention. 図１２に示すコンピュータのブロック図である。It is a block diagram of the computer shown in FIG.

Explanation of symbols

７０韻律による音声処理部
７２声質による音声処理部
７４パラ言語情報抽出部
１２２ボーカル・フライ割合算出部
１２６非周期性／ダブル周期性割合算出部
１３０気息性割合算出部 70 speech processing unit 72 based on prosody speech processing unit 74 based on voice quality para language information extraction unit 122 vocal / fly ratio calculation unit 126 non-periodic / double periodicity ratio calculation unit 130 breathing ratio calculation unit

Claims

A paralinguistic information detection device for detecting paralinguistic information independent of utterance content from a human utterance voice signal,
First speech processing means for processing information relating to the prosody of the speech signal;
Second voice processing means for processing information relating to voice quality of the speech signal;
And paralinguistic information extracting means for extracting paralinguistic information on speech from the information about the information and the voice quality for said prosodic seen including,
The para-linguistic information detection apparatus , wherein the second voice processing means includes a vocal / fly ratio calculation means for calculating a ratio of a vocal / fly section in an utterance section of the speech signal .

The second sound processing means further includes a non-periodic / double periodicity ratio calculating means for calculating a ratio of the non-periodic / double periodicity period in the utterance period of the speech signal. Item 3. The paralinguistic information detection device according to Item 1 .

A paralinguistic information detection device for detecting paralinguistic information independent of utterance content from a human utterance voice signal,
First speech processing means for processing information relating to the prosody of the speech signal;
Second voice processing means for processing information relating to voice quality of the speech signal;
Including paralinguistic information extracting means for extracting paralinguistic information about speech from the information about the prosody and the information about the voice quality,
The second speech processing means includes paralinguistic information including a non-periodic / double periodicity ratio calculating means for calculating a ratio of the non-periodic / double periodicity section in the utterance section of the speech signal. Detection device.

It said second voice processing means further comprises a breathiness ratio calculating means for calculating a ratio of the breathiness interval during speech period of the speech signal, to any one of claims 1 to 3 The paralinguistic information detection device described.

A paralinguistic information detection device for detecting paralinguistic information independent of utterance content from a human utterance voice signal,
First speech processing means for processing information relating to the prosody of the speech signal;
Second voice processing means for processing information relating to voice quality of the speech signal;
Including paralinguistic information extracting means for extracting paralinguistic information about speech from the information about the prosody and the information about the voice quality,
The paralinguistic information detection apparatus, wherein the second voice processing means includes a breathing rate calculation unit for calculating a ratio of a breathing period in the utterance section of the utterance voice signal.

A computer program that, when executed by a computer, causes the computer to operate as the paralinguistic information detection device according to any one of claims 1 to 5 .