JPH03273280A

JPH03273280A - Voice synthesizing system for vocal exercise

Info

Publication number: JPH03273280A
Application number: JP2072888A
Authority: JP
Inventors: Keiko Nagano; 永野　敬子
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1990-03-22
Filing date: 1990-03-22
Publication date: 1991-12-04
Anticipated expiration: 2014-01-13
Also published as: JP2844817B2

Abstract

PURPOSE:To provide teacher's voice appropriate for a learner by converting the rhythm information of an input voice into the rhythm information of a standard voice, synthesizing both the information and outputting the synthesized voice as the teacher's voice. CONSTITUTION:A pitch extracting/analyzing part 70 extracts the pitch of an input voice by using an input voice file analyzed by an input voice analyzing part 40 and receiving the sound element boundary of the input voice by a data collating part 50, finds out a pitch dividing position, analyzes the input voice file synchronously with the pitch, and writes the analyzed result in the 2nd synthesized wave forming file similarly to the pitch synchronizing analysis of the reference voice file. A synthesizing part 80 synthesizes a voice by using at least one of the rhythm information of reference voices and the residual characteristics and voice path characteristics of the input voice based upon the synthesized wave forming file to form a teacher's voice and the rhythm information of the teacher's voice is converted into that of the standard voice. A synthesized wave storing part 130 stoers the teacher's voice and outputs it when necessary. Thus, correct training can be attained.

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、練習者の音声を分析してその特徴を抽出し、
該練習者の音声の特徴を標準話者の音声の特徴に変換し
た合成音声を教師音声とする発声練習用音声合成方式に
関する。[Detailed Description of the Invention] (Industrial Application Field) The present invention analyzes the voice of a practitioner and extracts its features.
The present invention relates to a speech synthesis method for practicing vocalization in which synthesized speech obtained by converting the characteristics of the voice of the practitioner into the characteristics of the voice of a standard speaker is used as the teacher's voice.

（従来の技術）従来の発声訓練装置としては、練習者が発声する音声に
ついてそのピ・ソチ周波数やフォルマント周波数を解析
して該解析結果をモニター上に画面として表示し、前記
練習者が前記モニター上の画面を！！察しながら発声訓
練を行う訓練装置がすでに知られている。この発声訓練
装置の詳細については、梅崎氏らによる“聾唖者用発声
・発語訓練装置の開発”と題した論文（１９８８年３月
　Ｂ本音響学会論文集１−４−１１　　ｐ、２９７〜２
９８）（文献１）に記載されている。(Prior Art) Conventional vocal training devices analyze the pisochi frequency and formant frequency of the voice uttered by a practitioner and display the analysis results as a screen on a monitor. The screen above! ! A training device that performs vocal training while listening to the voice is already known. For details of this vocal training device, please refer to the paper by Mr. Umezaki et al. titled “Development of vocalization/speech training device for deaf and dumb people” (March 1988, Proceedings of the Acoustical Society of Japan, 1-4-11, p. 297- 2
98) (Reference 1).

またこれ以外にも、音声学の知識に基づき練習者が発声
する音声を種々のセンサにより種々のパラメータについ
て分析して該パラメータに基づいて訓練方法を決定し、
練習者が発声する音声について各パラメータ毎に教師音
声のパラメータと比較して発声訓練を行う訓練装置が知
られている。In addition to this, based on knowledge of phonetics, the voice uttered by the practitioner is analyzed for various parameters using various sensors, and the training method is determined based on the parameters.
2. Description of the Related Art A training device is known that performs vocal training by comparing each parameter of the voice uttered by a practitioner with the parameters of a teacher's voice.

この装置の詳細については、山田比らによる“言語障害
者用発声訓練装置の開発（第６報）”と題しｆＳ論文（
１９８８年１月　電子情報通信学会技術研究報告　ＥＴ
８７−８　　ｐ、２５〜３０）（文献２）に記載されて
いる。For details of this device, please refer to the fS paper titled “Development of a vocal training device for people with language disabilities (6th report)” by Hiro Yamada et al.
January 1988 IEICE Technical Research Report ET
87-8 p, 25-30) (Reference 2).

また最近では、教師音声を聴きながらリズム・イントネ
ーション、ｆ＃音の発声練習および単語の聞き取り練習
ができ、練習者の発声した音声と教師音声を聴き比べる
ことができる発声訓練装置が実現されている。この詳細
については、高田氏らによる“英語スピーチ練習システ
ムの開発２と題した論文（１９８７年３月　電子情報通
信学会技術研究報告　ＥＴ８６−１２　　ｐ、４９〜５
２）（文献３）に記載されている。Recently, vocal training devices have been developed that allow you to practice rhythm intonation, f# sound pronunciation, and word comprehension while listening to a teacher's voice, and that allow you to listen and compare the voice uttered by the practitioner with the teacher's voice. . For details, see the paper titled “Development of an English Speech Practice System 2” by Mr. Takada et al. (March 1987, IEICE Technical Report ET86-12, p. 49-5.
2) (Reference 3).

（発明が解決しようとする課Ｈ）しかしながら、文献１の発声訓ａ！ｉ装置では、解析し
た入力音声すなわち練習者が発声する音声のみのイント
ネーションをデイスプレィに表示するから、入力音声が
標準的な発声からどの様にずれているかを確認すること
ができず、練習者には標準音声と自分の音声との違いが
分かりにくいという欠点があった。(Lesson H that the invention attempts to solve) However, the utterance lesson a! of Document 1! Since the i-device displays the analyzed input voice, that is, the intonation of only the voice uttered by the practitioner, on the display, it is not possible to check how the input voice deviates from the standard utterance, and it is difficult for the practitioner to had the disadvantage that it was difficult to tell the difference between the standard voice and one's own voice.

また、文献２の発声訓練装置では、練習者が発声訓練時
に種々のセンサを身につける必要があり、非常に煩わし
く、かつセンサの付加により発声の仕方が歪められると
いう欠点があった。Furthermore, the vocal training device of Document 2 requires the practitioner to wear various sensors during vocal training, which is extremely troublesome, and the addition of the sensors distorts the way the practitioner speaks.

文ｖ、３の発声訓練装色では、リズム・イントネーショ
ンの練習と称して、予め登録されている教師音声に合わ
せて練習者が発音すると、教師音声と練習者の発声した
音声のリズムとイントネーションをデイスプレィ上に表
示するが、表示画面を見ただけではどこをどうなおした
らよいのかわからず、ただ単に練習者の発声が標準音声
である教師音声とは違うことを示すだけであるから、実
際の発声の学習には結び付きにくいという欠点があった
。また、この訓練装置では、練習者が自分の発声と間き
比べかつ目標とする教師音声として他の話者が発声した
Ｗ準音声を使用するが、その教師音声は練習者の音声と
は音韻、韻律情報のみならず、声質等が違うから、練習
者は声質等の違いに注意を向けがちで、音韻や韻律情報
を教師音声に近づくことができるように訓練するのが難
しいという欠点があった。また、標準話者の発声速度が
、一般に一定であるから、練習者の発声速度に合わない
場合があり、練習者は言語の発声訓練時に話す速度を変
える必要があった。In sentences v and 3, the vocal training coloring is called practice of rhythm and intonation, and when the practitioner pronounces according to the pre-registered teacher's voice, the rhythm and intonation of the teacher's voice and the voice uttered by the practitioner are adjusted. Although it is displayed on the display, it is difficult to know what to correct just by looking at the display screen, and it simply indicates that the practitioner's utterance is different from the teacher's voice, which is the standard voice. The drawback was that it was difficult to connect to learning vocalizations. In addition, in this training device, the practitioner uses a W quasi-voice uttered by another speaker as a teacher's voice to compare with his/her own voice, but the teacher's voice is phonologically different from the practitioner's voice. , since not only the prosodic information but also the voice quality etc. are different, practitioners tend to pay attention to the differences in voice quality etc., and the disadvantage is that it is difficult to train the phonological and prosodic information so that it approaches the teacher's voice. Ta. Furthermore, since the speaking rate of a standard speaker is generally constant, it may not match the speaking rate of a practitioner, and the practitioner has had to change the speaking rate during language pronunciation training.

そこで本発明の目的は、個々の練習者に近い声質、発声
速度を持った教師音声を生成することによって、個々の
練習者に遺した教師音声を提供し、さらに、練習者の音
声を音素毎に教師音声と比較し表示することでどの部分
に問題があるかを明瞭にし、Ｗ、冒者の発声と教師音声
の発声の距離を掌め、学習到達度を示すことにより、ど
のくらい学習効果があったかを分かりやすく表示する発
声練習用音声合成方式を提供することにある。Therefore, an object of the present invention is to provide a teacher's voice that has been left to each practitioner by generating a teacher's voice with a voice quality and speaking rate close to that of the individual practitioner, and furthermore, to provide a teacher's voice that has been left to each practitioner, and to further reproduce the voice of the practitioner on a phoneme-by-phoneme basis. By comparing and displaying a comparison with the teacher's voice, it is possible to clearly identify which parts have problems, and to understand the distance between the pronunciation's voice and the teacher's voice, and to show the degree of learning achievement. To provide a speech synthesis method for vocal practice which displays the warmness in an easy-to-understand manner.

（課題を解決するための手段）本発明に係る第１の発声練習用音声合成方式は、練習者
が発声した音声を入力音声として入力し、該入力音声を
分析して基本周波数の時間的変化や音韻継続時間長等の
韻律情報を抽出する手段と、予め分析してある標準話者
の音声である標準音声の韻律情報を記録する手段と、前記入力音声と前記標準音声との音素間の対応関係をそ
れぞれの韻律情報から求め、前記入力音声の韻律情報を
予め分析してある前記標準音声の韻律情報に変換して該
入力音声と前記標準音声とを合成し、該合成した音声を
教師音声として出力する手段と、前記教師音声と前記入力音声から抽出した前記韻律情報
とを画面に出力し、前記出力において前記入力音声の音
素名と音素境界とを前記標準音声の対応する位置にそれ
ぞれ表示する手段と、を有することを特徴とする。(Means for Solving the Problems) The first voice synthesis method for vocal practice according to the present invention inputs the voice uttered by a practitioner as input voice, analyzes the input voice, and analyzes temporal changes in the fundamental frequency. means for extracting prosodic information such as phoneme duration and phoneme duration; means for recording prosodic information of a standard speech that is the speech of a standard speaker that has been analyzed in advance; A correspondence relationship is determined from each piece of prosodic information, the prosodic information of the input speech is converted into prosodic information of the standard speech that has been analyzed in advance, the input speech and the standard speech are synthesized, and the synthesized speech is used by the teacher. means for outputting the teacher's voice and the prosodic information extracted from the input voice on a screen, and in the output, phoneme names and phoneme boundaries of the input voice are placed in corresponding positions of the standard voice, respectively; It is characterized by having a means for displaying.

本発明に係る第２の発声練習用音声合成方式は、練習者
が発声した音声を入力音声として入力し、該入力音声を
分析して基本周波数の時間的変化や音韻継続時間長等の
韻律情報を抽出する手段と、予め分析してある標準話者
の音声である！準音声の韻律情報を記録する手段と、前記入力音声と前記標準音声との音素間の対応関係をそ
れぞれの韻律情報から求め、前記入力音声の韻律情報を
予め分析してある前記標準音声の韻律情報に変換して該
入力音声と前記標準音声とを合成し、該合成した音声を
教師音声として出力する手段と、前記入力音声の分析結果から得た音！ｊ！継続時間長と
前記標準音声の分析結果から得た甘顔ＩＩ！続時開時間
長較して差異を計算し、音韻継続時間長を前記差異だけ
伸縮した標準音声を前記教師音声出力手段において用い
る標準音声として出力する手段、または、前記入力音声
と前記標準音声のそれぞれについてストレスのある母音
の継続時間長と他の母音の継続時間長との関係を求め、
前記標準音声の前記ストレスのある母音の継続時間長を
前記入力音声の母音の継続時間長に変換し、その他の母
音については前記標準音声における各母音間の継続時間
長を前記入力音声における各母音間の継続時間長に変換
し、該母音間の継続時間長を変換した前記１１１ｉ準音
声を前記教師音声出力手段において用いる標準音声とし
て出力する手段と、を有することを特徴とする。The second voice synthesis method for vocal practice according to the present invention inputs the voice uttered by a practitioner as input voice, analyzes the input voice, and analyzes the input voice to obtain prosodic information such as temporal changes in fundamental frequency and phoneme duration length. It is a means to extract and the voice of a standard speaker that has been analyzed in advance! means for recording prosodic information of a quasi-speech; and a prosody of the standard speech in which the correspondence between phonemes of the input speech and the standard speech is determined from the prosodic information of each, and the prosodic information of the input speech is analyzed in advance. Means for converting into information, synthesizing the input voice and the standard voice, and outputting the synthesized voice as a teacher voice; and a sound obtained from the analysis result of the input voice! j! Sweet face II obtained from the analysis results of the duration and the standard voice! means for calculating the difference by comparing the continuous duration lengths, and outputting a standard speech whose phoneme duration length is expanded or contracted by the difference as the standard speech to be used in the teacher speech output means, or between the input speech and the standard speech. For each, find the relationship between the duration of the stressed vowel and the duration of other vowels,
The duration length of the stressed vowel in the standard speech is converted into the duration length of the vowel in the input speech, and for other vowels, the duration length between each vowel in the standard speech is converted to the duration length of each vowel in the input speech. and outputting the 111i quasi-speech obtained by converting the duration length between vowels as a standard speech used in the teacher speech output means.

本発明に係る第３の発声Ｉ［習用音声合成方式は、練習
者が発声した音声を入力音声として入力し、該入力音声
を分析して基本周波数の時間的変化や音韻継続時間長等
の韻律情報を抽出する手段と、予め分析してある標準話
者の音声である標準音声のａＳ情報を記録する手段と、前記入力音声と前記ｔＩＡ準音声との音素間の対応関係
をそれぞれの韻律情報から求め、前記入力音声の韻律情
報を予め分析してある前記標！＃、音声の韻律情報に変
換して該入力音声と前記ｌｌ準音声とを合成し、該合成
した音声を教師音声として出力する手段と、予め継続時間長の異なる複数の音声を記録しておき、前
記入力音声と前記標準音声の継続時間長を比較して該継
続時間長が異なる場合には前記複数の音声の中から継続
時間長が前記入力音声に最も近い音声を選択して該音声
を前記教師音声出力手段に標準音声として出力する手段
と、を有することを特徴とする。The third utterance I according to the present invention [the learning speech synthesis method inputs the voice uttered by the practitioner as the input voice, analyzes the input voice, and analyzes the temporal changes in the fundamental frequency and the prosody such as phoneme duration length. means for extracting information; means for recording aS information of a standard speech that is the speech of a standard speaker analyzed in advance; and prosodic information on the correspondence between phonemes of the input speech and the tIA quasi-speech. The mark obtained by analyzing the prosodic information of the input speech in advance! #, a means for converting the input speech into prosodic information and synthesizing the input speech and the quasi-speech, and outputting the synthesized speech as a teacher speech, and recording a plurality of speeches having different durations in advance; , Compare the duration lengths of the input voice and the standard voice, and if the duration lengths are different, select the voice whose duration length is closest to the input voice from the plurality of voices and select the voice. The present invention is characterized by comprising: means for outputting a standard voice to the teacher voice output means.

本発明に係る第４の発声練習用音声合成方式は、練習者
が発声した音声を入力音声として入力し、該入力音声を
分析して基本周波数の時間的変化や音韻継続時間長等の
韻律情報を抽出する手段と、予め分析してある標準話者
の音声である標準音声の韻律情報を記録する手段と、前記入力音声と前記標準ｆｖＭの音素間の対応間係をそ
れぞれの韻律情報から求め、前記入力音声の韻律情報を
予め分析してある前記標準音声の韻律情報に変換して該
入力音声と前記標準音声とを合成し、該合成した音声を
教師音声として出力する手段と、前記入力音声と前記教師音声との韻律情報同志の距離を
計算し、該距離を学習到達度に変換して画面に出力する
手段と、を有することを特徴とする。The fourth voice synthesis method for vocal training according to the present invention inputs the voice uttered by a practitioner as input voice, analyzes the input voice, and analyzes the input voice to obtain prosodic information such as temporal changes in fundamental frequency and phoneme duration length. means for recording prosodic information of a standard voice that is the voice of a standard speaker analyzed in advance; and determining correspondence between phonemes of the input voice and the standard fvM from the respective prosodic information. , means for converting prosodic information of the input speech into prosodic information of the standard speech that has been analyzed in advance, synthesizing the input speech and the standard speech, and outputting the synthesized speech as a teacher speech; The present invention is characterized by comprising means for calculating the distance between the prosodic information of the voice and the teacher voice, converting the distance into a learning achievement level, and outputting it on a screen.

（作用）本発明によれば、練習者が標準話者の韻律情報、特にリ
ズム・イントネーションを正確に習得し、発声できるよ
うになる。練習には視覚的な情報と、聴覚的なＩｆＩ報
の両方を用いる。視覚的な情報を表示する画面に教師音
声と練習者の発声した音声との音素間の対応を表示する
ことにより、どの部分でどの程度違うかが確かめられる
。また、教師音声には、練習者の発声および練習者の声
質を持った音声を用いて、練習者の声と教師音声の声の
違いを少なくし、両音声の韻律情報の差を強調するから
、練習者白身が標準音声の韻律情報を正確に真似した時
のイメージをつかみながら、練習を行うことができる。(Function) According to the present invention, a practitioner can accurately learn prosodic information, especially rhythm and intonation, of a standard speaker, and can pronounce the same. Use both visual and auditory IfI information for practice. By displaying the correspondence between phonemes between the teacher's voice and the voice uttered by the practitioner on a screen that displays visual information, it is possible to confirm where and how different the voice is. In addition, the teacher's voice uses the voice of the practitioner and a voice that has the quality of the voice of the practitioner to reduce the difference between the voice of the practitioner and the teacher's voice and emphasize the difference in prosodic information between the two voices. , it is possible to practice while grasping the image of when a practitioner Shirami accurately imitates the prosodic information of a standard voice.

例えば、イントネーションについては、練習者のイント
ネーションのどこが標準話者のイントネーションと違う
かを明確にわからせるために、画面上では標準話者のイ
ントネーションの上に練習者のイントネーションを重ね
て表示し、どの部分が違うが視覚的に練習者に知らせる
。ｔ、た、それが何という音素であるかを表示し、より
正確に問題部分を練習者に示す、また、音素毎に音声を
評価し、教師音声と興なっている部分を示す、　練習用
の教師音声としては、練習を行いたい韻律情報について
は標準音声のものを、それ以外の声道特性と音源特性に
ついては練習者の音声のものを用いて音声合成した音声
を用いる。For example, with regard to intonation, in order to clearly see where the intonation of a practitioner differs from that of a standard speaker, the intonation of the practitioner is superimposed on the intonation of a standard speaker on the screen, and Although the parts are different, it is visually informed to the practitioner. For practice, it displays what phoneme it is, shows the problem part more accurately to the practitioner, evaluates the sound for each phoneme, and shows the parts that are different from the teacher's sound. As the teacher's voice, a standard voice is used for the prosodic information to be practiced, and a voice synthesized using the practitioner's voice is used for other vocal tract characteristics and sound source characteristics.

ｅ師音声として使われる合成音声は練習したい韻律情報
以外は練習者のものであるから、練習者が自分の発声と
教師音声を聞き比べる時に、声質的な差が少なくなって
おり、両音声で練習を行いたい韻律情報の違いが分かり
やすくなり、目標とする韻律が習得しやすくなる。The synthesized speech used as the e-teacher's voice belongs to the practitioner, except for the prosodic information that they wish to practice, so when the practitioner listens to and compares his or her own vocalization with the teacher's voice, there is less difference in voice quality, and it is easier to hear the difference between the two voices. It becomes easier to understand the difference in the prosodic information that you want to practice, making it easier to learn the target prosody.

さらに、練習者と標準話者の発声速度が著しく違うもの
については、標準話者の発声速度を伸縮し、！冒者の発
声速度に近づけた上で、音声合成を行うから、練習者が
教師音声の発声速度に合わせられず、正しい訓練が行え
ないという欠点は解決する。Furthermore, in cases where the speaking speed of the trainee and the standard speaker are significantly different, the speaking speed of the standard speaker is expanded or contracted. Since the speech synthesis is performed after getting the speech rate close to that of the teacher's voice, the problem of the trainee being unable to match the speech rate of the teacher's voice and being unable to perform proper training is solved.

（実施例）次に、図面を参照して本発明について詳しく説明する。(Example) Next, the present invention will be explained in detail with reference to the drawings.

第１図は本発明に係る第１の発声練習用音声合成方式の
一実施例を示すブロック図である。標準音声ピッチ抽出
部３０は、入力端子１を介して入力する標準音声を線形
予測分析して残差特性と声道特性に分離し、さらに自己
相関法を用いて残差特性に対してピッチ抽出を行ってピ
ッチ分割位置と音素境界を抽出し、これら全てを標準音
声ファイルとする。この方法には、岩田比の“音声ピッ
チ抽出装置”　（特願昭６２−２１０６９０号明醪書）
（文献４）の方法を弔いる。データ保ダ部１０は、標準
音声ピッチ抽出部３０でビ・ソチ抽出を行った標準音声
のデータすなわち標準音声ファイルを保存する。標準音
声として既にあるデータを使用する場合に、このデータ
保育部１０かち取り出す。FIG. 1 is a block diagram showing an embodiment of a first speech synthesis method for practicing speaking according to the present invention. The standard speech pitch extraction unit 30 performs linear predictive analysis on the standard speech input through the input terminal 1 to separate it into residual characteristics and vocal tract characteristics, and further performs pitch extraction on the residual characteristics using an autocorrelation method. to extract pitch division positions and phoneme boundaries, and make all of these into a standard audio file. This method includes Iwatahiro's “speech pitch extraction device” (Patent Application No. 1983-210690).
The method of (Reference 4) is used. The data storage section 10 stores the standard speech data extracted by the standard speech pitch extraction section 30, that is, the standard speech file. When using existing data as standard audio, this data storage section 10 is extracted.

入力音声検出部４０は、入力端子６をｐして入力する練
習者の音声である入力音声を線形予測分析して残差特性
と声道特性とに分離し、さらに自動音声検出を行って、
入力音声の始端・終端の位置を求める。自声検出には、
Ｒａｂｉｎｅｒ氏らの“Ａｎ　　Ａｌｇｏｒｉｔｈｍｆｏｒ　　Ｄｅｔｅｒｍｉｎｉｎｇ　　ｔｈｅＥｎｄｐ
ｏｉｎｔｓ　　ｏｆ　　１５ｏｌａｔｅｄＵｔｔｅｒｒ
ａｎｃｅｓ　　（Ｂｅｌｌ　　５ｙｓｔＴｅｃｈ、Ｊ、
、Ｖｏｌ、　　５４．Ｎｏ、　　２ｐｐ２９７−３１５
．Ｆｅｂｒｕａｒｙ１９７５）（文献５）の方法を用い
る。The input voice detection unit 40 performs a linear predictive analysis on the input voice, which is the voice of a practitioner inputted through the input terminal 6, and separates it into residual characteristics and vocal tract characteristics, and further performs automatic voice detection.
Find the starting and ending positions of the input audio. For own voice detection,
“An Algorithm for Determining the Endp” by Rabiner et al.
oints of 15olatedUtterr
ances (Bell 5ystTech, J,
, Vol. 54. No, 2pp297-315
．． The method of February 1975) (Reference 5) is used.

データ照合部５０は、ＤＰマツチングを用いて、入力音
声検出部４０と標準音声ピッチ抽出部３０とでそれぞれ
分析される入力音声と標準音声との分析結果間の距離を
求めて両音声の時間軸対応をとる。このＤＰマツチング
には、迫江氏らによる“動的計画法による音声バタンの
類似度評価“（Ｉ９７０年８月　昭和４５年度電子通信
学会総合全国大会講演論文集　ｐ１３６）（文献６）の
方法を用いる。ｔた、ＤＰマ・ｙチングにより、あらか
じめ音素名と音素境界位置が入っている標準音声ファイ
ルを用いて、入力音声ファイルに入力音声における音素
境界の位置を与える。The data matching unit 50 uses DP matching to determine the distance between the analysis results of the input voice and the standard voice analyzed by the input voice detection unit 40 and the standard voice pitch extraction unit 30, respectively, and calculates the distance between the analysis results on the time axis of both voices. Take action. For this DP matching, we used the method described by Sakoe et al.'s "Similarity Evaluation of Voice Bangs Using Dynamic Programming" (August 1970, Proceedings of the 1970 National Conference of the Institute of Electronics and Communication Engineers, p. 136) (Reference 6). use Additionally, by using DP mapping, the positions of phoneme boundaries in the input voice are given to the input voice file using a standard voice file that contains phoneme names and phoneme boundary positions in advance.

分析部６０は、標準音声ピッチ抽出部３０で標準音声に
おけるピッチ分割位置を挿入した標準音声ファイルを、
ピッチ同期で分析する０分析した標準音声ファイルを音
素名と音素境界位置、ピッチ分割位置、継続時間長、残
差特性、声道特性などに分離し、それらを第１の合成波
形作成用ファイルに書き込む。The analysis section 60 converts the standard speech file into which the pitch division positions in the standard speech have been inserted by the standard speech pitch extraction section 30 into the standard speech file.
Analyze with pitch synchronization 0 Separate the analyzed standard voice file into phoneme names, phoneme boundary positions, pitch division positions, duration length, residual characteristics, vocal tract characteristics, etc. and create the first synthetic waveform creation file. Write.

ピッチ抽出・分析部７０は、入力音声分析部４０で分析
され、かつデータ照合部５０で入力音声における音素境
界を与えられた入力音声ファイルを用いて、入力音声に
対するピッチ抽出を行って、ピッチ分割位置を求め、そ
の結果を元に入力音声ファイルをピッチ同期で分析する
。この分析結果は、標準音声ファイルのピッチ同期分析
と同様に第２の合成波形作成用ファイルに書き込む。The pitch extraction/analysis section 70 performs pitch extraction on the input speech using the input speech file that has been analyzed by the input speech analysis section 40 and given the phoneme boundaries in the input speech by the data matching section 50, and performs pitch division. Find the position and analyze the input audio file with pitch synchronization based on the result. This analysis result is written into the second synthetic waveform creation file in the same way as the pitch synchronization analysis of the standard audio file.

合成部８０は、第１および第２の合成波形作成用ファイ
ルを用いて、標準音声の韻律情報の少なくとも１つと、
入力音声の残差特性と声道特性を用いて音声合成を行っ
て教師音声を生成する。また、合成部８０は、ピッチ分
割位置、音素境界と残差特性、声道特性等の情報を持っ
た入力音声ファイルと標準音声ファイルを用いて、目的
とする教師音声の韻律情報を標準音声のものと変換する
。The synthesis unit 80 uses the first and second synthesis waveform creation files to generate at least one piece of prosodic information of the standard speech;
Speech synthesis is performed using the residual characteristics and vocal tract characteristics of the input speech to generate teacher speech. Furthermore, the synthesis unit 80 uses the input speech file and the standard speech file, which have information such as pitch division positions, phoneme boundaries, residual characteristics, and vocal tract characteristics, to convert the prosodic information of the target teacher's speech into the standard speech. Convert to something.

このとき、ピッチ制御を用いて入力音声または標ｌ！音
声の時間長を変換する０合成部８０は、教師音声を生成
するだけではなく、標準音声、入力音声の再生も行う。At this time, pitch control is used to input voice or mark l! The zero synthesis unit 80, which converts the time length of audio, not only generates teacher audio, but also plays back standard audio and input audio.

出力端子５には、これら合成された各音声が出力する６
合成波形保存部１３０は、教師音声を保存し、必要なと
きに該教師音声を出力する。Each of these synthesized voices is output to the output terminal 5.
The synthesized waveform storage unit 130 stores the teacher's voice and outputs the teacher's voice when necessary.

画面表示部１５０は、データ照合部５０で求めた入力音
声の音素芯と音素境界位置および合成部８０で求めた教
師音声の音素芯と音素境界位置を元に、それらに対応す
る韻律情報のそれぞれの場所に、音素芯と音素境界の情
報を加えて画面に出力する。これらの画面に表示された
情報に対して。The screen display unit 150 displays each of the prosodic information corresponding to the phoneme core and phoneme boundary positions of the input voice obtained by the data matching unit 50 and the phoneme core and phoneme boundary positions of the teacher voice obtained by the synthesis unit 80, based on the phoneme core and phoneme boundary positions of the input voice obtained by the data matching unit 50. Add the phoneme core and phoneme boundary information to the location and output it to the screen. for the information displayed on these screens.

入力端子３にキーボードやマウス等から再生する部分の
音素を指定する信号を入力すると、合成部８０において
その信号に対応する部分の音声について音声合成を行い
、出力端子５から教師音声が出力する。When a signal specifying the phoneme of the part to be reproduced is inputted to the input terminal 3 from a keyboard, mouse, etc., the synthesizer 80 performs speech synthesis on the part of the sound corresponding to the signal, and the teacher's sound is output from the output terminal 5.

第２図は本発明に係る第２の発声練習用音声合成方式の
一実施例を示すブロック図である。第２図において、第
１図に示す実施例と同一の番号である＄ａ戒要素は、第
１図に示す実施例のものと同一の動作をする６発声速度
変換部９０は、標準音声ファイルと入力音声ファイルと
から標準音声と入力音声の全体の音韻継続時間長を比べ
、ｍ準音声と入力音声の音韻継続時間長が著しく違う場
合には、標準音声と入力音声の音韻継続時間長の差を計
算する。この差について標？＃音声の時間長の伸縮を残
差信号上で行う、この＃縮にはピッチ制御法を用いる。FIG. 2 is a block diagram showing an embodiment of the second voice synthesis method for practicing speaking according to the present invention. In FIG. 2, $a precept elements having the same numbers as those in the embodiment shown in FIG. 1 operate in the same manner as those in the embodiment shown in FIG. Compare the overall phonological duration length of the standard speech and input speech from Calculate the difference. What about this difference? #Expansion/contraction of the audio time length is performed on the residual signal. #This compression uses a pitch control method.

この方法については岩田氏の論文“残差ｆｌＦＪ　ｆＲ
による音声合成システムの検討”（１９８８年１０月　
日本音響学会講演論文集３−２−７　　ｐ、１８３〜１
８４ン　（文ｙ’ｙ＞に詳しく述べられている。This method is described in Mr. Iwata's paper “Residual flFJ fR
"Study of a speech synthesis system" (October 1988)
Acoustical Society of Japan Proceedings 3-2-7 p, 183-1
84 (described in detail in text y'y>).

発声速度変換の別の方法としては、発声速度変換部９０
において、入力音声と標準音声の分析結果である入力音
声ファイルと標準音声ファイルとから、最も強く発声す
るストレスのある母音の継続時間長と他の母音の継続時
間長との比を入力音声と標準音声のそれぞれについて求
める。さらに、標準音声における最も強く発声するスト
レスのある母音のｍ続開間長を、入力音声のそれに対応
する母音の部分の継続時間長に変換する０次に、標準音
声の母音同志の比を同じにするように、他の母音につい
ても継続時間長の伸縮を行う、このようにして、標準音
声の継続時間長を求め直す。Another method for converting the speaking speed is to use the speaking speed converter 90.
From the input speech file and the standard speech file, which are the analysis results of the input speech and standard speech, calculate the ratio of the duration of the most strongly uttered stressed vowel to the duration of other vowels. Find each voice. Furthermore, the m-sequence interval length of the most strongly uttered stressed vowel in the standard speech is converted into the duration length of the corresponding vowel part in the input speech. In this way, the duration lengths of other vowels are expanded and contracted, and in this way, the duration lengths of the standard speech are recalculated.

第３図は本発明に係る第３の発声練習用音声合成方式の
一実施例を示すブロック図である。第３図において、第
１図および第２図に示す実施例と同一の番号である構成
要素は、第１図および第２図に示す実施例のものと同一
の動作をする。発声速度照合部１００は、入力音声と標
準音声の時間長を比較するもので、入力音声と標準音声
の時間長が著しく異なる場合は、入力音声の時間長に近
い標準音声をデータ保存部２０から呼び出し、標準音声
と入力音声の時間長と等しいものに変える。FIG. 3 is a block diagram showing an embodiment of the third voice synthesis method for practicing speaking according to the present invention. In FIG. 3, components that are numbered the same as in the embodiment shown in FIGS. 1 and 2 operate in the same manner as in the embodiment shown in FIGS. The speech rate comparison unit 100 compares the time length of the input voice and the standard voice. If the time lengths of the input voice and the standard voice are significantly different, a standard voice close to the time length of the input voice is retrieved from the data storage unit 20. Call, change the standard voice to the same duration as the input voice.

第４図は本発明に係る第４の発声練習用音声合成方式の
一実施例を示すブロック図である。第４図において、第
１図、第２図および第３図に示す実施例と同一の番号で
ある構成要素は、第１図、第２図および第３図に示す実
施例のものと同一の動作をする。距離計算部１２０は、
教師音声と入力音声の韻律パラメータ同志の距離を求め
、その結果を、予め決めておいた評価数値対応表と照合
して評価値（例えば０〜１００点）に変換する。FIG. 4 is a block diagram showing an embodiment of the fourth voice synthesis method for practicing speaking according to the present invention. In FIG. 4, components having the same numbers as in the embodiment shown in FIGS. 1, 2 and 3 are the same as in the embodiment shown in FIGS. 1, 2 and 3. take action. The distance calculation unit 120
The distance between the prosodic parameters of the teacher's voice and the input voice is determined, and the result is compared with a predetermined evaluation value correspondence table and converted into an evaluation value (for example, 0 to 100 points).

画面出力部１１０は、距離計算部１２０で求めた評価値
および教師音声と入力音声の韻律パラメータの時間的変
化を表示する。評価値を表示する場合は、数値またはグ
ラフ若しくは図等を用いて視覚的に分かりやすい表示を
する。また、発声方法をアドバイスする欄を設け、例え
ば継続時間長を練習している場合は、　ａの音をもっと
短く発音して下さい”１の音を少し長めに光音して下さ
い”等を表示し、ピッチの場合は“ａの音は高めに発音
して下さい”　　　ｕの音は低めに発音して下さい”等
を表示する０表示用の文章の中で共通の部分は、あらか
じめ画面出力部１１０に用意しておき、音素の部分に必
要な音素芯を挿入し、画面に表示する。従って過去の評
価値と現在の評価値を対応させることにより、練習者に
上達具合いを把握させることができる。The screen output unit 110 displays the evaluation value obtained by the distance calculation unit 120 and the temporal change in the prosodic parameters of the teacher voice and the input voice. When displaying evaluation values, display them in a visually easy-to-understand manner using numerical values, graphs, diagrams, etc. In addition, we have provided a column to give advice on how to pronounce the voice, for example, if you are practicing the duration length, it will display messages such as ``Please pronounce the a sound shorter.'' ``Please make the 1 sound a little longer.'' However, in the case of pitch, the common parts in the sentences for displaying 0, such as "Please pronounce the sound a high, please pronounce the sound u sound low," are displayed in advance on the screen output section. 110, insert the necessary phoneme core into the phoneme part, and display it on the screen.Therefore, by correlating the past evaluation value with the current evaluation value, it is possible for the practitioner to grasp the progress level. can.

評価値保存部１４０は、過去の評価値を保存しておき、
過去の練習結果として、画面に表示できるようにする１
画面出力端子２０を介してこれら評価結果を画面に表示
する。The evaluation value storage unit 140 stores past evaluation values,
Displaying past practice results on the screen 1
These evaluation results are displayed on the screen via the screen output terminal 20.

〈発明の効果）以上に詳しく説明したように、本発明によれば、発声練
習を行う際に用いる教師音声の声質は練習者の声質に近
いものであり、その発声速度は練習者の音声とほぼ同じ
であるから、ｌｉ習冒者とって分かりやすく個々の練習
者に適した教師音声を生成する発声練習用音声合成方式
を提供できる。また、本発明の発声練習用音声合成方式
の画面表示では、教師音声と入力音声の両方に音素境界
を表示することによって、どの音に問題があるか分かり
やすくなり、その音を練習する時の方法等も表示するか
ら、練習者が上達する上で大きな効果がある。<Effects of the Invention> As explained in detail above, according to the present invention, the quality of the teacher's voice used for vocal practice is close to that of the practitioner, and the speaking speed is similar to that of the practitioner. Since they are almost the same, it is possible to provide a voice synthesis method for vocal practice that generates a teacher's voice that is easy for a beginner to understand and suitable for each individual practitioner. In addition, in the screen display of the speech synthesis method for vocal practice of the present invention, by displaying phoneme boundaries in both the teacher's voice and the input voice, it becomes easier to understand which sounds have problems, and when practicing that sound. Since the method is also displayed, it has a great effect on the practitioner's progress.

る。Ru.

１・・・入力端子、３・・・入力端子、５・・・出力端
子、６・・・入力端子、１ｏ・・・データ保存部、２ｏ
・・・画面出力端子、３０・・・標準音声ピッチ抽出部
、４ｏ・・・入力音声検出部、５０・・・データ照合部
、６ｏ・・・分析部、７０・・・ピッチ抽出・分析部、
８ｏ・・・合咬部。1... Input terminal, 3... Input terminal, 5... Output terminal, 6... Input terminal, 1o... Data storage section, 2o
... Screen output terminal, 30... Standard speech pitch extraction section, 4o... Input speech detection section, 50... Data collation section, 6o... Analysis section, 70... Pitch extraction/analysis section ,
8o... Occlusion area.

９０・・・発声速度変換部、１００・・・発声速度照合
部、１１０・・・画面出力部、１２０・・・距離計算部
、１３０・・・合成波形保存部、１４０・・・評価値保
存部、１５０・・・画面表示部。90... Speech rate conversion section, 100... Speech rate comparison section, 110... Screen output section, 120... Distance calculation section, 130... Synthetic waveform storage section, 140... Evaluation value storage Section, 150... Screen display section.

【図面の簡単な説明】[Brief explanation of drawings]

Claims

[Claims]

(1) Input the voice uttered by the practitioner as input voice,
means for analyzing the input speech to extract prosodic information such as temporal changes in fundamental frequency and phoneme duration length; and means for recording prosodic information of a standard speech that is the speech of a standard speaker that has been analyzed in advance. , find the correspondence between phonemes of the input speech and the standard speech from their respective prosodic information, convert the prosodic information of the input speech into prosodic information of the standard speech that has been analyzed in advance, and compare the input speech and the standard speech. means for synthesizing the synthesized speech with a standard speech and outputting the synthesized speech as a teacher speech; outputting the prosodic information extracted from the teacher speech and the input speech on a screen; and means for displaying phoneme boundaries at corresponding positions of the standard speech.

(2) Input the voice uttered by the practitioner as input voice,
means for analyzing the input speech to extract prosodic information such as temporal changes in fundamental frequency and phoneme duration length; and means for recording prosodic information of a standard speech that is the speech of a standard speaker that has been analyzed in advance. , find the correspondence between phonemes of the input speech and the standard speech from their respective prosodic information, convert the prosodic information of the input speech into prosodic information of the standard speech that has been analyzed in advance, and compare the input speech and the standard speech. means for synthesizing a standard speech and outputting the synthesized speech as a teacher speech, and comparing a phoneme duration length obtained from an analysis result of the input speech and a phoneme duration length obtained from an analysis result of the standard speech. means for calculating the difference and outputting a standard speech whose phoneme duration length is expanded or contracted by the difference as a standard speech to be used in the teacher speech output means; The relationship between the duration length and the duration length of other vowels is determined, and the duration length of the stressed vowel in the standard speech is converted to the duration length of the vowel in the input speech. Converting the duration length between each vowel in the standard speech into the duration length between each vowel in the input speech, and converting the duration length between the vowels into the standard speech used in the teacher speech output means. A speech synthesis method for practicing vocalization, comprising: means for outputting.

(3) Input the voice uttered by the practitioner as input voice,
means for analyzing the input speech to extract prosodic information such as temporal changes in fundamental frequency and phoneme duration length; and means for recording prosodic information of a standard speech that is the speech of a standard speaker that has been analyzed in advance. , find the correspondence between phonemes of the input speech and the standard speech from their respective prosodic information, convert the prosodic information of the input speech into prosodic information of the standard speech that has been analyzed in advance, and compare the input speech and the standard speech. a means for synthesizing a standard voice and outputting the synthesized voice as a teacher voice; recording a plurality of voices having different durations in advance; and comparing the durations of the input voice and the standard voice; When the duration lengths are different, means for selecting a voice whose duration length is closest to the input voice from among the plurality of voices and outputting the voice to the teacher voice output means as a standard voice; A speech synthesis method for vocal practice, which is characterized by:

(4) Input the voice uttered by the practitioner as input voice,
means for analyzing the input speech to extract prosodic information such as temporal changes in fundamental frequency and phoneme duration length; and means for recording prosodic information of a standard speech that is the speech of a standard speaker that has been analyzed in advance. , find the correspondence between the phonemes of the input speech and the standard speech from their respective prosodic information, convert the prosodic information of the input speech into prosodic information of the standard speech that has been analyzed in advance, and compare the input speech and the standard speech. a means for synthesizing the synthesized speech with a voice and outputting the synthesized voice as a teacher's voice; calculating a distance between prosodic information between the input voice and the teacher's voice, converting the distance into a learning achievement level and outputting it on a screen; 1. A speech synthesis method for vocal practice, comprising: