JPH11259094A

JPH11259094A - Regular speech synthesis device

Info

Publication number: JPH11259094A
Application number: JP10057723A
Authority: JP
Inventors: Haru Andou; ハル安藤; Yoshinori Kitahara; 義典北原; Nobuo Nukaga; 信尾額賀
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1998-03-10
Filing date: 1998-03-10
Publication date: 1999-09-24

Abstract

PROBLEM TO BE SOLVED: To make it possible for a general user to simply correct prosodic errors by providing this device with a function correcting prosodic information added to character strings selected by user according to his corrected contents. SOLUTION: This device controls (1001) a whole system by programs stored in a disk 10, displays a mail (1002), calculates parameters presenting features of a speech from a character string inputted from a character string input device 3, and transfers them to a memory in a main storage device 2 (1003). Moreover, this device selects and creates prosodic parameters from a word dictionary with prosodic information, and creates a prosodic pattern of a sentence (1004) by summing up those prosodic parameters and those of the phrase components obtained from the number of the phrases. Further, speech synthesis is executed from the parameters outputted from a synthesis parameter calculation part 2 and the prosodic parameter created by a prosodic parameter creating program 1004, and processing for transmitting (1005), etc., the speech waveform to the memory is executed.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文字列より合成し
た音声の韻律を、ユーザにとって違和感のない韻律、あ
るいは好みの韻律に修正できる規則音声合成装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a ruled speech synthesizer capable of modifying a prosody of a speech synthesized from a character string into a prosody that does not give a user a sense of incongruity or a preferred prosody.

【０００２】[0002]

【従来の技術】近年、パソコンや携帯情報機器といった
マルチメディア製品の普及は著しく、一般家庭にも市場
を拡大しつつある。音声メディアに関しても例にもれ
ず、音声合成・認識技術は、パソコンでの利用は言うま
でもなく、携帯情報機器やカーナビゲーションシステム
への組み込みが行われるようになってきた。特に、音声
合成は、パソコンや携帯情報機器上でのメール読み上げ
等に用いられたり、カーナビゲーションでの行き先誘導
や地名発声等に利用されている。2. Description of the Related Art In recent years, multimedia products such as personal computers and portable information devices have become remarkably widespread, and the market is expanding to ordinary households. As is the case with audio media, speech synthesis / recognition technology has been incorporated into portable information devices and car navigation systems, not to mention use in personal computers. In particular, speech synthesis is used for reading out e-mails on personal computers and portable information devices, and is used for guidance on destinations in car navigation, utterance of place names, and the like.

【０００３】[0003]

【発明が解決しようとする課題】メールでは、話言葉に
類似した文が多用され、カーナビゲーションでは、人名
や地名等の固有名詞が多用される。話言葉には「いい案
じゃない。」や「食べたそうだ。」に見られるように、
音声で発声されて、韻律を知覚して始めて意味を一意に
解釈できる文が多くある。In a mail, a sentence similar to a spoken word is frequently used, and in a car navigation, a proper noun such as a personal name or a place name is frequently used. As you can see in the spoken words, "It's not a good idea" and "I ate it."
There are many sentences that can be uniquely interpreted only when they are spoken and perceive the prosody.

【０００４】たとえば、「いい案じゃない。」の場合
は、「感嘆」の意味か「否定」の意味なのかが解釈でき
る。また、固有名詞は、国語辞典やアクセント辞典で網
羅されてはおらず、かつ膨大な数が存在する。韻律に関
しては、例えば「坂田」と「前田」に見られるように、
名前を構成する２つの漢字の後部の漢字が同じ「田」
で、かつ前部の漢字が同じアクセント型を持つ漢字であ
っても、これらの韻律は、「坂田」の場合は、１型アク
セントになり、「前田」の場合は、０型アクセントにな
るといった事象がある。[0004] For example, in the case of "not a good idea", it can be interpreted as "exclamation" or "denial". In addition, proper nouns are not covered in Japanese language dictionaries or accent dictionaries, and there are a huge number of proper nouns. As for the prosody, for example, as seen in Sakata and Maeda,
The back kanji of the two kanji that make up the name is the same "Ta"
Even if the front kanji is a kanji with the same accent type, these prosody will be a type 1 accent in the case of "Sakata" and a type 0 accent in the case of "Maeda". There is an event.

【０００５】このような文，固有名詞などの韻律が誤っ
ていると、発声内容の理解に支障をきたすため、音声合
成装置では上記韻律を正確に表現する必要がある。しか
しながら、現状の音声合成装置では、前述した文や固有
名詞などがテキストとして与えられた場合、合成処理部
においてデフォルト値として規定された韻律での読み上
げしか行えず、ユーザに違和感を与えていた。さらに、
一般ユーザには、上記韻律を修正し、正しいあるいは好
みに合った韻律として登録することが困難であった。[0005] If the prosody of such a sentence or proper noun is incorrect, it will hinder the understanding of the utterance content. Therefore, it is necessary for the speech synthesizer to accurately represent the above prosody. However, in the current speech synthesizing apparatus, when the above-mentioned sentence, proper noun, or the like is given as text, the synthesis processing unit can only read out the text according to the prosody defined as the default value, giving the user a sense of discomfort. further,
It has been difficult for a general user to modify the above-mentioned prosody and register it as a correct or favorite prosody.

【０００６】本発明の目的は、一般ユーザにとって簡便
に韻律の誤り個所を修正することができ、さらに、韻律
によって文の意味を判断する場合に、意味をどちらとも
判断不可能な韻律を生成することで韻律誤りを軽減し、
ユーザにとって聴いていて心地よい音声を合成すること
ができる手段を提供することにある。SUMMARY OF THE INVENTION An object of the present invention is to allow a general user to easily correct an error in a prosody and to generate a prosody whose meaning cannot be determined when judging the meaning of a sentence by the prosody. Reduce prosody errors,
It is an object of the present invention to provide means capable of synthesizing a sound that is comfortable for a user to listen to.

【０００７】[0007]

【課題を解決するための手段】上記目的を達成するため
の本発明の一構成によれば、文字列を入力する文字列入
力部と、上記文字列から音声の特徴を表すパラメータを
算出する合成パラメータ算出部と、上記パラメータから
音声を合成する音声合成部と、上記音声合成部により合
成された音声を出力する音声出力部と、ユーザが音声入
力を行う音声入力部と、音声をＡ／Ｄ変換し、さらに音
声分析によってピッチなどの音声パラメータを抽出する
音声認識処理部と、上記文字列や韻律パラメータ、合成
された音声等を記憶する情報記憶装置からなる規則音声
合成装置において、ユーザの選択した文字列に付与され
た韻律情報をユーザの修正内容に応じて修正する機能を
備えた規則音声合成装置が提供される。According to one aspect of the present invention, there is provided a character string input unit for inputting a character string, and a synthesizing unit for calculating a parameter representing a feature of a voice from the character string. A parameter calculating unit, a voice synthesizing unit for synthesizing voice from the parameter, a voice output unit for outputting voice synthesized by the voice synthesizing unit, a voice input unit for inputting voice by a user, and an A / D In a rule speech synthesizer comprising a speech recognition processing unit for converting and further extracting speech parameters such as pitches by speech analysis, and an information storage device for storing the character strings, prosodic parameters, synthesized speech, etc. A ruled speech synthesizer having a function of correcting the prosody information given to a given character string in accordance with the content of a user's correction is provided.

【０００８】また、音声合成部より出力された複数の韻
律パタン候補に対し、上記全韻律パタン候補とは異なる
韻律を生成する機能を備えた規則音声合成装置が提供さ
れる。また、前記韻律パタン候補をユーザが選択する
と、選択された上記韻律パタン候補を用いた合成音声を
上記音声合成部によって合成し、上記音声出力部から合
成音声を出力する機能を備えた規則音声合成装置が提供
される。[0008] Also, there is provided a rule speech synthesis apparatus having a function of generating a prosody different from the above all prosody pattern candidates for a plurality of prosody pattern candidates output from the speech synthesis unit. When the user selects the prosody pattern candidate, the speech synthesis unit synthesizes a synthesized speech using the selected prosody pattern candidate, and has a function of outputting a synthesized speech from the speech output unit. An apparatus is provided.

【０００９】また、ユーザが前記音声入力部に対して入
力した発声内容を、上記音声認識処理部で分析し、分析
結果として得られたピッチ情報を用いて前記音声合成部
において音声合成を行う機能を備えた規則音声合成装置
が提供される。[0009] Further, a function of analyzing the utterance content inputted by the user to the voice input unit in the voice recognition processing unit and performing voice synthesis in the voice synthesis unit using pitch information obtained as a result of the analysis. Is provided.

【００１０】また、抽出されたフレーム毎のピッチを、
ある関数によって変換することにより、ユーザの発声帯
域から合成音の音源帯域にシフトする機能を備えた規則
音声合成装置が提供される。The pitch of each extracted frame is defined as
A ruled speech synthesizer having a function of shifting from a user's utterance band to a synthesized sound source band by performing conversion by a certain function is provided.

【００１１】[0011]

【発明の実施の形態】以下、本発明をシステムに適用し
た場合の一実施形態について図面を参照しながら説明す
る。本システムは、規則音声合成を取り扱うシステムで
あり、与えられたテキストを音声として合成するシステ
ムである。本発明を説明するにあたっての対象例とし
て、電子メール読み上げシステムを取り上げる。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment in which the present invention is applied to a system will be described below with reference to the drawings. This system is a system that handles regular speech synthesis, and is a system that synthesizes a given text as speech. An e-mail reading system will be described as a target example for describing the present invention.

【００１２】図１は、本発明の構成を示すブロック図で
ある。１は、起動されたプログラム処理に応じて処理を
行う情報処理装置、２はデータ等を記憶する主記憶装
置、３はキーボードに代表される文字列を入力し主記憶
装置２内のメモリに転送するための文字列入力装置、４
はマウスに代表される、画面上の位置情報を入力する位
置入力装置、５はマイクロフォンに代表される、ユーザ
が音声を入力する音声入力装置、６はディスプレイに代
表される、情報を視覚的に提示する情報表示装置、７は
６の画面上に表示する情報である文字情報や画像情報等
を制御する画面制御装置、８は出力される音声等のボリ
ューム等を制御する音響出力制御装置、９はスピーカに
代表される、音声等を出力する音響出力装置、１０はデ
ィスク装置である。FIG. 1 is a block diagram showing the configuration of the present invention. 1 is an information processing device that performs processing in accordance with the activated program processing, 2 is a main storage device that stores data and the like, 3 is a character string represented by a keyboard, which is input and transferred to a memory in the main storage device 2 Character string input device for
Is a position input device for inputting position information on a screen, typified by a mouse, 5 is a voice input device, typified by a microphone, for a user to input voice, 6 is a display, typified by a display. Information display device for presentation, 7 is a screen control device for controlling character information, image information, etc., which are information displayed on the screen of 6, 6 is an audio output control device for controlling the volume of output audio and the like, 9 Reference numeral denotes a sound output device for outputting sound or the like, represented by a speaker, and reference numeral 10 denotes a disk device.

【００１３】ディスク１０には、本発明の処理を行う実
行プログラムの他に、計算機システムの基本となる制御
を行うプログラム，処理対象となるデータ，処理時に参
照される辞書，音源データベースなどが格納される。す
なわち１００１はシステム全体を制御するシステムプロ
グラム、１００２はメール表示プログラム、１００３は
文字列入力装置３から入力された文字列から音声の特徴
を表すパラメータを算出し、主記憶装置２内のメモリに
転送する合成パラメータ算出プログラム、1004は後述す
る韻律情報付き単語辞書から韻律パラメータを選択およ
び生成し、さらに生成された上記韻律パラメータと、文
節数から得られるフレーズ成分の韻律パラメータを合算
することにより、文の韻律パタンを生成する韻律パタン
生成プログラム、１００５は合成パラメータ算出部２よ
り出力されたパラメータと、上記韻律パタン生成プログ
ラム１００４で生成された韻律パラメータなどから音声
合成を行い、音声波形を上記メモリに転送する音声合成
プログラム、１００６は上記音響出力装置９から出力さ
れた音声の韻律を修正する機能とユーザインタフェース
を構築する韻律修正プログラム、１００７は上記音声合
成プログラム1005によって前記メモリに転送された音声
波形を音響出力装置に出力する音声出力プログラム、１
００８は上記マイクロフォン５から入力された音声をＡ
／Ｄ変換する音声入力プログラム、１００９は上記音声
入力プログラム１００８によりデジタル音声に変換され
た音声から音の高低を表現するピッチパラメータを抽出
し、上記メモリに格納する音声認識プログラム、１０１
０は単語情報と上記単語に付与する韻律として可能性の
ある韻律とを併記してある韻律情報付き単語辞書、1011
はそして前記音声合成プログラム１００５で利用する音
声合成パラメータのひとつである音源パラメータを格納
している音源データベースである。ここで、図１ではプ
ログラムおよびデータが単独のディスク１０に格納され
るように示したが、それらが複数のディスク装置（図示
せず）に格納されてもよい。The disk 10 stores, in addition to an execution program for performing the processing of the present invention, a program for performing basic control of a computer system, data to be processed, a dictionary referred to during processing, a sound source database, and the like. You. That is, 1001 is a system program for controlling the whole system, 1002 is a mail display program, 1003 is a parameter that calculates the characteristics of the voice from the character string input from the character string input device 3 and is transferred to the memory in the main storage device 2 A synthesis parameter calculation program 1004 selects and generates a prosody parameter from a word dictionary with prosody information described later, and further sums the generated prosody parameter and the prosody parameter of a phrase component obtained from the number of phrases to obtain a sentence. 1005 is a prosody pattern generation program for generating a prosody pattern of the above, performs voice synthesis based on the parameters output from the synthesis parameter calculation unit 2 and the prosody parameters generated by the prosody pattern generation program 1004, and stores the voice waveform in the memory. Speech synthesis program to be transferred, 10 Reference numeral 06 denotes a prosody modification program for constructing a function and a user interface for modifying the prosody of the voice output from the audio output device 9, and 1007 outputs the voice waveform transferred to the memory by the voice synthesis program 1005 to the voice output device. Sound output program, 1
008 represents the voice input from the microphone 5 as A
A voice input program 1009 for / D conversion extracts a pitch parameter representing the pitch of the sound from the voice converted into digital voice by the voice input program 1008, and stores a voice recognition program 101 in the memory.
Reference numeral 0 denotes a word dictionary with prosody information in which word information and prosody which may be given as prosody to be added to the word are described.
Is a sound source database storing sound source parameters which are one of the voice synthesis parameters used in the voice synthesis program 1005. Here, FIG. 1 shows that programs and data are stored on a single disk 10, but they may be stored on a plurality of disk devices (not shown).

【００１４】実行プログラムおよびアクセスするデータ
は、必要に応じて主記憶装置２上に読み込まれ、情報処
理装置１によって本発明に基づくデータ処理がなされ
る。The execution program and the data to be accessed are read into the main storage device 2 as needed, and the information processing device 1 performs data processing based on the present invention.

【００１５】まず、システムプログラム１００１が起動
される。次に利用者がメール表示プログラム１００２を
起動する。その結果、情報表示装置の画面にメールが表
示される。上記画面は、例えば、図２に示すような画面
となる。２００は読み上げたいメールを選択する場合の
メールメニュー表示ボタン、２０１はメールが表示され
る個所、２０２は読み上げを開始する場合のボタン、そ
して２０３は韻律を修正したい場合に押すボタンであ
る。First, a system program 1001 is started. Next, the user activates the mail display program 1002. As a result, the mail is displayed on the screen of the information display device. The screen is, for example, a screen as shown in FIG. Reference numeral 200 denotes a mail menu display button for selecting a mail to be read out, 201 denotes a place where the mail is displayed, 202 denotes a button for starting reading out, and 203 denotes a button for pressing the prosody.

【００１６】ここでは、「坂田さんが、それを食べたそ
うだ。」という文について説明を行う。画面に表示され
た上記文字列は、主記憶装置２上のメモリに格納され
る。次に合成パラメータ算出プログラムが起動され、メ
モリより文字列を読み出し、上記文字列を音素に分割す
る。文字列から音素に分割する方法は、例えば、宮崎ら
の方法（「日本文音声出力のための言語処理方式」情報
処理学会論文誌、Vol.２７，No.１１，pp,１０５３−１
０６１，１９８６）を利用する。勿論、上記計算方法は
一例であり、他の音素を分割する方法を用いてもよい。
このようにして、「坂田さんが、それを食べたそう
だ。」という文字列は、「ｓ／ａ／ｋ／ａ／ｔ／ａ／ｓ
／ａ／ｎ／ｇ／ａ，ｓ／ｏ／ｒ／ｅ／ｗ／ｏ／，ｔ／ａ
／ｂ／ｅ／ｔ／ａ／ｓ／ｏ／ｕ／ｄ／ａ．」という音素
に分割され、上記音素分割データはメモリに格納され
る。ここで、「ｓ」「ａ」等は音素を示す記号である。Here, the sentence "Mr. Sakata ate it" is explained. The character string displayed on the screen is stored in a memory on the main storage device 2. Next, a synthesis parameter calculation program is started, a character string is read from the memory, and the character string is divided into phonemes. A method of dividing a character string into phonemes is described in, for example, Miyazaki et al.'S method (“Language processing method for Japanese sentence speech output” Transactions of Information Processing Society of Japan, Vol. 27, No. 11, pp. 1053-1).
061, 1986). Of course, the above calculation method is an example, and a method of dividing another phoneme may be used.
In this way, the character string “It seems that Mr. Sakata ate it” is “s / a / k / a / t / a / s”.
/ A / n / g / a, s / o / r / e / w / o /, t / a
/ B / e / t / a / s / o / u / d / a. And the phoneme divided data is stored in the memory. Here, “s”, “a”, etc. are symbols indicating phonemes.

【００１７】勿論、上記音素記号データは一例であり、
他の音素記号表現を用いてもよい。また、求める単位は
音素単位に限らず、音素を２分割した単位や音節でもよ
い。次に音素に分割された音素分割データをメモリより
読み出し、音素毎の継続時間長の計算を行って、継続時
間長データをメモリへ転送する。音素毎の継続時間長の
計算方法は、例えば、匂坂らの方法（「規則による音声
合成のための音韻時間長制御」電子通信学会論文誌，Vo
l.Ｊ６７−Ａ，No.７，pp.６２９−６３６，１９８４）
を利用する。勿論、上記計算方法は一例であり、他の音
素継続時間長の計算方法を用いてもよい。Of course, the above phoneme symbol data is an example,
Other phoneme symbol representations may be used. The unit to be obtained is not limited to a phoneme unit, but may be a unit obtained by dividing a phoneme into two or a syllable. Next, the phoneme divided data divided into phonemes is read out from the memory, the duration of each phoneme is calculated, and the duration data is transferred to the memory. The method of calculating the duration of each phoneme is described, for example, by the method of Sakasaka et al. (“Phonological time length control for speech synthesis by rules”, IEICE Transactions, Vo.
l.J67-A, No.7, pp.629-636, 1984)
Use Of course, the above calculation method is an example, and another calculation method of the phoneme duration may be used.

【００１８】このような方法で、先の音素分割データ
「ｓ／ａ／ｋ／ａ／ｔ／ａ／ｓ／ａ／ｎ／ｇ／ａ，ｓ／
ｏ／ｒ／ｅ／ｗ／ｏ／，ｔ／ａ／ｂ／ｅ／ｔ／ａ／ｓ／
ｏ／ｕ／ｄ／ａ．」から、例えば、ミリ秒単位の継続時
間長データが計算結果として求められ、メモリに格納さ
れる。勿論、上記継続時間長は一例であり、秒単位等の
継続時間長データを用いてもよい。In this manner, the phoneme division data "s / a / k / a / t / a / s / a / n / g / a, s / a /
o / r / e / w / o /, t / a / b / e / t / a / s /
o / u / d / a. , For example, duration data in milliseconds is obtained as a calculation result and stored in the memory. Of course, the above-described duration is an example, and duration data in seconds or the like may be used.

【００１９】さらに、上記文字列は形態素解析され、メ
モリに格納される。ここでは、「坂田／さん／が／、／
食べた／そうだ／。」となる。次に、メモリから、文字
列と音素分割データを読み出し、さらに、メモリに格納
されている形態素解析結果と韻律情報付き単語辞書１０
１０との照合を行う。Further, the character string is morphologically analyzed and stored in a memory. Here, "Sakata / san / ga /, /
Ate / yes /. ". Next, the character string and the phoneme division data are read out from the memory, and the morphological analysis results stored in the memory and the word dictionary with prosody information 10 are read.
Check with 10.

【００２０】上記韻律情報付き単語辞書の構造は、図３
のようになる。３００は各形態素の種類を示す語種が格
納されているセルである。３０１は単語列を格納するセ
ル、３０２は単語番号を格納するセル、３０３は単語列
をかなで表示した内容を格納するセル、３０４は韻律を
選択する場合の優先順位を示す優先番号セル、３０５は
韻律を識別するための韻律番号セル、３０６は上記韻律
番号ごとのピッチ形状を示すパラメータが格納されてい
るセルである。The structure of the word dictionary with prosody information is shown in FIG.
become that way. Reference numeral 300 denotes a cell in which a word type indicating the type of each morpheme is stored. Reference numeral 301 denotes a cell that stores a word string, 302 denotes a cell that stores a word number, 303 denotes a cell that stores contents of a word string displayed in kana, 304 denotes a priority number cell that indicates a priority when prosody is selected, and 305. Is a prosody number cell for identifying the prosody, and 306 is a cell in which a parameter indicating a pitch shape for each prosody number is stored.

【００２１】形態素解析結果と上記単語辞書とを照合
し、抽出された韻律パラメータを用いて付与を行う。固
有名詞「坂田」の形態素に関しては、ここでは、優先順
位１である韻律番号が選択される。つまり、韻律番号１
−４が選択されることになる。生成された韻律データを
メモリへ転送する。また、「食べた」と「そうだ。」に
関しては、「食べた」の韻律番号中の括弧（）内の番号
から、４０−１と３０−１の組み合わせ(４０−１−３
０−１)、４０−２と３０−２の組み合わせ（４０−２
−３０−２）が生成される。The result of the morphological analysis is compared with the word dictionary, and assignment is performed using the extracted prosodic parameters. As for the morpheme of the proper noun "Sakata", here, the prosody number having the priority 1 is selected. That is, prosody number 1
-4 will be selected. The generated prosody data is transferred to the memory. In addition, as for “eat” and “yes.”, The combination of 40-1 and 30-1 (40-1-3) is determined from the number in parentheses () in the prosody number of “eat”.
0-1), a combination of 40-2 and 30-2 (40-2
-30-2) is generated.

【００２２】このようにして、文字列「坂田さんが、そ
れを食べたそうだ。」は、２種の韻律データＡ「sakata
saNga,sorewo,ta'betasouda.」，韻律データＢ「sakatas
aNga,sorewo,tabetaso'uda.」に変換され、メモリに格
納される。As described above, the character string "It seems that Mr. Sakata ate it."
saNga, sorewo, ta'betasouda. ", prosodic data B" sakatas
aNga, sorewo, tabetaso'uda. "and stored in memory.

【００２３】ここで、「’」が付されている音節はアク
セント核のある音節、「，」は句の区切れ、「．」は文
の終端を表す記号である。勿論、上記記号は一例であ
り、他の記号を用いてもよい。この２種の韻律は、それ
ぞれ、「欲求」と「伝聞」を表現している。Here, syllables with “′” are syllables with accent nuclei, “,” is a phrase delimiter, and “.” Is a symbol representing the end of a sentence. Of course, the above symbol is an example, and other symbols may be used. These two types of prosody express "desire" and "hearing", respectively.

【００２４】次に、「欲求」とも「伝聞」とも解釈され
ない、あるいはどちらとも解釈される韻律を生成する。
例えば、上記２種の韻律データＡと韻律データＢの各音
韻のピッチ値を加算し、中間値を算出する。算出された
各音韻のデータを韻律データＣとしてメモリに格納す
る。次に、メモリより、音素分割データと上記韻律デー
タを読み出し、基本周波数の計算を行い、基本周波数デ
ータをメモリに転送する。基本周波数の計算方法は、例
えば、藤崎らの方法（「日本語単語アクセントの基本周
波数パタンとその生成機構」日本音響学会誌２７，pp.
４４５−４５３，１９７１）を利用する。勿論、上記
計算方法は一例であり、他の基本周波数の計算方法を用
いてもよい。Next, a prosody which is not interpreted as "desire" or "hearing" or interpreted as either is generated.
For example, the pitch values of the phonemes of the two types of prosody data A and B are added to calculate an intermediate value. The calculated data of each phoneme is stored in the memory as prosody data C. Next, the phoneme division data and the prosody data are read from the memory, the fundamental frequency is calculated, and the fundamental frequency data is transferred to the memory. The calculation method of the fundamental frequency is described in, for example, the method of Fujisaki et al.
445-453, 1971). Of course, the above calculation method is an example, and another calculation method of the fundamental frequency may be used.

【００２５】このようにして、文字列「坂田さんが、そ
れを食べたそうだ。」は、数１の基本周波数データに変
換され、メモリに格納される。In this manner, the character string "It seems that Mr. Sakata ate it." Is converted into the fundamental frequency data of Formula 1 and stored in the memory.

【００２６】[0026]

【数１】（Ｆ０〔１〕，Ｆ０〔２〕，...，Ｆ０
〔ｉ〕，...，Ｆ０〔ｐ〕）ここで、Ｆ０〔ｉ〕は基本周波数を表す数値であり、例
えば、１０ミリ秒単位で求められた値である。また、ｐ
は基本周波数を表す値の数である。## EQU1 ## (F0 [1], F0 [2],..., F0
[I],..., F0 [p]) Here, F0 [i] is a numerical value representing the fundamental frequency, for example, a value obtained in units of 10 milliseconds. Also, p
Is the number of values representing the fundamental frequency.

【００２７】勿論、基本周波数を表す上記方法は一例で
あり、基本周波数の値を求めることができる方法である
ならば、一定の時間間隔ではない時間単位で基本周波数
を求める方法、基本周波数を求めるモデルのパラメータ
の組で表す方法等でもよい。このようにして算出された
合成パラメータは、メモリへ転送される。Of course, the above method of representing the fundamental frequency is merely an example, and if it is a method capable of finding the value of the fundamental frequency, the method of finding the fundamental frequency in time units other than a fixed time interval, and the method of finding the fundamental frequency It may be a method or the like represented by a set of model parameters. The synthesis parameters calculated in this way are transferred to the memory.

【００２８】続いて、音声合成部４は、メモリから合成
パラメータを読み出し、合成フィルタを駆動することに
より音声合成を行い、音声データを生成する。音声合成
の方法は、例えば、文献「古井：“ディジタル音声処
理”，p.２２，東海大学出版会，１９８５」に示されて
いる方法を利用する。勿論、上記音声合成方法は一例で
あり、他の音声合成方法を利用してもよい。このように
して、生成された音声データはメモリに格納される。次
に、音声出力装置５を通じて、メモリに格納されている
音声データを出力する。以上の手続きを以って、目的の
音声を合成する。Subsequently, the voice synthesizing unit 4 reads out the synthesis parameters from the memory, performs voice synthesis by driving the synthesis filter, and generates voice data. As a speech synthesis method, for example, a method described in the document “Furui:“ Digital Speech Processing ”, p.22, Tokai University Press, 1985” is used. Of course, the above speech synthesis method is an example, and another speech synthesis method may be used. The generated audio data is stored in the memory. Next, audio data stored in the memory is output through the audio output device 5. According to the above procedure, a desired voice is synthesized.

【００２９】また、別例として、「黄色い虎の小屋」を
例にとると、文字列からだけでは、虎が黄色いのか小屋
が黄色いのか判断できない。ただし、「黄色い」と
「虎」との間のポーズ長の伸縮および「虎」のピッチの
変動により、いずれかの解釈を導くことができる。例え
ば、ポーズ長を５０ｍsec とし、図３の５０１−４−１
の韻律番号を用いると「黄色い虎」と判断でき、ポーズ
長を５０ｍsec とし、「虎（tora）」の「ｏ」の立ち上
がりピッチを「黄色い」の最後の音韻である「ｉ」のピ
ッチよりも７０Ｈｚ低くする、ここでは、前記単語辞書
の「虎」の韻律番号３０５の５０１−４−２を採用する
ことにより「小屋」が「黄色い」という解釈を導くこと
が可能となる。また、「黄色い」と「虎」との間ポーズ
長を１００ｍsec 以上にすることによっても、「小屋」
が「黄色い」という解釈を導くことができる。As another example, in the case of "yellow tiger hut", it is not possible to determine whether the tiger is yellow or the hut is yellow from the character string alone. However, any interpretation can be derived from the expansion and contraction of the pose length between “yellow” and “tiger” and the fluctuation of the pitch of “tiger”. For example, when the pause length is set to 50 msec, 501-4-1 in FIG.
Can be determined as "yellow tiger", the pause length is 50 msec, and the rising pitch of "o" of "tora" is higher than the pitch of "i" which is the last phoneme of "yellow". By lowering the frequency by 70 Hz, here, by employing 501-4-2 of the prosody number 305 of "tiger" in the word dictionary, it is possible to guide the interpretation that "hut" is "yellow". Also, by making the pause length between “yellow” and “tiger” 100 msec or more,
Can be interpreted as "yellow".

【００３０】次に、ユーザが、例文「坂田さんが、それ
を食べたそうだ。」中の「坂田」の韻律を修正したい場
合、修正したい個所、この場合「坂田」の個所を、位置
入力装置のひとつであるマウス等で選択し、さらに図２
中に表示されている韻律修正ボタン２０３を押すと、図
４に示すように韻律修正ウインドウ２０５が表示され
る。さらに、前述した韻律パタン候補生成方法および音
声合成方法によって「坂田」の文字列から合成音声デー
タが生成され、生成された上記音声データはメモリに格
納される。ここでは、６種類の韻律パタン候補が生成さ
れ、音声データは６種類生成されたとする。Next, if the user wants to modify the prosody of "Sakata" in the example sentence "Mr. Sakata ate it.", The location to be modified, in this case, the location of "Sakata" is entered into the position input device. Select with mouse etc. which is one of
When the prosody modification button 203 displayed on the screen is pressed, a prosody modification window 205 is displayed as shown in FIG. Further, synthesized speech data is generated from the character string "Sakata" by the above-described prosody pattern candidate generation method and speech synthesis method, and the generated speech data is stored in the memory. Here, it is assumed that six types of prosody pattern candidates are generated and six types of audio data are generated.

【００３１】上記ウインドウ中に表示されている韻律パ
タン候補Ａ，韻律パタン候補Ｂ，韻律パタン候補Ｃ，韻
律パタン候補Ｄ，韻律パタン候補Ｅ，韻律パタン候補Ｆ
からひとつを選択すると、メモリに格納されていた音声
データのうちのひとつを音声出力装置５を通じて出力す
る。ユーザが正しい、あるいは好みの韻律と判断した韻
律パタン候補があった場合には、ユーザは、上記韻律パ
タン候補Ａ〜Ｄからひとつを選択し、さらに、「韻律修
正ウインドウ」２０５に表示されている韻律確定ボタン
を押すと、ユーザが選択した韻律パタン候補が前記韻律
情報付き単語辞書１０１０に登録され、韻律優先番号３
０４が「１」として登録される。Prosody pattern candidate A, prosody pattern candidate B, prosody pattern candidate C, prosody pattern candidate D, prosody pattern candidate E, and prosody pattern candidate F displayed in the above window
When one is selected, one of the audio data stored in the memory is output through the audio output device 5. If there is a prosody pattern candidate that the user has determined to be correct or a favorite prosody, the user selects one of the prosody pattern candidates A to D, and is displayed in the “prosody modification window” 205. When the prosody determination button is pressed, the prosody pattern candidate selected by the user is registered in the word dictionary with prosody information 1010, and the prosody priority number 3
04 is registered as “1”.

【００３２】例えば、韻律パタン候補Ｃを選択した場合
は、韻律番号１−３の韻律優先番号が「１」となる。そ
の結果、次に、上記文字列がメール中に出現した場合に
は、上記登録韻律データを用いて前述した方法で音声合
成が行われる。For example, when the prosody pattern candidate C is selected, the prosody priority number of the prosody number 1-3 becomes "1". As a result, next, when the character string appears in the mail, speech synthesis is performed by the above-described method using the registered prosody data.

【００３３】また、ユーザが音声を入力することによっ
て韻律を修正することも可能である。その場合、まず、
韻律修正希望個所を画面上で指定し、さらに、音声入力
ボタン２０４を押し、ユーザは音声入力装置５を用いて
「さかた」と発声する。発声が終了したら、上記音声入
力ボタンを押す。例えば、音声はサンプリング周波数１
６キロヘルツ，量子化ビット数１６ビット，モノラル音
声で取り込む。次に音声入力装置５により、音声データ
をメモリに転送する。It is also possible for the user to modify the prosody by inputting a voice. In that case, first,
A prosody modification desired portion is specified on the screen, and further, the voice input button 204 is pressed, and the user utters “Sakata” using the voice input device 5. When the utterance ends, press the above voice input button. For example, audio has a sampling frequency of 1
6 kHz, 16-bit quantization, monaural sound. Next, the voice input device 5 transfers the voice data to the memory.

【００３４】次に、音声パラメータを分析する。ここ
で、音声認識プログラム１００９が起動され、メモリよ
り音声データを読み出し、基本周波数の分析を行い、基
本周波数データ（Ｆ'０Next, the speech parameters are analyzed. Here, the speech recognition program 1009 is started, the speech data is read from the memory, the fundamental frequency is analyzed, and the fundamental frequency data (F'0

〔０〕，Ｆ'０〔１〕，...，Ｆ'
０〔ｉ〕，...，Ｆ'０〔ｋ〕）を抽出する。さらに、上
記基本周波数データを、前記音源データベースに格納さ
れている音源周波数領域と同領域にする。音源データベ
ースにおける基本周波数中心値をＭ_Ｐとすると、数２
により、音源に適合した基本周波数データが算出され
る。[0], F'0 [1], ..., F '
0 [i],..., F′0 [k]). Further, the fundamental frequency data is set to the same region as the sound source frequency region stored in the sound source database. Assuming that the fundamental frequency center value in the sound source database is M_P,
Thereby, fundamental frequency data suitable for the sound source is calculated.

【００３５】[0035]

【数２】Ｆ'０〔ｐ〕＝ΣＦ'０〔ｐ〕／ｐ−Ｍ＿Ｐ＋
Ｆ'０〔ｐ〕上記基本周波数データＦ'０〔ｐ〕をメモリに転送す
る。ここで、Ｆ'０〔ｉ〕は基本周波数の値であり、ｋ
は基本周波数データの値の数である。基本周波数の分析
の方法は、例えば、ＬＰＣ分析の残差信号の自己相関関
数より基本周波数を求める方法を利用する。勿論、上記
計算方法は一例であり、他の基本周波数分析方法を利用
してもよい。このようにして、音声「さかた」の基本周
波数データがメモリに格納される。上記データを前記韻
律情報付き単語辞書に登録する。登録する場合、韻律優
先番号は「１」として登録され、韻律記号は、前記基本
周波数が時系列で記憶される。F′0 [p] = ０F′0 [p] / p−M_P +
F'0 [p] The basic frequency data F'0 [p] is transferred to the memory. Here, F′0 [i] is the value of the fundamental frequency, and k ′
Is the number of values of the fundamental frequency data. As a method of analyzing the fundamental frequency, for example, a method of obtaining the fundamental frequency from the autocorrelation function of the residual signal of the LPC analysis is used. Of course, the above calculation method is an example, and another fundamental frequency analysis method may be used. In this way, the fundamental frequency data of the voice “Sakata” is stored in the memory. The above data is registered in the word dictionary with prosody information. In the case of registration, the prosody priority number is registered as "1", and the prosodic symbols are stored in a time series of the fundamental frequencies.

【００３６】[0036]

【発明の効果】このように本発明によれば、ユーザの選
択した文字列に付与された韻律情報をユーザの修正内容
に応じて修正することができる。また、音声合成部より
出力された複数の韻律パタン候補に対し、上記全韻律パ
タン候補とは異なる韻律を生成することができる。ま
た、前記韻律パタン候補をユーザが選択すると、選択さ
れた上記韻律パタン候補を用いた合成音声を上記音声合
成部によって合成し、上記音声出力部から合成音声を出
力することができる。As described above, according to the present invention, the prosody information given to the character string selected by the user can be corrected in accordance with the contents of the correction by the user. Further, for a plurality of prosody pattern candidates output from the speech synthesis unit, a prosody different from the all prosody pattern candidates can be generated. In addition, when the user selects the prosody pattern candidate, the synthesized voice using the selected prosody pattern candidate is synthesized by the voice synthesis unit, and the synthesized voice can be output from the voice output unit.

【００３７】また、ユーザが前記音声入力部に対して入
力した発声内容を、上記音声認識処理部で分析し、分析
結果として得られたピッチ情報を用いて前記音声合成部
において音声合成を行うことができる。また、抽出され
たフレーム毎のピッチを、ある関数によって変換するこ
とにより、ユーザの発声帯域から合成音の音源帯域にシ
フトすることができる。The speech recognition processing unit analyzes the utterance content input by the user to the speech input unit, and performs speech synthesis in the speech synthesis unit using pitch information obtained as a result of the analysis. Can be. Further, by converting the extracted pitch of each frame by a certain function, it is possible to shift from the utterance band of the user to the sound source band of the synthesized sound.

[Brief description of the drawings]

【図１】本発明が適用された規則音声合成装置の一実施
形態のシステム構成図。FIG. 1 is a system configuration diagram of an embodiment of a rule speech synthesizer to which the present invention is applied.

【図２】図１に示した規則音声合成装置の画面表示の一
例を示した説明図（１）。FIG. 2 is an explanatory diagram (1) showing an example of a screen display of the rule speech synthesizer shown in FIG. 1;

【図３】韻律情報付き単語辞書の構造の一例を示した説
明図。FIG. 3 is an explanatory diagram showing an example of the structure of a word dictionary with prosody information.

【図４】図１に示した規則音声合成装置の画面表示の一
例を示した説明図（２）。FIG. 4 is an explanatory view (2) showing an example of a screen display of the rule speech synthesizer shown in FIG. 1;

[Explanation of symbols]

１…情報処理装置、２…主記憶装置、３…文字列入力装
置、４…位置情報入力装置、５…音声入力装置、６…情
報表示装置、７…表示制御装置、８…音響出力制御装
置、９…音響出力装置、１０…ディスク、１００１…シ
ステムプログラム、１００２…メール表示プログラム、
１００３…合成パラメータ算出プログラム、１００４…
韻律パラメータ生成プログラム、１００５…音声合成プ
ログラム、１００６…韻律修正プログラム、１００７…
音声出力プログラム、１００８…音声入力プログラム、
１００９…音声認識プログラム、１０１０…韻律情報付
き単語辞書、１０１１…音源データベース。DESCRIPTION OF SYMBOLS 1 ... Information processing device, 2 ... Main storage device, 3 ... Character string input device, 4 ... Position information input device, 5 ... Voice input device, 6 ... Information display device, 7 ... Display control device, 8 ... Sound output control device , 9 ... sound output device, 10 ... disk, 1001 ... system program, 1002 ... mail display program,
1003 ... Synthesis parameter calculation program, 1004 ...
Prosody parameter generation program, 1005 ... speech synthesis program, 1006 ... prosody modification program, 1007 ...
Voice output program, 1008 ... voice input program,
1009: Voice recognition program, 1010: Word dictionary with prosody information, 1011: Sound source database.

Claims

[Claims]

A character string input unit for inputting a character string; a synthesis parameter calculation unit for calculating a parameter representing a characteristic of a voice from the character string; a voice synthesis unit for synthesizing a voice from the parameter; A voice output unit for outputting a voice synthesized by the unit, a voice input unit for performing a voice input by a user, a voice recognition processing unit for A / D converting the voice, and further extracting voice parameters such as a pitch by voice analysis. In a rule speech synthesizer comprising an information storage device for storing the above-mentioned character strings, prosodic parameters, synthesized speech, etc., the prosody information given to the character string selected by the user is corrected according to the user's correction request. Rule speech synthesizer with functions.

2. The apparatus according to claim 1, further comprising a function of generating, for a plurality of prosody pattern candidates output from the speech synthesis unit, a prosody pattern different from all of the prosody pattern candidates. apparatus.

3. The apparatus according to claim 1, wherein when a user selects the prosody pattern candidate, a synthesized speech using the selected prosody pattern candidate is synthesized by the speech synthesis unit, and the synthesized speech is output from the speech output unit. A rule speech synthesizer with a function to output synthesized speech.

4. The apparatus according to claim 1, wherein the utterance content input by the user to the voice input unit is analyzed by the voice recognition processing unit, a time series of pitch is extracted, and the pitch information is extracted. A rule speech synthesizer having a function of performing speech synthesis in the speech synthesis unit using the same.

5. The apparatus according to claim 4, further comprising a function of converting the time series of the extracted pitch by a predetermined function to shift from the utterance band of the user to the sound source band of the synthesized sound. Rule speech synthesizer.