JP3421963B2

JP3421963B2 - Speech component creation method, speech component database and speech synthesis method

Info

Publication number: JP3421963B2
Application number: JP00730497A
Authority: JP
Inventors: 正信東田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1997-01-20
Filing date: 1997-01-20
Publication date: 2003-06-30
Anticipated expiration: 2017-01-20
Also published as: JPH10207488A

Abstract

PROBLEM TO BE SOLVED: To allow a synthesized voice to be made by a simple method and be easy to hear with a low parts count. SOLUTION: A voice of reading a sentence at a constant speed 'SHIAWASE HA ARUI-...', of which a row of the phoneme and meter of the character string is clear, is recorded in a medium 8; a time length for reproducing the first two syllables 'SHIA' is measured; at intervals of this time length (for instance 200msec), the reproduced voice on the medium 8 is successively cut to voice waveforms of each two syllables 8a, 8b,...; the character string of the recorded sentence is made to correspond to these voice waveforms 8a, 8b,..., to which a phoneme 'SHIA-' and a meter (ascending type), a phoneme 'WASE' and a meter (ascending type),... are added; and these are stored in addresses of different databases.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、人間の連続発声
音声から、音声合成に用いる音声波形を音声部品として
作成する音声部品作成方法と、音声合成に用いる各種音
声波形を格納した音声部品データベースと、この音声部
品データベースを用いて日本語読み文に対応した合成音
声を生成する音声合成方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice component creating method for producing a voice waveform used for voice synthesis as a voice component from a continuous human voice, and a voice component database storing various voice waveforms used for voice synthesis. The present invention relates to a speech synthesis method for generating synthetic speech corresponding to a Japanese reading sentence by using this speech component database.

【０００２】[0002]

【従来の技術】従来より、人間の音声を合成音声単位と
する録音編集型の音声合成方法には、以下の第１から第
３の方法に示すものが知られている。第１の方法は、人
間の音声を録音したものから切り出して「あ」、
「い」、「う」、「え」、「お」・・・等の日本語の５
０音、「きゃ」、「しゃ」・・・等の拗音、濁音、半濁
音等すべての日本語の１音節単位の音声部品を作成し、
これをデータベース等に登録しておき、この音声部品を
組み合わせて日本語読み文に合成する方法である。例え
ば、に／ほ／んのように、「に」、「ほ」、「ん」と１音節毎に音声部
品を組み合わせて日本語読み文を合成する。2. Description of the Related Art Heretofore, as a recording / editing type voice synthesizing method in which a human voice is used as a synthesized voice unit, the following first to third methods are known. The first method is to cut out a human voice and record it,
5 of Japanese such as "i", "u", "e", "o" ...
Create all Japanese one-syllable voice parts such as 0 sounds, "Kya", "Sha", etc.
This is a method of registering this in a database or the like and combining these voice parts to synthesize a Japanese reading sentence. For example, a Japanese reading sentence is synthesized by combining voice parts for each syllable such as "ni / ho / n" and "ni", "ho", and "n".

【０００３】しかし、日本語読み文に「とーきょー」や
「ろっぽんぎ」等のように長音や促音等が含まれる場合
には、この第１の方法で合成された音声には不自然さが
目立つため、一般には以下に示す第２及び第３の方法の
ように２音節以上の音声部品をデータベースに登録して
おき、日本語読み文に対応する音声部品を取り出して日
本語読み文の合成音声を生成するという方法がとられて
いる。However, when a Japanese reading sentence includes long sounds or consonants such as "Tokkyo" or "Roppongi", the speech synthesized by the first method is used. Since unnaturalness is noticeable, generally, as in the second and third methods shown below, voice parts having two or more syllables are registered in the database, and the voice parts corresponding to the Japanese reading sentence are taken out from Japan. A method of generating a synthetic speech of a word reading sentence is adopted.

【０００４】第２の方法は、人間の音声を録音したもの
から切り出して多音節からなる文節毎の音声部品を作成
し、これらをデータベースに登録しておき、日本語読み
文に対応する音声部品を文節毎に取り出し接続して、日
本語読み文を音声合成する方法である。これらの文節毎
の音声部品には、同一の音韻列（読み列）に対して各種
の韻律の変化をもたせたものをデータベースに格納して
おく。例えば、名詞「大阪」と助詞の組み合わせからな
る文節に対応した音声部品としては、「大阪は」、「大
阪が」、「大阪に」、「大阪へ」、「大阪を」の各音韻
列に各種の韻律の変化を付加したものがあり、これらが
データベースに格納されている。In the second method, a human voice is cut out from a recorded voice to create voice parts for each syllable consisting of multiple syllables, which are registered in a database, and voice parts corresponding to Japanese reading sentences are prepared. This is a method of synthesizing Japanese reading sentences by synthesizing and connecting each phrase. The speech parts for each phrase are stored in the database with various prosody changes for the same phoneme string (reading string). For example, as a phonetic part corresponding to a phrase consisting of a combination of the noun "Osaka" and a postpositional particle, there are phoneme strings "Osaka ha", "Osaka ga", "Osaka ni", "Osaka ni", and "Osaka ni" There are various prosody changes added, and these are stored in the database.

【０００５】そして、音声合成したい文がテキスト解析
によって私は／明日／大阪へ／行きます。（／
は、文節区切りを示す。）のように文節に区切られると、「私は」、「明日」、
「大阪へ」、「行きます」の各文節の音韻列と韻律列に
対応する音声部品がデータベースから取り出され、これ
らが接続されて音声合成される。[0005] Then, by text analysis, I want to go to / tomorrow / Osaka / go to the sentence that I want to synthesize voice. (/
Indicates a phrase break. ) Is divided into phrases, "I am", "tomorrow",
The voice parts corresponding to the phoneme sequences and the prosodic sequences of the phrases "to Osaka" and "go" are retrieved from the database, and these are connected to perform voice synthesis.

【０００６】第３の方法は、人間の音声を録音したもの
から切り出して作成し登録した、形容詞・動詞の活用形
語句を含む、全ての品詞語句に対応する音声部品を組み
合わせて日本語読み文を音声合成する方法である。この
場合も、前記第２の方法と同様に、同一の音韻列に対し
て各種の韻律の変化を有する音声部品をデータベースに
格納している。[0006] A third method is a combination of voice parts corresponding to all part-of-speech words and phrases including the adjective and verb inflectional words and phrases, which are created and cut out from recorded human voices, and are registered in a Japanese reading sentence. Is a method for synthesizing voice. Also in this case, as in the case of the second method, voice parts having various changes in prosody for the same phoneme sequence are stored in the database.

【０００７】そして、音声合成したい文をテキスト解析
して、私／は／明日／大阪／へ／行き／ます。（／は、品詞区
切りを示す。）のように品詞毎に分解されると、これら品詞毎の語句に
対応する音韻列と韻律列を有する音声部品をデータベー
スから取り出し、接続して日本語読み文を音声合成す
る。[0007] Then, the sentence to be speech-synthesized is text-analyzed and I / wa / tomorrow / osaka / to / go / go. (/ Indicates a part-of-speech delimiter.) When it is decomposed for each part-of-speech, a voice component having a phoneme sequence and a prosodic sequence corresponding to the phrase of each part-of-speech is extracted from the database and connected to the Japanese reading sentence. Is voice-synthesized.

【０００８】[0008]

【発明が解決しようとする課題】一方、人間の音声を録
音したものから音声部品を切り出す作業は、音声波形を
見ながら所望の音韻と韻律を有する箇所を選択して行う
ため手間がかかり、可能な限り音声部品の数が少なくて
済むような音声部品の作成方法が望まれている。しかし
ながら、上記従来の第２の方法によれば、文節毎の各種
音韻列・韻律列を有する音声部品が必要となり、名詞と
助詞の組み合わせ数、動詞と助動詞の組み合わせ数だけ
を考慮しても、膨大な数の音声部品を作成しなければな
らず、限定された日本語読み文を音声合成するのでない
限り、音声部品の作成に手間がかかり、実用に供するこ
とが難しい。On the other hand, the work of cutting out a voice part from a recorded human voice is troublesome because it is necessary to select a portion having a desired phoneme and prosody while observing the voice waveform. There is a demand for a method of creating audio components that requires as few audio components as possible. However, according to the above-mentioned second conventional method, a voice component having various phoneme sequences / prosodic sequences for each phrase is required, and even if only the number of combinations of nouns and particles and the number of combinations of verbs and auxiliary verbs are considered, A huge number of voice parts must be created, and unless the limited Japanese reading sentence is synthesized by voice, it takes time and effort to create the voice parts and it is difficult to put them into practical use.

【０００９】また、第３の方法の場合は、前記第２の方
法と比較すると音声部品の数は少なくて済むが、この場
合にも少なくとも全ての品詞の単語数に加えて形容詞、
動詞等のように活用するものに対する活用形語句をも考
慮すると、国語辞典の収録語数（具体的には１０万語以
上）を遙かに越える音声部品の数が必要となるため、前
記第２の方法と同様に音声部品作成のための手間がかか
る。Further, in the case of the third method, the number of voice parts is smaller than that of the second method, but in this case as well, in addition to at least the number of words of all parts of speech, an adjective,
Considering the inflectional phrases for things to be utilized such as verbs, the number of voice parts far exceeding the number of words recorded in the Japanese dictionary (specifically, 100,000 words or more) is required. As with the method described above, it takes time and effort to create a voice component.

【００１０】この発明は、音声部品の数が少なくて済
み、作成のために手間のかからない、かつ、人間の発声
リズムを取り込んだ音声部品を提供できる音声部品作成
方法と、その音声部品が格納された。音声部品データベ
ース及びその音声部品データベースを用いて音声合成す
る方法を提供することを目的とする。According to the present invention, the number of voice parts is small, it does not take time to create, and a voice part creating method capable of providing a voice part incorporating a human vocal rhythm and the voice part are stored. It was An object is to provide a voice component database and a method for synthesizing a voice using the voice component database.

【００１１】[0011]

【課題を解決するための手段】この発明の音声部品作成
方法は、文字列の並び及び音韻と韻律が明らかな文章を
一定速度で読み上げた音声を録音し、その録音音声の再
生音において、人間の発声リズムを基本とした拍数（音
節数）分、つまり発声リズムの周期の音節数分の最初の
録音再生時間を単位時間として設定し、前記録音音声を
前記単位時間毎に先頭から順に切断して、録音小片を作
成し、その録音小片に対し、その１つ１つに切り出され
た順番に前記文章の対応する文字列とその韻律とを付与
することによって音声部品を作成することを主要な特徴
とする。According to the method for creating a voice component of the present invention, a voice in which a sentence having a clear arrangement of character strings and phoneme and prosody is read at a constant speed is recorded, and a reproduced sound of the recorded voice is recorded by a human being. The number of beats (the number of syllables) based on the vocal rhythm, that is, the first recording / playback time corresponding to the number of syllables of the cycle of the vocal rhythm is set as a unit time, and the recorded voice is sequentially cut from the beginning at each unit time. Then, it is main to create a voice component by creating a recording piece and adding a corresponding character string of the sentence and its prosody to the recording piece in the order cut out into each piece. It is a characteristic.

【００１２】この発明の音声部品データベースは、格納
されている各種音声波形の構成音節の数が同一であり、
かつ各音声波形の長さが同一であり、更に構成音韻と韻
律型に応じて異なる番地に音声波形が格納されている。
また、この発明の第１の音声合成方法は、前記この発明
の音声部品データベースと、応答文中の挿入句部分以外
の基文の音声を録音した録音基文保持手段と、前記挿入
句を組み立てる各音声部品の前記音声部品データベース
上の格納番地が記述されている挿入語句部品結合辞書と
を具備し、前記挿入句に充当する文字列を入力する文字
情報入力過程と、前記挿入語句部品結合辞書の参照によ
り前記入力された文字列に対応する各音声部品を前記音
声部品データベースから選択・接続する合成音声組立過
程と、前記接続された音声部品を前記挿入句部分の合成
音声として合成する音声合成過程と、前記挿入句部分の
合成音声を前記録音基文音声にはめ込んで応答文の音声
を出力する出力過程とを有することを特徴としている。In the voice component database of the present invention, the number of constituent syllables of various voice waveforms stored is the same,
Moreover, the lengths of the voice waveforms are the same, and the voice waveforms are stored in different addresses according to the constituent phonemes and the prosody types.
A first voice synthesis method of the present invention is the voice component database of the present invention, a recording base sentence holding means for recording voice of a base sentence other than an insertion phrase portion in a response sentence, and voice components for assembling the insertion phrase. And a character information input step of inputting a character string applicable to the insertion phrase, and a reference to the insertion phrase parts combination dictionary. A synthetic voice assembling step of selecting and connecting each voice part corresponding to the input character string from the voice part database; and a voice synthesizing step of synthesizing the connected voice part as a synthesized voice of the insertion phrase part, An output step of inserting the synthesized voice of the insertion phrase portion into the recorded base sentence voice and outputting the voice of the response sentence.

【００１３】さらに、この発明の第２の音声合成方法
は、前記この発明の音声部品データベースと、各種の定
型文に対して各定型文を組み立てる各音声部品の前記音
声部品データベース上の格納番地が記述されている定型
文部品結合辞書とを具備し、前記定型文の文字列を入力
する文字情報入力過程と、前記定型文部品結合辞書の参
照により、前記入力された文字列と対応する各音声部品
を前記音声部品データベースから選択・接続する合成音
声組立過程と、前記接続された音声部品を合成音声とし
て合成する音声合成過程と、前記合成音声を出力する出
力過程とを有することを特徴としている。Further, according to the second voice synthesizing method of the present invention, a storage address on the voice component database of the voice component database of the present invention and each voice component for assembling each fixed phrase with respect to various fixed phrases is stored in the voice component database. And a character information input process for inputting a character string of the fixed phrase, and each voice corresponding to the input character string by referring to the fixed phrase part combination dictionary. It is characterized by comprising a synthetic voice assembling process of selecting and connecting a component from the voice component database, a voice synthesizing process of synthesizing the connected voice component as a synthetic voice, and an output process of outputting the synthetic voice. .

【００１４】さらに、この発明の第３の音声合成方法
は、前記この発明の音声部品データベースと、各種文節
に対して各文節を組み立てる各音声部品の前記音声部品
データベース上の格納番地が記述されている文節部品結
合辞書とを具備し、任意文字列とこの任意文字列の文節
区切り及びポーズ位置を入力する文字情報入力過程と、
前記文節部品結合辞書を参照して前記入力された任意文
字列に対応する各音声部品を前記音声部品データベース
から選択・接続する文字列合成音声組立過程と、前記接
続された音声部品を合成音声として合成する音声合成過
程と、前記合成音声を出力する出力過程とを有すること
を特徴としている。Furthermore, a third voice synthesizing method of the present invention describes the voice component database of the present invention and the storage address of the voice component database for assembling each phrase with respect to various phrases in the voice component database. And a character information input process for inputting an arbitrary character string and a phrase delimiter and a pause position of the arbitrary character string.
A character string synthesis voice assembling process of selecting and connecting each voice component corresponding to the input arbitrary character string from the voice component database by referring to the phrase component combination dictionary, and the connected voice component as a synthesized voice. It is characterized by having a voice synthesizing process for synthesizing and an output process for outputting the synthesized voice.

【００１５】このようにこの発明は、人間の音声を一定
速度で録音したものから、人間の発声リズムを基本とす
る単位時間毎にその一部を切り出して合成音声組立に必
要な音声部品を作成していて、音声部品に人間の発声リ
ズムを取り入れて、この音声部品から生成される合成音
声に人間の発声リズムを取り入れられるよう、かつ、音
声部品の作成も容易に行えるようにしている。As described above, according to the present invention, a human voice is recorded at a constant speed, and a part of the human voice is cut out for each unit time based on the human vocal rhythm to produce a voice component necessary for assembling a synthetic voice. In addition, the human voice rhythm is incorporated into the voice component, the human voice rhythm can be incorporated into the synthetic voice generated from the voice component, and the voice component can be easily created.

【００１６】[0016]

【作用】この発明では、まず、各種の音韻と韻律を含
み、かつ、文字列の並びが明らかな文章を、一定速度で
読み上げて録音媒体に録音する。この録音媒体の音声を
再生し、音声波形を見ながら人間の発声リズムを基本と
した最初の拍数（音節数）に相当する録音再生時間を計
測し、つまり再生音声の最初から、音節数を単位とする
人間の発声リズムの１周期分の時間を測定し、この測定
した録音再生時間を単位時間として設定する。前記録音
音声をその最初から単位時間毎に機械的に単純に切断し
て、一定時間長の音声波形からなる録音小片を複数作成
する。According to the present invention, first, a sentence containing various phonemes and prosody and having a clear arrangement of character strings is read aloud at a constant speed and recorded on a recording medium. The sound of this recording medium is played back, and the recording / playback time corresponding to the first number of beats (the number of syllables) based on the human vocal rhythm is measured while looking at the sound waveform, that is, the number of syllables is calculated from the beginning of the playback sound. The time for one cycle of the human vocal rhythm, which is the unit, is measured, and the measured recording / playback time is set as the unit time. The recorded voice is mechanically simply cut from the beginning every unit time to create a plurality of recording pieces each having a voice waveform of a certain time length.

【００１７】録音小片（音声波形）に、その切断された
順に前記文章を参照することによってこの録音小片に含
まれる文字列の音韻と韻律を付与する。そして、各種の
音韻と韻律を有する録音小片を集めて、音声部品とす
る。この音声部品を音声部品データベースに格納する。
このデータベースに格納された各音声部品は音節の数が
同一であり、かつ長さが等しく、その音韻と韻律型の組
み合わせの違いにより、異なる番地に格納されている。
下記に示すように、入力される文字列や文節区切りに従
って、この音声部品データベースから適切な音声部品を
選択・接続して合成音声を生成する。The phoneme and prosody of the character string included in the recording piece is added to the recording piece (speech waveform) by referring to the sentence in the order of cutting. Then, the recording pieces having various phonemes and prosody are collected to form a voice component. This voice component is stored in the voice component database.
The respective voice parts stored in this database have the same number of syllables and the same length, and are stored in different addresses depending on the combination of the phoneme and the prosody type.
As shown below, an appropriate voice component is selected and connected from this voice component database according to an input character string or phrase delimiter to generate synthetic voice.

【００１８】例えば、挿入句を含む応答文を音声出力し
たいとき、挿入句以外の文の音声を録音基文音声として
予め録音しておいて、挿入句部分に充当する文字列を入
力し、入力された文字列に基づいて挿入語句部品結合辞
書を参照して、挿入句部分を組み立てる音声部品を、音
声部品データベースより選択して接続し、その接続され
た音声部品を合成音声として合成し、更に、既に録音さ
れている録音基文音声に前記挿入句部分の合成音声をは
め込んで応答文を音声出力する。For example, when a response sentence including an insertion phrase is to be output as voice, a voice of a sentence other than the insertion phrase is pre-recorded as a recording base sentence voice, and a character string to be applied to the insertion phrase portion is input and input. Referring to the insertion phrase combination dictionary based on the character string, select the voice component to assemble the insertion phrase from the voice component database and connect it, and synthesize the connected voice component as a synthesized voice, and The synthetic voice of the insertion phrase is inserted into the recorded voice that has already been recorded, and the response sentence is output as voice.

【００１９】また、例えば数種ある定型文の一つを音声
出力する場合には、これら数種の定型文に対して、各定
型文を組み立てる各音声部品の音声部品データベース上
の格納番地を定型文部品結合辞書に記述しておき、入力
された定型文文字列から前記定型文部品結合辞書を参照
して音声部品データベースより定型文の音声を組み立て
る音声部品を選択・接続し、その接続された音声部品を
合成音声として合成し、更に定型文の合成音声を出力す
る。Further, for example, when one of several fixed phrases is output by voice, the fixed storage address on the voice component database of each voice component for assembling each fixed phrase is fixed for these fixed phrases. Described in the sentence part combination dictionary, select and connect the voice parts that assemble the voice of the fixed form sentence from the voice part database by referring to the fixed form part character string input, and connect the voice parts database. The voice component is synthesized as a synthesized voice, and the synthesized voice of the standard sentence is output.

【００２０】さらに、例えば任意文を音声合成したいと
きは、合成したい任意文字列とこの任意文字列の文節区
切り及びポーズ位置の文字情報を入力する。入力文字情
報に基づき文節部品結合辞書を参照して任意文の各文節
毎に、文節を組み立てる音声部品を音声部品データベー
スより選択して接続し、最終的に全文節、つまり、任意
文字列を組み立てる前記音声部品を選択・接続し、この
接続された音声部品にポーズを付与したものを、合成音
声として合成してこの任意文字列の合成音声を出力す
る。Further, for example, when synthesizing an arbitrary sentence by voice, an arbitrary character string to be synthesized, a segment break of this arbitrary character string, and character information of a pause position are input. Based on the input character information, refer to the phrase part combination dictionary, and for each phrase of an arbitrary sentence, assemble a phrase, select a voice component from the voice component database, connect it, and finally assemble all phrases, that is, an arbitrary character string. The voice parts are selected and connected, and the connected voice parts with pauses are synthesized as a synthesized voice and the synthesized voice of the arbitrary character string is output.

【００２１】[0021]

【発明の実施の形態】図１はこの発明による音声部品作
成方法の一実施例を示す流れ図である。この実施例の音
声部品作成方法は、データベース番地設定過程Ｓ１と、
一定速度録音過程Ｓ２と、単位時間設定過程Ｓ３と、録
音切断過程Ｓ４と、録音小片音韻韻律付与過程Ｓ５と、
番地付与過程Ｓ６と、格納過程Ｓ７とを有して構成され
ている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 is a flow chart showing an embodiment of a voice component making method according to the present invention. The voice component creation method of this embodiment includes a database address setting step S1 and
Constant speed recording step S2, unit time setting step S3, recording disconnection step S4, recording small piece phonetic prosody imparting step S5,
It is configured to have an address assigning step S6 and a storing step S7.

【００２２】データベース番地設定過程Ｓ１は、合成音
声の生成単位としての音声部品を格納する音声部品デー
タベース１の番地を予め設定する。作成する音声部品の
それぞれは、音声合成に必要な各種の音韻と韻律を有す
る２音節かつ一定長の音声波形からなる。図２に、音声
部品データベース１の部分構成としての２音節からなる
音声部品２の音韻行列を示す。行側と列側にはそれぞれ
日本語の５０音と撥音、濁音、半濁音、拗音、促音、長
音、無音が配置され、行側には２音節のうちの第１音
が、列側には第２音が配置されている。同図には、音韻
は読みがなで示されている。この音韻行列の全ての要素
が音声合成の生成単位として必要というものではなく、
例えば、「学校」の音韻「がっ」と音韻「こう」の内、
「こう」の部分は「こー」で充分代用できるなど、日本
語読み文を音声合成するのに必要な音声部品２の数は数
千種類で済む。In the database address setting step S1, the address of the voice component database 1 for storing voice components as a synthetic voice generation unit is preset. Each of the voice parts to be created is composed of two syllable and fixed length voice waveforms having various phonemes and prosody necessary for voice synthesis. FIG. 2 shows a phoneme matrix of the voice component 2 having two syllables as a partial configuration of the voice component database 1. Japanese 50 syllables, Japanese syllables, voiced sounds, dakuon, semi-voiced sounds, Japanese syllables, consonants, long sounds, and silences are arranged on the row side and the column side, respectively. The second sound is arranged. In the same figure, phonemes are shown in phonetic. Not all the elements of this phonological matrix are necessary as a unit of speech synthesis,
For example, of the phonological “ga” and “ko” of “school”,
"Kou" can be used as a substitute for the "Kou" part, and the number of voice parts 2 required for synthesizing Japanese reading sentences can be several thousand.

【００２３】データベース番地設定過程Ｓ１では、この
音韻行列の各要素に、「ああ」を１番地として「あい」
を２番地というように順に番地を設定する。さらに、音
韻行列の各要素である各２音節に、図３に示すような
（イ）平板型、（ロ）上昇型、（ハ）下降型の３種類の
韻律変化を持たせる。平板型は、第１音と第２音に抑揚
の変化がない韻律型であり、上昇型は、第２音が第１音
よりもアクセントが強い韻律型であり、下降型は第２音
が第１音よりもアクセントが弱い韻律型である。In the database address setting step S1, "a" is assigned to each element of this phoneme matrix as "address" and "ai" is assigned.
Addresses are set in order such as "2". Furthermore, each two syllables, which is each element of the phonological matrix, has three types of prosody change, as shown in FIG. 3, namely (a) flat type, (b) rising type, and (c) falling type. The flat type is a prosody type in which there is no change in intonation between the first and second sounds, the rising type is a prosody type in which the second sound has a stronger accent than the first sound, and the falling type is a second sound. It is a prosody type with a weaker accent than the first note.

【００２４】図５に音声部品データベース１の全体像を
示す。図４のＡは、平板型の韻律を有する音韻行列から
なる各種音声部品２の格納形態を示し、同図のＢは、上
昇型の韻律を有する音韻行列からなる各種音声部品２の
格納形態を、同図のＣは、下降型の韻律を有する音韻行
列からなる各種音声部品２の格納形態をそれぞれ示して
いる。FIG. 5 shows an overview of the voice parts database 1. 4A shows a storage form of various voice parts 2 made of a phoneme matrix having a flat prosody, and B of FIG. 4 shows a storage form of various voice parts 2 made of a phoneme matrix having an ascending prosody. , C in the figure respectively shows the storage form of various voice parts 2 consisting of a phoneme matrix having a descending prosody.

【００２５】同図に示すように音声部品２に、音韻行列
の番地に加えて韻律型によって分岐番地を設定する。韻
律型が平板型のものは分岐番地が１番、韻律型が上昇型
のものは分岐番地が２番、韻律型が下降型のものは分岐
番地ば３番とする。例えば、音声部品２の１−１番地は
音韻「ああ」の平板型で、第１音の「あ」と第２音の
「あ」とに抑揚の変化のない韻律であり、音声部品２の
２−３番地は音韻「あい」の下降型で、第１音の「あ」
より第２音の「い」の方がアクセントの弱い韻律とす
る。As shown in the figure, in the voice component 2, branch addresses are set according to the prosody type in addition to the addresses of the phoneme matrix. If the prosody type is flat type, the branch address is number 1, the prosody type is ascending type, the branch address is 2, and if the prosody type is the descending type, branch address is 3. For example, the address 1-1 of the voice component 2 is a flat type of the phoneme "a", and the prosody of the first sound "a" and the second sound "a" has no change in intonation. Addresses 2-3 are the descending type of the phoneme "ai", and the first sound "a".
The second note "I" is a prosody with a weaker accent.

【００２６】一定速度録音過程Ｓ２では、文字列の音韻
と韻律の並びが明らかな文章を、図６に示すように一定
速度で録音する。具体的にはまず、音韻列と韻律列の並
びが明らかであり、２音節毎に区切ったときに、音声合
成に必要な音韻と韻律を有するあらゆる２音節が得られ
るように一般的な複数の文章を用意する。このように２
音節毎に区切るのは、行進曲や応援歌が２拍子であるよ
うに、人間の自然の生体リズムや発声リズムが２拍子で
あることを鑑みて、この２拍子を音声部品２に取り入れ
ることにより、この音声部品２で合成された音声も自然
なリズムを保ち、聞き手にとって自然な聞き取り易い音
声となるようにしている。つまり、人間の発声リズムを
取り入れるために基本となる拍数（１拍は１音節に相
当）を２音節としている。In the constant speed recording step S2, a sentence in which the phoneme and prosody of the character string are clearly arranged is recorded at a constant speed as shown in FIG. Specifically, first, the arrangement of the phoneme sequence and the prosodic sequence is clear, and when divided into two syllables, it is possible to obtain all two syllables that have all the phonemes and prosody necessary for speech synthesis. Prepare sentences. 2 like this
Separating each syllable is based on the fact that the natural biological rhythm and vocal rhythm of human beings have two beats, like marching songs and cheering songs have two beats. The voice synthesized by the voice component 2 also has a natural rhythm so that the listener can easily hear the voice naturally. That is, the basic number of beats (one beat corresponds to one syllable) is two syllables in order to incorporate human vocal rhythm.

【００２７】前記文章を、話者９に一定速度で読んでも
らい、録音媒体８に録音する。この場合話者９は一定速
度で読むことを訓練されたものが好ましい。単位時間設
定過程Ｓ３は、前記録音したものから音声部品２の基と
なる、一定時間長の音声波形の録音小片８ａ，８ｂ，・
・・を切り出す単位時間を設定する。すなわち、前記録
音媒体８を再生して、音声波形を見ながら最初の２音節
（発声リズムの拍子数）が含まれる録音再生時間を計測
し、この録音再生時間を単位時間（例えば２００ｍｓ）
として設定する。The sentence is read by the speaker 9 at a constant speed and recorded on the recording medium 8. In this case, the speaker 9 is preferably trained to read at a constant speed. In the unit time setting step S3, the recording pieces 8a, 8b, ...
・・ Set the unit time for cutting out. That is, the recording medium 8 is reproduced, and the recording / reproducing time including the first two syllables (the number of beats of the vocal rhythm) is measured while watching the voice waveform, and this recording / reproducing time is set as a unit time (for example, 200 ms).
Set as.

【００２８】録音切断過程Ｓ４は、前記録音媒体８に記
録された音声波形を前記設定した単位時間毎に順番に単
純に切断して、一定時間長の音声波形からなる録音小片
８ａ，８ｂ，・・・を切り出す。これら録音小片８ａ，
８ｂ，・・・には、それぞれ２音節を基調とする人間の
発声リズムが盛り込まれる。録音小片音韻韻律付与過程
Ｓ５は、前記録音小片８ａ，８ｂ，・・・に音韻と韻律
を付与する。録音媒体８に録音した文章の音韻列と韻律
列の並びは前記したように明らかであるので、各録音小
片の音韻と韻律は、録音小片が切り出された順番に前記
文章の音韻列と韻律列を対応させることにより、容易に
判読できる。In the recording cutting step S4, the voice waveform recorded on the recording medium 8 is simply cut in order at each of the set unit times, and the recording pieces 8a, 8b ,.・・ Cut out. These recording pieces 8a,
Each of 8b, ... Includes a human vocal rhythm based on two syllables. In the recording piece phonological prosody imparting step S5, phonological elements and prosody are imparted to the recording pieces 8a, 8b ,. Since the arrangement of the phoneme sequence and the prosodic sequence of the sentence recorded on the recording medium 8 is clear as described above, the phonological sequence and the prosody of each recording piece are as follows. Can be easily read by associating with.

【００２９】そして、合成音声の生成に必要な各種音韻
と韻律を有する録音小片を一通り揃え、これらをそれぞ
れ音声部品２とする。番地付与過程Ｓ６は、音声部品２
に付与された音韻と韻律の情報に基づいて、音声部品２
を格納する音声部品データベース１上の音声部品２の番
地を特定する。Then, a set of recording pieces having various phonemes and prosody necessary for generating synthetic speech are prepared, and these are set as speech parts 2. The address assigning step S6 is performed by the voice component 2
Based on the phoneme and prosody information given to
The address of the voice component 2 on the voice component database 1 for storing is specified.

【００３０】格納過程Ｓ７は、音声部品２の番地情報に
基づいて、図２と図４に示す音声部品データベース１上
の番地箇所に音声部品２をはめ込む。次に、上記構成の
音声部品作成方法の具体例を図６を用いて説明する。具
体的には、同図に示したように文章「幸せは歩いてこな
い。だから、歩いて行くんだね。」（この文章の音韻列
は、「しあわせわあるいてこない。だから、あるいてゆ
くんだね」であり、韻律は「しあ」の部分が上昇型、
「わせ」の部分が上昇型などの文字列情報が明らかにさ
れている。この文章は、話者９に読んでもらう文章の一
部を示している。）と話者９が一定速度で吹き込む。こ
のように、話者９が一定速度で読むため、文の適切な位
置に入れられるポーズの長さはｎ音節分（ｎ：整数）と
なる（一定速度録音過程Ｓ２）。In the storing step S7, the voice component 2 is fitted into the address portion on the voice component database 1 shown in FIGS. 2 and 4 based on the address information of the voice component 2. Next, a specific example of the voice component creating method having the above configuration will be described with reference to FIG. Specifically, as shown in the figure, the sentence "Happiness doesn't walk. So I'm walking." (The phonological sequence in this sentence says, "There is happiness. It's "dane", and the prosody of the "shia" part is ascending
The character string information such as "rise" in the ascending type has been clarified. This sentence shows a part of the sentence to be read by the speaker 9. ) And the speaker 9 blows at a constant speed. In this way, since the speaker 9 reads at a constant speed, the length of the pause that can be put in an appropriate position in the sentence is n syllables (n: integer) (constant speed recording step S2).

【００３１】そして、録音媒体８には「しあわせわ（ポ
ーズ）あるいてこない（ポーズ）だから（ポーズ）ある
いていくんだね」と録音される。録音文の最初の「し
あ」という２音節を含む録音小片８ａを再生する時間を
計測し、例えば、この録音再生時間が２００ｍｓであっ
たなら、この２００ｍｓを単位時間として設定する（単
位時間設定過程Ｓ３）。Then, the recording medium 8 is recorded as "There are happiness (pause) and no (pause), so there is (pause)." The time for reproducing the recording piece 8a including the first two syllables "Shia" of the recorded sentence is measured. For example, if this recording / reproducing time is 200 ms, this 200 ms is set as the unit time (unit time setting Step S3).

【００３２】次に、この単位時間（２００ｍｓ）毎に録
音媒体８の再生音を順次切断することによって、それぞ
れ２音節毎の録音小片８ａ，８ｂ，・・・を切り出す
（録音切断過程Ｓ４）。これら録音小片８ａ，８ｂ，・
・・に、前記録音された文章の文字列を対応させて、順
番に音韻「しあ」韻律（上昇型）、音韻「わせ」韻律
（上昇型）・・・というような音韻と韻律を付与する。
音韻と韻律を付与された録音小片を音声部品２とする
（録音小片音韻韻律付与過程Ｓ５）。Next, the reproduced sound of the recording medium 8 is sequentially cut at every unit time (200 ms) to cut out the recording pieces 8a, 8b, ... For each two syllables (recording cutting step S4). These recording pieces 8a, 8b, ...
.. are associated with the character strings of the recorded sentences, and phonemes and prosody such as phonological “shia” prosody (ascending type) and phonological “seating” prosody (ascending type) are given in order. To do.
The recording small piece to which the phoneme and the prosody are added is defined as a voice component 2 (recording small piece phonetic prosody adding step S5).

【００３３】音声部品２に付与されている音韻と韻律か
ら音声部品データベース１上の番地を特定する。例え
ば、最初の音韻「しあ」韻律（上昇型）部分の音声部品
２は７４１−２番地と特定される（番地付与過程Ｓ
６）。前記番地が付与された音声部品２を、音声部品デ
ータベース１上のその番地箇所にはめ込み格納する（格
納過程Ｓ７）。The address on the voice component database 1 is specified from the phoneme and prosody assigned to the voice component 2. For example, the voice part 2 of the first phonological “Shia” prosody (ascending type) part is specified as the address 741-2 (address assignment step S
6). The voice component 2 to which the address is assigned is fitted and stored in the address component on the voice component database 1 (storing step S7).

【００３４】次に、上記のように作成された音声部品を
用いた音声合成装置について説明する。この音声合成装
置は、図４に示すように音声部品データベース１と、文
字情報入力部３と、部品結合辞書４と、合成音声組立部
５と、音声出力部７とを有して構成されている。音声部
品データベース１には、前記と同様の各種の音韻と韻律
を有する音声部品２が格納されている。Next, a voice synthesizing apparatus using the voice parts created as described above will be described. As shown in FIG. 4, this voice synthesizer is configured to have a voice component database 1, a character information input unit 3, a component combination dictionary 4, a synthesized voice assembling unit 5, and a voice output unit 7. There is. The voice component database 1 stores voice components 2 having the same various phonemes and prosody as described above.

【００３５】文字情報入力部３は、合成したい音声に対
応する文字情報を入力する。文字情報としては、文字列
の他、必要に応じて文節区切りやポーズ位置も入力す
る。部品結合辞書４は、文字情報入力部３からの入力文
字情報に基づいた音声合成がなされるよう各種の入力文
字情報に対する音声部品２の組み合わせが記述され、か
つ、各音声部品２は音声部品データベース１上の番地が
指定されている。The character information input section 3 inputs character information corresponding to a voice to be synthesized. As the character information, in addition to the character string, the phrase delimiter and the pause position are input as necessary. The component combination dictionary 4 describes combinations of the voice components 2 for various input character information so that voice synthesis is performed based on the character information input from the character information input unit 3, and each voice component 2 is a voice component database. The upper address is specified.

【００３６】合成音声組立部５は、文字情報入力部３か
ら入力された文字列や文節区切り、或いはポーズ位置等
の入力文字情報に基づいて合成音声生成のための音声部
品２を音声部品データベース１から選択し、接続する。
音声合成部６は、合成音声組立部５で接続された音声部
品２をなめらかな連続音声になるように処理して合成音
声として合成する。The synthetic voice assembling unit 5 stores the voice component 2 for generating the synthetic voice based on the input character information such as the character string, the segment break, or the pause position input from the character information input unit 3. Select from and connect.
The voice synthesizing unit 6 processes the voice components 2 connected by the synthetic voice assembling unit 5 so as to form a smooth continuous voice and synthesizes the voice component 2 as a synthetic voice.

【００３７】音声出力部７は、前記合成音声を出力す
る。次に上記構成のこの実施例の具体例を説明する。（具体例−１）挿入句部分を合成音声にする。利用者から入力される文字情報を挿入句として、この挿
入句以外の応答文の定型部分を録音基文音声として予め
録音しておく。そして、文字情報入力部３に入力される
文字情報に基づいて挿入句部分の合成音声を生成し、こ
の合成音声を前記録音基文音声にはめ込んで応答文を音
声出力する。The voice output unit 7 outputs the synthesized voice. Next, a specific example of this embodiment having the above configuration will be described. (Specific Example-1) The insertion phrase part is converted to synthetic speech. Character information input by the user is used as an insertion phrase, and the fixed part of the response sentence other than this insertion phrase is recorded in advance as a recorded basic voice. Then, based on the character information input to the character information input unit 3, a synthetic voice of the insertion phrase portion is generated, and the synthetic voice is embedded in the recording base sentence voice to output a response sentence by voice.

【００３８】例えば、電話番号の問い合わせ等の際に、
「お調べになりたい方のお住まいになる都道府県名か市
区郡名を入力して下さい。」という文を音声出力した
後、利用者からの都道府県名か市区郡名の応答があった
ときにこの応答を確認するための応答文として「○○○○○でよろしいですか？」と音声出力したい場合、文の定型部分「でよろしいです
か？」を録音基文音声として予め録音しておき、「○○
○○○」の部分は録音基文の挿入句として、利用者から
入力される具体的な都道府県名或いは市区郡名の文字情
報に基づいて音声が合成される。For example, when inquiring about a telephone number,
After outputting the sentence "Please enter the prefecture or city / county name of the person you want to research.", The user answered the prefecture / city / city name. When you want to output the voice "Is it OK with ○○○○○?" As a response sentence for confirming this response when you do so, pre-record the standard part of the sentence, "Is it OK?" By the way, "○○
The part of "○○○" is an insertion phrase of a recorded sentence, and voice is synthesized based on the character information of a specific prefecture name or city / ward name input by the user.

【００３９】一方、挿入句部分に充当する各種の入力文
字情報に対して、合成音声生成に必要な音声部品２の組
み合わせとその各音声部品の音声部品データベース上の
番地が図７に示されるように部品結合辞書４としての挿
入語句部品結合辞書に記述されている。同図において
は、挿入語句の入力文字情報として「漢字」文字列も
「かな」文字列も共にキーワードとして受け付けられる
ようになっている。On the other hand, FIG. 7 shows combinations of the voice parts 2 necessary for generating the synthesized voice and addresses of the respective voice parts on the voice part database for various input character information applied to the insertion phrase part. Is described in the inserted word parts combination dictionary as the parts combination dictionary 4. In the figure, both the “Kanji” character string and the “Kana” character string are accepted as keywords as input character information of the inserted word.

【００４０】例えば、文字情報入力部３に利用者から
「岡山県」或いは「おかやまけん」と文字列の入力があ
ったときには、合成音声組立部５でこの文字列に基づい
て図７に示す挿入語句部品結合辞書を参照して２８６−
２番地の「おか」部分の音声部品と、２４８２−２番地
の「やま」部分の音声部品と、６１０−２番地の「け
ん」部分の音声部品とを選択して接続する。For example, when the user inputs a character string such as "Okayama Prefecture" or "Okayama Ken" into the character information input unit 3, the synthetic voice assembling unit 5 inserts the character string shown in FIG. 7 based on this character string. 286-
The audio component of the "oka" portion of the address 2, the audio component of the "yama" portion of the address 2482-2, and the audio component of the "ken" portion of the address 610-2 are selected and connected.

【００４１】音声合成部６では、前記挿入句部分の接続
された音声部品をなめらかな連続音声の合成音声として
合成する。そして、音声出力部７では前記挿入句部分の
合成音声を前記録音基文音声にはめ込んで応答文を音声
出力する。この具体例−１のように挿入句として日本国
内の都道府県及び市区郡の地名を音声合成する場合に
は、実際にある約４０００の地名に対し、音声部品２は
１１７５種類あれば音声合成の生成に充分であることが
実験から確かめられている。（具体例−２）定型文を出力する。The voice synthesizing unit 6 synthesizes the voice parts connected to the inserted phrase portion as a synthesized voice of smooth continuous voice. Then, the voice output unit 7 inserts the synthesized voice of the insertion phrase portion into the recorded base voice and outputs a response sentence as voice. When voice-synthesizing place names of prefectures and municipalities in Japan as an insertion phrase as in this concrete example-1, if there are 1175 types of voice components 2 for about 4000 actual place names, voice-synthesizing is possible. Experiments have shown that this is sufficient for the production of (Specific example-2) A fixed phrase is output.

【００４２】前記具体例−１の場合は挿入句部分の語句
を音声合成したが、この具体例−２では定型文を音声合
成する。この場合は、具体例−１でも示したような「お調べになりたい方のお住まいになる都道府県名を入
力して下さい。」とか、「ありがとうございました。」のような定型文を音声合成するための音声部品２の組み
合わせに、適当な位置にポーズが付されたものと、その
各音声部品の音声部品データベース上の番地が、図８に
示す部品結合辞書４としての定型文部品結合辞書に記述
されている。同図に示す定型文部品結合辞書は、定型文
の日本語文字列でもかな文字列でもキーワードとして受
け付けられるようになっている。In the case of the concrete example-1, the words in the insertion phrase part are speech-synthesized, but in the concrete example-2, the fixed sentence is voice-synthesized. In this case, you can use a fixed phrase such as "Please enter the name of the prefecture in which you would like to research." Or "Thank you." As shown in Example-1. The combination of the voice parts 2 for the purpose of being put in a proper position and the addresses of the voice parts database of the respective voice parts are the fixed sentence parts combination dictionary as the parts combination dictionary 4 shown in FIG. It is described in. The fixed phrase part combination dictionary shown in the figure can accept both Japanese character strings and kana character strings of fixed phrases as keywords.

【００４３】そして、合成音声組立部５では、利用者か
ら文字情報入力部３に入力される定型文文字列の文字情
報に基づいて、定型文部品結合辞書を参照して定型文に
対応する音声部品２を選択・接続し、音声合成部６で、
接続された音声部品を合成音声に合成し、音声出力部７
でこの合成音声を出力する。具体的には、定型文「こちらは電話番号案内です。」を音声出力したいときには、例えば、「こちらはでんわばんごうあんないです。」と、かな文字列で文字情報入力部３に入力すれば、合成
音声組立部５で、この入力文字列に基づいて定型文部品
結合辞書を参照して音韻「こち」は７１６−１番地の音
声部品２を選択し、音韻「らわ」は２６０３−２番地の
音声部品２を選択し、というように音声部品データベー
ス１から音声部品２を選択・接続する。Then, the synthesized voice assembling unit 5 refers to the fixed phrase part combination dictionary based on the character information of the fixed phrase character string input by the user to the character information input unit 3 and outputs the voice corresponding to the fixed phrase. Select and connect the component 2, and in the voice synthesizer 6,
The connected voice component is synthesized into a synthesized voice, and the voice output unit 7
To output this synthesized voice. Specifically, if you want to output the fixed phrase "This is a telephone number guide." By voice, for example, enter "This is a telephone number." In the kana character string in the text information input section 3. In the synthetic speech assembling unit 5, the phoneme "kochi" selects the phonetic part 2 at the address 716-1 based on the input character string based on the fixed sentence part combination dictionary, and the phoneme "Rawa" is 2603-2. The voice component 2 of the address is selected, and the voice component 2 is selected and connected from the voice component database 1 and so on.

【００４４】この場合、定型文を音声出力したいときに
は、上記のように定型文の文字列全てを日本語文字列や
かな文字列で入力しなくとも、定型文に予め番号を付し
て、この定型文番号を文字情報入力部３に入力するのみ
で定型文の合成音声を出力することも、定型文番号をキ
ーワードとすることにより可能となる。（具体例−３）任意文を出力する。In this case, when the user wants to output the fixed phrase by voice, the fixed phrase is pre-numbered without inputting all the character strings of the fixed phrase in Japanese character string or Japanese kana character string as described above. It is also possible to output a synthesized voice of a fixed phrase by simply inputting the fixed phrase number into the character information input unit 3 by using the fixed phrase number as a keyword. (Specific example-3) An arbitrary sentence is output.

【００４５】任意文を音声合成する場合は、この任意文
の文字列と文節区切り及びポーズ位置を文字情報入力部
３へ入力する。一方、部品結合辞書４としての文節部品
結合辞書には、図９に示すように文節単位に文節を組み
立てる音声部品２とその各音声部品の音声部品データベ
ース上の番地を記述しておく。同図に示す文節部品結合
辞書の場合も、文節語句として日本語文字列でもかな文
字列でも受け付けられるようキーワードが設定されてい
る。例えば、文節の文字列「私は」或いは「わたしは」
は、音韻「わた」部分が２９０１−２番地の音声部品２
と、音韻「しわ」部分が７８０−２番地の音声部品２と
より組み立てられることが文節部品結合辞書に記述され
ている。When synthesizing a voice of an arbitrary sentence, the character string of this arbitrary sentence, the segment break, and the pause position are input to the character information input unit 3. On the other hand, as shown in FIG. 9, the speech component 2 as the component coupling dictionary 4 describes the voice component 2 that assembles a phrase in phrase units and the addresses on the voice component database of each voice component. In the case of the phrase-parts combined dictionary shown in the figure, keywords are set so that either Japanese character strings or kana character strings can be accepted as the phrase phrases. For example, the phrase string "I" or "I"
Is a phonetic part 2 whose phoneme "wata" is at address 2901-2.
It is described in the clause part combination dictionary that the phonological “wrinkle” part is assembled from the voice part 2 at the address 780-2.

【００４６】この文節部品結合辞書の作成については、
前記従来の第２の方法で示したように文節語句の種類が
多く膨大な数を辞書に収録することになるが、前記従来
の第２の方法のようにこの膨大な数の音声部品を音声波
形を観察しながら切り出すという大変な作業をするわけ
でなく、この場合は、この膨大な数の文節語句に対し、
文節語句を組み立てる各音声部品２の番地を所定の記憶
媒体に記憶格納するだけであるので、この辞書の作成作
業は容易に行われる。Regarding the creation of this clause parts combination dictionary,
As shown in the above-mentioned second conventional method, there are many kinds of clauses and phrases, and a huge number is recorded in the dictionary. However, as in the above-mentioned second conventional method, this enormous number of voice parts are voiced. We do not do the heavy work of cutting out while observing the waveform, and in this case, for this huge number of phrase phrases,
Since the addresses of the respective voice parts 2 for assembling phrase clauses are simply stored and stored in a predetermined storage medium, the work of creating this dictionary can be performed easily.

【００４７】合成音声組立部５では、文字情報入力部３
に入力された任意文の文字列と文節区切りより認識され
る複数の文節語句を基に、文節部品結合辞書を参照し
て、各文節毎に音声合成に必要な音声部品２を音声部品
データベース１から選択して、これらを接続する。例え
ば、任意文「私は、学校へ行きます。」を文節に区切ると、「私は／学校へ／行きます。」（／は文節区
切りを示す。）となり、さらに、「私は」と「学校へ」の間に、ポーズ
位置が入るため、文節「私は」と文節「学校へ」及び文
節「行きます」の文字列と、ポーズ位置の文字情報が文
字情報入力部３に入力される。In the synthetic voice assembling unit 5, the character information input unit 3
Based on a plurality of phrase phrases recognized from the character string of the arbitrary sentence and the phrase delimiter input to, the phrase component combination dictionary is referred to, and the voice component 2 required for voice synthesis for each phrase is stored in the voice component database 1 Select from and connect these. For example, if you divide the arbitrary sentence "I will go to school." Into phrases, it will become "I / go to school / go." (/ Indicates the phrase break), and "I" and " Since the pause position is entered between "to school", the character strings of the phrase "I am", the phrase "to school" and the phrase "go", and the character information of the pose position are input to the character information input unit 3. .

【００４８】そして、この合成音声組立部５では、各文
節毎に、文節部品結合辞書を参照して文節を組み立てる
音声部品２を音声部品データベース１から選択して接続
する。文節「私は」の部分は、２９０１−２番地の音韻
「わた」部分の音声部品２と７８０−２番地の音韻「し
わ」部分の音声部品２を音声部品データベース１から選
択し、文節「学校へ」の部分は、２５２０−２番地の音
韻「がっ」部分の音声部品２と、６２３−１番地の音韻
「こー」部分の音声部品２と、４８−１番地の音韻「え
（無音）」部分の音声部品２を音声部品データベース１
からそれぞれ選択するというように順に文節毎に音声部
品２を選択して、これら選択した音声部品２を接続す
る。Then, in the synthesized voice assembling unit 5, the voice component 2 for assembling the phrase is selected from the voice component database 1 and connected to each phrase by referring to the phrase component combination dictionary. For the phrase “Iwa”, select the voice component 2 of the phoneme “wata” at address 2901-2 and the voice component 2 of the phoneme “wrinkle” at address 780-2 from the voice component database 1 and select the phrase “school”. The part "to" is the phonetic part 2 of the phoneme "ga" at the address 2520-2, the phonetic part 2 of the phoneme "ko" at the address 623-1, and the phoneme "e (silence) at the address 48-1. ) ”Portion of the voice component 2 to the voice component database 1
The voice parts 2 are selected for each phrase in order such that each of the selected voice parts 2 is connected.

【００４９】さらに、この合成音声組立部５では、全文
節を組み立てる音声部品２の接続体に、指定されたポー
ズ位置に従って、例えば１００ｍｓのポーズを付与す
る。例えば、前文の例では音韻「わた」「しわ」と音韻
「がっ」「こう」との間には、句点があるためポーズが
付与される。音声合成部６では、前記ポーズが付与され
た音声部品２の接続体を、なめらかに連続した合成音声
として合成し、音声出力部７では、この合成音声を出力
する。Further, in the synthetic voice assembling unit 5, a pause of, for example, 100 ms is given to the connected body of the voice parts 2 for assembling all the clauses according to the designated pause position. For example, in the example of the previous sentence, a pause is added because there are punctuation marks between the phoneme “wata” “wrinkle” and the phoneme “gatsu” “ko”. The voice synthesizing unit 6 synthesizes the connected body of the voice parts 2 to which the pause has been added as a smooth continuous synthetic voice, and the voice output unit 7 outputs the synthetic voice.

【００５０】上記のようにこの実施例によれば、録音媒
体８を再生して、人間の発声リズムの基本となる最初の
２音節分の音声波形を観察し、このときの録音再生時間
を単位時間として設定しておけば、この単位時間毎に録
音媒体８の再生信号を単純に切断していくことによっ
て、２音節からなる音声部品２を切り出すことができ
る。従って、従来のように音声部品を録音媒体を再生し
て音声波形を観察しながら全ての音声部品を切り出す作
業に較べて非常に容易に行えることになる。As described above, according to this embodiment, the recording medium 8 is reproduced to observe the first two syllable voice waveforms which are the basis of the human vocal rhythm, and the recording / reproducing time at this time is used as a unit. If it is set as time, the audio component 2 consisting of two syllables can be cut out by simply cutting the reproduction signal of the recording medium 8 every unit time. Therefore, it is much easier than the conventional work of reproducing all sound components while reproducing the sound recording medium and observing the sound waveform.

【００５１】また、音声部品２の数も、第２及び第３の
従来例のように１０万種類を越えるものではなく、数千
種類で済むので、より一層、音声部品２の作成の手間が
省ける。さらに、この実施例によれば、２拍子の人間の
自然の発声リズムを取り入れた２音節からなる音声部品
２を合成音声の生成単位としているため、この音声部品
２を組み合わせてなる合成音声も人間の自然の発声リズ
ムが取り入れられることになり、合成音声も聞き易いも
のとなる。Also, the number of voice parts 2 does not exceed 100,000 as in the second and third conventional examples, and is only several thousand, so that the time and effort for creating the voice parts 2 is further increased. I can omit it. Further, according to this embodiment, since the voice component 2 consisting of two syllables incorporating the natural rhythm of human voice with two beats is used as the unit for generating the synthetic voice, the synthetic voice formed by combining the voice components 2 is also human. The natural vocal rhythm of will be adopted, and the synthesized voice will be easy to hear.

【００５２】なお、この発明は上記実施例に限定される
ことはなく様々な実施の態様があり得る。例えば、上記
実施例の各具体例においては、文字列の入力に漢字を含
む日本語文字列でもかな文字列でも受け付けられるよう
にしたが、かな文字列の入力だけ受け付けるよう、キー
ワードを設定しても良い。The present invention is not limited to the above-mentioned embodiment, but may have various embodiments. For example, in each of the specific examples of the above embodiments, a Japanese character string including Kanji or a Kana character string is accepted for inputting a character string, but a keyword is set so that only a Kana character string is accepted. Is also good.

【００５３】また、この実施例では音声部品データベー
ス１に格納される音声部品２に、３種類の抑揚の変化を
持たせたが、より豊富な抑揚の変化を持たすようにして
も良い。Further, in this embodiment, the voice component 2 stored in the voice component database 1 is provided with three types of intonation changes, but it is also possible to have more rich intonation changes.

【００５４】[0054]

【発明の効果】以上の説明から明らかなように、この発
明の音声部品作成方法を用いれば、一定速度で人間の音
声を録音した録音媒体から単位時間毎に切り出すという
単純な方法にて音声部品を作成でき、従来のように音声
波形を観察しながら全ての音声部品を切り出す作成方法
に較べれば、音声部品の作成手間がかからなくて済む。As is apparent from the above description, if the voice component creating method of the present invention is used, the voice component can be simply cut out from the recording medium in which the human voice is recorded at a constant speed every unit time. Can be created, and compared with the conventional method of cutting out all the audio parts while observing the audio waveform, the creation of the audio parts is not required.

【００５５】また、音声部品の数も少なくて済むため一
層、作成手間がかからない。さらに、録音媒体から切り
出す単位時間を人間の発声リズムを基本としているた
め、この単位時間毎に切り出された音声部品には、人間
の自然な発声リズムが取り入れられるので、この音声部
品を組み合わせて生成される合成音声にも人間の自然な
発声リズムが取り入れられることになり、この合成音声
は聞き易いものとなる。Further, since the number of voice parts is small, it does not take much time and effort. Furthermore, since the unit time cut out from the recording medium is based on the human vocal rhythm, the human vocal rhythm can be incorporated into the voice parts cut out at each unit time. The natural vocal rhythm of human beings is also incorporated into the synthesized speech, and the synthesized speech becomes easy to hear.

[Brief description of drawings]

【図１】発明による音声部品作成方法の一実施例の処理
手順を示す図。FIG. 1 is a diagram showing a processing procedure of an embodiment of a voice component creating method according to the present invention.

【図２】音声部品データベースの音韻行列の一例を示す
図。FIG. 2 is a diagram showing an example of a phoneme matrix of a voice component database.

【図３】３種類の韻律変化を示す図。FIG. 3 is a diagram showing three types of prosody changes.

【図４】この発明の音声合成方法を適用した音声合成装
置の機能構成例を示すブロック図。FIG. 4 is a block diagram showing an example of the functional configuration of a speech synthesis apparatus to which the speech synthesis method of the present invention is applied.

【図５】この発明による音声部品データベースの全体像
の例を示す図。FIG. 5 is a diagram showing an example of an overall image of a voice component database according to the present invention.

【図６】音声部品の作成方法を示す説明図。FIG. 6 is an explanatory diagram showing a method of creating a voice component.

【図７】挿入語句部品結合辞書の一例を示す図。FIG. 7 is a diagram showing an example of an insertion word part combination dictionary.

【図８】定型文部品結合辞書の一例を示す図。FIG. 8 is a diagram showing an example of a fixed text part combination dictionary.

【図９】文節部品結合辞書の一例を示す図。FIG. 9 is a diagram showing an example of a phrase part combination dictionary.

Claims

(57) [Claims]

1. A voice recorded by reading a sentence in which the arrangement of character strings and phonology and prosody are clearly read at a constant speed, and the first recording of the reproduced voice of the recorded voice for the number of syllables of the period of human vocal rhythm. The playback time is set as a unit time, the recorded voice is cut in order from the beginning for each unit time to create a recording piece, and each of these recording pieces is divided into one of the recorded pieces in the order in which they are cut out. A method of creating a voice component, which comprises creating a voice component by adding a corresponding character string and its prosody.

2. A database storing various speech waveforms, wherein each speech waveform has the same number of constituent syllables and the same length, and each speech waveform has its constituent phoneme and prosody. A voice parts database, which is stored in different addresses depending on the model.

3. The voice component database according to claim 2, a recording base sentence holding unit that records a base sentence voice other than an insertion phrase portion in a response sentence, and an address on the voice component database of each voice component that assembles the insertion phrase. And a character information input step of inputting a character string to be applied to the insertion phrase, and referring to the insertion word component combination dictionary, corresponding to the input character string. A synthesized voice assembling process of selecting and connecting each voice component from the voice component database; a voice synthesizing process of synthesizing the connected voice component as a synthesized voice of the inserted phrase part; and a synthesized voice of the inserted phrase part. An output step of outputting a voice of a response sentence by fitting the voice into the recorded base voice.

4. The voice component database according to claim 2, and a fixed sentence component combination dictionary in which addresses on the voice component database of each voice component that assembles each fixed phrase are described for various fixed phrases. And a character information input step of inputting a character string of the fixed phrase, and referring to the fixed phrase part combination dictionary, selecting and connecting each voice part corresponding to the input character string from the voice part database. A voice synthesizing method, comprising: a voice synthesizing process for synthesizing; a voice synthesizing process for synthesizing the connected voice components as a synthetic voice; and an outputting process for outputting the synthetic voice.

5. The speech component database according to claim 2, and a clause component combination dictionary in which addresses on the speech component database of each speech component for assembling each clause for various clauses are described. A character information input process of inputting a character string and a phrase delimiter and a pause position of the arbitrary character string, and referring to the phrase component combination dictionary, each voice component corresponding to the input arbitrary character string is recorded from the voice component database. A voice synthesizing method comprising: a character string synthetic voice assembling process of selecting and connecting; a voice synthesizing process of synthesizing the connected voice parts as a synthetic voice; and an outputting process of outputting the synthetic voice.