JPH10207488A

JPH10207488A - Method of making voice parts, voice part data base, and method of voice synthesis

Info

Publication number: JPH10207488A
Application number: JP9007304A
Authority: JP
Inventors: Masanobu Higashida; 正信東田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1997-01-20
Filing date: 1997-01-20
Publication date: 1998-08-07
Anticipated expiration: 2017-01-20
Also published as: JP3421963B2

Abstract

PROBLEM TO BE SOLVED: To allow a synthesized voice to be made by a simple method and be easy to hear with a low parts count. SOLUTION: A voice of reading a sentence at a constant speed 'SHIAWASE HA ARUI-...', of which a row of the phoneme and meter of the character string is clear, is recorded in a medium 8; a time length for reproducing the first two syllables 'SHIA' is measured; at intervals of this time length (for instance 200msec), the reproduced voice on the medium 8 is successively cut to voice waveforms of each two syllables 8a, 8b,...; the character string of the recorded sentence is made to correspond to these voice waveforms 8a, 8b,..., to which a phoneme 'SHIA-' and a meter (ascending type), a phoneme 'WASE' and a meter (ascending type),... are added; and these are stored in addresses of different databases.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、人間の連続発声
音声から、音声合成に用いる音声波形を音声部品として
作成する音声部品作成方法と、音声合成に用いる各種音
声波形を格納した音声部品データベースと、この音声部
品データベースを用いて日本語読み文に対応した合成音
声を生成する音声合成方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech component creation method for creating speech waveforms used for speech synthesis as speech components from continuous human utterance speech, and a speech component database storing various speech waveforms used for speech synthesis. The present invention also relates to a speech synthesis method for generating a synthesized speech corresponding to a Japanese reading sentence using the speech component database.

【０００２】[0002]

【従来の技術】従来より、人間の音声を合成音声単位と
する録音編集型の音声合成方法には、以下の第１から第
３の方法に示すものが知られている。第１の方法は、人
間の音声を録音したものから切り出して「あ」、
「い」、「う」、「え」、「お」・・・等の日本語の５
０音、「きゃ」、「しゃ」・・・等の拗音、濁音、半濁
音等すべての日本語の１音節単位の音声部品を作成し、
これをデータベース等に登録しておき、この音声部品を
組み合わせて日本語読み文に合成する方法である。例え
ば、に／ほ／んのように、「に」、「ほ」、「ん」と１音節毎に音声部
品を組み合わせて日本語読み文を合成する。2. Description of the Related Art Heretofore, the following first to third methods have been known as a recording / editing type speech synthesizing method using human speech as a synthetic speech unit. The first method is to cut out the recorded human voice,
Japanese 5 such as "I", "U", "E", "O" ...
Create all Japanese one-syllable voice parts, such as zero sounds, “Kyu”, “Sha”, etc.
This is a method in which this is registered in a database or the like, and the voice parts are combined to synthesize a Japanese reading sentence. For example, Japanese reading sentences are synthesized by combining "Ni", "Ho", "N", and voice parts for each syllable, such as Ni / ho / n.

【０００３】しかし、日本語読み文に「とーきょー」や
「ろっぽんぎ」等のように長音や促音等が含まれる場合
には、この第１の方法で合成された音声には不自然さが
目立つため、一般には以下に示す第２及び第３の方法の
ように２音節以上の音声部品をデータベースに登録して
おき、日本語読み文に対応する音声部品を取り出して日
本語読み文の合成音声を生成するという方法がとられて
いる。[0003] However, when Japanese reading sentences include long sounds or prompting sounds such as "Tokyo" or "Roppongi", the speech synthesized by the first method is used. Since the unnaturalness is conspicuous, voice parts of two or more syllables are generally registered in a database as in the second and third methods described below, and the voice parts corresponding to the Japanese reading sentence are taken out. A method of generating a synthesized speech of a word reading sentence has been adopted.

【０００４】第２の方法は、人間の音声を録音したもの
から切り出して多音節からなる文節毎の音声部品を作成
し、これらをデータベースに登録しておき、日本語読み
文に対応する音声部品を文節毎に取り出し接続して、日
本語読み文を音声合成する方法である。これらの文節毎
の音声部品には、同一の音韻列（読み列）に対して各種
の韻律の変化をもたせたものをデータベースに格納して
おく。例えば、名詞「大阪」と助詞の組み合わせからな
る文節に対応した音声部品としては、「大阪は」、「大
阪が」、「大阪に」、「大阪へ」、「大阪を」の各音韻
列に各種の韻律の変化を付加したものがあり、これらが
データベースに格納されている。[0004] A second method is to create a speech component for each syllable composed of multiple syllables by cutting out from a recorded human voice, registering these in a database, and compiling a speech component corresponding to a Japanese reading sentence. Is extracted for each phrase and connected to speech-synthesize a Japanese reading sentence. In the speech parts for each of the phrases, the same phoneme sequence (reading sequence) with various prosody changes is stored in a database. For example, speech parts corresponding to a phrase consisting of a combination of the noun "Osaka" and a particle include the following phonemic sequences: "Osaka is", "Osaka is", "Osaka is", "To Osaka", and "Osaka is". Some have added various prosody changes, and these are stored in a database.

【０００５】そして、音声合成したい文がテキスト解析
によって私は／明日／大阪へ／行きます。（／
は、文節区切りを示す。）のように文節に区切られると、「私は」、「明日」、
「大阪へ」、「行きます」の各文節の音韻列と韻律列に
対応する音声部品がデータベースから取り出され、これ
らが接続されて音声合成される。[0005] Then, the sentence I want to synthesize is sent to / tomorrow / Osaka / by text analysis. (/
Indicates a phrase break. ), The words "I", "Tomorrow",
Speech parts corresponding to the phoneme sequence and the prosody sequence of each of the phrases “to Osaka” and “go” are extracted from the database, and these are connected to perform speech synthesis.

【０００６】第３の方法は、人間の音声を録音したもの
から切り出して作成し登録した、形容詞・動詞の活用形
語句を含む、全ての品詞語句に対応する音声部品を組み
合わせて日本語読み文を音声合成する方法である。この
場合も、前記第２の方法と同様に、同一の音韻列に対し
て各種の韻律の変化を有する音声部品をデータベースに
格納している。[0006] A third method is to combine Japanese voice sentences corresponding to all part-of-speech phrases, including adjectives and verb inflected phrases, which are cut out from a recorded human voice and created and registered. This is a method of synthesizing voice. Also in this case, similarly to the second method, speech components having various prosody changes for the same phoneme sequence are stored in the database.

【０００７】そして、音声合成したい文をテキスト解析
して、私／は／明日／大阪／へ／行き／ます。（／は、品詞区
切りを示す。）のように品詞毎に分解されると、これら品詞毎の語句に
対応する音韻列と韻律列を有する音声部品をデータベー
スから取り出し、接続して日本語読み文を音声合成す
る。Then, the sentence to be synthesized is analyzed by text, and I / ha / tomorrow / osaka / go / go /. (/ Indicates a part-of-speech delimiter.) When the parts of speech are decomposed for each part of speech as shown in FIG. Is synthesized.

【０００８】[0008]

【発明が解決しようとする課題】一方、人間の音声を録
音したものから音声部品を切り出す作業は、音声波形を
見ながら所望の音韻と韻律を有する箇所を選択して行う
ため手間がかかり、可能な限り音声部品の数が少なくて
済むような音声部品の作成方法が望まれている。しかし
ながら、上記従来の第２の方法によれば、文節毎の各種
音韻列・韻律列を有する音声部品が必要となり、名詞と
助詞の組み合わせ数、動詞と助動詞の組み合わせ数だけ
を考慮しても、膨大な数の音声部品を作成しなければな
らず、限定された日本語読み文を音声合成するのでない
限り、音声部品の作成に手間がかかり、実用に供するこ
とが難しい。On the other hand, the work of cutting out audio components from a recorded human voice is time-consuming because it is necessary to select a portion having desired phonemes and prosody while looking at the voice waveform. There is a demand for a method of creating audio components that requires as few audio components as possible. However, according to the above-mentioned conventional second method, a voice component having various phoneme strings / prosodic strings for each phrase is required, and even if only the number of combinations of nouns and particles and the number of combinations of verbs and auxiliary verbs are considered, Unless a huge number of audio parts must be created and a limited number of Japanese reading sentences are synthesized, it takes time and effort to create the audio parts and it is difficult to put them to practical use.

【０００９】また、第３の方法の場合は、前記第２の方
法と比較すると音声部品の数は少なくて済むが、この場
合にも少なくとも全ての品詞の単語数に加えて形容詞、
動詞等のように活用するものに対する活用形語句をも考
慮すると、国語辞典の収録語数（具体的には１０万語以
上）を遙かに越える音声部品の数が必要となるため、前
記第２の方法と同様に音声部品作成のための手間がかか
る。In the case of the third method, the number of voice parts is small as compared with the case of the second method. In this case, however, at least the number of words of all parts of speech and an adjective,
Considering also inflected phrases such as verbs, etc., it is necessary to have a number of voice parts far exceeding the number of words included in the Japanese dictionary (specifically, 100,000 words or more). It takes time and effort to create audio parts as in the case of the above method.

【００１０】この発明は、音声部品の数が少なくて済
み、作成のために手間のかからない、かつ、人間の発声
リズムを取り込んだ音声部品を提供できる音声部品作成
方法と、その音声部品が格納された。音声部品データベ
ース及びその音声部品データベースを用いて音声合成す
る方法を提供することを目的とする。The present invention requires only a small number of audio components, does not require much time for creation, and can provide an audio component that incorporates human vocal rhythms, and stores the audio component. Was. An object of the present invention is to provide a voice component database and a method for performing voice synthesis using the voice component database.

【００１１】[0011]

【課題を解決するための手段】この発明の音声部品作成
方法は、文字列の並び及び音韻と韻律が明らかな文章を
一定速度で読み上げた音声を録音し、その録音音声の再
生音において、人間の発声リズムを基本とした拍数（音
節数）分、つまり発声リズムの周期の音節数分の最初の
録音再生時間を単位時間として設定し、前記録音音声を
前記単位時間毎に先頭から順に切断して、録音小片を作
成し、その録音小片に対し、その１つ１つに切り出され
た順番に前記文章の対応する文字列とその韻律とを付与
することによって音声部品を作成することを主要な特徴
とする。SUMMARY OF THE INVENTION According to the present invention, there is provided a method for producing a voice component, comprising: recording a speech in which a sequence of character strings and a phoneme and a prosody are clearly read out at a constant speed; The first recording and playback time for the number of beats (syllables) based on the utterance rhythm, that is, the number of syllables in the cycle of the utterance rhythm, is set as the unit time, and the recorded voice is sequentially cut from the beginning for each of the unit times. The main task is to create a sound piece by adding a corresponding character string of the sentence and its prosody to the recorded piece in the order cut out one by one. Characteristics.

【００１２】この発明の音声部品データベースは、格納
されている各種音声波形の構成音節の数が同一であり、
かつ各音声波形の長さが同一であり、更に構成音韻と韻
律型に応じて異なる番地に音声波形が格納されている。
また、この発明の第１の音声合成方法は、前記この発明
の音声部品データベースと、応答文中の挿入句部分以外
の基文の音声を録音した録音基文保持手段と、前記挿入
句を組み立てる各音声部品の前記音声部品データベース
上の格納番地が記述されている挿入語句部品結合辞書と
を具備し、前記挿入句に充当する文字列を入力する文字
情報入力過程と、前記挿入語句部品結合辞書の参照によ
り前記入力された文字列に対応する各音声部品を前記音
声部品データベースから選択・接続する合成音声組立過
程と、前記接続された音声部品を前記挿入句部分の合成
音声として合成する音声合成過程と、前記挿入句部分の
合成音声を前記録音基文音声にはめ込んで応答文の音声
を出力する出力過程とを有することを特徴としている。In the audio component database of the present invention, the number of syllables constituting the various audio waveforms stored is the same,
In addition, the lengths of the voice waveforms are the same, and the voice waveforms are stored at different addresses according to the constituent phonemes and the prosodic types.
Also, the first speech synthesis method of the present invention is characterized in that the speech component database of the present invention, a recorded basic sentence holding means for recording a voice of a base sentence other than an inserted phrase portion in a response sentence, and a voice component for assembling the inserted phrase. And a character information inputting step of inputting a character string applicable to the insertion phrase, and a reference to the insertion word / phrase component combination dictionary. A synthesized voice assembly process of selecting and connecting each voice component corresponding to the input character string from the voice component database, a voice synthesis process of synthesizing the connected voice component as a synthesized voice of the inserted phrase portion, Outputting a voice of a response sentence by inserting the synthesized speech of the inserted phrase portion into the recorded base sentence speech.

【００１３】さらに、この発明の第２の音声合成方法
は、前記この発明の音声部品データベースと、各種の定
型文に対して各定型文を組み立てる各音声部品の前記音
声部品データベース上の格納番地が記述されている定型
文部品結合辞書とを具備し、前記定型文の文字列を入力
する文字情報入力過程と、前記定型文部品結合辞書の参
照により、前記入力された文字列と対応する各音声部品
を前記音声部品データベースから選択・接続する合成音
声組立過程と、前記接続された音声部品を合成音声とし
て合成する音声合成過程と、前記合成音声を出力する出
力過程とを有することを特徴としている。Further, in the second speech synthesis method according to the present invention, the speech component database of the present invention and the storage address of each speech component for assembling each fixed sentence with respect to various fixed sentences on the speech component database are provided. A character information inputting step of inputting a character string of the fixed sentence, and a voice corresponding to the input character string by referring to the fixed sentence part combined dictionary. A synthesized voice assembly process for selecting and connecting a component from the voice component database; a voice synthesis process for synthesizing the connected voice component as a synthesized voice; and an output process for outputting the synthesized voice. .

【００１４】さらに、この発明の第３の音声合成方法
は、前記この発明の音声部品データベースと、各種文節
に対して各文節を組み立てる各音声部品の前記音声部品
データベース上の格納番地が記述されている文節部品結
合辞書とを具備し、任意文字列とこの任意文字列の文節
区切り及びポーズ位置を入力する文字情報入力過程と、
前記文節部品結合辞書を参照して前記入力された任意文
字列に対応する各音声部品を前記音声部品データベース
から選択・接続する文字列合成音声組立過程と、前記接
続された音声部品を合成音声として合成する音声合成過
程と、前記合成音声を出力する出力過程とを有すること
を特徴としている。Further, in a third speech synthesis method of the present invention, the speech component database of the present invention and a storage address on the speech component database of each speech component for assembling each phrase for various phrases are described. A character part input dictionary, comprising: an arbitrary character string and a phrase delimiter and a pause position of the arbitrary character string;
A character string synthesis voice assembling step of selecting and connecting each voice component corresponding to the input arbitrary character string from the voice component database with reference to the phrase component connection dictionary, and using the connected voice components as synthesized voice It is characterized by having a voice synthesis process for synthesizing and an output process for outputting the synthesized voice.

【００１５】このようにこの発明は、人間の音声を一定
速度で録音したものから、人間の発声リズムを基本とす
る単位時間毎にその一部を切り出して合成音声組立に必
要な音声部品を作成していて、音声部品に人間の発声リ
ズムを取り入れて、この音声部品から生成される合成音
声に人間の発声リズムを取り入れられるよう、かつ、音
声部品の作成も容易に行えるようにしている。As described above, according to the present invention, a part of a human voice recorded at a constant speed is cut out at every unit time based on a human utterance rhythm to generate a voice part necessary for assembling a synthesized voice. Then, a human voice rhythm is incorporated into the voice component so that the human voice rhythm can be incorporated into the synthesized voice generated from the voice component, and the voice component can be easily created.

【００１６】[0016]

【作用】この発明では、まず、各種の音韻と韻律を含
み、かつ、文字列の並びが明らかな文章を、一定速度で
読み上げて録音媒体に録音する。この録音媒体の音声を
再生し、音声波形を見ながら人間の発声リズムを基本と
した最初の拍数（音節数）に相当する録音再生時間を計
測し、つまり再生音声の最初から、音節数を単位とする
人間の発声リズムの１周期分の時間を測定し、この測定
した録音再生時間を単位時間として設定する。前記録音
音声をその最初から単位時間毎に機械的に単純に切断し
て、一定時間長の音声波形からなる録音小片を複数作成
する。According to the present invention, first, a sentence including various phonemes and prosody and in which the arrangement of character strings is clear is read out at a constant speed and recorded on a recording medium. The sound of this recording medium is reproduced, and the recording and reproduction time corresponding to the first beat (syllabic number) based on the human vocal rhythm is measured while observing the sound waveform, that is, the syllable number is measured from the beginning of the reproduced sound. The time corresponding to one cycle of the human vocal rhythm as a unit is measured, and the measured recording / reproducing time is set as a unit time. The recorded voice is simply mechanically cut every unit time from the beginning to create a plurality of recording pieces each composed of a voice waveform of a fixed time length.

【００１７】録音小片（音声波形）に、その切断された
順に前記文章を参照することによってこの録音小片に含
まれる文字列の音韻と韻律を付与する。そして、各種の
音韻と韻律を有する録音小片を集めて、音声部品とす
る。この音声部品を音声部品データベースに格納する。
このデータベースに格納された各音声部品は音節の数が
同一であり、かつ長さが等しく、その音韻と韻律型の組
み合わせの違いにより、異なる番地に格納されている。
下記に示すように、入力される文字列や文節区切りに従
って、この音声部品データベースから適切な音声部品を
選択・接続して合成音声を生成する。A phoneme and a prosody of a character string included in the recording piece are added to the recording piece (voice waveform) by referring to the sentence in the cut order. Then, recording pieces having various phonemes and prosody are collected to form audio components. This audio component is stored in the audio component database.
Each voice component stored in this database has the same number of syllables and the same length, and is stored at different addresses depending on the combination of the phoneme and the prosody type.
As described below, an appropriate voice component is selected and connected from the voice component database according to the input character string or phrase break to generate a synthesized voice.

【００１８】例えば、挿入句を含む応答文を音声出力し
たいとき、挿入句以外の文の音声を録音基文音声として
予め録音しておいて、挿入句部分に充当する文字列を入
力し、入力された文字列に基づいて挿入語句部品結合辞
書を参照して、挿入句部分を組み立てる音声部品を、音
声部品データベースより選択して接続し、その接続され
た音声部品を合成音声として合成し、更に、既に録音さ
れている録音基文音声に前記挿入句部分の合成音声をは
め込んで応答文を音声出力する。For example, when it is desired to output a response sentence including an inserted phrase by voice, a voice of a sentence other than the inserted phrase is recorded in advance as a recording base voice, and a character string applicable to the inserted phrase portion is input. With reference to the inserted word / phrase part combination dictionary based on the extracted character string, a voice part for assembling the inserted phrase part is selected and connected from the voice part database, and the connected voice part is synthesized as a synthesized voice. The synthesized speech of the inserted phrase is inserted into the already-recorded base sentence speech, and a response sentence is output as speech.

【００１９】また、例えば数種ある定型文の一つを音声
出力する場合には、これら数種の定型文に対して、各定
型文を組み立てる各音声部品の音声部品データベース上
の格納番地を定型文部品結合辞書に記述しておき、入力
された定型文文字列から前記定型文部品結合辞書を参照
して音声部品データベースより定型文の音声を組み立て
る音声部品を選択・接続し、その接続された音声部品を
合成音声として合成し、更に定型文の合成音声を出力す
る。For example, when one of several types of fixed phrases is output as speech, the storage addresses in the audio component database of each voice component for assembling each of the fixed phrases are fixed. A voice component for assembling a voice of a standard sentence from a voice component database is selected and connected from the input standard text character string by referring to the standard text component binding dictionary, and the connection is made. The voice component is synthesized as synthesized voice, and a synthesized voice of a fixed sentence is output.

【００２０】さらに、例えば任意文を音声合成したいと
きは、合成したい任意文字列とこの任意文字列の文節区
切り及びポーズ位置の文字情報を入力する。入力文字情
報に基づき文節部品結合辞書を参照して任意文の各文節
毎に、文節を組み立てる音声部品を音声部品データベー
スより選択して接続し、最終的に全文節、つまり、任意
文字列を組み立てる前記音声部品を選択・接続し、この
接続された音声部品にポーズを付与したものを、合成音
声として合成してこの任意文字列の合成音声を出力す
る。Further, for example, when an arbitrary sentence is to be synthesized by speech, an arbitrary character string to be synthesized and character information of a phrase break and a pause position of the arbitrary character string are input. A speech component for assembling a phrase is selected from the speech component database and connected to each phrase of an arbitrary sentence with reference to the phrase component combination dictionary based on the input character information, and finally all phrases, that is, an arbitrary character string is assembled. The voice component is selected and connected, a voice component obtained by adding a pause to the connected voice component is synthesized as a synthesized voice, and a synthesized voice of the arbitrary character string is output.

【００２１】[0021]

【発明の実施の形態】図１はこの発明による音声部品作
成方法の一実施例を示す流れ図である。この実施例の音
声部品作成方法は、データベース番地設定過程Ｓ１と、
一定速度録音過程Ｓ２と、単位時間設定過程Ｓ３と、録
音切断過程Ｓ４と、録音小片音韻韻律付与過程Ｓ５と、
番地付与過程Ｓ６と、格納過程Ｓ７とを有して構成され
ている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 is a flow chart showing an embodiment of a voice component creating method according to the present invention. The voice component creation method of this embodiment includes a database address setting process S1 and
A constant speed recording step S2, a unit time setting step S3, a recording cutting step S4, a recording small piece phoneme prosody provision step S5,
It comprises an address assigning step S6 and a storing step S7.

【００２２】データベース番地設定過程Ｓ１は、合成音
声の生成単位としての音声部品を格納する音声部品デー
タベース１の番地を予め設定する。作成する音声部品の
それぞれは、音声合成に必要な各種の音韻と韻律を有す
る２音節かつ一定長の音声波形からなる。図２に、音声
部品データベース１の部分構成としての２音節からなる
音声部品２の音韻行列を示す。行側と列側にはそれぞれ
日本語の５０音と撥音、濁音、半濁音、拗音、促音、長
音、無音が配置され、行側には２音節のうちの第１音
が、列側には第２音が配置されている。同図には、音韻
は読みがなで示されている。この音韻行列の全ての要素
が音声合成の生成単位として必要というものではなく、
例えば、「学校」の音韻「がっ」と音韻「こう」の内、
「こう」の部分は「こー」で充分代用できるなど、日本
語読み文を音声合成するのに必要な音声部品２の数は数
千種類で済む。In the database address setting step S1, an address of the voice component database 1 for storing voice components as a unit for generating a synthesized voice is set in advance. Each speech component to be created is composed of two syllables and a fixed-length speech waveform having various phonemes and prosody required for speech synthesis. FIG. 2 shows a phoneme matrix of a voice part 2 composed of two syllables as a partial configuration of the voice part database 1. On the row side and on the column side, 50 Japanese syllables, muddy sounds, muddy sounds, semi-voiced sounds, murmurs, prompting sounds, prolonged sounds, and silence are arranged, respectively. The second sound is arranged. In the figure, phonemes are indicated by readings. Not all elements of this phoneme matrix are required as units for speech synthesis,
For example, among the phonemes “school” and “ko” of “school”
The number of voice parts 2 required for voice-synthesizing Japanese-read sentences can be reduced to several thousands, for example, "ko" can be sufficiently substituted for "ko".

【００２３】データベース番地設定過程Ｓ１では、この
音韻行列の各要素に、「ああ」を１番地として「あい」
を２番地というように順に番地を設定する。さらに、音
韻行列の各要素である各２音節に、図３に示すような
（イ）平板型、（ロ）上昇型、（ハ）下降型の３種類の
韻律変化を持たせる。平板型は、第１音と第２音に抑揚
の変化がない韻律型であり、上昇型は、第２音が第１音
よりもアクセントが強い韻律型であり、下降型は第２音
が第１音よりもアクセントが弱い韻律型である。In the database address setting step S1, each element of the phoneme matrix is assigned "Ah" with "Ah" as an address.
Are set in order, such as "2". Further, each of the two syllables, which are the elements of the phoneme matrix, is given three types of prosody changes (a) flat type, (b) ascending type, and (c) descending type as shown in FIG. The flat type is a prosody type in which the first and second sounds have no change in intonation, the ascending type is a prosody type in which the second sound has a stronger accent than the first sound, and the descending type is a prosody type in which the second sound is less pronounced. Prosodic type with less accent than the first note.

【００２４】図５に音声部品データベース１の全体像を
示す。図４のＡは、平板型の韻律を有する音韻行列から
なる各種音声部品２の格納形態を示し、同図のＢは、上
昇型の韻律を有する音韻行列からなる各種音声部品２の
格納形態を、同図のＣは、下降型の韻律を有する音韻行
列からなる各種音声部品２の格納形態をそれぞれ示して
いる。FIG. 5 shows an overall image of the audio component database 1. FIG. 4A shows a storage form of various voice parts 2 formed of a phoneme matrix having a flat prosody, and FIG. 4B shows a storage form of various sound parts 2 formed of a phoneme matrix having an ascending prosody. , C in the same figure shows the storage form of various audio components 2 each composed of a phoneme matrix having a descending prosody.

【００２５】同図に示すように音声部品２に、音韻行列
の番地に加えて韻律型によって分岐番地を設定する。韻
律型が平板型のものは分岐番地が１番、韻律型が上昇型
のものは分岐番地が２番、韻律型が下降型のものは分岐
番地ば３番とする。例えば、音声部品２の１−１番地は
音韻「ああ」の平板型で、第１音の「あ」と第２音の
「あ」とに抑揚の変化のない韻律であり、音声部品２の
２−３番地は音韻「あい」の下降型で、第１音の「あ」
より第２音の「い」の方がアクセントの弱い韻律とす
る。As shown in the figure, a branch address is set in the voice part 2 according to the prosodic type in addition to the address of the phoneme matrix. If the prosodic type is a flat type, the branch address is 1, the prosodic type is an ascending type, the branch address is 2, and if the prosodic type is a descending type, the branch address is 3. For example, the address 1-1 of the voice part 2 is a flat type of the phoneme "Ah", and the prosody of the first sound "A" and the second sound "A" has no change in intonation. Address 2-3 is the descending type of the phoneme "Ai" and the first sound "A"
It is assumed that the second sound "i" has a weaker accent.

【００２６】一定速度録音過程Ｓ２では、文字列の音韻
と韻律の並びが明らかな文章を、図６に示すように一定
速度で録音する。具体的にはまず、音韻列と韻律列の並
びが明らかであり、２音節毎に区切ったときに、音声合
成に必要な音韻と韻律を有するあらゆる２音節が得られ
るように一般的な複数の文章を用意する。このように２
音節毎に区切るのは、行進曲や応援歌が２拍子であるよ
うに、人間の自然の生体リズムや発声リズムが２拍子で
あることを鑑みて、この２拍子を音声部品２に取り入れ
ることにより、この音声部品２で合成された音声も自然
なリズムを保ち、聞き手にとって自然な聞き取り易い音
声となるようにしている。つまり、人間の発声リズムを
取り入れるために基本となる拍数（１拍は１音節に相
当）を２音節としている。In the constant speed recording step S2, a sentence in which the arrangement of the phonemes and the prosody of the character string are clear is recorded at a constant speed as shown in FIG. Specifically, first, the arrangement of the phoneme sequence and the prosody sequence is clear, and when divided into two syllables, general plural syllables having the phoneme and the prosody required for speech synthesis are obtained. Prepare sentences. Thus 2
In order to separate each syllable, taking into account that the natural biological rhythm and the vocal rhythm of human beings are in two beats, like a march or a cheering song is in two beats, by incorporating these two beats into the audio component 2 The voice synthesized by the voice component 2 also keeps a natural rhythm, so that the voice is natural and easy for the listener to hear. That is, the basic number of beats (one beat is equivalent to one syllable) for taking in the human vocal rhythm is set to two syllables.

【００２７】前記文章を、話者９に一定速度で読んでも
らい、録音媒体８に録音する。この場合話者９は一定速
度で読むことを訓練されたものが好ましい。単位時間設
定過程Ｓ３は、前記録音したものから音声部品２の基と
なる、一定時間長の音声波形の録音小片８ａ，８ｂ，・
・・を切り出す単位時間を設定する。すなわち、前記録
音媒体８を再生して、音声波形を見ながら最初の２音節
（発声リズムの拍子数）が含まれる録音再生時間を計測
し、この録音再生時間を単位時間（例えば２００ｍｓ）
として設定する。The sentence is read by the speaker 9 at a constant speed, and is recorded on the recording medium 8. In this case, it is preferable that the speaker 9 is trained to read at a constant speed. The unit time setting step S3 is based on the recording pieces 8a, 8b,.
・・ Set the unit time to cut out. That is, the recording medium 8 is reproduced, and the recording / reproducing time including the first two syllables (the number of rhythms of the vocal rhythm) is measured while observing the audio waveform, and the recording / reproducing time is calculated as a unit time (for example, 200 ms).
Set as

【００２８】録音切断過程Ｓ４は、前記録音媒体８に記
録された音声波形を前記設定した単位時間毎に順番に単
純に切断して、一定時間長の音声波形からなる録音小片
８ａ，８ｂ，・・・を切り出す。これら録音小片８ａ，
８ｂ，・・・には、それぞれ２音節を基調とする人間の
発声リズムが盛り込まれる。録音小片音韻韻律付与過程
Ｓ５は、前記録音小片８ａ，８ｂ，・・・に音韻と韻律
を付与する。録音媒体８に録音した文章の音韻列と韻律
列の並びは前記したように明らかであるので、各録音小
片の音韻と韻律は、録音小片が切り出された順番に前記
文章の音韻列と韻律列を対応させることにより、容易に
判読できる。In the recording cutting step S4, the audio waveform recorded on the recording medium 8 is simply cut in order at the set unit time, and the recording pieces 8a, 8b,.・・ Cut out. These recording pieces 8a,
8b,... Each include a human vocal rhythm based on two syllables. In the recording small piece phonetic prosody provision step S5, phonemes and prosody are provided to the recording small pieces 8a, 8b,. Since the arrangement of the phoneme sequence and the prosody sequence of the sentence recorded on the recording medium 8 is clear as described above, the phoneme and the prosody of each recording piece are determined in the order in which the recording piece was cut out. Can be easily read.

【００２９】そして、合成音声の生成に必要な各種音韻
と韻律を有する録音小片を一通り揃え、これらをそれぞ
れ音声部品２とする。番地付与過程Ｓ６は、音声部品２
に付与された音韻と韻律の情報に基づいて、音声部品２
を格納する音声部品データベース１上の音声部品２の番
地を特定する。Then, recording pieces having various phonemes and prosody necessary for generation of a synthesized speech are arranged in a line, and these are designated as speech parts 2 respectively. The address assignment process S6 includes the audio component 2
Component 2 based on the phoneme and prosody information given to
The address of the audio component 2 on the audio component database 1 that stores the

【００３０】格納過程Ｓ７は、音声部品２の番地情報に
基づいて、図２と図４に示す音声部品データベース１上
の番地箇所に音声部品２をはめ込む。次に、上記構成の
音声部品作成方法の具体例を図６を用いて説明する。具
体的には、同図に示したように文章「幸せは歩いてこな
い。だから、歩いて行くんだね。」（この文章の音韻列
は、「しあわせわあるいてこない。だから、あるいてゆ
くんだね」であり、韻律は「しあ」の部分が上昇型、
「わせ」の部分が上昇型などの文字列情報が明らかにさ
れている。この文章は、話者９に読んでもらう文章の一
部を示している。）と話者９が一定速度で吹き込む。こ
のように、話者９が一定速度で読むため、文の適切な位
置に入れられるポーズの長さはｎ音節分（ｎ：整数）と
なる（一定速度録音過程Ｓ２）。In the storing step S7, based on the address information of the audio component 2, the audio component 2 is inserted into an address on the audio component database 1 shown in FIGS. Next, a specific example of the audio component creation method having the above configuration will be described with reference to FIG. More specifically, as shown in the figure, the sentence "Happiness does not come. I walk." The prosody of the "Shia" part is ascending,
The character string information such as the rising type is clarified in the part of “seed”. This sentence shows a part of the sentence that the speaker 9 reads. ) And the speaker 9 blows at a constant speed. As described above, since the speaker 9 reads at a constant speed, the length of the pause put in the appropriate position of the sentence is n syllables (n: integer) (constant speed recording process S2).

【００３１】そして、録音媒体８には「しあわせわ（ポ
ーズ）あるいてこない（ポーズ）だから（ポーズ）ある
いていくんだね」と録音される。録音文の最初の「し
あ」という２音節を含む録音小片８ａを再生する時間を
計測し、例えば、この録音再生時間が２００ｍｓであっ
たなら、この２００ｍｓを単位時間として設定する（単
位時間設定過程Ｓ３）。Then, the recording medium 8 is recorded as "Happiness (pause) is not present (pause), so (pause) is going on." The time for reproducing the recording piece 8a including the first two syllables "Shia" of the recorded sentence is measured. For example, if the recording and reproduction time is 200 ms, this 200 ms is set as a unit time (unit time setting). Step S3).

【００３２】次に、この単位時間（２００ｍｓ）毎に録
音媒体８の再生音を順次切断することによって、それぞ
れ２音節毎の録音小片８ａ，８ｂ，・・・を切り出す
（録音切断過程Ｓ４）。これら録音小片８ａ，８ｂ，・
・・に、前記録音された文章の文字列を対応させて、順
番に音韻「しあ」韻律（上昇型）、音韻「わせ」韻律
（上昇型）・・・というような音韻と韻律を付与する。
音韻と韻律を付与された録音小片を音声部品２とする
（録音小片音韻韻律付与過程Ｓ５）。Next, the reproduction sound of the recording medium 8 is sequentially cut every unit time (200 ms) to cut out the recording pieces 8a, 8b,... For every two syllables (recording cutting step S4). These recording pieces 8a, 8b,.
··· Correspond to the character string of the recorded sentence, and assign phonemes and prosody such as phoneme “shia” prosody (ascending type), phoneme “set” prosody (ascending type)... I do.
The recording piece to which the phoneme and the prosody are given is defined as the audio component 2 (recording piece phonetic prosody provision step S5).

【００３３】音声部品２に付与されている音韻と韻律か
ら音声部品データベース１上の番地を特定する。例え
ば、最初の音韻「しあ」韻律（上昇型）部分の音声部品
２は７４１−２番地と特定される（番地付与過程Ｓ
６）。前記番地が付与された音声部品２を、音声部品デ
ータベース１上のその番地箇所にはめ込み格納する（格
納過程Ｓ７）。The address on the audio component database 1 is specified from the phoneme and the prosody assigned to the audio component 2. For example, the voice part 2 of the first phoneme “Shia” prosody (ascending type) portion is specified as address 741-2 (address assignment process S
6). The voice component 2 to which the address has been assigned is inserted and stored in the address location on the voice component database 1 (storage step S7).

【００３４】次に、上記のように作成された音声部品を
用いた音声合成装置について説明する。この音声合成装
置は、図４に示すように音声部品データベース１と、文
字情報入力部３と、部品結合辞書４と、合成音声組立部
５と、音声出力部７とを有して構成されている。音声部
品データベース１には、前記と同様の各種の音韻と韻律
を有する音声部品２が格納されている。Next, a speech synthesizing apparatus using the speech parts created as described above will be described. As shown in FIG. 4, this speech synthesizer includes a speech component database 1, a character information input unit 3, a component combination dictionary 4, a synthesized speech assembly unit 5, and a speech output unit 7. I have. The voice component database 1 stores voice components 2 having the same various phonemes and prosody as described above.

【００３５】文字情報入力部３は、合成したい音声に対
応する文字情報を入力する。文字情報としては、文字列
の他、必要に応じて文節区切りやポーズ位置も入力す
る。部品結合辞書４は、文字情報入力部３からの入力文
字情報に基づいた音声合成がなされるよう各種の入力文
字情報に対する音声部品２の組み合わせが記述され、か
つ、各音声部品２は音声部品データベース１上の番地が
指定されている。The character information input section 3 inputs character information corresponding to a voice to be synthesized. As the character information, in addition to the character string, a paragraph break and a pause position are input as necessary. The component combination dictionary 4 describes combinations of voice components 2 for various types of input character information so that voice synthesis is performed based on the character information input from the character information input unit 3, and each voice component 2 has a voice component database. The upper address is designated.

【００３６】合成音声組立部５は、文字情報入力部３か
ら入力された文字列や文節区切り、或いはポーズ位置等
の入力文字情報に基づいて合成音声生成のための音声部
品２を音声部品データベース１から選択し、接続する。
音声合成部６は、合成音声組立部５で接続された音声部
品２をなめらかな連続音声になるように処理して合成音
声として合成する。The synthesized speech assembling unit 5 converts the speech component 2 for generating a synthesized speech based on the input character information such as a character string or a phrase segment input from the character information input unit 3 or a pause position into the speech component database 1. Choose from and connect.
The voice synthesizer 6 processes the voice component 2 connected by the synthesized voice assembler 5 so as to be a smooth continuous voice and synthesizes it as a synthesized voice.

【００３７】音声出力部７は、前記合成音声を出力す
る。次に上記構成のこの実施例の具体例を説明する。（具体例−１）挿入句部分を合成音声にする。利用者から入力される文字情報を挿入句として、この挿
入句以外の応答文の定型部分を録音基文音声として予め
録音しておく。そして、文字情報入力部３に入力される
文字情報に基づいて挿入句部分の合成音声を生成し、こ
の合成音声を前記録音基文音声にはめ込んで応答文を音
声出力する。The voice output unit 7 outputs the synthesized voice. Next, a specific example of this embodiment having the above configuration will be described. (Specific Example-1) The inserted phrase is made into a synthesized speech. The character information input by the user is used as an insertion phrase, and a fixed portion of the response sentence other than the insertion phrase is recorded in advance as a recording base sentence voice. Then, based on the character information input to the character information input unit 3, a synthesized speech of the inserted phrase is generated, and this synthesized speech is inserted into the recorded base sentence speech to output a response sentence.

【００３８】例えば、電話番号の問い合わせ等の際に、
「お調べになりたい方のお住まいになる都道府県名か市
区郡名を入力して下さい。」という文を音声出力した
後、利用者からの都道府県名か市区郡名の応答があった
ときにこの応答を確認するための応答文として「○○○○○でよろしいですか？」と音声出力したい場合、文の定型部分「でよろしいです
か？」を録音基文音声として予め録音しておき、「○○
○○○」の部分は録音基文の挿入句として、利用者から
入力される具体的な都道府県名或いは市区郡名の文字情
報に基づいて音声が合成される。For example, when inquiring of a telephone number,
After outputting the sentence "Please enter the name of the prefecture or city where you want to look up.", And the user responds with the name of the prefecture or city. If you want to output the response sentence "Is it okay with ○○○○○?" As a response sentence to confirm this response, pre-record the fixed part of the sentence "Is it okay?" In advance, "XX
The voice of "○○○" is synthesized as an insertion phrase of the recording base sentence based on character information of a specific prefecture name or city / ward / county name input by the user.

【００３９】一方、挿入句部分に充当する各種の入力文
字情報に対して、合成音声生成に必要な音声部品２の組
み合わせとその各音声部品の音声部品データベース上の
番地が図７に示されるように部品結合辞書４としての挿
入語句部品結合辞書に記述されている。同図において
は、挿入語句の入力文字情報として「漢字」文字列も
「かな」文字列も共にキーワードとして受け付けられる
ようになっている。On the other hand, as shown in FIG. 7, the combinations of the voice parts 2 necessary for generating the synthesized voice and the addresses of the respective voice parts in the voice part database are shown in FIG. Are described in an insertion phrase part combination dictionary as the part combination dictionary 4. In the figure, both the "kanji" character string and the "kana" character string are received as keywords as input character information of the inserted phrase.

【００４０】例えば、文字情報入力部３に利用者から
「岡山県」或いは「おかやまけん」と文字列の入力があ
ったときには、合成音声組立部５でこの文字列に基づい
て図７に示す挿入語句部品結合辞書を参照して２８６−
２番地の「おか」部分の音声部品と、２４８２−２番地
の「やま」部分の音声部品と、６１０−２番地の「け
ん」部分の音声部品とを選択して接続する。For example, when the user inputs a character string such as "Okayama Prefecture" or "Okayamaken" to the character information input section 3, the synthesized speech assembling section 5 inserts the character string shown in FIG. 286-
The audio component of the "oka" portion at address 2, the audio component of the "yama" portion at address 2482-2, and the audio component of the "ken" portion at address 610-2 are selected and connected.

【００４１】音声合成部６では、前記挿入句部分の接続
された音声部品をなめらかな連続音声の合成音声として
合成する。そして、音声出力部７では前記挿入句部分の
合成音声を前記録音基文音声にはめ込んで応答文を音声
出力する。この具体例−１のように挿入句として日本国
内の都道府県及び市区郡の地名を音声合成する場合に
は、実際にある約４０００の地名に対し、音声部品２は
１１７５種類あれば音声合成の生成に充分であることが
実験から確かめられている。（具体例−２）定型文を出力する。The speech synthesizing unit 6 synthesizes the speech parts connected to the inserted phrase as a synthesized speech of smooth continuous speech. Then, the voice output unit 7 fits the synthesized voice of the inserted phrase portion into the recording base voice and outputs a response text. In the case where the place names of prefectures and municipalities in Japan are speech-synthesized as an insertion phrase as in the specific example-1, if there are 1175 types of speech parts 2 for about 4000 place names actually present, speech synthesis is performed. Experiments have shown that this is sufficient for the formation of (Specific Example-2) Output a fixed phrase.

【００４２】前記具体例−１の場合は挿入句部分の語句
を音声合成したが、この具体例−２では定型文を音声合
成する。この場合は、具体例−１でも示したような「お調べになりたい方のお住まいになる都道府県名を入
力して下さい。」とか、「ありがとうございました。」のような定型文を音声合成するための音声部品２の組み
合わせに、適当な位置にポーズが付されたものと、その
各音声部品の音声部品データベース上の番地が、図８に
示す部品結合辞書４としての定型文部品結合辞書に記述
されている。同図に示す定型文部品結合辞書は、定型文
の日本語文字列でもかな文字列でもキーワードとして受
け付けられるようになっている。In the case of the above-mentioned specific example-1, the words in the inserted phrase are synthesized by voice, but in the specific example-2, the fixed phrase is synthesized by voice. In this case, speech synthesis such as "Please enter the name of the prefecture where the person you want to look for" or "Thank you." The combination of the audio components 2 to be performed and a pause at an appropriate position, and the address of each audio component on the audio component database are determined by the fixed-text component connection dictionary as the component connection dictionary 4 shown in FIG. It is described in. The fixed phrase component combination dictionary shown in FIG. 3 is configured to accept either a Japanese character string or a kana character string of a fixed phrase as a keyword.

【００４３】そして、合成音声組立部５では、利用者か
ら文字情報入力部３に入力される定型文文字列の文字情
報に基づいて、定型文部品結合辞書を参照して定型文に
対応する音声部品２を選択・接続し、音声合成部６で、
接続された音声部品を合成音声に合成し、音声出力部７
でこの合成音声を出力する。具体的には、定型文「こちらは電話番号案内です。」を音声出力したいときには、例えば、「こちらはでんわばんごうあんないです。」と、かな文字列で文字情報入力部３に入力すれば、合成
音声組立部５で、この入力文字列に基づいて定型文部品
結合辞書を参照して音韻「こち」は７１６−１番地の音
声部品２を選択し、音韻「らわ」は２６０３−２番地の
音声部品２を選択し、というように音声部品データベー
ス１から音声部品２を選択・接続する。The synthesized speech assembling unit 5 refers to the fixed phrase component combination dictionary based on the character information of the fixed phrase character string input from the user to the character information input unit 3 and outputs the voice corresponding to the fixed phrase. Select and connect the part 2 and in the speech synthesis unit 6,
The connected voice component is synthesized into synthesized voice, and the voice output unit 7
Outputs this synthesized voice. Specifically, if you want to output the fixed phrase "This is a telephone number guide.", For example, enter "This is a telephone number" in the character information input unit 3 using a kana character string. Based on the input character string, the synthesized voice assembling unit 5 refers to the fixed phrase component combination dictionary to select the voice component 2 at address 716-1 for the phoneme "Kochi" and 2603-2 for the phoneme "Rawa". The audio component 2 at the address is selected, and the audio component 2 is selected and connected from the audio component database 1 and so on.

【００４４】この場合、定型文を音声出力したいときに
は、上記のように定型文の文字列全てを日本語文字列や
かな文字列で入力しなくとも、定型文に予め番号を付し
て、この定型文番号を文字情報入力部３に入力するのみ
で定型文の合成音声を出力することも、定型文番号をキ
ーワードとすることにより可能となる。（具体例−３）任意文を出力する。In this case, when the user wants to output the fixed phrase by voice, the fixed phrase is assigned a number in advance without inputting the entire character string of the fixed phrase as a Japanese character string or a kana character string as described above. It is also possible to output a synthesized voice of a fixed sentence by simply inputting the fixed sentence number to the character information input unit 3 by using the fixed sentence number as a keyword. (Specific Example-3) Output an arbitrary sentence.

【００４５】任意文を音声合成する場合は、この任意文
の文字列と文節区切り及びポーズ位置を文字情報入力部
３へ入力する。一方、部品結合辞書４としての文節部品
結合辞書には、図９に示すように文節単位に文節を組み
立てる音声部品２とその各音声部品の音声部品データベ
ース上の番地を記述しておく。同図に示す文節部品結合
辞書の場合も、文節語句として日本語文字列でもかな文
字列でも受け付けられるようキーワードが設定されてい
る。例えば、文節の文字列「私は」或いは「わたしは」
は、音韻「わた」部分が２９０１−２番地の音声部品２
と、音韻「しわ」部分が７８０−２番地の音声部品２と
より組み立てられることが文節部品結合辞書に記述され
ている。In the case of synthesizing an arbitrary sentence, a character string of this arbitrary sentence, a segment break and a pause position are input to the character information input unit 3. On the other hand, as shown in FIG. 9, the phrase component combination dictionary serving as the component combination dictionary 4 describes the voice component 2 for assembling a phrase in phrase units and addresses of the respective voice components in the voice component database. Also in the case of the phrase part combination dictionary shown in FIG. 6, keywords are set so that either a Japanese character string or a kana character string can be accepted as a phrase. For example, the phrase string "I am" or "I am"
Is the voice part 2 whose phoneme "wa" is at address 2901-2
It is described in the phrase part connection dictionary that the phoneme "wrinkle" part is assembled from the voice part 2 at address 780-2.

【００４６】この文節部品結合辞書の作成については、
前記従来の第２の方法で示したように文節語句の種類が
多く膨大な数を辞書に収録することになるが、前記従来
の第２の方法のようにこの膨大な数の音声部品を音声波
形を観察しながら切り出すという大変な作業をするわけ
でなく、この場合は、この膨大な数の文節語句に対し、
文節語句を組み立てる各音声部品２の番地を所定の記憶
媒体に記憶格納するだけであるので、この辞書の作成作
業は容易に行われる。For the creation of the phrase part combination dictionary,
As shown in the second conventional method, a large number of phrase phrases are recorded in the dictionary, and as in the second conventional method, the huge number of voice parts are It is not a hard work to cut out while observing the waveform. In this case, for this huge number of phrase words,
Since the address of each audio component 2 for assembling a phrase is simply stored and stored in a predetermined storage medium, the work of creating the dictionary is easily performed.

【００４７】合成音声組立部５では、文字情報入力部３
に入力された任意文の文字列と文節区切りより認識され
る複数の文節語句を基に、文節部品結合辞書を参照し
て、各文節毎に音声合成に必要な音声部品２を音声部品
データベース１から選択して、これらを接続する。例え
ば、任意文「私は、学校へ行きます。」を文節に区切ると、「私は／学校へ／行きます。」（／は文節区
切りを示す。）となり、さらに、「私は」と「学校へ」の間に、ポーズ
位置が入るため、文節「私は」と文節「学校へ」及び文
節「行きます」の文字列と、ポーズ位置の文字情報が文
字情報入力部３に入力される。In the synthesized speech assembling section 5, the character information input section 3
Based on the character string of an arbitrary sentence input to the phrase and a plurality of phrase words recognized from the phrase delimiter, a speech component database necessary for speech synthesis is obtained for each phrase by referring to a phrase component combination dictionary. Choose from and connect them. For example, if the optional sentence “I go to school.” Is divided into phrases, “I will go to school / goes.” (/ Indicates a phrase break). Since the pause position is inserted between "go to school", the character strings of the phrase "I am", the phrase "go to school" and the phrase "go" and the character information of the pause position are input to the character information input unit 3. .

【００４８】そして、この合成音声組立部５では、各文
節毎に、文節部品結合辞書を参照して文節を組み立てる
音声部品２を音声部品データベース１から選択して接続
する。文節「私は」の部分は、２９０１−２番地の音韻
「わた」部分の音声部品２と７８０−２番地の音韻「し
わ」部分の音声部品２を音声部品データベース１から選
択し、文節「学校へ」の部分は、２５２０−２番地の音
韻「がっ」部分の音声部品２と、６２３−１番地の音韻
「こー」部分の音声部品２と、４８−１番地の音韻「え
（無音）」部分の音声部品２を音声部品データベース１
からそれぞれ選択するというように順に文節毎に音声部
品２を選択して、これら選択した音声部品２を接続す
る。The synthesized speech assembling section 5 selects a speech component 2 for assembling a phrase from the speech component database 1 with reference to a phrase component combination dictionary for each phrase, and connects them. For the phrase “I am”, the voice part 2 of the phoneme “Wata” at address 2901-2 and the voice part 2 of the phoneme “Wrinkle” at address 780-2 are selected from the voice part database 1, and the phrase “School” The part “He” is the sound part 2 of the phoneme “ga” part at address 2520-2, the sound part 2 of the phoneme “ko” part at address 623-1, and the phoneme “e (silence) at address 48-1. )) Part of the audio part database 1
, The audio component 2 is selected for each phrase in order, and these selected audio components 2 are connected.

【００４９】さらに、この合成音声組立部５では、全文
節を組み立てる音声部品２の接続体に、指定されたポー
ズ位置に従って、例えば１００ｍｓのポーズを付与す
る。例えば、前文の例では音韻「わた」「しわ」と音韻
「がっ」「こう」との間には、句点があるためポーズが
付与される。音声合成部６では、前記ポーズが付与され
た音声部品２の接続体を、なめらかに連続した合成音声
として合成し、音声出力部７では、この合成音声を出力
する。Further, the synthesized voice assembling section 5 gives a pause of, for example, 100 ms to the connected body of the voice parts 2 for assembling all the phrases according to the designated pause position. For example, in the example of the preceding sentence, a pause is given between the phonemes "Wata" and "Wrinkle" and the phonemes "Gatsu" and "Kou" because there are punctuation marks. The voice synthesizing unit 6 synthesizes the connected body of the voice parts 2 to which the pause has been given as a smoothly continuous synthesized voice, and the voice output unit 7 outputs the synthesized voice.

【００５０】上記のようにこの実施例によれば、録音媒
体８を再生して、人間の発声リズムの基本となる最初の
２音節分の音声波形を観察し、このときの録音再生時間
を単位時間として設定しておけば、この単位時間毎に録
音媒体８の再生信号を単純に切断していくことによっ
て、２音節からなる音声部品２を切り出すことができ
る。従って、従来のように音声部品を録音媒体を再生し
て音声波形を観察しながら全ての音声部品を切り出す作
業に較べて非常に容易に行えることになる。As described above, according to this embodiment, the recording medium 8 is reproduced to observe the sound waveform of the first two syllables, which is the basis of the human vocal rhythm, and the recording / reproducing time at this time is used as a unit. If the time is set, the audio component 2 composed of two syllables can be cut out by simply cutting the reproduction signal of the recording medium 8 for each unit time. Therefore, compared with the conventional operation of cutting out all the audio components while reproducing the audio component from the recording medium and observing the audio waveform, it can be performed very easily.

【００５１】また、音声部品２の数も、第２及び第３の
従来例のように１０万種類を越えるものではなく、数千
種類で済むので、より一層、音声部品２の作成の手間が
省ける。さらに、この実施例によれば、２拍子の人間の
自然の発声リズムを取り入れた２音節からなる音声部品
２を合成音声の生成単位としているため、この音声部品
２を組み合わせてなる合成音声も人間の自然の発声リズ
ムが取り入れられることになり、合成音声も聞き易いも
のとなる。Also, the number of audio components 2 does not exceed 100,000 types as in the second and third prior arts, but can be several thousand types. Can be omitted. Furthermore, according to this embodiment, since the voice component 2 composed of two syllables incorporating the natural vocal rhythm of the human in two beats is used as a unit for generating a synthesized voice, the synthesized voice obtained by combining the voice components 2 is also a human voice. The natural utterance rhythm is adopted, and the synthesized speech becomes easy to hear.

【００５２】なお、この発明は上記実施例に限定される
ことはなく様々な実施の態様があり得る。例えば、上記
実施例の各具体例においては、文字列の入力に漢字を含
む日本語文字列でもかな文字列でも受け付けられるよう
にしたが、かな文字列の入力だけ受け付けるよう、キー
ワードを設定しても良い。The present invention is not limited to the above-described embodiment, but may have various embodiments. For example, in each specific example of the above-described embodiment, the input of a character string is made to accept either a Japanese character string including a kanji character or a kana character string, but a keyword is set so that only an input of a kana character string is accepted. Is also good.

【００５３】また、この実施例では音声部品データベー
ス１に格納される音声部品２に、３種類の抑揚の変化を
持たせたが、より豊富な抑揚の変化を持たすようにして
も良い。Further, in this embodiment, the audio component 2 stored in the audio component database 1 has three types of intonation changes, but may have more abundant intonation changes.

【００５４】[0054]

【発明の効果】以上の説明から明らかなように、この発
明の音声部品作成方法を用いれば、一定速度で人間の音
声を録音した録音媒体から単位時間毎に切り出すという
単純な方法にて音声部品を作成でき、従来のように音声
波形を観察しながら全ての音声部品を切り出す作成方法
に較べれば、音声部品の作成手間がかからなくて済む。As is apparent from the above description, when the voice component creation method of the present invention is used, a voice component is cut out from a recording medium in which a human voice is recorded at a constant speed by a simple method, which is cut out every unit time. Can be created. Compared with the conventional method of cutting out all the audio components while observing the audio waveform, it is not necessary to create the audio components.

【００５５】また、音声部品の数も少なくて済むため一
層、作成手間がかからない。さらに、録音媒体から切り
出す単位時間を人間の発声リズムを基本としているた
め、この単位時間毎に切り出された音声部品には、人間
の自然な発声リズムが取り入れられるので、この音声部
品を組み合わせて生成される合成音声にも人間の自然な
発声リズムが取り入れられることになり、この合成音声
は聞き易いものとなる。Further, since the number of audio components can be reduced, the time and effort for creating the components are further reduced. Furthermore, since the unit time cut out from the recording medium is based on the human vocal rhythm, the natural voice rhythm of the human is incorporated into the voice component cut out for each unit time. The natural utterance rhythm of human beings is also incorporated into the synthesized speech, and the synthesized speech becomes easy to hear.

[Brief description of the drawings]

【図１】発明による音声部品作成方法の一実施例の処理
手順を示す図。FIG. 1 is a diagram showing a processing procedure of an embodiment of a voice component creation method according to the present invention.

【図２】音声部品データベースの音韻行列の一例を示す
図。FIG. 2 is a diagram showing an example of a phoneme matrix of a voice component database.

【図３】３種類の韻律変化を示す図。FIG. 3 is a diagram showing three types of prosody changes.

【図４】この発明の音声合成方法を適用した音声合成装
置の機能構成例を示すブロック図。FIG. 4 is a block diagram showing a functional configuration example of a speech synthesis device to which the speech synthesis method of the present invention is applied.

【図５】この発明による音声部品データベースの全体像
の例を示す図。FIG. 5 is a diagram showing an example of an overall image of a sound component database according to the present invention.

【図６】音声部品の作成方法を示す説明図。FIG. 6 is an explanatory diagram showing a method for creating a voice component.

【図７】挿入語句部品結合辞書の一例を示す図。FIG. 7 is a diagram illustrating an example of an insertion phrase part combination dictionary.

【図８】定型文部品結合辞書の一例を示す図。FIG. 8 is a diagram showing an example of a fixed phrase component combination dictionary.

【図９】文節部品結合辞書の一例を示す図。FIG. 9 is a diagram showing an example of a phrase part combination dictionary.

Claims

[Claims]

1. A method of recording a speech in which a sentence in which the arrangement of character strings and phonology and prosody are clear is read out at a constant speed, and the first recording of the reproduced sound of the recorded voice for the number of syllables in the period of the human utterance rhythm. A playback time is set as a unit time, and the recorded voice is cut in order from the beginning for each unit time to create a recording piece, and each of the recording pieces is written in the cut-out order of the sentence. A speech component creation method, wherein a speech component is created by adding a corresponding character string and its prosody.

2. A database storing various voice waveforms, wherein each voice waveform has the same number of constituent syllables and the same length, and each voice waveform has its constituent phonemes and prosody. A voice component database stored at different addresses according to types.

3. A voice component database according to claim 2, a recording base sentence holding means for recording a base voice other than an insertion phrase portion in a response sentence, and an address on the voice component database of each voice component for assembling the insertion phrase. And a character information inputting step of inputting a character string applicable to the insertion phrase, and referring to the insertion word / parts combination dictionary to correspond to the input character string. A voice synthesis process for selecting and connecting each voice component to be performed from the voice component database; a voice synthesis process for synthesizing the connected voice component as a synthesized voice of the inserted phrase portion; and a synthesized voice of the inserted phrase portion. Outputting a voice of the response sentence by being fitted into the recorded base sentence voice.

4. A speech part database according to claim 2, a fixed phrase part connection dictionary in which addresses of the respective speech parts for assembling each of the fixed parts for the various fixed parts are described on the speech part database. And a character information inputting step of inputting a character string of the fixed sentence; and selecting and connecting each of the sound components corresponding to the input character string from the sound component database with reference to the fixed sentence component combination dictionary. A voice synthesis process for synthesizing the connected voice component as a synthesized voice, and an output process for outputting the synthesized voice.

5. An audio component database according to claim 2, and a phrase component connection dictionary in which addresses of the audio components for assembling each phrase for various phrases on the audio component database are provided. A character information inputting step of inputting a character string and a phrase delimiter and a pause position of the arbitrary character string; and referring to the phrase part combination dictionary to input each of the audio parts corresponding to the input arbitrary character string from the audio part database. A speech synthesis method comprising: a character string synthesis speech assembly process of selecting and connecting; a speech synthesis process of synthesizing the connected speech component as a synthesized speech; and an output process of outputting the synthesized speech.