JP2002189489A

JP2002189489A - Speech synthesizer

Info

Publication number: JP2002189489A
Application number: JP2001021447A
Authority: JP
Inventors: Yuji Wada; 田祐司和
Original assignee: Victor Company of Japan Ltd
Current assignee: Victor Company of Japan Ltd
Priority date: 2000-02-18
Filing date: 2001-01-30
Publication date: 2002-07-05

Abstract

PROBLEM TO BE SOLVED: To achieve cost reduction by eliminating rhythm processing, shortening processing time and suppressing the increase of a memory capacity. SOLUTION: A phonemic sequence output part 2 outputs the phonemic sequence of phonemic symbols identical with the phonemic sequence in which input text information is decomposed and which is stored in the speech database 1 to a phonemic sequence waveform selection part 3. The phonemic sequence waveform part 3 extracts a candidate phonemic sequence from the speech database 1 while successively paying attention to each phonemic sequence inputted from the phonemic sequence output part 2, and selects and outputs the optimal phonemic sequence. A fundamental frequency pattern generating means 4 divides the input text information into a plurality of phonemic sequences having one accent, and generates a fundamental frequency pattern about this. A selection means 5 narrows the number of candidate phonemic sequences by comparing the fundamental frequency of the waveform data of each candidate phonemic sequence with the value of the fundamental frequency pattern, discriminating whether the difference between the two is in a prescribed range, and discriminating further whether each candidate phonemic sequence satisfies a prescribed reference.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、テキスト情報の入
力に基づき音声出力を行う音声合成装置に関するもので
ある。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizing apparatus for outputting speech based on input of text information.

【０００２】[0002]

【従来の技術】音声合成装置では、通常、入力した音声
合成用のテキスト情報を、音声データベースに格納され
ている音素列と等しい音素記号を有する音素列に分解す
ると共に、この分解した音素列の最適の波形データを音
声データベースから抽出し、この抽出した波形データを
つなぎ合わせることにより、テキスト情報についての音
声出力を行うようにしている。2. Description of the Related Art In a speech synthesizer, usually, input text information for speech synthesis is decomposed into phoneme strings having phoneme symbols equal to the phoneme strings stored in a speech database. The optimal waveform data is extracted from the audio database, and the extracted waveform data is connected to perform audio output of the text information.

【０００３】[0003]

【発明が解決しようとする課題】しかし、近時の音声合
成装置に対しては、できるだけ人間の発話に近い、自然
で滑らかな音声出力が要求されてきている。このような
自然で滑らかな音声出力を行うためには、韻律処理など
を行って最適な波形データを生成する必要があるが、韻
律処理は複雑な処理であり少なからず処理時間を要する
ものである。したがって、入力したテキスト情報をリア
ルタイムで音声出力することが要求される音声合成装置
の場合には、入力テキストが長くなると連続して音声を
出力できなくなるという欠点を有していた。However, recent speech synthesizers are required to output natural and smooth speech as close as possible to human speech. In order to perform such natural and smooth audio output, it is necessary to generate optimal waveform data by performing prosody processing and the like, but the prosody processing is a complicated process and requires a considerable amount of processing time. . Therefore, in the case of a speech synthesizer which is required to output input text information by voice in real time, there is a drawback that if the input text becomes longer, voice cannot be output continuously.

【０００４】また、韻律処理を行うためには韻律情報を
記憶しておくための大きなメモリ容量を確保しなければ
ならず、コストダウンを阻害する一つの要因となってい
た。Further, in order to perform prosody processing, a large memory capacity for storing prosody information must be secured, which has been one factor that hinders cost reduction.

【０００５】本発明は上記事情に鑑みてなされたもので
あり、韻律処理を行うことなく自然で滑らかな音声出力
が得られる音声合成装置を提供することを目的としてい
る。The present invention has been made in view of the above circumstances, and has as its object to provide a speech synthesizer capable of obtaining a natural and smooth speech output without performing prosody processing.

【０００６】[0006]

【課題を解決するための手段】上記課題を解決するため
の手段として、請求項１記載の発明は、音声合成に用い
る音素列についての波形データ及びその特徴データが格
納されている音声データベースと、音声合成用のテキス
ト情報を入力し、このテキスト情報を分解して前記音声
データベースに格納されている音素列と等しい音素記号
の音素列を順次出力する音素列出力部と、前記音素列出
力部からの各音素列を入力した後、波形データの生成対
象となる注目音素列をこの入力した音素列に対して順次
決定し、ある一の注目音素列の決定の際に、この注目音
素列と等しい音素記号を有する音素列を候補音素列とし
て、その候補音素列の波形データ及び特徴データを前記
音声データベースから抽出し、この抽出した波形データ
のうちのいずれか一をこの注目音素列についての最適の
音素列の波形データとして選択することにより、順次各
注目音素列について最適の波形データを出力する音素列
波形選択部と、前記音素列波形選択部からの各音素列の
波形データを接続する波形データ接続部と、前記波形デ
ータ接続部からの接続波形データの入力に基づき、前記
テキスト情報の音声出力を行う音声出力部と、を備えた
音声合成装置において、前記音素列波形選択部は、前記
テキスト情報の形態素解析に基づき、アクセントを１つ
だけ持つ音素列を単位としてこのテキスト情報を分割
し、この分割した各音素列についての基本周波数パター
ンを作成する基本周波数パターン作成手段と、前記決定
されたある一の注目音素列についての各候補音素列の前
記特徴データに含まれている波形データの基本周波数
を、この一の注目音素列に対応する前記アクセントを１
つだけ持つ音素列の前記基本周波数パターンの所定位置
での基本周波数の値と比較し、両者の差分が所定範囲内
にあり且つ所定の基準を満足する候補音素列の波形デー
タを、前記注目音素列についての最適の波形データとし
て選択し、各注目音素列についての最適波形データを前
記テキスト情報の入力順に対応して順次出力する選択手
段と、を有するものである、ことを特徴とする。As means for solving the above-mentioned problems, the invention according to claim 1 is a sound database storing waveform data and characteristic data of a phoneme sequence used for speech synthesis, A phoneme sequence output unit that inputs text information for speech synthesis, decomposes the text information, and sequentially outputs phoneme sequences of phoneme symbols equal to the phoneme sequence stored in the speech database; and After inputting each phoneme sequence, a target phoneme sequence for which waveform data is to be generated is sequentially determined with respect to the input phoneme sequence, and when a certain target phoneme sequence is determined, the target phoneme sequence is equal to the target phoneme sequence. A phoneme string having a phoneme symbol is set as a candidate phoneme string, and the waveform data and feature data of the candidate phoneme string are extracted from the speech database, and any one of the extracted waveform data is extracted. Is selected as the waveform data of the optimal phoneme sequence for this phoneme sequence of interest, a phoneme sequence waveform selection unit that sequentially outputs the optimal waveform data for each phoneme sequence of interest, and each phoneme from the phoneme sequence waveform selection unit. A speech data synthesizing apparatus comprising: a waveform data connection unit that connects the waveform data of the columns; and a speech output unit that performs speech output of the text information based on input of the connection waveform data from the waveform data connection unit. The phoneme string waveform selection unit divides the text information into units of phoneme strings having only one accent based on the morphological analysis of the text information, and creates a fundamental frequency pattern for each of the divided phoneme strings. Pattern creating means, and waveform data included in the feature data of each candidate phoneme string for the determined one phoneme string of interest. The fundamental frequency of, the accent for this one target phoneme string 1
The phoneme sequence having only one phoneme sequence is compared with the value of the fundamental frequency at a predetermined position of the basic frequency pattern, and the difference between the two is within a predetermined range and the waveform data of the candidate phoneme sequence that satisfies a predetermined criterion is converted to the target phoneme. Selecting means for selecting the optimal waveform data for the column and sequentially outputting the optimal waveform data for each phoneme sequence of interest corresponding to the input order of the text information.

【０００７】請求項２記載の発明は、請求項１記載の発
明において、前記選択手段は、前記両者の差分が所定範
囲内にあり且つ所定の基準を満足する候補音素列の波形
データが複数存在する場合に、前回の選択動作により最
適波形データとして既に選択されている波形データと各
候補音素列の波形データとの間のフォルマント平均自乗
誤差を求め、その値が最小のものを前記最適の波形デー
タとして選択するものである、ことを特徴とする。According to a second aspect of the present invention, in the first aspect of the invention, the selecting means includes a plurality of waveform data of a candidate phoneme string having a difference between the two within a predetermined range and satisfying a predetermined standard. In this case, the formant mean square error between the waveform data already selected as the optimal waveform data by the previous selection operation and the waveform data of each candidate phoneme string is obtained, and the one with the smallest value is determined as the optimal waveform. Is selected as data.

【０００８】請求項３記載の発明は、請求項１又は２記
載の発明において、前記選択手段は、前記所定の基準を
満足するか否かの判別として、音素記号について前記テ
キスト情報における前記注目音素列の前後の音素列と前
記音声データベースに格納されている前記各候補音素列
の前後の音素列とが一致するか否かの判別を行うもので
あり、しかも、この前後音素列の一致についての判別を
行った後に前記基本周波数の値の両者の差分が所定の範
囲内にあるか否かについての判別を行うものである、こ
とを特徴とする。According to a third aspect of the present invention, in the first or second aspect of the invention, the selecting means determines whether or not the predetermined criterion is satisfied. A determination is made as to whether or not a phoneme sequence before and after a column matches a phoneme sequence before and after each of the candidate phoneme sequences stored in the speech database. After the determination, it is determined whether or not the difference between the two values of the fundamental frequency is within a predetermined range.

【０００９】請求項４記載の発明は、請求項１又は２記
載の発明において、前記選択手段は、前記所定の基準を
満足するか否かの判別として、前記テキスト情報におけ
る前記注目音素列の前後の音素列と前記音声データベー
スに格納されている前記各候補音素列の前後の音素列と
を比較し、前記注目音素列の前音素列と前記各候補音素
列の前音素列とにおいて一致する音素記号の数を前記各
候補音素列の各前音素列毎に求め、前記注目音素列の後
音素列と前記各候補音素列の後音素列とにおいて一致す
る音素記号の数を前記各候補音素列の各後音素列毎に求
め、各候補音素列毎にその前音素列における前記一致す
る音素記号の数と、その後音素列における前記一致する
音素記号の数との合計値を求めて、一の候補音素列が、
前記合計値が最大となる候補音素列であるか否かの判別
を行うものであり、しかも、この合計値が最大となる候
補音素列であるか否かについての判別を行った後に前記
基本周波数の値の両者の差分が所定の範囲内にあるか否
かについての判別を行うものである、ことを特徴とす
る。According to a fourth aspect of the present invention, in the first or second aspect of the invention, the selecting means determines whether or not the predetermined criterion is satisfied before and after the target phoneme sequence in the text information. Is compared with the phoneme strings before and after each of the candidate phoneme strings stored in the voice database, and the phoneme strings that match in the pre-phoneme string of the target phoneme string and the pre-phoneme string of each of the candidate phoneme strings are compared. The number of symbols is determined for each pre-phoneme sequence of each candidate phoneme sequence, and the number of phoneme symbols that match in the post-phoneme sequence of the target phoneme sequence and the post-phoneme sequence of each candidate phoneme sequence is calculated for each of the candidate phoneme sequences. Is calculated for each of the following phoneme strings, and for each candidate phoneme string, the sum of the number of matching phoneme symbols in the preceding phoneme string and the number of matching phoneme symbols in the subsequent phoneme string is calculated. The candidate phoneme sequence is
After determining whether or not the sum is a candidate phoneme string having the maximum value, and after determining whether or not the sum is a candidate phoneme string having a maximum value, the fundamental frequency Is determined whether or not the difference between the two values is within a predetermined range.

【００１０】請求項５記載の発明は、請求項１又は２記
載の発明において、前記選択手段は、前記所定の基準を
満足するか否かの判別として、前記テキスト情報におけ
る前記注目音素列の前後の音素列と前記音声データベー
スに格納されている前記各候補音素列の前後の音素列と
を比較し、前記注目音素列の前音素列と前記各候補音素
列の前音素列とにおいて一致する音素記号の数を前記各
候補音素列の各前音素列毎に求め、前記注目音素列の後
音素列と前記各候補音素列の後音素列とにおいて一致す
る音素記号の数を前記各候補音素列の各後音素列毎に求
め、各候補音素列毎にその前音素列における前記一致す
る音素記号の数と、その後音素列における前記一致する
音素記号の数との間のバランスに関する評価値を予め設
定されたルールに基づいて算出し、一の候補音素列が、
バランスが最も良好な評価値に係る候補音素列であるか
否かの判別を行うものであり、しかも、このバランスが
最も良好な評価値に係る候補音素列であるか否かについ
ての判別を行った後に前記基本周波数の値の両者の差分
が所定の範囲内にあるか否かについての判別を行うもの
である、ことを特徴とする。According to a fifth aspect of the present invention, in the first or second aspect of the invention, the selecting means determines whether or not the predetermined criterion is satisfied before and after the target phoneme sequence in the text information. Is compared with the phoneme strings before and after each of the candidate phoneme strings stored in the voice database, and the phoneme strings that match in the pre-phoneme string of the target phoneme string and the pre-phoneme string of each of the candidate phoneme strings are compared. The number of symbols is determined for each pre-phoneme sequence of each candidate phoneme sequence, and the number of phoneme symbols that match in the post-phoneme sequence of the target phoneme sequence and the post-phoneme sequence of each candidate phoneme sequence is calculated for each of the candidate phoneme sequences. For each candidate phoneme string, an evaluation value relating to the balance between the number of matching phoneme symbols in the preceding phoneme string and the number of matching phoneme symbols in the subsequent phoneme string is determined in advance for each candidate phoneme string. To the set rules Calculated Zui, one candidate phoneme strings,
A determination is made as to whether or not the balance is a candidate phoneme string with the best evaluation value, and further, a determination is made as to whether or not the balance is a candidate phoneme string with the best evaluation value. And determining whether the difference between the two values of the fundamental frequency is within a predetermined range.

【００１１】請求項６記載の発明は、請求項３乃至５の
いずれかに記載の発明において、前記選択手段は、前記
所定の基準を満足するか否かの判別として、前回の選択
動作により最適の波形データとして決定済みの波形デー
タにおける終端部付近の所定区間にわたる振幅平均値
と、各候補音素列の波形データにおける先端部付近の所
定区間にわたる振幅平均値との差が所定範囲内にあるか
否かについての判別をも行うものであり、しかも、この
振幅平均値の差についての判別を、前記前後音素列につ
いての判別を行った後で、前記基本周波数の値の両者の
差分が所定の範囲内にあるか否かについての判別を行っ
た後又は行う前に行うものである、ことを特徴とする。According to a sixth aspect of the present invention, in the first aspect of the present invention, the selecting means determines whether or not the predetermined criterion is satisfied by a previous selecting operation. Whether the difference between the average amplitude value over a predetermined section near the end of the waveform data determined as the waveform data of the above and the average amplitude value over a predetermined section near the front end of the waveform data of each candidate phoneme string is within a predetermined range. The determination of whether the difference between the amplitude average values is equal to or less than the predetermined value after the determination of the preceding and following phoneme sequences is performed. This is performed after or before the determination as to whether or not it is within the range is performed.

【００１２】請求項７記載の発明は、請求項６記載の発
明において、前記決定済みの波形データにおける終端部
付近の所定区間にわたる振幅平均値は、終端部に近づく
にしたがって振幅値が大きくなるように重み付けされた
波形データに基づき演算されたものであり、また、前記
各候補音素列の波形データにおける先端部付近の所定区
間にわたる振幅平均値は、先端部に近づくにしたがって
振幅値が大きくなるように重み付けされた波形データに
基づき演算されたものである、ことを特徴とする。According to a seventh aspect of the present invention, in the invention according to the sixth aspect, the average amplitude value of the determined waveform data over a predetermined section near the end portion becomes larger as approaching the end portion. The average amplitude value over a predetermined section near the tip in the waveform data of each of the candidate phoneme strings is such that the amplitude value increases as approaching the tip. Calculated based on waveform data weighted to

【００１３】[0013]

【発明の実施の形態】以下、本発明の実施形態を図に基
づき説明する。図１は、第１の実施形態に係る音声合成
装置の構成を示すブロック図である。この図において、
音声データベース１には、音声合成に用いる音素列につ
いての波形データ及びその特徴データ（基本周波数、音
素記号、フォルマント等のデータ）が格納されている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram illustrating the configuration of the speech synthesis device according to the first embodiment. In this figure,
The speech database 1 stores waveform data and characteristic data (data such as a fundamental frequency, a phoneme symbol, and a formant) of a phoneme string used for speech synthesis.

【００１４】そして、図示を省略してあるテキスト情報
入力手段からテキスト情報が音素列出力部２及び音素列
波形選択部３に出力されるようになっている。音素列出
力部２は、入力したテキスト情報を分解して、音声デー
タベース１に格納されている音素列と等しい音素記号の
音素列を音素列波形選択部３に出力するようになってい
る。Then, text information is output to a phoneme string output unit 2 and a phoneme string waveform selection unit 3 from text information input means (not shown). The phoneme string output unit 2 decomposes the input text information and outputs a phoneme string of a phoneme symbol equal to the phoneme string stored in the speech database 1 to the phoneme string waveform selection unit 3.

【００１５】音素列波形選択部３は、音素列出力部２か
ら音素列を入力した後、波形データの生成対象となる音
素列（このように、現在、波形データの生成対象となっ
ている音素列を本明細書では「注目音素列」と呼ぶ。）
を順次決定するようになっている。そして、音素列波形
選択部３は、ある一の注目音素列を決定すると、音声デ
ータベース１からこの注目音素列についての候補音素列
（音声データベース１内に格納され、注目音素列と音素
記号が等しい音素列）を抽出し、この抽出した候補音素
列のうちのいずれか一をその注目音素列について最適の
音素列の波形データとして選択して出力するものであ
る。After inputting the phoneme sequence from the phoneme sequence output unit 2, the phoneme sequence waveform selecting unit 3 outputs a phoneme sequence for which waveform data is to be generated. The sequence is referred to as a “phonemic sequence of interest” in this specification.)
Are sequentially determined. When the phoneme string waveform selecting unit 3 determines a certain phoneme string of interest, a candidate phoneme string for this noteworthy phoneme string is stored in the speech database 1 (stored in the speech database 1 and the phoneme symbol of interest is equal to the phoneme symbol). A phoneme sequence is extracted, and one of the extracted candidate phoneme sequences is selected and output as waveform data of an optimal phoneme sequence for the target phoneme sequence.

【００１６】そして、波形データ接続部６は、音素列波
形選択部３から入力した各波形データを接続して、これ
を音声出力部７に出力するようになっている。そして、
音声出力部７は、波形データ接続部６から入力した接続
波形データに基づき、テキスト情報の音声出力を行うよ
うになっている。The waveform data connection unit 6 connects the respective waveform data input from the phoneme sequence waveform selection unit 3 and outputs them to the audio output unit 7. And
The audio output unit 7 outputs audio of text information based on the connection waveform data input from the waveform data connection unit 6.

【００１７】音素列波形選択部３は、基本周波数パター
ン作成手段４及び選択手段５を有している。基本周波数
パターン作成手段４は、図示を省略してあるテキスト情
報入力手段からテキスト情報を入力し、このテキスト情
報に対して形態素解析を行って１つのアクセントを有す
る複数の音素列に分割し、更に、この分割した１アクセ
ントの音素列についての基本周波数パターンを作成する
ものである。The phoneme string waveform selecting section 3 has a fundamental frequency pattern creating means 4 and a selecting means 5. The fundamental frequency pattern creating means 4 inputs text information from text information input means (not shown), performs morphological analysis on the text information, divides the text information into a plurality of phoneme strings having one accent, A basic frequency pattern is created for the divided phoneme string of one accent.

【００１８】選択手段５は、音声データベース１から抽
出した各候補音素列の波形データの基本周波数と、上記
の作成された基本周波数パターンの値とを比較し、両者
の差分が所定範囲内にあるか否かを判別し、更に、各候
補音素列が所定の基準を満足するか否かを判別すること
により、候補音素列の数を絞るようになっている。この
場合の所定の基準を満足するか否かの判別としては、音
素記号（全体の音素記号、最初の音素記号、最後の音素
記号）についてテキスト情報における注目音素列の前後
の音素列と音声データベース１に格納されている各候補
音素列の前後の音素列とが一致するか否かの判別があ
る。このような基本周波数についての判別を行うことに
より、自然な音声合成を行う上で、好ましい基本周波数
として予め設定された基本周波数パターンに極力近い音
素列を選択することができる。また、注目音素列の前後
の音素列の音素記号について判別を行うことにより、前
後の状態が注目音素列に極力似ている音素列を選択する
ことができる。The selection means 5 compares the fundamental frequency of the waveform data of each candidate phoneme string extracted from the speech database 1 with the value of the created fundamental frequency pattern, and the difference between the two is within a predetermined range. By determining whether or not each candidate phoneme string satisfies a predetermined criterion, the number of candidate phoneme strings is reduced. In this case, it is determined whether or not the predetermined criterion is satisfied. For the phoneme symbols (the whole phoneme symbol, the first phoneme symbol, and the last phoneme symbol), the phoneme strings before and after the phoneme string of interest in the text information and the speech database There is a determination as to whether or not the phoneme strings before and after each candidate phoneme string stored in No. 1 match. By determining such a fundamental frequency, it is possible to select a phoneme string as close as possible to a fundamental frequency pattern preset as a preferred fundamental frequency in performing natural speech synthesis. Further, by determining the phoneme symbols of the phoneme strings before and after the phoneme string of interest, it is possible to select a phoneme string whose state before and after is as similar as possible to the phoneme string of interest.

【００１９】選択手段５は、上記の判別を行ってもまだ
候補音素列を１つに絞ることができない場合は、既に選
択されている前回の波形データと各候補波形データとの
間のフォルマント平均自乗誤差を演算し、その値が最小
のものを最適の波形データとして選択するようになって
いる。なお、選択手段５の選択の順番つまりテキスト情
報の音素列に対して注目する順番は、テキスト情報の入
力順となっている。このように、フォルマントの変化が
最小のものを選択する理由は、フォルマントが人間が発
話するときの舌、顎、唇等の動きに応じて変化するもの
であり、人間の舌、顎、唇等の動きは連続したものであ
ることから、時間軸上に存在するフォルマントの値も連
続したものである方が、自然な音声合成を行う上で好ま
しいと考えられるからである。If the selection means 5 has not been able to narrow down the candidate phoneme sequence to one even after performing the above determination, the selecting means 5 sets the formant average between the previously selected previous waveform data and each candidate waveform data. The square error is calculated, and the one with the smallest value is selected as the optimal waveform data. The order of selection by the selection means 5, that is, the order of focusing on the phoneme sequence of the text information is the input order of the text information. As described above, the reason for selecting the one with the smallest formant change is that the formant changes according to the movement of the tongue, jaw, lips, etc. when a human speaks, and the human tongue, jaw, lips, etc. Is continuous, and it is considered that it is preferable that the formants present on the time axis be continuous in order to perform natural speech synthesis.

【００２０】次に、上記のように構成される第１の実施
形態の動作につき説明する。図２乃至図５は、音素列波
形選択部３の動作について説明するためのフローチャー
トである。音素列波形選択部３の基本周波数パターン作
成手段４は、図示を省略してあるテキスト情報入力手段
からテキスト情報（例えば、「今日は晴れですね。」と
する。）を入力すると、このテキスト情報に対して形態
素解析を行って１つのアクセントを有する複数の音素列
に分割する（ステップ１）。この実施形態では、１つの
アクセントを有する複数の音素列が「kyo-wa」、「har
e」、「desune」の３つの音素列である場合について説
明する。そして、基本周波数パターン作成手段４は、こ
れらの３つの音素列「kyo-wa」、「hare」、「desune」
についての各基本周波数パターンを作成する（ステップ
２）。例えば、「kyo-wa」は「o」の位置にアクセント
があるので、その基本周波数パターンは「o」の位置を
ピークとする山型のパターンとなる。Next, the operation of the first embodiment configured as described above will be described. FIGS. 2 to 5 are flowcharts for explaining the operation of the phoneme string waveform selection unit 3. The basic frequency pattern creation means 4 of the phoneme string waveform selection unit 3 receives text information (for example, “Today is fine.”) From text information input means (not shown). Is subjected to morphological analysis to divide into a plurality of phoneme strings having one accent (step 1). In this embodiment, a plurality of phoneme strings having one accent are “kyo-wa”, “har”
The case of three phoneme strings "e" and "desune" will be described. Then, the fundamental frequency pattern creating means 4 outputs these three phoneme strings “kyo-wa”, “hare”, “desune”
Is created (step 2). For example, since "kyo-wa" has an accent at the position of "o", its fundamental frequency pattern is a mountain-shaped pattern having a peak at the position of "o".

【００２１】一方、音素列出力部２も、図示を省略して
あるテキスト情報入力手段からテキスト情報「今日は晴
れですね。」を入力し、これを分解して、音声データベ
ース１に格納されている音素列と等しい音素記号の音素
列を音素列波形選択部３に出力する。この場合、音素列
出力部２が出力する音素列は、必ずしも基本周波数パタ
ーン作成手段４が形態素解析に基づき分解した音素列
「kyo-wa」、「hare」、「desune」に一致するとは限ら
ず、例えば、「kyo-」、「wa」、「hare」、「desu」、
「ne」であったりする。しかし、この場合は、説明を分
かりやすくするため、音素列出力部２も同様に、「kyo-
wa」、「hare」、「desune」という３つの音素列を出力
するものとする。音素列波形選択部３の選択手段５は、
このような音素列出力部２からの各音素列を入力する
（ステップ３）。On the other hand, the phoneme string output unit 2 also inputs text information “Today is fine.” From text information input means (not shown), decomposes the text information, and stores it in the speech database 1. A phoneme string having a phoneme symbol equal to the phoneme string present is output to the phoneme string waveform selection unit 3. In this case, the phoneme string output by the phoneme string output unit 2 does not always match the phoneme strings “kyo-wa”, “hare”, and “desune” decomposed by the fundamental frequency pattern creating unit 4 based on morphological analysis. For example, "kyo-", "wa", "hare", "desu",
"Ne". However, in this case, the phoneme string output unit 2 also outputs “kyo-
It is assumed that three phoneme strings “wa”, “hare”, and “desune” are output. The selection means 5 of the phoneme string waveform selection unit 3
Each phoneme string is input from the phoneme string output unit 2 (step 3).

【００２２】次いで、選択手段５は、まず、注目音素列
として「kyo-wa」を決定する（ステップ４）。ここで、
注目音素列と候補音素列との関係について具体的に説明
しておく。候補音素列とは、テキスト情報の中の音素列
であってこれから音声波形データを生成しようとして現
在注目している音素列（つまり注目音素列）についての
候補であり、テキスト情報の中の音素列であっても現在
注目されていない音素列との関係では候補音素列ではな
い。すなわち、テキスト情報「今日は晴れですね。」の
分解音素列が「kyo-wa」、「hare」、「desune」であっ
たとすると、音声データベース１には「kyo-wa」という
音素記号の複数（１つである場合もあり得るが）の波形
データ、「hare」という音素記号の複数の波形データ、
「desune」という音素記号の複数の波形データが存在す
る。そして、「kyo-wa」、「hare」、「desune」の３つ
の音素列のうち、現在、「kyo-wa」を注目音素列として
その波形データを生成しようとしている場合は、音声デ
ータベース１中の「kyo-wa」という音素記号の複数の音
素列のみが候補音素列であって、「hare」という音素記
号の複数の音素列、及び「desune」という音素記号の複
数の音素列は候補音素列ではない。同様に、現在、「ha
re」を注目音素列としてその波形データを生成しようと
している場合は、音声データベース１中の「hare」とい
う音素記号の複数の音素列のみが候補音素列であって、
「kyo-wa」という音素記号の複数の音素列、及び「desu
ne」という音素記号の複数の音素列は候補音素列ではな
い。Next, the selection means 5 first determines "kyo-wa" as the phoneme sequence of interest (step 4). here,
The relationship between the phoneme sequence of interest and the candidate phoneme sequence will be specifically described. The candidate phoneme sequence is a phoneme sequence in text information, and is a candidate for a phoneme sequence that is currently focused on in order to generate speech waveform data from it (that is, a target phoneme sequence), and is a phoneme sequence in the text information. Even in the case of a phoneme string that is not currently focused on, it is not a candidate phoneme string. That is, assuming that the decomposed phoneme strings of the text information "Today is fine." Are "kyo-wa", "hare", and "desune", the speech database 1 contains a plurality of phoneme symbols "kyo-wa". Waveform data (although there may be one), a plurality of waveform data of a phoneme symbol "hare",
There are a plurality of waveform data of the phoneme symbol “desune”. In the case where the waveform data of “kyo-wa” is currently to be generated with the target phoneme sequence “kyo-wa” among the three phoneme sequences “kyo-wa”, “hare”, and “desune”, Are only candidate phoneme strings, and a plurality of phoneme strings of a phoneme symbol “hare” and a plurality of phoneme strings of a phoneme symbol “desune” are candidate phoneme strings. Not a column. Similarly, "ha
When the waveform data is to be generated using “re” as the target phoneme sequence, only a plurality of phoneme sequences of the phoneme symbol “hare” in the voice database 1 are candidate phoneme sequences,
A plurality of phoneme strings of the phoneme symbol "kyo-wa" and "desu
The plurality of phoneme strings of the phoneme symbol “ne” are not candidate phoneme strings.

【００２３】したがって、テキスト情報「今日は晴れで
すね。」の分解音素列「kyo-wa」、「hare」、「desun
e」の各波形データを作る場合は、まず、「kyo-wa」に
ついて注目するためにこれを注目音素列として決定し、
この「kyo-wa」と同じ音素記号の音素列を候補音素列と
して扱い、その波形データを音声データベース１から取
り出すようにする。この場合、候補音素列の波形データ
が複数存在する場合には、その中から最適の波形データ
を選択しなければならないが、どの波形データが最適か
を決定するために、候補音素列が１つに絞られるまで後
述するような基本周波数、音素記号、及びフォルマント
についてのデータを調べるようにする。このようにし
て、「kyo-wa」について最適の波形データを選択したな
らば、次に、「hare」を注目音素列として決定し、これ
と同じ音素記号の候補音素列「hare」の波形データを音
声データベース１から取り出し、その中から最適の波形
データを選択する。同様にして、「hare」について最適
の波形データを選択したならば、次に、「desune」を注
目音素列として決定し、これと同じ音素記号の候補音素
列「desune」の波形データを音声データベース１から取
り出し、その中から最適の波形データを選択する。この
ように、注目音素列とは音素列出力部２がテキスト情報
を分解した後出力する音素列であって、現在、テキスト
情報中のどの部分の波形データを生成しようとしている
かについての指標となる概念であり、一方、候補音素列
とは、その生成しようとしている波形データの音声デー
タベース１中における名称としての意味を有する概念で
あるといえる。Therefore, the decomposed phoneme strings "kyo-wa", "hare", "desun" of the text information "Today is fine."
When creating each waveform data of "e", first determine this as a phoneme sequence of interest to pay attention to "kyo-wa",
The phoneme string having the same phoneme symbol as “kyo-wa” is treated as a candidate phoneme string, and its waveform data is extracted from the speech database 1. In this case, when there are a plurality of waveform data of the candidate phoneme sequence, the optimum waveform data must be selected from the plurality of data. However, in order to determine which waveform data is optimal, one candidate phoneme sequence is required. Until it is narrowed down, data on fundamental frequencies, phoneme symbols, and formants, which will be described later, are checked. After selecting the optimal waveform data for "kyo-wa" in this way, next, "hare" is determined as the phoneme sequence of interest, and the waveform data of the candidate phoneme sequence "hare" of the same phoneme symbol is determined. From the audio database 1 and the optimum waveform data is selected from among them. Similarly, after selecting the optimal waveform data for "hare", next, "desune" is determined as the phoneme sequence of interest, and the waveform data of the candidate phoneme sequence "desune" of the same phoneme symbol is stored in the voice database. 1 and select the most suitable waveform data from them. As described above, the phoneme sequence of interest is a phoneme sequence output by the phoneme sequence output unit 2 after decomposing the text information, and serves as an index as to which part of the text information to generate waveform data at present. On the other hand, the candidate phoneme sequence is a concept having a meaning as a name in the voice database 1 of the waveform data to be generated.

【００２４】さて、ステップ４に戻り、選択手段５は、
注目音素列として「kyo-wa」を決定した後、音声データ
ベース１からこの注目音素列「kyo-wa」についての全て
の候補音素列「kyo-wa」（この場合、例えば第１乃至第
３の３つの候補音素列があるものとする。）についての
データ（例えば、基本周波数、音素記号、フォルマント
等についてのデータ）を取得する（ステップ５）。そし
て、選択手段５は、基本周波数について、基本周波数パ
ターン作成手段４が作成したパターンの値ｆzt（ピーク
位置である「o」及びその前後の複数個所の位置におけ
る各基本周波数の値を指している。）と、この各値ｆzt
に対応する、第１の候補音素列の各値ｆzt1とを比較し
（ステップ６）、下式（１）が成立するか否かを判別す
る（ステップ７）。なお、（１）式中のａ，ｂは予め設
定された固定値である。Now, returning to step 4, the selecting means 5
After determining “kyo-wa” as the phoneme sequence of interest, all candidate phoneme sequences “kyo-wa” (in this case, for example, the first to third (For example, there are three candidate phoneme strings) (for example, data about fundamental frequency, phoneme symbol, formant, etc.) (step 5). Then, the selection unit 5 indicates the value fzt of the pattern created by the fundamental frequency pattern creation unit 4 (the peak position “o” and the values of the respective basic frequencies at a plurality of positions before and after the fundamental position). .) And the respective values fzt
Is compared with each value fzt1 of the first candidate phoneme string (step 6), and it is determined whether or not the following equation (1) holds (step 7). Note that a and b in the equation (1) are fixed values set in advance.

【００２５】ａ≦ｆzt−ｆzt1≦ｂ … （１）選択手段５は、ステップ７において、全ての個所のｆz
t，ｆzt1について（１）式が成立した場合に「ＹＥＳ」
と判別し、いずれか１個所でも（１）式が成立しなかっ
た場合には「ＮＯ」と判別する。そして、選択手段５
は、この判別結果を自己の記憶手段に記憶し（ステップ
８）、全ての候補音素列が終了したか否かを判別する
（ステップ９）。この場合は、まだ第２及び第３の候補
音素列が残っているので、選択手段５は「ＮＯ」と判別
し、ステップ６において作成パターンの各値ｆztと、こ
れに対応する第２の候補音素列の各値ｆzt1とを比較す
る。A ≦ fzt−fzt1 ≦ b (1) The selecting means 5 determines in step 7 that fz
"YES" when expression (1) is satisfied for t and fzt1
If formula (1) is not satisfied at any one of the locations, it is determined as "NO". And selecting means 5
Stores this determination result in its own storage means (step 8), and determines whether or not all the candidate phoneme strings have been completed (step 9). In this case, since the second and third candidate phoneme strings still remain, the selecting means 5 determines "NO", and in step 6, each value fzt of the created pattern and the second candidate Each value of the phoneme sequence is compared with fzt1.

【００２６】選択手段５は、以下同様にして第２及び第
３の候補音素列「kyo-wa」に対して判別を行い、ステッ
プ９で「ＹＥＳ」の判別を行った後、（１）式を満足す
る候補音素列が有ったか否かを判別する（ステップ１
０）。（１）式を満足したものが有れば、（１）式を満
足しなかった候補音素列を除去し（ステップ１１）、候
補音素列が１つに絞られたか否かを判別する（ステップ
１２）。そして、１つに絞られていれば、その候補音素
列を最適な波形データとして決定し（ステップ１３）、
ステップ５６で全ての注目音素列について選択が終了し
たか否かについて「ＮＯ」を判別し、ステップ４に戻っ
て次の注目音素列（「hare」）を決定する。上記の例の
場合、３つの候補音素列が全て（１）式を満足したもの
とする。したがって、ステップ１０での判別結果は「Ｙ
ＥＳ」となり、ステップ１１で除去する候補音素列の数
はゼロとなり、ステップ１２の判別結果は「ＮＯ」とな
る。そして、選択手段５は、注目音素列が先頭のもので
あるか否かを判別する（ステップ１４）。なお、ステッ
プ１０において、（１）式を満足するものが１つもなけ
れば、直ちにステップ１４に移行する。The selecting means 5 makes a determination on the second and third candidate phoneme strings "kyo-wa" in the same manner as described above. After making a determination of "YES" in step 9, the equation (1) is obtained. Is determined whether or not there is a candidate phoneme string satisfying
0). If there is a candidate that satisfies the expression (1), the candidate phoneme sequence that does not satisfy the expression (1) is removed (step 11), and it is determined whether or not the candidate phoneme sequence is reduced to one (step 11). 12). If the number is reduced to one, the candidate phoneme sequence is determined as optimal waveform data (step 13),
In step 56, "NO" is determined as to whether or not the selection has been completed for all the target phoneme strings, and the process returns to step 4 to determine the next target phoneme string ("hare"). In the case of the above example, it is assumed that all three candidate phoneme strings satisfy Expression (1). Therefore, the determination result in step 10 is “Y
ES ”, the number of candidate phoneme strings to be removed in step 11 becomes zero, and the determination result in step 12 becomes“ NO ”. Then, the selecting means 5 determines whether or not the phoneme sequence of interest is the first one (step 14). In step 10, if there is no one satisfying the expression (1), the process immediately proceeds to step 14.

【００２７】注目音素列「kyo-wa」は、入力テキスト情
報の先頭の音素列であるため、選択手段５はステップ１
４で「ＹＥＳ」を判別する。したがって、選択手段５は
ステップ２４に移行し、まず、全体の音素記号について
注目音素列「kyo-wa」の後音素列と第１の候補音素列
「kyo-wa」の後音素列とを比較して両者が一致するか否
かを判別し（ステップ２５）、その判別結果を記憶手段
に記憶する（ステップ２６）。そして、ステップ２７で
全ての候補音素列についての比較が終了したか否かにつ
いて「ＮＯ」の判別を行い、ステップ２４に戻って注目
音素列「kyo-wa」の後音素列と第２の候補音素列の後音
素列とを比較する。選択手段５は、以下、同様にして第
３の候補音素列「kyo-wa」までの比較判別を終了する
と、ステップ２７で「ＹＥＳ」を判別し、一致するもの
が有ったか否かを判別する（ステップ２８）。上記の例
の場合、第１の候補音素列「kyo-wa」の後音素列が「ha
jimete」、第２の候補音素列「kyo-wa」の後音素列が
「hima」、第３の候補音素列「kyo-wa」の後音素列が
「hen」であるものとする。したがって、ステップ２８
での判別結果は「ＮＯ」となる。なお、ステップ２８で
の判別結果が「ＹＥＳ」であれば、ステップ２５での判
別が不一致の候補音素列を除去し（ステップ２９）、候
補音素列が１つに絞られたか否かを判別する（ステップ
３０）。そして、１つに絞られていれば、その候補音素
列を最適な波形データとして決定する（ステップ３
１）。Since the phoneme sequence of interest "kyo-wa" is the first phoneme sequence of the input text information, the selecting means 5 executes step 1
In step 4, "YES" is determined. Therefore, the selecting means 5 proceeds to step 24, and first compares the post-phoneme sequence of the target phoneme sequence "kyo-wa" with the post-phoneme sequence of the first candidate phoneme sequence "kyo-wa" for all the phoneme symbols. Then, it is determined whether or not they match (step 25), and the result of the determination is stored in the storage means (step 26). Then, in a step 27, it is determined whether or not the comparison has been completed for all the candidate phoneme strings, and "NO" is determined. The phoneme sequence is compared with the succeeding phoneme sequence. After completing the comparison determination up to the third candidate phoneme string “kyo-wa” in the same manner, the selection means 5 determines “YES” in step 27 and determines whether there is a match. (Step 28). In the case of the above example, the post-phoneme sequence of the first candidate phoneme sequence “kyo-wa” is “ha”.
jimete ", the post-phoneme sequence of the second candidate phoneme sequence" kyo-wa "is" hima ", and the post-phoneme sequence of the third candidate phoneme sequence" kyo-wa "is" hen ". Therefore, step 28
Is "NO". If the result of the determination in step 28 is "YES", the candidate phoneme strings that do not match in the determination in step 25 are removed (step 29), and it is determined whether or not the number of candidate phoneme strings has been reduced to one. (Step 30). If the number is reduced to one, the candidate phoneme sequence is determined as the optimum waveform data (step 3).
1).

【００２８】選択手段５は、ステップ２８で「ＮＯ」を
判別した後、注目音素列「kyo-wa」が先頭のものである
か否かを判別し（ステップ３２）、この判別結果は「Ｙ
ＥＳ」となるのでステップ４２に移行する。選択手段５
は、まず、注目音素列「kyo-wa」の後音素列の最初の１
音素記号と第１の候補音素列「kyo-wa」の後音素列の最
初の１音素記号とを比較して両者が一致するか否かを判
別し（ステップ４３）、その判別結果を記憶手段に記憶
する（ステップ４４）。そして、ステップ４５で全ての
候補音素列についての比較が終了したか否かについて
「ＮＯ」の判別を行い、ステップ４２に戻って注目音素
列「kyo-wa」の後音素列と第２の候補音素列の後音素列
とを比較する。選択手段５は、以下、同様にして第３の
候補音素列「kyo-wa」までの比較判別を終了すると、ス
テップ４５で「ＹＥＳ」を判別し、一致するものが有っ
たか否かを判別する（ステップ４６）。上記の例の場
合、第１乃至第３の候補音素列「kyo-wa」の各後音素列
「hajimete」、「hima」、「hen」の最初の音素記号は
いずれも「ｈ」であるから、注目音素列「kyo-wa」の後
音素列「hare」の最初の音素記号「ｈ」と一致する。し
たがって、ステップ４６の判別結果は「ＹＥＳ」とな
る。After determining "NO" in step 28, the selecting means 5 determines whether or not the phoneme sequence of interest "kyo-wa" is the first one (step 32). The result of this determination is "Y".
ES ”, the flow proceeds to step. Selection means 5
Is the first one of the phoneme sequence after the phoneme sequence of interest "kyo-wa".
The phoneme symbol is compared with the first phoneme symbol of the succeeding phoneme sequence of the first candidate phoneme sequence "kyo-wa" to determine whether or not the two match (step 43), and the result of the determination is stored. (Step 44). Then, in step 45, it is determined whether or not the comparison has been completed for all the candidate phoneme strings, and “NO” is determined. Then, the processing returns to step 42, where the succeeding phoneme string of the target phoneme string “kyo-wa” and the second candidate The phoneme sequence is compared with the succeeding phoneme sequence. After completing the comparison determination up to the third candidate phoneme string “kyo-wa” in the same manner, the selection unit 5 determines “YES” in step 45 and determines whether there is a match. (Step 46). In the case of the above example, the first phoneme symbol of each of the subsequent phoneme strings “hajimete”, “hima”, and “hen” of the first to third candidate phoneme strings “kyo-wa” is “h”. , Matches the first phoneme symbol “h” of the phoneme sequence “hare” after the target phoneme sequence “kyo-wa”. Therefore, the determination result of step 46 is "YES".

【００２９】選択手段５は、ステップ４６での判別結果
が「ＹＥＳ」であれば、ステップ４３での判別が不一致
の候補音素列を除去し（ステップ４７）、候補音素列が
１つに絞られたか否かを判別する（ステップ４８）。そ
して、１つに絞られていれば、その候補音素列を最適な
波形データとして決定する（ステップ４９）。上記の例
の場合、３つの候補音素列の全てについてステップ４３
の判別結果が「ＹＥＳ」となるため、ステップ４７で除
去する候補音素列の数はゼロとなり、ステップ４８の判
別結果は「ＮＯ」となる。If the result of the determination at step 46 is "YES", the selecting means 5 removes the candidate phoneme string that does not match at step 43 (step 47), and narrows down the candidate phoneme string to one. It is determined whether or not it has been performed (step 48). If the number is reduced to one, the candidate phoneme sequence is determined as the optimal waveform data (step 49). In the case of the above example, step 43 is performed for all three candidate phoneme strings.
Is "YES", the number of candidate phoneme strings to be removed in step 47 is zero, and the determination result in step 48 is "NO".

【００３０】次いで、選択手段５は、注目音素列「kyo-
wa」が先頭の音素列であるか否かを判別し（ステップ５
０）、前回の選択動作により最適な波形データとして決
定済みの波形データと今回の注目音素列の第１の候補音
素列の波形データとの間のフォルマント平均自乗誤差Ｍ
seを演算する（ステップ５２）。この場合、注目音素列
「kyo-wa」は先頭音素列であるためステップ５０の判別
結果は「ＹＥＳ」となり、選択手段５は、予め設定され
ているフォルマントの値を、例えば自己の記憶手段から
取得する。これは、先頭音素列については決定済みの前
回波形データというものは未だ存在しないため、前回波
形データに相当するものを予め設定しておいてやる必要
があるからである。Next, the selecting means 5 selects the phoneme sequence of interest "kyo-
It is determined whether or not "wa" is the first phoneme string (step 5).
0), the formant mean square error M between the waveform data determined as the optimal waveform data by the previous selection operation and the waveform data of the first candidate phoneme sequence of the current phoneme sequence.
Calculate se (step 52). In this case, since the phoneme sequence of interest "kyo-wa" is the first phoneme sequence, the result of the determination in step 50 is "YES", and the selection means 5 reads the preset formant value from, for example, its own storage means. get. This is because there is no determined previous waveform data for the first phoneme sequence yet, and it is necessary to set data corresponding to the previous waveform data in advance.

【００３１】したがって、選択手段５は、ステップ５１
で設定フォルマントの値を取得した後、ステップ５２で
この設定フォルマントの値を用いて下式（２）により平
均自乗誤差Ｍseを演算する（ステップ５２）。なお、本
実施形態では、音声データベース１に格納されている各
フォルマントの値は、第１〜第３フォルマントについて
の３種類の値であるものとし、前回の選択動作により最
適な波形データとして決定済みの波形データの第１〜第
３フォルマント値をｆ11，f12，ｆ13と表し、今回の注
目音素列の一の候補音素列の波形データの第１〜第３フ
ォルマント値をｆ21，f22，ｆ23と表す。また、今回の
注目音素列は先頭の音素列であるので、ｆ11，f12，ｆ1
3の各フォルマント値は前記設定フォルマントの各値と
なる。Therefore, the selection means 5 determines in step 51
After obtaining the value of the set formant in step 52, the mean square error Mse is calculated by the following equation (2) using the value of the set formant in step 52 (step 52). In the present embodiment, the values of each formant stored in the audio database 1 are assumed to be three types of values for the first to third formants, and have been determined as optimal waveform data by the previous selection operation. Are represented as f11, f12 and f13, and the first to third formant values of the waveform data of one candidate phoneme sequence of the current phoneme sequence are represented as f21, f22 and f23. . Since the current phoneme sequence is the first phoneme sequence, f11, f12, f1
Each formant value of 3 becomes each value of the set formant.

【００３２】[0032]

【数１】そして、選択手段５は、ステップ５２の判別結果を自己
の記憶手段に記憶し（ステップ５３）、全ての候補音素
列が終了したか否かを判別する（ステップ５４）。この
場合は、まだ第２及び第３の候補音素列が残っているの
で、選択手段５は「ＮＯ」と判別し、ステップ５２に戻
って決定済み波形データと第２の候補音素列の波形デー
タとの間のフォルマント平均自乗誤差Ｍseを演算する。(Equation 1) Then, the selecting means 5 stores the determination result of step 52 in its own storage means (step 53), and determines whether or not all the candidate phoneme strings have been completed (step 54). In this case, since the second and third candidate phoneme strings still remain, the selecting means 5 determines "NO" and returns to step 52 to return to the determined waveform data and the waveform data of the second candidate phoneme string. Is calculated from the formant mean square error Mse.

【００３３】選択手段５は、以下同様にして第２及び第
３の候補音素列「kyo-wa」についてステップ５２の演算
を行い、ステップ５４で「ＹＥＳ」の判別を行った後、
フォルマント平均自乗誤差Ｍseの値が最小の候補音素列
を最適の波形データとして選択する（ステップ５５）。
上記の場合、例えば第２の候補音素列「kyo-wa」を最適
の音素列として選択するものとする。そして、選択手段
５は、全ての注目音素列について選択が終了したか否か
を判別するが、この時点では未だ「hare」及び「desun
e」が残っているので「ＮＯ」を判別する。The selecting means 5 performs the operation of step 52 on the second and third candidate phoneme strings "kyo-wa" in the same manner, and after the determination of "YES" in step 54,
A candidate phoneme string having the smallest formant mean square error Mse is selected as optimal waveform data (step 55).
In the above case, for example, the second candidate phoneme string “kyo-wa” is selected as the optimum phoneme string. Then, the selection means 5 determines whether or not the selection has been completed for all the phoneme strings of interest. At this time, the selection means 5 still has “hare” and “desun”.
Since "e" remains, "NO" is determined.

【００３４】したがって、選択手段５は、ステップ４に
戻り、次の注目音素列として「hare」を決定する。この
後のステップ５〜１３の動作は既述したものと同様であ
るため、重複した説明を省略する。さて、ステップ１２
の判別結果が「ＮＯ」であったとすると、注目音素列
「hare」は先頭の音素列ではないため、選択手段５はス
テップ１４で「ＮＯ」を判別し、ステップ１５に移行す
る。ステップ１５〜２２は既述したステップ２４〜３１
と略同様であるため説明を省略する（異なるのは、比較
する音素列が前音素列か後音素列かの点だけであ
る。）。ステップ２１の判別結果が「ＮＯ」であったと
すると、選択手段５はステップ２３で注目音素列が最後
の音素列であるか否かを判別するが、注目音素列「har
e」は最後の音素列ではないため、その判別結果は「Ｎ
Ｏ」となり、ステップ２４に移行する。ステップ２４〜
３１については既述してあるため、その説明を省略す
る。Therefore, the selection means 5 returns to step 4 and determines "hare" as the next phoneme sequence of interest. The subsequent operations of Steps 5 to 13 are the same as those described above, and thus redundant description will be omitted. Now, Step 12
Is "NO", the phoneme sequence of interest "hare" is not the first phoneme sequence, so the selecting means 5 determines "NO" in step 14 and proceeds to step 15. Steps 15 to 22 correspond to steps 24 to 31 described above.
The description is omitted because it is substantially the same as (1). The only difference is that the phoneme sequence to be compared is the front phoneme sequence or the rear phoneme sequence. If the determination result in step 21 is “NO”, the selecting means 5 determines in step 23 whether or not the phoneme sequence of interest is the last phoneme sequence.
e ”is not the last phoneme sequence, so the result of the determination is“ N
O ", and the process proceeds to step 24. Step 24-
Since 31 has already been described, its description is omitted.

【００３５】ステップ３０の判別結果が「ＮＯ」であっ
たとすると、注目音素列「hare」は先頭の音素列ではな
いため、選択手段５はステップ３２で「ＮＯ」を判別
し、ステップ３３に移行する。ステップ３３〜４０は既
述したステップ４２〜４９と略同様であるため説明を省
略する（異なるのは、比較対象が前音素列の最後の１音
素記号か後音素列の最初の１音素記号かの点だけであ
る。）。ステップ３９の判別結果が「ＮＯ」であったと
すると、選択手段５はステップ４１で注目音素列が最後
の音素列であるか否かを判別するが、注目音素列「har
e」は最後の音素列ではないため、その判別結果は「Ｎ
Ｏ」となり、ステップ４２に移行する。ステップ４２〜
４９については既述してあるため、その説明を省略す
る。If the result of the determination in step 30 is "NO", since the phoneme sequence of interest "hare" is not the first phoneme sequence, the selecting means 5 determines "NO" in step 32 and proceeds to step 33. I do. Steps 33 to 40 are substantially the same as steps 42 to 49 described above, and a description thereof will be omitted (the difference is that the comparison target is the last one phoneme symbol of the previous phoneme sequence or the first one phoneme symbol of the rear phoneme sequence). Only the point.) If the result of the determination in step 39 is “NO”, the selecting means 5 determines in step 41 whether or not the phoneme sequence of interest is the last phoneme sequence.
e ”is not the last phoneme sequence, so the result of the determination is“ N
O ", and the routine goes to Step 42. Step 42 ~
49 has already been described, and a description thereof will be omitted.

【００３６】ステップ４８の判別結果が「ＮＯ」であっ
たとすると、選択手段５は注目音素列「hare」が先頭の
音素列であるか否かを判別するが、この判別結果は「Ｎ
Ｏ」となるので、直ちにステップ５２の演算を行う。こ
の場合、「決定済み波形データ」としては、前記の第２
の候補音素列「kyo-wa」についての波形データが選択さ
れているので、この波形データと第１の候補音素列「ha
re」のフォルマント平均自乗誤差Ｍseを演算する。If the result of the determination in step 48 is "NO", the selecting means 5 determines whether or not the phoneme sequence of interest "hare" is the first phoneme sequence.
O ", the operation of step 52 is immediately performed. In this case, the “determined waveform data” is the second
Since the waveform data for the candidate phoneme string “kyo-wa” has been selected, this waveform data and the first candidate phoneme string “ha
The formant mean square error Mse of “re” is calculated.

【００３７】この後、選択手段５は、ステップ５４で
「ＹＥＳ」を判別した後、フォルマント平均自乗誤差Ｍ
seの値が最小の候補音素列を最適の波形データとして選
択し（ステップ５５）、更に全ての注目音素列について
選択が終了したか否かを判別する（ステップ５６）。こ
の時点では、未だ「desune」が残っているので「ＮＯ」
を判別する。Thereafter, the selecting means 5 determines "YES" in step 54, and then determines the formant mean square error M
The candidate phoneme sequence having the smallest value of se is selected as the optimal waveform data (step 55), and it is determined whether or not the selection has been completed for all the target phoneme sequences (step 56). At this point, "desune" still remains, so "NO"
Is determined.

【００３８】したがって、選択手段５は、ステップ４に
戻り、次の注目音素列として「desune」を決定する。こ
の後のステップ５〜１３の説明は省略する。ステップ１
２の判別結果が「ＮＯ」であったとすると、注目音素列
「desune」は先頭の音素列ではないため、選択手段５は
ステップ１４で「ＮＯ」を判別し、ステップ１５に移行
する。ステップ１５〜２２の説明は省略する。Therefore, the selection means 5 returns to step 4 and determines "desune" as the next phoneme sequence of interest. The description of subsequent steps 5 to 13 will be omitted. Step 1
If the determination result of step 2 is “NO”, the phoneme sequence of interest “desune” is not the first phoneme sequence, so the selecting unit 5 determines “NO” in step 14 and proceeds to step 15. The description of steps 15 to 22 is omitted.

【００３９】ステップ２１の判別結果が「ＮＯ」であっ
たとすると、選択手段５はステップ２３で注目音素列が
最後の音素列であるか否かを判別するが、注目音素列
「desune」は最後の音素列であるため、その判別結果は
「ＹＥＳ」となり、ステップ３２に移行する。If the result of the determination in step 21 is "NO", the selecting means 5 determines in step 23 whether or not the phoneme sequence of interest is the last phoneme sequence. Since the phoneme sequence is "", the determination result is "YES", and the routine goes to Step 32.

【００４０】注目音素列「desune」は先頭の音素列では
ないため、選択手段５はステップ３２で「ＮＯ」を判別
し、ステップ３３に移行する。ステップ３３〜４０の説
明は省略する。ステップ３９の判別結果が「ＮＯ」であ
ったとすると、選択手段５はステップ４１で注目音素列
が最後の音素列であるか否かを判別するが、注目音素列
「desune」は最後の音素列であるため、その判別結果は
「ＹＥＳ」となり、ステップ５０に移行する。Since the phoneme sequence of interest "desune" is not the first phoneme sequence, the selecting means 5 determines "NO" in step 32 and proceeds to step 33. The description of steps 33 to 40 is omitted. If the result of the determination in step 39 is "NO", the selecting means 5 determines in step 41 whether or not the phoneme sequence of interest is the last phoneme sequence. Therefore, the determination result is “YES”, and the routine goes to Step 50.

【００４１】ステップ５０〜５５の動作も記述したのと
同様であるため、その説明を省略する。そして、選択手
段５は、ステップ５６で「ＹＥＳ」を判別し、全ての動
作を終了する。The operations in steps 50 to 55 are the same as those described above, and thus the description thereof will be omitted. Then, the selecting means 5 determines "YES" in step 56, and ends all the operations.

【００４２】なお、先に、説明を分かりやすくするた
め、音素列出力部２は、基本周波数パターン作成手段４
が形態素解析に基づき分解したのと同じ３つの音素列
「kyo-wa」、「hare」、「desune」を出力するものとし
て説明を行ったが、音素列出力部２がこれとは異なる音
素列「kyo-」、「wa」、「hare」、「desu」、「ne」を
出力した場合（むしろ、この場合の方が通常である）も
同様に行えばよい。例えば、最初の注目音素列「kyo-」
の場合、音声データベース１から取り出した各候補音素
列「kyo-」の基本周波数の値と、「kyo-wa」の作成パタ
ーンにおける「kyo-」に対応する個所の基本周波数の値
とを前記の（１）式に適用してやればよい。In order to make the description easier to understand, the phoneme string output unit 2 firstly outputs the fundamental frequency pattern
Output the same three phoneme strings "kyo-wa", "hare", and "desune" that were decomposed based on morphological analysis, but the phoneme string output unit 2 outputs a different phoneme string. The same operation may be performed when “kyo-”, “wa”, “hare”, “desu”, and “ne” are output (in fact, this case is more normal). For example, the first phoneme sequence of interest "kyo-"
In the case of, the value of the fundamental frequency of each candidate phoneme string “kyo-” extracted from the voice database 1 and the value of the fundamental frequency at the location corresponding to “kyo-” in the pattern for creating “kyo-wa” are described above. What is necessary is just to apply to Formula (1).

【００４３】音素列波形選択部３は、上記のように動作
し、音声データベース１から最適の音素列の波形データ
を選択してこれを取り出し、波形データ接続部６に対し
て出力する。波形データ接続部６は、入力した各波形デ
ータを接続してこれを音声出力部７に出力し、音声出力
部７は、この入力した接続波形データに基づき、テキス
ト情報である「今日は晴れですね。」の音声出力を行
う。The phoneme string waveform selecting section 3 operates as described above, selects the optimum phoneme string waveform data from the speech database 1, takes it out, and outputs it to the waveform data connection section 6. The waveform data connection unit 6 connects the input waveform data and outputs it to the audio output unit 7. The audio output unit 7 outputs the text information “Today is fine” based on the input connection waveform data. Sound output.

【００４４】上述した図１の音声合成装置では、音素列
波形選択部３は、音声データベース１に格納されている
基本周波数、音素記号、フォルマント等のデータだけを
用いて最適な音素列波形を選択するようになっている。
したがって、複雑な韻律処理を行う必要がなくなり、処
理時間を短縮できると共に、メモリ容量も増加を抑制し
て充分なコストダウンを図ることができる。In the speech synthesizer of FIG. 1 described above, the phoneme string waveform selecting section 3 selects an optimum phoneme string waveform using only data such as the fundamental frequency, phoneme symbols, and formants stored in the speech database 1. It is supposed to.
Therefore, it is not necessary to perform complicated prosody processing, so that the processing time can be shortened, and the memory capacity can be suppressed from increasing, thereby achieving a sufficient cost reduction.

【００４５】次に、本発明の第２の実施形態につき説明
する。この第２の実施形態の構成は、第１の実施形態と
同じ構成すなわち図１に示す構成であり、選択手段５が
音声データベース１から候補音素列を選択する際の判別
動作の順序が異なるものである。すなわち、第１の実施
形態では、図２乃至図５のフローチャートに示すよう
に、最初に基本周波数の比較を行った後に、前後の音素
列の比較を行い、その後にフォルマントの誤差を演算し
て最適の波形データを選択していたが、この第２の実施
形態では、最初に前後の音素列の比較を行い、その後に
基本周波数の比較を行い、更にその後にフォルマントの
誤差を演算するようにしている。Next, a second embodiment of the present invention will be described. The configuration of the second embodiment is the same as that of the first embodiment, that is, the configuration shown in FIG. 1, and the order of the discriminating operation when the selecting means 5 selects a candidate phoneme string from the voice database 1 is different. It is. That is, in the first embodiment, as shown in the flowcharts of FIGS. 2 to 5, after first comparing the fundamental frequencies, comparing the preceding and succeeding phoneme strings, and then calculating the formant error. Although the optimum waveform data is selected, in the second embodiment, the comparison of the preceding and succeeding phoneme strings is performed first, the fundamental frequency is compared, and then the error of the formant is calculated. ing.

【００４６】このように、最初に前後の音素列の比較を
行うことにより、第一に、テキストを音声合成するとき
の調音を最適化することができる。ここで「調音」と
は、連続する２つの音素について発声した場合に、この
２つの音素の波形のいずれにも属さない中間的な波形デ
ータ部分を指している。例えば、「アイウエオ」と連続
して発話する場合につき考えてみると、「ア」と「イ」
の間には「ア」にも「イ」にも属さない中間的な波形デ
ータが存在し、同様に、「イ」と「ウ」の間には「イ」
にも「ウ」にも属さない中間的な波形データが存在す
る。したがって、入力テキスト情報が「アイウエオ」
で、「イ」を音声データベースに格納されている複数の
候補音素列から選択する場合には、「イ」の前後にそれ
ぞれ「ア」、「ウ」が格納されている候補音素列を選択
するのが最も好ましいことになる。それ故、この第２の
実施形態では、前後の音素列の比較に最も高い優先度を
与え、この比較を最初に行うようにしている。As described above, by first comparing the preceding and succeeding phoneme strings, first, it is possible to optimize the articulation at the time of speech synthesis of the text. Here, “articulation” refers to an intermediate waveform data portion that does not belong to any of the waveforms of two consecutive phonemes when uttered for two consecutive phonemes. For example, consider the case of uttering “aiueo” consecutively.
There is intermediate waveform data that does not belong to "A" or "I", and similarly, "I"
There is intermediate waveform data that does not belong to "c" or "c". Therefore, if the input text information is "aiueo"
In the case where “A” is selected from a plurality of candidate phoneme strings stored in the voice database, a candidate phoneme string in which “A” and “U” are stored before and after “A”, respectively, is selected. Is most preferable. Therefore, in the second embodiment, the highest priority is given to the comparison of the preceding and succeeding phoneme strings, and this comparison is performed first.

【００４７】第二に、上記のように前後の音素列の比較
を最初に行うことにより、音声データベースに格納され
ている波形データに係る話者の自然な（ここでいう「自
然な」とは主観的な意味である。例えば、話者が訛りを
有する者であれば、第三者にとっては必ずしも自然であ
るとは言い難いが、話者にとっては「自然」である。）
イントネーションを引き出すことができる場合が多くな
る。Second, by comparing the preceding and succeeding phoneme strings first as described above, the speaker's natural (“natural” here) related to the waveform data stored in the voice database is determined. (For example, if the speaker has an accent, it is not necessarily natural for a third party, but "natural" for the speaker.)
Intonation can often be extracted.

【００４８】次に、この第２の実施形態の動作につき説
明する。図６乃至図１０は、この第２の実施形態におけ
る選択手段５の動作についてのフローチャートである。
これらの図において、図２乃至図５の各ステップと同様
の内容のステップには添字「a」を付し、また、特定の
ステップについて他の図面における連結個所を示す連結
符号「Ａ」，「Ｂ」，…，「Ｇ」には、「1」，「1
1」，「12」等の添字を付してある。これら図６乃至図
１０のフローチャートは、図２乃至図５のフローチャー
トといくつかのステップの順序が異なるだけであり、各
ステップの内容については図２乃至図５において既述し
てある。したがって、図６乃至図１０のフローチャート
の個々のステップについての詳しい説明は省略し、概略
のみを簡単に説明する。Next, the operation of the second embodiment will be described. FIGS. 6 to 10 are flowcharts showing the operation of the selecting means 5 in the second embodiment.
In these figures, steps having the same contents as the respective steps in FIGS. 2 to 5 are denoted by a suffix “a”, and the connection signs “A”, “ B ”,…,“ G ”include“ 1 ”,“ 1 ”
Subscripts such as "1" and "12" are added. These flowcharts in FIGS. 6 to 10 differ from the flowcharts in FIGS. 2 to 5 only in the order of some steps, and the contents of each step have been described in FIGS. 2 to 5. Therefore, detailed description of individual steps in the flowcharts of FIGS. 6 to 10 will be omitted, and only outlines will be briefly described.

【００４９】まず、音素列波形選択部３が図示を省略し
てあるテキスト情報入力手段からテキスト情報を入力す
ると、基本周波数パターン作成手段４は、このテキスト
情報に対して形態素解析を行って１つのアクセントを有
する複数の音素列に分割し（ステップ１a）、各音素列
の基本周波数パターンを作成する（ステップ２a）。一
方、音素列出力部２もこのテキスト情報を入力し、これ
を分解して、音声データベース１に格納されている音素
列と等しい音素記号の音素列を音素列波形選択部３に出
力するが、選択手段５は、このような音素列出力部２か
らの各音素列を入力する（ステップ３a）。そして、選
択手段５は、注目音素列を決定し（ステップ４a）、音
声データベース１から全ての候補音素列についてのデー
タを取得する（ステップ５a）。ここまでは、第１の実
施形態と同様の処理順序である。First, when the phoneme string waveform selecting unit 3 inputs text information from text information input means (not shown), the fundamental frequency pattern creating means 4 performs a morphological analysis on this text information to obtain one text. It is divided into a plurality of phoneme strings having accents (step 1a), and a fundamental frequency pattern of each phoneme string is created (step 2a). On the other hand, the phoneme string output unit 2 also inputs the text information, decomposes the text information, and outputs a phoneme string of a phoneme symbol equal to the phoneme string stored in the speech database 1 to the phoneme string waveform selection unit 3. The selecting means 5 inputs each phoneme string from the phoneme string output unit 2 (step 3a). Then, the selecting means 5 determines the phoneme sequence of interest (step 4a), and acquires data on all the candidate phoneme sequences from the voice database 1 (step 5a). Up to here, the processing order is the same as that of the first embodiment.

【００５０】次いで、選択手段５は、注目音素列がテキ
スト中の先頭のものであるか否かを判別し（ステップ１
４a）、先頭である場合にはステップ２４a〜３２aの処
理、あるいは更にステップ４２a〜４９aの処理を行う。
すなわち、全体の音素記号について注目音素列の後音素
列と候補音素列の後音素列とを比較することにより候補
音素列の数を絞り込み、それでも候補が一つに絞り込め
なければ、注目音素列の後音素列の最初の１音素記号と
候補音素列の後音素列の最初の１音素記号とを比較する
ことにより候補音素列の数を絞り込むようにする。Next, the selection means 5 determines whether or not the phoneme sequence of interest is the first one in the text (step 1).
4a) If it is the head, the processing of steps 24a to 32a or the processing of steps 42a to 49a is further performed.
That is, the number of candidate phoneme strings is reduced by comparing the phoneme string after the target phoneme string with the phoneme string after the candidate phoneme string for all of the phoneme symbols. Then, the number of candidate phoneme strings is narrowed down by comparing the first one phoneme symbol of the last phoneme string of the candidate phoneme string with the first one phoneme symbol of the latter phoneme string of the candidate phoneme string.

【００５１】一方、ステップ１４aの判別において、注
目音素列が先頭のものでない場合、選択手段５はステッ
プ１５a〜２３aの処理、あるいは更にステップ３３a〜
４１aの処理を行う。すなわち、全体の音素記号につい
て注目音素列の前音素列と候補音素列の前音素列とを比
較することにより候補音素列の数を絞り込み、それでも
候補が一つに絞り込めなければ、注目音素列の前音素列
の最後の１音素記号と候補音素列の前音素列の最後の１
音素記号とを比較することにより候補音素列の数を絞り
込むようにする。On the other hand, if it is determined in step 14a that the phoneme string of interest is not the first one, the selection means 5 performs the processing in steps 15a to 23a, or further performs steps 33a to 33a.
41a is performed. That is, for all phoneme symbols, the number of candidate phoneme strings is reduced by comparing the previous phoneme string of the phoneme string of interest with the previous phoneme string of the candidate phoneme string. The last one phoneme symbol of the previous phoneme sequence of the last and the last one of the previous phoneme sequence of the candidate phoneme sequence
By comparing with phoneme symbols, the number of candidate phoneme strings is narrowed down.

【００５２】この後、選択手段５は、基本周波数につい
てパターンの値と候補音素列の値とを比較することによ
り候補音素列の数を絞り込む処理を行い（ステップ６a
〜１３a）、それでも候補を一つに絞り込めなければ、
前回の決定済み波形データとの間のフォルマント平均自
乗誤差を演算し、このフォルマント平均自乗誤差の値が
最小の候補音素列を選択するようにする（ステップ５０
a〜５６a）。Thereafter, the selecting means 5 performs a process of narrowing down the number of candidate phoneme strings by comparing the value of the pattern with the value of the candidate phoneme string for the fundamental frequency (step 6a).
~ 13a) If you still can't narrow down to one candidate,
A formant mean square error with the previously determined waveform data is calculated, and a candidate phoneme string having the smallest formant mean square error is selected (step 50).
a-56a).

【００５３】上述したように、第２の実施形態では、選
択手段５が、最初に、音素記号についてテキスト情報に
おける注目音素列の前後の音素列と音声データベースに
格納されている各候補音素列の前後の音素列とが一致す
るか否かの判別を行うようにしているので、テキストを
音声合成するときの調音を最適化することができ、更
に、音声データベース１に録音されている話者のイント
ネーションを引き出すことができる場合が多くなる。As described above, in the second embodiment, first, the selecting means 5 determines the phoneme sequence of the phoneme symbol before and after the phoneme sequence of interest in the text information and the candidate phoneme sequences stored in the speech database. Since it is determined whether or not the preceding and succeeding phoneme strings match, it is possible to optimize the articulation at the time of text-to-speech synthesis of the text. Intonation can often be extracted.

【００５４】次に、本発明の第３の実施形態につき説明
する。この第３の実施形態も、第１の実施形態を示す図
１と同じ構成であり、選択手段５が音声データベース１
から候補音素列を選択する際の判別動作として波形デー
タの振幅（音量）についての演算を追加するようにした
ものである。すなわち、この実施形態における選択手段
５は、複数の候補音素列からいずれか一つを選択する場
合に、その波形データの先端部の振幅が、既に選択され
ている前回波形データの終端部の振幅になるべく近いも
のを選択するようにしている。これにより、一層、自然
で滑らかな音声合成を行うことが可能になる。Next, a third embodiment of the present invention will be described. This third embodiment has the same configuration as that of the first embodiment shown in FIG.
In addition, a calculation on the amplitude (volume) of the waveform data is added as a discriminating operation when selecting a candidate phoneme string from. That is, when selecting one of a plurality of candidate phoneme strings, the selecting means 5 in this embodiment sets the amplitude of the leading end of the waveform data to the amplitude of the terminal end of the previously selected previous waveform data. I try to select something that is as close as possible. This makes it possible to perform natural and smooth speech synthesis.

【００５５】図１１は、本実施形態において波形データ
の振幅を求める際に行う処理についての説明図である。
図１１（ａ）は既に選択されている前回波形データ又は
音声データベース１に格納されている候補音素列の波形
データである。選択手段５は、まず、この図１１（ａ）
に示す波形データに対して絶対値フィルタ処理を行い、
負側の波形部分を正側に折り返すことにより、図１１
（ｂ）に示すように波形データを正側部分のみにより形
成する。次いで、選択手段５は、この図１１（ｂ）に示
す波形データに対し所定の小さな窓幅を有する窓を用い
て最大値フィルタ処理を行い、図１１（ｃ）に示すよう
な包絡を得るようにする。このような包絡を得るように
する理由は、波形データの振幅の平均値をできるだけ正
確に得るようにするためである。つまり、振幅とは波形
データのピーク点の高さを意味するが、もし図１１
（ｂ）の波形データに対して振幅の平均値を演算してし
まうと、波形データの谷の部分の高さが含まれてしまう
ために、振幅の平均値が不正確なものとなるからであ
る。FIG. 11 is an explanatory diagram of processing performed when obtaining the amplitude of waveform data in the present embodiment.
FIG. 11A shows the waveform data of the previously selected previous waveform data or the candidate phoneme string stored in the voice database 1. First, the selection means 5 is configured as shown in FIG.
Absolute value filter processing is performed on the waveform data shown in
By folding the waveform portion on the negative side to the positive side, FIG.
As shown in (b), the waveform data is formed only by the positive part. Next, the selecting means 5 performs a maximum value filtering process on the waveform data shown in FIG. 11B using a window having a predetermined small window width to obtain an envelope as shown in FIG. 11C. To The reason for obtaining such an envelope is to obtain the average value of the amplitude of the waveform data as accurately as possible. In other words, the amplitude means the height of the peak point of the waveform data.
If the average value of the amplitude is calculated for the waveform data of (b), the height of the valley portion of the waveform data is included, so that the average value of the amplitude becomes inaccurate. is there.

【００５６】図１２は、本実施形態における波形データ
の振幅平均値の求め方についての説明図である。選択手
段５は、図１１で説明した各処理を行うことにより、ま
ず、前回の選択により決定済みの波形データの包絡ＤA
を得るようにする。その後、選択手段５は、対象とする
候補音素列の波形データの包絡ＤBを得るようにする。FIG. 12 is a diagram for explaining a method of obtaining the average amplitude value of the waveform data in the present embodiment. The selecting unit 5 performs the processes described with reference to FIG. 11 to first obtain the envelope DA of the waveform data determined by the previous selection.
To get Thereafter, the selection means 5 obtains the envelope DB of the waveform data of the target candidate phoneme string.

【００５７】前回決定済みの波形データにつなげる候補
音素列の波形データとしては、その包絡ＤBの先端部ｈB
の振幅が、包絡ＤAの終端部ｅAの振幅にできるだけ近い
ものが好ましい。しかし、単に先端部ｈBの振幅のみで
決定したのでは、先端部ｈBの後に振幅が急変するデー
タを選択してしまう可能性がある。そこで、本実施形態
では、終端部ｅAを含む区間Ｉ、及び先端部ｈBを含む区
間Ｊにわたる両者の振幅平均値を比較することにしてい
る。The waveform data of the candidate phoneme string to be connected to the previously determined waveform data includes the tip hB of the envelope DB.
Is preferably as close as possible to the amplitude of the terminal end eA of the envelope DA. However, if the determination is made only based on the amplitude of the tip hB, there is a possibility that data whose amplitude changes abruptly after the tip hB may be selected. Therefore, in the present embodiment, the amplitude average values of both the section I including the end portion eA and the section J including the front end portion hB are compared.

【００５８】図１２において、ｉ，ｊは各波形データの
サンプリング位置を示し、ｘ(i)及びｙ(j)は各サンプリ
ング位置における振幅値を示している。いま、包絡ＤA
の終端部ｅAの位置をi=100にすると共に、包絡ＤBの先
端部ｈBの位置をj=1にし、区間Ｉをi=96〜100、区間Ｊ
をj=1〜5とする。すると、区間Ｉにおける振幅値はｘ(9
6)〜ｘ(100)となり、区間Ｊにおける振幅値はｙ(1)〜ｙ
(5)となる。選択手段５は、これらの振幅値ｘ(96)〜ｘ
(100)及びｙ(1)〜ｙ(5)の各平均値を求め、その差が一
定値以下であるか否かを判別する。但し、複数の候補音
素列から最適なものを選択するという観点からは、これ
らの振幅値データを全て同等に扱うのは妥当ではない。
なぜなら、終端部ｅA付近の振幅と先端部ｈB付近の振幅
とが近似しているか否かを判別するためには、終端部ｅ
A及び先端部ｈBに近いデータほど大きな価値を持ち、終
端部ｅA及び先端部ｈBから離れたデータほどその価値が
小さくなるからである。それ故、本実施形態では、包絡
ＤAにおける振幅値ｘ(100)には最も大きな重み係数Ｗx
(100)を与え、振幅値ｘ(96)には最も小さな重み係数Ｗx
(96)を与えるようにしている。同様にして、包絡ＤBに
おける振幅値ｙ(1)には最も大きな重み係数Ｗy(1)を与
え、振幅値ｙ(5)には最も小さな重み係数Ｗy(5)を与え
るようにする。In FIG. 12, i and j indicate sampling positions of each waveform data, and x (i) and y (j) indicate amplitude values at each sampling position. Now, the envelope DA
, The position of the end part eA is set to i = 100, the position of the front end part hB of the envelope DB is set to j = 1, the section I is i = 96 to 100, and the section J
Is j = 1 to 5. Then, the amplitude value in section I is x (9
6) to x (100), and the amplitude value in section J is y (1) to y
(5). The selection means 5 calculates these amplitude values x (96) to x
An average value of (100) and y (1) to y (5) is obtained, and it is determined whether or not the difference is equal to or less than a certain value. However, from the viewpoint of selecting an optimal one from a plurality of candidate phoneme strings, it is not appropriate to treat all of these amplitude value data equally.
This is because, in order to determine whether or not the amplitude near the end portion eA and the amplitude near the front end portion hB are similar, the end portion e
This is because data closer to A and the front end hB has a greater value, and data farther from the terminal end eA and the front end hB has a smaller value. Therefore, in the present embodiment, the largest weighting coefficient Wx is assigned to the amplitude value x (100) in the envelope DA.
(100), and the amplitude value x (96) has the smallest weight coefficient Wx
(96). Similarly, the largest weight coefficient Wy (1) is given to the amplitude value y (1) in the envelope DB, and the smallest weight coefficient Wy (5) is given to the amplitude value y (5).

【００５９】選択手段５は、下式（３），（４）を用い
て、包絡ＤAの区間Ｉにおける振幅平均値Ｚx、及び包絡
ＤBの区間Ｊにおける振幅平均値Ｚyの演算を行う。図１
２に示した例では、（３）式におけるiはi=96〜100であ
り、（４）式におけるjはj=1〜5である。次いで、選択
手段５は振幅平均値Ｚx，Ｚyの差の絶対値αを求め、こ
のαが予め設定値α0よりも小さなものであるか否かす
なわち（５）式が成立するか否かを判別する。そして、
（５）式が成立しなかった候補音素列を除去し、（５）
式が成立した候補音素列をそのまま残すようにする。The selecting means 5 calculates the average amplitude value Zx in the section I of the envelope DA and the average amplitude value Zy in the section J of the envelope DB using the following equations (3) and (4). FIG.
In the example shown in FIG. 2, i in Expression (3) is i = 96 to 100, and j in Expression (4) is j = 1 to 5. Next, the selection means 5 obtains the absolute value α of the difference between the average amplitude values Zx and Zy, and determines whether this α is smaller than the preset value α0, that is, whether the equation (5) is satisfied. I do. And
The candidate phoneme sequence for which the expression (5) does not hold is removed, and (5)
Leave the candidate phoneme string for which the expression has been established as it is.

【００６０】[0060]

【数２】第３の実施形態は、このように、前回の選択動作により
最適の波形データとして決定済みの波形データにおける
終端部付近の所定区間にわたる振幅平均値と、各候補音
素列の波形データにおける先端部付近の所定区間にわた
る振幅平均値との差が所定範囲内にあるか否かについて
の判別を、第２の実施形態に追加した形態となってい
る。この場合、この振幅平均値の差についての判別は、
基本周波数の差分が所定の範囲内にあるか否かについて
の判別（ステップ６a〜１２a）の後又は前のどちらでも
よいが、必ず前後音素列についての判別（ステップ１５
a〜４８a）の後に行う必要がある。既述したように、こ
の前後音素列の判別は、調音を最適化すると共に、音声
データベースの話者のイントネーションを引き出すこと
を優先させるために重要な意義を有する判別だからであ
る。(Equation 2) In the third embodiment, as described above, the amplitude average value over a predetermined section near the end of the waveform data that has been determined as the optimal waveform data by the previous selection operation, and the vicinity of the leading end in the waveform data of each candidate phoneme string In this embodiment, the determination as to whether or not the difference from the amplitude average value over a predetermined section is within a predetermined range is added to the second embodiment. In this case, the difference between the amplitude average values is determined as follows:
The determination may be made after or before determining whether or not the difference between the fundamental frequencies is within a predetermined range (steps 6a to 12a).
a-48a). As described above, the discrimination between the front and rear phoneme strings is a discrimination that is important in optimizing articulation and giving priority to extracting the intonation of the speaker in the speech database.

【００６１】図１３及び図１４は、この第３の実施形態
における選択手段５の動作についての一部のフローチャ
ートである。また、第２の実施形態に係るフローチャー
トである図６乃至図８及び図１０は、この第３の実施形
態についても適用される。これらの図を用いて、第３の
実施形態の動作の概略を説明する。FIGS. 13 and 14 are partial flowcharts showing the operation of the selecting means 5 in the third embodiment. 6 to 8 and 10, which are flowcharts according to the second embodiment, are also applied to the third embodiment. The outline of the operation of the third embodiment will be described with reference to these drawings.

【００６２】まず、第２の実施形態と同様に、図６に示
したステップ１a〜５a、及びステップ１４aの処理が行
われ、更に、音素記号について注目音素列の前後の音素
列と音声データベースに格納されている各候補音素列の
前後の音素列とが一致するか否かについての判別に基づ
く候補数の絞り込みが行われる（図７及び図８に示した
ステップ１５a〜４９a）。これらの処理によっても候補
数が一つに絞り込めなかった場合、選択手段５は、やは
り第２の実施形態と同様に、基本周波数の比較に基づき
候補数の絞り込みを行うようにする（図１３に示したス
テップ６a〜１３a）。なお、第２の実施形態のフローチ
ャートである図９と、第３の実施形態のフローチャート
である図１３との違いは、ステップ１２aの下の連結符
号がＦ12からＦ13に変わっている点のみである。First, similarly to the second embodiment, the processing of steps 1a to 5a and step 14a shown in FIG. 6 is performed, and the phoneme symbols are stored in the phoneme strings before and after the phoneme string of interest and the speech database. The number of candidates is narrowed down based on the determination as to whether the stored phoneme strings before and after each of the candidate phoneme strings match (steps 15a to 49a shown in FIGS. 7 and 8). When the number of candidates cannot be narrowed down to one by these processes, the selection unit 5 narrows down the number of candidates based on the comparison of the fundamental frequencies, similarly to the second embodiment (FIG. 13). Steps 6a to 13a). The difference between FIG. 9 which is the flowchart of the second embodiment and FIG. 13 which is the flowchart of the third embodiment is only that the concatenated code under step 12a is changed from F12 to F13. .

【００６３】次いで、基本周波数の比較を行っても候補
数を一つに絞り込めなかった場合（ステップ１２aの判
別が「ＮＯ」となった場合）に、選択手段５は、図１４
に示すように、波形データの振幅の判別に基づく候補数
の絞り込みについての処理を行う。Next, when the number of candidates cannot be narrowed down to one even when the fundamental frequencies are compared (when the determination in step 12a is "NO"), the selecting means 5 sets the number in FIG.
As shown in (1), processing for narrowing down the number of candidates based on the determination of the amplitude of the waveform data is performed.

【００６４】すなわち、選択手段５は、まず、前回決定
済みの波形データの終端部付近の振幅平均値Ｚxを
（３）式により演算すると共に、現在対象となっている
候補音素列の波形データの先端部付近の振幅平均値Ｚy
を（４）式により演算する（ステップ５７，５８）。そ
して、これらの振幅平均値の差の絶対値が設定値α0よ
りも小さいか否か、つまり（５）式が成立するか否かに
ついての判別を行い（ステップ５９）、この判別結果を
自己の記憶手段に記憶する（ステップ６０）。That is, the selecting means 5 first calculates the average amplitude value Zx near the end of the previously determined waveform data by the equation (3), and also calculates the waveform data of the currently targeted candidate phoneme string. Average amplitude Zy near the tip
Is calculated by equation (4) (steps 57 and 58). Then, it is determined whether or not the absolute value of the difference between these amplitude average values is smaller than the set value α0, that is, whether or not the equation (5) is satisfied (step 59). It is stored in the storage means (step 60).

【００６５】この後、選択手段５は、全ての候補音素列
に対してステップ５７〜６０の処理が終了したか否かを
判別し（ステップ６１）、残っている候補音素列があれ
ばステップ５７に戻って上記の処理を繰り返す。また、
全ての候補音素列に対して上記の処理が終了していれ
ば、（５）式を満足した候補音素列が有ったか否かを判
別し（ステップ６２）、有った場合には（５）式を満足
しなかった候補音素列を除去する（ステップ６３）。但
し、全ての候補音素列が（５）式を満足する場合、除去
数はゼロとなる。そして、候補音素列が１つに絞られた
か否かを判別し（ステップ６４）、１つに絞られていれ
ば、これを最適波形データとして選択することを決定す
る（ステップ６５）。Thereafter, the selecting means 5 determines whether or not the processing of steps 57 to 60 has been completed for all the candidate phoneme strings (step 61). And the above processing is repeated. Also,
If the above processing has been completed for all the candidate phoneme strings, it is determined whether or not there is a candidate phoneme string that satisfies the expression (5) (step 62). ) The candidate phoneme strings that do not satisfy the expression are removed (step 63). However, if all the candidate phoneme strings satisfy Expression (5), the number of removals is zero. Then, it is determined whether or not the candidate phoneme string has been narrowed down to one (step 64), and if it has been narrowed down to one, it is determined to select this as the optimal waveform data (step 65).

【００６６】一方、ステップ６２又はステップ６４の判
別が「ＮＯ」の場合、選択手段５は、第２の実施形態の
場合と同様に、図１０に示すステップ５０a〜５６aの処
理を行う。なお、上記した第３の実施形態では、波形デ
ータの振幅の判別に基づく候補数絞り込み処理（ステッ
プ５７〜６５）を、基本周波数の比較に基づく候補数絞
り込み処理（ステップ６a〜１３a）の後に行っている
が、前に行うようにしてもよい。しかし、いずれにして
も、必ず前後音素列の比較に基づく候補数絞り込み処理
（ステップ１５a〜４８a）の後に行う必要がある。On the other hand, if the determination in step 62 or step 64 is "NO", the selection means 5 performs the processing of steps 50a to 56a shown in FIG. 10 as in the second embodiment. In the third embodiment, the candidate number narrowing process based on the determination of the amplitude of the waveform data (steps 57 to 65) is performed after the candidate number narrowing process based on the comparison of the fundamental frequencies (steps 6a to 13a). However, it may be performed before. However, in any case, it must be performed after the candidate number narrowing process (steps 15a to 48a) based on the comparison of the front and rear phoneme strings.

【００６７】上述したように、第３の実施形態は、前回
の選択動作により最適の波形データとして決定済みの波
形データにおける終端部付近の所定区間にわたる振幅平
均値と、各候補音素列の波形データにおける先端部付近
の所定区間にわたる振幅平均値との差が所定範囲内にあ
るか否かについて判別する処理を、第２の実施形態に追
加した構成となっている、ので、第２の実施形態よりも
一層自然で滑らかな音声合成を行うことができるように
なる。As described above, in the third embodiment, the amplitude average value over a predetermined section near the end of the waveform data determined as the optimum waveform data by the previous selection operation, and the waveform data of each candidate phoneme sequence In the second embodiment, the processing for determining whether or not the difference from the amplitude average value over a predetermined section near the front end is within a predetermined range is added to the second embodiment. It is possible to perform a more natural and smooth speech synthesis than in the case.

【００６８】次に、本発明の第４の実施形態につき説明
する。この第４の実施形態も、第１の実施形態を示す図
１と同じ構成であり、選択手段５が音声データベース１
から候補音素列を選択する際の判別動作の内容につき第
２の実施形態を更に改良したものである。第２の実施形
態において、注目音素列の前後の音素列と候補音素列の
前後の音素列との比較に基づき候補を絞る場合の動作は
図７及び図８に示されているが、この内容を要約して再
度述べると（但し、注目音素列がテキスト情報における
先頭でも最後でもない中間のものである場合につき述べ
る。）次のようになっている。Next, a fourth embodiment of the present invention will be described. This fourth embodiment also has the same configuration as that of the first embodiment shown in FIG.
This is a further improvement of the second embodiment with respect to the content of the discriminating operation when selecting a candidate phoneme string from. In the second embodiment, the operation for narrowing down candidates based on a comparison between a phoneme sequence before and after a phoneme sequence of interest and a phoneme sequence before and after a candidate phoneme sequence is shown in FIGS. 7 and 8. Is summarized and described again (however, a case in which the phoneme sequence of interest is an intermediate one which is neither the first nor the last in the text information will be described).

【００６９】すなわち、まず最初に、注目音素列と或る
１つの候補音素列の前音素列同士を比較して候補を絞
り、候補を１つに絞ることができない場合にはじめて後
音素列を比較して候補を１つに絞るようにする。そし
て、それでも候補を１つに絞ることができない場合は、
注目音素列の前音素列の最後の１音素記号と候補音素列
の前音素列の最後の１音素記号とを比較して候補を絞
り、候補を１つに絞ることができない場合は、注目音素
列の後音素列の最初の１音素記号と候補音素列の後音素
列の最初の１音素記号とを比較して候補を絞るようにし
ている。したがって、第２の実施形態では、前音素列同
士の比較で候補が１つに絞られれば後音素列同士の比較
は行われないために、選択手段５は前音素列に大きく依
存した選択を行っていることになる。That is, first, the phoneme sequence of interest is compared with the front phoneme sequence of a certain candidate phoneme sequence to narrow down the candidates. If it is not possible to narrow down the candidates to one, the rear phoneme sequence is first compared. To reduce the number of candidates to one. And if you still can't narrow down to one candidate,
If the last phoneme symbol of the previous phoneme sequence of the phoneme sequence of interest and the last phoneme symbol of the previous phoneme sequence of the candidate phoneme sequence are compared, the candidates are narrowed down. The candidates are narrowed by comparing the first one phoneme symbol of the post-phoneme sequence of the sequence with the first one phoneme symbol of the post-phoneme sequence of the candidate phoneme sequence. Therefore, in the second embodiment, if the candidates are narrowed down to one in the comparison between the pre-phoneme strings, the comparison between the post-phoneme strings is not performed. You are doing it.

【００７０】しかし、このような前音素列に大きく依存
する選択では、必ずしもテキストを音声合成するときの
調音を最適化することができない場合が生じることにな
る。このような場合の具体的な例を図１５の図表を用い
て説明する。例えば、音素列出力部２が、図示を省略し
てあるテキスト情報入力手段から入力したテキスト「お
早うございます」を分解して、音声データベース１に格
納されている音素列と等しい音素記号の音素列「o」、
「ha」、「yo-」、「go」、「zai」、「masu」を音素列
波形選択部３に出力し、注目音素列を「yo-」とした場
合につき考えてみる。However, in such a selection that largely depends on the preceding phoneme sequence, there may be cases where it is not always possible to optimize the articulation when text is synthesized. A specific example of such a case will be described with reference to the table of FIG. For example, the phoneme string output unit 2 decomposes the text “Oh my good morning” input from text information input means (not shown), and generates a phoneme string of a phoneme symbol equal to the phoneme string stored in the speech database 1. "O",
Consider the case where “ha”, “yo-”, “go”, “zai”, and “masu” are output to the phoneme string waveform selection unit 3 and the target phoneme string is “yo-”.

【００７１】この場合、音声データベース１には、図１
５に示したテキスト内容から取り出した第１乃至第４の
候補音素列「yo-」が格納されており、それぞれの前音
素列及び候補音素列は図示したようなものとなってい
る。ここで、第１の候補音素列の前音素列は「ha」では
なく「a」となっており、第４の候補音素列は「ku」で
はなく「u」となっているが、このようにテキスト内容
とは関係なく子音が除去された状態で音声データベース
１に波形データが格納されることはしばしば生じること
である。上記のような第１乃至第４の候補音素列「yo
-」から最適の波形データを選択する場合、第２の実施
形態では、最初に注目音素列の前音素列と各候補音素列
の前音素列とを比較して候補を絞るために、注目音素列
の前音素列と音素記号が一致する「ha」を有している第
３の候補音素列「yo-」が最適の波形データとして選択
され、この選択が行われた時点で他の第１、第２、及び
第４の候補音素列「yo-」は切り捨てられることなる。In this case, the voice database 1 contains
The first to fourth candidate phoneme strings “yo-” extracted from the text content shown in FIG. 5 are stored, and the respective pre-phoneme strings and candidate phoneme strings are as illustrated. Here, the previous phoneme sequence of the first candidate phoneme sequence is “a” instead of “ha”, and the fourth candidate phoneme sequence is “u” instead of “ku”. It often happens that waveform data is stored in the voice database 1 with consonants removed irrespective of the text content. The first to fourth candidate phoneme strings “yo
When selecting the optimal waveform data from "-", in the second embodiment, the target phoneme sequence is first compared with the previous phoneme sequence of the target phoneme sequence and the previous phoneme sequence of each candidate phoneme sequence to narrow the candidates. The third candidate phoneme string “yo-” having “ha” whose phoneme symbol matches the previous phoneme string of the row is selected as the optimal waveform data, and at the time of this selection, the other first phoneme string “yo-” is selected. , The second and fourth candidate phoneme strings “yo-” will be truncated.

【００７２】しかし、第３の候補音素列「yo-」の波形
データを用いて「お早うございます」の音声合成を行っ
た場合、「ha」と「yo-」とのつながりは良くなるとし
ても、「yo-」と「go」との間のつながりは全く考慮さ
れていないために必ずしもこれが良いものであるとは限
らない。そこで、この第４の実施形態では、選択手段５
が前音素列だけを基準にして最適の波形データを選択す
るのではなく、前音素列及び後音素列の双方を基準にし
て最適の波形データを選択するようにしている。However, if the voice synthesis of “Oh hayashi wa” is performed using the waveform data of the third candidate phoneme sequence “yo-”, even if the connection between “ha” and “yo-” is improved, This is not always a good thing because the connection between "yo-" and "go" is not taken into account at all. Therefore, in the fourth embodiment, the selecting unit 5
Does not select the optimum waveform data based only on the front phoneme sequence, but selects the optimum waveform data based on both the front phoneme sequence and the rear phoneme sequence.

【００７３】すなわち、本実施形態の選択手段５は、注
目音素列の前音素列及び後音素列とを比較して一致する
音素記号の合計数Ｎをカウントし、合計数Ｎが最大とな
る候補音素列の波形データを最適の波形データとして選
択するようになっている。例えば、第１の候補音素列の
前音素列「a」は注目音素列の前音素列「ha」の中の
「a」と一致し、第１の候補音素列の後音素列「go」は
注目音素列の後音素列「go」と一致するから、第１の候
補音素列の場合は一致する音素記号の合計数ＮはＮ＝３
となる。同様にして、他の候補音素列の合計数Ｎを求め
ると、第２の候補音素列の場合はＮ＝０、第３の候補音
素列の場合はＮ＝２、第４の候補音素列の場合はＮ＝２
となる。したがって、最大の合計数ＮがＮ＝３となる第
１の候補音素列「yo-」が最適の波形データとして選択
されることになる。That is, the selecting means 5 of the present embodiment compares the front phoneme sequence and the postphoneme sequence of the target phoneme sequence, counts the total number N of matching phoneme symbols, and selects the candidate having the maximum total number N. The waveform data of the phoneme sequence is selected as the optimum waveform data. For example, the previous phoneme sequence “a” of the first candidate phoneme sequence matches “a” in the previous phoneme sequence “ha” of the target phoneme sequence, and the post-phoneme sequence “go” of the first candidate phoneme sequence is Since it matches the succeeding phoneme string “go” after the target phoneme string, the total number N of matching phoneme symbols is N = 3 in the case of the first candidate phoneme string.
Becomes Similarly, when the total number N of the other candidate phoneme strings is obtained, N = 0 for the second candidate phoneme string, N = 2 for the third candidate phoneme string, and N = 2 for the fourth candidate phoneme string. If N = 2
Becomes Therefore, the first candidate phoneme string “yo−” in which the maximum total number N is N = 3 is selected as the optimal waveform data.

【００７４】この第１の候補音素列「yo-」は入力テキ
ストと同じテキスト内容「お早うございます」から取り
出されたものであるから、音声合成上最も好ましい波形
データであるが、たまたま前音素列が「a」となった状
態で音声データベース１に格納されていたため、第２の
実施形態では最適の波形データとして選択されなかった
ものである。しかし、この第４の実施形態によれば上記
のように、選択手段５は最適の波形データとして選択す
ることができるようになる。Since the first candidate phoneme string “yo-” is extracted from the same text content “Ohashio-aigo” as the input text, it is the most preferable waveform data for speech synthesis. Has been stored in the audio database 1 in a state where is "a", and thus is not selected as the optimal waveform data in the second embodiment. However, according to the fourth embodiment, as described above, the selecting means 5 can select the optimum waveform data.

【００７５】次に、第４の実施形態の動作を、上記の前
音素列及び後音素列の比較に基づく選択動作の部分のみ
に限定し、図１６及び図１７のフローチャートを参照し
つつ説明する（図１６は、第２の実施形態の図６に相当
するものであり、図１７は同じく第２の実施形態の図７
及び図８に相当するものである。）。他の部分の動作に
ついては第２の実施形態と同様であるため省略すること
とする。Next, the operation of the fourth embodiment is limited to only the selection operation based on the comparison between the preceding phoneme sequence and the rear phoneme sequence, and will be described with reference to the flowcharts of FIGS. (FIG. 16 is equivalent to FIG. 6 of the second embodiment, and FIG. 17 is the same as FIG. 7 of the second embodiment.
8 and FIG. ). The operation of the other parts is the same as in the second embodiment, and will not be described.

【００７６】まず、音素列波形選択部３が図示を省略し
てあるテキスト情報入力手段からテキスト情報を入力す
ると、基本周波数パターン作成手段４は、このテキスト
情報に対して形態素解析を行って１つのアクセントを有
する複数の音素列に分割し（ステップ１a）、各音素列
の基本周波数パターンを作成する（ステップ２a）。一
方、音素列出力部２もこのテキスト情報を入力し、これ
を分解して、音声データベース１に格納されている音素
列と等しい音素記号の音素列を音素列波形選択部３に出
力するが、選択手段５は、このような音素列出力部２か
らの各音素列を入力する（ステップ３a）。そして、選
択手段５は、注目音素列を決定し（ステップ４a）、音
声データベース１から全ての候補音素列についてのデー
タを取得する（ステップ５a）。ここまでは、第２の実
施形態、更には第１の実施形態と同様の処理順序であ
る。First, when the phoneme string waveform selecting section 3 inputs text information from text information input means (not shown), the fundamental frequency pattern creating means 4 performs a morphological analysis on this text information to obtain one text. It is divided into a plurality of phoneme strings having accents (step 1a), and a fundamental frequency pattern of each phoneme string is created (step 2a). On the other hand, the phoneme string output unit 2 also inputs the text information, decomposes the text information, and outputs a phoneme string of a phoneme symbol equal to the phoneme string stored in the speech database 1 to the phoneme string waveform selection unit 3. The selecting means 5 inputs each phoneme string from the phoneme string output unit 2 (step 3a). Then, the selecting means 5 determines the phoneme sequence of interest (step 4a), and acquires data on all the candidate phoneme sequences from the voice database 1 (step 5a). Up to this point, the processing order is the same as that of the second embodiment and further the first embodiment.

【００７７】次いで、選択手段５は、注目音素列がテキ
スト中の先頭のものであるか否かを判別し（ステップ５
０１）、先頭である場合には前音素列は存在しないの
で、注目音素列の後音素列と候補音素列の後音素列とを
比較し、一致する音素記号の合計数Ｎを自己のメモリに
記憶する（ステップ５０２）。また、注目音素列が先頭
のものでない場合には最後のものであるか否かを判別し
（ステップ５０３）、最後のものでなければ前音素列及
び後音素列の双方が存在するはずなので、注目音素列の
前音素列及び後音素列と、候補音素列の前音素列及び後
音素列とを比較し、一致する音素記号の合計数Ｎを自己
のメモリに記憶する（ステップ５０４）。そして、注目
音素列が最後のものである場合には後音素列は存在しな
いので、注目音素列の前音素列と候補音素列の前音素列
とを比較し、一致する音素記号の合計数Ｎを自己のメモ
リに記憶する（ステップ５０５）。Next, the selection means 5 determines whether or not the phoneme sequence of interest is the first one in the text (step 5).
01), since the preceding phoneme string does not exist at the beginning, the phoneme string after the phoneme string of interest and the phoneme string after the candidate phoneme string are compared, and the total number N of matching phoneme symbols is stored in its own memory. It is stored (step 502). If the target phoneme sequence is not the first one, it is determined whether or not it is the last one (step 503). If the last phoneme sequence is not the last one, both the front phoneme sequence and the rear phoneme sequence should exist. The pre-phoneme sequence and the post-phoneme sequence of the phoneme sequence of interest are compared with the pre-phoneme sequence and the post-phoneme sequence of the candidate phoneme sequence, and the total number N of matching phoneme symbols is stored in its own memory (step 504). If the target phoneme sequence is the last one, there is no post-phoneme sequence. Therefore, the front phoneme sequence of the target phoneme sequence and the front phoneme sequence of the candidate phoneme sequence are compared, and the total number of matching phoneme symbols N Is stored in its own memory (step 505).

【００７８】上記のようにして、選択手段５は全ての候
補音素列について合計数Ｎの記憶を終了したか否かにつ
き判別し（ステップ５０６）、終了していなければステ
ップ５０１に戻って次の候補音素列について合計数Ｎを
求めて記憶するようにする。そして、選択手段５は、全
ての候補音素列について合計数Ｎの記憶を終了すると、
合計数Ｎが最大となる候補音素列以外の候補音素列を除
去する（ステップ５０７）。As described above, the selecting means 5 determines whether or not the storage of the total number N has been completed for all the candidate phoneme strings (step 506), and if not, returns to step 501 to return to the next step 501. The total number N of the candidate phoneme strings is obtained and stored. Then, when the selection unit 5 finishes storing the total number N for all the candidate phoneme strings,
The candidate phoneme strings other than the candidate phoneme string having the maximum total number N are removed (step 507).

【００７９】選択手段５は、この除去の結果、候補が１
つに絞られたか否かを判別し（ステップ５０８）、１つ
に絞られていればこれを最適の波形データとして決定し
（ステップ５０９）、ステップ５６ａ（図１０参照）の
処理を行う。また、候補が１つに絞られていなければ、
ステップ６ａ（図９参照）以降の処理を行う。すなわ
ち、第２の実施形態の場合と同様に、選択手段５は、基
本周波数についてパターンの値と候補音素列の値とを比
較することにより候補音素列の数を絞り込む処理を行い
（ステップ６a〜１３a）、それでも候補を一つに絞り込
めなければ、前回の決定済み波形データとの間のフォル
マント平均自乗誤差を演算し、このフォルマント平均自
乗誤差の値が最小の候補音素列を選択するようにする
（ステップ５０a〜５６a）。As a result of this removal, the selection means 5 determines that the candidate is 1
It is determined whether or not it has been narrowed down to one (step 508), and if it has been narrowed down to one, it is determined as optimal waveform data (step 509), and the processing of step 56a (see FIG. 10) is performed. Also, if the candidates are not narrowed down to one,
The processing after step 6a (see FIG. 9) is performed. That is, as in the case of the second embodiment, the selection unit 5 performs a process of narrowing down the number of candidate phoneme strings by comparing the value of the pattern with the value of the candidate phoneme string for the fundamental frequency (steps 6a to 6a). 13a) If still one candidate cannot be narrowed down, a formant mean square error with the previously determined waveform data is calculated, and a candidate phoneme sequence having the smallest formant mean square error value is selected. (Steps 50a to 56a).

【００８０】上述したように、第４の実施形態では、選
択手段５が前音素列だけを基準にして最適の波形データ
を選択するのではなく、前音素列及び後音素列の双方を
基準にして最適の波形データを選択するようにしている
ので、第２の実施形態によっては必ずしも最適化するこ
とができなかった調音についても最適化できるようにな
る。As described above, in the fourth embodiment, the selecting means 5 does not select the optimum waveform data based only on the pre-phoneme sequence, but uses both the pre-phoneme sequence and the post-phoneme sequence as a reference. Since the optimum waveform data is selected in this way, it is possible to optimize the articulation which could not always be optimized in the second embodiment.

【００８１】次に、本発明の第５の実施形態につき説明
する。この第５の実施形態は、第４の実施形態における
選択手段５の候補音素列選択の際の判別動作を更に改良
したものである。第４の実施形態では、注目音素列の前
後の音素列と各候補音素列の前後の音素列とを比較し、
一致する音素記号の合計値が最大となる候補音素列を最
適の波形データとして選択することにより、調音の最適
化を図るようにしていた。しかし、本発明の発明者の研
究によると、このような第４の実施形態によってもなお
調音の最適化を充分に行えない場合が生じることが判明
した。このような場合の具体的な例を図１８の図表を用
いて説明する。Next, a fifth embodiment of the present invention will be described. In the fifth embodiment, the discriminating operation of the selecting means 5 in selecting a candidate phoneme string in the fourth embodiment is further improved. In the fourth embodiment, the phoneme strings before and after the phoneme string of interest are compared with the phoneme strings before and after each candidate phoneme string,
The articulation is optimized by selecting a candidate phoneme string having the maximum sum of the matching phoneme symbols as the optimum waveform data. However, according to the study of the inventor of the present invention, it has been found that there is a case where the articulation cannot be sufficiently optimized even in the fourth embodiment. A specific example of such a case will be described with reference to the chart of FIG.

【００８２】例えば、音素列出力部２が、図示を省略し
てあるテキスト情報入力手段から入力したテキスト「お
早うございます」を分解して、音声データベース１に格
納されている音素列と等しい音素記号の音素列「o」、
「ha」、「yo-」、「go」、「zaimasu」を音素列波形選
択部３に出力し、注目音素列を「go」とした場合につき
考えてみる。この場合、音声データベース１には、図１
８に示したテキスト内容から取り出した第１乃至第４の
候補音素列「go」が格納されており、それぞれの前音素
列及び候補音素列は図示したようなものとなっている。For example, the phoneme string output unit 2 decomposes the text “Oh my good morning” input from text information input means (not shown), and obtains phoneme symbols equal to the phoneme strings stored in the speech database 1. Phoneme string "o",
Consider the case where "ha", "yo-", "go", and "zaimasu" are output to the phoneme string waveform selection unit 3 and the target phoneme string is "go". In this case, the voice database 1 contains
The first to fourth candidate phoneme strings “go” extracted from the text content shown in FIG. 8 are stored, and the respective pre-phoneme strings and candidate phoneme strings are as illustrated.

【００８３】各候補音素列における前後の一致音素記号
の合計数Ｎを求めてみると、まず第１の候補音素列にお
いて、その前音素列「yo-」は注目音素列の前音素列「y
o-」との間で３つの文字全てが一致し、また、その後音
素列「za」は注目音素列の後音素列「zaimasu」の「z
a」との間で２つの文字が一致するので、Ｎの値は５と
なる。次に、第２の候補音素列において、その前音素列
「no」と注目音素列の「yo-」との間で一致する文字数
はゼロとなる。つまり、「no」の中の文字「o」と「yo
-」の中の文字「-」とは一致せず、且つ、「no」の中の
文字「n」と「yo-」の中の文字「o」とは一致していな
い。また、その後音素列「do-」と注目音素列の後音素
列「zaimasu」との間で一致する文字数もゼロとなるの
で、第２の候補音素列におけるＮの値はゼロとなる。以
下、同様にして、第３の候補音素列におけるＮの値は
３、第４の候補音素列におけるＮの値は７となる。した
がって、第４の実施形態によれば、最適の波形データと
して選択されるのは、Ｎ＝７となる第４の候補音素列の
「go」ということになる。When the total number N of the preceding and succeeding phoneme symbols in each candidate phoneme string is obtained, first, in the first candidate phoneme string, the preceding phoneme string “yo-” is replaced by the previous phoneme string “y” of the target phoneme string.
o- ”, all three characters match, and then the phoneme sequence“ za ”is followed by“ z ”in the phoneme sequence“ zaimasu ”after the target phoneme sequence.
Since the two characters match with "a", the value of N is 5. Next, in the second candidate phoneme string, the number of characters that match between the preceding phoneme string “no” and the target phoneme string “yo-” becomes zero. That is, the letters "o" and "yo" in "no"
The character "-" in "-" does not match, and the character "n" in "no" does not match the character "o" in "yo-". Further, since the number of characters that match between the phoneme sequence “do-” and the subsequent phoneme sequence “zaimasu” of the target phoneme sequence also becomes zero, the value of N in the second candidate phoneme sequence becomes zero. Hereinafter, similarly, the value of N in the third candidate phoneme string is 3, and the value of N in the fourth candidate phoneme string is 7. Therefore, according to the fourth embodiment, what is selected as the optimum waveform data is “go” of the fourth candidate phoneme sequence where N = 7.

【００８４】しかしながら、各候補音素列に係るテキス
ト内容を検討してみると、第１の候補音素列で用いられ
ているテキストは「お早うございます」であり、入力テ
キストと同一の内容となっている。したがって、本来的
には、この第１の候補音素列の「go」が最適の波形デー
タとして選択されるべきである。それにもかかわらず、
第４の実施形態では、前後の一致音素記号の合計数Ｎを
選択の際の判別基準としているために、第４の候補音素
列の「go」が選択されてしまうことになる。そして、こ
の第４の候補音素列のＮの値７の内容を分析してみる
と、前音素列の一致文字数がゼロ、後音素列の一致文字
数が７となっており、一致文字数が後音素列側に極端に
偏っていることが分かる。これに対し、第１の候補音素
列では、Ｎの値は５であり、第４の候補音素列における
Ｎの値７よりもやや小さな値となってはいるものの、前
音素列側の一致文字数は３であり、また、後音素列側の
一致文字数は２となっており、ほぼバランスが取れてい
るものと見ることができる。つまり、最適の波形データ
を選択する際の判別基準としては、単に一致文字数Ｎの
値が大きなだけでは不充分であり、さらに、前音素列の
一致文字数と後音素列の一致文字数との間のバランスが
取れているのが必要であることが分かる。However, when examining the text content relating to each candidate phoneme string, the text used in the first candidate phoneme string is “Oh my good morning”, which is the same as the input text. I have. Therefore, "go" of the first candidate phoneme sequence should be originally selected as the optimal waveform data. Nevertheless,
In the fourth embodiment, since the total number N of the preceding and succeeding matching phoneme symbols is used as a criterion for selection, “go” of the fourth candidate phoneme string is selected. When the content of the value 7 of N of the fourth candidate phoneme string is analyzed, the number of matching characters in the preceding phoneme string is zero, the number of matching characters in the rear phoneme string is 7, and the number of matching characters is It can be seen that it is extremely biased toward the row side. On the other hand, in the first candidate phoneme string, the value of N is 5, which is slightly smaller than the value of N in the fourth candidate phoneme string, but the number of matching characters in the previous phoneme string is Is 3, and the number of matching characters on the rear phoneme string side is 2, which can be regarded as being almost balanced. In other words, as a criterion for selecting the optimum waveform data, it is not sufficient that the value of the number of matching characters N is simply large, and further, the difference between the number of matching characters of the preceding phoneme string and the number of matching characters of the following phoneme string is not enough. It turns out that it is necessary to be balanced.

【００８５】そこで、この第５の実施形態では、選択手
段５が、前音素列の一致文字数及び後音素列の一致文字
数がそれぞれある程度大きく、且つ両者の一致文字数が
ある程度バランスが取れている候補音素列の中から最適
の波形データを選択する構成としている。そして、本実
施形態における選択手段５は、内部にファジイ演算回路
を有しており。予め設定してある下記のファジイルール
に基づき、前音素列の一致文字数と後音素列の一致文字
数との間のバランス状態を表した評価値（本明細書で
は、この評価値を「バランス評価値」と呼ぶことにす
る。）ｙを算出し、このバランス評価値ｙを用いて各候
補音素列の中から最適の波形データを算出するようにし
ている。なお、下記のファジイルール中、Ｎ1は前音素
列の一致文字数、Ｎ2は後音素列の文字数であり、ま
た、Ｓ，Ｍ，Ｌは言語ラベルであり、それぞれSmall,Mi
ddle,Largeを意味している。そして、選択手段５は、こ
れらの言語ラベルを数値化し、一定の閾値以上を有する
候補音素列の中から最適の波形データを選択するように
なっている。 IF N1 is S AND N2 is S THEN y is S IF N1 is S AND N2 is L THEN y is M IF N1 is L AND N2 is S THEN y is M IF N1 is L AND N2 is L THEN y is LTherefore, in the fifth embodiment, the selecting means 5 determines whether the number of matching characters in the front phoneme string and the number of matching characters in the rear phoneme string are large to some extent, and the number of matching characters in both is balanced to some extent. The configuration is such that optimal waveform data is selected from the columns. The selecting means 5 in the present embodiment has a fuzzy arithmetic circuit inside. An evaluation value indicating the state of balance between the number of matched characters in the preceding phoneme string and the number of matched characters in the succeeding phoneme string based on the following fuzzy rules set in advance (in this specification, this evaluation value is referred to as a “balance evaluation value ). Y is calculated, and the optimal waveform data is calculated from each candidate phoneme sequence using the balance evaluation value y. In the fuzzy rules described below, N1 is the number of matching characters in the preceding phoneme string, N2 is the number of characters in the succeeding phoneme string, and S, M, and L are language labels.
It means ddle, Large. Then, the selecting means 5 digitizes these language labels and selects the optimum waveform data from candidate phoneme strings having a certain threshold value or more. IF N1 is S AND N2 is S THEN y is S IF N1 is S AND N2 is L THEN y is M IF N1 is L AND N2 is S THEN y is M IF N1 is L AND N2 is L THEN y is L

【００８６】上記のファジイルールを用いて、図１８に
おける第１乃至第４の各候補音素列のバランス評価値ｙ
を求めると、それぞれL,S,M,Mとなる。したがって、こ
の第５の実施形態によれば、本来的に最も好ましい第１
の候補音素列「go」が最適の波形データとして選択手段
５により選択されることになる。Using the above fuzzy rule, the balance evaluation value y of each of the first to fourth candidate phoneme strings in FIG.
Are obtained as L, S, M, and M, respectively. Therefore, according to the fifth embodiment, the most originally preferred first
Is selected by the selecting means 5 as the optimal waveform data.

【００８７】次に、第５の実施形態の動作を、上記の前
音素列及び後音素列の比較に基づく選択動作の部分のみ
に限定し、図１９のフローチャートを参照しつつ説明す
る（図１９は、第４の実施形態の図１７に相当するもの
である。）。なお、他の部分の動作については第２の実
施形態又は第４の実施形態と同様であるため省略するこ
ととする。Next, the operation of the fifth embodiment will be described with reference to the flowchart of FIG. 19 by limiting only the selection operation based on the comparison between the preceding phoneme sequence and the rear phoneme sequence described above (FIG. 19). Corresponds to FIG. 17 of the fourth embodiment.) The operation of the other parts is the same as that of the second embodiment or the fourth embodiment, and will not be described.

【００８８】選択手段５は、注目音素列がテキスト中の
先頭のものであるか否かを判別し（ステップ５０１
ａ）、先頭である場合には前音素列は存在しないので、
注目音素列の後音素列と候補音素列の後音素列とを比較
し、一致する音素記号の数Ｎ2を自己のメモリに記憶す
る（ステップ５０２ａ）。また、注目音素列が先頭のも
のでない場合には最後のものであるか否かを判別し（ス
テップ５０３ａ）、最後のものでなければ前音素列及び
後音素列の双方が存在するはずなので、注目音素列の前
音素列及び後音素列と、候補音素列の前音素列及び後音
素列とを比較し、それぞれの一致する音素記号の数Ｎ
1，Ｎ2を自己のメモリに記憶する（ステップ５０４
ａ）。そして、注目音素列が最後のものである場合には
後音素列は存在しないので、注目音素列の前音素列と候
補音素列の前音素列とを比較し、一致する音素記号の数
Ｎ1を自己のメモリに記憶する（ステップ５０５ａ）。The selecting means 5 determines whether or not the phoneme sequence of interest is the first one in the text (step 501).
a) Since there is no previous phoneme sequence at the beginning,
The phoneme sequence after the target phoneme sequence and the phoneme sequence after the candidate phoneme sequence are compared, and the number N2 of matching phoneme symbols is stored in its own memory (step 502a). If the phoneme sequence of interest is not the first one, it is determined whether it is the last one (step 503a). If not, both the previous phoneme sequence and the rear phoneme sequence should exist. The pre-phoneme sequence and the post-phoneme sequence of the phoneme sequence of interest are compared with the pre-phoneme sequence and the post-phoneme sequence of the candidate phoneme sequence.
1, and N2 are stored in its own memory (step 504).
a). If the phoneme sequence of interest is the last one, there is no post-phoneme sequence, so the pre-phoneme sequence of the phoneme sequence of interest is compared with the pre-phoneme sequence of the candidate phoneme sequence, and the number N1 of matching phoneme symbols is determined. It is stored in its own memory (step 505a).

【００８９】上記のようにして、選択手段５は全ての候
補音素列について一致する音素記号の数Ｎ1，Ｎ2の記憶
を終了したか否かにつき判別し（ステップ５０６ａ）、
終了していなければステップ５０１ａに戻って次の候補
音素列について合計数Ｎを求めて記憶するようにする。
そして、選択手段５は、全ての候補音素列について一致
文字数Ｎ1，Ｎ2の記憶を終了すると、上記のファジイル
ールを用いてバランス評価値ｙを算出する（ステップ５
１０ａ）。なお、上記のファジイルールは、注目音素列
が先頭でもなく、且つ最後でもない場合についてのもの
である。注目音素列が先頭又は最後である場合は、条件
部の変数はＮ2のみ又はＮ1のみになるのでより簡単なフ
ァジイルールとなる。そして、その後、選択手段５は、
バランス評価値ｙが最大となる候補音素列以外の候補音
素列を除去する（ステップ５０７ａ）。As described above, the selection means 5 determines whether or not the storage of the numbers N1 and N2 of the phoneme symbols that match for all the candidate phoneme strings has been completed (step 506a).
If not completed, the process returns to step 501a to obtain and store the total number N for the next candidate phoneme string.
When the selection means 5 finishes storing the number of matching characters N1 and N2 for all the candidate phoneme strings, it calculates the balance evaluation value y using the above fuzzy rule (step 5).
10a). The above fuzzy rule is for the case where the phoneme sequence of interest is neither the first nor the last. If the target phoneme sequence is the first or last, the variable of the condition part is only N2 or only N1, so that a simpler fuzzy rule is obtained. And thereafter, the selection means 5
Candidate phoneme strings other than the candidate phoneme string having the largest balance evaluation value y are removed (step 507a).

【００９０】選択手段５は、この除去の結果、候補が１
つに絞られたか否かを判別し（ステップ５０８ａ）、１
つに絞られていればこれを最適の波形データとして決定
し（ステップ５０９ａ）、ステップ５６ａ（図１０参
照）の処理を行う。また、候補が１つに絞られていなけ
れば、ステップ６ａ（図９参照）以降の処理を行う。す
なわち、第２の実施形態又は第４の実施形態の場合と同
様に、選択手段５は、基本周波数についてパターンの値
と候補音素列の値とを比較することにより候補音素列の
数を絞り込む処理を行い（ステップ６a〜１３a）、それ
でも候補を一つに絞り込めなければ、前回の決定済み波
形データとの間のフォルマント平均自乗誤差を演算し、
このフォルマント平均自乗誤差の値が最小の候補音素列
を選択するようにする（ステップ５０a〜５６a）。As a result of this removal, the selection means 5 determines that the candidate is 1
It is determined whether or not the number has been reduced to one (step 508a).
If it has been narrowed down to one, this is determined as the optimal waveform data (step 509a), and the processing of step 56a (see FIG. 10) is performed. If the number of candidates has not been reduced to one, the processing after step 6a (see FIG. 9) is performed. That is, as in the case of the second embodiment or the fourth embodiment, the selection unit 5 compares the value of the pattern with the value of the candidate phoneme string for the fundamental frequency to narrow down the number of candidate phoneme strings. (Steps 6a to 13a), and if the number of candidates cannot be narrowed down to one, the formant mean square error with the previously determined waveform data is calculated.
The candidate phoneme string having the smallest formant mean square error is selected (steps 50a to 56a).

【００９１】上述したように、第５の実施形態では、選
択手段５が、前音素列の一致文字数及び後音素列の一致
文字数がそれぞれある程度大きく、且つ両者の一致文字
数がある程度バランスが取れている候補音素列の中から
最適の波形データを選択する構成としている第４の実施
形態によっては必ずしも最適化することができなかった
調音についても最適化できるようになる。As described above, in the fifth embodiment, the selecting means 5 determines that the number of matching characters in the front phoneme string and the number of matching characters in the rear phoneme string are somewhat large, and that the number of matched characters is well balanced to some extent. According to the fourth embodiment in which the optimum waveform data is selected from the candidate phoneme sequence, it is possible to optimize the articulation that could not always be optimized.

【００９２】なお、調音の最適化の観点からは、第５の
実施形態の方が第４の実施形態よりも優れていると言え
るが、第５の実施形態ではファジイ演算を行わなければ
ならないので、処理時間が長くなる。したがって、処理
時間を短縮したいような場合には、却って第４の実施形
態の構成を採用した方が好ましいことになる。また、上
述した第５の実施形態では、バランス評価値ｙを算出す
るのにファジイ演算を用いた手法を採用したが、ニュー
ラルネットワークなどの他の手法を採用する構成とする
ことも可能である。Although the fifth embodiment is superior to the fourth embodiment from the viewpoint of articulation optimization, the fifth embodiment requires a fuzzy operation to be performed. , The processing time becomes longer. Therefore, when it is desired to reduce the processing time, it is more preferable to adopt the configuration of the fourth embodiment. Further, in the above-described fifth embodiment, a method using fuzzy arithmetic is used to calculate the balance evaluation value y. However, a configuration using another method such as a neural network may be used.

【００９３】[0093]

【発明の効果】以上のように、本発明によれば、音素列
波形選択部は、テキスト情報の形態素解析に基づきアク
セントを１つだけ持つ各音素列についての基本周波数パ
ターンを作成する基本周波数パターン作成手段と、候補
波形データの各基本周波数をこの基本周波数パターンの
値と比較し、両者の差分が所定範囲内にあり且つ所定の
基準を満足する候補波形データを最適の波形データとし
て選択する選択手段と、を有する構成となっているの
で、複雑な韻律処理を行わなくても自然で滑らかな音声
出力が得られる。その結果、本発明は、自然で滑らかな
音声を合成するための処理時間を短縮できると共に、メ
モリ容量の増加を抑制して充分なコストダウンを図るこ
とができる。As described above, according to the present invention, the phoneme string waveform selecting unit creates a fundamental frequency pattern for each phoneme string having only one accent based on morphological analysis of text information. Selecting means for comparing each basic frequency of the candidate waveform data with the value of the basic frequency pattern and selecting candidate waveform data having a difference between the two within a predetermined range and satisfying a predetermined criterion as optimal waveform data; Therefore, natural and smooth voice output can be obtained without performing complicated prosody processing. As a result, the present invention can reduce the processing time for synthesizing a natural and smooth voice, and can suppress an increase in the memory capacity to achieve a sufficient cost reduction.

【００９４】また、一番最初に、音素記号についてテキ
スト情報における注目音素列の前後の音素列と音声デー
タベースに格納されている各候補音素列の前後の音素列
とが一致するか否かの判別を行う構成とすることによ
り、テキストを音声合成するときの調音を最適化するこ
とができ、更に、音声データベース１に録音されている
話者のイントネーションを引き出すことができる場合が
多くなるという効果を得ることができる。First, for phoneme symbols, it is determined whether or not the phoneme strings before and after the target phoneme string in the text information match the phoneme strings before and after each candidate phoneme string stored in the speech database. Is performed, the articulation at the time of text-to-speech synthesis can be optimized, and moreover, the intonation of the speaker recorded in the voice database 1 can be more often extracted. Obtainable.

【００９５】更に、前回の選択動作により最適の波形デ
ータとして決定済みの波形データにおける終端部付近の
所定区間にわたる振幅平均値と、各候補音素列の波形デ
ータにおける先端部付近の所定区間にわたる振幅平均値
との差が所定範囲内にあるか否かについて判別する処理
を追加する構成とすることにより、より一層自然で滑ら
かな音声合成を行うことが可能になる。Further, the amplitude average value over a predetermined section near the end of the waveform data determined as the optimum waveform data by the previous selection operation, and the amplitude average over the predetermined section near the front end of the waveform data of each candidate phoneme string. By adding a process for determining whether or not the difference from the value is within a predetermined range, it is possible to perform a more natural and smooth speech synthesis.

【００９６】そして、注目音素列の前後の音素列と各候
補音素列の前後の音素列とを比較し、一致する音素記号
の合計数が最大となる候補音素列を最適波形データとし
て選択する構成とすることにより、テキストを音声合成
するときの調音の最適化を一層的確に行うことができ
る。Then, the phoneme strings before and after the phoneme string of interest are compared with the phoneme strings before and after each candidate phoneme string, and the candidate phoneme string with the maximum total number of matching phoneme symbols is selected as optimal waveform data. By doing so, it is possible to more accurately optimize the articulation when synthesizing the text.

【００９７】また、前音素列の一致文字数及び後音素列
の一致文字数がそれぞれある程度大きく、且つ両者の一
致文字数がある程度バランスが取れている候補音素列の
中から最適の波形データを選択する構成とすることによ
り、調音の最適化を行う際の的確性をより向上させるこ
とができる。Further, a configuration is adopted in which the optimum waveform data is selected from candidate phoneme strings in which the number of matching characters in the front phoneme string and the number of matching characters in the rear phoneme string are somewhat large, and the number of matching characters is well balanced to some extent. By doing so, the accuracy at the time of optimizing the articulation can be further improved.

[Brief description of the drawings]

【図１】本発明の第１の実施形態に係る音声合成装置の
構成を示すブロック図。FIG. 1 is a block diagram showing a configuration of a speech synthesizer according to a first embodiment of the present invention.

【図２】第１の実施形態における音素列波形選択部３の
動作について説明するためのフローチャート。FIG. 2 is a flowchart for explaining the operation of a phoneme string waveform selection unit 3 according to the first embodiment.

【図３】第１の実施形態における音素列波形選択部３の
動作について説明するためのフローチャート。FIG. 3 is a flowchart for explaining the operation of a phoneme string waveform selection unit 3 according to the first embodiment.

【図４】第１の実施形態における音素列波形選択部３の
動作について説明するためのフローチャート。FIG. 4 is a flowchart for explaining the operation of the phoneme string waveform selection unit 3 in the first embodiment.

【図５】第１の実施形態における音素列波形選択部３の
動作について説明するためのフローチャート。FIG. 5 is a flowchart for explaining the operation of the phoneme string waveform selection unit 3 in the first embodiment.

【図６】第２の実施形態における音素列波形選択部３の
動作について説明するためのフローチャート。FIG. 6 is a flowchart for explaining the operation of a phoneme string waveform selection unit 3 in the second embodiment.

【図７】第２の実施形態における音素列波形選択部３の
動作について説明するためのフローチャート。FIG. 7 is a flowchart for explaining the operation of a phoneme string waveform selection unit 3 according to the second embodiment.

【図８】第２の実施形態における音素列波形選択部３の
動作について説明するためのフローチャート。FIG. 8 is a flowchart for explaining the operation of a phoneme string waveform selection unit 3 according to the second embodiment.

【図９】第２の実施形態における音素列波形選択部３の
動作について説明するためのフローチャート。FIG. 9 is a flowchart for explaining the operation of the phoneme string waveform selection unit 3 according to the second embodiment.

【図１０】第２の実施形態における音素列波形選択部３
の動作について説明するためのフローチャート。FIG. 10 shows a phoneme string waveform selection unit 3 according to the second embodiment.
4 is a flowchart for explaining the operation of FIG.

【図１１】第３の実施形態において波形データの振幅を
求める際に行う処理についての説明図。FIG. 11 is a diagram illustrating a process performed when obtaining the amplitude of waveform data in the third embodiment.

【図１２】第３の実施形態における波形データの振幅平
均値の求め方についての説明図。FIG. 12 is a diagram illustrating a method for obtaining an average amplitude value of waveform data according to the third embodiment.

【図１３】第３の実施形態における音素列波形選択部３
の動作について説明するためのフローチャート。FIG. 13 shows a phoneme string waveform selection unit 3 according to the third embodiment.
4 is a flowchart for explaining the operation of FIG.

【図１４】第３の実施形態における音素列波形選択部３
の動作について説明するためのフローチャート。FIG. 14 shows a phoneme string waveform selection unit 3 according to the third embodiment.
4 is a flowchart for explaining the operation of FIG.

【図１５】第４の実施形態が解決しようとする課題につ
いて説明するための図表。FIG. 15 is a chart for explaining a problem to be solved by a fourth embodiment.

【図１６】第４の実施形態における音素列波形選択部３
の動作について説明するためのフローチャート。FIG. 16 shows a phoneme string waveform selection unit 3 according to the fourth embodiment.
4 is a flowchart for explaining the operation of FIG.

【図１７】第４の実施形態における音素列波形選択部３
の動作について説明するためのフローチャート。FIG. 17 shows a phoneme string waveform selection unit 3 according to the fourth embodiment.
3 is a flowchart for explaining the operation of FIG.

【図１８】第５の実施形態が解決しようとする課題につ
いて説明するための図表。FIG. 18 is a table for explaining a problem to be solved by a fifth embodiment.

【図１９】第５の実施形態における音素列波形選択部３
の動作について説明するためのフローチャート。FIG. 19 shows a phoneme string waveform selection unit 3 according to the fifth embodiment.
4 is a flowchart for explaining the operation of FIG.

[Explanation of symbols]

１音声データベース２音素列出力部３音素列波形選択部４基本周波数パターン作成手段５選択手段６波形データ接続部７音声出力部 DESCRIPTION OF SYMBOLS 1 Speech database 2 Phoneme string output part 3 Phoneme string waveform selection part 4 Fundamental frequency pattern creation means 5 Selection means 6 Waveform data connection part 7 Voice output part

Claims

[Claims]

1. A speech database in which waveform data and characteristic data of a phoneme string used for speech synthesis are stored, and text information for speech synthesis are input, and the text information is decomposed and stored in the speech database. A phoneme string output unit for sequentially outputting a phoneme string of a phoneme symbol equal to the phoneme string being input, and inputting each phoneme string from the phoneme string output unit, and then inputting a phoneme string of interest to be generated as waveform data. The phoneme sequence having the same phoneme symbol as this phoneme sequence of interest is determined as a candidate phoneme sequence, and the waveform data and feature data of the candidate phoneme sequence are determined. From the audio database, and selecting any one of the extracted waveform data as the waveform data of the optimal phoneme sequence for the phoneme sequence of interest. A phoneme string waveform selection unit that sequentially outputs optimal waveform data for each phoneme string of interest; a waveform data connection unit that connects waveform data of each phoneme string from the phoneme string waveform selection unit; and the waveform data connection unit. And a voice output unit that performs voice output of the text information based on the input of the connection waveform data from the speech information processing device.The phoneme sequence waveform selection unit, based on a morphological analysis of the text information, A basic frequency pattern creating unit that divides the text information in units of a phoneme string having only one, and creates a fundamental frequency pattern for each of the divided phoneme strings; The fundamental frequency of the waveform data included in the feature data of the candidate phoneme sequence is changed to the one accent corresponding to the one target phoneme sequence. Compared with the value of the fundamental frequency at a predetermined position of the fundamental frequency pattern of the phoneme string having, the waveform data of the candidate phoneme string having a difference between the two within a predetermined range and satisfying a predetermined criterion, Selecting means for selecting the optimal waveform data of the text information and sequentially outputting the optimal waveform data for each phoneme sequence of interest in correspondence with the input order of the text information.

2. The method according to claim 1, wherein the selecting means includes: when a difference between the two is within a predetermined range and a plurality of candidate phoneme string waveform data satisfying a predetermined criterion is present,
The formant mean square error between the waveform data already selected as the optimal waveform data by the previous selection operation and the waveform data of each candidate phoneme string is obtained, and the one with the smallest value is selected as the optimal waveform data. The speech synthesizer according to claim 1, wherein:

3. The method according to claim 2, wherein the selecting means determines whether or not the predetermined criterion is satisfied. The phoneme symbol includes a phoneme string before and after the target phoneme string in the text information and the phoneme string stored in the speech database. A determination is made as to whether or not the preceding and following phoneme strings of each candidate phoneme string match, and after the determination of the matching of the preceding and succeeding phoneme strings, the difference between the two values of the fundamental frequency is determined by a predetermined value. 3. The speech synthesizer according to claim 1, wherein a determination is made as to whether or not the speech is within a range.

4. The method according to claim 1, wherein the selecting means determines whether the predetermined criterion is satisfied or not, and determines the phoneme strings before and after the phoneme string of interest in the text information and the candidate phonemes stored in the speech database. By comparing the phoneme sequence before and after the sequence, the number of phoneme symbols that match in the pre-phoneme sequence of the target phoneme sequence and the pre-phoneme sequence of each of the candidate phoneme sequences is determined for each pre-phoneme sequence in each of the candidate phoneme sequences. For each of the candidate phoneme sequences, the number of phoneme symbols that match in the post-phoneme sequence of the target phoneme sequence and the post-phoneme sequence of each of the candidate phoneme sequences is determined. The sum of the number of matching phoneme symbols in the previous phoneme sequence and the number of matching phoneme symbols in the subsequent phoneme sequence is determined, and one candidate phoneme sequence is a candidate phoneme sequence having the maximum total value. It is to determine whether or not there is, After determining whether or not the sum is a candidate phoneme string having the maximum value, whether or not the difference between the two values of the fundamental frequency is within a predetermined range is determined. The speech synthesizer according to claim 1, wherein:

5. The method according to claim 1, wherein the selecting means determines whether or not the predetermined criterion is satisfied. The phoneme strings before and after the phoneme string of interest in the text information and the candidate phonemes stored in the speech database. By comparing the phoneme sequence before and after the sequence, the number of phoneme symbols that match in the pre-phoneme sequence of the target phoneme sequence and the pre-phoneme sequence of each of the candidate phoneme sequences is determined for each pre-phoneme sequence in each of the candidate phoneme sequences. For each of the candidate phoneme sequences, the number of phoneme symbols that match in the post-phoneme sequence of the target phoneme sequence and the post-phoneme sequence of each of the candidate phoneme sequences is determined. An evaluation value for the balance between the number of matching phoneme symbols in the previous phoneme sequence and the number of matching phoneme symbols in the subsequent phoneme sequence is calculated based on a preset rule, and one candidate phoneme sequence is calculated. But the best balance After determining whether or not the candidate phoneme string has the best balance, the above-described basic method is performed after determining whether or not the candidate phoneme string has the best balance. 3. The speech synthesizer according to claim 1, wherein a determination is made as to whether a difference between the two frequency values is within a predetermined range.

6. The voice synthesizing apparatus according to claim 3, wherein said selecting means determines whether or not said predetermined criterion is satisfied, and determines whether said waveform data is optimal waveform data by a previous selecting operation. A determination is made as to whether or not the difference between the average amplitude value over a predetermined section near the end of the determined waveform data and the average amplitude over a predetermined section near the front end of the waveform data of each candidate phoneme string is within a predetermined range. The discrimination is also performed, and the discrimination regarding the difference between the amplitude average values is performed after the discrimination regarding the preceding and following phoneme strings, and the difference between the two values of the fundamental frequency is within a predetermined range. Or after performing the determination as to whether or not the voice synthesis is performed.

7. The speech synthesizer according to claim 6, wherein the amplitude average value of the determined waveform data over a predetermined section near the end portion is weighted such that the amplitude value increases as approaching the end portion. It is calculated based on the waveform data, and the average amplitude value over a predetermined section near the tip in the waveform data of each candidate phoneme string is weighted such that the amplitude value increases as approaching the tip. A speech synthesizer calculated based on waveform data.