JPH07181995A

JPH07181995A - Device and method for voice synthesis

Info

Publication number: JPH07181995A
Application number: JP5323648A
Authority: JP
Inventors: Kaoru Tsukamoto; 薫塚本
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1993-12-22
Filing date: 1993-12-22
Publication date: 1995-07-21

Abstract

PURPOSE:To generate a synthesized voice signal having a more human voice feeling and close to a natural voice. CONSTITUTION:A voice element piece data storage part 14B storing environment-unrelated voice element piece data generated by analyzing a voice signal generated by clear sound-by-sound vocalization without phoneme environment generation and a voice element piece data storage part 14A storing voice element piece data with extraction environment generated by analyzing a voice signal generated by vocalization with phoneme environment are prepared. Selection information storage means 16 and 18 storing selection information on the voice element piece data by the kinds of voice units are prepared. The selection information storage mean are fererred to in voice units of a phoneme sequence generated by converting inputted character information to select the voice element piece data with extraction environment which is high in similarity in phoneme environment to the voice unit of the phoneme sequence and provides an excellent connection with last voice element piece data when the data are found or the environment-unrelated voice element piece data when not (13, 17, and 19).

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、入力された文字列情報
を音声に変換して出力する音声合成装置及び音声合成方
法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice synthesizing apparatus and a voice synthesizing method for converting input character string information into voice and outputting the voice.

【０００２】[0002]

【従来の技術】文字情報（例えばテキストデータ）を入
力として、それを音声に変換して出力する音声合成装置
は、出力語彙の制限がないことから、録音・再生型の音
声合成技術にとって代わる音声合成技術として種々の利
用分野での応用が期待できる。例えば、ワードプロセッ
サ等で作成されたテキストデータを発音出力させたり、
翻訳処理で得られた目的言語のテキストデータを発音出
力させたりする際などに利用できる。2. Description of the Related Art A voice synthesizer which receives character information (for example, text data) and converts it into voice and outputs the voice has no limitation on the output vocabulary. As a synthetic technology, application in various fields of application can be expected. For example, output text data created by a word processor, etc.,
It can be used, for example, when outputting text data in the target language obtained by the translation process in pronunciation.

【０００３】図２は、日本語（漢字かな混じり文）を入
力とした従来の音声合成装置（日本語テキスト音声変換
装置）の構成を示しており、以下、この図２を参照しな
がら従来装置の概要を説明する。FIG. 2 shows the configuration of a conventional speech synthesizer (Japanese text-to-speech converter) that inputs Japanese (Kanji / Kana mixed sentence) as an input. Hereinafter, referring to FIG. 2, the conventional apparatus will be described. The outline of is explained.

【０００４】図２において、テキスト解析部１０１で
は、発音辞書１０２を利用して、文字情報入力部１００
より入力された漢字かな混じり文から、音韻・韻律記号
列を生成する。ここで、音韻・韻律記号列とは、入力文
の読み、アクセント、イントネーション等を文字列とし
て記述したもの（中間言語）である。各単語の読みとア
クセントは、発音辞書１０２に登録されており、テキス
ト解析部１０１は、この辞書１０２を参照しながら音韻
・韻律記号列を生成する。In FIG. 2, the text analysis unit 101 uses the pronunciation dictionary 102 to make use of the character information input unit 100.
A phonological / prosodic symbol string is generated from a kanji-kana mixed sentence input by the user. Here, the phoneme / prosodic symbol string is a string (intermediate language) in which the reading, accent, intonation, etc. of the input sentence are described as a character string. The reading and accent of each word are registered in the pronunciation dictionary 102, and the text analysis unit 101 refers to the dictionary 102 to generate a phoneme / prosodic symbol string.

【０００５】合成パラメータ生成部１０３では、音韻・
韻律記号列に基づき、音声素片（音の種類）、音韻継続
時間（音の長さ）、基本周波数（声の高さ）パターンと
いった音声合成用のパラメータ（合成パラメータと呼
ぶ）を生成する。このうち、音声素片は、接続して合成
波形をつくるための音声の基本単位であり、単語等を発
音したときの発声データから生成されるものである。な
お、以下では、ＣＶ（子音−母音）、ＶＣＶ（母音−子
音−母音）等の音声の基本要素の組合わせ自体を音声単
位と呼び、その音声単位の波形を実現する要素を音声素
片と呼ぶ。１個の音声単位は、例えば複数の音声素片で
なる組に対応する。音声素片データは、ＲＯＭ等でなる
音声素片データ記憶部１０４に格納されており、合成パ
ラメータ生成部１０３は、音韻・韻律記号列から音声単
位を認識して対応する音声素片データを取出す。In the synthesis parameter generation unit 103, the phoneme /
Based on the prosodic symbol string, parameters for voice synthesis (referred to as synthesis parameters) such as a voice unit (sound type), phoneme duration (sound length), and fundamental frequency (voice pitch) pattern are generated. Of these, the voice unit is a basic unit of voice for connecting and creating a synthetic waveform, and is generated from vocal data when a word or the like is pronounced. In the following, a combination itself of basic elements of voice such as CV (consonant-vowel) and VCV (vowel-consonant-vowel) is called a voice unit, and an element that realizes a waveform of the voice unit is a voice unit. Call. One voice unit corresponds to, for example, a set of a plurality of voice units. The speech unit data is stored in the speech unit data storage unit 104 such as a ROM, and the synthesis parameter generation unit 103 recognizes the speech unit from the phoneme / prosodic symbol string and extracts the corresponding speech unit data. .

【０００６】音声合成部１０５は、合成パラメータ生成
部１０３が生成した合成パラメータに基づいて、合成波
形を生成する。このような合成音声信号が、スピーカを
通して発音出力されたり、回線を介して他の装置に伝送
されたりする。The voice synthesizing section 105 produces a synthetic waveform based on the synthesis parameters produced by the synthesis parameter producing section 103. Such a synthesized voice signal is output as a sound through a speaker or is transmitted to another device through a line.

【０００７】ところで、人間は様々な音韻を発声するた
め音韻に合わせて声道の形を調整しているが、会話音声
のように連続して発声された一般の音声では、声道の形
は急には変化できないために、前後の音韻の影響を受け
て、音韻と音韻との中間部においてその本来の周波数か
らずれるという性質がある。この音韻と音韻の中間部に
おいて音響的性質が連続的に変化することを調音結合と
言うが、近年、合成音の品質の向上を目指し、音声合成
装置においても、この調音結合を考慮した合成パラメー
タの生成方法が考えられている。By the way, since humans utter various phonemes, the shape of the vocal tract is adjusted according to the phoneme. However, in the case of general speech that is continuously uttered like conversational speech, the shape of the vocal tract is Since it cannot change suddenly, it has the property that it is affected by the preceding and following phonemes and deviates from its original frequency in the middle part between the phonemes. It is called articulatory coupling that the acoustic characteristics change continuously in the middle part of the phoneme and the phoneme. In recent years, in order to improve the quality of synthesized speech, a speech synthesizer also has a synthesis parameter considering this articulatory coupling. Is being considered.

【０００８】考えられる第１の合成パラメータの生成方
法は、同一のＣＶ（子音−母音）、ＶＣＶ（母音−子音
−母音）等でなる音声単位として、その音声単位に対応
する音声素片データの組合わせが異なるもの、すなわ
ち、異なった音韻環境（前後の音韻が異なっているよう
な環境）を持つ複数の音声単位を用意し、入力文中（従
って、音韻・韻律記号列）の音韻環境に合った音声単位
を選択して使用するものである。A conceivable first synthesis parameter generation method is as a voice unit composed of the same CV (consonant-vowel), VCV (vowel-consonant-vowel), etc., of voice segment data corresponding to the voice unit. Prepare multiple phonetic units with different combinations, that is, different phoneme environments (environments where the preceding and following phonemes are different), and match the phoneme environment of the input sentence (and thus the phoneme / prosodic symbol string). The voice unit is selected and used.

【０００９】また、考えられる第２の合成パラメータの
生成方法は、音声単位（従って音声素片）を接続して音
声を合成する場合において、音声単位間の接続点での歪
みは避けられないので、接続点そのものを減らすため
に、入力音韻列を接続歪みが大きくなるような場所での
接続を避けるように区切り、任意の長さの音声単位を選
択するものである（例えば、下記文献参照）。Further, the second conceivable method of generating the synthesis parameter is that, when speech units (henceforth speech units) are connected to synthesize speech, distortion at the connection point between speech units cannot be avoided. , In order to reduce the number of connection points, the input phoneme sequence is divided so as to avoid connection at locations where connection distortion is large, and a voice unit of arbitrary length is selected (for example, see the following references). .

【００１０】文献『岩橋直人、匂坂芳典共著、「歪み最
小化音声合成方法の主観・客観評価」日本音響学会講演
論文集２−２−１５、１９９２年３月』Reference “Naoto Iwahashi and Yoshinori Sakasaka,“ Subjective and Objective Evaluation of Distortion-Minimized Speech Synthesis Method ”Proceedings of Acoustical Society of Japan 2-2-15, March 1992.

【００１１】[0011]

【発明が解決しようとする課題】音声の調音結合という
性質を考慮した上述した第１及び第２の合成パラメータ
の生成方法によれば、音声の自然性や肉声感といった点
から合成音声の品質を向上させることが期待できる。According to the above-mentioned first and second synthesis parameter generation methods in consideration of the property of articulatory combination of voices, the quality of synthesized voices is improved from the viewpoint of naturalness of voice and feeling of real voice. It can be expected to improve.

【００１２】しかしながら、音韻環境を考慮して第１の
合成パラメータの生成方法を適用し、かつ、接続による
歪み（接続箇所）を減らそうとして第２の合成パラメー
タの生成方法を適用した場合、１個の入力文に対して、
音声単位の何通りもの組合せの中から最適なものを求め
るという問題になり、多くの計算を要してしまうという
問題があった。However, if the first synthesis parameter generation method is applied in consideration of the phonological environment and the second synthesis parameter generation method is applied in an attempt to reduce distortion (connection point) due to connection, 1 For each input sentence,
There has been a problem that an optimal one is obtained from various combinations of voice units, and many calculations are required.

【００１３】また、計算量の増大やメモリの制限から考
えて、それぞれの音声単位についてあらゆる音韻環境を
揃えることは不可能である。従って、音韻環境の合った
音声単位がない場合は、他の環境を持つもので代用する
ことになるが、異なった調音結合を起こしている音声単
位は音響的に異なったものであるので、これらが接続さ
れると接続歪みは大きくなり、合成音の音質を損なって
いた。Further, considering the increase of the calculation amount and the limitation of the memory, it is impossible to arrange all the phoneme environments for each voice unit. Therefore, if there is no voice unit with a suitable phonological environment, one with another environment will be substituted, but since the voice units that cause different articulation coupling are acoustically different, these When was connected, the connection distortion increased and the sound quality of the synthesized sound was impaired.

【００１４】つまり、音質の向上には、音声単位の音韻
環境まで考慮することが必要であるが、記憶部に揃えら
れなかった音韻環境を持つ音声単位を入力文から要求さ
れた場合には、適切な対処ができない。That is, in order to improve the sound quality, it is necessary to consider the phonological environment of each voice unit. However, when a voice unit having a phonological environment that is not stored in the storage unit is requested from the input sentence, I can't handle it properly.

【００１５】本発明は、以上の点を考慮してなされたも
のであり、少ないメモリ容量及び処理量でより肉声感の
ある自然音声に近い合成音声信号を生成することが可能
な音声合成装置及び音声合成方法を提供しようとするも
のである。The present invention has been made in consideration of the above points, and a speech synthesizer and a speech synthesizer capable of generating a synthesized speech signal having a real feeling of a natural voice with a small memory capacity and processing amount. It is intended to provide a speech synthesis method.

【００１６】[0016]

【課題を解決するための手段】かかる課題を解決するた
め、第１の本発明においては、入力された文字情報を音
声信号に変換する音声合成装置において、以下の各手段
を設けた。In order to solve such a problem, in the first aspect of the present invention, the following means are provided in the voice synthesizing device for converting the input character information into a voice signal.

【００１７】(1) 音韻環境を持たぬように１音１音はっ
きりと発声された音声信号から分析生成された環境無関
係の音声素片データを格納している環境無関係音声素片
データ記憶部と、(2) 音韻環境を持つように発声された
音声信号から分析生成された抽出環境付の音声素片デー
タを格納している抽出環境付音声素片データ記憶部と、
(3) これら音声素片データの選択情報を音声単位の種類
毎に格納している選択情報格納手段と、(4) 入力された
文字情報が変換された音韻列における音声単位毎に選択
情報格納手段を参照し、この音韻列における音声単位の
音韻環境に対して近似しており、直前の音声素片データ
との接続が良好な抽出環境付の音声素片データがあれば
それを選択し、なければ環境無関係な音声素片データを
選択する合成パラメータ生成手段とを設けた。(1) An environment-irrelevant speech unit data storage unit that stores environment-unrelated speech unit data that is analyzed and generated from a speech signal in which one sound is clearly pronounced so as not to have a phonological environment , (2) a speech element data storage unit with an extraction environment, which stores speech element data with an extraction environment, which is analyzed and generated from a speech signal uttered so as to have a phonological environment,
(3) Selection information storage means that stores selection information of these speech unit data for each type of speech unit, and (4) Selection information storage for each speech unit in a phoneme string into which input character information is converted. Refer to the means, it is approximated to the phoneme environment of the voice unit in this phoneme sequence, if there is a speech unit data with an extraction environment that has a good connection with the immediately preceding speech unit data, select it, If not, a synthesis parameter generating means for selecting speech unit data irrelevant to the environment is provided.

【００１８】ここで、選択情報格納手段が、(3-1) 抽出
環境付の音声素片データの音韻環境を音声単位毎に格納
している音声単位辞書と、(3-2) 抽出環境付の音声素片
データの接続部情報を音声単位毎に格納している接続部
情報記憶部とからなり、合成パラメータ生成手段が、(4
-1) 入力された文字情報が変換された音韻列における音
声単位毎に音声単位辞書を参照し、この音韻列における
音声単位の音韻環境に対する近似度合から候補を絞り込
む音声単位選択チェック部と、(4-2) 絞り込まれた抽出
環境付の音声素片データと直前の音声素片データとの類
似度を接続部情報記憶部の格納内容から求める類似度計
算部と、(4-3) 音韻環境及び接続部の類似度に基づいて
抽出環境付の音声素片データの候補を１個に絞り込むと
共に抽出環境付の音声素片データを選択するか環境無関
係な音声素片データを選択するかを決定する合成パラメ
ータ生成部とからなることは好ましい。Here, the selection information storage means includes (3-1) a voice unit dictionary in which the phoneme environment of the voice unit data with an extraction environment is stored for each voice unit, and (3-2) a voice unit dictionary with an extraction environment. And a connection part information storage part that stores connection part information of the voice unit data for each voice unit.
-1) A voice unit selection check unit that refers to the voice unit dictionary for each voice unit in the phoneme sequence in which the input character information is converted and narrows down candidates based on the degree of approximation of the voice unit in this phoneme sequence to the phoneme environment, 4-2) A similarity calculation unit that obtains the similarity between the narrowed down speech unit data with extraction environment and the immediately preceding speech unit data from the stored contents of the connection information storage unit, and (4-3) Phonological environment Based on the similarity of the connection part, the number of candidates for the speech unit data with the extraction environment is narrowed down to one, and it is determined whether to select the speech unit data with the extraction environment or the speech unit data unrelated to the environment. It is preferable that the image forming apparatus further comprises a synthesis parameter generating unit.

【００１９】また、第２の本発明においては、入力され
た文字情報を音声信号に変換する音声合成方法を、以下
のようにした。Further, in the second aspect of the present invention, the voice synthesizing method for converting the input character information into a voice signal is as follows.

【００２０】すなわち、音韻環境を持たぬように１音１
音はっきりと発声された音声信号から分析生成された環
境無関係の音声素片データを格納している環境無関係音
声素片データ記憶部と、音韻環境を持つように発声され
た音声信号から分析生成された抽出環境付の音声素片デ
ータを格納している抽出環境付音声素片データ記憶部
と、これら音声素片データの選択情報を音声単位の種類
毎に格納している選択情報格納手段とを備えている。そ
して、入力された文字情報が変換された音韻列における
音声単位毎に選択情報格納手段を参照し、この音韻列に
おける音声単位の音韻環境に対して近似しており、直前
の音声素片データとの接続が良好な抽出環境付の音声素
片データがあればそれを選択し、なければ環境無関係な
音声素片データを選択する。In other words, one note 1 does not have a phonological environment.
Sounds generated from analysis of speech signals uttered clearly and generated from speech signals uttered to have an environment-independent speech unit data storage unit that stores environment-unrelated speech unit data. A speech element data storage unit with an extraction environment, which stores speech element data with an extraction environment, and a selection information storage unit that stores selection information of these speech element data for each type of speech unit. I have it. Then, the selection information storage means is referred to for each voice unit in the phoneme string in which the input character information is converted, and the phoneme environment of the phoneme unit in this phoneme string is approximated to the immediately preceding phoneme data. If there is speech element data with a good extraction environment that has a good connection, the speech element data that does not relate to the environment is selected.

【００２１】ここで、選択情報格納手段を、抽出環境付
の音声素片データの音韻環境を音声単位毎に格納してい
る音声単位辞書と、抽出環境付の音声素片データの接続
部情報を音声単位毎に格納している接続部情報記憶部と
で構成し、まず、入力された文字情報が変換された音韻
列における音声単位毎に音声単位辞書を参照し、この音
韻列における音声単位の音韻環境に対する近似度合から
候補を絞り込み、さらに、絞り込まれた抽出環境付の音
声素片データと直前の音声素片データとの類似度を接続
部情報記憶部の格納内容から求め、そして、音韻環境及
び接続部の類似度に基づいて抽出環境付の音声素片デー
タの候補を１個に絞り込んだ後、抽出環境付の音声素片
データを選択するか環境無関係な音声素片データを選択
するかを決定することは好ましい。Here, the selection information storage means stores a voice unit dictionary in which the phoneme environment of the voice unit data with the extraction environment is stored for each voice unit, and the connection part information of the voice unit data with the extraction environment. The connection unit information storage unit stores each voice unit, and first, the voice unit dictionary is referred to for each voice unit in the phoneme sequence in which the input character information is converted, and the voice unit dictionary in this phoneme sequence is stored. The candidates are narrowed down from the degree of approximation to the phonological environment, and the similarity between the narrowed down speech unit data with the extraction environment and the immediately preceding speech unit data is obtained from the stored contents of the connection information storage unit, and the phonological environment And whether to select the speech unit data with the extraction environment or the speech unit data with no environment, after narrowing down the candidates of the speech unit data with the extraction environment to one based on the similarity of the connection part. To decide The preferable.

【００２２】[0022]

【作用】本発明による音声合成装置及び音声合成方法に
おいては、音韻環境を持たぬように１音１音はっきりと
発声された音声信号から分析生成された環境無関係の音
声素片データを格納している環境無関係音声素片データ
記憶部と、音韻環境を持つように発声された音声信号か
ら分析生成された抽出環境付の音声素片データを格納し
ている抽出環境付音声素片データ記憶部とを用意してい
る。さらに、これら音声素片データの選択情報を音声単
位の種類毎に格納している選択情報格納手段を用意して
いる。In the voice synthesizing apparatus and the voice synthesizing method according to the present invention, environment-independent voice segment data generated by analysis and generation from voice signals uttered clearly one by one without storing a phonological environment are stored. An environment-independent speech unit data storage unit, and an extraction-environment-based speech unit data storage unit that stores speech unit data with an extraction environment that is analyzed and generated from a speech signal uttered to have a phonological environment; Is prepared. Furthermore, a selection information storage means for storing the selection information of these voice segment data for each type of voice unit is prepared.

【００２３】そして、入力された文字情報が変換された
音韻列における音声単位毎に選択情報格納手段を参照
し、この音韻列における音声単位の音韻環境に対して近
似しており、直前の音声素片データとの接続が良好な抽
出環境付の音声素片データがあればそれを選択し、なけ
れば環境無関係な音声素片データを選択する。Then, the selection information storage means is referred to for each voice unit in the phoneme sequence into which the input character information is converted, and the phoneme environment of the voice unit in this phoneme sequence is approximated to the immediately preceding phoneme. If there is a voice segment data with an extraction environment that has a good connection with the voice data, it is selected, and if not, environment-independent voice segment data is selected.

【００２４】これにより、音声の調音結合という性質を
重視し、入力文（音韻列）と音声単位（従って音声素片
データ）の音韻環境を考慮して合成しているので、肉声
感のあるより自然音声に近い合成音声を得ることがで
き、適切な音韻環境にある音声単位がないときでも、標
準的な音声単位を用意しているので、接続による歪みが
大きくなるような不適切な音声単位が合成に用いられる
ことはない。また、高頻度の抽出環境付の音声素片デー
タを中心に用意すればよいので、メモリ容量を軽減でき
ると共に、処理量も軽減できるようになる。As a result, since the property of articulatory coupling of voices is emphasized and synthesis is performed in consideration of the phonological environment of the input sentence (phoneme sequence) and the voice unit (henceforth voice segment data), it is possible to obtain a real voice. Inappropriate voice unit that causes distortion due to connection because standard voice units are available even when there is no voice unit in an appropriate phonological environment that can obtain synthetic voice close to natural voice. Is never used for synthesis. Further, since it is sufficient to prepare the voice segment data with the high-frequency extraction environment as the center, it is possible to reduce the memory capacity and the processing amount.

【００２５】[0025]

【実施例】以下、本発明の一実施例を図面を参照しなが
ら詳述する。なお、この実施例も、日本語文（漢字かな
混じり文）を対象としたものである。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described in detail below with reference to the drawings. In addition, this embodiment is also intended for a Japanese sentence (kanji / kana mixed sentence).

【００２６】図１は、この実施例の機能的構成を示すブ
ロック図である。図１において、この実施例は、文字情
報入力部１０、テキスト解析部１１、発音辞書１２、合
成パラメータ生成部１３、音声素片データ記憶部１４、
音声合成部１５、音声単位辞書１６、距離値計算部１
７、接続部音響パラメータデータ記憶部１８及び音声単
位選択チェック部１９からなる。FIG. 1 is a block diagram showing the functional configuration of this embodiment. In FIG. 1, in this embodiment, a character information input unit 10, a text analysis unit 11, a pronunciation dictionary 12, a synthesis parameter generation unit 13, a speech unit data storage unit 14,
Speech synthesizer 15, speech unit dictionary 16, distance value calculator 1
7, a connection unit acoustic parameter data storage unit 18 and a voice unit selection check unit 19.

【００２７】文字情報入力部１０、テキスト解析部１
１、発音辞書１２及び音声合成部１５は、従来の対応構
成と同一の動作を行なうものである。Character information input unit 10 and text analysis unit 1
1, the pronunciation dictionary 12 and the voice synthesizing unit 15 perform the same operations as the conventional corresponding configuration.

【００２８】合成パラメータ生成部１３も、基本的な機
能は、従来の合成パラメータ生成部１０３と同様であ
り、音韻・韻律記号列に基づき、音声素片、音韻継続時
間、基本周波数パターンといった音声合成用パラメータ
を生成するものである。The basic function of the synthesis parameter generation unit 13 is the same as that of the conventional synthesis parameter generation unit 103. Based on the phoneme / prosodic symbol string, the voice synthesis of a voice unit, a phoneme duration, and a basic frequency pattern is performed. It is for generating a parameter for use.

【００２９】この実施例の場合、合成パラメータ生成部
１３が利用する音声素片データ記憶部１４には、自然に
発声された調音結合などの音声の性質が自然に含まれた
音声のデータから分析生成された抽出環境付の音声素片
データ１４Ａと、１音１音はっきりと発声された音声の
データから分析生成された環境無関係な音声素片データ
１４Ｂとが格納されている。In the case of this embodiment, the speech element data storage unit 14 used by the synthesis parameter generation unit 13 analyzes from the voice data naturally including the nature of voice such as articulatory coupling produced naturally. The generated voice segment data with an extraction environment 14A and environment-independent voice segment data 14B analyzed and generated from the data of voices clearly pronounced for each sound are stored.

【００３０】また、この実施例の合成パラメータ生成部
１３には、音声単位選択チェック部１９及び距離値計算
部１７が付随して設けられている。なお、これら合成パ
ラメータ生成部１３、音声単位選択チェック部１９及び
距離値計算部１７が合成パラメータ生成手段を構成して
いるということができる。Further, the synthesis parameter generation unit 13 of this embodiment is provided with a voice unit selection check unit 19 and a distance value calculation unit 17 in association. It can be said that the synthesis parameter generation unit 13, the voice unit selection check unit 19, and the distance value calculation unit 17 constitute a synthesis parameter generation unit.

【００３１】音声単位選択チェック部１９は、音韻列に
よって、音声単位辞書１６を検索しながら、入力された
音韻列との音韻環境の類似度が高い音声単位を選択する
ものである。The phonetic unit selection checking unit 19 selects a phonetic unit having a high degree of similarity in phoneme environment to the input phoneme string while searching the phonetic unit dictionary 16 by the phoneme string.

【００３２】図３は、音声単位辞書１６の構成を示すも
のである。音声単位辞書１６は、例えば抽出環境付音声
単位ポインタテーブル１６Ａと、抽出環境付音声単位記
憶部１６Ｂと、環境無関係音声単位ポインタテーブル１
６Ｃと、環境無関係音声単位記憶部１６Ｄとからなる。FIG. 3 shows the structure of the voice unit dictionary 16. The voice unit dictionary 16 includes, for example, a voice unit pointer table with extraction environment 16A, a voice unit storage unit with extraction environment 16B, and a voice unit pointer table 1 not related to the environment.
6C and an environment-independent voice unit storage unit 16D.

【００３３】抽出環境付音声単位記憶部１６Ｂには、上
記抽出環境付音声素片データ１４Ａを生成した際の発声
音声についての音韻列（抽出環境）と共に音声単位（以
下、抽出環境付音声単位と呼ぶ）が記述されていて、そ
の抽出環境付音声単位の音韻環境が分かるようになされ
ており、また、その抽出環境付音声単位に対する抽出環
境付音声素片データ１４Ａが格納されている音声素片デ
ータ記憶部１４のアドレスも記述されている。音韻環境
は異なるが同一の音声単位（以下、音声単位の種類が同
一と呼ぶ）に関する情報は、例えば連続的に格納されて
いる。抽出環境付音声単位ポインタテーブル１６Ａは、
音韻環境付音声単位記憶部１６Ｂに記憶されている同一
種類の抽出環境付音声単位の情報群の先頭アドレスを、
入力された音声単位の種類（音声単位名）に基づいて取
出せるように構成されている。The extraction environment-added voice unit storage unit 16B stores a phoneme unit (hereinafter referred to as an extraction environment-added voice unit) together with a phonological string (extraction environment) about a uttered voice when the above-mentioned extraction-unit-added voice unit data 14A is generated. Is called so that the phonological environment of the voice unit with the extraction environment can be understood, and the voice unit in which the voice unit data 14A with the extraction environment for the voice unit with the extraction environment is stored. The address of the data storage unit 14 is also described. Information about the same voice unit (hereinafter, the same type of voice unit is referred to as the same) although the phonological environments are different is continuously stored, for example. The voice unit pointer table with extraction environment 16A is
The start address of the information group of the voice unit with the extraction environment of the same type stored in the voice unit storage unit with phoneme 16B is
It is configured so that it can be taken out based on the type (voice unit name) of the input voice unit.

【００３４】図３において、“／ｉ／ｉｋｉ”は“ｉｉ
ｋｉ”と発音された際の語頭用の音声単位／ｉ／を意味
し、“／ｉ／ＮｄｅａＮ”は“ｉＮｄｅａＮ”と発音さ
れた際の語頭用の音声単位／ｉ／を意味する（Ｎは
「ん」を「な行」と区別するために示している）。ま
た、Ｐ／ｉ／は、語頭用の音声単位／ｉ／についての音
韻環境付音声単位情報群の先頭アドレス（ポインタ）を
表している。In FIG. 3, "/ i / iki" is "ii".
"i" means the initial voice unit / i / when pronounced "ki", and "/ i / NdeaN" means the initial voice unit / i / when pronounced "iNdeaN" (where N is It is shown to distinguish "n" from "na line". Further, P / i / represents the head address (pointer) of the phonetic environment-attached voice unit information group for the voice unit / i / for the beginning of a word.

【００３５】環境無関係音声単位ポインタテーブル１６
Ｃ及び環境無関係音声単位記憶部１６Ｄは、環境無関係
な音声素片データ１４Ｂに対するものであり、抽出環境
付音声単位ポインタテーブル１６Ａ及び抽出環境付音声
単位記憶部１６Ｂとほぼ同様な構成を有するのでその説
明は省略する。なお、環境無関係音声単位は、同一種類
の音声単位について１個しか存在しない。Environment-independent voice unit pointer table 16
The C and environment-irrelevant voice unit storage unit 16D is for the environment-unrelated voice unit data 14B and has substantially the same configuration as the extraction environment-attached voice unit pointer table 16A and the extraction environment-attached voice unit storage unit 16B. The description is omitted. Note that there is only one environment-unrelated voice unit for the same type of voice unit.

【００３６】距離値計算部１７は、候補に挙がった抽出
環境付音声単位と、その直前位置の既に選択させた音声
単位（必ずしも抽出環境付音声単位とは限らない）との
間の接続部での距離値（この実施例では類似度として距
離値を利用している）を、接続部音響パラメータデータ
記憶部１８から接続部音響パラメータデータを読み出し
てを計算するものである。発声された自然音声から抽出
環境付音声素片データ１４Ａを生成させる際に、音声単
位の接続部の音響パラメータデータを併せて生成され、
その接続部音響パラメータデータが接続部音響パラメー
タデータ記憶部１８に格納されている。The distance value calculation unit 17 is a connection unit between a candidate voice unit with an extraction environment and a previously selected voice unit (not necessarily a voice unit with an extraction environment) at the immediately preceding position. Is calculated by reading the connection part acoustic parameter data from the connection part acoustic parameter data storage unit 18 (the distance value is used as the similarity in this embodiment). When generating the voice element data with extraction environment 14A from the uttered natural voice, the acoustic parameter data of the connection unit for each voice unit is also generated,
The connection section acoustic parameter data is stored in the connection section acoustic parameter data storage unit 18.

【００３７】合成パラメータ生成部１３は、音韻列に基
づき、音声単位選択チェック部１９が絞り込んだ各音声
単位の候補の音韻環境や、距離値計算部１７が計算した
音声単位間の接続部の距離値（類似度）に基づいて、音
韻環境及び距離値が適当である抽出環境付音声単位があ
れば、その抽出環境付音声単位に対応した抽出環境付音
声素片データ１４Ａを音声素片データ記憶部１４から取
出し、音韻環境及び距離値が適当である抽出環境付音声
単位がなければ、環境無関係な音声素片データ１４Ｂを
音声素片データ記憶部１４から取出す。Based on the phoneme sequence, the synthesis parameter generation unit 13 determines the phoneme environment of each voice unit candidate narrowed down by the voice unit selection check unit 19 and the distance of the connection unit between the voice units calculated by the distance value calculation unit 17. Based on the value (similarity), if there is a speech unit with an extraction environment having an appropriate phonological environment and distance value, the speech unit data with extraction environment 14A corresponding to the speech unit with an extraction environment is stored as speech unit data. If there is no voice unit with an extracted environment having a proper phonological environment and distance value, the voice unit data 14B irrelevant to the environment is fetched from the voice unit data storage unit 14.

【００３８】以上のように機能する各部よりなる実施例
の音声合成装置は、全体を通しては、図４に示すように
動作する。The speech synthesizing apparatus according to the embodiment, which is composed of the respective units functioning as described above, operates as shown in FIG. 4 throughout.

【００３９】まず、文字情報（テキストデータ）を取り
込み（ステップ２０１）、その文字情報を解析してフレ
ーズに分解し、各フレーズ毎に、音韻・韻律記号列に変
換する（ステップ２０２）。First, character information (text data) is taken in (step 201), the character information is analyzed and decomposed into phrases, and each phrase is converted into a phoneme / prosodic symbol string (step 202).

【００４０】そして、音韻・韻律記号列における音韻列
に沿って、ある音声単位の種類を対象とし、その音声単
位種類によって音声単位辞書１６を検索してその音声単
位種類に係る抽出環境付音声単位を取出す（ステップ２
０３）。Then, along with the phoneme sequence in the phoneme / prosodic symbol sequence, a certain voice unit type is targeted, and the voice unit dictionary 16 is searched by the voice unit type to extract the voice unit with the extraction environment related to the voice unit type. Take out (Step 2)
03).

【００４１】ここで、１個以上の抽出環境付音声単位が
取出せた場合には、入力音韻列と取出した各抽出環境付
音声単位との音韻環境を比較し、最も音韻環境が近い抽
出環境付音声単位を候補として残す（ステップ２０
４）。なお、この選択の際に、直前に選択された抽出環
境付音声単位と同一の発声音声（抽出環境）に係る今回
の抽出環境付音声単位があればそれを優先する。また、
音韻環境の近似度合は、後続音韻の一致性だけでなく、
先行音韻の一致性をも考慮して行なうことが好ましい
が、後続音韻を先行音韻より優先させても良い。Here, when one or more speech units with an extraction environment can be extracted, the phonological environments of the input phoneme sequence and each extracted speech unit with an extraction environment are compared, and the extraction environment with the closest phonological environment is added. Leave the voice unit as a candidate (step 20)
4). At the time of this selection, if there is a current voice unit with extraction environment related to the same utterance (extraction environment) as the voice unit with extraction environment selected immediately before, this is prioritized. Also,
The degree of approximation of the phonological environment is not only the coincidence of the following phonemes,
Although it is preferable that the matching of the preceding phonemes is taken into consideration, the following phonemes may be prioritized over the preceding phonemes.

【００４２】そして、既に選択された直前の音声単位
（抽出環境付音声単位又は環境無関係音声単位）の接続
部と、候補の抽出環境付音声単位の接続部との距離値を
計算すると共に、ステップ２０４の処理によって複数の
抽出環境付音声単位が候補として残っているならば（音
韻環境が同じ候補が複数あったならば）、距離値が最も
小さい（類似度が最も大きい）１個の抽出環境付音声単
位に候補を絞り込む（ステップ２０５）。なお、語頭の
音声単位については距離値計算は実行されない。Then, the distance value between the connection portion of the immediately preceding voice unit (the voice unit with the extracted environment or the voice unit not related to the environment) which has already been selected and the connection portion of the candidate voice unit with the extracted environment is calculated, and the step is performed. If a plurality of speech units with an extraction environment remain as candidates by the processing of 204 (if there are a plurality of candidates with the same phoneme environment), one extraction environment with the smallest distance value (largest similarity) The candidates are narrowed down by the attached voice unit (step 205). The distance value calculation is not executed for the voice unit of the beginning of the word.

【００４３】そして、残った１個の抽出環境付音声単位
について、音韻環境及び接続部距離値が適当であるか否
か判断する（ステップ２０６）。音韻環境についてのこ
の判断条件は、音声単位辞書１６に格納した抽出環境付
音声単位の数や目標音質等に応じて適宜設定すれば良い
ものであるが、例えば、音声単位の後続音韻（１又は２
以上）が入力音韻列の該当位置の音韻に一致しているこ
とを挙げることができる。また、接続部距離値について
の判断は所定閾値との比較で行なう。Then, with respect to the remaining one speech unit with extraction environment, it is judged whether or not the phonological environment and the connection distance value are appropriate (step 206). This judgment condition for the phonological environment may be appropriately set according to the number of voice units with the extraction environment stored in the voice unit dictionary 16, the target sound quality, and the like. Two
It can be mentioned that the above) matches the phoneme at the corresponding position in the input phoneme sequence. Further, the judgment of the connection part distance value is made by comparison with a predetermined threshold value.

【００４４】候補として１個だけ残った抽出環境付音声
単位が適当であればその抽出環境付音声単位を選択し、
この抽出環境付音声単位に対応した抽出環境付音声素片
データ１４Ａを採用する（ステップ２０７）。If only one voice unit with an extraction environment that remains as a candidate is appropriate, select that voice unit with an extraction environment,
The voice segment data with extraction environment 14A corresponding to this voice unit with extraction environment is adopted (step 207).

【００４５】一方、候補として１個だけ残った抽出環境
付音声単位が不適当であれば、また、上記ステップ２０
３の処理によって音声単位辞書１６を検索しても抽出環
境付音声単位が見付からないときには、対象の音声単位
種類に対応した、調音結合を起こしていない発声音声デ
ータから形成された環境無関係音声素片データ１４Ｂを
採用する（ステップ２０８）。On the other hand, if the voice unit with the extraction environment, which remains only one candidate, is inappropriate, the above step 20 is repeated.
If the extracted environment-added voice unit is not found even after searching the voice unit dictionary 16 by the process of 3, the environment-independent voice unit corresponding to the target voice unit type and formed from the voiced voice data without articulation coupling is generated. The data 14B is adopted (step 208).

【００４６】その後、対象フレーズに関する全ての音声
単位種類について（語尾の音声単位種類についても）採
用する音声素片データを決定したか否か判断し（ステッ
プ２０９）、決定していなければ上述したステップ２０
３に戻る。After that, it is judged whether or not the voice unit data to be adopted for all the voice unit types (also for the ending voice unit types) related to the target phrase have been decided (step 209), and if not decided, the above-mentioned step. 20
Return to 3.

【００４７】そして、音韻・韻律記号列の韻律情報と、
決定した音声素片データとに基づいて韻律パラメータ
（音韻継続時間、基本周波数パターン、パワー等を規定
するパラメータ）も設定する（ステップ２１０）。Then, the prosodic information of the phonological / prosodic symbol string,
Prosodic parameters (parameters that define phoneme duration, fundamental frequency pattern, power, etc.) are also set based on the determined speech unit data (step 210).

【００４８】以上のようなステップ２０３〜２１０でな
る一連の処理は、フレーズ毎の繰返しループ線を図示し
ていないが、フレーズに対して繰返し行なわれる。な
お、ステップ２０３〜２１０でなる一連の処理が、ステ
ップ２０２の処理や、後述するステップ２１１の処理と
並行して実行されるものであっても良い。Although the series of steps 203 to 210 described above does not show a repeating loop line for each phrase, it is repeated for each phrase. The series of processes in steps 203 to 210 may be executed in parallel with the process in step 202 or the process in step 211 described later.

【００４９】以上のようにして、合成パラメータ（韻律
パラメータや音声素片データ等）が決定されると、音声
信号を合成して出力する（ステップ２１１、２１２）。
出力方法は、スピーカからの発音出力でも良く、回線を
通じた他の装置への伝送でも良い。When the synthesis parameters (prosodic parameters, voice segment data, etc.) are determined as described above, the voice signals are synthesized and output (steps 211 and 212).
The output method may be sound output from a speaker or transmission to another device through a line.

【００５０】次に、具体例によって、実施例の音声合成
動作、特に利用する音声素片データの決定動作を説明す
る。ここでは、入力文（フレーズ）が図５（１）に示す
“いられない”として説明する。また、音声単位がＶＣ
Ｖ（母音−子音−母音）を基本としているものとして説
明する。Next, the voice synthesizing operation of the embodiment, especially the operation of determining the voice segment data to be used will be described with reference to a specific example. Here, it is assumed that the input sentence (phrase) is “cannot be” shown in FIG. 5 (1). The voice unit is VC
The description will be made on the basis of V (vowel-consonant-vowel).

【００５１】この入力文“いられない”は、図５（２）
に示すように、“ｉｒａｒｅｎａｉ”という音韻列に変
換される。This input sentence “I can't” is shown in FIG.
As shown in, the phoneme sequence is converted into a phoneme sequence of "iraranai".

【００５２】まず、語頭の音声単位種類［ｉ］が対象と
なって、音声単位辞書１６を検索し、語頭に音声単位／
ｉ／を有する抽出環境付音声単位が取出される。すなわ
ち、音声単位辞書１６の抽出環境付音声単位ポインタテ
ーブル部１６Ａを語頭の音声単位種類［ｉ］の情報をア
ドレスとしてアクセスしてポインタ値Ｐ／ｉ／を取出
し、このポインタ値Ｐ／ｉ／で抽出環境付音声単位記憶
部１６Ｂをアクセスすることで抽出環境付音声単位を取
出す。First, the voice unit dictionary 16 is searched for the voice unit type [i] at the beginning of the word, and the voice unit /
The voice unit with extraction environment having i / is retrieved. That is, the voice unit pointer table unit 16A with extraction environment of the voice unit dictionary 16 is accessed by using the information of the voice unit type [i] at the beginning of the word as an address to extract the pointer value P / i /, and the pointer value P / i / The voice unit with extraction environment is retrieved by accessing the voice unit storage unit with extraction environment 16B.

【００５３】図５（Ａ）は、このとき取出された抽出環
境付音声単位を示す。１番目に検索された“／ｉ／ｉｋ
ｉ”と２番目の“／ｉ／ＮｄｅａＮ”は、３番目の“／
ｉ／ｒｅｃｈｉｇａｕ”が音声単位／ｉ／直後の音韻
“ｒ”が入力音韻列の対応音韻と一致するので、３番目
の“／ｉ／ｒｅｃｈｉｇａｕ”が対象となったときに候
補からはずれ、この時点では３番目のものが候補とな
る。しかし、この後に検索された“／ｉ／ｒａｓｓｙａ
ｉ”の方が入力音韻列との音韻環境が良くあっているの
で、この抽出環境付音声単位が候補に置き換わる。FIG. 5A shows the voice unit with the extraction environment extracted at this time. The first searched "/ i / ik"
i "and the second" / i / NdeaN "are the third" /
Since i / rechigau is the phoneme unit / i / the phoneme "r" immediately after it matches the corresponding phoneme of the input phoneme sequence, the third phoneme "/ i / rechigau" is excluded from the candidates at this point. The third one is a candidate, but "/ i / rasya" retrieved after this
Since i ”has a better phoneme environment with the input phoneme sequence, this extraction environment-added voice unit is replaced with the candidate.

【００５４】このような動作を繰返し、ここでは、“／
ｉ／ｒａｓｓｙａｉ”だけが候補として残ったとする。
語頭であるので距離値は計算されないが、この抽出環境
付音声単位“／ｉ／ｒａｓｓｙａｉ”について音韻環境
からの妥当性が判断される。判断条件にもよるが、後続
音韻が２個一致しているので妥当と判断される。従っ
て、この語頭の音声単位種類［ｉ］については、調音結
合された自然音“ｉｒａｓｓｙａｉ”が発声された際の
語頭の音声単位／ｉ／についての抽出環境付音声素片デ
ータ１４Ａが採用される。Such an operation is repeated, and here, "/
It is assumed that only “i / rassyai” remains as a candidate.
The distance value is not calculated because it is the beginning of a word, but the validity from the phonological environment is judged for this voice unit with extraction environment "/ i / rassyai". Although it depends on the determination condition, it is determined to be appropriate because two succeeding phonemes match. Therefore, with respect to the voice unit type [i] at the beginning of the word, the voice element data with extraction environment 14A for the voice unit / i / at the beginning of the word when the articulated combined natural sound "irassyai" is uttered is adopted. .

【００５５】次に、語中の音声単位種類［ｉｒａ］が対
象となって、音声単位辞書１６を検索する。この場合、
図５（Ｂ）に示すように、“／ｉｒａ／ｓｓｙａｉ”及
び“ｓ／ｉｒａ／ｂｙｏｕｏｓｉ”が取出されたとす
る。ここで、前者“／ｉｒａ／ｓｓｙａｉ”の方が候補
として残るが、後続音韻が入力音韻列“ｉｒａｒｅｎａ
ｉ”と一致しないため妥当性判断で不適当と判断され
る。Next, the voice unit dictionary 16 is searched for the voice unit type [ira] in the word. in this case,
As shown in FIG. 5B, it is assumed that “/ ira / sseyai” and “s / ira / byouosi” have been taken out. Here, the former "/ ira / sssaii" remains as a candidate, but the subsequent phoneme is the input phoneme sequence "irarena".
Since it does not match i ", it is judged to be inappropriate by the validity judgment.

【００５６】そこで、環境無関係音声単位／ｉｒａ／を
選択し、これに対応する環境無関係音声素片データ１４
Ｂを採用することに決定する。この環境無関係音声素片
データ１４Ｂの取出しは、単位辞書１６の環境無関係音
声単位ポインタテーブル部１６Ｃを音声単位種類［ｉｒ
ａ］の情報をアドレスとしてアクセスしてポインタ値を
取出し、このポインタ値で環境無関係音声単位記憶部１
６Ｄをアクセスして音声素片データ１４Ｂの格納アドレ
スを取出し、このアドレスで音声素片データ記憶部１４
をアクセスして行なう。Therefore, the environment-independent voice unit / ira / is selected and the environment-independent voice unit data 14 corresponding thereto is selected.
It is decided to adopt B. To extract the environment-irrelevant voice unit data 14B, the environment-irrelevant voice unit pointer table unit 16C of the unit dictionary 16 is used for the voice unit type [ir.
a] is used as an address to access the pointer value and the pointer value is used to extract the environment-independent voice unit storage unit 1
6D is accessed to take out the storage address of the voice unit data 14B, and the voice unit data storage unit 14 is used at this address.
To access.

【００５７】次に、語中の次の音声単位種類［ａｒｅ］
が対象となる。この場合には詳述は避けるが、図５
（Ｃ）に示す複数の抽出環境付音声単位の中から抽出環
境付音声単位“ｅｍｏｉｗ／ａｒｅ／ｎｕ”だけが候補
に絞り込まれ、妥当と判断され、調音結合された自然音
“ｅｍｏｉｗａｒｅｎｕ”が発声された際の音声単位／
ａｒｅ／についての抽出環境付音声素片データ１４Ａが
採用されたとする。Next, the next voice unit type [are] in the word
Is the target. In this case, detailed description is avoided, but FIG.
Of the plurality of voice units with extraction environment shown in (C), only the voice unit with extraction environment "emoiw / are / nu" is narrowed down to candidates and is judged to be valid, and a natural sound "emoiwarenu" with articulation is uttered. Voice unit when
It is assumed that the voice segment data with extraction environment 14A for are / is adopted.

【００５８】次に、語中の次の音声単位種類［ｅｎａ］
が対象となり、音声単位辞書１６の検索によって、図５
（Ｄ）に示す４個の抽出環境付音声単位が取出される。
入力音韻列との音韻環境の一致性による候補の絞り込み
では、“ｋａｔａｊｉｋ／ｅｎａ／ｉ”と“ａ／ｅｎａ
／ｉ”とが残る。そこで、直前に選択された抽出環境付
音声単位“ｅｍｏｉｗ／ａｒｅ／ｎｕ”における音声単
位／ａｒｅ／と接続されたときの歪の大きさ（距離値）
を、両候補間で比較する。Next, the next voice unit type [ena] in the word
5 is obtained by searching the voice unit dictionary 16.
The four voice units with extraction environment shown in (D) are extracted.
In narrowing down candidates by matching the phoneme environment with the input phoneme sequence, "katajik / ena / i" and "a / ena" are selected.
/ I ”remains. Therefore, the magnitude of the distortion (distance value) when connected to the voice unit / are / in the voice unit“ emoiw / are / nu ”with the extraction environment immediately before selected
Is compared between both candidates.

【００５９】図６は、距離値算出のイメージ的な説明図
である。抽出環境付音声単位“ｅｍｏｉｗ／ａｒｅ／ｎ
ｕ”における音声単位／ａｒｅ／の後部の接続部音響パ
ラメータデータＣＴを取出し、抽出環境付音声単位“ｋ
ａｔａｊｉｋ／ｅｎａ／ｉ”又は“ａ／ｅｎａ／ｉ”に
おける音声単位／ｅｎａ／の前部の接続部音響パラメー
タデータＣＨを取出して距離値を求め、距離値の小さい
ものに候補を絞り込む。この場合、直前の音声単位／ａ
ｒｅ／における最終音韻“ｅ”は子音から移行したもの
であるので、当該音声単位／ｅｎａ／の先頭音韻“ｅ”
の前が子音である抽出環境付音声単位“ｋａｔａｊｉｋ
／ｅｎａ／ｉ”の方が距離値が小さなって最終的な候補
として残る。FIG. 6 is an image-like explanatory diagram of distance value calculation. Speech unit with extraction environment "emoiw / are / n"
In the audio unit / are / in the "u", the connection section acoustic parameter data CT at the rear part is extracted,
Atajik / ena / i "or" a / ena / i "in the front of the voice unit / ena / of the connection part acoustic parameter data CH is extracted to obtain a distance value, and candidates are narrowed down to those having a small distance value. , Last voice unit / a
Since the final phoneme "e" in re / is a transition from the consonant, the beginning phoneme "e" of the voice unit / ena /
"Katajik", which is a voice unit with an extraction environment in which the front of the
"/ Ena / i" has a smaller distance value and remains as a final candidate.

【００６０】抽出環境付音声単位“ｋａｔａｊｉｋ／ｅ
ｎａ／ｉ”については、妥当性は問題なかったとする。
従って、調音結合された自然音“ｋａｔａｊｉｋｅｎａ
ｉ”が発声された際の音声単位／ｅｎａ／についての抽
出環境付音声素片データ１４Ａが採用される。Speech unit with extraction environment "katajik / e"
As for na / i ", it is assumed that there was no problem with the validity.
Therefore, the articulated and combined natural sound "katajikena"
The voice element data with extraction environment 14A for the voice unit / ena / when i "is uttered is adopted.

【００６１】次に、語尾の音声単位種類［ａｉ］が対象
となる。この場合には詳述は避けるが、図５（Ｅ）に示
す複数の抽出環境付音声単位の中から抽出環境付音声単
位“ｋａｔａｊｉｋｅｎ／ａｉ／”だけが候補に絞り込
まれ、妥当と判断され、調音結合された自然音“ｋａｔ
ａｊｉｋｅｎａｉ”が発声された際の音声単位／ａｉ／
についての抽出環境付音声素片データ１４Ａが採用され
る。Next, the ending voice unit type [ai] is targeted. In this case, although detailed description is omitted, only the extraction environment-added voice unit “katajiken / ai /” is narrowed down to candidates from the plurality of extraction environment-attached voice units shown in FIG. Articulated natural sound "kat"
ajikenai "voice unit / ai /
The voice element data 14A with the extraction environment is used.

【００６２】なお、音声素片データ記憶部１４に格納さ
れた音声素片データ１４Ａは、単語等の発声から切り出
して生成されたもので、ＶＣＶ単位が基本となっている
が、音声単位辞書１６を参照することで、単語等の一続
きの発声から続けて抽出、生成された音声素片、つまり
音韻連接している音声素片群を選択し、組合せの計算の
手間を省いて素片間の接続歪みを抑えることもできる。
すなわち、この実施例では上述したようにこのような選
択ルールを設けているので、入力音韻列“ｉｒａｒｅｎ
ａｉ”の音声単位種類［ｅｎａ］と［ａｉ］について
は、１つの単語発声“ｋａｔａｊｉｋｅｎａｉ”から抽
出された音声単位／ｅｎａ／と／ａｉ／が適用され、こ
の間の接続歪みをなくすことができる。The voice unit data 14A stored in the voice unit data storage unit 14 is generated by cutting out a utterance of a word or the like, and is based on the VCV unit, but the voice unit dictionary 16 By referring to, select a speech unit that is continuously extracted and generated from a series of utterances such as words, that is, a group of phonemes that are phonologically concatenated, and save time and effort for calculating combinations. It is possible to suppress the connection distortion of.
That is, in this embodiment, since such a selection rule is provided as described above, the input phoneme sequence "iraren"
For the voice unit types [ena] and [ai] of "ai", the voice units / ena / and / ai / extracted from one word utterance "katajikenai" are applied, and the connection distortion between them can be eliminated.

【００６３】従って、上記実施例によれば、音声の調音
結合という性質を重視し、入力文（音韻列）と音声単位
の音韻環境を考慮して合成しているので、肉声感のある
より自然音声に近い合成音声信号を得ることができる。Therefore, according to the above-described embodiment, since the property of articulatory coupling of voices is emphasized and synthesis is performed in consideration of the input sentence (phoneme sequence) and the phoneme environment of each voice unit, a natural voice with a feeling of real voice is obtained. It is possible to obtain a synthetic voice signal close to a voice.

【００６４】また、適切な音韻環境にある音声単位がな
いときでも、標準的な音声単位を用意しているので、接
続による歪みが大きくなるような不適切な音声単位が合
成に用いられることはない。Further, even when there is no voice unit in an appropriate phonological environment, since a standard voice unit is prepared, an inappropriate voice unit that causes large distortion due to connection may not be used for synthesis. Absent.

【００６５】さらに、標準的な音声単位を用意している
ので、音韻環境を考慮した音声単位をかなり多く用意す
る必要はなく、メモリ容量を軽減できると共に、処理量
も軽減できる。Further, since standard voice units are prepared, it is not necessary to prepare a considerably large number of voice units in consideration of the phonological environment, and the memory capacity and the processing amount can be reduced.

【００６６】さらにまた、肉声感のあるより自然音声に
近い合成音声信号を得るためには、音声単位の探索処理
が中心であり、距離値の計算は僅かに行なえば良いの
で、この点からも処理量を軽減できる。例えば、抽出環
境付音声素片データ（従って音声単位）を２倍に増やし
ても、音声単位辞書の探索に要する処理が２倍になるだ
けで、全体の処理量は決して２倍にはならない。その結
果、音質向上のために、音声素片データの拡張に対処す
ることが容易であり、装置の能力を最大限に生かした品
質の音声合成を行なうことができる。Furthermore, in order to obtain a synthesized voice signal having a real voice and closer to a natural voice, the search process for each voice unit is the center, and the distance value may be calculated slightly. From this point as well. The processing amount can be reduced. For example, even if the number of speech unit data with extraction environment (henceforth, the voice unit) is doubled, the processing required for searching the voice unit dictionary is doubled, and the total processing amount is never doubled. As a result, in order to improve the sound quality, it is easy to deal with the expansion of the voice unit data, and it is possible to perform voice synthesis of a quality that maximizes the ability of the device.

【００６７】また、音声単位の選択時において、直前の
音声単位と同じ発声から、抽出生成された音声単位を優
先的に選択するようにしているので、音声単位間の接続
歪みを減らすこともできる。Further, when the voice unit is selected, the voice unit extracted and generated is preferentially selected from the same utterance as the immediately preceding voice unit, so that the connection distortion between voice units can be reduced. .

【００６８】なお、上記実施例においては、音声単位間
（従って音声素片データ間）の接続部の類似度を距離値
で判断するものを示したが、他のパラメータで判断する
ようにしても良い。また、距離値の計算時点も、音韻環
境の近似度合に基づいて候補を置き換える際に確認の意
味で行なうようにしても良く、実施例のタイミングに限
定されるものではない。In the above embodiment, the similarity of the connection portion between voice units (thus between voice segment data) is judged by the distance value, but it may be judged by other parameters. good. Also, the distance value may be calculated at the point of confirmation when replacing the candidate based on the degree of approximation of the phoneme environment, and is not limited to the timing of the embodiment.

【００６９】さらに、本発明が対象とする入力文は、日
本語文に限定されるものでないことは勿論である。Further, it goes without saying that the input sentence targeted by the present invention is not limited to the Japanese sentence.

【００７０】[0070]

【発明の効果】以上のように、本発明によれば、音韻環
境を持たぬように１音１音はっきりと発声された音声信
号から分析生成された環境無関係の音声素片データと、
音韻環境を持つように発声された音声信号から分析生成
された抽出環境付の音声素片データとを用意すると共
に、これら音声素片データの選択情報を音声単位の種類
毎に用意しておき、入力された文字情報が変換された音
韻列における音声単位毎に選択情報を参照し、この音韻
列における音声単位の音韻環境に近似していて、直前の
音声素片データとの接続が良好な抽出環境付の音声素片
データがあればそれを選択し、なければ環境無関係な音
声素片データを選択するようにしたので、肉声感のある
より自然音声に近い合成音声信号を、少ないメモリ容量
及び少ない処理量で得ることができる音声合成装置及び
音声合成方法を実現できる。As described above, according to the present invention, environment-independent speech segment data generated by analysis and generation from a speech signal in which one note is clearly uttered without having a phonological environment,
A speech element data with an extraction environment, which is analyzed and generated from a speech signal uttered so as to have a phonological environment, is prepared, and selection information of these speech element data is prepared for each type of speech unit, The selection information is referred to for each voice unit in the phoneme sequence into which the input character information is converted, and the phoneme environment is approximated to the voice unit in this phoneme sequence, and the connection with the immediately preceding voice unit data is extracted well. If there is speech unit data with environment, select it, and if not, select speech unit data unrelated to the environment. A voice synthesizing apparatus and a voice synthesizing method that can be obtained with a small processing amount can be realized.

[Brief description of drawings]

【図１】実施例の機能ブロック図である。FIG. 1 is a functional block diagram of an embodiment.

【図２】従来の機能ブロック図である。FIG. 2 is a conventional functional block diagram.

【図３】実施例の音声単位辞書の構成を示す説明図であ
る。FIG. 3 is an explanatory diagram showing a configuration of a voice unit dictionary according to the embodiment.

【図４】実施例の音声合成動作を示すフローチャートで
ある。FIG. 4 is a flowchart showing a voice synthesizing operation of the embodiment.

【図５】実施例の具体的入力文に対する動作の説明図で
ある。FIG. 5 is an explanatory diagram of an operation for a specific input sentence according to the embodiment.

【図６】実施例の距離値計算方法の概念図である。FIG. 6 is a conceptual diagram of a distance value calculation method according to an embodiment.

[Explanation of symbols]

１３…合成パラメータ生成部、１４…音声素片データ記
憶部、１４Ａ…抽出環境付音声素片データ、１４Ｂ…環
境無関係音声素片データ、１５…音声合成部、１６…音
声単位辞書、１７…距離値計算部、１８…接続部音響パ
ラメータデータ記憶部、１９…音声単位選択チェック
部。13 ... Synthesis parameter generation unit, 14 ... Speech unit data storage unit, 14A ... Speech unit data with extraction environment, 14B ... Environment unrelated speech unit data, 15 ... Speech synthesis unit, 16 ... Speech unit dictionary, 17 ... Distance Value calculation unit, 18 ... Connection unit acoustic parameter data storage unit, 19 ... Voice unit selection check unit.

Claims

[Claims]

1. A speech synthesizer for converting input character information into a speech signal, wherein an environment-independent speech segment is generated by analysis from a speech signal in which one note is clearly pronounced so as not to have a phonological environment. An environment-independent speech unit data storage unit that stores data and a speech unit with an extraction environment that stores speech unit data with an extraction environment that is analyzed and generated from a speech signal that is uttered to have a phonological environment. A piece data storage unit, a selection information storage unit that stores selection information of these speech unit data for each type of voice unit, and the above selection information for each voice unit in a phoneme string into which input character information is converted. With reference to the storage means, if there is speech segment data with an extraction environment that is close to the phoneme environment of this speech unit in the phoneme sequence and that has a good connection with the immediately preceding speech segment data, select it. Speech synthesis apparatus characterized by comprising a composite parameter generating means for selecting a boundary extraneous speech unit data.

2. The selection information storage means stores a phonetic unit dictionary in which the phoneme environment of the voice unit data with the extraction environment is stored for each voice unit, and connection part information of the voice unit data with the extraction environment. The synthesizing parameter generating means refers to the voice unit dictionary for each voice unit in the phoneme sequence into which the input character information is converted, A voice unit selection check unit that narrows down candidates based on the degree of approximation to the phoneme environment of a voice unit in a column, and a similarity between the narrowed voice unit data with an extraction environment and the immediately preceding voice unit data is the connection unit information storage unit. Based on the similarity between the phoneme environment and the connection part, and the similarity calculation unit obtained from the stored content of the phoneme data is narrowed down to one, and the voice unit data with the extraction environment is selected. Speech synthesis apparatus according to claim 1, characterized in that it consists of a synthetic parameter generation unit for determining whether to select the Luke environment extraneous speech unit data.

3. A speech synthesis method for converting input character information into a speech signal, wherein an environment-independent speech element is generated by analysis from a speech signal in which one note is clearly pronounced so as not to have a phonological environment. An environment-independent speech unit data storage unit that stores data and a speech unit with an extraction environment that stores speech unit data with an extraction environment that is analyzed and generated from a speech signal that is uttered to have a phonological environment. A piece data storage unit and selection information storage means for storing the selection information of these speech unit data for each type of voice unit are provided. For each voice unit in the phoneme sequence in which the input character information is converted, Referring to the selection information storage means, if there is a speech unit data with an extraction environment that is close to the phoneme environment of the voice unit in this phoneme sequence and has a good connection with the immediately preceding speech unit data, select it. Na Speech synthesis method and selects the environmental extraneous speech unit data if Re.

4. The selection information storage means stores a voice unit dictionary storing the phoneme environment of the voice unit data with the extraction environment for each voice unit, and connection part information of the voice unit data with the extraction environment. A phoneme unit of a phonetic unit in this phoneme string is configured by referring to the phonetic unit dictionary for each phonetic unit in the phoneme string in which the input character information is converted The candidates are narrowed down based on the degree of approximation to the environment, and the similarity between the narrowed down speech unit data with the extraction environment and the immediately preceding speech unit data is obtained from the stored contents of the connection unit information storage unit. After narrowing down the candidates of the speech unit data with the extraction environment to one on the basis of the degree of similarity, determining whether to select the speech unit data with the extraction environment or the speech unit data unrelated to the environment. Characterized by Claim 3
The speech synthesis method described in.