JPH06250685A

JPH06250685A - Voice synthesis system and rule synthesis device

Info

Publication number: JPH06250685A
Application number: JP5031627A
Authority: JP
Inventors: Mitsuru Ebihara; 充海老原; Yasushi Ishikawa; 泰石川
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1993-02-22
Filing date: 1993-02-22
Publication date: 1994-09-09
Anticipated expiration: 2018-04-07
Also published as: JP3394281B2

Abstract

PURPOSE:To obtain a voice synthesis system and a rule synthesis device which are applicable to a rule synthesis that synthesizes high quality voice from character strings based on rules. CONSTITUTION:The system consists of a voice sound source generating means 10 which generates a voice sound source 104 that repeats an inputted one pitch length residual waveform 107 at a frame average pitch 100 interval, a no voice sound source generating means 11, a fluctuation component generating means 13 which generates fluctuation components 200 that are normalized by a frame average power 101, a fluctuation component superimposing means 14 which outputs a fluctuation component superimposed voice sound source 201 by superimposing the waveform 104 equivalent to one pitch to the fluctuation components 200 and a vocal tract filter means 12 which outputs synthesized voices 206 by the filter that simulates vocal tract characteristics 103 of LSP filters by a no voice sound source 105 from the source 201 or the means 11.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は音声合成方式と規則合成
装置、特に文字で与えられる文章を音声に変換する規則
合成に適用する音声合成方式および規則合成装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizing method and a rule synthesizing apparatus, and more particularly to a speech synthesizing method and a rule synthesizing apparatus applied to rule synthesizing for converting a sentence given by characters into speech.

【０００２】[0002]

【従来の技術】音声を直接人間の発声そのままによらな
いで、人工的に作り出すことを音声合成（ｓｐｅｅｃｈ
ｓｙｎｔｈｅｓｉｓ）という。音声合成方式として
は、録音編集方式、パラメータ編集方式、規則合成方式
の３種類に分類できる。このなかで、録音編集方式やパ
ラメータ編集方式は、予め記憶しておいた人の声をその
まま接続して出力する方式であるので、自然に近い音声
を合成することができるという利点があるが、出力可能
な語彙や文構造が限られてしまうという問題がある。し
かしながら、出力すべき応答文の種類や文構造、語彙な
どが使用者によって大幅に変わったとしても、合成音の
品質に劣化を生じないようでなければ、真の音声合成と
は言い難い。この点、規則合成方式は、文字列あるいは
音素記号列から音声学的ないし言語学的規則に基づいて
音声を作り出す方式であり、発音語彙数を無限とできる
ので、規則合成方式は、本質的に、録音編集方式やパラ
メータ編集方式とは異なり、真の音声合成が行える可能
性があるといえる。2. Description of the Related Art Speech synthesis (speech synthesis) is a method of artificially producing speech without directly relying on human speech.
Synthesis). The voice synthesis method can be classified into three types, that is, a recording editing method, a parameter editing method, and a rule synthesizing method. Among them, the recording editing method and the parameter editing method are methods in which the voice of a person stored in advance is directly connected and output, so that there is an advantage that a voice close to natural can be synthesized. There is a problem that the vocabulary and sentence structure that can be output are limited. However, even if the type, sentence structure, vocabulary, etc. of the response sentence to be output vary greatly depending on the user, it cannot be said to be true voice synthesis unless the quality of synthesized speech is deteriorated. In this respect, the rule synthesis method is a method for producing speech from a character string or a phoneme symbol string based on phonetic or linguistic rules. Since the number of pronunciation vocabularies can be infinite, the rule synthesis method is essentially It can be said that there is a possibility that true voice synthesis can be performed, unlike the recording editing method and the parameter editing method.

【０００３】このような特性規則合成方式による音声合
成の原理は、蓄えておく単位として音節、音素、１ピッ
チ区間の波形などのような、基本的な小さな単位の特調
パラメータを用い、そのかわりそれらを接続する規則
や、ピッチ・振幅などの音律情報を制御する規則を精密
に定めることにより、いかなる言葉でも、音素、音節記
号あるいは文字の系列から合成できるようにしようとす
るものである。このとき、音声が自然で聞きやすいもの
であるためには、ピッチやストレスの変化及び、スペク
トルの時間的変化が滑らかで、しかもポーズなどが自然
でなければならない。したがって、この規則合成方式の
場合は、合成に用いる基本的単位の品質とともに、自然
音声の音声学的ないし言語学的特性に基づく、音響パラ
メータの制御規則（制御情報と制御機構）が重要な役割
を果たす。The principle of speech synthesis by such a characteristic rule synthesis method uses a basic small unit of special tone parameter such as a syllable, a phoneme, a waveform of one pitch interval, etc. as a unit to be stored, and instead thereof. By precisely defining rules for connecting them and rules for controlling temperament information such as pitch and amplitude, any word can be synthesized from a phoneme, a syllable symbol, or a series of characters. At this time, in order for the voice to be natural and easy to hear, changes in pitch and stress, and temporal changes in spectrum must be smooth, and poses must be natural. Therefore, in the case of this rule synthesis method, the control rule (control information and control mechanism) of the acoustic parameters, which is based on the phonetic or linguistic characteristics of natural speech, plays an important role in addition to the quality of the basic unit used for synthesis. Fulfill.

【０００４】ところで、このような規則合成に適用でき
る、任意のピッチ周期・パワー・継続時間長の合成音声
を得る音声合成方式としては、従来からボコーダ方式が
知られていた。ボコーダ方式は、例えば、“ディジタル
音声処理”古井貞煕東海大学出版に示されているよう
に、音声信号の分析結果により音声を音源情報と声道情
報に分離してモデル化することで合成音声を得る方式で
あり、所望のピッチ周期の合成音声を比較的容易に得る
ことができる。By the way, a vocoder method has been conventionally known as a speech synthesizing method which can be applied to such rule synthesizing and obtains a synthetic speech having an arbitrary pitch period, power and duration. The vocoder method is, for example, as shown in "Digital Speech Processing" Sadahiro Furui Tokai University Press, based on the analysis result of the speech signal, the speech is separated into sound source information and vocal tract information to be modeled as synthetic speech. This is a method of obtaining the synthesized speech with a desired pitch period, which is relatively easy.

【０００５】図８は、線形予測分析を用いた音声分析合
成を行う従来のボコーダ方式による音声分析合成系の一
構成例を示す構成図である。図において、従来のボコー
ダ方式による音声分析合成系は、有声音源生成手段１０
と、無声音源生成手段１１と、声道フィルタ手段１２と
を有し、音声信号の分析結果により入力音声を分離・モ
デル化した音源情報であるフレーム平均ピッチ１００
と、フレーム平均パワー１０１と、有声無声情報１０２
と、同じく入力音声を分離・モデル化して得られた声道
情報である声道特性１０３とを入力し、有声音源１０４
と無声音源１０５とを中間出力するとともに、最終的に
は合成音声１０６を出力する。FIG. 8 is a block diagram showing an example of the configuration of a conventional vocoder-based voice analysis / synthesis system for performing voice analysis / synthesis using linear prediction analysis. In the figure, a voice analysis / synthesis system based on a conventional vocoder system is shown in FIG.
A frame average pitch 100, which is sound source information obtained by separating and modeling the input voice according to the analysis result of the voice signal.
, Frame average power 101, voiced unvoiced information 102
And a vocal tract characteristic 103, which is vocal tract information obtained by separating and modeling the input voice, are input, and a voiced sound source 104
And the unvoiced sound source 105 are intermediately output, and finally the synthetic speech 106 is output.

【０００６】上記の通り構成される従来のボコーダ方式
による音声分析合成系の動作について説明する。有声音
源生成手段１０は、有声無声情報１０２により判別され
る有声区間において、フレーム平均パワー１０１とフレ
ーム平均ピッチ１００により、一定のフレーム平均ピッ
チ間隔のインパルス列で表現される有声音源１０４を生
成する。また、無声音源生成手段１１は、有声無声情報
１０２により判別される無声区間において、フレーム平
均パワー１０１により、白色雑音で表現される無声音源
１０５を生成する。声道フィルタ手段１２は、上記有声
音源１０４または無声音源１０５で声道特性１０３を近
似する声道フィルタを駆動し、合成音声１０６を出力す
る。The operation of the conventional voice analysis / synthesis system based on the vocoder system configured as described above will be described. The voiced sound source generation means 10 generates a voiced sound source 104 represented by an impulse train having a constant frame average pitch interval by the frame average power 101 and the frame average pitch 100 in the voiced section determined by the voiced unvoiced information 102. Further, the unvoiced sound source generation means 11 generates the unvoiced sound source 105 represented by white noise with the frame average power 101 in the unvoiced section determined by the voiced unvoiced information 102. The vocal tract filter means 12 drives a vocal tract filter that approximates the vocal tract characteristic 103 with the voiced sound source 104 or the unvoiced sound source 105, and outputs a synthetic speech 106.

【０００７】[0007]

【発明が解決しようとする課題】上記のような従来のボ
コーダ方式は、音源にインパルス列を用いているために
有声音のピッチ間隔毎の時間的な微細な特徴が失われる
という問題点や、声道特性の推定が不十分である際にス
ペクトルの微細な特徴が失われてしまうことにより合成
音声の品質が劣化するという問題点があった。この問題
を解決するために音声の微細な特徴を残すことができよ
るうに改良した方法として、音声を逆フィルタにより分
析して得られる残差波形を音源に用いる方式や、残差波
形を近似した音源を用いるマルチパルス方式などが提案
されている。In the conventional vocoder system as described above, since the impulse train is used as the sound source, there is a problem in that the minute temporal characteristics of the pitch intervals of the voiced sound are lost. When the estimation of vocal tract characteristics is insufficient, the fine feature of the spectrum is lost, so that the quality of synthesized speech is deteriorated. In order to solve this problem, as a method improved so that minute features of the voice can be left, a method of using a residual waveform obtained by analyzing the voice with an inverse filter as a sound source or approximating the residual waveform A multi-pulse method using a sound source has been proposed.

【０００８】しかし、これらの改良されたボコーダ方式
より十分に高品質な合成音を得ることができるようには
なったが、反面記憶すべきデータ量が膨大になる問題
や、さらには、音声の微細な時間特徴はパワーやピッチ
に依存する傾向にあって、これらの方式では規則合成の
適用に際して、パワーやピッチを変化させた時に元の音
声の微細な時間特徴がそのまま合成音声に保存されてし
まうという問題点が指摘されている。However, although it has become possible to obtain a sufficiently high quality synthesized voice as compared with these improved vocoder systems, on the other hand, there is a problem that the amount of data to be stored becomes enormous, and further, that of voice Fine time characteristics tend to depend on power and pitch.In these methods, when applying rule synthesis, the fine time characteristics of the original speech are preserved as they are in the synthesized speech. It has been pointed out that there is a problem with it.

【０００９】一方、フィルタを利用しない方式として、
音声波形の直接表現によってピッチ周期の制御を可能と
する波形重畳法が提案され、高品質な合成音を実現する
ことができるようになった。しかし、この方式も規則合
成への適用に関しては、上記の残差波形を音源に用いる
方式と同じ問題点がある。On the other hand, as a method not using a filter,
A waveform superposition method that enables control of the pitch period by direct expression of the speech waveform has been proposed, and it has become possible to realize high-quality synthesized speech. However, this method also has the same problem as the method using the residual waveform as a sound source in the application to rule synthesis.

【００１０】また、予め記憶した自然音声あるいは自然
音源のゆらぎ成分を合成音声または合成音源に重畳する
方式や、あるいは乱数により生成したゆらぎ成分を合成
音源に重畳する方式が提案されているが、これらいずれ
の方式も前記方式と同様に、ゆらぎ成分がパワーまたは
ピッチに依存する傾向にあることが考慮されておらず、
所望のパワーまたはピッチに対して適切なゆらぎ成分を
生成しないという問題点があった。Further, a method of superposing a fluctuation component of a natural voice or a natural sound source stored in advance on a synthetic sound or a synthetic sound source or a method of superposing a fluctuation component generated by a random number on a synthetic sound source has been proposed. Similar to the above methods, neither method considers that the fluctuation component tends to depend on power or pitch,
There is a problem in that an appropriate fluctuation component is not generated for a desired power or pitch.

【００１１】本発明は上記のような問題点、すなわち上
記の音声の微細な特徴を直接保存しようとする方式にお
いて記憶すべきデータ量が膨大になるという問題や、微
細な時間特徴はパワーやピッチを変化させた時には元と
同一の特徴を示すわけではないという問題などを解消す
るためになされたもので、規則合成への適用を可能とす
る高品質化された音声合成方式を提供することを目的と
している。The present invention has the above-mentioned problems, that is, the amount of data to be stored in the method for directly storing the minute features of the voice becomes huge, and the minute time features are power and pitch. It was made to solve the problem that the same characteristics as the original are not exhibited when changing, and it is to provide a high-quality speech synthesis method that can be applied to rule synthesis. Has an aim.

【００１２】[0012]

【課題を解決するための手段】上記目的を達成するため
に、本発明の第１の発明に係る音声合成方式は、音声の
分析により得られる声道特性、フレーム平均パワー、フ
レーム平均ピッチ、有声無声情報を入力とし、白色雑音
で表現される無声音源を生成する無声音源生成手段と、
インパルスあるいはフレームを代表する１ピッチ長残差
波形のフレーム平均ピッチ毎の繰り返しからなる有声音
源を生成する有声音源生成手段を有する音声合成方式で
あって、フレーム平均パワーあるいはフレーム平均ピッ
チを入力し、パワーまたはピッチにより規定されるゆら
ぎ成分を生成するゆらぎ成分生成手段と、ゆらぎ生成手
段より得られるゆらぎ成分を、フレーム平均ピッチ区間
毎に１ピッチ分の有声音源波形に重畳するゆらぎ成分重
畳手段と、ゆらぎ成分重畳手段で得られたゆらぎ成分重
畳音源波形を入力とし、声道特性を近似するフィルタに
より合成音声を得る声道フィルタ手段と、を備えたこと
を特徴とする。In order to achieve the above object, a speech synthesis method according to the first aspect of the present invention uses a vocal tract characteristic, a frame average power, a frame average pitch, a voiced voice obtained by analyzing a voice. An unvoiced sound source generation unit that inputs unvoiced information and generates an unvoiced sound source represented by white noise,
A voice synthesis method having a voiced sound source generation means for generating a voiced sound source, which is a repetition of a 1-pitch length residual waveform representing an impulse or a frame for each frame average pitch, wherein a frame average power or a frame average pitch is input, A fluctuation component generating means for generating a fluctuation component defined by power or pitch; and a fluctuation component superposing means for superposing the fluctuation component obtained by the fluctuation generating means on a voiced sound source waveform for one pitch for each frame average pitch section, Vocal tract filter means for obtaining synthetic speech by a filter approximating the vocal tract characteristics, using the fluctuation component superimposed sound source waveform obtained by the fluctuation component superimposing means as an input.

【００１３】また、第２の発明に係る規則合成装置は、
文字列あるいは音素記号列を入力し、あらかじめ記憶さ
れた辞書情報や音声素片情報などに基づき、音声学的な
いし言語学的規則にしたがって合成音声を出力する規則
合成装置であって、文章に応じ規則に基づいて生成され
るパワーまたはピッチにより規定されるゆらぎ成分を生
成するゆらぎ成分生成手段と、ゆらぎ生成手段より得ら
れるゆらぎ成分を、ピッチ区間毎に１ピッチ分の有声音
源波形に重畳するゆらぎ成分重畳手段と、ゆらぎ成分重
畳手段で得られたゆらぎ成分重畳音源波形を入力とし、
規則により生成される声道特性を近似するフィルタで合
成音声を得る声道フィルタ手段と、を備えたことを特徴
とする。The rule synthesizing device according to the second invention is
A rule synthesizing device that inputs a character string or a phoneme symbol string and outputs synthetic speech according to phonetic or linguistic rules based on pre-stored dictionary information or speech unit information. A fluctuation component generating means for generating a fluctuation component defined by power or pitch generated based on a rule, and a fluctuation component obtained by superposing the fluctuation component obtained by the fluctuation generating means on a voiced sound source waveform for one pitch for each pitch section. The component superimposing means and the fluctuation component superimposing sound source waveform obtained by the fluctuation component superimposing means are input,
Vocal tract filter means for obtaining synthesized speech with a filter approximating the vocal tract characteristic generated by the rule.

【００１４】また、第３の発明に係る音声合成方式は、
有声無声情報により判別される有声音区間についてフレ
ーム平均パワーまたはフレーム平均ピッチを入力し、パ
ワーまたはピッチにより規定されるゆらぎ成分を生成す
るゆらぎ成分生成手段と、ゆらぎ生成手段より得られる
ゆらぎ成分を、各フレームの声道特性に対しフレーム平
均ピッチに同期して重畳するゆらぎ成分重畳手段と、ピ
ッチ区間毎に、有声音源波形を入力とし、ゆらぎ成分重
畳手段で得られたゆらぎ成分重畳声道特性を近似するフ
ィルタにより合成音声を得る声道フィルタ手段と、を備
えたことを特徴とする。The voice synthesis system according to the third invention is
A frame average power or a frame average pitch is input for a voiced sound section that is determined by voiced unvoiced information, and a fluctuation component generation unit that generates a fluctuation component defined by the power or pitch, and a fluctuation component obtained by the fluctuation generation unit, Fluctuation component superimposing means that superimposes on the vocal tract characteristics of each frame in synchronization with the frame average pitch, and voiced sound source waveforms are input for each pitch section, and the fluctuation component superimposing vocal tract characteristics obtained by the fluctuation component superimposing means Vocal tract filter means for obtaining synthetic speech by an approximate filter.

【００１５】また、第４の発明に係る規則合成装置は、
文章に応じ規則に基づいて生成されるパワーまたはピッ
チにより規定されるゆらぎ成分を生成するゆらぎ成分生
成手段と、ゆらぎ生成手段より得られるゆらぎ成分を、
規則により生成される有声区間の各フレームの声道特性
に対しフレーム平均ピッチに同期して重畳するゆらぎ成
分重畳手段と、ピッチ区間毎に、有声音源波形を入力と
し、上記ゆらぎ成分重畳手段で得られたゆらぎ成分重畳
声道特性を近似するフィルタにより合成音声を得る声道
フィルタ手段と、を備えたことを特徴とする。Further, the rule synthesizing device according to the fourth invention is
A fluctuation component generating means for generating a fluctuation component defined by power or pitch generated based on a rule according to a sentence, and a fluctuation component obtained by the fluctuation generating means,
Fluctuation component superimposing means that superimposes on the vocal tract characteristics of each frame in the voiced section generated by the rule in synchronization with the frame average pitch, and a voiced sound source waveform is input for each pitch section and obtained by the fluctuation component superimposing means. And a vocal tract filter means for obtaining a synthetic voice by a filter approximating the vocal tract characteristic with the superimposed fluctuation component.

【００１６】また、第５の発明に係る音声合成方式は、
フレーム平均パワーまたはフレーム平均ピッチを入力
し、パワーまたはピッチにより規定されるゆらぎ成分を
生成するゆらぎ成分生成手段と、ゆらぎ生成手段より得
られるゆらぎ成分を、フレーム平均ピッチ区間毎に１ピ
ッチ長音声波形に重畳するゆらぎ成分重畳手段と、ゆら
ぎ成分重畳手段で得られた１ピッチ長のゆらぎ成分重畳
音声波形をフレーム平均ピッチ間隔で重畳することによ
り合成音声を得る波形重畳手段と、を備えたことを特徴
とする。The speech synthesis system according to the fifth aspect of the invention is
A fluctuation component generating means for inputting the frame average power or the frame average pitch to generate a fluctuation component defined by the power or the pitch, and a fluctuation component obtained by the fluctuation generating means for one pitch long speech waveform for each frame average pitch section. And a waveform superimposing means for obtaining a synthesized speech by superimposing the fluctuation component superimposing speech waveform of one pitch length obtained by the fluctuation component superimposing means at a frame average pitch interval. Characterize.

【００１７】さらに、第６の発明に係る規則合成装置
は、文章に応じ規則に基づいて生成されるパワーまたは
ピッチにより規定されるゆらぎ成分を生成するゆらぎ成
分生成手段と、ゆらぎ生成手段より得られるゆらぎ成分
を、ピッチ区間毎に、規則により生成される１ピッチ長
音声波形に重畳するゆらぎ成分重畳手段と、ゆらぎ成分
重畳手段で得られた１ピッチ長のゆらぎ成分重畳音声波
形をピッチ間隔で重畳することにより合成音声を得る波
形重畳手段と、を備えたことを特徴とする。Further, the rule synthesizing apparatus according to the sixth aspect of the invention is obtained from the fluctuation component generating means for generating the fluctuation component defined by the power or pitch generated based on the rule according to the sentence, and the fluctuation generating means. A fluctuation component superimposing means for superimposing the fluctuation component on a pitch-based 1-pitch long speech waveform for each pitch section, and a fluctuation component superimposing speech waveform of 1 pitch length obtained by the fluctuation component superimposing means are superimposed at pitch intervals. And a waveform superimposing means for obtaining a synthetic voice by doing so.

【００１８】[0018]

【作用】従って、本発明の第１の発明に係る音声合成方
式によれば、ゆらぎ成分生成手段が入力される音声のパ
ワーまたはピッチにより規定されるゆらぎ成分を生成
し、ゆらぎ成分重畳手段によりそのゆらぎ成分をフレー
ム平均ピッチ区間毎に１ピッチ分の有声音源波形に重畳
し、そのゆらぎ成分重畳音源波形を入力して声道フィル
タ手段が声道特性を近似するフィルタにより合成音声を
得るようにしているので、自然性の高い合成音声が再生
される。Therefore, according to the voice synthesizing method of the first aspect of the present invention, the fluctuation component generating means generates the fluctuation component defined by the power or pitch of the input voice, and the fluctuation component superposing means generates the fluctuation component. A fluctuation component is superimposed on a voiced sound source waveform for one pitch for each frame average pitch section, and the fluctuation component superimposed sound source waveform is input so that the vocal tract filter means obtains a synthetic speech by a filter that approximates the vocal tract characteristics. Therefore, a synthetic voice with high naturalness is reproduced.

【００１９】また、第２の発明に係る規則合成装置によ
れば、ゆらぎ成分生成手段によりパワーまたはピッチに
より規定されるゆらぎ成分を生成し、ゆらぎ成分重畳手
段によりゆらぎ成分をピッチ区間毎に１ピッチ分の有声
音源波形に重畳して、そのゆらぎ成分重畳音源波形を入
力する声道フィルタ手段が規則により生成される声道特
性を近似するフィルタで合成音声を発声させるようにし
ているので、高品質の規則合成音が得られる。According to the rule synthesizing device of the second aspect of the invention, the fluctuation component generating means generates the fluctuation component defined by the power or the pitch, and the fluctuation component superposing means generates the fluctuation component by one pitch for each pitch section. Since the vocal tract filter means for inputting the fluctuation component-superposed source waveform is superposed on the voiced voicing source waveform for a minute, the synthesized voice is uttered by a filter that approximates the vocal tract characteristics generated by the rule, so that high quality is achieved. The rule synthesized sound of is obtained.

【００２０】また、第３の発明に係る音声合成方式によ
れば、ゆらぎ成分生成手段が入力される音声のパワーま
たはピッチにより規定されるゆらぎ成分を生成し、ゆら
ぎ成分重畳手段によりそのゆらぎ成分を各フレームの声
道特性に対しフレーム平均ピッチに同期して重畳して、
ピッチ区間毎に有声音源波形を入力する声道フィルタ手
段がそのゆらぎ成分重畳声道特性を近似するフィルタに
より合成音声を得るようにしているので、自然性の高い
合成音声が再生される。Further, according to the voice synthesizing method of the third aspect of the invention, the fluctuation component generating means generates a fluctuation component defined by the power or pitch of the input voice, and the fluctuation component superposing means converts the fluctuation component. Superimpose on the vocal tract characteristics of each frame in synchronization with the frame average pitch,
Since the vocal tract filter means for inputting the voiced sound source waveform for each pitch section obtains the synthetic voice by the filter that approximates the fluctuation component superimposed vocal tract characteristic, the synthetic voice with high naturalness is reproduced.

【００２１】また、第４の発明に係る規則合成装置によ
れば、ゆらぎ成分生成手段によりパワーまたはピッチに
より規定されるゆらぎ成分を生成し、ゆらぎ成分重畳手
段によりゆらぎ成分を規則により生成される有声区間の
各フレームの声道特性に対しフレーム平均ピッチに同期
して重畳して、ピッチ区間毎に有声音源波形を入力する
声道フィルタ手段がそのゆらぎ成分重畳声道特性を近似
するフィルタにより合成音声を発声させるようにしてい
るので、高品質の規則合成音が得られる。Further, according to the rule synthesizing apparatus of the fourth aspect of the invention, the fluctuation component generating means generates the fluctuation component defined by the power or the pitch, and the fluctuation component superposing means generates the fluctuation component by the voiced voice. Vocal tract filter means for superimposing on the vocal tract characteristic of each frame in the section in synchronization with the frame average pitch and inputting a voiced sound source waveform for each pitch section is synthesized by a filter approximating the fluctuation component superimposed vocal tract characteristic. As a result, a high-quality regular synthesized voice can be obtained.

【００２２】また、第５の発明に係る音声合成方式によ
れば、ゆらぎ成分生成手段が入力される音声のパワーま
たはピッチにより規定されるゆらぎ成分を生成し、ゆら
ぎ成分重畳手段によりそのゆらぎ成分をフレーム平均ピ
ッチ区間毎に１ピッチ長音声波形に重畳して、その１ピ
ッチ長のゆらぎ成分重畳音声波形を波形重畳手段がフレ
ーム平均ピッチ間隔で重畳することにより合成音声を得
るようにしているので、自然性の高い合成音声が再生さ
れる。According to the voice synthesizing method of the fifth aspect of the invention, the fluctuation component generating means generates a fluctuation component defined by the power or pitch of the input voice, and the fluctuation component superposing means converts the fluctuation component. Since the 1-pitch-long voice waveform is superimposed on each frame average pitch section and the fluctuation component-superimposed voice waveform of the 1-pitch length is superimposed by the waveform superimposing means at the frame average pitch interval, synthetic speech is obtained. A synthetic voice with high naturalness is reproduced.

【００２３】さらに、第６の発明に係る規則合成装置に
よれば、ゆらぎ成分生成手段によりパワーまたはピッチ
により規定されるゆらぎ成分を生成し、ゆらぎ成分重畳
手段によりゆらぎ成分をピッチ区間毎に規則により生成
される１ピッチ長音声波形に重畳して、その１ピッチ長
のゆらぎ成分重畳音声波形を波形重畳手段がピッチ間隔
で重畳することにより合成音声を発声させるようにして
いるので、高品質の規則合成音が得られる。Further, according to the rule synthesizing device of the sixth aspect of the present invention, the fluctuation component generating means generates the fluctuation component defined by the power or the pitch, and the fluctuation component superposing means regulates the fluctuation component for each pitch section. Since the synthesized voice is generated by superposing it on the generated one-pitch-long voice waveform and superposing the fluctuation component-superposed voice waveform of the one-pitch length at pitch intervals, a high-quality rule is produced. A synthetic sound is obtained.

【００２４】[0024]

【実施例】以下、本発明の好適な実施例を図に基づいて
説明する。図１〜図６は本実施例に係る音声合成方式に
よる音声分析合成系、あるいは本実施例に係る規則合成
装置の構成例を示す構成図である。なお、図において、
従来の音声合成方式による音声分析合成系と同一あるい
は相当部分には、同一符号を付して説明を省略する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT A preferred embodiment of the present invention will be described below with reference to the drawings. 1 to 6 are configuration diagrams showing a configuration example of a voice analysis / synthesis system by a voice synthesis system according to the present embodiment or a rule synthesis device according to the present embodiment. In the figure,
The same or corresponding parts as those of the conventional voice analysis / synthesis system by the voice synthesis method are designated by the same reference numerals, and the description thereof will be omitted.

【００２５】実施例１．図１は本実施例に係る音声合成
方式による第１の音声分析合成系の構成例を示す構成図
である。図において、本実施例の第１の音声分析合成系
は、有声音源生成手段１０、無声音源生成手段１１、声
道フィルタ手段１２に加えて、ゆらぎ成分生成手段１３
と、ゆらぎ成分重畳手段１４とを有している。また、音
声信号の分析結果により入力音声を分離・モデル化した
音源情報であるフレーム平均ピッチ１００、フレーム平
均パワー１０１、有声無声情報１０２と、同じく入力音
声を分離・モデル化して得られた声道情報である声道特
性１０３に加えて、１ピッチ長残差波形１０７を入力
し、有声音源１０４、無声音源１０５に加えて、ゆらぎ
成分２００と、ゆらぎ成分重畳有声音源２０１を中間出
力するとともに、最終的には合成音声２０６を出力す
る。Example 1. FIG. 1 is a configuration diagram showing a configuration example of a first voice analysis / synthesis system by a voice synthesis system according to the present embodiment. In the figure, the first speech analysis and synthesis system of the present embodiment includes a voiced sound source generation means 10, an unvoiced sound source generation means 11, a vocal tract filter means 12, and a fluctuation component generation means 13
And fluctuation component superimposing means 14. Further, a frame average pitch 100, a frame average power 101, voiced unvoiced information 102, which is sound source information obtained by separating and modeling the input voice based on the analysis result of the voice signal, and a vocal tract obtained by separating and modeling the input voice in the same manner. In addition to the vocal tract characteristic 103, which is information, the 1-pitch length residual waveform 107 is input, and in addition to the voiced sound source 104 and the unvoiced sound source 105, the fluctuation component 200 and the fluctuation component-superimposed voiced sound source 201 are intermediately output, and Finally, the synthesized voice 206 is output.

【００２６】次に、上記の通り構成される本実施例の第
１の音声分析合成系の動作について説明する。ゆらぎ成
分生成手段１３は、フレーム平均パワー１０１を入力し
て、例えば、図７に示す様なパワーにより規定されるゆ
らぎ成分２００を出力する。なお、図７はフレーム平均
パワーとゆらぎ成分との関係を示すグラフである。Next, the operation of the first speech analysis and synthesis system of this embodiment having the above-mentioned configuration will be described. The fluctuation component generation means 13 inputs the frame average power 101 and outputs a fluctuation component 200 defined by the power as shown in FIG. 7, for example. 7. FIG. 7 is a graph showing the relationship between the frame average power and the fluctuation component.

【００２７】無声音源生成手段１１は、有声無声情報１
０２により無声と判別されるときに白色雑音からなる無
声音源１０５を生成する。有声音源生成手段１０は、有
声無声情報１０２により有声と判別されるときに１ピッ
チ長残差波形１０７をフレーム平均ピッチ１００間隔で
繰り返した有声音源１０４を生成する。ゆらぎ成分重畳
手段１４は、１ピッチ分の有声音源波形１０４にゆらぎ
成分２００を重畳し、ゆらぎ成分重畳有声音源２０１を
出力する。声道フィルタ手段１２は、ゆらぎ成分重畳有
声音源２０１または無声音源１０５を入力し、ＬＳＰフ
ィルタなどの声道特性１０３を近似するフィルタにより
合成音声２０６を得る。The unvoiced sound source generating means 11 uses the voiced unvoiced information 1
When it is determined to be unvoiced by 02, the unvoiced sound source 105 including white noise is generated. The voiced sound source generation means 10 generates a voiced sound source 104 by repeating the 1-pitch length residual waveform 107 at frame average pitch 100 intervals when it is determined to be voiced by the voiced unvoiced information 102. The fluctuation component superimposing means 14 superimposes the fluctuation component 200 on the voiced sound source waveform 104 for one pitch, and outputs the fluctuation component superimposed voiced sound source 201. The vocal tract filter means 12 receives the voiced sound source 201 or the unvoiced sound source 105 on which the fluctuation component is superimposed, and obtains a synthetic speech 206 by a filter that approximates the vocal tract characteristic 103 such as an LSP filter.

【００２８】実施例２．図２は、本実施例に係る第１の
規則合成装置の構成例を示す構成図である。図におい
て、本実施例の第１の規則合成装置２０は、文章解析手
段２１と、合成規則手段２２と、音声合成手段２３とか
ら構成され、音声合成手段２３は有声音源生成手段１０
と、無声音源生成手段１１と、声道フィルタ手段１２
と、ゆらぎ成分生成手段１３と、ゆらぎ成分重畳手段１
４とから構成されている。そして、第１の規則合成装置
２０は、文字列あるいは音素記号列からなる入力文章２
０２を入力し、辞書情報２０３と、音声素片情報２０４
とを用いて規則合成を行う。Example 2. FIG. 2 is a configuration diagram showing a configuration example of the first rule synthesis device according to the present embodiment. In the figure, the first rule synthesizing apparatus 20 of the present embodiment is composed of a sentence analyzing unit 21, a synthesizing rule unit 22, and a voice synthesizing unit 23, and the voice synthesizing unit 23 is a voiced sound source generating unit 10.
Unvoiced sound source generation means 11 and vocal tract filter means 12
, Fluctuation component generation means 13 and fluctuation component superposition means 1
4 and. Then, the first rule synthesizing device 20 uses the input sentence 2 composed of a character string or a phoneme symbol string.
02 is input, and dictionary information 203 and speech unit information 204
Rule composition is performed using and.

【００２９】次に、上記の通り構成される本実施例の第
１の規則合成装置の動作について説明する。入力文章２
０２を入力し、合成音声２０６を出力する規則合成装置
２０において、文章解析手段２１は、文字で与えられた
入力文章２０２を、あらかじめ記憶された辞書情報２０
３を参照して解析し、単語の読み・アクセントなどの文
章解析結果２０５を出力する。合成規則手段２２は、上
記文章解析手段２１からの文章解析結果２０５を入力し
て、あらかじめ記憶された音声素片情報２０４を参照
し、規則によって音声合成に用いる１ピッチ長残差波形
１０７・声道特性１０３・フレーム平均パワー１０１・
フレーム平均ピッチ１００・有声無声情報１０２を決定
し出力する。Next, the operation of the first rule synthesizing device of the present embodiment configured as described above will be described. Input sentence 2
In the rule synthesizing device 20 that inputs 02 and outputs the synthesized speech 206, the sentence analysis unit 21 stores the input sentence 202 given in characters in the dictionary information 20 stored in advance.
3 is referred to and analyzed, and a sentence analysis result 205 such as word reading and accent is output. The synthesis rule means 22 inputs the sentence analysis result 205 from the sentence analysis means 21, refers to the pre-stored speech segment information 204, and according to the rule, the 1-pitch length residual waveform 107 / voice used for speech synthesis. Road characteristic 103, frame average power 101,
The frame average pitch 100 and voiced / unvoiced information 102 are determined and output.

【００３０】音声合成手段２３においては、上記１ピッ
チ長残差波形１０７・声道特性１０３・フレーム平均パ
ワー１０１・フレーム平均ピッチ１００・有声無声情報
１０２を入力し、音声合成手段２３の無声音源生成手段
１１は、有声無声情報１０２により無声と判別されると
きに白色雑音からなる無声音源１０５を生成する。ま
た、有声音源生成手段１０は、有声無声情報１０２によ
り有声と判別されるときに１ピッチ長残差波形１０７を
フレーム平均ピッチ１００間隔で繰り返した有声音源１
０４を生成する。ゆらぎ成分生成手段１３は、フレーム
平均パワー１０１を入力して、例えば、図７に示す様な
パワーにより規定されるゆらぎ成分２００を出力する。
ゆらぎ成分重畳手段１４は、上記ゆらぎ成分２００をフ
レーム平均ピッチ１００区間毎に１ピッチ分の有声音源
１０４に重畳し、ゆらぎ成分重畳有声音源２０１を出力
する。声道フィルタ手段１２は、上記ゆらぎ成分重畳有
声音源２０１または無声音源１０５を入力し、ＬＳＰフ
ィルタなど声道特性１０３を近似するフィルタにより合
成音声２０６を得る。In the voice synthesizing means 23, the 1-pitch length residual waveform 107, the vocal tract characteristic 103, the frame average power 101, the frame average pitch 100 and the voiced unvoiced information 102 are input, and the voice synthesizing means 23 generates an unvoiced sound source. The means 11 generates an unvoiced sound source 105 composed of white noise when it is determined to be unvoiced by the voiced unvoiced information 102. Further, the voiced sound source generation means 10 repeats the 1-pitch length residual waveform 107 at frame average pitch 100 intervals when it is determined that the voiced unvoiced information 102 is voiced.
04 is generated. The fluctuation component generation means 13 inputs the frame average power 101 and outputs a fluctuation component 200 defined by the power as shown in FIG. 7, for example.
The fluctuation component superimposing means 14 superimposes the fluctuation component 200 on the voiced sound source 104 for one pitch for every 100 frame average pitches, and outputs the fluctuation component superimposed voiced sound source 201. The vocal tract filter means 12 receives the voiced sound source 201 or the unvoiced sound source 105 on which the fluctuation component is superimposed, and obtains a synthetic speech 206 by a filter such as an LSP filter that approximates the vocal tract characteristic 103.

【００３１】実施例３．図３は、本実施例に係る音声合
成方式による第２の音声分析合成系の構成例を示す構成
図である。図において、本実施例の第２の音声分析合成
系は、有声音源生成手段１０、無声音源生成手段１１、
声道フィルタ手段１２、ゆらぎ成分生成手段１３、ゆら
ぎ成分重畳手段１４を有している。また、フレーム平均
ピッチ１００、フレーム平均パワー１０１、有声無声情
報１０２、声道特性１０３、１ピッチ長残差波形１０７
を入力し、有声音源１０４、無声音源１０５、ゆらぎ成
分２００、ゆらぎ成分重畳声道特性２０７を中間出力す
るとともに、最終的には合成音声２０６を出力する。こ
のような構成は図１に示す本実施例の第１の音声分析合
成系の構成と同様であるが、第２の音声分析合成系のゆ
らぎ成分重畳手段１４は、ゆらぎ成分２００をフレーム
平均ピッチ１００に同期させて重畳してゆらぎ成分重畳
声道特性２０７を出力し、これを声道フィルタ手段１２
に入力することにより合成音声２０６を生成するところ
に特徴を有している。Example 3. FIG. 3 is a configuration diagram showing a configuration example of the second voice analysis and synthesis system by the voice synthesis method according to the present embodiment. In the figure, the second speech analysis and synthesis system of the present embodiment includes a voiced sound source generation means 10, an unvoiced sound source generation means 11,
It has a vocal tract filter means 12, a fluctuation component generation means 13, and a fluctuation component superposition means 14. Further, frame average pitch 100, frame average power 101, voiced unvoiced information 102, vocal tract characteristics 103, 1 pitch length residual waveform 107
Is input, the voiced sound source 104, the unvoiced sound source 105, the fluctuation component 200, and the fluctuation component superimposed vocal tract characteristic 207 are intermediately output, and finally the synthesized speech 206 is output. Such a configuration is similar to the configuration of the first voice analysis / synthesis system of the present embodiment shown in FIG. 1, but the fluctuation component superimposing means 14 of the second voice analysis / synthesis system shifts the fluctuation component 200 to the frame average pitch. A vocal tract characteristic 207 is output by superimposing the fluctuation component in synchronism with 100, and the vocal tract filter means 12 outputs this.
It is characterized in that the synthetic speech 206 is generated by inputting into the input.

【００３２】次に、上記の通り構成される本実施例の第
２の音声分析合成系の動作について説明する。ゆらぎ成
分生成手段１３は、フレーム平均パワー１０１を入力し
て、例えば、図７に示す様なパワーにより規定されるゆ
らぎ成分２００を出力する。無声音源生成手段１１は、
有声無声情報１０２により無声と判別されるときに白色
雑音からなる無声音源１０５を生成する。有声音源生成
手段１０は、有声無声情報１０２により有声と判別され
るときに１ピッチ長残差波形１０７をフレーム平均ピッ
チ１００間隔で繰り返した有声音源１０４を生成する。
さらに有声時において、ゆらぎ成分重畳手段１４は、声
道特性１０３のインパルス応答にゆらぎ成分２００をフ
レーム平均ピッチ１００に同期して重畳し、ゆらぎ成分
重畳声道特性２０７を出力する。声道フィルタ手段１２
は有声音源１０４または無声音源１０５を入力し、ゆら
ぎ成分重畳声道特性２０７を近似するフィルタにより合
成音声２０６を得る。Next, the operation of the second speech analysis and synthesis system of the present embodiment configured as described above will be described. The fluctuation component generation means 13 inputs the frame average power 101 and outputs a fluctuation component 200 defined by the power as shown in FIG. 7, for example. The unvoiced sound source generator 11
An unvoiced sound source 105 made of white noise is generated when it is determined to be unvoiced by the voiced unvoiced information 102. The voiced sound source generation means 10 generates a voiced sound source 104 by repeating the 1-pitch length residual waveform 107 at frame average pitch 100 intervals when it is determined to be voiced by the voiced unvoiced information 102.
Further, in the voiced state, the fluctuation component superimposing unit 14 superimposes the fluctuation component 200 on the impulse response of the vocal tract characteristic 103 in synchronization with the frame average pitch 100, and outputs the fluctuation component superimposed vocal tract characteristic 207. Vocal tract filter means 12
Inputs a voiced sound source 104 or an unvoiced sound source 105, and obtains a synthesized voice 206 by a filter approximating a fluctuation component superimposed vocal tract characteristic 207.

【００３３】実施例４．図４は、本実施例に係る第２の
規則合成装置の構成例を示す構成図である。図におい
て、本実施例の第２の規則合成装置２０は、文章解析手
段２１と、合成規則手段２２と、音声合成手段２３とか
ら構成され、音声合成手段２３は有声音源生成手段１０
と、無声音源生成手段１１と、声道フィルタ手段１２
と、ゆらぎ成分生成手段１３と、ゆらぎ成分重畳手段１
４とから構成されている。そして、第２の規則合成装置
２０は、入力文章２０２を入力し、辞書情報２０３と、
音声素片情報２０４とを用いて規則合成を行う。Example 4. FIG. 4 is a configuration diagram showing a configuration example of the second rule synthesis device according to the present embodiment. In the figure, the second rule synthesizing device 20 of the present embodiment is composed of a sentence analyzing means 21, a synthesizing rule means 22, and a voice synthesizing means 23, and the voice synthesizing means 23 is a voiced sound source generating means 10.
Unvoiced sound source generation means 11 and vocal tract filter means 12
, Fluctuation component generation means 13 and fluctuation component superposition means 1
4 and. Then, the second rule synthesis device 20 inputs the input sentence 202, and inputs the dictionary information 203,
Rule synthesis is performed using the speech unit information 204.

【００３４】次に、上記の通り構成される本実施例の第
２の規則合成装置の動作について説明する。入力文章２
０２を入力し、合成音声１０６を出力する規則合成装置
１７において、文章解析手段２１は、文字で与えられた
入力文章２０２を、あらかじめ記憶された辞書情報２０
３を参照して解析し、単語の読み・アクセント等の文章
解析結果２０５を出力する。合成規則手段２２は、上記
文章解析結果２０５を入力して、あらかじめ記憶された
音声素片情報２０４を参照し、規則によって音声合成に
用いる１ピッチ長残差波形１０７・声道特性１０３・フ
レーム平均パワー１０１・フレーム平均ピッチ１００・
有声無声情報１０２を決定し、音声合成手段２３に出力
する。Next, the operation of the second rule synthesizing device of this embodiment constructed as described above will be explained. Input sentence 2
In the rule synthesizing device 17 that inputs 02 and outputs the synthesized speech 106, the sentence analysis unit 21 converts the input sentence 202 given in characters into the previously stored dictionary information 20.
3 is referred to and analyzed, and a sentence analysis result 205 such as word reading and accent is output. The synthesis rule means 22 inputs the sentence analysis result 205, refers to the voice segment information 204 stored in advance, and uses the rule to generate a 1-pitch length residual waveform 107, vocal tract characteristic 103, and frame average used for voice synthesis. Power 101 · Frame average pitch 100 ·
The voiced / unvoiced information 102 is determined and output to the voice synthesis unit 23.

【００３５】音声合成手段２３においては、上記１ピッ
チ長残差波形１０７・声道特性１０３・フレーム平均パ
ワー１０１・フレーム平均ピッチ１００・有声無声情報
１０２を入力し、音声合成手段２３の無声音源生成手段
１１は、有声無声情報１０２により無声と判別されると
きに白色雑音からなる無声音源１０５を生成する。ま
た、有声音源生成手段１０は、有声無声情報１０２によ
り有声と判別されるときに１ピッチ長残差波形１０７を
フレーム平均ピッチ１００間隔で繰り返した有声音源１
０４を生成する。ゆらぎ成分生成手段１３は、フレーム
平均パワー１０１を入力して、例えば、図７に示す様な
パワーにより規定されるゆらぎ成分２００を出力し、ゆ
らぎ成分重畳手段１４は、上記ゆらぎ成分２００をフレ
ーム平均ピッチ１００に同期して声道特性１０３のイン
パルス応答に重畳し、ゆらぎ成分重畳声道特性２０７を
出力する。声道フィルタ手段１２は有声音源１０４また
は無声音源１０５を入力し、ゆらぎ成分重畳声道特性２
０７を近似するフィルタにより合成音声２０６を得る。In the voice synthesizing means 23, the 1-pitch length residual waveform 107, the vocal tract characteristic 103, the frame average power 101, the frame average pitch 100, and the voiced unvoiced information 102 are input, and the voice synthesizer 23 generates an unvoiced sound source. The means 11 generates an unvoiced sound source 105 composed of white noise when it is determined to be unvoiced by the voiced unvoiced information 102. Further, the voiced sound source generation means 10 repeats the 1-pitch length residual waveform 107 at frame average pitch 100 intervals when it is determined that the voiced unvoiced information 102 is voiced.
04 is generated. The fluctuation component generation means 13 inputs the frame average power 101 and outputs the fluctuation component 200 defined by the power as shown in FIG. 7, for example, and the fluctuation component superposition means 14 averages the fluctuation component 200 by the frame average. It is superimposed on the impulse response of the vocal tract characteristic 103 in synchronization with the pitch 100, and the fluctuation component superimposed vocal tract characteristic 207 is output. The vocal tract filter means 12 inputs the voiced sound source 104 or the unvoiced sound source 105, and the fluctuation component-superimposed vocal tract characteristic 2
A synthesized voice 206 is obtained by a filter approximating 07.

【００３６】実施例５．図５は、本実施例に係る音声合
成方式による第３の音声分析合成系の構成例を示す構成
図である。図において、本実施例の第３の音声分析合成
系は、ゆらぎ成分生成手段１３、ゆらぎ成分重畳手段１
４、波形重畳手段１５を有している。また、フレーム平
均ピッチ１００、フレーム平均パワー１０１、１ピッチ
長音声波形２０８を入力し、ゆらぎ成分２００、ゆらぎ
成分重畳音声波形２０９を中間出力するとともに、最終
的には合成音声２０６を出力する。Example 5. FIG. 5 is a configuration diagram showing a configuration example of a third voice analysis / synthesis system by the voice synthesis method according to the present embodiment. In the figure, the third voice analysis / synthesis system of the present exemplary embodiment includes a fluctuation component generation means 13 and a fluctuation component superposition means 1.
4. It has a waveform superimposing means 15. Further, the frame average pitch 100, the frame average power 101, and the 1-pitch long speech waveform 208 are input, the fluctuation component 200 and the fluctuation component superimposed speech waveform 209 are intermediately output, and finally the synthesized speech 206 is output.

【００３７】次に、上記の通り構成される本実施例の第
３の音声分析合成系の動作について説明する。ゆらぎ成
分生成手段１３は、フレーム平均パワー１０１を入力し
て、例えば、図７に示す様なパワーにより規定されるゆ
らぎ成分２００を出力する。ゆらぎ成分重畳手段１４
は、有声無声情報により有声と判別されるときに、上記
ゆらぎ成分生成手段１３からのゆらぎ成分２００を１ピ
ッチ長音声波形２０８に重畳し、ゆらぎ成分重畳音声波
形２０９を出力する。波形重畳手段１５は、そのゆらぎ
成分重畳音声波形２０９を入力し、フレーム平均ピッチ
１００間隔で重畳することより合成音声２０６を得る。Next, the operation of the third speech analysis / synthesis system of the present embodiment having the above configuration will be described. The fluctuation component generation means 13 inputs the frame average power 101 and outputs a fluctuation component 200 defined by the power as shown in FIG. 7, for example. Fluctuation component superimposing means 14
When the voiced unvoiced information determines that the voice is voiced, the fluctuation component 200 from the fluctuation component generation means 13 is superimposed on the 1-pitch long speech waveform 208, and the fluctuation component superimposed speech waveform 209 is output. The waveform superimposing unit 15 receives the fluctuation component-superimposed speech waveform 209 and superimposes it at a frame average pitch of 100 to obtain a synthesized speech 206.

【００３８】実施例６．図６は、本実施例に係る第３の
規則合成装置の構成例を示す構成図である。図におい
て、本実施例の第３の規則合成装置２０は、文章解析手
段２１と、合成規則手段２２と、音声合成手段２３とか
ら構成され、音声合成手段２３は、ゆらぎ成分生成手段
１３と、ゆらぎ成分重畳手段１４と、波形重畳手段１５
とから構成されている。そして、第３の規則合成装置２
０は、入力文章２０２を入力し、辞書情報２０３と、音
声素片情報２０４を用いて規則合成を行う。Example 6. FIG. 6 is a configuration diagram showing a configuration example of the third rule synthesizing device according to the present embodiment. In the figure, the third rule synthesizing device 20 of the present embodiment is composed of a sentence analyzing means 21, a synthesizing rule means 22, and a voice synthesizing means 23, and the voice synthesizing means 23 is a fluctuation component generating means 13 and Fluctuation component superimposing means 14 and waveform superimposing means 15
It consists of and. Then, the third rule synthesizer 2
For 0, the input sentence 202 is input, and rule synthesis is performed using the dictionary information 203 and the voice segment information 204.

【００３９】次に、上記の通り構成される本実施例の第
３の規則合成装置の動作について説明する。入力文章２
０２を入力し、合成音声２０６を出力する規則合成装置
２０において、文章解析手段２１は、文字で与えられた
入力文章２０２を、あらかじめ記憶された辞書情報２０
３を参照して解析し、単語の読み・アクセント等の文章
解析結果２０５を出力する。合成規則手段２２は、上記
文章解析手段２１からの文章解析結果２０５を入力し
て、あらかじめ記憶された音声素片情報２０４を参照
し、規則によって音声合成に用いる１ピッチ長音声波形
２０８・フレーム平均ピッチ１００・フレーム平均パワ
ー１０１を決定し、音声合成手段２３に出力する。Next, the operation of the third rule synthesizing device of the present embodiment configured as described above will be described. Input sentence 2
In the rule synthesizing device 20 that inputs 02 and outputs the synthesized speech 206, the sentence analysis unit 21 stores the input sentence 202 given in characters in the dictionary information 20 stored in advance.
3 is referred to and analyzed, and a sentence analysis result 205 such as word reading and accent is output. The synthesis rule means 22 inputs the sentence analysis result 205 from the sentence analysis means 21, refers to the pre-stored speech segment information 204, and according to the rule, a 1-pitch long speech waveform 208 / frame average used for speech synthesis. The pitch 100 and the frame average power 101 are determined and output to the voice synthesizing means 23.

【００４０】音声合成手段２３においては、上記１ピッ
チ長音声波形２０８・フレーム平均ピッチ１００・フレ
ーム平均パワー１０１を入力し、音声合成手段２３のゆ
らぎ成分生成手段１３はフレーム平均パワー１０１を入
力して、例えば、図７に示す様なパワーにより規定され
るゆらぎ成分２００を出力する。またし、ゆらぎ成分重
畳手段１４は、上記ゆらぎ成分２００を１ピッチ長音声
波形２０８に重畳し、ゆらぎ成分重畳音声波形２０９を
出力する。波形重畳手段１５はゆらぎ成分重畳音声波形
２０９を入力し、フレーム平均ピッチ１００の間隔での
重畳を行い、合成音声２０６を得る。In the voice synthesizing means 23, the 1-pitch long voice waveform 208, the frame average pitch 100, and the frame average power 101 are input, and the fluctuation component generating means 13 of the voice synthesizing means 23 inputs the frame average power 101. For example, the fluctuation component 200 defined by the power as shown in FIG. 7 is output. Further, the fluctuation component superimposing means 14 superimposes the fluctuation component 200 on the 1-pitch long speech waveform 208 and outputs a fluctuation component-superimposed speech waveform 209. The waveform superimposing means 15 inputs the fluctuation component-superimposed speech waveform 209 and superimposes it at intervals of the frame average pitch 100 to obtain a synthesized speech 206.

【００４１】その他の実施例．なお、上記実施例１〜実
施例６のゆらぎ成分生成手段１３の説明においては、ゆ
らぎ成分生成手段１３はフレーム平均パワー１０１を入
力して、パワーにより規定されるゆらぎ成分２００を出
力するものとして説明したが、これに限られるものでは
なく、例えば、フレーム平均ピッチを入力してピッチに
より規定されるゆらぎ成分を出力するものとして構成す
ることもできる。Other Embodiments In the description of the fluctuation component generation means 13 in the above-described first to sixth embodiments, the fluctuation component generation means 13 inputs the frame average power 101 and outputs the fluctuation component 200 defined by the power. However, the present invention is not limited to this. For example, the frame average pitch may be input and a fluctuation component defined by the pitch may be output.

【００４２】また、上記実施例１〜実施例４の有声音源
生成手段１０の説明においては、有声音源生成手段１０
は、有声無声情報１０２により有声と判別されるときに
１ピッチ長残差波形１０７をフレーム平均ピッチ１００
間隔で繰り返した有声音源１０４を生成するものとして
説明したが、これに限られるものではなく、例えば、入
力されるパワーおよびピッチにより生成されるインパル
ス列を用いるものとして構成することもできる。Further, in the description of the voiced sound source generation means 10 of the above-described first to fourth embodiments, the voiced sound source generation means 10 is described.
Is a 1-pitch length residual waveform 107 when it is determined to be voiced by the voiced unvoiced information 102.
Although it has been described that the voiced sound source 104 that is repeated at intervals is generated, the present invention is not limited to this, and for example, an impulse train generated by the input power and pitch can be used.

【００４３】[0043]

【発明の効果】以上説明したように、本発明の音声合成
方式によれば、入力される音声のパワーまたはピッチに
より規定されるゆらぎ成分を生成するようにし、そのゆ
らぎ成分を合成に用いるピッチ長の有声音源波形または
有声区間の声道特性にピッチ間隔毎に重畳し、あるい
は、ピッチ間隔毎に重畳した有声音声波形をピッチ間隔
毎に重畳するように構成したので、自然性の高い合成音
声を実現することができるという効果がある。As described above, according to the voice synthesizing method of the present invention, the fluctuation component defined by the power or pitch of the input voice is generated, and the fluctuation component is used for the pitch length. The voiced sound source waveform or the vocal tract characteristics of the voiced section is superposed at each pitch interval, or the voiced speech waveform superposed at each pitch interval is configured to be superposed at each pitch interval. It has the effect that it can be realized.

【００４４】また、本発明の規則合成装置によれば、規
則により生成されたパワーまたはピッチにより規定され
るゆらぎ成分を生成するようにし、そのゆらぎ成分を合
成に用いるピッチ長の有声音源波形または有声区間の声
道特性にピッチ間隔毎に重畳し、あるいは、ピッチ長の
有声音声波形にピッチ間隔毎に重畳して得られた波形を
ピッチ間隔毎に重畳するように構成したので、高品質の
規則合成音を得ることができるという効果がある。Further, according to the rule synthesizing apparatus of the present invention, the fluctuation component defined by the power or the pitch generated by the rule is generated, and the fluctuation component is used for the synthesis of the pitched voiced sound source waveform or voiced sound source. Since it is configured to superimpose on the vocal tract characteristics of a section at each pitch interval, or to superimpose a waveform obtained by superimposing on a pitch-length voiced speech waveform at each pitch interval, a high quality rule. There is an effect that a synthetic sound can be obtained.

[Brief description of drawings]

【図１】本実施例に係る音声合成方式による第１の音声
分析合成系の構成例を示す構成図である。FIG. 1 is a configuration diagram showing a configuration example of a first voice analysis / synthesis system by a voice synthesis method according to an embodiment.

【図２】本実施例に係る第１の規則合成装置の構成例を
示す構成図である。FIG. 2 is a configuration diagram showing a configuration example of a first rule synthesis device according to the present embodiment.

【図３】本実施例に係る音声合成方式による第２の音声
分析合成系の構成例を示す構成図である。FIG. 3 is a configuration diagram showing a configuration example of a second voice analysis / synthesis system by a voice synthesis system according to the present embodiment.

【図４】本実施例に係る第２の規則合成装置の構成例を
示す構成図である。FIG. 4 is a configuration diagram showing a configuration example of a second rule synthesizing device according to the present embodiment.

【図５】本実施例に係る音声合成方式による第３の音声
分析合成系の構成例を示す構成図である。FIG. 5 is a configuration diagram showing a configuration example of a third voice analysis / synthesis system by the voice synthesis system according to the present embodiment.

【図６】本実施例に係る第３の規則合成装置の構成例を
示す構成図である。FIG. 6 is a configuration diagram showing a configuration example of a third rule composition device according to the present embodiment.

【図７】フレーム平均パワーとゆらぎ成分生成手段によ
り生成されるゆらぎ成分との関係を示すグラフである。FIG. 7 is a graph showing the relationship between the frame average power and the fluctuation component generated by the fluctuation component generation means.

【図８】従来の音声合成方式による音声分析合成系の構
成例を示す構成図である。FIG. 8 is a configuration diagram showing a configuration example of a voice analysis / synthesis system according to a conventional voice synthesis system.

【符号の説明】１０有声音源生成手段１１無声音源生成手段１２声道フィルタ手段１３ゆらぎ成分生成手段１４ゆらぎ成分重畳手段１５波形重畳手段２０規則合成装置２１文章解析手段２２合成規則手段２３音声合成手段１００フレーム平均ピッチ１０１フレーム平均パワー１０２有声無声情報１０３声道特性１０４有声音源１０５無声音源１０６、２０６合成音声１０７１ピッチ長残差波形２００ゆらぎ成分２０１ゆらぎ成分重畳有声音源２０２入力文章２０３辞書情報２０４音声素片情報２０５文章解析結果２０７ゆらぎ成分重畳声道特性２０８１ピッチ長音声波形２０９ゆらぎ成分重畳音声波形[Description of Codes] 10 voiced sound source generation means 11 unvoiced sound source generation means 12 vocal tract filter means 13 fluctuation component generation means 14 fluctuation component superposition means 15 waveform superposition means 20 rule synthesis device 21 sentence analysis means 22 synthesis rule means 23 speech synthesis means 100 frame average pitch 101 frame average power 102 voiced unvoiced information 103 vocal tract characteristic 104 voiced sound source 105 unvoiced sound source 106, 206 synthesized speech 107 1 pitch length residual waveform 200 fluctuation component 201 voiced sound source 202 input sentence 203 dictionary information 204 Speech segment information 205 Sentence analysis result 207 Fluctuation component superimposed vocal tract characteristics 208 1 pitch length speech waveform 209 Fluctuation component superimposed speech waveform

─────────────────────────────────────────────────────
─────────────────────────────────────────────────── ───

【手続補正書】[Procedure amendment]

【提出日】平成５年４月１９日[Submission date] April 19, 1993

【手続補正１】[Procedure Amendment 1]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】０００２[Name of item to be corrected] 0002

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【０００２】[0002]

【従来の技術】音声を直接人間の発声そのままによらな
いで、人工的に作り出すことを音声合成（ｓｐｅｅｃｈ
ｓｙｎｔｈｅｓｉｓ）という。音声合成方式として
は、録音編集方式、パラメータ編集方式、規則合成方式
の３種類に分類できる。このなかで、録音編集方式やパ
ラメータ編集方式は、予め記憶しておいた人の声をその
まま接続して出力する方式であるので、自然に近い音声
を合成することができるという利点があるが、出力可能
な語彙や文構造が限られてしまうという問題がある。こ
の点、規則合成方式は、文字列あるいは音素記号列から
音声学的ないし言語学的規則に基づいて音声を作り出す
方式であり、録音編集方式やパラメータ編集方式と異な
り、少ない記憶容量で任意の語彙の音声合成が可能とな
る。2. Description of the Related Art Speech synthesis (speech synthesis) is a method of artificially producing speech without directly relying on human speech.
Synthesis). The voice synthesizing method can be classified into three types, that is, a recording editing method, a parameter editing method, and a rule synthesizing method. Among them, the recording editing method and the parameter editing method are methods in which the voice of a person stored in advance is directly connected and output, so that there is an advantage that a voice close to natural can be synthesized. There is a problem that the vocabulary and sentence structure that can be output are limited . Point This <br/>, the rule synthesizing method is a method of creating a sound based on a character string or phoneme symbol string to phonetically to linguistic rules, Recording editing method and parameters editing method and a different < It is possible to synthesize speech of any vocabulary with a small storage capacity.
It

【手続補正２】[Procedure Amendment 2]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】０００３[Name of item to be corrected] 0003

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【０００３】ここで示す規則合成方式とは、蓄えておく
単位として音節、音素、１ピッチ区間の波形などのよう
な、基本的な小さな単位の特徴パラメータを用い、その
かわりそれらを接続する規則や、ピッチ・振幅などの韻
律情報を制御する規則を精密に定めることにより、いか
なる言葉でも、音素、音節記号あるいは文字の系列から
合成できるようにしようとするものである。このとき、
音声が自然で聞きやすいものであるためには、ピッチや
ストレスの変化及び、スペクトルの時間的変化が滑らか
で、しかもポーズなどが自然でなければならない。した
がって、この規則合成方式の場合は、合成に用いる基本
単位の品質とともに、自然音声の音声学的ないし言語学
的特性に基づく、音響パラメータの制御規則（制御情報
と制御機構）が重要な役割を果たす。[0003] The rule synthesizing method shown here, the syllable as stored in advance unit, a phoneme, such as a waveform of one pitch period, using the feature parameters of the basic small unit, rules for connecting Instead them and, by determining precisely the rules governing the rhyme <br/> law information, such as pitch and amplitude, in any words, phonemes, it is intended to be able to synthesize the syllabic symbol or character sequence. At this time,
For the voice to be natural and easy to hear, changes in pitch and stress, and temporal changes in the spectrum must be smooth, and poses must be natural. Therefore, in the case of the rule synthesizing method, basic for use in the synthesis
With quality of the single position, based on the phonetic or linguistic characteristics of natural speech, control rules of acoustic parameters (control information and control mechanism) play an important role.

Claims

[Claims]

1. An unvoiced sound source generation means for generating an unvoiced sound source represented by white noise by inputting vocal tract characteristics, frame average power, frame average pitch, and voiced unvoiced information obtained by speech analysis, and impulse or frame. In a voice synthesis method having a voiced sound source generation means for generating a voiced sound source consisting of repetitions of a 1-pitch length residual waveform for each frame average pitch, the frame average power or the frame average pitch is input and defined by the power or pitch. And a fluctuation component superimposing means for superimposing the fluctuation component obtained by the fluctuation generating means on the voiced sound source waveform for one pitch for each frame average pitch section, and the fluctuation component superimposing means. Using the obtained fluctuation source superimposed source waveform as an input, a filter that approximates the vocal tract characteristics Speech synthesis method characterized by comprising the vocal tract filter means for obtaining a synthesized speech, the by motor.

2. A rule synthesizing device for inputting a character string or a phoneme symbol string, and outputting synthetic speech according to phonetic or linguistic rules based on prestored dictionary information, phoneme segment information, or the like, A fluctuation component generating means for generating a fluctuation component defined by power or pitch generated based on a rule according to a sentence, and a fluctuation component obtained by the fluctuation generating means are converted into a voiced sound source waveform for one pitch for each pitch section. A fluctuation component superimposing means for superimposing, and a vocal tract filter means for receiving a fluctuation component superimposing sound source waveform obtained by the fluctuation component superimposing means and obtaining a synthesized voice by a filter approximating the vocal tract characteristics generated by the rule. A rule synthesizer characterized in that

3. An unvoiced sound source generation means for generating an unvoiced sound source represented by white noise, using vocal tract characteristics, frame average power, frame average pitch, and voiced unvoiced information obtained by speech analysis as inputs, and impulse or frame. In a voice synthesis method having a voiced sound source generating means for generating a voiced sound source consisting of repetitions of a 1-pitch length residual waveform for each frame average pitch, a frame average power or frame for a voiced sound section determined by voiced unvoiced information. A fluctuation component generation means for inputting an average pitch and generating a fluctuation component defined by power or pitch, and a fluctuation component obtained by the fluctuation generation means are superimposed on the vocal tract characteristics of each frame in synchronization with the frame average pitch. Fluctuation component superimposing means and input of voiced source waveform for each pitch section Speech synthesis method characterized by comprising, a vocal tract filter means for obtaining a synthesized speech by filtering which approximates the fluctuation component superimposed vocal tract characteristics obtained by the fluctuation component superposing unit.

4. A rule synthesizing device for inputting a character string or a phoneme symbol string and outputting synthetic speech in accordance with phonetic or linguistic rules based on prestored dictionary information, phoneme segment information, or the like. A fluctuation component generating means for generating a fluctuation component defined by power or pitch generated based on a rule according to a sentence, and a fluctuation component obtained by the fluctuation generating means are used as a voice of each frame of a voiced section generated by the rule. Fluctuation component superimposing means for superimposing on the path characteristic in synchronization with the frame average pitch, and a filter for approximating the fluctuation component superimposing vocal tract characteristic obtained by the fluctuation component superimposing means by inputting a voiced sound source waveform for each pitch section. A rule synthesizing device comprising:

5. A speech synthesis in which a frame average power, a frame average pitch, and voiced / unvoiced information obtained by speech analysis are input, and a 1-pitch long speech waveform representing a frame in a voiced section determined by voiced / unvoiced information is input. In this method, a frame average power or a frame average pitch is input, and a fluctuation component generating unit that generates a fluctuation component defined by the power or the pitch, and a fluctuation component obtained by the fluctuation generating unit are set to one pitch for each frame average pitch section. A fluctuation component superimposing means for superimposing on the long speech waveform, and a waveform superimposing means for obtaining a synthesized speech by superimposing the fluctuation component superimposing speech waveform of one pitch length obtained by the fluctuation component superimposing means at a frame average pitch interval. A speech synthesis method characterized by

6. A rule synthesizing device for inputting a character string or a phoneme symbol string and outputting synthetic speech according to phonetic or linguistic rules based on dictionary information or phoneme segment information stored in advance, A fluctuation component generating means for generating a fluctuation component defined by a power or a pitch generated based on a rule according to a sentence, and a fluctuation component obtained by the fluctuation generating means for one pitch generated by a rule for each pitch section. A fluctuation component superimposing means for superimposing on a long speech waveform, and a waveform superimposing means for obtaining a synthetic speech by superimposing the fluctuation component superimposing speech waveform of one pitch length obtained by the fluctuation component superimposing means at a pitch interval. A rule synthesizer characterized by.