JPH0376480B2

JPH0376480B2 -

Info

Publication number: JPH0376480B2
Application number: JP56027313A
Authority: JP
Inventors: Hiroshi Yasuda; Yoichi Tamura; Masakatsu Toyoshima
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1981-02-26
Filing date: 1981-02-26
Publication date: 1991-12-05
Also published as: JPS57141698A

Description

[Detailed description of the invention]

この発明は音素片合成方法に基いて音声の合成
を行なうようにした音声合成装置に関する。人間の音声のうち、例えば“ア”のような声帯
の振動を伴なう有声音は、話す速度によつても異
なるが数10ｍsec程度の短かい区間についてみれ
ば、同じような形をした波形の繰り返しからでき
ている。従つて、有声音を合成するには、いくつ
かの繰り返し単位の波形（その長さＬをビツチと
いう、第１図参照）を用意しておき（例えば
“ア”に対する波形、“イ”に対する波形など）、
各々決められた順序に従つて所要の繰り返し数だ
け単位波形を繰り返し再生すればよい。また、例えば“シ”の始めの音のように声帯の
振動を伴なわない無声音を合成するには、無声音
をそのまま用意するか、あるいは有声音と同じく
ある長さの波形を用意しておき、これを数回繰り
返して再生すればよい。このような音声合成方法は音素片合成方法と呼
ばれ、繰り返しの単位となる波形を音素片とい
う。従つて、この音素片合成方式では、いかにして
能率よく音響的特性（物理的特性）の良い音素片
を原音声から抽出するかが、合成された音声の品
質を左右するとになる。従来から知られている音素片データを作成方法
の一例を次に示す。まず、原音声から音素片を抽出する場合に、原
音声信号をＡ−Ｄ変換してメモリーに格納すると
共に、必要なときメモリーから信号を読み出して
これをデイスプレー装置に供給し原音声信号波形
を映し出しながら、音素片となる単位波形を抽出
する方法がある。例えば、第２図において、区間
Ｋは同じような波形の繰り返しなので、，，
のうちのいずれかの波形を音素片として抽出す
る方法である。ところが、この方法では音素片としては、，
，の波形のいずれでも使用することができる
から、音素片を抽出する基準がない。また〜
の波形は厳密には夫々若干異なるので〜のう
ちのいずれかに固定して音素片を抽出するように
すると、音声の物理的性質とな何ら関係がない音
素片を抽出するようなことにもなりかねない。また、この方法では区間Ｋを１つの単位波形の
周波数特性で代表させるものであるから、音質上
の劣化を招く。すなわち、区間Ｋをデイスクリー
トフーリエ変換すると、第３図に示すような周波
数−振幅特性、すなわちスペクトル強度特性が得
られる。このスペクトル強度特性に示される激し
い凹凸は区間Ｋが同じような波形の繰り返しであ
るために生じたもので、この区間Ｋの音韻情報は
破線で示すスペクトル包絡が担つている。上述の
ように１つの単位波形の周波数特性で区間Ｋの周
波数特性を代表させると、第３図に示すようなス
ペクトル包絡が得られず、そのために音質が劣化
することになる。従つて、このような抽出方法による音素片デー
タの作成方法では原音声に忠実な音声合成を実現
できない。また、他の方法として音声信号が定常とみなせ
る区間ごとにLPC（Linear Predictive Coding）
分析を行ない、その区間のLPC分析により得ら
れたスペクトル包絡を音素片のスペクトル特性に
もつようにLPC合成でその音素片を作成する作
成方法もあるが、この方法では全極モデルを仮定
しているために第４図に示すように、その区間Ｋ
のスペクトル包絡のうち、凹部よりも凸部をよく
近似する。そのため、鼻音のようにスペクトル包絡の急激
な凹部が聞えの本質的な役割をはたす音素片は、
この方法によつては形成できない。また、人間の耳は短区間（数10ｍsec）のスペ
クトルの位相に対して比較的鈍いことを利用し
て、音素片データをメモリーする際、デイスクリ
ートフーリエ変換により音素片を周波数領域のデ
ータに変換したのち、この振幅データの位相を０
相またはπ相に振り分け、その後逆デイスクリー
トフーリエ変換して、例えば第５図に示すような
対称波形（偶関数）の信号を形成することによつ
て、軸対称である一方のデータのみメモリーする
ようにして、使用するメモリー容量を削減するよ
うにしたものがある。ところが、このような方法によつて音素片デー
タを形成した場合には、対称波形の両端ｅ（第５
図）の値は、有声音によつて相異するため、常に
一定の値になることはない。従つて、この音素片
データを使つて音声合成する場合、第６図で示す
ように異なる音素片V₁，V₂の接続部分が不連続
になることがあり、これによつて合成された音声
の品質が劣化してしまう。また、音素片にピツチを付加する方法として、
データ“０”を詰める方法がある。これは例えば
第７図で示すように、ピツチP₀より短かい音素
片に対し、この音素片をピツチP₀の長さにした
い場合に、その不足部分にデータ“０”を補つて
希望するピツチP₀をもつた音素片を作成するよ
うにしたものである。しかし、この音素片データ
はデータ“０”を補うものであるから、本来の音
素片データとは異なつたものとなり、そのため、
音素片の周波数特性が変り、合成音声の品質は当
然のことながら劣化する。原音声の音響的特性を損なうことなく原音声か
ら音素片を抽出して音素片データを作成するに
は、例えば第８図のように構成すればよい。まず、原音声をマイクロフオンなどを用いてア
ナログ音声信号（電気信号）に変換し、これをゲ
ート１に供給して適当な時間長ｌ（20ｍsec程度）
だけ切り出したのち（第９図Ａ）、Ａ−Ｄ変換器
２にて所定ビツト数のデイタル信号Saに変換す
る。Ａ−Ｄ変換器２のサンプリングレートは6k
Hz位に設定される。このデジタル信号Saは後段の回路３で原音声
有声音であるか無声音であるかの判別が行なわれ
て判別出力Ｖ／VLが出力されると共に、有声音
である場合にはピツチ周期P₀の抽出が行なわれ
て、そのデータが出力される。その出力を説明の
便宜上同じくP₀とする。続いて、このデジタル信号Saより、音素片の
音韻的な情報をもつたスペクトル包絡が抽出され
る。このスペクトル包絡の抽出のためにケプスト
ラム分析器１０が使用される。すなわち、デジタル信号Saはデイスクリート
高速フーリエ変換によつて、その音声により、よ
り物理的な意味を有する周波数領域のパワースペ
クトルに変換する（ステツプ（○イ）。このパワー
スペクトルSbの概形を第９図Ｂに示す。ステツ
プ○ロにおいて、パワースペクトルSbの絶対値の
対数をとつたのち、第３図に示す原音声のスペク
トル包絡Scの情報を抽出するため、このパワー
スペクトルSbを信号波形とみなして高速フーリ
エ逆変換を施す（ステツプ○ハ）。フーリエ逆変換
によつて得られた波形が第９図Ｃのケプストラム
Seである。次に、ステツプ○ニにおいて、ピツチ周期P₀に
基づきこのピツチ周期P₀以上の高ケフレンシ部
のデータを零にし（第９図Ｄ）、得られた低ケフ
レンシ部Sfをステツプ○ホにおいて再度高速フーリ
エ変換することにより、ステツプ○イでの高速フー
リエ変換出力のうちの低周波成分Sgが得られ、
この低周波成分はさらにステツプ○ヘにおいて逆対
数化されて第９図Ｅに示すスペクトル包絡Sh（第
１のデジタル信号）が求められる。このようにケプストラム分析によつて抽出され
たスペクトル包絡Shはもとのデータのスペクト
ル包絡Scの凸部と凹部の双方を一様に近似した
包絡特性となるから、原音声の音韻情報を十分に
保有し、従つてどのような原音声に対しても常に
一定の音質が保証されることになる。さて、上述したようにＡ−Ｄ変換器２で得られ
た複数の符号化デジタルデータはケプストラム分
析器１０に供給されて原音声に対する１つのスペ
クトル包絡Shが形成される訳であるが、このス
ペクトル包絡Shにはピツチ周期P₀の情報が全く
含まれていない。従つて、ステツプ○トにおいて、
ピツチ周期P₀の情報に基いてスペクトル包絡Sh
が再編成される。すなわち、Ａ−Ｄ変換器２のサ
ンプリング周波数fsによつて決定される、ピツチ
周期P₀内に含まれるスペクトル包絡Shを形づく
るデータが、補間を使つてfs・P₀個のデーダ（第
２のデジタル信号）に再編成（再サンプリング）
される。ピツチ周期P₀の情報に基いて再編成されたス
ペクトル包絡Shを構成するデータは、さらに次
のステツプ○チにおいて音韻決定要素及び奇関数へ
の関数変換要素が付加される。つまり、スペクト
ル包絡Shを構成するデータが虚軸（＋π／２、−π／２の軸）上のデータに位相変換されて出力波形でみ
た場合、すべて奇関数となるように関数変換処理
が行なわれると共に、＋π／２軸と−π／２軸とへのデータ位相の振り分けを行つて、所定のケフレンシ
にスペクトラムを集中させる。データ位相の振り
分け方によつてスペクトラムの集中するケフレン
シが異なり、この相違は音韻の相違となる。従つ
て、このデータの振り分けによつてスペクトル包
絡Shに音韻要素が付加されることになる。なお、取扱う原音声が無声音である場合には、
乱数などを用いてデータを振り分ければよい。従
つて、このデータの振り分けは有声音、無声音の
判別出力Ｖ／VLを参照する。このように音韻決定要素と関数変換要素を付加
した後は、ステツプ○リにおいてさらにこのスペク
トル包絡Shを逆デイスクリートフーリエ変換し
て周波数領域のデータから時間領域のデータ（音
素片用のデジタル信号Si）を形成する。このデジ
タル信号Siは切り出された音声信号区間、すなわ
ち原音声に対応する情報圧縮された時間領域での
ピツチ周期P₀に相当するデータであつて、出力
波形は逆対称波形になる（第９図Ｆ）。このデー
タにはさらに、同一データの繰り返し回数ｎ（ｎ
＝ｌ／P₀）を示すデータが付加されたのち、ス
テツプ○ルにおいてそのデータが最終的な音素片デ
ータとしてメモリー装置に格納される。以上のような音素片データ処理のような音素片
データ処理を次の音声区間においても行ない、以
上の操作を音声信号がなくなるまで繰り返す。音声を合成するには、メモリー等に格納された
素片データをｎ回繰り返して使用すると共に、こ
の処理を定められた順序に従つて行うことによ
り、必要とする音声を合成することができる。以上説明したような手段によつて音素片データ
を作成すれば、まず、ケプストラム分析器１０を
使用して音声信号からスペクトル包絡を抽出する
方法を採ると、スペクトル包絡Scの凸部と凹部
を一様に近似したスペクトル包絡Shを抽出する
ことができるので、求めようとする音声の音韻的
な情報のほぼすべてをデータとして格納できるた
め、常に一定の音質を確保できる。また、スペクトル包絡Shにピツチ周期P₀の情
報を付加して、このスペクトル包絡Shを形成す
るデータを再編成したので、従来のように希望す
るピツチ周期に音素片の長さを揃えるため、その
音素片データとは全く無関係なデータ“０”を詰
めるようにしたものに比べ、音素片の周波数特性
が劣化せず、従つて合成音声の品質が低下しない
で済む。すなわち、より原音声に近い音質が得ら
れる。そして、音素片の波形が逆対称化波形となるよ
うにデータを変換したので、波形の両端ｅは必ず
零になる。そのため、異なる音素片の波形の間で
不連続になることがないから、従来のような音質
の劣化は生じない。この発明は、例えば以上のような手段を用いて
形成された音素片データを能率よくメモリー装置
に貯えられるようにした音声合成装置を提案する
ものである。そのため、まずこの発明において使用される音
素片データは、これをアナログ化したときそのア
ナログ波形が上述したように奇関数化された点対
称波形となるようなデータであつて、差分PCM
データとして貯えられる。また、この音素片デー
タのアナログ波形は例えば第１０図のような点対
称波形であるから、点対称な一対の波形のうち前
半部（半サイクル）の波形S_IFの音素片データを
貯えるようにし、残り後半部の波形S_IBの音素片
データは前半部の波形S_IFの音素片データを利用
して復元する。データを復元する場合、第１１図Ａのようにサ
ンプリングポイントが偶数個であるときには前半
部の波形S_IFの最後のサンプリングポイントｑの
音素片データは後半部の波形S_IBを復元するため
には使用されないが、同図Ｂのようにサンプリン
グポイントが奇数個であるときには最後のサンプ
リングポイントｑの音素片データは後半部の波形
S_IBの最初のサンプリグポイントｑの音素片デー
タを使用して復元する必要がある。従つて、第１２図のデータストレージフオーマ
ツトに示すように音素片データD_A，D_B……とは
別個にその音素片のサンプリングポイントが偶数
個であるか奇数個であるか従つて、音素片データ
D_A1，D_A2……D_B1……の数が奇数であるか偶数で
あるかを示すコードＸが当てられる。また、この音素片データは上述のように差分
PCMが利用されるが、第１０図で示すように差
分PCMの差分値（＋７、−２等）は音素片の波形
によつて大きく相違するから、この例では差分値
の大小によつて差分PCMのデータ数が変更され
る。実施例では差分値の絶対値が７以上のとき８
ビツトのデータ数が当てられ、それ以下のときに
は４ビツトのデータ数が当てられる。そのため、
第１２図で示すように音素片データD_A，D_B……
は夫々が不等長ビツトとして構成される場合があ
る。不等長ビツトとして構成する場合には、４ビツ
トデータを８ビツトデータに復元する必要があ
る。そのため識別コードが、上述したデータ数識
別用のコードＸに当てられる。ビツト数、データ数とコードＸとの関係の一例
の次の＜表−１＞に示す。 The present invention relates to a speech synthesis device that synthesizes speech based on a phoneme segment synthesis method. Among human voices, for example, voiced sounds that involve vibration of the vocal cords, such as "a", differ depending on the speaking speed, but if you look at short intervals of about 10 msec, the waveforms have a similar shape. It is made up of repetitions of Therefore, in order to synthesize voiced sounds, prepare several repeating unit waveforms (the length L is called a bit, see Figure 1) (for example, the waveform for "A", the waveform for "I"). Such),
The unit waveforms may be repeatedly reproduced for the required number of repetitions in each predetermined order. In addition, to synthesize an unvoiced sound that does not involve vibration of the vocal cords, such as the first sound of "shi", you can either prepare the unvoiced sound as is, or prepare a waveform of the same length as the voiced sound. You can repeat this several times to play. Such a speech synthesis method is called a phoneme piece synthesis method, and the waveform that is a unit of repetition is called a phoneme piece. Therefore, in this phoneme segment synthesis method, the quality of synthesized speech depends on how efficiently phonemes with good acoustic properties (physical properties) are extracted from the original speech. An example of a conventionally known method for creating phoneme piece data is shown below. First, when extracting phoneme fragments from the original speech, the original speech signal is A-D converted and stored in memory, and when necessary, the signal is read from the memory and supplied to the display device to form the original speech signal waveform. There is a method of extracting unit waveforms that become phoneme pieces while projecting the phoneme. For example, in Figure 2, section K is a repetition of a similar waveform, so...
This method extracts one of the waveforms as a phoneme piece. However, with this method, as a phoneme piece,
, any of the waveforms can be used, so there is no standard for extracting phoneme pieces. Also~
Strictly speaking, the waveforms of are slightly different, so if you extract phoneme segments by fixing them to one of ~, you may end up extracting phoneme segments that have no relation to the physical properties of speech. It could happen. Furthermore, in this method, the section K is represented by the frequency characteristic of one unit waveform, which leads to deterioration in sound quality. That is, when section K is subjected to discrete Fourier transform, frequency-amplitude characteristics, ie, spectral intensity characteristics, as shown in FIG. 3 are obtained. The severe unevenness shown in this spectral intensity characteristic is caused by the repetition of similar waveforms in section K, and the phonological information in section K is carried by the spectral envelope shown by the broken line. If the frequency characteristics of section K are represented by the frequency characteristics of one unit waveform as described above, a spectral envelope as shown in FIG. 3 will not be obtained, resulting in deterioration of sound quality. Therefore, the method of creating phoneme segment data using such an extraction method cannot realize speech synthesis that is faithful to the original speech. Another method is to use LPC (Linear Predictive Coding) for each section where the audio signal can be considered stationary.
There is also a creation method in which a phoneme segment is created by LPC synthesis so that the spectral envelope obtained by LPC analysis of that interval is the spectral characteristic of the phoneme segment, but this method assumes an all-pole model. Therefore, as shown in Figure 4, the section K
of the spectrum envelope, the convex portions are better approximated than the concave portions. Therefore, for phonemes such as nasal sounds, where a sharp concavity in the spectral envelope plays an essential role in hearing,
It cannot be formed using this method. In addition, by taking advantage of the fact that the human ear is relatively insensitive to the phase of the spectrum over a short period (several tens of milliseconds), when storing phoneme data, we use discrete Fourier transform to convert the phoneme into frequency domain data. After that, set the phase of this amplitude data to 0.
By distributing the data into phase or π phase, and then performing inverse discrete Fourier transform to form a signal with a symmetrical waveform (even function) as shown in Figure 5, only one data that is axially symmetric is memorized. There are some methods that reduce the amount of memory used. However, when phoneme piece data is formed by such a method, both ends e (fifth
The value shown in the figure) differs depending on the voiced sound, so it is not always a constant value. Therefore, when performing speech synthesis using this phoneme piece data, the connection between different phoneme pieces V ₁ and V ₂ may become discontinuous, as shown in Figure 6, and this may cause the synthesized speech to become discontinuous. quality deteriorates. In addition, as a method of adding pitch to phoneme pieces,
There is a way to pad the data with "0". For example, as shown in Fig. 7, if you want to make a phoneme piece shorter than pitch P ₀ to the length of pitch P ₀ , you can fill in the missing part with data "0". This is to create a phoneme segment with pitch P ₀ . However, since this phoneme piece data supplements the data "0", it is different from the original phoneme piece data, and therefore,
The frequency characteristics of the phonemes change, and the quality of the synthesized speech naturally deteriorates. In order to extract phoneme segments from the original speech and create phoneme segment data without impairing the acoustic characteristics of the original speech, a configuration as shown in FIG. 8 may be used, for example. First, the original audio is converted into an analog audio signal (electrical signal) using a microphone, etc., and this is supplied to gate 1 for an appropriate length of time l (about 20 msec).
After cutting out the signal (FIG. 9A), the A/D converter 2 converts it into a digital signal Sa of a predetermined number of bits. The sampling rate of A-D converter 2 is 6k
It is set to about Hz. This digital signal Sa is discriminated in the subsequent circuit 3 as to whether it is the original voiced sound or unvoiced sound, and a discrimination output V/VL is outputted.If it is a voiced sound, the pitch period _P0 is Extraction is performed and the data is output. For convenience of explanation, the output is also referred to as P ₀ . Next, a spectral envelope containing phonetic information of the phoneme is extracted from this digital signal Sa. A cepstral analyzer 10 is used for extracting this spectral envelope. That is, the digital signal Sa is converted into a power spectrum in the frequency domain that has more physical meaning by the sound by discrete fast Fourier transform (step (○a).The outline of this power spectrum Sb is This is shown in Figure 9B.In Step ○Pro, after taking the logarithm of the absolute value of the power spectrum Sb, this power spectrum Sb is converted into a signal waveform in order to extract information on the spectral envelope Sc of the original voice shown in Figure 3. The waveform obtained by the Fourier inverse transform is the cepstrum in Figure 9C.
It is Se. Next, in step ◯◯◯, data of the high quenching frequency part with pitch period P ₀ or more is set to zero based on the pitch period P ₀ (Fig. 9D), and the obtained low quenching rate part Sf is set to high speed again in step ◯◯. By performing Fourier transform, the low frequency component Sg of the fast Fourier transform output in step ○I is obtained,
This low frequency component is further anti-logarithmized in step ◯ to obtain the spectral envelope Sh (first digital signal) shown in FIG. 9E. In this way, the spectral envelope Sh extracted by cepstral analysis has an envelope characteristic that uniformly approximates both the convex and concave parts of the spectral envelope Sc of the original data, so it can sufficiently capture the phonological information of the original speech. Therefore, a constant sound quality is always guaranteed for any original sound. Now, as mentioned above, a plurality of encoded digital data obtained by the A-D converter 2 are supplied to the cepstrum analyzer 10 to form one spectral envelope Sh for the original voice. The envelope Sh does not include any information about the pitch period P ₀ . Therefore, in step○,
Spectral envelope Sh based on information of pitch period P ₀
will be reorganized. In other words, the data forming the spectral envelope Sh included within the pitch period P ₀ determined by the sampling frequency fs of the A/D converter 2 is divided into fs P ₀ data (second Reorganized (resampled) into digital signals)
be done. In the next step, a phoneme determining element and a function conversion element to an odd function are added to the data constituting the spectral envelope Sh that has been reorganized based on the information of the pitch period _P0 . In other words, when the data constituting the spectral envelope Sh is phase-converted to data on the imaginary axis (+π/2, -π/2 axes) and viewed as an output waveform, the function conversion process is performed so that all the data become odd functions. At the same time, the data phase is distributed to the +π/2 axis and the −π/2 axis to concentrate the spectrum at a predetermined quefrency. The quefrency at which the spectrum concentrates differs depending on how the data phase is distributed, and this difference results in a difference in phoneme. Therefore, by allocating this data, a phonological element is added to the spectral envelope Sh. In addition, if the original audio to be handled is unvoiced,
The data can be sorted using random numbers or the like. Therefore, the distribution of this data refers to the voiced/unvoiced sound discrimination output V/VL. After adding the phoneme determining element and the function transformation element in this way, in step ○, this spectral envelope Sh is further inversely discrete Fourier transformed to convert frequency domain data to time domain data (digital signal Si for phoneme segment). ) to form. This digital signal Si is data corresponding to the pitch period P ₀ in the extracted audio signal section, that is, the information-compressed time domain corresponding to the original audio, and the output waveform is an antisymmetric waveform (see Fig. 9). F). This data further includes the number of repetitions n(n
=l/P ₀ ) is added, and in step ○, the data is stored in the memory device as final phoneme piece data. The phoneme piece data processing similar to the phoneme piece data processing described above is performed in the next voice section, and the above operations are repeated until there are no more voice signals. In order to synthesize speech, the required speech can be synthesized by repeatedly using segment data stored in a memory or the like n times and performing this processing in a predetermined order. If phoneme segment data is created by the means described above, first, if a method is adopted in which the spectral envelope is extracted from the speech signal using the cepstrum analyzer 10, the convex and concave parts of the spectral envelope Sc can be combined. Since it is possible to extract a spectral envelope Sh that is similar to the above, almost all of the phonetic information of the speech to be obtained can be stored as data, so that a constant sound quality can always be ensured. In addition, we added information about the pitch period P ₀ to the spectral envelope Sh and reorganized the data that forms this spectral envelope Sh. Compared to the case where data "0" which is completely unrelated to the phoneme piece data is filled, the frequency characteristics of the phoneme piece do not deteriorate, and therefore the quality of the synthesized speech does not deteriorate. In other words, a sound quality closer to the original sound can be obtained. Since the data was converted so that the waveform of the phoneme piece became an inversely symmetrized waveform, both ends e of the waveform are always zero. Therefore, there is no discontinuity between the waveforms of different phoneme pieces, so the deterioration of sound quality as in the conventional case does not occur. The present invention proposes a speech synthesis device that can efficiently store phoneme segment data formed using, for example, the above-described means in a memory device. Therefore, first of all, the phoneme piece data used in this invention is data whose analog waveform becomes a point-symmetric waveform with an odd function as described above when it is converted into analog data,
Stored as data. Moreover, since the analog waveform of this phoneme piece data is a point-symmetric waveform as shown in FIG. 10, for example, the phoneme piece data of the first half (half cycle) of the waveform S _IF of the pair of point-symmetric waveforms is stored. , the remaining phoneme piece data of the waveform S _IB in the latter half is restored using the phoneme piece data of the waveform S _IF in the first half. When restoring data, when there is an even number of sampling points as shown in Figure 11A, the phoneme data at the last sampling point q of the waveform S _IF in the first half is used to restore the waveform S _IB in the second half. Although it is not used, when there is an odd number of sampling points as shown in B of the same figure, the phoneme piece data of the last sampling point q is the waveform of the latter half.
It is necessary to restore it using the phoneme piece data of the first sampling point q of S _IB . Therefore, as shown in the data storage format of Fig. 12, it is determined whether the phoneme piece data D _A , D _B . piece of data
A code X indicating whether the numbers of D _A1 , D _{A2 .} . . D _{B1 .} . . is odd or even is applied. In addition, this phoneme piece data is divided into differences as described above.
PCM is used, but as shown in Figure 10, the difference value (+7, -2, etc.) of the difference PCM varies greatly depending on the waveform of the phoneme, so in this example, the difference value (+7, -2, etc.) The number of PCM data is changed. In the example, when the absolute value of the difference value is 7 or more, 8
The number of bits of data is guessed, and if it is less than that, the number of 4 bits of data is guessed. Therefore,
As shown in Fig. 12, phoneme piece data D _A , D _B . . .
may each be configured as bits of unequal length. When configured as unequal length bits, it is necessary to restore 4-bit data to 8-bit data. Therefore, the identification code is assigned to the code X for identifying the number of data mentioned above. An example of the relationship between the number of bits, the number of data, and code X is shown in Table 1 below.

【表】コードＸは４ビツトで構成され、ビツト数とデ
ータ数とに応じて振り分けられる。なお、＜表−１＞にはコードＸのほかにコード
Ｙが示されているが、これは音声合成の際に音素
片データD_A，D_B……を何回繰り返して使用する
かを示す繰り返しコードであつて、この例では４
ビツトで構成するため繰り返し回数ｎは０から15
回まで指示できる。従つて、コードＸとＹで８ビ
ツトデータとなる（第１２図参照）。なお、第１２図に示すように区切りコードＺ、
識別コードＸ、Ｙ及び音素片データD_Aでデータ
の単位ブロツクが構成され、これが多数、メモリ
ー装置２０に貯えられる。また、単位ブロツクを複数用いて音素片データ
を複数集合させると、１つの単語、１つのまとま
つた文章などの音片が形成されるので、音素片の
頭、終り、音片の終りには夫々を区別するための
区切りコードＺが当てられる。この区切りコード
Ｚはヘキサデシマル表示であつて、この例では音
素片の区切りコードは“80”、音片の区切りコー
ドは“7F”としてある。従つて、区切りコード
Ｚは音素片あるいは音片のいずれかのコードが当
てられる。これら識別コードＸ，Ｙ、区切りコードＺ及び
音素片データＤ（D_A，D_B……）のストレージフオ
ーマツトの一例を第１２図に示す。この図におい
て１つのブロツクが８ビツトデータとなる。この
ようなフオーマツトのなるようにメモリー装置に
格納される。なお、音素片データＤはア，イ……のような音
声のほかに、半濁音や拗音等の音声を含むものと
する。さて、希望する音声をこのメモリー装置の出力
によつて合成するには、例えば第１３図に示すよ
うな音声合成装置を使用すればよい。第１３図において、２０が音素片データＤ等が
格納されたメモリー装置であつて、３０が入力装
置、４０がシーケンスコントローラである。ま
た、５０は音片アドレステーブル、６０は音片コ
ードテーブルである。メモリー装置２０より読出された所望のデータ
は読出しレジスタ２１に供給されたのちデータ復
元器２２に供給されて４ビツトデータが８ビツト
データに復元される。そのため、そのレジスタ２
１に取込まれたデータはデータ識別回路２３に供
給されてデータ内容の判別及び後述するコード
Ｘ，Ｙ，Ｚの識別が行なわれたのちシーケンスコ
ントローラ４０に加えられる。なお、このコント
ローラ４０から出力される各種の制御信号によつ
てこの音声合成装置全体が制御される。識別回路２３ではまず読出しレジスタ２１のデ
ータ内容が判別され、８ビツトデータならば制御
信号Paによつてそのデータそのものが後段のア
キユムレータ（復調レジスタを含む）２４に供給
され、４ビツトデータならば制御信号Paによつ
てこのデータが８ビツトデータに復元される。すなわち、４ビツトデータが例えば、“0011”
のように正の値であるときには、そのMSB側に
上位４ビツト“0000”が加られて８ビツトデータ
に復元されてからアキユムレータ２４に供給さ
れ、逆に例えば“1100”のように負の値であると
きには、そのMBS側に上位４ビツト“1111”が
加えられて８ビツトデータに復元されてからアキ
ユムレータ２４の供給される。アキユムレータ２４では波形の再現が行なわれ
る。すなわち、音素片データＤは波形S_Iの前半部
の波形S_IFに対応したデータであるから、後半部
の波形S_IBに対応するデータは、この音素片デー
タＤを往復するようにアキユムレータ２４に設け
られた復調レジスタに加算されることによつて形
成される。このとき、データ数の識別コードＸを検出した
出力に基づく制御信号Paで、データＤの加算回
数が制御される（第１１図参照）。また繰り返し
コードＹを検出した出力に基づく制御信号Pcで
再現された１サイクル分の音素片データの繰り返
し数が定まる。この１サイクル分の音素片データは復調レジス
タ（Ｄ−Ａ変換器）２５にてアナログ音声信号に
復調されたのち、ローパスフイルタ２６にて所望
の音声周波数帯域の音声信号が取出される。その
場合、復調レジスタ２５の１回の加算をＭ倍の速
度でＭ回行なうことにより、見掛け上のサンプリ
ングレートをＭ倍に高めることができる。これに
よつてローパルフイルタ２６のカツトオフ周波数
もＭ倍になるから、カツトオフ点の周波数特性が
急峻なフイルタを使わないでもフイルタリング処
理できる。このように、入力装置３０からのデータに基づ
き希望する音素片データＤが読出されて合成され
ることによつて希望する音片や文章の音声信号を
出力端子２７から得ることができる。以上説明したようにこの発明に係る音声合成装
置によれば、音素片データが不等長ビツトの形式
で記憶されているため、メモリの容量を音質低下
をまねくことなく小規模とすることができる。ま
た逆に多くの音素片データをメモリに記憶させる
ことができるため一層の音質向上をはかることが
できる。[Table] Code X consists of 4 bits and is distributed according to the number of bits and the number of data. Note that in addition to code X, code Y is shown in <Table 1>, which indicates how many times the phoneme piece data D _A , D _B . . . are repeated during speech synthesis. It is a repeating code, in this example 4
Since it is composed of bits, the number of repetitions n is from 0 to 15.
You can give instructions up to once. Therefore, codes X and Y become 8-bit data (see FIG. 12). In addition, as shown in FIG. 12, the delimiter code Z,
The identification codes X, Y and the phoneme piece data D _A constitute a unit block of data, and a large number of these blocks are stored in the memory device 20 . In addition, when multiple unit blocks are used to collect multiple phoneme data, a phoneme such as one word or one coherent sentence is formed, so the beginning, end, and end of the phoneme are each A delimiter code Z is assigned to distinguish between the two. This delimiter code Z is expressed in hexadecimal, and in this example, the delimiter code for the phoneme segment is "80" and the delimiter code for the speech piece is "7F". Therefore, the delimiter code Z is either a phoneme piece code or a phoneme piece code. An example of the storage format of these identification codes X, Y, delimiter code Z, and phoneme piece data D ( _DA , _DB ...) is shown in FIG. In this figure, one block is 8-bit data. The data is stored in the memory device in this format. It is assumed that the phoneme piece data D includes sounds such as semi-voiced sounds and persistent sounds, in addition to sounds such as A, I.... Now, in order to synthesize desired speech using the output of this memory device, a speech synthesis device as shown in FIG. 13, for example, may be used. In FIG. 13, 20 is a memory device in which phoneme piece data D and the like are stored, 30 is an input device, and 40 is a sequence controller. Further, 50 is a speech unit address table, and 60 is a speech unit code table. Desired data read from the memory device 20 is supplied to a read register 21 and then to a data restorer 22, where 4-bit data is restored to 8-bit data. Therefore, that register 2
The data fetched in 1 is supplied to a data identification circuit 23, where the data content is determined and codes X, Y, and Z, which will be described later, are identified, and then added to the sequence controller 40. The entire speech synthesis apparatus is controlled by various control signals output from the controller 40. The identification circuit 23 first determines the data content of the read register 21, and if it is 8-bit data, the data itself is supplied to the subsequent accumulator (including a demodulation register) 24 by the control signal Pa, and if it is 4-bit data, it is supplied to the subsequent accumulator (including the demodulation register) 24. This data is restored to 8-bit data by signal Pa. In other words, 4-bit data is, for example, “0011”.
When the data is a positive value, such as ``1100'', the upper 4 bits ``0000'' are added to the MSB side to restore the data to 8-bit data, which is then supplied to the accumulator 24; If so, the upper 4 bits "1111" are added to the MBS side and the data is restored to 8 bits before being supplied to the accumulator 24. The waveform is reproduced in the accumulator 24. That is, since the phoneme piece data D is data corresponding to the waveform _SIF in the first half of the waveform S _I , the data corresponding to the waveform _SIB in the latter half is sent to the accumulator 24 so that the phoneme piece data D is reciprocated. It is formed by being added to a demodulation register provided. At this time, the number of additions of the data D is controlled by the control signal Pa based on the output of the detected data number identification code X (see FIG. 11). Furthermore, the number of repetitions of one cycle of phoneme piece data reproduced by the control signal Pc based on the output of the detected repetition code Y is determined. This one cycle of phoneme piece data is demodulated into an analog audio signal by a demodulation register (DA converter) 25, and then an audio signal in a desired audio frequency band is extracted by a low-pass filter 26. In that case, by performing one addition in the demodulation register 25 M times at M times the speed, the apparent sampling rate can be increased by M times. As a result, the cutoff frequency of the low-pass filter 26 is also multiplied by M, so that filtering processing can be performed without using a filter with a steep frequency characteristic at the cutoff point. In this manner, the desired phoneme piece data D is read out and synthesized based on the data from the input device 30, so that the audio signal of the desired phoneme or sentence can be obtained from the output terminal 27. As explained above, according to the speech synthesis device according to the present invention, since the phoneme piece data is stored in the format of unequal length bits, the memory capacity can be reduced without deteriorating the sound quality. . On the other hand, since a large amount of phoneme data can be stored in the memory, the sound quality can be further improved.

[Brief explanation of drawings]

第１図は有声音の波形の一例を示す図、第２図
は音素片抽出のための説明図、第３図は音素片の
パワースペクトル図、第４図はLPC合成に基づ
くパワースペクトル図、第５図は音素片を対称化
した波形の説明図、第６図はこの対称化波形の合
成の説明図、第７図はビツチ付加の説明図、第８
図は音素片データの作成方法の一例を示す信号処
理の系統図、第９図はその動作説明に供する波形
図、第１０図、第１１図はこの発明の説明に供す
る図、第１２図はデータストレージフオーマツト
の一例の図、第１３図はこの発明に係る音声合成
装置の一例を示す要部の系統図である。１０はケプストラム分析器、P₀はピツチ周期、
ｎはP₀の繰り返し回数、Shはスペクトル包絡、
２０はメモリー装置、２１は読出しレジスタ、２
２はデータ復元器、２４はアキユムレータ、３０
は入力装置、４０はシーケンスコントローラであ
る。 Fig. 1 is a diagram showing an example of the waveform of a voiced sound, Fig. 2 is an explanatory diagram for phoneme segment extraction, Fig. 3 is a power spectrum diagram of a phoneme segment, and Fig. 4 is a power spectrum diagram based on LPC synthesis. Fig. 5 is an explanatory diagram of a waveform obtained by symmetricalizing phoneme pieces, Fig. 6 is an explanatory diagram of the synthesis of this symmetrical waveform, Fig. 7 is an explanatory diagram of bit addition, and Fig. 8
The figure is a signal processing system diagram showing an example of a method for creating phoneme piece data, FIG. 9 is a waveform diagram for explaining its operation, FIGS. 10 and 11 are diagrams for explaining the present invention, and FIG. 12 is a diagram for explaining the operation. FIG. 13, which is a diagram illustrating an example of a data storage format, is a system diagram of essential parts showing an example of a speech synthesis device according to the present invention. 10 is the cepstrum analyzer, P ₀ is the pitch period,
n is the number of repetitions of P ₀ , Sh is the spectral envelope,
20 is a memory device, 21 is a read register, 2
2 is a data restorer, 24 is an accumulator, 30
is an input device, and 40 is a sequence controller.

Claims

[Claims]

1 A speech synthesis device that performs speech synthesis based on phoneme piece data formed by differential PCM, in which the differential PCM data representing each phoneme piece is stored in the form of unequal length bits, and each phoneme is A delimiter code for dividing piece data and a repetition code for indicating the number of repetitions of phoneme piece data are stored, and the equal length bits, delimiter code, and repetition code corresponding to each phoneme piece are read out from the memory and used as desired. A speech synthesis device is characterized in that the speech of is synthesized into phoneme segments.