JP2614436B2

JP2614436B2 - Speech synthesizer

Info

Publication number: JP2614436B2
Application number: JP60031814A
Authority: JP
Inventors: 雅久清水; 徹北村
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 1985-02-20
Filing date: 1985-02-20
Publication date: 1997-05-28
Anticipated expiration: 2012-05-28
Also published as: JPS61190397A

Description

【発明の詳細な説明】イ）産業上の利用分野本発明は音源と声道の特徴を表わすパラメータを用い
て音声を合成する音声合成装置に関する。The present invention relates to a speech synthesizer for synthesizing speech using a parameter representing characteristics of a sound source and a vocal tract.

ロ）従来の技術音声合成には、音源波形を差分、適応量子化などを行
うことにより圧縮して音声データとして蓄え、音声合成
に用いるΔ−PCM、ADPCM等の波形処理形の方式と、特公
昭52−7282号公報に示されているように声道のスペクト
ル包絡を表わすようなパラメータを抽出し、前記パラメ
ータと音源波より音声合成を行うLPC、PARCOR方式等の
生成源形の方式があり、現在ではデータ圧縮率の高い生
成源形の方式が主流となっている。B) Conventional technology In speech synthesis, sound source waveforms are compressed by performing differences, adaptive quantization, and the like, compressed and stored as speech data, and waveform processing methods such as Δ-PCM and ADPCM used for speech synthesis are used. As shown in Japanese Patent Publication No. 52-7282, there is a generator type such as LPC, PARCOR, etc., which extracts a parameter representing the spectral envelope of the vocal tract and synthesizes a voice from the parameter and a sound source wave. At present, a source-type method having a high data compression rate is predominant.

一般のLPC、PARCORなどの方式では、音声を分析する
ことにより、一定時間ごとに、LPC、PARCORなどの係
数、基本周波数（ピッチ）、振幅などのデータを抽出
し、メモリに音声データとして蓄える。この一定時間を
フレーム周期と呼んでいる。そして、合成時には第２図
に模式的に示す如く音源波（１）は、音声データとして
蓄えられている基本周波数（ピッチ）データと振幅とか
ら生成される。この音源波（１）をLPC、PARCORなどの
係数（２）により構成されるデイジタルフィルタ（３）
に入力することにより音声波形（４）が合成される。In a general method such as LPC and PARCOR, voice is analyzed to extract data such as a coefficient such as LPC and PARCOR, a fundamental frequency (pitch), and an amplitude at regular intervals, and store the data as voice data in a memory. This fixed time is called a frame period. At the time of synthesis, the sound source wave (1) is generated from the fundamental frequency (pitch) data and the amplitude stored as audio data, as schematically shown in FIG. Digital filter (3) composed of this sound source wave (1) by coefficients (2) such as LPC and PARCOR
, A speech waveform (4) is synthesized.

従来、この音源波としては、インパルス波、三角波、
残差波などが用いられてる。その内特に残差波を用いる
と、インパルス波、三角波を用いる場合に比べ、原音声
をより忠実に再現することができ、音質も良好である
が、残差波を記憶するために多量のメモリを必要とする
欠点があった。Conventionally, as this sound source wave, an impulse wave, a triangular wave,
A residual wave is used. Of these, the use of residual waves allows the original sound to be reproduced more faithfully and provides better sound quality than the case of using impulse waves and triangular waves, but requires a large amount of memory to store the residual waves. There was a disadvantage that required.

現在この残差波を記憶するために必要なメモリを削減
するために種々の試みが行われているが、やはり必要と
するメモリの削減には限界があった。Various attempts have been made at present to reduce the memory required to store the residual wave, but there is still a limit to the reduction in the required memory.

（ハ）発明が解決しようとする課題本発明は、上述の点に鑑みてなされ、基本的にはイン
パルス波、三角波などを音源として用いて、メモリ量を
小さくしながら原音声の基本周波数をより忠実に再現す
るものである。(C) Problems to be Solved by the Invention The present invention has been made in view of the above points, and basically uses an impulse wave, a triangular wave, or the like as a sound source and reduces the amount of memory while increasing the fundamental frequency of the original sound. They are faithfully reproduced.

ここで、インパルス波を音源として用いた場合を例に
あげ、その問題点を示す。Here, the problem will be described using an example in which an impulse wave is used as a sound source.

第３図は、従来の音声合成処理の過程を図示したもの
であり、原音声が有音声の部分では、音源としてインパ
ルス波（５）、無声音の部分では白色雑音（６）を用い
ている。また、（７）は有声音か、無声音かで音源を切
り換えるスイッチである。FIG. 3 illustrates a conventional speech synthesis process, in which an impulse wave (5) is used as a sound source when the original voice is a voiced portion, and a white noise (6) is used when the original voice is a voiceless portion. (7) is a switch for switching the sound source between voiced sound and unvoiced sound.

第４図は、有声音の部分における音源波形を示したも
のであり、この図において、横軸は時間ｔであり、縦軸
は振幅Ａである。そして、インパルス波の間隔Ｐが基本
周波数となる。FIG. 4 shows a sound source waveform in a voiced sound portion. In FIG. 4, the horizontal axis represents time t, and the vertical axis represents amplitude A. Then, the interval P between the impulse waves becomes the fundamental frequency.

しかし、従来の方法では、このインパルス波はサンプ
リング周波数（例えば10KHz）に同期したタイミングで
生成されているため、インパルス波の間隔（ピッチ）は
100μsecの整数倍となり、10000/n Hz（ｎは整数）の基
本周波の値しかとる事ができない。従って、第５図のよ
うな基本周波数のパターンをもつ原音声を分析し、正し
く、基本周波数を抽出したとしても、合成音声は上述の
理由から第６図に示す如く精度の低い基本周波数パター
ンしか得られなかった。However, in the conventional method, since the impulse wave is generated at a timing synchronized with the sampling frequency (for example, 10 KHz), the interval (pitch) between the impulse waves is
It is an integral multiple of 100 μsec, and can only take the value of the fundamental frequency of 10,000 / n Hz (n is an integer). Therefore, even if the original voice having the fundamental frequency pattern as shown in FIG. 5 is analyzed and the fundamental frequency is correctly extracted, the synthesized voice has only a low-accuracy fundamental frequency pattern as shown in FIG. Could not be obtained.

ニ）問題点を解決するための手段本発明の音声合成装置は、音声の基本周波数、声道の
特徴を表わすパラメータを、予め原音声より抽出してメ
モリに蓄え、前記メモリから読み出された基本周波数よ
り、音源波を生成し、前記音源波を声道の特徴を表わす
パラメータを係数とするデイジタルフィルタに与えるこ
とにより、音声を合成する音声合成装置において、前記
音源波は前記基本周波数の逆数を基本周期にもつ基本波
形からなり、前記基本波形は所定位置に隣接する２個の
パルス波からなり、それぞれの前記基本波形において、
前記隣接する２個のパルス波の振幅を独立可変に制御す
ることにより、前記隣接する２個のパルス波の重心点を
変化させて、前記基本周波数を中心とした所定周波数内
に周波数値を変動せしめて音声合成する事を特徴とす
る。D) Means for Solving the Problems The speech synthesizer of the present invention extracts parameters representing the fundamental frequency of speech and the characteristics of the vocal tract from the original speech in advance, stores them in memory, and reads them from the memory. In a speech synthesizer for synthesizing speech by generating a sound source wave from a fundamental frequency and applying the sound source wave to a digital filter having a parameter representing a characteristic of a vocal tract as a coefficient, the sound source wave is a reciprocal of the fundamental frequency. The basic waveform has a basic period, and the basic waveform is composed of two pulse waves adjacent to a predetermined position. In each of the basic waveforms,
By controlling the amplitudes of the two adjacent pulse waves independently variable, the center of gravity of the two adjacent pulse waves is changed, and the frequency value is changed within a predetermined frequency centered on the fundamental frequency. It is characterized by speech synthesis at least.

ホ）作用本発明の音声合成装置によれば、音源波は基本周波数
の逆数を基本周期にもつ基本波形からなり、基本波形は
所定位置に隣接する２個のパルス波からなり、それぞれ
の基本波形において、隣接する２個のパルス波の振幅を
独立可変に制御することにより、隣接する２個のパルス
波の重心点を変化させて、基本周波数を中心とした所定
周波数内に周波数値を変動せしめて音声合成するので、
より自然な人間の音声の合成が可能となる。E) Operation According to the speech synthesizer of the present invention, the sound source wave is composed of a basic waveform having a reciprocal of the fundamental frequency in a fundamental period, and the basic waveform is composed of two pulse waves adjacent to a predetermined position. In the waveform, by controlling the amplitude of two adjacent pulse waves independently variable, the center of gravity of the two adjacent pulse waves is changed, and the frequency value fluctuates within a predetermined frequency centered on the fundamental frequency. At the very least, we will synthesize speech,
A more natural synthesis of human voice becomes possible.

ヘ）実施例本発明の音声合成装置に用いられる音源波は例えば第
７図に示す如く、連続した２つのインパルス波（ａ）、
（ｂ）からなり、第１のインパルス波（ａ）の振幅Aaと
第２のインパルス波（ｂ）の振幅Abとの比α（＝Ab/A
a）を例えば２ビットのデータで表わす。この場合、こ
のαの値が０、1/3、１、３と変化した時の音源波は、
第８図（ｉ）、（ii）、（iii）、（iv）に夫々示され
る如く、これ等の矢印で示す重心点が両インパルス波
（ａ）（ｂ）間で移動するのである。ここで、第８図に
おいては横軸には時間を、また縦軸には波形の振幅をと
っている。F) Embodiment As shown in FIG. 7, for example, a sound source wave used in the speech synthesizer of the present invention has two continuous impulse waves (a),
(B), and the ratio α (= Ab / A) of the amplitude Aa of the first impulse wave (a) to the amplitude Ab of the second impulse wave (b)
a) is represented by, for example, 2-bit data. In this case, the sound source wave when the value of α changes to 0, 1/3, 1, 3 is
As shown in FIGS. 8 (i), (ii), (iii) and (iv), the center of gravity indicated by these arrows moves between both impulse waves (a) and (b). In FIG. 8, the horizontal axis represents time, and the vertical axis represents waveform amplitude.

第８図において、サンプリング周波数を8kHzとして、
40サンプルを取ると、基本周期はＴ＝5msec（基本周波
数:F＝200Hz）となる。ここで、第８図のような波形が
連続しているものと仮定すると、各波形の重心は矢印の
位置となり、このとき基本周波数は波形間の重心距離、
即ち矢印間距離（基本周期）の逆数となります。具体的
には、F1＝1/T1、F2＝1/T2、及びF3＝1/T3が基本周波数
となります。In FIG. 8, assuming that the sampling frequency is 8 kHz,
If 40 samples are taken, the basic period is T = 5 msec (basic frequency: F = 200 Hz). Here, assuming that the waveforms as shown in FIG. 8 are continuous, the center of gravity of each waveform is at the position of the arrow, and the fundamental frequency is the distance of the center of gravity between the waveforms.
That is, it is the reciprocal of the distance between arrows (basic period). Specifically, F1 = 1 / T1, F2 = 1 / T2, and F3 = 1 / T3 are the fundamental frequencies.

従って、T1＜Ｔ、T2＞Ｔ、及びT3＞Ｔとなり、これを
周波数に変換して、縦軸を周波数、横軸を時間に取る
と、第９図のようになる。即ち、周波数F1、F2及びF3は
或る基本周波数Ｆを中心とした所定周波数内に変動せし
めた配置状態となり、これによって音声合成に際して、
より自然な人間の音声の合成が可能となる。Therefore, T1 <T, T2> T, and T3> T, and these are converted into frequencies. If the vertical axis is frequency and the horizontal axis is time, the result is as shown in FIG. That is, the frequencies F1, F2, and F3 are arranged in a state where they are varied within a predetermined frequency centered on a certain fundamental frequency F.
A more natural synthesis of human voice becomes possible.

第１図に上述の如き音源波制御をなした本発明の音声
合成装置の一実施例を示し、次に、この装置の動作を説
明する。FIG. 1 shows an embodiment of the voice synthesizing apparatus of the present invention in which the above-mentioned sound source wave control is performed. Next, the operation of this apparatus will be described.

メモリ読み出し制御部（９）に外部より音声の出力指
示が行われると、メモリ読み出し制御部（９）は音声デ
ータメモリ（11）に格納されている音声データの先頭ア
ドレスをアドレスカウンタ（10）にセットし、音声デー
タメモリ（11）の内容を順次バッファ（12）に転送す
る。When a voice output instruction is issued from the outside to the memory read control unit (9), the memory read control unit (9) stores the head address of the voice data stored in the voice data memory (11) in the address counter (10). The contents of the audio data memory (11) are sequentially transferred to the buffer (12).

音声データメモリ（11）及びバッファ（12）に蓄えら
れている音声データの内容は第10図の如くであり、同図
のαは前出の隣り合ったインパルス波の振幅比であり、
インパルス波の重心点を変化させることにより、基本周
波数をより細かく制御するためのものである。尚、同図
のＰは基本周波数データ（ピッチ間隔）、Ａは振幅デー
タ、K0〜K9はPARCOR係数を示している。バッファ（12）
に転送された音声データは、各パラメータα、Ｐ、Ａ、
K0〜K9に分割され、一定周期（フレーム同期）ごとにＡ
レジスタ（13）、αレジスタ（14）、Ｐレジスタ（15）
に夫々蓄えられる。The content of the audio data stored in the audio data memory (11) and the buffer (12) is as shown in FIG. 10, where α is the amplitude ratio of the adjacent impulse waves described above,
By changing the center of gravity of the impulse wave, the fundamental frequency can be more finely controlled. In the figure, P indicates fundamental frequency data (pitch interval), A indicates amplitude data, and K0 to K9 indicate PARCOR coefficients. Buffer (12)
Are transferred to each of the parameters α, P, A,
It is divided into K0 to K9, and A
Register (13), α register (14), P register (15)
Each is stored.

一方、PARCOR係数K0〜K9はデイジタルフィルタ（16）
に送出され、係数として用いられる。インパルス波生成
部（17）では、Ａレジスタ（13）、αレジスタ（14）、
Ｐレジスタ（15）の内容に従って、前述したように第８
図（ｉ）〜（iv）のいずれかの形式のインパルス波を生
成する。On the other hand, the PARCOR coefficients K0 to K9 are digital filters (16)
And used as coefficients. In the impulse wave generator (17), the A register (13), the α register (14),
According to the contents of the P register (15), the eighth
An impulse wave of any one of the forms (i) to (iv) is generated.

音源切り換え部（18）は、Ｐレジスタ（15）の内容に
よって、有声音であるときはインパルス波生成部（17）
の出力、無声音であるときは白色雑音生成部（19）の出
力がデイジタルフィルタ（16）の入力として与えられる
ように音源を切り換える。デイジタルフィルタ（16）の
出力はDA変換器（20）を通して合成音声として出力され
るのである。According to the contents of the P register (15), the sound source switching unit (18) determines whether or not the sound is a voiced sound by an impulse wave generation unit (17).
If the output is unvoiced, the sound source is switched so that the output of the white noise generator (19) is given as an input to the digital filter (16). The output of the digital filter (16) is output as a synthesized voice through the DA converter (20).

ト）発明の効果以上の説明から明らかなように、本発明の音声合成装
置によれば、音声の基本周波数、声道の特徴を表わすパ
ラメータを、予め原音声より抽出してメモリに蓄え、前
記メモリから読み出された基本周波数より、音源波を生
成し、前記音源波を声道の特徴に表わすパラメータを係
数とするデイジタルフィルタに与えることにより、音声
を合成する音声合成装置において、前記音源波は前記基
本周波数の逆数を基本周期にもつ基本波形からなり、前
記基本波形は所定位置に隣接する２個のパルス波からな
り、それぞれの前記基本波形において、前記隣接する２
個のパルス波の振幅を独立可変に制御することにより、
前記隣接する２個のパルス波の重心点を変化させて、前
記基本周波数を中心とした所定周波数内に周波数値を変
動せしめて音声合成するので、より自然な人間の音声の
合成が可能となる。G) Effects of the Invention As is clear from the above description, according to the speech synthesis apparatus of the present invention, the parameters representing the fundamental frequency of the speech and the characteristics of the vocal tract are extracted from the original speech in advance and stored in the memory. A sound synthesizer for synthesizing a sound by generating a sound source wave from a fundamental frequency read from a memory and applying the sound source wave to a digital filter having coefficients representing parameters of vocal tract characteristics as coefficients. Is composed of a basic waveform having a reciprocal of the fundamental frequency in a fundamental cycle, and the basic waveform is composed of two pulse waves adjacent to a predetermined position.
By controlling the amplitude of each pulse wave to be independently variable,
Since the voice synthesis is performed by changing the center of gravity of the two adjacent pulse waves and changing the frequency value within a predetermined frequency centering on the fundamental frequency, more natural human voice synthesis is possible. .

[Brief description of the drawings]

第１図は本発明の音声合成装置の構成を示すブロック
図、第２図及び第３図は夫々音声合成方法を示す概略
図、第４図は音源として用いられるインパルス波形図、
第５図は原音声の基本周波数特性図、第６図は従来方法
にて合成した音声の基本周波数パターン図、第７図は本
発明に用いられる音源波形図、第８図（ｉ）（ii）（ii
i）（iv）は夫々第７図の音源波形図の変化を示す音源
波形図、第９図は本発明の音声合成装置にて合成した音
声の基本周波数パターン図、第10図はメモリ図である。（11）……音声データメモリ、（13）……Ａレジスタ、
（14）……αレジスタ、（15）……Ｐレジスタ、（16）
……デイジタルフィルタ、（17）……インパルス波生成
部、（19）……白色雑音生成部。FIG. 1 is a block diagram showing a configuration of a speech synthesizer of the present invention, FIGS. 2 and 3 are schematic diagrams showing a speech synthesis method, respectively, FIG. 4 is an impulse waveform diagram used as a sound source,
FIG. 5 is a diagram of a fundamental frequency characteristic of an original voice, FIG. 6 is a diagram of a fundamental frequency pattern of a voice synthesized by a conventional method, FIG. 7 is a diagram of a sound source waveform used in the present invention, and FIG. ) (Ii
i) and (iv) are sound source waveform diagrams each showing a change in the sound source waveform diagram of FIG. 7, FIG. 9 is a fundamental frequency pattern diagram of speech synthesized by the speech synthesizer of the present invention, and FIG. 10 is a memory diagram. is there. (11) ... voice data memory, (13) ... A register,
(14)… α register, (15)… P register, (16)
... A digital filter, (17) an impulse wave generator, (19) a white noise generator.

Claims

(57) [Claims]

1. A fundamental frequency of a voice and parameters representing characteristics of a vocal tract are extracted in advance from an original voice and stored in a memory, and a sound source wave is generated from the fundamental frequency read from the memory. Is applied to a digital filter that uses a parameter representing a characteristic of a vocal tract as a coefficient, thereby synthesizing a voice. In the speech synthesizer, the sound source wave includes a fundamental waveform having a reciprocal of the fundamental frequency in a fundamental cycle; Is composed of two pulse waves adjacent to a predetermined position. In each of the basic waveforms, the amplitude of the two adjacent pulse waves is independently variably controlled, so that the center of gravity of the two adjacent pulse waves is A speech synthesizer characterized by changing a point and changing a frequency value within a predetermined frequency centered on the fundamental frequency to synthesize speech.