JP2821276B2

JP2821276B2 - Speech synthesizer

Info

Publication number: JP2821276B2
Application number: JP3055472A
Authority: JP
Inventors: 潤亀谷; 世光友竹
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1991-02-28
Filing date: 1991-02-28
Publication date: 1998-11-05
Anticipated expiration: 2013-11-05
Also published as: JPH04273300A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は音声の合成を行う音声合
成装置に係わり、特に分析合成型の装置で発声速度を早
くすることができるようにした音声合成装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizing apparatus for synthesizing speech, and more particularly to a speech synthesizing apparatus capable of increasing the utterance speed by an analysis and synthesis type apparatus.

【０００２】[0002]

【従来の技術】図２は、発声速度を高速化できるように
した従来の分析合成型の音声合成装置の構成を表わした
ものである。この音声合成装置１１でホストＣＰＵ（中
央処理装置）１２は発声速度の指示や、合成する音声の
データの選択を行うようになっている。ホストＣＰＵ１
２は合成される音声素片についての指示を音声データフ
ァイル１３に入力するようになっている。音声データフ
ァイル１３は、音声の素片データを格納しており、指示
のあった音声素片データを読み出して音声データメモリ
１４に出力するようになっている。音声データメモリ１
４は、ホストＣＰＵ１２から要求があったとき音声素片
データを即時にアクセスできるように展開しておくため
の記憶媒体である。2. Description of the Related Art FIG. 2 shows the configuration of a conventional analysis-synthesis type speech synthesizer capable of increasing the utterance speed. In the speech synthesizer 11, a host CPU (central processing unit) 12 instructs an utterance speed and selects data of speech to be synthesized. Host CPU1
Reference numeral 2 denotes an instruction for inputting a speech unit to be synthesized into the speech data file 13. The voice data file 13 stores voice unit data, reads out voice unit data instructed, and outputs the read voice unit data to the voice data memory 14. Voice data memory 1
Reference numeral 4 denotes a storage medium for expanding speech unit data so as to be immediately accessible when requested by the host CPU 12.

【０００３】音声データメモリ１４に格納された音声素
片データ中の声道特性パラメータは声道特性パラメータ
バッファ１５に一時的にストアされ、音源パラメータは
音源パラメータバッファ１６に一時的にストアされる。
バッファコントローラ１７は、これらのバッファ１５、
１６のデータの入出力を制御するためのものである。音
声合成フィルタ１８にはバッファコントローラ１７の制
御で声道特性パラメータおよび音源パラメータが供給さ
れ、合成された音声が音声出力端子１９から出力される
ようになっている。The vocal tract characteristic parameters in the voice segment data stored in the voice data memory 14 are temporarily stored in a vocal tract characteristic parameter buffer 15, and the sound source parameters are temporarily stored in a sound source parameter buffer 16.
The buffer controller 17 controls these buffers 15,
This is for controlling the input / output of 16 data. The vocal tract characteristic parameter and the sound source parameter are supplied to the voice synthesis filter 18 under the control of the buffer controller 17, and the synthesized voice is output from the voice output terminal 19.

【０００４】このような従来の音声合成装置１１では、
発声速度を高速化する際に、２つの手法が採用されてい
た。このうちの第１の手法では、音声合成フィルタ１８
に転送する声道特性パラメータおよび音源特性パラメー
タの転送速度を速くし、またＤ／Ａ（ディジタル・アナ
ログ）変換の周期を高速化するようになっていた。ま
た、第２の手法では、音声合成フィルタ１８に転送する
音声特性パラメータおよび音源パラメータをフレーム単
位で間引き、これによって発声速度を高速化していた。In such a conventional speech synthesizer 11,
In increasing the utterance speed, two techniques have been employed. In the first of these, the speech synthesis filter 18
The transfer speed of the vocal tract characteristic parameter and the sound source characteristic parameter to be transmitted to the TD is increased, and the period of D / A (digital / analog) conversion is increased. Further, in the second method, the speech characteristic parameters and the sound source parameters to be transferred to the speech synthesis filter 18 are thinned out in frame units, thereby increasing the utterance speed.

【０００５】[0005]

【発明の解決しようとする課題】このうちの第１の手法
では、ちょうど音楽テープの早送り再生を行うように単
純に速度を速めて発声速度の高速化を実現していた。こ
のため、この手法では発声速度が速くなるだけでなく、
合成した音声のピッチも上昇してしまい、声の高さが全
体的に高くなるといった問題があった。In the first method, the utterance speed is simply increased by simply increasing the speed so as to perform fast forward reproduction of a music tape. For this reason, this method not only increases the utterance speed,
There is a problem that the pitch of the synthesized voice also increases, and the pitch of the voice increases as a whole.

【０００６】第２の手法では、合成音声の声の高さが高
くなるといった問題を解決することができる。しかしな
がら、この第２の手法では間引くフレームを均一にとる
ために、子音区間のように本来発声長があまり変化しな
いフレームとか、破裂音のように過渡的な変化に重要性
をもつフレームのようなフレームまでも間引く可能性が
あった。このために、発声速度を高速化した場合には、
音韻情報の欠落が発生し、合成音声の音質に劣化が生じ
るといった問題があった。[0006] The second method can solve the problem that the pitch of the synthesized speech becomes high. However, in the second method, in order to uniformly take out frames to be decimated, a frame such as a consonant section in which the utterance length does not change so much, or a frame such as a plosive which has importance in transient changes, is used. There was a possibility of thinning out even the frame. For this reason, if the utterance speed is increased,
There has been a problem that a loss of phoneme information occurs and the sound quality of synthesized speech is deteriorated.

【０００７】そこで本発明の目的は、合成音声の声の高
さが変わらず、しかも合成音声の音質も変化せずに合成
音声の高速化を実現することのできる音声合成装置を提
供することにある。An object of the present invention is to provide a speech synthesizer capable of realizing a high-speed synthesized speech without changing the pitch of the synthesized speech and without changing the sound quality of the synthesized speech. is there.

【０００８】[0008]

【課題を解決するための手段】請求項１記載の発明で
は、（イ）音声の素片データのうち合成する音声に該当
するものを格納する音声データメモリと、（ロ）この音
声データメモリに格納された素片データを音源パラメー
タとケプストラムパラメータに分離するパラメータ分離
手段と、（ハ）このパラメータ分離手段によって分離さ
れた音源パラメータを格納する音源パラメータバッファ
と、（ニ）パラメータ分離手段によって分離されたケプ
ストラムパラメータを格納するケプストラムパラメータ
バッファと、（ホ）ケプストラムパラメータバッファ内
のケプストラムパラメータをフレーム単位で調査して、
調査したフレーム内の音声データのケプストラムの高次
の係数にピッチの存在を示すピークが存在するか否かに
よって、調査したフレーム内の音声データのピッチの有
無を判定するピッチ判定手段と、（ヘ）発声速度の高速
化を指示する発声速度高速化指示手段と、（ト）この発
声速度高速化指示手段が発声速度の高速化を指示したと
きピッチ判定手段の判定を起動させるピッチ判定手段起
動手段と、（チ）このピッチ判定手段起動手段によって
ピッチ判定手段が起動されたときこのピッチ判定手段で
求めたピッチ間の周期としてのピッチ周期に基づいて音
源パラメータのピッチを検出しそのピッチパルスの位置
をマーキングするピッチパルス位置マーキング手段と、
（リ）発声速度高速化指示手段によって指示された高速
化の速度に応じてピッチパルス位置マーキング手段のマ
ークしたピッチパルス間のデータ単位で音源パラメータ
の間引きを行う高速時音源パラメータ間引き手段と、
（ヌ）発声速度高速化指示手段によって発声速度の高速
化が指示されたとき音源パラメータバッファから間引か
れた後の音源パラメータとケプストラムパラメータバッ
ファからのケプストラムパラメータを使用して音声信号
を合成する音声合成フィルタとを音声合成装置に具備さ
せる。 According to the first aspect of the present invention,
Corresponds to the speech to be synthesized from the speech unit data (a)
An audio data memory for storing what
The segment data stored in the voice data memory is
Separation into data and cepstrum parameters
And (c) separated by this parameter separating means.
Sound source parameter buffer for storing sound source parameters
And (d) Kep separated by the parameter separation means
Cepstrum parameters that store strum parameters
Buffer and (e) cepstrum parameter buffer
Investigate the cepstrum parameters of
Higher order cepstrum of audio data in investigated frame
Whether there is a peak indicating the presence of pitch in the coefficient
Therefore, the pitch of the audio data in the investigated frame
Pitch determination means for determining absence, and (f) high utterance speed
Means for instructing to increase the utterance speed, and (g)
The voice speed increase instruction means instructs to increase the utterance speed.
Pitch determining means for activating the determination of the pitch determining means
Moving means and (h) the pitch determining means starting means
When the pitch determination means is activated, the pitch determination means
The sound based on the pitch period as the period between the determined pitches
Source parameter pitch is detected and the pitch pulse position
Pitch pulse position marking means for marking
(I) High speed indicated by the utterance speed increase instruction means
Of pitch pulse position marking means according to the speed of
Sound source parameters in data units between pitch pulses
High-speed sound source parameter decimation means for decimation;
(V) High utterance speed by means of high utterance speed instruction means
Decimated from sound source parameter buffer when activation is instructed
Sound source parameters and cepstrum parameter
Audio signal using cepstrum parameters from web
And a speech synthesis filter for synthesizing
Let

【０００９】そして、合成音声の速度に応じて、ピッチ
区間におけるデータの間引きの割合を調整して、高速化
された合成音の出力を行い、前記した目的を達成する。[0009] Then, the rate of thinning out of data in the pitch section is adjusted according to the speed of the synthesized voice to output a high-speed synthesized voice, thereby achieving the above object.

【００１０】[0010]

【実施例】以下実施例につき本発明を詳細に説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described in detail below with reference to embodiments.

【００１１】図１は本発明の一実施例における音声合成
装置の構成を表わしたものである。この音声合成装置２
１は図２で説明した従来の音声合成装置１１と同様に音
声データファイル１３および音声データメモリ１４を備
えている。音声素片データ中の音源パラメータは音源パ
ラメータバッファ１６に一時的にストアされ、ケプスト
ラムパラメータはケプストラムパラメータバッファ２２
に一時的にストアされるようになっている。FIG. 1 shows the configuration of a speech synthesizer according to an embodiment of the present invention. This speech synthesizer 2
Reference numeral 1 includes a voice data file 13 and a voice data memory 14 similarly to the conventional voice synthesizer 11 described with reference to FIG. The sound source parameters in the speech unit data are temporarily stored in the sound source parameter buffer 16, and the cepstrum parameters are stored in the cepstrum parameter buffer 22.
Is stored temporarily.

【００１２】発声速度の指示や合成する音声のデータの
選択を行うためのホストＣＰＵ２３は、音声データファ
イル１３と音声データメモリ１４およびバッファコント
ローラ２４と接続されている。このバッファコントロー
ラ２４は、音源パラメータバッファ１６やケプストラム
パラメータバッファ２２の他に、ピッチ判定部２５、ピ
ッチ検出部２６および音声合成フィルタ２７と接続され
ている。A host CPU 23 for instructing the utterance speed and selecting speech data to be synthesized is connected to the speech data file 13, the speech data memory 14, and the buffer controller 24. The buffer controller 24 is connected to a pitch determination unit 25, a pitch detection unit 26, and a speech synthesis filter 27 in addition to the sound source parameter buffer 16 and the cepstrum parameter buffer 22.

【００１３】ここでピッチ判定部２５はケプストラムパ
ラメータを用いてピッチ成分の有無を判定する。またピ
ッチ検出部２６はピッチ判定部２５の判定に基づいて音
源パラメータよりピッチ区間内のピッチを検出する。ホ
ストＣＰＵ２３は、合成に使用するデータの選択や発生
速度の制御等を行うようになっている。バッファコント
ローラ２４はケプストラムパラメータバッファ２２およ
び音源パラメータバッファ１６のデータの入出力を制御
するコントローラである。音声合成フィルタ２９はケプ
ストラムパラメータと音源パラメータを用いて音声の合
成を行い、音声出力端子１９から音声を出力するように
なっている。Here, the pitch determination unit 25 determines the presence or absence of a pitch component using cepstrum parameters. Further, the pitch detection unit 26 detects the pitch in the pitch section from the sound source parameters based on the determination by the pitch determination unit 25. The host CPU 23 selects data to be used for synthesis, controls the generation speed, and the like. The buffer controller 24 is a controller that controls input and output of data of the cepstrum parameter buffer 22 and the sound source parameter buffer 16. The voice synthesis filter 29 synthesizes voice using cepstrum parameters and sound source parameters, and outputs voice from the voice output terminal 19.

【００１４】このような構成の音声合成装置２１の動作
を次に説明する。まず、ホストＣＰＵ２３が音声の発声
を通常の速度で指示したものとする。この場合、合成し
ようとする音声についての音声素片データが音声データ
メモリ１４から読み出される。フレーム単位にまとめら
れたデータセットはケプストラムパラメータと音源パラ
メータに分離され、ケプストラムパラメータはケプスト
ラムパラメータバッファ２２に、音源パラメータは音源
パラメータバッファ１６に格納される。そして、これら
のパラメータを用いて音声の合成が行われ、音声出力端
子１９から音声が出力されることになる。すなわち、こ
の場合にはピッチ判定部２５およびピッチ検出部２６が
起動されず、フレーム内のデータは間引かれずに処理さ
れて、音声の合成が行われる。The operation of the speech synthesizer 21 having such a configuration will be described below. First, it is assumed that the host CPU 23 instructs utterance of voice at a normal speed. In this case, speech unit data for the speech to be synthesized is read from the speech data memory 14. The data set compiled for each frame is separated into cepstrum parameters and sound source parameters. The cepstrum parameters are stored in the cepstrum parameter buffer 22 and the sound source parameters are stored in the sound source parameter buffer 16. Then, voice synthesis is performed using these parameters, and voice is output from the voice output terminal 19. That is, in this case, the pitch determination unit 25 and the pitch detection unit 26 are not activated, the data in the frame is processed without being thinned, and the speech is synthesized.

【００１５】ホストＣＰＵ２３が発声速度の高速化を指
示した場合には、その指令内容がバッファコントローラ
２４に入力され、ピッチ判定部２５が起動される。ピッ
チ判定部２５はケプストラムパラメータバッファ２２中
のパラメータをフレーム単位で調査し、調査したフレー
ム内の音声データにピッチがあるかないかの判定を行
う。When the host CPU 23 instructs to increase the utterance speed, the contents of the instruction are input to the buffer controller 24, and the pitch determination unit 25 is activated. The pitch determination unit 25 checks the parameters in the cepstrum parameter buffer 22 on a frame-by-frame basis, and determines whether there is a pitch in the sound data in the checked frame.

【００１６】ピッチの有無の判定は、ケプストラムの高
次の係数にピッチの存在を表わすピークが存在するかど
うかを調べることによって実現する。もし、そのフレー
ム内にピッチが存在した場合には、ピッチ検出部２６が
起動される。そして、同じフレームの音源パラメータに
ついてピッチ判定部２５で求めたピッチ周期を手掛かり
にしてピッチ検出を行い、ピッチパルスの位置をマーキ
ングしておく。フレーム内にピッチが存在しなかった場
合には、ピッチ判定部２５がピッチ検出部２６を起動し
ない。このため、この場合には、そのフレーム内のデー
タとしての音源パラメータは間引かれない。The determination of the presence or absence of the pitch is realized by checking whether or not a peak representing the presence of the pitch exists in the higher order coefficient of the cepstrum. If a pitch exists in the frame, the pitch detection unit 26 is activated. Then, pitch detection is performed for the sound source parameters of the same frame using the pitch period obtained by the pitch determination unit 25 as a clue, and the position of the pitch pulse is marked. If no pitch exists in the frame, the pitch determination unit 25 does not activate the pitch detection unit 26. Therefore, in this case, the sound source parameters as data in the frame are not thinned out.

【００１７】このようにして、フレーム単位で合成音声
の高速化のための処理が行われるが、バッファコントロ
ーラ２４は、ホストＣＰＵ２３から指示された発声速度
に基づいて、ピッチ検出部２６がマーキングしたピッチ
パルスの間のデータ単位に、高速になるほど音源パラメ
ータを多く間引いて音源パラメータバッファ１６から音
声合成フィルタ２７に転送する。音声合成フィルタ２７
は、ケプストラムパラメータバッファ２２と音源パラメ
ータバッファ１６から転送されてくるフレーム単位のデ
ータを用いて、音声信号を合成し、音声出力端子１９か
らこれを出力する。In this way, the processing for speeding up the synthesized speech is performed in frame units. The buffer controller 24 determines the pitch marked by the pitch detection unit 26 based on the utterance speed specified by the host CPU 23. The higher the speed, the more the sound source parameters are thinned out and transferred from the sound source parameter buffer 16 to the speech synthesis filter 27 in units of data between pulses. Voice synthesis filter 27
Synthesizes an audio signal using the data of the frame unit transferred from the cepstrum parameter buffer 22 and the sound source parameter buffer 16 and outputs it from the audio output terminal 19.

【００１８】[0018]

【発明の効果】以上説明したように請求項１記載の発明
によれば、パラメータ分離手段で分離したケプストラム
パラメータを使用してピッチ判定手段で調査したフレー
ム内の音声データのピッチの有無を判定し、発声速度の
制御は音源パラメータをピッチ単位で間引くことで実現
している。したがって、音声合成装置の構成がシンプル
であるばかりでなく、発声速度が高速化した際の処理が
単純となり、このための処理系に負担をかけることが少
ないという効果がある。 As described above , according to the first aspect of the present invention, the cepstrum separated by the parameter separating means.
The frame investigated by the pitch judgment means using the parameters
The pitch of the voice data in the system
Control is achieved by thinning out sound source parameters in pitch units
doing. Therefore, the configuration of the speech synthesizer is simple
Not only is the processing when the utterance speed is increased
It is simple and does not burden the processing system for this.
There is no effect.

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 9/00 G10L 9/16──────────────────────────────────────────────────続き Continued on the front page (58) Field surveyed (Int.Cl. ⁶ , DB name) G10L 9/00 G10L 9/16

Claims

(57) [Claims]

1. A method for synthesizing speech from speech segment data.
An audio data memory for storing the corresponding data, and the segment data stored in the audio data memory are stored in a sound source pattern.
Parameters separated into parameters and cepstrum parameters
Data separating means and sound source parameters separated by the parameter separating means.
And a cepstrum separated by the parameter separating means.
Cepstrum parameter buffer for storing system parameters
And the cepstrum in the cepstrum parameter buffer.
The parameters are investigated in frame units, and the investigated frame
Pitch to the higher order coefficient of the cepstrum of the audio data in the
Investigate whether there is a peak indicating the presence of
To determine the presence or absence of the pitch of the audio data in the
Switch determination means and utterance speed increase instructing means for instructing to increase the utterance speed
And the utterance speed increase instructing means instructs to increase the utterance speed.
Pitch that activates the judgment of the pitch judgment means when
Determining means activating means, and the pitch determining means
The pitch determined by this pitch determination means when the stage is activated
The sound source parameters based on the pitch period as the period between
Data pitch and mark the position of the pitch pulse.
Pitching pulse position marking means to be executed, and a speed increase instructed by the utterance speed increase instructing means.
Mark of pitch pulse position marking means according to the speed of
Sound source parameters in data units between clicked pitch pulses.
High-speed sound source parameter thinning-out means for thinning out data,
From the sound source parameter buffer when
Sound source parameters and the cepstrum parameters
Sound using cepstrum parameters from
A speech synthesis device comprising: a speech synthesis filter that synthesizes a voice signal .