JPH0690638B2

JPH0690638B2 - Speech analysis method

Info

Publication number: JPH0690638B2
Application number: JP61148418A
Authority: JP
Inventors: 寛治国澤; 博糸山
Original assignee: Matsushita Electric Works Ltd
Current assignee: Panasonic Electric Works Co Ltd
Priority date: 1986-06-25
Filing date: 1986-06-25
Publication date: 1994-11-14
Anticipated expiration: 2009-11-14
Also published as: JPS635398A

Description

【発明の詳細な説明】［技術分野］本発明は、音声波形から声道パラメータ、ピッチパラメ
ータおよびアンプパラメータなどの特徴パラメータを抽
出する音声分析方式に関するものである。Description: TECHNICAL FIELD The present invention relates to a voice analysis method for extracting characteristic parameters such as vocal tract parameters, pitch parameters, and amplifier parameters from a voice waveform.

［背景技術］一般に、音声分析合成システムは、第４図に示すよう
に、入力される原声音Viの波形を分析することにより特
徴パラメータを抽出し、その特徴パラメータをコード化
した音声再生用コードSCを出力する音声分析器Ｘと、こ
の音声再生用コードSCをメモリに記憶しておき、適宜読
み出して伝送するコード記憶伝送手段Ｙと、入力される
音声再生用コードに基いて音声を合成し、合成音声をス
ピーカSPから出力する音声合成器Ｚとで構成されてい
る。この音声分析合成方式としては、音源の生成と声道
の共鳴による調音とを分離する線形分離等価回路モデル
を用いたものがあり、この方式による音声合成器Ｚにお
いては、原音声を声道調音等価フィルタの逆特性のフィ
ルタを通して得られる残差波形を音源波形として用い、
声道調音等価フィルタにて抽出された声道パラメータに
基いて形成される声道調音等価フィルタを制御すること
により原音声を再生するようになっている。ここに、音
声分析器Ｘで声道パラメータを抽出する過程で得られる
残差波形をそのまま音源波形として用いれば、この音源
波形を声道調音等価フィルタに通して得られる合成音声
波形は原音声波形と全く同じになる。しかしながら、こ
のように残差波形を音源波形とした場合には、多くの音
声再生情報（特に音源波形を再生するための情報が多く
なる）が必要になって音声再生情報の記憶伝送系が複雑
且つ大規模化して実用的なシステムが得られないという
問題があった。そこで、従来、音声再生情報の圧縮を行
うために、その残差波形から抽出されたピッチパラメー
タＰ、有声／無声切換パラメータU/Vおよびアンプパラ
メータＡの３つの音源制御用の特徴パラメータに基い
て、残差波形の近似波形を生成して音源波形としてい
る。この場合、残差波形をそのまま音源波形とする場合
に比べて大幅な情報圧縮が行えるので、ハード化が容易
にできコストを安くできるものの、圧縮された音声再生
情報である特徴パラメータ（ピッチパラメータＰ、アン
プパラメータＡ）に基いて発生される近似波形を音源波
形として用いているため、合成音声波形には大幅な劣化
が生じることになる。第５図は音声合成器Ｚの一例を示
すもので、音源発生部10は、無声音発生部11、有声音発
生部12、音源切換部13および振幅制御部14にて形成さ
れ、音源制御用の特徴パラメータP,V/U,Aにて設定され
た波形の音源信号を発生するようになっている。この音
源発生部10から出力される音源信号は、声道パラメータ
Ｋ（例えば、偏相関係数k₁〜Kn）によってフィルタ特性
が設定される声道調音等価フィルタ15に入力され、マク
ロな周波数特性である音声スペクトルエンベロープが付
与されて合成音声信号が得られるようになっている。こ
こに、音源発生部10の無声音発生部11では、ホワイトノ
イズよりなる無声音源信号が発生され、有声音発生部12
では、メモリに記憶されている１ピッチ分の波形をピッ
チパラメータＰにて設定される間隔で繰り返すことによ
って所定周期の有声音源信号が発生されるようになって
いる。一方、音源切換部13では、有声／無声切換パラメ
ータV/Uによって無声音あるいは有声音のいずれを出力
するかを切り換えるようになっており、また、振幅制御
部14では、その強度（すなわち振幅）をアンプパラメー
タＡに基いて制御して音源信号を出力するようになって
いる。BACKGROUND ART Generally, as shown in FIG. 4, a speech analysis and synthesis system extracts a characteristic parameter by analyzing a waveform of an input original voice Vi, and a speech reproduction code in which the characteristic parameter is encoded. A voice analyzer X that outputs an SC, a code storing and transmitting means Y that stores the voice reproduction code SC in a memory and appropriately reads and transmits it, and synthesizes voice based on the input voice reproduction code. , And a voice synthesizer Z for outputting synthetic voice from the speaker SP. As this speech analysis and synthesis method, there is a method using a linear separation equivalent circuit model that separates the generation of a sound source and the articulation due to the resonance of the vocal tract. Using the residual waveform obtained through the filter with the inverse characteristic of the equivalent filter as the source waveform,
The original voice is reproduced by controlling the vocal tract articulatory equivalent filter formed based on the vocal tract parameters extracted by the vocal tract articulatory equivalent filter. If the residual waveform obtained in the process of extracting the vocal tract parameters by the speech analyzer X is used as the source waveform as it is, the synthesized speech waveform obtained by passing this source waveform through the vocal tract articulatory equivalent filter is the original speech waveform. Is exactly the same as However, when the residual waveform is used as the sound source waveform in this manner, a large amount of voice reproduction information (particularly, a lot of information for reproducing the sound source waveform is required), which complicates the storage and transmission system of the voice reproduction information. In addition, there is a problem that the system becomes large and a practical system cannot be obtained. Therefore, conventionally, in order to compress the audio reproduction information, based on three characteristic parameters for controlling the sound source, a pitch parameter P, a voiced / unvoiced switching parameter U / V, and an amplifier parameter A extracted from the residual waveform. , An approximate waveform of the residual waveform is generated as a sound source waveform. In this case, as compared with the case where the residual waveform is used as the sound source waveform as it is, a large amount of information compression can be performed. Therefore, although the hardware can be easily implemented and the cost can be reduced, a characteristic parameter (pitch parameter P , The approximate waveform generated based on the amplifier parameter A) is used as the sound source waveform, so that the synthesized speech waveform is significantly deteriorated. FIG. 5 shows an example of the speech synthesizer Z. The sound source generation unit 10 is formed by an unvoiced sound generation unit 11, a voiced sound generation unit 12, a sound source switching unit 13, and an amplitude control unit 14 for controlling the sound source. A sound source signal having a waveform set by the characteristic parameters P, V / U, A is generated. The sound source signal output from the sound source generation unit 10 is input to a vocal tract articulatory equivalent filter 15 whose filter characteristic is set by a vocal tract parameter K (for example, partial correlation coefficients k _{1 to} Kn), and a macro frequency characteristic. Is added to the voice spectrum envelope to obtain a synthesized voice signal. Here, in the unvoiced sound generator 11 of the sound source generator 10, an unvoiced sound source signal composed of white noise is generated, and the voiced sound generator 12
In this case, the voiced sound source signal of a predetermined cycle is generated by repeating the waveform for one pitch stored in the memory at intervals set by the pitch parameter P. On the other hand, the sound source switching unit 13 is configured to switch whether to output unvoiced sound or voiced sound according to the voiced / unvoiced switching parameter V / U, and the amplitude control unit 14 sets the strength (that is, amplitude) of the unvoiced sound. The sound source signal is output by controlling based on the amplifier parameter A.

ここに、有声音を合成する場合には、有声音発生部11に
蓄えられている１ピッチ分の波形を、ピッチパラメータ
Ｐによって所定周期で読み出して有声音源信号を発生さ
せるようになっているので、上述の劣化を少しでも小さ
くするには、その波形によって生成される音源波形がど
のようなものになるかを考慮して少しでも合成音声が原
音声に近くなるように各特徴パラメータP,Aを決定する
ことが必要であると考えられる。Here, when synthesizing a voiced sound, the waveform for one pitch stored in the voiced sound generation unit 11 is read at a predetermined cycle by the pitch parameter P to generate a voiced sound source signal. , In order to reduce the above-mentioned deterioration as much as possible, each characteristic parameter P, A should be set so that the synthesized speech becomes as close as possible to the original speech in consideration of the sound source waveform generated by the waveform. It may be necessary to determine

ところで、従来の音声分析方式においては、ピッチパラ
メータＰを求める方法として、波形処理法（波形包絡
法、零交叉法）、相関処理法（自己相関法、変形相関
法、AMDF法）、スペクトル処理法（ケプストラム法、ピ
リオドヒストグラム法）などを採用していたが、これら
の方法を用いたものにおいては、単に原音声中のピッチ
を求めようとしているだけであって、残差波形の近似波
形を音源波形として用いている音声合成器Ｚ側の事情は
全く考慮されていないものであり、しかも、ピッチ抽出
精度も十分高いとは言えないものであった。したがっ
て、この従来方式で抽出された特徴パラメータを用いて
音声合成器Ｚで音声合成を行う場合には、どうしても原
音声に忠実な合成音声を得ることはできないという問題
があった。By the way, in the conventional speech analysis method, as a method for obtaining the pitch parameter P, a waveform processing method (waveform envelope method, zero crossing method), a correlation processing method (autocorrelation method, modified correlation method, AMDF method), a spectrum processing method. (Cepstrum method, period histogram method) was adopted, but with those methods, it is only trying to find the pitch in the original speech, and the approximate waveform of the residual waveform is used as the sound source. The situation on the side of the speech synthesizer Z used as a waveform is not considered at all, and moreover, the pitch extraction accuracy cannot be said to be sufficiently high. Therefore, when the speech synthesizer Z performs speech synthesis using the characteristic parameters extracted by the conventional method, there is a problem that it is impossible to obtain a synthesized speech faithful to the original speech.

［発明の目的］本発明は上記の点に鑑みて為されたものであり、その目
的とするところは、原音声に忠実な合成音声を得るため
の特徴パラメータを抽出できる音声分析方式を提供する
ことにある。[Object of the Invention] The present invention has been made in view of the above points, and an object thereof is to provide a speech analysis method capable of extracting a characteristic parameter for obtaining a synthesized speech faithful to an original speech. Especially.

［発明の開示］（実施例）第１図は本発明方式を用いた音声分析器Ｘを示すもの
で、声道模擬フィルタ特性を抽出する声道パラメータ抽
出手段１と、声道パラメータ抽出手段１により原音声か
ら抽出された声道パラメータを用いて声道調音等価フィ
ルタのフィルタ特性とは逆のフィルタ特性に設定したフ
ィルタよりなる残差波形抽出手段２と残差波形抽出手段
２に原音声を通すことによって抽出された残差波形と有
声音の合成時に用いる１ピッチ分の波形との相互相関波
形を求め、この相互相関波形中の２つのピークの距離を
ピッチ周期としたピッチパラメータＰを求めるピッチパ
ラメータ抽出手段３と、残差波形と音声合成時に用いる
１ピッチの波形とのエネルギーの比較に基いてアンプパ
ラメータＡを抽出するアンプパラメータ抽出手段４と、
各抽出手段1,3,4にて抽出された特徴パラメータK,P,Aを
コード化して音声再生コードSCとして出力するパラメー
タコード化手段５とで形成されている。ここに、本発明
に係る音声分析方式の特徴とするところは、音源の生成
と声道の共鳴による調音とを分離する線形分離等価回路
モデルを用い、有声音の合成時には予め記憶している１
ピッチ分の波形を用いてその繰り返し間隔に対応するピ
ッチパラメータＰと、強度に対応するアンプパラメータ
Ａとを用いて音源波形を生成するとともに、該音源波形
を声道パラメータＫにてフィルタ特性が設定される声道
調音等価フィルタを通すことにより合成音声信号を得る
ようにした音声合成器Ｚに入力する各特徴パラメータK,
P,Aを原音声から抽出する従来例と同様の音声分析方式
において、原音声から抽出された声道パラメータＫを用
いて声道調音等価フィルタの逆フィルタ特性に設定した
フィルタに原音声を通して得られる残差波形からピッチ
パラメータＰを抽出する際に、該残差波形と有声音の合
成時に用いる１ピッチ分の波形との相互相関波形を求
め、該相互相関波形中の２つのピークの距離をピッチ周
期とすることにある。DISCLOSURE OF THE INVENTION (Embodiment) FIG. 1 shows a voice analyzer X using the method of the present invention, which is a vocal tract parameter extracting means 1 for extracting a vocal tract simulation filter characteristic and a vocal tract parameter extracting means 1. By using the vocal tract parameters extracted from the original voice by the original voice to the residual waveform extracting means 2 and the residual waveform extracting means 2 each having a filter whose filter characteristic is opposite to the filter characteristic of the vocal tract articulatory equivalent filter. The cross-correlation waveform between the residual waveform extracted by passing the waveform and the waveform for one pitch used when synthesizing the voiced sound is obtained, and the pitch parameter P with the distance between two peaks in the cross-correlation waveform as the pitch period is obtained. An amplifier parameter extracting means for extracting the amplifier parameter A based on the energy comparison between the pitch parameter extracting means 3 and the waveform of the residual waveform and the one-pitch waveform used in speech synthesis. And 4,
And the parameter coding means 5 for coding the characteristic parameters K, P, A extracted by the respective extraction means 1, 3, 4 and outputting as the audio reproduction code SC. Here, the feature of the speech analysis method according to the present invention is that a linearly separated equivalent circuit model that separates generation of a sound source and articulation due to resonance of the vocal tract is used, and is stored in advance when synthesizing a voiced sound.
A waveform of a pitch is used to generate a sound source waveform using a pitch parameter P corresponding to the repetition interval and an amplifier parameter A corresponding to the intensity, and the sound source waveform is set with a filter characteristic by a vocal tract parameter K. Characteristic parameters K to be input to a speech synthesizer Z, which is adapted to obtain a synthesized speech signal by passing through a vocal tract articulation equivalent filter
In a voice analysis method similar to the conventional example in which P and A are extracted from the original voice, the vocal tract parameter K extracted from the original voice is used to obtain the original voice through a filter set to the inverse filter characteristic of the vocal tract articulation equivalent filter. When extracting the pitch parameter P from the residual waveform obtained, a cross-correlation waveform between the residual waveform and a waveform for one pitch used when synthesizing voiced sound is obtained, and the distance between two peaks in the cross-correlation waveform is calculated. The pitch period is used.

以下、実施例の動作について説明する。第２図は動作を
示すフローチャートであり、いま、入力された原音声Vi
は、声道模擬フィルタ特性を抽出する声道パラメータ抽
出手段１に入力され、声道パラメータＫが抽出される。
また、この原音声Viは声道パラメータ抽出手段１により
抽出された声道パラメータＫを用いることにより声道調
音等価フィルタとは逆のフィルタ特性に設定されたフィ
ルタよりなる残差波形抽出手段２に入力され、逆フィル
タリングによって第３図（ａ）に示すような残差波形が
抽出される。この残差波形は、ピッチパラメータ抽出手
段３に入力され、第３図（ｂ）に示すような有声音の合
成時に用いる１ピッチの波形と残差波形との相互相関波
形を求める。すなわち、相互相関波形は、有声音の合成
に用いる１ピッチ分の波形と原音声から得た残差波形と
の相互相関係数を時系列的に並べた相互相関係数列であ
って、この相互相関波形中の２つのピークの間隔をピッ
チ周期とするピッチパラメータが抽出される。このよう
にして抽出されたピッチパラメータＰは、音声合成器Ｚ
における有声音の合成時に用いられる１ピッチの波形を
考慮して抽出されているので、１ピッチの波形およびピ
ッチパラメータＰに基いて発生される有声音源の波形は
望ましい音源波形であるところの残差波形に極めて近い
近似波形となり、原音声に忠実な合成音声を得ることが
できるようになっている。The operation of the embodiment will be described below. FIG. 2 is a flow chart showing the operation.
Is input to the vocal tract parameter extracting means 1 for extracting the vocal tract simulation filter characteristic, and the vocal tract parameter K is extracted.
Further, this original voice Vi is applied to the residual waveform extracting means 2 including a filter whose filter characteristic is opposite to that of the vocal tract articulatory equivalent filter by using the vocal tract parameter K extracted by the vocal tract parameter extracting means 1. The residual waveform as input is extracted by inverse filtering as shown in FIG. This residual waveform is input to the pitch parameter extraction means 3 to obtain a cross-correlation waveform between the residual waveform and the one-pitch waveform used when synthesizing the voiced sound as shown in FIG. 3 (b). That is, the cross-correlation waveform is a cross-correlation coefficient string in which time-series cross-correlation coefficients of a one-pitch waveform used for synthesis of voiced sound and a residual waveform obtained from original speech are arranged. A pitch parameter whose pitch period is the interval between two peaks in the correlation waveform is extracted. The pitch parameter P thus extracted is used as the voice synthesizer Z.
The waveform of the voiced sound source generated based on the 1-pitch waveform and the pitch parameter P is a desirable sound source waveform because the 1-pitch waveform used when synthesizing the voiced sound is extracted. The approximate waveform is very close to the waveform, and it is possible to obtain a synthesized voice that is faithful to the original voice.

［発明の効果］本発明は上述のように、音源の生成と声道の共鳴による
調音とを分離する線形分離等価回路モデルを用い、有声
音の合成時には予め記憶している１ピッチ分の波形を用
いてその繰り返し間隔に対応するピッチパラメータと、
強度に対応するアンプパラメータとを用いて音源波形を
生成するとともに、該音源波形を声道パラメータにてフ
ィルタ特性が設定される声道調音等価フィルタを通すこ
とにより合成音声信号を得るようにした音声合成器に入
力する各特徴パラメータを原音声から抽出する音声分析
方式において、原音声から抽出された声道パラメータを
用いて声道調音等価フィルタの逆フィルタ特性に設定し
たフィルタに原音声を通して得られる残差波形からピッ
チパラメータを抽出する際に、該残差波形と有声音の合
成時に用いる１ピッチ分の波形との相互相関波形を求
め、該相互相関波形中の２つのピークの距離をピッチ周
期とするものであり、本発明の方式にて抽出されるピッ
チパラメータは、音声合成器における有声音の合成時に
用いられる１ピッチの波形を考慮して抽出されているの
で、このピッチパラメータに基いて発生される有声音源
信号の波形は望ましい音源波形であるところの残差波形
に極めて近い近似波形となり、原音声に忠実な合成音声
を得ることができるという効果がある。すなわち、有声
音の合成時に声道調音等価フィルタに入力される音源
は、予め記憶されている１ピッチ分の波形をもとにし
て、アンプパラメータで設定した倍率で振幅が決まり、
ピッチパラメータで設定したピッチ周期で繰り返し周期
が決まるものであって、本発明では、原音声の波形から
得た残差波形に有声音の合成の音源のもとになる１ピッ
チ分の波形を関連付けてピッチパラメータを決定してい
るから、合成時の音源の特性に適合したピッチパラメー
タを得ることができるのである。要するに、原音声のみ
によってピッチパラメータを決定すれば、原音声に対す
る正確なピッチパラメータを得ることができるが、この
ようなピッチパラメータは合成時の音源を考慮したもの
ではなく、このピッチパラメータを用いて音声を合成し
ても必ずしも合成音声の品質向上にはつながらないのに
対し、本発明では、合成時の音源のもとになる１ピッチ
分の波形と関連付けてピッチパラメータを決定するか
ら、合成音声の品質が向上するという効果をもたらすこ
とができるのである。[Effects of the Invention] As described above, the present invention uses the linear separation equivalent circuit model for separating the generation of the sound source and the articulation due to the resonance of the vocal tract, and the waveform for one pitch stored in advance at the time of synthesizing the voiced sound. And the pitch parameter corresponding to the repetition interval,
A voice in which a synthesized voice signal is obtained by generating a sound source waveform using an amplifier parameter corresponding to intensity and passing the sound source waveform through a vocal tract articulatory equivalent filter whose filter characteristic is set by the vocal tract parameter. In the speech analysis method that extracts each characteristic parameter input to the synthesizer from the original speech, the vocal tract parameters extracted from the original speech are used to obtain the original speech through the filter set to the inverse filter characteristic of the vocal tract articulatory equivalent filter. When a pitch parameter is extracted from the residual waveform, a cross-correlation waveform between the residual waveform and a waveform for one pitch used when synthesizing a voiced sound is obtained, and a distance between two peaks in the cross-correlation waveform is determined by a pitch period. The pitch parameter extracted by the method of the present invention is one pitch used when synthesizing a voiced sound in the speech synthesizer. Since it is extracted in consideration of the waveform, the waveform of the voiced sound source signal generated based on this pitch parameter is an approximate waveform that is very close to the residual waveform which is the desired sound source waveform, and the synthesized speech faithful to the original speech. There is an effect that can be obtained. That is, the amplitude of the sound source input to the vocal tract articulatory equivalent filter at the time of synthesizing the voiced sound is determined by the magnification set by the amplifier parameter based on the waveform for one pitch stored in advance,
In the present invention, the repetition period is determined by the pitch period set by the pitch parameter, and in the present invention, the residual waveform obtained from the waveform of the original voice is associated with the waveform for one pitch that is the source of the synthesized sound source of voiced sound. Since the pitch parameter is determined by the pitch parameter, it is possible to obtain the pitch parameter that matches the characteristics of the sound source at the time of synthesis. In short, if the pitch parameter is determined only by the original voice, an accurate pitch parameter for the original voice can be obtained, but such a pitch parameter does not consider the sound source at the time of synthesis, and this pitch parameter is used. Although synthesizing a voice does not necessarily improve the quality of the synthesized voice, in the present invention, the pitch parameter is determined in association with the waveform for one pitch that is the source of the sound source at the time of synthesis. The effect of improving quality can be brought about.

[Brief description of drawings]

第１図は本発明方式を用いた音声分析器のブロック図、
第２図および第３図は同上の動作説明図、第４図は音声
分析合成システムの構成図、第５図は音声合成器の構成
図である。１は声道パラメータ抽出手段、２は残差波形抽出手段、
３はピッチパラメータ抽出手段、４はアンプパラメータ
抽出手段、５はパラメータコード化手段である。FIG. 1 is a block diagram of a speech analyzer using the method of the present invention,
2 and 3 are diagrams for explaining the operation of the same as above, FIG. 4 is a block diagram of a voice analysis / synthesis system, and FIG. 5 is a block diagram of a voice synthesizer. 1 is vocal tract parameter extraction means, 2 is residual waveform extraction means,
3 is a pitch parameter extracting means, 4 is an amplifier parameter extracting means, and 5 is a parameter coding means.

Claims

[Claims]

1. A linear separation equivalent circuit model for separating generation of a sound source and articulation due to vocal tract resonance is used, and when synthesizing a voiced sound, a waveform for one pitch stored in advance is used to correspond to the repetition interval. A sound source waveform is generated using the pitch parameter and the amplifier parameter corresponding to the intensity, and the synthesized sound signal is obtained by passing the sound source waveform through a vocal tract articulatory equivalent filter whose filter characteristic is set by the vocal tract parameter. In the voice analysis method for extracting each of the above parameters to be input to the speech synthesizer from the original voice, in the filter set to the inverse filter characteristic of the vocal tract articulatory equivalent filter using the vocal tract parameters extracted from the original voice. When extracting pitch parameters from the residual waveform obtained through the original speech, a wave of one pitch used when synthesizing the residual waveform and voiced sound Cross correlation waveform calculated, speech analysis method, characterized in that the distance of the two peaks in the cross correlation waveform and pitch period of the.