JPS61128299A

JPS61128299A - Voice analysis/analytic synthesization system

Info

Publication number: JPS61128299A
Application number: JP59250133A
Authority: JP
Inventors: マツツ　ユンクヴイスト; 藤崎　博也; 佐藤　泰雄; 杉田　忠靖; 花田　章夫
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1984-11-27
Filing date: 1984-11-27
Publication date: 1986-06-16
Also published as: JPH0339320B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は音声分析／分析合成方式、特にいわゆるＡ　−
ｂ　−Ｓ　（Ａｎａｌｙｓｉｓ−ｂｙ　−５ｙｎｔｈｅ
ｓｉｓ）の手法を用いることにより、平均２乗誤差が最
小となるように声帯音源波形モデルのパラメータを定め
。[Detailed Description of the Invention] [Industrial Field of Application] The present invention relates to a speech analysis/analysis synthesis method, particularly a so-called A-
b-S (Analysis-by-5ynthe
sis), the parameters of the vocal cord sound source waveform model are determined so that the mean squared error is minimized.

声帯音源波形モデルを線形予測分析法と組合わせること
により音声を分析または分析合成する音声。Speech that analyzes or analyzes and synthesizes speech by combining vocal cord source waveform models with linear predictive analysis methods.

分析／分析合成方式に関するものである。This relates to analysis/analysis synthesis methods.

[Conventional technology and problems]

音声の認識、伝送、蓄積などにあたって、音声に関する
情報量をできるだけ圧縮し、かつその音声に関する情報
から高品質の音声の再生を可能とするために、従来から
種々の方式が考えられている。その１つとして１例えば
ＡＤＰＣＭ等の音声の波形をそのまま符号化する波形符
号化方式かあ。BACKGROUND ART Various methods have been devised in the past in order to compress the amount of voice-related information as much as possible during voice recognition, transmission, storage, etc., and to make it possible to reproduce high-quality voice from the voice-related information. One such method is a waveform encoding method that encodes the audio waveform as it is, such as ADPCM.

す、また、これに対するものとして、いわゆるポコーグ
による狭義の分析合成方式がある。In addition, as a solution to this, there is a so-called analysis and synthesis method in a narrow sense by Pokogu.

波形符号化方式の場合、音声信号を線形予測分析し、線
形予測係数と予測誤差を得て、その予測誤差を量子化す
る。再生にあたっては、量子化された予測誤差を分析で
得られた線形予測係数によるフィルタで駆動する。この
波形符号化方式による再生音声の歪は、予測誤差の量子
化によるものであり、高品質の再生音声が得られる。し
かし。In the case of the waveform encoding method, the audio signal is subjected to linear prediction analysis to obtain linear prediction coefficients and prediction errors, and the prediction errors are quantized. During reproduction, the quantized prediction error is driven by a filter using the linear prediction coefficient obtained by analysis. The distortion of the reproduced audio caused by this waveform encoding method is due to the quantization of prediction errors, and high-quality reproduced audio can be obtained. but.

その情報量は１例えばｌ　６　Ｋｂｐｓ　〜６４　Ｋｂ
ｐｓであり、音声に関する情報量はかなり多くなる。The amount of information is 1, for example l 6 Kbps ~ 64 Kb
ps, and the amount of information regarding audio is considerably large.

分析合成方式の場合、音声の生成機構をモデル化し、音
源信号と調音器官による音響フィルタ特性とに着目する
。そして１例えば有声音の音源信号を周期的インパルス
列とし、無声音の音源信号を白色雑音として近似する。In the case of the analysis-synthesis method, the speech generation mechanism is modeled, and attention is focused on the sound source signal and the acoustic filter characteristics of the articulator. For example, the source signal of a voiced sound is approximated as a periodic impulse train, and the source signal of an unvoiced sound is approximated as white noise.

これによれば２例えば音声は、有声音／無声音の区別情
報１周期音源に関するピッチ周波数、振幅情報、ｉ形フ
ィルタ特性によって表される。換言すれば、予測誤差分
をモデル化していると見ることができ、音声情報を例え
ば１．２　Ｋｂｐｓ　〜９．６　Ｋｂｐｓ程度に圧縮で
きる。しかし１合成される音声の品質は、上記波形符号
化方式に比べると、かなり落ちる。According to this, for example, speech is represented by voiced/unvoiced sound discrimination information, pitch frequency regarding a one-period sound source, amplitude information, and i-type filter characteristics. In other words, it can be seen that the prediction error is modeled, and the audio information can be compressed to about 1.2 Kbps to 9.6 Kbps, for example. However, the quality of the synthesized speech is considerably lower than that of the waveform encoding method described above.

音声の分析または合成において、音声に関する情報量が
少なく、かつ上記波形符号化方式に近い高品質の合成音
声が得られる方式が望まれる。In analyzing or synthesizing speech, a method is desired that has a small amount of information regarding the speech and can obtain synthesized speech of high quality similar to the waveform encoding method described above.

[Means for solving problems]

本発明は上記問題点の解決を図り、音源波形のモデル化
に当たって、音源をパルスと雑音信号で近似するのでは
なく、ローゼンベルグ（Ｒｏｓｅｎｂｅｒｇ）波形など
の声帯音源波形モデルを使用する。The present invention aims to solve the above-mentioned problems, and when modeling the sound source waveform, instead of approximating the sound source with pulses and noise signals, a vocal fold sound source waveform model such as a Rosenberg waveform is used.

そして、この声帯音源波形モデルを規定するためのピン
チ周期、立上がり時間、立下がり時間および振幅の４種
パラメータを、　Ａ−ｂ−Ｓ　（Ａｎａｌｙｓｉｓ−ｂ
ｙ　−Ｓ　ｙｎｔｈｅｓｉｓ）の手法によりもとめる手
段を備えている。即ち１本発明の音声分析／分析合成方
式は、音源波形をモデル化した情報に基づいて音声の分
析または音声の分析合成を行う音声分析／分析合成方式
において、少なくともピッチ周期、立上がり時間、立下
がり時間および振幅に関する４種パラメータにより規定
される音源信号で駆動される線形予測フィルタによって
音声信号を生成する音声合成系を有し、上記４種のパラ
メータを逐次選択する選択手段と、該選択手段により選
択された上記４種のパラメータについて上記線形予測フ
ィルタにより得られる合成音声信号と入力音声信号との
誤差を求める手段と、上記合成音声信号と上記入力音声
信号との誤差がより小となるように上記４種のパラメー
タに関する最適化を行い上記４種のパラメータを決定す
る手段とを備え、上記４種のパラメータおよび線形予測
係数に基づいて音声の分析または音声の分析合成を行う
ようにしたことを特徴としている。Then, the four parameters of pinch period, rise time, fall time, and amplitude for defining this vocal cord sound source waveform model are calculated using A-b-S (Analysis-b
y-Synthesis). Namely, the speech analysis/analysis and synthesis method of the present invention is a speech analysis/analysis and synthesis method that analyzes speech or analyzes and synthesizes speech based on information modeling a sound source waveform. It has a speech synthesis system that generates a speech signal by a linear prediction filter driven by a sound source signal defined by four types of parameters related to time and amplitude, and includes a selection means for sequentially selecting the four types of parameters, and a selection means for sequentially selecting the four types of parameters. means for determining an error between the synthesized speech signal obtained by the linear prediction filter and the input speech signal with respect to the four selected parameters; means for optimizing the four types of parameters and determining the four types of parameters, and performing voice analysis or voice analysis and synthesis based on the four types of parameters and the linear prediction coefficients. It is a feature.

[Effect]

本発明は、ピッチ周期、立上がり時間、立下がり時間お
よび振幅の４種パラメータにより規定される声帯音源波
形モデルの音源信号で駆動される線形予測フィルタによ
って音声信号を生成する音声合成系を用意し、入力音声
信号に対し、上記４種のパラメータを選択した後、線形
予測分析を行って１合成音声信号と入力音声信号との誤
差を求める手順を繰り返すＡ−ｂ−３手法によって、上
記４種の最適なパラメータを決定する。そして。The present invention provides a speech synthesis system that generates a speech signal by a linear prediction filter driven by a sound source signal of a vocal cord sound source waveform model defined by four parameters: pitch period, rise time, fall time, and amplitude, After selecting the above four types of parameters for the input audio signal, the above four types are calculated using the A-b-3 method, which repeats the procedure of performing linear predictive analysis to find the error between the 1 synthesized audio signal and the input audio signal. Determine optimal parameters. and.

この４種のパラメータと線形予測係数とを音声に関する
情報とする。必要に応じて上記４種のパラメータと線形
予測係数とを受信ないし蓄積し、上記モデルによって合
成すれば、少ない情報量でもって、高品質の音声を分析
合成することができることになる。以下１図面を参照し
つつ、実施例に従って説明する。These four types of parameters and linear prediction coefficients are taken as information regarding speech. If the four types of parameters and linear prediction coefficients described above are received or stored as necessary and synthesized using the model described above, high-quality speech can be analyzed and synthesized with a small amount of information. An embodiment will be described below with reference to one drawing.

〔Example〕

第１図は本発明の一実施例構成ブロック図、第２図は声
帯音源波形モデルの説明図、第３図は合成音声について
本発明を適用した例の波形図、第４図は第３図と対比す
るための従来方式による例の波形図、第５図は自然音声
について本発明を適用した例の波形図を示す。FIG. 1 is a block diagram of the configuration of an embodiment of the present invention, FIG. 2 is an explanatory diagram of a vocal cord sound source waveform model, FIG. 3 is a waveform diagram of an example in which the present invention is applied to synthesized speech, and FIG. FIG. 5 shows a waveform diagram of an example of the conventional method for comparison with the conventional method, and FIG. 5 shows a waveform diagram of an example of applying the present invention to natural speech.

第１図において、符号１はピッチ周期推定部。In FIG. 1, reference numeral 1 indicates a pitch period estimator.

２は最適パラメータ決定部、３はパラメータ選択部、４
は声帯音源波形生成部、５は線形予測分析部を表す。2 is an optimal parameter determination section, 3 is a parameter selection section, 4
5 represents a vocal cord sound source waveform generation section, and 5 represents a linear prediction analysis section.

本発明は９分析合成のための音源のモデル化にあたって
５周期音源としてインパルスを用いるのではなく、声帯
音源波形モデルを利用する。例えば９人の声には、明瞭
な声や唆れ声など種々の変化がある。これには、音源の
相違による影響が考えられ、−律にインパルスで近似し
た場合、妥当な結果を得ることが難しい。声帯音源波形
モデルを用いることにより、より近似性を向上させるこ
とができる。なお、三角波で近似してもよい。The present invention does not use an impulse as a 5-period sound source in modeling the sound source for 9-analysis synthesis, but uses a vocal cord sound source waveform model. For example, the voices of the nine people vary in various ways, such as clear voices and suggestive voices. This may be due to the influence of differences in sound sources, and it is difficult to obtain reasonable results when approximating the -temporal impulse. By using the vocal cord sound source waveform model, approximation can be further improved. Note that it may be approximated by a triangular wave.

声帯音源波形は２例えば第２図図示のような形をしてい
る。この波形ｇ　（ｎ）は１次式で表される。The vocal cord sound source waveform has a shape as shown in FIG. 2, for example. This waveform g(n) is expressed by a linear equation.

■　ｔｌ＜ｔｓｔ２のとき、　　ｇ（ｎ）　　＝０■　
１２＜１≦ｔ３のとき。■ When tl<tst2, g(n) = 0■
When 12<1≦t3.

■　ｔ２＜ｔ５ｔ３のとき。■ When t2<t5t3.

４−ｔ３２この波形は、ピッチ周期Ｔ、立上がり時間の比Ｒ２立下
がり時間の比Ｆおよび振幅Ａの４つのパラメータにより
表すことができ、以下のようになる。4-t32 This waveform can be expressed by four parameters: pitch period T, rise time ratio R2, fall time ratio F, and amplitude A, as shown below.

Ｔ＝　ｔ４−　ｔｌＲ＝　（ｔ３−　ｔ２）　／ＴＦ＝　（ｔ４−　ｔ３）　／ＴＡ＝α 第１図図示ピ・７チ周期推定部１は、このピッチ周期Ｔ
を、従来から知られている種々の手段により、入力音声
から推定するものである。推定したピッチ周期は、最適
パラメータ決定部２に供給される。また、立上がり時間
の比Ｒ９立下がり時間の比Ｆ、振幅Ａのパラメータにつ
いては、予め適当な初期値を定めておき、それを最適パ
ラメータ決定部２へ与える。パラメータ選択部３は、最
初にこれら４種のパラメータを選択し、声帯音源波形生
成部４に出力する。T= t4- tl R= (t3- t2) /T F= (t4- t3) /T A=α The pitch period estimator 1 shown in FIG.
is estimated from the input speech using various conventionally known means. The estimated pitch period is supplied to the optimal parameter determining section 2. Further, appropriate initial values are determined in advance for the parameters of the rise time ratio R, the fall time ratio F, and the amplitude A, and these values are provided to the optimum parameter determining section 2. The parameter selection unit 3 first selects these four types of parameters and outputs them to the vocal cord sound source waveform generation unit 4.

声帯音源波形生成部４は、これらのピッチ周期Ｔ、立上
がり時間の比Ｒ５立下がり時間の比Ｆおよび振幅Ａの４
つのパラメータから、第２図に示すような声帯音源波形
の信号を合成し出力するものである。この出力信号は２
図示省略したが、必要に応じていわゆる放射特性を加味
した補正がなされ、ｍ形予測分析部５に供給される。The vocal cord sound source waveform generation unit 4 calculates the pitch period T, the rise time ratio R5, the fall time ratio F, and the amplitude A.
From the two parameters, a vocal cord sound source waveform signal as shown in FIG. 2 is synthesized and output. This output signal is 2
Although not shown in the drawings, corrections are made in consideration of so-called radiation characteristics as necessary, and the data is supplied to the m-type predictive analysis unit 5.

線形予測分析部５は、この合成音声信号と入力音声信号
とから予測誤差を求め、その線形予測係数を出力する。The linear prediction analysis unit 5 calculates a prediction error from this synthesized speech signal and the input speech signal, and outputs its linear prediction coefficient.

求められた予測誤差は、最適パラメータ決定部２ヘフィ
ードバックされる。The obtained prediction error is fed back to the optimal parameter determination unit 2.

最適パラメータ決定部２は、この予測誤差を小さくする
ために、上記声帯音源波形を規定するパラメータを少し
ずつ変化させていくように、パラメータ選択部３に指示
を与える。パラメータ選択部３は、前のパラメータと異
なる値をとるパラメータを選択して、声帯音源波形生成
部４に出力する。この手順を繰り返し、最適な４種のパ
ラメータを決定する。即ち、いわゆるＡ−ｂ−３手法を
用いることにより１時間領域における平均２乗誤差が最
小となるように、ピッチ周期Ｔ、立上がり時間の比Ｒ１
立下がり時間の比Ｆおよび振幅Ａの４つのパラメータを
定める。Ａ−ｂ−３手法によってパラメータを抽出する
ことにより１例えば逆フイルタリング手法によってパラ
メータを決定するよりも、精度の良い分析が可能である
。In order to reduce this prediction error, the optimal parameter determining section 2 instructs the parameter selecting section 3 to gradually change the parameters that define the vocal cord sound source waveform. The parameter selection unit 3 selects a parameter that takes a value different from the previous parameter and outputs it to the vocal cord sound source waveform generation unit 4. This procedure is repeated to determine the four optimal parameters. That is, by using the so-called A-b-3 method, the pitch period T and the ratio R1 of the rise time are
Define four parameters: fall time ratio F and amplitude A. By extracting parameters using the A-b-3 method, more accurate analysis is possible than when determining parameters using, for example, an inverse filtering method.

音声信号に対し、全極形モデルを仮定する線形予測分析
によれば、音声信号ｓ　（ｎ）は１次式で表される。According to linear predictive analysis that assumes an all-pole model for the audio signal, the audio signal s (n) is expressed by a linear equation.

ここで＋ａｉは予測係数、ｐは予測次数であり。Here, +ai is a prediction coefficient, and p is a prediction order.

ａ、＋、はゲインである。ｇ　（ｎ）は白色雑音シーケ
ンスが仮定されている。a,+, is a gain. g (n) is assumed to be a white noise sequence.

しかしながら、声帯音源波形モデルを線形予測分析法と
組合わせるＧＬＰＣ法によれば、　ｇ（ｎ）は既知の波
形であり、平坦なスペクトルを持たない。However, according to the GLPC method, which combines a vocal cord sound source waveform model with a linear predictive analysis method, g(n) is a known waveform and does not have a flat spectrum.

即ち、音声信号ｓ　（ｎ）は１次式で表される。That is, the audio signal s(n) is expressed by a linear equation.

ここでｅ　（ｎ）は白色雑音シーケンスであり、最小化
されるべき誤差Ｅｇは１次のようになる。Here e (n) is a white noise sequence, and the error Eg to be minimized is of order one.

第１図に示した最適パラメータ決定部２により。By the optimum parameter determination unit 2 shown in FIG.

この誤差Ｅｇが最小となるパラメータが決定されること
になる。線形予測係数ａ、は、音声信号５（ｎ）とｇ（
ｎ）とに関して最適化さＤ＋　　ａｌｌ＋１　は、誤差
Ｅｇを最小化するｇ（ｎ）のゲインとなる。The parameter that minimizes this error Eg is determined. The linear prediction coefficient a, is the audio signal 5(n) and g(
The optimization D+ all+1 with respect to n) becomes the gain of g(n) that minimizes the error Eg.

第３図は本発明による方式を評価するため１合成音声に
ついて本発明を適用し、上記ＧＬＰＣによる分析を行っ
た結果を示しているものである。FIG. 3 shows the results of applying the present invention to one synthesized speech and analyzing it using the above-mentioned GLPC in order to evaluate the method according to the present invention.

第３図（ａ）の波形は１合成に用いられた声帯音源波形
であり、第３図（ｂ）は、それによって合成された音声
信号である。第３図（ｃ）は、第３図（ｂ）図示音声信
号についてＧＬＰＣ：により推定した声帯音源波形であ
り、これによって再合成された音声信号が、第３図（ｄ
）に示されている。The waveform in FIG. 3(a) is the vocal cord sound source waveform used for 1 synthesis, and FIG. 3(b) is the voice signal synthesized thereby. FIG. 3(c) shows the vocal cord sound source waveform estimated by GLPC for the speech signal shown in FIG. 3(b), and the speech signal resynthesized by this is
) is shown.

第３図（８）は１分析対象となつた第３図（ｂ）の音声
信号と、再合成された第３図（ｄ）の音声信号との誤差
信号を示しており、これによるＳＮ比は、１２．７ｄＢ
となっている。Figure 3 (8) shows the error signal between the audio signal in Figure 3 (b), which was the subject of analysis, and the resynthesized audio signal in Figure 3 (d), and the resulting SN ratio is 12.7dB
It becomes.

第３図と対比するために、同じ合成音声について、従来
行われている周期的なインパルス列で音源モデルを近似
した例を、第４図に示す。For comparison with FIG. 3, FIG. 4 shows an example of approximating a sound source model using a conventional periodic impulse train for the same synthesized speech.

第４図（ａ　Ｌ　（ｂ　）は、第３図（ａ）、　（ｂ）
にそれぞれ対応する合成に用いた声帯音源波形と。Figure 4 (a L (b) is the same as Figure 3 (a), (b)
and the vocal cord sound source waveforms used for synthesis, respectively.

それにより合成された分析対象の音声信号とを示してい
る。ピッチ周期と振幅にのみ着目し、第４図（Ｃ）のよ
うなインパルス列によりて、線形予測フィルタを駆動し
、それによって得られた再合成音声信号が、第４図（ｄ
）図示の信号である。The resultant synthesized audio signal to be analyzed is shown. Focusing only on the pitch period and amplitude, the linear prediction filter is driven by an impulse train as shown in Fig. 4(C), and the resynthesized speech signal obtained thereby is as shown in Fig. 4(d).
) is the signal shown.

この信号と第４図（ｂ）の元の音声信号との誤差が、第
４図（ｅ）に示されている。これによるＳＮ比は、３．
２ｄＢである。これに比べて１本発明による方式の場合
、大幅にＳＮ比が向上していることがわかる。The error between this signal and the original audio signal of FIG. 4(b) is shown in FIG. 4(e). The SN ratio resulting from this is 3.
It is 2dB. In comparison, it can be seen that in the case of the method according to the present invention, the SN ratio is significantly improved.

もちろん、自然音声についても１本発明によれば、同様
に良好な結果を得ることができる。第５図は、母音／ａ
／についての自然音声信号について１本発明を通用した
例を示している。Of course, according to the present invention, similarly good results can be obtained with respect to natural speech. Figure 5 shows the vowel /a
An example in which the present invention is applicable to a natural speech signal of / is shown.

第５図（ａ）は１分析対象となった母音／ａ／の音声波
形であり、第５図（ｂ）は１本発明を用いて分析し再合
成した音声信号を示している。その誤差は、第５図（Ｃ
）図示の通りであり、極めて小さい。第５図（ｄ）は、
その際ＧＬＰＣにより推定された声帯音源波形を示して
いる。因に。FIG. 5(a) shows the speech waveform of the vowel /a/ which was the subject of analysis, and FIG. 5(b) shows the speech signal analyzed and resynthesized using the present invention. The error is shown in Figure 5 (C
) As shown in the diagram, it is extremely small. Figure 5(d) shows
At that time, the vocal cord sound source waveform estimated by GLPC is shown. Incidentally.

第５図（ｅ）図示の波形は、逆フィルタリングによって
得られた声帯音源波形である。The waveform shown in FIG. 5(e) is a vocal cord sound source waveform obtained by inverse filtering.

本発明は、特に有声音に対して有効であり、無声音部分
に対して分析合成を行う場合９例えばその部分だけ、従
来の波形符号化方式または分析合噺成方式を用い２本発
明による方式と従来用いられている方式とを組合わせて
１本発明を実施することができる。The present invention is particularly effective for voiced sounds, and when analyzing and synthesizing an unvoiced sound part9, for example, only the conventional waveform encoding method or analysis synthesis method is used for that part. The present invention can be implemented in combination with conventionally used methods.

〔Effect of the invention〕

以上説明した如く１本発明によれば、音声に関する情報
量を効率的に圧縮し、波形符号化方式をとるものに比較
して、大幅に少なくすることができるようになると共に
、従来のしＰＣ分析合成音声よりも高品質な合成音声を
得ることができるようになる。As explained above, according to the present invention, it is possible to efficiently compress the amount of information related to audio and to significantly reduce the amount of information compared to the waveform encoding method. It becomes possible to obtain synthesized speech of higher quality than analytically synthesized speech.

[Brief explanation of drawings]

第１図は本発明の一実施例構成ブロック図、第２図は声
帯音源波形モデルの説明図、第３図は合成音声について
本発明を適用した例の波形図、第４図は第３図と対比す
るための従来方式による例の波形図、第５図は自然音声
について本発明を適用した例の波形図を示す。図中、１はピッチ周期推定部、２は最適パラメータ決定
部、３はパラメータ選択部、４は声帯音源波形生成部、
５は線形予測分析部を表す。FIG. 1 is a block diagram of the configuration of an embodiment of the present invention, FIG. 2 is an explanatory diagram of a vocal cord sound source waveform model, FIG. 3 is a waveform diagram of an example in which the present invention is applied to synthesized speech, and FIG. FIG. 5 shows a waveform diagram of an example of the conventional method for comparison with the conventional method, and FIG. 5 shows a waveform diagram of an example of applying the present invention to natural speech. In the figure, 1 is a pitch period estimation section, 2 is an optimal parameter determination section, 3 is a parameter selection section, 4 is a vocal cord sound source waveform generation section,
5 represents a linear prediction analysis section.

Claims

[Claims]

In a speech analysis/analysis synthesis method that analyzes speech or analyzes and synthesizes speech based on information modeling a sound source waveform, a sound source signal defined by at least four parameters related to pitch period, rise time, fall time, and amplitude is used. a speech synthesis system that generates a speech signal by a linear prediction filter driven by a linear prediction filter, a selection means for sequentially selecting the four types of parameters, and a selection means for sequentially selecting the four types of parameters selected by the selection means, the linear prediction filter means for determining the error between the synthesized speech signal obtained by 1. A speech analysis/analysis synthesis method, characterized in that the speech analysis/analysis synthesis method is characterized in that the speech analysis or speech analysis/synthesis is performed based on the four types of parameters and the linear prediction coefficients.