JPH0414813B2

JPH0414813B2 -

Info

Publication number: JPH0414813B2
Application number: JP59080855A
Authority: JP
Inventors: Yoshihisa Shiraki; Masaaki Yoda
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1984-04-20
Filing date: 1984-04-20
Publication date: 1992-03-16
Also published as: JPS60224341A

Abstract

PURPOSE:To obtain a good sentence intelligibility even in case of encoding with a low bit rate by dividing the time series of spectrum parameters, which are extracted in frame units of an input voice, to segments each of which consists of plural frames and encoding them. CONSTITUTION:The voice input from a terminal 11 passes an LPF12 and is sampled periodically and is converted to a digital signal by an A/D converter 13 and is inputted to an LPC analyzing part 14. In the analyzing part 14, the spectrum parameter time series of the input voice is calculated in frame units and is divided to segments on a basis of the change rate of the spectrum by a segmentation part 15, and segments are resampled by a means 16 so that the number of samples of the parameter time series in each segment is fixed. Resampled segments are matched with standard pattern from a memory 23 in segment unit by a matrix quantizing part 22, and the standard pattern approximating most the segment is outputted, and this pattern and a signal quantized by a quantizing part 27 are synthesized by a multiplexer 26, and this voice code output is restored in an LPC synthesizing part 36 and is subjected to D/A conversion by a D/A converter 37 and is outputted through an LPF38.

Description

【発明の詳細な説明】「産業上の利用分野」この発明は、入力音声のスペクトルパラメータ
を抽出して低ビツトで符号化する音声符号化方法
に関するものである。DETAILED DESCRIPTION OF THE INVENTION "Field of Industrial Application" The present invention relates to a speech encoding method for extracting spectral parameters of input speech and encoding them with low bits.

「従来技術」従来、音声の符号化方式として1000bps以下の
低ビツトレートで符号化する方式は、ベクトル量
子化と可変フレームレート符号化との２つの方式
がある。ベクトル量子化方式は、フレーム単位
（音声分析単位）は一定のまま、フレーム当りの
スペクトルパラメータ情報を８ビツト程度で量子
化するもので、パラメータを１つのベクトルとし
て扱う点に特徴がある。しかし、この方式は空間
的、すなわち周波数の冗長性のみを取除くもの
で、500bps以下になると、フレーム単位が一定
のため、急激な品質劣化を生じる。"Prior Art" Conventionally, there are two methods for encoding audio at a low bit rate of 1000 bps or less: vector quantization and variable frame rate encoding. The vector quantization method quantizes the spectral parameter information per frame to about 8 bits while keeping the frame unit (sound analysis unit) constant, and is characterized by treating the parameters as one vector. However, this method only removes spatial, ie, frequency, redundancy, and when it becomes less than 500 bps, the quality deteriorates rapidly because the frame unit is constant.

一方、可変フレームレート符号化方式は、スペ
クトルの時間的変化に適応してフレーム単位（フ
レーム長）を変化させるもので、時間的に冗長性
を除去しているが平均伝送速度が1/3程度に減少
しても品質の劣化は少ない。しかし、この方式は
本質的にパラメータの（直線）補間特性に依存し
ているため、伝送速度が毎秒25フレーム（全体で
600bps）以下になると急激な品質劣化を生じる。 On the other hand, the variable frame rate encoding method changes the frame unit (frame length) in response to temporal changes in the spectrum, and removes temporal redundancy, but the average transmission speed is about 1/3. There is little deterioration in quality even if the amount is reduced. However, this method inherently relies on the (linear) interpolation characteristics of the parameters, resulting in a transmission rate of 25 frames per second (overall
600 bps) or less, rapid quality deterioration occurs.

この発明の目的は600bps以下の低いビツトレ
ートでも良好な文章了解性をもつ音声として再生
可能な音声符号化方法を提供することにある。 An object of the present invention is to provide a speech encoding method that can reproduce speech with good text intelligibility even at a low bit rate of 600 bps or less.

「発明の構成」この発明は音声スペクトルの空間的（周波数）
の冗長性のみならず、時間的冗長性も除去する。
このためこの発明では入力音声のスペクトルパラ
メータをフレーム単位で抽出し、このスペクトル
パラメータの時系列を、そのスペクトルの変化率
に基づいて複数フレームからなるセグメントに分
割し、その分割された各セグメントごとに、その
スペクトルパラメータ時系列の標本点の数が同一
になるように再標本化する。つまり音声中におい
て、時間的に好ましくは空間的（周波数）にも繰
返し現われるスペクトルパラメータの系列パター
ンが得られるように、例えば音韻や音節境界で音
声のスペクトルパラメータ時系列を区切つて、セ
グメントに分割する。その各セグメントごとに、
そのパラメータ時系列の標本点の数が同一数にな
るように再標本化して、各セグメントの時間長を
正規化して、時間的冗長性を除去する。このよう
に再標本化されたパラメータ時系列を、セグメン
トごとに標準パターンとのマツチングをとつて符
号化する。"Structure of the Invention" This invention describes the spatial (frequency)
This eliminates not only the redundancy of the data, but also the temporal redundancy.
Therefore, in this invention, the spectral parameters of the input audio are extracted frame by frame, the time series of this spectral parameter is divided into segments consisting of multiple frames based on the rate of change of the spectrum, and each divided segment is , the spectral parameter time series is resampled so that the number of sample points is the same. In other words, in order to obtain a sequence pattern of spectral parameters that repeatedly appears temporally and preferably spatially (frequency) in speech, the spectral parameter time series of speech is divided into segments, for example, at phoneme or syllable boundaries. . For each segment,
The parameter time series is resampled so that the number of sample points becomes the same, the time length of each segment is normalized, and temporal redundancy is removed. The parameter time series resampled in this way is encoded by matching it with a standard pattern for each segment.

「実施例」第１図はこの発明の音声符号化方法の実施例を
示す。入力端子１１からの音声入力は低域通過フ
イルタ１２で帯域制限を受けてAD変換器１３に
入力され、周期的に標本化されてデイジタル信号
に変換される。このAD変換器１３の出力はLPC
分析部１４でフレーム単位で入力音声のスペクト
ルパラメータが抽出される。LPC分析して算出
されたパラメータ時系列は、セグメンテーシヨン
部１５でそのスペクトルパラメータ時系列のスペ
クトルの変化率に基づいて複数のフレームからな
るセグメントに分割される。入力音声から抽出さ
れるスペクトルパラメータとしてはLPC係数の
他にLAR、PARCOR係数、LPCケプストラム係
数、LSPなどいずれでも良いが、この実施例で
は、LPCケプストラム係数の重みつき最小二乗
近似係数を使つて、毎秒平均12コにセグメントさ
れた。このセグメント分割の詳細は、例えば嵯峨
山、板倉：音声の動的尺度に含まれる個人性情
報、日本音響学会春季研究発表会講演論文集３−
２−７（1979）を参照されたい。つまり音声波形
に含まれる周波数（スペクトル）成分は一般に時
間的に一様ではなく、話される音韻（話の内容）
に大きく依存している。この不均一性を生じさせ
る状況は音韻や単語が同じときにも生じる。話す
時の前後の環境や発声速度等は大きく変化する。
一方、このような不均一性の高い音声スペクトル
時系列も、その変化率の極大時点については再現
性が高いことが知られている。そこで音声のスペ
クトルパラメータ時系列をそのスペクトルの変化
率に基づいてセグメントに分割し、つまり音韻や
音節境界で分割し、再現性の高いスペクトルパラ
メータを得る。Embodiment FIG. 1 shows an embodiment of the speech encoding method of the present invention. The audio input from the input terminal 11 is band-limited by the low-pass filter 12 and input to the AD converter 13, where it is periodically sampled and converted into a digital signal. The output of this AD converter 13 is LPC
The analysis unit 14 extracts the spectral parameters of the input audio on a frame-by-frame basis. The parameter time series calculated by LPC analysis is divided into segments each consisting of a plurality of frames based on the rate of change of the spectrum of the spectral parameter time series in the segmentation unit 15. In addition to LPC coefficients, LAR, PARCOR coefficients, LPC cepstrum coefficients, LSP, etc. may be used as spectral parameters extracted from input speech, but in this example, weighted least squares approximation coefficients of LPC cepstrum coefficients are used. It was segmented into an average of 12 frames per second. For details of this segmentation, see, for example, Sagayama and Itakura: Personal information included in dynamic measures of speech, Acoustical Society of Japan Spring Conference Proceedings 3-
2-7 (1979). In other words, the frequency (spectrum) components included in a speech waveform are generally not uniform over time, but are based on the phonology (content of the speech) being spoken.
is heavily dependent on. This situation of heterogeneity also occurs when the phonemes and words are the same. The environment before and after speaking, the speed of speech, etc. change greatly.
On the other hand, it is known that even in such highly non-uniform audio spectrum time series, the reproducibility of the maximum point of change rate is high. Therefore, the spectral parameter time series of speech is divided into segments based on the rate of change of the spectrum, that is, divided at phoneme and syllable boundaries to obtain highly reproducible spectral parameters.

この分割されたセグメントは長いものも短かい
ものもあるが、各セグメントにおけるパラメータ
時系列の標本点数が一定（同一数）となるよう
に、各セグメントごとに時間的に等間隔で予め決
めた数だけ再標本化部１６で再標本化して各セグ
メントの時間長を同一長さに正規化し、時間的冗
長性を除去する。１セグメントのおける再標本化
数が多い程、元に戻した際のスペクトル歪は小さ
く、第２図に示すように各セグメントの再標本化
数が10以上になると、スペクトル歪を1dB²以下
に押さえられる。第２図は横軸に１セグメント当
りの再標本化数、縦軸は復元した時のスペクトル
歪みをとつてあり、スペクトルパラメータはLSP
である。この発明における実施例では、再標本化
部１６ではパラメータをLSP、再標本化数は10と
している。 These divided segments may be long or short, but so that the number of sample points of the parameter time series in each segment is constant (same number), a predetermined number is set at equal intervals in time for each segment. The time length of each segment is normalized to the same length by resampling by the resampling unit 16, and temporal redundancy is removed. The greater the number of resamples in one segment, the smaller the spectral distortion will be when restored.As shown in Figure 2, if the number of resamples in each segment is 10 or more, the spectral distortion will be reduced to ^1dB2 or less. Being held down. In Figure 2, the horizontal axis shows the number of resamples per segment, and the vertical axis shows the spectral distortion when restored, and the spectral parameters are LSP
It is. In the embodiment of the present invention, the resampling unit 16 uses LSP as the parameter and 10 as the number of resamples.

第３図に再標本化部１６の一具体例を示す。端
子１７からのセグメント分割されたスペクトル時
系列は信号分離部１８よりスペクトル時系列とセ
グメントの長さ（継続長）とに分離される。その
継続長は比例定数部１９に入力され、予め決めら
れた再標本化数（実施例では10）と入力した継続
長とから所望の比例定数を計算し、つまり再標本
化数で継続長を割算して再標本化周期を求め、こ
れを線形補間部２１に送る。線形補間部２１では
信号分離部１８からのセグメント分割されたスペ
クトル時系列を予め決められた数だけ前記再標本
化周期で再標本化する。その際に再標本時点は再
標本化前のスペクトル時系列の標本時点と一致し
ないためスペクトル時系列を線形補間して再標本
値を得る。つまり再標本時点T_iのスペクトルパラ
メータ時系列の再標本値〔Ｌ〕_iは、時点T_iの直前
の原標本時点t_jの標本値〔Ｌ〕_jと、直後の原標本
時点t_j+1の標本値〔Ｌ〕_j+1とから、〔Ｌ〕_j＋｛〔Ｌ〕
_j+
₁−〔Ｌ〕_j｝×T_i−t_i／t_j+1−t_jにより求める。 FIG. 3 shows a specific example of the resampling unit 16. The segmented spectrum time series from the terminal 17 is separated by the signal separation unit 18 into a spectrum time series and a segment length (continuation length). The continuation length is input to the proportionality constant section 19, and a desired proportionality constant is calculated from the predetermined number of resamplings (10 in the embodiment) and the input continuation length. The resampling period is determined by division and sent to the linear interpolation section 21. The linear interpolation section 21 resamples the segmented spectrum time series from the signal separation section 18 by a predetermined number at the resampling period. At this time, since the resampling time does not match the sampling time of the spectral time series before resampling, the spectral time series is linearly interpolated to obtain the resampled value. In other words, the resampled value [L] _i of the spectral parameter time series at the resampled time T i is the sampled value [L _] _j at the original sample time t _j immediately before the time T _i , and the original sample time t _j+1 immediately after the resampled time T i. From the sample value [L] _j+1 , [L] _j + {[L]
_j+
₁ − [L] _j }×T _i −t _i /t _j+1 −t _j .

第１図の説明に戻り、学習音声を入力し、これ
を上述のようにセグメンテーシヨンし、更に再標
本化し、その結果得られた時間長を正規化した再
現性の高いスペクトルパターンを似たものどうし
をまとめたり、分類したりして、標準パターンを
作り、これらの標準パターンをメモリ２３に蓄積
しておく。このようにして周波数（空間的）冗長
性の除去された標準パターンが得られる。次に通
常の音声入力を量子化する場合は、マトリクス量
子化部２２において再標本化されたセグメントを
単位として、メモリ２３よりの予め作られた標準
パターンとのマツチングを行ない、最も類似した
標準パターンの番号を出力する。標準パターンの
生成とマツチング、いずれの場合も同じ尺度（距
離）計算が行なわれる。すなわち、セグメントを
マトリクス（パラメータ時系列を並べたもの）と
してとらえ、マトリクス間の距離を重みつきユー
クリツド距離で定義する。この例では、12次の
LSP（L₁、L₂…L₁₂）と音声パワ（対数）P₁のパ
ラメータを横に10コ並べたものを13×10次の（セ
グメント）マトリクスとしている。 Returning to the explanation of Figure 1, the training speech is input, segmented as described above, and resampled, and the resulting time length is normalized to create a highly reproducible spectral pattern similar to Standard patterns are created by grouping and classifying items, and these standard patterns are stored in a memory 23. In this way, a standard pattern with frequency (spatial) redundancy removed is obtained. Next, when normal audio input is quantized, the matrix quantizer 22 uses the resampled segment as a unit to match it with a standard pattern created in advance from the memory 23, and matches it with the standard pattern that is the most similar. Outputs the number of The same scale (distance) calculation is performed in both standard pattern generation and matching. That is, a segment is treated as a matrix (parameter time series arranged), and the distance between the matrices is defined as a weighted Euclidean distance. In this example, the 12th order
The 10 parameters of LSP (L ₁ , L _{2 .} . . L ₁₂ ) and audio power (logarithm) P ₁ are arranged horizontally to form a 13×10-order (segment) matrix.

P₁ L₁₁ L₂₁ 〓〓 L₁₂₁ P₂ L₁₂ L₂₂ 〓〓 L₁₂₂ …… …… …… …… P₁₀ L₁₁₀ L₂₁₀ L₁₂₁₀ 標準パターンの作り方は、例えばＡ−Buzoの
他“Speech Coding based upon Vector
Quantization”IEEE、ASSP−283Vol5.pp562−
pp574（1980）を参照されたい。 P ₁ L ₁₁ L ₂₁ 〓〓 L ₁₂₁ P ₂ L ₁₂ L ₂₂ 〓〓 L ₁₂₂ …… …… …… …… P ₁₀ L ₁₁₀ L ₂₁₀ L ₁₂₁₀ How to make a standard pattern, for example, in addition to A-Buzo “Speech Coding based upon Vector
Quantization” IEEE, ASSP−283Vol5.pp562−
See pp574 (1980).

第４図に、１セグメント当りの情報量を横軸に
とり、縦軸にCD（ケプストラム距離）をとり、曲
線２４はパワーとスペクトルとを別々に量子化し
た場合、曲線２５は前述のパワーとスペクトルと
を１つのベクトルとみなしたものである。この第
４図からパワーを込みにした曲線２５は、パワー
を分離した曲線２４よりも３ビツト／セグメント
以上情報圧縮がされていることがわかる。 In FIG. 4, the horizontal axis represents the amount of information per segment, and the vertical axis represents CD (cepstral distance). Curve 24 represents the power and spectrum when the power and spectrum are quantized separately, and curve 25 represents the power and spectrum described above. is regarded as one vector. It can be seen from FIG. 4 that the curve 25 including the power compresses information by 3 bits/segment more than the curve 24 that separates the power.

スペクトル次系列はマトリクス量子化部２２で
前述のように符号化され、これと入力音声のピツ
チ情報及び各セグメントの継続長情報がマルチプ
レクサ２６で合成されて出力される。ここでピツ
チ情報とは音声の基本周波数であり、これにより
音声の高さが決る。ピツチ情報の検出は例えば文
献「音声情報処理の基礎」斎藤、中田著（オーム
社1981年発行88〜89頁）に示す手法による。この
実施例では、ピツチ情報はスムージングした後、
セグメント当り１点に点ピツチ化し、量子化部２
７で３ビツトのADPCMで量子化される。第５図
にLPC分析部１４を示すように、端子２８から
の音声のデイジタル信号列はLPC分析ユニツト
２９でLPC分析され、LPCパラメータ、有声、
無声判定系列、音声パワー多重化部３１へ出力さ
れ、ピツチはピツチスムージング部３２へ供給さ
れ、滑らかにされた後、多重化部３１へ供給され
る。 The spectral order sequence is encoded in the matrix quantization section 22 as described above, and this is combined with the pitch information of the input voice and the duration information of each segment in the multiplexer 26 and output. Here, the pitch information is the fundamental frequency of the voice, which determines the pitch of the voice. The pitch information is detected, for example, by the method described in the document "Fundamentals of Speech Information Processing" by Saito and Nakata (Published by Ohmsha, 1981, pp. 88-89). In this example, after smoothing the pitch information,
The point pitch is set to one point per segment, and the quantization unit 2
7 and quantized with 3-bit ADPCM. As shown in FIG. 5, the audio digital signal train from the terminal 28 is subjected to LPC analysis in the LPC analysis unit 29, and the LPC parameters, voicing,
The unvoiced determination sequence is output to the audio power multiplexing section 31, and the pitch is supplied to the pitch smoothing section 32, where it is smoothed and then supplied to the multiplexing section 31.

また、セグメントの継続長は頻度を考慮して
2.5ビツトに量子化部２７で量子化される。以上
説明した具体例では、１秒当り平均12セグメン
ト、標準パターは10ビツト、ピツチは３ビツト、
継続長は2.5ビツトで量子化している。以上のよ
うにして得られた3000サンプル（４分間の音声）
をマトリクス量子化し、生成された標準パターン
とのマツチングをした結果を第６図の曲線３３に
示す。第６図で横軸はLSPのビツト／秒、縦軸は
スペクトル歪であり、曲線３４は従来のベクトル
量子化法による場合である。この第６図よりスペ
クトル歪を同程度におさえるためにはこの発明で
はベクトル量子化法の４分の１の情報量で済むこ
とが解り、従つて従来のスカラー量子化法の10分
の１の情報量で済む。上記具体例では、毎秒12×
（10＋３＋2.5）＝186bpsであり、文章の了解性は
良好であつた。 Also, the duration of the segment should be determined by considering the frequency.
The signal is quantized to 2.5 bits by the quantizer 27. In the specific example explained above, the average is 12 segments per second, the standard putter is 10 bits, the pitch is 3 bits,
The continuation length is quantized to 2.5 bits. 3000 samples (4 minutes of audio) obtained as above
The curve 33 in FIG. 6 shows the result of matrix quantization and matching with the generated standard pattern. In FIG. 6, the horizontal axis is LSP bits/second, the vertical axis is spectral distortion, and curve 34 is for the conventional vector quantization method. From FIG. 6, it can be seen that in order to suppress spectral distortion to the same level, this invention requires only one-fourth the information amount of the vector quantization method, and therefore one-tenth of the information amount of the conventional scalar quantization method. The amount of information is sufficient. In the above example, 12× per second
(10+3+2.5)=186 bps, and the intelligibility of the text was good.

なおマルチプレクサ２６よりの音声符号化出力
は伝送、或いは記憶され復号化はLPC合成部３
６でマトリクス量子化符号から、辞書を参照して
標準パターンを得、これをセグメント継続長情
報、ピツチ情報から元LPC分析出力と対応した
ものを復元し、これをDA変換器３７でアナログ
変換し、低域通過フイルタ３８を通じて出力端子
３９にアナログ音声信号を出力する。 Note that the audio encoded output from the multiplexer 26 is transmitted or stored and decoded by the LPC synthesis unit 3.
In step 6, a standard pattern is obtained from the matrix quantization code by referring to a dictionary, and a pattern corresponding to the original LPC analysis output is restored from the segment duration information and pitch information, and this is converted into analog by the DA converter 37. , outputs an analog audio signal to an output terminal 39 through a low-pass filter 38.

「発明の効果」以上説明したように、この発明によれば約
200bpsのように著しく低速度としても良好な文
章了解性が得られるため、伝送路の有効利用、秘
話性の高い通信路の構成などに使用できるという
利点がある。"Effects of the Invention" As explained above, according to this invention, approximately
Good text intelligibility can be obtained even at extremely low speeds such as 200bps, so it has the advantage of being useful for making effective use of transmission paths and configuring communication paths with high privacy.

ピツチ情報、セグメント継続情報の送出は前記
量子化に限らず、他の量子化を適用してもよく、
量子化しなくてもよい。 The sending of pitch information and segment continuation information is not limited to the above quantization, but other quantization may be applied.
Does not need to be quantized.

[Brief explanation of drawings]

第１図はこの発明の一例を示すブロツク図、第
２図は再標本化数とスペクトルひずみの関係を示
す図、第３図はパラメータの再標本化部１６の具
体例を示すブロツク図、第４図はセグメント当り
のビツト数とスペクトル歪の関係を示す図、第５
図はLPC分析部１４の一例を示すブロツク図、
第６図はビツト／秒とスペクトル歪との関係を示
す図である。１４……LPC分析部、１５……セグメンテー
シヨン部、１６……再標本化部、２２……マトリ
クス量子化部、２３……標準パターンメモリ。 FIG. 1 is a block diagram showing an example of the present invention, FIG. 2 is a diagram showing the relationship between the number of resamplings and spectral distortion, and FIG. 3 is a block diagram showing a specific example of the parameter resampling unit 16. Figure 4 shows the relationship between the number of bits per segment and spectral distortion.
The figure is a block diagram showing an example of the LPC analysis section 14.
FIG. 6 is a diagram showing the relationship between bits/second and spectral distortion. 14...LPC analysis unit, 15...Segmentation unit, 16...Resampling unit, 22...Matrix quantization unit, 23...Standard pattern memory.

Claims

[Claims]

1 A means for extracting spectral parameters of input speech in frame units, and a time series of the extracted spectral parameters divided into intervals corresponding to phoneme or syllable boundaries based on the rate of change of the spectrum to create segments consisting of multiple frames. means for resampling the parameter time series so that the number of sample points of the parameter time series within that segment becomes the same predetermined number for each divided segment; means for encoding the converted parameter time series using a standard pattern of spectral parameter time series in units of segments, and means for encoding pitch information of the input voice and duration information of each segment. Speech encoding method.