JP3055901B2

JP3055901B2 - Audio signal encoding / decoding method and audio signal encoding device

Info

Publication number: JP3055901B2
Application number: JP63085191A
Authority: JP
Inventors: 一範小澤
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1988-04-08
Filing date: 1988-04-08
Publication date: 2000-06-26
Anticipated expiration: 2015-06-26
Also published as: JPH01257999A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は音声信号を低いビットレートで効率的に符号
化するための音声信号符号化方法及び音声信号符号化装
置に関し、特に聴覚の特性にもとづいて音声を非一様に
分割し、分割した区間において音声信号の特徴を表すパ
ラメータを求めて符号化することのできる音声信号符号
化方法及びそれに用いる装置に関する。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an audio signal encoding method and an audio signal encoding device for efficiently encoding an audio signal at a low bit rate. The present invention relates to a speech signal encoding method capable of non-uniformly dividing speech based on a parameter and obtaining and encoding a parameter representing a feature of the speech signal in the divided section, and an apparatus used therefor.

[Conventional technology]

音声信号を低い伝送ビットレート、例えば8kb/s以下
で伝送する方法としては、8kb/s程度ではピッチ予測マ
ルチパルス符号化法、4.8kb/s程度ではピッチ補間マル
チパルス符号化法などが知られている。これらはいずれ
も音源信号を複数個のパルスの組合せ（マルチパルス）
で表し、声帯の特性をデジタルフィルタで表し、音源パ
ルスの情報とフィルタの係数を、一定時間区間（フレー
ム）毎に求めて伝送している。この方法の詳細について
は、前者は例えばOzawa,Araseki氏による“High Qualit
y Multi−pulse Speech Coder with Pitch Prediction"
（Proc.I.C.A.S.S.P.,講演番号33.3 1986）（文献１）
に、後者については例えばOzawa,Araseki氏による“Low
Bit Rate Multi−pulse Speech Coder with Natural S
peech Quality"（Proc.I.C.A.S.S.P.,講演番号9.7,198
6）（文献２）に記載されている。これらの方法では、
伝送情報量を低減するために、音源パルス信号のピッチ
予測やフレーム内の１つのピッチ区間に対してのみパル
ス列を求めることによって、伝送すべき音源パルス情報
を低減している。As a method of transmitting an audio signal at a low transmission bit rate, for example, 8 kb / s or less, a pitch prediction multi-pulse encoding method at about 8 kb / s and a pitch interpolation multi-pulse encoding method at about 4.8 kb / s are known. ing. In each of these, the sound source signal is a combination of multiple pulses (multi-pulse)
, The characteristics of the vocal cords are represented by a digital filter, and the information of the sound source pulse and the coefficient of the filter are obtained and transmitted for each fixed time section (frame). For details of this method, the former is described, for example, by “High Qualit” by Ozawa, Araseki.
y Multi-pulse Speech Coder with Pitch Prediction "
(Proc.ICASSP, Lecture number 33.3 1986) (Reference 1)
For the latter, for example, “Low” by Ozawa and Araseki
Bit Rate Multi-pulse Speech Coder with Natural S
peech Quality "(Proc. ICASSP, Lecture number 9.7,198
6) (Reference 2). With these methods,
In order to reduce the amount of transmission information, the source pulse information to be transmitted is reduced by predicting the pitch of the source pulse signal or obtaining a pulse train only for one pitch section in the frame.

[Problems to be solved by the invention]

しかしながら、この従来の方法では、音源パルス、フ
ィルタ係数を求める区間長を一定（文献1,2では20ms）
としていた。従って、母音区間ではほぼ周期的な波形が
連続し音声の特徴が余り変化していないにも拘わらず、
20msという短時間毎に情報を伝送するということで、非
常に効率が悪く、他方、子音区間では速い音声の特徴の
変化に追随出来ずに音質劣化が起こるという問題点があ
った。また、特にこの問題点はビットレートが8kb/sよ
りもかなり低い場合に顕著であった。However, in this conventional method, the section length for obtaining the sound source pulse and the filter coefficient is fixed (20 ms in References 1 and 2).
And had Therefore, in the vowel section, although a substantially periodic waveform continues and the characteristics of the voice have not changed much,
Since information is transmitted every short time of 20 ms, the efficiency is extremely low. On the other hand, there is a problem in that consonant sections cannot follow changes in characteristics of fast speech and sound quality deteriorates. In particular, this problem was remarkable when the bit rate was considerably lower than 8 kb / s.

上述の問題を更に具体的に説明すると、まず、よく知
られているように、母音区間は、発生速度にも依存する
が、一般に100〜300msと時間長が長く、この半分以上は
音声信号の特徴が殆ど変化しない定常区間とみなせる。
更に、母音定常部では、信号を零に抑圧し情報を全く伝
送しなくても、音節明瞭度は殆ど劣化しないことが知ら
れている。但し、自然性は劣化する。従って、従来方法
の如く、これを短い20ms程度のフレーム区間毎に分析し
て情報を伝送しているのでは効率が非常に悪かった。一
方、子音区間では音声の特徴の変化が速いために、20ms
のフレームでは長すぎて音声の変化に対応した精度の良
い分析が難しく、再生音声の音質が劣化していた。To explain the above problem more specifically, first, as is well known, a vowel section has a long time length, generally 100 to 300 ms, depending on the generation speed. It can be regarded as a stationary section where the feature hardly changes.
Furthermore, it is known that, in the vowel stationary part, even if the signal is suppressed to zero and no information is transmitted, the syllable clarity hardly deteriorates. However, naturalness deteriorates. Therefore, as in the conventional method, if the information is transmitted by analyzing it for each short frame section of about 20 ms, the efficiency is very low. On the other hand, in the consonant section, the characteristics of the voice change rapidly, so
Frame is too long, it is difficult to perform accurate analysis corresponding to a change in the voice, and the sound quality of the reproduced voice has deteriorated.

そこで、これらの問題点を改善するために、例えばMa
rkel,Gray氏による“Linear Prediction of Speech"第1
0章（Springer−Verlag社,1976）（文献３）にあるよう
に、10ms程度の固定長フレームで求めたスペクトルのフ
レーム間での差分の変化をもとに、フレーム長を固定区
間の整数倍に可変にする方法が提案されているが、この
改善策でも、問題がある。すなわち、かかる方法におい
ては、上述のようにしたときに聴覚との対応づけの良く
ない特徴パラメータを用いてフレーム長の可変を行って
いることと、フレーム長の可変方法が固定区間長を基に
しており自由度がないために、フレーム長の増大区間を
増してビットレートを低減すると、音質が大きく劣化す
るという問題点があった。Therefore, in order to improve these problems, for example, Ma
"Linear Prediction of Speech" by rkel, Gray No. 1
As described in Chapter 0 (Springer-Verlag, 1976) (Reference 3), the frame length is set to an integral multiple of the fixed interval based on the change in the difference between the frames of the spectrum obtained with the fixed-length frame of about 10 ms. However, there is a problem with this improvement. That is, in such a method, the frame length is changed using the characteristic parameter that is not well associated with the auditory sense in the above manner, and the method of changing the frame length is based on the fixed section length. Because of the lack of flexibility, if the bit rate is reduced by increasing the frame length increasing section, there is a problem that the sound quality is greatly deteriorated.

本発明の目的は、音源信号伝送に必要な情報量を大幅
に低減することができ、ビットレートを大幅に下げても
合成音声の聴覚的な劣化を非常に少なくすることのでき
る音声信号符号化方法及び音声信号符号化装置を提供す
ることにある。SUMMARY OF THE INVENTION It is an object of the present invention to provide a speech signal coding method capable of greatly reducing the amount of information required for transmitting a sound source signal, and capable of greatly reducing auditory deterioration of synthesized speech even if the bit rate is significantly reduced. It is an object of the present invention to provide a method and a speech signal encoding device.

本発明の音声信号符号化復号化方法は、離散的な音声
信号を入力し、あらかじめ定められた時間区間毎にケプ
ストラムパラメータを求め、このケプストラムパラメー
タの時間変化を示す尺度を計算し、前記尺度が極大とな
る時間位置の近傍でかつ前記特徴量があらかじめ定めら
れたしきい値を下まわる時間位置を探索してセグメント
境界と判別することにより前記音声信号を非一様な時間
区間に分割し、前記分割された区間の全部または一部の
区間における音源信号を複数個のパルス列の組み合わせ
で表して伝送し、伝送されたパルス列を用いて前記区間
の音源信号を復元して前記音声信号を表す合成音声信号
を出力することを特徴とする。The speech signal encoding / decoding method of the present invention includes the steps of: inputting a discrete speech signal; obtaining a cepstrum parameter for each predetermined time interval; calculating a scale indicating a time change of the cepstrum parameter; The audio signal is divided into non-uniform time sections by searching for a time position near the maximum time position and at which the feature amount falls below a predetermined threshold and determining the boundary as a segment boundary, The sound source signal in all or a part of the divided section is represented by a combination of a plurality of pulse trains and transmitted, and the sound source signal in the section is restored using the transmitted pulse train to represent the audio signal. It is characterized by outputting an audio signal.

また、本発明の音声信号符号化装置は、入力した離散
的な音声信号系列から、あらかじめ定められた時間区間
毎にケプストラムパラメータを求め、このケプストラム
パラメータの時間変化を示す尺度を計算するセグメンテ
ーション尺度計算回路と、前記尺度が極大となる時間位置の近傍でかつ前記尺度
があらかじめ定められたしきい値を下まわる時間位置を
探索しセグメント境界と判別することにより前記音声信
号を非一様な時間区間にセグメンテーションするセグメ
ンテーション回路と、前記分割された区間の全部または一部の区間から短時
間スペクトル特性を表すスペクトルパラメータとピッチ
パラメータとを計算するスペクトルパラメータ計算回路
と、前記分割された区間の全部または一部の区間における
音源信号を表す複数個のパルス列の組み合わせを計算す
る音源パルス計算回路と、前記スペクトルパラメータとピッチパラメータと音源
パルス列を組み合わせて出力するマルチプレクサ回路とを備えたことを特徴とする。Further, the speech signal encoding apparatus of the present invention obtains a cepstrum parameter for each predetermined time interval from the input discrete speech signal sequence, and calculates a segmentation scale calculation for calculating a scale indicating a time change of the cepstrum parameter. A circuit for searching for a time position near the time position at which the scale is maximal and at which the scale is smaller than a predetermined threshold value and determining the segment position as a segment boundary, thereby making the voice signal a non-uniform time interval. A segmentation circuit that performs segmentation on all of the divided sections; a spectrum parameter calculation circuit that calculates a spectrum parameter and a pitch parameter representing a short-time spectrum characteristic from all or some of the divided sections; and all or one of the divided sections. Of the sound source signal in the section A sound source pulse calculating circuit for calculating a combination of the scan row, characterized by comprising a multiplexer circuit for outputting a combination of the spectral parameters and pitch parameters and excitation pulse train.

［作用］上記のように音声信号の符号化を行うため、音源信号
伝送に必要な情報量を大幅に低減でき、しかもビットレ
ートを大幅に下げても合成音声の視覚的な劣化の非常に
少ない符号化処理を行える。[Operation] Since the audio signal is encoded as described above, the amount of information necessary for transmission of the sound source signal can be significantly reduced, and even if the bit rate is significantly reduced, the visual deterioration of the synthesized voice is extremely small. An encoding process can be performed.

音声信号符号化装置は、上記構成のセグメンテーショ
ン回路、スペクトルパラメータ計算回路、音源パルス計
算回路、マルチプレクサ回路を有することにより、上述
のような符号化処理が行える。The audio signal encoding device includes the segmentation circuit, the spectrum parameter calculation circuit, the excitation pulse calculation circuit, and the multiplexer circuit configured as described above, so that the above-described encoding process can be performed.

〔Example〕

次に、本発明の実施例について図面を参照して説明す
る。Next, embodiments of the present invention will be described with reference to the drawings.

第１図は本発明による音声信号符号化方法及び音声信
号符号化装置の一実施例の構成を示すブロック図であ
る。なお、第１図には、発明の理解を助けるために、音
声信号復号化装置をも示してある。また、第２図は音声
信号符号化の原理の説明に供するブロック図である。FIG. 1 is a block diagram showing a configuration of an embodiment of an audio signal encoding method and an audio signal encoding device according to the present invention. FIG. 1 also shows an audio signal decoding device to facilitate understanding of the invention. FIG. 2 is a block diagram for explaining the principle of audio signal encoding.

図１の実施例に係る音声信号の符号化復号化方法にお
いては、送信側では離散的な音声信号を入力し聴覚の特
性と対応の良い方法により音声信号を非一様な区間に分
割し、分割された区間の全部または一部の区間における
音源信号を複数個のパルス列の組合せで表して伝送し、
受信側では前記パルス列を用いて前記区間の音源信号を
復元して音声信号を良好に表す合成音声信号を出力す
る。In the encoding / decoding method of the audio signal according to the embodiment of FIG. 1, the transmitting side inputs a discrete audio signal, and divides the audio signal into non-uniform sections by a method that is compatible with the auditory characteristics, The sound source signal in all or part of the divided sections is represented by a combination of a plurality of pulse trains and transmitted.
On the receiving side, the sound source signal in the section is restored by using the pulse train, and a synthesized voice signal that well represents the voice signal is output.

以下、まず、本発明に従う符号化処理の原理につい
て、第２図（ａ）を用いて説明する。図において、セグ
メンテーション尺度計算部400は、音声信号を入力し、
音声特徴変化の速い子音部でも精度よく分析できるよう
な短時間区間（例えば5ms）毎に、聴覚の特性と対応が
よく音声信号の特徴の時間変化を良好に表す尺度をセグ
メンテーション尺度として用いる。ここでは、この尺度
として、次の動的尺度Ｄ（ｔ）を用いる。まず、短時間
（例えば5ms）毎の音声信号から、スペクトラム包絡を
良好に表すパラメータとして、LPCケプストラムCi（１
≦ｉ≦ｐ）を抽出し、これを式（１）に基づき動的尺度
に変換する。ここで、動的尺度は、音声認識の分野で知
られているパラメータであり、聴覚でとらえた場合の音
声信号の音韻特徴量の時間的変化を良好に表す。音声信
号の音韻の時間変化が大きい部分（子音や、子音と母音
の遷移部）では、一般に大きな値を有し、定常的な部分
（母音定常部など）では、小さな値を示す。Hereinafter, first, the principle of the encoding process according to the present invention will be described with reference to FIG. In the figure, a segmentation scale calculation unit 400 inputs an audio signal,
For each short time period (for example, 5 ms) in which a consonant part having a rapid change in speech characteristics can be accurately analyzed, a scale that has a good correspondence with the auditory characteristics and well represents the time change of the characteristics of the speech signal is used as a segmentation measure. Here, the following dynamic scale D (t) is used as this scale. First, an LPC cepstrum Ci (1) is used as a parameter that expresses a good spectrum envelope from a speech signal every short time (for example, 5 ms).
.Ltoreq.i.ltoreq.p), which is converted into a dynamic measure based on equation (1). Here, the dynamic scale is a parameter known in the field of speech recognition, and satisfactorily represents a temporal change in a phoneme feature amount of a speech signal when captured by hearing. Generally, a portion having a large temporal change in a phoneme of a voice signal (a consonant or a transition between a consonant and a vowel) has a large value, and a stationary portion (a vowel stationary portion or the like) has a small value.

ここで、a_iは、である。 Where a _i is It is.

尚、この計算法についての詳細な説明はFurui氏によ
る“On the Role of Spectral Transition for Speech
Peception"と題した論文（J.Acoustical Society of Am
erica,vol.80,pp.1016−1025,1986）（文献４）に記載
されているので、ここでは詳細は省略する。また、
（１）式の代わりにパワ項a₀を含めた（３）式を用いる
こともできる。（３）式によりパワ項も含めることで、
スペクトル特徴のみならずパワ特徴の時間変化をも考慮
することが可能となり、尺度の精度を向上させることが
できる。For a detailed explanation of this calculation method, see "On the Role of Spectral Transition for Speech" by Furui.
Peception "(J. Acoustical Society of Am
erica, vol. 80, pp. 1016-1025, 1986) (Reference 4), and the details are omitted here. Also,
(1) Power section a ₀ were included (3) instead of expression can also be used. By including the power term by equation (3),
It is possible to consider not only the spectral feature but also the time change of the power feature, and it is possible to improve the accuracy of the scale.

セグメンテーション部410は、前記動的尺度Ｄ（ｔ）
を入力して、音声信号を非一様に分割（セグメンテーシ
ョン）する。これは前記（１）あるいは（３）式の尺度
を用いて行う。まず前記尺度の極大値の付近毎に音声信
号をあらかじめ分割する。ここで、前記分割４に記され
ているように、前記尺度の極大値の前後数10msの部分
は、子音から母音、母音から子音への調音結合部分にほ
ぼ対応しており、音韻知覚の際の聴覚的に非常に重要な
部分であることが報告されている。従ってこのような聴
覚的に重要な部分を除き前記尺度がある程度連続的に小
さくなる箇所で音声信号をセグメンテーションする。セ
グメンテーションした様子を第２図（ｂ）に示す。ここ
で第２図（ｂ）の上段は音声波形、下段は動的尺度とセ
グメンテーションの一例を示す。 The segmentation unit 410 calculates the dynamic scale D (t)
To divide the audio signal non-uniformly (segmentation). This is performed using the scale of the above equation (1) or (3). First, an audio signal is divided in advance around each local maximum value of the scale. Here, as described in the division 4, the part of several tens ms before and after the maximum value of the scale substantially corresponds to the articulatory connection part from a consonant to a vowel and from a vowel to a consonant. Has been reported to be a very important part of the hearing. Therefore, the audio signal is segmented at a place where the scale becomes smaller to some extent continuously except for such an auditory important part. FIG. 2B shows the state of the segmentation. Here, the upper part of FIG. 2B shows an audio waveform, and the lower part shows an example of a dynamic scale and a segmentation.

次に、LPC,ピッチ分析部430はセグメンテーションさ
れた区間全体あるいはこの中の一部分の音声信号を分析
してLPC係数を求める。なお、一部分の音声信号から求
める場合は、セグメンテーション部410で求めたケプス
トラムから周囲の方法によってLPC係数に変換すること
もできる。そして周知の方法によってピッチ周期の計算
及びセグメンテーションされた区間が母音定常部か否か
の判別を行う。ここでこの判別には、セグメンテーショ
ン区間内の電力とピッチ周期だけ離れた自己相関関数
（ピッチゲイン）の値があらかじめ定められたしきい値
よりも大きいか否かによって判別する方法を用いること
ができる。Next, the LPC / pitch analysis unit 430 analyzes the audio signal of the entire segmented section or a part of the segmented section to obtain an LPC coefficient. In the case of obtaining from a part of the audio signal, the cepstrum obtained by the segmentation unit 410 can be converted into LPC coefficients by a surrounding method. Then, the pitch period is calculated by a known method, and it is determined whether or not the segmented section is a vowel stationary part. Here, for this determination, a method of determining whether the value of the autocorrelation function (pitch gain) separated from the power in the segmentation section by the pitch period is larger than a predetermined threshold value can be used. .

音源計算部420は、セグメンテーションされた区間が
母音定常部のときは、前記セグメンテーション区間をピ
ッチ区間の周期毎のサブフレームに分割し、そのうちの
１つのピッチ区間について、音源パルス列を計算する。
ここで音源パルス列の計算には、特願昭59−272435号明
細書（文献５）を参照することができる。When the segmented section is a vowel stationary section, the sound source calculation section 420 divides the segmentation section into subframes for each period of the pitch section, and calculates a sound source pulse train for one of the pitch sections.
The calculation of the sound source pulse train can be referred to Japanese Patent Application No. 59-272435 (Reference 5).

また、他のピッチ区間については、ピッチ区間毎にピ
ッチ区間の波形を良好に表すように振幅補正係数を求め
る。Further, for other pitch sections, an amplitude correction coefficient is obtained for each pitch section so that the waveform of the pitch section is well represented.

従って、本発明によれば、従来方式に比べビットレー
トを大幅に下げても１ピッチ区間の音源パルスの数を大
幅に増やすことが可能であるため、後述のように他のピ
ッチ区間は増幅補正あるいは補間処理を用いて復元する
としても、前記区間全体の音源信号を良好に表すことが
できる。Therefore, according to the present invention, it is possible to greatly increase the number of excitation pulses in one pitch section even if the bit rate is significantly reduced as compared with the conventional method. Alternatively, even if the reconstruction is performed using interpolation processing, the sound source signal of the entire section can be satisfactorily represented.

一方、前記セグメンテーション区間が母音定常部でな
いときは、区間全体で音源パルス列を求める。On the other hand, if the segmentation section is not a vowel stationary section, a sound source pulse train is obtained for the entire section.

送信側の伝送情報は音源パルス列の振幅，位置、セグ
メンテーションされた区間の長さを示すセグメンテーシ
ョン情報、ピッチ周期、判別情報、振幅補正係数であ
る。受信側では、母音定常部の時は、伝送された音源パ
ルス列の振幅と位置をピッチ周期毎に滑らかに変化させ
たり、セグメンテーションされた区間の間での音源信号
に補間処理を施し、伝送されたピッチ区間以外のピッチ
区間のパルス列を復元しセグメンテーションされた区間
の音源信号を復元する。The transmission information on the transmitting side is the amplitude and position of the sound source pulse train, segmentation information indicating the length of the segmented section, pitch period, discrimination information, and amplitude correction coefficient. On the receiving side, at the time of the vowel stationary part, the amplitude and position of the transmitted sound source pulse train are smoothly changed for each pitch cycle, or the sound source signal between the segmented sections is subjected to interpolation processing and transmitted. The pulse train in the pitch section other than the pitch section is restored, and the sound source signal in the segmented section is restored.

次に、第１図を参照して説明する。 Next, a description will be given with reference to FIG.

第１図において、送信側は音声信号符号化装置を、ま
た受信側は音声信号復号化装置をそれぞれ含み、両者間
には適宜の伝送路が設けられている。In FIG. 1, the transmitting side includes an audio signal encoding device, and the receiving side includes an audio signal decoding device, and an appropriate transmission path is provided between the two.

音声信号符号化装置は、入力した離散的な音声信号系
列から聴覚の特定と対応の良い特徴パラメータを抽出し
前記パラメータを用いて前記音声信号を非一様な時間区
間にセグメンテーションするセグメンテーション回路
と、前記分割された音声信号から短時間スペクトル特性
を表すスペクトルパラメータとピッチパラメータとを計
算するスペクトルパラメータ計算回路と、前記分割され
た区間の全部または一部の区間における音源信号を表す
複数個のパルス列の組合せを計算する音源パルス計算回
路と、前記スペクトルパラメータと前記ピッチパラメー
タと前記音源パルス列を組み合わせて出力するマルチプ
レクサ回路とを有する。An audio signal encoding device, a segmentation circuit that extracts a characteristic parameter corresponding to the identification of the auditory sense from the input discrete audio signal sequence and segments the audio signal into non-uniform time intervals using the parameter, A spectrum parameter calculation circuit for calculating a spectrum parameter and a pitch parameter representing a short-time spectrum characteristic from the divided voice signal, and a plurality of pulse trains representing a sound source signal in all or some of the divided sections. A sound source pulse calculation circuit for calculating the combination; and a multiplexer circuit for combining and outputting the spectrum parameter, the pitch parameter, and the sound source pulse train.

音声信号復号化装置は、音声信号の短時間スペクトル
特性を表すスペクトルパラメータとピッチパラメータと
音源信号を表す音源パルス列を入力して前記スペクトル
パラメータと前記ピッチパラメータと前記音源パルス列
とを分離するデマルチプレクサ回路と、前記ピッチパラ
メータと前記音源パルス列を用いて非一様に分割された
区間全体の音源信号を復元する音源復元回路と、前記復
元された音源信号を用いて前記区間の音声信号を合成す
る合成フィルタとを有する。An audio signal decoding apparatus receives a spectrum parameter, a pitch parameter, and an excitation pulse train representing an excitation signal, representing a short-time spectrum characteristic of an audio signal, and separates the spectrum parameter, the pitch parameter, and the excitation pulse train from each other. A sound source restoring circuit for restoring a sound source signal of an entire section which is non-uniformly divided using the pitch parameter and the sound source pulse train; and a synthesis section for synthesizing a sound signal of the section using the restored sound source signal. And a filter.

音声信号符号化、復号化処理は、以下のようにしてな
される。The audio signal encoding and decoding processes are performed as follows.

本発明の一実施例を示す第１図において、入力端子50
0から離散的な音声信号を入力する。セグメンテーショ
ン尺度計算回路505は第２図（ａ）のセグメンテーショ
ン尺度計算部400と同一の計算を行い、セグメンテーシ
ョン尺度を出力する。セグメンテーション回路510は第
２図（ａ）のセグメンテーション部410と同一の処理を
行い、音声信号を非一様な区間にセグメンテーション
し、セグメンテーション区間の長さを表すセグメンテー
ション情報とセグメンテーションされた音声信号を出力
する。LPC,ピッチ計算回路520は第２図（ａ）のLPC,ピ
ッチ分析部430と同一の処理を行い、セグメンテーショ
ンされた音声信号について、LPC分析、ピッチ周期の計
算及び、セグメンテーションされた区間が母音定常部か
否かの判別を行い、LPC係数、ピッチ周期、判別情報を
量子化器530へ出力する。量子化器530はこれらの情報を
所定のビット数で量子化しマルチプレクサ600へ出力す
ると共に、これらを逆量子化する。FIG. 1 shows an embodiment of the present invention.
Input a discrete audio signal from 0. The segmentation scale calculation circuit 505 performs the same calculation as the segmentation scale calculation section 400 in FIG. 2A, and outputs a segmentation scale. The segmentation circuit 510 performs the same processing as the segmentation unit 410 in FIG. 2 (a), segments the audio signal into non-uniform sections, and outputs segmentation information indicating the length of the segmentation section and the segmented audio signal. I do. The LPC / pitch calculation circuit 520 performs the same processing as the LPC / pitch analysis unit 430 in FIG. 2 (a), and performs LPC analysis, pitch period calculation and segmentation of the vowel stationary section on the segmented voice signal. It determines whether or not it is a part, and outputs the LPC coefficient, the pitch period, and the determination information to the quantizer 530. The quantizer 530 quantizes the information with a predetermined number of bits, outputs the quantized information to the multiplexer 600, and dequantizes the information.

重みづけ回路540は、セグメンテーションされた音声
信号と逆量子化されたLPC係数を用いて前記信号に重み
づけを施す。重みづけの方法は前記文献５の重みづけ回
路（200）を参照することができる。インパルス応答計
算回路560は逆量子化されたLPC係数を用いてインパルス
応答を計算する。インパルス応答計算の方法応は前記文
献５のインパルス応答計算回路（170）を参照すること
ができる。自己相関関数計算回路570は前記インパルス
応答の自己相関関数を計算し音源パルス計算回路580へ
出力する。自己相関関数の計算法は前記文献５の自己相
関関数計算回路（180）を参照することができる。相互
相関関数計算回路550は前記重みづけられた信号と前記
インパルス応答との相互相関関数を計算して音源パルス
計算回路580へ出力する。この計算法については、前記
文献５の相互相関関数計算回路（210）を参照すること
ができる。The weighting circuit 540 weights the segmented audio signal using the dequantized LPC coefficient and the signal. The weighting method can be referred to the weighting circuit (200) of the above-mentioned reference 5. The impulse response calculation circuit 560 calculates an impulse response using the inversely quantized LPC coefficients. For the method of calculating the impulse response, reference can be made to the impulse response calculation circuit (170) of Reference 5. The autocorrelation function calculation circuit 570 calculates the autocorrelation function of the impulse response and outputs the calculated autocorrelation function to the sound source pulse calculation circuit 580. The calculation method of the autocorrelation function can be referred to the autocorrelation function calculation circuit (180) of Reference 5 described above. The cross-correlation function calculation circuit 550 calculates a cross-correlation function between the weighted signal and the impulse response, and outputs the result to the sound source pulse calculation circuit 580. For this calculation method, reference can be made to the cross-correlation function calculation circuit (210) of the above-mentioned Document 5.

音源パルス計算回路580は、セグメンテーションされ
た区間が母音定常部の時は、前記第２図（ａ）の説明中
で述べた様に、前記区間をピッチ周期毎のサブフレーム
に分割して中央付近のサブフレーム区間について音源パ
ルス列を計算する。また他のサブフレーム区間では前記
第２図（ａ）の説明中で述べたようにパルス列の振幅補
正係数を各区間で１つずつ求める。一方、母音定常部で
ないときは、前記区間全体に対して音源パルス列を計算
する。音源パルス列の計算法については前記文献５の駆
動信号計算回路（220）を参照することができる。量子
化器590は前記音源パルス列の振幅，位置を所定のビッ
ト数で量子化してマルチプレクサ600へ出力する。量子
化器590の動作は前記文献５の符号化回路（230）を参照
することが出来る。マルチプレクサ600は量子化された
音源パルス列、LPC係数、ピッチ周期、セグメンテーシ
ョン情報、判別情報、振幅補正係数を組み合わせて出力
する。When the segmented section is a vowel stationary section, the sound source pulse calculation circuit 580 divides the section into subframes for each pitch period and converts the section into a vicinity of the center, as described in the description of FIG. The excitation pulse train is calculated for the subframe section of. In other sub-frame sections, the amplitude correction coefficient of the pulse train is obtained one by one in each section as described in the description of FIG. 2A. On the other hand, when it is not a vowel stationary part, a sound source pulse train is calculated for the entire section. For the method of calculating the sound source pulse train, the drive signal calculation circuit (220) of Reference 5 can be referred to. The quantizer 590 quantizes the amplitude and position of the sound source pulse train with a predetermined number of bits and outputs the result to the multiplexer 600. The operation of the quantizer 590 can be referred to the encoding circuit (230) of the aforementioned document 5. The multiplexer 600 combines and outputs the quantized sound source pulse train, LPC coefficient, pitch period, segmentation information, discrimination information, and amplitude correction coefficient.

一方、受信側では、デマルチプレクサ610は、音源パ
ルス情報、LPC係数、ピッチ周期、セグメンテーション
情報、判別情報、振幅補正係数を分離して出力する。音
源パルス復号器620は音源パルス列の振幅、位置を復号
する。LPC,ピッチ復号器650はLPC係数、ピッチ周期を復
号する。音源復元器630は判別情報、セグメンテーショ
ン情報を入力して、区間が母音定常部の時は、復号した
１ピッチ区間の音源パルス列を用いてセグメンテーショ
ン区間全体の音源信号を復元し出力する。ここで伝送さ
れていないピッチ区間の音源パルス列の復元法として
は、ピッチ区間のパルス全体をピッチ周期だけずらして
位置を復元し、振幅に関しては振幅補正係数を乗じて振
幅を復元する。この方法以外にも、隣接セグメンテーシ
ョン区間の音源パルス列を用いて補間処理によって復元
する方法などが知られており、この詳細については前記
文献５を参照することかできる。またこれ以外にも他の
周知な方法を用いることもできる。一方、区間が母音定
常部でないときには、受信した音源パルス列を用いて前
記区間全体の音源信号を発生して出力する。補間器640
は復号したLPC係数、判別情報、ピッチ周期を用いて、
セグメンテーション区間が母音定常部のときはスペクト
ル変化を滑らかにするために、ピッチ周期毎にLPC係数
をPARCOR係数上で補間する。一方、前記区間が母音定常
部でないときには係数を補間せずに合成フィルタ660へ
出力する。これは母音定常部以外では音声信号のスペク
トル特徴の変化が速いので補間によってかえって大きな
歪が入ることを防ぐ為である。合成フィルタ660はLPC係
数、復元された音源信号、セグメンテーション情報応を
用いてセグメンテーション区間全体における音声信号を
合成し端子670を通して出力する。On the receiving side, on the other hand, the demultiplexer 610 separates and outputs the excitation pulse information, the LPC coefficient, the pitch period, the segmentation information, the discrimination information, and the amplitude correction coefficient. The excitation pulse decoder 620 decodes the amplitude and position of the excitation pulse train. The LPC / pitch decoder 650 decodes the LPC coefficient and the pitch period. The sound source restorer 630 receives the discrimination information and the segmentation information, and when the section is a vowel stationary part, restores and outputs the sound source signal of the entire segmentation section using the decoded one-pitch section sound source pulse train. Here, as a method of restoring the sound source pulse train in the pitch section that is not transmitted, the position is restored by shifting the entire pulse in the pitch section by the pitch period, and the amplitude is restored by multiplying by the amplitude correction coefficient. In addition to this method, a method of restoring by an interpolation process using a sound source pulse train in an adjacent segmentation section is known, and the details can be referred to the above-mentioned reference 5. In addition, other well-known methods can be used. On the other hand, when the section is not a vowel stationary section, a sound source signal for the entire section is generated and output using the received sound source pulse train. Interpolator 640
Using the decoded LPC coefficient, discrimination information, and pitch period,
When the segmentation section is a vowel stationary part, the LPC coefficient is interpolated on the PARCOR coefficient for each pitch cycle in order to smooth the spectrum change. On the other hand, when the section is not the vowel stationary part, the coefficient is output to the synthesis filter 660 without interpolation. This is to prevent a large distortion from being generated by interpolation because the spectral characteristics of the audio signal change rapidly in portions other than the vowel stationary part. The synthesis filter 660 synthesizes the audio signal in the entire segmentation section using the LPC coefficient, the restored sound source signal, and the segmentation information, and outputs the synthesized signal through the terminal 670.

以上のように、上記構成によれば、聴覚の特性と対応
づけのよい特徴パラメータを用いて音声信号を非一様に
セグメンテーションし、さらにセグメンテーションされ
た区間のスペクトルの特徴によって、複数種類のベクト
ル量子化器を切り替えてスペクトルパラメータの量子化
を行い、さらに前記区間が音声の特徴の変化が殆どなく
時間的にも長い母音定常部のときは、その区間のうちの
１つのピッチ区間について音源パルス列を求め、母音定
常部以外のときは区間全体で音源パルス列を求めている
ので、音源信号伝送に必要な情報量を大幅に低減するこ
とができる。従ってビットレートを大幅に下げても合成
音声の聴覚的な劣化は非常に少なく高い自然性が得られ
る。As described above, according to the above configuration, the audio signal is non-uniformly segmented using the characteristic parameters that are well associated with the characteristics of the auditory sense, and a plurality of types of vector quantum When the section is a vowel stationary part that is long in terms of time with little change in speech characteristics, the sound source pulse train is generated for one pitch section of the section. Since the sound source pulse train is obtained for the entire section except for the vowel stationary part, the amount of information necessary for sound source signal transmission can be greatly reduced. Therefore, even if the bit rate is greatly reduced, the perceived deterioration of the synthesized speech is very small and high naturalness can be obtained.

上述した実施例はあくまで本発明の一実施例に過ぎず
その変形例を種々考えられる。The above-described embodiment is merely an embodiment of the present invention, and various modifications thereof can be considered.

例えば、セグメンテーションされた区間が母音定常部
であるときには、相互相関関数計算回路550は前記区間
全体に対してではなく、前記区間の中央付近の１ピッチ
区間に対してのみ相互相関関数を計算しても良い。これ
は実際に音源パルス列を求めるのは１ピッチ区間である
ためである。この方法では特性は少し劣化するが演算量
はほぼP/N（ここでＰはピッチ周期、Ｎは母音定常部の
セグメンテーション区間の長さ）に低減できる。For example, when the segmented section is a vowel stationary section, the cross-correlation function calculation circuit 550 calculates the cross-correlation function only for one pitch section near the center of the section, not for the entire section. Is also good. This is because the sound source pulse train is actually obtained in one pitch section. In this method, the characteristics are slightly deteriorated, but the amount of calculation can be reduced to almost P / N (where P is the pitch period, and N is the length of the segmentation section of the vowel stationary part).

また、音源パルスの計算法としては上述の実施例の他
に周知の良好な方法を用いることもできる。これについ
ては、K.Ozawa“A Study of Pulse Search Algorithms
for Multi−pulse Speech Codec Realization"（J.Sele
cted Area of Communications,pp.,1987）（文献６）を
参照することができる。As a method of calculating the sound source pulse, a known good method can be used in addition to the above-described embodiment. See K. Ozawa, “A Study of Pulse Search Algorithms.
for Multi-pulse Speech Codec Realization "(J. Sele
cted Area of Communications, pp., 1987) (Reference 6).

また、セグメンテーションされた区間が母音定常部の
ときは、音源パルス列を求める１ピッチ区間の位置とし
ては、固定ではなく、最も良好な合成音声が得られるよ
うなピッチ区間を探索して求めるようにすることもでき
る。この処理によって音質はさらに良好になるが演算量
は若干増加する。具体的な方法については前記文献５を
参照することができる。When the segmented section is a vowel stationary section, the position of one pitch section for obtaining a sound source pulse train is not fixed, but is searched for and determined so as to obtain the best synthesized speech. You can also. This process further improves the sound quality but slightly increases the amount of calculation. Reference 5 can be referred to for a specific method.

また、合成フィルタ660の係数の補間法としては、対
数断面積比上や他のパラメータ上で補間することもでき
る。さらに補間法としては線形補間以外に対数補間等を
用いることもできる。これらの方法の詳細についてはB.
S.Atal氏による“Speech Analysis and Synthesis by L
inear Prediction of the Speech Wave"（J.Acoust.So
c.America,pp.637−655,1971）（文献７）を参照するこ
とができる。As a method of interpolating the coefficients of the synthesis filter 660, interpolation can be performed on the logarithmic cross-sectional area ratio or other parameters. Further, as an interpolation method, logarithmic interpolation or the like can be used in addition to linear interpolation. See B. for details on these methods.
“Speech Analysis and Synthesis by L by S. Atal
inear Prediction of the Speech Wave "(J.Acoust.So
c. America, pp. 637-655, 1971) (Reference 7).

また、受信側でピッチ周期を補間によって滑らかに変
化させることによって合成音質はさらに改善される。Also, the synthesized sound quality is further improved by smoothly changing the pitch period on the receiving side by interpolation.

［発明の効果］以上説明したように、本発明の音声符号化復号化方法
によれば、音声信号を符号化しでんそうして復号化した
とき、音声信号を良好に表す合成あるいは音声信号を得
ることができる。このため、従来の固定長フレームによ
るものや、あるいは固定長フレームで求めたスペクトル
のフレーム間での差分の変化を基にフレーム長を可変す
るものに比し、音質の劣化を少なくすることができる。[Effects of the Invention] As described above, according to the audio encoding / decoding method of the present invention, when an audio signal is encoded and decoded by decoding, a synthesized or audio signal that satisfactorily represents the audio signal is generated. Obtainable. For this reason, deterioration in sound quality can be reduced as compared with the conventional fixed-length frame or the one in which the frame length is changed based on the change in the difference between the frames of the spectrum obtained in the fixed-length frame. .

また本発明によれば、聴覚の特性と対応付けの良い特
徴パラメータを用いて音質信号を不均一な区間に分割で
きる結果、分割された区間が音声の特徴の変化が殆ど無
く時間的に長い母音定常部のときは、その区間のうち１
つのピッチ区間について音源パルス列を求め、母音定常
部以外のときには、区間全体で音源パルス列を求めるこ
とが可能になり、音源信号の伝送に必要な情報量を大幅
に低減することができる。Further, according to the present invention, a sound quality signal can be divided into non-uniform sections using feature parameters that are well associated with auditory characteristics. As a result, the divided sections have a long vowel with little change in speech characteristics. In the case of a stationary part, one of the sections
A sound source pulse train is obtained for one pitch section, and a sound source pulse train can be obtained for the entire section other than the vowel stationary section, so that the amount of information required for transmission of a sound source signal can be greatly reduced.

したがって、本発明は、ビットレートを大幅に下げて
も、合成音声の聴覚的な劣化の非常に少なく高い自然性
が得られる符号化復号化処理を行うことができるので、
音声信号を低いビットレートで効率的に符号化復号化す
る場合に適する。Therefore, according to the present invention, even if the bit rate is significantly reduced, it is possible to perform the encoding / decoding processing which can obtain a very natural sound with very little auditory deterioration of the synthesized speech.
This method is suitable for efficiently encoding and decoding an audio signal at a low bit rate.

[Brief description of the drawings]

第１図は本発明による音声信号符号化方法及び音声信号
符号化装置の一実施例の構成を示すブロック図、第２図は本発明の説明に供する原理ブロック図及び波形
図である。 400……セグメンテーション尺度計算部 410……セグメンテーション部 420……音源計算部 430……LPC,ピッチ分析部 505……セグメンテーション尺度計算回路 510……セグメンテーション回路 520……LPC,ピッチ計算回路 530,590……量子化器 540……重みづけ回路 550……相互相関関数計算回路 560……インパルス応答計算回路 570……自己相関関数計算回路 600……マルチプレクサ 610……デマルチプレクサ 620……復号器 630……音源復元器 640……補間器 650……LPC,ピッチ復号器 660……合成フィルタFIG. 1 is a block diagram showing a configuration of an embodiment of an audio signal encoding method and an audio signal encoding device according to the present invention, and FIG. 2 is a principle block diagram and a waveform diagram for explaining the present invention. 400 segmentation scale calculation unit 410 segmentation unit 420 sound source calculation unit 430 LPC, pitch analysis unit 505 segmentation scale calculation circuit 510 segmentation circuit 520 LPC pitch calculation circuit 530,590 quantum Transformer 540 Weighting circuit 550 Cross-correlation function calculation circuit 560 Impulse response calculation circuit 570 Autocorrelation function calculation circuit 600 Multiplexer 610 Demultiplexer 620 Decoder 630 Sound source restoration 640 …… Interpolator 650 …… LPC, pitch decoder 660 …… Synthesis filter

Claims

(57) [Claims]

1. A discrete audio signal is input, and cepstrum parameters are determined for each predetermined time interval.
Calculating a scale indicating the temporal change of the cepstrum parameter, searching for a time position near the time position at which the scale is maximal and at which the scale falls below a predetermined threshold value, and determining the boundary as a segment boundary. Divides the audio signal into non-uniform time sections, and transmits the sound source signal in all or some of the divided sections by expressing a combination of a plurality of pulse trains, and using the transmitted pulse trains. An audio signal encoding / decoding method for restoring a sound source signal in the section and outputting a synthesized audio signal representing the audio signal.

2. A segmentation scale calculation circuit for obtaining a cepstrum parameter at predetermined time intervals from an input discrete audio signal sequence, and calculating a scale indicating a time change of the cepstrum parameter; A segmentation circuit for segmenting the audio signal into non-uniform time intervals by searching for a time position near the time position to be and where the scale is less than a predetermined threshold and determining it as a segment boundary, A spectrum parameter calculation circuit that calculates a spectrum parameter and a pitch parameter representing a short-time spectrum characteristic from all or a part of the divided sections; and a sound source signal in all or part of the divided sections. Calculate multiple pulse train combinations to represent That the sound source pulse and calculating circuit, an audio signal encoding apparatus and a multiplexer circuit for outputting a combination of the spectral parameters and pitch parameters and excitation pulse train.