JP2847730B2

JP2847730B2 - Audio coding method

Info

Publication number: JP2847730B2
Application number: JP1023255A
Authority: JP
Inventors: 一範小澤
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1989-02-01
Filing date: 1989-02-01
Publication date: 1999-01-20
Anticipated expiration: 2014-01-20
Also published as: JPH02203399A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、音声符号化方法に関し、特に、音声信号を
低いビットレート、特に4.8kb/s以下で、比較的少ない
演算量により高品質に符号化するための音声符号化方法
に関する。Description: TECHNICAL FIELD The present invention relates to a speech coding method, and more particularly to a speech coding method which converts a speech signal to a low bit rate, particularly 4.8 kb / s or less, and achieves high quality with a relatively small amount of computation. The present invention relates to a speech encoding method for encoding.

[Conventional technology]

音声信号を4.8kb/s程度の低いビットレートで符号化
する方式としては、例えば特願昭59−272435号明細書
（文献１）や特願昭60−178911号明細書（文献２）等に
記載されているピッチ補間マルチパルス法が知られてい
る。この方法では、送信側では、フレーム毎の音声信号
から、音声信号のスペクトル特性を表すスペクトルパラ
メータとピッチを表すピッチパラメータとを抽出し、音
声信号を有声区間と無声区間との２種類に分類し、有声
区間と無声区間とで、次のようにして情報を伝送する。As a method of encoding an audio signal at a low bit rate of about 4.8 kb / s, for example, Japanese Patent Application No. 59-272435 (Reference 1) and Japanese Patent Application No. 60-178911 (Reference 2) The pitch interpolation multipulse method described is known. In this method, the transmitting side extracts, from the audio signal for each frame, a spectrum parameter representing the spectrum characteristic of the audio signal and a pitch parameter representing the pitch, classifies the audio signal into two types, a voiced section and an unvoiced section. The information is transmitted between the voiced section and the unvoiced section as follows.

すなわち、有声区間では、１フレームの音源信号を、
１フレームをピッチ区間毎に分割した複数個のピッチ区
間のうちの一つのピッチ区間（代表区間）についてマル
チパルスで表すようにし、かかる代表区間におけるマル
チパルスの振幅、位置と、スペクトル、ピッチパラメー
タとを伝送する。また、無声区間では、１フレームの音
源を少数のマルチパルスと雑音信号で表し、マルチパル
スの振幅、位置と雑音信号のゲイン、インデクスとを伝
送する。That is, in the voiced section, the sound source signal of one frame is
One pitch section (representative section) of a plurality of pitch sections obtained by dividing one frame for each pitch section is represented by a multipulse, and the amplitude, position, spectrum, pitch parameter, and the like of the multipulse in the representative section are represented. Is transmitted. In the unvoiced section, the sound source of one frame is represented by a small number of multi-pulses and a noise signal, and the amplitude, position, gain, and index of the multi-pulse are transmitted.

一方、受信側では、有声区間では、現フレームの代表
区間のマルチパルスと隣接フレームの代表区間のマルチ
パルスとを用いて、マルチパルス同士の振幅と位置を補
間して代表区間以外のピッチ区間のマルチパルスを復元
し、フレームの駆動音源信号を復元する。また、無声区
間では、マルチパルスと雑音信号のインデクス、ゲイン
とを用いて、フレームの駆動音源信号を復元する。On the other hand, on the receiving side, in the voiced section, the amplitude and position of the multi-pulses are interpolated using the multi-pulse of the representative section of the current frame and the multi-pulse of the representative section of the adjacent frame, and the pitch section of the pitch section other than the representative section is used. The multipulse is restored, and the driving sound source signal of the frame is restored. In the unvoiced section, the driving sound source signal of the frame is restored using the multipulse, the index of the noise signal, and the gain.

受信側では、さらに、復元した駆動音源信号を、スペ
クトルパラメータを用いた合成フィルタに入力して合成
音声信号を出力するようにし、このようにして受信側か
らの音声信号の再生を行う。On the receiving side, the restored driving sound source signal is further input to a synthesis filter using spectral parameters to output a synthesized audio signal, and thus the audio signal is reproduced from the receiving side.

[Problems to be solved by the invention]

しかしながら、このように音声信号を符号化し受信側
での再生のための情報を伝送する場合、従来は、フレー
ム毎に一定の情報量を割り当てており、このように伝送
情報量が固定的なものであるときは、すなわちフレーム
毎に固定の情報量を割当伝送する方法であると、ビット
レートのより一層の低減化が望まれる場合に、所要の再
生品質を確保しつつビットレートを低減することは容易
ではない。However, when the audio signal is encoded and the information for reproduction on the receiving side is transmitted as described above, a fixed amount of information is conventionally assigned to each frame, and thus the transmission information amount is fixed. In other words, if the method of allocating a fixed amount of information for each frame is to be transmitted, if it is desired to further reduce the bit rate, the bit rate must be reduced while ensuring the required reproduction quality. Is not easy.

具体的にいえば、上述した従来方式では、20msecフレ
ーム毎に一定の情報量（例えばビットレートが4.8kb/s
のときはフレームあたり96ビットの情報量）を割り当て
ている。このようにフレーム毎に固定の情報量を割当伝
送することは、装置化するには都合がよいが、さらにビ
ットレートを低減するのは困難であるという問題点があ
った。Specifically, in the above-described conventional method, a fixed amount of information (for example, a bit rate of 4.8 kb / s
In the case of, a 96-bit information amount per frame) is allocated. Assigning a fixed amount of information for each frame in this way is convenient for realizing a device, but has the problem that it is difficult to further reduce the bit rate.

本発明の目的は、上述した問題点を解決し、比較的少
ない演算量により低いビットレートで音質の良好な再生
を可能ならしめる音声符号化方法を提供することにあ
る。SUMMARY OF THE INVENTION An object of the present invention is to solve the above-mentioned problems and to provide a speech encoding method that enables good reproduction of sound quality at a low bit rate with a relatively small amount of calculation.

[Means for solving the problem]

本発明の音声符号化方法は、入力した離散的な音声信号からその音声信号の特徴を
表すパラメータを求め、この特徴パラメータに応じて前
記音声信号をセグメンテーションし、このセグメンテーションにより得られた区間が母音区
間以外のときは前記区間の音声信号のスペクトル包絡を
表すスペクトルパラメータの次数と前記音声信号の音源
信号を表すマルチパルスの個数を決定し前記スペクトル
パラメータと前記マルチパルスの振幅と位置を求め、前記セグメンテーションにより得られた区間が母音区
間のときはその母音区間に隣接した少なくとも一方の区
間の信号から前記母音区間の信号を予測するための係数
を求めることを特徴としている。The speech encoding method according to the present invention obtains a parameter representing a feature of the speech signal from an input discrete speech signal, segments the speech signal according to the feature parameter, and a section obtained by the segmentation is a vowel. When other than the section, the order of the spectrum parameter representing the spectrum envelope of the speech signal of the section and the number of multipulses representing the sound source signal of the speech signal are determined, and the spectrum parameter and the amplitude and position of the multipulse are determined. When the section obtained by the segmentation is a vowel section, a coefficient for predicting the signal of the vowel section is obtained from a signal of at least one section adjacent to the vowel section.

[Action]

本発明では、フレーム毎に固定の情報量を割当伝送す
るものと比べて柔軟性を有しており、人間の音韻知覚に
必要な情報量を音声信号の特徴を表す特徴パラメータに
応じてセグメンテーションしたセグメトに応じて割り当
てて伝送することが可能である。伝送情報量は、従来の
ものと比較して低減され、また波形の相関を利用するこ
ともでき、ビットレートを従来方式よりも更に低減する
ことが可能であり、音韻知覚や自然性の知覚に重要な音
声の特性が変化している部分でも良好な音質を維持する
ことが可能である。In the present invention, it has flexibility compared to the method of allocating and transmitting a fixed amount of information for each frame, and the amount of information necessary for human phoneme perception is segmented according to a feature parameter representing a feature of a speech signal. It is possible to allocate and transmit according to the segment. The amount of transmitted information is reduced as compared with conventional ones, and the correlation of waveforms can also be used, so that the bit rate can be further reduced compared to the conventional method, and the perception of phoneme and naturalness can be reduced. It is possible to maintain good sound quality even in a portion where the characteristic of important sound changes.

〔Example〕

第１図は本発明による音声符号化方法の一実施例を示
すブロック図である。第１図に示すように、本実施例で
は、音声信号が供給される入力端子100に接続されたバ
ッファメモリ110以下の各回路部から成る音声符号化処
理のための音声符号化装置を用いる。FIG. 1 is a block diagram showing one embodiment of a speech encoding method according to the present invention. As shown in FIG. 1, in the present embodiment, an audio encoding device for audio encoding processing is used, which is composed of circuit units below a buffer memory 110 connected to an input terminal 100 to which an audio signal is supplied.

第１図に示す構成の音声符号化装置の最終段を構成す
るマルチプレクサ260には、出力端子300が接続されてお
り、受信側へは出力端子300を通してマルチプレクサ260
の出力が伝送情報として送出される。この音声符号化装
置は、基本的には、マルチパルス符号化法を用いて符号
化処理を行うものであってよく、マルチプレクサ260に
は、スペクトルパラメータ、及びマルチパルスの振幅と
位置についての情報が入力され、これらが伝送される。An output terminal 300 is connected to the multiplexer 260 constituting the final stage of the speech coding apparatus having the configuration shown in FIG.
Is transmitted as transmission information. This speech coding apparatus may basically perform coding processing using a multi-pulse coding method, and the multiplexer 260 stores information about the spectral parameters and the amplitude and position of the multi-pulse. Input and these are transmitted.

このようにマルチパルス符号化法を基本とするが、第
１図の構成では、音声符号化装置は、特徴抽出回路115
及びセグメンテーション回路120を含んでいる。これら
特徴抽出回路115及びセグメンテーション回路120は、入
力した離散的な音声信号からその音声信号の特徴を表す
特徴パラメータを求め、この特徴パラメータに応じて前
記音声信号をセグメンテーション（分割）するのに用い
られる。As described above, the multi-pulse encoding method is basically used. In the configuration shown in FIG.
And a segmentation circuit 120. The feature extraction circuit 115 and the segmentation circuit 120 are used to obtain a characteristic parameter representing the characteristic of the audio signal from the input discrete audio signal, and to segment the audio signal according to the characteristic parameter. .

更に、これに加えて、上記音声符号化装置には、セグ
メンテーション回路120からマルチプレクサ260に至るラ
インｌにそれぞれ接続されたスイッチ195,235が含まれ
ている。本実施例では、ラインｌ上に、上述の如くに音
声信号がセグメンテーションされた場合におけるそのセ
グメンテーションの情報が送出され、各スイッチ195,23
5がこれに応じて同期して切り換えられるようになって
いる。すなわち、スイッチ195,235は、それぞれ選択的
に、端子195a,235a側に、または195b,235b側に切換え制
御される。Further, in addition to the above, the speech coding apparatus includes switches 195 and 235 connected to a line l from the segmentation circuit 120 to the multiplexer 260, respectively. In the present embodiment, information on the segmentation when the audio signal is segmented as described above is transmitted on the line l, and the switches 195 and 23 are output.
5 can be switched synchronously accordingly. That is, the switches 195 and 235 are selectively switched to the terminals 195a and 235a or to the terminals 195b and 235b, respectively.

これらスイッチ195,235は、前記セグメンテーション
により得られた区間が母音区間以外のときは前記区間の
音声信号のスペクトル包絡を表すスペクトルパラメータ
の次数と前記音声信号の音源信号を表すマルチパルスの
個数を決定し前記スペクトルパラメータと前記マルチパ
ルスの振幅と位置を求め、前記セグメンテーションによ
り得られた区間が母音区間のときは前記母音区間に隣接
した少なくとも一方の区間の信号から前記母音区間の信
号を予測するための係数を求めるのに使用される。These switches 195 and 235 determine the order of spectral parameters representing the spectral envelope of the audio signal of the section and the number of multipulses representing the sound source signal of the audio signal when the section obtained by the segmentation is other than the vowel section. A coefficient for predicting a signal of the vowel section from a signal of at least one section adjacent to the vowel section when a section obtained by the segmentation is a spectrum parameter and an amplitude and a position of the multipulse. Used to ask for

本発明に従う音声符号化方法は、音声信号を符号化し
情報を伝送するにあたり、上述のように、入力した離散
的な音声信号から音声信号の特徴を表すパラメータを求
めてその特徴パラメータに応じて音声信号をセグメンテ
ーションすると共に、そのセグメンテーションにより得
られた区間が母音区間以外のときは前記区間の音声信号
のスペクトル包絡を表すスペクトルパラメータの次数と
前記音声信号の音源信号を表すマルチパルスの個数を決
定し前記スペクトルパラメータと前記マルチパルスの振
幅と位置を求めるようにし、また、前記セグメンテーシ
ョンにより得られた区間が母音区間のときは前記母音区
間に隣接した少なくとも一方の区間の信号から前記母音
区間の信号を予測するための係数を求めるにようにす
る。In the speech encoding method according to the present invention, when encoding a speech signal and transmitting information, as described above, a parameter representing a feature of the speech signal is obtained from the input discrete speech signal, and the speech is determined according to the feature parameter. Along with segmenting the signal, when the section obtained by the segmentation is other than the vowel section, the order of the spectrum parameter representing the spectrum envelope of the speech signal of the section and the number of multipulses representing the sound source signal of the speech signal are determined. The spectrum parameter and the amplitude and position of the multi-pulse are obtained, and when the section obtained by the segmentation is a vowel section, the signal of the vowel section is obtained from the signal of at least one section adjacent to the vowel section. A coefficient for prediction is obtained.

これは、下記のような知見に基づくものである。すな
わち、音声波形は音韻情報を担っているが、音韻情報の
知覚のために必要な情報量は音声波形の区間により異な
り、一定ではない。例えば音韻情報の知覚に必要な情報
量は、母音定常部、過渡部などにより異なっており、母
音定常部の情報は母音、子音の知覚にはそれほど重要で
はなく、母音定常部の信号は０であっても母音明瞭度は
100％が得られ、子音明瞭度も低下はしないこと、他
方、過渡部の情報量をある値以下に低減すると母音、子
音明瞭度ともに低下することなど、情報量には差があ
る。This is based on the following findings. That is, the speech waveform carries phoneme information, but the amount of information necessary for perception of the phoneme information varies depending on the section of the speech waveform and is not constant. For example, the amount of information necessary for the perception of phonological information differs depending on the vowel stationary part, the transient part, etc., and the information of the vowel stationary part is not so important for the perception of vowels and consonants, and the signal of the vowel stationary part is 0. Vowel intelligibility
100% is obtained, and consonant intelligibility does not decrease. On the other hand, if the information amount in the transient part is reduced to a certain value or less, both the vowel and consonant intelligibility are reduced.

従って、音声を種々の音声区間（例えば母音定常部、
過渡部、子音部等）に分割し、これらの区間の特徴に応
じて音韻知覚に必要な情報量を割り当てて伝送すること
により、従来方式よりもさらにビットレートを低減化す
ることができる。Therefore, the voice is divided into various voice sections (for example, a vowel stationary part,
(Transient part, consonant part, etc.), and by allocating the amount of information necessary for phonological perception according to the characteristics of these sections and transmitting the information, the bit rate can be further reduced as compared with the conventional method.

すなわち、比較的少ない演算量により4.87kb/s以下の
ビットレートで、しかも、音質の良好な音声符号化方法
を実現できる。That is, it is possible to realize a speech encoding method with a bit rate of 4.87 kb / s or less and high sound quality with a relatively small amount of calculation.

より詳しく述べれば、原理的には、次のようにして説
明することができる。To put it in more detail, the principle can be explained as follows.

まず、第１図に示したようなセグメンテーション回路
120でセグメンテーションを行う場合に、例えば、音声
信号を母音部、母音定常部、過渡部、子音部、無音部に
大まかにセグメンテーション（分割）する。これには、
例えば、5msec程度の短区間毎に求めた音声信号のパワ
ー、スペクトル変化率、ピッチなどの特徴パラメータを
用い、さらに、これらのパラメータの20〜30msecの区間
での時間的変化を併用することもできる。First, a segmentation circuit as shown in FIG.
When segmentation is performed at 120, for example, the audio signal is roughly segmented (divided) into vowel parts, vowel stationary parts, transient parts, consonant parts, and silent parts. This includes
For example, it is possible to use characteristic parameters such as the power, spectrum change rate, and pitch of the audio signal obtained for each short section of about 5 msec, and to further use a temporal change of these parameters in a section of 20 to 30 msec. .

具体的なセグメンテーションは、例えば、パワーが大
きくかつスペクトル変化率が小さくかつパワーとスペク
トル変化率の時間的変化が少なくピッチが明確なときは
母音区間、更に母音区間の中央部のある時間長の区間を
母音定常部と、また、パワーおよびスペクトル変化率の
時間的変化が大きくスペクトル変化率の極大点の前後の
セグメントは過渡部と、更にまた、パワーが小さくパワ
ー及びスペクトル変化率の時間的変化がそれほど大きく
ないときは子音部と、そしてパワーが非常に小さいとき
は無音部と、というようにセグメンテーションする。こ
の場合、セグメンテーション精度は5msecであるが、こ
れは特徴パラメータの計算区間長を変化させることによ
り任意の値に設定できる。A specific segmentation is, for example, a vowel section when the power is large, the spectrum change rate is small, the temporal change between the power and the spectrum change rate is small, and the pitch is clear, and further, a section of a certain length of time in the center of the vowel section. The vowel stationary part, and the segment before and after the maximum point of the spectral change rate where the temporal change of the power and the spectral change rate is large is the transient part, and the power and the temporal change of the power and the spectral change rate are further small. Segment the consonants when not too big, and silence when the power is very low. In this case, the segmentation accuracy is 5 msec, but this can be set to any value by changing the calculation section length of the feature parameter.

なお、セグメンテーション法については、このような
方法以外にも例えば音声認識の分野で用いられているよ
うな周知の方法を用いることができる。また、前述のス
ペクトル変化率の求め方については、古井氏により“音
声知覚におけるスペクトル変化情報の役割”と題した論
文（日本音響学会聴覚研究会資料Ｈ−85−6,1985年）
（文献３）等を参照できる。As the segmentation method, a well-known method, for example, used in the field of speech recognition can be used in addition to such a method. The method of obtaining the above-mentioned spectral change rate is described in a paper entitled "The role of spectral change information in speech perception" by Furui (Acoustic Society of Japan, H-85-6, 1985).
(Reference 3) can be referred to.

セグメンテーションされた各部は、マルチパルス符号
化法を基にして、次のように符号化する。The segmented sections are encoded as follows based on the multi-pulse encoding method.

すなわち、セグメンテーションされた区間が無音部の
ときは、セグメント長を示す情報以外は何も情報は送ら
ない。子音部のときは、前記文献１や文献２に示す従来
のマルチパルス符号化法を用いて符号化する。この場
合、音源パルスの個数はL₁、スペクトルパラメータの次
数はM₁とする。更に、セグメンテーションされた区間が
過渡部のときは、従来のマルチパルス符号化法を用いて
符号化する。この場合、スペクトルパラメータの次数は
M₂、パルスの個数はL₂とする。That is, when the segmented section is a silent section, no information is sent except for information indicating the segment length. In the case of a consonant part, encoding is performed using the conventional multi-pulse encoding method described in the above-mentioned Documents 1 and 2. In this case, the number of sound source pulses is L ₁ , and the order of the spectrum parameter is M ₁ . Further, when the segmented section is a transient section, encoding is performed using a conventional multi-pulse encoding method. In this case, the order of the spectral parameters is
M ₂ and the number of pulses are L ₂ .

ここで、花田、小澤氏により“符号化音声における母
音定常部と過渡部の明瞭性に与える影響”と題した論文
（日本音響学会講演論文集２−１−17,1988年10月）
（文献４）等にも示されているように、過渡部は音韻知
覚にとって重要な区間であり、或る程度多くの情報を割
り当てる必要がある。従ってL₂＞L₁、M₂≧M₁である。Here, a paper entitled "Effects of vowel stationary and transient parts on clarity in coded speech" by Hanada and Ozawa (Acoustic Society of Japan, 2-1-17, October 1988)
As shown in (Reference 4) and the like, the transient portion is an important section for phoneme perception, and it is necessary to allocate a certain amount of information. Therefore, L ₂ > L ₁ and M ₂ ≧ M ₁ .

音声信号のスペクトル包絡を表すスペクトルパラメー
タの次数と音声信号を表すマルチパルスの個数の決定に
ついては、例えば上述のようにして行うことができる。The determination of the order of the spectral parameter representing the spectral envelope of the audio signal and the number of multi-pulses representing the audio signal can be performed, for example, as described above.

また、セグメンテーションされた区間が母音定常部の
ときは、音韻知覚の観点からは伝送情報量としては非常
にわずかでよい。また、母音定常部では音声波形にピッ
チ毎の大きな相関がある。従って、この相関を利用して
セグメント区間長とピッチ周期Ｔ、ピッチ毎の振幅補正
係数（ゲイン）ｇのみを求めて伝送し、受信側でこれら
の情報から母音定常部の音声を復元する。Also, when the segmented section is a vowel stationary part, the amount of transmitted information may be very small from the viewpoint of phoneme perception. In the vowel stationary part, the speech waveform has a large correlation for each pitch. Therefore, utilizing this correlation, only the segment section length, the pitch period T, and the amplitude correction coefficient (gain) g for each pitch are obtained and transmitted, and the receiving side restores the voice of the vowel stationary part from these information.

ピッチ周期Ｔと振幅補正係数ｇは、具体的には、次の
ようにして求めることができる。The pitch period T and the amplitude correction coefficient g can be specifically determined as follows.

すなわち、Ｔとｇについては、セグメント区間よりも
短い予め定められた短区間（サンプル数Ｎ）毎に次式に
従い計算する。That is, T and g are calculated in accordance with the following equation for each predetermined short section (the number of samples N) shorter than the segment section.

ここで、ｘ（ｎ）は当該セグメントの入力音声信号、
（ｎ−Ｔ）は隣接した過去のセグメントで符号化再生
された合成音声信号である。ｗ（ｎ）は聴感重みづけフ
ィルタのインパルス応答であり、その詳細は前記文献1,
2の重みづけ回路を参照できる。短区間での誤差電力Ｅ
を最小化するＴとｇを得るためには、（１）式をｇにつ
いて偏微分して０とおくことにより下式を得る。 Here, x (n) is the input audio signal of the segment,
(N-T) is a synthesized speech signal encoded and reproduced in an adjacent past segment. w (n) is the impulse response of the perceptual weighting filter.
2 weighting circuit can be referred to. Error power E in short section
In order to obtain T and g that minimize the following equation, the following equation is obtained by partially differentiating equation (1) with respect to g and setting it to 0.

（２）式を（１）式に代入すると、となる。 Substituting equation (2) into equation (1) gives Becomes

ここで、（３）式第１項は定数であるので、第２項を
最大化することにより（１）式は最小化される。従っ
て、（３）式第２項を最大化するようなｇとＴの組を求
めればよい。このｇとＴの組を予め定められた短区間Ｎ
毎に計算して伝送する。Here, since the first term of the equation (3) is a constant, the equation (1) is minimized by maximizing the second term. Therefore, a set of g and T that maximizes the second term of Expression (3) may be obtained. This pair of g and T is set to a predetermined short section N
It is calculated and transmitted every time.

なお、この方法では多くの演算量を必要とするため、
演算量低減のために、ピッチ周期Ｔを予め自己相関法な
どにより求めておき、それを（２）式に代入してｇの値
を計算するようにしてもよい。自己相関法によるピッチ
周期の計算法としては、前記文献1,2のピッチ抽出回路
を参照できる。さらに演算量を低減するためには、ｇは
下式のように求めることができる。Note that this method requires a large amount of computation,
In order to reduce the amount of calculation, the pitch period T may be obtained in advance by the autocorrelation method or the like, and may be substituted into the equation (2) to calculate the value of g. As a method of calculating the pitch period by the autocorrelation method, reference can be made to the pitch extraction circuits of the above-mentioned documents 1 and 2. In order to further reduce the calculation amount, g can be obtained as in the following equation.

ここで、P₁は１つの過去のセグメントの１ピッチ周期
分の合成音声信号のパワー、P₂は該当するセグメントの
短区間の１ピッチ周期分の入力音声信号のパワーを示
す、ただし、これらの方法では、演算量を低減するほど
特性は劣化する。 Here, P ₁ indicates the power of the synthesized voice signal for one pitch cycle of one past segment, and P ₂ indicates the power of the input voice signal for one pitch cycle of the short section of the corresponding segment. In the method, the characteristics are degraded as the calculation amount is reduced.

更に、第１図を参照して具体的に説明するに、図にお
いて、入力端子100から音声信号を入力し、入力音声信
号をバッファメモリ110に格納する。特徴抽出回路115は
音声信号からセグメンテーションに必要な特徴パラメー
タを求める。特徴パラメータとしては、前記原理説明で
述べたように、パワー、ピッチ、ピッチゲイン、スペク
トル変化率である。特徴抽出回路115は、これらのパラ
メータを既述の如く5msec毎に計算する。Further, as specifically described with reference to FIG. 1, an audio signal is input from an input terminal 100 in the figure, and the input audio signal is stored in a buffer memory 110. The feature extraction circuit 115 obtains feature parameters required for segmentation from the audio signal. As described in the description of the principle, the characteristic parameters are power, pitch, pitch gain, and spectrum change rate. The feature extraction circuit 115 calculates these parameters every 5 msec as described above.

セグメンテーション回路120は、特徴パラメータを入
力し、さらに特徴パラメータの20〜30msec区間での時間
的変化も併用して、前記原理説明で述べた方法により、
音声信号の無音部、子音部、過渡部、母音定常部のセグ
メンテーションを行う。セグメント区間の情報、更には
セグメント区間が無音部、母音定常部のときはセグメン
トの時間長が、セグメンテーション回路120からマルチ
プレクサ260へ出力される。The segmentation circuit 120 inputs a feature parameter, and further uses a temporal change in a 20-30 msec section of the feature parameter in combination with the method described in the principle description,
A segmentation of a silent part, a consonant part, a transient part, and a vowel steady part of the audio signal is performed. Information on the segment section, and when the segment section is a silent section and a vowel stationary section, the segment time length is output from the segmentation circuit 120 to the multiplexer 260.

Ｋパラメータ計算回路140は、セグメンテーションさ
れた区間が無音部と母音定常部以外の時は、セグメント
区間の音声信号のスペクトル特等を表すスペクトルパラ
メータとして、Ｋパラメータを、周知のLPC分析を行う
ことによって、予め定められた次数だけ計算する。この
具体的な計算法については前記文献1,2のＫパラメータ
計算回路を参照することができる。なお、Ｋパラメータ
はPARCOR係数同一のものである。また、Ｋパラメータの
次数は、既述の如くセグメント区間が子音部のときは
M₁、過渡部のときはM₂（M₁≦M₂）とする。When the segmented section is other than the silent part and the vowel stationary part, the K parameter calculation circuit 140 performs the well-known LPC analysis on the K parameter as a spectral parameter representing the spectrum characteristic of the audio signal in the segment section, Calculation is performed for a predetermined order. For the specific calculation method, reference can be made to the K-parameter calculation circuits in the above-mentioned documents 1 and 2. The K parameter has the same PARCOR coefficient. Also, the order of the K parameter is, as described above, when the segment section is a consonant part.
M ₁ , and M ₂ (M ₁ ≦ M ₂ ) at the transition part.

Ｋパラメータ量子化回路（Ｋパラメータ符号化回路）
150は、予め定められた量子化ビット数でＫパラメータ
を量子化し、マルチプレクサ260へ出力する。また、こ
れを逆量子化して線形予測係数a_i′に変換して出力す
る。K parameter quantization circuit (K parameter encoding circuit)
150 quantizes the K parameter with a predetermined number of quantization bits and outputs the result to the multiplexer 260. Further, this is inversely quantized, converted into a linear prediction coefficient a _i ′, and output.

インパルス応答計算回路170は、前記線形予測係数
a_i′を用いて、聴感重みづけを行った合成フィルタのイ
ンパルス応答h_w（ｎ）を計算し、これを自己相関関数計
算回路180へ出力する。自己相関関数計算回路180は前記
インパルス応答の自己相関関数R_hh（ｎ）を予め定めら
れた遅れ時間まで計算して出力する。インパルス応答計
算回路170、自己相関関数計算回路180の動作は前記文献
1,2等を参照することができる。The impulse response calculation circuit 170 calculates the linear prediction coefficient
Using a _i ′, the impulse response h _w (n) of the synthesis filter subjected to the perceptual weighting is calculated, and this is output to the autocorrelation function calculation circuit 180. The autocorrelation function calculation circuit 180 calculates and outputs an autocorrelation function R _hh (n) of the impulse response up to a predetermined delay time. The operations of the impulse response calculation circuit 170 and the autocorrelation function calculation circuit 180 are described in the aforementioned literature.
1, 2 etc. can be referred to.

減算器190は、セグメント内の音声信号ｘ（ｎ）から
合成フィルタ281の出力を減算し、減算結果を重みづけ
回路200へ出力する。重みづけ回路200は前記減算結果を
インパルス応答がｗ（ｎ）で表される聴感重みづけフィ
ルタに通し、重みづけ信号x_w（ｎ）を得てこれを出力す
る。重みづけの方法は、既述したように、前記文献1,2
等を参照できる。The subtracter 190 subtracts the output of the synthesis filter 281 from the audio signal x (n) in the segment, and outputs the subtraction result to the weighting circuit 200. The weighting circuit 200 passes the subtraction result through an auditory weighting filter whose impulse response is represented by _w (n), obtains a weighting signal _xw (n), and outputs it. As described above, the weighting method is described in References 1 and 2.
Etc. can be referred to.

スイッチ195は、セグメント区間が子音、過渡部のと
きは端子195a（図中上方）に、母音定常部のときは端子
195b（図中下方）に接続される。The switch 195 is connected to the terminal 195a (upper part in the figure) when the segment section is a consonant or a transient part, and is connected to the terminal
195b (bottom in the figure).

相互相関関数計算回路210は、セグメント区間が子
音、過渡部のときは、重みづけ信号と重みづけインパル
ス応答を入力して相互相関関数を予め定められた遅れ時
間まで計算し出力する。この計算法は前記文献1,2等を
参照できる。When the segment section is a consonant or a transient part, the cross-correlation function calculation circuit 210 inputs a weighted signal and a weighted impulse response, calculates a cross-correlation function up to a predetermined delay time, and outputs the result. This calculation method can be referred to the above-mentioned references 1 and 2.

音源パルス計算回路（音源信号計算回路）220は、セ
グメント区間内であらかじめ定められた個数のマルチパ
ルスの振幅と位置を求めて出力する。パルスの振幅、位
置の計算法は文献1,2を参照できる。パルスの個数は、
既述の如くセグメントが子音のときはL₁個、過渡部のと
きはL₂個とし、（L₂/セグメント長）≧（L₁/セグメント
長）とする。The sound source pulse calculation circuit (sound source signal calculation circuit) 220 obtains and outputs the amplitudes and positions of a predetermined number of multipulses in the segment section. The calculation method of the pulse amplitude and position can be referred to Literatures 1 and 2. The number of pulses is
₁ L when segments consonant as described above, when the transition portion and _two L, and (L ₂ / segment length) ≧ (L ₁ / segment length).

このようにして、本実施例では、セグメンテーション
により得られた区間が母音区間以外である子音、過渡部
のときはＫパラメータの次数とマルチパルスの個数を決
定し、Ｋパラメータとマルチパルスの振幅と位置を求め
る。In this way, in the present embodiment, the order obtained by the segmentation is a consonant other than the vowel section, and when the section is a transient part, the order of the K parameter and the number of multi-pulses are determined, and the K parameter and the amplitude of the multi-pulse are determined. Find the position.

符号化回路230は、マルチパルスの振幅g_i,位置m_iを予
め定めたビット数で符号化して出力する。符号化回路23
0では、さらにこれらを復号化してスイッチ235へ出力す
る。The encoding circuit 230 encodes the multi-pulse amplitude g _i and position _mi with a predetermined number of bits and outputs the result. Encoding circuit 23
At 0, these are further decoded and output to the switch 235.

スイッチ235は、セグメント区間が子音、過渡部のと
きは端子235a（図中左側）に、母音定常部のときは端子
235b（図中右側）に接続される。The switch 235 is connected to the terminal 235a (left side in the figure) when the segment section is a consonant or a transient section, and is connected to the terminal 235 when the segment section is a vowel stationary section.
235b (right side in the figure).

周期、ゲイン計算回路240は、前記原理で説明したよ
うに、（１）〜（４）式に従い、周期Ｔ、ゲインｇを予
め定められた短区間毎に計算し出力する。ここでは、５
〜10meec毎に計算するものとする。なお、セグメント区
間で平均ピッチ周期を求め、平均ピッチ周期毎にT,gを
計算することもできる。The cycle and gain calculation circuit 240 calculates and outputs the cycle T and the gain g for each predetermined short interval according to the equations (1) to (4), as described in the above principle. Here, 5
It shall be calculated every ~ 10meec. Note that an average pitch period can be obtained in a segment section, and T and g can be calculated for each average pitch period.

符号化回路250は、T,gを予め定められた量子化ビット
数で量子化し、マルチプレクサ260へ出力する。符号化
回路250ではまた、さらに復号化してスイッチ235へ出力
する。The encoding circuit 250 quantizes T and g with a predetermined number of quantization bits, and outputs the result to the multiplexer 260. The encoding circuit 250 further decodes and outputs the result to the switch 235.

合成フィルタ281は、スイッチ235の出力を入力し、セ
グメントが子音、過渡部のときはマルチパルスで合成フ
ィルタを駆動して当該セグメントの合成信号を計算す
る。また、合成フィルタ281は、セグメントが母音定常
部のときは周期Ｔ、ゲインｇを計算した短区間（５−10
msec）毎に次式に従い合成信号を計算する。なお、受信
側でもこれと同一の方法で音声信号を合成することがで
きる。The synthesis filter 281 receives the output of the switch 235 and, when the segment is a consonant or a transient portion, drives the synthesis filter with a multi-pulse to calculate a synthesized signal of the segment. When the segment is a vowel stationary part, the synthesis filter 281 calculates a short period (5-10
msec), a composite signal is calculated according to the following equation. Note that the receiving side can also synthesize an audio signal in the same manner.

（ｎ）＝ｇ・（ｎ−Ｔ）・・・（５）そして、合成フィルタ281では、さらに合成信号に起
因した次のセグメントへの影響信号を予め定められたサ
ンプル数だけ求めて演算器190へ出力する。なお、影響
信号の計算法については前記文献1,2を参照できる。(N) = g · (n−T) (5) Then, the synthesis filter 281 further obtains an influence signal on the next segment caused by the synthesized signal by a predetermined number of samples, and calculates the arithmetic unit 190 Output to In addition, the above-mentioned documents 1 and 2 can be referred to for the calculation method of the influence signal.

以上のようにして、音声信号を音声学的な特徴に基づ
きいくつかのセグメントに分類し、人間の音韻知覚に必
要な情報量をセグメントに応じて割り当てて伝送してお
り、更に母音定常部では伝送情報量を大きく低減し波形
の相関を利用して過去のセグメントから音声を復元して
いるので、ビットレートを4.8kb/sに低減することが可
能で、音韻知覚や自然性の知覚に重要な音声の特性が変
化している部分（有声の過渡部や母音間の変化部分）で
も音質の良好な合成音声を得られる。As described above, the audio signal is classified into several segments based on the phonetic features, and the amount of information necessary for human phonological perception is assigned and transmitted according to the segment. Since the amount of transmitted information is greatly reduced and speech is restored from past segments using waveform correlation, the bit rate can be reduced to 4.8 kb / s, which is important for the perception of phoneme and naturalness Also, a synthesized voice with good sound quality can be obtained in a portion where the characteristics of a proper voice are changed (voiced transition portion or a portion between vowels).

なお、上述した実施例はあくまで本発明の一構成に過
ぎず、その変形例も種々考えられる。Note that the above-described embodiment is merely one configuration of the present invention, and various modifications thereof are also conceivable.

すなわち、実施例ではセグメントの音声信号を母音定
常部、子音部、過渡部、無音部に分類したが、この分類
数を変えてもよい。また、子音部をさらに摩擦部と破裂
部とに分類してもよい。That is, in the embodiment, the audio signal of the segment is classified into a vowel steady part, a consonant part, a transient part, and a silent part, but the number of classifications may be changed. Further, the consonant part may be further classified into a friction part and a rupture part.

また、実施例では、スペクトルパラメータとしてＫパ
ラメータを符号化し、その分析法としてLPC分析を用い
たが、スペクトルパラメータとしては、他の周知なパラ
メータ、例えばLSP、LPCケプストラム、ケプストラム、
改良ケプストラム、一般化ケプストラム、メルケプスト
ラムなどを用いることもできる。また、各パラメータに
量適な分析法を用いることができる。Further, in the embodiment, the K parameter is encoded as a spectrum parameter, and LPC analysis is used as an analysis method, but other known parameters such as LSP, LPC cepstrum, cepstrum, and the like are used as the spectrum parameter.
Improved cepstrum, generalized cepstrum, mel cepstrum and the like can also be used. In addition, an analysis method suitable for each parameter can be used.

また、セグメントが母音定常部のときは過去のセグメ
ントをもとにピッチ周期Ｔ、ゲインｇを計算したが、当
該セグメントをはさんだ過去と未来のセグメントからピ
ッチ周期Ｔ、ゲインｇを計算することもできる。この方
法は例えば、池田、板倉氏による“音声波形のブロック
間引き・補間を用いた符号化”題した論文（日本音響学
会電気音響研究会資料EA88−15,1988年）（文献５）等
を参照できる。ただし、この方法では実施例の方法に比
べて伝送情報量が増大する。When the segment is a vowel stationary part, the pitch period T and the gain g are calculated based on the past segment. However, the pitch period T and the gain g may be calculated from the past and future segments sandwiching the segment. it can. This method is described in, for example, a paper by Ikeda and Itakura entitled "Encoding of Speech Waveform Using Block Decimation and Interpolation" (Acoustic Society of Japan Electroacoustics Research Society, EA88-15, 1988) (Reference 5). it can. However, in this method, the amount of transmission information increases as compared with the method of the embodiment.

また、ピッチ周期Ｔ、ゲインｇは１次としたが、高次
とすることもできる。例えば３次とすると、（５）式は
次のように書ける。Further, the pitch period T and the gain g are set to the first order, but may be set to a higher order. For example, assuming a third order, equation (5) can be written as follows.

（ｎ）＝g₁・（ｎ−Ｔ−１）＋g₂・（ｎ−Ｔ）＋g₃・（ｎ−Ｔ−１）・・・（６）高次にすることにより、母音定常部の音質は向上する
が、伝送情報量は増大する。(N) = g ₁ · (n−T−1) + g ₂ · (n−T) + g ₃ · (n−T−1) (6) By setting higher order, the sound quality of the vowel stationary part But the amount of transmitted information increases.

また、以上述べた方法全てに関連して、ピッチ周期を
短区間毎にそのまま送るのではなく、セグメント区間で
平均ピッチ周期を求めてこれを送り、平均ピッチ周期と
短区間毎に求めたピッチ周期Ｔとの差ｄを伝送するよう
にしてもよい。このようにすると、ピッチ周期Ｔの伝送
に要する伝送情報量を低減できる。Also, in connection with all the methods described above, instead of sending the pitch cycle as it is for each short section, an average pitch cycle is obtained and sent in the segment section, and the average pitch cycle and the pitch cycle obtained for each short section are sent. The difference d from T may be transmitted. This can reduce the amount of transmission information required for transmission of the pitch period T.

また、母音定常部のセグメントでは、予め定められた
区間毎に１ピッチ区間の音声信号をマルチパルスとスペ
クトルパラメータとで表して伝送し、その他の区間で前
述のピッチ周期とゲインを伝送するようにしてもよい。
このようにすると母音定常部の音質はさらに向上する
が、伝送情報量は増加する。Also, in the segment of the vowel stationary part, the voice signal of one pitch section is expressed by a multi-pulse and a spectrum parameter for each predetermined section and transmitted, and the above-described pitch period and gain are transmitted in other sections. You may.
By doing so, the sound quality of the vowel stationary part is further improved, but the amount of transmitted information increases.

なお、デジタル信号処理の分野でよく知られているよ
うに、自己相関関数は周波数軸上でパワースペクトル
に、相互相関関数はクロスパワースペクトルに対応して
いるので、これらから計算することもできる。これらの
計算法については、Oppenheim氏らによる“Digital Sig
nal Processing"（Prentice−Hall,1975）と題した単行
本（文献６）等を参照できる。As is well known in the field of digital signal processing, since the autocorrelation function corresponds to the power spectrum on the frequency axis and the cross-correlation function corresponds to the cross power spectrum, it can be calculated from these. For a discussion of these calculations, see “Digital Sig
nal Processing "(Prentice-Hall, 1975).

〔The invention's effect〕

以上説明したように、本発明によれば、音声信号を特
徴パラメータに基づいていくつかのセグメントに分類
し、それらセグメントに応じて必要な情報量を割り当て
て伝送することができ、さらに伝送情報量を低減しビッ
トレートを低減することが容易に可能で、音韻知覚や自
然性の知覚に重要な音声の特性が変化している部分でも
音質の良好な合成音声を得ることができるという大きな
効果がある。As described above, according to the present invention, it is possible to classify an audio signal into several segments based on feature parameters, allocate a necessary amount of information according to those segments, and transmit the information. It is possible to easily reduce the bit rate and reduce the bit rate, and a large effect that it is possible to obtain a synthesized voice with good sound quality even in the part where the characteristics of the voice that is important for phonological perception and perception of naturalness changes. is there.

[Brief description of the drawings]

第１図は本発明による音声符号化方法の一実施例を示す
ブロック図である。 100……入力端子 110……バッファメモリ 115……特徴抽出回路 120……セグメンテーション回路 140……Ｋパラメータ計算回路 150……Ｋパラメータ符号化（Ｋパラメータ量子化回
路） 170……インパルス応答計算回路 180……自己相関関数計算回路 195,235……スイッチ 195a,195b,235a,235b……端子 200……重みづけ回路 210……相互相関関数計算回路 220……音源パルス計算（音源信号計算）回路 230,250……符号化回路 240……周期、ゲイン計算回路 260……マルチプレクサ 281……合成フィルタ 300……出力端子FIG. 1 is a block diagram showing one embodiment of a speech encoding method according to the present invention. 100 input terminal 110 buffer memory 115 feature extraction circuit 120 segmentation circuit 140 K parameter calculation circuit 150 K parameter coding (K parameter quantization circuit) 170 impulse response calculation circuit 180 …… Autocorrelation function calculation circuit 195,235… Switch 195a, 195b, 235a, 235b… Terminal 200… Weighting circuit 210… Cross correlation function calculation circuit 220… Sound source pulse calculation (sound source signal calculation) circuit 230,250… Encoding circuit 240: Period and gain calculation circuit 260: Multiplexer 281: Synthesis filter 300: Output terminal

フロントページの続き (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 - 9/18 H03M 7/30 H04B 14/04 ＪＩＣＳＴ（ＪＯＩＳ)Continuation of the front page (58) Field surveyed (Int. Cl. ⁶ , DB name) G10L 3/00-9/18 H03M 7/30 H04B 14/04 JICST (JOIS)

Claims

(57) [Claims]

1. A feature parameter representing a feature of a voice signal is obtained from an input discrete voice signal, the voice signal is segmented in accordance with the feature parameter, and a section obtained by the segmentation is a section other than a vowel section. When the order of the spectrum parameter representing the spectrum envelope of the audio signal of the section and the number of multi-pulses representing the sound source signal of the audio signal are determined, the spectrum parameter and the amplitude and position of the multi-pulse are obtained, and obtained by the segmentation. When the obtained section is a vowel section, a coefficient for predicting a signal of the vowel section is obtained from a signal of at least one section adjacent to the vowel section.