JPH0766733A

JPH0766733A - Highly efficirent sound encoding device

Info

Publication number: JPH0766733A
Application number: JP23239193A
Authority: JP
Inventors: Norihiko Fuchigami; 徳彦渕上; Shoji Ueno; 昭治植野
Original assignee: Victor Company of Japan Ltd
Current assignee: Victor Company of Japan Ltd
Priority date: 1993-08-25
Filing date: 1993-08-25
Publication date: 1995-03-10
Anticipated expiration: 2014-07-12
Also published as: JP2917766B2

Abstract

PURPOSE:To provide a highly efficient sound encoding device capable of reducing the omission of detection or post detection of transient and suppressing the generation of a preecho. CONSTITUTION:A transient detecting part 2 prevents the generation of omission in the detection of a case that transient exists on the center position of a segment by comparing segment power P [i] with P[i-1] and P[i-4] for instance. When the segment power P[i] is compared with P[i-3] and P [i-4], the generation of misrecognition of transient in a long period steady waveform having a case partially reducing the segment power to an extremely small value some times can be prevented. An windowing/orthogonal transformation part 3 orthogonally transforms an audio signal in each sample by a DCT, FET or the like in accordance with reference frame length T or shortened frame length (T/4) based upon a detection flag (trans) detected from the detecting part 2 to divide the signal into plural sub-bands.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、オーディオ信号を有限
長のフレーム毎に符号化する音声高能率符号化装置に関
する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a high-efficiency speech coding apparatus for coding an audio signal for each finite length frame.

【０００２】[0002]

【従来の技術】ミニディスク（ＭＤ）、デジタルコンパ
クトカセット（ＤＣＣ）、カラオケＣＤ等における音声
高能率符号化は、オーディオ信号のデータ量を圧縮する
ので音楽圧縮とも呼ばれている。このような符号化方式
では、オーディオ信号がデジタルフィルタまたは直交変
換により複数のサブバンドに分割され、周波数領域にお
ける聴覚心理分析に基づいてサブバンド毎の量子化ビッ
ト数が決定される。なお、以下の説明では「エンコー
ド」という用語を符号化の他に圧縮の意味で用いる場合
もある。2. Description of the Related Art High-efficiency audio coding in a mini disc (MD), a digital compact cassette (DCC), a karaoke CD, etc. is called music compression because it compresses the data amount of an audio signal. In such an encoding method, an audio signal is divided into a plurality of subbands by a digital filter or orthogonal transformation, and the number of quantization bits for each subband is determined based on psychoacoustic analysis in the frequency domain. In the following description, the term “encode” may be used to mean compression in addition to encoding.

【０００３】図６（ａ）〜（ｄ）は周波数帯域を直交変
換により分割する例を示す。図６（ａ）はエンコードの
対象となる１６ビットＰＣＭオーディオ信号を５１２サ
ンプル分切り出したことを示し、ここでは図の長方形で
囲まれる全情報量が１６ビット＊５１２＝８１９２ビッ
トとして説明する。もちろん、切り出されるサンプル数
やＰＣＭのビット数はこの値に限定されない。FIGS. 6A to 6D show an example in which a frequency band is divided by orthogonal transformation. FIG. 6A shows that the 16-bit PCM audio signal to be encoded is cut out by 512 samples, and here, the description will be made assuming that the total information amount enclosed by the rectangle in the figure is 16 bits * 512 = 8192 bits. Of course, the number of samples to be cut out and the number of PCM bits are not limited to this value.

【０００４】図６（ｂ）は図６（ａ）に示す信号をＤＣ
Ｔ（離散コサイン変換）やＦＦＴ（高速フーリエ変換）
等の直交変換により周波数変換した信号を示し、図の曲
線が周波数スペクトルのエンベロープを示している。こ
こで、直交変換により情報量が保存されると仮定する
と、この全情報量も図の長方形領域で表現することがで
きる。一方、聴覚心理モデルによれば、図６（ｂ）に示
す信号が存在したときに、その信号によりマスキングさ
れて聞こえなくなる信号レベルをカーブとして規定する
ことができ、これは一般にマスキング効果と言われる。FIG. 6 (b) shows the DC signal shown in FIG. 6 (a).
T (discrete cosine transform) and FFT (fast Fourier transform)
Shows a signal frequency-converted by orthogonal transformation such as, and the curve in the figure shows the envelope of the frequency spectrum. Here, assuming that the amount of information is preserved by orthogonal transformation, this total amount of information can also be expressed by the rectangular area in the figure. On the other hand, according to the psychoacoustic model, when a signal shown in FIG. 6B is present, a signal level masked by the signal and inaudible can be defined as a curve, which is generally called a masking effect. .

【０００５】図６（ｂ）からマスキングカーブを描くと
図６（ｃ）に示すように表すことができ、ここで、図６
（ｂ）に示す信号を再量子化することを考慮すると、再
量子化により発生する量子化ノイズレベルがマスキング
カーブで規定されるレベル以下であれば、そのノイズは
人間の耳には聞こえないということができる。そこで、
図６（ｄ）に示すようにスペクトルを複数データ毎にサ
ブバンドに分割し、各サブバンド毎の最大信号レベルを
Ｓとし、また、図６（ｃ）から許容されるノイズレベル
をＮとしてこのＳ／Ｎを満足するビット数で再量子化す
れば、そのときの量子化ノイズはマスキングされて聞こ
えない。When a masking curve is drawn from FIG. 6B, it can be expressed as shown in FIG. 6C, where FIG.
Considering requantization of the signal shown in (b), if the quantization noise level generated by the requantization is equal to or lower than the level defined by the masking curve, the noise is inaudible to the human ear. be able to. Therefore,
As shown in FIG. 6 (d), the spectrum is divided into sub-bands for each plurality of data, the maximum signal level for each sub-band is S, and the noise level allowed from FIG. 6 (c) is N. If requantization is performed with the number of bits satisfying S / N, the quantization noise at that time is masked and cannot be heard.

【０００６】図６（ｄ）の矩形は圧縮時および伸長時に
必要な情報量を示し、特に図の中央の変形矩形は主情報
を、図の下側の細長い矩形は補助情報を示している。な
お、補助情報とはデコード時に必要な各サブバンドの最
大値（スケール値）と量子化ビット数を示す情報等であ
る。したがって、図６（ｄ）において示される全情報量
は主情報量と補助情報量の和であり、図６（ａ）や図６
（ｂ）における全情報量の数分の１になることが分か
る。したがて、図７に示すように以上の処理（ステップ
Ｓ１〜Ｓ６）を区間（この例では５１２サンプル区間）
毎に繰り返すことにより音質を殆ど劣化することなくエ
ンコードすることができる。The rectangle in FIG. 6 (d) shows the amount of information required at the time of compression and decompression. In particular, the deformed rectangle in the center of the figure shows the main information, and the elongated rectangle at the bottom of the figure shows the auxiliary information. The auxiliary information is information indicating the maximum value (scale value) and the number of quantization bits of each subband necessary for decoding. Therefore, the total amount of information shown in FIG. 6D is the sum of the amount of main information and the amount of auxiliary information.
It can be seen that it is a fraction of the total amount of information in (b). Therefore, as shown in FIG. 7, the above-described processing (steps S1 to S6) is performed in a section (512 sample sections in this example).
By repeating every time, it is possible to encode with almost no deterioration in sound quality.

【０００７】図８（ａ）および（ｂ）はそれぞれ一般的
な音声高能率符号化および復号化装置を示す。図８
（ａ）に示す符号化装置では、例えば１６ビットＰＣＭ
オーディオ信号がフレームバッファリング部１により保
持された後、窓掛け・直交変換部３により５１２サンプ
ル分切り出され、各サンプルのオーディオ信号がＤＣＴ
やＦＦＴ等により直交変換され、複数のサブバンドに分
割される。FIGS. 8 (a) and 8 (b) respectively show a general voice efficient coding and decoding apparatus. Figure 8
In the encoding device shown in (a), for example, 16-bit PCM
After the audio signal is held by the frame buffering unit 1, 512 samples are cut out by the windowing / orthogonal transform unit 3, and the audio signal of each sample is DCT.
Orthogonal transform is performed by FFT or the like, and divided into a plurality of subbands.

【０００８】そして、聴覚心理分析部４により各サブバ
ンドの量子化ビット数が決定され、量子化・符号化部５
はこの量子化ビット数で、直交変換部２により分割され
た各サブバンドのオーディオ信号を量子化および符号化
し、この量子化・符号化部４により量子化および符号化
されて圧縮されたデータと、聴覚心理分析部３により決
定された量子化ビット数はマルチプレックス部６により
多重化されて出力される。Then, the psychoacoustic analysis unit 4 determines the number of quantization bits of each subband, and the quantization / encoding unit 5
Is quantized and encoded by this quantization bit number, the audio signal of each sub-band divided by the orthogonal transformation unit 2, and the data which is quantized and encoded by the quantization / encoding unit 4 and compressed. The quantized bit number determined by the psychoacoustic analysis unit 3 is multiplexed by the multiplex unit 6 and output.

【０００９】図８（ｂ）に示す復号化装置では、デマル
チプレックス部７により音声符号と量子化ビット数が分
離され、復号化・逆量子化部８により復号化された後音
声符号が量子化ビット数で逆量子化され、逆直交変換・
窓掛け部９とフレームバッファリング１０により１６ビ
ットＰＣＭオーディオ信号として再生される。In the decoding device shown in FIG. 8 (b), the demultiplexing unit 7 separates the speech code and the number of quantization bits, and the decoding / dequantization unit 8 decodes the speech code to quantize the speech code. Inverse quantization with the number of bits
A 16-bit PCM audio signal is reproduced by the window unit 9 and the frame buffering 10.

【００１０】次に、実際のオーディオ信号の性質とその
信号をエンコードおよびデコードした結果との関係につ
いて説明する。図９はフレーム区間でほぼ定常な場合の
信号を示し、特に図９（ａ）は原波形を、図９（ｂ）は
エンコードおよびデコード後の波形（以下、簡単に「処
理波形」という。）を示し、両信号は前述したように聴
覚心理に従って処理を行った場合には聴覚上の差は殆ど
ないと言える。Next, the relationship between the nature of the actual audio signal and the result of encoding and decoding the signal will be described. 9A and 9B show signals in the case of being almost stationary in the frame section. In particular, FIG. 9A shows the original waveform, and FIG. 9B shows the waveform after encoding and decoding (hereinafter simply referred to as "process waveform"). It can be said that there is almost no difference in hearing when both signals are processed according to the psychology of hearing as described above.

【００１１】他方、図１０（ａ）（ｂ）はそれぞれ、フ
レーム区間内で振幅（パワー）が急峻に立ち上がるよう
な非定常な信号の原波形、処理波形を示し、図１０
（ｂ）から明らかなようにパワーが立ち上がる前に原波
形を大きく上回るノイズ成分Ｎが出現している。このよ
うなノイズＮは一般にプリエコーと呼ばれており、立ち
上がりより約１〜３ｍsec以上遡るノイズは、信号の立
ち上がりに付帯するノイズエコーとして検知される。On the other hand, FIGS. 10A and 10B respectively show an original waveform and a processed waveform of a non-stationary signal whose amplitude (power) rises sharply in a frame section.
As is clear from (b), the noise component N that greatly exceeds the original waveform appears before the power rises. Such noise N is generally called a pre-echo, and noise that goes back about 1 to 3 msec or more from the rising edge is detected as a noise echo incidental to the rising edge of the signal.

【００１２】このノイズの原因は、フレームによる分析
区間よりも信号パワーの変化の区間の方が短いためであ
る。また、フレームのエンコード処理により発生する量
子化ノイズは図６（ｃ）に示すような周波数−振幅特性
を有するが、周波数−位相特性についてはランダムにな
り、したがって、処理波形上に発生する時間領域での量
子化ノイズはフレーム内に一様に（定常的に）分布する
ので、図１０（ｂ）に示すように元々信号振幅が小さか
った領域にも大振幅領域に影響された大きなノイズＮが
出現することになる。なお、信号パワーの急峻な立ち下
がりの後にも同様な理由によりノイズが出現するが、こ
の場合には先行する大きな信号が聴覚機構内に発生する
刺激の余韻が比較的長時間（１０〜２０ｍsec）持続す
るので検知されにくい。The cause of this noise is that the signal power change section is shorter than the frame analysis section. Further, the quantization noise generated by the frame encoding process has a frequency-amplitude characteristic as shown in FIG. 6C, but the frequency-phase characteristic is random, and therefore, the time domain generated on the processed waveform is generated. Since the quantization noise in (1) is uniformly (steadily) distributed in the frame, a large noise N affected by the large amplitude area is originally generated in the area where the signal amplitude is small as shown in FIG. Will appear. Note that noise appears even after the sharp fall of the signal power for the same reason, but in this case, the afterglow of the stimulus generated by the preceding large signal in the auditory mechanism is relatively long (10 to 20 msec). It is persistent and difficult to detect.

【００１３】このようなプリエコーの対策としては、フ
レーム長を短縮する、すなわち処理の時間分解能を向上
させることが最も有効である。例えば直交変換を用いて
サブバンドに分割する場合には図１１に示すように変換
の際のフレーム長を標準長Ｔより１／４等に短縮して４
回の変換を行うことにより時間分解能を４倍にすること
ができる。その結果、非定常波形に起因するノイズがよ
り短い区間に閉じ込められるのでプリエコーが減衰す
る。As a countermeasure against such a pre-echo, it is most effective to reduce the frame length, that is, to improve the time resolution of processing. For example, when dividing into subbands using orthogonal transformation, the frame length at the time of transformation is shortened from the standard length T to ¼ or the like as shown in FIG.
The time resolution can be quadrupled by performing the conversion once. As a result, the noise due to the unsteady waveform is confined in a shorter section, and the pre-echo is attenuated.

【００１４】図１２は図１０（ａ）に示す原波形を１／
４のフレーム長で処理した波形を示し、図１０（ｂ）に
示す処理波形よりプリエコーが減衰していることが分か
る。また、サブバンド分割に直交変換を用いない場合に
は、各サブバンドの時間領域のサンプル（時間波形）を
より短い区間毎に量子化することにより同様な効果が得
られる。FIG. 12 shows the original waveform shown in FIG.
A waveform processed with a frame length of 4 is shown, and it can be seen from the processed waveform shown in FIG. 10B that the pre-echo is attenuated. When orthogonal transformation is not used for subband division, the same effect can be obtained by quantizing the time domain samples (time waveforms) of each subband for each shorter section.

【００１５】このようなプリエコーの対策でポイントと
なるのは、波形の非定常性（＝パワーの急峻な立ち上が
り）を如何に正確に検出するかであり、この検出を以下
では「トランジェント検出」と呼ぶことする。ここで、
一般にフレーム長を短縮すると図６（ｄ）において説明
したフレーム単位の補助情報量が主情報量より相対的に
多くなるので、主情報に割り当てられる情報量が減少す
る。また、直交変換を用いた場合には、フレーム長を短
縮すると周波数分解能が劣化するので聴覚心理の適用精
度が劣化する。したがって、定常な波形に対してはフレ
ーム長は長い程良く、したがって、トランジェント検出
が誤動作すると音質は一般に劣化する。The point of countermeasures against such pre-echo is how to accurately detect the non-stationarity of the waveform (= steep rise of power). This detection will be referred to as "transient detection" below. To call. here,
Generally, when the frame length is shortened, the amount of auxiliary information in units of frame described in FIG. 6D becomes relatively larger than the amount of main information, so that the amount of information assigned to main information decreases. In the case of using the orthogonal transform, if the frame length is shortened, the frequency resolution deteriorates, and the accuracy of applying psychoacoustic sound deteriorates. Therefore, the longer the frame length is, the better for a steady waveform, and thus the sound quality is generally deteriorated when the transient detection malfunctions.

【００１６】図１３および図１４を参照して従来のトラ
ンジェント検出方法について説明する。フレーム長とし
ては一般的には１０〜２０ｍsec程度が選択されるが、
トランジェント検出では、フレームのサンプルを約１〜
３ｍsecの短い例えばｍ個のセグメントに分割し、各セ
グメントｉのトータルパワーＰ〔ｉ〕を以下のように計
算する（ステップＳ１１、Ｓ１２）。A conventional transient detection method will be described with reference to FIGS. 13 and 14. Generally, a frame length of about 10 to 20 msec is selected,
For transient detection, approximately 1 to
It is divided into, for example, m segments each having a short length of 3 msec, and the total power P [i] of each segment i is calculated as follows (steps S11 and S12).

【００１７】[0017]

【数１】 [Equation 1]

【００１８】トランジェントの判定は、図１４に示すよ
うに、あるセグメントｉと隣接するセグメント（ｉ−
１）とのパワーの比を予め設定された判断基準値と比較
し、例えばｉ＝０・・・ｍ−１についてAs shown in FIG. 14, the determination of transient is made by determining that a segment i and a segment (i-
The power ratio with 1) is compared with a preset judgment reference value, and for example, for i = 0 ... m-1

【００１９】[0019]

【数２】Ｐ〔ｉ〕／Ｐ〔ｉ−１〕＞Ａｔ（条件
１）但し、Ａｔは判断基準値Ｐ〔−１〕は前フレームのＰ〔ｍ−１〕## EQU00002 ## P [i] / P [i-1]> At (Condition 1) where At is a criterion value P [-1] is P [m-1] of the previous frame

【００２０】が１回でも成立する場合にそのフレーム内
にトランジェントがあるとして検出フラグｔｒａｎｓを
セットする（ステップＳ１３〜Ｓ１７）。ここで、判断
基準値Ａｔは一般的には図１５（ａ）に示すように１５
〜２０ｄＢ程度が選択される。高能率符号化ではエンコ
ードおよびデコード後の平均Ｓ／Ｎ比は２０〜３０ｄＢ
程度であることが多いので、図１０（ａ）に示すような
プリエコーの振幅がその領域の原波形の振幅と比較して
無視できなくなるレベルに判断基準値をとるのが妥当で
ある。図１０（ｂ）はプリエコーの振幅が無視できる場
合を示している。また、定常波形におけるセグメントの
パワー比の現れ方も参考にされる。When is satisfied even once, it is determined that there is a transient in the frame, and the detection flag trans is set (steps S13 to S17). Here, the judgment reference value At is generally 15 as shown in FIG.
Approximately 20 dB is selected. In high efficiency coding, average S / N ratio after encoding and decoding is 20 to 30 dB
Since it is often a degree, it is appropriate to set the judgment reference value to a level at which the amplitude of the pre-echo as shown in FIG. 10A cannot be ignored compared with the amplitude of the original waveform in that region. FIG. 10B shows the case where the amplitude of the pre-echo can be ignored. The appearance of the power ratio of the segment in the steady waveform is also referred to.

【００２１】[0021]

【発明が解決しようとする課題】しかしながら、上記従
来のトランジェント検出方法では次のような２つの問題
点がある。（１）例えば図１６に示すようにトランジェントがセグ
メントの丁度中央に位置する場合には検出漏れが発生す
る。この理由は、セグメントパワーが丁度２つのセグメ
ントで極大値に変化するので１つのセグメント当たりの
変化量が１／２になり、隣接セグメントとのパワーを比
較すると上記パワー比Ｐ〔ｉ〕／Ｐ〔ｉ−１〕が判断基
準値Ａｔを超えないことがあるためであり、この場合に
は図１７（ａ）（ｂ）に示すように実際の波形と検出フ
ラグｔｒａｎｓが異なることになる。However, the above-mentioned conventional transient detection method has the following two problems. (1) For example, when the transient is located exactly in the center of the segment as shown in FIG. 16, detection omission occurs. The reason for this is that since the segment power changes to a maximum value in just two segments, the amount of change per segment is halved, and comparing the power with adjacent segments, the power ratio P [i] / P [ This is because i-1] may not exceed the determination reference value At, and in this case, the actual waveform and the detection flag trans are different as shown in FIGS.

【００２２】（２）例えば図１８に示すように長周期の
定常波形では、セグメントパワーが部分的に非常に小さ
くなる場合があり、この場合にはパワー比Ｐ〔ｉ〕／Ｐ
〔ｉ−１〕が判断基準値Ａｔを超えるので、図１９
（ａ）（ｂ）に示すようにトランジェントと誤認するこ
とがある。したがって、従来のトランジェント検出方法
では、波形の性質によっては検出漏れや後検出が発生す
るという問題点がある。(2) In a steady waveform having a long period as shown in FIG. 18, for example, the segment power may become extremely small in some cases. In this case, the power ratio P [i] / P
Since [i-1] exceeds the judgment reference value At, FIG.
It may be mistaken for a transient as shown in (a) and (b). Therefore, the conventional transient detection method has a problem that detection omission or post-detection may occur depending on the nature of the waveform.

【００２３】本発明は上記従来の問題点に鑑み、トラン
ジェントの検出漏れや誤検出を低減してプリエコーを抑
圧することができる音声高能率符号化装置を提供するこ
とを目的とする。In view of the above conventional problems, it is an object of the present invention to provide a high-efficiency speech coding apparatus capable of suppressing transient detection omission and false detection and suppressing pre-echo.

【００２４】[0024]

【課題を解決するための手段】本発明は上記目的を達成
するために、トランジェントを検出する場合に当該セグ
メントと隣接セグメントおよび１以上離れたセグメント
の各パワー比を求めるようにしている。すなわち本発明
によれば、オーディオ信号を有限長のフレーム毎に処理
することにより符号化する音声高能率符号化装置におい
て、オーディオ信号を標準フレーム長より十分短いセグ
メント長の区間に分割して各セグメントのトータルパワ
ーを計算し、当該セグメントと隣接セグメントおよび１
以上離れたセグメントの各パワー比が所定値以上の場合
に標準フレーム長より短いフレーム長でオーディオ信号
が処理されるように制御する制御手段とを有することを
特徴とする音声高能率符号化装置が提供される。SUMMARY OF THE INVENTION In order to achieve the above object, the present invention seeks the power ratios of a segment, an adjacent segment and a segment separated by one or more when a transient is detected. That is, according to the present invention, in a high-efficiency speech coding apparatus that encodes an audio signal by processing it for each frame of a finite length, the audio signal is divided into segments having a segment length sufficiently shorter than the standard frame length. Calculate the total power of the
A high-efficiency speech coding apparatus comprising: a control unit that controls an audio signal to be processed with a frame length shorter than a standard frame length when the power ratios of the segments separated from each other are equal to or more than a predetermined value. Provided.

【００２５】[0025]

【作用】本発明では、当該セグメントと隣接セグメント
および１以上離れたセグメントの各パワー比が所定値以
上の場合にトランジェントとして検出される。したがっ
て、トランジェントがセグメントの中央に位置する場合
には１以上離れたセグメントとのパワー比が基準判断値
以上になり、したがって、検出漏れを防止することがで
きる。また。長周期の定常波形においてセグメントパワ
ーが部分的に非常に小さくなる場合には１以上離れたセ
グメントとのパワー比が基準判断値以下となり、したが
って、トランジェントと誤認することを防止することが
できる。In the present invention, a transient is detected when the power ratios of the segment, the adjacent segment, and the segments separated by one or more are equal to or more than a predetermined value. Therefore, when the transient is located at the center of the segment, the power ratio with the segment separated by one or more becomes equal to or higher than the reference determination value, and therefore, detection omission can be prevented. Also. In the case where the segment power is extremely small partially in the long-cycle steady waveform, the power ratio with the segments separated by 1 or more becomes equal to or less than the reference determination value, and therefore, it is possible to prevent misidentification as a transient.

【００２６】[0026]

【実施例】以下、図面を参照して本発明の実施例につい
て説明する。図１は本発明に係る音声高能率符号化装置
の一実施例を示すブロック図、図２は図１のトランジェ
ント検出部の比較対象セグメントを示す説明図、図３は
図１のトランジェント検出部のトランジェント検出処理
を説明するためのフローチャート、図４は非定常波形と
そのトランジェント検出フラグを示す説明図、図５は定
常波形とそのトランジェント検出フラグを示す説明図で
ある。Embodiments of the present invention will be described below with reference to the drawings. 1 is a block diagram showing an embodiment of a high-efficiency speech coding apparatus according to the present invention, FIG. 2 is an explanatory diagram showing comparison target segments of the transient detection unit of FIG. 1, and FIG. 3 is a diagram of the transient detection unit of FIG. 4 is a flowchart for explaining the transient detection process, FIG. 4 is an explanatory diagram showing a non-stationary waveform and its transient detection flag, and FIG. 5 is an explanatory diagram showing a steady waveform and its transient detection flag.

【００２７】図１において、例えば１６ビットＰＣＭオ
ーディオ信号がフレームバッファリング部１により保持
され、本実施例では、トランジェント検出部２はトラン
ジェントを検出する際のセグメントのパワーＰ〔ｉ〕を
比較する場合に、隣接セグメント間に加えて１以上離れ
たセグメント間でも行い、例えば図２の実線に示すよう
にＰ〔ｉ〕とＰ〔ｉ−１〕およびＰ〔ｉ〕とＰ〔ｉ−
２〕を比較することにより、例えば図１６に示すように
トランジェントがセグメントの丁度中央に位置する場合
の検出漏れを防止するようにしている。In FIG. 1, for example, when a 16-bit PCM audio signal is held by the frame buffering unit 1, and in the present embodiment, the transient detecting unit 2 compares the power P [i] of the segments when detecting the transient. In addition to the operation between adjacent segments, the operation is also performed between segments separated by one or more. For example, P [i] and P [i-1] and P [i] and P [i-
By comparing [2], it is possible to prevent detection omission when the transient is located exactly in the center of the segment as shown in FIG.

【００２８】[0028]

【数３】（Ｐ〔ｉ〕／Ｐ〔ｉ−１〕＞Ａｔ）ｏｒ（Ｐ〔ｉ〕／Ｐ〔ｉ−２〕＞Ａｔ）
（条件２）(3) (P [i] / P [i-1]> At) or (P [i] / P [i-2]> At)
(Condition 2)

【００２９】また、本実施例では、例えば図２の破線に
示すようにＰ〔ｉ〕とＰ〔ｉ−３〕およびＰ〔ｉ〕とＰ
〔ｉ−４〕を比較することにより、例えば図１８に示す
ようにセグメントパワーが部分的に非常に小さくなる場
合がある長周期の定常波形におけるトランジェントの誤
認を防止するようにしている。Further, in this embodiment, for example, as shown by the broken line in FIG. 2, P [i] and P [i-3] and P [i] and P
By comparing [i-4], for example, as shown in FIG. 18, it is possible to prevent the false recognition of a transient in a long-waveform stationary waveform in which the segment power may become extremely small partially.

【００３０】[0030]

【数４】（Ｐ〔ｉ〕／Ｐ〔ｉ−３〕＞Ａｔ２）ａｎｄ（Ｐ〔ｉ〕／Ｐ〔ｉ−４〕＞Ａｔ２）
（条件３）但し、Ａｔ２はＡｔと同一または異なる数値の判断基準
値## EQU4 ## (P [i] / P [i-3]> At2) and (P [i] / P [i-4]> At2)
(Condition 3) However, At2 is a judgment reference value that is the same as or different from At.

【００３１】窓掛け・直交変換部３は、フレームバッフ
ァリング部１からのＰＣＭオーディオ信号を例えば５１
２サンプル分切り出し、次いでトランジェント検出部２
からの検出フラグｔｒａｎｓに基づいて図１１（ａ）に
示すような標準フレーム長Ｔまたは図１１（ａ）に示す
ような短縮フレーム長（Ｔ／４）で各サンプルのオーデ
ィオ信号をＤＣＴやＦＦＴ等により直交変換し、複数の
サブバンドに分割する。The windowing / orthogonal transformation unit 3 receives the PCM audio signal from the frame buffering unit 1 by, for example, 51
Cut out 2 samples, then transient detector 2
Based on the detection flag trans from the sampled audio signal of each sample with a standard frame length T as shown in FIG. 11A or a shortened frame length (T / 4) as shown in FIG. 11A. Then, it is orthogonally transformed and divided into a plurality of subbands.

【００３２】また、聴覚心理分析部４により各サブバン
ドの量子化ビット数が決定され、量子化・符号化部５は
この量子化ビット数で、直交変換部２により分割された
各サブバンドのオーディオ信号を量子化および符号化す
る。この量子化・符号化部５により量子化および符号化
されて圧縮されたデータと、トランジェント検出部２に
より検出されたフラグｔｒａｎｓと、聴覚心理分析部３
により決定された量子化ビット数はマルチプレックス部
６により多重化されて出力される。また、図示省略のデ
コーダでは、これらのデータが分離されて復号化され
る。Further, the psychoacoustic analysis unit 4 determines the number of quantization bits of each subband, and the quantization / encoding unit 5 uses the number of quantization bits of each subband of the subbands divided by the orthogonal transform unit 2. Quantize and encode the audio signal. The data quantized and encoded by the quantization / encoding unit 5 and compressed, the flag trans detected by the transient detection unit 2, and the psychoacoustic analysis unit 3
The number of quantized bits determined by is multiplexed by the multiplex unit 6 and output. Further, in a decoder not shown, these data are separated and decoded.

【００３３】図３を参照してトランジェント検出部３の
トランジェント検出処理を説明する。先ず、フレーム長
として１０〜２０ｍsec程度が選択されている場合にフ
レームのサンプルを約１〜３ｍsecの短い例えばｍ個の
セグメントに分割し、次いでトランジェント検出フラグ
ｔｒａｎｓとセグメントのインデックスｉをリセットす
るとともに前４個分のセグメントのトータルパワーＰ
〔−１〕〜Ｐ〔−４〕をロードし（ステップＳ２１）、
次いで各セグメントｉのトータルパワーＰ〔ｉ〕を前述
した式（数１）に基づいて計算する（ステップＳ２
２）。Transient detection processing of the transient detector 3 will be described with reference to FIG. First, when a frame length of about 10 to 20 msec is selected, the sample of the frame is divided into, for example, m segments having a short length of about 1 to 3 msec, and then the transient detection flag trans and the segment index i are reset and Total power P of 4 segments
[-1] to P [-4] are loaded (step S21),
Next, the total power P [i] of each segment i is calculated based on the above-mentioned formula (Equation 1) (step S2).
2).

【００３４】次いで、条件（２）を満たすか否かを判別
し（ステップＳ２３）、満たさない場合にはステップＳ
２４、Ｓ２５を経て、ステップＳ２２から同様の処理を
繰り返す。そして、ステップＳ２３において条件（２）
を満たす場合には条件（３）を満たすか否かを判別し
（ステップＳ２６）、満たさない場合にはステップ２
４、Ｓ２５を経て、ステップＳ２２から同様の処理を繰
り返し、満たす場合には検出フラグｔｒａｎｓをセット
し（ステップＳ２７）、ステップＳ２８に進む。全ての
ｉについてステップＳ２７が実行されない場合にはｔｒ
ａｎｓはセットされないままステップＳ２８に進む。Then, it is judged whether or not the condition (2) is satisfied (step S23).
After 24 and S25, the same processing is repeated from step S22. Then, in step S23, the condition (2)
When the conditions are satisfied, it is determined whether or not the condition (3) is satisfied (step S26), and when the conditions are not satisfied, the step 2 is performed.
After S4 and S25, the same processing is repeated from step S22. When the processing is satisfied, the detection flag trans is set (step S27), and the process proceeds to step S28. If step S27 is not executed for all i, tr
Ans is not set and the process proceeds to step S28.

【００３５】したがって、上記実施例によれば、例えば
図２中の実線で示すようにＰ〔ｉ〕とＰ〔ｉ−１〕およ
びＰ〔ｉ〕とＰ〔ｉ−２〕を比較するので、図４に示す
ように非定常波形の場合の検出漏れを防止することがで
き、また、例えば図２中の破線で示すようにＰ〔ｉ〕と
Ｐ〔ｉ−３〕およびＰ〔ｉ〕とＰ〔ｉ−４〕を比較する
ので、図５に示すように長周期の定常波形におけるトラ
ンジェントの誤認を防止することができる。なお、サブ
バンド分割に直交変換を用いない場合には、各サブバン
ドの時間領域のサンプル（時間波形）をより短い区間毎
に量子化することにより短縮フレーム長の使用と同様な
効果を得ることができる。Therefore, according to the above embodiment, for example, P [i] and P [i-1] and P [i] and P [i-2] are compared as shown by the solid line in FIG. As shown in FIG. 4, detection omission in the case of an unsteady waveform can be prevented, and, for example, as shown by the broken line in FIG. 2, P [i] and P [i-3] and P [i] Since P [i-4] is compared, it is possible to prevent erroneous recognition of transients in a long-period stationary waveform as shown in FIG. If the orthogonal transformation is not used for subband division, the same effect as the use of the shortened frame length can be obtained by quantizing the time domain samples (time waveforms) of each subband for each shorter section. You can

【００３６】[0036]

【発明の効果】以上説明したように本発明によれば、当
該セグメントと隣接セグメントおよび１以上離れたセグ
メントの各パワー比が所定値以上の場合にトランジェン
トとして検出するので、トランジェントがセグメントの
中央に位置する場合には１以上離れたセグメントとのパ
ワー比が基準判断値以上になり、したがって、検出漏れ
を防止することができる。また、長周期の定常波形にお
いてセグメントパワーが部分的に非常に小さくなる場合
には１以上離れたセグメントとのパワー比が基準判断値
以下となり、したがって、トランジェントと誤認するこ
とを防止することができる。As described above, according to the present invention, when the power ratios of the segment, the adjacent segment, and the segments separated by one or more are detected as a predetermined value or more, the transient is detected in the center of the segment. In the case of being located, the power ratio with the segment separated by 1 or more becomes equal to or higher than the reference judgment value, and therefore detection omission can be prevented. Further, in the case where the segment power becomes extremely small partially in the long-cycle steady waveform, the power ratio with the segments separated by 1 or more becomes equal to or less than the reference judgment value, and therefore it is possible to prevent misidentification as a transient. .

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明に係る音声高能率符号化装置の一実施例
を示すブロック図である。FIG. 1 is a block diagram showing an embodiment of a high-efficiency speech coding apparatus according to the present invention.

【図２】図１のトランジェント検出部の比較対象セグメ
ントを示す説明図である。FIG. 2 is an explanatory diagram showing a comparison target segment of the transient detection unit of FIG.

【図３】図１のトランジェント検出部のトランジェント
検出処理を説明するためのフローチャートである。FIG. 3 is a flowchart for explaining a transient detection process of a transient detection unit in FIG.

【図４】非定常波形とそのトランジェント検出フラグを
示す説明図である。FIG. 4 is an explanatory diagram showing an unsteady waveform and its transient detection flag.

【図５】定常波形とそのトランジェント検出フラグを示
す説明図である。FIG. 5 is an explanatory diagram showing a steady waveform and its transient detection flag.

【図６】音声高能率符号化方法を模式的に示す説明図で
ある。FIG. 6 is an explanatory diagram schematically showing a high-efficiency voice encoding method.

【図７】図６の音声高能率符号化処理を説明するための
フローチャートである。FIG. 7 is a flowchart for explaining the high-efficiency speech coding processing of FIG.

【図８】一般的な音声高能率符号化および復号化装置を
示すブロック図である。FIG. 8 is a block diagram showing a general voice efficient encoding and decoding device.

【図９】フレーム区間でほぼ正常な場合の原波形とその
エンコードおよびデコード後の波形を示す説明図であ
る。FIG. 9 is an explanatory diagram showing an original waveform and a waveform after encoding and decoding when the frame section is almost normal.

【図１０】フレーム区間内で振幅（パワー）が急峻に立
ち上がる非定常な信号の原波形とそのエンコードおよび
デコード後の波形を示す説明図である。FIG. 10 is an explanatory diagram showing an original waveform of an unsteady signal whose amplitude (power) sharply rises within a frame section and its encoded and decoded waveforms.

【図１１】標準フレーム長と短縮フレーム長を示す説明
図である。FIG. 11 is an explanatory diagram showing a standard frame length and a shortened frame length.

【図１２】図１０（ａ）に示す原波形を１／４のフレー
ム長で処理した波形を示す説明図である。12 is an explanatory diagram showing a waveform obtained by processing the original waveform shown in FIG. 10A with a frame length of ¼.

【図１３】従来のトランジェント検出処理を説明するた
めのフローチャートである。FIG. 13 is a flowchart illustrating a conventional transient detection process.

【図１４】従来の比較対象セグメントを示す説明図であ
る。FIG. 14 is an explanatory diagram showing a conventional comparison target segment.

【図１５】従来のトランジェント検出の判断基準値を示
す説明図である。FIG. 15 is an explanatory diagram showing a determination reference value for conventional transient detection.

【図１６】トランジェントがセグメントの中央に位置す
る場合を示す説明図である。FIG. 16 is an explanatory diagram showing a case where a transient is located at the center of a segment.

【図１７】図１６に示す場合の原波形とトランジェント
検出フラグを示す説明図である。FIG. 17 is an explanatory diagram showing an original waveform and a transient detection flag in the case shown in FIG.

【図１８】長周期の定常波形とそのセグメントパワーを
示す説明図である。FIG. 18 is an explanatory diagram showing a long-cycle stationary waveform and its segment power.

【図１９】図１８に示す場合の原波形とトランジェント
検出フラグを示す説明図である。FIG. 19 is an explanatory diagram showing an original waveform and a transient detection flag in the case shown in FIG. 18.

[Explanation of symbols]

１フレームバッファリング部２トランジェント検出部（制御手段）３窓掛け・直交変換部４聴覚心理分析部５量子化・符号化部６マルチプレックス部 1 Frame Buffering Section 2 Transient Detection Section (Control Means) 3 Windowing / Orthogonal Transformation Section 4 Auditory Psychological Analysis Section 5 Quantization / Coding Section 6 Multiplex Section

Claims

[Claims]

1. A high-efficiency speech coding apparatus for coding an audio signal by processing each frame of a finite length, dividing the audio signal into segments having a segment length sufficiently shorter than a standard frame length, and totaling each segment. And a control means for calculating the power and controlling so that the audio signal is processed with a frame length shorter than the standard frame length when the power ratio of the segment to the adjacent segment and the segment separated by one or more is a predetermined value or more. A high-efficiency speech coding apparatus characterized by the above.