KR100827458B1

KR100827458B1 - Method for audio signal coding

Info

Publication number: KR100827458B1
Application number: KR1020060068733A
Authority: KR
Inventors: 이창준; 박영철; 윤대희
Original assignee: 엘지전자 주식회사
Priority date: 2006-07-21
Filing date: 2006-07-21
Publication date: 2008-05-06
Also published as: KR20080008897A

Abstract

본 발명은 오디오 부호화 방법에 관한 것이다. 본 발명의 오디오 부호화 방법은 오디오 신호의 압축 부호화시에 심리음향 모델을 적용하여 시간/주파수 변환을 토대로 양자화 및 압축 부호화를 수행하는 방법으로서, 입력 오디오 신호의 대역별 마스킹 임계치를 구하는 단계; 프리에코 조절(pre-echo control)을 수행하는 단계; 구간 전환(block switching or window switching)을 위한 PE(Perceptual Entropy) 값을 구하는 단계; 구간별 비트 할당을 위한 PE(Perceptual Entropy) 값을 구하는 단계; 를 포함하는 것을 특징으로 한다.The present invention relates to an audio encoding method. An audio encoding method of the present invention is a method of performing quantization and compression encoding based on time / frequency transform by applying a psychoacoustic model during compression encoding of an audio signal, the method comprising: obtaining a masking threshold for each band of an input audio signal; Performing pre-echo control; Obtaining a PE (Perceptual Entropy) value for block switching or window switching; Obtaining a PE (Perceptual Entropy) value for bit allocation per section; Characterized in that it comprises a.

오디오, 부호화, 심리음향 모델 Audio, Coding and Psychoacoustic Models

Description

Audio coding method {METHOD FOR AUDIO SIGNAL CODING}

도1은 본 발명이 적용되는 오디오 부호화기의 구조를 나타낸 도면1 is a diagram showing the structure of an audio encoder to which the present invention is applied.

도2는 본 발명의 실시예에 따른 오디오 부호화 방법에서 프리에코 조절 및 마스킹 임계치 계산 과정을 나타낸 플로우차트2 is a flowchart illustrating a preeco adjustment and a masking threshold calculation process in an audio encoding method according to an embodiment of the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

210: MDCT부 220: FFT부210: MDCT unit 220: FFT unit

230: 심리음향 모델부 240: 윈도우 변환부230: psychoacoustic model unit 240: window conversion unit

250: 양자화부 260: 부호화부250: quantizer 260: encoder

270: 비트열 구성부270: bit string component

본 발명은 오디오 부호화 방법에 관한 것이다.The present invention relates to an audio encoding method.

MPEG 오디오 부호화 알고리즘은 오디오 신호의 저장과 전송에 필요한 막대한 채널 용량을 줄이기 위해 주관적인 음질의 손실없이 오디오 신호를 압축하는 것을 목적으로 한다. 이를 위하여 인간의 감각 특성에 기반한 지각 부호화(Perceptual Coding) 방법을 사용한다. 지각 부호화란 청각으로 감지할 수 있는 최소 레벨인 최 소 가청한계와, 특정 음에 의해서 다른 음이 잘 들리지 않게 되는 마스킹(Masking) 현상을 이용하는 방법이다. 최소 가청한계는 음의 주파수(고저)에 따라 달라지고, 마스킹 현상은 마스킹하는 음(Masker)과 마스킹되어 들리지 않게 되는 음(Maskee)의 주파수에 따라 달라진다. 특히, 마스킹 효과가 일어나는 주파수 폭을 임계대역(Critical Band)이라고 하는데, 이 임계대역 내에서의 지각 가능한 신호대 잡음비(S/N비)는 매우 낮은 특성이 있다. 따라서, MPEG 오디오 부호화에서는 상기와 같은 지각 부호화에 기반한 압축 부호화를 수행함으로써, 디지털 오디오 신호 양자화 잡음을 임계대역 내에 혼합하여 그 양자화 잡음이 표현되지 않도록 하는 것이다.The MPEG audio coding algorithm aims to compress an audio signal without losing subjective sound quality in order to reduce the huge channel capacity required for storing and transmitting the audio signal. For this, we use Perceptual Coding based on human sensory characteristics. Perceptual coding is a method that uses a minimum audible limit, which is the minimum level that can be sensed by hearing, and a masking phenomenon in which other sounds are hard to be heard by a specific sound. The minimum audible limit depends on the frequency of the note (high and low), and the masking phenomenon depends on the masking note (Masker) and the frequency of the masked note (Maskee). In particular, the frequency width at which the masking effect occurs is called a critical band, and the perceptible signal-to-noise ratio (S / N ratio) within the critical band is very low. Accordingly, in MPEG audio coding, compression coding based on the above perceptual coding is performed to mix digital audio signal quantization noise within a critical band so that the quantization noise is not represented.

이와 같이 MPEG 오디오는 오디오 신호의 압축을 위해 통계적인 무손실 압축 방법과 함께 손실 압축 방법을 사용하는데, 이는 심리음향 이론 중 마스킹 현상을 이용하여 손실되는 부분이 사람의 귀로 지각되지 않도록 하는 것이다. 따라서 부호화 과정을 수행할 때 심리음향 모델이라는 복잡한 과정을 통해서 각 주파수 별로 최대 허용 가능한 잡음의 양을 구하게 된다. 이러한 점이 고려되어야 하기 때문에 고음질의 오디오 출력 신호를 얻기 위해서 심리음향 모델의 역할이 매우 중요하다.As described above, MPEG audio uses a lossless compression method together with a statistical lossless compression method for compressing an audio signal, so that the masked phenomenon of psychoacoustic theory is not perceived by the human ear. Therefore, when performing the encoding process, the maximum allowable amount of noise for each frequency is obtained through a complicated process called a psychoacoustic model. This needs to be taken into account, so the role of the psychoacoustic model is very important for obtaining high quality audio output signals.

본 발명의 목적은 오디오 부호화기에서 부호화 프로세스의 효율을 높이고, 부호화를 위한 비트 할당이 효과적으로 이루어질 수 있도록 한 오디오 부호화 방법을 제공하는데 있다.An object of the present invention is to provide an audio encoding method which increases the efficiency of an encoding process in an audio encoder and enables effective bit allocation for encoding.

상기 목적을 달성하기 위한 본 발명에 따른 오디오 부호화 방법은, 오디오 신호의 압축 부호화시에 심리음향 모델을 적용하여 시간/주파수 변환을 토대로 양자화 및 압축 부호화를 수행하는 방법으로서, 구간 전환(block switching or window switching)을 위한 PE(Perceptual Entropy)값을 구하는 단계; 상기 PE(Perceptual Entropy)값에 따라 구간 전환 여부를 결정하는 단계; 상기 구간 전환 결정 결과에 따라 각 구간에서의 비트 할당을 위한 PE(Perceptual Entropy)값을 각각 구하는 단계; 를 포함하는 것을 특징으로 한다.An audio encoding method according to the present invention for achieving the above object is a method for performing quantization and compression encoding based on time / frequency conversion by applying a psychoacoustic model during compression encoding of an audio signal, and includes block switching or obtaining a Perceptual Entropy (PE) value for window switching; Determining whether to switch sections according to the PE (Perceptual Entropy) value; Obtaining a PE (Perceptual Entropy) value for bit allocation in each section according to the section switching decision result; Characterized in that it comprises a.

또한 상기 목적을 달성하기 위한 본 발명에 따른 오디오 부호화 방법은, 오디오 신호의 압축 부호화시에 심리음향 모델을 적용하여 시간/주파수 변환을 토대로 양자화 및 압축 부호화를 수행하는 방법으로서, 입력 오디오 신호의 대역별 마스킹 임계치를 구하는 단계; 프리에코 조절(pre-echo control)을 수행하는 단계; 구간 전환(block switching or window switching)을 위한 PE(Perceptual Entropy) 값을 구하는 단계; 구간별 비트 할당을 위한 PE(Perceptual Entropy) 값을 구하는 단계; 를 포함하는 것을 특징으로 한다.In addition, an audio encoding method according to the present invention for achieving the above object is a method of performing quantization and compression encoding based on time / frequency transform by applying a psychoacoustic model during compression encoding of an audio signal, the band of the input audio signal Obtaining a star masking threshold; Performing pre-echo control; Obtaining a PE (Perceptual Entropy) value for block switching or window switching; Obtaining a PE (Perceptual Entropy) value for bit allocation per section; Characterized in that it comprises a.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 설명한다.Hereinafter, with reference to the accompanying drawings will be described an embodiment of the present invention.

본 발명은 MPEG 심리음향 모델-Ⅱ를 이용한 오디오 부호화기의 비트 할당 과정의 효율성을 높이기 위한 것으로, 심리음향 모델에서 마스킹 임계치(masking threshold)를 구하는 마지막 단계에서의 프리에코 조절(pre-echo control) 과정을 장구간(long block)과 단구간(short block)으로 나누어서 수행하는 방법이다.The present invention is to improve the efficiency of the bit allocation process of the audio coder using MPEG psychoacoustic model-II, the pre-echo control process in the last step of obtaining the masking threshold in the psychoacoustic model Is divided into long block and short block.

도1은 본 발명의 오디오 부호화 방법이 적용되는 오디오 부호화기 구조의 실시예를 보여주고 있다. 도1에 나타낸 오디오 부호화기는 MPEG 기반 오디오 부호화 기이다. 그 구성을 살펴보면, MDCT부(Modified Discrete Cosine Transform)(110), 입력 오디오 신호의 FFT(Fast Fourier Transform)부(120), 심리음향 모델부(130), 윈도우 변환부(140), 양자화부(150), 부호화부(160), 비트열 구성부(170)를 포함하고 있다. 상기 양자화부(150)는 양자화 및 비트 할당부(151)와 허프만 코딩부(152)를 포함하며, 상기 부호화부(160)는 TNS부(Temporal Noise Shaping), 세기/결합부(Insensity/Coupling)(162), 예측부(Prediction)(163), M/S부(Middle/Side)(164)를 포함한다.1 shows an embodiment of an audio encoder structure to which the audio encoding method of the present invention is applied. The audio encoder shown in FIG. 1 is an MPEG-based audio encoder. Looking at the configuration, the MDCT unit (Modified Discrete Cosine Transform) 110, the Fast Fourier Transform (FFT) unit 120 of the input audio signal, psychoacoustic model unit 130, window transform unit 140, quantization unit ( 150, an encoder 160, and a bit string generator 170. The quantization unit 150 includes a quantization and bit allocation unit 151 and a Huffman coding unit 152, and the encoding unit 160 includes a temporal noise shaping (TNS) unit and an intensity / coupling unit (Insensity / Coupling). 162, a prediction unit 163, and an M / S unit 164.

도1에 나타낸 바와 같이 입력 오디오 신호는 부호화를 위해서 MDCT 분석 필터(110)를 통해서 주파수 축 신호로 바뀌게 되며, 이후 다양한 방법을 통해 부호화된다. 그리고 이와 동시에 심리음향 모델부(130)은 입력 신호의 지각적 특성을 분석하여 비트 할당 과정에 필요한 각 주파수 별 최대 허용 양자화 잡음의 양을 결정하게 된다. 비트 할당 과정은 주어진 비트율에서 양자화 과정에서 발생하는 양자화 잡음이 심리음향 모델로부터 얻은 최대 허용 잡음의 양보다 가능한 적어지도록 최적화한다.As shown in FIG. 1, the input audio signal is converted into a frequency axis signal through the MDCT analysis filter 110 for encoding, and then encoded through various methods. At the same time, the psychoacoustic model unit 130 analyzes the perceptual characteristics of the input signal to determine the maximum allowable quantization noise for each frequency required for the bit allocation process. The bit allocation process optimizes the quantization noise generated in the quantization process at a given bit rate to be as small as possible than the maximum allowable noise obtained from the psychoacoustic model.

심리음향 모델은 주파수 축에서 입력 신호의 지각적 특성을 분석하기 때문에 입력 신호의 주파수 변환 과정을 필요로 한다. 도1에서 볼 수 있듯이 부호화 과정에서는 이미 MDCT 분석 필터(110)를 통해서 주파수 변환을 수행하고 있지만, 심리음향 이론의 실험 결과들은 대부분 DFT(Discrete Fourier Transform) 축 상에서 이루어져 있으므로 MPEG 표준안은 심리음향 모델을 위한 별도의 FFT(Fast Fourier Transform) 변환이 필요하다고 권고하고 있다.The psychoacoustic model requires a frequency conversion process of the input signal because it analyzes the perceptual characteristics of the input signal on the frequency axis. As shown in FIG. 1, the frequency conversion is already performed in the encoding process through the MDCT analysis filter 110, but since the experimental results of the psychoacoustic theory are mostly made on the Discrete Fourier Transform (DFT) axis, the MPEG standard proposes a psychoacoustic model. It is recommended that a separate Fast Fourier Transform (FFT) transform is required.

MPEG 심리음향 모델Ⅱ는 프리에코(pre-echo)가 발생하는 것을 막기 위해서 단구간 윈도우들을 이용해서 부호화하는 방법과 함께 현재 프레임의 대역별 마스킹 임계치를 이전 두 프레임의 임계치들과 비교하여 더 작은 값을 현재 프레임의 마스킹 임계치로 적용하는 프리에코 조절(pre-echo control) 방법을 권고하고 있다.MPEG psychoacoustic model II has a method that encodes using short-term windows to prevent the occurrence of pre-echo, and the band-specific masking threshold of the current frame is smaller than that of the previous two frames. We recommend a pre-echo control method that applies to the masking threshold of the current frame.

MPEG 표준안은 구간(block) 전환 결정을 심리음향 모델의 결과 값 중의 하나인 PE(Perceptual Entropy)값을 통해서 결정하도록 하고 있다. 이를 위해서 FFT 연산에서는 장구간 윈도우(Long Window)와 단구간 윈도우(Short Window) 두가지 형태의 윈도우를 사용하고 있다. 이에 반해 MDCT 분석필터(110)에서는 장구간 윈도우, 단구간 윈도우 외에 장구간 시작 윈도우(Long Start Window)와 장구간 마무리 윈도우(Long Stop Window)를 더 사용하고 있다.The MPEG standard proposes to determine the block switching decision through the PE (Perceptual Entropy) value, which is one of the psychoacoustic model results. For this purpose, the FFT operation uses two types of windows: long window and short window. In contrast, the MDCT analysis filter 110 further uses a long start window and a long stop window in addition to the long window and the short window.

심리음향 모델을 통해서 다음 번 구간이 단구간(Short Block)으로 결정되어, 장구간에서 단구간으로 구간 전환(Block Switching)이 일어나야 될 경우 MDCT 분석필터에서는 현재 구간에 대해서 장구간 시작 윈도우를 사용하게 되며, 단구간에서 장구간으로 전환되어야 할 경우에는 장구간 마무리 윈도우를 사용하게 된다.The psychoacoustic model determines that the next section is a short block, and if block switching occurs from a long section to a short section, the MDCT analysis filter uses the long section start window for the current section. If the short section is to be switched to long section, the long section finishing window is used.

도1을 참조하여 오디오 부호화기 동작을 살펴보자. MDCT부(110)는 MDCT 분석필터로서, 입력된 오디오 신호를 이산 코사인 변환(DCT) 처리하여 입력 오디오 신호를 주파수 축으로 변환한다. FFT부(120)는 앞서 설명한 바와 같이 심리음향 모델링을 위하여 입력 오디오 신호를 주파수 축으로 변환한다. 여기서, MDCT 분석 필터의 입력과 FFT 연산의 윈도우 형태는 전단의 윈도우 변환부(140)에서 시간축 정보를 이용하여 변환된 동일 윈도우 형태를 갖게 된다. 즉, 심리음향 모델의 FFT 연산 에 MDCT 분석필터에서 사용되는 것과 같은 형태의 장구간 시작 윈도우와 장구간 마무리 윈도우를 사용하게 된다.Referring to FIG. 1, an operation of an audio encoder will be described. The MDCT unit 110, as an MDCT analysis filter, converts the input audio signal into a discrete cosine transform (DCT) to convert the input audio signal into a frequency axis. As described above, the FFT unit 120 converts an input audio signal into a frequency axis for psychoacoustic modeling. Here, the input of the MDCT analysis filter and the window form of the FFT operation have the same window form converted by using the time axis information in the window converter 140 of the front end. That is, the long-term start window and the long-end window that are used in the MDCT analysis filter are used for the FFT calculation of the psychoacoustic model.

앞서 설명한 바와 같이 FFT와 MDCT는 시간/주파수 변환으로서, 일반적으로 시간 영역의 신호보다 주파수 영역의 신호를 부호화하기 용이한 특성을 이용하기 위하여, 시간 영역의 오디오 신호를 주파수 영역의 오디오 신호로 변환하는 부분이며, 이 때 변환 윈도우의 길이는 주파수 해상도와 밀접한 관련이 있기 때문에 적절하게 선택되는데 이는 시간 축 정보를 이용한 윈도우 변환부(140)로부터 제공받게 된다.As described above, the FFT and MDCT are time / frequency conversions. In order to use a characteristic that is easier to encode a signal in the frequency domain than a signal in the time domain, the FFT and MDCT convert the audio signal in the time domain into an audio signal in the frequency domain. In this case, since the length of the conversion window is closely related to the frequency resolution, it is appropriately selected, which is provided from the window conversion unit 140 using time axis information.

심리음향 모델부(130)는 다채널 오디오의 지각 부호화를 위해 인간의 청각 특성을 모델링하는데, 입력 오디오의 특성을 추출하고 대역별로 인간의 청각에 감지되지 않는 양자화 잡음의 정도를 계산하여 부호화에 필요한 비트의 할당 시 이를 반영하여 최적의 부호화를 달성하도록 한다. 심리음향 모델링의 기법과 구현은 기존의 심리음향 모델링 기반 오디오 부호화 알고리즘에 사용되는 것과 동일하게 적용된다.The psychoacoustic model unit 130 models human auditory characteristics for perceptual encoding of multi-channel audio. The psychoacoustic model unit 130 extracts the characteristics of the input audio and calculates a degree of quantization noise that is not detected by the human auditory for each band. The allocation of bits reflects this to achieve optimal coding. The techniques and implementation of psychoacoustic modeling are the same as those used in the existing psychoacoustic modeling based audio coding algorithm.

양자화부(150)는 부호화부(160)에 의해서 압축된 주파수 스펙트럼을 심리음향 모델부(130)를 이용하여 주어진 비트율에 대해서 최적의 양자화 레벨을 할당하는 방법을 토대로 오디오 신호 양자화를 실행한다. 이는 양자화 및 비트 할당부(151)에 의해서 수행되며, 또한 양자화된 주파수 스펙트럼들은 할당된 비트에 의해서 표현되는 값들로 구성되는데, 이들을 보다 적은 비트 수로 표현하기 위해서 디코더에서 원래의 값들을 복원할 수 있는 상태로 부호화하는 방법으로, 예를 들면 허프만 코딩부(152)에 의해서 허프만 부호화를 사용하여 보다 감소된 비트 수로 부호화하는 기법을 사용한다.The quantization unit 150 performs audio signal quantization based on a method of allocating an optimal quantization level for a given bit rate by using the psychoacoustic model unit 130 on the frequency spectrum compressed by the encoder 160. This is performed by the quantization and bit allocation unit 151, and the quantized frequency spectrums are composed of values represented by allocated bits, which can restore the original values at the decoder to represent them with fewer bits. As a method of encoding in a state, for example, the Huffman coding unit 152 uses a technique of encoding a reduced number of bits using Huffman coding.

부호화부(160)는 오디오 신호의 압축 부호화를 위해서 상기 시간/주파수 변환부-MDCT부(110)에서 제공되는 주파수 스펙트럼의 진폭을 줄이거나 예측할 수 있는 방법들을 사용해서 오디오 신호의 압축 부호화를 수행한다. 이를 위하여 TNS부(161), 세기/결합부(162), 예측부(163), M/S부(164)를 사용한다.The encoder 160 performs compression encoding on the audio signal by using methods that can reduce or predict the amplitude of the frequency spectrum provided by the time / frequency converter-MDCT unit 110 for compression encoding the audio signal. . For this purpose, the TNS unit 161, the strength / combination unit 162, the prediction unit 163, and the M / S unit 164 are used.

TNS부(161)는 양자화 과정에서 발생하는 잡음을 주파수 영역에서 예측 코딩함으로써 양자화 잡음을 최소화 해주는 역할을 한다. 채널 간의 관계에 의한 압축 방법으로 좌,우 채널로서 구분되는 각 채널 쌍에 대해서 하나의 채널에 대해서 다른 채널의 레벨 차이 만을 전송함으로써 실제 전송되는 데이터의 양을 줄이는 기법을 위하여 세기/결합부(Intensity/Coupling)(162)를 이용한 부호화를 수행한다. 또한, 시간 영역에서의 데이터 압축 방법으로서 이전 오디오 프레임의 스펙트럼으로부터 현재 프레임의 스펙트럼을 예측하는 프레임간 예측을 위하여 예측부(Prediction)(163)를 사용하며, 이는 예측 파라미터와 예측 오차만을 전송함으로써 전송 데이터의 양을 감소시킬 수 있는 기반을 제공한다. 그리고, 좌,우 채널의 신호를 M(Middle)/S(Side) 채널로 변환하여 데이터를 줄이는 M/S부(164)를 사용하여 부호화가 이루어지도록 하였다. TNS, Intensity/Coupling, Prediction, M/S 과정은 부호화의 효율을 높이기 위해 사용하는 선택적으로 사용되는 부호화 과정들이고, 허프만 코딩은 양자화된 스펙트럼 정보를 부호화하는데 사용되는 무손실 부호화 과정이다.The TNS unit 161 minimizes quantization noise by predictively coding noise generated in the quantization process in a frequency domain. Intensity / combination unit for a technique to reduce the amount of data actually transmitted by transmitting only the level difference of another channel for one channel for each pair of channels divided as left and right channels by the relationship between channels. / Coupling) 162 to perform encoding. In addition, as a data compression method in the time domain, a prediction unit 163 is used for inter-frame prediction for predicting a spectrum of a current frame from a spectrum of a previous audio frame, which is transmitted by transmitting only a prediction parameter and a prediction error. It provides a basis to reduce the amount of data. The left and right channel signals are converted into M (Middle) / S (Side) channels to encode the data using the M / S unit 164 for reducing data. The TNS, Intensity / Coupling, Prediction, and M / S processes are selectively used encoding processes used to increase coding efficiency, and Huffman coding is a lossless coding process used to encode quantized spectral information.

비트열 구성부(170)는 상기 압축 부호화된 오디오 데이터의 비트 열(bit stream)을 생성한다. 즉, 비트 열의 헤더정보, 스펙트럼 데이터를 비롯하여 부가 정보를 비트 열로 구성하는데, 여기서는 외부 제어나 사용자 제어에 따라 오디오 ES(Element Stream)을 패킷화된 비트 열인 PES(Packetized Element Stream)으로 변환하는 경우도 포함할 수 있다.The bit stream constructing unit 170 generates a bit stream of the compressed and encoded audio data. That is, the header information and the spectral data of the bit string are configured as bit strings. In this case, an audio element stream (ES) is converted into a packetized element stream (PES), which is a packetized bit string, according to external control or user control. It may include.

앞서 언급한 바와 같이, MPEG 표준안은 구간 전환(block switching or window switching) 결정을 심리음향 모델의 결과 값 중의 하나인 PE(Perceptual Entropy) 값을 통해 결정하도록 하고 있다. PE값은 각 대역별 SMR(Signal-to-Mask Ratio, 신호 대 마스킹 임계치의 비율)의 합으로 나타낼 수 있다. 시간축에서 신호의 갑작스런 증가가 발생할 때 주파수축에서는 전대역에 걸쳐서 에너지가 증가하게 된다. 또한 이로 인해 전대역에서의 마스킹 임계치가 함께 증가하게 된다. 이렇게 증가한 마스킹 임계치를 프리에코 조절 과정에서 이전 프레임에서의 마스킹 임계치와 비교하여 더 작은 값을 현재 프레임의 마스킹 임계치로 사용하게 되면 현재 프레임의 각 대역에서의 SMR이 크게 증가하기 때문에 PE값 또한 증가하게 된다. PE값이 미리 정의된 임계치를 넘으면 장구간에서 단구간으로 전환되어 부호화를 하게 된다.As mentioned above, the MPEG standard allows the block switching or window switching decision to be determined through the PE (Perceptual Entropy) value, which is one of the results of the psychoacoustic model. The PE value may be expressed as a sum of signal-to-mask ratio (signal-to-mask ratio) for each band. When a sudden increase in signal occurs on the time axis, energy increases across the entire band on the frequency axis. This also increases the masking threshold across the band. This increased masking threshold is compared with the masking threshold of the previous frame during the pre-eco adjustment process, and if the smaller value is used as the masking threshold of the current frame, the SMR in each band of the current frame is increased so that the PE value is increased. do. If the PE value exceeds the predefined threshold, the coding is performed by switching from the long section to the short section.

그러나 이러한 구간 전환(block switching)을 위한 PE값은 장구간에서의 에너지와 마스킹 임계치를 이용하여 계산되는 반면, 프리에코 현상은 단구간에서 발생하게 된다. 따라서 현재 프레임이 장구간으로 결정되었을 때에도 프리에코 조절방법을 이용하여 마스킹 임계치를 실제 예측된 값보다 더 떨어뜨림으로써 비트 할 당 시에 이러한 대역들에 실제 필요한 비트보다 더 많은 비트들을 할당하게 된다. 따라서 상대적으로 다른 대역들에서 사용할 비트가 줄어든다. 그러므로 낮은 비트율에서 이러한 대역들에서의 양자화 잡음이 증가할 수 있다.However, the PE value for block switching is calculated using the energy and masking threshold in the long term, while the pre-eco phenomenon occurs in the short term. Therefore, even when the current frame is determined to be long, the masking threshold is lower than the actual predicted value by using the pre-eco adjustment method, so that more bits are allocated to these bands than are actually needed in the bit allocation. This reduces the bits to use in relatively other bands. Therefore, quantization noise in these bands can be increased at low bit rates.

즉, 프리에코를 방지하기 위해서 장구간을 단구간으로 나누어서 부호화하는 방법이 사용되는 경우, 장구간에서는 프리에코가 발생하지 않음에도 불구하고 장구간에서의 마스킹 임계치를 이전 프레임의 값들과 비교한다면 실제 계산된 값보다 더 작은 마스킹 임계치를 이용하여 장구간을 부호화하게 된다. 이러한 방법은 장구간에서의 대역들의 마스킹 임계치를 떨어뜨림으로써 실제 필요한 비트보다 더 많은 비트를 사용하게 한다. 이렇게 되면 낮은 비트율에서 상대적으로 다른 대역들에서 사용할 비트가 줄어든다. 그러므로 양자화 잡음이 증가하게 된다.In other words, in order to prevent the pre-echo, when the long section is encoded by dividing the long section, although the pre-eco is not generated in the long section, the masking threshold in the long section is compared with the values of the previous frame. The long section is encoded using a masking threshold smaller than the calculated value. This approach lowers the masking threshold of bands in the long term, allowing more bits to be used than are actually needed. This reduces the bits used in other bands at lower bit rates. Therefore, quantization noise is increased.

본 발명은 MPEG 심리음향 모델Ⅱ에서 장구간에서의 효과적인 비트 할당을 위한 방법을 제공한다. 이를 위해서 심리음향 모델Ⅱ에서의 구간전환을 위한 PE값과 비트 할당을 위한 PE값을 분리하여 계산한다. 수학식1은 PE값을 구하는 식이다.The present invention provides a method for effective bit allocation in the long term in MPEG psychoacoustic model II. For this purpose, the PE value for interval switching and the PE value for bit allocation are calculated separately in psychoacoustic model II. Equation 1 is an equation for obtaining a PE value.

여기서, b는 대역단위이며, (w_high[b]-w_low[b])는 한 대역에서의 FFT(Fast Fourier Transform) 계수의 개수이다. thr[b]와 e[b]는 각각 대역별 마스킹 임계치와 에너지를 뜻한다. 수학식1에 따르면, PE값은 입력 오디오 신호에 대해서 FFT를 수행한 결과로부터 얻는 FFT 계수의 개수(w_high[b]-w_low[b])와, 대역별 마스킹 임계치(thr[b]), 그리고 대역별 에너지(e[b])의 합으로부터 구할 수 있음을 알 수 있다.Here, b is a band unit, and (w_high [b] -w_low [b]) is the number of fast Fourier transform coefficients in one band. thr [b] and e [b] represent the masking threshold and energy of each band. According to Equation 1, the PE value is the number of FFT coefficients (w_high [b] -w_low [b]) obtained from the result of performing the FFT on the input audio signal, the masking threshold (thr [b]) for each band, and It can be seen that it can be obtained from the sum of the energy of each band (e [b]).

이와 같이 PE값을 구하기 위한 대역별 마스킹 임계치 thr[b]는 장구간 및 단구간에 대해서 구해진다. 대역별 마스킹 임계치 thr[b]는 절대 가청 한계값과 이전 프레임 및 그 이전 프레임의 마스킹 임계치, 그리고 미리 정의된 가중치를 이용해서 구한다. 다음의 수학식2는 구간전환을 위한 PE값 계산에 사용되어질 마스킹 임계치를 구하는 식이다. Thus, the masking threshold thr [b] for each band for obtaining the PE value is obtained for the long and short sections. The band-specific masking threshold thr [b] is obtained by using an absolute audible threshold value, a masking threshold value of a previous frame and a previous frame, and a predefined weight. Equation 2 below is a formula for calculating a masking threshold to be used to calculate the PE value for the interval switching.

여기서, absthr[b], nb_l[b], nb_ll[b]는 각각 절대 가청 한계값과 이전 프레임과 그 이전 프레임의 마스킹 임계치를 뜻하며, rpelev, rpelev2는 미리 정의된 가중치이다. nb[b]는 프리에코 조절과정 이전까지 현재 프레임에서 계산되어진 마스킹 임계치이며, thr[b]를 이용하여 구간전환을 위한 PE값을 계산한다.Here, absthr [b], nb_l [b], and nb_ll [b] denote absolute threshold values and masking thresholds of the previous frame and the previous frame, respectively, and rpelev and rpelev2 are predefined weights. nb [b] is a masking threshold calculated in the current frame until the pre-eco adjustment process, and the PE value for the interval switching is calculated using thr [b].

앞서 설명한 바와 같이 본 발명의 오디오 부호화 방법에서는 구간 전환용 PE값과 비트 할당용 PE값을 분리하여 계산하고 있다. 장구간에서 비트 할당을 위한 PE값 계산에 사용될 임계치는 다음의 수학식3과 같이 구한다.As described above, in the audio encoding method of the present invention, the interval switching PE value and the bit allocation PE value are separately calculated. The threshold to be used to calculate the PE value for bit allocation in the long term is calculated as in Equation 3 below.

여기서, 대역별 마스킹 임계치 thr[b]는 절대 가청 한계값 absthr[b]와 프리 에코 조절과정 이전까지 현재 프레임에서 계산되어진 마스킹 임계치 nb[b]를 이용하여 구할 수 있으며, 이 임계치 thr[b]를 이용해서 장구간에서 비트 할당을 위한 PE값을 계산함을 알 수 있다. 즉, 이전 프레임과의 마스킹 임계치 비교 과정없이 절대 가청 한계 결과값만 비교하여 장구간에서의 마스킹 임계치를 구한다.Here, the masking threshold thr [b] for each band can be obtained by using the absolute audible threshold absthr [b] and the masking threshold nb [b] calculated in the current frame until the pre-echo adjustment process, and this threshold thr [b] We can calculate the PE value for bit allocation in the long term by using. That is, the masking threshold in the long term is obtained by comparing only the absolute audible threshold result value without comparing the masking threshold with the previous frame.

한편, 단구간에서의 마스킹 임계치는 프리에코 발생을 막기 위하여 수학식2를 사용한다. 따라서, 구간 전환을 위한 PE값 계산을 위한 마스킹 임계치와 단구간에서의 마스킹 임계치는 모두 수학식2를 이용하여 계산된다.On the other hand, the masking threshold in the short section uses the equation (2) to prevent the pre-eco generation. Therefore, both the masking threshold for calculating the PE value for the interval switching and the masking threshold in the short section are calculated using Equation 2.

본 발명의 실시예에 따른 오디오 부호화 과정은 도1에 의하는 바와 같이, 오디오 데이터를 입력받고, 윈도우를 적용하여 MDCT 분석을 수행함과 함께, 심리음향 모델링을 위한 FFT 변환을 수행하고, 지각 특성에 기반한 오디오 신호의 모델링과 이를 통한 양자화 비트 수 할당의 결정 및 제어를 수행하며, MDCT부의 출력을 토대로 오디오 신호의 압축 부호화를 수행하고, 또한 심리음향 모델링 결과를 토대로 양자화 비트 수의 할당 및 허프만 코딩을 적용하여 오디오 신호의 양자화를 수행하고, 압축 부호화된 오디오 데이터의 비트 열을 구성하여 출력한다.In the audio encoding process according to the embodiment of the present invention, as shown in FIG. 1, audio data is input, MDCT analysis is performed by applying a window, FFT transform for psychoacoustic modeling, Modeling of the audio signal based on the determination and control of the quantization bit number allocation, and performing compression coding of the audio signal based on the output of the MDCT unit, and based on the psychoacoustic modeling results of allocation of the quantization bit number and Huffman coding The quantization of the audio signal is performed, and a bit string of compressed and encoded audio data is output.

도2는 이러한 오디오 부호화 과정에서 MPEG 심리음향 모델 Ⅱ의 프리에코 조절 및 마스킹 임계치를 계산하는 과정을 나타낸 플로우차트이다.FIG. 2 is a flowchart illustrating a process of calculating pre-eco adjustment and masking threshold of MPEG psychoacoustic model II in the audio encoding process.

제 1 단계(S10)는 각 대역(band)별로, 장구간 및 단구간에 대해서 마스킹 임계치 thr[b]를 계산하는 단계이다. 이 단계에서 구한 마스킹 임계치 thr[b]는 프리에코 조절과정을 거치지 않은 마스킹 임계치이며, 장구간에서 비트 할당을 위한 PE값 계산 단계(S51)에서 사용된다.The first step S10 is a step of calculating a masking threshold thr [b] for each long and short periods. The masking threshold thr [b] obtained in this step is a masking threshold value which has not undergone the pre-eco adjustment process and is used in the PE value calculation step S51 for bit allocation in the long term.

제 2 단계(S20)는 이전 프레임들과의 마스킹 임계치 비교를 기반으로, 장구간 및 단구간에 대해서 프리에코 조절을 수행하는 단계이다. 이 단계의 수행에 의해서 프리에코 조절 과정을 거친 마스킹 임계치 thr[b]가 단구간에서 비트 할당을 위한 PE값 계산 단계(S61)에서 사용된다.The second step S20 is a step of performing pre-eco adjustment on the long and short sections based on the masking threshold comparison with the previous frames. By performing this step, the masking threshold thr [b] subjected to the pre-eco adjustment process is used in the PE value calculation step S61 for bit allocation in the short term.

제 3 단계(S30)는 장구간에서 구간전환(장구간=>단구간)을 위한 PE값을 계산하는 단계이다. 여기서 PE값을 구하는 계산은 수학식1에 따르며, 구간 전환을 위한 PE값을 계산하는데 사용할 마스킹 임계치는 수학식2에 따라 계산된다.The third step (S30) is a step of calculating the PE value for section switching (long section => short section) in the long section. The calculation for obtaining the PE value is performed according to Equation 1, and the masking threshold to be used to calculate the PE value for the interval switching is calculated according to Equation 2.

제 4 단계(S40)는 계산된 PE값을 구간 전환을 위해서 미리 정의된 값 SWITCH_PE와 비교하는 단계이다. 계산된 PE값이 미리 정의된 SWITCH_PE 값을 넘지 않으면 제 5 단계(S51)로 이행하여 장구간에서의 부호화를 수행하고, 계산된 PE값이 미리 정의된 SWITCH_PE 값을 넘으면 제 8 단계(S61)로 이행하여 장구간에서 단구간으로 전환되어 부호화를 수행한다.A fourth step S40 is a step of comparing the calculated PE value with a predefined value SWITCH_PE for the interval switching. If the calculated PE value does not exceed the predefined SWITCH_PE value, the process proceeds to the fifth step S51 to perform encoding in the long term. If the calculated PE value exceeds the predefined SWITCH_PE value, the eighth step S61 is performed. Transitioning from the long section to the short section, the encoding is performed.

제 5 단계(S51)는 프리에코 조절과정을 거치지 않은 마스킹 임계치를 사용해서 장구간에서 비트할당을 위한 PE값을 계산하는 단계이다. 여기서 PE값을 구하는 계산은 수학식1에 따르며, 장구간에서 비트 할당을 위한 PE값을 계산하는데 사용할 마스킹 임계치는 수학식3에 따라 계산된다.The fifth step (S51) is a step of calculating the PE value for bit allocation in the long section using the masking threshold that has not undergone the pre-eco adjustment process. The calculation for obtaining the PE value is according to Equation 1, and the masking threshold to be used to calculate the PE value for bit allocation in the long term is calculated according to Equation 3.

다음의 제 6 단계(S52)는 장구간에서의 스케일 팩터 대역별 마스킹 임계치를 구하는 단계이며, 다음의 제 7 단계(S53)는 장구간에서의 스케일 팩터 대역별 SMR(신호 대 마스킹 임계치의 비)을 계산하는 단계이다.The sixth step S52 is a step of obtaining a masking threshold for each scale factor band in the long term, and the next seventh step S53 is an SMR (ratio of signal to masking thresholds) for each scale factor band in the long term. Step to calculate.

제 8 단계(S61)는 구간전환에 따라, 프리에코 조절과정을 거친 마스킹 임계 치를 사용해서 단구간에서 비트할당을 위한 PE값을 계산하는 단계이다. 여기서 PE값을 구하는 계산은 수학식1에 따르며, 단구간에서 비트 할당을 위한 PE값을 계산하는데 사용할 마스킹 임계치는 수학식2에 따라 계산된다.An eighth step S61 is a step of calculating a PE value for bit allocation in a short section by using a masking threshold that has undergone pre-eco adjustment according to the section switching. Here, the calculation for obtaining the PE value is performed according to Equation 1, and the masking threshold to be used to calculate the PE value for bit allocation in the short term is calculated according to Equation 2.

다음의 제 9 단계(S62)는 단구간에서의 스케일 팩터 대역별 마스킹 임계치를 구하는 단계이며, 다음의 제 10 단계(S63)는 단구간에서의 스케일 팩터 대역별 SMR(신호 대 마스킹 임계치의 비)을 계산하는 단계이다.A ninth step S62 is a step of obtaining a masking threshold for each scale factor band in a short section, and a next tenth step S63 is a scale factor band SMR (ratio of a signal to masking threshold) in a short section. Step to calculate.

위와 같이, 심리음향 모델Ⅱ에서 비트할당과 구간전환을 위한 PE값을 각각 구하기 위해서 장구간에서의 마스킹 임계치를 프리에코 조절 이전의 값과 이후의 값을 각각 사용하고 있다. 이를 위해서 장구간에 대해서는 프리에코 조절 과정 전후의 값을 분리해서 사용할 수 있도록 하였다. 구간전환을 위해서 PE값을 이용하는 방법 이외에 다른 방법이 사용되어질 경우에는 구간 전환 부분은 이러한 다른 방법들로 대체될 수 있다.As described above, in order to obtain PE values for bit allocation and interval switching in psychoacoustic model II, masking threshold values in the long section are used before and after the pre-eco adjustment, respectively. For this, the long section can be used separately before and after the pre-eco adjustment process. In the case where a method other than the PE value is used for the interval switching, the interval switching part may be replaced by these other methods.

본 발명의 오디오 부호화 방법을 사용하면, 프리에코 조절과정을 단구간에서만 사용함으로써 단구간에서의 프리에코의 발생을 줄이는 것은 물론, 장구간에서 현재 프레임에서 예측된 마스킹 임계치를 그대로 사용함으로써 장구간에서의 비트 할당이 효과적으로 이루어질 수 있다.According to the audio encoding method of the present invention, the pre-eco adjustment process is used only in a short section to reduce the occurrence of pre-eco in the short section, as well as to use the masking threshold predicted in the current frame in the long section. Bit allocation can be effectively made.

Claims

A method of performing quantization and compression encoding based on time / frequency transform by applying a psychoacoustic model during compression encoding of an audio signal,

Obtaining a PE (Perceptual Entropy) value for block switching or window switching;

Determining whether to switch sections according to the PE (Perceptual Entropy) value;

Obtaining a PE (Perceptual Entropy) value for bit allocation in each section according to the section switching decision result;

Audio encoding method comprising a.

The audio encoding method of claim 1, wherein the PE (Perceptual Entropy) value is obtained by using an FFT coefficient, a masking threshold for each band, and an energy value in one band.

Obtaining a band-specific masking threshold of the input audio signal;

Performing pre-echo control;

Obtaining a PE (Perceptual Entropy) value for bit allocation per section;

Audio encoding method comprising a.

The audio encoding method of claim 3, wherein the PE (Perceptual Entropy) value is obtained by using the number of FFT coefficients in one band, a masking threshold for each band, and energy.

4. The audio encoding method of claim 3, wherein the PE (Perceptual Entropy) value for the interval switching and the PE (Perceptual Entropy) value for the bit allocation are obtained based on different masking thresholds.

The audio encoding method of claim 3, wherein the PE (Perceptual Entropy) value for bit allocation is obtained for each section.

4. The audio encoding method of claim 3, wherein a masking threshold for obtaining a PE (Perceptual Entropy) value for bit allocation uses a value before and after the pre-eco adjustment according to a section.

8. The audio encoding method of claim 7, wherein a masking threshold for obtaining a PE (Perceptual Entropy) value for bit allocation uses a value before pre-eco adjustment in a long period.

The audio encoding method of claim 7, wherein a masking threshold for obtaining a PE (Perceptual Entropy) value for bit allocation uses a value after pre-eco adjustment in a short period.