KR20220048252A

KR20220048252A - Method and apparatus for encoding and decoding of audio signal using learning model and methos and apparatus for trainning the learning model

Info

Publication number: KR20220048252A
Application number: KR1020200131189A
Authority: KR
Inventors: 장인선; 백승권; 성종모; 이미숙; 이태진; 임우택; 강홍구; 이지현; 이찬우; 한혜원
Original assignee: 한국전자통신연구원; 연세대학교 산학협력단
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2022-04-19

Abstract

Disclosed are a method for encoding and decoding an audio signal using a learning model and a method and device for training of a device and the learning model. The method for encoding the audio signal according to one embodiment of the present invention may comprise: a step of dividing an input signal corresponding to the audio signal for each of a plurality of frequency bands; a step of extracting a feature vector for each of the plurality of frequency bands; a step of quantizing the feature vector generated for each of the plurality of frequency bands; and a step of encoding the quantized feature vector into a bitstream, and transmitting the same to a decoding device. Therefore, the present invention is capable of improving a performance of an audio codec.

Description

A method and apparatus for encoding and decoding an audio signal using a learning model, and a training method and apparatus for a learning model

본 발명은 딥러닝 기반의 학습 모델 중 자기 회귀(Auto-Regressive)적인 생성 모델(Generative Model)을 이용한 오디오 신호의 부호화 및 복호화 방법 및 장치와 학습 모델의 트레이닝 방법 및 장치에 관한 것으로, 보다 구체적으로는, 오디오 신호의 주파수 대역을 분할하여 각 대역의 특성에 따라 생성 모델을 트레이닝함으로써 오디오 신호를 부호화 및 복호화하는 방법 및 장치에 관한 것이다. The present invention relates to a method and apparatus for encoding and decoding an audio signal using an auto-regressive generative model among deep learning-based learning models, and a method and apparatus for training a learning model, and more specifically relates to a method and apparatus for encoding and decoding an audio signal by dividing a frequency band of the audio signal and training a generative model according to the characteristics of each band.

종래 오디오 신호의 부호화는 사람의 청각 특성을 활용하여 인지 가능한 정보만을 표현하는 방법을 통해 비교적 높은 압축률에서도 고품질을 표현하였다. 그러나, 부호화 과정에서 요구되는 지연 시간이 길고, 부호화 방법이 표준화되어 있지 않아 회사 및 제품별 시스템 특징에 따른 음질의 변화가 심하여 일관성 있는 결과를 얻기 어려운 단점이 있다. In the conventional encoding of an audio signal, high quality is expressed even at a relatively high compression rate through a method of expressing only perceivable information by utilizing human auditory characteristics. However, there are disadvantages in that it is difficult to obtain consistent results because the delay time required in the encoding process is long and the encoding method is not standardized and the sound quality varies greatly depending on the system characteristics of each company and product.

오디오 코덱은 크게 i) 입력 오디오 신호에 대한 주파수 성분으로 변환 과정 및 사람의 청각 특성을 활용한 비트 할당 과정을 포함하는 부호화와 ii) 전송된 주파수 성분을 다시 시간축의 신호로 역 변환하는 복호화로 나눌 수 있다.The audio codec is largely divided into i) encoding, which includes a process of converting an input audio signal into a frequency component and a bit allocation process utilizing human auditory characteristics, and ii) decoding, which inversely transforms the transmitted frequency component back into a signal on the time axis. can

여기서, 오디오 코덱의 성능을 결정하는 것은 어떻게 비트를 효과적으로 할당하는 지 여부에 달려있다. 즉, 변환을 통해 얻어진 주파수 성분에 대해 최소한의 정보량을 사용하여 사람이 청각적으로 인지하지 못할 정도의 왜곡만 허용되도록 하는 방법이 오디오 코덱의 성능을 결정하는 가장 중요한 요인이다.Here, what determines the performance of an audio codec depends on how effectively it allocates bits. That is, a method of allowing only distortion to the extent that human auditory perception cannot be perceived by using a minimum amount of information about a frequency component obtained through conversion is the most important factor in determining the performance of an audio codec.

최근, 딥러닝 기술이 급격하게 발전함에 따라 이를 오디오 신호의 부호화 과정에 활용하기 위한 연구가 시도되고 있다. 딥러닝 기술을 오디오 코덱에 적용하기 위한 가장 간단한 방법은 주파수 성분으로 변환 및 역 변환 과정을 딥러닝 네트워크로 대체하는 것이다. Recently, with the rapid development of deep learning technology, research has been attempted to utilize it in the encoding process of an audio signal. The simplest way to apply deep learning technology to audio codecs is to replace the transformation and inverse transformation with frequency components with a deep learning network.

오토인코더(Auto-Encoder)로 알려져 있는 간단한 구조를 사용하면 오디오 코덱의 부호화 및 복호화 과정을 대체할 수 있다. 그러나, 오토인코더를 이용한 오디오 코덱의 성능은 생성 모델 (Generative Model)기반의 딥러닝 네트워크를 이용한 오디오 코덱의 성능보다 높지 않다.A simple structure known as an auto-encoder can replace the encoding and decoding process of an audio codec. However, the performance of an audio codec using an autoencoder is not higher than that of an audio codec using a generative model-based deep learning network.

종래 생성 모델에는 WaveNet, WaveRNN과 같이 Auto-regressive (AR) 기반의 생성 모델이 있다. 이러한 AR 기반의 생성 모델들에서 입력 데이터는 이미 합성된 과거의 샘플과 앞으로 생성될 샘플의 특성을 정해주는 역할을 하는 부가 특징 벡터 (auxiliary features) 혹은 조건 벡터 (conditional vector)로 구성되고, 이를 딥 러닝 네트워크에 통과함으로써 다음 시간의 샘플을 얻을 수 있다. Conventional generation models include auto-regressive (AR)-based generation models such as WaveNet and WaveRNN. In these AR-based generative models, input data is composed of auxiliary features or conditional vectors that play a role in determining characteristics of samples to be generated and past samples that have been synthesized in the past. By passing through the learning network, we can get samples of the next time.

따라서, 부가 특징 벡터를 어떻게 정하는 지가 매우 중요하며, 이를 부호화 장치에서 어떻게 효과적으로 압축할 수 있을 지에 대한 기술이 필요하다. 또한, 생성 모델의 구조에 따라 복잡도 및 성능이 정해지므로 이에 대한 기술 개발이 선행되어야 한다. Therefore, how to determine the additional feature vector is very important, and it is necessary to describe how the encoding apparatus can effectively compress it. In addition, since complexity and performance are determined according to the structure of the generative model, technology development for this should be preceded.

WaveNet의 경우, dilated 합성곱 신경망을 활용하여 receptive field를 큰 폭으로 증가시킨다. 다시 말해, WaveNet을 이용하여 이전 샘플들의 정보가 효과적으로 모델링될 수 있으며, 이전 샘플들의 멜-스펙트로그램을 조건 벡터로 활용하여 현재 샘플을 예측하는 방식으로 음성을 생성해낸다. In the case of WaveNet, the receptive field is greatly increased by utilizing a dilated convolutional neural network. In other words, information of previous samples can be effectively modeled using WaveNet, and a voice is generated by predicting the current sample by using the Mel-spectrogram of the previous samples as a condition vector.

그리고, 생성 모델 중 하나인 SampleRNN은 음성의 생성 시에 널리 사용되는 모델로서, 여러 층의 순환곱 신경망을 활용해 다양한 시간축 해상도에서의 조건 벡터를 추정함으로써 음성을 생성해낸다. In addition, SampleRNN, one of the generation models, is a widely used model for speech generation, and generates speech by estimating condition vectors at various time-axis resolutions using a multi-layered recurrent neural network.

SampleRNN은 음성 기반의 신호를 만들어 내는데 효율적인 결과를 가져왔으나, 오디오 신호의 경우 사람의 발화 특성이 반영된 음성 신호와는 차이가 있기 때문에 조건 벡터에 대한 변경이 필요하다.SampleRNN resulted in an efficient result in generating a speech-based signal, but in the case of an audio signal, it is different from a speech signal in which human speech characteristics are reflected, so a change in the condition vector is required.

종래 생성 모델을 이용한 오디오 신호의 부호화 과정에서는 사람의 가청 주파수(0-20 KHz)에 해당하는 넓은 대역의 신호를 표현해야 하기 때문에 전 대역의 오디오 신호에 대해 생성 모델을 적용하였다. 그러나, 오디오 코덱의 성능을 높이기 위해서는 전 대역의 오디오 신호에 대해서 생성 모델을 적용하는 것보다 주파수 대역을 분할하여 각 대역의 특성에 맞도록 생성 모델을 학습하는 기술이 요구된다. In the process of encoding an audio signal using the conventional generative model, since it is necessary to express a signal of a wide band corresponding to the human audible frequency (0-20 KHz), the generative model is applied to the audio signal of the entire band. However, in order to improve the performance of the audio codec, a technique for learning the generative model by dividing the frequency band to fit the characteristics of each band is required rather than applying the generative model to the audio signal of the entire band.

본 발명은 종래 딥러닝 네트워크 만을 사용한 AR 기반 생성 모델과 달리 입력 신호를 여러 개의 주파수 대역 신호로 분할한 후, 각 대역에 대응하는 시간축 신호의 특성에 적합하도록 독립적으로 생성 모델을 생성함으로써, 부가 특징 벡터를 표현하기 위해 필요한 정보량을 최소화하여 오디오 코덱의 성능을 높일 수 있는 오디오 신호의 부호화 방법 및 장치를 제공한다. Unlike the conventional AR-based generation model using only a deep learning network, the present invention divides the input signal into several frequency band signals, and then independently generates a generation model to fit the characteristics of the time axis signal corresponding to each band, thereby providing additional features A method and apparatus for encoding an audio signal capable of increasing the performance of an audio codec by minimizing the amount of information required to express a vector are provided.

본 발명의 일실시예에 따른 오디오 신호의 부호화 방법은 상기 오디오 신호에 대응하는 입력 신호를 복수의 주파수 대역 별로 분할하는 단계; 상기 복수의 주파수 대역 별로 특징 벡터를 추출하는 단계; 상기 복수의 주파수 대역 별로 생성된 특징 벡터를 양자화하는 단계; 및 상기 양자화된 특징 벡터를 비트스트림으로 부호화하여 복호화 장치로 전송하는 단계를 포함할 수 있다.A method of encoding an audio signal according to an embodiment of the present invention includes: dividing an input signal corresponding to the audio signal into a plurality of frequency bands; extracting a feature vector for each of the plurality of frequency bands; quantizing the feature vectors generated for each of the plurality of frequency bands; and encoding the quantized feature vector into a bitstream and transmitting it to a decoding apparatus.

상기 입력 신호를 복수의 주파수 대역으로 분할하는 단계는, 상기 주파수 대역들 간의 간섭을 제거하는 대역 통과 필터(Band Pass Filter)를 이용하여 상기 입력 신호를 복수의 주파수 대역으로 구분할 수 있다. The dividing of the input signal into a plurality of frequency bands may include dividing the input signal into a plurality of frequency bands using a band pass filter that removes interference between the frequency bands.

상기 특징 벡터를 추출하는 단계는, 상기 복수의 주파수 대역으로 분할된 입력 신호를 도메인 변환함으로써 상기 특징 벡터를 생성할 수 있다.In the extracting of the feature vector, the feature vector may be generated by domain-transforming the input signal divided into the plurality of frequency bands.

상기 특징 벡터를 양자화하는 단계는, 사람의 청각 특성과 양자화에 사용 가능한 총 비트의 수를 고려하여 상기 복수의 주파수 대역 별로 요구되는 정보량을 추정하고, 상기 추정된 정보량에 따라 상기 복수의 주파수 대역 별로 상기 비트를 할당함으로써 상기 특징 벡터를 양자화할 수 있다.The quantizing of the feature vector includes estimating the amount of information required for each of the plurality of frequency bands in consideration of human auditory characteristics and the total number of bits available for quantization, and for each of the plurality of frequency bands according to the estimated amount of information. By allocating the bit, the feature vector can be quantized.

본 발명의 일실시예에 따른 오디오 신호의 복호화 방법은 부호화 장치로부터 비트스트림을 수신하는 단계; 상기 비트스트림을 복호화하여 복수의 주파수 대역 별 양자화된 특징 벡터를 결정하는 단계; 상기 복수의 주파수 대역 별 양자화된 특징 벡터를 학습 모델에 입력하여 상기 복수의 주파수 대역 별로 출력 신호를 생성하는 단계 및 상기 복수의 주파수 대역 별로 생성된 출력 신호들을 결합하는 단계를 포함할 수 있다.A method of decoding an audio signal according to an embodiment of the present invention includes: receiving a bitstream from an encoding apparatus; determining a quantized feature vector for each of a plurality of frequency bands by decoding the bitstream; The method may include inputting the quantized feature vector for each of the plurality of frequency bands into a learning model to generate an output signal for each of the plurality of frequency bands, and combining the output signals generated for each of the plurality of frequency bands.

상기 학습 모델은, 입력 레이어(layer), 복수의 가중치를 갖는 히든 레이어 및 출력 레이어로 구성되는 네트워크 구조를 갖는 학습 모델로서, 시계열의 입력 데이터들에 대하여, 이전 시점의 입력 데이터들로 현재 시점의 출력 데이터를 생성할 수 있다. The learning model is a learning model having a network structure consisting of an input layer, a hidden layer having a plurality of weights, and an output layer. You can create output data.

상기 출력 신호를 생성하는 단계는, 상기 학습 모델을 통해 상기 복수의 주파수 대역 별 입력 신호와 양자화된 특징 벡터를 부가 정보로 이용하여 상기 출력 신호의 오디오 샘플을 결정할 수 있다.The generating of the output signal may include determining an audio sample of the output signal by using the input signal for each frequency band and the quantized feature vector as additional information through the learning model.

본 발명의 일실시예에 따른 오디오 신호의 복호화에 이용되는 학습 모델의 트레이닝 방법은 부호화 장치로부터 비트스트림을 수신하는 단계; 상기 비트스트림을 복호화하여 복수의 주파수 대역 별 양자화된 특징 벡터를 결정하는 단계; 상기 복수의 주파수 대역 별 양자화된 특징 벡터를 학습 모델에 입력하여 상기 복수의 주파수 대역 별로 출력 신호를 생성하는 단계; 및 상기 복수의 주파수 대역 별로 상기 출력 신호와 상기 부호화 장치에 입력된 입력 신호를 비교하여 상기 학습 모델을 트레이닝하는 단계를 포함할 수 있다.A method of training a learning model used for decoding an audio signal according to an embodiment of the present invention includes: receiving a bitstream from an encoding apparatus; determining a quantized feature vector for each of a plurality of frequency bands by decoding the bitstream; generating an output signal for each of the plurality of frequency bands by inputting the quantized feature vectors for each of the plurality of frequency bands into a learning model; and training the learning model by comparing the output signal with the input signal input to the encoding apparatus for each of the plurality of frequency bands.

상기 학습 모델은, 입력 레이어(layer), 복수의 가중치를 갖는 히든 레이어 및 출력 레이어로 구성되는 네트워크 구조를 갖는 학습 모델로서, 시계열의 입력 데이터들에 대하여, 이전 시점의 입력 데이터들로 현재 시점의 출력 데이터를 생성할 수 있다.The learning model is a learning model having a network structure consisting of an input layer, a hidden layer having a plurality of weights, and an output layer. You can create output data.

상기 학습 모델을 트레이닝하는 단계는, 상기 출력 신호와 상기 입력 신호의 차이를 계산하는 손실 함수의 출력 값이 최소가 되도록 상기 가중치를 업데이트할 수 있다.In the training of the learning model, the weight may be updated such that an output value of a loss function for calculating a difference between the output signal and the input signal is minimized.

본 발명의 일실시예에 따른 오디오 신호의 복호화 방법은 부호화 장치로부터 비트스트림을 수신하는 단계; 상기 비트스트림을 복호화하여 복수의 주파수 대역 별 오디오 샘플의 특징을 나타내는 특징 벡터를 결정하는 단계; 상기 복수의 주파수 대역 별로, 상기 특징 벡터와 t 시점 이전의 오디오 샘플들을 이용하여 t 시점의 오디오 샘플을 생성하는 단계; 및 상기 복수의 주파수 대역 별로 생성된 오디오 샘플들을 결합하여 출력 신호를 생성하는 단계를 포함할 수 있다.A method of decoding an audio signal according to an embodiment of the present invention includes: receiving a bitstream from an encoding apparatus; decoding the bitstream to determine a feature vector representing a feature of an audio sample for each of a plurality of frequency bands; generating an audio sample at time t using the feature vector and audio samples before time t for each of the plurality of frequency bands; and generating an output signal by combining the audio samples generated for each of the plurality of frequency bands.

상기 t 시점의 오디오 샘플을 상기 복수의 주파수 대역 별로 생성하는 단계는, 미리 트레이닝된 딥러닝 기반의 학습 모델에 상기 특징 벡터를 입력함으로써, 상기 복수의 주파수 대역 별로 상기 오디오 샘플을 생성할 수 있다.The generating of the audio sample at time t for each of the plurality of frequency bands may include generating the audio sample for each of the plurality of frequency bands by inputting the feature vector to a pre-trained deep learning-based learning model.

본 발명의 일실시예에 따른 오디오 신호의 복호화 장치에 있어서, 상기 복호화 장치는 프로세서를 포함하고, 상기 프로세서는, 부호화 장치로부터 비트스트림을 수신하고, 상기 비트스트림을 복호화하여 복수의 주파수 대역 별 양자화된 특징 벡터를 결정하고, 상기 복수의 주파수 대역 별 양자화된 특징 벡터를 학습 모델에 입력하여 상기 복수의 주파수 대역 별로 출력 신호를 생성할 수 있다. In an audio signal decoding apparatus according to an embodiment of the present invention, the decoding apparatus includes a processor, wherein the processor receives a bitstream from an encoding apparatus, decodes the bitstream, and quantizes the plurality of frequency bands. An output signal may be generated for each of the plurality of frequency bands by determining the obtained feature vector and inputting the quantized feature vector for each of the plurality of frequency bands to a learning model.

상기 프로세서는, 상기 학습 모델을 통해 상기 복수의 주파수 대역 별 양자화된 특징 벡터를 부가 정보로 이용하여 상기 출력 신호의 오디오 샘플을 결정할 수 있다.The processor may determine the audio sample of the output signal by using the quantized feature vector for each frequency band as additional information through the learning model.

본 발명의 일실시예에 따르면 종래 딥러닝 네트워크 만을 사용한 AR 기반 생성 모델과 달리 입력 신호를 여러 개의 주파수 대역 신호로 분할한 후, 각 대역에 대응하는 시간축 신호의 특성에 적합하도록 독립적으로 생성 모델을 생성함으로써, 부가 특징 벡터를 표현하기 위해 필요한 정보량을 최소화하여 오디오 코덱의 성능을 높일 수 있다.According to an embodiment of the present invention, unlike the AR-based generation model using only the conventional deep learning network, the input signal is divided into several frequency band signals, and then the generation model is independently generated to suit the characteristics of the time axis signal corresponding to each band. By generating, it is possible to increase the performance of the audio codec by minimizing the amount of information required to express the additional feature vector.

본 발명의 일실시예에 따르면 각 주파수 대역의 특징에 따라 생성 모델이 유연하게 설정될 수 있으므로 낮은 복잡도로 오디오 신호의 복호화가 가능하며, 부호화 과정에서 지연 시간을 최소화하면서도 높은 성능을 나타낸다. 본 발명의 생성 모델은 오디오 코덱 이외의 응용 분야에도 적용 가능하며, 오디오나 음성 신호 뿐만 아니라 시계열로 표현되는 모든 신호에 적용 가능하다.According to an embodiment of the present invention, since a generation model can be flexibly set according to the characteristics of each frequency band, it is possible to decode an audio signal with low complexity, and exhibits high performance while minimizing a delay time in the encoding process. The generation model of the present invention is applicable to application fields other than audio codecs, and is applicable to all signals expressed in time series as well as audio or voice signals.

도 1은 본 발명의 일실시예에 따른 부호화 장치 및 복호화 장치를 나타낸 도면이다.
도 2는 본 발명의 일실시예에 따른 부호화 및 복호화 과정을 도식화한 도면이다.
도 3은 본 발명의 일실시예에 따른 생성 모델의 학습 과정을 도식화한 도면이다.
도 4는 본 발명의 일실시예에 따른 생성 모델의 학습 과정을 플로우 차트로 도시한 도면이다.1 is a diagram illustrating an encoding apparatus and a decoding apparatus according to an embodiment of the present invention.
2 is a diagram schematically illustrating an encoding and decoding process according to an embodiment of the present invention.
3 is a diagram schematically illustrating a learning process of a generative model according to an embodiment of the present invention.
4 is a flowchart illustrating a learning process of a generative model according to an embodiment of the present invention.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 그러나, 실시예들에는 다양한 변경이 가해질 수 있어서 특허출원의 권리 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 실시예들에 대한 모든 변경, 균등물 내지 대체물이 권리 범위에 포함되는 것으로 이해되어야 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. However, since various changes may be made to the embodiments, the scope of the patent application is not limited or limited by these embodiments. It should be understood that all modifications, equivalents and substitutes for the embodiments are included in the scope of the rights.

실시예에서 사용한 용어는 단지 설명을 목적으로 사용된 것으로, 한정하려는 의도로 해석되어서는 안된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the examples are used for the purpose of description only, and should not be construed as limiting. The singular expression includes the plural expression unless the context clearly dictates otherwise. In this specification, terms such as "comprise" or "have" are intended to designate that a feature, number, step, operation, component, part, or a combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiment belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same components are given the same reference numerals regardless of the reference numerals, and the overlapping description thereof will be omitted. In describing the embodiment, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the embodiment, the detailed description thereof will be omitted.

도 1은 본 발명의 일실시예에 따른 부호화 장치 및 복호화 장치를 나타낸 도면이다. 1 is a diagram illustrating an encoding apparatus and a decoding apparatus according to an embodiment of the present invention.

본 발명은 딥러닝 기반의 학습 모델 중 자기 회귀(Auto-Regressive)적인 학습 모델을 이용한 오디오 신호의 부호화 및 복호화 방법 및 장치와 학습 모델의 트레이닝 방법 및 장치에 관한 것으로, 오디오 신호의 주파수 대역을 분할하여 각 대역의 특성에 따라 학습 모델을 트레이닝함으로써 오디오 신호를 부호화 및 복호화하는 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for encoding and decoding an audio signal using an auto-regressive learning model among deep learning-based learning models, and a method and apparatus for training a learning model, by dividing the frequency band of the audio signal to a method and apparatus for encoding and decoding an audio signal by training a learning model according to the characteristics of each band.

본 발명에서, 학습 모델은 딥러닝에 기반한 학습 모델로, 입력 레이어(layer), 복수의 가중치를 갖는 히든 레이어 및 출력 레이어로 구성되는 네트워크 모델이다. 학습 모델은 학습 데이터의 분포를 가지도록 입력 데이터에 대한 출력 데이터를 생성하는 모델이다. In the present invention, a learning model is a learning model based on deep learning, and is a network model composed of an input layer, a hidden layer having a plurality of weights, and an output layer. The learning model is a model that generates output data for input data so as to have a distribution of training data.

그리고, 학습 모델은 시계열의 입력 데이터들에 대하여, 이전 시점의 입력 데이터들을 부가 정보로, 현재 시점의 출력 데이터를 생성하도록 트레이닝된다. 일례로, 생성 모델(Generative Model)이 학습 모델로 이용될 수 있다. 다시 말해, 본 발명에서 학습 모델은, 복호화된 결과를 입력 데이터로 하여-, 부호화 장치(101)의 입력 신호(103)와 동일한 출력 신호(104)를 생성하도록 트레이닝된다. Then, the learning model is trained to generate output data of the current time by using the input data of the previous time as additional information with respect to the input data of the time series. For example, a generative model may be used as a learning model. In other words, in the present invention, the learning model is trained to generate the same output signal 104 as the input signal 103 of the encoding device 101 using the decoded result as input data.

도 1을 참조하면, 부호화 장치(101)는 입력 신호(103)로부터 특징 벡터를 추출하고, 특징 벡터를 비트스트림으로 부호화하여 생성하고, 복호화 장치(102)에 전달한다. 복호화 장치(102)는 비트스트림을 복호화하여 트레이닝된 학습 모델에 입력하고, 출력 신호(104)를 생성한다. 도 1에서, 입력 신호(103)는 부호화 장치(101)가 이용하는 원본의 오디오 신호를 의미하고, 출력 신호(104)는 복호화 장치(102)에서 복원된 오디오 신호를 의미한다.Referring to FIG. 1 , the encoding apparatus 101 extracts a feature vector from an input signal 103 , encodes the feature vector into a bitstream, generates the generated bitstream, and transmits the extracted feature vector to the decoding apparatus 102 . The decoding device 102 decodes the bitstream, inputs it to a trained learning model, and generates an output signal 104 . In FIG. 1 , an input signal 103 means an original audio signal used by the encoding apparatus 101 , and an output signal 104 means an audio signal restored by the decoding apparatus 102 .

부호화 장치(101) 및 복호화 장치(102)는 각각 프로세서에 대응하며, 부호화 장치(101) 및 복호화 장치(102)는 동일한 프로세서에 대응할 수 있고, 서로 다른 프로세서에 대응할 수 있다. 그리고, 부호화 장치(101)는 본 발명의 부호화 방법을 수행하며, 복호화 장치(102)는 본 발명의 복호화 방법 및 트레이닝 방법을 수행한다. The encoding apparatus 101 and the decoding apparatus 102 may each correspond to a processor, and the encoding apparatus 101 and the decoding apparatus 102 may correspond to the same processor or may correspond to different processors. And, the encoding apparatus 101 performs the encoding method of the present invention, and the decoding apparatus 102 performs the decoding method and the training method of the present invention.

본 발명에서, 부호화 장치(101)는 단순히 입력 신호(103)를 그대로 부호화하지 않고, 입력 신호(103)를 주파수 대역 별로 분할하고, 복호화 장치(102)에서 각 주파수 대역 별로 학습 모델을 생성하여 트레이닝함으로써, 복호화 과정에서 복잡도를 낮출 수 있고, 부호화 과정에서 지연 시간을 최소화하면서 오디오 코덱의 성능을 높일 수 있다. In the present invention, the encoding device 101 does not simply encode the input signal 103 as it is, but divides the input signal 103 for each frequency band, and generates a training model for each frequency band in the decoding device 102 for training. By doing so, complexity can be reduced in the decoding process, and the performance of the audio codec can be improved while minimizing the delay time in the encoding process.

도 2는 본 발명의 일실시예에 따른 부호화 및 복호화 과정을 도식화한 도면이다. 2 is a diagram schematically illustrating an encoding and decoding process according to an embodiment of the present invention.

주파수 대역 분할 과정(201)에서, 부호화 장치(101)는 입력 신호(103)를 복수의 주파수 대역 별로 분할한다. 구체적으로, 부호화 장치(101)는 주파수 대역들 간의 간섭을 제거하는 대역 통과 필터(Band Pass Filter)를 이용하여 입력 신호(103)를 복수의 주파수 대역으로 분할한다.In the frequency band division process 201 , the encoding apparatus 101 divides the input signal 103 into a plurality of frequency bands. Specifically, the encoding apparatus 101 divides the input signal 103 into a plurality of frequency bands by using a band pass filter that removes interference between frequency bands.

주파수 대역 별 특징 벡터 추출 과정(202)에서, 부호화 장치(101)는 복수의 주파수 대역 별로 특징 벡터를 추출한다. 구체적으로, 부호화 장치(101)는 복수의 주파수 대역으로 분할된 입력 신호(103)를 도메인 변환함으로써 특징 벡터를 생성한다. In the feature vector extraction process 202 for each frequency band, the encoding apparatus 101 extracts a feature vector for each frequency band. Specifically, the encoding apparatus 101 generates a feature vector by domain-transforming the input signal 103 divided into a plurality of frequency bands.

일례로, 특징 벡터는 입력 신호(103)에 대응하는 멜-스펙트로그램(mel-spectrogram)나 LPC(Linear Predicative Coding) 등의 음향 특징(acoustic features)일 수 있다. 다른 예로, 특징 벡터는 입력 신호(103)에 대해 오토인코더(autoencoder)로 추출되는 잠재 벡터(latent vector)를 의미할 수 있다. For example, the feature vector may be an acoustic feature such as a mel-spectrogram or linear predictive coding (LPC) corresponding to the input signal 103 . As another example, the feature vector may mean a latent vector extracted by an autoencoder with respect to the input signal 103 .

특징 벡터의 양자화(203)에서, 부호화 장치(101)는 복수의 주파수 대역 별로 생성된 특징 벡터를 양자화한다. 구체적으로, 부호화 장치(101)는 사람의 청각 특성과 양자화에 사용 가능한 총 비트의 수를 고려하여 복수의 주파수 대역 별로 요구되는 정보량을 추정하고, 추정된 정보량에 따라 복수의 주파수 대역 별로 비트를 할당함으로써 특징 벡터를 양자화한다.In the quantization of the feature vector 203 , the encoding apparatus 101 quantizes the feature vector generated for each frequency band. Specifically, the encoding apparatus 101 estimates the amount of information required for each of a plurality of frequency bands in consideration of human auditory characteristics and the total number of bits available for quantization, and allocates bits to each of the plurality of frequency bands according to the estimated amount of information. By doing so, the feature vector is quantized.

일례로, 사람의 청각 특성 상 인지하기 어려운 주파수 대역의 특징 벡터에는 양자화에 이용되는 비트 수가 적게 할당된다. 다시 말해, 가청 주파수(예: 20Hz- 20,000Hz)에 대응하는 주파수 대역에 할당되는 비트 수는, 가청 주파수가 아닌 주파수 대역에 할당되는 비트 수보다 크도록 결정될 수 있다.For example, a small number of bits used for quantization is allocated to a feature vector in a frequency band that is difficult to recognize due to human auditory characteristics. In other words, the number of bits allocated to a frequency band corresponding to an audible frequency (eg, 20Hz-20,000Hz) may be determined to be greater than the number of bits allocated to a frequency band other than the audible frequency.

즉, 주파수 대역이 가청 주파수에 포함될 경우, 그 주파수 대역에 요구되는 정보량은 가정 주파수에 포함되지 않는 주파수 대역에 요구되는 정보량 보다 높게 추정되고, 가청 주파수에 포함되는 주파수 대역에 가정 주파수에 포함되지 않는 주파수 대역 보다 큰 비트가 할당된다. That is, when a frequency band is included in an audible frequency, the amount of information required for the frequency band is estimated to be higher than the amount of information required for a frequency band not included in the home frequency, and the frequency band included in the audible frequency does not include the home frequency. Bits larger than the frequency band are allocated.

그리고, 부호화 과정(204)에서, 부호화 장치(101)는 양자화된 특징 벡터를 비트스트림으로 부호화하여 복호화 장치(102)로 전송한다. 이 때, 부호화 장치(101)는 주파수 대역 별로 비트스트림을 생성하여 복호화 장치(102)에 송신한다.Then, in the encoding process 204 , the encoding apparatus 101 encodes the quantized feature vector into a bitstream and transmits it to the decoding apparatus 102 . In this case, the encoding apparatus 101 generates a bitstream for each frequency band and transmits the generated bitstream to the decoding apparatus 102 .

복호화 과정(205)에서, 복호화 장치(102)는, 부호화 장치(101)로부터 비트스트림을 수신하고, 비트스트림을 복호화하여 복수의 주파수 대역 별 양자화된 특징 벡터들을 결정한다. In the decoding process 205 , the decoding apparatus 102 receives the bitstream from the encoding apparatus 101 and decodes the bitstream to determine quantized feature vectors for each frequency band.

그리고, 생성 모델을 이용한 처리 과정(206)에서, 복호화 장치(102)는, 복수의 주파수 대역 별 특징 벡터를 학습 모델에 입력하여 복수의 주파수 대역 별로 출력 신호(104)를 생성한다. 복호화 장치(102)는, 복수의 주파수 대역 별로 생성된 출력 신호(104)를 결합하여 최종적인 출력 신호(104)를 획득한다. And, in the processing step 206 using the generation model, the decoding apparatus 102 inputs a plurality of frequency band-specific feature vectors to the learning model to generate an output signal 104 for each frequency band. The decoding apparatus 102 obtains a final output signal 104 by combining the output signals 104 generated for each of a plurality of frequency bands.

복호화 장치(102)는 학습 모델을 통해 양자화된 특징 벡터를 부가 정보로 이용하여 출력 신호(104)를 생성한다. 일례로, 학습 모델은 생성 모델일 수 있다.The decoding apparatus 102 generates the output signal 104 by using the feature vector quantized through the learning model as additional information. As an example, the learning model may be a generative model.

도 3은 본 발명의 일실시예에 따른 생성 모델의 학습 과정을 도식화한 도면이다. 3 is a diagram schematically illustrating a learning process of a generative model according to an embodiment of the present invention.

복호화 장치(102)는 시계열의 오디오 샘플에서 이전 시점의 오디오 샘플들과 현재 시점의 오디오 샘플에 대한 특징 벡터를 이용하여 현재 시점의 오디오 샘플을 생성하도록 자기회귀적(Auto-Regressive)으로 학습 모델을 트레이닝한다. The decoding device 102 auto-regressively builds a learning model to generate an audio sample at the current time by using the audio samples of the previous time and the feature vectors for the audio sample of the current time in the audio sample of the time series. train

즉, 복호화 장치(102)는, 부호화 장치(101)에 입력되는 입력 신호(300)와 동일한 출력 신호를 생성하도록 트레이닝된다. 구체적으로, t 시점에 대응하는 출력 신호의 오디오 샘플을 생성하기 위하여, 복호화 장치(102)는 t 시점에 대응하는 양자화된 특징 벡터가 학습 모델에 입력된다. That is, the decoding apparatus 102 is trained to generate the same output signal as the input signal 300 input to the encoding apparatus 101 . Specifically, in order to generate an audio sample of an output signal corresponding to time t, the decoding apparatus 102 inputs a quantized feature vector corresponding to time t to the learning model.

복호화 장치(102)는 학습 모델을 통해 양자화된 특징 벡터를 부가 정보로 이용하여 출력 신호를 생성한다. 구체적으로, 복호화 장치(102)는 학습 모델을 통해 t 시점 이전에 생성된 오디오 샘플들과 t 시점의 양자화된 특징 벡터를 이용하여 t 시점의 오디오 샘플을 생성한다. 일례로, 학습 모델은 auto-aggressive(AR) 기반의 생성 모델일 수 있다.The decoding apparatus 102 generates an output signal by using the feature vector quantized through the learning model as additional information. Specifically, the decoding apparatus 102 generates an audio sample at time t by using audio samples generated before time t through the learning model and a quantized feature vector at time t. As an example, the learning model may be an auto-aggressive (AR)-based generative model.

일례로, 본 발명의 학습 모델은 수학식 1에 따라 출력 신호를 생성할 수 있다. As an example, the learning model of the present invention may generate an output signal according to Equation (1).

h는 양자화된 특징 벡터를 의미하고, x_t는 t시점의 오디오 샘플을 의미한다. p는 양자화된 특징 벡터에 대한 오디오 샘플의 확률을 의미한다. 즉, 수학식 1에 따라, 본 발명의 학습 모델은 t시점 이전의 오디오 샘플들과 양자화된 특징 벡터로부터 t시점의 오디오 샘플의 확률을 결정함으로써, 오디오 샘플을 생성할 수 있다. h denotes a quantized feature vector, and x _t denotes an audio sample at time t. p denotes the probability of the audio sample for the quantized feature vector. That is, according to Equation 1, the learning model of the present invention may generate an audio sample by determining the probability of an audio sample at time t from audio samples before time t and a quantized feature vector.

그리고, 양자화된 특징 벡터는 학습 모델을 통한 오디오 샘플 생성 과정에서, 해상도(resolution)을 입력 신호의 프레임 길이에 맞추기 위해 프레임의 길이로 업샘플링된다. 구체적으로, 복호화 장치(102)는 전치 합성곱 네트워크(Transposed convolution network)에 양자화된 특징 벡터를 입력하여 양자화된 특징 벡터를 업샘플링할 수 있다.In addition, the quantized feature vector is upsampled to the length of the frame in order to match the resolution to the frame length of the input signal in the process of generating an audio sample through the learning model. Specifically, the decoding apparatus 102 may upsample the quantized feature vector by inputting the quantized feature vector to a transposed convolution network.

그리고, 복호화 장치(102)는 i) 업샘플링된 결과와 ii) t시점 이전의 오디오 샘플에 대한 확장 합성곱(dilated convolution) 결과를 더하고, 더한 결과에 활성화 유닛(예: Gated Activation Unit)을 처리하여 출력 신호를 결정한다. Then, the decoding device 102 adds i) the up-sampled result and ii) the dilated convolution result for the audio sample before time t, and processes an activation unit (eg, Gated Activation Unit) to the added result. to determine the output signal.

도 3에서 도시된 생성 모델(321, 322)은 학습 모델의 일례로서 자기회귀적 생성 모델이며, 본 발명에서 이용되는 학습 모델은 생성 모델로 제한되지 않는다. The generative models 321 and 322 shown in FIG. 3 are autoregressive generative models as examples of the learning model, and the learning model used in the present invention is not limited to the generative model.

구체적으로, 복호화 장치(102)는 부호화 장치(101)로부터 입력 신호(300)로부터 추출된 특징 벡터에 대응하는 비트스트림(301, 302)을 수신한다. 부호화 장치(101)는 주파수 대역 별로 비트스트림을 생성하여 복호화 장치(102)에 송신한다. 즉, 도 3에서, 비트스트림(301)과 비트스트림(302)는 서로 다른 주파수 대역에 대응하는 특징 벡터에 대한 비트스트림이다.Specifically, the decoding apparatus 102 receives the bitstreams 301 and 302 corresponding to the feature vectors extracted from the input signal 300 from the encoding apparatus 101 . The encoding apparatus 101 generates a bitstream for each frequency band and transmits it to the decoding apparatus 102 . That is, in FIG. 3 , the bitstream 301 and the bitstream 302 are bitstreams for feature vectors corresponding to different frequency bands.

복호화 과정(310)에서, 복호화 장치(102)는 주파수 대역 별 비트스트림을 복호화하여 주파수 대역 별로 양자화된 특징 벡터를 결정한다. 복호화된 양자화된 특징 벡터는 각각 주파수 대역 별로 생성되는 학습 모델(321, 322)에 입력된다.In the decoding process 310, the decoding apparatus 102 decodes a bitstream for each frequency band and determines a quantized feature vector for each frequency band. The decoded quantized feature vectors are input to learning models 321 and 322 generated for each frequency band, respectively.

그리고, 복호화 장치(102)는 주파수 대역 별로 학습 모델의 구조를 다르게 설정할 수 있다. 일례로, 낮은 주파수 대역일수록 학습 모델의 시간축 상 수용 영역(receptive field)의 크기가 크게 설정되고, 학습 모델의 히든 레이어의 수가 적게 설정될 수 있다. In addition, the decoding apparatus 102 may set the structure of the learning model differently for each frequency band. For example, as the frequency band is lower, the size of a receptive field on the time axis of the learning model may be set to be larger, and the number of hidden layers of the learning model may be set to be smaller.

즉, 적은 레이어의 수로도 더 넓은 수용 영역을 확보할 수 있다. 예를 들어, 낮은 주파수 대역에서, 복호화 장치(102)는 dilation 수를 늘림으로써 수용 영역을 크게할 수 있다. 따라서, 높은 주파수 대역일 때 보다 상대적으로 긴 시간의 오디오 샘플들을 고려하여 현재 오디오 샘플을 결정할 수 있다. That is, a wider accommodating area can be secured even with a small number of layers. For example, in a low frequency band, the decoding apparatus 102 may increase the reception area by increasing the number of dilations. Accordingly, the current audio sample may be determined by considering audio samples of a relatively longer time than in the case of a high frequency band.

마찬가지로, 높은 주파수 대역일수록 학습 모델의 수용 영역의 크기가 작게 설정되고, 학습 모델의 히든 레이어의 수가 크게 설정될 수 있다. 즉, 주파수 대역 별 특성에 따라 학습 모델을 적용하기 때문에 전체적인 복잡도가 감소한다. 예를 들어, 높은 주파수 대역에서, 복호화 장치(102)는 시간 축상 오디오 샘플을 타이트(tight)하게 고려하기 위해 레이어 수를 늘릴 수 있다. Similarly, as the frequency band increases, the size of the receiving region of the learning model may be set to be smaller, and the number of hidden layers of the learning model may be set to be larger. That is, since the learning model is applied according to the characteristics of each frequency band, the overall complexity is reduced. For example, in a high frequency band, the decoding apparatus 102 may increase the number of layers in order to tightly consider an audio sample on a time axis.

학습 모델의 수용 영역이란, 입력 레이어, 히든 레이어, 출력 레이어로 구성되는 뉴럴 네트워크에서 출력 레이어의 한 뉴런에 영향을 미치는 입력 뉴런들의 공간 크기를 의미한다. The receptive area of the learning model refers to the spatial size of input neurons that affect one neuron of an output layer in a neural network composed of an input layer, a hidden layer, and an output layer.

즉, 낮은 주파수 대역에 대응하는 생성 모델일수록 그 생성 모델의 수용 영역은 크게 결정되지만, 서브 샘플링 간격(subsampling interval) 또는 팽창 팩터(dilation factor)가 높은 주파수 대역에 대응하는 생성 모델 보다 길게 결정된다. That is, a generation model corresponding to a low frequency band has a larger reception area of the generation model, but a subsampling interval or dilation factor is determined longer than a generation model corresponding to a high frequency band.

마찬가지로, 높은 주파수 대역에 대응하는 생성 모델일수록 그 생성 모델의 수용 영역은 작게 결정되지만, 서브 샘플링 간격 또는 팽창 팩터가 낮은 주파수 대역에 대응하는 생성 모델 보다 짧게 결정된다. 복호화 장치(102)는 주파수 대역 별로 학습 모델의 레이어 및 시간축 상 수용 영역을 다르게 설정함으로써 학습 모델의 복잡도를 줄일 수 있다. Similarly, a generation model corresponding to a high frequency band has a smaller reception area of the generation model, but a subsampling interval or expansion factor is determined to be shorter than that of a generation model corresponding to a low frequency band. The decoding apparatus 102 may reduce the complexity of the learning model by differently setting the layer of the learning model and the receptive region on the time axis for each frequency band.

또한, 생성 모델의 레이어 및 시간축 상 수용 영역을 다르게 설정함으로써, 각 주파수 대역의 학습 모델에서 특징 벡터가 표현되기 위해 요구되는 정보량이 최소화될 수 있다. In addition, by differently setting the layer of the generative model and the receptive region on the time axis, the amount of information required to express the feature vector in the learning model of each frequency band can be minimized.

도 3을 참조하면, 복호화 장치(102)는, 주파수 대역 별 생성 모델(321, 322)를 이용하여 출력 신호(331, 332)를 획득할 수 있다. 그리고, 복호화 장치(102)는 부호화 장치(101)에 입력되는 입력 신호(300)을 주파수 대역 별로 분할하여 주파수 대역 별 입력 신호(351, 352)와 출력 신호(331, 332)를 비교함으로써 학습 모델을 트레이닝할 수 있다. Referring to FIG. 3 , the decoding apparatus 102 may obtain output signals 331 and 332 by using generation models 321 and 322 for each frequency band. In addition, the decoding device 102 divides the input signal 300 input to the encoding device 101 for each frequency band, and compares the input signals 351 and 352 and the output signals 331 and 332 for each frequency band to obtain a learning model. can be trained

구체적으로, 복호화 장치(102)는 부호화 장치(101)에 입력되는 입력 신호를 정답 레이블로, 주파수 대역 별로 출력 신호와 입력 신호의 차이를 비교하는 손실 함수의 값이 최소가 되도록 학습 모델을 트레이닝한다. Specifically, the decoding device 102 trains the learning model so that the value of the loss function that compares the difference between the output signal and the input signal for each frequency band by using the input signal input to the encoding device 101 as the correct answer label is minimized. .

예를 들어, 복호화 장치(102)는 주파수 대역 A에 대한 생성 모델(321)의 출력 신호(331)와 주파수 대역 A에 대한 입력 신호(351)에 기초하여 생성 모델을 트레이닝한다. 구체적으로, 복호화 장치(102)는 주파수 대역 A에 대한 손실 함수(341)를 통해 주파수 대역 A에 대한 생성 모델(321)의 출력 신호(331)와 주파수 대역 A에 대한 입력 신호(351)의 차이를 계산하고, 차이가 최소가 되도록 학습 모델(321)의 가중치를 업데이트한다.For example, the decoding apparatus 102 trains the generative model based on the output signal 331 of the generative model 321 for the frequency band A and the input signal 351 for the frequency band A. Specifically, the decoding device 102 uses the loss function 341 for the frequency band A to determine the difference between the output signal 331 of the generation model 321 for the frequency band A and the input signal 351 for the frequency band A , and the weight of the learning model 321 is updated so that the difference is minimized.

도 4는 본 발명의 일실시예에 따른 생성 모델의 학습 과정을 플로우 차트로 도시한 도면이다.4 is a flowchart illustrating a learning process of a generative model according to an embodiment of the present invention.

단계(401)에서, 복호화 장치는, 부호화 장치로부터 입력 신호의 특징 벡터에 대응하는 비트스트림을 수신한다. 부호화 장치는 주파수 대역 별로 비트스트림을 생성하여 복호화 장치에 송신한다. 그리고 복호화 장치는 주파수 대역 별 비트스트림을 이용한다. In step 401, the decoding apparatus receives a bitstream corresponding to the feature vector of the input signal from the encoding apparatus. The encoding apparatus generates a bitstream for each frequency band and transmits it to the decoding apparatus. And the decoding apparatus uses a bitstream for each frequency band.

단계(402)에서, 복호화 장치는, 비트스트림을 복호화하여 복수의 주파수 대역 별 양자화된 특징 벡터를 결정한다. 구체적으로, 복호화 장치는 비트스트림을 복호화하여 복수의 주파수 대역 별 오디오 샘플의 특징을 나타내는 특징 벡터를 획득한다. In step 402, the decoding apparatus decodes the bitstream to determine quantized feature vectors for each of a plurality of frequency bands. Specifically, the decoding apparatus decodes the bitstream to obtain a feature vector indicating the characteristics of the audio samples for each of a plurality of frequency bands.

그리고, 복호화 장치는 주파수 대역 별로 학습 모델을 생성한다. 구체적으로, 복호화 장치는 낮은 주파수 대역일수록 학습 모델의 시간축 상 수용 영역의 크기를 크게 설정하고, 학습 모델의 히든 레이어의 수를 작게 결정한다. And, the decoding apparatus generates a learning model for each frequency band. Specifically, the decoding apparatus sets the size of the receptive region on the time axis of the learning model to be larger as the frequency band is lower, and determines the number of hidden layers of the learning model to be smaller.

단계(403)에서, 복호화 장치는, 복호화된 입력 신호를 복원하여 출력 신호를 생성하는 학습 모델을 이용하여 양자화된 특징 벡터로부터 복수의 주파수 대역 별 출력 신호를 생성한다. In step 403 , the decoding apparatus generates output signals for each frequency band from the quantized feature vectors by using a learning model that generates an output signal by reconstructing a decoded input signal.

다시 말해, 복호화 장치는 복호화된 양자화된 특징 벡터를 학습 모델의 부가 정보로 이용하여 학습 모델로부터 출력 신호를 생성한다. 구체적으로, 복호화 장치는 학습 모델을 통해 복수의 주파수 대역 별로, t 시점 이전에 생성된 오디오 샘플들과 t 시점의 특징 벡터를 이용하여 t 시점의 오디오 샘플을 복수의 주파수 대역 별로 결정하고, 결정된 오디오 샘플들을 결합함으로써 출력 신호를 생성한다. In other words, the decoding apparatus generates an output signal from the learning model by using the decoded quantized feature vector as additional information of the learning model. Specifically, the decoding apparatus determines the audio sample at time t for each of the plurality of frequency bands by using the audio samples generated before time t and the feature vector at time t for each of the plurality of frequency bands through the learning model, and the determined audio Combining the samples creates an output signal.

단계(404)에서, 복호화 장치는, 복수의 주파수 대역 별로 출력 신호와 입력 신호를 비교하여 학습 모델을 트레이닝한다. 구체적으로, 복호화 장치는 주파수 대역 별로 입력 신호와 출력 신호의 차이를 계산하는 손실 함수의 값이 최소가 되도록 주파수 대역 별 학습 모델의 가중치를 업데이트함으로써 학습 모델을 트레이닝한다.In step 404, the decoding apparatus trains the learning model by comparing the output signal and the input signal for each of a plurality of frequency bands. Specifically, the decoding apparatus trains the learning model by updating the weight of the learning model for each frequency band so that the value of the loss function for calculating the difference between the input signal and the output signal for each frequency band is minimized.

한편, 본 발명에 따른 방법은 컴퓨터에서 실행될 수 있는 프로그램으로 작성되어 마그네틱 저장매체, 광학적 판독매체, 디지털 저장매체 등 다양한 기록 매체로도 구현될 수 있다.Meanwhile, the method according to the present invention is written as a program that can be executed on a computer and can be implemented in various recording media such as magnetic storage media, optical reading media, and digital storage media.

본 명세서에 설명된 각종 기술들의 구현들은 디지털 전자 회로조직으로, 또는 컴퓨터 하드웨어, 펌웨어, 소프트웨어로, 또는 그들의 조합들로 구현될 수 있다. 구현들은 데이터 처리 장치, 예를 들어 프로그램가능 프로세서, 컴퓨터, 또는 다수의 컴퓨터들의 동작에 의한 처리를 위해, 또는 이 동작을 제어하기 위해, 컴퓨터 프로그램 제품, 즉 정보 캐리어, 예를 들어 기계 판독가능 저장 장치(컴퓨터 판독가능 매체) 또는 전파 신호에서 유형적으로 구체화된 컴퓨터 프로그램으로서 구현될 수 있다. 상술한 컴퓨터 프로그램(들)과 같은 컴퓨터 프로그램은 컴파일된 또는 인터프리트된 언어들을 포함하는 임의의 형태의 프로그래밍 언어로 기록될 수 있고, 독립형 프로그램으로서 또는 모듈, 구성요소, 서브루틴, 또는 컴퓨팅 환경에서의 사용에 적절한 다른 유닛으로서 포함하는 임의의 형태로 전개될 수 있다. 컴퓨터 프로그램은 하나의 사이트에서 하나의 컴퓨터 또는 다수의 컴퓨터들 상에서 처리되도록 또는 다수의 사이트들에 걸쳐 분배되고 통신 네트워크에 의해 상호 연결되도록 전개될 수 있다.Implementations of the various techniques described herein may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or combinations thereof. Implementations may be implemented for processing by, or controlling the operation of, a data processing device, eg, a programmable processor, computer, or number of computers, a computer program product, ie an information carrier, eg, a machine readable storage It may be embodied as a computer program tangibly embodied in an apparatus (computer readable medium) or a radio signal. A computer program, such as the computer program(s) described above, may be written in any form of programming language, including compiled or interpreted languages, as a standalone program or in a module, component, subroutine, or computing environment. It can be deployed in any form, including as other units suitable for use in A computer program may be deployed to be processed on one computer or multiple computers at one site or distributed across multiple sites and interconnected by a communications network.

컴퓨터 프로그램의 처리에 적절한 프로세서들은 예로서, 범용 및 특수 목적 마이크로프로세서들 둘 다, 및 임의의 종류의 디지털 컴퓨터의 임의의 하나 이상의 프로세서들을 포함한다. 일반적으로, 프로세서는 판독 전용 메모리 또는 랜덤 액세스 메모리 또는 둘 다로부터 명령어들 및 데이터를 수신할 것이다. 컴퓨터의 요소들은 명령어들을 실행하는 적어도 하나의 프로세서 및 명령어들 및 데이터를 저장하는 하나 이상의 메모리 장치들을 포함할 수 있다. 일반적으로, 컴퓨터는 데이터를 저장하는 하나 이상의 대량 저장 장치들, 예를 들어 자기, 자기-광 디스크들, 또는 광 디스크들을 포함할 수 있거나, 이것들로부터 데이터를 수신하거나 이것들에 데이터를 송신하거나 또는 양쪽으로 되도록 결합될 수도 있다. 컴퓨터 프로그램 명령어들 및 데이터를 구체화하는데 적절한 정보 캐리어들은 예로서 반도체 메모리 장치들, 예를 들어, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광 기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 롬(ROM, Read Only Memory), 램(RAM, Random Access Memory), 플래시 메모리, EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM) 등을 포함한다. 프로세서 및 메모리는 특수 목적 논리 회로조직에 의해 보충되거나, 이에 포함될 수 있다.Processors suitable for processing a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from either read-only memory or random access memory or both. Elements of a computer may include at least one processor that executes instructions and one or more memory devices that store instructions and data. In general, a computer may include one or more mass storage devices for storing data, for example magnetic, magneto-optical disks, or optical disks, receiving data from, sending data to, or both. may be combined to become Information carriers suitable for embodying computer program instructions and data are, for example, semiconductor memory devices, for example, magnetic media such as hard disks, floppy disks and magnetic tapes, Compact Disk Read Only Memory (CD-ROM). ), an optical recording medium such as a DVD (Digital Video Disk), a magneto-optical medium such as an optical disk, ROM (Read Only Memory), RAM (RAM) , Random Access Memory), flash memory, EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), and the like. Processors and memories may be supplemented by, or included in, special purpose logic circuitry.

또한, 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용매체일 수 있고, 컴퓨터 저장매체 및 전송매체를 모두 포함할 수 있다.In addition, the computer-readable medium may be any available medium that can be accessed by a computer, and may include both computer storage media and transmission media.

본 명세서는 다수의 특정한 구현물의 세부사항들을 포함하지만, 이들은 어떠한 발명이나 청구 가능한 것의 범위에 대해서도 제한적인 것으로서 이해되어서는 안되며, 오히려 특정한 발명의 특정한 실시형태에 특유할 수 있는 특징들에 대한 설명으로서 이해되어야 한다. 개별적인 실시형태의 문맥에서 본 명세서에 기술된 특정한 특징들은 단일 실시형태에서 조합하여 구현될 수도 있다. 반대로, 단일 실시형태의 문맥에서 기술한 다양한 특징들 역시 개별적으로 혹은 어떠한 적절한 하위 조합으로도 복수의 실시형태에서 구현 가능하다. 나아가, 특징들이 특정한 조합으로 동작하고 초기에 그와 같이 청구된 바와 같이 묘사될 수 있지만, 청구된 조합으로부터의 하나 이상의 특징들은 일부 경우에 그 조합으로부터 배제될 수 있으며, 그 청구된 조합은 하위 조합이나 하위 조합의 변형물로 변경될 수 있다.While this specification contains numerous specific implementation details, they should not be construed as limitations on the scope of any invention or claim, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. should be understood Certain features that are described herein in the context of separate embodiments may be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment may also be implemented in multiple embodiments, either individually or in any suitable subcombination. Furthermore, although features operate in a particular combination and may be initially depicted as claimed as such, one or more features from a claimed combination may in some cases be excluded from the combination, the claimed combination being a sub-combination. or a variant of a sub-combination.

마찬가지로, 특정한 순서로 도면에서 동작들을 묘사하고 있지만, 이는 바람직한 결과를 얻기 위하여 도시된 그 특정한 순서나 순차적인 순서대로 그러한 동작들을 수행하여야 한다거나 모든 도시된 동작들이 수행되어야 하는 것으로 이해되어서는 안 된다. 특정한 경우, 멀티태스킹과 병렬 프로세싱이 유리할 수 있다. 또한, 상술한 실시형태의 다양한 장치 컴포넌트의 분리는 그러한 분리를 모든 실시형태에서 요구하는 것으로 이해되어서는 안되며, 설명한 프로그램 컴포넌트와 장치들은 일반적으로 단일의 소프트웨어 제품으로 함께 통합되거나 다중 소프트웨어 제품에 패키징 될 수 있다는 점을 이해하여야 한다.Likewise, although acts are depicted in the drawings in a particular order, it should not be construed that all acts shown must be performed or that such acts must be performed in the specific order or sequential order shown to obtain desirable results. In certain cases, multitasking and parallel processing may be advantageous. Further, the separation of the various device components of the above-described embodiments should not be construed as requiring such separation in all embodiments, and the program components and devices described may generally be integrated together into a single software product or packaged into multiple software products. You have to understand that you can.

한편, 본 명세서와 도면에 개시된 본 발명의 실시 예들은 이해를 돕기 위해 특정 예를 제시한 것에 지나지 않으며, 본 발명의 범위를 한정하고자 하는 것은 아니다. 여기에 개시된 실시 예들 이외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 자명한 것이다.On the other hand, the embodiments of the present invention disclosed in the present specification and drawings are merely presented as specific examples to aid understanding, and are not intended to limit the scope of the present invention. It will be apparent to those of ordinary skill in the art to which the present invention pertains that other modifications based on the technical spirit of the present invention can be implemented in addition to the embodiments disclosed herein.

101: 부호화 장치
102: 복호화 장치101: encoding device
102: decryption device

Claims

A method for encoding an audio signal, comprising:
dividing an input signal corresponding to the audio signal into a plurality of frequency bands;
extracting a feature vector for each of the plurality of frequency bands;
quantizing the feature vectors generated for each of the plurality of frequency bands; and
encoding the quantized feature vector into a bitstream and transmitting it to a decoding apparatus;
A coding method comprising a.

According to claim 1,
The step of dividing the input signal into a plurality of frequency bands,
An encoding method of dividing the input signal into a plurality of frequency bands using a band pass filter that removes interference between the frequency bands.

According to claim 1,
Extracting the feature vector comprises:
and generating the feature vector by domain transforming the input signal divided into the plurality of frequency bands.

According to claim 1,
Quantizing the feature vector comprises:
The feature vector is obtained by estimating the amount of information required for each of the plurality of frequency bands in consideration of human auditory characteristics and the total number of bits available for quantization, and allocating the bits to each of the plurality of frequency bands according to the estimated amount of information. Quantizing, encoding method.

A method for decoding an audio signal, comprising:
receiving a bitstream from an encoding device;
determining a quantized feature vector for each of a plurality of frequency bands by decoding the bitstream;
generating an output signal for each of the plurality of frequency bands by inputting the quantized feature vectors for each of the plurality of frequency bands into a learning model; and
combining the output signals generated for each of the plurality of frequency bands
A decryption method comprising

6. The method of claim 5,
The learning model is
As a learning model having a network structure consisting of an input layer, a hidden layer having a plurality of weights, and an output layer, it is a learning model that generates output data of a current time from input data of a previous time with respect to time series input data. , the decryption method.

6. The method of claim 5,
The step of generating the output signal comprises:
A decoding method of determining an audio sample of the output signal by using the quantized feature vector for each frequency band through the learning model as additional information.

In the training method of a learning model used for decoding an audio signal,
receiving a bitstream from an encoding device;
determining a quantized feature vector for each of a plurality of frequency bands by decoding the bitstream;
generating an output signal for each of the plurality of frequency bands by inputting the quantized feature vectors for each of the plurality of frequency bands into a learning model; and
training the learning model by comparing the output signal with the input signal input to the encoding device for each of the plurality of frequency bands
A training method comprising a.

9. The method of claim 8,
The learning model is
As a learning model having a network structure consisting of an input layer, a hidden layer having a plurality of weights, and an output layer, it is a learning model that generates output data of a current time from input data of a previous time with respect to time series input data. , training methods.

10. The method of claim 9,
Training the learning model comprises:
and updating the weight so that an output value of a loss function for calculating a difference between the output signal and the input signal is minimized.

A method for decoding an audio signal, comprising:
receiving a bitstream from an encoding device;
decoding the bitstream to determine a feature vector representing a feature of an audio sample for each of a plurality of frequency bands;
generating an audio sample at time t using the feature vector and audio samples before time t for each of the plurality of frequency bands; and
generating an output signal by combining the audio samples generated for each of the plurality of frequency bands
A decryption method comprising

12. The method of claim 11,
The step of generating the audio sample at time t includes:
A decoding method for generating the audio sample for each of the plurality of frequency bands by inputting the feature vector to a pre-trained deep learning-based learning model.

13. The method of claim 12,
The learning model is
As a learning model having a network structure consisting of an input layer, a hidden layer having a plurality of weights, and an output layer, it is a learning model that generates output data of a current time from input data of a previous time with respect to time series input data. , the decryption method.

An audio signal decoding apparatus comprising:
The decryption device includes a processor,
The processor is
Receives a bitstream from an encoding device, decodes the bitstream to determine quantized feature vectors for each of a plurality of frequency bands, and inputs the quantized feature vectors for each of the plurality of frequency bands to a learning model for each of the plurality of frequency bands generating an output signal,
decryption device.

15. The method of claim 14,
The learning model is
As a learning model having a network structure consisting of an input layer, a hidden layer having a plurality of weights, and an output layer, it is a learning model that generates output data of a current time from input data of a previous time with respect to time series input data. , decryption device.

15. The method of claim 14,
The processor is
and determining an audio sample of the output signal by using the quantized feature vector for each frequency band through the learning model as additional information.