KR20220109299A

KR20220109299A - Method for Encoding and Decoding an Audio Signal Using a Neural Network Model, and an Encoder and Decoder Performing the Method

Info

Publication number: KR20220109299A
Application number: KR1020210152153A
Authority: KR
Inventors: 성종모; 백승권; 이태진; 임우택; 장인선
Original assignee: 한국전자통신연구원
Priority date: 2021-01-28
Filing date: 2021-11-08
Publication date: 2022-08-04

Abstract

Disclosed are a method for encoding and decoding an audio signal using a learning model and an encoder and decoder performing the same. According to one embodiment of the present invention, the method for encoding and decoding an audio signal using the learning model may include: a step of extracting pitch information on the audio signal; a step of determining an expansion factor of a first expandable neural network block extracting a feature map from the audio signal based on the pitch information; a step of determining a receptive field of the first expandable neural network block according to the expansion factor; a step of generating a first feature map of the audio signal by using the first expandable neural network block; a step of determining a second feature map by inputting the first feature map to a second expandable neural network block processing a feature map; and a step of converting the second feature map and the pitch information into a bitstream. The present invention can effectively remove long-term redundancy in the audio signal.

Description

Method for Encoding and Decoding an Audio Signal Using a Neural Network Model, and an Encoder and Decoder Performing the Method}

본 발명은 신경망 모델을 이용한 오디오 신호의 부호화 및 복호화 방법 및 그 방법을 수행하는 부호화기 및 복호화기에 대한 것으로, 보다 구체적으로는, 오디오 신호의 피치 정보를 활용한 신경망 모델을 이용하여 오디오 신호에 내재된 중복성을 제거하는 부호화 및 복호화를 위한 기술에 관한 것이다. The present invention relates to a method for encoding and decoding an audio signal using a neural network model, and an encoder and decoder for performing the method, and more specifically, to an audio signal embedded in an audio signal using a neural network model utilizing pitch information of the audio signal. It relates to a technique for encoding and decoding that removes redundancy.

최근, 인공지능에 대한 기술이 발전함에 음성, 오디오 신호, 언어 및 영상 신호의 처리 등 다양한 분야에 적용되고 있고, 그에 관한 연구도 활발히 이루어지고 있다. 대표적인 예로, 딥러닝에 기반한 오토인코더(autoencoder)를 이용하여 오디오 신호의 특징을 추출하고, 추출된 특징에 기반하여 오디오 신호를 복원하는 기술이 이용되고 있다. Recently, as the technology for artificial intelligence develops, it is applied to various fields such as processing of voice, audio signal, language, and image signal, and research on it is actively conducted. As a representative example, a technique for extracting features of an audio signal using an autoencoder based on deep learning and restoring the audio signal based on the extracted features is used.

다만, 오디오 신호를 복원함에 있어, 종래의 인공지능 모델을 이용할 경우, 연산의 복잡도가 증가하는 문제, 오디오 신호에 내재된 단기간 중복성(short-term redundancy) 및 장기간 중복성(long-term redundancy) 등을 제거하는데 있어 비효율적인 문제가 발생할 수 있어, 이러한 문제를 해결하기 위한 기술이 요구된다. However, in restoring an audio signal, when using a conventional artificial intelligence model, the problem of increased computational complexity, short-term redundancy and long-term redundancy inherent in audio signals, etc. There may be inefficient problems in the removal, so a technique for solving these problems is required.

본 발명은, 오디오 신호의 피치 정보를 이용하여 신경망 모델의 확장 인자(dilation factor)를 가변적으로 결정함으로써, 오디오 신호의 부호화 및 복호화 과정에서 오디오 신호에 내재된 장기간 중복성을 효과적으로 제거할 수 있는 방법 및 장치를 제공한다. The present invention provides a method for effectively removing long-term redundancy inherent in an audio signal in a process of encoding and decoding an audio signal by variably determining a dilation factor of a neural network model using pitch information of the audio signal, and provide the device.

또한, 본 발명은, 오디오 신호의 피치 정보를 이용하여 신경망 모델의 확장 인자를 결정함으로써, 복원된 오디오 신호의 품질을 개선하고, 연산의 복잡도를 감소시킬 수 있는 방법 및 장치를 제공한다. In addition, the present invention provides a method and apparatus capable of improving the quality of a reconstructed audio signal and reducing computational complexity by determining an extension factor of a neural network model using pitch information of the audio signal.

본 발명의 일실시예에 따른 신경망 모델을 이용한 오디오 신호의 부호화 방법은 상기 오디오 신호에 대한 피치(pitch) 정보를 추출하는 단계; 상기 피치 정보에 기초하여 상기 오디오 신호로부터 특징맵을 추출하는 제1 확장형 신경망 블록의 수용 영역(receptive field)에 대한 확장 인자를 결정하는 단계; 상기 제1 확장형 신경망 블록을 이용하여, 상기 오디오 신호의 제1 특징맵(feature map)을 생성하는 단계; 상기 제1 특징맵을 가공하는 제2 확장형 신경망 블록에 상기 제1 특징맵을 입력하여 제2 특징맵을 결정하는 단계; 상기 제2 특징맵과 상기 피치 정보를 각각 양자화하여 비트스트림으로 변환하는 단계를 포함할 수 있다. A method of encoding an audio signal using a neural network model according to an embodiment of the present invention includes extracting pitch information for the audio signal; determining an extension factor for a receptive field of a first scalable neural network block extracting a feature map from the audio signal based on the pitch information; generating a first feature map of the audio signal by using the first extended neural network block; determining a second feature map by inputting the first feature map to a second extended neural network block that processes the first feature map; Quantizing each of the second feature map and the pitch information may include converting the second feature map and the pitch information into a bitstream.

상기 제1 특징맵을 생성하는 단계는, 상기 오디오 신호의 채널 수를 변환하여, 상기 제1 확장형 신경망 블록에 입력함으로써 상기 제1 특징맵을 생성하고, 상기 제2 특징맵을 결정하는 단계는, 상기 결정된 제2 특징맵의 채널 수를 변환하는 단계를 더 포함할 수 있다. The generating of the first feature map may include converting the number of channels of the audio signal and inputting it to the first extended neural network block to generate the first feature map and determining the second feature map, The method may further include converting the determined number of channels of the second feature map.

상기 제2 특징맵을 결정하는 단계는, 상기 제1 특징맵의 차원을 축소하도록 다운 샘플링(down sampling)하고, 상기 다운 샘플링된 제1 특징맵을 상기 제2 확장형 신경망 블록에 입력하여 상기 제2 특징맵을 결정할 수 있다. The determining of the second feature map includes down-sampling to reduce a dimension of the first feature map, and inputting the down-sampled first feature map to the second scalable neural network block to obtain the second feature map. A feature map can be determined.

상기 제1 확장형 신경망 블록의 확장 인자는, 상기 제1 확장형 신경망 블록의 수용 영역을 상기 피치 정보와 근사화함으로써 결정될 수 있다.The expansion factor of the first extended neural network block may be determined by approximating an accommodation area of the first extended neural network block with the pitch information.

상기 제2 확장형 신경망 블록의 확장 인자는 미리 결정된 값으로 고정되고, 상기 제2 확장형 신경망 블록의 수용 영역은, 상기 제2 확장형 신경망 블록의 고정된 확장 인자에 의해서 결정될 수 있다. The extension factor of the second extended neural network block may be fixed to a predetermined value, and the accommodation area of the second extended neural network block may be determined by the fixed extension factor of the second extended neural network block.

상기 제2 특징맵과 상기 피치 정보를 각각 양자화하는 단계를 더 포함하고, 상기 비트스트림으로 변환하는 단계는, 상기 양자화된 제2 특징맵과 피치 정보를 다중화한 비트스트림으로 변환할 수 있다. The method may further include quantizing the second feature map and the pitch information, respectively, and converting to the bitstream may include converting the quantized second feature map and the pitch information into a multiplexed bitstream.

본 발명의 일실시예에 따른 신경망 모델을 이용한 오디오 신호의 복호화 방법은 부호화기로부터 수신한 비트스트림에서 상기 오디오 신호에 대한 제2 특징맵(feature map)과 상기 오디오 신호의 피치(pitch) 정보를 추출하는 단계; 제2 확장형 신경망 블록에 상기 제2 특징맵을 입력하여 제1 특징맵을 복원하는 단계; 상기 피치 정보에 기초하여 제1 확장형 신경망 블록의 확장 인자를 결정하는 단계; 및 상기 제1 확장형 신경망 블록을 이용하여 상기 제1 특징맵으로부터 오디오 신호를 복원하는 단계를 포함할 수 있다. A method for decoding an audio signal using a neural network model according to an embodiment of the present invention extracts a second feature map for the audio signal and pitch information of the audio signal from a bitstream received from an encoder. to do; reconstructing a first feature map by inputting the second feature map to a second extended neural network block; determining an expansion factor of a first scalable neural network block based on the pitch information; and reconstructing an audio signal from the first feature map using the first extended neural network block.

상기 제1 특징맵을 복원하는 단계는, 상기 제2 특징맵의 채널 수를 변환하여, 상기 제2 확장형 신경망 블록에 입력함으로써 상기 제1 특징맵을 복원하고, 상기 오디오 신호를 복원하는 단계는, 상기 복원된 오디오 신호의 채널 수를 원 입력 신호의 채널 수와 동일하도록 변환하는 단계를 더 포함할 수 있다. The step of restoring the first feature map may include: converting the number of channels of the second feature map and inputting it to the second extended neural network block to restore the first feature map and restoring the audio signal; The method may further include converting the number of channels of the restored audio signal to be the same as the number of channels of the original input signal.

상기 오디오 신호를 복원하는 단계는, 상기 제1 특징맵의 차원을 확장하도록 업 샘플링(up sampling)하고, 상기 업 샘플링된 제1 특징맵을 상기 제1 확장형 신경망 블록에 입력하여 상기 오디오 신호를 결정할 수 있다. The restoring of the audio signal includes up-sampling to extend a dimension of the first feature map, and inputting the up-sampled first feature map to the first extended neural network block to determine the audio signal. can

상기 제1 확장형 신경망 블록의 확장 인자는, 상기 제1 확장형 신경망 블록의 수용 영역을 상기 피치 정보에 근사화함으로써 결정될 수 있다.The expansion factor of the first extended neural network block may be determined by approximating an accommodation area of the first extended neural network block to the pitch information.

상기 제2 확장형 신경망 블록의 확장 인자는 미리 결정된 값으로 고정되고, 상기 제2 확장형 신경망 블록의 수용 영역은, 상기 제2 확장형 신경망 블록의 확장 인자에 의해서 결정될 수 있다. The expansion factor of the second extended neural network block may be fixed to a predetermined value, and the accommodation area of the second extended neural network block may be determined by the expansion factor of the second extended neural network block.

상기 제2 특징맵과 상기 오디오 신호의 피치 정보를 추출하는 단계는, 상기 제2 특징맵과 상기 피치 정보를 각각 역-양자화하는 단계를 더 포함할 수 있다. The extracting of the second feature map and the pitch information of the audio signal may further include inverse-quantizing the second feature map and the pitch information, respectively.

본 발명의 일실시예에 따른 오디오 신호의 부호화 방법을 수행하는 부호화기에 있어서, 상기 부호화기는 프로세서를 포함하고, 상기 프로세서는, 상기 오디오 신호에 대한 피치(pitch) 정보를 추출하고, 상기 피치 정보에 기초하여 상기 오디오 신호로부터 특징맵을 추출하는 제1 확장형 신경망 블록의 확장 인자를 결정하고, 상기 제1 확장형 신경망 블록을 이용하여, 상기 오디오 신호의 제1 특징맵(feature map)을 생성하고, 특징맵을 가공하는 제2 확장형 신경망 블록에 상기 제1 특징맵을 입력하여 제2 특징맵을 결정하고, 상기 제2 특징맵과 상기 피치 정보를 각각 양자화를 통해 비트스트림으로 변환할 수 있다. In an encoder for performing a method of encoding an audio signal according to an embodiment of the present invention, the encoder includes a processor, the processor extracting pitch information for the audio signal, and adding the pitch information to the audio signal. determining an expansion factor of a first extended neural network block for extracting a feature map from the audio signal based on the first extended neural network block, generating a first feature map of the audio signal using the first extended neural network block, A second feature map may be determined by inputting the first feature map to a second extended neural network block that processes the map, and the second feature map and the pitch information may be converted into a bitstream through quantization, respectively.

상기 프로세서는, 상기 제1 특징맵의 차원을 축소하도록 다운 샘플링(down sampling)하고, 상기 다운 샘플링된 제1 특징맵을 상기 제2 확장형 신경망 블록에 입력하여 상기 제2 특징맵을 결정할 수 있다. The processor may down-sample to reduce a dimension of the first feature map, and input the down-sampled first feature map to the second scalable neural network block to determine the second feature map.

본 발명의 일실시예에 따른 오디오 신호의 부호화 방법을 수행하는 복호화기에 있어서, 상기 복호화기는 프로세서를 포함하고, 상기 프로세서는, 부호화기로부터 수신한 비트스트림에서 상기 오디오 신호에 대한 제2 특징맵(feature map)과 상기 오디오 신호의 피치(pitch) 정보를 추출하고, 특징맵을 복원하는 제2 확장형 신경망 블록에 상기 제2 특징맵을 입력하여 제1 특징맵을 복원하고, 상기 피치 정보에 기초하여 특징맵으로부터 오디오 신호를 복원하는 제1 확장형 신경망 블록의 확장 인자를 결정하고, 상기 제1 확장형 신경망 블록을 이용하여 상기 제1 특징맵으로부터 오디오 신호를 복원할 수 있다. In a decoder for performing the method of encoding an audio signal according to an embodiment of the present invention, the decoder includes a processor, wherein the processor includes a second feature map (feature map) for the audio signal in a bitstream received from the encoder. map) and the audio signal pitch information, input the second feature map to a second extended neural network block that restores the feature map to restore the first feature map, and features based on the pitch information An extension factor of a first extended neural network block for reconstructing an audio signal from a map may be determined, and an audio signal may be restored from the first feature map using the first extended neural network block.

상기 프로세서는, 상기 제1 특징맵의 차원을 확장하도록 업 샘플링(up sampling)하고, 상기 업 샘플링된 제1 특징맵을 상기 제1 확장형 신경망 블록에 입력하여 상기 오디오 신호를 결정할 수 있다. The processor may up-sample the first feature map to extend a dimension, and input the up-sampled first feature map to the first scalable neural network block to determine the audio signal.

본 발명의 일실시예에 따르면, 오디오 신호의 피치 정보를 이용하여 확장형 신경망 모델의 확장 인자를 가변적으로 결정함으로써, 신경망 기반 오디오 부호화 및 복호화 과정에서 오디오 신호에 내재된 장기간 중복성을 효과적으로 제거할 수 있다.According to an embodiment of the present invention, by variably determining the expansion factor of the extended neural network model using pitch information of the audio signal, it is possible to effectively remove the long-term redundancy inherent in the audio signal in the neural network-based audio encoding and decoding process. .

또한, 본 발명의 일실시예에 따르면, 오디오 신호의 피치 정보를 이용하여 확장형 신경망 모델의 확장 인자를 결정함으로써, 오디오 신호 특성에 따라 가변적인 신경망 부호화 및 복호화 모델을 통해 복원된 오디오 신호의 품질을 개선할 수 있으며, 종래 고정된 확장 인자를 갖는 확장형 신경망 모델에 비해 연산의 복잡도를 감소시킬 수 있다. In addition, according to an embodiment of the present invention, by determining the expansion factor of the extended neural network model using pitch information of the audio signal, the quality of the audio signal restored through the variable neural network encoding and decoding model according to the characteristics of the audio signal is improved. It can be improved, and computational complexity can be reduced compared to the conventional scalable neural network model having a fixed expansion factor.

도 1은 본 발명의 일실시예에 따른 부호화기 및 복호화기이다.
도 2는 본 발명의 일실시예에 따른 부호화 방법과 복호화 방법이 처리되는 과정을 도시한 도면이다.
도 3a 및 도 3b는 본 발명의 일실시예에 따른 신경망 모델의 계층 구조를 도시한 도면이다.
도 4는 본 발명의 일실시예에 따른 피치 정보에 따라 결정되는 신경망 모델의 계층 구조를 도시한 도면이다. 1 is an encoder and a decoder according to an embodiment of the present invention.
2 is a diagram illustrating a process of processing an encoding method and a decoding method according to an embodiment of the present invention.
3A and 3B are diagrams illustrating a hierarchical structure of a neural network model according to an embodiment of the present invention.
4 is a diagram illustrating a hierarchical structure of a neural network model determined according to pitch information according to an embodiment of the present invention.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 그러나, 실시예들에는 다양한 변경이 가해질 수 있어서 특허출원의 권리 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 실시예들에 대한 모든 변경, 균등물 내지 대체물이 권리 범위에 포함되는 것으로 이해되어야 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. However, since various changes may be made to the embodiments, the scope of the patent application is not limited or limited by these embodiments. It should be understood that all modifications, equivalents and substitutes for the embodiments are included in the scope of the rights.

실시예에서 사용한 용어는 단지 설명을 목적으로 사용된 것으로, 한정하려는 의도로 해석되어서는 안된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in the examples are used for the purpose of description only, and should not be construed as limiting. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present specification, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It is to be understood that it does not preclude the possibility of the presence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiment belongs. Terms such as those defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same components are assigned the same reference numerals regardless of the reference numerals, and the overlapping description thereof will be omitted. In the description of the embodiment, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the embodiment, the detailed description thereof will be omitted.

도 1은 본 발명의 일실시예에 따른 부호화기 및 복호화기이다. 1 is an encoder and a decoder according to an embodiment of the present invention.

본 발명은 오디오 신호를 부호화 및 복호화함에 있어, 오디오 신호의 피치(pitch) 정보를 이용하여 인공지능에 기반한 신경망 모델의 수용 영역(receptive field)을 결정하고, 신경망 모델을 통해 오디오 신호를 부호화 및 복호화함으로써, 오디오 신호의 부호화 및 복호화 시 발생하는 단기간 중복성(short-term redundancy) 및 장기간 중복성(long-term redundancy)을 줄이는 기술에 관한 것일 수 있다.The present invention determines a receptive field of a neural network model based on artificial intelligence using pitch information of an audio signal in encoding and decoding an audio signal, and encoding and decoding an audio signal through the neural network model By doing so, the present invention may relate to a technique for reducing short-term redundancy and long-term redundancy occurring during encoding and decoding of audio signals.

본 발명의 부호화 방법 및 복호화 방법을 수행하는 부호화기 및 복호화기는, 스마트폰, 데스크톱, 노트북과 같이 프로세서를 포함할 수 있다. 부호화기 및 복호화기는 서로 다른 전자 장치이거나, 동일한 전자 장치일 수 있다. The encoder and decoder for performing the encoding method and the decoding method of the present invention may include a processor such as a smart phone, a desktop computer, or a laptop computer. The encoder and the decoder may be different electronic devices or may be the same electronic device.

본 발명의 부호화 및 복호화 모델은, 딥러닝에 기반한 신경망(Neural Network) 모델일 수 있다. 예를 들어, 부호화 및 복호화 모델은, 합성곱 신경망(Convolutional Neural Network)로 구성된 오토 인코더(autoencoder)일 수 있다. 본 발명에서 이용될 수 있는 부호화 및 복호화 모델은 기재된 예로 제한되지 않으며, 다양한 종류의 신경망 모델이 이용될 수 있다. The encoding and decoding model of the present invention may be a neural network model based on deep learning. For example, the encoding and decoding model may be an autoencoder configured with a convolutional neural network. The encoding and decoding models that can be used in the present invention are not limited to the described examples, and various types of neural network models can be used.

신경망 모델은 입력 계층, 은닉 계층, 출력 계층을 포함할 수 있고, 각 계층은 복수의 노드를 포함할 수 있다. 각 계층의 노드는, 이전 계층의 노드들과 임의의 가중치로 구성된 행렬의 곱으로 계산될 수 있다. 각 계층 간 행렬의 가중치는 신경망 모델의 트레이닝 과정에서 업데이트될 수 있다. 특히, 합성곱 신경망의 경우에는 각 계층에 대한 특징맵을 계산하기 위해 가중치 행렬인 필터를 이용한다. 일반적으로 복수의 필터를 통해 각 계층의 특징맵을 계산하며, 사용된 필터의 개수를 채널 수라고 한다.The neural network model may include an input layer, a hidden layer, and an output layer, and each layer may include a plurality of nodes. A node of each layer may be calculated as a product of the nodes of the previous layer and a matrix composed of an arbitrary weight. The weight of the matrix between each layer may be updated during training of the neural network model. In particular, in the case of a convolutional neural network, a filter, which is a weight matrix, is used to calculate a feature map for each layer. In general, a feature map of each layer is calculated through a plurality of filters, and the number of filters used is called the number of channels.

신경망 모델은 입력 데이터에 대한 출력 데이터를 생성하며, 입력 계층은 신경망 모델의 입력 데이터에 대응할 수 있고, 출력 계층은 신경망 모델의 출력 데이터에 대응할 수 있다. 본 발명에서 입력 데이터와 출력 데이터는, 일정한 길이(프레임)를 갖는 오디오 신호를 나타내는 벡터일 수 있고, 복수의 오디오 프레임들로 구성되는 경우 입력 및 출력 데이터는 2차원 행렬로 표현될 수 있다. The neural network model generates output data for input data, the input layer may correspond to input data of the neural network model, and the output layer may correspond to output data of the neural network model. In the present invention, the input data and the output data may be vectors representing an audio signal having a constant length (frame), and when configured with a plurality of audio frames, the input and output data may be expressed as a two-dimensional matrix.

신경망 모델의 각 계층에 대한 특징맵은 오디오 신호의 특성을 나타내는 1차원의 벡터, 2차원의 행렬 또는 다차원의 텐서(tensor)일 수 있다. 일례로, 특징맵은 입력 데이터 또는 이전 계층의 특징맵과 각 계층의 가중치 필터 간의 연산 결과로 얻어진 데이터일 수 있다. 신경망 모델의 수용 영역은, 출력 계층 각 노드의 값을 계산하는 사용된 입력 노드의 수로, 가중치 필터의 길이와 학습 모델을 구성하는 계층 수에 의해서 결정된다. 확장형 신경망 모델의 수용 영역은 추가로 확장 인자에 의해 결정될 수 있다. 확장 인자에 따른 신경망 모델의 수용 영역은 도 3a, 도 3b, 및 도 4에서 후술한다. The feature map for each layer of the neural network model may be a one-dimensional vector, a two-dimensional matrix, or a multi-dimensional tensor indicating characteristics of an audio signal. For example, the feature map may be input data or data obtained as a result of an operation between a feature map of a previous layer and a weight filter of each layer. The reception area of the neural network model is the number of input nodes used to calculate the value of each node of the output layer, and is determined by the length of the weight filter and the number of layers constituting the learning model. The receptive area of the scalable neural network model may be further determined by an expansion factor. The reception area of the neural network model according to the expansion factor will be described later with reference to FIGS. 3A, 3B, and 4 .

입력 신호의 채널 수는 원 신호가 갖는 신호 표현에 따라 달라지며, 예를 들어 오디오 신호의 모노와 스테레오 신호에 대해서는 채널 수가 각각 1과 2이며, RGB 컬러 영상 신호의 경우에는 채널 수가 3에 해당한다. 한편, 합성곱 신경망에서 출력 특징맵의 채널 수는 출력 특징맵을 계산하는데 사용된 합성곱 필터의 개수에 의해 결정된다.The number of channels of the input signal depends on the signal representation of the original signal. For example, the number of channels is 1 and 2 for a mono and stereo signal of an audio signal, respectively, and the number of channels corresponds to 3 for an RGB color image signal. . Meanwhile, in the convolutional neural network, the number of channels in the output feature map is determined by the number of convolution filters used to calculate the output feature map.

오디오 신호의 피치 정보는, 오디오 신호의 주기성을 나타내는 정보를 의미할 수 있다. 일례로, 피치 정보는, 입력 오디오 신호에 내재된 주기성을 표현하기 위한 것으로 일반적인 오디오 압축기에서 신호의 장기간 중복성을 모델링하는데 활용되며, 각 프레임에 대한 피치 지연(pitch lag)을 의미할 수 있다. 즉, 피치 정보는, 일반적으로 특정 시점의 오디오 신호와 이전 시점의 오디오 신호 간의 상관도가 가장 큰 시점을 검색하는 방식을 통해 검색된 이전 시점과 특정 시점의 차이로 정의될 수 있다. 이때, 검색 시점은 해당 오디오 신호 프레임 내부 시점들과 이전 프레임들의 시점들을 포함할 수 있다.The pitch information of the audio signal may refer to information indicating periodicity of the audio signal. As an example, pitch information is used to model long-term redundancy of a signal in a general audio compressor to express periodicity inherent in an input audio signal, and may mean a pitch lag for each frame. That is, the pitch information may be defined as a difference between the previous time and the specific time, which are generally found through a method of searching for a viewpoint having the greatest correlation between the audio signal of a specific viewpoint and the audio signal of the previous viewpoint. In this case, the search time may include time points inside the corresponding audio signal frame and time points of previous frames.

도 1을 참조하면, 본 발명의 부호화기는 입력 신호를 부호화하여 비트스트림을 생성하고, 복호화기는 부호화기로부터 수신한 비트스트림으로부터 출력 신호를 생성할 수 있다. 입력 신호는 부호화기에 수신되는 원본의 오디오 신호를 의미하고, 출력 신호는, 복호화기로부터 복원된 오디오 신호를 의미할 수 있다. 학습 모델을 이용하여 오디오 신호를 부호화 및 복호화하는 구체적인 동작은 도 2에서 후술한다. Referring to FIG. 1 , the encoder of the present invention may generate a bitstream by encoding an input signal, and the decoder may generate an output signal from the bitstream received from the encoder. The input signal may mean an original audio signal received by the encoder, and the output signal may mean an audio signal reconstructed from the decoder. A detailed operation of encoding and decoding an audio signal using the learning model will be described later with reference to FIG. 2 .

도 2는 본 발명의 일실시예에 따른 부호화 방법과 복호화 방법이 처리되는 과정을 도시한 도면이다. 2 is a diagram illustrating a process of processing an encoding method and a decoding method according to an embodiment of the present invention.

본 발명의 일실시예에 따르면, 채널 변환 블록(201), 제1 확장형 신경망 블록(202), 다운 샘플링 블록(203), 제2 확장형 신경망 블록(204), 채널 변환 블록(205)을 포함하는 신경망 모델이 입력 신호의 부호화에 이용될 수 있다. According to an embodiment of the present invention, including a channel transformation block 201, a first scalable neural network block 202, a downsampling block 203, a second scalable neural network block 204, a channel transformation block 205 A neural network model may be used for encoding the input signal.

피치 정보 추출(206) 동작에서, 부호화기(101)는, 오디오 신호의 피치 정보를 추출할 수 있다. 일례로, 부호화기(101)는 미리 정해진 피치 지연 검색 범위에 해당하는 각 시점에 대해서 오디오 신호 프레임에 대한 정규화된 자기 상관도(autocorrelation)를 계산한 다음 최대값을 갖는 시점을 검색함으로써 피치 정보를 추출할 수 있다. 피치 정보를 추출하는 세부적인 방법은 기재된 예로 제한되지 않을 수 있다.In the pitch information extraction 206 operation, the encoder 101 may extract pitch information of the audio signal. For example, the encoder 101 extracts pitch information by calculating a normalized autocorrelation for an audio signal frame for each viewpoint corresponding to a predetermined pitch delay search range, and then searching for a viewpoint having a maximum value. can do. A detailed method of extracting pitch information may not be limited to the described example.

양자화(207) 동작에서, 부호화기(101)는, 추출된 피치 정보를 미리 정해진 비트수로 표현가능한 값으로 양자화할 수 있다. 또한, 부호화기(101)는, 양자화된 피치 정보를 비트스트림으로 변환할 수 있다. In the quantization 207 operation, the encoder 101 may quantize the extracted pitch information to a value expressible by a predetermined number of bits. Also, the encoder 101 may convert the quantized pitch information into a bitstream.

부호화기(101)는, 상기 양자화된 피치 정보에 기초하여 제1 확장형 신경망 블록(202)확장 블록의 확장 인자를 결정할 수 있다. 제1 확장형 신경망 블록(202)의 수용 영역은 필터의 길이, 계층 수와 확장 인자에 의해 결정될 수 있으며, 필터의 길이와 계층 수는 신경망 모델 설계 단계에서 미리 정해지는 반면에 확장 인자는 매 오디오 프레임마다 상기 양자화된 피치 정보에 따라 계산된다.The encoder 101 may determine an extension factor of the first extended neural network block 202 extension block based on the quantized pitch information. The reception area of the first extended neural network block 202 may be determined by the length of the filter, the number of layers, and the expansion factor. The length and the number of layers of the filter are predetermined in the neural network model design stage, while the expansion factor is applied every audio frame. is calculated according to the quantized pitch information.

제1 확장형 신경망 블록(202)은, 입력 특징맵로부터 새로운 출력 특징맵을 계산하는 합성곱 신경망으로, 피치 정보에 따라 가변적으로 결정된 확장 인자를 갖는 신경망 블록일 수 있다. 제1 확장형 신경망 블록(202)은, 확장 인자가 고정된 제2 확장형 신경망 블록(204)과 구별될 수 있다. The first extended neural network block 202 is a convolutional neural network that calculates a new output feature map from an input feature map, and may be a neural network block having an extension factor variably determined according to pitch information. The first extended neural network block 202 may be distinguished from the second extended neural network block 204 in which an extension factor is fixed.

본 발명에서, 종래 고정된 확장인자를 갖는 확장 신경망과 달리 수용 영역을 넓게 가져가기 위해 신경망 블록의 계층 수 및 필터의 길이를 과도하게 늘리지 않고, 피치 정보에 기초하여 제1 확장형 신경망 블록(202)의 확장 인자를 가변적으로 결정함으로써, 상대적으로 적은 계층 수로 장기간 모델링에 요구되는 충분한 수용 영역을 얻을 수 있으므로 연산의 복잡도를 줄일 수 있다. In the present invention, the first extended neural network block 202 based on pitch information without excessively increasing the number of layers of the neural network block and the length of the filter in order to widen the receiving area, unlike the conventional extended neural network having a fixed extension factor. By variably determining the expansion factor of , it is possible to obtain a sufficient accommodating area required for long-term modeling with a relatively small number of layers, thereby reducing computational complexity.

일례로, 부호화기(101)에 이용되는 채널 변환 블록(201)과 채널 변환 블록(205), 다운 샘플링 신경망 블록 및 제1 확장형 신경망 블록(202)과 제2 확장형 신경망 블록(204)은 합성곱 신경망을 이용한 오토 인코더의 부호화기(101) 구성요소일 수 있고, 복호화기(102)에 이용되는 채널 변환 블록(201)과 채널 변환 블록(205), 업 샘플링 신경망 블록 및 제1 확장형 신경망 블록(202)과 제2 확장형 신경망 블록(204)은 합성곱 신경망을 이용한 오토 인코더의 부호화기(101) 구성요소일 수 있다. As an example, the channel transformation block 201 and the channel transformation block 205 used in the encoder 101, the downsampling neural network block, and the first extended neural network block 202 and the second extended neural network block 204 are convolutional neural networks. It may be a component of the encoder 101 of the auto-encoder using and the second scalable neural network block 204 may be a component of the encoder 101 of an auto-encoder using a convolutional neural network.

일례로, 부호화기(101)에서, 채널 변환 블록(201)은 단일 또는 2채널에 불과한 입력 오디오 신호에 다수의 필터(출력 특징맵의 채널 수에 해당)를 갖는 합성곱을 적용함으로써 입력 신호에 포함된 다양한 특징을 추출하여 채널 변환된 특징맵을 출력하는 신경망 블록일 수 있다.As an example, in the encoder 101, the channel transformation block 201 applies a convolution with a plurality of filters (corresponding to the number of channels in the output feature map) to an input audio signal that has only a single or two channels to be included in the input signal. It may be a neural network block that extracts various features and outputs a channel-transformed feature map.

부호화기(101)에서 이용되는 제1 확장형 신경망 블록(202)은 채널 변환 블록(201)이 출력한 채널 변환된 특징맵에 상기 양자화된 피치 정보에 기반한 확장 인자를 갖는 확장형 합성곱을 적용함으로써 오디오 신호에 내재된 장기간 중복성을 제거한 제1 특징맵을 출력하는 신경망 블록일 수 있다. 제1 특징맵은, 부호화기(101)에서 이용되는 제1 확장형 신경망 블록으로부터로부터 출력된 특징맵으로서, 제2 확장형 신경망 블록의 입력 데이터로 이용될 수 있고, 제2 확장형 신경망 블록의 출력 데이터인 제2 특징맵과 구별될 수 있다. 제2 특징맵은 제1 특징맵이 제2 확장형 신경망 블록에 의하여 가공된 특징맵을 의미할 수 있다.The first extended neural network block 202 used in the encoder 101 applies extended convolution having an extension factor based on the quantized pitch information to the channel transformed feature map output by the channel transformation block 201 to the audio signal. It may be a neural network block that outputs the first feature map from which the inherent long-term redundancy is removed. The first feature map is a feature map output from the first extended neural network block used in the encoder 101, and can be used as input data of the second extended neural network block, and is output data of the second extended neural network block. 2 It can be distinguished from the feature map. The second feature map may mean a feature map in which the first feature map is processed by the second extended neural network block.

부호화기(101)에서 이용되는 다운 샘플링 블록(203)은 제1 확장형 신경망 블록(202)이 출력한 제1 특징맵에 스트라이디드(strided) 합성곱 또는 풀링(pooling)과 결합된 합성곱 등을 적용하여 입력 특징맵의 차원을 축소한 다운 샘플링된 특징맵을 출력하는 신경망 블록일 수 있다.The downsampling block 203 used in the encoder 101 applies strided convolution or convolution combined with pooling, etc. to the first feature map output by the first extended neural network block 202 . Thus, it may be a neural network block that outputs a down-sampled feature map in which the dimension of the input feature map is reduced.

부호화기(101)에서 이용되는 제2 확장형 신경망 블록(204)은, 다운 샘플링 신경망 블록이 출력한 특징맵에 고정된 확장 인자를 갖는 확장형 합성곱을 적용함으로써 오디오 신호에 내재된 단기간 중복성을 제거한 제2 특징맵을 출력하는 신경망 블록일 수 있다. 부호화기(101)는, 제2 확장형 신경망 블록을 이용하여 다운 샘플링된 제1 특징맵으로부터 제2 특징맵을 결정할 수 있다. 제2 특징맵은 제1 특징맵 보다 적은 크기일 수 있다. The second extended neural network block 204 used in the encoder 101 removes short-term redundancy inherent in the audio signal by applying extended convolution with a fixed extension factor to the feature map output by the down-sampling neural network block. It may be a neural network block that outputs a map. The encoder 101 may determine the second feature map from the down-sampled first feature map using the second extended neural network block. The second feature map may have a smaller size than the first feature map.

부호화기(101)에서 이용되는 채널 변환 블록(205)은 제2 확장형 신경망 블록(204)이 출력한 제2 특징맵에 미리 정해진 개수의 필터를 이용한 합성곱을 적용함으로써 양자화를 위해 채널 변환된 잠재(latent) 특징맵을 출력하는 신경망 블록일 수 있다.The channel transform block 205 used in the encoder 101 is a channel transformed latent for quantization by applying convolution using a predetermined number of filters to the second feature map output by the second extended neural network block 204 . ) may be a neural network block that outputs a feature map.

채널 변환 블록(205)은, 제2 특징맵의 채널을 변환할 수 있다. 즉, 제2 특징맵의 채널은 제2 확장형 신경망 블록의 필터의 길이(예: l번째 계층에서, l+1번째 계층의 가중치 필터를 결정하기 위해 이용되는 가중치 필터의 수)에 대응하도록 설정되었기 때문에, 채널 변환 블록(205)은 제2 특징맵의 채널을 입력 신호의 채널로 변환할 수 있다. The channel conversion block 205 may convert a channel of the second feature map. That is, the channel of the second feature map is set to correspond to the filter length of the second extended neural network block (eg, the number of weight filters used to determine the weight filter of the l+1th layer in the lth layer). Therefore, the channel conversion block 205 may convert the channel of the second feature map into the channel of the input signal.

양자화(208) 동작에서, 부호화기(101)는, 채널 변환 블록(205)이 출력한 잠재 특징맵을 미리 정해진 비트수로 표현가능한 값으로 양자화할 수 있다. 또한, 양자화된 잠재 특징맵을 비트스트림으로 변환할 수 있다.In the quantization 208 operation, the encoder 101 may quantize the latent feature map output from the channel transform block 205 into a value expressible by a predetermined number of bits. In addition, the quantized latent feature map can be converted into a bitstream.

다중화(209) 동작에서, 부호화기(101)는, 양자화된 피치 정보 비트스트림과 양자화된 잠재 특징맵 비트스트림을 다중화하여 전체 비트스트림을 출력한다.In the multiplexing 209 operation, the encoder 101 multiplexes the quantized pitch information bitstream and the quantized latent feature map bitstream to output the entire bitstream.

본 발명의 일실시예에 따르면, 채널 변환 블록(212), 제1 확장형 신경망 블록(215), 업 샘플링 블록(214), 제2 확장형 신경망 블록(213), 채널 변환 블록(216)을 포함하는 신경망 모델이 오디오 신호의 복호화에 이용될 수 있다. According to an embodiment of the present invention, including a channel transformation block 212, a first scalable neural network block 215, an up-sampling block 214, a second scalable neural network block 213, and a channel transformation block 216 A neural network model may be used for decoding an audio signal.

역-다중화(210) 동작에서, 복호화기(102)는, 부호화기(101)에서 수신된 전체 비트스트림을 역-다중화하여 양자화된 피치 정보 비트스트림과 양자화된 잠재 특징맵 비트스트림을 각각 추출한다.In the de-multiplexing 210 operation, the decoder 102 de-multiplexes the entire bitstream received from the encoder 101 to extract a quantized pitch information bitstream and a quantized latent feature map bitstream, respectively.

역-양자화(217) 동작에서, 복호화기(102)는, 양자화된 피치 정보 비트스트림을 역-양자화하여 양자화된 피치 정보를 추출한다. 역-양자화(211) 동작에서, 복호화기(102)는, 양자화된 잠재 특징맵 비트스트림을 역-양자화하여 양자화된 잠재 특징맵을 추출한다.In the de-quantization 217 operation, the decoder 102 de-quantizes the quantized pitch information bitstream to extract the quantized pitch information. In the de-quantization 211 operation, the decoder 102 de-quantizes the quantized latent feature map bitstream to extract the quantized latent feature map.

복호화기(102)에서 이용되는 채널 변환 블록(212)은 역-양자화화 과정을 통해 양자화된 잠재 특징맵에 미리 정해진 개수의 필터를 이용한 합성곱을 적용함으로써 오디오 신호에 내재된 단기간 중복성을 복원한 제2 특징맵을 출력하는 신경망 블록일 수 있다.The channel transform block 212 used in the decoder 102 restores the short-term redundancy inherent in the audio signal by applying convolution using a predetermined number of filters to the latent feature map quantized through the inverse quantization process. 2 It may be a neural network block that outputs a feature map.

채널 변환 블록(212)은, 제2 특징맵의 채널을 변환할 수 있다. 구체적으로, 채널 변환 블록(212)은, 제2 특징맵의 채널이 제2 확장형 신경망 블록의 필터의 길이(예: l번째 계층에서, l+1번째 계층의 가중치 필터를 결정하기 위해 이용되는 가중치 필터의 수)에 대응하도록, 제2 특징맵의 채널을 변환할 수 있다. The channel conversion block 212 may convert a channel of the second feature map. Specifically, in the channel transformation block 212, the channel of the second feature map is the length of the filter of the second extended neural network block (eg, in the l-th layer, the weight used to determine the weight filter of the l+1th layer) The channel of the second feature map may be transformed to correspond to the number of filters).

복호화기(102)에서 이용되는 제2 확장형 신경망 블록(213)은, 채널 변환 블록(212)이 출력한 제2 특징맵에 고정된 확장 인자를 갖는 확장형 합성곱을 적용함으로써 다운 샘플링된 특징맵을 복원하는 신경망 블록일 수 있다.The second extended neural network block 213 used in the decoder 102 restores the down-sampled feature map by applying the extended convolution having a fixed extension factor to the second feature map output by the channel transformation block 212 . It may be a neural network block that

복호화기(102)에서 이용되는 업 샘플링 블록(214)은 제2 확장형 신경망 블록(213)이 출력한 다운 샘플링된 특징맵에 디컨볼루션(deconvolution) 또는 부-픽셀 합성곱(subpixel convolution) 등을 적용하여 입력 특징맵의 차원을 확장한 제1 특징맵을 복원하는 신경망 블록일 수 있다.The up-sampling block 214 used in the decoder 102 performs deconvolution or sub-pixel convolution on the down-sampled feature map output by the second extended neural network block 213 . It may be a neural network block that restores the first feature map that extends the dimension of the input feature map by applying it.

복호화기(102)에서 이용되는 제1 확장형 신경망 블록(215)은 업 샘플링 블록(214)이 출력한 제1 특징맵에 상기 양자화된 피치 정보에 기반한 확장 인자를 갖는 확장형 합성곱을 적용함으로써 오디오 신호에 내재된 장기간 중복성을 복원한 채널 변환된 특징맵을 출력하는 신경망 블록일 수 있다.The first extended neural network block 215 used in the decoder 102 applies extended convolution having an extension factor based on the quantized pitch information to the first feature map output from the up-sampling block 214 to the audio signal. It may be a neural network block that outputs a channel-transformed feature map in which the inherent long-term redundancy is restored.

복호화기(102)에서 이용되는 채널 변환 블록(216)은 제1 확장형 신경망 블록이 출력한 채널 변환된 특징맵에 원 입력 오디오 신호의 채널 수와 동일한 개수의 필터를 갖는 합성곱을 적용함으로써 입력 오디오 신호를 복원하는 신경망 블록일 수 있다.The channel transformation block 216 used in the decoder 102 applies a convolution having the same number of filters as the number of channels of the original input audio signal to the channel-transformed feature map output by the first extended neural network block to input audio signal. It may be a neural network block that restores

채널 변환 블록(216), 복원된 출력 신호의 채널을 변환할 수 있다. 일례로, 복원된 출력 신호의 채널은 제1 확장형 신경망 블록의 필터 길이(l번째 계층에서, l+1번째 계층의 가중치 필터를 결정하기 위해 이용되는 가중치 필터의 수)에 대응할 수 있으므로, 입력 신호의 채널과 대응되도록, 채널 변환 블록(216)은 출력 신호의 채널을, 모노 또는 스테레오 채널로 변환할 수 있다. The channel conversion block 216 may convert the channel of the reconstructed output signal. As an example, since the channel of the reconstructed output signal may correspond to the filter length of the first extended neural network block (the number of weight filters used to determine the weight filter of the l+1th layer in the lth layer), the input signal To correspond to the channel of , the channel conversion block 216 may convert the channel of the output signal into a mono or stereo channel.

부호화기(101) 및 복호화기(102)에 이용되는 모든 신경망 블록들의 합섭곱 필터와 바이어스(bias) 등의 모델 파라미터는, 복호화기(102)에서 복원된 오디오 신호와, 부호화기(101)에 입력되는 원 오디오 신호를 비교함으로써 훈련될 수 있다. 즉, 복호화기(102)에서 복원된 오디오 신호와, 부호화기(101)에 입력되는 오디오 신호의 차이가 최소가 되도록, 부호화기(101) 및 복호화기(102)에 이용되는 채널 변환 블록(201, 205, 212, 216), 다운 샘플링 신경망(203), 업 샘플링 신경망(214)과 제1 확장형 신경망 블록(202, 215) 및 제2 확장형 신경망 블록(204, 213)의 모델 파라미터는 업데이트될 수 있다. The model parameters such as the sum filter and bias of all neural network blocks used in the encoder 101 and the decoder 102 are the audio signal restored by the decoder 102 and the input to the encoder 101 . It can be trained by comparing the raw audio signals. That is, the channel transformation blocks 201 and 205 used in the encoder 101 and the decoder 102 so that the difference between the audio signal restored by the decoder 102 and the audio signal input to the encoder 101 is minimized. , 212 and 216 , the down-sampling neural network 203 , the up-sampling neural network 214 , and the model parameters of the first extended neural network blocks 202 and 215 and the second extended neural network blocks 204 and 213 may be updated.

일례로, 확장 인자에 따라 제1 확장형 신경망 블록(202, 215) 및 제2 확장형 신경망 블록(204, 213)의 수용 영역은 아래 수학식 1에 따라 결정될 수 있다. For example, according to the expansion factor, the accommodation areas of the first extended neural network blocks 202 and 215 and the second extended neural network blocks 204 and 213 may be determined according to Equation 1 below.

수학식 1에서, r은 확장형 신경망 블록(202, 204, 215, 213)의 수용 영역을 의미하고, L은 확장형 신경망 블록(202, 204, 215, 213)에 포함된 모든 계층의 수를 의미할 수 있다. k_l은 l번째 계층에서, (l+1)번째 계층간의 합성곱 필터의 길이를 의미할 수 있다. k_l은 계층과 관계없이 동일한 값일 수 있다. d_l은 l번째 계층의 확장 인자를 의미할 수 있다. 일례로, d_l은 아래 수학식 2에 따라 결정될 수 있다. 일례로, 확장형 신경망 블록의 수용 영역은 계층 수와 가중치 필터의 길이가 고정된 경우 수학식 1과 같이 확장 인자의 함수로 표현될 수 있다. In Equation 1, r means the accommodation area of the extended neural network block (202, 204, 215, 213), and L is the number of all layers included in the extended neural network block (202, 204, 215, 213). can k _l may mean the length of the convolution filter between the (l+1)-th layers in the l-th layer. k _l may have the same value regardless of the layer. d _l may mean an extension factor of the l-th layer. As an example, d _l may be determined according to Equation 2 below. For example, when the number of layers and the length of the weight filter are fixed, the accommodation area of the extended neural network block may be expressed as a function of the expansion factor as shown in Equation (1).

수학식 2를 참조하면, l번째 계층의 확장 인자는, (l-1)번째 계층의 확장 인자의 2배로 결정될 수 있다. 다만, (l-1)번째 계층의 확장 인자와 l번째 계층의 확장 인자의 관계는 기재된 예로 제한되지 않을 수 있다. Referring to Equation 2, the extension factor of the l-th layer may be determined to be twice the extension factor of the (l-1)-th layer. However, the relationship between the extension factor of the (l-1)-th layer and the extension factor of the l-th layer may not be limited to the described example.

일례로, 제1 확장형 신경망 블록의 각 계층에 대한 확장 인자는, 오디오 신호의 피치 정보에 기초하여 결정될 수 있고, 제2 확장형 신경망 블록(204, 213)의 각 계층에 대한 확장 인자는 오디오 신호와 관계없이 미리 고정된 값으로 결정될 수 있다. As an example, the expansion factor for each layer of the first extended neural network block may be determined based on pitch information of the audio signal, and the expansion factor for each layer of the second extended neural network block 204 and 213 is the audio signal and Regardless, it can be determined as a preset value.

일례로, 부호화기(101)와 복호화기(102)의 확장 인자 결정 과정(204, 217)에서, 오디오 신호의 피치 정보에 기초하여 제1 확장형 신경망 블록(202, 215)의 확장 인자를 결정하기 위하여, 아래 수학식 3, 수학식 4가 이용될 수 있다. For example, in the expansion factor determination process (204, 217) of the encoder 101 and the decoder 102, in order to determine the expansion factor of the first extended neural network block (202, 215) based on the pitch information of the audio signal , Equations 3 and 4 below may be used.

수학식 3에서, r은 제1 확장형 신경망 블록(202, 204, 215, 213)의 수용 영역을 의미하고,

p는 오디오 신호의 양자화된 피치 지연을 의미할 수 있다. 본 발명에서, 장기간 중복성을 줄이기 위하여, 제1 확장형 신경망 블록(202, 215)의 수용 영역은 오디오 신호의 피치 지연과 일치하도록 결정될 수 있다. In Equation 3, r means the receiving area of the first extended neural network block (202, 204, 215, 213),

p may mean a quantized pitch delay of an audio signal. In the present invention, in order to reduce long-term redundancy, the receiving area of the first extended neural network blocks 202 and 215 may be determined to match the pitch delay of the audio signal.

수학식 4에서, d₁은 제1 확장형 신경망 블록(202, 215)의 첫번째 계층의 확장 인자를 의미할 수 있다. k는 l번째 계층에서, l+1번째 계층 간의 합성곱 필터의 길이를 의미할 수 있다. L은 제1 확장형 신경망 블록(202, 215)에 포함된 모든 계층의 수를 의미할 수 있다.

은 반올림 연산(rounding operation)을 의미할 수 있다. 첫번째 계층의 확장 인자(d₁)로부터 수학식 2에 정의된 관계를 통해서 나머지 계층의 확장 인자를 구할 수 있다. In Equation 4, d ₁ may mean an extension factor of the first layer of the first extended neural network blocks 202 and 215 . k may mean the length of the convolution filter between the l+1th layer in the lth layer. L may mean the number of all layers included in the first extended neural network blocks 202 and 215 .

may mean a rounding operation. From the expansion factor (d ₁ ) of the first layer, the expansion factor of the remaining layers can be obtained through the relationship defined in Equation (2).

채널 변환 과정(219)에서, 복호화기(102)는, 복원된 출력 신호의 채널을 변환할 수 있다. 일례로, 복원된 출력 신호의 채널은 제1 확장형 신경망 블록의 필터 길이(l번째 계층에서, l+1번째 계층의 가중치 필터를 결정하기 위해 이용되는 가중치 필터의 수)에 대응할 수 있으므로, 입력 신호의 채널과 대응되도록, 복호화기(102)는 출력 신호의 채널을, 모노 또는 스테레오 채널로 변환할 수 있다. In the channel conversion process 219 , the decoder 102 may convert the channel of the reconstructed output signal. As an example, since the channel of the reconstructed output signal may correspond to the filter length of the first extended neural network block (the number of weight filters used to determine the weight filter of the l+1th layer in the lth layer), the input signal To correspond to the channel of , the decoder 102 may convert the channel of the output signal into a mono or stereo channel.

도 3a 및 도 3b는 본 발명의 일실시예에 따른 학습 모델의 계층 구조를 도시한 도면이다. 3A and 3B are diagrams illustrating a hierarchical structure of a learning model according to an embodiment of the present invention.

도 3a 및 도 3b 각각에서, 모든 계층(301-303, 311-313)의 필터 길이는, l번째 계층에서, l+1번째 계층의 가중치 필터(304, 314)를 결정하기 위해 이용되는 가중치 필터(304, 314)의 수)는 3으로 결정될 수 있다. 도 3a는, 학습 모델의 수용 영역(305)이 5이고, 학습 모델의 확장 인자가 모든 계층(301-303)에서 1로 결정된 경우, 출력 계층의 가중치 필터(304)가 결정되는 과정을 계층 구조로 도시한 도면이다. In each of FIGS. 3A and 3B , the filter length of all layers 301-303, 311-313 is, in the l-th layer, the weight filter used to determine the weight filter 304, 314 of the l+1-th layer. (the number of 304, 314) can be determined to be three. 3A shows a process in which the weight filter 304 of the output layer is determined when the receiving area 305 of the learning model is 5 and the expansion factor of the learning model is determined to be 1 in all layers 301-303 hierarchical structure It is a drawing shown as

도 3a를 참조하면, 입력 계층(301)에서 3개의 가중치 필터(304)가 은닉 계층(302)의 가중치 필터(304)를 결정하는 데 이용될 수 있고, 은닉 계층(302)에서, 3개의 가중치 필터(304)가 출력 계층(303)의 가중치 필터(304)를 결정하는 데 이용될 수 있다. 도 3a를 참조하면, 입력 계층(301)에서 5개의 가중치 필터(304)가 출력 계층(303)에서 하나의 가중치 필터(304)를 결정하는 데 이용될 수 있다. 즉, 도 3a는 학습 모델의 수용 영역(305)은 5로 결정된 경우일 수 있다.Referring to FIG. 3A , three weight filters 304 in the input layer 301 may be used to determine the weight filter 304 of the hidden layer 302 , and in the hidden layer 302 , three weight filters 304 . A filter 304 may be used to determine the weight filter 304 of the output layer 303 . Referring to FIG. 3A , five weight filters 304 in the input layer 301 may be used to determine one weight filter 304 in the output layer 303 . That is, in FIG. 3A , the reception area 305 of the learning model may be determined to be 5 .

도 3b는, 학습 모델의 수용 영역(315)이 5이고, 학습 모델의 확장 인자가 은닉 계층에서 1이고, 출력 계층에서 2인 경우, 출력 계층의 가중치 필터(314)가 결정되는 과정을 계층 구조로 도시한 도면이다. 즉, 확장 인자가 계층에 따라 증가하는 경우일 수 있다. 일례로, 도 3b는 확장 합성곱 신경망의 일례일 수 있고, 도 3a는 일반적인 합성곱 신경망의 일례일 수 있다. 3B shows a process in which the weight filter 314 of the output layer is determined when the accommodation area 315 of the learning model is 5, the expansion factor of the learning model is 1 in the hidden layer, and 2 in the output layer. It is a drawing shown as That is, it may be a case in which the extension factor increases according to the layer. As an example, FIG. 3B may be an example of an extended convolutional neural network, and FIG. 3A may be an example of a general convolutional neural network.

도 3b를 참조하면, 입력 계층(311)에서 3개의 가중치 필터(314)가 은닉 계층(312)의 가중치 필터(314)를 결정하는 데 이용될 수 있고, 은닉 계층(312)에서, 3개의 가중치 필터(314)가 출력 계층(313)의 가중치 필터(314)를 결정하는 데 이용될 수 있다. 도 3b를 참조하면, 입력 계층(311)에서 7개의 가중치 필터(314)가 출력 계층(313)에서 하나의 가중치 필터(314)를 결정하는 데 이용될 수 있다. 즉, 도 3b 학습 모델의 수용 영역(315)은 7로 결정된 경우일 수 있다.Referring to FIG. 3B , three weight filters 314 in the input layer 311 may be used to determine the weight filters 314 of the hidden layer 312 , and in the hidden layer 312 , three weight filters 314 . A filter 314 may be used to determine the weight filter 314 of the output layer 313 . Referring to FIG. 3B , seven weight filters 314 in the input layer 311 may be used to determine one weight filter 314 in the output layer 313 . That is, the accommodation area 315 of the learning model of FIG. 3B may be determined as 7 .

도 4는 본 발명의 일실시예에 따른 피치 정보에 따라 결정되는 학습 모델의 계층 구조를 도시한 도면이다. 4 is a diagram illustrating a hierarchical structure of a learning model determined according to pitch information according to an embodiment of the present invention.

도 4에서, 피치 지연(405)(예:

p)은 3으로 결정되고, 모든 계층(401-403)의 필터 길이가 2로 결정된 경우일 수 있다. 도 4를 참조하면, 피치 지연에 기초하여, 첫번째 계층(401)의 확장 인자는 1로 결정될 수 있다. 그리고, 입력 계층(401)의 확장 인자에 따라 은닉 계층(402)의 확장 인자는 2로 결정되고, 출력 계층의 확장 인자는 4로 결정될 수 있다. 이에 따라, 학습 모델의 수용 영역은 4로 결정될 수 있다. 4, pitch delay 405 (eg:

p) is determined to be 3, and the filter length of all layers 401 to 403 may be determined to be 2. Referring to FIG. 4 , the expansion factor of the first layer 401 may be determined to be 1 based on the pitch delay. And, according to the expansion factor of the input layer 401, the expansion factor of the hidden layer 402 may be determined to be 2, and the expansion factor of the output layer may be determined to be 4. Accordingly, the reception area of the learning model may be determined to be 4.

한편, 본 발명에 따른 방법은 컴퓨터에서 실행될 수 있는 프로그램으로 작성되어 마그네틱 저장매체, 광학적 판독매체, 디지털 저장매체 등 다양한 기록 매체로도 구현될 수 있다.Meanwhile, the method according to the present invention is written as a program that can be executed on a computer and can be implemented in various recording media such as magnetic storage media, optical reading media, and digital storage media.

본 명세서에 설명된 각종 기술들의 구현들은 디지털 전자 회로조직으로, 또는 컴퓨터 하드웨어, 펌웨어, 소프트웨어로, 또는 그들의 조합들로 구현될 수 있다. 구현들은 데이터 처리 장치, 예를 들어 프로그램가능 프로세서, 컴퓨터, 또는 다수의 컴퓨터들의 동작에 의한 처리를 위해, 또는 이 동작을 제어하기 위해, 컴퓨터 프로그램 제품, 즉 정보 캐리어, 예를 들어 기계 판독가능 저장 장치(컴퓨터 판독가능 매체) 또는 전파 신호에서 유형적으로 구체화된 컴퓨터 프로그램으로서 구현될 수 있다. 상술한 컴퓨터 프로그램(들)과 같은 컴퓨터 프로그램은 컴파일된 또는 인터프리트된 언어들을 포함하는 임의의 형태의 프로그래밍 언어로 기록될 수 있고, 독립형 프로그램으로서 또는 모듈, 구성요소, 서브루틴, 또는 컴퓨팅 환경에서의 사용에 적절한 다른 유닛으로서 포함하는 임의의 형태로 전개될 수 있다. 컴퓨터 프로그램은 하나의 사이트에서 하나의 컴퓨터 또는 다수의 컴퓨터들 상에서 처리되도록 또는 다수의 사이트들에 걸쳐 분배되고 통신 네트워크에 의해 상호 연결되도록 전개될 수 있다.Implementations of the various techniques described herein may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or combinations thereof. Implementations may be implemented for processing by, or for controlling the operation of, a data processing device, eg, a programmable processor, computer, or number of computers, a computer program product, ie an information carrier, eg, a machine readable storage It may be embodied as a computer program tangibly embodied in an apparatus (computer readable medium) or a radio signal. A computer program, such as the computer program(s) described above, may be written in any form of programming language, including compiled or interpreted languages, and may be written as a standalone program or in a module, component, subroutine, or computing environment. may be deployed in any form, including as other units suitable for use in A computer program may be deployed to be processed on one computer or multiple computers at one site or to be distributed across multiple sites and interconnected by a communications network.

컴퓨터 프로그램의 처리에 적절한 프로세서들은 예로서, 범용 및 특수 목적 마이크로프로세서들 둘 다, 및 임의의 종류의 디지털 컴퓨터의 임의의 하나 이상의 프로세서들을 포함한다. 일반적으로, 프로세서는 판독 전용 메모리 또는 랜덤 액세스 메모리 또는 둘 다로부터 명령어들 및 데이터를 수신할 것이다. 컴퓨터의 요소들은 명령어들을 실행하는 적어도 하나의 프로세서 및 명령어들 및 데이터를 저장하는 하나 이상의 메모리 장치들을 포함할 수 있다. 일반적으로, 컴퓨터는 데이터를 저장하는 하나 이상의 대량 저장 장치들, 예를 들어 자기, 자기-광 디스크들, 또는 광 디스크들을 포함할 수 있거나, 이것들로부터 데이터를 수신하거나 이것들에 데이터를 송신하거나 또는 양쪽으로 되도록 결합될 수도 있다. 컴퓨터 프로그램 명령어들 및 데이터를 구체화하는데 적절한 정보 캐리어들은 예로서 반도체 메모리 장치들, 예를 들어, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광 기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 롬(ROM, Read Only Memory), 램(RAM, Random Access Memory), 플래시 메모리, EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM) 등을 포함한다. 프로세서 및 메모리는 특수 목적 논리 회로조직에 의해 보충되거나, 이에 포함될 수 있다.Processors suitable for processing a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from either read-only memory or random access memory or both. Elements of a computer may include at least one processor that executes instructions and one or more memory devices that store instructions and data. In general, a computer may include one or more mass storage devices for storing data, for example magnetic, magneto-optical disks, or optical disks, receiving data from, sending data to, or both. may be combined to become Information carriers suitable for embodying computer program instructions and data are, for example, semiconductor memory devices, for example, magnetic media such as hard disks, floppy disks and magnetic tapes, Compact Disk Read Only Memory (CD-ROM). ), an optical recording medium such as a DVD (Digital Video Disk), a magneto-optical medium such as a floppy disk, a ROM (Read Only Memory), a RAM , Random Access Memory), flash memory, EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), and the like. Processors and memories may be supplemented by, or included in, special purpose logic circuitry.

또한, 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용매체일 수 있고, 컴퓨터 저장매체 및 전송매체를 모두 포함할 수 있다.In addition, the computer-readable medium may be any available medium that can be accessed by a computer, and may include both computer storage media and transmission media.

본 명세서는 다수의 특정한 구현물의 세부사항들을 포함하지만, 이들은 어떠한 발명이나 청구 가능한 것의 범위에 대해서도 제한적인 것으로서 이해되어서는 안되며, 오히려 특정한 발명의 특정한 실시형태에 특유할 수 있는 특징들에 대한 설명으로서 이해되어야 한다. 개별적인 실시형태의 문맥에서 본 명세서에 기술된 특정한 특징들은 단일 실시형태에서 조합하여 구현될 수도 있다. 반대로, 단일 실시형태의 문맥에서 기술한 다양한 특징들 역시 개별적으로 혹은 어떠한 적절한 하위 조합으로도 복수의 실시형태에서 구현 가능하다. 나아가, 특징들이 특정한 조합으로 동작하고 초기에 그와 같이 청구된 바와 같이 묘사될 수 있지만, 청구된 조합으로부터의 하나 이상의 특징들은 일부 경우에 그 조합으로부터 배제될 수 있으며, 그 청구된 조합은 하위 조합이나 하위 조합의 변형물로 변경될 수 있다.While this specification contains numerous specific implementation details, these are not to be construed as limitations on the scope of any invention or claim, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. should be understood Certain features that are described herein in the context of separate embodiments may be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment may also be implemented in multiple embodiments, either individually or in any suitable subcombination. Furthermore, although features operate in a particular combination and may be initially depicted as claimed as such, one or more features from a claimed combination may in some cases be excluded from the combination, the claimed combination being a sub-combination. or a variant of a sub-combination.

마찬가지로, 특정한 순서로 도면에서 동작들을 묘사하고 있지만, 이는 바람직한 결과를 얻기 위하여 도시된 그 특정한 순서나 순차적인 순서대로 그러한 동작들을 수행하여야 한다거나 모든 도시된 동작들이 수행되어야 하는 것으로 이해되어서는 안 된다. 특정한 경우, 멀티태스킹과 병렬 프로세싱이 유리할 수 있다. 또한, 상술한 실시형태의 다양한 장치 컴포넌트의 분리는 그러한 분리를 모든 실시형태에서 요구하는 것으로 이해되어서는 안되며, 설명한 프로그램 컴포넌트와 장치들은 일반적으로 단일의 소프트웨어 제품으로 함께 통합되거나 다중 소프트웨어 제품에 패키징 될 수 있다는 점을 이해하여야 한다.Likewise, although acts are depicted in the figures in a particular order, it should not be construed that all acts shown must be performed or that such acts must be performed in the specific order or sequential order shown to achieve desirable results. In certain cases, multitasking and parallel processing may be advantageous. Further, the separation of the various device components of the above-described embodiments should not be construed as requiring such separation in all embodiments, and the program components and devices described may generally be integrated together into a single software product or packaged into multiple software products. You have to understand that you can.

한편, 본 명세서와 도면에 개시된 본 발명의 실시 예들은 이해를 돕기 위해 특정 예를 제시한 것에 지나지 않으며, 본 발명의 범위를 한정하고자 하는 것은 아니다. 여기에 개시된 실시 예들 이외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 자명한 것이다.On the other hand, the embodiments of the present invention disclosed in the present specification and drawings are merely presented as specific examples to aid understanding, and are not intended to limit the scope of the present invention. It will be apparent to those of ordinary skill in the art to which the present invention pertains that other modifications based on the technical spirit of the present invention can be implemented in addition to the embodiments disclosed herein.

101: 부호화기
102: 복호화기101: encoder
102: decoder

Claims

A method of encoding an audio signal using a learning model, the method comprising:
extracting pitch information for the audio signal;
determining an extension factor for a receptive field of a first scalable neural network block for extracting a feature map from the audio signal based on the pitch information;
generating a first feature map of the audio signal using a first extended neural network block in which the expansion factor is determined;
determining a second feature map by inputting the first feature map to a second extended neural network block that processes the first feature map;
converting the second feature map and the pitch information into a bitstream
A coding method comprising a.

According to claim 1,
The step of generating the first feature map comprises:
generating the first feature map by converting the channel of the audio signal and inputting it to the first extended neural network block;
The step of determining the second feature map comprises:
The encoding method further comprising the step of transforming the determined number of channels of the second feature map.

According to claim 1,
The step of determining the second feature map,
Down-sampling for reducing the dimension of the first feature map is performed on the first feature map, and the down-sampled first feature map is input to the second scalable neural network block to provide the second feature A coding method for determining a map.

The method of claim 1,
The step of determining the expansion factor comprises:
and determining the expansion factor by approximating an accommodation area of the first extended neural network block with the pitch information.

According to claim 1,
The expansion factor of the second scalable neural network block is predetermined as a fixed value,
The receiving area of the second extended neural network block is,
The encoding method is determined according to an extension factor of the second scalable neural network block.

According to claim 1,
Further comprising the step of quantizing each of the second feature map and the pitch information,
The step of converting to the bitstream comprises:
An encoding method of multiplexing the quantized second feature map and pitch information into a bitstream.

A method of decoding an audio signal using a learning model, the method comprising:
extracting a second feature map of the audio signal and pitch information of the audio signal from the bitstream received from the encoder;
restoring a first feature map by inputting the second feature map to a second extended neural network block for restoring a feature map;
determining an extension factor for a receptive field of a first scalable neural network block for reconstructing an audio signal from a feature map based on the pitch information;
reconstructing an audio signal from the first feature map using a first extended neural network block in which the extension factor is determined;
A decryption method comprising

8. The method of claim 7,
Restoring the first feature map comprises:
The first feature map is restored by transforming the number of channels in the second feature map and inputting it to the second extended neural network block,
The step of restoring the audio signal comprises:
The decoding method further comprising the step of converting the number of channels of the reconstructed audio signal to be the same as the number of channels of the input signal of the encoder.

8. The method of claim 7,
Restoring the audio signal comprises:
Up-sampling is performed on the first feature map to extend a dimension of the first feature map, and the up-sampled first feature map is input to the first extended neural network block to generate the audio signal. Determining the decryption method.

8. The method of claim 7,
The expansion factor is
In the encoder, the decoding method is determined by approximating an accommodation region of the first extended neural network block with the pitch information.

8. The method of claim 7,
The expansion factor of the second scalable neural network block is predetermined as a fixed value,
The receiving area of the second extended neural network block is,
The decoding method is determined according to an expansion factor of the second extended neural network block.

8. The method of claim 7,
The step of extracting the second feature map and the pitch information of the audio signal,
The decoding method further comprising the step of inverse-quantizing the second feature map and the pitch information, respectively.

An encoder for performing an audio signal encoding method, comprising:
The encoder includes a processor,
The processor is
extracting pitch information for the audio signal, and determining an extension factor for a receptive field of a first extended neural network block for extracting a feature map from the audio signal based on the pitch information; A first feature map of the audio signal is generated using the first extended neural network block in which the extension factor is determined, and the first feature map is input to a second extended neural network block that processes the first feature map. to determine a second feature map, and convert the second feature map and the pitch information into a bitstream,
Encoder.

14. The method of claim 13,
The processor is
Down-sampling for reducing the dimension of the first feature map is performed on the first feature map, and the down-sampled first feature map is input to the second scalable neural network block to provide the second feature The encoder that determines the map.

14. The method of claim 13,
The processor is
and determining the expansion factor by approximating an accommodation area of the first extended neural network block with the pitch information.

14. The method of claim 13,
The expansion factor of the second scalable neural network block is predetermined as a fixed value,
The receiving area of the second extended neural network block is,
The encoder, which is determined according to an extension factor of the second extended neural network block.