KR20190060628A

KR20190060628A - Method and apparatus of audio signal encoding using weighted error function based on psychoacoustics, and audio signal decoding using weighted error function based on psychoacoustics

Info

Publication number: KR20190060628A
Application number: KR1020170173405A
Authority: KR
Inventors: 성종모; 김민제; 시바라만 아스윈; 젠 카이
Original assignee: 한국전자통신연구원; 더 트러스티즈 오브 인디애나 유니버시티
Priority date: 2017-11-24
Filing date: 2017-12-15
Publication date: 2019-06-03
Also published as: KR102556098B1

Abstract

An objective of the present invention is to use acoustic characteristics of a person to use a perceptually weighted error function to provide increased audio signal quality while having the same model complexity or provide audio signal quality of the same level while having low model complexity. According to an embodiment of the present invention, a neural network applied to a method of audio signal encoding using an audio signal encoding apparatus comprises: a step of generating a masking threshold for a first audio signal before learning; a step of calculating a weight matrix to be applied to frequency components of the first audio signal based on the masking threshold; a step of using the weight matrix to generate a weighted error function by correcting a preset error function; and a step of applying parameters learned by using the weighted error function to the first audio signal to generate a second audio signal.

Description

FIELD OF THE INVENTION [0001] The present invention relates to a method and apparatus for encoding an audio signal using a psychoacoustic-based weighted error function, and a method and apparatus for decoding an audio signal. }

아래의 실시예들은 심리음향 기반 가중된 오류 함수를 이용한 오디오 신호 부호화 방법 및 장치, 그리고 오디오 신호 복호화 방법 및 장치에 관한 것으로, 보다 구체적으로 사람의 청각 특성을 고려한 가중된 오류 함수를 이용하여 학습된 파라미터를 적용하는 오디오 신호 부호화 방법 및 장치, 그리고 오디오 신호 복호화 방법 및 장치에 관한 것이다.The following embodiments are directed to a method and apparatus for encoding an audio signal using a psychoacoustic-based weighted error function and a method and apparatus for decoding an audio signal. More specifically, the present invention relates to a method and apparatus for decoding an audio signal using a weighted error function considering human auditory characteristics And an audio signal decoding method and apparatus.

최근 다양한 목적 및 응용에 적용되는 음성 및 오디오 코덱이 ITU-T, MPEG, 3GPP와 같은 표준화 기구에서 개발되고 있다. 대부분의 오디오 코덱은 사람의 다양한 청각적 특성을 이용한 심리음향 모델에 기반하고 있다. 또한, 음성 코덱은 주로 음성 발생 모델에 기반하고 있지만, 동시에 주관적 품질 향상을 위해 사람의 인지적 특성도 활용하고 있다. Recently, voice and audio codecs for various purposes and applications have been developed in standardization organizations such as ITU-T, MPEG, and 3GPP. Most audio codecs are based on a psychoacoustic model that uses various auditory characteristics of a person. In addition, although the voice codec is mainly based on the voice generation model, at the same time, the person's cognitive characteristics are also utilized for the subjective quality improvement.

이와 같이, 종래의 음성 및 오디오 코덱은 부호화 단계에서 발생하는 양자화 잡음을 효과적으로 제어하기 위해 사람의 청각 특성에 기반한 방법을 사용하고 있다. As described above, the conventional audio and audio codecs use a method based on human auditory characteristics to effectively control the quantization noise generated in the encoding step.

일 실시예에 따르면, 사람의 청각 특성을 이용하여 지각적으로 가중된 오류 함수를 이용하여, 동일한 모델의 복잡도를 가지고 개선된 오디오 신호의 품질을 제공할 수 있다. According to one embodiment, a perceptually weighted error function can be utilized using the human auditory characteristics to provide improved audio signal quality with the same model complexity.

일 실시예에 따르면, 사람의 청각 특성을 이용하여 지각적으로 가중된 오류 함수를 이용하여, 낮은 모델의 복잡도를 가지고 동일한 수준의 오디오 신호의 품질을 제공할 수 있다. According to an exemplary embodiment, a perceptually weighted error function may be used, using the human auditory characteristic, to provide the same level of audio signal quality with a lower model complexity.

일 측면에 따르면, 오디오 신호 부호화 장치를 이용하여 오디오 신호 부호화 방법에 적용되는 뉴럴 네트워크에 있어서, 학습되기 전 제1 오디오 신호에 대한 마스킹 임계치(masking threshold)를 생성하는 단계; 상기 마스킹 임계치에 기초하여, 상기 제1 오디오 신호 각각의 주파수 성분에 적용될 가중 행렬(weight matrix)을 계산하는 단계; 상기 가중 행렬을 이용하여 미리 설정된 오류 함수를 수정한 가중된 오류 함수를 생성하는 단계; 상기 가중된 오류 함수를 이용하여 학습된 파라미터를 상기 제1 오디오 신호에 적용하여 제2 오디오 신호를 생성하는 단계를 포함하는 뉴럴 네트워크일 수 있다.According to an aspect of the present invention, there is provided a neural network applied to an audio signal encoding method using an audio signal encoding apparatus, comprising: generating a masking threshold for a first audio signal before being learned; Calculating a weight matrix to be applied to a frequency component of each of the first audio signals based on the masking threshold; Generating a weighted error function by modifying a preset error function using the weight matrix; And applying the learned parameter to the first audio signal using the weighted error function to generate a second audio signal.

상기 가중 행렬은, 상기 제1 오디오 신호의 주파수 성분에 적용될 가중치를 포함하고, 상기 가중치는, 상기 제1 오디오 신호에 대한 마스킹 임계치에 반비례하고, 상기 제1 오디오 신호 각각의 주파수 성분의 크기에 비례하도록 설정되는 뉴럴 네트워크일 수 있다.Wherein the weight matrix comprises a weight to be applied to a frequency component of the first audio signal and the weight is inversely proportional to a masking threshold for the first audio signal and proportional to a magnitude of a frequency component of each of the first audio signals Lt; / RTI >

상기 뉴럴 네트워크는, 상기 생성된 제2 오디오 신호와 상기 제1 오디오 신호를 비교하여, 지각적 품질 평가를 수행하는 단계를 더 포함하는 뉴럴 네트워크일 수 있다.The neural network may be a neural network that further includes comparing the generated second audio signal with the first audio signal to perform a perceptual quality assessment.

상기 지각적 품질 평가는, PESQ(Perceptual Evaluation of Speech Quality), POLQA(Perceptual Objective Listening Quality Assessment), PEAQ(Perceptual Evaluation of Audio Quality)의 객관적 평가 또는 MOS(Mean Opinion Score), MUSHRA(Multiple Stimuli with Hidden Reference and Anchor)의 주관적 평가를 포함하는 뉴럴 네트워크일 수 있다.The perceptual quality evaluation may be an objective evaluation or MOS (Mean Opinion Score), a Multiple Stimuli with Hidden (MUSHRA), a Perceptual Evaluation of Speech Quality (PESQ), a Perceptual Objective Listening Quality Assessment (POLQA) Reference and Anchor).

상기 지각적 품질 평가에 기초하여, 모델에 포함된 토폴로지의 조정 가능 여부를 판단하는 뉴럴 네트워크일 수 있다.Based on the perceptual quality evaluation, a neural network that determines whether or not the topology included in the model can be adjusted.

상기 토폴로지의 조정 가능 여부를 판단할 경우, 상기 지각적 품질 평가의 결과가 미리 설정된 품질 요구사항을 만족하지 못한다면, 복잡도가 증가된 모델을 이용하여 파라미터를 재학습하고, 상기 지각적 품질 평가의 결과가 미리 설정된 품질 요구사항을 만족한다면, 상기 품질 요구사항 내에서 복잡도가 줄어든 모델을 이용하여 파라미터를 재학습하는 뉴럴 네트워크일 수 있다.If the result of the perceptual quality evaluation does not satisfy the predetermined quality requirement, the parameter is re-learned using the model with increased complexity, and the result of the perceptual quality evaluation May be a neural network that re-learns the parameters using a reduced complexity model within the quality requirements if the pre-established quality requirements are met.

일 측면에 따르면, 뉴럴 네트워크가 적용된 오디오 신호 부호화 장치가 수행하는 오디오 신호 부호화 방법에 있어서, 입력 오디오 신호를 수신하는 단계; 상기 뉴럴 네트워크는 하나 이상의 히든 레이어를 포함하고, 상기 뉴럴 네트워크를 이용하여 학습된 상기 히든 레이어의 파라미터에 기초하여, 상기 입력 오디오 신호의 차원 축소된 잠재벡터를 생성하는 단계; 상기 생성된 잠재 벡터를 부호화하여 비트스트림을 출력하는 단계를 포함하는 오디오 신호 부호화 방법일 수 있다.According to an aspect of the present invention, there is provided an audio signal encoding method performed by an audio signal encoding apparatus to which a neural network is applied, the method comprising: receiving an input audio signal; The neural network comprising at least one hidden layer and generating a dimensionally reduced potential vector of the input audio signal based on the learned hidden layer parameter using the neural network; And outputting a bitstream by encoding the generated potential vector.

상기 뉴럴 네트워크를 이용하여 학습된 상기 히든 레이어의 파라미터에 기초하여, 상기 입력 오디오 신호의 차원 축소된 잠재벡터를 생성하는 단계는, 상기 히든 레이어의 개수, 노드의 개수를 포함하는 모델의 토폴로지의 조정이 불가능하거나 필요하지 않은 경우 상기 학습된 파라미터에 기초하여 상기 잠재벡터를 생성하거나, 상기 모델의 토폴로지의 조정이 가능한 경우 조정된 토폴로지를 적용함으로써 재학습된 파라미터에 기초하여 상기 잠재벡터를 생성하는 오디오 신호 부호화 방법일 수 있다.The step of generating a reduced-size potential vector of the input audio signal based on the learned hidden layer parameters using the neural network may include adjusting the topology of the model including the number of hidden layers, Generating the potential vector based on the re-learned parameter by generating the potential vector based on the re-learned parameter by applying the adjusted topology when adjustment of the topology of the model is possible, Signal coding method.

상기 잠재 벡터의 부호화는, 채널을 통해 상기 비트스트림을 전송하기 위해 이진화하는 오디오 신호의 부호화 방법일 수 있다.The coding of the latent vector may be a coding method of an audio signal to be binarized to transmit the bitstream through a channel.

일 측면에 따르면, 오디오 신호 복호화 장치를 이용하여 오디오 신호 복호화 방법에 적용되는 뉴럴 네트워크에 있어서, 학습되기 전 제1 오디오 신호에 대한 마스킹 임계치(masking threshold)를 생성하는 단계; 상기 마스킹 임계치에 기초하여, 상기 제1 오디오 신호 각각의 주파수 성분에 적용될 가중 행렬(weight matrix)을 계산하는 단계; 상기 가중 행렬을 이용하여 미리 설정된 오류 함수를 수정한 가중된 오류 함수를 생성하는 단계; 상기 가중된 오류 함수를 이용하여 학습된 파라미터를 상기 제1 오디오 신호에 적용하여 제2 오디오 신호를 생성하는 단계를 포함하는 뉴럴 네트워크일 수 있다.According to an aspect of the present invention, there is provided a neural network applied to an audio signal decoding method using an audio signal decoding apparatus, the method comprising: generating a masking threshold for a first audio signal before being learned; Calculating a weight matrix to be applied to a frequency component of each of the first audio signals based on the masking threshold; Generating a weighted error function by modifying a preset error function using the weight matrix; And applying the learned parameter to the first audio signal using the weighted error function to generate a second audio signal.

상기 토폴로지의 조정 가능 여부를 판단할 경우, 상기 지각적 품질 평가의 결과가 미리 설정된 품질 요구사항을 만족하지 못한다면, 복잡도가 증가된 모델을 이용하여 파라미터를 재학습시키고, 상기 지각적 품질 평가의 결과가 미리 설정된 품질 요구사항을 만족한다면, 상기 품질의 요구사항내에서 복잡도가 줄어든 모델을 이용하여 파라미터를 재학습시키는 뉴럴 네트워크일 수 있다.If it is determined that the topology can be adjusted, if the result of the perceptual quality evaluation does not satisfy the preset quality requirement, the parameter is re-learned using the increased complexity model, and the result of the perceptual quality evaluation May be a neural network that re-learns the parameters using a reduced complexity model within the requirements of the quality, if the pre-established quality requirements are met.

일 측면에 따르면, 뉴럴 네트워크가 적용된 오디오 신호 복호화 장치가 수행하는 오디오 신호 복호화 방법에 있어서, 상기 뉴럴 네트워크를 통해 학습된 파라미터를 입력 오디오 신호에 적용하여 생성된 잠재벡터(latent vector)가 부호화된 비트스트림을 수신하는 단계; 상기 수신한 비트스트림으로부터 상기 잠재 벡터를 복원하는 단계; 상기 뉴럴 네트워크는 하나 이상의 히든 레이어를 포함하고, 상기 학습된 파라미터가 적용된 상기 히든 레이어를 이용하여 상기 복원된 잠재 벡터로부터 출력 오디오 신호를 복호화하는 단계를 포함하는 오디오 신호 복호화 방법일 수 있다.According to an aspect of the present invention, there is provided a method of decoding an audio signal performed by an audio signal decoding apparatus to which a neural network is applied, the method comprising: receiving a latent vector generated by applying a parameter learned through the neural network to an input audio signal, Receiving a stream; Recovering the latent vector from the received bitstream; The neural network may include one or more hidden layers and decoding the output audio signal from the restored latent vector using the hidden layer to which the learned parameter is applied.

상기 생성된 잠재 벡터는, 상기 히든 레이어의 개수, 상기 히든 레이어에 속한 노드의 개수를 포함하는 토폴로지의 조정이 불가능하거나 필요하지 않은 경우 상기 학습된 파라미터에 기초하여 상기 잠재 벡터는 생성되고, 상기 토폴로지의 조정이 가능한 경우 조정된 토폴로지를 적용함으로써 재학습된 파라미터에 기초하여 상기 잠재 벡터는 생성되는 오디오 신호 복호화 방법일 수 있다.Wherein the generated potential vector is generated based on the learned parameter if the topology including the number of the hidden layers and the number of nodes belonging to the hidden layer is impossible or unnecessary, The potential vector may be generated based on the re-learned parameter by applying the adjusted topology when the adjustment of the potential vector is possible.

상기 잠재 벡터의 부호화는, 채널을 통해 상기 비트스트림을 전송하기 위해 이진화하는 오디오 신호의 복호화 방법일 수 있다.The coding of the latent vector may be a decoding method of an audio signal binarized to transmit the bitstream through a channel.

일 측면에 따르면, 뉴럴 네트워크가 적용된 오디오 신호 부호화 장치에 있어서, 상기 오디오 신호 부호화 장치는 프로세서 및 상기 프로세서에 의해 실행 가능한 하나 이상의 명령어를 포함하는 메모리를 포함하고, 상기 하나 이상의 명령어가 상기 프로세서에서 실행되면, 입력 오디오 신호를 수신하고, 상기 뉴럴 네트워크는 하나 이상의 히든 레이어를 포함하고, 상기 뉴럴 네트워크를 이용하여 학습된 상기 히든 레이어의 파라미터에 기초하여, 상기 입력 오디오 신호의 차원 축소된 잠재벡터를 생성하고, 상기 생성된 잠재 벡터를 부호화하여 비트스트림을 출력하는 오디오 신호 부호화 장치일 수 있다.According to one aspect, in an audio signal encoding apparatus employing a neural network, the audio signal encoding apparatus includes a processor and a memory including one or more instructions executable by the processor, wherein the one or more instructions are executable Wherein the neural network comprises one or more hidden layers and generates a dimensionally reduced potential vector of the input audio signal based on the learned hidden layer parameters using the neural network And outputting a bitstream by encoding the generated potential vector.

일 측면에 따르면, 뉴럴 네트워크가 적용된 오디오 신호 복호화 장치에 있어서, 상기 오디오 신호 복호화 장치는 프로세서 및 상기 프로세서에 의해 실행 가능한 하나 이상의 명령어를 포함하는 메모리를 포함하고, 상기 하나 이상의 명령어가 상기 프로세서에서 실행되면, 상기 뉴럴 네트워크를 통해 학습된 파라미터를 입력 오디오 신호에 적용하여 생성된 잠재 벡터(latent vector)가 양자화된 비트스트림을 수신하고, 상기 수신한 비트스트림으로부터 상기 잠재 벡터를 복원하고, 상기 뉴럴 네트워크는 하나 이상의 히든 레이어를 포함하고, 상기 학습된 파라미터가 적용된 상기 히든 레이어를 이용하여 상기 복원된 잠재 벡터로부터 출력 오디오 신호를 복호화하는 오디오 신호 복호화 장치일 수 있다.According to an aspect, there is provided an audio signal decoding apparatus to which a neural network is applied, the audio signal decoding apparatus including a processor and a memory including one or more instructions executable by the processor, Receiving a quantized bitstream generated by applying a parameter learned through the neural network to an input audio signal and reconstructing the latent vector from the received bitstream, May be an audio signal decoding apparatus that includes one or more hidden layers and decodes an output audio signal from the reconstructed potential vector using the hidden layer to which the learned parameter is applied.

도 1은 일 실시예에 따른, 3개의 hidden layer을 포함하는 Autoencoder의 구조를 나타낸 도면이다.
도 2는 일 실시예에 따른, 동시 마스킹 효과(simultaneous masking effect)를 나타낼 수 있다.
도 3은 일 실시예에 따른, 마스킹 효과를 고려한 가청 및 비가청 영역을 나타낸 도면이다.
도 4는 일 실시예에 따른, 가중된 오류 함수를 이용한 뉴럴 네트워크의 학습 과정을 나타낸 도면이다.
도 5는 일 실시예에 따른, 오디오 신호 부호화 장치와 채널, 오디오 신호 복호화 장치를 나타낸 도면이다.
도 6은 일 실시예에 따른, 오디오 신호 부호화 장치를 이용하여 오디오 신호 부호화 방법에 적용되는 뉴럴 네트워크의 학습 과정을 나타낸 도면이다.
도 7은 일 실시예에 따른, 뉴럴 네트워크가 적용된 오디오 신호 부호화 장치가 수행하는 오디오 신호 부호화 방법을 나타낸 도면이다.
도 8은 일 실시예에 따른, 오디오 신호 복호화 장치를 이용하여 오디오 신호 복호화 방법에 적용되는 뉴럴 네트워크의 학습 과정을 나타낸 도면이다.
도 9는 일 실시예에 따른, 뉴럴 네트워크가 적용된 오디오 신호 복호화 장치가 수행하는 오디오 신호 복호화 방법을 나타낸 도면이다.
도 10은 일 실시예에 따른, 뉴럴 네트워크가 적용된 오디오 신호 부호화 장치를 나타낸 도면이다.
도 11는 일 실시예에 따른, 뉴럴 네트워크가 적용된 오디오 신호 복호화 장치를 나타낸 도면이다.FIG. 1 illustrates a structure of an autoencoder including three hidden layers according to an exemplary embodiment of the present invention. Referring to FIG.
Figure 2 may represent a simultaneous masking effect, according to one embodiment.
3 is a diagram illustrating audible and non-audible areas in consideration of the masking effect, according to one embodiment.
4 is a diagram illustrating a learning process of a neural network using a weighted error function according to an exemplary embodiment of the present invention.
5 is a diagram illustrating an audio signal encoding apparatus, a channel, and an audio signal decoding apparatus according to an embodiment.
6 is a diagram illustrating a learning process of a neural network applied to an audio signal encoding method using an audio signal encoding apparatus according to an embodiment.
7 is a diagram illustrating a method of encoding an audio signal performed by an audio signal encoding apparatus to which a neural network is applied, according to an embodiment.
8 is a diagram illustrating a learning process of a neural network applied to an audio signal decoding method using an audio signal decoding apparatus according to an embodiment.
9 is a diagram illustrating an audio signal decoding method performed by an audio signal decoding apparatus to which a neural network is applied, according to an embodiment.
FIG. 10 is a diagram illustrating an audio signal encoding apparatus to which a neural network is applied, according to an embodiment.
11 is a block diagram illustrating an audio signal decoding apparatus to which a neural network is applied, according to an embodiment of the present invention.

실시예들에 대한 특정한 구조적 또는 기능적 설명들은 단지 예시를 위한 목적으로 개시된 것으로서, 다양한 형태로 변경되어 실시될 수 있다. 따라서, 실시예들은 특정한 개시형태로 한정되는 것이 아니며, 본 명세서의 범위는 기술적 사상에 포함되는 변경, 균등물, 또는 대체물을 포함한다. Specific structural or functional descriptions of embodiments are set forth for illustration purposes only and may be embodied with various changes and modifications. Accordingly, the embodiments are not intended to be limited to the particular forms disclosed, and the scope of the present disclosure includes changes, equivalents, or alternatives included in the technical idea.

제 1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 해석되어야 한다. 예를 들어, 제 1 구성요소는 제 2 구성요소로 명명될 수 있고, 유사하게 제 2 구성요소는 제 1 구성요소로도 명명될 수 있다.The terms first or second, etc. may be used to describe various elements, but such terms should be interpreted solely for the purpose of distinguishing one element from another. For example, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. It is to be understood that when an element is referred to as being "connected" to another element, it may be directly connected or connected to the other element, although other elements may be present in between.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, the terms " comprises ", or " having ", and the like, are used to specify one or more of the described features, numbers, steps, operations, elements, But do not preclude the presence or addition of steps, operations, elements, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the meaning of the context in the relevant art and, unless explicitly defined herein, are to be interpreted as ideal or overly formal Do not.

이하, 본 발명의 실시예를 첨부된 도면을 참조하여 상세하게 설명한다. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 일 실시예에 따른, 3개의 hidden layer을 포함하는 Autoencoder의 구조를 나타낸 도면이다. FIG. 1 illustrates a structure of an autoencoder including three hidden layers according to an exemplary embodiment of the present invention. Referring to FIG.

일 실시예에 따르면, 뉴럴 네트워크(Neural Network)를 적용하여 오디오 신호 부호화 및 복호화는 수행될 수 있다. 이때, 뉴럴 네트워크는 딥러닝/머신러닝의 다양한 모델을 포함할 수 있다. 구체적으로 뉴럴 네트워크는 딥러닝의 모델로서 Autoencoding 방법을 포함할 수 있다.According to an embodiment, audio signal encoding and decoding can be performed by applying a neural network. At this time, the neural network may include various models of deep learning / machine learning. Specifically, a neural network may include an autoencoding method as a model of deep running.

뉴럴 네트워크는 오류 함수(error function or cost function)를 최소화하기 위한 최적화 문제를 해결할 수 있다. 여기서, 최적화는 반복적인 학습 알고리즘(iterative learning algorithm)에 의해서 수행되며, 최적화를 통해 오류를 최소화하는 파라미터(parameter)는 발견될 수 있다. Neural networks can solve optimization problems to minimize the error function or cost function. Here, the optimization is performed by an iterative learning algorithm, and a parameter that minimizes the error through optimization can be found.

예를 들면, 뉴럴 네트워크가 적용된 입력 데이터는 다음의 수학식 1과 같은 출력 데이터를 예측할 수 있다. 여기서, D 차원의 N개의 입력 데이터

, R 차원의 N개의 출력 데이터

를 의미할 수 있고, 파라미터

는 입력 데이터에 적용된 뉴럴 네트워크의 모델을 나타낼 수 있다. For example, the input data to which the neural network is applied can predict the output data as shown in Equation 1 below. Here, the N input data of the D dimension

, N pieces of output data of R dimension

, And the parameter

May represent a model of the neural network applied to the input data.

이때, 예측된 출력 데이터

와 목표 데이터

의 차이는 오류를 나타낼 수 있고, 오류 함수는 아래의 수학식 2와 같이 정의될 수 있다. 여기서, 목표 데이터

는 R 차원의 N개의 목표 데이터일 수 있다.At this time, the predicted output data

And target data

And the error function can be defined as Equation (2) below. &Lt; EMI ID = 2.0 > Here,

May be N target data of the R dimension.

여기서,

과

은 목표 데이터와 출력 데이터의 n-번째 column 벡터를 나타낼 수 있다. 따라서, 아래의 수학식 3과 같이, N개의 데이터에 대해서 총 오류는 계산될 수 있다.here,

and

May represent the n-th column vector of the target data and the output data. Therefore, the total error can be calculated for N data, as shown in Equation (3) below.

따라서, 총 오류를 최소화하기 위해 파라미터는 조정될 수 있으며, 반복적인 학습 알고리즘에 의해 총 오류를 최소화하는 파라미터는 발견될 수 있다. 반복적인 학습 알고리즘을 수행하는 다양한 뉴럴 네트워크의 모델이 존재할 수 있다. 그 중에서도 Autoencoding 방법은 입력 데이터로부터 잠재 패턴(latent patterns)의 학습에 효과적일 수 있다. Thus, the parameters can be adjusted to minimize the total error, and parameters that minimize the total error by the iterative learning algorithm can be found. There may be models of various neural networks that perform repetitive learning algorithms. Among them, Autoencoding method can be effective for learning latent patterns from input data.

이하, 일 실시예에 따른, Autoencoding 방법이 적용된 뉴럴 네트워크에 대해 설명한다. 다만, Autoencoding 방법에 한정되는 것은 아니며, 다른 뉴럴 네트워크가 적용된 모델도 포함될 수 있다.Hereinafter, a neural network to which the Autoencoding method is applied will be described according to an embodiment of the present invention. However, the present invention is not limited to the autoencoding method, and may include a model to which another neural network is applied.

일 실시예에 따른, Autoencoder는 목표 데이터 Y와 입력 데이터 X 가 동일한 크기를 가질 수 있어, 차원 축소에 효과적일 수 있다. Autoencoder가 L개의 히든 레이어(layer)를 포함할 경우, fully-connected deep Autoencoder는 아래의 수학식 4와 같이 회귀적(recursively)으로 정의될 수 있다. According to one embodiment, the Autoencoder can have the same size of the target data Y and the input data X, and can be effective in reducing the dimension. When the autoencoder includes L hidden layers, a fully-connected deep autoencoder can be recursively defined as shown in Equation 4 below.

여기서,

과

는 각각

번째 레이어에 대한 가중치(weighting)와 바이어스(bias)를 나타낼 수 있다. 이때, 파라미터

는

를 나타낼 수 있다.here,

and

Respectively

And the weighting and bias for the i < th > layer. At this time,

The

Lt; / RTI >

도 1은 일 실시예에 따른, 3개의 히든 레이어를 포함하는 Autoencoder를 나타낸 도면이다. 여기서, 바닥 레이어(bottom layer, 110)는 데이터가 입력되는 입력 레이어를 나타내며, 꼭대기 레이어(top layer, 120)는 데이터가 출력되는 출력 레이어를 나타내며, 중간의 레이어는 히든 레이어(hidden layer, 130)을 나타낼 수 있다. 또한, 노드는 입력 레이어/출력 레이어/히든 레이어에 포함되며, 그중에서도 노드(140)는 바이어스된 노드를 나타낼 수 있다. 1 is a diagram illustrating an autoencoder including three hidden layers according to an exemplary embodiment of the present invention. Here, a bottom layer 110 represents an input layer to which data is input, a top layer 120 represents an output layer from which data is output, a middle layer represents a hidden layer 130, Lt; / RTI > In addition, the node is included in the input layer / output layer / hidden layer, among which the node 140 may represent the biased node.

일 실시예에 따른, Autoencoder는 입력 레이어와 히든 레이어 1,2 를 포함하는 부호화부와 히든 레이어 3과 출력 레이어를 포함하는 복호화부를 포함하는 압축 시스템일 수 있다. 이때, 히든 레이어 2는 입력 데이터를 압축하는 code 레이어일 수 있다. According to an exemplary embodiment, the Autoencoder may be a compression system including a coding unit including an input layer and a hidden layer 1, 2, and a decoding unit including a hidden layer 3 and an output layer. At this time, the hidden layer 2 may be a code layer for compressing input data.

압축 시스템인 Autoencoder의 부호화부는

인 경우 code 레이어에서 차원 축소된

를 생성할 수 있다. 또한, 목표 데이터 Y와 입력 데이터 X에 대해 아래의 수학식 5와 같이 설정된 오류 함수를 이용하여, Autoencoder의 복호화부는

로부터 입력 데이터를 복원할 수 있다. The encoding unit of the compression system Autoencoder

If the dimension is collapsed in the code layer

Lt; / RTI > Using the error function set for the target data Y and the input data X as shown in Equation (5) below, the decryption unit of the autoencoder

The input data can be restored.

효과적인 code 레이어를 위해 히든 레이어의 개수와 히든 레이어에 포함된 노드의 개수는 증가될 수 있다. 이때, Autoencoder의 복잡도와 출력 데이터의 품질 간은 trade-off될 수 있다. 따라서, Autoencoder의 복잡도가 큰 경우, 배터리 소모 및 메모리 용량의 문제가 발생할 수 있다. For an effective code layer, the number of hidden layers and the number of nodes in hidden layers can be increased. At this time, the complexity of the autoencoder and the quality of the output data can be trade-off. Therefore, when the complexity of the autoencoder is large, battery consumption and memory capacity problems may occur.

다른 일 실시예에 따른, Autoencoder는 노이즈(noise)에 의해 변형된 신호(noisy signal)를 이용하여 노이즈가 제거된 원 신호(clean signal)을 생성하는 denoising Autoencoder일 수 있다. 노이즈는 부가 잡음(additive noise), 반향(reverberation), 대역 통과 필터링(band-pass filtering)을 포함할 수 있다.According to another embodiment, the autoencoder may be a denoising autoencoder that generates a clean signal from which a noise is removed using a noisy signal. Noise may include additive noise, reverberation, and band-pass filtering.

변형된 신호(noisy signal)로부터 원 신호(clean signal)을 생성하는 denoising Autoencoder는 아래의 수학식 6과 같이 표현될 수 있다. 이때, Y는 원 신호의 크기 스펙트럼, X는 변형된 신호의 크기 스펙트럼일 수 있다. 여기서, X는 변형 함수

에 의해서

로 표현될 수 있다. 수학식 6에서,

는 변형 함수의 역함수를 근사화한 것으로,

를 나타낼 수 있다. A denoising autoencoder that generates a clean signal from a noisy signal can be expressed as Equation (6) below. In this case, Y may be the magnitude spectrum of the original signal, and X may be the magnitude spectrum of the distorted signal. Where X is the deformation function

By

. &Lt; / RTI > In Equation (6)

Is an approximation of the inverse function of the deformation function,

Lt; / RTI >

이때, 예를 들어 원 신호가 부가 잡음으로 인해 변형된 경우, 아래의 수학식 7과 같이 denoising Autoencoder는 원 신호를 직접 추정하는 것보다 부가 잡음을 제거하기 위한 이상적인 마스크(ideal mask)

를 추정하도록 학습되는 것이 효과적일 수 있다. In this case, for example, when the original signal is deformed due to additional noise, the denoising autoencoder may be an ideal mask for removing additional noise rather than directly estimating the original signal, as shown in Equation (7)

It may be effective to learn to estimate.

여기서, 이상적인 마스크는 아래의 수학식 8과 같이 Hadamard 곱을 이용하여 변형된 신호의 부가 잡음을 제거하는데 사용될 수 있다. Here, the ideal mask can be used to remove the added noise of the modified signal using the Hadamard product as shown in Equation (8) below.

이때,

는 추정된 이상적인 비율 마스크(ideal ratio mask)일 수 있으며, 아래의 수학식 9와 같이 표현될 수 있다. 여기서, Q는 부가 잡음에 대한 크기 스펙트럼일 수 있으며, 원 신호가 부가 잡음에 의해 변형될 경우 변형 함수는 수학식 10과 같이 정의될 수 있다.At this time,

May be an estimated ideal ratio mask and may be expressed as Equation (9) below. Here, Q may be a magnitude spectrum for additional noise, and a distortion function may be defined as Equation (10) when the original signal is modified by additional noise.

denoising Autoencoder는 수학식 4와 유사한 구조를 이용하여 함수

를 학습할 수 있다. 다만, 변형 함수

가 매우 복잡하기 때문에, 많은 히든 레이어와 노드의 개수는 필요할 수 있다. 따라서, 모델의 복잡도와 성능간의 trade-off가 발생할 수 있다. The denoising autoencoder uses a structure similar to Equation (4)

Can be learned. However,

Is very complex, many hidden layers and the number of nodes may be needed. Thus, a trade-off between model complexity and performance can occur.

따라서, trade-off를 해결하기 위해, 사람의 청각 특성을 이용한 오류 함수에 기반하여 autoencoding 방법에 대해 이하 자세하게 설명한다. 청각 특성을 이용한 오류 함수를 적용함으로써 모델 복잡도를 낮추거나, Autoencoder의 성능을 개선할 수 있다. Therefore, in order to solve the trade-off, the autoencoding method based on the error function using the human auditory characteristic will be described in detail below. By applying the error function using the auditory characteristic, the model complexity can be lowered and the performance of the autoencoder can be improved.

도 2는 일 실시예에 따른, 동시 마스킹 효과(simultaneous masking effect)를 나타낼 수 있다.Figure 2 may represent a simultaneous masking effect, according to one embodiment.

그래프(210)는 조용한 환경에서 오디오 신호의 주파수에 따른 가청 음압 레벨을 데시벨(db)로 나타낸 것이다. 예를 들면, 4kHz 주파수 대역에서 가장 낮은 db를 나타내고 있으며, 저주파 또는 고주파 대역일수록 가청 음악 레벨의 임계치는 증가할 수 있다. 보다 구체적으로, 30Hz에서는 약 30db의 큰 토널(tonal) 신호를 대부분의 사람은 인지할 수 없지만, 1kHz에서는 약 10db의 상대적으로 작은 토널 신호를 사람은 인지할 수 있다. The graph 210 shows the audible sound pressure level in decibels (dB) according to the frequency of the audio signal in a quiet environment. For example, it shows the lowest dB in the 4 kHz frequency band, and the threshold of the audible music level may increase with low frequency or high frequency bands. More specifically, most people can not recognize a large tonal signal of about 30 db at 30 Hz, but a human can perceive a relatively small tonal signal of about 10 db at 1 kHz.

그래프(230)는 1kHz에서 존재하는 토널 신호에 의해 수정된 그래프를 나타낼 수 있다. 즉, 토널 신호는 1kHz에서 그래프(210)를 상승시킬 수 있다. 따라서, 그래프(230)보다 작은 크기의 신호는 사람에 의해 인식될 수 없다. Graph 230 may represent a graph modified by a tonal signal present at 1 kHz. That is, the tonal signal can rise the graph 210 at 1 kHz. Thus, a signal smaller in magnitude than the graph 230 can not be recognized by a person.

마스커(masker, 220)는 1kHz에서의 토널 신호를 나타내며, 마스키(maskee)는 마스커에 의해 마스킹되는 신호를 나타낼 수 있다. 예를 들면, 마스키는 그래프(230)보다 작은 크기의 신호를 포함할 수 있다.A masker 220 represents a tonal signal at 1 kHz, and a masker may represent a signal masked by a masker. For example, MASKI may include a signal smaller than the graph 230.

도 3은 일 실시예에 따른, 마스킹 효과를 고려한 가청 및 비가청 영역을 나타낸 도면이다.3 is a diagram illustrating audible and non-audible areas in consideration of the masking effect, according to one embodiment.

도 3은 도 2의 그래프와 입력 오디오 신호의 스펙트럼을 중첩한 것을 나타낸다. 여기서, 가청 영역(340)은 입력 오디오 신호의 스펙트럼(360)이 해당 주파수에서 마스킹 임계치(masking threshold) 곡선보다 큰 스펙트럼을 가지는 것을 나타낼 수 있다. 또한, 비가청 영역(350)은 입력 오디오 신호의 스펙트럼(360)이 해당 주파수에서 마스킹 임계치(masking threshold) 곡선보다 작은 스펙트럼을 가지는 것을 나타낼 수 있다. Fig. 3 shows the spectrum of the input audio signal superimposed on the graph of Fig. Here, the audible region 340 may indicate that the spectrum 360 of the input audio signal has a spectrum greater than the masking threshold curve at that frequency. In addition, the non-visible region 350 may indicate that the spectrum 360 of the input audio signal has a spectrum that is less than the masking threshold curve at that frequency.

예를 들면, 30Hz에서 그래프(310)이 그래프(360)보다 크기 때문에 비가청 영역(350)일 수 있으며, 10kHz에서 그래프(310)이 그래프(360)보다 크기 때문에 비가청 영역(350)일 수 있으며, 4kHz에서 그래프(360)이 그래프(310)보다 작기 때문에 가청 영역(340)일 수 있다. For example, at 30 Hz, the graph 310 may be non-audible region 350 because it is larger than graph 360, and at 10 kHz, graph 310 may be larger than graph 360, And may be the audible region 340 because the graph 360 at 4 kHz is smaller than the graph 310.

일 실시예에 따르면, Autoencoder의 훈련에서 입력 오디오 신호의 특성에 따라 비가청 영역보다 가청 영역을 더 고려하도록, 오류 함수는 수정될 수 있다. 즉, 입력 오디오 신호의 특정 주파수 성분의 크기가 대응하는 마스킹 임계치보다 작은 경우, 사람은 오류에 대해 상대적으로 예민하지 않을 수 있다. 또는 입력 오디오 신호의 특정 주파수 성분의 크기가 대응하는 마스킹 임계치보다 큰 경우, 사람은 오류에 대해 상대적으로 더 예민할 수 있다. According to one embodiment, the error function can be modified such that, in the training of the Autoencoder, more of the audible area is considered than the non-audible area, depending on the characteristics of the input audio signal. That is, if the magnitude of a particular frequency component of the input audio signal is less than the corresponding masking threshold, then the person may not be sensitive to the error. Or if the magnitude of a particular frequency component of the input audio signal is greater than the corresponding masking threshold, the person may be more sensitive to the error.

따라서, 사람의 청각 특성을 고려한 Autoencoder의 훈련을 위해, 수정된 오류 함수는 아래 수학식 11과 같이 표현될 수 있다. 이때, 수정된 오류 함수는 미리 설정된 오류 함수를 수정한 가중된 오류 함수를 나타낼 수 있다.Therefore, for the training of the autoencoder considering human auditory characteristics, the modified error function can be expressed as Equation (11) below. At this time, the modified error function may represent a weighted error function that has modified the predetermined error function.

여기서, H는 가중 행렬을 나타내며, 가중 행렬은 입력 오디오 신호 각각의 주파수 성분에 적용되는 가중치를 포함할 수 있다. 예를 들면,

번째 샘플의 번째 계수가 큰 마스킹 임계치를 가질 경우 대응하는 가중치

은 상대적으로 작을 수 있다. 또 다른 예를 들면,

번째 샘플의

번째 계수가 작은 마스킹 임계치를 가질 경우 대응하는 가중치

은 상대적으로 클 수 있다.Where H denotes a weighting matrix, and the weighting matrix may include a weight applied to a frequency component of each of the input audio signals. For example,

Th sample If the ith coefficient has a large masking threshold, the corresponding weighting value

Can be relatively small. As another example,

Th sample

If the ith coefficient has a small masking threshold, the corresponding weighting value

Can be relatively large.

도 4는 일 실시예에 따른, 가중된 오류 함수를 이용한 뉴럴 네트워크의 학습 과정을 나타낸 도면이다.4 is a diagram illustrating a learning process of a neural network using a weighted error function according to an exemplary embodiment of the present invention.

일 실시예에 따르면, 시간-주파수 분석(401)을 통해 미리 설정된 길이의 분석 프레임에 대해서 제1 오디오 신호의 주파수 스펙트럼은 획득될 수 있다. 이때, 주파수 스펙트럼은 제1 오디오 신호에 대해 필터 뱅크(filter bank) 또는 MDCT(Modified Discrete Cosine Transform)를 적용하여 획득될 수 있고, 다른 방법 또한 적용하여 획득될 수 있다. According to one embodiment, the frequency spectrum of the first audio signal may be obtained for a predetermined length of analysis frame through the time-frequency analysis 401. [ At this time, the frequency spectrum may be obtained by applying a filter bank or MDCT (Modified Discrete Cosine Transform) to the first audio signal, and may be obtained by applying another method.

여기서, 제1 오디오 신호는 파라미터의 학습을 위한 훈련용 오디오 신호를 나타낼 수 있다. 또한, 미리 설정된 길이의 분석 프레임은 제1 오디오 신호의 10ms ~ 50ms 길이의 분석 프레임을 포함할 수 있으며, 다른 길이의 분석 프레임 또한 포함할 수 있다.Here, the first audio signal may represent a training audio signal for parameter learning. In addition, the predetermined length of the analysis frame may include an analysis frame of 10 ms to 50 ms in length of the first audio signal, and may also include analysis frames of different lengths.

따라서, 시간-주파수 분석(401)을 통해, 제1 오디오 신호의 주파수 스펙트럼 Y 또는 노이즈에 의해 변형된 제1 오디오 신호의 주파수 스펙트럼

는 생성될 수 있다.Thus, through the time-frequency analysis 401, the frequency spectrum Y of the first audio signal or the frequency spectrum of the first audio signal,

Can be generated.

일 실시예에 따르면, 심리 음향 분석(402)을 통해 제1 오디오 신호에 대한 마스킹 임계치(masking threshold)는 계산될 수 있다. 즉, 제1 오디오 신호의 청각적 특성을 분석하여 마스킹 임계치는 계산될 수 있다. According to one embodiment, a psychoacoustic analysis 402 may be used to calculate a masking threshold for the first audio signal. That is, the masking threshold can be calculated by analyzing the auditory characteristics of the first audio signal.

심리 음향 분석 방법은 MPEG PAM-I 또는 MPEG PAM-II를 포함할 수 있으며, 다른 방법에 의한 심리 음향 분석을 포함할 수 있다. 예를 들면, 동시 마스킹 효과 만 아니라 시간적 마스킹(temporal masking)이 이용될 수 있다. The psychoacoustic analysis method may include MPEG PAM-I or MPEG PAM-II and may include psychoacoustic analysis by other methods. For example, temporal masking as well as simultaneous masking effects can be used.

일 실시예에 따르면, 심리 음향 분석(402)을 통해 계산된 마스킹 임계치를 이용하여, 제1 오디오 신호 각각의 주파수 성분에 적용될 가중 행렬(weight matrix)은 계산(403)될 수 있다. According to one embodiment, using the masking threshold computed through psychoacoustic analysis 402, a weight matrix to be applied to the frequency components of each of the first audio signals may be calculated (403).

이때, 가중 행렬은 제1 오디오 신호의 주파수 성분에 적용될 가중치를 포함하고, 가중치는 제1 오디오 신호에 대한 마스킹 임계치에 반비례하고 제1 오디오 신호 각각의 주파수 성분의 크기에 비례하도록 설정될 수 있다. 즉, 가중치는 SMR(Signal-to-Mask Ratio)에 의해 결정될 수 있으며, SMR은 제1 오디오 신호 각각의 주파수 성분에서 크기와 마스킹 임계치의 크기 간의 비율을 나타낼 수 있다. At this time, the weighting matrix may include a weight to be applied to the frequency component of the first audio signal, and the weight may be set to be inversely proportional to the masking threshold for the first audio signal and proportional to the magnitude of the frequency component of each of the first audio signal. That is, the weight can be determined by the Signal-to-Mask Ratio (SMR), and the SMR can represent the ratio between the size and the size of the masking threshold in the frequency component of each of the first audio signals.

예를 들면, 제1 오디오 신호의 각각의 주파수 성분

에 대한 가중 행렬의 가중치

는 대응하는 마스킹 임계치에 반비례하고 주파수 성분의 크기에 비례하도록 설정될 수 있다. For example, each frequency component of the first audio signal

&Lt; / RTI > weight < RTI ID =

May be set to be inversely proportional to the corresponding masking threshold and proportional to the magnitude of the frequency component.

일 실시예에 따르면, 마스킹 임계치가 데시벨로 표현될 경우 가중 행렬의 각각의 가중치는 아래의 수학식 12와 같은 관계로 표현될 수 있다. 다만, 일 실시예에 따르면, 마스킹 임계치와 가중치 간의 선형 스케일 변환 또는 로그 스케일 변환을 포함할 수 있으며, 다른 스케일 변환도 포함할 수 있다. According to one embodiment, when the masking threshold is expressed in decibels, each weight of the weighting matrix may be expressed by Equation (12) below. However, according to one embodiment, it may include a linear scale transformation or a log scale transformation between the masking threshold and the weight, and may also include other scale transformations.

일 실시예에 따르면, 미리 설정된 오류 함수

는 오류 가중 함수(404)를 통해 가중된 오류 함수를 생성할 수 있다. 여기서, 수학식 11과 같이 표현되는 가중된 오류 함수는 가중치를 적용하여 모델의 파라미터의 학습에 이용될 수 있다. According to one embodiment,

May generate a weighted error function via the error weight function (404). Here, the weighted error function expressed by Equation (11) can be used to learn the parameters of the model by applying weights.

이때, 모델은 뉴럴 네트워크를 포함할 수 있으며, 예를 들면, Autoencoder를 포함할 수 있다. 또한, 모델은 토폴로지를 포함할 수 있으며, 모델의 토폴로지는 입력 레이어, 하나 이상의 히든 레이어, 출력 레이어 및 각 레이어에 포함된 노드를 포함할 수 있다. At this time, the model may include a neural network, for example, an Autoencoder. The model may also include a topology, and the topology of the model may include an input layer, one or more hidden layers, an output layer, and nodes contained in each layer.

일 실시예에 따르면, 모델의 파라미터

학습(405)는 가중된 오류 함수를 이용하여 학습될 수 있다. 예를 들면, 초기 모델의 토폴로지에 대해서 학습이 수행될 수 있으며, 학습이 완료된 모델의 토폴로지를 이용하여 예측된 오디오 스펙트럼 Z는 출력될 수 있다. According to one embodiment, the parameters of the model

Learning 405 can be learned using a weighted error function. For example, learning may be performed on the topology of the initial model, and the predicted audio spectrum Z may be output using the topology of the completed model.

이때, 예측된 오디오 스펙트럼 Z는 제1 오디오 신호의 주파수 스펙트럼 Y 또는 노이즈에 의해 변형된 제1 오디오 신호의 주파수 스펙트럼

에 대해 학습된 모델의 파라미터를 적용하여 생성될 수 있다. At this time, the predicted audio spectrum Z is the frequency spectrum Y of the first audio signal, or the frequency spectrum of the first audio signal

Lt; RTI ID = 0.0 > model < / RTI >

일 실시예에 따르면, 지각적 품질 평가(406)는 예측된 오디오 스펙트럼을 제1 오디오 신호

또는 제1 오디오 신호의 주파수 스펙트럼 Y와 비교하여 품질 평가를 수행할 수 있다. According to one embodiment, the perceptual quality estimate 406 may include a predicted audio spectrum as a first audio signal

Or the frequency spectrum Y of the first audio signal.

이때, 지각적 품질 평가는 PESQ(Perceptual Evaluation of Speech Quality), POLQA(Perceptual Objective Listening Quality Assessment), PEAQ(Perceptual Evaluation of Audio Quality)의 객관적 평가 또는 MOS(Mean Opinion Score), MUSHRA(Multiple Stimuli with Hidden Reference and Anchor)의 주관적 평가를 이용할 수 있다. 지각적 품질 평가는 이에 한정되지 않으며, 다른 품질 평가를 포함할 수 있다.The perceptual quality evaluation is based on objective evaluation or MOS (Mean Opinion Score), Multiple Stimuli with Hidden (MUSHRA), Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Assessment (POLQA) Reference and Anchor) can be used. Perceptual quality assessment is not limited to this, and may include other quality assessments.

지각적 품질 평가(406)는 현재 학습된 모델에 대한 품질 평가를 나타낼 수 있다. 품질 평가에 기초하여, 미리 설정된 품질 및 모델의 복잡도와 같은 모델 요구사항을 만족하였는지 여부를 판단할 수 있다. 또한, 품질 평가에 기초하여, 모델의 복잡도와 같은 모델 토폴로지의 조정 가능 여부를 판단할 수 있다. 여기서, 모델의 복잡도는 히든 레이어의 개수, 노드의 개수와 양의 상관 관계를 가질 수 있다.The perceptual quality assessment 406 may indicate a quality assessment for the currently learned model. Based on the quality evaluation, it can be determined whether or not the model requirements such as the preset quality and the complexity of the model are satisfied. Further, based on the quality evaluation, it is possible to determine whether or not the model topology such as the complexity of the model can be adjusted. Here, the complexity of the model can have a positive correlation with the number of hidden layers and the number of nodes.

일 실시예에 따르면, 모델의 토폴로지를 조정 가능하지 않은 경우 또는 조정할 필요가 없다고 판단된 경우(407), 현재 학습된 모델의 파라미터는 저장되어 학습은 종료될 수 있다. 이때, 모델의 토폴로지를 조정 가능한지 여부에 대한 판단은 모델이 적용되는 분야 및 모델에 대한 요구사항에 따라 다를 수 있다.According to one embodiment, if the topology of the model is not adjustable or is determined not to be adjusted (407), the parameters of the currently learned model may be stored and the learning may be terminated. At this time, the determination as to whether or not the topology of the model can be adjusted depends on the field to which the model is applied and the requirements for the model.

일 실시예에 따르면, 모델의 토폴로지를 조정 가능하다고 판단된 경우(407), 모델의 토폴로지는 업데이트(408)될 수 있고, 앞서 기술한 학습을 위한 과정은 반복될 수 있다. 이때, 모델의 토폴로지를 조정 가능한지 여부에 대한 판단은 모델이 적용되는 분야 및 모델에 대한 요구사항에 따라 다를 수 있다.According to one embodiment, if the topology of the model is determined to be adjustable 407, the topology of the model may be updated 408 and the process for learning described above may be repeated. At this time, the determination as to whether or not the topology of the model can be adjusted depends on the field to which the model is applied and the requirements for the model.

예를 들면, 품질 요구사항을 만족하지 못한 경우, 모델의 복잡도가 증가하도록 모델의 토폴로지는 조정될 수 있고, 앞서 기술한 학습을 위한 과정이 반복될 수 있다. 따라서, 토폴로지의 조정 가능 여부를 판단할 경우, 지각적 품질 평가의 결과가 미리 설정된 품질 요구사항을 만족하지 못한다면 복잡도가 증가된 모델을 이용하여 파라미터는 재학습될 수 있다. 또는, 품질 요구사항을 만족하지 않지만 더 이상 모델의 복잡도가 증가되지 않을 경우, 현재 학습된 모델의 파라미터는 저장되어 학습은 종료될 수 있다. For example, if the quality requirements are not met, the topology of the model can be adjusted so that the complexity of the model increases, and the process for learning described above can be repeated. Therefore, when judging whether or not the topology can be adjusted, if the result of the perceptual quality evaluation does not satisfy the preset quality requirement, the parameter can be re-learned using the model with increased complexity. Alternatively, if the quality requirements are not satisfied but the complexity of the model no longer increases, the parameters of the currently learned model may be stored and the learning terminated.

또 다른 예를 들면, 품질 요구사항을 만족하는 경우, 품질 요구사항 내에서 모델의 복잡도가 줄어들도록 모델의 토폴로지는 조정될 수 있고, 앞서 기술한 학습을 위한 과정이 반복될 수 있다. 따라서, 토폴로지의 조정 가능 여부를 판단할 경우, 지각적 품질 평가의 결과가 미리 설정된 품질 요구사항을 만족한다면 품질 요구사항 내에서 복잡도가 줄어든 모델을 이용하여 파라미터는 재학습될 수 있다. As another example, if the quality requirements are met, the topology of the model can be adjusted so that the complexity of the model is reduced within the quality requirements, and the process for learning described above can be repeated. Therefore, when determining whether the topology is adjustable, the parameters can be re-learned using a model with reduced complexity within the quality requirements if the outcome of the perceptual quality evaluation satisfies a predetermined quality requirement.

일 실시예에 따르면, 학습된 모델의 파라미터에 기초하여, 입력 오디오 신호에 대해 신호 처리는 수행될 수 있다. 이때, 신호 처리는 압축/잡음제거/부호화/복호화를 포함할 수 있으며, 이에 한정되지 않는다.According to one embodiment, signal processing may be performed on the input audio signal based on the parameters of the learned model. At this time, the signal processing may include, but is not limited to, compression / noise cancellation / encoding / decoding.

도 5는 일 실시예에 따른, 오디오 신호 부호화 장치와 채널, 오디오 신호 복호화 장치를 나타낸 도면이다.5 is a diagram illustrating an audio signal encoding apparatus, a channel, and an audio signal decoding apparatus according to an embodiment.

일 실시예에 따르면, 뉴럴 네트워크가 적용된 오디오 신호 부호화 장치(510)는 입력 오디오 신호를 수신할 수 있다. 이때, 오디오 신호 부호화 장치(510)는 뉴럴 네트워크를 통해 학습된 모델이 적용될 수 있다. 여기서, 뉴럴 네트워크는 Autoencoder을 포함할 수 있다. According to one embodiment, the audio signal encoding apparatus 510 to which the neural network is applied may receive an input audio signal. At this time, the audio signal encoding apparatus 510 can apply the learned model through the neural network. Here, the neural network may include an autoencoder.

오디오 신호 부호화 장치(510)는 Autoencoder 부호화부(511) 및 양자화부(512)를 포함할 수 있다. 여기서, Autoencoder 부호화부(511)는 입력 레이어 부터 code 레이어까지의 레이어를 포함할 수 있으며, 이때 code 레이어는

번째 히든 레이어를 나타낼 수 있다. The audio signal encoding apparatus 510 may include an autoencoder encoding unit 511 and a quantization unit 512. Here, the autoencoder encoding unit 511 may include a layer from an input layer to a code layer,

The second hidden layer.

오디오 신호 부호화 장치(510)는 뉴럴 네트워크를 이용하여 학습된 히든 레이어의 파라미터에 기초하여 수신한 입력 오디오 신호의 차원 축소된 잠재벡터(latent vector)를 생성할 수 있다. 이때 생성된 잠재벡터는 양자화부(512)에 의해 양자화 또는 부호화되어 비트스트림으로 출력될 수 있다. 여기서, 양자화부(512)는 전송 채널을 통해 비트스트림을 전송하기 위해 이진화하는 과정을 포함할 수 있다.The audio signal encoding apparatus 510 can generate a reduced-size latent vector of the received input audio signal based on the learned hidden layer parameters using the neural network. The generated potential vector may be quantized or encoded by the quantization unit 512 and output as a bitstream. Here, the quantizer 512 may include a process of binarizing the bitstream to transmit the bitstream through the transport channel.

출력된 비트스트림은 전송 채널(530)에 의해 오디오 신호 복호화 장치(520)으로 전송될 수 있다. The output bit stream can be transmitted to the audio signal decoding apparatus 520 by the transmission channel 530.

오디오 신호 복호화 장치(520)는 Autoencoder 복호화부(521) 및 역양자화부(522)를 포함할 수 있다. 여기서, Autoencoder 복호화부(521)는

번째 히든 레이어부터 출력 레이어까지의 레이어를 포함할 수 있다. The audio signal decoding apparatus 520 may include an autoencoder decoding unit 521 and an inverse quantization unit 522. Here, the autoencoder decoding unit 521

The second hidden layer to the output layer.

오디오 신호 복호화 장치(520)는 전송 채널(530)을 통해 전송된 비트스트림을 역양자화부(522)에서 역양자화 또는 역부호화하여 잠재벡터를 복원할 수 있다. 복원된 잠재벡터를 이용하여, Autoencoder 복호화부(521)는 출력 오디오 신호를 복호화할 수 있거나, 또는 Autoencoder 복호화부(521)는 출력 오디오 신호를 계산할 수 있다. 여기서, 역양자화부(522)는 비트스트림을 이진화하는 과정을 포함할 수 있다.The audio signal decoding apparatus 520 can restore the latent vector by inverse-quantizing or inverse-coding the bitstream transmitted through the transmission channel 530 by the dequantizer 522. [ Using the restored potential vector, the autoencoder decoding unit 521 can decode the output audio signal, or the autoencoder decoding unit 521 can calculate the output audio signal. Here, the inverse quantization unit 522 may include a process of binarizing the bit stream.

Autoencoder 부호화부(511) 및 Autoencoder 복호화부(521)의 파라미터는 도 4에 의해 학습된 파라미터일 수 있다. 학습된 파라미터에 대해 자세한 사항은 도 4를 참조한다.The parameters of the autoencoder encoding unit 511 and the autoencoder decoding unit 521 may be the parameters learned by FIG. See Figure 4 for details of the learned parameters.

일 실시예에 따르면, 사람의 청각 특성을 이용하여 지각적으로 가중된 오류 함수를 이용하여, 동일한 모델의 복잡도를 가지고 개선된 오디오 신호의 품질을 제공할 수 있다. 또는 낮은 모델의 복잡도를 가지고 동일한 수준의 오디오 신호의 품질을 제공할 수 있다. 따라서, 오디오 신호 압축 및 복원을 위한 오디오 코덱에 활용될 수 있다. According to one embodiment, a perceptually weighted error function can be utilized using the human auditory characteristics to provide improved audio signal quality with the same model complexity. Or can provide the same level of quality of audio signal with lower model complexity. Therefore, it can be utilized as an audio codec for audio signal compression and decompression.

도 6은 일 실시예에 따른, 오디오 신호 부호화 장치를 이용하여 오디오 신호 부호화 방법에 적용되는 뉴럴 네트워크의 학습 과정을 나타낸 도면이다.6 is a diagram illustrating a learning process of a neural network applied to an audio signal encoding method using an audio signal encoding apparatus according to an embodiment.

단계(601)에서, 오디오 신호 부호화 장치는 학습되기 전 제1 오디오 신호에 대한 마스킹 임계치를 생성할 수 있다. 여기서, 제1 오디오 신호는 뉴럴 네트워크를 학습하기 위한 훈련용 오디오 신호를 나타낼 수 있다. In step 601, the audio signal encoding apparatus may generate a masking threshold for the first audio signal before being learned. Here, the first audio signal may represent a training audio signal for learning a neural network.

마스킹 임계치는 제1 오디오 신호의 청각적 특성을 분석하여 계산될 수 있다. 심리 음향 분석 방법은 MPEG PAM-I 또는 MPEG PAM-II를 포함할 수 있으며, 다른 방법에 의한 심리 음향 분석을 포함할 수 있다. 예를 들면, 동시 마스킹 효과 만 아니라 시간적 마스킹(temporal masking)이 이용될 수 있다. The masking threshold may be calculated by analyzing the auditory characteristics of the first audio signal. The psychoacoustic analysis method may include MPEG PAM-I or MPEG PAM-II and may include psychoacoustic analysis by other methods. For example, temporal masking as well as simultaneous masking effects can be used.

단계(602)에서, 오디오 신호 부호화 장치는 생성된 마스킹 임계치에 기초하여, 제1 오디오 신호 각각의 주파수 성분에 적용될 가중 행렬을 계산할 수 있다. 이때, 가중 행렬은 제1 오디오 신호의 주파수 성분에 적용될 가중치를 포함하고, 가중치는 제1 오디오 신호에 대한 마스킹 임계치에 반비례하고 제1 오디오 신호 각각의 주파수 성분의 크기에 비례하도록 설정될 수 있다. 즉, 가중치는 SMR(Signal-to-Mask Ratio)에 의해 결정될 수 있으며, SMR은 제1 오디오 신호 각각의 주파수 성분에서 크기와 마스킹 임계치의 크기 간의 비율을 나타낼 수 있다. In step 602, the audio signal encoding apparatus may calculate a weighting matrix to be applied to the frequency components of each of the first audio signals, based on the generated masking threshold. At this time, the weighting matrix may include a weight to be applied to the frequency component of the first audio signal, and the weight may be set to be inversely proportional to the masking threshold for the first audio signal and proportional to the magnitude of the frequency component of each of the first audio signal. That is, the weight can be determined by the Signal-to-Mask Ratio (SMR), and the SMR can represent the ratio between the size and the size of the masking threshold in the frequency component of each of the first audio signals.

예를 들면, 제1 오디오 신호의 각각의 주파수 성분

에 대한 가중 행렬의 가중치

는 대응하는 마스킹 임계치에 반비례하고 주파수 성분의 크기에 비례하도록 설정될 수 있다.For example, each frequency component of the first audio signal

&Lt; / RTI > weight < RTI ID =

단계(603)에서, 오디오 신호 부호화 장치는 가중 행렬을 이용하여 미리 설정된 오류 함수를 수정한 가중된 오류 함수를 생성할 수 있다. 여기서, 가중된 오류 함수는 가중치를 적용하여 모델의 파라미터의 학습에 이용될 수 있다. In step 603, the audio signal encoding apparatus may generate a weighted error function by modifying a preset error function using a weighting matrix. Here, the weighted error function can be used to learn the parameters of the model by applying weights.

이때, 모델은 뉴럴 네트워크를 포함할 수 있으며, 예를 들면 뉴럴 네트워크는 Autoencoder를 포함할 수 있다. 또한, 모델은 토폴로지를 포함할 수 있으며, 모델의 토폴로지는 입력 레이어, 하나 이상의 히든 레이어, 출력 레이어 및 각 레이어에 포함된 노드를 포함할 수 있다.At this time, the model may include a neural network, for example, a neural network may include an autoencoder. The model may also include a topology, and the topology of the model may include an input layer, one or more hidden layers, an output layer, and nodes contained in each layer.

단계(604)에서, 오디오 신호 부호화 장치는 가중된 오류 함수를 이용하여 학습된 파라미터를 제1 오디오 신호에 적용하여 제2 오디오 신호를 생성할 수 있다. 여기서, 모델의 파라미터는 가중된 오류 함수를 이용하여 학습될 수 있다. 예를 들면, 초기 모델의 토폴로지에 대해서 학습이 수행될 수 있으며, 반복적인 학습 알고리즘에 의해 학습이 완료된 모델의 토폴로지를 이용하여 예측된 오디오 신호는 출력될 수 있다. 여기서, 예측된 오디오 신호는 제2 오디오 신호를 나타낼 수 있다.In step 604, the audio signal encoding apparatus may apply the learned parameter to the first audio signal using the weighted error function to generate the second audio signal. Here, the parameters of the model can be learned using a weighted error function. For example, learning can be performed on the topology of the initial model, and the predicted audio signal can be output using the topology of the model that has been learned by the iterative learning algorithm. Here, the predicted audio signal may represent the second audio signal.

이때, 제2 오디오 신호의 주파수 스펙트럼은 제1 오디오 신호의 주파수 스펙트럼 또는 노이즈에 의해 변형된 제1 오디오 신호의 주파수 스펙트럼에 대해 학습된 모델의 파라미터를 적용하여 생성될 수 있다. At this time, the frequency spectrum of the second audio signal may be generated by applying the parameters of the model learned for the frequency spectrum of the first audio signal or the frequency spectrum of the first audio signal modified by the noise.

제2 오디오 신호는 제1 오디오 신호와 비교하여, 지각적 품질 평가는 수행될 수 있다. 이때, 지각적 품질 평가는 PESQ(Perceptual Evaluation of Speech Quality), POLQA(Perceptual Objective Listening Quality Assessment), PEAQ(Perceptual Evaluation of Audio Quality)의 객관적 평가 또는 MOS(Mean Opinion Score), MUSHRA(Multiple Stimuli with Hidden Reference and Anchor)의 주관적 평가를 이용할 수 있다. 지각적 품질 평가는 이에 한정되지 않으며, 다른 품질 평가를 포함할 수 있다.Comparing the second audio signal with the first audio signal, a perceptual quality assessment can be performed. The perceptual quality evaluation is based on objective evaluation or MOS (Mean Opinion Score), Multiple Stimuli with Hidden (MUSHRA), Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Assessment (POLQA) Reference and Anchor) can be used. Perceptual quality assessment is not limited to this, and may include other quality assessments.

지각적 품질 평가는 현재 학습된 모델에 대한 품질 평가를 나타낼 수 있다. 품질 평가에 기초하여, 미리 설정된 품질 및 모델의 복잡도와 같은 모델 요구사항을 만족하였는지 여부를 판단할 수 있다. 또한, 품질 평가에 기초하여, 모델의 복잡도와 같은 모델 토폴로지의 조정 가능 여부를 판단할 수 있다. 여기서, 모델의 복잡도는 히든 레이어의 개수, 노드의 개수와 양의 상관 관계를 가질 수 있다.A perceptual quality assessment may indicate a quality assessment of the currently learned model. Based on the quality evaluation, it can be determined whether or not the model requirements such as the preset quality and the complexity of the model are satisfied. Further, based on the quality evaluation, it is possible to determine whether or not the model topology such as the complexity of the model can be adjusted. Here, the complexity of the model can have a positive correlation with the number of hidden layers and the number of nodes.

일 실시예에 따르면, 모델의 토폴로지를 조정 가능하지 않은 경우 또는 조정할 필요가 없다고 판단된 경우, 현재 학습된 모델의 파라미터는 저장되어 학습은 종료될 수 있다. 이때, 모델의 토폴로지를 조정 가능한지 여부에 대한 판단은 모델이 적용되는 분야 및 모델에 대한 요구사항에 따라 다를 수 있다.According to one embodiment, if the topology of the model is not adjustable or if it is determined that it is not necessary to adjust, the parameters of the currently learned model may be stored and the learning may be terminated. At this time, the determination as to whether or not the topology of the model can be adjusted depends on the field to which the model is applied and the requirements for the model.

일 실시예에 따르면, 모델의 토폴로지를 조정 가능하다고 판단된 경우, 모델의 토폴로지는 업데이트될 수 있고, 앞서 기술한 학습을 위한 과정은 반복될 수 있다. 이때, 모델의 토폴로지를 조정 가능한지 여부에 대한 판단은 모델이 적용되는 분야 및 모델에 대한 요구사항에 따라 다를 수 있다.According to one embodiment, if it is determined that the topology of the model can be adjusted, the topology of the model may be updated and the process for learning described above may be repeated. At this time, the determination as to whether or not the topology of the model can be adjusted depends on the field to which the model is applied and the requirements for the model.

또 다른 예를 들면, 품질 요구사항을 만족하는 경우, 품질 요구사항 내에서 모델의 복잡도가 줄어들도록 모델의 토폴로지는 조정될 수 있고, 앞서 기술한 학습을 위한 과정이 반복될 수 있다. 따라서, 토폴로지의 조정 가능 여부를 판단할 경우, 지각적 품질 평가의 결과가 미리 설정된 품질 요구사항을 만족한다면 품질 요구사항 내에서 복잡도가 줄어든 모델을 이용하여 파라미터는 재학습될 수 있다.As another example, if the quality requirements are met, the topology of the model can be adjusted so that the complexity of the model is reduced within the quality requirements, and the process for learning described above can be repeated. Therefore, when determining whether the topology is adjustable, the parameters can be re-learned using a model with reduced complexity within the quality requirements if the outcome of the perceptual quality evaluation satisfies a predetermined quality requirement.

일 실시예에 따르면, 학습된 모델의 파라미터에 기초하여, 제1 오디오 신호에 대해 신호 처리는 수행될 수 있다. 이때, 신호 처리는 압축/잡음제거/부호화/복호화를 포함할 수 있으며, 이에 한정되지 않는다.According to one embodiment, based on the parameters of the learned model, signal processing may be performed on the first audio signal. At this time, the signal processing may include, but is not limited to, compression / noise cancellation / encoding / decoding.

도 7은 일 실시예에 따른, 뉴럴 네트워크가 적용된 오디오 신호 부호화 장치가 수행하는 오디오 신호 부호화 방법을 나타낸 도면이다.7 is a diagram illustrating a method of encoding an audio signal performed by an audio signal encoding apparatus to which a neural network is applied, according to an embodiment.

단계(701)에서, 오디오 신호 부호화 장치는 입력 오디오 신호를 수신할 수 있다. 이때, 오디오 신호 부호화 장치는 뉴럴 네트워크를 통해 학습된 모델이 적용될 수 있다. 여기서, 뉴럴 네트워크는 Autoencoder을 포함할 수 있다. In step 701, the audio signal encoding apparatus can receive the input audio signal. At this time, a model learned through the neural network can be applied to the audio signal encoding apparatus. Here, the neural network may include an autoencoder.

단계(702)에서, 오디오 신호 부호화 장치는 뉴럴 네트워크를 이용하여 학습된 히든 레이어의 파라미터에 기초하여, 입력 오디오 신호의 차원 축소된 잠재벡터를 생성할 수 있다. 여기서, 히든 레이어의 파라미터를 학습하는 과정은 앞서 자세히 기술하였다.In step 702, the audio signal encoding apparatus can generate a reduced-size potential vector of the input audio signal based on the learned hidden layer parameters using the neural network. Here, the process of learning the parameters of the hidden layer is described in detail above.

일 실시예에 따르면, 잠재벡터는 히든 레이어의 개수, 노드의 개수를 포함하는 모델의 토폴로지의 조정이 불가능하거나 필요하지 않은 경우 학습된 파라미터에 기초하여 생성될 수 있다. 이때, 모델의 토폴로지를 조정 가능한지 여부에 대한 판단은 모델이 적용되는 분야 및 모델에 대한 요구사항에 따라 다를 수 있다.According to one embodiment, the potential vector may be generated based on the learned parameters if adjustment of the topology of the model, including the number of hidden layers, the number of nodes, is impossible or not needed. At this time, the determination as to whether or not the topology of the model can be adjusted depends on the field to which the model is applied and the requirements for the model.

일 실시예에 따르면, 잠재벡터는 히든 레이어의 개수, 노드의 개수를 포함하는 모델의 토폴로지의 조정이 가능한 경우 조정된 토폴로지를 적용함으로써 재학습된 파라미터에 기초하여 생성될 수 있다. 이때, 모델의 토폴로지를 조정 가능한지 여부에 대한 판단은 모델이 적용되는 분야 및 모델에 대한 요구사항에 따라 다를 수 있다.According to one embodiment, the latent vector may be generated based on the re-learned parameter by applying the adjusted topology where adjustment of the topology of the model including the number of hidden layers, the number of nodes, is possible. At this time, the determination as to whether or not the topology of the model can be adjusted depends on the field to which the model is applied and the requirements for the model.

예를 들면, 품질 요구사항을 만족하지 못한 경우, 모델의 복잡도가 증가하도록 모델의 토폴로지는 조정될 수 있고, 파라미터의 학습을 위한 과정이 반복될 수 있다. 따라서, 토폴로지의 조정 가능 여부를 판단할 경우, 지각적 품질 평가의 결과가 미리 설정된 품질 요구사항을 만족하지 못한다면 복잡도가 증가된 모델을 이용하여 파라미터는 재학습될 수 있다. 또는, 품질 요구사항을 만족하지 않지만 더 이상 모델의 복잡도가 증가되지 않을 경우, 현재 학습된 모델의 파라미터는 저장되어 학습은 종료될 수 있다. For example, if the quality requirements are not met, the topology of the model can be adjusted so that the complexity of the model increases, and the process for learning the parameters can be repeated. Therefore, when judging whether or not the topology can be adjusted, if the result of the perceptual quality evaluation does not satisfy the preset quality requirement, the parameter can be re-learned using the model with increased complexity. Alternatively, if the quality requirements are not satisfied but the complexity of the model no longer increases, the parameters of the currently learned model may be stored and the learning terminated.

또 다른 예를 들면, 품질 요구사항을 만족하는 경우, 품질 요구사항 내에서 모델의 복잡도가 줄어들도록 모델의 토폴로지는 조정될 수 있고, 파라미터의 학습을 위한 과정이 반복될 수 있다. 따라서, 토폴로지의 조정 가능 여부를 판단할 경우, 지각적 품질 평가의 결과가 미리 설정된 품질 요구사항을 만족한다면 품질 요구사항 내에서 복잡도가 줄어든 모델을 이용하여 파라미터는 재학습될 수 있다.As another example, if the quality requirements are met, the topology of the model can be adjusted so that the complexity of the model is reduced within the quality requirements, and the process for learning the parameters can be repeated. Therefore, when determining whether the topology is adjustable, the parameters can be re-learned using a model with reduced complexity within the quality requirements if the outcome of the perceptual quality evaluation satisfies a predetermined quality requirement.

단계(703)에서, 오디오 신호 부호화 장치는 생성된 잠재벡터를 부호화하여 비트스트림을 출력할 수 있다. 여기서, 잠재벡터는 전송 채널을 통해 전송되기 위해 양자화 또는 부호화되어 비트스트림으로 출력될 수 있다.In step 703, the audio signal encoding apparatus can output the bit stream by encoding the generated potential vector. Here, the potential vector may be quantized or coded to be transmitted through a transmission channel and output as a bitstream.

도 8은 일 실시예에 따른, 오디오 신호 복호화 장치를 이용하여 오디오 신호 복호화 방법에 적용되는 뉴럴 네트워크의 학습 과정을 나타낸 도면이다.8 is a diagram illustrating a learning process of a neural network applied to an audio signal decoding method using an audio signal decoding apparatus according to an embodiment.

단계(801)에서, 오디오 신호 복호화 장치는 학습되기 전 제1 오디오 신호에 대한 마스킹 임계치를 생성할 수 있다. 여기서, 제1 오디오 신호는 뉴럴 네트워크를 학습하기 위한 훈련용 오디오 신호를 나타낼 수 있다. In step 801, the audio signal decoding device may generate a masking threshold for the first audio signal before being learned. Here, the first audio signal may represent a training audio signal for learning a neural network.

단계(802)에서, 오디오 신호 복호화 장치는 생성된 마스킹 임계치에 기초하여, 제1 오디오 신호 각각의 주파수 성분에 적용될 가중 행렬을 계산할 수 있다. 이때, 가중 행렬은 제1 오디오 신호의 주파수 성분에 적용될 가중치를 포함하고, 가중치는 제1 오디오 신호에 대한 마스킹 임계치에 반비례하고 제1 오디오 신호 각각의 주파수 성분의 크기에 비례하도록 설정될 수 있다. 즉, 가중치는 SMR(Signal-to-Mask Ratio)에 의해 결정될 수 있으며, SMR은 제1 오디오 신호 각각의 주파수 성분에서 크기와 마스킹 임계치의 크기 간의 비율을 나타낼 수 있다. In step 802, the audio signal decoding apparatus may calculate a weighting matrix to be applied to the frequency components of each of the first audio signals, based on the generated masking threshold. At this time, the weighting matrix may include a weight to be applied to the frequency component of the first audio signal, and the weight may be set to be inversely proportional to the masking threshold for the first audio signal and proportional to the magnitude of the frequency component of each of the first audio signal. That is, the weight can be determined by the Signal-to-Mask Ratio (SMR), and the SMR can represent the ratio between the size and the size of the masking threshold in the frequency component of each of the first audio signals.

예를 들면, 제1 오디오 신호의 각각의 주파수 성분

에 대한 가중 행렬의 가중치

&Lt; / RTI > weight < RTI ID =

단계(803)에서, 오디오 신호 복호화 장치는 가중 행렬을 이용하여 미리 설정된 오류 함수를 수정한 가중된 오류 함수를 생성할 수 있다. 여기서, 가중된 오류 함수는 가중치를 적용하여 모델의 파라미터의 학습에 이용될 수 있다. In step 803, the audio signal decoding apparatus may generate a weighted error function by modifying a preset error function using a weighting matrix. Here, the weighted error function can be used to learn the parameters of the model by applying weights.

단계(804)에서, 오디오 신호 복호화 장치는 가중된 오류 함수를 이용하여 학습된 파라미터를 제1 오디오 신호에 적용하여 제2 오디오 신호를 생성할 수 있다. 여기서, 모델의 파라미터는 가중된 오류 함수를 이용하여 학습될 수 있다. 예를 들면, 초기 모델의 토폴로지에 대해서 학습이 수행될 수 있으며, 반복적인 학습 알고리즘에 의해 학습이 완료된 모델의 토폴로지를 이용하여 예측된 오디오 신호는 출력될 수 있다. 여기서, 예측된 오디오 신호는 제2 오디오 신호를 나타낼 수 있다.In step 804, the audio signal decoding apparatus may apply the learned parameter to the first audio signal using the weighted error function to generate the second audio signal. Here, the parameters of the model can be learned using a weighted error function. For example, learning can be performed on the topology of the initial model, and the predicted audio signal can be output using the topology of the model that has been learned by the iterative learning algorithm. Here, the predicted audio signal may represent the second audio signal.

일 실시예에 따르면, 학습된 모델의 파라미터에 기초하여, 입력 오디오 신호에 대해 압축/잡음제거/부호화/복호화와 같은 신호 처리는 수행될 수 있다. 이때, 신호 처리는 이에 한정되지 않는다.According to one embodiment, signal processing such as compression / noise cancellation / encoding / decoding may be performed on the input audio signal based on the parameters of the learned model. At this time, the signal processing is not limited thereto.

도 9는 일 실시예에 따른, 뉴럴 네트워크가 적용된 오디오 신호 복호화 장치가 수행하는 오디오 신호 복호화 방법을 나타낸 도면이다.9 is a diagram illustrating an audio signal decoding method performed by an audio signal decoding apparatus to which a neural network is applied, according to an embodiment.

단계(901)에서, 오디오 신호 복호화 장치는 뉴럴 네트워크를 통해 학습된 파라미터를 입력 오디오 신호에 적용하여 생성된 잠재벡터가 부호화된 비트스트림을 수신할 수 있다. 이때, 오디오 신호 복호화 장치는 뉴럴 네트워크를 통해 학습된 모델이 적용될 수 있다. 여기서, 뉴럴 네트워크는 Autoencoder을 포함할 수 있다. In step 901, the audio signal decoding apparatus may receive the bitstream encoded with the latent vector generated by applying the learned parameters through the neural network to the input audio signal. At this time, the audio signal decoding apparatus can be applied with a learned model through a neural network. Here, the neural network may include an autoencoder.

단계(902)에서, 오디오 신호 복호화 장치는 수신한 비트스트림으로부터 잠재벡터를 복원할 수 있다. 여기서, 비트스트림은 전송 채널을 통해 전송되기 위해 잠재벡터를 양자화 또는 부호화하여 생성될 수 있다.In step 902, the audio signal decoding apparatus can recover the potential vector from the received bitstream. Here, a bitstream may be generated by quantizing or encoding a potential vector for transmission over a transmission channel.

단계(903)에서, 오디오 신호 복호화 장치는 학습된 파라미터가 적용된 히든 레이어를 이용하여 복원된 잠재벡터로부터 출력 오디오 신호를 복호화 할 수 있다. 여기서, 히든 레이어의 파라미터를 학습하는 과정은 앞서 자세히 기술하였다.In step 903, the audio signal decoding apparatus can decode the output audio signal from the reconstructed potential vector using the hidden layer to which the learned parameter is applied. Here, the process of learning the parameters of the hidden layer is described in detail above.

도 10은 일 실시예에 따른, 뉴럴 네트워크가 적용된 오디오 신호 부호화 장치를 나타낸 도면이다.FIG. 10 is a diagram illustrating an audio signal encoding apparatus to which a neural network is applied, according to an embodiment.

오디오 신호 부호화 장치(1000)는 프로세서(1010)와 메모리(1020)를 포함할 수 있다. 메모리(1020)는 프로세서에 의해 실행 가능한 하나 이상의 명령어(instruction)을 포함할 수 있다.The audio signal encoding apparatus 1000 may include a processor 1010 and a memory 1020. Memory 1020 may include one or more instructions executable by a processor.

프로세서(1010)는 입력 오디오 신호를 수신할 수 있다. 이때, 오디오 신호 부호화 장치(1000)의 프로세서(1010)는 뉴럴 네트워크를 통해 학습된 모델이 적용될 수 있다. 여기서, 뉴럴 네트워크는 Autoencoder을 포함할 수 있다. The processor 1010 may receive an input audio signal. At this time, the processor 1010 of the audio signal encoding apparatus 1000 can apply the learned model through the neural network. Here, the neural network may include an autoencoder.

오디오 신호 부호화 장치(1000)의 프로세서(1010)는 뉴럴 네트워크를 이용하여 학습된 히든 레이어의 파라미터에 기초하여, 입력 오디오 신호의 차원 축소된 잠재벡터를 생성할 수 있다. 여기서, 히든 레이어의 파라미터를 학습하는 과정은 앞서 자세히 기술하였다.The processor 1010 of the audio signal encoding apparatus 1000 can generate a dimensionally reduced potential vector of the input audio signal based on the learned hidden layer parameter using the neural network. Here, the process of learning the parameters of the hidden layer is described in detail above.

오디오 신호 부호화 장치(1000)의 프로세서(1010)는 생성된 잠재벡터를 부호화하여 비트스트림을 출력할 수 있다. 여기서, 잠재벡터는 전송 채널을 통해 전송되기 위해 양자화 또는 부호화되어 비트스트림으로 출력될 수 있다.The processor 1010 of the audio signal encoding apparatus 1000 can output the bitstream by encoding the generated potential vector. Here, the potential vector may be quantized or coded to be transmitted through a transmission channel and output as a bitstream.

도 11는 일 실시예에 따른, 뉴럴 네트워크가 적용된 오디오 신호 복호화 장치를 나타낸 도면이다.11 is a block diagram illustrating an audio signal decoding apparatus to which a neural network is applied, according to an embodiment of the present invention.

오디오 신호 복호화 장치(1100)는 프로세서(1110)와 메모리(1120)를 포함할 수 있다. 메모리(1120)는 프로세서에 의해 실행 가능한 하나 이상의 명령어(instruction)을 포함할 수 있다.The audio signal decoding apparatus 1100 may include a processor 1110 and a memory 1120. Memory 1120 may include one or more instructions executable by a processor.

프로세서(1110)는 뉴럴 네트워크를 통해 학습된 파라미터를 입력 오디오 신호에 적용하여 생성된 잠재벡터가 부호화된 비트스트림을 수신할 수 있다. 이때, 오디오 신호 복호화 장치(1100)의 프로세서(1110)는 뉴럴 네트워크를 통해 학습된 모델이 적용될 수 있다. 여기서, 뉴럴 네트워크는 Autoencoder을 포함할 수 있다. The processor 1110 may apply the learned parameters to the input audio signal through the neural network to receive the bitstream encoded with the generated potential vector. At this time, the processor 1110 of the audio signal decoding apparatus 1100 can apply the learned model through the neural network. Here, the neural network may include an autoencoder.

오디오 신호 복호화 장치(1100)의 프로세서(1110)는 수신한 비트스트림으로부터 잠재벡터를 복원할 수 있다. 여기서, 비트스트림은 전송 채널을 통해 전송되기 위해 잠재벡터를 양자화 또는 부호화하여 생성될 수 있다.The processor 1110 of the audio signal decoding apparatus 1100 can recover the potential vector from the received bit stream. Here, a bitstream may be generated by quantizing or encoding a potential vector for transmission over a transmission channel.

오디오 신호 복호화 장치(1100)의 프로세서(1110)는 학습된 파라미터가 적용된 히든 레이어를 이용하여 복원된 잠재벡터로부터 출력 오디오 신호를 복호화 할 수 있다. 여기서, 히든 레이어의 파라미터를 학습하는 과정은 앞서 자세히 기술하였다.The processor 1110 of the audio signal decoding apparatus 1100 can decode the output audio signal from the reconstructed potential vector using the hidden layer to which the learned parameter is applied. Here, the process of learning the parameters of the hidden layer is described in detail above.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented in hardware components, software components, and / or a combination of hardware components and software components. For example, the devices, methods, and components described in the embodiments may be implemented within a computer system, such as, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, such as an array, a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For ease of understanding, the processing apparatus may be described as being used singly, but those skilled in the art will recognize that the processing apparatus may have a plurality of processing elements and / As shown in FIG. For example, the processing unit may comprise a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as a parallel processor.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of the foregoing, and may be configured to configure the processing device to operate as desired or to process it collectively or collectively Device can be commanded. The software and / or data may be in the form of any type of machine, component, physical device, virtual equipment, computer storage media, or device , Or may be permanently or temporarily embodied in a transmitted signal wave. The software may be distributed over a networked computer system and stored or executed in a distributed manner. The software and data may be stored on one or more computer readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to an embodiment may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions to be recorded on the medium may be those specially designed and configured for the embodiments or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다Although the embodiments have been described with reference to the drawings, various technical modifications and variations may be applied to those skilled in the art. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; RTI ID = 0.0 > and / or < / RTI > equivalents,

110: 입력 레이어
120: 출력 레이어
130: 히든 레이어110: input layer
120: Output layer
130: Hidden layer

Claims

In a neural network applied to an audio signal encoding method using an audio signal encoding apparatus,
Generating a masking threshold for the first audio signal before being learned;
Calculating a weight matrix to be applied to a frequency component of each of the first audio signals based on the masking threshold;
Generating a weighted error function by modifying a preset error function using the weight matrix;
Applying the learned parameter to the first audio signal using the weighted error function to generate a second audio signal
&Lt; / RTI >

The method according to claim 1,
Wherein the weighting matrix includes a weight to be applied to a frequency component of the first audio signal,
The weighting value,
Wherein the first audio signal is set to be in proportion to a masking threshold for the first audio signal and to be proportional to a magnitude of a frequency component of each of the first audio signals.

The method according to claim 1,
The neural network includes:
And comparing the generated second audio signal with the first audio signal to perform a perceptual quality evaluation.

The method of claim 3,
The perceptual quality assessment may include:
Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Assessment (POLQA), Perceptual Evaluation of Audio Quality (PEAQ)
A neural network that includes a subjective assessment of MOS (Mean Opinion Score), MUSHRA (Multiple Stimuli with Hidden Reference and Anchor).

The method of claim 3,
And determining whether or not the topology included in the model can be adjusted based on the perceptual quality evaluation.

6. The method of claim 5,
When determining whether the topology is adjustable,
If the result of the perceptual quality evaluation does not satisfy the preset quality requirement, the parameter is re-learned using the model with increased complexity,
Wherein if the result of the perceptual quality assessment satisfies a predetermined quality requirement, the neural network re-learns the parameter using a model with reduced complexity within the quality requirement.

1. An audio signal encoding method performed by an audio signal encoding apparatus to which a neural network is applied,
Receiving an input audio signal;
The neural network comprising at least one hidden layer and generating a dimensionally reduced potential vector of the input audio signal based on the learned hidden layer parameter using the neural network;
Encoding the generated potential vector and outputting a bit stream
And an audio signal encoding method.

8. The method of claim 7,
Wherein generating the dimensionally-reduced latent vector of the input audio signal based on the learned hidden layer parameters using the neural network comprises:
If the adjustment of the topology of the model including the number of hidden layers and the number of nodes is impossible or unnecessary, the potential vector is generated based on the learned parameter,
And if the topology of the model is adjustable, applying the adjusted topology to generate the potential vector based on the re-learned parameter.

8. The method of claim 7,
The coding of the latent vector may be performed by:
A method of encoding an audio signal that is binarized to transmit the bitstream over a channel.

In a neural network applied to an audio signal decoding method using an audio signal decoding apparatus,
Generating a masking threshold for the first audio signal before being learned;
Calculating a weight matrix to be applied to a frequency component of each of the first audio signals based on the masking threshold;
Generating a weighted error function by modifying a preset error function using the weight matrix;
Applying the learned parameter to the first audio signal using the weighted error function to generate a second audio signal
&Lt; / RTI >

11. The method of claim 10,
Wherein the weighting matrix includes a weight to be applied to a frequency component of the first audio signal,
The weighting value,
Wherein the first audio signal is set to be in proportion to a masking threshold for the first audio signal and to be proportional to a magnitude of a frequency component of each of the first audio signals.

11. The method of claim 10,
The neural network includes:
And comparing the generated second audio signal with the first audio signal to perform a perceptual quality evaluation.

13. The method of claim 12,
The perceptual quality assessment may include:
Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Assessment (POLQA), Perceptual Evaluation of Audio Quality (PEAQ)
A neural network that includes a subjective assessment of MOS (Mean Opinion Score), MUSHRA (Multiple Stimuli with Hidden Reference and Anchor).

13. The method of claim 12,
And determining whether or not the topology included in the model can be adjusted based on the perceptual quality evaluation.

15. The method of claim 14,
When determining whether the topology is adjustable,
If the result of the perceptual quality evaluation does not satisfy the preset quality requirement, the parameter is re-learned using the model with increased complexity,
Wherein if the result of the perceptual quality assessment meets a predetermined quality requirement then the parameter is re-learned using a model with reduced complexity within the requirements of the quality.

A method of decoding an audio signal performed by an audio signal decoding apparatus to which a neural network is applied,
Receiving a latent vector-encoded bitstream generated by applying learned parameters through the neural network to an input audio signal;
Recovering the latent vector from the received bitstream;
Wherein the neural network includes one or more hidden layers and decodes the output audio signal from the restored latent vector using the hidden layer to which the learned parameter is applied
And decoding the audio signal.

17. The method of claim 16,
The generated potential vector may be,
If the topology including the number of hidden layers and the number of nodes belonging to the hidden layer is impossible or unnecessary, the potential vector is generated based on the learned parameter,
Wherein said potential vector is generated based on a re-learned parameter by applying an adjusted topology when said topology is adjustable.

17. The method of claim 16,
The coding of the latent vector may be performed by:
And binarizes the bit stream to transmit the bit stream over a channel.

An audio signal encoding apparatus to which a neural network is applied,
The audio signal encoding apparatus comprising a processor and a memory including one or more instructions executable by the processor,
If the one or more instructions are executed on the processor,
Receiving an input audio signal,
Wherein the neural network includes one or more hidden layers and generates a dimensionally reduced potential vector of the input audio signal based on the learned hidden layer parameter using the neural network,
And generating a bitstream by encoding the generated potential vector.

An audio signal decoding apparatus to which a neural network is applied,
The audio signal decoding apparatus comprising a processor and a memory including one or more instructions executable by the processor,
If the one or more instructions are executed on the processor,
A quantized bitstream generated by applying a parameter learned through the neural network to an input audio signal,
Restoring the potential vector from the received bitstream,
Wherein the neural network includes one or more hidden layers and decodes an output audio signal from the restored latent vector using the hidden layer to which the learned parameter is applied.