KR102579470B1

KR102579470B1 - Method And Apparatus for Processing Audio Signal

Info

Publication number: KR102579470B1
Application number: KR1020200056492A
Authority: KR
Inventors: 이미숙; 백승권; 성종모; 이태진; 최진수; 김민제; 젠 카이
Original assignee: 한국전자통신연구원; 더 트러스티즈 오브 인디애나 유니버시티
Priority date: 2020-01-28
Filing date: 2020-05-12
Publication date: 2023-09-18
Also published as: KR20210096542A

Abstract

오디오 신호의 처리 방법 및 장치가 개시된다. 본 발명의 일실시예에 따른 오디오 신호의 처리 방법은 입력 오디오 신호을 인코딩 및 디코딩하여 출력 오디오 신호를 생성하는 복수의 신경망 모델을 이용하여 초기 오디오 신호에 대한 최종 오디오 신호를 획득하는 단계; 상기 초기 오디오 신호와 상기 최종 오디오 신호의 차이를 시간 도메인에서 계산하는 단계; 상기 초기 오디오 신호와 상기 최종 오디오 신호를 멜 스펙트럼(mel spectrum)으로 변환하는 단계; 상기 초기 오디오 신호와 상기 최종 오디오 신호의 멜 스펙트럼 간의 차이를 주파수 도메인에서 계산하는 단계; 상기 시간 도메인 및 주파수 도메인에서 계산된 결과에 기초하여 상기 복수의 신경망 모델을 트레이닝하는 단계; 및 상기 트레이닝된 신경망 모델들을 이용하여 상기 초기 오디오 신호로부터 상기 최종 오디오 신호와 구별되는 새로운 최종 오디오 신호를 생성하는 단계를 포함할 수 있다. A method and device for processing audio signals are disclosed. An audio signal processing method according to an embodiment of the present invention includes obtaining a final audio signal for an initial audio signal using a plurality of neural network models that encode and decode an input audio signal to generate an output audio signal; calculating the difference between the initial audio signal and the final audio signal in the time domain; converting the initial audio signal and the final audio signal into a mel spectrum; calculating the difference between the mel spectrum of the initial audio signal and the final audio signal in the frequency domain; training the plurality of neural network models based on results calculated in the time domain and frequency domain; and generating a new final audio signal that is distinct from the final audio signal from the initial audio signal using the trained neural network models.

Description

{Method And Apparatus for Processing Audio Signal}

본 발명은 오디오 신호의 처리 방법 및 장치에 관한 것으로, 보다 구체적으로는 오디오 신호를 인코딩 및 디코딩하는 신경망 모델을 트레이닝함에 있어, 심리 음향 모델을 이용하여 신경망 모델의 트레이닝을 위한 손실 함수를 계산함으로써 오디오 신호를 처리하는 방법 및 장치에 관한 것이다. The present invention relates to a method and device for processing audio signals, and more specifically, in training a neural network model for encoding and decoding audio signals, by calculating a loss function for training the neural network model using a psychoacoustic model. It relates to a method and device for processing signals.

오디오 신호를 인코딩하고, 인코딩된 오디오 신호를 디코딩 하여 복원하는 오디오 신호의 처리 과정에서 오디오 신호의 손실로 인해 복원된 오디오 신호와 초기에 입력되는 오디오 신호의 차이가 발생한다. In the process of encoding an audio signal, decoding the encoded audio signal, and restoring the audio signal, a difference occurs between the restored audio signal and the initially input audio signal due to loss of the audio signal.

이러한 오디오 신호의 손실을 줄이기 위해 오디오 신호의 인코딩 및 디코딩에 인공 지능 기술 중 하나인 딥러닝에서 신경망 모델을 오디오 신호의 인코딩 및 디코딩에 적용한 뉴럴 오디오 코딩(Neural Audio Coding)에 대한 연구들이 활발히 이루어 지고 있다. 그러나, 오디오 신호의 손실을 최소화하기 위해 신경망 모델을 학습시킴에 있어 심리 음향적 요인들을 고려하는 기술이 요구된다. In order to reduce the loss of these audio signals, research is being actively conducted on Neural Audio Coding, which applies a neural network model in deep learning, one of the artificial intelligence technologies, to the encoding and decoding of audio signals. there is. However, technology that takes psychoacoustic factors into account when learning a neural network model to minimize audio signal loss is required.

본 발명은 오디오 신호의 인코딩 및 디코딩을 수행하는 신경망 모델을 이용하여 오디오 신호를 처리함에 있어, 신경망 모델의 학습 과정에서 심리 음향 모델을 이용하여 오디오 신호의 손실을 최소화하는 방법 및 장치를 제공한다. The present invention provides a method and device for minimizing audio signal loss by using a psychoacoustic model in the learning process of the neural network model when processing an audio signal using a neural network model that encodes and decodes the audio signal.

또한, 본 발명은 오디오 신호의 인코딩 및 디코딩을 수행하는 신경망 모델의 학습 과정에서, 오디오 신호의 인코딩 과정에서 발생하는 노이즈를 최소화하도록 신경망 모델을 학습시킴으로써 복원된 오디오 신호의 품질을 높이는 방법 및 장치를 제공한다. In addition, the present invention provides a method and device for improving the quality of a restored audio signal by learning a neural network model to minimize noise generated in the encoding process of the audio signal during the learning process of the neural network model that performs encoding and decoding of the audio signal. to provide.

본 발명의 일실시예에 따른 오디오 신호의 처리 방법은 입력 오디오 신호을 인코딩 및 디코딩하여 출력 오디오 신호를 생성하는 복수의 신경망 모델을 이용하여 초기 오디오 신호에 대한 최종 오디오 신호를 획득하는 단계; 상기 초기 오디오 신호와 상기 최종 오디오 신호의 차이를 시간 도메인에서 계산하는 단계; 상기 초기 오디오 신호와 상기 최종 오디오 신호를 멜 스펙트럼(mel spectrum)으로 변환하는 단계; 상기 초기 오디오 신호와 상기 최종 오디오 신호의 멜 스펙트럼 간의 차이를 주파수 도메인에서 계산하는 단계; 상기 시간 도메인 및 주파수 도메인에서 계산된 결과에 기초하여 상기 복수의 신경망 모델을 트레이닝하는 단계; 및 상기 트레이닝된 신경망 모델들을 이용하여 상기 초기 오디오 신호로부터 상기 최종 오디오 신호와 구별되는 새로운 최종 오디오 신호를 생성하는 단계를 포함할 수 있다. An audio signal processing method according to an embodiment of the present invention includes obtaining a final audio signal for an initial audio signal using a plurality of neural network models that encode and decode an input audio signal to generate an output audio signal; calculating the difference between the initial audio signal and the final audio signal in the time domain; converting the initial audio signal and the final audio signal into a mel spectrum; calculating the difference between the mel spectrum of the initial audio signal and the final audio signal in the frequency domain; training the plurality of neural network models based on results calculated in the time domain and frequency domain; and generating a new final audio signal that is distinct from the final audio signal from the initial audio signal using the trained neural network models.

상기 신경망 모델들을 트레이닝하는 단계는, 상기 시간 도메인에서 계산된 결과와 상기 주파수 도메인에서 계산된 결과를 합한 결과가 최소가 되도록 상기 신경망 모델에 포함되는 파라미터들을 업데이트할 수 있다.In the step of training the neural network models, parameters included in the neural network model may be updated so that the sum of the results calculated in the time domain and the results calculated in the frequency domain is minimized.

상기 복수의 신경망은, 연속적인 관계로서 i번째 신경망 모델은 i-1번째 신경망 모델의 출력 오디오 신호와 i-1번째 신경망 모델의 입력 오디오 신호 간의 차이를 입력 오디오 신호로 하여 출력 오디오 신호를 생성할 수 있다. The plurality of neural networks are in a continuous relationship, and the i-th neural network model generates an output audio signal by using the difference between the output audio signal of the i-1th neural network model and the input audio signal of the i-1th neural network model as an input audio signal. You can.

상기 최종 오디오 신호는, 상기 복수의 신경망 각각의 출력 오디오 신호를 합한 결과 오디오 신호일 수 있다. The final audio signal may be an audio signal resulting from adding the output audio signals of each of the plurality of neural networks.

본 발명의 일실시예에 따른 오디오 신호의 처리 방법은 입력 오디오 신호을 인코딩 및 디코딩하여 출력 오디오 신호를 생성하는 복수의 신경망 모델을 이용하여 초기 오디오 신호에 대한 최종 오디오 신호를 획득하는 단계; 심리 음향 모델을 통해 상기 초기 오디오 신호에 대한 전력 스펙트럼 밀도와 마스킹 임계치를 획득하는 단계; 주파수 별로 상기 마스킹 임계치와 상기 전력 스펙트럼 밀도의 관계에 따라 가중치를 결정하는 단계; 상기 결정된 가중치에 기초하여 주파수 별로 상기 초기 오디오 신호의 전력 스펙트럼 밀도와 상기 최종 오디오 신호의 전력 스펙트럼 밀도 간의 차이를 계산하는 단계; 상기 계산한 결과에 따라 상기 신경망 모델들을 트레이닝하는 단계; 및 상기 트레이닝된 신경망 모델들을 이용하여 상기 초기 오디오 신호로부터 상기 최종 오디오 신호와 구별되는 새로운 최종 오디오 신호를 생성하는 단계를 포함할 수 있다. An audio signal processing method according to an embodiment of the present invention includes obtaining a final audio signal for an initial audio signal using a plurality of neural network models that encode and decode an input audio signal to generate an output audio signal; Obtaining a power spectral density and masking threshold for the initial audio signal through a psychoacoustic model; determining a weight according to the relationship between the masking threshold and the power spectral density for each frequency; calculating a difference between the power spectral density of the initial audio signal and the power spectral density of the final audio signal for each frequency based on the determined weight; training the neural network models according to the calculated results; and generating a new final audio signal that is distinct from the final audio signal from the initial audio signal using the trained neural network models.

상기 신경망 모델들을 트레이닝하는 단계는, 상기 계산한 결과가 최소가 되도록 상기 신경망 모델에 포함되는 파라미터들을 업데이트할 수 있다. In the step of training the neural network models, parameters included in the neural network model may be updated so that the calculated result is minimized.

상기 마스킹 임계치는, 상기 심리 음향 모델로 결정되는 상기 초기 오디오 신호의 음압을 고려하여 상기 신경망 모델들의 인코딩 및 디코딩 과정에서 발생하는 노이즈(noise)를 마스킹하는 기준일 수 있다. The masking threshold may be a standard for masking noise generated during encoding and decoding of the neural network models in consideration of the sound pressure of the initial audio signal determined by the psychoacoustic model.

상기 가중치를 결정하는 단계는, 상기 마스킹 임계치에 대한 상기 초기 오디오 신호의 전력 스펙트럼 밀도가 클수록 특정 주파수에서 상기 가중치를 높게 결정하고, 상기 초기 오디오 신호의 전력 스펙트럼 밀도에 대해 상기 마스킹 임계치가 클수록 상기 특정 주파수에서 상기 가중치를 낮게 결정할 수 있다. The step of determining the weight includes determining the weight to be higher at a specific frequency as the power spectral density of the initial audio signal with respect to the masking threshold increases, and determining the weight to be higher at a specific frequency as the masking threshold with respect to the power spectral density of the initial audio signal increases. The weight may be determined to be low in frequency.

본 발명의 일실시예에 따른 오디오 신호의 처리 방법은 입력 오디오 신호을 인코딩 및 디코딩하여 출력 오디오 신호를 생성하는 복수의 신경망 모델을 이용하여 초기 오디오 신호에 대한 최종 오디오 신호를 획득하는 단계; 심리 음향 모델을 통해 상기 초기 오디오 신호에 대한 마스킹 임계치를 획득하는 단계; 상기 초기 오디오 신호의 인코딩 및 디코딩 과정에서 발생한 노이즈를 상기 최종 오디오 신호에서 식별하는 단계; 주파수 별로 상기 마스킹 임계치와 상기 최종 오디오 신호에 포함되는 노이즈 간의 차이를 계산하는 단계; 상기 계산한 결과에 따라 상기 신경망 모델들을 트레이닝하는 단계; 및 상기 트레이닝된 신경망 모델들을 이용하여 상기 초기 오디오 신호로부터 상기 최종 오디오 신호와 구별되는 새로운 최종 오디오 신호를 생성하는 단계를 포함할 수 있다. An audio signal processing method according to an embodiment of the present invention includes obtaining a final audio signal for an initial audio signal using a plurality of neural network models that encode and decode an input audio signal to generate an output audio signal; Obtaining a masking threshold for the initial audio signal through a psychoacoustic model; identifying noise generated during encoding and decoding of the initial audio signal in the final audio signal; calculating a difference between the masking threshold and noise included in the final audio signal for each frequency; training the neural network models according to the calculated results; and generating a new final audio signal that is distinct from the final audio signal from the initial audio signal using the trained neural network models.

상기 마스킹 임계치는, 상기 심리 음향 모델로 결정되는 상기 초기 오디오 신호의 음압을 고려하여 상기 신경망 모델들의 인코딩 및 디코딩 과정에서 발생하는 노이즈를 마스킹하는 기준일 수 있다. The masking threshold may be a standard for masking noise generated during encoding and decoding of the neural network models in consideration of the sound pressure of the initial audio signal determined by the psychoacoustic model.

본 발명의 일실시예에 따른 오디오 신호의 처리 방법은 입력 오디오 신호을 인코딩 및 디코딩하여 출력 오디오 신호를 생성하는 복수의 신경망 모델을 이용하여 초기 오디오 신호에 대한 최종 오디오 신호를 획득하는 단계; 상기 초기 오디오 신호와 상기 최종 오디오 신호의 차이를 시간 도메인에서 계산하는 제1 손실 함수 및 상기 초기 오디오 신호와 상기 최종 오디오 신호 간에 멜 스펙트럼(mel spectrum)의 차이를 주파수 도메인에서 계산하는 제2 손실 함수를 이용하여 상기 초기 오디오 신호와 상기 최종 오디오 신호의 차이를 계산하는 단계; 심리 음향 모델을 이용하여 상기 초기 오디오 신호의 전력 스펙트럼 밀도 및 마스킹 임계치를 결정하는 단계; 상기 초기 오디오 신호의 전력 스펙트럼 밀도와 상기 마스킹 임계치의 관계에 기초하여 상기 초기 오디오 신호와 상기 최종 오디오 신호의 차이를 주파수 도메인에서 계산하는 제3 손실 함수를 통해 상기 초기 오디오 신호와 상기 최종 오디오 신호의 차이를 계산하는 단계; 상기 제1 내지 3 손실 함수를 통해 계산된 결과에 기초하여 상기 복수의 신경망 모델에 포함되는 파라미터들을 업데이트하는 단계; 및 상기 파라미터들이 업데이트된 신경망 모델들을 이용하여 상기 초기 오디오 신호로부터 상기 최종 오디오 신호와 구별되는 새로운 최종 오디오 신호를 생성할 수 있다.An audio signal processing method according to an embodiment of the present invention includes obtaining a final audio signal for an initial audio signal using a plurality of neural network models that encode and decode an input audio signal to generate an output audio signal; A first loss function for calculating the difference between the initial audio signal and the final audio signal in the time domain, and a second loss function for calculating the difference in mel spectrum between the initial audio signal and the final audio signal in the frequency domain. calculating the difference between the initial audio signal and the final audio signal using; determining a power spectral density and masking threshold of the initial audio signal using a psychoacoustic model; of the initial audio signal and the final audio signal through a third loss function that calculates the difference between the initial audio signal and the final audio signal in the frequency domain based on the relationship between the power spectral density of the initial audio signal and the masking threshold. calculating the difference; updating parameters included in the plurality of neural network models based on results calculated through the first to third loss functions; And a new final audio signal that is distinguished from the final audio signal can be generated from the initial audio signal using neural network models with updated parameters.

상기 마스킹 임계치는, 상기 심리 음향 모델로 결정되는 상기 초기 오디오 신호의 음압을 고려하여 상기 신경망 모델들의 인코딩 및 디코딩 과정에서 발생하는 노이즈(noise)를 마스킹할 수 있다.The masking threshold may mask noise generated during encoding and decoding of the neural network models by considering the sound pressure of the initial audio signal determined by the psychoacoustic model.

상기 제3 손실 함수를 통해 상기 초기 오디오 신호와 상기 최종 오디오 신호의 차이를 계산하는 단계는, 주파수 별로 상기 마스킹 임계치와 상기 전력 스펙트럼 밀도의 관계에 따라 가중치를 결정하는 단계; 및 상기 결정된 가중치에 기초하여 상기 제3 손실 함수를 통해 주파수 별로 상기 초기 오디오 신호의 전력 스펙트럼 밀도와 상기 최종 오디오 신호의 전력 스펙트럼 밀도 간의 차이를 계산하는 단계를 포함할 수 있다.Calculating the difference between the initial audio signal and the final audio signal through the third loss function includes determining a weight according to the relationship between the masking threshold and the power spectral density for each frequency; and calculating a difference between the power spectral density of the initial audio signal and the power spectral density of the final audio signal for each frequency through the third loss function based on the determined weight.

본 발명의 일실시예에 따른 오디오 신호의 처리 방법은 a) 입력 오디오 신호을 인코딩 및 디코딩하여 출력 오디오 신호를 생성하는 복수의 신경망 모델을 이용하여 초기 오디오 신호에 대한 최종 오디오 신호를 획득하는 단계; b) 상기 초기 오디오 신호와 상기 최종 오디오 신호의 차이를 시간 도메인에서 계산하는 단계; c) 상기 초기 오디오 신호와 상기 최종 오디오 신호 간에 멜 스펙트럼(mel spectrum)의 차이를 주파수 도메인에서 계산하는 단계; d) 심리 음향 모델을 이용하여 마스킹 임계치를 결정하는 단계; e) 상기 심리 음향 모델을 통해 결정되는 상기 최종 오디오 신호의 노이즈와 상기 초기 오디오 신호의 마스킹 임계치 간의 차이를 주파수 도메인에서 계산하는 단계; 상기 b), c) 및 d) 단계에서 계산된 결과들에 기초하여 상기 복수의 신경망 모델에 포함되는 파라미터들을 업데이트하는 단계; 및 상기 파라미터들이 업데이트된 신경망 모델들을 이용하여 상기 초기 오디오 신호로부터 새로운 최종 오디오 신호를 생성하는 단계를 포함할 수 있다.An audio signal processing method according to an embodiment of the present invention includes the steps of: a) obtaining a final audio signal for an initial audio signal using a plurality of neural network models that encode and decode an input audio signal to generate an output audio signal; b) calculating the difference between the initial audio signal and the final audio signal in the time domain; c) calculating the difference in mel spectrum between the initial audio signal and the final audio signal in the frequency domain; d) determining a masking threshold using a psychoacoustic model; e) calculating the difference between the noise of the final audio signal and the masking threshold of the initial audio signal determined through the psychoacoustic model in the frequency domain; updating parameters included in the plurality of neural network models based on the results calculated in steps b), c), and d); and generating a new final audio signal from the initial audio signal using neural network models with updated parameters.

상기 마스킹 임계치는, 상기 심리 음향 모델로 결정되는 상기 초기 오디오 신호의 음압을 고려하여 상기 신경망 모델들의 인코딩 및 디코딩 과정에서 발생하는 노이즈를 마스킹하는 기준일 수 있다.The masking threshold may be a standard for masking noise generated during encoding and decoding of the neural network models in consideration of the sound pressure of the initial audio signal determined by the psychoacoustic model.

본 발명의 일실시예에 따른 오디오 신호의 처리 방법은 입력 오디오 신호을 인코딩 및 디코딩하여 출력 오디오 신호를 생성하는 복수의 신경망 모델을 이용하여 초기 오디오 신호에 대한 최종 오디오 신호를 획득하는 단계; i) 상기 초기 오디오 신호와 상기 최종 오디오 신호의 차이를 시간 도메인에서 계산하는 제1 손실 함수, ii) 상기 초기 오디오 신호와 상기 최종 오디오 신호 간에 멜 스펙트럼(mel spectrum)의 차이를 주파수 도메인에서 계산하는 제2 손실 함수, iii) 심리 음향 모델을 통해 결정되는 상기 초기 오디오 신호의 전력 스펙트럼 밀도와 마스킹 임계치의 관계에 기초하여 상기 초기 오디오 신호와 상기 최종 오디오 신호의 차이를 주파수 대역에서 계산하는 제3 손실 함수 및 iv) 상기 심리 음향 모델을 통해 결정되는 상기 최종 오디오 신호에 포함된 노이즈와 상기 초기 오디오 신호의 마스킹 임계치 간의 차이를 주파수 대역에서 계산하는 제4 손실 함수를 이용하여 상기 복수의 신경망 모델들을 트레이닝하는 단계; 및 상기 트레이닝된 신경망 모델들을 이용하여 상기 초기 오디오 신호로부터 상기 최종 오디오 신호와 구별되는 새로운 최종 오디오 신호를 생성하는 단계를 포함할 수 있다. An audio signal processing method according to an embodiment of the present invention includes obtaining a final audio signal for an initial audio signal using a plurality of neural network models that encode and decode an input audio signal to generate an output audio signal; i) a first loss function that calculates the difference between the initial audio signal and the final audio signal in the time domain, ii) a first loss function that calculates the difference in mel spectrum between the initial audio signal and the final audio signal in the frequency domain a second loss function, iii) a third loss that calculates the difference between the initial audio signal and the final audio signal in a frequency band based on the relationship between the power spectral density of the initial audio signal and a masking threshold determined through a psychoacoustic model; Function and iv) training the plurality of neural network models using a fourth loss function that calculates the difference between the noise included in the final audio signal and the masking threshold of the initial audio signal determined through the psychoacoustic model in the frequency band. steps; and generating a new final audio signal that is distinct from the final audio signal from the initial audio signal using the trained neural network models.

본 발명의 일실시예에 따른 오디오 신호의 처리 방법은 오디오 신호의 처리 장치에 있어서, 상기 처리 장치는 프로세서를 포함하고, 상기 프로세서는, 입력 오디오 신호을 인코딩 및 디코딩하여 출력 오디오 신호를 생성하는 복수의 신경망 모델을 이용하여 초기 오디오 신호에 대한 최종 오디오 신호를 획득하고, i) 상기 초기 오디오 신호와 상기 최종 오디오 신호의 차이를 시간 도메인에서 계산하는 제1 손실 함수, ii) 상기 초기 오디오 신호와 상기 최종 오디오 신호 간에 멜 스펙트럼(mel spectrum)의 차이를 주파수 도메인에서 계산하는 제2 손실 함수, iii) 심리 음향 모델을 통해 결정되는 상기 초기 오디오 신호의 전력 스펙트럼 밀도와 마스킹 임계치의 관계에 기초하여 상기 초기 오디오 신호와 상기 최종 오디오 신호의 차이를 주파수 대역에서 계산하는 제3 손실 함수 및 iv) 상기 심리 음향 모델을 통해 결정되는 상기 최종 오디오 신호에 포함된 노이즈와 상기 초기 오디오 신호의 마스킹 임계치 간의 차이를 주파수 대역에서 계산하는 제4 손실 함수를 이용하여 상기 복수의 신경망 모델들을 트레이닝하고, 상기 트레이닝된 신경망 모델들을 이용하여 상기 초기 오디오 신호로부터 상기 최종 오디오 신호와 구별되는 새로운 최종 오디오 신호를 생성할 수 있다 . An audio signal processing method according to an embodiment of the present invention includes an audio signal processing device, wherein the processing device includes a processor, and the processor encodes and decodes an input audio signal to generate an output audio signal. Obtaining a final audio signal for the initial audio signal using a neural network model, i) a first loss function that calculates the difference between the initial audio signal and the final audio signal in the time domain, ii) the initial audio signal and the final audio signal a second loss function that calculates the difference in mel spectrum between audio signals in the frequency domain; iii) the initial audio based on the relationship between the power spectral density of the initial audio signal and a masking threshold determined through a psychoacoustic model; a third loss function that calculates the difference between the signal and the final audio signal in a frequency band; and iv) a frequency band that calculates the difference between the masking threshold of the initial audio signal and the noise included in the final audio signal determined through the psychoacoustic model. The plurality of neural network models can be trained using the fourth loss function calculated in , and a new final audio signal that is distinguished from the final audio signal from the initial audio signal can be generated using the trained neural network models.

본 발명의 일실시예에 따른 오디오 신호의 처리 방법은 오디오 신호의 처리 장치에 있어서, 상기 처리 장치는 프로세서를 포함하고, 상기 프로세서는, 입력 오디오 신호을 인코딩 및 디코딩하여 출력 오디오 신호를 생성하는 복수의 신경망 모델을 이용하여 초기 오디오 신호에 대한 최종 오디오 신호를 획득하고, i) 상기 초기 오디오 신호와 상기 최종 오디오 신호의 차이를 시간 도메인에서 계산하는 제1 손실 함수, ii) 상기 초기 오디오 신호와 상기 최종 오디오 신호 간에 멜 스펙트럼(mel spectrum)의 차이를 주파수 도메인에서 계산하는 제2 손실 함수, iii) 심리 음향 모델을 통해 결정되는 상기 초기 오디오 신호의 전력 스펙트럼 밀도와 마스킹 임계치의 관계에 기초하여 상기 초기 오디오 신호와 상기 최종 오디오 신호의 차이를 주파수 대역에서 계산하는 제3 손실 함수 및 iv) 상기 심리 음향 모델을 통해 결정되는 상기 최종 오디오 신호에 포함된 노이즈와 상기 초기 오디오 신호의 마스킹 임계치 간의 차이를 주파수 대역에서 계산하는 제4 손실 함수 중 적어도 하나 이상의 손실 함수를 이용하여 상기 초기 오디오 신호와 상기 최종 오디오 신호의 차이를 계산하고, 상기 계산된 결과에 기초하여 상기 복수의 신경망 모델들을 트레이닝하고, 상기 트레이닝된 신경망 모델들을 이용하여 상기 초기 오디오 신호로부터 상기 최종 오디오 신호와 구별되는 새로운 최종 오디오 신호를 생성할 수 있다. An audio signal processing method according to an embodiment of the present invention includes an audio signal processing device, wherein the processing device includes a processor, and the processor encodes and decodes an input audio signal to generate an output audio signal. Obtaining a final audio signal for the initial audio signal using a neural network model, i) a first loss function that calculates the difference between the initial audio signal and the final audio signal in the time domain, ii) the initial audio signal and the final audio signal a second loss function that calculates the difference in mel spectrum between audio signals in the frequency domain; iii) the initial audio based on the relationship between the power spectral density of the initial audio signal and a masking threshold determined through a psychoacoustic model; a third loss function that calculates the difference between the signal and the final audio signal in a frequency band; and iv) a frequency band that calculates the difference between the masking threshold of the initial audio signal and the noise included in the final audio signal determined through the psychoacoustic model. Calculate the difference between the initial audio signal and the final audio signal using at least one of the fourth loss functions calculated in , train the plurality of neural network models based on the calculated results, and train the trained neural network models. A new final audio signal that is distinct from the final audio signal can be generated from the initial audio signal using neural network models.

본 발명의 일실시예에 따르면 오디오 신호의 인코딩 및 디코딩을 수행하는 신경망 모델을 이용하여 오디오 신호를 처리함에 있어, 신경망 모델의 학습 과정에서 심리 음향 모델을 이용하여 오디오 신호의 손실을 최소화할 수 있다.According to an embodiment of the present invention, when processing audio signals using a neural network model that encodes and decodes audio signals, the loss of the audio signal can be minimized by using a psychoacoustic model in the learning process of the neural network model. .

또한, 본 발명의 일실시예에 따르면 오디오 신호의 인코딩 및 디코딩을 수행하는 신경망 모델의 학습 과정에서, 오디오 신호의 인코딩 과정에서 발생하는 노이즈를 최소화하도록 신경망 모델을 학습시킴으로써 복원된 오디오 신호의 품질을 높일 수 있다.In addition, according to one embodiment of the present invention, in the learning process of a neural network model that performs encoding and decoding of audio signals, the quality of the restored audio signal is improved by learning the neural network model to minimize noise generated in the encoding process of the audio signal. It can be raised.

도 1은 본 발명의 일실시예에 따른 오디오 신호의 처리 장치의 구조를 도시한 도면이다.
도 2는 본 발명의 일실시예에 따른 신경망 모델들의 관계 및 신경망 모델의 구조를 도시한 도면이다.
도 3은 본 발명의 일실시예에 따른 신경망 모델들로 생성한 최종 오디오 신호와 초기 오디오 신호의 차이를 계산하는 손실 함수의 구조를 도시한 도면이다.
도 4는 본 발명의 일실시예에 따른 손실 함수의 이용 여부에 따른 노이즈의 발생 결과를 도시한 도면이다.
도 5는 본 발명의 일실시예에 따른 오디오 신호의 처리 방법을 플로우 차트로 도시한 도면이다. 1 is a diagram showing the structure of an audio signal processing device according to an embodiment of the present invention.
Figure 2 is a diagram showing the relationship between neural network models and the structure of the neural network model according to an embodiment of the present invention.
Figure 3 is a diagram showing the structure of a loss function that calculates the difference between a final audio signal and an initial audio signal generated by neural network models according to an embodiment of the present invention.
Figure 4 is a diagram showing the results of noise generation depending on whether or not a loss function is used according to an embodiment of the present invention.
Figure 5 is a flow chart showing a method of processing an audio signal according to an embodiment of the present invention.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 그러나, 실시예들에는 다양한 변경이 가해질 수 있어서 특허출원의 권리 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 실시예들에 대한 모든 변경, 균등물 내지 대체물이 권리 범위에 포함되는 것으로 이해되어야 한다.Hereinafter, embodiments will be described in detail with reference to the attached drawings. However, various changes can be made to the embodiments, so the scope of the patent application is not limited or limited by these embodiments. It should be understood that all changes, equivalents, or substitutes for the embodiments are included in the scope of rights.

실시예에서 사용한 용어는 단지 설명을 목적으로 사용된 것으로, 한정하려는 의도로 해석되어서는 안된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the examples are for descriptive purposes only and should not be construed as limiting. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as “comprise” or “have” are intended to designate the presence of features, numbers, steps, operations, components, parts, or combinations thereof described in the specification, but are not intended to indicate the presence of one or more other features. It should be understood that this does not exclude in advance the possibility of the existence or addition of elements, numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as generally understood by a person of ordinary skill in the technical field to which the embodiments belong. Terms defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and unless explicitly defined in the present application, should not be interpreted in an ideal or excessively formal sense. No.

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, when describing with reference to the accompanying drawings, identical components will be assigned the same reference numerals regardless of the reference numerals, and overlapping descriptions thereof will be omitted. In describing the embodiments, if it is determined that detailed descriptions of related known technologies may unnecessarily obscure the gist of the embodiments, the detailed descriptions are omitted.

도 1은 본 발명의 일실시예에 따른 오디오 신호의 처리 장치의 구조를 도시한 도면이다.1 is a diagram showing the structure of an audio signal processing device according to an embodiment of the present invention.

본 발명은 오디오 신호를 인코딩 및 디코딩하는 과정에서 발생하는 오디오 신호의 손실을 줄이기 위해, 오디오 신호의 인코딩 및 디코딩을 수행하는 신경망 모델(Neural Network Model)을 트레이닝함에 있어, 심리 음향 모델(Psychoacoustic Model, PAM)을 이용한 손실 함수(Loss function)를 통해 신경망 모델을 트레이닝(training)하여 오디오 신호를 처리한다. The present invention provides a psychoacoustic model in training a neural network model that encodes and decodes audio signals in order to reduce the loss of audio signals that occurs in the process of encoding and decoding audio signals. Audio signals are processed by training a neural network model using a loss function using PAM.

본 발명의 오디오 신호의 처리 장치는 프로세서를 포함할 수 있고, 처리 장치에 포함된 프로세서는 오디오 신호 처리 방법을 수행할 수 있다. 본 발명에서 인코딩은 오디오 신호를 코드 벡터로 변환하는 과정을 의미할 수 있고, 디코딩은 코드 벡터로부터 오디오 신호를 복원하는 과정을 의미할 수 있다. The audio signal processing device of the present invention may include a processor, and the processor included in the processing device may perform an audio signal processing method. In the present invention, encoding may refer to a process of converting an audio signal into a code vector, and decoding may refer to a process of restoring an audio signal from a code vector.

여기서 오디오 신호(audio signal)란, 광의로는, 비디오 신호와 구분되는 개념으로서, 재생시 청각으로 식별할 수 있는 신호를 지칭하고, 협의로는, 음성(speech) 신호와 구분되는 개념으로서, 음성 특성이 없거나 적은 신호를 의미한다. 본 발명에서의 오디오 신호는 광의로 해석되어야 하며 음성 신호와 구분되어 사용될 때 협의의 오디오 신호로 이해될 수 있다.Here, an audio signal is, in a broad sense, a concept that is distinct from a video signal and refers to a signal that can be identified by hearing when played. In a narrow sense, it is a concept that is distinct from a speech signal, and is a concept that is distinct from a speech signal. It refers to a signal that has no or few characteristics. The audio signal in the present invention should be interpreted in a broad sense and can be understood as an audio signal in the narrow sense when used separately from a voice signal.

도 1을 참조하면, 본 발명에서 이용되는 신경망 모델(102-104)은 처리 장치에서 구현되고, 신경망 모델(102-104)은 입력 오디오 신호를 인코딩하여 코드 벡터를 생성하고, 코드 벡터를 양자화(quantization)한다. 그리고, 신경망 모델(102-104)은 양자화된 코드 벡터를 디코딩함으로써 입력 오디오 신호를 복원한 출력 오디오 신호를 생성한다. Referring to Figure 1, the neural network model 102-104 used in the present invention is implemented in a processing unit, and the neural network model 102-104 encodes the input audio signal to generate a code vector and quantizes the code vector ( quantization). Then, the neural network models 102-104 generate an output audio signal by reconstructing the input audio signal by decoding the quantized code vector.

도 1을 참조하면, 처리 장치는 초기 오디오 신호를 연속적인 복수의 신경망 모델(102-104)을 이용하여 최종 오디오 신호를 획득한다. 구체적으로, 복수의 신경망 모델(102-104)은 연속적인 관계로서 i번째 신경망 모델은 i-1번째 신경망 모델의 출력 오디오 신호와 i-1번째 신경망 모델의 입력 오디오 신호 간의 차이를 입력 오디오 신호로 하여 출력 오디오 신호를 생성한다. Referring to FIG. 1, a processing device obtains a final audio signal by using a plurality of successive neural network models 102-104 for the initial audio signal. Specifically, the plurality of neural network models 102-104 are in a continuous relationship, and the ith neural network model converts the difference between the output audio signal of the i-1th neural network model and the input audio signal of the i-1th neural network model into an input audio signal. to generate an output audio signal.

일례로, 1번째 신경망 모델(102)은 처리 장치에 입력되는 초기 오디오 신호를 입력 오디오 신호로 하여 출력 오디오 신호를 생성하고, 2번째 신경망 모델(103)은 초기 오디오 신호와 1번째 신경망 모델의 출력 오디오 신호 간의 차이를 입력 오디오 신호로 하여 출력 오디오 신호를 생성한다. For example, the first neural network model 102 generates an output audio signal using the initial audio signal input to the processing device as an input audio signal, and the second neural network model 103 generates the initial audio signal and the output of the first neural network model. An output audio signal is generated using the difference between audio signals as an input audio signal.

그리고, N개의 신경망 모델이 있는 경우, N번째 신경망 모델(104)는 N-1번째 신경망 모델의 입력 오디오 신호 및 출력 오디오 신호 간의 차이를 입력 오디오 신호로 하여 출력 오디오 신호를 생성한다. 따라서, 처리 장치에 입력되는 초기 오디오 신호에 대한 최종 오디오 신호는 복수의 신경망 모델(102-104) 각각의 출력 오디오 신호를 합한 오디오 신호에 대응한다. And, when there are N neural network models, the N-th neural network model 104 uses the difference between the input audio signal and the output audio signal of the N-1-th neural network model as an input audio signal to generate an output audio signal. Accordingly, the final audio signal relative to the initial audio signal input to the processing device corresponds to an audio signal obtained by adding the output audio signals of each of the plurality of neural network models 102 to 104.

신경망 모델(102-104)은 파라미터들을 포함하는 복수의 레이어로 구성될 수 있다. 본 발명에서 신경망 모델은 합성곱 신경망(Convolutional Neural Network, CNN)으로 구현되는 오토 인코더(autoencoder)에 대응할 수 있다. 다만, 본 발명의 신경망 모델은 다양한 형태로 구현될 수 있으며, 위에서 기재한 예시로 한정되지 않는다. 신경망 모델의 구체적인 구조는 도 2에서 후술한다. Neural network models 102-104 may be composed of multiple layers including parameters. In the present invention, the neural network model can correspond to an autoencoder implemented as a convolutional neural network (CNN). However, the neural network model of the present invention may be implemented in various forms and is not limited to the examples described above. The specific structure of the neural network model is described later in Figure 2.

신경망 모델은 최종 오디오 신호와 초기 오디오 신호 간의 차이를 줄이도록 트레이닝된다. 구체적으로, 처리 장치는 최종 오디오 신호와 초기 오디오 신호 간의 차이를 계산하는 손실 함수의 결과가 최소가 되도록 복수의 신경망 모델에 포함되는 파라미터들을 업데이트한다. 즉, 손실 함수는 신경망 모델의 학습에 기준이 될 수 있다. A neural network model is trained to reduce the difference between the final and initial audio signals. Specifically, the processing device updates parameters included in a plurality of neural network models so that the result of a loss function that calculates the difference between the final audio signal and the initial audio signal is minimized. In other words, the loss function can be a standard for learning a neural network model.

처리 장치는 손실 함수에 복수의 신경망 모델 각각의 입력 오디오 신호와 출력 오디오 신호 간의 차이를 입력하여 최종 오디오 신호와 초기 오디오 신호 간의 차이를 결정할 수 있다. The processing device may determine the difference between the final audio signal and the initial audio signal by inputting the difference between the input audio signal and the output audio signal of each of the plurality of neural network models into the loss function.

처리 장치는 손실 함수를 통해 복수의 신경망 모델 각각의 입력 오디오 신호와 출력 오디오 신호 간의 차이를 계산함에 있어, 시간 도메인에 대한 제1 손실 함수, 주파수 도메인에 대한 제2 손실 함수, 심리 음향 모델에 따른 초기 오디오 신호의 전력 스펙트럼 밀도 및 마스킹 임계치의 관계에 기초한 제3 손실 함수 및 심리 음향 모델에 따른 마스킹 임계치와 양자화 과정에서 발생하는 노이즈의 관계에 기초한 제4 손실 함수 중 적어도 하나 이상의 손실 함수를 이용할 수 있다.The processing device calculates the difference between the input audio signal and the output audio signal of each of the plurality of neural network models through a loss function, according to a first loss function for the time domain, a second loss function for the frequency domain, and a psychoacoustic model. At least one loss function may be used among a third loss function based on the relationship between the power spectral density of the initial audio signal and the masking threshold and a fourth loss function based on the relationship between the masking threshold according to the psychoacoustic model and the noise generated during the quantization process. there is.

첫번째 예로, 처리 장치는 제1 손실 함수를 통해 복수의 신경망 모델 각각의 입력 오디오 신호와 출력 오디오 신호 간의 차이를 시간 도메인에서 계산하고, 신경망 모델 각각에 대해 계산된 결과를 합하여 최종 오디오 신호와 초기 오디오 신호 간의 차이를 결정할 수 있다. In a first example, the processing device calculates the difference between the input audio signal and the output audio signal of each of the plurality of neural network models in the time domain through a first loss function, and sums the calculated results for each neural network model to create the final audio signal and the initial audio signal. Differences between signals can be determined.

두번째 예로, 처리 장치는 복수의 신경망 모델 각각의 입력 오디오 신호와 출력 오디오 신호를 멜 스펙트럼(mel spectrum)으로 변환하고, 제2 손실 함수를 통해 변환된 입력 오디오 신호와 출력 오디오 신호 간의 차이를 주파수 도메인에서 계산하고, 신경망 모델 각각에 대해 계산된 결과를 합하여 최종 오디오 신호와 초기 오디오 신호 간의 차이를 결정할 수 있다.As a second example, the processing device converts the input audio signal and the output audio signal of each of the plurality of neural network models into a mel spectrum, and calculates the difference between the input audio signal and the output audio signal converted through the second loss function into the frequency domain. By calculating and summing the calculated results for each neural network model, the difference between the final audio signal and the initial audio signal can be determined.

세번째 예로, 처리 장치는 심리 음향 모델을 통해 초기 오디오 신호의 마스킹 임계치(masking threshold)를 획득할 수 있다. 또한, 처리 장치는 심리 음향 모델을 통해 초기 오디오 신호에 대한 전력 스펙트럼 밀도(Power Spectrum Density, PSD)를 획득할 수 있다.As a third example, the processing device may obtain a masking threshold of the initial audio signal through a psychoacoustic model. Additionally, the processing device may obtain the power spectral density (PSD) for the initial audio signal through the psychoacoustic model.

이 때, 마스킹 임계치는 심리 음향 이론에 의한 것으로, 인간의 청각 구조에서 크기가 큰 오디오 신호에 인접한 작은 오디오 신호들이 잘 인지되지 않는다는 특성을 이용하여 각 신경망 모델의 양자화 과정에서 발생하는 노이즈를 마스킹 하기 위한 기준이다. At this time, the masking threshold is based on psychoacoustic theory. It uses the characteristic that small audio signals adjacent to large audio signals are not well perceived in the human auditory structure to mask noise generated during the quantization process of each neural network model. It is a standard for

즉, 처리 장치는 최종 오디오 신호를 생성함에 있어 심리 음향 모델을 통해 결정되는 초기 오디오 신호의 음압을 고려하여 주파수 별로 마스킹 임계치 보다 음압(Sound Pressure Level)이 낮은 노이즈들을 제거함으로써 노이즈들을 마스킹할 수 있다. That is, when generating the final audio signal, the processing device considers the sound pressure of the initial audio signal determined through a psychoacoustic model and can mask noises by removing noises with a sound pressure level lower than the masking threshold for each frequency. .

처리 장치는 제3 손실 함수를 통해 심리 음향 모델에 의해 결정되는 주파수 별 초기 오디오 신호의 전력 스펙트럼 밀도와 마스킹 임계치의 관계에 기초하여 초기 오디오 신호와 최종 오디오 신호의 차이를 주파수 대역에서 계산하고, 신경망 모델 각각에 대해 계산된 결과를 합하여 최종 오디오 신호와 초기 오디오 신호 간의 차이를 결정할 수 있다. The processing unit calculates the difference between the initial audio signal and the final audio signal in the frequency band based on the relationship between the masking threshold and the power spectral density of the initial audio signal for each frequency determined by the psychoacoustic model through a third loss function and a neural network. By combining the results calculated for each model, the difference between the final and initial audio signals can be determined.

여기서, 심리 음향 모델(PAM)은 심리 음향 이론에 기반하여, 초기 오디오 신호에 대한 주파수 별 전력 스펙트럼 밀도를 생성하고, 생성된 전력 스펙트럼 밀도에 따른 마스킹 임계치를 결정함으로써 마스킹 효과를 계산하기 위해 사용되는 모델이다. 전력 스펙트럼 밀도는 오디오 신호의 주파수 도메인 상에서 오디오 신호의 에너지 또는 전력의 밀도 분포를 의미한다. Here, the psychoacoustic model (PAM) is used to calculate the masking effect by generating frequency-specific power spectral density for the initial audio signal, based on psychoacoustic theory, and determining the masking threshold according to the generated power spectral density. It's a model. Power spectral density refers to the density distribution of energy or power of an audio signal in the frequency domain of the audio signal.

네번째 예로, 처리 장치는 제4 손실 함수를 통해 최종 오디오 신호의 노이즈와 심리 음향 모델을 통해 결정되는 초기 오디오 신호의 마스킹 임계치 간의 차이를 주파수 대역에서 계산하고, 신경망 모델 각각에 대해 계산된 결과를 합하여 최종 오디오 신호와 초기 오디오 신호 간의 차이를 결정할 수 있다.As a fourth example, the processing unit calculates the difference between the noise of the final audio signal and the masking threshold of the initial audio signal determined through the psychoacoustic model in the frequency band through the fourth loss function, and sums the results calculated for each neural network model to obtain The difference between the final audio signal and the initial audio signal can be determined.

제3 손실 함수와 제4 손실 함수의 구체적인 계산 방법은 도 3에서 후술한다.The specific calculation method of the third loss function and the fourth loss function will be described later in FIG. 3.

처리 장치는 제1-4 손실 함수 중 적어도 하나 이상의 손실 함수를 이용하여 최종 오디오 신호와 초기 오디오 신호 간의 차이를 결정할 수 있다. 처리 장치는 제1-4 손실 함수 중 적어도 하나 이상의 손실 함수를 통해 계산되는 최종 오디오 신호와 초기 오디오 신호 간의 차이를 최소화하도록 복수의 신경망 모델에 포함되는 파라미터들을 업데이트할 수 있다. The processing device may determine the difference between the final audio signal and the initial audio signal using at least one loss function among the first to fourth loss functions. The processing device may update parameters included in the plurality of neural network models to minimize the difference between the final audio signal and the initial audio signal calculated through at least one of the first to fourth loss functions.

처리 장치는 업데이트된 복수의 신경망 모델을 이용하여 초기 오디오 신호를 처리하여 최종 오디오 신호를 획득할 수 있다. The processing device may obtain a final audio signal by processing the initial audio signal using a plurality of updated neural network models.

도 2는 본 발명의 일실시예에 따른 신경망 모델들의 관계 및 신경망 모델의 구조를 도시한 도면이다. Figure 2 is a diagram showing the relationship between neural network models and the structure of the neural network model according to an embodiment of the present invention.

도 2의 (a)는 본 발명에서 이용되는 복수의 신경망 모델의 관계를 도시한 도면이다. 도 2의 (b)는 하나의 신경망 모델의 구조를 도시한 도면이다. Figure 2(a) is a diagram showing the relationship between a plurality of neural network models used in the present invention. Figure 2(b) is a diagram showing the structure of one neural network model.

도 2의 (a), (b)에서 s는 초기 오디오 신호를 의미하고, s⁽ⁱ⁾는 i번째 신경망 모델의 입력 오디오 신호를 의미한다. 그리고, ⁽ⁱ⁾는 i번째 신경망 모델의 출력 오디오 신호를 의미한다. 도 2의 (a)에서 볼 수 있듯이, i번째 신경망 모델은 i-1번째 신경망 모델의 입력 오디오 신호와 출력 오디오 신호의 차이(s^(i-1)- ^(i-1))를 입력 오디오 신호로 입력 받아 출력 오디오 신호( ⁽ⁱ⁾)를 생성한다. In Figures 2 (a) and (b), s means the initial audio signal, and s ⁽ⁱ⁾ means the input audio signal of the ith neural network model. and, ⁽ⁱ⁾ refers to the output audio signal of the ith neural network model. As can be seen in (a) of Figure 2, the ith neural network model is the difference between the input audio signal and the output audio signal of the i-1th neural network model (s ^(i-1) - ^(i-1) ) as an input audio signal and output audio signal ( ⁽ⁱ⁾ ) is generated.

각각의 신경망 모델(s⁽ⁱ⁾)은 인코딩을 수행하는 인코더, 입력 오디오 신호의 인코딩으로 생성되는 코드 벡터를 양자화한 코드(h⁽ⁱ⁾) 및 디코딩을 수행하는 디코더를 포함할 수 있다. 인코더 및 디코더는 신경망 모델에 포함된 레이어들에 대응할 수 있다. Each neural network model (s ⁽ⁱ⁾ ) may include an encoder that performs encoding, a code (h ⁽ⁱ⁾ ) that quantizes a code vector generated by encoding an input audio signal, and a decoder that performs decoding. Encoders and decoders may correspond to layers included in the neural network model.

도 2의 (b)를 참조하면, 신경망 모델의 인코더는 입력 오디오 신호를 프레임 단위로 인코딩하여 코드 벡터를 생성한다. 일례로, 도 2의 (b)를 참조하면 신경망 모델의 인코더에 CNN을 이용한 분류 모델인 ResNet가 적용된 Bottleneck ResNet Block들이 이용될 수 있다. Referring to (b) of FIG. 2, the encoder of the neural network model encodes the input audio signal on a frame-by-frame basis to generate a code vector. For example, referring to (b) of FIG. 2, Bottleneck ResNet Blocks to which ResNet, a classification model using CNN, is applied, can be used as an encoder of a neural network model.

신경망 모델은 인코더를 통해 생성된 코드 벡터(z⁽ⁱ⁾)를 양자화 및 엔트로피 코딩하여 양자화된 코드(h⁽ⁱ⁾)를 생성할 수 있다. 그리고, 신경망 모델의 디코더는 양자화된 코드(h⁽ⁱ⁾)를 이용하여 입력 오디오 신호(s⁽ⁱ⁾)를 복원한 출력 오디오 신호( ⁽ⁱ⁾)를 생성할 수 있다. 인코더와 마찬가지로 디코더도 ResNet가 적용된 Bottleneck ResNet Block들이 이용될 수 있다. 다만, 신경망 모델에서 이용하는 모델은 ResNet으로 한정되지 아니한다.The neural network model can generate a quantized code (h ⁽ⁱ⁾ ) by quantizing and entropy coding the code vector (z ⁽ⁱ⁾ ) generated through the encoder. And, the decoder of the neural network model restores the input audio signal (s ⁽ⁱ ^{)) using the quantized code (h (i} )). ⁽ⁱ⁾ ) can be generated. Like the encoder, the decoder can also use Bottleneck ResNet Blocks to which ResNet is applied. However, the model used in the neural network model is not limited to ResNet.

일례로, ResNet가 적용된 신경망 모델은 아래 표 1에서 기재된 값에 따라 입력 오디오 신호의 인코딩 및 디코딩을 수행할 수 있다. For example, a neural network model to which ResNet is applied can perform encoding and decoding of an input audio signal according to the values listed in Table 1 below.

표 1에서 신경망 모델의 각 레이어에 입력 형태(input shape) 및 출력 형태(output shape)는 (프레임 길이, 채널)을 의미하고, 커널 형태(Kernel shape)는 (커널 사이즈, 인채널, 아웃채널)을 의미한다.In Table 1, the input shape and output shape for each layer of the neural network model mean (frame length, channel), and the kernel shape means (kernel size, in-channel, out-channel). means.

도 3은 본 발명의 일실시예에 따른 신경망 모델들로 생성한 최종 오디오 신호와 초기 오디오 신호의 차이를 계산하는 손실 함수의 구조를 도시한 도면이다.Figure 3 is a diagram showing the structure of a loss function that calculates the difference between a final audio signal and an initial audio signal generated by neural network models according to an embodiment of the present invention.

도 3을 참조하면, 처리 장치는 복수의 신경망 모델을 통해 초기 오디오 신호(s)를 처리하여 각 신경망 모델의 출력 오디오 신호를 합한 최종 오디오 신호( ⁽¹⁾+ ⁽²⁾+...+ ^(N))를 획득함에 있어, 각 신경망 모델의 입력 오디오 신호와 출력 오디오 신호 간의 차이를 손실 함수(302)에 입력할 수 있다.Referring to FIG. 3, the processing device processes the initial audio signal (s) through a plurality of neural network models to produce a final audio signal (s) obtained by summing the output audio signals of each neural network model. ⁽¹⁾ + ⁽²⁾ +...+ In obtaining ^(N) ), the difference between the input audio signal and the output audio signal of each neural network model can be input into the loss function 302.

처리 장치는 각 신경망 모델의 입력 오디오 신호와 출력 오디오 신호 간의 차이의 합(301)을 이용하여 각 신경망 모델에 포함되는 파라미터들을 업데이트할 수 있다. 과정(307)을 통해 처리 장치는 손실 함수(302)의 결과가 최소가 되도록 파라미터들을 업데이트하면서 초기 오디오 신호에 대한 최종 오디오 신호를 생성하고, 손실 함수(302)의 결과가 최소가 되도록 하는 파라미터를 포함하는 신경망 모델로 초기 오디오 신호에 대한 최종 오디오 신호를 획득할 수 있다.The processing device may update parameters included in each neural network model using the sum 301 of the difference between the input audio signal and the output audio signal of each neural network model. Through process 307, the processing unit generates a final audio signal for the initial audio signal while updating the parameters so that the result of the loss function 302 is minimized, and adjusts the parameters so that the result of the loss function 302 is minimized. The final audio signal for the initial audio signal can be obtained with the neural network model included.

즉, 처리 장치는 손실 함수(302)의 결과가 최소가 되도록 파라미터들을 업데이트함으로써 복수의 신경망 모델을 트레이닝하고, 손실 함수(302)의 결과가 최소가 되도록 하는 파라미터를 포함하는 신경망 모델들은 트레이닝된 신경망 모델에 대응한다. That is, the processing device trains a plurality of neural network models by updating parameters so that the result of the loss function 302 is minimized, and the neural network models including parameters that minimize the result of the loss function 302 are trained neural networks. Corresponds to the model.

그리고, 손실 함수(302)는 시간 도메인에 대한 제1 손실 함수(303), 주파수 도메인에 대한 제2 손실 함수(304), 심리 음향 모델에 따른 초기 오디오 신호의 전력 스펙트럼 밀도 및 마스킹 임계치의 관계에 기초한 제3 손실 함수(305) 및 심리 음향 모델에 따른 마스킹 임계치와 양자화 과정에서 발생하는 노이즈의 관계에 기초한 제4 손실 함수(306)를 포함할 수 있다. And, the loss function 302 is based on the relationship between the first loss function 303 for the time domain, the second loss function 304 for the frequency domain, the power spectral density of the initial audio signal according to the psychoacoustic model, and the masking threshold. It may include a third loss function 305 based on a masking threshold based on a psychoacoustic model and a fourth loss function 306 based on the relationship between noise generated during the quantization process.

즉, 처리 장치는 제1-4 손실 함수(303-306) 중 적어도 하나 이상을 이용하여 초기 오디오 신호와 최종 오디오 신호의 차이에 대한 손실 함수(302)의 결과를 획득할 수 있다. 아래 수학식 1에 따라 제1-4 손실 함수(303-306) 중 적어도 하나 이상을 이용한 손실 함수(302)가 정의될 수 있다. That is, the processing device may obtain the result of the loss function 302 for the difference between the initial audio signal and the final audio signal using at least one of the first to fourth loss functions 303 to 306. A loss function 302 using at least one of the first to fourth loss functions 303 to 306 may be defined according to Equation 1 below.

은 제1-4 손실 함수 (로 결정되는 손실 함수(302)를 의미하며, 1, 2, 3 및 4는 제1-4 손실 함수 (에서 이용되는 손실 함수를 결정하거나, 제1-4 손실 함수( 마다 단위가 다르기 때문에 이를 조절하는 가중치이다. is the 1st-4th loss function ( It means the loss function 302 determined by, One, 2, 3 and 4 is the 1st-4th loss function ( Determine the loss function used in, or the 1-4 loss functions ( Since each unit is different, this is the weight that adjusts it.

예를 들어, 1, 2, 3이 0일 경우, 처리 장치는 제4 손실 함수()를 이용하여 초기 오디오 신호와 최종 오디오 신호의 차이를 계산한다. 또는, 1, 2, 3 및 4가 모두 0 보다 큰 경우, 처리 장치는 초기 오디오 신호와 최종 오디오 신호의 차이에 대한 제1-4 손실 함수( 각각의 결과를 모두 합하여 초기 오디오 신호와 최종 오디오 신호의 차이를 계산한다. for example, One, 2, When 3 is 0, the processing unit uses the fourth loss function ( ) is used to calculate the difference between the initial audio signal and the final audio signal. or, One, 2, 3 and If 4 is all greater than 0, the processing unit calculates the 1-4 loss function for the difference between the initial audio signal and the final audio signal ( Calculate the difference between the initial and final audio signals by adding up each result.

제1 손실 함수(303)는 초기 오디오 신호와 최종 오디오 신호의 차이를 시간 도메인에서 계산하는 손실 함수이다. 즉, 처리 장치는 제1 손실 함수(303)에 각 신경망 모델들의 입력 오디오 신호와 출력 오디오 신호 간의 차이를 입력하여 시간 도메인에 대한 초기 오디오 신호와 출력 오디오 신호의 차이를 계산할 수 있다. 제1 손실 함수(303)는 아래 수학식 2에 따라 입력 오디오 신호와 출력 오디오 신호 간의 차이를 계산한다. The first loss function 303 is a loss function that calculates the difference between the initial audio signal and the final audio signal in the time domain. That is, the processing device may calculate the difference between the initial audio signal and the output audio signal in the time domain by inputting the difference between the input audio signal and the output audio signal of each neural network model into the first loss function 303. The first loss function 303 calculates the difference between the input audio signal and the output audio signal according to Equation 2 below.

수학식 2에서 T는 신경망 모델의 인코딩 및 디코딩 단위가 되는 프레임의 시간 길이에 대응하고 t는 초기 오디오 신호의 특정 시간에 대응한다. 즉 s_t ⁽ⁱ⁾와 _t ⁽ⁱ⁾은 특정 시간 t에 대응하는 입력 오디오 신호와 출력 오디오 신호를 의미한다. In Equation 2, T corresponds to the time length of the frame that is the encoding and decoding unit of the neural network model, and t corresponds to the specific time of the initial audio signal. That is, s _t ⁽ⁱ⁾ and _t ⁽ⁱ⁾ means the input audio signal and output audio signal corresponding to a specific time t.

그리고, i는 N 개의 연속된 신경망 모델들 중 i번째 신경망 모델임을 의미한다. 제1 손실 함수(303)는 N개의 신경망 모델 각각의 신경망 모델(i)에 대해 각 시간 별(t) 입력 오디오 신호와 출력 오디오 신호 간의 시간 도메인에서 차이의 제곱을 합한 결과를 출력하는 함수이다. And, i means that it is the ith neural network model among N consecutive neural network models. The first loss function 303 is a function that outputs the result of summing the squares of the differences in the time domain between the input audio signal and the output audio signal at each time (t) for the neural network model (i) of each of the N neural network models.

결국, 제1 손실 함수(303)가 출력하는 결과가 적을수록 최종 오디오 신호가 초기 오디오 신호를 정확하게 복원하였다는 것을 의미하므로, 처리 장치는 제1 손실 함수(303)의 결과가 최소화되도록 신경망 모델들을 트레이닝한다. Ultimately, the smaller the result output by the first loss function 303 means that the final audio signal accurately restores the initial audio signal, so the processing device constructs neural network models to minimize the result of the first loss function 303. train.

제2 손실 함수(304)는 초기 오디오 신호와 최종 오디오 신호의 멜 스펙트럼 간의 차이를 주파수 도메인에서 계산하는 손실 함수이다. 구체적으로, 처리 장치는 초기 오디오 신호와 최종 오디오 신호를 멜 스펙트럼으로 변환한다. 멜 스펙트럼은 초기 오디오 신호의 주파수 단위를 멜 단위(mel-unit)로 변환한 것을 의미한다. The second loss function 304 is a loss function that calculates the difference between the mel spectrum of the initial audio signal and the final audio signal in the frequency domain. Specifically, the processing unit converts the initial audio signal and the final audio signal into a mel spectrum. Mel spectrum means converting the frequency unit of the initial audio signal into mel-unit.

즉, 처리 장치는 제2 손실 함수(304)에 각 신경망 모델들의 입력 오디오 신호 및 출력 오디오 신호의 멜 스펙트럼 간의 차이를 입력하여 주파수 도메인에 대한 초기 오디오 신호와 출력 오디오 신호의 차이를 계산할 수 있다. 제2 손실 함수(304)는 아래 수학식 3에 따라 입력 오디오 신호와 출력 오디오 신호 간의 차이를 계산한다. That is, the processing device may calculate the difference between the initial audio signal and the output audio signal in the frequency domain by inputting the difference between the Mel spectra of the input audio signal and the output audio signal of each neural network model to the second loss function 304. The second loss function 304 calculates the difference between the input audio signal and the output audio signal according to Equation 3 below.

수학식 3에서 F는 신경망 모델의 인코딩 및 디코딩 단위가 되는 프레임의 주파수 범위에 대응하고 f는 F에 포함된 특정 분해능에 대응한다. y_f ⁽ⁱ⁾와 _f ⁽ⁱ⁾는 특정 주파수 f에 대한 입력 오디오 신호의 멜 스펙트럼과 출력 오디오 신호의 멜 스펙트럼을 의미한다.In Equation 3, F corresponds to the frequency range of the frame that is the encoding and decoding unit of the neural network model, and f corresponds to the specific resolution included in F. y _f ⁽ⁱ⁾ and _f ⁽ⁱ⁾ means the Mel spectrum of the input audio signal and the Mel spectrum of the output audio signal for a specific frequency f.

i는 N 개의 연속된 신경망 모델들 중 i번째 신경망 모델임을 의미한다. 제2 손실 함수(304)는 N개의 신경망 모델 각각의 신경망 모델(i)에 대해 각 주파수 별(f) 입력 오디오 신호와 출력 오디오 신호의 멜 스펙트럼 간의 차이의 제곱을 합한 결과를 출력하는 함수이다. i means that it is the ith neural network model among N consecutive neural network models. The second loss function 304 is a function that outputs the result of adding the square of the difference between the Mel spectra of the input audio signal and the output audio signal for each frequency (f) for each of the N neural network models (i).

결국, 제2 손실 함수(304)가 출력하는 결과가 적을수록 최종 오디오 신호가 초기 오디오 신호를 정확하게 복원하였다는 것을 의미하므로, 처리 장치는 제1 손실 함수(303)의 결과가 최소화되도록 신경망 모델들을 트레이닝한다. Ultimately, the smaller the result output by the second loss function 304 means that the final audio signal accurately restores the initial audio signal, so the processing device constructs neural network models to minimize the result of the first loss function 303. train.

그리고, 제3 손실 함수(305)는 심리 음향 모델을 통해 결정되는 초기 오디오 신호의 전력 스펙트럼 밀도와 마스킹 임계치의 관계에 기초하여 초기 오디오 신호와 최종 오디오 신호의 차이를 주파수 도메인에서 계산하는 손실 함수이다.And, the third loss function 305 is a loss function that calculates the difference between the initial audio signal and the final audio signal in the frequency domain based on the relationship between the power spectral density of the initial audio signal and the masking threshold determined through the psychoacoustic model. .

처리 장치는 제3 손실 함수(305)를 이용하기 위해 심리 음향 모델을 통해 초기 오디오 신호에 대한 전력 스펙트럼 밀도와 마스킹 임계치를 획득할 수 있다. 처리 장치는 제3 손실 함수(305)를 통해 주파수 별로 마스킹 임계치와 전력 스펙트럼 밀도의 관계에 따라 가중치를 결정하고, 결정된 가중치에 기초하여 주파수 별로 초기 오디오 신호의 전력 스펙트럼 밀도와 최종 오디오 신호의 전력 스펙트럼 밀도 간의 차이를 계산할 수 있다. The processing device may obtain the power spectral density and masking threshold for the initial audio signal through the psychoacoustic model to use the third loss function 305. The processing device determines a weight according to the relationship between the masking threshold and the power spectral density for each frequency through the third loss function 305, and based on the determined weight, the power spectral density of the initial audio signal and the power spectrum of the final audio signal for each frequency. The difference between densities can be calculated.

구체적으로, 처리 장치는 아래 수학식 4에 따라 초기 오디오 신호에 대한 전력 스펙트럼 밀도와 마스킹 임계치의 관계를 나타내는 가중치를 결정할 수 있다. Specifically, the processing device may determine a weight representing the relationship between the power spectral density and the masking threshold for the initial audio signal according to Equation 4 below.

위 수학식 4에서 특정 주파수에서 는 전력 스펙트럼 밀도와 마스킹 임계치의 관계를 나타내는 가중치를 의미한다. m은 마스킹 임계치를 나타내고, p는 초기 오디오 신호에 대한 전력 스펙트럼 밀도를 나타낸다. In Equation 4 above, at a specific frequency means a weight representing the relationship between the power spectral density and the masking threshold. m represents the masking threshold, and p represents the power spectral density for the initial audio signal.

수학식 4에 따르면, 처리 장치는 특정 주파수에서 마스킹 임계치에 비해 초기 오디오 신호의 전력 스펙트럼 밀도가 클수록 복원하기 어려운 오디오 신호이므로 가중치를 높게 결정하고, 초기 오디오 신호의 전력 스펙트럼 밀도에 비해 마스킹 임계치가 클수록 가중치를 낮게 결정한다. According to Equation 4, the higher the power spectral density of the initial audio signal at a certain frequency is, the more difficult the processing device is to restore, and thus the higher the weight is. Decide on a low weight.

그리고, 처리 장치는 제3 손실 함수(305)를 통해 주파수 별로 결정된 가중치를 이용하여 초기 오디오 신호의 전력 스펙트럼 밀도와 최종 오디오 신호의 전력 스펙트럼 밀도 간의 차이를 계산한다. 구체적으로, 제3 손실 함수(305)는 수학식 5에 따라 결정된다. Then, the processing device calculates the difference between the power spectral density of the initial audio signal and the power spectral density of the final audio signal using the weight determined for each frequency through the third loss function 305. Specifically, the third loss function 305 is determined according to Equation 5.

위 수학식 5에서 f는 특정 주파수를 의미하고, x_f ⁽ⁱ⁾와 _f ⁽ⁱ⁾는 각각 신경망 모델의 입력 오디오 신호의 전력 스펙트럼 밀도와 출력 오디오 신호의 전력 스펙트럼 밀도를 나타낸다. 그리고, w_f는 특정 주파수에 대해 결정된 가중치를 의미한다. In Equation 5 above, f means a specific frequency, and x _f ⁽ⁱ⁾ and _f ⁽ⁱ⁾ represents the power spectral density of the input audio signal and the power spectral density of the output audio signal of the neural network model, respectively. And, w _f means the weight determined for a specific frequency.

i는 N 개의 연속된 신경망 모델들 중 i번째 신경망 모델임을 의미한다. 제3 손실 함수(305)는 N개의 신경망 모델 각각의 신경망 모델(i)에 대해 각 주파수 별(f) 입력 오디오 신호와 출력 오디오 신호의 전력 스펙트럼 밀도 간의 차이의 제곱에 가중치를 곱하여 합한 결과를 출력하는 손실 함수이다. i means that it is the ith neural network model among N consecutive neural network models. The third loss function 305 outputs the result of multiplying the square of the difference between the power spectral densities of the input audio signal and the output audio signal for each frequency (f) by the weight for each neural network model (i) of the N neural network models. This is the loss function.

따라서, 처리 장치는 심리 음향 모델에 따른 가중치가 적용되는 제3 손실 함수(305)를 통해 복원이 어려운 오디오 신호에 대한 비중을 다른 오디오 신호보다 높게하여 처리함으로써 초기 오디오 신호의 복원율을 높일 수 있다. Accordingly, the processing device can increase the restoration rate of the initial audio signal by processing audio signals that are difficult to restore in a higher proportion than other audio signals through the third loss function 305 to which weights according to the psychoacoustic model are applied.

이 때, 제3 손실 함수(305)를 통해 학습된 신경망 모델의 출력 오디오 신호는 큰 노이즈를 마스킹하지 못할 수 있다. 처리 장치는 제4 손실 함수(306)를 이용하여 이러한 문제를 해결할 수 있다. At this time, the output audio signal of the neural network model learned through the third loss function 305 may not be able to mask large noise. The processing device can solve this problem using the fourth loss function 306.

제4 손실 함수(306)는 심리 음향 모델을 통해 결정되는 최종 오디오 신호에 포함된 노이즈와 초기 오디오 신호의 마스킹 임계치 간의 차이를 주파수 대역에서 계산하는 손실 함수이다. 여기서 노이즈는 초기 오디오 신호와 최종 오디오 신호의 차이에 대한 로그 전력 밀도 함수(logarithmic PSD)를 의미한다. The fourth loss function 306 is a loss function that calculates the difference between the masking threshold of the initial audio signal and the noise included in the final audio signal determined through the psychoacoustic model in the frequency band. Here, noise refers to the logarithmic power density function (logarithmic PSD) of the difference between the initial audio signal and the final audio signal.

처리 장치는 앞서 획득한 마스킹 임계치와 초기 오디오 신호의 인코딩 및 디코딩 과정에서 발생한 노이즈를 관계에 기초하여 초기 오디오 신호와 최종 오디오 신호의 차이를 계산할 수 있다. The processing device may calculate the difference between the initial audio signal and the final audio signal based on the relationship between the previously obtained masking threshold and noise generated during the encoding and decoding process of the initial audio signal.

구체적으로, 처리 장치는 초기 오디오 신호의 인코딩 및 디코딩 과정에서 발생한 노이즈를 최종 오디오 신호에서 식별하고, 제4 손실 함수(306)를 통해 아래 수학식 6과 같이 주파수 별로 마스킹 임계치와 최종 오디오 신호에 포함되는 노이즈 간의 차이를 계산한다. Specifically, the processing device identifies noise generated during the encoding and decoding process of the initial audio signal in the final audio signal and includes it in the final audio signal and a masking threshold for each frequency as shown in Equation 6 below through the fourth loss function 306. Calculate the difference between the noise.

위 수학식 6에서, n_f ⁽ⁱ⁾와 m_f ⁽ⁱ⁾는 각각 특정 주파수(f)에 대응하는 노이즈와 마스킹 임계치를 의미한다. 처리 장치는 마스킹 임계치와 노이즈의 차이가 가장 적은 주파수를 결정하여 마스킹 임계치와 노이즈의 최소 차이를 신경망 모델 별로 결정할 수 있다. 제4 손실 함수(306)는 주파수 별로 노이즈와 마스킹 임계치의 차이를 합한 결과에서 결정된 최소 차이를 제외한 결과를 신경망 모델 별로 계산하여 합한 결과를 출력한다.In Equation 6 above, n _f ⁽ⁱ⁾ and m _f ⁽ⁱ⁾ mean noise and masking thresholds corresponding to a specific frequency (f), respectively. The processing device can determine the minimum difference between the masking threshold and the noise for each neural network model by determining the frequency with the smallest difference between the masking threshold and the noise. The fourth loss function 306 calculates the result for each neural network model by subtracting the determined minimum difference from the result of adding the difference between the noise and the masking threshold for each frequency and outputs the sum result.

즉, 처리 장치는 제4 손실 함수(306)의 결과를 최소화하도록 신경망 모델의 파라미터를 업데이트함으로써 초기 오디오 신호의 인코딩 및 디코딩 과정에서 발생하는 노이즈를 줄일 수 있다. That is, the processing device can reduce noise generated during the encoding and decoding process of the initial audio signal by updating the parameters of the neural network model to minimize the result of the fourth loss function 306.

처리 장치는 제1-4 손실 함수(303-306) 중 적어도 하나 이상의 손실 함수의 결과를 최소화하도록 신경망 모델을 트레이닝함으로써 트레이닝된 신경망 모델들을 통해 초기 오디오 신호에 대한 최종 오디오 신호를 생성할 수 있다. The processing device may train the neural network model to minimize the result of at least one of the first to fourth loss functions 303 to 306, thereby generating a final audio signal for the initial audio signal through the trained neural network models.

도 4는 본 발명의 일실시예에 따른 손실 함수의 이용 여부에 따른 노이즈의 발생 결과를 도시한 도면이다.Figure 4 is a diagram showing the results of noise generation depending on whether or not a loss function is used according to an embodiment of the present invention.

도 4의 (a)는 제1-3 손실 함수를 통해 트레이닝된 신경망 모델들로 획득한 최종 오디오 신호의 노이즈 및 초기 오디오 신호의 마스킹 임계치의 관계를 도시한 그래프이다. Figure 4(a) is a graph showing the relationship between the noise of the final audio signal and the masking threshold of the initial audio signal obtained with neural network models trained through the first to third loss functions.

도 4의 (b)는 제1-4 손실 함수를 통해 트레이닝된 신경망 모델들로 획득한 최종 오디오 신호의 노이즈 및 초기 오디오 신호의 마스킹 임계치의 관계를 도시한 그래프이다. Figure 4(b) is a graph showing the relationship between the noise of the final audio signal and the masking threshold of the initial audio signal obtained with neural network models trained through the 1-4 loss functions.

구체적으로, 도 4의 (a)는 위 수학식 1에서1, 2, 3 이 0보다 크고 4는 0인 경우이고, 도 4의 (b)는 1, 2, 3 및 4가 0보다 큰 경우로서, 1=60, 2=5, 3=1, 4=5 인 경우이다. Specifically, (a) in Figure 4 is in Equation 1 above. One, 2, 3 is greater than 0 4 is 0, and (b) in Figure 4 is One, 2, 3 and As 4 is greater than 0, 1=60, 2=5; 3=1; This is the case where 4=5.

도 4의 (a)를 참조하면 401 구간에서 마스킹 임계치보다 높은 노이즈가 마스킹되지 않은 결과가 나타난다. 이 경우, 최종 오디오 신호는 초기 오디오 신호에 포함되지 않은 노이즈로 인하여 품질이 저하된다. Referring to (a) of FIG. 4, the result is that noise higher than the masking threshold is not masked in section 401. In this case, the quality of the final audio signal is degraded due to noise not included in the initial audio signal.

도 4의 (b)를 참조하면, 제4 손실 함수에서 노이즈와 마스킹 임계치의 관계에 기초하여 신경망 모델들이 트레이닝되기 때문에 도 4의 (a)와 같이 마스킹 임계치보다 높은 노이즈가 발생하지 않는다. Referring to (b) of FIG. 4, since neural network models are trained based on the relationship between noise and masking threshold in the fourth loss function, noise higher than the masking threshold does not occur as shown in (a) of FIG. 4.

도 5는 본 발명의 일실시예에 따른 오디오 신호의 처리 방법을 플로우 차트로 도시한 도면이다. Figure 5 is a flow chart showing a method of processing an audio signal according to an embodiment of the present invention.

단계(501)에서, 처리 장치는 입력 오디오 신호을 인코딩 및 디코딩하여 출력 오디오 신호를 생성하는 복수의 신경망 모델을 이용하여 초기 오디오 신호에 대한 최종 오디오 신호를 획득한다. In step 501, the processing device obtains a final audio signal for the initial audio signal using a plurality of neural network models that encode and decode the input audio signal to generate the output audio signal.

이 때, 복수의 신경망 모델(102-104)은 연속적인 관계로서 i번째 신경망 모델은 i-1번째 신경망 모델의 출력 오디오 신호와 i-1번째 신경망 모델의 입력 오디오 신호 간의 차이를 입력 오디오 신호로 하여 출력 오디오 신호를 생성한다.At this time, the plurality of neural network models 102-104 are in a continuous relationship, and the ith neural network model converts the difference between the output audio signal of the i-1th neural network model and the input audio signal of the i-1th neural network model into an input audio signal. to generate an output audio signal.

단계(502)에서, 처리 장치는 처리 장치는 제1 손실 함수에 각 신경망 모델들의 입력 오디오 신호와 출력 오디오 신호 간의 차이를 입력하여 시간 도메인에 대한 초기 오디오 신호와 출력 오디오 신호의 차이를 계산할 수 있다.In step 502, the processing device inputs the difference between the input audio signal and the output audio signal of each neural network model into the first loss function to calculate the difference between the initial audio signal and the output audio signal in the time domain. .

단계(503)에서, 처리 장치는 처리 장치는 제2 손실 함수에 각 신경망 모델들의 입력 오디오 신호 및 출력 오디오 신호의 멜 스펙트럼 간의 차이를 입력하여 주파수 도메인에 대한 초기 오디오 신호와 출력 오디오 신호의 차이를 계산할 수 있다.In step 503, the processing device inputs the difference between the Mel spectra of the input audio signal and the output audio signal of each neural network model to the second loss function to calculate the difference between the initial audio signal and the output audio signal in the frequency domain. It can be calculated.

단계(504)에서, 처리 장치는 제3 손실 함수를 이용하기 위해 심리 음향 모델을 통해 초기 오디오 신호에 대한 전력 스펙트럼 밀도와 마스킹 임계치를 획득한다. At step 504, the processing device obtains the power spectral density and masking threshold for the initial audio signal through the psychoacoustic model to use the third loss function.

그리고, 처리 장치는 제3 손실 함수를 통해 주파수 별로 마스킹 임계치와 전력 스펙트럼 밀도의 관계에 따라 가중치를 결정하고, 결정된 가중치에 기초하여 주파수 별로 초기 오디오 신호의 전력 스펙트럼 밀도와 최종 오디오 신호의 전력 스펙트럼 밀도 간의 차이를 계산할 수 있다.Then, the processing device determines a weight according to the relationship between the masking threshold and the power spectral density for each frequency through a third loss function, and based on the determined weight, the power spectral density of the initial audio signal and the power spectral density of the final audio signal for each frequency. The difference between them can be calculated.

단계(505)에서, 오디오 신호의 처리 장치는 초기 오디오 신호의 인코딩 및 디코딩 과정에서 발생한 노이즈를 최종 오디오 신호에서 식별하고, 제4 손실 함수를 통해 아래 수학식 6과 같이 주파수 별로 마스킹 임계치와 최종 오디오 신호에 포함되는 노이즈 간의 차이를 계산한다. In step 505, the audio signal processing device identifies noise generated during the encoding and decoding process of the initial audio signal in the final audio signal, and determines the masking threshold and final audio for each frequency as shown in Equation 6 below through the fourth loss function. Calculate the difference between noise included in the signal.

단계(506)에서, 처리 장치는 제1-4 손실 함수 중 적어도 하나 이상의 손실 함수의 결과를 최소화하도록 신경망 모델을 트레이닝할 수 있다. 구체적으로, 처리 장치는 제1-4 손실 함수 중 적어도 하나 이상의 손실 함수를 통해 계산되는 최종 오디오 신호와 초기 오디오 신호 간의 차이를 최소화하도록 복수의 신경망 모델에 포함되는 파라미터들을 업데이트할 수 있다.In step 506, the processing device may train a neural network model to minimize the result of at least one loss function among the first to fourth loss functions. Specifically, the processing device may update parameters included in the plurality of neural network models to minimize the difference between the final audio signal and the initial audio signal calculated through at least one of the first to fourth loss functions.

일례로, 처리 장치는 제1-2 손실 함수만을 이용하여 최종 오디오 신호와 초기 오디오 신호 간의 차이를 결정할 수 있고, 제3 손실 함수만을 이용하여 최종 오디오 신호와 초기 오디오 신호 간의 차이를 결정할 수 있고, 제4 손실 함수만을 이용하여 최종 오디오 신호와 초기 오디오 신호 간의 차이를 결정할 수 있고, 제1-4 손실 함수를 모두 이용하여 최종 오디오 신호와 초기 오디오 신호 간의 차이를 결정할 수 있다. For example, the processing device may determine the difference between the final audio signal and the initial audio signal using only the first-second loss function, and may determine the difference between the final audio signal and the initial audio signal using only the third loss function, The difference between the final audio signal and the initial audio signal can be determined using only the fourth loss function, and the difference between the final audio signal and the initial audio signal can be determined using all of the first to fourth loss functions.

단계(507)에서, 처리 장치는 업데이트된 복수의 신경망 모델을 이용하여 초기 오디오 신호를 처리하여 최종 오디오 신호를 생성할 수 있다In step 507, the processing device may process the initial audio signal using the updated plurality of neural network models to generate the final audio signal.

한편, 본 발명에 따른 방법은 컴퓨터에서 실행될 수 있는 프로그램으로 작성되어 마그네틱 저장매체, 광학적 판독매체, 디지털 저장매체 등 다양한 기록 매체로도 구현될 수 있다.Meanwhile, the method according to the present invention is written as a program that can be executed on a computer and can be implemented in various recording media such as magnetic storage media, optical read media, and digital storage media.

본 명세서에 설명된 각종 기술들의 구현들은 디지털 전자 회로조직으로, 또는 컴퓨터 하드웨어, 펌웨어, 소프트웨어로, 또는 그들의 조합들로 구현될 수 있다. 구현들은 데이터 처리 장치, 예를 들어 프로그램가능 프로세서, 컴퓨터, 또는 다수의 컴퓨터들의 동작에 의한 처리를 위해, 또는 이 동작을 제어하기 위해, 컴퓨터 프로그램 제품, 즉 정보 캐리어, 예를 들어 기계 판독가능 저장 장치(컴퓨터 판독가능 매체) 또는 전파 신호에서 유형적으로 구체화된 컴퓨터 프로그램으로서 구현될 수 있다. 상술한 컴퓨터 프로그램(들)과 같은 컴퓨터 프로그램은 컴파일된 또는 인터프리트된 언어들을 포함하는 임의의 형태의 프로그래밍 언어로 기록될 수 있고, 독립형 프로그램으로서 또는 모듈, 구성요소, 서브루틴, 또는 컴퓨팅 환경에서의 사용에 적절한 다른 유닛으로서 포함하는 임의의 형태로 전개될 수 있다. 컴퓨터 프로그램은 하나의 사이트에서 하나의 컴퓨터 또는 다수의 컴퓨터들 상에서 처리되도록 또는 다수의 사이트들에 걸쳐 분배되고 통신 네트워크에 의해 상호 연결되도록 전개될 수 있다.Implementations of the various techniques described herein may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or combinations thereof. Implementations may include a computer program product, i.e., an information carrier, e.g., machine-readable storage, for processing by or controlling the operation of a data processing device, e.g., a programmable processor, a computer, or multiple computers. It may be implemented as a computer program tangibly embodied in a device (computer-readable medium) or a radio signal. Computer programs, such as the computer program(s) described above, may be written in any form of programming language, including compiled or interpreted languages, and may be written as a stand-alone program or as a module, component, subroutine, or part of a computing environment. It can be deployed in any form, including as other units suitable for use. The computer program may be deployed for processing on one computer or multiple computers at one site or distributed across multiple sites and interconnected by a communications network.

컴퓨터 프로그램의 처리에 적절한 프로세서들은 예로서, 범용 및 특수 목적 마이크로프로세서들 둘 다, 및 임의의 종류의 디지털 컴퓨터의 임의의 하나 이상의 프로세서들을 포함한다. 일반적으로, 프로세서는 판독 전용 메모리 또는 랜덤 액세스 메모리 또는 둘 다로부터 명령어들 및 데이터를 수신할 것이다. 컴퓨터의 요소들은 명령어들을 실행하는 적어도 하나의 프로세서 및 명령어들 및 데이터를 저장하는 하나 이상의 메모리 장치들을 포함할 수 있다. 일반적으로, 컴퓨터는 데이터를 저장하는 하나 이상의 대량 저장 장치들, 예를 들어 자기, 자기-광 디스크들, 또는 광 디스크들을 포함할 수 있거나, 이것들로부터 데이터를 수신하거나 이것들에 데이터를 송신하거나 또는 양쪽으로 되도록 결합될 수도 있다. 컴퓨터 프로그램 명령어들 및 데이터를 구체화하는데 적절한 정보 캐리어들은 예로서 반도체 메모리 장치들, 예를 들어, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광 기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 롬(ROM, Read Only Memory), 램(RAM, Random Access Memory), 플래시 메모리, EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM) 등을 포함한다. 프로세서 및 메모리는 특수 목적 논리 회로조직에 의해 보충되거나, 이에 포함될 수 있다.Processors suitable for processing computer programs include, by way of example, both general-purpose and special-purpose microprocessors, and any one or more processors of any type of digital computer. Typically, a processor will receive instructions and data from read-only memory or random access memory, or both. Elements of a computer may include at least one processor that executes instructions and one or more memory devices that store instructions and data. Generally, a computer may include one or more mass storage devices that store data, such as magnetic, magneto-optical disks, or optical disks, receive data from, transmit data to, or both. It can also be combined to make . Information carriers suitable for embodying computer program instructions and data include, for example, semiconductor memory devices, magnetic media such as hard disks, floppy disks, and magnetic tapes, and Compact Disk Read Only Memory (CD-ROM). ), optical media such as DVD (Digital Video Disk), magneto-optical media such as Floptical Disk, ROM (Read Only Memory), RAM , Random Access Memory), flash memory, EPROM (Erasable Programmable ROM), and EEPROM (Electrically Erasable Programmable ROM). The processor and memory may be supplemented by or included in special purpose logic circuitry.

또한, 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용매체일 수 있고, 컴퓨터 저장매체 및 전송매체를 모두 포함할 수 있다.Additionally, computer-readable media can be any available media that can be accessed by a computer, and can include both computer storage media and transmission media.

본 명세서는 다수의 특정한 구현물의 세부사항들을 포함하지만, 이들은 어떠한 발명이나 청구 가능한 것의 범위에 대해서도 제한적인 것으로서 이해되어서는 안되며, 오히려 특정한 발명의 특정한 실시형태에 특유할 수 있는 특징들에 대한 설명으로서 이해되어야 한다. 개별적인 실시형태의 문맥에서 본 명세서에 기술된 특정한 특징들은 단일 실시형태에서 조합하여 구현될 수도 있다. 반대로, 단일 실시형태의 문맥에서 기술한 다양한 특징들 역시 개별적으로 혹은 어떠한 적절한 하위 조합으로도 복수의 실시형태에서 구현 가능하다. 나아가, 특징들이 특정한 조합으로 동작하고 초기에 그와 같이 청구된 바와 같이 묘사될 수 있지만, 청구된 조합으로부터의 하나 이상의 특징들은 일부 경우에 그 조합으로부터 배제될 수 있으며, 그 청구된 조합은 하위 조합이나 하위 조합의 변형물로 변경될 수 있다.Although this specification contains details of numerous specific implementations, these should not be construed as limitations on the scope of any invention or what may be claimed, but rather as descriptions of features that may be unique to particular embodiments of particular inventions. It must be understood. Certain features described herein in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment can also be implemented in multiple embodiments individually or in any suitable sub-combination. Furthermore, although features may be described as operating in a particular combination and initially claimed as such, one or more features from a claimed combination may in some cases be excluded from that combination, and the claimed combination may be a sub-combination. It can be changed to a variant of a sub-combination.

마찬가지로, 특정한 순서로 도면에서 동작들을 묘사하고 있지만, 이는 바람직한 결과를 얻기 위하여 도시된 그 특정한 순서나 순차적인 순서대로 그러한 동작들을 수행하여야 한다거나 모든 도시된 동작들이 수행되어야 하는 것으로 이해되어서는 안 된다. 특정한 경우, 멀티태스킹과 병렬 프로세싱이 유리할 수 있다. 또한, 상술한 실시형태의 다양한 장치 컴포넌트의 분리는 그러한 분리를 모든 실시형태에서 요구하는 것으로 이해되어서는 안되며, 설명한 프로그램 컴포넌트와 장치들은 일반적으로 단일의 소프트웨어 제품으로 함께 통합되거나 다중 소프트웨어 제품에 패키징 될 수 있다는 점을 이해하여야 한다.Likewise, although operations are depicted in the drawings in a particular order, this should not be construed as requiring that those operations be performed in the specific order or sequential order shown or that all of the depicted operations must be performed to obtain desirable results. In certain cases, multitasking and parallel processing may be advantageous. Additionally, the separation of various device components in the above-described embodiments should not be construed as requiring such separation in all embodiments, and the described program components and devices may generally be integrated together into a single software product or packaged into multiple software products. You must understand that it is possible.

한편, 본 명세서와 도면에 개시된 본 발명의 실시 예들은 이해를 돕기 위해 특정 예를 제시한 것에 지나지 않으며, 본 발명의 범위를 한정하고자 하는 것은 아니다. 여기에 개시된 실시 예들 이외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 자명한 것이다.Meanwhile, the embodiments of the present invention disclosed in the specification and drawings are merely provided as specific examples to aid understanding, and are not intended to limit the scope of the present invention. It is obvious to those skilled in the art that in addition to the embodiments disclosed herein, other modifications based on the technical idea of the present invention can be implemented.

101: 오디오 처리 장치
102-104: 신경망 모델101: audio processing device
102-104: Neural network model

Claims

delete

Obtaining a final audio signal for the initial audio signal using a plurality of neural network models that encode and decode the input audio signal to generate an output audio signal;
Obtaining a power spectral density and masking threshold for the initial audio signal through a psychoacoustic model;
determining a weight according to the relationship between the masking threshold and the power spectral density for each frequency;
calculating a difference between the power spectral density of the initial audio signal and the power spectral density of the final audio signal for each frequency based on the determined weight;
training the neural network models according to the calculated results; and
Generating a new final audio signal that is distinct from the final audio signal from the initial audio signal using the trained neural network models.
Including,
The plurality of neural networks are,
As a continuous relationship, the i-th neural network model generates an output audio signal by using the difference between the output audio signal of the i-1th neural network model and the input audio signal of the i-1th neural network model as an input audio signal,
The masking threshold is,
It is a standard for masking noise generated during the encoding and decoding process of the neural network models by considering the sound pressure of the initial audio signal determined by the psychoacoustic model,
The final audio signal is,
Corresponding to an audio signal that is the sum of the output audio signals of each of the plurality of neural network models,
How to handle it.

According to clause 5,
The step of training the neural network models is,
A processing method for updating parameters included in the neural network model so that the calculated result is minimized.

delete

According to clause 5,
The step of determining the weight is,
The greater the power spectral density of the initial audio signal with respect to the masking threshold, the higher the weight is determined at a specific frequency, and the greater the masking threshold is with respect to the power spectral density of the initial audio signal, the lower the weight is determined at the specific frequency. , processing method.

Obtaining a final audio signal for the initial audio signal using a plurality of neural network models that encode and decode the input audio signal to generate an output audio signal;
Obtaining a masking threshold for the initial audio signal through a psychoacoustic model;
identifying noise generated during encoding and decoding of the initial audio signal in the final audio signal;
calculating a difference between the masking threshold and noise included in the final audio signal for each frequency;
training the neural network models according to the calculated results; and
Generating a new final audio signal that is distinct from the final audio signal from the initial audio signal using the trained neural network models.
Including,
The plurality of neural networks are,
As a continuous relationship, the i-th neural network model generates an output audio signal by using the difference between the output audio signal of the i-1th neural network model and the input audio signal of the i-1th neural network model as an input audio signal,
The masking threshold is,
It is a standard for masking noise generated during the encoding and decoding process of the neural network models by considering the sound pressure of the initial audio signal determined by the psychoacoustic model,
The final audio signal is,
Corresponding to an audio signal that is the sum of the output audio signals of each of the plurality of neural network models,
How to handle it.

According to clause 9,
The step of training the neural network models is,
A processing method for updating parameters included in the neural network model so that the calculated result is minimized.

delete

Obtaining a final audio signal for the initial audio signal using a plurality of neural network models that encode and decode the input audio signal to generate an output audio signal;
determining a power spectral density and masking threshold of the initial audio signal using a psychoacoustic model;
i) a first loss function that calculates the difference between the initial audio signal and the final audio signal in the time domain; and ii) a first loss function that calculates the difference in mel spectrum between the initial audio signal and the final audio signal in the frequency domain. the initial audio through a second loss function and iii) a third loss function that calculates the difference between the initial audio signal and the final audio signal in the frequency domain based on the relationship between the power spectral density of the initial audio signal and the masking threshold. calculating the difference between the signal and the final audio signal;
updating parameters included in the plurality of neural network models based on results calculated through the first to third loss functions; and
Generating a new final audio signal distinct from the final audio signal from the initial audio signal using neural network models with updated parameters.
Including,
The plurality of neural networks are,
As a continuous relationship, the i-th neural network model generates an output audio signal by using the difference between the output audio signal of the i-1th neural network model and the input audio signal of the i-1th neural network model as an input audio signal,
The masking threshold is,
It is a standard for masking noise generated during the encoding and decoding process of the neural network models by considering the sound pressure of the initial audio signal determined by the psychoacoustic model,
The final audio signal is,
Corresponding to an audio signal that is the sum of the output audio signals of each of the plurality of neural network models,
How to handle it.

delete

According to clause 12,
Calculating the difference between the initial audio signal and the final audio signal through the third loss function,
determining a weight according to the relationship between the masking threshold and the power spectral density for each frequency; and
Calculating the difference between the power spectral density of the initial audio signal and the power spectral density of the final audio signal for each frequency through the third loss function based on the determined weight.
Including, processing method.

According to clause 14,
The step of determining the weight is,
The greater the power spectral density of the initial audio signal with respect to the masking threshold, the higher the weight is determined at a specific frequency, and the greater the masking threshold is with respect to the power spectral density of the initial audio signal, the lower the weight is determined at the specific frequency. , processing method.

a) obtaining a final audio signal for the initial audio signal using a plurality of neural network models that encode and decode the input audio signal to generate an output audio signal;
b) calculating the difference between the initial audio signal and the final audio signal in the time domain;
c) calculating the difference in mel spectrum between the initial audio signal and the final audio signal in the frequency domain;
d) determining a masking threshold using a psychoacoustic model;
e) calculating the difference between the noise of the final audio signal and the masking threshold of the initial audio signal determined through the psychoacoustic model in the frequency domain;
updating parameters included in the plurality of neural network models based on the results calculated in steps b), c), and d); and
Generating a new final audio signal from the initial audio signal using neural network models with updated parameters.
Including,
The plurality of neural networks are,
As a continuous relationship, the i-th neural network model generates an output audio signal by using the difference between the output audio signal of the i-1th neural network model and the input audio signal of the i-1th neural network model as an input audio signal,
The masking threshold is,
It is a standard for masking noise generated during the encoding and decoding process of the neural network models by considering the sound pressure of the initial audio signal determined by the psychoacoustic model,
The final audio signal is,
Corresponding to an audio signal that is the sum of the output audio signals of each of the plurality of neural network models,
How to handle it.

delete

Obtaining a final audio signal for the initial audio signal using a plurality of neural network models that encode and decode the input audio signal to generate an output audio signal;
i) a first loss function that calculates the difference between the initial audio signal and the final audio signal in the time domain, ii) a first loss function that calculates the difference in mel spectrum between the initial audio signal and the final audio signal in the frequency domain a second loss function, iii) a third loss that calculates the difference between the initial audio signal and the final audio signal in a frequency band based on the relationship between the power spectral density of the initial audio signal and a masking threshold determined through a psychoacoustic model; Function and iv) training the plurality of neural network models using a fourth loss function that calculates the difference between the noise included in the final audio signal and the masking threshold of the initial audio signal determined through the psychoacoustic model in the frequency band. steps; and
Generating a new final audio signal that is distinct from the final audio signal from the initial audio signal using the trained neural network models.
Including,
The plurality of neural networks are,
As a continuous relationship, the i-th neural network model generates an output audio signal by using the difference between the output audio signal of the i-1th neural network model and the input audio signal of the i-1th neural network model as an input audio signal,
The masking threshold is,
It is a standard for masking noise generated during the encoding and decoding process of the neural network models by considering the sound pressure of the initial audio signal determined by the psychoacoustic model,
The final audio signal is,
Corresponding to an audio signal that is the sum of the output audio signals of each of the plurality of neural network models,
How to handle it.

In the audio signal processing device,
The processing device includes a processor,
The processor,
Obtaining a final audio signal for an initial audio signal using a plurality of neural network models that encode and decode an input audio signal to generate an output audio signal, i) calculating the difference between the initial audio signal and the final audio signal in the time domain. a first loss function that calculates, ii) a second loss function that calculates the difference in mel spectrum between the initial audio signal and the final audio signal in the frequency domain, iii) the initial audio determined through a psychoacoustic model a third loss function that calculates the difference between the initial audio signal and the final audio signal in a frequency band based on the relationship between the power spectral density of the signal and the masking threshold; and iv) the final audio signal determined through the psychoacoustic model. The plurality of neural network models are trained using a fourth loss function that calculates the difference between the included noise and the masking threshold of the initial audio signal in the frequency band, and the final loss is calculated from the initial audio signal using the trained neural network models. generate a new final audio signal that is distinct from the audio signal,
The plurality of neural networks are,
As a continuous relationship, the i-th neural network model generates an output audio signal by using the difference between the output audio signal of the i-1th neural network model and the input audio signal of the i-1th neural network model as an input audio signal,
The masking threshold is,
It is a standard for masking noise generated during the encoding and decoding process of the neural network models by considering the sound pressure of the initial audio signal determined by the psychoacoustic model,
The final audio signal is,
Corresponding to an audio signal that is the sum of the output audio signals of each of the plurality of neural network models,
processing unit.

In the audio signal processing device,
The processing device includes a processor,
The processor,
Obtaining a final audio signal for an initial audio signal using a plurality of neural network models that encode and decode an input audio signal to generate an output audio signal, i) calculating the difference between the initial audio signal and the final audio signal in the time domain. a first loss function that calculates, ii) a second loss function that calculates the difference in mel spectrum between the initial audio signal and the final audio signal in the frequency domain, iii) the initial audio determined through a psychoacoustic model a third loss function that calculates the difference between the initial audio signal and the final audio signal in a frequency band based on the relationship between the power spectral density of the signal and the masking threshold; and iv) the final audio signal determined through the psychoacoustic model. Calculate the difference between the initial audio signal and the final audio signal using at least one loss function among a fourth loss function that calculates the difference between the included noise and the masking threshold of the initial audio signal in the frequency band, and the calculated Train the plurality of neural network models based on the results, and generate a new final audio signal that is differentiated from the final audio signal from the initial audio signal using the trained neural network models, and the plurality of neural networks are,
As a continuous relationship, the i-th neural network model generates an output audio signal by using the difference between the output audio signal of the i-1th neural network model and the input audio signal of the i-1th neural network model as an input audio signal,
The masking threshold is,
It is a standard for masking noise generated during the encoding and decoding process of the neural network models by considering the sound pressure of the initial audio signal determined by the psychoacoustic model,
The final audio signal is,
Corresponding to an audio signal that is the sum of the output audio signals of each of the plurality of neural network models,
processing unit.