KR20210105688A

KR20210105688A - Method and apparatus for reconstructing speech signal without noise from input speech signal including noise using machine learning model

Info

Publication number: KR20210105688A
Application number: KR1020200020468A
Authority: KR
Inventors: 김기준
Original assignee: 라인플러스 주식회사
Priority date: 2020-02-19
Filing date: 2020-02-19
Publication date: 2021-08-27

Abstract

Provided is a speech signal processing method that inputs an amplitude signal of an input speech signal including noise to acquire a mask to be applied to the amplitude signal as output for a pre-trained machine learning model, inputs a first phase signal of a first frequency band among phase signals of the input speech signal to acquire a mask to be applied to a fast Fourier transform (FFT) coefficient as output for the machine learning model, and restores a noise-cancelled speech signal based on the input speech signal and the masks acquired from the machine learning model.

Description

METHOD AND APPARATUS FOR RECONSTRUCTING SPEECH SIGNAL WITHOUT NOISE FROM INPUT SPEECH SIGNAL INCLUDING NOISE USING MACHINE LEARNING MODEL

실시예들은 노이즈를 포함하는 입력 음성 신호으로부터 노이즈가 제거된 음성 신호를 복원(재구성)하는 방법 및 장치에 관한 것으로, 특히, 입력 음성 신호를 구성하는 진폭 신호 및 위상 신호에 대해 머신러닝 모델을 적용함으로써, 노이즈가 제거된 음성 신호를 복원하는 방법 및 장치와 관련된다.The embodiments relate to a method and apparatus for reconstructing (reconstructing) a speech signal from which noise has been removed from an input speech signal including noise, and in particular, applying a machine learning model to an amplitude signal and a phase signal constituting the input speech signal. By doing so, the present invention relates to a method and apparatus for reconstructing a denoised voice signal.

최근, VoIP와 같은 인터넷 통화에 대한 관심과, 기타 음성/사운드 신호를 활용하는 콘텐츠의 개발 및 제공에 대해 관심이 높아짐에 따라, 음성 신호로부터 노이즈를 제거하는 기술에 대한 관심 역시 높아지고 있다. Recently, as interest in Internet calls such as VoIP and development and provision of content utilizing other voice/sound signals increases, interest in technology for removing noise from voice signals is also increasing.

음성 신호는 일반적으로 8kHz 또는 16kHz의 대역폭을 가지며, 음성 신호로부터 노이즈를 제거하는 기술 역시 이러한 8kHz 또는 16kHz의 대역폭을 갖는 음성 신호를 타겟으로 개발되어 왔다. 그러나, 이러한 노이즈 제거 기술은 24kHz 혹은 그 이상의 대역폭(즉, 풀 밴드(Full Band))을 갖는 고음질의 음성 신호에 포함된 노이즈를 제거하기 위해 적용되기는 어렵다. 말하자면, 풀 밴드의 음성 신호를 제공하는 서비스에 대해서는 기존의 노이즈 제거 기술이 적용되기가 어려우며, 이러한 풀 밴드의 음성 신호로부터 노이즈를 제거하기 위해 샘플링 레이트(sampling rate)를 높일 경우에는 연산량이 급증하게 되는 문제가 있다. A voice signal generally has a bandwidth of 8 kHz or 16 kHz, and a technology for removing noise from the voice signal has also been developed targeting the voice signal having a bandwidth of 8 kHz or 16 kHz. However, it is difficult to apply this noise removal technique to remove noise included in a high-quality voice signal having a bandwidth of 24 kHz or higher (ie, full band). In other words, it is difficult to apply the existing noise removal technology to a service that provides a full-band voice signal, and when the sampling rate is increased to remove noise from such a full-band voice signal, the amount of computation increases rapidly. there is a problem that

또한, 음성 신호를 복원함에 있어서 진폭 신호만을 복원하고, 위상 신호는 노이즈가 포함된 신호를 그대로 사용하는 경우에는, 노이즈가 심한 경우(예컨대, SNR이 0dB 이하인 경우) 노이즈가 적절하게 제거된 음성 신호를 복원할 수 없으며, 특히, 높은 샘플링 레이트의 음성 신호에 대해 진폭 신호와 위상 신호에 포함된 노이즈를 효과적으로 제거하기는 어렵다. In addition, in the case of restoring only the amplitude signal and using the signal containing noise as the phase signal as it is, when the noise is severe (eg, when the SNR is 0 dB or less), the voice signal from which the noise is properly removed. cannot be restored, and in particular, it is difficult to effectively remove noise included in the amplitude signal and the phase signal for a high sampling rate audio signal.

한국공개특허 제10-2018-0067608호(공개일 2018년 06월 20일)는 노이즈 신호 결정 방법 및 장치와, 음성 노이즈 제거 방법 및 장치를 기재하고 있다. Korean Patent Laid-Open No. 10-2018-0067608 (published on June 20, 2018) describes a method and apparatus for determining a noise signal, and a method and apparatus for removing voice noise.

상기에서 설명된 정보는 단지 이해를 돕기 위한 것이며, 종래 기술의 일부를 형성하지 않는 내용을 포함할 수 있으며, 종래 기술이 통상의 기술자에게 제시할 수 있는 것을 포함하지 않을 수 있다.The information described above is for understanding only, and may include content that does not form a part of the prior art, and may not include what the prior art can present to a person skilled in the art.

일 실시예는, 미리 훈련된 머신러닝(Machine Learning; ML) 모델에 대해, 노이즈를 포함하는 입력 음성 신호의 진폭 신호를 입력시켜 출력으로서 진폭 신호에 대해 적용될 마스크를 획득하고, 해당 머신러닝 모델에 대해, 입력 음성 신호의 위상 신호 중 제1 주파수 대역의 제1 위상 신호를 입력시켜 출력으로서 입력 음성 신호의 FFT 계수에 대해 적용될 마스크를 획득하고, 입력 음성 신호 및 머신러닝 모델로부터 획득되는 마스크에 기반하여 노이즈가 제거된 음성 신호를 복원하는, 음성 신호를 처리하는 방법을 제공할 수 있다. One embodiment is to obtain a mask to be applied to the amplitude signal as an output by inputting the amplitude signal of the input voice signal including noise to the pre-trained machine learning (ML) model, and to the corresponding machine learning model , to obtain a mask to be applied to the FFT coefficients of the input speech signal as an output by inputting a first phase signal of a first frequency band among the phase signals of the input speech signal, and based on the mask obtained from the input speech signal and the machine learning model Thus, it is possible to provide a method of processing a voice signal to restore a voice signal from which noise has been removed.

일 실시예는, 머신러닝 모델을 사용하여, 입력 음성 신호의 위상 신호 중 저대역(즉, 저주파 대역)인 제1 주파수 대역의 제1 위상 신호의 FFT 계수에 대한 마스크로서 입력 음성 신호의 FFT 계수에 대해 적용될 마스크를 획득하고, 입력 음성 신호의 진폭 신호 중 제2 주파수 대역의 제1 진폭 신호에 대해 적용될 마스크, 및 상기 진폭 신호 중 제2 주파수 대역보다 더 큰 주파수 대역인 제3 주파수 대역의 제2 진폭 신호에 대해 적용될 마스크를 획득하고, 위상 신호 중 저대역인 제1 주파수 대역 이외의 고대역(즉, 고주파 대역)의 위상 신호와, 머신러닝 모델에 의해 획득한 마스크가 적용된 입력 음성 신호에 기반하여 노이즈가 제거된 음성 신호를 복원하는 방법을 제공할 수 있다. An embodiment, using a machine learning model, uses the FFT coefficients of the input speech signal as a mask for the FFT coefficients of the first phase signal of the first frequency band, which is a low-band (ie, low-frequency band) among the phase signals of the input speech signal. obtain a mask to be applied to , a mask to be applied to a first amplitude signal in a second frequency band among amplitude signals of the input voice signal, and a third frequency band in a third frequency band that is a higher frequency band than the second frequency band among the amplitude signals 2 Acquire a mask to be applied to the amplitude signal, and apply the mask to the phase signal of a high band (that is, high frequency band) other than the first frequency band, which is a low band among the phase signals, and the input speech signal to which the mask obtained by the machine learning model is applied. A method of reconstructing a voice signal from which noise has been removed may be provided.

일 측면에 있어서, 컴퓨터 시스템을 사용하여, 노이즈를 포함하는 음성 신호를 처리하는 방법에 있어서, FFT (Fast Fourier Transform)가 수행된 노이즈를 포함하는 입력 음성 신호로부터 진폭(magnitude) 신호 및 위상(phase) 신호를 획득하는 단계, 상기 노이즈를 제거하기 위해 상기 입력 음성 신호에 대해 적용될 마스크를 추정하기 위해 미리 훈련된 머신러닝 모델에 대해, 상기 진폭 신호를 입력시키고, 상기 머신러닝 모델의 출력으로서 상기 진폭 신호에 대해 적용될 마스크를 획득하는 단계, 상기 머신러닝 모델에 대해, 상기 위상 신호 중 제1 주파수 대역의 제1 위상 신호를 입력시키고, 상기 머신러닝 모델의 출력으로서 상기 입력 음성 신호의 FFT 계수에 대해 적용될 마스크를 획득하는 단계 및 상기 입력 음성 신호 및 상기 머신러닝 모델로부터 획득되는 마스크에 기반하여 상기 노이즈가 제거된 음성 신호를 복원하는 단계를 포함하는, 음성 신호를 처리하는 방법이 제공된다. In one aspect, in a method for processing a noise-containing voice signal using a computer system, an amplitude signal and a phase ) obtaining a signal, inputting the amplitude signal to a pre-trained machine learning model to estimate a mask to be applied to the input speech signal to remove the noise, and inputting the amplitude signal as an output of the machine learning model obtaining a mask to be applied to a signal, inputting a first phase signal of a first frequency band among the phase signals to the machine learning model, and for FFT coefficients of the input speech signal as an output of the machine learning model A method for processing a speech signal is provided, comprising: obtaining a mask to be applied; and reconstructing the speech signal from which the noise has been removed based on the mask obtained from the input speech signal and the machine learning model.

상기 제1 위상 신호는 상기 위상 신호 중 상대적으로 저주파 대역의 신호이고, 상기 위상 신호 중 상기 제1 주파수 대역 이외의 주파수 대역의 위상 신호는 상기 노이즈가 제거된 음성 신호를 복원함에 있어서 그대로 사용될 수 있다. The first phase signal is a signal of a relatively low frequency band among the phase signals, and a phase signal of a frequency band other than the first frequency band among the phase signals may be used as it is in reconstructing the voice signal from which the noise has been removed. .

상기 진폭 신호에 대해 적용될 마스크를 획득하는 단계는, 상기 머신러닝 모델에 대해, 상기 진폭 신호 중 제2 주파수 대역의 제1 진폭 신호를 입력시키고, 상기 머신러닝 모델의 출력으로서 상기 제1 진폭 신호에 대해 적용될 제1 마스크를 획득하는 단계, 상기 진폭 신호 중 상기 제2 주파수 대역보다 더 큰 주파수 대역인 제3 주파수 대역의 제2 진폭 신호를 복수의 대역폭 구간의 진폭 신호들로 구분하는 단계, 상기 구분된 진폭 신호들의 각각에 대한 평균 에너지를 계산하는 단계 및 상기 머신러닝 모델에 대해, 상기 계산된 평균 에너지를 입력시키고, 상기 머신러닝 모델의 출력으로서 상기 제2 진폭 신호에 대해 적용될 제2 마스크를 획득하는 단계를 포함할 수 있다. The step of obtaining a mask to be applied to the amplitude signal includes inputting a first amplitude signal of a second frequency band among the amplitude signals to the machine learning model, and applying the first amplitude signal to the first amplitude signal as an output of the machine learning model. obtaining a first mask to be applied to the signal, dividing a second amplitude signal of a third frequency band that is a larger frequency band than the second frequency band among the amplitude signals into amplitude signals of a plurality of bandwidth sections, the division calculating an average energy for each of the amplitude signals and, for the machine learning model, inputting the calculated average energy, and obtaining a second mask to be applied to the second amplitude signal as an output of the machine learning model may include the step of

상기 제2 진폭 신호는 상기 제3 주파수 대역을 바크 스케일(bark scale) 단위로 구분함으로써 상기 복수의 대역폭 구간의 진폭 신호들로 구분될 수 있다. The second amplitude signal may be divided into amplitude signals of the plurality of bandwidth sections by dividing the third frequency band in units of a bark scale.

상기 제1 마스크는 상기 제1 진폭 신호에 대한 이상 비율 마스크(Ideal Ratio Mask; IRM)이고, 상기 제2 마스크는 상기 계산된 평균 에너지에 대한 IRM이고, 상기 노이즈가 제거된 음성 신호를 복원하는 단계에서, 상기 제1 마스크는 상기 제1 진폭 신호와 곱해지고, 상기 제2 마스크는 상기 제2 진폭 신호와 곱해질 수 있다. wherein the first mask is an ideal ratio mask (IRM) for the first amplitude signal, the second mask is an IRM for the calculated average energy, and restoring the noise-removed speech signal. In , the first mask may be multiplied by the first amplitude signal, and the second mask may be multiplied by the second amplitude signal.

상기 제1 마스크를 획득하는 단계는, 상기 제1 진폭 신호에 기반하여 기 결정된 개수의 MFCC (Mel-Frequency Cepstral Coefficient)들을 계산하는 단계 및 상기 제1 마스크를 획득하기 위해, 상기 계산된 MFCC들을 상기 머신러닝 모델에 대해 입력시키는 단계를 포함할 수 있다. The obtaining of the first mask may include calculating a predetermined number of Mel-Frequency Cepstral Coefficients (MFCCs) based on the first amplitude signal and using the calculated MFCCs to obtain the first mask. input to the machine learning model.

상기 제1 마스크를 획득하는 단계는, 상기 제1 진폭 신호에 기반하여 ZCR (Zero Crossing Rate)을 계산하는 단계 및 상기 제1 마스크를 획득하기 위해, 상기 계산된 ZCR을 상기 머신러닝 모델에 대해 입력시키는 단계를 포함할 수 있다. Acquiring the first mask may include calculating a Zero Crossing Rate (ZCR) based on the first amplitude signal and inputting the calculated ZCR to the machine learning model to obtain the first mask. It may include the step of

상기 입력 음성 신호는 24kHz의 대역폭을 갖는 신호이고, 상기 제1 주파수 대역은 0 이상 4kHz 미만의 대역이고, 상기 제2 주파수 대역은 0 이상 8kHz 미만의 대역이고, 상기 제3 주파수 대역은 8 kHz 이상 24kHz 미만의 대역일 수 있다. The input voice signal is a signal having a bandwidth of 24 kHz, the first frequency band is a band of 0 or more and less than 4 kHz, the second frequency band is a band of 0 or more and less than 8 kHz, and the third frequency band is 8 kHz or more. The band may be less than 24 kHz.

상기 노이즈가 제거된 음성 신호를 복원하는 단계는, 상기 진폭 신호와 상기 진폭 신호에 대해 적용될 마스크를 곱함으로써 노이즈 제거 진폭 신호를 추정하는 단계, 상기 입력 음성 신호의 FFT 계수와 상기 제1 위상 신호에 대해 적용될 마스크를 곱함으로써 상기 제1 주파수 대역의 노이즈 제거 위상 신호를 추정하는 단계, 상기 노이즈 제거 진폭 신호, 상기 제1 주파수 대역의 노이즈 제거 위상 신호, 및 상기 위상 신호 중 상기 제1 주파수 대역 이외의 주파수 대역의 위상 신호에 기반하여 상기 노이즈가 제거된 음성 신호의 FFT 계수를 복원하는 단계 및 상기 복원된 FFT 계수에 기반하여 IFFT (Inverse Fast Fourier Transform)를 수행함으로써 상기 노이즈가 제거된 음성 신호를 복원하는 단계를 포함할 수 있다. The step of restoring the noise-removed speech signal may include estimating the noise-removed amplitude signal by multiplying the amplitude signal and a mask to be applied to the amplitude signal, and adding an FFT coefficient of the input speech signal and the first phase signal. estimating a noise removal phase signal of the first frequency band by multiplying a mask to be applied to Restoring the FFT coefficient of the noise-removed voice signal based on the phase signal of the frequency band and performing Inverse Fast Fourier Transform (IFFT) based on the restored FFT coefficient to restore the noise-removed voice signal may include the step of

상기 입력 음성 신호는 24kHz의 대역폭을 갖는 신호이고, 상기 제1 주파수 대역은 0 이상 4kHz 미만의 대역이고, 상기 제1 주파수 대역 이외의 주파수 대역은 4kHz 이상 24kHz 미만의 대역일 수 있다. The input voice signal may be a signal having a bandwidth of 24 kHz, the first frequency band may be a band of 0 or more and less than 4 kHz, and a frequency band other than the first frequency band may be a band of 4 kHz or more and less than 24 kHz.

상기 머신러닝 모델은 완전 연결 레이어(Fully Connected Layer)로 구성될 수 있다. The machine learning model may be configured as a fully connected layer.

상기 머신러닝 모델은 음성과 묵음을 구분하기 위한 음성 활동 검출(Voice Activity Detection; VAD) 정보를 추정하기 위해 더 훈련된 모델이고, 상기 음성 신호를 처리하는 방법은 상기 머신러닝 모델의 출력으로서 상기 입력 음성 신호에 대한 VAD 정보를 획득하는 단계를 더 포함하고, 상기 입력 음성 신호에 대한 VAD 정보는 상기 노이즈가 제거된 음성 신호의 주파수 복원에 있어서 사용될 수 있다.The machine learning model is a model further trained for estimating Voice Activity Detection (VAD) information for distinguishing between speech and silence, and the method for processing the speech signal includes the input as an output of the machine learning model. The method may further include obtaining VAD information for the voice signal, wherein the VAD information for the input voice signal may be used in frequency restoration of the voice signal from which the noise has been removed.

상기 제1 위상 신호에 대해 적용될 마스크는 상기 제1 주파수 대역폭의 상기 제1 위상 신호의 FFT 계수에 대한 복소 이상 비율 마스크(Complex Ideal Ratio Mask; CIRM)이고, 상기 노이즈가 제거된 음성 신호를 복원하는 단계에서, 상기 CIRM은 상기 입력 음성 신호의 FFT 계수와 곱해질 수 있다. The mask to be applied to the first phase signal is a complex ideal ratio mask (CIRM) with respect to the FFT coefficient of the first phase signal of the first frequency bandwidth, and is used to restore the voice signal from which the noise has been removed. In the step, the CIRM may be multiplied by an FFT coefficient of the input speech signal.

상기 진폭 신호 및 상기 위상 신호는 소정의 주파수 빈(frequency bin) 단위의 신호이고, 상기 노이즈가 제거된 음성 신호는 상기 주파수 빈 단위로 복원될 수 있다. The amplitude signal and the phase signal may be signals in units of predetermined frequency bins, and the voice signal from which the noise has been removed may be restored in units of the frequency bins.

상기 머신 러닝 모델에 입력되는 상기 입력 음성 신호는 복수의 프레임들로 구성될 수 있다. The input voice signal input to the machine learning model may consist of a plurality of frames.

다른 일 측면에 있어서, 노이즈를 포함하는 음성 신호를 처리하는 컴퓨터 시스템에 있어서, 컴퓨터에서 판독 가능한 명령을 실행하도록 구현되는 적어도 하나의 프로세서를 포함하고, 상기 적어도 하나의 프로세서는, FFT (Fast Fourier Transform)가 수행된 노이즈를 포함하는 입력 음성 신호로부터 진폭(magnitude) 신호 및 위상(phase) 신호를 획득하고, 상기 노이즈를 제거하기 위해 상기 입력 음성 신호에 대해 적용될 마스크를 추정하기 위해 미리 훈련된 머신러닝 모델에 대해, 상기 진폭 신호를 입력시키고, 상기 머신러닝 모델의 출력으로서 상기 진폭 신호에 대해 적용될 마스크를 획득하고, 상기 머신러닝 모델에 대해, 상기 위상 신호 중 제1 주파수 대역의 제1 위상 신호를 입력시키고, 상기 머신러닝 모델의 출력으로서 상기 입력 음성 신호의 FFT 계수에 대해 적용될 마스크를 획득하고, 상기 입력 음성 신호 및 상기 머신러닝 모델로부터 획득되는 마스크에 기반하여, 상기 노이즈가 제거된 음성 신호를 복원하는, 컴퓨터 시스템이 제공된다. In another aspect, a computer system for processing a voice signal including noise, comprising at least one processor implemented to execute computer-readable instructions, wherein the at least one processor comprises: Fast Fourier Transform (FFT) ) obtained by performing a magnitude signal and a phase signal from an input speech signal containing noise, and pre-trained machine learning to estimate a mask to be applied to the input speech signal to remove the noise. For a model, input the amplitude signal, obtain a mask to be applied on the amplitude signal as an output of the machine learning model, and for the machine learning model, obtain a first phase signal of a first frequency band among the phase signals input, obtain a mask to be applied to the FFT coefficients of the input speech signal as an output of the machine learning model, and generate the noise-removed speech signal based on the input speech signal and the mask obtained from the machine learning model A computer system for restoring is provided.

또 다른 일 측면에 있어서, 컴퓨터 시스템을 사용하여, 노이즈를 포함하는 음성 신호를 처리하는 방법에 있어서, FFT (Fast Fourier Transform)가 수행된 노이즈를 포함하는 입력 음성 신호로부터 진폭(magnitude) 신호 및 위상(phase) 신호를 획득하는 단계, 상기 노이즈를 제거하기 위해 상기 입력 음성 신호에 대해 적용될 마스크를 추정하기 위해 미리 훈련된 머신러닝 모델에 대해, 상기 위상 신호를 입력시키고, 상기 머신러닝 모델의 출력으로서 상기 위상 신호 중 제1 주파수 대역의 제1 위상 신호의 FFT 계수에 대한 마스크로서 상기 입력 음성 신호의 FFT 계수에 대해 적용될 마스크를 획득하는 단계, 상기 머신러닝 모델에 대해, 상기 진폭 신호를 입력시키고, 상기 머신러닝 모델의 출력으로서 상기 진폭 신호 중 제2 주파수 대역의 제1 진폭 신호에 대해 적용될 제1 마스크 및 상기 진폭 신호 중 상기 제2 주파수 대역보다 더 큰 주파수 대역인 제3 주파수 대역의 제2 진폭 신호에 대해 적용될 제2 마스크를 획득하는 단계 및 상기 위상 신호 중 상기 제1 주파수 대역 이외의 주파수 대역의 위상 신호와, 상기 마스크와 상기 입력 음성 신호의 FFT 계수와의 곱, 상기 제1 마스크와 상기 제1 진폭 신호의 곱 및 상기 제2 마스크와 상기 제2 진폭 신호의 곱에 기반하여, 상기 노이즈가 제거된 음성 신호를 복원하는 단계를 포함하는, 음성 신호를 처리하는 방법이 제공된다. In another aspect, in a method of processing a noise-containing voice signal using a computer system, an amplitude signal and a phase from an input voice signal containing noise subjected to FFT (Fast Fourier Transform) obtaining a phase signal, inputting the phase signal to a pre-trained machine learning model to estimate a mask to be applied to the input speech signal to remove the noise, and as an output of the machine learning model obtaining a mask to be applied to the FFT coefficients of the input speech signal as a mask for the FFT coefficients of a first phase signal of a first frequency band among the phase signals; inputting the amplitude signal to the machine learning model; A first mask to be applied to a first amplitude signal in a second frequency band among the amplitude signals as an output of the machine learning model, and a second amplitude in a third frequency band that is a frequency band larger than the second frequency band among the amplitude signals obtaining a second mask to be applied to a signal; and a product of a phase signal of a frequency band other than the first frequency band among the phase signals by an FFT coefficient of the mask and the input voice signal, the first mask and the A method for processing a voice signal is provided, comprising the step of reconstructing the noise-removed voice signal based on the product of the first amplitude signal and the product of the second mask and the second amplitude signal.

입력 음성 신호 중 저대역의 주파수 대역에 속하는 진폭 신호 및 위상 신호는 머신러닝 모델을 사용하여 복원하되, 고대역의 주파수 대역에 속하는 진폭 신호는 바크 스케일(bark scale)단위로 머신러닝 모델을 사용하여 복원하고, 고대역의 주파수 대역에 속하는 위상 신호는 그대로 사용함으로써, 노이즈가 제거된 음성 신호의 복원에 있어서 요구되는 연산량을 줄일 수 있다. Among the input voice signals, the amplitude signal and phase signal belonging to the low-band frequency band are restored using a machine learning model, but the amplitude signal belonging to the high-band frequency band is restored using the machine learning model in units of the bark scale. By restoring and using the phase signal belonging to the high frequency band as it is, it is possible to reduce the amount of computation required to restore the noise-removed audio signal.

노이즈가 제거된 음성 신호의 복원을 위한 머신러닝 모델로서, 완전 연결 레이어(Fully Connected Layer)로만 구성된 깊은 신경망(Deep Neural Network; DNN)를 사용함으로써 노이즈가 제거된 음성 신호의 복원에 있어서 요구되는 연산량을 줄일 수 있다.As a machine learning model for reconstructing a denoised speech signal, the amount of computation required for reconstructing a denoised speech signal by using a Deep Neural Network (DNN) composed only of fully connected layers. can reduce

풀 밴드의 음성 신호와 같은 고음질의 음성 신호에 대해 노이즈를 제거하기 위해 요구되는 연산량을 줄임으로써, 모바일 단말과 같은 장치에서도 구현될 수 있는 노이즈 제거 방법을 제공할 수 있다. By reducing the amount of computation required to remove noise from a high-quality voice signal such as a full-band voice signal, it is possible to provide a noise removal method that can be implemented in a device such as a mobile terminal.

도 1은 일 실시예에 따른, 노이즈를 포함하는 음성 신호를 처리하여 노이즈가 제거된 음성 신호를 복원하는 방법을 나타낸다.
도 2는 일 실시예에 따른, 노이즈를 포함하는 음성 신호를 처리하여 노이즈가 제거된 음성 신호를 복원하는 컴퓨터 시스템의 구조를 나타낸다.
도 3는 일 실시예에 따른, 노이즈를 포함하는 음성 신호를 처리하여 노이즈가 제거된 음성 신호를 복원하는 방법을 나타내는 흐름도이다.
도 4는 일 예에 따른, 머신러닝 모델을 사용하여, 진폭 신호에 대해 적용될 마스크를 획득하는 방법을 나타내는 흐름도이다.
도 5는 일 예에 따른, 머신러닝 모델을 사용하여 진폭 신호 중 제2 주파수 대역의 제1 진폭 신호에 대해 적용될 마스크를 생성함에 있어서, 머신러닝 모델에 대해 입력되는 파라미터를 결정하는 방법을 나타내는 흐름도이다.
도 6은 일 예에 따른, 입력 음성 신호 및 머신러닝 모델로부터의 마스크를 사용하여, 노이즈가 제거된 음성 신호를 복원하는 방법을 나타내는 흐름도이다.
도 7은 일 예에 따른, 머신러닝 모델에 의해 추정되는 마스크를 나타낸다.
도 8은 일 예에 따른, 노이즈를 포함하는 음성 신호를 처리하여 노이즈가 제거된 음성 신호를 복원하는 방법을 나타낸다.
도 9는 일 예에 따른, 머신러닝 모델에 대해 입력되는 파라미터 및 머신러닝 모델의 구조를 나타낸다.1 illustrates a method of reconstructing a noise-removed voice signal by processing a voice signal including noise, according to an exemplary embodiment.
FIG. 2 illustrates a structure of a computer system for reconstructing a noise-removed voice signal by processing a voice signal including noise, according to an exemplary embodiment.
3 is a flowchart illustrating a method of reconstructing a noise-removed voice signal by processing a voice signal including noise, according to an exemplary embodiment.
4 is a flowchart illustrating a method of obtaining a mask to be applied to an amplitude signal using a machine learning model, according to an example.
5 is a flowchart illustrating a method of determining a parameter input to a machine learning model in generating a mask to be applied to a first amplitude signal of a second frequency band among amplitude signals using a machine learning model, according to an example; am.
6 is a flowchart illustrating a method of reconstructing a noise-removed speech signal using an input speech signal and a mask from a machine learning model, according to an example.
7 illustrates a mask estimated by a machine learning model, according to an example.
8 illustrates a method of reconstructing a noise-removed voice signal by processing a voice signal including noise, according to an example.
9 shows parameters input to the machine learning model and the structure of the machine learning model, according to an example.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. Like reference numerals in each figure indicate like elements.

도 1은 일 실시예에 따른, 노이즈를 포함하는 음성 신호를 처리하여 노이즈가 제거된 음성 신호를 복원하는 방법을 나타낸다. 1 illustrates a method of reconstructing a noise-removed voice signal by processing a voice signal including noise, according to an exemplary embodiment.

도 1을 참조하여, 노이즈 및 음성 신호를 포함하는 입력 음성 신호(105)로부터, 노이즈가 제거되어 복원된 음성 신호(120)를 획득하는 방법에 대해 설명한다. Referring to FIG. 1 , a method of obtaining a voice signal 120 restored by removing noise from an input voice signal 105 including noise and a voice signal will be described.

입력 음성 신호(105)는 예컨대, 풀 밴드의 음성 신호로서 24kHz(혹은 그 이상)의 대역폭을 갖는 음성 신호일 수 있다. 입력 음성 신호(105)는 샘플링 레이트가 48kHz인 신호일 수 있다. The input audio signal 105 may be, for example, a full-band audio signal having a bandwidth of 24 kHz (or more). The input voice signal 105 may be a signal having a sampling rate of 48 kHz.

실시예의 입력 음성 신호(105) 처리 방법은 후술될 컴퓨터 시스템(100)에 의해 수행될 수 있다. The method of processing the input voice signal 105 of the embodiment may be performed by the computer system 100 to be described later.

입력 음성 신호(105)는 고속 푸리에 변환(Fast Fourier Transform; FFT)이 수행될 수 있고, 이러한 FFT가 수행된 입력 음성 신호(105)로부터 입력 음성 신호(105)의 진폭(magnitude)를 나타내는 진폭 신호와, 위상(phase)을 나타내는 위상 신호가 획득될 수 있다. The input speech signal 105 may be subjected to Fast Fourier Transform (FFT), and an amplitude signal representing the amplitude of the input speech signal 105 from the input speech signal 105 on which the FFT has been performed. and a phase signal indicating a phase may be obtained.

도시된 것처럼, 실시예에서는 저대역(예컨대, 8kHz 미만의 대역폭)에 속하는 진폭 신호와 저대역(예컨대, 4kHz 미만의 대역폭)에 속하는 위상 신호는 머신러닝 모델(110)을 사용하여 복원될 수 있다. 한편, 고대역(예컨대, 8kHz 이상의 대역폭)에 속하는 진폭 신호는 바크 스케일 단위로 구분되어, 구분된 진폭 신호의 평균 에너지가 머신러닝 모델(110)로 입력되어 복원될 수 있다. 또한, 고대역(예컨대, 4kHz 이상의 대역폭)에 속하는 위상 신호는 특별한 처리 없이 바로 음성 신호(120)를 복원하기 위해 사용될 수 있다. 또한, 저대역에 속하는 진폭 신호와 함께, 해당 저대역에 속하는 진폭 신호에 기반하여 생성된 MFCC (Mel frequency cepstral coefficient)(들)이 머신러닝 모델(110)에 파라미터로서 입력될 수 있다. As shown, in an embodiment, an amplitude signal belonging to a low band (eg, a bandwidth of less than 8 kHz) and a phase signal belonging to a low band (eg, a bandwidth of less than 4 kHz) may be reconstructed using the machine learning model 110 . . On the other hand, the amplitude signal belonging to the high band (eg, bandwidth of 8 kHz or more) may be divided in units of Bark scale, and the average energy of the divided amplitude signal may be input to the machine learning model 110 and restored. In addition, a phase signal belonging to a high band (eg, a bandwidth of 4 kHz or more) may be directly used to reconstruct the voice signal 120 without special processing. In addition, along with the amplitude signal belonging to the low band, MFCC (Mel frequency cepstral coefficient)(s) generated based on the amplitude signal belonging to the corresponding low band may be input to the machine learning model 110 as a parameter.

머신러닝 모델(110)은 예컨대, 인공신경망(예컨대, CNN 및/또는 DNN)을 통해 구현될 수 있다. The machine learning model 110 may be implemented, for example, through an artificial neural network (eg, CNN and/or DNN).

머신러닝 모델(110)로부터의 출력은 입력 음성 신호(105)(또는, 입력 음성 신호(105)의 적어도 일부)에 대해 적용하기 위한 마스크일 수 있다. 이러한 마스크가 적용된 입력 음성 신호(105)와 고대역에 속하는 위상 신호를 사용하여 음성 신호(120)가 복원될 수 있다. 예컨대, 마스크가 적용된 입력 음성 신호(105)와 고대역에 속하는 위상 신호를 사용하여 음성 신호(120)의 FFT 계수가 복원될 수 있고, IFFT (Inverse Fast Fourier Transform)가 수행됨으로써 음성 신호(120)가 복원될 수 있다. The output from the machine learning model 110 may be a mask to apply to the input speech signal 105 (or at least a portion of the input speech signal 105 ). The voice signal 120 may be restored using the input voice signal 105 to which the mask is applied and the phase signal belonging to the high band. For example, the FFT coefficients of the speech signal 120 may be reconstructed using the input speech signal 105 to which the mask is applied and the phase signal belonging to the high band, and inverse fast Fourier transform (IFFT) is performed to the speech signal 120 . can be restored.

실시예에서는, (가령, 고대역에 속하는 진폭 신호와 고대역에 속하는 위상 신호를 그대로 머신러닝 모델(110)로 입력시키는 경우에 비해) 고대역에 속하는 진폭 신호를 사용하여 노이즈가 제거된 음성 신호(120)를 복원하기 위한 연산량과, 고대역에 속하는 위상 신호를 사용하여 노이즈가 제거된 음성 신호(120)를 복원하기 위한 연산량이 현저하게 감소될 수 있다. In the embodiment, a speech signal from which noise is removed using an amplitude signal belonging to a high band (for example, compared to a case where an amplitude signal belonging to the high band and a phase signal belonging to the high band are directly input to the machine learning model 110) The amount of computation for reconstructing (120) and the amount of computation for reconstructing the noise-removed voice signal 120 using a phase signal belonging to a high band may be remarkably reduced.

따라서, 컴퓨터 시스템(100)이 예컨대, 모바일 단말과 같은 전자 장치로서, 고성능의 PC 또는 서버와 같은 컴퓨팅 장치에는 해당하지 않더라도, 실시예에서는 이러한 컴퓨터 시스템(100)을 사용하여 풀 밴드의 음성 신호(105)의 노이즈를 효과적으로 제거할 수 있다. Therefore, although the computer system 100 is an electronic device such as a mobile terminal, and does not correspond to a computing device such as a high-performance PC or server, in the embodiment, a full-band voice signal ( 105) can be effectively removed.

실시예에서는, 도시된 것처럼, 저대역에 해당하는 예컨대, 0~4kHz 대역폭의 위상 신호만이 머신러닝 모델(110)의 입력으로서 설정될 수 있다. 청취 평가 실험 결과에 따르면, 진폭 신호에는 노이즈가 없는 상태(clean)라도, 위상 신호에 노이즈가 포함되어 있는 경우에는, 해당 음성 신호는 사용자에게 노이즈를 포함하는 것(noisy)으로 느껴질 수 있다. 다만, 위상 신호에 노이즈가 포함되어 있더라도, 0~4kHz 대역폭의 위상 신호에는 노이즈가 포함되어 있지 않은 경우라면, 해당 음성 신호는 사용자에게 노이즈를 포함하지 않는 것(clean)으로 느껴질 수 있다. 따라서, 실시예에서는 사용자가 느끼는 노이즈에 dominant한 영향을 미치는 저대역에 해당하는 대역폭의 위상 신호만을 머신러닝 모델(110)의 입력으로 사용할 수 있다. 이에 따라, 머신러닝 모델(110)의 네트워크의 진폭을 작게 할 수 있다. In an embodiment, as shown, only a phase signal of, for example, a 0-4 kHz bandwidth corresponding to a low band may be set as an input of the machine learning model 110 . According to the results of the listening evaluation experiment, even when the amplitude signal is clean, if the phase signal contains noise, the corresponding voice signal may be perceived by the user as containing noise (noisy). However, even if the phase signal includes noise, if the phase signal of the 0 to 4 kHz bandwidth does not include noise, the corresponding voice signal may be perceived as clean by the user. Accordingly, in the embodiment, only a phase signal of a bandwidth corresponding to a low band having a dominant influence on noise felt by a user may be used as an input of the machine learning model 110 . Accordingly, the amplitude of the network of the machine learning model 110 can be reduced.

또한, 실시예에서는 음향 심리학적인(psychoacoustics) 이론에 따라, 사용자가 낮은 해상도로 인지하는 음성 신호에 해당하는 고대역(예컨대, 8~24kHz 대역폭)의 진폭 신호를 바크 스케일 단위로 구분하고, 구분된 진폭 신호의 평균 에너지를 머신러닝 모델(110)로 입력 파라미터로서 사용함으로써 고대역의 진폭 신호에 대한 머신러닝 모델(110)에서의 연산량을 줄일 수 있다.In addition, in the embodiment, according to the psychoacoustics theory, the amplitude signal of the high band (eg, 8 to 24 kHz bandwidth) corresponding to the voice signal perceived by the user with low resolution is divided into Bark scale units, and the separated By using the average energy of the amplitude signal as an input parameter to the machine learning model 110 , it is possible to reduce the amount of computation in the machine learning model 110 for the high-band amplitude signal.

이처럼, 음향 심리학적인 이론에 따를 때, 고대역의 진폭 신호에 대해 사용자는 특정 대역폭 구간을 단위로 소리를 느끼게 되므로, 이러한 특정 대역폭 구간을 단위로 음성 신호를 복원하더라도 사용자는 음질의 저하나 노이즈를 느끼지 않게 될 수 있다. 또한, 고대역의 위상 신호에 대해서는 사용자는 신호의 변화를 잘 느끼지 못할 수 있다. 따라서, 실시예의 경우, 풀 밴드의 24kHz의 음성 신호를 전부 복원하는 방법에 비해 더 적은 연산량으로도 유사한 결과를 얻을 수 있다. As such, according to the psychoacoustic theory, the user feels the sound in a specific bandwidth section with respect to the high-bandwidth amplitude signal, so even if the voice signal is restored in the specific bandwidth section, the user does not experience deterioration in sound quality or noise. may not feel it. In addition, with respect to the high-band phase signal, the user may not feel the signal change well. Therefore, in the case of the embodiment, a similar result can be obtained even with a smaller amount of computation compared to the method of reconstructing the full-band 24 kHz voice signal.

노이즈를 포함하는 입력 음성 신호(105)를 처리하여, 노이즈가 제거된 음성 신호(120)를 복원하는 보다 구체적인 방법에 대해서는 후술될 도 2 내지 도 9를 참조하여 더 자세하게 설명된다. A more specific method of reconstructing the noise-removed voice signal 120 by processing the noise-containing input voice signal 105 will be described in more detail with reference to FIGS. 2 to 9 to be described later.

도 2는 일 실시예에 따른, 노이즈를 포함하는 음성 신호를 처리하여 노이즈가 제거된 음성 신호를 복원하는 컴퓨터 시스템의 구조를 나타낸다.FIG. 2 illustrates the structure of a computer system for reconstructing a noise-removed voice signal by processing a voice signal including noise, according to an exemplary embodiment.

도시된 컴퓨터 시스템(100)은 도 1을 참조하여 전술된 컴퓨터 시스템(100)에 대응할 수 있다. 컴퓨터 시스템(100)은 입력 음성 신호(105)로부터 노이즈를 제거하기 위한 경량화된 추론 모델(예컨대, 머신러닝 모델(110))을 탑재하고 있는 전자 장치일 수 있다. 또는, 도시된 것과는 달리 컴퓨터 시스템(100)은 컴퓨터 시스템(100)의 외부의 전자 장치 또는 서버에 존재하는 머신러닝 모델(110)을 사용하여 입력 음성 신호(105)로부터 노이즈가 제거된 음성 신호(120)를 획득하기 위한 장치일 수 있다. 이 때, 컴퓨터 시스템(100)은 외부의 전자 장치 또는 서버에와의 통신을 통해 음성 신호(120)를 획득할 수 있다.The illustrated computer system 100 may correspond to the computer system 100 described above with reference to FIG. 1 . The computer system 100 may be an electronic device in which a lightweight inference model (eg, the machine learning model 110) for removing noise from the input speech signal 105 is mounted. Alternatively, unlike shown, the computer system 100 uses a machine learning model 110 existing in an electronic device or server external to the computer system 100 to remove noise from the input voice signal 105 voice signal ( 120) may be a device for obtaining. In this case, the computer system 100 may acquire the voice signal 120 through communication with an external electronic device or server.

컴퓨터 시스템(100)은, 예컨대, PC(personal computer), 노트북 컴퓨터(laptop computer), 스마트폰(smart phone), 태블릿(tablet), 웨어러블 컴퓨터(wearable computer), 사물 인터넷(Internet Of Things) 기기 등을 포함할 수 있다. 일례로, 컴퓨터 시스템(100)은 모바일 단말과 같은 장치로서, 고성능의 PC 또는 서버와 같은 컴퓨팅 장치에는 해당하지 않을 수 있다. The computer system 100 is, for example, a personal computer (PC), a laptop computer, a smart phone, a tablet, a wearable computer, an Internet of Things (Internet Of Things) device, etc. may include. For example, the computer system 100 is a device such as a mobile terminal, and may not correspond to a computing device such as a high-performance PC or server.

컴퓨터 시스템(100)은 통신부(210) 및 프로세서(220)를 포함할 수 있다. 컴퓨터 시스템(100)은 사용자로부터 음성 신호(105)를 입력 받기 위한 마이크(230)를 포함할 수 있고, 노이즈가 제거된 음성 신호(120)를 출력하기 위한 스피커(240)를 포함할 수 있다. 마이크(230)는 사용자 또는 외부로부터 입력되는 음성으로부터 풀 밴드의 대역폭을 갖는 음성 신호를 생성할 수 있다. 스피커(240)는 풀 밴드의 대역폭을 갖는 음성 신호를 출력하도록 구성될 수 있다.The computer system 100 may include a communication unit 210 and a processor 220 . The computer system 100 may include a microphone 230 for receiving a voice signal 105 from a user, and a speaker 240 for outputting a noise-removed voice signal 120 . The microphone 230 may generate a voice signal having a full-band bandwidth from a voice input from the user or from the outside. The speaker 240 may be configured to output a voice signal having a bandwidth of a full band.

또한, 컴퓨터 시스템(100)은 도시되지는 않았으나 사용자로부터 입력되는 정보 및/또는 사용자의 요청에 따라 제공되는 정보/콘텐츠를 표시하는 디스플레이를 더 포함할 수 있다. Also, although not illustrated, the computer system 100 may further include a display for displaying information input from a user and/or information/content provided according to a user's request.

통신부(210)는 컴퓨터 시스템(100)이 다른 서버나 다른 장치와 통신하기 위한 장치일 수 있다. 말하자면, 통신부(210)는 다른 서버나 다른 장치에 대해 데이터 및/또는 정보를 전송/수신하는, 컴퓨터 시스템(100)의 네트워크 인터페이스 카드, 네트워크 인터페이스 칩 및 네트워킹 인터페이스 포트 등과 같은 하드웨어 모듈 또는 네트워크 디바이스 드라이버(driver) 또는 네트워킹 프로그램과 같은 소프트웨어 모듈일 수 있다. The communication unit 210 may be a device for the computer system 100 to communicate with other servers or other devices. In other words, the communication unit 210 transmits/receives data and/or information to/from other servers or other devices, such as a network interface card, a network interface chip, and a networking interface port of the computer system 100, or a hardware module or network device driver. (driver) or a software module such as a networking program.

프로세서(220)는 컴퓨터 시스템(100)의 구성 요소들을 관리할 수 있고, 컴퓨터 시스템(100)이 사용하는 프로그램 또는 어플리케이션을 실행할 수 있다. 예컨대, 프로세서(220)는, 마이크(230)를 통해 입력되거나, 혹은 기 입력된 음성 신호(105)를 획득하며, 머신러닝 모델(110)을 사용하여 획득된 음성 신호(105)를 처리할 수 있고, 음성 신호(105)로부터 노이즈가 제거된 음성 신호(120)를 생성할 수 있다. 프로세서(220)는 상기의 동작을 수행하기 위해 요구되는 프로그램 또는 어플리케이션의 실행 및 데이터의 처리 등에 필요한 연산을 처리할 수 있다. 프로세서(220)는 컴퓨터 시스템(100)의 적어도 하나의 프로세서 또는 프로세서 내의 적어도 하나의 코어(core)일 수 있다.The processor 220 may manage components of the computer system 100 and execute programs or applications used by the computer system 100 . For example, the processor 220 may obtain the voice signal 105 input through the microphone 230 or pre-input, and process the acquired voice signal 105 using the machine learning model 110 . and may generate the voice signal 120 from which noise is removed from the voice signal 105 . The processor 220 may process an operation required to execute a program or application required to perform the above operation, process data, and the like. The processor 220 may be at least one processor of the computer system 100 or at least one core within a processor.

컴퓨터 시스템(100)은 도시되지는 않았으나 메모리를 포함할 수 있다. 메모리는 컴퓨터에서 판독 가능한 기록매체로서, RAM(random access memory), ROM(read only memory) 및 디스크 드라이브와 같은 비소멸성 대용량 기록장치(permanent mass storage device)를 포함할 수 있다. 여기서 ROM과 비소멸성 대용량 기록장치는 메모리와 분리되어 별도의 영구 저장 장치로서 포함될 수도 있다. 또한, 메모리에는 운영체제와 적어도 하나의 프로그램 코드가 저장될 수 있다. 이러한 소프트웨어 구성요소들은 메모리와는 별도의 컴퓨터에서 판독 가능한 기록매체로부터 로딩될 수 있다. 이러한 별도의 컴퓨터에서 판독 가능한 기록매체는 플로피 드라이브, 디스크, 테이프, DVD/CD-ROM 드라이브, 메모리 카드 등의 컴퓨터에서 판독 가능한 기록매체를 포함할 수 있다. 다른 실시예에서 소프트웨어 구성요소들은 컴퓨터에서 판독 가능한 기록매체가 아닌 통신부(210)를 통해 메모리에 로딩될 수도 있다. Although not shown, computer system 100 may include memory. The memory is a computer-readable recording medium and may include a random access memory (RAM), a read only memory (ROM), and a permanent mass storage device such as a disk drive. Here, the ROM and the non-volatile mass storage device may be separated from the memory and included as separate permanent storage devices. Also, an operating system and at least one program code may be stored in the memory. These software components may be loaded from a computer-readable recording medium separate from the memory. The separate computer-readable recording medium may include a computer-readable recording medium such as a floppy drive, a disk, a tape, a DVD/CD-ROM drive, and a memory card. In another embodiment, the software components may be loaded into the memory through the communication unit 210 rather than a computer-readable recording medium.

프로세서(220)는 기본적인 산술, 로직 및 입출력 연산을 수행함으로써, 컴퓨터 프로그램의 명령을 처리하도록 구성될 수 있다. 명령은 메모리 또는 통신부(210)에 의해 프로세서(220)로 제공될 수 있다. 예를 들어, 프로세서(220)는 메모리에 로딩된 프로그램 코드에 따라 수신되는 명령을 실행하도록 구성될 수 있다. 이러한 프로세서(220)에 의한 동작에 의해 컴퓨터 시스템(100)은 음성 신호(105)로부터 노이즈가 제거된 음성 신호(120)를 생성할 수 있다. The processor 220 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input/output operations. The command may be provided to the processor 220 by the memory or communication unit 210 . For example, the processor 220 may be configured to execute a received instruction according to a program code loaded into the memory. By the operation of the processor 220 , the computer system 100 may generate the voice signal 120 from which noise is removed from the voice signal 105 .

일례로, 프로세서(220)는 FFT가 수행된 노이즈를 포함하는 입력 음성 신호(105)로부터 진폭 신호 및 위상 신호를 획득할 수 있고, 노이즈를 제거하기 위해 입력 음성 신호(105)에 대해 적용될 마스크를 추정하기 위해 미리 훈련된 머신러닝 모델(110)에 대해, 진폭 신호를 입력시켜 머신러닝 모델(110)의 출력으로서 진폭 신호에 대해 적용될 마스크를 획득할 수 있다. 또한, 프로세서(220)는 머신러닝 모델(110)에 대해, 위상 신호 중 제1 주파수 대역의 제1 위상 신호를 입력시켜, 머신러닝 모델(110)의 출력으로서 입력 음성 신호(105)의 FFT 계수에 대해 적용될 마스크를 획득할 수 있다. 프로세서(220)는 입력 음성 신호(105) 및 머신러닝 모델(110)로부터 획득되는 마스크에 기반하여, 노이즈가 제거된 음성 신호(120)를 복원할 수 있다. As an example, the processor 220 may obtain an amplitude signal and a phase signal from the input voice signal 105 including noise on which FFT has been performed, and select a mask to be applied to the input voice signal 105 to remove the noise. For a pretrained machine learning model 110 for estimating, an amplitude signal may be input to obtain a mask to be applied to the amplitude signal as an output of the machine learning model 110 . In addition, the processor 220 inputs the first phase signal of the first frequency band among the phase signals to the machine learning model 110 , and as an output of the machine learning model 110 , the FFT coefficient of the input speech signal 105 . It is possible to obtain a mask to be applied to . The processor 220 may reconstruct the noise-removed speech signal 120 based on the input speech signal 105 and the mask obtained from the machine learning model 110 .

상기와 같은 동작은 프로세서(220)가 구성하는 적어도 하나의 소프트웨어 및/또는 하드웨어 모듈에 의해 수행될 수 있다.The above operation may be performed by at least one software and/or hardware module configured by the processor 220 .

머신러닝 모델(110)은 입력 음성 신호(105)에 포함된 노이즈를 제거하기 위해 입력 음성 신호(105)에 대해 적용될 마스크를 추정하기 위해 미리 훈련된 모델일 수 있다. 머신러닝 모델(110)은 정답을 알고 있는 복수의 훈련용 입력 음성 신호들의 세트에 의해 훈련된 것일 수 있다. 머신러닝 모델(110)은 DNN을 통해 구현될 수 있다. 머신러닝 모델(110)은 DNN을 구성하는 복수의 레이어들을 포함할 수 있고, 연산량을 줄이기 위해, 완전 연결 레이어(Fully Connected Layer)로만 구성될 수 있다. 머신러닝 모델(110)에 대해서는 후술될 도 9를 참조하여 더 자세하게 설명한다. The machine learning model 110 may be a pre-trained model for estimating a mask to be applied to the input speech signal 105 to remove noise included in the input speech signal 105 . The machine learning model 110 may be trained by a set of a plurality of training input voice signals that know the correct answer. The machine learning model 110 may be implemented through DNN. The machine learning model 110 may include a plurality of layers constituting the DNN, and may consist of only a fully connected layer in order to reduce the amount of computation. The machine learning model 110 will be described in more detail with reference to FIG. 9 to be described later.

컴퓨터 시스템(100)을 사용하여, 노이즈를 포함하는 입력 음성 신호(105)를 처리하고, 노이즈가 제거된 음성 신호(120)를 복원하는 보다 구체적인 방법에 대해서는 후술될 도 3 내지 도 9를 참조하여 더 자세하게 설명된다. For a more specific method of processing the input voice signal 105 including noise and reconstructing the noise-removed voice signal 120 using the computer system 100, refer to FIGS. 3 to 9 to be described later. described in more detail.

이상, 도 1을 참조하여 전술된 기술적 특징에 대한 설명은, 도 2에 대해서도 그대로 적용될 수 있으므로 중복되는 설명은 생략한다.As described above, the description of the technical features described above with reference to FIG. 1 may be applied to FIG. 2 as it is, and thus a redundant description will be omitted.

후술될 상세한 설명에서, 컴퓨터 시스템(100) 또는 프로세서(220)의 구성들에 의해 수행되는 동작이나 컴퓨터 시스템(100) 또는 프로세서(220)가 실행하는 어플리케이션/프로그램에 의해 수행되는 동작은 설명의 편의상 컴퓨터 시스템(100)에 의해 수행되는 동작으로 설명될 수 있다.In the detailed description to be described later, the operations performed by the components of the computer system 100 or the processor 220 or the operations performed by the applications/programs executed by the computer system 100 or the processor 220 are for convenience of description. It may be described as an operation performed by the computer system 100 .

도 3는 일 실시예에 따른, 노이즈를 포함하는 음성 신호를 처리하여 노이즈가 제거된 음성 신호를 복원하는 방법을 나타내는 흐름도이다.3 is a flowchart illustrating a method of reconstructing a noise-removed voice signal by processing a voice signal including noise, according to an exemplary embodiment.

단계(310)에서, 컴퓨터 시스템(100)은 FFT가 수행된 노이즈를 포함하는 입력 음성 신호(105)로부터 진폭 신호 및 위상 신호를 획득할 수 있다. 입력 음성 신호(105)는 마이크(230)를 통해 컴퓨터 시스템(100)에 입력된 음성 신호이거나, 컴퓨터 시스템(100)에 기 저장된 음성 신호일 수 있다. 입력 음성 신호(105)가 포함하는 노이즈는 진폭 신호 및/또는 위상 신호에 포함된 노이즈일 수 있고, 사용자가 청취하기를 희망하는 소기의 음성 신호 외의 잡음을 나타낼 수 있다. In step 310 , the computer system 100 may obtain an amplitude signal and a phase signal from the input voice signal 105 including noise on which FFT has been performed. The input voice signal 105 may be a voice signal input to the computer system 100 through the microphone 230 or a voice signal pre-stored in the computer system 100 . The noise included in the input voice signal 105 may be noise included in the amplitude signal and/or the phase signal, and may represent noise other than the desired voice signal that the user desires to hear.

입력 음성 신호(105)는 24kHz의 (풀 밴드의) 대역폭을 갖는 신호일 수 있다. 입력 음성 신호(105)의 샘플링 레이트는 48kHz일 수 있다.The input voice signal 105 may be a signal having a bandwidth (full band) of 24 kHz. The sampling rate of the input voice signal 105 may be 48 kHz.

단계(320)에서, 컴퓨터 시스템(100)은 입력 음성 신호(105)에 포함된 노이즈를 제거하기 위해 입력 음성 신호(105)에 대해 적용될 마스크를 추정하기 위해 미리 훈련된 머신러닝 모델(110)에 대해, 입력 음성 신호(105)의 진폭 신호를 입력시킬 수 있고, 머신러닝 모델(110)의 출력으로서 진폭 신호에 대해 적용될 마스크를 획득할 수 있다. In step 320 , the computer system 100 applies the pretrained machine learning model 110 to the pretrained machine learning model 110 to estimate a mask to be applied to the input speech signal 105 to remove noise contained in the input speech signal 105 . , the amplitude signal of the input speech signal 105 may be input, and a mask to be applied to the amplitude signal may be obtained as an output of the machine learning model 110 .

진폭 신호에 대해 적용될 마스크를 획득하는 구체적인 방법에 대해서는 후술될 도 4 및 도 5를 참조하여 더 자세하게 설명된다. A specific method of obtaining a mask to be applied to the amplitude signal will be described in more detail with reference to FIGS. 4 and 5, which will be described later.

입력 음성 신호(105)의 진폭 신호(또는 해당 진폭 신호로부터 얻어진 파라미터)는 머신러닝 모델(110)에서의 추론을 수행하기 위한 입력 파라미터가 될 수 있다. 머신러닝 모델(110)에 의해 출력되는 마스크를 입력 음성 신호(105)의 진폭 신호에 대해 적용함으로써(예컨대, 곱함으로써), 노이즈가 제거된 진폭 신호가 얻어질 수 있다. The amplitude signal (or a parameter obtained from the amplitude signal) of the input speech signal 105 may be an input parameter for performing inference in the machine learning model 110 . By applying (eg, multiplying) the amplitude signal of the input speech signal 105 with a mask output by the machine learning model 110 , an amplitude signal from which noise has been removed can be obtained.

단계(330)에서, 컴퓨터 시스템(100)은 머신러닝 모델(110)에 대해, 입력 음성 신호(105)의 위상 신호 중 제1 주파수 대역의 제1 위상 신호를 입력시킬 수 있고, 머신러닝 모델(110)의 출력으로서 입력 음성 신호(105)의 FFT 계수에 대해 적용될 마스크를 획득할 수 있다. 제1 주파수 대역의 제1 위상 신호는 위상 신호 중 상대적으로 저주파 대역(저대역)에 해당하는 위상 신호를 나타낼 수 있다. 예컨대, 제1 주파수 대역은 0 이상 4kHz 미만의 대역을 나타낼 수 있고, 제1 위상 신호는 이러한 0 이상 4kHz 미만의 대역의 위상 신호를 나타낼 수 있다. 한편, 위상 신호 중 제1 주파수 대역 이외의 주파수 대역의 위상 신호(즉, 상대적으로 고주파 대역(고대역)의 위상 신호)는 노이즈가 제거된 음성 신호(120)를 복원함에 있어서 그대로 사용될 수 있다.In step 330 , the computer system 100 may input a first phase signal of a first frequency band among the phase signals of the input speech signal 105 to the machine learning model 110 , and the machine learning model ( As an output of 110 , it is possible to obtain a mask to be applied to the FFT coefficients of the input speech signal 105 . The first phase signal of the first frequency band may represent a phase signal corresponding to a relatively low frequency band (low band) among the phase signals. For example, the first frequency band may indicate a band of 0 or more and less than 4 kHz, and the first phase signal may indicate a phase signal of such a band of 0 or more and less than 4 kHz. Meanwhile, among the phase signals, a phase signal of a frequency band other than the first frequency band (ie, a phase signal of a relatively high frequency band (high band)) may be used as it is in reconstructing the noise-free voice signal 120 .

제1 위상 신호(또는 해당 제1 위상 신호로부터 얻어진 파라미터)는 머신러닝 모델(110)에서의 추론을 수행하기 위한 입력 파라미터가 될 수 있다. 머신러닝 모델(110)에 의해 출력되는 마스크를 입력 음성 신호(105)의 FFT 계수에 대해 적용함으로써(예컨대, 곱함으로써), 노이즈가 제거된 위상 신호가 얻어질 수 있다. The first phase signal (or a parameter obtained from the first phase signal) may be an input parameter for performing inference in the machine learning model 110 . By applying (eg, multiplying) the mask output by the machine learning model 110 to the FFT coefficients of the input speech signal 105 , a denoised phase signal can be obtained.

한편, 입력 음성 신호(105)의 위상 신호 중 제1 주파수 대역 이외의 주파수 대역의 위상 신호는 머신러닝 모델(110)으로 입력되지 않을 수 있다. 말하자면, 예컨대, 고대역에 해당하는 4kHz 이상 24kHz 미만의 대역의 위상 신호는 머신러닝 모델(110)으로 입력되지 않을 수 있다. 전술한 것처럼, 청취 평가 실험의 결과 고대역에 해당하는 위상 신호는 사용자가 느끼는 노이즈에 큰 영향을 미치지 않으므로, 머신러닝 모델(110)에서의 연산량을 줄이기 위해 고대역에 해당하는 위상 신호는 머신러닝 모델(110)에는 입력되지 않을 수 있다. 즉, 고대역에 해당하는 위상 신호는 그대로 노이즈가 제거된 음성 신호(120)를 복원하기 위해 사용될 수 있다. Meanwhile, a phase signal of a frequency band other than the first frequency band among the phase signals of the input voice signal 105 may not be input to the machine learning model 110 . That is, for example, a phase signal of a band of 4 kHz or more and less than 24 kHz corresponding to a high band may not be input to the machine learning model 110 . As described above, as a result of the listening evaluation experiment, the phase signal corresponding to the high band does not significantly affect the noise felt by the user, so to reduce the amount of computation in the machine learning model 110 , the phase signal corresponding to the high band is machine learning It may not be input to the model 110 . That is, the phase signal corresponding to the high band may be used to restore the voice signal 120 from which the noise has been removed as it is.

단계(340)에서, 컴퓨터 시스템(100)은 입력 음성 신호(105) 및 머신러닝 모델(110)로부터 획득되는 마스크에 기반하여 노이즈가 제거된 음성 신호(120)를 복원할 수 있다. 예컨대, 컴퓨터 시스템(100)은 단계(320)에서 획득된 마스크를 진폭 신호에 대해 적용할 수 있고, 단계(330)에서 획득된 마스크를 입력 음성 신호(105)의 FFT 계수에 대해 적용할 수 있다. 컴퓨터 시스템(100)은 이러한 마스크가 적용된 신호들과 머신러닝 모델(110)으로 입력되지 않은 고대역에 해당하는 위상 신호에 기반하여 노이즈가 제거된 음성 신호(120)를 복원할 수 있다.In step 340 , the computer system 100 may reconstruct the noise-removed speech signal 120 based on the mask obtained from the input speech signal 105 and the machine learning model 110 . For example, computer system 100 may apply the mask obtained in step 320 to the amplitude signal, and may apply the mask obtained in step 330 to FFT coefficients of the input speech signal 105 . . The computer system 100 may restore the noise-removed voice signal 120 based on the mask-applied signals and the phase signal corresponding to the high band that is not input to the machine learning model 110 .

노이즈가 제거된 음성 신호(120)를 복원하는 구체적인 방법에 대해서는 후술될 도 6을 참조하여 더 자세하게 설명된다. A detailed method of reconstructing the noise-removed voice signal 120 will be described in more detail with reference to FIG. 6 to be described later.

실시예에 의해서는, 컴퓨터 시스템(100)으로 입력되는 입력 음성 신호(105)에 대한 노이즈 제거 처리가 실시간으로 이루어질 수 있다. 예컨대, 컴퓨터 시스템(100)을 사용하여 사용자가 통화하는 경우, 사용자가 발화하는 음성에 대응하는 음성 신호나, 컴퓨터 시스템(100)이 수신하는 음성 신호에 대한 노이즈 제거 처리가 실시간으로 이루어질 수 있다. 또는, 실시예에 의해서는, 컴퓨터 시스템(100)에 기 저장되어 있는 입력 음성 신호(105)를 획득(즉, 로딩)하여 노이즈 제거 처리를 수행할 수도 있다. According to the embodiment, noise removal processing for the input voice signal 105 input to the computer system 100 may be performed in real time. For example, when a user makes a call using the computer system 100 , noise removal processing for a voice signal corresponding to a voice uttered by the user or a voice signal received by the computer system 100 may be performed in real time. Alternatively, according to an embodiment, noise removal processing may be performed by acquiring (ie, loading) the input voice signal 105 previously stored in the computer system 100 .

한편, 머신러닝 모델(110)은 입력 음성 신호(105)로부터 음성과 묵음을 구분하기 위한 음성 활동 검출(Voice Activity Detection; VAD) 정보를 추정하기 위해 더 훈련된 모델일 수 있다. 도 3에서 단계로서 도시되지는 않았으나, 컴퓨터 시스템(100)은 머신러닝 모델(110)의 출력으로서 입력 음성 신호(105)에 대한 VAD 정보를 획득할 수 있다. 이러한 획득된 입력 음성 신호(105)에 대한 VAD 정보는 노이즈가 제거된 음성 신호(120)를 복원함에 있어서의 주파수 복원을 위해 사용될 수 있다. VAD 정보는 복원된 음성 신호(120)의 품질을 보다 우수하게 하기 위해(즉, 노이즈 제거의 성능을 높이기 위해) 머신러닝 모델(110)에서 사용될 수 있다. Meanwhile, the machine learning model 110 may be a more trained model for estimating Voice Activity Detection (VAD) information for distinguishing between speech and silence from the input speech signal 105 . Although not shown as a step in FIG. 3 , the computer system 100 may obtain VAD information for the input speech signal 105 as an output of the machine learning model 110 . The obtained VAD information on the input voice signal 105 may be used for frequency restoration in restoring the noise-removed voice signal 120 . The VAD information may be used in the machine learning model 110 in order to improve the quality of the reconstructed speech signal 120 (ie, to increase the performance of noise removal).

머신러닝 모델(110)은 VAD 정보를 추정하기 위한 복수의 레이어들을 포함할 수 있고, VAD 정보를 추정하기 위한 각 레이어의 출력은 주파수 복원 네트워크의 입력으로서 사용될 수 있다. 예컨대, VAD 정보를 추정하기 위한 각 레이어의 출력은 단계(320)에서 획득되는 마스크를 결정하기 위한 레이어의 입력 및 단계(330)에서 획득되는 마스크를 결정하기 위한 레이어의 입력으로서 사용될 수 있다. The machine learning model 110 may include a plurality of layers for estimating VAD information, and an output of each layer for estimating VAD information may be used as an input of a frequency reconstruction network. For example, the output of each layer for estimating the VAD information may be used as an input of the layer for determining the mask obtained in step 320 and the input of the layer for determining the mask obtained in step 330 .

이상, 도 1 및 도 2를 참조하여 전술된 기술적 특징에 대한 설명은, 도 3에 대해서도 그대로 적용될 수 있으므로 중복되는 설명은 생략한다.The description of the technical features described above with reference to FIGS. 1 and 2 can be applied to FIG. 3 as it is, and thus the overlapping description will be omitted.

도 4는 일 예에 따른, 머신러닝 모델을 사용하여, 진폭 신호에 대해 적용될 마스크를 획득하는 방법을 나타내는 흐름도이다.4 is a flowchart illustrating a method of obtaining a mask to be applied to an amplitude signal using a machine learning model, according to an example.

도 4의 단계들(410 내지 440)을 참조하여, 진폭 신호에 대해 적용될 마스크를 획득하는 구체적인 방법에 대해 설명한다. A specific method of obtaining a mask to be applied to an amplitude signal will be described with reference to steps 410 to 440 of FIG. 4 .

단계(410)에서, 컴퓨터 시스템(100)은 머신러닝 모델(110)에 대해, 입력 음성 신호(105)의 진폭 신호 중 제2 주파수 대역의 제1 진폭 신호를 입력시킬 수 있고, 머신러닝 모델(100)의 출력으로서 제1 진폭 신호에 대해 적용될 제1 마스크를 획득할 수 있다. 제2 주파수 대역의 제1 진폭 신호는 입력 음성 신호(105)의 진폭 신호 중 저대역에 해당하는 진폭 신호를 나타낼 수 있다. 예컨대, 제2 주파수 대역은 0 이상 8kHz 미만의 대역을 나타낼 수 있고, 제1 진폭 신호는 이러한 0 이상 8kHz 미만의 대역의 진폭 신호를 나타낼 수 있다.In step 410 , the computer system 100 may input a first amplitude signal of a second frequency band among the amplitude signals of the input speech signal 105 to the machine learning model 110 , and the machine learning model ( 100), a first mask to be applied to the first amplitude signal may be obtained. The first amplitude signal of the second frequency band may represent an amplitude signal corresponding to a low band among amplitude signals of the input voice signal 105 . For example, the second frequency band may represent a band of 0 or more and less than 8 kHz, and the first amplitude signal may represent an amplitude signal of such a band of 0 or more and less than 8 kHz.

제1 진폭 신호(및/또는 해당 제1 진폭 신호로부터 얻어진 파라미터)는 머신러닝 모델(110)에서의 추론을 수행하기 위한 입력 파라미터가 될 수 있다. 머신러닝 모델(110)에 의해 출력되는 제1 마스크를 제1 진폭 신호에 대해 적용함으로써(예컨대, 곱함으로써), 노이즈가 제거된 진폭 신호(즉, 노이즈가 제거된 제1 진폭 신호)가 얻어질 수 있다. The first amplitude signal (and/or a parameter obtained from the first amplitude signal) may be an input parameter for performing inference in the machine learning model 110 . By applying (eg, multiplying) the first mask output by the machine learning model 110 to the first amplitude signal, a denoised amplitude signal (ie, the denoised first amplitude signal) will be obtained. can

제1 마스크는 제1 진폭 신호에 대한 이상 비율 마스크(Ideal Ratio Mask; IRM)일 수 있다. 전술한 것처럼 이러한 제1 마스크는 제1 진폭 신호와 곱해짐으로써 제1 진폭 신호에 대해 적용될 수 있다.The first mask may be an ideal ratio mask (IRM) for the first amplitude signal. As described above, this first mask can be applied to the first amplitude signal by multiplying it with the first amplitude signal.

단계(420)에서, 컴퓨터 시스템(100)은 입력 음성 신호(105)의 진폭 신호 중 상기 제2 주파수 대역보다 더 큰 주파수 대역인 제3 주파수 대역의 제2 진폭 신호를 복수의 대역폭 구간의 진폭 신호들로 구분할 수 있다. 제3 주파수 대역의 제2 진폭 신호는 입력 음성 신호(105)의 진폭 신호 중 고대역에 해당하는 진폭 신호를 나타낼 수 있다. 예컨대, 제2 주파수 대역은 8kHz 이상 24kHz 미만의 대역을 나타낼 수 있고, 제1 진폭 신호는 이러한 0 이상 8kHz 미만의 대역의 진폭 신호를 나타낼 수 있다.In step 420 , the computer system 100 converts a second amplitude signal of a third frequency band, which is a higher frequency band than the second frequency band, among the amplitude signals of the input voice signal 105 , as an amplitude signal of a plurality of bandwidth sections. can be divided into The second amplitude signal of the third frequency band may represent an amplitude signal corresponding to a high band among the amplitude signals of the input voice signal 105 . For example, the second frequency band may represent a band of 8 kHz or more and less than 24 kHz, and the first amplitude signal may represent an amplitude signal of such a band of 0 or more and less than 8 kHz.

예컨대, 컴퓨터 시스템(100)은 제2 진폭 신호의 제3 주파수 대역을 적어도 하나의 바크 스케일(bark scale) 단위로 구분함으로써, 제2 진폭 신호를 복수의 대역폭 구간의 진폭 신호들로 구분할 수 있다. For example, the computer system 100 may divide the second amplitude signal into amplitude signals of a plurality of bandwidth sections by dividing the third frequency band of the second amplitude signal by at least one bark scale unit.

바크 스케일은 음향심리학에 기반한 척도일 수 있다. 이는 인간이 청각 기관을 이용하여 구분할 수 있는 소리의 진폭과 높이, 길이, 음색 등과 같은 소리의 특징와 관련하여, 해당 소리의 특징을 구체적으로 나타내기 위해 서로 다른 소리를 구별하기 위한 척도일 수 있다.The Bark scale may be a scale based on psychoacoustics. This may be a measure for distinguishing different sounds in order to specifically express the characteristics of the sound in relation to sound characteristics such as amplitude, height, length, tone, and the like of the sound that humans can distinguish using the auditory organ.

음향 심리학적인(psychoacoustics) 이론에 따라, 사용자는 고대역(예컨대, 8~24kHz 대역폭)의 진폭 신호를 낮은 해상도로 인지할 수 있는 바, 이러한 고대역의 음성 신호에 해당하는 진폭 신호는 바크 스케일 단위로 구분함으로써 복수의 대역폭 구간의 진폭 신호들로 구분될 수 있다. 복수의 대역폭 구간은 바크 스케일에 따라, 예컨대, 8000~9600Hz, 9600~12000Hz, 12000~15600Hz, 15600~20000 Hz, 및 20000~24000Hz일 수 있고, 고대역의 음성 신호에 해당하는 진폭 신호는 상기의 각 대역폭 구간의 진폭 신호로 구분될 수 있다. According to a psychoacoustics theory, a user can recognize a high-bandwidth (eg, 8-24kHz bandwidth) amplitude signal with low resolution, and the amplitude signal corresponding to the high-bandwidth voice signal is a Bark scale unit. By dividing by , it can be divided into amplitude signals of a plurality of bandwidth sections. The plurality of bandwidth sections may be, for example, 8000-9600Hz, 9600-12000Hz, 12000-15600Hz, 15600-20000Hz, and 20000-24000Hz according to the Bark scale, and the amplitude signal corresponding to the high-bandwidth voice signal is It can be divided into an amplitude signal of each bandwidth section.

단계(430)에서, 컴퓨터 시스템(100)은 단계(420)에서 구분된 진폭 신호들의 각각에 대한 평균 에너지를 계산할 수 있다. 컴퓨터 시스템(100)은 구분된 진폭 신호들의 각각에 대해 각 진폭 신호가 해당하는 대역폭 구간에서의 평균 에너지(즉, 주파수 에너지의 평균)를 계산할 수 있다. In step 430 , the computer system 100 may calculate an average energy for each of the amplitude signals identified in step 420 . The computer system 100 may calculate an average energy (ie, an average of frequency energy) in a bandwidth section corresponding to each amplitude signal for each of the divided amplitude signals.

단계(440)에서, 컴퓨터 시스템(100)은 머신러닝 모델(110)에 대해, 단계(430)에서 계산된 평균 에너지를 입력시킬 수 있고, 머신러닝 모델(110)의 출력으로서 상기 제2 진폭 신호에 대해 적용될 제2 마스크를 획득할 수 있다.In step 440 , the computer system 100 may input the average energy calculated in step 430 to the machine learning model 110 , and as an output of the machine learning model 110 , the second amplitude signal It is possible to obtain a second mask to be applied to .

이처럼, 단계(430)에서 계산된 평균 에너지는 머신러닝 모델(110)에서의 추론을 수행하기 위한 입력 파라미터가 될 수 있다. 머신러닝 모델(110)에 의해 출력되는 제2 마스크를 제2 진폭 신호에 대해 적용함으로써(예컨대, 곱함으로써), 노이즈가 제거된 진폭 신호(즉, 노이즈가 제거된 제2 진폭 신호)가 얻어질 수 있다. As such, the average energy calculated in step 430 may be an input parameter for performing inference in the machine learning model 110 . By applying (eg, multiplying) the second mask output by the machine learning model 110 to the second amplitude signal, a denoised amplitude signal (ie, denoised second amplitude signal) will be obtained. can

제2 마스크는 단계(430)에서 계산된 평균 에너지에 대한 IRM일 수 있다. 전술한 것처럼 이러한 제2 마스크는 제2 진폭 신호와 곱해짐으로써 제2 진폭 신호에 대해 적용될 수 있다. The second mask may be the IRM for the average energy calculated in step 430 . As described above, this second mask can be applied to the second amplitude signal by multiplying it with the second amplitude signal.

실시예에서는, 단계(410)에서와 같이, 저대역의 진폭 신호는 머신러닝 모델(110)으로 입력될 수 있고(즉, 저대역의 진폭 신호가 머신러닝 모델(110)에 대한 입력 파라미터가 될 수 있고), 머신러닝 모델(110)에 의한 추론에 의해 노이즈 제거를 위한 작업이 수행될 수 있다. In an embodiment, as in step 410 , the low-band amplitude signal may be input to the machine learning model 110 (ie, the low-band amplitude signal will be an input parameter to the machine learning model 110 ). ), and an operation for noise removal may be performed by inference by the machine learning model 110 .

다만, 단계(420 내지 440)에서와 같이, 고대역의 진폭 신호에 대해서는 복수의 대역폭 구간으로 구분된 진폭 신호들의 각각의 평균 에너지가 계산되어, 계산된 평균 에너지가 머신러닝 모델(110)에 대한 입력 파라미터로서 사용됨으로써 고대역의 진폭 신호에 대한 머신러닝 모델(110)에서의 연산량이 줄어들 수 있다.However, as in steps 420 to 440 , with respect to the high-band amplitude signal, the average energy of each of the amplitude signals divided into a plurality of bandwidth sections is calculated, and the calculated average energy is calculated for the machine learning model 110 . By being used as an input parameter, the amount of computation in the machine learning model 110 for a high-band amplitude signal may be reduced.

이상, 도 1 내지 도 3을 참조하여 전술된 기술적 특징에 대한 설명은, 도 4에 대해서도 그대로 적용될 수 있으므로 중복되는 설명은 생략한다.As described above, the description of the technical features described above with reference to FIGS. 1 to 3 can be applied to FIG. 4 as it is, and thus the overlapping description will be omitted.

도 5는 일 예에 따른, 머신러닝 모델을 사용하여 진폭 신호 중 제2 주파수 대역의 제1 진폭 신호에 대해 적용될 마스크를 생성함에 있어서, 머신러닝 모델에 대해 입력되는 파라미터를 결정하는 방법을 나타내는 흐름도이다.5 is a flowchart illustrating a method of determining a parameter input to a machine learning model in generating a mask to be applied to a first amplitude signal of a second frequency band among amplitude signals using a machine learning model, according to an example; am.

도 5의 단계들(510-1 내지 520-2)을 참조하여, 진폭 신호 중 저대역에 해당하는 제2 주파수 대역의 제1 진폭 신호에 기반하여 머신러닝 모델(110)에 대해 입력되는 파라미터를 결정하는 방법을 설명한다. Referring to steps 510-1 to 520-2 of FIG. 5, parameters input to the machine learning model 110 based on the first amplitude signal of the second frequency band corresponding to the low band among the amplitude signals are calculated. Explain how to decide.

단계(510-1)에서, 컴퓨터 시스템(100)은 제2 주파수 대역의 제1 진폭 신호에 기반하여 기 결정된 개수의 MFCC (Mel-Frequency Cepstral Coefficient)들을 계산할 수 있다.In operation 510-1, the computer system 100 may calculate a predetermined number of Mel-Frequency Cepstral Coefficients (MFCCs) based on the first amplitude signal of the second frequency band.

단계(520-1)에서, 컴퓨터 시스템(100)은 제1 진폭 신호에 대해 적용될 제1 마스크를 획득하기 위해, 계산된 MFCC들을 머신러닝 모델(110)에 대해 입력시킬 수 있다. In step 520-1, the computer system 100 may input the calculated MFCCs to the machine learning model 110 to obtain a first mask to be applied to the first amplitude signal.

즉, 제1 진폭 신호와 함께, 제1 진폭 신호에 대한 MFCC들은 머신러닝 모델(110)에서의 추론을 수행하기 위한 입력 파라미터가 될 수 있다. 제1 진폭 신호에 대한 소정의 개수(예컨대, 20개)의 계수(coefficient)가 계산되어 머신러닝 모델(110)으로 입력될 수 있다. MFCC들은 제1 진폭 신호의 주파수 전체의 모양에 대한 정보를 제공할 수 있다. That is, together with the first amplitude signal, MFCCs for the first amplitude signal may be input parameters for performing inference in the machine learning model 110 . A predetermined number (eg, 20) of coefficients for the first amplitude signal may be calculated and input to the machine learning model 110 . The MFCCs may provide information about the shape of the entire frequency of the first amplitude signal.

MFCC는 음성 신호를 특징 벡터화하기 위해 필요한 계수일 수 있다. 예컨대, MFCC는 제1 진폭 신호의 피쳐(feature)가 될 수 있다. The MFCC may be a coefficient required for feature vectorizing a speech signal. For example, the MFCC may be a feature of the first amplitude signal.

MFCC는 상대적으로 저주파 대역의 음성 신호를 잘 인식하고, 고주파 대역의 음성 신호를 잘 인식하지 못하는 달팽이관의 특성을 고려하는 멜 스케일(Mel-scale)에 기반하여 제1 진폭 신호로부터 계산(추출)될 수 있다. MFCC는, 멜 스케일에 따라 제1 진폭 신호를 복수의 구간들로 구분하여, 각 구간에 대해 계산될 수 있다. The MFCC can be calculated (extracted) from the first amplitude signal based on the Mel-scale that recognizes the voice signal of the relatively low frequency band well and considers the characteristic of the cochlea that does not recognize the voice signal of the high frequency band well. can The MFCC may be calculated for each section by dividing the first amplitude signal into a plurality of sections according to the Mel scale.

단계(510-2)에서, 컴퓨터 시스템(100)은 제2 주파수 대역의 제1 진폭 신호에 기반하여 ZCR (Zero Crossing Rate)을 계산할 수 있다. In operation 510 - 2 , the computer system 100 may calculate a Zero Crossing Rate (ZCR) based on the first amplitude signal of the second frequency band.

단계(520-2)에서, 컴퓨터 시스템(100)은 제1 진폭 신호에 대해 적용될 제1 마스크를 획득하기 위해, 계산된 ZCR을 머신러닝 모델(110)에 대해 입력시킬 수 있다.In step 520 - 2 , the computer system 100 may input the calculated ZCR to the machine learning model 110 to obtain a first mask to be applied to the first amplitude signal.

즉, 제1 진폭 신호와 함께, 제1 진폭 신호에 대한 ZCR은 머신러닝 모델(110)에서의 추론을 수행하기 위한 입력 파라미터가 될 수 있다. ZCR은 제1 진폭 신호를 시간 축으로 분석함으로써 계산될 수 있다. ZCR은 제1 진폭 신호의 시간 축 성분이 포함하는 노이즈에 관한 정보를 제공할 수 있다. ZCR은 (음성) 신호를 따른 부호 변화율, 즉, 신호가 변화하는 비율을 나타낼 수 있다. 즉, ZCR은 신호가 0을 지나는, 신호의 부호가 바뀌는 비율을 나타낼 수 있다. That is, together with the first amplitude signal, the ZCR for the first amplitude signal may be an input parameter for performing inference in the machine learning model 110 . ZCR can be calculated by analyzing the first amplitude signal on the time axis. The ZCR may provide information about noise included in the time axis component of the first amplitude signal. ZCR may represent a rate of change of a sign according to a (voice) signal, that is, a rate at which the signal changes. That is, the ZCR may represent a rate at which the signal passes through 0 and the sign of the signal changes.

컴퓨터 시스템(100)은 제1 진폭 신호, 제1 진폭 신호에 대한 MFCC들 및 제1 진폭 신호에 대한 ZCR을 머신러닝 모델(110)에 대해 입력시킬 수 있고, 머신러닝 모델(100)의 출력으로서 1 진폭 신호에 대해 적용될 제1 마스크를 획득할 수 있다.The computer system 100 may input a first amplitude signal, the MFCCs for the first amplitude signal, and the ZCR for the first amplitude signal to the machine learning model 110 , and as an output of the machine learning model 100 . A first mask to be applied to a 1-amplitude signal may be obtained.

이상, 도 1 내지 도 4를 참조하여 전술된 기술적 특징에 대한 설명은, 도 5에 대해서도 그대로 적용될 수 있으므로 중복되는 설명은 생략한다.As described above, the description of the technical features described above with reference to FIGS. 1 to 4 can be applied to FIG. 5 as it is, and thus a redundant description will be omitted.

도 6은 일 예에 따른, 입력 음성 신호 및 머신러닝 모델로부터의 마스크를 사용하여, 노이즈가 제거된 음성 신호를 복원하는 방법을 나타내는 흐름도이다.6 is a flowchart illustrating a method of reconstructing a noise-removed speech signal by using an input speech signal and a mask from a machine learning model, according to an example.

도 6의 단계들(610 내지 640)을 참조하여, 노이즈가 제거된 음성 신호를 복원하는 방법을 구체적으로 설명한다.A method of reconstructing a noise-removed voice signal will be described in detail with reference to steps 610 to 640 of FIG. 6 .

단계(610)에서, 컴퓨터 시스템(100)은 입력 음성 신호(105)의 진폭 신호와, 해당 진폭 신호에 대해 적용될 마스크를 곱함으로써 노이즈 제거 진폭 신호(노이즈가 제거된 진폭 신호)를 추정할 수 있다. 예컨대, 전술된 것처럼, 컴퓨터 시스템(100)은 단계(410)에서 획득된 제1 마스크를 제1 진폭 신호와 곱하고, 단계(440)에서 획득된 제2 마스크를 제2 진폭 신호와 곱함으로써 획득된 진폭 신호를 노이즈가 제거된 진폭 신호로서 추정할 수 있다. 제1 마스크와 제1 진폭 신호를 곱함으로써, 노이즈가 제거된 제1 진폭 신호가 추정될 수 있고, 제2 마스크와 제2 진폭 신호를 곱함으로써, 노이즈가 제거된 제2 진폭 신호가 추정될 수 있다.In step 610, the computer system 100 may estimate the denoised amplitude signal (the denoised amplitude signal) by multiplying the amplitude signal of the input speech signal 105 by a mask to be applied to the amplitude signal. . For example, as described above, the computer system 100 may be configured to multiply the first mask obtained in step 410 with a first amplitude signal, and multiply the second mask obtained in step 440 with a second amplitude signal. The amplitude signal can be estimated as an amplitude signal from which noise has been removed. By multiplying the first mask and the first amplitude signal, the noise-removed first amplitude signal may be estimated, and by multiplying the second mask and the second amplitude signal, the noise-removed second amplitude signal may be estimated. have.

단계(620)에서, 컴퓨터 시스템(100)은 입력 음성 신호(105)의 FFT 계수와 단계(330)에서 획득된 제1 위상 신호에 대해 적용될 마스크를 곱함으로써 제1 주파수 대역의 노이즈 제거 위상 신호(노이즈가 제거된 위상 신호)를 추정할 수 있다. 컴퓨터 시스템은 단계(620)에 의해 추정된 제1 주파수 대역의 노이즈 제거 위상 신호와 위상 신호 중 제1 주파수 대역 이외의 주파수 대역의 위상 신호에 기반하여, 입력 음성 신호(105)의 노이즈가 제거된 위상 신호를 획득할 수 있다. In step 620, the computer system 100 multiplies the FFT coefficients of the input speech signal 105 by a mask to be applied to the first phase signal obtained in step 330 to thereby generate a denoising phase signal ( A phase signal from which noise has been removed) can be estimated. The computer system determines that the noise of the input voice signal 105 is removed based on the phase signal of a frequency band other than the first frequency band among the phase signal and the noise-removed phase signal of the first frequency band estimated by step 620 . A phase signal can be obtained.

제1 위상 신호에 대해 적용될 마스크는 제1 주파수 대역폭의 제1 위상 신호의 FFT 계수에 대한 복소 이상 비율 마스크(Complex Ideal Ratio Mask; CIRM)일 수 있다. 이러한 CIRM은 입력 음성 신호(105)의 FFT 계수와 곱해질 수 있다. The mask to be applied to the first phase signal may be a complex ideal ratio mask (CIRM) with respect to the FFT coefficient of the first phase signal of the first frequency bandwidth. This CIRM may be multiplied by the FFT coefficient of the input speech signal 105 .

단계(630)에서, 컴퓨터 시스템(100)은 노이즈 제거 진폭 신호, 제1 주파수 대역의 노이즈 제거 위상 신호, 및 위상 신호 중 제1 주파수 대역 이외의 주파수 대역의 위상 신호에 기반하여 노이즈가 제거된 음성 신호(120)의 FFT 계수를 복원할 수 있다. 말하자면, 컴퓨터 시스템(100)은 단계(610)에 의해 획득된 입력 음성 신호(105)의 노이즈가 제거된 진폭 신호와 단계(620)에 의해 획득된 입력 음성 신호(105)의 노이즈가 제거된 위상 신호에 기반하여 노이즈가 제거된 음성 신호(120)의 FFT 계수를 복원할 수 있다.In step 630, the computer system 100 generates a noise-removed voice based on a noise-removed amplitude signal, a noise-removed phase signal of a first frequency band, and a phase signal of a frequency band other than the first frequency band among the phase signals. The FFT coefficients of the signal 120 may be reconstructed. In other words, the computer system 100 determines the denoised amplitude signal of the input speech signal 105 obtained by step 610 and the denoised phase signal of the input speech signal 105 obtained by step 620 . Based on the signal, the FFT coefficient of the voice signal 120 from which the noise has been removed may be restored.

단계(640)에서, 컴퓨터 시스템(100)은 복원된 FFT 계수에 기반하여 IFFT (Inverse Fast Fourier Transform)를 수행함으로써 노이즈가 제거된 음성 신호(120)를 복원할 수 있다. In operation 640 , the computer system 100 may reconstruct the noise-removed speech signal 120 by performing Inverse Fast Fourier Transform (IFFT) based on the reconstructed FFT coefficients.

복원된 음성 신호(120)는 컴퓨터 시스템(100)에서, 예컨대, 스피커(240)를 통해 출력될 수 있다. The restored voice signal 120 may be output from the computer system 100 , for example, through the speaker 240 .

일 예시에 있어서, 입력 음성 신호(105)는 24kHz의 대역폭을 갖는 신호일 수 있고, 위상 신호의 (저대역인) 제1 주파수 대역은 0 이상 4kHz 미만의 대역일 수 있고, (고대역인) 제1 주파수 대역 이외의 주파수 대역은 4kHz 이상 24kHz 미만의 대역일 수 있다. In one example, the input voice signal 105 may be a signal having a bandwidth of 24 kHz, and a first frequency band (which is a low band) of the phase signal may be a band of 0 or more and less than 4 kHz, and a first frequency band (which is a high band) of the phase signal. The frequency band other than the frequency band may be a band of 4 kHz or more and less than 24 kHz.

이상, 도 1 내지 도 5를 참조하여 전술된 기술적 특징에 대한 설명은, 도 6에 대해서도 그대로 적용될 수 있으므로 중복되는 설명은 생략한다.As described above, the description of the technical features described above with reference to FIGS. 1 to 5 can be applied to FIG. 6 as it is, and thus a redundant description will be omitted.

후술될 도 8에서 설명되는 것처럼, 실시예에서 설명되는 입력 음성 신호(105)의 진폭 신호 및 위상 신호는 소정의 주파수 빈(frequency bin) 단위의 신호일 수 있다. 또한, 노이즈가 제거되는 음성 신호(120)는 이러한 주파수 빈 단위로 복원될 수 있다. 말하자면, 실시예에서는, 저대역(위상 신호는 4kHz 이하(미만)/ 진폭 신호는 8kHz 이하(미만))의 음성 신호는 주파수의 주파수 빈 단위로 머신러닝 모델(110)을 사용하여 복원될 수 있고, 고대역(위상 신호는 4kHz 초과(이상)/ 진폭 신호는 8kHz 초과(이상))의 음성 신호는 바크 스케일 단위로 복원되거나(진폭 신호), 노이즈를 포함하는 음성 신호가 그대로 사용되어(위상 신호) 복원될 수 있다.As will be described with reference to FIG. 8 to be described later, the amplitude signal and the phase signal of the input voice signal 105 described in the embodiment may be signals in units of predetermined frequency bins. Also, the voice signal 120 from which noise is removed may be restored in units of such frequency bins. In other words, in an embodiment, a speech signal in the low band (phase signal is 4 kHz or less (less than) / amplitude signal is 8 kHz or less (less than)) can be reconstructed using the machine learning model 110 in units of frequency bins of frequency and , the high-bandwidth (phase signal > 4 kHz (greater than)/amplitude signal > 8 kHz (greater than)) is restored in units of Bark scale (amplitude signal), or the voice signal containing noise is used as it is (phase signal ) can be restored.

한편, 다른 실시예로서, 머신러닝 모델(110)에 대해 상기 제1 주파수 대역 내지 상기 제3 주파수 대역이 구분된 입력 음성 신호(105) (진폭 신호 및/또는 위상 신호)가 직접 입력되는 것이 아니라, 머신러닝 모델(110)에 대해 입력 음성 신호(105)가 입력되는 경우, 머신러닝 모델(110)이 상기 제1 주파수 대역 내지 상기 제3 주파수 대역 각각의 입력 음성 신호(105)(진폭 신호 및/또는 위상 신호)를 위한 마스크를 출력하도록 타겟팅될 수 있다. On the other hand, as another embodiment, the input voice signal 105 (amplitude signal and/or phase signal) in which the first frequency band to the third frequency band is divided is not directly input to the machine learning model 110 . , when the input speech signal 105 is input to the machine learning model 110, the machine learning model 110 is the input speech signal 105 (amplitude signal and / or phase signal).

일례로, 컴퓨터 시스템(100)은 입력 음성 신호(105)의 노이즈를 제거하기 위해 입력 음성 신호(100)에 대해 적용될 마스크를 추정하기 위해 미리 훈련된 머신러닝 모델(110)에 대해, 입력 음성 신호(105)의 위상 신호를 입력시킬 수 있고, 머신러닝 모델(110)의 출력으로서 상기 위상 신호 중 제1 주파수 대역의 제1 위상 신호의 FFT 계수에 대한 마스크로서 입력 음성 신호(105)의 FFT 계수에 대해 적용될 마스크(예컨대, CIRM)를 획득할 수 있다. 즉, 머신러닝 모델(110)은 입력 음성 신호(105)의 위상 신호가 입력되는 경우 제1 주파수 대역의 제1 위상 신호의 FFT 계수에 대한 마스크를 출력하도록 타겟팅될 수 있다. As an example, the computer system 100 may run the input speech signal 105 against a pre-trained machine learning model 110 to estimate a mask to be applied on the input speech signal 100 to denoise the input speech signal 105 . A phase signal of 105 may be input, and an FFT coefficient of the input speech signal 105 as a mask for the FFT coefficient of a first phase signal of a first frequency band among the phase signals as an output of the machine learning model 110 A mask (eg, CIRM) to be applied may be obtained. That is, the machine learning model 110 may be targeted to output a mask for the FFT coefficients of the first phase signal of the first frequency band when the phase signal of the input speech signal 105 is input.

또한, 컴퓨터 시스템(100)은 머신러닝 모델(110)에 대해, 입력 음성 신호(105)의 진폭 신호를 입력시킬 수 있고, 머신러닝 모델(110)의 출력으로서 상기 진폭 신호 중 제2 주파수 대역의 제1 진폭 신호에 대해 적용될 제1 마스크 및 상기 진폭 신호 중 제2 주파수 대역보다 더 큰 주파수 대역인 제3 주파수 대역의 제2 진폭 신호에 대해 적용될 제2 마스크를 획득할 수 있다. 즉, 머신러닝 모델(110)은 입력 음성 신호(105)의 진폭 신호가 입력되는 경우 상기 제1 마스크 및 상기 제2 마스크를 출력하도록 타겟팅될 수 있다.In addition, the computer system 100 may input an amplitude signal of the input voice signal 105 to the machine learning model 110 , and as an output of the machine learning model 110 , the amplitude signal of the second frequency band among the amplitude signals. A first mask to be applied to the first amplitude signal and a second mask to be applied to a second amplitude signal in a third frequency band that is a higher frequency band than a second frequency band among the amplitude signals may be obtained. That is, the machine learning model 110 may be targeted to output the first mask and the second mask when the amplitude signal of the input speech signal 105 is input.

컴퓨터 시스템(100)은 입력 음성 신호(105)의 위상 신호 중 제1 주파수 대역 이외의 주파수 대역의 위상 신호와, 마스크(CIRM)와 입력 음성 신호(105)의 FFT 계수와의 곱, 제1 마스크와 제1 진폭 신호의 곱 및 제2 마스크와 제2 진폭 신호의 곱에 기반하여, 노이즈가 제거된 음성 신호(120)를 복원할 수 있다.The computer system 100 is configured to multiply a phase signal of a frequency band other than the first frequency band among the phase signals of the input voice signal 105 by a mask CIRM and an FFT coefficient of the input voice signal 105, and a first mask. Based on the product of the first amplitude signal and the second mask and the second amplitude signal, the noise-removed voice signal 120 may be restored.

머신러닝 모델(110)에 대한 입력 파라미터, 머신러닝 모델(110)에 의한 마스크의 획득 및 음성 신호(120)의 복원에 대해서는 도 1 내지 6을 참조하여 전술된 기술적 특징에 대한 설명이 그대로 적용될 수 있으므로 중복되는 설명은 생략한다.For the input parameters for the machine learning model 110, the acquisition of the mask by the machine learning model 110, and the restoration of the voice signal 120, the description of the technical features described above with reference to FIGS. 1 to 6 can be applied as it is. Therefore, redundant descriptions are omitted.

도 7은 일 예에 따른, 머신러닝 모델에 의해 추정되는 마스크를 나타낸다.7 illustrates a mask estimated by a machine learning model, according to an example.

도 7의 도시된 <a> 내지 <d>의 각각은 머신러닝 모델(110)에 의한 추론을 통해 추정되는 마스크(IRM 또는 CIRM)의 예시를 나타낼 수 있다. 말하자면, 도시된 <a> 내지 <d>의 각각은 머신러닝 모델(110)에 의해 추정되는 최적값을 나타낼 수 있다. Each of <a> to <d> shown in FIG. 7 may represent an example of a mask (IRM or CIRM) estimated through inference by the machine learning model 110 . In other words, each of <a> to <d> shown may represent an optimal value estimated by the machine learning model 110 .

마스크가 입력 음성 신호(105)에 대해 곱해짐으로써, 입력 음성 신호(105)에 포함된 노이즈가 억제(suppress)될 수 있다. By multiplying the mask with respect to the input voice signal 105 , noise included in the input voice signal 105 may be suppressed.

도시된 <a> 내지 <d>에서, 예컨대, x축은 주파수(또는 시간)를 나타낼 수 있고, y축은 입력 음성 신호(105)에 대해 곱해지는 값을 나타낼 수 있다. In the illustrated <a> to <d>, for example, the x-axis may represent a frequency (or time), and the y-axis may represent a value multiplied with respect to the input voice signal 105 .

머신러닝 모델(110)에 의해 추정되는 마스크의 형태 및 값의 진폭은 전술된 머신러닝 모델(110)에 대한 입력 파라미터와 머신러닝 모델(110)에 의한 추정의 결과에 따라 도시된 것과는 상이하게 될 수 있다. The amplitude of the shape and value of the mask estimated by the machine learning model 110 will be different from that shown depending on the results of the estimation by the machine learning model 110 and the input parameters to the machine learning model 110 described above. can

이상, 도 1 내지 도 6을 참조하여 전술된 기술적 특징에 대한 설명은, 도 7에 대해서도 그대로 적용될 수 있으므로 중복되는 설명은 생략한다.As described above, the description of the technical features described above with reference to FIGS. 1 to 6 can be applied to FIG. 7 as it is, and thus the overlapping description will be omitted.

도 8은 일 예에 따른, 노이즈를 포함하는 음성 신호를 처리하여 노이즈가 제거된 음성 신호를 복원하는 방법을 나타낸다.8 illustrates a method of reconstructing a noise-removed voice signal by processing a voice signal including noise, according to an example.

도 8을 참조하여, 실시예의 노이즈 및 음성 신호를 포함하는 입력 음성 신호(105)로부터, 노이즈가 제거된 복원된 음성 신호(120)를 획득하는 방법에 대해 더 자세하게 설명한다.With reference to FIG. 8 , a method of obtaining a reconstructed voice signal 120 from which noise has been removed from the input voice signal 105 including noise and voice signals according to the embodiment will be described in more detail.

아래에서 도 8에서 도시된 1 내지 18의 과정에 따라, 실시예의 입력 음성 신호(105)로부터, 노이즈가 제거된 복원된 음성 신호(120)를 획득하는 방법을 설명한다. 1 내지 18의 과정은 도 3 내지 도 6을 참조하여 설명된 단계들에 대응할 수 있다. 도 8에서 머신러닝 모델(110)은 DNN으로 도시되었다. Hereinafter, a method of obtaining the reconstructed voice signal 120 from which noise has been removed from the input voice signal 105 of the embodiment according to steps 1 to 18 shown in FIG. 8 will be described. Processes 1 to 18 may correspond to the steps described with reference to FIGS. 3 to 6 . In FIG. 8 , the machine learning model 110 is illustrated as a DNN.

1. 컴퓨터 시스템(100)은 입력되는 노이즈 포함 입력 음성 신호(105)에 대해 FFT를 수행할 수 있다. 입력 음성 신호(105)는 오디오 신호로서 노이즈가 없는 음성 신호 및 노이즈(예컨대, 백그라운드 노이즈)를 포함할 수 있다. FFT로는 1024-point-FFT가 수행될 수 있다. 입력 음성 신호(105)의 샘플링 레이트는 48kHz일 수 있고, FFT 사이즈는 20ms (960 샘플)이고, 홉 사이즈 2.5ms (120 샘플, 12.5% 오버랩)일 수 있다. 1. The computer system 100 may perform FFT on the input noise-containing input voice signal 105 . The input voice signal 105 may include a noise-free voice signal and noise (eg, background noise) as an audio signal. As FFT, 1024-point-FFT may be performed. The sampling rate of the input speech signal 105 may be 48 kHz, the FFT size may be 20 ms (960 samples), and the hop size may be 2.5 ms (120 samples, 12.5% overlap).

2. 컴퓨터 시스템(100)은 FFT가 수행된 입력 음성 신호(105)에 대해, FFT 계수의 실수값(real value)과 허수값(image value)의 제곱 합에 제곱근(root)을 취함으로써 진폭(magnitude) 신호를 획득할 수 있다.2. For the input speech signal 105 on which the FFT has been performed, the computer system 100 calculates the amplitude (root) by taking the square root of the squared sum of the real and imaginary values of the FFT coefficients. magnitude) signal can be obtained.

3. 컴퓨터 시스템(100)은 FFT 계수의 실수값과 허수값의 아크 탄젠트(arctangent)를 계산함으로써 위상(phase) 신호를 획득할 수 있다. 위상 신호는 원래의 음성 신호로의 복원이 가능하도록 (-pi 내지 pi) 범위의 값을 갖도록 계산될 수 있다. 3. The computer system 100 may obtain the phase signal by calculating the arc tangent of the real and imaginary values of the FFT coefficients. The phase signal may be calculated to have a value in the range (-pi to pi) to enable restoration to the original voice signal.

4. 컴퓨터 시스템(100)은 0~8 kHz의 대역폭의 진폭 신호(전술된 제2 주파수 대역폭의 제1 진폭 신호)를 DNN(110)의 입력으로 설정할 수 있다. 컴퓨터 시스템(100)은, 추가로, DNN(110)의 학습 및 추론의 정확도를 높이기 위해, 상기 입력되는 진폭 신호에 대해 MFCC와 ZCR를 계산할 수 있고, 이들을 DNN(110)에 대해 추가로 입력할 수 있다. MFCC는 0~8kHz 범위에서 계산되어 20개의 계수들로서 계산될 수 있다. ZCR은 시간 축의 상기 입력되는 진폭 신호를 사용하여 계산될 수 있다. MFCC는 진폭 신호의 주파수 전체의 모양에 대한 정보를 제공할 수 있고, ZCR는 진폭 신호의 시간 축 성분이 포함하는 노이즈에 관한 정보(예컨대, 노이즈를 포함하는 정도에 관한 정보)를 제공할 수 있다.4. The computer system 100 may set an amplitude signal of a bandwidth of 0-8 kHz (a first amplitude signal of the above-described second frequency bandwidth) as an input of the DNN 110 . The computer system 100 may further calculate the MFCC and ZCR for the input amplitude signal in order to increase the accuracy of learning and inference of the DNN 110 , and additionally input them to the DNN 110 . can MFCC can be calculated in the range of 0-8 kHz and calculated as 20 coefficients. ZCR may be calculated using the input amplitude signal of the time axis. The MFCC may provide information about the shape of the entire frequency of the amplitude signal, and the ZCR may provide information about the noise included in the time-axis component of the amplitude signal (eg, information about the extent to which the noise is included). .

5. 컴퓨터 시스템(100)은 8~24kHz의 대역폭의 진폭 신호(전술된 제3 주파수 대역폭의 제2 진폭 신호)에 대해, 바크 스케일 단위의 대역폭 구간 별로 평균 에너지를 계산할 수 있다. 이는 음향 심리학적인 이론에 따라, 인간이 고대역의 음성 신호를 낮은 해상도로 인지하는 점을 고려한 것일 수 있다. 구분되는 대역폭 구간은 기 정의된 바크 스케일에 따라, 예컨대, 8000~9600, 9600~12000Hz, 12000~15600Hz, 15600~20000Hz 및 20000~24000Hz의 대역폭 구간을 포함할 수 있고, 각 구간에 대한 주파수 에너지의 평균이 계산될 수 있다. 5. The computer system 100 may calculate an average energy for each bandwidth section of the Bark scale unit with respect to the amplitude signal of the 8-24 kHz bandwidth (the second amplitude signal of the third frequency bandwidth described above). According to a psychoacoustic theory, this may be in consideration of human perception of high-bandwidth speech signals with low resolution. The divided bandwidth section may include, for example, bandwidth sections of 8000 to 9600, 9600 to 12000 Hz, 12000 to 15600 Hz, 15600 to 20000 Hz and 20000 to 24000 Hz, according to a predefined Bark scale, and the frequency energy for each section. An average can be calculated.

6. 컴퓨터 시스템(100)은 5에서 계산된 5개의 대역폭 구간별 에너지의 평균 값을 DNN(110)의 입력 값으로서 설정할 수 있다.6. The computer system 100 may set the average value of energy for each of the five bandwidth sections calculated in step 5 as an input value of the DNN 110 .

7. 컴퓨터 시스템(100)은 저대역인 0~4kHz 대역폭의 위상 신호(전술된 제1 주파수 대역폭의 제1 위상 신호)를 DNN(110)의 입력으로 설정할 수 있다. 청취 평가 실험 결과에 따르면, 진폭 신호에는 노이즈가 없는 상태(clean)라도, 위상 신호에 노이즈가 포함되어 있는 경우에는, 해당 음성 신호는 사용자에게 노이즈를 포함하는 것(noisy)으로 느껴질 수 있다. 다만, 위상 신호에 노이즈가 포함되어 있더라도, 0~4kHz 대역폭의 위상 신호에는 노이즈가 포함되어 있지 않은 경우라면, 해당 음성 신호는 사용자에게 노이즈를 포함하지 않는 것(clean)으로 느껴질 수 있다. 따라서, 실시예에서는 DNN(110)을 구성하는 네트워크의 진폭을 줄일 수 있도록, 0~4kHz 대역폭의 위상 신호만을 DNN(110)의 입력(입력 파라미터)으로서 사용할 수 있다.7. The computer system 100 may set the low-bandwidth 0-4 kHz phase signal (the first phase signal of the first frequency bandwidth described above) as an input of the DNN 110 . According to the results of the listening evaluation experiment, even when the amplitude signal is clean, if the phase signal contains noise, the corresponding voice signal may be perceived by the user as containing noise (noisy). However, even if the phase signal includes noise, if the phase signal of the 0 to 4 kHz bandwidth does not include noise, the corresponding voice signal may be perceived as clean by the user. Therefore, in the embodiment, in order to reduce the amplitude of the network constituting the DNN 110 , only a phase signal of a bandwidth of 0 to 4 kHz may be used as an input (input parameter) of the DNN 110 .

8. DNN(110)의 입력은 전술한 설명한 5개(진폭 신호, 위상 신호, 바크 스케일 에너지, MFCC 및 ZCR)의 파라미터로 구성될 수 있다. 음성 신호의 시간적 연속성을 활용하기 위해 현재의 프레임과 과거의 7개의 프레임에 대한 파라미터들이 사용될 수 있고, 따라서, 총 8개의 프레임에 대한 파라미터들이 DNN(110)의 입력으로 사용될 수 있다. 또한, 연산량을 줄이기 위해, DNN(110)은 완전 연결 레이어로만 구성될 수 있다. 8. The input of the DNN 110 may be configured with the above-described five parameters (amplitude signal, phase signal, Bark scale energy, MFCC and ZCR). In order to utilize the temporal continuity of the speech signal, parameters for the current frame and 7 frames in the past may be used, and accordingly, parameters for a total of 8 frames may be used as an input of the DNN 110 . In addition, in order to reduce the amount of computation, the DNN 110 may be configured only with a fully connected layer.

음성 신호(120)의 주파수를 복원하기 위한 DNN(110)의 출력에 해당하는 마스크는, i) (제2 주파수 대역인) 0~8kHz 대역폭의 진폭 신호에 대한 IRM, ii) (제3 주파수 대역인) 8~24kHz 대역폭의 진폭 신호에 따른 바크 스케일 에너지에 대한 IRM, iii) (제1 주파수 대역인) 0~4kHz 대역폭의 FFT 계수(예컨대, 해당 대역폭의 입력 음성 신호(105)의 FFT 계수 또는 제1 위상 신호의 FFT 계수)에 대한 CIRM으로 구성될 수 있다. 또한, DNN(110)에 의한 추론의 성능을 높이기 위한 추가 정보로서, DNN(110)은 음성 신호에 대한 정보를 추가적으로 학습하기 위한 VAD 정보를 출력하는 네트워크를 더 포함할 수 있다. VAD 정보를 출력하는 네트워크를 구성하는 각 레이어의 출력은 음성 신호(120)을 복원함에 있어서의 주파수 복원 네트워크의 입력으로서 사용될 수 있다. The mask corresponding to the output of the DNN 110 for restoring the frequency of the voice signal 120 is i) IRM for an amplitude signal of 0-8 kHz bandwidth (which is the second frequency band), ii) (the third frequency band) i) IRM for Bark scale energy according to amplitude signal of 8-24 kHz bandwidth, iii) FFT coefficient of 0-4 kHz bandwidth (which is the first frequency band) (eg, FFT coefficient of input voice signal 105 of that bandwidth, or CIRM for the FFT coefficient of the first phase signal). In addition, as additional information for increasing the performance of inference by the DNN 110 , the DNN 110 may further include a network for outputting VAD information for additionally learning information about a voice signal. The output of each layer constituting the network outputting the VAD information may be used as an input of the frequency recovery network in reconstructing the voice signal 120 .

9. DNN(110)의 출력 중 하나로서, (제2 주파수 대역인) 0~8kHz 대역폭의 진폭 신호에 대한 IRM이 출력될 수 있다. 9. As one of the outputs of the DNN 110 , an IRM for an amplitude signal of a bandwidth of 0 to 8 kHz (which is a second frequency band) may be output.

10. 컴퓨터 시스템(100)은 9에서의 IRM과 입력 음성 신호(105)의 진폭 신호(즉, 제1 진폭 신호)의 곱을 취함으로써, 노이즈가 제거된 진폭 신호를 추정할 수 있다. 10. The computer system 100 may estimate the denoised amplitude signal by taking the product of the IRM at 9 and the amplitude signal of the input speech signal 105 (ie, the first amplitude signal).

11. DNN(110)의 출력 중 하나로서, (제3 주파수 대역인) 8~24kHz 대역폭의 진폭 신호에 따른 바크 스케일 에너지에 대한 IRM이 출력될 수 있다. 11. As one of the outputs of the DNN 110 , an IRM for Bark scale energy according to an amplitude signal of a bandwidth of 8 to 24 kHz (which is a third frequency band) may be output.

12. 컴퓨터 시스템(100)은 11에서의 IRM과 입력 음성 신호(105)의 진폭 신호(즉, 제2 진폭 신호)의 곱을 취함으로써, 노이즈가 제거된 진폭 신호를 추정할 수 있다. 이 때, 컴퓨터 시스템(100)은 계산된 바크 스케일의 IRM의 구간의 범위에 따라 제2 진폭 신호의 그룹을 구성할 수 있고, 그룹 단위로 IRM을 곱할 수 있다. 즉, 구분된 바크 스케일 대역폭 구간 별로 평균 에너지를 조정할 수 있도록 IRM은 제2 진폭 신호와 곱해질 수 있다.12. The computer system 100 may estimate the denoised amplitude signal by taking the product of the IRM at 11 and the amplitude signal (ie, the second amplitude signal) of the input speech signal 105 . At this time, the computer system 100 may configure a group of the second amplitude signal according to the range of the calculated section of the IRM of the Bark scale, and may multiply the IRM in units of the group. That is, the IRM may be multiplied by the second amplitude signal to adjust the average energy for each divided Bark scale bandwidth section.

13. DNN(110)의 출력 중 하나로서, (제1 주파수 대역인) 0~4kHz 대역폭의 위상 신호(즉, 제1 위상 신호)에 대한 CIRM이 출력될 수 있다.13. As one of the outputs of the DNN 110 , a CIRM for a phase signal (ie, a first phase signal) having a bandwidth of 0 to 4 kHz (which is a first frequency band) may be output.

14. 컴퓨터 시스템(100)은 입력 음성 신호(105)(또는, 제1 위상 신호)의 FFT 계수와 CIRM의 곱을 구할 수 있다. 곱의 결과로서는 진폭 및 위상의 정보를 모두 포함하고 있는 복소수 값이 계산될 수 있고, 이 중에서 진폭의 정보는 사용하지 않고, arctan를 이용하여 위상의 정보만 추출하여 음성 신호(120)의 복원에 사용할 수 있다. 14. The computer system 100 may obtain the product of the FFT coefficient of the input speech signal 105 (or the first phase signal) and the CIRM. As a result of the product, a complex value including both amplitude and phase information can be calculated. Among them, amplitude information is not used, and only phase information is extracted using arctan to restore the voice signal 120 . Can be used.

15. 컴퓨터 시스템(100)은 (제1 주파수 대역 이외의 주파수 대역인) 4~24kHz 대역폭의 위상 신호에 대해서는 입력된 위상 신호를 음성 신호(120)의 복원에 그대로 사용할 수 있다.15. The computer system 100 may use the inputted phase signal as it is to restore the voice signal 120 with respect to a phase signal of a bandwidth of 4 to 24 kHz (which is a frequency band other than the first frequency band).

16. DNN(110)의 출력 중 하나로서, VAD 정보가 출력될 수 있다. VAD 정보는 전술한 것처럼, DNN(110)에 의한 추론의 성능을 높이기 위한 추가 파라미터일 수 있다. 16. As one of the outputs of the DNN 110 , VAD information may be output. As described above, the VAD information may be an additional parameter for increasing the performance of inference by the DNN 110 .

17. 컴퓨터 시스템(100)은 전술된 10, 12 및 14에 의해 추정된 (노이즈가 제거된) 진폭 신호 및 위상 신호를 사용하여 음성 신호(120)의 FFT 계수를 복원할 수 있다. 컴퓨터 시스템(100)은 복원된 FFT 계수를 사용하여 IFFT를 수행함으로써, 음성 신호(120)를 복원할 수 있다.17. The computer system 100 may reconstruct the FFT coefficients of the speech signal 120 using the amplitude signal and the phase signal estimated by 10, 12 and 14 described above (noise removed). The computer system 100 may reconstruct the speech signal 120 by performing IFFT using the reconstructed FFT coefficients.

18. 컴퓨터 시스템(100)은 IFFT의 결과에 따라 복원된 음성 신호(120)를 출력할 수 있다. 18. The computer system 100 may output the restored voice signal 120 according to the result of the IFFT.

이상, 도 1 내지 도 7을 참조하여 전술된 기술적 특징에 대한 설명은, 도 8에 대해서도 그대로 적용될 수 있으므로 중복되는 설명은 생략한다.As described above, the description of the technical features described above with reference to FIGS. 1 to 7 can be applied to FIG. 8 as it is, and thus the overlapping description will be omitted.

도 9는 일 예에 따른, 머신러닝 모델에 대해 입력되는 파라미터 및 머신러닝 모델의 구조를 나타낸다.9 shows parameters input to the machine learning model and the structure of the machine learning model, according to an example.

도 9에서 머신러닝 모델(110)은 DNN으로 도시되었다.In FIG. 9 , the machine learning model 110 is illustrated as a DNN.

도시된 것처럼, 머신 러닝 모델(110)에 입력되는 입력 음성 신호(105)는 복수의 프레임들로 구성될 수 있다. 예컨대, 머신 러닝 모델(110)은 음성 신호(오디오 신호)의 시간적 연속성을 활용하기 위해, 현재의 프레임과 과거 7개의 프레임들(즉, 총 8개의 프레임들)에 대한 파라미터를 머신 러닝 모델(110)에 대한 입력으로 사용하여 추론을 수행할 수 있다. As shown, the input voice signal 105 input to the machine learning model 110 may be composed of a plurality of frames. For example, the machine learning model 110 sets parameters for the current frame and the past 7 frames (ie, a total of 8 frames) in order to utilize the temporal continuity of the speech signal (audio signal). ) can be used as an input to perform inference.

입력 음성 신호(105)의 각 프레임과 관련하여, 전술한 것처럼, 진폭 신호, 위상 신호, 바크 스케일 에너지, MFCC 및 ZCR이 머신 러닝 모델(110)의 입력 파라미터가 될 수 있다.With respect to each frame of input speech signal 105 , amplitude signal, phase signal, Bark scale energy, MFCC and ZCR may be input parameters of machine learning model 110 , as described above.

머신 러닝 모델(110)은 8개의 프레임들 각각에 대해 진폭 신호로부터 IRM을 추정할 수 있고, 위상 신호로부터 CIRM을 추정할 수 있다. 또한, 머신 러닝 모델(110)은 입력에 기반하여 VAD 정보를 추정할 수 있다. The machine learning model 110 may estimate the IRM from the amplitude signal for each of the eight frames, and the CIRM from the phase signal. In addition, the machine learning model 110 may estimate VAD information based on the input.

IRM을 추정하기 위한 네트워크, CIRM을 추정하기 위한 네트워크 및 VAD 정보를 추정하기 위한 네트워크는 복수의 모두 완전 연결 레이어들로서 구성될 수 있다. A network for estimating IRM, a network for estimating CIRM, and a network for estimating VAD information may be configured as a plurality of all fully connected layers.

도시된 것처럼, VAD 정보를 추정하는 네트워크를 구성하는 각 레이어의 출력은 음성 신호(120)을 복원함에 있어서의 주파수 복원 네트워크의 입력으로서 사용될 수 있다. 말하자면, VAD 정보를 추정하는 네트워크를 구성하는 각 레이어의 출력은 CIRM을 추정하기 위한 네트워크를 구성하는 레이어 및 IRM을 추정하기 위한 네트워크를 구성하는 레이어에 대해 입력될 수 있다. 따라서, 입력된 VAD 정보에 의해, CIRM 및 IRM의 추정의 정확성이 개선될 수 있다. As shown, the output of each layer constituting the network for estimating VAD information may be used as an input of the frequency recovery network in reconstructing the voice signal 120 . In other words, the output of each layer constituting the network for estimating VAD information may be input to a layer constituting a network for estimating CIRM and a layer constituting a network for estimating IRM. Accordingly, the accuracy of estimation of CIRM and IRM may be improved by the input VAD information.

아래에서, 도 9의 DNN에 대해 보다 자세하게 설명한다. Hereinafter, the DNN of FIG. 9 will be described in more detail.

도 9에서, DNN에 대한 입력 파라미터와 관련하여 병기된 ( ) 안의 숫자는 FFT 빈(bin)의 개수를 나타낼 수 있다. 진폭과 관련하여 "170"은 진폭 신호의 FFT 빈 개수를 나타낼 수 있다. 예컨대, 48000Hz의 샘플링 레이트로 샘플링된 신호에 대해 1024-FFT 가 수행될 경우, 하나의 주파수 빈의 해상도는 48000/1024 = 46.875로 나타내어질 수 있다. 전술된 진폭 신호에 대한 저대역(즉, 제2 주파수 대역) 8kHz를 상기 해상도로 계산하면, 8000/46.875 = 약170.66가 될 수 있고, 따라서, DNN으로 입력되는 진폭 신호는 170개의 주파수 빈으로 표현될 수 있다. 한편, 전술된 위상 신호의 저대역(즉, 제1 주파수 대역)은 4kHz 이므로 상기 170개의 절반인 85개의 주파수 빈이 위상 신호를 위해 사용될 수 있다. 따라서, 도 9에서 DNN에 대한 위상 신호의 입력 파라미터는 위상(85)으로 표현될 수 있다. 바크 스케일 에너지는 고대역(즉, 제3 주파수 대역)의 대역폭 구간의 수로 5개이므로 도시된 것처럼 5로 표현될 수 있다. MFCC와 ZCR에 대응하는 20 및 1은 각각 계산된 MFCC 및 ZCR의 계수의 개수를 나타낼 수 있다. In FIG. 9 , a number in ( ) in relation to an input parameter for the DNN may indicate the number of FFT bins. Regarding the amplitude, “170” may indicate the number of FFT bins of the amplitude signal. For example, when 1024-FFT is performed on a signal sampled at a sampling rate of 48000 Hz, the resolution of one frequency bin may be expressed as 48000/1024 = 46.875. If the low-band (ie, second frequency band) 8 kHz for the above-described amplitude signal is calculated with the above resolution, 8000/46.875 = about 170.66, therefore, the amplitude signal input to the DNN is expressed by 170 frequency bins. can be Meanwhile, since the low band (ie, the first frequency band) of the above-described phase signal is 4 kHz, 85 frequency bins, which are half of the 170, may be used for the phase signal. Accordingly, the input parameter of the phase signal to the DNN in FIG. 9 may be represented by the phase 85 . Since the Bark scale energy is 5 as the number of bandwidth sections of the high band (ie, the third frequency band), it can be expressed as 5 as shown. 20 and 1 corresponding to MFCC and ZCR may indicate the number of calculated coefficients of MFCC and ZCR, respectively.

도시된 DNN의 각 레이어의 괄호 ( ) 안의 숫자는 각 레이어의 노드 숫자를 나타낼 수 있다. 실시예에서, DNN은 실험적으로 입력에서 출력으로 갈수록 점점 노드들의 숫자가 감소하도록 설계될 수 있다. 또는, DNN을 구성하는 레이어들 각각의 노드들의 수는 도시된 것과는 상이하게 구성될 수도 있다.The number in parentheses ( ) of each layer of the illustrated DNN may indicate the number of nodes of each layer. In an embodiment, the DNN may be experimentally designed such that the number of nodes gradually decreases from input to output. Alternatively, the number of nodes in each of the layers constituting the DNN may be configured differently from the illustrated one.

VAD 정보에 대한 학습 과정에 있어서, 노이즈를 포함하는 입력 음성 신호에 대해, 음성에 해당하는 구간에 대해서는 1이, 비음성에 해당하는 구간에 대해서는 0이 각각 라벨링될 수 있다. 이에 따라, VAD 정보에 해당하는 출력은 0과 1이 되도록 학습될 수 있고, 따라서, VAD 정보를 학습/추론하기 위한 마지막 레이어는 단지 하나의 노드를 가질 수 있다. VAD 정보를 학습/추론하기 위한 입력을 처음으로 수신하는 64개의 노드는 음성/비음성에 해당하는 구간의 특성을 구분할 수 있도록 학습될 수 있다. 이에 따라, DNN은 입력 음성 신호에 특화된 특징들이 학습될 수 있고, VAD 정보를 학습/추론하기 위한 레이어로부터의 출력이 IRM과 CIRM을 학습/추론하기 위한 레이어에 대한 입력으로 사용됨으로써, DNN의 IRM과 CIRM을 학습/추론함에 있어서의 성능을 높일 수 있다. In the learning process for VAD information, with respect to an input voice signal including noise, 1 may be labeled for a section corresponding to voice and 0 may be labeled for a section corresponding to non-voice, respectively. Accordingly, outputs corresponding to VAD information may be trained to be 0 and 1, and thus, the last layer for learning/inferring VAD information may have only one node. The 64 nodes that receive the input for learning/inferring VAD information for the first time may be learned to distinguish characteristics of sections corresponding to speech/non-voice. Accordingly, the DNN can learn features specific to the input speech signal, and the output from the layer for learning/inferring VAD information is used as an input to the layer for learning/inferring IRM and CIRM, so that the IRM of the DNN and CIRM learning/inference performance can be improved.

IRM과 CIRM을 학습/추론하기 위한 레이어에 대해서는 입력 음성 신호의 특성을 나타내는 피쳐가 입력됨으로써, 노이즈가 제거된 음성 신호에 해당하는 IRM 및 CIRM이 추론될 수 있다. 말하자면, IRM과 CIRM을 학습/추론하기 위한 레이어는 이러한 IRM 및 CIRM을 추론하도록 학습될 수 있다. 예컨대, IRM과 CIRM을 학습/추론하기 위한 레이어는 입력 음성 신호의 이전의 프레임과 현재의 프레임의 음성 신호의 특성에 기반하여 IRM 및 CIRM 을 추론하도록 학습될 수 있고, 이러한 학습의 결과에 따라 IRM 및 CIRM 을 추론할 수 있다.For the layer for learning/inferring IRM and CIRM, features representing the characteristics of the input voice signal are input, so that IRM and CIRM corresponding to the voice signal from which the noise has been removed can be inferred. In other words, a layer for learning/inferring IRM and CIRM can be learned to infer these IRM and CIRM. For example, a layer for learning/inferring IRM and CIRM may be trained to infer IRM and CIRM based on the characteristics of the voice signal of the previous frame and the current frame of the input voice signal, and IRM according to the result of such learning and CIRM can be inferred.

이상, 도 1 내지 도 8을 참조하여 전술된 기술적 특징에 대한 설명은, 도 9에 대해서도 그대로 적용될 수 있으므로 중복되는 설명은 생략한다.As described above, the description of the technical features described above with reference to FIGS. 1 to 8 can be applied to FIG. 9 as it is, and thus a redundant description will be omitted.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 어플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented as a hardware component, a software component, and/or a combination of the hardware component and the software component. For example, the devices and components described in the embodiments may include a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), and a programmable logic unit (PLU). It may be implemented using one or more general purpose or special purpose computers, such as a logic unit, microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications executed on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be embodied in any type of machine, component, physical device, computer storage medium or device for interpretation by or providing instructions or data to the processing device. have. The software may be distributed over networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 이때, 매체는 컴퓨터로 실행 가능한 프로그램을 계속 저장하거나, 실행 또는 다운로드를 위해 임시 저장하는 것일 수도 있다. 또한, 매체는 단일 또는 수 개의 하드웨어가 결합된 형태의 다양한 기록수단 또는 저장수단일 수 있는데, 어떤 컴퓨터 시스템에 직접 접속되는 매체에 한정되지 않고, 네트워크 상에 분산 존재하는 것일 수도 있다. 매체의 예시로는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등을 포함하여 프로그램 명령어가 저장되도록 구성된 것이 있을 수 있다. 또한, 다른 매체의 예시로, 어플리케이션을 유통하는 앱 스토어나 기타 다양한 소프트웨어를 공급 내지 유통하는 사이트, 서버 등에서 관리하는 기록매체 내지 저장매체도 들 수 있다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. In this case, the medium may be to continuously store the program executable by the computer, or to temporarily store the program for execution or download. In addition, the medium may be a variety of recording means or storage means in the form of a single or several hardware combined, it is not limited to a medium directly connected to any computer system, and may exist distributed on a network. Examples of the medium include a hard disk, a magnetic medium such as a floppy disk and a magnetic tape, an optical recording medium such as CD-ROM and DVD, a magneto-optical medium such as a floppy disk, and those configured to store program instructions, including ROM, RAM, flash memory, and the like. In addition, examples of other media may include recording media or storage media managed by an app store for distributing applications, a site for supplying or distributing other various software, and a server.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible from the above description by those skilled in the art. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

A method for processing, using a computer system, a voice signal including noise, the method comprising:
obtaining an amplitude signal and a phase signal from an input voice signal including noise on which Fast Fourier Transform (FFT) has been performed;
input the amplitude signal to a pre-trained machine learning model to estimate a mask to be applied to the input speech signal to remove the noise, and obtain a mask to be applied to the amplitude signal as an output of the machine learning model to do;
inputting a first phase signal of a first frequency band among the phase signals to the machine learning model, and obtaining a mask to be applied to the FFT coefficients of the input speech signal as an output of the machine learning model; and
Restoring the speech signal from which the noise has been removed based on the mask obtained from the input speech signal and the machine learning model
A method of processing a voice signal comprising:

According to claim 1,
The first phase signal is a signal of a relatively low frequency band among the phase signals,
and a phase signal of a frequency band other than the first frequency band among the phase signals is used as it is in reconstructing the noise-free voice signal.

According to claim 1,
Obtaining a mask to be applied to the amplitude signal comprises:
inputting a first amplitude signal of a second frequency band among the amplitude signals to the machine learning model, and obtaining a first mask to be applied to the first amplitude signal as an output of the machine learning model;
dividing a second amplitude signal of a third frequency band, which is a higher frequency band than the second frequency band, among the amplitude signals into amplitude signals of a plurality of bandwidth sections;
calculating an average energy for each of the divided amplitude signals; and
inputting the calculated average energy to the machine learning model, and obtaining a second mask to be applied to the second amplitude signal as an output of the machine learning model;
A method of processing a voice signal comprising:

4. The method of claim 3,
The second amplitude signal is divided into amplitude signals of the plurality of bandwidth sections by dividing the third frequency band by a bark scale unit.

4. The method of claim 3,
The first mask is an ideal ratio mask (IRM) for the first amplitude signal,
the second mask is the IRM for the calculated average energy,
In the step of restoring the voice signal from which the noise has been removed,
wherein the first mask is multiplied by the first amplitude signal and the second mask is multiplied by the second amplitude signal.

4. The method of claim 3,
Obtaining the first mask comprises:
calculating a predetermined number of Mel-Frequency Cepstral Coefficients (MFCCs) based on the first amplitude signal; and
inputting the calculated MFCCs to the machine learning model to obtain the first mask;
A method of processing a voice signal comprising:

4. The method of claim 3,
Obtaining the first mask comprises:
calculating a Zero Crossing Rate (ZCR) based on the first amplitude signal; and
inputting the calculated ZCR to the machine learning model to obtain the first mask;
A method of processing a voice signal comprising:

4. The method of claim 3,
The input voice signal is a signal having a bandwidth of 24 kHz,
The first frequency band is a band of 0 or more and less than 4 kHz,
The second frequency band is a band of 0 or more and less than 8 kHz,
The third frequency band is a band of 8 kHz or more and less than 24 kHz, the method of processing a voice signal.

According to claim 1,
The step of restoring the voice signal from which the noise has been removed includes:
estimating a denoising amplitude signal by multiplying the amplitude signal by a mask to be applied to the amplitude signal;
estimating a denoising phase signal of the first frequency band by multiplying an FFT coefficient of the input speech signal by a mask to be applied to the first phase signal;
Restoring an FFT coefficient of the noise-removed speech signal based on the noise-removed amplitude signal, the noise-removed phase signal of the first frequency band, and a phase signal of a frequency band other than the first frequency band among the phase signals step; and
reconstructing the noise-removed speech signal by performing Inverse Fast Fourier Transform (IFFT) based on the reconstructed FFT coefficients
A method of processing a voice signal comprising:

10. The method of claim 9,
The input voice signal is a signal having a bandwidth of 24 kHz,
The first frequency band is a band of 0 or more and less than 4 kHz,
A frequency band other than the first frequency band is a band of 4 kHz or more and less than 24 kHz.

According to claim 1,
The machine learning model is composed of a fully connected layer (Fully Connected Layer), a method of processing a voice signal.

According to claim 1,
The machine learning model is a more trained model to estimate Voice Activity Detection (VAD) information for distinguishing between speech and silence,
obtaining VAD information for the input speech signal as an output of the machine learning model;
further comprising,
The VAD information of the input voice signal is used in frequency restoration of the noise-removed voice signal.

According to claim 1,
The mask to be applied to the first phase signal is a complex ideal ratio mask (CIRM) for the FFT coefficient of the first phase signal of the first frequency bandwidth,
In the step of restoring the voice signal from which the noise has been removed,
wherein the CIRM is multiplied by an FFT coefficient of the input speech signal.

According to claim 1,
The amplitude signal and the phase signal are signals of a predetermined frequency bin unit,
The voice signal from which the noise has been removed is restored in units of the frequency bin.

According to claim 1,
The method of claim 1, wherein the input speech signal input to the machine learning model is composed of a plurality of frames.

A program recorded on a computer-readable recording medium for performing the method of any one of claims 1 to 15.

A computer system for processing a voice signal including noise, the computer system comprising:
at least one processor implemented to execute computer-readable instructions
including,
the at least one processor,
To obtain a magnitude signal and a phase signal from an input speech signal including noise on which FFT (Fast Fourier Transform) has been performed, and to estimate a mask to be applied to the input speech signal to remove the noise For a pretrained machine learning model, input the amplitude signal, obtain a mask to be applied on the amplitude signal as an output of the machine learning model, and for the machine learning model, input a first phase signal, obtain a mask to be applied for FFT coefficients of the input speech signal as an output of the machine learning model, and based on the input speech signal and a mask obtained from the machine learning model, the noise is A computer system that restores a removed speech signal.

A method for processing, using a computer system, a voice signal including noise, the method comprising:
obtaining an amplitude signal and a phase signal from an input voice signal including noise on which Fast Fourier Transform (FFT) has been performed;
input the phase signal to a pre-trained machine learning model for estimating a mask to be applied to the input speech signal to remove the noise, and output the first frequency band of the phase signal as an output of the machine learning model obtaining a mask to be applied to the FFT coefficients of the input speech signal as a mask for the FFT coefficients of the first phase signal;
For the machine learning model, input the amplitude signal, and as an output of the machine learning model, a first mask to be applied to a first amplitude signal of a second frequency band of the amplitude signal and the second frequency band of the amplitude signal obtaining a second mask to be applied for a second amplitude signal of a third frequency band, which is a larger frequency band; and
a product of a phase signal in a frequency band other than the first frequency band among the phase signals, a product of the mask and an FFT coefficient of the input audio signal, a product of the first mask and the first amplitude signal, and a second mask; Restoring the speech signal from which the noise has been removed based on the product of the second amplitude signal
A method of processing a voice signal comprising: