KR102401959B1

KR102401959B1 - Joint training method and apparatus for deep neural network-based dereverberation and beamforming for sound event detection in multi-channel environment

Info

Publication number: KR102401959B1
Application number: KR1020200070856A
Authority: KR
Inventors: 장준혁; 노경진
Original assignee: 한양대학교 산학협력단
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2022-05-25
Also published as: KR20210153919A; WO2021251627A1

Abstract

다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 방법 및 장치가 제시된다. 일 실시예에 따른 컴퓨터로 구현되는 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 방법은, 잡음 및 잔향이 존재하는 환경에서 입력되는 다채널 음향 신호를 심화 신경망 기반의 WPE(Weighted Predicted Error) 잔향 제거 기술을 이용하여 잔향을 제거하는 단계; 상기 잔향이 제거된 다채널 음향 신호를 심화 신경망 기반의 MVDR(Minimum Variance Distortionless Response) 빔포밍 기술을 이용하여 잡음을 제거하는 단계; 및 상기 잡음이 제거된 단일 채널의 향상된 음향 신호를 합성곱 순환 신경망(convolutional recurrent neural network) 기반의 음향 인지 모델로 입력하는 단계를 포함하고, 상기 잔향을 제거하는 잔향 제거 모델, 상기 빔포밍을 수행하는 빔포밍 모델, 및 상기 음향 인지 모델은 하나의 결합된 신경망으로 동작할 수 있다. A method and apparatus for combined learning of reverberation cancellation, beamforming, and acoustic recognition models based on deep neural networks using multi-channel acoustic signals are presented. A method for combined learning of a deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model using a computer-implemented multi-channel acoustic signal according to an embodiment deepens a multi-channel acoustic signal input in an environment in which noise and reverberation exist. removing reverberation using a neural network-based WPE (Weighted Predicted Error) reverberation technology; removing noise from the multi-channel sound signal from which the reverberation has been removed using a deep neural network-based MVDR (Minimum Variance Distortionless Response) beamforming technology; and inputting the improved acoustic signal of the single channel from which the noise has been removed into an acoustic recognition model based on a convolutional recurrent neural network, wherein the reverberation cancellation model for removing the reverberation and the beamforming are performed. The beamforming model, and the acoustic recognition model may operate as one combined neural network.

Description

JOINT TRAINING METHOD AND APPARATUS FOR DEEP NEURAL NETWORK-BASED DEREVERBERATION AND BEAMFORMING FOR SOUND EVENT DETECTION IN MULTI-CHANNEL ENVIRONMENT}

아래의 실시예들은 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 방법 및 장치에 관한 것이다. The following embodiments relate to a method and apparatus for combined learning of a deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model using a multi-channel acoustic signal.

음향 인지란 주어진 마이크 입력 신호에서 발생한 음향의 시작 위치(onset)와 종료 위치(offset)를 찾고 해당 음향의 종류(예컨대, 기침소리, 키보드소리, 말소리, 벨소리 등)를 분류하는 기술이다. Acoustic recognition is a technique for finding the start position (onset) and end position (offset) of a sound generated from a given microphone input signal and classifying the sound type (eg, cough sound, keyboard sound, speech sound, ringtone, etc.).

(비특허문헌 1)에서는 심화 신경망 기반의 잔향 제거와 빔포밍 모듈을 시스템적으로 결합하기 위한 다양한 방법에 대해 연구를 진행하여 잡음과 잔향이 존재하는 환경에서의 음성 인식 실험을 진행하였다. (비특허문헌 2)에서는 심화 신경망 기반의 잔향 제거와 빔포밍 모듈을 x-vector 기반의 화자 검증 모듈과 결합하여 학습하는 방법을 제안하였다. (비특허문헌 3)에서는 합성곱 순환 신경망을 이용하여 음향을 인지하고 음향의 방향 추정을 동시에 실행하는 방법을 제안하였다.In (Non-Patent Document 1), various methods for systematically combining the deep neural network-based reverberation cancellation and beamforming module were studied, and speech recognition experiments were conducted in an environment where noise and reverberation were present. In (Non-Patent Document 2), a method for learning by combining a deep neural network-based reverberation cancellation and beamforming module with an x-vector-based speaker verification module was proposed. (Non-Patent Document 3) proposes a method of recognizing a sound and estimating the direction of the sound using a convolutional recurrent neural network.

종래의 음향 인지 기술은 신경망 모델 자체의 구조 변경을 통하여(예컨대, 합성곱 순환 신경망) 성능 향상을 도모하거나 다른 기능(예컨대, 방향 추정)을 동시에 실행하기 위한 연구가 주로 진행되었다. 유사 분야인 음성 인식 또는 화자 검증 분야에서 사용되는 음성 신호와 달리, 음향 신호는 매우 다양한 특성을 가지므로, 앞 단(front-end)에 전처리 모듈을 통하여 음향 신호를 향상시킴으로써 음향 인지 성능을 향상시키려는 시도가 많지 않았다.In the conventional acoustic recognition technology, research has been mainly conducted to improve the performance of the neural network model itself (eg, convolutional neural network) or to simultaneously execute other functions (eg, direction estimation). Unlike voice signals used in similar fields, such as voice recognition or speaker verification, sound signals have very diverse characteristics. There weren't many attempts.

이와 같이 종래의 기술은 단순히 합성곱 순환 신경망 기반의 음향 인지 모델을 사용함으로써 잡음 및 잔향 환경에서 강인하지 못하며, 다채널 음향 신호를 사용하는 이점을 활용하지 못하고 있다.As described above, the conventional technology simply uses a convolutional neural network-based acoustic recognition model, and thus is not robust in noise and reverberation environments, and does not utilize the advantage of using a multi-channel acoustic signal.

“Integrating neural network based beamforming and weighted prediction error dereverberation,” L. Drude, C. Boeddeker, J. Heymann, R. Haeb-Umbach, K. Kinoshita, M. Delcroix, and T. Nakatani, University of Paderborn & NTT, Interspeech, 2018. “Integrating neural network based beamforming and weighted prediction error dereverberation,” L. Drude, C. Boeddeker, J. Heymann, R. Haeb-Umbach, K. Kinoshita, M. Delcroix, and T. Nakatani, University of Paderborn & NTT, Interspeech, 2018. “Joint optimization of neural acoustic beamforming and dereverberation with x-vectors for robust speaker verification," J.-Y. Yang and J.-H. Chang, Hanyang University, Interspeech, 2019. “Joint optimization of neural acoustic beamforming and dereverberation with x-vectors for robust speaker verification,” J.-Y. Yang and J.-H. Chang, Hanyang University, Interspeech, 2019. “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,” S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, Tampere University of Technology, IEEE Journal of Selected Topics in Signal Processing, 2018. “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,” S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, Tampere University of Technology, IEEE Journal of Selected Topics in Signal Processing, 2018.

실시예들은 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 방법 및 장치에 관하여 기술하며, 보다 구체적으로 잡음과 잔향이 존재하는 환경에서 다채널 음향 신호를 이용하여 강인한 음향인지 시스템을 구성하는 기술을 제공한다. The embodiments describe a method and apparatus for combined learning of a deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model using a multi-channel acoustic signal, and more specifically, using a multi-channel acoustic signal in an environment in which noise and reverberation exist. This provides a technology to construct a robust acoustic perception system.

실시예들은 다채널 음향 신호를 이용하여 심화 신경망(deep neural network) 기반의 잔향 제거(dereverberation)와 빔포밍(beamforming) 및 음향 인지(sound event detection) 모델을 결합 학습함으로써 잡음 및 잔향이 존재하는 환경에서 향상된 성능을 획득할 수 있는, 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 방법 및 장치를 제공하는데 있다. The embodiments provide an environment in which noise and reverberation exist by combining and learning a deep neural network-based reverberation, beamforming, and sound event detection model using a multi-channel sound signal. An object of the present invention is to provide a method and apparatus for combined learning of a deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model using a multi-channel acoustic signal, which can obtain improved performance in a multi-channel acoustic signal.

일 실시예에 따른 컴퓨터로 구현되는 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 방법은, 잡음 및 잔향이 존재하는 환경에서 입력되는 다채널 음향 신호를 심화 신경망 기반의 WPE(Weighted Predicted Error) 잔향 제거 기술을 이용하여 잔향을 제거하는 단계; 상기 잔향이 제거된 다채널 음향 신호를 심화 신경망 기반의 MVDR(Minimum Variance Distortionless Response) 빔포밍 기술을 이용하여 잡음을 제거하는 단계; 및 상기 잡음이 제거된 단일 채널의 향상된 음향 신호를 합성곱 순환 신경망(convolutional recurrent neural network) 기반의 음향 인지 모델로 입력하는 단계를 포함하고, 상기 잔향을 제거하는 잔향 제거 모델, 상기 빔포밍을 수행하는 빔포밍 모델, 및 상기 음향 인지 모델은 하나의 결합된 신경망으로 동작할 수 있다. A method for combined learning of a deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model using a computer-implemented multi-channel acoustic signal according to an embodiment deepens a multi-channel acoustic signal input in an environment in which noise and reverberation exist. removing reverberation using a neural network-based WPE (Weighted Predicted Error) reverberation technology; removing noise from the multi-channel sound signal from which the reverberation has been removed using a deep neural network-based MVDR (Minimum Variance Distortionless Response) beamforming technology; and inputting the improved acoustic signal of the single channel from which the noise has been removed into an acoustic recognition model based on a convolutional recurrent neural network, wherein the reverberation cancellation model for removing the reverberation and the beamforming are performed. The beamforming model, and the acoustic recognition model may operate as one combined neural network.

상기 잔향 제거 모델, 상기 빔포밍 모델, 및 상기 음향 인지 모델은 상기 음향 인지 모델의 출력과 타겟에 의해 계산되는 에러에 의하여 결합 학습하는 단계를 더 포함할 수 있다. The method may further include performing joint learning of the reverberation cancellation model, the beamforming model, and the acoustic perception model based on an output of the acoustic recognition model and an error calculated by a target.

상기 다채널 음향 신호를 심화 신경망 기반의 WPE 잔향 제거 기술을 이용하여 잔향을 제거하는 단계는, 상기 다채널 음향 신호를 STFT(short-time Fourier Transform)를 통하여 시간 축에서 주파수 축의 값으로 변환하는 단계; 및 변환된 주파수 축의 STFT 계수는 상기 심화 신경망 기반의 WPE 잔향 제거 기술을 이용하여 추정된 spectral mask에 의해 향상되는 단계를 포함할 수 있다. The step of removing the reverberation from the multi-channel sound signal by using the deep neural network-based WPE reverberation removal technique includes converting the multi-channel sound signal into a value on the frequency axis from the time axis through short-time Fourier transform (STFT). ; and improving the STFT coefficient of the transformed frequency axis by the spectral mask estimated using the WPE reverberation cancellation technique based on the deep neural network.

상기 잔향이 제거된 다채널 음향 신호를 심화 신경망 기반의 MVDR 빔포밍 기술을 이용하여 잡음을 제거하는 단계는, 상기 잔향이 제거된 다채널 음향 신호의 STFT 계수는 상기 심화 신경망 기반의 MVDR 빔포밍 기술을 이용하여 추정된 두 종류의 spectral mask에 의해 향상될 수 있다. In the step of removing noise from the multi-channel sound signal from which the reverberation has been removed by using the deep neural network-based MVDR beamforming technology, the STFT coefficient of the multi-channel sound signal from which the reverberation has been removed is determined by the deep neural network-based MVDR beamforming technology. It can be improved by two kinds of spectral masks estimated using

상기 잡음이 제거된 단일 채널의 향상된 음향 신호를 합성곱 순환 신경망 기반의 음향 인지 모델로 입력하는 단계는, 상기 잡음이 제거된 단일 채널의 향상된 음향 신호의 STFT 계수는 로그 스케일(log scale)의 멜 필터 뱅크(mel filter bank)로 변환되어 합성곱 순환 신경망(convolutional recurrent neural network) 기반의 음향 인지 모델로 입력될 수 있다. The step of inputting the improved acoustic signal of the single channel from which the noise has been removed into the acoustic recognition model based on the convolutional recurrent neural network may include: It may be converted into a filter bank (mel filter bank) and input to a sound recognition model based on a convolutional recurrent neural network.

상기 잡음이 제거된 단일 채널의 향상된 음향 신호를 합성곱 순환 신경망 기반의 음향 인지 모델로 입력하는 단계는, 상기 합성곱 순환 신경망의 합성곱 모듈에서 주파수 축만 고려하여 합성곱을 진행하는 방법과 주파수 축과 시간 축을 순차적으로 고려하여 합성곱을 진행하는 방법을 병렬로 연결한 후 결합시킬 수 있다. The step of inputting the improved acoustic signal of the single channel from which the noise has been removed into the acoustic recognition model based on the convolutional recurrent neural network includes: In the convolution module, the method of performing convolution by considering only the frequency axis and the method of performing convolution by considering the frequency axis and the time axis sequentially can be connected in parallel and then combined.

상기 잔향 제거 모델, 상기 빔포밍 모델, 및 상기 음향 인지 모델은 상기 음향 인지 모델의 출력과 타겟에 의해 계산되는 에러에 의하여 결합 학습하는 단계는, 신경망의 손실 함수를 focal loss로 설정하여 데이터 양의 불균형을 완화할 수 있다. The step of combining the reverberation cancellation model, the beamforming model, and the acoustic perception model by the output of the acoustic recognition model and the error calculated by the target is to set the loss function of the neural network to the focal loss to increase the amount of data. imbalance can be alleviated.

다른 실시예에 따른 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 장치는, 잡음 및 잔향이 존재하는 환경에서 입력되는 다채널 음향 신호를 심화 신경망 기반의 WPE(Weighted Predicted Error) 잔향 제거 기술을 이용하여 잔향을 제거하는 잔향 제거 모델; 상기 잔향이 제거된 다채널 음향 신호를 심화 신경망 기반의 MVDR(Minimum Variance Distortionless Response) 빔포밍 기술을 이용하여 잡음을 제거하는 빔포밍 모델; 및 상기 잡음이 제거된 단일 채널의 향상된 음향 신호를 합성곱 순환 신경망(convolutional recurrent neural network) 기반의 음향 인지 모델로 입력하는 음향 인지 모델을 포함하고, 상기 잔향 제거 모델, 상기 빔포밍 모델, 및 상기 음향 인지 모델은 하나의 결합된 신경망으로 동작할 수 있다. A deep neural network-based reverberation removal, beamforming, and acoustic recognition model combined learning apparatus using a multi-channel acoustic signal according to another embodiment is a WPE based on a deep neural network for a multi-channel acoustic signal input in an environment in which noise and reverberation exist (Weighted Predicted Error) A reverberation cancellation model that removes reverberation using reverberation cancellation technology; a beamforming model that removes noise from the multi-channel sound signal from which the reverberation has been removed by using a MVDR (Minimum Variance Distortionless Response) beamforming technology based on a deep neural network; and an acoustic recognition model inputting the improved acoustic signal of the single channel from which the noise has been removed into a convolutional recurrent neural network-based acoustic perception model, wherein the reverberation cancellation model, the beamforming model, and the The acoustic perception model can operate as a single combined neural network.

상기 잔향 제거 모델, 상기 빔포밍 모델, 및 상기 음향 인지 모델은 상기 음향 인지 모델의 출력과 타겟에 의해 계산되는 에러에 의하여 결합 학습하는 학습부를 더 포함할 수 있다. The reverberation cancellation model, the beamforming model, and the acoustic recognition model may further include a learning unit that performs combined learning based on an error calculated by an output of the acoustic recognition model and a target.

상기 잔향 제거 모델은, 상기 다채널 음향 신호를 STFT(short-time Fourier Transform)를 통하여 시간 축에서 주파수 축의 값으로 변환하고, 변환된 주파수 축의 STFT 계수는 상기 심화 신경망 기반의 WPE 잔향 제거 기술을 이용하여 추정된 spectral mask에 의해 향상될 수 있다. The reverberation cancellation model converts the multi-channel acoustic signal from a time axis to a frequency axis value through short-time Fourier transform (STFT), and the STFT coefficient of the converted frequency axis uses the deep neural network-based WPE reverberation cancellation technology. Therefore, it can be improved by the estimated spectral mask.

상기 음향 인지 모델은, 상기 잡음이 제거된 단일 채널의 향상된 음향 신호의 STFT 계수는 로그 스케일(log scale)의 멜 필터 뱅크(mel filter bank)로 변환되어 합성곱 순환 신경망(convolutional recurrent neural network) 기반의 음향 인지 모델로 입력될 수 있다. The acoustic recognition model is based on a convolutional recurrent neural network in which the STFT coefficients of the improved acoustic signal of the single channel from which the noise has been removed are converted into a log scale mel filter bank. can be input as an acoustic perception model of

실시예들에 따르면 다채널 음향 신호를 이용하여 심화 신경망(deep neural network) 기반의 잔향 제거(dereverberation)와 빔포밍(beamforming) 및 음향 인지(sound event detection) 모델을 결합 학습함으로써 잡음 및 잔향이 존재하는 환경에서 향상된 성능을 획득할 수 있는, 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 방법 및 장치를 제공할 수 있다. According to embodiments, noise and reverberation are present by combining and learning a deep neural network-based reverberation, beamforming, and sound event detection model using a multi-channel sound signal. It is possible to provide a method and apparatus for combined learning of a deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model using a multi-channel acoustic signal, which can acquire improved performance in an environment where a multi-channel acoustic signal is used.

도 1은 일 실시예에 따른 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 방법을 설명하기 위한 도면이다.
도 2는 일 실시예에 따른 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 방법을 나타내는 흐름도이다.
도 3은 일 실시예에 따른 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 장치를 나타내는 블록도이다.
도 4는 일 실시예에 따른 합성곱 순환 신경망 모델을 설명하기 위한 도면이다. 1 is a diagram for describing a method for combined learning of a deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model using a multi-channel acoustic signal according to an embodiment.
2 is a flowchart illustrating a method for combined learning of a deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model using a multi-channel acoustic signal according to an exemplary embodiment.
3 is a block diagram illustrating an apparatus for combined learning of a deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model using a multi-channel acoustic signal according to an exemplary embodiment.
4 is a diagram for explaining a convolutional recurrent neural network model according to an embodiment.

이하, 첨부된 도면을 참조하여 실시예들을 설명한다. 그러나, 기술되는 실시예들은 여러 가지 다른 형태로 변형될 수 있으며, 본 실시예의 범위가 이하 설명되는 실시예들에 의하여 한정되는 것은 아니다. 또한, 여러 실시예들은 당해 기술분야에서 평균적인 지식을 가진 자에게 본 실시예를 더욱 완전하게 설명하기 위해서 제공되는 것이다. 도면에서 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.Hereinafter, embodiments will be described with reference to the accompanying drawings. However, the described embodiments may be modified in various other forms, and the scope of the present embodiment is not limited by the embodiments described below. In addition, various embodiments are provided in order to more completely explain the present embodiment to those of ordinary skill in the art. The shapes and sizes of elements in the drawings may be exaggerated for clearer description.

아래의 실시예들은 다채널 음향 신호를 이용하여 심화 신경망(deep neural network) 기반의 잔향 제거(dereverberation)와 빔포밍(beamforming) 및 음향 인지(sound event detection) 모델을 결합 학습함으로써 잡음 및 잔향이 존재하는 환경에서 향상된 성능을 얻는 것을 목적으로 한다. In the following embodiments, noise and reverberation exist by combining and learning a deep neural network-based reverberation, beamforming, and sound event detection model using a multi-channel sound signal. It aims to obtain improved performance in the environment where

도 1은 일 실시예에 따른 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 방법을 설명하기 위한 도면이다.1 is a diagram for describing a method for combined learning of a deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model using a multi-channel acoustic signal according to an embodiment.

도 1을 참조하면, 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거 모델(110), 빔포밍 모델(120) 및 음향 인지 모델(130)의 결합 학습 방법을 나타내는 것으로, 먼저 잔향 제거 모델(110)은 잡음 및 잔향이 존재하는 환경에서 입력되는 다채널 음향 신호를 STFT(short-time Fourier Transform)를 통하여 시간 축에서 주파수 축의 값으로 변환한다. 변환된 주파수 축의 STFT 계수는 심화 신경망 기반의 WPE(Weighted Predicted Error) 잔향 제거 기술을 이용하여 추정된 spectral mask에 의해 향상된다. Referring to FIG. 1 , a combined learning method of the deep neural network-based reverberation cancellation model 110 , the beamforming model 120 , and the acoustic recognition model 130 using a multi-channel acoustic signal is shown. First, the reverberation cancellation model 110 . ) converts a multi-channel sound signal input in an environment where noise and reverberation exist into values on the frequency axis from the time axis through short-time Fourier transform (STFT). The STFT coefficient of the transformed frequency axis is improved by the spectral mask estimated using the WPE (Weighted Predicted Error) reverberation technique based on the deep neural network.

다음으로, 빔포밍 모델(120)에서 잔향이 제거된 다채널의 향상된 음향 신호 STFT 계수는 심화 신경망 기반의 MVDR(Minimum Variance Distortionless Response) 빔포밍 기술을 이용하여 추정된 두 종류의 spectral mask(source mask와 noise mask)에 의해 향상된다. Next, the STFT coefficients of the multi-channel enhanced acoustic signal from which the reverberation has been removed from the beamforming model 120 are two types of spectral masks (source masks) estimated using the MVDR (Minimum Variance Distortionless Response) beamforming technique based on the deep neural network. and noise mask).

결과적으로, 잡음이 제거된 단일 채널의 향상된 음향 신호 STFT 계수는 로그 스케일(log scale)의 멜 필터 뱅크(mel filter bank)로 변환되어 합성곱 순환 신경망(convolutional recurrent neural network) 기반의 음향 인지 모델(130)로 입력된다. As a result, the enhanced acoustic signal STFT coefficients of the single channel from which the noise has been removed are converted to a log-scale mel filter bank, and a convolutional recurrent neural network-based acoustic perception model ( 130) is entered.

잔향 제거 모델(110), 빔포밍 모델(120) 및 음향 인지 모델(130)의 세 개의 모델은 하나의 결합된 신경망으로 동작하며, 학습 시 음향 인지 모델(130)의 출력과 타겟에 의해 계산되는 에러에 의하여 공동으로 학습된다. 이 때, 불균형한 음향 데이터의 양으로 인하여 모델의 학습 능력이 저하되는 현상을 완화하기 위해 focal loss가 에러 계산에 사용된다. The three models of the reverberation cancellation model 110, the beamforming model 120, and the acoustic recognition model 130 operate as one combined neural network, and are calculated by the output of the acoustic recognition model 130 and the target during training. It is jointly learned by error. In this case, the focal loss is used for error calculation in order to alleviate the phenomenon that the learning ability of the model is deteriorated due to the amount of unbalanced acoustic data.

실험 결과, focal loss를 손실 함수로써 이용하여 결합 학습된 심화 신경망 기반의 잔향 제거 모델(110), 빔포밍 모델(120) 및 음향 인지 모델(130)은 잡음과 잔향이 존재하는 환경에서 향상된 음향 인지 성능을 보였다.As a result of the experiment, the deep neural network-based reverberation cancellation model 110, beamforming model 120, and acoustic perception model 130 jointly trained using focal loss as a loss function improved acoustic perception in an environment where noise and reverberation exist. performance was shown.

도 2는 일 실시예에 따른 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 방법을 나타내는 흐름도이다.2 is a flowchart illustrating a method for combined learning of a deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model using a multi-channel acoustic signal according to an exemplary embodiment.

도 2를 참조하면, 일 실시예에 따른 컴퓨터로 구현되는 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 방법은, 잡음 및 잔향이 존재하는 환경에서 입력되는 다채널 음향 신호를 심화 신경망 기반의 WPE(Weighted Predicted Error) 잔향 제거 기술을 이용하여 잔향을 제거하는 단계(S110), 잔향이 제거된 다채널 음향 신호를 심화 신경망 기반의 MVDR(Minimum Variance Distortionless Response) 빔포밍 기술을 이용하여 잡음을 제거하는 단계(S120), 및 잡음이 제거된 단일 채널의 향상된 음향 신호를 합성곱 순환 신경망(convolutional recurrent neural network) 기반의 음향 인지 모델로 입력하는 단계(S130)를 포함하고, 잔향을 제거하는 잔향 제거 모델, 빔포밍을 수행하는 빔포밍 모델, 및 음향 인지 모델은 하나의 결합된 신경망으로 동작할 수 있다. Referring to FIG. 2 , a method for combined learning of a deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model using a computer-implemented multi-channel acoustic signal according to an embodiment is an input method in an environment in which noise and reverberation are present. Step of removing reverberation using WPE (Weighted Predicted Error) reverberation cancellation technology based on deep neural network for multi-channel acoustic signal (S110), deep neural network-based MVDR (Minimum Variance Distortionless Response) for multi-channel acoustic signal The step of removing noise using beamforming technology (S120), and the step of inputting the improved acoustic signal of a single channel from which the noise has been removed into a convolutional recurrent neural network-based acoustic recognition model (S130) Including, the reverberation cancellation model for removing reverberation, the beamforming model for performing beamforming, and the acoustic recognition model may operate as one combined neural network.

잔향 제거 모델, 빔포밍 모델, 및 음향 인지 모델은 음향 인지 모델의 출력과 타겟에 의해 계산되는 에러에 의하여 결합 학습하는 단계(S140)를 더 포함할 수 있다. The reverberation cancellation model, the beamforming model, and the acoustic recognition model may further include a step ( S140 ) of joint learning based on an error calculated by an output of the acoustic recognition model and a target.

아래에서 일 실시예에 따른 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 방법의 각 단계를 설명한다. Hereinafter, each step of a method for combined learning of a deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model using a multi-channel acoustic signal according to an embodiment will be described.

일 실시예에 따른 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 방법은 일 실시예에 따른 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 장치를 예를 들어 보다 상세히 설명할 수 있다. The deep neural network-based reverberation cancellation, beamforming, and combined learning method of an acoustic recognition model using a multi-channel acoustic signal according to an embodiment includes the deep neural network-based reverberation cancellation, beamforming and An example of a joint learning apparatus for a sound recognition model may be described in more detail.

도 3은 일 실시예에 따른 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 장치를 나타내는 블록도이다.3 is a block diagram illustrating an apparatus for combined learning of a deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model using a multi-channel acoustic signal according to an exemplary embodiment.

도 3을 참조하면, 일 실시예에 따른 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 장치는(300) 잔향 제거 모델(310), 빔포밍 모델(320) 및 음향 인지 모델(330)을 포함할 수 있다. 실시예에 따라 결합 학습 장치(300)는 학습부(340)를 더 포함할 수 있다. Referring to FIG. 3 , an apparatus 300 for combined learning of a deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model using a multi-channel acoustic signal according to an embodiment includes a reverberation cancellation model 310 and a beamforming model 320 . ) and an acoustic perception model 330 . According to an embodiment, the combined learning apparatus 300 may further include a learning unit 340 .

단계(S110)에서, 잔향 제거 모델(310)은 잡음 및 잔향이 존재하는 환경에서 입력되는 다채널 음향 신호를 심화 신경망 기반의 WPE(Weighted Predicted Error) 잔향 제거 기술을 이용하여 잔향을 제거할 수 있다. 여기서, 잔향 제거 모델(310)은 다채널 음향 신호를 STFT(short-time Fourier Transform)를 통하여 시간 축에서 주파수 축의 값으로 변환한 후, 변환된 주파수 축의 STFT 계수는 심화 신경망 기반의 WPE 잔향 제거 기술을 이용하여 추정된 spectral mask에 의해 향상될 수 있다. In step S110, the reverberation removal model 310 may remove the reverberation using a WPE (Weighted Predicted Error) reverberation removal technique based on a deep neural network based on a multi-channel acoustic signal input in an environment in which noise and reverberation exist. . Here, the reverberation cancellation model 310 converts a multi-channel acoustic signal from a time axis to a frequency axis value through short-time Fourier transform (STFT), and then, the STFT coefficient of the converted frequency axis is a WPE reverberation cancellation technology based on a deep neural network. It can be improved by the spectral mask estimated using

단계(S120)에서, 빔포밍 모델(320)은 잔향이 제거된 다채널 음향 신호를 심화 신경망 기반의 MVDR(Minimum Variance Distortionless Response) 빔포밍 기술을 이용하여 잡음을 제거할 수 있다. 여기서, 빔포밍 모델(320)은 잔향이 제거된 다채널 음향 신호의 STFT 계수는 심화 신경망 기반의 MVDR 빔포밍 기술을 이용하여 추정된 두 종류의 spectral mask에 의해 향상될 수 있다. In step S120 , the beamforming model 320 may remove noise from the multi-channel acoustic signal from which the reverberation has been removed by using a deep neural network-based Minimum Variance Distortionless Response (MVDR) beamforming technique. Here, in the beamforming model 320 , the STFT coefficient of the multi-channel acoustic signal from which the reverberation is removed may be improved by two types of spectral masks estimated using the deep neural network-based MVDR beamforming technique.

단계(S130)에서, 음향 인지 모델(330)은 잡음이 제거된 단일 채널의 향상된 음향 신호를 합성곱 순환 신경망(convolutional recurrent neural network) 기반의 음향 인지 모델로 입력할 수 있다. 잔향 제거 모델(310), 빔포밍 모델(320) 및 음향 인지 모델(330)은 하나의 결합된 신경망으로 동작할 수 있다. 여기서, 음향 인지 모델(330)에서 잡음이 제거된 단일 채널의 향상된 음향 신호의 STFT 계수는 로그 스케일(log scale)의 멜 필터 뱅크(mel filter bank)로 변환되어 합성곱 순환 신경망(convolutional recurrent neural network) 기반의 음향 인지 모델(330)로 입력될 수 있다. In operation S130 , the acoustic recognition model 330 may input an improved acoustic signal of a single channel from which noise has been removed into a convolutional recurrent neural network-based acoustic perception model. The reverberation cancellation model 310 , the beamforming model 320 , and the acoustic recognition model 330 may operate as one combined neural network. Here, the STFT coefficients of the improved acoustic signal of the single channel from which the noise has been removed in the acoustic recognition model 330 are converted into a log scale mel filter bank, and a convolutional recurrent neural network ) based on the acoustic recognition model 330 .

음향 인지 모델(330)은 합성곱 순환 신경망의 합성곱 모듈에서 주파수 축만 고려하여 합성곱을 진행하는 방법과 주파수 축과 시간 축을 순차적으로 고려하여 합성곱을 진행하는 방법을 병렬로 연결한 후 결합시킬 수 있다. Acoustic perception model 330 is a convolutional recurrent neural network In the convolution module, the method of performing convolution by considering only the frequency axis and the method of performing convolution by considering the frequency axis and the time axis sequentially can be connected in parallel and then combined.

단계(S140)에서, 학습부(340)는 잔향 제거 모델(310), 빔포밍 모델(320) 및 음향 인지 모델(330)은 음향 인지 모델(330)의 출력과 타겟에 의해 계산되는 에러에 의하여 결합 학습할 수 있다. 이 때, 학습부(340)는 신경망의 손실 함수를 focal loss로 설정하여 데이터 양의 불균형을 완화할 수 있다. In step S140 , the learning unit 340 performs the reverberation cancellation model 310 , the beamforming model 320 , and the acoustic perception model 330 according to the output of the acoustic recognition model 330 and the error calculated by the target. It can be combined learning. In this case, the learning unit 340 may set the loss function of the neural network as focal loss to alleviate the imbalance in the amount of data.

아래에서 일 실시예에 따른 다채널 음향 신호를 이용한 심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델의 결합 학습 방법 및 장치를 예를 들어 보다 상세히 설명한다. Hereinafter, a method and apparatus for combined learning of a deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model using a multi-channel acoustic signal according to an embodiment will be described in more detail as an example.

본 실시예에서는, 종래의 음성 인식 및 화자 검증 분야의 기술을 참고하여 다채널 음향 신호를 이용한 결합된 잔향 제거, 빔포밍 및 음향 인지 모듈을 focal loss 기반의 하나의 손실 함수로써 학습하였다. 이 때, 종래의 기술과 달리, 합성곱 순환 신경망의 세부 모듈을 변형하여 추가적인 성능 향상을 보였으며, 또한 종래의 기술에서 심화 신경망 기반 잔향 제거 시 STFT 계수를 직접 추정하는 것과 달리, 0~1 사이의 값을 가지는 spectral mask를 추정함으로써 원활한 학습을 가능하게 하였다.In this embodiment, a combined reverberation cancellation, beamforming, and acoustic recognition module using a multi-channel acoustic signal was studied as one loss function based on focal loss with reference to the conventional techniques in the field of speech recognition and speaker verification. At this time, unlike the prior art, it showed additional performance improvement by modifying the detailed module of the convolutional neural network. Also, unlike the conventional technique for directly estimating the STFT coefficient when removing deep neural network-based reverberation, it is between 0 and 1 By estimating a spectral mask with a value of , smooth learning was enabled.

실시예Example

신호 모델 및 심화 신경망 기반의 WPE 잔향 제거WPE reverberation cancellation based on signal models and deep neural networks

잡음과 잔향이 존재하는 환경에서 D개의 마이크를 통해 음향 신호를 수집한다고 가정할 때, 마이크에 입력되는 음향 신호는 잔향이 있는 음향 신호와 잡음의 합으로 표현할 수 있으며, 그 식은 아래와 같다.Assuming that acoustic signals are collected through D microphones in an environment where noise and reverberation exist, the acoustic signal input to the microphone can be expressed as the sum of the acoustic signal with reverberation and noise, and the equation is as follows.

[수학식 1][Equation 1]

(1)

(One)

이 때, x와 n은 잔향에 의해 열화된 음향 신호와 잡음 신호의 STFT(short-time Fourier transform) 계수를, t, f, d는 각각 time frame index와 frequency-bin index, microphone index를 의미하고, y는 음향 신호와 잡음 신호를 더한 마이크 입력 신호를 의미한다. 또한, 위첨자 (early)와 (late)는 각각 early reflection 신호와 late reflection 신호를 의미하며, 각각의 신호는 원 신호가 공간의 반향 특성에 의하여 열화된 것으로, 전자의 경우 그 시간 간격이 짧아 충분히 허용 가능한 열화로 가정하며, 후자의 경우 그 시간 간격이 비교적 길게 지속되어 크게 열화된 것으로 제거해야 할 대상으로 본다. 이 때, 잔향 제거를 통해 얻고자 하는 신호는 다음과 같이 선형 예측 필터에 기반하여 계산된다.In this case, x and n are the STFT (short-time Fourier transform) coefficients of the acoustic signal and the noise signal degraded by reverberation, and t, f, and d are the time frame index, frequency-bin index, and microphone index, respectively. , y denotes a microphone input signal obtained by adding an acoustic signal and a noise signal. In addition, the superscripts (early) and (late) mean an early reflection signal and a late reflection signal, respectively, and each signal indicates that the original signal is deteriorated by the echo characteristic of the space. It is assumed that possible deterioration is possible, and in the latter case, the time interval is relatively long, so it is considered to be greatly deteriorated and needs to be removed. At this time, a signal to be obtained through reverberation cancellation is calculated based on the linear prediction filter as follows.

[수학식 2][Equation 2]

(2)

여기서,

는 선형 예측 기법을 통해 추정한 early reflection 신호의 추정값을,

는 선형 예측 알고리즘의 delay를,

와 G는 각각 마이크 입력 신호의 STFT 계수와 선형 예측 필터 계수를 현재 프레임 t를 기준으로 과거

번째 프레임부터 과거

번째 프레임까지 쌓아 놓은 stacked representation이다. 고전적인 WPE 잔향 제거 기술은 입력 신호의 late reflection 성분을 추정하기 위한 선형 예측 필터를 아래와 같은 반복적인(iterative) 방식으로 추정한다.here,

is the estimated value of the early reflection signal estimated through the linear prediction technique,

is the delay of the linear prediction algorithm,

and G are the STFT coefficients and linear prediction filter coefficients of the microphone input signal, respectively, in the past with respect to the current frame t.

from the second frame to the past

It is a stacked representation stacked up to the second frame. The classical WPE reverberation cancellation technique estimates the linear prediction filter for estimating the late reflection component of the input signal in the following iterative manner.

[수학식 3][Equation 3]

(3)

[수학식 4][Equation 4]

(4)

[수학식 5][Equation 5]

(5)

[수학식 6][Equation 6]

(6)

이 때,

는 추정한 early reflection 신호의 시간-주파수 축에서의 파워를, R는 선형 예측 필터의 차수(order)를 의미한다. At this time,

is the estimated power on the time-frequency axis of the early reflection signal, and R is the order of the linear prediction filter.

반면, 심화 신경망을 활용한 WPE 잔향 제거는 고전적인 WPE 알고리즘의 일부분을 심화신경망을 활용한 로직으로 대체한다. 위의 수학식 3에서의 early reflection 신호의 파워를 추정하는 부분을 심화 신경망으로 대체하게 되며, 이 때, 심화신경망은 마이크 입력 신호

의 파워를 입력 받아 late reflection 성분이 제거된

성분의 파워를 추정하도록 학습된다. 이는 음향 성분과 잡음 성분 모두에서 late reverberation 제거하는 것을 목적으로 심화 신경망을 학습하는 방법이며, 본 실시예에서는 학습을 더욱 원활하게 하기 위하여 값의 범위가 상대적으로 넓은 STFT 계수 대신에 0~1 사이의 값을 가지는 spectral mask를 추정하여 STFT 계수에 곱하는 방법으로 향상된 STFT 계수를 추정한다. Mask의 값이 0~1 사이가 되도록 하기 위해서 mask 추정을 위한 심화 신경망의 출력 레이어에서는 ReLU(Rectified Linear Unit) 함수를 사용한 뒤 출력값이 최대 1이 되도록 상한값을 설정해 준다. 심화 신경망의 학습이 끝나면 심화 신경망을 이용하여 각 마이크로폰 채널별로 early reflection 신호의 파워 추정값을 계산한 뒤, 모든 채널에 대해 평균을 취하여 수학식 3의 좌변을 대신할 수 있는 파워 추정값을 계산하고, 수학식 4 내지 수학식 6의 과정을 거쳐 최종적으로는 수학식 2를 통해 early reflection 신호의 STFT 계수를 추정할 수 있다.On the other hand, WPE reverberation cancellation using deep neural network replaces a part of the classical WPE algorithm with logic using deep neural network. The part for estimating the power of the early reflection signal in Equation 3 above is replaced by a deep neural network, and in this case, the deep neural network is a microphone input signal

received the power of , and the late reflection component is removed.

It is learned to estimate the power of a component. This is a method of learning a deep neural network for the purpose of removing late reverberation from both the acoustic component and the noise component. An improved STFT coefficient is estimated by estimating a spectral mask having a value and multiplying the STFT coefficient. In order to ensure that the mask value is between 0 and 1, the output layer of the deep neural network for mask estimation uses the Rectified Linear Unit (ReLU) function and sets the upper limit so that the output value is at most 1. After learning of the deep neural network is completed, the power estimate of the early reflection signal is calculated for each microphone channel using the deep neural network, and then the power estimate that can be substituted for the left side of Equation 3 is calculated by taking the average of all channels, After the process of Equation 4 to Equation 6, the STFT coefficient of the early reflection signal may be estimated through Equation 2 finally.

심화 신경망 기반의 MVDR 빔포밍MVDR beamforming based on deep neural network

고전적인 MVDR 빔포밍 기술은 빔포밍을 적용한 출력 신호가 왜곡이 없게 하면서 출력 신호에 남아있는 잔여 잡음의 파워를 최소화하는 것을 목적으로 한다. 이와 같은 최소화 문제를 풀면 아래와 같은 MVDR 이득(gain)을 얻을 수 있다.The classical MVDR beamforming technique aims to minimize the power of residual noise remaining in the output signal while the output signal to which the beamforming is applied has no distortion. By solving such a minimization problem, the following MVDR gain can be obtained.

[수학식 7][Equation 7]

(7)

여기서,

와

는 각각 음향 성분과 잡음 성분의 power spectral density(PSD) 행렬을 나타내며,

는 출력 채널 선택을 위한 one-hot 벡터를 나타낸다. MVDR 빔포머의 출력은 MVDR 이득(gain)과 입력 신호를 곱하여 얻을 수 있으며, 그 식은 아래와 같이 표현될 수 있다.here,

Wow

represents the power spectral density (PSD) matrix of the acoustic component and the noise component, respectively,

denotes a one-hot vector for output channel selection. The output of the MVDR beamformer can be obtained by multiplying the MVDR gain by the input signal, and the equation can be expressed as follows.

[수학식 8][Equation 8]

(8)

여기서,

은 시간-주파수 축에서의 빔포머 출력 음향 신호를 의미한다. here,

denotes a beamformer output sound signal on the time-frequency axis.

반면, 심화 신경망을 활용한 MVDR 빔포머는 고전적인 빔포머 알고리즘의 일부분을 심화 신경망을 활용한 로직으로 대체한다. 본 실시예에서는 spectral mask 기반 MVDR 빔포머를 사용하였으며, 해당 빔포머는 음성 성분과 잡음 성분에 대한 시간-주파수 축의 spectral mask를 심화 신경망을 이용하여 추정한 뒤, 추정한 mask를 이용하여 음성 성분과 잡음 성분에 대한 PSD 행렬을 계산하게 된다. 이 때, mask 추정은 마이크 채널 각각에 대해 독립적으로 이루어지기 때문에 마이크의 개수와 같은 개수의 mask가 음성 성분 및 잡음 성분에 대해 각각 계산된다. Mask의 값이 0과 1 사이가 되도록 하기 위해서 mask 추정을 위한 심화 신경망의 출력 레이어에서는 시그모이드(sigmoid) 함수를 사용하여 출력값이 0과 1 사이에서 나타나도록 모델을 설계한다. 음향 성분과 잡음 성분에 해당하는 spectral mask를 각각 추정하기 때문에, mask 추정을 위한 심화 신경망은 따로 구성되며, 심화 신경망의 학습이 끝나면 아래의 식을 이용하여 음향 성분과 잡음 성분 신호의 PSD 행렬을 추정할 수 있다.On the other hand, MVDR beamformer using deep neural network replaces a part of the classical beamformer algorithm with logic using deep neural network. In this embodiment, a spectral mask-based MVDR beamformer is used, and the beamformer estimates the spectral mask of the time-frequency axis for the voice component and the noise component using a deep neural network, and then uses the estimated mask to estimate the voice component and A PSD matrix for the noise component is calculated. At this time, since mask estimation is performed independently for each microphone channel, the same number of masks as the number of microphones is calculated for the voice component and the noise component, respectively. In order to ensure that the mask value is between 0 and 1, the output layer of the deep neural network for mask estimation uses a sigmoid function to design the model so that the output value appears between 0 and 1. Since the spectral mask corresponding to the acoustic component and the noise component is estimated respectively, the deep neural network for mask estimation is configured separately, and when the deep neural network is trained, the PSD matrix of the acoustic and noise component signals is estimated using the following equation can do.

[수학식 9][Equation 9]

(9)

이 때,

은 심화신경망을 통해 각 마이크로폰 채널별로 얻은 mask의 추정값을 모든 채널에 대해 평균을 취한 average mask이다. At this time,

is the average mask obtained by averaging the mask estimates obtained for each microphone channel through the deep neural network for all channels.

이와 같은 방식으로 얻은 PSD 행렬을 수학식 7에 대입하여 MVDR 이득(gain)을 구할 수 있다. Mask 추정을 위한 심화 신경망은 입력 신호로부터 계산한 LPS(log-scale power spectra)를 입력 받아 mask를 출력으로 하도록 학습된다.The MVDR gain can be obtained by substituting the PSD matrix obtained in this way into Equation (7). The deep neural network for mask estimation is trained to receive the log-scale power spectra (LPS) calculated from the input signal and output the mask.

합성곱 순환 신경망 기반의 음향 인지Acoustic Recognition based on Convolutional Recurrent Neural Network

심화 신경망 기반의 WPE 잔향 제거와 MVDR 빔포밍이 순차적으로 완료된 후, 출력된 단일 채널의 향상된 STFT 계수는 합성곱 순환 신경망 기반의 음향 인지 모델로 입력된다. 이 때, 신경망의 입력으로는 종래 기술과 유사하게 로그 스케일(log scale)의 멜 필터 뱅크(mel filter bank)가 사용되며, 합성곱 순환 신경망 모델을 이용하여 음향의 시작 위치와 종료 위치를 찾고 해당 음향의 종류를 추정하게 된다. 여기서, 멜 필터 뱅크(mel filter bank)는 STFT 계수의 크기 값에 멜 필터(mel filter)를 적용하여 계산되는 에너지(energy) 값으로, 멜 필터(mel filter)는 사람의 귀를 모방한 비선형적 필터이다. After the deep neural network-based WPE reverberation cancellation and MVDR beamforming are sequentially completed, the improved STFT coefficients of the output single channel are input to the acoustic recognition model based on the convolutional recurrent neural network. In this case, a log scale mel filter bank is used as an input to the neural network, similar to the prior art, and the start and end positions of the sound are found using a convolutional recurrent neural network model, and the corresponding Estimate the type of sound. Here, the mel filter bank is an energy value calculated by applying a mel filter to the magnitude value of the STFT coefficient, and the mel filter is a non-linear imitation of a human ear. it's a filter

도 3은 일 실시예에 따른 합성곱 순환 신경망 모델을 설명하기 위한 도면이다. 3 is a diagram for explaining a convolutional recurrent neural network model according to an embodiment.

도 3을 참조하면, 종래 기술과 일 실시예에 따른 합성곱 순환 신경망 모델을 비교하여 나타낸다. 본 실시예에서는 종래 기술에서 사용된 합성곱 순환 신경망에서 합성곱 파트를 변형하여 성능 향상을 도모하였다. 기존의 합성곱 파트에서는 3x3 합성곱 필터를 여러 층 사용하여 시간-주파수 축의 특징을 추출하는 방법을 사용하였으나, 본 실시예에서는 주파수 축만 고려하여 합성곱을 진행하는 방법과 주파수 축과 시간 축을 순차적으로 고려하여 합성곱을 진행하는 방법을 병렬로 연결한 후 결합시키는 방법을 사용함으로써 더욱 효과적으로 특징을 추출하고 향상된 성능을 얻을 수 있다.Referring to FIG. 3 , a comparison of a convolutional recurrent neural network model according to the prior art and an embodiment is shown. In this embodiment, the performance is improved by modifying the convolution part in the convolutional recurrent neural network used in the prior art. In the existing convolution part, a method of extracting features of the time-frequency axis using multiple layers of 3x3 convolution filters was used, but in this embodiment, the method of convolution by considering only the frequency axis and the frequency axis and the time axis are considered sequentially Thus, by using the method of concatenating the convolution method in parallel and then combining it, it is possible to extract features more effectively and obtain improved performance.

심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델 결합 학습 방법Deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model combined learning method

심화 신경망 기반의 잔향 제거, 빔포밍 및 음향 인지 모델은 모두 미분 가능한 연산으로만 구성되어 있으며, 따라서 하나의 신경망으로 연결되어 결합 학습이 가능하다. 이 때, 신경망 학습의 에러 역전파를 위하여, 손실 함수로는 focal loss가 사용될 수 있다. 음향 데이터는 음향의 특성 상, 짧은 음향부터 시작하여 긴 음향까지 매우 다양하며, 음성처럼 연구용으로 체계적으로 수집 및 관리가 되기 시작한지 얼마되지 않아 그 데이터의 양이 매우 다양하다고 할 수 있다. 신경망 학습은 일반적으로 각 클래스의 데이터의 양이 동일해야 효율적인 학습이 가능한데, 음향 인지에서는 데이터 양의 불균형으로 인하여 신경망 학습에 항상 어려움이 있어왔고, 그것을 완화하기 위하여 다양한 방법들이 제안되었다. 본 실시예에서는 영상 분야에서 제안된 focal loss를 음향 인지 분야에 도입하여 데이터 양의 불균형을 자동으로 보상함으로써 성능 향상을 도모하였다.Deep neural network-based reverberation cancellation, beamforming, and acoustic recognition models are all composed of only differentiable operations, so they are connected to a single neural network and combine learning is possible. In this case, for error backpropagation of neural network learning, focal loss may be used as a loss function. Acoustic data varies greatly from short to long sounds due to the characteristics of sound, and it can be said that the amount of data is very diverse as it has not been long since it has been systematically collected and managed for research like voice. In general, in neural network learning, efficient learning is possible only when the amount of data in each class is the same. In acoustic recognition, there have always been difficulties in neural network learning due to an imbalance in the amount of data, and various methods have been proposed to alleviate it. In this embodiment, the focal loss proposed in the imaging field was introduced to the acoustic recognition field to automatically compensate for the imbalance in the amount of data to improve performance.

[수학식 10][Equation 10]

(10)

[수학식 11][Equation 11]

(11)

여기서, gt는 정답을 나타내며, 즉 1이면 해당 시간에 음향이 존재하는 것을 의미하고, 0이면 해당 시간에 음향이 존재하지 않는 것을 의미한다. p는 신경망 모델에서 추정한 음향 존재 확률, 그리고

는 focal loss의 focusing 파라미터를 나타낸다. Here, gt represents the correct answer, that is, 1 means that the sound is present at the corresponding time, and 0 means that there is no sound at the corresponding time. p is the acoustic presence probability estimated by the neural network model, and

Is It represents the focusing parameter of focal loss.

Focal loss는 분류가 매우 손쉽게 되는 데이터에 패널티를 줌으로써 동작하는 손실 함수이다. 일반적으로 데이터 양이 많은 클래스의 경우 분류가 매우 손쉽게 되며, 이 때 신경망 모델에서 추정한 음향 존재 확률 p의 값이 매우 크게 된다. 수학식 10에 의하면, p의 값이 매우 크면 focusing 파라미터

에 의하여 에러가 강제로 작게 반영되게 된다.Focal loss is a loss function that works by penalizing data for which classification is very easy. In general, in the case of a class with a large amount of data, classification becomes very easy, and at this time, the value of the acoustic presence probability p estimated by the neural network model becomes very large. According to Equation 10, if the value of p is very large, the focusing parameter

by forcing the error to be reflected small.

한편, 결합 학습 시 WPE 잔향 제거와 MVDR 빔포밍 알고리즘은 모두 복소수 값을 갖는 STFT 계수들을 처리하는 연산들로 구성되어 있기 때문에, 결합 학습을 실제로 구현하기 위해서는 복소수의 실수부와 허수부를 별도로 연산할 필요가 있다. 특히, 수학식 6 및 수학식 7에서 나타나는 복소수 행렬의 역(inverse) 연산은 아래와 같은 공식을 통해 해결할 수 있다.On the other hand, since both the WPE reverberation removal and MVDR beamforming algorithms in joint learning consist of operations that process STFT coefficients having complex values, it is necessary to separately calculate the real and imaginary parts of the complex number to actually implement joint learning. there is In particular, the inverse operation of the complex matrix shown in Equations 6 and 7 can be solved through the following formula.

[수학식 12][Equation 12]

(12)

[수학식 13][Equation 13]

(13)

여기서,

는 복소수 행렬이며,

와

는 각각 역행렬이 존재하는 실수 행렬이다.here,

is a complex matrix,

Wow

is a real matrix with inverse matrices, respectively.

실험 결과Experiment result

본 실시예의 평가를 위하여, 잡음과 잔향이 존재하는 실내 환경에서 다채널 마이크로 녹음된 음향 신호로 구성된 TAU Spatial Sound Events 2019-Microphone Array 공개 데이터셋을 사용하였다. 본 데이터셋은 사면체 구조의 4채널 마이크로 녹음되었으며, 500개의 파일로 구성되어 있다. 각 파일은 1분 길이이며, 샘플링 주파수는 48,000 Hz이고, 신호대잡음비는 30 dB로, 잡음이 강하지는 않은 편이다. 잡음이 강한 환경에서의 평가를 위하여, 실내에서 녹음된 잡음을 10 dB의 신호대잡음비로 더하여 추가적인 평가에 사용하였다. 본 데이터셋에서 고려되는 음향 종류는 총 11가지로, 목 가다듬는 소리, 기침 소리, 노크 소리, 문 닫는 소리, 서랍 여닫는 소리, 웃는 소리, 키보드 소리, 키를 테이블에 놓는 소리, 책장을 넘기는 소리, 전화벨 소리, 말 소리이며, 각 종류 당 20개의 파일이 사용되었다. 신경망 학습은 데이터의 양에 따라 학습의 효과가 크게 달라지므로, 신경망 학습의 효과를 향상시키기 위하여 pitch shifting 및 block mixing의 두 가지 데이터 증폭 방법이 사용되었다.For the evaluation of this embodiment, the TAU Spatial Sound Events 2019-Microphone Array public dataset consisting of sound signals recorded with multi-channel microphones in an indoor environment where noise and reverberation exist was used. This dataset was recorded with a 4-channel microphone with a tetrahedral structure and consists of 500 files. Each file is 1 minute long, the sampling frequency is 48,000 Hz, and the signal-to-noise ratio is 30 dB, so the noise is not strong. For evaluation in a noisy environment, noise recorded indoors was added to a signal-to-noise ratio of 10 dB and used for additional evaluation. A total of 11 types of sound are considered in this dataset: throat clearing, coughing, knocking, door closing, drawer opening and closing, laughter, keyboard sound, placing a key on a table, turning a bookshelf, It is a phone ring sound and a speech sound, and 20 files were used for each type. Since the learning effect of neural network learning varies greatly depending on the amount of data, two data amplification methods, pitch shifting and block mixing, were used to improve the effect of neural network learning.

음향 인지 알고리즘은 일반적으로 F-score 및 error rate(ER) 두 가지 지표로 평가된다. 지표를 계산하는 방법은 여러 가지가 있을 수 있으나, 본 실시예에서는 겹치는 부분 없이 1초 단위의 segment를 shifting하며 결과와 정답을 비교하는 방법으로 계산하였다. F-score는 다음과 같이 정의될 수 있다.Acoustic recognition algorithms are generally evaluated with two indicators, F-score and error rate (ER). There may be various methods of calculating the index, but in this embodiment, the calculation was performed by shifting segments in units of 1 second without overlapping parts and comparing the results with the correct answers. F-score can be defined as follows.

[수학식 14][Equation 14]

(14)

이 때, K는 세그먼트(segment)의 총 개수, TP는 true positive(정답과 결과가 모두 1일 때)의 개수, FP는 false positive(정답이 0인데 결과가 1일 때)의 개수, FN은 false negative(정답이 1인데 결과가 0일 때)의 개수를 나타낸다. At this time, K is the total number of segments, TP is the number of true positives (when both the correct answer and the result are 1), FP is the number of false positives (when the correct answer is 0 but the result is 1), and FN is Indicates the number of false negatives (when the correct answer is 1 but the result is 0).

다음으로, ER은 아래와 같이 정의될 수 있다.Next, ER may be defined as follows.

[수학식 15][Equation 15]

(15)

이 때, N(k)는 정답이 1인 것의 총 개수를 나타내며, S(k), D(k) 및 I(k)는 각각 substitution, deletion, insertion의 약자로 아래와 같이 정의될 수 있다.At this time, N(k) represents the total number of things with the correct answer being 1, and S(k), D(k), and I(k) are abbreviations of substitution, deletion, and insertion, respectively, and may be defined as follows.

[수학식 16][Equation 16]

(16)

[수학식 17][Equation 17]

(17)

[수학식 18][Equation 18]

(18)

가장 이상적인 조건에서는, F-score는 1이 되며, ER은 0이 된다.In the most ideal condition, the F-score becomes 1 and ER becomes 0.

표 1은 일 실시예에 따른 TAU Spatial Sound Events 2019-Microphone Array 데이터셋을 이용한 F-score 및 ER 결과를 나타낸다. Table 1 shows the F-score and ER results using the TAU Spatial Sound Events 2019-Microphone Array dataset according to an embodiment.

[표 1][Table 1]

표 2는 일 실시예에 따른 TAU Spatial Sound Events 2019-Microphone Array 데이터셋에 실내 잡음을 10 dB로 더한 데이터셋을 이용한 F-score 및 ER 결과를 나타낸다.Table 2 shows F-score and ER results using a dataset in which room noise is added by 10 dB to the TAU Spatial Sound Events 2019-Microphone Array dataset according to an embodiment.

[표 2][Table 2]

실험 결과 음향 인지 모델에 사용된 합성곱 파트의 변형으로 인하여 종래 기술보다 F-score가 1~2% 향상되었으며, ER이 유사 수준인 것을 알 수 있다. 또한 심화 신경망 기반의 WPE 잔향 제거 및 MVDR 빔포밍과 결합 학습함으로써 대폭 성능 향상이 있었으며, 특히 세 개의 모듈을 결합 학습하였을 때 성능이 가장 좋았다. 특히, 잡음 및 잔향이 상대적으로 강한 신호 대 잡음비 10 dB 환경에서는, F-score가 종래 기술 대비 10%, ER이 0.2 수준 향상되었다. 또한 결합 학습 시 신경망의 손실 함수로 focal loss를 사용하였을 때, 기존의 크로스 엔트로피(cross entropy)를 손실 함수로 사용하였을 때에 대비하여 다소 성능이 향상되었다.As a result of the experiment, it can be seen that the F-score was improved by 1-2% compared to the prior art due to the deformation of the convolutional part used in the acoustic recognition model, and the ER was at a similar level. In addition, there was a significant performance improvement by combining learning with deep neural network-based WPE reverberation cancellation and MVDR beamforming. In particular, the performance was the best when three modules were combined learning. In particular, in an environment with a signal-to-noise ratio of 10 dB where noise and reverberation are relatively strong, the F-score is improved by 10% and ER by 0.2 level compared to the prior art. In addition, when focal loss is used as the loss function of the neural network during joint learning, the performance is somewhat improved compared to when the conventional cross entropy is used as the loss function.

본 실시예에 따른 심화 신경망 기반의 WPE 잔향 제거와 MVDR 빔포밍 및 음향 인지 모델의 결합 학습 기법을 이용하여 잡음과 잔향이 존재하는 환경에서 다채널 마이크를 통해 음향을 입력하여 처리하는 경우, 특히 실내 환경에서 스마트폰, 인공지능 스피커, 로봇, CCTV 등에 적용될 수 있는 음향 인지 기반 어플리케이션에 적용되어 향상된 성능을 기대할 수 있다.In the case of inputting and processing sound through a multi-channel microphone in an environment where noise and reverberation exist by using the deep neural network-based WPE reverberation cancellation, MVDR beamforming, and combined learning technique of the acoustic recognition model according to this embodiment, especially indoors Improved performance can be expected as it is applied to acoustic recognition-based applications that can be applied to smartphones, artificial intelligence speakers, robots, and CCTVs in the environment.

이상과 같이, 실시예들은 잡음과 잔향이 존재하는 환경에서 다채널 음향 신호를 이용하여 강인한 음향인지 시스템을 구성하는 방법에 관한 것이다. 실시예들에 따르면 잡음과 잔향이 존재하는 환경에서 음향 인지를 필요로 하는 스마트폰, 인공지능 스피커, 로봇, CCTV 등에 적용되어 음향 인지의 성능을 향상시킬 수 있으며, 마이크 특성을 반영하기 위하여 해당 마이크로 수집된 음향 신호를 학습에 사용하거나 또는 마이크 특성에 적응하는 adaptation 기술을 사용하여 최적화된 성능을 얻을 수 있다.As described above, embodiments relate to a method of constructing a robust acoustic recognition system using a multi-channel acoustic signal in an environment in which noise and reverberation exist. According to embodiments, the performance of sound recognition can be improved by being applied to smartphones, artificial intelligence speakers, robots, CCTVs, etc. that require sound recognition in an environment in which noise and reverberation exist, and in order to reflect the microphone characteristics, the corresponding microphone Optimized performance can be obtained by using the collected acoustic signal for learning or by using an adaptation technique that adapts to microphone characteristics.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented as a hardware component, a software component, and/or a combination of the hardware component and the software component. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or apparatus, to be interpreted by or to provide instructions or data to the processing device. may be embodied in The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible from the above description by those skilled in the art. For example, the described techniques are performed in an order different from the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다. Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

A method for combined learning of deep neural network-based reverberation cancellation, beamforming, and acoustic recognition models using computer-implemented multi-channel acoustic signals, the method comprising:
removing reverberation from a multi-channel acoustic signal input in an environment in which noise and reverberation exist using a WPE (Weighted Predicted Error) reverberation removal technology based on a deep neural network;
removing noise from the multi-channel sound signal from which the reverberation has been removed using a deep neural network-based MVDR (Minimum Variance Distortionless Response) beamforming technology; and
inputting the improved acoustic signal of the single channel from which the noise has been removed into a convolutional recurrent neural network-based acoustic perception model;
including,
The reverberation cancellation model for removing the reverberation, the beamforming model for performing the beamforming, and the acoustic recognition model operate as one combined neural network,
Combining learning of the reverberation cancellation model, the beamforming model, and the acoustic recognition model based on an output of the acoustic recognition model and an error calculated by a target;
Further comprising, a combined learning method.

delete

A method for combined learning of deep neural network-based reverberation cancellation, beamforming, and acoustic recognition models using computer-implemented multi-channel acoustic signals, the method comprising:
removing reverberation from a multi-channel acoustic signal input in an environment in which noise and reverberation exist using a WPE (Weighted Predicted Error) reverberation removal technology based on a deep neural network;
removing noise from the multi-channel sound signal from which the reverberation has been removed using MVDR (Minimum Variance Distortionless Response) beamforming technology based on a deep neural network; and
inputting the improved acoustic signal of the single channel from which the noise has been removed into a convolutional recurrent neural network-based acoustic perception model;
including,
The reverberation cancellation model for removing the reverberation, the beamforming model for performing the beamforming, and the acoustic recognition model operate as one combined neural network,
The step of removing the reverberation of the multi-channel acoustic signal by using the deep neural network-based WPE reverberation removal technology,
converting the multi-channel sound signal into a value on the frequency axis from the time axis through short-time Fourier transform (STFT); and
The STFT coefficient of the transformed frequency axis is improved by the spectral mask estimated using the WPE reverberation cancellation technique based on the deep neural network.
Containing, a combined learning method.

A method for combined learning of deep neural network-based reverberation cancellation, beamforming, and acoustic recognition models using computer-implemented multi-channel acoustic signals, the method comprising:
removing reverberation from a multi-channel acoustic signal input in an environment in which noise and reverberation exist using a WPE (Weighted Predicted Error) reverberation removal technology based on a deep neural network;
removing noise from the multi-channel sound signal from which the reverberation has been removed using a deep neural network-based MVDR (Minimum Variance Distortionless Response) beamforming technology; and
inputting the improved acoustic signal of the single channel from which the noise has been removed into a convolutional recurrent neural network-based acoustic perception model;
including,
The reverberation cancellation model for removing the reverberation, the beamforming model for performing the beamforming, and the acoustic recognition model operate as one combined neural network,
The step of removing noise from the multi-channel sound signal from which the reverberation has been removed using MVDR beamforming technology based on a deep neural network includes:
The STFT coefficient of the multi-channel sound signal from which the reverberation has been removed is improved by two types of spectral masks estimated using the MVDR beamforming technique based on the deep neural network.
characterized in that, a combined learning method.

According to claim 1,
The step of inputting the improved acoustic signal of the single channel from which the noise has been removed into a convolutional recurrent neural network-based acoustic recognition model comprises:
The STFT coefficient of the improved acoustic signal of the single channel from which the noise has been removed is converted into a log scale mel filter bank and input to a convolutional recurrent neural network-based acoustic perception model. to be
characterized in that, a combined learning method.

According to claim 1,
The step of inputting the improved acoustic signal of the single channel from which the noise has been removed into a convolutional recurrent neural network-based acoustic recognition model comprises:
of the convolutional recurrent neural network. In the convolution module, the method of convolution by considering only the frequency axis and the method of performing convolution by considering the frequency axis and the time axis sequentially are connected in parallel and then combined.
characterized in that, a combined learning method.

According to claim 1,
The step of combining and learning the reverberation cancellation model, the beamforming model, and the acoustic perception model by an error calculated by an output of the acoustic recognition model and a target,
Setting the loss function of the neural network as the focal loss to alleviate the imbalance in the amount of data
characterized in that, a combined learning method.

In the combined learning apparatus of the deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model using multi-channel acoustic signals,
A reverberation cancellation model that removes reverberation using WPE (Weighted Predicted Error) reverberation technology based on deep neural network for multi-channel acoustic signals input in an environment where noise and reverberation exist;
a beamforming model that removes noise from the multi-channel sound signal from which the reverberation has been removed by using a MVDR (Minimum Variance Distortionless Response) beamforming technology based on a deep neural network; and
Acoustic perception model in which the improved acoustic signal of the single channel from which the noise has been removed is input into a convolutional recurrent neural network-based acoustic perception model
including,
The reverberation cancellation model, the beamforming model, and the acoustic perception model operate as one combined neural network,
A learning unit that combines the reverberation cancellation model, the beamforming model, and the acoustic recognition model based on an output of the acoustic recognition model and an error calculated by a target
Further comprising, a combined learning device.

delete

In the combined learning apparatus of the deep neural network-based reverberation cancellation, beamforming, and acoustic recognition model using multi-channel acoustic signals,
A reverberation cancellation model that removes reverberation using WPE (Weighted Predicted Error) reverberation technology based on deep neural network for multi-channel acoustic signals input in an environment where noise and reverberation exist;
a beamforming model that removes noise from the multi-channel sound signal from which the reverberation has been removed by using a MVDR (Minimum Variance Distortionless Response) beamforming technique based on a deep neural network; and
Acoustic perception model in which the improved acoustic signal of the single channel from which the noise has been removed is input into a convolutional recurrent neural network-based acoustic perception model
including,
The reverberation cancellation model, the beamforming model, and the acoustic perception model operate as one combined neural network,
A learning unit that combines the reverberation cancellation model, the beamforming model, and the acoustic perception model based on an output of the acoustic recognition model and an error calculated by a target
further comprising,
The reverberation removal model is
The multi-channel sound signal is transformed from the time axis to the frequency axis value through short-time Fourier transform (STFT), and the STFT coefficient of the converted frequency axis is applied to the spectral mask estimated using the WPE reverberation cancellation technique based on the deep neural network. to be improved by
characterized in that, a combined learning device.