KR102505653B1

KR102505653B1 - Method and apparatus for integrated echo and noise removal using deep neural network

Info

Publication number: KR102505653B1
Application number: KR1020200138406A
Authority: KR
Inventors: 장준혁; 박송규
Original assignee: 한양대학교 산학협력단
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2023-03-03
Also published as: KR20220053995A

Abstract

일 실시예에 따른 심화신경망을 이용한 에코 및 잡음 통합 제거 장치는 마이크입력 신호 및 원단신호를 입력 받아, 상기 마이크입력 신호에 대한 제1특징 벡터와 상기 원단신호에 대한 제2특징 벡터를 추출하는 특징 벡터 추출부, 상기 제1특징 벡터 및 상기 제2특징 벡터를 입력 정보로 하고, 상기 마이크입력 신호에 포함되어 있는 에코신호와 잡음신호를 1차적으로 추정한 제1에코신호 및 제1잡음신호를 출력 정보를 하는, 기 학습된 제1인공신경망, 상기 제1특징 벡터, 제2특징 벡터, 제1잡음신호 및 제2잡음신호를 입력 정보로 하고, 상기 에코신호 및 상기 잡음신호를 2차적으로 추정한 제2에코신호 및 제2잡음신호를 출력 정보로 하는, 기 학습된 제2인공신경망, 상기 제1특징 벡터, 제2특징 벡터, 제1잡음신호 및 제2잡음신호를 입력 정보로 하고, 상기 마이크입력 신호에서 상기 에코신호와 상기 잡음신호를 제거하기 위한 최적이득(Optimal gain)을 출력 정보로 하는, 기 학습된 제3인공신경망 및 상기 최적이득을 이용하여 상기 마이크입력 신호에서 상기 에코신호와 상기 잡음신호가 제거된 최종 추정 음성신호를 출력하는 음성합성부를 포함할 수 있다.An apparatus for integrating and canceling echo and noise using a deep neural network according to an embodiment receives a microphone input signal and a far-end signal, and extracts a first feature vector for the microphone input signal and a second feature vector for the far-end signal. A vector extractor, using the first feature vector and the second feature vector as input information, obtains a first echo signal and a first noise signal obtained by primarily estimating the echo signal and the noise signal included in the microphone input signal. The pre-learned first artificial neural network, the first feature vector, the second feature vector, the first noise signal, and the second noise signal, which generate output information, are used as input information, and the echo signal and the noise signal are secondarily generated. A pre-learned second artificial neural network having the estimated second echo signal and the second noise signal as output information, the first feature vector, the second feature vector, the first noise signal, and the second noise signal as input information, , the echo signal from the microphone input signal using the pre-learned third artificial neural network and the optimal gain, which uses the optimal gain for removing the echo signal and the noise signal from the microphone input signal as output information It may include a voice synthesis unit that outputs a signal and a final estimated voice signal from which the noise signal has been removed.

Description

Method and apparatus for integrated echo and noise removal using deep neural network

본 발명은 심화신경망을 이용한 에코 및 잡음 통합 제거 방법 및 장치에 관한 발명으로서, 보다 상세하게는 1차적으로 추정한 에코 신호 및 잡음신호를 기초로 적대적 학습기법을 이용하여 에코 및 잡음이 제거된 음성 신호를 획득하는 기술에 관한 발명이다.The present invention relates to a method and apparatus for integrated echo and noise cancellation using a deep neural network, and more particularly, a voice in which echo and noise are removed by using an adversarial learning method based on a primarily estimated echo signal and noise signal. The invention relates to a technique for obtaining a signal.

음성통신(speech communication)이란 음성통신 사용자끼리 상호간의 의사소통을 위해 사용자의 발화된 음성을 상대방에게 전달하는 기술을 의미하며, 구체적으로 널리 사용되고 있는 전화 뿐만 아니라 컨퍼런스 콜, 영상통화, 화상회의 등의 다양한 분야에서 사용되고 있다. 음성통신에서 상대방에게 정확한 의미를 전달하기 위해서는 발화자의 깨끗한 음성신호만 전달 되어야 하나 두 화자 혹은 여러 화자가 동시에 발화하는 상황이나, 직전 화자의 발화가 다시 마이크로 입력되어 스피커에서의 재생과 마이크에서의 입력이 반복되는 음향학적 반향(acoustic echo)인 에코 현상이 발생한 경우, 발화자의 음성이 정확하게 전달될 수 없게 된다.Speech communication means a technology that transmits the user's uttered voice to the other party for mutual communication between voice communication users. It is used in various fields. In voice communication, only the clear voice signal of the speaker must be transmitted to convey the correct meaning to the other party. However, in situations where two speakers or several speakers speak at the same time, the speech of the previous speaker is input back into the microphone, and playback and input from the speaker are performed. When an echo phenomenon, which is a repeated acoustic echo, occurs, the speaker's voice cannot be accurately transmitted.

또한, 휴대용 디지털 기기를 통해 음원을 녹음하거나 음성 신호를 입력 받는 환경은 통상적으로 주변 간섭음이 없이 조용한 환경이기보다는 다양한 소음과 주변 간섭음이 모두 포함되어 있는 환경일 경우가 더 많으므로 사용자의 정확한 음성을 인식하기 위해서는 입력되는 음성 신호로부터 잡음을 분리하고 제거하는 기술이 중요하다.In addition, the environment where a sound source is recorded or a voice signal is input through a portable digital device is usually an environment containing various noises and ambient interference rather than a quiet environment without ambient interference. In order to recognize voice, a technology for separating and removing noise from an input voice signal is important.

따라서, 상대방에게 정확한 음성신호가 전달되기 위해서는 앞서 설명한 에코신호 뿐만 아니라 사용자의 주변 환경에서 발생하는 다양한 잡음신호가 제거되어야 하며, 근래에는 에코신호와 잡음신호를 통합적으로 제거하는 기술들이 제안되고 있다.Therefore, in order to deliver an accurate voice signal to the other party, not only the echo signal described above but also various noise signals generated in the user's surrounding environment must be removed. Recently, technologies for integrally removing the echo signal and the noise signal have been proposed.

음성 잡음 및 에코의 통합 제거 기술이란 음성 신호에 포함된 잡음 및 에코를 제거하는 기술로서, 일반적으로 잡음 제거기 및 에코 제거기를 독립적으로 설계한 후 직렬로 연결하여 잡음 및 에코 제거를 순차적으로 수행한다. 그러나 이러한 잡음 및 에코 제거기는 잡음 제거기 및 에코 제거기의 위치에 따라 성능의 차이가 크게 발생하게 된다. 예를 들어 잡음 제거기가 에코 제거기의 앞단에 위치할 경우 잡음 제거기의 비선형적인 연산으로 인하여 반향 제거기의 성능 저하가 발생하게 된다. 또한, 반향 제거기가 잡음 제거기의 앞단에 위치할 경우, 잡음 제거기가 추정해야 할 잡음의 스펙트럼이 반향 제거 과정에서 왜곡이 생길 수 있기 때문에 잡음 추정의 성능이 저하되는 문제점이 발생하게 된다.The integrated voice noise and echo cancellation technology is a technology that removes noise and echo included in a voice signal. Generally, a noise canceller and an echo canceller are designed independently and then connected in series to sequentially cancel noise and echo. However, the performance of these noise and echo cancellers varies greatly depending on the locations of the noise and echo cancellers. For example, when a noise canceller is placed in front of an echo canceller, performance of the echo canceller is degraded due to nonlinear operation of the noise canceller. In addition, when the echo canceller is positioned in front of the noise canceller, the noise spectrum to be estimated by the noise canceller may be distorted during the echo cancellation process, resulting in deterioration in noise estimation performance.

이에 따라 잡음 및 에코를 한꺼번에 통합적으로 제거하는 잡음 및 에코의 통합 제거 기술이 사용될 수 있다. 종래에는 음성 신호와 잡음 및 에코 사이의 통계적 정보를 이용하는 통계 모델 기반의 잡음 및 에코의 통합 제거 기술이 주로 사용되었으나, 통계 모델 기반의 음성 향상 기술은 정상 잡음 환경과는 달리 비정상 잡음 환경에서 성능이 크게 저하되는 문제점을 가지고 있다. 예를 들어, 음성 인식에서 잡음이 존재하지 않은 깨끗한 신호를 이용하여 음성 인식 모델을 학습시킨 후 잡음이 존재하는 신호로 테스트를 수행할 경우 성능이 감소한다.Accordingly, an integrated noise and echo cancellation technology that integrally and simultaneously removes noise and echo may be used. Conventionally, a statistical model-based integrated noise and echo cancellation technology using statistical information between a voice signal and noise and echo has been mainly used. I have a problem with a big drop. For example, in speech recognition, when a speech recognition model is trained using a clean signal without noise and then a test is performed with a signal with noise, performance is reduced.

이러한 성능 감소를 해결하기 위해 잡음이 존재하는 음성을 이용하여 음성 인식 모델을 학습하는 기술이 제안되었으나, 학습된 잡음 환경에 최적화되어 학습된 잡음 환경에서 테스트하는 경우에는 우수한 성능을 보이나, 학습되지 않은 잡음 환경에서 테스트하는 경우에는 성능이 저하되는 문제점이 존재한다.In order to solve this performance decrease, a technique for learning a speech recognition model using noisy speech has been proposed, but it is optimized for the learned noise environment and shows excellent performance when tested in the learned noise environment. In the case of testing in a noisy environment, there is a problem in that performance is degraded.

또한, 최근에는 인공신경망의 기술이 발전함에 따라, 머신러닝 기법인 심화신경망(Deep Neural Network, DNN)이 다양한 음성 향상 및 음성 인식 연구에서 우수한 성능을 보이고 있다. 심화신경망은 다수의 은닉 층과 은닉 노드들을 통하여 입력 특징 벡터와 출력 특징 벡터 사이의 비선형적인 관계를 효과적으로 모델링하여 우수한 성능을 보인다. 따라서, 심화신경망을 이용하여 잡음 및 에코를 한꺼번에 제거하는 기술이 발전하고 있는데, 대부분의 종래 기술은 잡음과 에코를 한번씩 제거하거나 음성을 직접적으로 추정하는 방법을 사용하는데, 이는 단계별로 제거되는 단계에서 음성의 일정 부분이 억제되는 단점을 가지거나, 음성신호를 직접 추정할 때 남아있는 잔여 에코 및 잡음에 의해 정확한 음성을 추정하지 못한다는 문제점이 존재한다.In addition, with the recent development of artificial neural network technology, a deep neural network (DNN), a machine learning technique, has shown excellent performance in various speech enhancement and speech recognition studies. The deep neural network shows excellent performance by effectively modeling the non-linear relationship between the input feature vector and the output feature vector through a plurality of hidden layers and hidden nodes. Therefore, a technology for removing noise and echo at once using a deep neural network is being developed. Most conventional technologies use a method of removing noise and echo once or directly estimating voice, which is a step-by-step removal step. There is a problem in that a certain part of the voice is suppressed, or when the voice signal is directly estimated, the voice cannot be accurately estimated due to residual echo and noise remaining.

한국등록특허 제10-1871604호 '심화 신경망을 이용한 다채널 마이크 기반의 잔향시간 추정 방법 및 장치'(2018.06.25. 공개)Korean Patent Registration No. 10-1871604 'Method and apparatus for estimating reverberation time based on multi-channel microphone using deep neural network' (published on June 25, 2018) 한국등록특허 제10-1988504호 '딥러닝에 의해 생성된 가상환경을 이용한 강화학습 방법'(2019.06.05. 공개)Korean Patent Registration No. 10-1988504 'Reinforcement learning method using virtual environment created by deep learning' (published on June 5, 2019)

따라서, 일 실시예에 따른 심화신경망을 이용한 에코 및 잡음 통합 제거 방법 및 장치는 상기 설명한 문제점을 해결하기 위해 고안된 발명으로서, 보다 에코와 잡음이 효율적으로 제거된 음성 신호를 추출할 수 있는 에코 및 잡음 통합 제거 장치 및 방법을 제공하기 위함이다.Therefore, a method and apparatus for integrated echo and noise cancellation using a deep neural network according to an embodiment is an invention designed to solve the above-described problems, and is capable of extracting a voice signal from which echo and noise are more efficiently removed. It is to provide an integrated removal device and method.

구체적으로, 마이크입력 신호 및 원단신호를 이용하여1차적으로 추정된 잡음신호와 에코신호를 기초로 적대적 학습 기법을 사용하여 마이크에 입력된 신호 중 잡음신호와 에코신호를 최종적으로 추정한 후, 마이크입력 신호에서 잡음신호와 에코신호가 제거된 깨끗한 사용자의 음성 신호만을 제공할 수 있는 에코 및 잡음 통합 제거 장치 및 발명을 제공하기 위함이다.Specifically, after estimating the noise signal and the echo signal among the signals input to the microphone using an adversarial learning technique based on the noise signal and echo signal primarily estimated using the microphone input signal and the far-end signal, the microphone It is an object of the present invention to provide an integrated echo and noise canceling device capable of providing only a clean user's voice signal from which noise and echo signals are removed from an input signal, and the present invention.

상기 제2인공신경망은, 상기 제2에코신호를 출력하는 제2에코신호 추정 인공신경망과 및 상기 제2잡음신호를 출력하는 제2잡음신호 추정 인공신경망을 포함할 수 있다.The second artificial neural network may include a second echo signal estimating artificial neural network outputting the second echo signal and a second noise signal estimating artificial neural network outputting the second noise signal.

상기 제2에코신호 추정 인공신경망은, 상기 제2에코신호를 추정하는 경우, 적대적 학습 기법을 사용하여 상기 제2잡음신호가 상기 제2에코신호의 추정에 영향을 미치지 않도록 학습을 수행할 수 있다.When estimating the second echo signal, the artificial neural network for estimating the second echo signal may use an adversarial learning technique to perform learning so that the second noise signal does not affect the estimation of the second echo signal. .

상기 제2에코신호 추정 인공신경망의 손실함수는, 상기 제2에코신호와 제1레퍼런스 신호와의 차이를 손실함수로 하는 제1손실함수와, 상기 제2잡음신호와 제2 레퍼런스 신호와의 차이를 손실함수로 하는 제2손실함수를 포함할 수 있다.The loss function of the artificial neural network estimating the second echo signal is a first loss function having a difference between the second echo signal and the first reference signal as a loss function, and a difference between the second noise signal and the second reference signal. It may include a second loss function with a loss function.

상기 제2에코신호 추정 인공신경망은, 상기 제1손실함수의 값은 최소가 되도록 상기 제2에코신호 추정 인공신경망을 학습시키고, 상기 제2손실함수의 값은 최대가 되도록 상기 제2에코신호 추정 인공신경망을 학습시킬 수 있다.The second echo signal estimation artificial neural network trains the second echo signal estimation artificial neural network to minimize the value of the first loss function, and estimates the second echo signal to maximize the value of the second loss function. Artificial neural networks can be trained.

상기 제2잡음신호 추정 인공신경망은, 상기 제2잡음신호를 추정하는 경우, 적대적 학습 기법을 사용하여 상기 제2에코신호가 상기 제2잡음신호의 추정에 영향을 미치지 않도록 학습을 수행할 수 있다.When estimating the second noise signal, the artificial neural network for estimating the second noise signal may perform learning using an adversarial learning technique so that the second echo signal does not affect the estimation of the second noise signal. .

상기 제2잡음신호 추정 인공신경망의 손실함수는, 상기 제2잡음신호와 제3레퍼런스 신호와의 차이를 손실함수로 하는 제3손실함수와, 상기 제2에코신호와 제4 레퍼런스 신호와의 차이를 손실함수로 하는 제4손실함수를 포함할 수 있다.The loss function of the artificial neural network estimating the second noise signal is a third loss function having a difference between the second noise signal and the third reference signal as a loss function, and a difference between the second echo signal and the fourth reference signal. It may include a fourth loss function with a loss function.

상기 제2잡음신호 추정 인공신경망은, 상기 제3손실함수의 값은 최소가 되도록 상기 제2잡음신호 추정 인공신경망을 학습시키고, 상기 제4손실함수의 값은 최대가 되도록 상기 제2잡음신호 추정 인공신경망을 학습시킬 수 있다.The second noise signal estimation artificial neural network trains the second noise signal estimation artificial neural network to minimize the value of the third loss function, and estimates the second noise signal such that the value of the fourth loss function maximizes. Artificial neural networks can be trained.

상기 특징 벡터 추출부는, 숏타임 푸리에 변환(Short-Time Fourier Transform, STFT)을 수행하여 시간 영역의 상기 마이크입력 신호 및 원단 화자 신호를 주파수 영역의 신호로 변환하고, 변환된 주파수 영역의 신호의 로그 파워 스펙트럼(Log Power Spectrum, LPS)을 특징 벡터로 추출할 수 있다.The feature vector extractor converts the microphone input signal and the far-end speaker signal in the time domain into a signal in the frequency domain by performing Short-Time Fourier Transform (STFT), and the logarithm of the transformed signal in the frequency domain A power spectrum (Log Power Spectrum, LPS) can be extracted as a feature vector.

상기 제3인공신경망은, 회기 학습(regression)을 통하여 연속적인 최적 이득(optimal gain)을 추정하고, 평균제곱오차(Mean Squared Error, MSE)를 상기 제3인공신경망의 목적 함수로 하여 타겟(target) 특징 벡터인 잡음 및 에코의 통합 제거 이득과 상기 제3인공신경망에 의해 추정된 상기 최적 이득의 차이를 최소화하는 방향으로 상기 제3인공신경망을 학습시킬 수 있다.The third artificial neural network estimates a continuous optimal gain through regression learning, and sets the mean squared error (MSE) as the objective function of the third artificial neural network to target ) The third artificial neural network may be trained in a direction that minimizes a difference between an integrated cancellation gain of noise and echo, which is a feature vector, and the optimal gain estimated by the third artificial neural network.

상기 제1인공신경망은, 상기 제1에코신호를 출력하는 제1에코신호 추정 인공신경망과 및 상기 제1에코신호 추정 인공신경망과 독립적으로 구성되어 상기 제1잡음신호를 출력하는 제1잡음신호 추정 인공신경망을 포함할 수 있다.The first artificial neural network includes a first echo signal estimating artificial neural network that outputs the first echo signal and a first noise signal estimation that outputs the first noise signal independently of the first echo signal estimating artificial neural network. Artificial neural networks may be included.

다른 실시예에 따른 심화신경망을 이용한 에코 및 잡음 통합 제거 장치는 마이크입력 신호 및 원단신호를 입력 받아, 상기 마이크입력 신호에 대한 제1특징 벡터와 상기 원단 신호에 대한 제2특징 벡터를 추출하는 특징 벡터 추출부, 상기 제1특징 벡터 및 상기 제2특징 벡터를 입력 정보로 하고, 상기 마이크입력 신호에 포함되어 있는 에코신호와 잡음신호를 1차적으로 추정한 제1에코신호 및 제1잡음신호를 출력 정보를 하는, 기 학습된 제1인공신경망 및 상기 제1특징 벡터, 제2특징 벡터, 제1잡음신호 및 제2잡음신호를 입력 정보로 하고, 상기 에코신호 및 상기 잡음신호를 2차적으로 추정한 제2에코신호 및 제2잡음신호를 출력 정보로 하는, 기 학습된 제2인공신경망을 포함할 수 있다.An apparatus for integrating and canceling echo and noise using a deep neural network according to another embodiment receives a microphone input signal and a far-end signal, and extracts a first feature vector for the microphone input signal and a second feature vector for the far-end signal. A vector extractor, using the first feature vector and the second feature vector as input information, obtains a first echo signal and a first noise signal obtained by primarily estimating the echo signal and the noise signal included in the microphone input signal. The pre-learned first artificial neural network, which produces output information, and the first feature vector, the second feature vector, the first noise signal, and the second noise signal are used as input information, and the echo signal and the noise signal are secondarily generated. A pre-learned second artificial neural network having the estimated second echo signal and the second noise signal as output information may be included.

다른 실시예에 따른 심화신경망을 이용한 에코 및 잡음 통합 제거 방법은, 마이크입력 신호 및 원단 신호를 포함하는 음성신호로부터 상기 마이크입력 신호에 대한 제1특징 벡터와 상기 원단 신호에 대한 제2특징 벡터를 추출하는 특징 벡터 추출 단계, 상기 제1특징 벡터 및 상기 제2특징 벡터를 기초로 , 상기 마이크입력 신호에 포함되어 있는 에코신호와 잡음신호를 1차적으로 추정한 제1에코신호 및 제1잡음신호를 기 학습된 제1인공신경망을 이용하여 출력하는 단계, 상기 제1특징 벡터, 제2특징 벡터, 제1잡음신호 및 제2잡음신호를 기초로 상기 에코신호 및 상기 잡음신호를 2차적으로 추정한 제2에코신호 및 제2잡음신호를 기 학습된 제2인공신경망을 이용하여 출력하는 단계, 상기 제1특징 벡터, 제2특징 벡터, 제1잡음신호 및 제2잡음신호를 기초로, 상기 음성신호에서 상기 에코신호와 상기 잡음신호를 제거하기 위한 최적이득(Optimal gain)을 기 학습된 제3인공신경망을 이용하는 출력하는 단계 및 상기 최적이득을 이용하여 상기 마이크입력 신호에서 상기 에코신호와 상기 잡음신호가 제거된 최종 추정 음성신호를 출력하는 단계;를 포함할 수 있다.A method for integrating and canceling echo and noise using a deep neural network according to another embodiment includes a first feature vector for a microphone input signal and a second feature vector for a far-end signal from a voice signal including a microphone input signal and a far-end signal. Extracting a feature vector, a first echo signal and a first noise signal obtained by first estimating an echo signal and a noise signal included in the microphone input signal based on the first feature vector and the second feature vector outputting using a pre-learned first artificial neural network, secondarily estimating the echo signal and the noise signal based on the first feature vector, the second feature vector, the first noise signal, and the second noise signal. outputting a second echo signal and a second noise signal using a pre-learned second artificial neural network; based on the first feature vector, the second feature vector, the first noise signal, and the second noise signal, outputting an optimal gain for removing the echo signal and the noise signal from a voice signal using a previously learned third artificial neural network; and using the optimal gain to remove the echo signal and the noise signal from the microphone input signal. and outputting a final estimated voice signal from which noise signals have been removed.

상기 제2인공신경망은, 상기 제2에코신호를 출력하는 제2에코신호 추정 인공신경망과 및 상기 제2잡음신호를 출력하는 제2잡음신호 추정 인공신경망을 포함하여 구성될 수 있다.The second artificial neural network may include a second echo signal estimation artificial neural network outputting the second echo signal and a second noise signal estimation artificial neural network outputting the second noise signal.

상기 제2에코신호를 추정하는 단계는, 적대적 학습 기법을 사용하여 상기 제2잡음신호가 상기 제2에코신호의 추정에 영향을 미치지 않는 학습 방법을 사용하여 상기 제2에코신호를 추정하고, 상기 제2잡음신호를 추정하는 단계는, 적대적 학습 기법을 사용하여 상기 제2에코신호가 상기 제2잡음신호의 추정에 영향을 미치지 않는 학습 방법을 사용하여 상기 제2잡음신호를 추정할 수 있다.The estimating of the second echo signal may include estimating the second echo signal using an adversarial learning technique in which the second noise signal does not affect the estimation of the second echo signal; The estimating of the second noise signal may include estimating the second noise signal using an adversarial learning technique so that the second echo signal does not affect the estimation of the second noise signal.

일 실시예에 따른 심화신경망을 이용한 에코 및 잡음 통합 제거 방법은, 종래 기술과 달리 마이크입력 신호에 포함되어 있는 에코신호와 잡음신호를 별도로 추정하되 서로 영향을 최대한 주지 않도록 적대적 학습 기법을 사용하는바, 에코신호와 잡음신호가 보다 깨끗하게 제거된 사용자의 음성신호만을 추출할 수 있는 효과가 존재한다.Unlike the prior art, an integrated cancellation method for echo and noise using a deep neural network according to an embodiment separately estimates an echo signal and a noise signal included in a microphone input signal, but uses an adversarial learning technique so as not to affect each other as much as possible. , there is an effect of extracting only the user's voice signal from which the echo signal and the noise signal are more cleanly removed.

따라서, 가정 환경에서 사용되는 인공지능 스피커, 공항에서 사용되는 로봇, 음성인식 및 PC 음성통신 시스템 등 배경잡음신호와 에코신호가 존재하는 환경에서 마이크로폰을 통해 사용자의 음성을 수집하여 처리하는 경우, 배경 잡음신호와 에코신호를 보다 효율적으로 제거할 수 있어, 음성 품질 및 명료도를 향상시킬 수 있는 효과가 존재한다.Therefore, when collecting and processing the user's voice through a microphone in an environment where background noise and echo signals exist, such as artificial intelligence speakers used in home environments, robots used in airports, voice recognition and PC voice communication systems, Noise signals and echo signals can be removed more efficiently, so there is an effect of improving voice quality and intelligibility.

도 1은 에코와 잡음이 존재하는 음성 통신 환경에서 에코 및 잡음 통합 제거 장치로 입력되는 신호들을 도시한 도면이다.
도 2는 본 발명의 실시예에 따른 심화신경망을 이용한 에코 및 잡음 통합 제거 장치의 일부 구성 요소를 도시한 블럭도이다.
도 3은 본 발명의 특징 백터 추출부에 입력되는 입력 정보와 출력되는 출력 정보를 도시한 도면이다.
도 4는 본 발명의 제1인공신경망에 입력되는 입력 정보와 출력되는 출력 정보를 도시한 도면이다.
도 5는 본 발명의 제1에코신호 추정 인공신경망의 학습방법을 설명하기 위한 도면이다.
도 6은 본 발명의 제1잡음심호 추정 인공신경망의 학습방법을 설명하기 위한 도면이다.
도 7은 본 발명의 제2인공신경망에 입력되는 입력 정보와 출력되는 출력 정보를 도시한 도면이다.
도 8은 본 발명의 제2에코신호 추정 인공신경망의 학습방법을 설명하기 위한 도면이다.
도 9는 본 발명의 제2잡음심호 추정 인공신경망의 학습방법을 설명하기 위한 도면이다.
도 10은 본 발명의 제3인공신경망과 음성합성부의 입력 정보를 출력 정보를 설명하기 위한 도면이다.
도 11은 본 발명의 실험 데이터에 대한 설정 값으로서, RIR 생성기의 파라미터 설정 값을 도시한 표이다.
도 12와 도 13은 본 발명과 다른 인공신경망 모델의 학습 결과를 비교하여 도시한 도면이다.1 is a diagram illustrating signals input to an integrated echo and noise cancellation device in a voice communication environment where echo and noise exist.
2 is a block diagram showing some components of an integrated echo and noise cancellation device using a deep neural network according to an embodiment of the present invention.
3 is a diagram showing input information input to the feature vector extractor and output information output to the feature vector extractor of the present invention.
4 is a diagram illustrating input information input to the first artificial neural network and output information output to the first artificial neural network of the present invention.
5 is a diagram for explaining a learning method of a first echo signal estimation artificial neural network according to the present invention.
6 is a diagram for explaining a learning method of a first noisy signal estimation artificial neural network according to the present invention.
7 is a diagram showing input information input to the second artificial neural network and output information output to the second artificial neural network of the present invention.
8 is a diagram for explaining a learning method of a second echo signal estimation artificial neural network according to the present invention.
9 is a diagram for explaining a learning method of a second noisy signal estimation artificial neural network according to the present invention.
10 is a diagram for explaining the output information of the input information of the third artificial neural network and the speech synthesizer of the present invention.
11 is a table showing parameter setting values of the RIR generator as setting values for experimental data of the present invention.
12 and 13 are diagrams showing comparison of learning results of the present invention and other artificial neural network models.

이하, 본 발명에 따른 실시 예들은 첨부된 도면들을 참조하여 설명한다. 각 도면의 구성요소들에 참조 부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명의 실시 예를 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 실시예에 대한 이해를 방해한다고 판단되는 경우에는 그 상세한 설명은 생략한다. 또한, 이하에서 본 발명의 실시 예들을 설명할 것이나, 본 발명의 기술적 사상은 이에 한정되거나 제한되지 않고 당업자에 의해 변형되어 다양하게 실시될 수 있다.Hereinafter, embodiments according to the present invention will be described with reference to the accompanying drawings. In adding reference numerals to the components of each drawing, it should be noted that the same components have the same numerals as much as possible even if they are displayed on different drawings. In addition, in describing an embodiment of the present invention, if it is determined that a detailed description of a related known configuration or function hinders understanding of the embodiment of the present invention, the detailed description thereof will be omitted. In addition, embodiments of the present invention will be described below, but the technical idea of the present invention is not limited or limited thereto and can be modified and implemented in various ways by those skilled in the art.

또한, 본 명세서에서 사용한 용어는 실시 예를 설명하기 위해 사용된 것으로, 개시된 발명을 제한 및/또는 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.In addition, terms used in this specification are used to describe embodiments, and are not intended to limit and/or limit the disclosed invention. Singular expressions include plural expressions unless the context clearly dictates otherwise.

본 명세서에서, "포함하다", "구비하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는다.In this specification, terms such as "include", "include" or "have" are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or the existence or addition of more other features, numbers, steps, operations, components, parts, or combinations thereof is not excluded in advance.

또한, 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "간접적으로 연결"되어 있는 경우도 포함하며, 본 명세서에서 사용한 "제 1", "제 2" 등과 같이 서수를 포함하는 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 상기 구성 요소들은 상기 용어들에 의해 한정되지는 않는다.In addition, throughout the specification, when a part is said to be “connected” to another part, this is not only the case where it is “directly connected”, but also the case where it is “indirectly connected” with another element in the middle. Terms including ordinal numbers, such as "first" and "second" used herein, may be used to describe various components, but the components are not limited by the terms.

아래에서는 첨부한 도면을 참고하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략한다.Hereinafter, with reference to the accompanying drawings, embodiments of the present invention will be described in detail so that those skilled in the art can easily carry out the present invention. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted.

음성 향상 기술은 마이크로폰으로 입력된 잡음신호 및 에코신호를 제거하여 깨끗한 음성을 추정하는 기술로, 음성 인식과 음성 통신과 같은 음성 어플리케이션에 필수적인 기술이다. 예를 들어 음성 인식에서 잡음 및 에코가 존재하지 않은 깨끗한 신호로 음성 인식 모델을 학습시킨 후 잡음이 존재하는 신호로 테스트를 할 경우 성능이 감소하게 된다. 이를 해결하기 위하여 음성 인식 수행 전에 잡음 및 에코를 제거하는 음성 향상 기술을 도입하여 음성 인식의 성능을 높일 수 있다. 또한, 음성 향상 기술은 음성 통신에서 잡음 및 에코를 제거하여 선명하고 명확하게 음성을 전달하여 통화 품질을 높이기 위해서도 사용될 수 있다.The voice enhancement technology is a technology for estimating a clear voice by removing a noise signal and an echo signal input by a microphone, and is essential for voice applications such as voice recognition and voice communication. For example, in speech recognition, when a speech recognition model is trained with a clean signal without noise and echo, and then tested with a signal with noise, performance is reduced. In order to solve this problem, the performance of voice recognition can be improved by introducing a voice enhancement technology that removes noise and echo before performing voice recognition. In addition, voice enhancement technology can be used to improve call quality by removing noise and echo from voice communication to deliver clear and clear voice.

이하 설명되는 아래의 실시 예들은 심화신경망(DNN)을 이용하여 음성에 존재하는 잡음 및 에코를 통합적으로 제거하는 기술에 관한 발명으로서, 더 구체적으로 실시예들은 원단 화자 신호와 마이크 입력 정보만으로는 심화신경망(DNN)이 학습이 잘되지 않는 문제점을 해결하기 위하여, 1차적으로 에코신호와 잡음신호를 독립적으로 추정한 후, 추정된 신호들을 적대적 학습 기법을 이용하여 2차적으로 에코신호와 잡음신호를 추정함으로써, 보다 정확히 에코신호와 잡음신호를 정확히 추정할 수 있는 기술에 관한 발명이다.The following embodiments to be described below are inventions related to a technology for integrally removing noise and echo present in voice using a deep neural network (DNN). In order to solve the problem that (DNN) does not learn well, the echo signal and the noise signal are first independently estimated, and then the echo signal and the noise signal are secondarily estimated by using the adversarial learning technique for the estimated signals. Accordingly, the present invention relates to a technique capable of more accurately estimating an echo signal and a noise signal.

또한, 본 실시예들에서는 STFT(Short Time Fourier Transform) 및 STFT(Inverse Short Time Fourier Transform) 변환을 이용하는 경우를 예로 들어 설명하나, 이는 실시예에 해당되며, STFT, ISTFT 이외에 DFT(Discrete Fourier Transform), IDFT(Inverse Discrete Fourier Transform) 변환, FFT(Fast Fourier Transform), IFFT(Inverse Fast Fourier Transform) 변환 등이 이용될 수도 있다. In addition, in the present embodiments, the case of using Short Time Fourier Transform (STFT) and Inverse Short Time Fourier Transform (STFT) is described as an example, but this corresponds to the embodiment, and in addition to STFT and ISTFT, Discrete Fourier Transform (DFT) , IDFT (Inverse Discrete Fourier Transform) transform, FFT (Fast Fourier Transform), IFFT (Inverse Fast Fourier Transform) transform, or the like may be used.

이하에서는 음성 신호에 포함된 잡음 및 에코를 심화신경망 기반으로 통합하여 제거하는 기술에 대해 보다 상세히 설명하기로 한다.Hereinafter, a technology for integrating and removing noise and echo included in a voice signal based on a deep neural network will be described in detail.

도 1은 에코신호와 잡음신호가 존재하는 음성 통신 환경에서 에코신호 및 잡음신호가 통합 제거 장치로 입력되는 신호들을 도시한 도면이며, 도 2는 본 발명의 실시예에 따른 심화신경망을 이용한 에코 및 잡음 통합 제거 장치의 일부 구성 요소를 도시한 블럭도이다.FIG. 1 is a diagram illustrating signals input to an integrated cancellation device for an echo signal and a noise signal in a voice communication environment in which an echo signal and a noise signal exist. FIG. It is a block diagram showing some components of the noise integration cancellation device.

도 1을 참조하면, 마이크(10)에 입력되는 신호 y(t)는 아래 식 (1)과 같이, 사용자가 마이크(10)로 입력하는 음성신호(speech signal)인 s(t)와 사용자가 존재하는 공간에서 다양한 환경에 의해 발생되는 잡음신호(noise signal)인 n(t)와 스피커(20)를 통해 출력된 원단 신호(far end signal)가 마이크(10)와 스피커(20) 사이의 RIR(Room Impulse Response)와 컨불루션(convolution) 되어 다시 마이크(10)로 다시 입력되는 에코신호(echo signal)인 d(t)의 합으로 구성될 수 있다.Referring to FIG. 1, the signal y(t) input to the microphone 10 is s(t), which is a speech signal input by the user to the microphone 10, as shown in Equation (1) below, and the user The RIR between the microphone 10 and the speaker 20 is determined by n(t), which is a noise signal generated by various environments in the existing space, and a far end signal output through the speaker 20. It may be composed of the sum of (Room Impulse Response) and d(t), which is an echo signal that is convolved and inputted back to the microphone 10 again.

식 (1) -

Equation (1) -

본 발명에 따른 에코 및 잡음 통합 제거 장치(100)는 마이크 입력 신호인 y(t)와 원단 신호인 x(t)를 이용하여 화자의 음성신호 s(t)를 추정한 추정 음성신호인 s'(t)를 출력할 수 있다. 여기에서 잡음 및 에코가 포함된 마이크 입력 신호는 잡음과 에코가 동시에 존재하는 마이크 입력 신호를 의미한다.The integrated echo and noise canceling apparatus 100 according to the present invention generates an estimated voice signal s' by estimating the speaker's voice signal s(t) using the microphone input signal y(t) and the far-end signal x(t) (t) can be output. Here, the microphone input signal including noise and echo refers to a microphone input signal in which both noise and echo exist.

도 2를 참조하면, 심화신경망을 이용한 에코 및 잡음 통합 제거 장치(100)는 특징 벡터를 추출하는 특징 벡터 추출부(110), 제1에코신호 및 제1잡음신호를 추정하고 출력하는 제1인공신경망(120), 제2에코신호 및 제2잡음신호를 추정하고 출력하는 제2인공신경망(130), 에코 및 잡음 통합 제거 이득 값을 출력하는 제3인공신경망(140) 및 사용자의 음성을 추정한 최종 추정 음성을 출력하는 음성합성부(150) 등을 포함할 수 있다.Referring to FIG. 2 , an echo and noise integrated cancellation apparatus 100 using a deep neural network includes a feature vector extractor 110 that extracts a feature vector, and a first artificial artificial intelligence that estimates and outputs a first echo signal and a first noise signal. The neural network 120, the second artificial neural network 130 for estimating and outputting the second echo signal and the second noise signal, the third artificial neural network 140 for outputting the integrated echo and noise cancellation gain value, and estimating the user's voice and a voice synthesis unit 150 that outputs one final estimated voice.

도 3은 본 발명의 특징 백터 추출부에 입력되는 입력 정보와 출력되는 출력 정보를 도시한 도면이다.3 is a diagram showing input information input to the feature vector extractor and output information output to the feature vector extractor of the present invention.

도 3에 도시된 바와 같이 특징 벡터 추출부(110)는 푸리에 변환부(111)와 LPS 추출부(112)를 포함할 수 있다.As shown in FIG. 3 , the feature vector extraction unit 110 may include a Fourier transform unit 111 and an LPS extraction unit 112 .

푸리에 변환부(111)는 시간 영역의 마이크입력 신호(11)와 원단신호(12)를 주파수 영역에 관한 식으로 변환시킬 수 있다. 구체적으로, 푸리에 변환부(111)는 마이크입력 신호(11)에 포함되어 있는 s(t), d(t) 및 v(t)를 숏타임 푸리에 변환(STFT)을 통하여 각각 S(k,ㅣ), D(k,l) 및 N(k,l)로 변환하여 아래 식 (2)와 같이 주파수 영역으로 표현할 수 있다.The Fourier transform unit 111 may transform the microphone input signal 11 and the far-end signal 12 in the time domain into an equation in the frequency domain. Specifically, the Fourier transform unit 111 converts s (t), d (t), and v (t) included in the microphone input signal 11 to S (k, l) through short-time Fourier transform (STFT). ), D(k,l) and N(k,l), and can be expressed in the frequency domain as shown in Equation (2) below.

식 (2) -

Equation (2) -

식 (2)에서 k,와 l은 각각 frequency-bin index와 frame index를 의미한다.숏타임 푸리에 변환(STFT)을 수행하는 이유는 인공신경망에 입력할 특징벡터를 추출하기 위해 수행하는 것이며, 이는 연산량의 효율을 가져올 수 있다.In Equation (2), k and l mean frequency-bin index and frame index, respectively. The reason for performing the short-time Fourier transform (STFT) is to extract feature vectors to be input to the artificial neural network, which is It can bring the efficiency of the amount of calculation.

원단신호(12) 또한 푸리에 변환부(111)에 의해 시간 영역에서 주파수 영역의 식으로 표현될 수 있다. 구체적으로, 앞서 설명한 숏타임 푸리에 변환(STFT)에 의해 원단신호(12)인 x(t)는 X(m,l)로 표현될 수 있다.The far-end signal 12 may also be expressed as an expression in the frequency domain in the time domain by the Fourier transform unit 111. Specifically, x(t), which is the far-end signal 12, may be expressed as X(m,l) by the short-time Fourier transform (STFT) described above.

특징 벡터 추출부(110)의 LPS 추출부(112)는 복수의 인공신경망(120, 130, 140)에 입력할 입력 특징 벡터의 추출을 위하여 주파수 영역 신호의 로그 파워 스펙트럼(Log Power Spectrum, LPS)을 추출할 수 있다. The LPS extractor 112 of the feature vector extractor 110 extracts the log power spectrum (Log Power Spectrum, LPS) of the frequency domain signal to extract an input feature vector to be input to the plurality of artificial neural networks 120, 130, and 140. can be extracted.

구체적으로, LPS 추출부(112)는 마이크입력 신호(11) 및 원단신호(12)가 푸리에 변환부(111)에 의해 변환된 신호들에 대해 로그 파워 스펙트럼(log power spectrum)을 추출하여 제1특징 벡터에 해당하는 마이크입력 신호 LPS(21)와 제2특징 벡터에 해당하는 단신호 LPS(22)를 출력 정보로 하여 출력할 수 있다. 특징 벡터 추출부(110)에 의해 출력된 마이크입력 신호 LPS(21)와 원단신호 LPS(22)는 제1인공신경망(120), 제2인공신경망(130) 및 제3인공신경망(140)의 입력 특징 벡터로 사용될 수 있다. Specifically, the LPS extractor 112 extracts the log power spectrum of the signals obtained by transforming the microphone input signal 11 and the far-end signal 12 by the Fourier transform unit 111, and extracts a first The microphone input signal LPS 21 corresponding to the feature vector and the short signal LPS 22 corresponding to the second feature vector may be output as output information. The microphone input signal LPS 21 and the far-end signal LPS 22 output by the feature vector extractor 110 are components of the first artificial neural network 120, the second artificial neural network 130, and the third artificial neural network 140. Can be used as an input feature vector.

도 4는 본 발명의 제1인공신경망에 입력되는 입력 정보와 출력되는 출력 정보를 도시한 도면이다. 도 5는 본 발명의 제1에코신호 추정 인공신경망의 학습방법을 설명하기 위한 도면이고, 도 6은 제1잡음심호 추정 인공신경망의 학습방법을 설명하기 위한 도면이다.4 is a diagram illustrating input information input to the first artificial neural network and output information output to the first artificial neural network of the present invention. FIG. 5 is a diagram for explaining a learning method of the artificial neural network for estimating a first echo signal according to the present invention, and FIG. 6 is a diagram for explaining a method for learning the artificial neural network for estimating a first noise signal.

도 4를 참조하면, 제1 인공신경망(120)은 특징 벡터 추출부(110)에 출력된 마이크입력 신호 LPS(21)와 원단신호 LPS(22)를 입력 정보로 하고, 마이크입력 신호(11)에 포함되어 있는 에코신호와 잡음신호를 각각 추정한 제1에코신호(31)와 제1잡음신호(32)를 출력 정보로 하는 인공신경망으로서, 제1인공신경망(120)은 상기 입력 정보와 출력 정보를 기초로 학습을 수행하는 학습 세션(미도시)과, 입력 정보를 기초로 출력 정보를 추론하는 추론 세션(미도시)을 포함할 수 있다. Referring to FIG. 4, the first artificial neural network 120 uses the microphone input signal LPS 21 and the far-end signal LPS 22 output from the feature vector extractor 110 as input information, and the microphone input signal 11 An artificial neural network that uses, as output information, the first echo signal 31 and the first noise signal 32 obtained by estimating the echo signal and the noise signal included in It may include a learning session (not shown) for performing learning based on information and an inference session (not shown) for inferring output information based on input information.

제1인공신경망(120)의 학습 세션은 입력 정보(21,22)와 출력 정보(31,32)를 기초로 학습을 수행하는 세션이며, 추론 세션은 학습된 제1 인공신경망(120)을 이용하여 실시간으로 입력되는 입력 정보(21,22)를 분석하여, 마이크입력 신호(11)에 포함되어 있는 에코신호와 잡음신호를 각각 추정한 제1에코신호(31)와 제2에코신호(32)를 출력 정보로 하여 출력할 수 있으며, 제1에코신호(31)는 1차적으로 추정된 에코신호의 LPS 정보를 포함하고 있으며, 제1잡음신호(32)는 1차적으로 추정된 잡음신호의 LPS 정보를 포함할 수 있다. The learning session of the first artificial neural network 120 is a session in which learning is performed based on the input information 21 and 22 and the output information 31 and 32, and the inference session uses the learned first artificial neural network 120. first echo signal 31 and second echo signal 32 obtained by estimating the echo signal and the noise signal included in the microphone input signal 11 by analyzing the input information 21 and 22 input in real time Can be output as output information, the first echo signal 31 includes the LPS information of the primarily estimated echo signal, and the first noise signal 32 is the LPS of the primarily estimated noise signal information may be included.

구체적으로, 본 발명에 따른 제1인공신경망(120)은 특징 벡터 추출부(110)에 의해 출력된 마이크입력 신호 LPS(21) 및 원단신호 LPS(22)에 포함되어 있는 임의의 길이를 가지는 특징벡터를 입력정보로 입력 받아 프레임 단위로 잡음신호와 에코신호를 추정하는 방법으로 학습을 수행할 수 있다. Specifically, the first artificial neural network 120 according to the present invention has a feature having an arbitrary length included in the microphone input signal LPS 21 and the far-end signal LPS 22 output by the feature vector extractor 110. Learning can be performed by receiving a vector as input information and estimating a noise signal and an echo signal in units of frames.

일 실시예로, 제1인공신경망(120)은 시간 순서대로 나열된 프레임 단위의 특징벡터를 제1인공신경망(120)의 입력정보로 입력 받아 비선형 연산을 통해 2개의 은닉층(hidden layer)를 거친 후, 마지막으로 출력층에서 에코신호와 잡음신호의 LPS를 추정하도록 학습할 수 있다.In one embodiment, the first artificial neural network 120 receives feature vectors in units of frames listed in chronological order as input information of the first artificial neural network 120 and passes through two hidden layers through nonlinear operation. , and finally, it can learn to estimate the LPS of the echo signal and the noise signal in the output layer.

또한, 제1인공신경망(120)은 출력되는 출력 정보들의 정확성을 높이기 위해 음성 왜곡을 발생시키는 요소인 에코신호와 잡음신호를 하나의 인공신경망에서 추정하는 것이 아니라, 별도로 구성되어 있는 제1에코신호 추정 인공신경망(121)과 제1잡음신호 추정 인공신경망(122)를 이용하여 제1에코신호 추정 인공신경망(121)은 에코신호를 추정하도록 하게 하고, 제1잡음심호 추정 인공신경망(122)은 잡음신호를 추정하도록 구성될 수 있다.In addition, the first artificial neural network 120 does not estimate the echo signal and the noise signal, which are elements that generate voice distortion, in one artificial neural network to increase the accuracy of the output information, but separately configured first echo signal. Using the estimation artificial neural network 121 and the first noise signal estimation artificial neural network 122, the first echo signal estimation artificial neural network 121 estimates the echo signal, and the first noise signal estimation artificial neural network 122 It can be configured to estimate the noise signal.

도 5를 참조하면, 제1에코신호 추정 인공신경망(121)은 에코신호의 LPS를 보다 정확하게 추정하기 위해 마이크입력 신호(11)에 포함되어 있는 target 에코신호의 LPS즉, 에코 레퍼런스 신호(41)와 제1에코신호 추정 인공신경망(121)을 통해 추정된 제1에코신호(31)의 LPS와의 평균제곱오차(mean squared error)를 손실함수로 하고, 상기 손실함수의 값을 최소화하는 방법으로 학습을 진행할 수 있으며, 구체적으로 제1에코신호 추정 인공신경망(121)의 손실함수 식은 아래 식(3)과 같이 정의될 수 있다.Referring to FIG. 5, the first echo signal estimating artificial neural network 121 uses the LPS of the target echo signal included in the microphone input signal 11, that is, the echo reference signal 41, in order to more accurately estimate the LPS of the echo signal. and the mean squared error of the LPS of the first echo signal 31 estimated through the artificial neural network 121 estimating the first echo signal as a loss function, and learning by a method of minimizing the value of the loss function In detail, the loss function equation of the first echo signal estimation artificial neural network 121 may be defined as Equation (3) below.

식 (3) -

Equation (3) -

제1잡음신호 추정 인공신경망(122) 또한 잡음신호의 LPS를 정확하게 추정하기 위해, 도 6에 도시된 바와 같이 마이크입력 신호(11)에 포함되어 있는 target 잡음신호의 LPS즉, 잡음 레퍼런스 신호(42)와 제1잡음신호 추정 인공신경망(122)을 통해 추정된 제1잡음신호(32)의 LPS와의 평균제곱오차(mean squared error)를 손실함수로 하고, 상기 손실함수의 값을 최소화하는 방법으로 학습을 진행할 수 있으며, 제1잡음신호 추정 인공신경망(122)의 손실함수는 아래 식 (4)와 같이 정의될 수 있다.The first noise signal estimating artificial neural network 122 also accurately estimates the LPS of the noise signal, as shown in FIG. 6, the LPS of the target noise signal included in the microphone input signal 11, that is, the noise reference signal 42 ) and the mean squared error of the LPS of the first noise signal 32 estimated through the first noise signal estimation artificial neural network 122 as a loss function, and a method of minimizing the value of the loss function Learning may proceed, and the loss function of the first noise signal estimation artificial neural network 122 may be defined as Equation (4) below.

식 (4) -

Equation (4) -

식 (3)과 (4)에서의 M과 K는 학습에 사용되는 전체 프레임 개수이며 K는 STFT 이후에 학습에 사용되는 전체 주파수의 개수를 의미하며, 제1인공신경망(120)에 의해 출력된 출력 정보들은 제2인공신경망(130)의 입력 정보로 활용될 수 있다. 이하 제2인공신경망과 제3인공신경망에 대해 구체적으로 알아본다.In equations (3) and (4), M and K are the total number of frames used for learning, K means the total number of frequencies used for learning after STFT, and output by the first artificial neural network 120 The output information may be used as input information of the second artificial neural network 130 . Hereinafter, the second artificial neural network and the third artificial neural network will be examined in detail.

도 7은 본 발명의 제2인공신경망에 입력되는 입력 정보와 출력되는 출력 정보를 도시한 도면이다. 도 8은 본 발명의 제2에코신호 추정 인공신경망의 학습방법을 설명하기 위한 도면이고, 도 9는 제2잡음심호 추정 인공신경망의 학습방법을 설명하기 위한 도면이며, 도 10은 제3인공신경망과 음성합성부를 설명하기 위한 도면이다. 7 is a diagram showing input information input to the second artificial neural network and output information output to the second artificial neural network of the present invention. 8 is a diagram for explaining a learning method of a second artificial neural network for estimating a second echo signal according to the present invention, FIG. 9 is a diagram for explaining a learning method for a second artificial neural network for estimating a noisy signal, and FIG. 10 is a diagram for explaining a third artificial neural network. It is a drawing for explaining the voice synthesis unit.

도 7을 참조하면, 제2 인공신경망(130)은 특징 벡터 추출부(110)에 출력된 출력된 마이크입력 신호 LPS(21)와 원단신호 LPS(22) 및 제1인공신경망(120)에서 출력된 제1에코신호(31)와 제1잡음신호(32)를 입력 정보로 하고, 마이크입력 신호(11)에 포함되어 있는 에코신호와 잡음신호를 2차적으로 추정한 제2에코신호(51)와 제2잡음신호(52)를 출력 정보롤 하는 인공신경망으로서, 제2인공신경망(130)은 상기 입력 정보와 출력 정보를 기초로 학습을 수행하는 학습 세션(미도시)과, 입력 정보를 기초로 출력 정보를 추론하는 추론 세션(미도시)을 포함할 수 있다. Referring to FIG. 7 , the second artificial neural network 130 outputs the microphone input signal LPS 21 output to the feature vector extractor 110, the far-end signal LPS 22, and the first artificial neural network 120. A second echo signal (51) obtained by secondarily estimating the echo signal and the noise signal included in the microphone input signal (11) using the first echo signal (31) and the first noise signal (32) as input information. and a second noise signal 52 as output information, and the second artificial neural network 130 has a learning session (not shown) for performing learning based on the input information and output information, and based on the input information It may include an inference session (not shown) for inferring output information with .

제2인공신경망(130)의 학습 세션은 입력 정보(31,32,21,22)와 출력 정보(51,52)를 기초로 학습을 수행할 수 있는 세션이며, 추론 세션은 학습된 제2 인공신경망(130)을 이용하여 실시간으로 입력되는 입력 정보(31,32,21,22)를 분석하여, 마이크입력 신호(11)에 포함되어 있는 에코 신호와 잡음 신호를 다시 한번 추정한 정보인 제2에코신호(51)와 제2잡음신호(52)를 출력 정보로 하여 출력할 수 있다. The learning session of the second artificial neural network 130 is a session capable of learning based on input information 31, 32, 21, 22 and output information 51, 52, and the inference session is the learned second artificial neural network 130. The second data obtained by estimating the echo signal and the noise signal included in the microphone input signal 11 by analyzing the input information 31, 32, 21, 22 input in real time using the neural network 130. The echo signal 51 and the second noise signal 52 can be output as output information.

즉, 제2에코신호(51)는 제1인공신경망(120)에 의해 1차적으로 추정된 제1에코신호(31) 및 다른 입력 정보들(32,21,22)를 기초로 마이크입력 신호(11)에 포함되어 있는 에코신호를 다시 한번 추정한 신호를 의미하며, 제2잡음신호(52)는 제1인공신경망(120)에 의해 1차적으로 추정된 제1잡음신호(32) 및 다른 입력 정보들(31,21,22)를 기초로 마이크입력 신호(11)에 포함되어 있는 잡음신호를 다시 한번 추정한 신호를 의미한다. 여기서 제2에코신호(51)는 2차적으로 추정된 에코신호의 LPS 정보를 포함할 수 있고, 제2잡음신호(52)는 2차적으로 추정된 잡음신호의 LPS 정보를 포함할 수 있다.That is, the second echo signal 51 is a microphone input signal (( 11), and the second noise signal 52 is the first noise signal 32 primarily estimated by the first artificial neural network 120 and other inputs. It refers to a signal obtained by estimating the noise signal included in the microphone input signal 11 once again based on the information 31, 21, and 22. Here, the second echo signal 51 may include second-order estimated LPS information of the echo signal, and the second noise signal 52 may include second-order estimated LPS information of the noise signal.

제1인공신경망(120)에서 추정되어 출력된 제1에코신호(31) 및 제1잡음신호(32)의 LPS는 서로의 성분을 고려하지 않고 독립적으로 추정했기 때문에 정확성이 상대적으로 떨어질 수 있으며, 이렇게 추정된 에코신호와 잡음신호에 대한 정보를 그대로 사용하게 될 경우 음성왜곡이 발생할 수 있다. Since the LPS of the first echo signal 31 and the first noise signal 32 estimated and output by the first artificial neural network 120 are independently estimated without considering each other's components, the accuracy may be relatively low. If information about the echo signal and the noise signal estimated in this way is used as it is, voice distortion may occur.

따라서, 음성 왜곡을 방지하기 위해 에코신호와 잡음신호를 동시에 제거하는 것이 효과적이며, 본 발명의 경우 이를 효율적으로 구현하기 위해 음성왜곡을 발생시키는 요소인 에코신호와 잡음신호를 하나의 인공신경망에서 추정하는 것이 아니라, 각각 독립적으로 존재하는 제2에코신호 추정 인공신경망(131)과 제2잡음심호 추정 인공신경망(132)이 각각 에코신호와 잡음신호를 추정하므로, 에코신호와 잡음신호가 중복되어 추정되는 문제를 방지할 수 있다.Therefore, in order to prevent voice distortion, it is effective to simultaneously remove the echo signal and the noise signal, and in the case of the present invention, in order to efficiently implement this, the echo signal and the noise signal, which are elements that cause voice distortion, are estimated in one artificial neural network. Instead, since the second echo signal estimation artificial neural network 131 and the second noise signal estimation artificial neural network 132, which exist independently, respectively estimate the echo signal and the noise signal, the echo signal and the noise signal are overlapped and estimated. problems can be avoided.

구체적으로, 제2에코신호 추정 인공신경망(131)은 에코신호를 효과적으로 추정할 수 있도록 학습을 수행함에 있어서, 에코신호는 추정이 실제 값과 유사해질 수 있도록 추정을 하되, 잡음신호에 대해는 둔감하게 반응하도록 인공신경망을 학습시키고, 제2잡음신호 추정 인공신경망(132)은 잡음신호를 효과적으로 추정할 수 있도록 학습을 수행함에 있어서, 잡음신호는 추정이 실제 값과 유사해질 수 있도록 추정을 하되, 에코신호에 대해는 둔감하게 반응하도록 인공신경망을 학습시키는 방법인 적대적 학습(adversarial training) 방법을 사용하여 각각 학습을 수행할 수 있다.Specifically, when the second echo signal estimation artificial neural network 131 performs learning to effectively estimate the echo signal, it estimates the echo signal so that the estimation can be similar to the actual value, but is insensitive to noise signals. When the artificial neural network is trained to respond appropriately, and the second noise signal estimation artificial neural network 132 performs learning to effectively estimate the noise signal, the noise signal is estimated so that the estimate can be similar to the actual value, Each learning may be performed using an adversarial training method, which is a method of training an artificial neural network to respond insensitively to an echo signal.

적대적 학습이란 추정 신호에 대한 정확성에 영향을 미칠 수 있는 인자가 복수 개인 경우, 정확성을 높일 수 있는 성분에 대해서는 추정이 잘 이루어지는 방향(손실함수의 값이 작아지는 방향)으로 학습을 수행하고, 정확성에 부정적인 영향을 미칠 수 있는 성분에 대해서는 그 성분에 대해서는 인공신경망이 둔감하게 반응할 수 있는 방향(손실함수의 갑의 증가하는 방향)으로 학습을 수행하는 방법을 의미한다. In adversarial learning, when there are multiple factors that can affect the accuracy of an estimation signal, learning is performed for components that can increase accuracy in the direction of good estimation (the direction in which the value of the loss function decreases), It means a method of performing learning in a direction in which the artificial neural network can respond insensitively to a component that may have a negative impact on that component (in the direction of increasing the value of the loss function).

이러한 방법으로 추정을 하는 경우 제2에코신호 추정 인공신경망(131)은 에코신호에 대해서만 중점적으로 추정하고, 제2잡음심호 추정 인공신경망(132)은 잡음신호에 대해서만 중점적으로 추정을 하므로, 제2인공신경망이 에코신호와 잡음신호를 추정함에 있어서 중복 추정을 피할 수 있는 효과가 존재하며, 제2에코신호 추정 인공신경망(131)과 제2잡음심호 추정 인공신경망(132)은 제1인공신경망(120)에서 추정되었던 에코신호와 잡음신호에 대한 정보인 제1에코신호(31)와 제1잡음신호(32)를 고려하여 에코신호와 잡음신호를 추정하므로 제1인공신경망(120) 보다 정확히 에코신호와 잡음신호를 추정할 수 있는 장점이 존재한다. When estimation is performed in this way, the second echo signal estimating artificial neural network 131 mainly estimates only the echo signal, and the second noise signal estimating artificial neural network 132 mainly estimates only the noise signal. When the artificial neural network estimates the echo signal and the noise signal, there is an effect of avoiding redundant estimation. Since the echo signal and the noise signal are estimated in consideration of the first echo signal 31 and the first noise signal 32, which are information about the echo signal and the noise signal estimated in 120), the echo is more accurate than the first artificial neural network 120. There is an advantage of estimating a signal and a noise signal.

즉, 도 8에 도시된 바와 같이 제2에코신호 추정 인공신경망(131)은 에코신호의 LPS를 정확하게 추정하기 위해, 제2에코신호 추정 인공신경망(131)을 통해 추정된 제2에코신호(51)의 LPS와 마이크입력 신호(11)에 포함되어 있는 실제 에코신호에 대한 정보인 제1레퍼런스 신호(61)와의 차이를 제1손실함수로 하고, 상기 제1손실함수의 값이 최소가 되는 방향으로 학습을 진행할 수 있다. 따라서, 에코신호의 추정에 사용되는 파라미터(parameter)들은 에코신호의 LPS를 더욱 잘 추정되도록 손실함수를 최소화 하는 방향으로 업데이트 될 수 있다.That is, as shown in FIG. 8 , the second echo signal estimating artificial neural network 131 uses the second echo signal estimating artificial neural network 131 to accurately estimate the LPS of the echo signal. ) and the first reference signal 61, which is information about the actual echo signal included in the microphone input signal 11, as a first loss function, and the direction in which the value of the first loss function is minimized learning can proceed. Accordingly, parameters used for estimating the echo signal may be updated in a direction of minimizing the loss function so as to better estimate the LPS of the echo signal.

또한, 제2에코신호 추정 인공신경망(131)은 에코신호를 추정함에 있어서, 잡음신호가 추정에 영향을 미치지 않도록 하는 방향으로 학습을 수행하는 바, 제2에코신호 추정 인공신경망(131)을 통해 추정된 제2잡음신호(52)의 LPS와 이에 대한 레퍼런스 정보인 제2레퍼런신호의 LPS와 차이를 제2손실함수로 하고, 상기 제2손실함수의 값은 최대가 되는 방향으로 학습을 수행할 수 있다. In addition, in estimating the echo signal, the second echo signal estimation artificial neural network 131 performs learning in a direction in which the noise signal does not affect the estimation, through the second echo signal estimation artificial neural network 131. The difference between the LPS of the estimated second noise signal 52 and the LPS of the second reference signal, which is reference information therefor, is set as a second loss function, and learning is performed in a direction in which the value of the second loss function is maximized. can do.

따라서, 제2에코신호 추정 인공신경망(131)은 잡음신호에 추정에 사용되는 파라미터들에 대해서는 잡음신호의 LPS를 잘 추정하지 못하도록 손실함수를 최대화 하는 방향으로 학습시키므로, 제2에코신호 추정 인공신경망(131)은 잡음신호의 추정 성능에 둔감하면서, 에코신호의 추정 성능은 향상될 수 있도록 학습을 수행할 수 있으며, 제2에코신호 추정 인공신경망(131)의 손실함수는 위에서 설명한 식(3)에 아래 식 (5)가 추가된 식으로 구성될 수 있다. Therefore, the second echo signal estimation artificial neural network 131 learns the parameters used for estimation of the noise signal in a direction of maximizing the loss function so that the LPS of the noise signal cannot be estimated well. (131) is insensitive to the estimation performance of the noise signal and can be trained to improve the estimation performance of the echo signal. Equation (5) below can be configured as an equation.

식(5) -

Equation (5) -

반대로, 제2잡음신호 추정 인공신경망(132)은 잡음신호의 LPS를 정확하게 추정하기 위해, 도 9에 도시된 바와 같이 제2잡음신호 추정 인공신경망(132)을 통해 추정된 제2잡음신호(52)의 LPS와 마이크입력 신호(11)에 포함되어 있는 실제 잡음신호에 대한 정보인 제3레퍼런스 신호(62)와의 차이를 제3손실함수로 하고, 상기 제3손실함수의 값이 최소가 되는 방향으로 학습을 진행할 수 있다. 따라서, 잡음신호의 추정에 사용되는 파라미터(parameter)들은 잡음신호의 LPS를 더욱 잘 추정되도록 손실함수를 최소화 하는 방향으로 업데이트 될 수 있다. In contrast, the second noise signal estimating artificial neural network 132 uses the second noise signal 52 estimated through the second noise signal estimating artificial neural network 132 as shown in FIG. 9 to accurately estimate the LPS of the noise signal. ) and the third reference signal 62, which is information about the actual noise signal included in the microphone input signal 11, as the third loss function, and the direction in which the value of the third loss function becomes the minimum learning can proceed. Accordingly, parameters used for estimation of the noise signal may be updated in a direction of minimizing the loss function to better estimate the LPS of the noise signal.

또한, 제2잡음신호 추정 인공신경망(132)은 잡음신호를 추정함에 있어서, 에코신호가 추정에 영향을 미치지 않도록 하는 방향으로 학습을 수행하는 바, 제2잡음신호 추정 인공신경망(132)을 통해 추정된 제2에코신호(51)의 LPS와 이에 대한 레퍼런스 정보인 제4레퍼런신호의 LPS와 차이를 제4손실함수로 하고, 상기 제4손실함수의 값은 최대가 되는 방향으로 학습을 수행할 수 있다. In addition, in estimating the noise signal, the second noise signal estimation artificial neural network 132 performs learning in such a way that the echo signal does not affect the estimation, through the second noise signal estimation artificial neural network 132. A difference between the LPS of the estimated second echo signal 51 and the LPS of the fourth reference signal, which is reference information therefor, is set as a fourth loss function, and learning is performed in a direction in which the value of the fourth loss function is maximized. can do.

따라서, 제2잡음신호 추정 인공신경망(132)은 에코신호에 추정에 사용되는 파라미터들은 잡음신호의 LPS를 잘 추정하지 못하도록 손실함수를 최대화 하는 방향으로 학습시키므로, 제2잡음신호 추정 인공신경망(132)은 에코신호의 추정 성능에 둔감하면서, 잡음신호의 추정 성능은 향상될 수 있도록 학습을 수행할 수 있으며, 제2잡음신호 추정 인공신경망(132)의 손실함수는 위에서 설명한 식(4)에 아래 식 (6)이 추가된 식으로 구성될 수 있다. Therefore, the second noise signal estimating artificial neural network 132 learns the parameters used for estimating the echo signal in a direction that maximizes the loss function so that the LPS of the noise signal cannot be well estimated. ) can perform learning so that the estimation performance of the noise signal can be improved while being insensitive to the estimation performance of the echo signal. Equation (6) can be constructed as an additional equation.

식(6) -

Equation (6) -

제2에코신호 추정 인공신경망(131)은 학습을 함에 있어서, 파라미터가 업데이트 되는 식(3)은 최소화가 되도록 하고, 식 (5)는 최대화가 되는 방향으로 학습을 진행하고, 제2잡음신호 추정 인공신경망(132)은 학습을 함에 있어서, 파라미터가 업데이트 되는 식(4)는 최소화가 되도록 하고, 식 (6)은 최대화가 되는 방향으로 학습을 진행할 수 있다. 따라서, 제2에코신호 추정 인공신경망(131)은 아래 식 (7)과 같은 파라미터 업데이트 규칙이 적용될 수 있으며, 제2잡음신호 추정 인공신경망(132)은 아래 식 (8)과 같은 파라미터 업데이트 규칙이 적용될 수 있다.In learning, the second echo signal estimation artificial neural network 131 minimizes Equation (3) and maximizes Equation (5), and estimates the second noise signal. In learning, the artificial neural network 132 may minimize Equation (4) in which parameters are updated, and may proceed with learning in a direction in which Equation (6) is maximized. Therefore, the second echo signal estimation artificial neural network 131 may apply a parameter update rule as shown in Equation (7) below, and the second noise signal estimation artificial neural network 132 may have a parameter update rule as shown in Equation (8) below. can be applied

식 (7) -

Equation (7) -

식 (8) -

Equation (8) -

식 (7)과 (8)에서

는 학습비율을 나타내며,

과

는 back-propagation 시 gradient reversal의 정도를 조절할 수 있는 hyperparameter를 의미한다.In equations (7) and (8)

represents the learning rate,

class

is a hyperparameter that can control the degree of gradient reversal during back-propagation.

제2인공신경망(130)에 의해 마이크입력 신호(11)에 포함되어 있는 에코신호와 입력신호를 추정한 제2에코신호(51)와 제2잡음신호(52)가 출력되면, 제3인공신경망은 제2에코신호(51)와 제2잡음신호(52)에 포함되어 있는 제2에코신호의 LPS 정보와 제2잡음신호의 LPS 정보 및 특징 벡터 추출부(110)에 의해 출력된 마이크입력 신호 LPS(21) 정보 및 원단신호 LPS(22) 정보를 기초로 마이크입력 신호(11)에 포함되어 있는 에코신호와 잡음신호를 통합적으로 제거하기 위한 최적 이득(optimal gain, 70)을 출력할 수 있으며, 음성합성부(150)는 최적 이득 정보를 기초로 마이크 입력 신호에서 에코신호와 잡음신호를 제거한 최종 추정 음성신호(80)를 출력할 수 있다. When the echo signal included in the microphone input signal 11 and the second echo signal 51 obtained by estimating the input signal and the second noise signal 52 are output by the second artificial neural network 130, the third artificial neural network LPS information of the second echo signal included in the second echo signal 51 and the second noise signal 52 and the LPS information of the second noise signal and the microphone input signal output by the feature vector extractor 110 Based on the LPS (21) information and the far-end signal LPS (22) information, it is possible to output an optimal gain (70) for integrally removing the echo signal and noise signal included in the microphone input signal (11). , The voice synthesis unit 150 may output a final estimated voice signal 80 obtained by removing the echo signal and the noise signal from the microphone input signal based on the optimal gain information.

구체적으로, 제3인공신경망은 앞서 설명한 식(2)의 제곱을 아래 식 (9)와 같이 변환할 수 있다. Specifically, the third artificial neural network may convert the square of Equation (2) described above as shown in Equation (9) below.

식 (9) -

Equation (9) -

식(9)에서 *는 conjugate operator를 의미하며, 음성, 에코, 잡음의 경우 보통 통계적 독립으로 가정되기 때문에 제곱 텀을 제외한 성분들은 0이 된다. 따라서 음성을 추정하기 위한 최적 이득인mask는 아래 식 (10)과 같이 표현될 수 있으며 0 ~ 1의 값을 가진다.In Equation (9), * means the conjugate operator, and in the case of voice, echo, and noise, since they are usually assumed to be statistically independent, the components except for the square term become 0. Therefore, mask, which is the optimal gain for estimating speech, can be expressed as Equation (10) below and has a value of 0 to 1.

식 (10) -

Equation (10) -

제3인공신경망(140)은 추정된 mask의 정확도를 높이기 위해 학습을 수행할 수 있는데, 구체적으로 제3인공신경망(140)에 의해 추정된 mask와 레퍼런스 정보에 해당하는 실제 mask 와의 차이에 대한 평균제곱오차(mean squared error)를 손실함수로 하고, 상기 손실함수의 값이 최소가 되는 방향으로 학습을 진행할 수 있으며, 구체적인 손실함수의 식은 아래 식 (11)과 같다. The third artificial neural network 140 may perform learning to increase the accuracy of the estimated mask. Specifically, the average of the difference between the mask estimated by the third artificial neural network 140 and the actual mask corresponding to the reference information The mean squared error can be used as a loss function, and learning can be performed in a direction in which the value of the loss function is minimized.

식 (11) -

Equation (11) -

음성합성부(150)는 잡음신호 및 에코신호가 포함된 마이크입력 신호 LPS(21)에 추정된 잡음 및 에코의 통합 제거 이득을 곱하여 사용자의 음성신호에 대한 LPS를 획득할 수 있다. 다만, 추정된 음성신호는 위상 정보를 고려하지 않았기 때문에 아래 식 (12)를 이용하여 주파수 도메인에서 추정된 사용자의 음성신호에 대한 크기와 마이크입력 신호의 위상 정보를 곱한 후, 인버스 숏타임 푸리에 변환(Inverse Short-Time Fourier Transform, ISTFT)을 수행하여 최종적으로 잡음 및 에코가 제거된 최종 음성 신호의 파형을 획득할 수 있다. 즉, 잡음 및 에코가 통합 제거된 최종 추정 음성신호를 획득할 수 있다.The voice synthesis unit 150 may obtain the LPS of the user's voice signal by multiplying the estimated integrated noise and echo cancellation gain by the microphone input signal LPS 21 including the noise signal and the echo signal. However, since the estimated voice signal does not consider the phase information, after multiplying the amplitude of the user's voice signal estimated in the frequency domain by the phase information of the microphone input signal using Equation (12) below, inverse short-time Fourier transform (Inverse Short-Time Fourier Transform, ISTFT) may be performed to finally obtain a waveform of the final voice signal from which noise and echo are removed. That is, a final estimated voice signal from which noise and echo are integrally removed can be obtained.

식 (12) -

Equation (12) -

도 11 내지 도13은 본 발명의 효과를 설명하기 위한 실험 데이터를 도시한 도면으로서, 도 11은 RIR(Room Impulse Response) 생성기의 파라미터 설정 값을, 도 12와 도 13은 본 발명과 다른 인공신경망 모델의 학습 결과를 비교하여 도시한 도면이다.11 to 13 are diagrams showing experimental data for explaining the effect of the present invention, FIG. 11 shows parameter setting values of a RIR (Room Impulse Response) generator, and FIGS. 12 and 13 show artificial neural networks different from those of the present invention. It is a diagram showing the comparison of the learning results of the models.

도 11 내지 13에서 설명되는 실험은 TIMIT 데이터베이스(DB)를 이용하여 진행하였고 데이터베이스는 모두 16 kHz로 샘플링된 신호로 이루어져 있으며, 실험을 위해 음성신호에 에코신호를 convolution한 데이터와 잡음신호 데이터를 이용해 학습용 데이터셋은 3000개의 발화로 구성하였고, 평가용 데이터셋은 184개의 발화로 구성하였다. 또한, 잡음신호와 에코신호에 의해 오염된 다채널 음성신호를 생성하기 위해서 시물레이션을 통해 특정 방(room)환경에서 RIR(Room Impulse Response)을 생성해주는 RIR 생성기 툴킷을 이용하여 다양한 종류의 방 환경에 대해 시뮬레이션하여 RIR를 생성하였다. 또한, 학습용 데이터셋에 적용할 RIR을 18개, 평가용 데이터셋에 적용할 RIR을 2개를 적용하였으며, RIR 생성을 위한 방 환경을 설정하는 데에는 도 11에 도시된 표의 설정에 따라 랜덤하게 환경을 설정하였다. The experiments described in FIGS. 11 to 13 were conducted using the TIMIT database (DB), and the database consists of signals sampled at 16 kHz. The training dataset consisted of 3000 utterances, and the evaluation dataset consisted of 184 utterances. In addition, in order to generate multi-channel voice signals contaminated by noise signals and echo signals, RIR generator toolkits that generate RIR (Room Impulse Response) in a specific room environment through simulation can be used in various types of room environments. RIR was generated by simulating for . In addition, 18 RIRs to be applied to the training dataset and 2 RIRs to be applied to the evaluation dataset were applied, and the room environment for RIR generation was set at random according to the settings in the table shown in FIG. 11. has been set.

잡음 신호로는 ITU-T recommendation P. 501 database를 사용하였으며 잡음은 평가용 음성 데이터셋과 랜덤하게 더하였으며, 더할 때의 신호대잡음비(signal-to-noie ratio : SNR)은 학습용은 4 dB, 8 dB, 12 dB 중 하나를 택하여 랜덤하게 더하였으며, 평가용은 10 dB로 고정하여 사용하였다. As the noise signal, the ITU-T recommendation P. 501 database was used, and the noise was randomly added to the speech dataset for evaluation. The signal-to-noie ratio (SNR) at the time of addition was 4 dB and 8 One of dB and 12 dB was selected and randomly added. For evaluation, 10 dB was fixed and used.

본 발명에 따른 인공신경망과 결과를 비교하기 위한 다른 인공신경망을 학습하는 경우, 대부분의 설정 값들은 동일하게 진행하였다. 첫 번째로, 프레임 길이와 프레임 이동(shift) 길이를 각각 32 ms와 16 ms로 설정하였으며, 이는 음성인식 혹은 음성통신 환경에서 16 kHz로 샘플링된 신호를 잡음이나 에코신호 제거와 같은 전처리 알고리즘에서 많이 사용되는 설정 값이다. 모든 심화신경망 모델은 Adam 알고리즘을 이용하여 학습되었으며, 전처리 부분의 심화신경망들은 mini-batch 크기를 256로 설정하여 30번의 epoch동안 학습되었다. Initial learning rate는 0.0001로 설정하였으며, Dropout은 10%로 설정하였다. In the case of learning another artificial neural network to compare the results with the artificial neural network according to the present invention, most of the setting values proceeded the same. First, the frame length and frame shift length were set to 32 ms and 16 ms, respectively, which is used for preprocessing algorithms such as noise or echo signal removal of a signal sampled at 16 kHz in a voice recognition or voice communication environment. This is the setting value used. All deep neural network models were trained using the Adam algorithm, and deep neural networks in the preprocessing part were trained for 30 epochs by setting the mini-batch size to 256. The initial learning rate was set to 0.0001, and the dropout was set to 10%.

각 인공신경망 모델들에 대한 평가를 위해 평가 데이터셋에 포함된 발화들을 이용하여 각 잡음 별 184개의 발화에 대한 결과를 분석하였다. 평가를 위해 perceptual evaluation of speech quality(PESQ), short-time objective intelligibility(STOI) 그리고 echo return loss enhancement (ERLE)를 사용하였고 음성과 에코가 동시에 존재하는 구간과 에코만 존재하는 구간을 나누어 점수를 측정하였며, 비교 심화 신경망으로는 본 발명과 관련된 종래 기술 중 심화 신경망을 활용한 전처리 알고리즘인, stacked-DNN, CRN을 적용하였다. For the evaluation of each artificial neural network model, the results of 184 utterances for each noise were analyzed using the utterances included in the evaluation dataset. For evaluation, perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and echo return loss enhancement (ERLE) were used. And, as the comparative deep neural network, stacked-DNN and CRN, which are preprocessing algorithms using the deep neural network of the prior art related to the present invention, were applied.

PESQ는 -0.5 ~ 4.5 사이의 점수를 가지고, STOI는 0~1 사이의 점수를 가지며, ERLE는 값의 범위가 특정되어 있지 않고 점수가 높을수록 에코를 잘 제거했다는 것을 의미한다.PESQ has a score between -0.5 and 4.5, STOI has a score between 0 and 1, and ERLE has an unspecified range of values, and a higher score means better echo removal.

먼저 음성 품질을 평가하는 PESQ를 살펴보면 모든 잡음 환경에서 전처리를 적용하지 않은 경우(un-process)에 비교해보면 모든 심화신경망을 활용한 전처리 알고리즘이 음성품질을 향상시키는 것으로 나타내며 그 중 본 발명에서 제안한 방법이 가장 높은 점수를 보여준다.First, looking at the PESQ that evaluates voice quality, compared to the case where preprocessing is not applied (un-process) in all noise environments, it is shown that the preprocessing algorithm using all deep neural networks improves voice quality, and among them, the method proposed by the present invention This shows the highest score.

다음으로 ERLE를 살펴보면 대부분의 잡음환경에서 에코만 존재하는 구간에서의 에코 제거 성능이 본 발명에서 제안한 방법이 가장 높은 성능을 나타내며, 버스와 자동차 잡음에서는 stacked-DNN 알고리즘이 더 높은 것으로 확인되지만 음성구간에서의 점수와 함께 보면 에코신호는 더욱 잘 제거하지만 음성 왜곡이 다소 발생하는 것을 알 수 있다. Next, looking at ERLE, the method proposed in the present invention shows the highest performance in echo cancellation in the section where only echo exists in most noise environments, and the stacked-DNN algorithm is confirmed to be higher in bus and car noise, but In combination with the score in , it can be seen that the echo signal is better removed, but there is some distortion of the voice.

음성 명료도를 평가하는 STOI의 경우도 모든 환경에서 본 발명에서 제안한 방법이 가장 높은 성능을 나타내며 3가지의 객관적 평가 지표를 종합해보면 거의 모든 잡음 환경에서 종래 기술과 비교하여 크게 점수가 향상된 것을 확인할 수 있다.Even in the case of STOI, which evaluates speech intelligibility, the method proposed by the present invention shows the highest performance in all environments, and it can be seen that the score is greatly improved compared to the prior art in almost all noise environments by integrating the three objective evaluation indicators. .

따라서, 본 발명에 의한 잡음 및 에코 통합 제거 학습 기법을 이용하여 배경 잡음과 에코가 존재하는 환경에서 마이크로폰을 통해 음성을 수집하여 음성신호를 처리하는 경우 보다 정확하게 사용자의 음성을 추정하고 추출할 수 있는 장점이 존재한다. Therefore, when a voice is collected through a microphone in an environment where background noise and echo exist using the integrated noise and echo canceling learning technique according to the present invention and the voice signal is processed, the user's voice can be more accurately estimated and extracted. Advantages do exist.

이상과 같이 실시예들은 음성 향상 기술로 음성 인식과 음성 통신 기술을 수행하기 이전에 잡음 및 에코를 제거하여 보다 우수한 성능을 도출할 수 있으며, 휴대폰 단말기나 보이스톡 등에서 음성 통화 품질을 높이기 위해 적용될 수 도 있다. 또한, 최근 다양한 사물인터넷(Internet of Things, IoT) 기기에서 음성 인식이 수행되는데 이는 조용한 환경에서만 수행되는 것이 아니라 주변 잡음이 존재하는 환경에서 수행될 수 있으며, IoT 기기의 스피커에서 소리가 나올 때 이 소리가 다시 들어가 에코를 발생할 수 있다. 따라서 음성 인식 수행 전 잡음 및 에코를 제거하여 IoT 기기에서 수행되는 음성 인식의 성능을 높일 수 있다. 또한, 본 실시예들은 우수한 품질의 음성 향상 신호를 제공하므로 다양한 음성 통신 기술에 적용되어 깨끗한 품질의 음성을 제공할 수 있다.As described above, the embodiments can derive better performance by removing noise and echo before performing voice recognition and voice communication technology as a voice enhancement technology, and can be applied to improve voice call quality in a mobile phone terminal or voice talk. There is also In addition, recently, voice recognition is performed in various Internet of Things (IoT) devices, which can be performed not only in a quiet environment but also in an environment with ambient noise. Sound can re-enter and cause an echo. Therefore, the performance of voice recognition performed by IoT devices can be improved by removing noise and echo before performing voice recognition. In addition, since the present embodiments provide a voice enhancement signal of excellent quality, they can be applied to various voice communication technologies to provide a clear voice quality.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The devices described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may run an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. You can command the device. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. can be embodied in Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다. 그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.As described above, although the embodiments have been described with limited examples and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved. Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

100: 에코 및 잡음 통합 제거 장치
110: 특징 벡터 추출부
111: 푸리에 변환부
112: LPS 추출부
120: 제1인공신경망
130: 제2인공신경망
140: 제3인공신경망
150: 음성합성부100: echo and noise integrated cancellation device
110: feature vector extraction unit
111: Fourier transform unit
112: LPS extraction unit
120: first artificial neural network
130: second artificial neural network
140: 3rd artificial neural network
150: voice synthesis unit

Claims

a feature vector extraction unit receiving a microphone input signal and a far-end signal and extracting a first feature vector for the microphone input signal and a second feature vector for the far-end signal;
The first feature vector and the second feature vector are input information, and the first echo signal and the first noise signal obtained by primarily estimating the echo signal and noise signal included in the microphone input signal are used as output information. , the pre-learned first artificial neural network;
The first feature vector, the second feature vector, the first echo signal, and the first noise signal are used as input information, and a second echo signal and a second noise signal obtained by secondarily estimating the echo signal and the noise signal are output. a pre-learned second artificial neural network with information;
The first feature vector, the second feature vector, the second echo signal, and the second noise signal are used as input information, and an optimal gain for removing the echo signal and the noise signal from the microphone input signal is output. A third artificial neural network pre-learned with information; and
and a speech synthesis unit outputting a final estimated speech signal from which the echo signal and the noise signal are removed from the microphone input signal using the optimum gain.
The second artificial neural network,
a second echo signal estimation artificial neural network outputting the second echo signal and a second noise signal estimation artificial neural network outputting the second noise signal;
and performing learning so that the second noise signal does not affect the estimation of the second echo signal by using an adversarial learning technique when estimating the second echo signal.

delete

According to claim 1,
The loss function of the second echo signal estimation artificial neural network is
A first loss function having a difference between the second echo signal and a first reference signal as a loss function, and a second loss function having a difference between the second noise signal and a second reference signal as a loss function, Integrated echo and noise canceling device using deep neural network.

According to claim 4,
The second echo signal estimation artificial neural network,
The deep neural network is used to train the second echo signal estimation artificial neural network to minimize the value of the first loss function and to train the second echo signal estimation artificial neural network to maximize the value of the second loss function. Echo and noise integrated canceller.

According to claim 1,
The second noise signal estimation artificial neural network,
When estimating the second noise signal, using an adversarial learning technique, learning is performed so that the second echo signal does not affect the estimation of the second noise signal.

According to claim 6,
The loss function of the second noise signal estimation artificial neural network is
A third loss function having a difference between the second noise signal and a third reference signal as a loss function, and a fourth loss function having a difference between the second echo signal and a fourth reference signal as a loss function, Integrated echo and noise canceling device using deep neural network.

According to claim 7,
The second noise signal estimation artificial neural network,
The second noise signal estimation artificial neural network is trained to minimize the value of the third loss function and the second noise signal estimation artificial neural network is trained to maximize the value of the fourth loss function using a deep neural network. Echo and noise integrated canceller.

According to claim 7,
The feature vector extraction unit,
Short-Time Fourier Transform (STFT) is performed to convert the microphone input signal and far-end speaker signal in the time domain into a signal in the frequency domain, and the log power spectrum of the converted signal in the frequency domain , LPS) as a feature vector, an integrated echo and noise cancellation device using a deep neural network.

According to claim 7,
The third artificial neural network,
Noise and echo, which are target feature vectors, by estimating continuous optimal gain through regression and using the mean squared error (MSE) as the objective function of the third artificial neural network Echo and noise integrated cancellation apparatus using a deep neural network, wherein the third artificial neural network is trained in a direction that minimizes a difference between an integrated cancellation gain of and the optimal gain estimated by the third artificial neural network.

According to claim 1,
The first artificial neural network,
a first echo signal estimating artificial neural network outputting the first echo signal and a first noise signal estimating artificial neural network configured independently of the first echo signal estimating artificial neural network and outputting the first noise signal; Integrated echo and noise cancellation device using neural network.

delete

a feature vector extraction step of extracting a first feature vector for the microphone input signal and a second feature vector for the far-end signal from a voice signal including a microphone input signal and a far-end signal;
Based on the first feature vector and the second feature vector, the first echo signal and the first noise signal obtained by primarily estimating the echo signal and noise signal included in the microphone input signal are pre-learned first artificial Outputting using a neural network;
The second echo signal and the second noise signal obtained by secondarily estimating the echo signal and the noise signal based on the first feature vector, the second feature vector, the first echo signal, and the first noise signal are obtained by pre-learning the second echo signal and the second noise signal. 2 Outputting using an artificial neural network;
Based on the first feature vector, the second feature vector, the second echo signal, and the second noise signal, an optimal gain for removing the echo signal and the noise signal from the speech signal is determined by learning the first feature vector. 3 Outputting using an artificial neural network; and
and outputting a final estimated voice signal from which the echo signal and the noise signal are removed from the microphone input signal using the optimum gain.
The second artificial neural network includes a second echo signal estimating artificial neural network outputting the second echo signal and a second noise signal estimating artificial neural network outputting the second noise signal;
The estimating of the second echo signal may include preventing the second noise signal from affecting the estimation of the second echo signal by using an adversarial learning technique, and integrated cancellation of echo and noise using a deep neural network. method.

delete

According to claim 13,
The step of estimating the second noise signal,
and preventing the second echo signal from affecting the estimation of the second noise signal by using an adversarial learning technique.