KR102275656B1

KR102275656B1 - Method and apparatus for robust speech enhancement training using adversarial training

Info

Publication number: KR102275656B1
Application number: KR1020190119037A
Authority: KR
Inventors: 김남수; 배수현; 최인규; 김형용; 김석민; 나선필
Original assignee: 국방과학연구소
Priority date: 2019-09-26
Filing date: 2019-09-26
Publication date: 2021-07-09
Also published as: KR20210036692A

Abstract

적대적 학습(adversarial training) 모델을 이용한 강인한 음성 향상 훈련 방법이 개시된다. 음성 향상 훈련 방법은 잡음이 섞인 음성으로부터 특징 벡터를 추출하는 단계, 추출된 특징 벡터를 제1 인공 신경망의 입력으로하여 잠재 변수를 추출하는 단계, 추출된 잠재 변수를 제2 인공 신경망의 입력으로 하여 추정된 음성 특징 벡터를 출력하는 제1 동작 및 추출된 잠재 변수를 제3 인공 신경망의 입력으로 하여 추정된 잡음 특징 벡터를 출력하는 제2 동작을 수행하며, 제1 동작 및 제2 동작이 서로 적대적 학습을 하여 훈련되는 단계, 학습에 의해 잡음 성분이 제거된 잠재 변수를 출력하는 단계 및, 출력된 잠재 변수에 기초하여 복원된 복원 음성을 생성하는 단계를 포함한다.A robust voice enhancement training method using an adversarial training model is disclosed. The voice enhancement training method includes the steps of extracting a feature vector from a noisy voice, extracting a latent variable by using the extracted feature vector as an input of a first artificial neural network, and using the extracted latent variable as an input of a second artificial neural network. A first operation of outputting the estimated speech feature vector and a second operation of outputting the estimated noise feature vector by using the extracted latent variable as an input to the third artificial neural network are performed, and the first operation and the second operation are antagonistic to each other It includes the steps of being trained by learning, outputting a latent variable from which a noise component is removed by learning, and generating a reconstructed speech based on the output latent variable.

Description

Robust voice enhancement training method and device using adversarial training model {METHOD AND APPARATUS FOR ROBUST SPEECH ENHANCEMENT TRAINING USING ADVERSARIAL TRAINING}

본 발명은 적대적 학습(adversarial training) 모델을 이용한 강인한 음성 향상 훈련 방법 및 그 장치에 관한 것으로, 보다 상세하게는 잡음이 섞인 음성에서 잡음을 제거하여 음성의 품질을 향상시킬 수 있는 강인한 음성 향상 훈련 방법 및 장치에 관한 것이다.The present invention relates to a robust voice enhancement training method using an adversarial training model and an apparatus therefor, and more particularly, a robust voice enhancement training method capable of improving voice quality by removing noise from a noisy voice. and devices.

음성 향상 기술은 잡음이 섞인 음성에서 깨끗한 음성을 추정하는 기술로, 음성통신 분야에서는 음성의 명료도 향상에 도움을 주고, 음성 인식 등에서는 전처리 기술로 이용하는 등 다양한 음성 관련 어플리케이션에 활용될 수 있는 중요한 기술이다.Speech enhancement technology is a technology for estimating a clear voice from a noisy voice. It is an important technology that can be used in various voice-related applications, such as helping to improve the intelligibility of voice in the field of voice communication and using it as a pre-processing technology in voice recognition, etc. to be.

초기 연구에서는 비음성 구간(노이즈만 있는 구간)에서 노이즈를 추정하여 그 정보를 바탕으로 노이즈를 제거하는 통계적 방법이 많이 사용되었다. 그러나, 이러한 기술은 노이즈가 시간에 따라 변하거나(non-stationary) 심하게 섞인 환경(low signal to ratio(SNR))에서는 성능이 저하되는 경향이 있었다.In early studies, statistical methods were used a lot to estimate noise in the non-voice section (the section with only noise) and to remove the noise based on the information. However, this technique tends to deteriorate in performance in an environment in which noise varies over time (non-stationary) or in a heavily mixed environment (low signal to ratio (SNR)).

최근에는 딥 러닝(deep learning)의 발달로 인해 음성 향상 기술 분야에서도 다양한 딥 러닝 기법이 적용되고 있다. 딥 러닝 기반의 음성향상에서는 잡음이 섞인 음성을 인풋으로 하고, 잡음이 섞이기 전 깨끗한 음성을 타겟으로 하여 모델을 훈련시키는데, 이는 전형적인 회귀(regression) 모델의 학습이라고 할 수 있다.Recently, due to the development of deep learning, various deep learning techniques are being applied in the field of speech enhancement technology. In deep learning-based speech enhancement, a model is trained by taking a noisy voice as an input and targeting a clean voice before noise, which is typical of regression model learning.

즉, 기존의 딥 러닝 음성 향상 기법은 잡음이 섞인 음성을 입력으로 하고, 그에 매칭되는 깨끗한 음성을 추정하는 모델의 설계에만 관심을 집중하고, 딥 러닝 모델 중간의 은닉층(hidden layer)에서 입력이 어떻게 학습되는지에 대하여는 연구가 진행되지 않았다.In other words, the existing deep learning voice enhancement technique takes a noisy voice as an input and focuses only on the design of a model that estimates a clean voice matching it, and how the input works in the hidden layer in the middle of the deep learning model. No research has been conducted on whether it is learned.

본 발명에서는 음성 향상을 효과적으로 수행하기 위하여 적대적 학습(adversatial training) 모델을 이용한 음성 향상 훈련 방법 및 그 장치를 제공하는데에 있다.An object of the present invention is to provide a voice enhancement training method using an adversatial training model and an apparatus therefor to effectively perform voice enhancement.

본 발명의 일 실시 예에 따른, 적대적 학습 모델을 이용한 강인한 음성 향상 훈련 방법은 잡음이 섞인 음성으로부터 특징 벡터를 추출하는 단계, 상기 추출된 특징 벡터를 제1 인공 신경망의 입력으로 하여 잠재 변수를 추출하는 단계, 상기 추출된 잠재 변수를 제2 인공 신경망의 입력으로 하여 추정된 음성 특징 벡터를 출력하는 제1 동작 및 상기 추출된 잠재 변수를 제3 인공 신경망의 입력으로 하여 추정된 잡음 특징 벡터를 출력하는 제2 동작을 수행하며, 상기 제1 동작 및 제2 동작이 서로 적대적 학습을 하여 훈련되는 단계, 상기 학습에 의해 잡음 성분이 제거된 잠재 변수를 출력하는 단계 및, 상기 출력된 잠재 변수에 기초하여 복원된 복원 음성을 생성하는 단계를 포함한다.A robust voice enhancement training method using an adversarial learning model according to an embodiment of the present invention includes extracting a feature vector from a noise mixed voice, and extracting a latent variable by using the extracted feature vector as an input to the first artificial neural network. a first operation of outputting an estimated speech feature vector by using the extracted latent variable as an input of a second artificial neural network, and outputting an estimated noise feature vector by using the extracted latent variable as an input of a third artificial neural network performing a second operation, the first operation and the second operation are trained by adversarial learning, outputting a latent variable from which a noise component is removed by the learning, and based on the output latent variable and generating a restored restored voice.

이때, 상기 적대적 학습은 상기 제3 인공 신경망에서의 역전파(back-propagation) 시, 상기 그레이디언트 반전 레이어(gradient reversal layer)를 통해 그레이디언트(gradient)의 부호를 반대로 변환하여 상기 추출된 잠재 변수에서 잡음 특성이 제거되도록 학습할 수 있다.In this case, the adversarial learning is performed by inverting the sign of the gradient through the gradient reversal layer during back-propagation in the third artificial neural network, and the extracted The latent variable can be trained to remove the noise characteristic.

또한, 상기 제1 동작은 상기 추출된 잠재변수를 디코딩하여 원음을 추정하고, 상기 음성 특징 벡터로부터, 상기 추정된 원음의 매그니튜드 스펙트럼(magnitude spectrum)을 출력할 수 있다.Also, in the first operation, an original sound may be estimated by decoding the extracted latent variable, and a magnitude spectrum of the estimated original sound may be output from the speech feature vector.

본 발명의 일 실시 예에 따른, 적대적 학습(adversarial training) 모델을 이용한 음성 향상 훈련 장치는 잡음이 섞인 음성으로부터 특징 벡터를 추출하는 특징 벡터 추출부, 상기 추출된 특징 벡터를 제1 인공 신경망의 입력으로 하여 잠재 변수를 추출하는 인코더, 상기 추출된 잠재 변수를 제2 인공 신경망의 입력으로 하여 추정된 음성 특징 벡터를 출력하는 디코더, 상기 추출된 잠재 변수를 제3 인공 신경망의 입력으로 하여 추정된 잡음 특징 벡터를 출력하는 잡음잠재변수 제거부 및, 복원 음성을 생성하는 음성복원부를 포함하고, 상기 디코더 및 잡음잠재변수 제거부는 서로 적대적 학습을 수행하여 잡음 성분이 제거된 잠재 변수를 출력하고, 상기 음성복원부는 상기 출력된 잠재 변수에 기초하여 복원된 상기 복원 음성을 생성할 수 있다.A voice enhancement training apparatus using an adversarial training model according to an embodiment of the present invention includes a feature vector extractor for extracting a feature vector from a noise-mixed voice, and input the extracted feature vector to the first artificial neural network. An encoder that extracts a latent variable as an input, a decoder that outputs a speech feature vector estimated by using the extracted latent variable as an input of a second artificial neural network, and a noise estimated by using the extracted latent variable as an input of a third artificial neural network. and a noise latent variable removal unit for outputting a feature vector and a voice restoration unit for generating a restored voice, wherein the decoder and the noise latent variable removal unit perform adversarial learning to output a latent variable from which a noise component is removed, and the voice The restoration unit may generate the restored voice based on the output latent variable.

이때, 상기 적대적 학습은 상기 제3 인공 신경망에서의 역전파(back-propagation) 시, 상기 그레이디언트 반전 레이어를 통해 그레이디언트(gradient)의 부호를 반대로 변환하여 상기 추출된 잠재 변수에서 잡음 특성이 제거되도록 학습할 수 있다.In this case, the adversarial learning is performed by inverting the sign of the gradient through the gradient reversal layer during back-propagation in the third artificial neural network to obtain noise characteristics in the extracted latent variable. can learn to be removed.

본 발명의 일 실시 예에 따른, 컴퓨터 프로그램을 저장하고 있는 컴퓨터 판독 가능 기록매체로서, 상기 컴퓨터 프로그램은, 프로세서에 의해 실행되면, 잡음이 섞인 음성으로부터 특징 벡터를 추출하는 단계, 상기 추출된 특징 벡터를 제1 인공 신경망의 입력으로 하여 잠재 변수를 추출하는 단계, 상기 추출된 잠재 변수를 제2 인공 신경망의 입력으로 하여 추정된 음성 특징 벡터를 출력하는 제1 동작 및 상기 추출된 잠재 변수를 제3 인공 신경망의 입력으로 하여 추정된 잡음 특징 벡터를 출력하는 제2 동작을 수행하며, 상기 제1 동작 및 제2 동작이 서로 적대적 학습을 하여 훈련되는 단계, 상기 학습에 의해 잡음 성분이 제거된 잠재 변수를 출력하는 단계 및 상기 출력된 잠재 변수에 기초하여 복원된 복원 음성을 생성하는 단계를 포함하는 음성 향상 훈련 방법을 상기 프로세서가 수행하도록 하기 위한 명령어를 포함할 수 있다.According to an embodiment of the present invention, as a computer-readable recording medium storing a computer program, when the computer program is executed by a processor, extracting a feature vector from a noisy voice, the extracted feature vector extracting a latent variable using as an input of a first artificial neural network, a first operation of outputting an estimated speech feature vector using the extracted latent variable as an input of a second artificial neural network, and a third operation of the extracted latent variable A second operation of outputting an estimated noise feature vector as an input to the artificial neural network is performed, and the first operation and the second operation are trained by adversarial learning; a latent variable from which the noise component is removed by the learning. It may include an instruction for causing the processor to perform a voice enhancement training method comprising the steps of outputting and generating a reconstructed reconstructed speech based on the output latent variable.

본 발명의 일 실시 예에 따른, 컴퓨터 판독 가능한 기록매체에 저장되어 있는 컴퓨터 프로그램으로서, 상기 컴퓨터 프로그램은, 프로세서에 의해 실행되면, 잡음이 섞인 음성으로부터 특징 벡터를 추출하는 단계, 상기 추출된 특징 벡터를 제1 인공 신경망의 입력으로 하여 잠재 변수를 추출하는 단계, 상기 추출된 잠재 변수를 제2 인공 신경망의 입력으로 하여 추정된 음성 특징 벡터를 출력하는 제1 동작 및 상기 추출된 잠재 변수를 제3 인공 신경망의 입력으로 하여 추정된 잡음 특징 벡터를 출력하는 제2 동작을 수행하며, 상기 제1 동작 및 제2 동작이 서로 적대적 학습을 하여 훈련되는 단계, 상기 학습에 의해 잡음 성분이 제거된 잠재 변수를 출력하는 단계, 및 상기 출력된 잠재 변수에 기초하여 복원된 복원 음성을 생성하는 단계를 포함하는 음성 향상 훈련 방법을 상기 프로세서가 수행하도록 하기 위한 명령어를 포함할 수 있다.According to an embodiment of the present invention, as a computer program stored in a computer-readable recording medium, when the computer program is executed by a processor, extracting a feature vector from a noisy voice, the extracted feature vector extracting a latent variable using as an input of a first artificial neural network, a first operation of outputting an estimated speech feature vector using the extracted latent variable as an input of a second artificial neural network, and a third operation of the extracted latent variable A second operation of outputting an estimated noise feature vector as an input to the artificial neural network is performed, and the first operation and the second operation are trained by adversarial learning; a latent variable from which the noise component is removed by the learning. It may include an instruction for causing the processor to perform a voice enhancement training method including outputting , and generating a reconstructed reconstructed voice based on the output latent variable.

기존의 기존의 음성 향상 모델에서 제시되지 않았던 딥 러닝 모델의 중간 은닉층에서의 잡음 음성의 잠재변수에 직접 접근함으로써, 잡음의 특징을 제거하고 원음의 특징만을 남길 수 있으므로 음성 향상 성능을 현저히 높일 수 있다.By directly accessing the latent variables of the noisy speech in the middle hidden layer of the deep learning model, which has not been presented in the existing speech enhancement models, the noise features can be removed and only the features of the original sound can be left, so the speech enhancement performance can be significantly improved. .

도 1은 본 발명의 일 실시 예에 따른 음성 향상 훈련 방법의 과정을 간략히 설명하기 위한 흐름도,
도 2는 본 발명의 일 실시 예에 따른 음성 향상 훈련 장치의 구성을 간략히 나타낸 블록도이다.1 is a flowchart for briefly explaining the process of a voice enhancement training method according to an embodiment of the present invention;
2 is a block diagram schematically illustrating the configuration of a voice enhancement training apparatus according to an embodiment of the present invention.

먼저, 본 명세서 및 청구범위에서 사용되는 용어는 본 발명의 다양한 실시 예들에서의 기능을 고려하여 일반적인 용어들을 선택하였다. 하지만, 이러한 용어들은 당 분야에 종사하는 기술자의 의도나 법률적 또는 기술적 해석 및 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 일부 용어는 출원인이 임의로 선정한 용어일 수 있다. 이러한 용어에 대해서는 본 명세서에서 정의된 의미로 해석될 수 있으며, 구체적인 용어 정의가 없으면 본 명세서의 전반적인 내용 및 당해 기술 분야의 통상적인 기술 상식을 토대로 해석될 수도 있다.First, the terms used in the present specification and claims have been selected in consideration of functions in various embodiments of the present invention. However, these terms may vary depending on the intention, legal or technical interpretation of a person skilled in the art, and the emergence of new technology. Also, some terms may be arbitrarily selected by the applicant. These terms may be interpreted in the meaning defined in the present specification, and if there is no specific term definition, it may be interpreted based on the general content of the present specification and common technical common sense in the art.

또한, 본 명세서에 첨부된 각 도면에 기재된 동일한 참조 번호 또는 부호는 실질적으로 동일한 기능을 수행하는 부품 또는 구성요소를 나타낸다. 설명 및 이해의 편의를 위해서 서로 다른 실시 예들에서도 동일한 참조번호 또는 부호를 사용하여 설명하도록 한다. 즉, 복수의 도면에서 동일한 참조 번호를 가지는 구성 요소를 모두 도시하고 있다고 하더라도, 복수의 도면들이 하나의 실시 예를 의미하는 것은 아니다.Also, the same reference numerals or reference numerals in each drawing appended to this specification indicate parts or components that perform substantially the same functions. For convenience of description and understanding, the same reference numerals or reference numerals are used in different embodiments. That is, even though all the components having the same reference number are shown in the plurality of drawings, the plurality of drawings do not mean one embodiment.

또한, 본 명세서 및 청구범위에서는 구성요소들 간의 구별을 위하여 '제1', '제2' 등과 같이 서수를 포함하는 용어가 사용될 수 있다. 이러한 서수는 동일 또는 유사한 구성 요소들을 서로 구별하기 위하여 사용하는 것이며, 이러한 서수 사용으로 인하여 용어의 의미가 한정 해석되어서는 안될 것이다. 일 예로, 이러한 서수와 결합된 구성 요소는 그 숫자에 의해 사용 순서나 배치 순서 등이 제한 해석되어서는 안된다. 필요에 따라서는, 각 서수들은 서로 교체되어 사용될 수도 있다.In addition, in this specification and claims, terms including ordinal numbers such as 'first' and 'second' may be used to distinguish between elements. This ordinal number is used to distinguish the same or similar components from each other, and the meaning of the term should not be limitedly interpreted due to the use of the ordinal number. As an example, the components combined with such an ordinal number should not be construed as limiting the order of use or arrangement by the number. If necessary, each ordinal number may be used interchangeably.

본 명세서에서 단수의 표현은 문맥상 명백하게 다름을 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, '포함하다' 또는 '구성하다' 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성 요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In this specification, the singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as 'comprise' or 'comprise' are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification is present, and one or more other It should be understood that this does not preclude the possibility of addition or presence of features or numbers, steps, operations, components, parts, or combinations thereof.

또한, 본 발명의 실시 예에서, 어떤 부분이 다른 부분과 연결되어 있다고 할 때, 이는 직접적인 연결뿐 아니라, 다른 매체를 통한 간접적인 연결의 경우도 포함한다. 또한 어떤 부분이 어떤 구성 요소를 포함한다는 의미는, 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있다는 것을 의미한다.In addition, in an embodiment of the present invention, when it is said that a part is connected to another part, this includes not only a direct connection but also an indirect connection through another medium. In addition, the meaning that a certain component includes a certain component does not exclude other components unless otherwise stated, but may further include other components.

이하, 첨부된 도면을 참조하여 본 발명을 더욱 구체적으로 설명하기로 한다.Hereinafter, the present invention will be described in more detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시 예에 따른 음성 향상 훈련 방법의 과정을 간략히 설명하기 위한 흐름도이다.1 is a flowchart for briefly explaining a process of a voice enhancement training method according to an embodiment of the present invention.

적대적 학습 모델을 이용한 강인한 음성 향상 훈련 방법에 있어서, 도 1을 참조하면, 먼저 잡음이 섞인 음성(이하, 잡음 음성)으로부터 특징 벡터를 추출한다(S110). In the robust voice enhancement training method using the adversarial learning model, referring to FIG. 1 , first, a feature vector is extracted from a noisy voice (hereinafter, noise voice) (S110).

구체적으로는, 입력된 잡음 음성을 시간 단위의 프레임 신호로 분할하고, 분할된 각각의 프레임 신호를 주파수 도메인의 신호로 변환하여 음성 특징 벡터인 매그니튜드 스펙트럼(magnitude spectrum)을 획득할 수 있다.Specifically, a magnitude spectrum, which is a speech feature vector, may be obtained by dividing the input noise speech into frame signals in units of time, and converting each divided frame signal into a signal of a frequency domain.

예를 들어, 잡음 음성은 16~32ms 단위의 프레임 신호로 분할될 수 있다.For example, the noisy speech may be divided into frame signals in units of 16 to 32 ms.

예를 들어, 획득한 매그니튜드 스펙트럼 자체가 특징 벡터가 될 수도 있고, 여기에서 파생되는 피치와 관련된 특징 벡터, 음색과 관련된 특징 벡터 등을 포함할 수 있다.For example, the acquired magnitude spectrum itself may be a feature vector, and may include a pitch-related feature vector, a tone-related feature vector, and the like derived from this.

특징 벡터는 주로 단구간 푸리에 분석(STFT) 등을 이용하여 계산될 수 있는데, 특징 벡터는 MFCC(mel frequency cepstral coefficient), 스펙트럴 롤오프(Spectral rolloff), 스펙트럴 플럭스(Spectral flux), 자동 상관 계수(Autocorrelation coefficient) 등이 있다.The feature vector can be mainly calculated using short-term Fourier analysis (STFT), etc. The feature vector is a mel frequency cepstral coefficient (MFCC), a spectral rolloff, a spectral flux, and an autocorrelation coefficient. (Autocorrelation coefficient) and the like.

MFCC는 켑스트럼(cepstrum) 영역으로 변환 된 오디오 신호를 청각 특성이 반영된 멜-주파수(mel-frequency) 필터뱅크를 이용하여 하위 대역으로 나눈 후 DCT를 이용하여 구한 계수로써 음성 신호처리 분야에서 많이 사용된다.MFCC is a coefficient obtained by using DCT after dividing an audio signal converted into the cepstrum domain into sub-bands using a mel-frequency filter bank reflecting auditory characteristics. used

스펙트럴 롤오프(Spectral rolloff)는 주파수 영역에서 저대역 신호부터 85%의 에너지가 분포하는 주파수의 값을 계산한 것으로서, 스텍트럴 센트로이드(Spectral centroid)와 함께 주파수 영역의 분포를 파악할 수 있는 특징 벡터이다.Spectral rolloff is calculated by calculating the value of the frequency at which 85% of energy is distributed from the low-band signal in the frequency domain. It is a feature that can identify the distribution of the frequency domain together with the spectral centroid. It is a vector.

스펙트럴 플럭스(Spectral flux)는 각 주파수 단위마다 시간 축으로의 변화 정도를 표현한 것으로 지역적인 주파수의 변화를 측정하는 파라미터이다.The spectral flux expresses the degree of change in the time axis for each frequency unit, and is a parameter that measures the change in local frequency.

자동 상관 계수(Autocorrelation coefficient)는 신호의 스펙트럼 분포를 시간 영역에서 표현하는 것으로, 예를 들어 1번째 계수부터 12번째 계수까지 사용할 수 있다.The autocorrelation coefficient expresses the spectral distribution of a signal in the time domain, and for example, a first coefficient to a twelfth coefficient may be used.

이후, 추출된 특징 벡터를 인코더 인공신경망(제1 인공신경망)의 입력으로 하여 잠재 변수(latent feature)를 추출한다(S120). 잠재 변수는 적대적 학습 모델의 다층 인공 신경망을 구성하는 은닉 변수이다.Thereafter, a latent feature is extracted by using the extracted feature vector as an input of the encoder artificial neural network (first artificial neural network) (S120). A latent variable is a hidden variable that constitutes a multi-layered artificial neural network of an adversarial learning model.

이후, 추출된 잠재 변수를 음성을 추정하는 디코더 인공신경망(제2 인공 신경망)의 입력으로 하여 추정된 음성 특징 벡터를 출력하는 제1 동작을 수행한다(S130). 여기서, 제1 동작에 의해, 잠재변수를 입력받은 디코더 인공신경망은 추정된 원음 음성 특징 벡터를 출력한다.Thereafter, a first operation of outputting an estimated speech feature vector by using the extracted latent variable as an input of a decoder artificial neural network (second artificial neural network) for estimating speech is performed (S130). Here, by the first operation, the artificial neural network of the decoder receiving the latent variable outputs the estimated original sound speech feature vector.

이와 함께, 추출된 잠재 변수를 잡음 추정 인공신경망(제3 인공신경망)의 입력으로 하여 추정된 잡음 특징 벡터를 출력하는 제2 동작을 수행한다(S140). 여기서 그레이디언트 반전 레이어(gradient reversal layer; GRL)를 이용하여 잠재 변수에서 잡음 성분을 제거하는 방향으로 학습을 진행한다.At the same time, a second operation of outputting an estimated noise feature vector by using the extracted latent variable as an input of the noise estimation artificial neural network (the third artificial neural network) is performed (S140). Here, learning proceeds in the direction of removing noise components from latent variables using a gradient reversal layer (GRL).

이때, 제1 동작 및 제2 동작은 서로 적대적 학습 모델에 기반한 학습을 하여 훈련된다(S150).In this case, the first operation and the second operation are trained by learning based on an antagonistic learning model (S150).

적대적 학습 모델이란, 대표적인 비지도 학습 모델 중 하나로서, 최대한 실제와 동일한 데이터를 생성하려는 생성 모델과 모조 데이터를 판별하려는 판별 모델이 서로 적대적으로 학습하는 방식으로 경쟁적으로 발전하는 구조의 인공신경망 모델을 의미한다.The adversarial learning model is one of the representative unsupervised learning models. It is an artificial neural network model with a structure in which a generative model that tries to generate data identical to the actual data as much as possible and a discriminant model that tries to discriminate imitation data learn antagonistically. it means.

적대적 학습 모델에서는 판별 모델을 먼저 학습 시킨 후, 생성 모델을 학습시키는 과정을 반복하며, 대표적으로는 생성적 적대적 신경망(Generative Adversarial Network; GAN)이 있다.In the adversarial learning model, the discriminant model is first trained and then the process of learning the generative model is repeated. A typical example is a generative adversarial network (GAN).

이러한 생성적 적대적 신경망(GAN)은 종종 위폐범('생성 모델'에 대응)이 위폐감별사('판별 모델'에 대응)를 속이는 방향을 취하는 것과 위폐감별사가 위폐범에 의하여 위작된 지폐를 진폐와 구별하는 방향을 취하는 것에 비유된다.These generative adversarial neural networks (GANs) often take the direction that counterfeiters (corresponding to 'generative models') deceive counterfeiters (corresponding to 'discrimination models'), and counterfeiters (corresponding to 'discriminatory models') to pneumatize and counterfeit banknotes forged by counterfeiters. It is compared to taking the direction of distinction.

생성적 적대적 신경망(GAN)에 의하여 음성 향상 훈련 모델이 더 정확해지도록 지속적으로 갱신되어 그 성능이 향상될 수 있다.By a generative adversarial neural network (GAN), the voice enhancement training model may be continuously updated to be more accurate and its performance may be improved.

본 발명에서 적대적 학습 모델이란, 인공신경망 학습 시, 두 개의 모델이 한 쪽은 목적 함수를 최대화하는 방향으로, 다른 한 쪽은 목적 함수를 최소화하는 방향으로 서로 적대적으로 훈련이 진행되는 것을 의미한다. 즉, 음성과 잡음의 특징이 섞여 있는 잠재 변수를 입력으로 하여, 디코더 인공신경망은 목적 음성과 최대한 비슷해지는 방향으로 학습이 되고, 잡음 추정 인공 신경망은 원래의 잡음과 최대한 비슷해지게 학습을 하되, GRL을 통해서 오차 전파 시 부호를 바꿔줌으로써, 결과적으로 잡음은 최대한 추정을 하지 못하도록 학습이 진행된다.In the present invention, the adversarial learning model means that when training an artificial neural network, training of two models is adversarial to one side in the direction of maximizing the objective function and the other side in the direction of minimizing the objective function. In other words, by inputting a latent variable with the characteristics of speech and noise as input, the decoder artificial neural network is trained in a direction that is as close to the target voice as possible, and the noise estimation artificial neural network is trained to be as similar to the original noise as possible, but GRL By changing the sign during error propagation through , learning proceeds to avoid estimating the noise as much as possible.

제1 및 제2 동작에서의, 적대적 학습은 제2 인공 신경망에서의 역전파(back-propagation) 시, 그레이디언트 반전 레이어를 통해 그레이디언트(gradient)의 부호를 반대로 변환하여 추출된 잠재 변수에서 잡음 성분이 제거되도록 학습한다.In the first and second operations, adversarial learning is a latent variable extracted by inverting the sign of the gradient through the gradient reversal layer during back-propagation in the second artificial neural network. learn to remove the noise component.

즉, 인코더를 통해 출력된 잠재 변수가 그레이디언트 반전 레이어를 통과하도록 하는 구조를 통해 역전파 과정에서 잡음들 간 구분을 하기 어려운 방향으로 학습되어 잡음 특성이 제거된다.That is, through a structure that allows the latent variable output through the encoder to pass through the gradient inversion layer, it is learned in a direction that makes it difficult to distinguish between noises in the backpropagation process, and the noise characteristic is removed.

따라서, 제1 인공 신경망에서는 깨끗한 음성인 원음을 잘 추정하는 방향으로 학습되고, 제2 인공 신경망에서는 잡음을 잘 추정하지 못하는 방향으로 학습하면서, 이 두 가지 학습이 서로 적대적으로 실시되는 형태가 된다.Therefore, the first artificial neural network learns in a direction to well estimate the original sound, which is a clean voice, and in the second artificial neural network, learns in a direction that does not estimate the noise well, and the two kinds of learning are performed antagonistically.

이에 따라, 인코더가 잡음 성분이 제거되고 음성 성분만 남은 잠재 변수를 추출하는 방향으로 학습된다.Accordingly, the encoder is trained to extract the latent variable from which the noise component is removed and only the voice component remains.

이때, 제1 동작은 추출된 잠재 변수를 디코딩하여 잡음이 제거된 음성을 추정하고, 추정된 음성의 매그니튜드 스펙트럼을 음성 특징 벡터로 출력할 수 있다.In this case, the first operation may decode the extracted latent variable to estimate the noise-removed speech, and output the magnitude spectrum of the estimated speech as a speech feature vector.

이후, 학습이 완료되고 실제 적용 단계에서는, 인코더 인공신경망에서 출력된 잠재 변수는 잡음 성분이 제거되고 음성 성분은 남겨져 있는 잠재 변수를 출력한다(S160). 이때, 디코더 인공신경망을 이용하여, 음성 매그니튜드 스펙트럼을 출력할 수 있다.After that, the learning is completed and in the actual application step, the latent variable output from the encoder artificial neural network is output with the noise component removed and the voice component remaining ( S160 ). In this case, a speech magnitude spectrum may be output using the decoder artificial neural network.

이후, 출력된 추정 음성 매그니튜드 스펙트럼을 다시 시간 도메인의 음성 신호로 복원함으로써, 복원 음성을 생성한다(S170).Thereafter, the reconstructed speech is generated by reconstructing the output estimated speech magnitude spectrum as a speech signal in the time domain ( S170 ).

도 2는 본 발명의 일 실시 예에 따른 음성 향상 훈련 장치의 구성을 간략히 나타낸 블록도이다.2 is a block diagram schematically illustrating the configuration of a voice enhancement training apparatus according to an embodiment of the present invention.

본 발명의 일 실시 예에 따른 음성 향상 훈련 장치(100)는 특징 벡터 추출부(110), 인코더(encoder)(120), 디코더(decoder)(130), 잡음잠재변수 제거부(140) 및 음성복원부(150)를 포함한다.The voice enhancement training apparatus 100 according to an embodiment of the present invention includes a feature vector extractor 110 , an encoder 120 , a decoder 130 , a noise latent variable remover 140 , and a voice and a restoration unit 150 .

음성 향상 훈련 장치(100)는 일종의 컴퓨팅 장치로서, 데이터를 가공 및 처리할 수 있는 프로세서(미도시)를 포함할 수 있다. 프로세서는 MPU(micro processing unit), CPU(central processing unit), GPU(graphics processing unit), NPU(neural processing unit) 또는 TPU(tensor processing unit), 캐시 메모리(cache memory), 데이터 버스(data bus) 등의 하드웨어 구성을 포함할 수 있다. 또한, 운영체제, 특정 목적을 수행하는 애플리케이션의 소프트웨어 구성을 더 포함할 수도 있다.The voice enhancement training apparatus 100 is a kind of computing device and may include a processor (not shown) capable of processing and processing data. The processor includes a micro processing unit (MPU), a central processing unit (CPU), a graphics processing unit (GPU), a neural processing unit (NPU) or a tensor processing unit (TPU), a cache memory, and a data bus. It may include a hardware configuration such as In addition, it may further include an operating system, a software configuration of an application for performing a specific purpose.

특징 벡터 추출부(110)는 잡음이 섞인 음성(이하, 잡음 음성)을 딥 러닝(deep learning) 모델의 입력으로 사용하기 위한 구성이다.The feature vector extractor 110 is configured to use a noise-mixed voice (hereinafter, noise voice) as an input of a deep learning model.

구체적으로, 특징 벡터 추출부(110)는 음성 향상 훈련 장치(100)에 입력된 잡음 음성을 시간 단위의 프레임 신호로 분할하고, 분할된 각각의 프레임 신호를 주파수 도메인 신호로 변환하여 음성 특징 벡터인 매그니튜드 스펙트럼을 생성할 수 있다.Specifically, the feature vector extractor 110 divides the noise speech input to the speech enhancement training apparatus 100 into frame signals in units of time, and converts each divided frame signal into a frequency domain signal to obtain a speech feature vector. A magnitude spectrum can be generated.

인코더(120)는 특징 벡터 추출부(110)로부터, 생성된 매그니튜드 스펙트럼을 입력받아, 잠재변수를 출력하는 인코더 인공 신경망(제1 인공 신경망)을 구현하는 구성이다. 인코더(120)는 인지 네트워크(recognition network)라고도 하며, 쉽게 말해 입력을 내부 표현으로 변환하는 역할을 한다.The encoder 120 is configured to implement an encoder artificial neural network (a first artificial neural network) that receives the generated magnitude spectrum from the feature vector extractor 110 and outputs a latent variable. The encoder 120 is also referred to as a recognition network, and in simple terms, serves to transform an input into an internal representation.

디코더(130)는 인코더(120)에서 출력된 잠재변수를 입력받고, 입력된 잠재변수를 디코딩하여 잡음이 제거된 음성을 추정한다. 디코더(130)는 추정된 원음의 매그니튜드 스펙트럼을 출력할 수 있는 디코더 인공 신경망(제2 인공 신경망)을 구현할 수 있다.The decoder 130 receives the latent variable output from the encoder 120, decodes the input latent variable, and estimates the noise-free speech. The decoder 130 may implement a decoder artificial neural network (second artificial neural network) capable of outputting the magnitude spectrum of the estimated original sound.

디코더(130)는 생성 네트워크(generative network)라고도 하며, 쉽게 말해 내부 표현을 출력으로 변환하는 역할을 한다. 디코더(130)는 어떤 프레임에 대한 잠재 변수에 주목(attention)할지 계산하여 주목도에 따라 음성 특징 벡터를 추정해낼 수 있다.The decoder 130 is also referred to as a generative network, and in simple terms, serves to convert an internal representation into an output. The decoder 130 may calculate a latent variable for which frame to pay attention to, and estimate a speech feature vector according to the attention level.

잡음잠재변수 제거부(140)는 인코더(120)에서 출력된 잠재변수를 입력받아, 추정된 잡음의 매그니튜드 스펙트럼을 출력으로 하는 잡음 추정 인공 신경망(제3 인공 신경망)을 구현하는 구성이다.The noise latent variable removal unit 140 is configured to receive the latent variable output from the encoder 120 and implement a noise estimation artificial neural network (third artificial neural network) that outputs the magnitude spectrum of the estimated noise.

여기서, 디코더(130)와 잡음잠재변수 제거부(140)는 서로 적대적 학습을 통해 인공 신경망을 훈련시킬 수 있다.Here, the decoder 130 and the noise latent variable removal unit 140 may train the artificial neural network through adversarial learning.

구체적으로, 디코더(130)는 입력되는 잠재변수로부터 깨끗한 음성, 즉 원음을 추정하려고 하고, 잡음잠재변수 제거부(140)는 입력되는 잠재변수로부터 잡음을 추정하려고 한다.Specifically, the decoder 130 tries to estimate a clean voice, that is, an original sound from the input latent variable, and the noise latent variable remover 140 tries to estimate the noise from the input latent variable.

즉, 디코더(130)는 생성 모델로서, 원음과 최대한 비슷한 음성을 만들도록 학습되며, 잡음잠재변수 제거부(140)는 입력되는 잡음들 간의 특성을 최대한 구별하지 못하게 학습되면서, 서로 적대적으로 학습하게 된다.That is, as a generative model, the decoder 130 is trained to make a voice that is as similar to the original sound as possible, and the noise latent variable remover 140 is learned not to distinguish characteristics between input noises as much as possible, while learning to be hostile to each other. do.

이때, 잡음잠재변수 제거부(140)의 잡음 추정 인공 신경망은 그레이디언트 반전 레이어를 포함하고, 이 그레이디언트 반전 레이어를 통해 역전파(back-propagation) 과정에서 그레이디언트의 부호를 반대로 변환하게 된다.In this case, the noise estimation artificial neural network of the noise latent variable removal unit 140 includes a gradient reversal layer, and reversely converts the sign of the gradient in the back-propagation process through the gradient reversal layer. will do

즉, 잡음잠재변수 제거부(140)의 잡음 추정 인공 신경망은 잡음 성분을 잘 추정하지 않는 방향으로 훈련되며, 결과적으로 입력된 잠재변수가 잡음 성분을 가지지 않게 된다.That is, the noise estimation artificial neural network of the noise latent variable removal unit 140 is trained in a direction that does not estimate the noise component well, and as a result, the input latent variable does not have a noise component.

이러한 적대적 학습을 통해, 인코더(120)는 원음 성분을 제외한 잡음 성분은 제거되는 특징을 갖는 잠재 변수를 학습하여 출력하게 된다.Through such adversarial learning, the encoder 120 learns and outputs a latent variable having a characteristic in which noise components other than the original sound component are removed.

이러한 과정으로 학습되어 인코더(120)로부터 출력된 잠재 변수는 디코더(130)를 통해 원음의 특징만 보존되어 있는 매그니튜드 스펙트럼을 출력하게 된다.The latent variable learned through this process and output from the encoder 120 outputs a magnitude spectrum in which only the characteristics of the original sound are preserved through the decoder 130 .

이러한 적대적 학습 훈련이 종료된 후, 실제 추정 단계에서는 잡음잠재변수 제거부(140)의 동작이 생략되고, 인코더(120) 및 디코더(130)의 동작만으로 음성 향상이 수행될 수 있다.After the hostile learning training is finished, in the actual estimation step, the operation of the noise latent variable removal unit 140 is omitted, and voice enhancement can be performed only by the operations of the encoder 120 and the decoder 130 .

즉, 학습 이후에는 적대적 학습 모델 중 생성 모델만을 사용하게 된다.That is, after learning, only generative models among adversarial learning models are used.

음성복원부(150)는 디코더(130)를 통해 출력된 음성 특징 벡터의 매그니튜드 스펙트럼을 다시 시간 도메인의 음성 신호로 복원함으로써, 복원 음성을 생성할 수 있다.The voice restoration unit 150 may generate the restored voice by restoring the magnitude spectrum of the voice feature vector output through the decoder 130 back to the time domain voice signal.

상술한 다양한 실시 예에 따르면, 기존의 방법에서 제시되지 않았던 딥 러닝 모델의 중간 은닉층에서 잡음 음성의 잠재변수에 직접 접근함으로써, 잡음의 특징을 제거하고 음성의 특징만을 남겨, 디코더가 기존의 모델보다 잡음 성분을 더욱 효과적으로 제거함으로써 음성 향상 성능을 높일 수 있게 된다.According to the above-described various embodiments, by directly accessing the latent variable of the noisy speech in the middle hidden layer of the deep learning model, which was not presented in the existing method, the noise feature is removed and only the speech feature is left, so that the decoder is better than the existing model. By more effectively removing noise components, voice enhancement performance can be improved.

상술한 다양한 실시 예에 따른 제어 방법은 프로그램으로 구현되어 다양한 기록 매체에 저장될 수 있다. 즉, 각종 프로세서에 의해 처리되어 상술한 잡음제거 방법을 실행할 수 있는 컴퓨터 프로그램이 기록 매체에 저장된 상태로 사용될 수도 있다.The control method according to the above-described various embodiments may be implemented as a program and stored in various recording media. That is, a computer program that is processed by various processors to execute the above-described noise removal method may be used in a state stored in the recording medium.

일 예로, ⅰ)잡음이 섞인 음성으로부터 특징 벡터를 추출하는 단계, ⅱ)추출된 특징 벡터를 제1 인공 신경망(예: 인코더 인공 신경망)의 입력으로 하여 잠재 변수를 추출하는 단계, ⅲ)추출된 잠재 변수를 제2 인공 신경망(예: 디코더 인공신경망)의 입력으로 하여 추정된 음성 특징 벡터를 출력하는 제1 동작 및 추출된 잠재 변수를 제3 인공 신경망(예: 잡음 추정 인공신경망)의 입력으로 하여, 추정된 잡음 특징 벡터를 출력하는 제2 동작을 수행하며, 제1 동작 및 제2 동작이 서로 적대적 학습을 하여 훈련되는 단계 및, ⅳ)학습에 의해 잡음 성분이 제거된 잠재 변수를 출력하는 단계 및, ⅴ)출력된 잠재 변수에 기초하여 복원된 복원 음성을 생성하는 단계를 수행하는 프로그램이 저장된 비일시적 판독 가능 매체(non-transitory computer readable medium)가 제공될 수 있다.As an example, i) extracting a feature vector from a voice mixed with noise, ii) extracting a latent variable by using the extracted feature vector as an input of a first artificial neural network (eg, an encoder artificial neural network), iii) the extracted A first operation of outputting an estimated speech feature vector by taking the latent variable as an input of a second artificial neural network (eg, a decoder artificial neural network) and the extracted latent variable as an input of a third artificial neural network (eg, a noise estimation artificial neural network) a second operation of outputting the estimated noise feature vector is performed, the first operation and the second operation are trained by adversarial learning, and iv) outputting a latent variable from which the noise component is removed by learning. A non-transitory computer readable medium in which a program for performing steps and v) generating a reconstructed speech based on the output latent variable is stored may be provided.

비일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 어플리케이션 또는 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.The non-transitory readable medium refers to a medium that stores data semi-permanently, rather than a medium that stores data for a short moment, such as a register, cache, memory, etc., and can be read by a device. Specifically, the above-described various applications or programs may be provided by being stored in a non-transitory readable medium such as a CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM, and the like.

한편, 이상에서는 본 발명의 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안될 것이다.On the other hand, although preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the specific embodiments described above, and the technical field to which the present invention belongs without departing from the gist of the present invention as claimed in the claims Various modifications are possible by those of ordinary skill in the art, and these modifications should not be individually understood from the technical spirit or prospect of the present invention.

100: 음성 향상 훈련 장치 110: 특징 벡터 추출부
120: 인코더 130: 디코더
140: 잡음잠재변수 제거부 150: 음성 복원부100: speech enhancement training apparatus 110: feature vector extraction unit
120: encoder 130: decoder
140: noise latent variable removal unit 150: voice restoration unit

Claims

In a robust voice enhancement training method using an adversarial training model,
extracting a feature vector from the noisy speech;
extracting a first latent variable having a feature in which noise components excluding original sound components are removed by using the extracted feature vector as an input to a first artificial neural network;
A first operation of outputting an estimated original speech feature vector by using the extracted first latent variable as an input of a second artificial neural network, and using the extracted first latent variable as an input of a third artificial neural network, the first latent variable performing a second operation of outputting a noise feature vector from which a noise component has been removed, and the second artificial neural network and the third artificial neural network are trained by adversarial learning;
receiving, by the second artificial neural network, the learning completed, the first latent variable, and outputting a second latent variable in which a noise component is removed from the first latent variable; and
generating a reconstructed speech based on the output second latent variable; A voice enhancement training method comprising a.

According to claim 1,
The adversarial learning is
During back-propagation in the third artificial neural network, the sign of the gradient is reversed through the gradient reversal layer included in the adversarial learning model, and the extracted first latent variable is A voice enhancement training method, characterized in that learning to remove noise characteristics.

According to claim 1,
The first operation is
A speech enhancement training method, comprising: estimating an original sound by decoding the extracted first latent variable, and outputting a magnitude spectrum of the estimated original sound from the original sound feature vector.

In the voice enhancement training apparatus using an adversarial training model,
a feature vector extractor for extracting a feature vector from a noisy voice;
an encoder that uses the extracted feature vector as an input to a first artificial neural network and extracts a first latent variable having a feature in which noise components other than the original sound component are removed;
a decoder for outputting an estimated original speech feature vector using the extracted first latent variable as an input to a second artificial neural network;
a noise latent variable removal unit that uses the extracted first latent variable as an input to a third artificial neural network and outputs a noise feature vector from which a noise component is removed from the first latent variable; and
a voice restoration unit for generating a restored voice; including,
The second artificial neural network and the third artificial neural network are
adversarial learning,
The decoder is
The second artificial neural network on which the learning is completed receives the first latent variable and outputs a second latent variable from which the noise component is removed from the first latent variable,
The voice restoration unit,
Voice enhancement training apparatus, characterized in that generating the reconstructed speech based on the output second latent variable.

5. The method of claim 4,
The adversarial learning is
During back-propagation in the third artificial neural network, the sign of the gradient is reversed through the gradient reversal layer included in the adversarial learning model, and the extracted first latent variable is A voice enhancement training apparatus, characterized in that it learns to remove the noise characteristic.

6. The method of claim 5,
The decoder is
The apparatus for estimating an original sound by decoding the extracted first latent variable, and outputting a magnitude spectrum of the estimated original sound from the original sound feature vector.

As a computer-readable recording medium storing a computer program,
The computer program, when executed by a processor,
extracting a feature vector from the noisy speech;
extracting a first latent variable having a feature in which noise components excluding original sound components are removed by using the extracted feature vector as an input to a first artificial neural network;
A first operation of outputting an estimated original speech feature vector by using the extracted first latent variable as an input of a second artificial neural network, and using the extracted first latent variable as an input of a third artificial neural network, the first latent variable performing a second operation of outputting a noise feature vector from which the noise component is removed, and the second artificial neural network and the third artificial neural network are trained by adversarial learning;
receiving, by the second artificial neural network, the learning completed, the first latent variable, and outputting a second latent variable in which a noise component is removed from the first latent variable; and
generating a reconstructed speech based on the output second latent variable; A computer-readable recording medium comprising instructions for causing the processor to perform a voice enhancement training method comprising a.

As a computer program stored in a computer-readable recording medium,
The computer program, when executed by a processor,
extracting a feature vector from the noisy speech;
extracting a first latent variable having a feature in which noise components excluding original sound components are removed by using the extracted feature vector as an input to a first artificial neural network;
A first operation of outputting an estimated original speech feature vector by using the extracted first latent variable as an input of a second artificial neural network, and using the extracted first latent variable as an input of a third artificial neural network, the first latent variable performing a second operation of outputting a noise feature vector from which the noise component is removed, and the second artificial neural network and the third artificial neural network are trained by adversarial learning;
receiving, by the second artificial neural network, the learning completed, the first latent variable, and outputting a second latent variable in which a noise component is removed from the first latent variable; and
Generating a reconstructed reconstructed speech based on the output second latent variable; A computer program comprising instructions for causing the processor to perform a speech enhancement training method comprising a.