KR20210134195A

KR20210134195A - Method and apparatus for voice recognition using statistical uncertainty modeling

Info

Publication number: KR20210134195A
Application number: KR1020200052981A
Authority: KR
Inventors: 김남수; 이현승; 김민찬
Original assignee: 서울대학교산학협력단
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2021-11-09
Also published as: KR102363636B1

Abstract

The present invention relates to a method and device for recognizing a voice using statistical uncertainty modeling. More specifically, the present invention relates to the method for recognizing the voice comprising: (1) a step of configuring, in consideration of an uncertainty of an input voice deformed by a noise, an uncertainty-aware training (UAT) model based on a deep neural network (DNN) for estimating a phonetic target from the input voice; and (2) a step of processing voice recognition using the UAT model configured in step (1). Therefore, the present invention is capable of showing excellent voice recognition performance.

Description

Speech recognition method and apparatus using statistical uncertainty modeling

본 발명은 음성 인식 방법 및 장치에 관한 것으로서, 보다 구체적으로는 통계적 불확실성 모델링을 활용한 음성 인식 방법 및 장치에 관한 것이다.The present invention relates to a speech recognition method and apparatus, and more particularly, to a speech recognition method and apparatus utilizing statistical uncertainty modeling.

최근 deep neural network (DNN) 기반의 음성 모델링은 음성인식 분야에 큰 발전을 일으켰다. 하지만 잡음 등으로 인해 열화된 데이터가 들어올 때의 성능 저하는 여전히 음성인식 분야의 큰 과제로 남아있다.Recently, deep neural network (DNN)-based voice modeling has made great strides in the field of voice recognition. However, performance degradation when degraded data due to noise is input remains a major challenge in the field of speech recognition.

기존의 DNN 기반의 기술은 주어진 입력에 대하여 다양한 입력정보원으로부터 발생하는 불확실성의 정도를 고려하지 않고 결정론적인 출력을 생성한다. 따라서 학습 데이터셋에 없는 환경의 데이터에 대해서는 제대로 동작하지 않는 문제가 발생하기 쉽다.The existing DNN-based technology generates a deterministic output for a given input without considering the degree of uncertainty arising from various input information sources. Therefore, it is easy to cause problems that do not work properly for data in an environment that is not in the training dataset.

특히, DNN-HMM 기반의 음성 인식 장치에 대한 연구가 많이 이루어졌는데, 기존의 DNN-HMM 기반의 음성 인식 장치는 주어진 입력을 이용하여 결정론적인 출력을 생성한다. 그중 잡음에 강인한 음성 인식 장치의 경우 주로 noisy data를 clean data로 결정론적으로 맵핑(mapping)하는 네트워크와 clean data만을 이용하여 학습된 음성 인식 장치를 직렬로 구성하거나 음성 인식 장치를 noisy data로 학습하는 방식을 사용한다.In particular, a lot of research has been done on a DNN-HMM-based speech recognition apparatus, and the existing DNN-HMM-based speech recognition apparatus generates a deterministic output using a given input. Among them, in the case of a voice recognition device that is robust to noise, a network that deterministically maps noisy data to clean data and a voice recognition device learned using only clean data are serially configured or the voice recognition device is trained with noisy data. use the method

그러나 두 경우 모두 학습 데이터에서 나타나지 않은 잡음이 나타날 경우 올바른 맵핑을 하지 못하여 큰 성능 저하가 발생하기 쉽다. 따라서 다양한 입력정보원으로부터 발생하는 불확실성의 정도를 고려해 소음이 있는 환경에서도 정확하게 음성 인식을 할 수 있는 기술의 개발이 필요한 실정이다.However, in both cases, when noise that does not appear in the training data appears, it is easy to cause significant performance degradation due to incorrect mapping. Therefore, it is necessary to develop a technology that can accurately recognize voice even in a noisy environment in consideration of the degree of uncertainty arising from various input information sources.

한편, 본 발명과 관련된 선행기술로서, 등록특허 제10-1740637호(발명의 명칭: 불확실성을 이용한 잡음 환경에서의 음성 인식 방법 및 장치, 등록일자: 2017년 05월 22일) 등일 개시된 바 있다.On the other hand, as prior art related to the present invention, Patent Registration No. 10-1740637 (title of the invention: method and apparatus for voice recognition in a noisy environment using uncertainty, registration date: May 22, 2017) and the like have been disclosed.

본 발명은 기존에 제안된 방법들의 상기와 같은 문제점들을 해결하기 위해 제안된 것으로서, 노이즈에 의해 변형된 입력 음성의 불확실성을 고려하되, 깨끗한 음성 특징의 분포를 나타내는 음성 불확실성 정보와 변분 추론(Variational Inference, VI) 기반으로 잠재변수의 확률 분포를 나타내는 환경 불확실성 정보를 이용해 입력 음성이 가지는 불확실성을 직접적으로 측정하고, 이를 통해 변형된 입력 음성에 대한 불확실성을 효과적으로 반영하도록 학습된 모델을 구성함으로써, 학습 데이터에 없는 잡음이 나타나더라도 우수한 음성 인식 성능을 보이는, 통계적 불확실성 모델링을 활용한 음성 인식 방법 및 장치를 제공하는 것을 그 목적으로 한다.The present invention has been proposed to solve the above problems of the previously proposed methods, and while considering the uncertainty of the input speech deformed by noise, speech uncertainty information indicating the distribution of clean speech features and variational inference (Variational Inference) , VI) based on the environmental uncertainty information representing the probability distribution of the latent variable, directly measuring the uncertainty of the input voice, and constructing a trained model to effectively reflect the uncertainty about the transformed input voice. It is an object of the present invention to provide a speech recognition method and apparatus using statistical uncertainty modeling, which exhibits excellent speech recognition performance even when noise that is not present is present.

상기한 목적을 달성하기 위한 본 발명의 특징에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법은,A speech recognition method using statistical uncertainty modeling according to a feature of the present invention for achieving the above object,

음성 인식 방법으로서,A speech recognition method comprising:

(1) 노이즈에 의해 변형된 입력 음성의 불확실성을 고려해, 상기 입력 음성으로부터 음성 타깃 (phonetic target)을 추정하는 심층 신경망 (Deep neural network, DNN) 기반의 UAT (Uncertainty-Aware Training) 모델을 구성하는 단계; 및(1) Considering the uncertainty of the input voice deformed by noise, a deep neural network (DNN)-based UAT (Uncertainty-Aware Training) model for estimating a phonetic target from the input voice is constructed. step; and

(2) 상기 단계 (1)에서 구성된 UAT 모델을 이용해 음성 인식을 처리하는 단계를 포함하되,(2) processing speech recognition using the UAT model constructed in step (1),

상기 단계 (1)은,The step (1) is,

(1-1) 상기 입력 음성의 특징 (distorted feature, y_t)을 입력받아 깨끗한 음성 특징 (clean feature, x_t)의 분포를 추정하는 CUN (Clean Uncertainty Network)을 학습하여, 음성 불확실성 정보를 출력하는 단계;(1-1) learning CUN (Clean Uncertainty Network) that receives the distorted feature (y _t ) of the input voice _{and estimates the distribution of the clean feature (x t), and outputs voice uncertainty information} to do;

(1-2) 상기 입력 음성의 특징 및 상기 단계 (1-1)에서 출력된 상기 음성 불확실성 정보를 이용해 음성 타깃을 추정하되, 변분 추론 (Variational Inference, VI) 방식으로, 추정 과정에서 잠재변수의 확률 분포를 모델링하는 EUN (Environment Uncertainty Network)을 학습하여, 환경 불확실성 정보를 출력하는 단계;(1-2) Estimating a speech target using the characteristics of the input speech and the speech uncertainty information output in step (1-1), using a Variational Inference (VI) method, Learning the EUN (Environment Uncertainty Network) modeling the probability distribution, outputting environmental uncertainty information;

(1-3) 상기 음성 불확실성 정보 및 환경 불확실성 정보를 이용해 음성 타깃을 추정하는 PN(Prediction Network)를 포함하는 UAT 모델을 구성하는 단계를 포함하는 것을 그 구성상의 특징으로 한다.(1-3) constructing a UAT model including a PN (Prediction Network) for estimating a voice target using the voice uncertainty information and the environmental uncertainty information.

바람직하게는, 상기 단계 (1-1)에서는,Preferably, in step (1-1),

상기 깨끗한 음성 특징의 분포에 대한 평균 및 로그 분산을 상기 음성 불확실성 정보로 출력할 수 있다.The mean and log variance of the distribution of the clean speech feature may be output as the speech uncertainty information.

바람직하게는, 상기 단계 (1-2)의 EUN은,Preferably, the EUN of step (1-2) is,

VAE (Variational Autoencoder)를 변형하여 인코더에서 출력된 잠재변수의 확률 분포를 모델링 할 수 있다.By transforming the Variational Autoencoder (VAE), the probability distribution of the latent variable output from the encoder can be modeled.

바람직하게는, 상기 단계 (1-2)에서는,Preferably, in the step (1-2),

상기 잠재변수의 분포의 평균 및 분산을 상기 환경 불확실성 정보로 출력할 수 있다.The average and variance of the distribution of the latent variable may be output as the environmental uncertainty information.

바람직하게는, 상기 단계 (1-3)에서는,Preferably, in step (1-3),

상기 CUN, EUN 및 PN을 연결 (concatenation)한 통합 모델을, 상기 입력 음성으로부터 음성 타깃을 추정하도록 학습하여 상기 UAT 모델을 구성할 수 있다.The UAT model may be configured by learning an integrated model in which the CUN, EUN, and PN are concatenated to estimate a speech target from the input speech.

더욱 바람직하게는, 상기 단계 (1-3)에서는,More preferably, in step (1-3),

상기 PN의 손실함수를 이용해 상기 통합 모델을 튜닝하여 성능이 향상된 상기 UAT 모델을 구성할 수 있다.The UAT model with improved performance may be configured by tuning the integrated model using the loss function of the PN.

상기한 목적을 달성하기 위한 본 발명의 특징에 따른 통계적 불확실성 모델링을 활용한 음성 인식 장치는,A speech recognition apparatus utilizing statistical uncertainty modeling according to a feature of the present invention for achieving the above object,

음성 인식 장치로서,A voice recognition device comprising:

노이즈에 의해 변형된 입력 음성의 불확실성을 고려해, 상기 입력 음성으로부터 음성 타깃 (phonetic target)을 추정하는 심층 신경망 (Deep neural network, DNN) 기반의 UAT (Uncertainty-aware training) 모델을 구성하는 학습부; 및a learning unit configuring a UAT (Uncertainty-Aware Training) model based on a deep neural network (DNN) for estimating a phonetic target from the input voice in consideration of the uncertainty of the input voice deformed by noise; and

상기 학습부에서 구성된 UAT 모델을 이용해 음성 인식을 처리하는 음성 인식부를 포함하되,A voice recognition unit for processing voice recognition using the UAT model configured in the learning unit,

상기 UAT 모델은,The UAT model is

상기 입력 음성의 특징 (distorted feature, y_t)을 입력받아 깨끗한 음성 특징 (clean feature, x_t)의 분포를 추정하며, 음성 불확실성 정보를 출력하는 CUN (Clean Uncertainty Network);a Clean Uncertainty Network (CUN) that receives the distorted feature (y _t ) of the input voice _{, estimates the distribution of a clean feature (x t} ), and outputs voice uncertainty information;

상기 입력 음성의 특징 및 상기 CUN에서 출력된 상기 음성 불확실성 정보를 이용해 음성 타깃을 추정하되, 변분 추론 (variational inference, VI) 방식으로, 추정 과정에서 잠재변수의 확률 분포를 모델링하고, 환경 불확실성 정보를 출력하는 EUN (Environment Uncertainty Network); 및A voice target is estimated using the characteristics of the input voice and the voice uncertainty information output from the CUN, but the probability distribution of the latent variable is modeled in the estimation process by a variational inference (VI) method, and environmental uncertainty information is obtained Output EUN (Environment Uncertainty Network); and

상기 음성 불확실성 정보 및 환경 불확실성 정보를 이용해 음성 타깃을 추정하는 PN (Prediction Network)을 포함하여 구성되는 것을 그 구성상의 특징으로 한다.It is characterized in that it is configured to include a PN (Prediction Network) for estimating a voice target using the voice uncertainty information and the environmental uncertainty information.

바람직하게는, 상기 CUN은,Preferably, the CUN is

바람직하게는, 상기 EUN은,Preferably, the EUN is

VAE (variational autoencoder)를 변형하여 인코더에서 출력된 잠재변수의 확률 분포를 모델링 할 수 있다.By transforming the VAE (variational autoencoder), the probability distribution of the latent variable output from the encoder can be modeled.

바람직하게는, 상기 EUN은,Preferably, the EUN is

바람직하게는, 상기 UAT 모델은,Preferably, the UAT model,

상기 CUN, EUN 및 PN을 연결 (concatenation)한 통합 모델로서, 상기 입력 음성으로부터 음성 타깃 (phonetic target)을 추정하도록 학습하여 구성될 수 있다.As an integrated model in which the CUN, EUN, and PN are concatenated, it may be configured by learning to estimate a phonetic target from the input voice.

더욱 바람직하게는, 상기 UAT 모델은,More preferably, the UAT model,

상기 PN의 손실함수를 이용해 상기 통합 모델을 튜닝하여 성능이 향상된 것일 수 있다.The performance may be improved by tuning the integrated model using the loss function of the PN.

본 발명에서 제안하고 있는 통계적 불확실성 모델링을 활용한 음성 인식 방법 및 장치에 따르면, 노이즈에 의해 변형된 입력 음성의 불확실성을 고려하되, 깨끗한 음성 특징의 분포를 나타내는 음성 불확실성 정보와 변분 추론(Variational Inference, VI) 기반으로 잠재변수의 확률 분포를 나타내는 환경 불확실성 정보를 이용해 입력 음성이 가지는 불확실성을 직접적으로 측정하고, 이를 통해 변형된 입력 음성에 대한 불확실성을 효과적으로 반영하도록 학습된 모델을 구성함으로써, 학습 데이터에 없는 잡음이 나타나더라도 우수한 음성 인식 성능을 보일 수 있다.According to the speech recognition method and apparatus using statistical uncertainty modeling proposed in the present invention, while considering the uncertainty of the input speech deformed by noise, speech uncertainty information indicating the distribution of clean speech features and variational inference (Variational Inference, VI) Based on the environmental uncertainty information indicating the probability distribution of the latent variable, the uncertainty of the input voice is directly measured, and the trained model is constructed to effectively reflect the uncertainty about the transformed input voice. Even if there is no noise, excellent speech recognition performance may be exhibited.

도 1은 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법의 흐름을 도시한 도면.
도 2는 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법의 UAT 모델의 구성을 도시한 도면.
도 3은 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법에서, 단계 S100의 세부적인 흐름을 도시한 도면.
도 4는 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법에서, 단계 S100의 학습 과정을 기능블록으로 도시한 도면.
도 5는 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법의 단계 S100의 학습 과정을 설명하기 위해 도시한 도면.
도 6은 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법에서, UAT 모델의 구성을 흐름도로 도시한 도면.
도 7은 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 장치의 구성을 도시한 도면.
도 8은 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 장치에서, 학습부의 세부적인 구성을 도시한 도면.
도 9는 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법 및 장치와 다른 음성인식기를 CHiME-4 테스트셋을 이용해 비교 실험한 결과를 나타낸 도면.
도 10은 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법 및 장치와 다른 음성인식기를 AURORA-4 테스트셋을 이용해 비교 실험한 결과를 나타낸 도면.1 is a diagram illustrating a flow of a speech recognition method using statistical uncertainty modeling according to an embodiment of the present invention.
2 is a diagram illustrating the configuration of a UAT model of a speech recognition method using statistical uncertainty modeling according to an embodiment of the present invention.
3 is a diagram illustrating a detailed flow of step S100 in a speech recognition method using statistical uncertainty modeling according to an embodiment of the present invention.
4 is a diagram illustrating the learning process of step S100 as a functional block in a speech recognition method using statistical uncertainty modeling according to an embodiment of the present invention.
5 is a diagram illustrating a learning process of step S100 of a speech recognition method using statistical uncertainty modeling according to an embodiment of the present invention.
6 is a flowchart illustrating the configuration of a UAT model in a speech recognition method using statistical uncertainty modeling according to an embodiment of the present invention.
7 is a diagram illustrating a configuration of a speech recognition apparatus using statistical uncertainty modeling according to an embodiment of the present invention.
8 is a diagram illustrating a detailed configuration of a learning unit in a speech recognition apparatus using statistical uncertainty modeling according to an embodiment of the present invention.
9 is a view showing the results of a comparative experiment using a CHiME-4 test set for a voice recognition method and apparatus using statistical uncertainty modeling according to an embodiment of the present invention and another voice recognizer.
10 is a view showing the results of a comparative experiment using the AURORA-4 test set for a voice recognition method and apparatus using statistical uncertainty modeling according to an embodiment of the present invention and another voice recognizer.

이하, 첨부된 도면을 참조하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 바람직한 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예를 상세하게 설명함에 있어, 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다. 또한, 유사한 기능 및 작용을 하는 부분에 대해서는 도면 전체에 걸쳐 동일한 부호를 사용한다.Hereinafter, preferred embodiments will be described in detail so that those of ordinary skill in the art can easily practice the present invention with reference to the accompanying drawings. However, in describing a preferred embodiment of the present invention in detail, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. In addition, the same reference numerals are used throughout the drawings for parts having similar functions and functions.

덧붙여, 명세서 전체에서, 어떤 부분이 다른 부분과 ‘연결’ 되어 있다고 할 때, 이는 ‘직접적으로 연결’ 되어 있는 경우뿐만 아니라, 그 중간에 다른 소자를 사이에 두고 ‘간접적으로 연결’ 되어 있는 경우도 포함한다. 또한, 어떤 구성요소를 ‘포함’ 한다는 것은, 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다.In addition, throughout the specification, when a part is 'connected' with another part, it is not only 'directly connected' but also 'indirectly connected' with another element interposed therebetween. include In addition, "including" a certain component means that other components may be further included, rather than excluding other components, unless otherwise stated.

도 1은 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법의 흐름을 도시한 도면이다. 도 1에 도시된 바와 같이, 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법은, 노이즈에 의해 변형된 입력 음성의 불확실성을 고려해, 입력 음성으로부터 음성 타깃을 추정하는 심층 신경망 기반의 UAT (Uncertainty-Aware Training) 모델(200)을 구성하는 단계(S100) 및 구성된 UAT 모델(200)을 이용해 음성 인식을 처리하는 단계(S200)를 포함하여 구현될 수 있다.1 is a diagram illustrating a flow of a speech recognition method using statistical uncertainty modeling according to an embodiment of the present invention. As shown in FIG. 1 , the speech recognition method using statistical uncertainty modeling according to an embodiment of the present invention is based on a deep neural network for estimating a speech target from the input speech in consideration of the uncertainty of the input speech deformed by noise. It may be implemented including the step of configuring the UAT (Uncertainty-Aware Training) model 200 ( S100 ) and the step of processing speech recognition using the configured UAT model 200 ( S200 ).

본 발명은 통계적 불확실성 모델링을 활용한 음성 인식 방법에 관한 것으로서, 본 발명의 특징에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법은, 메모리 및 프로세서를 포함한 하드웨어에서 기록되는 소프트웨어로 구성될 수 있다. 예를 들어, 본 발명의 통계적 불확실성 모델링을 활용한 음성 인식 방법은, 개인용 컴퓨터, 노트북 컴퓨터, 서버 컴퓨터, PDA, 스마트폰, 태블릿 PC 등에 저장 및 구현될 수 있다. 이하에서는 설명의 편의를 위해, 각 단계를 수행하는 주체는 생략될 수 있다.The present invention relates to a speech recognition method utilizing statistical uncertainty modeling, and the speech recognition method utilizing statistical uncertainty modeling according to a feature of the present invention may be composed of software recorded in hardware including a memory and a processor. For example, the voice recognition method utilizing the statistical uncertainty modeling of the present invention may be stored and implemented in a personal computer, a notebook computer, a server computer, a PDA, a smart phone, a tablet PC, and the like. Hereinafter, for convenience of description, a subject performing each step may be omitted.

단계 S100에서는, 노이즈에 의해 변형된 입력 음성의 불확실성을 고려해, 입력 음성으로부터 음성 타깃(phonetic target)을 추정하는 심층 신경망(Deep neural network, DNN) 기반의 UAT 모델을 구성할 수 있다. 단계 S100의 세부적인 흐름에 대해서는 추후 도 3을 참조하여 상세히 설명하도록 한다.In step S100, a UAT model based on a deep neural network (DNN) for estimating a phonetic target from the input voice may be constructed in consideration of the uncertainty of the input voice deformed by noise. The detailed flow of step S100 will be described in detail later with reference to FIG. 3 .

단계 S200에서는, 단계 S100에서 구성된 UAT 모델(200)을 이용해 음성 인식을 처리할 수 있다.In step S200, speech recognition may be processed using the UAT model 200 configured in step S100.

즉, 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법은, noisy data의 발음정보를 모델링할 때 noisy data가 clean data로 맵핑될 때 발생하는 불확실성, noisy data로부터 추정된 clean data의 확률분포 이용해 발음정보를 추정할 때 발생하는 불확실성을 이용해 최종적으로 noisy data의 발음정보를 추정하는 구조를 사용하여 UAT 모델(200)을 구현하고(단계 S100), 구현된 UAT 모델(200)을 이용해 음성 인식을 수행할 수 있다(단계 S200).That is, the speech recognition method using statistical uncertainty modeling according to an embodiment of the present invention is an uncertainty that occurs when noisy data is mapped to clean data when modeling pronunciation information of noisy data, and clean data estimated from noisy data. Finally, the UAT model 200 is implemented using a structure for estimating the pronunciation information of noisy data using the uncertainty that occurs when estimating the pronunciation information using the probability distribution of (step S100), and the implemented UAT model 200 is Voice recognition can be performed by using the (step S200).

이하에서는, 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법을 설명하기 위해, 불확실성 디코딩 (Uncertainty decoding, UD) 및 VAE (Variational autoencoder)에 대해 먼저 살펴보도록 한다.Hereinafter, in order to describe a speech recognition method using statistical uncertainty modeling according to an embodiment of the present invention, uncertainty decoding (UD) and variational autoencoder (VAE) will be first described.

먼저, 불확실성 디코딩은, 잡음에 강인한 모델을 구성하기 위한 방법이다. 잡음에 강인한 음성인식을 학습함에 있어서, 학습 데이터와 테스트 데이터의 차이는 성능의 저하를 일으킬 수 있다. 학습 데이터가 특정 환경에 제한되어 있을 때, 테스트 데이터가 학습 데이터에 존재하지 않는 환경에 노출되면 추정치의 정확도는 떨어지고 그로 인해 음성인식기의 성능 저하가 유발되게 된다. UD는 디코딩 과정에서 이러한 결함을 보상하여 주는 역할을 할 수 있다.First, uncertainty decoding is a method for constructing a noise-tolerant model. In learning noise-resistant speech recognition, a difference between training data and test data may cause performance degradation. When the training data is limited to a specific environment, if the test data is exposed to an environment that does not exist in the training data, the accuracy of the estimate decreases, and thus the performance of the speech recognizer is deteriorated. The UD may play a role of compensating for such defects in the decoding process.

UD는 깨끗한 음성 특징 (clean feature, x_t)에서 노이즈에 의해 변형된 음성 특징 (distorted feature, y_t)으로의 대응(mapping)은 확률적인 과정이라 가정한다. 음향 모델이 clean 음성 데이터로 학습되며, 관측된 입력이 이러한 clean 음성 데이터의 왜곡된 버전이라는 가정 하에 HMM state에 대한 우도 함수 (likelihood function)를 다음 수학식 1과 같이 수정할 수 있다.UD assumes that the mapping from clean features (x _t ) to distorted features (y _t ) by noise is a stochastic process. Assuming that the acoustic model is trained with clean speech data and the observed input is a distorted version of this clean speech data, the likelihood function for the HMM state can be modified as shown in Equation 1 below.

여기서 x_t, y_t, q_t는 각각 t 프레임에서 추출된 clean feature, noisy feature, HMM state를 의미한다. UD 기술 구현에 있어서 수학식 1의 적분 부분이 계산적으로 다루기 힘들기 때문에, 기존의 GMM-HMM 시스템의 경우는 위 식의 p(x_t|q_t), p(x_t|y_t), p(x_t)를 Gaussian이나 Gaussian mixtures를 이용하여 추정하였고, DNN-HMM의 경우에는 p(q_t|x_t)의 clean feature 분포에 대한 기대값을 계산함으로써 p(q_t|y_t)를 계산하였다.Here, x _t , y _t , and q _t mean the clean feature, noisy feature, and HMM state extracted from frame t, respectively. Since the integral part of Equation 1 is computationally difficult to implement in the UD technology implementation, in the case of the existing GMM-HMM system, p(x _t |q _t ), p(x _t |y _t ), p (x _t) a was estimated using a Gaussian or Gaussian mixtures, for DNN-HMM is p (q _{_t} | x _t) by a calculation of the expected value for the clean feature distribution p | a (q _t y _t) calculated did.

다음으로, VAE는, 입력된 벡터를 출력에서 재구성하는 오토인코더의 일종으로, 가운데 은닉층 (hidden layer)을 랜덤 변수인 잠재변수 (latent variables)로 가지고 있는 구조를 취한다.Next, VAE is a kind of autoencoder that reconstructs the input vector from the output, and takes a structure having a hidden layer in the middle as latent variables, which are random variables.

VAE는 크게 인코더와 디코더, 두 개의 네트워크로 구성된다. 인코더 네트워크는 입력 벡터를 받아서 입력이 조건으로 주어진 경우 잠재변수의 사후 분포를 추정한다. 잠재변수의 추정된 분포로부터 샘플링 된 잠재변수는 디코더의 입력이 되며, 디코더의 출력으로 입력이 재구성된다.VAE consists of two networks, encoder and decoder. An encoder network takes an input vector and estimates the posterior distribution of the latent variable given the input as a condition. The latent variable sampled from the estimated distribution of the latent variable becomes the input of the decoder, and the input is reconstructed as the output of the decoder.

VAE의 학습은 인코더 네트워크와 디코더 네트워크를 한 번에 오류 역전파 알고리즘(error back-propagation)을 사용하여 학습하는데, 그 목적 함수는 다음 수학식 2와 같이 정의된다.In VAE learning, an encoder network and a decoder network are trained using an error back-propagation algorithm at a time, and the objective function is defined as in Equation 2 below.

여기에서 q

(z|x)는 인코더 네트워크에서 주어진 입력 x로부터 잠재변수 z를 생성할 확률을 의미하고, p_θ(x|z)은 디코더 네트워크에서 잠재변수로부터 입력 x를 재구성할 확률을 의미하며, p_θ(z)는 디코더 네트워크의 파라미터가 주어졌을 때, 잠재변수가 생성될 사전확률을 의미한다. D_KL(q

(z|x)||p_θ(z))은 x가 주어졌을 때 z의 생성 확률 분포와 z의 사전 확률 분포의 차이를 나타내는 Kullback-Leibler divergence를 나타내며, 생성되는 잠재변수의 확률 분포가 최대한 사전 확률 분포에 가깝도록 규제해주는 역할을 한다. 반면 E_q

₍ _z _| _x ₎[logpθ(x|z)]는 재구성 오차로, 입력 x가 주어졌을 때 z의 생성 확률 분포와 z로부터 x가 생성되는 확률 분포간의 cross-entropy 오차를 의미한다.q here

( z | x ) means the probability of generating a latent variable z from a given input x in the encoder network _{, p θ} ( x | z ) means the probability of reconstructing the input x from the latent variable in the decoder network _{, p θ} ( z ) means the prior probability that the latent variable will be generated when the parameters of the decoder network are given. D _KL (q

( z | x )||p _θ ( z )) represents the Kullback-Leibler divergence representing the difference between the generation probability distribution of z and the prior probability distribution of z given x , and the probability distribution of the generated latent variable is maximized. It plays a role of regulating to be close to the prior probability distribution. On the other hand, E _q

₍ _z _| _x ₎ [logpθ( x | z )] is the reconstruction error, which means the cross-entropy error between the probability distribution of the generation of z when the input x is given and the probability distribution of the generation of x from z.

본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법에서는, 딥 뉴럴 네트워크를 이용하여 노이즈에 의해 변형된 입력 음성 특징의 내재된 불확실성 (uncertainty)을 고려하는 학습 기법을 제안한다.In the speech recognition method using statistical uncertainty modeling according to an embodiment of the present invention, a learning technique that considers the inherent uncertainty of an input speech feature transformed by noise using a deep neural network is proposed.

기존의 UD 방식을 딥러닝 기반의 음향 모델에 적용하기 위해, 우도(likelihood)에 대한 식을 다음 수학식 3과 같은 형태로 표현할 수 있다.In order to apply the existing UD method to a deep learning-based acoustic model, an expression for likelihood can be expressed in the form of Equation 3 below.

조건부 확률인 p(x_t|y_t)는 우도에서 변형된 입력 음성의 특징 (distorted feature, y_t)으로부터 발생하는 불확실성을 고려할 수 있게 한다. 한편, 해당 조건부 확률이 특정 확률 모델에 의한 parametric 분포를 보이면 이 우도는 그 분포의 파라미터에 의해 결정된다고 볼 수 있다. 따라서 x_t의 분포가 특정 확률 모델에 의한 분포를 보인다는 가정 하에 위의 수학식 3을 x_t의 파라미터인 ξ_xt에 대한 다음 수학식 4로 표현할 수 있다.The conditional probability p(x _t |y _t ) allows us to consider the uncertainty arising from the _{distorted feature (y t} ) of the input speech deformed in likelihood. On the other hand, if the corresponding conditional probability shows a parametric distribution by a specific probability model, it can be seen that this likelihood is determined by the parameters of the distribution. Therefore, _{under the assumption that the distribution of x t} shows a distribution according to a specific probability model, Equation 3 above can be expressed as Equation 4 below for _{ξ xt} , which is a parameter of _{x t .}

여기서, g_qt는 맵핑 함수를 의미한다. 그러나 수학식 4를 딥 러닝 기반의 음향 모델에 적용하기에는 문제점이 남아있다. 수학식 4의 조건부 분포는 학습 데이터에서 추정되는 것이기 때문에 학습 데이터와 테스트 데이터에 차이가 있는 경우에는 완벽하게 적용할 수 없다. 따라서 본 발명에서는 잠재변수 z_t를 적용하여, y_t가 대응되는 HMM state q_t로 대응될 때 설명할 수 없는 정보를 보완하였다. 이를 식에 도입하면 다음 수학식 5와 같으며, chain rule에 의해 최종적으로 표현이 된다.Here, g _qt means a mapping function. However, there remains a problem in applying Equation 4 to a deep learning-based acoustic model. Since the conditional distribution of Equation 4 is estimated from the training data, it cannot be perfectly applied when there is a difference between the training data and the test data. Therefore, in the present invention, the latent variable z _t is applied to supplement information that cannot be explained when _{y t} corresponds to the corresponding HMM state q _{t .} When this is introduced into the equation, it is equivalent to the following Equation 5, and is finally expressed by the chain rule.

위의 식에서 새로운 조건부 분포는 x_t와 z_t의 영향을 받게 되는데, 앞에서와 같은 원리로 우도는 깨끗한 음성 특징과 잠재변수의 파라미터들의 함수로 다음 수학식 6과 같이 표현될 수 있다.In the above equation, the new conditional distribution is _{affected by x t} and z _{t . Using} the same principle as before, the likelihood can be expressed as a function of the parameters of the clean speech feature and the latent variable as shown in Equation 6 below.

p(x_t|y_t)와 p(z_t|x_t,y_t)는 각 구성요소들의 상관관계 (correlation)가 존재하지 않는 가우시안 (gaussian) 분포를 따른다고 가정한다.It is assumed that p(x _t |y _t ) and p(z _t |x _t ,y _t ) follow a Gaussian distribution in which there is no correlation between components.

이러한 가정 하에 본 발명에서 제안하는 UAT 모델(200)은 세 개의 독립적인 DNN을 결합하고 joint training하는 것으로 진행될 수 있다.Under this assumption, the UAT model 200 proposed by the present invention may proceed by combining three independent DNNs and joint training.

도 2는 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법 및 장치의 UAT 모델(200)의 구성을 도시한 도면이다. 도 2에 도시된 바와 같이, 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법 및 장치에서 구현 및 사용하는 UAT 모델(200)은, 세 개의 독립적인 DNN의 결합으로 구성될 수 있다. 첫 번째 네트워크인 Clean uncertainty network (CUN)(210)은 깨끗한 음성을 추정하는 과정에서 발생하는 불확실성을 표현할 수 있다. 두 번째 네트워크인 Environment uncertainty network (EUN)(220)은 앞의 CUN(210)에서 보이지 않는 요소들로 인해 풀지 못했던 불확실성을 나타낼 수 있다. 마지막 네트워크인 Prediction network (PN)(230)은 CUN(210)과 EUN(220)에서 발생되는 불확실성 정보들을 통해 HMM state를 예측할 수 있다.2 is a diagram illustrating a configuration of a UAT model 200 of a speech recognition method and apparatus using statistical uncertainty modeling according to an embodiment of the present invention. As shown in FIG. 2 , the UAT model 200 implemented and used in the speech recognition method and apparatus utilizing statistical uncertainty modeling according to an embodiment of the present invention may be composed of a combination of three independent DNNs. have. The first network, the Clean Uncertainty Network (CUN) 210 , can express uncertainty generated in the process of estimating a clean voice. The second network, the Environment Uncertainty Network (EUN) 220 , may represent uncertainty that could not be resolved due to factors not seen in the previous CUN 210 . The last network, the prediction network (PN) 230 , can predict the HMM state through uncertainty information generated from the CUN 210 and the EUN 220 .

보다 구체적으로, UAT 모델(200)은, 입력 음성의 특징(distorted feature, y_t)을 입력받아 깨끗한 음성 특징(clean feature, x_t)의 분포를 추정하며, 음성 불확실성 정보를 출력하는 CUN (Clean Uncertainty Network)(210), 입력 음성의 특징 및 CUN(210) 출력된 음성 불확실성 정보를 이용해 음성 타깃을 추정하되, 변분 추론 (variational inference, VI) 방식으로, 추정 과정에서 잠재변수의 확률 분포를 모델링하고, 환경 불확실성 정보를 출력하는 EUN (Environment Uncertainty Network)(220), 및 음성 불확실성 정보 및 환경 불확실성 정보를 이용해 음성 타깃을 추정하는 PN (Prediction Network)(230)을 포함하여 구성될 수 있다. 학습을 통해 UAT 모델(200)을 구성하는 세부적인 흐름에 대해서는 이하에서 도 3을 참조하여 상세히 설명하도록 한다.More specifically, the UAT model 200 receives a distorted feature (y _t ) of an input voice _{, estimates the distribution of a clean feature (x t} ), and outputs voice uncertainty information (CUN (Clean)) The voice target is estimated using the uncertainty information of the Uncertainty Network 210, the characteristics of the input voice, and the CUN 210, but the probability distribution of the latent variable is modeled in the estimation process using a variational inference (VI) method. and an Environment Uncertainty Network (EUN) 220 for outputting environmental uncertainty information, and a Prediction Network (PN) 230 for estimating a voice target using the voice uncertainty information and the environmental uncertainty information. A detailed flow of configuring the UAT model 200 through learning will be described in detail below with reference to FIG. 3 .

도 3은 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법에서, 단계 S100의 세부적인 흐름을 도시한 도면이다. 도 3에 도시된 바와 같이, 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법의 단계 S100은, CUN (Clean Uncertainty Network)(210)을 학습하여 음성 불확실성 정보를 출력하는 단계(S110), EUN (Environment Uncertainty Network)(220)을 학습하여 환경 불확실성 정보를 출력하는 단계(S120) 및 음성 불확실성 정보 및 환경 불확실성 정보를 이용해 음성 타깃을 추정하는 PN (Prediction Network)(230)을 포함하는 UAT 모델(200)을 구성하는 단계(S130)를 포함하여 구현될 수 있다.3 is a diagram illustrating a detailed flow of step S100 in a speech recognition method using statistical uncertainty modeling according to an embodiment of the present invention. As shown in FIG. 3, step S100 of the speech recognition method using statistical uncertainty modeling according to an embodiment of the present invention is a step of learning CUN (Clean Uncertainty Network) 210 and outputting speech uncertainty information ( S110), a step of outputting environmental uncertainty information by learning the Environment Uncertainty Network (EUN) 220 (S120) and PN (Prediction Network) 230 for estimating a voice target using the voice uncertainty information and the environmental uncertainty information It can be implemented including the step of configuring the UAT model 200 (S130).

도 4는 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법에서, 단계 S100의 학습 과정을 기능블록으로 도시한 도면이다. 도 4에 도시된 바와 같이, 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법은, 단계 S100에서 학습 시 우선 CUN(210)과 EUN(220)을 학습하고(S110 및 S120), 학습된 CUN(210)과 EUN(220)의 출력을 이용해 PN(230) 모델을 학습한다. 이후, 최종적으로 세 네트워크가 결합되어 PN(230)의 손실함수를 이용해 jointy fine tuning하여 전체적인 성능을 향상시킨다(S130).4 is a diagram illustrating the learning process of step S100 as a functional block in a speech recognition method using statistical uncertainty modeling according to an embodiment of the present invention. As shown in FIG. 4 , the speech recognition method using statistical uncertainty modeling according to an embodiment of the present invention first learns the CUN 210 and the EUN 220 when learning in step S100 ( S110 and S120 ). , the PN 230 model is trained using the outputs of the learned CUN 210 and EUN 220 . After that, the three networks are finally combined to improve the overall performance by jointy fine tuning using the loss function of the PN 230 (S130).

도 5는 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법의 단계 S100의 학습 과정을 설명하기 위해 도시한 도면이고, 도 6은 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법에서, UAT 모델(200)의 구성을 흐름도로 도시한 도면이다. 이하에서는, 도 5 및 도 6을 참조하여 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법의 단계 S100의 각 단계를 상세히 설명하도록 한다.5 is a diagram illustrating the learning process of step S100 of the speech recognition method using statistical uncertainty modeling according to an embodiment of the present invention, and FIG. 6 is a view illustrating statistical uncertainty modeling according to an embodiment of the present invention. It is a diagram showing the configuration of the UAT model 200 as a flow chart in the speech recognition method utilized. Hereinafter, each step of step S100 of the speech recognition method using statistical uncertainty modeling according to an embodiment of the present invention will be described in detail with reference to FIGS. 5 and 6 .

단계 S110에서는, 입력 음성의 특징 (distorted feature, y_t)을 입력받아 깨끗한 음성 특징 (clean feature, x_t)의 분포를 추정하는 CUN (Clean Uncertainty Network)(210)을 학습하여, 음성 불확실성 정보를 출력할 수 있다. 보다 구체적으로, 단계 S110에서는, 깨끗한 음성 특징의 분포에 대한 평균 및 로그 분산을 음성 불확실성 정보로 출력할 수 있다.In step S110, _{the CUN (Clean Uncertainty Network) 210 that receives the distorted feature (y t} ) of the input voice _{and estimates the distribution of the clean feature (x t} ) is learned, and the voice uncertainty information is obtained. can be printed out. More specifically, in step S110, the mean and log variance of the distribution of clean speech features may be output as speech uncertainty information.

즉, CUN(210)은 변형된 입력 음성의 특징에 대해 깨끗한 음성 특징 분포의 파라미터 정보를 출력하며, 추정된 분포의 우도를 최대화하는 방향으로 학습될 수 있다. 보다 구체적으로, CUN(210)은 잡음이 있는 음성 특징을 입력받아 깨끗한 음성 특징의 분포에 대한 평균 (gaussian mean)과 분산 (log-variance)을 출력하는 모델이다. 이 네트워크의 출력 중 평균은 맵핑되는 깨끗한 음성 특징, 분산은 입력의 불확실성을 각각 나타낸다. 학습에 사용되는 손실함수는 다음 수학식 7과 같다.That is, the CUN 210 outputs parameter information of a clean speech feature distribution with respect to the transformed input speech feature, and can be learned in a direction to maximize the likelihood of the estimated distribution. More specifically, the CUN 210 is a model that receives a noisy speech feature and outputs a mean (gaussian mean) and log-variance for the distribution of a clean speech feature. Among the outputs of this network, the mean represents the mapped clean speech feature, and the variance represents the uncertainty of the input, respectively. The loss function used for learning is shown in Equation 7 below.

단계 S120에서는, 입력 음성의 특징 및 단계 S110에서 출력된 음성 불확실성 정보를 이용해 음성 타깃을 추정하되, 변분 추론 (Variational Inference, VI) 방식으로, 추정 과정에서 잠재변수의 확률 분포를 모델링하는 EUN (Environment Uncertainty Network)(220)을 학습하여, 환경 불확실성 정보를 출력할 수 있다. 보다 구체적으로, 단계 S120의 EUN(220)은, VAE (Variational Autoencoder)를 변형하여 인코더에서 출력된 잠재변수의 확률 분포를 모델링할 수 있다. 또한, 단계 S120에서는, 잠재변수의 분포의 평균 및 분산을 환경 불확실성 정보로 출력할 수 있다.In step S120, the speech target is estimated using the characteristics of the input speech and the speech uncertainty information output in step S110, but using a Variational Inference (VI) method to model the probability distribution of latent variables in the estimation process. By learning the Uncertainty Network 220 , it is possible to output environmental uncertainty information. More specifically, the EUN 220 of step S120 may model a probability distribution of a latent variable output from an encoder by transforming a Variational Autoencoder (VAE). In addition, in step S120, the average and variance of the distribution of the latent variable may be output as environmental uncertainty information.

즉, EUN(220)은 CUN(210)의 출력과 입력 음성의 특징을 이용해 음성 타깃을 추정하는 네트워크로서, variational autoencoder(VAE)의 구조를 다음 수학식 8과 같이 변형하여 중간에 잠재변수의 확률변수를 모델링하는 variational inference(VI) 모델일 수 있다.That is, the EUN 220 is a network for estimating a voice target using the output of the CUN 210 and the characteristics of the input voice. It can be a variational inference (VI) model that models variables.

x_t에 대한 적분을 x_t의 파라미터인 ξ_xt의 함수로 바꾸어 표현할 수 있다. 여기에 맞추어 VAE의 구조를 다시 디자인하면, VAE의 인코더

, 디코더 θ의 입력에 깨끗한 음성 특징 파라미터가 조건부로 입력되는 구조를 갖게 된다. EUN(220)의 손실함수는 VAE의 손실함수와 유사하게 다음 수학식 9을 따를 수 있다.the integral of the x _t as a function of the parameters of ξ _xt x _t can be changed to be represented. If the structure of VAE is redesigned according to this, VAE's encoder

, it has a structure in which a clean speech feature parameter is conditionally input to the input of the decoder θ. The loss function of the EUN 220 may follow Equation 9 similar to the loss function of the VAE.

EUN(220)이 출력하는 잠재변수의 분포의 평균 및 분산은 환경적인 불확실성을 나타내는 척도로 사용될 수 있다.The mean and variance of the distribution of latent variables output by the EUN 220 may be used as a measure of environmental uncertainty.

단계 S130에서는, 음성 불확실성 정보 및 환경 불확실성 정보를 이용해 음성 타깃을 추정하는 PN (Prediction Network)(230)을 포함하는 UAT 모델(200)을 구성할 수 있다. 보다 구체적으로, 단계 S130에서는, CUN(210), EUN(220) 및 PN(230)을 연결 (concatenation)한 통합 모델을, 입력 음성으로부터 음성 타깃을 추정하도록 학습하여 UAT 모델(200)을 구성할 수 있다. 또한, 단계 S130에서는, PN(230)의 손실함수를 이용해 통합 모델을 튜닝하여 성능이 향상된 UAT 모델(200)을 구성할 수 있다.In step S130 , the UAT model 200 including a prediction network (PN) 230 for estimating a voice target using the voice uncertainty information and the environmental uncertainty information may be configured. More specifically, in step S130, the UAT model 200 is constructed by learning an integrated model in which the CUN 210, the EUN 220 and the PN 230 are concatenated to estimate a voice target from the input voice. can In addition, in step S130 , the UAT model 200 with improved performance may be configured by tuning the integrated model using the loss function of the PN 230 .

즉, PN(230)은 CUN(210)과 EUN(220)이 출력하는 확률 분포의 파라미터들을 이용하여 음성 타깃을 추정하는 모델이다. PN(230)은 데이터의 불확실성을 고려하여 전체 네트워크의 최종적인 출력을 생성할 수 있다. PN(230)의 입력 벡터는 깨끗한 음성의 평균과 분산(CUN(210)에서 출력된 음성 불확실성 정보), 잠재변수 파라미터들이며(EUN(220)에서 출력된 환경 불확실성 정보), 이를 통해 UAT 모델(200)이 깨끗한 음성과 환경 정보의 불확실성을 동시에 반영하도록 할 수 있다.That is, the PN 230 is a model for estimating a voice target using parameters of a probability distribution output from the CUN 210 and the EUN 220 . The PN 230 may generate the final output of the entire network in consideration of data uncertainty. The input vector of the PN 230 is the mean and variance of the clean voice (voice uncertainty information output from the CUN 210), and latent variable parameters (environment uncertainty information output from the EUN 220), through which the UAT model 200 ) can simultaneously reflect the uncertainty of clean voice and environmental information.

UAT 모델(200)은, 위의 세 개의 네트워크 CUN(210), EUN(220), PN(230)을 순서대로 결합하고, HMM state에 대한 cross entropy criterion으로 fine-tuning하여 학습을 마무리함으로 구성될 수 있다.The UAT model 200 is configured by combining the above three networks CUN 210 , EUN 220 , and PN 230 in order, and finishing learning by fine-tuning as a cross entropy criterion for the HMM state. can

도 7은 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 장치(100)의 구성을 도시한 도면이다. 도 7에 도시된 바와 같이, 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 장치(100)는, 노이즈에 의해 변형된 입력 음성의 불확실성을 고려해, 입력 음성으로부터 음성 타깃 (phonetic target)을 추정하는 심층 신경망 (Deep neural network, DNN) 기반의 UAT (Uncertainty-aware training) 모델을 구성하는 학습부(110) 및 학습부(110)에서 구성된 UAT 모델(200)을 이용해 음성 인식을 처리하는 음성 인식부(120)를 포함하여 구성될 수 있다. 학습부(110)는 단계 S100, 음성 인식부(120)는 단계 S200을 각각 수행할 수 있다.7 is a diagram illustrating a configuration of a speech recognition apparatus 100 using statistical uncertainty modeling according to an embodiment of the present invention. As shown in FIG. 7 , the speech recognition apparatus 100 using statistical uncertainty modeling according to an embodiment of the present invention considers the uncertainty of the input speech deformed by noise, ), a deep neural network (DNN)-based UAT (Uncertainty-aware training) model for estimating speech recognition is processed using the learning unit 110 and the UAT model 200 configured in the learning unit 110 . It may be configured to include a voice recognition unit 120. The learning unit 110 may perform step S100 and the voice recognition unit 120 may perform step S200, respectively.

학습부(110)는 UAT 모델(200)을 구성할 수 있는데, UAT 모델(200)은, 도 2에 도시된 바와 같이, 입력 음성의 특징 (distorted feature, y_t)을 입력받아 깨끗한 음성 특징 (clean feature, x_t)의 분포를 추정하며, 음성 불확실성 정보를 출력하는 CUN (Clean Uncertainty Network)(210), 입력 음성의 특징 및 CUN(210) 출력된 음성 불확실성 정보를 이용해 음성 타깃을 추정하되, 변분 추론 (variational inference, VI) 방식으로, 추정 과정에서 잠재변수의 확률 분포를 모델링하고, 환경 불확실성 정보를 출력하는 EUN (Environment Uncertainty Network)(220), 및 음성 불확실성 정보 및 환경 불확실성 정보를 이용해 음성 타깃을 추정하는 PN (Prediction Network)(230)을 포함하여 구성될 수 있다. UAT 모델(200)은, CUN(210), EUN(220) 및 PN(230)을 연결 (concatenation)한 통합 모델로서, 입력 음성으로부터 음성 타깃 (phonetic target)을 추정하도록 학습하여 구성될 수 있다.May be configured to learn 110 UAT model 200, UAT model 200, as shown in Fig. 2, receives the input speech feature (distorted feature, y _t), a clean speech feature ( Estimate the distribution of clean features, x _t ), and estimate the voice target using the CUN (Clean Uncertainty Network) 210, which outputs voice uncertainty information, the characteristics of the input voice, and the voice uncertainty information output from the CUN 210, EUN (Environment Uncertainty Network) 220, which models the probability distribution of latent variables in the estimation process and outputs environmental uncertainty information, using the variational inference (VI) method, and negative uncertainty information and environmental uncertainty information It may be configured to include a prediction network (PN) 230 for estimating a target. The UAT model 200 is an integrated model in which the CUN 210 , the EUN 220 , and the PN 230 are concatenated, and may be configured by learning to estimate a phonetic target from an input voice.

도 8은 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 장치(100)에서, 학습부(110)의 세부적인 구성을 도시한 도면이다. 도 8에 도시된 바와 같이, 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 장치(100)의 학습부(110)는, 단계 S110을 수행하여 CUN(210)을 학습하는 CUN 학습 모듈(111), 단계 S120을 수행하여 EUN(220)을 학습하는 EUN 학습 모듈(112), 단계 S130을 수행하여 UAT 모델(200)을 구성하는 UAT 모델 구성 모듈(113)을 포함하여 구성될 수 있다.8 is a diagram illustrating a detailed configuration of the learning unit 110 in the speech recognition apparatus 100 using statistical uncertainty modeling according to an embodiment of the present invention. As shown in FIG. 8 , the learning unit 110 of the speech recognition apparatus 100 using statistical uncertainty modeling according to an embodiment of the present invention performs CUN learning to learn the CUN 210 by performing step S110. Module 111, an EUN learning module 112 for learning the EUN 220 by performing step S120, and a UAT model configuration module 113 for configuring the UAT model 200 by performing step S130. have.

실험Experiment

본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법 및 장치의 성능 검증을 위하여 Aurora-4와 CHiME-4 데이터셋을 이용해 실험을 수행하였다.In order to verify the performance of the speech recognition method and apparatus using statistical uncertainty modeling according to an embodiment of the present invention, an experiment was performed using Aurora-4 and CHiME-4 datasets.

두 데이터셋은 강인한 음성인식기를 실험할 때 사용되는 데이터셋으로 Aurora-4 데이터셋은 다양한 SNR과 잡음환경에서 녹음된 데이터셋이고 CHiME-4 데이터셋은 실제 환경에서 녹음된 데이터와 시뮬레이션을 통해 잔향과 잡음을 더한 데이터셋이다. 실험 환경으로는 Kaldi toolkit의 특징 추출 (feature extraction)과 GMM-HMM training, alignment, ASR decoding 모듈을 사용하였다.The two datasets are the datasets used to test the robust speech recognizer. The Aurora-4 dataset is a dataset recorded in various SNR and noise environments, and the CHiME-4 dataset is a dataset recorded in the real environment and reverberated through simulation. It is a dataset with noise added. As the experimental environment, feature extraction of Kaldi toolkit and GMM-HMM training, alignment, and ASR decoding modules were used.

실험의 baseline으로는 기존의 음성인식기 5종류의 모델을 사용하였고 이 중 중점적으로 대조할 3개의 모델 구성은 아래와 같다.As the baseline of the experiment, five types of existing speech recognizer models were used, and the composition of three models to be mainly contrasted is as follows.

우선, DNN-Conventional 모델은 불확실성에 대해 모델링하지 않고, noisy data를 clean data로 회귀 (regression)하는 네트워크와 회귀된 clean data를 이용해 phonetic target을 추정하는 네트워크로 구성되어 있다.First, the DNN-Conventional model consists of a network that regresses noisy data to clean data without modeling uncertainty, and a network that estimates phonetic targets using the regressed clean data.

DNN-ID 모델의 경우 noisy data를 clean data로 모델링 할 때 가우시안 분포의 파라미터를 출력하고 예측 네트워크는 CUN(210)의 출력인 평균, 분산 정보를 이용해 음성 타깃을 추정한다. 이 경우 clean feature에 대한 불확실성은 어느 정도 반영되지만 본 발명에서 제안하는 UAT 모델(200)과 달리 불확실한 특징을 이용해 예측할 때 발생하는 불확실성에 대한 정보는 반영되지 않는다.In the case of the DNN-ID model, when modeling noisy data as clean data, the Gaussian distribution parameters are output, and the prediction network estimates the voice target using the mean and variance information output from the CUN (210). In this case, the uncertainty about the clean feature is reflected to some extent, but unlike the UAT model 200 proposed in the present invention, information about the uncertainty that occurs when predicting using the uncertain feature is not reflected.

VAE-Conventional 모델의 경우 CUN(210)을 VAE 구조를 이용해 모델링하고, PN(230)은 CUN(210)의 잠재변수의 분포를 이용해 음성 타깃을 추정한다. 이 모델 역시 DNN-ID모델과 마찬가지로 noisy feature의 불확실성만 반영하는 모델이다.In the case of the VAE-Conventional model, the CUN 210 is modeled using the VAE structure, and the PN 230 estimates a negative target using the distribution of the latent variables of the CUN 210 . Like the DNN-ID model, this model also reflects only the uncertainty of noisy features.

실험 결과는 각 모델에 대한 word error rate(WER)로 비교하였다.The experimental results were compared with the word error rate (WER) for each model.

도 9는 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법 및 장치와 다른 음성인식기를 CHiME-4 테스트셋을 이용해 비교 실험한 결과를 나타낸 도면이고, 도 10은 본 발명의 일실시예에 따른 통계적 불확실성 모델링을 활용한 음성 인식 방법 및 장치와 다른 음성인식기를 AURORA-4 테스트셋을 이용해 비교 실험한 결과를 나타낸 도면이다. 도 9 및 도 10의 결과는 모두 WERs(%)로 나타내었다. 도 9에서 데이터셋 A, B, C, D는 통상적으로 사용하는 실험 환경을 나타내는 것으로서, A는 Clean, matched-channel condition, test condition 1, B는 Noisy, matched-channel condition, test conditions 2-7, C는 Clean, mismatched-channel condition, test condition 8, D는 Noisy, mismatched-channel condition, test conditions 9-14를 각각 나타낸다.9 is a view showing the results of a comparison experiment using a CHiME-4 test set for a voice recognition method and apparatus using statistical uncertainty modeling according to an embodiment of the present invention and another voice recognizer, and FIG. 10 is an embodiment of the present invention. It is a diagram showing the results of comparative experiments using the AURORA-4 test set for a speech recognition method and apparatus using statistical uncertainty modeling according to an embodiment and another speech recognizer. The results of FIGS. 9 and 10 were all expressed as WERs (%). In FIG. 9, datasets A, B, C, and D represent commonly used experimental environments, where A is Clean, matched-channel condition, test condition 1, B is Noisy, matched-channel condition, test conditions 2-7 , C represents Clean, mismatched-channel condition, test condition 8, D represents Noisy, mismatched-channel condition, and test conditions 9-14, respectively.

도 9 및 도 10을 살펴보면, 여러 환경의 데이터셋에서 불확실성을 모델링하지 않은 DNN-Baseline 및 DNN-Conventional 모델에 비해 깨끗한 음성 특징 (clean feature)에 대한 불확실성을 고려한 DNN-ID와 VAE-Conventional 모델의 성능이 전체적으로 우수한 것을 확인할 수 있다. 또한 나머지 모델에 비해 본 발명에서 제안하는 UAT 모델(200)이 다양한 잡음 환경에서 더욱 우수한 성능을 보이는 것을 확인할 수 있다.9 and 10, the DNN-ID and VAE-Conventional models considering the uncertainty of clean features compared to the DNN-Baseline and DNN-Conventional models that do not model the uncertainty in datasets of various environments. It can be seen that the overall performance is excellent. In addition, it can be seen that the UAT model 200 proposed by the present invention shows better performance in various noise environments compared to the rest of the models.

전술한 바와 같이, 본 발명에서 제안하고 있는 통계적 불확실성 모델링을 활용한 음성 인식 방법 및 장치에 따르면, 노이즈에 의해 변형된 입력 음성의 불확실성을 고려하되, 깨끗한 음성 특징의 분포를 나타내는 음성 불확실성 정보와 변분 추론(Variational Inference, VI) 기반으로 잠재변수의 확률 분포를 나타내는 환경 불확실성 정보를 이용해 입력 음성이 가지는 불확실성을 직접적으로 측정하고, 이를 통해 변형된 입력 음성에 대한 불확실성을 효과적으로 반영하도록 학습된 모델을 구성함으로써, 학습 데이터에 없는 잡음이 나타나더라도 우수한 음성 인식 성능을 보일 수 있다.As described above, according to the speech recognition method and apparatus using statistical uncertainty modeling proposed in the present invention, while considering the uncertainty of the input speech deformed by noise, speech uncertainty information and variance representing the distribution of clean speech features Based on Variational Inference (VI), the uncertainty of the input voice is directly measured using the environmental uncertainty information representing the probability distribution of the latent variable, and a trained model is constructed to effectively reflect the uncertainty about the transformed input voice. By doing so, even if noise that is not present in the training data appears, excellent speech recognition performance can be exhibited.

본 발명은 잡음이나 여러 환경적인 요인으로 인해 불확실성이 존재하는 데이터를 처리해야할 때 불확실성에 대한 정도를 측정하고, 이를 활용하여 여러 환경적인 요인으로 인해 발생하는 성능저하를 완화하는 기술이다. 따라서 본 발명의 UAT 모델(200)은 DNN-HMM기반의 음성인식 분야에서 음성 데이터에 대한 불확실성을 모델링 하였지만, 최근 인공지능 분야와 관련하여 영상, 텍스트 등의 음성이 아닌 어떠한 데이터 처리에도 본 알고리즘을 적용할 수 있을 것으로 예상된다.The present invention is a technique for measuring the degree of uncertainty when processing data in which uncertainty exists due to noise or various environmental factors, and using this to mitigate performance degradation caused by various environmental factors. Therefore, the UAT model 200 of the present invention models the uncertainty about voice data in the field of voice recognition based on DNN-HMM, but in recent years, in relation to the field of artificial intelligence, this algorithm is applied to any data processing other than voice such as video and text. expected to be applicable.

한편, 본 발명은 다양한 통신 단말기로 구현되는 동작을 수행하기 위한 프로그램 명령을 포함하는 컴퓨터에서 판독 가능한 매체를 포함할 수 있다. 예를 들어, 컴퓨터에서 판독 가능한 매체는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD_ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다.Meanwhile, the present invention may include a computer-readable medium including program instructions for performing operations implemented in various communication terminals. For example, the computer-readable medium includes magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD_ROM and DVD, and floppy disks. It may include magneto-optical media and hardware devices specially configured to store and execute program instructions such as ROM, RAM, flash memory, and the like.

이와 같은 컴퓨터에서 판독 가능한 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 이때, 컴퓨터에서 판독 가능한 매체에 기록되는 프로그램 명령은 본 발명을 구현하기 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 예를 들어, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함할 수 있다.Such a computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. In this case, the program instructions recorded in the computer-readable medium may be specially designed and configured to implement the present invention, or may be known and used by those skilled in the art of computer software. For example, it may include not only machine language code such as generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

이상 설명한 본 발명은 본 발명이 속한 기술분야에서 통상의 지식을 가진 자에 의하여 다양한 변형이나 응용이 가능하며, 본 발명에 따른 기술적 사상의 범위는 아래의 특허청구범위에 의하여 정해져야 할 것이다.Various modifications and applications of the present invention described above are possible by those skilled in the art to which the present invention pertains, and the scope of the technical idea according to the present invention should be defined by the following claims.

100: 음성 인식 장치
110: 학습부
111: CUN 학습 모듈
112: EUN 학습 모듈
113: UAT 모델 구성 모듈
120: 음성 인식부
200: UAT 모델
210: CUN
220: EUN
230: PN
S100: 노이즈에 의해 변형된 입력 음성의 불확실성을 고려해, 입력 음성으로부터 음성 타깃을 추정하는 심층 신경망 기반의 UAT 모델을 구성하는 단계
S110: CUN (Clean Uncertainty Network)을 학습하여 음성 불확실성 정보를 출력하는 단계
S120: EUN (Environment Uncertainty Network)을 학습하여 환경 불확실성 정보를 출력하는 단계
S130: 음성 불확실성 정보 및 환경 불확실성 정보를 이용해 음성 타깃을 추정하는 PN (Prediction Network)를 포함하는 UAT 모델을 구성하는 단계
S200: 구성된 UAT 모델을 이용해 음성 인식을 처리하는 단계100: speech recognition device
110: study department
111: CUN Learning Module
112: EUN Learning Module
113: UAT model configuration module
120: voice recognition unit
200: UAT model
210: CUN
220: EUN
230: PN
S100: Constructing a deep neural network-based UAT model for estimating a speech target from the input speech in consideration of the uncertainty of the input speech deformed by noise
S110: Step of learning CUN (Clean Uncertainty Network) and outputting voice uncertainty information
S120: A step of learning EUN (Environment Uncertainty Network) and outputting environmental uncertainty information
S130: Constructing a UAT model including a PN (Prediction Network) for estimating a voice target using voice uncertainty information and environmental uncertainty information
S200: processing speech recognition using the configured UAT model

Claims

A speech recognition method comprising:
(1) Considering the uncertainty of the input voice deformed by noise, a deep neural network (DNN)-based UAT (Uncertainty-Aware Training) model for estimating a phonetic target from the input voice is constructed. step; and
(2) processing speech recognition using the UAT model 200 configured in step (1),
The step (1) is,
(1-1) By learning a Clean Uncertainty Network (CUN) 210 that receives the distorted feature (y _t ) of the input voice _{and estimates the distribution of a clean feature (x t ), the voice uncertainty} outputting information;
(1-2) Estimating a speech target using the characteristics of the input speech and the speech uncertainty information output in step (1-1), using a Variational Inference (VI) method, Learning the EUN (Environment Uncertainty Network) 220 for modeling the probability distribution, outputting environmental uncertainty information;
(1-3) Statistical uncertainty, characterized in that it comprises the step of constructing a UAT model 200 including a PN (Prediction Network) 230 for estimating a voice target using the voice uncertainty information and the environmental uncertainty information Speech recognition method using modeling.

The method of claim 1, wherein in step (1-1),
A speech recognition method using statistical uncertainty modeling, characterized in that the mean and log variance of the distribution of the clean speech feature are output as the speech uncertainty information.

According to claim 1, wherein the EUN (220) of step (1-2),
A speech recognition method using statistical uncertainty modeling, characterized in that the VAE (Variational Autoencoder) is transformed to model the probability distribution of the latent variable output from the encoder.

The method of claim 1, wherein in step (1-2),
A speech recognition method using statistical uncertainty modeling, characterized in that the average and variance of the distribution of the latent variable are output as the environmental uncertainty information.

The method of claim 1, wherein in step (1-3),
The UAT model 200 is configured by learning an integrated model in which the CUN 210, EUN 220 and PN 230 are concatenated to estimate a voice target from the input voice, Speech recognition method using statistical uncertainty modeling.

The method of claim 5, wherein in step (1-3),
A speech recognition method using statistical uncertainty modeling, characterized in that the UAT model (200) with improved performance is configured by tuning the integrated model using the loss function of the PN (230).

A voice recognition device 100 comprising:
In consideration of the uncertainty of the input voice deformed by noise, a learning unit ( 110); and
A voice recognition unit 120 for processing voice recognition using the UAT model 200 configured in the learning unit 110,
The UAT model 200,
a Clean Uncertainty Network (CUN) 210 for receiving the distorted feature (y _t ) of the input voice _{, estimating the distribution of a clean feature (x t ), and outputting voice uncertainty information;}
A voice target is estimated using the characteristics of the input voice and the voice uncertainty information output from the CUN 210, and the probability distribution of the latent variable is modeled in the estimation process by a variational inference (VI) method, and the environment EUN (Environment Uncertainty Network) 220 for outputting uncertainty information; and
A speech recognition apparatus 100 using statistical uncertainty modeling, characterized in that it comprises a PN (Prediction Network) 230 for estimating a speech target using the speech uncertainty information and the environmental uncertainty information.

The method of claim 7, wherein the CUN (210),
The speech recognition apparatus 100 using statistical uncertainty modeling, characterized in that the mean and log variance of the distribution of the clean speech feature are output as the speech uncertainty information.

The method of claim 7, wherein the EUN (220),
Speech recognition apparatus 100 using statistical uncertainty modeling, characterized in that the VAE (variational autoencoder) is transformed to model the probability distribution of the latent variable output from the encoder.

The method of claim 7, wherein the EUN (220),
A speech recognition apparatus 100 using statistical uncertainty modeling, characterized in that the average and variance of the distribution of the latent variable are output as the environmental uncertainty information.

According to claim 7, The UAT model 200,
As an integrated model in which the CUN 210, EUN 220 and PN 230 are concatenated, it is characterized in that it is configured by learning to estimate a phonetic target from the input voice, Statistical uncertainty modeling A voice recognition device 100 using

The method of claim 11, wherein the UAT model 200,
The speech recognition apparatus 100 using statistical uncertainty modeling, characterized in that the performance is improved by tuning the integrated model using the loss function of the PN (230).