KR20100056859A

KR20100056859A - Voice recognition apparatus and method

Info

Publication number: KR20100056859A
Application number: KR1020080115852A
Authority: KR
Inventors: 김홍국; 윤재삼; 박지훈
Original assignee: 광주과학기술원
Priority date: 2008-11-20
Filing date: 2008-11-20
Publication date: 2010-05-28
Also published as: KR101610708B1

Abstract

PURPOSE: A voice recognizing device and a method thereof are provided to compensate a sound model for remaining noise in a mask-based multichannel sound source separating method for voice recognition, thereby improving voice recognition performance. CONSTITUTION: A voice mask predicting unit(20) predicts a voice mask from a multichannel voice signal. A noise component remover(30) removes a noise component using the predicted voice mask. A voice combiner(40) combines voices using the multichannel voice signal from which the noise component is removed. A sound model generator(200) predicts an SNR(Signal to Noise Ratio) and a noise model using the voice mask. The sound model generator generates a sound model adapted to noise.

Description

Voice recognition apparatus and method

본 발명은 잡음 환경에서 동작하는 음성 인식 장치 및 방법에 관한 것으로, 더욱 상세히 설명하면, 다채널 음원 분리 기법에 의해 처리된 후 남겨진 잡음 성분을 제거하는 음성 인식 장치 및 방법에 관한 것이다. The present invention relates to a speech recognition apparatus and method for operating in a noisy environment, and more particularly, to a speech recognition apparatus and method for removing noise components left after being processed by a multi-channel sound source separation technique.

음성은 인간의 가장 기본적이고 자연스러운 의사소통 수단이다. 인간의 청각 시스템은 다양한 잡음 환경에 원하는 소리만 선택하여 인지할 수 있다. 이를 위해, 인간의 청각 시스템은 양 청신경을 통해 들어오는 신호들의 시간차이(inter-aural time difference; ITD)와 크기 차이(inter-aural level difference; ILD)를 이용하여 원하는 소리가 발생하는 음원의 방향을 찾은 뒤, 원하는 소리를 다른 음원들에서 발생하는 소리들로부터 분리한다. Voice is the most basic and natural means of communication. Human hearing systems can select and perceive only the sounds you want in a variety of noisy environments. To this end, the human auditory system uses the inter-aural time difference (ITD) and the inter-aural level difference (ILD) of signals coming through both auditory nerves to determine the direction of the sound source that produces the desired sound. After finding, the desired sound is separated from the sounds from other sources.

이와 유사하게, 마스크 기반의 다채널 음원 분리 기법인 CASA(computational auditory scene analysis)에서도 두개의 마이크로폰으로 입력되는 신호들 간의 ITD와 ILD를 이용한다. ITD와 ILD로부터 시간-주파수 마스크를 계산한 뒤, 잡음을 갖는 음성 신호에 마스크를 적용함으로써 원하는 음성 신호를 분리할 수 있다.Similarly, the CASA (computational auditory scene analysis), a mask-based multichannel sound source separation technique, uses ITD and ILD between signals input to two microphones. After calculating the time-frequency mask from the ITD and ILD, the desired speech signal can be separated by applying the mask to the noisy speech signal.

도 1은 가우시안커널-기반 마스크를 이용하여 잡음이 제거된 음성의 합성 신 호 예를 보여준다. 가우시안커널-기반의 마스크(Gaussian kernel-based mask) 추정 방법에 대해서는 D. L. Wang and G. J. Brown, Computational Auditory Scene Analysis: Principle, Algorithms and Applications, IEEEPress, Wiley-Interscience, 2006 및 N.Roman, D.L.Wang, and G.J.Brown, "Speech segregation based on sound localization," J. Acoust. Soc. Amer., vol. 114, no. 4, pp. 2236-2252, July 2003에 상세히 설명되어 있다.FIG. 1 shows an example of a synthesized signal of noise canceled speech using a Gaussian kernel-based mask. For a method of Gaussian kernel-based mask estimation, see DL Wang and GJ Brown, Computational Auditory Scene Analysis: Principle, Algorithms and Applications , IEEEPress, Wiley-Interscience, 2006 and N.Roman, DLWang, and GJBrown. , "Speech segregation based on sound localization," J. Acoust. Soc. Amer ., Vol. 114, no. 4, pp. 2236-2252, July 2003.

도 1의 (a)는 잡음이 없는 음성신호을 나타내며, 도 1의 (b)는 잡음이 더해진 음성신호(0 dB SNR)를 나타내고, 도 1의 (c) 가우시안커널-기반 마스크를 통해 잡음이 제거된 음성신호를 나타낸다. (A) of FIG. 1 shows a speech signal without noise, and (b) of FIG. 1 shows a speech signal with noise (0 dB SNR), and noise is removed through the Gaussian kernel-based mask of FIG. Voice signal.

도 1에 도시된 바와 같이, 잡음이 없는 음성신호 (a)와 비교하여 잡음이 제거된 신호 (c)를 보면, 여전히 잡음신호가 남아 있음을 알 수 있다. 이렇게 잡음이 남은 신호를 이용할 경우, 음성인식 성능의 저하를 야기할 수 있기 때문에, 잔여 잡음을 보상할 필요가 있다. As shown in FIG. 1, when the signal (c) from which the noise is removed is compared with the noise-free speech signal (a), it can be seen that the noise signal remains. When using the signal with the residual noise, it is necessary to compensate for the residual noise since it may cause a deterioration of speech recognition performance.

이와 같이 종래 방식에서는 실제 환경에서 음성마스크를 추정에 있어 이상적인 음성마스크를 얻을 수 없기 때문에 잔여 잡음이 발생하게 되어 음성인식 성능에 제약을 가져오는 원인이 된다.As described above, since the ideal voice mask cannot be obtained in estimating the voice mask in a real environment, residual noise is generated, which causes a limitation in speech recognition performance.

즉, 다채널 음원분리 기술을 통해 잡음을 제거한 음성신호를 보면 잡음이 완전히 제거되지 않아 음성인식 성능향상에 제약을 가져오는 단점이 있다. 이는 특히, 낮은 신호-대-잡음비 환경에서 제약이 심하다.In other words, when the voice signal is removed through the multi-channel sound source separation technology, the noise is not completely removed, which leads to a limitation in improving the speech recognition performance. This is particularly limited in low signal-to-noise ratio environments.

따라서, 종래 방식에서 얻어진 음성마스크를 효율적으로 이용하여 잡음 특성 을 얻어 잡음에 적응한 음향모델을 추정하는 과정을 통해 음성인식 성능을 더욱 향상시킬 수 있는 기술이 절실히 요구되고 있다.Therefore, there is an urgent need for a technique that can further improve speech recognition performance through a process of estimating an acoustic model adapted to noise by using noise masks obtained by conventional methods efficiently.

본 발명은 상기와 같은 문제점을 해결하고 상기와 같은 요구에 부응하기 위하여 제안된 것으로, 잡음환경에서 음성 인식 성능을 향상시키는 음성 인식 장치 및 방법을 제공하는데 그 목적이 있다.The present invention has been proposed to solve the above problems and to meet the above requirements, and an object thereof is to provide a speech recognition apparatus and method for improving speech recognition performance in a noisy environment.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 음성 인식 장치는 다채널 음성 신호로부터 음성마스크를 추정하는 음성마스크 추정부; 상기 음성마스크 추정부에서 추정된 음성마스크를 이용하여 잡음성분을 제거하는 잡음성분 제거부; 상기 잡음성분 제거부로부터 잡음성분이 제거된 다채널 음성 신호를 이용하여 음성을 합성하는 음성합성부; 상기 음성마스크를 이용하여 잡음모델 및 신호-대-잡음비를 추정하고, 잡음 모델 및 신호-대-잡음비를 이용하여 잡음에 적응된 음향모델을 생성하는 음향 모델 생성부; 상기 음성합성부로부터 출력된 음성 신호로부터 음성특징을 추출하는 특징 추출부; 및 상기 특징 추출부에서 얻어진 음성특징과 잡음에 적응된 음향모델을 이용하여 음성인식 결과를 구하는 디코딩부를 포함한다.According to an aspect of the present invention, there is provided a speech recognition apparatus, including: a speech mask estimator configured to estimate a speech mask from a multichannel speech signal; A noise component removing unit for removing noise components using the voice mask estimated by the voice mask estimating unit; A speech synthesizer for synthesizing speech using the multi-channel speech signal from which the noise component is removed from the noise component remover; An acoustic model generator for estimating a noise model and a signal-to-noise ratio using the voice mask and generating an acoustic model adapted to noise using the noise model and the signal-to-noise ratio; A feature extractor which extracts a voice feature from the voice signal output from the voice synthesizer; And a decoding unit for obtaining a speech recognition result using an acoustic model adapted to the speech feature and noise obtained by the feature extractor.

여기에서, 상기 음성마스크 추정부는 외부로부터 입력받은 음성신호를 여러 주파수대역으로 분리하는 감마-톤 필터링부; 상기 감마-톤 필터링부를 통해 분리된 신호로부터 마이크로폰 채널간 시간 차이를 추정하는 채널간 시간차이 추정부; 상 기 감마-톤 필터링부를 통해 분리된 신호로부터 마이크로폰 채널간 레벨 차이를 추정하는 채널간 레벨차이 추정부; 및 상기 마이크로폰 채널간 시간 차이와 마이크로폰 채널간 레벨 차이를 이용하여 음성마스크를 구하는 음성마스크 산출부를 포함한다. Here, the voice mask estimator comprises: a gamma-tone filter for separating the voice signal received from the outside into various frequency bands; An inter-channel time difference estimator estimating a time difference between microphone channels from the signal separated by the gamma-tone filtering unit; An inter-channel level difference estimator estimating a level difference between microphone channels from the signal separated by the gamma-tone filtering unit; And a voice mask calculator for obtaining a voice mask using the time difference between the microphone channels and the level difference between the microphone channels.

여기에서, 상기 음향 모델 생성부는 상기 음성마스크 추정부에서 추정된 음성마스크로부터 잡음마스크를 구하는 잡음마스크 연산부; 상기 잡음마스크와 감마-톤 필터링된 신호를 이용하여 잡음모델을 구하는 마스크-기반 잡음모델 추정부; 상기 음성마스크와 잡음마스크를 이용하여 신호-대-잡음비를 구하는 마스크-기반 신호-대-잡음비 추정부; 및 상기 추정된 잡음모델과 상기 추정된 신호-대-잡음비를 이용하여 잡음에 적응된 음향모델을 구하는 음향모델 적응부를 포함한다. The acoustic model generator may include a noise mask calculator configured to obtain a noise mask from the voice mask estimated by the voice mask estimator; A mask-based noise model estimator for obtaining a noise model using the noise mask and a gamma-tone filtered signal; A mask-based signal-to-noise ratio estimator for obtaining a signal-to-noise ratio using the speech mask and the noise mask; And an acoustic model adaptor for obtaining an acoustic model adapted to noise using the estimated noise model and the estimated signal-to-noise ratio.

여기에서, 상기 마스크-기반 잡음모델 추정부는 상기 음성마스크를 이용하여 잡음마스크를 구하는 잡음 마스크 연산부; 상기 잡음 마스크를 상기 감마-톤 필터링된 신호와 곱함으로써 음성성분을 제거하는 음성 제거부; 상기 음성 성분이 제거된 신호로부터 잡음 신호를 합성하는 잡음 합성부; 상기 잡음 신호의 특징을 추출하는 특징 추출부; 및 상기 잡음 신호의 특징을 이용하여 잡음 모델을 구하는 잡음 모델 연산부를 포함한다. The mask-based noise model estimator may include a noise mask calculator configured to obtain a noise mask using the voice mask; A speech removing unit for removing a speech component by multiplying the noise mask with the gamma-tone filtered signal; A noise synthesizer configured to synthesize a noise signal from the signal from which the speech component is removed; A feature extractor which extracts a feature of the noise signal; And a noise model calculator for obtaining a noise model using the characteristics of the noise signal.

여기에서, 상기 마스크-기반 신호-대-잡음비 추정부는 상기 음성마스크의 주파수 채널상 평균값을 구하는 음성 마스크 평균 연산부; 상기 잡음마스크의 주파수 채널상 평균값을 구하는 잡음마스크 평균 연산부; 상기 음성마스크 평균값을 이용하여 잡음 프레임을 검출하는 잡음 프레임 검출부; 및 상기 잡음 프레임에 대한 주 파수상의 음성마스크 평균값과 주파수상의 잡음 마스크 평균값을 구하고, 이들 평균값 간의 비율로 신호-대 잡음비를 구하는 신호-대-잡음비 연산부를 포함한다.Here, the mask-based signal-to-noise ratio estimator includes: a voice mask averaging calculator for obtaining an average value on a frequency channel of the voice mask; A noise mask averaging unit for obtaining an average value on the frequency channel of the noise mask; A noise frame detector for detecting a noise frame using the average value of the voice mask; And a signal-to-noise ratio calculation unit for obtaining a speech mask mean value on a frequency and a noise mask mean value on a frequency for the noise frame, and obtaining a signal-to-noise ratio as a ratio between these average values.

또한, 본 발명의 일 실시예에 따른 음성 인식 방법은 다채널 음성 신호로부터 음성마스크를 추정하는 단계; 상기 추정된 음성마스크를 이용하여 잡음성분을 제거하는 단계; 상기 잡음성분이 제거된 다채널 음성 신호를 이용하여 음성을 합성하는 단계; 상기 음성마스크를 이용하여 잡음모델 및 신호-대-잡음비를 추정하고, 상기 잡음 모델 및 신호-대-잡음비를 이용하여 잡음에 적응된 음향모델을 생성하는 단계; 상기 합성된 음성 신호로부터 음성특징을 추출하는 단계; 및 상기 음성특징과 잡음에 적응된 음향모델을 이용하여 음성인식 결과를 구하는 단계를 포함한다. In addition, the speech recognition method according to an embodiment of the present invention comprises the steps of estimating the speech mask from the multi-channel speech signal; Removing noise components using the estimated voice mask; Synthesizing speech using the multi-channel speech signal from which the noise component is removed; Estimating a noise model and a signal-to-noise ratio using the voice mask, and generating an acoustic model adapted to noise using the noise model and the signal-to-noise ratio; Extracting a speech feature from the synthesized speech signal; And obtaining a speech recognition result using the acoustic model adapted to the speech feature and noise.

여기에서, 상기 음성마스크 추정 단계는, 외부로부터 입력받은 음성신호를 감마-톤 필터링을 이용하여 여러 주파수대역으로 분리하는 단계; 상기 분리된 신호로부터 마이크로폰 채널간 시간 차이를 추정하는 단계; 상기 분리된 신호로부터 마이크로폰 채널간 레벨 차이를 추정하는 단계; 및 상기 마이크로폰 채널간 시간 차이와 마이크로폰 채널간 레벨 차이를 이용하여 음성마스크를 구하는 단계를 포함한다. Here, the voice mask estimating step may include: separating a voice signal received from the outside into several frequency bands using gamma-tone filtering; Estimating time difference between microphone channels from the separated signal; Estimating a level difference between microphone channels from the separated signal; And obtaining a voice mask using the time difference between the microphone channels and the level difference between the microphone channels.

여기에서, 상기 음향 모델 생성 단계는 상기 추정된 음성마스크로부터 잡음마스크를 구하는 단계; 상기 잡음마스크와 감마-톤 필터링된 신호를 이용하여 잡음모델을 추정하는 단계; 상기 음성마스크와 잡음마스크를 이용하여 신호-대-잡음비를 추정하는 단계; 및 상기 추정된 잡음모델과 상기 추정된 신호-대-잡음비를 이용하여 잡음에 적응된 음향모델을 구하는 단계를 포함한다. Here, the generating of the acoustic model may include obtaining a noise mask from the estimated speech mask; Estimating a noise model using the noise mask and a gamma-tone filtered signal; Estimating a signal-to-noise ratio using the speech mask and the noise mask; And obtaining an acoustic model adapted to noise by using the estimated noise model and the estimated signal-to-noise ratio.

여기에서, 상기 잡음모델을 추정하는 단계는 상기 음성마스크를 이용하여 잡음마스크를 구하는 단계; 상기 잡음 마스크를 상기 감마-톤 필터링된 신호와 곱함으로써 음성성분을 제거하는 단계; 상기 음성 성분이 제거된 신호로부터 잡음 신호를 합성하는 단계; 상기 잡음 신호의 특징을 추출하는 단계; 및 상기 잡음 신호의 특징을 이용하여 잡음 모델을 구하는 단계를 포함한다. The estimating of the noise model may include obtaining a noise mask using the voice mask; Removing speech components by multiplying the noise mask with the gamma-tone filtered signal; Synthesizing a noise signal from the signal from which the speech component has been removed; Extracting features of the noise signal; And obtaining a noise model using the characteristics of the noise signal.

여기에서, 상기 신호-대-잡음비를 추정하는 단계는 상기 음성마스크의 주파수 채널상 평균값을 구하는 단계; 상기 잡음마스크의 주파수 채널상 평균값을 구하는 단계; 상기 음성마스크 평균값을 이용하여 잡음 프레임을 검출하는 단계; 및 상기 잡음 프레임에 대한 주파수상의 음성마스크 평균값과 주파수상의 잡음 마스크 평균값을 구하고, 이들 평균값 간의 비율로 신호-대 잡음비를 구하는 단계를 포함한다. The estimating of the signal-to-noise ratio may include obtaining an average value on a frequency channel of the voice mask; Obtaining an average value on a frequency channel of the noise mask; Detecting a noise frame using the average value of the voice mask; And obtaining a mean value of speech mask on frequency and a mean value of noise mask on frequency for the noise frame, and obtaining a signal-to-noise ratio as a ratio between these mean values.

본 발명에 따르면 음성인식을 위한 마스크-기반 다채널 음원분리 기법에서 남겨진 잡음에 대한 음향모델 보상하여 음성인식 성능을 향상시키는 효과가 있다. According to the present invention, there is an effect of improving the speech recognition performance by compensating the acoustic model for the noise left in the mask-based multi-channel sound source separation technique for speech recognition.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소 에 대해 사용하였다. As the invention allows for various changes and numerous embodiments, particular embodiments will be illustrated in the drawings and described in detail in the written description. However, this is not intended to limit the present invention to specific embodiments, it should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. In describing the drawings, similar reference numerals are used for similar elements.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다. Terms such as first, second, A, and B may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. When a component is referred to as being "connected" or "connected" to another component, it may be directly connected to or connected to that other component, but it may be understood that other components may be present in between. Should be. On the other hand, when a component is said to be "directly connected" or "directly connected" to another component, it should be understood that there is no other component in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "comprise" or "have" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in the commonly used dictionaries should be construed as having meanings consistent with the meanings in the context of the related art and shall not be construed in ideal or excessively formal meanings unless expressly defined in this application. Do not.

이하, 본 발명에 따른 바람직한 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 2는 본 발명에 따른 음성 인식 장치에 대한 일실시예의 구성도를 나타낸다.2 is a block diagram of an embodiment of a speech recognition apparatus according to the present invention.

본 발명에 따른 음성 인식 장치는 A/D 컨버터(10), 음성마스크 추정부(20), 잡음성분 제거부(30), 음성합성부(40), 음향 모델 생성부(200), 특징 추출부(310) 및 비터비 디코딩부(320)를 포함한다. 도면부호 100에 의해 지시된 부분은 다채널 음원 분리 모듈이며, 도면부호 300에 의해 지시된 부분은 음성 디코딩부이다.The speech recognition apparatus according to the present invention includes an A / D converter 10, a voice mask estimator 20, a noise component remover 30, a speech synthesizer 40, an acoustic model generator 200, and a feature extractor. 310 and the Viterbi decoding unit 320. The portion indicated by 100 is a multi-channel sound source separation module, and the portion indicated by 300 is a speech decoding unit.

A/D 컨버터(10)는 외부의 다채널 마이크로폰들(마이크로폰1 및 마이크로폰2)로부터 각각 입력되는 다채널 음성 신호, 예컨대, 우채널 음성 신호 및 좌채널 음성 신호를 각각 디지털 신호로 변환한다. The A / D converter 10 converts a multichannel audio signal, for example, a right channel audio signal and a left channel audio signal, respectively, input from external multichannel microphones (microphone 1 and microphone 2) into digital signals, respectively.

음성마스크 추정부(20)는 A/D 컨버터(10)의 출력단에 연결되어 A/D 컨버터(10)로부터 디지털 형태의 다채널 음성 신호를 수신하고 다채널 음성 신호로부터 음성 마스크를 추정한다. 도 3은 이러한 음성마스크 추정부(20)의 상세 구성도를 나타낸다.The voice mask estimator 20 is connected to an output terminal of the A / D converter 10 to receive a digital multi-channel voice signal from the A / D converter 10 and estimate a voice mask from the multi-channel voice signal. 3 shows a detailed configuration diagram of the voice mask estimator 20.

도 3을 참조하면, 음성마스크 추정부(20)는 감마-톤 필터링부 (21), 채널간 신호 차이 추정부(22), 레벨 연산부(23), 채널간 레벨 차이 추정부(24), 음성마스크 산출부(25)를 포함한다.Referring to FIG. 3, the voice mask estimator 20 includes a gamma-tone filter 21, a signal difference estimator 22 between channels, a level calculator 23, a level difference estimator 24 between channels, and a voice. And a mask calculator 25.

감마-톤 필터링부(21)는 외부로부터 입력받은 음성신호를 여러 주파수 대역으로 분리한다. 예컨대, 본 실시예에서는 감마-톤 필터링부(21)는 2개의 마이크로폰으로부터 입력된 신호들을 주파수 대역별로 분리한다. 이 감마-톤 필터링부(21)는 필터뱅크의 형태로 구현될 수 있다. The gamma-tone filter 21 separates the voice signal received from the outside into several frequency bands. For example, in the present embodiment, the gamma-tone filter 21 separates signals input from two microphones for each frequency band. The gamma-tone filter 21 may be implemented in the form of a filter bank.

채널간 시간 차이 추정부(22)는 상기 감마-톤 필터링부(21)를 통해 분리된 신호로부터 마이크로폰 채널간 시간 차이를 추정한다. 레벨 연산부(23)는 상기 감마-톤 필터링부(21)에서 분리된 신호 샘플값에 절대값을 취한뒤 합산하여 신호 레벨값을 구한다. The time difference estimator 22 estimates the time difference between microphone channels from the signal separated by the gamma-tone filter 21. The level calculator 23 takes an absolute value of the signal sample value separated by the gamma-tone filter 21 and adds the absolute value to obtain a signal level value.

채널간 레벨 차이 추정부(24)는 상기 레벨 연산부(23)로부터 출력된 신호 레벨값들 간의 차이를 구함으로써 마이크로폰 채널간 레벨 차이를 추정한다. The level difference estimator 24 estimates the level difference between microphone channels by obtaining a difference between signal level values output from the level calculator 23.

그리고, 음성마스크 산출부(25)는 상기 마이크로폰 채널간 시간 차이와 마이크로폰 채널간 레벨 차이를 이용하여 음성마스크를 산출한다. 예컨대, 음성마스크 산출부(25)는 가우시간 커널을 기반으로 하여 음성마스크를 산출한다. 이 때 음성 마스크는 0과 1 사이의 값을 갖는다. 이후 음성 마스크는

로 표현된다. 여 기서

는 주파수 채널 인덱스이고,

는 프레임 인덱스이다. The voice mask calculator 25 calculates a voice mask using the time difference between the microphone channels and the level difference between the microphone channels. For example, the voice mask calculator 25 calculates a voice mask based on the Gaussian kernel. In this case, the voice mask has a value between 0 and 1. Since the voice mask

It is expressed as here

Is the frequency channel index,

Is the frame index.

다시 도 2를 참조하면, 잡음 성분 제거부(30)는 상기 음성마스크 추정부(20)에서 추정된 음성마스크를 이용하여 잡음성분을 갖는 음성 신호로부터 잡음성분을 제거한다. 음성 합성부(40)는 상기 잡음성분 제거부(30)로부터 출력된 잡음성분이 제거된 신호를 이용하여 음성을 합성한다. 이때, 음성 합성부(40)는 바람직하게 논문 M. Weintraub, A Theory and Computational Model of Monaural Auditory Sound Separation, Ph.D.Thesis, Stanford University,1985에 개시된 방법을 이용한다.Referring back to FIG. 2, the noise component remover 30 removes a noise component from a voice signal having a noise component by using the voice mask estimated by the voice mask estimator 20. The speech synthesizer 40 synthesizes speech using the signal from which the noise component output from the noise component remover 30 is removed. In this case, the speech synthesis unit 40 preferably uses the method disclosed in the paper M. Weintraub, A Theory and Computational Model of Monaural Auditory Sound Separation , Ph.D.Thesis, Stanford University, 1985.

이와 같이, 본 발명에 따른 음성 인식 장치는 음성 신호로부터 다채널 음원 분리 기술을 이용하여 잡음성분을 일차적으로 제거한다. 그러나, 음성 합성부(40)로부터 출력되는 음성 신호는 도 1의 (c) 에 도시된 바와 같이, 잔여 잡음 성분을 가지고 있다. As described above, the speech recognition apparatus according to the present invention primarily removes noise components by using a multi-channel sound source separation technique. However, the speech signal output from the speech synthesis section 40 has a residual noise component, as shown in Fig. 1C.

잔여 잡음 성분을 제거하기 위해 본 발명은, 다채널 음원 분리시 이용된 음성마스크를 이용하여 잡음모델 및 신호-대-잡음비를 효율적으로 추정하고, 잡음 모델 및 신호-대-잡음비를 이용하여 잡음에 적응된 음향모델을 추정함으로써 음성 인식성능을 향상시킨다. 본 발명은 상기 잡음에 적응된 음향모델을 추정하기 위해 음향 모델 생성부(200)를 포함한다. In order to remove the residual noise component, the present invention efficiently estimates the noise model and the signal-to-noise ratio using the speech mask used in the multi-channel sound source separation, and uses the noise model and the signal-to-noise ratio to reduce the noise. The speech recognition performance is improved by estimating the adaptive acoustic model. The present invention includes an acoustic model generator 200 to estimate the acoustic model adapted to the noise.

음향 모델 생성부(200)는 음성 신호의 잔여 잡음 성분을 보상하기 위한 음향 모델을 생성한다. 구체적으로, 음향 모델 생성부(200)는 잡음 없는(clean) 환경에서 학습된 음향모델(clean-trained model)과 잡음 모델(noise model)에 SNR 가중치를 곱하여 잡음 환경에 적응된 음향 모델(noise-corrupted model)을 추정한다. The acoustic model generator 200 generates an acoustic model for compensating for the residual noise component of the speech signal. In detail, the acoustic model generator 200 multiplies the SNR weights of a clean-trained model and a noise model learned in a clean environment with a noise model adapted to the noise environment. estimate the corrupted model.

이를 위해, 음향 모델 생성부(200)는 잡음 모델을 추정(mask-based noise model estimation; MBNME)하는 컴포넌트와 SNR을 추정(mask-based SNR estimation; MBSE)하는 컴포넌트를 포함한다. 이러한 음향 모델 생성부(200)의 상세 구성에 대해서는 이하 설명된다.To this end, the acoustic model generator 200 includes a component for mask-based noise model estimation (MBNME) and a component for mask-based SNR estimation (MBSE). The detailed configuration of such an acoustic model generator 200 will be described below.

도 4는 도 2의 음향 모델 생성부(200)의 구성을 나타낸다.4 illustrates a configuration of the acoustic model generator 200 of FIG. 2.

도 4를 참조하면 음향 모델 생성부(200)는 마스크-기반 잡음 모델 추정부(210), 마스크-기반 SNR(신호-대-잡음비) 추정부(220), 음향 모델 적응부(230), 무잡음 음향 모델(240) 및 잡음에 적응된 음향 모델(250)을 포함한다. Referring to FIG. 4, the acoustic model generator 200 may include a mask-based noise model estimator 210, a mask-based SNR (signal-to-noise ratio) estimator 220, an acoustic model adaptor 230, and no sound. Noise acoustic model 240 and acoustic model 250 adapted to noise.

마스크-기반 잡음모델 추정부(210)는 음성마스크를 이용하여 잡음마스크를 연산하고, 상기 잡음마스크와 감마-톤 필터링된 신호를 이용하여 잡음모델을 구한다. 구체적으로 마스크-기반 잡음모델 추정부(210)는 음성 마스크 추정부(20)에서 구한 음성마스크로부터 얻은 잡음 마스크 정보를 이용하여 잡음 신호를 합성하고, 그 잡음 신호로부터 잡음 모델을 추정한다. The mask-based noise model estimator 210 calculates a noise mask using a voice mask and obtains a noise model using the noise mask and a gamma-tone filtered signal. In more detail, the mask-based noise model estimator 210 synthesizes a noise signal using noise mask information obtained from the voice mask obtained by the voice mask estimator 20, and estimates a noise model from the noise signal.

도 5는 도 4의 마스크-기반 잡음모델 추정부(210)에 대한 일실시예 구성도이다. 마스크-기반 잡음모델 추정부(210)는 음성마스크 추정부(20)로부터 제공된 음성마스크 정보와 감마톤-필터링부(21)에서 필터링된 임의의 마이크로폰 신호 를 이용하여 잡음모델을 추정한다. FIG. 5 is a diagram illustrating an embodiment of the mask-based noise model estimator 210 of FIG. 4. The mask-based noise model estimator 210 estimates a noise model using voice mask information provided from the voice mask estimator 20 and an arbitrary microphone signal filtered by the gamma tone-filter 21.

도 5를 참조하면, 마스크-기반 잡음모델 추정부(210)는 잡음마스크 연산부(211), 음성제거부(212), 잡음합성부(213), 특징 추출부(214), 잡음모델 연산부(215)로 이루워진다. 잡음모델을 추정하기 위해, 잡음 마스크 연산부(211)는 먼 저 음성마스크를 이용하여 잡음마스크를 구한다. 구체적으로, 잡음 마스크 연산부(211)는 잡음마스크,

,를 값 1로부터 도 3에서 구한 음성마스크를 차감하여 [수학식 1]과 같이 구한다. Referring to FIG. 5, the mask-based noise model estimator 210 includes a noise mask calculator 211, a speech remover 212, a noise synthesizer 213, a feature extractor 214, and a noise model calculator 215. ) In order to estimate the noise model, the noise mask calculator 211 first obtains a noise mask using a voice mask. Specifically, the noise mask calculator 211 includes a noise mask,

, Is obtained as shown in Equation 1 by subtracting the voice mask obtained in FIG.

다음으로는 음성 제거부(212)는 상기 잡음 마스크

를 감마-톤 필터링된 마이크로폰1의 신호에 곱함으로써 음성성분을 제거한다. 이 경우, 다채널 신호들 중 어떠한 채널 신호라도 사용될 수 있다. 즉, 본 실시예에서는 마이크로폰1의 신호가 사용되었지만, 마이크로폰2의 신호가 사용될 수도 있으며, 다른 다채널 신호도 사용될 수 있다. Next, the voice remover 212 is the noise mask.

Multiply the signal of the gamma-tone filtered microphone 1 to remove the speech component. In this case, any of the multichannel signals may be used. That is, although the signal of microphone 1 is used in this embodiment, the signal of microphone 2 may be used, and other multichannel signals may also be used.

그리고, 잡음 합성부(213)는 음성 성분이 제거된 신호를 이용하여 잡음 신호를 합성하고 상기 잡음 신호를 특징 추출부(214)로 출력한다. 특징 추출부(214)는 잡음 신호로부터 MFCC(Mel-Frequency Cepstral Coefficient),

,k=1,…, K,를 추출한다. 이 때, MFCC(Mel-Frequency Cepstral Coefficient)의 차수, K, 는 시스템 및 응용예에 따라 다르게 결정될 수 있다. 잡음모델 연산부(215)는 [수학식 2]와 같이 추출된 MFCC로부터 평균값

과 분산값

을 구하여 잡음모델로서 사용한다.The noise synthesizer 213 synthesizes a noise signal using the signal from which the speech component is removed and outputs the noise signal to the feature extractor 214. The feature extractor 214 may use a MFCC (Mel-Frequency Cepstral Coefficient),

, k = 1,… Extract K, At this time, the order, K, of the Mel-Frequency Cepstral Coefficient (MFCC) may be determined differently according to a system and an application example. The noise model calculator 215 may calculate an average value from the MFCC extracted as shown in [Equation 2].

And variance

We obtain and use as noise model.

여기서

은 전체 프레임 수를 나타낸다.here

Represents the total number of frames.

다시 도 4를 참조하면, 마스크-기반 SNR 추정부(220)는 상기 음성마스크와 잡음마스크를 이용하여 신호-대-잡음비를 구한다. 마스크-기반 SNR 추정부(220)는 음성 마스크 추정부(20)에서 구한 음성마스크와 마스크-기반 잡음모델 추정부(210)로부터 얻은 잡음마스크간의 평균값의 비율을 통해 SNR을 추정한다.Referring back to FIG. 4, the mask-based SNR estimator 220 obtains a signal-to-noise ratio using the voice mask and the noise mask. The mask-based SNR estimator 220 estimates the SNR based on a ratio of the average value between the speech mask obtained by the speech mask estimator 20 and the noise mask obtained from the mask-based noise model estimator 210.

도 6은 도 4의 마스크-기반 SNR 추정부(220)에 대한 일실시예 구성도이다. 즉, 도 2의 음성마스크 추정부(20)에서 계산된 음성마스크와 도 5의 마스크-기반 잡음모델 추정부(220)에서 계산된 잡음마스크를 이용하여 신호-대-잡음비를 추정한다. FIG. 6 is a diagram illustrating an embodiment of the mask-based SNR estimator 220 of FIG. 4. That is, the signal-to-noise ratio is estimated using the speech mask calculated by the speech mask estimator 20 of FIG. 2 and the noise mask calculated by the mask-based noise model estimator 220 of FIG. 5.

도 6을 참조하면, 마스크-기반 SNR추정부(220)는 음성마스크 평균 연산부(221), 잡음마스크 평균 연산부(222), 잡음프레임 검출부(223), 신호-대-잡음비 연산부(224)로 이루워진다. Referring to FIG. 6, the mask-based SNR estimator 220 includes a voice mask average calculator 221, a noise mask average calculator 222, a noise frame detector 223, and a signal-to-noise ratio calculator 224. It gets warm.

음성마스크 평균 연산부(221)에서는 도 2의 음성마스크 추정부(20)에서 추정된 음성마스크,

,로부터 주파수 채널상에서 음성 마스크의 평균값,

,을 [수학식 3]을 이용하여 구한다. In the voice mask averaging unit 221, the voice mask estimated by the voice mask estimator 20 of FIG. 2,

The average value of the voice mask on the frequency channel from

And are calculated using Equation 3.

여기서,

는 감마-톤 필터링부의 채널수이다. 여기에서 채널수는 32가 될 수 있으며, 그 적용예 또는 구현에 따라 달라질 수 있다. 이와 유사하게, 잡음마스크 평균 연산부(221)에서는 도 4의 마스크-기반 잡음모델 추정부(210)에서 계산된 잡음마스크,

,로부터 주파수 채널상에서 잡음 마스크의 평균값,

,을 [수학식 4]를 이용하여 구한다.here,

Is the number of channels in the gamma-tone filtering unit. Here, the number of channels may be 32, and may vary depending on the application or implementation thereof. Similarly, the noise mask averaging unit 221 calculates the noise mask calculated by the mask-based noise model estimator 210 of FIG.

The average value of the noise mask on the frequency channel from

And are obtained using Equation 4.

잡음 프레임 검출부(223)는 정확한 신호-대-잡음비를 구하기 위하여 음성마스크 평균 연산부(221)에서 구한 음성마스크 평균값,

,을 이용하여 잡음 프레임을 검출한다. 이를 위해, 잡음 프레임 검출부(223)는 [수학식 5]와 같이 가장 먼저 초기 20프레임으로부터 음성 마스크의 주파수-시간상 평균값,

, 과 분산값,

,을 구한다.The noise frame detector 223 may calculate a voice mask average value obtained by the voice mask averaging unit 221 to obtain an accurate signal-to-noise ratio.

Detect noise frame using,. To this end, the noise frame detection unit 223 is the frequency-time average value of the speech mask from the first 20 frames, as shown in Equation 5,

, And variance,

Find,

여기서

은 상기 초기 프레임들의 개수이며, 본 실시예에서 20이다. 이러 한 프레임의 개수는 그 적용예 또는 구현에 따라 달라질 수 있으며, 본 발명에 이에 한정되지 않는다. 그 다음으로 잡음 프레임 검출부(223)는 초기 20프레임의 약 90% (즉, 18프레임 정도)가 잡음 프레임으로 선택되는 상수값,

를 설정하여 문턱값,

,을 [수학식 6]과 같이 구한다.here

Is the number of the initial frames, 20 in this embodiment. The number of such frames may vary depending on the application or implementation thereof, and is not limited thereto. Next, the noise frame detection unit 223 has a constant value in which about 90% of the initial 20 frames (that is, about 18 frames) is selected as the noise frame,

To set the threshold,

Find, as shown in [Equation 6].

다음으로, 잡음 프레임 검출부(223)는 이렇게 구한 문턱값을 넘지 못하는 음성마스크 주파수-시간 평균값을 가지는 프레임을 [수학식 7]을 통해 집합,

,을 구한다.Next, the noise frame detection unit 223 aggregates a frame having a voice mask frequency-time average value that does not exceed the obtained threshold value through Equation 7,

Find,

그리고, 신호-대-잡음 연산부(224)는 음성마스크 주파수 평균값과 잡음마스크 주파수 평균값으로부터 다시 잡음 프레임 집합상에서의 음성마스크 주파수 평균값 g_S 및 잡음마스크 주파수 평균값 g_N을 각각 [수학식 8]로부터 구한다.Then, the signal-to-noise calculation unit 224 obtains the voice mask frequency average value g _S and the noise mask frequency average value g _N on the noise frame set from Equation 8, respectively, from the voice mask frequency average value and the noise mask frequency average value. .

여기서

는 잡음 프레임 집합,

,에 속하는 프레임 개수이다. 최종적으 로 신호-대-잡음비 즉 SNR, g,는 [수학식 9]와 같이 구한다.here

Is a set of noise frames,

Number of frames belonging to,. Finally, the signal-to-noise ratio, ie, SNR, g, is obtained as shown in [Equation 9].

다시 도 4를 참조하면, 음향모델 적응부(230)는 상기 마스크-기반 잡음모델 추정부(210)에서 추정된 잡음모델과 마스크-기반 SNR 추정부(220)에서 추정된 신호-대-잡음비를 이용하여 잡음에 적응된 음향모델(250)을 구한다. 다시 말해, 음향모델 적응부(230)는 마스크-기반 잡음모델 추정부(210)에서 추정된 잡음모델(

,

)과 마스크-기반 신호-대-잡음비 추정부(220)에서 추정된 신호-대-잡음비(g)를 이용하여 미리 잡음이 없는 음성들로 학습해놓은 무잡음 음향모델(

,

)(240)를 잡음에 적응시킨 새로운 모델(

,

)(250)을 [수학식 10]과 같이 추정한다.Referring back to FIG. 4, the acoustic model adaptor 230 may extract the noise model estimated by the mask-based noise model estimator 210 and the signal-to-noise ratio estimated by the mask-based SNR estimator 220. The acoustic model 250 adapted to the noise is obtained. In other words, the acoustic model adaptation unit 230 estimates a noise model estimated by the mask-based noise model estimating unit 210.

,

) And a noise-free acoustic model trained on speechless noises using the signal-to-noise ratio (g) estimated by the mask-based signal-to-noise ratio estimator 220 (

,

) Is a new model that adapts 240 to noise.

,

) 250 is estimated as shown in [Equation 10].

이는 논문 M. Gales and S. J. Young, "Robust continuous speech recognition using parallel model combination," IEEE Trans. Speech and Audio Proc., vol. 4, no. 5, pp. 352-359, Sept. 1996을 참조한다.This is described in the papers M. Gales and S. J. Young, "Robust continuous speech recognition using parallel model combination," IEEE Trans. Speech and Audio Proc., Vol. 4, no. 5, pp. 352-359, Sept. See 1996.

다시 도 2를 참조하면, 음성 디코딩부(300)는 음성 합성부(40)로부터 출력된 음성 신호로부터 음성 특징, 예컨대, MFCC(Mel Frequency Cepstral Coefficient)를 추출하고, 음향 모델 생성부(200)로부터 출력된 음향 모델 및 추출된 음성 특징을 이용하여 비터비 디코딩을 수행하여 음성 인식 결과를 출력한다.Referring back to FIG. 2, the voice decoding unit 300 extracts a voice feature, for example, Mel Frequency Cepstral Coefficient (MFCC) from the voice signal output from the voice synthesizer 40, and then from the acoustic model generator 200. Viterbi decoding is performed using the output acoustic model and the extracted speech feature to output a speech recognition result.

구체적으로, 음성 디코딩부(300)은 특징 추출부(310) 및 비터비 디코딩부(320)를 포함한다. 특징 추출부(310)에서는 음성합성부(40)에서 일차적으로 잡음이 제거된 음성신호로부터 음성 특징, 예컨대MFCC를 추출한다. 비터비 디코딩부(320)는 음향모델 적응부(230)에서 구한 잡음에 적응된 음향모델과 특징 추출부(310)에서 추출된 MFCC와 패턴 매칭을 통하여 가장 높은 확률값을 갖는 단어 또는 문장열의 음성인식 결과를 얻는다. 비터비 디코딩은 음성인식 결과를 얻기 위한 일반적인 방법으로, 음향모델들과 MFCC의 패턴매칭을 수행하여 가장 큰 확률을 가지는 음향모델을 선택하여, 그 음향모델에 해당하는 음소, 단어 또는 문장열 등을 인식 결과로 얻는 과정을 일컫는다.In detail, the voice decoding unit 300 includes a feature extractor 310 and a Viterbi decoder 320. The feature extractor 310 extracts a speech feature, for example, MFCC, from the speech signal from which the noise is first removed from the speech synthesizer 40. The Viterbi decoding unit 320 recognizes a word or sentence having the highest probability value through pattern matching with the acoustic model adapted to the noise obtained by the acoustic model adaptor 230 and the MFCC extracted by the feature extractor 310. Get the result. Viterbi decoding is a general method for obtaining speech recognition results. The pattern matching of acoustic models and MFCC is performed to select the acoustic model having the greatest probability, and the phoneme, word or sentence string corresponding to the acoustic model is selected. Refers to the process that results from recognition.

도 7은 본 발명에 따라 음성 인식 장치에서 음성 인식 성능을 향상시키는 방법의 흐름도를 나타낸다.7 is a flowchart of a method for improving speech recognition performance in a speech recognition apparatus according to the present invention.

도 7을 참조하면, 음성 인식 장치는 단계 410에서 다채널 음성 신호가 입력되었는 지를 판단한다. 구체적으로 음성 인식 장치는 복수개의 마이크로폰으로부터 각각 음성 신호가 입력되는 지를 판단한다. 음성 인식 장치는 다채널 음성 신호가 입력되면 단계 420에서 다채널 음성 신호로부터 음성마스크를 추정한다. Referring to FIG. 7, the speech recognition apparatus determines whether a multi-channel speech signal is input in step 410. In detail, the speech recognition apparatus determines whether a speech signal is input from each of the plurality of microphones. When the multi-channel voice signal is input, the voice recognition apparatus estimates a voice mask from the multi-channel voice signal in step 420.

구체적으로, 단계 410에서 음성 인식 장치는 외부로부터 입력받은 음성신호에 대해 감마-톤 필터링을 수행하여 음성 신호를 여러 주파수 대역으로 분리한다. 음성 인식 장치는 분리된 신호들로부터 마이크로폰 채널간 시간 차이와 마이크로폰 채널간 레벨 차이를 추정하고, 마이크로폰 채널간 시간 차이와 마이크로폰 채널간 레벨 차이를 이용하여 음성 마스크를 추정한다. In detail, in operation 410, the apparatus for recognizing speech divides the speech signal into several frequency bands by performing gamma-tone filtering on the speech signal received from the outside. The speech recognition apparatus estimates a time difference between microphone channels and a level difference between microphone channels from the separated signals, and estimates a voice mask using the time difference between microphone channels and a level difference between microphone channels.

이어서, 음성 인식 장치는 음성 마스크를 추정한 후 단계 430에서 음성마스크를 이용하여 다채널 음성 신호로부터 잡음 성분을 제거하고 또한 잡음 성분이 제거된 타채널 음성 신호를 이용하여 음성을 합성한다. 음성을 합성한 후, 음성 인식 장치는 단계 440에서 합성된 음성으로부터 음성 특징, 예컨대, MFCC(Mel-Frequency Cepstral Coefficient)를 추출한다. Subsequently, the speech recognition apparatus estimates the speech mask and then, in step 430, removes noise components from the multichannel speech signal using the speech mask and synthesizes speech using another channel speech signal from which the noise components are removed. After synthesizing the speech, the speech recognition apparatus extracts a speech feature, for example, a mel-frequency cepstral coefficient (MFCC), from the speech synthesized in step 440.

그런 다음, 음성 인식 장치는 단계 450에서 음성마스크를 이용하여 잡음 모델 및 신호-대-잡음비를 추정하고, 잡음에 적응된 음향 모델을 생성한다.The speech recognition apparatus then estimates the noise model and signal-to-noise ratio using the speech mask in step 450 and generates an acoustic model adapted to noise.

상기 잡음 모델을 추정하기 위해, 음성 인식 장치는 음성마스크를 이용하여 잡음마스크를 구하고, 잡음마스크에 감마-톤 필터링된 신호를 곱함으로써 음성 성분을 제거한다. 음성 인식 장치는 음성 성분이 제거된 신호로부터 잡음 신호를 합성하고 합성된 잡음 신호로부터 특징, 즉 MFCC를 추출한다. 그리고, 음성 인식 장치는 이 특징으로부터 잡음 모델을 추정한다. To estimate the noise model, the speech recognition apparatus obtains a noise mask using a speech mask and removes speech components by multiplying the noise mask by a gamma-tone filtered signal. The speech recognition apparatus synthesizes a noise signal from a signal from which speech components are removed and extracts a feature, ie, MFCC, from the synthesized noise signal. The speech recognition apparatus then estimates a noise model from this feature.

또한, 상기 신호-대 잡음비를 추정하기 위해, 음성 인식 장치는 음성 인식 장치는 음성마스크와 잡음마스크간의 평균값의 비율을 통해 SNR을 추정할 수 있다. 이를 위해 음성 인식 장치는 음성마스크의 주파수 채널상 평균값을 구하고, 잡음마스크의 주파수 채널상 평균값을 구한다. 그리고 음성 인식 장치는 음성마스크 평균값을 이용하여 잡음 프레임을 검출한다. 즉, 음성 인식 장치는 음성 마스크의 주파수-시간상 평균값에 기반하여 잡음 프레임이 되는 문턱값을 계산하고, 상기 문턱값을 넘지 못하는 음성마스크 주파수-시간 평균값을 가지는 프레임들을 잡음 프레임 으로 결정한다. 음성 인식 장치는 이 잡음 프레임들에 대하여 음성마스크 평균값과 잡음 마스크 평균값을 구한 다음 이들 평균값 간의 비율로 신호-대 잡음비를 구한다. In addition, in order to estimate the signal-to-noise ratio, the speech recognition apparatus may estimate the SNR through the ratio of the average value between the speech mask and the noise mask. To this end, the speech recognition apparatus obtains an average value on the frequency channel of the voice mask and obtains an average value on the frequency channel of the noise mask. The speech recognition apparatus detects a noise frame by using a speech mask mean value. That is, the speech recognition apparatus calculates a threshold value that becomes a noise frame based on the frequency-time average value of the voice mask, and determines frames having a voice mask frequency-time average value that does not exceed the threshold value as the noise frame. The speech recognition apparatus obtains a speech mask average value and a noise mask average value for these noise frames, and then obtains a signal-to-noise ratio as a ratio between these average values.

이와 같이, 음성 인식 장치는 잡음 모델 및 신호-대-잡음비를 추정한 후, 추정된 잡음모델과 신호-대-잡음비를 이용하고 또한 미리 잡음이 없는 음성들로 학습해놓은 무잡음 음향모델을 이용하여 잡음에 적응된 음향 모델을 추정한다.As such, the speech recognition apparatus estimates the noise model and the signal-to-noise ratio, and then uses the estimated noise model and the signal-to-noise ratio, and uses a noisy acoustic model that has previously been learned with no-noise speech. Estimate acoustic model adapted to noise.

단계 470에서 음성 인식 장치는 잡음에 적응된 음향 모델 및 음성 특징을 이용하여 비터비 디코딩을 수행함으로써 음성 인식 결과를 얻을 수 있다. In operation 470, the speech recognition apparatus may obtain a speech recognition result by performing Viterbi decoding using an acoustic model and a speech feature adapted to noise.

도 8은 여성 낭독음을 잡음으로 하고, 단어 음성 DB (S. Kim, S. Oh, H.-Y. Jung, H.-B. Jeong, and J.-S. Kim, "Common speech database collection," 한국음향학회 학술대회 논문집, 제 21권, 제 1호, pp. 21-24, 2002년 7월)를 이용하여 인식한 본 발명의 인식성능을 단어오인식률(%)로 보여준다. 음성인식 시스템의 음향모델 학습에는 18,240개의 단어음성을, 인식테스트에는 570개의 단어음성을 사용하였으며, 음성신호는 0°에, 잡음신호는 10°, 20°, 40°에 위치하도록 머리 전달함수를 적용하였다. 음성인식 시스템은 트라이폰(triphone) 단위의 hidden Markov model을 기반으로 하며, 각 트라이폰은 3개의 상태(state)를 갖는 left-to-right 모델로 표현되었다. 각각의 상태는 4개의 Gaussian mixture를 가지며, 사용된 어휘수는 2,250 단어이다. 도 5에서 비교된 성능은 잡음처리를 하지 않은 MFCC를 이용한 baseline 성능 (Baseline), 마스크 기반의 다채널 음원분리 기술의 성능 (MMSS, Mask-based multi-channel source separation), 본 발명의 성능 (MMSS+AMC(Acoustic model combination))이다. 마스크 기반의 다채널 음원분리 기술의 성능을 보면, 높은 신호-대-잡음비에서는 단어오인식률이 매우 낮은 것에 비하여, 낮은 신호-대-잡음비에서는 단어오인식률이 상대적으로 높다. 이는 마스크 기반의 다채널 음원분리 처리후 남은 잔여잡음에 의한 영향으로 판단되며, 이를 보상하는 본 발명의 성능을 보면 낮은 신호-대-잡음비에서도 단어오인식률이 크게 낮아지는 것을 볼 수 있었다. 결과적으로, 본 발명으로 마스크 기반의 다채널 음원 분리 처리 기술에 비하여 단어오인식률이 상대적으로 52.14%만큼 줄일 수 있었다.FIG. 8 shows the female reading sound as noise, and the word speech DB (S. Kim, S. Oh, H.-Y. Jung, H.-B. Jeong, and J.-S. Kim, "Common speech database collection , "Proceedings of the Acoustical Society of Korea, Vol. 21, No. 1, pp. 21-24, July 2002), shows the recognition performance of the present invention in terms of word misrecognition (%). 18,240 words were used for the acoustic model training and 570 words were used for the recognition test. The head transfer function was placed at 0 ° and noise at 10 °, 20 ° and 40 °. Applied. The speech recognition system is based on a triphone hidden Markov model, and each triphone is represented by a left-to-right model with three states. Each state has four Gaussian mixtures, and the number of words used is 2,250 words. The performance compared in FIG. 5 is based on baseline performance using MFCC without noise processing (Baseline), performance of mask-based multi-channel sound source separation technology (MMSS), performance of the present invention (MMSS) Acoustic model combination (AMC). The performance of mask-based multi-channel sound separation technology shows that word recognition is relatively high at low signal-to-noise ratios, compared to very low word-to-noise ratios at high signal-to-noise ratios. This is determined by the influence of the residual noise remaining after the mask-based multi-channel sound source separation process, and the performance of the present invention compensating for this shows that the word misrecognition rate is significantly lowered even at a low signal-to-noise ratio. As a result, the word misrecognition rate was reduced by 52.14% compared to the mask-based multi-channel sound separation processing technology.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다. It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the present invention as defined by the following claims It can be understood that

도 1은 가우시안커널-기반 마스크를 이용하여 잡음이 제거된 음성의 합성 신호 예를 보여준 도면이다.FIG. 1 is a diagram illustrating an example of a synthesized signal of speech with noise removed using a Gaussian kernel-based mask.

도 2는 본 발명에 따른 음성 인식 장치에 대한 일실시예의 구성도를 나타낸 도면이다.2 is a diagram illustrating a configuration of an embodiment of a speech recognition apparatus according to the present invention.

도 3은 도 2의 음성마스크 추정부의 일실시예에 대한 블록 구성도를 나타낸다FIG. 3 is a block diagram illustrating an embodiment of a voice mask estimator of FIG. 2.

도 4는 도 2의 음향 모델 생성부의 일실시예에 대한 블록 구성도를 나타낸다.4 is a block diagram illustrating an embodiment of an acoustic model generator of FIG. 2.

도 5는 도 4의 마스크-기반 잡음모델 추정부의 일실시예에 대한 블록 구성도를 나타낸다.5 is a block diagram illustrating an embodiment of the mask-based noise model estimator of FIG. 4.

도 6은 도 4의 마스크-기반 SNR 추정부의 일실시예에 대한 블록 구성도를 나타낸다.FIG. 6 illustrates a block diagram of an embodiment of the mask-based SNR estimator of FIG. 4.

도 7은 본 발명의 일실시예에 따른 음성 인식 방법의 흐름도를 나타낸다.7 is a flowchart illustrating a speech recognition method according to an embodiment of the present invention.

도 8은 여성 낭독음을 잡음으로 하고, 단어 음성 DB를 이용하여 인식한 본 발명의 인식성능을 단어오인식률(%)로 보여준 도면이다.FIG. 8 is a diagram illustrating the recognition performance of the present invention as a word false recognition rate (%), which is performed by using a female voice as a noise and recognized using a word speech DB.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for main parts of the drawings>

10; A/D 컨버터10; A / D converter

20; 음성마스크 추정부20; Voice Mask Estimator

30; 잡음 성분 제거부30; Noise Component Rejection Unit

40; 음성 합성부40; Speech synthesis

200; 음향 모델 생성부200; Acoustic model generator

310; 특징 추출부310; Feature Extraction Unit

320; 비터비 디코딩부320; Viterbi decoding unit

Claims

In the speech recognition device,

A voice mask estimator for estimating a voice mask from a multi-channel voice signal;

A noise component removing unit for removing noise components using the voice mask estimated by the voice mask estimating unit;

A speech synthesizer for synthesizing speech using the multi-channel speech signal from which the noise component is removed from the noise component remover;

An acoustic model generator for estimating a noise model and a signal-to-noise ratio using the voice mask and generating an acoustic model adapted to noise using the noise model and the signal-to-noise ratio;

A feature extractor which extracts a voice feature from the voice signal output from the voice synthesizer; And

And a decoding unit for obtaining a speech recognition result using an acoustic model adapted to the speech feature and noise obtained by the feature extractor.

The voice mask estimator of claim 1, wherein the voice mask estimator

A gamma-tone filter which separates an audio signal input from an external device into various frequency bands;

An inter-channel time difference estimator estimating a time difference between microphone channels from the signal separated by the gamma-tone filtering unit;

An inter-channel level difference estimator estimating a level difference between microphone channels from the signal separated by the gamma-tone filtering unit; And

And a voice mask calculator configured to obtain a voice mask by using the time difference between the microphone channels and the level difference between the microphone channels.

The method of claim 2, wherein the acoustic model generating unit

A noise mask calculator for obtaining a noise mask from the voice mask estimated by the voice mask estimator;

A mask-based noise model estimator for obtaining a noise model using the noise mask and a gamma-tone filtered signal;

A mask-based signal-to-noise ratio estimator for obtaining a signal-to-noise ratio using the speech mask and the noise mask; And

And an acoustic model adaptor for obtaining an acoustic model adapted to noise using the estimated noise model and the estimated signal-to-noise ratio.

The method of claim 3, wherein the mask-based noise model estimator

A noise mask calculator which obtains a noise mask using the voice mask;

A speech removing unit for removing a speech component by multiplying the noise mask with the gamma-tone filtered signal;

A noise synthesizer configured to synthesize a noise signal from the signal from which the speech component is removed;

A feature extractor which extracts a feature of the noise signal; And

And a noise model calculator for obtaining a noise model using the features of the noise signal.

5. The apparatus of claim 4, wherein the mask-based signal-to-noise ratio estimator

A voice mask averaging unit for obtaining an average value on a frequency channel of the voice mask;

A noise mask averaging unit for obtaining an average value on the frequency channel of the noise mask;

A noise frame detector for detecting a noise frame using the average value of the voice mask; And

And a signal-to-noise ratio calculation unit for obtaining a mean value of speech mask on frequency and a mean value of noise mask on frequency for the noise frame, and obtaining a signal-to-noise ratio as a ratio between these average values.

In the speech recognition method,

Estimating a speech mask from the multichannel speech signal;

Removing noise components using the estimated voice mask;

Synthesizing speech using the multi-channel speech signal from which the noise component is removed;

Estimating a noise model and a signal-to-noise ratio using the voice mask, and generating an acoustic model adapted to noise using the noise model and the signal-to-noise ratio;

Extracting a speech feature from the synthesized speech signal; And

Obtaining a speech recognition result using the acoustic model adapted to the speech feature and noise.

The method of claim 6, wherein the voice mask estimating step,

Separating a voice signal received from the outside into several frequency bands using gamma-tone filtering;

Estimating time difference between microphone channels from the separated signal;

Estimating a level difference between microphone channels from the separated signal; And

And obtaining a voice mask using the time difference between the microphone channels and the level difference between the microphone channels.

The method of claim 6, wherein generating the acoustic model

Obtaining a noise mask from the estimated speech mask;

Estimating a noise model using the noise mask and a gamma-tone filtered signal;

Estimating a signal-to-noise ratio using the speech mask and the noise mask; And

Obtaining an acoustic model adapted to noise using the estimated noise model and the estimated signal-to-noise ratio.

The method of claim 8, wherein estimating the noise model

Obtaining a noise mask using the voice mask;

Removing speech components by multiplying the noise mask with the gamma-tone filtered signal;

Synthesizing a noise signal from the signal from which the speech component has been removed;

Extracting features of the noise signal; And

And obtaining a noise model using the features of the noise signal.

10. The method of claim 9, wherein estimating the signal-to-noise ratio

Obtaining an average value on a frequency channel of the voice mask;

Obtaining an average value on a frequency channel of the noise mask;

Detecting a noise frame using the average value of the voice mask; And

Obtaining a mean value of speech mask on frequency and a mean value of noise mask on frequency for the noise frame, and obtaining a signal-to-noise ratio as a ratio between these mean values.