KR20000025292A

KR20000025292A - Method for extracting voice characteristic suitable for core word detection in noise circumstance

Info

Publication number: KR20000025292A
Application number: KR1019980042317A
Authority: KR
Inventors: 이교혁; 강현우
Original assignee: 김영환; 현대전자산업 주식회사
Priority date: 1998-10-09
Filing date: 1998-10-09
Publication date: 2000-05-06

Abstract

PURPOSE: A method for extracting voice characteristics is provided to be suitable for a core word detection in a noise circumstance. CONSTITUTION: In a method for extracting voice characteristics, if voice is input(ST1), a spectrum slope of the input voice is leveled, and, after suppressing an operating range of the signal, a pre-emphasis for increasing a signal-to-noise ratio is performed(ST2). A window function is multiplied with the pre-emphasized signal(ST3), and then a synthesis sound is obtained by passing the multiplied signal through a filter. A cap stream analysis is performed in order to obtain a local average from the synthesis sound(ST5), and distortion owing to a noise is compensated by the local average of the cap stream(ST7).

Description

Speech Feature Extraction Method for Key Word Detection in Noisy Environments

본 발명은 음성인식에 관한 것으로, 특히 음성인식에서 핵심어 검출에 적합한 잡음제거 방법에 관한 것으로써, 잡음환경에서 수행되는 핵심어 검출 시스템의 응용분야에 사용하고자 한 것이다.The present invention relates to speech recognition, and more particularly, to a noise reduction method suitable for keyword detection in speech recognition, and is intended for use in application fields of a keyword detection system performed in a noise environment.

일반적으로 음성인식이란 사람의 말을 인식하는 기술을 말한다. 그 중 핵심어 검출이란 어휘에 제한 없이 자연스럽게 발음한 연속음성으로부터 미리 정해진 핵심어들을 검출해내는 기술을 말한다. 핵심어 검출은 고립단어 인식에서의 사용자의 불편함과 연속음성 인식에서의 성능저조의 문제를 모두 해결할 수 있으며, 입력된 연속음성으로부터 핵심주제어만 검출해 내면 의미가 통할 수 있는 많은 응용분야, 예를 들면 전화 교환 및 안내 서비스나 각종 정보검색 서비스 등에 효과적으로 응용될 수 있다.In general, speech recognition refers to a technology of recognizing a person's speech. Among them, key word detection refers to a technique for detecting predetermined key words from naturally pronounced continuous speech without limiting vocabulary. Key word detection can solve both the user's inconvenience in isolated word recognition and poor performance in continuous speech recognition, and many applications that can make sense if only key control is input from continuous speech input. For example, it can be effectively applied to telephone exchange and guidance services or various information retrieval services.

여기서 전화망을 통한 음성인식은 음성인식의 매우 유망한 응용분야중 하나이다. 그러나 전화망을 통한 음성인식은 연구실 환경의 고음질 음성의 인식에서는 고려할 필요가 없었던 많은 문제점을 지닌다. 그 대표적인 예로 핵심어 검출을 비롯한 음성인식이 적용되는 실제 환경은 채널대역폭 제한에 의한 왜곡, 핸드셋(Handset) 마이크로폰 특성에 의한 왜곡, 배경잡음 등의 주위 잡음이 존재한다. 따라서 모든 음성인식 시스템이 실생활에 적용되기 위해서는 이러한 주위잡음의 효과적인 제거가 필수적이다. 따라서 전화망을 통한 음성인식의 성능향상을 위해서는 여러 가지 왜곡과 배경잡음에 대한 효과적인 보상방법이 요구된다.Voice recognition over the telephone network is one of the very promising applications of speech recognition. However, voice recognition through telephone network has many problems that need not be considered in recognition of high quality voice in laboratory environment. For example, in the real environment to which speech recognition including key word detection is applied, there are ambient noises such as distortion due to channel bandwidth limitation, distortion caused by handset microphone characteristics, and background noise. Therefore, in order for all speech recognition systems to be applied in real life, effective removal of the ambient noise is essential. Therefore, in order to improve the performance of speech recognition through the telephone network, an effective compensation method for various distortions and background noises is required.

그래서 채널대역폭 제한에 의한 왜곡, 핸드셋 마이크로폰 특성에 의한 왜곡 등 여러 가지 왜곡을 하나의 채널왜곡으로 본다면, 전화망을 통한 음성은 입력음성에 채널왜곡과 부가잡음이 첨가된 것으로 모델링할 수 있다. 이러한 왜곡의 특성이 사용자가 발음하는 동안에는 변하지 않는다고 가정하면, 이러한 왜곡을 선형시불변 필터만으로 모델링할 수 있다.Therefore, if one considers various distortions such as distortion due to channel bandwidth limitation and distortion due to the characteristics of the handset microphone as one channel distortion, voice through the telephone network can be modeled as channel distortion and additional noise added to the input voice. Assuming that the characteristics of the distortion do not change during the user's pronunciation, such distortion can be modeled with a linear time invariant filter only.

따라서 이러한 왜곡을 시간 영역과 주파수 영역에서 표현하면 각각 다음의 수학식1 및 수학식2로 표현된다.Therefore, when the distortion is expressed in the time domain and the frequency domain, the following equations (1) and (2) are respectively represented.

z(n)=x(n)*h(n)z (n) = x (n) * h (n)

Z(ω)=X(ω)H(ω)Z (ω) = X (ω) H (ω)

여기서 x(n)은 입력음성이고, h(n)은 스펙트럼 포락(spectral envelop) 함수이며, z(n)은 입력음성과 스펙트럼 포락 함수의 컨볼루션(convolution) 형태로 표현되는 음성신호이다. 그리고 수학식2는 시간 영역에서 나타낸 음성신호를 주파수 영역에서 나타낸 것이다.Where x (n) is the input voice, h (n) is the spectral envelope function, and z (n) is the speech signal expressed in the form of convolution of the input voice and the spectral envelope function. Equation 2 shows the voice signal shown in the time domain in the frequency domain.

그리고 수학식1 및 2를 로그 주파수 영역과 로그 주파수 영역의 역 푸리에(Fourier) 변환 영역인 켑스트럼(Cepstrum, spectrum이라는 단어의 앞부분을 역순으로 배열한 단어) 영역에서 표현하면 다음의 수학식3 및 4로 표현된다.Equations 1 and 2 are expressed in the domain of the log frequency domain and the inverse Fourier transform domain of the log frequency domain in the region of the word Cepstrum (the word arranged in the reverse order). And 4.

logZ(ω)=logX(ω)+logH(ω)logZ (ω) = logX (ω) + logH (ω)

여기서 는 관찰 켑스트럼 벡터, 는 채널 필터의 켑스트럼 벡터, 그리고 는 입력음성의 켑스트럼 벡터이다. 수학식3에서 미지의 bias 항목으로 나타나는 를 제거해 줌으로써 인식성능의 향상을 기대할 수 있다. 이러한 왜곡 보상 방법으로는 CMS(Cepstral Mean Subtraction), CDCN(Codeword-Dependent Cepstral Normalization), SBR(Signal Bias Removal), RASTA(Relative Spectral) 등이 제안되었다. 이러한 방법들은 모두 핵심어 검출이라는 특수한 상황을 고려하지 않은 일반적인 왜곡 보상방법들이다. 핵심어 검출을 위해서는 보다 specific한 왜곡 보상 방법이 요구된다.here Observe the spectral vector, Is the spectral vector of the channel filter, and Is the spectral vector of the input speech. Equation 3 appears as an unknown bias item You can expect to improve the recognition performance by removing the. Examples of such distortion compensation methods include a CMS (Cepstral Mean Subtraction), a Code-Dependent Cepstral Normalization (CDCN), a Signal Bias Removal (SBR), and Relative Spectral (RASTA). These methods are all general distortion compensation methods without considering the special situation of keyword detection. For keyword detection, a more specific distortion compensation method is required.

종래의 왜곡 보상방법 중, CMS 방법의 목적은 수학식4에서 미지의 bias 항목으로 나타나는 를 관찰 켑스트럼 벡터로부터 제거하는 것이다. 관찰 켑스트럼 벡터 에서 평균벡터는 다음의 수학식5와 같다.Among the conventional distortion compensation methods, the purpose of the CMS method is represented by an unknown bias in Equation 4. Is removed from the observed Cepstrum vector. Observation cepstrum vector The mean vector in Eq.

여기서 T는 관찰벡터의 길이를 의미한다. 그리고 CMS 방법에 의해 채널왜곡이 보상된 벡터 는 다음의 수학식6과 같이 표현된다.Where T is the length of the observation vector. And the vector whose channel distortion is compensated by the CMS method. Is expressed as in Equation 6 below.

이와 같이 채널왜곡이 보상된 벡터는 채널에 의한 bias에 영향을 받지 않는다.The channel distortion compensated vector is not affected by the bias caused by the channel.

그러나 이러한 방법은 훈련과 인식시 평균벡터를 구할 때 핵심어 부분을 포함한 전체 입력 음성에 대해 평균을 구해야 하기 때문에 실시간 처리가 곤란하다는 단점이 있었다.However, this method has a disadvantage in that it is difficult to process in real time because the average of the entire input speech including the key word must be obtained when the average vector is obtained during training and recognition.

또한 이 방법이 핵심어 검출에 사용될 경우 전체 입력 음성에 대해 평균을 취하기 때문에 비핵심어 음성이 핵심어 모델을 모델링하는데 영향을 끼치게 되어, 핵심어 모델의 정밀도를 떨어뜨리게 되는 문제점도 있었다.In addition, when this method is used for keyword detection, since the average of the entire input speech is taken into account, the non-core speech affects modeling of the keyword model, thereby lowering the accuracy of the keyword model.

이에 본 발명은 상기와 같은 종래의 제반 문제점을 해소하기 위해 제안된 것으로, 본 발명의 목적은 음성인식에서 주위 잡음을 제거하여 핵심어를 검출할 수 있는 잡음환경에서의 핵심어 검출에 적합한 음성특징 추출방법을 제공하는 데 있다.Therefore, the present invention has been proposed to solve the above-mentioned conventional problems, and an object of the present invention is to extract a voice feature suitable for key word detection in a noise environment that can detect key words by removing ambient noise from voice recognition. To provide.

상기와 같은 목적을 달성하기 위하여 본 발명에 의한 잡음환경에서의 핵심어 검출에 적합한 음성특징 추출방법은,In order to achieve the above object, a voice feature extraction method suitable for key word detection in a noisy environment according to the present invention,

음성이 입력되면 관찰 켑스트럼 분석으로 켑스트럼의 로컬평균을 추출하는 음성특징 추출단계와; 상기 음성특징 추출단계 실행 후 켑스트럼의 로컬평균으로 잡음에 의한 왜곡을 보상하는 왜곡보상 단계를 수행함을 그 기술적 구성상의 특징으로 한다.A voice feature extraction step of extracting a local mean of the cepstrum by the observation cepstrum analysis when the voice is input; The technical feature of the present invention is that after performing the speech feature extraction step, a distortion compensation step of compensating for distortion caused by noise is performed by the local mean of the cepstrum.

도 1은 본 발명이 적용되는 음성인식 시스템의 블록구성도,1 is a block diagram of a speech recognition system to which the present invention is applied;

도 2는 본 발명에 의한 잡음환경에서의 핵심어 검출에 적합한 음성특징 추출방법을 보인 흐름도.2 is a flowchart illustrating a voice feature extraction method suitable for key word detection in a noisy environment according to the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

10 : 음성특징 추출부 20 : 왜곡보상부10: voice feature extraction unit 20: distortion compensation unit

30 : HMM 모델부 40 : 패턴인식부30: HMM model part 40: pattern recognition part

이하, 상기와 같은 본 발명 잡음환경에서의 핵심어 검출에 적합한 음성특징 추출방법의 기술적 사상에 따른 일실시예를 첨부한 도면에 의거 상세히 설명하면 다음과 같다.Hereinafter, with reference to the accompanying drawings an embodiment according to the technical idea of the speech feature extraction method suitable for key word detection in the present invention as described above in detail as follows.

먼저 본 발명은 핵심어 검출의 특성에 맞는 왜곡 보상방법으로 LCMS(Local Cepstral Mean Subtraction) 방법을 제안한다. LCMS 방법 역시 CMS와 마찬가지로 수학식4에서 bias 항목으로 나타나는 를 관찰벡터로부터 제거하는 것을 목적으로 한다. CMS 방법에서는 bias를 입력음성 전체에 대한 평균을 bias로 정의하였으나, LCMS 방법에서는 관찰벡터의 moving average를 평균벡터로 취한다. 관찰 켑스트럼 벡터로부터 취해지는 moving average는 다음의 수학식7과 같다.First, the present invention proposes a local cepstral mean subtraction (LCMS) method as a distortion compensation method suitable for the characteristics of keyword detection. The LCMS method appears as a bias item in Equation 4 like the CMS. To remove from the observation vector. In the CMS method, the bias is defined as the average of the entire input voice, but in the LCMS method, the moving average of the observation vector is taken as the average vector. The moving average taken from the observation cepstrum vector is expressed by Equation 7 below.

여기서 T_L 은 moving average를 취하는 프레임 길이를 의미한다. 그리고 LCMS 방법에 의해 채널왜곡이 보상된 벡터 는 다음의 수학식8과 같다.here T _L Is the frame length that takes a moving average. And a vector whose channel distortion is compensated by the LCMS method Equation 8 is as follows.

이러한 LCMS 방법은 moving average를 취하기 때문에 채널왜곡이 고정되어 있지 않고, 시간에 따라 서서히 변화하더라도 그 영향을 어느 정도 제거해 줄 수 있다. LCMS 방법 역시 채널 bias와 함께 입력 음성 특징 벡터의 평균을 제거하는 문제점이 있으나, 실시가 처리가 가능한 장점을 지닌다. 이 방법이 핵심어 검출에 사용될 경우 moving average를 평균 벡터로 취했기 때문에 비핵심어 부분이 핵심어 모델을 모델링하는데 거의 영향을 미치지 않는다. 이러한 방법은 종래의 왜곡보상 방법들에 비해 핵심어 모델의 정교함이 가장 중요한 핵심어 검출에 보다 더 적합하다.Since the LCMS method takes a moving average, the channel distortion is not fixed. Even if it gradually changes over time, the influence of the LCMS method can be eliminated to some extent. The LCMS method also has a problem of eliminating the average of the input speech feature vectors along with the channel bias, but has the advantage that the implementation can be processed. When this method is used for keyword detection, since the moving average is taken as the average vector, the non-core part has little effect on modeling the keyword model. This method is more suitable for the key word detection where the sophistication of the key word model is the most important than the conventional distortion compensation methods.

도1은 본 발명이 적용되는 음성인식 시스템의 블록구성도이다.1 is a block diagram of a speech recognition system to which the present invention is applied.

이에 도시된 바와 같이, 입력된 음성으로부터 인식에 유효한 특징을 추출하는 음성특징 추출부(10)와; 상기 음성특징 추출부(10)의 출력에서 잡음에 의한 왜곡을 보상하는 왜곡보상부(20)와; 입력음성의 변화되는 통계적인 특징을 확률적으로 모델링하는 HMM(Hidden Markov Model) 모델부(30)와; 상기 HMM 모델부(30)의 모델링에 따라 상기 왜곡보상부(20)에서 출력된 잡음제거된 특징벡터에서 패턴을 인식하여 인식된 단어를 출력하는 패턴인식부(40)로 구성된다.As shown therein, a voice feature extraction unit 10 for extracting a feature effective for recognition from the input voice; A distortion compensator (20) for compensating for distortion caused by noise at the output of the voice feature extractor (10); A Hidden Markov Model (HMM) model unit 30 for probabilistically modeling changing statistical characteristics of an input voice; According to the modeling of the HMM model unit 30 is composed of a pattern recognition unit 40 for outputting the recognized word by recognizing the pattern in the noise-rejected feature vector output from the distortion compensator 20.

도2는 본 발명에 의한 잡음환경에서의 핵심어 검출에 적합한 음성특징 추출방법을 보인 흐름도이다.2 is a flowchart illustrating a voice feature extraction method suitable for key word detection in a noisy environment according to the present invention.

이에 도시된 바와 같이, 음성이 입력되면 관찰 켑스트럼 분석으로 켑스트럼의 로컬평균을 추출하는 음성특징 추출단계(ST1 - ST5)와; 상기 음성특징 추출단계 실행 후 켑스트럼의 로컬평균으로 잡음에 의한 왜곡을 보상하는 왜곡보상 단계(ST6)(ST7)를 수행한다.As shown therein, a voice feature extraction step (ST1-ST5) of extracting a local mean of the cepstrum by observation cepstrum analysis when a voice is input; After the speech feature extraction step is performed, a distortion compensation step ST6 or ST7 is performed to compensate for the distortion caused by noise as a local mean of the cepstrum.

그래서 음성이 입력되면, 스펙트럼 경사(spectral tilt)를 평탄화해 줌으로써 신호의 동적 범위를 억제하는 프리앰퍼시스(Preemphasis)를 수행하여 SNR(Signal to Noise Ratio, 신호대 잡음비)를 높인다(ST1)(ST2). 그리고 프리앰퍼시스를 수행한 신호에 윈도우 함수를 곱하는 윈도윙(windowing)을 수행한다(ST3). 그런 다음 선형 예측 방법에 의해서 표현된 조음 필터에 여기 신호(Excitation Signal)를 통과시켜 합성음을 얻는 LPC(Linear Prediction Coefficient, 선형 예측 계수) 분석을 수행한다(ST4).Thus, when voice is input, the signal to noise ratio (SNR) is increased by performing preemphasis that suppresses the dynamic range of the signal by flattening the spectral tilt (ST1) (ST2). . Then, windowing is performed by multiplying the signal on which the pre-emphasis is performed by the window function (ST3). Then, an LPC (Linear Prediction Coefficient) analysis is performed to pass synthesized sound by passing an excitation signal through an articulation filter represented by a linear prediction method (ST4).

그리고 로컬 평균을 구하기 위해 켑스트럼 분석을 한다. 이때의 로컬 평균은 상기한 수학식7과 같다(ST5).Then perform a spectral analysis to find the local mean. The local average at this time is equal to the above Equation 7 (ST5).

그런 다음 켑스트럼의 로컬 평균을 구하고 입력 켑스트럼 벡터에서 로컬 평균을 차감하여 채널 보상된 켑스트럼 벡터를 구하게 되는 것이다(ST6)(ST7).Then, the local average of the cepstrum is obtained and the channel compensated cepstrum vector is obtained by subtracting the local average from the input cepstrum vector (ST6) (ST7).

이처럼 본 발명은 음성인식에서 주위 잡음을 제거하여 핵심어를 검출하게 되는 것이다.As such, the present invention removes ambient noise from speech recognition to detect key words.

이상에서 본 발명의 바람직한 실시예를 설명하였으나, 본 발명은 다양한 변화와 변경 및 균등물을 사용할 수 있다. 본 발명은 상기 실시예를 적절히 변형하여 동일하게 응용할 수 있음이 명확하다. 따라서 상기 기재 내용은 하기 특허청구범위의 한계에 의해 정해지는 본 발명의 범위를 한정하는 것이 아니다.Although the preferred embodiment of the present invention has been described above, the present invention may use various changes, modifications, and equivalents. It is clear that the present invention can be applied in the same manner by appropriately modifying the above embodiments. Accordingly, the above description does not limit the scope of the invention as defined by the limitations of the following claims.

이상에서 살펴본 바와 같이, 본 발명에 의한 잡음환경에서의 핵심어 검출에 적합한 음성특징 추출방법은 핵심어 검출의 특성에 맞는 왜곡 보상방법으로 LCMS 방법을 사용하여 채널왜곡이 보상된 켑스트럼 벡터를 구하여 주위 잡음이 제거된 핵심어를 검출할 수 있는 효과가 있게 된다.As described above, a voice feature extraction method suitable for key word detection in a noisy environment according to the present invention is a distortion compensation method suitable for the characteristics of the key word detection. There is an effect that can detect the keywords with the noise removed.

Claims

A voice feature extraction step of extracting a local mean of the cepstrum by the observation cepstrum analysis when the voice is input;

And a distortion compensating step of compensating for distortion caused by noise as a local mean of the cepstrum after performing the speech feature extraction step.

The speech feature extraction step of extracting a local mean of the cepstrum by the observation cepstrum analysis,

A pre-emphasis step of performing pre-emphasis to increase the signal-to-noise ratio by flattening the spectral slope when the voice is input to suppress the dynamic range of the signal;

A windowing step of multiplying the pre-emphasis signal by a window function;

A linear predictive coefficient analysis step of obtaining the synthesized sound by passing the signal on which the windowing is performed through an articulation filter represented by a linear prediction method;

And a cepstrum analysis step of performing cepstrum analysis for obtaining a local mean from the signal from which the linear predictive coefficient is analyzed.

The method of claim 1, wherein in the distortion compensation step of compensating for the distortion caused by noise with the local average of the cepstruum, the local average of the cepstruum is

In order to find the local average of the cepstrum with the frame length taking the local average analyzed in the cepstrum analysis,

, Where Is the observation cepstrum vector,

T _L

Is the frame length taking the local average, A speech feature extraction method suitable for key word detection in a noisy environment, characterized by a local mean.

The method of claim 1, wherein in the distortion compensation step of compensating for the distortion caused by noise with a local mean of the spectral, compensating for the distortion caused by the noise includes:

A speech feature extraction method suitable for key word detection in a noisy environment, characterized by subtracting a local mean from an input cepstrum vector to obtain a spectral vector with channel distortion compensation, and using the same to detect key words.