KR20040073145A

KR20040073145A - Performance enhancement method of speech recognition system

Info

Publication number: KR20040073145A
Application number: KR1020030009128A
Authority: KR
Inventors: 신원호
Original assignee: 엘지전자 주식회사
Priority date: 2003-02-13
Filing date: 2003-02-13
Publication date: 2004-08-19

Abstract

PURPOSE: A method for improving performance of a voice recognizer is provided to relatively emphasize a part less influenced by noise compared to a part more influenced by the noise, while calculating a voice recognition function, thereby preventing voice recognition performance from deteriorating under noise environments. CONSTITUTION: A system obtains noise features from a silent section before starting a voice(S1,S2). The system detects a voice section by using voice detection algorithm, and multiplies the detected voice section by a sub band or a feature vector weight(S3-S5). The system re-multiplies weight by frame in power type in order to obtain entire probability values with regards to an inputted voice after determining the weight for the feature vector(S6-S8).

Description

PERFORMANCE ENHANCEMENT METHOD OF SPEECH RECOGNITION SYSTEM}

본 발명은 음성인식기의 성능을 향상시키는 기술에 관한 것으로, 특히 음성인식기의 확률 계산 과정에서 잡음의 영향을 줄이는데 적당하도록 한 음성인식기의 성능 향상 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a technique for improving the performance of a speech recognizer, and more particularly, to a method of improving the performance of a speech recognizer, which is suitable for reducing the effects of noise in the probability calculation process of the speech recognizer.

음성 인식에서 학습에 사용된 데이터의 공통된 특성은 학습된 모델에 반영된다는 것이다. 이상적인 경우 주변 환경이나 발성자의 특성이 배제된 언어적인 특징들만이 추정되어 모델이 구성되지만, 실제 상황에서는 주변 환경의 영향이 학습된 모델에 반영되게 된다.A common characteristic of the data used for learning in speech recognition is that it is reflected in the learned model. In the ideal case, only the linguistic features that exclude the characteristics of the surrounding environment or the speaker are estimated and the model is constructed, but in the actual situation, the influence of the surrounding environment is reflected in the learned model.

따라서, 실제 테스트 환경과 학습 환경과의 차이를 극복하기 위해 지금까지 여러 가지 시도가 있었다. 예를 들어, 다양한 환경에서 모델을 학습하거나, 잡음에 강인한 특징 벡터를 구하거나, 잡음 보상(또는 제거)을 실시하거나, 음질을 향상시키는 방법 등이 시도되었다. 또한, 상기와 같은 여러 방법들이 결합된 방법이 적용되기도 하였다.Therefore, several attempts have been made to overcome the difference between the actual test environment and the learning environment. For example, there have been attempts to train a model in various environments, to obtain noise- robust feature vectors, to perform noise compensation (or elimination), or to improve sound quality. In addition, a combination of the above methods has been applied.

그럼에도 불구하고, 잡음 환경에서는 음성인식 성능이 급격히 저하되어 올바른 인식 결과를 출력하는데 어려움이 있었다.Nevertheless, in a noisy environment, the speech recognition performance is drastically degraded, which makes it difficult to output correct recognition results.

따라서, 본 발명의 제1목적은 음성인식기의 확률 계산 과정에서 잡음의 영향을 줄이는데 있다.Therefore, the first object of the present invention is to reduce the influence of noise in the process of calculating the probability of the speech recognizer.

본 발명의 제2목적은 서브밴드 특징 벡터나 서로 다른 벡터 중에서 보다 잡음에 덜 영향을 받은 정보를 효과적으로 활용하는데 있다.A second object of the present invention is to effectively utilize information less affected by noise than among subband feature vectors or different vectors.

본 발명의 제3특징은 상기와 동일한 방법으로 시간적인 측면에서도 잡음의 영향을 줄이는데 있다.A third aspect of the present invention is to reduce the influence of noise in terms of time in the same manner as above.

도 1은 본 발명에 의한 음성인식기의 성능 향상 방법의 처리과정을 나타낸 신호 흐름도.1 is a signal flow diagram illustrating a process of a method for improving performance of a voice recognizer according to the present invention.

본 발명의 제1특징에 따르면, 계산된 인식 확률 값이 잡음에 대해 강인한 특성을 갖게 된다.According to the first aspect of the present invention, the calculated recognition probability value is robust against noise.

본 발명의 제2특징에 따르면, 인식률 계산시 잡음에 덜 오염된 대역이나 특징 벡터를 활용할 수 있게 된다.According to the second aspect of the present invention, it is possible to utilize a band or a feature vector less contaminated with noise when calculating the recognition rate.

본 발명의 제3특징에 따르면, 주파수 및 시간 영역에서 잡음에 적게 오염된 부분을 강조할 수 있게 된다.According to the third aspect of the present invention, it is possible to emphasize a part which is less contaminated with noise in the frequency and time domain.

본 발명에 의한 음성인식기의 성능 향상 방법은, 서브 밴드 또는 특징 벡터 가중치를 결정하기 위하여, 음성이 시작되기 전의 묵음 구간으로부터 잡음 특성을구하는 제1과정과; 음성 구간을 검출하여 서브밴드 별로 가중치를 곱하거나, 서로 다른 특징 벡터를 추출하여 이들간의 가중치를 적당히 조절하는 방법으로 가중치를 곱하되, 잡음의 영향을 적게 받은 부분이 강조되도록 가중치를 설정하여 곱하는 제2과정과; 매 프레임의 특징 벡터에 대한 가중치가 결정된 후, 입력 음성에 대한 전체 확률 값을 구하기 위해 프레임별 가중치를 파워 형태로 다시 곱할 때 잡음의 영향을 적게 받은 프레임에 더 많은 가중치를 부여하는 제3과정으로 이루어지는 것으로, 이와 같은 본 발명의 음성인식 성능 향상 방법을 첨부한 도 1을 참조하여 상세히 설명하면 다음과 같다.According to an aspect of the present invention, there is provided a method of improving performance of a speech recognizer, the method including: obtaining a noise characteristic from a silent section before speech starts to determine a subband or a feature vector weight; Multiply the weights by detecting the voice interval and multiplying the weights by subbands, or by extracting different feature vectors and adjusting the weights between them properly. 2 course; After the weights of the feature vectors of each frame are determined, the third process of giving more weight to the frames less affected by noise when multiplying the weights by frame again in the form of power to obtain the overall probability value for the input speech. When made in detail, with reference to Figure 1 attached to the method of improving the speech recognition performance of the present invention as follows.

본 발명은 HMM(HMM: Hidden Markov Model)을 이용한 음성 인식 시스템을 기반으로 한다. 왜냐하면, 음성 인식의 방법은 이 이외에도 여러 가지 방법이 있으나 HMM을 이용한 방법이 가장 보편적으로 널리 이용되고 있기 때문이다.The present invention is based on a speech recognition system using a Hidden Markov Model (HMM). This is because there are various methods of speech recognition, but the method using HMM is the most widely used.

HMM을 이용한 음성인식 방법은 입력 음성에 대해 가장 높은 확률값을 나타내는 모델을 인식 결과로 출력하는 방법이다. 따라서, 입력 음성에 대해 확률 값을 계산하여야 하는데, 이를 위해 입력 음성으로부터 일정 구간 간격으로 특징 벡터를 추출하여 이들에 대한 유사도(likelihood)를 곱하는 과정을 반복적으로 수행하게 된다.The speech recognition method using the HMM is a method of outputting a model representing the highest probability value for the input speech as a recognition result. Therefore, a probability value should be calculated for the input speech. To this end, the process of extracting feature vectors from the input speech at intervals and multiplying the likelihood thereof is performed repeatedly.

일반적으로, 잡음 환경(여기서는 주로 부가 잡음에 의한 영향을 고려한다)에서 음성 인식을 수행할 경우 추출된 특징 벡터가 잡음의 영향을 받게 되는데, 이 특징 벡터 입장에서 볼 때에도 가능하면 잡음의 영향을 덜 받은 부분을 중심으로 인식을 수행하는 것이 유리하다. 그런데, 상기 특징 벡터가 잡음의 영향을 받는 정도는 잡음의 특성에 따라 다르게 나타난다. 예를 들어, 잡음이 백색 잡음과 같이 넓게 분포되어 있는 경우에는 별다른 대응방법이 없지만 실제 존재하는 잡음들의 경우 대체로 일부 대역에 집중된 형태로 나타난다.In general, when speech recognition is performed in a noisy environment (mainly considering the effects of additional noise), the extracted feature vectors are subject to noise. It is advantageous to perform the recognition around the received part. However, the degree to which the feature vector is affected by noise appears differently depending on the characteristics of the noise. For example, if the noise is widely distributed, such as white noise, there is no countermeasure. However, the existing noises are generally concentrated in some bands.

하지만, 상기와 같이 잡음이 집중이 되어 있는 대역도 상대적으로 음성이 존재하는 대역에 비해 넓게 분포한 경우가 많아 간단하게 제거되지 않는다. 또한, 잡음의 영향을 받은 부분이 인식에 얼마나 중요한 정보를 가지고 있는가에 따라 다르지만, 상대적으로는 잡음의 영향을 적게 받은 부분을 강조해 주는 것이 인식 성능 개선에 도움을 준다. 설령, 상기 잡음이 백색 잡음이라고 하더라도 잡음의 정도가 심하다면 상대적으로 SNR(SNR: Signal to Noise Ratio)이 높은 대역에 보다 가중치를 두는 것이 바람직하다.However, as described above, the band where the noise is concentrated is also relatively wider than the band where the voice exists, and thus it is not simply removed. In addition, although it depends on how important the affected part of the noise is to the recognition, the emphasis on the relatively less affected part of the noise helps to improve the recognition performance. For example, even if the noise is white noise, if the degree of noise is severe, it is preferable to weight the band having a relatively high signal-to-noise ratio (SNR).

상기 설명을 기반으로 하고 도 1을 참조하여, 특징 벡터 및 프레임의 가중 방법을 중심으로 입력 음성에 대한 확률 계산 방법을 좀더 상세히 설명하면 다음과 같다.Based on the above description and with reference to FIG. 1, a method of calculating a probability for an input speech based on a weighting method of a feature vector and a frame will be described in detail as follows.

입력 음성에 대한 확률 계산 과정에서 잡음의 특성에 따라 특징 벡터(Feature vector) 및 프레임의 가중치(weight)를 설정하게 되므로, 우선적으로 잡음의 특성이 고려되어야 하는데, 이 잡음 특성은 일반적으로 음성이 시작되기 전의 묵음 구간으로부터 구해진다.(S1,S2)In the process of calculating the probability for the input speech, the feature vector and the weight of the frame are set according to the characteristics of the noise. Therefore, the characteristics of the noise should be considered first. It is obtained from the silence section before it becomes (S1, S2).

이후, 음성 검출 알고리즘을 이용하여 음성 구간을 검출하게 되는데, 이 음성 구간이 검출되면 그 구간에 서브밴드 혹은 특징벡터 가중치를 곱하게 된다.(S3-S5)Thereafter, a speech section is detected using a speech detection algorithm, and when the speech section is detected, the section is multiplied by the subband or the feature vector weights (S3-S5).

상기 특징 벡터의 가중치를 곱하는 방법으로서, 서브밴드 별로 가중치를 곱하는 방법과 서로 다른 특징 벡터를 추출하여 이들 간의 가중치를 적절히 조절하는 방법이 있는데 먼저, 서브밴드 별로 가중치를 곱하는 방법의 경우 전대역의 특징 벡터와 함께 가중치를 고려하게 된다.As a method of multiplying the weights of the feature vectors, there are a method of multiplying weights by subbands and a method of extracting different feature vectors and properly adjusting weights among them. The weights are taken into account.

일반적으로, 서브밴드 스타일의 음성 인식의 경우에는 존재하는 잡음이 일부 주파수 대역에만 편중되어 있으면 이의 장점을 이용할 수 있게 되지만, 그렇지 않은 경우 즉, 넓은 대역에 분포하는 경우에는 이의 장점을 이용하는데 어려움이 있다. 왜냐하면, 잡음이 존재하는 대역들을 제거할 경우 정보의 손실이 너무 크기 때문이다. 이러한 점들을 고려하여 가중치에 의해 손실된 정보에 대해서는 전대역의 특징 벡터에서 보상하고, 나머지 잡음의 영향을 적게 받은 특징 벡터에 대해서는 대역별로 분리하여 가중치를 두어 확률값으로 계산하게 된다. 이와 같은 방법은 다음의 [수학식1]로 표현된다.In general, in the case of subband-style speech recognition, the advantage is that the existing noise is biased only in some frequency bands, but otherwise it is difficult to use the advantage in the case of wide band distribution. have. This is because the loss of information is too large when removing the bands in which noise exists. In consideration of these points, the information lost by the weight is compensated in the full-featured feature vector, and the feature vectors less affected by the remaining noise are separated by band and weighted to calculate the probability. This method is represented by the following [Equation 1].

결국, 상기 [수학식1]은 t번째 관측에 대한 상태 j에서의 유사도 값을 나타낸 것으로, 서브밴드 유사도와 전대역 유사도(fullband likelihood)가 각각 가중치및에 의해 가중되어 표현되었다.As a result, Equation 1 shows the similarity value in the state j for the t-th observation, and the subband similarity and the fullband likelihood are respectively weighted. And It is weighted by.

또한, 상기 서로 다른 특징 벡터를 추출하여 이들 간의 가중치를 적절히 조절하는 방법은 다음의 [수학식2]와 같이 표현된다. 이 방법의 일실시 구현예로서, 일반적으로 많이 이용되는 멜 켑스트럼(Mel cepstrum) 특징 벡터와 조용한 환경에서는 다소 성능이 떨어지지만 잡음 환경에서 양호한 성능을 발휘하는 루트 켑스트럽(Root Cepstrum)을 활용하여 이들 간의 가중치를 조정하는 방법을 들 수 있다.In addition, a method of extracting the different feature vectors and properly adjusting the weights between them is expressed as in Equation 2 below. As an embodiment of this method, a commonly used Mel cepstrum feature vector and Root Cepstrum, which performs somewhat poorly in a quiet environment but performs well in a noisy environment, is used. To adjust the weights between them.

이와 같은 경우 두 특징 벡터 간의 가중치를 로컬 프레임(local frame) SNR을 기준으로 다음의 [수학식3]과 같이 결정할 수 있다.In such a case, a weight between two feature vectors may be determined as shown in Equation 3 below based on a local frame SNR.

이상과 같이 매 프레임의 특징 벡터에 대한 가중치가 결정된 후, 입력 음성에 대한 전체 확률 값을 구하기 위해 프레임별 가중치를 파워 형태로 다시 곱하게 된다. 즉, 프레임별 에너지나 에너지와의 켑스트럼 거리 차 등을 기준으로 잡음의 영향을 적게 받은 프레임에 더 많은 가중치를 부여하게 되는데, 이는 다음의 [수학식4]로 표현되고 이에 의해 전체 확률 값이 계산된다.(S6-S8)As described above, after the weights of the feature vectors of each frame are determined, the weight of each frame is multiplied again in the form of power in order to obtain the overall probability value of the input speech. That is, more weight is given to a frame that is less influenced by noise based on the energy of each frame or the difference in 켑 strum distance with the energy, which is expressed by the following Equation 4, and thereby the overall probability value Is calculated. (S6-S8)

이때, 프레임별 가중치를 구하기 위해서는 제6스텝(S6)에서와 같이 일정 구간(DurThr) 이상의 음성 프레임이 입력된 후에 확률 값을 추정하게 되는데, 고립 단어의 경우 전체 프레임으로부터, 연속 문장의 경우에는 일정 시간 간격으로 프레임 가중치를 구하게 된다. 이와 같이 프레임 가중치를 구할 때 상기 설명에서와 같이 프레임 에너지 및 잡음과의 켑스트럼 거리 차 등을 이용하게 되는데, 어느 경우나 다음의 [수학식5]를 이용하여 가중치를 결정하게 된다.At this time, in order to obtain the weight for each frame, as in the sixth step S6, a probability value is estimated after inputting a voice frame of a predetermined period (DurThr) or more. Frame weights are obtained at time intervals. As described above, when the frame weight is obtained, the cepstrum distance difference between the frame energy and the noise is used, and in any case, the weight is determined using Equation 5 below.

여기서, 프레임 가중치는 0에서 2사이의 값을 갖도록 한다. 이상의 특징 벡터 및 프레임별 가중치를 함께 고려한 확률 식은 최종적으로 다시 다음의 [수학식6]과 같이 표현된다. 여기서, 첫 번째 식은 서브밴드 접근 방식의 가중 방법에 대한 것이고, 두 번째 식은 두 개의 특징 벡터 방식에 대한 것이다.Here, the frame weight has a value between 0 and 2. Finally, the probability equation considering both the feature vector and the weight for each frame is finally expressed as in Equation 6 below. Here, the first equation is for the weighting method of the subband approach, and the second equation is for the two feature vector method.

이상에서 상세히 설명한 바와 같이 본 발명은 음성인식 계산 과정에서 시간 또는 주파수 영역에서 잡음의 영향을 덜 받은 부분을 잡음의 영향을 많이 받은 부분에 비하여 상대적으로 강조 함으로써, 잡음 환경에서 음성인식 성능이 급격히 저하되는 것을 방지할 수 있는 효과가 있다.As described in detail above, the present invention emphasizes a relatively low portion of the noise in the time or frequency domain in the speech recognition calculation process as compared with a portion of the noise, thereby rapidly degrading the speech recognition performance in a noisy environment. There is an effect that can be prevented.

Claims

Detecting a noise characteristic from the silent section; A second process of setting and multiplying a weight so that a portion affected by noise is emphasized when the voice interval is detected and multiplied by the weight for each frame; After the weights of the feature vectors of each frame are determined, a third process of giving more weight to the frames less affected by noise when multiplying the weights by frame in power form to obtain the overall probability value for the input speech Performance improvement method of the voice recognizer, characterized in that consisting of.

The method of claim 1, wherein the silent section is a silent section before the voice starts.

The method of claim 1, wherein when multiplying the weights in the second process, the weights are multiplied for each subband or different feature vectors are extracted to adjust the weights among them.

The method of claim 1, wherein the second process compensates for the feature vectors of the entire band in consideration of the information lost by the weight, and calculates the probability values by separating the weights of the feature vectors less affected by the remaining noise by bands. A method for improving the performance of a voice recognizer characterized in that

The method of claim 1, wherein a probability value is estimated after a voice frame is input for a predetermined interval to obtain a weight for each frame of the third process.