KR19990056313A

KR19990056313A - Speech Segment Detection Method in Speech Recognition System

Info

Publication number: KR19990056313A
Application number: KR1019970076307A
Authority: KR
Inventors: 박성희
Original assignee: 김영환; 현대전자산업 주식회사
Priority date: 1997-12-29
Filing date: 1997-12-29
Publication date: 1999-07-15

Abstract

본 발명은 음성인식시스템에 있어서, 필터뱅크를 이용하여 입력신호의 각 주파수 대역별 에너지를 산출하고, 이 산출된 에너지들의 상관관계를 파악하여 입력신호로부터 정확한 음성구간을 검출할 수 있도록 한 음성인식시스템에서의 음성구간 검출방법에 것으로, 고역통과필터를 이용하여 입력신호의 고주파 영역을 강조하고, 고주파 영역이 강조된 입력신호를 해밍 윈도우를 사용하여 일정 크기의 프레임으로 세분화하며, 세분화된 프레임 단위로 FFT를 수행한 후 삼각형 모양의 필터를 이용하여 각 주파수에 해당되는 에너지를 구하고, 이 구한 각각의 에너지들에 대한 상관관계를 구하여 음성구간 판단지수를 산출한 후, 이 음성구간 판단지수가 임계값 이상인 경우에는 입력되는 신호가 음성신호임을 검출하고, 임계값 이하인 경우에는 잡음신호임을 검출하는 것을 특징으로 하며, 종래의 기술에 비해 상대적으로 큰 잡음이 섞인 신호인 경우에도 음성구간을 정확히 검출할 수 있으며, 잡음과 구별이 어려워 검출이 어려웠던 자음들의 경우에도 어느 정도까지 검출이 가능하게 되는 효과가 있다.According to the present invention, in a speech recognition system, an energy for each frequency band of an input signal is calculated using a filter bank, and a speech recognition is performed to detect an accurate speech section from the input signal by grasping the correlation of the calculated energies. A method for detecting a speech section in a system, which uses a high pass filter to emphasize a high frequency region of an input signal, subdivides an input signal with a high frequency region into a frame of a predetermined size using a hamming window, After performing the FFT, the energy corresponding to each frequency is obtained by using a triangular filter, and the correlation coefficients are calculated to calculate the voice interval judgment index. In case of abnormality, it detects that the input signal is a voice signal, and in case of less than the threshold value, noise signal It can be detected, even in the case of a signal mixed with a relatively large noise compared to the prior art, it is possible to accurately detect the speech section, and even in the case of consonants that were difficult to detect due to the noise and difficult to detect to some extent It is effective.

Description

Speech Segment Detection Method in Speech Recognition System

본 발명은 인간의 음성을 인식하기 위한 음성인식시스템에 있어서, 필터뱅크(Filter Bank)를 이용하여 입력신호의 각 주파수 대역별 에너지를 산출하고, 이 산출된 에너지들의 상관관계를 파악하여 입력신호로부터 정확한 음성구간을 검출할 수 있도록 한 음성인식시스템에서의 음성구간 검출방법에 관한 것이다.The present invention is a speech recognition system for recognizing a human voice, using the filter bank (Filter Bank) to calculate the energy of each frequency band of the input signal, and to determine the correlation of the calculated energy from the input signal The present invention relates to a speech section detection method in a speech recognition system capable of detecting an accurate speech section.

일반적으로, 인간의 음성과 같은 자연음을 인식하고자 하는 음성인식시스템은 도 1에 도시된 바와 같이 입력되는 신호로부터 음성구간을 검출하는 음성구간 검출부(1)와, MFCC 계수를 사용하여 상기 음성구간 검출부(1)에서 검출된 음성구간으로부터 그 특징을 추출하는 특징계수 추출부(2)와, HMM(Hidden Markov Model)과 VMS VQ(Variable Multi-Section Vector Quantization) 알고리즘을 이용하여 음성신호를 인식하는 음성 인식부(3)와, 음성신호에 의해 학습된 단어모델 파라미터가 저장되어 있는 데이터 베이스(4)와, 상기 음성 인식부(3)에서 인식된 음성신호에 대해 실효성을 판단하여 인식되는 단어를 출력하는 후처리부(5)로 구성된다.In general, a speech recognition system for recognizing a natural sound such as a human voice includes a speech section detection unit 1 for detecting a speech section from an input signal as shown in FIG. 1 and the speech section using an MFCC coefficient. A feature coefficient extraction unit 2 for extracting the feature from the speech section detected by the detection unit 1, and using the Hidden Markov Model (HMM) and VMS Variable Multi-Section Vector Quantization (VQ) algorithm to recognize the voice signal The speech recognition unit 3, a database 4 storing word model parameters learned by the speech signal, and a word recognized by determining effectiveness of the speech signal recognized by the speech recognition unit 3 It consists of the post-processing part 5 which outputs.

상기와 같이 구성된 음성인식시스템에 있어서, 입력되는 신호로부터 정확한 음성구간을 검출하는 것은 음성인식시스템의 전처리 부분으로, 시스템의 성능을 좌우하는 전제조건으로서 매우 중요한 작업이다.In the speech recognition system configured as described above, detecting the correct speech section from the input signal is a pre-processing part of the speech recognition system, which is a very important task as a precondition for determining the performance of the system.

그러나, 잡음이 많은 외부 환경에서는 잡음성분으로 인하여 정확한 음성구간을 검출하기가 쉽지 않다.However, in a noisy external environment, it is difficult to detect an accurate speech section due to noise components.

이에 따라, 종래에는 입력신호의 각 프레임(Frame)의 최대값 에너지 또는 평균 에너지나 영교차율(Zero Crossing Rate) 등의 정보를 이용하여 배경잡음으로부터 산출한 임계값과 비교함으로써 음성구간을 검출하였다.Accordingly, conventionally, the speech section is detected by comparing the threshold value calculated from the background noise using information such as the maximum energy or average energy of each frame of the input signal, zero crossing rate, or the like.

그러나, 이러한 검출방법은 배경잡음이 작을 경우에는 비교적 음성구간을 잘 검출하지만, 배경잡음이 상대적으로 클 경우에는 음성구간 검출 성능이 현저하게 떨어지게 되는 문제점이 있었다.However, this detection method has a problem that the speech section is relatively well detected when the background noise is small, but the speech section detection performance is remarkably degraded when the background noise is relatively large.

또한, 잡음성분과 쉽게 구별되지 않는 시옷,지읒,티읕,키읔 등의 자음에 대해서는 아예 음성구간에 포함되어 있지 않게 되어 검출이 불가능한 문제가 있었다.In addition, consonants such as garments, jigs, teats, and kiosks, which are not easily distinguished from noise components, are not included in the voice section at all, and thus there is a problem that cannot be detected.

본 발명은 상기와 같은 문제점을 해결하기 위해 안출한 것으로서, 그 목적은 필터뱅크를 이용하여 입력신호의 각 주파수 대역별 에너지를 산출하고, 이 산출된 에너지의 상관관계를 통해 음성신호와 잡음신호를 구별하여 입력신호로부터 음성구간만을 효과적으로 검출할 수 있도록 한 음성인식시스템에서의 음성구간 검출방법을 제공하는 데에 있다.The present invention has been made to solve the above problems, the purpose of which is to calculate the energy of each frequency band of the input signal using a filter bank, and through the correlation of the calculated energy signal and noise signal SUMMARY OF THE INVENTION An object of the present invention is to provide a speech section detecting method in a speech recognition system capable of effectively detecting only a speech section from an input signal.

이러한 목적을 달성하기 위한 본 발명의 음성인식시스템에서의 음성구간 검출방법은, 고역통과필터를 이용하여 입력신호의 고주파 영역을 강조하고, 고주파 영역이 강조된 입력신호를 해밍 윈도우를 사용하여 일정 크기의 프레임으로 세분화하며, 세분화된 프레임 단위로 FFT를 수행한 후 삼각형 모양의 필터를 이용하여 각 주파수에 해당되는 에너지를 구하고, 이 구한 각각의 에너지들에 대한 상관관계를 구하여 음성구간 판단지수를 산출한 후, 이 음성구간 판단지수가 임계값 이상인 경우에는 입력되는 신호가 음성신호임을 검출하고, 임계값 이하인 경우에는 잡음신호임을 검출하는 것을 특징으로 한다.The speech segment detection method in the speech recognition system of the present invention for achieving the above object, by using a high-pass filter to emphasize the high frequency region of the input signal, the input signal of the high frequency region is highlighted using a Hamming window After subdividing by frame, FFT by subdivided frame unit, the energy corresponding to each frequency is calculated by using a triangular filter, and the correlation coefficient between each energy is calculated to calculate the voice interval judgment index. After that, if the voice interval determination index is greater than or equal to the threshold value, the input signal is detected to be a voice signal.

도 1은 일반적인 음성인식시스템의 블록 구성도,1 is a block diagram of a general speech recognition system;

도 2는 본 발명에 의한 음성인식시스템에서의 음성구간 검출 흐름도,2 is a flowchart for detecting a speech section in a speech recognition system according to the present invention;

도 3은 본 발명에 의한 필터뱅크를 나타낸 도면.3 is a view showing a filter bank according to the present invention.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

1 : 음성구간 검출부 2 : 특징계수 추출부1: Voice section detection unit 2: Feature coefficient extraction unit

3 : 음성 인식부 4 : 데이터 베이스3: speech recognition unit 4: database

5 : 후처리부5: post-processing unit

이하, 도 2를 참고하여 본 발명에 의한 음성인식시스템에서의 음성구간 검출방법을 상세히 설명한다.Hereinafter, a method of detecting a voice interval in a voice recognition system according to the present invention will be described in detail with reference to FIG. 2.

먼저, 음성인식시스템으로 입력되는 입력신호를 아래 수학식 1과 같은 고역통과필터를 이용하여 고주파 영역을 강조한다(S10).First, the high frequency region is emphasized using the high pass filter of the input signal input to the voice recognition system as shown in Equation 1 below (S10).

이어, 상기 단계(S10)에서 고주파 영역이 강조된 입력신호를 아래 수학식 2와 같은 해밍(Hamming) 윈도우(Window)를 사용하여 10ms씩 중첩하여 20ms 크기의 프레임으로 세분화(Blocking)한다(S11).Subsequently, in step S10, the input signal in which the high frequency region is emphasized is subdivided into frames of 20 ms size by overlapping by 10 ms using a Hamming window as shown in Equation 2 below (S11).

그리고 나서, 상기 단계(S11)에서 세분화된 각 프레임 단위로 256-포인트(point) FFT(Fast Fourier Transform)를 수행하면(S12), 이 FFT 결과를 도 3에 도시된 바와 같은 삼각형 모양의 필터, 즉 필터뱅크를 이용하여 각 주파수에 해당되는 에너지를 구한다(S13).Then, if a 256-point Fast Fourier Transform (FFT) is performed in each frame unit subdivided in step S11 (S12), the FFT result is a triangular filter as shown in FIG. That is, the energy corresponding to each frequency is obtained using the filter bank (S13).

이어서, 수학식 3을 이용하여 상기 단계(S13)에서 필터뱅크를 거친 신호를 미리 준비된 배경잡음신호에 대하여 상관관계 계수를 구한다(S14).Subsequently, a correlation coefficient is obtained with respect to the background noise signal prepared in advance for the signal passing through the filter bank in step S13 using Equation 3 (S14).

이후, 상기 단계(S14)에서 구한 상관관계 계수로부터 수학식 4와 같이 음성구간 판단지수를 산출하고(S15), 이 음성구간 판단지수를 이용하여 음성신호 여부를 판단한다(S16).Thereafter, the voice interval determination index is calculated from the correlation coefficient obtained in the step S14 as shown in Equation 4 (S15), and whether the voice signal is determined is determined using the voice interval determination index (S16).

즉, 상기 음성구간 판단지수가 임계값 이상인 경우에는 입력되는 신호가 음성신호임을 검출하고(S17), 임계값 이하인 경우에는 잡음신호임을 각각 검출하게 된다(S18).That is, when the voice interval determination index is greater than or equal to a threshold value, the input signal is detected as a voice signal (S17), and when it is less than or equal to the threshold value, a noise signal is detected (S18).

이상, 상기 설명에서와 같이 본 발명은 주파수 영역에서의 각 주파수 대역의 에너지 값들의 상관관계를 산출하여 이를 근거로 하여 입력신호로부터 음성신호를 분리해 냄으로써, 종래의 기술에 비해 상대적으로 큰 잡음이 섞인 신호인 경우에도 음성구간을 정확히 검출할 수 있으며, 잡음과 구별이 어려워 검출이 어려웠던 자음들의 경우에도 어느 정도까지 검출이 가능하게 되는 효과가 있다.As described above, the present invention calculates a correlation between energy values of respective frequency bands in a frequency domain, and separates a voice signal from an input signal based on this, thereby resulting in a relatively large noise. Even in the case of mixed signals, the speech section can be accurately detected, and in the case of consonants that were difficult to detect because of difficulty in distinguishing from noise, it is possible to detect to some extent.

Claims

In the method for detecting the speech section in the speech recognition system,

A first step of emphasizing a high frequency region of the input signal using a high pass filter, a second step of subdividing the input signal of which the high frequency region is emphasized into a frame having a predetermined size using a hamming window, and the second step A third step of performing an FFT on the input signal divided in step 2 for each frame unit; a fourth step of obtaining an energy corresponding to each frequency using a filter bank from the FFT result performed in the third step; A fifth step of obtaining a correlation coefficient with respect to the background noise signal prepared in advance in the input signal passed through the filter bank in a fourth step; and a sixth step of calculating a voice interval determination index using the correlation coefficient obtained in the fifth step. And when the voice section determination index calculated in the sixth step is equal to or greater than a threshold value, detect that the input signal is a voice signal. Speech period detection method for a speech recognition system which comprises a seventh step of detecting that.