KR100897555B1

KR100897555B1 - Apparatus and method of extracting speech feature vectors and speech recognition system and method employing the same

Info

Publication number: KR100897555B1
Application number: KR1020070017621A
Authority: KR
Inventors: 오광철; 정재훈; 정소영
Original assignee: 삼성전자주식회사
Priority date: 2007-02-21
Filing date: 2007-02-21
Publication date: 2009-05-15
Also published as: KR20080077874A

Abstract

음성 특징벡터 추출장치 및 방법과 이를 채용하는 음성인식 시스템 및 방법이 개시된다. 음성 특징벡터 추출장치는 프레임 단위로 구성된 음성신호를 주파수 영역의 신호로 변환하는 FFT 처리부; 상기 FFT 처리부로부터 제공되는 주파수 영역의 신호에 대하여, 각 주파수 성분에 포함된 피치 하모닉 성분을 억제하여 포먼트를 강조하는 포먼트 강조부; 및 상기 포먼트가 강조된 각 주파수 성분을 포함하는 주파수 영역의 신호를 복수개의 멜 스케일 필터뱅크를 이용하여 대역통과 필터링을 수행하는 필터뱅크 처리부를 포함하고, 상기 포먼트 강조부는 상기 각 주파수 성분의 크기와 이웃하는 하위 주파수 성분의 크기를 차감하고, 차감된 결과의 절대치를 취하여 피치 하모닉 성분을 제거하는 하모닉 제거부; 및 피치 하모닉 성분이 억제된 각 주파수 성분을 국소적인 무게 중심을 이용하여 스무딩시키는 스무딩부로 이루어진다.An apparatus and method for extracting a speech feature vector and a speech recognition system and method employing the same are provided. The speech feature vector extracting apparatus includes: an FFT processor converting a speech signal configured in units of frames into a signal in a frequency domain; A formant emphasis unit that emphasizes the formant by suppressing the pitch harmonic components included in each frequency component with respect to the signal in the frequency domain provided from the FFT processor; And a filter bank processor configured to perform bandpass filtering on a signal in a frequency domain including each frequency component in which the formant is emphasized, using a plurality of mel scale filter banks, and the formant emphasis unit includes a magnitude of each frequency component. A harmonic removing unit for subtracting the magnitude of the neighboring lower frequency component and removing the pitch harmonic component by taking an absolute value of the subtracted result; And a smoothing portion for smoothing each frequency component whose pitch harmonic component is suppressed using a local center of gravity.

Description

Apparatus and method of extracting speech feature vectors and speech recognition system and method employing the same}

도 1은 본 발명이 채용되는 음성인식시스템의 구성을 나타낸 블록도,1 is a block diagram showing the configuration of a speech recognition system employing the present invention;

도 2는 본 발명에 따른 음성 특징벡터 추출장치의 일실시예의 구성을 나타낸 블럭도,2 is a block diagram showing the configuration of an embodiment of a speech feature vector extraction apparatus according to the present invention;

도 3은 도 2에 도시된 포먼트 강조부의 세부적인 구성을 나타낸 블록도,3 is a block diagram showing the detailed configuration of the formant emphasis unit shown in FIG.

도 4a 및 도 4b는 본 발명과 종래기술간의 성능을 비교하기 위하여, 모음의 스펙트럼을 보여주는 도면,4a and 4b show the spectrum of a vowel in order to compare the performance between the invention and the prior art,

도 5a 및 도 5b는 본 발명과 종래기술간의 성능을 비교하기 위하여, 한 문장의 스펙트로그램을 보여주는 도면, 및5A and 5B show a spectrogram in one sentence for comparing the performance between the present invention and the prior art, and

도 6a 및 도 6b는 본 발명과 종래기술간의 성능을 비교하기 위하여, 필터뱅크들의 스펙트로그램을 보여주는 도면이다.6A and 6B show spectrograms of filterbanks in order to compare the performance between the present invention and the prior art.

본 발명은 음성인식에 관한 것으로서, 보다 구체적으로는 포먼트(formant)를 강조하기 위하여 피치 하모닉 성분을 억제함으로써 음성인식에 필요한 특징벡터를 보다 정확하게 추출하는 장치 및 방법과 이를 채용하는 음성인식시스템 및 방법에 관한 것이다.The present invention relates to speech recognition, and more particularly, to an apparatus and method for extracting a feature vector necessary for speech recognition more accurately by suppressing pitch harmonic components in order to emphasize a formant, and a speech recognition system employing the same. It is about a method.

현재, 음성인식 기술은 개인용 휴대 단말에서 정보 가전, 컴퓨터, 대용량 텔레포니 서버 등에 이르기까지 응용 범위를 점차 넓혀가고 있지만, 주변 환경에 따라 달라지는 인식성능의 불안정성을 개선하기 위하여 음성인식 성능 자체를 높이려는 시도와 잡음환경에서 인식율 저하를 방지하려는 시도와 관련하여 다양한 연구가 진행되어 왔다.At present, voice recognition technology is gradually expanding the application range from personal portable terminals to information appliances, computers, and large-scale telephony servers, but attempts to increase the voice recognition performance itself to improve the instability of recognition performance that varies depending on the surrounding environment. Various studies have been conducted on attempts to prevent recognition rate degradation in noise and noise environments.

이중, 잡음환경에서 인식율이 저하하는 것을 방지하기 위하여, 음성인식 기술의 첫 단계인 음성 특징벡터 추출과정에서 기존의 멜-주파수 켑스트럼 계수(mel-frequency cepstral coefficient, 이하 'MFCC' 이라 칭함) 특징벡터를 시간적인 특성을 고려하여 선형적으로 또는 비선형적으로 변환하는 기술들이 다양하게 연구되고 있다. 　Among these, in order to prevent the recognition rate from deteriorating in the noise environment, the existing mel-frequency cepstral coefficient (hereinafter referred to as MFCC) in the speech feature vector extraction process, which is the first step of speech recognition technology, is used. Various techniques for linearly or nonlinearly transforming feature vectors in consideration of temporal characteristics have been studied.

먼저, 특징벡터의 시간적인 특성을 고려한 기존의 변환 알고리즘에는 켑스트럼 평균 차감법(cepstral mean subtraction), 평균-분산 정규화(mean-variance normalization, On real-time mean-variance normalization of speech recognition features, P. Pujol, D. Macho and C. Nadeu, ICASSP, 2006, pp.773-776), RASTA 알고리즘(RelAtive SpecTrAl algorithm, Data-driven RASTA filters in reverberation, M. L. Shire et al, ICASSP, 2000, pp. 1627-1630), 히스토그램 정규화(histogram normalization, Quantile based histogram equalization for noise robust large vocabulary speech recognition, F. Hilger and H. Ney, IEEE Trans. Audio, Speech, Language Processing, vol.14, no.3, pp. 845-854), 델타 특징 증강 알고리즘(augmenting delta feature, On the use of high order derivatives for high performance alphabet recognition, J. di Martino, ICASSP, 2002, pp. 953-956)등이 있다. First, conventional transform algorithms considering temporal characteristics of feature vectors include cepstral mean subtraction, mean-variance normalization, on real-time mean-variance normalization of speech recognition features, P. Pujol, D. Macho and C. Nadeu, ICASSP, 2006, pp. 773-776, RelAtive SpecTrAl algorithm, Data-driven RASTA filters in reverberation, ML Shire et al, ICASSP, 2000, pp. 1627 1630), histogram normalization, quantile based histogram equalization for noise robust large vocabulary speech recognition, F. Hilger and H. Ney, IEEE Trans.Audio, Speech, Language Processing, vol. 14, no. 3, pp. 845-854), augmenting delta feature, On the use of high order derivatives for high performance alphabet recognition, J. di Martino, ICASSP, 2002, pp. 953-956).

그리고, 특징벡터들을 선형적으로 변환하는 기술들에는 LDA(linear discriminant analysis) 및 PCA(principal component analysis, Optimization of temporal filters for constructing robust features in speech recognition, Jeih-Weih Hung et. al, IEEE Trans. Audio, Speech, and Language Processing, vol.14, No.3, 2006, pp. 808-832)를 이용하여 시간-프레임 상의 특징 데이터를 변환하는 방법들이 있다.In addition, techniques for linearly transforming feature vectors include linear discriminant analysis (LDA) and principal component analysis, optimization of temporal filters for constructing robust features in speech recognition, Jeih-Weih Hung et. Al, IEEE Trans.Audio. , Speech, and Language Processing, vol. 14, No. 3, 2006, pp. 808-832).

또한, 비선형 신경망을 사용하는 방법으로는 시간적인 패턴 알고리즘(TempoRAl Patterns, 이하 'TRAP' 이라 칭함, Temporal patterns in ASR of noisy speech, H. Hermansky and S. Sharma, ICASSP, 1999, pp. 289-292), 자동 음성 속성 전사 알고리즘(automatic speech attribute transcription, 이하 ASAT, A study on knowledge source integration for candidate rescoring in automatic speech recognition, Jinyu Li, Yu Tsao and Chin-Hui Lee, ICASSP, 2005, pp. 837-840) 등이 공지되어 있다.In addition, as a method of using a non-linear neural network (TempoRAl Patterns, referred to as 'TRAP', Temporal patterns in ASR of noisy speech, H. Hermansky and S. Sharma, ICASSP, 1999, pp. 289-292 ), Automatic speech attribute transcription (ASAT), A study on knowledge source integration for candidate rescoring in automatic speech recognition, Jinyu Li, Yu Tsao and Chin-Hui Lee, ICASSP, 2005, pp. 837-840 ) And the like are known.

한편, 음성인식 성능 자체를 높이는 시도와 관련해서는 음성인식과는 관련성이 적은 피치 하모닉을 포함하는 스펙트럼으로부터 MFCC 특징벡터를 추출하므로 그 성능 개선에는 한계가 있었다.On the other hand, in relation to an attempt to increase the speech recognition performance itself, there is a limit to the performance improvement since the MFCC feature vectors are extracted from the spectrum including the pitch harmonics that are less related to speech recognition.

본 발명이 이루고자 하는 기술적 과제는 포먼트를 강조하기 위하여 피치 하모닉 성분을 억제함으로써 음성인식에 필요한 특징벡터를 보다 정확하게 추출하는 장치 및 방법을 제공하는데 있다.An object of the present invention is to provide an apparatus and method for more accurately extracting feature vectors required for speech recognition by suppressing pitch harmonic components to emphasize formants.

본 발명이 이루고자 하는 다른 기술적 과제는 상기한 음성 특징벡터 추출장치 및 방법을 채용하는 음성인식시스템 및 방법을 제공하는데 있다.Another object of the present invention is to provide a speech recognition system and method employing the speech feature vector extraction apparatus and method described above.

상기 기술적 과제를 해결하기 위하여 본 발명에 따른 음성 특징벡터 추출장치는 프레임 단위로 구성된 음성신호를 주파수 영역의 신호로 변환하는 FFT 처리부; 상기 FFT 처리부로부터 제공되는 주파수 영역의 신호에 대하여, 각 주파수 성분에 포함된 피치 하모닉 성분을 억제하여 포먼트를 강조하는 포먼트 강조부; 및 상기 포먼트가 강조된 각 주파수 성분을 포함하는 주파수 영역의 신호를 복수개의 멜 스케일 필터뱅크를 이용하여 대역통과 필터링을 수행하는 필터뱅크 처리부를 포함하여 이루어진다.In order to solve the above technical problem, the apparatus for extracting a speech feature vector according to the present invention comprises: an FFT processor for converting a speech signal configured in units of frames into a signal in a frequency domain; A formant emphasis unit that emphasizes the formant by suppressing the pitch harmonic components included in each frequency component with respect to the signal in the frequency domain provided from the FFT processor; And a filter bank processor configured to perform bandpass filtering on a signal in a frequency domain including each frequency component in which the formant is emphasized using a plurality of mel scale filter banks.

여기서, 상기 포먼트 강조부는 상기 각 주파수 성분의 크기와 이웃하는 하위 주파수 성분의 크기를 차감하고, 차감된 결과의 절대치를 취하여 피치 하모닉 성분을 제거하는 하모닉 제거부; 및 피치 하모닉 성분이 억제된 각 주파수 성분을 국소적인 무게 중심을 이용하여 스무딩시키는 스무딩부를 포함하는 것이 바람직하다.Here, the formant emphasis unit may include: a harmonic removing unit configured to subtract the magnitude of each frequency component and the magnitude of the neighboring lower frequency component, and remove the pitch harmonic component by taking an absolute value of the subtracted result; And a smoothing portion for smoothing each frequency component whose pitch harmonic component is suppressed using a local center of gravity.

상기 기술적 과제를 해결하기 위하여 본 발명에 따른 음성 특징벡터 추출방법은 프레임 단위로 구성된 음성신호를 주파수 영역의 신호로 변환하는 단계; 상기 주파수 영역의 신호에 대하여, 각 주파수 성분에 포함된 피치 하모닉 성분을 억제하여 포먼트를 강조하는 단계; 및 상기 포먼트가 강조된 각 주파수 성분을 포함하는 주파수 영역의 신호를 복수개의 멜 스케일 필터뱅크를 이용하여 대역통과 필터링을 수행하는 단계를 포함하여 이루어진다.In order to solve the above technical problem, the speech feature vector extraction method according to the present invention comprises the steps of: converting a speech signal composed of frame units into a signal in a frequency domain; Reinforcing a formant by suppressing a pitch harmonic component included in each frequency component with respect to the signal in the frequency domain; And performing bandpass filtering on a signal in a frequency domain including each frequency component in which the formant is emphasized using a plurality of mel scale filter banks.

상기 다른 기술적 과제를 해결하기 위하여 본 발명에 따른 음성인식시스템은 프레임 단위로 구성된 주파수 영역의 신호에 대하여, 각 주파수 성분에 포함된 피치 하모닉 성분을 억제하여 포먼트를 강조한 스펙트럼을 얻고, 상기 포먼트가 강조된 스펙트럼을 이용하여 음성인식을 위한 특징벡터를 추출하는 특징추출부; 및 데이터베이스를 참조하여 상기 추출된 특징벡터에 대한 인식과정을 수행하는 인식부를 포함하여 이루어진다.In order to solve the above other technical problem, the speech recognition system according to the present invention obtains a spectrum emphasizing the formant by suppressing the pitch harmonic components included in each frequency component with respect to the signal in the frequency domain configured in units of frames. Feature extraction unit for extracting a feature vector for speech recognition using the spectrum is highlighted; And a recognition unit that performs a recognition process on the extracted feature vector with reference to a database.

상기 다른 기술적 과제를 해결하기 위하여 본 발명에 따른 음성인식방법은 프레임 단위로 구성된 주파수 영역의 신호에 대하여, 각 주파수 성분에 포함된 피치 하모닉 성분을 억제하여 포먼트를 강조한 스펙트럼을 얻고, 상기 포먼트가 강조된 스펙트럼을 이용하여 음성인식을 위한 특징벡터를 추출하는 단계; 및 데이터베이스를 참조하여 상기 추출된 특징벡터에 대한 인식과정을 수행하는 단계를 포함하여 이루어진다.In order to solve the above other technical problem, the speech recognition method according to the present invention suppresses pitch harmonic components included in each frequency component with respect to a signal in a frequency domain configured in units of frames, thereby obtaining a spectrum that emphasizes the formant, and the formant Extracting feature vectors for speech recognition using the highlighted spectrum; And performing a recognition process on the extracted feature vector with reference to a database.

상기 음성 특징벡터 추출방법 및 음성인식방법은 바람직하게는 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체로 구현될 수 있다.The speech feature vector extraction method and speech recognition method may be embodied as a computer-readable recording medium that records a program for execution in a computer.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예에 대하여 상세하게 설명하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명이 채용되는 음성인식시스템의 구성을 나타낸 블록도로서, 잡음제거부(110), 특징벡터 추출부(130), 인식부(150) 및 데이터베이스(170)를 포함하여 이루어진다.1 is a block diagram showing the configuration of a speech recognition system to which the present invention is employed, and includes a noise removing unit 110, a feature vector extracting unit 130, a recognition unit 150, and a database 170.

도 1을 참조하면, 잡음제거부(110)는 입력되는 음성신호에 대하여 잡음을 제거한다. 음성신호에서 잡음을 제거하기 위해서는 공지되어 있는 다양한 방법 예를 들면, 스펙트럼 차감법(spectral subtraction) 등을 적용할 수 있다.Referring to FIG. 1, the noise removing unit 110 removes noise from an input voice signal. In order to remove noise from a voice signal, various known methods such as spectral subtraction may be applied.

특징벡터 추출부(130)는 스펙트럼을 비선형적으로 변환함으로써, 피치 하모닉 성분이 억제되어 포먼트가 강조된 스펙트럼을 얻고, 얻어진 스펙트럼으로부터 MFCC 특징벡터를 추출한다.The feature vector extractor 130 non-linearly converts the spectrum, thereby suppressing the pitch harmonic component to obtain a formant-enhanced spectrum, and extracting the MFCC feature vector from the obtained spectrum.

인식부(150)는 특징벡터 추출부(130)에서 추출된 특징벡터를 대하여 학습된 데이터베이스(170)에 저장된 파라미터를 이용하여 유사도를 계산한다. 인식부(150)는 HMM(Hidden Markov Model), DTW(Dynamic Time Warping), 및 신경회로망(neural network) 등과 같은 다양한 음성인식 모델을 사용할 수 있다.The recognition unit 150 calculates the similarity using the parameters stored in the learned database 170 with respect to the feature vector extracted by the feature vector extractor 130. The recognition unit 150 may use various speech recognition models such as Hidden Markov Model (HMM), Dynamic Time Warping (DTW), and neural network.

데이터베이스(170)는 인식부(150)에서 사용하는 모델의 파라미터를 미리 학습되어 저장한다. 인식부(150)가 신경회로망 모델을 사용할 경우 데이터베이스(170)에 저장되는 파라미터는 BP(Back Propagation) 알고리즘에 의해 학습된 각 노드들의 가중치값이고, 인식부(150)가 HMM 모델을 사용할 경우 데이터베이스(170)에 저장되는 파라미터는 Baum-Welch 재추정 알고리즘에 의해 학습된 상태천이 확률 과 각 상태의 확률분포이다.The database 170 previously learns and stores parameters of a model used by the recognizer 150. When the recognizer 150 uses the neural network model, the parameter stored in the database 170 is a weight value of each node learned by the BP (Back Propagation) algorithm, and when the recognizer 150 uses the HMM model, the parameter is stored in the database 170. The parameters stored at (170) are the state transition probabilities learned by the Baum-Welch reestimation algorithm and the probability distribution of each state.

도 2는 본 발명에 따른 음성 특징벡터 추출장치의 일실시예의 구성을 나타낸 블록도로서, 전처리부(210), FFT(Fast Fourier Transform) 처리부(230), 포먼트 강조부(250), 필터뱅크 처리부(270) 및 DCT(Discrete Cosine Transform) 처리부(290)를 포함하여 이루어진다.Figure 2 is a block diagram showing the configuration of an embodiment of the speech feature vector extraction apparatus according to the present invention, a pre-processing unit 210, FFT (Fast Fourier Transform) processing unit 230, formant emphasis unit 250, filter bank A processing unit 270 and a DCT (Discrete Cosine Transform) processing unit 290 are included.

도 2를 참조하면, 전처리부(210)는 음성신호에 대하여 예를 들면 10 msec 마다 20~30 ms 길이로 한 프레임을 구성하고, 프레임 단위로 프리앰퍼시스(pre-emphasis) 처리를 수행하여 고주파 성분을 강조함으로써 자음성분을 강화한다. 프리앰퍼시스가 수행된 신호 x(n)은 다음 수학식 1과 같이 나타낼 수 있다.Referring to FIG. 2, the preprocessor 210 composes a frame having a length of 20 to 30 ms for every 10 msec for a voice signal, and performs a pre-emphasis process on a frame basis to perform high frequency. Emphasizes the components to enhance the consonants. The signal x (n) on which the pre-emphasis is performed may be represented by Equation 1 below.

x(n) = s(n) - α(n-1)x (n) = s (n)-α (n-1)

여기서, s(n)은 음성신호이고, α는 프리앰퍼시스에 사용되는 상수값으로서 통상 0.97을 사용한다.Here, s (n) is an audio signal, and α is usually 0.97 as a constant value used for pre-emphasis.

한편, 프레임 간의 경계값의 갑작스러운 변화에 의해 주파수 정보가 왜곡되는 것을 방지하기 위하여, 전처리부(210)는 고주파 성분이 강조된 프레임 단위의 신호에 윈도우 함수 예를 들면 다음 수학식 2와 같이 나타낼 수 있는 해밍 윈도우 함수 h(n)를 적용한다. On the other hand, in order to prevent the frequency information from being distorted due to a sudden change in the boundary value between frames, the preprocessor 210 may express a window function in a frame unit signal in which a high frequency component is emphasized as shown in Equation 2 below. Apply the Hamming window function h (n)

h(n) = 0.6 - 0.4 sin(2πn/M)h (n) = 0.6-0.4 sin (2πn / M)

여기서, M은 해밍 윈도우의 길이이다.Where M is the length of the Hamming window.

FFT 처리부(230)는 윈도우가 적용된 신호를 N-포인트 FFT(Fast Fourier Transform) 처리하여 주파수 영역의 신호로 변환한다. N-포인트 FFT 처리는 다음 수학식 3과 같이 나타낼 수 있다.The FFT processor 230 converts the signal to which the window is applied to an N-point fast fourier transform (FFT) to convert the signal into a frequency domain signal. N-point FFT processing can be expressed as Equation 3 below.

여기서, f_s 는 샘플링 주파수이고, k= 0, 1, ...,N-1이다.Where f _s Is the sampling frequency and k = 0, 1, ..., N-1.

포먼트 강조부(250)는 FFT 처리부(230)로부터 제공되는 FFT 처리된 신호로부터 각 주파수 성분의 크기를 구하고, 각 주파수 성분의 크기에 대하여 인접한 주파수 성분의 크기를 차감하여 그 절대치를 취함으로써 피치 하모닉 성분을 제거하고, 피치 하모닉 성분이 제거된 각 주파수 성분의 크기를 국소적으로 스무딩하여 포먼트를 강조한다.The formant emphasis unit 250 obtains the magnitude of each frequency component from the FFT processed signal provided from the FFT processor 230, subtracts the magnitude of the adjacent frequency component with respect to the magnitude of each frequency component, and takes the absolute value of the pitch. The harmonic component is removed and the formant is emphasized by locally smoothing the magnitude of each frequency component from which the pitch harmonic component is removed.

필터뱅크 처리부(270)는 포먼트 강조부(250)를 통해 제공되는 포먼트가 강조된 주파수 영역의 신호에 대하여, 인간의 청각특성에 따라 저주파수 영역은 좁게, 고주파수 영역은 넓게 그 대역폭을 멜 스케일로 분할한 복수개의 필터뱅크를 이용하여 대역통과 필터링을 수행한다. 즉, 하나의 프레임내에서 특정 주파수성분에 대한 스펙트럼을 멜-스케일 필터링을 통하여 특징을 보다 잘 나타낼 수 있는 차원공간으로 변환한다. 이러한 멜-스케일 필터링은 다음 수학식 4와 같이 나타낼 수 있다.The filter bank processing unit 270 may narrow the low frequency region and wide the high frequency region with respect to the signal of the formant-enhanced frequency region provided by the formant enhancement unit 250 according to the human auditory characteristics. Bandpass filtering is performed using a plurality of divided filter banks. In other words, the spectrum of a specific frequency component in one frame is transformed into dimensional space through which the characteristics can be better represented through mel-scale filtering. Such mel-scale filtering may be expressed as Equation 4 below.

여기서, E[j]는 필터뱅크 j의 출력을 나타내며, J는 필터뱅크의 수이고, H_j(m)은 필터뱅크 j의 전달함수를 나타낸다.Where E [j] represents the output of filter bank j, J is the number of filter banks, and H _j (m) represents the transfer function of filter bank j.

DCT 처리부(290)는 필터뱅크 처리부(270)로부터 제공되는 각 필터뱅크 신호에 대하여 DCT 처리를 수행하여 최종적인 MFCC 특징벡터를 추출한다. 현재 음성인식 기술에서 널리 사용되고 있는 MFCC 특징벡터는 각 프레임당 12차의 벡터로 표현된다. DCT 처리를 통하여 출력되는 m차 MFCC 특징벡터 C(m)은 다음 수학식 5와 같이 나타낼 수 있다.The DCT processing unit 290 performs DCT processing on each filter bank signal provided from the filter bank processing unit 270 to extract the final MFCC feature vector. Currently, MFCC feature vectors, which are widely used in speech recognition technology, are represented by 12th order vectors in each frame. The m-th order MFCC feature vector C (m) output through the DCT process may be expressed by Equation 5 below.

여기서, J는 필터뱅크의 수이고, j는 각 필터뱅크를 나타낸다. Where J is the number of filter banks and j represents each filter bank .

도 3은 도 2에 도시된 포먼트 강조부(250)의 세부적인 구성을 나타낸 블록도로서, 크기 계산부(310), 하모닉 제거부(330) 및 스무딩(smoothing)부(350)를 포함하여 이루어진다.3 is a block diagram illustrating a detailed configuration of the formant emphasis unit 250 illustrated in FIG. 2, and includes a size calculation unit 310, a harmonic removing unit 330, and a smoothing unit 350. Is done.

도 3을 참조하면, 크기 계산부(310)는 FFT 처리부(230)로부터 제공되는 신호로부터 각 주파수 성분의 크기를 구한다. 즉, FFT 처리부(230)로부터 제공되는 신 호는 복소수이므로 그 크기를 취하여 실수값으로 변환함으로써 각 주파수 성분의 크기를 구할 수 있다.Referring to FIG. 3, the magnitude calculator 310 calculates the magnitude of each frequency component from the signal provided from the FFT processor 230. That is, since the signal provided from the FFT processor 230 is a complex number, the magnitude of each frequency component can be obtained by taking the magnitude and converting it to a real value.

하모닉 제거부(330)는 크기 계산부(310)로부터 제공되는 각 주파수 성분의 크기와 이웃하는 하위 주파수 성분의 크기를 차감하고, 차감된 결과의 절대치를 취함으로써 피치 하모닉 성분을 억제한다. 이는 다음 수학식 6과 같이 나타낼 수 있다.The harmonic removing unit 330 suppresses the pitch harmonic component by subtracting the magnitude of each frequency component and the neighboring lower frequency component provided from the size calculating unit 310 and taking an absolute value of the subtracted result. This may be expressed as in Equation 6 below.

여기서,

는 피치 하모닉 성분이 억제된 k 번째 주파수 성분을 나타낸다.here,

Denotes the kth frequency component at which the pitch harmonic component is suppressed.

스무딩부(350)는 피치 하모닉 성분이 억제된 각 주파수 성분을 국소적인 무게 중심을 이용하여 스무딩시킨다. 스무딩 처리는 다음 수학식 7 및 8과 같이 나타낼 수 있다.The smoothing unit 350 smoothes each frequency component whose pitch harmonic component is suppressed using a local center of gravity. The smoothing process can be expressed as the following equations (7) and (8).

여기서,

는 스무딩된 k 번째 주파수 성분을 나타내고, U는 국소적인 무게 중심을 구하는데 사용되는 주파수 성분의 수 즉, 윈도우의 길이를 나타내고,

는 전체 스펙트럼의 평균과 관련있는 파라미터이며, N은 FFT 포인트의 수, P는

가 전체 스펙트럼의 평균보다 큰 값이 되도록 조정하는 파라미터이다. here,

Denotes the smoothed k-th frequency component, U denotes the number of frequency components used to find the local center of gravity, i.e. the length of the window ,

Is the parameter associated with the mean of the entire spectrum, where N is the number of FFT points, P is

Is a parameter adjusted to be greater than the average of the entire spectrum.

도 4a 및 도 4b는 본 발명과 종래기술간의 성능을 비교하기 위하여, 모음의 스펙트럼을 보여주는 도면이다. 도 4a는 종래기술에 의한 모음의 스펙트럼, 도 4b는 본 발명에 따른 모음의 스펙트럼을 각각 나타낸다. 종래기술에 따르면 피치 하모닉 성분에 의하여 두번째 포먼트와 세번째 포먼트를 구분하는 것이 어려우나, 본 발명의 경우에는 명확하게 구분됨을 알 수 있다.4A and 4B show spectra of vowels in order to compare the performance between the present invention and the prior art. 4a shows a spectrum of a vowel according to the prior art, and FIG. 4b shows a spectrum of a vowel according to the invention, respectively. According to the prior art, it is difficult to distinguish the second formant and the third formant by the pitch harmonic component, but it can be seen that the present invention is clearly distinguished.

도 5a 및 도 5b는 본 발명과 종래기술간의 성능을 비교하기 위하여, 한 문장의 스펙트로그램을 보여주는 도면이다. 도 5a는 종래기술에 의한 한 문장의 스펙트로그램, 도 5b는 본 발명에 따른 한 문장의 스펙트로그램을 각각 나타낸다. 이에 따르면, 마찬가지로 본 발명의 경우 포먼트의 궤적을 정확하게 추적할 수 있음을 알 수 있다. 한편, 도 6a 및 도 6b는 본 발명과 종래기술간의 성능을 비교하기 위하여, 필터뱅크들의 스펙트로그램을 보여주는 도면으로서, 마찬가지로 본 발명의 경우 스펙트로그램이 안정되어 있어서 포먼트의 궤적을 좀 더 명확하게 추적할 수 있음을 알 수 있다.5a and 5b are diagrams showing a spectrogram of a sentence in order to compare the performance between the present invention and the prior art. 5a shows a spectrogram of a sentence according to the prior art, and FIG. 5b shows a spectrogram of a sentence according to the present invention, respectively. According to this, in the case of the present invention it can be seen that the trajectory of the formant can be accurately tracked. 6A and 6B are diagrams showing spectrograms of filter banks in order to compare performance between the present invention and the prior art. Similarly, in the present invention, the spectrogram is stable, and thus the trajectory of the formant is more clearly shown. It can be traced.

다음, 본 발명에 의한 효과를 검증하기 위하여 영어 발성에 대한 음성인식 실험을 수행하였다. 음성데이터는 미국 LDC(Linguistic Data Consortium)에서 제공하는 TIMIT 코퍼스로서, 이 데이터베이스는 음소에 대한 레벨이 부가되어 있어서 음소인식 성능을 측정하는데 기준이 되고 있다. 한편, 이 데이터베이스는 미국 전역을 8개 지역으로 나누어 각각 그 지방의 언어를 사용하는 사람의 음성을 수집하였으며, 총 6천여개의 음성문장으로 구성된다. 또한, 이 음성문장들을 음성인식에 사용하기 위하여 학습문장과 테스트문장으로 나누었으며 본 실험은 이를 기준으로 수행하였다.Next, in order to verify the effect of the present invention, a speech recognition experiment for English utterance was performed. Voice data is a TIMIT corpus provided by the Linguistic Data Consortium (LDC) in the United States. This database is a standard for measuring phoneme recognition performance due to the added level of phonemes. On the other hand, this database divides the entire United States into eight regions, collecting voices of people who speak the language of each region, and consists of about 6,000 speech sentences. In addition, the speech sentences were divided into learning sentences and test sentences to be used for speech recognition.

음소인식에 사용된 알고리즘은 HMM이며, 영국 캠프리지 대학에서 제공하는 KTK 툴을 사용하였다. 성능 비교를 위하여 종래기술의 MFCC 특징벡터는 13차를 기준으로 델타(delta) 계수와 델타-델타(acceleration) 계수를 포함한 39차 특징벡터를 사용하였고, 20차의 필터뱅크를 사용하였다. 1 프레임은 20 ms의 길이를 가지며, 512-포인트 FFT 처리가 수행되었고, 프리앰퍼시스 상수는 0.97을 사용하였다. 한편, 본 발명에서는 프리앰퍼시스 상수로 0.97을, 20 msec의 해밍 윈도우, 512-포인트 FFT, 20차 필터뱅크를 사용하였고, 스무딩부(350)에서 사용되는 U는 5를 사용함으로써 현재 주파수 성분의 값을 주변 4개의 주파수 성분에 해당하는 값에 대하여 비교하여 스무딩을 수행하였다. The algorithm used for phoneme recognition is HMM, which uses the KTK tool provided by the University of Cambridge. For the performance comparison, the MFCC feature vector of the prior art used a 39th order feature vector including a delta coefficient and a delta-delta coefficient based on the 13th order, and a 20th-order filter bank was used. One frame was 20 ms in length, 512-point FFT processing was performed, and the pre-emphasis constant was 0.97. Meanwhile, in the present invention, 0.97 is used as a pre-emphasis constant, a Hamming window of 20 msec, a 512-point FFT, and a 20th order filter bank, and U used in the smoothing unit 350 is 5 to determine the current frequency component. Smoothing was performed by comparing the values to the values corresponding to the surrounding four frequency components.

또한, P는 4를 사용함으로써 전체 스펙트럼의 평균의 4배에 해당하는 값을 기준으로 함으로써, 묵음구간에서와 같이 스펙트럼의 크기가 너무 작은 경우에도 작은 값의 변동이 국소적인 무게중심 스무딩으로 크게 변화하여 마치 포먼트 성분처럼 커지는 것을 방지하였다. In addition, P is based on a value corresponding to four times the average of the entire spectrum by using 4, so that even if the size of the spectrum is too small, such as in the silent section, the small value fluctuations largely change to the local center of gravity smoothing. This prevents it from growing like a formant component.

다음 표 1은 상기와 같은 실험환경에서 본 발명에 의해 얻어지는 특징벡터와 종래기술에 의해 얻어지는 특징벡터에 대한 음소인식 실험결과를 나타낸 것이다.Table 1 shows the phoneme recognition experiment results for the feature vector obtained by the present invention and the feature vector obtained by the prior art in the above experimental environment.

[표 1]TABLE 1

인식율Recognition rate 정확도accuracy # 히트# Hit # 삭제# delete # 대체# how # 삽입# Insert # 총 단어# Total word 종래기술Prior art 73.74 %73.74% 70.48 %70.48% 47,30347,303 5,5625,562 11,28011,280 2,0942,094 64,14564,145 본 발명The present invention 74.23 %74.23% 71.17 %71.17% 47,61847,618 5,5005,500 11,02711,027 1,9631,963 64,14564,145

표 1을 살펴보면 본 발명에 의한 특징벡터를 사용한 결과, 종래의 음소인식율인 73.34 %보다 높은 74.23 %을 나타내며, 삽입 에러를 포함한 정확도에서도 향상된 결과를 나타냄을 확인할 수 있다. Looking at Table 1, as a result of using the feature vector according to the present invention, 74.23% is higher than the conventional phoneme recognition rate of 73.34%, it can be seen that the improved results including the insertion error.

본 발명은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플라피디스크, 광데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고 본 발명을 구현하기 위 한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있다.The invention can also be embodied as computer readable code on a computer readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage, and the like, which are also implemented in the form of a carrier wave (for example, transmission over the Internet). It also includes. The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion. And functional programs, codes and code segments for implementing the present invention can be easily inferred by programmers in the art to which the present invention belongs.

상술한 바와 같이 본 발명에 따르면, FFT 처리되어 얻어지는 음성 스펙트럼에 대하여 피치 하모닉 성분을 억제하여 포먼트를 강조한 음성 스펙트럼을 얻고, 이로부터 보다 정확한 특징벡터를 추출하여 음성인식에 사용함으로써 음성인식 성능을 향상시킬 수 있는 이점이 있다.As described above, according to the present invention, the speech spectrum obtained by FFT processing is suppressed from the pitch harmonic component to obtain a speech spectrum emphasizing the formant, and a more accurate feature vector is extracted from the speech spectrum and used for speech recognition. There is an advantage that can be improved.

본 발명에 대해 상기 실시예를 참고하여 설명하였으나, 이는 예시적인 것에 불과하며, 본 발명에 속하는 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다.Although the present invention has been described with reference to the above embodiments, it is merely illustrative, and those skilled in the art will understand that various modifications and equivalent other embodiments are possible therefrom. . Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

Claims

An FFT processor for converting a voice signal configured in units of frames into a signal in a frequency domain;

For the signal in the frequency domain provided from the FFT processor, the pitch harmonic component included in each frequency component is subtracted by subtracting the magnitude of each frequency component and the magnitude of the neighboring lower frequency component and taking the absolute value of the subtracted result. Formant emphasis to suppress the formant emphasis; And

And a filter bank processor configured to perform bandpass filtering on a signal in a frequency domain including each frequency component of which the formant is emphasized using a plurality of mel scale filter banks.

The apparatus of claim 1, wherein the formant emphasis unit smoothes each frequency component of which the pitch harmonic component is suppressed by using a local center of gravity.

Converting a voice signal configured in units of frames into a signal in a frequency domain;

For the signal in the frequency domain, by subtracting the magnitude of each frequency component and the magnitude of the neighboring lower frequency component and taking the absolute value of the subtracted result, the form is suppressed by suppressing the pitch harmonic component included in each frequency component. Emphasizing; And

And performing bandpass filtering on a signal in a frequency domain including each frequency component of which the formant is emphasized using a plurality of mel scale filter banks.

delete

4. The method of claim 3, further comprising smoothing each frequency component of which the pitch harmonic component is suppressed using a local center of gravity in the formant enhancement step.

The method of claim 5, wherein the smoothing step is

(here,

Denotes the kth frequency component at which the pitch harmonic component is suppressed,

Denotes the smoothed k-th frequency component, U denotes the number of frequency components used to find the local center of gravity,

Is a parameter that is larger than the average of the entire spectrum)

Speech feature vector extraction method characterized in that performed by.

By subtracting the magnitude of each frequency component and the neighboring lower frequency component with respect to the signal in the frequency domain configured in units of frames, and taking the absolute value of the subtracted result, the pitch harmonic component included in each frequency component is suppressed. A feature extractor for obtaining a formant-enhanced spectrum and extracting a feature vector for speech recognition using the formant-enhanced spectrum; And

And a recognition unit performing a recognition process on the extracted feature vector with reference to a database.

delete

The speech recognition system of claim 7, wherein the feature extractor smooths each frequency component of which the pitch harmonic component is suppressed using a local center of gravity.

By subtracting the magnitude of each frequency component and the neighboring lower frequency component with respect to the signal in the frequency domain configured in units of frames, and taking the absolute value of the subtracted result, the pitch harmonic component included in each frequency component is suppressed. Obtaining a formant-enhanced spectrum and extracting feature vectors for speech recognition using the formant-enhanced spectrum; And

Speech recognition method comprising the step of performing a recognition process for the extracted feature vector with reference to a database.

The speech recognition method of claim 10, wherein the feature extraction step smoothes each frequency component whose pitch harmonic component is suppressed using a local center of gravity.

The method of claim 11, wherein the smoothing step is

(here,

Is a parameter that is larger than the average of the entire spectrum)

Speech recognition method, characterized in that carried out by.

A computer-readable recording medium describing a program capable of executing the speech feature vector extraction method according to any one of claims 3, 5 and 6.

A computer-readable recording medium describing a program capable of executing the voice recognition method according to any one of claims 10 to 12.