KR100450787B1

KR100450787B1 - Speech Feature Extraction Apparatus and Method by Dynamic Spectralization of Spectrum

Info

Publication number: KR100450787B1
Application number: KR1019970025277A
Authority: KR
Inventors: 오광철; 김동국
Original assignee: 삼성전자주식회사
Priority date: 1997-06-18
Filing date: 1997-06-18
Publication date: 2005-05-03
Also published as: KR19990001828A

Abstract

본 발명은 음성인식 시스템에서 입력된 신호로부터 음성 특징을 추출하는 방법에 관한 것으로, 특히 음성의 스펙트럼을 잡음의 양에 따라 정규화한 후, 켑스트럼 형태의 특징 벡터를 얻음으로써, 잡음이 있는 경우 음성 인식에 사용할 수 있도록, 입력 음성신호에 대해 프레임 단위로 스펙트럼을 분석하는 스펙트럼 분석부(1); 멜-스케일로 구성된 필터뱅크를 통하여 간략화된 스펙트럼을 구하는 필터 뱅크부(2); 스펙트럼 신호의 동적 영역을 줄이는 로그 압축부(3); 시그모이드(sigmoid) 함수의 파라미터를 구하기 위한 정보를 얻는 구간을 설정하는 잡음 구간 검출부(4); 스펙트럼 신호의 동적 영역을 정규화하기 위하여 시그모이드 함수의 파라미터를 구하는 시그모이드 파라미터 계산부(5); 스펙트럼 신호를 정규화하는 시그모이드 함축부(6), 및 인식 알고리즘에 사용되는 특징 벡터로써 켑스트럼을 구하는 이산 코사인 변환부(7)를 포함하여 구성함을 특징으로 하는, 스펙트럼의 동적영역 정규화에 의한 음성 특징 추출 장치 및 방법에 관한 것이다.The present invention relates to a method for extracting a speech feature from a signal input from a speech recognition system. In particular, the speech spectrum is normalized according to the amount of noise, and then a characteristic vector in the form of a spectrum is obtained to obtain noise. A spectrum analyzer 1 which analyzes the spectrum on a frame-by-frame basis with respect to the input speech signal so as to be used for speech recognition; A filter bank unit (2) for obtaining a simplified spectrum through a mel-scale filter bank; A log compression unit 3 for reducing the dynamic range of the spectral signal; A noise section detection section 4 for setting a section for obtaining information for obtaining a parameter of a sigmoid function; A sigmoid parameter calculator 5 for calculating parameters of the sigmoid function to normalize the dynamic range of the spectral signal; A dynamic domain normalization of the spectrum, characterized by comprising a sigmoid constriction section 6 for normalizing the spectral signal and a discrete cosine transform section 7 for obtaining the cepstrum as a feature vector used in the recognition algorithm. The present invention relates to a voice feature extraction apparatus and method.

Description

Apparatus and method for extracting speech features by dynamic spectrum normalization of spectrum

본 발명은 음성인식 시스템에서 입력된 신호로부터 음성 특징을 추출하는 방법에 관한 것으로, 특히 음성의 스펙트럼을 잡음의 양에 따라 정규화한 후 켑스트럼 형태의 특징벡터를 얻음으로써, 잡음이 있는 경우 음성 인식에 사용할 수 있는, 스펙트럼의 동적영역 정규화에 의한 음성특징 추출장치 및 방법에 관한 것이다. The present invention relates to a method for extracting a speech feature from a signal input in a speech recognition system, and in particular, by normalizing the spectrum of the speech according to the amount of noise and obtaining a feature vector in the form of a spectrum, An apparatus and method for extracting speech features by dynamic region normalization of a spectrum that can be used for recognition.

음성특징 추출은, 각 음성의 음가를 서로 구분할 수 있도록 패턴을 만드는 일이다. 그리고, 음성인식 알고리즘은 추출된 음성 특징으로 이루어진 패턴을 비교하여 음성을 인식하게 된다. 따라서, 음성인식 시스템은 음성의 특징이 잘 추출되어야 성능이 좋다.Speech feature extraction is to create a pattern to distinguish the sound value of each voice from each other. And, the speech recognition algorithm recognizes the speech by comparing the pattern consisting of the extracted speech features. Therefore, the voice recognition system has good performance only when the voice features are well extracted.

일반적으로, 음성특징 추출은 신호의 주파수 스펙트럼의 특성에 기초하여 구해진다. 음성 인식 시스템에서 음성특징을 추출하는 방법은, 전극 모델을 이용한 파라메트릭(parametric) 방법의 선형예측계수-켑스트럼(lpc-cepstrum) 방법과, 스펙트럼 정보를 직접 이용한 비파라메트릭(non-parametric) 방법의 멜-주파수 켑스트럼 계수(Mel-Frequency Cepstrum Coefficient 이하 MFCC 라 칭함) 방법이 있다.Generally, speech feature extraction is obtained based on the characteristics of the frequency spectrum of the signal. The speech feature extraction method in the speech recognition system includes a linear predictive coefficient-lpc-cepstrum method of a parametric method using an electrode model and a non-parametric method using spectral information directly. Mel-Frequency Cepstrum Coefficient (hereinafter referred to as MFCC) method.

배경 잡음이 있는 경우, 선형예측계수-켑스트럼(lpc-cepstrum) 방법의 전극 모델에 에러가 많아서 멜-주파수 켑스트럼 계수(MFCC) 방법의 특징을 주로 사용한다. 상기와 같은 멜-주파수 켑스트럼 계수(MFCC) 방법으로 음성 특징을 추출하는 장치는 도 1 에 도시된 바와 같이, 음성 신호의 주파수 스펙트럼 정보를 추출하는 스펙트럼 분석(spectral analysis)부(10), 구하여진 스펙트럼으로부터 간략화된 스펙트럼의 포락선을 구하는 필터 뱅크부(20), 간략화된 스펙트럼의 크기(amplitude)를 로그 함수를 이용하여 함축시키는 로그 압축부(30), 및 이를 이산 코사인 변환(Discrete Cosine Transform 이하 DCT 라 칭함)을 통하여 켑스트럼 계수를 얻어내는 이산 코사인 변환부(40)를 포함하여 구성되어 있다.In case of background noise, the electrode model of the linear predictive coefficient-cepstrum method is error-prone, and the characteristics of the mel-frequency cepstrum coefficient (MFCC) method are mainly used. The apparatus for extracting a speech feature using the Mel-Frequency Histrum Coefficient (MFCC) method as described above includes a spectral analysis unit 10 for extracting frequency spectrum information of the speech signal, as shown in FIG. A filter bank unit 20 for obtaining an envelope of the simplified spectrum from the obtained spectrum, a log compression unit 30 for impregnating the amplitude of the simplified spectrum using a logarithm function, and a discrete cosine transform The discrete cosine transform unit 40 that obtains the Cepstrum coefficient through the following) is configured.

상기와 같이 구성된 장치를 이용한 종래의 음성 특징 추출 방법은 다음과 같다.The conventional speech feature extraction method using the device configured as described above is as follows.

음성 신호는 스펙트럼 분석부(10)에서 주파수 스펙트럼 정보를 추출한다. 이때, 스펙트럼 정보는 도 2 와 같이 추출된다.The voice signal extracts frequency spectrum information from the spectrum analyzer 10. At this time, the spectrum information is extracted as shown in FIG.

먼저, 신호의 고주파 부분을 강조하는 프리 엠퍼시스(pre-emphasis) 필터(11)를 통과하여, 10msec 정도의 프레임(frame) 단위로 버퍼(12)에 저장된다. 한 프레임의 데이터가 모아지면 이를 해밍 창(Haimming window)(13)을 씌우고, 이를 고속 푸리에 변환(Fast Fourier Transform 이하 FFT 라 칭함)(14)을 이용하여 주파수 스펙트럼 정보를 얻어낸다. 도 2 에서 M은 한 프레임에 해당하는 샘플수를 나타내어, 8kHz 샘플링한 데이터에 10msec를 하나의 프레임이라 하면 “80”이 된다. 또, 2N은 고속 푸리에 변환(FFT)의 변환 단위로 보통 20~30msec 정도의 단위이다.First, it passes through a pre-emphasis filter 11 that emphasizes a high frequency portion of a signal, and is stored in the buffer 12 in units of frames of about 10 msec. When data of one frame is collected, it is covered with a Hamming window 13 and obtained using a fast Fourier transform (FFT) 14 to obtain frequency spectrum information. In FIG. 2, M denotes the number of samples corresponding to one frame, and when 80 msec of one frame is 10 msec for 8 kHz sampled data, it is "80". In addition, 2N is a conversion unit of a fast Fourier transform (FFT), usually a unit of about 20 to 30 msec.

이렇게 구하여진 스펙트럼으로부터, 멜 스케일(mel-scale)된 10여개에서 20여개의 필터 뱅크부(20)를 통하여, 간략화된 스펙트럼의 포락선을 구한다. 간략화된 스펙트럼의 크기(amplitude)는, 로그 압축부(30)에서 로그 함수를 이용하여 함축되고, 이를 이산 코사인 변환부(40)를 통하여 켑스트럼 계수를 얻어내는 방법으로 음성 특징 벡터들을 구하게 된다.From the spectrum thus obtained, the envelope of the simplified spectrum is obtained through the 10 to 20 filter bank portions 20 which are mel-scaled. Simplified spectrum amplitude is implied using the logarithmic function in the logarithmic compression unit 30, and the speech feature vectors are obtained by obtaining the cepstral coefficient through the discrete cosine transform unit 40. .

멜-주파수 켑스트럼 계수(MFCC) 방법으로 구해지는 특징 벡터 자체도, 배경 잡음에 영향을 받아 인식률을 저하시키지만 이경우, 스펙트럼 감법(spectral subtraction)이나 라스타 처리(Rasta processing) 등을 통하여 이를 보상하는 방법이 개발되었다.The feature vector itself, which is obtained by the Mel-Frequency Histrum Coefficient (MFCC) method, is also affected by background noise and reduces the recognition rate, but in this case, it is compensated by spectral subtraction or rasta processing. A method was developed.

스펙트럼 감법은, 배경잡음의 스펙트럼을 구하여 잡음에 의해 왜곡된 음성에서 잡음 성분만 제거하는 방식으로 잡음 제거에 주로 사용되는 방법이다. 이때, 배경잡음의 스펙트럼을 정확히 구해야 하는 문제가 발생하여 큰 효과를 거두지 못하고 있다. Spectral subtraction is a method mainly used for noise removal by obtaining a spectrum of background noise and removing only noise components from speech distorted by noise. At this time, there is a problem in that the spectrum of the background noise must be accurately obtained, which does not have a great effect.

라스타 처리는, 스펙트럼의 시간에 따른 변화를 이용한 것으로, 인간의 청각 시스템이 주파수 스펙트럼이 변화하는 구간에서 보다 많은 정보를 얻어낸다는 사실을 응용한 것이다. 즉, 도 3 과 같이 로그 함축된 간략한 스펙트럼들의 각 주파수 밴드의 출력에 대해 대역 통과(bandpass) 필터(50)를 통과 시킨다.Rasta processing uses changes in the spectrum over time and applies the fact that the human auditory system obtains more information in the intervals in which the frequency spectrum changes. That is, a bandpass filter 50 is passed for the output of each frequency band of the log-implicit simplified spectra as shown in FIG. 3.

이 대역 통과 필터(50)는, 그 주파수 대역의 값들중 변화량이 적은 정상 상태의 배경 잡음이 포함되는 저주파 영역(보통 수 Hz이하)의 정보와, 변화량이 음성의 변화(보통 수십 Hz)보다도 큰 시변잡음 성격의 정보를 제거하는 역할을 한다.The band pass filter 50 includes information in a low frequency region (usually several Hz or less) including a steady state background noise having a small amount of change among the values of the frequency band, and a change amount larger than a change in speech (usually tens Hz). Eliminates information of time-varying noise.

이상과 같이 배경 잡음이 있는 경우의 음성 특징 추출 방법은, 잡음과 음성이 혼재하는 스펙트럼에서 잡음의 스펙트럼을 제거하는 형식으로 이루어진다.As described above, the speech feature extraction method in the case of background noise has a form of removing a spectrum of noise from a spectrum in which noise and voice are mixed.

그러나, 일반적으로 잡음이 늘어나는 양에 비해 음성 신호의 성분은, 도 4 에 도시된 바와 같이 거의 늘어나지 않기 때문에, 상대적으로 잡음에 대한 스펙트럼을 제거하더라도 음성 성분도 감소하는 효과를 가지게 된다. 이는 음성 특징 벡터의 모양을 변화시켜 음성 인식 성능의 저하를 가져온다.However, since the components of the speech signal generally do not increase as shown in FIG. 4, the components of the speech signal are relatively reduced even if the spectrum of the noise is removed. This changes the shape of the speech feature vector, resulting in degradation of speech recognition performance.

이에 본 발명은 상기한 바와 같은 종래의 제 문제점들을 해소시키기 위하여 창안된 것으로, 음성의 스펙트럼을 잡음의 양에 따라 정규화한 후, 켑스트럼 형태의 특징벡터를 얻음으로써, 잡음이 있는 경우 음성 인식에 사용할 수 있는, 스펙트럼의 동적영역 정규화에 의한 음성특징 추출장치 및 방법을 제공하는데 그 목적이 있다.Accordingly, the present invention was devised to solve the above-mentioned problems, and after normalizing the spectrum of the speech according to the amount of noise, obtain a characteristic vector in the form of a spectrum, thereby recognizing speech in case of noise. It is an object of the present invention to provide an apparatus and method for extracting speech features by dynamic region normalization of spectrum, which can be used in the present invention.

상기한 바와 같은 목적을 달성하기 위한 본 발명은, 입력 음성신호에 대해 프레임 단위로 스펙트럼을 분석하는 스펙트럼 분석부(1)와 ; 멜-스케일로 구성된 필터뱅크를 통하여 간략화된 스펙트럼을 구하는 필터 뱅크부(2) ; 스펙트럼 신호의 동적 영역을 줄이는 로그 압축부(3) ; 시그모이드(sigmoid) 함수의 파라미터를 구하기 위한 정보를 얻는 구간을 설정하는 잡음 구간 검출부(4) ; 스펙트럼 신호의 동적 영역을 정규화하기 위하여 시그모이드 함수의 파라미터를 구하는 시그모이드 파라미터 계산부(5) ; 스펙트럼 신호를 정규화하는 시그모이드 함축부(6) 및 ; 인식 알고리즘에 사용되는 특징 벡터로써 켑스트럼을 구하는 이산 코사인 변환부(7)를 포함하여 구성함을 특징으로 한다.The present invention for achieving the above object is a spectrum analyzer (1) for analyzing the spectrum on a frame basis for the input voice signal; A filter bank section 2 for obtaining a simplified spectrum through a filter bank composed of mel-scales; A log compression unit 3 for reducing the dynamic range of the spectral signal; A noise section detector 4 for setting a section for obtaining information for obtaining a parameter of a sigmoid function; A sigmoid parameter calculator 5 for calculating parameters of the sigmoid function to normalize the dynamic range of the spectral signal; A sigmoid constriction 6 for normalizing the spectral signal; And a discrete cosine transform unit 7 that obtains the cepstrum as a feature vector used in the recognition algorithm.

또한, 상기한 바와 같은 목적을 달성하기 위한 본 발명은, 스펙트럼 분석부(1)를 통하여, 입력된 음성 신호에서 프레임 단위로 주파수 스펙트럼 정보를 추출하는 스펙트럼 분석 과정과 ; 필터 뱅크부(2)를 통하여, 추출된 스펙트럼으로부터 간략화된 스펙트럼의 포락선을 구하는 필터 뱅크 과정 ; 로그 압축부(3)를 통하여, 간략화된 스펙트럼 신호의 동적 영역을 로그(log) 함수를 이용하여 함축시키는 로그 압축 과정 ; 잡음 구간 검출부(4)를 통하여, 시그모이드(sigmoid) 함수의 파라미터를 구하기 위한 정보를 얻는 구간을 설정하는 잡음 구간 검출 과정 ; 시그모이드 파라미터 계산부(5)를 통하여, 스펙트럼 신호의 동적 영역을 정규화하기 위하여 시그모이드 함수의 파라미터를 구하는 시그모이드 파라미터 계산 과정 ; 시그모이드 함축부(6)를 통하여, 스펙트럼 신호를 정규화하는 시그모이드 함축 과정 및 ; 이산 코사인 변환부(7)를 통하여, 시그모이드 함수로 정규화된 스펙트럼 신호의 인식 알고리즘에 사용되는 특징 벡터로써 켑스트럼을 구하는 이산 코사인 변환 과정을 포함하여 이루어짐을 특징으로 한다.In addition, the present invention for achieving the above object, a spectrum analysis process for extracting the frequency spectrum information in units of frames from the input speech signal through the spectrum analyzer 1; A filter bank process for obtaining an envelope of a simplified spectrum from the extracted spectrum through the filter bank unit 2; A log compression process for impregnating a dynamic region of the simplified spectral signal through a log function through a log compression unit 3; A noise section detection process for setting a section for obtaining information for obtaining a parameter of a sigmoid function through the noise section detection unit 4; A sigmoid parameter calculation process of obtaining a parameter of a sigmoid function to normalize the dynamic range of the spectral signal through the sigmoid parameter calculation unit 5; A sigmoid constriction process for normalizing a spectral signal through the sigmoid constriction portion 6; Through the discrete cosine transform unit 7, a discrete cosine transform process for obtaining a cepstrum as a feature vector used in a recognition algorithm of a spectral signal normalized by a sigmoid function is characterized.

본 발명의 목적에 따른, 스펙트럼의 동적영역 정규화에 의한 음성 특징 추출 장치의 동작 원리를, 상세히 설명하면 다음과 같다.The operation principle of the speech feature extraction apparatus by dynamic region normalization of spectrum according to the object of the present invention will be described in detail as follows.

본 발명은, 배경 잡음의 스펙트럼을 제거하였을 때 음성 스펙트럼의 성분이 감소하는 것을 방지하기 위하여, 배경 잡음의 양에 따라 스펙트럼 진폭의 동적 영역이 자동적으로 할당되는 함수를 이용하였다.In order to prevent the components of the speech spectrum from decreasing when the spectrum of background noise is removed, the present invention utilizes a function in which a dynamic range of spectral amplitude is automatically assigned according to the amount of background noise.

따라서, 이 함수를 통과한 스펙트럼의 동적 영역은 배경 잡음에 따라 일정하게 유지되어, 잡음 스펙트럼만을 제거했을 때 음성 신호의 동적 영역이 감소하는 현상을 없앨 수 있다. 이는 기존의 멜-주파수 켑스트럼 계수(MFCC) 특징 추출 방식에 시그모이드 함수로 함축하는 부분이 포함된다.Therefore, the dynamic range of the spectrum passing through this function is kept constant according to the background noise, thereby eliminating the phenomenon that the dynamic range of the speech signal decreases when only the noise spectrum is removed. This includes the implications of the sigmoid function in the conventional Mel-Frequency Cepstrum Coefficient (MFCC) feature extraction method.

또한, 본 발명은 배경 잡음에 의해 신호의 동적 영역이 감소하는 현상을 보상하여 음성의 특징을 추출하는 것으로, 인간의 청각 특성에 기초한 방식이다.In addition, the present invention is to extract the feature of the speech by compensating for the phenomenon that the dynamic range of the signal is reduced by the background noise, which is based on the human auditory characteristics.

인간의 청각 기관중, 내이의 달팽이관(cochlea)에서 주파수 스펙트럼 형태로 분석된 정보를, 대뇌의 중추기관에서 처리가 가능한 신경 신호로 바꾸어 주는 부분이 섬모세포(hair cell)들의 집합이다.The part of the human auditory organ that converts information analyzed in the form of frequency spectrum in the cochlea of the inner ear into neural signals that can be processed in the central organ of the cerebrum is a collection of hair cells.

이들은 하나의 주파수 성분에 대해 다수의 섬모세포가 관여하는데, 이들은 신호가 없을 때 신경 펄스를 발생시키는 주기에 따라 도 6 에 도시한 바와 같이, 주기가 짧은 높은 자연 점호율(high-spontaneous firing rate 이하 HSFR 라 칭함) 섬모세포와, 주기가 긴 낮은 자연 점호율(low-spontaneous firing rate 이하 LSFR 라 칭함) 섬모세포로 나뉜다. 높은 자연 점호율(HSFR) 섬모세포는, 일상적인 상태인 조용한 상태에서 신경 정보를 제공하는 세포로, 그 동적 영역이 크고 낮은 진폭에 해당한다.They are involved in a number of ciliated cells for one frequency component, which is below the high-spontaneous firing rate, which is short in duration, as shown in FIG. HSFR) and ciliated cells with long periods of low natural spontaneous firing rate (LSFR). High natural firing rate (HSFR) ciliated cells are cells that provide neural information in a quiet state, which is a normal state, and whose dynamic range is large and corresponds to low amplitude.

배경 잡음이 커지면 달팽이관에 의한 주파수 스펙트럼의 진폭도 커지게 되고, 이 경우 높은 자연 점호율(HSFR) 섬모세포는 포화상태에 이르러, 더 이상 정보를 제공하지 못한다. 이 경우, 낮은 자연 점호율(LSFR) 섬모세포가 활성화 되어, 신경 정보를 중추기관에 제공한다. 따라서, 이와 같은 방법으로 사람은, 배경 잡음이 큰 경우에도 음성을 인식할 수 있는 신경 정보를 얻게 되는 것이다.As the background noise increases, so does the amplitude of the frequency spectrum by the cochlea, which results in high natural firing rate (HSFR) ciliated cells becoming saturated and no longer providing information. In this case, low natural firing rate (LSFR) ciliated cells are activated, providing nerve information to the central organs. Thus, in this way, a person obtains neural information capable of recognizing speech even when the background noise is large.

청각 기관의 이러한 메카니즘을 다음 수학식 1과 같이 시그모이드 함수로 근사화시키는데, 근사화된 시그모이드 함수는 도 7에 도시한 바와 같이, 청각 섬모세포의 특성과 유사한 모양을 가진다.This mechanism of the auditory organ is approximated by a sigmoid function as shown in Equation 1, which has a shape similar to that of auditory ciliated cells, as shown in FIG.

[수학식 1] [Equation 1]

여기서, X(n,k)는 n번째 프레임의 k번째 주파수 대역의 로그(log) 압축된 스펙트럼 정보의 값이고, r(n,k)는 이에 대한 시그모이드 함수로 정규화한 값이다. 또, A는 일반 스펙트럼의 크기를 유지하기 위한 상수 값이고, α(k) 와 β(k) 는 k번째 대역에 대한 시그모이드 함수의 모양을 결정하는 파라미터들이다.Here, X (n, k) is a value of log compressed spectrum information of the kth frequency band of the nth frame, and r (n, k) is a value normalized by the sigmoid function thereof. In addition, A is a constant value for maintaining the magnitude of the general spectrum, and α (k) and β (k) are parameters for determining the shape of the sigmoid function for the k-th band.

따라서, 배경 잡음의 스펙트럼 대역의 크기에 따라 이들 두 개의 파라미터를 구하면 시그모이드 함수로 스펙트럼을 정규화할 수 있다. 이들 파라미터를 배경 잡음의 스펙트럼 양에 따라 결정하기 위한 수학식은 다음과 같이 얻어진다.Therefore, obtaining these two parameters according to the magnitude of the spectral band of the background noise can normalize the spectrum with the sigmoid function. The equation for determining these parameters according to the spectral amount of the background noise is obtained as follows.

먼저 시그모이드 함수값 r(n,k)가 0.5A가 되는 X_c(n,k)를 구하면 다음 수학식 2와 같다.First, X _c (n, k), where the sigmoid function value r (n, k) becomes 0.5A, is obtained from Equation 2 below.

[수학식 2] [Equation 2]

또한, r(n,k)=0.9A가 되는 X(n,k)와, r(n,k)=0.1A가 되는 X(n,k)의 차이를 그 신호 X(n,k)의 동적 영역 DX(n,k)라 하면 다음 수학식 3과 같이 구해진다.Further, the difference between X (n, k) where r (n, k) = 0.9A and X (n, k) where r (n, k) = 0.1A is determined by the signal X (n, k). The dynamic range DX (n, k) is obtained as shown in Equation 3 below.

[수학식 3][Equation 3]

따라서, 배경 잡음에 대해 X_c(n,k)와 DX(n,k)를 알면, 다음과 같이 시그모이드 함수의 파라미터를 구하여 그 모양을 결정할 수 있다.Therefore, if we know X _c (n, k) and DX (n, k) with respect to the background noise, we can determine the shape by calculating the parameters of the sigmoid function as follows.

[수학식 4] [Equation 4]

배경 잡음에서 각 대역에 대한 X_c(n,k)와 DX(n,k)를 구하는 방법은 다음과 같다.The method of calculating X _c (n, k) and DX (n, k) for each band in the background noise is as follows.

먼저, 입력 신호로부터 20 프레임 정도의 X(n,k)값이 변화가 적으면 배경 잡음 구간으로 보고, 그 대역의 평균 Xavg(k)를 구한다.First, if the X (n, k) value of about 20 frames is small from the input signal, it is regarded as a background noise section, and the average Xavg (k) of the band is obtained.

계속된 입력 신호에 대한 그 대역의 X(n,k) 값의 변화가 적으면, 평균 Xavg(k)를 수정해 나가다가 음성신호가 들어오면 Xavg(k)를 그대로 놔두고, 시그모이드 함수의 파라미터 값을 구하기 위해 Xc(n,k)와 DX(n,k)를 다음과 같이 구한다.If the change in the X (n, k) value of the band for the continuous input signal is small, modify the average Xavg (k) and leave Xavg (k) as it is when a voice signal is received. To find the parameter value, find Xc (n, k) and DX (n, k) as follows.

[수학식 5][Equation 5]

여기서, X_max(k)는 k번째 대역의 스펙트럼 크기의 최대값이고, a와 b는 실험적으로 구할 수 있는 일차함수의 파라미터로, X_avg(k)와 시그모이드 함수의 입력 동적 영역의 최소값과의 관계로부터 구한다. r은 시그모이드 함수값이 0.5A가 되게 하는 입력 X_c(n,k)를, 동적 영역의 최대값 X_max(k)와 b+aX_avg(k) 사이에 어디에 위치하게 할 것인가를 결정하는 변수이다.Here, X _max (k) is the maximum value of the spectral magnitude of the k-th band, a and b are experimentally obtainable parameters of the first function, the minimum value of the input dynamic range of X _avg (k) and the sigmoid function Obtained from the relationship with r determines where to place the input X _c (n, k), which causes the sigmoid function value to be 0.5 A, between the dynamic range maximum value X _max (k) and b + aX _avg (k) Variable.

따라서, 동적영역을 정규화한 특징 추출 방법의 동적영역 정규화 과정은 다음과 같이 동작한다.Therefore, the dynamic region normalization process of the feature extraction method that normalizes the dynamic region operates as follows.

먼저 입력 신호로부터 배경 잡음을 찾아내어 X_avg(k)를 구한다. 음성 신호가 들어오면, X_avg(k)로부터 상기 수학식 5 를 이용하여 X_c(n,k)와 DX(n,k)를 구한다.First find the background noise from the input signal and find X _avg (k). When a voice signal comes in, X _c (n, k) and DX (n, k) are obtained using Equation 5 from X _avg (k).

그러면 시그모이 함수의 모양은 Xc(n,k)와 DX(n,k)로부터 상기 수학식 4 를 이용하여 α(k)와 β(k)를 구하여 결정한다. 이후, 주파수 스펙트럼에 대해 α(k) 와 β(k) 값들과, 상기 수학식 1 을 이용한 시그모이드 함수로 동적 영역을 정규화 한다.The shape of the sigmoid function is then determined by obtaining α (k) and β (k) from Xc (n, k) and DX (n, k) using Equation 4 above. Then, the dynamic range is normalized by the α (k) and β (k) values for the frequency spectrum and a sigmoid function using Equation (1).

이상의 시그모이드 함수를 이용한 스펙트럼의 동적 영역 정규화에 의한 음성 특징 추출 방법의 동작은 다음과 같다.The operation of the speech feature extraction method by dynamic region normalization of the spectrum using the sigmoid function is as follows.

먼저, 음성 신호는 스펙트럼 분석부(1)에서 프레임 단위로 주파수 스펙트럼 정보를 추출한다.First, the frequency signal is extracted from the spectrum analyzer 1 in frame units.

이렇게 구하여진 스펙트럼으로부터, 멜 스케일(mel-scale)된 10여개에서 20여개의 필터 뱅크부(2)를 통하여, 간략화된 스펙트럼의 포락선을 구한다.From the thus obtained spectrum, a simplified spectrum envelope is obtained through the mel-scaled 10 to 20 filter bank sections 2.

간략화된 스펙트럼의 크기(amplitude)는, 로그 압축부(3)에서 로그(log) 함수를 이용하여 함축되고, 잡음 구간 검출부(4)를 통하여 시그모이드(sigmoid) 함수의 파라미터를 구하기 위한 정보를 얻는 구간을 설정한다.Simplified spectrum amplitude is implied using a log function in the log compression unit 3, and information for obtaining parameters of a sigmoid function through the noise interval detection unit 4 is obtained. Set the interval to get.

한편, 시그모이드 파라미터 계산부(5)를 통하여 스펙트럼 신호의 동적 영역을 정규화하기 위하여 시그모이드 함수의 파라미터를 구한 후, 시그모이드 함축부(6)에서 스펙트럼 신호를 정규화한다. On the other hand, in order to normalize the dynamic range of the spectral signal through the sigmoid parameter calculation section 5, the parameters of the sigmoid function are obtained, and then the spectral signal is normalized in the sigmoid constriction section 6.

상기와 같이, 시그모이드 함수로 정규화된 스펙트럼 신호는, 이산 코사인 변환부(7)를 통하여 인식 알고리즘에 사용되는 특징 벡터로써 켑스트럼을 구한다.As described above, the spectral signal normalized by the sigmoid function is obtained by using the discrete cosine transform unit 7 to obtain the cepstrum as a feature vector used in the recognition algorithm.

한편, 본 발명에 따른 스펙트럼의 동적 영역 정규화에 의한 음성 특징 추출 방법의 성능을 평가하기 위하여, 먼저 배경 잡음이 있는 경우의 최종 스펙트럼을 구하였다.On the other hand, in order to evaluate the performance of the speech feature extraction method by dynamic region normalization of the spectrum according to the present invention, first, the final spectrum in the case of the background noise is obtained.

이를 위하여 상기 수학식 5 에 있는 변수들은 다음과 같이 정하였다.To this end, the variables in Equation 5 are determined as follows.

X_max(k) = 78.0 for all kX _max (k) = 78.0 for all k

a = 0.625 a = 0.625

b = 27.5 b = 27.5

r = 0.25 r = 0.25

또한, 상기 수학식 1 에 있는 상수 A는 50.0으로 정하였다.In addition, the constant A in Equation 1 was set to 50.0.

한 프레임은 10msec이며, 해밍 창(Hamming window)은 세개의 프레임에 대하여 수행하였고, 고속 푸리에 변환(FFT)은 256포인트를 사용하였다. 프리 엠퍼시스(Pre-emphasis) 파라미터는 0.97로 하였고, 필터 뱅크는 19개의 대역을 가지도록 구성하였다. 각 필터 뱅크의 모양은 삼각형 형태로, 각각의 중앙 주파수는 로그 스케일에서 구하였고, 인접한 필터의 중앙 주파수에서의 값이 “0”이고, 그 필터의 중앙 주파수에서의 값이 “1”인 삼각형 형태이다.One frame is 10 msec, a Hamming window is performed for three frames, and fast Fourier transform (FFT) uses 256 points. The pre-emphasis parameter was 0.97, and the filter bank was configured to have 19 bands. The shape of each filter bank is triangular, each center frequency is obtained from the log scale, and the triangle at the center frequency of the adjacent filter is “0” and the value at the center frequency of the filter is “1”. to be.

이때, 기존의 멜-주파수 켑스트럼 계수(MFCC) 방식으로 구한 주파수 스펙트럼을 도 7에 나타 내었고, 10 데시벨(dB)의 백색 잡음을 섞었을 때의 주파수 스펙트럼을 도 8 에 나타내었다. In this case, the frequency spectrum obtained by the conventional Mel-frequency cepstrum coefficient (MFCC) method is shown in FIG. 7, and the frequency spectrum when white noise of 10 decibels (dB) is mixed is shown in FIG. 8.

잡음이 섞였을 때의 주파수 스펙트럼은, 고주파 영역의 밴드에 대해 모두 큰 값을 가지고 있어서, 서로 구분하기가 어렵다. 그러나, 본 발명에 따른 스펙트럼의 동적 영역 정규화 방식에 의한 스펙트럼을 잡음이 없을 때와 10 데시벨(dB)의 백색 잡음이 있을 때 각각 도 9 와 도 10 에 나타내었다. 도 9 및 도 10에서 보듯이 잡음이 섞였을 때에도 고주파 영역의 대역에서도 정보를 얻을 수 있었다. 따라서, 잡음이 있을 때에도 그 신호의 정보를 유지하고 있음을 알 수 있다.When the noise is mixed, the frequency spectrum has a large value for all the bands in the high frequency region, and it is difficult to distinguish them from each other. However, the spectrum by the dynamic domain normalization method of the spectrum according to the present invention is shown in FIGS. 9 and 10 when there is no noise and when there is 10 decibels (dB) of white noise. As shown in FIGS. 9 and 10, even when noise was mixed, information could be obtained even in a band of a high frequency region. Therefore, it can be seen that information of the signal is maintained even when there is noise.

이와 같은 결과는 다음과 같이 음성 인식 실험을 하였을 때도 나타난다.This result also occurs when the speech recognition experiment is performed as follows.

인식 실험은 한국어 연속어 숫자음에 대하여 수행하였다. 인식 알고리즘은 연속 혼합(continous mixture) 히든 마르코프 모델(Hidden Markov Model 이하 HMM 이라 칭함)을 사용하였다. 배경 잡음은 앞서의 실험과 같이 백색 잡음을 첨가하여 수행 하였는 데, 이 잡음 데이터는 NOISEX 의 잡음 데이터를 랜덤한 포인트부터 섞어서 만들었다.Recognition experiments were performed on Korean consecutive words. The recognition algorithm used a continuous mixture hidden Markov model (hereinafter referred to as HMM). The background noise was performed by adding white noise as in the previous experiment. The noise data was made by mixing the noise data of NOISEX from a random point.

본 발명에 따른 방법과 성능을 비교하기 위하여 동일한 방법으로 멜-주파수 켑스트럼 계수(MFCC)와 라스타(Rasta)에 의한 특징 추출의 인식 실험도 수행하였다. 인식 실험 결과 에러율은 도 11 에 도시한 바와 같다. 가로축은 각각 잡음이 섞인 정도로 신호 잡음비(SNR)로 나타내었고, 세로축은 이들의 에러율이다. 여기서 신호 잡음비(SNR)중 30 데시벨(dB)은 깨끗한 음성을 의미한다. 멜-주파수 켑스트럼 계수(MFCC)는 잡음에 따라 에러율이 많이 증가하나, 라스타(Rasta) 방식의 특징 벡터 에러율은 상당한 잡음에서도 어느정도 유지한다. 그러나, 15 데시벨(dB) 이하에서는 에러율이 많이 증가한다.In order to compare the performance with the method according to the present invention, recognition experiments of the feature extraction by the Mel-Frequency Histrum coefficient (MFCC) and Rasta were also performed. Recognition experiment result error rate is as shown in FIG. The horizontal axis represents the signal noise ratio (SNR) to the extent that the noise is mixed, and the vertical axis represents their error rate. Here, 30 decibels (dB) of the signal noise ratio (SNR) means clear voice. Mel-Frequency Histrum Coefficient (MFCC) increases the error rate according to noise, but Rasta's characteristic vector error rate is maintained to some extent even with considerable noise. However, the error rate increases much below 15 decibels (dB).

반면에, 스펙트럼의 동적 영역 정규화(Spectrum Dynamic Range Normalization 이하 SDRN 이라 칭함)에 의한 특징 추출 방법은 잡음이 더 섞여도 에러율의 증가가 억제된다. 따라서, 스펙트럼의 동적 영역 정규화(SDRN) 방식의 특징 벡터가, 잡음에 대하여 강한 면모를 보이고 있다.On the other hand, in the feature extraction method using spectrum dynamic range normalization (hereinafter referred to as SDRN), an increase in error rate is suppressed even if noise is further mixed. Therefore, the feature vector of the spectral dynamic domain normalization (SDRN) method is strong against noise.

이상에서 상세히 설명한 바와 같이 본 발명은, 음성인식시스템의 전처리 과정인 음성특징 추출부분을 개선한 것으로, 배경 잡음이 있는 경우 신호의 주파수 스펙트럼이 왜곡되는 것을 각 스펙트럼 대역의 동적 영역(dynamic range)을 유지 또는 정규화하여 보상할 수 있다.As described in detail above, the present invention is an improvement of the speech feature extraction portion, which is a preprocessing process of the speech recognition system. In the background noise, the frequency spectrum of the signal is distorted. It can be maintained or normalized to compensate.

또한, 인간의 청각기관의 섬모세포가 배경 잡음의 크기에 따라 신경정보를 만들어 내는 세포의 특성이 다른 점을 응용하여, 배경 잡음의 크기에 따라 동적 영역의 정규화 범위와 위치를 다르게 가져가는 구조로, 이와같이 처리된 주파수 스펙트럼을 기초하여 음성 특징을 추출하면, 배경 잡음이 커져도 인식률의 저하를 최소화할 수 있다. 따라서, 본 시스템을 이용하면 음성 인식의 성능을 향상시킬 수 있다.In addition, by applying the fact that the ciliary cells of human auditory organs produce neural information according to the background noise size, the normalization range and position of the dynamic range are different according to the background noise size. If the speech feature is extracted based on the processed frequency spectrum, the degradation of the recognition rate can be minimized even when the background noise increases. Therefore, the use of this system can improve the performance of speech recognition.

특히, 사무실 환경잡음, 음악 잡음등 배경 잡음이 있는 경우 스펙트럼의 정규화로 특징 벡터가 잡음이 없는 경우와 유사하여, 음성인식 시스템의 성능을 유지하게 된다. In particular, if there is background noise such as office environment noise or music noise, the normalization of the spectrum is similar to the case where the feature vector is noisy, thereby maintaining the performance of the speech recognition system.

도 1은 종래의 음성특징 추출방식(MFCC)에 대한 예시도,1 is an exemplary diagram of a conventional speech feature extraction method (MFCC),

도 2는 고속 푸리에 변환(FFT)을 이용한 스펙트럼 분석법의 예시도,2 is an exemplary diagram of a spectral analysis method using a fast Fourier transform (FFT),

도 3은 라스타(Rasta) 필터링에 의한 특징 추출 방법의 예시도,3 is an exemplary diagram of a feature extraction method by Rasta filtering;

도 4는 주파수 스펙트럼의 한 밴드 값의 시간에 따른 변화 특성 파형도,4 is a waveform diagram of a characteristic change over time of one band of a frequency spectrum;

도 5는 본 발명에 따른 스펙트럼의 동적영역 정규화에 의한 음성특징 추출장치의 블록 구성도,5 is a block diagram of an apparatus for extracting speech features by dynamic region normalization of spectrum according to the present invention;

도 6은 HSFR 섬모세포와 LSFR 섬모세포의 firing rate 특성 파형도,Figure 6 is a firing rate characteristic waveform diagram of HSFR and LSFR ciliated cells,

도 7은 깨끗한 음성 신호에 대한 MFCC에 의한 스펙트럼 특성 파형도,7 is a spectral characteristic waveform diagram by MFCC for a clear speech signal;

도 8은 백색 잡음이 섞인 음성 신호에 대한 MFCC에 의한 스펙트럼 특성 파형도,8 is a spectral characteristic waveform diagram by MFCC for a speech signal mixed with white noise;

도 9는 깨끗한 음성 신호에 대한 SDRN에 의한 스펙트럼 특성 파형도,9 is a spectral characteristic waveform diagram by SDRN for a clear speech signal;

도 10은 백색 잡음이 섞인 음성신호에 대한 SDRN에 의한 스펙트럼 특성 파형도, 및 FIG. 10 is a spectral characteristic waveform diagram according to SDRN for a speech signal mixed with white noise; and

도 11은 잡음정도에 따른 특징 추출방법들에 의한 연속어 숫자음 인식 실험의 에러율 비교 예시도이다.FIG. 11 is a diagram illustrating an error rate comparison of a continuous speech recognition experiment by feature extraction methods according to noise level.

* 도면의 주요 부분에 대한 부호의 설명 *Explanation of symbols on the main parts of the drawings

1 : 스펙트럼 분석부 2 : 필터 뱅크부 1: Spectrum Analyzer 2: Filter Bank

3 : 로그 압축부 4 : 잡음 구간 검출부 3: log compression unit 4: noise section detection unit

5 : 시그모이드 파라미터 계산부 6 : 시그모이드 함축부 5: sigmoid parameter calculation unit 6: sigmoid impregnation unit

7 : 이산 코사인 변환부 7: discrete cosine transform unit

Claims

A spectrum analyzer 1 for analyzing the spectrum of the input voice signal in units of frames;

A filter bank unit (2) for obtaining a simplified spectrum from the spectrum provided from the spectrum analyzer through a mel-scale filter bank;

A log compression unit (3) for reducing the dynamic range of the spectrum provided from the filter bank unit;

A noise section detector 4 for setting a section for obtaining information for obtaining a parameter of a sigmoid function;

A sigmoid parameter calculator (5) for obtaining a parameter of a sigmoid function to normalize a dynamic range of a spectral signal according to the magnitude of the background noise of the noise section;

A sigmoid constriction unit 6 for normalizing the spectrum using parameters of the sigmoid function provided from the sigmoid parameter calculating unit;

A discrete cosine transform unit (7) for obtaining a cepstrum as a feature vector used in a recognition algorithm by using a spectral signal normalized by the sigmoid function. Voice feature extraction device.

A spectrum analysis process of extracting frequency spectrum information on a frame basis from an input voice signal;

A filter bank process for obtaining an envelope of the simplified spectrum from the extracted spectrum through a mel-scale filter bank;

Log compression process that implies a dynamic range of the simplified spectral signal using a logarithmic function;

A noise section detection process for setting a section for obtaining information for obtaining parameters of a sigmoid function;

A sigmoid parameter calculation process of obtaining a parameter of a sigmoid function to normalize a dynamic range of a spectral signal according to the magnitude of the background noise of the noise section;

A sigmoid implicit process for normalizing a spectral signal using the parameters of the sigmoid function; And

A method for extracting speech features by dynamic region normalization of a spectrum comprising a discrete cosine transform process for obtaining a cepstrum as a specific vector used in a recognition algorithm by using a spectral signal normalized by the sigmoid function. .

The method of claim 2,

The noise section detection process,

X _max (k) is the maximum value of the spectral magnitude of the k-th band, a and b are the parameters of the first-order parameter that can be experimentally obtained, and X _avg (k) is the relationship between the minimum value of the input dynamic range of the sigmoid function. If you want to find

The method for extracting speech features by dynamic region normalization of the spectrum, characterized by setting a section for obtaining information for obtaining a parameter of the sigmoid function in the background noise.

The method of claim 2,

The sigmoid parameter calculation process,

X _c (n, k) is a signal where r (n, k) = 0.5, and X (n, k) where r (n, k) = 0.9 A and X where r (n, k) = 0.1 A If the difference of (n, k) is the dynamic area DX (n, k) of the signal X (n, k),

And determining the shape of the sigmoid function by obtaining the parameters of the sigmoid function ([alpha] (k), [beta] k)).

The method of claim 2,

The sigmoid implicit process,

α (k) and β (k) are parameters that determine the shape of the sigmoid function for the kth band, and X (n, k) is the value of the log-compressed spectral information of the kth frequency band of the nth frame. Where r (n, k) is a value normalized by the sigmoid function thereof,

Using a sigmoid function to normalize the spectrum according to the magnitude of the spectral band of the background noise.