KR100381372B1

KR100381372B1 - Apparatus for feature extraction of speech signals

Info

Publication number: KR100381372B1
Application number: KR10-2001-0033915A
Authority: KR
Inventors: 김창민; 오상훈; 원영걸; 이수영
Original assignee: 주식회사 엑스텔테크놀러지
Priority date: 2001-06-15
Filing date: 2001-06-15
Publication date: 2003-04-26
Also published as: KR20020095731A

Abstract

본 발명은 음성특징 추출장치에 관한 것으로서, 그 목적은 달팽이관에서 추출되는 음성특징과 유사하게 주파수 및 세기정보를 추출 시, 청각에 대한 상기 인지적 특성인 청각신호의 주파수에 따른 민감도 변화를 수용하여 잡음에 영향을 덜 받도록 하면서, 하나의 특징벡터 조성기만으로 음성특징을 추출하고 또한 추출된 음성특징을 축약시켜 데이터량을 줄이도록 하여 하드웨어 구현비용을 크게 절감하며, 음성구간 내에서 얻어진 전체 특징 데이터를 시간 및 크기에 대하여 정규화 과정을 거치게 하여 소리의 세기 및 발음시간에 성능이 민감하지 않도록 하는데 있다.The present invention relates to an apparatus for extracting speech features, the object of which is to accept the change in sensitivity according to the frequency of the auditory signal which is the cognitive characteristic of hearing when extracting frequency and intensity information similar to the speech feature extracted from the cochlea It is possible to reduce the amount of data by reducing the amount of data by extracting the voice features with only one feature vector generator and reducing the amount of data while making it less susceptible to noise. The normalization process is performed on time and loudness so that performance is not sensitive to sound intensity and pronunciation time.

본 발명은 다수의 대역통과필터로부터 출력된 각 대역별 음성신호의 주파수와 세기정보를 청각의 인지적 특성에 따라 검출하여 잡음에 대한 민감도를 낮추는 음성특징 벡터를 산출하는 특징벡터 조성수단과, 특징벡터 조성수단에서 출력된 음성특징벡터의 차원 및 개수를 줄이는 특징축약수단과, 특징축약수단으로부터 출력된 음성 특징벡터를 음성구간동안 저장함과 아울러 시간 및 세기에 대하여 정규화처리하는 정규화처리수단으로 이루어짐을 특징으로 한다.The present invention provides a feature vector generating means for detecting a frequency and intensity information of a voice signal for each band output from a plurality of bandpass filters according to cognitive characteristics of the auditory to calculate a voice feature vector that lowers sensitivity to noise. A feature reduction means for reducing the dimension and number of speech feature vectors output from the vector composition means, and a normalization processing means for storing the speech feature vectors output from the feature reduction means during the speech section and normalizing the time and intensity. It features.

Description

Apparatus for feature extraction of speech signals}

본 발명은 음성인식장치에 관한 것으로서, 보다 상세하게는 인간의 달팽이관에서 추출되는 음성특징과 유사하게 음성의 여러 주파수 및 세기 정보를 주변 잡음에 둔감하도록 추출하는 음성특징 추출장치에 관한 것이다.The present invention relates to a speech recognition apparatus, and more particularly, to a speech feature extraction apparatus for extracting various frequency and intensity information of speech to be insensitive to ambient noise, similarly to speech features extracted from a human cochlea.

일반적으로 음성인식장치는 도 1 에 도시된 바와 같이 입력된 음성신호의 특징을 추출하는 특징 추출부(11)와, 특징 추출부(11)에서 추출된 특징 데이터에 의하여 음성을 인식하는 인식기(12)로 구성된다.In general, the speech recognition apparatus includes a feature extractor 11 for extracting a feature of an input voice signal and a recognizer 12 for recognizing speech based on feature data extracted from the feature extractor 11. It consists of

특징 추출부(11)에 입력된 음성신호는 음성인식에 적합한 형태의 특징을 추출하는 단계를 거친 후, 그 결과를 이용하여 인식기(12)에서 인식하게 된다.The voice signal input to the feature extractor 11 is subjected to a step of extracting a feature of a form suitable for voice recognition, and then recognized by the recognizer 12 using the result.

여기서, 일반적으로 음성특징추출을 위한 특징추출부(11)는 여러 가지 방법을 사용하나, 'MFCC'(Mel-Frequency Cepstrum Coefficient)나 'PLPCC'(Perceptual Linear Prediction Cepstrum Coefficient)가 대표적인 방법이다.Here, in general, the feature extraction unit 11 for voice feature extraction uses a variety of methods, but 'MFC' (Mel-Frequency Cepstrum Coefficient) or 'PLPCC' (Perceptual Linear Prediction Cepstrum Coefficient) is a typical method.

인식기(12)로는 'HMM'(Hidden Markov Model), 'DTW'(Dynamic Time Warping), 신경회로망 등의 방법이 많이 사용된다. 그러나, 이러한 특징추출방법들을 하드웨어로 구현할 때, 그 구현비용이 많이 소요되므로 실생활에 간편하게 응용할 수 있는 음성인식장치를 만들 수 없다. 즉, 이러한 방법들은 'ASIC'구현이 어렵기 때문에 소프트웨어로만 처리하거나, 디지털신호처리장치(DSP)를 이용해야하므로 시스템 구현비용이 많이 소요되는 문제점이 있었다.As the recognizer 12, many methods such as 'HMM' (Hidden Markov Model), 'DTW' (Dynamic Time Warping), and neural network are used. However, when implementing these feature extraction methods in hardware, the implementation cost is high, it is not possible to make a speech recognition device that can be easily applied in real life. That is, since these methods are difficult to implement 'ASIC', they have to be processed only in software or use a digital signal processing device (DSP).

이러한 하드웨어 상의 비용을 줄이기 위하여 간단한 음성특징을 추출하는 방법도 있으나, 이러한 방법들은 발성변화와 역신호 조건(소음, 마이크 및 채널의 왜곡, 룸반향(room reverberation) 등)에 의하여 성능이 저하되는 문제점이 있었다.In order to reduce the hardware cost, there are methods for extracting simple voice features, but these methods degrade performance due to vocal changes and inverse signal conditions (noise, microphone and channel distortion, room reverberation, etc.). There was this.

상기한 바와 같이 음성인식 시스템을 실생활에 응용하는데 장애가 되는 문제점을 해결하여, 보다 간단하면서 역신호 조건 및 발성변화에 영향을 덜 받는 특징추출방법으로 1994년 'Ghitza'가 앙상블 인터벌 히스토그램(Ensemble Interval Histogram : 이하 "EIH"라 약칭함) 모델을 발표하였다.As mentioned above, in 1994, 'Ghitza' ensemble interval histogram is a simple feature extraction method that solves the problem of applying the voice recognition system to real life and is simpler and less affected by reverse signal condition and vocal change. : Abbreviated as "EIH").

EIH 모델은 인간의 청각기관을 모델링한 것으로 음성신호를 주파수와 강도정보로 표현한다.The EIH model is a model of the human auditory organ. The EIH model expresses voice signals in terms of frequency and intensity information.

도 2는 종래 기술에 따른 EIH 모델의 블록 구성도로서, 음성신호가 입력되면 대역통과필터(BPF : Bandwidth Pass Filter)(121∼124)들은 귀의 달팽이관과 같이 음성신호를 여러 개의 다른 주파수 대역을 통과시킨 정보로 만들어준다.FIG. 2 is a block diagram of an EIH model according to the prior art, and when a voice signal is input, band pass filters (BPFs) 121 to 124 pass the voice signal through several different frequency bands, such as the cochlea of the ear. It is made up of information.

이때, 소리의 크기 변화에 따라 음성이 비선형적 특성을 보이는 것은 대역통과필터(121∼124)의 비선형성으로 구현된다.At this time, the non-linear characteristic of the voice according to the change of the sound is implemented by the nonlinearity of the band pass filters 121 to 124.

레벨교차 검출부(141)는 각 대역통과필터(121∼124)의 출력에서 주파수와 강도 정보를 추출하기 위해 여러 개의 레벨값과 비교하여 해당 레벨값과 교차하는 정보를 얻는다.The level crossing detection unit 141 compares several level values and obtains information that crosses the corresponding level values in order to extract frequency and intensity information from the outputs of the band pass filters 121 to 124.

인터벌 히스토그램부(142)는 각 레벨값과 교차된 정보를 교차시간을 기준으로 추정한 주파수와 강도정보로 해석하여 인터벌 히스토그램을 작성한다.The interval histogram unit 142 interprets the information intersected with each level value into frequency and intensity information estimated based on the crossing time to create an interval histogram.

이와 같이 각 대역필터(121∼24)의 출력에서 작성된 주파수와 강도정보를 나타내는 인터벌 히스토그램들은 가산기(170)를 통하여 합해져서 입력된 음성신호가 어떤시간에 어떤 주파수에서 어떠한 강도를 지니고 있는지를 나타내는 행렬형태의 데이터 'EIH(t,f)'가 얻어진다.As such, interval histograms representing frequency and intensity information generated at the outputs of the band filters 121 to 24 are summed through the adder 170, and a matrix indicating which intensity at which frequency the input audio signal has at what time. The data 'EIH (t, f)' of the form is obtained.

상기의 EIH 모델은 사람의 청각기관을 간단히 모방해서 인식에 적합한 좋은 특징을 추출하지만, 필터의 출력에서 여러 개의 레벨 교차값을 측정해야 하므로 이를 구현한 하드웨어는 복잡하다. 즉, N개의 대역통과필터와 각 필터에 M개의 레벨 교차 검출기가 연결되어 있으면, 총합 'M ×N'개의 레벨교차 검출기가 필요하게 되기 때문이다. 또한, 레벨값을 어떻게 설정해 두어야 할지도 중요한 파라미터인데, EIH 모델은 레벨값과 그 개수에 따라 성능의 변이가 심한 문제점이 있다.The EIH model simply mimics the human auditory organ and extracts a good feature suitable for recognition, but the hardware that implements it is complex because multiple level crossings must be measured at the output of the filter. That is, when N band pass filters and M level cross detectors are connected to each filter, a total of 'M × N' level cross detectors are required. In addition, how to set the level value is an important parameter, the EIH model has a problem that the performance variation is severe depending on the level value and the number.

이러한 문제점을 해결하기 위해 제안된 것이 'ZCPA(Zero-Crossing with Peak Amplitude)'모델이다. 이 'ZCPA' 모델은 EIH 모델을 간략화하여 하드웨어 구현이 용이하고 파라미터의 설정이 필요하지 않는 모델이다.In order to solve this problem, ZCPA (Zero-Crossing with Peak Amplitude) model is proposed. This 'ZCPA' model simplifies the EIH model, making it easy to implement hardware and no setting of parameters.

이 모델을 도 3을 참조하여 상세히 설명하면, 음성신호를 대역필터링하는 다수개의 대역통과필터(221∼224)와, 상기 각 대역통과필터(221∼224)를 통과한 음성신호의 특징을 추출하도록 영교차검출기(241)와 인터벌히스토그램부(242)와 최대값검출기(243)와 비선형변환기(244)를 포함하는 특징 추출부(230∼260)와, 상기 특징추출부(240)의 인터벌 히스토그램부(242)에서 출력된 인터벌히스토그램들을 가산하는 가산기(270)로 구성된다.The model will be described in detail with reference to FIG. 3 to extract features of a plurality of band pass filters 221 to 224 for band filtering the voice signals, and a feature of the voice signals passed through the band pass filters 221 to 224. Feature extractors 230 to 260 including a zero crossing detector 241, an interval histogram unit 242, a maximum detector 243, and a nonlinear converter 244, and an interval histogram unit of the feature extractor 240. The adder 270 adds interval histograms output from 242.

이와 같이 구성된 ZCPA 시스템은 사람의 음성이 입력되면 'ZCPA'모델의 대역통과필터(221∼224)에서 사람의 귀의 달팽이관과 같이 음성신호를 여러 개의 다른 주파수 대역을 통과시킨 정보를 영교차검출기(241)를 통해 영교차점을 검출한다. 영교차검출기(241)를 통해 검출된 영교차(zero crossing) 간격과 최대값 검출기(243)에 의해 검출된 영교차 간격 내의 최대값 정보를 사용하여 인터벌 히스토그램부(242)를 통해 인터벌히스토그램을 작성한다.When the human voice is input, the ZCPA system configured as described above has a zero-cross detector 241 that transmits the information of the voice signal through several different frequency bands, such as the cochlea of the human ear, in the band pass filters 221 to 224 of the 'ZCPA' model. The zero crossing point is detected with An interval histogram is created through the interval histogram unit 242 using the zero crossing interval detected by the zero crossing detector 241 and the maximum value information in the zero crossing interval detected by the maximum detector 243. do.

이때, 소리의 크기변화에 사람이 비선형적 특성을 보이는 것은, 최대값검출기(243)에서 추출된 최대값 정보를 비선형변환기(244)에 의해 비선형 변환시킨 후, 인터벌 히스토그램의 정보로 축적시키는 형태로 구현한다. 각 대역통과필터(221∼224)의 출력된 주파수와 강도정보를 나타내는 인터벌 히스토그램들은 가산기(270)를 통하여 합해져서, 입력된 음성신호가 어떤 시간에 어떠한 강도를 지니고 있는지를 나타내는 행렬 형태의 데이터 'ZCPA(t,f)'가 산출된다.In this case, the non-linear characteristic of the human being in the change in the loudness is that the maximum value information extracted by the maximum value detector 243 is non-linearly converted by the nonlinear converter 244 and then accumulated in the interval histogram information. Implement Interval histograms representing the output frequency and intensity information of each bandpass filter 221 to 224 are summed through the adder 270, and the matrix data indicating the intensity of the input voice signal at what time is' ZCPA (t, f) 'is calculated.

이와 같은 'ZCPA' 모델은 레벨 크로싱의 대안으로 영 교차 및 최대값 검출을 사용하므로 'EIH'모델보다 훨씬 간단하면서도 인식에 충분한 정도의 특징을 추출한다.This 'ZCPA' model uses zero crossing and maximum detection as an alternative to level crossing, so it extracts features that are much simpler than the 'EIH' model and are sufficient for recognition.

그러나, 이와 같은 종래기술에 따른 ZCPA 모델은 각각의 대역통과필터의 출력에서 영교차 및 최대값 검출 후 인터벌 히스토그램을 별도로 작성하고, 이를 가산기를 통하여 취합하여 행렬 형태의 특징추출 결과 데이터를 출력하므로 영교차 및 최대 값 검출기와 히스토그램 작성기가 대역통과필터 수만큼 필요한 문제점이있으며, 또한, 영 교차점 사이에서 최대값 하나만을 측정하여 세기정보로 사용하므로 추출된 특징이 잡음의 영향을 많이 받을 가능성이 있으며, 또한 인간의 청각에 대한 인지적 특성 중 하나인 청각신호의 주파수에 따른 민감도의 변화를 고려하지 않은 문제점이 있었다.However, according to the conventional ZCPA model, since the interval histogram is separately generated after the zero crossing and the maximum value are detected at the output of each band pass filter, the result is collected through an adder to output the feature extraction result data in matrix form. There is a problem that the crossing and maximum value detectors and histogram writers need as many bandpass filters, and because only one maximum value is measured between zero crossings and used as intensity information, the extracted feature may be affected by noise. In addition, there is a problem that does not consider the change in sensitivity according to the frequency of the auditory signal, which is one of the cognitive characteristics of human hearing.

본 발명은 상기한 종래기술의 제반 문제점을 해결하기 위한 것으로, 그 목적은 달팽이관에서 추출되는 음성특징과 유사하게 주파수 및 세기정보를 추출 시, 청각에 대한 상기 인지적 특성인 청각신호의 주파수에 따른 민감도 변화를 수용하여 잡음에 영향을 덜 받도록 하면서, 하나의 특징벡터 조성기만으로 음성특징을 추출하며 또한 추출된 음성특징을 축약시켜 데이터량을 줄이도록 하여 하드웨어 구현 비용을 크게 절감하며, 음성구간 내에서 얻어진 전체 특징 데이터를 시간 및 크기에 대하여 정규화 과정을 거치게 하여 소리의 세기 및 발음시간에 성능이 민감하지 않도록 하는 음성특징 추출장치를 제공함에 있다.The present invention is to solve the above-mentioned problems of the prior art, the object of which is to extract the frequency and intensity information, similar to the speech feature extracted from the cochlea, according to the frequency of the auditory signal which is the cognitive characteristic for hearing It accepts sensitivity changes and makes them less affected by noise, extracts the voice features with only one feature vector generator, and reduces the amount of data by reducing the extracted voice features, greatly reducing the hardware implementation cost. The present invention provides a speech feature extraction apparatus that normalizes the obtained feature data with respect to time and magnitude so that performance is not sensitive to sound intensity and pronunciation time.

도 1은 일반적인 음성인식을 개략적으로 도시한 블록 구성도이고,1 is a block diagram schematically showing a general speech recognition,

도 2는 종래 기술에 따른 EIH방법에 의한 음성특징 추출장치의 블록 구성도이고,2 is a block diagram of an apparatus for extracting speech features by the EIH method according to the prior art;

도 3은 종래 기술에 따른 ZCPA 방법에 의한 음성특징추출장치의 블록구성도이고,3 is a block diagram of a speech feature extraction apparatus according to the ZCPA method according to the prior art,

도 4는 본 발명에 따른 음성특징추출장치의 블록 구성도이고,Figure 4 is a block diagram of a voice feature extraction apparatus according to the present invention,

도 5는 도 4에서 특징벡터 조성부에서의 동작 흐름도이고,FIG. 5 is a flowchart illustrating operations of the feature vector generator of FIG. 4;

도 6a는 도 4에서 특징벡터 조성부에서 출력되는 한 음성구간에서의 음성특징 파형도이고,FIG. 6A is a sound feature waveform diagram of a sound section output from the feature vector composition unit in FIG. 4;

도 6b는 도 4에서 특징벡터 축약부에서 출력되는 한 음성구간에서의 음성특징 파형도이고,FIG. 6B is a voice feature waveform diagram of one voice section output from the feature vector abbreviation unit in FIG. 4;

도 6c는 도 4에서 정규화처리부에서 출력되는 한 음성구간에서의 음성특징 파형도이다.FIG. 6C is a voice characteristic waveform diagram of one voice section output from the normalization processor in FIG. 4.

< 도면의 주요부분에 대한 부호의 설명 ><Description of Symbols for Major Parts of Drawings>

301∼304 : 대역통과필터 310 : 데이터 접속부301 to 304: Band pass filter 310: Data connection part

320 : 특징벡터 조성부 321 : 영교차 검출기320: feature vector composition unit 321: zero crossing detector

322 : 음성세기 검출기 323 : 주파수 민감도조절기322: voice intensity detector 323: frequency sensitivity controller

324 : 비선형 변환기 325 : 특징축적기324 nonlinear transducer 325 feature accumulator

330 : 특징축약부 340 : 정규화 처리부330: feature reduction unit 340: normalization processing unit

본 발명의 목적을 달성하기 위한 음성특징 추출장치는 입력된 음성신호를 대역통과필터를 통해 다수의 다른 주파수 대역으로 분할하여 입력된 음성신호의 특징을 추출하는 음성특징추출장치에 있어서, 상기 각 대역통과필터로부터 출력된 각 대역별 음성신호의 주파수와 세기정보를 청각의 인지적 특성에 따라 검출하여 잡음에 대한 민감도를 낮추는 음성특징 벡터를 산출하는 특징벡터 조성수단과, 상기 특징벡터 조성수단에서 출력된 음성특징벡터의 차원 및 개수를 줄이는 특징축약수단과, 상기 특징축약수단으로부터 출력된 음성 특징벡터를 음성구간동안 저장함과 아울러 시간 및 세기에 대하여 정규화처리하는 정규화처리수단을 포함하여 이루어짐을 특징으로 한다.In the speech feature extraction apparatus for achieving the object of the present invention, in the speech feature extraction apparatus for extracting the features of the input speech signal by dividing the input speech signal into a plurality of different frequency bands through a band pass filter, A feature vector generating means for detecting a frequency and intensity information of the voice signal for each band output from the pass filter according to the cognitive characteristics of the auditory to calculate a voice feature vector for reducing the sensitivity to noise; and outputting from the feature vector forming means And feature normalization means for reducing the dimension and number of the speech feature vectors, and normalization processing means for storing the speech feature vectors outputted from the feature reduction means during the speech section and normalizing the time and intensity. do.

이와 같이 이루어진 본 발명을 첨부된 도면을 참조하여 상세히 설명하면 다음과 같다.The present invention made as described above will be described in detail with reference to the accompanying drawings.

도 4는 본 발명의 일 실시 예에 따른 음성특징 추출장치의 블록 구성도로서, 입력된 음성신호를 다수의 다른 주파수 대역으로 분할하는 제 1 내지 제 N 대역통과필터(301∼304)와, 상기 각 대역통과필터(301∼304)를 통해 출력된 음성신호를 설정된 순서에 따라 순차적으로 선택출력하는 데이터 접속부(310)와, 상기 데이터 접속부(310)를 통해 각 대역통과필터(301∼304)로부터 출력된 각 대역별 음성신호의 영 교차점과 세기정보를 검출하고 검출값에 청각의 인지적 특성을 적용하여 잡음에 대한 민감도를 낮추는 음성특징벡터(SD(f), f=1,2,‥‥,N)를 산출하는 특징벡터 조성부(320)와, 상기 특징벡터 조성부(320)에서 출력된 음성특징벡터(SD(f))의 차원을 줄이거나 소정시간동안의 벡터 수를 하나의 벡터로 변환시켜 축약된 특징벡터(RSD(f), f=1,2,‥‥, N)를 출력하는 특징축약부(330)와, 상기 특징축약부(330)로부터 출력된 특징벡터(RSD(f))를 소정의 음성구간동안 시간 및 세기에 대하여 정규화처리한 후 그 결과 데이터(FSD(t,f), t=1,2,‥‥,NT, f=1,2,‥‥NF)를 출력하는 정규화처리부(340)로 구성된다.4 is a block diagram of an apparatus for extracting speech features according to an embodiment of the present invention. The first to Nth bandpass filters 301 to 304 for dividing an input speech signal into a plurality of different frequency bands, and A data connection unit 310 which sequentially selects and outputs the audio signals output through the band pass filters 301 to 304 according to a set order, and from each band pass filter 301 to 304 through the data connection 310. Speech feature vectors (SD (f), f = 1, 2, .....) that detect zero crossing points and intensity information of the outputted audio signals and apply the cognitive characteristics of hearing to the detected values. Reduce the dimension of the feature vector composition unit 320 for calculating N, and the voice feature vector SD (f) output from the feature vector composition unit 320, or convert the number of vectors for a predetermined time into one vector. To output the reduced feature vector (RSD (f), f = 1, 2, ..., N) The abbreviation unit 330 and the feature vector RSD (f) output from the feature reduction unit 330 are normalized with respect to time and intensity for a predetermined voice interval, and then the resultant data FSD (t, f) is obtained. , t = 1, 2, ..., NT, f = 1, 2, ... NF).

여기서, 특징벡터 조성부(320)는 상기 각 대역통과필터(301∼304)에서 대역통과된 음성신호의 영교차점을 검출하는 영교차검출기(321)와, 상기 각 대역통과필터(301∼304)에서 대역통과된 음성신호의 세기를 검출하는 음성세기 검출기(322)와, 상기 음성세기 검출기(322)에서 검출된 음성의 세기에 인간의 청각기관과 유사하게 주파수의 민감도를 조절하는 주파수 민감도조절기(323)와, 상기 주파수 민감도조절기(323)로부터 출력된 세기 정보를 비선형 변환하는 비선형 변환기(324)와, 상기 비선형 변환기(324)를 통해 변환된 각 주파수별 세기성분을 축적하여 특징벡터를 산출하는 특징축적기(325)로 구성된다.Here, the feature vector composition unit 320 includes a zero crossing detector 321 for detecting a zero crossing point of the voice signal band-passed by each of the band pass filters 301 through 304, and by each of the band pass filters 301 through 304. A voice intensity detector 322 for detecting the intensity of the band-passed voice signal, and a frequency sensitivity controller 323 for adjusting the sensitivity of the frequency to the intensity of the voice detected by the voice intensity detector 322, similar to a human auditory organ. And a nonlinear transducer 324 for nonlinearly converting the intensity information output from the frequency sensitivity controller 323, and an intensity component for each frequency converted through the nonlinear transducer 324 to calculate a feature vector. An accumulator 325.

이와 같이 구성된 본 발명 실시예에 따른 작용을 첨부된 도 4 내지 도 6c를 참조하여 보다 상세히 설명하면 다음과 같다.The operation according to the embodiment of the present invention configured as described above will be described in more detail with reference to FIGS. 4 to 6C as follows.

먼저, 본 발명은 디지털신호로 변환된 음성신호를 여과하는 과정과, 여과된 음성신호를 하나의 특징벡터 조성기로 잡음에 강한 음성특징을 인간이 인지하는것과 유사하게 추출하는 과정과, 추출된 음성특징 결과의 크기를 축소시키는 과정과, 축약된 음성특징의 시간 및 크기를 정규화하는 과정을 통하게 된다.First, the present invention is a process of filtering a voice signal converted into a digital signal, a process of extracting the filtered voice signal similar to that of human perception of a voice feature resistant to noise with one feature vector generator, and the extracted voice Through the process of reducing the size of the feature result and normalizing the time and size of the reduced speech feature.

도 4는 이와 같은 과정을 구현하기 위한 본 발명에 따른 음성특징추출장치의 블록 구성도이다.Figure 4 is a block diagram of a voice feature extraction apparatus according to the present invention for implementing such a process.

음성신호가 음성 입력장치(도면에 미도시)를 통하여 입력된 후, 디지털신호로 변환되어 다수의 대역통과필터(301∼304)에 입력된다. 다수의 대역통과필터(301∼304)들은 각각의 상이한 주파수 대역통과 특성을 가지고 있어서, 입력된 음성신호는 각각 대역통과특성에 따라 분할 출력하게된다.After the voice signal is input through the voice input device (not shown), it is converted into a digital signal and input to the plurality of band pass filters 301 to 304. The plurality of band pass filters 301 to 304 have different frequency band pass characteristics, so that the input audio signal is divided and output according to the band pass characteristics.

각 대역통과특성에 따라 분할 출력된 음성신호는 데이터접속부(310)에 의하여 설정된 순서에 따라 선택되어 순차적으로 특징벡터조성부(320)로 전달된다.The audio signal divided and output according to each band pass characteristic is selected in the order set by the data connection unit 310 and sequentially transmitted to the feature vector generator 320.

특징벡터조성부(320)에서는 순차적으로 입력된 음성신호의 영 교차점 및 세기정보 검출을 통해 주파수와 세기 특징을 축적한다.The feature vector generator 320 accumulates frequency and intensity characteristics by detecting zero crossing points and intensity information of sequentially input voice signals.

도 5는 특징벡터조성부(320)의 동작 흐름도로서, 먼저, 특징벡터성분(SD(f), f=1,2,‥‥,N)을 모두 '0'으로 초기화하고,(S101) 데이터접속부(310)에서 대역통과필터(301∼304)의 연결을 위한 순번을 초기화(i=0)한 후,(S102) 설정된 순서에 따라 대역통과필터(301∼304)로부터 순차적으로 음성신호를 입력한다.(S103)(S103)5 is an operation flowchart of the feature vector composition unit 320. First, all the feature vector components SD (f), f = 1, 2, ..., N are initialized to '0' (S101) and the data connection unit In step 310, the sequence for connecting the band pass filters 301 to 304 is initialized (i = 0), and then the voice signals are sequentially input from the band pass filters 301 to 304 according to the set order. (S103) (S103)

예를 들면, 제 1 대역통과필터(301)의 출력에 대해 특징벡터조성부(320)가 동작하여 특징벡터 SD(f)에 특징정보를 기록하고, 다음 제 2 대역통과필터(302)의 출력에 대해 특징벡터를 추출하여 이전 특징벡터에 누산한다.For example, the feature vector generator 320 operates on the output of the first band pass filter 301 to record feature information in the feature vector SD (f), and then to the output of the second band pass filter 302. The feature vectors are extracted and accumulated in the previous feature vectors.

이를 위해 i번째 대역통과필터가 특징벡터조성부(320)에 연결되면 영교차 검출기(321)에서 영교차점을 검출하고, 음성세기 검출기(322)에서는 음성세기를 검출한다.(S104)(S106) 이때, 음성의 세기는 영 교차점 사이의 모든 데이터를 대상으로 적분형태로 구한다.To this end, when the i-th band pass filter is connected to the feature vector generator 320, the zero crossing detector 321 detects the zero crossing point, and the voice intensity detector 322 detects the voice intensity. (S104) (S106) In other words, the intensity of the voice is obtained by integrating all data between the zero crossings.

음성세기 검출기(322)에서 검출된 음성세기는 주파수 민감도 조절기(323)에서 청신경 세포에서 나타나는 각 주파수 대역 별 자극의 세기와 반응사이의 관계를 고려하여 주파수 민감도를 조절한다. 비선형변환기(324)에서는 주파수 민감도 조절기(232)를 거친 세기정보에 대해 청신경 세포가 지닌 비선형 변환을 수행한다. 그 결과 특정 주파수에 대한 특징값이 얻어진다.The voice intensity detected by the voice intensity detector 322 adjusts the frequency sensitivity in consideration of the relationship between the stimulus intensity and the response of each frequency band appearing in the auditory nerve cell in the frequency sensitivity controller 323. The nonlinear transducer 324 performs nonlinear transformation of the auditory cells on the intensity information passed through the frequency sensitivity controller 232. The result is a characteristic value for a particular frequency.

특징축적기(325)에서는 특징벡터 SD(f)의 해당 주파수 성분(f)에 특징값을 누산한다.(S107)The feature accumulator 325 accumulates the feature value in the corresponding frequency component f of the feature vector SD (f) (S107).

산출된 특징벡터(SD(f))는 모든 대역통과필터(301∼304)에서 출력된 신호를 처리할 때까지 다음 대역통과필터로 연결 접속하여 데이터를 처리하게 된다.(S108)(S109)The calculated feature vector SD (f) is connected to the next bandpass filter to process data until the signals output from all bandpass filters 301 to 304 are processed. (S108) (S109)

이와 같은 특징벡터조성부(320)의 출력 SD(f)는 청각의 인지적 특성을 수용하고 잡음에 대해 덜 민감한 특징을 추출하기 위해 다음 수학식 1과같은 특징벡터(SD(f))를 산출한다.The output SD (f) of the feature vector generator 320 calculates a feature vector SD (f) as shown in Equation 1 to accommodate the cognitive characteristics of the hearing and to extract a feature that is less sensitive to noise. .

여기서, x_k(n;m)는 소정의 시점 m을 기준으로 한 음성 프레임에서 k 번째 필터의 출력이고, n은 해당 프레임 내에서의 시간을 나타내는 인덱스이고, Z_k는 k번째 필터의 출력이 영교차되는 가지수이고, n_l은 l번째 증가방향 영교차점이고, f_l는 l과 (l+1)번째 영 교차점 사이의 시간차이의 역수로 구한 주파수를 나타내는 인덱스이고, g_f(.)은 청신경 세포에서 자극의 세기와 반응사이의 관계를 나타내는 단조증가함수이다.Here, x _k (n; m) is the output of the k-th filter in the voice frame at a predetermined time m, n is the index indicating the time in the frame, Z _k is the output of the k-th filter Is the number of zero crossings, n _l is the zero crossing of the l-th increasing direction, f _l is the index representing the frequency obtained by the reciprocal of the time difference between l and the (l + 1) zero crossing, and g _f (.) Is a monotonically increasing function representing the relationship between stimulus intensity and response in auditory nerve cells.

특히, g_f(.)은 ZCPA에서 고려한 소리크기에 따른 비선형적 변환 g(.)외에 청각의 인지적 특성인 청각신호의 주파수에 따른 민감도 변화도 고려하여 주파수의 함수로 결정된다. 즉, g_f(.)와 주파수에 영향을 받지않고 크기에만 영향을 받는 비선형적 함수 g(.)의 관계는 다음 수학식 2와 같이 표현된다.In particular, g _f (.) Is determined as a function of frequency in consideration of the change in sensitivity according to the frequency of the auditory signal, which is a cognitive characteristic of hearing, in addition to the nonlinear transformation g (.) According to the loudness considered in ZCPA. That is, the relationship between g _f (.) And the nonlinear function g (.), Which is not affected by frequency but only by magnitude, is expressed by Equation 2 below.

g_f (v)=g(E(ω)*v)g_f (v) = g (E (ω) * v)

여기서, E(ω)는 주파수에 대한 민감도이다.Where E (ω) is the sensitivity to frequency.

ZCPA의 경우 최대값 만을 해당 주파수의 세기정보로 사용하였기에 잡음에 영향을 받게 되지만, 여기서는 수학식 1에서 보는 바와 같이 영 교차점 사이의 모든 데이터를 적분형태로 사용하였으므로 잡음의 영향을 덜 받게된다.In the case of ZCPA, only the maximum value is used as the strength information of the corresponding frequency, so noise is affected. However, as shown in Equation 1, since all data between zero crossings are used as an integral form, noise is less affected.

도 6은 특징벡터조성부(320)의 출력을 음성구간내에서 얻은 결과를 보여주는 그래프이다. 여기서 SD(f) (f=1,2,‥‥,16)을 한 음성구간 내에서 얻어진 45개의 프레임에 대하여 도시한 것이다.FIG. 6 is a graph illustrating a result obtained by outputting the feature vector generator 320 in the voice section. Here, 45 frames obtained within the audio section in which SD (f) (f = 1, 2, ..., 16) are shown.

이와 같이 특징벡터 조성부(320)에서 출력된 SD(f)는 특징축약부(330)에 입력되어 벡터의 차원(N)을 줄이거나 소정시간 동안 벡터의 수를 줄이도록 한다.As described above, the SD (f) output from the feature vector composition unit 320 is input to the feature contractor 330 to reduce the dimension N of the vector or reduce the number of vectors for a predetermined time.

특징벡터 축약방법은 특징벡터조성부(320)에서 전달받은 특징벡터 SD(f)를 PCA(Principal Component Analysis), ICA(Independent Component Analysis) 또는 신경회로망을 사용하거나, 상기 각 방법을 조합하여 특징벡터를 축약한다. 또한, 음성특징 축약을 위하여 산술적 평균방법을 사용한다.The feature vector abbreviation method uses a feature vector SD (f) received from the feature vector generator 320 using PCA (Principal Component Analysis), ICA (Independent Component Analysis), or a neural network, or combines the above methods to obtain a feature vector. To abbreviate. In addition, the arithmetic mean method is used to abbreviate negative features.

이와 같은 축약방법을 이용하여 벡터의 차원(N)을 'NF'로 줄이거나 여러시간(T)동안 모아진 T개의 벡터를 산술적으로 연산하여 하나의 벡터로 변환시킴으로 데이터의 양을 줄인다.Using this abbreviation method, the dimension N of the vector is reduced to 'NF', or the T vectors collected for several hours T are arithmetically calculated and converted into one vector to reduce the amount of data.

여기서 벡터의 차원을 줄이는 것은 한 특징벡터 내에서 주파수 성분 사이에 존재하는 상관관계를 이용해서 성분 사이의 상관관계가 아주 작은 특징벡터로 변환시키는 것을 뜻한다.In this case, reducing the dimension of a vector means converting a feature vector having a very small correlation between components by using correlations between frequency components in a feature vector.

T시간 동안의 특징벡터를 산술적 계산에 의해 하나의 특징벡터로 변환시키는 것은 음성신호가 지닌 시간 축에서의 상관관계를 이용해서 시간 사이의 변화를 적당히 반영하는 특징벡터로 변환시키는 것을 뜻한다.Converting a feature vector for a time T into a feature vector by arithmetic calculation means converting the feature vector to a feature vector that adequately reflects the change between time using correlations in the time axis of the speech signal.

도 6b는 특징축약부(330)의 출력을 음성구간 내에서 얻은 결과를 보여주는 3차원 그래프이다. 즉, 도 6a에서 도시된 SD(f) (f=1,2,‥‥16)들을 특징추출 시간에 대해 축약하여 얻어진 9개의 프레임으로 변형된 RSD(f)(f=1,2,‥‥16)을 표시한것이다.FIG. 6B is a three-dimensional graph showing a result obtained by outputting the feature abbreviation 330 within the voice interval. That is, RSD (f) (f = 1, 2, ...) transformed into nine frames obtained by shortening the SD (f) (f = 1, 2, ... 16) shown in FIG. 6A with respect to the feature extraction time. 16) is displayed.

정규화처리부(340)는 특징축약부(330)로부터 전달된 특징벡터 RSD(f)(f=1,2,‥‥,NF)를 음성구간 동안 저장한 후 정규화 처리하여 그 결과 FSD(t,f) (t=1,2,‥‥,NT, f=1,2,‥‥,NF) 데이터를 출력한다. 즉, 음성은 사람에 따라 발음하는 시간과 세기가 다르므로 이러한 변화에 따른 특징벡터의 시간 축 및 특징성분의 세기변화를 정규화를 통해 흡수하고자 한다.The normalization processing unit 340 stores the feature vector RSD (f) (f = 1, 2, ..., NF) transmitted from the feature reduction unit 330 during the voice interval, and normalizes the result. ) (t = 1,2, ..., NT, f = 1,2, ..., NF) Outputs data. That is, since the sound is pronounced differently according to the time and intensity of the voice, it is intended to absorb the change in the intensity of the feature axis and the time axis of the feature vector according to the change.

도 6c는 정규화처리부(340)의 출력을 음성구간 내에서 얻은 결과를 보여주는 그래프로서, 도 6b에서 도시된 바와 같이 RSD(f)는 9프레임으로 변형되었지만, 시간 정규화과정에 의해 FSD(t,f)는 16프레임을 가짐을 알 수 있다. 즉, 이 음성입력은 평균적 발음구간 보다 짧게 발음되었기에, 정규화에 의해 RSD의 프레임보다 FSD의 프레임이 많아진 경우이다.FIG. 6C is a graph showing results obtained from the normalization processing unit 340 within the voice interval. As shown in FIG. 6B, the RSD (f) is transformed into 9 frames, but the FSD (t, f is obtained by the time normalization process. ) Has 16 frames. In other words, since the voice input is shorter than the average pronunciation period, the FSD frame is larger than the RSD frame due to normalization.

본 발명의 다른 실시예로 특징벡터 조성부(320)를 통하여 출력된 특징벡터(SD(f))를 특징축약부(330)을 통한 특징축약 없이 바로 상기의 정규화처리부(340)로 전송하므로 정규화 과정을 거쳐 음성특징을 추출하게 된다.In another embodiment of the present invention, since the feature vector SD (f) output through the feature vector composition unit 320 is directly transmitted to the normalization processor 340 without the feature reduction through the feature reduction unit 330, the normalization process is performed. The voice feature is extracted.

한편, 본 발명에 따른 음성특징장치가 동작하는데 필요한 기억장치는 모든 데이터를 8비트로 표시할 경우, 다음과 같이 각 부에서 필요한 기억용량이 계산된다.On the other hand, in the storage device required for the operation of the audio feature device according to the present invention, when all data are displayed in 8 bits, the storage capacity required in each unit is calculated as follows.

특징벡터조성부(320)는 특징축적에 사용되는 SD(f)(f=1,2,‥‥,N)을 저장하기 위한 N바이트가 필요하다.The feature vector generator 320 needs N bytes for storing the SD (f) (f = 1, 2, ..., N) used for feature accumulation.

특징 축약부(330)는 특징벡터조성부(320)에서 전달된 SD(f)(f=1,2,‥‥,N)을 T시간동안 저장하기 위한 N×T 바이트와 축약결과인 RSD(f)(f=1,2,‥‥NF)을 저장하기 위한 NF바이트가 필요하다.The feature abbreviation unit 330 is an N × T byte for storing the SD (f) (f = 1, 2,..., N) transmitted from the feature vector composition unit 320 for a T time, and an RSD (f) that is an abbreviation result NF bytes are required for storing (f = 1, 2, ... NF).

정규화처리부(340)는 RSD(f) (f=1,2,‥‥,NF)를 음성구간 ST동안 저장하기 위한 (ST/T)×NF 바이트와 정규화 후 결과인 FSD(t,f) (t=1,2,‥‥,NT, f=1,2,‥‥,NF)를 저장하기 위한 NT×NF바이트가 필요하다.The normalization processing unit 340 stores (ST / T) x NF bytes for storing RSD (f) (f = 1, 2, ..., NF) during the voice section ST, and FSD (t, f) ( NT = NF bytes are required for storing t = 1, 2, ..., NT, f = 1, 2, ..., NF).

이상에서 본 발명에 따른 바람직한 실시예에 대해 설명하였으나, 본 기술분야에서 통상의 지식을 가진자라면 본 발명의 특허청구범위를 벗어남이 없이 다양한 변형예 및 수정예를 실시할 수 있을 것으로 이해된다.Although the preferred embodiment according to the present invention has been described above, it will be understood by those skilled in the art that various modifications and changes can be made without departing from the scope of the claims of the present invention.

이상에서 설명한 바와 같이, 본 발명에 따른 음성특징추출장치는 사람의 인지 기능에 근거하여 달팽이관이 추출하는 음성특징과 유사하게 음성의 다수의 주파수 및 세기 정보를 주변잡음의 영향에 민감하지 않으면서도 최소한의 장치를 사용하여 추출하므로, 음성인식의 필수단계인 특징추출의 성능을 향상시킬 수 있으며, 하드웨어 구현비용을 절감할 수 있는 효과가 있다.As described above, the apparatus for extracting speech features according to the present invention, at the same time, extracts a large number of frequency and intensity information of speech without being sensitive to the influence of ambient noise, similar to the speech features extracted by the cochlea based on the human cognitive function. By using the device of the extraction, it is possible to improve the performance of feature extraction, which is an essential step of speech recognition, and to reduce the hardware implementation cost.

Claims

In the voice feature extraction apparatus for dividing the input voice signal into a plurality of different frequency bands through a band pass filter to extract the characteristics of the input voice signal,

Feature vector composition means for detecting a zero crossing point and intensity information of each voice signal output from each band pass filter and applying a cognitive characteristic of hearing to the detected value to calculate a voice feature vector having low sensitivity to noise;

Feature reduction means for reducing the dimension and number of speech feature vectors output from said feature vector composition means; And

And a normalization processing means for storing the speech feature vector outputted from the feature reduction means during the speech section and normalizing the time and the intensity.

The method of claim 1, further comprising data connection means for sequentially transmitting the audio signals passing through each band pass filter in the order set in the feature vector composition means, wherein the feature vector composition means is configured to use only one. Voice feature extraction device.

2. The apparatus of claim 1, wherein the feature vector forming means comprises: a zero crossing detector for detecting a zero crossing point of a voice signal band-passed in each band pass filter;

A voice intensity detector for detecting the intensity of the voice signal band-passed in each band pass filter;

A frequency sensitivity controller for adjusting the sensitivity of the frequency to the intensity of the voice detected by the voice intensity detector;

A nonlinear converter for nonlinearly converting intensity information output from the frequency sensitivity controller;

And a feature accumulator for accumulating intensity components of each frequency converted by the nonlinear converter and calculating a feature vector.

The method of claim 3, wherein the feature vector (SD (f)) calculated by the feature accumulator

Voice feature extraction apparatus characterized in that calculated by.

Here, x _k (n; m) is the output of the k th filter in the voice frame based on the time m, n is the index representing the time in the frame, and Z _k is the zero crossing of the output of the k th filter. N _l is the l-th incremental zero crossing, f _l is the index of the inverse of the time difference between l and the (l + 1) th zero crossing, and g _f (.) Is the auditory nerve Monotonic increase function that represents the relationship between the frequency and intensity of stimulation in the cell and the response.

The apparatus of claim 1, wherein the speech feature reduction means is any one of a combination of Principal Component Analysis (PCA), Independent Component Analysis (ICA), or a neural network.

And a normalization processing means for storing the voice feature vector outputted from the feature vector forming means for the voice interval and normalizing the time and the intensity.