KR950009331B1

KR950009331B1 - Voice recognizing method for using vector quantizer and two step regonition

Info

Publication number: KR950009331B1
Application number: KR1019890003289A
Authority: KR
Inventors: 이윤근
Original assignee: 엘지전자주식회사; 구자홍
Priority date: 1989-03-16
Filing date: 1989-03-16
Publication date: 1995-08-19
Also published as: KR900015063A

Abstract

The sound recognizing method using vector quantization and two stage recognition of a sound recognition system comprises the steps of storing reference sound data via a vector quantizer by an LBG algorithm after detecting sound characteristic; obtaining the distortion by vector-quantizing the input sound after designating the range of the code book; performing a pre-recognition function; determining the entry having the minimum distance as recognition data.

Description

Speech Recognition Method Using Vector Quantization and Two-stage Recognition of Speech Recognition System

제1도는 본 발명 음성인식시스템의 전체구성도.1 is an overall configuration diagram of a voice recognition system of the present invention.

제2도는 본 발명 음성인식시스템의 음성저장과정에 대한 신호흐름도.2 is a signal flow diagram for the voice storage process of the present invention.

제3도 (a), (b)는 본 발명 음성인식시스템이 음성인식과정에 대한 신호흐름도.Figure 3 (a), (b) is a signal flow diagram for the speech recognition process of the speech recognition system of the present invention.

* 도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

1 : 마이크 2 : 마이크/스피커인터페이스1: Microphone 2: Microphone / Speaker Interface

3 : 저역필터샘플링부 4 : A/D 변환기3: low pass sampling unit 4: A / D converter

5 : 디지털 신호처리소자 6 : 어드레스디코더5 digital signal processing element 6 address decoder

7 : 버퍼 8 : 데이타롬7: Buffer 8: Data ROM

9 : 데이터램 10 : 프로그램롬9: Data RAM 10: Program ROM

11 : 입출력디코딩로직 12 : 음성인식이용장치11: input / output decoding logic 12: voice recognition device

본 발명은 음성인식시스템에 관한 것으로, 특히 음성저장시 벡터양자화(Vector Quantizer)를 하여 기준 데이타의 양을 감축하고, 음성인식시 벡터양자화에 따른 디스토션(distortion)에 의해 예비인식(pre-recognition)을 한후 필요에 따라 동적프로그램(Dynamic Program)매칭을 수행하여 동적프로그램매칭에 의한 계산시간을 단축하도록한 음성인식시스템의 벡터양자화와 2단인식을 이용한 음성인식 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition system, and more particularly, to reduce the amount of reference data by performing vector quantization during speech storage, and to pre-recognition by distortion according to vector quantization during speech recognition. The present invention relates to a speech recognition method using vector quantization and two-stage recognition of a speech recognition system, which performs dynamic program matching as needed and reduces computation time by dynamic program matching.

종래 음성인식시스템의 음성인식방법으로는 선형(Linear)매칭, 동적프로그램매칭이 있는데, 선형매칭은 입력된 음성신호와 저장되어 있는 음성신호의 시간축상에서의 변화를 선형적으로 보정하여 매칭을 하는 방법이며, 동적프로그램매칭은 시간축상에서의 변화를 옵티멀(Optimal)한 비선형적으로 보정하여 매칭을 하는 방법이다.The speech recognition methods of the conventional speech recognition system include linear matching and dynamic program matching. Linear matching is a method of linearly correcting a change in the time axis of an input speech signal and a stored speech signal. Dynamic program matching is a method of matching by optimizing nonlinearly the change on the time axis.

그런데 상기와 같은 종래 음성인식시스템의 선형매칭방법에 있어서는 인식성능이 저하되는 결함이 있으며, 동적프로그램방법에 있어서는 성능은 좋으나 계산량이 많아지게되므로 반응시간이 길어지게 되는 결함이 있었다.However, in the linear matching method of the conventional speech recognition system as described above, there is a defect that the recognition performance is deteriorated. In the dynamic program method, the performance is good but the computational amount increases, so that the reaction time is long.

본 발명은 이와같은 종래의 결함을 감안하여 음성저장시 벡터양자화를 통해 기준데이타양을 감축시키고, 음성인식시 벡터양자화에 따른 디스토션에 의해 예비인식을 한 후 필요에 따라 동적프로그램매칭을 수행함으로써 음성인식성능을 향상시킬 뿐아니라 계산시간을 단축시키도록한 음성인식시스템의 벡터양자화와 2단인식을 이용한 음성인식방법을 창안한 것으로, 이를 첨부한 도면에 의해 상세히 설명하면 다음과 같다.The present invention reduces the reference data amount through vector quantization during voice storage in consideration of such a conventional defect, and performs dynamic program matching as necessary after performing preliminary recognition by distortion according to vector quantization during voice recognition. The present invention has been devised a speech recognition method using vector quantization and two-stage recognition of speech recognition system that not only improves recognition performance but also shortens the calculation time.

제1도는 본 발명 음성인식시스템의 전체구성도로서 이에 도시한 바와같이, 마이크(1), 마이크/스피커인터페이스(2)를 통과한 음성출력이 저역필터샘플링부(3)를 통해 아날로그(A)/디지탈(D)변환기(4)에 인가되어 디지털 신호로 변환되고, 디지털신호처리소자(5)에 인가된후 어드레스디코더(6), 버퍼(7)를 통해 데이타롬(8), 데이타램(9), 프로그램롬(10)에 저장됨과 아울러 입출력 디코딩로직(11)을 통해 상기 마이크/스피커인터페이스(2)를 제어하고, 음성인식이용장치(12)를 구동하게 구성한 것으로, 디지털 신호 처리소자(5)의 XF는 클럭신호 발생단자,

는 인터럽트단자,

는 인터럽트 인지단자,

는 데이타 선택단자,

는 입력, 출력 선택단자,

는 프로그램 선택단계이다.1 is an overall configuration diagram of the speech recognition system of the present invention, as shown in FIG. 1, through which the audio output passing through the microphone 1 and the microphone / speaker interface 2 is passed through the low-pass filter sampling unit 3 to the analog A. / Digital (D) converter 4 is converted into a digital signal, and applied to the digital signal processing element 5, and then through the address decoder 6, the buffer 7 through the data ROM (8), data RAM ( 9), the microphone 10 is stored in the program ROM 10 and the microphone / speaker interface 2 is controlled through the input / output decoding logic 11, and the voice recognition device 12 is driven. 5) XF is the clock signal generation terminal,

Is the interrupt terminal,

Is the interrupt acknowledge terminal,

Is the data selection terminal,

Is the input and output selection terminals,

Is the program selection phase.

이와같이 구성된 본 발명의 작용효과를 설명하면 다음과 같다.Referring to the effects of the present invention configured as described above are as follows.

우선 음성저장과정은 마이크(1)를 통해 들어온 마이크/스피커인터페이스(2)는 음성신호가 저역필터통과샘플링부(3)를 통해 프레임으로 분할된 후 A/D 변환기(4)를 통해 디지털 신호로 변환된다.First, the voice recording process is performed through the microphone 1, the microphone / speaker interface 2 is divided into a frame through the low pass filter sampling unit 3, and then converted into a digital signal through the A / D converter 4. Is converted.

이와같이하여 얻어진 디지털신호가 디지털신호처리소자(6)에 인가된후 FFT(Fast Fourier Transform)을 통해 주파수 스펙트럼으로 추출되는데, 이것을 일정군으로 만들며, 이의 패턴이 데이터롬(8), 데이타램(9)에 저장된다.The digital signal obtained in this way is applied to the digital signal processing element 6 and then extracted into a frequency spectrum through a fast fourier transform (FFT), which is made into a constant group, and the pattern thereof is a data ROM 8 and a data RAM 9. )

또한 음성인식과정은 마이크(1)를 통해 마이크/스피커인터페이스(2)에 인가된 음성이 저역필터샘플링부(3)를 통해 프레임으로 분할된 후 A/D 변환기(4)를 통해 디지털신호호 변환되며, 이의 신호가 디지털신호처리소자(5)에 인가된후 FF7를 통해 일정군으로 나누어 입력패턴을 만든다.In addition, the voice recognition process is the voice signal applied to the microphone / speaker interface (2) through the microphone (1) is divided into a frame through the low-pass filter sampling unit (3) and then converted into a digital signal through the A / D converter (4) After the signal thereof is applied to the digital signal processing element 5, the input pattern is divided into a predetermined group through FF7.

이와같이하여 얻어진 패턴이 데이타롬(8), 데이타램(9)에 저장된 기준패턴과 하나씩 비교되어 가장 유사한 것을 추출하며, 이의 인식된 정보가 입출력 디코딩로직(11)을 통해 음성인식이용장치(12)를 구동한다.The pattern thus obtained is compared with the reference patterns stored in the data ROM 8 and the data RAM 9 one by one to extract the most similar ones, and the recognized information thereof is inputted through the input / output decoding logic 11 to the speech recognition apparatus 12. To drive.

여기서 벡터장치화를 이용한 음성인식데이타저장과정을 신호흐름도인 제2도에 의해 설명하면 다음과 같다.Here, the speech recognition data storage process using vectorization will be described with reference to FIG. 2, which is a signal flow diagram.

우선 샘플링된 디지털음성신호를 받아 에너지와 제로크로싱(Zero Crossing)비율을 이용하여 음성구간을 검출하는데, 이때 음성구간은 주변배경에 비해 에너지가 크며, 무성음으로 시작할 경우는 제로크로싱비율이 크게 나타난다.First, the voice section is detected using the energy and zero crossing ratio by receiving the sampled digital voice signal. In this case, the voice interval has a larger energy than the surrounding background, and the zero crossing ratio is large when the voice section starts with an unvoiced sound.

여기서 음성신호는 일반적으로 고주파성분의 파워가 작으므로 H(Z)=1-az^-1와 같이 현재값에서 바로앞의 샘플값에 상수(a)를 곱한 것을 감산하는 프리엠퍼시스(pre-emphasis)를 수행하여 고주파성분을 크게해주고, FFT에서의 디스토션을 막기위하여 해밍윈도우(Hamming Window)를 씌우는데, 이때 각 성분끼리 곱하게 되는 해밍윈도우식은In this case, since the power of the high frequency component is generally small, the pre-emphasis subtracting the previous value multiplied by the constant (a) from the current value, such as H (Z) = 1-az ^-1. ) To increase the high frequency components and cover the Hamming Window to prevent distortion in the FFT.In this case, the Hamming window formula that multiplies each component

W(n)=0.56-0.46 COS(2n/N), 0＜n＜N-1W (n) = 0.56-0.46 COS (2n / N), 0 <n <N-1

W(n)=0, 남어지주분의 영역W (n) = 0, area of remaining share

와 같이되며, 여기서 N은 1프레임내의 샘플링갯수이다.Where N is the number of samplings in one frame.

이후 FFT를 수행하여 각 파워스펙트럼을 얻고, 이것을 일정군으로 나누어 그 군안에 속하는 라인스펙트럼을 더함에 따라 음성의 특징이 검출된다.Then, the FFT is performed to obtain each power spectrum, and the characteristics of the voice are detected by dividing this into a predetermined group and adding the line spectrum belonging to the group.

이와같이하여 얻어진 음성의 특징을 원하는 대표벡터수만큼의 군으로 나누어 무게중심을 구한후 초기코드북(codebook)으로 설정하고, 엘비지(LBG)알고리즘에 의해 옵티멀(optimal)한 코드북을 설계한다. 이때 코드북에 의해 설계된 음성기준특징 각 프레임마다 코드북의 인덱스만을 저장하고, 음성저장과정이 끝날때까지 상기 작업을 반복수행함에 따라 음성기준패턴이 저장된다.The characteristics of the speech obtained in this way are divided into groups of the desired number of representative vectors, the center of gravity is obtained, the initial codebook is set, and the optimal codebook is designed by the LBG algorithm. In this case, only the index of the codebook is stored for each frame of the voice reference feature designed by the codebook, and the voice reference pattern is stored as the above operation is repeated until the voice storage process is completed.

또한 벡터양자화와 2단인식을 이용한 음성인식과정을 신호흐름의 제3도(a), (b)에 의해 설명하면 다음과 같다.In addition, the speech recognition process using vector quantization and two-stage recognition will be described with reference to FIG. 3 (a) and (b).

음성저장과정과 같이 샘플링된 디지털음성신호를 받아 음성구간을 검출하고, 프리엠퍼시스와 해밍윈도우를 수행한 후 FFT를 수행하여 음성의 특징을 검출한다.The voice section is detected by receiving the sampled digital voice signal as in the voice storage process, the pre-emphasis and hamming window are performed, and the FFT is performed to detect the feature of the voice.

이때 현재의 기준엔트리(Entry)수를 i=1로 설정한 상태에서 스캐일=

, 구간=스케일×상수(const)보다 작은 경우, 구간=상수(const)로 설정하여 코드북을 저장하고, 현재입력프레임수를 n=1, 기준인덱스 프레임수를 m, 기준코드북의 수를 l로 설정한다. 이후{cbno}={인덱스(m) : n×스캐일-구간＜m＜n×스케일+구간)으로 정해지는 코드북의 수(cbno)를 선택하고, 선택된 코드북으로 벡터양자화 디스토션을 계산한 후 기준코드북(l), 입력프레임수(n)로 최소로되는 유클리디언(Euclidian)거리 (M-indist)를 산출하여 디스토션을 결정한다.At this time, scale == with the current number of reference entries set to i = 1.

If the interval is smaller than the interval = scale × const, then set the interval = const to save the codebook, n = 1 for the current input frame, m for the reference index frame, and l for the reference codebook. Set it. {Cbno} = {index (m): select the number of codebooks (cbno) determined by n × scale-division <m <n × scale + division), calculate vector quantization distortion with the selected codebook, and then reference codebook (1) The distortion is determined by calculating the Euclidian distance (M-indist), which is minimized to the number of input frames (n).

따라서 상기와 같이 구간을 정해줌에 따라 n번째 입력프레임과 전혀 매칭되지 않는 코드북이 매칭됨으로써 벡터영자화시의 잘못구해지는 디스토션이 방지되며, 코드북크기가 줄어들게되므로 계산량이 작아져 응답속도가 빠르게 된다.Therefore, as the interval is set as described above, mismatched codebooks that do not match the n-th input frame are prevented, and distortions that are incorrectly obtained when vectorizing the vector are prevented. .

이와같이하여 각각의 기준엔트리마다 상기와 같은 방법으로 벡터양자화를 수행하여 그때마다의 디스토션의 합 sum(1)을 구하며, 이때 모든 입력프레임수(n)와 기준엔트리(i)에 대해 디스토션합이 구해지면 이중 적은순서로 Dist(1), Dist(2), …, Dist(N)와 같이 N개를 선택한다.In this way, the vector quantization is performed for each reference entry in the same manner as above to obtain the sum sum (1) of the distortions at that time, and the distortion sum is obtained for all the number of input frames (n) and the reference entries (i). Dist (1), Dist (2),... , N like Dist (N).

이중 가장작은 디스토션 Dist(1)과의 차이가 일정드레쉬홀드(threshold)이상이되면, 그 디스토션을 제외하며, 그 전까지의 해당되는 엔트리에 대해서만 동적프로그램매칭을 수행한다.If the difference with the smallest distortion Dist (1) is more than a certain threshold, the distortion is excluded and dynamic program matching is performed only for the corresponding entry up to that point.

따라서 추출된 음성특징이 각 기준엔트리의 코드북을 통해 벡터양자화를 수행할때 작은 디스토션을 갖는 엔트리를 골라내는 예비음성인식이 수행된다.Therefore, when the extracted speech feature performs vector quantization through the codebook of each reference entry, preliminary speech recognition is performed to select an entry having a small distortion.

여기서 선택된 기준엔트리 인덱스(i)가 sum(i)=Dist(1)과 같이되면, 이때 음성인식과정을 정지하며, 인식된 결과는 디스토션 Dist(1)으로 된다. 또한 여러개의 기준엔트리(1)가 선택되면 동적프로그램매칭을 수행하여 가장작은 거리를 갖는 엔트리가 인식결과로 된다.If the selected reference entry index i is equal to sum (i) = Dist (1), then the voice recognition process is stopped, and the recognized result is distortion Dist (1). In addition, when several reference entries 1 are selected, dynamic program matching is performed, and the entry having the smallest distance is a recognition result.

이상에서 상세히 설명한 바와같이 본 발명은 벡터양자화에 의해 기준데이타를 감축시키므로 메모리가 절약될 수 있고, 예비음성인식과 동적프로그램매칭의 2단매칭이 수행되므로 응답속도가 빠르게 될 수 있는 효과가 있다.As described in detail above, the present invention reduces the reference data by vector quantization, thereby saving memory, and having a two-stage matching of preliminary speech recognition and dynamic program matching.

또한 예비음성인식과정에서 코드북의 범위가 제한되므로 디스토션이 잘못구해지는 것이 방지되므로 계산량이 더욱 감소되어 응답속도를 빠르게 할 수 있는 효과가 있다.In addition, since the range of the codebook is limited in the preliminary speech recognition process, the distortion is prevented from being incorrectly calculated, so that the computational amount is further reduced, thereby increasing the response speed.

Claims

After detecting the speech feature, storing reference speech data through vector quantization by the algorithm, calculating the distortion by setting the interval of speech recognition, specifying the codebook range, and quantizing the input speech, Performing the preliminary speech recognition function of selecting only a predetermined entry by comparing the distortion with respect to the number of input frames and the reference entries, and performing dynamic program matching on the selected fixed entry to recognize an entry having a minimum distance. Speech recognition method using vector recognition and two-stage recognition of the speech recognition system, characterized in that the step consisting of determining.