KR100482313B1

KR100482313B1 - Speech Recognition Method Using Dual Similarity Comparison

Info

Publication number: KR100482313B1
Application number: KR1019960077679A
Authority: KR
Inventors: 임근옥
Original assignee: 엘지전자 주식회사
Priority date: 1996-12-30
Filing date: 1996-12-30
Publication date: 2005-07-21
Also published as: KR19980058355A

Abstract

본 발명은 음성 인식시에 제1 및 제2 유사단어를 설정한 뒤에 상기 2개의 유사단어의 차를 임계값과 비교함으로써 음성학적으로 비슷한 단어들에 대한 유사도를 재조사하여 음성인식이 이루어지도록 하는 이중 유사도 비교를 통한 음성 인식 방법에 관한 것이다.The present invention sets the first and second similar words at the time of speech recognition, and compares the difference between the two similar words with a threshold to re-examine the similarity for phonologically similar words. The present invention relates to a speech recognition method through similarity comparison.

이 음성 인식 방법은 음성입력신호를 입력받는 제1 단계와; 소정 시간 단위로 상기 음성입력신호로부터 켑스트럼을 구함으로써 상기 음성입력신호의 특징벡터를 추출하는 제2 단계; 코드북을 이용하여 상기 켑스트럼을 벡터 양자화시키는 제3 단계; 기준단어모델을 이용하여 상기 벡터 양자화에 의해 생성된 양자화 데이터와의 유사도를 비교함으로써 제1 및 제2 유사단어를 인식하는 제4 단계; 상기 제1 및 제2 유사단어의 차가 소정의 임계값보다 크지 않은 경우에는 상기 제1 및 제2 유사단어에 대하여 차등 캡스트럼을 구하는 제5 단계; 상기 제2 단계 내지 상기 제5 단계를 수행하여 상기 차등 캡스트럼에 대하여 다시 벡터 양자화하고 그 양자화 데이터와의 유사도를 재비교하여 최종 인식결과를 구하는 제6 단계; 및 상기 제1 및 제2 유사단어의 차가 상기 임계값보다 큰 우에는 상기 제1 및 제2 유사단어 중 상기 음성입력신호의 양자화 데이터와 유사도가 큰 유사단어를 최종 인식 결과로 선택하는 제7 단계를 포함한다.The voice recognition method includes a first step of receiving a voice input signal; Extracting a feature vector of the voice input signal by obtaining a cepstrum from the voice input signal in predetermined time units; Vector quantizing the cepstrum using a codebook; A fourth step of recognizing first and second similar words by comparing a similarity with the quantization data generated by the vector quantization using a reference word model; A fifth step of obtaining a differential capstrum for the first and second similar words when the difference between the first and second similar words is not greater than a predetermined threshold value; Performing a second step to the fifth step to perform vector quantization on the differential capstrum again, and to compare the similarity with the quantized data to obtain a final recognition result; And when the difference between the first and second pseudowords is greater than the threshold, selecting a pseudoword having a similarity similar to the quantized data of the voice input signal among the first and second pseudowords as a final recognition result. It includes.

Description

Speech recognition method using double similarity comparison

본 발명은 이중 유사도 비교를 통한 음성 인식 시스템 및 방법에 관한 것으로, 특히 음성 인식시에 제1 및 제2 유사단어를 설정한 뒤에 상기 2개의 유사단어의 차를 임계값과 비교함으로써 음성학적으로 비슷한 단어들에 대한 유사도를 재조사하여 음성인식이 이루어지도록 하는 이중 유사도 비교를 통한 음성 인식 시스템 및 방법에 관한 것이다.The present invention relates to a speech recognition system and method through a double similarity comparison, and more particularly, by setting the first and second similar words in speech recognition and comparing the difference between the two similar words with a threshold value. The present invention relates to a speech recognition system and method through double similarity comparison to reconsider the similarity of words to enable speech recognition.

일반적으로, 음성인식이란 패턴분류(pattern classification) 작업을 의미하는 것으로서, 음성파형인 입력패턴이 주어졌을 때 이를 기준(reference) 패턴 가운데서 가장 근접하는 것으로 분류하는 것이다. 음성인식은, 제1도에 도시되어 있는 바와 같이 기준 단어모델을 생성하는 학습과정과, 제2도에 도시되어 있는 바와 같이 생성된 기준 단어모델을 이용하여 인식을 하는 인식과정으로 나눌 수가 있다.In general, speech recognition refers to pattern classification, and when an input pattern that is a speech waveform is given, it is classified as the closest among the reference patterns. Speech recognition can be divided into a learning process of generating a reference word model as shown in FIG. 1 and a recognition process of recognizing using a reference word model generated as shown in FIG.

상기 학습과정은 입력 음성파형을 중첩되는 시간구간으로 나누어 특징벡터를 추출하는데, 음성인식에 쓰이는 특징벡터는 비교할 두 패턴의 중요한 특성의 차에 민감하고 주변 환경변화와 같은 부적당한 변화에는 민감하지 않아야 한다. 이러한 이유로 특징벡터 중 계산이 간단하고 인식성능도 우수한 켑스트럼이 많이 사용되어 오고 있다. 이와 같은 특성분석에 이용되는 기술로는 벡터 양자화가 있는데, N개의 다차원 특징 벡터들로 이루어지는 코드북(code book)을 집단화(clustering) 방법으로 구성한 후, 음성 파형으로부터 얻어지는 특징 벡터들을 N개의 코드 벡터와 비교하여 가장 근접한 코드 벡터값으로 양자화함으로써 기준단어 모델을 생성한다. 이러한 기술은 어느 정도의 왜곡을 포함하지만 특성 분석에 적용하기 간단한 좋은 도구를 제공한다.The learning process extracts the feature vector by dividing the input speech waveform into overlapping time intervals. The feature vector used for speech recognition is sensitive to the difference of important characteristics of the two patterns to be compared and not sensitive to inappropriate changes such as changes in the surrounding environment. do. For this reason, many cepstrums have been used among the feature vectors, which are simple to calculate and have good recognition performance. A technique used for such a characteristic analysis is vector quantization. A code book consisting of N multidimensional feature vectors is formed by a clustering method, and feature vectors obtained from a speech waveform are combined with N code vectors. In comparison, a reference word model is generated by quantizing to the nearest code vector value. This technique involves some distortion but provides a good tool that is simple to apply to characterization.

상기 인식과정은 벡터 양자화를 통한 특성분석(feature analysis) 단계와, 패턴분류 단계로 나누게 된다. 특성 분석 단계에서는 위에서 설명한 바와 같이 특징 벡터들을 추출한 뒤에 코드북을 이용하여 벡터 양자화시키게 된다. 다음에, 패턴분류 단계에서는 입력음성과 기준단어 모델간의 유사도를 측정하여 인식을 하게 되는데, 패턴분류를 위한 방법으로서 네가지 분야의 방법이 사용되어 오고 있다. 첫째는, 동적 프로그래밍(dynamic programming)을 이용한 패턴정합(pattern matching) 방법이고, 둘째는 히든 마르코프 모델(Hidden Markov Model, HMM)과 같은 통계적인 모델링 방법이고, 세째는 신경 회로망(Neutral Network)을 이용한 방법이고, 네째는 지식기반 시스템(knowledge based system)을 이용한 방법이다. 상기 동적 프로그래밍을 이용한 패턴정합 방법은 입력들에 대한 전형적인 기준패턴을 각각 선택하여 다이내믹 타이밍 워핑(Dynamic Tining Warping, DTW)과 같은 최적의 비선형 시간정렬방법을 이용하여 입력패턴과 가장 가까운 기준패턴을 선택하는 방법이다. 그리고, 상기 히든 마르코프 모델은 음성이 통계적으로 모델링될 수 있다는 가정으로부터 출발하여 학습 데이터의 앙상블(ensemble)을 확률적 모델로 구성하여 패턴분류에 응용하는 방법이다. 또한, 상기 신경 회로망은 퍼셉트론(perceptron)이란 신경구조를 모델링한 단위를 이용하여 다층 구조망(multi layer network)을 구성하는 것으로서, 이는 인간두뇌의 패턴정합 능력을 학습하고자 하는 방법이다. 그리고, 상기 전문가 시스템과 같은 지식을 바탕으로 한 지식기반 시스템은 사람들이 음성에 대해서 배운 규칙을 기계에도 이용해보자는 생각으로부터 출발한 방법이다. 이와 같은 여러가지 방법 중에서 현재 가장 많이 사용되는 방법은 확률적 모델을 이용한 히든 마르코브 모델이다.The recognition process is divided into a feature analysis step and a pattern classification step through vector quantization. In the feature analysis step, as described above, feature vectors are extracted and then vector quantized using a codebook. Next, in the pattern classification step, the similarity between the input voice and the reference word model is measured and recognized. Four methods have been used as a method for pattern classification. The first is pattern matching using dynamic programming, the second is statistical modeling method such as Hidden Markov Model (HMM), and the third is using neural network. The fourth method is using a knowledge based system. The pattern matching method using dynamic programming selects a reference pattern closest to the input pattern using an optimal nonlinear time alignment method such as dynamic timing warping (DTW) by selecting a typical reference pattern for the inputs. That's how. The Hidden Markov model is a method of applying an ensemble of training data into a stochastic model and applying it to pattern classification, starting from the assumption that speech can be statistically modeled. In addition, the neural network configures a multi-layer network using a unit modeling a neural structure called perceptron, which is a method for learning pattern matching ability of the human brain. In addition, the knowledge-based system based on the same knowledge as the expert system is a method that starts from the idea of using the rules that people have learned about voice in machines. Among the various methods, the most widely used method is the Hidden Markov model using a stochastic model.

이상 설명한 바와 같이, 종래의 음성 인식 시스템은 학습과정에서 인식대상이 되는 단어집단에 대한 특징벡터를 이용해 인식할 기준단어 모델을 미리 구한 후, 인식과정에서 입력음성의 특징벡터를 학습과정에서 구한 기준단어 모델과 유사도를 비교해 가장 유사한 단어를 인식하는 방법을 사용하고 있다.As described above, the conventional speech recognition system obtains a reference word model to be recognized using a feature vector for a word group to be recognized in the learning process, and then obtains a feature vector of the input speech in the learning process. We use a method of recognizing the most similar words by comparing the similarity with the word model.

그러나, 이와 같은 종래의 음성 인식 시스템은, 음성학적으로 비슷한 단어들을 단일한 특징벡터로 인식할때 오인식의 가능성이 높은 문제점이 있다. 그리고, 이러한 오인식을 줄이기 위해서 또 다른 특징벡터를 사용하게 되면 인식 처리량이 많아지게 되는 문제점이 있다.However, such a conventional speech recognition system has a high possibility of misrecognition when recognizing phonologically similar words as a single feature vector. In addition, if another feature vector is used to reduce such misperception, there is a problem that the recognition throughput increases.

본 발명은 이와 같은 종래의 문제점을 해결하기 위한 것으로서, 인식 처리량을 거의 증가시키지 않으면서 이중 유사도 비교기능을 이용하여 인식율을 증가시키기 위한 것이다.The present invention is to solve such a conventional problem, and to increase the recognition rate by using a double similarity comparison function with little increase in recognition throughput.

따라서, 본 발명의 목적은 인식 처리량을 증가시키지 않으면서 2개의 유사단어를 이용하여 음성학적으로 비슷한 단어들에 대해 유사도를 재조사함으로써 인식성능을 개선할 수 있는 이중 유사도 비교를 통한 음성 인식 시스템 및 방법을 제공하는데 있다.Accordingly, an object of the present invention is to provide a speech recognition system and method using a double similarity comparison that can improve the recognition performance by re-examining similarities for phonologically similar words using two similar words without increasing the recognition throughput. To provide.

상술한 목적을 달성하기 위하여, 본 발명의 음성 인식 방법은 음성입력신호를 입력받는 제1 단계와; 소정 시간 단위로 상기 음성입력신호로부터 켑스트럼을 구함으로써 상기 음성입력신호의 특징벡터를 추출하는 제2 단계; 코드북을 이용하여 상기 켑스트럼을 벡터 양자화시키는 제3 단계; 기준단어모델을 이용하여 상기 벡터 양자화에 의해 생성된 양자화 데이터와의 유사도를 비교함으로써 제1 및 제2 유사단어를 인식하는 제4 단계; 상기 제1 및 제2 유사단어의 차가 소정의 임계값보다 크지 않은 경우에는 상기 제1 및 제2 유사단어에 대하여 차등 켑스트럼을 구하는 제5 단계; 상기 제2 단계 내지 상기 제5 단계를 수행하여 상기 차등 캡스트럼에 대하여 다시 벡터 양자화하고 그 양자화 데이터와의 유사도를 재비교하여 최종 인식결과를 구하는 제6 단계; 및 상기 제1 및 제2 유사단어의 차가 상기 임계값보다 큰 우에는 상기 제1 및 제2 유사단어 중 상기 음성 입력신호의 양자화 데이터와 유사도가 큰 유사단어를 최종 인식 결과로 선택하는 제7 단계를 포함한다.In order to achieve the above object, the voice recognition method of the present invention comprises a first step of receiving a voice input signal; Extracting a feature vector of the voice input signal by obtaining a cepstrum from the voice input signal in predetermined time units; Vector quantizing the cepstrum using a codebook; A fourth step of recognizing first and second similar words by comparing a similarity with the quantization data generated by the vector quantization using a reference word model; A fifth step of obtaining a differential cepstrum for the first and second similar words if the difference between the first and second similar words is not greater than a predetermined threshold value; Performing a second step to the fifth step to perform vector quantization on the differential capstrum again, and to compare the similarity with the quantized data to obtain a final recognition result; And when the difference between the first and second similar words is greater than the threshold, selecting a similar word having a similarity similar to the quantized data of the voice input signal among the first and second similar words as a final recognition result. It includes.

상기 소정 시간은 20ms이다.The predetermined time is 20 ms.

본 발명의 상기 목적 및 그 이외의 목적 및 잇점은 후술될 본 발명의 실시 예에 대한 상세한 설명으로부터 보다 명확해질 것이다.The above and other objects and advantages of the present invention will become more apparent from the following detailed description of embodiments of the present invention.

이하, 본 발명이 속하는 기술분야에서 통상의 지식을 가진자가 본 발명을 용이하게 실시할 수 있을 정도로 본 발명을 상세하게 설명하기 위하여 본 발명의 가장 바람직한 실시 예를 첨부된 도면을 참조로 하여 설명하기로 한다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily practice the present invention. Shall be.

제 3 도는 본 발명의 실시 예에 따른 이중 유사도 비교를 통한 음성 인식 시스템의 구성 블럭도이다. 제 3 도에 도시되어 있는 바와 같이 본 발명의 실시 예에 따른 이중 유사도 비교를 통한 음성 인식 시스템의 개력적인 구성은, 마이크와 같이 음성신호를 전기적인 신호로 변환하여 입력시키는 음성 입력부(10)와, 상기 음성 입력부(10)로부터 입력되는 음성신호의 인식을 수행하는 음성 인식부(20)와, 상기 음성 인식부(20)를 포함한 하드웨어에 의한 음성 인식과정을 제어하기 위한 제어부(30)와, 상기 제어부(30)로부터 출력되는 음성 인식결과에 따라 에러 및 메시지를 전달하는 에러/메시지 전달부(40)를 포함하여 이루어진다.3 is a block diagram illustrating a speech recognition system using a double similarity comparison according to an embodiment of the present invention. As shown in FIG. 3, the general configuration of the voice recognition system through the double similarity comparison according to the embodiment of the present invention includes a voice input unit 10 for converting and inputting a voice signal into an electrical signal such as a microphone. A control unit 30 for controlling a voice recognition process by hardware including a voice recognition unit 20 for recognizing a voice signal input from the voice input unit 10, and the voice recognition unit 20; It includes an error / message transmission unit 40 for transmitting an error and a message according to the voice recognition result output from the control unit 30.

제 4 도는 본 발명의 실시 예에 따른 이중 유사도 비교를 통한 음성 인식방법의 동작 흐름도로서, 제 4 도에 도시되어 있는 본 발명의 실시 예에 따른 이중 유사도 비교를 통한 음성 인식방법의 동작을 제 3 도의 하드제어 구성을 이용하여 설명하면 다음과 같다.4 is an operation flowchart of a speech recognition method using a double similarity comparison according to an embodiment of the present invention, and FIG. 4 illustrates an operation of the speech recognition method through a double similarity comparison according to an embodiment of the present invention shown in FIG. A description will be given using the hard control configuration of FIG.

음성 입력부(10)를 통해 음성신호가 입력되면(S10), 음성 인식부(20)는 입력된 음성신호로부터 음성에 대한 특징벡터로서 켑스트럼 계수를 구함으로써 특징벡터를 추출한다(S20). 이때 특징벡터는 소정 음성구간 예를 들면, 20ms에 대해서 구해진다. 그리고, 상기 켑스트럼 계수는 음성신호를 퓨리에 변환(Fourier transform)을 이용해 주파수상으로 변환한 후에 로그를 취한 후 다시 역퓨리에 변환하여 창함수(window function)를 이용해 구할 수가 있다. 이 켑스트럼 계수는, 음성인식시에 계산이 간단하고 인식성능이 우수하여 많은 인식 시스템의 특징 벡터로서 사용된다.When a voice signal is input through the voice input unit 10 (S10), the voice recognition unit 20 extracts a feature vector by obtaining a 켑 strum coefficient as a feature vector for the voice from the input voice signal (S20). At this time, the feature vector is obtained for a predetermined voice section, for example, 20 ms. The cepstruum coefficient may be obtained by using a window function by transforming a speech signal into a frequency phase using a Fourier transform, taking a log, and then transforming the inverse Fourier again. This cepstruum coefficient is used as a feature vector of many recognition systems due to its simple calculation and excellent recognition performance in speech recognition.

다음에, 음성 인식부(20)는 제어부(30)로부터 코드북 데이터를 전달받아(S70), 상기 코드북을 이용하여 켑스트럼을 벡터 양자화시킨다(S30). 이어서, 음성 인식부(20)는 제어부(30)로부터 기준단어 모델을 전달받아 이를 양자화 데이터와 비교함으로써 유사도를 비교하여 최고의 유사값을 가지는 단어와 두번째로 유사값을 가지는 단어의 인식을 하게 된다(S40). 이어서, 최고의 유사값을 가지는 단어와 두번째로 유사값을 가지는 단어의 차이가 실험치에 의해서 설정된 값인 임계값보다 큰지를 판단하여(S50), 두개의 유사 단어의 차이가 임계값보다 큰 경우에는 최고의 유사값을 가지는 단어를 인식하게 된다(S60). 여기서, 임계값은 음성인식의 정밀도와 음성인식처리속도를 고려하여 그 값이 결정된다.Next, the voice recognition unit 20 receives the codebook data from the control unit 30 (S70) and vector-quantizes the Cepstrum using the codebook (S30). Subsequently, the speech recognition unit 20 receives the reference word model from the control unit 30 and compares the reference word model with the quantized data to compare the similarities to recognize the word having the highest similarity value and the word having the similarity value secondly ( S40). Subsequently, it is determined whether the difference between the word having the highest similarity value and the word having the similarity value is greater than the threshold value set by the experimental value (S50). If the difference between the two similar words is larger than the threshold value, the best similarity value is determined. A word having a value is recognized (S60). Here, the threshold value is determined in consideration of the accuracy of speech recognition and the speech recognition processing speed.

그러나 두개의 유사 단어의 차이가 임계값보다 크지 않은 경우에는, 두 유사단어에 대해서 상기한 과정을 반복 수행함으로써, 즉 두 유사단어에 대해서 차등 켑스트럼을 구한 후에, 이를 다시 벡터 양자화하여 차등 켑스트럼에 대해서 유사도를 비교함으로써 최종적인 인식결과를 얻을 수가 있다. 이 경우에, 상기 차등 켑스트럼은 현재의 20ms 음성구간에서 구한 켑스트럼에서 전 구간에서의 켑스트럼을 대수적으로 뺌으로써 구할 수가 있다. 이것은 독립 화자 인식시스템에 많이 이용된다. 제어부(30)는 이와 같이 음성 인식부(20)에 의해서 얻어진 인식결과를 에러/메시지 전달부(40)를 통해서 이를 필요로 하는 프로세서 및 컨트롤러로 전달되도록 한다.However, if the difference between two similar words is not greater than the threshold, the above process is repeated for the two similar words, that is, after obtaining the differential cepstrum for the two similar words, vector quantization is performed again and the differential 켑 By comparing the similarities for the strums, the final recognition results can be obtained. In this case, the differential cepstrum can be found by algebraically subtracting the cepstrum in all sections from the cepstrum obtained in the current 20ms speech section. This is often used for independent speaker recognition systems. The controller 30 transmits the recognition result obtained by the voice recognition unit 20 to the processor and the controller through the error / message transfer unit 40.

이상에서와 같이 본 발명의 실시 예에서, 인식 처리량을 증가시키지 않고 음성학적으로 비슷한 단어들에 대해 유사도를 재조사함으로써 인식성능을 개선할 수 있는 효과를 갖는 이중 유사도 비교를 통한 음성 인식 시스템 및 방법을 제공할 수가 있다. 본 발명의 방법은 어떤 특정한 특징 벡터나 인식방법에도 적용할 수가 있다.As described above, in the exemplary embodiment of the present invention, a system and method for speech recognition through double similarity comparison having an effect of improving recognition performance by re-examining similarities for phonologically similar words without increasing recognition throughput are provided. I can provide it. The method of the present invention can be applied to any particular feature vector or recognition method.

이상 설명한 내용을 통해 당업자라면 본 발명의 기술사상을 일탈하지 아니하는 범위에서 다양한 변경 및 수정이 가능함을 알 수 있을 것이다. 따라서, 본 발명의 기술적 범위는 명세서의 상세한 설명에 기재된 내용으로 한정되는 것이 아니라 특허 청구의 범위에 의하여 정하여져야만 한다.Those skilled in the art will appreciate that various changes and modifications can be made without departing from the technical spirit of the present invention. Therefore, the technical scope of the present invention should not be limited to the contents described in the detailed description of the specification but should be defined by the claims.

제 1 도는 종래의 기준 단어모델을 생성하는 학습과정을 나타내는 처리 흐름도.1 is a process flow diagram illustrating a learning process for generating a conventional reference word model.

제 2 도는 종래의 기준 단어모델을 이용하여 인식을 하는 인식과정을 나타내는 처리 흐름도.2 is a process flow diagram showing a recognition process of recognition using a conventional reference word model.

제 3 도는 본 발명의 실시 예에 따른 이중 유사도 비교를 통한 음성 인식시스템의 블럭 구성도.3 is a block diagram of a speech recognition system using a double similarity comparison according to an embodiment of the present invention.

제 4 도는 본 발명의 실시 예에 따른 이중 유사도 비교를 통한 음성 인식방법의 처리 흐름도.4 is a flowchart illustrating a speech recognition method through double similarity comparison according to an embodiment of the present invention.

* 도면의 주요 부분에 대한 부호의 설멍 ** Designation of symbols for major parts of drawings *

10 : 음성 입력부 20 : 음성 인식부10: voice input unit 20: voice recognition unit

30 : 제어부 40 : 에러/메시지 전달부30: control unit 40: error / message delivery unit

Claims

A first step of receiving a voice input signal;

Extracting a feature vector of the voice input signal by obtaining a cepstrum from the voice input signal in predetermined time units;

Vector quantizing the cepstrum using a codebook;

A fourth step of recognizing first and second similar words by comparing a similarity with the quantization data generated by the vector quantization using a reference word model;

A fifth step of obtaining a differential cepstrum for the first and second similar words if the difference between the first and second similar words is not greater than a predetermined threshold value;

Performing a second step to the fifth step to perform vector quantization on the differential cepstrum again, and to compare the similarity with the quantized data to obtain a final recognition result; And

If the difference between the first and second pseudowords is greater than the threshold, a seventh step of selecting, as a final recognition result, a pseudoword having a similarity similar to the quantized data of the voice input signal among the first and second pseudowords; Speech recognition method through a double similarity comparison comprising a.

The method of claim 1,

The predetermined time is 20ms voice recognition method through a double similarity comparison, characterized in that.