KR20010082390A

KR20010082390A - The Performance Improvement of Speech recognition system using Hierarchical Classification Method and Optimal Candidate Model

Info

Publication number: KR20010082390A
Application number: KR1020010019448A
Authority: KR
Inventors: 전화성
Original assignee: 전화성; 에스엘투(주)
Priority date: 2001-04-12
Filing date: 2001-04-12
Publication date: 2001-08-30

Abstract

PURPOSE: A broad classification method of an HMM(Hidden Markov Model) improving speech recognition speed and a method for detecting an ideal candidate are provided to improve a speech recognition speed of a speech recognition system by using a broad classification method. CONSTITUTION: A speech inputted as A-law coding formation is converted a linear PCM. An end point is detected by using an energy. A code book is generated by using five streams such as 12 Mel-Cepstrum, 12 Delta, 12 Acceleration Cepstrum, 3 merged energies(energy+delta energy+delta delta energy), 5 PLP(Perceptual Linear Prediction), and a RASTA cepstrum. The code book is formed by a vector quantization process. The Mel-Cepstrum, Delta-Cepstrum, Acceleration Cepstrum, and PLP generates 256 code book. The merged energy generates 64 code book. The first code index stream passing 12 code books passes an HMM. The state number and the stream number are reduced to improve a speed of the speech recognition system. An N-state HMM for selecting a candidate is located at a front end portion of a full state HMM. The N-state HMM and the full state HMM form 1445 as word number, respectively.

Description

{The Performance Improvement of Speech recognition System Using Hierarchical Classification Method and Optimal Candidate Model}

현재 연구되고 있는 음성 인식의 방법으로는 신경망에 의한 인식(ANN, Artificial Neural Net), 시간적인 정합을 이용한 DTW(Dynamic Time Warping)알고리즘, 확률적인 방법으로 알려진 HMM(Hidden Markov Model)등이 있다. 그 중에서도 HMM을 이용한 방법이 가장 인식율이 높은 것으로 알려져 있다. HMM을 이용한 방법 또한 인식단위가 음소와 단어로 크게 분류할 수 있다.The speech recognition methods currently studied include Artificial Neural Net (ANN), Dynamic Time Warping (DTW) algorithm using temporal matching, and Hidden Markov Model (HMM), which is known as a probabilistic method. Among them, the method using HMM is known to have the highest recognition rate. The method using the HMM can also be classified into phonemes and words.

음성인식기술이 상용화되면서 인식율보다 속도가 큰 문제로 대두되었다. 실제 한 음성인식서버에서 60회선이상을 처리해내는 경우가 많다. 이는 60개의 음성인식 쓰레드 동시에 처리하여야 한다는 의미이며 실시간 속도는 기대하기 힘들다.As voice recognition technology has been commercialized, it has emerged as a bigger problem than recognition rate. In fact, a single voice recognition server often handles more than 60 lines. This means that 60 voice recognition threads must be processed at the same time, and real-time speed is hard to expect.

속도처리 개선방법에 대한 연구는 인식단위가 음소인 음성인식기에 대해서 주로 연구가 되어왔다. 인식단위가 음소인 경우는 기본구조가 lexicon tree 형태로 되어있기 때문에, 검색하는 시간이 단어단위에 비해 짧을 수 밖에 없으며, 관련 tree검색에 관한 연구도 다양하다. 단어인식단위 인식기는 기본적으로 Depth first search를 해야 하고 모든 단어의 해당 HMM을 거쳐야 하므로 시간이 길게 걸릴 수밖에 없다. 또한 인식처리를 할 때 각각의 HMM모델이 메모리를 점유하게 되는데 단어인식단위 모델의 메모리 점유율이 음소인식단위 모델보다 훨씬 크며 여기서 발생하는 속도차이도 크다. 그러나 이러한 단점에도 불구하고 단어인식단위 인식기에 대한 연구는 필요하다. 음소인식단위 인식기는 보편적으로 단어음성인식의 성능면으로 볼 때 단어인식단위 인식기보다 인식율이 현격히 떨어진다.The research on the improvement method of speed processing has been mainly conducted on the speech recognizer whose phoneme recognition unit is phoneme. When the recognition unit is phoneme, since the basic structure is in the form of a lexicon tree, the searching time is inevitably shorter than that of the word unit, and there are various studies on related tree searching. Since the word recognition unit recognizer basically needs to perform depth first search and go through the corresponding HMM of all words, it takes a long time. In addition, each HMM model occupies memory when the recognition process is performed. The memory occupancy of the word recognition unit model is much larger than that of the phoneme recognition unit model, and the speed difference that occurs here is also large. However, despite these shortcomings, research on word recognition unit recognizer is necessary. The phoneme recognition unit recognizer is significantly lower than the word recognition unit recognizer in terms of the performance of word speech recognition.

본 발명은 이러한 특징을 갖는 단어인식단위 인식기에 대한 속도개선에 대한 것이다. 단어인식에 있어서 단어인식단위의 높은 인식율을 갖는 장점을 유지하면서, 속도개선 방식을 정립하는 것이 본 연구의 목적이다.The present invention is directed to speed improvement for a word recognition unit recognizer having such characteristics. The purpose of this study is to establish a speed improvement method while maintaining the advantage of high recognition rate of word recognition unit in word recognition.

속도의 개선을 위해 HMM모델에 대한 계층적인 대분류기법을 적용하였다. 보통 인식처리시간은 인식모델에 대한 탐색시간이 대부분이다. 본 발명에서의 대분류기법이란, 상태수가 다른 HMM모델을 다단계로 적용하여, 탐색 공간을 줄이는 방법이다. 탐색공간을 줄이기 위해 적은 상태수의 HMM모델을 이용해 후보를 선발하는데, 이는 몇 개의 적은 상태수 모델에 대해 후보포함 빈도수를 조절하여 최적의 경우를 찾는 실험을 수행하였다. 이러한 적은 상태수 HMM에 대해서, 후보의개수와 관련 특징파라미터 개수의 설정을 유동적으로 변화시켜 가장 이상적인 경우를 찾는다.The hierarchical large classification technique is applied to the HMM model to improve the speed. Usually, recognition processing time is mostly search time for recognition model. The large classification technique in the present invention is a method of reducing the search space by applying HMM models having different state numbers in multiple stages. In order to reduce the search space, candidates are selected using a small number of state HMM models. This experiment was performed to find the optimal case by adjusting the frequency of candidate inclusion for a few number of state models. For such a small number of state HMMs, the most ideal case is found by flexibly changing the setting of the number of candidates and the number of associated feature parameters.

도면 1. 대분류기법의 개념Figure 1. Concept of Large Classification Technique

속도의 개선을 위해 HMM모델에 대한 계층적인 대분류기법을 적용하였다. 보통 인식처리시간은 인식모델에 대한 탐색시간이 대부분이다. 본 논문에서의 대분류기법이란, 상태수가 다른 HMM모델을 다단계로 적용하여, 탐색 공간을 줄이는 방법이다. 탐색공간을 줄이기 위해 적은 상태수의 HMM모델을 이용해 후보를 선발하는데, 이는 몇 개의 적은 상태수 모델에 대해 후보포함 빈도수를 조절하여 최적의 경우를 찾는 실험을 수행하였다. 이러한 적은 상태수 HMM에 대해서, 후보의 개수와 관련 특징파라미터 개수의 설정을 유동적으로 변화시켜 가장 이상적인 경우를 찾는다.The hierarchical large classification technique is applied to the HMM model to improve the speed. Usually, recognition processing time is mostly search time for recognition model. The large classification technique in this paper is to reduce the search space by applying HMM models with different number of states in multiple stages. In order to reduce the search space, candidates are selected using a small number of state HMM models. This experiment was performed to find the optimal case by adjusting the frequency of candidate inclusion for a few number of state models. For such a small number of state HMMs, the setting of the number of candidates and the number of associated feature parameters varies flexibly to find the most ideal case.

도면 2. HMM의 구조Figure 2. Structure of the HMM

HMM은 음성인식에 가장 자주 쓰이는 확률모델이다. 이의 구성요소는 상태수와 천이, 천이확률 등의 요소로 구성된다. 일반적인 음성인식에서와 같이 시간에 따라 왼쪽에서 오른쪽으로 천이 하는 left-to-right model이나 이와 유사한 형태의 모델을 사용한다. 이 구조는 HMM이 대상 단어에 대한 참조 패턴 역할을 하며, 시간적인 변화 특성을 모델화하기 위한 것이다. 문장독립형에서는 그림 2의 (b)와 같은 완전연결구조(ergodic topology)를 사용한다. 이 모델은 모든 상태들이 완전히 연결되어 있어 어떤 상태로든지 천이가 가능하며 다시 자신의 상태로 돌아올 수 있는 구조로 되어 있다. 따라서, 정확하지는 않지만 학습 문장에 의해 각 상태가 자동으로 대분류 음소열(broad phonetic class)을 형성하여, 인식시 입력 문장과 가장 유사한 음소열들로 천이하면서 출력확률을 계산하므로 음소단위의 분할이 불필요한 장점이 있다.HMM is a probabilistic model most often used for speech recognition. Its components consist of factors such as number of states, transitions, and probability of transition. As with normal speech recognition, use a left-to-right model or a similar model that transitions from left to right over time. This structure serves as a reference pattern for the target word, and models the temporal change characteristic. In sentence-independent, the ergodic topology is used as shown in (b) of Figure 2. The model is structured so that all states are completely connected, so you can transition to any state and return to your state. Therefore, although not accurate, each state automatically forms a broad phonetic class by a learning sentence, and the output probability is calculated while transitioning to the phoneme sequences most similar to the input sentence upon recognition. There is an advantage.

도면 3. 음성인식시스템의 구성과 단계Figure 3. Configuration and steps of speech recognition system

음성인식의 단계는 음성구간 검출, 특징파라미터 추출, 벡터양자화, HMM모델을 거처 가장 큰 우도비 결과를 선택하는 구조이다.The speech recognition step is a structure that selects the largest likelihood ratio result based on speech segment detection, feature parameter extraction, vector quantization, and HMM model.

음성인식 시스템의 구조는 음성구간검출 단계 , 음향분석단계, 인식단위분할단계, 인식 단계로 나뉘어 진다. 영교차율과 대수에너지 등을 이용하여 인식을 위한 음성구간을 검출해 낸 후, 이를 전처리과정과 특징파라미터 추출과정을 통해 인식을 위한 정보를 추출해 낸다. 이러한 정보를 벡터양자화 과정을 통해 몇 개의 대표 벡터로 표현을 한 후, 이의 산출물인 1차원 코드색인열을 HMM모델을 이용하여 학습시킨다. 본 논문의 인식기는 1445개의 증권종목단어를 HMM모델로 학습시켰다.The structure of speech recognition system is divided into speech segment detection stage, acoustic analysis stage, recognition unit division stage, and recognition stage. After detecting the speech section for recognition using zero crossing rate and algebraic energy, information for recognition is extracted through preprocessing and feature parameter extraction. This information is expressed as several representative vectors through the vector quantization process, and then the one-dimensional code index string, which is its output, is trained using the HMM model. The recognizer of this paper trained 1445 securities stock words using the HMM model.

본 시스템은 A-law coding형식으로 입력된 발성음을 linear PCM으로 변환 후, energy를 이용하여 끝점검출을 한다. 이 후 5개의 특징파라미터(5 streams)를 사용한다. 12 차 Mel-Cepstrum과 12 차 Delta, 12 차 Acceleration Cepstrum, 그리고 3차의 Merged Energy(energy + delta energy + delta delta energy)와 청각신호가 반영된 5차의 PLP (perceptual linear prediction) 와 RASTA Cepstrum을 이용하여 코드북을 생성한다.This system converts voice sound input in A-law coding format into linear PCM and detects the end point using energy. We then use five feature parameters (5 streams). Using the 12th Mel-Cepstrum, the 12th Delta, the 12th Acceleration Cepstrum, and the 3rd Merged Energy (energy + delta energy + delta delta energy) and the 5th Perceptual Linear Prediction (PLP) and RASTA Cepstrum reflecting the auditory signal To generate a codebook.

코드북의 생성은 벡터양자화(VQ ,Vector Quantization)과정을 통해 이루어 진다. 이는 단어를 몇 개의 대표 벡터로 표현하여 정보를 압축시킨다. 방법은 가장 가까운 대표벡터와의 왜곡거리의 합을 확인점수로 사용하는 방식이며, text-dependent에서도 사용가능하다. Mel-Cepsrum, Delta-Cepsrum, Acceleration Cepstrum, PLP는 256차의 코드북을 생성하며, Merged Energy는 64차의 코드북을 생성한다. 이의 결과물은 Code word sequence이며, 1차원 코드색인열로 표현이 된다.The codebook is generated through a vector quantization (VQ) process. This compresses the information by representing the words in several representative vectors. The method uses the sum of the distortion distances from the nearest representative vector as the confirmation score, and can also be used in text-dependent mode. Mel-Cepsrum, Delta-Cepsrum, Acceleration Cepstrum, and PLP generate 256-order codebooks, and Merged Energy produces 64-order codebooks. The result is a code word sequence, expressed as a one-dimensional code index.

12차 코드북을 통과한 1차 코드 색인열은 HMM모델을 통과하게 된다. 음성인식 시스템에서 속도에 영향을 미치는 요인은 두가지이다. 우선 1445개의 전체 상태수의 HMM을 모두 통과하는 긴 시간이 원인이 된다. 전체 상태수 HMM은 단어의 음절길이에 따라 2음절일 경우는 7개, 3,4음절일 경우는 15개, 5음절이상일 경우는 21개의 상태수를 갖는다. 이러한 전체상태수를 가진 HMM모델을 통과할 때 많은 시간이 소비된다. HMM모델의 처리시간은 상태수에 비례한다. 둘째, 특징파라미터의 개수도 시스템의 속도에 영향을 미친다. 본 발명에서 사용하는 특징파라미터의 개수는 5개이며, 2개를 사용했을 때와 5개를 사용했을 때의 속도는 많은 차이가 난다.The primary code index string that passes through the 12th codebook passes through the HMM model. There are two factors that affect speed in speech recognition systems. First, the long time to pass through all 1445 total HMMs is caused. The total number of states HMM has 7 states for two syllables, 15 for 3 and 4 syllables, and 21 for 5 syllables, depending on the length of the syllable. A lot of time is spent passing through this HMM model with the overall state number. The processing time of the HMM model is proportional to the number of states. Second, the number of feature parameters also affects the speed of the system. The number of feature parameters used in the present invention is five, and the speed when using two and using five is very different.

이를 극복하기 위해 후보를 선정하는 적은 상태수의 N 상태 HMM모델을 전체상태 HMM 전단에 위치하게 한다. 상태수만 적게하는 것이 아니라 특징파라미터의 수도 적게 조절하기 때문에 긴 처리시간이 걸리지 않는다. 여기서 선출된 후보만이 전체상태 HMM을 통과하게 된다. 이를 위해 적은 수의 N 상태 HMM과 전체상태 HMM은 각각 단어의 개수인 1445개를 각각 형성한다.To overcome this problem, a small number of N-state HMM models for selecting candidates are placed in front of the full-state HMM. Not only does it reduce the number of states but also the number of feature parameters, so it does not take long processing time. Only elected candidates pass the full-state HMM. To this end, a small number of N-state HMMs and full-state HMMs form 1445 words, respectively.

후보를 찾아내기 위한 적은 상태수의 HMM에서 시간을 최대로 줄이면서 적합한 후보를 찾아내는 상태수 N을 찾아내야 한다. 우선 N에 대한 후보로 상태수를 1, 2, 3, 4, 5 로 정의 하였다. 전체 상태수 HMM중 2음절단어는 7개의 상태수를 갖는 점을 고려하여, 최대 N의 수를 5까지 실험하였다. 후보추출 정확도는 그림 4, 5와 같다. 결과로 분석을 했을 때 3개의 상태 HMM모델이 가장 효율적이다. 후보 추출 정확도는 3, 4, 5 개의 상태가 큰 차이를 보이지는 않지만, 속도는 많은 차이가 있다. 상태수 1, 2는 후보추출에 대한 역할을 수행하기가 어렵다.In a small number of state HMMs to find candidates, we need to find a state number N that finds a suitable candidate while minimizing time. First, the number of states is defined as 1, 2, 3, 4, and 5 as candidates for N. In consideration of the fact that the two syllable words in the total number of HMMs had seven number of states, the maximum number of N was tested up to 5. The candidate extraction accuracy is shown in Figures 4 and 5. As a result, the three-state HMM model is the most efficient. Candidate extraction accuracy does not show a big difference between the three, four, and five states, but there are many differences in speed. State counts 1 and 2 are difficult to play the role of candidate extraction.

색인열은 1차적으로 1445개의 3상태 HMM을 통과하며 우도비값을 구해낸다. 구해낸 우도비값 중 N개의 후보를 추출한다. 추출된 N개의 후보는 전체 상태 HMM을 적용하여 최대 우도비 값을 해당 단어로 인식한다. 후보를 추출할 때에는 적용되는 특징파라미터의 개수를 다양한 경우로 분류하여 실험하였다. 후보의 개수와 적용하는 특징파라미터의 종류와 개수를 다양하게 변화시켜 가장 최적화된 결과를 얻어내는 실험을 수행하였으며, 후보의 개수는 한단계, 두단계로 분류가 가능하다.The index sequence first passes 1445 tri-state HMMs to obtain the likelihood ratios. N candidates are extracted from the obtained likelihood ratio values. The extracted N candidates recognize the maximum likelihood ratio value by applying the full state HMM as the corresponding word. When the candidate is extracted, the number of feature parameters applied is classified into various cases and tested. Experiments were performed to obtain the most optimized results by varying the number of candidates and the type and number of feature parameters applied. The number of candidates could be classified into one or two stages.

본 발명은 특징파라미터와 상태수를 고려한 대분류 다단계 인식기이며 속도개선에 탁월한 효과가 있다.The present invention is a large class multi-stage recognizer in consideration of feature parameters and state numbers, and has an excellent effect on speed improvement.

보통 인식율을 선호하는 시스템은 단어인식단위 인식기를 사용한다. 그러나 속도문제 때문에 다중접속 환경 등의 시스템에는 사용되지 못했으며, 음소인식단위 인식기가 주로 사용되었다. 본 발명는 인식율이 안정적인 단어단위인식기에서 속도문제를 해결하였다는데 그 의미가 있다.Usually, a system that prefers recognition rate uses a word recognition unit recognizer. However, due to the speed problem, it was not used in a system such as a multiple access environment. The present invention has the meaning that solved the speed problem in a stable word recognition unit.

Claims

In the speech recognition process, a large classification technique with variations in the number of states and feature parameters of the HMM is used.

It is applied to multiple access dial-up service and embedded system where speed problem of large amount of computation is important.