KR20040001733A

KR20040001733A - Apparatus for calculating an Observation Probability for a search of hidden markov model

Info

Publication number: KR20040001733A
Application number: KR1020020037052A
Authority: KR
Inventors: 박현우; 장호랑; 홍근철; 김성재
Original assignee: 삼성전자주식회사
Priority date: 2002-06-28
Filing date: 2002-06-28
Publication date: 2004-01-07
Also published as: KR100464420B1

Abstract

PURPOSE: An apparatus for operating observation probability for searching a hidden markov model is provided to efficiently execute observation probability operation executing the most operations, and reduce power consumption. CONSTITUTION: A storage device stores the mean of parameters extracted from representative phonemes and a distribution degree of the mean. A subtractor(405) obtains a difference between the mean and a feature extracted from a voice signal to be a recognition object. A multiplier(406) multiplies output of the subtractor by the distribution degree provided from the storage device. A squarer(407) squares the result of the multiplication of the multiplier. An accumulator(408) accumulates output of the squarer.

Description

Apparatus for calculating an Observation Probability for a search of hidden markov model}

본 발명은 음성 인식 장치에 관한 것으로서 특히, 은닉 마코프 모델 탐색을 위한 관측 확률(Observation Probability) 연산 장치에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to speech recognition devices and, more particularly, to an observation probability computing device for searching hidden Markov models.

음성인식 기능은 인간이 일상생활에서 접할 수 있는 거의 모든 전자제품에 적용될 수 있으며 이러한 적용 추세는 저가의 e-toy 제품을 시작으로 점차 그 적용 범위가 확대될 전망이다.The voice recognition function can be applied to almost all electronic products that humans can encounter in everyday life, and the application trend is expected to be gradually extended to low-cost e-toy products.

음성 인식과 관련하여 최초로 사용화 기술을 제시한 회사는 IBM사로 문자인식에 처음 은닉 마코프 모델을 적용하면서 효율성을 입증하였다. (1997.06, US 5,636,291)The first company to offer usage technology for speech recognition proved its effectiveness by first applying hidden Markov models to character recognition. (1997.06, US 5,636,291)

이러한 음성인식 방법은 크게 세부분으로 나뉘는데, 각각 전처리(Pre-processor) 부분, Front-end 부분, 모델링 부분이다. Pre-processor 부분은 처리 대상의 문자에 대한 어휘소를 인지하는 단계이다.These speech recognition methods are largely divided into three parts: pre-processor, front-end, and modeling. The pre-processor part is a step of recognizing the lexicon of the character to be processed.

Front-end 부분은 인지된 어휘소로부터 비교대상이 되는 특징값(Feature value) 또는 파라미터들을 추출해내는 기능을 갖고 있다.The front-end part has a function of extracting feature values or parameters to be compared from the recognized lexicons.

한편, 모델링 부분에서는 추출된 파라미터들을 근거로 향후 인지된 문자에 대한 정확한 판단 기준이 되는 모델을 학습과정(Training Phase)를 통해 구성해 나간다. 이와 함께 인지된 어휘소들을 바탕으로 미리 지정된 문자들중 어느 문자를 인식된 문자로 판단할 지를 결정하는 기능을 수행하게 된다.On the other hand, in the modeling part, based on the extracted parameters, a model that becomes an accurate criterion for the recognized character in the future is constructed through a training phase. In addition, based on the recognized vocabulary, a function of determining which of the predetermined characters is determined as the recognized character is performed.

이후 IBM사에서는 더 광범위한 영역에서 사용할 수 있는 은닉 마코프 모델을 이용한 음성인식 시스템 및 방법들을 공개하였다. (1998.08, US 5,799,278) 이 기술은 고립어에 대한 음성인식 처리 과정에서 은닉 마코프 모델을 이용하는 기술로써, 음성학적으로 다른 단어들을 인식할 수 있도록 Training되며, 그리고 많은 단어들을 인식하는데 적합한 은닉 마코프 모델을 사용하는 방법 및 음성인식 시스템에 대한 것이다.Since then, IBM has disclosed speech recognition systems and methods using hidden Markov models that can be used in a wider range of applications. (1998.08, US 5,799,278) This technique uses hidden Markov models in the speech recognition process for isolated words, trained to recognize other words phonetically, and uses hidden Markov models that are suitable for recognizing many words. Method and voice recognition system.

이러한 음성 인식 장치를 구현함에 있어서 음성 인식에 필요한 연산 시간을 단축할 것이 요구된다. 관찰한 바에 의하면 은닉 마크프 모델을 사용하는 음성 인식 장치에 있어서 관측 확률 연산을 위한 연산이 약 62% 정도를 사용하고 있는 바 이러한 연산 속도를 개선할 것이 요구된다.In implementing such a speech recognition apparatus, it is required to shorten the computation time required for speech recognition. It has been observed that in the speech recognition apparatus using the hidden mark model, the computation for the observation probability calculation uses about 62%. Therefore, it is required to improve the computation speed.

본 발명은 상기의 요구에 부응하기 위한 것으로서, 개선된 관측 확률 연산 장치를 제공하는 것을 그 목적으로 한다.An object of the present invention is to provide an improved observation probability calculating device for meeting the above requirements.

도 1은 일반적인 음성 인식 시스템의 구성을 보이는 블록도이다.1 is a block diagram showing the configuration of a general speech recognition system.

도 2는 임의의 음절에 대한 상태열을 구하는 방법을 도식적으로 보이기 위한 것이다.Fig. 2 is a schematic view showing a method of obtaining a state string for any syllable.

도 3은 단어 인식에 대한 과정을 도식적으로 보이기 위한 것이다.3 is a schematic view of a process for word recognition.

도 4는 본 발명에 따른 관측 확률 연산 장치의 구성을 보이는 블록도이다.4 is a block diagram showing the configuration of the apparatus for calculating the probability of observation according to the present invention.

도 5는 비트 해상도의 선정에 대한 이해를 돕기위하여 도시된 것이다.5 is shown to help understand the selection of the bit resolution.

도 6은 본 발명에 따른 관측 확률 연산 장치의 적용례를 보이기 위하여 도시된 것이다.6 is shown to show an application example of the observation probability calculation device according to the present invention.

도 7은 도 6에 도시된 장치에 있어서 제어 명령 및 데이터를 수신하는 과정을 도식적으로 보이기 위하여 도시된 블록도이다.FIG. 7 is a block diagram schematically illustrating a process of receiving a control command and data in the apparatus shown in FIG. 6.

도 8은 도 6에 도시된 장치에 있어서 제어 명령 및 데이터를 수신하는 과정을 도식적으로 보이기 위한 타이밍도이다.FIG. 8 is a timing diagram schematically showing a process of receiving a control command and data in the apparatus shown in FIG. 6.

상기의 목적을 달성하는 본 발명에 따른 관측 확률 연산 장치는Observation probability calculation apparatus according to the present invention to achieve the above object

대표 음소들로부터 추출된 파라미터의 평균(mean)과 평균값의 분포 정도(1/σ)를 저장하는 저장 장치;A storage device for storing a mean of a parameter extracted from representative phonemes and a degree of distribution of the mean value (1 / σ);

상기 저장 장치로부터 제공되는 평균과 인식 대상이 되는 음성 신호에서 추출한 파라미터(feature)와의 차이를 산출하는 감산기;A subtractor for calculating a difference between an average provided from the storage device and a feature extracted from a speech signal to be recognized;

상기 감산기의 출력과 상기 저장 장치로부터 제공되는 분포 정도를 곱셈 연산하는 곱셈기;A multiplier for multiplying the output of the subtractor with the degree of distribution provided from the storage device;

상기 곱셈기의 곱셈 결과를 자승 연산하는 자승기; 및A multiplier for square operation of the multiplication result of the multiplier; And

상기 자승기의 출력을 누산하는 누산기를 포함하는 것을 특징으로 한다.And an accumulator accumulating the output of the squarer.

본 발명의 관측 확률 연산 장치에 있어서, 상기 분산, 평균, 그리고 파라미터들을 각각 버퍼링하기 위한 레지스터들을 더 구비하는 것이 바람직하다.In the observation probability computing device of the present invention, it is preferable to further include registers for buffering the variance, the average, and the parameters, respectively.

본 발명의 관측 확률 연산 장치에 있어서, 누산기의 누산 결과를 버퍼링하기 위한 레지스터를 더 구비하는 것이 더욱 바람직하다. 이하 첨부된 도면을 참조하여 본 발명의 구성 및 동작을 상세히 설명하기로 한다.In the observation probability calculating device of the present invention, it is more preferable to further include a register for buffering the accumulating result of the accumulator. Hereinafter, the configuration and operation of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 일반적인 음성 인식 시스템의 구성을 보이는 블록도이다. 도 1에 도시된 장치에 있어서, A/D 블록(101)은 연속적인 신호로 입력되는 음성신호를 연산이 용이한 디지털 신호로 바꿔 주는 기능을 한다.1 is a block diagram showing the configuration of a general speech recognition system. In the apparatus shown in FIG. 1, the A / D block 101 converts a voice signal input as a continuous signal into a digital signal that is easy to operate.

Pre-emphasis 블록(102)은 음성의 특성상 발음의 구분을 명확하기 위해 고주파 성분을 강조시켜 주는 기능을 한다. 디지털 신호로 바뀐 음성신호는 일정한 샘플링 개수 단위로 분리하여 처리하게 되는데 여기서는 240개의 샘플(30㎳) 단위로 구분한다.The pre-emphasis block 102 functions to emphasize high frequency components in order to clarify the distinction of pronunciation due to the nature of speech. The audio signal converted into a digital signal is separated and processed in a unit of a predetermined sampling number, which is divided into 240 samples (30 ms).

현재 은닉 마코프 모델에 사용되는 특징(Feature) 벡터로써 스펙트럼(Spectrum)에서 생성해낸 켑스트럼(Cepstrum)과 에너지를 일반적으로 사용하기 때문에, 이를 구하는 연산이 필요하며 이러한 에너지 및 스펙트럼을 구하는 연산 블록이 Energy Calculation 블록(103)이다. 여기서 에너지를 구하기 위해 타이밍 영역에서 에너지 계산 공식을 이용해 30㎳에 대한 순간 에너지를 계산한다. 이 계산식은 수학식 1과 같다.As the feature vector used in the current hidden Markov model is generally used the cepstrum and the energy generated from the spectrum, a calculation is required. Energy Calculation block 103. Here, to find the energy, we calculate the instantaneous energy for 30 영역 using the energy calculation formula in the timing domain. This calculation is the same as Equation 1.

<수학식 1><Equation 1>

이 에너지 값은 현재 입력된 신호가 음성신호인지 잡음인지를 판단하는데 사용된다. 이와 함께 주파수 영역에서의 스펙트럼을 구하기 위해서는 신호처리에서 많이 사용되는 고속 푸리어 변환(Fast Fourier Transform)을 이용한다. 이 스펙트럼은 256 포인트 FFT 연산을 통해 얻어진다. 이러한 FFT 연산은 256 Point Complex-FFT 연산을 수행하며 수학식 2와 같이 나타낼 수 있다.This energy value is used to determine whether the currently input signal is a voice signal or a noise. In addition, fast Fourier transform, which is frequently used in signal processing, is used to obtain spectrum in the frequency domain. This spectrum is obtained through a 256 point FFT operation. This FFT operation performs a 256 Point Complex-FFT operation and may be represented by Equation 2.

<수학식 2><Equation 2>

에너지 연산 결과를 이용해 음성신호인지 잡음인지를 판별한 후 음성신호로 판명이 되면 음성의 시작과 끝을 결정하여야 한다. 이러한 음성신호의 시작과 끝을 결정하는 기능을 FineEndPoint 블록(104)에서 수행한다. 이와 같이 유효한 하나의 단어가 결정이 되면 이에 해당하는 스펙트럼 데이터들만이 버퍼에 저장이 된다. 따라서, 버퍼블록(105)에는 화자(Speaker)의 발화 단어중 잡음 부분이 제외된 유효한 음성신호만이 저장이 된다.After the energy calculation result is used to determine whether it is a voice signal or noise, if it is found to be a voice signal, the start and end of the voice should be determined. The function of determining the start and end of the voice signal is performed in the FineEndPoint block 104. When a valid word is determined as such, only the corresponding spectrum data is stored in the buffer. Therefore, the buffer block 105 stores only a valid voice signal in which the noise part of the speech word of the speaker is excluded.

Mel-filter 블록(106)에서는 스펙트럼 값으로부터 켑스트럼(Cepstrum)을 구하기 위한 전처리 과정으로 32 대역폭으로 필터링하는 Mel-filtering 연산이 수행된다.In the Mel-filter block 106, a Mel-filtering operation for filtering with 32 bandwidths is performed as a preprocessing process for obtaining the Cepstrum from the spectral values.

이 과정을 통해 32개 대역에 대한 스펙트럼 값이 구해진다. 주파수 영역에 있는 이 값을 다시 시간 영역으로 변환하면 은닉 마코프 모델에서 사용하는 파라미터인 켑스트럼(Cepstrum)을 구할 수 있다. 이러한 시간영역으로의 변환을 위해 Inverse Discrete Cosine Transform(IDCT) 연산을 수행한다. (IDCT 블록, 107)This process yields spectral values for the 32 bands. By converting this value in the frequency domain back to the time domain, we can obtain the Cepstrum, a parameter used by the hidden Markov model. Inverse Discrete Cosine Transform (IDCT) operation is performed to convert to this time domain. (IDCT Block, 107)

은닉 마코프 모델을 이용한 검색을 위해 이러한 켑스트럼과 에너지 값이 사용되는데 에너지 값과 켑스트럼간의 값의 차이가 너무 크기 때문에 (10의 2승 크기의 차이), 이러한 값들간의 크기 조절이 필요하다. 이러한 값의 크기조절을 Scale 블록(108)에서 수행한다. 값의 조절은 Logarithm을 이용하여 조절한다. 이와 함께, Cepstral Window 블록(109)에서는 Mel-cepstrum 값에서 주기성과 에너지를 분리하는 작업과 잡음 특성을 개선하기 위한 작업을 수행한다. 여기서 잡음 특성을 개선하는 것은 수학식 3을 이용하여 계산한다.These cepstrum and energy values are used for retrieval using the hidden Markov model, because the difference between the energy value and the cepstrum is too large (difference of the power of 10). Do. The scaling of this value is performed in Scale block 108. Adjust the value using Logarithm. In addition, the Cepstral Window block 109 separates the periodicity and the energy from the Mel-cepstrum value and works to improve the noise characteristics. Here, the improvement of the noise characteristic is calculated using Equation 3.

<수학식 3><Equation 3>

여기서, Sin_Table은 다음 수학식 4와 같이 구성될 수 있다.Here, Sin_Table may be configured as shown in Equation 4.

<수학식 4><Equation 4>

이상의 연산이 완료되면, 다음과 같은 Normalization 블록(110)을 통해 각 프레임의 9번째 데이터인 에너지 값들을 일정한 범위 이내의 값으로 정규화시킨다.When the above operation is completed, the normalization block 110 as shown below normalizes the energy values, which are the ninth data of each frame, to a value within a predetermined range.

<수학식 5><Equation 5>

수학식 5와 같이 각 프레임의 9번째 데이터들 중에서 가장 큰 값을 찾고 이값을 수학식 6과 같이 모든 프레임의 에너지 데이터들에서 빼주면 Normalized Energy를 구할 수 있다.As shown in Equation 5, the largest value is found among the ninth data of each frame, and this value is subtracted from the energy data of all frames as shown in Equation 6 to obtain Normalized Energy.

<수학식 6><Equation 6>

일반적으로 인식률을 높이기 위해 파라미터(특징 값)의 종류를 늘리는 방법이 자주 사용된다. 가장 흔히 사용되는 방법이 각 프레임의 특징 값(Feature)외에 프레임과 프레임간 특징값의 차이를 또 하나의 특징 값으로 취하는 것이다. Dynamic Feature 블록(111)은 이러한 Delta Cepstrum을 계산하여 2차 특징 값으로 선정하는 부분이다. 이러한 켑스트럼들간의 차이를 계산하는 방법은 수학식 7과 같다.In general, a method of increasing the type of a parameter (feature value) to increase the recognition rate is frequently used. The most commonly used method is to take the difference of the feature value between the frame and the frame as another feature value in addition to the feature value of each frame. The dynamic feature block 111 calculates the Delta Cepstrum and selects it as the secondary feature value. A method of calculating the difference between these cepstrums is shown in Equation (7).

<수학식 7><Equation 7>

일반적으로 연산 대상 프레임은 앞과 뒤 각각 2프레임씩이다. 이러한 연산이 완료되면 Cepstrum과 같은 수의 Delta Cepstrum이 생성된다.In general, the frame to be computed is two frames each. When this operation is completed, the same number of Delta Cepstrums are generated as Cepstrum.

이상과 같은 작업을 통해 은닉 마코프 모델 검색의 대상이 되는 특징 값들을 추출해 내게 된다.Through the above operation, feature values that are the target of hidden Markov model search are extracted.

이러한 특징 값들로 부터 미리 정해진 은닉 마코프 모델을 이용한 단어 탐색 작업을 수행하게 된다. 은닉 마코프 모델의 탐색 작업은 크게 세 단계를 거친다. 첫번째는 관측 확률(Observation Probability) 계산 블록(112)이다. 기본적으로 검색 및 결정과정은 확률을 근거로 한다. 확률적으로 가장 근사한 음절을 찾아 내는것이다. 확률 값은 크게 관측 확률(Observation Probability)과 천이 확률(Transition Probability)로 나뉘어 지며 이러한 확률 값을 누적하여 확률 값이 가장 큰 음절의 시퀀스(Sequence)가 선택되는 것이다. 관측 확률은 수학식 8과 같이 나타낼 수 있다.From these feature values, a word search operation is performed using a predetermined hidden Markov model. The exploration of hidden Markov models involves three steps. First is the Observation Probability calculation block 112. Basically, the search and decision process is based on probability. It is probable to find the closest syllable. Probability values are largely divided into Observation Probability and Transition Probability. By accumulating these probability values, a sequence of syllables having the largest probability value is selected. The observation probability may be expressed as shown in Equation 8.

<수학식 8><Equation 8>

여기서, dbx는 기준 평균치와 입력 신호에서 추출한 특징 값간의 확률적 거리이다. 확률적으로 거리가 가까울수록 확률적으로 큰 값이 된다. 이러한 확률적 거리를 구하는 공식은 수학식 9와 같다.Here, dbx is a stochastic distance between the reference mean value and the feature value extracted from the input signal. The closer the probability is, the greater the probability is. The formula for calculating such a stochastic distance is shown in Equation 9.

<수학식 9><Equation 9>

여기서, m은 파라미터의 평균값을 나타내며, Feature는 입력된 신호로부터 추출해 낸 파라미터 값을 의미한다. p는 Precision 값으로 분포 정도(분산 1/σ² )를 나타내며 lw는 Log weight로써 가중치를 나타낸다. i는 음소의 대표적인 유형을 나타내는 Mixture를 나타낸다. 예를 들어 인식의 정확도를 높이기 위하여 많은 사람들로부터 대표값을 얻는 것이 필요하고, 이들 대표값들을 하나의 음소에 대하여 공통적 유형을 나타내는 몇 개의 그룹들로 분류하면 i는 각 그룹의 대표값을 나타내는 인수가 된다. k는 프레임의 개수, 그리고 j는 파라미터의 개수를 나타낸다. 참고로 프레임의 개수는 단어의 유형에 따라 다르며 Mixture는 일반적은 사람의 발음 유형에 따라 다양하게 분류될 수 있다. Log Weight는 선형영역에서의 가중치 계산이 로그 영역에서의 가중치 계산으로 변환되면서 감산이 이루어진다.Here, m represents an average value of the parameter, and Feature means a parameter value extracted from the input signal. p is the precision value and represents the distribution degree (variance 1 / σ²), and lw is the log weight. i represents a Mixture representing a representative type of phoneme. For example, in order to improve the accuracy of recognition, it is necessary to obtain representative values from many people, and classify these representative values into several groups representing a common type for a phoneme. Becomes k represents the number of frames and j represents the number of parameters. For reference, the number of frames varies according to the type of word, and Mixture can be classified into various types according to the pronunciation type of a general person. Log Weight is subtracted as the weight calculation in the linear domain is converted into the weight calculation in the log domain.

이와 같이 계산된 관측확률은 미리 선정된 각 단어 음절의 음소가 관측될 수 있는 확률들로 해당 음소마다 각기 다른 확률 값을 갖게 된다. 따라서 모든 음소에 대한 관측확률이 정해지면 이를 미리 정해진 State Machine Sequence 블록(113)에 적용하여 가장 적합한 음소의 Sequence를 구하게 된다. 일반적으로 독립어 인식을 위한 은닉 마코프 모델의 각 State Machine은 인식하고자 하는 단어의 각 음소에 대한 특징 값을 근거로 이루어진 시이퀀스(Sequence)이다.The observed probability calculated as described above is a probability that a phoneme of each pre-selected syllable can be observed, and each phoneme has a different probability value. Therefore, when the observation probability of all the phonemes is determined, the most suitable phoneme sequence is obtained by applying the same to the predetermined state machine sequence block 113. In general, each state machine of a hidden Markov model for independent word recognition is a sequence based on feature values for each phoneme of a word to be recognized.

도 2는 임의의 음절에 대한 상태열을 구하는 방법을 도식적으로 보이기 위한 것으로서, 임의의 음절 "크"에 대한 상태열(State Machine Sequence)을 나타낸 것이다.FIG. 2 is a diagram schematically illustrating a method of obtaining a state string for an arbitrary syllable, and illustrates a state machine sequence for an arbitrary syllable “k”.

"크"라는 음절이 3개의 순차적인 상태열(S1, S2, S3)으로 구성된다고 가정할 때, 도 2에서는 최초 상태(S0)에서 출발하여, S1과 S2를 거처서 최종적으로 S3에 도달하는 과정을 도시하고 있다. 도 2에 있어서 같은 상태에서의 우측으로 진행하는 것은 지연 상태를 의미하며, 이러한 지연 상태는 화자 의존적인 것이다. 즉, 어떤 경우에 있어서 "크"라는 음절이 시간적으로 매우 짧게 발생할 수 있으나 다른 경우에 있어서는 상대적으로 긴 시간에 발생될 수 있다. 어떤 음절의 발생 시간이 길 수록 각 상태에서의 지연이 길어진다. 도 2에 있어서 Sil은 묵음(silent sound)을 나타낸다.Assuming that the syllable "k" is composed of three sequential state strings (S1, S2, S3), the process of starting from the initial state S0 in FIG. 2 and finally reaching S3 via S1 and S2 It is shown. In FIG. 2, going to the right in the same state means a delay state, and the delay state is speaker dependent. That is, in some cases the syllable “k” may occur very short in time, but in other cases it may occur in a relatively long time. The longer a syllable occurs, the longer the delay in each state. In FIG. 2, Sil represents silent sound.

만약 사용자가 "크"라고 발음했다면 이 상태열이 가장 큰 확률 값을 갖게 될 것이다. 따라서 도 2와 같은 많은 상태열이 존재하게 되며 각각의 상태열 마다 하나의 입력 신호에 대한 확률 연산이 이루어 지므로 많은 연산량이 필요하다.If the user pronounces "k", this status string will have the highest probability value. Therefore, many state strings as shown in FIG. 2 exist and a large amount of computation is required because a probability operation is performed on one input signal for each state string.

최종적으로 모든 음소에 대한 확률적 연산(음소별 상태열 처리 작업)이 완료되면 음소별 최종 단의 상태(State Machine)에는 확률 값이 저장된다. 도 2에서 각 상태 단을 진행하는 기준은 다음의 수학식 10을 이용해 최대인 가지(Branch)를 선택하면서 Alpha 값을 구하는 것이다. 이러한 Alpha 값은 결국 관측확률이 누적된 값으로 이전의 관측확률 값과 미리 경험적 실험을 통해 얻은 음소간 천이 확률(Transition Probability)을 이용해 구하게 된다.Finally, when a probabilistic operation (processing state-by-phoneme state processing for each phoneme) is completed, probability values are stored in a state machine of the final phoneme. In FIG. 2, a criterion for progressing each state stage is to obtain an Alpha value while selecting a branch that is maximum using Equation 10 below. These alpha values are the accumulated values of observation probabilities, and are obtained by using the transition probability between phonemes obtained from previous observation probability values and previous empirical experiments.

<수학식 10><Equation 10>

여기서, State.Alpha는 새로이 계산되어 누적되는 확률 값이며, State.Alpha_prev는 이전까지 누적된 확률 값이다. 또한 trans_prob[0]는 상태 Sn에서 Sn으로 천이할 확률이며(예, S0→S0) trans_prob[1]은 상태 Sn에서 상태 Sn+1로 천이할 확률이다. (예, S0→S1) 그리고 o_prob는 현재의 상태에서 계산된 관측확률이다. Find Maximum Likelihood 블록(114)에서는 수학식 10과 같이 음소별 최종 누적된 확률 값을 근거로 인식된 단어를 선택하는 기능을 수행한다. 이때 확률 값이 가장 큰 해당 단어를 인식된 단어로 선택하게 된다.Here, State.Alpha is a probability value newly calculated and accumulated, and State.Alpha_prev is a probability value accumulated before. In addition, trans_prob [0] is a probability of transition from state Sn to Sn (for example, S0-> S0) and trans_prob [1] is a probability of transition from state Sn to state Sn + 1. O_prob is the observed probability calculated in the current state. In the Find Maximum Likelihood block 114, a function of selecting a recognized word based on a final cumulative probability value for each phoneme is performed as shown in Equation 10. In this case, the word having the largest probability value is selected as the recognized word.

"KBS"라는 단어를 인식하기 위한 과정을 예를 들어 설명하기로 한다.The procedure for recognizing the word "KBS" will be described by way of example.

"KBS"라는 단어는 "케이", "비", "에스"라는 3개의 음절로 구성되고, 또한, "케이"라는 음절은 "크", "에", "이"라는 3개의 음소들로 구성되고, "비"라는 음절은 "브"와 "이"라는 음소들로 구성되며, "에스"라는 음절은 "이", "에", 그리고 "스"라는 3개의 음소들로 구성된다.The word "KBS" consists of three syllables "K", "B", and "S". Also, the syllable "K" consists of three phonemes: "K", "E", and "Y". The syllable "B" consists of the phonemes "B" and "Y", and the syllable "S" consists of three phonemes "Y", "E", and "S".

"KBS"라는 단어는 "크", "에", "이", "브", "이", "이", "에", 그리고 "스"라는 8개의 음소들로 구성되며, 각 음소들의 관측 확률과 각 음소들 사이의 천이 확률에 의해 인식되게 된다.The word "KBS" consists of eight phonemes: "k", "e", "yi", "bro", "yi", "yi", "e", and "su". The probability of observation and the transition between each phoneme are recognized.

즉, "KBS"라는 단어를 인식하기 위해서는 "크", "에", "이", "브", "이", "이", "에", 그리고 "스"라는 8개의 음소들이 최대한 정확하게 인식되어야 하고, 그것을 바탕으로 각 음소들 사이의 시퀀스가 최대한 유사한 단어인 "KBS"가 선택되어야 한다.In other words, in order to recognize the word "KBS", the eight phonemes "K", "E", "Y", "B", "Y", "Y", "E", and "S" are as accurate as possible. It should be recognized, and on the basis of it, the word "KBS" should be selected, as closely as possible the sequence between each phoneme.

먼저 입력된 음성 신호에 대하여 각 음소별로 관측 확률이 계산된다. 관측 확률을 계산하기 위해서 데이터베이스에 저장된 대표 음소들과의 유사 정도 즉 확률이 계산되고, 확률이 가장 큰 대표 음소에 대한 확률이 관측확률이 된다. 예를 들면, "크"라는 음소에 대하여 데이터베이스에 저장된 대표 음소들 모두가 비교되고 그 중에서 가장 큰 확률을 보이는 "크"라는 대표 음소가 선택된다.The observation probability of each phoneme is calculated for the input voice signal. In order to calculate the observation probability, the similarity with the representative phonemes stored in the database, that is, the probability is calculated, and the probability for the representative phone with the highest probability becomes the observation probability. For example, for the phone "k", all of the representative phonemes stored in the database are compared and the representative phone "k" with the highest probability is selected.

입력된 음성 신호에 대하여 각 음소별로 관측 확률이 계산되면, 즉, 음성 신호의 각 음소에 대한 대표 음소들이 결정되면, 입력된 음성 신호를 이들 대표 음소들로 이루어진 스테이트 머신 시퀀스에 적용시켜 가장 적합한 시퀀스를 결정하게된다. 스테이트 머신 시퀀스는 "크", "에", "이", "브", "이", "이", "에", 그리고 "스"라는 8개의 음소들로 구성되게 되며, 각 음소들의 관측 확률 및 이들의 누적값이 가장 큰 단어인 "KBS"가 선택되게 되는 것이다. 각각의 음소들은 세부적으로 3개의 스테이트로 구성된다.When the observation probability is calculated for each phoneme of the input voice signal, that is, the representative phonemes for each phoneme of the voice signal are determined, the most suitable sequence is applied by applying the input voice signal to a state machine sequence composed of these phonemes. Will be determined. The state machine sequence is made up of eight phonemes: "K", "E", "Y", "B", "Y", "Y", "E", and "S", and each phoneme is observed. The word "KBS" having the highest probability and the cumulative value thereof is selected. Each phoneme consists of three states in detail.

예를 들면, "KBS"라는 단어를 인식하기 위해서 관측 확률 연산 블록(112)를 통하여 각각의 음소들 "크", "에", "이", "브", "이", "이", "에", 그리고 "스"라는 8개의 음소들에 대한 관측 확률이 계산되고, 스테이트 머신(113)을 통하여 각 음소들의 관측 확률 및 이들의 누적값이 가장 큰 단어인 "KBS"가 선택되게 되는 것이다.For example, in order to recognize the word "KBS", each phoneme "k", "e", "yi", "bro", "yi", "yi", through the observation probability calculation block 112 is used. Observation probabilities for eight phonemes "E" and "S" are calculated, and the word "KBS", which is the word having the largest observation probability and the cumulative value of each phoneme, is selected through the state machine 113. will be.

일반적으로 기존의 많은 음성인식 제품들은 이상의 기능들을 소프트웨어(C/C++ 언어)나 기계어(Assembly Code)로 설계하여 범용 프로세서를 이용해 기능을 수행시킨다.In general, many existing speech recognition products design the above functions in software (C / C ++ language) or machine code (Assembly Code) to perform functions using a general purpose processor.

또 다른 사용 형태는 전용 하드웨어(ASIC, Application Specific Integrated Circuit)로 구현하여 수행시키기도 한다. 이러한 두 가지 방법은 각각 장단점을 갖고 있다. 소프트웨어로 처리하는 방식은 연산시간이 상대적으로 많이 걸리지만 유연성이 높아 쉽게 기능의 변경이 가능하다.Another form of use is implemented by dedicated hardware (ASIC). Each of these two methods has advantages and disadvantages. Software processing takes a relatively long time, but its flexibility makes it easy to change functionality.

한편, 전용 하드웨어로 처리하는 방식은 소프트웨어로 처리하는 것에 비해 상대적으로 처리 속도가 빠르고 적은 전력소모량을 나타내지만, 유연성이 없어 기능의 변경이 불가능하다.On the other hand, the method of processing with dedicated hardware has a relatively high processing speed and low power consumption compared to processing with software, but it is inflexible and thus cannot be changed.

따라서 본 발명에서는 기능의 변경이 용이한 소프트웨어 방식에 적합하면서 상대적으로 처리속도가 빠를 수 있도록 지원할 수 있는 장치를 제안한다.Therefore, the present invention proposes a device that can support a relatively fast processing speed while being suitable for a software method whose function can be easily changed.

소프트웨어 처리 방식으로 범용 프로세서를 사용할 경우 각 기능을 수행하는데 소요되는 연산수를 표 1에 나타내었다. 여기서 연산수는 실제 명령어 수가 아니라 곱셈, 덧셈, 로그, 지수연산과 같은 연산 횟수를 나타낸다.Table 1 shows the number of operations required to perform each function when using a general-purpose processor as a software processing method. The number of operations here is not the number of instructions but the number of operations such as multiplication, addition, logarithm, and exponentiation.

<표 1>TABLE 1

연산calculate Pre-processingPre-processing Mel-filtering & CepstrumMel-filtering & Cepstrum HMMHMM 합계Sum Pre-emphasisPre-emphasis EnergyCalc.EnergyCalc. FFTFFT Mel-FilterMel-Filter IDCTIDCT ScalingScaling Cepstr.Cepstr. Observ.Prob.Observ.Prob. StateMachineStatemachine 곱셈multiplication 160160 240240 40964096 234234 288288 99 3636 4320043200 00 48,26348,263 덧셈addition 160160 239239 61446144 202202 279279 00 1One 4560045600 600600 53,22553,225 나눗셉Division 00 1One 00 00 00 00 99 00 00 1010 제곱근Square root 00 1One 00 00 00 00 00 00 00 1One LOGLOG 00 00 00 3232 00 00 00 00 1One 3333 총연산Total operation 320320 481481 1024010240 468468 567567 99 4646 8880088800 601601 101,532101,532

표 1에서 알 수 있듯이 일반적인 음성인식 처리에 필요한 총 연산수는 약 100,000개이며 이 중에서 약 88.8% 정도가 관측확률 연산이다.As shown in Table 1, the total number of operations required for general speech recognition processing is about 100,000, of which about 88.8% are observation probability operations.

실제로 현재 널리 사용되고 있는 범용 프로세서인 ARM Processor 계열에 상기한 알고리즘을 실장하여 수행할 경우 전체 기능 수행을 위한 총 명령어 수 약 36백만 명령어가 수행되며 이 중에 약 33백만 명령어가 은닉 마코프 모델 검색에 소요되는 것으로 분석이 되었다. 표 2는 실제 ARM 프로세서를 이용하여 음성인식 기능을 수행할 경우 소요되는 명령어들을 기능 블록별로 분류한 것이다.In fact, when the above-described algorithm is implemented and executed on the ARM Processor series, which are widely used today, the total number of instructions for performing the entire function is about 36 million instructions, of which about 33 million instructions are required for searching hidden Markov models. It was analyzed. Table 2 categorizes the instructions required to perform voice recognition using the actual ARM processor by function block.

<표 2>TABLE 2

Function BlockFunction block 명령어 cycle 수Number of instruction cycles Percentage(%)Percentage (%) Observation Probability 연산Observation Probability Operation 22,267,20022,267,200 61.7%61.7% State Machine UpdateState machine update 11,183,24011,183,240 30.7%30.7% FFTFFT 910,935910,935 2.50%2.50% Find Maximum LikelihoodFind Maximum Likelihood 531,640531,640 1.46%1.46% Mel-Filtering/IDCT/ScalingMel-Filtering / IDCT / Scaling 473,630473,630 1.30%1.30% Dynamic FeatureDynamic Feature 283,181283,181 0.78%0.78% Pre-emphasis & Energy CalculationPre-emphasis & Energy Calculation 272,037272,037 0.75%0.75% Cepstral Window & NormalizeCepstral Window & Normalize 156,061156,061 0.43%0.43% Find End PointFind End Point 123,050123,050 0.30%0.30% TotalTotal 36,400,97436,400,974 100.00%100.00%

표 2에서 알 수 있듯이, 명령어 수행에서도 약 62% 정도가 소요되는 것으로 나타났다. 따라서, 가장 많은 명령어가 수행되는 관측확률 연산 부분을 전용 장치로 지원함으로써 처리 속도 향상 및 소모전력 감소를 도모할 수 있다.As can be seen from Table 2, it takes about 62% to execute the command. Therefore, by supporting the observation probability operation portion that executes the most instructions as a dedicated device, it is possible to improve processing speed and reduce power consumption.

따라서 본 발명에서는 이러한 Observation Probability 연산을 적은 명령어 즉, 적은 사이클로도 수행할 수 있는 전용 장치를 제안한다.Accordingly, the present invention proposes a dedicated device capable of performing such an Observation Probability operation with fewer instructions, that is, fewer cycles.

관측확률을 계산 능력을 개선하기 위해, 본 발명에서는 가장 연산량이 많은 확률적 거리 계산식 수학식 9에서 수학식 10과 같은 연산을 하나의 명령어로 수행할 수 있는 전용 장치를 제시한다.In order to improve the calculation ability of the observation probability, the present invention provides a dedicated device capable of performing operations such as Equation 10 to Equation 10, which are the most computationally probabilistic distance calculations.

<수학식 11><Equation 11>

여기서, p[i][j]는 precision으로 분포정도(분산, 1/σ²)를 나타내며, mean[i][j]는 각 음소의 평균 값 그리고 feature[k][j]는 음소에 대한 파라미터 값으로 에너지와 켑스트럼을 의미한다. 수학식 11에서 mean[i][j] - feature[k][j]는 확률적으로 입력된 음소의 파라미터가 미리 정의된 대표 파라미터와 어느 정도 차이(거리)가 있는지를 나타내며 절대적인 확률적 거리를 계산하기 위해 자승을 취한다. 그리고 여기에 분산을 곱하면 객관적인 실제 거리를 예측할 수 있다. 여기서 대표 파라미터 값들은 수많은 음성 데이터를 통해 경험적으로 얻어낸 값들로 다양한 사람으로부터 얻어낸 음성 데이터가 많으면 많을수록 인식률은 개선된다.Where p [i] [j] represents the degree of distribution (dispersion, 1 / σ²) in precision, mean [i] [j] is the mean value of each phoneme, and feature [k] [j] is the parameter for the phoneme Values represent energy and cepstrum. In Equation 11, mean [i] [j]-feature [k] [j] indicate how far a parameter (distance) from a stochastic input phoneme differs from a predefined representative parameter. Take the square to calculate And multiply this by the variance to predict the objective real distance. Here, the representative parameter values are empirically obtained through numerous voice data, and the more voice data obtained from various people, the better the recognition rate.

그러나 본 발명에서는 하드웨어의 제한적 특성, 즉 데이터 비트(16 비트)의 한계를 고려하여 인식률을 최대한 올리기 위해 수학식 12와 같은 연산을 수행한다.However, in the present invention, in order to increase the recognition rate as much as possible in consideration of the limited characteristics of the hardware, that is, the limitation of the data bits (16 bits), an operation as shown in Equation 12 is performed.

<수학식 12><Equation 12>

여기서, p[i][j]는 식⑾에서의 분산 1/σ²와 달리 분포 정도를 나타내는 1/σ이다. 분산 1/σ²대신에 분포 정도 1/σ를 사용하는 이유는 다음과 같다.Here, p [i] [j] is 1 / σ indicating the degree of distribution, unlike variance 1 / σ 2 in equation VII. The reason why the distribution degree 1 / σ is used instead of the variance 1 / σ 2 is as follows.

수학식 9에 의하면 (m[i][j]-feature[i][j]을 자승 연산한 결과와 p[i][j]를 곱셈 연산하고 있으나 수학식 12에 의하면 p[i][j]·(m[i][j]-feature[i][j])을 연산한 결과를 자승연산하고 있다.According to Equation 9, the result of squared operation of (m [i] [j] -feature [i] [j] and p [i] [j] is multiplied, but according to Equation 12, p [i] [j ] The result of calculating (m [i] [j] -feature [i] [j]) is squared.

수학식 9에 의하면 p[i][j]를 표현하기 위하여 자승 연산할 결과와 같은 정도의 비트 해상도가 필요하지만 수학식 12에 의하면 (m[i][j]-feature[i][j])의 결과와 같은 정도의 비트 해상도만이 필요하다는 것을 의미한다.Equation (9) requires the same bit resolution as the result of the squared operation to express p [i] [j], but according to Equation (12) (m [i] [j] -feature [i] [j] This means that only a bit resolution of the same level is required.

다시 말하면 16비트 비트 해상도를 유지하기 위해서는 수학식 9에 의하면 p[i][j]를 표현하기 위하여 32비트가 필요하지만 수학식 12에 의하면 p[i][j]를 표현하기 위하여 16비트만이 필요하다. 한편, 수학식 12에 의하면 p[i][j]·(m[i][j]-feature[i][j])을 연산한 결과를 자승연산하고 있으므로 결과적으로 수학식 9에서와 같이 1/σ²을 사용한 것과 비슷한 효과를 얻을 수 있다.In other words, in order to maintain 16-bit bit resolution, 32 bits are required to express p [i] [j] according to Equation 9, but only 16 bits are used to express p [i] [j] according to Equation 12. This is necessary. On the other hand, according to Equation 12, the result of calculating p [i] [j] · (m [i] [j] -feature [i] [j]) is squared. Similar effects can be achieved with / σ².

도 4는 본 발명에 따른 관측 확률 연산 장치의 구성을 보이는 블록도이다. 도 4에 도시된 장치는 감산기(405), 곱셈기(406), 자승기(407), 그리고 누산기(408)를 구비한다. 참조 번호 402, 403, 404, 그리고 409는 레지스터를 나타낸다.4 is a block diagram showing the configuration of the apparatus for calculating the probability of observation according to the present invention. The apparatus shown in FIG. 4 includes a subtractor 405, a multiplier 406, a square 407, and an accumulator 408. Reference numerals 402, 403, 404, and 409 denote registers.

외부 저장 장치(401)은 데이터 베이스화된 저장 장치로서 모든 대표 음소들에 대한 presion, mean, feature들을 저장한다. 여기서, precision으로 분포정도(1/σ)를 나타내며, mean은 각 대표 음소를 나타내는 파라미터들(에너지와 켑스트럼)의 평균값, 그리고 feature[k][j]는 음소에 대한 파라미터 값으로 에너지와 켑스트럼을 의미한다.The external storage device 401 is a databased storage device that stores presion, mean, and features for all representative phonemes. Here, precision indicates distribution degree (1 / σ), mean is the average value of parameters (energy and cepstrum) representing each representative phoneme, and feature [k] [j] is the parameter value for phoneme. Means 켑 strum.

도 3에 도시된 장치에 있어서 먼저 감산기(405)를 이용해 mean과 feature의 차이를 구하며, 이 결과는 실제 거리를 구하기 위해 곱셈기(406)를 통하여 분산 정도(1/σ)가 곱해진다. 이 결과는 절대적인 차이를 구하기 위해 자승기(407)를 통하여 자승을 구하며, 이전 파라미터와의 누적을 위해 가산기(408)를 사용하게 된다.In the apparatus shown in FIG. 3, the difference between the mean and the feature is first obtained by using the subtractor 405, and the result is multiplied by the dispersion degree (1 / σ) through the multiplier 406 to obtain the actual distance. The result is to find the square through the square 407 to find the absolute difference, and to use the adder 408 to accumulate with the previous parameter.

즉, 수학식 12에 표현될 결과를 곱셈기(406)에서 얻고, 수학식 9에서 표현된연산의 결과를 누산기(408)에서 얻게 된다.That is, the result to be expressed in equation (12) is obtained by the multiplier (406), and is expressed in equation (9). The result of the operation is obtained at the accumulator 408.

외부 저장 장치에는 p[i][j], mean[i][j], 그리고 feature[i][j]가 저장되고, 이들이 소정의 순서에 따라 순차적으로 레지스터들(402, 403, 404)에 제공된다. 소정의 순서는 i, j가 순차적으로 증가되도록 설정된다.P [i] [j], mean [i] [j], and feature [i] [j] are stored in the external storage device, and they are sequentially stored in the registers 402, 403, and 404 in a predetermined order. Is provided. The predetermined order is set such that i and j are sequentially increased.

i, j를 바꾸어가면서 p[i][j], mean[i][j], 그리고 feature[i][j]들이 순차적으로 레지스터들(402, 403, 404)에 제공되고, 레지스터(409)에서 최종적으로 누적된 관측확률이 구해진다.p [i] [j], mean [i] [j], and feature [i] [j] are sequentially provided to the registers 402, 403, 404, changing i, j, and register 409. The final cumulative observation probability is obtained from.

이러한 확률의 누적 계산에 의해 가장 확률적으로 유사한 음소의 경우 가장 큰 값을 갖게 된다. 연산의 시작단과 마지막 단의 레지스터들(402, 403, 404, 409)은 데이터의 안정화를 위해 사용된다.By the cumulative calculation of these probabilities, the most probabilistic phonemes have the highest value. The registers 402, 403, 404, 409 at the beginning and the end of the operation are used for stabilization of the data.

도 4에 도시된 장치에 있어서 데이터의 비트 해상도(Bit Resolution)는 프로세서의 구조에 따라 달라질 수 있으며 비트 수가 커지면 커질수록 상세한 계산 결과를 얻을 수 있다. 그러나 이러한 비트 해상도는 회로의 크기와 관련이 있기 때문에 인식률을 고려하여 적절한 해상도를 선정해야 한다.In the apparatus illustrated in FIG. 4, the bit resolution of data may vary depending on the structure of the processor, and as the number of bits increases, detailed calculation results may be obtained. However, since the bit resolution is related to the size of the circuit, an appropriate resolution should be selected in consideration of the recognition rate.

도 5는 비트 해상도의 선정에 대한 이해를 돕기위하여 도시된 것이다. 비트 해상도의 선정에 대한 한 예로 도 5는 16 비트 해상도를 갖는 프로세서에 대한 내부 비트 해상도를 나타내었다. 여기서 각 단계의 절단 과정은 16비트 데이터 폭의 한계에 따른 것으로 최대한 성능 저하를 막기 위한 선택이다. 본 발명에서 제시한 장치를 이용하면 범용 프로세서만을 사용하는 경우에 비해 처리 속도 측면에서 많은 개선을 거둘 수 있다.5 is shown to help understand the selection of the bit resolution. As an example of the selection of the bit resolution, Figure 5 shows the internal bit resolution for a processor with 16 bit resolution. The truncation process for each step is based on the 16-bit data width limit and is the choice to prevent performance degradation as much as possible. By using the device proposed in the present invention, much improvement in processing speed can be achieved compared to the case of using only a general-purpose processor.

feature와 mean은각각이 4비트의 정수와 12비트의 소수로 구성된다. 이들 feature와 mean을 감산기(405)를 통하여 감산하여 역시 4비트의 정수와 12비트의 소수로 구성된 결과값을 얻는다.The feature and mean each consist of a 4-bit integer and a 12-bit prime. These features and mean are subtracted through the subtractor 405 to obtain a result value that is also composed of an integer of 4 bits and a decimal number of 12 bits.

precision은 7비트의 정수와 9비트의 소수로 구성된다. precision과 감산기(405)의 감산 결과를 곱셈기(406)를 통하여 곱셈하여 10비트의 정수와 6비트의 소수로 구성된 결과값을 얻는다.precision consists of an integer of 7 bits and a decimal number of 9 bits. The result of the precision and the subtraction of the subtractor 405 is multiplied by the multiplier 406 to obtain a result value consisting of an integer of 10 bits and a decimal number of 6 bits.

곱셈기(406)의 결과값을 자승기(407)를 통하여 자승연산하여 20비트의 정수와 12비트의 소수로 구성될 결과값을 얻고, 이들을 가산기(408)를 통하여 가산 및 스케일 연산하여 21비트의 정수와 11비트의 소수로 구성되는 결과값을 얻는다.The result of multiplying the multiplier 406 by the power of the multiplier 407 to obtain a result of 20-bit integer and 12-bit decimal, and add and scale the result of the 21-bit You get a result consisting of an integer and a decimal number of 11 bits.

표 3은 일반적으로 많이 사용되는 은닉 마코프 모델을 사용한 음성인식 알고리즘을 범용 프로세서(ARM Series)에서 수행시킨 경우와 본 발명에서 제시한 관측확률 계산 전용 장치를 채택한 전용 프로세서에 수행시킨 경우를 비교 분석한 것이다.Table 3 compares and analyzes the case in which the speech recognition algorithm using the commonly used hidden Markov model is executed in the general purpose processor (ARM Series) and the dedicated processor adopting the apparatus for calculating the probability of observation presented in the present invention. will be.

<표 3>TABLE 3

ProcessorProcessor Cycle 수Cycle number Time](20M CLK)Time] (20M CLK) ARM ProcessorARM Processor 36,400,97436,400,974 1.82s1.82 s 관측확률 연산장치 채용Observation Probability Calculator 15,151,53415,151,534 0.758s0.758 s

표 3에서 범용 프로세서는 음성인식 기능을 수행하는데 약 3천6백만 사이클이 수행된 반면에 전용 장치를 채용한 전용 프로세서는 절반 수준인 1천5백만 사이클 정도면 필요 기능을 수행할 수 있다. 따라서 거의 실시간 음성인식 처리가 가능하며 이는 다시 말해서 낮은 클럭 주파수로도 범용 프로세서와 동일한 성능을 내기 때문에 소비전력 측면에서도 많은 효과를 거둘 수 있다는 것을 의미한다. 참고로 전력 소모량과 클럭 주파수와의 관계는 수학식 13과 같이 나타낼 수 있다.In Table 3, a general-purpose processor takes about 36 million cycles to perform voice recognition, while a dedicated processor employing a dedicated device can perform the required function in about half a million cycles. This allows for near real-time speech recognition, which means that even at a lower clock frequency, the same performance as a general-purpose processor can have significant effects on power consumption. For reference, the relationship between the power consumption and the clock frequency may be expressed as in Equation 13.

<수학식 13><Equation 13>

여기서, P는 전력소모량이며, C는 회로를 구성하는 Capacitance 값을 나타낸다. f는 회로내 신호의 전체 천이 정도를 나타내는데, 거의 대부분을 클럭 속도가좌우한다. V는 공급전압이다. 따라서 클럭 속도를 반으로 떨어뜨리면, 이론적으로는 전력소모량도 반으로 줄게 된다.Here, P is a power consumption amount, C represents the capacitance value constituting the circuit. f represents the total degree of transition of the signal in the circuit, which is almost always the clock speed. V is the supply voltage. Therefore, if you lower the clock speed in half, you can theoretically cut power consumption in half.

본 발명의 장치는 도 4에서와 같이 외부 저장 장치(401)에 미리 경험적 방법에 의해 얻은 사람 유형별 대표음소의 평균 파라미터와 천이 확률 값들, 그리고 분포 정도와 새로이 입력된 음성에서 추출한 파라미터들을 저장해 놓는다. 이러한 데이터는 일단 전용장치 내부의 레지스터들(302,303,304)에 저장되는데 이는 외부의 데이터 변화에 따른 신호변화를 최소화하기 위한 것으로 전력소모와 깊은 관계가 있다. 내부 레지스터에 저장된 데이터중 입력된 음성에서 추출된 파라미터(Feature)와 미리 저장된 평균 파라미터(Mean)는 그들간의 차이를 구하기 위해 감산기(405)를 통해 감산 연산을 수행한다.As shown in FIG. 4, the apparatus stores the average parameters and transition probability values of representative phonemes for each type of person obtained by an empirical method in advance, and the degree of distribution and parameters extracted from the newly inputted voice. Such data is stored in the registers 302, 303, and 304 inside the dedicated device, which is related to power consumption in order to minimize signal changes caused by external data changes. The parameter extracted from the input voice among the data stored in the internal register and the average parameter Mean stored in advance are subjected to a subtraction operation through the subtractor 405 to obtain a difference therebetween.

이 결과는 곱셈기(406)를 통해 분산정도(1/σ)를 나타내는 Precision과 곱해지고 다시 자승기(407)를 통해 실질적인 확률적 거리를 계산하게 된다. 이 값은 단어를 형성하는 많은 음성 파라미터 프레임들 중 시간적으로 현재의 파라미터만을 계산한 것이므로 가산기(408)를 통해 이전까지 계산된 확률적 거리와 가산되어 누적되어야 한다. 누적 연산을 위해 가산기(408)과 더불어 레지스터(409)가 사용되며 레지스터에 저장된 데이터는 다음 연산을 위해 가산기(408)에 제공된다.This result is multiplied by Precision, which represents the degree of variance (1 / σ), by multiplier 406 and again by the square 407 to calculate the actual stochastic distance. Since this value is calculated only the current parameter in time among the many speech parameter frames forming the word, it should be accumulated by adding up with the stochastic distance previously calculated by the adder 408. A register 409 is used in addition to the adder 408 for the cumulative operation and the data stored in the register is provided to the adder 408 for the next operation.

이러한 레지스터는 누적연산을 위해서 뿐만 아니라, 신호 천이의 최소화를 위해서도 사용되어야 한다. 이상의 과정은 미리 정해 놓은 각 음소에 대해 동일하게 적용되며 각 음소별/상태별로 해당 저장 장소에 그 값이 저장된다. 결과적으로 입력된 단어에 대한 모든 파라미터들에 대한 연산이 완료되면 각 단어의 음소별로누적된 값들 중 가장 큰 값이 확률적으로 가장 유사한 단어로 인식될 수 있다. 이와 같이 누적된 값들을 이용해 최종 인식된 단어를 판단하는 것은 기존 프로세서에서 수행할 것이다.These registers should be used not only for cumulative operation but also for minimizing signal transitions. The above process is applied equally to each predetermined phoneme, and the value is stored in the corresponding storage location for each phoneme / state. As a result, when the calculation for all the parameters of the input word is completed, the largest value among the cumulative values for each phoneme of each word may be recognized as the probability most similar word. Determining the final recognized word using the accumulated values will be performed by the existing processor.

도 6은 본 발명에 따른 관측 확률 연산 장치의 적용례를 보이기 위하여 도시된 것이다. 도 6에 도시된 장치는 화자독립 음성인식 전용 프로세서로서 3버스 시스템 방식을 사용한다. 본 발명에 따른 관측 확률 연산 장치는 도 4에 도시된 HMM 모듈(628)의 내부에 구현되며, 각 구성 모듈들은 데이터를 위한 3개의 버스(2개의 읽기 버스와 1개의 쓰기 버스) 동작 코드를 위한 2개의 OPcode 버스들을 공유한다.6 is shown to show an application example of the observation probability calculation device according to the present invention. The apparatus shown in FIG. 6 uses a 3-bus system scheme as a speaker-independent speech recognition dedicated processor. The observation probability computing device according to the present invention is implemented inside the HMM module 628 shown in FIG. 4, and each component module is used for three buses for data (two read buses and one write bus). Share two OPcode buses.

도 6에 있어서 제어부(Ctrl Unit, 602)는 범용의 프로세서를 의미하고, REG FILE(레지스터 파일, 604)은 레지스터 파일(register file) 기능을 수행하는 모듈을 의미하고, ALU(506)는 Arithmatic Logic을 수행하는 모듈을 의미하고, MAC(608)는 Multiply and ACcummulate 기능을 수행하는 것을 의미하고, B SHIFTER(배럴 쉬프터, 610)은 Barrel SHIFT 기능을 수행하는 모듈을 의미하고, FFT(612)는 FFT연산을 수행하는 모듈을 의미하며, SQRT(514)는 square and root 연산 기능을 수행하는 모듈을 의미하며, TIMER(616)는 타이머 기능을 수행하는 모듈을 나타내며, CLKGEN(클록 발생기, 618)는 클록 발생 기능을 수행하는 모듈을 나타낸다. CLKGEN(618)는 도 4에 도시된 장치의 내부 혹은 외부에서 제공되는 클록 신호를 유입하여 도 4에 도시된 각 구성 모듈들에 제공되는 클록 신호를 발생하며, 특히 저전력 소모를 위하여 클록 속도를 조정한다.In FIG. 6, a control unit (Ctrl Unit) 602 means a general-purpose processor, a REG FILE (register file) 604 means a module that performs a register file function, and the ALU 506 is an Arithmatic Logic. The MAC 608 means a multiply and ACcummulate function, the B SHIFTER (barrel shifter, 610) means a module performing a Barrel SHIFT function, and the FFT 612 is an FFT. SQRT 514 denotes a module performing a square and root arithmetic function, TIMER 616 denotes a module performing a timer function, and CLKGEN (clock generator, 618) indicates a clock. Represents a module that performs a generation function. The CLKGEN 618 injects a clock signal provided from inside or outside the apparatus shown in FIG. 4 to generate a clock signal provided to each of the component modules shown in FIG. 4, and in particular, adjusts the clock speed for low power consumption. do.

마찬가지로, PMEM(프로그램 메모리, 620), PMIF(프로그램 메모리 인터페이스, 622), EXIF(외부 인터페이스, 624), MEMIF(메모리 인터페이스, 626), HMM(관측확률 연산, 628), SIF(직렬 인터페이스, 630), UART(비동기식 직렬 인터페이스, 632), GPIO(범용 인터페이스, 634), CODEC IF(코덱 인터페이스, 636), 그리고 CODEC(코덱, 640)로 표기된 것들은 각각 프로그램 메모리, 프로그램 메모리 인터페이스, 외부 인터페이스, 메모리 인터페이스, 히든 마코프 모델 연산, 동기식 직렬 인터페이스, 비동기식 직렬 인터페이스, 범용 입출력, 코덱 인터페이스, 그리고 코덱 기능들을 수행하는 모듈들이다. 특히 HMM(628)은 이러한 특징 값들로 부터 미리 정해진 은닉 마코프 모델을 이용한 단어 탐색 작업을 수행하게 된다.Similarly, PMEM (program memory, 620), PMIF (program memory interface, 622), EXIF (external interface, 624), MEMIF (memory interface, 626), HMM (observation probability operation, 628), SIF (serial interface, 630) ), UART (Asynchronous Serial Interface, 632), GPIO (Universal Interface, 634), CODEC IF (Codec Interface, 636), and CODEC (Codec, 640) are the program memory, program memory interface, external interface, memory Modules that perform interfaces, hidden Markov model operations, synchronous serial interfaces, asynchronous serial interfaces, general-purpose input / output, codec interfaces, and codec functions. In particular, the HMM 628 performs a word search operation using a predetermined hidden Markov model from these feature values.

또한, 외부 버스(452)는 외부 메모리와의 데이터 인터페이스를 위한 외부버스이다. EXIF(624)는 DMA(Dynamic Memory Access)를 지원한다. 특히, HMM(628)은 관측 확률 연산을 위한 도 3의 장치를 포함한다.In addition, the external bus 452 is an external bus for a data interface with an external memory. EXIF 624 supports Dynamic Memory Access (DMA). In particular, HMM 628 includes the apparatus of FIG. 3 for computing observation probabilities.

각 구성 요소들 내부의 제어기(디코더, 미도시)는 명령 버스(OPcode bus, 648, 650)를 통해 명령을 받아 디코딩하여 필요한 동작을 수행한다. 즉, HMM(628)내부의 제어기는 제어용 명령 버스(OPcode bus0, 1)를 통해 명령을 받아 디코딩하여 도 3에 도시된 바와 같은 본 발명의 관측 확률 연산 장치를 제어해서 관측 확률 연산을 수행하도록 한다. 한편, 데이터들은 2개의 읽기 버스(642, 644)들을 통하여 제공되거나 1개의 쓰기 버스(646)를 통하여 출력된다.A controller (decoder, not shown) inside each component receives a command through an instruction bus (OPcode bus, 648, 650), decodes and performs a necessary operation. That is, the controller inside the HMM 628 receives and decodes a command through a control command bus (OPcode bus0, 1) to control the observation probability calculating apparatus of the present invention as shown in FIG. . On the other hand, the data are provided via two read buses 642 and 644 or output through one write bus 646.

도 6에 도시된 장치는 프로그램 메모리(PMEM, Program MEMory, 620)를 구비하며, 프로그램은 외부 인터페이스 장치(EXIF, EXternal InterFace, 624)를 통하여 프로그램 메모리(PMEM, 620)에 로드된다.The apparatus illustrated in FIG. 6 includes a program memory (PMEM, Program Memory, 620), and the program is loaded into the program memory (PMEM, 620) through an external interface device (EXIF, EXTERNAL InterFace, 624).

HMM(628)은 도 6에 도시된 제어부(Ctrl Unit, 602)에서 제공되는 제어 명령을 2개의 OPcode bus들(648, 650)을 통해 전송받으며, 내부의 제어부(미도시)가 수신된 제어 명령을 디코딩하고 도 3에 도시된 바와 같은 관측 확률 연산 장치를 제어해서 관측 확률 연산을 수행하도록 한다.The HMM 628 receives a control command provided from a control unit (Ctrl Unit 602) illustrated in FIG. 6 through two OPcode buses 648 and 650, and a control command received from an internal control unit (not shown). Decode and control the observation probability computing device as shown in Figure 3 to perform the observation probability calculation.

제어부(Ctrl Unit, 602)는 자신이 직접 제어 명령을 디코딩하여 지정된 동작을 수행하도록 제어하거나, OPcode bus 0,1(648, 650)를 이용하여 각 구성 모듈들의 동작을 제어한다. 각 구성 모듈들은 OPcode bus1,2 및 읽기 버스 A,B를 공유한다.The controller 602 controls to directly decode a control command to perform a specified operation, or controls the operation of each component module using the OPcode bus 0,1 (648, 650). Each configuration module shares OPcode bus1,2 and read bus A, B.

제어부(Ctrl Unit, 602)가 직접 제어하는 경우 프로그램 메모리(PMEM, 620)로부터 제어 명령을 펫취(fetch)하여 이를 디코딩하고, 제어 동작에 필요한 오퍼랜드(operand; 조작의 대상이 되는 데이터)를 읽어들여 레지스터 파일(REG FILE, 604)에 저장한다. 이후 제어 동작이 제어 논리일 경우에는 ALU(Arithmatic Logic Unit, 606)를, 승산 및 누산일 경우에는 MAC(Multiply and ACcummulate, 608)를, 배럴 쉬프트 동작일 경우에는 B SHIFTER(Barrel SHITER, 610)를, 자승(square)/근(root)연산일 경우에는 SQRT(SQart and RooT, 614) 등을 이용하여 제어 동작을 수행하고 그 결과값을 다시 레지스터 파일(REG FILE, 604)에 저장한다.In case of direct control by the control unit (Ctrl Unit 602), the control command is fetched from the program memory (PMEM) 620, decoded, and the operands necessary for the control operation are read. It is stored in a register file (REG FILE, 604). After the control operation is control logic, ALU (Arithmatic Logic Unit, 606), multiplication and accumulation (MAC, Multiply and ACcummulate, 608), and barrel shift operation, B SHIFTER (Barrel SHITER, 610) In the case of square / root operation, the control operation is performed using SQRT (SQart and RooT, 614), and the result is stored in the register file (REG FILE, 604).

제어부(Ctrl Unit, 602)가 직접 제어하지 않는 경우에는 OPcode bus0,1(648, 650)를 이용한다. 제어부(Ctrl Unit, 602)는 프로그램 메모리(PMEM, 620)으로부터 펫취한 제어 명령을 디코딩하는 대신 OPcode bus0(648) 및 OPcode bus1(650)에 차례로 팻취한 제어 명령을 인가한다.If the control unit (Ctrl Unit 602) does not control directly, the OPcode bus 0,1 (648, 650) is used. The control unit Ctrl unit 602 applies the control command patched in turn to the OPcode bus0 648 and the OPcode bus1 650 instead of decoding the control command captured from the program memory PMEM 620.

OPcode bus0(648)과 OPcode bus1(650)에는 동일한 제어 명령이 1클럭의 차이를 두고 차례로 인가된다. 구성 모듈들은 OPcode bus0(648)에 제어 명령이 인가되면 자신에게 해당하는 제어 명령인가를 판단하고, 자신에게 해당하는 것이라면 이를 디코딩하여 제어 명령에 의해 지정된 제어 동작을 수행할 대기 상태가 된다. 이를 위해 구성 모듈들은 제어 명령을 해독하기 위한 디코더들을 구비한다. 1클럭 후 OPcode bus1(650)에 동일한 제어 명령이 인가되면 이때에 비로소 지정된 제어 명령에 해당하는 동작을 수행하기 위한 제어를 수행한다. 각 OPcode bus들(648, 650)에 인가되는 제어 코드의 인에이블 여부를 나타내기 위하여 RT 및 ET신호선이 할당된다.The same control command is applied to the OPcode bus0 648 and the OPcode bus1 650 one by one with a difference of one clock. When the control command is applied to the OPcode bus0 648, the configuration modules determine whether the control command corresponds to the control command. To this end, the configuration modules have decoders for decoding the control command. When the same control command is applied to the OPcode bus1 650 after one clock, control is performed to perform an operation corresponding to the designated control command at this time. RT and ET signal lines are allocated to indicate whether the control code applied to each of the OPcode buses 648 and 650 is enabled.

도 8에 있어서 최상위의 신호는 클럭 신호(CLK)이고, 차례로 OPcode bus0에 인가되는 제어 명령, OPcode bus1에 인가되는 제어 명령, RT 신호, ET 신호, 읽기 버스 A에 인가되는 데이터, 그리고 읽기 버스 B에 인가되는 데이터이다.In FIG. 8, the highest signal is a clock signal CLK, which in turn is a control command applied to OPcode bus0, a control command applied to OPcode bus1, an RT signal, an ET signal, data applied to read bus A, and read bus B. Data applied to.

OPcode bus0(648)에 제어 명령이 인가되고, RT 신호에 의해 인에이블되면 도 4의 구성 모듈들 중의 하나가 이를 인식하고 이를 디코딩하여 대기 상태가 된다. 이후 OPcode bus1(650)에 동일한 제어 명령이 인가되고 ET신호에 의해 인에이블되면, 해당 구성 모듈이 제어 명령에 의해 지정된 동작을 수행한다. 구체적으로 읽기 버스 A 및 읽기 버스 B에 인가된 데이터를 받아들여, 지정된 동작을 수행하고, 쓰기 버스를 통하여 결과값을 출력한다.When a control command is applied to the OPcode bus0 648 and enabled by the RT signal, one of the configuration modules of FIG. 4 recognizes it, decodes it, and enters a standby state. After the same control command is applied to the OPcode bus1 650 and enabled by the ET signal, the corresponding configuration module performs the operation specified by the control command. Specifically, data applied to the read bus A and the read bus B is received, a specified operation is performed, and a result value is output through the write bus.

본 발명의 관측 확률 연산 장치에 의하면 은닉 마코프 모델 탐색 방법을 사용함에 있어서, 가장 많은 연산을 수행하는 관측확률 연산을 효율적으로 수행할 수 있다.According to the observation probability computing device of the present invention, in using the hidden Markov model search method, it is possible to efficiently perform the observation probability calculation that performs the most calculations.

본 발명을 통해 구현된 은닉 마코프 모델 탐색을 위한 관측확률 계산 전용 장치는 이러한 음성인식 기능의 처리 속도를 향상시키기 위해 발명된 장치이며, 이러한 장치를 사용하지 않았을 때 보다 50% 이상 명령어 횟수를 줄일 수 있기 때문에, 동일한 기능을 일정 시간에 처리하는 경우 낮은 클럭 속도로도 처리가 가능하며 전력 소모량도 1/2로 줄일 수 있다.Observation probability calculation dedicated device for searching hidden Markov model implemented through the present invention is invented to improve the processing speed of the speech recognition function, and can reduce the number of instructions more than 50% than without using such a device. Thus, if the same function is processed at a certain time, it can be processed at a lower clock speed and power consumption can be reduced to 1/2.

이외에도 본 장치는 은닉 마코프 모델을 이용한 확률적 연산에도 사용할 수 있다.In addition, the apparatus can be used for stochastic calculations using hidden Markov models.

Claims

A storage device for storing a mean of a parameter extracted from representative phonemes and a degree of distribution of the mean value (1 / σ);

A subtractor for calculating a difference between an average provided from the storage device and a feature extracted from a speech signal to be recognized; And

And a multiplier for multiplying the output of the subtractor and the degree of distribution provided from the storage device.

The storage device of claim 1, wherein when i is an argument indicating a representative type of phoneme and j is an argument indicating a number of parameters, the storage device stores precision [i] [j] and mean [i] [j]. Providing them to the subtractor in a predetermined order;

The subtractor subtracts the difference between mean [i] [j] and feature [i] [j] in the predetermined order,

And the multiplier multiplies precision [i] [j] and a result of the subtraction of the subtractor according to the predetermined order.

The observation probability calculating device according to claim 2, further comprising a square generator which squares the multiplication result of the multiplier.

4. The apparatus of claim 3, further comprising registers for buffering the precision [i] [j], mean [i] [j], and feature [i] [j], respectively.

4. The apparatus of claim 3, further comprising an accumulator for accumulating the output of the power supply.

6. The apparatus of claim 5, further comprising a register for buffering an accumulation result of the accumulator.