KR100322731B1

KR100322731B1 - Voice recognition method and method of normalizing time of voice pattern adapted therefor

Info

Publication number: KR100322731B1
Application number: KR1019950043908A
Authority: KR
Inventors: 김경선; 권철중
Original assignee: 윤종용; 삼성전자 주식회사
Priority date: 1995-11-27
Filing date: 1995-11-27
Publication date: 2002-06-20
Also published as: KR970029327A

Abstract

PURPOSE: A voice recognition method and a method of normalizing time of a voice pattern adapted for the method are provided to correct time bends in a voice pattern using information about a variation in voice spectrum and information about a variation in energy to improve recognition rate of a recognition system. CONSTITUTION: Time bends of an input voice pattern are removed to convert the voice pattern into a voice pattern having a length similar to the length of reference patterns through time normalization. Characteristic information of the voice pattern normalized through the time normalization is extracted. The distances between the voice pattern and the reference patterns are calculated using the characteristic information of the voice pattern, and the reference pattern having the closest distance from the voice pattern is provided as the recognition result.

Description

Speech recognition method and time normalization method of speech pattern suitable for this

본 발명은 음성 인식 방법에 관한 것으로서 더욱 상세하게는 음성의 스펙트럼 변화 정보와 에너지 변화 정보를 이용하여 음성 패턴의 시간굴곡 현상 예컨대, 인간의 발성 속도의 다양성과 부자연스러운 발성 태도등에 의해 야기되는 길이의 변화 및 불연속성의 보정을 행함으로써 음성인식 시스템의 인식률을 향상시키는 개선된 음성 인식 방법 및 이에 적합한 음성 패턴의 시간 정규화 방법에 관한 것이다.The present invention relates to a speech recognition method, and more particularly, by using the spectral change information and the energy change information of a voice, the length of a length caused by the time warping phenomenon of a speech pattern, for example, the variation of human speech speed and unnatural speech attitude. An improved speech recognition method for improving the recognition rate of a speech recognition system by correcting changes and discontinuities, and a time normalization method of speech patterns suitable thereto.

종래의 음성 인식 방법은 제1도에 도시된 바와 같이 음성 인식시스템에 저장된 참조 패턴과 입력된 테스트 패턴간의 거리(distance)를 구하고, 테스트 패턴과가장 가까운 거리를 보이는 참조 패턴을 인식된 결과로서 출력하는 것이다.In the conventional speech recognition method, as shown in FIG. 1, a distance between a reference pattern stored in a speech recognition system and an input test pattern is obtained, and a reference pattern showing the closest distance to the test pattern is output as a recognized result. It is.

여기서 테스트 패턴이 발생될 수 있는 경우의 수는 사람마다. 발성환경마다, 발성 어휘마다 다르므로 오인식의 확률이 상존한다.The number of cases where a test pattern can occur is per person. Different vocal environments and vocal vocabulary have different probability of misrecognition.

화자 종속 고립 단어 인식 시스템에서 발성 길이의 차이에 의한 어휘별 오인식률을 표1에 보인다. 표1에 나타내어 진 것은 퍼스널컴퓨터의 윈도우를 제어하는 명령을 대상으로 조사된 것이다.In the speaker-dependent isolated word recognition system, Table 1 shows the erroneous recognition rates by lexical length difference. What is shown in Table 1 was investigated about the command which controls the window of a personal computer.

표2는 기준 패턴과 50%의 차이가 있는 길이를 갖는 두 세트의 보조 참조 패턴을 추가로 이용하는 음성 인식 시스템에서의 어휘별 오인식률을 보이는 것이다.Table 2 shows the erroneous recognition rate of each vocabulary in a speech recognition system that additionally uses two sets of auxiliary reference patterns having a length of 50% difference from the reference pattern.

표2에 보여지는 바와 같이 보조 참조 패턴을 사용함에 의해 오인식률은 최고 58.7%의 개선 효과가 있다.As shown in Table 2, the misrecognition rate is improved by up to 58.7% by using the auxiliary reference pattern.

인식 성능을 높이기 위해서는 보조 참조 패턴을 여러 세트 준비하여야 되지만, 실제 화자 종속 인식 시스템에서는 가격과 계산량에 의한 응답 속도의 문제로 인해 참조 패턴 세트가 하나로 한정된다.In order to improve the recognition performance, it is necessary to prepare several sets of auxiliary reference patterns, but in actual speaker-dependent recognition system, the reference pattern set is limited to one due to the problem of response speed due to price and calculation amount.

실제로 두개의 보조 참조 패턴을 사용하는 인식 시스템은 그렇지 않은 시스템에 비해 소요 메모리의 크기가 세 배로,그리고 계산 속도는 약2.3배가 되어야 한다. 특히, 인식 어휘가 많은 경우에 사용자가 같은 어휘에 대해 발음 속도나 억양 등을 바꿔 가면서 발음하여 보조 참조 패턴을 만드는 것은 발성 부담 등의 이유에 의해 실현성이 희박하다. 그리고, 인식률의 향상과 사용자 범위의 폭을 넓혀주기위해서는 즉, 음성 인식의 최종 목적인 화자 독립 시스템을 구현하기 위해서는 발성 습관이나 사용자 주변의 환경에 의해 야기되는 발성 길이의 불일치 문제를 해결해야 한다.Indeed, a recognition system using two auxiliary reference patterns would require three times the memory size and about 2.3 times the computation speed compared to a system that does not. In particular, in the case of a large number of recognized vocabularies, it is difficult to realize the user's pronunciation of the same vocabulary while changing the pronunciation speed or intonation, thereby creating an auxiliary reference pattern. In order to improve the recognition rate and widen the range of the user, that is, to implement the speaker independent system, which is the final purpose of speech recognition, it is necessary to solve the problem of inconsistency in speech length caused by the speech habit or the environment around the user.

한편, DTW(Dynamic Time Warping), NN(Neural Network), HMM(Hidden Markov Modeling)을 이용한 기존의 방법은 음성 인식 시 발성 길이의 차이에 의한 오인식 문제를 해결하기 위해서 주어진 패턴에 대한 수동적 패턴 매칭(matching)에 의한 최대 확률 값을 구하는 데 그치고 있어 효과가 그리 크지 못하다.On the other hand, the existing methods using Dynamic Time Warping (DTW), Neural Network (NN), and Hidden Markov Modeling (HMM) provide passive pattern matching for a given pattern in order to solve the misunderstanding problem caused by the difference in speech length in speech recognition. It is only to find the maximum probability value by matching, and the effect is not so great.

이것은 표1에서 보여지듯이 기준 패턴과 70% 이상 차이가 날 경우 오인식률이 19% 이상으로서 알고리즘 자체의 한계성을 보여주고 있다.As shown in Table 1, when the difference is more than 70% from the reference pattern, the false recognition rate is more than 19%, which shows the limitation of the algorithm itself.

이것은 기존의 방법들이 비정상적인 발성이나 부자연스러운 음성의 인식을 수행하면 음성 특징 패턴 자체의 왜곡으로 엉뚱한 결과를 도출하기 때문이다.This is because the conventional methods, when performing abnormal speech or unnatural speech recognition, produce wrong results due to distortion of the speech feature pattern itself.

예를 들자면 DTW방식에 있어서 현재 프레임에서 구한 특징이 지속되리라 여겨지는 시간차의 한계는 시스템에 따라 다르기는 하겠으나 약 30 - 100% 정도이다. 만약 이러한 발성 길이의 차이 현상이 한계 범위를 넘어가는 입력 패턴이 들어왔을 때는 엉뚱한 인식 결과가 나올 수 있다.For example, in the DTW method, the limit of time difference that the characteristic obtained from the current frame is expected to be maintained is about 30-100% although it varies depending on the system. If an input pattern in which the difference in the vocalization length is exceeded a limit range is input, a false recognition result may be generated.

HMM방식에서는 발성 길이의 차이를 어떤 특징 상태(state)에서의 관측이나 천이 확률 값으로 처리하게 되는 데 차이 정도가 심하면 인식확률은 HMM방식의 특성에 의해 아주 작은 값을 얻게 되어 오인식의 원인이 된다.In the HMM method, the difference in the vocalization length is treated as an observation or transition probability value in a certain state. If the difference is severe, the recognition probability is very small due to the characteristics of the HMM method. .

NN(Neural Network)을 이용한 방법 중에서 발성 길이 차이 현상을 해결하기 위해 제안된 방법으로는 TDNN(Time Delay Neural Network)이 있는 데 이 방법을 사용하면 전체 입력 패턴과 학습을 위해 사용된 특정 패턴간의 변별력을 극대화시키는 특징을 찾아 가중치(weight)를 변경하여 저장한다.One of the methods using NN (Neural Network) to solve the difference in vocalization length is the Time Delay Neural Network (TDNN). When this method is used, the discrimination between the entire input pattern and the specific pattern used for learning is used. Find the feature that maximizes and change the weight.

여기서, 입력 패턴들의 발성 길이의 차이 정도가 심하다면 변별력을 높여 주는 데 사용되는 특징이 원하는 음소간의 천이 부분이나 전체 모음등의 구간이 아닌 묵음이나 특정 모음에 한정될 수 있어서 오인식의 원인이 된다.Here, if the difference in the utterance length of the input patterns is severe, the feature used to increase the discrimination power may be limited to a silence or a specific vowel rather than a transition part or a whole vowel between desired phonemes, thereby causing a misperception.

따라서, 사용되고 있는 음성 인식 시스템에서 허용되는 한계 이상의 시간 굴곡 현상을 갖는 음성 입력이 들어왔다고 가정한다면 입력 패턴자체에서 시간 굴곡 현상을 제거하여 참조 패턴이나 학습에 사용되는 패턴과 비슷한 길이를 갖는 패턴으로 만들어 주는 것이 필요하다.Therefore, assuming that a speech input with a time curvature exceeding the limit allowed by the speech recognition system is used, the time curvature is removed from the input pattern itself to make a pattern having a length similar to a reference pattern or a pattern used for learning. It is necessary to give.

본 발명은 상기의 요구에 부응하기 위하여 창출된 것으로서 인간의 발성 속도의 다양성과 부자연스러운 발성 태도 등에 의해 야기되는 길이 변화와 불연속 패턴을 보정하여 참조 패턴과 비슷한 길이를 갖는 패턴으로 보정하여 줌으로써 인식률을 향상시키는 개선된 음성 인식 방법을 제공하는 것을 그 목적을 한다.The present invention was created in order to meet the above demands, and the recognition rate is corrected by correcting the length change and the discontinuity pattern caused by the diversity of human speech speed and unnatural speech attitude to a pattern having a length similar to the reference pattern. It is an object of the present invention to provide an improved speech recognition method for improving.

본 발명의 다른 목적은 상기의 음성 인식 방법에 적합한 음성 패턴의 시간 정규화 방법을 제공하는 것에 있다.Another object of the present invention is to provide a time normalization method of a speech pattern suitable for the above speech recognition method.

상기의 목적을 달성하는 본 발명에 따른 음성 인식 방법은Speech recognition method according to the present invention to achieve the above object

입력된 음성 패턴의 시간 굴곡을 제거하여 참조 패턴과 비숫한 길이를 갖는 음성 패턴으로 변환시켜 주는 시간 정규화 과정;A time normalization process of removing time distortion of the input voice pattern and converting the input voice pattern into a voice pattern having a non-numeric length with the reference pattern;

상기 시간 정규화 과정을 통하여 정규화된 음성 패턴의 특징 정보를 추출하는 과정; 및Extracting feature information of a normalized speech pattern through the time normalization process; And

음성 패턴의 특징 정보를 이용하여 참조 패턴과의 거리를 산출하고, 가장 가까운 거리를 보이는 참조 패턴을 인식 결과로서 제공하는 인식 과정을 포함함을 특징으로 한다.And a recognition process of calculating a distance from the reference pattern using feature information of the speech pattern and providing a reference pattern showing the closest distance as a recognition result.

상기의 목적을 달성하는 본 발명의 시간 정규화 방법은The time normalization method of the present invention to achieve the above object

입력된 음성 패턴의 시간 굴곡을 제거하여 참조 패턴과 비슷한 길이를 갖는 음성 패턴으로 변환시켜 주는 음성 패턴의 시간 정규화 방법에 있어서,In the time normalization method of the speech pattern to remove the time curve of the input speech pattern to convert to a speech pattern having a length similar to the reference pattern,

음성 패턴의 특징 정보를 산출하는 특징 패턴 추출 과정;A feature pattern extraction process of calculating feature information of the speech pattern;

상기 특징 패턴 추출 과정에서 구해진 특징 정보에 근거하여 에너지 변화 정보를 산출하는 에너지 변화 산출 과정;An energy change calculation process of calculating energy change information based on the feature information obtained in the feature pattern extraction process;

상기 특징 패턴 추출 과정에서 구해진 특징 정보에 근거하여 스펙트럼 변화 정보를 산출하는 스펙트럼 변화 산출 과정;A spectrum change calculation process of calculating spectrum change information based on feature information obtained in the feature pattern extraction process;

상기 에너지 변화 산출 과정에서 산출된 에너지 변화 정보와 상기 스펙트럼 변화 산출 과정에서 산출된 스펙트럼 변화 정보에 근거하여 음성 구간 중에서 스펙트럼 변화가 작은 부분과 에너지 변화가 작은 부분을 검출하는 삭제 구간 검출 과정;A deletion section detection step of detecting a portion having a small spectral change and a portion having a small energy change based on the energy change information calculated in the energy change calculation process and the spectrum change information calculated in the spectrum change calculation process;

삭제 구간의 스펙트럼 벡터와 이웃 입력 벡터와의 가중치 평균에 의해 새로운 입력 패턴을 생성하는 패턴 갱신 과정을 포함함을 특징으로 한다. 이하 첨부된 도면을 참조하여 본 발명을 상세히 설명한다.And a pattern update process of generating a new input pattern by a weighted average of the spectrum vector of the erase interval and the neighboring input vector. Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

인간이 만들어 내는 음성은 크게 자음, 모음, 묵음의 세 가지 음으로 구성된다. 이러한 음성을 주파수 영역에서 분석하였을 때 일반적으로 자음 구간과 자음/모음 변이 구간, 음성/노이즈 변이 구간에서는 스펙트럼 특성의 변화가 급격하고, 묵음 구간이나 모음 구간에서는 그렇지 않다는 것이 알려져 있다.Human-produced voices consist of three sounds: consonant, vowel, and silent. When the speech is analyzed in the frequency domain, it is generally known that the spectral characteristics change rapidly in the consonant section, the consonant / vowel variation section, and the voice / noise variation section, but not in the silent section or the vowel section.

본 발명에서는 이러한 사실에 착안하여 비정상적인 음성 예컨대, 필요 이상 길거나 짧게 발음하는 경우, 필요 이상 길게 쉬면서 발음하는 경우, 구해진 음성 구간의 시작점과 끝점이 불일치하는 경우의 음성이 들어왔을 때 참조 패턴의 길이를 정상 길이로 가정하고 스펙트럼 변화량이 작은 구간의 길이가 정상 길이가 되도록 조정한다.In the present invention, in view of this fact, the abnormal voice, for example, when longer or shorter pronunciation than necessary, when the pronunciation is longer and longer than necessary, the length of the reference pattern when the voice when the start point and end point of the obtained speech section is inconsistent Is assumed to be the normal length, and the length of the section with small spectral variation is adjusted to be the normal length.

제2도는 본 발명에 따른 음성 인식 방법을 보이는 흐름도이다. 제2도에 있어서 제1도에 도시된 방법과 다른 점은 음성 신호의 시간 정규화 과정에 있다. 시간 정규화 과정은 음성 인식기에서 허용하는 이상의 시간 굴곡 현상을 포함한 음성 입력이 들어왔을 때 입력 패턴자체에서 추출한 특징을 이용하여 시간 굴곡 현상을 제거한다. 그 결과 참조 패턴이나 학습에 사용되는 패턴과 비숫한 특징을 갖는 음성 패턴이 얻어진다.2 is a flowchart showing a speech recognition method according to the present invention. The difference from the method shown in FIG. 1 in FIG. 2 lies in the time normalization process of the speech signal. The temporal normalization process removes temporal curvature by using features extracted from the input pattern itself when a speech input including temporal curvature of the speech recognizer is received. As a result, a speech pattern having features similar to the reference pattern or the pattern used for learning is obtained.

제3도는 제2도에 도시된 음성 패턴의 시간 정규화 과정을 보다 상세히 보이는 흐름도이다. 제3도에 도시된 방법은 음성 구간 검출과정(300), 검출된 음성 구간에 대해서 각 펄스 구간의 에너지E _Pi , 노이즈 구간의 에너지E _Ni , 그리고 전체 펄스 구간의 평균 에너지 _Pi 와 노이즈 구간의 평균 에너지 _Ni 을 구하는 에너지 산출 과정(302), 특징 정보 산출 과정(304), 에너지 변화 산출 과정(306), 스펙트럼 변화 산출 과정(398), 노이즈 구간 압축 과정(310), 삭제 구간 검출 과정(312),패턴 갱신 과정(314)을 구비한다.3 is a flow chart showing in more detail the time normalization process of the speech pattern shown in FIG. In the method shown in FIG. 3, the speech section detection process 300, the energy E _Pi of each pulse section, the energy E _Ni of the noise section, and the average energy of the entire pulse section are detected. Average energy of _Pi and noise interval _An energy calculation process 302 for obtaining _Ni , a feature information calculation process 304, an energy change calculation process 306, a spectral change calculation process 398, a noise section compression process 310, an erase section detection process 312, A pattern update process 314 is provided.

음성 구간 검출 과정(300)에서는 마이크로폰을 통해 들어온 신호중에서 음성에 해당되는 부분을 골라낸다. 음성 구간은 실제의 음성 펄스에 해당되는 펄스 구간(Pi)과 묵음에 해당되는 노이즈 구간(Ni)으로 구성된다.In the voice section detection process 300, a part corresponding to the voice is selected from the signals input through the microphone. The voice section is composed of a pulse section Pi corresponding to an actual voice pulse and a noise section Ni corresponding to silence.

에너지 산출 과정(302)에서는 검출된 음성 구간에 대해서 각 펄스 구간의 에너지E _Pi , 노이즈 구간의 에너지E _Ni ,그리고 전체 펄스 구간의 평균 에너지 _Pi 와 노이즈 구간의 평균 에너지 _Ni 을 구한다.In the energy calculation process 302, the energy E _Pi of each pulse section, the energy E _Ni of the noise section , and the average energy of the entire pulse section are detected for the detected speech section. Average energy of _Pi and noise interval _{Find Ni} .

그리고, 이 값들을 이용하여 음성 구간 검출 과정(300)에서 검출된 음성 구간의 앞뒤에 첨가된 노이즈를 제거한다.Then, the noises added to the front and rear of the voice section detected in the voice section detection process 300 are removed using these values.

특징 정보 산출 과정(304)은 음성 특징 패턴을 추출하는 과정으로서 본 발명에서는 LPC 파라메터를 사용하여 켑스트럼(ceptrum)을 추출한다.The feature information calculation process 304 is a process of extracting a speech feature pattern. In the present invention, a ceprum is extracted using an LPC parameter.

에너지 변화 산출 과정(306)에서는 에너지 산출 과정(302)에서 재조정된 음성 구간에 대해서 다음 식에 의해 프레임별 에너지D ₀, 1차 미분 에너지(difference energy)D ₁, 2차 미분 에너지D ₂, 그리고 각각의 평균 에너지를 구한다.In the energy change calculation process 306, for each voice section readjusted in the energy calculation process 302, the energy D ₀ , the first differential energy D ₁ , the second differential energy D ₂ , and The average energy of each Obtain

여기서, i는 프레임 번호로서 0≤i<1이고,Where i is a frame number 0 ≦ i <1,

n은 프레임 크기로서 0≤n<N이고,n is the frame size, where 0 ≦ n <N,

x(i,n)은 x번째 프레임의 I번째 샘플을 나타내고,x (i, n) represents the I-th sample of the x-th frame,

d는 직류(dc)값을 나타낸다.d represents a direct current (dc) value.

스펙트럼 변화 산출 과정(308)과정에서는 프레임에서 각각 구해진 10차의 LPC 켑스트럼 값들을 가지고 스펙트럼 변화 정보를 산출한다.In the spectral change calculation process 308, the spectral change information is calculated using the LPC cepstrum values of the tenth order obtained in each frame.

이를 위해 본 발명에서는 아래와 같은 스펙트럼 차이 평가(spectral differential measure)를 새로 제안한다.To this end, the present invention newly proposes a spectral differential measure as follows.

여기서, C(i,n)는 Ⅰ번째 프레임 켑스트럼 벡터의 n번째 엘리먼트(element)이다. N은 켑스트럼의 차수(order)로서 10이다. W[n]는 10개의 켑스트럼 가중치(weight)로서 0.7 - 1.3의 값을 갖는다.Here, C (i, n) is the nth element of the I-th frame spectral vector. N is 10 as the order of the cepstrum. W [n] is 10 cepstrum weights with a value of 0.7-1.3.

그리고,V ₁,V ₂는 스펙트럼 차이 평가이다.And V ₁ and V ₂ are spectral difference evaluations.

노이즈 구간 압축 과정(310)에서는 특징 정보 산출 과정(306)에서 구한 켑스트럼 벡터, 에너지 변화 산출 과정(304)에서 산출한 에너지 차이 평가D ₁,D ₂, 그리고 실제 음성 입력이 들어오기 전에 작성한 노이즈 코드 북을 이용하여 음성 구간 사이의 묵음 구간과 음성 구간 앞뒤의 노이즈 구간의 길이를 조정한다.In the noise section compression process 310, the cepstrum vector obtained in the feature information calculation process 306, the energy difference evaluation D ₁ , D ₂ calculated in the energy change calculation process 304, and the actual voice input are generated before the input. The length of the silent section between the speech section and the noise section before and after the speech section is adjusted using the noise code book.

여기서는 노이즈 구간에서 평균 에너지와 에너지의 변화가 심하지 않다는 것을 이용한다. 현재 처리되고 있는 음성 입력 패턴의 생성 시간과 노이즈 코드 북(noise codebook)의 생성 시간과의 차이가 거의 없으므로 효율적으로 노이즈 구간을 검출할 수 있다.Here, the change in the average energy and the energy in the noise section is not significant. Since there is little difference between the generation time of the speech input pattern currently being processed and the generation time of the noise codebook, the noise section can be detected efficiently.

삭제 구간 검출 과정(312)에서는 스펙트림 차이 평가V ₁,V ₂와 에너지 차이 평가D ₁,D ₂를 이용하여 음성 구간 중에서 스펙트럼 변화가 작은 부분과 에너지 변화가 작은 부분을 찾아내고 아래의 식에 의해 음성 특징 제거 부분을 검출한다.In the erasing section detection process 312, the small spectral change and the small energy change are found in the speech section using the spectra difference evaluation V ₁ , V ₂ and the energy difference evaluation D ₁ , D ₂ . By detecting the voice feature removal portion.

여기서,Vth1,Vth2,Vth3는 각각 스펙트럼 변화의 한계값(threshold)에 해당하는 것으로f(Vth1,Vth2,Vth3,V ₁,V ₂)의 패턴 비교 함수로 삭제 구간을 결정한다.Here, Vth 1, Vth 2, and Vth 3 correspond to thresholds of spectral change, respectively, and the erasure interval is determined by a pattern comparison function of f ( Vth 1, Vth 2, Vth 3, V ₁ , V ₂ ). do.

마찰음 코드 북(fricative sound codehook)은 노이즈와의 변별력을 높이기 위해 미리 구해 둔 것이다. 이러한 일련의 삭제 구간 검출 과정은 제4도에 도시되듯이 스펙트럼의 변화가 작은 삭제 후보 구간, 즉 모음 구간 혹은 묵음 구간 중에서 제거되어야 할 해칭(hatching)된 사각형 부분을 검출한다.Fricative sound codehooks are prepared in advance to increase discrimination from noise. This series of deletion section detection processes detects a hatched rectangular portion to be removed from the deletion candidate section, that is, the vowel section or the silent section, in which the spectrum change is small, as shown in FIG.

패턴 갱신 과정(314)에서는 삭제 구간의 스펙트럼 벡터와 이웃 입력 벡터와의 가중치 평균에 의해 새로운 입력 패턴을 만든다. 처리된 결과는 통상의 음성 인식 과정에 제공된다.In the pattern update process 314, a new input pattern is created by a weighted average of the spectral vector of the erase interval and the neighboring input vector. The processed result is provided to a normal speech recognition process.

화자 종속 고립 단어 인식 시스템에서 본 발명에 따른 음성 패턴 시간 정규화 방법을 적용했을 때 발성 길이의 차이에 의한 어휘별 오인식률을 표3에 보인다. 표3에 나타내어 진 것은 표3에 나타내어 진것은 퍼스널 컴퓨터의 윈도우를 제어하는 명령을 대상으로 조사된 것이다.In the speaker-dependent isolated word recognition system, Table 3 shows the erroneous recognition rates of different words based on the difference in speech length when the speech pattern time normalization method according to the present invention is applied. What is shown in Table 3 is shown in Table 3 as a target for the commands for controlling the windows of the personal computer.

표4는 참조 패턴과 50%의 차이가 있는 길이를 갖는 두 세트의 보조 참조 패턴을 추가로 이용하는 음성 인식 시스템에 본 발명에 따른 음성 패턴 시간 정규화 방법을 적용했을 때의 발성 길이의 변화에 따른 특징을 비교하여 보이는 것이다.Table 4 shows the characteristics according to the change in the speech length when the speech pattern time normalization method according to the present invention is applied to a speech recognition system that additionally uses two sets of auxiliary reference patterns having a difference of 50% from the reference pattern. It looks like a comparison.

상술한 바와 같이 본 발명에 따른 음성 인식 방법은 인간의 발성 속도의 다양성과 부자연스러운 발성 태도 등에 의해 야기되는 길이의 변화 및 불연속성의 보정을 행함으로써 음성 인식 시스템의 인식률을 향상시키는 효과가 있다.As described above, the speech recognition method according to the present invention has the effect of improving the recognition rate of the speech recognition system by correcting the change of the length and the discontinuity caused by the diversity of human speech speed and unnatural speech attitude.

제1도는 종래의 음성 인식 방법을 보이는 흐름도이다.1 is a flowchart showing a conventional speech recognition method.

제2도는 본 발명에 의해 제안된 음성 인식 방법을 보이는 흐름도이다.2 is a flowchart showing a speech recognition method proposed by the present invention.

제3도는 제2도에 도시된 음성 패턴의 시간 정규화 과정을 보다 상세히 보이는 흐름도이다.3 is a flow chart showing in more detail the time normalization process of the speech pattern shown in FIG.

제4도는 음성 구간에 있어서 제거 구간을 도식적으로 보이는 도면이다.4 is a diagram schematically showing a removal section in the voice section.

Claims

In the speech recognition method of obtaining a distance between the characteristic information of the reference pattern and the characteristic information of the input pattern, and providing a reference pattern showing the closest distance among them as a result to be recognized,

A time normalization process of removing a time curve of the input voice pattern and converting the input voice pattern into a voice pattern having a length similar to that of the reference pattern;

Extracting feature information of a normalized speech pattern through the time normalization process; And

And a recognition process of calculating a distance from the reference pattern using feature information of the voice pattern and providing a reference pattern showing the closest distance as a recognition result.

In the time normalization method of the speech pattern to remove the time curve of the input speech pattern to convert to a speech pattern having a length similar to the reference pattern,

A feature pattern extraction process of calculating feature information of the speech pattern;

An energy change calculation process of calculating energy change information based on the feature information obtained in the feature pattern extraction process;

A spectrum change calculation process of calculating spectrum change information based on feature information obtained in the feature pattern extraction process;

A deletion section detection step of detecting a portion having a small spectral change and a portion having a small energy change based on the energy change information calculated in the energy change calculation process and the spectrum change information calculated in the spectrum change calculation process;

And a pattern update process of generating a new input pattern by a weighted average of the spectrum vector of the erase interval and the neighboring input vector.

The method of claim 2,

Based on the feature information obtained in the feature pattern extraction process, the energy change information calculated in the energy change calculation process, and the noise code book, the length of the silence section between the speech sections and the noise sections before and after the speech section is adjusted. The time normalization method of the speech pattern, characterized in that it further comprises a noise interval compression process provided to the detection process.

The method of claim 2, wherein the energy change calculation process

The energy D ₀ , the first differential energy D ₁ , the second differential energy D ₂ , and the average energy of each frame are given by the equation Obtaining speech pattern time normalization method, characterized in that for obtaining.

Where i is a frame number 0 ≦ i <1,

n is the frame size, where 0 ≦ n <N,

X (i, n) represents the I-th sample of the x-th frame,

d represents a direct current (dc) value.

The method of claim 2, wherein the spectral change calculation process calculates spectral change information S ₁ , S ₂ , and spectral difference evaluations V ₁ , V ₂ by the following equation.

Where C (i, n) is the nth of the I-th frame 켑 strum vector

Element,

N is the order of the cepstrum,

W [n] is the spectral weight, 0.7-1.3