KR100293465B1

KR100293465B1 - Speech recognition method

Info

Publication number: KR100293465B1
Application number: KR1019980015109A
Authority: KR
Inventors: 이윤근; 김기백; 이병수; 이종석
Original assignee: 구자홍; 엘지전자 주식회사
Priority date: 1998-04-28
Filing date: 1998-04-28
Publication date: 2001-07-12
Also published as: KR19990081262A

Abstract

PURPOSE: A voice recognition method is provided to draw slant lines having gradients 1 from start lattice points and end lattice points, and to move the slant lines to both sides as many as determined values to set a window, then to use lattice points only between the two slant lines for matching, so as to easily set the window. CONSTITUTION: A voice recognition system generates a 2-dimensional vertical coordinates system having M times N lattice points, to match two sequences. The system detects start lattice points(1, 1) and end lattice points(M, N) of the 2-dimensional vertical coordinates system, and draws slant lines having gradients 1 from the start lattice points(1, 1) and the end lattice points(M, N), then moves the slant lines as many as determined values to define a search section for matching. The system calculates distance between two features of each lattice point of each string within the search section, and selects a path where the distance is the minimum as an optimum path. The system repeats the step of selecting the optimum path for all strings within the search section. The system divides a minimum accumulation distance of the end lattice points(M, N) by a length sum of the two sequences, and calculates a final matching score.

Description

Speech Recognition Method {SPEECH RECOGNITION METHOD}

본 발명은 데이터베이스화된 음성 신호의 기준 패턴과 입력되는 음성 신호의 테스트 패턴을 비교하여 음성 신호를 인식하는 음성 인식 방법에 관한 것이다.The present invention relates to a speech recognition method for recognizing a speech signal by comparing a reference pattern of a databaseized speech signal with a test pattern of an input speech signal.

음성 인식 기술 중 가장 간단한 것은 화자 종속 고립 단어 인식이다. 이는 훈련시킨 사람의 목소리만을 인식할 수 있으며, 단어(또는 짧은 문장) 단위로 발성된 음성만 인식할 수 있다. 이를 위한 음성 인식 알고리즘은 이미 많이 알려져 있는데 크게 음성 구간 검출 과정과 특징(feature) 추출과정, 그리고 매칭 과정으로나눌 수 있다.The simplest of the speech recognition techniques is speaker dependent isolated word recognition. It can recognize only the voice of the trained person, and can only recognize the voice spoken in units of words (or short sentences). Many speech recognition algorithms for this purpose are already known and can be divided into speech section detection, feature extraction, and matching.

이러한 화자 종속 고립 단어 인식에 있어서 테스트 패턴과 기준 패턴의 시간축상의 정규화는 음성 인식률에 크게 영향을 받는다.In this speaker dependent isolated word recognition, the normalization of the test pattern and the reference pattern on the time axis is greatly influenced by the speech recognition rate.

일반적으로 테스트 패턴과 기준 패턴 사이에 시간 구간이 일치하지 않으므로 비선형적인 시간 와핑(Time warping) 방법을 통해서 타임 스케일(time scale)을 정하게 된다. 이러한 타임 스케일의 결정은 DTW(Dynamic Time Warping) 기법을 통해 실현되며 이 방법을 사용하면 단어경계 검출, 비선형 타임 정렬(alignment) 및 인식의 세 과정이 동시에 이루어진다. 그러므로, 단어 경계 및 타임 정렬에서 발생된 오류에 의해 인식 오류가 발생될 경우는 없어진다. 즉, 동일한 사람이 동일한 단어를 발성하여도 그때마다 발음 속도의 차이로 인하여 시간축 상에서 비선형으로 신축하므로, 이러한 시간축상에서의 변동을 제거하기 위해서 테스트 패턴을 기준 패턴과 비교하여 두 패턴 사이의 유사성을 결정하는 시간축 정규화 계산 방법이 DTW이다.In general, since the time intervals do not coincide between the test pattern and the reference pattern, a time scale is determined through a nonlinear time warping method. The determination of this time scale is realized through the Dynamic Time Warping (DTW) technique, which uses three processes: word boundary detection, nonlinear time alignment, and recognition. Therefore, a recognition error is not caused by an error generated in word boundary and time alignment. In other words, even if the same person utters the same word, it is stretched nonlinearly on the time axis due to the difference in pronunciation speed. Therefore, the similarity between the two patterns is determined by comparing the test pattern with the reference pattern to remove the variation on the time axis. The time base normalization calculation method is DTW.

상기 DTW 방법에는 여러 가지가 제안되어 있는데, 그중 하나가 테스트 패턴과 데이터베이스화된 기준 패턴들 사이의 스펙트럴(spectral) 거리를 측정하고 테스트 패턴과 가장 가까운 스펙트럴 거리를 갖는 기준 패턴을 인식 패턴으로서 선택하는 방법이 있다. 즉, 각 음성 패턴의 디지털 신호를 프레임으로 나눈 후 각 프레임에 대해서 LPC(Linear Predictor Coefficient)를 계산하고 그것을 테스트 패턴의 특징 벡터인 캡스트럼 계수로 변환한 후 미리 데이터베이스화된 각각의 기준 패턴과 비교하여 스펙트럴 거리를 측정한다. 그러나, 이 DTW 방법은 응답 속도가 느리고 또한, 부동점 포맷이므로 많은 저장 용량을 요구하는 문제점이 있다.Various DTW methods have been proposed, one of which measures a spectral distance between a test pattern and a database-based reference pattern and uses a reference pattern having a spectral distance closest to the test pattern as a recognition pattern. There is a way to choose. That is, after dividing the digital signal of each speech pattern into frames, the LPC (Linear Predictor Coefficient) is calculated for each frame, and then converted into a Capstrum coefficient, which is a feature vector of the test pattern, and compared with each of the previously-referenced reference patterns. To measure the spectral distance. However, this DTW method is slow in response and has a problem of requiring a large storage capacity because it is a floating point format.

이를 해결하기 위한 것이 윈도우를 설정하여 최적 패스 탐색 범위를 제한하고, 분석되는 음성 패턴을 정수 포맷의 특징 셋트로 변환하는 방법이 있다. 이때, 정수 포맷의 테스트 패턴에 대한 특징 셋트를 데이터베이스화된 기준 패턴과 비교하여 테스트 패턴과 각 기준 패턴 사이의 스펙트럴 거리를 측정하고, 테스트 패턴과 가장 가까운 스펙트럴 거리를 갖는 기준 패턴을 인식 패턴으로서 선택한다. 그러나, 이 DTW 방법은 부동점 포맷보다 적은 저장 용량과 빠른 응답 시간을 갖지만 시작점과 끝점을 연결한 후 연결된 라인을 중심으로 폭의 넓이(swath)가 오드(odd) 정수가 되는 윈도우 구간을 설정함으로써, 테스트 패턴과 기준 패턴의 프레임 수가 다를 때 기울기 계산이 어렵고 어떤 격자점이 윈도우 내에 포함되는지 체크하기가 어려우며, 또한 정수 포맷으로 변환시 나눗셈이 어려울뿐만 아니라 나눗셈을 위해 제산기를 이용하여야 하므로 로직이 복잡해지고 계산량이 많아지는 문제점이 있다.In order to solve this problem, there is a method of setting a window to limit an optimal path search range and converting the analyzed speech pattern into a feature set of an integer format. At this time, the spectral distance between the test pattern and each reference pattern is measured by comparing the feature set of the test pattern in the integer format with the database-based reference pattern, and the reference pattern having the spectral distance closest to the test pattern is recognized. Select as. However, this DTW method has less storage capacity and faster response time than the floating point format, but by connecting the start point and the end point and setting the window interval whose width is odd by the connected line. In addition, when the number of frames of the test pattern and the reference pattern are different, it is difficult to calculate the slope, and it is difficult to check which grid points are included in the window. There is a problem that the amount of calculation increases.

본 발명은 상기와 같은 문제점을 해결하기 위한 것으로서, 본 발명의 목적은 시작 격자점과 끝 격자점에서부터 각각 기울기 1을 갖는 사선을 긋고, 이로부터 정해진 값 즉, 프레임의 개수를 2의 정수배로 나눈 값만큼 양쪽으로 사선을 이동시켜 윈도우 구간을 설정함으로써, 기울기 계산과 제산기가 필요없는 음성 인식 방법을 제공함에 있다.SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and an object of the present invention is to draw an oblique line having a slope of 1 from a starting grid point and an end grid point, and divide a predetermined value, that is, the number of frames by an integer multiple of 2. The present invention provides a speech recognition method that does not need a slope calculation and a divider by setting a window section by moving diagonal lines to both sides by a value.

도 1은 본 발명에 따른 음성 인식 방법을 수행하기 위한 구성 블록도1 is a configuration block diagram for performing a voice recognition method according to the present invention

도 2는 본 발명에 따른 탐색 구간 설정 상태를 보인 좌표도2 is a coordinate diagram showing a search interval setting state according to the present invention;

도 3은 본 발명에 따른 DTW 좌표도3 is a DTW coordinate diagram according to the present invention

도면의 주요부분에 대한 부호의 설명Explanation of symbols for main parts of the drawings

11 : 마이크 12 : A/D 컨버터11: microphone 12: A / D converter

13 : 음성 구간 검출부 14 : 특징 추출부13 voice section detection unit 14 feature extraction unit

15 : 기준 데이터용 메모리 16 : DTW부15: reference data memory 16: DTW unit

상기와 같은 목적을 달성하기 위한 음성 인식 방법은, 입력되는 음성의 테스트 패턴과 미리 데이터베이스화된 각 기준 패턴의 특징 셋트를 비교하여 테스트 패턴과 가장 유사한 특징을 갖는 기준 패턴을 인식된 음성으로 출력하는 음성 인식 방법에 있어서, 테스트 패턴의 특징과 기준 패턴의 특징 셋트의 두 시퀀스를 매칭하기 위해서 M×N개(M,N은 테스트 패턴과 기준 패턴의 각 프레임 수)의 격자점을 갖는 2차원의 수직 좌표계를 발생하는 제 1 단계와, 상기 2차원의 수직 좌표계의 시작 격자점과 끝 격자점을 검출한 후 검출된 시작 격자점과 끝 격자점에서부터 각각 기울기 1을 갖는 사선을 긋고, 이로부터 N(프레임의 개수)을 2의 정수배로 나눈 값만큼 양쪽으로 사선을 이동시켜 매칭을 위한 탐색 구간을 정하는 제 2 단계와, 상기 탐색 구간 내의 각 열의 각 격자점에서 두 특징간의 거리를 계산하여 두 특징 사이의 거리가 최소가 되는 패스를 최적 패스로 선택하는 제 3 단계와, 상기 탐색 구간의 모든 열에 대해서 상기 제 3 단계를 반복 수행하는 제 4 단계와, 상기 제 4 단계가 수행되면 끝 격자점에서의 최소 누적 거리를 두 시퀀스 길이의 합으로 나누어 최종 매칭 스코어를 계산하는 제 5 단계로 이루어짐을 특징으로 한다.In order to achieve the above object, a voice recognition method includes comparing a test pattern of an input voice with a feature set of each reference database pre-database and outputting a reference pattern having a feature most similar to the test pattern as a recognized voice. In the speech recognition method, in order to match two sequences of a feature of a test pattern and a feature set of a reference pattern, a two-dimensional two-dimensional grid having MxN points (M and N are the number of frames of the test pattern and the reference pattern). A first step of generating a vertical coordinate system, and after detecting a starting grid point and an ending grid point of the two-dimensional vertical coordinate system, an oblique line having an inclination of 1 is drawn from the detected starting grid point and the ending grid point, and from there, N A second step of determining a search section for matching by moving diagonal lines on both sides by dividing the (number of frames) by an integer multiple of 2; A third step of calculating a distance between two features at each grid point and selecting a path having a minimum distance between the two features as an optimal path, and a fourth step of repeating the third step for all columns of the search interval; And a fifth step of calculating the final matching score by dividing the minimum cumulative distance at the end grid point by the sum of the two sequence lengths when the fourth step is performed.

상기 탐색 구간 내의 각 열의 각 격자점에서의 최소 누적 거리 값이 정수의 범위를 넘을때는 최대 정수값으로 대치함을 특징으로 한다.When the minimum cumulative distance value at each lattice point of each column in the search interval exceeds the range of integers, the maximum integer value is replaced.

상기 탐색 구간내의 각 열의 각 격자점은 테스트 패턴과 기준 패턴의 두 시퀀스의 m번째 특징과 n번째 특징까지의 최소 누적 거리값을 갖음을 특징으로 한다.Each lattice point of each column in the search interval has a minimum cumulative distance value up to the m th and n th features of the two sequences of the test pattern and the reference pattern.

이러한 DTW 방법을 음성 인식에 사용하면 기울기 계산이 필요없이 윈도우 설정이 용이하며, 계산량이 적어져 응답 속도가 빨라진다.When the DTW method is used for speech recognition, it is easy to set a window without requiring a tilt calculation, and the calculation speed is small, and the response speed is increased.

본 발명의 다른 목적, 특징 및 잇점들은 첨부한 도면을 참조한 실시예들의상세한 설명을 통해 명백해질 것이다.Other objects, features and advantages of the present invention will become apparent from the detailed description of the embodiments with reference to the accompanying drawings.

이하, 본 발명의 바람직한 실시예를 첨부도면을 참조하여 상세히 설명한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 음성 인식 방법을 수행하기 위한 음성 인식 시스템의 구성 블록도로서, 마이크(11), 아날로그/디지탈(Analog/Digital ; A/D) 컨버터(12), 음성 구간 검출부(13), 특징 추출부(14), 기준 데이터용 메모리(15), 및 DTW부(16)로 구성된다.FIG. 1 is a block diagram illustrating a speech recognition system for performing a speech recognition method, including a microphone 11, an analog / digital converter (A / D) converter 12, a speech section detector 13, and feature extraction. The unit 14 is composed of a reference data memory 15 and a DTW unit 16.

이와 같이 구성된 본 발명은 먼저, 음성 신호의 기준 패턴을 저장하기 위해, 마이크(11)를 통해 음성 신호가 입력되면 A/D 컨버터(12)에서 이를 디지털 신호로 변환한 후 음성 구간 검출부(13)로 출력한다. 상기 음성 구간 검출부(13)는 디지털 음성 신호를 짧은 구간의 신호(즉, 프레임)로 분할한 후 각 프레임의 에너지와 영교차율(Zero Crossing Rate) 그리고, 시간길이 정보를 이용하여 입력된 신호중에서 실제 발성된 음성 구간의 시작점과 끝점을 검출한다. 그리고나서, 특징 추출부(14)에서 음성 구간에 해당하는 프레임의 스펙트럼 계수를 특징으로 추출하여 기준 패턴을 만든 후 상기 기준 패턴을 기준 데이터용 메모리(15)에 저장하는데, 이와 같은 동작을 인식하고자 하는 음성 신호에 대하여 반복 수행하여 기준 패턴을 기준 데이터용 메모리(15)에 데이터베이스화하게 된다.According to the present invention configured as described above, in order to store the reference pattern of the voice signal, when the voice signal is input through the microphone 11, the A / D converter 12 converts it to a digital signal and then the voice section detector 13 Will output The voice section detector 13 divides the digital voice signal into a short section of a signal (ie, a frame), and then actually uses the energy, zero crossing rate, and time length information of each frame. The starting point and the end point of the spoken speech section are detected. Then, the feature extractor 14 extracts the spectral coefficients of the frame corresponding to the speech section to form a reference pattern, and stores the reference pattern in the reference data memory 15. The reference pattern is repeated in the reference data memory 15 by repeatedly performing the voice signal.

한편, 발성된 음성 신호를 인식하는 과정을 수행하는 경우에 있어서도 마찬가지로, 마이크(11)를 통해 입력된 음성은 A/D 변환기(12), 음성 구간 검출부(13), 및 특징 추출부(14)를 거쳐 입력된 신호중에서 실제 발성된 음성 구간의 시작점과 끝점을 검출하고 검출된 음성 구간에 해당하는 프레임의 스펙트럼 계수를 특징으로추출하여 입력된 음성의 테스트 패턴을 만든다. 이때, DTW부(16)에서는 테스트 패턴과 기준 데이터용 메모리(15)에 저장된 각 기준 패턴을 적어도 하나 이상 각각 비교하여 테스트 패턴과 가장 유사한 특징을 갖는 기준 패턴을 인식된 음성으로 출력한다.On the other hand, in the case of performing the process of recognizing the spoken voice signal, the voice input through the microphone 11 is the A / D converter 12, the voice section detection unit 13, and the feature extraction unit 14 The start point and the end point of the actually spoken speech section are detected from the inputted signal, and the spectral coefficient of the frame corresponding to the detected speech section is extracted to make a test pattern of the input speech. At this time, the DTW unit 16 compares the test pattern with each reference pattern stored in the reference data memory 15 at least one or more, and outputs a reference pattern having the most similar characteristics to the test pattern as a recognized voice.

즉, DTW부(16)에서는 먼저, 매칭하고자 하는 두 시퀀스의 길이를 도 2에서와 같이 각각 M,N이라 정한다. 이때, 비교하고자 하는 시퀀스 길이 M,N중 큰 쪽이 작은 쪽의 2배가 넘으면 같을 확률이 거의 없으므로 인식 과정을 수행하지 않고 매칭 디스토션(matching distortion)을 미리 정해둔 최대값으로 준다. 이는 입력된 단어와 등록된 단어가 2배이상 차이가 나므로 같을 확률이 거의 없어 인식된 단어의 유사도 판별시 제외시키기 위해서이다.That is, in the DTW unit 16, first, the lengths of two sequences to be matched are defined as M and N, respectively, as shown in FIG. At this time, if the larger one of the sequence lengths M and N to be compared is more than two times smaller than the smaller one, there is almost no likelihood. Therefore, a matching distortion is given as a maximum value without performing a recognition process. This is because the input word and the registered word are more than two times different, so there is almost no likelihood, so that the similarity of the recognized word is excluded.

그리고, 두 개의 시퀀스를 매칭하기 위해서 M×N개의 격자점을 갖는 2차원의 수직 좌표계를 만든다. 이때, 데이터베이스화된 기준 패턴의 프레임과 테스트 패턴의 프레임 길이를 비교하여 더 긴 프레임을 갖는 패턴을 M축에 위치하도록 하면 계산을 원할히 할 수 있게된다.In order to match the two sequences, a two-dimensional vertical coordinate system having M × N grid points is created. At this time, by comparing the frame length of the database-based reference pattern and the frame length of the test pattern, if the pattern having a longer frame is located on the M axis, the calculation can be smoothly performed.

여기서, 음성 패턴 R과 T는 특징 추출에 대한 특징 벡터의 시퀀스로 다음과 같이 표현할 수 있다.Here, the speech patterns R and T may be expressed as follows as a sequence of feature vectors for feature extraction.

R = [R₁,R₂,...,R_m,...,R_M]R = [R ₁ , R ₂ , ..., R _m , ..., R _M ]

T = [T₁,T₂,...,T_n,...,T_N]T = [T ₁ , T ₂ , ..., T _n , ..., T _N ]

패턴 R과 T는 각각 m축과 n축에 따라 변화하고, 음성 패턴 사이의 특징 벡터의 차이는 도 3에서와 같이 점 C(k)의 계열로서 다음과 같이 표현한다.The patterns R and T change along the m-axis and the n-axis, respectively, and the difference in the feature vector between the voice patterns is expressed as a series of points C (k) as shown in FIG.

F = C(1),C(2),...,C(k),...,C(K)F = C (1), C (2), ..., C (k), ..., C (K)

여기서, C(k) = (m(k),n(k))이며, F를 와핑 함수라고 하는데, 테스트 패턴의 시간축으로부터 기준 패턴의 시간축까지 투영하는 함수이다.Here, C (k) = (m (k), n (k)), and F is called a warping function, which is a function of projecting from the time axis of the test pattern to the time axis of the reference pattern.

상기 DTW부(16)는 상기 와핑 함수를 이용하여 최소 거리가 되는 최적 패스 m=W(n)를 찾는다.The DTW unit 16 finds the optimal path m = W (n) that becomes the minimum distance using the warping function.

그리고, 불필요한 계산을 줄이기 위하여 최적 패스 탐색 범위를 제한하는 윈도우를 설정한다. 즉, 같은 사람이 발성을 하면 큰 변화가 없으므로 윈도우를 설정하여 최적 패스 탐색 구간을 제한한다.In order to reduce unnecessary calculation, a window for limiting the optimum path search range is set. In other words, if the same person talks, there is no big change, so the window is set to limit the optimal path search section.

본 발명에서는 기울기 계산이 필요없이 윈도우 설정이 용이하며, 계산량이 적어져 응답 속도를 빠르게 하기 위해 윈도우를 다음과 같이 결정한다.In the present invention, it is easy to set the window without the need for the slope calculation, and the window is determined as follows in order to reduce the amount of calculation and to increase the response speed.

먼저, 시작 격자점(1,1)과 끝 격자점(M,N)에서부터 각각 기울기 1을 갖는 사선을 긋는다. 이로부터 정해진 값(, 여기서, N은 프레임의 개수이고, n은 자연수이며, n이 2일때가 가장 적당하다.)만큼 양쪽으로 사선을 이동시키면 두 사선 사이의 격자점들이 매칭을 위해 탐색하게 되는 구간이 된다. 이때, 윈도우의 폭을로 정함으로써, N을 2의 정수배로 나눌 때 2의 정수배는 복잡한 제산기가 필요없이 쉬프터를 이용하면 되므로 간단하고 효율적이다. 또한, N은 테스트 패턴의 프레임 수일수도 있고, 기준 패턴의 프레임 수일수도 있다.First, an oblique line having a slope 1 is drawn from the starting grid points 1 and 1 and the ending grid points M and N, respectively. From this , Where N is the number of frames, n is a natural number, and n is 2, which is most suitable.) If the diagonal lines are moved to both sides, the grid points between the two diagonal lines are searched for matching. At this time, the width of the window When N is divided by an integer multiple of 2, the integer multiple of 2 is simple and efficient because a shifter can be used without a complicated divider. N may be the number of frames of the test pattern or may be the number of frames of the reference pattern.

그리고, 탐색하는 윈도우 내에서 격자점(m,n)은 두 시퀀스의 m번째 특징과 n번째 특징까지의 최소 누적 거리값을 갖게된다. 여기서, 특징 값을 0∼5000 사이의 정수값을 갖도록 스케일링한다.In the search window, the grid points m and n have minimum cumulative distance values between the m th and n th features of the two sequences. Here, the feature value is scaled to have an integer value between 0 and 5000.

이때, 특정 시간 구간에서 과도한 압축과 신장을 피하기 위해 소구간 패스에 제한을 두며, 소구간 패스를 정하는 방법은 실시예로서 도 3과 같다. 도 3에서 와핑 함수는 3개의 가능 방향중의 한 방향으로부터 동시에 한 스텝 이동한다. 예를 들어, 격자점(m,n)에 도달할 수 있는 3개의 가능한 방법은 격자점(m-1,n-1)에서 격자점(m,n)으로 직접 이동하거나, 또는 격자점(m-1,n)에서 격자점(m,n)으로, 또는 격자점(m,n-1)에서 격자점(m,n)으로 간접 이동하는 방법이 있다.At this time, in order to avoid excessive compression and elongation in a specific time interval, the limit of the small section pass is limited, and the method of determining the small section path is shown in FIG. 3 as an embodiment. In Figure 3 the warping function moves one step simultaneously from one of the three possible directions. For example, three possible ways to reach grid points (m, n) are to move directly from grid points (m-1, n-1) to grid points (m, n), or grid points (m) There is a method of indirectly moving from -1, n to grid point (m, n) or from grid point (m, n-1) to grid point (m, n).

이때, 각 도착 격자점(m,n)은 패턴 R의 m번째 프레임과 패턴 T의 n번째 프레임 사이의 유클리디언 또는 캡스트럴 거리와 같은 가중치 W_mn와 관련된다. 가중치 W_mn는 상기 간접 패스 각각에 적용하고, 가중치 2W_mn는 직접 패스에 적용함으로써, 각 격자점(m,n)에서의 두 특징간의 거리는 아래의 수학식 1의 d_m,n과 같이 정의된다.At this time, each arrival lattice point (m, n) is associated with a weight W _mn such as the Euclidean or captral distance between the m th frame of the pattern R and the n th frame of the pattern T. The weight W _mn is applied to each of the indirect paths and the weight 2W _mn is applied to the direct path, so that the distance between two features at each grid point (m, n) is defined as d _{m, n in} Equation 1 below. .

즉, 두 특징의 각 차수에 해당하는 값의 차이를 모두 더하여 각 특징간의 거리를 구한다. 이때, 격자점(m,n)에서의 최소 누적 거리는 수학식 1과 같이 계산하며 그 값이 정수의 범위가 넘는 값이 나올때는 최대 정수값으로 대치한다.That is, the distance between each feature is obtained by adding all the difference between the values corresponding to the orders of the two features. At this time, the minimum cumulative distance at the grid point (m, n) is calculated as in Equation 1, and when the value is out of the range of integers, it is replaced by the maximum integer value.

그리고, 맨 아래의 열부터 시작하여 위로 순차적으로 올라가면서 탐색 범위 내에 들어있는 격자점에 대하여 최소 누적 거리값을 구한다. 현재의 열의 최소 누적 거리값을 구하기 위해서 바로 아래열의 최소 누적 거리값이 필요하므로 이를 저장한다.Then, starting from the bottom row, the minimum cumulative distance value is obtained for the grid points within the search range while sequentially moving upward. In order to obtain the minimum accumulated distance value of the current column, the minimum accumulated distance value of the next row is needed.

그리고, 최종 매칭 스코어는 격자점(M,N)에서의 최소 누적 거리를 두 시퀀스 길이의 합(M+N)으로 나눈 값이 된다.The final matching score is a value obtained by dividing the minimum cumulative distance at the grid points M and N by the sum of two sequence lengths (M + N).

이상에서와 같이 본 발명에 따른 음성 인식 방법에 의하면, 시작 격자점과 끝 격자점에서부터 각각 기울기 1을 갖는 사선을 긋고, 이로부터 정해진 값(, N은 프레임의 개수, n은 자연수)만큼 양쪽으로 사선을 이동시켜 탐색 구간이 되는 윈도우를 설정하여 두 사선 사이의 격자점들만을 매칭에 이용함으로써, 복잡한 기울기 계산이 필요없이 윈도우 설정이 용이하며, 윈도우 경계 라인이 격자점과 격자점으로 연결되므로 격자점 체크도 용이하다. 또한, 복잡한 제산기를 사용하지 않아도 되므로 계산량이 적어져 응답 속도가 빨라진다.As described above, according to the speech recognition method of the present invention, an oblique line having a slope 1 is drawn from the starting grid point and the ending grid point, respectively, , N is the number of frames, n is the natural number, and sets the window to be the search section by moving the diagonal lines on both sides, so that only the grid points between the two diagonal lines are used for matching. The grid point check is also easy because the window boundary line is connected to the grid point and the grid point. In addition, the complexity of the divider is eliminated, resulting in less computation and faster response.

Claims

A voice recognition method for comparing a test pattern of an input voice with a feature set of each reference pattern previously databased and outputting a reference pattern having a feature most similar to a test pattern as a recognized voice.

To match the two sequences of the feature of the test pattern and the feature set of the reference pattern, generate a two-dimensional vertical coordinate system with M × N grid points (M and N are the number of frames of the test pattern and the number of frames of the reference pattern). With the first step,

After detecting the starting grid points (1,1) and the ending grid points (M, N) of the two-dimensional vertical coordinate system, the slopes are respectively determined from the detected starting grid points (1,1) and the ending grid points (M, N). A second step of drawing an oblique line having 1, and moving the oblique line to both sides by a predetermined value therefrom to determine a search section for matching;

A third step of calculating a distance between two features at each grid point of each column in the search section and selecting a path having a minimum distance between the two features as an optimal path;

A fourth step of repeating the third step with respect to all columns of the search period;

When the fourth step is performed, the speech recognition method comprises a fifth step of calculating a final matching score by dividing the minimum cumulative distance at the end grid points (M, N) by the sum (M + N) of the two sequence lengths. Way.

The speech recognition method of claim 1, wherein the distance between two features at each grid point is obtained by adding up the difference between the values corresponding to each order of the two features, and the minimum cumulative distance is defined by the following equation.

Where D _{m, n} is the minimum accumulation distance at the lattice point (m, n),

d _{m, n is} the distance between two features at the grid point (m, n)

,

a ⁱ _{1, m} : i-th value of the m th characteristic of the first sequence,

a ⁱ _{2, n} : i-th value of the n th feature of the second sequence,

P: degree of feature.

The method of claim 2,

Speech recognition method characterized in that the minimum cumulative distance value at each grid point (m, n) is replaced by the maximum integer value when it exceeds the range of integers.

3. The voice of claim 2, wherein each grid point (m, n) of each column in the search interval has a minimum cumulative distance value up to the mth and nth features of two sequences of a test pattern and a reference pattern. Recognition method.

5. The grid point (m, n) of claim 4, wherein each grid point (m, n) of each column in the search interval has a distance value moving directly from the previous grid point (m-1, n-1) and two neighbors of the grid point (m, n). And generating a new pass value repeatedly as a function of at least one of the distance values indirectly moving from the grid points (m-1, n) and (m, n-1).

6. The speech recognition method of claim 5, wherein the minimum cumulative distance value of the immediately below column is stored to obtain a minimum cumulative distance value of the current column.

The predetermined value for moving the oblique line to both sides in the second step is (Where N is the number of frames and n is a natural number).