KR0176882B1

KR0176882B1 - Search method of connected speech recognition

Info

Publication number: KR0176882B1
Application number: KR1019960000108A
Authority: KR
Inventors: 김민성
Original assignee: 구자홍; 엘지전자주식회사
Priority date: 1996-01-05
Filing date: 1996-01-05
Publication date: 1999-04-01
Also published as: KR970060043A

Abstract

본 발명은 연결음성 인식의 검색방법에 관한 것으로, 종래의 연결음성 인식의 검색방법은 문법에서 만들 수 있는 모든 문장경로를 추적하면서, 경로중 최대의 유사도를 가지는 경로를 선택하고, 그 때의 경로상에서 비교되었던 단어열로서 입력음성을 인식하였기 때문에 검색시간이 많이 걸리는 문제점이 있었다. 본 발명은 이러한 종래의 문제점을 해결하기 위해 T프레임의 음성이 입력되면 특징으로 추출하고 근사단어열을 찾은 다음 그 근사단어열에 대해 인식과정을 통하여 평균유사도(Psc)를 구하는 제1단계와; 이후, 비터비 인식과정을 첫번째 프레임부터 소정프레임(S) 까지 진행하면서 그 때까지 검색된 경로의 정합유사도를 상기 평균유사도(Psc)와 비교하여 평균유사도(Psc)보다 적은 경로를 제거함으로서 근사단어열을 인식하는 제2단계와; 상기 제2단계의 반복수행 결과 입력음성의 마지막 프레임까지 비터비 인식과정이 완료되면 제거되지 않은 경로중에서 유사도가 최대인 경로를 구하고, 단어열 역 추적과정을 통해 그 경로상에서 비교된 단어열을 찾아 인식된 문장으로 출력하는 제3단계로 이루어진 연결음성 인식의 검색방법을 창안한 것으로, 이의 작용을 통해 즉, 입력의 양자화된 오차만을 더하여 입력에 근사적인 단어열을 선택하고 전체 정합과정에서 이 근사적인 단어열보다 유사도가 낮게 정합되는 경로를 검색과정 중에서 제거함으로서 인식시간을 줄일 수 있는 효과가 있다.The present invention relates to a method of searching for connected speech recognition. In the related art, a method of searching for connected speech recognition selects a path having the maximum similarity among the paths while tracing all sentence paths that can be made in a grammar, and the path at that time. Since the input voice is recognized as the word string compared in the above, there is a problem that the search time takes a lot. The present invention provides a first step of extracting a feature of the T-frame speech input, find an approximate word sequence, and then obtain an average similarity degree (Psc) through the recognition process for the approximate word sequence. Then, the Viterbi recognition process proceeds from the first frame to the predetermined frame S, and compares the matched similarity of the searched paths with the average similarity Psc by removing the path less than the average similarity Psc. Recognizing a second step; As a result of the repetition of the second step, when the Viterbi recognition process is completed until the last frame of the input voice, the path having the maximum similarity is found among the paths that are not removed, and the word strings are compared on the path through the word string inverse tracking process. Invented a search method of connected speech recognition consisting of a third step of outputting a recognized sentence. Through this function, that is, only the quantized error of the input is added to select an approximate word sequence to the input, and this approximation is performed in the entire matching process. Recognition time can be reduced by eliminating paths whose matching similarity is lower than that of conventional word strings in the search process.

Description

Search method of connected speech recognition

제1도는 종래 연결음성 인식 문법 구조도.1 is a conventional structured speech recognition grammar structure diagram.

제2도는 종래 연결음성 단어 정합과정을 나타낸 도.2 is a diagram illustrating a conventional connected speech word matching process.

제3도는 본 발명 연결음성 인식의 검색동작을 설명을 위한 개략적인 흐름도.Figure 3 is a schematic flowchart for explaining the search operation of the present invention connected speech recognition.

제4도는 본 발명에 있어서, 경로를 제거하는 동작을 설명하기 위한 구조도.4 is a structural diagram for explaining an operation of removing a path in the present invention.

제5도는 본 발명에 있어서, 인식과정을 나타내기 위한 동작 흐름도.5 is a flowchart illustrating the recognition process according to the present invention.

제6도는 본 발명에 있어서, 근사단어열을 인식하는 과정을 설명하기 위한 동작 흐름도.6 is an operation flowchart for explaining a process of recognizing an approximate word sequence in the present invention.

제7도는 발명에 있어서, 단어를 정합하는 과정을 설명하기 위한 설명도.7 is an explanatory diagram for explaining a process of matching words in the invention.

본 발명은 연결음성 인식의 검색방법에 관한 것으로, 특히 음성이 입력되면 먼저, 근사적인 단어열을 찾고 이 단어열보다 유사도가 적은 경로를 검색과정에서 제거함으로써 검색할 경로 수를 줄일 수 있도록한 연결음성 인식의 검색방법에 관한 것이다.The present invention relates to a method of searching for connected speech recognition. In particular, when a voice is input, first, an approximate word string is searched and the number of paths to be searched is reduced by removing a path having less similarity than the word string in the search process. A search method of speech recognition.

종래 연결음성 인식의 검색방법은 종래 연결음성 인식 문법 구조도를 나타낸 제1도에 도시된 바와같이 몇개의 유한한 상태를 갖는 문법적 구조로 음성을 인식하였다.The conventional method of retrieving connected speech recognition recognizes speech in a grammatical structure having several finite states, as shown in FIG.

예를들어 상태 0에서 1로 가는 경로로 인식 가능한 단어집합은 W01로 한정된다. 그리고 상태 1에서 상태 2로 가는 경로에 인식 가능한 단어집합은 W12가 된다.For example, the set of words recognizable as a path from state 0 to 1 is limited to W01. The recognizable word set in the path from state 1 to state 2 is W12.

이와같이 모든 상태와 상태간의 천이에서 사용 가능한 단어집합이 미리 정해진다. 따라서 이러한 문법구조에서 허용되는 문장은 시작상태에서 출발하여 끝상태로 이어지는 상태와 상태천이에서 허용된 단어열로서 만들어 지는 문장이다.Thus, the set of available words in all states and transitions between states is predetermined. Therefore, the sentences allowed in this grammatical structure are sentences that are created as the allowed word sequence in the state and the state transition from the start state to the end state.

인식과정에서는 입력음성이 이러한 문법에 의해 만들어진 문장이라고 가정하고 문법에서 만들 수 있는 모든 문장경로를 추적하면서 정합하여 인식하게 된다.In the recognition process, the input voice is assumed to be a sentence made by this grammar, and all the sentence paths that can be made in the grammar are matched and recognized.

연결음성 단어정합 과정을 제2도를 참조하여 설명하면 다음과 같다.A description of the connected speech word matching process is given below with reference to FIG.

문장의 처음에는 단어(W1)에서 단어(Wr)까지 비교되고, 다시 단어(W1)의 다음단어로 단어(Wn)부터 단어(W1)까지 비교된다.At the beginning of the sentence, the word W1 is compared to the word Wr, and the next word of the word W1 is compared from the word Wn to the word W1.

이렇게 모든문장 경로를 검색하여 경로중 최대의 유사도를 가지는 경로를 선택하고, 그 때의 경로상에서 비교되었던 단어열로서 입력음성을 인식한다.In this way, all sentence paths are searched to select a path having the maximum similarity among the paths, and the input speech is recognized as a string of words compared on the path at that time.

제2도와 같은 경로로 단어모델을 입력음성의 음향학적 특징과 비교하는 방법으로 비터비 알고리즘이 있다.The Viterbi algorithm is a method of comparing the word model with the acoustic characteristics of the input speech along the path shown in FIG.

이상에서 설명한 바와같이 종래의 연결음성 인식의 검색방법은 문법에서 만들 수 있는 모든 문장경로를 추적하면서, 경로중 최대의 유사도를 가지는 경로를 선택하고, 그 때의 경로상에서 비교되었던 단어열로서 입력음성을 인식하였기 때문에 검색시간이 많이 걸리는 문제점이 있었다.As described above, the conventional method of recognizing the connected speech recognizes the path having the maximum similarity among the paths while tracing all the sentence paths that can be made in the grammar, and then inputs the speech as the word strings compared on the paths. There was a problem that it takes a lot of time to search.

본 발명의 목적은 이러한 종래의 문제점을 해결하기 위해 음성 입력시, 먼저, 근사적인 단어열을 찾고 이 단어열보다 유사도가 적은 경로를 검색과정에서 제거하도록 함으로써 검색경로를 줄일 수 있는 연결음성 인식의 검색방법을 제공하는데 있다.An object of the present invention is to solve the conventional problems, first of all, when inputting a voice, it is necessary to find an approximate word string and to remove a path having less similarity than the word string in the search process. To provide a search method.

상기 본 발명의 목적을 달성하기 위한 연결음성 인식의 검색방법은 T프레임의 음성이 입력되면 특징으로 추출하고 근사단어열을 찾은 다음 그 근사단어열에 대해 인식과정을 통하여 평균유사도(Psc)를 구하는 제1단계와; 이후, 비터비 인식과정을 첫번째 프레임부터 소정프레임(S) 까지 진행하면서 그 때까지 검색된 경로의 정합유사도를 상기 평균유사도(Psc)와 비교하여 평균유사도(Psc)보다 적은 경로를 제거하는 제2단계와; 상기 제2단계의 반복수행 결과 입력 음성의 마지막 프레임까지 비터비 인식과정이 완료되면 제거되지 않은 경로중에서 유사도가 최대인 경로를 구하고, 단어열 역 추적과정을 통해 그 경로상에서 비교된 단어열을 찾아 인식된 문장으로 출력하는 제3단계로 이루어 진다.In order to achieve the object of the present invention, a search method for connected speech recognition extracts a feature of a T frame speech, finds an approximate word sequence, and obtains an average similarity degree (Psc) through a recognition process for the approximate word sequence. Step 1; Then, the second step of removing the path less than the average similarity degree (Psc) by comparing the matching similarity degree of the searched path up to that time while the Viterbi recognition process from the first frame to a predetermined frame (S) Wow; As a result of the repetition of the second step, when the Viterbi recognition process is completed up to the last frame of the input voice, the path having the maximum similarity is obtained from the paths that are not removed, and the word strings are compared on the path through the word string inverse tracking process. The third step is to output the recognized sentence.

상기와 같은 본 발명의 작용 및 효과를 첨부한 제3도 내지 제7도를 참조하여 상세히 설명하면 다음과 같다.When described in detail with reference to Figures 3 to 7 attached to the operation and effects of the present invention as described above.

제3도는 본 발명의 동작을 설명하기 위한 개략적인 흐름도로서, 이에 도시한 바와같이 음성이 입력되면 먼저, 근사단어열을 선택하고, 그 선택된 근사연결단어에 대한 평균유사도(Psc)를 구한다음 문장검색 과정중에 특정 경로에서 정합유사도가 평균유사도(Psc)보다 적으면 그 경로는 제거하는 방법을 사용한다.FIG. 3 is a schematic flowchart illustrating the operation of the present invention. As shown in FIG. 3, when a voice is input, an approximate word sequence is first selected, and an average similarity degree Psc for the selected approximate connected word is obtained. If the match similarity is less than the average similarity (Psc) for a particular path during the search process, the path is removed.

좀더 자세히 설명하면, 입력음성이 들어오면 모든 가능한 문장에 대해 검색하는 종래의 방법과는 달리 입력음성에 근사적으로 맞는 문장을 미리 선택해서 연결단어인식기를 통해 이 문장에 대해 평균유사도(Psc)를 구한다.In more detail, unlike the conventional method of searching for all possible sentences when the input voice comes in, the sentence similarity to the input voice is selected in advance and an average similarity degree (Psc) for the sentence is obtained through the connected word recognizer. Obtain

문장검색 과정중에 특정 경로에서 정합유사도가 평균유사도(Psc)보다 적으면 그 경로의 정합과정은 더 이상 진행하지 않는다.During the sentence search process, if the match similarity is less than the average similarity (Psc) in a particular path, the matching process of the path does not proceed anymore.

따라서 근사문장(단어열)이 입력과 매우 유사하면 정합하여야 할 문장 경로 수는 줄어들게 된다.Therefore, if the approximate sentence (word string) is very similar to the input, the number of sentence paths to be matched is reduced.

제4도에 도시한 바와같이 입력음성이 Word1을 정합했을 때, 평균유사도가 근사문장에 대한 평균유사도보다 작으면 X로 표시된 것처럼 이 경로상에서 정합될 Wordn, Wordm, Wordl 등의 단어와 이들 단어에 이어지는 모든 단어열과의 정합을 진행하지 않는다.As shown in Fig. 4, when the input voice matches Word1, if the average similarity is less than the average similarity for the near sentence, the words such as Wordn, Wordm, Wordl, etc. to be matched on this path, as indicated by X, Do not match all subsequent strings of words.

연결음성의 인식과정을 제5도를 참조하여 좀더 자세히 설명하면 다음과 같다.The recognition process of the connected speech is described in more detail with reference to FIG. 5 as follows.

먼저, 음성이 입력되면 특징으로 추출하고 근사단어열을 찾는다. 그리고 그 찾은 근사단어열에 대해 인식과정을 통하여 평균유사도(Psc)를 구한다.First, if a voice is input, it is extracted as a feature and an approximate word sequence is found. And the average similarity degree (Psc) is found through the recognition process for the found word sequence.

상기 평균유사도(Psc)는 인식기의 유사도를 입력음성길이로 나누어 구한다.The average similarity degree Psc is obtained by dividing the similarity of the recognizer by the input voice length.

예를들어 입력이 T프레임으로 구성되어 졌다고 하면, 인식은 기존의 비터비 인식과정을 첫번째 프레임부터 수행한다.For example, if the input consists of T frames, recognition performs the existing Viterbi recognition process from the first frame.

그러나 이때, 상기 과정을 입력의 끝인 T프레임까지 진행하지 않고 T보다 적은 S프레임까지 진행하고, 이 때까지 검색된 경로의 정합유사도를 상기 평균유사도(Psc)와 비교하여 평균유사도(Psc)보다 적은 경로를 제거한다.However, at this time, the process does not proceed to the T frame, which is the end of the input, but proceeds to an S frame less than T, and compares the matched similarity degree of the searched path with the mean similarity degree Psc to a path smaller than the mean similarity degree Psc. Remove it.

그리고 다시 S프레임 만큼 비터비 과정을 진행하고 경로를 제거하는 과정을 반복하여 수행한다.Then, the Viterbi process is repeated for the S frame and the path is removed.

이와같이 매 S프레임마다 유사도가 낮은 경로를 제거하면서 입력의 끝까지 비교하여 전체문장을 검색한다.In this way, the entire sentence is searched by comparing the end of the input while removing the path with low similarity every S frame.

제거되지 않은 경로중에서 유사도가 최대인 경로를 구하고 단어열 역 추적과정을 통해 그 경로상에서 비교된 단어열을 찾아 인식된 문장으로 출력한다.Among the paths that have not been removed, the paths with the maximum similarity are found, and the word strings that are compared on the paths are searched and output as recognized sentences.

이때, 상기 동작 중 근사 단어열 인식과정에 대해 제6도를 참조하여 설명하면 다음과 같다.In this case, the approximate word string recognition process will be described with reference to FIG. 6 as follows.

먼저, 훈련음성으로 부터 양자화 설계과정을 거쳐 백터양자기를 구하고, 각 단어에 대해 자주 발생하는 코드워드를 N개 구한다.First, we obtain vector quantum from the training voice through quantization design process and find N codewords that occur frequently for each word.

상기 자주발생하는 코드워드는 각 단어에 대한 훈련음성을 양자화 했을 때, 코드워드의 발생 수를 큰 순서부터 정렬하여 제일많이 발생하는 N개를 취하면 구할 수 있다.The frequently generated codewords can be obtained by quantizing the training speech for each word, taking the number of the most frequently generated codewords in ascending order.

그리고 같은 훈련음성으로부터 각 단어에 대한 음성의 최소 길이와 최대 길이를 저장한다.The minimum length and the maximum length of the voice for each word are stored from the same training voice.

이와같은 상태에서 음성이 입력되면 먼저, 특징을 추출하고 양자화한다. 그리고 입력에 대한 상기 양자화된 코드워드와 각 코드워드로 양자화 할 때의 오차로 부터 입력음성에 근사적으로 맞는 단어열을 선택한다.When voice is input in such a state, the feature is first extracted and quantized. Then, the quantized codeword for the input and the error of the quantization with each codeword are selected to approximate a word string suitable for the input voice.

이때, 입력 음성에서 단어정합을 하는 과정은 제7도에 도시한 바와 같이 먼저, 입력의 시작점부터 각 단어의 최소 길이에서 최대 길이까지 구간에서 해당되는 단어에 대해 자주 발생하는 코드워드의 양자화 오차를 모두 더하고 길이로 나누면 그 단어에 대한 유사도를 구할 수 있다.At this time, the process of word matching in the input speech, as shown in FIG. 7, first, a quantization error of a codeword that frequently occurs for a corresponding word in a section from a starting point of input to a minimum length to a maximum length of each word is shown. Add them all up and divide by the length to find the similarity for that word.

이 과정을 모든 단어에 대해서 구하고 최소누적 오차를 갖는 단어를 선택하면 첫번째 단어를 구할 수 있다.This process can be found for all words and the first word can be found by selecting the word with the least cumulative error.

이 과정을 첫번째 정합된 단어의 마지막 지점에서 다시 시작하여 두번째 단어를 선택한다.This process starts again at the end of the first matched word and selects the second word.

이 과정을 입력의 마지막 시점까지 반복하여 전체 음성에 대한 근사적으로 맞는 단어열을 구할 수 있다.This process can be repeated until the end of the input to find an approximate word string for the entire speech.

이상에서 상세히 설명한 바와같이 본 발명은 입력의 양자화된 오차만을 더하여 입력에 근사적인 단어열을 선택하고 전체 정합과정에서 이 근사적인 단어열보다 유사도가 낮게 정합되는 경로를 검색과정 중에서 제거함으로써 인식시간을 줄일 수 있는 효과가 있다.As described in detail above, the present invention selects an approximate word sequence by adding only the quantized error of the input and eliminates the recognition time by removing a path matching the lower similarity than the approximate word sequence in the entire matching process. There is an effect that can be reduced.

Claims

A first step of extracting a feature, finding an approximate word sequence, and obtaining an average similarity degree (Psc) through a recognition process on the approximate word sequence when a voice of a T frame is input; Then, the Viterbi recognition process proceeds from the first frame to the predetermined frame S while comparing the matched similarity of the searched paths with the mean similarity Psc to remove the path less than the average similarity Psc. Recognizing a second step; As a result of the repetition of the second step, when the Viterbi recognition process is completed up to the last frame of the input voice, the path having the maximum similarity is obtained from the paths that are not removed, and the word strings are compared on the path through the word string inverse tracking process. And a third step of outputting the recognized speech response.

The method of claim 1, wherein the approximate word sequence recognition method obtains a vector quantum from a training voice through a quantization design process, obtains N frequently generated codewords for each word, and obtains a minimum length and maximum voice for each word. Storing the length; When the voice is input in the same state as in the first step, first, the speech string is extracted and quantized as a feature, and then a word string that is approximately matched to the input speech is obtained from the quantized codeword for the input and the error of quantization with each codeword. And a second step of selecting.

The word matching process of claim 1, wherein, when a voice is input, the word matching process adds all the quantization errors of the codewords that frequently occur for the corresponding word in the interval from the start point of the input to the minimum length to the maximum length of each word and divides the word by the length. Calculating a similarity with respect to the first step; And a second step of obtaining an approximate word sequence for the entire voice by repeating the process of selecting the word having the minimum cumulative error by performing the process of the first step for all words until the last time point of the input. Search method of speech recognition.