KR100450396B1

KR100450396B1 - Tree search based speech recognition method and speech recognition system thereby

Info

Publication number: KR100450396B1
Application number: KR10-2001-0065149A
Authority: KR
Inventors: 정호영
Original assignee: 한국전자통신연구원
Priority date: 2001-10-22
Filing date: 2001-10-22
Publication date: 2004-09-30
Also published as: KR20030033394A

Abstract

본 발명은 트리탐색시 언어모델 미리보기를 적용하여 확률이 낮은 루트를 미리제거함으로서 문장단위의 연속 음성 인식시의 계산량을 줄인 트리 기반 음성 인식 방법 및 이를 이용한 대용량 연속 음성 인식 시스템을 제공하기 위한 것으로, 본 발명은 시간 t에서 입력된 음성에 대해 결정된 어휘를 추출하여, 상기 어휘들의 언어모델확률을 합하여 각 루트별로 언어학적으로 상기 시간 t에서 결정된 어휘에 이어질 가능성을 나타내는 기대치를 구하고, 각 루트별 기대치를 설정된 한계치와 비교하여, 기대치가 낮은 루트를 탐색대상에서 제외시킴으로서, 인식속도를 향상시키면서, 각 루트별로 속하는 모든 어휘를 고려하여 언어모델을 적용함으로서 성능저하를 감소시킨 것이다.An object of the present invention is to provide a tree-based speech recognition method and a large capacity continuous speech recognition system using the same by applying a language model preview when searching a tree in advance to eliminate a low probability route in advance. The present invention extracts the vocabulary determined for the voice input at the time t, sums the language model probability of the vocabulary, and obtains an expectation indicating the possibility of linguistically following the vocabulary determined at the time t for each route, and for each route. By excluding the route with the lower expected value from the search target, the performance is reduced by applying the language model to consider the vocabulary belonging to each route.

Description

Tree Search-based Speech Recognition Method and Large-Scale Continuous Speech Recognition System Using the Same {TREE SEARCH BASED SPEECH RECOGNITION METHOD AND SPEECH RECOGNITION SYSTEM THEREBY}

본 발명은 대용량 음성 인식 시스템에 관한 것으로서, 보다 상세하게는 트리기반탐색시 언어모델 미리보기를 적용하여 확률이 낮은 루트를 미리제거함으로서 문장단위의 연속된 음성인식에서의 계산량을 줄일 수 있는 트리탐색 기반 음성 인식 방법 및 이를 이용한 대용량 연속 음성 인식 시스템에 관한 것이다.The present invention relates to a large-capacity speech recognition system, and more particularly, tree search that can reduce the amount of computation in continuous speech recognition in sentence units by removing the route with low probability by applying language model preview during tree-based search. A speech recognition method and a large capacity continuous speech recognition system using the same.

음성인식은 일반적으로 마이크나 전화를 통하여 얻어진 음향학적 신호를 단어나 단어집합 또는 문장으로 변환하는 과정을 말하는 것으로서, 이런 음성 인식 기술은 1950년대 최초로 그 가능성이 제시된 이래 비약적으로 발전하고 있으며 차세대 컴퓨터와 인간간의 의사 소통을 의한 인터페이스로 떠오르는 분야이다. 예를 들면, 이러한 음성인식기술을 이용해 입력장치를 수조작에 의한 입력에서 음성입력으로 대체시켜 더 빠른 입력속도 및 간편함을 얻는 효과를 노릴 수 있다.Speech recognition generally refers to the process of converting acoustic signals obtained through microphones or telephones into words, word sets, or sentences. Such speech recognition technology has been developed dramatically since the first time in the 1950s. It is a field that emerges as an interface through human communication. For example, by using the voice recognition technology, the input device can be replaced with a voice input from a manual operation to achieve a faster input speed and simplicity.

통상적으로, 음성인식은 다음과 같은 세가지 분야에 대해 연구가 이루어지고 있다. 첫 번째 분야는 고립단어인식으로 명료하게 구분하여 발음되는 단어를 인식하는 것이고, 두 번째 분야는 연결단어인식으로서 좀더 자연스럽게 발음되는 연결단어를 인식하여 고립단어인식의 단점을 보완하는 것이며, 세 번째 분야는 연속음성인식으로서 자연스럽게 발음되는 연속된 음성을 문장단위로 인식하는 것이다.In general, speech recognition has been studied in the following three fields. The first field recognizes words that are clearly pronounced by isolated word recognition, and the second field recognizes connected words that are pronounced more naturally as connected word recognition, and complements the shortcomings of isolated word recognition. Is continuous speech recognition, which recognizes a continuous speech that is naturally pronounced in sentence units.

도 1은 상술한 세 가지 분야중 세 번째 분야에 해당하는 종래의 연속 음성 인식 시스템을 도시한다.1 illustrates a conventional continuous speech recognition system corresponding to a third of the three fields described above.

도시된 바와 같이, 기존의 연속된 음성을 인식하는 시스템은 음성을 입력받아 전기신호로 변환하는 음성입력부(10)와, 상기 음성입력부(10)로부터 인가되는 음성신호를 음성인식을 위한 특징변수로 변환하는 특징추출부(20)와, 미리 학습된 음향모델과 언어모델을 이용하여 입력된 음성과 가장 잘 부합되는 어휘 열을 시간에 탐색하여 출력하는 음성인식부(30)와, 다수의 화자가 발성한 음성을 이용하여 구성된 음성의 통계적 모델을 저장하는 음향모델저장부(40)와, 인식영역에 해당하는 텍스트로부터 통계적으로 만들어진 언어모델을 저장하는 언어모델저장부(50)와, 상기 음성인식부(30)의 출력을 이용하여 인식결과를 제공하는 인식결과출력부(60)로 구현되었다.As shown, the conventional continuous speech recognition system has a voice input unit 10 for receiving a voice and converting the voice into an electrical signal, and a voice signal applied from the voice input unit 10 as a feature variable for voice recognition. A feature extractor 20 for converting, a speech recognition unit 30 for searching and outputting a vocabulary string that best matches an input voice using a pre-learned acoustic model and a language model in time, and a plurality of speakers An acoustic model storage unit 40 for storing a statistical model of a speech constructed using the spoken voice, a language model storage unit 50 for storing a statistically generated language model from text corresponding to a recognition region, and the speech recognition Recognition result output unit 60 that provides a recognition result using the output of the unit 30 is implemented.

상기와 같은 연속음성인식시스템에서, 음성인식부(30)는 음향모델과 언어모델중에서 입력된 음성신호와 부합되는 모델을 탐색하기 위하여 여러 가지 탐색방법을 적용하는데, 도 2는 일반적인 탐색방법을 보인 것으로, 여기서 언어모델은 임의의 어휘가 결정된 후, 그 어휘에서 가장 가능성 있는 다음 어휘를 탐색할 때(between-word transition) 적용된다. 이러한 방법은 보통 연속음성인식에는 유용하나, 대용량의 연속 음성 인식에서는 도 2의 세로축과 같이 대상 어휘가 나열되어 있을 경우, 탐색해야 할 공간이 대상어휘에 비례하여 증가하게 되는 문제점이 있다.In the continuous speech recognition system as described above, the speech recognition unit 30 applies various search methods to search for a model matching the input voice signal among the acoustic model and the language model, and FIG. 2 shows a general search method. Here, the language model is applied when an arbitrary vocabulary is determined and then searches for the next most likely vocabulary in that vocabulary (between-word transition). Such a method is generally useful for continuous speech recognition, but in a large-scale continuous speech recognition, when the target vocabularies are listed as shown in the vertical axis of FIG. 2, a space to be searched increases in proportion to the target vocabulary.

이러한 방식을 해결하기 위하여 트리(tree) 기반 탐색 방법이 제안되었는데, 이는 대상 어휘를 발음에 따라서 트리(tree)의 형태로 묶는 것으로서, 도 3은 탐색에 이용되는 트리 구조의 일실시예를 보인 것이다. 도 3을 보면 루트 "A"는 '마을', '마음', '마차'의 어휘를 묶은 것으로, 음소 'ㅁ'과 '아'를 공유하고 있으며, "마을"과 '마음'은 음소'으'를 더 공유하게 된다. 이렇게 여러인식대상 어휘가같은 음소를 공유하게 되어 탐색공간이 줄어드는 효과가 있다.In order to solve such a scheme, a tree-based search method has been proposed, which combines a target vocabulary in the form of a tree according to pronunciation, and FIG. 3 shows an embodiment of a tree structure used for searching. . Referring to Figure 3, the root "A" is a combination of the vocabulary of 'village', 'heart', 'carriage', sharing the phonemes 'ㅁ' and 'Ah', and 'village' and 'heart' are phonemes. 'Will share more. In this way, multiple vocabulary targets share the same phoneme, which reduces the search space.

그런데, 지금까지의 트리 기반 탐색 방식은 선행 어휘가 결정된 후, 다음 어휘 탐색을 위해, 탐색 경로가 임의의 시간 t에서 루트"A"로 진행될 경우, 상기 루트"A"로 진행하는 시간 t에서는 앞서 결정된 어휘에 대해 가장 가능성 있는 어휘를 선택할 수 없다. 즉, 지금까지는, 루트"A"에서 탐색을 진행하여 리프(leaf) 노드에 도달하였을 때, 언어모델을 적용하여 기대치가 낮을 경우, 후보에서 탈락시키는 방식으로 어휘를 결정하게 된다.However, in the tree-based search method up to now, after the previous vocabulary is determined, the search path advances to the root "A" at any time t for the next vocabulary search. It is not possible to select the most likely vocabulary for the determined vocabulary. That is, until now, when the search is performed at the root "A" to reach a leaf node, when the expectation is low by applying the language model, the vocabulary is determined by dropping the candidate.

따라서, 종래의 트리기반탐색방법은 트리 구조의 루트(root)로 탐색경로가 이어질 경우, 현재까지 인식된 어휘에 대해 다음으로 이어질 수 있는 어휘가 다양하기 때문에, 언어모델을 곧바로 적용하지 못하게 되며, 모든 경로를 탐색해야만 하는 문제가 발생한다.Therefore, in the conventional tree-based search method, if the search path is followed by the root of the tree structure, since the vocabulary that can be recognized for the presently recognized vocabulary varies, the language model cannot be applied immediately. The problem arises that all routes must be searched.

연속음성인식기술에 대한 선행기술을 더 언급하면, 미국특허 5,956,678 호(명칭 : Speech recognition apparatus and method using look-ahead scoring)에서는 탐색 과정 중간에 한 어휘가 결정되면 음향모델을 이용하여 다음에 나올 가능성이 적은 탐색경로를 제외하여 인식속도를 개선하는 것을 제안하고 있으나, 이는 음향모델로 예측하기 때문에 탐색 제외 대상이 한정된다.Further referring to the prior art for continuous speech recognition technology, U.S. Patent 5,956,678 (Name: Speech recognition apparatus and method using look-ahead scoring) is the possibility of coming up next using an acoustic model when a vocabulary is determined in the middle of the search process. It is proposed to improve the recognition speed by excluding the search path, but the search exclusion target is limited because it is predicted by the acoustic model.

또한, 미국특허 6,178,401호(명칭:Method for reducing search complexity in a speech recognition system)에서는 개략적인 탐색 후에, 그 결과를 이용하여 특정영역만을 정밀하게 탐색함으로서 인식속도를 개선할 것을 제안하고 있으며, 그 외, Optmanns등이 발표한 논문(명칭:Language-model look-ahead for largevocabulary speech recognition, ICSLP96, p2095~p2098)에는 탐색중간에 결정된 한 어휘에 대해 여러 개의 루트를 가진 트리형태로 이루어진 모든 어휘들의 통계적 언어 모델 값을 구해, 트리 구조의 각 가지마다 배분함으로써 탐색이 계속 진행될 경우 낮은 값을 배분 받은 가지를 제거하는 것을 제안하고 있으나, 이는 탐색이 진행됨에 따라 탐색 경로가 줄어들지만 처음 얼마간은 모든 경로를 고려해야 하는 단점이 있다.In addition, US Pat. No. 6,178,401 (name: Method for reducing search complexity in a speech recognition system) proposes to improve the recognition speed by searching only a specific area precisely using the result after a rough search. (Language-model look-ahead for large vocabulary speech recognition, ICSLP96, p2095-p2098), published by Optmanns, et al. It is suggested to get the model value and to allocate each branch of the tree structure to remove the branch with low value if the search continues. However, the search path decreases as the search progresses, but all paths should be considered for the first time. There is a disadvantage.

본 발명은 상술한 종래의 문제점을 해결하기 위하여 제안된 것으로서, 그 목적은 연속적으로 발음되는 문장단위의 대용량 음성을 인식시 선행 결정된 어휘에 대하여 언어학적으로 기대치가 낮은 루트를 미리 제거함으로서 탐색속도 향상시킨 트리 기반 음성 인식 방법 및 이를 이용한 대용량 연속 음성 인식 시스템을 제공하는 것이다.SUMMARY OF THE INVENTION The present invention has been proposed to solve the above-mentioned problems. The object of the present invention is to improve the search speed by removing a route with low linguistic expectation in advance for a predetermined vocabulary when recognizing a large volume of sentences that are continuously pronounced. To provide a tree-based speech recognition method and a large capacity continuous speech recognition system using the same.

도 1은 일반적인 연속 음성 인식 시스템의 구성을 보인 블럭도이다.1 is a block diagram showing the configuration of a general continuous speech recognition system.

도 2는 연속 음성 인식을 위한 일반적인 탐색 방법을 설명하기 위한 도면이다.2 is a diagram illustrating a general search method for continuous speech recognition.

도 3은 트리 기반 탐색 구조에 대한 일 실시예를 보인 것이다.3 illustrates one embodiment of a tree-based search structure.

도 4는 본 발명에 의한 트리기반음성인식방법을 이용한 음성인식 시스템의 구성도이다.4 is a block diagram of a speech recognition system using a tree-based speech recognition method according to the present invention.

도 5는 본 발명에 의한 트리탐색 기반 음성인식과정을 보인 플로우챠트이다.5 is a flowchart illustrating a tree search based speech recognition process according to the present invention.

*도면의 주요 부분에 대한 부호의 설명** Description of the symbols for the main parts of the drawings *

100 : 음성입력부 200 : 특징추출부100: voice input unit 200: feature extraction unit

300 : 음성인식부 400 : 음향모델저장부300: speech recognition unit 400: acoustic model storage unit

500 : 언어모델저장부 600 : 인식결과출력부500: language model storage unit 600: recognition result output unit

310 : 딜레이부 320 : 트리기반탐색부310: delay unit 320: tree-based search unit

330 : 중간결과추출부 340 : 언어모델미리보기처리부330: intermediate result extractor 340: language model preview processing unit

상술한 본 발명의 목적을 달성하기 위한 구성수단으로서, 본 발명은 대용량 음성 인식 시스템의 트리탐색 기반 음성 인식 방법에 있어서,As a construction means for achieving the above object of the present invention, the present invention provides a tree search-based speech recognition method of a large-capacity speech recognition system,

가) 시간 t에서 입력된 음성에 대해 결정된 어휘를 추출하는 단계;A) extracting the determined vocabulary for the input voice at time t;

나) 탐색트리의 각 루트별로 언어학적으로 상기 시간 t에서 결정된 어휘에 이어질 가능성을 나타내는 기대치를 구하는 단계;B) obtaining an expectation indicating a possibility of linguistically following the vocabulary determined at the time t for each route of the search tree;

다) 상기 나)단계에서 구해진 각 루트별 기대치를 설정된 한계치와 비교하는 단계; 및C) comparing the expected value for each route obtained in step b) with a set limit value; And

라) 상기 비교에서 한계치보다 낮은 기대치를 갖는 루트를 시간 t+1의 다음어휘 탐색에서 제외시키는 단계로 이루어지는 것을 특징으로 한다.D) excluding a route having an expectation lower than the threshold in the comparison from the next vocabulary search at time t + 1.

더하여, 상기에서 루트별 기대치는In addition, the root expectation above

a) 시간 t+1에서 진행될 루트에 속하는 어휘를 추출하는 단계;a) extracting a vocabulary belonging to a route to be advanced at time t + 1;

b) 상기 a)단계에서 추출된 어휘들의 상기 시간 t의 결정어휘에 대한 언어모델확률을 각각 구하는 단계; 및b) obtaining a language model probability of each of the vocabulary extracted in step a) for the determined vocabulary of time t; And

c) 상기 b)단계에서 구해진 언어모델확률을 모두 더하여 기대치를 계산하는 단계를 각 루트별로 실행하여 구하는 것을 특징으로 한다.c) calculating the expected value by adding all of the language model probabilities obtained in step b).

그리고, 상기 방법을 이용한 대용량음성인식시스템은 음성을 입력받아 전기신호로 변환하는 음성입력부; 상기 음성입력부로부터 인가되는 음성신호를 음성인식을 위한 특징변수로 변환하는 특징추출부; 특징추출부의 출력을 소정 시간지연후 인가하는 딜레이부; 상기 딜레이부를 통해 입력된 어휘를 공동음소별로 구성한 트리 구조를 기반으로 상기 딜레이부를 통해 입력된 음성에 대해 부합되는 어휘 열을 트리 구조를 통해 결정하는 트리기반탐색부; 상기 트리기반탐색부에서 매 순간마다 결정되는 어휘를 추출하는 중간결과추출부; 상기 중간결과추출부에 의해 추출된 결정어휘에 대해 트리 구조의 각 루트별로 루트아래에 존재하는 어휘들의 언어모델확률을 합하여 루트별 기대치를 정하고, 이 기대치가 설정 값보다 낮은 루트는 탐색에서 제외시키도록 상기 트리기반탐색부를 제어하는 언어모델미리보기처리부; 다수의 화자가 발성한 음성을 이용하여 구성된 음성의 통계적 모델을 저장하는 음향모델저장부; 인식영역에 해당하는 텍스트로부터 통계적으로 만들어진 언어모델을 저장하는 언어모델저장부; 및, 상기 트리기반탐색부의 출력을 이용하여 인식결과를제공하는 인식결과출력부로 이루어지는 것을 특징으로 한다.In addition, the large-scale speech recognition system using the method includes a voice input unit for receiving a voice and converting it into an electrical signal; A feature extracting unit converting a voice signal applied from the voice input unit into a feature variable for voice recognition; A delay unit for applying the output of the feature extraction unit after a predetermined time delay; A tree-based search unit configured to determine a lexical sequence corresponding to a voice input through the delay unit through a tree structure based on a tree structure configured for each of the vocabs input through the delay unit; An intermediate result extracting unit extracting a vocabulary determined every moment in the tree-based searching unit; The expected value for each route is determined by summing the language model probabilities of the vocabularies existing under the root for each route of the tree structure with respect to the determined words extracted by the intermediate result extracting unit, and the route whose expectations are lower than the set value is excluded from the search. A language model preview processor configured to control the tree-based search unit to be used; An acoustic model storage unit for storing a statistical model of a speech constructed using a plurality of talkers' speech; A language model storage unit for storing the language model statistically generated from the text corresponding to the recognition area; And a recognition result output unit configured to provide a recognition result using the output of the tree-based search unit.

이하, 첨부한 도면을 참조하여 본 발명의 구성 및 작용에 대하여 상세하게 설명한다.Hereinafter, with reference to the accompanying drawings will be described in detail the configuration and operation of the present invention.

먼저, 본 발명에 대해 간략하게 설명하면, 선행 어휘가 결정되면, 이에 대해 루트아래에 존재하는 어휘들의 언어모델을 모두 합하여 각 루트별로 결정된 선행어휘에 대한 기대치를 정하고, 이 기대치에 따라 가능성이 낮은 루트는 탐색에서 제외시킴으로서, 다음 어휘를 결정하기 위한 탐색시 속도를 개선시킨 것이다.First, briefly describing the present invention, when the preceding vocabulary is determined, the expectation for the prior vocabulary determined for each route is determined by adding up the language models of the vocabulary words existing under the root. Routes are excluded from the search, which improves the speed in the search to determine the next vocabulary.

그 구체적인 작용을 도 4를 참조하여 설명한다.The specific operation thereof will be described with reference to FIG. 4.

도 4는 본 발명에 의한 음성인식시스템을 도시한 것으로서, 음성을 입력받아 전기신호로 변환하는 음성입력부(100)와, 상기 음성입력부(100)로부터 인가되는 음성신호를 음성인식을 위한 특징변수로 변환하는 특징추출부(200)와, 선행어휘가 결정되면 다음으로 인가된 음성의 인식시 각 루트아래에 존재하는 어휘들의 기대치를 합하여 선행어휘 다음에 올 확률이 낮은 루트들을 제거한 후 트리 기반 탐색처리를 하여 다음 어휘를 결정하는 음성인식부(300)와, 다수의 화자가 발성한 음성을 이용하여 구성된 음성의 통계적 모델을 저장하는 음향모델저장부(400)와, 인식영역에 해당하는 텍스트로부터 통계적으로 만들어진 언어모델을 저장하는 언어모델저장부(500)와, 상기 음성인식부(300)의 출력을 이용하여 인식결과를 제공하는 인식결과출력부(600)로 이루어진다.4 illustrates a voice recognition system according to the present invention, wherein the voice input unit 100 receives a voice and converts the voice signal into an electrical signal, and a voice signal applied from the voice input unit 100 as a feature variable for voice recognition. When the feature extracting unit 200 and the preceding vocabulary are determined, the expected values of the vocabulary existing under each root are added together when the recognition of the next speech is applied. The speech recognition unit 300 to determine the next vocabulary, the acoustic model storage unit 400 for storing a statistical model of the speech composed using a voice of a plurality of speakers, and statistical from the text corresponding to the recognition area It consists of a language model storage unit 500 for storing the language model made by the recognition result output unit 600 for providing a recognition result using the output of the speech recognition unit 300 The.

그리고, 상기 음성인식부(300)는 특징추출부(200)로부터 출력된 신호의 입력을 K시간지연시키는 K딜레이부(310)와, 상기 음향모델저장부(400) 및 언어모델저장부(500)에 저장된 음향모델 및 언어모델을 고려하여 어휘를 공동음소별로 구성한 트리 구조를 기반으로 상기 K딜레이부(310)를 통해 입력된 음성에 대해 가능성 있는 단어를 결정하는 트리기반탐색부(320)와, 상기 트리기반탐색부(320)에서 매 순간마다 결정되는 어휘를 추출하는 중간결과추출부(330)와, 상기 중간결과추출부(330)에 의해 추출된 결정어휘에 대해 언어모델미리보기를 실행하여 문맥적으로 다음에 올 가능성이 적은 어휘를 나타내는 루트를 탐색에 적용될 트리 구조에서 제외시키는 언어모델미리보기처리부(340)로 구성된다.The voice recognition unit 300 includes a K delay unit 310 for delaying the input of the signal output from the feature extractor 200 by K time, the acoustic model storage unit 400, and the language model storage unit 500. And a tree-based search unit 320 for determining a possible word for a voice input through the K delay unit 310 based on a tree structure in which the vocabulary is formed by the common phonemes in consideration of the acoustic model and the language model stored in the). A language model preview is performed on the intermediate result extractor 330 for extracting the vocabulary determined every moment in the tree-based search unit 320 and the determined vocabulary extracted by the intermediate result extractor 330. Therefore, the language model preview processing unit 340 excludes the root representing the vocabulary that is less likely to be contextually included in the tree structure to be applied to the search.

상기 구성의 작용은 다음과 같이 이루어진다.The action of the configuration is as follows.

음성입력부(100)를 통해 들어온 음성신호는 특징추출부(200)로 입력되어, 특징 파라메터들이 추출되고, 이렇게 특징추출부(200)로부터 출력된 특징값은 음성인식부(300)에 제공된다.The voice signal input through the voice input unit 100 is input to the feature extractor 200, and feature parameters are extracted, and the feature value output from the feature extractor 200 is provided to the voice recognizer 300.

음성인식부(300)는 입력된 특징값을 상기 음향모델저장부(400) 및 언어모델저장부(500)에서 제공된 음향모델 및 언어모델에 대입하여 대응하는 어휘를 결정하는데, 최초 입력된 특징값을 K딜레이(310)를 거쳐 트리기반탐색부(320)로 인가되고, 트리기반탐색부(320)는 음향모델 및 언어모델을 이용하여 인식하려는 입력음성과 가장 잘 부합되는 어휘 열을 시간에 따라 탐색한다.The voice recognition unit 300 determines the corresponding vocabulary by substituting the input feature values into the acoustic model and the language model provided by the acoustic model storage unit 400 and the language model storage unit 500. Is applied to the tree-based search unit 320 through the K delay 310, and the tree-based search unit 320 uses a sound model and a language model to determine a lexical sequence that best matches an input speech to be recognized over time. Search.

상기 트리기반탐색부(320)의 결과는 인식결과출력부(600)로 출력됨과 동시에 중간결과추출부(330)를 통해 언어모델미리보기처리부(340)로 인가되고, 언어모델미리보기처리부(340)는 학습된 언어모델을 불러들여, 루트별로 결정된 선행어휘 다음으로 올 가능성을 나타내는 기대치를 계산하고, 기대치가 낮은 루트를 제거하여 트리기반탐색부(320)에 제공한다.The result of the tree-based search unit 320 is output to the recognition result output unit 600 and is applied to the language model preview processor 340 through the intermediate result extractor 330, and the language model preview processor 340. ) Imports the trained language model, calculates an expectation that indicates the likelihood of coming after the preceding vocabulary determined for each route, and removes the route whose expectation is low and provides it to the tree-based search unit 320.

상기 트리기반탐색부(320)는 다음 어휘 탐색시 상기 언어모델미리보기처리부(340)에 의해 가능성이 적은 루트가 제거된 트리 구조를 탐색하여 부합되는 다음 어휘 열을 탐색한다.The tree-based search unit 320 searches for a next lexical column that matches by searching for a tree structure in which a less probable route is removed by the language model preview processing unit 340 during the next lexical search.

즉, 임의의 어휘 "B"가 시간 t-1에서 결정된 후, 루트"A"아래에 속하는 모든 어휘를 뽑아내어, 각 어휘의 상기 결정된 어휘 "B"에 대한 언어모델확률값을 얻는다. 이렇게 얻어진 언어모델확률값을 모두 더하면 루트 "A"의 기대값이 된다. 마찬가지로, 나머지 루트들 "A"~"N"에 대해서도 기대값을 구하고, 다음 시간 t에서 탐색이 시작될 때 상기와 같이 구해진 기대값이 한계치 이하의 값을 가진 루트로의 진행을 막는다. 따라서, 탐색 대상이 줄어들게 되어 처리속도를 높일 수 있다.That is, after an arbitrary vocabulary "B" is determined at time t-1, all the vocabulary belonging to the root "A" are extracted, and the language model probability value for the determined vocabulary "B" of each vocabulary is obtained. The sum of the language model probability values thus obtained adds up to the expected value of the root "A". Similarly, the expectation value is obtained for the remaining routes "A" to "N", and when the search is started at the next time t, the expected value thus obtained is prevented from proceeding to a route having a value below the threshold. Therefore, the search target is reduced and the processing speed can be increased.

예를 들어, 도 3과 같은 트리 구조에서, 선행 어휘가 "사탕"라고 할 때, 문맥적으로 보면, 루트"N"속하는 어휘 "지구", "지각", "지게"등이 "사탕" 다음으로 올 가능성이 아주 낮다. 그러므로, 다음 어휘에 대한 탐색시 상기 루트"N"을 미리 제외시킴으로서, 불필요한 탐색시간낭비를 막는다.For example, in the tree structure as shown in Fig. 3, when the preceding vocabulary is "candy", the vocabulary "Earth", "perception", "fork", etc. belonging to the root "N" is next to "candy". Very unlikely to come. Therefore, by excluding the route "N" in advance in the search for the next vocabulary, unnecessary search time wasted.

인식결과출력부(600)는 상기와 같이 결정된 어휘들을 합하여 입력된 음성에 대응하는 문장을 출력한다.The recognition result output unit 600 outputs a sentence corresponding to the input voice by summing the vocabulary determined as described above.

도 5의 플로우챠트는 본 발명에 따른 트리 기반 탐색과정에 대한 이해를 돕기 위해 처리과정을 시계열적으로 나타낸 플로우챠트로서, 설명의 중복을 막기 위해 그 구체적인 동작설명을 생략한다. 본 발명은 도 5에 도시된 바와 같이, 선행어휘가 결정되면, 선행어휘에 대해 각 루트별로 언어모델을 적용하여 가능성이 낮은루트를 탐색전 미리 제거함으로서, 후행어휘의 탐색시 속도를 더 개선시킬 수 있는 것이다.5 is a flowchart showing a process in time series to help understand the tree-based search process according to the present invention, and the detailed operation description thereof is omitted to avoid duplication of description. As shown in FIG. 5, when the preceding vocabulary is determined, by applying a language model to each root for the preceding vocabulary, the route having low probability is removed before searching, thereby further improving the speed of searching for the following vocabulary. It is.

본 발명은 상술한 바와 같이, 트리 구조를 기반으로 한 음성인식에 있어서 언어모델을 미리 적용하여, 언어적 관점에서 현재 어휘 다음으로 나올 어휘중 가능성이 낮은 어휘를 탐색대상에서 제외시킴으로써 인식속도를 향상시키는 효과가 있다.As described above, the present invention improves the recognition speed by applying a language model to speech recognition based on a tree structure in advance, excluding words having a low possibility of being found next to the current vocabulary from a linguistic perspective. It is effective to let.

또한, 트리기반탐색구조에서 하나의 루트 아래에 있는 모든 어휘를 고려하여 각 루트의 기대치를 계산함으로서 속도향상에 따른 성능을 저하를 최소화할 수 있는 우수한 효과가 있는 것이다.In addition, the tree-based search structure has an excellent effect of minimizing performance degradation due to speed improvement by calculating the expectation of each route by considering all vocabulary under one route.

Claims

In a tree search-based speech recognition method of a large-capacity speech recognition system for recognizing continuous input speech in sentence units,

A) extracting the determined vocabulary when the vocabulary for a predetermined input voice is determined;

B) obtaining an expectation value representing the possibility of linguistically following the determined vocabulary for each route of the search tree;

C) comparing the expected value for each route obtained in step b) with a set limit value; And

D) removing from the search tree a route having an expectation lower than a threshold in the comparison; And

E) searching for a search tree from which the root with low linguistic expectation is removed to determine a corresponding vocabulary by receiving a next voice;

Tree search based speech recognition method characterized in that for repeating the.

The method of claim 1, wherein the step b) of obtaining the expectation value for each route is

a) extracting every vocabulary belonging to each root of the search tree;

b) obtaining a language model probability of each of the vocabulary words extracted in step a); And

c) A tree search-based speech recognition method comprising the step of calculating the expected value of the corresponding route by adding all of the language model probabilities obtained in step b).

The method of claim 1 or 2, wherein the search tree

Tree search based speech recognition method characterized in that the phoneme of the recognition target vocabulary arranged in time series.

The method of claim 1 or 2, wherein the search tree determines a vocabulary when the search tree reaches a leaf node by searching in a root.

A voice input unit for receiving a voice and converting the voice into an electrical signal;

A feature extracting unit converting a voice signal applied from the voice input unit into a feature variable for voice recognition;

A delay unit for applying the output of the feature extraction unit after a predetermined time delay;

A tree-based search unit configured to determine a word probable for a voice input through the K delay unit based on a tree structure configured for the lexical words input through the delay unit for each joint phoneme;

An intermediate result extracting unit extracting a vocabulary determined every moment in the tree-based searching unit;

The tree-based search unit is controlled to determine a route-specific expectation for applying a language model to each route of the tree structure for the determined vocabulary extracted by the intermediate result extractor, and to exclude routes whose expectation is lower than a set value from the search. A language model preview processing unit;

An acoustic model storage unit for storing a statistical model of a speech constructed using a plurality of talkers' speech;

A language model storage unit for storing the language model statistically generated from the text corresponding to the recognition area; And

A large-scale speech recognition system using a tree-based search method comprising a recognition result output unit for providing a recognition result using the output of the tree-based search unit.

The method of claim 5, wherein the language model preview processing unit

Extracts all vocabulary belonging to each route under each route of search tree, calculates language model probability of pre-determined vocabulary of extracted vocabulary, and adds language model probability for all vocabulary belonging to each root under each route. And excluding a route whose expected value is lower than a predetermined limit in a next search process.

In large-capacity speech recognition system applying tree-based search method,

A function of receiving as a pre-determined vocabulary for extracting a vocabulary recognized at a predetermined speech recognition time t by a tree-based search;

Extracting vocabulary belonging to each root of the search tree;

A function of obtaining a language model probability for each of the words extracted from the search tree for the predetermined words by applying a language model;

A function for calculating the expectation for each route by adding the probability of each language model of each vocabulary belonging to each route of the search tree;

Comparing the expected value obtained for each route with a preset threshold value; And

And a program storing a program for realizing a function of excluding a route having an expected value lower than a threshold as a result of the comparison at the time of the next speech recognition time t + 1.