KR20100067727A

KR20100067727A - Multi-search based speech recognition apparatus and its method

Info

Publication number: KR20100067727A
Application number: KR1020080126244A
Authority: KR
Inventors: 전형배; 조훈영; 김승희; 황규웅; 이일빈; 박준; 박상규
Original assignee: 한국전자통신연구원
Priority date: 2008-12-12
Filing date: 2008-12-12
Publication date: 2010-06-22
Also published as: KR101068120B1

Abstract

PURPOSE: A voice recognition unit and a method thereof of a multiple search base for performing a multi-search about the input speech signal of the multiple search base are provided to improve voice recognition performance about the voice signal by using FSN(Finite State Network) mode and N-gram mode. CONSTITUTION: A speech feature extracting block(102) extracts feature data about the inputted voice signal. A language model database(108) stores the FSN language model and N-gram language model. A multi-search block(104) is parallel performed the first voice search and the second voice search. The multiple search block is created in the integration search network. The multiple search block outputs the voice recognition result according to the third voice search.

Description

Speech Recognition Device based on Multiple Searches and Its Method {MULTI-SEARCH BASED SPEECH RECOGNITION APPARATUS AND ITS METHOD}

본 발명은 음성 인식 기법에 관한 것으로, 더욱 상세하게는 입력 음성 신호에 대한 다중 탐색을 통해 음성 인식을 수행하는데 적합한 다중 탐색 기반의 음성 인식 장치 및 그 방법에 관한 것이다.The present invention relates to a speech recognition technique, and more particularly, to a multiple search based speech recognition apparatus and method suitable for performing speech recognition through multiple searches for an input speech signal.

본 발명은 지식경제부 및 정보통신연구진흥원의 IT 성장동력 핵심기술개발사업의 일환으로 수행한 연구로부터 도출된 것이다[과제관리번호: 2008-S-019-01, 과제명: 휴대형 한/영 자동통역 기술개발].The present invention is derived from research conducted as a part of the core technology development project of IT growth engine of the Ministry of Knowledge Economy and ICT. [Task management number: 2008-S-019-01, Task name: Portable Korean / English automatic interpretation] Technology development].

잘 알려진 바와 같이, 음성 인식 기법(speech recognition)은 인식하고자 하는 대상 영역을 하나의 탐색 네트워크로 표현하고, 해당 탐색 네트워크 조건 내에서 입력 음성 신호(음성 데이터)와 가장 유사한 단어열을 찾는 탐색 과정을 수행한다.As is well known, speech recognition expresses a target area to be recognized as a search network, and searches for a word sequence most similar to an input speech signal (voice data) within the corresponding search network conditions. Perform.

이러한 음성 인식의 탐색 네트워크로는 여러 종류가 있는데 그 중에서 FSN(Finite State Network, 이하 ‘FSN’이라 함) 방식, N-gram 언어 모델 방식 등 이 가장 많이 사용된다. There are many types of search networks for speech recognition. Among them, FSN (Finite State Network, hereinafter referred to as FSN), N-gram language model, etc. are the most used.

여기에서, FSN 방식의 탐색 네트워크는 인식하고자 하는 대상 영역을 전형적인 문장 표현들로 정의하고, 이러한 전형적인 문장 표현들을 문형으로 정의하며, 이로부터 단어 네트워크로 기술하는 방식으로서, 인식하고자 하는 영역이 제한적인 경우 많이 사용되고, 정의된 문장만을 발성할 경우 인식 성능이 우수한 장점을 갖는다.Here, the FSN-based search network defines a target region to be recognized as typical sentence expressions, and defines such typical sentence expressions in a sentence form, and describes it as a word network from which a limited region to be recognized is limited. Many cases are used, and when only defined sentences are spoken, the recognition performance is excellent.

예를 들면, 항공 예약과 같은 서비스의 경우 “서울에서 제주도까지 예약하려고 합니다”, “내일 오후 3시 이후가 좋겠어요” 등과 같은 항공 예약 상황에 나타나는 문장들이 전형적인 경우 관련 문장들을 모두 문형으로 정의하고, 이를 FSN 방식의 탐색네트워크로 표현할 수 있으며, 이를 통해 입력된 음성 신호를 인식하게 된다.For example, in the case of services such as flight reservations, if sentences such as “I want to make a reservation from Seoul to Jeju Island” or “I'd like to be after 3 pm tomorrow” are typical, all the relevant sentences are defined as sentences, This can be expressed as a search network of the FSN method, thereby recognizing the input voice signal.

한편, N-gram 언어 모델 방식은 상술한 바와 같은 FSN 방식보다 인식하고자 하는 영역이 상대적으로 방대하고 다양한 표현이 가능하여 문장 패턴을 정의하기 힘든 경우 많이 사용되는데, 인식하고자 하는 영역을 표현하는 문장 코퍼스를 구축하고, 문장 코퍼스에서의 정의된 개수의 단어에 대해 출현할 확률을 계산하여 저장해 둔 상태에서 기 산출된 각 단어의 확률을 입력된 음성 신호에 대한 관측 확률에 추가하여 총 문장의 확률을 정의한다.On the other hand, the N-gram language model method is used in a case where it is difficult to define a sentence pattern because the area to be recognized is relatively large and various expressions are possible than the FSN method as described above. And calculate and store the probability of appearing for the defined number of words in the sentence corpus, and define the probability of the total sentence by adding the calculated probability of each word to the observed probability of the input speech signal. do.

여기에서, 각 단어의 확률을 언어모델 확률 값이라고 하며, N-gram 방식에서의 입력 음성 신호에 대한 출력 단어의 확률은 아래의 수학식 1 및 수학식 2와 같이 정의된다.Here, the probability of each word is called a language model probability value, and the probability of the output word for the input speech signal in the N-gram method is defined as in Equations 1 and 2 below.

이 때, 상기 수학식 1에서 출력 단어열인 L은 Pr(A|L)Pr(L)이 가장 큰 단어열로 결정되며, Pr(A|L)은 입력된 음성 신호가 특정 단어의 일반적인 음향 특성과 얼마나 유사한지를 나타내는 관측 확률을 의미하고, 일반적인 음향 특성은 훈련 음성 데이터베이스로 미리 학습하여 HMM(Hidden Markov Model) 등을 통해 훈련한 음향 모델로 모델링된다.In this case, L, which is an output word string in Equation 1, is determined as Pr ( A | L ) Pr ( L ) having the largest word string, and Pr ( A | L ) is a general sound of a specific word. Observation probabilities indicating how similar to the characteristics are, and general acoustic characteristics are modeled as acoustic models trained through a HID (Hidden Markov Model) by learning in advance with a training voice database.

또한, Pr(L)은 언어 모델 확률로서, 인식 대상 영역의 대량의 문장 코퍼스로부터 각 단어가 출현할 확률값을 의미하며, 각 단어의 출현 확률은 이전의 나타난 단어들이 제약 조건이 되어 모델링되는데, 일반적으로 앞의 2 단어까지를 고정하고, 현재 단어가 나타날 확률을 모델링하는 Tri-gram 언어 모델이 상대적으로 많이 사용된다.In addition, Pr ( L ) is a language model probability, and means a probability value of each word appearing from a large sentence corpus of the recognition target region, and the occurrence probability of each word is modeled by constraints of the previously appeared words. The trigram language model, which fixes up to two words and models the probability of the current word, is used relatively.

이와 같이, 인식 대상 영역이 작은 경우는 FSN 방식이, 인식 대상 영역이 큰 경우는 N-gram 방식이 일반적으로 사용되며, N-gram 방식과 FSN 방식을 비교해 보면, FSN 방식의 탐색 네트워크에서는 특정 단어 다음에 나타날 수 있는 단어가 정의된 문형에 의해 제한되는 반면 N-gram 방식에서는 특정 단어 다음에 나타날 수 있는 단어는 모든 단어가 가능하고, 대신 2 단어가 연속해서 나타날 확률의 차이가 존재하는 특성으로 인해 N-gram 방식이 FSN 방식보다 상대적으로 자연스럽고, 다양한 문장 표현이 가능하다. As described above, when the recognition target area is small, the FSN method is generally used. When the recognition target area is large, the N-gram method is generally used. When comparing the N-gram method and the FSN method, a specific word is used in the FSN search network. While the next word can be limited by a defined sentence pattern, in the N-gram method, the word that can appear after a specific word can be any word, and instead there is a difference in the probability that two words appear in succession. Therefore, the N-gram method is more natural than the FSN method, and various sentence expressions are possible.

하지만, 종래에 음성 인식 기법으로 이용되는 FSN 방식의 경우 정의되지 않은 문형 또는 정의된 문형과 일부 다르게 표현한 문장(변형된 문장)들에 대해서는 정확하게 인식해 내지 못하는 문제가 있으며, N-gram 방식의 경우 그에 대응하는 탐색 네트워크가 표현하는 공간이 방대하기 때문에 음성 인식 성능은 제한된 대상 영역에 대해 FSN 탐색 방식보다 상대적으로 낮은 문제점이 있었다.However, in the case of the FSN method, which is conventionally used as a speech recognition technique, there is a problem in that the sentence (modified sentences) which are differently expressed from an undefined sentence or a defined sentence may not be recognized correctly. Since the space represented by the corresponding search network is enormous, speech recognition performance is relatively lower than that of the FSN search method for a limited target region.

이에 따라, 본 발명은 FSN 방식과 N-gram 방식을 이용하여 병렬 탐색한 후에 이를 통해 생성된 통합 탐색 네트워크에 대한 재탐색을 수행함으로써, 음성 신호에 대한 음성 인식 성능을 향상시킬 수 있는 다중 탐색 기반의 음성 인식 장치 및 그 방법을 제공하고자 한다.Accordingly, the present invention performs a multi-search based on the parallel search using the FSN method and the N-gram method and then re-search for the integrated search network generated therefrom, thereby improving the speech recognition performance of the speech signal. An apparatus and method for speech recognition are provided.

일 관점에서 본 발명은, 입력된 음성 신호를 인식하는 음성 인식 장치로서, 상기 입력된 음성 신호에 대한 특징 데이터를 추출하는 음성 특징 추출 블록과, 음소들을 통계적으로 모델링한 음향 모델을 저장하는 음향 모델 데이터베이스와, FSN(Finite State Network) 언어 모델 및 N-gram 언어 모델을 저장하는 언어 모델 데이터베이스와, 상기 추출된 특징 데이터에 대해 상기 음향 모델 및 FSN 언어 모델을 이용한 제 1 음성 탐색과 상기 음향 모델 및 N-gram 언어 모델을 이용한 제 2 음성 탐색을 병렬 수행하고, 병렬 수행된 상기 제 1 음성 탐색 및 제 2 음성 탐색에 따른 통합 탐색 네트워크를 생성하고, 상기 생성된 통합 탐색 네트워크 및 음향 모델을 이용하여 제 3 음성 탐색을 수행하며, 상기 제 3 음성 탐색에 따른 음성 인식 결과를 출력하는 다중 탐색 블록을 포함하는 다중 탐색 기반의 음성 인식 장치를 제공한다.In one aspect, the present invention provides a speech recognition apparatus for recognizing an input speech signal, comprising: a speech feature extraction block for extracting feature data of the input speech signal; and an acoustic model for storing a sound model statistically modeling phonemes. A language model database for storing a database, a finite state network (FSN) language model and an N-gram language model, a first speech search using the acoustic model and the FSN language model for the extracted feature data, the acoustic model and Perform a second voice search using an N-gram language model in parallel, generate an integrated search network according to the first and second voice search performed in parallel, and use the generated integrated search network and acoustic model A third search block and a multi search block configured to output a voice recognition result according to the third search voice; To provide a voice recognition device of the multi-navigation-based.

다른 관점에서 본 발명은, 입력된 음성 신호를 인식하는 음성 인식 방법으로서, 상기 입력된 음성 신호에 대한 특징 데이터를 추출하는 단계와, 상기 추출된 특징 데이터에 대해 음향 모델 및 FSN(Finite State Network) 언어 모델을 이용한 제 1 음성 탐색과 음향 모델 및 N-gram 언어 모델을 이용한 제 2 음성 탐색을 병렬 수행하는 단계와, 병렬 수행된 상기 제 1 음성 탐색 및 제 2 음성 탐색에 따른 통합 탐색 네트워크를 생성하는 단계와, 상기 특징 데이터에 대해 상기 생성된 통합 탐색 네트워크 및 음향 모델을 이용하여 제 3 음성 탐색을 수행하며, 그에 따른 음성 인식 결과를 출력하는 단계를 포함하는 다중 탐색 기반의 음성 인식 방법을 제공한다.In another aspect, the present invention provides a speech recognition method for recognizing an input speech signal, the method comprising: extracting feature data of the input speech signal; and an acoustic model and a finite state network (FSN) for the extracted feature data. Performing a first voice search using a language model and a second voice search using an acoustic model and an N-gram language model in parallel, and generating an integrated search network according to the first voice search and the second voice search performed in parallel And performing a third voice search on the feature data using the generated integrated search network and acoustic model, and outputting a voice recognition result according to the multi-search method. do.

본 발명은, FSN 방식, N-gram 방식 등의 기법을 이용하여 입력된 음성 신호를 인식하는 종래 방법과는 달리, FSN 방식 및 N-gram 방식을 이용한 음성 탐색을 병렬 처리한 후, 이에 따라 출력되는 제 1 단어 격자와 제 2 단어 격자를 통해 통합 탐색 네트워크를 생성하고, 생성된 통합 탐색 네트워크를 통해 음성 탐색을 재수행하여 음성 인식 결과를 출력함으로써, FSN 방식 및 N-gram 방식의 다중 탐색을 통해 입력된 음성 신호에 대한 음성 인식률을 향상시킬 수 있다.Unlike the conventional method of recognizing an input voice signal using a technique such as an FSN method or an N-gram method, the present invention processes the voice search using the FSN method and the N-gram method in parallel, and then outputs it accordingly. The integrated search network is generated through the first word grid and the second word grid, and the voice search is performed again through the generated integrated search network to output voice recognition results. The speech recognition rate of the input speech signal can be improved.

본 발명의 기술 요지는, 다중 탐색을 수행하는 음성 인식 장치를 이용하여 입력된 음성 신호의 특징 데이터를 추출하고, 추출된 특징 데이터를 이용하여 FSN 방식의 음성 탐색과 N-gram 방식의 음성 탐색을 병렬로 수행한 후, 이로부터 출력되는 각 단어 격자를 이용하여 통합 탐색 네트워크를 생성하며, 생성된 통합 탐색 네트워크와 특징 데이터를 이용하여 음성 탐색을 재수행하고, 이에 따른 음성 인식 결과를 출력한다는 것이며, 이러한 기술적 수단을 통해 종래 기술에서의 문제점을 해결할 수 있다.Summary of the Invention The technical gist of the present invention is to extract feature data of an input speech signal using a speech recognition apparatus performing multiple searches, and to perform FSN speech search and N-gram speech search using extracted feature data. After performing in parallel, the integrated search network is generated using each word grid output therefrom, the search is performed again using the generated integrated search network and the feature data, and the resulting speech recognition result is output. This technical means can solve the problems in the prior art.

이하 첨부된 도면을 참조하여 본 발명의 바람직한 실시 예에 대하여 상세하게 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 바람직한 실시 예에 따라 다중 탐색기를 이용하여 입력된 음성 신호를 인식하는데 적합한 음성 인식 장치의 블록 구성도로서, 음성 특징 추출 블록(102), 다중 탐색 블록(104), 음향 모델 데이터베이스(106) 및 언어 모델 데이터베이스(108)를 포함한다.1 is a block diagram of a speech recognition apparatus suitable for recognizing a speech signal input using a multi-searcher according to a preferred embodiment of the present invention. The speech feature extraction block 102, the multiple search block 104, and the acoustic model are shown in FIG. Database 106 and language model database 108.

도 1을 참조하면, 음성 특징 추출 블록(102)은 음성 탐색을 위한 특징 벡터를 추출하는 것으로, 인식하고자 하는 음성 신호가 입력되면, 입력되는 음성 신호에 대한 MFCC(Mel-Frequency Cepstrum Coefficients, 이하‘MFCC’라 함), LPCC(Linear Prediction Cepstral Coefficients, 이하‘LPCC’라 함), EIH(Ensemble Interval Histogram, 이하‘EIH’라 함), SMC (Short-time Modified Coherence, 이하‘SMC’라 함), PLP(Perceptual Linear Prediction, 이하‘PLP’ 라 함) 등의 분석 기법으로 특징 벡터를 추출한 후에, 이러한 음성 신호의 특징 벡터를 다중 탐색 블록(104)으로 전달한다.Referring to FIG. 1, the speech feature extraction block 102 extracts a feature vector for speech search. When a speech signal to be recognized is input, a MFCC (Mel-Frequency Cepstrum Coefficients) hereinafter for the input speech signal is obtained. MFCC '), LPCC (Linear Prediction Cepstral Coefficients, hereinafter' LPCC '), EIH (Ensemble Interval Histogram, `` EIH' '), SMC (Short-time Modified Coherence, `` SMC' ') After the feature vectors are extracted by an analysis technique such as PLP (Perceptual Linear Prediction, hereinafter referred to as 'PLP'), the feature vectors of the speech signals are transferred to the multiple search block 104.

그리고, 다중 탐색 블록(104)은 음성 특징 추출 블록(102)으로부터 전달되는 음성 신호의 특징 벡터(특징 데이터)를 이용하여 FSN 방식의 음성 탐색과 N-gram 방식의 음성 탐색을 병렬적으로 수행하는데, 음향 모델 데이터베이스(106)로부터 추출된 음향 모델과 언어 모델 데이터베이스(108)로부터 추출된 FSN 언어 모델(즉, FSN 탐색 네트워크)을 이용하여 음성 신호의 특징 데이터를 탐색하는 방식으로 FSN 방식의 음성 탐색이 수행되며, FSN 방식의 음성 탐색에 따른 음성 인식 결과를 단어 격자 형식으로 출력함과 동시에 음향 모델 데이터베이스(106)로부터 추출된 음향 모델과 언어 모델 데이터베이스(108)로부터 추출된 N-gram 언어 모델(즉, N-gram 탐색 네트워크)을 이용하여 음성 신호의 특징 데이터를 탐색하는 방식으로 N-gram 방식의 음성 탐색이 수행되며, N-gram 방식의 음성 탐색에 따른 음성 인식 결과를 단어 격자 형식으로 출력한다.The multi-search block 104 performs the FSN-based speech search and the N-gram-based speech search in parallel by using the feature vector (feature data) of the speech signal transmitted from the speech feature extraction block 102. The voice search of the FSN method is performed by using the acoustic model extracted from the acoustic model database 106 and the FSN language model (ie, the FSN search network) extracted from the language model database 108. The N-gram language model extracted from the acoustic model extracted from the acoustic model database 106 and the language model database 108 is outputted while outputting the speech recognition result according to the FSN-based speech search in a word lattice format. That is, the N-gram type voice search is performed by searching the feature data of the voice signal using the N-gram search network). The output of the speech recognition result of the voice navigation with the word lattice format.

여기에서, 단어 격자 형식으로 출력하는 것은 FSN 방식의 음성 탐색과 N-gram 방식의 음성 탐색을 통해 관측 확률, 언어 모델 확률 등이 가장 높은 우선 순위 단어열과 함께 기 설정된 확률값보다 상대적으로 높은(즉, 가능성이 높은) 다수의 차순위 단어열을 함께 출력하는 것을 의미한다.Here, the output in the word grid format is that the observation probability, the language model probability, etc., through the FSN-based voice search and the N-gram voice search, are relatively higher than the preset probability value together with the highest priority word string (that is, Outputting multiple next-order word strings (likely).

또한, 다중 탐색 블록(104)은 FSN 방식의 음성 인식 결과와 N-gram 방식의 음성 인식 결과에 따라 각각의 단어 격자를 포함하는 FSN 방식의 통합 탐색 네트워크를 생성한 후에, 생성된 통합 탐색 네트워크(즉, 통합 언어 모델)와 음향 모델 데이터베이스(106)로부터 추출된 음향 모델을 기반으로 음성 특징 추출 블록(102)으로부터 전달된 특징 데이터(특징 벡터)를 이용하여 음성 탐색을 재수행함으로써, 관측 확률, 언어 모델 확률 등이 가장 높은 음성 인식 결과를 출력한다.In addition, the multi-search block 104 generates an integrated search network of the FSN method including each word grid according to the speech recognition result of the FSN method and the speech recognition result of the N-gram method. That is, by reconstructing the voice search using the feature data (feature vector) transferred from the voice feature extraction block 102 based on the unified language model) and the acoustic model extracted from the acoustic model database 106, observation probability, The speech model probability outputs the highest speech recognition result.

한편, 음향 모델 데이터베이스(106)는 음성 데이터베이스로부터 한국어 음소들의 통계적 현상을 HMM(Hidden Markov Model, 이하‘HMM’이라 함) 등의 기법으로 모델링한 후, 그에 대응하는 음향 모델 정보를 데이터베이스화하여 저장해 두고, 이러한 음향 모델 정보들이 필요에 따라 추출되어 다중 탐색 블록(104)으로 제공된다.Meanwhile, the acoustic model database 106 models statistical phenomena of Korean phonemes from a speech database using a technique such as HMM (Hidden Markov Model, hereinafter referred to as 'HMM'), and then stores the corresponding acoustic model information as a database. As such, these acoustic model information is extracted as needed and provided to the multiple search block 104.

또한, 언어 모델 데이터베이스(108)는 FSN 방식의 언어 모델 정보, N-gram 방식의 언어 모델 정보 등이 데이터베이스화하여 저장되어 있으며, 이러한 각 언어 모델 정보들은 필요에 따라 추출되어 다중 탐색 블록(104)으로 제공된다.In addition, the language model database 108 stores the language model information of the FSN method, the language model information of the N-gram method, and the like as a database, and each of these language model information is extracted as necessary and multiple search blocks 104 are performed. Is provided.

다음에, 상술한 바와 같은 구성을 갖는 다중 탐색 기반의 음성 인식 장치에서 FSN 방식의 음성 탐색 및 N-gram 방식의 음성 탐색을 병렬적으로 수행한 후에 이에 따라 생성된 통합 탐색 네트워크를 기반으로 음성 탐색을 재수행하여 음성 인식 결과를 출력하는 다중 탐색 블록에 대해 설명한다.Next, in the multi-search-based speech recognition apparatus having the above-described configuration, the FSN-based speech search and the N-gram-based speech search are performed in parallel, and then voice search based on the integrated search network generated accordingly. Next, the multiple search block outputting the speech recognition result will be described.

도 2는 본 발명에 따라 FSN 방식과 N-gram 방식을 포함하는 다중 탐색을 통해 음성 인식을 수행하는데 적합한 다중 탐색 블록의 구성도로서, 다중 탐색 블록(104)은 제 1 음성 탐색부(202), 제 2 음성 탐색부(204), 통합 탐색 네트워크 생성부(206) 및 제 3 음성 탐색부(208)를 포함한다.2 is a block diagram of a multiple search block suitable for performing speech recognition through multiple searches including the FSN method and the N-gram method according to the present invention, wherein the multiple search block 104 includes a first voice search unit 202. , A second voice search unit 204, an integrated search network generator 206, and a third voice search unit 208.

도 2를 참조하면, 제 1 음성 탐색부(202)는 FSN 방식의 음성 탐색 모듈을 포함하는 것으로, 음향 모델 데이터베이스(106)로부터 추출된 음향 모델과 언어 모델 데이터베이스(108)로부터 추출된 FSN 언어 모델(즉, FSN 탐색 네트워크)을 이용하여 음성 신호의 특징 데이터를 탐색하고, FSN 방식의 음성 탐색에 따른 음성 인식 결과를 제 1 단어 격자로 출력한다.Referring to FIG. 2, the first voice search unit 202 includes an FSN speech search module, and the sound model extracted from the acoustic model database 106 and the FSN language model extracted from the language model database 108. (I.e., the FSN search network), the feature data of the speech signal is searched, and the speech recognition result according to the FSN speech search is output to the first word grid.

또한, 제 2 음성 탐색부(204)는 N-gram 방식의 음성 탐색 모듈을 포함하는 것으로, 음향 모델 데이터베이스(106)로부터 추출된 음향 모델과 언어 모델 데이터베이스(108)로부터 추출된 N-gram 언어 모델(즉, N-gram 탐색 네트워크)을 이용하여 음성 신호의 특징 데이터를 탐색하며, N-gram 방식의 음성 탐색에 따른 음성 인식 결과를 제 2 단어 격자로 출력한다.In addition, the second voice search unit 204 includes an N-gram type voice search module, and the sound model extracted from the acoustic model database 106 and the N-gram language model extracted from the language model database 108. (I.e., N-gram search network), the feature data of the speech signal is searched, and the speech recognition result according to the N-gram speech search is output to the second word grid.

다음에, 통합 탐색 네트워크 생성부(206)는 제 1 음성 탐색부(202)로부터 출력된 FSN 방식의 음성 인식 결과(제 1 단어 격자)와 제 2 음성 탐색부(204)로부터 출력되는 N-gram 방식의 음성 인식 결과(제 2 단어 격자)에 따라 각각의 단어 격자를 포함하는 FSN 방식의 통합 탐색 네트워크를 생성하여 제 3 음성 탐색부(208)로 전달한다. 여기에서, 통합 탐색 네트워크는 FSN 방식의 음성 탐색과 N-gram 방식의 음성 탐색을 통해 관측 확률, 언어 모델 확률 등이 가장 높은 우선 순위 단어열과 함께 기 설정된 확률값보다 상대적으로 높은(즉, 가능성이 높은) 다수의 차순위 단어열을 포함하는 제 1 단어 격자와 제 2 단어 격자로 구성된다.Next, the integrated search network generation unit 206 outputs the speech recognition result (first word grid) of the FSN method output from the first voice search unit 202 and the N-gram output from the second voice search unit 204. According to the speech recognition result (second word lattice) of the method, an integrated search network of the FSN method including each word lattice is generated and transmitted to the third voice search unit 208. Here, the integrated search network uses the FSN-based speech search and the N-gram-based speech search so that the observation probability, the language model probability, and the like are relatively higher than the preset probability values along with the highest priority word string (that is, the high probability). A first word grid and a second word grid including a plurality of next order word strings.

이어서, 제 3 음성 탐색부(208)는 통합 탐색 네트워크 생성부(206)로부터 전달되는 통합 탐색 네트워크(즉, 통합 언어 모델)와 음향 모델 데이터베이스(106)로부터 추출된 음향 모델을 기반으로 음성 특징 추출 블록(102)으로부터 전달된 특징 데이터(특징 벡터)를 이용하여 음성 탐색을 재수행하며, 재수행된 음성 탐색에 따라 관측 확률, 언어 모델 확률 등이 가장 높은 음성 인식 결과를 출력한다.Subsequently, the third voice search unit 208 extracts the voice feature based on the integrated search network (ie, the integrated language model) delivered from the integrated search network generator 206 and the acoustic model extracted from the acoustic model database 106. The voice search is re-executed using the feature data (feature vector) transmitted from the block 102, and the voice recognition result having the highest observation probability, language model probability, etc. is output according to the re-executed voice search.

다음에, 상술한 바와 같이 다중 탐색을 수행하는 음성 인식 장치를 이용하여 입력된 음성 신호의 특징 데이터를 추출하고, 추출된 특징 데이터를 이용하여 FSN 방식의 음성 탐색과 N-gram 방식의 음성 탐색을 병렬로 수행한 후, 이로부터 출력되는 각 단어 격자를 이용하여 통합 탐색 네트워크를 생성하며, 생성된 통합 탐색 네트워크와 특징 데이터를 이용하여 음성 탐색을 재수행하고, 이에 따른 음성 인식 결과를 출력하는 과정에 대해 설명한다.Next, the feature data of the input voice signal is extracted by using a speech recognition apparatus that performs multiple searches as described above, and the extracted feature data is used to perform FSN speech search and N-gram speech search. After performing in parallel, the integrated search network is generated using each word grid output therefrom, the search is performed again using the generated integrated search network and the feature data, and the resulting speech recognition result is output. Explain.

도 3은 본 발명의 일 실시 예에 따라 FSN 방식과 N-gram 방식을 포함하는 다중 탐색을 통해 음성 인식을 수행하는 과정을 도시한 플로우차트이다.3 is a flowchart illustrating a process of performing speech recognition through multiple searches including an FSN method and an N-gram method according to an embodiment of the present invention.

도 3을 참조하면, 다중 탐색 기반의 음성 인식 장치에 인식하고자 하는 음성 신호가 입력되면(단계302), 음성 특징 추출 블록(102)에서는 입력되는 음성 신호에 대한 특징 벡터(특징 데이터)를 추출한 후에, 이러한 음성 신호의 특징 벡터를 다 중 탐색 블록(104)으로 전달한다(단계304). 여기에서, 특징 벡터는, 예를 들어 MFCC, LPCC, EIH, SMC 및 PLP 중 어느 하나의 기법을 이용하여 추출될 수 있다.Referring to FIG. 3, when a speech signal to be recognized is input to a multi-search-based speech recognition apparatus (step 302), the speech feature extraction block 102 extracts a feature vector (feature data) for the input speech signal. The feature vector of the speech signal is then passed to the multiple search block 104 (step 304). Here, the feature vector may be extracted using, for example, any one of MFCC, LPCC, EIH, SMC and PLP techniques.

그리고, 다중 탐색 블록(104)의 제 1 음성 탐색부(202)에서는 음향 모델 데이터베이스(106)로부터 추출된 음향 모델과 언어 모델 데이터베이스(108)로부터 추출된 FSN 언어 모델(FSN 탐색 네트워크)을 이용하여 음성 신호의 특징 데이터를 탐색한다(단계306).The first voice search unit 202 of the multiple search block 104 uses the acoustic model extracted from the acoustic model database 106 and the FSN language model (FSN search network) extracted from the language model database 108. The feature data of the audio signal is retrieved (step 306).

이에 따라, 제 1 음성 탐색부(202)에서는 FSN 방식의 음성 탐색에 따른 음성 인식 결과를 단어 격자 형식에 따라 제 1 단어 격자로 출력한다(단계308).Accordingly, the first voice search unit 202 outputs the voice recognition result according to the FSN type voice search to the first word grid in accordance with the word grid format (step 308).

이와 함께, 다중 탐색 블록(104)의 제 2 음성 탐색부(204)에서는 음향 모델 데이터베이스(106)로부터 추출된 음향 모델과 언어 모델 데이터베이스(108)로부터 추출된 N-gram 언어 모델(N-gram 탐색 네트워크)을 이용하여 음성 신호의 특징 데이터를 탐색한다(단계310).In addition, the second voice search unit 204 of the multiple search block 104 searches for an acoustic model extracted from the acoustic model database 106 and an N-gram language model extracted from the language model database 108. Search for feature data of the voice signal using a network (step 310).

이에 따라, 제 2 음성 탐색부(204)에서는 N-gram 방식의 음성 탐색에 따른 음성 인식 결과를 단어 격자 형식에 따라 제 2 단어 격자로 출력한다(단계312).Accordingly, the second voice search unit 204 outputs the speech recognition result according to the N-gram type voice search to the second word grid according to the word grid format (step 312).

다음에, 다중 탐색 블록(104)의 통합 탐색 네트워크 생성부(206)에서는 제 1 음성 탐색부(202)로부터 출력된 FSN 방식의 제 1 단어 격자와 제 2 음성 탐색부(204)로부터 출력되는 N-gram 방식의 제 2 단어 격자에 따라 각각의 단어 격자를 포함하는 FSN 방식의 통합 탐색 네트워크를 생성하여 제 3 음성 탐색부(208)로 전달한다(단계314).Next, in the integrated search network generator 206 of the multiple search block 104, the first word lattice of the FSN method output from the first voice search unit 202 and the N output from the second voice search unit 204 are output. According to the second word grid of the -gram method, an integrated search network of the FSN method including each word grid is generated and transmitted to the third voice search unit 208 (step 314).

그리고, 다중 탐색 블록(104)의 제 3 음성 탐색부(208)에서는 통합 탐색 네 트워크 생성부(206)로부터 전달되는 통합 탐색 네트워크와 음향 모델 데이터베이스(106)로부터 추출된 음향 모델을 기반으로 음성 특징 추출 블록(102)으로부터 전달된 특징 데이터(특징 벡터)를 이용하여 음성 탐색(단어열 탐색)을 최종적으로 수행한다(단계316).In addition, the third voice search unit 208 of the multi-search block 104 includes a voice feature based on the integrated search network delivered from the integrated search network generator 206 and the acoustic model extracted from the acoustic model database 106. The voice search (word sequence search) is finally performed using the feature data (feature vector) transferred from the extraction block 102 (step 316).

이어서, 제 3 음성 탐색부(208)에서는 최종적으로 수행된 통합 탐색 네트워크에서의 음성 탐색에 따라 확률이 가장 높은 단어열을 포함하는 음성 인식 결과를 출력한다(단계318).Subsequently, the third voice search unit 208 outputs a voice recognition result including the word string having the highest probability according to the voice search performed on the finally performed integrated search network (step 318).

예를 들면, 도 4a 및 도 4b는 본 발명에 따라 FSN 방식의 음성 탐색과 N-gram 방식의 음성 탐색에 따른 각각의 음성 인식 결과를 나타낸 제 1 예를 예시한 도면으로,“그런데, 어느 분으로 예약 을 해 드릴까요?”라는 음성 신호가 입력될 경우 FSN 방식의 음성 탐색에 따른 제 1 단어 격자는 도 4a와 같이 나타나고, N-gram 방식의 음성 탐색에 따른 제 2 단어 격자는 도 4b와 같이 나타날 수 있는데, 이는 입력된 음성 신호에 대응하는 문장의 단어열이 FSN 탐색 네트워크(FSN 언어 모델)에 정의되어 있지 않거나 혼돈을 일으키는 유사한 단어열이 많고, N-gram 탐색 네트워크(N-gram 언어 모델)에서 해당 음성 신호에 대응하는 문장의 단어열이 나타날 확률이 높은 경우(파란색 화살표가 연결된 단어열이 음성 인식 정답에 가까운 단어열을 의미함)로, 제 1 단어 격자와 제 2 단어 격자를 포함하는 통합 탐색 네트워크를 통해 최종적인 음성 탐색을 수행하면, 그 음성 인식 결과로서 제 2 단어 격자 중“그런데 어느 분으로 예약 해 드릴까요”를 출력할 수 있다.For example, FIGS. 4A and 4B are diagrams illustrating a first example showing respective speech recognition results according to an FSN speech search and an N-gram speech search according to the present invention. Would you like to make a reservation? ”When the voice signal is input, the first word lattice based on the FSN speech search is shown in FIG. 4A, and the second word lattice based on the N-gram speech search is shown in FIG. 4B. It can be seen that the word string of the sentence corresponding to the input voice signal is not defined in the FSN search network (FSN language model) or there are many similar word strings that cause confusion, and the N-gram search network (N-gram language). Model word), the word string of a sentence corresponding to the corresponding speech signal has a high probability (a word string connected with a blue arrow means a word string close to a speech recognition correct answer). When you perform a final voice navigation via the integrated navigation network, including a lattice, can output the "Would you like to make a reservation then one minute," the word of two grids as a result of voice recognition.

또한, 도 5a 및 도 5b는 본 발명에 따라 FSN 방식의 음성 탐색과 N-gram 방 식의 음성 탐색에 따른 각각의 음성 인식 결과를 나타낸 제 2 예를 예시한 도면으로,“네 언제 예약 하실 건가요?”라는 음성 신호가 입력될 경우 FSN 방식의 음성 탐색에 따른 제 1 단어 격자는 도 5a와 같이 나타나고, N-gram 방식의 음성 탐색에 따른 제 2 단어 격자는 도 5b와 같이 나타날 수 있는데, 이는 FSN 탐색 네트워크(FSN 언어 모델)에 입력된 음성 신호에 대응하는 문장의 단어열이 정의(파란색 화살표가 연결된 단어열이 음성 인식 정답에 가까운 단어열을 의미함)되어 있고, N-gram 탐색 네트워크(N-gram 언어 모델)에서 해당 음성 신호에 대응하는 문장의 단어열이 나타날 확률이 경쟁 단어열인“네 언제 이용 하실 건가요”보다 낮은 경우로, 제 1 단어 격자와 제 2 단어 격자를 포함하는 통합 탐색 네트워크를 통해 최종적인 음성 탐색을 수행하면, 그 음성 인식 결과로서 제 1 단어 격자 중“네 언제 예약 하실 건가요”를 출력할 수 있다.5A and 5B are diagrams illustrating a second example showing the respective speech recognition results according to the FSN speech search and the N-gram speech search according to the present invention. When the voice signal “?” Is input, the first word lattice according to the FSN speech search is shown in FIG. 5A, and the second word lattice according to the N-gram speech search is shown in FIG. 5B. The word string of the sentence corresponding to the voice signal input to the FSN search network (FSN language model) is defined (the word string connected with the blue arrow means the word string close to the voice recognition correct answer), and the N-gram search network ( N-gram language model) has a lower probability that the word string of the sentence corresponding to the speech signal is lower than the competing word string “Yes when will you use it”, including the first word grid and the second word grid. When you perform a final voice navigation via the integrated navigation network, and may output the first word of the grid, "When are you going to your reservation" as a result of voice recognition.

따라서, 입력된 음성 신호에 대해 FSN 방식의 음성 탐색 및 N-gram 방식의 음성 탐색을 병렬적으로 수행한 후에 이에 따라 생성된 통합 탐색 네트워크를 기반으로 음성 탐색을 재수행하여 음성 인식 결과를 출력함으로써, 인식하고자 하는 음성 신호에 대한 음성 인식 결과의 정확성을 더욱 향상시킬 수 있다.Therefore, after performing the voice search of the FSN method and the voice search of the N-gram method in parallel with respect to the input voice signal, the voice search is re-run based on the integrated search network generated accordingly, and the voice recognition result is output. The accuracy of the speech recognition result for the speech signal to be recognized can be further improved.

이상의 설명에서는 본 발명의 바람직한 실시 예들을 제시하여 설명하였으나 본 발명이 반드시 이에 한정되는 것은 아니며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자라면 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능함을 쉽게 알 수 있을 것이다.In the foregoing description, the present invention has been described with reference to preferred embodiments, but the present invention is not necessarily limited thereto. Those skilled in the art will appreciate that the present invention may be modified without departing from the spirit of the present invention. It will be readily appreciated that branch substitutions, modifications and variations are possible.

도 1은 본 발명의 바람직한 실시 예에 따라 다중 탐색기를 이용하여 입력된 음성 신호를 인식하는데 적합한 음성 인식 장치의 블록 구성도,1 is a block diagram of a speech recognition apparatus suitable for recognizing an input speech signal using multiple searchers according to an exemplary embodiment of the present invention;

도 2는 본 발명에 따라 FSN 방식과 N-gram 방식을 포함하는 다중 탐색을 통해 음성 인식을 수행하는데 적합한 다중 탐색 블록의 구성도,2 is a block diagram of a multiple search block suitable for performing speech recognition through multiple searches including an FSN scheme and an N-gram scheme according to the present invention;

도 3은 본 발명의 일 실시 예에 따라 FSN 방식과 N-gram 방식을 포함하는 다중 탐색을 통해 음성 인식을 수행하는 과정을 도시한 플로우차트,3 is a flowchart illustrating a process of performing speech recognition through multiple searches including an FSN method and an N-gram method according to an embodiment of the present invention;

도 4a 및 도 4b는 본 발명에 따라 FSN 방식의 음성 탐색과 N-gram 방식의 음성 탐색에 따른 각각의 음성 인식 결과를 나타낸 제 1 예를 예시한 도면,4A and 4B are diagrams illustrating a first example showing respective speech recognition results according to an FSN speech search and an N-gram speech search according to the present invention;

도 5a 및 도 5b는 본 발명에 따라 FSN 방식의 음성 탐색과 N-gram 방식의 음성 탐색에 따른 각각의 음성 인식 결과를 나타낸 제 2 예를 예시한 도면.5A and 5B are diagrams illustrating a second example showing respective speech recognition results according to FSN speech search and N-gram speech search according to the present invention;

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

102 : 음성 특징 추출 블록 104 : 다중 탐색 블록102: voice feature extraction block 104: multiple search block

106 : 음향 모델 데이터베이스 108 : 언어 모델 데이터베이스106: acoustic model database 108: language model database

202 : 제 1 음성 탐색부 204 : 제 2 음성 탐색부202: first voice search unit 204: second voice search unit

206 : 통합 탐색 네트워크 생성부 208 : 제 3 음성 탐색부206: integrated search network generation unit 208: third voice search unit

Claims

A voice recognition device for recognizing an input voice signal,

A voice feature extraction block for extracting feature data on the input voice signal;

An acoustic model database for storing acoustic models of statistically modeled phonemes;

A language model database that stores Finite State Network (FSN) language models and N-gram language models;

The first voice search using the acoustic model and the FSN language model and the second voice search using the acoustic model and the N-gram language model are performed on the extracted feature data in parallel, and the first voice search performed in parallel and A multi-search block generating an integrated search network according to a second voice search, performing a third voice search using the generated integrated search network and an acoustic model, and outputting a voice recognition result according to the third voice search.

Multiple navigation based speech recognition device comprising a.

The method of claim 1,

The multiple search block,

A first voice search unit configured to output the first word grid by performing the first voice search on the extracted feature data using the acoustic model and the FSN language model;

A second voice search unit for outputting a second word grid by performing a second voice search on the extracted feature data using the acoustic model and the N-gram language model;

An integrated search network generation unit which integrates the first word grid and the second word grid to generate the integrated search network in an FSN method;

A third voice search unit for performing a third voice search on the feature data using the generated integrated search network and acoustic model and outputting a voice recognition result accordingly;

Multiple navigation based speech recognition device comprising a.

The method according to claim 1 or 2,

The feature data may be any one of Mel-Frequency Cepstrum Coefficients (MFCC), Linear Prediction Cepstral Coefficients (LPCC), Ensemble Interval Histogram (EIH), Short-time Modified Coherence (SMC), and Perceptual Linear Prediction (PLP). Multi-search based speech recognition device which is an extracted feature vector.

The method according to claim 1 or 2,

The acoustic model is a multiple search-based speech recognition apparatus modeled by a Hidden Markov Model (HMM) technique.

The method according to claim 1 or 2,

The first word lattice and the second word lattice are each one of the FSN and N-gram speech search, and the priority word string having the highest observation probability and language model probability and a plurality of relatively higher than the predetermined probability value. Multiple search based speech recognition device including a second order word string.

A voice recognition method for recognizing an input voice signal,

Extracting feature data of the input voice signal;

Performing parallel operation on the extracted feature data with a first voice search using an acoustic model and a finite state network (FSN) language model and a second voice search using an acoustic model and an N-gram language model;

Generating an integrated search network according to the first voice search and the second voice search performed in parallel;

Performing a third voice search on the feature data using the generated integrated search network and acoustic model, and outputting a voice recognition result accordingly;

Multiple search based speech recognition method comprising a.

The method of claim 6,

The extracting of the feature data may include any one of Mel-Frequency Cepstrum Coefficients (MFCC), Linear Prediction Cepstral Coefficients (LPCC), Ensemble Interval Histogram (EIH), Short-time Modified Coherence (SMC), and Perceptual Linear Prediction (PLP). A multiple search based speech recognition method for extracting a feature vector corresponding to the speech signal by using one technique.

The method of claim 6,

The parallel step is performed,

Outputting a first word grid by performing the first speech search on the extracted feature data using the acoustic model and the FSN language model;

Outputting a second word grid by performing a second voice search on the extracted feature data using the acoustic model and the N-gram language model

Multiple search based speech recognition method comprising a.

The method of claim 6,

The generating of the integrated search network may include: a priority word string having the highest observation probability and a language model probability and a plurality of next order word strings having a relatively higher than a preset probability value according to the first and second voice searches performed in parallel. Multiple search-based speech recognition method for generating the integrated search network by the FSN method.

The method according to any one of claims 7 to 9,

The acoustic model is a multiple search based speech recognition method modeled by a Hidden Markov Model (HMM) technique.