KR100764247B1

KR100764247B1 - Apparatus and Method for speech recognition with two-step search

Info

Publication number: KR100764247B1
Application number: KR1020060020754A
Authority: KR
Inventors: 고한석; 정석영
Original assignee: 고려대학교 산학협력단
Priority date: 2005-12-28
Filing date: 2006-03-06
Publication date: 2007-10-08
Also published as: KR20070070000A

Abstract

2단계 탐색을 이용한 음성인식 장치 및 그 방법이 개시된다.Disclosed are a speech recognition apparatus using a two-stage search and a method thereof.

본 발명은 입력된 음성에 대하여 풀에 포함된 소정 개수 이하의 가우시안 분포들을 이용한 비터비 탐색을 수행하여 복수개의 후보 단어들을 생성하는 고속 탐색부, 상기 후보 단어들 중 신뢰도가 높은 순서로 후보 단어들을 추출하는 N-best 후보 생성부 및 상기 추출된 후보 단어들에 대하여 상기 풀에 포함된 소정 개수 이상의 가우시안 분포들을 이용한 비터비 탐색을 수행하여 인식된 단어를 출력하는 정밀 탐색부를 포함한다.The present invention provides a fast search unit for generating a plurality of candidate words by performing a Viterbi search using a Gaussian distribution having a predetermined number or less included in a pool, and selecting candidate words in an order of high reliability among the candidate words. And an N-best candidate generator to extract and a precision search unit to output the recognized word by performing a Viterbi search using a predetermined number or more Gaussian distributions of the extracted candidate words.

본 발명에 의하면, 음성 인식률을 저하시키지 않으면서 음성 인식의 속도를 향상시킬 수 있고, 전체 시스템의 성능을 향상 시킬 수 있는 효과가 있다.According to the present invention, the speed of speech recognition can be improved without degrading the speech recognition rate, and the performance of the entire system can be improved.

Description

Apparatus and Method for speech recognition with two-step search}

도 1은 본 발명의 블럭도이다.1 is a block diagram of the present invention.

도 2는 도 1의 상세 블럭도이다.FIG. 2 is a detailed block diagram of FIG. 1.

도 3은 본 발명의 흐름도이다.3 is a flowchart of the present invention.

도 4는 도 3의 고속 탐색과정의 상세 흐름도이다.4 is a detailed flowchart of the fast searching process of FIG. 3.

도 5는 도 3의 N-best 후보 단어 추출과정의 상세 흐름도이다.FIG. 5 is a detailed flowchart of an N-best candidate word extraction process of FIG. 3.

도 6은 도 3의 정밀 탐색과정의 상세 흐름도이다.6 is a detailed flowchart of the precise search process of FIG.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

100 : 고속 탐색부 110 : N-best 후보 생성부100: fast search unit 110: N-best candidate generation unit

130 : 정밀 탐색부130: precision search unit

본 발명은 음성인식에 관한 것으로서, 특히 임베디드 플랫폼(embedded platform)에서의 2단계 탐색을 이용한 음성인식 장치 및 그 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to speech recognition, and more particularly, to a speech recognition apparatus and a method using two-stage search in an embedded platform.

최근 임베디드 상에서의 음성인식기 구현이 큰 이슈가 되고 있다. 정보통신 기술의 발전으로 개인의 휴대기기 사용이 늘어나고, 첨단 기술이 적용된 가전, 차량, 장난감 등의 제품이 개발 되면서, 음성인터페이스의 도입에 대한 관심이 높아졌기 때문이다. Recently, voice recognizer implementation in embedded has become a big issue. This is because the development of information and communication technology has increased the use of personal mobile devices, the development of products such as home appliances, vehicles, toys, etc. applied with advanced technologies, and the interest in the introduction of voice interfaces has increased.

그러나, 임베디드 시스템의 경우 일반 PC에서의 개발환경보다 자원이 극히 한정되어 있고, 연산속도가 느리기 때문에 대 어휘나 연속어 기반의 음성인식기 구현이 쉽지 않다. 특히, 종래의 음성인식 방법들은 1회의 정교한 비터비 탐색을 통하여 최적의 확률 값을 갖는 1단어를 검출하기 때문에, 자원이 한정된 임베디드 환경에서 대 어휘의 인식을 수행하기에는 높은 인식성능과 빠른 수행속도를 얻기가 어렵다.However, in the case of an embedded system, resources are extremely limited than the development environment of a general PC, and the computation speed is slow, so it is not easy to implement a speech recognizer based on a large vocabulary or a continuous word. In particular, the conventional speech recognition methods detect a single word having an optimal probability value through one sophisticated Viterbi search, and thus have a high recognition performance and a fast execution speed to perform a large vocabulary recognition in a resource-limited embedded environment. Difficult to obtain

또한, 종래의 연속분포 HMM(continuous density Hidden Markov Model)은 많은 양의 메모리를 할당할 수 없고, 인식 성능의 저하를 최소화하기에 용이하지 않으므로, 임베디드 시스템용 음향모델을 설계하기에는 적합하지 않다.In addition, the conventional continuous density hidden markov model (HMM) is not suitable for designing an acoustic model for an embedded system since it cannot allocate a large amount of memory and is not easy to minimize the degradation of recognition performance.

따라서, 종래의 음성인식 방법은 높은 인식성능을 유지하면서 고속의 음성인식을 수행할 수 없는 문제점이 있다.Therefore, the conventional speech recognition method has a problem that it is not possible to perform high-speed speech recognition while maintaining high recognition performance.

따라서, 본 발명이 이루고자 하는 첫번째 기술적 과제는 높은 인식 성능을 유지하면서 고속으로 음성 인식을 수행할 수 있는 2단계 탐색을 이용한 음성인식 장치를 제공하는데 있다.Accordingly, the first technical problem to be achieved by the present invention is to provide a speech recognition apparatus using a two-stage search that can perform speech recognition at high speed while maintaining high recognition performance.

본 발명이 이루고자 하는 두번째 기술적 과제는 상기의 음성인식 장치에 적용된 2단계 탐색을 이용한 음성인식 방법을 제공하는데 있다.The second technical problem to be achieved by the present invention is to provide a speech recognition method using a two-stage search applied to the speech recognition device.

상기의 첫번째 기술적 과제를 이루기 위하여 본 발명은 입력된 음성에 대하여 풀에 포함된 소정 개수 이하의 가우시안 분포들을 이용한 비터비 탐색을 수행하여 복수개의 후보 단어들을 생성하는 고속 탐색부, 상기 후보 단어들 중 신뢰도가 높은 순서로 후보 단어들을 추출하는 N-best 후보 생성부 및 상기 추출된 후보 단어들에 대하여 상기 풀에 포함된 소정 개수 이상의 가우시안 분포들을 이용한 비터비 탐색을 수행하여 인식된 단어를 출력하는 정밀 탐색부를 포함한다.In order to achieve the first technical problem, the present invention provides a fast search unit for generating a plurality of candidate words by performing a Viterbi search using a Gaussian distribution having a predetermined number or less included in a pool, among the candidate words. N-best candidate generator for extracting candidate words in order of high reliability and Viterbi search using a predetermined number or more Gaussian distributions of the extracted candidate words and outputting the recognized words It includes a search unit.

상기의 두번째 기술적 과제를 이루기 위하여 본 발명은 입력된 음성에 대하여 풀에 포함된 소정 개수 이하의 가우시안 분포들을 이용한 비터비 탐색을 수행하여 복수개의 후보 단어들을 생성하는 단계, 상기 후보 단어들 중 신뢰도가 높은 순서로 후보 단어들을 추출하는 단계 및 상기 추출된 후보 단어들에 대하여 상기 풀에 포함된 소정 개수 이상의 가우시안 분포들을 이용한 비터비 탐색을 수행하여 인식된 단어를 출력하는 단계를 포함한다.In order to achieve the second technical problem, the present invention performs a Viterbi search using a Gaussian distribution having a predetermined number or less included in a pool, for generating a plurality of candidate words, and having a reliability among the candidate words. Extracting candidate words in ascending order and performing a Viterbi search using a predetermined number or more of Gaussian distributions included in the pool to output the recognized words.

이하에서는 도면을 참조하여 본 발명의 바람직한 실시예를 설명하기로 한다.Hereinafter, with reference to the drawings will be described a preferred embodiment of the present invention.

고속 탐색부(100)는 입력된 음성에 대하여 풀(pool)에 포함된 소정 개수 이하의 가우시안 분포들을 이용한 비터비 탐색을 수행하여 복수개의 후보 단어들을 생성한다. 이때, 풀(pool)은 발생가능한 가우시안 분포들의 집합이다.The fast search unit 100 generates a plurality of candidate words by performing a Viterbi search using a Gaussian distribution having a predetermined number or less included in a pool on the input voice. Where pool is a set of possible Gaussian distributions.

N-best 후보 생성부(110)는 고속 탐색부(100)에 의해 생성된 후보 단어들 중 신뢰도가 높은 순서로 후보 단어들을 추출한다.The N-best candidate generator 110 extracts candidate words in the order of high reliability among the candidate words generated by the fast search unit 100.

정밀 탐색부(130)는 N-best 후보 생성부(110)에 의해 추출된 후보 단어들에 대하여 풀에 포함된 소정 개수 이상의 가우시안 분포들을 이용한 비터비 탐색을 수행하여 인식된 단어를 출력한다.The precision search unit 130 performs a Viterbi search using a predetermined number or more Gaussian distributions of the candidate words extracted by the N-best candidate generator 110 and outputs the recognized words.

고속 탐색부(200)는 입력된 음성에 대하여 풀(pool)에 포함된 소정 개수 이하의 가우시안 분포들을 이용한 비터비 탐색을 수행하여 복수개의 후보 단어들을 생성한다. 이때, 풀(pool)은 발생가능한 가우시안 분포들의 집합이다.The fast search unit 200 generates a plurality of candidate words by performing a Viterbi search using a Gaussian distribution having a predetermined number or less included in a pool on the input voice. Where pool is a set of possible Gaussian distributions.

고속 탐색부(200)는 마할라노비스 거리 연산부(201), 가우시안 선택부(202) 및 로그합 연산부(203)를 포함한다.The fast search unit 200 includes a Mahalanobis distance calculator 201, a Gaussian selector 202, and a log sum calculator 203.

마할라노비스 거리 연산부(201)는 풀에 포함된 모든 가우시안 분포들에 대해 마할라노비스 거리값들을 연산한다. 바람직하게는, 마할라노비스 거리 연산부(201)는 음성의 프레임마다 추출된 특징벡터와 풀에 포함된 모든 가우시안 분포들 사이의 마할라노비스 거리값들을 연산할 수 있다.The Mahalanobis distance calculator 201 calculates Mahalanobis distance values for all Gaussian distributions included in the pool. Preferably, the Mahalanobis distance calculator 201 may calculate Mahalanobis distance values between the feature vectors extracted for each frame of the voice and all Gaussian distributions included in the pool.

가우시안 선택부(202)는 풀에 포함된 가우시안 분포들 중 마할라노비스 거리값들이 큰 순서로 소정 개수 이하의 가우시안 분포들을 선택한다. 이때, 소정 개수는 음성 인식률을 저하시키지 않으면서 고속 인식을 수행할 수 있는 임계값이다.The Gaussian selector 202 selects Gaussian distributions having a predetermined number or less from the Gaussian distributions included in the pool in order of Mahalanobis distance values in ascending order. In this case, the predetermined number is a threshold value for performing fast recognition without lowering the speech recognition rate.

로그합 연산부(203)는 가우시안 선택부(202)에 의해 선택된 가우시안 분포들을 이용하여 로그합 연산을 수행한다.The log sum operator 203 performs a log sum operation using the Gaussian distributions selected by the Gaussian selector 202.

N-best 후보 생성부(210)는 고속 탐색부(200)에 의해 생성된 후보 단어들 중 신뢰도가 높은 순서로 후보 단어들을 추출한다.The N-best candidate generator 210 extracts candidate words in the order of high reliability among candidate words generated by the fast search unit 200.

N-best 후보 생성부(210)는 NLLR 검증부(211) 및 탐색 공간 생성부(212)를 포함한다.The N-best candidate generator 210 includes an NLLR verification unit 211 and a search space generator 212.

NLLR 검증부(211)는 후보 단어들의 신뢰도를 연산하여 후보 단어들 중 신뢰도가 임계값 이상인 후보 단어들을 선택한다. 이때, 임계값은 음성 인식의 결과가 신뢰할만한 수준임을 보장할 수 있도록 당업자에 의해 미리 결정된 값이다. 바람직하게는, NLLR 검증부(211)는 후보 단어들의 정규화된 로그 우도 비율(Normalized Log Likelihood Ratio)을 연산하여, 정규화된 로그 우도 비율이 임계값 이상인 후보 단어들을 선택할 수 있다. 이때, 선택된 후보 단어들을 N-best 후보 단어로 정의한다.The NLLR verification unit 211 calculates reliability of candidate words and selects candidate words having a reliability greater than or equal to a threshold among candidate words. At this time, the threshold value is predetermined by a person skilled in the art to ensure that the result of speech recognition is a reliable level. Preferably, the NLLR verification unit 211 may calculate candidate normalized log likelihood ratios of the candidate words to select candidate words whose normalized log likelihood ratio is equal to or greater than a threshold. At this time, the selected candidate words are defined as N-best candidate words.

탐색 공간 생성부(212)는 NLLR 검증부(211)에 의해 선택된 후보 단어들을 정밀 탐색부(230)로 출력한다.The search space generator 212 outputs candidate words selected by the NLLR verification unit 211 to the precision search unit 230.

가우시안 캐쉬 저장부(220)는 연산된 마할라노비스 거리값들을 저장한다. 가우시안 캐쉬 저장부(220)는 마할라노비스 거리값들을 저장하기 위한 휘발성 메모리 소자를 포함할 수 있다.The Gaussian cache storage unit 220 stores the calculated Mahalanobis distance values. The Gaussian cache storage unit 220 may include a volatile memory device for storing Mahalanobis distance values.

정밀 탐색부(230)는 N-best 후보 생성부(210)에 의해 추출된 후보 단어들에 대하여 풀에 포함된 소정 개수 이상의 가우시안 분포들을 이용한 비터비 탐색을 수행하여 인식된 단어를 출력한다. 바람직하게는, 정밀 탐색부(230)는 음성의 발화가 완료된 이후에, 비터비 탐색을 수행하여 인식된 단어를 출력할 수 있다.The precision search unit 230 performs a Viterbi search using a predetermined number or more of Gaussian distributions of the candidate words extracted by the N-best candidate generator 210 and outputs the recognized words. Preferably, the precision search unit 230 may output a recognized word by performing a Viterbi search after speech is completed.

정밀 탐색부(230)는 가우시안 캐쉬 적용부(231), 로그합 연산부(232) 및 1-best 탐색부(233)를 포함한다.The precision search unit 230 includes a Gaussian cache application unit 231, a log sum calculator 232, and a 1-best search unit 233.

가우시안 캐쉬 적용부(231)는 가우시안 캐쉬 저장부(220)로부터 마할라노비스 거리값들을 독출하여 로그합 연산부(232)로 출력한다.The Gaussian cache application unit 231 reads Mahalanobis distance values from the Gaussian cache storage unit 220 and outputs the Mahalanobis distance values to the log sum calculator 232.

로그합 연산부(232)는 비터비 탐색의 출력확률 계산 과정 중 로그합 연산을 수행한다. 또한, 로그합 연산부(232)는 소정 개수 이상의 가우시안 분포들을 이용하여 로그합 연산을 수행한다.The log sum operator 232 performs a log sum operation during the output probability calculation process of the Viterbi search. In addition, the log sum calculator 232 performs a log sum operation using a predetermined number or more of Gaussian distributions.

1-best 탐색부(233)는 로그합 연산부(232)에 의한 로그합 연산의 결과 및 가우시안 캐쉬 적용부(231)에 의한 마할라노비스 거리값들을 이용하여 후보 단어들 중 가장 높은 우도를 갖는 1개의 단어를 추출하고, 추출된 단어를 인식된 단어로 출력한다.The 1-best search unit 233 uses the result of the log sum operation by the log sum operator 232 and the Mahalanobis distance values by the Gaussian cache application unit 231 to have 1 having the highest likelihood among the candidate words. Words are extracted and the extracted words are output as recognized words.

바람직하게는, 고속 탐색부(200) 및 정밀 탐색부(230)는 동일한 음향모델을 적용할 수 있다.Preferably, the fast search unit 200 and the precision search unit 230 may apply the same acoustic model.

먼저, 음성을 입력받는다(300 과정). 바람직하게는, 이 과정(300 과정)은 음성의 프레임마다 특징벡터를 추출하는 과정을 포함할 수 있다.First, a voice is input (step 300). Preferably, the process 300 may include extracting a feature vector for each frame of speech.

다음, 입력된 음성에 대하여 풀에 포함된 소정 개수 이하의 가우시안 분포들을 이용한 비터비 탐색을 수행하여 복수개의 후보 단어들을 생성한다(310 과정). 이때, 풀(pool)은 발생가능한 가우시안 분포들의 집합이다.Next, a plurality of candidate words are generated by performing a Viterbi search using a Gaussian distribution having a predetermined number or less included in the pool on the input voice (step 310). Where pool is a set of possible Gaussian distributions.

후보 단어들이 생성되면, 후보 단어들 중 신뢰도가 높은 순서로 후보 단어들을 추출한다(320 과정).When candidate words are generated, candidate words are extracted in order of high reliability among candidate words (step 320).

마지막으로, 추출된 후보 단어들에 대하여 풀에 포함된 소정 개수 이상의 가 우시안 분포들을 이용한 비터비 탐색을 수행하여 인식된 단어를 출력한다(330 과정).Finally, a Viterbi search is performed on the extracted candidate words using a predetermined number of Gaussian distributions included in the pool (step 330).

바람직하게는, 후보 단어들을 생성하는 과정(310 과정) 및 인식된 단어를 출력하는 과정(330 과정)은 동일한 음향모델을 적용할 수 있다.Preferably, the process of generating candidate words (step 310) and outputting the recognized words (step 330) may apply the same acoustic model.

도 4는 도 3의 고속 탐색과정(310 과정)의 상세 흐름도이다.4 is a detailed flowchart of the fast discovery process 310 of FIG. 3.

고속 탐색에서는 복잡한 대어휘 탐색의 신뢰할만한 속도를 보장하기 위해서 소량의 음향 모델링을 사용하여 탐색을 수행한다. In the fast search, the search is performed using a small amount of acoustic modeling to ensure the reliable speed of complex large vocabulary search.

먼저, 풀에 포함된 모든 가우시안 분포들에 대해 마할라노비스 거리값들을 연산한다(411 과정). 이 과정(411 과정)은 음성의 프레임마다 추출된 특징벡터와 모든 가우시안 분포들 사이의 마할라노비스 거리값들을 연산과정일 수 있다.First, Mahalanobis distance values are calculated for all Gaussian distributions included in the pool (step 411). This process (411) may be a process of calculating the Mahalanobis distance values between the feature vector extracted for each frame of the voice and all Gaussian distributions.

다음, 풀에 포함된 가우시안 분포들 중 마할라노비스 거리값들이 큰 순서로 소정 개수 이하의 가우시안 분포들을 선택한다(412 과정).Next, among the Gaussian distributions included in the pool, the Mahalanobis distance values are selected in order of increasing number of Gaussian distributions having a predetermined number or less (step 412).

마지막으로, 선택된 가우시안 분포들을 이용하여 로그합 연산을 수행한다(413 과정). 바람직하게는, 이 과정(413 과정)은 로그합 연산에 따라 출력확률이 계산되면, 출력확률에 따라 후보 단어를 생성하는 과정을 포함할 수 있다.Finally, a log sum operation is performed using the selected Gaussian distributions (step 413). Preferably, the process 413 may include generating a candidate word according to the output probability when the output probability is calculated according to the log sum operation.

도 5는 도 3의 N-best 후보 단어 추출과정(320 과정)의 상세 흐름도이다.FIG. 5 is a detailed flowchart of the N-best candidate word extraction process 320 of FIG. 3.

이 과정(320 과정)은 인식률 저하를 막기 위해 재탐색 과정을 위한 N-best 후보 단어들을 결과로 생성한다. This process (step 320) generates N-best candidate words for the rescan process to prevent the recognition rate from being lowered.

먼저, 후보 단어들의 신뢰도를 연산한다(521 과정). 이 과정(521 과정)은 후보 단어들의 정규화된 로그 우도 비율(Normalized Log Likelihood Ratio; NLLR)을 연산하는 과정일 수 있다. N-best 생성을 위해 신뢰성 있는 후보 단어들을 선택해야 한다. 효율적으로 신뢰도(confidence measure)를 구하기 위해 NLLR(Normalized Log Likelihood Ratio) 방법을 사용할 수 있다. 즉, NLLR을 신뢰도의 척도로 정의할 수 있다. First, the reliability of candidate words is calculated (step 521). This process (521) may be a process of calculating the normalized log likelihood ratio (NLLR) of the candidate words. For candidate N-best generation, we need to select the candidate candidates that are reliable. The Normalized Log Likelihood Ratio (NLLR) method can be used to efficiently calculate a confidence measure. That is, NLLR can be defined as a measure of reliability.

정규화된 로그 우도 비율은 다음의 수학식 1에 의해 연산될 수 있다.The normalized log likelihood ratio may be calculated by Equation 1 below.

이때, NLLRv는 V번째 후보 단어의 정규화된 로그 우도 비율이고, LKv는 V번째 후보 단어의 우도이고, LLv는 v번째 후보 단어의 로그 우도이고, LKmax는 모든 후보 단어들의 최대 우도이고, LLmax는 모든 후보 단어들의 최대 로그 우도이다.Where NLLRv is the normalized log likelihood ratio of the Vth candidate word, LKv is the likelihood of the Vth candidate word, LLv is the log likelihood of the vth candidate word, LKmax is the maximum likelihood of all candidate words, and LLmax is all The maximum log likelihood of candidate words.

다음, 연산된 신뢰도가 임계값 이상인지 판단한다(522 과정). 이때, 임계값은 음성 인식의 결과가 신뢰할만한 수준임을 보장할 수 있도록 당업자에 의해 미리 결정된 값이다. 신뢰도 있는 후보 단어를 결정하기 위한 수식은 다음과 같다.Next, it is determined whether the calculated reliability is greater than or equal to the threshold (step 522). At this time, the threshold value is predetermined by a person skilled in the art to ensure that the result of speech recognition is a reliable level. The formula for determining a candidate candidate is as follows.

NLLRv > ThNLLRv> Th

임계값(Th)은 실험에 의해 미리 결정된다. 이때, NLLR이 임계값(Th)보다 큰 v번째 단어가 정밀 탐색을 위한 후보 단어들로 선정된다.The threshold Th is predetermined by experiment. At this time, the v- th word whose NLLR is larger than the threshold Th is selected as candidate words for the fine search.

마지막으로, 후보 단어들 중 신뢰도가 임계값 이상인 후보 단어들을 선택한다(523 과정). 이때, 선택된 후보 단어들을 N-best 후보 단어로 정의한다.Finally, candidate words having a reliability greater than or equal to a threshold among candidate words are selected (step 523). At this time, the selected candidate words are defined as N-best candidate words.

도 6은 도 3의 정밀 탐색과정(330 과정)의 상세 흐름도이다.6 is a detailed flowchart of the precise search process 330 of FIG.

먼저, 소정 개수 이상의 가우시안 분포들을 이용하여 로그합 연산을 수행한다(631 과정). 로그합 연산 과정(631 과정)은 비터비 탐색에서 출력확률 계산을 위해 필요한 과정이다.First, a log sum operation is performed using more than a predetermined number of Gaussian distributions (step 631). The log sum operation process (631) is a process required for calculating the output probability in the Viterbi search.

다음, 위 로그합 연산의 결과 및 고속 탐색과정(310 과정)에서 연산된 마할라노비스 거리값들을 이용하여 후보 단어들 중 가장 높은 우도를 갖는 1개의 단어를 추출하고, 추출된 단어를 인식된 단어로 출력한다(632 과정). 이때, 인식된 단어를 1-best 단어로 정의한다. 이 과정(632 과정)은 음성의 발화가 완료된 이후에, 상기 비터비 탐색을 수행하여 인식된 단어를 출력할 수 있다.Next, one word having the highest likelihood among the candidate words is extracted using the result of the log sum operation and the Mahalanobis distance values calculated in the fast search process 310, and the extracted word is recognized. (Step 632). In this case, the recognized word is defined as a 1-best word. In operation 632, after the speech is completed, the Viterbi search may be performed to output the recognized word.

이와 같은 본 발명의 음성인식에 필요한 탐색 과정을 위해 임베디드 플랫폼에 사용될 적절한 음향모델 선택이 필요하다.For the search process required for the voice recognition of the present invention, it is necessary to select an appropriate acoustic model to be used in the embedded platform.

본 발명은 반 연속 HMM(SCHMM) 모델 기법을 도입하고, 공유분포(tied-mixture) 모델링 방법을 통하여 구현할 수 있다. 또한, 문맥종속 모델(triphone)을 적용하여 앞 뒤 문맥에 따른 조음효과를 최대한 반영하고, 문맥종속 모델(triphone) 훈련시의 데이터 부족 현상을 방지하면서 동시에 학습에 나타나지 않은 데이터의 부족 문제를 해결하기 위해 결정 트리 기반의 상태 공유(state tying)기법을 적용할 수 있다. 이렇게 구성된 음향 모델은 CHMM 모델 크기의 60% 정도의 크기를 가지면서 인식 성능은 1% 미만의 범위로 줄어든 것을 확인할 수 있다.The present invention introduces a semi-continuous HMM (SCHMM) model technique, and can be implemented through a shared-mixture modeling method. In addition, by applying the context-dependent model (triphone) to reflect the articulation effects according to the context of the front and back, to prevent the lack of data when training the context-dependent model (triphone) while solving the problem of lack of data that does not appear in the learning To do this, state treeing based on decision trees can be applied. The acoustic model thus constructed has a size of about 60% of the size of the CHMM model, and the recognition performance is reduced to less than 1%.

본 발명에서 탐색 방법은 비터비 탐색방법을 사용한다. 비터비 탐색에서는 거의 대부분의 연산시간이 출력확률 계산에 소요된다. 출력확률 계산은 각 가우시 안에서의 마할라노비스 거리 연산(분산을 고려한 유클리디안 거리 계산) 부분과 이 값의 로그 합(log-add) 연산 부분으로 나누어질 수 있다. 공유분포(tied-mixture) 기반의 상태 출력확률 연산과정은 발생가능한 가우시안들의 집합인 풀(pool)에 존재하는 가우시안들을 공유하고 각 분포들의 가중치를 모두 더하여 계산하므로, 실질적으로는 로그 합 연산이 가장 많은 연산 시간을 필요로 한다. In the present invention, the search method uses the Viterbi search method. In Viterbi search, most of the computation time is spent calculating output probability. The output probability calculation can be divided into the Mahalanobis distance calculation (Euclidean distance calculation considering dispersion) and the log-add operation of this value within each Gaussian. Since the state-probability calculation process based on the shared-mixture shares the Gaussians existing in the pool, which is a set of possible Gaussians, and calculates by adding up the weights of each distribution, the log sum operation is the most practical. It requires a lot of computation time.

본 발명의 일 실시예에 따르면, 디코딩 시간을 단축시키기 위해서는 로그 합에 참여시키는 가우시안들을 적절하게 선택하여 계산할 수 있다. 즉, 매 프레임마다 특징벡터가 입력되면, 풀에 존재하는 모든 가우시안들에 대하여 마할라노비스 거리 연산을 수행한 후, 이 값들의 순위가 높은 순서로 정렬하여 상위부터 문턱 값만큼 선택할 수 있다. 이 분포들에 대해서만 로그 합 연산에 참여시키므로 계산량을 단축시킬 수 있다. 문턱 값은 고속 탐색 및 정밀 탐색의 특성에 맞도록 선택한다. According to an embodiment of the present invention, in order to shorten the decoding time, the Gaussians participating in the log sum may be appropriately selected and calculated. That is, if a feature vector is input every frame, the Mahalanobis distance calculation is performed on all Gaussians present in the pool, and the values are sorted in ascending order to select the threshold values from the top. Only the distributions are involved in the log sum operation, which can reduce the amount of computation. The threshold value is selected to suit the characteristics of the fast search and the fine search.

또한, 마할라노비스 연산 과정은 고속 탐색 및 정밀 탐색 모두 동일한 특징벡터로 계산하기 때문에 고속 탐색에서 계산한 값을 그대로 정밀 탐색에 적용할 수 있다. 따라서, 재탐색 시간을 대폭 단축할 수 있다. In addition, since the Mahalanobis calculation process calculates both the fast search and the fine search by the same feature vector, the value calculated in the fast search can be applied to the precision search as it is. Therefore, rescanning time can be shortened significantly.

수학식 3은 공유분포(tied-mixture) 기반의 가우시안 선택에 대한 수식이다.Equation 3 is a formula for Gaussian selection based on a shared-mixture.

이때, bs는 선택된 가우시안이고, i는 인덱스이고, Ws를 가중치이고, Xt는 현재 음성의 관측값이고, μi는 평균값이고, Σi는 분산값이다.Where bs is the selected Gaussian, i is the index, Ws is the weight, Xt is the observed value of the current voice, μi is the mean value, and Σi is the variance value.

본 발명의 다른 실시예에 따르면, 고속 탐색을 위해서 가우시안 거리값 계산을 통한 상위 가우시안 분포를 선택할 수 있다. 예를 들어, 고속 탐색에서는 상위 4개의 분포만을 선택하여 출력 확률값을 계산할 수 있다. According to another embodiment of the present invention, a higher Gaussian distribution through Gaussian distance value calculation may be selected for a fast search. For example, in the fast search, the output probability value may be calculated by selecting only the top four distributions.

본 발명의 또다른 실시예에 따르면, 가우시안의 마할라노비스 거리 계산값들을 크기에 따라 정렬한 결과를 매 프레임마다 임시 저장공간인 캐쉬(cache)에 저장시켰다가 정밀 탐색에 그대로 적용할 수 있다.According to another embodiment of the present invention, the result of arranging the Gaussian Mahalanobis distance calculation values according to the size may be stored in a cache, which is a temporary storage space, every frame, and applied to the precise search.

정밀 탐색에서는 고속 탐색에서 얻어진 N-best 후보 단어를 가지고 재탐색을 진행한다. 예를 들어, N-best 후보 단어들이 최대 20개 정도이기 때문에, 소규모의 탐색 공간을 가지고 보다 정교한 탐색 과정을 수행할 수 있다. 정밀 탐색에서도 비터비 탐색을 수행할 수 있다. 예를 들어, 출력 확률을 얻기 위해 가우시안 분포의 개수를 고속 탐색보다 많은 상위 32개를 선택하여 계산할 수 있다. In the precise search, the re-search is performed with the N-best candidate words obtained from the fast search. For example, since there are a maximum of 20 N-best candidate words, a more sophisticated search process can be performed with a small search space. Viterbi search can also be performed in fine search. For example, to obtain an output probability, the number of Gaussian distributions can be calculated by selecting the top 32 more than the fast search.

본 발명의 또다른 실시예에 따르면, 고속 탐색은 이미 음성 발화가 끝난 이후의 결과에 대해 비 실시간 시에 수행되는 재탐색 과정이므로 사용자가 기다리는 시간을 최소화 시키는 것이 중요하다. 흔히 화자가 발음한 음성의 길이를 1배속 실시간(xRT)이라고 하면, 발화가 끝난 후 0.4xRT까지가 사용자가 실시간이라고 생각하면서 기다릴 수 있는 최대 대기시간이다. According to another embodiment of the present invention, it is important to minimize the user waiting time because the fast search is a re-search process performed at a non-real time on the result after the voice utterance is finished. If the length of the speaker's voice is 1x real-time (xRT), the maximum waiting time that the user can wait while thinking that it is real time is 0.4xRT after the speech is finished.

본 발명의 또다른 실시예에 따르면, 구현된 시스템에서는 두 단계 탐색 모두 동일한 음향 모델로부터 얻어진 값을 사용하기 때문에, 고속 탐색에서 얻어진 가우 시안 거리 계산 값을 정밀 탐색에서 그대로 사용할 수 있다. 이에따라, 재탐색 시간을 대폭 단축할 수 있다. 예를 들어, 0.4xRT 이내에 정밀 탐색의 수행이 종료되므로, 사용자가 느끼는 실시간 내에 대 어휘 시스템의 인식 결과를 얻을 수 있다.According to another embodiment of the present invention, since the implemented system uses the values obtained from the same acoustic model, the Gaussian distance calculated from the fast search can be used as it is in the precise search. Accordingly, the rescanning time can be significantly shortened. For example, since the precision search is completed within 0.4xRT, the recognition result of the large vocabulary system can be obtained within the real time felt by the user.

표 1은 종래의 비터비 탐색 및 본 발명의 2단계의 탐색에 따른 음성인식 결과의 속도 및 인식률을 나타내는 표이다.Table 1 is a table showing the speed and recognition rate of speech recognition results according to the conventional Viterbi search and the search of the second stage of the present invention.

탐색과정Search process 가우시안 개수Gaussian Count 인식률(%)Recognition rate (%) 평균 수행시간 (μsec)Average execution time (μsec) 고속 탐색Fast navigation 정밀 탐색Precise navigation 2단계 탐색Step 2 navigation 44 3232 93.3193.31 1,805,8761,805,876 88 3232 93.6493.64 1,966,4361,966,436 1616 3232 93.7293.72 2,242,3412,242,341 3232 3232 93.7693.76 2,458,1942,458,194 종래의 비터비 탐색Conventional Viterbi Search 3232 93.7493.74 2,418,8972,418,897

표 1은 종래의 비터비 탐색 방법과 비교하여 고속 탐색에서 선택되는 가우시안의 개수를 4개에서 32개 까지 증가시키면서 실험하고, 정밀 탐색에서는 32개를 선택하여 실험한 결과이다. 각 단계마다 평균 수행시간을 기록하여 인식시간의 속도향상을 살펴보았다.Table 1 shows the experimental results of increasing the number of Gaussians selected from the high-speed search from 4 to 32 compared to the conventional Viterbi search method, and selecting 32 from the detailed search. For each step, we recorded the average execution time and examined the speed of recognition time.

위 결과를 보면 2단계 탐색이 종래의 비터비 탐색보다 약간 낮은 성능을 보이지만, 평균 수행시간은 0.7배 정도 빨라졌음을 볼 수 있다. 인식성능의 큰 하락 없이 전체 인식속도가 크게 향상되었음을 알 수 있다.The results show that the two-stage search is slightly lower than the conventional Viterbi search, but the average execution time is 0.7 times faster. It can be seen that the overall recognition speed has been greatly improved without a significant drop in the recognition performance.

바람직하게는, 본 발명의 2단계 탐색을 이용한 음성인식 방법을 컴퓨터에서 실행시키기 위한 프로그램은 컴퓨터로 읽을 수 있는 기록매체에 기록되어 제공될 수 있다.Preferably, a program for executing the voice recognition method using the two-stage search of the present invention on a computer may be provided recorded on a computer-readable recording medium.

본 발명은 소프트웨어를 통해 실행될 수 있다. 소프트웨어로 실행될 때, 본 발명의 구성 수단들은 필요한 작업을 실행하는 코드 세그먼트들이다. 프로그램 또는 코드 세그먼트들은 프로세서 판독 가능 매체에 저장되거나 전송 매체 또는 통신망에서 반송파와 결합된 컴퓨터 데이터 신호에 의하여 전송될 수 있다. The invention can be implemented via software. When implemented in software, the constituent means of the present invention are code segments that perform the necessary work. The program or code segments may be stored on a processor readable medium or transmitted by a computer data signal coupled with a carrier on a transmission medium or network.

본 발명은 도면에 도시된 일 실시예를 참고로 하여 설명하였으나 이는 예시적인 것에 불과하며 당해 분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 실시예의 변형이 가능하다는 점을 이해할 것이다. 그러나, 이와 같은 변형은 본 발명의 기술적 보호범위내에 있다고 보아야 한다. 따라서, 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해서 정해져야 할 것이다.Although the present invention has been described with reference to one embodiment shown in the drawings, this is merely exemplary and will be understood by those of ordinary skill in the art that various modifications and variations can be made therefrom. However, such modifications should be considered to be within the technical protection scope of the present invention. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

상술한 바와 같이, 본 발명에 의하면, 임베디드 플랫폼과 같은 제한된 연산 환경에서 대 어휘의 음성 인식을 구현할 때 고속 탐색 과정을 통해 생성된 후보 단어를 대상으로 정밀 탐색 과정을 거쳐 인식하므로, 음성 인식률을 저하시키지 않으면서 음성 인식의 속도를 향상시킬 수 있고, 동시에 전체 시스템의 성능을 향상 시킬 수 있는 효과가 있다.As described above, according to the present invention, when the speech recognition of the large vocabulary is implemented in a limited computing environment such as an embedded platform, the candidate word generated through the fast searching process is recognized through a precise searching process, thereby lowering the speech recognition rate. The speed of speech recognition can be improved without improving the performance of the entire system.

Claims

A fast search unit configured to generate a plurality of candidate words by performing a Viterbi search using a specific number of Gaussian distributions (hereinafter referred to as "N1") included in the pool with respect to the input voice;

An N-best candidate generator configured to extract candidate words in an order of high reliability among the candidate words; And

The precision search unit outputs a recognized word by performing a Viterbi search using a specific number of Gaussian distributions (hereinafter referred to as "N2") included in the pool with respect to the extracted candidate words.

Speech recognition device using a two-step search comprising.

The method of claim 1,

N1 is less than or equal to the N2 voice recognition device using a two-step search.

The method of claim 2,

The fast search unit

A Mahalanobis distance calculator for calculating Mahalanobis distance values for all Gaussian distributions in the pool;

A Gaussian selector which selects the N1 Gaussian distributions in order of the Mahalanobis distance values among the Gaussian distributions included in the pool; And

And a log sum calculator configured to perform a log sum operation using the selected Gaussian distributions.

The method of claim 3, wherein

Further comprising a Gaussian cache storage for storing the calculated Mahalanobis distance values,

The precision search unit

And a Gaussian cache application unit for reading Mahalanobis distance values from the Gaussian cache storage unit.

The method of claim 3, wherein

The Mahalanobis distance calculation unit

And a Mahalanobis distance value between all the Gaussian distributions and the feature vector extracted for each frame of the voice.

The method of claim 3, wherein

The precision search unit

A log sum operator configured to perform a log sum operation using the N2 Gaussian distributions; And

A 1-Best search unit for extracting one word having the highest likelihood among the candidate words by using the log sum operation and the Mahalanobis distance values, and outputting the extracted word as the recognized word Speech recognition device using a two-step search, characterized in that it comprises a.

The method of claim 2,

The N-best candidate generation unit

An NLLR verification unit for calculating candidates of the candidate words and selecting candidate words of the candidate words whose threshold is greater than or equal to a threshold value; And

And a search space generator for outputting the selected candidate words to the precision search unit.

The method of claim 7, wherein

The NLLR verification unit

And calculating candidate normalized log likelihood ratios of the candidate words to select candidate words whose normalized log likelihood ratio is greater than or equal to a threshold.

The method of claim 2,

The fast search unit and the precision search unit

Speech recognition device using a two-stage search, characterized in that the same acoustic model is applied.

The method of claim 2,

The precision search unit

After the speech is completed, the speech recognition apparatus using the two-stage search, which performs a Viterbi search to output the recognized word.

Generating a plurality of candidate words by performing a Viterbi search using a specific number of Gaussian distributions (hereinafter referred to as "N1") included in the pool with respect to the input voice;

Extracting candidate words in an order of high reliability among the candidate words; And

And performing a Viterbi search using a specific number of Gaussian distributions (hereinafter, referred to as "N2") included in the pool for the extracted candidate words, and outputting a recognized word.

Speech recognition method using two-stage search.

The method of claim 11,

The N1 is less than or equal to the N2 voice recognition method using a two-stage search.

The method of claim 12,

Generating the candidate words

Calculating Mahalanobis distance values for all Gaussian distributions included in the pool;

Selecting the N1 Gaussian distributions in ascending order of the Mahalanobis distance values among the Gaussian distributions included in the pool; And

And performing a log sum operation using the selected Gaussian distributions.

The method of claim 13,

Computing the Mahalanobis distance values

And calculating Mahalanobis distance values between the feature vector extracted for each frame of the voice and all Gaussian distributions.

The method of claim 13,

The step of outputting the recognized word is

Performing a log sum operation using the N2 Gaussian distributions; And

Extracting one word having the highest likelihood among the candidate words by using the log sum operation and the Mahalanobis distance values, and outputting the extracted word as the recognized word Speech recognition method using two-step search.

The method of claim 12,

Extracting the candidate words

Calculating the reliability of the candidate words; And

And selecting candidate words of the candidate words having a reliability greater than or equal to a threshold value.

The method of claim 16,

Computing the reliability

And a step for calculating a normalized log likelihood ratio of the candidate words.

The method of claim 17,

Computing the normalized log likelihood ratio

T is the number of frames of speech to be recognized, NLLRv is the normalized log likelihood ratio of the Vth candidate word, LKv is the likelihood of the Vth candidate word, LLv is the log likelihood of the vth candidate word, and LKmax is all candidate words. Is the maximum likelihood of, and LLmax is the maximum log likelihood of all candidate words,

Computing the normalized log likelihood ratio using the equation of the speech recognition method using a two-stage search.

The method of claim 12,

Generating the candidate words and outputting the recognized words

Speech recognition method using two-stage search characterized by applying the same acoustic model.

The method of claim 12,

The step of outputting the recognized word is

And after the speech is completed, performing the Viterbi search and outputting a recognized word.

A computer-readable recording medium having recorded thereon a program for executing the invention according to any one of claims 11 to 20.