KR101892736B1

KR101892736B1 - Apparatus and method for utterance verification based on word duration

Info

Publication number: KR101892736B1
Application number: KR1020150035245A
Authority: KR
Inventors: 최우용; 박전규; 이윤근
Original assignee: 한국전자통신연구원
Priority date: 2015-03-13
Filing date: 2015-03-13
Publication date: 2018-08-28
Also published as: KR20160109942A

Abstract

본 발명에 따른 대용량 음성인식 시스템의 실시간 단어별 지속시간 모델링을 이용한 발화검증 장치는 수신된 훈련 신호의 음소별 지속시간의 평균, 단어별 지속시간의 평균 및 분산을 계산하고, 상기 단어별 지속시간의 평균 및 분산의 관계를 회귀분석으로 모델링하여 회귀모델을 생성하는 훈련 신호 관리부, 수신된 음성 신호를 문맥종속 음소모델에 적용하여 음성인식 결과를 출력하는 음성 신호 처리부 및 기 설정된 단어빈도수의 임계치에 기초하여 상기 단어별 지속시간의 평균 및 분산 또는 음소별 지속시간의 평균 및 회귀모델에 기초하여 신뢰도 측정값을 생성하는 발화 검증부를 포함한다.The apparatus for verifying the utterance using the real-time word duration time model of the large-capacity speech recognition system according to the present invention calculates an average of the duration of each phoneme in the received training signal, an average and variance of the duration of each word, A speech signal processor for outputting a speech recognition result by applying the received speech signal to a context dependent phoneme model and a speech signal processor for outputting a speech recognition result to a threshold value of a predetermined word frequency And a speech verification unit for generating a reliability measurement value based on an average of the word duration time and an average of variance or phoneme duration and a regression model.

Description

[0001] APPARATUS AND METHOD FOR UTTERANCE VERIFICATION BASED ON WORD DURATION [0002]

본 발명은 음성인식에 관한 기술로서, 보다 상세하게는 음성인식의 발화 검증에 관한 기술이다.TECHNICAL FIELD The present invention relates to speech recognition, and more particularly to speech recognition verification of speech recognition.

일반적인 음성인식 시스템은 사용자가 발성하는 소정의 음성을 미리 등록된 데이터베이스에서 비교하여 음성 특징이 가장 유사한 데이터를 대응하는 인식 결과로 판단한다. 이 과정에서, 음성 특징이 유사한 경우 잘못 인식될 수 있다. 또한, 등록이 되어 있지 않은 데이터에 해당하는 음성이 입력되더라도 가장 유사한 데이터를 인식 결과로 결정하기 대문에 오류를 범할 수 있다. 따라서, 음성 인식의 결과에 대한 신뢰도를 높이기 위해 음성 인식 시스템의 발화 검증에 관한 기술들이 개발되고 있다. A general speech recognition system compares a predetermined voice uttered by a user in a database registered in advance, and judges data having the most similar voice characteristic as a corresponding recognition result. In this process, if voice features are similar, they can be misrecognized. In addition, even if the voice corresponding to the data not registered is inputted, the most similar data is determined as the recognition result, so that an error may occur. Therefore, in order to increase the reliability of the result of speech recognition, techniques for verifying verification of the speech recognition system are being developed.

음성인식 시스템의 발화검증의 목적은 잘못 인식된 단어 또는 문장을 인식 결과에서 제거하는 것이다. 일반적인 발화검증 방법으로 사용되는 것은 반음소 모델을 이용한 로그우도비 값을 이용한 방법과 단어별 지속시간 모델링을 이용한 방법이 사용되고 있다. 그러나, 종래의 발화검증 방법은 대용량 음성인식 시스템에서 단어를 기반으로 지속시간을 모델링하는 경우에는 훈련데이터가 충분하지 않으므로 대부분의 단어는 학습되지 않는다. 따라서, 음소(Triphone) 기반 지속시간 모델링을 통해 단어별 지속시간 분포를 추정해야 하는데, 이 과정에서 문제가 발생한다. 먼저, 훈련데이터의 음소들 중에서 일부는 빈도수가 적어 해당 음소의 분산 추정치를 신뢰할 수 없다. 또한, 음소의 지속시간은 감마분포를 따르므로 음소들의 지속시간이 서로 독립적이라고 가정하도 그 합인 단어의 지속시간이 감마분포를 따른다고 할 수 없기 때문에 단어별 지속시간의 확률밀도함수를 산출할 수 없다. 대한민국 공개특허 제10-2007-0061266호는 발화검증을 위한 개선된 기술에 대해 기재되어 있으나, 임계치 값만을 자동갱신하여 신뢰도를 높일 뿐, 상술한 문제점을 해결하지 못한다.The purpose of verbal verification of speech recognition systems is to remove incorrectly recognized words or phrases from recognition results. As a general verification method, a method using a log-likelihood ratio using a half-phoneme model and a method using a word-based duration modeling are used. However, in the conventional speech verification method, most words are not learned because the training data is not sufficient when modeling the duration based on words in a large-capacity speech recognition system. Therefore, it is necessary to estimate the word duration time distribution through the triphone based duration modeling, which causes problems in this process. First, some of the phonemes in the training data are few in frequency, and the variance estimates of the phonemes can not be relied upon. In addition, since the duration of a phoneme follows a gamma distribution, it can not be said that the duration of the combined word follows the gamma distribution even if the durations of the phonemes are independent of each other. none. Korean Patent Laid-Open No. 10-2007-0061266 discloses an improved technique for verification of speech, but it only raises the reliability by automatically updating only the threshold value, and does not solve the above-mentioned problems.

대한민국 공개특허 제10-2007-0061266호Korean Patent Publication No. 10-2007-0061266

본 발명이 해결하고자 하는 과제는 대용량 음성인식 시스템에서 실시간으로 단어별 지속시간을 모델링 할 수 있는 발화 검증 장치 및 방법을 제공하는 것이다.SUMMARY OF THE INVENTION It is an object of the present invention to provide a speech verification apparatus and method capable of modeling the duration of each word in real time in a large-capacity speech recognition system.

본 발명에 따른 대용량 음성인식 시스템의 실시간 단어별 지속시간 모델링을 이용한 발화검증 장치는 수신된 훈련 신호의 음소별 지속시간의 평균, 단어별 지속시간의 평균 및 분산을 계산하고, 상기 단어별 지속시간의 평균 및 분산의 관계를 회귀분석으로 모델링하여 회귀모델을 생성하는 훈련 신호 관리부, 수신된 음성 신호를 문맥종속 음소모델에 적용하여 음성인식 결과를 출력하는 음성 신호 처리부 및 기 설정된 단어빈도수의 임계치에 기초하여 상기 단어별 지속시간의 평균 및 분산 또는 음소별 지속시간의 평균 및 회귀모델에 기초하여 신뢰도 측정값을 생성하는 발화 검증부를 포함한다.
상기 훈련 신호 관리부는, 수신된 훈련 신호로부터 음소별 지속시간의 평균을 계산하는 평균 계산부; 수신된 훈련 신호로부터 빈도수가 설정된 임계치 이상인 단어들에 대하여 단어별 지속시간의 평균과 분산을 계산하여 단어별 지속시간 분포를 추정하는 단어지속 모델 생성부; 상기 단어 지속 모델 생성부에서 계산된 단어별 지속모델의 평균 및 분산 데이터를 이용하여 회귀 분석을 수행하여 단어별 지속 시간의 평균과 분산의 관계를 회귀 분석으로 모델링하는 회귀 분석부; 및 상기 평균 계산부에서 계산된 음소별 지속시간의 평균값과, 상기 회귀 분석부에서의 회귀 분석 결과에 대한 모델을 저장하는 데이터베이스를 포함한다.
상기 음성신호 처리부는, 수신된 음성신호의 잡음 처리 및 음성구간 검출을 수행하여 음성 데이터를 출력하는 전처리부; 전처리부에서 출력되는 음성 데이터를 기 학습되어 저장된 문맥 종속 음소모델을 적용하여 음성 인식을 수행한 후, 음성 인식 결과를 상기 발화 검증부로 제공하는 음성 인식부; 및 상기 기 학습된 문백 종속 음소모델을 저장하고, 저장된 문맥 종속 음소모델을 상기 음성 인식부로 제공하는 음소모델 저장부를 포함한다.
상기 발화 검증부는, 상기 음성 인식부의 음성 인식 결과를 통해 인식된 단어의 빈도수와 기 설정된 임계치를 비교하고, 비교 결과에 따라 상기 단어별 지속시간의 평균 및 분산 또는 상기 음소별 지속시간의 평균 및 회귀모델에 기초하여 음성 인식의 신뢰도를 측정하는 신뢰도 측정부; 상기 신뢰도 측정부에서 신뢰도 측정을 위한 반음소 모델을 저장하는 반음소모델 저장부; 및 상기 신뢰도 측정부에서 측정된 음성 신식 신뢰도값을 기 설정된 임계치를 비교하여 음성 인식 결과의 정상 여부를 판단하는 판단부를 포함한다.
상기 신뢰도 측정부는, 상기 문맥종속 음소 모델 및 반음소모델 저장부에 저장된 반음소모델의 로그 우도비를 계산하고, 단어별 지속시간 스코어를 계산하여 계산된 단어별 지속시간 스코어 및 로그 우도비 값의 가중합으로 단어의 신뢰도를 측정한다.
한편, 본 발명에 따른 대용량 음성인식 시스템의 실시간 단어별 지속시간 모델링을 이용한 발화검증 방법은, 수신된 훈련 신호의 음소별 지속시간의 평균, 단어별 지속시간의 평균 및 분산을 계산하고, 상기 단어별 지속시간의 평균 및 분산의 관계를 회귀분석으로 모델링하여 회귀모델을 생성하는 단계; 수신된 음성 신호를 문맥종속 음소모델에 적용하여 음성인식 결과를 출력하는 단계; 및 기 설정된 단어빈도수의 임계치에 기초하여 상기 단어별 지속시간의 평균 및 분산 또는 상기 음소별 지속시간의 평균 및 회귀모델에 기초하여 신뢰도 측정값을 생성하는 단계를 포함한다.
상기 모델링하여 회귀모델을 생성하는 단계는, 수신된 훈련 신호로부터 음소별 지속시간의 평균을 계산하는 단계; 수신된 훈련 신호로부터 빈도수가 설정된 임계치 이상인 단어들에 대하여 단어별 지속시간의 평균과 분산을 계산하여 단어별 지속시간 분포를 추정하는 단계; 상기 계산된 단어별 지속모델의 평균 및 분산 데이터를 이용하여 회귀 분석을 수행하여 단어별 지속 시간의 평균과 분산의 관계를 회귀 분석으로 모델링하는 단계; 및 상기 계산된 음소별 지속시간의 평균값과, 상기 회귀 분석 결과에 대한 모델을 저장하는 단계를 포함한다.
상기 음성인식 결과를 출력하는 단계는, 수신된 음성신호의 잡음 처리 및 음성구간 검출을 수행하여 음성 데이터를 출력하는 단계; 및 상기 출력되는 음성 데이터를 기 학습되어 저장된 문맥 종속 음소모델을 적용하여 음성 인식을 수행한 후, 음성 인식 결과를 출력하는 단계를 포함한다.
상기 신뢰도 측정값을 생성하는 단계는, 상기 음성 인식 결과를 통해 인식된 단어의 빈도수와 기 설정된 임계치를 비교하고, 비교 결과에 따라 상기 단어별 지속시간의 평균 및 분산 또는 상기 음소별 지속시간의 평균 및 회귀모델에 기초하여 음성 인식의 신뢰도를 측정하는 단계; 및 상기 측정된 음성 신식 신뢰도값을 기 설정된 임계치를 비교하여 음성 인식 결과의 정상 여부를 판단하는 단계를 포함한다.
상기 신뢰도를 측정하는 단계는, 상기 문맥종속 음소 모델 및 반음소모델 저장부에 저장된 반음소모델의 로그 우도비와, 단어별 지속시간 스코어를 계산하는 단계; 및 상기 계산된 단어별 지속시간 스코어 및 로그 우도비 값의 가중합으로 단어의 신뢰도를 측정하는 단계를 포함한다. The apparatus for verifying the utterance using the real-time word duration time model of the large-capacity speech recognition system according to the present invention calculates an average of the duration of each phoneme in the received training signal, an average and variance of the duration of each word, A speech signal processor for outputting a speech recognition result by applying the received speech signal to a context dependent phoneme model and a speech signal processor for outputting a speech recognition result to a threshold value of a predetermined word frequency And a speech verification unit for generating a reliability measurement value based on an average of the word duration time and an average of variance or phoneme duration and a regression model.
Wherein the training signal management unit comprises: an average calculation unit for calculating an average of a duration time per phoneme from the received training signal; A word continuity model generation unit for estimating a word duration time distribution by calculating an average and a variance of a word duration time for words having a frequency equal to or higher than a threshold value from the received training signal; A regression analyzing unit for regression analysis using the average and variance data of the sustain models for each word calculated by the word sustain model generation unit to model the relationship between the average of the durations and the variance of the words by regression analysis; And a database for storing a model of the average value of the phoneme duration calculated by the average calculator and the regression analysis result in the regression analyzer.
Wherein the voice signal processing unit comprises: a preprocessor for performing noise processing of the received voice signal and voice section detection to output voice data; A speech recognition unit for performing speech recognition by applying a context dependent phoneme model learned and stored in the preprocessing unit and providing speech recognition results to the speech verification unit; And a phoneme model storage unit for storing the learned phoneme dependent phoneme model and providing the stored context dependent phoneme model to the speech recognition unit.
Wherein the speech verification unit compares the frequency of recognized words with the predetermined threshold value through the speech recognition result of the speech recognition unit and determines an average and variance of the duration of each word or an average of the duration of each phoneme and a return value A reliability measuring unit for measuring reliability of speech recognition based on the model; A half-phoneme model storage unit for storing a half-phoneme model for reliability measurement in the reliability measuring unit; And a determination unit for determining whether the voice recognition result is normal by comparing the voice recognition formula reliability value measured by the reliability measurement unit with predetermined threshold values.
The reliability measurement unit calculates log likelihood ratios of the anti-phoneme model stored in the context dependent phoneme model and anti-phoneme model storage unit, calculates a duration time score per word, The weighted sum measures the reliability of words.
Meanwhile, the speech recognition verification method using the real-time word-based duration modeling of the large-capacity speech recognition system according to the present invention calculates an average of the duration time of each phoneme in the received training signal, Generating a regression model by modeling the relationship between the mean of the star duration and the variance by regression analysis; Applying the received speech signal to a context dependent phoneme model to output a speech recognition result; And generating a confidence measure based on a mean and variance of the word duration and a phoneme-specific duration and a regression model based on a threshold of a predetermined word frequency.
The modeling and generating the regression model may include calculating an average of the phoneme duration from the received training signal; Estimating a duration time distribution for each word by calculating an average and variance of durations of words for words having a frequency equal to or greater than a threshold value from the received training signal; Performing a regression analysis using the calculated mean and variance data of the per-word continuous models, and modeling the relationship between the mean and the variance of the word durations by regression analysis; And storing the average of the calculated phoneme duration and the model of the regression analysis result.
The step of outputting the speech recognition result may include performing noise processing of the received speech signal and speech segment detection to output speech data; And a step of performing speech recognition by applying the context dependent phoneme models stored in the learned speech data and outputting speech recognition results.
Wherein the step of generating the confidence measure comprises comparing the frequency of words recognized through the speech recognition result with a predetermined threshold and comparing the mean and variance of the durations of the words or the average of durations of the phonemes Measuring reliability of speech recognition based on a regression model; And comparing the measured speech confidence measure with a preset threshold value to determine whether the speech recognition result is normal.
The step of measuring the reliability may include calculating a log likelihood ratio of the anti-phoneme model stored in the context dependent phoneme model and the anti-phoneme model storage unit and a word-based duration score; And measuring the reliability of the word by a weighted sum of the calculated word-based duration score and the log likelihood ratio value.

종래의 기술이 대용량 음성인식에 포함된 인식 어휘 수가 많아 모든 단어에 대해 지속 시간을 모델링하는 것이 불가능한 반면에, 본 발명에 따른 대용량 음성인식 시스템의 실시간 단어별 지속시간 모델링을 이용한 발화검증 장치 및 방법은 회귀분석을 이용하여 훈련 신호에 없는 단어에 대해서도 지속시간 분포의 모델링이 가능하다. 또한, 본 발명은 훈련 신호 중에서 빈도수가 높은 단어를 이용하여 단어지식시간 분포를 추정함으로써 대용량 음성인식 시스템에서의 발화검증 신뢰도를 높일 수 있다.It is impossible to model the duration of all the words because the conventional technology has a large number of recognition vocabularies included in the large-capacity speech recognition, whereas the speech recognition verification apparatus and method using the real-time word-based duration modeling of the large- Using regression analysis, it is also possible to model the duration distribution for words that are not in the training signal. In addition, the present invention estimates the word knowledge time distribution using words having a high frequency among training signals, thereby improving the reliability of speech recognition verification in a large-capacity speech recognition system.

도 1은 본 발명에 따른 대용량 음성인식 시스템의 실시간 단어별 지속시간 모델링을 이용한 발화검증 장치의 일 실시예를 나타내는 구성도이다.
도 2는 본 발명의 일 실시예에 따른 대용량 음성인식 시스템의 실시간 단어별 지속시간 모델링을 이용한 발화검증 장치(100)의 훈련 신호 관리부(110)의 처리 과정을 나타내는 흐름도이다.
도 3은 본 발명의 일 실시예에 따른 대용량 음성인식 시스템의 실시간 단어별 지속시간 모델링을 이용한 발화검증 장치(100)의 음성 신호 처리부(120)의 처리 과정을 나타내는 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 대용량 음성인식 시스템의 실시간 단어별 지속시간 모델링을 이용한 발화검증 장치(100)의 발화 검증부(130)의 처리 과정을 나타내는 흐름도이다.FIG. 1 is a block diagram illustrating an embodiment of a speech verification apparatus using real-time word-based duration modeling of a large-capacity speech recognition system according to the present invention.
FIG. 2 is a flowchart illustrating a process of a training signal managing unit 110 of the speech verification apparatus 100 using real-time word-based duration modeling of a large-capacity speech recognition system according to an embodiment of the present invention.
3 is a flowchart illustrating a process of the speech signal processor 120 of the speech verification apparatus 100 using real-time word-based duration modeling of the large-capacity speech recognition system according to an embodiment of the present invention.
4 is a flowchart illustrating a process of the speech verification unit 130 of the speech verification apparatus 100 using real-time word-based duration modeling of the large-capacity speech recognition system according to an embodiment of the present invention.

이하, 본 발명의 실시예를 첨부된 도면들을 참조하여 상세하게 설명한다. 본 명세서에서 사용되는 용어 및 단어들은 실시예에서의 기능을 고려하여 선택된 용어들로서, 그 용어의 의미는 발명의 의도 또는 관례 등에 따라 달라질 수 있다. 따라서 후술하는 실시예에서 사용된 용어는, 본 명세서에 구체적으로 정의된 경우에는 그 정의에 따르며, 구체적인 정의가 없는 경우는 당업자들이 일반적으로 인식하는 의미로 해석되어야 할 것이다.
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The terms and words used in the present specification are selected in consideration of the functions in the embodiments, and the meaning of the terms may vary depending on the intention or custom of the invention. Therefore, the terms used in the following embodiments are defined according to their definitions when they are specifically defined in this specification, and unless otherwise specified, they should be construed in a sense generally recognized by those skilled in the art.

도 1은 본 발명에 따른 대용량 음성인식 시스템의 실시간 단어별 지속시간 모델링을 이용한 발화검증 장치의 일 실시예를 나타내는 구성도이다.FIG. 1 is a block diagram illustrating an embodiment of a speech verification apparatus using real-time word-based duration modeling of a large-capacity speech recognition system according to the present invention.

도 1을 참조하면, 대용량 음성인식 시스템의 실시간 단어별 지속시간 모델링을 이용한 발화검증 장치(100)는 훈련 신호 관리부(110), 음성 신호 처리부(120) 및 발화 검증부(130)를 포함한다. 그리고, 훈련 신호 관리부(110)는 발화검증을 위한 사전 과정으로서 훈련 신호에 기초하여 회귀모델을 추정하며, 평균 계산부(111), 단어지속모델 생성부(112), 회귀분석부(113) 및 데이터베이스(114)를 포함한다. 그리고, 음성 신호 처리부(120)는 사용자 발화에 따른 음성 신호를 인식하며, 전처리부(121), 음성인식부(122) 및 음소모델 저장부(123)를 포함한다. 그리고, 발화 검증부(130)는 신뢰도 측정부(131), 반음소모델 저장부(132) 및 판단부(133)를 포함한다.Referring to FIG. 1, a speech recognition verification apparatus 100 using a real-time word-based duration modeling speech recognition system includes a training signal management unit 110, a speech signal processing unit 120, and a speech verification unit 130. The training signal management unit 110 estimates a regression model based on the training signal as a preliminary process for verifying the speech, and includes an average calculation unit 111, a word continuity model generation unit 112, a regression analysis unit 113, And a database 114. The speech signal processor 120 recognizes a speech signal resulting from user utterance and includes a preprocessor 121, a speech recognition unit 122, and a phoneme model storage unit 123. The speech verification unit 130 includes a reliability measurement unit 131, a half-phoneme model storage unit 132, and a determination unit 133.

평균 계산부(111)는 수신된 훈련 신호로부터 음소별 지속시간의 평균을 계산한다. 훈련 신호는 사용자의 발화에 따른 음성 신호를 인식하기 이전에 음성 신호를 인식하기 위한 회귀 모델을 생성하기 위한 신호이다. 평균 계산부(111)는 계산된 음소별 지속시간의 평균을 데이터베이스(114)에 저장한다.The averaging unit 111 calculates an average of the duration of each phoneme from the received training signal. The training signal is a signal for generating a regression model for recognizing the speech signal before recognizing the speech signal according to the user's utterance. The averaging unit 111 stores an average of the calculated phoneme duration in the database 114.

단어지속모델 생성부(112)는 훈련 신호로부터 빈도수가 임계치 이상인 단어들에 대하여 단어별 지속시간(단어지속모델)의 평균과 분산을 계산하여 단어별 지속시간 분포를 추정한다. 단어별 지속시간은 단어를 발성하는데 걸린 시간을 나타낸다. 그리고, 단어지속시간 모델(단어지속모델)은 단어별 지속시간의 확률분포로서, 일반적으로 감마분포 또는 정규분포를 사용할 수 있다. 그리고, 빈도수 임계치는 훈련 신호의 특성에 따라 결정될 수 있다. 단어지속모델 생성부(112)는 빈도수가 임계치 이상인 단어에 대해서 단어별 지속시간(단어지속모델)의 평균 및 분산을 계산한다.The word sustain model generation unit 112 estimates the word duration time distribution by calculating the mean and variance of the word duration time (word duration model) for the words whose frequency is equal to or greater than the threshold value from the training signal. The duration by word represents the time taken to speak the word. The word duration model (word duration model) is a probability distribution of word duration, and can generally use a gamma distribution or a normal distribution. And, the frequency threshold can be determined according to the characteristics of the training signal. The word continuity model generation unit 112 calculates an average and a variance of a word duration time (word duration model) for a word whose frequency is equal to or greater than a threshold value.

회귀분석부(113)는 단어지속모델 생성부(112)에서 계산된 단어별 지속모델의 평균 및 분산 데이터를 이용하여 회귀분석(Regression Analysis)을 수행하여 단어별 지속시간의 평균과 분산의 관계를 회귀분석으로 모델링한다. 회귀분석부(113)는 단어지속모델 생성부(112)에서 계산된 평균을 독립변수로, 분산을 종속변수로 하여 회귀분석을 수행하며, 데이터에 따라 선형회귀분석 또는 비선형회귀분석을 수행한다. 그리고, 회귀분석부(113)는 회귀분석 결과에 따라 생성된 회귀모델을 데이터베이스(114)에 저장한다. The regression analyzing unit 113 performs a regression analysis using the average and variance data of the continuous models of words calculated by the word duration model generating unit 112 to calculate the relationship between the average of the durations and the variance Model by regression analysis. The regression analyzing unit 113 performs a regression analysis using an average calculated by the word duration model generation unit 112 as an independent variable and a variance as a dependent variable, and performs a linear regression analysis or a nonlinear regression analysis according to the data. The regression analyzing unit 113 stores the regression model generated in accordance with the regression analysis result in the database 114.

데이터베이스(DataBase, DB, 114)는 평균 계산부(111)로부터 수신된 음소별 지속시간의 평균과 회귀분석부(113)의 회귀분석 결과 및 단어지속모델의 평균 및 분산에 대한 데이터를 저장한다.The database (DataBase, DB, 114) stores data on the mean of the phoneme duration received from the averager 111, the regression analysis result of the regression analyzer 113, and the mean and variance of the word duration model.

전처리부(121)는 수신된 음성 신호의 잡음 처리 및 음성구간 검출을 수행하여 음성 데이터를 출력한다. 사용자의 발화에 의한 음성 신호는 주변 환경에 따라 다양한 잡음을 포함할 수 있다. 따라서, 전처리부(121)는 수신된 음성 신호에서 잡음을 제거하고, 불필요한 구간을 제거하여 음성구간을 검출하여 음성 데이터를 생성한다. 그리고, 전처리부(121)는 생성된 음성 데이터를 음성인식부(122)로 전달한다.The preprocessing unit 121 performs noise processing of the received voice signal and voice interval detection to output voice data. The speech signal due to the user's utterance may include various noises depending on the surrounding environment. Therefore, the preprocessing unit 121 removes noise from the received voice signal, removes unnecessary sections, and detects the voice section to generate voice data. Then, the preprocessing unit 121 transfers the generated voice data to the voice recognition unit 122.

음성인식부(122)는 음성 데이터를 음소모델 저장부(123)로부터 수신된 기 학습된 문맥종속 음소모델에 적용하여 음성인식을 수행하여 음성인식 결과를 출력한다. 그리고, 음성인식부(122)는 음성인식 결과에 기초하여 단어별 지속시간을 출력한다. 음소모델은 등록된 모델이 음소를 표현하는지에 따라 구분된 모델이다(반면에, 단어모델은 등록된 모델이 단어를 표현). 예를 들어, '아버지'라는 단어를 인식하고자 할 때, '아버지', '어머니', '아들'이라는 세 가지 단어를 등록한 다음 입력음성이 어느 단어와 가장 유사한지를 인식하는 것을 단어모델이라고 한다. 반면에, 'ㅂ', 'ㅈ', 'ㅁ', 'ㄴ', 'ㄷ', 'ㄹ', 'ㅏ', 'ㅣ', 'ㅓ', 'ㅡ'를 등록한 다음 입력음성이 'ㅏ', 'ㅂ', 'ㅓ', 'ㅈ', 'ㅣ'라고 인식이 되면 '아버지'라고 인식하는 것을 음소모델이라 한다.음소모델 저장부(123)는 문맥종속 음소모델을 저장하고 있으며, 음성인식부(122)로 저장된 문맥종속 음소모델을 전달한다. 문맥독립 음소모델은 음소의 앞뒤에 어떤 음소가 오든지 상관없이 같은 음소로 모델링하는 것이고, 문맥종속 음소모델은 음소의 앞뒤에 오는 음소가 다른 경우에는 다른 음소로 모델링하는 것이다. 예를 들어, '아버지'의 'ㅓ'와 '어머니'의 'ㅓ'를 같은 음소로 모델링하면 문맥독립이고, 다른 음소로 모델링하면 문맥종속이 된다.The speech recognition unit 122 applies the speech data to the learned context dependent phoneme models received from the phoneme model storage unit 123, performs speech recognition, and outputs speech recognition results. Then, the speech recognition unit 122 outputs the duration of each word based on the speech recognition result. A phoneme model is a separate model based on whether a registered model represents a phoneme (whereas a word model represents a registered model). For example, when you want to recognize the word 'father', you register three words 'father', 'mother', and 'son', and then recognize that the input voice is most similar to a word is called a word model. On the other hand, when the input voice is input after registering 'ß', 'I', 'ㅁ', 'ㄴ', '' ',' '', '' ',' '', 'ㅓ' The phoneme model storage unit 123 stores a context dependent phoneme model. The phoneme model storage unit 123 stores context-dependent phoneme models, And transmits the stored context dependent phoneme model to the speech recognition unit 122. Context-independent phoneme models are modeled with the same phoneme no matter which phonemes precede or follow the phoneme. Context-dependent phoneme models are modeled with different phonemes when the phonemes before and after the phoneme are different. For example, modeling 'father' of 'father' and 'mother' of 'mother' with the same phoneme is context independent, and modeling with other phonemes is context dependent.

신뢰도 측정부(131)는 음성인식부(122)로부터 음성인식 결과를 수신한다. 그리고, 신뢰도 측정부(131)는 음성인식 결과를 통해 인식된 단어의 빈도수와 기 설정된 임계치(단어빈도수의 임계치)를 비교한다. 임계치는 훈련 신호의 특성에 따라 결정될 수 있다. 인식된 단어의 빈도수가 기 설정된 임계치보다 클 경우, 신뢰도 측정부(131)는 데이터베이스(114)로부터 단어별 지속시간의 평균과 분산을 수신한다. 그리고, 신뢰도 측정부(131)는 수학식 1과 같이 단어별 지속시간의 평균 및 분산을 이용하여 감마분포의 파라메타를 계산한다.The reliability measuring unit 131 receives the speech recognition result from the speech recognition unit 122. [ Then, the reliability measuring unit 131 compares the frequency of the recognized word with the predetermined threshold value (the threshold value of the word frequency) through the speech recognition result. The threshold value may be determined according to the characteristics of the training signal. If the frequency of the recognized word is greater than a preset threshold value, the reliability measuring unit 131 receives the average and variance of the word-based duration from the database 114. Then, the reliability measuring unit 131 calculates the parameters of the gamma distribution using the average and variance of the word duration time as shown in Equation (1).

수학식 1에서 단어별 지속시간의 Var(X)는 분산을 나타내고, E(X)는 단어별 지속시간의 평균을 나타낸다.In Equation (1), Var (X) represents the variance of the word duration time, and E (X) represents the word duration time average.

인식된 단어의 빈도수가 기 설정된 임계치보다 작을 경우, 신뢰도 측정부(131)는 데이터베이스(114)로부터 음소별 지속시간의 평균 및 회귀모델을 수신한다. 그리고, 신뢰도 측정부(131)는 음소별 지속시간의 평균의 합으로 단어별 지속시간의 평균을 계산하고, 계산된 단어별 지속시간의 평균을 회귀모델에 적용하여 단어별 지속시간의 분산을 계산한다. 단어 지속시간의 분포(평균, 분산)을 추정하기 위해서는 훈련데이터에 해당 단어의 빈도가 임계치 이상이어야 한다. 따라서, 임계치 이상의 빈도 단어에 대해서는 훈련 시 추정한 분포를 사용하면 되지만, 임계치 이하의 빈도 단어에 대해서는 평균은 음소지속시간의 합으로 추정하고, 분산은 회귀분석으로 추정한다. 그리고, 신뢰도 측정부(131)는 수학식 2를 통해 감마분포 파라메타를 계산한다. If the frequency of the recognized word is smaller than a predetermined threshold value, the reliability measuring unit 131 receives an average of phoneme duration time and a regression model from the database 114. [ The reliability measuring unit 131 calculates the average of the durations of the words by the sum of the average of the durations per phoneme and applies the average of the durations of the calculated words to the regression model to calculate the variance of the durations of the words do. In order to estimate the distribution (mean, variance) of word durations, the frequency of the words in the training data should be above the threshold. Therefore, for the frequency words above the threshold value, the estimated distribution at training can be used. For frequency words below the threshold value, the mean is estimated as the sum of the phoneme duration and the variance is estimated by regression analysis. Then, the reliability measuring unit 131 calculates the gamma distribution parameter through Equation (2).

그리고, 신뢰도 측정부(131)는 음성인식부(122)에서 출력된 단어별 지속시간과 계산된 감마분포 파라메타를 이용하여 단어별 지속시간 스코어를 계산한다. 단어별 지속시간 스코어는 어떤 발성의 지속시간으로 미루어 볼 때 인식결과가 얼마나 신뢰할만한 가에 대한 점수로, 인식결과를 수락/거절하기 위한 척도들 중 하나로 사용된다. 본 발명에서는 발화검증 척도로 반음소모델 로그우도비와 단어별 지속시간 스코어의 가중합을 사용할 수 있다.Then, the reliability measuring unit 131 calculates the duration time score for each word using the word duration time output from the speech recognition unit 122 and the calculated gamma distribution parameter . A word - based duration score is a measure of how trustworthy the recognition result is from the duration of a vocalization, and is used as one of the measures to accept / reject recognition results. In the present invention, a weighted sum of a half-phoneme model log-likelihood ratio and a word-based duration score can be used as an utterance verification scale.

신뢰도 측정부(131)는 문맥종속 음소모델 및 반음소모델의 로그 우도비(Log-Likelihood Radio) 값을 계산한다. 그리고, 신뢰도 측정부(131)는 단어별 지속시간 스코어 및 로그 우도비 값의 가중합으로부터 단어의 신뢰도를 측정(신뢰도 측정값 산출)한다.The reliability measuring unit 131 calculates a Log-Likelihood Radio value of the context-dependent phoneme model and the anti-phoneme model. Then, the reliability measuring unit 131 measures the reliability of the word from the weighted sum of the duration time score and the log likelihood ratio by word (calculates the reliability measurement value).

반음소모델 저장부(132)는 음소의 반대 개념인 문맥독립 반음소 모델(Anti-phone Model)을 저장한다. 문맥독립 반음소 모델은 모든 믹스처를 사용한 반모델(Allmixture Antimodel), 적응 반모델(Adapted Antimodel), 변별학습을 수행하는 반모델(Discriminative Antimodel), VQ(Vector Quantization) 기반 반모델 등을 포함할 수 있다.The anti-phoneme model storage unit 132 stores a context-independent anti-phoneme model which is an opposite concept of a phoneme. Context-independent semi-phoneme models include allmixture antimodel, adapted antimodel, discriminative antimodel, and vector quantization (VQ) based semi-models. .

판단부(133)는 신뢰도 측정부(123)에서 계산된 신뢰도 측정값과 기 설정된 신뢰도 측정값의 임계치값을 비교하여 인식결과가 정상적인지를 판단하여 수락 및 거절을 결정한다. 판단부(133)는 수신된 음성 신호로부터 계산된 신뢰도 측정값이 설정된 신뢰도 측정값의 임계치값보다 크다면, 인식결과가 정상이라고 판단하여 인식결과를 수락한다. 반면에, 판단부(133)는 수신된 음성 신호로부터 계산된 신뢰도 측정값이 설정된 신뢰도 측정값의 임계치값보다 작다면, 인식결과가 비정상이라고 판단하여 인식결과를 거절한다.The determination unit 133 compares the reliability measurement value calculated by the reliability measurement unit 123 with a threshold value of the predetermined reliability measurement value, determines whether the recognition result is normal, and determines the acceptance or rejection. The determination unit 133 determines that the recognition result is normal and accepts the recognition result if the reliability measurement value calculated from the received voice signal is larger than the threshold value of the established reliability measurement value. On the other hand, if the reliability measurement value calculated from the received voice signal is smaller than the threshold value of the established reliability measurement value, the determination unit 133 determines that the recognition result is abnormal and rejects the recognition result.

도 2는 본 발명의 일 실시예에 따른 대용량 음성인식 시스템의 실시간 단어별 지속시간 모델링을 이용한 발화검증 장치(100)의 훈련 신호 관리부(110)의 처리 과정을 나타내는 흐름도이다.FIG. 2 is a flowchart illustrating a process of a training signal managing unit 110 of the speech verification apparatus 100 using real-time word-based duration modeling of a large-capacity speech recognition system according to an embodiment of the present invention.

훈련 신호 관리부(110)는 대용량 음성인식 시스템의 실시간 단어별 지속시간 모델링을 이용한 발화검증 장치(100)의 훈련 신호 관리부(110)는 발화검증을 위한 사전 과정으로서 훈련 신호에 기초하여 회귀모델을 추정하고, 단어별 지속시간의 평균/분산 및 음소별 지속시간의 평균을 산출하여 저장한다.The training signal management unit 110 of the speech recognition verification apparatus 100 using the real-time word-based duration modeling of the large-capacity speech recognition system generates a regression model based on the training signal as a preliminary process for verification of speech, And the mean / variance of the duration of each word and the average of the duration of each phoneme are calculated and stored.

이를 위해, 훈련 신호 관리부(110)는 먼저 훈련 신호를 수신한다(S201). 훈련 신호는 사용자의 발화에 따른 음성 신호를 인식하기 이전에 음성 신호를 인식하기 위한 회귀 모델을 생성하기 위한 신호이다. 훈련 신호 관리부(110)는 수신된 훈련 신호로부터 음소별 지속시간의 평균을 계산(S202)하고, 계산된 음소별 지속시간의 평균을 데이터베이스(114)에 저장한다(S203).To this end, the training signal managing unit 110 first receives the training signal (S201). The training signal is a signal for generating a regression model for recognizing the speech signal before recognizing the speech signal according to the user's utterance. The training signal management unit 110 calculates an average of the duration of each phoneme from the received training signal (S202), and stores the average of the calculated duration of each phoneme in the database 114 (S203).

그리고, 훈련 신호 관리부(110)는 수신된 훈련 신호의 단어빈도수를 기 성정된 단어빈도수의 임계치와 비교하여 단어빈도수가 임계치를 초과하는지 여부를 판단한다(S204). 만약, 수신된 훈련 신호의 단어빈도수가 기 설정된 단어빈도수의 임계치를 초과하지 않는다면(이하라면), 훈련 신호 관리부(110)는 작업을 종료한다.Then, the training signal management unit 110 compares the word frequency of the received training signal with a threshold value of the determined frequency of words to determine whether the word frequency exceeds the threshold (S204). If the word frequency of the received training signal does not exceed the predetermined word frequency threshold, the training signal manager 110 ends the task.

반면에, 수신된 훈련 신호의 단어빈도수가 기 설정된 단어빈도수의 임계치를 초과하는 것으로 판단되면, 훈련 신호 관리부(110)는 단어별 지속시간의 평균 및 분산을 계산한다(S205). 훈련 신호 관리부(110)는 단어빈도수가 임계치를 초과하는 단어들에 대하여 단어별 지속시간의 평균과 분산을 계산하여 단어별 지속시간 분포를 추정한다. 여기서, 단어빈도수의 임계치는 훈련 신호의 특성에 따라 결정될 수 있다. 그리고, 훈련 신호 관리부(110)는 계산 결과에 따른 단어별 지속시간의 평균 및 분산을 데이터베이스(114)에 저장한다(S206).On the other hand, if it is determined that the word frequency of the received training signal exceeds the predetermined word frequency threshold, the training signal managing unit 110 calculates the average and variance of the word-based duration (S205). The training signal management unit 110 estimates the duration time distribution of each word by calculating the mean and variance of the word duration time for the words whose word frequency exceeds the threshold value. Here, the threshold value of the word frequency can be determined according to the characteristics of the training signal. Then, the training signal management unit 110 stores the average and variance of the duration of each word according to the calculation result in the database 114 (S206).

그리고, 훈련 신호 관리부(110)는 단어별 지속시간(단어지속모델)의 평균 및 분산 데이터를 이용하여 회귀분석(Regression Analysis)을 수행하여 단어별 지속시간의 평균과 분산의 관계를 회귀분석으로 모델링한다(S207). 훈련 신호 관리부(110)는 평균을 독립변수로, 분산을 종속변수로 하여 회귀분석을 수행하며, 데이터에 따라 선형회귀분석 또는 비선형회귀분석을 수행한다. 회귀분석 모델링이 생성되면, 훈련 신호 관리부(110)는 회귀분석 결과에 따라 생성된 회귀모델을 데이터베이스(114)에 저장한다(S208).The training signal management unit 110 performs a regression analysis using the mean and variance data of the word duration duration (word duration model) to determine the relationship between the mean and the variance of the word duration by regression analysis (S207). The training signal management unit 110 performs a regression analysis using an average as an independent variable and a variance as a dependent variable and performs a linear regression analysis or a nonlinear regression analysis according to the data. When the regression analysis modeling is generated, the training signal management unit 110 stores the regression model generated in accordance with the regression analysis result in the database 114 (S208).

도 3은 본 발명의 일 실시예에 따른 대용량 음성인식 시스템의 실시간 단어별 지속시간 모델링을 이용한 발화검증 장치(100)의 음성 신호 처리부(120)의 처리 과정을 나타내는 흐름도이다.3 is a flowchart illustrating a process of the speech signal processor 120 of the speech verification apparatus 100 using real-time word-based duration modeling of the large-capacity speech recognition system according to an embodiment of the present invention.

도 3을 참조하면, 대용량 음성인식 시스템의 실시간 단어별 지속시간 모델링을 이용한 발화검증 장치(100)의 음성 신호 처리부(120)는 사용자의 발화에 따른 음성 신호를 수신한다(S301). 그리고, 음성 신호 처리부(120)는 전처리 과정을 수행하여 수신된 음성 신호의 잡음 처리 및 음성구간 검출을 수행한다(S302). 사용자의 발화에 의한 음성 신호는 주변 환경에 따라 다양한 잡음을 포함할 수 있다. 따라서, 음성 신호 처리부(120)는 수신된 음성 신호에서 잡음을 제거하고, 불필요한 구간을 제거하여 음성구간을 검출하여 음성 데이터를 생성한다. 그리고, 음성 신호 처리부(120)는 전처리 과정을 거친 음성 신호(음성 데이터를 기 학습된 문맥종속 음소모델에 적용하여 음성인식을 수행한다(S303). 음성 신호 처리부(120)는 음성인식 결과를 발화 검증부(130)로 전달한다(S304).Referring to FIG. 3, the speech signal processing unit 120 of the speech verification apparatus 100 using the real-time word-based duration modeling of the large-capacity speech recognition system receives the speech signal according to the speech of the user (S301). The voice signal processing unit 120 performs a preprocessing process to perform noise processing and voice interval detection on the received voice signal (S302). The speech signal due to the user's utterance may include various noises depending on the surrounding environment. Accordingly, the voice signal processing unit 120 removes noise from the received voice signal, removes an unnecessary section, and detects a voice section to generate voice data. Then, the speech signal processing unit 120 performs speech recognition by applying the preprocessed speech signal (speech data to the learned context dependent phoneme model (S303).) The speech signal processing unit 120 performs speech recognition To the verification unit 130 (S304).

도 4는 본 발명의 일 실시예에 따른 대용량 음성인식 시스템의 실시간 단어별 지속시간 모델링을 이용한 발화검증 장치(100)의 발화 검증부(130)의 처리 과정을 나타내는 흐름도이다.4 is a flowchart illustrating a process of the speech verification unit 130 of the speech verification apparatus 100 using real-time word-based duration modeling of the large-capacity speech recognition system according to an embodiment of the present invention.

도 4를 참조하면, 발화 검증부(130)는 음성 신호 처리부(120)로부터 음성인식 결과를 수신한다(S401). 그리고, 발화 검증부(130)는 음성인식 결과를 통해 인식된 단어의 빈도수와 기 설정된 단어빈도수의 임계치를 비교한다(S402). 그리고, 인식된 단어의 빈도수가 기 설정된 단어빈도수의 임계치보다 클 경우, 발화 검증부(130)는 데이터베이스(114)로부터 단어별 지속시간의 평균과 분산을 수신한다(S403). 그리고, 발화 검증부(130)는 단어별 지속시간의 평균 및 분산을 이용하여 감마분포의 파라메타를 계산한다(S404).Referring to FIG. 4, the speech verification unit 130 receives speech recognition results from the speech signal processing unit 120 (S401). Then, the speech verification unit 130 compares the frequency of the recognized word with the threshold of the predetermined word frequency (S402). If the frequency of the recognized word is larger than the threshold value of the preset word frequency, the speech verification unit 130 receives the average and variance of the word-based duration from the database 114 (S403). Then, the speech verification unit 130 calculates the parameters of the gamma distribution using the average and variance of the word-by-word duration (S404).

S402 단계의 판단 결과, 인식된 단어의 빈도수가 기 설정된 임계치보다 작을 경우, 발화 검증부(130)는 데이터베이스(114)로부터 음소별 지속시간의 평균 및 회귀모델을 수신한다(S405). 그리고, 발화 검증부(130)는 음소별 지속시간의 평균의 합으로 단어별 지속시간의 평균을 계산(S406)하고, 계산된 단어별 지속시간의 평균을 회귀모델에 적용하여 단어별 지속시간의 분산을 계산한다(S407). 그리고, 발화 검증부(130)는 계산된 단어별 지속시간의 분산 및 평균을 통해 감마분포 파라메타를 계산한다(S404). If it is determined in step S402 that the frequency of the recognized word is smaller than the predetermined threshold value, the speech verification unit 130 receives the average of the phoneme duration and the regression model from the database 114 (S405). Then, the speech verification unit 130 calculates an average of the durations of the words by the sum of the average of the durations of the phonemes (S406), and applies the average of the durations calculated by the words to the regression model, The variance is calculated (S407). Then, the speech verification unit 130 calculates the gamma distribution parameter through the variance and the average of the calculated word-based durations (S404).

그리고, 발화 검증부(130)는 수신된 음성인식 결과로부터 단어별 지속시간을 확인(S408)하고, 단어별 지속시간과 계산된 감마분포 파라메타를 이용하여 단어별 지속시간 스코어를 계산한다(S409).In step S408, the speech verification unit 130 checks the duration of each word from the received speech recognition result, and calculates a word-based duration score using the word-based duration and the calculated gamma distribution parameter (S409) .

다음으로, 발화 검증부(130)는 문맥종속 음소모델 및 반음소모델의 로그 우도비 값을 계산한다(S410). 로그 우도비 값이 계산되면, 발화 검증부(130)는 단어별 지속시간 스코어 및 로그 우도비 값의 가중합으로부터 단어의 신뢰도를 측정하여 신뢰도 측정값 산출한다(S411). 신뢰도 측정값이 산출되면, 발화 검증부(130)는 계산된 신뢰도 측정값과 기 설정된 신뢰도 임계치의 값을 비교하여 신뢰도 측정값이 신뢰도 임계치를 초과하는지 여부를 판단한다(S412). 만약, 수신된 음성 신호로부터 계산된 신뢰도 측정값이 설정된 신뢰도 측정값의 임계치값보다 크다면, 발화 검증부(130)는 인식결과가 정상이라고 판단하여 인식결과를 수락한다(S413). 반면에, 수신된 음성 신호로부터 계산된 신뢰도 측정값이 설정된 신뢰도 측정값의 임계치값보다 작다면, 발화 검증부(130)는 인식결과가 비정상이라고 판단하여 인식결과를 거절한다(S414).
Next, the speech verification unit 130 calculates the log-likelihood ratio values of the context-dependent phoneme model and the anti-phoneme model (S410). When the log likelihood ratio value is calculated, the speech verifying unit 130 measures the reliability of the word from the weighted sum of the word duration time score and the log likelihood ratio value to calculate the reliability measurement value (S411). When the reliability measurement value is calculated, the speech verification unit 130 compares the calculated reliability measurement value with a predetermined reliability threshold value to determine whether the reliability measurement value exceeds the reliability threshold (S412). If the reliability measurement value calculated from the received voice signal is larger than the threshold value of the established reliability measurement value, the speech verification unit 130 determines that the recognition result is normal and accepts the recognition result (S413). On the other hand, if the reliability measurement value calculated from the received voice signal is smaller than the threshold value of the established reliability measurement value, the speech verification unit 130 determines that the recognition result is abnormal and rejects the recognition result (S414).

상술한 내용을 포함하는 본 발명은 컴퓨터 프로그램으로 작성이 가능하다. 그리고 상기 프로그램을 구성하는 코드 및 코드 세그먼트는 당분야의 컴퓨터 프로그래머에 의하여 용이하게 추론될 수 있다. 또한, 상기 작성된 프로그램은 컴퓨터가 읽을 수 있는 기록매체 또는 정보저장매체에 저장되고, 컴퓨터에 의하여 판독되고 실행함으로써 본 발명의 방법을 구현할 수 있다. 그리고 상기 기록매체는 컴퓨터가 판독할 수 있는 모든 형태의 기록매체를 포함한다.
The present invention including the above-described contents can be written in a computer program. And the code and code segment constituting the program can be easily deduced by a computer programmer of the field. In addition, the created program can be stored in a computer-readable recording medium or an information storage medium, and can be read and executed by a computer to implement the method of the present invention. And the recording medium includes all types of recording media readable by a computer.

이상 바람직한 실시예를 들어 본 발명을 상세하게 설명하였으나, 본 발명은 전술한 실시예에 한정되지 않고, 본 발명의 기술적 사상의 범위 내에서 당분야에서 통상의 지식을 가진자에 의하여 여러 가지 변형이 가능하다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, It is possible.

100: 대용량 음성인식 시스템의 실시간 단어별 지속시간 모델링을 이용한 발화검증 장치
110: 훈련 신호 관리부 111: 평균 계산부
112: 단어지속모델 생성부 113: 회귀분석부
114: 데이터베이스 120: 음성 신호 처리부
121: 전처리부 122: 음성인식부
123: 음소모델 저장부 130: 발화 검증부
131: 신뢰도 측정부 132: 반음소모델 저장부
133: 판단부100: Speech verification system using real-time word-based duration modeling of large-capacity speech recognition system
110: training signal management unit 111: average calculation unit
112: Word continuity model generation unit 113: Regression analysis unit
114: Database 120: Voice signal processor
121: preprocessing unit 122: speech recognition unit
123: phoneme model storage unit 130:
131: reliability measurement unit 132: half phoneme model storage unit
133:

Claims

A training signal manager for calculating an average of the duration of the training signal by the phoneme, an average and variance of the durations of the words, and generating a regression model by modeling the relationship between the mean and variance of the durations of the words by regression analysis;
A speech signal processor for applying a received speech signal to a context dependent phoneme model to output a speech recognition result; And
And an utterance verification unit for generating a reliability measurement value based on an average and a variance of the word duration time or an average of the phoneme duration and a regression model based on a threshold value of a predetermined word frequency,
The training signal management unit,
An average calculation unit for calculating an average of the phoneme duration time from the received training signal;
A word continuity model generation unit for estimating a word duration time distribution by calculating an average and a variance of a word duration time for words having a frequency equal to or higher than a threshold value from the received training signal;
And a regression analyzing section for modeling the relationship between the average of the duration time of words and the variance of the words by regression analysis by performing regression analysis using the mean and variance data of the continuous models of words calculated by the word duration model generation section, A spoken verification system using the real - time word - based duration modeling of the system.

The method according to claim 1,
The training signal management unit,
And a database for storing a model of a result of the regression analysis in the regression analysis unit, wherein the average value of the duration of each phoneme calculated by the average calculation unit and the database for storing the model of the result of the regression analysis in the regression analysis unit An ignition verification device.

3. The method of claim 2,
The audio signal processing unit includes:
A preprocessor for performing noise processing of the received voice signal and voice section detection to output voice data;
A speech recognition unit for performing speech recognition by applying a context dependent phoneme model learned and stored in the preprocessing unit and providing speech recognition results to the speech verification unit; And
And a phoneme model storage unit for storing the learned phrase dependent phoneme model and providing the stored context dependent phoneme model to the speech recognition unit.

The method of claim 3,
Wherein,
And comparing the frequency of the recognized word with the predetermined threshold based on the speech recognition result of the speech recognition unit, and based on the average and variance of the duration of each word or the average of the duration of each phoneme and the regression model, A reliability measuring unit for measuring reliability of recognition;
A half-phoneme model storage unit for storing a half-phoneme model for reliability measurement in the reliability measuring unit; And
And a determination unit for determining whether the speech recognition result is normal by comparing the speech recognition confidence value measured by the reliability measurement unit with a preset threshold value.

5. The method of claim 4,
The reliability measuring unit includes:
The log likelihood ratio of the anti-phoneme model stored in the context dependent phoneme model and the anti-phoneme model storage unit is calculated, and the weighted sum of the duration time score and the log likelihood ratio value calculated by calculating the duration time score according to the word, A Verification Verification Device Using Real - time Word - by - word Duration Modeling of Large Capacity Speech Recognition System.

Generating a regression model by calculating an average of the duration of phonemes of the received training signal and an average and variance of durations of the words and modeling the relationship between the mean and variance of the durations of the words by regression analysis;
Applying the received speech signal to a context dependent phoneme model to output a speech recognition result; And
Generating a confidence measure based on a mean and variance of the duration of the word or a mean of a duration of the phoneme and a regression model based on a threshold of a predetermined word frequency,
Wherein the step of modeling and generating a regression model comprises:
Calculating an average of the phoneme duration from the received training signal;
Estimating a duration time distribution for each word by calculating an average and variance of durations of words for words having a frequency equal to or greater than a threshold value from the received training signal;
And a step of modeling the relationship between the mean of the duration and the variance of the word by regression analysis by performing a regression analysis using the calculated mean and variance data of the per- Verification Verification Method Using Modeling.

The method according to claim 6,
Wherein the step of modeling and generating a regression model comprises:
And storing a model of the result of the regression analysis. The method of claim 1, further comprising: storing the model of the result of the regression analysis.

The method according to claim 6,
The step of outputting the speech recognition result includes:
Performing noise processing of the received voice signal and voice section detection to output voice data; And
And outputting the speech recognition result after performing the speech recognition by applying the context dependent phoneme model learned and stored in the output speech data. Verification method.

The method according to claim 6,
Wherein the generating the confidence measure comprises:
Comparing the frequency of the recognized word with the predetermined threshold value based on the speech recognition result, and determining a reliability of the speech recognition based on the mean and variance of the duration of each word or the average of the duration of each phoneme and the regression model, ; And
And verifying whether the speech recognition result is normal by comparing the measured speech confidence measure with a preset threshold value. The speech recognition verification method using real-time word-based duration modeling of a large-capacity speech recognition system.

10. The method of claim 9,
Wherein the step of measuring reliability comprises:
Calculating a log likelihood ratio of the anti-phoneme model stored in the context dependent phoneme model and the anti-phoneme model storage unit and a word-based duration score; And
And measuring reliability of the word by using a weighted sum of the calculated word-based duration score and the log-likelihood ratio value.