KR100819848B1

KR100819848B1 - Apparatus and method for speech recognition using automatic update of threshold for utterance verification

Info

Publication number: KR100819848B1
Application number: KR1020060077948A
Authority: KR
Inventors: 강점자; 전형배
Original assignee: 한국전자통신연구원
Priority date: 2005-12-08
Filing date: 2006-08-18
Publication date: 2008-04-08
Also published as: KR20070061266A

Abstract

본 발명은 음성인식 시스템에 있어서 발화검증의 판단기준인 신뢰도 값이 환경요인(채널특성, 화자특성 등)과의 상관 관계가 존재하여 영향을 받기 때문에 상수값으로 설정된 임계치값을 자동으로 갱신함으로써 고신뢰도의 발화검증을 제공하기 위한 것으로, 입력되는 음성의 잡음 처리 및 음성구간 검출을 수행하여 음성데이터를 출력하는 전처리부와, 상기 음성데이터를 기 학습된 문맥종속 음소모델에 적용하여 음성인식을 수행하고 음성정보를 출력하는 음성인식부와, 상기 음성정보를 이용하여 환경요인 파라미터 및 스코어를 추출하는 환경요인 파라미터 계산부와, 학습된 문맥독립 반음소 모델, 음소 지속 모델(phone duration model), 기타 정보(우도, Nbest 정보 등)를 적용하여 단어별 신뢰도 측정을 위한 입력 파라미터를 추출하는 입력 파라미터 추출부와, 상기 입력 파라미터를 기반으로 신뢰도 측정값을 계산하는 신뢰도 측정부와, 상기 스코어의 평균값을 통해 산출된 환경요인 값을 적용하여 새로운 임계치값을 계산하고 갱신하는 임계치값 결정부와, 상기 갱신된 임계치값을 사용하여 인식결과의 수락 및 거절을 판단하는 판단부를 포함하는데 있다.According to the present invention, since a reliability value, which is a criterion for utterance verification, is influenced by a correlation with environmental factors (channel characteristics, speaker characteristics, etc.) in a speech recognition system, a threshold value set to a constant value is automatically updated. To provide speech verification of reliability, a preprocessing unit for outputting speech data by performing noise processing and speech section detection of an input speech, and applying the speech data to a previously trained context-dependent phoneme model to perform speech recognition A voice recognition unit for outputting voice information, an environment factor parameter calculation unit for extracting environmental factor parameters and scores using the voice information, a trained context-independent semitone model, a phone duration model, and the like. Input parameter extraction that extracts input parameter for measuring reliability of each word by applying information (likelihood, Nbest information, etc.) A reliability measurer that calculates a reliability measure based on the issuer, the input parameter, a threshold value determiner that calculates and updates a new threshold value by applying an environmental factor value calculated through the average value of the scores, and the update It includes a determination unit for determining the acceptance and rejection of the recognition result using the threshold value.

음성인식, 발화검증, 임계치 Speech recognition, speech verification, threshold

Description

Apparatus and method for speech recognition using automatic update of threshold for utterance verification}

도 1 은 본 발명에 따른 발화검증을 위한 임계치값 자동 갱신을 이용한 음성인식 장치의 구성을 나타낸 도면1 is a view showing the configuration of a speech recognition device using automatic threshold value update for speech verification according to the present invention

도 2 는 본 발명에 따른 발화검증을 위한 임계치값 자동 갱신을 이용한 음성인식 방법을 나타낸 흐름도2 is a flowchart illustrating a voice recognition method using an automatic update of a threshold value for speech verification according to the present invention.

도 3 은 본 발명에 따른 발화검증을 위한 임계치값 자동 갱신을 이용한 음성인식 방법에서 환경요인 특성이 적용된 새로운 임계치 값을 계산하기 위해 필요한 사전 데이터 값을 계산하는 방법을 보다 상세히 나타낸 흐름도3 is a flowchart illustrating a method of calculating in advance a data value necessary for calculating a new threshold value to which an environmental factor is applied in a speech recognition method using automatic threshold value update for speech verification according to the present invention.

도 4 는 본 발명에 따른 발화검증을 위한 임계치값 자동 갱신을 이용한 음성인식 방법에서 환경요인 특성이 적용된 새로운 임계치값을 계산하고 갱신하는 방법을 보다 상세히 나타낸 흐름도4 is a flowchart illustrating a method for calculating and updating a new threshold value to which environmental factor characteristics are applied in a voice recognition method using automatic threshold value update for utterance verification according to the present invention.

*도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

10 : 중앙 제어부 20 : 전처리부10: central control unit 20: preprocessing unit

30 : 음성 인식부 40 : 환경요인 파라미터 계산부30: speech recognition unit 40: environmental factor parameter calculation unit

50 : 입력 파라미터 추출부 60 : 신뢰도 측정부50: input parameter extraction unit 60: reliability measurement unit

70 : 임계치값 결정부 80 : 판단부70 threshold value determination unit 80 determination unit

90 : 문맥종속 음소모델 100: 문맥독립 음소모델 90: context-dependent phoneme model 100: context-independent phoneme model

110 : 분류기 모델110: classifier model

본 발명은 음성인식 시스템에 관한 것으로, 특히 음성인식 결과의 수락 또는 거절을 결정하는 발화검증을 위한 임계치값 자동갱신을 이용한 음성인식 장치 및 방법에 관한 것이다.The present invention relates to a speech recognition system, and more particularly, to an apparatus and method for speech recognition using threshold automatic update for speech verification for determining acceptance or rejection of speech recognition results.

종래의 음성인식 시스템은 사용자가 소정의 음성을 말하면 미리 등록된 데이터들 중 그 음성 특성이 가장 유사한 데이터를 찾아 인식 결과로 결정한다. 이로 인해 등록은 되어 있더라도 특성 차이가 매우 적어 구분이 어려운 경우 잘못 인식되기도 하고, 또한 등록이 되어 있지 않은 데이터에 해당하는 음성이 입력되더라도 가장 유사한 데이터를 골라서 인식 결과로 결정함으로써 오류를 범하는 경우가 많았다. 따라서 인식된 어떤 결과에 대해서 신뢰도 측정(Confidence Measure)값을 계산하여 인식 결과를 받아들일 것인지(Accept), 거절할 것인지(Reject)를 결정하는 발화검증 기능이 수행된다.In a conventional speech recognition system, when a user speaks a predetermined voice, the user finds the most similar data among the pre-registered data and determines the recognition result. As a result, even if registered, the difference in characteristics is so small that it is difficult to distinguish them, and even if a voice corresponding to unregistered data is input, an error is sometimes made by selecting the most similar data and determining it as a recognition result. Many. Therefore, a speech verification function is performed to determine whether to accept (Accept) or reject (Reject) the recognition result by calculating a confidence measure value for any recognized result.

최근 음성인식 시스템이 정보통신 산업, 정보처리 산업, 가전산업, 자동차 산업등과 같은 다양한 산업 분야에 적용되고 있으며, 이에 따라 신뢰도 높은 음성 인식결과를 얻기 위하여 인식 대상 문장이라도 오인식 가능성이 높은 결과를 기각 시키는 발화검증 기술의 중요성이 높아지고 있다. 그리고 이를 위한 발화검증 방법으로는 구해진 신뢰도 값과 사전에 설정된 임계치 값을 서로 비교하여 임계치 값 보다 크면 인식 결과를 수락하고, 임계치 값 보다 작거나 같으면 인식 결과를 거절하는 방식이 이용되고 있다.Recently, the voice recognition system is applied to various industries such as the information and communication industry, the information processing industry, the home appliance industry, the automobile industry, and the like, and thus, in order to obtain a reliable voice recognition result, even a sentence to be recognized is highly rejected. The importance of ignition verification techniques is increasing. As a method of verifying speech, a method of accepting a recognition result when the calculated reliability value and a predetermined threshold value are larger than the threshold value and rejecting the recognition result when the value is smaller than or equal to the threshold value is used.

그러나, 종래의 음성인식 시스템에 있어서, 발화검증에 사용하는 신뢰도 값이 환경요인(채널특성, 화자특성 등)과의 상관 관계가 존재하여 영향을 받기 때문에 단지 상수로 설정되는 임계치값을 적용해서는 실제 발생하는 여러 가지 경우를 충실히 반영할 수 없다. However, in the conventional speech recognition system, since the reliability value used for the speech verification is affected by the correlation with the environmental factors (channel characteristics, speaker characteristics, etc.), it is practical to apply a threshold value set only as a constant. The various cases that occur are not faithfully reflected.

또한, 환경이나 타스크가 변경될 때마다 발화검증의 대립가설에 사용되는 반모델 생성도 매번 새롭게 생성해야할 뿐만 아니라, 실제 환경에서 수집된 데이터를 사용하여 실험을 통해 가장 적절한 임계치값을 미리 설정하거나 운용자가 실제 환경에서 샘플링된 데이터의 테스트를 통해서 임계치값을 새롭게 설정해야만 한다는 문제점을 갖고 있다.In addition, every time the environment or task is changed, the semi-model generation used for the alternative hypothesis of ignition verification must be newly generated each time. Also, the data collected in the real environment are used to pre-set the most appropriate threshold value through the experiment or the operator. Has a problem in that the threshold value must be newly set by testing the sampled data in a real environment.

게다가, 발화검증용 임계치값은 환경요인에 민감하게 동작하므로 한번 설정된 임계치값을 사용하여 계속 발화검증을 수행하는 것은 음성인식 신뢰도면에서 볼 때 바람직하지 못하다는 문제점도 갖고 있다. In addition, since the speech verification threshold operates sensitively to environmental factors, it is also problematic to continuously perform speech verification using the threshold value once set in view of speech recognition reliability.

따라서 본 발명은 상기와 같은 문제점을 해결하기 위해 안출한 것으로서, 음성인식 시스템에 있어서 발화검증의 신뢰도 값에 영향을 미치는 환경요인 특성을 분석하여 임계치값을 자동으로 갱신함으로써 고신뢰도의 발화검증을 제공하는데 그 목적이 있다.Therefore, the present invention has been made to solve the above problems, and provides a high reliability speech verification by analyzing the environmental factors characteristic affecting the reliability value of speech verification in the speech recognition system automatically updated threshold value Its purpose is to.

상기와 같은 목적을 달성하기 위해 본 발명에 따른 음성인식 장치는, 입력되는 음성의 잡음 처리 및 음성구간 검출을 수행하여 음성데이터를 출력하는 전처리부와, 상기 음성데이터를 기 학습된 문맥종속 음소모델에 적용하여 음성인식을 수행하고 음성정보를 출력하는 음성인식부와, 상기 음성정보를 이용하여 환경요인 파라미터 및 스코어를 추출하는 환경요인 파라미터 추출부와, 기 학습된 문맥독립 반음소 모델, 음소 지속 모델(phone duration model), 우도, Nbest 정보 중 적어도 하나 이상에 적용하여 단어별 신뢰도 측정을 위한 입력 파라미터를 추출하는 입력 파라미터 추출부와, 상기 추출되는 입력 파라미터가 여러개인 경우는 기 정의된 분류기의 훈련 모델을 사용하여 신뢰도 측정값을 계산하고, 단일 입력 파라미터인 경우는 추출된 신뢰도 측정값을 그대로 사용하는 신뢰도 측정부와, 상기 스코어의 평균값을 통해 산출된 환경요인 값을 적용하여 새로운 임계치값을 계산하고 갱신하는 임계치값 결정부와, 상기 갱신된 임계치값을 사용하여 인식결과의 수락 및 거절을 판단하는 판단부를 포함하고, 상기 환경요인 값은 채널특성인 신호대 잡음 비(SNR: Signal to Noise Ratio)와, 화자특성인 음성 에너지 및 F0 포만트 크기 중 적어도 하나 이상이며, 상기 문맥독립 반음소 모델은 모든 믹스처를 사용한 반모델(Allmixture antimodel), 적응 반모델(adapted antimodel), 변별학습을 수행하는 반모델(discriminative antimodel), VQ(Vector Quantization)기반 반모델(VQ based antimodel) 중 적어도 하나 이상인 것을 특징으로 한다.In order to achieve the above object, a speech recognition apparatus according to the present invention includes a preprocessing unit for outputting speech data by performing noise processing and speech section detection of an input speech, and a context-dependent phoneme model based on the speech data. Speech recognition unit to perform the speech recognition and output the speech information by applying to, environmental factor parameter extraction unit for extracting environmental factor parameters and scores using the voice information, the context-independent semi-phoneme model previously learned, phoneme duration An input parameter extracting unit extracting an input parameter for measuring reliability of each word by applying to at least one of a phone duration model, likelihood, and Nbest information, and in the case where the extracted input parameters are multiple, Reliability measurements are calculated using a training model, and in the case of a single input parameter, extracted reliability measurements A reliability measurer using the same value, a threshold value determiner that calculates and updates a new threshold value by applying the environmental factor value calculated through the average value of the score, and accepts the recognition result using the updated threshold value. And a determination unit for determining rejection, wherein the environmental factor value is at least one of a signal-to-noise ratio (SNR) that is a channel characteristic, a voice energy that is a speaker characteristic, and a F0 formant size. The phoneme model includes at least one of an Allmixture antimodel, an adaptive antimodel, a discriminative antimodel that performs differential learning, and a VQ based antimodel. It is characterized by one or more.

삭제delete

한편, 상기와 같은 목적을 달성하기 위해 본 발명에 따른 음성인식 방법은, (a) 입력된 음성의 잡음 처리 및 음성구간을 검출하는 단계와, (b) 상기 검출된 음성데이터를 기 설정되어 학습된 문맥종속 음소모델에 적용하여 비터비 탐색을 통해 음성인식을 수행하는 단계와, (c) 상기 인식된 음성정보를 기 설정되어 학습된 문맥독립 반음소 모델, 음소 지속 모델(phone duration model), 우도, Nbest 정보 중 적어도 하나 이상에 적용하여 입력 파라미터값과 환경 요인 파라미터를 산출하는 단계와, (d) 상기 산출된 입력 파라미터값을 기반으로 신뢰도 측정값을 계산하고, 정규화하는 단계와, (e) 상기 산출된 환경요인 파라미터에 기반하여 계산된 환경요인 값을 적용하여 새로운 임계치값을 계산하고 갱신하는 단계와, (f) 상기 갱신된 임계치값을 상기 산출된 신뢰도 측정값과 비교하여 음성 인식결과에 따른 수락 또는 거절을 결정하는 단계를 포함하고, 상기 환경요인 값은 채널특성인 신호대 잡음 비(SNR: Signal to Noise Ratio)와, 화자특성인 음성 에너지 및 F0 포만트 크기 중 적어도 하나 이상이며, 상기 문맥독립 반음소 모델은 모든 믹스처를 사용한 반모델(Allmixture antimodel), 적응 반모델(adapted antimodel), 변별학습을 수행하는 반모델(discriminative antimodel), VQ(Vector Quantization)기반 반모델(VQ based antimodel) 중 적어도 하나 이상인 것을 특징으로 한다.On the other hand, in order to achieve the above object, the voice recognition method according to the present invention, (a) detecting the noise processing and the voice interval of the input voice, and (b) learning the preset voice data by preset Performing speech recognition through Viterbi search by applying to the context-dependent phoneme model, and (c) a context-independent half-phoneme model, a phone duration model, Calculating an input parameter value and an environmental factor parameter by applying to at least one of likelihood and Nbest information; (d) calculating and normalizing a reliability measurement value based on the calculated input parameter value; Calculating and updating a new threshold value by applying the calculated environmental factor value based on the calculated environmental factor parameter, and (f) converting the updated threshold value into the calculated reliability value. Determining whether to accept or reject according to a speech recognition result by comparing the measured value, wherein the environmental factor value includes a signal-to-noise ratio (SNR), which is a channel characteristic, a speech energy, and a F0 satiety, which is a speaker characteristic. At least one or more of the track size, the context-independent semi-phoneme model is Allmixture antimodel (admixture antimodel), adaptive antimodel (discriminative antimodel), VQ (Vector) for performing differential learning At least one or more of a VQ based antimodel.

바람직하게 상기 (c)의 입력 파라미터 계산은 문맥독립 반음소 모델과 같이 외부에서 필요한 모델을 생성하여 사용하는 경우와 내부 정보만을 사용하여 구할 수 있다. 만약 문맥독립 반음소 모델을 기반으로 하는 경우, 믹스처를 사용한 반모델(Allmixture antimodel), 적응 반모델(adapted antimodel), 변별학습을 수행하는 반모델(discriminative antimodel), VQ(Vector Quantization)기반 반모델(VQ based antimodel)등 다양한 형태의 반음소 모델과 내부 정보(우도, Nbest 정보 등)를 혼합하여 사용하는 것을 특징으로 하며, 한 개의 입력 파라미터가 존재하는 경우에도 처리할 수 있다.Preferably, the input parameter calculation of (c) may be obtained by using only internal information and a case of generating and using a model necessary externally, such as a context-independent half-phone model. If it's based on a context-independent semiphone model, Allmixture antimodel, Adapted antimodel, Discriminative antimodel, VQ (Vector Quantization) based class It is characterized by using a mixture of various phoneme models such as VQ based antimodel and internal information (likelihood, Nbest information, etc.) and processing even when there is one input parameter.

바람직하게 상기 (d) 단계는 여러개의 입력 파라미터를 기반하는 경우, 분류 기에 필요한 분류기 모델 생성을 위하여 훈련하는 단계와, 상기 훈련 단계에서 산출된 훈련모델을 이용하여 훈련 데이터에 대한 신뢰도 측정값, 평균 및 표준편차를 계산하는 단계와, 상기 계산된 신뢰도 측정값을 정규화하는 단계를 포함하는 것을 특징으로 한다. 만약 단일의 입력 파라미터를 기반으로 하는 경우, 훈련 과정 없이 훈련 데이터로부터 직접 신뢰도 측정값, 평균 및 표준편차를 계산하여 정규화하는 단계를 포함하는 것을 특징으로 한다.Preferably, the step (d) is based on a plurality of input parameters, the step of training to generate a classifier model required for the classifier, using the training model calculated in the training step using the training model calculated in the training step, average And calculating a standard deviation, and normalizing the calculated reliability measure. If it is based on a single input parameter, characterized in that it comprises the step of calculating and normalizing the reliability measure, average and standard deviation directly from the training data without training process.

바람직하게 상기 (e) 단계는 (e1) 산출된 환경요인 파라미터를 기반하여 환경요인 특성의 초기값을 계산하고 설정하는 단계와, (e2) 실시간으로 현재 환경요인을 기반한 환경요인 파라미터에 대한 각각의 현재 스코어를 계산하고 정규화하는 단계와, (e3) 상기 환경요인 파라미터를 기반으로 계산된 환경요인 특성에 따른 각각의 상관계수를 상기 각각의 정규화된 현재 스코어와 서로 곱하여 새로운 스코어를 산출하는 단계와, (e4) 상기 산출된 새로운 스코어를 모두 합하고 전체 평균값을 계산하여 환경요인 값을 산출하는 단계와, (e5) 상기 산출된 환경요인 값에 특정 적응계수를 곱하고 기존의 임계치값에 더하거나 빼서 새로운 임계치값을 계산하는 단계와, (e6) 상기 계산된 새로운 임계치값으로 기존의 임계치값을 갱신하는 단계를 포함하는 것을 특징으로 한다. Preferably, the step (e) comprises (e1) calculating and setting an initial value of the environmental factor characteristics based on the calculated environmental factor parameters, and (e2) each of the environmental factor parameters based on the current environmental factors in real time. Calculating and normalizing a current score, (e3) calculating a new score by multiplying each correlation coefficient according to the environmental factor characteristics calculated based on the environmental factor parameter with each of the normalized current scores; (e4) adding all of the calculated new scores and calculating an overall average value to calculate an environmental factor value; (e5) multiplying the calculated environmental factor value by a specific adaptation coefficient and adding or subtracting an existing threshold value to add a new threshold value; And calculating (e6) updating an existing threshold value with the calculated new threshold value. It shall be.

바람직하게 상기 (e1)의 환경요인 특성의 초기값은 신호대 잡음비와의 초기값, 음성 에너지와의 초기값 및 FO 포만트 크기에 대한 초기값 중 적어도 하나 이상인 것을 특징으로 한다.Preferably, the initial value of the environmental factor characteristic of (e1) is at least one of an initial value of the signal-to-noise ratio, an initial value of the voice energy, and an initial value of the FO formant size.

바람직하게 상기 초기값은 훈련 데이터를 통한 신뢰도 측정값과 신호대 잡음 비와의 상관계수, 신호대 잡음비와의 평균, 표준편차, 신뢰도 측정값과 음성 에너지와의 상관계수, 음성에너지의 평균, 표준편차, 신뢰도 측정값과 F0 포만트 크기와의 상관계수, F0 포만트 크기의 평균, F0 포만트 크기의 표준편차 값 중 적어도 하나 이상인 것을 특징으로 한다.Preferably, the initial value is a correlation coefficient between the reliability measurement value and the signal-to-noise ratio through the training data, an average of the signal-to-noise ratio, a standard deviation, a correlation coefficient between the reliability measurement value and the voice energy, an average of the speech energy, a standard deviation, And at least one of a correlation coefficient between the reliability measurement value and the F0 formant size, an average of the F0 formant size, and a standard deviation value of the F0 formant size.

바람직하게 상기 (e3) 단계는 상기 계산된 정규화된 신호대 잡음비의 현재 스코어와 상기 신호대 잡음비 상관계수를 곱하여 제 1 스코어를 산출하는 단계와, 상기 계산된 정규화된 음성 에너지의 현재 스코어와 상기 음성 에너지 상관계수를 곱하여 제 2 스코어를 산출하는 단계와, 상기 계산된 정규화된 FO 포만드 크기의 현재 스코어와 상기 FO 포만드 크기 상관계수를 곱하여 제 3 스코어를 산출하는 단계를 포함하는 것을 특징으로 한다.Preferably, the step (e3) comprises: calculating a first score by multiplying the calculated normalized signal-to-noise ratio by the signal-to-noise ratio correlation coefficient, and correlating the calculated current score of the normalized speech energy with the speech energy. Multiplying the coefficients to produce a second score; and multiplying the calculated current score of the normalized FO form size with the FO form size correlation coefficient to calculate a third score.

바람직하게 상기 (f) 단계는 상기 비교결과 계산된 신뢰도 측정값보다 임계치값이 크면 인식결과를 수락하는 단계와, 상기 비교결과 계산된 신뢰도 측정값이 임계치값보다 작으면 인식 결과를 거절하는 단계와, 상기 인식 결과가 수락되면 음성인식 시스템을 동작시키고, 상기 인식 결과가 거절되면 사용자에게 메시지 또는 음성을 통해 음성인식을 위한 재발성을 유도하는 단계를 포함하는 것을 특징으로 한다.Preferably, the step (f) includes the step of accepting the recognition result if the threshold value is larger than the confidence value calculated from the comparison result, and rejecting the recognition result if the confidence value calculated from the comparison result is smaller than the threshold value; And if the recognition result is accepted, operating the voice recognition system, and if the recognition result is rejected, inducing recurrence for voice recognition through a message or voice to the user.

본 발명의 다른 목적, 특성 및 이점들은 첨부한 도면을 참조한 실시예들의 상세한 설명을 통해 명백해질 것이다.Other objects, features and advantages of the present invention will become apparent from the following detailed description of embodiments with reference to the accompanying drawings.

본 발명에 따른 발화검증을 위한 임계치값 자동 갱신을 이용한 음성인식 장치 및 방법의 바람직한 실시예에 대하여 첨부한 도면을 참조하여 설명하면 다음과 같다.Referring to the accompanying drawings, a preferred embodiment of a speech recognition apparatus and method using an automatic threshold update for speech verification according to the present invention will be described.

도 1 은 본 발명에 따른 발화검증을 위한 임계치값 자동 갱신을 이용한 음성인식 장치의 구성을 나타낸 도면이다. 1 is a diagram showing the configuration of a speech recognition apparatus using automatic threshold value update for speech verification according to the present invention.

도 1에 도시된 바와 같이, 사용자의 음성입력 및 메시지/음성출력을 통해 음성인식에 따른 전체시스템을 제어하는 중앙제어부(10)와, 상기 중앙제어부(10)에서 입력되는 음성의 잡음 처리 및 음성구간 검출을 위한 끝점검출과 특징추출을 수행하여 음성데이터를 출력하는 전처리부(20)와, 상기 음성데이터를 기 학습된 문맥종속 음소모델에 적용하여 비터비 탐색을 통해 음성인식을 수행하고 음성정보를 출력하는 음성인식부(30)와, 상기 음성정보를 이용하여 채널 특성, 화자 특성 정보인 환경요인 파라미터 및 임계치값 갱신을 위한 스코어를 추출하는 환경요인 파라미터 계산부(40)와, 상기 음성정보를 기 학습된 문맥독립 반음소 모델 또는 음소 지속 모델(phone duration model), 기타 정보(우도, Nbest 정보 등)를 적용하여 단어별 신뢰도 측정을 위한 다양한 입력 파라미터를 추출하는 입력 파라미터 추출부(50)와, 상기 입력 파라미터 추출부(50)의 데이터가 여러개여서 분류기 모델을 사용하는 경우는 모델을 이용하여 신뢰도 측정값을 구하거나 또는 분류기 모델 없이 단일의 입력 파라미터인 경우는 추출된 신뢰도 측정값을 그대로 사용하는 신뢰도 측정부(60)와, 임계치값 갱신을 위해 상기 스코어의 파라미터 값을 정규 분포로 정규화된 값에 훈련 데이터로부터 구해진 환경요인 파라미터별 상관계수 값을 적용하여 각각 새로운 값을 구하여 평균을 취해 구해진 환경 요인 값을 적용하여 새로운 임계치값을 계산하고, 갱신하는 임계치값 결정부(70)와, 상기 새로 갱신된 임계치 값 을 사용하여 인식결과의 수락 및 거절을 판단하는 판단부(80)로 구성된다.As shown in FIG. 1, a central controller 10 for controlling an entire system based on voice recognition through voice input and a message / voice output of a user, and noise processing and voice of a voice input from the central controller 10 are performed. Preprocessing unit 20 for outputting voice data by performing end point detection and feature extraction for section detection, and applying the voice data to a previously learned context-dependent phoneme model to perform voice recognition through Viterbi search and voice information. A voice recognition unit 30 for outputting a voice signal, an environmental factor parameter calculation unit 40 for extracting a channel factor, an environmental factor parameter which is speaker characteristic information, and a score for updating a threshold value using the voice information, and the voice information A variety of inputs to measure the reliability of each word by applying the context-independent semitone model, phone duration model, and other information (likelihood and Nbest information). In case of using a classifier model because the input parameter extractor 50 extracts a parameter and the data of the input parameter extractor 50 are multiple, a single input is obtained using a model or a single input without the classifier model. In the case of a parameter, the reliability measurer 60 using the extracted reliability measure as it is, and the correlation coefficient value for each environmental factor parameter obtained from training data to a value in which the parameter value of the score is normalized to a normal distribution to update a threshold value. The threshold value determination unit 70 calculates and updates a new threshold value by applying the environmental factor values obtained by averaging each value by applying a new value, and accepting the recognition result using the newly updated threshold value. It is composed of a determination unit 80 for determining the rejection.

이때, 상기 입력 파라미터 추출부(50)에서 추출되는 파라미터는 문맥독립 반음소 모델과 같이 외부에서 필요한 모델을 생성하여 사용하는 경우와 내부 정보만을 사용하여 구할 수 있다. 만약 문맥독립 반음소 모델을 기반으로 하는 경우, 믹스처를 사용한 반모델(Allmixture antimodel), 적응 반모델(adapted antimodel), 변별학습을 수행하는 반모델(discriminative antimodel), VQ(Vector Quantization)기반 반모델(VQ based antimodel)등 다양한 형태의 반음소 모델과 내부 정보(우도, Nbest 정보 등)를 혼합하여 사용하는 것을 특징으로 하며, 한 개의 입력 파라미터가 추출될 수 있다.In this case, the parameter extracted by the input parameter extraction unit 50 may be obtained using only internal information and a case of generating and using a model necessary externally, such as a context-independent half-phone model. If it's based on a context-independent semiphone model, Allmixture antimodel, Adapted antimodel, Discriminative antimodel, VQ (Vector Quantization) based class It is characterized by using a mixture of various phoneme models such as VQ based antimodel and internal information (likelihood, Nbest information, etc.), and one input parameter can be extracted.

또한 상기 임계치값 결정부(70)에서 사용되는 환경요인 파라미터는 채널특성인 신호대 잡음 비(SNR: Signal to Noise Ratio))와, 화자특성인 음성 에너지 및 F0 포만트 크기 등인 것이 바람직하다.In addition, the environmental factor parameters used in the threshold value determination unit 70 may include a signal-to-noise ratio (SNR), which is a channel characteristic, voice energy, and a F0 formant size, which are speaker characteristics.

이와 같이 구성된 본 발명에 따른 발화검증을 위한 임계치값 자동 갱신을 이용한 음성인식 방법의 동작을 첨부한 도면을 참조하여 상세히 설명하면 다음과 같다.Referring to the accompanying drawings, the operation of the voice recognition method using the threshold value automatic update for speech verification according to the present invention configured as described above will be described in detail as follows.

도 2 는 본 발명에 따른 발화검증을 위한 임계치값 자동 갱신을 이용한 음성인식 방법을 나타낸 흐름도이다.2 is a flowchart illustrating a speech recognition method using an automatic update of a threshold value for speech verification according to the present invention.

도 2를 참조하여 설명하면, 먼저 사용자가 음성인식 기능이 탑재된 시스템을 사용하여 음성을 입력하면, 음성인식 시스템 내의 중앙제어부(10)는 이를 전처리부(20)에 전달한다(S100). Referring to FIG. 2, when a user inputs a voice using a system equipped with a voice recognition function, the central control unit 10 in the voice recognition system transfers it to the preprocessor 20 (S100).

이어 상기 전처리부(20)에서는 전달된 음성의 잡음 처리 및 음성구간 검출을 위한 끝점검출과 특징추출을 수행하여 음성인식부(30)에 전달하고(S200), 상기 음성인식부(30)에서는 상기 전처리부(20)에서 출력되는 음성데이터를 기 설정되어 학습된 문맥종속 음소모델(90)에 적용하여 비터비 탐색을 통해 음성인식을 수행한다(S300). 이때 상기 음성인식 수행에 따라 단어 스코어, 단어의 경계 정보, 해당 단어에 대한 음소별 스코어 및 경계 정보가 출력된다.Subsequently, the preprocessing unit 20 performs end point detection and feature extraction for noise processing and speech section detection of the transmitted voice, and transmits the result to the voice recognition unit 30 (S200). The speech data output from the preprocessor 20 is applied to the context-dependent phoneme model 90 that has been set and learned to perform speech recognition through the Viterbi search (S300). At this time, the word score, the word boundary information, the phoneme score and the boundary information for the corresponding word are output according to the speech recognition.

이어, 상기 음성인식을 통해 출력되는 음성정보를 기 설정되어 학습된 다양한 문맥독립 반음소 모델(100)과 내부정보(우도, Nbest 정보 등)에 적용하여 단어별 신뢰도를 측정하기 위한 입력 파라미터와 임계치값 갱신을 위한 환경요인 파라미터를 산출한다(S400). 이때 상기 문맥독립 반음소 모델(100)로는 모든 믹스처를 사용한 반모델(Allmixture antimodel), 적응 반모델(adapted antimodel), 변별학습을 수행하는 반모델(discriminative antimodel), VQ(Vector Quantization)기반 반모델(VQ based antimodel) 등이 사용될 수 있다.Next, input parameters and thresholds for measuring the reliability of each word by applying the speech information output through the speech recognition to various context-independent half-phoneme models 100 and internal information (likelihood, Nbest information, etc.) that have been preset and learned. The environmental factor parameter for the value update is calculated (S400). In this case, the context-independent half-phoneme model 100 includes an Allmixture antimodel, an adaptive antimodel, a discriminant antimodel, and a VQ (Vector Quantization) based class. Model (VQ based antimodel) and the like can be used.

이어, 상기 산출된 입력 파라미터가 여러개인 경우, 입력 파라미터를 통합하기 위한 분류기와 분류기 모델(110)을 사용하여 신뢰도 측정값을 계산하고 정규화한다(S500). 만약 단일의 입력 파라미터를 사용하는 경우, 별도의 분류기와 분류기 모델을 사용하지 않아도 된다. 이때, 상기 신뢰도란 음성인식 결과인 음소나 단어에 대해서 그 이외의 다른 음소나 단어로부터 그 말이 발화되었을 때의 확률에 대한 상대적 확률값을 의미한다. 신뢰도 측정(S500)은 입력된 파라미터가 여러개인 경우, 입력 파라미터 값을 하나의 값으로 분류기에서 통합하여 구해진 값과 모델 계수값을 곱하여 신뢰도 값을 구한다. 만약 단일의 입력 파라미터 값을 사용하는 경우에는 해당하는 입력 파라미터 값이 신뢰도 값이 된다.Subsequently, when the calculated input parameters are several, reliability measures are calculated and normalized using the classifier and the classifier model 110 for integrating the input parameters (S500). If you use a single input parameter, you do not have to use separate classifiers and classifier models. In this case, the reliability means a relative probability value with respect to the probability when the word is uttered from other phonemes or words with respect to the phoneme or word that is the speech recognition result. The reliability measure S500 obtains a reliability value by multiplying a model coefficient value and a value obtained by integrating the input parameter values into a single value in the classifier when there are several input parameters. If a single input parameter value is used, the corresponding input parameter value is a reliability value.

상기 새로운 임계치값 계산 갱신(S600)을 위해 필요한 사전 데이터 값을 계산하는 방법이 도 3에서 보다 상세히 나타내고 있다.A method of calculating the prior data value required for the new threshold value calculation update S600 is shown in more detail in FIG. 3.

도 3을 참조하여 설명하면, 먼저 훈련 데이터가 결정(S510) 되면 훈련 데이터 셋에 대해 신호대 잡음비, 음성 에너지, F0 포만트 크기 스코아를 구해서 각각에 대해 평균, 표준편차를 계산한 후, 정규화하여 저장한다(S520). 그리고 입력 파라미터를 결정(S530)한다. 만약 여러 개의 입력 파라미터를 사용하는 경우, 모델 생성(S540)이 필요하므로, 모델을 훈련(S550)하고, 이중 최적의 모델을 선택(S560)한 후, 해당 모델로부터 구해진 신뢰도 값을 측정하여, 평균, 표준편차를 계산(S570)하여 신뢰도 값을 정규화(S580)한다. 그렇지 않고, 모델 생성이 필요 없는 경우, 입력 파라미터 결정(S530)에서 결정된 신뢰도 측정 값에 대해 평균, 표준편차를 계산(S570)하고, 신뢰도값을 정규화(S580)한다. 그런 다음, 정규화된 신뢰도 측정값과 신호대 잡음비와의 상관계수, 정규화된 신뢰도 측정값과 음성에너지와의 상관계수, 정규화된 F0포만트 크기와의 상관계수를 계산(S590)하여 저장된 값을 사용한다.Referring to FIG. 3, first, when training data is determined (S510), signal-to-noise ratio, voice energy, and F0 formant size scores are obtained for the training data set, and the average and standard deviation are calculated for each, and then normalized and stored. (S520). In operation S530, an input parameter is determined. If multiple input parameters are used, model generation is required (S540), so that the model is trained (S550), a dual optimal model is selected (S560), and the reliability values obtained from the model are measured and averaged. The standard deviation is calculated (S570) to normalize the reliability value (S580). Otherwise, when model generation is not necessary, the average and standard deviation are calculated (S570) with respect to the reliability measurement value determined in the input parameter determination (S530), and the reliability value is normalized (S580). Then, the correlation coefficient between the normalized reliability measurement value and the signal-to-noise ratio, the correlation coefficient between the normalized reliability measurement value and the voice energy, and the correlation coefficient between the normalized F0 formant size is calculated (S590) and used. .

다음으로 상기 산출된 환경요인 파라미터(S400)와 상기 계산된 신뢰도 측정(S500)값에 적용하여 새로운 임계치값을 계산하고 갱신한다(S600).Next, the new threshold value is calculated and updated by applying the calculated environmental factor parameter S400 and the calculated reliability measurement S500 value (S600).

이때, 상기 환경요인 특성은 발화검증의 임계치값에 영향을 미치는 요인으로서 채널특성인 신호대 잡음 비(SNR: Signal to Noise Ratio))와, 화자 특성인 음성 에너지 및 F0 포만트 크기 정보이다.At this time, the environmental factor characteristics are factors affecting the threshold value of the utterance verification, such as signal-to-noise ratio (SNR), which is a channel characteristic, voice energy, and F0 formant size information, which are speaker characteristics.

상기 환경요인 특성이 적용된 새로운 임계치값을 계산하고 갱신하는 방법(S600)이 도 4에서 보다 상세히 나타내고 있다. A method (S600) of calculating and updating a new threshold value to which the environmental factor characteristic is applied is shown in more detail in FIG. 4.

도 4를 참조하여 설명하면, 먼저 산출된 환경요인 파라미터 각각에 대해 신호대 잡음비와의 초기값, 음성 에너지와의 초기값(상관계수, 평균, 표준편차), FO 포만트 크기에 대한 초기값을 계산하여 설정한다(S610). 이때, 상기 환경요인 특성에 따른 초기값은 신뢰도 측정값과의 상관계수와, 이에 따른 평균 및 표준편차 값들을 말한다.Referring to FIG. 4, the initial value of the signal-to-noise ratio, the initial value of the speech energy (correlation coefficient, average, standard deviation), and the initial value of the FO formant are calculated for each of the calculated environmental factor parameters. To set (S610). At this time, the initial value according to the environmental factor characteristics refers to the correlation coefficient with the reliability measurement value, and the average and standard deviation values accordingly.

이어, 실시간으로 현재 환경요인을 기반한 신호대 잡음비, 음성 에너지, F0 포만트 크기에 대한 각각의 현재 스코어를 계산하고, 이를 정규화한다(S620).In operation S620, the current scores for the signal-to-noise ratio, the voice energy, and the F0 formant size are calculated and normalized based on the current environmental factors in real time.

그리고 상기 환경요인 파라미터와 신뢰도 측정 값을 기반으로 계산된 환경요인 특성에 따른 각각의 상관계수를 상기 현재 스코어와 서로 곱하여 새로운 스코어를 산출한다(S630). Then, a new score is calculated by multiplying each correlation coefficient according to the environmental factor characteristics calculated based on the environmental factor parameter and the reliability measurement value with the current score (S630).

즉, 상기 정규화된 신호대 잡음비 현재 스코어와 상기 신호대 잡음비 상관계수를 곱하여 제 1 스코어를 산출하고, 상기 정규화된 음성 에너지 현재 스코어와 상기 음성 에너지 상관계수를 곱하여 제 2 스코어를 산출하고, 상기 정규화된 FO 포만트 크기 현재 스코어와 상기 FO 포만트 크기 상관계수를 곱하여 제 3 스코어를 산출한다.That is, a first score is calculated by multiplying the normalized signal-to-noise ratio current score and the signal-to-noise ratio correlation coefficient, and a second score is calculated by multiplying the normalized voice energy current score and the voice energy correlation coefficient, and the normalized FO A formant size current score is multiplied by the FO formant size correlation coefficient to calculate a third score.

이렇게 산출된 각각의 제 1, 2, 3 스코어를 모두 합하여 환경요인에 따른 전체 평균값을 계산한다. 이때, 계산된 평균값을 환경요인 값이라 정한다(S640). The sum of each of the first, second, and third scores thus calculated is calculated to calculate the overall average value according to the environmental factors. At this time, the calculated average value is determined as an environmental factor value (S640).

이어 상기 계산된 환경요인 값에 특정 적응계수를 곱하여 기존의 임계치값에 더하거나 빼서 새로운 임계치값을 계산하여 갱신하게 된다(S650). Subsequently, the calculated environmental factor value is multiplied by a specific adaptation coefficient, and the new threshold value is calculated and updated by adding or subtracting an existing threshold value (S650).

이렇게 새로 갱신된 임계치값을 상기 계산된 신뢰도 측정값과 비교하여 음성 인식결과에 따른 수락 또는 거절을 결정한다(S700). The newly updated threshold value is compared with the calculated reliability measure to determine acceptance or rejection according to the speech recognition result (S700).

즉, 상기 비교결과(S700), 계산된 신뢰도 측정값보다 임계치값이 크면, 인식된 결과가 맞는 것으로 인식결과를 수락하고(S1000), 상기 비교결과(S700) 상기 계산된 신뢰도 측정값이 임계치값보다 작으면 인식 결과를 거절한다(S800).That is, if the threshold value is larger than the comparison result (S700) and the calculated reliability measurement value, the recognition result is accepted as being correct (S1000), and the comparison result (S700) the calculated reliability measurement value is the threshold value. If smaller, the recognition result is rejected (S800).

그리고 상기 인식 결과가 수락되면 음성인식 시스템을 동작시키고(S1100), 상기 인식 결과가 거절된 경우는 중앙제어부(10)가 사용자에게 메시지와 음성을 출력하여 사용자에게 음성인식을 위한 재발성을 유도한다(S900).If the recognition result is accepted, the voice recognition system is operated (S1100). If the recognition result is rejected, the central control unit 10 outputs a message and a voice to the user to induce recurrence for the voice recognition to the user. (S900).

이와 같이, 발화검증을 위해 환경요인 특성을 실시간으로 임계치값에 반영해 줌으로써, 신뢰도 높은 음성인식 서비스를 제공할 수 있게 된다.As such, by reflecting characteristics of environmental factors in a threshold value in real time for speech verification, it is possible to provide a reliable voice recognition service.

이상에서와 같이 상세한 설명과 도면을 통해 본 발명의 최적 실시예를 개시하였다. 용어들은 단지 본 발명을 설명하기 위한 목적에서 사용된 것이지 의미 한정이나 특허청구범위에 기재된 본 발명의 범위를 제한하기 위하여 사용된 것은 아니다. 그러므로 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서 본 발명의 진정한 기술적 보호 범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다. As described above, the preferred embodiment of the present invention has been disclosed through the detailed description and the drawings. The terms are used only for the purpose of describing the present invention and are not used to limit the scope of the present invention as defined in the meaning or claims. Therefore, those skilled in the art will understand that various modifications and equivalent other embodiments are possible from this. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

이상에서 설명한 바와 같은 본 발명에 따른 발화검증을 위한 임계치값 자동 갱신을 이용한 음성인식 장치 및 방법은 다음과 같은 효과가 있다.As described above, the apparatus and method for speech recognition using automatic threshold value update for speech verification according to the present invention have the following effects.

첫째, 음성인식 시스템에서 발화검증을 위한 임계치값을 자동으로 갱신함으로써 새로운 환경이나 타스크가 변경되어도 환경요인 특성을 시스템에 즉시 반영이 가능한 효과가 있다.First, by automatically updating the threshold for speech verification in speech recognition system, even if a new environment or task is changed, the environmental factor characteristics can be immediately reflected in the system.

둘째, 실제 환경에서 민감한 정보를 임계치값 계산에 반영해 줌으로써 신뢰도 높은 발화검증을 제공할 수 있는 효과가 있다.Second, it is possible to provide reliable speech verification by applying sensitive information to the threshold calculation in real environment.

Claims

A pre-processing unit for outputting voice data by performing noise processing of the input voice and detecting a voice section;

A speech recognition unit for performing speech recognition and outputting speech information by applying the speech data to a pre-learned context-dependent phoneme model;

An environmental factor parameter extracting unit for extracting environmental factor parameters and scores using the voice information;

An input parameter extracting unit extracting an input parameter for measuring reliability of each word by applying to at least one of previously learned context-independent semitone models, a phone duration model, likelihood, and Nbest information;

When the extracted input parameters are several, the reliability measurement value is calculated using a training model of a pre-defined classifier, and in the case of a single input parameter, the reliability measurement unit using the extracted reliability measurement value as it is;

A threshold value determination unit for calculating and updating a new threshold value by applying an environmental factor value calculated through the average value of the scores;

Determination unit for determining the acceptance and rejection of the recognition result using the updated threshold value,

The environmental factor value is at least one of a signal-to-noise ratio (SNR), which is a channel characteristic, voice energy and a F0 formant, which are speaker characteristics,

The context-independent half-phoneme model includes an Allmixture antimodel, an adaptive antimodel, a discriminative antimodel that performs differential learning, and a VQ (Vector Quantization) -based half model. voice recognition device, characterized in that at least one or more.

delete

(a) detecting noise processing and speech section of the input speech;

(b) applying the detected voice data to a pre-set and learned context-dependent phoneme model to perform voice recognition through Viterbi search;

(c) calculating an input parameter value and an environmental factor parameter by applying the recognized speech information to at least one of a preset and learned contextual independent phoneme model, phone duration model, likelihood, and Nbest information; Steps,

(d) calculating and normalizing a reliability measure based on the calculated input parameter value;

(e) calculating and updating a new threshold value by applying the calculated environmental factor value based on the calculated environmental factor parameter;

(f) comparing the updated threshold value with the calculated reliability measure to determine acceptance or rejection according to a speech recognition result;

The context-independent half-phoneme model includes an Allmixture antimodel, an adaptive antimodel, a discriminative antimodel that performs differential learning, and a VQ (Vector Quantization) -based half model. antimodel) at least one or more.

delete

The method of claim 4, wherein step (d)

If the calculated input parameter is based on multiple input parameters, training to generate a classifier model for the classifier;

Calculating a reliability measurement value, an average, and a standard deviation of the training data using the training model calculated in the training step;

And normalizing the calculated reliability measure.

The method of claim 4, wherein step (d)

If the calculated input parameter is based on a single input parameter, calculating and normalizing reliability measurements, averages, and standard deviations directly from the training data.

The method of claim 4, wherein the method of obtaining dictionary parameter data for use in the step (e)

Determining training data using a classifier model based on the calculated input parameter values,

Using the determined training data to calculate an environmental factor parameter value and a parameter to be used as input data;

Calculating reliability measurements, averages, and standard deviations of the calculated input parameters using a predefined classifier model;

Normalizing the calculated reliability measure;

Calculating a correlation coefficient between the calculated normalized reliability measure and the signal-to-noise ratio, a correlation coefficient between the normalized reliability measure and the voice energy, and a correlation coefficient between the normalized F0 formant size.

The method of claim 4, wherein step (e)

(e1) calculating and setting an initial value of the environmental factor characteristics based on the environmental factor parameters calculated using the training data;

(e2) calculating and normalizing each current score for environmental factors based on the current environmental factors in real time,

(e3) calculating a new score by multiplying a correlation coefficient of each of the initialized environmental factor parameters with a current score of each of the environmental factor parameters obtained in real time;

(e4) calculating the environmental factor values by summing all the calculated new scores and calculating a total average value;

(e5) calculating a new threshold value by multiplying the calculated environmental factor value by a specific adaptation coefficient and adding or subtracting an existing threshold value;

(e6) updating the existing threshold value with the calculated new threshold value.

The method of claim 9,

The initial value of the environmental factor characteristic of (e1) is at least one of the initial value of the signal-to-noise ratio, the initial value of the voice energy, and the initial value of the FO formant size.

The method of claim 10,

And the initial value is at least one of a correlation coefficient, an average, and a standard deviation value with a reliability measurement value.

The method of claim 9, wherein step (e3)

Calculating a first score by multiplying the normalized and calculated current score of the signal-to-noise ratio by the signal-to-noise ratio correlation coefficient;

Calculating a second score by multiplying the normalized and calculated current score of speech energy by the speech energy correlation coefficient;

And calculating a third score by multiplying the normalized calculated current score of the FO formant size by the FO formant size correlation coefficient.

The method of claim 4, wherein step (f)

Accepting a recognition result when a threshold value is larger than a reliability measurement value calculated as a result of the comparison;

Rejecting a recognition result if the reliability measurement value calculated as a result of the comparison is smaller than a threshold value;

Operating a speech recognition system if the recognition result is accepted, and inducing a recurrence for speech recognition through a message or speech to the user if the recognition result is rejected.