KR20200129007A

KR20200129007A - Utterance verification device and method

Info

Publication number: KR20200129007A
Application number: KR1020190053344A
Authority: KR
Inventors: 정윤재; 조훈영
Original assignee: 주식회사 엔씨소프트
Priority date: 2019-05-07
Filing date: 2019-05-07
Publication date: 2020-11-17

Abstract

According to one embodiment of the present invention, a method for verifying utterance includes an operation of verifying utterance based on average phoneme recognition rank for utterance data and utterance transcription data. According to another embodiment of the present invention, the method may determine whether to perform utterance verification based on average phoneme recognition rank by measuring a misclassification distance for utterance data and utterance transcription data and using the measured misclassification distance.

Description

Ignition verification device and utterance verification method {UTTERANCE VERIFICATION DEVICE AND METHOD}

아래의 실시예들은 발화검증 장치 및 발화검증 방법에 관한 것이다.The following embodiments relate to a utterance verification apparatus and a utterance verification method.

음성인식시스템에서 음성 인식 결과를 맞는 것으로 판단할지, 틀린 것으로 판단할지의 여부를 결정하는 기법을 발화검증이라 한다.In the speech recognition system, a technique for determining whether to judge the result of speech recognition as correct or incorrect is called speech verification.

발화검증 방법으로는 HMM(Hidden Markov Model) 음성인식기의 탐색부에서 출력되는 특징데이터(예를 들어, 반음소모델(anti-phoneme model) LLR(Log Likelihood Ratio) 스코어(score), N-베스트(N-best) LLR 스코어, LLR 스코어의 조합, 단어길이(word duration) 등)와 이들 특징으로부터 오인식인지 여부를 판단할 수 있도록 SVM(Support Vector Machine) 패턴분류 방법 등이 사용되고 있다.The speech verification method includes feature data output from the search unit of the HMM (Hidden Markov Model) speech recognizer (e.g., anti-phoneme model) LLR (Log Likelihood Ratio) score, N-best ( N-best) LLR score, combination of LLR score, word duration, etc.), and SVM (Support Vector Machine) pattern classification method, etc. are used to determine whether or not it is misrecognized from these features.

음성인식 시스템에 있어서, 발화검증의 목적은 비인식 대상 어휘(out-of-vocabulary)를 거절시키고, 인식 대상 어휘라도 오인식 가능성이 있는 어휘들은 거절하는데 있다. 이 경우, 인식된 결과의 신뢰도를 측정하여 음성인식의 수락 또는 거절을 결정하게 된다. 신뢰도란 인식된 결과 이외의 음소나 단어로부터 해당 어휘가 발화되었을 경우에 대한 인식된 결과인 음소나 단어의 상대적 비율을 의미하고, 일반적으로 신뢰도 측정은 통계적 가설을 검증하는 관점에서 이루어진다.In a speech recognition system, the purpose of speech verification is to reject out-of-vocabulary and vocabularies that may be misrecognized even in recognition target vocabularies. In this case, acceptance or rejection of speech recognition is determined by measuring the reliability of the recognized result. Reliability refers to the relative ratio of a phoneme or word, which is a recognized result when the corresponding vocabulary is uttered from a phoneme or word other than the recognized result. In general, reliability is measured from the viewpoint of verifying a statistical hypothesis.

음성인식시스템에서 발화검증의 신뢰도는 주로 반음소모델(anti-phoneme model)에 기반하여 이루어진다. 음성인식이 적절하게 이루어졌음에 대한 귀무가설(null hypothesis, H0)과 부적절하게 이루어졌음을 의미하는 대립가설(alternative hypothesis, H1)과의 차이를 측정하는 것이다. 발화검증에서는 신뢰도 척도를 구하고 이를 미리 정한 임계치와 비교하여 음성인식 결과의 수락/거절을 결정한다.The reliability of speech verification in speech recognition systems is mainly based on the anti-phoneme model. This is to measure the difference between the null hypothesis (H0) that the speech recognition was properly performed and the alternative hypothesis (H1) that the speech recognition was improperly performed. In speech verification, a confidence scale is obtained and compared with a predetermined threshold to determine acceptance/rejection of the speech recognition result.

신뢰도 척도로는 LRT(Likelihood Ratio Test) 기반의 척도가 가장 많이 사용되어 왔으며, MCE(Minimum Classification Error), MVE(Minimum Verification Error) 등의 학습을 통해 대립가설을 모델링하는 개선 방법이 제안되었다. 현재 대부분의 시스템에서는 단일 척도를 사용하는 방식에서 벗어나, N-best 목록 내의 후보 인식결과와 다양한 신뢰도 척도를 통합하는 방법이 통용된다.As the reliability scale, the LRT (Likelihood Ratio Test)-based scale has been the most widely used, and an improvement method for modeling the alternative hypothesis through learning such as MCE (Minimum Classification Error) and MVE (Minimum Verification Error) has been proposed. In most of the current systems, a method of integrating candidate recognition results in the N-best list with various reliability measures is used instead of using a single measure.

본 발명의 실시예에 따르면, 발화 데이터와 발화 전사문 데이터에 대한 평균 음소인식 순위 기초 발화검증을 수행할 수 있다.According to an embodiment of the present invention, a speech verification based on an average phoneme recognition ranking may be performed for speech data and speech transcription data.

또한, 본 발명의 다른 실시예에 따르면, 발화 데이터와 발화 전사문 데이터에 대한 오분류 거리를 측정하고, 상기 측정한 오분류 거리를 이용하여 평균 음소인식 순위 기초 발화검증 수행 여부를 결정할 수 있다.In addition, according to another embodiment of the present invention, the misclassification distance for the utterance data and the utterance transcription data may be measured, and whether to perform the speech verification based on the average phoneme recognition ranking may be determined using the measured misclassification distance.

또한, 본 발명의 또 다른 실시예에 따르면, 발화 데이터와 발화 전사문 데이터 음소열의 평균 음소인식 순위를 기초로 상기 발화 데이터와 상기 발화 전사문 데이터에 대한 발화검증을 수행할 수 있다.Further, according to another embodiment of the present invention, speech verification may be performed on the speech data and the speech transcription data based on the average phoneme recognition ranking of speech data and speech transcription data phoneme sequences.

또한, 본 발명의 또 다른 실시예에 따르면, 음향모델에 학습된 발화 데이터와 발화 스타일이 상이한 발화 데이터에 대한 발화검증을 수행할 수 있다.Further, according to another embodiment of the present invention, speech verification may be performed on speech data learned in an acoustic model and speech data having different speech styles.

본 발명의 일실시예에 따르면, 발화 데이터와 발화 전사문 데이터에 대한 평균 음소인식 순위 기초 발화검증을 수행하는 동작을 포함한다.According to an embodiment of the present invention, an operation of performing a speech verification based on an average phoneme recognition ranking for speech data and speech transcription data is included.

또한, 반음소 모델을 기초로 발화 데이터와 발화 전사문 데이터에 대한 오분류거리를 측정하는 동작 및 상기 측정한 오분류거리와 미리 설정된 임계치를 비교하고, 비교 결과에 따라 평균 음소인식 순위 기초 발화검증 수행 여부를 결정하는 동작을 포함할 수 있다.In addition, the operation of measuring the misclassification distance for the utterance data and the utterance transcription data based on the semiphone model, comparing the measured misclassification distance with a preset threshold, and verifying the average phoneme recognition ranking based on the comparison result. It may include an operation of determining whether to perform.

또한, 상기 발화검증 방법은, 상기 반음소 모델을 기초로 상기 발화 데이터와 상기 발화 전사문 데이터에 대한 적절성을 결정하는 동작을 더 포함할 수 있다.In addition, the speech verification method may further include an operation of determining appropriateness for the speech data and the speech transcription data based on the semiphoneme model.

또한, 상기 평균 음소인식 순위 기초 발화검증 수행 여부를 결정하는 동작은, 상기 측정한 오분류거리가 미리 설정된 임계치보다 큰 경우, 상기 평균 음소인식 순위 기초 발화검증의 순위 값을 가장 낮은 순위로 배치하고, 상기 측정한 오분류거리가 미리 설정된 임계치보다 작은 경우, 상기 발화 데이터와 상기 발화 전사문 데이터에 상기 평균 음소인식 순위 기초 발화검증 수행을 결정할 수 있다.In addition, the operation of determining whether to perform the average phoneme recognition ranking based speech verification may include, when the measured misclassification distance is greater than a preset threshold, the average phoneme recognition ranking basic speech verification ranking value is placed in the lowest ranking. If the measured misclassification distance is less than a preset threshold, it is possible to determine the execution of the speech verification based on the average phoneme recognition ranking in the speech data and the speech transcription data.

또한, 상기 평균 음소인식 순위는, 미리 학습된 음향모델을 기초로 결정할 수 있다.In addition, the average phoneme recognition ranking may be determined based on a previously learned acoustic model.

본 발명의 다른 실시예에 따르면, 발화검증 장치에 있어서, 상기 발화검증 장치는 프로세서를 포함하고, 상기 프로세서는, 발화 데이터와 발화 전사문 데이터에 대한 평균 음소인식 순위 기초 발화검증을 수행한다.According to another embodiment of the present invention, in the speech verification apparatus, the speech verification apparatus includes a processor, and the processor performs speech verification based on an average phoneme recognition ranking for speech data and speech transcription data.

또한, 상기 프로세서는, 반음소 모델을 기초로 발화 데이터와 발화 전사문 데이터에 대한 오분류거리를 측정하고, 상기 측정한 오분류거리와 미리 설정된 임계치를 비교하고, 비교 결과에 따라 평균 음소인식 순위 기초 발화검증 수행 여부를 결정할 수 있다.In addition, the processor measures the misclassification distance for the speech data and the utterance transcription data based on the semiphone model, compares the measured misclassification distance with a preset threshold, and ranks the average phoneme recognition according to the comparison result. You can decide whether to perform basic utterance verification.

또한, 상기 프로세서는, 상기 반음소 모델을 기초로 상기 발화 데이터와 상기 발화 전사문 데이터에 대한 적절성을 결정할 수 있다.Further, the processor may determine appropriateness of the speech data and the speech transcription data based on the semiphoneme model.

또한, 상기 프로세서는, 상기 측정한 오분류거리가 미리 설정된 임계치보다 큰 경우, 상기 평균 음소인식 순위 기초 발화검증의 순위 값을 가장 낮은 순위로 배치하고, 상기 측정한 오분류거리가 미리 설정된 임계치보다 작은 경우, 상기 발화 데이터와 상기 발화 전사문 데이터에 상기 평균 음소인식 순위 기초 발화검증 수행을 결정할 수 있다.In addition, when the measured misclassification distance is greater than a preset threshold, the processor places the average phoneme recognition ranking basic speech verification ranking value as the lowest ranking, and the measured misclassification distance is less than a preset threshold. If it is small, it is possible to determine the execution of the speech verification based on the average phoneme recognition ranking in the speech data and the speech transcription data.

본 발명의 일실시예에 따르면, 발화 데이터와 발화 전사문 데이터에 대한 평균 음소인식 순위 기초 발화검증을 수행할 수 있는 효과가 있다.According to an embodiment of the present invention, there is an effect of performing a speech verification based on an average phoneme recognition ranking for speech data and speech transcription data.

또한, 발화 데이터와 발화 전사문 데이터에 대한 오분류 거리를 측정하고, 상기 측정한 오분류 거리를 이용하여 평균 음소인식 순위 기초 발화검증 수행 여부를 결정할 수 있는 효과가 있다.In addition, there is an effect of measuring misclassification distances for utterance data and utterance transcription data, and determining whether to perform basic utterance verification based on the average phoneme recognition ranking by using the measured misclassification distance.

또한, 발화 데이터와 발화 전사문 데이터 음소열의 평균 음소인식 순위를 기초로 상기 발화 데이터와 상기 발화 전사문 데이터에 대한 발화검증을 수행할 수 있는 효과가 있다.In addition, there is an effect of performing a speech verification on the speech data and the speech transcription data based on the average phoneme recognition ranking of speech data and speech transcription data phoneme sequences.

또한, 음향모델에 학습된 발화 데이터와 발화 스타일이 상이한 발화 데이터에 대한 발화검증을 수행할 수 있는 효과가 있다.In addition, there is an effect of performing speech verification on speech data having different speech styles from speech data learned in the acoustic model.

도 1은 일실시예에 따라 발화검증 장치가 동작하는 모습을 나타내는 도면이다.
도 2는 일실시예에 따른 발화검증 장치의 구성을 나타내는 도면이다.
도 3은 오분류거리에 따른 발화 데이터-발화 전사문 데이터의 분포를 나타내는 도면이다.
도 4는 평균 음소인식 순위에 따른 발화 데이터-발화 전사문 데이터의 분포를 나타내는 도면이다.
도 5는 일실시예에 따른 발화검증 방법을 나타내는 플로우 차트이다.1 is a diagram illustrating an operation of an utterance verification device according to an exemplary embodiment.
2 is a diagram showing the configuration of an utterance verification apparatus according to an embodiment.
3 is a diagram showing the distribution of speech data-speech transcription data according to misclassification distances.
4 is a diagram showing the distribution of speech data-speech transcription data according to an average phoneme recognition ranking.
5 is a flow chart showing a method of verifying utterance according to an embodiment.

본 명세서에 개시되어 있는 본 발명의 개념에 따른 실시 예들에 대해서 특정한 구조적 또는 기능적 설명들은 단지 본 발명의 개념에 따른 실시 예들을 설명하기 위한 목적으로 예시된 것으로서, 본 발명의 개념에 따른 실시 예들은 다양한 형태들로 실시될 수 있으며 본 명세서에 설명된 실시 예들에 한정되지 않는다.Specific structural or functional descriptions of the embodiments according to the concept of the present invention disclosed in the present specification are exemplified only for the purpose of describing the embodiments according to the concept of the present invention, and the embodiments according to the concept of the present invention are It may be implemented in various forms and is not limited to the embodiments described herein.

본 발명의 개념에 따른 실시 예들은 다양한 변경들을 가할 수 있고 여러 가지 형태들을 가질 수 있으므로 실시 예들을 도면에 예시하고 본 명세서에 상세하게 설명하고자 한다. 그러나, 이는 본 발명의 개념에 따른 실시 예들을 특정한 개시 형태들에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물, 또는 대체물을 포함한다.Since the embodiments according to the concept of the present invention can apply various changes and have various forms, the embodiments will be illustrated in the drawings and described in detail in the present specification. However, this is not intended to limit the embodiments according to the concept of the present invention to specific disclosed forms, and includes all changes, equivalents, or substitutes included in the spirit and scope of the present invention.

제1 또는 제2 등의 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 상기 구성 요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로만, 예컨대 본 발명의 개념에 따른 권리 범위로부터 이탈되지 않은 채, 제1구성요소는 제2구성요소로 명명될 수 있고, 유사하게 제2구성요소는 제1구성요소로도 명명될 수 있다.Terms such as first or second may be used to describe various elements, but the elements should not be limited by the terms. The terms are only for the purpose of distinguishing one component from other components, for example, without departing from the scope of the rights according to the concept of the present invention, the first component may be referred to as the second component, and similarly The second component may also be referred to as a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is referred to as being "connected" or "connected" to another component, it is understood that it may be directly connected or connected to the other component, but other components may exist in the middle. Should be. On the other hand, when a component is referred to as being "directly connected" or "directly connected" to another component, it should be understood that there is no other component in the middle. Other expressions describing the relationship between components, such as "between" and "just between" or "adjacent to" and "directly adjacent to" should be interpreted as well.

본 명세서에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.The terms used in this specification are used only to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise.

본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In the present specification, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, action, component, part, or combination thereof is present, but one or more other features or numbers It is to be understood that the possibility of addition or presence of, steps, actions, components, parts, or combinations thereof is not preliminarily excluded.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미가 있다.Unless otherwise defined, all terms including technical or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs.

일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Terms as defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and should not be interpreted as an ideal or excessively formal meaning unless explicitly defined in this specification. Does not.

이하의 설명에서 동일한 식별 기호는 동일한 구성을 의미하며, 불필요한 중복적인 설명 및 공지 기술에 대한 설명은 생략하기로 한다.In the following description, the same identification symbols mean the same configuration, and unnecessary redundant descriptions and descriptions of known technologies will be omitted.

본 발명의 실시 예에서 '통신', '통신망' 및 '네트워크'는 동일한 의미로 사용될 수 있다. 상기 세 용어들은, 파일을 사용자 단말, 다른 사용자들의 단말 및 다운로드 서버 사이에서 송수신할 수 있는 유무선의 근거리 및 광역 데이터 송수신망을 의미한다.In an embodiment of the present invention,'communication','communication network', and'network' may be used with the same meaning. The three terms refer to wired/wireless local and wide area data transmission/reception networks capable of transmitting and receiving files between a user terminal, a terminal of other users, and a download server.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시 예를 설명함으로써, 본 발명을 상세히 설명한다.Hereinafter, the present invention will be described in detail by describing a preferred embodiment of the present invention with reference to the accompanying drawings.

도 1은 일실시예에 따라 발화검증 장치가 동작하는 모습을 나타내는 도면이다.1 is a diagram illustrating an operation of an utterance verification device according to an exemplary embodiment.

도 1을 참조하면, 일실시예에 따른 발화검증 장치(100)는 발화 데이터(101)와 발화 전사문 데이터(102)를 입력 받아 발화검증 결과 데이터(103)를 생성한다.Referring to FIG. 1, a speech verification apparatus 100 according to an exemplary embodiment receives speech data 101 and speech transcription data 102 and generates speech verification result data 103.

발화 데이터(101)와 발화 전사문 데이터(102)는 발화검증 장치(100)로 각각 입력되며, 발화 데이터(101)와 발화 전사문 데이터(102)는 서로 다른 타입의 데이터일 수 있다.The speech data 101 and the speech transcription data 102 are respectively input to the speech verification apparatus 100, and the speech data 101 and the speech transcription data 102 may be different types of data.

발화 데이터(101)는 웨이브(WAV) 형태의 데이터일 수 있다.The speech data 101 may be data in the form of a wave (WAV).

발화 데이터(101)는 성우의 발화일 수 있다.The speech data 101 may be speech of a voice actor.

발화 전사문 데이터(102)는 텍스트(TXT) 형태의 데이터일 수 있다.The speech transcription data 102 may be data in the form of text (TXT).

발화 전사문 데이터(102)는 평문 형태의 자막일 수 있다.The speech transcription data 102 may be a subtitle in a plain text format.

발화검증 결과 데이터(103)는 점수 형태로 출력될 수 있다.The speech verification result data 103 may be output in the form of a score.

발화검증 결과 데이터(103)는 발화검증 결과가 미리 설정된 수준 이상인 경우(예컨대, 적절한 경우)에만 출력될 수 있다.The speech verification result data 103 may be output only when the speech verification result is equal to or higher than a preset level (eg, when appropriate).

음향모델(acoustic model)(104)은 소리단위를 딥러닝 기술로 학습해서 지식화해 놓고 디코더에서 사용할 수 있다.The acoustic model 104 can be used in a decoder after learning a sound unit by deep learning technology and making it knowledgeable.

음향모델(acoustic model)(104)은 확률통계 방식인 HMM(Hidden Markov Model)을 기반으로 모델링 될 수 있으나 이에 한정되는 것은 아니다.The acoustic model 104 may be modeled based on the HMM (Hidden Markov Model), which is a probability statistics method, but is not limited thereto.

음향모델(acoustic model)(104)은 미리 학습된 모델일 수 있으나 이에 한정되는 것은 아니다.The acoustic model 104 may be a pre-trained model, but is not limited thereto.

음향모델(acoustic model)(104)은 발화 데이터(101)를 음소 단위로 분류하는데 이용될 수 있다.The acoustic model 104 may be used to classify the speech data 101 into phoneme units.

음향모델(acoustic model)(104)은 반음소 모델의 기반이 될 수 있다.The acoustic model 104 may be the basis of the semiphoneme model.

음향모델(acoustic model)(104)을 기초로 평균 음소인식 순위가 결정될 수 있다.An average phoneme recognition ranking may be determined based on the acoustic model 104.

도 2는 일실시예에 따른 발화검증 장치의 구성을 나타내는 도면이고, 도 3은 오분류거리에 따른 발화 데이터-발화 전사문 데이터의 분포를 나타내는 도면이며, 도 4는 평균 음소인식순위에 따른 발화 데이터-발화 전사문 데이터의 분포를 나타내는 도면이다. 2 is a diagram showing the configuration of a speech verification device according to an embodiment, FIG. 3 is a diagram showing the distribution of speech data-speech transcription data according to misclassification distances, and FIG. 4 is a utterance according to average phoneme recognition ranking It is a diagram showing the distribution of data-speech transcription data.

도 2를 참조하면, 발화검증 장치(100)는 프로세서(110), 인터페이스 모듈(120) 및 데이터베이스(130)를 포함한다.Referring to FIG. 2, the speech verification apparatus 100 includes a processor 110, an interface module 120, and a database 130.

발화검증 장치(100)를 구성하는 프로세서(110), 인터페이스 모듈(120) 및 데이터베이스(130)는 연결되어 있으며, 상호 데이터를 전송하는 것이 가능하다.The processor 110, the interface module 120, and the database 130 constituting the speech verification apparatus 100 are connected, and data can be transmitted to each other.

프로세서(110)는 프로그램들이 저장된 메모리(미도시)를 포함할 수 있다.The processor 110 may include a memory (not shown) in which programs are stored.

프로세서(110)는 상기 포함한 메모리(미도시)에 저장된 프로그램들을 실행시킬 수 있다.The processor 110 may execute programs stored in the included memory (not shown).

프로세서(110)는 프로그램들이 저장된 메모리(미도시)와 연결될 수 있다.The processor 110 may be connected to a memory (not shown) in which programs are stored.

프로세서(110)는 상기 연결된 메모리(미도시)에 저장된 프로그램들을 실행시킬 수 있다.The processor 110 may execute programs stored in the connected memory (not shown).

이때, 상기 메모리(미도시)에는 발화검증 장치(100)를 동작시키기 위한 동작프로그램(예컨대, OS)이 저장될 수 있다.In this case, an operation program (eg, OS) for operating the speech verification device 100 may be stored in the memory (not shown).

프로세서(110)는 발화검증 장치(100)에 대한 정보를 관리하기 위한 프로그램을 실행시킬 수 있다.The processor 110 may execute a program for managing information on the speech verification apparatus 100.

프로세서(110)는 발화검증 장치(100)의 동작을 관리하기 위한 프로그램을 실행시킬 수 있다.The processor 110 may execute a program for managing the operation of the speech verification apparatus 100.

프로세서(110)는 인터페이스 모듈(120)의 동작을 관리하기 위한 프로그램을 실행시킬 수 있다.The processor 110 may execute a program for managing the operation of the interface module 120.

프로세서(110)는 인터페이스 모듈(120)이 수신한 사용자의 입력을 이용하여 발화검증 장치(100)의 동작을 제어할 수 있다.The processor 110 may control the operation of the speech verification apparatus 100 by using the user's input received by the interface module 120.

ⅰ) 강제음성정렬(forced alignment)Ⅰ) Forced alignment

도 1 및 도 2를 참조하면, 프로세서(110)는 10ms 마다 25ms 구간으로 발화 데이터(101)에서 음성 자질을 추출할 수 있다. Referring to FIGS. 1 and 2, the processor 110 may extract voice features from speech data 101 in 25 ms intervals every 10 ms.

프로세서(110)는 MFCC(Mel Frequency Cepstral Coefficient), FBANK(Filter BANK) 등의 음성 특징 추출 기술을 이용하여 발화 데이터(101)에서 음성 자질을 추출할 수 있다.The processor 110 may extract speech features from the speech data 101 using speech feature extraction techniques such as Mel Frequency Cepstral Coefficient (MFCC) and Filter BANK (FBANK).

프로세서(110)는 발화 전사문 데이터(102)를 발음 사전(pronunciation dictionary)(미도시)을 이용하여 음소열(phone-sequence)로 변환할 수 있다. 이때, 발음 사전에는 단어, 그리고 그 단어의 발음에 해당하는 음소열(phone sequence)이 수록되어 있다.The processor 110 may convert the speech transcription data 102 into a phone-sequence using a pronunciation dictionary (not shown). At this time, the pronunciation dictionary includes a word and a phone sequence corresponding to the pronunciation of the word.

프로세서(110)는 발화 데이터(101)에서 추출한 음성 자질과 발화 전사문 데이터(102)에서 변환한 음소열에 강제음성정렬(forced alignment)을 적용하여 음소열의 각 음소와 해당 발화 데이터(101)의 음성 신호 구간을 결정할 수 있다.The processor 110 applies a forced alignment to the speech features extracted from the speech data 101 and the phoneme string converted from the speech transcription data 102 to determine each phoneme of the phoneme sequence and the speech of the corresponding speech data 101. The signal section can be determined.

ⅱ) 반음소 모델(ANTI-PHONEME MODEL) 기초 발화검증Ii) Basic speech verification of ANTI-PHONEME MODEL

프로세서(110)는 반음소 모델을 기초로 발화 데이터(101)와 발화 전사문 데이터(102)에 대한 오분류거리(misclassification distance)를 측정할 수 있다. 이때, 반음소 모델은 음향모델(104)에 기반할 수 있다.The processor 110 may measure a misclassification distance for the speech data 101 and the speech transcription data 102 based on the semiphoneme model. In this case, the semiphoneme model may be based on the acoustic model 104.

프로세서(110)는 측정한 오분류거리(misclassification distance)를 이용하여 발화 데이터(101)와 발화 전사문 데이터(102)에 대한 부적절성을 측정할 수 있다.The processor 110 may measure inappropriateness of the speech data 101 and the speech transcription data 102 using the measured misclassification distance.

프로세서(110)는 측정한 오분류거리(misclassification distance)를 이용하여 발화 데이터(101)와 발화 전사문 데이터(102)에 대한 적절성을 결정할 수 있다.The processor 110 may determine appropriateness for the speech data 101 and the speech transcription data 102 using the measured misclassification distance.

프로세서(110)는 측정한 오분류거리의 값이 미리 설정한 값보다 작은 경우, 발화 데이터(101)와 발화 전사문 데이터(102) 사이가 적절하다고 결정할 수 있다.If the measured value of the misclassification distance is smaller than a preset value, the processor 110 may determine that between the speech data 101 and the speech transcription data 102 is appropriate.

프로세서(110)는 측정한 오분류거리의 값이 미리 설정한 값보다 큰 경우, 발화 데이터(101)와 발화 전사문 데이터(102) 사이가 적절하지 않다고 결정할 수 있다.If the measured value of the misclassification distance is greater than a preset value, the processor 110 may determine that between the speech data 101 and the speech transcription data 102 is not appropriate.

프로세서(110)는 측정한 오분류거리(misclassification distance)를 이용하여 발화 데이터(101)와 발화 전사문 데이터(102)에 대한 발화검증을 수행할 수 있다.The processor 110 may perform speech verification on the speech data 101 and the speech transcription data 102 using the measured misclassification distance.

즉, 프로세서(110)는 발화 데이터(101)에서 추출한 음성 자질과 발화 전사문 데이터(102)에서 변환한 음소열 사이의 오분류거리를 측정하고, 측정한 오분류 거리를 이용하여 발화 데이터(101)와 발화 전사문 데이터(102)에 대한 적절성을 결정하고, 상기 결정한 적절성을 이용하여 발화검증을 수행할 수 있다. That is, the processor 110 measures the misclassification distance between the speech feature extracted from the utterance data 101 and the phoneme sequence converted from the utterance transcription data 102, and uses the measured misclassification distance. ) And the utterance transcript data 102 may be determined, and utterance verification may be performed using the determined appropriateness.

이때, 발화 데이터(101)와 발화 전사문 데이터(102) 사이가 적절할수록 오분류거리의 값이 작아진다. At this time, the more appropriate between the speech data 101 and the speech transcription data 102, the smaller the value of the misclassification distance.

프로세서(110)는 오분류거리를 측정하기 위하여 하기 [수학식 1]을 이용할 수 있다.The processor 110 may use the following [Equation 1] to measure the misclassification distance.

여기서,

는 발화 데이터(u)와 발화 전사문 데이터(t) 사이의 음소인식 로그-우도(log-likelihood)를 의미하며,

는 발화 데이터(u)가 임의의 발화 전사문 데이터로 인식될 로그-우도(즉, 반음소 모델)를 나타낸다. here,

Denotes the phoneme recognition log-likelihood between speech data (u) and speech transcription data (t),

Denotes a log-likelihood (ie, a semi-phone model) in which the speech data u will be recognized as arbitrary speech transcription data.

즉,

는 발화검증 장치(100)로 입력된 발화 데이터(u)가 임의의 발화 전사문 데이터로 인식될 확률이 발화검증 장치(100)로 입력된 발화 전사문 데이터로 인식될 확률보다 얼마나 큰지를 나타낸다.In other words,

Denotes how much greater the probability that the utterance data u input to the utterance verification device 100 is recognized as arbitrary utterance transcription data is greater than the probability of being recognized as utterance transcript data input to the utterance verification device 100.

프로세서(110)는 [수학식 2]를 이용하여 [수학식 1]의

를 계산할 수 있다.The processor 110 uses [Equation 2] to

Can be calculated.

여기서,

는 발화 전사문 데이터(t)의 음소열 중 i 번째 음소를 나타내며,

는 강제음소정렬 결과에서 i 번째 음소에 해당하는 음성 자질을 나타낸다.

는 음성 자질

가 음소

로 인식될 확률이며, N은 발화 전사문 데이터(t)의 전체 음소 수이다. 즉,

는 각 음소가 음향모델에 의해 인식될 로그-우도의 평균을 나타낸다.here,

Represents the i-th phoneme of the phoneme sequence of the speech transcription data (t),

Represents the voice quality corresponding to the i-th phoneme in the forced phoneme sorting result.

Voice qualities

Autumn phoneme

Is the probability of being recognized as, and N is the total number of phonemes in the spoken transcription data t. In other words,

Represents the average of log-likelihood that each phoneme will be recognized by the acoustic model.

프로세서(110)는 [수학식 3]을 이용하여 [수학식 1]의

를 계산할 수 있다.The processor 110 uses [Equation 3] to

Can be calculated.

여기서,

는 i 번째 음소

가 임의의 다른 음소로 인식될 로그-우도를 나타내는 반음소 모델이다.here,

Is the i-th phoneme

Is a semiphoneme model representing log-likelihood to be recognized as any other phoneme.

그러나, 반음소 모델

는 음향모델을 통해 직접 구할 수 없으므로, 입력 받은 음소 이외의 음소의 인식확률을 통해 간접적으로 계산된다.However, the semiphone model

Is not directly obtained through the acoustic model, so it is indirectly calculated through the recognition probability of phonemes other than the received phoneme.

프로세서(110)는 [수학식 4]를 이용하여 전체 음소 집합 P에서 i 번째 음소

를 제외한 나머지 음소의 인식 로그-우도 값의 평균을 계산할 수 있다.The processor 110 uses [Equation 4] to determine the i-th phoneme in the entire phoneme set P.

It is possible to calculate the average of the recognized log-likelihood values of the other phonemes except for.

ⅲ) 평균 음소인식 순위 기초 발화검증Iii) Basic speech verification of average phoneme recognition ranking

도 3을 참조하면, 도 3(a)는 음향모델 학습 데이터와 유사한 발화 스타일의 발화 데이터-발화 전사문 데이터에 대한 오분류거리를 나타내는 그래프이고, 도 3(b)는 음향모델 학습 데이터와 상이한 발화 스타일의 발화 데이터-발화 전사문 데이터에 대한 오분류거리를 나타내는 그래프이다.Referring to FIG. 3, FIG. 3(a) is a graph showing the misclassification distance for speech data-speech transcription data in a speech style similar to the acoustic model training data, and FIG. 3(b) This is a graph showing the misclassification distance for speech data of speech style-speech transcription data.

도 3(a)를 참조하면, 도 3(a)의 그래프에는 적합한 발화 데이터-발화 전사문 데이터(200)와 부적합한 발화 데이터-발화 전사문 데이터(210)가 표현되어 있으며, 음향모델 학습 데이터와 유사한 발화 스타일의 발화 데이터에 대해서는 오분류거리가 -1.5인 지점(220)에서 적합한 발화 데이터-발화 전사문 데이터(200)와 부적합한 발화 데이터-발화 전사문 데이터(210)가 구별되며, 적합한 발화 데이터-발화 전사문 데이터(200)와 부적합한 발화 데이터-발화 전사문 데이터(210)가 겹치는 부분(230)이 매우 작은 것을 알 수 있다.3(a), in the graph of FIG. 3(a), suitable speech data-speech transcription data 200 and inappropriate speech data-speech transcription data 210 are represented, and acoustic model training data and For utterance data of a similar utterance style, suitable utterance data-utterance transcription data 200 and inappropriate utterance data-utterance transcription data 210 are distinguished at a point 220 with a misclassification distance of -1.5, and suitable utterance data It can be seen that the portion 230 where the -speech transcription data 200 and the inappropriate speech data-speech transcription data 210 overlap is very small.

도 3(b)를 참조하면, 도 3(b)의 그래프에는 적합한 발화 데이터-발화 전사문 데이터(240)와 부적합한 발화 데이터-발화 전사문 데이터(250)가 표현되어 있으며,Referring to FIG. 3(b), in the graph of FIG. 3(b), suitable utterance data-utterance transcription data 240 and inappropriate utterance data-utterance transcription data 250 are represented,

음향모델 학습 데이터와 상이한 발화 스타일의 발화 데이터에 대해서는 오분류거리가 -1.1인 지점(260)에서 적합한 발화 데이터-발화 전사문 데이터(240)와 부적합한 발화 데이터-발화 전사문 데이터(250)가 구별되며, 적합한 발화 데이터-발화 전사문 데이터(240)와 부적합한 발화 데이터-발화 전사문 데이터(250)가 겹치는 부분(270)이 도 3(a)와 비교할 때 매우 큰 것을 알 수 있다.For speech data of speech styles different from the acoustic model training data, appropriate speech data-speech transcription data 240 and inadequate speech data-speech transcription data 250 are distinguished at a point 260 with a misclassification distance of -1.1. It can be seen that the portion 270 where the appropriate speech data-speech transcription data 240 and the inappropriate speech data-speech transcription data 250 overlap is very large as compared to FIG. 3(a).

도 3(a)와 도 3(b)를 종합해보면, 음향모델 학습 데이터와 상이한 발화 스타일의 발화 데이터에 대해서는 반음소 모델 기초 발화검증의 성능이 하락하는 것을 알 수 있다. 3(a) and 3(b), it can be seen that the performance of the semiphone model basic speech verification decreases for speech data of speech styles different from the acoustic model training data.

도 4를 참조하면, 도 4(a)는 음향모델 학습데이터와 유사한 발화 스타일의 발화 데이터-발화 전사문 데이터에 대한 평균 음소인식 순위를 나타내는 그래프이고, 도 4(b)는 음향모델 학습 데이터와 상이한 발화 스타일의 발화 데이터-발화 전사문 데이터에 대한 평균 음소인식 순위를 나타내는 그래프이다.Referring to FIG. 4, FIG. 4(a) is a graph showing the average phoneme recognition ranking for speech data-speech transcription data in a speech style similar to the acoustic model training data, and FIG. 4(b) is a graph showing the acoustic model training data and It is a graph showing the average phoneme recognition ranking for speech data-speech transcription data of different speech styles.

도 4(a)를 참조하면, 도 4(a)의 그래프에는 적합한 발화 데이터-발화 전사문 데이터(300)와 부적합한 발화 데이터-발화 전사문 데이터(310)가 표현되어 있으며, 음향모델 학습 데이터와 유사한 발화 스타일의 발화 데이터에 대해서는 평균 음소인식 순위가 4인 지점(320)에서 적합한 발화 데이터-발화 전사문 데이터(300)와 부적합한 발화 데이터-발화 전사문 데이터(310)가 구별되며, 적합한 발화 데이터-발화 전사문 데이터(300)와 부적합한 발화 데이터-발화 전사문 데이터(310)는 거의 겹치는 부분이 없는 것(330)을 알 수 있다.Referring to FIG. 4(a), in the graph of FIG. 4(a), suitable speech data-speech transcription data 300 and inappropriate speech data-speech transcription data 310 are represented, and the acoustic model training data and For speech data of similar speech styles, suitable speech data-speech transcription data 300 and inappropriate speech data-speech transcription data 310 are distinguished from the point 320 where the average phoneme recognition rank is 4, and suitable speech data It can be seen that the -speech transcription data 300 and the inadequate speech data-speech transcription data 310 have almost no overlapping part (330).

도 4(b)를 참조하면, 도 4(b)의 그래프에는 적합한 발화 데이터-발화 전사문 데이터(340)와 부적합한 발화 데이터-발화 전사문 데이터(350)가 표현되어 있으며,Referring to FIG. 4(b), in the graph of FIG. 4(b), suitable utterance data-utterance transcription data 340 and inappropriate utterance data-utterance transcription data 350 are represented,

음향모델 학습 데이터와 상이한 발화 스타일의 발화 데이터에 대해서는 평균 음소인식 순위가 6인 지점(360)에서 적합한 발화 데이터-발화 전사문 데이터(340)와 부적합한 발화 데이터-발화 전사문 데이터(350)가 구별되며, 적합한 발화 데이터-발화 전사문 데이터(340)와 부적합한 발화 데이터-발화 전사문 데이터(350)가 겹치는 부분(370)이 매우 작은 것을 알 수 있다.For speech data of speech styles different from the acoustic model training data, appropriate speech data-speech transcription data 340 and inappropriate speech data-speech transcription data 350 are distinguished at the point 360 where the average phoneme recognition rank is 6 It can be seen that the portion 370 where the appropriate speech data-speech transcription data 340 and the inappropriate speech data-speech transcription data 350 overlap is very small.

도 4(a)와 도 4(b)를 종합해보면, 음향모델 학습 데이터와 상이한 발화 스타일의 발화 데이터에 대해서도 평균 음소인식 순위 기초 발화검증의 성능의 하락폭이 매우 작은 것을 알 수 있다. 4(a) and 4(b), it can be seen that even for speech data of speech styles different from the acoustic model training data, the decline in performance of the basic speech verification by the average phoneme recognition ranking is very small.

도 3 및 도 4에 나타난 결과는, 음향모델에 학습된 발화 스타일과 다른 성격을 가지고 있는 발화 데이터에 대한 적합한 음소의 인식확률이 전반적으로 떨어졌기 때문으로 분석된다. 그러나 떨어지는 음소 인식 확률에 반하여, 음소인식의 순위는 큰 변화가 없는 것으로 분석된다.The results shown in FIGS. 3 and 4 are analyzed because the probability of recognizing a suitable phoneme for speech data having characteristics different from the speech style learned in the acoustic model is generally lowered. However, contrary to the probability of falling phoneme recognition, the ranking of phoneme recognition does not change significantly.

프로세서(110)는 평균 음소인식 순위를 기초로 발화검증을 수행할 수 있다.The processor 110 may perform speech verification based on the average phoneme recognition ranking.

프로세서(110)는 평균 음소인식 순위를 측정하기 위하여 하기 [수학식 5]을 이용할 수 있다.The processor 110 may use the following [Equation 5] to measure the average phoneme recognition ranking.

여기서,

는 평균 음소인식 순위이고,

는 강제음소정렬 결과에서 i 번째 음소에 해당하는 음성 자질을 나타낸다. N은 발화 전사문 데이터(t)의 전체 음소수이다.

는 음성 자질

에 대하여 음소

의 인식순위를 가리킨다. here,

Is the average phoneme recognition ranking,

Represents the voice quality corresponding to the i-th phoneme in the forced phoneme sorting result. N is the total phoneme number of the speech transcription data t.

Voice qualities

About phoneme

Refers to the ranking of recognition.

프로세서(110)는 발화 전사문 데이터 음소열의 평균적인 음소인식순위가 높을수록(즉,

값이 작을수록) 발화 데이터와 발화 전사문 데이터 사이가 적절한 것으로 판단할 수 있다.The processor 110, the higher the average phoneme recognition priority of the speech transcription data phoneme sequence (that is,

The smaller the value, the more appropriate between the speech data and the speech transcription data.

ⅳ) 2 단계 발화검증Iv) Stage 2 ignition verification

음향모델 학습 데이터와 상이한 발화 스타일의 발화 데이터에 대한 반음소 모델을 기초로 한 발화검증에서 발생하는 문제점을 해결하기 위하여 프로세서(110)는 반음소 모델을 기초로 한 발화검증과 평균 음소인식 순위를 기초로 한 발화검증을 결합할 수 있다. In order to solve the problem that occurs in speech verification based on a semiphoneme model for speech data of speech styles different from the acoustic model training data, the processor 110 performs speech verification based on the semiphoneme model and an average phoneme recognition ranking. You can combine based utterance verification.

프로세서(110)는 반음소 모델에 의해 발화 데이터와 발화 전사문 데이터가 일정 수준 이상의 적절성을 가지고 있다고 판단되는 경우, 반음소 모델을 기초로 한 발화검증과 평균 음소인식 순위를 기초로 한 발화검증이 결합된 2단계 발화검증을 수행할 수 있다.When it is determined that the utterance data and the utterance transcript data have a certain level of relevance by the semiphone model, the processor 110 performs utterance verification based on the semiphoneme model and the speech verification based on the average phoneme recognition ranking. Combined two-step utterance verification can be performed.

프로세서(110)는 반음소 모델을 기초로 한 발화검증과 평균 음소인식 순위를 기초로 한 발화검증을 결합하기 위하여 하기 [수학식 6]을 이용할 수 있다.The processor 110 may use the following [Equation 6] to combine speech verification based on a semiphoneme model and speech verification based on an average phoneme recognition ranking.

프로세서(110)는 [수학식 6]에서 발화 데이터(u)와 발화 전사문 데이터(t) 사이에서 측정한 오분류거리

가 임계치

보다 크거나 같은 경우, 순위 값을 가장 낮은 순위로 배치한다. 이때, 가장 낮은 순위는 전체 음소 집합 P의 크기로 설정할 수 있다. 이때, 상기 임계치

는 발화 스타일에 따라 변할 수 있다.The processor 110 is a misclassification distance measured between the speech data (u) and the speech transcription data (t) in [Equation 6]

Fall threshold

If greater than or equal to, the rank value is placed in the lowest rank. In this case, the lowest priority may be set as the size of the entire phoneme set P. At this time, the threshold

Can change depending on the style of speech.

가 임계치

보다 크거나 같은 경우, 발화 데이터(u)와 발화 전사문 데이터(t) 사이의 오분류 거리

를 발화 데이터(u)와 발화 전사문 데이터(t) 사이의 발화검증의 척도로 사용할 수 있다. The processor 110 is a misclassification distance measured between the speech data (u) and the speech transcription data (t) in [Equation 6]

Fall threshold

If greater than or equal to, the misclassification distance between the speech data (u) and the speech transcription data (t)

Can be used as a measure of utterance verification between the utterance data u and the utterance transcription data t.

가 임계치

보다 크거나 같은 경우, 반음소 모델을 기초로 발화검증을 수행하여 발화검증 결과 데이터를 출력할 수 있다.The processor 110 is a misclassification distance measured between the speech data (u) and the speech transcription data (t) in [Equation 6]

Fall threshold

If it is greater than or equal to, the speech verification result data may be output by performing speech verification based on the semiphoneme model.

가 임계치

보다 작은 경우, 발화 데이터(u)와 발화 전사문 데이터(t) 사이의 평균 음소인식 순위

를 발화 데이터(u)와 발화 전사문 데이터(t) 사이의 발화검증의 척도로 사용할 수 있다. 이때, 상기 임계치

Fall threshold

If smaller, the average phoneme recognition rank between the speech data (u) and the speech transcription data (t)

Can be used as a measure of utterance verification between the utterance data u and the utterance transcription data t. At this time, the threshold

May change depending on the style of speech.

가 임계치

보다 작은 경우, 평균 음소인식 순위를 기초로 발화검증을 수행하여 발화검증 결과 데이터를 출력할 수 있다.The processor 110 is a misclassification distance measured between the speech data (u) and the speech transcription data (t) in [Equation 6]

Fall threshold

If it is smaller than that, the speech verification result data may be output by performing speech verification based on the average phoneme recognition ranking.

프로세서(110)는

의 인식순위가 높을수록(즉, 값이 작을수록) 발화 데이터(u)와 발화 전사문 데이터(t) 사이가 적절하다고 결정할 수 있다.The processor 110 is

It can be determined that the higher the recognition rank of (i.e., the smaller the value) is, the appropriate between the speech data u and the speech transcription data t.

인터페이스 모듈(120)은 발화검증 장치(100) 외부로부터 발화 데이터(101) 및 발화 전사문 데이터(102)를 수신할 수 있다.The interface module 120 may receive the speech data 101 and the speech transcription data 102 from outside the speech verification apparatus 100.

인터페이스 모듈(120)은 발화검증 결과 데이터(103)를 출력할 수 있다.The interface module 120 may output the speech verification result data 103.

인터페이스 모듈(120)은 음향모델(104)과 연결되어 데이터들을 송수신할 수 있다.The interface module 120 may be connected to the acoustic model 104 to transmit and receive data.

인터페이스 모듈(120)은 발화검증 장치(100) 사용자의 입력을 수신할 수 있다.The interface module 120 may receive an input from a user of the speech verification apparatus 100.

데이터베이스(130)는 프로세서(110)가 측정한 오분류거리를 저장할 수 있다.The database 130 may store the misclassification distance measured by the processor 110.

데이터베이스(130)는 인터페이스 모듈(120)이 음향모델(104)로부터 수신한 데이터들을 저장할 수 있다.The database 130 may store data received by the interface module 120 from the acoustic model 104.

데이터베이스(130)는 인터페이스 모듈(120)이 수신한 발화 데이터(101) 및 발화 전사문 데이터(102)를 저장할 수 있다.The database 130 may store the speech data 101 and the speech transcription data 102 received by the interface module 120.

데이터베이스(130)는 프로세서(110)가 결정한 발화 데이터(101)에서 추출한 음성 자질과 발화 전사문 데이터(102)에서 변환한 음소열에 강제음성정렬(forced alignment)을 적용하여 음소열의 각 음소와 해당 발화 데이터(101)의 음성 신호 구간을 저장할 수 있다.The database 130 applies a forced alignment to each phoneme of the phoneme sequence by applying a forced alignment to the phoneme sequence converted from the speech features extracted from the speech data 101 determined by the processor 110 and the speech transcription data 102. The audio signal section of the data 101 can be stored.

데이터베이스(130)는 발화검증 결과 데이터(103)를 저장할 수 있다.The database 130 may store speech verification result data 103.

데이터베이스(130)는 프로세서(110)가 계산한 값들을 저장할 수 있다.The database 130 may store values calculated by the processor 110.

여기서 사용된 '장치 또는 모듈'이라는 용어는 논리적인 구성 단위를 나타내는 것으로서, 반드시 물리적으로 구분되는 구성 요소가 아니라는 점은 본 발명이 속하는 기술분야의 당업자에게 자명한 사항이다.The term'device or module' used herein denotes a logical structural unit, and it is obvious to those skilled in the art that the present invention is not necessarily a physically classified component.

도 5는 일실시예에 따른 발화검증 방법을 나타내는 플로우 차트이다.5 is a flow chart showing a method of verifying utterance according to an embodiment.

도 5를 참조하면, 일실시예에 따른 발화검증 방법은 발화검증 장치가 발화 데이터와 발화 전사문 데이터를 획득한다(400).Referring to FIG. 5, in the utterance verification method according to an embodiment, the utterance verification apparatus acquires utterance data and utterance transcription data (400).

상기 발화 데이터는 웨이브(WAV) 형태의 데이터일 수 있고, 상기 발화 전사문 데이터는 텍스트(TXT) 형태의 데이터일 수 있다.The speech data may be data in a wave (WAV) form, and the speech transcription data may be data in a text (TXT) form.

발화검증 장치가 음성 신호 구간을 결정한다(410).The speech verification device determines a speech signal section (410).

이때, 발화검증 장치는 발화 데이터에서 추출한 음성 자질과 발화 전사문 데이터에서 변환한 음소열에 강제음성정렬(forced alignment)을 적용하여 음소열의 각 음소와 해당 발화 데이터(101)의 음성 신호 구간을 결정할 수 있다.At this time, the speech verification device may determine each phoneme of the phoneme sequence and the speech signal section of the corresponding speech data 101 by applying a forced alignment to the speech features extracted from the speech data and the phoneme string converted from the speech transcription data. have.

발화검증 장치가 반음소 모델을 기초로 오분류거리를 측정한다(420).The speech verification device measures the misclassification distance based on the semiphoneme model (420).

이때, 발화검증 장치는 반음소 모델을 기초로 발화 데이터와 발화 전사문 데이터에 대한 오분류거리(misclassification distance)를 측정할 수 있다.In this case, the speech verification apparatus may measure a misclassification distance for speech data and speech transcription data based on the semiphoneme model.

발화검증 장치가 반음소 모델을 기초로 발화 데이터와 발화 전사문 데이터에 대한 적절성을 결정한다(430).The speech verification device determines the appropriateness of the speech data and the speech transcription data based on the semiphoneme model (430).

이때, 발화검증 장치는 측정한 오분류거리(misclassification distance)를 이용하여 발화 데이터와 발화 전사문 데이터에 대한 적절성을 결정할 수 있다.In this case, the utterance verification device may determine appropriateness of the utterance data and the utterance transcription data by using the measured misclassification distance.

발화검증 장치가 오분류거리와 임계치를 비교한다(440).The ignition verification device compares the misclassification distance and the threshold value (440).

이때, 상기 임계치는 발화 스타일에 따라 변할 수 있다.In this case, the threshold may be changed according to the speech style.

발화검증 장치가 평균 음소인식 순위를 기초로 발화검증을 수행한다(450).The speech verification device performs speech verification based on the average phoneme recognition ranking (450).

이때, 발화검증 장치는 오분류거리가 미리 설정된 임계치보다 큰 경우, 평균 음소인식 순위 기초 발화검증의 순위 값을 가장 낮은 순위로 배치할 수 있고, 오분류거리가 미리 설정된 임계치보다 작은 경우, 발화 데이터와 상기 발화 전사문 데이터에 평균 음소인식 순위 기초 발화검증 수행할 수 있다.At this time, when the misclassification distance is greater than a preset threshold, the speech verification device may place the average phoneme recognition ranking basic speech verification ranking value at the lowest priority, and when the misclassification distance is less than a preset threshold, the speech data And, the speech verification based on the average phoneme recognition rank may be performed on the speech transcription data.

발화검증 장치가 발화검증 결과 데이터를 출력한다(460).The speech verification device outputs speech verification result data (460).

이때, 발화검증 장치는 발화검증 결과 데이터가 미리 설정된 수준 이상인 경우(예컨대, 적절한 경우)에만 발화검증 결과를 출력할 수 있다.In this case, the utterance verification device may output the utterance verification result only when the utterance verification result data exceeds a preset level (eg, when appropriate).

이상에서, 본 발명의 실시예를 구성하는 모든 구성 요소들이 하나로 결합되거나 결합되어 동작하는 것으로 설명되었다고 해서, 본 발명이 반드시 이러한 실시예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성 요소들이 적어도 하나로 선택적으로 결합하여 동작할 수도 있다. In the above, even if all the constituent elements constituting the embodiments of the present invention have been described as being combined into one or operating in combination, the present invention is not necessarily limited to these embodiments. That is, within the scope of the object of the present invention, all the constituent elements may be selectively combined and operated with at least one.

또한, 그 모든 구성 요소들이 각각 하나의 독립적인 하드웨어로 구현될 수 있지만, 각 구성 요소들의 그 일부 또는 전부가 선택적으로 조합되어 하나 또는 복수 개의 하드웨어에서 조합된 일부 또는 전부의 기능을 수행하는 프로그램 모듈을 갖는 컴퓨터 프로그램으로서 구현될 수도 있다. 그 컴퓨터 프로그램을 구성하는 코드들 및 코드 세그먼트들은 본 발명의 기술 분야의 당업자에 의해 용이하게 추론될 수 있을 것이다. In addition, although all of the components may be implemented as one independent hardware, a program module that performs some or all functions combined in one or more hardware by selectively combining some or all of the components. It may be implemented as a computer program having Codes and code segments constituting the computer program may be easily inferred by those skilled in the art of the present invention.

이러한 컴퓨터 프로그램은 컴퓨터가 읽을 수 있는 저장매체(Computer Readable Media)에 저장되어 컴퓨터에 의하여 읽혀지고 실행됨으로써, 본 발명의 실시예를 구현할 수 있다. 컴퓨터 프로그램의 저장매체로서는 자기 기록매체, 광 기록매체, 등이 포함될 수 있다.Such a computer program is stored in a computer-readable storage medium, and is read and executed by a computer, thereby implementing an embodiment of the present invention. The computer program storage medium may include a magnetic recording medium, an optical recording medium, and the like.

또한, 이상에서 기재된 "포함하다", "구성하다" 또는 "가지다" 등의 용어는, 특별히 반대되는 기재가 없는 한, 해당 구성 요소가 내재될 수 있음을 의미하는 것이므로, 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것으로 해석되어야 한다. In addition, the terms such as "include", "consist of" or "have" described above mean that the corresponding component may be embedded unless otherwise stated, excluding other components Rather, it should be interpreted as being able to further include other components.

기술적이거나 과학적인 용어를 포함한 모든 용어들은, 다르게 정의되지 않는 한, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 사전에 정의된 용어와 같이 일반적으로 사용되는 용어들은 관련 기술의 문맥 상의 의미와 일치하는 것으로 해석되어야 하며, 본 발명에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.All terms, including technical or scientific terms, unless otherwise defined, have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms generally used, such as terms defined in the dictionary, should be interpreted as being consistent with the meaning in the context of the related technology, and are not interpreted as ideal or excessively formal meanings unless explicitly defined in the present invention.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. The above description is merely illustrative of the technical idea of the present invention, and those of ordinary skill in the art to which the present invention pertains will be able to make various modifications and variations without departing from the essential characteristics of the present invention.

따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Accordingly, the embodiments disclosed in the present invention are not intended to limit the technical idea of the present invention, but to explain the technical idea, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present invention.

100...발화검증 장치100...Ignition verification device

Claims

The operation of performing basic speech verification on the average phoneme recognition ranking for speech data and speech transcription data
The utterance verification method comprising a.

The method of claim 1,
The firing verification method,
Measuring the misclassification distance for the speech data and the speech transcription data based on a semiphone model; And
An operation of comparing the measured misclassification distance with a preset threshold, and determining whether to perform basic speech verification based on the average phoneme recognition ranking according to the comparison result
The utterance verification method further comprising a.

The method of claim 2,
The firing verification method,
Determining appropriateness of the speech data and the speech transcription data based on the semiphone model
The utterance verification method further comprising.

The method of claim 2,
The operation of determining whether to perform speech verification based on the average phoneme recognition rank,
When the measured misclassification distance is greater than a preset threshold, the average phoneme recognition ranking basic speech verification ranking value is placed in the lowest ranking,
When the measured misclassification distance is less than a preset threshold, the speech verification method for determining the execution of the speech verification based on the average phoneme recognition ranking in the speech data and the speech transcription data.

The method of claim 1,
The average phoneme recognition ranking is,
A speech verification method that is determined based on the acoustic model learned in advance.

In the firing verification device,
The speech verification apparatus includes a processor,
The processor,
A speech verification device that performs basic speech verification on the average phoneme recognition ranking for speech data and speech transcription data.

The method of claim 6,
The processor,
Measure the misclassification distance for speech data and speech transcription data based on the semiphoneme model,
A speech verification device that compares the measured misclassification distance with a preset threshold, and determines whether to perform basic speech verification based on an average phoneme recognition ranking according to a comparison result.

The method of claim 7,
The processor,
A speech verification device that determines appropriateness for the speech data and the speech transcription data based on the semiphoneme model.

The method of claim 7,
The processor,
When the measured misclassification distance is greater than a preset threshold, the average phoneme recognition ranking basic speech verification ranking value is placed in the lowest ranking,
When the measured misclassification distance is less than a preset threshold, the speech verification device determines to perform speech verification based on the average phoneme recognition ranking in the speech data and the speech transcription data.

The method of claim 6,
The average phoneme recognition ranking is,
A speech verification device that determines based on a pre-learned acoustic model.

A computer-readable recording medium storing a program for performing the method of any one of claims 1 to 5.