KR20130048309A

KR20130048309A - Speech recognition method, decision tree construction method and apparatus for speech recognition

Info

Publication number: KR20130048309A
Application number: KR1020110113082A
Authority: KR
Inventors: 김영준
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2011-11-02
Filing date: 2011-11-02
Publication date: 2013-05-10
Also published as: KR101839950B1

Abstract

PURPOSE: A voice recognizing method, a tree composing method for voice recognition, and a voice recognizing device thereof are provided to perform voice recognition and wrong pronunciation detection by composing different decision trees by dividing a phoneme including pronunciation errors and a pronunciation excluding the errors. CONSTITUTION: A phoneme model generating unit(240) composes first and second decision trees by dividing a first phoneme excluding pronunciation errors and a second phoneme including the pronunciation errors in a plurality of learning voices. When a terminal device requests voice recognition for a user voice, a service providing unit(230) performs voice recognition and error detection for the user voice by using the first and the second decision trees. [Reference numerals] (210) Communication unit; (220) Storage unit; (230) Service providing unit; (231) Voice recognition module; (240) Phoneme model generating unit; (AA) Phoneme model

Description

Speech recognition method, decision tree construction method and apparatus for speech recognition

본 발명은 음성 인식 기술에 관한 것으로, 특히 외국어 발화 시 모국어 간섭에 의해 흔히 발생하는 발음 오류에 대한 검출 능력을 향상시킬 수 있는 음성 인식 방법, 음성 인식을 위한 결정 트리 구성 방법 및 음성 인식을 위한 장치에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition technology, and more particularly, to a speech recognition method capable of improving the detection capability of pronunciation errors commonly caused by native language interference when a foreign language is spoken, a decision tree construction method for speech recognition, and an apparatus for speech recognition. It is about.

도 1의 (a)는 연속 음성인식 방법의 일반적인 순서를 보여주는 도면이다. 도 1의 (a)를 참조하여 음성 인식 과정을 설명하면, 화자에 의해 발화된 임의의 음성신호가 입력되면, 특징벡터 추출과정을 통해 음성신호로부터 함축적인 음향학적 정보를 담고 있는 특징벡터열 O를 얻게 된다. 이를 토대로 단어열 W에서 특징 벡터열 O가 발생할 확률 P(O|W)가 계산되는데 이것은 음운모델에 의해 결정된다.Figure 1 (a) is a diagram showing the general sequence of the continuous speech recognition method. Referring to (a) of FIG. 1, when a voice signal uttered by a speaker is input, a feature vector string O containing implicit acoustic information from the voice signal is extracted through a feature vector extraction process. You get Based on this, the probability P (O | W) of occurrence of the feature vector sequence O in the word sequence W is determined by the phonological model.

음성인식 시스템의 관점에서 보면, 발화된 문장은 언어적인 제약 하에 놓여있는 단어들을 연결함으로써 얻어지고, 각각의 단어들은 발음사전에 정의된 음소들을 결합함으로써 얻어지는 계층적 구조를 형성하고 있다. 그런데, 기본적인 음운모델 단위를 선정할 때, 음소를 기본 단위로 설정하면, 실제 단어 내에서 혹은 단어 간에서 발생할 수 있는 음운 변화 현상을 정확하게 반영할 수가 없다.In terms of speech recognition systems, spoken sentences are obtained by connecting words placed under linguistic constraints, and each word forms a hierarchical structure obtained by combining phonemes defined in the pronunciation dictionary. However, when selecting the phoneme model unit, if the phoneme is set as the base unit, it is impossible to accurately reflect the phonological change that may occur in the actual word or between words.

이 문제를 해결하기 위한 접근 방법 가운데 가장 일반적인 것이 트라이폰(triphone) 모델을 이용한 것이다. 문맥 종속 모델(context dependent model)인 트라이폰 모델은 음소의 좌우문맥을 고려하여 얻어진 서로 다른 각각의 음소쌍들을 하나의 음운모델로 표현한 것이다. The most common approach to solving this problem is with the triphone model. The triphone model, which is a context dependent model, represents one phoneme model of different phoneme pairs obtained by considering the left and right contexts of the phoneme.

하지만 트라이폰 모델을 이용하는 경우에는 모델 파라미터의 개수가 너무 많아져 훈련 데이터 수의 부족이 발생한다. 따라서 학습 단계에서 데이터 부족으로 인해 훈련되지 않은 모델이 발생할 수 있으며, 이러한 모델들은 사용할 수 없게 된다. 이를 해결하기 위해 결정 트리(decision tree)등을 이용한 각종 군집화(clustering) 방법이 이용된다.However, when using the triphone model, the number of model parameters becomes too large, resulting in a lack of training data. Thus, lack of data in the learning phase can lead to untrained models, which are not available. In order to solve this problem, various clustering methods using a decision tree or the like are used.

결정트리는 이진 트리의 형태를 가지면서 각각의 노드마다 음성학적인 특성에 기반을 두고 있는 질문들을 포함하고 있다. 이 방법론은 우선, 동일한 각각의 음소로부터 파생된 변이음들로 하여금 하나의 클러스터를 형성하도록 한 후, 클러스터를 분할하기 위해 부여된 음성학적 질문들 가운데서 전체 훈련 데이터의 유사도를 최대화시켜주는 질문을 선택하는 식으로 분할이 이루어진다. 그런데, 전체 훈련 데이터의 유사도가 계산될 때, 분할된 클러스터내의 변이음들은 모두 트라이 되었다고 가정된다. 이런 방식으로 분할되어 형성된 이진 트리는 분할 전과 분할 후의 유사도에 대한 변화가 거의 존재하지 않을 때, 분할을 중단한다. 마지막으로 터미널 노드들에 할당된 변이음들에 대해 제공되는 훈련 데이터가 부족할 경우, 유사도의 변화를 최소화시키는 범위 내에서 터미널 노드들을 병합하는 방법을 취하게 된다. 실제 인식 시 훈련되지 않은 모델(unseen model)이 발생한다면 도 1의 (b)와 같은 결정트리를 이용하여 음성학적으로 유사도가 높은 노드를 찾아 해당 노드와 관련된 모델의 파라미터를 공유하게 된다.The decision tree is a binary tree that contains questions based on the phonetic characteristics of each node. This methodology first allows variational sounds derived from each of the same phonemes to form a cluster, and then selects a question that maximizes the similarity of the entire training data among the phonetic questions given to divide the cluster. Partitioning is done in this way. By the way, when the similarity of the entire training data is calculated, it is assumed that all the variation sounds in the divided cluster have been trimmed. The binary tree formed by splitting in this way stops splitting when there is almost no change in similarity before and after splitting. Finally, if there is a lack of training data provided for the transition sounds assigned to the terminal nodes, a method of merging the terminal nodes within a range that minimizes the change in similarity is taken. If an unseen model occurs during actual recognition, a node having a high phonetic similarity is found using a decision tree as shown in FIG. 1 (b), and the parameters of the model related to the node are shared.

하지만 결정트리를 이용한 군집화는 여러 모델을 하나의 모델로 만드는 과정에서 모델별 분산을 넓혀 각 모델에 대한 정확한 검출 능력은 오히려 줄어들게 된다. However, clustering using decision tree widens the variance of each model in the process of making several models into one model, so that the accurate detection capability of each model is rather reduced.

더불어, 지역 특성에 기반한 외국어 학습 및 평가 분야에서는, 모국어 간섭에 의한 오류 발화 음소들을 정확히 검출하는 능력이 중요한데, 기존의 음성 인식 기술에서는 이러한 모국어 간섭에 의한 오류 발화 음소들의 검출에 대해 고려되어 있지 않기 때문에, 정확한 발음 오류를 검출하는 것이 어려웠다.In addition, in the field of foreign language learning and evaluation based on regional characteristics, the ability to accurately detect error phonemes caused by native language interference is important, and conventional speech recognition technology does not consider detection of error phonemes caused by such native language interference. Because of this, it was difficult to detect the correct pronunciation error.

본 발명은 발음 오류를 포함하는 오류 음운과 발음 오류를 포함하지 않는 정상 음운들을 구분하여 구성된 결정 트리들을 이용함으로써 음성 인식 성능을 향상시키면서, 외국어 발화시 모국어 간섭에 의한 발음 오류에 대한 검출 능력을 향상시킬 수 있는 음성 인식 방법, 음성 인식을 위한 결정 트리 구성 방법 및 음성 인식을 위한 장치를 제공하고자 한다.The present invention improves speech recognition performance by using decision trees constructed by dividing error phonologies including pronunciation errors from normal phonologies without pronunciation errors, and improves the detection ability of pronunciation errors due to native language interference when a foreign language is spoken. The present invention provides a speech recognition method, a decision tree construction method for speech recognition, and an apparatus for speech recognition.

본 발명은 과제를 해결하기 위한 수단으로서, 다수의 학습용 음성에서 발음 오류를 포함하지 않는 제1 타입 음운과 발음 오류를 포함하는 제2 타입 음운을 구분하고, 제1 타입 음운을 이용하여 제1 결정 트리를 구성하고, 제2 타입 음운을 이용하여 제2 결정 트리를 구성하는 음운 모델 생성부; 제1 결정 트리 및 제2 결정 트리를 저장하는 저장부; 및 단말 장치로부터 사용자 음성에 대한 음성 인식이 요청되면, 제1 결정 트리 및 제2 결정 트리를 이용하여 사용자 음성에 대한 음성 인식 및 오류 검출을 수행하고, 음성 인식 결과 및 오류 검출 결과를 단말 장치에 제공하는 서비스 제공부를 포함하는 것을 특징으로 하는 음성 인식을 위한 서비스 장치를 제공한다.The present invention provides a means for solving the problem, and distinguishes a first type phoneme that does not include a pronunciation error from a plurality of learning voices and a second type phoneme that includes a pronunciation error, and determines the first using the first type phoneme. A phonological model generator for constructing a tree and constructing a second decision tree using a second type of phonology; A storage unit for storing the first decision tree and the second decision tree; And when the voice recognition of the user voice is requested from the terminal device, the voice recognition and error detection of the user voice are performed using the first decision tree and the second decision tree, and the voice recognition result and the error detection result are transmitted to the terminal device. It provides a service apparatus for speech recognition, characterized in that it comprises a service providing unit for providing.

본 발명에 의한 음성 인식을 위한 서비스 장치에 있어서, 제1 결정 트리 및 제2 결정 트리는 트라이폰 모델로 구현되는 것을 특징으로 한다.In the service apparatus for speech recognition according to the present invention, the first decision tree and the second decision tree are implemented by a triphone model.

본 발명에 의한 음성 인식을 위한 서비스 장치에 있어서, 서비스 제공부는 오류 검출 결과로서, 사용자 음성에 포함된 오류 음운에 대한 정상 발음기호, 오류 발음기호, 교정 방법, 오류 원인 중 하나 이상을 포함하는 오류 음운 정보를 단말 장치에 제공하는 것을 특징으로 한다.In the service device for speech recognition according to the present invention, the service provider includes an error including at least one of a normal pronunciation symbol, an error pronunciation symbol, a correction method, and an error cause for an error phoneme included in a user's voice as a result of error detection. The phonetic information is provided to the terminal device.

본 발명에 의한 음성 인식을 위한 서비스 장치에 있어서, 음운 모델 생성부는 제2 결정 트리의 구성 시, 결정 트리를 구분하는 임계값을 제1 결정 트리보다 높게 설정하여, 제2 결정 트리가 제1 결정 트리보다 적은 브랜치를 갖도록 하는 것을 특징으로 한다.In the service apparatus for speech recognition according to the present invention, when the second decision tree is constructed, the phonological model generator sets a threshold value for distinguishing the decision tree higher than the first decision tree so that the second decision tree determines the first decision. It is characterized by having fewer branches than the tree.

본 발명에 의한 음성 인식을 위한 서비스 장치에 있어서, 제2 타입 음운은, 특정 발화자의 발음 습관에 의해 발생하는 오류 음운을 포함할 수 있다.In the service apparatus for speech recognition according to the present invention, the second type phoneme may include an error phoneme generated by a pronunciation habit of a particular talker.

더하여, 본 발명은 상술한 과제를 해결하기 위한 다른 수단으로서, 발음 오류를 포함하지 않는 제1 타입 음운을 이용하여 구성된 제1 결정 트리 및 발음 오류를 포함하는 제2 타입 음운을 이용하여 구성된 제2 결정 트리를 포함하는 음운 모델을 저장하는 저장부; 음성 인식에 대한 사용자 요청을 입력 받기 위한 입력부; 사용자 음성을 입력받기 위한 오디오 처리부; 음성 인식에 대한 사용자 요청이 입력되면, 오디오 처리부를 통해 사용자 음성을 입력 받아, 제1 결정 트리 및 제2 결정 트리를 이용하여 음성 인식 및 오류 검출을 수행하는 제어부; 음성 인식 및 오류 검출에 대한 결과를 출력하는 출력부를 포함하는 것을 특징으로 하는 음성 인식을 위한 단말 장치를 제공한다.In addition, according to another aspect of the present invention, there is provided a second decision tree configured using a first decision tree configured using a first type phonology that does not include pronunciation errors and a second type phoneme comprising a pronunciation error. A storage unit for storing a phonological model including a decision tree; An input unit to receive a user request for speech recognition; An audio processor for receiving a user voice; A controller configured to receive a user's voice through an audio processor and perform voice recognition and error detection using a first decision tree and a second decision tree when a user request for voice recognition is input; It provides a terminal device for speech recognition, characterized in that it comprises an output unit for outputting results for speech recognition and error detection.

본 발명에 의한 음성 인식을 위한 단말 장치에 있어서, 제어부는 다수의 학습용 음성으로부터, 제1 타입 음운과 제2 타입 음운을 분류하고, 제1 타입 음운을 이용하여 제1 결정 트리를 구성하고, 제2 타입 음운을 이용하여 제2 결정 트리를 구성하는 음운 모델 생성 모듈을 더 포함할 수 있다.In the terminal device for speech recognition according to the present invention, the control unit classifies the first type phonology and the second type phonology from a plurality of learning voices, constructs a first decision tree using the first type phonology, The apparatus may further include a phonological model generation module constituting the second decision tree using the two-type phonology.

본 발명에 의한 음성 인식을 위한 단말 장치에 있어서, 음운 모델 생성 모듈은 제2 결정 트리를 구성하는데 있어서, 결정 트리를 구분하는 임계값을 제1 결정 트리보다 높게 설정함으로써, 제2 결정 트리가 제1 결정트리보다 적은 브랜치를 갖도록 구성하는 것을 특징으로 한다.In the terminal device for speech recognition according to the present invention, in the phonological model generation module constructing the second decision tree, the second decision tree is generated by setting a threshold value for distinguishing the decision tree higher than the first decision tree. It is characterized in that it is configured to have fewer branches than one decision tree.

본 발명에 의한 음성 인식을 위한 단말 장치에 있어서, 제어부는 오류 검출 결과로서, 사용자 음성에 포함된 오류 음운에 대한 정상 발음기호, 오류 발음기호, 교정 방법, 오류 원인 중 하나 이상을 포함하는 오류 음운 정보를 제공할 수 있다.In the terminal device for speech recognition according to the present invention, the control unit, as a result of error detection, an error phoneme including at least one of a normal phonetic symbol, an error phonetic symbol, a correction method, and a cause of an error for an error phoneme included in a user's voice. Information can be provided.

또한, 본 발명은 상술한 과제를 해결하기 위한 또 다른 수단으로서, 다수의 학습용 음성을 수집하는 단계; 다수의 학습용 음성을 발음 오류를 포함하지 않는 제1 타입 음운과, 발음 오류를 포함하는 제2 타입 음운으로 분류하는 단계; 제1 타입 음운을 이용하여 제1 결정 트리를 구성하는 단계; 및 제2 타입 음운을 이용하여 제2 결정 트리를 구성하는 단계를 포함하되, 제2 결정 트리의 구성 시, 제1 결정 트리의 구성 시보다 결정 트리를 구분하는 임계값을 높게 설정하여, 제1 결정 트리보다 제2 결정 트리가 적은 브랜치를 갖도록 하는 것을 특징으로 하는 음성 인식을 위한 결정 트리 구성 방법을 제공할 수 있다.In addition, the present invention as another means for solving the above problems, the step of collecting a plurality of learning voices; Classifying the plurality of learning voices into a first type phoneme including a pronunciation error and a second type phoneme including a pronunciation error; Constructing a first decision tree using first type phonology; And constructing a second decision tree using a second type of phonology, wherein, when constructing the second decision tree, a threshold value that separates the decision tree is set higher than when constructing the first decision tree. It is possible to provide a method for constructing a decision tree for speech recognition, wherein the second decision tree has fewer branches than the decision tree.

또한, 본 발명은 상술한 과제를 해결하기 위한 또 다른 수단으로서, 인식 대상 음성을 수신하는 단계; 및 발음 오류를 포함하지 않는 제1 타입 음운에 대한 클러스터링을 통해 구성된 제1 결정트리 및 발음 오류를 포함하며 상기 제1 결정트리보다 적은 브랜치를 갖으며 제2 타입 음운에 대한 클러스터링을 통해 구성된 제2 결정트리를 이용하여, 인식 대상 음성 내에 포함되는 제1 타입 음운 및 상기 제2 타입 음운을 각각 인식 및 오류 검출을 수행하는 단계를 포함하는 음성 인식 방법을 제공할 수 있다.In addition, the present invention is another means for solving the above problems, the step of receiving a voice to be recognized; And a first decision tree configured through clustering for a first type of phoneme that does not include a pronunciation error and a second configured through clustering for a second type phoneme, including a branch less than the first decision tree and having a pronunciation error. By using the decision tree, a method of recognizing a speech may be provided. The method may include recognizing and error detecting a first type phoneme and a second type phoneme included in a voice to be recognized.

본 발명은 음성 인식을 위한 결정 트리 기반의 음운 모델을 구성하는데 있어서, 학습용 음성에서 발음 오류를 포함하는 음운과 발음 오류를 포함하는 음운을 구분하여 서로 다른 결정 트리를 구성함으로써, 음성 인식 및 오류 발음 검출을 함께 수행할 수 있으며, 더불어, 발음 오류를 포함하는 음운에 대해서는 결정 트리를 나누는 임계값을 발음 오류를 포함하지 않는 음운에 적용되는 임계값보다 높게 설정함으로써, 발음 오류를 포함하는 음운에 대하여 군집화 되는 모델의 숫자를 최소화 되게 함으로써 분산이 커지는 것을 방지할 수 있어 입력된 음성에 대한 오류 검출을 신속 정확하게 수행할 수 있다.The present invention constructs a phonological model based on a decision tree for speech recognition, by forming a different decision tree by dividing a phoneme including a pronunciation error and a phoneme including a pronunciation error in a training voice, thereby forming a speech recognition and error pronunciation. Detection can be performed together, and for a phoneme containing a pronunciation error, a threshold for dividing the decision tree is set higher than a threshold applied to the phoneme that does not include a pronunciation error. By minimizing the number of clustered models, the variance can be prevented from being increased, so that error detection of the input voice can be performed quickly and accurately.

도 1의 (a)은 일반적인 연속 음성인식 방법을 나타낸 순서도이다.
도 1의 (b)는 음성 인식 기술에 있어서 음운 모델을 결정트리에 의해 군집화하는 과정을 나타낸 모식도이다.
도 2는 본 발명의 일 실시 예에 따른 음성 인식 시스템을 나타낸 블록도이다.
도 3은 본 발명의 일 실시 예에 따른 음성 인식 시스템에 있어서 음성 인식을 위한 서비스 장치의 구성을 나타낸 블록도이다.
도 4는 본 발명의 일 실시 예에 따른 음성 인식 시스템에 있어서, 음성 인식 과정을 나타낸 흐름도이다.
도 5는 본 발명의 다른 실시예에 따른 음성 인식을 위한 단말 장치의 구성을 나타낸 블록도이다.
도 6은 본 발명의 다른 실시예에 따른 음성 인식 과정을 나타낸 순서도이다.
도 7은 본 발명에 따른 결정 트리 구성 방법을 나타낸 순서도이다.
도 8은 본 발명에 따라 구성된 결정 트리의 일 예를 나타낸 예시도이다.Figure 1 (a) is a flow chart illustrating a general continuous speech recognition method.
FIG. 1B is a schematic diagram illustrating a process of grouping phonological models by a decision tree in speech recognition technology.
2 is a block diagram illustrating a speech recognition system according to an exemplary embodiment.
3 is a block diagram illustrating a configuration of a service apparatus for speech recognition in a speech recognition system according to an exemplary embodiment.
4 is a flowchart illustrating a voice recognition process in the voice recognition system according to an embodiment of the present invention.
5 is a block diagram illustrating a configuration of a terminal device for voice recognition according to another embodiment of the present invention.
6 is a flowchart illustrating a speech recognition process according to another embodiment of the present invention.
7 is a flowchart illustrating a method for constructing a decision tree according to the present invention.
8 is an exemplary diagram illustrating an example of a decision tree constructed according to the present invention.

이하 본 발명의 바람직한 실시 예를 첨부한 도면을 참조하여 상세히 설명한다. 다만, 하기의 설명 및 첨부된 도면에서 본 발명의 요지를 흐릴 수 있는 공지 기능 또는 구성에 대한 상세한 설명은 생략한다. 또한, 도면 전체에 걸쳐 동일한 구성 요소들은 가능한 한 동일한 도면 부호로 나타내고 있음에 유의하여야 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description and the accompanying drawings, detailed description of well-known functions or constructions that may obscure the subject matter of the present invention will be omitted. In addition, it should be noted that like elements are denoted by the same reference numerals as much as possible throughout the drawings.

이하에서 설명되는 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위한 용어로 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서 본 명세서에 기재된 실시 예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시 예에 불과할 뿐이고, 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형 예들이 있을 수 있음을 이해하여야 한다.The terms or words used in the specification and claims described below should not be construed as being limited to the ordinary or dictionary meanings, and the inventors are properly defined as terms for explaining their own invention in the best way. It should be interpreted as meaning and concept corresponding to the technical idea of the present invention based on the principle that it can. Therefore, the embodiments described in the present specification and the configurations shown in the drawings are merely the most preferred embodiments of the present invention, and not all of the technical ideas of the present invention are described. Therefore, It is to be understood that equivalents and modifications are possible.

도 2는 본 발명에 의한 결정 트리를 이용하여 오류 발음을 검출하는 음성 인식 시스템을 나타낸 블록도로서, 이를 참조하면, 본 발명의 일 실시 예에 의한 음성 인식 시스템은 단말 장치(100)와 서비스 장치(200)와 네트워크(300)를 포함하여 이루어질 수 있다.2 is a block diagram illustrating a speech recognition system that detects an error pronunciation using a decision tree according to the present invention. Referring to this, the speech recognition system according to an embodiment of the present invention is a terminal device 100 and a service device. 200 and the network 300 may be included.

즉, 본 발명의 일 실시 예에 있어서, 음성 인식 제공은 서버 기반 컴퓨팅 방식으로 이루어질 수 있다. 여기서, 서버 기반 컴퓨팅 방식은, 네트워크를 매개로 연결된 임의의 장치에서 본 발명에 따른 결정 트리를 이용하여 오류 발음을 검출하는 음성 인식 서비스 제공 방법의 처리가 이루어지고, 단말 장치에서는 입출력만 이루어지는 방식을 의미한다. 이하에서는 설명의 편의를 위해 본 발명에 따른 결정 트리를 이용하여 오류 발음을 검출하는 음성 인식 서비스를 제공하는 장치를, 서비스 장치(200)로 구분하기로 한다.That is, in one embodiment of the present invention, the speech recognition may be provided by a server-based computing scheme. Here, the server-based computing method, the processing of the voice recognition service providing method for detecting the error pronunciation using the decision tree according to the present invention in any device connected via a network is performed, the terminal device is a method that only input and output it means. Hereinafter, for convenience of description, an apparatus for providing a speech recognition service that detects an error pronunciation using a decision tree according to the present invention will be divided into a service apparatus 200.

서비스 장치(200)는, 네트워크(300)를 통해서 다수의 단말 장치(100)로 본 발명에 의한 결정 트리를 이용하여 오류 발음을 검출하는 음성 인식 서비스를 제공하는 서버 장치이다. 더 구체적으로, 서비스 장치(200)는 발음 오류를 포함하지 않는 제1 타입 음운(이하, 정상 음운이라 부르기로 한다)과 발음 오류를 포함하는 제2 타입 음운(이하, 오류 음운이라 부르기로 한다)을 구분하여 각각 결정 트리 기반의 음운모델을 구성하되, 오류 음운의 경우 정상 음운에 비하여, 결정 트리를 구분하는 임계값을 높게 설정하여 음운 모델을 구성한다. 여기서, 오류 음운은, 사용자가 외국어 발화시 모국어 간섭에 의해 발생할 수 있는 발음 오류를 포함한다. 그리고, 상기 정상 음운을 이용하여 구성된 결정 트리를 제1 결정 트리라 하고, 오류 음운을 이용하여 구성된 결정 트리를 제2 결정 트리라 한다. 그리고, 단말 장치(100)로부터 사용자 음성에 대하여 음성 인식이 요청되면, 상기 제1,2 결정 트리를 이용하여 음성 인식 및 오류 검출을 수행하고, 그 결과를 단말 장치(100)에 제공한다. 예를 들어, 서비스 장치(200)는 제1 결정 트리를 이용하여 사용자 음성에 대한 음성 인식을 수행하고, 제2 결정 트리를 이용하여, 상기 음성 인식된 사용자 음성에 포함된 발음 오류를 검출할 수 있다.The service device 200 is a server device that provides a voice recognition service that detects an error pronunciation using a decision tree according to the present invention to a plurality of terminal devices 100 through a network 300. More specifically, the service device 200 includes a first type phoneme which does not include a pronunciation error (hereinafter, referred to as a normal phoneme) and a second type phoneme which includes a pronunciation error (hereinafter, referred to as an error phoneme). Then, we construct a phonological model based on decision tree. However, in case of error phonology, we construct a phonological model by setting the threshold that separates decision tree higher than normal phonology. Here, the error phoneme includes a pronunciation error that may occur due to the mother tongue interference when the user speaks a foreign language. The decision tree constructed using the normal phoneme is called a first decision tree, and the decision tree constructed using error phonology is called a second decision tree. When voice recognition is requested for the user voice from the terminal device 100, voice recognition and error detection are performed using the first and second decision trees, and the result is provided to the terminal device 100. For example, the service device 200 may perform voice recognition on a user voice using a first decision tree, and detect a pronunciation error included in the voice recognized user voice using a second decision tree. have.

이러한 서비스 장치(200)는 서버-클라이언트 컴퓨팅 방식으로 동작할 수도 있고, 클라우드 컴퓨팅 기반으로 동작할 수도 있다. 즉, 결정 트리를 이용하여 오류 발음을 검출하는 음성 인식 서비스를 진행하는데 필요한 컴퓨터 자원, 예를 들면, 하드웨어, 소프트웨어 중에서 하나 이상을 단말 장치(100)에 제공할 수 있다.The service device 200 may operate in a server-client computing manner or may operate on a cloud computing basis. That is, the terminal device 100 may provide one or more computer resources, for example, hardware or software, required for the voice recognition service for detecting an error pronunciation using the decision tree.

단말 장치(100)는 사용자가 이용하는 다양한 형태의 장치로서, 예를 들면, PC(Personal Computer), 노트북 컴퓨터, 휴대폰(mobile phone), 태블릿 PC, 내비게이션(navigation) 단말기, 스마트폰(smart phone), PDA(Personal Digital Assistants), 스마트 TV(Smart TV), PMP(Portable Multimedia Player) 및 디지털방송 수신기를 포함할 수 있다. 물론 이는 예시에 불과할 뿐이며, 상술한 예 이외에도 현재 개발되어 상용화되었거나 향후 개발될 모든 통신이 가능한 장치를 포함하는 개념으로 해석되어야 한다.The terminal device 100 is a various type of device used by a user, for example, a personal computer (PC), a notebook computer, a mobile phone, a tablet PC, a navigation terminal, a smart phone, Personal Digital Assistants (PDAs), Smart TVs (Smart TVs), Portable Multimedia Players (PMPs), and digital broadcast receivers may be included. Of course, this is merely an example, and it should be construed as a concept including a device that is currently developed, commercialized, or capable of all communication to be developed in the future, in addition to the above-described examples.

이러한 단말 장치(100)는 음성 인식 서비스를 요청하는 사용자가 사용할 수 있다. 본 발명에 따른 음성 인식 시스템에서, 단말 장치(100)는 사용자로부터 사용자 음성을 입력 받아, 서비스 장치(200)로 전송하여 음성 인식을 요청하고 서비스 장치(200)로부터 상기 사용자 음성에 대한 음성 인식 결과 또는 오류 검출 결과를 수신하여 사용자에게 출력한다.The terminal device 100 may be used by a user who requests a voice recognition service. In the voice recognition system according to the present invention, the terminal device 100 receives a user's voice from the user, transmits it to the service device 200 to request voice recognition, and the voice recognition result of the user voice from the service device 200. Alternatively, the error detection result is received and output to the user.

네트워크(300)는 서비스 장치(200)와 단말 장치(100) 간에 데이터의 송수신을 위한 통로를 제공한다. 이러한 네트워크(300)는 인터넷 프로토콜(IP)을 통하여 대용량 데이터의 송수신 서비스 및 끊기는 현상이 없는 데이터 서비스를 제공하는 아이피망으로, 아이피를 기반으로 서로 다른 망을 통합한 아이피망 구조인 올 아이피(All IP)망 일 수 있다. 또한, 네트워크(300)는 유선네트워크, Wibro(Wireless Broadband)망, WCDMA를 포함하는 3 세대 이동네트워크, HSDPA(High Speed Downlink Packet Access)망 및 LTE망을 포함하는 3.5세대 이동네트워크, LTE advanced를 포함하는 4세대 이동네트워크, 위성네트워크 및 와이파이(Wi-Fi)망을 포함하는 무선랜 중 하나 이상을 포함하여 이루어질 수 있다.The network 300 provides a passage for transmitting and receiving data between the service device 200 and the terminal device 100. The network 300 is an IP network providing a data transmission / reception service and a disconnected data service through an internet protocol (IP), and an IP network structure in which different networks are integrated based on IP. IP) network. In addition, the network 300 includes a wired network, a wireless broadband network (Wibro), a third generation mobile network including WCDMA, a 3.5 generation mobile network including a high speed downlink packet access (HSDPA) network, and an LTE network, and LTE advanced. It can be made by including one or more of the wireless LAN, including a 4G mobile network, satellite network and Wi-Fi (Wi-Fi) network.

도 3은 본 발명의 일 실시 에에 따른 음성 인식 서비스 시스템에 있어서, 음성 인식 서비스를 제공하기 위한 서비스 장치(200)의 상세 구성을 나타낸 블록도이다. 도 3에서는 서비스 장치(200)의 구성을 기능 단위로 표현하였으나, 이는 실제로 구현 시 다수의 서버 장치에 분산되어 구현될 수도 있고, 하나의 서버 장치에 구현될 수도 있다.3 is a block diagram illustrating a detailed configuration of a service apparatus 200 for providing a voice recognition service in a voice recognition service system according to an exemplary embodiment of the present invention. In FIG. 3, the configuration of the service device 200 is expressed in functional units. However, the service device 200 may be distributed in a plurality of server devices or may be implemented in one server device.

도 3을 참조하면, 본 발명의 일 실시예에 따른 서비스 장치(200)는, 통신부(210), 저장부(220), 서비스 제공부(230), 음운모델 생성부(240)를 포함하여 이루어질 수 있다.Referring to FIG. 3, the service device 200 according to an exemplary embodiment of the present invention includes a communication unit 210, a storage unit 220, a service provider unit 230, and a phonological model generator 240. Can be.

통신부(210)는 네트워크(300)를 통하여 단말 장치(100)와 데이터를 주고받는다.The communication unit 210 exchanges data with the terminal device 100 through the network 300.

저장부(220)는 서비스 장치(200)의 동작을 위한 데이터 및 프로그램을 저장하는 수단으로서, 특히, 본 발명에 의한 결정 트리를 이용하여 오류 유형을 분류함으로써 오류 발음을 검출하는 음성 인식 서비스 제공을 위하여, 정상 음운과 발생 가능한 오류 음운에 대한 음운모델을 저장한다. 상기 음운 모델은 정상 음운에 대하여 구성된 제1 결정 트리 및 오류 음운에 대하여 구성된 제2 결정 트리를 포함하여 이루어진다. 이러한 저장부(220)는, 램(RAM, Read Access Memory), 롬(ROM, Read Only Memory), 하드디스크(HDD, Hard Disk Drive), 플래시 메모리, CD-ROM, DVD와 같은 모든 종류의 저장 매체를 포함할 수 있다.The storage unit 220 is a means for storing data and programs for the operation of the service apparatus 200. In particular, the storage unit 220 provides a voice recognition service for detecting an error pronunciation by classifying an error type using a decision tree according to the present invention. For this purpose, we store phonological models for normal phonology and possible error phonologies. The phonological model comprises a first decision tree constructed for normal phonology and a second decision tree constructed for error phonology. The storage unit 220 may store all kinds of RAMs such as RAM (Read Access Memory), ROM (Read Only Memory), hard disk (HDD), flash memory, CD-ROM, DVD, and the like. Media may be included.

서비스 제공부(230)는, 단말 장치(100)로부터 사용자 음성에 대하여 음성 인식이 요청되면, 정상 음운 및 오류 음운에 대한 제1,2 결정 트리를 이용하여 음성 인식 및 오류 검출을 수행하고, 사용자 음성에 포함된 오류 음운에 대한 정상 발음기호, 오류 발음기호, 교정 방법, 오류 원인 중 하나 이상을 포함하는 오류 음운 정보를 포함하는 음성 인식 결과 및 오류 검출 결과를 단말 장치(100)에 제공한다.When the voice recognition is requested for the user's voice from the terminal device 100, the service provider 230 performs voice recognition and error detection using the first and second decision trees for the normal phoneme and the error phoneme. The terminal 100 provides a voice recognition result and an error detection result including error phonological information including one or more of a normal phonetic symbol, an error phonetic symbol, a correction method, and a cause of an error.

이러한 서비스 제공부(230)는, 본 발명에 의한 결정 트리를 이용하여 오류 발음을 검출하는 음성 인식 서비스 제공을 위한 음성 인식 모듈(231)을 포함할 수 있다.The service provider 230 may include a voice recognition module 231 for providing a voice recognition service that detects an error pronunciation using a decision tree according to the present invention.

음성 인식 모듈(231)은 단말 장치(100)로부터 사용자 음성에 대하여 음성 인식이 요청되면, 정상 음운 및 오류 음운 각각을 이용하여 구성된 제1,2 결정 트리에 따라 상태를 분류함으로써 사용자 음성에 포함된 단어를 인식하여 음성 인식을 수행한다. 특히, 제2 결정 트리를 이용하여 오류 음운이 검출되는 경우, 오류 음운에 대한 정상 발음기호, 오류 발음기호, 교정 방법, 오류 원인 중 하나 이상을 포함하는 오류 음운 정보를 단말 장치(100)에 제공하도록 제어한다.When voice recognition is requested for the user voice from the terminal device 100, the voice recognition module 231 classifies the states according to the first and second decision trees configured by using normal and error phonons, respectively. Speech recognition is performed by recognizing words. In particular, when the error phoneme is detected using the second decision tree, the terminal device 100 provides error phoneme information including one or more of a normal phonetic symbol, an error phonetic symbol, a correction method, and an error cause for the error phoneme. To control.

음성 인식 모듈(231)은 소프트웨어 혹은 하드웨어 혹은 소프트웨어와 하드웨어의 조합에 의해 구현될 수 있는 것으로서, 예를 들면, 프로그램 형태로 저장부(220)에 저장되어 있다가 서비스 제공부(230)에 의해 실행됨에 의해 구현될 수 있다.The voice recognition module 231 may be implemented by software or hardware or a combination of software and hardware. For example, the voice recognition module 231 is stored in the storage unit 220 in the form of a program and executed by the service provider 230. Can be implemented.

음운 모델 생성부(240)는, 음성 인식에 필요한 음운 모델을 생성한다. 특히, 본 발명에 있어서, 음운 모델 생성부(240)는 음운 모델의 생성을 위해 수집된 학습용 음성을 정상 음운과 오류 음운으로 구분하고, 정상 음운과 오류 음운 각각에 대한 음운모델을 구성한다. 상기 음운 모델은, 결정 트리 기반의 음운 모델로서, 더 구체적으로는, 정상 음운을 기반으로 구성된 제1 결정 트리와, 오류 음운을 기반으로 구성된 제2 결정 트리를 포함한다. 이때, 제2 결정 트리의 구성 시에는, 제1 결정 트리의 구성 시보다 결정 트리를 분류하는 임계값을 높게 설정함으로써, 제2 결정 트리의 브랜치가 제1 결정 트리 보다 적게 구성되도록 한다. The phonological model generation unit 240 generates a phonological model required for speech recognition. In particular, in the present invention, the phonological model generator 240 divides the learning voices collected for generating the phonological model into normal phonons and error phonons, and constructs phonological models for each of the normal phonons and the error phonons. The phonological model is a decision tree based phonological model, and more specifically, includes a first decision tree constructed based on normal phonology, and a second decision tree constructed based on error phonology. At this time, when constructing the second decision tree, a threshold value for classifying the decision tree is set higher than that when constructing the first decision tree, so that the branches of the second decision tree are configured to be smaller than the first decision tree.

상술한 구성을 포함하는 서비스 장치(200)는 정상 음운과 발생 가능한 오류 음운에 대해 각각 제1,2 결정 트리를 구성하되, 오류 음운의 경우 정상 음운에 비하여 높은 임계값을 기준으로 음운을 분류하고, 분류된 음운들에 대한 군집화를 수행하게 된다. The service device 200 having the above-described configuration configures the first and second decision trees for the normal phoneme and the possible error phoneme, respectively, and classifies the phoneme based on a higher threshold value than the normal phoneme in case of the error phoneme. In this case, clustering is performed on the classified phonologies.

이어서, 서비스 장치(200)는 상기 음운 모델 생성부(240)에 의해 생성된 제1,2 결정 트리를 이용하여 단말 장치(100)로부터 전달된 사용자 음성에 대한 음성 인식 및 오류 검출을 수행한다.Subsequently, the service device 200 performs voice recognition and error detection on the user voice transmitted from the terminal device 100 by using the first and second decision trees generated by the phonological model generator 240.

도 4는 본 발명에 의한 결정 트리를 이용하여 오류 발음을 검출하는 음성 인식 서비스 제공 방법을 나타낸 흐름도이다.4 is a flowchart illustrating a method of providing a speech recognition service for detecting an error pronunciation using a decision tree according to the present invention.

도 4를 참조하면, 서비스 장치(200)는 음운 모델의 생성을 위해 수집한 다수의 학습용 음성을 정상 음운과 오류 음운으로 구분한 뒤, 정상 음운과 오류 음운에 대해 각각 제1,2 결정 트리를 구성한다(S105). 도 7을 참조하여 더 구체적으로 설명하면, 서비스 장치(200)는 다수의 학습용 음성을 수집한다(S305). 그리고, 수집된 다수의 학습용 음성을 발음 오류를 포함하지 않는 정상 음운과 발음 오류를 포함하는 오류 음운으로 구분한다(S310). 이어서, 음운 모델 생성부(240)를 통해서, 발음 오류를 포함하지 않는 정상 음운을 이용하여 제1 결정 트리를 구성하고(S315), 발음 오류를 포함하는 오류 음운을 이용하여 제2 결정 트리를 구성한다(S320). 이때, 오류 음운에 대해서는 정상 음운에 비하여 높은 임계값으로 분류하여 군집화를 수행함으로써 제1 결정 트리보다 적은 브랜치를 갖는 제2 결정 트리를 구성한다. 여기서, 본 발명에 따른 음운 모델은 트라이폰 기반의 음운 모델일 수 있다. 그리고, 음운 모델의 결정트리를 분류하는 임계값에 따라서 결정 트리의 브랜치 수가 달라지는데, 임계값이 높을수록 브랜치의 수가 작아진다. 본 발명에서는, 오류 음운에 대하여 결정 트리를 구분하는 임계값을, 정상 음운보다 높게 설정함으로써, 오류 검출을 위한 제2 결정 트리의 군집화되는 모델의 수를 최소화하고, 분산이 커지는 것을 방지한다. 도 8은 본 발명에 따라 구성된 제1,2 결정 트리의 일 예를 도시한 것으로서, 도 8의 (a)는 제1 결정 트리의 일 예이고, (b)는 제2 결정 트리의 일 예를 나타낸다. 도시된 바와 같이, 제2 결정 트리는 제1 결정 트리에 비하여 브랜치가 적고, 리프 노드의 수가 작다.Referring to FIG. 4, the service device 200 divides a plurality of learning voices collected for generation of a phonological model into normal phonons and error phonons, and then determines first and second decision trees for normal phonons and error phonons, respectively. It constitutes (S105). Referring to FIG. 7 in more detail, the service device 200 collects a plurality of learning voices (S305). Then, the collected plurality of learning voices are divided into normal phonologies not including pronunciation errors and error phonologies including pronunciation errors (S310). Next, through the phonological model generator 240, a first decision tree is constructed using normal phonologies that do not include pronunciation errors (S315), and a second decision tree is constructed using error phonologies including pronunciation errors. (S320). In this case, the error phoneme is classified into a higher threshold value than the normal phoneme to perform clustering to form a second decision tree having fewer branches than the first decision tree. Here, the phonological model according to the present invention may be a triphone-based phonological model. The number of branches of the decision tree varies depending on the threshold for classifying the decision trees of the phonological model. The higher the threshold, the smaller the number of branches. In the present invention, by setting the threshold for distinguishing the decision tree with respect to the error phonation higher than the normal phonation, the number of clustered models of the second decision tree for error detection is minimized and the variance is prevented from increasing. FIG. 8 illustrates an example of the first and second decision trees constructed according to the present invention. FIG. 8A illustrates an example of the first decision tree, and FIG. 8B illustrates an example of the second decision tree. Indicates. As shown, the second decision tree has fewer branches and fewer leaf nodes than the first decision tree.

다시 도 4를 참조하면, 단말 장치(100)로 사용자에 의해 음성이 입력되면(S110), 단말 장치(100)는 서비스 장치(200)로 사용자 음성에 대한 음성 인식을 요청한다(S115). 이에 서비스 장치(200)는 수신한 사용자 음성에 대하여, 앞서 구성된 제1,2 결정 트리를 이용하여 음성 인식 및 오류 검출을 수행한다(S120). Referring back to FIG. 4, when a voice is input by the user to the terminal device 100 (S110), the terminal device 100 requests the voice recognition of the user voice to the service device 200 (S115). Accordingly, the service device 200 performs voice recognition and error detection on the received user voice by using the first and second decision trees configured above (S120).

그리고, 음성 인식 및 오류 검출 결과를 단말 장치(100)로 제공한다(S125). 이때, 오류 검출 결과는 사용자 음성에 포함된 오류 음운에 대한 정상 발음기호, 오류 발음기호, 교정 방법, 오류 원인 중 하나 이상을 포함하는 오류 음운 정보를 포함할 수 있다. In operation S125, the voice recognition and error detection result is provided to the terminal device 100. In this case, the error detection result may include error phonological information including one or more of a normal phonetic symbol, an error phonetic symbol, a correction method, and an error cause of the error phoneme included in the user's voice.

단말 장치(100)는 서비스 장치(200)로부터 수신한 음성 인식 결과 및 오류 검출 결과를 사용자가 확인할 수 있도록 출력한다(S130).The terminal device 100 outputs the voice recognition result and the error detection result received from the service device 200 so that the user can check the result (S130).

상술한 바와 같이, 본 발명은 정상 음운과 오류 음운을 구분하여 각각 제1,2 결정 트리를 구성하고, 제1,2 결정 트리를 이용하여 음성 인식 및 오류 검출을 수행함으로써, 음성 인식 성능을 더 향상시키고, 발음 오류(특히, 모국어 간섭에 의한 발음 오류)에 대한 검출 성능을 더 향상시킬 수 있다.As described above, the present invention configures first and second decision trees by distinguishing between normal and error phonemes, and performs voice recognition and error detection using the first and second decision trees, thereby further improving speech recognition performance. It is possible to improve the detection performance for pronunciation errors (especially pronunciation errors due to native language interference).

한편, 본 발명의 다른 실시 예에 있어서, 결정 트리를 이용하여 오류 발음을 검출하는 음성 인식 처리는 단말 장치(100)를 기반으로 이루어질 수 있다.Meanwhile, in another embodiment of the present disclosure, the voice recognition process of detecting an error pronunciation using a decision tree may be performed based on the terminal device 100.

도 5는 본 발명의 다른 실시 예에 의한 결정 트리를 이용하여 오류 발음을 검출하는 음성 인식 서비스의 제공을 위한 단말 장치(100)의 구성을 나타낸 블록도이다.5 is a block diagram illustrating a configuration of a terminal device 100 for providing a voice recognition service for detecting an error pronunciation using a decision tree according to another exemplary embodiment of the present invention.

도 5을 참조하면, 본 발명의 다른 실시 예에 따른 단말 장치(100)는 입력부(110)와 출력부(120)와 오디오 처리부(130)와 저장부(140)와 제어부(150)를 포함하여 구성될 수 있다.Referring to FIG. 5, the terminal device 100 according to another exemplary embodiment includes an input unit 110, an output unit 120, an audio processor 130, a storage unit 140, and a controller 150. Can be configured.

입력부(110)는 사용자의 조작에 따라서 단말 장치(100)를 제어하거나 동작하기 위한 사용자 입력 신호를 발생하는 수단으로서, 다양한 방식의 입력 수단으로 구현될 수 있다. 예를 들어, 입력부(110)는 키 입력 수단, 터치 입력 수단, 제스처 입력 수단, 음성 입력 수단 중에서 하나 이상을 포함할 수 있다. 키 입력 수단은, 키 조작에 따라서 해당 키에 대응하는 신호를 발생시키는 것으로서, 키패드, 키보드가 해당된다. 터치 입력 수단은, 사용자가 특정 부분을 터치하는 동작을 감지하여 입력 동작을 인식하는 것으로서, 터치 패드, 터치 스크린, 터치 센서를 들 수 있다. 제스처 입력 수단은, 사용자의 동작, 예를 들어, 단말 장치를 흔들거나 움직이는 동작, 단말 장치에 접근하는 동작, 눈을 깜빡이는 동작 등 지정된 특정 동작을 특정 입력 신호로 인식하는 것으로서, 지자기 센서, 가속도 센서, 카메라, 고도계, 자이로 센서, 근접 센서 중에서 하나 이상을 포함하여 이루어질 수 있다.The input unit 110 is a means for generating a user input signal for controlling or operating the terminal device 100 according to a user's manipulation. The input unit 110 may be implemented by various means. For example, the input unit 110 may include one or more of a key input unit, a touch input unit, a gesture input unit, and a voice input unit. The key input means generates a signal corresponding to the key according to the key operation, and corresponds to a keypad and a keyboard. The touch input means recognizes an input operation by detecting an operation of touching a specific portion of the user, and may include a touch pad, a touch screen, and a touch sensor. The gesture input means recognizes a specific specific action such as a user's motion, for example, shaking or moving the terminal device, approaching the terminal device, or blinking an eye, as a specific input signal. It may include one or more of a sensor, a camera, an altimeter, a gyro sensor, and a proximity sensor.

출력부(120)는 결정 트리를 이용하여 오류 발음을 검출하는 음성 인식 서비스 제공을 위한 사용자 인터페이스를 화면에 출력하는 수단으로, 음성 인식 및 오류 검출 결과를 출력한다. 이러한 출력부(120)는 예를 들면, LCD((Liquid Crystal Display), TFT-LCD(Thin Film Transistor-Liquid Crystal Display), LED(Light Emitting Diodes), OLED(Organic Light Emitting Diodes), AMOLED(Active Matrix Organic Light Emitting Diodes), 플렉시블 디스플레이(flexible display), 3차원 디스플레이 중에서 어느 하나로 구현될 수 있다.The output unit 120 is a means for outputting a user interface for providing a voice recognition service for detecting an error pronunciation using a decision tree on a screen, and outputs a result of voice recognition and error detection. The output unit 120 may include, for example, a liquid crystal display (LCD), a thin film transistor-liquid crystal display (TFT-LCD), light emitting diodes (LEDs), organic light emitting diodes (OLEDs), and active AMOLEDs (AMOLEDs). Matrix Organic Light Emitting Diodes), a flexible display (flexible display), or a three-dimensional display can be implemented in any one.

오디오 처리부(130)는 음성 입력 및 출력을 처리하는 수단으로서, 본ㅂ 발명에 있어서, 음성 인식 대상인, 사용자 음성을 인식하여 음성 신호로 변환하여 제어부(150)로 제공한다. The audio processor 130 is a means for processing voice input and output. In the present invention, the audio processor 130 recognizes a user's voice, which is a voice recognition object, converts the user's voice into a voice signal and provides the same to the controller 150.

저장부(140)는 단말 장치(100)의 동작에 필요한 데이터 혹은 프로그램을 저장하는 수단으로서, 기본적으로 단말 장치(100)의 운용 프로그램(OS) 및 하나 이상의 응용 프로그램을 저장할 수 있다. 더하여, 본 발명에 있어서, 저장부(140)는 결정 트리를 이용하여 음성 인식 및 오류 발음을 검출하기 위해 필요한 결정 트리 기반의 음운모델들을 저장한다. 상기 음운 모델은, 정상 음운을 이용하여 구성된 제1 결정 트리 및 오류 음운을 이용하여 구성된 제2 결정 트리를 포함할 수 있다. 이러한 저장부(140)는, 램(RAM, Read Access Memory), 롬(ROM, Read Only Memory), 하드디스크(HDD, Hard Disk Drive), 플래시 메모리, CD-ROM, DVD와 같은 모든 종류의 저장 매체를 포함할 수 있다.The storage unit 140 is a means for storing data or a program necessary for the operation of the terminal device 100, and basically stores an operating program (OS) and one or more application programs of the terminal device 100. In addition, in the present invention, the storage 140 stores decision tree based phonological models necessary for detecting speech recognition and error pronunciation using the decision tree. The phonological model may include a first decision tree constructed using normal phonology and a second decision tree constructed using error phonology. The storage unit 140 stores all types of RAM, such as RAM (Read Access Memory), ROM (Read Only Memory), hard disk (HDD, Hard Disk Drive), flash memory, CD-ROM, DVD, and the like. Media may be included.

제어부(150)는 단말 장치(100)의 동작 전반을 제어하는 것으로서, 기본적으로 저장부(150)에 저장한 운영 프로그램을 기반으로 동작하여 단말 장치(100)의 기본적인 플랫폼 환경을 구축하고, 사용자의 선택에 따라서 응용 프로그램을 실행하여 임의 기능을 제공한다. 본 발명의 다른 실시 예에 있어서, 제어부(150)는, 입력부(110)를 통해 사용자로부터 음성 인식이 요청되면, 오디오 처리부(130)를 통해 입력되는 사용자 음성에 대하여, 기 저장된 제1,2 결정 트리를 음성 인식 및 오류 검출을 수행하고, 그 결과를 출력부(120)로 출력한다. 이러한 제어부(150)는 음성 인식 모듈(151)을 포함할 수 있다.The controller 150 controls the overall operation of the terminal device 100, and basically operates based on an operating program stored in the storage unit 150 to build a basic platform environment of the terminal device 100. Optionally, run the application to provide arbitrary functionality. According to another embodiment of the present disclosure, when a voice recognition is requested from the user through the input unit 110, the controller 150 determines previously stored first and second voices with respect to the user voice input through the audio processor 130. Speech recognition and error detection are performed on the tree, and the result is output to the output unit 120. The controller 150 may include a voice recognition module 151.

음성 인식 모듈(151)은 사용자 음성에 대하여, 제1,2 결정 트리를 이용하여 음성 인식 및 오류 발음 검출을 음성 인식을 수행한다. 예를 들어, 제1 결정 트리를 이용하여 사용자 음성의 각 단어를 인식하고, 제2 결정 트리를 이용하여 각 단어에 대한 오류 발음의 포함 여부를 검출한다. 또한 오류 음운 검출 시 오류 음운에 대한 정상 발음기호, 오류 발음기호, 교정 방법, 오류 원인 중 하나 이상을 포함하는 오류 음운 정보를 추출한다.The speech recognition module 151 performs speech recognition on speech recognition and error pronunciation detection using the first and second decision trees. For example, each word of the user's voice is recognized using the first decision tree, and whether or not an error pronunciation is included in each word is detected using the second decision tree. In addition, when the error phoneme is detected, error phoneme information including one or more of a normal phonetic symbol, an error phonetic symbol, a correction method, and a cause of an error is extracted.

더하여, 상기 제어부(150)는 음운 모델 생성 모듈(152)를 더 포함할 수 있다. 음운 모델 생성 모듈(152)는, 정상 음운과 오류 음운을 구분하여 각각 제1,2 결정 트리를 구성하되, 오류 음운의 경우 정상 음운에 비하여 높은 결정 트리를 분류하는 임계값을 설정하여 분류하고, 분류된 음운들에 대하여 군집화를 수행함으로써, 결정 트리 기반의 음운 모델을 구성한다. 또한 음운 모델 생성 모듈(152)는 오류 음운을 이용한 제2 결정 트리의 구성시, 정상 음운을 이용한 제1 결정 트리의 구성 시에 비하여 높은 임계값을 적용하고, 이를 통하여 제1 결정 트리보다 적은 브랜치를 갖도록 제2 결정 트리를 구성한다. In addition, the controller 150 may further include a phonological model generation module 152. The phonological model generation module 152 configures the first and second decision trees by dividing the normal phoneme and the error phoneme, respectively, and sets and classifies a threshold that classifies a higher decision tree than the normal phoneme in case of the error phoneme, Clustering is performed on the classified phonologies to form a decision tree based phonological model. In addition, the phonological model generation module 152 applies a higher threshold value when constructing the second decision tree using the error phoneme compared to constructing the first decision tree using the normal phoneme, thereby reducing the number of branches less than the first decision tree. Construct a second decision tree to have.

이렇게 음운 모델 생성 모듈(152)에 의해 구성된 제1,2 결정 트리는 제어부(150)의 음성 인식 모듈(151)에서 음성 인식 및 오류 발음 검출 시에 이용된다.The first and second decision trees configured by the phonological model generation module 152 are used for speech recognition and error pronunciation detection by the speech recognition module 151 of the controller 150.

상기 음성 인식 모듈(151) 및 음운 모델 생성 모듈(152)은 소프트웨어 혹은 하드웨어 혹은 소프트웨어와 하드웨어의 조합에 의해 구현될 수 있는 것으로서, 예를 들면, 프로그램 형태로 저장부(140)에 저장되어 있다가 제어부(150)에 의해 실행됨에 의해 구현될 수 있다.The speech recognition module 151 and the phonetic model generation module 152 may be implemented by software or hardware or a combination of software and hardware. For example, the voice recognition module 151 and the phonetic model generation module 152 may be stored in the storage 140 in a program form. It may be implemented by being executed by the controller 150.

도 6은 본 발명의 다른 실시 예에 따른 단말 장치(100)에 의한 음성 인식 방법을 나타낸 흐름도이다.6 is a flowchart illustrating a voice recognition method by the terminal device 100 according to another exemplary embodiment.

도 6을 참조하면, 단말 장치(100)는 정상 음운을 이용하여 구성된 제1 결정 트리 및 오류 음운을 이용하여 구성된 제2 결정 트리를 저장한다(S205). 이때, 상기 제1,2 결정 트리는 다음과 같이 구성될 수 있다.Referring to FIG. 6, the terminal device 100 stores a first decision tree constructed using normal phonology and a second decision tree constructed using error phonation (S205). In this case, the first and second decision trees may be configured as follows.

즉, 도 7에 도시된 바와 같이, 다수의 학습용 음성을 수집하고(S305), 수집된 다수의 학습용 음성을 발음 오류를 포함하지 않는 정상 음운과, 발음 오류를 포함하는 오류 음운으로 분류한 후(S310), 상기 정상 음운을 이용하여 제1 결정 트리를 구성하고(S315), 상기 오류 음운을 이용하여 제2 결정 트리를 구성하되(S320), 제2 결정 트리의 구성 시, 제1 결정 트리의 경우보다 임계값을 적용하여 음운들을 분류하고, 군집화한다. 이에 따르면, 도 8에 도시된 바와 같이, 제2 결정 트리는 제1 결정 트리보다 적은 브랜치를 갖으며, 그 결과 군집화되는 모델의 숫자가 최소화되어 분산이 커지는 것을 방지할 수 있다. That is, as shown in FIG. 7, a plurality of learning voices are collected (S305), and the collected plurality of learning voices are classified into normal phonologies that do not include pronunciation errors and error phonologies that include pronunciation errors ( S310), constructing a first decision tree using the normal phonogram (S315), and constructing a second decision tree using the error phonogram (S320), when constructing a second decision tree, The phonemes are classified and clustered by applying a threshold value rather than the case. According to this, as shown in FIG. 8, the second decision tree has fewer branches than the first decision tree, and as a result, the number of clustered models is minimized to prevent the variance from increasing.

상기 단말 장치(100)는 네트워크(300)를 통해 서비스 장치(200)에 의해 구성된 제1,2 결정 트리를 수신하여 저장할 수 도 있으나, 음운 모델 생성 모듈(152)가 내장된 경우, 상기 음운 모델 생성 모듈(152)를 통해서 상술한 바와 같이 제1,2 결정 트리를 구성하여 저장할 수 도 있다.The terminal device 100 may receive and store the first and second decision trees configured by the service device 200 through the network 300. However, when the phonetic model generation module 152 is built in, the phonetic model As described above, the generation module 152 may configure and store the first and second decision trees.

다시 도 6을 참조 하면, 단말 장치(100)는 사용자 음성이 입력되면(S210), 정상 음운 및 오류 음운에 대한 제1 결정 트리를 이용하여 상기 사용자 음성에 포함된 각각의 단어를 인식하고, 제2 결정 트리를 이용하여 상기 사용자 음성에 포함된 각 단어에 대한 발음 오류를 검출한다(S215). Referring back to FIG. 6, when the user's voice is input (S210), the terminal device 100 recognizes each word included in the user's voice using a first decision tree for normal phoneme and error phoneme. The pronunciation error for each word included in the user's voice is detected using the second decision tree (S215).

그리고, 음성 인식 및 오류 검출 결과를 출력부(120)를 통해 사용자에게 출력한다(S220). 상기 오류 검출 결과는 사용자 음성에 포함된 오류 음운에 대한 정상 발음기호, 오류 발음기호, 교정 방법, 오류 원인 중 하나 이상을 포함하는 오류 음운 정보를 포함할 수 있다.Then, the voice recognition and error detection result is output to the user through the output unit 120 (S220). The error detection result may include error phoneme information including one or more of a normal phonetic symbol, an error phonetic symbol, a correction method, and a cause of an error of an error phoneme included in a user's voice.

도 8은 본 발명에 의한 음성 인식에 적용된 제1,2결정 트리의 일 실시예를 나타낸 예시도이다.8 is an exemplary view showing an embodiment of first and second decision trees applied to speech recognition according to the present invention.

도 8의 (a)는 발음 오류가 없는 정상 음운을 이용하여 구성된 제1 결정 트리에 대한 예시도를 나타낸 것이고, (b)는 발음 오류를 포함하는 제2 결정 트리에 대한 예시도를 나타낸 것이다. 이 둘을 비교해 보면, 오류 음운에 대한 제2 결정 트리의 경우, 결정 트리를 분류하는 임계값을 제1 결정 트리의 경우보다 높게 설정함으로서, 제1 결정 트리에 비하여 최종 리프 노드의 수 및 브랜치가 적어지고, 그 결과, 발음 오류의 검출을 신속하게 수행할 수 있다.FIG. 8A illustrates an example of a first decision tree configured using normal phonemes without a pronunciation error, and FIG. 8B illustrates an example of a second decision tree including a pronunciation error. Comparing the two, in case of the second decision tree for the error phoneme, the threshold for classifying the decision tree is set higher than that of the first decision tree, so that the number and branches of the final leaf nodes are higher than that of the first decision tree. As a result, detection of pronunciation errors can be performed quickly.

본 발명에 따른 음성 인식 방법 및 결정 트리를 구성하는 방법은 다양한 컴퓨터 수단을 통하여 판독 가능한 소프트웨어 형태로 구현되어 컴퓨터로 판독 가능한 기록매체에 기록될 수 있다. 여기서, 기록매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 예컨대 기록매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광 기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 및 롬(ROM), 램(RAM, Random Access Memory), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함한다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함할 수 있다. 이러한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The speech recognition method and the method for constructing the decision tree according to the present invention may be implemented in software form readable by various computer means and recorded on a computer readable recording medium. Here, the recording medium may include program commands, data files, data structures, and the like, alone or in combination. Program instructions recorded on the recording medium may be those specially designed and constructed for the present invention, or they may be of the kind well-known and available to those having skill in the computer software arts. For example, the recording medium may be magnetic media such as hard disks, floppy disks and magnetic tapes, optical disks such as Compact Disk Read Only Memory (CD-ROM), digital video disks (DVD), Magnetic-Optical Media, such as floppy disks, and hardware devices specially configured to store and execute program instructions, such as ROM, random access memory (RAM), flash memory, and the like. do. Examples of program instructions may include machine language code such as those generated by a compiler, as well as high-level language code that may be executed by a computer using an interpreter or the like. Such hardware devices may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이상과 같이, 본 명세서와 도면에는 본 발명의 바람직한 실시 예에 대하여 개시하였으나, 여기에 개시된 실시 예외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 자명한 것이다. 또한, 본 명세서와 도면에서 특정 용어들이 사용되었으나, 이는 단지 본 발명의 기술 내용을 쉽게 설명하고 발명의 이해를 돕기 위한 일반적인 의미에서 사용된 것이지, 본 발명의 범위를 한정하고자 하는 것은 아니다.While the present invention has been described in connection with what is presently considered to be practical exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, It will be apparent to those skilled in the art. In addition, although specific terms are used in the specification and the drawings, they are only used in a general sense to easily explain the technical contents of the present invention and to help the understanding of the present invention, and are not intended to limit the scope of the present invention.

본 발명은 음성 인식 분야, 특히, 외국어 학습 분야에서의 음성 인식 분야에 적용될 수 있으며, 음성 인식을 위한 결정 트리 기반의 음운 모델을 구성하는데 있어서, 학습용 음성에서 발음 오류를 포함하는 음운과 발음 오류를 포함하는 음운을 구분하여 서로 다른 결정 트리를 구성함으로써, 음성 인식 및 오류 발음 검출을 함께 수행할 수 있으며, 더불어, 발음 오류를 포함하는 음운에 대해서는 결정 트리를 나누는 임계값을 발음 오류를 포함하지 않는 음운에 적용되는 임계값보다 높게 설정함으로써, 발음 오류를 포함하는 음운에 대하여 군집화 되는 모델의 숫자를 최소화 되게 함으로써 분산이 커지는 것을 방지할 수 있어 입력된 음성에 대한 오류 검출을 신속 정확하게 수행할 수 있다.The present invention can be applied to the speech recognition field, in particular, in the field of speech recognition in the field of foreign language learning, and in constructing a decision tree based phonological model for speech recognition, By determining different phonologies to form different decision trees, speech recognition and error pronunciation detection can be performed together. In addition, for phonologies containing pronunciation errors, the threshold for dividing the decision tree does not include pronunciation errors. By setting a higher value than the threshold applied to the phoneme, the number of models clustered for the phoneme including the pronunciation error can be minimized, thereby preventing the variance from increasing, so that error detection of the input voice can be performed quickly and accurately. .

100: 단말 장치 110: 입력부 120: 출력부
130: 오디오 처리부 140: 저장부 150: 제어부
151: 음성 인식 모듈 152: 음운 모델 생성 모듈
200: 서비스 장치 210: 통신부 220: 저장부
230: 서비스 제공부 231: 음성 인식 모듈
240: 음운 모델 생성부 300: 네트워크100: terminal device 110: input unit 120: output unit
130: audio processing unit 140: storage unit 150: control unit
151: speech recognition module 152: phonetic model generation module
200: service device 210: communication unit 220: storage unit
230: service provider 231: voice recognition module
240: phonetic model generator 300: network

Claims

In a plurality of learning voices, a first type phoneme which does not include a pronunciation error is distinguished from a second type phoneme that includes a pronunciation error, and a first decision tree is constructed using the first type phoneme, and a second type phoneme is formed. A phonological model generator for constructing a second decision tree by using;
A storage unit which stores the first decision tree and the second decision tree; And
When the voice recognition for the user voice is requested from the terminal device, the voice recognition and error detection for the user voice are performed using the first decision tree and the second decision tree, and the voice recognition result and the error detection result are displayed. Service apparatus for speech recognition, comprising a service providing unit for providing to.

The method of claim 1,
And the first decision tree and the second decision tree are based on a triphone model.

The method of claim 1, wherein the service provider
As a result of the error detection, voice phonological information including one or more of the normal phonetic symbol, error phonetic symbol, correction method, the cause of the error for the error phoneme included in the user voice to the terminal device Service device for recognition.

The method of claim 1, wherein the phonetic model generator
In constructing the second decision tree, a threshold for dividing the decision tree is set higher than that of the first decision tree so that the second decision tree has fewer branches than the first decision tree. Device.

The method of claim 1,
The second type phoneme includes an error phoneme generated by a pronunciation habit of a particular talker.

A storage unit for storing a phonological model including a first decision tree constructed using a first type phonology that does not include pronunciation errors and a second decision tree constructed using a second type phonology including pronunciation errors;
An input unit to receive a user request for speech recognition;
An audio processor for receiving a user voice;
A controller configured to receive a user voice through the audio processor and perform voice recognition and error detection using the first decision tree and the second decision tree when the user request for the voice recognition is input;
And an output unit for outputting results of the voice recognition and error detection.

7. The apparatus of claim 6, wherein the control unit
From the plurality of learning voices, the first type phoneme and the second type phoneme are classified, a first decision tree is constructed using the first type phoneme, and a second decision tree is configured using the second type phoneme. And a phonological model generation module.

The method of claim 6, wherein the phonetic model generation module
In constructing the second decision tree, by setting a threshold value that distinguishes the decision tree higher than the first decision tree, the second decision tree is configured to have fewer branches than the first decision tree. Terminal device for.

7. The apparatus of claim 6, wherein the control unit
As a result of error detection, a terminal apparatus for speech recognition comprising providing error phonological information including at least one of a normal phonetic symbol, an error phonetic symbol, a correction method, and an error cause of an error phoneme included in the user's voice. .

Collecting a plurality of learning voices;
Classifying the plurality of learning voices into a first type phoneme not including a pronunciation error and a second type phoneme including a pronunciation error;
Constructing a first decision tree using first type phonology; And
Constructing a second decision tree using a second type of phonology,
In the configuration of the second decision tree, the threshold value for dividing the decision tree is set higher than that in the configuration of the first decision tree, so that the second decision tree has fewer branches than the first decision tree. How to construct a decision tree for recognition.

Receiving a voice to be recognized; And
A first decision tree configured through clustering for the first type of phonology that does not include pronunciation errors and a second decision including clustering for the second type phonology with fewer branches than the first decision tree, including pronunciation errors And recognizing and error detecting the first type phonology and the second type phonogram included in the recognition target voice, respectively, using a tree.