KR101840363B1

KR101840363B1 - Voice recognition apparatus and terminal device for detecting misprononced phoneme, and method for training acoustic model

Info

Publication number: KR101840363B1
Application number: KR1020110115316A
Authority: KR
Inventors: 김영준
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2011-11-07
Filing date: 2011-11-07
Publication date: 2018-03-21
Also published as: KR20130050132A

Abstract

본 발명은, 모국어 간섭에 의하여 발생한 오류 음운과 원어민의 표준 음운과의 구별 능력이 향상시키고, 이를 통해 외국어 학습에 있어서의 발음 평가 능력을 향상시키기 위한 음성 인식 시스템, 그의 음성 인식 장치 및 음향 모델 학습 방법에 관한 것으로서, 기 저장된 음향 모델을 이용하여 입력된 사용자의 음성 데이터에 대한 음성 인식을 수행하는데 있어서, 음성 인식 결과가 오류인 경우, 상기 오류가 모국어 간섭에 의한 오류인지 확인하여, 모국어 간섭에 의한 오류인 경우, 사용자의 음성 데이터에 대응하는 단어열에 대하여 모국어 간섭에 의해 발생되는 오류 발음과 상기 인식 결과 간의 차이를 크게 하는 방향으로 상기 음향 모델의 변별 학습을 수행한다.The present invention relates to a speech recognition system for improving the ability to distinguish error phonology generated by native language interference from standard phonemes of native speakers and thereby improve pronunciation evaluation ability in foreign language learning, The present invention relates to a method and apparatus for performing speech recognition on speech data of a user input using a previously stored acoustic model, and in the case where the speech recognition result is an error, whether the error is an error due to native language interference, , Discrimination learning of the acoustic model is performed in such a direction as to increase the difference between the erroneous pronunciation generated by the mother language interference and the recognition result for the word string corresponding to the user's voice data.

Description

TECHNICAL FIELD [0001] The present invention relates to a terminal and a speech recognition device for detecting an error sound, and a learning method of the acoustic model,

본 발명은 음성 인식에 관한 것으로서, 더욱 상세하게는 모국어 간섭에 의하여 발생한 오류 발음과 원어민의 표준 발음간의 구별 능력이 향상시키고, 이를 통해 외국어 학습에 있어서의 발음 평가 능력을 향상시키기 위한 단말 및 음성 인식 장치, 그리고 그의 음향 모델 학습 방법을 제공하고자 한다.The present invention relates to speech recognition, and more particularly, to a speech recognition system for improving the ability to discriminate between an erroneous pronunciation caused by mother language interference and a standard pronunciation of a native speaker, Device, and its acoustic model learning method.

음성 인식 기술은 음성에 포함된 음향학적 정보로부터 음운, 언어적 정보를 추출하여 이를 기계가 인식할 수 있게 하는 기술로서, 인식할 수 있는 사람의 종류에 따라 특정 화자만 인식할 수 있는 화자종속 기술과, 불특정 다수를 대상으로 하는 화자독립 기술로 구분할 수 있고, 발음의 형태에 따라 고립단어, 연결단어, 연속문장, 대화체 연속문장 인식 기술 등으로 나뉘며 특정 어휘만을 검출해서 인식하는 핵심어 검출 기술이 있다. 더불어, 어휘 수에 따라서는 수백 단어 이하를 다루는 소규모, 수천 단어의 중규모, 수만 단어의 인식이 가능한 대용량 인식 기술 등으로 분류할 수 있다.Speech recognition technology extracts phonological and verbal information from acoustical information included in speech and allows the machine to recognize it. It is a technology that can recognize only a specific speaker according to the type of person who can recognize it, And a speaker-independent technology targeting an unspecified majority. There is a key word detection technology for detecting and recognizing only a specific vocabulary, which is divided into isolated words, connected words, continuous sentences, and consecutive sentence recognition techniques according to pronunciation forms . In addition, depending on the number of vocabulary, it can be categorized as a small scale which handles several hundred words or less, a medium scale of several thousand words, and a large capacity recognition technology capable of recognizing tens of thousands of words.

음성 인식 과정을 개략적으로 설명하면, 먼저 사용자가 발성한 음성으로부터 음향처리를 통하여 인식에 필요한 특징 벡터를 추출한다. 이때 특징 벡터로는 LPC(Linear Predictive Coding)와 MFCC(Mel Frequency Cepstral Coefficients)가 주로 사용된다. 이어서, 추출된 특징 벡터와 훈련된 기준 패턴과의 비교를 통하여 인식 결과를 얻는다. 여기서 음성 인식의 가장 일반적인 방법은 패턴인식에 의한 방법으로, 템플릿 기반의 패턴 매칭 방법을 이용하는 DTW(Dynamic Time Warping)와, 통계적 패턴 인식을 이용한 HMM(Hidden Markov Model)이 대표적으로 이용된다. 그 외에 신경 회로망을 이용한 방법도 이용되고 있다.The speech recognition process will be schematically described. First, a feature vector necessary for recognition is extracted from the voice uttered by the user through sound processing. In this case, LPC (Linear Predictive Coding) and MFCC (Mel Frequency Cepstral Coefficients) are mainly used as feature vectors. The recognition result is then obtained by comparing the extracted feature vector with the trained reference pattern. Here, the most common method of speech recognition is pattern recognition, Dynamic Time Warping (DTW) using a template-based pattern matching method, and HMM (Hidden Markov Model) using statistical pattern recognition. In addition, a method using a neural network is also used.

여기서, HMM은 음성 신호의 시간적인 통계적 특성을 이용하여 훈련 데이터로부터 이들을 대표하는 모델을 구성한 후, 실제 음성 신호와 유사도가 높은 확률 모델을 인식 결과로 채택하는 방법으로서, 단독 음성이나 연결 음성, 연속어 음성 인식에까지 구현이 용이하며 좋은 인식 성능을 나타내어 여러 가지 응용 분야에 많이 이용되고 있다.Here, the HMM constructs a model representing these from the training data using the temporal statistical characteristics of the speech signal, and then adopts a probability model having a high degree of similarity to the actual speech signal as the recognition result. It is easy to implement up to speech recognition and has good recognition performance and is widely used in various application fields.

HMM의 파라미터는 상태 간의 천이확률, 상태에 종속된 출력확률, 상태의 초기 존재확률로 구성되며, HMM을 음성 인식에 응용하기 위해서는 첫 번째로, 임의의 관측 심벌의 순서 열을 얻었을 때 그러한 관측 열이 발생할 확률을 계산하는 문제, 두 번째로, 최적 상태 열을 찾아내는 문제, 세 번째로, 관측 열을 가장 잘 표현해 줄 수 있는 모델 파라미터를 추정하는 문제가 해결되어야 한다. In order to apply the HMM to speech recognition, firstly, when an order sequence of an arbitrary observation symbol is obtained, such an observation The problem of calculating the probability of occurrence of heat, the problem of finding the optimal state column, and the problem of estimating the model parameter that can best express the observation column.

여기서, 첫 번째 문제는 전향-후향(forward-backward) 알고리즘을 이용하여 해결할 수 있으며, 두 번째 문제는 viterbi 알고리즘으로 적용하여 해결 가능하며, 세 번째는 Baum-Welch 방법과 segmental K-means 방법으로 해결될 수 있다.Here, the first problem can be solved by using a forward-backward algorithm, the second problem can be solved by applying a viterbi algorithm, and the third can be solved by Baum-Welch method and segmental K-means method. .

한편, 연속어 음성인식은 고립어 인식과 달리 음성신호에 해당하는 문장 또는 연속된 단어 열을 찾는 방식으로, 단어 열에 대한 음향 신호의 관찰 확률인 음향모델과 단어 열 자체의 확률인 언어모델이 중요한 요소이다. On the other hand, continuous speech speech recognition is a method of finding sentences or consecutive word strings corresponding to speech signals, unlike the case of isolated word recognition, in which an acoustic model, which is an observation probability of an acoustic signal for a word string, and a language model, to be.

특히, 음향 모델에서는 음운(phoneme)을 최소 단위로 사용하는 경우가 많으며, 기 중에서도 이전 및 이후의 음운까지 고려하는 트라이폰 모델(Triphone model)을 사용하는 경우가 가장 성능이 좋은 것으로 알려져 있다.Especially, in the acoustic model, the phoneme is often used as the minimum unit, and it is known that the triphone model considering the before and after phonemes is the best in terms of performance.

그리고, 음향 모델을 훈련하는 방법으로서, ML(Maximum Likelihood) 방법을 많이 사용하는데, 이는 자신에게 할당된 모델의 최적화만을 고려하기 때문에 다른 모델과 얼마나 다른 지에 대한 고려가 부족하여, 유사한 음성 간의 인식율이 저하되는 원인을 제공한다. In addition, ML (Maximum Likelihood) method is widely used as a training method of acoustic models. Because it considers only the optimization of the model allocated to itself, it does not consider how different from other models, Provide a cause of degradation.

더불어, 음성 인식 기술이 이용되는 대표적인 응용 분야인 외국어 학습에 있어서, 특히, 외국어 학습에서 사용되는 데이터에는 많은 오류 데이터를 포함하고 있으며, 여기에는 음성 인식 입장에서는 맞게 인식하였으나, 외국어를 학습하는 발화자가 모국어 간섭에 의하여 음운을 잘못 발음하여 생기는 오류들도 포함된다.In addition, in the foreign language learning, which is a typical application field in which the speech recognition technology is used, the data used in the foreign language learning includes a lot of error data. In this case, Incorrect errors caused by incorrect pronunciation of phonemes due to native language interference are also included.

따라서, 모국어 간섭에 의하여 잘못 발성된 음운과 원어민들이 발성하는 음운에 대한 구별이 가능한 경우, 외국어 학습에서의 성능을 더 향상시킬 수 있을 것으로 보인다.Therefore, it can be expected that the performance in foreign language learning will be improved if it is possible to distinguish phonemes erroneously spoken by native language interference and phonemes spoken by native speakers.

본 발명은 음성 인식에 있어서, 모국어 간섭에 의하여 발생한 오류 음운과 원어민의 표준 음운과의 구별 능력이 향상시키고, 이를 통해 외국어 학습에 있어서의 발음 평가 능력을 향상시키기 위한 단말 및 음성 인식 장치, 그리고 그의 음향 모델 학습 방법을 제공하고자 한다.The present invention relates to a terminal and a voice recognition device for improving the ability to distinguish an error phonology generated by mother language interference from a standard phoneme generated by native language interference and thereby to improve a pronunciation evaluation ability in foreign language learning, And to provide an acoustic model learning method.

본 발명은 과제를 해결하기 위한 수단으로서, 단말로부터의 음성 인식 요청에 따라서, 기 저장된 음향 모델을 이용하여 입력된 사용자의 음성 데이터에 대한 음성 인식을 수행하여 음성 인식 결과를 단말로 출력하는 음성 인식부; 음성 인식 결과가 오류인 경우, 오류가 모국어 간섭에 의한 오류인지 확인하여, 모국어 간섭에 의한 오류인 경우, 음성 인식 결과와 상기 음성 데이터에 대하여 모국어 간섭에 의해 발생되는 오류 발음 간의 차이를 크게 하는 방향으로 음향 모델의 변별 학습을 수행하는 음향 모델 학습부; 및 음향 모델을 저장하는 저장부를 포함하는 오류 발음 검출을 위한 음성 인식 장치를 제공한다.According to an aspect of the present invention, there is provided a voice recognition apparatus for performing voice recognition on voice data of a user input using a pre-stored acoustic model in response to a voice recognition request from a terminal, part; When the speech recognition result is an error, it is confirmed whether the error is an error due to native language interference, and when the error is due to native language interference, the difference between the speech recognition result and the error pronunciation generated by the mother- An acoustic model learning unit for performing discrimination learning of the acoustic model; And a storage unit for storing an acoustic model.

본 발명에 의한 음성 인식 장치에 있어서, 저장부는 모국어 간섭에 의해 발생 가능한 오류 음운을 정의한 음운 오류 사전을 더 저장하고, 음향 모델 학습부는, 음성 데이터와 음운 오류 사전을 비교하여, 음운 오류 사전에 포함된 오류 음운과 매칭되는 음운이 음성 데이터에 포함되는 경우, 오류를 모국어 간섭에 의한 오류로 판단할 수 있다.In the speech recognition apparatus according to the present invention, the storage unit further stores a phonological error dictionary that defines an error phoneme that can be generated by the mother language interference, and the acoustic model learning unit compares the phonetic data dictionary with the speech data, If the phoneme corresponding to the error phoneme is included in the voice data, the error can be judged as an error due to the native language interference.

본 발명에 의한 음성 인식 장치에 있어서, 음향 모델 학습부는 인식 결과의 오류가 모국어 간섭에 의한 오류가 아닌 경우, 음성 인식 자체의 오류로 판단하여, 오류로 판단된 인식 결과와 원하는 인식 결과와의 차이를 크게 하는 방향으로 변별학습을 수행할 수 있다.In the speech recognition apparatus according to the present invention, when the error of the recognition result is not an error caused by the native language interference, the acoustic model learning unit determines that the speech recognition itself is an error, It is possible to perform discrimination learning in the direction of increasing the size of the image.

본 발명에 의한 음성 인식 장치에 있어서, 음성 인식부는 음향 모델을 이용하여 사용자의 음성 데이터에 대한 오류 발음을 더 검출하고, 오류 발음 검출 결과를 단말로 제공할 수 있다.In the speech recognition apparatus according to the present invention, the speech recognition unit may further detect an error pronunciation of the user's voice data by using the acoustic model, and may provide the terminal with the result of the error pronunciation detection.

또한, 본 발명은 상술한 과제를 해결하기 위한 다른 수단으로서, 사용자의 요청을 입력받기 위한 입력부; 사용자의 음성 신호를 음성 데이터로 변환하여 출력하는 오디오 처리부; 입력부를 통한 사용자의 요청 또는 응용 프로그램의 요청에 의한 음성 인식 요청이 발생하면, 기 저장된 음향 모델을 이용하여 오디오 처리부로부터 출력되는 사용자의 음성 데이터에 대한 음성 인식을 수행하여 음성 인식 결과를 출력하는 음성 인식 모듈과, 음성 인식 결과가 오류인 경우, 오류가 모국어 간섭에 의한 오류인지 확인하여, 모국어 간섭에 의한 오류인 경우, 상기 사용자의 음성 데이터에 대응하는 단어열에 대하여 모국어 간섭에 의해 발생되는 오류 발음과 상기 음성 인식 결과의 차이를 크게 하는 방향으로 변별 학습을 수행하는 음향 모델 학습 모듈을 포함하는 제어부; 음성 인식 결과를 사용자에게 출력하는 출력부; 및 음향 모델을 저장하는 저장부를 포함하는 것을 특징으로 하는 오류 발음 검출을 위한 단말을 제공한다.According to another aspect of the present invention, there is provided an information processing apparatus comprising: an input unit for receiving a request from a user; An audio processor for converting a user's voice signal into voice data and outputting the voice data; When a user's request through the input unit or a voice recognition request based on a request of an application program occurs, voice recognition is performed on the voice data of the user outputted from the audio processing unit using the pre-stored acoustic model, A recognition module for recognizing whether or not the error is an error due to native language interference when the result of speech recognition is an error; if the error is due to native language interference, And an acoustic model learning module for performing discrimination learning in a direction of increasing a difference between the speech recognition result and the speech recognition result. An output unit for outputting speech recognition results to a user; And a storage unit for storing the acoustic model.

본 발명에 의한 단말에 있어서, 저장부는 모국어 간섭에 의해 발생 가능한 오류 음운을 정의한 음운 오류 사전을 더 저장하고, 음향 모델 학습 모듈은, 음성 데이터와 음운 오류 사전을 비교하여, 음운 오류 사전에 포함된 오류 음운과 매칭되는 음운이 음성 데이터에 포함되는 경우, 오류를 모국어 간섭에 의한 오류로 판단할 수 있다.In the terminal according to the present invention, the storage unit further stores a phonological error dictionary defining an error phoneme that can be generated by the first language interference, and the acoustic model learning module compares the phonetic data dictionary with the speech data, When the phoneme corresponding to the error phoneme is included in the voice data, the error can be judged as an error due to the native language interference.

본 발명에 의한 단말은, 네트워크를 통해 데이터를 송수신하는 통신부를 더 포함하고, 제어부는 통신부를 통해 음성 인식 장치에 의해 생성된 다수의 음향 모델 및 음운 오류 사전을 수신하여 저장부에 저장할 수 있다.The terminal according to the present invention may further include a communication unit for transmitting and receiving data through a network and the control unit may receive a plurality of acoustic model and phoneme error dictionary generated by the speech recognition apparatus through the communication unit and store the received acoustic model and phonetic error dictionary in the storage unit.

본 발명에 의한 단말에 있어서, 제어부는 오디오 처리부로부터 출력된 사용자의 음성 데이터로부터 특징 벡터를 추출하는 전처리 모듈을 더 포함하고, 음성 인식 모듈은 특징 벡터를 음향 모델에 적용하여 유사도를 측정을 통해 상기 음성 데이터와 유사한 단어열을 추출할 수 있다.In the terminal according to the present invention, the control unit may further include a preprocessing module for extracting a feature vector from the user's speech data output from the audio processing unit, wherein the speech recognition module applies the feature vector to the acoustic model, Word strings similar to voice data can be extracted.

본 발명에 의한 단말에 있어서, 음성 인식 모듈은 인식된 음성 데이터에 포함된 오류 발음을 더 검출하여 제공할 수 있다.In the terminal according to the present invention, the voice recognition module can further detect and provide erroneous pronunciation included in the recognized voice data.

본 발명에 따르면, 음성 인식 시스템에 있어서, 모국어 간섭에 의하여 사용자의 오류 발음과 실제 원어민의 표준 발음 간의 차이를 명확히 구분할 수 있도록 음향 모델의 학습이 이루어질 수 있으며, 그 결과, 음성 인식, 특히 잘못된 발음과 올바른 발음 간의 구별 능력에 대한 인식 성능을 향상시킬 수 있으며, 이를 통해서 외국어와 같은 언어 학습에 있어서, 사용자의 발음 평가에 대한 정확도 및 신뢰도를 향상시킬 수 있다.According to the present invention, in the speech recognition system, the learning of the acoustic model can be performed so as to clearly distinguish the difference between the error pronunciation of the user and the standard pronunciation of the actual native speaker by the mother language interference. As a result, It is possible to improve the recognition performance of the discrimination ability between the correct pronunciation and the correct pronunciation and thereby improve the accuracy and reliability of the pronunciation evaluation of the user in language learning such as a foreign language.

도 1은 본 발명에 따른 음성 인식 시스템을 개략적으로 도시한 블럭도이다.
도 2는 본 발명의 일 실시 예에 따른 음성 인식 장치의 구성을 나타낸 블럭도이다.
도 3은 본 발명의 다른 실시 예에 따른 단말의 구성을 나타낸 블럭도이다.
도 4는 본 발명에 따른 음성 인식 과정을 나타낸 순서도이다.
도 5는 본 발명에 따른 음성 인식에 있어서, 음향 모델을 위한 학습 과정을 나타낸 순서도이다.1 is a block diagram schematically illustrating a speech recognition system according to the present invention.
2 is a block diagram illustrating a configuration of a speech recognition apparatus according to an embodiment of the present invention.
3 is a block diagram illustrating a configuration of a UE according to another embodiment of the present invention.
4 is a flowchart illustrating a speech recognition process according to the present invention.
5 is a flowchart illustrating a learning process for an acoustic model in speech recognition according to the present invention.

이하 본 발명의 바람직한 실시 예를 첨부한 도면을 참조하여 상세히 설명한다. 다만, 하기의 설명 및 첨부된 도면에서 본 발명의 요지를 흐릴 수 있는 공지 기능 또는 구성에 대한 상세한 설명은 생략한다. 또한, 도면 전체에 걸쳐 동일한 구성 요소들은 가능한 한 동일한 도면 부호로 나타내고 있음에 유의하여야 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description and the accompanying drawings, detailed description of well-known functions or constructions that may obscure the subject matter of the present invention will be omitted. It should be noted that the same constituent elements are denoted by the same reference numerals as possible throughout the drawings.

본 발명에 의한 음성 인식은 다양한 응용 분야에 적용될 수 있다. 예를 들어, 외국어와 같은 언어 학습 분야에 적용될 경우 언어 평가 성능을 향상시킬 수 있으나, 꼭 언어 학습 분야에 적용되는 것으로 한정되지 않는다. 또한, 본 발명에 의한 음성 인식 및 음향 모델의 학습은 서버-클라이언트 타입으로 구현될 수 도 있고, 단말의 스탠드-얼론 타입으로 이루어질 수도 있다. 이하의 실시 예에서는, 이러한 두 타입을 모두 감안하여 설명하기로 한다.The speech recognition according to the present invention can be applied to various applications. For example, when applied to a language learning field such as a foreign language, the performance of the language evaluation may be improved, but the present invention is not limited to the application to the language learning field. In addition, the learning of the speech recognition and the acoustic model according to the present invention may be realized as a server-client type or a stand-alone type of a terminal. In the following embodiments, both of these types will be described.

도 1은 본 발명의 일 실시 예에 따른 음성 인식 시스템을 나타낸 블럭도이다. 여기서, 음성 인식은 서버-클라이언트 시스템을 통해 구현된다.1 is a block diagram illustrating a speech recognition system according to an embodiment of the present invention. Here, speech recognition is implemented through a server-client system.

도 1을 참조하면, 본 발명의 음성 인식 시스템은, 네트워크(10)를 통해서 연동하는 음성 인식 장치(100) 및 단말(200)을 통하여 이루어질 수 있다.Referring to FIG. 1, the speech recognition system of the present invention may be implemented through a speech recognition apparatus 100 and a terminal 200 that are linked through a network 10.

여기서, 네트워크(10)는 데이터 통신이 가능한 네트워크라면 어떤 종류라도 관계없으며, 예를 들어, 인터넷 프로토콜(IP)을 통하여 대용량 데이터의 송수신 서비스 및 끊기는 현상이 없는 데이터 서비스를 제공하는 아이피망으로, 아이피를 기반으로 서로 다른 망을 통합한 아이피망 구조인 올 아이피(All IP)망 일 수 있다. 또한, 네트워크(10)는 유선 네트워크, Wibro(Wireless Broadband)망, WCDMA를 포함하는 3 세대 이동네트워크, HSDPA(High Speed Downlink Packet Access)망 및 LTE망을 포함하는 3.5세대 이동네트워크, LTE advanced를 포함하는 4세대 이동네트워크, 위성네트워크 및 와이파이(Wi-Fi)망을 포함하는 무선랜 중 하나 이상을 포함하여 이루어질 수 있다.Here, the network 10 is not limited to any type of network capable of data communication. For example, the network 10 provides a large-capacity data transmission / reception service through an internet protocol (IP) (IP) network, which is an i-bell-net structure in which different networks are integrated. The network 10 also includes a 3G mobile network including a wired network, a Wibro (Wireless Broadband) network, a WCDMA, a 3.5G mobile network including a High Speed Downlink Packet Access (HSDPA) network and an LTE network, and LTE advanced , A fourth generation mobile network, a satellite network, and a wireless LAN including a Wi-Fi network.

상기 음성 인식 장치(100)는 음성 인식 서비스를 수행하는 서버 장치로서, 단말(200)로부터의 요청에 따라서, 저장된 음향 모델을 기반으로 입력된 사용자 음성에 대한 음성 인식 및 오류 발음을 검출하고, 검출 결과를 단말(200)로 제공한다.The speech recognition apparatus 100 is a server apparatus for performing a speech recognition service. The speech recognition apparatus 100 detects speech recognition and error pronunciation of a user speech inputted based on a stored acoustic model in response to a request from the terminal 200, And provides the result to the terminal 200.

여기서, 상기 음향 모델은 본 발명에 따라서 변별 학습을 통해서 학습되는데, 구체적으로, 음성 인식 결과 중 인식 결과에 오류가 있는 오류 데이터를 수집하고, 수집한 오류 데이터가 모국어 간섭에 의해 발생한 오류 인지, 음성 인식 처리 자체의 오류 인지를 구분하고, 모국어 간섭에 의한 오류인 경우, 오류 데이터에 대응하는 단어의 모국어 간섭에 의해 발생되는 오류 발음을 추출하고, 상기 오류 발음과 상기 인식 결과와의 차이를 크게 하는 방향으로 변별학습을 수행하고, 음성 인식 자체의 오류인 경우 원래의 정답과의 차이를 크게 하는 방향으로 변별학습을 수행한다. 즉, 보편적인 변별 학습의 경우, 오답과 정답간의 차이를 크게 하는 방향으로 재학습을 수행함으로써, 오답과 정답간의 중첩 영역을 감소시키는 것이나, 모국어 간섭에 의해 발생된 사용자의 발음 오류에 의해 음성 인식 결과가 오류로 판별된 경우, 단지 정답과 오답간의 차이를 크게 하는 방향으로만 변별학습을 수행하게 되면, 모국어 간섭에 의해 발음 오류를 검출할 수 없다. 이에 본 발명에서는, 사용자의 모국어 간섭에 의한 발음 오류에 의해 음성 인식 결과가 오류로 판별된 경우, 정답이 아닌 모국어 간섭에 의해 발생되는 오류 발음과 음성 인식 결과의 차이를 크게 하는 방향으로 재학습을 수행함으로써, 모국어 간섭에 의해 발음 오류에 대한 검출 성능을 향상시키는 것이다.The acoustic model is learned through discrimination learning in accordance with the present invention. Specifically, error data having an error in the recognition result out of the speech recognition results is collected, and it is determined whether the collected error data is an error caused by native language interference, Extracting error pronunciations caused by mother language interference of words corresponding to the error data in the case of an error due to mother language interference, and increasing the difference between the error pronunciation and the recognition result And discrimination learning is performed in a direction of increasing the difference from the original correct answer in the case of an error of speech recognition itself. That is, in the case of universal discrimination learning, re-learning is performed in a direction of increasing the difference between the incorrect answer and the correct answer, thereby reducing the overlap area between the incorrect answer and the correct answer, If the result is discriminated as an error, discrimination learning is performed only in the direction of increasing the difference between the correct answer and the incorrect answer, and pronunciation errors can not be detected by the mother language interference. Therefore, in the present invention, when the speech recognition result is determined as an error due to the pronunciation error due to the mother language interference of the user, re-learning is performed in a direction of increasing the difference between the error pronunciation generated by the mother language interference Thereby improving the detection performance of pronunciation errors due to native language interference.

상기 음성 인식 장치(100)는 상기와 같이 학습된 음향 모델을 이용하여 음성 인식을 수행함으로써, 모국어 간섭에 의한 발음 오류에 대한 검출 성능을 향상시킬 수 있으며, 이를 통해 외국어 학습 시의 평가 결과의 신뢰도를 향상시킬 수 있게 된다.The speech recognition apparatus 100 performs speech recognition using the learned acoustic model as described above, thereby improving the detection performance of pronunciation errors due to native language interference. Thus, the reliability of the evaluation results at the time of foreign language learning Can be improved.

이러한 음성 인식 장치(100)는 외국어와 같은 언어 학습 시스템과 연동하여 동작할 수 있으며, 특히 언어 학습 시스템과 일체로 형성될 수도 있다.The speech recognition apparatus 100 may operate in conjunction with a language learning system such as a foreign language, and may be formed integrally with a language learning system.

단말(200)은 사용자 요청에 따라서, 사용자 음성을 감지하여 음성 데이터를 생성하고, 상기 음성 데이터를 음성 인식 장치(100)에 전송하여 음성 인식을 요청한다. 이때 단말(200)은 음성 인식 처리 중에서 일부, 예를 들어, 사용자의 음성 데이터에 대한 특징 벡터의 추출 등을 수행하고, 추출한 특징 벡터를 음성 인식 장치(100)로 전송하여 음성 인식을 요청할 수 있다. 그리고, 단말(200)은 음성 인식 장치(100)로부터 전달된 음성 인식 결과 및 오류 검출 결과를 수신하여 사용자에게 출력할 수 있다. 더하여, 단말(200)은 상기 수신한 음성 인식 결과 및 오류 검출 결과를 기반으로 언어 학습 및 평가를 더 제공할 수 있다.In response to a user request, the terminal 200 generates voice data by sensing a user voice, and transmits the voice data to the voice recognition apparatus 100 to request voice recognition. At this time, the terminal 200 may perform a part of speech recognition processing, for example, extraction of a feature vector for user's speech data, and transmit the extracted feature vector to the speech recognition apparatus 100 to request speech recognition . The terminal 200 can receive the speech recognition result and the error detection result transmitted from the speech recognition apparatus 100 and output the result to the user. In addition, the terminal 200 can further provide language learning and evaluation based on the received speech recognition result and the error detection result.

상기 단말(200)은 사용자가 이용하는 다양한 형태의 정보 처리 장치가 될 수 있으며, 예를 들면, PC(Personal Computer), 노트북 컴퓨터, 휴대폰(mobile phone), 태블릿 PC, 내비게이션(navigation) 단말기, 스마트폰(smart phone), PDA(Personal Digital Assistants), 스마트 TV(Smart TV), PMP(Portable Multimedia Player) 및 디지털방송 수신기를 포함할 수 있다. 물론 이는 예시에 불과할 뿐이며, 상술한 예 이외에도 현재 개발되어 상용화되었거나 향후 개발될 모든 통신이 가능한 장치를 포함하는 개념으로 해석되어야 한다.The terminal 200 may be various types of information processing apparatuses used by a user and may be a personal computer (PC), a notebook computer, a mobile phone, a tablet PC, a navigation terminal, a smart phone, a PDA (Personal Digital Assistants), a Smart TV, a Portable Multimedia Player (PMP), and a digital broadcast receiver. Of course, this is merely an example, and it should be construed as a concept including a device that is currently developed, commercialized, or capable of all communication to be developed in the future, in addition to the above-described examples.

도 2는 본 발명의 일 실시 예에 따른 음성 인식 시스템에 있어서, 음성 인식 장치의 상세 구성을 나타낸 블럭도이다. 참고로, 도 2는 음성 인식 장치(100)의 구성을 기능 단위로 구분하여 나타낸 것으로서, 각 구성 요소는 다수의 장치에 각각 분산되어 구현될 수도 있고, 하나의 장치로 구현될 수도 있다.2 is a block diagram showing a detailed configuration of a speech recognition apparatus in a speech recognition system according to an embodiment of the present invention. For reference, FIG. 2 shows the configuration of the speech recognition apparatus 100 as functional units, and each of the components may be dispersed in a plurality of devices, or may be implemented as a single device.

도 2를 참조하면, 본 발명의 일 실시 예에 따른 음성 인식 장치(100)는, 음성 인식부(110)와, 음향 모델 학습부(120)와, 저장부(130)를 포함할 수 있다.2, the speech recognition apparatus 100 according to an embodiment of the present invention may include a speech recognition unit 110, an acoustic model learning unit 120, and a storage unit 130.

음성 인식부(110)는 단말(200)로부터의 음성 인식 요청을 수신하고, 저장부(130)에 저장된 음향 모델을 이용하여 해당 단말(200)로부터 전달된 사용자의 음성 데이터와 기준 패턴들간의 유사도를 산출하고 이를 기반으로 상기 사용자의 음성 데이터에 대한 인식 결과를 출력한다. 상기 인식 결과를 해당 단말(200)로 전달될 수 있다. 더불어, 음성 인식부(110)는, 원어민의 표준 발음을 기준으로 한 상기 사용자의 음성 데이터에 있어서, 사용자의 오류 발음을 검출하고 검출결과를 단말(200)로 제공한다. 이때, 음성 인식 및 오류 발음 검출에 이용되는 음향 모델은 음향 모델 학습부(120)를 통해서 학습된 것이다.The speech recognition unit 110 receives a speech recognition request from the terminal 200 and uses the acoustic model stored in the storage unit 130 to calculate a similarity degree between the user's speech data transmitted from the corresponding terminal 200 and the reference patterns And outputs the recognition result of the user's voice data based on the calculated result. And the recognition result may be transmitted to the corresponding terminal 200. In addition, the voice recognition unit 110 detects the user's pronunciation of the erroneous pronunciation of the user's voice data based on the standard pronunciation of the native speaker, and provides the detection result to the terminal 200. [ At this time, the acoustic model used for speech recognition and error sound detection is learned through the acoustic model learning unit 120.

상기 음향 모델 학습부(120)는 음성 인식부(110)로부터 음성 인식에 대한 오류 데이터를 수집하여, 수집한 오류 데이터를 기반으로 상기 음성 인식에 사용될 음향 모델을 훈련한다. 더 구체적으로 상기 음향 모델 학습부(120)는 상기 음성 인식부(110)에서 출력되는 음성 인식 결과 중에서, 음성 인식 결과가 오류로 판별된 사용자의 음성 데이터를 기 저장된 모국어 간섭에 의해 발생 가능한 오류 음운을 정의한 음운 오류 사전과 비교하여, 상기 음운 오류 사전에 매칭되는 음운이 존재하는 경우, 상기 오류 데이터를 모국어 간섭에 의한 오류로 구분한다. 그리고, 모국어 간섭에 의한 오류로 구분된 오류 데이터에 대하여, 해당 단어열에 대하여 모국어 간섭에 의해 발생되는 오류 발음에 대한 정보를 추출하고, 상기 오류로 판단된 음성 인식 결과와 상기 모국어 간섭에 의해 발생되는 오류 발음과의 차이를 크게 하는 방향으로 상기 음향 모델에 대한 변별학습을 수행한다.The acoustic model learning unit 120 collects error data for speech recognition from the speech recognition unit 110, and trains an acoustic model to be used for the speech recognition based on the collected error data. More specifically, the acoustic model learning unit 120 extracts, from among the speech recognition results output from the speech recognition unit 110, speech data of a user whose speech recognition result is determined as an error, Is compared with a phonological error dictionary defining a phonological error dictionary, and if there is a phoneme matched with the phonological error dictionary, the error data is classified as an error due to native language interference. Then, information on error pronunciation generated by the mother language interference with respect to the word sequence is extracted from the error data classified by the error due to the native language interference, and the result of the speech recognition, The discrimination learning for the acoustic model is performed in a direction increasing the difference from the error pronunciation.

반면에, 상기 음향 모델 학습부(120)는 상기 오류 데이터가 모국어 간섭에 의한 오류로 판단되지 않은 경우, 상기 음성 인식 자체의 오류로 판단하여, 상기 오류로 판단된 인식 결과와 실제 요구되는 바른 인식 결과(정답)과의 차이를 크게 하는 방향으로 해당 음향 모델에 대한 변별학습을 수행한다.On the other hand, when the error data is not determined to be an error due to native language interference, the acoustic model learning unit 120 determines that the speech recognition itself is an error, and recognizes the recognition result, And discrimination learning is performed on the acoustic model in a direction increasing the difference from the result (correct answer).

이렇게 음향 모델 학습부(120)에 의해서 재 학습된 음향 모델은 음성 인식부(110)에서 다음의 음성 데이터에 대한 음성 인식 시에 이용되며, 그 결과 차후의 음성 인식 결과에 대한 정확도를 향상시킬 수 있게 한다.The acoustic model re-learned by the acoustic model learning unit 120 is used for voice recognition of the next voice data in the voice recognition unit 110, and as a result, the accuracy of the voice recognition result can be improved Let's do it.

저장부(130)는 상기 음성 인식부(110)의 음성 인식 및 음향 모델 학습부(120)의 학습에 필요한 데이터를 저장하는 수단으로서, 더 구체적으로는 다수의 음향 모델 및 모국어 간섭에 의해 발생 가능한 오류 발음에 대한 음운 오류 사전을 저장한다.The storage unit 130 stores data necessary for speech recognition of the speech recognition unit 110 and learning of the acoustic model learning unit 120. More specifically, the storage unit 130 includes a plurality of acoustic models, Stores phonological error dictionary for error pronunciation.

이상에서는 서버-클라이언트 기반으로 이루어지는 실시 예에 대하여 설명하였으나, 본 발명의 다른 실시 예에 있어서, 음성 인식 처리는 단말(200)의 스탠드 언론 동작으로 이루어질 수도 있다.In the above description, the server-client based embodiment has been described. However, in another embodiment of the present invention, the speech recognition processing may be performed by the stand press operation of the terminal 200. [

도 3은 본 발명의 다른 실시 예에 따라서 음성 인식 기능을 수행하는 단말(200)의 구성을 나타낸 블럭도이다.3 is a block diagram illustrating a configuration of a terminal 200 that performs a voice recognition function according to another embodiment of the present invention.

도 3을 참조하면, 단말(200)은 입력부(210)와, 통신부(220)와, 오디오 처리부(230)와, 출력부(240)와, 저장부(250)와, 제어부(260)를 포함하여 이루어질 수 있다.3, the terminal 200 includes an input unit 210, a communication unit 220, an audio processing unit 230, an output unit 240, a storage unit 250, and a control unit 260 .

입력부(210)는 사용자의 조작에 따라서 단말(200)을 제어하거나 특정 기능을 요청하기 위한 사용자 입력 신호를 발생하는 수단으로서, 다양한 방식의 입력 수단으로 구현될 수 있다. 예를 들어, 입력부(210)는 키 입력 수단, 터치 입력 수단, 제스처 입력 수단, 음성 입력 수단 중에서 하나 이상을 포함할 수 있다. 키 입력 수단은, 키 조작에 따라서 해당 키에 대응하는 신호를 발생시키는 것으로서, 키패드, 키보드가 해당된다. 터치 입력 수단은, 사용자가 특정 부분을 터치하는 동작을 감지하여 입력 동작을 인식하는 것으로서, 터치 패드, 터치 스크린, 터치 센서를 들 수 있다. 제스처 입력 수단은, 사용자의 동작, 예를 들어, 단말 장치를 흔들거나 움직이는 동작, 단말 장치에 접근하는 동작, 눈을 깜빡이는 동작 등 지정된 특정 동작을 특정 입력 신호로 인식하는 것으로서, 지자기 센서, 가속도 센서, 카메라, 고도계, 자이로 센서, 근접 센서 중에서 하나 이상을 포함하여 이루어질 수 있다.The input unit 210 is a means for generating a user input signal for controlling the terminal 200 or requesting a specific function in response to a user's operation. For example, the input unit 210 may include at least one of a key input unit, a touch input unit, a gesture input unit, and a voice input unit. The key input means generates a signal corresponding to the key according to the key operation, and corresponds to a keypad and a keyboard. The touch input means is a touch pad, a touch screen, and a touch sensor, which recognize an input operation by sensing a user's operation of touching a specific portion. The gesture input means recognizes a specific operation, such as a shaking or moving operation of the terminal device, an approach to the terminal device, a blinking operation, etc., as a specific input signal, such as a geomagnetic sensor, A sensor, a camera, an altimeter, a gyro sensor, and a proximity sensor.

통신부(220)는 네트워크(10)를 통해서 데이터를 송수신하는 수단으로서, 음성 인식 처리에 있어서 필요에 따라서 음성 인식 장치(100)와 통신하여 필요한 음향 모델, 음운 오류 사전, 음성 인식 및 학습 처리를 위한 관련 프로그램들을 수신할 수 있다.The communication unit 220 is a means for transmitting and receiving data through the network 10. The communication unit 220 communicates with the speech recognition apparatus 100 as necessary in the speech recognition process to generate necessary acoustic models, And receive related programs.

오디오 처리부(230)는 단말(200)에서의 음성 출력 및 사용자 음성을 감지하여 음성 데이터를 생성하는 것과 같은 오디오 처리를 수행하는 것으로서, 음성 감지 수단(예를 들어, 마이크(MIC)) 및 음성 출력 수단(예를 들어, 스피커(SPK))와 연동한다. 특히, 본 발명에 따른 음성 인식 처리에 있어서, 오디오 처리부(230)는 음성 인식 대상인 사용자의 음성에 대한 신호를 마이크(MIC)를 통해 입력 받아 디지털 데이터인 음성 데이터로 변환한다. 이때, 증폭, 잡음 제거를 위한 필터링 등의 전처리를 더 수행할 수 있다.The audio processing unit 230 performs audio processing such as generating audio data by sensing audio output from the terminal 200 and a user's voice. The audio processing unit 230 includes audio sensing means (for example, a microphone (MIC) (For example, a speaker SPK). Particularly, in the speech recognition processing according to the present invention, the audio processing unit 230 receives a signal of a voice of a user, which is a target of speech recognition, through a microphone (MIC) and converts the signal into voice data, which is digital data. At this time, preprocessing such as amplification and filtering for noise can be further performed.

출력부(240)는 단말(200)과 사용자 간이 인터페이스를 위한 수단으로서, 예를 들어, 단말(200)의 실행 결과 확인 및 사용자 조작을 위한 GUI(Graphic User Interface) 화면을 출력한다. 특히, 본 발명에 있어서 출력부(240)는 음성 인식을 위한 사용자 음성 입력을 위한 안내 메시지, 음성 인식 결과의 출력, 및 사용자의 발음 오류에 대한 검출 정보를 출력할 수 있다.The output unit 240 is a means for interfacing the terminal 200 with a user and outputs a GUI (Graphic User Interface) screen for confirming the execution result of the terminal 200 and for performing a user operation, for example. In particular, in the present invention, the output unit 240 may output a guidance message for inputting a user voice for voice recognition, output of a voice recognition result, and detection information on a pronunciation error of the user.

이러한 출력부(240)는 다양한 표시 수단으로 구현될 수 있으며, 예를 들면, LCD((Liquid Crystal Display), TFT-LCD(Thin Film Transistor-Liquid Crystal Display), LED(Light Emitting Diodes), OLED(Organic Light Emitting Diodes), AMOLED(Active Matrix Organic Light Emitting Diodes), 플렉시블 디스플레이(flexible display), 3차원 디스플레이 중에서 어느 하나를 포함하여 구현될 수 있다.The output unit 240 may be implemented by various display means such as an LCD (Liquid Crystal Display), a TFT-LCD (Thin Film Transistor-Liquid Crystal Display), an LED (Light Emitting Diodes) Organic Light Emitting Diodes (AMOLED), Active Matrix Organic Light Emitting Diodes (AMOLED), flexible displays, and three-dimensional displays.

저장부(250)는 단말(200)의 동작에 필요한 데이터 혹은 프로그램을 저장하는 수단으로서, 기본적으로 단말(200)의 운용 프로그램(OS) 및 하나 이상의 응용 프로그램을 저장할 수 있다. 상기 저장부(250)에 저장된 운용 프로그램(OS) 및 응용 프로그램은 제어부(260)에 의해 실행되어, 단말(200)에 구현된 기능을 실행한다. 특히, 본 발명에 있어서, 저장부(140)는 음성 인식 및 음향 모델의 학습을 위한 데이터 및 프로그램을 저장한다. 구체적으로 저장부(140)는 음성 인식을 위한 음향 모델 및 모국어 간섭에 의한 발음 오류를 검출하기 위한 음운 오류 사전을 저장한다. 이러한 저장부(250)는, 램(RAM, Read Access Memory), 롬(ROM, Read Only Memory), 하드디스크(HDD, Hard Disk Drive), 플래시 메모리, CD-ROM, DVD와 같은 모든 종류의 저장 매체를 포함할 수 있다.The storage unit 250 is a means for storing data or programs necessary for the operation of the terminal 200 and can basically store an operating program (OS) of the terminal 200 and one or more application programs. The operating program and the application program stored in the storage unit 250 are executed by the control unit 260 and execute functions implemented in the terminal 200. [ In particular, in the present invention, the storage unit 140 stores data and programs for learning speech recognition and acoustic models. Specifically, the storage unit 140 stores an acoustic model for speech recognition and a phonological error dictionary for detecting pronunciation errors due to native language interference. The storage unit 250 may be any type of storage such as a RAM (Read Only Memory), a ROM (Read Only Memory), a hard disk (HDD), a flash memory, a CD- Media.

제어부(260)는 단말(200)의 동작 전반을 제어하는 것으로서, 기본적으로 저장부(250)에 저장한 운영 프로그램을 기반으로 동작하여 단말(200)의 기본적인 실행 환경을 구축하고, 사용자의 선택에 따라서 응용 프로그램을 실행하여 임의 기능을 제공한다. 특히 제어부(150)는, 입력부(210)를 통해 사용자의 음성 인식이 요청되거나, 수행되는 응용 프로그램(예를 들어, 학습 프로그램)으로부터 음성 인식 요청이 발생하면, 오디오 처리부(230)를 통해서 입력되는 사용자의 음성 데이터에 대하여, 상기 저장부(250)에 저장된 음향 모델을 이용하여 음성 인식 및 오류 발음 검출을 수행하고, 오류로 확인된 음성 인식 결과를 이용하여 음성 인식 결과의 오류가 최소화되도록 상기 음향 모델에 대한 재 학습을 수행한다. 제어부(260)에 의한 음성 인식 및 오류 발음 검출 결과는 출력부(240)를 통해서 사용자에게 출력된다.The control unit 260 controls the overall operation of the terminal 200 and basically operates based on the operating program stored in the storage unit 250 to establish a basic execution environment of the terminal 200 and selects Therefore, the application program is executed to provide a random function. Particularly, the control unit 150 receives the voice recognition request from the application program (for example, a learning program) requested by the user through the input unit 210 or through the audio processing unit 230 The speech recognition and error sound detection are performed using the acoustic model stored in the storage unit 250 for the user's voice data, and the sound recognition result is used to minimize the error in the speech recognition result Perform re-learning on the model. The result of speech recognition and error pronunciation detection by the control unit 260 is output to the user through the output unit 240.

이를 위하여 제어부(260)는 전처리 모듈(261)과 음성 인식 모듈(262)과 음향 모델 학습 모듈(263)을 포함할 수 있다.For this, the controller 260 may include a preprocessing module 261, a speech recognition module 262, and an acoustic model learning module 263.

상기 전처리 모듈(261)은 상기 오디오 처리부(230)로부터 입력된 사용자의 음성 데이터에 대하여, 음성 인식을 위한 전처리를 수행하는 모듈로서, 더 구체적으로 상기 음성 데이터의 특징 벡터를 추출한다.The preprocessing module 261 is a module for performing preprocessing for voice recognition of user voice data input from the audio processor 230, and more specifically extracts a feature vector of the voice data.

상기 음성 인식 모듈(262)은 상기 전처리 모듈(261)로부터 전달된 사용자의 음성 데이터에 대한 특징 벡터를 상기 저장부(250)에 저장된 훈련된 다수의 음향 모델에 대입하여, 유사도를 산출하고, 유사도에 근거하여 상기 사용자의 음성 데이터에 대한 음성 인식을 수행한다. 또한, 상기 음성 인식 모듈(262)은 상기 다수의 음향 모델을 이용하여 상기 인식된 사용자의 음성 데이터에 포함된 오류 발음을 검출하고, 검출된 오류 발음에 대한 정보를 출력부(240)를 통해서 출력한다.The speech recognition module 262 substitutes the feature vectors of the user's voice data transmitted from the preprocessing module 261 into a plurality of training models stored in the storage unit 250 to calculate the similarity degree, And performs speech recognition on the voice data of the user. The voice recognition module 262 detects an error sound included in the voice data of the recognized user using the plurality of acoustic models and outputs information on the detected error sound through the output unit 240 do.

음향 모델 학습 모듈(263)은 상기 음성 인식 모듈(262)의 음성 인식 결과를 기준으로 오류 여부를 확인하여, 오류로 판별된 음성 인식 결과, 즉, 오류 데이터를 이용하여 상기 저장부(250)에 저장된 음향 모델에 대한 재 학습을 실행한다. 구체적으로 설명하면, 음향 모델 학습 모듈(263)은 음성 인식 결과 중에서, 음성 인식 결과가 오류로 판별된 사용자의 음성 데이터를 기 저장된 모국어 간섭에 의해 발생 가능한 오류 발음을 정의한 음운 오류 사전과 비교하여, 상기 음운 오류 사전에 매칭되는 패턴이 존재하는 경우, 상기 오류를 모국어 간섭에 의한 오류로 구분하고, 모국어 간섭에 의한 오류로 구분된 오류 데이터에 대하여, 해당 단어에 대하여 모국어 간섭에 의해 발생되는 오류 발음에 대한 정보를 추출하고, 상기 오류로 판단된 음성 인식 결과와 상기 오류 발음과의 차이를 크게 하는 방향으로 상기 음향 모델에 대한 변별학습을 수행한다. 반면에, 상기 오류가 모국어 간섭에 의한 오류가 아닌 경우, 상기 음성 인식 자체의 오류로 판단하여, 상기 오류로 판단된 인식 결과와 실제 요구되는 바른 인식 결과(정답)과의 차이를 크게 하는 방향으로 해당 음향 모델에 대한 변별학습을 수행한다.The acoustic model learning module 263 confirms whether or not an error has occurred on the basis of the speech recognition result of the speech recognition module 262 and outputs the speech recognition result to the storage unit 250 using the speech recognition result, And executes re-learning on the stored acoustic model. Specifically, the acoustic model learning module 263 compares, among the speech recognition results, the speech data of the user whose speech recognition result is determined as an error, with the phonemic error dictionary in which the error pronunciation that can be generated by the previously stored mother language interference is defined, The method comprising the steps of: classifying the error as an error due to native language interference when an error occurs in a pattern matching the phonological error dictionary; And performs discrimination learning on the acoustic model in a direction to increase the difference between the speech recognition result determined as the error and the error pronunciation. On the other hand, if the error is not an error due to native language interference, it is determined that the speech recognition itself is an error, and the difference between the recognition result determined as the error and the actual recognition result (correct answer) And performs discrimination learning for the acoustic model.

다음으로 상술한 구성을 기반으로 이루어지는 본 발명에 따른 음성 인식 방법 및 음향 모델의 학습 방법을 설명한다.Next, a speech recognition method and a learning method of an acoustic model according to the present invention based on the above-described configuration will be described.

본 발명에 따른 음성 인식 방법 및 음향 모델의 학습 방법은 본 발명의 일 실시 예에서는 음성 인식 장치(100) 및 단말(200)의 연동을 통해서 이루어지며, 본 발명의 다른 실시 예에서는 단말(200)의 단독 동작으로 실행될 수 있다.The speech recognition method and the learning method of the acoustic model according to the present invention are performed through the interlocking of the speech recognition apparatus 100 and the terminal 200 according to an embodiment of the present invention. In another embodiment of the present invention, As shown in FIG.

도 4는 본 발명에 따른 음성 인식 과정을 나타낸 순서도이며, 도 5는 본 발명에 따른 음성 인식에 있어서, 음향 모델을 위한 학습 과정을 나타낸 순서도이다.FIG. 4 is a flowchart illustrating a speech recognition process according to the present invention. FIG. 5 is a flowchart illustrating a learning process for an acoustic model in speech recognition according to the present invention.

먼저, 도 4를 참조하면, 음성 인식을 위한 음향 모델을 다수의 음성 데이터 샘플의 군집화를 통해서 생성하여 저장한다(S110). 상기 단계 S110는 음성 인식 장치(100)를 통해서 이루질 수 있으며, 단말(200)의 경우는, 음성 인식 장치(100)로부터 다수의 음향 모델을 수신하여 저장할 수 있다.First, referring to FIG. 4, an acoustic model for speech recognition is generated through clustering of a plurality of speech data samples and stored (S110). The step S110 may be performed through the voice recognition apparatus 100. In the case of the terminal 200, the voice recognition apparatus 100 may receive and store a plurality of acoustic models.

그리고, 음성 인식 대상이 될 사용자의 음성 데이터를 입력 받는다(S120). 이때, 단말(200)은 오디오 처리부(230)를 통해서 사용자의 음성을 직접 검출하여 음성 데이터로 변환하고, 음성 인식 장치(100)는 상기 단말(200)을 통해서 사용자의 음성 데이터를 전달받게 된다.Then, voice data of a user to be a speech recognition target is input (S120). At this time, the terminal 200 directly detects the user's voice through the audio processing unit 230 and converts it into voice data, and the voice recognition device 100 receives voice data of the user through the terminal 200.

이러한 사용자의 음성 데이터가 입력되면, 기 저장된 음향 모델을 이용하여 입력된 사용자의 음성 데이터에 대한 음성 인식을 수행한다(S130). 더 구체적으로 설명하면, 상기 사용자의 음성 데이터에 대한 특징 벡터를 추출하고, 상기 특징 벡터를 각 음향 모델에 대입하여, 사용자의 음성 데이터에 대하여 가장 유사도가 높은 단어열을 추출한다. 여기서, 특징 벡터는, 시간에 따라 변화되는 파형으로 표현되는 음성 데이터에 있어서, 불필요한 정보는 배제하고 파형의 특징 신호만을 축약하여 나타낸 것이다. 이러한 음성 인식을 위해서는 비터비(Viterbi) 알고리즘 등이 사용될 수 있다. 상기 단계 S130은 음성 인식 장치(100), 또는 음성 인식 장치(100)와 단말(200)의 연동, 또는 단말(200)의 단독 동작으로 이루어질 수 있다. 상기 음성 인식 장치(100)와 단말(200)의 연동의 경우, 예를 들면, 단말(200)이 사용자의 음성 데이터에서 특징 벡터를 추출하여 음성 인식 장치(100)로 전송하면, 음성 인식 장치(100)가 수신한 특징 벡터를 각 음향 모델에 대입하여 유사도가 높은 단어열을 추출하는 방식으로 이루어질 수 있다.When the user's voice data is input, voice recognition is performed on the voice data of the user using the pre-stored acoustic model (S130). More specifically, a feature vector for the voice data of the user is extracted, and the feature vector is substituted for each acoustic model to extract the word sequence having the highest degree of similarity to the user voice data. Here, the feature vector is an abbreviated representation of only the feature signal of the waveform, excluding unnecessary information, in the voice data represented by the waveform that changes with time. For such speech recognition, a Viterbi algorithm or the like can be used. The step S130 may be performed by interlocking the voice recognition apparatus 100 or the voice recognition apparatus 100 with the terminal 200 or the terminal 200 alone. When the terminal 200 interacts with the speech recognition apparatus 100 and the terminal 200, for example, when the terminal 200 extracts a feature vector from the user's speech data and transmits the extracted feature vector to the speech recognition apparatus 100, 100) to the acoustic models, and extracts word strings having a high degree of similarity.

이러한 음성 인식 결과는 단말(200)을 통해서 사용자에게 출력되거나, 단말(200)의 다른 응용 프로그램(예를 들어, 언어 학습 프로그램)에 제공된다(S140).The speech recognition result is output to the user through the terminal 200 or provided to another application program (e.g., a language learning program) of the terminal 200 (S140).

한편, 본 발명에 있어서, 음성 인식 장치(100) 또는 단말(200)은 상기 음성 인식 결과에 대한 오류 여부를 확인한다(S150). 이를 위하여, 상기 음성 인식 장치(100) 또는 단말(200)은 사용자에게 음성 인식 결과가 맞는지 틀린 지를 질의하고 그 결과를 사용자로부터 피드백 받을 수 있다.Meanwhile, in the present invention, the speech recognition apparatus 100 or the terminal 200 checks whether the speech recognition result is erroneous (S150). For this purpose, the voice recognition apparatus 100 or the terminal 200 may inquire whether the voice recognition result is correct or not, and receive the feedback from the user.

상기 확인 결과, 음성 인식 결과가 오류인 경우, 즉, 음성 인식 결과가 바르지 않은 경우, 상기 음성 인식 장치(100)의 음향 모델 학습부(120) 또는 단말(200)의 음향 모델 학습 모듈(263)은, 상기 오류로 확인된 음성 인식 결과로부터 오류 데이터를 수집하고, 이를 이용하여 오류가 최소화되도록 음향 모델의 재 학습을 실행한다(S160). 상기 단계 S160은 도 5에 도시된 바와 같이 이루어질 수 있다.The acoustic model learning module 120 of the speech recognition apparatus 100 or the acoustic model learning module 263 of the terminal 200 may be configured to perform the speech recognition on the basis of the speech recognition result, (S160), the error data is collected from the speech recognition result confirmed by the error, and the acoustic model is re-learned so that the error is minimized. The step S160 may be performed as shown in FIG.

도 5를 참조하면, 상기 음성 인식 장치(100)의 음향 모델 학습부(120) 또는 단말(200)의 음향 모델 학습 모듈(263)은, 상기 오류로 확인된 음성 인식 결과로부터 오류 데이터를 수집한다(S210).5, the acoustic model learning unit 120 of the speech recognition apparatus 100 or the acoustic model learning module 263 of the terminal 200 collects error data from the speech recognition result confirmed as the error (S210).

그리고, 상기 음성 인식 장치(100)의 음향 모델 학습부(120) 또는 단말(200)의 음향 모델 학습 모듈(263)은, 음성 인식 결과의 오류를 모국어 간섭에 의한 오류와, 음성 인식 자체의 오류로 구분한다(S220). 이를 위하여, 음성 인식 결과가 오류로 판별된 사용자의 음성 데이터를 기 저장된 모국어 간섭에 의해 발생 가능한 오류 발음을 정의한 음운 오류 사전과 비교하여 상기 음운 오류 사전에 매칭되는 패턴이 존재하는 경우, 상기 오류를 모국어 간섭에 의한 오류로 구분하고, 상기 음운 오류 사전에 매칭되는 패턴이 존재하지 않으면, 음성 인식 자체의 오류로 판단한다.The acoustic model learning unit 120 of the speech recognition apparatus 100 or the acoustic model learning module 263 of the terminal 200 corrects an error of the speech recognition result by an error due to mother language interference, (S220). To this end, if the speech data of the user whose speech recognition result is judged as an error is compared with a phonological error dictionary defining an error pronunciation that can be generated by the previously stored native language interference, if there is a pattern matching the phonological error dictionary, An error due to native language interference, and if there is no pattern matched with the phonological error dictionary, it is determined to be an error of speech recognition itself.

그리고, 해당 오류가 음성 인식 자체의 오류인지 모국어 간섭에 의한 오류인 지 확인하여(S230), 음성 인식 자체의 오류인 경우, 피드백 받은 상기 음성 인식 결과의 올바른 정답과 상기 잘못된 인식 결과와의 차이를 크게 하는 방향으로 해당 음향 모델에 대한 변별 학습을 수행한다(S240).Then, it is determined whether the error is an error of the speech recognition itself or an error due to native language interference (S230). If the error is the speech recognition itself, the difference between the correct answer of the feedback speech recognition result and the incorrect recognition result The discrimination learning for the acoustic model is performed (S240).

반대로, S230 단계에서, 모국어 간섭에 의한 오류인 경우, 상기 음성 인식 장치(100)의 음향 모델 학습부(120) 또는 단말(200)의 음향 모델 학습 모듈(263)은, 먼저, 상기 사용자의 음성 데이터와 대응하여 모국어 간섭에 의해 발생되는 오류 발음에 대한 정보를 추출한다(S250). 즉, 사용자의 음성 데이터에 대응하는 단어열에 대하여 해당 사용자의 모국어 간섭에 의해 보편적으로 발생 가능한 오류 발음을 추출한다.Conversely, if the error is due to a first language interference in step S230, the acoustic model learning unit 120 of the speech recognition apparatus 100 or the acoustic model learning module 263 of the terminal 200 first obtains the speech of the user Information corresponding to the data and error pronunciation generated by the mother language interference is extracted (S250). That is, an erroneous pronunciation that can be commonly generated by the mother language interference of the user with respect to the word string corresponding to the user's voice data is extracted.

그리고, 상기 오류로 판단된 음성 인식 결과와 상기 모국어 간섭에 의해 발생되는 오류 발음과의 차이를 크게 하는 방향으로 상기 음향 모델에 대한 변별학습을 수행한다(S260).Then, discrimination learning for the acoustic model is performed in a direction of increasing the difference between the speech recognition result determined as the error and the error pronunciation generated by the mother language interference (S260).

상술한 바에 따르면, 본 발명에 따른 음성 인식 시스템에 있어서, 모국어 간섭에 의하여 사용자의 오류 발음과 실제 원어민의 표준 발음 간의 차이를 명확히 구분할 수 있도록 음향 모델의 학습이 이루어질 수 있다.According to the above description, in the speech recognition system according to the present invention, the learning of the acoustic model can be performed so that the difference between the error pronunciation of the user and the standard pronunciation of the actual native speaker can be clearly distinguished by the mother language interference.

본 발명에 따른 음향 모델 학습 방법은 다양한 컴퓨터 수단을 통하여 판독 가능한 소프트웨어 형태로 구현되어 컴퓨터로 판독 가능한 기록매체에 기록될 수 있다. 여기서, 기록매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 예컨대 기록매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광 기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 및 롬(ROM), 램(RAM, Random Access Memory), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함한다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함할 수 있다. 이러한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The acoustic model learning method according to the present invention can be implemented in software form readable by various computer means and recorded in a computer-readable recording medium. Here, the recording medium may include program commands, data files, data structures, and the like, alone or in combination. Program instructions to be recorded on a recording medium may be those specially designed and constructed for the present invention or may be available to those skilled in the art of computer software. For example, the recording medium may be an optical recording medium such as a magnetic medium such as a hard disk, a floppy disk and a magnetic tape, a compact disk read only memory (CD-ROM), a digital video disk (DVD) Includes a hardware device that is specially configured to store and execute program instructions such as a magneto-optical medium such as a floppy disk and a ROM, a random access memory (RAM), a flash memory, do. Examples of program instructions may include machine language code such as those generated by a compiler, as well as high-level language code that may be executed by a computer using an interpreter or the like. Such hardware devices may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이상과 같이, 본 명세서와 도면에는 본 발명의 바람직한 실시 예에 대하여 개시하였으나, 여기에 개시된 실시 예외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 자명한 것이다. 또한, 본 명세서와 도면에서 특정 용어들이 사용되었으나, 이는 단지 본 발명의 기술 내용을 쉽게 설명하고 발명의 이해를 돕기 위한 일반적인 의미에서 사용된 것이지, 본 발명의 범위를 한정하고자 하는 것은 아니다.While the present invention has been described in connection with what is presently considered to be practical exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, It will be apparent to those skilled in the art. Furthermore, although specific terms are used in this specification and the drawings, they are used in a generic sense only to facilitate the description of the invention and to facilitate understanding of the invention, and are not intended to limit the scope of the invention.

10: 네트워크 100: 음성 인식 장치 110: 음성 인식부
120: 음향 모델 학습부 130: 저장부 200: 단말
210: 입력부 220: 통신부 230: 오디오 처리부
240: 출력부 250: 저장부 260: 제어부10: Network 100: Speech recognition device 110:
120: acoustic model learning unit 130: storage unit 200: terminal
210: input unit 220: communication unit 230: audio processing unit
240: output unit 250: storage unit 260: control unit

Claims

A storage unit for storing a phonemic error dictionary defining an acoustic model and an error phoneme that can be generated by native language interference;
A voice recognition unit for performing voice recognition of voice data of a user input using the stored acoustic model in response to a voice recognition request from the terminal and outputting a voice recognition result to the terminal;
When the speech recognition result is an error, comparing the speech data with the phonemic error dictionary and, if the speech data includes a phoneme matched with an error phoneme included in the phonological error dictionary, An acoustic model learning unit that performs discrimination learning of the acoustic model in a direction that increases the difference between the recognition result and the error phonology generated by the mother language interference when the error is due to native language interference; And
The speech recognition apparatus comprising:

delete

2. The apparatus of claim 1, wherein the acoustic model learning unit
And discriminating learning is performed in a direction of determining that the speech recognition itself is an error and increasing the difference between the recognition result determined as the error and the original correct answer if the error of the recognition result is not an error due to native language interference A voice recognition device for detecting an erroneous pronunciation.

The speech recognition apparatus according to claim 1,
Further detecting an error pronunciation of the user's voice data using the acoustic model and providing an error pronunciation detection result to the terminal.

An acoustic model and a storage unit for storing a phonological error dictionary defining an error phoneme that can be generated by the mother language interference
An input unit for receiving a user request;
An audio processor for converting a user's voice signal into voice data and outputting the voice data;
When a user's request through the input unit or a request for a speech recognition based on a request of an application program is generated, speech recognition is performed on the user's speech data output from the audio processing unit using a previously stored acoustic model to output a speech recognition result When the speech recognition result is in error, comparing the speech data with the phonemic error dictionary, and if the speech data includes a phoneme that matches the error phoneme included in the phonological error dictionary, And an acoustic model learning module for performing discrimination learning in a direction for increasing the difference between the recognition result and the error phonology generated by the mother language interference in the case of an error caused by the mother language interference A control unit; And
And an output unit for outputting the speech recognition result to a user.

delete

6. The method of claim 5,
Further comprising a communication unit for transmitting and receiving data through a network,
Wherein the control unit receives the acoustic model for speech recognition and the phonological error dictionary through the communication unit and stores the received acoustic model and the phonological error dictionary in the storage unit.

6. The apparatus of claim 5, wherein the control unit
Further comprising a preprocessing module for extracting a feature vector from voice data of the user output from the audio processor,
Wherein the speech recognition module applies the feature vector to the acoustic model and extracts word strings similar to the speech data through similarity measurement.

6. The method of claim 5, wherein the speech recognition module
And further detects an erroneous pronunciation included in the recognized voice data.

Storing a phonological error dictionary defining an error phoneme that can be generated by a native language interference;
A speech recognition method using an acoustic model, comprising the steps of: collecting speech data in which a speech recognition result is determined to be an error;
Comparing the speech data with the phonemic error dictionary to determine an error related to the speech data as an error due to native language interference when the speech data includes a phoneme that matches the error phoneme included in the phonological error dictionary; ;
Extracting an error phoneme generated by mother language interference with a word sequence corresponding to the voice data if it is determined to be an error due to native language interference; And
And performing discriminative learning on the acoustic model in a direction that increases the difference between the speech recognition result and the error phonology caused by the mother language interference.