KR20180062003A

KR20180062003A - Method of correcting speech recognition errors

Info

Publication number: KR20180062003A
Application number: KR1020160161799A
Authority: KR
Inventors: 이기영; 김영길
Original assignee: 한국전자통신연구원
Priority date: 2016-11-30
Filing date: 2016-11-30
Publication date: 2018-06-08

Abstract

A method of correcting speech recognition errors is disclosed. The method comprises: a step of determining whether a current vocabulary is an error vocabulary which has a speech recognition error, using a distance value between the position of a current vocabulary mapped to a vector space and the position of a previous vocabulary according to a word embedding technique; a step of generating a corrected vocabulary candidate having a pronunciation similar to the determined pronunciation of the error vocabulary by referring to a vocabulary pronunciation dictionary if the current vocabulary is determined to be an error vocabulary; and a step of determining a corrective vocabulary formed by restoring the error vocabulary from the generated corrective vocabulary candidate, by referring to a general domain-based language model and a specific domain-based language model.

Description

{METHOD OF CORRECTING SPEECH RECOGNITION ERRORS}

본 발명은 발화 문맥(uttered context) 기반의 음성 인식 오류 교정 방법에 관한 것으로, 상세하게는 음성 인식 기반의 응용 장치에서 음성 인식 오류를 교정하는 방법에 관한 것이다.The present invention relates to a speech recognition error correction method based on a uttered context, and more particularly, to a method for correcting a speech recognition error in a speech recognition based application.

최근, 음성 인식 기술과 언어 처리 기술이 결합된 다양한 형태의 응용 기술 들이 개발되고 있으며, 이러한 응용기술은 자동 번역, 자동 통역, 동시통역, 화상 통역 등 다양한 분야에 적용될 수 있다.In recent years, various types of application technologies combining speech recognition technology and language processing technology have been developed and applied to various fields such as automatic translation, automatic interpretation, simultaneous interpretation, and image interpretation.

음성 인식 기술은 다양한 이유로 인해 음성 인식 오류를 발생시키며, 이러한 음성 인식 오류는 음성 인식 결과를 입력으로 사용하는 다양한 응용모듈에서의 오류를 증폭시킨다. 따라서 음성 인식 오류를 교정하고자 하는 많은 노력이 있었다.Speech recognition technology generates speech recognition errors for various reasons, and such speech recognition errors amplify errors in various application modules that use speech recognition results as input. Therefore, there has been a great deal of effort to correct speech recognition errors.

종래의 음성 인식 오류를 교정하는 방법에서는, 음성 인식된 결과 문장을 구성하는 어휘들을 태깅하여 그 품사를 확인한 후, 특정 품사를 포함하는 품사열이 일반적이지 않으면, 해당 어휘에 오류가 있는 것으로 판단하고, 오류가 있는 것으로 판단된 오류 어휘에 대해 유사한 발음을 갖는 후보 어휘들(유사 발음 어휘들)을 선정하고, 선정된 후보 어휘들 중에서 가장 확률이 높은 어휘를 선택하여, 상기 오류 어휘를 상기 선택된 어휘로 교체하는 방식으로, 음성 인식 오류에 대한 교정을 수행한다.In a conventional method for correcting a speech recognition error, after the vocabularies constituting the speech-recognized result sentence are tagged to check the part-of-speech, if the part-of-speech string including the specific part-of-speech is not general, , Candidate vocabularies (similar-pronounced vocabularies) having a similar pronunciation with respect to an error vocabulary judged as having an error, selecting the most probable vocabulary among the selected candidate vocabularies, In order to correct the speech recognition error.

하지만, 이러한 종래의 음성 인식 기술은 대부분 대용량의 언어 모델(language model)을 사용하기 때문에, 특정 주제와 연관된 특정 문장의 음성 인식 오류를 포함하고 있는지 판단하기 어렵다. 즉, 상기 특정 문장이 음성 인식 오류를 포함하고 있더라도 의미적으로 오류가 없는 경우가 많기 때문에 정확하게 오류를 교정하기 어렵다. However, since most of the conventional speech recognition technologies use a large-capacity language model, it is difficult to determine whether the speech recognition error includes a specific speech related to a specific sentence. That is, even if the specific sentence includes a speech recognition error, it is difficult to correct the error accurately since there are many cases where there is no error in terms of semantics.

또한, 종래의 음성 인식 기술에서는, 형태소 태깅의 부정확성 때문에, 음성 인식 오류가 없는 어휘를 잘못 태깅하여 틀린 품사 정보가 부착될 경우, 음성 인식 오류가 없는 어휘임에도 오류가 있는 어휘로 잘못 인식하여 교정하는 경우가 빈번하다.In addition, in the conventional speech recognition technology, due to the inaccuracy of the morpheme tagging, when a wrong vocabulary is incorrectly tagged with incorrect speech recognition information, the vocabulary without a speech recognition error is erroneously recognized as an erroneous vocabulary and corrected Frequent cases.

따라서, 본 발명에서 해결하고자 하는 과제는 특정 주제와 연관된 음성 인식 문장 내에서 음성 인식 오류가 발생한 오류 어휘를 검출하고 교정할 수 있는 음성 인식 오류 교정 방법을 제공하는 데 있다.Accordingly, an object of the present invention is to provide a speech recognition error correction method capable of detecting and correcting an error vocabulary in which a speech recognition error occurs in a speech recognition sentence associated with a specific subject.

상술한 과제를 달성하기 위한 본 발명의 일면에 따른 음성 인식 오류 교정 방법은, 워드 임베딩 기법에 따라 벡터 공간에 사상된 현재 어휘의 위치와 이전 어휘의 위치 사이의 거리값을 이용하여, 상기 현재 어휘가 음성 인식 오류가 발생한 오류 어휘인지를 결정하는 단계; 상기 현재 어휘가 오류 어휘로 결정된 경우, 어휘 발음 사전을 참조하여, 상기 결정된 오류 어휘의 발음과 유사한 발음을 갖는 교정 어휘 후보를 생성하는 단계; 및 일반 도메인 기반의 언어 모델과 특정 도메인 기반의 언어 모델을 참조하여, 상기 생성된 교정 어휘 후보에서 상기 오류 어휘가 복구된 교정 어휘를 결정하는 단계를 포함한다.According to another aspect of the present invention, there is provided a method for correcting a speech recognition error, the method comprising the steps of: using a distance value between a position of a current vocabulary mapped in a vector space and a position of a previous vocabulary according to a word- Determining whether a speech recognition error is an error vocabulary; If the current vocabulary is determined to be an erroneous vocabulary, generating a corrected vocabulary candidate having a pronunciation similar to the determined pronunciation of the erroneous vocabulary by referring to the vocabulary pronunciation dictionary; And determining a corrective vocabulary in which the erroneous vocabulary is restored from the generated corrective vocabulary candidate by referring to a general domain-based language model and a specific domain-based language model.

본 발명에 따르면, 특정 문맥 윈도우를 정의하고, 정의된 특정 문맥 윈도우 내에서의 다양한 문맥 정보를 사용하여 효율적으로 음성 인식 오류를 검출한다. 이러한 문맥 윈도우는 적용되는 응용분야에 따라 다양하게 정의될 수 있다. 예를 들어, 강연 통역이나 화상 통역 등에 적용될 경우에는 발화 시작부터 특정 시점까지를 문맥 윈도우의 크기로 정의되며, 이 크기는 강연이나 화상 대화가 끝날 때까지 증가한다. 즉, 본 발명은 특정 문맥 윈도우 내에서 발견될 수 있는 어휘들 간의 의미적 유사성, 그리고 동일 어휘의 반복적 발화와 같은 특성을 활용하여 보다 정확하게 음성인식 오류를 교정할 수 있다. 또한, 본 발명은 강연 통역, 화상 대화 번역, 화상 통역 등과 같은 다양한 음성 및 언어 처리 장치로 확장되어 적용될 수 있다. According to the present invention, a specific context window is defined, and various contextual information within a defined context window is used to efficiently detect speech recognition errors. These context windows can be defined variously depending on the application field to which they are applied. For example, when applied to a lecture interpreter or an image interpreter, the size of the context window is defined as the size of the context window from the start of the utterance to a specific time point, and this size increases until the lecture or video conversation ends. That is, the present invention can more accurately correct speech recognition errors by utilizing characteristics such as semantic similarity between vocabularies that can be found in a specific context window, and repeated speech of the same vocabulary. Further, the present invention can be extended to various speech and language processing apparatuses such as lecture interpreting, video conversation translation, image interpretation, and the like.

도 1은 본 발명의 일 실시 예에 따른 음성 인식 장치를 구현하기 위한 하드웨어 구성도이다.
도 2는 도 1에 도시한 프로세서의 기능 블록도이다.
도 3은 도 2에 도시한 음성 인식 오류 교정부의 기능 블록도이다.
도 4는 본 발명의 일 실시 예에 따른 음성 인식 오류 교정 방법을 나타내는 순서도이다.1 is a hardware block diagram for implementing a speech recognition apparatus according to an embodiment of the present invention.
2 is a functional block diagram of the processor shown in Fig.
3 is a functional block diagram of the speech recognition error correction unit shown in FIG.
4 is a flowchart illustrating a speech recognition error correction method according to an embodiment of the present invention.

이하, 본 발명의 다양한 실시 예가 첨부된 도면과 연관되어 기재된다. 본 발명의 실시 예는 다양한 변경을 가할 수 있고 여러 가지 실시 예를 가질 수 있는 바, 특정 실시 예이 도면에 예시되고 관련된 상세한 설명이 기재되어 있다. 그러나, 이는 본 발명의 실시 예를 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 다양한 실시예의 사상 및 기술 범위에 포함되는 모든 변경 및/또는 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 도면의 설명과 관련하여, 유사한 구성요소에 대해서는 유사한 참조 부호가 사용되었다.Best Mode for Carrying Out the Invention Various embodiments of the present invention will be described below with reference to the accompanying drawings. The embodiments of the present invention are capable of various modifications and various embodiments, and specific embodiments are illustrated in the drawings and the detailed description is described with reference to the drawings. It should be understood, however, that the embodiments of the invention are not intended to be limited to the specific embodiments, but include all changes and / or equivalents and alternatives falling within the spirit and scope of the various embodiments of the invention. In connection with the description of the drawings, like reference numerals have been used for like elements.

일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 발명의 다양한 실시 예에서 명백하게 정의되지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Terms such as those defined in commonly used dictionaries should be interpreted to have the meanings consistent with the contextual meanings of the related art and, unless expressly defined in the various embodiments of the present invention, It is not interpreted as meaning.

종래의 음성 인식 오류 교정 방법은 일반적으로 품사 문맥 정보를 활용하여 음성 인식 오류를 교정했지만, 최근의 음성 인식 기술에서는 대용량의 언어 모델을 이용하여 음성 인식을 수행하기 때문에, 음성 인식된 문장 내의 음성 인식 오류가 오류처럼 보이지 않기 때문에 이러한 방법으로는 음성 인식 오류를 검출하거나 교정할 수 없는 한계가 있다.In the conventional speech recognition error correction method, speech recognition error is corrected by utilizing part-of-speech context information. In recent speech recognition technology, speech recognition is performed using a large-capacity language model. Because the error does not look like an error, there is a limit to how it can not detect or correct speech recognition errors.

이에, 본 발명에서는 문맥 윈도우를 정의하고, 문맥 윈도우 내에서의 다양한 문맥 정보를 활용하여 음성 인식 오류를 효율적으로 검출하여 교정하는 방법을 제공한다.Accordingly, the present invention provides a method for efficiently detecting and correcting a speech recognition error by defining a context window and utilizing various context information in the context window.

이러한 문맥 윈도우는 적용되는 응용분야에 따라 다양하게 정의될 수 있다. 예를 들어, 강연 통역이나 화상 통역 등에 적용될 경우에는 발화 시작부터 특정 시점까지를 문맥 윈도우의 크기로 정의되며, 이 문맥 윈도우의 크기는 강연이나 화상 대화가 끝날 때까지 증가한다. 즉, 본 발명에서는 특정 문맥 윈도우 내에서 발견될 수 있는 발화 문장 어휘들 간의 의미적 유사성, 그리고 동일어휘의 반복적 발화와 같은 특성을 활용하여 음성 인식 오류를 더욱 정확하게 교정할 수 있는 장점을 제공할 수 있다.These context windows can be defined variously depending on the application field to which they are applied. For example, when applied to a lecture interpreter or an image interpreter, the size of the context window is defined as the size of the context window from the start of the utterance to a specific time, and the size of the context window increases until the lecture or video conversation ends. That is, in the present invention, it is possible to provide an advantage that the speech recognition errors can be corrected more accurately by utilizing characteristics such as semantic similarity between spoken sentence vocabularies that can be found in a specific context window and repeated speech of the same vocabulary have.

한편, 음성 인식 오류를 더욱 정확하게 교정할 수 있는 본 발명의 음성 인식 장치는 다양한 전자 장치에 탑재되거나 다양한 전자 장치로 구현될 수 있다. On the other hand, the speech recognition apparatus of the present invention which can more accurately correct speech recognition errors can be implemented in various electronic apparatuses or various electronic apparatuses.

상기 전자 장치는, 예를 들면, 인공 지능을 구비한 로봇 장치이거나, 통신 기능을 갖는 사용자 단말 또는 서버일 수 있다. 상기 사용자 단말은, 예를 들면, 스마트 폰(smartphone), 태블릿 PC(tablet personal computer), 이동 전화기(mobile phone), 비디오 전화기, 전자북 리더기(e-book reader), 데스크탑 PC(desktop personal computer), 랩탑 PC(laptop personal computer), 넷북 컴퓨터(netbook computer), PDA(personal digital assistant), PMP(portable multimedia player), MP3 플레이어, 모바일 의료기기, 카메라(camera), 또는 웨어러블 장치(wearable device)(예: 전자 안경과 같은 head-mounted-device(HMD), 전자 의복, 전자 팔찌, 전자 목걸이, 전자 앱세서리(appcessory), 또는 스마트 와치(smart watch))일 수 있다. The electronic device may be, for example, a robot device having artificial intelligence, or a user terminal or a server having a communication function. The user terminal may be, for example, a smartphone, a tablet personal computer, a mobile phone, a video phone, an e-book reader, a desktop personal computer, A laptop personal computer, a netbook computer, a personal digital assistant (PDA), a portable multimedia player (PMP), an MP3 player, a mobile medical device, a camera, or a wearable device Examples may be head-mounted-devices (HMD) such as electronic glasses, electronic apparel, electronic bracelets, electronic necklaces, electronic apps, or smart watches.

도 1은 본 발명의 일 실시 예에 따른 음성 인식 장치(100)를 구현하기 위한 구성도이다.1 is a block diagram of a speech recognition apparatus 100 according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시 예에 따른 음성 인식 장치(100)는 음성 인식 오류를 검출하여, 검출된 음성 인식 오류를 교정하고, 음성 인식 오류가 교정된 음성 인식 결과를 응용 장치(200)에 출력할 수 있다.Referring to FIG. 1, a speech recognition apparatus 100 according to an embodiment of the present invention detects a speech recognition error, corrects a detected speech recognition error, and outputs a speech recognition result, 200).

상기 응용 장치(200)는 자동 번역, 자동 통역, 동시 통역, 화상 통역, 강연 통역, 화상 대화 번역, 화상 통역 등과 같은 통역/번역 서비스를 제공하는 모든 종류의 장치일 수 있다.The application apparatus 200 may be any kind of apparatus that provides interpretation / translation services such as automatic translation, automatic translation, simultaneous interpretation, image interpretation, lecture interpretation, video conversation translation,

상기 음성 인식 장치(100)가 전자 장치에 적용되는 경우, 상기 음성 인식 장치(100)는, 도 1에 도시된 바와 같이, 하나 이상의 프로세서(110), 메모리(130), 사용자 입력 장치(140), 사용자 출력 장치(150) 및 저장소(160)를 포함하며, 이들 각각은 시스템 버스(120)를 통해 통신할 수 있다. 또한, 상기 음성 인식 장치(100)는 네트워크에 연결된 네트워크 인터페이스를 포함할 수 있다. When the speech recognition apparatus 100 is applied to an electronic apparatus, the speech recognition apparatus 100 includes one or more processors 110, a memory 130, a user input apparatus 140, A user output device 150, and a storage 160, each of which may communicate via the system bus 120. [ In addition, the speech recognition apparatus 100 may include a network interface connected to a network.

상기 프로세서(110)는 메모리(130) 및/또는 저장소(160)에 저장된 처리 명령어를 실행하는 중앙 처리 유닛 또는 반도체 장치일 수 있다. The processor 110 may be a central processing unit or a semiconductor device that executes processing instructions stored in the memory 130 and / or the storage 160.

상기 메모리(130) 및 상기 저장소(160)는 휘발성 저장 매체 또는 비 휘발성 저장 매체를 포함할 수 있다. 예컨대, 상기 메모리(130)는 ROM(131) 및 RAM(133)을 포함할 수 있다. The memory 130 and the storage 160 may include volatile or non-volatile storage media. For example, the memory 130 may include a ROM 131 and a RAM 133.

또한, 본 발명의 실시 예에 따른 음성 인식 장치(100)에서 수행되는 음성 인식 오류 교정 방법은 컴퓨터 실행 가능 명령어를 가진 비(non)-일시적 컴퓨터 판독 가능 매체로서 구현 될 수 있다. 일 실시 예에서, 상기 음성 인식 오류 교정 방법이 프로세서(110)에 의해 실행될 때, 컴퓨터 판독 가능 명령은 본 발명의 적어도 한 양태에 따른 방법을 수행 할 수 있다.Further, the speech recognition error correction method performed in the speech recognition apparatus 100 according to the embodiment of the present invention may be implemented as a non-temporary computer-readable medium having computer-executable instructions. In one embodiment, when the speech recognition error correction method is executed by the processor 110, the computer readable instructions may perform the method according to at least one aspect of the present invention.

도 2는 도 1에 도시된 프로세서의 기능 블록도이다.2 is a functional block diagram of the processor shown in FIG.

도 2를 참조하면, 상기 프로세서(110)는 음성 인식부(112) 및 음성 인식 오류 교정부(114)를 포함할 수 있다. 상기 음성 인식부(112)와 상기 음성 인식 오류 교정부(114)는 로직으로 구현되어, 상기 프로세서(110) 내부에 탑재될 수 있다.Referring to FIG. 2, the processor 110 may include a speech recognition unit 112 and a speech recognition error correction unit 114. The speech recognition unit 112 and the speech recognition error correction unit 114 may be implemented as logic and mounted within the processor 110.

상기 음성 인식부(112)는, 음성 인식 학습모델(도시하지 않음)을 참조하여, 사용자의 발화 음성을 인식하고, 그 음성 인식 결과에 대응하는 텍스트 형태의 문장(음성 인식된 문장 또는 음성 인식된 어휘열)을 상기 음성 인식 교정부(114)로 출력할 수 있다. 여기서, 음성 인식 학습 모델은, 예를 들면, 음향 모델(acoustic model), 대용량의 언어 모델(language model) 및 발음 모델(pronunciation model) 등일 수 있으며, 상기 음성 인식부(112)는 이러한 음성 인식 학습 모델을 참조하도록 도 1의 저장소(160)에 저장될 수 있다.The speech recognition unit 112 refers to the speech recognition learning model (not shown), recognizes the speech uttered by the user, and generates a sentence in the form of a text corresponding to the speech recognition result Word sequence) to the speech recognition calibration unit 114. Here, the speech recognition learning model may be, for example, an acoustic model, a language model of a large capacity, and a pronunciation model, May be stored in storage 160 of Figure 1 to refer to the model.

상기 음성 인식 오류 교정부(114)는 정의된 문맥 윈도우의 크기 안에서 지속적으로 업데이트되는 워드 위치 데이터베이스(114-7, 도 3에 도시함), 발화 어휘 데이터베이스(114-9, 도 3에 도시함)와 어휘 발음 사전(130, 도 3에 도시함) 등을 참조로 하여, 음성 인식 오류를 검출할 수 있다.The speech recognition error correcting unit 114 includes a word position database 114-7 (shown in FIG. 3), a spoken word database 114-9 (shown in FIG. 3) that are continuously updated in the size of the defined context window, And a vocabulary pronunciation dictionary 130 (shown in FIG. 3), and the like.

상기 문맥 윈도우 크기는 본 발명의 음성 인식 장치(100)와 연동하는 응용 장치(200, 도 1에 도시함)에 따라 달라질 수 있다. The context window size may vary according to an application device 200 (shown in FIG. 1) linked to the speech recognition apparatus 100 of the present invention.

일 예로, 상기 음성 인식 장치(100)가 일반 대화를 기반으로 하는 자동 통역이나 대화 등을 처리하는 응용 장치(200, 도 1에 도시함)와 연동하는 경우, 상기 문맥 윈도우 크기는 음성 인식된 현재의 문장 이전의 N(N은 1 이상의 자연수) 개의 문장으로 제한될 수 있다. For example, when the speech recognition apparatus 100 is interworked with an application apparatus 200 (shown in FIG. 1) for processing an automatic interpretation or dialog based on a general dialogue, (N is a natural number of 1 or more) sentences before the sentence of the sentence.

다른 예로, 상기 음성 인식 장치(100)가 화상 통역, 강연 통역 등을 처리하는 응용 장치(200, 도 1에 도시함)와 연동하는 경우, 상기 문맥 윈도우 크기는 화상 회의 또는 강연의 시작 시점부터 발화자에 의해 발화되는 모든 문장들을 포함할 수 있다. As another example, when the speech recognition apparatus 100 is interworked with an application apparatus 200 (shown in FIG. 1) that processes image interpretation, lecture interpreting, etc., the context window size is changed from the start point of the video conference or lecture to the utterance Lt; RTI ID = 0.0 > sentences < / RTI >

도 3은 도 2에 도시한 음성 인식 오류 교정부의 기능 블록도이다.3 is a functional block diagram of the speech recognition error correction unit shown in FIG.

도 3을 참조하면, 상기 음성 인식 오류 교정부(114)는 오류 어휘 결정부(114-1), 교정 어휘 후보 생성부(114-3), 교정 어휘 결정부(114-5), 발화 어휘 DB(114-7), 상기 어휘 위치 정보 DB(114-9) 및 특정 도메인 기반의 언어 모델 DB(114-11)을 포함할 수 있다. 상기 DB들(114-7, 114-9 및 114-11)은 도 1의 저장소(160)에 저장될 수 있다. 3, the speech recognition error correction unit 114 includes an error word determination unit 114-1, a correction word candidate generation unit 114-3, a correction word determination unit 114-5, (114-7), the lexical location information DB 114-9, and a specific domain-based language model DB 114-11. The DBs 114-7, 114-9, and 114-11 may be stored in the storage 160 of FIG.

오류 어휘 Error vocabulary 결정부The decision unit (114-1)(114-1)

상기 DB들(114-7, 114-9 및 114-11)은 음성 인식 오류를 검출하기 위해 사용되는 DB일 수 있다. 추가로, 상기 음성 인식 오류를 검출하기 위해, 어휘 발음 사전(130)과 품사 엔그램(n-gram) 정보 DB(120)가 더 사용될 수 있으며, 상기 품사 엔그램 정보 DB(120)와 상기 어휘 발음 사전(130) 또한 도 1의 저장소(160)에 저장될 수 있다.The DBs 114-7, 114-9, and 114-11 may be DBs used to detect speech recognition errors. Further, in order to detect the speech recognition error, a vocabulary pronunciation dictionary 130 and an n-gram information DB 120 may be further used, and the parts-of-speech information DB 120 and the vocabulary- The pronunciation dictionary 130 may also be stored in the repository 160 of FIG.

상기 DB들(114-7, 114-9 및 114-11)은 상기 음성 인식부(112)에 의해 음성 인식된 문장이 입력될 때마다 실시간으로 업데이트될 수 있다. 이와는 다르게, 상기 품사 엔그램 정보 DB(120)와 상기 어휘 발음 사전(130)은 사전에 구축된 DB일 수 있다.The DBs 114 - 7, 114 - 9, and 114 - 11 may be updated in real time whenever a sentence recognized by the speech recognition unit 112 is input. Alternatively, the part-of-speech information DB 120 and the vocabulary pronunciation dictionary 130 may be DBs constructed in advance.

구체적으로, 상기 발화 어휘 DB(114-7)에는 상기 음성 인식된 문장에 포함된 어휘들이 실시간으로 저장될 수 있다.Concretely, vocabularies included in the speech-recognized sentence can be stored in the speech-based vocabulary DB 114-7 in real time.

상기 어휘 위치 정보 DB(114-9)에는 상기 음성 인식된 문장에 포함된 어휘들이 단어 공간(word space)에 사상된(또는 투사된) 위치 정보(또는 벡터값)가 실시간으로 저장될 수 있다. 다르게, 상기 어휘 위치 정보 DB(114-9)에는 상기 음성 인식된 문장에 포함된 어휘들 중에서 의미적으로 관계가 있는 어휘들로 클러스터링 된 어휘 클래스가 상기 단어 공간에 사상된(또는 투사된) 위치 정보(벡터값)가 실시간으로 저장될 수 있다. 상기 단어 공간은 워드 임베딩 기술(Word Embedding)에 따라 음성 인식된 문장 내의 각 어휘들이 사상될(또는 투사될) 벡터 공간으로 정의되며, 상기 워드 임베딩 기술은 신경망 언어모델로부터 도출된 기술로 유사한 단어들을 상기 벡터 공간상에 가깝게 배치하여 어휘 의미를 표현할 수 있는 기술이다.In the lexical position information DB 114-9, position information (or a vector value) in which vocabularies included in the speech-recognized sentence are mapped (or projected) in a word space can be stored in real time. Alternatively, the vocabulary location information DB 114-9 may include a lexical class clustered into vocabularies that are semantically related to the vocabularies included in the speech-recognized sentence, Information (vector value) can be stored in real time. The word space is defined as a vector space in which each vocabulary in the speech-recognized sentence is to be mapped (or projected) according to a word embedding technique. The word-embedding technique is a technique derived from a neural network language model, And is arranged close to the vector space to express the lexical meaning.

상기 특정 도메인 기반의 언어 모델 DB(114-11)에는 상기 음성 인식된 문장에 포함된 어휘들 중에서 특정 문맥 윈도우의 사이즈 내에서 특정 도메인과 관련된 어휘들에 대한 언어 모델링 결과가 실시간으로 저장될 수 있다. 상기 언어 모델링 결과는 상기 특정 도메인과 관련된 어휘들에 대한 n-gram 정보일 수 있다. 여기서, 상기 특정 도메인은 특정 주제와 관련된 강연(온라인 강연, 실시간 강연), 화상 회의 등일 수 있다.The language modeling results for the specific domain-related vocabularies within the size of the specific context window among the vocabularies included in the speech-recognized sentence can be stored in real time in the specific domain-based language model DB 114-11 . The language modeling result may be n-gram information about vocabularies related to the specific domain. Here, the specific domain may be a lecture (online lecture, real-time lecture) related to a specific topic, a video conference, and the like.

상기 DB들(114-7, 114-9 및 114-11)과는 다르게, 사전에 구축된 상기 품사 엔그램(n-gram) 정보 DB(120)에는 품사 엔그램(n-gram) 정보가 저장된 DB로서, 상기 품사 엔그램(n-gram) 정보는 과거의 n-1개의 단어로부터 다음에 나타날 단어의 확률을 정의하는 것으로서, n-gram은, 예를 들면, 바이그램(n=2)과 트라이그램(n=3)이 있을 수 있다.Unlike the DBs 114-7, 114-9, and 114-11, the part-of-speech information (n-gram) information DB 120 that is constructed in advance stores n-gram information Wherein the n-gram information defines a probability of a word to be displayed next from n-1 words in the past, and the n-gram is, for example, a combination of a bi-gram (n = 2) Grams (n = 3).

상기 DB들(114-7, 114-9 및 114-11)과는 다르게, 사전에 구축된 상기 어휘 발음 사전(130)에는 음성 인식된 문장 내의 각 어휘들에 대한 발음 기호 정보가 저장될 수 있다.Unlike the DBs 114-7, 114-9, and 114-11, pronunciation symbol information for each of the vocabularies in the speech-recognized sentence can be stored in the vocabulary pronunciation dictionary 130 previously constructed .

상기 오류 어휘 결정부(114-1)는 상기 DB들(114-7, 114-9 및 114-11), 상기 품사 엔그램(n-gram) 정보 DB(120) 및 상기 어휘 발음 사전(130)을 참조하여, 음성 인식된 문장 내의 각 어휘들 중에서 음성 인식 오류 가능성이 있는 오류 어휘를 결정할 수 있다. The error lexical decision unit 114-1 determines whether or not the lexical pronunciation dictionary 130 and the vocabulary pronunciation dictionary 130 correspond to the DBs 114-7, 114-9, and 114-11, the n-gram information DB 120, , It is possible to determine an error vocabulary having a possibility of speech recognition error among the vocabularies in the speech recognized sentence.

상기 음성 인식 오류 가능성이 있는 오류 어휘를 결정하기 위해, 상기 오류 어휘 결정부(114-1)는 각 어휘들에 대한 음성 인식 오류 가능성을 수치화한 음성 인식 오류치를 계산하고, 계산된 음성 인식 오류치와 임계치를 비교하여, 상기 음성 인식 오류치가 상기 임계치를 초과한 경우, 대상 어휘를 음성 인식 오류 가능성이 높은 오류 어휘로 결정한다. In order to determine an error vocabulary having a possibility of the speech recognition error, the error word determination unit 114-1 calculates a speech recognition error value obtained by quantifying the possibility of speech recognition error for each of the vocabularies, And when the speech recognition error value exceeds the threshold value, the target vocabulary is determined as an error vocabulary having a high possibility of speech recognition error.

상기 음성 인식 오류치(E)는 아래의 수식로 계산될 수 있다.The speech recognition error value E can be calculated by the following equation.

상기 변수 A는 상기 벡터 공간에서 현재 시점에서 음성 인식된 어휘(이하, '현재의 어휘'라 함)의 위치 정보(벡터값)와 이전 시점까지 클러스터링 된 어휘 클래스의 위치 정보(벡터값) 사이의 거리값(A)이고, 상기 변수 w1은 상기 A에 할당되는 가중치이다. 상기 오류 어휘 결정부(114-1)는 상기 어휘 위치 정보 DB(114-9)를 참조하여 상기 거리값(A)을 계산될 수 있다. 음성 인식이 적용되는 분야에서, 강연이나 화상 회의 등은 특정 도메인(특정 주제)을 가지고 있기 때문에 발화자에 의해서 발화된 어휘들은 서로 의미적으로 관계가 있으며 단어 공간 상에서 비교적 가까운 거리로 사상된다. 이러한 이유로 특정 시점에서 발화된 어휘와 '현재까지 발화된 어휘 클래스' 사이의 거리가 멀수록 오류 가능성이 크다고 할 수 있다.The variable A is used to calculate a difference between the position information (vector value) of the vocabulary recognized at the current point in time in the vector space (hereinafter referred to as a 'current vocabulary') and the position information (vector value) of the vocabulary class clustered up to the previous point Is a distance value (A), and the variable w1 is a weight assigned to the A. The error word determination unit 114-1 may calculate the distance value A by referring to the lexical positional information DB 114-9. In the field where speech recognition is applied, lectures and videoconferences have a specific domain (a specific topic), so vocabulary uttered by a speaker is semantically related to each other and is mapped to a relatively short distance in the word space. For this reason, the greater the distance between the vocabulary at a certain point and the vocabulary class that has been uttered until now, the greater the possibility of error.

상기 변수 B는 상기 현재의 어휘와 이전 시점에서 음성 인식된 어휘들(이하, 이전의 어휘들) 간의 발음 유사도이고, 상기 변수 w2는 상기 B에 할당되는 가중치이다. 상기 오류 어휘 결정부(114-1)는 상기 어휘 발음 사전(130)을 참조하여 상기 현재의 어휘의 발음 기호와 이전의 각 어휘들의 발음 기호들 간의 유사도를 확률적 방법으로 계산할 수 있다. 동일한 주제의 회의 또는 강연에서 발화자는 동일한 주제에서 자주 사용되는 어휘가 반복적으로 발화하는 경향이 높다. 그러므로 발화 어휘 DB(207)에 수집된 어휘들 중에서 발음이 유사한 어휘들이 존재한다면, 발음이 유사한 어휘들 중에서 적어도 하나의 어휘는 잘못 음성 인식된 오류 어휘일 가능성이 크다. 예를 들면, '학습'의 의미를 갖는'learning'이라는 어휘와 '달리기'의 의미를 갖는 'running'이라는 어휘는 서로 의미가 전혀 다르지만, 서로 유사한 발음을 갖는다. 만일, 이러한 어휘들이 발화 어휘 DB(207)에 존재한다면, 이들 중 어느 하나의 어휘를 포함하는 문장은 'learning'을 'running'으로 잘못 음성 인식했거나, 반대로, 'running'을 'learning'으로 잘못 음성 인식한 결과일 확률이 높다. 이러한 점에서, 현재의 어휘와 이전의 어휘들 간의 발음 유사도는 음성 인식 오류를 판단함에 있어 중요한 변수라 할 수 있다.The variable B is the pronunciation similarity between the current vocabulary and the vocabulary recognized at the previous time (hereinafter referred to as the previous vocabularies), and the variable w2 is a weight assigned to the B. The error word determination unit 114-1 may calculate the similarity between the phonetic symbols of the current vocabulary and the pronunciation symbols of the previous vocabularies by a stochastic method by referring to the vocabulary pronunciation dictionary 130. [ In conferences or lectures on the same subject, the speaker tends to repeatedly utter vocabulary that is often used on the same topic. Therefore, if there are vocabularies similar in pronunciation among the vocabulary collected in the vocabulary DB 207, at least one vocabulary among similar vocabulary words is likely to be an incorrect vocabulary recognized as a false vocabulary. For example, a vocabulary called 'learning' meaning 'learning' and a vocabulary 'running' meaning 'running' have similar pronunciations to each other although they have different meanings. If these vocabularies exist in the spoken-word DB 207, a sentence including any one of these vocabularies may mis-recognize the 'learning' as 'running', or conversely, the 'running' It is more likely to be the result of speech recognition. In this respect, the pronunciation similarity between the current vocabulary and the previous vocabulary is an important parameter in determining the speech recognition error.

상기 변수 C는 품사 엔그램(n-gram) 확률값이고, 상기 변수 w3은 상기 B에 할당되는 가중치이다. 상기 오류 어휘 결정부(114-1)는 품사 엔그램 DB(120)를 참조하여 어휘들에 대한 품사 엔그램(n-gram) 확률값을 계산할 수 있다. 오류 어휘를 구성하는 단어는 과거의 n-1개의 단어로부터 다음에 나타날 단어의 확률이 작을 것이다. 즉, 과거의 n-1개의 어휘로부터 다음에 나타날 오류 어휘의 빈도수는 비교적 낮은 빈도로 발견될 것이다. 이러한 점에서 품사 엔그램(n-gram) 확률값(C)도 음성 인식 오류를 검출하는 데 중요한 변수가 될 수 있다.The variable C is an n-gram probability value, and the variable w3 is a weight assigned to the B. The error word determination unit 114-1 may calculate an n-gram probability value for the vocabulary by referring to the speech phrase database 120. [ The words constituting the error vocabulary will have a smaller probability of the next word to appear from the past n-1 words. That is, the frequency of error vocabulary that will appear next from the past n-1 vocabularies will be found at a relatively low frequency. In this regard, the n-gram probability value (C) can also be an important parameter for detecting speech recognition errors.

이와 같이, 본 발명의 일 실시 예에서는 음성 인식 오류 가능성을 상기 벡터 공간에서 현재 시점에서 음성 인식된 어휘(이하, '현재의 어휘'라 함)의 위치 정보(벡터값)와 이전 시점까지 클러스터링 된 어휘 클래스의 위치 정보(벡터값) 사이의 거리값(A), 상기 현재의 어휘와 이전 시점에서 음성 인식된 어휘들(이하, 이전의 어휘들) 간의 발음 유사도(B) 및 품사 엔그램(n-gram) 확률값(C)로 이루어진 3가지 주요 요소를 기준으로 해당 어휘가 오류 어휘인지를 판단할 수 있다. 각 변수에 할당되는 가중치들(w1, w2 및 w3)은 휴리스틱(heuristics) 하게 결정될 수 있다.As described above, according to an embodiment of the present invention, the possibility of speech recognition error is detected by comparing the position information (vector value) of the vocabulary recognized as speech at the current point in the vector space (hereinafter referred to as the 'current vocabulary' (A) between the position information (vector value) of the vocabulary class, the pronunciation similarity degree (B) between the current vocabulary and the vocabulary recognized at the previous time (hereinafter referred to as the previous vocabularies) -gram) probability value (C) on the basis of the three main factors can be judged whether the word is an error vocabulary. The weights w1, w2 and w3 assigned to each variable can be determined heuristically.

한편, 본 실시 예에서는, 위의 3가지 주요 요소를 모두 고려하여 음성 인식된 어휘의 음성 인식 오류 가능성을 판단한 예를 기술하고 있지만, 3가지 주요 요소를 모두 고려하지 않고, 한가지 또는 두 가지의 주요 요소만을 고려하여 음성 인식 오류 가능성을 판단할 수도 있다.Meanwhile, in the present embodiment, an example of judging the possibility of speech recognition error of a vocabulary recognized by considering all three main factors is described, but without considering all three main factors, one or two major It is possible to determine the possibility of speech recognition error by considering only the element.

교정 어휘 후보 Candidate for correctional vocabulary 생성부Generating unit (114-3)(114-3)

상기 교정 어휘 후보 생성부(114-3)는 상기 오류 어휘 결정부(114-1)에서 음성 인식 오류가 있는 것으로 결정된 오류 어휘에 대해 어휘 발음 사전(130)을 참조하여, 교정 어휘 후보(또는 정답 어휘 후보)를 생성한다. 즉, 상기 교정 어휘 결정부(114-5)는 상기 어휘 발음 사전(130)에서 상기 오류 어휘의 발음과 유사한 발음을 갖는 어휘들을 검색하고, 검색된 어휘들을 교정 어휘 후보로 생성한다. The corrective vocabulary candidate generating unit 114-3 refers to the vocabulary pronunciation dictionary 130 about the error vocabulary determined to have a speech recognition error by the error vocabulary determination unit 114-1, Vocabulary candidate). That is, the calibration lexical decision unit 114-5 searches the vocabulary pronunciation dictionary 130 for vocabulary words having pronunciation similar to the pronunciation of the error vocabulary, and generates the retrieved vocabularies as the corrected vocabulary candidates.

교정 어휘 Corrective vocabulary 결정부The decision unit (114-5)(114-5)

상기 교정 어휘 결정부(114-5)는 일반 도메인 기반의 언어 모델(140)과 추가로 특정 도메인(특정 주제) 기반의 언어 모델(114-11)을 참조하여, 상기 교정 어휘 후보 생성부(114-3)에서 생성한 교정 어휘 후보 중에서 교정 어휘를 결정한다. 이와 같이, 일반 도메인 기반의 언어 모델(140)만을 이용함으로써, 특정 주제와 연관된 특정 문장이 음성 인식 오류를 포함하고 있는지 판단하기 어려운 종래의 문제점을 해결할 수 있다.The calibration lexical decision unit 114-5 refers to the general domain based language model 140 and furthermore a specific domain based specific language model 114-11 to generate the corrected lexical word candidate generator 114 3) to determine the corrective vocabulary from the corrective vocabulary candidates. Thus, by using only the general domain-based language model 140, it is possible to solve the conventional problem that it is difficult to determine whether a specific sentence associated with a specific topic includes a speech recognition error.

일반 도메인 기반의 언어 모델(140)은 잘 알려진 바와 같이, 대용량 코퍼스로부터 학습된 모델로서, 일반 주제와 관련된 어휘들을 서로에 대해 등급을 매기고 주어진 교정 어휘 후보들에 대해 가장 적합한 교정 어휘로 선택하는데 사용될 수 있는 확률 및/또는 다른 적절한 점수 데이터(예를 들어, 조건부 확률, 점수, 단어 계수, n-gram 모델 데이터, 빈도 데이터, 문맥 빈도 등)에 단어들을 연관시키도록 구성될 수 있다.As is well known, the generic domain-based language model 140 is a model that is learned from a large-capacity corpus and can be used to select vocabularies associated with a general topic as the most appropriate calibration vocabulary for a given corrective vocabulary candidate (E.g., conditional probability, score, word count, n-gram model data, frequency data, context frequency, etc.).

특정 도메인 기반의 언어 모델(114-11)은 상기 언어 모델 생성부(114-13)로부터 생성될 수 있으며, 화상 회의 및 강연에서 다루는 특정 주제와 관련된 단어 후보들을 서로에 대해 등급을 매기고 주어진 교정 어휘 후보에 내에서 가장 적합한 교정 어휘를 선택하는데 사용될 수 있는 확률 및/또는 다른 적절한 점수 데이터(예를 들어, 조건부 확률, 점수, 단어 계수, n-gram 모델 데이터, 빈도 데이터, 문맥 빈도 등)에 단어들을 연관시키도록 구성될 수 있다. A specific domain-based language model 114-11 may be generated from the language model generation unit 114-13 and may be used to rank word candidates associated with a particular topic handled in a video conference and lecture, (E.g., conditional probability, score, word count, n-gram model data, frequency data, context frequency, etc.) that can be used to select the most appropriate calibration vocabulary within the candidate Lt; / RTI >

상기 교정 어휘 결정부(114-5)는 아래의 수학식 2를 통해 교정 어휘 후보 중에서 교정 어휘를 결정하기 교정 어휘 결정 지수를 결정할 수 있다.The calibration lexical decision unit 114-5 may determine a calibration lexical decision index to determine a calibration vocabulary from the corrected vocabulary candidates through the following Equation (2).

여기서, 문맥빈도1은 상기 특정 도메인 기반의 언어 모델(114-11)을 참조하여, 상기 교정 어휘 후보에 포함된 각 교정 어휘에 대한 문맥빈도이고, 상기 문맥빈도2는 상기 일반 도메인 기반의 언어 모델(130)을 참조하여, 상기 교정 어휘 후보에 포함된 각 교정 어휘에 대한 문맥빈도이다. 상기 w4는 교정 어휘 결정 지수를 계산하기 위해, 부여되는 가중치로서, 화상 회의 또는 강연의 특성을 고려하여, 상기 특정 도메인 기반의 언어 모델(114-11) 기반의 문맥 빈도가 상기 일반 도메인 기반의 언어 모델(140) 기반의 문맥 빈도보다 높게 설정될 수 있다.Here, the context frequency 1 is a context frequency for each corrective vocabulary included in the corrective vocabulary candidate with reference to the specific domain-based language model 114-11, and the context frequency 2 is a language model based on the general domain- Is the context frequency for each corrective vocabulary included in the corrective vocabulary candidate, with reference to equation (130). W4 is a weight given to calculate a calibration lexical decision index, wherein the context frequency based on the specific domain-based language model (114-11) is calculated based on the characteristics of the videoconference or lecture, May be set higher than the context frequency based on the model (140).

위와 같은 수학식 2에 따라 교정 어휘 후보에서 가장 적합한 교정 어휘가 결정되면, 상기 오류 어휘가 상기 결정된 교정 어휘로 복구된 문자열이 도 1에 도시한 응용 장치(200, 도 1에 도시함)로 입력된다.When the most appropriate corrective vocabulary is determined according to the above-mentioned Equation 2, the error vocabulary is restored to the determined corrective vocabulary and input to the application apparatus 200 (shown in FIG. 1) do.

한편, 위의 수학식 2의 문맥 빈도는 교정 어휘 후보에 내에서 가장 적합한 교정 어휘를 선택하는데 사용될 수 있는 확률, 조건부 확률, 점수, 단어 계수, n-gram 모델 데이터, 빈도 데이터 등의 용어로 대체될 수 있다.On the other hand, the context frequency in Equation (2) is replaced with terms such as probability, conditional probability, score, word count, n-gram model data, and frequency data that can be used to select the most appropriate calibration vocabulary in the corrected vocabulary candidate .

도 4는 본 발명의 일 실시 예에 따른 음성 인식 오류 교정 방법을 나타내는 순서도로서, 도 1 내지 도 3에서 설명한 내용과 중복된 내용은 간략히 설명하거나 생략하기로 한다.FIG. 4 is a flowchart illustrating a method of correcting a speech recognition error according to an embodiment of the present invention, and a description overlapping with those described in FIGS. 1 to 3 will be briefly described or omitted.

도 4를 참조하면, 먼저, 단계 S410에서, 특정 주제와 관련된 화상 회의 및 강연에서 이루어지는 발화자의 음성에 대해 음성 인식을 수행하는 과정이 수행된다.Referring to FIG. 4, in step S410, a process of performing speech recognition on a speech of a speaking person performed in a video conference and a lecture related to a specific topic is performed.

이어, 단계 S420에서, 음성 인식을 수행한 결과에 따라 음성 인식된 어휘가 음성 인식 오류가 발생한 오류 어휘인 지를 판단하는 과정이 수행된다. 판단 과정은, 전술한 수학식 1에 따라 계산된 어류 어휘의 음성 인식 오류치(E)와 임계치를 비교한 결과를 토대로 판단할 수 있다. In step S420, a determination is made as to whether the vocabulary recognized according to the result of speech recognition is an error vocabulary in which a speech recognition error occurs. The determination process can be based on a result of comparing the threshold value with the speech recognition error value E of the fish vocabulary calculated according to Equation (1).

이어, 단계 S430에서, 음성 인식된 어휘가 오류 어휘로 확인되면, 단계 S440으로 진행하고, 음성 인식된 어휘가 오류 어휘가 아닌 것으로 확인되면, 단계 S410 이전으로 돌아가 S410 내지 S420를 다시 반복 수행한다.If it is determined in step S430 that the recognized vocabulary is an error vocabulary, the process proceeds to step S440. If it is determined that the recognized vocabulary is not an error vocabulary, the process returns to step S410 to repeat steps S410 to S420.

음성 인식된 어휘가 오류 어휘로 확인된 경우, 단계 S440에서, 상기 오류 어휘에 대해 어휘 발음 사전(130)을 참조하여, 교정 어휘 후보(또는 정답 어휘 후보)를 생성하는 과정이 수행된다.If the vocabulary recognized as a speech vocabulary is recognized as an error vocabulary, a process of generating a corrected vocabulary candidate (or correct answer vocabulary candidate) is performed by referring to the vocabulary pronunciation dictionary 130 for the error vocabulary in step S440.

이어, 단계 S450에서, 일반 도메인 기반의 언어 모델(140)과 추가로 특정 도메인(특정 주제) 기반의 언어 모델(114-11)을 참조하여, 상기 생성된 교정 어휘 후보 중에서 교정 어휘를 결정하는 과정이 수행됨으로써, 음성 인식 오류 교정 방법과 관련된 일련의 모든 과정이 종료된다. Next, in step S450, referring to the general domain-based language model 140 and the language model 114-11 based on a specific domain (specific subject), a process of determining a calibration vocabulary among the generated corrected vocabulary candidates Thereby completing a series of all processes related to the speech recognition error correction method.

이상에서 본 발명에 대하여 실시 예를 중심으로 설명하였으나 이는 단지 예시일 뿐 본 발명을 한정하는 것이 아니며, 본 발명이 속하는 분야의 통상의 지식을 가진 자라면 본 발명의 본질적인 특성을 벗어나지 않는 범위에서 이상에 예시되지 않은 여러 가지의 변형과 응용이 가능함을 알 수 있을 것이다. 예를 들어, 본 발명의 실시 예에 구체적으로 나타난 각 구성 요소는 변형하여 실시할 수 있는 것이다. 그리고 이러한 변형과 응용에 관계된 차이점들은 첨부된 청구 범위에서 규정하는 본 발명의 범위에 포함되는 것으로 해석되어야 할 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, It will be understood that various modifications and applications not illustrated in the drawings are possible. For example, each component specifically shown in the embodiments of the present invention can be modified and implemented. It is to be understood that all changes and modifications that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

Determining whether the current vocabulary is an error vocabulary in which a speech recognition error occurs using a distance value between a position of a current vocabulary mapped in a vector space and a position of a previous vocabulary according to a word embedding technique;
If the current vocabulary is determined to be an erroneous vocabulary, generating a corrected vocabulary candidate having a pronunciation similar to the determined pronunciation of the erroneous vocabulary by referring to the vocabulary pronunciation dictionary; And
Determining a corrected vocabulary in which the erroneous vocabulary is restored in the generated corrected vocabulary candidate by referring to a general domain-based language model and a specific domain-based language model
And a speech recognition error correction method.