KR100704508B1

KR100704508B1 - Language model adaptation apparatus for korean continuous speech recognition using n-gram network and method therefor

Info

Publication number: KR100704508B1
Application number: KR1020050037093A
Authority: KR
Inventors: 최준기; 이영직
Original assignee: 한국전자통신연구원
Priority date: 2004-12-14
Filing date: 2005-05-03
Publication date: 2007-04-09
Also published as: KR20060067096A

Abstract

본 발명은 연속음성인식 과정에서 현재 인식 과정 중에 있는 음성의 주제 정보를 추정하고 이 정보를 이용하여 대규모 코퍼스에서 얻어진 N-gram 간의 의미적, 구문적 거리를 표시한 N-gram 네트워크를 검색하여 언어모델 적응을 수행하도록 하는 N-gram 네트워크를 이용하는 한국어 연속음성인식을 위한 언어모델 적응장치 및 그 방법에 관한 것이다. 이와 같은 본 발명은 대규모 텍스트 코퍼스 DB에 저장된 대규모 텍스트 코퍼스에 존재하는 문서간에 유사도를 이용하여 N-gram 네트워크를 생성하는 네트워크 생성 모듈과, 상기 네트워크 생성 모듈에서 생성된 N-gram 네트워크를 저장하는 N-gram 네트워크 DB와, 음성 인식 중간 결과를 추출하여, 음향학적 안정 단어를 검출한 후, 현재 인식중인 음성의 영역 정보를 추출하고, 이 안정 단어가 포함한 N-gram을 추출하는 정보 추출 모듈과, 상기 정보 추출 모듈에서 추출된 N-gram들을 상기 N-gram 네트워크 DB에서 검색하는 검색 모듈과, 상기 검색 모듈에서 검색된 N-gram을 이용하여 실시간으로 언어 모델을 갱신하여 이 갱신된 언어 모델을 음성인식에 적용하는 실시간 언어 모델 병합 모듈로 구성된다. The present invention estimates the subject information of the speech currently being recognized in the continuous speech recognition process, and uses the information to search the N-gram network displaying the semantic and syntactic distances between the N-grams obtained from the large corpus. An apparatus and method for adapting a language model for Korean continuous speech recognition using an N-gram network for performing model adaptation. As described above, the present invention provides a network generation module for generating an N-gram network using similarity between documents existing in a large text corpus stored in a large text corpus DB, and an N for storing the N-gram network generated by the network generation module. an information extraction module for extracting a -gram network DB, an intermediate result of speech recognition, detecting an acoustic stable word, extracting region information of a currently recognized speech, and extracting an N-gram included in the stable word; A search module for searching the N-grams extracted by the information extraction module from the N-gram network DB, and the language model is updated in real time using the N-grams searched by the search module to recognize the updated language model. It consists of a real-time language model merging module that is applied to.

음성인식, 연속음성인식, 언어모델, 언어모델 적응, Speech Recognition, Continuous Speech Recognition, N-gram network, N-gram Speech Recognition, Continuous Speech Recognition, Language Model, Language Model Adaptation, Speech Recognition, Continuous Speech Recognition, N-gram network, N-gram

Description

LANGUAGE MODEL ADAPTATION APPARATUS FOR KOREAN CONTINUOUS SPEECH RECOGNITION USING N-GRAM NETWORK AND METHOD THEREFOR}

도 1은 본 발명의 일 실시예에 따른 N-gram 네트워크를 이용하는 한국어 연속음성인식의 언어모델 적응장치의 구성을 나타낸 기능 블럭도, 1 is a functional block diagram showing the configuration of a language model adaptation apparatus of Korean continuous speech recognition using an N-gram network according to an embodiment of the present invention;

도 2는 도 1에서의 네트워크 생성모듈의 상세 기능 블럭도, FIG. 2 is a detailed functional block diagram of the network generation module of FIG. 1; FIG.

도 3은 음성인식의 중간 단계에서 음향학적으로 안정된 단어를 기준으로 N-gram network에서 검색된 N-gram을 확장하는 방법을 설명하기 위한 도면, 3 is a view for explaining a method of expanding an N-gram retrieved from an N-gram network based on an acoustically stable word in an intermediate stage of speech recognition;

도 4는 N-gram network를 구성하는 방법을 설명하기 위한 도면, 4 is a view for explaining a method of configuring an N-gram network,

도 5는 본 발명의 일 실시예에 따른 N-gram 네트워크를 이용하는 한국어 연속음성인식의 언어모델 적응방법을 나타낸 동작 플로우챠트, 5 is an operation flowchart illustrating a method of adapting a language model of Korean continuous speech recognition using an N-gram network according to an embodiment of the present invention;

도 6은 도 5에서의 네트워크 생성 과정의 상세 동작 플로우챠트이다. FIG. 6 is a detailed operation flowchart of the network generation process of FIG. 5.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

100 : 대규모 텍스트 코퍼스 DB 200 : 기본 언어모델 생성부 100: large text corpus DB 200: basic language model generation unit

300 : 기본 언어모델 DB 400 : 네트워크 생성모듈 300: basic language model DB 400: network generation module

500 : N-gram 네트워크 DB 600 : 음향모델 DB 500: N-gram network DB 600: Acoustic model DB

700 : 연속 음성 인식부 800 : 정보 추출모듈 700: continuous speech recognition unit 800: information extraction module

900 : 검색모듈 1000 : 실시간 언어모델 병합모듈 900: search module 1000: real-time language model merge module

본 발명은 N-gram 네트워크를 이용하는 한국어 연속음성인식을 위한 언어모델 적응장치 및 그 방법에 관한 것으로, 연속음성인식 과정에서 현재 인식 과정 중에 있는 음성의 주제 정보를 추정하고 이 정보를 이용하여 대규모 코퍼스에서 얻어진 N-gram 간의 의미적, 구문적 거리를 표시한 N-gram 네트워크를 검색하여 언어모델 적응을 수행하도록 하는 N-gram 네트워크를 이용하는 한국어 연속음성인식을 위한 언어모델 적응장치 및 그 방법에 관한 것이다.The present invention relates to an apparatus and method for adapting a language model for Korean continuous speech recognition using N-gram network, comprising: estimating the subject information of a speech currently being recognized during continuous speech recognition and using this information Apparatus and method for language model adaptation for Korean continuous speech recognition using N-gram network to search for N-gram network showing semantic and syntactic distance between N-grams will be.

일반적으로, 연속음성인식에서는 언어모델이 매우 중요한 역할을 하며, 음향모델의 오류를 수정하고, 정확한 인식 결과를 내는데 사용되고 있다. 그러나, 현재 가장 많이 사용되고 있으며 음성인식 산업의 표준처럼 사용되고 있는 N-gram의 경우, 학습 데이터의 영역에 따라 매우 다른 결과가 나올 수 있으며, 인식하고자 하는 음성의 영역과 언어모델 학습 데이터의 영역이 일치할 때 가장 나은 성능을 기대할 수 있다. 따라서, 방송 뉴스 인식과 같이 여러 영역을 다루는 음성인식에서는 각 주제마다 언어모델의 영역을 적응하는 언어모델 적응 방법을 통해서 음성인식의 정확도를 향상시킬 수 있다. In general, language models play a very important role in continuous speech recognition, and are used to correct errors in acoustic models and to produce accurate recognition results. However, in the case of N-gram, which is the most widely used and used as the standard of speech recognition industry, very different results can be obtained according to the area of learning data, and the area of speech to be recognized and the area of language model training data are identical. You can expect the best performance. Therefore, in speech recognition covering multiple areas such as broadcasting news recognition, the accuracy of speech recognition can be improved by using a language model adaptation method that adapts the language model area to each subject.

종래의 언어모델 적응 기술은 음성인식의 중간 결과로부터 음성의 영역 정보를 추출하고, 이를 바탕으로 대용량의 학습 데이터에서 유사한 텍스트 문장을 검색하여 해당 영역만으로 구성된 적응 데이터를 구성한다. 그리고 이 적응 데이터를 이용하여 적응 언어모델을 생성한다. 이때, 적은 양의 텍스트만을 사용하여 구성된 언어모델의 약점을 보완하기 위하여 기존의 대용량 코퍼스로부터 생성된 언어모델과 병합하여 최종적인 적응 언어모델을 생성하는 방법이 널리 사용되었다. The conventional language model adaptation technology extracts speech region information from intermediate results of speech recognition, and based on this, searches for similar text sentences in a large amount of learning data and constructs adaptive data composed of only corresponding regions. The adaptive language model is generated using the adaptive data. At this time, in order to make up for the weakness of the language model constructed using only a small amount of text, a method of generating a final adaptive language model by merging with a language model generated from an existing large corpus was widely used.

그러나, 위와 같은 종래의 기술의 경우 소규모의 적응 데이터를 이용하여 적응언어모델을 구하기 때문에 주제 정보가 포함된 문장의 일부분의 언어모델은 개선 될 수 있으나 적응 데이터의 주제 정보와 관련이 없는 부분은 왜곡된 결과를 가져올 수 있기 때문에 기존의 대용량 코퍼스에서 구한 안정적인 언어모델 점수가 불한해 지는 경우가 생기며, 따라서 성능 향상이 이루어지는 문장도 있으나 성능 향상이 이루어지지 않는 경우의 문장도 상당 부분 존재한다. 따라서 전체적으로 언어모델 적응을 통해서 얻을 수 있는 성능 향상이 그다지 크지 않는 문제점이 있었다. However, in the prior art as described above, since the adaptive language model is obtained using a small amount of adaptive data, the language model of a part of the sentence including the subject information may be improved, but the portions not related to the subject information of the adaptive data are distorted. As a result, the stable language model score obtained in the existing large corpus becomes indefinite. Therefore, there are some sentences where performance is improved but some sentences when performance is not improved. Therefore, there is a problem that the overall performance gain obtained through language model adaptation is not so great.

따라서, 본 발명은 상기와 같은 종래의 문제점을 해결하기 위하여 이루어 진 것으로서, 본 발명의 목적은 적응 텍스트 코퍼스를 N-gram network의 형태로 구현하여 하나의 N-gram이 검색되면 한 문장에서 발생할 수 있는 다른 N-gram을 순차적으로 호출하여 사용할 수 있도록 함으로써, 인식 성능의 저하 없이 모든 경우에 대해서 음성인식 성능을 향상 시킬 수 있는 N-gram 네트워크를 이용하는 한국어 연속 음성인식의 언어모델 적응장치 및 그 방법을 제공한 데 있다. Accordingly, the present invention has been made to solve the above-mentioned conventional problems, and an object of the present invention is to implement an adaptive text corpus in the form of an N-gram network, which can occur in one sentence when one N-gram is searched. Apparatus and method for adapting a language model of Korean continuous speech recognition using N-gram network which can improve speech recognition performance in all cases without degrading recognition performance by allowing other N-grams to be called sequentially. To provide.

상기와 같은 목적을 달성하기 위한 본 발명의 N-gram 네트워크를 이용하는 한국어 연속음성인식의 언어모델 적응장치는, N-gram 네트워크과 기본 언어모델을 생성하기 위한 대규모 텍스트 코퍼스를 저장하는 제1 저장 수단과, 상기 제1 저장 수단에 저장된 대규모 텍스트 코퍼스에 존재하는 문서간에 유사도를 이용하여 N-gram 네트워크를 생성하는 네트워크 생성 수단과, 상기 네트워크 생성 수단에서 생성된 N-gram 네트워크를 저장하는 제2 저장 수단과, 음성 인식 중간 결과를 추출하여, 음향학적 안정 단어를 검출한 후, 현재 인식중인 음성의 영역 정보를 추출하고, 이 안정 단어가 포함한 N-gram을 추출하는 정보 추출 수단과, 상기 정보 추출 수단에서 추출된 N-gram들을 상기 제2 저장 수단에서 검색하는 검색 수단과, 상기 검색 수단에서 검색된 N-gram을 이용하여 실시간으로 언어 모델을 갱신하여 이 갱신된 언어 모델을 음성인식에 적용하는 실시간 언어 모델 병합 수단을 포함하여 구성되는 것을 특징으로 한다. The apparatus for adapting a language model of Korean continuous speech recognition using the N-gram network of the present invention for achieving the above object comprises: first storage means for storing a large text corpus for generating an N-gram network and a basic language model; Network generation means for generating an N-gram network using similarity between documents existing in the large-scale text corpus stored in the first storage means, and second storage means for storing the N-gram network generated by the network generation means. And information extraction means for extracting a speech recognition intermediate result, detecting an acoustic stable word, extracting region information of a speech currently being recognized, and extracting an N-gram included in the stable word, and the information extracting means. Search means for searching the N-grams extracted from the second storage means, and using the N-grams searched by the search means. And real-time language model merging means for updating the language model in real time and applying the updated language model to speech recognition.

또한, 상기와 같은 목적을 달성하기 위한 본 발명의 N-gram 네트워크를 이용하는 한국어 연속음성인식의 언어모델 적응방법은, 대규모 텍스트 코퍼스에서 기본 언어모델을 추출하는 제1 단계와, 상기 구축된 대규모 텍스트 코퍼스 DB에 저장된 대규모 텍스트 코퍼스에 존재하는 문서간에 유사도를 이용하여 N-gram 네트워크를 생성하는 제2 단계와, 상기 제2 단계에서 생성된 N-gram 네트워크를 저장하는 N-gram 네트워크 DB를 구축하는 제3 단계와, 음성 인식 중간 결과를 추출하여, 음향학적 안정 단어를 검출한 후, 현재 인식중인 음성의 영역 정보를 추출하고, 이 안정 단어가 포함한 N-gram을 추출하는 제4 단계와, 상기 제4 단계에서 추출된 N-gram들을 상기 구축된 N-gram 네트워크 DB에서 검색하는 제5 단계와, 상기 제5 단계에서 검색된 N-gram을 이용하여 실시간으로 언어 모델을 갱신하여 이 갱신된 언어 모델을 음성인식에 적용하는 제6 단계를 포함하여 이루어지는 것을 특징으로 한다. In addition, the language model adaptation method of Korean continuous speech recognition using the N-gram network of the present invention for achieving the above object, the first step of extracting a basic language model from a large-scale text corpus, and the constructed large-scale text A second step of generating an N-gram network using similarities between documents existing in a large text corpus stored in a corpus DB, and an N-gram network DB storing an N-gram network generated in the second step; A third step, extracting a speech recognition intermediate result, detecting an acoustic stable word, extracting region information of a speech currently being recognized, and extracting an N-gram included in the stable word, and A fifth step of searching for the N-grams extracted in the fourth step in the constructed N-gram network DB, and in real time using the N-grams retrieved in the fifth step And updating the language model to apply the updated language model to speech recognition.

이하, 본 발명의 일 실시예에 의한 N-gram 네트워크를 이용하는 한국어 연속음성인식의 언어모델 적응장치 및 그 방법에 대하여 첨부된 도면을 참조하여 상세히 설명하기로 한다. Hereinafter, an apparatus and method for adapting a language model of Korean continuous speech recognition using an N-gram network according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 N-gram 네트워크를 이용하는 한국어 연속음성인식의 언어모델 적응장치의 구성을 나타낸 기능 블럭도를 도시한 것이고, 도 2는 도 1에서의 네트워크 생성모듈의 상세 기능 블럭도를 도시한 것이다. 1 is a functional block diagram showing the configuration of a language model adaptation apparatus of Korean continuous speech recognition using N-gram network according to an embodiment of the present invention, Figure 2 is a detailed view of the network generation module in FIG. A functional block diagram is shown.

도 1에 도시된 바와 같이, 본 발명의 일 실시예에 따른 N-gram 네트워크를 이용하는 한국어 연속음성인식의 언어모델 적응장치는, 기본 언어모델과 N-gram 네트워크를 만들기 위한 대규모 텍스트 코퍼스 DB(100)와, 상기 대규모 텍스트 코퍼스 DB(100)에 저장된 텍스트 코퍼스를 이용하여 기본 언어모델을 생성하는 기본 언어모델 생성부(200)와, 상기 기본 언어모델 생성부(200)에서 생성된 기본 언어모델 을 저장하는 기본 언어모델 DB(300)와, 상기 대규모 텍스트 코퍼스 DB(100)에 저장된 대규모 텍스트 코퍼스에 존재하는 문서간에 유사도를 이용하여 N-gram 네트워크를 생성하는 네트워크 생성 모듈(400)과, 상기 네트워크 생성 모듈(400)에서 생성된 N-gram 네트워크를 저장하는 N-gram 네트워크 DB(500)와, 상기 기본 언어모델 DB(300)에 저장된 기본 언어모델과 음향모델 DB(600)에 저장된 음향모델을 이용하여 연속 음성 인식을 수행하는 연속 음성 인식부(700)와, 상기 연속 음성 인식부(700)에서의 음성 인식 중간 결과를 추출하여, 음향학적 안정 단어를 검출한 후, 현재 인식중인 음성의 영역 정보를 추출하고, 이 안정 단어가 포함한 N-gram을 추출하는 정보 추출 모듈(800)과, 상기 정보 추출부(800)에서 추출된 N-gram들을 상기 N-gram 네트워크 DB(500)에서 검색하는 검색 모듈(900)과, 상기 검색 모듈(900)에서 검색된 N-gram을 이용하여 실시간으로 언어 모델을 갱신하여 이 갱신된 언어 모델을 음성인식에 적용하는 실시간 언어 모델 병합 모듈(1000)로 구성된다. As shown in FIG. 1, a language model adaptation apparatus for Korean continuous speech recognition using an N-gram network according to an embodiment of the present invention includes a large-scale text corpus DB 100 for making a basic language model and an N-gram network. ), A basic language model generator 200 for generating a basic language model using the text corpus stored in the large-scale text corpus DB 100, and a basic language model generated by the basic language model generator 200. A network generation module 400 for generating an N-gram network using similarity between the basic language model DB 300 for storing and the documents existing in the large text corpus stored in the large text corpus DB 100, and the network; N-gram network DB (500) for storing the N-gram network generated by the generation module 400, and the basic language model and acoustic model DB (600) stored in the basic language model DB (300) After the continuous speech recognition unit 700 performs continuous speech recognition using the acoustic model stored in the apparatus, the intermediate speech recognition result of the continuous speech recognition unit 700 is extracted, and the acoustic stable word is detected. The information extraction module 800 extracts the area information of the speech being recognized, and extracts the N-grams included in the stable word, and the N-grams extracted from the information extracting unit 800 to the N-gram network DB ( A real time language model merging module for updating the language model in real time using the search module 900 searched at 500 and the N-gram searched at the search module 900 and applying the updated language model to speech recognition ( 1000).

도 2에 도시된 바와 같이, 상기 네트워크 생성모듈(400)은 상기 대규모 텍스트 코퍼스 DB(100)에 저장된 대규모 텍스트 코퍼스에 존재하는 문서 간에 유사도를 이용하여 문서 간 클러스터링을 수행하는 문서 클러스터링부(401)와, 상기 문서 클러스터링부(401)에서 클러스터링된 문서 클러스터 별로 N-gram을 추출하는 문서 클러스터별 N-gram 추출부(402)와, 상기 문서 클러스터 별 N-gram 추출부(403)에서 추출된 N-gram을 결합하여 N-gram 네트워크를 생성하여 상기 N-gram 네트워크 DB(500)에 저장시키는 N-gram 결합부(403)로 구성된다. As shown in FIG. 2, the network generation module 400 performs a document clustering unit 401 to perform document clustering using similarities between documents existing in a large text corpus stored in the large text corpus DB 100. And an N-gram extractor 402 for each document cluster for extracting N-grams for each document cluster clustered by the document clustering unit 401, and an N-gram extractor 403 for each document cluster. The N-gram combiner 403 combines -grams to generate an N-gram network and stores the N-gram network in the N-gram network DB 500.

그러면, 상기와 같은 구성을 가지는 본 발명의 일 실시예에 따른 N-gram 네트워크를 이용하는 한국어 연속음성인식의 언어모델 적응장치 및 그 방법에 대해 도 3 내지 도 6을 참조하여 설명하기로 한다. Next, an apparatus and method for adapting a language model of Korean continuous speech recognition using an N-gram network according to an embodiment of the present invention having the above configuration will be described with reference to FIGS. 3 to 6.

먼저 기본 언어모델을 생성하기 위한 대규모 텍스트 코퍼스 DB(100)를 구축한다. 이 대규모 코퍼스는 여러 주제와 영역을 커버할 다양한 문서를 포함하고 있어야 하며 다양한 문형을 가지고 있는 풍부한 문장들의 집합이다. First, a large text corpus DB 100 for generating a basic language model is constructed. This large corpus should contain a variety of documents that cover a variety of subjects and areas, and is a rich set of sentences with different sentence patterns.

기본 언어모델 생성부(200)는 상기 대규모 텍스트 코퍼스 DB(100)에 저장된 적응 텍스트 코퍼스를 이용하여 기본 언어모델을 생성한 후, 그 생성된 기본 언어모델을 기본 언어모델 DB(300)에 저장시켜, 기본 언어모델 DB(300)을 구축하게 된다(S102, S103). The basic language model generation unit 200 generates a basic language model using the adaptive text corpus stored in the large-scale text corpus DB 100, and then stores the generated basic language model in the basic language model DB 300. The basic language model DB 300 is constructed (S102, S103).

한편, 네트워크 생성 모듈(400)은 상기 대규모 텍스트 코퍼스 DB(100)에 저장된 대규모 텍스트 코퍼스에 존재하는 문서간에 유사도를 이용하여 N-gram 네트워크를 생성한 후, 그 생성된 N-gram 네트워크를 N-gram 네트워크 DB(500)에 저장시켜, N-gram 네트워크 DB(500)을 구축하게 된다(S100, S101). Meanwhile, the network generation module 400 generates an N-gram network using similarities between documents existing in the large-scale text corpus stored in the large-text corpus DB 100, and then generates the N-gram network N-gram network. The gram network DB 500 is stored to construct the N-gram network DB 500 (S100 and S101).

상기 N-gram 네트워크에는 2가지의 연결(link)가 있는데, 첫 번째 연결은 위치 연결 정보이다. 즉, 인접해 있는 단어들을 연결하는 방식으로 단어 연결 사이에 N-gram 확률값이 부여된다. 그리고 두 번째 연결은 영역 정보 연결로서 서로 영역이 비슷한 문서에서 발생하는 N-gram끼리 하나의 군을 이루는 연결이다. 이러한 영역 정보 연결을 생성하기 위해서 도 4에서 도시한 것처럼 대규모의 텍스트 코퍼스에 존재하는 문서 간에 유사도를 이용하여 문서 간 군집화(clustering)을 수행하고 군집화된 문서들을 이용하여 영역 연결을 구성한다. There are two links in the N-gram network. The first link is location link information. In other words, N-gram probability values are given between word concatenations by concatenating adjacent words. The second connection is the area information connection, which is a group of N-grams generated from documents with similar areas. In order to generate such a region information link, as shown in FIG. 4, clustering between documents using similarity between documents existing in a large text corpus is performed, and region linking is configured using clustered documents.

이를 상세히 설명하면, 상기 네트워크 생성모듈(400)의 문서 클러스터링부(401)는 상기 대규모 텍스트 코퍼스 DB(100)에 저장된 대규모 텍스트 코퍼스에 존재하는 문서 간에 유사도를 이용하여 문서 간 클러스터링을 수행한다(S200). 문서 클러스터별 N-gram 추출부(402)는 상기 문서 클러스터링부(401)에서 클러스터링된 문서 클러스터 별로 N-gram을 추출한다(S201). N-gram 결합부(403)는 상기 문서 클러스터 별 N-gram 추출부(403)에서 추출된 N-gram을 결합하여 N-gram 네트워크를 생성하여 상기 N-gram 네트워크 DB(500)에 저장시킨다(S202). In detail, the document clustering unit 401 of the network generation module 400 performs clustering between documents using similarities between documents existing in the large text corpus stored in the large text corpus DB 100 (S200). ). The N-gram extractor 402 for each document cluster extracts an N-gram for each document cluster clustered by the document clustering unit 401 (S201). The N-gram combiner 403 combines the N-grams extracted from the N-gram extractor 403 for each document cluster to generate an N-gram network and stores the N-gram network in the N-gram network DB 500 ( S202).

연속 음성 인식부(700)는 상기 기본 언어모델 DB(300)에 저장된 기본 언어모델과 음향모델 DB(600)에 저장된 음향모델을 이용하여 연속 음성 인식을 수행하게 된다(S104). The continuous speech recognition unit 700 performs continuous speech recognition using the basic language model stored in the basic language model DB 300 and the acoustic model stored in the acoustic model DB 600 (S104).

정보 추출 모듈(800)은 상기 연속 음성 인식부(700)에서의 음성 인식 중간 결과를 추출하여(S105), 음향학적 안정 단어를 검출한 후(S106), 현재 인식중인 음성의 영역 정보를 추출하고, 이 안정 단어가 포함한 N-gram을 추출하게 된다(S107). The information extraction module 800 extracts an intermediate result of speech recognition in the continuous speech recognizer 700 (S105), detects an acoustic stable word (S106), and extracts region information of the currently recognized speech. In step S107, the N-gram included in the stable word is extracted.

이와 같이, 상기 정보 추출 모듈(800)은 음성인식의 중간 결과로부터 현재 인식 중인 음성의 영역 정보를 추출하는데, 일반적인 N-gram은 영역정보를 표현할 수 있는 방법이 없기 때문에 현재 인식 중인 음성의 정확한 영역을 아는 것이 매우 중요하다. 영역 정보는 한 문장에서 얻어지지 않고 동일한 주제를 다루고 있는 여러 문장에서 얻어지는 것이 바람직하다. 따라서, 방송뉴스의 경우 하나의 주제를 다루는 꼭지 기사 단위로 음성인식의 중간 결과를 검색한다. 음성인식의 중간 결과는 일반적으로 N-best list나 lattice 형태로 주어지는데, 이 중간 결과에서 음향학적으로 매우 안정된 단어를 검색할 수 있다. 이 단어들은 인간의 음성 인식 과정으로 말하자면 매우 잘 들리는 단어에 해당한다. 이러한 음향학적으로 안정된 단어는 N-best list에서 유사한 구간에서 동일하게 많이 출현하거나 lattice 구조에서 branch factor가 작아지는 부분에 해당한다. 이러한 음향학적으로 안정된 단어를 제외한 부분은 N-best list에서 해당 구간에 대해 후보 단어 열이 다양하게 생성된다. 이 때 음향학적으로 안정된 단어를 정확히 찾기 위해서 음향모델을 강건하게 작성해야 하며 상위 언어모델 보다 간단하고 단순한 하위 언어모델을 사용하여 음성인식의 첫 단계를 수행해야 한다. In this way, the information extraction module 800 extracts the region information of the speech currently being recognized from the intermediate result of the speech recognition, and since the general N-gram has no way to express the region information, the accurate region of the speech currently being recognized It is very important to know. It is preferable that the area information is not obtained from one sentence but obtained from several sentences covering the same subject. Therefore, in the case of broadcasting news, the intermediary result of speech recognition is searched by the unit of the article addressing one subject. Intermediate results of speech recognition are usually given in the form of N-best lists or lattice, which can search for very acoustically stable words. These words correspond to words that sound very good when it comes to human speech recognition. These acoustically stable words appear in the N-best list in equally similar segments or in the branching structure of the lattice structure. Except for the acoustically stable words, candidate word strings are variously generated for the corresponding section in the N-best list. At this time, in order to accurately find the acoustically stable words, the acoustic model must be robustly constructed and the first step of speech recognition should be performed using a simpler and simpler lower language model than the upper language model.

음성인식의 중간 결과에서 음향학적으로 안정된 단어가 결정되고 나면, 상기 정보 추출 모듈(800)은 이 단어가 포함된 N-gram을 추출한다. 이때, 전체 언어모델을 trigram을 사용한다면 해당하는 단어가 포함된 unigram, bigram, trigram을 N-gram network 검색의 질의(query)로 사용하도록 한다. After an acoustically stable word is determined from the intermediate result of speech recognition, the information extraction module 800 extracts an N-gram including the word. In this case, if trigram is used for the entire language model, unigram, bigram, and trigram containing the corresponding word should be used as a query for N-gram network search.

검색 모듈(900)은 상기 정보 추출부(800)에서 추출된 N-gram들을 상기 N-gram 네트워크 DB(500)에서 검색하게 된다(S108, S109). 상기 검색 모듈(900)에서는 추출된 다량의 N-gram을 N-gram network를 통하여 검색하는 부분이다. 이 검색 모듈(900)에서는 일단 unigram이나 bigram을 이용하여 검색된 N-gram 들이 현재 인식 중인 음성의 영역과 얼마나 유사한지, 그리고 현재 문장 들이 얼마나 공통적인 영역정보를 가지고 있는지, 그리고 현재 인식 중인 문장이 얼마나 균일한 영역 정 보를 가지고 있는 지를 판단한다(S108). 이 판단 작업을 통해서 현재 인식 중인 문장에 영역정보를 포함한 언어모델로 갱신하였을 때 성능이 저하되는 경우를 막을 수 있다. 즉, 영역정보가 명확하지 않은 문장은 언어모델을 갱신하지 않거나 갱신의 가중치를 줄여서 인식 성능이 저하되는 상황을 막는다. The search module 900 searches for the N-grams extracted by the information extractor 800 in the N-gram network DB 500 (S108 and S109). The search module 900 is a part for searching the extracted large amount of N-grams through the N-gram network. In this search module 900, once the N-grams searched using unigram or bigram are similar to the area of the voice currently recognized, how common areas the current sentences have, and how many sentences are currently recognized It is determined whether the uniform area information (S108). Through this determination, performance can be prevented from being deteriorated when a language model including area information is updated in a currently recognized sentence. That is, sentences with unclear region information prevent the situation in which the recognition performance is deteriorated by not updating the language model or reducing the weight of the update.

N-gram network를 검색할 때에는, 상기 검색 모듈(900)은 일단 음향학적으로 안정된 단어가 포함되어 있는 N-gram부터 검색하고 이 N-gram 간의 영역간 유사도를 이용하여 영역이 유사한 N-gram와 연결정보 보다 유사한 N-gram을 검색해서 적응 N-gram set를 작성한다. 도 4에서 도시된 바와 같이 N-gram network는 단어와 단어가 연결되는 형태로 구성되기 때문에 N-gram이 연결되어 나타날 수 있으며, 이 N-gram들은 큰 하나의 덩어리를 이루고 있다. 그리고 영역이 유사한 N-gram 간의 상관관계 연결이 있어서 서로 영역이 유사한 N-gram의 검색이 가능하다. 검색 시에는 위치 정보 유사도와 영역 정보 유사도를 같이 사용한다. When searching for an N-gram network, the search module 900 first searches for an N-gram containing an acoustically stable word and associates it with an N-gram having similar areas using the similarity between the areas between the N-grams. Create an adaptive N-gram set by searching for N-grams that are more similar than information. As shown in FIG. 4, since the N-gram network is composed of words and words connected to each other, N-grams may be connected to each other, and these N-grams form one large chunk. In addition, since there is a correlation connection between similar N-grams, it is possible to search for similar N-grams. When searching, location similarity and area information similarity are used together.

실시간 언어 모델 병합 모듈(1000)은 상기 검색 모듈(900)에서 검색된 N-gram을 이용하여 실시간으로 언어 모델을 갱신하여 이 갱신된 언어 모델을 음성인식에 적용하게 된다(S110). 상기 실시간 언어모델 병합 모듈(1000)은 그 검색된 N-gram 들로 훈련 코퍼스를 작성할 수 있으며 영역에 대한 가중치를 구하여 새로운 언어모델 값을 부여할 수 있다. 따라서, 음향모델의 신뢰도가 낮은 구간에는 가급적 긴 언어모델과 강한 영역 가중치를 부여하며 음향모델의 신뢰도가 높은 구간이나 단어는 짧은 언어모델을 적용하여 언어모델을 재 계산하는 방법을 사용한다. 이 방법을 보다 강건하게 하기 위해서 발화검증(utterance verification)과 같은 방법 을 사용할 수 있다. The real-time language model merging module 1000 updates the language model in real time using the N-gram searched by the search module 900 to apply the updated language model to speech recognition (S110). The real-time language model merging module 1000 may create a training corpus with the retrieved N-grams and obtain a new language model value by obtaining a weight for the region. Therefore, a long language model and a strong region weight are given to a section with low reliability of the acoustic model, and a method for recalculating the language model is applied to a section or word with a high reliability of the acoustic model. To make this method more robust, methods such as utterance verification can be used.

이와 같이, 도 3은 음성인식 중간 결과를 이용하여 음향학적으로 안정된 단어를 고르고, 그 단어에 대한 영역 테스트를 수행하고, 영역 정보를 사용할 수 있다는 확신이 들면 언어모델을 확장하여 새롭게 음성인식 결과를 수정하는 과정을 도시한 것이다. As shown in FIG. 3, when a sound acoustically stable word is selected using an intermediate result of speech recognition, an area test is performed on the word, and when the domain information is used, the language model is expanded to newly obtain a speech recognition result. It shows the process of modification.

이상에서 몇 가지 실시예를 들어 본 발명을 더욱 상세하게 설명하였으나, 본 발명은 반드시 이러한 실시예로 국한되는 것이 아니고 본 발명의 기술사상을 벗어나지 않는 범위 내에서 다양하게 변형실시될 수 있다. Although the present invention has been described in more detail with reference to some embodiments, the present invention is not necessarily limited to these embodiments, and various modifications can be made without departing from the spirit of the present invention.

상술한 바와 같이 본 발명에 의한 N-gram 네트워크를 이용하는 한국어 연속음성인식의 언어모델 적응장치 및 그 방법에 의하면, 본 발명은 연속 음성 인식에서 음성인식의 중간 결과를 이용하여 언어모델을 갱신하는 언어모델 적응 방법에 대해 N-gram network와 음향학적 안정 단어를 사용하였다. 위 방법을 사용하여 음성의 영역의 정확한 추정이 가능하며 영역 정보를 효율적으로 사용하여 언어모델을 효과적으로 적응할 수 있는 효과가 있다. 또한, 본 발명은 음성인식 성능의 일관된 향상을 가져올 수 있는 효과가 있다. According to the apparatus and method for adapting a language model of Korean continuous speech recognition using an N-gram network according to the present invention as described above, the present invention provides a language for updating a language model using intermediate results of speech recognition in continuous speech recognition. N-gram network and acoustic stable words are used for the model adaptation method. Using the above method, it is possible to accurately estimate the area of speech and to effectively adapt the language model by using the area information efficiently. In addition, the present invention has the effect that can bring a consistent improvement in speech recognition performance.

Claims

First storage means for storing a large text corpus for generating a basic language model and an N-gram network;

A document clustering unit for performing inter-document clustering using similarities between documents existing in the large-scale text corpus, an N-gram extracting unit for each document cluster extracting N-grams for each clustered document cluster in the document clustering unit; The N-gram combiner combines the N-grams extracted by the N-gram extractor to generate an N-gram network, and uses the similarity between documents existing in the large-scale text corpus stored in the first storage means. network generating means for generating a -gram network;

Second storage means for storing an N-gram network generated by the network generating means;

Information extraction means for extracting a speech recognition intermediate result, detecting an acoustic stable word, extracting region information of a speech currently being recognized, and extracting an N-gram included in the stable word;

When the N-grams extracted by the information extraction means are searched for the N-gram network in the second storage means, the N-grams containing the acoustically stable words are searched for, and the similarity between the regions between the N-grams is searched. Retrieving means for retrieving N-grams having similar regions and N-grams having similar connection information and creating an adaptive N-gram set;

Real-time language model merging means for updating the language model in real time using the N-gram searched by the search means and applying the updated language model to speech recognition. Apparatus for adapting language model of speech recognition.

delete

The method of claim 1,

The retrieval means includes: how similar the N-grams retrieved using unigram or bigram are with the area information of the currently recognized voice, how common area information the current sentences have, and how uniform the current recognition sentence is. Apparatus for adapting a language model of Korean continuous speech recognition using an N-gram network characterized by determining whether information is present.

delete

The method of claim 1,

The real-time language model merging means gives a long language model and a strong region weight as much as possible in a section with low reliability of the acoustic model, and applies a short language model to sections with high reliability of the acoustic model to recalculate the language model. Apparatus for adapting a language model of Korean continuous speech recognition using an N-gram network.

A first step of constructing a large-scale text corpus DB storing an adaptive text corpus comprising an N-gram network;

A first step of performing clustering between documents using similarity between documents existing in the large-scale text corpus; a second step of extracting N-grams for each clustered document cluster in the first step; and in the second step A third step of generating an N-gram network by combining the extracted N-grams, and generating a N-gram network using similarity between documents existing in the large-scale text corpus;

A third step of constructing an N-gram network DB storing the N-gram network generated in the second step;

A fourth step of extracting a speech recognition intermediate result, detecting an acoustic stable word, extracting region information of a speech currently being recognized, and extracting an N-gram included in the stable word;

When the N-grams extracted in the fourth step are searched for the N-gram network in the constructed N-gram network DB, the N-grams containing the acoustically stable words are searched for the N-grams. A fifth step of searching for an N-gram having similar regions and an N-gram having similar connection information by using the similarity between regions and creating an adaptive N-gram set;

And a sixth step of applying the updated language model to speech recognition by updating the language model in real time using the N-gram searched in the fifth step. Language model adaptation method.

delete

The method of claim 6,

The fifth step is how similar the N-grams retrieved using unigram or bigram with the area information of the currently recognized voice, how common area information the current sentences have, and how uniform the currently recognized sentences are. A method for adapting a language model of Korean continuous speech recognition using an N-gram network characterized by determining whether or not there is region information.

delete

The method of claim 6,

In the sixth step, a long language model and a strong region weight are given to a section in which the reliability of the acoustic model is low, and the language model is recalculated by applying a short language model to a section or word having a high reliability of the acoustic model. A language model adaptation method of Korean continuous speech recognition using N-gram network.