KR100561225B1

KR100561225B1 - Real-time news collection System based on GUI environment and On-line Language Model Generation Service method

Info

Publication number: KR100561225B1
Application number: KR1020030092517A
Authority: KR
Inventors: 김현숙; 김상훈
Original assignee: 한국전자통신연구원
Priority date: 2003-12-17
Filing date: 2003-12-17
Publication date: 2006-03-15
Also published as: KR20050060795A

Abstract

본 발명은 방송뉴스 음성인식의 성능향상과 미등록 어휘수를 감소시키기 위해 최근의 방송뉴스와 신문기사를 실시간으로 수집하고 이에 대한 정보를 언어모델과 어휘사전에 반영할 수 있도록 GUI(Graphic User Interface)환경을 기반으로 하는 사용자 편의성을 고려한 실시간 기사 수집 시스템 및 온라인 언어 모델 구축 서비스 방법에 관한 것이다. 본 발명은 언론 매체의 웹사이트에 접속하여 수집된 기사들을 근거로 언어모델을 구축하기 위한 시스템을 이용한 서비스 방법에 있어서, 수집할 신문/방송등의 언론 매체와 상기 언론 매체에서 제공하는 기사들의 수집대상을 설정하는 것에 의해 해당 언론 매체의 웹사이트에 접속하여 기사를 실시간으로 다운로드하는 단계; 상기 수집된 기사들에 포함된 영어, 숫자 등을 한글로 변환하는 텍스트 변환단계; 수집된 최신 기사코퍼스에 대한 의사형태소를 태깅하는 단계; 수집된 최신 기사에 대한 어휘사전 작성, 언어모델 생성 및 발음사전을 구축하는 단계; 최신 기사코퍼스에 대한 어휘사전과 기존의 코퍼스의 어휘사전을 통합하여 새로운 어휘사전을 작성하는 단계; 및 기존의 작성된 언어모델과 수집된 언어모델을 인터폴레이션하여 음성인식 시스템으로 전송하는 단계;를 포함한다.The present invention collects the latest broadcast news and newspaper articles in real time in order to improve the performance of the speech recognition speech recognition and reduce the number of unregistered vocabulary, and to reflect the information about the language model and the vocabulary dictionary (Graphic User Interface) A real-time article collection system and online language model construction service method considering environment-based user convenience. The present invention provides a service method using a system for constructing a language model based on articles collected by accessing a website of a media medium, and collecting a media medium such as a newspaper / broadcast to be collected and articles provided by the media medium. Accessing the website of the media in question by downloading the article and downloading the article in real time; A text conversion step of converting English, numbers, etc. included in the collected articles into Korean; Tagging pseudo morphemes for the latest article corpus collected; Creating a vocabulary dictionary, language model generation and pronunciation dictionary for the latest articles collected; Creating a new vocabulary dictionary by integrating a vocabulary dictionary for the latest article corpus and a vocabulary dictionary of an existing corpus; And interpolating the existing written language model and the collected language model and transmitting the interpolated language model to the speech recognition system.

방송뉴스 음성인식, 실시간 기사 수집, 온라인 언어 모델 구축 서비스, Broadcasting news voice recognition, real-time article collection, online language model building service,

Description

Real-time news collection system based on GUI environment and On-line Language Model Generation Service method}

도 1은 본 발명에 따른 음성 인식을 실시하기 위한 블록도,1 is a block diagram for performing speech recognition according to the present invention;

도 2는 본 발명에 따른 음성 인식 시스템을 보인 블록도,2 is a block diagram showing a speech recognition system according to the present invention;

도 3은 일반적인 언어모델, 어휘 사전, 발음 사전을 구축하기 위한 절차를 보인 순서도,3 is a flow chart showing a procedure for building a general language model, lexical dictionary, pronunciation dictionary,

도 4는 실시간 기사수집 및 온라인 언어모델을 구축하기 위한 절차를 보인 순서도,4 is a flowchart showing a procedure for building a real-time article collection and online language model,

도 5는 본 발명에서 제공하는 실시간 기사수집 및 온라인 언어모델을 구축할 때 사용되는 GUI화면의 예시도.5 is an exemplary view of a GUI screen used when building a real-time article collection and online language model provided by the present invention.

*도면의 주요부분에 대한 부호의 설명** Description of the symbols for the main parts of the drawings *

10; 기사수집 시스템 11;언어모델 생성 시스템10; Article collection system 11; language model generation system

12; 방송뉴스 음성인식 시스템12; Broadcasting News Voice Recognition System

본 발명은 컴퓨터를 사용하여 한국어 방송뉴스 음성을 문자로 변환해주는 기술분야에 관한 것으로, 상세하게는 방송뉴스 음성인식의 성능향상과 미등록 어휘수를 감소시키기 위해 최근의 방송뉴스와 신문기사를 실시간적으로 수집하고 이에 대한 정보를 언어모델과 어휘 사전에 반영할 수 있는 종합적인 환경을 제공할 수 있는 GUI환경을 기반으로 하는 실시간 기사 수집 시스템 및 온라인 언어 모델 구축 서비스 방법에 관한 것이다.The present invention relates to a technical field for converting Korean broadcasting news speech into text using a computer, and more particularly, to improve performance of speech news speech recognition and to reduce unregistered vocabulary. The present invention relates to a real-time article collection system and online language model construction service method based on a GUI environment that can provide a comprehensive environment for collecting and reflecting information on language models and vocabulary dictionaries.

음성인식에서는 사전에 등록되는 어휘가 음성인식의 기본 단위가 되며, 인식대상 어휘의 수가 음성인식 작업의 난이도와 인식성능을 결정하는 요소가 된다. 대용량의 방송뉴스 코퍼스와 신문기사 코퍼스가 보유되어 있어도, 언어모델 생성에 사용되는 어휘 사전은 텍스트 코퍼스에 고빈도로 출현한 어휘에 대해서만, 어휘 사전이 작성된다. 즉, 코퍼스에서 가장 많이 사용된 어휘에 대해서만 작성되어 빈도가 낮은 어휘는 어휘 사전에서 제외되고, 언어모델에서는 제외되거나 또는 unknown으로 표시된다. 기존의 언어모델이 대규모 코퍼스에 대해 구축되어 있더라도 매일 매일의 사건과 정보를 다루는 방송 뉴스를 음성인식하기 위해서는, 새로 발생한 사건을 표현한 어휘를 음성인식 시스템의 어휘사전과 언어모델에 수시로 반영될 수 있어야 실생활에서 사용하기에 적합하다.In speech recognition, the vocabulary registered in advance becomes the basic unit of speech recognition, and the number of words to be recognized becomes a factor that determines the difficulty and recognition performance of the speech recognition task. Even if a large broadcasting news corpus and a newspaper article corpus are held, the lexical dictionary used for generating a language model is created only for vocabularies that appear at high frequency in the text corpus. In other words, only the vocabulary most frequently used in the corpus may be excluded from the lexical dictionary, excluded from the language model, or marked as unknown. Even if the existing language model is constructed for a large corpus, in order to recognize the broadcasting news dealing with the daily events and information, the vocabulary expressing the newly occurring events must be reflected in the lexical dictionary and the language model of the speech recognition system. Suitable for use in real life.

방송뉴스 인식시스템에 포함되어 있는 어휘사전과 언어모델은 기존의 코퍼스를 중심으로 만들어져 있기 때문에, 새로 발생한 사건에 대한 어휘를 포함하고 있지 않을 가능성이 있다. 이렇게 새로 발생한 어휘들은 미등록어(OOV :Out of Vocabulary)로 처리되고, 미등록어가 많을수록 음성인식 오류가 많아져 음성인식의 성능도 낮아지게 된다. Since the lexical dictionaries and language models included in the broadcasting news recognition system are built around the existing corpus, there is a possibility that they do not include vocabularies for newly occurring events. The newly generated vocabulary is processed as an unregistered word (OOV: Out of Vocabulary), and the more unregistered words, the more speech recognition errors and the lower the performance of speech recognition.

따라서, 실생활에서 방송뉴스 음성인식 시스템을 계속 사용하기 위해서는, 매일 새로 발생하는 사건에 대한 어휘를 인식할 수 있도록, 기존의 어휘사전과 언어모델을 수시로 보완하는 방법이 필요하다. Therefore, in order to continue to use the broadcasting news speech recognition system in real life, it is necessary to supplement the existing lexical dictionaries and language models from time to time to recognize the vocabulary for new events that occur every day.

음성 인식 시스템은 발성의 의미 사본을 발생하기 위해 발성에 포함된 음향적 및 언어학적(또는 언어) 정보의 조합을 사용한다. 음성 인식 시스템에서 인식기에 의해 사용되는 언어 정보는 집합적으로 언어 모델이라 칭하여진다.The speech recognition system uses a combination of acoustic and linguistic (or linguistic) information contained in the utterance to generate a semantic copy of the utterance. The language information used by the recognizer in the speech recognition system is collectively called a language model.

종래에는 방송뉴스 코퍼스를 대량으로 수집하여 이를 이용한 언어모델을 구축한 후, 인식기 사용 시점에 최신 신문기사를 수집하여 MAP추정방법 등을 이용하여 언어모델을 적응하는 방법이 있다. Conventionally, there is a method of constructing a language model using a large amount of broadcast news corpus, and then using a MAP estimation method to adapt the language model by collecting the latest newspaper article at the time of use of the recognizer.

그러나, 이러한 방법은 온라인(on-line) 상에서 자동으로 기사수집과 언어모델 적응 방법이 연결되어 구축되는 것이 아니라, 기사를 수집하는 단계와 언어모델을 적응하는 단계가 각각 별도로 구축되어 있다.However, these methods are not built on-line automatically by linking article collection and language model adaptation methods, but separately collecting articles and adapting language models.

따라서, 음성인식 시스템에 사용될 최신 기사 수집 및 언어모델 구축이 통합되어 자동으로 실행될 수 있는 사용하기 편리한 GUI서비스 환경이 필요하다. Therefore, there is a need for an easy-to-use GUI service environment that integrates the latest article collection and language model construction to be used for speech recognition systems.

본 발명은 종래 기술의 문제를 해결하기 위한 것으로, 방송뉴스 음성인식의 성능향상과 미등록 어휘수를 감소시키기 위해 최근의 방송뉴스와 신문기사를 실시간으로 수집하고 이에 대한 정보를 언어모델과 어휘사전에 반영할 수 있도록 GUI(Graphic User Interface)환경을 기반으로 하는 실시간 기사 수집 및 온라인 언어 모델 구축 서비스 방법을 제공하는데 목적이 있다.The present invention is to solve the problems of the prior art, in order to improve the performance of speech recognition speech recognition and reduce the number of unregistered vocabulary, the latest broadcast news and newspaper articles are collected in real time and information about the language model and vocabulary dictionary The purpose is to provide a real-time article collection and online language model construction service method based on the GUI (Graphic User Interface) environment to reflect.

상기 목적을 달성하기 위한 본 발명의 GUI환경을 기반으로 하는 실시간 기사 수집 시스템은 수집된 기사로부터 언어모델을 생성하고 음성인식 시스템과 연동하는 것에 의해 음성기사를 문자로 변환하기 위한 언론 매체에서 제공하는 기사를 수집하는데 적합한 GUI화면을 구축한 시스템으로서, 상기 GUI화면은 다수의 언론 매체를 선택하고, 선택된 해당 언론 매체에서 제공하는 기사의 종류, 기사의 시작/종료 날짜 및 시간대역을 설정하는 것에 의해 자동으로 해당 웹사이트에 접속하여 기사들을 실시간으로 수집하기 위한 메뉴창; 및 상기 수집된 기사들을 근거로 총어휘수, 기사수 및 기존에 등록된 등록어와 비교하여 등록되지 않은 미등록어수에 대한 통계값을 자동으로 체크하여 표시하기 위한 통계자료창;을 포함하며, 상기 수집된 기사들을 언어모델을 생성하기 위한 시스템으로 전송하는 것을 특징으로 한다.Real-time article collection system based on the GUI environment of the present invention for achieving the above object is provided by the media for converting the speech article to the text by generating a language model from the collected articles and interworking with the speech recognition system A GUI screen system suitable for collecting articles, wherein the GUI screen is selected by selecting a plurality of media and setting the types of articles provided by the selected media, the start / end dates, and time bands of the articles. A menu window for automatically accessing the website and collecting articles in real time; And a statistical data window for automatically checking and displaying statistical values for unregistered words that are not registered by comparing the total vocabulary number, the number of articles, and the registered words registered on the basis of the collected articles. The article is transmitted to a system for generating a language model.

상기 목적을 달성하기 위한 본 발명의 GUI환경을 기반으로 하는 실시간 온라인 언어모델 구축 서비스 방법은 언론 매체의 웹사이트에 접속하여 수집된 기사들을 근거로 언어모델을 구축하기 위한 시스템을 이용한 서비스 방법에 있어서, 수집할 신문/방송등의 언론 매체와 상기 언론 매체에서 제공하는 기사들의 수집대상을 설정하는 것에 의해 해당 언론 매체의 웹사이트에 접속하여 기사를 실시간으로 다운로드하는 단계; 상기 수집된 기사들에 포함된 영어, 숫자 등을 한글로 변환하는 텍스트 변환단계; 수집된 최신 기사코퍼스에 대한 의사형태소를 태깅하는 단계; 수집된 최신 기사에 대한 어휘사전 작성, 언어모델 생성 및 발음사전을 구축하는 단계; 최신 기사코퍼스에 대한 어휘사전과 기존의 코퍼스의 어휘사전을 통합하여 새로운 어휘사전을 작성하는 단계; 및 기존의 작성된 언어모델과 수집된 언어모델을 인터폴레이션하여 음성인식 시스템으로 전송하는 단계;를 포함하는 것을 특징으로 한다.Real-time online language model construction service method based on the GUI environment of the present invention for achieving the above object is a service method using a system for building a language model based on the articles collected by accessing the website of the media Accessing a website of the media in question and downloading articles in real time by setting a media to be collected such as a newspaper / broadcast to be collected and a collection target of articles provided by the media; A text conversion step of converting English, numbers, etc. included in the collected articles into Korean; Tagging pseudo morphemes for the latest article corpus collected; Creating a vocabulary dictionary, language model generation and pronunciation dictionary for the latest articles collected; Creating a new vocabulary dictionary by integrating a vocabulary dictionary for the latest article corpus and a vocabulary dictionary of an existing corpus; And interpolating the existing written language model and the collected language model and transmitting the interpolated language model to a speech recognition system.

여기서, 통합된 어휘사전은 새로 수집한 코퍼스에 출현한 명사에 대한 어휘를 기존의 코퍼스의 어휘보다 우선적으로 어휘 사전에 포함시키는 것이 바람직하다.Here, it is preferable that the integrated vocabulary dictionary includes the vocabulary for nouns appearing in the newly collected corpus in the vocabulary dictionary over the vocabulary of the existing corpus.

이하, 본 발명의 바람직한 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 실시간 기사 수집 및 온라인 언어모델 구축을 위해 사용되는 각각의 시스템의 개략적인 구성도이다.1 is a schematic configuration diagram of each system used for real-time article collection and online language model construction according to the present invention.

도 1은 본 발명이 실시되는 적절한 컴퓨팅(computing) 환경에 대해 간략하고 일반적인 설명을 하기 위한 것으로, 도면 중 부호 10은 본 발명의 GUI환경의 기사 수집시스템을 도시한 것이다. 이 기사 수집 시스템(10)은 GUI환경 하에서 인터넷 망을 통해 미디어 매체의 사이트에 접속하여 기사를 수집할 수 있도록 구축되어 있고, 여기서 미디어 매체는 도시된 바와 같이 방송사나 또는 신문사에서 제공하는 웹사이트에 접속하여 해당 정보를 수집하는 것이지만 기사 수집 대상은 방송사나 신문사에 국한되는 것은 아니다.BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a brief and general description of a suitable computing environment in which the invention is practiced, wherein reference numeral 10 in the figure shows an article collection system of a GUI environment of the invention. The article collection system 10 is constructed to collect articles by accessing a site of a media medium through an internet network under a GUI environment, where the media medium is displayed on a website provided by a broadcaster or a newspaper company as shown. It is to collect the information by accessing, but the article collection target is not limited to broadcasters or newspapers.

상기 기사 수집시스템(10)은 방송사나 또는 신문사에서 제공하는 사이트에 접속하여 그들의 저장서버(13)(14)에 저장된 정보로부터 기사를 실시간으로 수집하게 되며, 기사 수집시스템(10)에 의해 신규로 수집된 기사들은 언어모델 생성시스템(11)에서 신규한 언어모델들을 생성하게 되며, 여기서 생성된 신규 언어모델을 방송뉴스 음성인식 시스템(12)에서 방송용 언어모델로 구축하여 강건한 성능의 문자로 변환할 수 있게 되는 것이다.The article collection system 10 collects articles in real time from information stored in their storage servers 13 and 14 by accessing a site provided by a broadcaster or a newspaper company, and newly by the article collection system 10. The collected articles generate new language models in the language model generation system 11, and the generated new language models are constructed as broadcast language models in the broadcast news speech recognition system 12 and converted into robust characters. It will be possible.

이와 같이 구성된 본 발명은 온라인 언어모델을 구축하기 위해 먼저 GUI 환경을 사용한 기사 수집 시스템(10)을 구동한다. 상기 기사 수집 시스템(10)을 통해 특정일자의 방송뉴스나 또는 특정일자의 신문기사를 수집하려면, GUI화면을 사용하여 해당 언론 매체의 웹사이트에서 제공하는 방송뉴스 기사 저장서버(13)나 신문기사 저장서버(14)에 접속하여 해당 기사들을 다운받는다. The present invention configured as described above drives the article collection system 10 using the GUI environment in order to build an online language model. To collect broadcast news or news articles of a specific date through the article collection system 10, the broadcast news article storage server 13 or newspaper article provided by the website of the media using the GUI screen Access the storage server 14 to download the articles.

상기 기사 수집 시스템(10)은 다운받아 그 수집된 기사들로부터 영어, 숫자 등을 한글로 변환하고, 숫자 발성 오류, 띄워쓰기 오류, 맞춤법 오류 등을 수정한 후 수집된 기사에 대한 의사형태소 태깅을 수행한다. 이러한 과정을 수행한 후 언어모델 생성 시스템(11)으로 전송한다. The article collection system 10 downloads and converts English, numbers, etc. from the collected articles into Korean, corrects digit speech errors, spacing errors, spelling errors, and the like, and performs pseudomorphological tagging on the collected articles. Perform. After performing this process, it transmits to the language model generation system 11.

상기 언어모델 생성 시스템(11)은 신규로 수집된 기사에 대한 신규 언어모델을 생성하고, 기존에 구축되어 있던 대규모 코퍼스에 대한 기존언어모델과 신규언어모델을 인터폴레이션(interpolation)하여 최종적으로 방송뉴스 음성인식 시스템(12)에서 사용할 방송용 언어모델을 구축하고, 새로 생성된 방송용 언어모델을 방송뉴스 음성인식 시스템(12)으로 전송하여 방송뉴스 인식에 사용한다.The language model generation system 11 generates a new language model for a newly collected article, and interpolates an existing language model and a new language model for a large-scale corpus, which have been previously constructed, and finally broadcast news speech. A broadcast language model to be used in the recognition system 12 is constructed, and the newly generated broadcast language model is transmitted to the broadcast news speech recognition system 12 to be used for broadcast news recognition.

도 2는 트리 기반의 일반적인 음성 인식시스템을 보인 블록도이다. 도면을 참조하면, 방송뉴스 음성인식 시스템(12)으로 입력된 음성은 특징 추출부(21)에서 인식에 유요한 정보만을 추출한 특징 벡터로 변환되고, 이러한 특징 벡터로부터 탐색기(22)에서 학습과정에서 미리 구해진 언어모델, 어휘사전, 발음사전을 이용하여 가장 확률이 높은 단어열을 비터비 알고리즘을 이용하여 탐색한다. 여기서 대어휘 인식을 위하여 인식 대상 어휘들은 트리를 구성하고 있으며, 탐색기(22)는 이러한 트리를 탐색한다. 후처리부(23)는 탐색 결과로부터 잡음, 기호 등을 제거하고, 음절단위로 모아쓰기를 하여 최종 인식 결과를 텍스트로 출력하게 된다.2 is a block diagram showing a tree-based general speech recognition system. Referring to the drawings, the voice input to the broadcast news speech recognition system 12 is converted into a feature vector extracted only information useful for recognition by the feature extractor 21, and from the feature vector in the search process in the searcher 22 The most probable word sequence is searched using the Viterbi algorithm using the previously obtained language model, vocabulary dictionary, and pronunciation dictionary. Here, the recognition target vocabularies form a tree for the recognition of the large vocabulary, and the searcher 22 searches the tree. The post processor 23 removes noise, symbols, and the like from the search result, collects them in syllable units, and outputs the final recognition result as text.

여기서, 언어 모델은 대어휘 연속 음성 인식만을 위한 기술로서, 음성인식기의 문법이다. 단어 단위로 인식된 결과를 문장으로 재구성하는 작업에 사용되며, 음향학적인 모호함 때문에 정확히 인식하지 못하는 부분을 언어 정보를 이용하여 탐색 공간을 줄이는 역할을 한다. 최근의 대어휘 연속 음성인식 성능 향상 연구는 주로 언어 모델의 성능 향상을 통하여 이루어지고 있다.Here, the language model is a technique for only large vocabulary continuous speech recognition, which is a grammar of a speech recognizer. It is used to reconstruct the results recognized in word units into sentences. It plays a role of reducing search space by using language information on parts that are not accurately recognized due to acoustic ambiguity. Recently, the study of performance improvement of large vocabulary continuous speech recognition is mainly done through the performance improvement of language model.

음성인식에 주로 사용되는 언어 모델은 구구조(phrase structure) 문법에 기반한 언어모델과 통계적 언어 모델을 들 수 있다. 통계적 언어 모델은 단어간의 연결 관계가 확률로서 표현된다. The language models mainly used for speech recognition include language models based on phrase structure grammar and statistical language models. In statistical language models, word-to-word relationships are represented as probabilities.

통계적 언어 모델은 보통 주어진 영역의 많은 텍스트 문장으로부터 쉽게 추출이 가능하고, 입력 문장 전체를 파싱하지 않고 문장의 발생 확률만을 계산하므로 학습된 문장과 부분적으로 다른 문장도 인식할 수 있는 장점이 있다. 대표적인 통계적 언어 모델로 N-gram을 들 수 있다. Statistical language models can be easily extracted from a large number of text sentences in a given area, and since the probability of occurrence of a sentence is calculated without parsing the entire input sentence, a sentence that is partially different from the learned sentence can be recognized. A representative statistical language model is N-gram.

N-gram언어모델은 과거의 N-1개의 단어로부터 다음에 나타날 단어의 확률을 정의하는 문법이고 충분한 학습 데이터가 존재할 경우 매우 좋은 성능을 보이고 있다. 시간 복잡도와 공간 복잡도의 영향으로 형태소나 어절의 바이그램(bigram), 트라이그램(trigram) 정도의 간단한 언어 모델을 주로 사용하고 있다.The N-gram language model is a grammar that defines the probability of the next word from N-1 words in the past and shows very good performance when enough training data is available. Due to the influence of temporal complexity and spatial complexity, simple language models such as bigrams and trigrams of morphemes and words are mainly used.

도 3은 일반적인 언어모델, 어휘사전, 발음사전을 구축하기 위한 절차를 보인 절차도이다.Figure 3 is a procedure showing a procedure for building a general language model, vocabulary dictionary, pronunciation dictionary.

도면을 참조하면, 보다 강건한 성능의 언어모델을 만들기 위해서 오류율이 낮은 텍스트 코퍼스를 확보하는 것이 필요하다. 이를 위해 방송 뉴스 전사문으로 부터 숫자 발성 오류, 띄워쓰기 오류, 맞춤법 오류 등을 수정하는 텍스트 코퍼스 전처리 과정이 필요하다.(S31)Referring to the drawings, it is necessary to secure a text corpus with a low error rate in order to create a more robust language model. To this end, a text corpus preprocessing process for correcting digit speech errors, spacing errors, spelling errors, etc. from broadcast news transcriptions is required.

이렇게 전처리 과정이 끝난 텍스트 코퍼스에 대해 형태소 분석기를 사용하여 형태소 단위로 태깅하고, 인식 실험시 짧은 형태소로 인한 인식 오류를 감소시키기 위해 형태소를 의사형태소 단위로 병합한다.(S32-S33) The pre-processed text corpus is tagged with morphemes using a morpheme analyzer, and the morphemes are merged into pseudomorphological units in order to reduce recognition errors due to short morphemes in the recognition experiment (S32-S33).

이후, 텍스트 코퍼스로부터 의사형태소별 빈도 정보를 추출하고, 이중에서 가장 자주 사용된 고빈도 어휘를 64000개 또는 65000개 정도 추출하여 어휘 사전을 생성한다. 그리고 텍스트코퍼스에 대해 어휘 사전을 적용하여 색인 N-gram목록을 추출하고 트라이그램 언어모델과 발음사전을 생성한다.(S34-S36) Thereafter, frequency information for each morphological form is extracted from the text corpus, and a lexical dictionary is generated by extracting about 64000 or 65000 high frequency vocabularies most frequently used. The lexical dictionary is then applied to the text corpus to extract the index N-gram list and to generate trigram language models and pronunciation dictionaries (S34-S36).

본 발명은 실시간으로 신문기사 또는 방송 뉴스를 인터넷을 통하여 수집하고 이를 활용하여 언어모델을 구축하고자 하는 것으로, 매일 매일의 사건과 정보를 다루는 방송 뉴스를 보다 강건한 성능으로 음성인식하기 위해서, 새로 발생한 사건을 표현한 어휘를 음성인식 시스템의 어휘사전과 언어모델에 수시로 반영될 수 있는 환경 구축이 필요하다. 이러한 환경 구축을 위해 본 발명은 도 5와 같이 GUI화면을 사용하여 실시간 기사수집 및 온라인 언어모델을 구축할 수 있다.The present invention is to collect a newspaper article or broadcast news over the Internet in real time and to build a language model using it, in order to recognize the broadcast news dealing with daily events and information with more robust performance, newly generated events, It is necessary to build an environment where the vocabulary expressing this expression can be frequently reflected in the lexical dictionary and language model of the speech recognition system. In order to build such an environment, the present invention can build a real-time article collection and online language model using a GUI screen as shown in FIG.

도 5는 본 발명에 따른 실시간 기사수집 및 온라인 언어모델을 구축하기 위한 GUI(Graphic User Interface)화면을 예시한 것으로, 본 발명에서 예시된 GUI화면(51)은 그 화면상에 신문과 방송을 선택하기 위한 각각의 메뉴창(52)을 갖는다.FIG. 5 illustrates a graphical user interface (GUI) screen for building a real-time article collection and online language model according to the present invention. The GUI screen 51 illustrated in the present invention selects a newspaper and a broadcast on the screen. Each menu window 52 is provided.

예를 들면, 신문 메뉴창(52)에는 다양한 신문사들과 그 신문사들이 기고한 날짜들에 대한 클릭란(53)이 구비되고, 방송 메뉴창(54)의 경우에는 방송사 및 그 방송사의 방송 시작 및 종료 날짜 또는 시각 등을 입력할 수 있는 입력란(55)들을 구비할 수 있다. 또한, GUI화면(51)에는 수집된 기사들을 분석하여 통계자료를 기록할 수 있는 표시창(56)을 갖는다. 예를 들면 총어휘수, 기사수, 미등록어수에 대한 표시창 및 누적된 통계자료들을 표시할 수 있는 표시창을 갖는다.For example, the newspaper menu window 52 is provided with a click column 53 for various newspapers and dates contributed by the newspapers, and in the case of the broadcast menu window 54, the broadcaster and the broadcaster start and end broadcasting. Input fields 55 for inputting a date or time may be provided. In addition, the GUI screen 51 has a display window 56 for analyzing the collected articles to record statistical data. For example, it has a display window for displaying total vocabulary number, article number, unregistered word count, and accumulated statistical data.

이러한 GUI화면 구조를 갖는 본 발명은 GUI화면에서 특정한 날짜의 방송 뉴스 또는 신문기사 수집에 대해 화면을 통하여 선택한 후 OK 버튼을 클릭하면, 도 4의 순서에 따라 서비스가 진행된다. 도 5의 GUI 화면은 사용자의 요구에 따라, 다양한 방송사나 신문사로 선택의 범위를 확대할 수 있다. 또한, 인식할 뉴스의 방송 시간대에 따라 수집할 방송뉴스의 대상을 새벽방송, 아침방송, 정오 방송, 오후 방송, 저녁 방송 등으로 다양하게 확대하여 사용할 수 있음은 물론이다.According to the present invention having such a GUI screen structure, when a user selects a broadcast news or newspaper article collection on a GUI screen through a screen and clicks an OK button, the service proceeds according to the procedure of FIG. 4. The GUI screen of FIG. 5 may expand the range of selection to various broadcasters or newspapers according to a user's request. In addition, according to the broadcast time of the news to be recognized, the subject of broadcast news to be collected may be used in various ways such as dawn broadcast, morning broadcast, noon broadcast, afternoon broadcast, and evening broadcast.

이하, 도 4를 참조하여 본 발명에 따른 실시간 기사수집 및 온라인 언어모델을 구축하는 방법을 설명한다. Hereinafter, a method of building a real-time article collection and online language model according to the present invention will be described with reference to FIG. 4.

본 발명은 실시간 기사수집을 위해 기사 수집시스템(10)에 구비된 모니터 즉, 도 5의 GUI화면에 구비된 메뉴창에서 신문기사나 또는 방송뉴스의 수집할 날짜 등을 선택한다.(S41)The present invention selects a newspaper article or a date to collect broadcast news from the monitor provided in the article collection system 10, that is, the menu window provided on the GUI screen of FIG.

상기한 조건을 갖는 해당 미디어 매체의 기사를 수집하기 위해 그 기사들이 제공되는 웹사이트에 접속한 후 해당 기사들을 다운로드하게 된다.(S42)In order to collect articles of the corresponding media medium having the above conditions, the articles are downloaded to the websites where the articles are provided, and the articles are downloaded (S42).

단계 42에서와 같이 기사를 다운로드하여 해당 기사 수집이 완료되면, 기사 수집시스템(10)에서는 수집된 기사를 텍스트 형태로 변환하게 되는데, 이때 방송뉴스 음성인식 시스템(12)에 사용할 보다 강건한 성능의 언어모델을 만들기 위해서는 오류율이 낮은 텍스트 코퍼스를 확보하는 것이 필수적이다. 따라서 기사 수집시스템(10)은 방송 뉴스 전사문으로부터 영어, 숫자 등을 한글로 변환하고, 숫자 발성, 띄워쓰기 및 맞춤법 오류 등을 수정하여 텍스트 형태로 변환한 후 수집된 기사에 대한 의사형태소 태깅을 수행한 후 언어모델 생성시스템(11)으로 전달한다.(S43-S45)When the article collection is completed by downloading the article as in step 42, the article collection system 10 converts the collected article into a text form, in which the more robust language for the broadcast news speech recognition system 12 is used. To create a model, it is essential to have a text corpus with a low error rate. Therefore, the article collection system 10 converts English, numbers, etc. from the broadcast news transcription into Korean, corrects digits, spacing, spelling errors, etc., converts them into text form, and then performs pseudomorphological tagging of the collected articles. After performing it, it transfers to the language model generation system 11. (S43-S45)

단계 45에서와 같이 수집된 기사에 대한 의사형태소 태깅이 완료되면, 상기 언어모델 생성시스템(11)에서는 수집된 최신 기사코퍼스에 대한 어휘사전을 작성(S46)하고, 수집된 최신 기사에 대한 언어모델을 생성(S47)하고, 최신 기사코퍼스에 대한 어휘사전과 기존의 코퍼스의 어휘사전을 통합하여 새로운 어휘사전을 작성하게 된다.(S48) When pseudo-morphological tagging of the collected articles is completed as in step 45, the language model generation system 11 creates a lexicon for the latest article corpus collected (S46), and the language model for the latest articles collected. In operation S47, a new vocabulary dictionary is created by integrating a vocabulary dictionary for the latest article corpus and a vocabulary dictionary of an existing corpus. (S48)

본 발명은 어휘사전 작성방법에 있어서, 코퍼스에서 발생 빈도수가 큰 어휘부터 어휘 사전에 포함하는 방법을 사용하게 되는데, 통합된 어휘사전은 새로 수집한 코퍼스(최신 기사 코퍼스)에 출현한 명사에 대한 어휘가 기존의 코퍼스의 어휘보다 우선적으로 어휘 사전에 포함시키는 것이 바람직하다.In the present invention, a method of creating a lexicon, which includes a method of including a vocabulary with a high frequency of occurrence in a corpus, is included in the lexical dictionary. The integrated lexicon is a vocabulary for a noun appearing in a newly collected corpus (latest article corpus). It is preferable to include in the lexicon prior to the vocabulary of the existing corpus.

그리고, 단계 48에서는 어휘사전 작성이 완료되면, 기존의 대규모 언어모델과 수집된 언어모델을 어휘사전을 적용하여 인터폴레이션(interpolation)하게 되며(S49), 이러한 상태에서 음성인식 시스템(13)에서 인터폴레이션된 언어모델을 로드(load)한다.(S50)In operation 48, when the lexicon is completed, the existing large-scale language model and the collected language model are interpolated by applying the lexical dictionary (S49). In this state, the speech recognition system 13 is interpolated. Load the language model (S50).

본 발명의 실시예에 의하면, 본 발명은 도 5의 GUI화면과 같이 어느 특정 방송에 한정하여 설명하고 있지만 이를 확장하여 다양한 시간대의 방송 뉴스를 계속적으로 수집하여 음성인식 시스템에 적용할 수 도 있다. According to an embodiment of the present invention, the present invention is described as limited to a specific broadcast as shown in the GUI screen of FIG. 5, but may be extended and applied to a voice recognition system by continuously collecting broadcast news in various time zones.

즉, 방송사에서 방영하는 뉴스는 1일을 기준으로 볼 때, 새벽방송, 아침방송, 정오 방송, 오후 방송, 저녁 방송 등 다양한 시간대에 뉴스가 편성되어 있다. 또, 각 뉴스별로 살펴보면, 중복된 뉴스를 방송하는 부분도 있다. 따라서, 온라인 언어모델을 생성하기 위해 방송 뉴스를 수집할 경우, 특정 시간대의 방송뿐만 아니라, 이전의 시간의(예를 들어, 새벽, 정오방송 등) 이용 가능한 방송 기사를 수집할 수 있도록 수집 대상을 확장할 수 있음은 물론이다. In other words, the news broadcasted by the broadcasting company is organized in various time zones such as dawn broadcast, morning broadcast, noon broadcast, afternoon broadcast, and evening broadcast when viewed on a daily basis. In addition, when looking at each news, there is a part that broadcasts duplicate news. Therefore, when collecting broadcast news to create an online language model, the collection target may be collected so that not only the broadcast of a specific time zone but also the available broadcast articles of a previous time (for example, dawn, noon broadcast, etc.) may be collected. Of course, it can be extended.

또한, 매일, 매주, 또는 특정 기간을 정하여 기사를 수집하도록 선택(option)하는 기능을 제공하여, 사용자가 음성인식기 사용 시점을 기준으로 바로 이전 방송대본을 실시간으로 인터넷에 접속하여 다운받아 방송코퍼스를 보강한 후, 언어모델을 구축함으로써 강건한 성능의 음성인식 환경을 만들 수 있음은 본 발명의 실시예를 통해 당업자 수준에서 용이하게 실시할 수 있을 것이다.In addition, it provides the option to collect the articles every day, every week, or a specific period, so that the user can access the Internet by downloading the previous broadcast script in real time based on the time when the voice recognizer is used. After reinforcement, it is possible to easily perform at the level of those skilled in the art through the embodiment of the present invention that a robust speech recognition environment can be created by constructing a language model.

이상에서 설명한 바와 같이 본 발명은 실생활에서 방송뉴스 음성인식 시스템을 계속 강건한 성능으로 사용하기 위해, 매일 매일의 사건과 정보 뉴스를 수집하여, 새로 발생한 사건을 표현한 어휘가 음성인식 시스템의 어휘사전과 언어모델에 수시로 반영될 수 있도록 하기 위해 실시간으로 신문기사 또는 방송 뉴스를 인터넷을 통하여 수집하고, 이를 활용하여 언어모델을 구축하는 서비스를 제공함으로서, 음성인식시스템을 이용하는 사용자는 최신 뉴스에 대한 언어모델을 보유할 수 있을 뿐만 아니라 그 결과 미등록어의 출현을 억제할 수 있고 음성인식의 성능도 안정화시킬 수 있다.As described above, the present invention collects daily events and information news every day in order to continue to use the broadcast news speech recognition system with robust performance in real life, and the vocabulary dictionary and language of the speech recognition system expressing newly occurring events By collecting newspaper articles or broadcasting news through the Internet in real time in order to be reflected in the model from time to time, and providing a service to build a language model using this, the user using the speech recognition system can use the language model for the latest news. Not only can it be retained, but as a result it can suppress the appearance of unregistered words and stabilize the performance of speech recognition.

이상에서 설명한 것은 본 발명에 따른 사용자 편의성을 고려한 실시간 기사 수집 시스템 및 온라인 언어 모델 구축방법에 대해 설명한 하나의 실시 예에 불과한 것으로써, 본 발명은 상기한 실시 예에 한정되지 않고, 이하의 특허 청구의 범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변경 실시가 가능한 범위까지 본 발명의 기술적 사상이 미친다 할 것이다. What has been described above is only one embodiment described for the real-time article collection system and online language model construction method in consideration of user convenience according to the present invention, the present invention is not limited to the above embodiment, the following claims Without departing from the gist of the present invention claimed in the scope of the present invention, those of ordinary skill in the art will be able to extend the technical spirit of the present invention to the extent that various modifications can be made.

Claims

It is a system to build a GUI screen suitable for collecting articles provided by the media for converting speech articles into text by creating a language model from the collected articles and linking them with a speech recognition system.

The GUI screen collects articles in real time by automatically accessing the web site by selecting a plurality of media and setting the type of articles provided by the selected media, the start / end date and time band of the articles. A menu window for making; And

And a statistical data window for automatically checking and displaying statistical values for unregistered words that are not registered by comparing the total number of words, the number of articles, and the registered words registered based on the collected articles.

From the collected articles to English, numbers are converted to Hangul, numbers utterance, spacing and spelling errors are converted to text form, and then performing pseudo morpheme tagging to transmit to the system for generating a language model Real time article collection system based on GUI environment.

The method of claim 1, wherein the statistical data window provided on the GUI screen further comprises a cumulative statistical data window for recording the accumulated total vocabulary number, the number of articles and the number of unregistered words. Article Collection System.

In the service method using the system of claim 1 for building a language model based on the articles collected by accessing the website of the media,

Accessing a website of the media in question and downloading articles in real time by setting a media to be collected such as a newspaper / broadcast and a collection target of articles provided by the media;

A text conversion step of converting English, numbers, etc. included in the collected articles into Korean;

Tagging pseudo morphemes for the latest article corpus collected;

Creating a vocabulary dictionary, language model generation and pronunciation dictionary for the latest articles collected;

Incorporating the vocabulary of nouns appearing in the newly collected corpus into the vocabulary dictionary over the vocabulary of the existing corpus, creating a new integrated vocabulary dictionary by integrating the vocabulary dictionary for the latest article corpus with the vocabulary dictionary of the existing corpus. ; And

Real-time online language model building service method based on a GUI environment comprising a; comprising the step of interpolating the existing written language model and the collected language model and transmitted to the voice recognition system.

delete