KR20160060915A

KR20160060915A - Method for creating language model based on root and language processing apparatus thereof

Info

Publication number: KR20160060915A
Application number: KR1020140163104A
Authority: KR
Inventors: 김영준
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2014-11-21
Filing date: 2014-11-21
Publication date: 2016-05-31

Abstract

The present invention relates to a method for generation of a root-based language model and a language processing apparatus for the same. In particular, the method and the language processing apparatus are configured to extract each sentence, included in voice data, in a unigram form through language processing, analyze morphemes for each sentence in a unigram form, perform clustering based on roots of the morphemes, and match a cluster in an N-gram form, thereby overcoming a data shortage problem as in a class-based language model. Furthermore, the method and the language processing apparatus can be modified and utilized in such a way that using a linguistic relationship, an object which can be placed in front of a verb is mainly extracted in the case of the verb and the following relationship is extracted in the case of a noun having a postposition, thereby improving accuracy in language extraction. Furthermore, a new noun which is not present in a language model for voice recognition can be added in an N-gram form incorporating various expressions of a sentence, thereby providing higher voice recognition performance for a proper noun than that of a method of simply adding only the proper noun.

Description

TECHNICAL FIELD The present invention relates to a language-based language model generation method and a language processing apparatus for the same,

본 발명은 음성 인식 기술에 관한 것으로, 더욱 상세하게는 음성인식, 자연어 처리 과정에서 각 어절의 어근을 사용하여 자동으로 통계적 언어모델을 생성하는 어근 기반의 언어모델 생성 방법 및 이를 위한 언어처리장치에 관한 것이다.The present invention relates to a speech recognition technology, and more particularly, to a root-based language model generation method for automatically generating a statistical language model using a root of each word in speech recognition and natural language processing, and a language processing device .

음성인식 기술을 사용하여 기계와 사람간의 인터페이스를 보다 편리하고 자연스럽게 만들고자 하는 노력이 국내외에서 꾸준히 진행되어 오고 있으며, 그 결과 단순한 단어 인식 수준을 넘어 자연스럽게 발성한 음성도 처리할 수 있는 수준으로 발전되어 왔다. 음성인식 기술은 지난 20세기 후반의 지속적인 기술 개발에 힘입어 다양한 분야에서 실생활에 이용될 수 있는 수준으로 발전되어 왔지만, 우리가 상상하는 수많은 응용 분야에 적극적으로 이용되기에는 아직 해결해야 할 기술적 과제가 산적해 있는 실정이다.Efforts to make the interface between machine and person more convenient and natural by using speech recognition technology have been progressed steadily at home and abroad, and as a result, it has been developed to a level that can process natural voiced speech beyond simple word recognition level . Speech recognition technology has evolved into a level that can be used in real life in various fields thanks to continuous technology development in the latter half of the 20th century. However, in order to be actively used in many application fields that we imagine, It is a fact that it is piled up.

최근에 이러한 자연어 음성인식 처리 기술을 활용하여 사용자들에게 보다 편리한 서비스를 제공하려는 노력들이 국내에서도 통신사 및 금융기관을 중심으로 일어나고 있다. 스마트폰에서도 제한된 키보드를 통한 입력의 어려움을 개선하기 위해 음성인식 지원기능이 무선 모바일 기기 등에서 필수 기능이 되고 있다.In recent years, efforts to provide users with more convenient services by utilizing such natural language speech recognition processing technology have been taking place in domestic communication companies and financial institutions. In order to improve the difficulty of inputting through a limited keyboard in a smartphone, voice recognition support function becomes a necessary function in a wireless mobile device and the like.

일반적으로 음성인식 방법은 발성의 형태에 따라 몇 가지로 나뉘어지는데 고립어 단어 인식(isolated word recognition), 연결 단어 인식(connected word recognition), 연속어 인식(continuous speech recognition), 핵심어 인식(keyword spotting)등이 있다. 이들 중에서 개별적인 단어를 인식하는 고립어 단어 인식과 달리, 연속어 인식은 음성신호에 해당하는 문장 또는 연속된 단어열을 찾는 방식으로서 어휘사전의 단어수가 증가할수록 문장을 구성하는 단어열의 개수가 크게 증가하게 되며, 단어와 단어 사이의 발음 변이로 인해 단어 개수가 많을수록 비슷한 발음의 단어들로 오인식될 확률도 늘어나게 된다.In general, the speech recognition method is divided into several types according to the form of utterance, such as isolated word recognition, connected word recognition, continuous speech recognition, keyword spotting, etc. . Unlike lone word recognition, which recognizes individual words, continuous speech recognition is a method of finding sentences or consecutive word strings corresponding to a speech signal. As the number of words in a lexical dictionary increases, the number of word strings constituting a sentence increases significantly The more pronounced the number of words due to the variation of pronunciation between the word and the word, the more likely it is to be mistaken for words of similar pronunciation.

특히, 음성인식에서의 언어모델은 사용자가 발성한 문장이 올바른 문장으로 인식되도록 단어들 간의 연결성을 텍스트 코퍼스(Corpus)로부터 통계적인 방법으로 수집하여 구축한 모델이다. 언어모델에는 유니그램(Unigram), 바이그램(Bigram), 트라이그램(Trigram)이 많이 사용된다. 이중, 유니그램은 단어의 확률을 사용하는 것으로서 바로 앞에 위치한 과거의 단어는 사용하지 않는다. 또한, 바이그램과 트라이그램은 각각 바로 앞 하나와 두 개의 단어에 의존하는 확률을 사용한다. 이와 같은 언어모델의 사용은 문법적으로 유효한 단어열이 인식되도록 하며, 단어나 문장의 탐색 공간을 최소화시켜 인식 성능을 높이고 탐색 시간을 단축시킬 수 있다.Especially, the language model in speech recognition is a model constructed by collecting the connectivity between words from the text corpus by statistical method so that the sentence that the user uttered is recognized as the correct sentence. Unigram, Bigram, and Trigram are often used in language models. The unigram uses the probability of a word and does not use the past word which is located immediately in front of it. In addition, Biagram and Triagram use probabilities that depend on the first one and two words respectively. The use of such a language model allows grammatically valid word sequences to be recognized, minimizing the search space of words and sentences, thereby improving recognition performance and shortening the search time.

하지만, 음성인식 및 자연어 처리에 많이 사용되는 통계적 언어모델을 구성하게 되면, 데이터 부족 현상이 발생할 수 있고, 웹 상에서의 변형 언어 및 구어체 등을 대상으로 할 경우에는 데이터 부족 문제가 더 심화되어 음성 인식의 성능을 저하시키게 된다.However, if a statistical language model used for speech recognition and natural language processing is constructed, a data shortage phenomenon may occur. In the case of a deformation language and spoken language on the web, Thereby degrading the performance of the apparatus.

한국등록특허 10-0511247 B1, 2005년 08월 23일 등록 (명칭: 음성 인식 시스템에서의 언어 모델링 방법)Korean Registered Patent No. 10-0511247 B1, registered on Aug. 23, 2005 (Name: Language Modeling Method in Speech Recognition System)

본 발명은 종래의 음성인식 및 자연어 처리에 많이 사용되는 통계적 언어모델을 구성하게 되면, 데이터 부족 현상이 발생할 수 있고, 웹 상에서의 변형 언어 및 구어체 등을 대상으로 할 경우에는 데이터 부족 문제가 더 심화되어 음성 인식의 성능을 저하시키게 되는 문제점을 해결하기 위하여, 본 발명의 목적은 음성데이터에 포함된 각 문장을 언어 처리하여 유니그램 형태로 추출하고, 유니그램 형태의 각 문장에 대한 형태소를 분석하고, 형태소의 각 어근을 중심으로 클러스터하여 엔 그램(N-gram) 형태로 매칭하는 어근 기반의 언어모델 생성 방법 및 이를 위한 언어처리장치를 제공하고자 한다.If a statistical language model used in conventional speech recognition and natural language processing is structured, data shortage may occur. In the case of a target language such as a variation language and spoken language on the web, In order to solve the problem of deteriorating the performance of speech recognition, an object of the present invention is to provide a speech recognition apparatus and a speech recognition method in which each sentence included in speech data is subjected to language processing and extracted as a unigram form, , A method for generating a root-based language model in which each root of each morpheme is clustered and matched in an N-gram form, and a language processing apparatus for the same.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 실시 예에 따른 어근 기반의 언어모델 생성 방법은 언어처리장치가 적어도 하나의 음성데이터를 수집하는 단계와, 언어처리장치가 수집된 음성데이터를 문장 단위로 분리하고, 분리된 각 문장을 언어 처리하여 유니그램 형태로 추출하는 단계와, 언어처리장치가 추출된 유니그램 형태의 각 문장에 대한 형태소를 분석하는 단계와, 언어처리장치가 분석된 형태소의 각 어근을 중심으로 클러스터를 구성하는 단계 및 언어처리장치가 구성된 클러스터를 엔 그램(N-gram) 형태로 매칭하는 단계를 실행하는 프로그램을 기록한 컴퓨터 판독 가능한 기록매체를 제공한다.According to another aspect of the present invention, there is provided a method for generating a root-based language model, the method including the steps of: collecting at least one piece of voice data; Extracting the separated sentences in a linguistic form and extracting them in a unigram form; analyzing a morpheme for each sentence in a unigram form extracted by the language processing apparatus; A step of constructing a cluster around each root, and a step of matching a cluster constituted by the language processing apparatus in an N-gram form.

또한, 본 발명에 따른 어근 기반의 언어모델 생성 방법에 있어서, 수집하는 단계는 언어처리장치가 자연어 처리, 음성 인식, 변형언어 및 구어체 중 적어도 하나를 포함하는 음성데이터를 수집하는 단계 및 언어처리장치가 수집된 음성데이터를 텍스트로 변환하는 단계를 실행하는 프로그램을 기록한 컴퓨터 판독 가능한 기록매체를 제공한다.Further, in the method of generating a root-based language model according to the present invention, the collecting step may include a step of collecting voice data including at least one of natural language processing, speech recognition, And converting the collected voice data into a text.

또한, 본 발명에 따른 어근 기반의 언어모델 생성 방법에 있어서, 유니그램 형태로 추출하는 단계는 언어처리장치가 문장을 말뭉치 단위 또는 띄어쓰기 단위로 처리 언어 처리하는 것을 실행하는 프로그램을 기록한 컴퓨터 판독 가능한 기록매체를 제공한다.In the method for generating a root-based language model according to the present invention, the step of extracting in the form of a unicram is a computer-readable record in which a program for executing a processing language processing of a sentence in a corpus unit or a space- Media.

또한, 본 발명에 따른 어근 기반의 언어모델 생성 방법에 있어서, 형태소를 분석하는 단계는 언어처리장치가 언어 처리된 각 문장에서 어근과, 조사, 접사, 선 어말 어미 또는 종결 어미 중 적어도 하나를 분리하는 단계 및 언어처리장치가 분리된 어근을 품사별로 구분하여 저장하는 단계를 실행하는 프로그램을 기록한 컴퓨터 판독 가능한 기록매체를 제공한다.In the method of generating a root-based language model according to the present invention, the step of analyzing the morpheme may include separating at least one of a root, an irradiation, a macrophase, a frontal end, or a termination end And a step of storing the separated root by the part of speech by the language processing apparatus.

또한, 본 발명에 따른 어근 기반의 언어모델 생성 방법에 있어서, 클러스터를 구성하는 단계는 언어처리장치가 실질적인 의미를 나타내는 중심이 되는 부분을 기준으로 클러스터링 단계를 실행하는 프로그램을 기록한 컴퓨터 판독 가능한 기록매체를 제공한다.According to another aspect of the present invention, there is provided a method of generating a root-based language model, the method comprising the steps of: constructing a cluster by using a computer readable recording medium having recorded thereon a program for executing a clustering step on the basis of a central part, Lt; / RTI >

또한, 본 발명에 따른 어근 기반의 언어모델 생성 방법에 있어서, 엔 그램(N-gram) 형태로 매칭하는 단계는 언어처리장치가 각각의 어근 기반으로 구성된 클러스터를 확인하는 단계 및 언어처리장치가 클러스터 내 어근과 형태소 분석을 통해 획득된 정보를 조합하여 엔 그램(N-gram) 형태로 구성하는 단계를 실행하는 프로그램을 기록한 컴퓨터 판독 가능한 기록매체를 제공한다.In the root-based language model generation method according to the present invention, the step of matching in the form of an N-gram may include a step of the language processing device confirming the cluster constituted by each root-based, (N-gram) type information by combining information obtained through morphological analysis with the inner root of a root-mean-square root.

또한, 본 발명에 따른 어근 기반의 언어모델 생성 방법에 있어서, 엔 그램(N-gram) 형태로 매칭하는 단계 이후에, 언어처리장치가 엔 그램(N-gram) 형태로 매칭된 결과를 토대로 언어 모델을 자동적으로 생성하는 단계를 실행하는 프로그램을 기록한 컴퓨터 판독 가능한 기록매체를 제공한다.In addition, in the root-based language model generation method according to the present invention, after the step of matching in the form of an N-gram, the language processing apparatus generates a language based on the result matched in an N- There is provided a computer-readable recording medium having recorded thereon a program for automatically generating a model.

본 발명의 실시 예에 따른 언어처리장치는 적어도 하나의 음성데이터를 수집하는 언어수집모듈 및 언어수집모듈을 통해 수집된 음성데이터를 문장 단위로 분리하고, 분리된 각 문장을 언어 처리하여 유니그램 형태로 추출하고, 추출된 유니그램 형태의 각 문장에 대한 형태소를 분석하고, 분석된 형태소의 각 어근을 중심으로 클러스터를 구성하고, 구성된 클러스터를 엔 그램(N-gram) 형태로 매칭하는 언어처리모듈을 포함하는 것을 특징으로 한다.The language processing apparatus according to an embodiment of the present invention separates speech data collected through a language collection module and a language collection module for collecting at least one speech data into sentences, A language processing module for analyzing the morpheme for each sentence in the extracted unigram form, constructing a cluster around each root of the analyzed morpheme, and matching the configured cluster in an N-gram form And a control unit.

본 발명에 따르면, 음성데이터에 포함된 각 문장을 언어 처리하여 유니그램 형태로 추출하고, 유니그램 형태의 각 문장에 대한 형태소를 분석하고, 형태소의 각 어근을 중심으로 클러스터하여 엔 그램(N-gram) 형태로 매칭함으로써, 클래스 기반의 언어모델과 유사하게 데이터 부족 문제를 해결할 수 있다.According to the present invention, each sentence included in the speech data is language-processed and extracted in the form of a unicagram, the morphemes for each sentence in the form of a unicagram are analyzed, and clusters are centered around each root of the morpheme, grams, it is possible to solve the data shortage similar to the class-based language model.

또한, 언어적 관계를 이용하여 동사의 경우에는 이전에 올 수 있는 목적어 위주로 추출하고, 조사가 붙은 명사의 경우에는 이후의 관계를 추출하는 방식으로 변형해서 활용함으로써, 언어 추출의 정확도를 높일 수 있다.In addition, it is possible to improve the accuracy of language extraction by using the verbal relation, extracting the predicate of the verb in the case of the verb, and transforming the relation in the case of the verb with the verb .

또한, 음성인식을 위한 언어 모델에 없는 새로운 명사를 문장의 다양한 표현을 반영한 엔 그램(N-gram) 형태로 추가할 수 있기 때문에 단순히 고유명사 만으로 추가하는 방법보다 고유 명사의 높은 음성인식 성능을 얻을 수 있다.In addition, since a new noun not in the language model for speech recognition can be added in the form of an N-gram that reflects various expressions of the sentence, .

도 1은 본 발명의 실시 예에 따른 언어처리장치의 구성을 나타내는 블록도 이다.
도 2는 본 발명의 실시 예에 따른 어근 기반의 언어모델 생성 방법을 설명하기 위한 흐름도이다.
도 3은 본 발명의 실시 예에 따른 어근 기반의 언어모델 생성 방법을 설명하기 위한 예시도 이다.1 is a block diagram showing a configuration of a language processing apparatus according to an embodiment of the present invention.
2 is a flowchart illustrating a method of generating a root-based language model according to an embodiment of the present invention.
FIG. 3 is an exemplary diagram for explaining a root-based language model generation method according to an embodiment of the present invention.

이하 본 발명의 바람직한 실시 예를 첨부한 도면을 참조하여 상세히 설명한다. 다만, 하기의 설명 및 첨부된 도면에서 본 발명의 요지를 흐릴 수 있는 공지 기능 또는 구성에 대한 상세한 설명은 생략한다. 또한, 도면 전체에 걸쳐 동일한 구성 요소들은 가능한 한 동일한 도면 부호로 나타내고 있음에 유의하여야 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description and the accompanying drawings, detailed description of well-known functions or constructions that may obscure the subject matter of the present invention will be omitted. It should be noted that the same constituent elements are denoted by the same reference numerals as possible throughout the drawings.

이하에서 설명되는 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위한 용어의 개념으로 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서 본 명세서에 기재된 실시 예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시 예에 불과할 뿐이고, 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형 예들이 있을 수 있음을 이해하여야 한다.The terms and words used in the present specification and claims should not be construed to be limited to ordinary or dictionary meanings and the inventor is not limited to the concept of terminology for describing his or her invention in the best way. It should be interpreted as meaning and concept consistent with the technical idea of the present invention. Therefore, the embodiments described in the present specification and the configurations shown in the drawings are merely the most preferred embodiments of the present invention, and not all of the technical ideas of the present invention are described. Therefore, It is to be understood that equivalents and modifications are possible.

이하에서는 본 발명의 실시 예에 따른 언어처리장치는 통신망에 연결되어 어근 기반의 언어모델을 자동으로 생성하고, 언어모델을 적용할 수 있는 이동통신단말기를 대표적인 예로서 설명하지만 단말기는 이동통신단말기에 한정된 것이 아니고, 모든 정보통신기기, 멀티미디어 단말기, 유선 단말기, 고정형 단말기 및 IP(Internet Protocol) 단말기 등의 다양한 단말기에 적용될 수 있다. 또한, 단말기는 휴대폰, PMP(Portable Multimedia Player), MID(Mobile Internet Device), 스마트폰(Smart Phone), 데스크톱(Desktop), 태블릿컴퓨터(Tablet PC), 노트북(Note book), 넷북(Net book) 및 정보통신 기기 등과 같은 다양한 이동통신 사양을 갖는 모바일(Mobile) 단말기일 때 유리하게 활용될 수 있다.Hereinafter, a language processing apparatus according to an exemplary embodiment of the present invention is described as a representative example of a mobile communication terminal connected to a communication network to automatically generate a root-based language model and to apply a language model, The present invention can be applied to various terminals such as all information communication devices, multimedia terminals, wired terminals, fixed type terminals and IP (Internet Protocol) terminals. Also, the terminal may be a mobile phone, a portable multimedia player (PMP), a mobile Internet device (MID), a smart phone, a desktop, a tablet PC, a notebook, And an information communication device, which can be advantageously used in a mobile terminal having various mobile communication specifications.

또한, 본 발명에 따른 언어처리장치에 탑재되는 프로세서는 본 발명에 따른 방법을 실행하기 위한 프로그램 명령을 처리할 수 있다. 일 구현 예에서, 이 프로세서는 싱글 쓰레드(Single-threaded) 프로세서일 수 있으며, 다른 구현 예에서 본 프로세서는 멀티 쓰레드(Multi-threaded) 프로세서일 수 있다. 나아가 본 프로세서는 메모리 혹은 저장 장치 상에 저장된 명령을 처리하는 것이 가능하다.Further, a processor mounted in the language processing apparatus according to the present invention can process program instructions for executing the method according to the present invention. In one implementation, the processor may be a single-threaded processor, and in other embodiments, the processor may be a multi-threaded processor. Further, the processor is capable of processing instructions stored on a memory or storage device.

상술한 본 발명의 실시 예에 따른 어근 기반의 언어모델을 생성하는 언어처리장치에 대하여 설명하도록 한다.A language processing apparatus for generating a root-based language model according to an embodiment of the present invention will be described.

도 1은 본 발명의 실시 예에 따른 언어처리장치의 구성을 나타내는 블록도 이다.1 is a block diagram showing a configuration of a language processing apparatus according to an embodiment of the present invention.

도 1을 참조하면, 본 발명에 따른 언어처리장치(100)는 제어부(10), 입력부(20), 표시부(30), 저장부(40), 오디오처리부(50) 및 통신부(60)로 구성된다. 여기서, 제어부(11)는 언어수집모듈(11) 및 언어처리모듈(12)을 포함하고, 저장부(40)는 음성데이터(41)를 포함한다.1, a language processing apparatus 100 according to the present invention includes a control unit 10, an input unit 20, a display unit 30, a storage unit 40, an audio processing unit 50, and a communication unit 60 do. Here, the control unit 11 includes a language acquisition module 11 and a language processing module 12, and the storage unit 40 includes voice data 41.

입력부(20)는 숫자 및 문자 정보 등의 다양한 정보를 입력 받고, 각종 기능을 설정 및 언어처리장치(100)의 기능 제어와 관련하여 입력되는 신호를 제어부(10)로 전달한다. 또한, 입력부(20)는 사용자의 터치 또는 조작에 따른 입력 신호를 발생하는 키패드와 터치패드 중 적어도 하나를 포함하여 구성될 수 있다. 이때, 입력부(20)는 표시부(30)와 함께 하나의 터치패널(또는 터치스크린(touch screen))의 형태로 구성되어 입력과 표시 기능을 동시에 수행할 수 있다. 또한, 입력부(20)는 키보드, 키패드, 마우스, 조이스틱 등과 같은 입력 장치 외에도 향후 개발될 수 있는 모든 형태의 입력 수단이 사용될 수 있다. 특히, 본 발명에 따른 입력부(20)는 어근 기반의 언어모델 생성 과정에서 발생하는 일련의 입력 신호를 감지하여 제어부(10)로 전달한다.The input unit 20 receives various information such as numbers and character information and transmits various signals to the control unit 10 in response to the setting of various functions and the function control of the language processing apparatus 100. The input unit 20 may include at least one of a keypad and a touchpad that generates an input signal according to a user's touch or operation. At this time, the input unit 20 may be configured in the form of a single touch panel (or a touch screen) together with the display unit 30 to simultaneously perform the input and display functions. The input unit 20 may be any type of input device that can be developed in addition to an input device such as a keyboard, a keypad, a mouse, a joystick, and the like. In particular, the input unit 20 according to the present invention senses a series of input signals generated in the root-based language model generation process and transmits the input signals to the control unit 10.

표시부(30)는 언어처리장치(100)의 기능 수행 중에 발생하는 일련의 동작상태 및 동작결과 등에 대한 정보를 표시한다. 또한, 표시부(30)는 언어처리장치(100)의 메뉴 및 사용자가 입력한 사용자 데이터 등을 표시할 수 있다. 여기서, 표시부(30)는 액정표시장치(LCD, Liquid Crystal Display), 초박막 액정표시장치(TFT-LCD, Thin Film Transistor LCD), 발광다이오드(LED, Light Emitting Diode), 유기 발광다이오드(OLED, Organic LED), 능동형 유기발광다이오드(AMOLED, Active Matrix OLED), 레티나 디스플레이(Retina Display), 플렉시블 디스플레이(Flexible display) 및 3차원(3 Dimension) 디스플레이 등으로 구성될 수 있다. 이때, 표시부(30)가 터치스크린(touch screen) 형태로 구성된 경우, 표시부(30)는 입력부(20)의 기능 중 일부 또는 전부를 수행할 수 있다. 특히, 본 발명에 따른 표시부(30)는 어근 기반의 언어모델 생성 과정에서 발생하는 모든 화면 정보를 출력한다.The display unit 30 displays information on a series of operation states, operation results, and the like that occur during the performance of the functions of the language processing apparatus 100. In addition, the display unit 30 can display menus of the language processing apparatus 100, user data input by the user, and the like. The display unit 30 may be a liquid crystal display (LCD), a thin film transistor LCD (TFT-LCD), a light emitting diode (LED), an organic light emitting diode LEDs, active matrix organic light emitting diodes (AMOLED), active matrix OLEDs, retina displays, flexible displays, and three-dimensional displays. In this case, when the display unit 30 is configured as a touch screen, the display unit 30 may perform some or all of the functions of the input unit 20. [ In particular, the display unit 30 according to the present invention outputs all the screen information generated in the root-based language model generation process.

저장부(40)는 데이터를 저장하기 위한 장치로, 주 기억 장치 및 보조 기억 장치를 포함하고, 언어처리장치(100)의 기능 동작에 필요한 응용 프로그램을 저장한다. 이러한 저장부(40)는 크게 프로그램 영역과 데이터 영역을 포함할 수 있다. 여기서, 언어처리장치(100)는 사용자의 요청에 상응하여 각 기능을 활성화하는 경우, 제어부(10)의 제어 하에 해당 응용 프로그램들을 실행하여 각 기능을 제공하게 된다. 특히, 본 발명에 따른 저장부(40)는 언어처리장치(100)를 부팅시키는 운영체제, 음성데이터를 수집하는 프로그램, 음성데이터를 문장 단위로 분리하는 프로그램, 각 문장을 언어 처리하여 유니그램 형태로 추출하는 프로그램, 유니그램 형태의 각 문장에 대한 형태소를 분석하는 프로그램, 형태소의 각 어근을 중심으로 클러스터를 구성하는 프로그램, 클러스터를 엔 그램(N-gram) 형태로 매칭하는 프로그램 등을 저장한다. 또한, 저장부(40)는 외부로부터 입력되는 음성데이터 및 다른 장치로부터 수신되는 음성데이터를 저장한다.The storage unit 40 is a device for storing data, and includes a main storage device and an auxiliary storage device, and stores an application program required for the functional operation of the language processing device 100. [ The storage unit 40 may include a program area and a data area. Here, when the language processing apparatus 100 activates each function corresponding to a request from the user, the language processing apparatus 100 executes the corresponding application programs under the control of the control unit 10 to provide each function. In particular, the storage unit 40 according to the present invention includes an operating system for booting the language processing apparatus 100, a program for collecting voice data, a program for separating voice data on a sentence basis, A program for analyzing a morpheme for each sentence in a uniagram form, a program for forming a cluster around each root of a morpheme, and a program for matching a cluster in an N-gram form. Also, the storage unit 40 stores voice data input from the outside and voice data received from another apparatus.

오디오처리부(50)는 오디오 신호를 재생하여 출력하기 위한 스피커(SPK) 또는 마이크(MIC)로부터 입력되는 오디오 신호를 제어부(10)에 전달하는 기능을 수행한다. 이러한 오디오처리부(50)는 마이크를 통해 입력되는 아날로그 형식의 오디오 신호를 디지털 형식으로 변환하여 제어부(10)에 전달할 수 있다. 또한, 오디오처리부(50)는 제어부(10)로부터 출력되는 디지털 형식의 오디오 신호를 아날로그 신호로 변환하여 스피커를 통해 출력할 수 있다. 특히, 본 발명에 따른 오디오처리부(50)는 어근 기반의 언어 모델 생성 과정에서 발생하는 효과음 또는 실행음을 출력한다.The audio processing unit 50 performs a function of transmitting an audio signal input from a speaker SPK or a microphone MIC for reproducing and outputting an audio signal to the control unit 10. The audio processing unit 50 converts an analog audio signal input through a microphone into a digital format, and transmits the audio signal to the controller 10. The audio processing unit 50 may convert an audio signal of a digital format output from the control unit 10 into an analog signal and output it through a speaker. In particular, the audio processing unit 50 according to the present invention outputs an effect sound or an execution sound generated in the root-based language model generation process.

통신부(60)는 다른 장치와 통신망(미도시)을 통해 데이터를 송수신하기 위한 기능을 수행한다. 여기서, 통신부(60)는 송신되는 신호의 주파수를 상승 변환 및 증폭하는 RF 송신 수단과 수신되는 신호를 저잡음 증폭하고 주파수를 하강 변환하는 RF 수신 수단 등을 포함한다. 이러한 통신부(60)는 무선통신 모듈(미도시) 및 유선통신 모듈(미도시) 중 적어도 하나를 포함할 수 있다. 또한, 유선통신 모듈은 유선으로 데이터를 송수신하기 위한 것이다. 특히, 본 발명에 따른 통신부(60)는 외부의 다른 장치와 연동하여 음성데이터를 수집한다.The communication unit 60 performs a function for transmitting and receiving data through a communication network (not shown) with another device. Here, the communication unit 60 includes RF transmitting means for up-converting and amplifying the frequency of the transmitted signal, RF receiving means for low-noise amplifying the received signal and down-converting the frequency. The communication unit 60 may include at least one of a wireless communication module (not shown) and a wired communication module (not shown). The wired communication module is for transmitting / receiving data by wire. In particular, the communication unit 60 according to the present invention collects voice data in cooperation with an external device.

여기서, 통신망은 언어처리장치(100)와 다른 장치들 간의 데이터 전송 및 정보 교환을 위한 일련의 데이터 송수신 동작을 수행한다. 특히, 통신망은 다양한 형태의 통신망이 이용될 수 있으며, 예컨대, 무선랜(WLAN, Wireless LAN), 와이파이(Wi-Fi), 와이브로(Wibro), 와이맥스(Wimax), 고속하향패킷접속(HSDPA, High Speed Downlink Packet Access) 등의 무선 통신방식 또는 이더넷(Ethernet), xDSL(ADSL, VDSL), HFC(Hybrid Fiber Coax), FTTC(Fiber to The Curb), FTTH(Fiber To The Home) 등의 유선 통신방식이 이용될 수 있다. 한편, 통신망은 상기에 제시된 통신방식에 한정되는 것은 아니며, 상술한 통신 방식 이외에도 기타 널리 공지되었거나 향후 개발될 모든 형태의 통신 방식을 포함할 수 있다.Here, the communication network performs a series of data transmission / reception operations for data transmission and information exchange between the language processing apparatus 100 and other devices. In particular, various types of communication networks may be used for the communication network. For example, the communication network may be a wireless LAN (WLAN), a Wi-Fi, a Wibro, a WiMAX, a high speed downlink packet access Speed Downlink Packet Access) or a wired communication method such as Ethernet, xDSL (ADSL, VDSL), HFC (Hybrid Fiber Coax), FTTC (Fiber to the Curb), FTTH (Fiber To The Home) Can be used. Meanwhile, the communication network is not limited to the above-described communication methods, and may include all other known or later-developed communication methods in addition to the communication methods described above.

제어부(10)는 운영 체제(OS, Operation System) 및 각 구성을 구동시키는 프로세스 장치가 될 수 있다. 특히, 본 발명에 따른 제어부(10)는 음성데이터를 수집한다.The control unit 10 may be an operating system (OS) and a process unit for driving each configuration. In particular, the control unit 10 according to the present invention collects voice data.

제어부(10)는 수집된 음성데이터를 문장 단위로 분리하고, 분리된 각 문장을 언어 처리하여 유니그램 형태로 추출한다. 그리고, 제어부(10)는 추출된 유니그램 형태의 각 문장에 대한 형태소를 분석한다.The control unit 10 separates the collected voice data on a sentence-by-sentence basis, and language-processes each separated sentence to extract it as a unigram form. Then, the control unit 10 analyzes morphemes for each sentence in the extracted unigram form.

제어부(10)는 분석된 형태소의 각 어근을 중심으로 클러스터(Cluster)를 구성한다. 그리고, 제어부(10)는 구성된 클러스터를 엔 그램(N-gram) 형태로 매칭한다. 예를 들어, 엔 그램(N-gram) 형태는 바이그램(Bigram) 형태, 트라이그램(Trigram) 형태 등이 포함될 수 있다. 이후, 제어부(10)는 엔 그램(N-gram) 형태로 매칭된 결과를 토대로 언어 모델을 자동적으로 생성한다.The control unit 10 forms a cluster around each root of the analyzed morpheme. Then, the control unit 10 matches the configured clusters in an N-gram form. For example, an N-gram type may include a Bigram type, a Trigram type, and the like. Thereafter, the control unit 10 automatically generates a language model based on the result matched in the form of an N-gram.

이와 같이, 언어처리장치(100)의 기능을 보다 효과적으로 수행하기 위하여 제어부(10)는 복수의 모듈로 구성되는데, 상기 복수의 모듈은 언어수집모듈(11) 및 언어처리모듈(12)을 포함한다.In order to perform the functions of the language processing apparatus 100 more effectively, the control unit 10 includes a plurality of modules, which include a language acquisition module 11 and a language processing module 12 .

언어수집모듈(11)은 외부로부터 입력되는 음성데이터 및 다른 장치로부터 수신되는 음성데이터를 수집하여 관리한다. 여기서, 언어수집모듈(11)은 자연어 처리, 음성 인식, 변형언어 및 구어체 등을 포함하는 음성데이터를 수집하고, 수집된 음성데이터를 텍스트로 변환한다.The language collection module 11 collects and manages voice data input from the outside and voice data received from another device. Here, the language collection module 11 collects speech data including natural language processing, speech recognition, transformed language, and spoken language, and converts the collected speech data into text.

언어수집모듈(11)은 수집된 음성데이터에서 실제 음성데이터와 잡음을 분리하는 기능을 수행한다. 즉, 외부에서 수집된 음성데이터에는 주변의 잡음이 섞여 있기 때문에, 신뢰성 높은 음성 인식을 위하여 노이즈 제거 과정을 수행한다. 예를 들어, 언어수집모듈(11)은 수집된 음성데이터 내 잡음 제거를 위하여 전방향 탐색에 의한 판별 기술, 심리 음향 기반의 추정 기술, 개선된 스펙트럼 차감에 의한 제거 기술 등을 적용할 수 있다.The language collection module 11 separates actual speech data and noise from the collected speech data. That is, since the noise data collected from the outside is mixed with surrounding noise, a noise removal process is performed for reliable speech recognition. For example, the language acquisition module 11 may apply a discrimination technique based on a forward search, a psychoacoustic-based estimation technique, or an improved spectral subtraction technique to remove noise in the collected voice data.

언어처리모듈(12)은 언어수집모듈(11)을 통해 수집된 음성데이터를 문장 단위로 분리하고, 분리된 각 문장을 언어 처리하여 유니그램 형태로 추출한다. 이때, 언어처리모듈(12)은 각 문장을 말뭉치 단위 또는 띄어쓰기 단위로 처리 언어 처리한다. 여기서, 말뭉치는 단일한 언어 또는 여러 언어의 텍스트를 포함하고 있으며, 통계 분석 및 가설 검증을 수행하거나, 특정한 언어 영역 내에서 언어 규칙 발생의 검사와 그 규칙의 정당성 입증에 사용될 수 있다. 예를 들어, 말뭉치는 각 단어의 품사표기(동사, 명사, 형용사 등)에 대한 정보가 포함되기도 한다.The language processing module 12 separates the speech data collected through the language acquisition module 11 on a sentence-by-sentence basis, and language-processes each separated sentence to extract it as a uniagram form. At this time, the language processing module 12 processes each sentence in a processing unit in a corpus unit or a space unit. Here, the corpus contains texts of a single language or several languages and can be used for statistical analysis and hypothesis verification, or for checking the occurrence of language rules within a specific language area and for justifying the rules. For example, a corpus may contain information about the parts of speech (verbs, nouns, adjectives, etc.) of each word.

언어처리모듈(12)은 추출된 유니그램 형태의 각 문장에 대한 형태소를 분석한다. 여기서, 언어처리모듈(12)은 언어 처리된 각 문장에서 어근과, 조사, 접사, 선 어말 어미 또는 종결 어미 중 적어도 하나를 분리하고, 분리된 어근을 품사별로 구분하여 저장한다. 즉, 언어처리모듈(12)는 한 언어 내에서 의미를 내포하고 있는 가장 작은 단위로서 더 이상 분석하면, 의미를 잃어버리는 언어 단위 낱말에 따라서 1개의 형태소 또는 여러 개의 형태소로 분석한다.The language processing module 12 analyzes morphemes for each sentence in the extracted unigram form. Here, the language processing module 12 separates at least one of the root, the irradiation, the affix, the prefix, and the termination term in each sentence that has been subjected to the language processing, and stores the separated root by the parts of speech. That is, the language processing module 12 analyzes one or more morphemes according to the language unit words that lose their meaning if they are further analyzed as the smallest unit having meaning in a language.

또한, 형태소는 자립 형태소와, 의존 형태소, 어휘 형태소(실질 형태소) 및 문법 형태소(형식 형태소)로 구분될 수 있다. 예를 들어, 자립 형태소는 홀로 사용될 수 있는 형태소로서, 보통 명사와 일부 부사를 구성하는 형태소이고, 의존 형태소는 홀로 사용될 수 없는 형태소이고, 자립형 형태소와 의존 형태소로 구성된 품사는 형태소, 관형사, 동사가 포함된다. 또한, 어휘 형태소는 형태소 간의 문법적 관계와 다른 형태소 및 구성 성부의 의미를 명백하게 하는 형태소로서, 대기의 경우 독립적인 의미를 가지고 있지만 홀로 서지 못하는 경우도 있으며, 명사, 동사, 형용사, 부사 등이 포함된다. 또한, 문법 형태소는 형태소들을 결합시켜 뜻을 가진 언어 표현을 구성하게 하는 문법적 의미를 갖는 형태소로서, 어미와 접사, 그리고 조사가 포함된다.In addition, morphemes can be classified into independent morphemes, dependent morphemes, vocabulary morphemes (substantial morphemes), and grammatical morphemes (morphemes). For example, an independent morpheme is a morpheme that can be used alone, usually a morpheme that constitutes a noun and some adverbs, a dependency morpheme is a morpheme that can not be used alone, and a part of speech composed of independent morpheme and dependency morpheme is morpheme, . In addition, vocabulary morpheme is a morpheme that clarifies the grammatical relation between morphemes and other morphemes and constituent parts. In the case of the atmosphere, it has an independent meaning, but sometimes it can not stand alone, and includes nouns, verbs, adjectives, and adverbs . In addition, grammatical morphemes are grammatical morphemes that combine morphemes to form meaningful language expressions, including stem, affix, and investigation.

언어처리모듈(12)은 분석된 형태소의 각 어근을 중심으로 클러스터를 구성한다. 이때, 언어처리모듈(12)은 실질적인 의미를 나타내는 중심이 되는 부분을 기준으로 클러스터링 한다. 즉, 언어처리모듈(12)은 의미를 가진 부분인 어근을 추출하는데, 어근은 단어를 의미의 중심 여부에 따라 나눈 경우로서, 단어에서 중심 의미를 가진 부분이며, 부차적 의미를 가지는 접사와 함께 쓰인다. 또한, 어근은 몇 가지의 단어를 제외하고는 단어를 만들 때 반드시 필요한 부분으로, 하나의 어근만으로 이루어진 단어를 단일어, 두 개 이상의 어근이 결합하여 단어를 이루는 경우를 합성어라고 하고, 어근에 접사가 결합하여 단어를 이룰 때 파생어라고 한다.The language processing module 12 forms a cluster around each root of the analyzed morpheme. At this time, the language processing module 12 clusters on the basis of a central part indicating a substantial meaning. In other words, the language processing module 12 extracts a root, which is a meaningful part, and the root is a case where the word is divided according to the center of the meaning, that is, the part having the central meaning in the word and is used with the affix having the secondary meaning . In addition, a root is a necessary part when making a word except for a few words, a word consisting of only one root is singly, and a case where two or more roots are combined to form a word is called a compound word. When combined, the word is called a derivative.

언어처리모듈(12)은 클러스터를 엔 그램(N-gram) 형태로 매칭한다. 여기서, 언어처리모듈(12)은 각각의 어근 기반으로 구성된 클러스터를 확인하고, 클러스터 내 어근과 형태소 분석을 통해 획득된 정보를 조합하여 엔 그램(N-gram) 형태로 구성한다. 그리고 나서, 언어처리모듈(12)은 엔 그램(N-gram) 형태로 매칭된 결과를 토대로 언어 모델을 자동적으로 생성한다.The language processing module 12 matches the clusters in an N-gram form. Here, the language processing module 12 identifies clusters constituting each root base, and forms an N-gram type by combining the information obtained through morphological analysis with the root in the cluster. Then, the language processing module 12 automatically generates a language model based on the result matched in the form of an N-gram.

한편, 언어처리장치(100)에 탑재되는 메모리는 그 장치 내에서 정보를 저장한다. 일 구현예의 경우, 메모리는 컴퓨터로 판독 가능한 매체이다. 일 구현 예에서, 메모리는 휘발성 메모리 유닛 일 수 있으며, 다른 구현예의 경우, 메모리는 비휘발성 메모리 유닛 일 수도 있다. 일 구현예의 경우, 저장장치는 컴퓨터로 판독 가능한 매체이다. 다양한 서로 다른 구현 예에서, 저장장치는 예컨대 하드디스크 장치, 광학디스크 장치, 혹은 어떤 다른 대용량 저장장치를 포함할 수도 있다.On the other hand, the memory mounted on the language processing apparatus 100 stores information in the apparatus. In one implementation, the memory is a computer-readable medium. In one implementation, the memory may be a volatile memory unit, and in other embodiments, the memory may be a non-volatile memory unit. In one implementation, the storage device is a computer-readable medium. In various different implementations, the storage device may include, for example, a hard disk device, an optical disk device, or any other mass storage device.

비록 본 명세서와 도면에서는 예시적인 장치 구성을 기술하고 있지만, 본 명세서에서 설명하는 기능적인 동작과 주제의 구현물들은 다른 유형의 디지털 전자 회로로 구현되거나, 본 명세서에서 개시하는 구조 및 그 구조적인 등가물들을 포함하는 컴퓨터 소프트웨어, 펌웨어 혹은 하드웨어로 구현되거나, 이들 중 하나 이상의 결합으로 구현 가능하다. 본 명세서에서 설명하는 주제의 구현물들은 하나 이상의 컴퓨터 프로그램 제품, 다시 말해 본 발명에 따른 장치의 동작을 제어하기 위하여 혹은 이것에 의한 실행을 위하여 유형의 프로그램 저장매체 상에 인코딩된 컴퓨터 프로그램 명령에 관한 하나 이상의 모듈로서 구현될 수 있다. 컴퓨터로 판독 가능한 매체는 기계로 판독 가능한 저장 장치, 기계로 판독 가능한 저장 기판, 메모리 장치, 기계로 판독 가능한 전파형 신호에 영향을 미치는 물질의 조성물 혹은 이들 중 하나 이상의 조합일 수 있다.Although the present specification and drawings describe exemplary device configurations, the functional operations and subject matter implementations described herein may be embodied in other types of digital electronic circuitry, or alternatively, of the structures disclosed herein and their structural equivalents May be embodied in computer software, firmware, or hardware, including, or in combination with, one or more of the foregoing. Implementations of the subject matter described herein may be embodied in one or more computer program products, i. E. One for computer program instructions encoded on a program storage medium of the type for < RTI ID = 0.0 & And can be implemented as a module as described above. The computer-readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter that affects the machine readable propagation type signal, or a combination of one or more of the foregoing.

도 2는 본 발명의 실시 예에 따른 어근 기반의 언어모델 생성 방법을 설명하기 위한 흐름도이고, 도 3은 본 발명의 실시 예에 따른 어근 기반의 언어모델 생성 방법을 설명하기 위한 예시도 이다.FIG. 2 is a flowchart illustrating a root-based language model generation method according to an embodiment of the present invention. FIG. 3 is a diagram illustrating a root-based language model generation method according to an embodiment of the present invention.

도 2 및 도 3을 참조하면, 본 발명에 따른 어근 기반의 언어모델 생성 방법에 있어서, 언어처리장치(100)는 S11 단계에서 외부로부터 입력되는 음성데이터 및 다른 장치로부터 수신되는 음성데이터를 수집하여 관리한다. 여기서, 언어수집모듈(11)은 자연어 처리, 음성 인식, 변형언어 및 구어체 등을 포함하는 음성데이터를 수집하고, 수집된 음성데이터를 텍스트로 변환한다.Referring to FIGS. 2 and 3, in the root-based language model generation method according to the present invention, the language processing apparatus 100 collects voice data input from the outside and voice data received from another apparatus in step S11 Management. Here, the language collection module 11 collects speech data including natural language processing, speech recognition, transformed language, and spoken language, and converts the collected speech data into text.

언어처리장치(100)는 S13 단계에서 수집된 음성데이터를 문장 단위로 분리하고, 분리된 각 문장을 언어 처리하여 유니그램 형태로 추출한다. 이때, 언어처리장치(100)는 각 문장을 말뭉치 단위 또는 띄어쓰기 단위로 처리 언어 처리한다. 여기서, 말뭉치는 단일한 언어 또는 여러 언어의 텍스트를 포함하고 있으며, 통계 분석 및 가설 검증을 수행하거나, 특정한 언어 영역 내에서 언어 규칙 발생의 검사와 그 규칙의 정당성 입증에 사용될 수 있다. 예를 들어, 말뭉치는 각 단어의 품사표기(동사, 명사, 형용사 등)에 대한 정보가 포함되기도 한다.The language processing apparatus 100 separates the speech data collected in step S13 on a sentence-by-sentence basis, and performs language processing on each separated sentence to extract it as a uniagram form. At this time, the language processing apparatus 100 processes each sentence in a speech processing unit in units of a corpuscule or a space. Here, the corpus contains texts of a single language or several languages and can be used for statistical analysis and hypothesis verification, or for checking the occurrence of language rules within a specific language area and for justifying the rules. For example, a corpus may contain information about the parts of speech (verbs, nouns, adjectives, etc.) of each word.

언어처리장치(100)는 S15 단계에서 추출된 유니그램 형태의 각 문장에 대한 형태소를 분석한다. 여기서, 언어처리장치(100)는 언어 처리된 각 문장에서 어근과, 조사, 접사, 선 어말 어미 또는 종결 어미 중 적어도 하나를 분리하고, 분리된 어근을 품사별로 구분하여 저장한다. 즉, 언어처리장치(100)는 한 언어 내에서 의미를 내포하고 있는 가장 작은 단위로서 더 이상 분석하면, 의미를 잃어버리는 언어 단위 낱말에 따라서 1개의 형태소 또는 여러 개의 형태소로 분석한다.The language processing apparatus 100 analyzes the morpheme for each sentence in the unigram form extracted in step S15. Here, the language processing apparatus 100 separates at least one of a root, an irradiation, a close-up, a line ending, or a termination end in each language-processed sentence, and stores the separated root by the parts of speech. That is, the language processing apparatus 100 analyzes one morpheme or several morphemes according to a language unit word for which meaning is lost if it is further analyzed as the smallest unit having a meaning in one language.

또한, 형태소는 자립 형태소와, 의존 형태소, 어휘 형태소(실질 형태소) 및 문법 형태소(형식 형태소)로 구분될 수 있다. 예를 들어, 자립 형태소는 홀로 사용될 수 있는 형태소로서, 보통 명사와 일부 부사를 구성하는 형태소이고, 의존 형태소는 홀로 사용될 수 없는 형태소이고, 자립형 형태소와 의존 형태소로 구성된 품사는 형태소, 관형사, 동사가 포함된다. 또한, 어휘 형태소는 형태소 간의 문법적 관계와 다른 형태소 및 구성 성부의 의미를 명백하게 하는 형태소로서, 대기의 경우 독립적인 의미를 가지고 있지만 홀로 서지 못하는 경우도 있으며, 명사, 동사, 형용사, 부사 등이 포함된다. 또한, 문법 형태소는 형태소들을 결합시켜 뜻을 가진 언어 표현을 구성하게 하는 문법적 의미를 갖는 형태소로서, 어미와 접사, 그리고 조사가 포함된다.In addition, morphemes can be classified into independent morphemes, dependent morphemes, vocabulary morphemes (substantial morphemes), and grammatical morphemes (morphemes). For example, an independent morpheme is a morpheme that can be used alone and is usually a morpheme that constitutes a noun and some adverbs, a dependency morpheme is a morpheme that can not be used alone, and a part of speech composed of independent morpheme and dependency morpheme is morpheme, . In addition, vocabulary morpheme is a morpheme that clarifies the grammatical relation between morphemes and other morphemes and constituent parts. In the case of the atmosphere, it has an independent meaning, but sometimes it can not stand alone, and includes nouns, verbs, adjectives, and adverbs . In addition, grammatical morphemes are grammatical morphemes that combine morphemes to form meaningful language expressions, including stem, affix, and investigation.

언어처리장치(100)는 S17 단계에서 분석된 형태소의 각 어근을 중심으로 클러스터를 구성한다. 이때, 언어처리장치(100)는 실질적인 의미를 나타내는 중심이 되는 부분을 기준으로 클러스터링 한다. 즉, 언어처리장치(100)는 의미를 가진 부분인 어근을 추출하는데, 어근은 단어를 의미의 중심 여부에 따라 나눈 경우로서, 단어에서 중심 의미를 가진 부분이며, 부차적 의미를 가지는 접사와 함께 쓰인다. 또한, 어근은 몇 가지의 단어를 제외하고는 단어를 만들 때 반드시 필요한 부분으로, 하나의 어근만으로 이루어진 단어를 단일어, 두 개 이상의 어근이 결합하여 단어를 이루는 경우를 합성어라고 하고, 어근에 접사가 결합하여 단어를 이룰 때 파생어라고 한다.The language processing apparatus 100 forms a cluster around each root of the morpheme analyzed in step S17. At this time, the language processing apparatus 100 clusters on the basis of a central part indicating a substantial meaning. That is, the language processing apparatus 100 extracts a root, which is a meaningful part, and the root is a case in which a word is divided according to the center of meaning, that is, a part having a central meaning in a word, . In addition, a root is a necessary part when making a word except for a few words, a word consisting of only one root is singly, and a case where two or more roots are combined to form a word is called a compound word. When combined, the word is called a derivative.

언어처리장치(100)는 S19 단계에서 클러스터를 엔 그램(N-gram) 형태로 매칭한다. 여기서, 언어처리장치(100)는 각각의 어근 기반으로 구성된 클러스터를 확인하고, 클러스터 내 어근과 형태소 분석을 통해 획득된 정보를 조합하여 엔 그램(N-gram) 형태로 구성한다. 그리고 나서, 언어처리장치(100)는 엔 그램(N-gram) 형태로 매칭된 결과를 토대로 언어 모델을 자동적으로 생성한다.The language processing apparatus 100 matches the cluster in the form of an N-gram in step S19. Here, the language processing apparatus 100 identifies clusters constituting each root base, and forms an N-gram form by combining the information obtained through morphological analysis with the root in the cluster. Then, the language processing apparatus 100 automatically generates a language model based on the result matched in the form of an N-gram.

즉, 도 3에 도시된 바와 같이, 언어처리장치(100)는 음성데이터의 각 문장을 유니그램 형태로 추출한다. 예를 들어, 언어처리장치(100)는 각 문장을 '했었다. 했었고, 했었는데, 하는데, 신청하구, 신청했고, 신청을, …'의 유니그램 형태로 추출할 수 있다.That is, as shown in FIG. 3, the language processing apparatus 100 extracts each sentence of the voice data in the form of a unigram. For example, the language processing apparatus 100 has processed each sentence. I did, but I did, I applied, I applied, I applied, I ... 'Can be extracted in the form of a unigram.

그리고, 언어처리장치(100)는 추출된 유니그램 형태의 각 문장에 대한 형태소를 분석한다. 예를 들어, 언어처리장치(100)는 '했었다'를 '하(VV) + 었 + 다'의 형태소로 분리하고, '했었고'를 '하(VV) + 었 + 다'의 형태소로 분리하고, '했었는데'를 '하(VV) + 었 + 는데'의 형태소로 분리하고, '하는데'를 '하(VV) + 는 + 다'의 형태소로 분리한다. 또한, 언어처리장치(100)는 '신청하구'를 '신청(NN) + 하 + 구'의 형태소로 분리하고, '신청했고'를 신청(NN) + 하 + 었 + 고'의 형태소로 분리하고, '신청을'를 '신청(NN) + 을'의 형태소로 분리한다.Then, the language processing apparatus 100 analyzes morphemes for each sentence in the extracted unigram form. For example, the language processing apparatus 100 separates the "had" into a morpheme of "VV + odd +" and separates the morpheme into "VV + odd" (VV) + I + I ', and' Do (VV) + I + Da 'are separated by the morpheme. Further, the language processing apparatus 100 separates the 'application request' into the morphemes of 'application (NN) + Ha + ward', and separates the morpheme of 'application (NN) + Ha + And separates the 'application' into the 'application (NN) + morphology'.

언어처리장치(100)는 형태소 분석된 결과를 토대로 어근 기반 트러스터링을 수행한다. 예를 들어, 언어처리장치(100)는 '하(VV)는 했었다, 했었고, 했었는데, 하는데, 하였고,'로 클러스터링하고, '신청(NN)은 하구, 하었고, 을'로 클러스터링한다.The language processing apparatus 100 performs root-based trustering based on the morpheme analysis result. For example, the language processing apparatus 100 clusters to '(VV)', 'did', 'did', and 'clusters' to 'application (NN) and clusters to'.

이후, 언어처리장치(100)는 클러스터를 엔 그램(N-gram) 형태로 매칭한다. 예를 들어, 언어처리장치(100)는 엔 그램(N-gram) 중 바이그램 형태로 매칭할 수 있다. 즉, 언어처리장치(100)는 '신청 했었다'와 '신청 했었고, 신청 했었는데, 신청 하는데'를 바이그램 매칭하고, '헤지 했었고'와 '해지 했었는데, 해지 하는데, 해지 했었다'를 바이그램 매칭하고, '연결 했었는데'와 '연결 했었다, 연결 했었고, 연결 하는데'를 바이그램 매칭하고, '통화 하는데'와 '통화 했었다, 통화 했었고, 통화 했었는데'를 바이그램 매칭한다.Thereafter, the language processing apparatus 100 matches clusters in an N-gram form. For example, the language processing apparatus 100 may match in a bi-gram form of an N-gram. In other words, the language processing apparatus 100 matches the bi-gram matching of "I had applied" and "I applied, and I applied, but bi-gram matching", "I had hedged" and "I canceled, I had a connection and I had a connection, I had a connection, I had a connection, I had a connection, I had a connection, I had a connection, I had a connection, I had a connection,

컴퓨터 프로그램 명령어와 데이터를 저장하기에 적합한 컴퓨터로 판독 가능한 매체는, 예컨대 기록매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광 기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 및 롬(ROM, Read Only Memory), 램(RAM, Random Access Memory), 플래시 메모리, EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM)과 같은 반도체 메모리를 포함한다. 프로세서와 메모리는 특수 목적의 논리 회로에 의해 보충되거나, 그것에 통합될 수 있다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함할 수 있다. 이러한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Computer-readable media suitable for storing computer program instructions and data include, for example, magnetic media such as hard disks, floppy disks, and magnetic tape, compact disk read only memory (CD-ROM) A magneto-optical medium such as a floppy disk and an optical recording medium such as a digital video disk, a magneto-optical medium such as a floppy disk, and a read only memory (ROM) Access Memory), a flash memory, an erasable programmable ROM (EPROM), and a semiconductor memory such as an Electrically Erasable Programmable ROM (EEPROM). The processor and memory may be supplemented by, or incorporated in, special purpose logic circuits. Examples of program instructions may include machine language code such as those generated by a compiler, as well as high-level language code that may be executed by a computer using an interpreter or the like. Such a hardware device may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

본 명세서는 다수의 특정한 구현물의 세부사항들을 포함하지만, 이들은 어떠한 발명이나 청구 가능한 것의 범위에 대해서도 제한적인 것으로서 이해되어서는 안되며, 오히려 특정한 발명의 특정한 실시형태에 특유할 수 있는 특징들에 대한 설명으로서 이해되어야 한다. 개별적인 실시형태의 문맥에서 본 명세서에 기술된 특정한 특징들은 단일 실시형태에서 조합하여 구현될 수도 있다. 반대로, 단일 실시형태의 문맥에서 기술한 다양한 특징들 역시 개별적으로 혹은 어떠한 적절한 하위 조합으로도 복수의 실시형태에서 구현 가능하다. 나아가, 특징들이 특정한 조합으로 동작하고 초기에 그와 같이 청구된 바와 같이 묘사될 수 있지만, 청구된 조합으로부터의 하나 이상의 특징들은 일부 경우에 그 조합으로부터 배제될 수 있으며, 그 청구된 조합은 하위 조합이나 하위 조합의 변형물로 변경될 수 있다.While the specification contains a number of specific implementation details, it should be understood that they are not to be construed as limitations on the scope of any invention or claim, but rather on the description of features that may be specific to a particular embodiment of a particular invention Should be understood. Certain features described herein in the context of separate embodiments may be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented in multiple embodiments, either individually or in any suitable subcombination. Further, although the features may operate in a particular combination and may be initially described as so claimed, one or more features from the claimed combination may in some cases be excluded from the combination, Or a variant of a subcombination.

마찬가지로, 특정한 순서로 도면에서 동작들을 묘사하고 있지만, 이는 바람직한 결과를 얻기 위하여 도시된 그 특정한 순서나 순차적인 순서대로 그러한 동작들을 수행하여야 한다거나 모든 도시된 동작들이 수행되어야 하는 것으로 이해되어서는 안 된다. 특정한 경우, 멀티태스킹과 병렬 프로세싱이 유리할 수 있다. 또한, 상술한 실시형태의 다양한 시스템 컴포넌트의 분리는 그러한 분리를 모든 실시형태에서 요구하는 것으로 이해되어서는 안되며, 설명한 프로그램 컴포넌트와 시스템들은 일반적으로 단일의 소프트웨어 제품으로 함께 통합되거나 다중 소프트웨어 제품에 패키징 될 수 있다는 점을 이해하여야 한다.Likewise, although the operations are depicted in the drawings in a particular order, it should be understood that such operations must be performed in that particular order or sequential order shown to achieve the desired result, or that all illustrated operations should be performed. In certain cases, multitasking and parallel processing may be advantageous. Also, the separation of the various system components of the above-described embodiments should not be understood as requiring such separation in all embodiments, and the described program components and systems will generally be integrated together into a single software product or packaged into multiple software products It should be understood.

한편, 본 명세서와 도면에 개시된 본 발명의 실시 예들은 이해를 돕기 위해 특정 예를 제시한 것에 지나지 않으며, 본 발명의 범위를 한정하고자 하는 것은 아니다. 여기에 개시된 실시 예들 이외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 자명한 것이다.It should be noted that the embodiments of the present invention disclosed in the present specification and drawings are only illustrative of specific examples for the purpose of understanding and are not intended to limit the scope of the present invention. It will be apparent to those skilled in the art that other modifications based on the technical idea of the present invention are possible in addition to the embodiments disclosed herein.

본 발명은 음성데이터에 포함된 각 문장을 언어 처리하여 유니그램 형태로 추출하고, 유니그램 형태의 각 문장에 대한 형태소를 분석하고, 형태소의 각 어근을 중심으로 클러스터하여 엔 그램(N-gram) 형태로 매칭한다. 이에 따라, 본 발명은 클래스 기반의 언어 모델과 유사하게 데이터 부족 문제를 해결할 수 있다. 또한, 언어적 관계를 이용하여 동사의 경우에는 이전에 올 수 있는 목적어 위주로 추출하고, 조사가 붙은 명사의 경우에는 이후의 관계를 추출하는 방식으로 변형해서 활용함으로써, 언어 추출의 정확도를 높일 수 있다. 또한, 음성인식을 위한 언어모델에 없는 새로운 명사를 문장의 다양한 표현을 반영한 엔그램(N-gram) 형태로 추가할 수 있기 때문에 단순히 고유명사 만으로 추가하는 방법보다 고유 명사의 높은 음성인식 성능을 얻을 수 있다. 이는 시판 또는 영업의 가능성이 충분할 뿐만 아니라 현실적으로 명백하게 실시할 수 있는 정도이므로 산업상 이용가능성이 있다.In the present invention, each sentence included in speech data is subjected to language processing to extract it as a uniagram form, analyze the morpheme for each sentence in the form of a uniagram, cluster it around each root of the morpheme, . Accordingly, the present invention can solve the data shortage problem similar to the class-based language model. In addition, it is possible to improve the accuracy of language extraction by using the verbal relation, extracting the predicate of the verb in the case of the verb, and transforming the relation in the case of the verb with the verb . In addition, since a new noun not in the language model for speech recognition can be added in the form of an N-gram that reflects various expressions of the sentence, . This is not only a possibility of commercialization or sales, but also a possibility of being industrially applicable since it is practically possible to carry out clearly.

100: 언어처리장치 10: 제어부
11: 언어수집모듈 12: 언어처리모듈
20: 입력부 30: 표시부
40: 저장부 41: 음성데이터
50: 오디오처리부 60: 통신부100: language processing apparatus 10: control unit
11: Language collection module 12: Language processing module
20: input unit 30: display unit
40: storage unit 41: audio data
50: audio processor 60:

Claims

The language processing apparatus collecting at least one voice data;
Separating the collected speech data on a sentence-by-sentence basis, language processing each of the separated sentences, and extracting the speech data in a unigram form;
Analyzing a morpheme for each sentence in the extracted unigram form;
Constructing a cluster around each root of the analyzed morpheme; And
And the language processing apparatus matches the configured cluster in an N-gram form.

The method of claim 1, wherein said collecting comprises:
Collecting speech data including at least one of natural language processing, speech recognition, transformed language, and spoken language; And
And the language processing apparatus converts the collected voice data into text.

The method of claim 1, wherein extracting in the form of a unicram
And the language processing apparatus processes the sentence in a processing unit of a corpuscary unit or a spacing unit.

The method of claim 1, wherein analyzing the morpheme comprises:
Separating at least one of a root, an irradiation, a lexical root, a prefix end, or a termination end in each language-processed sentence by the language processing apparatus; And
And the language processing apparatus classifies the separated root according to part-of-speech and stores the divided root.

The method of claim 1, wherein configuring the cluster comprises:
And the clustering step is performed on the basis of the central part in which the language processing apparatus has a substantial meaning.

2. The method of claim 1, wherein matching in the N-gram form comprises:
Identifying the cluster consisting of each root based on the language processing device; And
And the language processing apparatus forms an N-gram form by combining the information obtained through morphological analysis with the root in the cluster.

2. The method of claim 1, wherein after the matching in the N-gram form,
And the language processing apparatus automatically generates a language model based on a result matched in the form of the N-gram.

A language collection module for collecting at least one voice data; And
Separating the speech data collected through the language collection module on a sentence basis, language processing each of the separated sentences, extracting the speech data in a unigram form, analyzing morphemes for each sentence in the extracted unigram form, A language processing module for constructing a cluster around each root of the analyzed morpheme and matching the configured cluster in an N-gram form;
Wherein the language processing apparatus comprises: