KR20180092836A

KR20180092836A - System and method for character boundary recognition

Info

Publication number: KR20180092836A
Application number: KR1020180011249A
Authority: KR
Inventors: 맹성현; 박주희
Original assignee: 한국과학기술원
Priority date: 2017-02-08
Filing date: 2018-01-30
Publication date: 2018-08-20
Also published as: KR102097545B1

Abstract

The present invention relates to a method of recognizing a character boundary by a character boundary recognition system. When a noun is found in a Korean sentence based on stored Korean postposition and noun words, a basic character boundary is marked on the found noun. If adjacent nouns in the Korean sentence marked with the basic character boundary are stored in a noun word as a single noun, or if multiple nouns are included in a single phrase or clause based on a sentence structure, the multiple nouns are merged into a single compound noun to extend the character boundary. It is possible to recognize a character boundary of nouns and compound nouns in a Korean sentence by using dictionary and syntactic analysis.

Description

System and method for character boundary recognition [

본 발명은 문자 경계 인식 시스템 및 방에 관한 것이다.The present invention relates to a character boundary recognition system and a room.

한국어 문장 내에서 등장하는 개체는 띄어쓰기를 포함한 복수개의 단어로 이루어져 있는 경우가 있다. 종래에는 한국어 문장 내에 등장하는 개체 중 단일 혹은 복수의 단어로 이루어진 개체, 그리고 복합 명사의 문자 경계를 파악하기 위해 다양한 방법을 이용하였다.An object appearing in a Korean sentence may be composed of plural words including a space. Conventionally, various methods have been used to identify the character boundaries of single or plural words and complex nouns in Korean sentences.

예를 들어, 백오프 엔-그램 자질을 이용하여 개체명을 인식할 수 있는데, 입력 문장 상의 형태소와 단어로부터 개체명이 포함된 훈련 말뭉치를 생성한 상태에서 훈련 말뭉치를 토대로 백오프 엔-그램 모델의 자질을 추출하여, 입력 문장의 각 단어에 대응되는 후보 개체를 선택하여 최종적인 개체명을 결한다. 그리고 자질 정보 추출기에서 추출한 백오프 엔-그램 단위의 훈련 자질을 학습시켜서 엔트로피 모델을 생성해 통계 DB에 저장하기 위한 모델을 학습한다. 이러한 방법은 백오프 엔그램 자질을 이용하여 문장의 자질들을 추출하고 그로 인해서 학습 모델을 훈련시켜야 하는 불편함이 있다.For example, it is possible to recognize an object name by using the back-off ENG grammatical property. In the case of generating a training corpus including an object name from a morpheme and a word in an input sentence, Extracts qualities, selects candidates corresponding to each word of the input sentence, and finally obtains the object name. And we learn the model to store the entropy model in the statistical database by learning the trained qualities of the backoff - yen gram extracted from the qualitative information extractor. This method has the disadvantage of extracting the qualities of sentences by using the back - off - element qualities and training the learning model accordingly.

또 다른 방법으로 코퍼스 자동 구축 방법이 있는데, 제시된 개체명 인식 방법은 비정형 개인정보를 수록한 개체명 사전을 구축하는 단계, 개체명 사전의 표제어 및 사용자 입력 단어 중에서 하나 이상을 검색대상으로 하는 개체명 검색 결과를 확인하여 데이터 특성별로 하나 이상의 스니펫을 추출하는 단계, 추출한 스니펫에 해당 개체명을 태깅하여 개체명 학습 데이터를 확보하는 단계, 확보된 개체명 학습 데이터를 기반으로 비정형 개인정보 개체명 인식을 위한 학습 모델을 결정하는 단계 등을 가진다. 이러한 방법은 사전을 활용해 한 번의 검색만을 실행한 후 이를 학습 모델의 훈련 데이터로 사용하여야만 한다.Another method is an automatic method of constructing a corpus. In the proposed method, a method for constructing an object name dictionary storing irregular personal information, a method for constructing an object name dictionary containing at least one of a headword and a user- Extracting at least one snippet according to the data characteristic by checking the search result, acquiring the object name learning data by tagging the object name in the extracted snippet, acquiring the object name learning data based on the acquired object name learning data, And a step of determining a learning model for recognition. In this method, only one search should be executed using the dictionary and then used as the training data of the learning model.

따라서, 본 발명은 사전과 구문 분석을 이용하여 한국어 문장 속의 개체와 복합 명사의 문자 경계를 인식하는 시스템 및 방법을 제공한다.Accordingly, the present invention provides a system and method for recognizing character boundaries of objects and compound nouns in Korean sentences using dictionary and syntax analysis.

상기 본 발명의 기술적 과제를 달성하기 위한 본 발명의 하나의 특징인 문자 경계 인식 시스템이 문자 경계를 인식하는 방법으로서, According to another aspect of the present invention, there is provided a method of recognizing a character boundary,

저장되어 있는 한국어 조사와 명사 단어를 토대로, 수신한 한국어 문장에서 명사를 찾는 단계, 상기 찾은 명사에 기본 문자 경계를 표시하는 단계, 그리고 상기 기본 문자 경계가 표시된 한국어 문장에서 인접한 명사들이 하나의 명사로서 명사 단어에 저장되어 있거나, 문장 구조를 토대로 하나의 구 또는 절 안에 복수의 명사가 포함되어 있으면, 복수의 명사를 하나의 복합 명사로 병합하여 문자 경계를 확장하는 단계를 포함한다.Searching for a noun in the received Korean sentence based on the stored Korean search and the noun word, displaying a basic character boundary on the found noun, and determining whether the adjacent noun in the Korean sentence having the basic character boundary is a single noun And includes expanding a character boundary by merging a plurality of nouns into one compound noun if the noun is stored in a noun word or a plurality of nouns are contained in one phrase or phrase based on the sentence structure.

상기 문자 경계가 확장된 한국어 문장에서, 상기 저장되어 있는 한국어 조사를 이용하여 명사 범위를 확장하는 단계를 더 포함할 수 있다.And expanding the noun range using the stored Korean search in a Korean sentence in which the character boundary is extended.

상기 명사를 찾는 단계는, 상기 한국어 문장으로부터 품사 정보를 추출하고, 상기 한국어 문장을 상기 추출한 품사 정보가 포함된 형태소 단위로 나누는 단계, 그리고 형태소 단위로 나뉜 한국어 문장을 상기 한국어 조사를 토대로 어절 단위로 분해하여, 구 단위로 구분하는 단계를 포함할 수 있다.The step of searching for the noun may include the steps of extracting parts of speech information from the Korean sentence, dividing the Korean sentence into morpheme units including the extracted part-of-speech information, and dividing the Korean sentence into morpheme- Decomposing, and segmenting into segments.

상기 형태소 단위로 나누는 단계는, 상기 추출한 품사 정보와 미리 정의된 규칙을 토대로 품사를 축약하는 단계를 포함할 수 있다.The step of dividing the morpheme unit may include abbreviating the part-of-speech based on the extracted part-of-speech information and a predefined rule.

상기 기본 문자 경계를 표시하는 단계는, 상기 한국어 문장에서 찾은 명사와 상기 저장되어 있는 명사 단어를 토대로, 상기 구 단위로 구분한 한국어 문장에 기본 문자 경계를 표시하는 단계를 포함할 수 있다.The step of displaying the basic character boundary may include displaying a basic character boundary in the Korean sentences separated by the phrase unit based on the noun found in the Korean sentence and the stored noun word.

상기 문자 경계를 확장하는 단계는, 상기 기본 문자 경계가 표시된 한국어 문장에서, 인접한 두 명사 사이의 단어들을 모두 포함하는 확장 단어가 상기 명사 단어로 저장되어 있는지 검색하는 단계, 그리고 상기 확장 단어가 상기 명사 단어로 저장되어 있으면, 상기 두 명사 사이의 경계를 병합하여 문자 경계를 확장하는 단계를 포함할 수 있다.Wherein the step of expanding the character boundary comprises the steps of: retrieving, in a Korean sentence in which the basic character boundary is displayed, whether an extended word including all the words between two adjacent nouns is stored as the noun word; And expanding the character boundaries by merging the boundaries between the two nouns if they are stored as words.

상기 저장 모듈에 추가하는 단계 이후에, 상기 문자 경계가 확장된 한국어 문장에서 상기 한국어 조사를 토대로 명사 범위를 확장하는 단계를 더 포함할 수 있다.And expanding the noun range based on the Korean search in the Korean sentence with the character boundary extended after the step of adding to the storage module.

상기 본 발명의 기술적 과제를 달성하기 위한 본 발명의 또 다른 특징인 문자 경계를 인식하는 시스템으로서, 한국어 문장을 수신하고, 상기 한국어 문장을 형태소 단위로 나누는 형태소 분석 모듈, 상기 형태소 분석 모듈에서 형태소 단위로 나눈 한국어 문장을 저장되어 있는 한국어 조사를 토대로 구 단위 문장으로 생성하고, 저장된 명사 단어들을 기준으로 상기 구 단위 문장에 문자 경계를 표시하는 명사 탐지 모듈, 그리고 상기 명사 탐지 모듈이 표시한 문자 경계가 포함된 구 단위 문장을, 상기 저장된 명사 단어들을 이용하여 명사를 재검색한 후 인접한 명사들을 병합하여 문자 경계를 확정하는 이웃 명사 병합 모듈을 포함한다.According to another aspect of the present invention, there is provided a system for recognizing a character boundary, the system comprising: a morpheme analysis module for receiving a Korean sentence and dividing the Korean sentence into morpheme units; And a character boundary is displayed on the sentence unit based on the stored noun words, and the character boundary detected by the noun detecting module is a character string And an adjacent noun merging module for merging the adjacent nouns to determine a character boundary after the noun unit sentence is re-searched using the stored noun words.

상기 한국어 조사를 저장하는 제1 사전 모듈, 그리고 상기 명사 단어가 저장되어 있는 제2 사전 모듈을 포함하고, 상기 명사 단어는 명사, 복합 명사, 신조어, 미등록 표제어를 포함할 수 있다.A first dictionary module for storing the Korean research, and a second dictionary module in which the noun words are stored, the noun words may include nouns, compound nouns, coined words, and unregistered entry words.

상기 이웃 명사 병합 모듈에서 문자 경계가 확정된 문장을 수신하면, 상기 문장에 표기된 문장 부호를 토대로 상기 신조어, 미등록 표제어를 상기 제2 사전 모듈에 추가하는 명사 추가 모듈을 더 포함할 수 있다.And a noun adding module for adding the new word and the unregistered entry word to the second dictionary module based on the sentence code indicated in the sentence when the sentence having the character boundary is received in the neighboring noun merging module.

상기 이웃 명사 병합 모듈에서 문자 경계가 확정된 문장을 수신하고, 상기 제1 사전 모듈에 저장되어 있는 한국어 조사를 이용하여 명사의 경계를 확장하는 명사 범위 확장 모듈을 더 포함할 수 있다.And a noun range extension module for receiving a sentence in which the character boundary is determined in the neighbor noun merging module and expanding the boundary of the noun using the Korean search stored in the first dictionary module.

본 발명에 따르면 한국어 문장 속에 경계 없이 나열되어 있는 일반 명사와 여러 복합 명사들의 문자 경계를 명확히 파악하여, 다양한 자연어처리 작업에 응용할 수 있다.According to the present invention, the character boundaries of the general nouns and the complex nouns that are listed without boundaries in Korean sentences can be clearly grasped and applied to various natural language processing tasks.

도 1은 본 발명의 실시예에 따른 문자 경계 인식 시스템의 구조도이다.
도 2는 본 발명의 실시예에 따른 문자 경계 인식 방법에 대한 흐름도이다.
도 3은 본 발명의 실시예에 따른 문자 경계 인식을 위한 중간 처리 결과의 예시도이다.1 is a structural diagram of a character boundary recognition system according to an embodiment of the present invention.
2 is a flowchart illustrating a method of recognizing a character boundary according to an embodiment of the present invention.
3 is a diagram illustrating an example of a result of intermediate processing for character boundary recognition according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. Throughout the specification, when an element is referred to as "comprising ", it means that it can include other elements as well, without excluding other elements unless specifically stated otherwise.

이하, 도면을 참조로 하여 본 발명의 실시예에 따른 문자 경계 인식 시스템 및 방법에 대해 상세히 설명한다. 먼저, 웹에서 수집된 대용량의 사전 표제어들과 품사(POS: Part of Speech) 태깅이 수반된 구문 분석을 이용하여, 한국어 문장 속의 단일 혹은 복수의 단어로 이루어진 개체와 복합 명사의 문자 경계를 파악하는 문자 경계 인식 시스템의 구조에 대해 도 1을 참조로 설명한다.Hereinafter, a system and method for recognizing a character boundary according to an embodiment of the present invention will be described in detail with reference to the drawings. First, using the large-sized dictionary entries and POS (Part of Speech) tagging collected from the web, the character boundary of an object composed of a single word or a plurality of words in a Korean sentence and a compound noun The structure of the character boundary recognition system will be described with reference to Fig.

도 1은 본 발명의 실시예에 따른 문자 경계 인식 시스템의 구조도이다.1 is a structural diagram of a character boundary recognition system according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 적어도 하나의 프로세서에 의해 구동되는 문자 경계 인식 시스템(100)은 형태소 분석 모듈(110), 명사 탐지 모듈(120), 이웃 명사 병합 모듈(130), 명사 추가 모듈(140), 명사 범위 확장 모듈(150), 한국어 조사 사전 모듈(160), 그리고 복합 명사 사전 모듈(170)을 포함한다.1, a character boundary recognition system 100 driven by at least one processor includes a morphological analysis module 110, a noun detection module 120, a neighbor noun merging module 130, a noun adding module 140, a noun range extension module 150, a Korean inquiry dictionary module 160, and a compound noun dictionary module 170.

형태소 분석 모듈(110)은 외부로부터 입력받은 한국어 문장을 품사 정보가 포함된 복수의 형태소 단위로 나누어 출력한다. 이때, 형태소 분석 모듈(110)은 품사의 나열을 축약하는 과정을 거쳐, 불필요한 관용어 또는 상용구 등을 후보에서 제외한 후 형태소 단위로 나눈다. The morpheme analysis module 110 divides a Korean sentence input from the outside into a plurality of morpheme units including part-of-speech information. At this time, the morpheme analysis module 110 divides the unnecessary idioms or common phrases into morpheme units after eliminating candidates from the candidates through a process of reducing the list of parts of speech.

본 발명의 실시예에서는 휴리스틱 룰을 이용하여 품사 축약 과정을 실행하는 것을 예로 하여 설명하나 반드시 이와 같이 한정되는 것은 아니다. 그리고 형태소 분석 모듈(110)이 품사를 축약하는 과정이나 형태소 단위로 나누는 과정은 다양한 방법으로 실행할 수 있으므로, 본 발명의 실시예에서는 어느 하나의 방법으로 한정하여 설명하지 않는다.In the embodiment of the present invention, the execution of the part shrinking process using the heuristic rule will be described as an example, but the present invention is not limited thereto. Since the morpheme analysis module 110 can reduce the parts of speech or divide morpheme units into various parts, it is not limited to any one method in the embodiment of the present invention.

명사 탐지 모듈(120)은 형태소 분석 모듈(110)에서 출력된 품사 정보를 포함하고 있는 복수의 형태소 단위의 문장을 입력받아, 구 단위의 문장으로 출력한다. 즉, 명사 탐지 모듈(120)은 한국어 조사 사전 모듈(160)에 저장되어 있는 한국어 조사를 이용하여 형태소 단위의 문장을 어절 단위로 분해하고, 구 단위의 문장으로 출력한다. 이때, 명사 탐지 모듈(120)이 한국어 조사를 이용하여 문장을 어절 단위로 분해하는 방법은 다양한 방법으로 실행할 수 있으므로, 본 발명의 실시예에서는 어느 하나의 방법으로 한정하여 설명하지 않는다.The noun detection module 120 receives a plurality of morpheme-unit sentences including part-of-speech information output from the morpheme analysis module 110, and outputs the sentences in units of phrases. That is, the noun detection module 120 decomposes the sentence of the morpheme unit into the unit of the word by using the Korean search stored in the Korean inquiry dictionary module 160, and outputs it as a sentence of the phrase unit. At this time, the noun detection module 120 can perform the method of decomposing the sentence into the unit of the word by using the Korean language search can be performed by various methods. Therefore, the present invention is not limited to any one method.

명사 탐지 모듈(120)은 구 단위의 문장에서, 복합 명사 사전 모듈(170)에 저장된 명사 단어들을 기준으로 구 단위의 문장에 1차로 경계를 짓는다. The noun detection module 120 firstly delimits a sentence in a phrase unit based on the noun words stored in the compound noun dictionary module 170 in sentence units.

이웃 명사 병합 모듈(130)은 명사 탐지 모듈(120)에서 생성된 1차로 경계가 지어진 구 단위의 문장이 입력되면, 복합 명사 사전 모듈(170)에 등록되어 있지 않거나 또는 구/절 형태로 명사가 되어 있거나, 오타 혹은 다른 외래어 표기법 등의 경우로 명사가 탐지되지 않은 경우를 대비하여, 2차로 복합 명사 사전 모듈(170)에 저장된 명사들을 재검색한다.The neighboring noun merging module 130 may determine that the noun phrase is not registered in the compound noun dictionary module 170 or that the noun is in a phrase / The nouns stored in the compound noun dictionary module 170 are re-searched in the case where nouns are not detected in the case of omissions or other foreign language notation.

즉, 명사 탐지 모듈(120)에서 생성되며 경계가 지어진 구 단위의 문장에서, 이웃한 탐지 결과들 사이에 있는 단어들을 포함하여 복합 명사 사전 모듈(170)에 저장된 명사들을 재검색한다. 그리고 명사 탐지 모듈(120)은 재검색한 결과를 이용하여 명사를 병합한다.That is, nouns stored in the compound noun dictionary module 170 are rediscovered by including words between neighboring detection results, in a sentence unit sentence boundary formed by the noun detection module 120 and bounded. The noun detection module 120 merges the nouns using the re-searched result.

예를 들어, '{사랑}의 집짓기 {운동}'의 문장에서, 명사 탐지 모듈(120)이 '{사랑}'과 '{운동}'을 탐지하면, 두 개의 명사 사이의 단어를 모두 포함한 '{사랑의 집짓기 운동}'으로 복합 명사 사전 모듈(170)을 재검색하여 해당 단어가 복합 명사 사전 모듈(170)에 등록되어 있는지 확인한다. 그리고 해당 단어가 복합 명사 사전 모듈(170)에 등록되어 있다면, 경계를 병합하여 이웃 명사들을 병합한다.For example, if the noun detection module 120 detects' {love} 'and' {motion} 'in a sentence of' {Love} The complex noun dictionary module 170 is re-searched with the phrase " building of love ", and it is confirmed whether or not the corresponding word is registered in the compound noun dictionary module 170. [ If the word is registered in the compound noun dictionary module 170, the neighbors are merged by merging the boundaries.

그리고 이웃 명사 병합 모듈(130)은 하나의 구 혹은 절 안에 2개 이상의 명사가 포함되어 있지 않을 것이라는 가정하에, 이웃 탐지 결과를 병합하여 문자 경계를 확정한다. 예를 들어, '{알아크}사 [모스크}에서'처럼 하나의 구에서 여러 개의 부분적 결과가 탐지되는 경우, 이웃 명사 병합 모듈(130)은 '{알아크사 모스크}'로 이웃 명사들을 병합하여 생성한다.Then, the neighboring noun merging module 130 merges the neighboring detection results to determine the character boundary, assuming that no more than two nouns are contained in one phrase or phrase. For example, if multiple partial results are detected in one phrase, such as '{}' in {}, the neighboring noun merging module 130 merges neighboring nouns into {{Acksamosk}} .

명사 추가 모듈(140)은 이웃 명사 병합 모듈(130)에서 이웃 명사들이 병합되어 문자 경계가 확정된 문장을 수신하면, 문장 부호로 표기된 신조어 또는 미등록 표제어 등을 사전에 추가한다. 예를 들어, "'혼밥혼술'이라는 용어가 사용되고 있다"와 같이, 문장 내에 따옴표로 신조어가 표기되어 있는 경우, 이는 이미 명사의 경계가 확정되어 있다. 따라서, 복합 명사 사전 모듈(170)에 '혼밥혼술'이라는 명사가 등록되어 있지 않은 경우, 신조어를 새로 등록하여 복합 명사 사전 모듈(170)을 갱신한다.When the neighboring nouns are merged in the neighboring noun merging module 130 and the sentence in which the character boundary is confirmed is received, the noun adding module 140 preliminarily adds the coined word or the unregistered vocabulary written in the sentence code. For example, if a phrase is used in quotation marks in a sentence, such as "" the term '' lavabo '' is used, it is already established that the boundaries of the noun are fixed. Therefore, if the noun phrase 'honey leprechaun' is not registered in the compound noun dictionary module 170, the new noun phrase is newly registered to update the compound noun dictionary module 170.

명사 범위 확장 모듈(150)은 이웃 명사 병합 모듈(130)에서 문자 경계가 확정된 문장을 수신하고, 한국어 조사 사전 모듈(160)에 저장되어 있는 조사를 이용하여 명사의 경계를 확장한다. 즉, 한국어의 조사가 어구의 마지막에 항상 붙어 구문을 구분하는데 이용된다는 점을 토대로, 명사 범위 확장 모듈(150)은 명사의 경계를 확장하고 확정한다.The noun range extension module 150 receives a sentence in which the character boundary is determined in the neighbor noun merging module 130 and expands the boundary of the noun using the search stored in the Korean search dictionary module 160. That is, based on the fact that the Korean search is used to distinguish phrases that are always attached at the end of the phrase, the noun range extension module 150 extends and confirms the noun boundary.

예를 들어, '{오보}에를 연주하는 과정'에서 '오보'만이 개념으로 탐지된 경우, '를'이라는 조사를 기준으로 명사의 범위를 확장하여 또 다른 명사인 '오보에'를 탐지할 수 있다. For example, if only 'obo' is detected as' concept 'in the process of playing' {obo} ', it is possible to detect another oboe' oboe 'by expanding the scope of noun based on' .

이상에서 설명한 문자 경계 인식 시스템(100)을 이용하여 문자 경계를 인식하는 방법에 대해 도 2 및 도 3을 참조로 설명한다.A method of recognizing a character boundary using the character boundary recognition system 100 will be described with reference to FIGS. 2 and 3. FIG.

도 2는 본 발명의 실시예에 따른 문자 경계 인식 방법에 대한 흐름도이다.2 is a flowchart illustrating a method of recognizing a character boundary according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 문자 경계 인식 시스템(100)은 외부로부터 한국어 문장을 수신하면(S100), 문장을 형태소 단위로 나눈 후 품사 정보를 추출한다(S101). 그리고 품사 정보를 포함하는 형태소 단위의 문장을 생성한다.As shown in FIG. 2, when the character boundary recognition system 100 receives a Korean sentence from the outside (S100), the sentence is divided into morpheme units and parts-of-speech information is extracted (S101). And generates a sentence of a morpheme unit that includes part-of-speech information.

문자 경계 인식 시스템(100)은 형태소 단위의 문장을 휴리스틱 룰을 이용하여, 문장 내 관용어나 상용구 등을 삭제하는 품사 축약 절차를 수행한다(S102). 품사가 축약된 형태소 단위의 문장이 생성되면, 문자 경계 인식 시스템(100)은 미리 저장되어 있는 한국어 조사를 이용하여, 문장을 어절 단위로 분해한 후 구 단위로 구분한다(S103). The character boundary recognition system 100 performs a lexical deconstruction procedure for deleting a phonetic word or a common phrase in a sentence using a heuristic rule in a sentence of a morpheme unit (S102). If the sentence of the morpheme unit in which the part-of-speech is abbreviated is generated, the character boundary recognition system 100 decomposes the sentence into units of phrases using the previously stored Korean search, and then divides the sentence into phrases (S103).

문자 경계 인식 시스템(100)은 구 단위로 구분한 문장을 미리 저장한 복합 명사 사전을 이용하여 경계를 지어 명사 범위를 탐지한다(S104). S104 단계에서 명사 범위를 탐지하면, 문자 경계 인식 시스템(100)은 이웃한 탐지 결과들 사이에 있는 단어를 포함하여 사전을 재검색하여 이웃한 명사들을 병합하는 제1 이웃 명사 병합 절차를 수행한다(S105). 이에 대해 도 3을 참조로 먼저 설명한다.The character boundary recognition system 100 detects a noun range by using a compound noun dictionary in which phrases classified in units of phr are stored in advance (S104). When the noun range is detected in step S104, the character boundary recognition system 100 performs a first neighbors merging procedure for merging neighboring nouns by re-searching the dictionary including words between neighboring detection results (S105 ). This will be described first with reference to FIG.

도 3은 본 발명의 실시예에 따른 문자 경계 인식을 위한 중간 처리 결과의 예시도이다.3 is a diagram illustrating an example of a result of intermediate processing for character boundary recognition according to an embodiment of the present invention.

도 3에 도시된 바와 같이, 입력된 문장이 '본명은 마리아 살로메아 스크워도프스카이고, 프랑스식 이름은 마리 퀴리이다.'이면, 제1 단계에 나타낸 바와 같이, 품사 정보를 추출하여 형태소 단위로 묶은 문장을 생성한다. 여기서 제4 단계에서 {프랑스}, {식}, {이름}과 같이 세 개의 구 단위로 구분된 문장 뒤에 {은}이라는 한국어 조사가 붙어 있음을 알 수 있다. As shown in FIG. 3, if the inputted sentence is' Maria Salomea Scorpio Skye 'and the French name is Mari Curie', as shown in the first step, the part-of-speech information is extracted, To create a sentence. In the fourth step, we can see that the sentence is divided into three phrases, {French}, {Expression}, {Name}

따라서, 제7 단계에 나타낸 바와 같이 한국어 조사를 이용하여 이웃 명사를 병합한 후 {프랑스식이름}과 같은 복합 명사가 생성된다. 이때, "{"와 "}"로 표시된 부분이 문자 경계 인식 시스템(100)이 검출할 개체(예를 들어, 이름, 지명, 작품명 등)들의 경계를 의미한다.Thus, as shown in the seventh step, a compound noun such as {French name} is generated after merging neighboring nouns using Korean surveys. At this time, the portion indicated by "{" and "}" means the boundary of the entity (for example, name, place name, work title, etc.) to be detected by the character boundary recognition system 100.

한편, 상기 도 2의 S105 단계에서 제1 이웃 명사 병합 절차를 수행한 후, 문자 경계 인식 시스템(100)은 제2 이웃 명사 병합 절차를 실행한다(S106). 제2 이웃 명사 병합 절차를 실행하기 위하여 문자 경계 인식 시스템(100)은 하나의 구 혹은 절 안에 2개 이상의 명사가 포함되어 있지 않을 것이라는 가정을 이용한다.After performing the first neighboring noun merging procedure in step S105 of FIG. 2, the character boundary recognizing system 100 executes a second neighboring noun merging procedure (S106). To perform the second neighboring noun merging procedure, the character boundary recognition system 100 uses the assumption that no more than two nouns in a phrase or phrase will be included.

예를 들어, 도 3에 도시한 제15 단계를 보면 '{마리아 살로메}아 {스크워도프스카}'에서처럼 하나의 구에 여러개의 부분적 결과가 탐지된 경우, 제16 단계에 나타난 바와 같이 {마리아 살로메아 스크워도프스카}로 이웃 명사를 병합한다.For example, in step 15 shown in FIG. 3, when a plurality of partial results are detected in one phrase as in '{Maria Salome}' {Scowdowska}, as shown in step 16, Salome Asschewdowska} merges neighbors' nouns.

이와 같이 제2 이웃 명사를 병합하면, 문자 경계 인식 시스템(100)은 문장 부호로 표기된 신조어 혹은 미등록 표제어 등을 사전에 추가한다(S107). 그리고 미리 저장된 조사를 이용하여 명사 범위를 확장한다(S108). 이상의 절차를 통해 한국어 문장 속에 경계 없이 나열되어 있는 일반 명사 및 여러 복합 명사들의 문자 경계를 명확히 파악하여 다양한 자연어 처리 작업에 응용할 수 있다.If the second neighbors are merged in this way, the character boundary recognition system 100 preliminarily adds new words or unregistered lemmas or the like marked with a punctuation mark (S107). Then, the noun range is expanded using the pre-stored search (S108). Through the above procedure, it is possible to apply various natural language processing tasks by clearly recognizing the character boundaries of general nouns and various compound nouns which are listed without boundaries in Korean sentences.

예를 들어, 문자 경계 인식 시스템(100)에 일반 복합 명사가 포함되어 있는 문장인 '퇴임 이후 카터 재단을 설립한 뒤 민주주의 실현을 위해 제 3세계의 선거 감시 활동과 기니 벌레에 의한 드라쿤쿠르스 질병의 방재를 위해 힘썼다'가 입력되었다고 가정한다. 그러면, 문자 경계 인식 시스템(100)은 결과로 '{퇴임} 이후 {카터 재단}을 설립한 뒤 {민주주의 실현}을 위해 {제 3세계}의 {선거 감시 활동}과 {기니 벌레}에 의한 {드라쿤쿠르스 질병} 방재를 위해 힘썼다'를 도출한다.For example, after the establishment of the Carter Foundation after the retirement of a sentence containing a general compound noun in the character boundary recognition system (100), the Third World election surveillance activities for the realization of democracy, I tried for the disaster prevention of the disease '. Then, the character boundary recognition system 100 will result in {Carter Foundation} after {withdrawal}, and then {Carried out {election surveillance activity} and {Guinea Bug} in {Third World} De la Kunks disease} struggled for disaster.

또한, 문자 경계 인식 시스템(100)에 여러 개의 단어로 구성된 전문 용어가 포함된 문장인 '유럽의 외환 거래가 집중되어 있고 유로달러라는 역외 금융 시장이 발달했으며 신디케이티드 론이나 신용 파생상품 등 새로운 금융 상품이 끊임없이 개발되어 상품화에 성공했다.'가 입력되었다고 가정한다. 그러면, 문자 경계 인식 시스템(100)은 결과로 '{유럽}의 {외환 거래}가 집중되어 있고 {유로달러}라는 {역외 금융 시장}이 발달했으며 {신디케이티드 론}이나 {신용 파생상품} 등 새로운 {금융 상품}이 끊임없이 개발되어 상품화에 성공했다.'를 도출한다.Also, in the character boundary recognition system 100, there are a plurality of words including a plurality of words, 'foreign exchange transactions in Europe', a foreign currency financial market in which a euro dollar is developed, a new financial system such as syndicated loans and credit derivative products It is assumed that the product has been continuously developed and successfully commercialized. As a result, the character boundary recognition system 100 has resulted in {foreign exchange transactions} of {Europe} being concentrated, {foreign money market} of {Euro dollar} being developed, {syndicated loan} And other new {financial products} have been continuously developed and successfully commercialized.

또한, 문자 경계 인식 시스템(100)에 외국어를 포함한 긴 길이의 개체나 상호명이 포함된 문장인 '서울 삼성동에 그랜드 인터컨티넨탈 서울 파르나스, 코엑스 인터컨티넨탈 2개가 있으며 모두 GS 계열사인 파르나스호텔(주)에서 운영하고 있다.'가 입력되었다고 가정한다. 그러면, 문자 경계 인식 시스템(100)은 결과로 '{서울 삼성동}에 {그랜드 인터컨티넨탈 서울 파르나스}, {코엑스 인터컨티넨탈} 2개가 있으며 모두 GS 계열사}인 {파르나스호텔(주)}에서 운영하고 있다.'를 도출한다.Also, in the character boundary recognition system 100, there are a long-length object including a foreign language or a sentence including a business name 'Grand Intercontinental Seoul Parnass, and COEX Intercontinental 2 in Samseong-dong, Seoul, ) Is operated. As a result, the character boundary recognition system 100 operates as {{Grand Intercontinental Seoul Seoul Parnas}, {COEX Intercontinental}} in {{Samsung Seoul} and {Parnas Hotel Co., Ltd.} And "

또한, 문자 경계 인식 시스템(100)에 사전에 등재되어 있지 않은 긴 길이의 외국어 이름이 포함된 문장인 '왕위를 계승한 다음 푸미폰은 삼촌인 차이낫 랑싯 프라유라삭디를 섭정 및 대리청정에 임명하고 학업을 마치기 위해 1947년 스위스로 돌아갔다.'가 입력되었다고 가정한다. 그러면, 문자 경계 인식 시스템(100)은 결과로 '{왕위}를 계승한 다음 {푸미폰}은 {삼촌}인 {차이낫 랑싯 프라유라삭디}를 {섭정} 및 {대리청정}에 임명하고 {학업}을 마치기 위해 1947년 {스위스}로 돌아갔다.'를 도출한다.In addition, after succeeding the throne, which is a sentence containing a long-length foreign language name not previously registered in the character boundary recognition system 100, Bhumi Phong is appointed as a regent and substitute cleaner by his uncle Chanthan Rangsitpuraya Saddy And returned to Switzerland in 1947 to finish his studies. " Then, the character boundary recognition system 100 results in the succession of {throne}, then {Bumiphone} appoints {Uncle} {Chain Nancy Situ Pra Yura Saddi} to {Regent} and { I returned to {Switzerland} in 1947 to finish my studies. "

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, It belongs to the scope of right.

Claims

A method for a character boundary recognition system to recognize a character boundary,
Searching for a noun in the received Korean sentence based on the stored Korean search and noun words,
Displaying a basic character boundary on the found noun, and
If adjacent nouns in the Korean sentence with the basic character boundaries are stored in a noun word as a single noun, or a plurality of nouns are contained in one phrase or phrase based on the sentence structure, a plurality of nouns may be regarded as a single compound noun Merging to extend the character boundary
/ RTI >

The method according to claim 1,
Expanding a noun range by using the stored Korean search in a Korean sentence in which the character boundary is extended,
The character boundary recognition method further comprising:

3. The method of claim 2,
Wherein the step of searching for the noun comprises:
Extracting part of speech information from the Korean sentence, and dividing the Korean sentence into morpheme units including the extracted part-of-speech information, and
Dividing Korean sentences divided into morpheme units into units of phrases based on the Korean research and dividing them into phrases
/ RTI >

The method of claim 3,
Wherein dividing into morpheme units comprises:
A step of reducing the part-of-speech based on the extracted part-of-speech information and a predefined rule
The character boundary recognition method further comprising:

5. The method of claim 4,
Wherein the step of displaying the basic character boundary comprises:
A step of displaying a basic character boundary in the Korean sentences separated by the sphere unit based on the noun found in the Korean sentence and the stored noun word,
/ RTI >

3. The method of claim 2,
Wherein expanding the character boundary comprises:
Retrieving whether or not an extended word including all the words between two adjacent nouns in the Korean sentence in which the basic character boundary is displayed is stored as the noun word; and
If the expanded word is stored as the noun word, expanding the character boundary by merging the boundary between the two nouns
/ RTI >

The method according to claim 6,
After the step of extending the character boundary,
If one phrase includes a plurality of character boundaries, merging the plurality of character bounds to extend the character boundary
The character boundary recognition method further comprising:

8. The method of claim 7,
Wherein the step of outputting the determined sentence includes:
Extracting a predefined sentence code in the Korean sentence in which the character boundary is extended, and
Adding the extracted punctuation word to a storage module for storing the noun phrase
/ RTI >

9. The method of claim 8,
After the step of adding to the storage module,
Expanding a noun phrase range based on the Korean search in the Korean sentence in which the character boundary is extended
The character boundary recognition method further comprising:

A system for recognizing a character boundary,
A morpheme analysis module for receiving the Korean sentence and dividing the Korean sentence into morpheme units,
A noun detection module for generating a Korean sentence divided into morpheme units in the morpheme analysis module as a phrase sentence based on the Korean sentence stored in the Korean sentence and displaying a character boundary on the sentence unit sentence based on the stored noun words,
A noun unit sentence including the character boundaries displayed by the noun detecting module is re-searched using the stored noun words and an adjacent noun merge module
The character boundary recognition system comprising:

11. The method of claim 10,
A first dictionary module for storing the Korean search, and
A second dictionary module in which the noun word is stored
/ RTI >
Wherein the noun word comprises a noun, a compound noun, a coined word, and an unregistered entry word.

12. The method of claim 11,
And a noun adding module for adding the new coined word and the unregistered entry word to the second pre-module based on the sentence code indicated in the sentence when receiving the sentence in which the character boundary is confirmed in the neighbor noun merging module,
The character boundary recognition system further comprising:

13. The method of claim 12,
A noun expansion module for expanding a boundary of a noun using the Korean search stored in the first dictionary module,
The character boundary recognition system further comprising:

11. The method of claim 10,
Wherein the morpheme analysis module comprises:
Extracting part of speech information from the Korean sentence, and executing a part-of-speech reduction process based on the extracted part-of-speech information.