KR100886687B1

KR100886687B1 - Method and apparatus for auto-detecting of unregistered word in chinese language

Info

Publication number: KR100886687B1
Application number: KR1020070129360A
Authority: KR
Inventors: 윤창호; 권오욱; 오영순; 노윤형; 최승권; 서영애; 이기영; 양성일; 김창현; 김영길; 김운; 황영숙; 박은진
Original assignee: 한국전자통신연구원
Priority date: 2007-12-12
Filing date: 2007-12-12
Publication date: 2009-03-04

Abstract

A method and an apparatus for auto-detecting an unregistered word in Chinese are provided to extract unregistered words from a web-document which is a translation target document by using HTML tag information, statistic information, monosyllable token information, etc. A removing unit(102) removes an HTML tag of an inputted web-document when receiving a web-document which includes Chinese sentences, and a tag classification unit(104) classifies each sentence in the document based on a meta tag and general tag processing manner. An extracting unit(106) using a general tag includes: a monosyllable based extracting module(116) extracts unregistered words on the basis of monosyllable token; and a verb based extracting module(118) extracts unregistered verb words which consist of 4 syllables. An extracting unit(108) using a meta tag extracts an unregistered word by using a word included in meta tag information, and a morpheme analyzing unit(110) analyzes morphemes and outputs the analyzed results. A radix based extracting module(114) extracts an unregistered word based on radixes by using the analyzed results.

Description

Chinese non-registered language automatic extraction method and apparatus {METHOD AND APPARATUS FOR AUTO-DETECTING OF UNREGISTERED WORD IN CHINESE LANGUAGE}

본 발명은 중국어 번역 기술에 관한 것으로서, 특히 기존의 기계번역 시스템에 내장된 통계적 고유명사 추출 방법과는 달리, 실제 번역 대상인 대용량의 웹 문서를 대상으로 하여 html 태그, 통계 정보, 단음절 토큰 등 정보를 이용하여 중국어 미등록어를 추출하는데 적합한 중국어 미등록어 자동 추출 방법 및 장치에 관한 것이다.The present invention relates to Chinese translation technology. Unlike the method of extracting statistical proper nouns embedded in the existing machine translation system, the present invention provides information such as html tags, statistical information, single-syllable tokens, and targets a large amount of web documents that are actually translated. The present invention relates to a method and apparatus for automatically extracting Chinese non-registered words suitable for extracting Chinese non-registered words.

본 발명은 정보통신부의 IT성장동력기술개발사업의 일환으로 수행한 연구로부터 도출된 것이다[과제관리번호: 2006-S-037-02, 과제명: 응용 특화 한중영 자동번역 기술 개발].The present invention is derived from a study conducted as part of the IT growth engine technology development project of the Ministry of Information and Communication [Task Management Number: 2006-S-037-02, Title: Application-specific development of Korean-China automatic translation technology].

기존의 중국어 미등록어 추출에 관한 연구는 고유명사 추출에 집중 되었다. 인명, 지명, 조직명을 포함하는 고유명사는 기계번역 시스템에서 미등록어가 될 가능성이 비교적 크다. 따라서 고유명사를 자동 추출하기 위해 통계적인 방법을 많이 사용하고 있다. The existing researches on extracting Chinese non-registered words focused on extracting proper nouns. Proper nouns, including names of persons, names of persons, and names of organizations, are relatively likely to be unregistered words in machine translation systems. Therefore, many statistical methods are used to automatically extract proper nouns.

고유명사가 정확하게 태깅된 코퍼스(Corpus)를 이용하여 고유명사의 시작 토 큰, 고유명사의 가운데 토큰, 고유명사의 마지막 토큰에 대한 태깅 정보를 획득하고, 좌우 문맥 정보를 추가하여 자동학습을 진행하여 추정 모듈을 만들고, 이 모듈을 이용하여 고유명사 추정을 진행한다. 고유명사 중 일반 중국어 단어가 인명으로 사용될 수 있기 때문에, 인명 추출 성능이 지명이나 조직명보다 낮다. 이 방법은 고유명사 정보가 부착된 대용량 학습 코퍼스가 필요하기 때문에, 코퍼스의 양과 질에 따라서 그 성능이 좌우된다. 따라서 낮은 성능의 고유명사 추정 모듈을 기계번역 시스템에 적용하게 되면, 잘못된 추정결과가 오적용 되어 번역 성능의 저하를 초래한다. Using the corpus tagged with proper nouns, tagging information about the starting token of proper nouns, the token of the proper nouns, and the last token of the proper nouns is obtained. Create an estimation module and use it to estimate proper nouns. Since the common Chinese words of proper nouns can be used as personal names, the performance of extracting names is lower than that of names or organization names. Since this method requires a large learning corpus with proper noun information, its performance depends on the quantity and quality of the corpus. Therefore, if a low performance proper noun estimation module is applied to a machine translation system, wrong estimation results may be misapplied, leading to a decrease in translation performance.

상기한 바와 같이 동작하는 종래 기술에 의한 중국어 번역 시스템에 있어서, 웹 문서 대상의 중한 기계번역 시스템에서의 미등록어는, 고유명사뿐만 아니라, 웹에서 새롭게 계속 만들어지는 신조어, 축약어, 전문용어 등을 포함한다. 따라서 기존의 고유명사 추정 방법으로는 신조어와 같은 미등록어 추출이 불가능하다.In the prior art Chinese translation system operating as described above, the unregistered words in the heavy machine translation system for the web document include not only proper nouns, but also new words, abbreviations, terminology, etc., which are continuously made on the web. . Therefore, it is impossible to extract unregistered words such as new words using the existing proper noun estimation method.

그 이유는 학습코퍼스가 없고, 학습코퍼스가 있다고 가정하여도 신조어를 이루는 구성 성분의 두드러진 특성이 없기 때문에 추정이 불가능하다는 문제점이 있었다.The reason is that there is no learning corpus, and even if there is a learning corpus, there is a problem that the estimation is impossible because there is no outstanding characteristic of the constituents that constitute the new word.

이에 본 발명은, 실제 번역 대상인 대용량의 웹 문서를 대상으로 하여 html 태그 정보, 통계 정보, 단음절 토큰 정보를 이용하여 중국어 미등록어를 추출할 수 있는 중국어 미등록어 자동 추출 방법 및 장치를 제공한다. Accordingly, the present invention provides a method and apparatus for automatically extracting Chinese non-registered words that can extract Chinese non-registered words using html tag information, statistical information, and single-syllable token information for a large amount of web documents that are actually translated.

본 발명의 일 실시예 방법은, 중국어 문장이 포함된 웹문서를 입력받으면, 상기 입력된 웹문서의 html 태그를 제거하는 과정과, 상기 웹문서 내의 문장별로 메타 태그와 일반 태그 처리 방식으로 분류하는 과정과, 형태소 분석을 진행하여 분석결과를 출력하는 과정과, 상기 분석 결과를 이용하여 어근 중심의 미등록어를 추출하는 방식과, 단음절 토큰을 중심으로 미등록어를 추출하는 방식과, 4음절로 된 동사 미등록어를 추출하는 방식과, 상기 단음절 토큰의 단어 가능 여부를 판단하여 단어 가능 미등록어를 추출하는 방식과, 상기 메타 태그 정보에 포함된 단어 를 이용하여 미등록어를 추출하는 방식 중 적어도 하나의 방식을 이용하여 미등록어를 추출하는 과정을 포함한다.According to an embodiment of the present invention, when a web document including a Chinese sentence is received, a process of removing the html tag of the input web document and classifying the meta tag and the general tag processing method for each sentence in the web document Process, outputting the analysis result through morphological analysis, using the analysis result to extract the unregistered words centered on the root, a method to extract the unregistered words based on the single syllable token, and a four syllable At least one of a method of extracting a non-registered verb, a method of extracting a word capable unregistered word by determining whether a single syllable token is a word, and a method of extracting an unregistered word using a word included in the meta tag information Extracting unregistered words using the method.

본 발명의 일 실시예 장치는, 중국어 문장이 포함된 웹문서를 입력받으면, 상기 입력된 웹문서의 html 태그를 제거하는 제거부와, 상기 웹문서 내의 문장별로 메타 태그와 일반 태그 처리 방식으로 분류하는 태그 분류부와, 형태소 분석을 진행하여 분석결과를 출력하는 형태소 분석부와, 상기 분석 결과를 이용하여 어근 중심의 미등록어를 어근 중심 추출 모듈과, 단음절 토큰을 중심으로 미등록어를 추출하는 단음절 중심 추출 모듈과, 4음절로된 동사 미등록어를 추출하는 동사 중심 추출 모듈을 포함하는 일반 태그를 이용한 추출 방식부와, 상기 단음절 토큰의 단어 가능 여부를 판단하여 단어 가능 미등록어를 추출하는 단음절 토큰의 단어 가능 여부를 이용한 추출 방식부와, 상기 메타 태그 정보에 포함된 단어를 이용하여 미등록어를 추출하는 메타 태그를 이용한 추출 방식부를 포함한다.According to an embodiment of the present invention, when the web document including a Chinese sentence is received, a removal unit for removing the html tag of the input web document, and the meta tag and the general tag processing method for each sentence in the web document A tag classification unit, a morpheme analysis unit which performs morphological analysis, and outputs an analysis result, a root-centered extraction module based on the root-based unregistered word using the analysis result, and a single syllable to extract the unregistered word based on the single-syllable token An extraction method unit using a general tag including a center extraction module, a verb center extraction module that extracts four syllables non-registered words, and a single syllable token to determine whether a word is possible in the single syllable token. Extraction method using the word availability of the word, and extracts the unregistered words using the words contained in the meta tag information It includes parts of extraction method using the tag.

본 발명에 있어서, 개시되는 발명 중 대표적인 것에 의하여 얻어지는 효과를 간단히 설명하면 다음과 같다.In the present invention, the effects obtained by the representative ones of the disclosed inventions will be briefly described as follows.

본 발명은, 중국어 미등록어 추출방식을 통하여 쉽고 빠르게 중국어 분석용 사전을 구축 할 수 있으며, 특히 신조어, 미등록 용언에 대한 사전 보강은 실제 웹문서 번역시스템에 있어서 정확도를 크게 향상 시킬 수 있는 효과가 있다.According to the present invention, it is possible to construct a dictionary for Chinese analysis easily and quickly through the Chinese unregistered word extraction method. In particular, the dictionary reinforcement for new words and unregistered words has the effect of greatly improving the accuracy in the actual web document translation system. .

이하 첨부된 도면을 참조하여 본 발명의 동작 원리를 상세히 설명한다. 하기 에서 본 발명을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. Hereinafter, the operating principle of the present invention will be described in detail with reference to the accompanying drawings. In the following description of the present invention, if it is determined that a detailed description of a known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted. Terms to be described later are terms defined in consideration of functions in the present invention, and may be changed according to intentions or customs of users or operators. Therefore, the definition should be made based on the contents throughout the specification.

본 발명은 실제 번역 대상인 대용량의 웹 문서를 대상으로 하여 HTML 태그 정보와, 형태소 분석을 통한 통계적 정보와, 단음절 토큰의 단어 가능 여부를 판단하고, 그 정보를 이용하여 중국어 미등록어를 추출하는 것이다.The present invention is to determine whether the HTML tag information, statistical information through morphological analysis, and whether or not a single-syllable token can be used for words of a large-capacity web document that is actually translated, and extracts Chinese non-registered words using the information.

즉, 태그 분류 모듈을 통해 메타 태그로 분류 된 문장은 토큰 길이 별로 분류하고, 빈도 정보를 구하여 각각 추출하고, 일반 태그로 분류된 문장은 통계정보를 이용한 미등록어 추출 방식으로서, 어근을 중심으로 추출하는 모듈과 나열된 단음절 토큰을 중심으로 추출하는 모듈과 연속된 동사로 태깅된 토큰을 중심으로 추출하는 모듈을 이용하여 미등록어를 추출한다. 또한, 단음절 토큰의 단어가능 여부를 이용한 미등록어 추출 방식으로 단음절 토큰 판단 모듈과 그 결과를 이용한 미등록어 추정 모듈을 이용하여 미등록어를 추출한다.That is, sentences classified by meta tag through tag classification module are classified by token length, frequency information is extracted and extracted, and sentences classified by general tag are unregistered words extraction method using statistical information. The unregistered words are extracted by using a module that extracts the information based on a single syllable token and a module that extracts the tokens tagged with consecutive verbs. In addition, an unregistered word is extracted using a single syllable token determination module and an unregistered word estimating module using the result as a method for extracting an unregistered word using whether a single syllable token is available.

도 1은 본 발명의 바람직한 실시예에 따른 미등록어 자동 추출 장치의 구조를 도시한 블록도이다.1 is a block diagram showing the structure of an automatic non-registered word extraction apparatus according to a preferred embodiment of the present invention.

도 1을 참조하면, 미등록어 자동 추출 장치(100)는 HTML 디태깅 모듈(102)과, 태그 분류 모듈(104), 일반 태그를 이용한 추출 방식부(108), 메타 태그를 이 용한 추출 방식부(108)를 포함하며, 일반 태그를 이용한 추출 방식부(108)는 통계적 추출 방식부(112)와, 단음절 토큰의 단어 가능 여부를 이용한 추출 방식부(122)를 포함한다.Referring to FIG. 1, the apparatus for automatically extracting unregistered words 100 includes an HTML detagging module 102, a tag classification module 104, an extraction method unit 108 using a general tag, and an extraction method unit using meta tags. The extraction method unit 108 using the general tag includes a statistical extraction method unit 112 and an extraction method unit 122 using the word syllable of a single syllable token.

이에 미등록어 자동 추출 장치(100)를 이용하여 중국어 미등록어의 자동 추출을 수행하는 과정을 구체적으로 설명하면, 실시간으로 업데이트 되는 실제 웹문서가 미등록어 자동 추출 장치(100)에 입력되면, 먼저 HTML 디태깅 모듈(102)에서는 웹문서에서 html 태그를 제거하고, 이를 태그 분류 모듈(104)로 전달한다. 태그 분류 모듈(104)에서는 문장별로 메타 태그와, 링크 태그 그리고 일반 태그로 분류하여 처리를 수행하는 것으로서, 메타 태그로 분류된 문장은 메타 태그를 이용한 추출 방식부(108)에서 처리되며, 일반 태그로 분류된 문장은 일반 태그를 이용한 추출 방식부(106)에서 처리되어 문장내의 미등록어를 추출하게 된다.The process of performing automatic extraction of Chinese non-registered words using the non-registered word automatic extraction device 100 will be described in detail. When an actual web document updated in real time is input to the non-registered word automatic extracting device 100, first, HTML The detagging module 102 removes the html tag from the web document and passes it to the tag classification module 104. The tag classification module 104 classifies the sentences into meta tags, link tags, and general tags for each sentence, and the sentences classified as meta tags are processed by the extraction method unit 108 using the meta tags. The sentences classified as are processed by the extraction method unit 106 using the general tag to extract the unregistered words in the sentences.

여기서, 메타 태그 정보를 이용한 미등록어 추출방법은, 웹 문서 작성자가 기 구축해놓은 메타 정보를 쉽게 이용하여, 쉽고 빠르게 미등록어를 구축하고자 하는 것이며, 형태소 분석 결과를 이용한 통계적인 추출방법은, 신조어 등 미등록어가 단음절 토큰으로 잘린다는 특성을 이용하여 미등록어 추출 재현율을 높이고자 하는 것이다.Here, the method for extracting unregistered words using meta tag information is to easily and quickly construct unregistered words by using meta information that has already been constructed by the web document creator, and the statistical extraction method using morphological analysis results is based on new words. By using the property that unregistered words are cut into single syllable tokens, we want to increase the recall rate of unregistered words.

먼저, 메타 태그를 이용한 추출 방식부(108)에서의 처리 방식을 설명하면, 웹 개발자가 웹문서를 작성 할 때 해당 문서의 내용을 요약하기 위해 메타 태그의 키워드(keyword), 디스트립션(discription) 속성에 문서 내용을 대표하는 키워드들을 부착한다. 이 키워드는 명사가 대부분이며, 그 중 신조어 혹은 전문용어가 많이 사용된다. 쉽고 빠르게 대량의 신규 구축된 웹문서를 확보 할 수 있다는 점을 이용하여, 메타 태그에 사용된 키워드 값을 수집하면 완전하지 않지만 적은 노력으로 미등록어를 추출하는 것이 가능하다. 키워드를 토큰 길이별로 분류하고, 빈도를 구하면 단일어와 복합명사를 각각 추출할 수 있다.First, the processing method in the extraction method unit 108 using the meta tag will be described. In order to summarize the contents of the document when the web developer prepares the web document, the keyword and the description of the meta tag are summarized. ) Attach keywords representing the document content to the attribute. Most of these keywords are nouns, many of which are synonymous or terminology. By taking advantage of the large number of newly constructed web documents quickly and easily, collecting keyword values used in meta tags makes it possible to extract unregistered words with less effort. By classifying keywords by token length and finding the frequency, single words and compound nouns can be extracted.

즉, 키워드를 토큰 길이별로 분류하는 것으로, 1토큰으로 이루어진 단어는 빈도수를 산출하여, 기 설정된 빈도수 보다 많은 횟수가 탐색되는 경우는 이를 미등록어로 추출한다. 또한, 2토근 이상으로 이루어진 단어에 대해서는 먼저 형태소 분석을 수행(132)하여 형태소 분석결과 명사인 경우는 복합명사(134)로서 추출하게 된다. 이로써, 추출된 미등록어는 미등록어 추정결과 3(134)으로 추출된다.That is, by classifying keywords by token length, a word consisting of one token calculates a frequency, and if a number of times more than a predetermined frequency is searched for, a word is extracted as an unregistered word. In addition, for words consisting of more than two roots, morphological analysis is first performed (132), and if the result of the morphological analysis is a noun, the compound noun 134 is extracted. As a result, the extracted unregistered word is extracted as an unregistered word estimation result 3 (134).

한편, 일반 태그를 이용한 추출 방식부(106)에서의 처리 방식을 설명하면, 반 태그를 이용한 추출 방식부(106)에서 통계적 추출 방식(112)은 어근 중심 추출 모듈(114), 단음절 토큰 중심 추출 모듈(116), 동사 중심 추출 모듈(118)과 같이 3가지 서브 모듈(114, 116, 118)로 구성된다. 3가지 서브 모듈(114, 116, 118)은 각각 독립적인 모듈로서 각각 장점을 갖고 있다. 먼저 어근 중심 추출 모듈(114)은, 어근을 중심으로 미등록어를 추출하는 모듈로서, 여기서 어근은 "山, 節, 街頭, 村, 學校, 公司" 등과 같이 지명, 지리, 회사, 학교 등 고유명사의 맨 뒤에 오는 접미사이다. 이러한 접미사를 중심으로 고유명사가 구성이 된다는 특성을 이용하여 미등록어를 추출하는 방법이다. Meanwhile, referring to the processing method of the extraction method unit 106 using the general tag, the statistical extraction method 112 in the extraction method unit 106 using the half tag is the root center extraction module 114 and the single syllable token center extraction. It consists of three sub-modules 114, 116, 118, such as module 116, verb center extraction module 118. The three sub-modules 114, 116, and 118 each have advantages as independent modules. First, the root center extraction module 114 is a module for extracting unregistered words based on roots, where roots are proper nouns such as names, geography, companies, and schools, such as "山, 節, 街頭, 村, 學校, 公司". It is the following suffix. It is a method of extracting unregistered words by using the characteristic that proper nouns are composed around these suffixes.

그러므로 기 보유 사전 및 코퍼스를 이용하여 고 빈도 어근 사전을 구축하고, 형태소 분석을 통하여 얻어진 결과에서 어근 사전을 이용하여 어근을 검색하 고, 검색된 어근을 중심으로 품사정보와 단어길이를 이용하여 미등록어 범위를 추정함으로써, 미등록어로서 추출한다.Therefore, a high-frequency root dictionary is constructed using the existing dictionary and the corpus, the root is searched using the root dictionary from the results obtained through morphological analysis, and the non-registered words using the part-of-speech information and the word length based on the found root By estimating a range, it extracts as an unregistered word.

단음절 토큰 중심 추출 모듈(116)은, 단음절 토큰을 중심으로 미등록어를 추출하는 모듈로서, 사전에 등록되지 않은 미등록어는 형태소 분석(110) 시에 단음절 토큰으로 분리되는 특성이 있다. 예를 들면 "

"라는 미등록어가 포함된 중국어 입력문장이 들어오면 "深/형용사

/명사" 으로 단음절 토큰으로 분리하여 분석될 가능성이 매우 높다. "深"와 같은 단음절 토큰은 중국어에서 "깊다"라는 형용사로서 자주 쓰이는 일반 단어이기 때문에 단어분석이 잘못되면 번역 성능에 치명적인 저하를 초래한다. 연속된 단음절 토큰으로 분리된 문자열은 원래 하나의 단어일 확률이 크며, 따라서 이러한 단음절 토큰열은 미등록어 추출 후보로 선택하고, 빈도 정보를 구하여 기 설정된 빈도수를 초과하여 기재되어 있는 경우는 미등록어로서 추출한다. The single syllable token-centered extraction module 116 is a module for extracting unregistered words based on the single syllable tokens. The unregistered unregistered words have a characteristic of being separated into single syllable tokens in the morpheme analysis 110. For example "

When you enter a Chinese input sentence containing "unregistered words""深 / adjective

/ Noun "is very likely to be separated into single-syllable tokens. A single-syllable token such as" 深 "is a common word often used as an adjective" deep "in Chinese. A string separated by consecutive single-syllable tokens is likely to be a single word. Therefore, such single-syllable token strings are selected as candidates for extracting unregistered words, and if the frequency information is obtained and listed above the preset frequency, it is not registered. Extract as fish.

동사 중심 추출 모듈(118)은, 연속된 동사로 태깅된 토큰을 중심으로 미등록어를 추출하는 모듈로서, 중국어 용언 중에는 4음절로 구성된 단어가 많이 있으며, 이 중 사자성어가 많이 포함되었을 뿐만 아니라, 사자성어는 아니지만 실제 자주 쓰이어 하나의 굳어진 용언으로 된 단어들이 많다. 이런 단어들은 대부분 미등록어 용언이고, 이들을 추출하기 위하여 연속된 동사열 빈도를 이용하여 4음절 미등록 용언을 추출한다. 이에 상기와 같이 3가지 서브 모듈(114, 116, 118)을 통해 미등록어 추정결과1(120)로서 추출한다.The verb center extraction module 118 is a module for extracting unregistered words based on tokens tagged with consecutive verbs, and many Chinese words include four syllables, among which a lot of lion words are included. Although it is not a idiom, there are many words that are often used in practice. Most of these words are unregistered terminology, and to extract them, we extract 4 syllable unregistered term using successive verb string frequency. In this way, the three sub-modules 114, 116 and 118 are extracted as the unregistered word estimation result 1 (120).

일반 태그를 이용한 추출 방식부(106)에서 단음절 토큰의 단어 가능 여부를 이용한 추출 방식부(122)는 기존의 단음절 토큰 열을 이용한 미등록어 추출 방식에 대해 보완한 장치라고 볼 수 있다. 단음절 토큰 열을 이용한 방식은 선택 기준이 오직 단음절 토큰 열이어야 하기 때문에 "

/마우스깔개"와 같이 "鼠標/마우스

/깔다" 두음절 + 한음절로 구성된 미등록어에 대해서는 추출할 수 없다. 이 문제를 해결하기 위하여, 정확히 태깅된 학습코퍼스를 이용하여 학습을 진행한다. In the extraction method unit 106 using the general tag, the extraction method unit 122 using the word availability of the single syllable token may be regarded as a complementary device to the extraction method of the non-registered word using the conventional single syllable token string. The single syllable token string method uses only one syllable token string because "

"Mouse / mouse"

It is not possible to extract unregistered words consisting of two syllables + one syllable. To solve this problem, the learning is performed using a correctly tagged learning corpus.

기존의 방법에서 사용된 학습코퍼스는 고유명사 정보가 부착된 학습코퍼스였기에 양과 질적인 면에서 모두 제약을 받았지만, 본 장치에서 사용되는 학습코퍼스는 단어분리와 품사정보만으로 충분하기에, 대량의 학습코퍼스를 쉽게 구할 수 있다. The learning corpus used in the conventional method was limited in both quantity and quality because it was a learning corpus with proper noun information. However, since the learning corpus used in this device is sufficient for word separation and part-of-speech information, a large amount of learning corpus is used. You can get it easily.

이에 단음절 토큰의 단어 여부 판단 모듈(124)에서는 학습코퍼스에서 먼저 단어분리 정보를 이용하여 중국어 각각 문자들이 단독으로 단어로 될 확률, 단어의 앞에 올 확률, 뒤에 올 확률 등 통계 정보를 구한다. 또한 단음절이 독립적으로 단어로 가능할 때 좌우 단어의 품사 정보를 참조하여 통계정보를 보완한다. 이렇게 학습된 모듈을 통하여 입력으로 들어온 단음절 토큰이 독립적으로 한 단어로 가능한지를 판단한다. Accordingly, the word syllable determination module 124 of the single syllable token obtains statistical information such as a probability that each Chinese character becomes a word alone, a probability before a word, and a probability after the word using information in the learning corpus. In addition, when single syllables are independently available as words, statistical information is supplemented by referring to parts of speech information of left and right words. Through this learned module, it is determined whether single-syllable tokens that are inputted are independently possible in one word.

이후, 단음절 토큰 중심 미등록어 추정 모듈(126)에서 만일 확률 값이 기 설정된 임계값을 넘지 못한다면 독립적으로 분리할 수 없다고 판단하고, 즉 잘못 분리되었다고 판단하고, 이 토큰을 중심으로 품사정보를 이용하여 좌우로 확장을 시 도한다. 확장 휴리스틱에 근거하여 추정한 미등록어 후보들을 길이에 따라 분류하고 빈도를 구함으로써 미등록어 추정결과 2(128)로서, 추출을 진행한다.Subsequently, the single syllable token-based non-registered word estimation module 126 determines that if the probability value does not exceed a preset threshold, it cannot be separated independently, that is, it is incorrectly separated, and uses the part-of-speech information based on the token. Try to expand from side to side. The unregistered word candidates estimated based on the extended heuristics are classified according to length and frequency is obtained to extract the unregistered word estimation result 2 (128).

도 2는 본 발명의 바람직한 실시예에 따른 미등록어 자동 추출 장치의 미등록어 자동 추출 절차를 도시한 흐름도이다.2 is a flowchart illustrating a procedure for automatically extracting unregistered words of an apparatus for automatically extracting unregistered words according to a preferred embodiment of the present invention.

도 2를 참조하면, 200단계에서 웹문서를 입력받은 경우 202단계에서 HTML 태그를 제거하고, 204단계에서 입력된 웹문서의 문장별로 태그 분류를 수행하게 된다. 이에 206단계에서 태그가 메타 태그인 경우는 208단계로 진행하여 토큰 길이별 단어를 분류하여, 210단계에서 토큰 길이가 1토큰인 경우는 212단계로 진행하여 기 설정된 빈도수 이상의 단어가 메타 태그로 나오는 경우는 고빈도 단어로 추출하여 228단계로 진행한다. 또한, 2토큰 이상의 단어인 경우는 214단계로 진행하여 형태소를 분석하고, 분석된 형태소가 명사인 경우는 복합 명사로서 추출하여 228단계로 진행한다.Referring to FIG. 2, when the web document is input in step 200, the HTML tag is removed in step 202, and tag classification is performed for each sentence of the input web document in step 204. In step 206, if the tag is a meta tag, the process proceeds to step 208 to classify the words for each token length, and in step 210, if the token length is one token, the process proceeds to step 212. If the case is extracted as a high frequency word proceeds to step 228. In addition, if the word is more than two tokens, the process proceeds to step 214 to analyze the morpheme, and if the analyzed morpheme is a noun, it is extracted as a compound noun and proceeds to step 228.

한편, 206단계에서 태그가 일반태그인 경우에는, 218단계 일반 태그인 경우는 218단계에서 형태소를 분석하고, 220단계에서 먼저, 어근 중심으로 미등록어를 추출하고, 222단계에서 단음절 중심으로 미등록어를 추출한다. 그리고 224단계에서는 동사 중심의 미등록어를 추출하고, 226단계에서는 단음절 토큰 중심의 미등록어를 추출하여 228단계로 진행하다. 여기서, 220단계 내지 226단계의 순서는 구현되는 방식에 따라 단계별 처리 순서가 바뀔 수 있다. 이후 228단계에서는 최종적으로 각각의 미등록어로서 추출된 단어들을 미등록어 추정결과를 출력하게 된다.Meanwhile, in step 206, when the tag is a general tag, in step 218, when the tag is a general tag, the morpheme is analyzed in step 218, first, in step 220, an unregistered word is extracted around a root, and in step 222, an unregistered word is centered on a single syllable. Extract In step 224, unregistered words centered on verbs are extracted, and in step 226, unregistered words centered on single-syllable tokens are extracted. In this case, the order of steps 220 to 226 may be changed according to the implementation method. Thereafter, in step 228, the words extracted as the unregistered words are finally outputted.

상기와 같이 대용량 웹문서를 이용한 중국어 미등록어 추출 장치는, 중국어 미등록어를 추출하기 위해 여러가지 특성들을 각각 분석하고, 이용하여, 신조어, 축약어, 전문용어, 미등록 동사 등에 대해 가능한 많고, 정확한 미등록어를 추출함으로써, 중국어 분석 사전 구축을 보다 쉽고 효율적으로 수행한다.As described above, the apparatus for extracting Chinese non-registered words using a large-capacity web document analyzes and uses various characteristics to extract Chinese non-registered words, and uses as many and accurate non-registered words as possible for new words, abbreviations, terminology, unregistered verbs, and the like. By extracting, Chinese analysis dictionary construction is more easily and efficiently performed.

이상 설명한 바와 같이, 본 발명은 실제 번역 대상인 대용량의 웹 문서를 대상으로 하여 HTML 태그 정보와, 형태소 분석을 통한 통계적 정보와, 단음절 토큰의 단어 가능 여부를 판단하고, 그 정보를 이용하여 중국어 미등록어를 추출한다.As described above, the present invention targets a large-capacity web document, which is actually translated, determines HTML tag information, statistical information through morphological analysis, and whether or not a single syllable token is possible, and uses the information to determine the non-Chinese language. Extract

한편 본 발명의 상세한 설명에서는 구체적인 실시예에 관해 설명하였으나, 본 발명의 범위에서 벗어나지 않는 한도 내에서 여러 가지 변형이 가능함은 물론이다. 그러므로 본 발명의 범위는 설명된 실시예에 국한되지 않으며, 후술되는 특허청구의 범위뿐만 아니라 이 특허청구의 범위와 균등한 것들에 의해 정해져야 한다. Meanwhile, in the detailed description of the present invention, specific embodiments have been described, but various modifications are possible without departing from the scope of the present invention. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be defined not only by the scope of the following claims, but also by those equivalent to the scope of the claims.

도 1은 본 발명의 바람직한 실시예에 따른 미등록어 자동 추출 장치의 구조를 도시한 블록도, 1 is a block diagram showing the structure of an automatic non-registered word extraction device according to an embodiment of the present invention;

도 2는 본 발명의 바람직한 실시예에 따른 미등록어 자동 추출 장치의 미등록어 자동 추출 절차를 도시한 흐름도,2 is a flowchart illustrating a procedure for automatically extracting unregistered words of an apparatus for automatically extracting unregistered words according to an embodiment of the present invention;

도 3은 본 발명의 바람직한 실시예에 따른 위치 정보 서버의 구조를 도시한 블록도,3 is a block diagram showing the structure of a location information server according to a preferred embodiment of the present invention;

< 도면의 주요 부분에 대한 부호 설명 > <Explanation of Signs of Major Parts of Drawings>

100: 미등록어 자동 추출 장치 102: HTML 태그 제거100: unregistered word automatic extraction device 102: HTML tag removal

104: 태그 분류 106: 일반 태그를 이용한 추출 방식104: tag classification 106: extraction method using a general tag

108: 메타 태그를 이용한 추출 방식 110: 형태소 분석108: Extraction Method Using Meta Tag 110: Morphological Analysis

112: 통계적 추출 방식 114: 어근 중심 추출112: statistical extraction method 114: root center extraction

116: 단음절 토큰 중심 추출 118: 동사 중심 추출116: single-syllable token-based extraction 118: verb-centered extraction

122: 단음절 토큰의 단어 가능 여부를 이용한 추출 방식122: Extraction method using word availability of single-syllable token

124: 단음절 토큰의 단어 여부 판단 모듈124: Determination of word presence of single syllable token

126: 단음절 토큰 중심 미등록어 추정 모듈 126: single syllable token-based non-registered word estimation module

130: 고빈도 단어 추출 132: 형태소 분석130: high frequency word extraction 132: stemming

134: 복합 명사 추출134: Extract compound nouns

Claims

When receiving a web document containing a Chinese sentence, removing the html tag of the input web document,

Classifying the meta tag and the general tag processing method for each sentence in the web document;

Process of morphological analysis and outputting the analysis result,

A method of extracting a non-registered word centered on a root, a method of extracting an unregistered word centered on a single syllable token, a method of extracting a non-registered word consisting of four syllables, and whether the word of the single-syllable token is possible using the analysis result Extracting a non-registered word using at least one of a method of extracting a word possible unregistered word and a method of extracting an unregistered word using a word included in the meta tag information

Chinese non-registered language automatic extraction method comprising a.

The method of claim 1,

Building a root dictionary using the analysis result, extracting unregistered words centering on the root;

Obtaining frequency for single syllables and extracting the non-registered words when the frequency is exceeded,

The process of extracting verbs of four syllable unregistered words by obtaining a frequency for verb strings tagged as verbs

Chinese non-registered language automatic extraction method comprising a.

The method of claim 1,

The method,

Determining whether a single syllable token can be singular using a learning corpus;

A process of extracting unregistered words through expansion using left and right context information on tokens that are impossible as single syllable words

Chinese non-registered language automatic extraction method comprising a.

The method of claim 1,

The method,

Classifying words included in the meta tag information by token length;

Extracting a word having a token length of 1 and exceeding a preset frequency as an unregistered word;

If the token length is 2 or more and the result of the morpheme analysis is a noun, the process of extracting a non-registered word

Chinese non-registered language automatic extraction method comprising a.

A removal unit for removing an html tag of the input web document when receiving a web document including a Chinese sentence;

A tag classification unit classifying the meta tag and the general tag processing method for each sentence in the web document;

A morpheme analysis unit for outputting an analysis result by performing a morphological analysis;

Root-centered extraction module for extracting unregistered words centered on roots using the analysis results, Mono-syllable centered extraction module for extracting unregistered words based on single-syllable tokens, and Verb-centered extraction module for extracting unregistered verbs with 4 syllables Extraction method using a general tag including a,

An extraction method unit using the word availability of the single-syllable token to determine whether the single-syllable token is capable of extracting the word and the non-word registered words;

Extraction method using a meta tag to extract the unregistered words using the words included in the meta tag information

Chinese unregistered automatic extraction system comprising a.

The method of claim 5,

Extraction method unit using the general tag,

Using the morphological analysis results to build a root dictionary, extract the unregistered words around the root,

Obtain frequency for single syllable list, extract it as unregistered word if it exceeds preset frequency,

A Chinese non-registered word automatic extraction system, characterized by extracting verbs of four syllable unregistered words by obtaining a frequency for a verb string tagged as a verb.

The method of claim 5,

Extraction method unit using the word availability of the single-syllable token,

The Chinese non-registered word automatic extraction system, characterized in that it is possible to determine whether the single word of the single-syllable token is possible using a learning corpus, and to extract the non-registered word through expansion using left and right context information on the token which is impossible as the single-syllable word.

The method of claim 5,

The extraction method unit using the meta tag,

The words included in the meta tag information are classified by token length, and when the token length is 1, words that exceed a preset frequency are extracted as unregistered words.

When the token length is 2 or more and the result of the morpheme analysis is a noun, the Chinese non-registered language automatic extraction system, characterized in that the extraction of the non-registered language.