KR100631086B1

KR100631086B1 - Method and apparatus for text normalization using extensible markup language(xml)

Info

Publication number: KR100631086B1
Application number: KR1020050066713A
Authority: KR
Inventors: 박경현
Original assignee: 한국전자통신연구원
Priority date: 2005-07-22
Filing date: 2005-07-22
Publication date: 2006-10-04

Abstract

A device and a method for normalizing text using XML(eXtensible Markup Language) are provided to facilitate maintenance of text normalization by applying the XML, facilitate a tuning process according to the feature of input data in the text normalization, and reduce errors of the text normalization and induce rules by easily applying the dynamic feature to the text normalization and verifying the data in advance. A storing part stores text sentences. A parser divides the stored text sentence into a word unit and parses the divided text data to a semantic unit to analyze semantics. An XML data converter converts the parsed text data of the word unit into XML data based on the first predefined XML DTD(Data Type Definition). An attribute setting part sets an attribute to each word unit by checking front/rear context after reading the converted XML data and considering the checked context. An XML data generator generates the final XML data for converting the text data according to the text normalization if the attribute value is set to each word unit.

Description

Method and apparatus for text normalization using XML {Method and apparatus for text normalization using extensible markup language (XML)}

도 1 은 본 발명에 따른 XML을 이용한 텍스트 정규화 방법을 나타낸 흐름도1 is a flowchart illustrating a text normalization method using XML according to the present invention.

도 2a, 2b 는 본 발명에 따른 텍스트 정규화를 위해 XML 데이터를 생성하기 위한 DTD를 나타낸 도면2A and 2B illustrate DTDs for generating XML data for text normalization according to the present invention.

도 3 은 본 발명에 따른 어절 엘리먼트들의 속성값을 나타낸 바람직한 실시예3 is a preferred embodiment showing attribute values of word elements according to the present invention;

도 4 는 본 발명에 따른 문장변환을 위한 최종적인 XML 데이터의 바람직한 실시예를 나타낸 도면4 is a view showing a preferred embodiment of the final XML data for sentence transformation according to the present invention

도 5 는 도 4의 실시예에 따른 XML 데이터를 DOM(Document Object Model) 트리 형태로 표현한 바람직한 실시예FIG. 5 illustrates a preferred embodiment in which XML data according to the embodiment of FIG. 4 is represented in the form of a DOM (Document Object Model) tree.

*도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

10 : 문장(<SENT>) 엘리먼트 20 : 어절(<EOJEOL>) 엘리먼트10: sentence (<SENT>) element 20: word (<EOJEOL>) element

30 : 서브 엘리먼트30: subelement

본 발명은 음성언어 및 자연언어 처리 분야에 관한 것으로, 특히 XML(extensible markup language)을 이용한 텍스트 정규화 방법 및 장치에 관한 것이다.The present invention relates to the field of speech and natural language processing, and more particularly, to a method and apparatus for text normalization using extensible markup language (XML).

음성언어 및 자연언어 처리 분야는 컴퓨터로 언어를 분석, 처리하기 위해서 각 언어가 가진 특성을 파악하고, 공학적인 측면에서 언어처리를 위한 알고리즘 개발과, 이를 위한 언어 지식의 연구 분야이다. The field of speech and natural language processing is to study the characteristics of each language in order to analyze and process languages with a computer, and to develop algorithms for language processing from an engineering point of view and to study language knowledge for them.

그리고 텍스트 정규화는 이런 음성언어 및 자연언어 처리 분야에서 언어모델을 생성하거나 음성합성에 사용되는 기술을 말한다. 즉, 기호 및 숫자, 영문자 등을 한글로 변환하는 기술로 언어모델 생성이나 음성합성에 사용된다.Text normalization is a technique used to generate language models or to synthesize speech in the fields of speech and natural language processing. In other words, it is a technology for converting symbols, numbers, and English characters into Korean, which is used for language model generation and speech synthesis.

자신의 생각이나 뜻을 상대방에게 전달하는 방법에는 대표적으로 문자, 음성 등이 있다. 특히 문자는 멀리 떨어져 있는 사람들 간에 자신의 생각을 전달하는데 오래 전부터 사용되어온 효과적인 매체였다. 현재 이메일이나 인터넷 등의 발전으로 이제는 자신의 생각을 전달하는 것이 문자뿐만 아니라 이미지나 음성 등으로도 전송이 가능해졌으며, 특히 눈으로 볼 수 있는 이미지로의 표현은 상대방에게 자신의 생각을 보다 함축적이며 효과적으로 전달할 수 있을 뿐만 아니라 상대방으로 하여금 더욱 친근감이 있도록 느끼게 해줄 수 있다.Representative ways of conveying one's thoughts and meanings to others are text and voice. In particular, texts have been an effective medium that has long been used to convey one's thoughts among distant people. Now, with the development of e-mail and the Internet, it is now possible to convey one's own thoughts as well as texts or images and voices.In particular, the visual expression of the images is more implicit. Not only can you communicate effectively, but it can also make your partner feel more friendly.

또한 현재 컴퓨터나 인터넷의 급속한 발전과 동시에 각종 아이콘, 애니메이션, 아바타 등 각종 이미지로 하는 서비스가 등장하면서 다양하고 자신의 취향에 맞는 이미지를 소유하고 이를 이용하려는 경향이 두드러지고 있다. 그럼에도 불구하고, 현재까지는 일련의 과정이 수작업으로 이루어져 왔다. 또한 자동 변환 서비 스도 단순히 글자의 형태만을 단순 비교하여 바꾸어주었기 때문에 단어의 제한과 단순 비교에 의한 잘못된 매칭 작업으로 그 의미가 왜곡되는 문제가 있어 왔다.In addition, with the rapid development of computers and the Internet, various services such as icons, animations, avatars, and the like have emerged, and there is a tendency to own and use various and customized images. Nevertheless, to date, a series of processes have been performed manually. In addition, the automatic conversion service has been a problem that the meaning is distorted due to the word matching and wrong matching by simple comparison because only the form of the letter is changed by simply comparing.

특히, 언어의 특성상 기호 및 숫자 등은 문맥을 포함한 여러 조건에 따라 다양하게 변환될 수 있고, 또한 한 조건하에 여러 가지로 변화될 수 있는 모호성을 가지고 있다.In particular, the nature of the language, symbols and numbers, etc. can be variously converted according to various conditions including the context, and also has a ambiguity that can be variously changed under one condition.

따라서 종래의 텍스트 정규화 방법은 입력 데이터의 특성에 따라 튜닝작업을 요구하게 된다. 즉, 미리 정의된 자료구조와 규칙에 의해 변환이 이루어지도록 하고 있다.Therefore, the conventional text normalization method requires a tuning operation according to the characteristics of the input data. In other words, the conversion is performed by predefined data structures and rules.

그러나 이와 같은 튜닝작업에 따른 텍스트 정규화 방법은 튜닝작업시 규칙의 수정이나 추가, 자료구조의 수정 등이 요구될 때 코드단위의 수정이 불가피하므로, 텍스트 정규화의 유지보수 관점에서 큰 단점으로 작용하고 있다.However, the text normalization method according to the tuning operation is inevitably a major disadvantage in terms of maintenance of text normalization since it is inevitable to modify the code unit when a modification or addition of a rule or a data structure is required during the tuning operation. .

따라서 본 발명은 상기와 같은 문제점을 해결하기 위해 안출한 것으로서, XML(extensible markup language)을 적용하여 텍스트 정규화의 유지보수를 용이하게 수행할 수 있는 텍스트 정규화 방법 및 장치를 제공하는데 그 목적이 있다.Accordingly, an object of the present invention is to provide a text normalization method and apparatus that can easily perform maintenance of text normalization by applying extensible markup language (XML).

본 발명의 다른 목적은 텍스트 정규화를 수행하는데 있어서 입력데이터의 특성에 따른 튜닝작업을 보다 용이하게 수행할 수 있는 텍스트 정규화 방법 및 장치를 제공하는데 있다.Another object of the present invention is to provide a method and apparatus for text normalization which can more easily perform a tuning operation according to characteristics of input data in performing text normalization.

본 발명의 또 다른 목적은 텍스트 정규화의 동적인 특성의 반영을 용이하게 하고 데이터를 미리 검증할 수 있도록 하여 텍스트 정규화의 오류를 줄이고 규칙을 유도하는데 있다.It is another object of the present invention to reduce the error of text normalization and to induce rules by facilitating the reflection of the dynamic characteristics of text normalization and pre-verifying the data.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 XML(extensible markup language)을 이용한 텍스트 정규화 방법의 특징은 입력된 텍스트 문장을 의미단위로 파싱하여 XML 데이터로 변환하는 제 1 단계와, 상기 변환된 XML 데이터에 상기 문장의 문맥을 고려한 속성값을 설정하여 최종 XML 데이터를 생성하는 제 2 단계를 포함하는데 있다.A feature of the text normalization method using extensible markup language (XML) according to the present invention for achieving the above object is a first step of parsing an input text sentence into semantic units and converting it into XML data, and the converted XML And a second step of generating final XML data by setting attribute values in consideration of the context of the sentence in the data.

바람직하게 상기 제 1 단계는 상기 입력된 텍스트 문장을 어절단위로 분리하고, 이 분리된 어절단위를 의미분석이 가능한 각각의 의미단위로 파싱하는 단계와, 상기 파싱된 각각의 의미 단위를 기 정의된 제 1 XML DTD(Document Type Definition)에 의거하여 XML 데이터로 변환하는 단계를 포함하는 것을 특징으로 한다.Preferably, the first step includes separating the input text sentence into word units, parsing the separated word units into respective semantic units capable of semantic analysis, and defining each of the parsed semantic units. And converting the data into XML data based on the first XML Document Type Definition (DTD).

바람직하게 상기 파싱단계는 상기 입력된 문장의 형태소 분석, 품사 태깅(tagging) 및 구문 분석 중 적어도 하나 이상을 이용하여 파싱하는 것을 특징으로 한다.Preferably, the parsing step includes parsing using at least one of morphological analysis, part-of-speech tagging, and syntax analysis of the input sentence.

바람직하게 상기 제 1 XML DTD는 문장(<SENT>) 엘리먼트를 루트(root) 엘리먼트로 하고, 상기 문장 엘리먼트의 하위 엘리먼트로 적어도 하나 이상의 어절(<EOJEOL>) 엘리먼트를 형성하는 단계와, 상기 적어도 하나 이상의 어절 엘리먼트 각각에 대하여 그 의미에 따라 복수개의 서브 엘리먼트를 형성하는 단계를 포함하는 것을 특징으로 한다.Preferably, the first XML DTD uses a sentence element as a root element and forms at least one word element as a sub element of the sentence element, and at least one And forming a plurality of sub-elements according to their meanings for each of the above word elements.

바람직하게 상기 서브 엘리먼트는 한국어(<KOR>), 영어(<ENG>), 한자어(<CHI>), 숫자(NUM), 기호(<SYM>) 중 하나인 것을 특징으로 한다.Preferably, the sub element is one of Korean (<KOR>), English (<ENG>), Chinese (<CHI>), number (NUM), and symbol (<SYM>).

바람직하게 상기 제 2 단계는 상기 변환된 XML 데이터를 판독한 후에 기 정의된 제 2 XML DTD(Document Type Definition)에 따라 문맥을 고려하여 상기 서브 엘리먼트 각각에 대한 속성값을 설정하는 단계와, 상기 설정된 속성값을 고려하여 해당규칙에 따라 텍스트 정규화에 따른 텍스트 데이터의 변환을 위한 최종적인 XML 데이터를 생성하는 단계를 포함하는 것을 특징으로 한다.Preferably, the step of setting the attribute value for each of the sub-elements in consideration of the context in accordance with a second XML DTD (Document Type Definition) previously defined after reading the converted XML data, and the set And generating final XML data for conversion of text data according to text normalization according to the corresponding rule in consideration of the attribute value.

바람직하게 상기 복수의 서브 엘리먼트트 각각에 대하여, 상기 서브 엘리먼트가 한국어(<KOR>) 엘리먼트인 경우는 yeal_han(열한방식), sip_il(십일방식), none(변환에 필요한 정보가 없는 경우)의 속성값을 적어도 하나 포함하여 정의하고, 상기 서브 엘리먼트가 영어(<ENG>) 엘리먼트인 경우는 email(전자메일), web(web 주소), acronym(약자 및 영단어), unit(영문단위), none(알파벳으로 발음)의 속성값을 적어도 하나 포함하여 정의하고, 상기 서브 엘리먼트가 한자어(<CHI>) 엘리먼트인 경우는 none(해당 한글로 변환)의 속성값을 적어도 하나 포함하여 정의하고, 상기 서브 엘리먼트가 숫자(<NUM>) 엘리먼트인 경우는 decimal(소수점), range(범위), score(스코어), exp(수식), time(시간), date(날짜), tel(전화/팩스), post(우편번호), ip(IP), card(카드번호), account(계좌번호), event(사건일자), none(일정형식이 없는 수, 다음에 올 <KOR> 노드를 참조하여 변환)의 속성값을 적어도 하나 포함하여 정의하고, 상기 서브 엘리먼트가 기호<SYM> 엘리먼트인 경우는 one-byte(1 바이트 문자), two-byte(2 바이트 문자) 등의 속성값을 적어도 하나 포함하여 정의하는 것을 특징으로 한다.Preferably, for each of the plurality of sub-elements, when the sub-element is a Korean (<KOR>) element, attributes of yeal_han (eleven method), sip_il (ten-way method), none (if there is no information necessary for conversion) If the sub-element is an English (<ENG>) element, define it by including at least one value. If the sub-element is an English (<ENG>) element, email (e-mail), web (web address), acronym (abbreviation and English word), unit (English unit), none ( And define at least one attribute value of the alphabet), and if the sub element is a Hanja element, include at least one attribute value of none (convert to the corresponding Hangul), and define the sub element Is a numeric (<NUM>) element, decimal (decimal point), range (score), score (score), exp (formula), time (time), date (date), tel (telephone / fax), post ( Zip code), ip (IP), card (card number), account (account number), event (event date), none ( At least one attribute value, and converts with reference to the next <KOR> node). If the sub-element is a symbolic <SYM> element, one-byte (two-byte characters), two and at least one attribute value such as -byte (2 byte character).

상기와 같은 목적을 달성하기 위한 본 발명에 따른 XML(extensible markup language)을 이용한 텍스트 정규화 장치의 특징은 외부에서 입력된 텍스트 문장을 저장하는 저장부와, 상기 저장부에 저장된 텍스트 문장을 어절 단위로 분리한 후, 상기 분리된 텍스트 데이터를 의미분석이 가능한 의미단위로 파싱하는 파싱기와, 상기 파싱기에서 파싱된 텍스트 데이터를 기 정의된 제 1 XML DTD(Document Type Definition)에 의거하여 분리된 어절 단위로 XML 데이터로 변환하는 XML 데이터 변환기와, 상기 XML 데이터 변환기에서 변환된 XML 데이터를 읽어 전후 문맥을 파악하고 이 파악된 문맥을 고려하여 어절 단위로 속성값을 설정하는 속성 설정기와, 상기 속성 설정기를 통해 각 어절 단위로 속성값이 설정되면 텍스트 정규화에 따른 텍스트 데이터의 변환을 위한 최종적인 XML 데이터를 생성하는 XML 데이터 생성기를 포함하여 구성되는데 있다.A feature of a text normalization apparatus using extensible markup language (XML) according to the present invention for achieving the above object is a storage unit for storing an externally input text sentence, the text sentence stored in the storage unit in word units After separation, the parser parses the separated text data into semantic units capable of semantic analysis, and the word unit separated based on the first XML document type definition (DTD) previously defined by the parser. An XML data converter for converting the XML data into XML data, a property setter configured to read the XML data converted by the XML data converter to determine the context before and after and to set attribute values in units of words in consideration of the identified context; When the attribute value is set in each word unit through the final XML data for conversion of text data according to text normalization It consists of including the XML data generator that generates emitter.

본 발명의 다른 목적, 특성 및 이점들은 첨부한 도면을 참조한 실시예들의 상세한 설명을 통해 명백해질 것이다.Other objects, features and advantages of the present invention will become apparent from the following detailed description of embodiments with reference to the accompanying drawings.

본 발명에 따른 XML(extensible markup language)을 이용한 텍스트 정규화 방법 및 장치의 바람직한 실시예에 대하여 첨부한 도면을 참조하여 설명하면 다음과 같다.A preferred embodiment of the method and apparatus for text normalization using extensible markup language (XML) according to the present invention will be described with reference to the accompanying drawings.

도 1 은 본 발명에 따른 XML을 이용한 텍스트 정규화 방법을 나타낸 흐름도이다.1 is a flowchart illustrating a text normalization method using XML according to the present invention.

도 1과 같이, 텍스트 데이터가 입력되면(S100), 먼저 입력된 텍스트 데이터 의 문장을 의미단위로 파싱하여 XML 데이터로 변환하는 문장 분석단계를 거친다.(S200). As shown in FIG. 1, when text data is input (S100), a sentence analysis step of first parsing a sentence of the input text data into a semantic unit is converted into XML data (S200).

즉, 상기 문장 분석단계는 입력된 텍스트 데이터의 문장을 먼저 저장부에 저장하고, 파싱기를 통해 형태소 분석, 품사 태깅(tagging) 작업 및 구분 분석 작업 등을 수행하여 어절단위로 분리한 후, 이 분리된 어절단위를 의미분석이 가능한 각각의 의미단위로 파싱한다(S210). That is, in the sentence analyzing step, the sentence of the input text data is first stored in a storage unit, and the morphological analysis, the part-of-speech tagging, and the segmentation analysis are performed through a parser, and separated into word units. The parsed word unit is parsed into each semantic unit capable of meaning analysis (S210).

이후, 상기 의미단위로 파싱된 텍스트 데이터를 XML 데이터로 변환시킨다(S220). 이때, 상기 XML의 변환은 도 2a의 제 1 XML DTD(Document Type Definition)에 의거하여 XML 데이터 변환기를 통해 변환된다.Thereafter, the text data parsed by the semantic unit is converted into XML data (S220). In this case, the XML is converted through the XML data converter based on the first XML Document Type Definition (DTD) of FIG. 2A.

도 2a를 참조하여 설명하면, 제 1 XML DTD는 문장(<SENT>) 엘리먼트(10)를 루트(root) 엘리먼트로 하고, 이 문장 엘리먼트(10)의 하위 엘리먼트로 다수개의 어절(<EOJEOL>) 엘리먼트(20)를 포함한다. 그리고 상기 각각의 어절 엘리먼트(20)는 그 의미에 따라 서로 다른 복수개의 서브 엘리먼트(30)로 분리될 수 있다. Referring to FIG. 2A, the first XML DTD has a sentence (<SENT>) element 10 as a root element, and a plurality of words (<EOJEOL>) as sub-elements of the sentence element 10. Element 20. Each word element 20 may be divided into a plurality of different sub elements 30 according to their meanings.

본 명세서에서는 설명을 용이하게 하기 위해 한국어(<KOR>), 영어(<ENG>), 한자어(<CHI>), 기호(<SYM>)인 5가지의 서브 엘리먼트(30)로 분리하고 있으나, 이는 바람직한 실시예로 설계자에 따라 얼마든지 확장, 변경이 가능하다.In the present specification, for ease of explanation, five sub-elements 30, which are Korean (<KOR>), English (<ENG>), Chinese (<CHI>), and symbol (<SYM>), are separated. This is a preferred embodiment and can be expanded and changed as much as the designer desires.

따라서 입력된 텍스트 데이터가 의미단위로 파싱되면, 각 파싱된 데이터는 루트 엘리먼트인 문장(<SENT>) 엘리먼트(10)를 거쳐 하위 엘리먼트인 상기 어절(<EOJEOL>) 엘리먼트(20)를 통해 해당 XML 데이터로 변환한다. 즉, 각 파싱된 데이터의 의미가 상기 어절 엘리먼트(20)에서 분리되어 있는 한국어(<KOR>), 영어 (<ENG>), 한자어(<CHI>), 숫자(<NUM>), 기호(<SYM>)의 서브 엘리먼트(30) 중 어느 하나인지를 판단하여 해당 부분에 태깅(tagging)함으로써, XML 데이터로 변환되게 된다.Therefore, when the input text data is parsed in a semantic unit, each parsed data is passed through the sentence (<SENT>) element 10 which is the root element and the corresponding XML through the word element (<EOJEOL>) element 20 which is a child element. Convert to data. That is, the meaning of each parsed data is Korean (<KOR>), English (<ENG>), Hanja (<CHI>), Number (<NUM>), Symbol (< SYM>) is determined to be one of the sub-elements 30 and tagging the corresponding part, it is converted into XML data.

이때, 상기 문장 분석단계에서는 각 파싱된 데이터에 의미에 따른 XML 엘리먼트만 추가되고 속성값은 설정되지 않는다. 왜냐하면 각 파싱된 데이터에 속성값을 설정하기 위해서는 전후 문맥을 정확히 파악해야 하는데 상기 문장 분석단계에서는 문맥의 파악이 쉽지 않기 때문이다. In this case, in the sentence parsing step, only XML elements corresponding to meanings are added to each parsed data, and attribute values are not set. In order to set attribute values for each parsed data, it is necessary to accurately grasp the context before and after the context analysis.

그러나 각 어절 엘리먼트(20)의 속성값은 텍스트 정규화 변환에 있어서 매우 중요한 역할로 정확히 판단하여 설정되어야 한다. However, the attribute value of each word element 20 should be accurately determined and set as a very important role in text normalization conversion.

따라서 상기 문자 분석단계가 끝나면 도 2b의 서브 엘리먼트(30)에 따른 제 2 XML DTD(Document Type Definition)에 의거하여 변환된 XML 데이터를 읽어들여 문맥을 고려한 속성값을 설정하여 텍스트 정규화 변환을 위한 최종적인 XML 데이터를 생성하는 문장 변환단계를 수행하게 된다(S300). Therefore, after the character analysis step is completed, the attribute value considering the context is set by reading the converted XML data based on the second XML document type definition (DTD) according to the sub element 30 of FIG. The sentence conversion step of generating the XML data is performed (S300).

이와 같이 상기 문장 변환 단계에서 각 어절 엘리먼트의 속성값을 판단할 수 있는 이유는 상기 문장 분석단계에서 파싱된 데이터가 XML 데이터로 변환하였기 때문에 손쉽게 전후 문맥을 파악할 수 있기 때문이다.The reason why the attribute value of each word element can be determined in the sentence conversion step is because the parsed data is converted into XML data in the sentence analysis step so that the context can be easily identified.

즉, 상기 문장 변환단계는 먼저, 속성 설정기를 통해 상기 문자 분석단계에서 변환된 XML 데이터를 읽어들여 전후 문맥을 파악하고 이 파악된 문맥을 고려한 정확한 속성값을 설정하게 된다(S310). That is, in the sentence conversion step, first, by reading the XML data converted in the character analysis step through an attribute setter, the context is determined before and after, and the correct attribute value is set in consideration of the identified context (S310).

이때, 도 2b의 서브 엘리먼트(30)에 따른 제 2 XML DTD(Document Type Definition)에 따른 속성값은 각 어절 엘리먼트(20)마다 다른 값들을 가지게 되는데, 이에 따라 각 어절 엘리먼트(20)는 속성값과 문맥을 고려하여 도 3과 같이 해당규칙을 정의할 수 있게 된다.At this time, the attribute value according to the second XML Document Type Definition (DTD) according to the sub element 30 of FIG. 2B has different values for each word element 20, and accordingly, each word element 20 has an attribute value. Considering the context and context, the corresponding rule can be defined as shown in FIG. 3.

도 3 은 본 발명에 따른 어절 엘리먼트들의 속성값을 나타낸 바람직한 실시예이다. 이때, 도 3 은 본 발명의 용이한 설명을 위해 각 어절 엘리먼트(20)들의 속성값의 따른 바람직한 실시예로서, 이는 설계자에 의해 변경 및 추가가 가능하다.3 is a preferred embodiment showing attribute values of word elements according to the present invention. In this case, Figure 3 is a preferred embodiment according to the attribute value of each word element 20 for easy description of the present invention, which can be changed and added by the designer.

도 3을 참조하여 설명하면, 어절(<EOJEOL>) 엘리먼트(20)는 한국어(<KOR>), 영어(<ENG>), 한자어(<CHI>), 숫자(NUM), 기호(<SYM>)의 서브 엘리먼트(30)로 각각 분리되어 있다.Referring to FIG. 3, the word element <EOJEOL> 20 may include Korean (<KOR>), English (<ENG>), Hanja (<CHI>), numeral (NUM), and symbol (<SYM>). Are separated into sub-elements 30 of each.

그리고 상기 한국어(<KOR>) 엘리먼트는 yeal_han(열한방식), sip_il(십일방식), none(변환에 필요한 정보가 없는 경우) 등의 속성값을 적어도 하나 포함하여 갖는다. The Korean element has at least one attribute value such as yeal_han (eleven method), sip_il (ten days method), none (when there is no information necessary for conversion).

또한, 상기 영어(<ENG>) 엘리먼트는 email(전자메일), web(web 주소), acronym(약자 및 영단어), unit(영문단위), none(알파벳으로 발음) 등의 속성값을 적어도 하나 포함하여 갖는다. In addition, the English element includes at least one attribute value such as email (e-mail), web (web address), acronym (abbreviation and English word), unit (English unit), none (pronounced alphabet). Have it.

또한, 상기 한자어(<CHI>) 엘리먼트는 none(해당 한글로 변환) 등의 속성값을 적어도 하나 포함하여 갖는다.Also, the Hanja element has at least one attribute value such as none (converted to the corresponding Hangul).

또한, 상기 숫자(<NUM>) 엘리먼트는 decimal(소수점), range(범위), score(스코어), exp(수식), time(시간), date(날짜), tel(전화/팩스), post(우편번호), ip(IP), card(카드번호), account(계좌번호), event(사건일자), none(일정형식이 없는 수, 다음에 올 <KOR> 노드를 참조하여 변환) 등의 속성값을 적어도 하나 포함하여 갖는다.In addition, the <NUM> element may be a decimal, a range, a score, an exp, a time, a date, a tel or a fax. Attributes such as zip code, ip (IP), card (card number), account (account number), event (event date), none (number without schedule, conversion by referring to next <KOR> node) It has at least one value.

또한, 상기 기호<SYM> 엘리먼트는 one-byte(1 바이트 문자), two-byte(2 바이트 문자) 등의 속성값을 적어도 하나 포함하여 갖는다.The symbol <SYM> element may include at least one attribute value such as one-byte (one byte character), two-byte (two byte character), or the like.

이와 같이, 상기 속성 설정기를 통해 각 어절 엘리먼트(20) 마다 속성값이 설정되면 XML 데이터 생성기를 통해 도 4와 같이 문맥에 따른 속성값을 고려하여 해당규칙에 따라 텍스트 정규화에 따른 텍스트 데이터의 변환을 위한 최종적인 XML 데이터를 생성한다(S320). 이때, 어절 엘리먼트(20)의 속성값은 텍스트 정규화 변환에 있어서 가장 중요한 역할을 하므로 정확히 판단하여 설정되어야 한다.As such, when the attribute value is set for each word element 20 through the attribute setter, the XML data generator converts the text data according to the text normalization according to the corresponding rule in consideration of the attribute value according to the context as shown in FIG. 4. Create final XML data for (S320). At this time, since the attribute value of the word element 20 plays the most important role in the text normalization transformation, it should be accurately determined and set.

도 5 는 도 4의 실시예에 따른 XML 데이터를 DOM(Document Object Model) 트리 형태로 표현한 바람직한 실시예이다.FIG. 5 is a preferred embodiment in which XML data according to the embodiment of FIG. 4 is represented in the form of a DOM (Document Object Model) tree.

도 5에서 나타내고 있는 것과 같이, XML 데이터의 자료구조는 트리(tree)구조로 구성됨으로, 데이터베이스와 같이 고정된 태그의 세트가 없고 미리 정의되어 있지 않기 때문에 필요에 따라서 새로운 태그를 용이하게 추가할 수 있어서 자료구조 및 데이터 처리에 매우 유연성을 갖게 된다.As shown in Fig. 5, the data structure of the XML data is composed of a tree structure, so that there is no fixed set of tags like the database and it is not predefined so that new tags can be easily added as needed. This gives you very flexibility in data structure and data processing.

이상에서와 같이 상세한 설명과 도면을 통해 본 발명의 최적 실시예를 개시하였다. 용어들은 단지 본 발명을 설명하기 위한 목적에서 사용된 것이지 의미 한정이나 특허청구범위에 기재된 본 발명의 범위를 제한하기 위하여 사용된 것은 아니다. 그러므로 본 기술분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서 본 발명의 진정한 기술적 보호 범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다. As described above, the preferred embodiment of the present invention has been disclosed through the detailed description and the drawings. The terms are used only for the purpose of describing the present invention and are not used to limit the scope of the present invention as defined in the meaning or claims. Therefore, those skilled in the art will understand that various modifications and equivalent other embodiments are possible therefrom. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

이상에서 설명한 바와 같은 본 발명에 따른 XML을 이용한 텍스트 정규화 방법 및 장치는 다음과 같은 효과가 있다.As described above, the method and apparatus for text normalization using XML according to the present invention has the following effects.

첫째, 텍스트 정규화를 수행하는데 있어서 XML을 적용함으로써 텍스트 정규화의 동적인 특성의 반영을 용이하게 하고 데이터를 미리 검증할 수 있기 때문에 텍스트 정규화의 오류를 줄이고 규칙을 유도하는데 도움을 줄 수 있다.First, by applying XML in performing text normalization, it is easy to reflect the dynamic characteristics of text normalization and the data can be verified in advance, which can help reduce errors in text normalization and induce rules.

둘째, XML 데이터의 자료구조가 트리(tree)구조로 구성되어 데이터베이스와 같이 고정된 태그의 세트가 없고 미리 정의되어 있지 않기 때문에 필요에 따라서 새로운 태그를 추가할 수 있어서 자료구조 및 데이터 처리에 유연성을 제공할 수 있다.Second, since the data structure of XML data is composed of tree structure, there is no fixed set of tags like the database and it is not defined in advance so that new tags can be added as needed to provide flexibility in data structure and data processing. Can provide.

셋째, XML을 이용함으로써, 멀티 데이터 포맷(multiple data format)을 지원하고, 모든 기존의 데이터 구조를 포함할 수 있다. Third, by using XML, it is possible to support multiple data formats and include all existing data structures.

Claims

A first step of parsing the input text sentence into semantic units and converting the input text sentence into XML data;

And a second step of generating final XML data by setting an attribute value in consideration of the context of the sentence in the transformed XML data.

The method of claim 1, wherein the first step

Dividing the input text sentence into word units and parsing the separated word units into respective semantic units capable of semantic analysis;

And converting each of the parsed semantic units into XML data based on a first XML Document Type Definition (DTD) previously defined.

The method of claim 2,

The parsing step includes parsing the text using at least one of morphological analysis, part-of-speech tagging, and syntax analysis of the input sentence.

The method of claim 2, wherein the first XML DTD is

Forming a sentence element as a root element, and forming at least one word element as a sub-element of the sentence element;

Forming a plurality of sub-elements for each of the at least one word element according to their meanings.

The method of claim 4, wherein

The sub element is one of Korean (<KOR>), English (<ENG>), Chinese (<CHI>), number (NUM), and symbol (<SYM>).

The method of claim 5, wherein the second step

Setting an attribute value for each of the sub elements in consideration of a context according to a second XML document type definition (DTD) defined after reading the converted XML data;

And generating final XML data for conversion of text data according to text normalization according to a corresponding rule in consideration of the set attribute value.

The method of claim 6, wherein for each of the plurality of sub-elements,

When the sub-element is a Korean (<KOR>) element, it is defined to include at least one attribute value of yeal_han (eleven method), sip_il (ten days method), none (when there is no information necessary for conversion),

If the sub-element is an English (<ENG>) element, attribute values of email (e-mail), web (web address), acronym (abbreviation and English word), unit (English unit), and none (pronounced alphabet) are included. Define one,

If the sub-element is a Hanja element, it is defined to include at least one attribute value of none (converted to the corresponding Hangul),

If the subelement is a number element, it is decimal, range, score, exp, time, date, or tel. , post (zip code), ip (IP), card (card number), account (account number), event (event date), none (number of unformatted, convert by referring to next <KOR> node) Define at least one attribute value of,

If the sub-element is a symbol <SYM> element, at least one attribute value of one-byte (single-byte character), two-byte (two-byte character), and the like is defined. .

A storage unit for storing text sentences,

A parser that separates the text sentence stored in the storage unit into word units and parses the text data separated into the word units into semantic units capable of semantic analysis;

An XML data converter for converting the text data parsed by the parser into XML data in units of separated words based on a first XML DTD (Document Type Definition) previously defined;

An attribute setter configured to read the XML data converted by the XML data converter to grasp a context before and after and set attribute values in units of words in consideration of the grasped context;

And an XML data generator for generating final XML data for converting text data according to text normalization when an attribute value is set in each word unit through the attribute setter.

The method of claim 8,

The first XML Document Type Definition (DTD) includes a sentence element as a root element and includes a plurality of word elements as sub-elements of the sentence element. The word normalization apparatus using XML, characterized in that the word element is composed of a tree structure composed of a plurality of different sub-elements according to their meanings.