KR19990039091A

KR19990039091A - How to define dictionary and text corpus structure in text dictionary management system (TDMS)

Info

Publication number: KR19990039091A
Application number: KR1019970059052A
Authority: KR
Inventors: 최기선; 한영석; 남기춘; 박동인; 이재성; 이운재
Original assignee: 박원훈; 한국과학기술연구원
Priority date: 1997-11-10
Filing date: 1997-11-10
Publication date: 1999-06-05
Also published as: KR100283023B1

Abstract

본 발명은 텍스트 사전 관리시스템(TDMS)에서의 사전 및 텍스트 코퍼스 구조 정의 방법에 관한 것으로서, 상세하게는 SDML를 이용하여 다양한 형태의 전자사전 논리구조를 가능한 한 효과적으로 기술할 수 있도록 사전의 규칙적인 구조(일정한 형식부분)는 SGML DTD 형태로 정의해 두고 반드시 그 형식을 따르도록 하며, 나머지 내용에 관련된 부분(형식이 바뀔 수 있는 부분)은 SGML에서 정의하는 방식을 이용하여 자유롭게, 그러나 SDML의 제한 규칙에 의해 정의할 수 있도록 하며, 텍스트 코퍼스를 단순한 텍스트처럼 저장할 수 있도록 하여 텍스트 코퍼스 관리가 가능하도록 사전 및 텍스트 코퍼스 구조를 정의함으로써 외국 문서는 물론이고 한글 문서의 표준으로 이용 가능하며, 사전을 새로 구축하고, 관리하며 이를 변형하여 새로운 정보로 변형하는데 적합한 사전 및 텍스트 코퍼스 구조 정의 방법이다.The present invention relates to a dictionary and text corpus structure definition method in a text dictionary management system (TDMS), and in detail, the regular structure of the dictionary so that it can describe various types of electronic dictionary logical structure as effectively as possible using SDML (Constant format part) is defined in SGML DTD form and must follow the format. The part related to the rest of contents (parts that can be changed) can be freely defined by SGML. By defining the dictionary and text corpus structure for text corpus management by allowing text corpus to be stored as simple text, it can be used as a standard for Korean documents as well as foreign documents. To manage, transform and transform it into new information One dictionary and text corpus structure definition method.

Description

How to define dictionary and text corpus structure in text dictionary management system

본 발명은 텍스트 사전 관리시스템(TDMS : Text Dictionary Management System)에서의 사전 및 텍스트 코퍼스(text corpus) 구조 정의 방법에 관한 것으로서, 특히 SDML(Standard Dictionary Markup Language)를 이용하여 사전 및 텍스트 코퍼스 구조를 정의하여 다양한 형태의 전자사전 논리구조를 가능한 한 효과적으로 기술할 수 있도록 사전의 규칙적인 구조는 고정된 형식으로 하고 나머지 내용에 관련된 부분만을 자유롭게 정의할 수 있도록 하며, 텍스트 코퍼스를 단순한 텍스트처럼 저장할 수 있도록 하여 텍스트 코퍼스 관리도 가능한 텍스트 사전 관리시스템(TDMS)에서의 사전 및 텍스트 코퍼스 구조 정의 방법에 관한 것이다.The present invention relates to a method for defining a dictionary and text corpus structure in a text dictionary management system (TDMS), and in particular, defines a dictionary and text corpus structure using a standard dictionary markup language (SDML). In order to be able to describe various forms of electronic dictionary logic structure as effectively as possible, the regular structure of the dictionary should be fixed form, and only the parts related to the rest of the contents can be freely defined, and the text corpus can be stored as simple text. The present invention relates to a dictionary and text corpus structure definition method in a text dictionary management system (TDMS) capable of text corpus management.

컴퓨터에 의하여 기록되는 많은 양의 문서 정보를 효과적으로 변형 조작하기 위해서는 문서의 형식이 통일되어 있어야 하는데, 이러한 문서형식의 통일은 아래의 참고문헌에서 알 수 있는 바와 같이 현재 SGML(Standard Generalized Markup Language)로 통일되어 가고 있는 추세이다.In order to effectively transform and manipulate a large amount of document information recorded by a computer, the document format must be unified. This document format unification can be referred to as Standard Generalized Markup Language (SGML) as can be seen in the following reference. The trend is unifying.

<참고문헌><Reference>

1. 공업 진흥청(1993), "문서 기술 언어 SGML," KSC 5913-1993, 한국표준협회.1. Industry Promotion Agency (1993), "Document description language SGML," KSC 5913-1993, Korean Standards Association.

2. Bryan, M.(1988), SGML: An Author's Guide to the Standard Generalized Markup Language, Addison-Wesley.Bryan, M. (1988), SGML: An Author's Guide to the Standard Generalized Markup Language, Addison-Wesley.

3. Goldfarb, Charles F.(1990), "The SGML Handbook," Claredon Press, Oxford.Goldfarb, Charles F. (1990), "The SGML Handbook," Claredon Press, Oxford.

4. Travis, B. E. and Waldt, D. C.(1995), "The SGML Implementation Guide," Springer-Verlag Berlon Heidelberg, Germany, pp3-20.4. Travis, B. E. and Waldt, D. C. (1995), "The SGML Implementation Guide," Springer-Verlag Berlon Heidelberg, Germany, pp 3-20.

SGML은 마크업 언어의 구문을 정의하는 메타언어로서 이를 이용하여 다양한 문서구조를 정의할 수 있다. 즉, SGML은 구문 구조를 정의할 수 있는 문법구조를 정의한 것이며, 이를 이용하여 구문 구조를 정의한 것이 DTD(Document Type Definition)이다.SGML is a meta language that defines the syntax of the markup language. You can use it to define various document structures. In other words, SGML defines a grammar structure that can define a syntax structure, and the syntax structure is defined using the DTD (Document Type Definition).

도 1은 SGML 중 Element 선언부분에 대한 정의를 나타낸 것이다.Figure 1 shows the definition of the element declaration portion of the SGML.

여기서 mdo는 마크업 정의의 시작을 알리는 것으로 "<!"가 사용된다. "Element"는 키워드로서 SGML의 Element 요소를 정의하는 것을 의미한다. ps는 빈칸을 의미하며, ps+는 하나이상의 빈칸 ps*는 빈칸이 없거나 하나이상의 빈칸을 의미한다.Where mdo is used to indicate the start of a markup definition, where "<!" Is used. "Element" means to define an element element of SGML as a keyword. ps means blank, ps + means one or more blanks ps * means none or more than one blank.

element type은 정의하려는 element의 이름이다. omitted tag minimization은 현재 정의되고 있는 element type에서 시작태그(tag)와 끝태그가 생략될 수 있는지를 나타낸다. (declared content｜content model)는 SGML 내부에서 사용되고 있는 declared content나 사용자가 문서의 구조를 정의하는 content model 중 하나를 사용하는 것을 의미한다. "｜" 표시는 둘 중의 하나를 선택한다는 의미이다. 마지막으로 mdc는 마크업 정의의 끝을 알리는 것으로 ">"가 사용된다.element type is the name of the element you want to define. omitted tag minimization indicates whether the start and end tags can be omitted from the currently defined element type. (declared content | content model) means either the declared content being used inside SGML, or one of the content models that define the structure of the document. "|" Means that one of the two is selected. Finally mdc is used to mark the end of the markup definition.

이와 같은 SGML 기술 규칙에 따라 다양한 문서 형식을 DTD로 정의할 수 있다. 예를 들어분석방법2a와 같은 SGML의 DTD는분석방법2b와 같은 예의 문서를 마크업하기 위한 것이다.According to these SGML description rules, various document formats can be defined as DTDs. For example, the DTD of SGML, such as analysis method 2a, is intended to mark up the example document as analysis method 2b.

다시 말해, SGML은 문서구조를 정의할 수 있는 언어(DTD)의 문법을 말하며, DTD는 문서의 구조를 SGML 규칙에 따라 정의한 것이며, 일반 문서를 구조에 따라 DTD에서 정의된 태그를 이용하여 마크업한 것이 마크업문서이다. 일반적으로 SGML DTD에 따라 마크업된 문서를 SGML 문서라고도 한다.In other words, SGML is a grammar of the language (DTD) that can define the structure of the document, DTD is the structure of the document according to the SGML rules, mark the general document using tags defined in the DTD according to the structure One is the markup document. In general, documents marked up according to the SGML DTD are also called SGML documents.

텍스트로 기록되는 여러 분야의 문서에 대해 TEI(Text Encoding Initiative )에서 아래의 참고문헌에서와 같은 여러 가지 표준형태를 제안하고 있다.For documents in various fields written in text, the TEI (Text Encoding Initiative) proposes several standard forms, such as those referenced below.

<참고문헌><Reference>

5. Ide, Nancy and Sperberg-McQueen, C. M.(1995), "The TEI: History, Goals, and Future," in Ide, Nancy and Veronis, Jean[20], pp 5-15.5. Ide, Nancy and Sperberg-McQueen, C. M. (1995), "The TEI: History, Goals, and Future," in Ide, Nancy and Veronis, Jean [20], pp 5-15.

6. Ide, Nancy and Veronis, Jean(1995), editors, "Text Encoding Initiative Background and Context," Kluwer Academic Publishers, Dordrect /Boston /London.6. Ide, Nancy and Veronis, Jean (1995), editors, "Text Encoding Initiative Background and Context," Kluwer Academic Publishers, Dordrect / Boston / London.

7. Sperberg-McQueen, C. M. and Burnard, Lou(1994), "Guidelines for Electronic Text Encoding and Interchange(TEI P3)," Vol I and Vol II, ACH, ACL, ALLC.7. Sperberg-McQueen, C. M. and Burnard, Lou (1994), "Guidelines for Electronic Text Encoding and Interchange (TEI P3)," Vol I and Vol II, ACH, ACL, ALLC.

8. Sperberg-McQueen, C. M. and Burnard, Lou(1995), "The Design of the TEI Encoding Scheme," in Ide, Nancy and Veronis, Jean[20], pp 17-39.8. Sperberg-McQueen, C. M. and Burnard, Lou (1995), "The Design of the TEI Encoding Scheme," in Ide, Nancy and Veronis, Jean [20], pp 17-39.

예를 들어분석방법3은 TEI에서 정의하고 있는 사전관련 DTD 중의 일부를 나타낸 것으로, 이 정의는 entry 태그 다음에 home, sense, form, gramGrp, usg, def, etym, eg 등의 어느 것이나 임의의 순서로 계속 반복하여 사용될 수 있는 구조를 뜻한다.For example, Analysis Method 3 shows some of the dictionary-related DTDs defined by the TEI, which can be in any order, such as home, sense, form, gramGrp, usg, def, etym, eg, after the entry tag. As a structure that can be used over and over again.

그러나, 상기와 같은 참고문헌의 표준들은 문서의 형태, 특히 서양의 문서만을 표준으로 하고, 이 문서를 관리하는데 필요한 정보에 대한 정의는 다음의 참고문헌에 설명되어 있는 방식을 표준으로 한다However, the standards of these references are based only on the type of document, especially Western documents, and the definition of the information required to manage this document is based on the methods described in the following references.

<참고문헌><Reference>

9. ISO/IEC(1994), "Information technology-Text and office system-Document Style Semantics and Specification Language(DSSSL)-Draft", ISO/IEC DIS 10179.2.).9. ISO / IEC (1994), "Information technology-Text and office system-Document Style Semantics and Specification Language (DSSSL) -Draft", ISO / IEC DIS 10179.2.).

이러한 외국문서 중심의 표준을 한글 문서의 표준으로 받아들여 사용하는 것은 문제가 있으며, 특히 집적된 정보를 포함하는 사전과 같은 형식의 문서에 대한 관리 및 변형에는 한계가 있는데, 이를분석방법3에 나타낸 TEI의 경우에서 살펴보면, 일단 home을 동의어, sense를 의미 등으로 해석하여 태그로 사용할 수는 있지만, 정확하게 우리사전에 맞도록 정의할 필요가 있다. 특히, 한국어 사전에서 나타나는 "학술분야", "방언" 등을 나타내는 태그가 빠져 있으며, "어원"에서도 한자나 외래어, 국어 등의 구분이 빠져 있는 것이 그것이다.It is problematic to accept and use such foreign document-oriented standards as standards for Korean documents, and there are limitations in managing and transforming documents in the form of dictionaries that contain aggregated information. In the case of TEI, we can interpret home as a synonym, sense as a meaning, and use it as a tag, but we need to define it correctly to fit our dictionary. In particular, tags indicating "academic field", "di dialect", etc., which appear in the Korean dictionary, are missing, and the divisions of Chinese characters, foreign languages, Korean, etc., are missing from "the etymology."

다시 말해, TEI에서 사전의 구조에 대하여 정의하고 있으며, 이는 매우 일반적으로 적용 가능하지만, 앞서 설명한 바와 같이 한글 사전에 대한 고려가 없으므로 한글 사전에 대한 적용에는 문제가 있다.In other words, TEI defines the structure of a dictionary, which is very generally applicable, but there is a problem in applying the Korean dictionary because there is no consideration of the Korean dictionary as described above.

따라서, 사전 형식의 문서를 새로 구축하고, 관리하며, 이를 변형하여 새로운 정보로 변경하는데 필요한 표준 표기 형식이 필요하다는 당위성이 부각되었다.Therefore, it is justified that a standard notation format required for newly constructing, managing, and transforming a dictionary document into new information has been emphasized.

이 문제를 해결하기 위해 강범모는 한글사전에 맞는 새로운 구조를 제안하였는데, 이 구조를 계층적으로 일부만 표시하면 첨부한분석방법4와 같다(참고문헌 : Kang, Beam-mo(1996), "Using the TEI Scheme in compiling a Korean Dictionary", ALLC-ACH '96, Bergen, Norway.)In order to solve this problem, Kang Bum-mo proposed a new structure suitable for the Hangul dictionary, and if this part of the structure is displayed hierarchically, it is the same as the attached analysis method 4 (Reference: Kang, Beam-mo (1996), "Using the TEI Scheme in compiling a Korean Dictionary ", ALLC-ACH '96, Bergen, Norway.)

그러나 TEI나분석방법4와 같은 구조를 갖는 강범모식의 방법은 너무 많은 경우를 고려하고 있기 때문에, 실제 텍스트 사전 관리시스템에서 처리하기에는 계산상으로 너무 복잡하고, 이 구조를 파싱하는 모듈과 파싱된 결과를 이용하는 데는 어려움이 많다.However, since the gangbeom method, which has the same structure as TEI or analysis method 4, considers too many cases, it is too computationally complex to be processed in the actual text dictionary management system, and the parsed module and the parsed result. There are a lot of difficulties with using.

이를 좀 더 상세하게 설명하면분석방법4에서 entry 다음 단계에 있는 hom은 다시 hom 밑에서 반복(recursive) 사용되고 있어서 그 단계를 이론상으로는 무한번까지 반복할 수 있는 경우 등을 고려하고 있다. 이러한 처리는 기술적으로 그 반복의 한계를 지정하여 계산 - 각 단계별로 파싱하고 그 내용을 처리하기 위한 컴퓨터의 처리부하(processing overhead) - 의 복잡성을 줄일 수는 있지만, 여전히 많은 양의 계산을 필요로 한다. 또한 TEI 문서가 TEI에서 정의된 문법규칙(DTD)에 맞게 정의되었는지를 확인하는 일과 맞게 정의되었을 경우, 각 내용을 컴퓨터가 쉽게 처리할 수 있는 형태로 변환하여 저장하는 파싱모듈의 저장형태는 대개 트리 구조로 표현되는데 앞서 설명한 바와 같이 반복적인 정의를 할 경우, 무한번 반복 사용될 가능성을 고려하여야 하므로 이는 무한번의 트리 구조를 생성하여야 하는 문제를 야기시킨다. 이를 피하기 위해서는 트리의 하위 노드가 상위 노드를 다시 포인트 해야 하므로 실제적으로 사이클이 형성되어 트리가 아닌 다른 형태로 되며, 이를 처리하기 위해서는 또 다른 복잡한 처리가 필요하게 된다.To explain this in more detail, in the analysis method 4, the hom in the next step of the entry is recursive under the hom, and thus, the step may theoretically be repeated infinitely. While this process can technically reduce the complexity of the computation by specifying the iteration limit—the processing overhead of parsing each step and processing its contents—but still requires a large amount of computation. do. In addition, if the TEI document is defined in accordance with checking whether the TEI document is defined according to the grammar rules (DTD) defined in the TEI, the storage form of the parsing module that converts and stores each content into a form that can be easily processed by a computer is usually a tree. As it is expressed as a structure, when iterative definition is made as described above, it is necessary to consider the possibility of being used repeatedly indefinitely, which causes a problem of creating an infinite tree structure. In order to avoid this, since the lower node of the tree must repoint the upper node, a cycle is actually formed, and it becomes a form other than the tree, which requires another complicated process.

또한, 한글사전을 아래의 참고문헌과 같이 TEI나 강범모식의 구조와 다른 구조, 즉분석방법5에 나타내 구조로 파악할 수분석방법있는데 이는 계산의 단순성을 위해 너무 정형화시켜 사전을 표현해 두었기 때문에 어느 정도 다양한 형태(TEI나 강범모 방식에서 나타난 것처럼 반복구조 같은 구현의 복잡성을 유발할 수 있는 다양한 형태가 아닌 중간 형태)를 표현할 수 없다. 따라서 이를 일반적인 사전용 DTD로 수용하는데도 문제가 있다.In addition, there is an analysis method that can be understood as a structure different from the structure of TEI or Kang Beom Model, that is, the structure shown in Analysis Method 5, as shown in the following reference. It cannot represent a wide variety of forms (intermediates rather than various forms that can lead to complex implementations, such as iterations, as shown in the TEI or Kang Bum Mo approach). Therefore, there is a problem in accepting this as a general dictionary DTD.

<참고문헌><Reference>

10. 최병진, 이재성, 이운재, 최기선(1996), "표준화를 위한 일반사전의 논리구조," 제8회 한글 및 한국어 정보처리 학술 발표 논문집.10. Choi Byung-jin, Lee Jae-sung, Lee Un-jae and Choi Ki-sun (1996), "Logical Structure of General Dictionary for Standardization," 8th Hangul and Korean Information Processing Conference.

11. 최병진, 이재성, 이운재, 최기선(1996), "기계가독형 사전 구축을 위한 사전항목의 논리구조," 인지과학 7-2.11. Byung-Jin Choi, Jae-Sung Lee, Un-Jae Lee, Ki-Sun Choi (1996), "Logical Structure of Dictionary Items for Machine-Ready Dictionaries," Cognitive Science 7-2.

또한, 기존의 사전 구조 정의 방법은 관리시스템이 각 태그(tag)의 표시를 어떤 형태로 나타낼 것인지에 대한 정의가 전혀 없다. 이는 보다 다양한 관리시스템을 만들 수 있도록 한다고는 하지만 표준 관리시스템으로 이용하는데는 문제가 있다.In addition, the existing dictionary structure definition method has no definition of how the management system will display the display of each tag. This makes it possible to create more diverse management systems, but there are problems in using them as standard management systems.

상기와 같은 문제점을 해소하기 위한 본 발명의 목적은 외국 문서는 물론이고 한글 문서의 표준으로 이용 가능하며, 사전을 새로 구축하고, 관리하며 이를 변형하여 새로운 정보로 변형하는데 적합한 텍스트 사전 관리시스템(TDMS)에서의 사전 및 텍스트 코퍼스 구조 정의 방법을 제공하는데 있다.An object of the present invention for solving the above problems is available as a standard of Hangul documents as well as foreign documents, text dictionary management system suitable for constructing a new dictionary, managing and transforming it into new information (TDMS). It provides a method for defining dictionary and text corpus structure in.

도 1은 SGML 중 Element 선언 부분에 대한 정의를 나타낸 도1 is a diagram illustrating a definition of an element declaration portion of an SGML.

도 2a는 SGML DTD를 나타낸 예시도2A is an exemplary diagram showing an SGML DTD

도 2b는 도 2a의 SGML DTD로 마크업한 문서의 예시도FIG. 2B is an exemplary diagram of a document marked up with the SGML DTD of FIG. 2A

도 3은 TEI에서 정의하고 있는 사전관련 DTD 중의 일부 예시도3 is a diagram illustrating some of the dictionary-related DTDs defined in TEI

도 4는 강범모에 의해 제안된 한글 사전 구조의 일부 예시도4 is a diagram illustrating a part of the Hangul dictionary structure proposed by Kang Bum-mo

도 5는 한글 국어사전 구조의 일부 예시도5 is a diagram showing a part of the Korean dictionary structure

도 6은 본 발명에 따른 SDML로 표현된 사전 구조의 예시도6 is an exemplary diagram of a dictionary structure expressed in SDML according to the present invention.

도 7은 도 3의 <body> 부분 구조의 재정의를 보인 일 예시도7 is an exemplary view illustrating the redefinition of the substructure of <body> of FIG. 3.

도 8은 본 발명에 따른 SDML로 표현된 텍스트 코퍼스 구조의 예시도8 is an exemplary diagram of a text corpus structure expressed in SDML according to the present invention.

도 9는 SGML content model 정의 방식을 보인 도9 is a diagram illustrating a method of defining an SGML content model.

상기 목적을 달성하기 위한 본 발명의 특징은 다양한 형태의 전자사전 논리구조를 가능한 한 효과적으로 기술할 수 있도록 사전의 규칙적인 구조 고정된 형식으로 하고, 나머지 내용에 관련된 부분은 자유롭게 정의할 수 있도록 하며, 텍스트 코퍼스를 단순한 텍스트처럼 저장할 수 있도록 하여 텍스트 코퍼스 관리가 가능한 SDML를 이용하여 사전 및 텍스트 코퍼스 구조를 정의하는 것이다.Features of the present invention for achieving the above object is to form a fixed structure of the dictionary in order to describe the electronic dictionary logical structure of various forms as effectively as possible, and the parts related to the rest of the contents can be freely defined, By defining the dictionary and text corpus structures using SDML, which allows text corpus to be stored as simple text and manages text corpus.

이하, 첨부한 도면을 참조하여 본 발명에 따른 바람직한 일 실시예를 상세히 설명하면 다음과 같다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

먼저 본 발명의 사전 및 텍스트 코퍼스의 구조 정의에 이용되는 SDML은 일정한 형식부분은 SGML DTD 형태로 정의해 두고 반드시 그 형식을 따르도록 하며 형식이 바뀔 수 있는 부분은 SGML에서 정의하는 방식을 이용하여 자유롭게, 그러나 SDML의 제한 규칙에 의해 정의할 수 있도록 한 것이다.First, the SDML used for the definition of the structure of the dictionary and the text corpus of the present invention is defined by the SGML DTD form a certain format portion and must follow the format freely by using the method to define the format can be changed in SGML However, it can be defined by the SDML restriction rules.

이와 같은 SDML은 크게 사전의 이름과 버전 등의 내용을 포함하는 헤더영역과, 사전 내부에서 쓸 각 속성들이 가질 수 있는 값을 규정하는 정의영역과, 실제 사전의 내용이 포함되어 있으며, 표제어를 포함하는 각 엔트리들의 반복되는 엔트리 그룹영역 또는 텍스트 코퍼스 파일의 내용이 반복되는 텍스트 코퍼스 파일 그룹영역으로 구성된다.Such SDML includes a header area that contains contents such as the name and version of the dictionary, a definition area that defines the values each attribute can use, and the contents of the actual dictionary. Each entry is composed of a repeated entry group area or a text corpus file group area in which the contents of a text corpus file are repeated.

도 6 및분석방법8은 상기와 같은 SDML를 이용하여 사전 및 텍스트 코퍼스 구조를 계층적으로 나타낸 것으로, 표준 사전 항을 최상위 계층으로 하고, 상기 최상위 계층은 표준사전의 머릿글로서 사전의 명칭 또는 버전 등의 정보가 들어 있는 항과, 표준 사전의 사용자 정의 태그에서 사용될 수 있는 태그 속성들을 정의하는 항과, 사전 엔트리 또는 텍스트 코퍼스 파일의 집합 항을 그 하위 계층에 포함하며, 상기 사전 엔트리 집합 항은 그 하위 계층에 표제어 항과, 사전 표제어에 딸린 설명 항을 포함한다.6 and Analysis method 8 show the dictionary and text corpus structure hierarchically using the SDML as described above. The standard dictionary term is the top layer, and the top layer is the heading of the standard dictionary. Includes in its sublayer a term containing information of a term, a term defining tag attributes that can be used in a user-defined tag of a standard dictionary, and a term of a dictionary entry or a set of text corpus files in its sublayer. Includes subheading terms and descriptive terms attached to dictionary headings.

이하분석방법6 내지분석방법8에 나타난 용어나 명칭을 아래의 표 1을 참조하여 간략히 살펴보면 다음과 같다.The terms or names shown in the following Analysis Methods 6 to 8 will be briefly described with reference to Table 1 below.

태그tag 기능 & 역할Function & role <sd><sd> 표준 사전 항Standard dictionary term <sdHeader><sdHeader> 표준사전의 머릿글로서 사전의 명칭, 버전 등의 정보가 들어 있는 항The heading of a standard dictionary that contains information such as the name and version of the dictionary. <sdFront><sdFront> 표준 사전의 사용자 정의 태그에서 사용될 수 있는 태그 속성들을 정의하는 항Terms that define tag attributes that can be used in custom tags in the standard dictionary. <group><group> 사전 엔트리 또는 파일의 집합 항A set of dictionary entries or files <entry><entry> 사전 엔트리 항Dictionary entry terms <wname><wname> 사전 엔트리에서의 표제어 항Heading terms in dictionary entries <body><body> 사전 표제어에 딸린 설명 등이 포함된 항Terms that include descriptions in dictionary headings <tdmsfile><tdmsfile> 파일의 내용이 들어 있는 항Terms containing the contents of the file

일반적으로 문서 구조의 정의에서 용어나 명칭을 태그로 사용할 경우 < >로 묶어서 사용함으로써 다른 데이터와 구분하는데, < >로 묶은 것을 그 항목의 시작을 나타내며, </ >로 묶은 것은 그 항목의 끝을 나타낸다.In general, when using a term or name as a tag in the definition of a document structure, it is distinguished from other data by enclosing it with <>, which is enclosed with <> to indicate the beginning of the item, and enclosed with </> indicates the end of the item. Indicates.

이때, 상술한 바와 같은 용어나 명칭은 사실 도 7에 나타낸 바와 같은 DTD에서 정의한 것으로, 의도목적에 따라 다르게 정의할 수 있다.In this case, the terms and names as described above are actually defined in the DTD as shown in FIG. 7, and may be defined differently according to the intended purpose.

그러나, 본 발명에서 다음과 같은 태그는 특수한 것으로 정의되어 있어 일반적인 태그(사용자가 정의할 수 있는 태그)와 구분하여 사용한다.However, in the present invention, the following tags are defined as special and are used separately from general tags (tags that can be defined by the user).

즉, <sd>는 표준사전의 전체를 나타내며, 그 구성요소로서 <sdHeader>와 <sdFront>, <group>이 있음을 나타낸다. 또한 <entry>는 <group>항의 하위항으로서 <wname>와 <body>를 하위항으로 포함하고 있음을 나타낸다.(도 6 참조)That is, <sd> represents the entire standard dictionary, and indicates that the elements include <sdHeader>, <sdFront>, and <group>. In addition, <entry> represents a subterm of the <group> term and includes <wname> and <body> as subterms (see FIG. 6).

마찬가지로 도 8에서와 같이 <group>항의 하위항으로 <entry> 대신에 <tdmsfile>항 을 하위항으로 포함할 수 있으며, 의미상으로는 사전의 엔트리가 아닌 텍스트 코퍼스 파일의 내용이 그 항에 들어 있다는 것이다.Similarly, as shown in FIG. 8, the subterm of the <group> term may include the <tdmsfile> term instead of the <entry>, meaning that the term contains the contents of a text corpus file, not a dictionary entry. .

그러므로, 상기 표 1에 나타나 있는 사항들을 기준으로 본 발명에 따른 사전 및 텍스트 코퍼스의 구조를 설명하면 다음과 같다.Therefore, the structure of the dictionary and text corpus according to the present invention will be described based on the matters shown in Table 1 as follows.

헤더 영역을 나타내는 <sdHeader>에는 사전의 이름과 버전 등의 내용을 포함하고 있다. 또한, 정의 영역을 나타내는 <sdFront>에는 사전 내부에서 쓸 각 속성들이 가질 수 있는 값을 규정하며, 집합 영역을 나타내는 <group>에는 실제 사전의 내용이 포함되어 있으며, 표제어를 포함하는 각 <entry> 또는 텍스트 코퍼스 파일의 내용을 포함하는 <tdmsfile>의 반복으로 구성된다.The <sdHeader> representing the header area contains contents such as the name and version of the dictionary. In addition, <sdFront>, which defines the definition area, defines the value that each property can have in the dictionary, and <group>, which represents the collection area, contains the contents of the actual dictionary, and each <entry> that contains the headword. Or an iteration of a <tdmsfile> containing the contents of a text corpus file.

본 발명에서는 도 6에 정의된 사전의 구조 중에서 내용에 관련된 부분만을 자유롭게 정의할 수 있는데, 자유롭게 정의가 가능한 [표준 사전 요소] 부분은 <body>에 포함되는 요소로서, 기본 설정 값처럼 단순한 하나의 필드로 구성될 수도 있고, 복잡한 트리구조를 가질 수도 있다.In the present invention, only the parts related to the contents of the dictionary structure defined in FIG. 6 can be freely defined. The freely definable [standard dictionary element] part is an element included in the <body> and is a simple one as a default value. It can be composed of fields, or it can have a complex tree structure.

이를 위하여 사용자는 필요한 사전의 구조를 정의해서 <body>의 구조를 변경하여 사용할 수 있다.To do this, the user can define the structure of the necessary dictionary and change the structure of the <body>.

이때, 상기 <body>는 초기에 다음과 같이 정의되어 있다.At this time, the <body> is initially defined as follows.

<!ELEMENT body ........(#PCDATA)*><! ELEMENT body ........ (# PCDATA) *>

따라서, 새로운 구조로 변경하고자 할 경우, 상기 요소만을 재정의(overwrite)한다.Therefore, when changing to a new structure, only the above elements are overwritten.

예를 들어, 첨부한 도 7에 도시되어 있는 바와 같이 <body> 구조를 <품사>와 <대역어>를 포함하는 새로운 구조로 변화시킬 수 있는데, 이와 같은 구조의 재정의를 통해 각 응용프로그램이 필요한 형태의 구조와 태그를 정하여 사용할 수 있는 것이다.For example, as shown in FIG. 7, the <body> structure may be changed to a new structure including <part of speech> and <band word>. The redefinition of such a structure requires each application program. You can use the structure and tag of.

이때, 일반적으로 새로운 구조의 정의는 SGML의 형식과 동일하게 정의하지만, SDML에 제한적으로 사용하여야 하는 몇가지의 규칙이 있는데, 그 대략적인 규칙은 아래와 같다.At this time, the definition of the new structure is generally defined in the same way as the format of SGML, but there are some rules that should be used limitedly in SDML. The general rules are as follows.

조건 1 : SDML은 기본적으로 도 9의 SGML content model 정의 방식을 이용하여 문서의 구조를 트리형태로 표현한다. 이때, 사용 가능한 기본 요소는 sequence(,)와 alternation(|)과 repetition(*)과 one or more repetition(+) 및 optional(?)이다. 이때, free sequence(&)는 사전형식을 결정하는데 무의미하므로 제외된다.Condition 1: SDML basically expresses the structure of a document in a tree form using the SGML content model definition method of FIG. 9. The basic elements that can be used are sequence (,), alternation (|), repetition (*), one or more repetition (+), and optional (?). At this time, the free sequence (&) is excluded because it is meaningless in determining the dictionary form.

조건 2 : 문서 구조의 표현시에 나타날 수 있는 태그의 재귀(recursive) 구조를 허용하지 않는다.Condition 2: Do not allow the recursive structure of tags that can appear in the presentation of the document structure.

조건 3 : 태그의 이름은 SDML로 정의된 문서를 편집하는 편집기(SDE : Standard Dictionary Editor)에서 필드 이름으로 사용되므로 이를 고려하여 사용한다.Condition 3: The name of the tag is used as the field name in the editor that edits the document defined in SDML (SDE: Standard Dictionary Editor).

조건 4 : 편집기에서 사용되는 표시정보를 사용할 경우, TDMS에서 정하는 속성이름과 속성값을 사용하여 태그에 포함시켜 표시한다. 특별히 표시정보를 지정하지 않으면, 기본 표시 방식으로 편집기에 나타낸다.Condition 4: When using the display information used in the editor, display it by including it in the tag using the attribute name and attribute value specified in the TDMS. Unless otherwise specified, display information is displayed in the editor in the default display manner.

여기서 표시정보란 화면에 각각의 필드를 표시할 때 사용되는 정보로써, 글자크기, 글자 모양 등과 입력창이나 표시창의 크기 또는 종류 등을 나타내며, 속성이란 사전 편집기 등에서 내부적으로 사용할 수 있는 성질이나 태그로 표현하기에는 조금 부적당한 것들을 정의하기 위해 사용된다.Here, the display information is information used to display each field on the screen, and indicates the font size, font shape, size and type of the input window or display window, and the property is a property or tag that can be used internally in the dictionary editor. It is used to define things that are a bit inappropriate for expression.

예를 들어 <동의어> 태그는 사전 편집기에서 화면에 표시할 경우, 그 내용에 밑줄이 그어지도록 하고 <동의어> 태그의 속성으로 그 동의어의 표제어위치를 적어둘 수 있다.For example, if the <synonyms> tag is displayed on the screen in the dictionary editor, the contents of the <synonyms> tag can be underlined and the headings of the synonyms can be written as attributes of the <synonyms> tag.

또한 기본 표시 방식은 #PCDATA 등으로 정의된 element type의 태그내에 내용을 넣을 때 기본으로 한줄 단위 입력창을 사용하도록 하는 것이다. 또한 구현 방식마다 다를 수는 있지만, 기본적으로 편집기에서 초기에 지정해주는 값(default value)을 나타낸다.In addition, the default display method is to use a single line input window by default when inserting contents in a tag of an element type defined as #PCDATA. It can also vary from implementation to implementation, but it defaults to the default value initially assigned by the editor.

조건 5 : 이미 정의된 태그가 있으면, 그 태그 이름을 가능하면 그대로 사용하여 호환성을 가질 수 있도록 한다.Condition 5: If a tag is already defined, use that tag name as much as possible to ensure compatibility.

이상으로 SDML에 제한적으로 사용하여야 하는 몇가지의 규칙에 대하여 살펴보았는데, 상기 조건 5는 사실상 태그의 표준화를 위한 것이다.So far, we have looked at some rules that should be limitedly used in SDML. The condition 5 is actually for standardization of tags.

이하, 텍스트 코퍼스의 관리에 대하여 살펴보면, 본 발명에서는 텍스트 코퍼스를 관리할 수 있도록 SDML로 표현된 사전 구조의 정의 중에서 <entry>를 <tdmsfile>로 바꾸고 이 태그에 텍스트 코퍼스를 단순한 텍스트처럼 저장할 수 있도록 하고, 텍스트 코퍼스와 사전을 구분할 수 있도록 헤더에 관련정보를 기록해 둔다. SDML로 표현된 텍스트 코퍼스의 구조는 첨부한 도 8에 도시되어 있는 바와 같다.Hereinafter, the management of the text corpus, in the present invention, in the definition of the dictionary structure expressed in SDML to manage the text corpus, <entry> is replaced with <tdmsfile> so that the text corpus can be stored as simple text in this tag. The relevant information is recorded in the header to distinguish the text corpus from the dictionary. The structure of the text corpus expressed in SDML is as shown in FIG.

상술한 바와 같은 본 발명의 사전 및 텍스트 코퍼스 구조 정의에 이용되는 SDML에서는 다양한 형태의 전자사전 논리구조를 가능한 한 효과적으로 기술할 수 있도록 SGML의 구조 표현 형식을 대부분 그대로 사용한다. 즉 SGML의 ELEMENT 구조 정의를 그대로 사용하고 불필요한 기능의 사용을 제한한다. 또, 아직 불확실한 표준 태그의 정의를 최대한 유보하면서도 전자사전의 관리를 최대한 효율적으로 하기 위하여 사전의 규칙적인 구조는 고정된 형식으로 하고 나머지 내용에 관련된 부분만을 자유롭게 정의할 수 있도록 한다. 이러한 원칙에 따라, 관리 정보도 현재의 SGML 태그형식에 포함되어 처리될 수 있도록 한다. 즉 표시정보는 속성에 포함시켜 나타냄으로써 자유롭게 태그를 사용하여 구조를 정의할 수 있도록 한다.In the SDML used for the dictionary and text corpus structure definition of the present invention as described above, most of the structure expression format of SGML is used as it is so that it can describe various types of electronic dictionary logic structures as effectively as possible. That is, it uses the definition of ELEMENT structure of SGML as it is and restricts the use of unnecessary functions. In addition, in order to keep the definition of the standard tag as uncertain as possible, and to manage the electronic dictionary as efficiently as possible, the regular structure of the dictionary should be fixed and only the parts related to the rest of the contents can be freely defined. According to this principle, management information can be included in the current SGML tag format and processed. In other words, the display information is included in the attribute so that the structure can be freely defined using a tag.

상기와 같은 본 발명의 사전 및 텍스트 코퍼스 정의 방법은 일부의 기능을 제한하고 관리시스템에서 사용한 정보를 고려함으로써, 사전 형식의 문서를 새로 구축하고, 관리하며, 이를 변형하여 새로운 정보로 변형하는데 있어 효과적인 표준 사전의 표기 언어로 사용될 수 있다.The dictionary and text corpus definition method of the present invention as described above is effective in newly constructing, managing, modifying, and transforming new documents into dictionary information by limiting some functions and considering information used in the management system. Can be used as a language for standard dictionaries.

Claims

In defining the dictionary and text corpus structure in the text dictionary management system, the regular structure (constant format part) of the dictionary is defined in the form of SGML DTD to be in a fixed form, and the parts related to the rest (the format may be changed). Part) can be freely defined using the method defined in SGML, and the dictionary and text structure in the text dictionary management system characterized by using SDML which can manage the text corpus by storing the text corpus as simple text. Definition way.

The method of claim 1,

The SDML includes a header area including contents of a dictionary name and a version, a definition area defining values that each attribute to use in the dictionary, and the contents of the actual dictionary, each containing a headword. It consists of a repeated entry group area of entries or a text corpus file group area in which the contents of a text corpus file are repeated. The standard dictionary term is a top layer, and the top layer is a header of a standard dictionary, such as the name or version of a dictionary. Include in its sublayer a term containing information, a term defining tag attributes that can be used in a user-defined tag of a standard dictionary, and a set term of a dictionary entry or text corpus file, the dictionary entry set term being a subordinate thereof. Defining a hierarchy to contain a term containing a term and a description of the dictionary term How to pre-define and structure of the text in a text dictionary management system as claimed.

The method according to claim 2, wherein the user can redefine the structure of the term including the description attached to the dictionary headword.

4. The method according to claim 3, wherein the redefinition of the structure of the term including the description attached to the dictionary heading is performed by using the ELEMENT structure definition of SGML.