KR100283023B1

KR100283023B1 - Standard dictionary editing and file processing method in text dictionary management system

Info

Publication number: KR100283023B1
Application number: KR1019970059052A
Authority: KR
Inventors: 최기선; 한영석; 남기춘; 박동인; 이재성; 이운재
Original assignee: 윤덕용; 한국과학기술원
Priority date: 1997-11-10
Filing date: 1997-11-10
Publication date: 2001-03-02
Also published as: KR19990039091A

Abstract

본 발명은 텍스트 사전 관리시스템(TDMS)에서의 사전 및 텍스트 코퍼스 구조 정의 방법에 관한 것으로서, 상세하게는 SDML를 이용하여 다양한 형태의 전자사전 논리구조를 가능한 한 효과적으로 기술할 수 있도록 사전의 규칙적인 구조(일정한 형식부분)는 SGML DTD 형태로 정의해 두고 반드시 그 형식을 따르도록 하며, 나머지 내용에 관련된 부분(형식이 바뀔 수 있는 부분)은 SGML에서 정의하는 방식을 이용하여 자유롭게, 그러나 SDML의 제한 규칙에 의해 정의할 수 있도록 하며, 텍스트 코퍼스를 단순한 텍스트처럼 저장할 수 있도록 하여 텍스트 코퍼스 관리가 가능하도록 사전 및 텍스트 코퍼스 구조를 정의함으로써 외국 문서는 물론이고 한글 문서의 표준으로 이용 가능하며, 사전을 새로 구축하고, 관리하며 이를 변형하여 새로운 정보로 변형하는데 적합한 사전 및 텍스트 코퍼스 구조 정의 방법이다.The present invention relates to a method of defining a dictionary and a text corpus structure in a text dictionary management system (TDMS), and more particularly, to a method of defining a dictionary structure and a text structure in order to describe various types of electronic dictionary logic structures as efficiently as possible using SDML (A certain format part) should be defined in the form of SGML DTD and must follow the format. The parts related to the remaining contents (the part where the format can be changed) are freely defined by the method defined in SGML, And the text corpus can be stored as simple text, so that the dictionary and the text corpus structure can be defined so that the text corpus can be managed. Thus, it can be used as the standard of the Korean language document as well as the foreign document. Manage, and transform it into new information. A method for defining a dictionary and a text corpus structure.

Description

How to define dictionary and text corpus structure in Text Dictionary Management System (TDMS)

본 발명은 텍스트 사전 관리시스템(TDMS : Text Dictionary Management System)에서의 사전 및 텍스트 코퍼스(text corpus) 구조 정의 방법에 관한 것으로서, 특히 SDML(Standard Dictionary Markup Language)를 이용하여 사전 및 텍스트 코퍼스 구조를 정의하여 다양한 형태의 전자사전 논리구조를 가능한 한 효과적으로 기술할 수 있도록 사전의 규칙적인 구조는 고정된 형식으로 하고 나머지 내용에 관련된 부분만을 자유롭게 정의할 수 있도록 하며, 텍스트 코퍼스를 단순한 텍스트처럼 저장할 수 있도록 하여 텍스트 코퍼스 관리도 가능한 텍스트 사전 관리시스템(TDMS)에서의 사전 및 텍스트 코퍼스 구조 정의 방법에 관한 것이다.The present invention relates to a method of defining a dictionary and a text corpus structure in a text dictionary management system (TDMS), and more particularly, a dictionary and a text corpus structure are defined using SDML (Standard Dictionary Markup Language) In order to be able to describe various types of electronic dictionary logic structures as effectively as possible, the regular regular structure of the dictionary can be defined in a fixed format and only the parts related to the remaining contents can be freely defined. The text corpus can be stored as simple text And a method of defining a dictionary and a text corpus structure in a text dictionary management system (TDMS) capable of managing text corpus.

컴퓨터에 의하여 기록되는 많은 양의 문서 정보를 효과적으로 변형 조작하기 위해서는 문서의 형식이 통일되어 있어야 하는데, 이러한 문서형식의 통일은 아래의 참고문헌에서 알 수 있는 바와 같이 현재 SGML(Standard Generalized Markup Language)로 통일되어 가고 있는 추세이다.In order to effectively manipulate a large amount of document information recorded by a computer, the format of the document must be unified. The unification of these document formats can be classified into the Standard Generalized Markup Language (SGML) It is a tendency to be unified.

〈참고문헌〉<references>

1. 공업 진흥청(1993), "문서 기술 언어 SGML," KSC 5913-1993, 한국표준협회.1. Industry Promotion Agency (1993), "Document Description Language SGML," KSC 5913-1993, Korea Standards Association.

2. Bryan, M.(1988), SGML: An Author's Guide to the Standard Generalized Markup Language, Addison-Wesley.2. Bryan, M. (1988), SGML: An Author's Guide to the Standardized Generalized Markup Language, Addison-Wesley.

3. Goldfarb, Charles F.(1990), "The SGML Handbook," Claredon Press, Oxford.3. Goldfarb, Charles F. (1990), " The SGML Handbook, " Claredon Press, Oxford.

4. Travis, B. E. and Waldt, D. C.(1995), "The SGML Implementation Guide," Springer-Verlag Berlon Heidelberg, Germany, pp3-20.4. Travis, B. E. and Waldt, D. C. (1995), "The SGML Implementation Guide," Springer-Verlag Berlon Heidelberg, Germany, pp3-20.

SGML은 마크업 언어의 구문을 정의하는 메타언어로서 이를 이용하여 다양한 문서구조를 정의할 수 있다. 즉, SGML은 구문 구조를 정의할 수 있는 문법구조를 정의한 것이며, 이를 이용하여 구문 구조를 정의한 것이 DTD(Document Type Definition)이다.SGML is a meta language that defines the syntax of a markup language and can be used to define various document structures. In other words, SGML defines a grammar structure that can define a syntax structure, and DTD (Document Type Definition) defines the syntax structure using it.

도 1은 SGML 중 Element 선언부분에 대한 정의를 나타낸 것이다.FIG. 1 shows a definition of an element declaration part in SGML.

여기서 mdo는 마크업 정의의 시작을 알리는 것으로 "〈!"가 사용된다. "Element"는 키워드로서 SGML의 Element 요소를 정의하는 것을 의미한다. ps는 빈칸을 의미하며, ps+는 하나이상의 빈칸 ps*는 빈칸이 없거나 하나이상의 빈칸을 의미한다.Where mdo indicates the beginning of the markup definition and "<!" Is used. "Element" means to define element element of SGML as keyword. ps means blank, ps + means one or more spaces ps * means no spaces or one or more spaces.

element type은 정의하려는 element의 이름이다. omitted tag minimization은 현재 정의되고 있는 element type에서 시작태그(tag)와 끝태그가 생략될 수 있는지를 나타낸다. (declared content｜content model)는 SGML 내부에서 사용되고 있는 declared content나 사용자가 문서의 구조를 정의하는 content model 중 하나를 사용하는 것을 의미한다. "｜" 표시는 둘 중의 하나를 선택한다는 의미이다. 마지막으로 mdc는 마크업 정의의 끝을 알리는 것으로 "〉"가 사용된다.The element type is the name of the element you want to define. The omitted tag minimization indicates whether the start tag and the end tag can be omitted from the currently defined element type. (declared content | content model) means either declared content used in SGML or one of the content models that defines the structure of the document. The "|" mark means to choose one of them. Finally, mdc tells the end of the markup definition that ">" is used.

이와 같은 SGML 기술 규칙에 따라 다양한 문서 형식을 DTD로 정의할 수 있다. 예를 들어분석방법2a와 같은 SGML의 DTD는분석방법2b와 같은 예의 문서를 마크업하기 위한 것이다.According to these SGML technical rules, various document formats can be defined as DTDs. For example, the SGML DTD, such as Assay Method 2a, is for marking up example documents such as Assay Method 2b.

다시 말해, SGML은 문서구조를 정의할 수 있는 언어(DTD)의 문법을 말하며, DTD는 문서의 구조를 SGML 규칙에 따라 정의한 것이며, 일반 문서를 구조에 따라 DTD에서 정의된 태그를 이용하여 마크업한 것이 마크업문서이다. 일반적으로 SGML DTD에 따라 마크업된 문서를 SGML 문서라고도 한다.In other words, SGML refers to the syntax of a language (DTD) in which a document structure can be defined. A DTD defines the structure of a document according to SGML rules, and a general document is marked up One is a markup document. In general, documents marked up according to the SGML DTD are also referred to as SGML documents.

텍스트로 기록되는 여러 분야의 문서에 대해 TEI(Text Encoding Initiative )에서 아래의 참고문헌에서와 같은 여러 가지 표준형태를 제안하고 있다.The text encoding initiative (TEI) proposes several standard forms, such as those in the following references, for documents in various fields that are recorded as text.

〈참고문헌〉<references>

5. Ide, Nancy and Sperberg-McQueen, C. M.(1995), "The TEI: History, Goals, and Future," in Ide, Nancy and Veronis, Jean[20], pp 5-15.5. Ide, Nancy and Sperberg-McQueen, C. M. (1995), "The TEI: History, Goals, and Future," in Ide, Nancy and Veronis, Jean [20], pp 5-15.

6. Ide, Nancy and Veronis, Jean(1995), editors, "Text Encoding Initiative Background and Context," Kluwer Academic Publishers, Dordrect /Boston /London.6. Ide, Nancy and Veronis, Jean (1995), editors, "Text Encoding Initiative Background and Context," Kluwer Academic Publishers, Dordrect / Boston / London.

7. Sperberg-McQueen, C. M. and Burnard, Lou(1994), "Guidelines for Electronic Text Encoding and Interchange(TEI P3)," Vol I and Vol II, ACH, ACL, ALLC.7. Sperberg-McQueen, C. M. and Burnard, Lou (1994), "Guidelines for Electronic Text Encoding and Interchange (TEI P3), Vol.

8. Sperberg-McQueen, C. M. and Burnard, Lou(1995), "The Design of the TEI Encoding Scheme," in Ide, Nancy and Veronis, Jean[20], pp 17-39.8. Sperberg-McQueen, C. M. and Burnard, Lou (1995), "The Design of the TEI Encoding Scheme," in Ide, Nancy and Veronis, Jean [20], pp 17-39.

예를 들어분석방법3은 TEI에서 정의하고 있는 사전관련 DTD 중의 일부를 나타낸 것으로, 이 정의는 entry 태그 다음에 home, sense, form, gramGrp, usg, def, etym, eg 등의 어느 것이나 임의의 순서로 계속 반복하여 사용될 수 있는 구조를 뜻한다.For example, analysis method 3 shows some of the dictionary-related DTDs defined in TEI. This definition can be used for any of the home, sense, form, gramGrp, usg, def, etym, This is a structure that can be repeated over and over again.

그러나, 상기와 같은 참고문헌의 표준들은 문서의 형태, 특히 서양의 문서만을 표준으로 하고, 이 문서를 관리하는데 필요한 정보에 대한 정의는 다음의 참고문헌에 설명되어 있는 방식을 표준으로 한다However, the standards of the above-mentioned reference documents are in the form of documents, in particular, Western documents only, and the definition of the information necessary for managing this document is based on the method described in the following references

〈참고문헌〉<references>

9. ISO/IEC(1994), "Information technology-Text and office system-Document Style Semantics and Specification Language(DSSSL)-Draft", ISO/IEC DIS 10179.2.).9. ISO / IEC (1994), "Information technology-Text and office system-Document Style Semantics and Specification Language (DSSSL) -Draft", ISO / IEC DIS 10179.2.

이러한 외국문서 중심의 표준을 한글 문서의 표준으로 받아들여 사용하는 것은 문제가 있으며, 특히 집적된 정보를 포함하는 사전과 같은 형식의 문서에 대한 관리 및 변형에는 한계가 있는데, 이를분석방법3에 나타낸 TEI의 경우에서 살펴보면, 일단 home을 동의어, sense를 의미 등으로 해석하여 태그로 사용할 수는 있지만, 정확하게 우리사전에 맞도록 정의할 필요가 있다. 특히, 한국어 사전에서 나타나는 "학술분야", "방언" 등을 나타내는 태그가 빠져 있으며, "어원"에서도 한자나 외래어, 국어 등의 구분이 빠져 있는 것이 그것이다.The use of such foreign document-based standards as a standard for Korean documents is problematic. In particular, there is a limit to the management and modification of documents in the form of dictionaries containing integrated information. In the case of TEI, home can be interpreted as a synonym, meaning sense, etc., and used as a tag, but it needs to be precisely defined according to our dictionary. Especially, there are no tags indicating "academic field" or "dialect" appearing in the Korean dictionary, and there is no segmentation of kanji, foreign language, and Korean language in "etymology".

다시 말해, TEI에서 사전의 구조에 대하여 정의하고 있으며, 이는 매우 일반적으로 적용 가능하지만, 앞서 설명한 바와 같이 한글 사전에 대한 고려가 없으므로 한글 사전에 대한 적용에는 문제가 있다.In other words, the structure of the dictionary is defined by TEI, which is generally applicable, but there is no consideration for the Hangul dictionary as described above.

따라서, 사전 형식의 문서를 새로 구축하고, 관리하며, 이를 변형하여 새로운 정보로 변경하는데 필요한 표준 표기 형식이 필요하다는 당위성이 부각되었다.Therefore, it became clear that standardized notation formats are needed to newly build, manage, and modify dictionary-formatted documents into new information.

이 문제를 해결하기 위해 강범모는 한글사전에 맞는 새로운 구조를 제안하였는데, 이 구조를 계층적으로 일부만 표시하면 첨부한분석방법4와 같다(참고문헌 : Kang, Beam-mo(1996), "Using the TEI Scheme in compiling a Korean Dictionary", ALLC-ACH '96, Bergen, Norway.)Kang, Beam-mo (1996), "Using the Structure of Korean Hangeul", Kang, Beam-mo (1996) TEI Scheme in compiling a Korean Dictionary ", ALLC-ACH '96, Bergen, Norway.)

그러나 TEI나분석방법4와 같은 구조를 갖는 강범모식의 방법은 너무 많은 경우를 고려하고 있기 때문에, 실제 텍스트 사전 관리시스템에서 처리하기에는 계산상으로 너무 복잡하고, 이 구조를 파싱하는 모듈과 파싱된 결과를 이용하는 데는 어려움이 많다.However, since the method of the TEI or the analysis method 4 having the same structure takes too many cases into consideration, it is computationally too complicated to process in the actual text dictionary management system, and the module for parsing the structure and the parsed result There are many difficulties to use.

이를 좀 더 상세하게 설명하면분석방법4에서 entry 다음 단계에 있는 hom은 다시 hom 밑에서 반복(recursive) 사용되고 있어서 그 단계를 이론상으로는 무한번까지 반복할 수 있는 경우 등을 고려하고 있다. 이러한 처리는 기술적으로 그 반복의 한계를 지정하여 계산 - 각 단계별로 파싱하고 그 내용을 처리하기 위한 컴퓨터의 처리부하(processing overhead) - 의 복잡성을 줄일 수는 있지만, 여전히 많은 양의 계산을 필요로 한다. 또한 TEI 문서가 TEI에서 정의된 문법규칙(DTD)에 맞게 정의되었는지를 확인하는 일과 맞게 정의되었을 경우, 각 내용을 컴퓨터가 쉽게 처리할 수 있는 형태로 변환하여 저장하는 파싱모듈의 저장형태는 대개 트리 구조로 표현되는데 앞서 설명한 바와 같이 반복적인 정의를 할 경우, 무한번 반복 사용될 가능성을 고려하여야 하므로 이는 무한번의 트리 구조를 생성하여야 하는 문제를 야기시킨다. 이를 피하기 위해서는 트리의 하위 노드가 상위 노드를 다시 포인트 해야 하므로 실제적으로 사이클이 형성되어 트리가 아닌 다른 형태로 되며, 이를 처리하기 위해서는 또 다른 복잡한 처리가 필요하게 된다.In more detail, Homing in the next step of the analysis method 4 is used recursively under hom, so that it is possible to repeat the step up to zero in theory. This can technically reduce the complexity of the computation by specifying the limits of the repetition - the processing overhead of the computer to parse and process the contents of each step - but still requires a large amount of computation do. Also, if the TEI document is defined to conform to the DTD defined in TEI, the storage format of the parsing module, which converts each content into a form that can be easily processed by the computer, As described above, when iterative definition is performed, it is necessary to consider the possibility of repeated use. Therefore, it causes a problem of generating an infinite number of tree structures. In order to avoid this, the lower node of the tree has to re-point to the upper node, so that a cycle is actually formed to be a tree other than a tree, and further complicated processing is required to handle this.

또한, 한글사전을 아래의 참고문헌과 같이 TEI나 강범모식의 구조와 다른 구조, 즉분석방법5에 나타내 구조로 파악할 수분석방법있는데 이는 계산의 단순성을 위해 너무 정형화시켜 사전을 표현해 두었기 때문에 어느 정도 다양한 형태(TEI나 강범모 방식에서 나타난 것처럼 반복구조 같은 구현의 복잡성을 유발할 수 있는 다양한 형태가 아닌 중간 형태)를 표현할 수 없다. 따라서 이를 일반적인 사전용 DTD로 수용하는데도 문제가 있다.In addition, the Hangul dictionary can be analyzed in a structure that is different from the structure of TEI or Kangmo-moo formula as shown in the following reference document, that is, the structure shown in analysis method 5. This is because the dictionary is expressed by being too stereotyped It can not express a variety of forms (intermediate forms, not various forms, which can lead to complexity of implementation, such as iterative structures, as shown in TEI or KM). Therefore, there is a problem in accepting it as a general dictionary DTD.

〈참고문헌〉<references>

10. 최병진, 이재성, 이운재, 최기선(1996), "표준화를 위한 일반사전의 논리구조," 제8회 한글 및 한국어 정보처리 학술 발표 논문집.10. Byeong-jin Choi, Jae-Sung Lee, Woon-Jae Lee, Ki-Sun Choi (1996), "Logical Structure of General Dictionary for Standardization," Proceedings of 8th Annual Conference on Korean and Korean Information Processing.

11. 최병진, 이재성, 이운재, 최기선(1996), "기계가독형 사전 구축을 위한 사전항목의 논리구조," 인지과학 7-2.11. Byeongjin Choi, Jae-Sung Lee, Woon-Jae Lee, Ki-Sun Choi (1996), "Logical structure of dictionary entry for machine-readable dictionary construction," cognitive science 7-2.

또한, 기존의 사전 구조 정의 방법은 관리시스템이 각 태그(tag)의 표시를 어떤 형태로 나타낼 것인지에 대한 정의가 전혀 없다. 이는 보다 다양한 관리시스템을 만들 수 있도록 한다고는 하지만 표준 관리시스템으로 이용하는데는 문제가 있다.In addition, the existing dictionary structure definition method has no definition as to how the management system displays the representation of each tag. Although it is possible to create a more diverse management system, there is a problem in using it as a standard management system.

상기와 같은 문제점을 해소하기 위한 본 발명의 목적은 외국 문서는 물론이고 한글 문서의 표준으로 이용 가능하며, 사전을 새로 구축하고, 관리하며 이를 변형하여 새로운 정보로 변형하는데 적합한 텍스트 사전 관리시스템(TDMS)에서의 사전 및 텍스트 코퍼스 구조 정의 방법을 제공하는데 있다.In order to solve the above problems, the object of the present invention is to provide a text dictionary management system (TDMS) which can be used as a standard of a Korean language document as well as a foreign document and is adapted to newly construct, manage, ) And a method of defining a text corpus structure.

도 1은 SGML 중 Element 선언 부분에 대한 정의를 나타낸 도1 is a diagram showing a definition of an element declaration part in SGML

도 2a는 SGML DTD를 나타낸 예시도2A shows an example of an SGML DTD

도 2b는 도 2a의 SGML DTD로 마크업한 문서의 예시도2B is an exemplary diagram of a document marked up with the SGML DTD of FIG.

도 3은 TEI에서 정의하고 있는 사전관련 DTD 중의 일부 예시도FIG. 3 shows some examples of dictionary-related DTDs defined in the TEI

도 4는 강범모에 의해 제안된 한글 사전 구조의 일부 예시도FIG. 4 is a block diagram illustrating some examples of the Hangul dictionary structure proposed by Kang,

도 5는 한글 국어사전 구조의 일부 예시도FIG. 5 is a block diagram illustrating some examples of the Korean-

도 6은 본 발명에 따른 SDML로 표현된 사전 구조의 예시도Figure 6 is an exemplary diagram of a pre-structure represented in SDML according to the present invention;

도 7은 도 3의 〈body〉 부분 구조의 재정의를 보인 일 예시도FIG. 7 is an example showing the redefinition of the < body > partial structure of FIG. 3

도 8은 본 발명에 따른 SDML로 표현된 텍스트 코퍼스 구조의 예시도Figure 8 is an exemplary diagram of a text corpus structure represented in SDML according to the present invention.

도 9는 SGML content model 정의 방식을 보인 도9 shows the SGML content model definition scheme

상기 목적을 달성하기 위한 본 발명의 특징은 다양한 형태의 전자사전 논리구조를 가능한 한 효과적으로 기술할 수 있도록 사전의 규칙적인 구조 고정된 형식으로 하고, 나머지 내용에 관련된 부분은 자유롭게 정의할 수 있도록 하며, 텍스트 코퍼스를 단순한 텍스트처럼 저장할 수 있도록 하여 텍스트 코퍼스 관리가 가능한 SDML를 이용하여 사전 및 텍스트 코퍼스 구조를 정의하는 것이다.According to an aspect of the present invention, there is provided an electronic dictionary logic structure in which various types of electronic dictionary logic structures are defined in a regular, fixed, and fixed form so that they can be described as effectively as possible, And to define a dictionary and a text corpus structure by using SDML which can manage text corpus by allowing text corpus to be stored as simple text.

이하, 첨부한 도면을 참조하여 본 발명에 따른 바람직한 일 실시예를 상세히 설명하면 다음과 같다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

먼저 본 발명의 사전 및 텍스트 코퍼스의 구조 정의에 이용되는 SDML은 일정한 형식부분은 SGML DTD 형태로 정의해 두고 반드시 그 형식을 따르도록 하며 형식이 바뀔 수 있는 부분은 SGML에서 정의하는 방식을 이용하여 자유롭게, 그러나 SDML의 제한 규칙에 의해 정의할 수 있도록 한 것이다.First, the SDML used in the structure definition of the dictionary and the text corpus of the present invention defines a certain format part as an SGML DTD type and follows the format, and the part where the format can be changed is freely , But can be defined by the restriction rules of SDML.

이와 같은 SDML은 크게 사전의 이름과 버전 등의 내용을 포함하는 헤더영역과, 사전 내부에서 쓸 각 속성들이 가질 수 있는 값을 규정하는 정의영역과, 실제 사전의 내용이 포함되어 있으며, 표제어를 포함하는 각 엔트리들의 반복되는 엔트리 그룹영역 또는 텍스트 코퍼스 파일의 내용이 반복되는 텍스트 코퍼스 파일 그룹영역으로 구성된다.The SDML includes a header area including contents such as dictionary name and version, a definition area defining a value that each attribute to be used in the dictionary defines, and contents of a real dictionary, and includes a headword A repeated entry group field of each entry or a text corpus file group field in which the contents of the text corpus file are repeated.

도 6 및분석방법8은 상기와 같은 SDML를 이용하여 사전 및 텍스트 코퍼스 구조를 계층적으로 나타낸 것으로, 표준 사전 항을 최상위 계층으로 하고, 상기 최상위 계층은 표준사전의 머릿글로서 사전의 명칭 또는 버전 등의 정보가 들어 있는 항과, 표준 사전의 사용자 정의 태그에서 사용될 수 있는 태그 속성들을 정의하는 항과, 사전 엔트리 또는 텍스트 코퍼스 파일의 집합 항을 그 하위 계층에 포함하며, 상기 사전 엔트리 집합 항은 그 하위 계층에 표제어 항과, 사전 표제어에 딸린 설명 항을 포함한다.FIG. 6 and the analysis method 8 hierarchically show the dictionary and text corpus structure using the SDML as described above. The standard dictionary term is the highest hierarchical level, and the highest hierarchical level is the head of the standard dictionary. A clause defining a tag attribute that can be used in a custom tag of a standard dictionary and a set term of a dictionary entry or a text corpus file in its lower layer, In the lower hierarchy, a heading term and an explanation term corresponding to the heading term are included.

이하분석방법6 내지분석방법8에 나타난 용어나 명칭을 아래의 표 1을 참조하여 간략히 살펴보면 다음과 같다.Hereinafter, terms and names appearing in Analysis Method 6 to Analysis Method 8 will be briefly described with reference to Table 1 below.

태그tag 기능 ＆ 역할Features & Roles 〈sd〉<Sd> 표준 사전 항Standard dictionary term 〈sdHeader〉<SdHeader> 표준사전의 머릿글로서 사전의 명칭, 버전 등의 정보가 들어 있는 항As the heading of the standard dictionary, terms that contain information such as dictionary name and version 〈sdFront〉<SdFront> 표준 사전의 사용자 정의 태그에서 사용될 수 있는 태그 속성들을 정의하는 항An element that defines tag attributes that can be used in custom tags in the standard dictionary 〈group〉<Group> 사전 엔트리 또는 파일의 집합 항A set of dictionary entries or files 〈entry〉<Entry> 사전 엔트리 항Dictionary entry term 〈wname〉<Wname> 사전 엔트리에서의 표제어 항Heading term in dictionary entry 〈body〉<Body> 사전 표제어에 딸린 설명 등이 포함된 항An entry containing a description of the dictionary entry 〈tdmsfile〉<Tdmsfile> 파일의 내용이 들어 있는 항An entry containing the contents of the file

일반적으로 문서 구조의 정의에서 용어나 명칭을 태그로 사용할 경우 〈〉로 묶어서 사용함으로써 다른 데이터와 구분하는데, 〈〉로 묶은 것을 그 항목의 시작을 나타내며, 〈/ 〉로 묶은 것은 그 항목의 끝을 나타낸다.Generally, in the definition of document structure, when a term or a name is used as a tag, it is distinguished from other data by being enclosed in <>, and the item enclosed by <> indicates the beginning of the item, and the item enclosed with </> .

이때, 상술한 바와 같은 용어나 명칭은 사실 도 7에 나타낸 바와 같은 DTD에서 정의한 것으로, 의도목적에 따라 다르게 정의할 수 있다.At this time, the above-mentioned terms and names are actually defined in the DTD as shown in FIG. 7, and can be defined differently according to the intended purpose.

그러나, 본 발명에서 다음과 같은 태그는 특수한 것으로 정의되어 있어 일반적인 태그(사용자가 정의할 수 있는 태그)와 구분하여 사용한다.However, in the present invention, the following tags are defined as special tags, and they are used separately from general tags (user definable tags).

즉, 〈sd〉는 표준사전의 전체를 나타내며, 그 구성요소로서 〈sdHeader〉와 〈sdFront〉, 〈group〉이 있음을 나타낸다. 또한 〈entry〉는 〈group〉항의 하위항으로서 〈wname〉와 〈body〉를 하위항으로 포함하고 있음을 나타낸다.(도 6 참조)That is, <sd> represents the whole of the standard dictionary and indicates that there are <sdHeader>, <sdFront>, and <group> as its components. Also, <entry> indicates that the <wname> and <body> are included as the sub-terms of the <group> term (see FIG. 6).

마찬가지로 도 8에서와 같이 〈group〉항의 하위항으로 〈entry〉 대신에 〈tdmsfile〉항 을 하위항으로 포함할 수 있으며, 의미상으로는 사전의 엔트리가 아닌 텍스트 코퍼스 파일의 내용이 그 항에 들어 있다는 것이다.Likewise, as in FIG. 8, the sub-clause of the <group> may include the term <tdmsfile> instead of the <entry> as a sub-term, meaning that the content of the text corpus file is included in the term .

그러므로, 상기 표 1에 나타나 있는 사항들을 기준으로 본 발명에 따른 사전 및 텍스트 코퍼스의 구조를 설명하면 다음과 같다.Therefore, the structure of the dictionary and the text corpus according to the present invention will be described with reference to the items shown in Table 1 below.

헤더 영역을 나타내는 〈sdHeader〉에는 사전의 이름과 버전 등의 내용을 포함하고 있다. 또한, 정의 영역을 나타내는 〈sdFront〉에는 사전 내부에서 쓸 각 속성들이 가질 수 있는 값을 규정하며, 집합 영역을 나타내는 〈group〉에는 실제 사전의 내용이 포함되어 있으며, 표제어를 포함하는 각 〈entry〉 또는 텍스트 코퍼스 파일의 내용을 포함하는 〈tdmsfile〉의 반복으로 구성된다.The <sdHeader> that represents the header area contains the name and version of the dictionary. The <sdFront> representing the definition area defines a value that each attribute to be used in the dictionary can have, the <group> representing the collection area includes the contents of the actual dictionary, each <entry> Or a repetition of <tdmsfile> containing the contents of a text corpus file.

본 발명에서는 도 6에 정의된 사전의 구조 중에서 내용에 관련된 부분만을 자유롭게 정의할 수 있는데, 자유롭게 정의가 가능한 [표준 사전 요소] 부분은 〈body〉에 포함되는 요소로서, 기본 설정 값처럼 단순한 하나의 필드로 구성될 수도 있고, 복잡한 트리구조를 가질 수도 있다.In the present invention, only the portion related to the content in the structure of the dictionary defined in FIG. 6 can be freely defined. The [standard dictionary element] portion that can be freely defined is an element included in the <body> Field, or may have a complicated tree structure.

이를 위하여 사용자는 필요한 사전의 구조를 정의해서 〈body〉의 구조를 변경하여 사용할 수 있다.For this purpose, the user can define the necessary dictionary structure and change the structure of <body>.

이때, 상기 〈body〉는 초기에 다음과 같이 정의되어 있다.At this time, the <body> is initially defined as follows.

〈!ELEMENT body ........(#PCDATA)*〉<! ELEMENT body ........ (# PCDATA) *>

따라서, 새로운 구조로 변경하고자 할 경우, 상기 요소만을 재정의(overwrite)한다.Therefore, when changing to a new structure, it overwrites only the element.

예를 들어, 첨부한 도 7에 도시되어 있는 바와 같이 〈body〉 구조를 〈품사〉와 〈대역어〉를 포함하는 새로운 구조로 변화시킬 수 있는데, 이와 같은 구조의 재정의를 통해 각 응용프로그램이 필요한 형태의 구조와 태그를 정하여 사용할 수 있는 것이다.For example, as shown in the attached FIG. 7, the <body> structure can be changed to a new structure including <part of speech> and <part of speech>. By redefining such a structure, Can be used by defining the structure and tag.

이때, 일반적으로 새로운 구조의 정의는 SGML의 형식과 동일하게 정의하지만, SDML에 제한적으로 사용하여야 하는 몇가지의 규칙이 있는데, 그 대략적인 규칙은 아래와 같다.At this time, the definition of the new structure is generally defined in the same manner as the SGML format, but there are a few rules that should be used in a limited manner in SDML.

조건 1 : SDML은 기본적으로 도 9의 SGML content model 정의 방식을 이용하여 문서의 구조를 트리형태로 표현한다. 이때, 사용 가능한 기본 요소는 sequence(,)와 alternation(|)과 repetition(*)과 one or more repetition(+) 및 optional(?)이다. 이때, free sequence(＆)는 사전형식을 결정하는데 무의미하므로 제외된다.Condition 1: The SDML basically expresses the structure of the document in a tree form using the SGML content model definition method of FIG. At this time, the available basic elements are sequence (,), alternation (|), repetition (*), one or more repetition (+) and optional (?). At this time, the free sequence (&) is ignored for determining the dictionary format and is excluded.

조건 2 : 문서 구조의 표현시에 나타날 수 있는 태그의 재귀(recursive) 구조를 허용하지 않는다.Condition 2: It does not allow the recursive structure of tags that can appear in the representation of the document structure.

조건 3 : 태그의 이름은 SDML로 정의된 문서를 편집하는 편집기(SDE : Standard Dictionary Editor)에서 필드 이름으로 사용되므로 이를 고려하여 사용한다.Condition 3: The name of the tag is used as a field name in the editor (SDE: Standard Dictionary Editor) which edits a document defined by SDML.

조건 4 : 편집기에서 사용되는 표시정보를 사용할 경우, TDMS에서 정하는 속성이름과 속성값을 사용하여 태그에 포함시켜 표시한다. 특별히 표시정보를 지정하지 않으면, 기본 표시 방식으로 편집기에 나타낸다.Condition 4: When the display information used in the editor is used, it is included in the tag and displayed by using the attribute name and attribute value set by the TDMS. If display information is not specially specified, it is displayed in the editor in a basic display manner.

여기서 표시정보란 화면에 각각의 필드를 표시할 때 사용되는 정보로써, 글자크기, 글자 모양 등과 입력창이나 표시창의 크기 또는 종류 등을 나타내며, 속성이란 사전 편집기 등에서 내부적으로 사용할 수 있는 성질이나 태그로 표현하기에는 조금 부적당한 것들을 정의하기 위해 사용된다.Here, the display information is information used to display each field on the screen, and indicates the size of a letter, the shape of a letter, and the size or kind of an input window or a display window. The attribute is a property that can be used internally by a dictionary editor, It is used to define things that are a bit inadequate to express.

예를 들어 〈동의어〉 태그는 사전 편집기에서 화면에 표시할 경우, 그 내용에 밑줄이 그어지도록 하고 〈동의어〉 태그의 속성으로 그 동의어의 표제어위치를 적어둘 수 있다.For example, if the <synonym> tag is displayed on the screen in the dictionary editor, the contents of the <synonym> tag can be underlined and the headword position of the synonym can be written as an attribute of the <synonym> tag.

또한 기본 표시 방식은 #PCDATA 등으로 정의된 element type의 태그내에 내용을 넣을 때 기본으로 한줄 단위 입력창을 사용하도록 하는 것이다. 또한 구현 방식마다 다를 수는 있지만, 기본적으로 편집기에서 초기에 지정해주는 값(default value)을 나타낸다.In addition, the basic display method is to use a one-line unit input window as a default when inserting contents into an element type tag defined by #PCDATA or the like. It may also differ from implementation to implementation, but it defaults to the default value that is initially specified in the editor.

조건 5 : 이미 정의된 태그가 있으면, 그 태그 이름을 가능하면 그대로 사용하여 호환성을 가질 수 있도록 한다.Condition 5: If there is an already defined tag, use that tag name as much as possible to make it compatible.

이상으로 SDML에 제한적으로 사용하여야 하는 몇가지의 규칙에 대하여 살펴보았는데, 상기 조건 5는 사실상 태그의 표준화를 위한 것이다.As a result, we have discussed some rules that should be used in a limited manner in SDML. The above condition 5 is for the standardization of tags in effect.

이하, 텍스트 코퍼스의 관리에 대하여 살펴보면, 본 발명에서는 텍스트 코퍼스를 관리할 수 있도록 SDML로 표현된 사전 구조의 정의 중에서 〈entry〉를 〈tdmsfile〉로 바꾸고 이 태그에 텍스트 코퍼스를 단순한 텍스트처럼 저장할 수 있도록 하고, 텍스트 코퍼스와 사전을 구분할 수 있도록 헤더에 관련정보를 기록해 둔다. SDML로 표현된 텍스트 코퍼스의 구조는 첨부한 도 8에 도시되어 있는 바와 같다.Hereinafter, the management of the text corpus will be described. In the present invention, <entry> is changed to <tdmsfile> among the definitions of the dictionary structure expressed by SDML so that the text corpus can be managed and the text corpus can be stored as simple text And records the related information in the header so as to distinguish the text corpus from the dictionary. The structure of the text corpus expressed in SDML is as shown in Fig.

상술한 바와 같은 본 발명의 사전 및 텍스트 코퍼스 구조 정의에 이용되는 SDML에서는 다양한 형태의 전자사전 논리구조를 가능한 한 효과적으로 기술할 수 있도록 SGML의 구조 표현 형식을 대부분 그대로 사용한다. 즉 SGML의 ELEMENT 구조 정의를 그대로 사용하고 불필요한 기능의 사용을 제한한다. 또, 아직 불확실한 표준 태그의 정의를 최대한 유보하면서도 전자사전의 관리를 최대한 효율적으로 하기 위하여 사전의 규칙적인 구조는 고정된 형식으로 하고 나머지 내용에 관련된 부분만을 자유롭게 정의할 수 있도록 한다. 이러한 원칙에 따라, 관리 정보도 현재의 SGML 태그형식에 포함되어 처리될 수 있도록 한다. 즉 표시정보는 속성에 포함시켜 나타냄으로써 자유롭게 태그를 사용하여 구조를 정의할 수 있도록 한다.In the SDML used in the dictionary and text corpus structure definition of the present invention as described above, most of the structural representation format of SGML is used as it is to describe various types of electronic dictionary logical structures as effectively as possible. That is, it uses the definition definition of SGML as it is, and restricts the use of unnecessary functions. Also, in order to maximize the management of the electronic dictionary while maximally reserving the definition of the standard tag which is still uncertain, the regular structure of the dictionary is fixed and it is possible to freely define only the parts related to the remaining contents. In accordance with this principle, management information can also be included in the current SGML tag format and processed. That is, the display information is included in the attribute so that the structure can be freely defined using the tag.

상기와 같은 본 발명의 사전 및 텍스트 코퍼스 정의 방법은 일부의 기능을 제한하고 관리시스템에서 사용한 정보를 고려함으로써, 사전 형식의 문서를 새로 구축하고, 관리하며, 이를 변형하여 새로운 정보로 변형하는데 있어 효과적인 표준 사전의 표기 언어로 사용될 수 있다.The dictionary and text corpus definition method of the present invention as described above is effective in constructing and managing a document in a dictionary format and transforming it into new information by considering some of the functions and considering the information used in the management system It can be used as a standard language for the standard dictionary.

Claims

In defining the dictionary and text corpus structure in the text dictionary management system, the regular structure (predetermined format part) of the dictionary is defined in the form of SGML DTD, and is formed into a fixed format, and the part related to the remaining contents The present invention relates to a text dictionary management system and a text dictionary management system, and more particularly, to a text dictionary management system and a text dictionary management system, which are capable of freely defining a text dictionary by using a method defined by SGML and storing text corpus as simple text, Definition method.

The method according to claim 1,

The SDML includes a header area including contents such as a name and a version of a dictionary, a definition area defining a value that each attribute to be used in the dictionary defines, and contents of an actual dictionary, The contents of the repeated entry group region of the entries or the contents of the text corpus file are repeated, and the standard dictionary term is the highest hierarchical level, and the highest hierarchical level is the head of the standard dictionary. A clause defining a tag attribute that can be used in a custom tag of a standard dictionary and a set term of a dictionary entry or a text corpus file in its lower layer, Define the hierarchy to include terms that include entry headings and descriptions associated with dictionary headings How to pre-define and structure of the text in a text dictionary management system as claimed.

3. The method according to claim 2, wherein the user can redefine a protest structure including a description and the like associated with the dictionary entry.

4. The method of claim 3, wherein redefinition of the protest structure including description and the like of the dictionary entry is performed using an ELEMENT structure definition of SGML.