KR100290665B1

KR100290665B1 - Method for storing and managing structured document in relational database

Info

Publication number: KR100290665B1
Application number: KR1019990001460A
Authority: KR
Inventors: 김태현; 김준하
Original assignee: 남궁석; 삼성에스디에스주식회사
Priority date: 1999-01-19
Filing date: 1999-01-19
Publication date: 2001-05-15
Also published as: KR20000051170A

Abstract

본 발명은 전자문서 표준화를 위한 범용언어로 작성된 구조화 문서를 관계 데이터베이스로 저장하고 효율적으로 조회하기 위한 관계 데이터베이스 관리시스템에 관한 것으로, 전자문서를 구문해석하여 태그, 내용 및 엔티티에 해당하는 토큰들을 발생시키는 단계; 전자문서에 대한 문서번호, 각 토큰들에 대한 순서정보 및 각 순서정보에 해당하는 토큰의 유형정보를 문서테이블에 저장하는 단계; 각 태그토큰 별로 상기 문서번호, 태그의 유형정보, 다른 태그와의 관계정보 및 태그에 대응되는 상기 순서정보를 태그테이블에 저장하는 단계; 각 내용토큰 별로 문서번호, 내용에 대응되는 순서정보 및 내용을 내용테이블에 저장하는 단계; 각 엔티티토큰 별로 문서번호, 엔티티에 대응되는 순서정보 및 엔티티의 이름를 엔티티테이블에 저장하는 단계; 및 엔티티테이블에 대응하여, 문서번호, 문서유형정의의 이름, 엔티티의 이름 및 엔티티에 대응하는 내용을 엔티티내용테이블에 저장하는 단계를 포함하며, SGML 문서의 구문해석을 통해 얻어지는 정보를 그대로 데이터베이스 모형에 반영하고 추가적으로 문서의 순서정보를 저장/활용함으로써 모형의 활용성 및 개발되는 시스템의 성능을 극대화시킬 수 있다.The present invention relates to a relational database management system for storing and efficiently searching a structured document written in a general-purpose language for electronic document standardization as a relational database. The present invention relates to parsing an electronic document and generating tokens corresponding to tags, contents, and entities. Making a step; Storing a document number for the electronic document, order information for each token, and type information of a token corresponding to each order information in a document table; Storing the document number, tag type information, relationship information with other tags, and the order information corresponding to a tag for each tag token in a tag table; Storing document numbers, order information corresponding to the contents, and the contents in the contents table for each contents token; Storing the document number, the order information corresponding to the entity, and the name of the entity for each entity token in an entity table; And corresponding to the entity table, storing the document number, the name of the document type definition, the name of the entity, and the content corresponding to the entity in the entity content table, wherein the information obtained through syntax analysis of the SGML document is used as is. In order to maximize the utilization of the model and the performance of the developed system, it is possible to reflect this information and to store / use document order information.

Description

Method for storing and managing structured document in relational database

본 발명은 전자문서 표준화를 위한 범용언어로 작성된 구조화 문서를 관계 데이터베이스로 저장하고 효율적으로 조회하기 위한 관계 데이터베이스 관리시스템에 관한 것이다.The present invention relates to a relational database management system for storing and efficiently searching a structured document written in a general-purpose language for electronic document standardization as a relational database.

현재 각종 워드프로세서와 전자출판, 전자신문 제작시스템 등 컴퓨터를 이용한 텍스트 처리장치의 보급이 확대되어 감에 따라, 하드웨어의 환경에 관계없이 한 번 작성된 문서정보를 이기종간의 시스템에서 공유할 수 있는 데이터베이스 구축 및 검색 그리고 상호 교환의 중요성이 날로 증대되고 있다.As the spread of text processing devices using computers, such as various word processors, electronic publishing, and electronic newspaper production systems, has been expanded, a database that can be used to share written information on heterogeneous systems regardless of hardware environment is established. And the importance of search and exchange is increasing day by day.

이기종간에 문서를 교환할 수 있고 문서의 논리적인 구조를 표현할 수 있는 국제표준으로는 ODA(Office Document Architecture: International Standard(IS) 8613)와 SGML(Standard Generalized Markup Language: IS 8879)이 있다. ODA는 문서의 논리구조와 배치구조를 병행적으로 구성할 수 있다. 이에 비하여 SGML은 문서정보의 논리적 구조를 표현하는 마크업의 일관성을 강조하고 있다. 따라서 ODA는 작성할 수 있는 문서구조의 틀이 한정되어 복잡한 문서의 구조 표현에는 부적합한 면이 있으나, SGML은 개념적인 논리구조만을 가지고 있어 어떤 복잡한 문서도 작성할 수 있다. 이러한 SGML의 유연한 프로그램적인 성격 때문에 SGML은 다른 어떤 표준보다 심도있게 문서 내용을 지정할 수 있다.International standards for exchanging documents and expressing the logical structure of documents are ODA (Office Document Architecture: International Standard (IS) 8613) and SGML (Standard Generalized Markup Language: IS 8879). ODA can construct the logical and layout structure of a document in parallel. In contrast, SGML emphasizes the consistency of markup that represents the logical structure of document information. Therefore, ODA has a limited structure of document structure, which makes it unsuitable for expressing complex document structure. However, SGML has a conceptual logical structure, so any complex document can be created. This flexible programmatic nature allows SGML to specify document content more deeply than any other standard.

따라서, 점차 SGML의 활용이 우위를 차지해 가는 추세이며, SGML 문서를 데이터베이스를 이용하여 처리하고자 하는 연구가 진행되어 왔다. 그런데, 이들은 주로 객체지향 데이터베이스에서 이루어지고 있는 실정인데, 실제 현장에서 가장 보편적으로 사용되고 있는 관계 데이터베이스 관리시스템를 기반으로 SGML 문서를 저장하기 위해서는 이에 대한 데이터베이스 모형이 필요하며, 또한 데이터베이스에 저장된 SGML 문서를 조회하는데 있어서 응답속도가 매우 중요하므로, 이러한 목적에 부합되는 데이터베이스 모형에 대한 설계가 요구된다.Therefore, the use of SGML is becoming a dominant trend, and research has been conducted to process SGML documents using a database. By the way, these are mainly done in object-oriented database. In order to store SGML documents based on the relational database management system most commonly used in the field, a database model for this is needed and also SGML documents stored in the database are inquired. The response speed is very important for this purpose, so it is necessary to design a database model that meets this purpose.

본 발명이 이루고자 하는 기술적 과제는, SGML 문서를 관계 데이터베이스로 저장하고 문서 조회기능에 대한 시스템의 성능향상을 위하여, 문서조회시 요구되는 질의어의 수를 최소화하여 개발응용프로그램의 성능이 최대화되도록 SGML 문서에 대한 관계 데이터베이스 관리방법을 제공하는데 있다.The technical problem to be achieved by the present invention is to store the SGML document as a relational database and to improve the performance of the system for document retrieval function, to minimize the number of query words required during document lookup SGML document to maximize the performance of the development application To provide a relational database management method for.

도 1은 본 발명과 관계되는 SGML 문서의 내용과 관련된 객체들간의 관계를 나타내는 도면이다.1 is a diagram illustrating a relationship between objects related to contents of an SGML document related to the present invention.

도 2는 SGML 문서의 일예를 구문해석하여 토큰으로 나타낸 도면이다.2 is a diagram illustrating an example of an SGML document as a token by parsing it.

도 3은 본 발명에 따라 데이터베이스에 사용되는 테이블의 구조 및 그 관계를 나타낸다.3 shows the structure of the tables used in the database and their relationships according to the present invention.

상기의 과제를 이루기 위하여 본 발명에 의한 구조화 문서 데이터베이스 관리방법은,In order to achieve the above object, the structured document database management method according to the present invention,

전자문서 표준화를 위한 범용언어로 작성된 구조화 문서를 데이터베이스로 저장하고 효율적으로 조회하기 위한 관계 데이터베이스 관리시스템에서, 구조화 문서를 데이터베이스에 저장하는 방법에 있어서,In a relational database management system for storing structured documents written in general-purpose languages for electronic document standardization and efficiently inquiring them, the method for storing structured documents in a database,

상기 구조화 문서를 구문해석하여 태그, 내용 및 엔티티에 해당하는 토큰들을 발생시키는 단계; 상기 구조화 문서에 대한 문서번호, 상기 각 토큰들에 대한 순서정보 및 상기 각 순서정보에 해당하는 토큰의 유형정보를 문서테이블에 저장하는 단계; 상기 모든 태그토큰에 대하여, 각 태그토큰 별로 상기 문서번호, 태그의 유형정보, 다른 태그와의 관계정보 및 태그에 대응되는 상기 순서정보를 태그테이블에 저장하는 단계; 상기 모든 내용토큰에 대하여, 각 내용토큰 별로 상기 문서번호, 내용에 대응되는 상기 순서정보 및 상기 내용을 내용테이블에 저장하는 단계; 상기 모든 엔티티토큰에 대하여, 각 엔티티토큰 별로 상기 문서번호, 엔티티에 대응되는 상기 순서정보 및 상기 엔티티의 이름를 엔티티테이블에 저장하는 단계; 및 상기 엔티티테이블에 대응하여, 상기 문서번호, 문서유형정의의 이름, 엔티티의 이름 및 상기 엔티티에 대응하는 내용을 엔티티내용테이블에 저장하는 단계를 포함하는 것을 특징으로 한다.Parsing the structured document to generate tokens corresponding to tags, content and entities; Storing a document number for the structured document, order information for the respective tokens, and type information of tokens corresponding to the order information in a document table; Storing the document number, tag type information, relationship information with other tags, and the sequence information corresponding to the tag for each tag token in a tag table for each tag token; Storing the content number, the order information corresponding to the content, and the content in a content table for each content token; Storing, for all the entity tokens, the document number, the order information corresponding to the entity, and the name of the entity for each entity token in an entity table; And corresponding to the entity table, storing the document number, the name of the document type definition, the name of the entity, and the content corresponding to the entity in an entity content table.

상기의 과제를 이루기 위하여 본 발명에 의한 다른 구조화 문서 데이터베이스 관리방법은,Another structured document database management method according to the present invention for achieving the above object,

상기 구조화 문서를 구문해석하여 태그, 내용 및 엔티티에 해당하는 토큰들을 발생시키는 단계; 상기 구조화 문서에 대한 문서번호, 상기 각 토큰들에 대한 순서정보 및 토큰이 태그이면 유형정보, 다른 태그와의 관계정보, 시작 및 끝태그의 순서정보를, 토큰이 내용이면 그 내용을 그리고 토큰이 엔티티이면 그 엔티티의 이름을 문서테이블에 저장하는 단계; 및 상기 엔티티토큰에 대응하여, 문서유형정의의 이름, 엔티티의 이름 및 상기 엔티티에 대응하는 내용을 엔티티내용테이블에 저장하는 단계를 포함하는 것을 특징으로 한다.Parsing the structured document to generate tokens corresponding to tags, content and entities; The document number for the structured document, the order information for each of the tokens, and the type information if the token is a tag, the relationship information with other tags, the order information of the start and end tags, the content if the token is content, and the token is If it is an entity, storing the name of the entity in a document table; And corresponding to the entity token, storing the name of the document type definition, the name of the entity, and the content corresponding to the entity in an entity content table.

이하에서, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예에 대하여 상세히 설명한다.Hereinafter, with reference to the accompanying drawings will be described in detail a preferred embodiment of the present invention.

데이터베이스의 논리적인 구조를 기술하는 것은 데이터베이스 스키마이며, SGML문서에서 문서의 논리적인 구조를 기술하는 것은 문서유형정의(DTD; Document Type Definition) 파일이다. DTD는 문서의 논리적인 정보를 문서보다 더 작은 단위로써 기술한다. 하나의 문서는 여러 정보로 이루어져 있으므로 이를 논리적으로 구분할 필요가 있는데 DTD의 서술적 마크업으로 이루어진 논리적인 구조는 이에 적합하다.It is the database schema that describes the logical structure of the database, and the Document Type Definition (DTD) file that describes the logical structure of documents in SGML documents. The DTD describes the logical information of the document in smaller units than the document. Since a document is composed of several pieces of information, it is necessary to distinguish them logically. A logical structure of descriptive markup of DTD is suitable for this.

일반적으로 SGML 문서를 데이터베이스에 저장하기 위해서는 다음과 같은 정보를 데이터베이스로 처리할 필요가 있다.In general, to store SGML documents in a database, you need to process the following information into a database:

(1) 문서내 요소 - 태그(tag), 엔티티(entity), 내용(contents), 속성(attribute) 등의 정보(1) elements within a document-information such as tags, entities, contents, attributes, etc.

(2) 태그 간의 구조관계 정보(2) Structure relationship information between tags

(3) 문서내 순서 정보(3) Order information in document

이러한 정보는 그 성격상 객체지향 데이터베이스 관리시스템에 저장하는 것이 효율적인 것으로 알려져 있다. 이는 SGML 문서가 문서내부의 구조를 정의하고 있는데 기인하며, 이러한 구조는 대부분의 객체지향 데이터베이스 관리시스템에서 제공되는 객체 식별자(OID; Object Identifier)를 활용하여 효율적으로 처리할 수 있기 때문이다. 반면에 SGML 문서의 구조를 관계 데이터베이스 관리시스템에 저장하기 위해서는 위의 정보를 관계 데이터베이스의 논리적인 데이터 저장 형태인 테이블 구조로 나타낼 필요가 있으며, 그와 더불어 시스템 성능이 저하되는 것을 방지하기 위하여 문서조회시 구조화된 질의문(SQL; Structured Query Language)을 최소화할 필요가 있다.It is known that such information is efficiently stored in an object-oriented database management system. This is because SGML documents define the internal structure of documents, and this structure can be efficiently processed by using object identifiers (OIDs) provided by most object-oriented database management systems. On the other hand, in order to store the structure of SGML documents in the relational database management system, it is necessary to express the above information in a table structure, which is a logical data storage form of the relational database. There is a need to minimize structured query language (SQL).

이상과 같은 배경으로 본 발명에서는 SGML 문서를 관계 데이터베이스 관리시스템에 저장하는데 있어서 시스템 속도 측면에서의 관계 데이터베이스 관리시스템의 구조적인 단점을 보완하기 위하여 문서의 순서정보를 활용한 스키마 디자인을 제안한다.In view of the above, the present invention proposes a schema design using document order information to compensate for the structural disadvantage of the relational database management system in terms of system speed in storing SGML documents in the relational database management system.

본 발명은 위에서 기술한 바와 같이 SGML 문서를 SGML 특성에 근거한 세그먼트(segment) 단위로 관계 데이터베이스 관리시스템에 저장하는데 사용된다. 따라서 SGML 문서를 데이터베이스에 저장/관리하는 SGML 보존 관리기(Repository Manager)와 같은 SGML 관련 소프트웨어 제품 개발이나 응용프로그램에 적용될 수 있다.The present invention is used to store SGML documents in a relational database management system in units of segments based on SGML characteristics as described above. Therefore, it can be applied to SGML related software product development or application such as SGML Repository Manager which stores / manages SGML documents in database.

본 발명은 위에서 기술한 바와 같이 SGML 문서를 관계 데이터베이스로 저장하는 데이터베이스 모델이며, 이에 대한 내용은 다음과 같다.As described above, the present invention is a database model for storing an SGML document as a relational database.

(1) SGML 문서의 구문해석(parsing)에 의하여 생성되는 문서단위(이하 'token'이라 칭함)를 기반으로 관계 데이터베이스 모델을 제시하여 관계 데이터베이스 상에 SGML 문서가 저장될 수 있게 하고,(1) SGML documents can be stored in a relational database by presenting a relational database model based on document units (hereinafter referred to as 'tokens') generated by parsing SGML documents;

(2) 문서내 순서정보를 함께 저장하고 이를 검색시 활용하여 시스템의 성능을 향상시킨다.(2) It improves the performance of the system by storing the order information in the document together and using it when searching.

본 발명에서 SGML 문서의 저장 방식은 SGML 문서를 해당 문서유형정의(DTD)에 맞게 작성되었는지를 검증하는 구문해석을 통해 발생되는 토큰을 순서대로 테이블에 저장하고 각 토큰의 종류에 따라 문서정보를 저장한다. 여기서, 이러한 토큰의 집합을 ESIS (Element Structure Information Set)이라 한다.In the present invention, the storage method of the SGML document stores the tokens generated through syntax analysis that verifies whether the SGML document is written according to the document type definition (DTD) in a table in order, and stores the document information according to each token type. do. Here, such a set of tokens is called an ESIS (Element Structure Information Set).

본 발명과 관계되는 SGML 문서의 내용과 관련된 객체들간의 관계를 나타내면 도 1과 같다. 도 1에서 태그는 SGML 문서의 구조를 나타내는 것으로 문서조회 등 SGML 문서에 대한 대부분 동작은 이를 기반으로 이루어진다. 이 때 문서조회에 있어서 문서의 범위 정보는 태그의 구조정보에 의존할 수 없는데, 이 경우 요구되는 질의어의 수가 증가되어 시스템 성능의 저하를 유발시킬 수 있다. 이를 보완하기 위하여 본 발명에서는 관계 데이터베이스의 테이블 설계에 있어서 구문해석되어 추출되는 순서를 각 토큰의 키로 사용하고, 이에 대한 정보를 저장하여 활용하여 필요한 질의어의 수가 급격히 감소되어 시스템의 성능을 향상시켰다.1 illustrates a relationship between objects related to contents of an SGML document related to the present invention. In FIG. 1, a tag represents a structure of an SGML document, and most operations on an SGML document such as a document search are performed based on this. In this case, the range information of the document cannot be dependent on the structure information of the tag in the document search. In this case, the number of required queries is increased, which may cause a decrease in system performance. In order to compensate for this, the present invention uses the order of syntax analysis and extraction in the table design of the relational database as the key of each token, and stores and utilizes the information about the token, thereby dramatically reducing the number of required queries and improving the performance of the system.

이상에서 기술한 내용을 기반으로 테이블을 설계하면 다음과 같이 나타낼 수 있다.If the table is designed based on the above description, it can be expressed as follows.

SGML 문서를 구문해석하여 태그, 내용 및 엔티티에 해당하는 토큰들을 발생시킨 다음, 다음과 같은 테이블을 작성하며, 도 3에서는 이들 테이블의 구조 및 그 관계를 나타낸다.After parsing the SGML document to generate tokens corresponding to tags, content, and entities, the following tables are created, and FIG. 3 shows the structure of these tables and their relationships.

SGML 문서 테이블에는 SGML문서에 대한 문서번호, 각 토큰들에 대한 순서정보 및 각 순서정보에 해당하는 토큰의 유형정보가 저장된다. 태그테이블에는 모든 태그토큰에 대하여, 각 태그토큰 별로 문서번호, 태그의 유형정보, 다른 태그와의 관계정보 및 태그에 대응되는 순서정보가 저장된다. 여기서, 태그테이블에 저장되는 관계정보는 첫 번째 하위 태그토큰의 순서정보, 상위 태그토큰의 순서정보 및 다음에 나타나는 동격 태그토큰의 순서정보를 포함하며, 태그테이블에 저장되는 순서정보는 시작을 나타내는 태그토큰의 순서정보 및 마지막을 나타내는 태그토큰의 순서정보를 포함한다.The SGML document table stores document numbers for SGML documents, order information for each token, and type information of tokens corresponding to each order information. The tag table stores document numbers, tag type information, relationship information with other tags, and order information corresponding to tags for all tag tokens. Here, the relationship information stored in the tag table includes the order information of the first lower tag token, the order information of the upper tag token, and the order information of the next equivalent tag token, and the order information stored in the tag table indicates the start. Order information of the tag token and order information of the tag token indicating the end.

내용 테이블에는 모든 내용토큰에 대하여, 각 내용토큰 별로 문서번호, 내용에 대응되는 순서정보 및 내용이 저장된다. 엔티티 테이블(Entity table)에는 모든 엔티티토큰에 대하여, 각 엔티티토큰 별로 문서번호, 엔티티에 대응되는 순서정보 및 엔티티의 이름이 저장된다. 그리고 엔티티내용 테이블(Entity_Content table)에는 엔티티테이블에 대응하여, 문서번호, 문서유형정의의 이름, 엔티티의 이름 및 엔티티에 대응하는 내용이 저장된다.In the content table, document numbers, order information corresponding to the content, and content are stored for each content token. In the entity table, document numbers, order information corresponding to the entities, and names of the entities are stored for each entity token. The entity content table stores the document number, the name of the document type definition, the name of the entity, and the content corresponding to the entity, corresponding to the entity table.

SGML 문서 테이블SGML Document Table

(doc_idinteger//문서 일련번호(doc_idinteger // document serial number

seq_idinteger//문서내 각 토큰의 순서정보(일련번호)seq_idinteger // Sequence information (serial number) of each token in the document

type varchar(255)//각 토큰의 유형)type varchar (255) // type of each token)

태그 테이블(Tag table)Tag table

(doc_idinteger//문서 일련번호(doc_idinteger // document serial number

Tag_name varchar(255) //태그명Tag_name varchar (255) // tag name

first_child integer//첫번째 child tag의 순서정보first_child integer // Order information of first child tag

next_sibling integer//다음 sibling tag의 순서정보next_sibling integer // order information of next sibling tag

parentinteger//parent tag의 순서정보order information for parentinteger // parent tags

start_id integer//start_tag의 순서정보Sequence information of start_id integer // start_tag

end_idinteger//end_tag의 순서정보)order information of end_idinteger // end_tag)

내용 테이블(Content table)Content table

(doc_idinteger//문서 일련번호(doc_idinteger // document serial number

seq_idinteger//문서내 순서정보seq_idinteger // Order information in document

content varchar2//문서 내용)content varchar2 // document content)

엔티티 테이블(Entity table)Entity table

(doc_idinteger//문서 일련번호(doc_idinteger // document serial number

name varchar(255)//엔티티명)name varchar (255) // entity name)

엔티티내용 테이블(Entity_Content table)Entity_Content table

(DTD_name varchar(255)//DTD 명(DTD_name varchar (255) // DTD name

Ent_name varchar(255)//엔티티명Ent_name varchar (255) // entity name

content varchar2//엔티티 내용)content varchar2 // entity content)

다음은 SGML문서의 일예를 나타내며, 이 문서가 본 발명에 따라 어떻게 각 테이블에 저장되는가를 설명한다.The following shows an example of an SGML document and explains how this document is stored in each table according to the present invention.

〈!DOCTYPE 메모 SYSTEM 'momo.dtd'[〈! DOCTYPE Memo SYSTEM 'momo.dtd' [

〈!ENTITY DOM 'Document Object Model'>]><! ENTITY DOM 'Document Object Model'>]>

〈메모><Note>

〈머리말><Preface>

〈수신자>홍길동〈/수신자><Recipient> Hong Gil-dong </ receiver>

〈발신자>삼성〈/발신자><Sender> Samsung </ sender>

〈날짜>1998년7월24일〈/날짜><Date> July 24, 1998 </ date>

〈/머리말></ Preface>

〈본문><Body>

SRE는 W3C에서 제정한 ＆DOM;을 지원합니다.SRE supports the &DOM; established by the W3C.

〈/본문></ Text>

〈끝인사>〈End Greetings〉

감사합니다.Thank you.

〈/끝인사></ End greeting>

〈메모><Note>

상기와 같은 SGML문서를 구문해석하여 토큰으로 나타내면 도 2와 같다. 도면에서 각 토큰 앞에 붙은 원번호는 각 토큰의 일련번호(seq_id)를 나타낸다.The SGML document as described above is parsed and represented as a token as shown in FIG. 2. The original number before each token in the figure indicates the serial number (seq_id) of each token.

표 1은 SGML 문서 테이블에 저장된 것을 나타내며, 각 토큰에 순서대로 일련번호(seq_id)를 붙이고 그 각각의 유형('start_tag', 'end_tag', 'content', 'entity')이 무엇인지를 표시한다.Table 1 shows what is stored in the SGML document table, with a sequence number (seq_id) appended to each token, and a representation of each type ('start_tag', 'end_tag', 'content', 'entity'). .

표 2는 태그 테이블에 저장된 것을 나타내며, 각 태그토큰의 이름('메모', '머리말', 수신자', '발신자', 날짜', '본문', 끝인사'), 그리고 각 태그토큰의 하위, 상위 및 동격 태그토큰(first_child, parent, next_sibling)의 일련번호가 저장된다. 예를 들어, 태그토큰 '머리말'의 경우 그 바로 다음에 나오는 태그토큰이 일련번호 3인 '수신자' 토큰이므로 'first_child' 항목에 '3'을 저장하고, 그 바로 상위에 있는 태그토큰이 일련번호 1인 '메모' 토큰이므로 'parent' 항목에 '1'을 저장하고, 서로 동격의 위치에 있는 태그토큰이 일련번호 13인 '본문' 토큰이므로 'next_sibling' 항목에 '13'을 저장한다. 그리고 각 태그토큰의 시작 및 끝 토큰의 일련번호(start_id, end_id)가 저장된다.Table 2 shows what is stored in the tag table, and the name of each tag token ('note', 'header', receiver ',' sender ', date', 'body', end greeting '), and the children of each tag token, The serial numbers of the parent and the same tag token (first_child, parent, next_sibling) are stored. For example, in the case of the tag token 'header', since the tag token immediately following is the 'receiver' token with the serial number 3, '3' is stored in the 'first_child' item, and the tag token immediately above is the serial number. Since '1' is a 'memo' token, '1' is stored in the 'parent' item, and '13' is stored in the 'next_sibling' item because the tag tokens at the same position are the 'body' token with serial number 13. The serial numbers (start_id and end_id) of the start and end tokens of each tag token are stored.

문서번호(doc_id)Document number (doc_id) 일련번호(seq_id)Serial number (seq_id) 유형(type)Type 1One 1One start_tagstart_tag 1One 22 start_tagstart_tag 1One 33 start_tagstart_tag 1One 44 contentcontent 1One 55 end_tagend_tag 1One 66 start_tagstart_tag 1One 77 contentcontent 1One 88 end_tagend_tag 1One 99 start_tagstart_tag 1One 1010 contentcontent 1One 1111 end_tagend_tag 1One 1212 end_tagend_tag 1One 1313 start_tagstart_tag 1One 1414 contentcontent 1One 1515 entityentity 1One 1616 contentcontent 1One 1717 end_tagend_tag 1One 1818 start_tagstart_tag 1One 1919 contentcontent 1One 2020 end_tagend_tag 1One 2121 end_tagend_tag

표 3은 내용 테이블에 저장된 것을 나타내며, 문서의 내용에 해당되는 데이터의 일련번호(seq_id)와 함께 그 내용(content)이 저장된다. 표 4은 엔티티 테이블에 저장된 것을 나타내며, 엔티티가 있는 위치의 일련번호(seq_id)와 함께 그 엔티티 이름(name)이 저장된다. 표 5는 엔티티내용 테이블에 저장된 것을 나타내며, 엔티티 이름(Ent_name)과 함께 그 내용(content)이 저장된다.Table 3 shows what is stored in the contents table, and the contents are stored together with the serial number (seq_id) of the data corresponding to the contents of the document. Table 4 shows what is stored in the entity table, and the entity name is stored along with the serial number (seq_id) of the location where the entity is located. Table 5 shows what is stored in the entity content table, the content of which is stored along with the entity name Ent_name.

doc_iddoc_id tag_nametag_name first_childfirst_child next_siblingnext_sibling parentparent start_idstart_id end_idend_id 1One 메모memo 22 -- -- 1One 2121 1One 머리말preface 33 1313 1One 22 1212 1One 수신자receiver -- 66 22 33 55 1One 발신자Caller -- 99 22 66 88 1One 날짜date -- -- 22 99 1111 1One 본문main text -- 1818 1One 1313 1717 1One 끝인사Greeting -- -- 1One 1818 2020

doc_iddoc_id seq_idseq_id contentcontent 1One 44 홍길동Hong Gil Dong 1One 77 삼성 SDSSamsung SDS 1One 1010 1998년 7월 24일July 24, 1998 1One 1414 SRE는 W3C에서 제정한SRE was enacted by the W3C 1One 1616 을 지원합니다.Support. 1One 1919 감사합니다.Thank you.

doc_iddoc_id seq_idseq_id namename 1One 1515 DOMDOM

DTD_nameDTD_name ent_nameent_name contentcontent $$DTD1$$ DTD1 DOMDOM Document Object ModelDocument Object Model

다음으로, 상기와 같은 데이터베이스 모델을 기반으로 저장된 SGML 문서에 대한 조회는 다음과 같이 수행된다.Next, an inquiry about the stored SGML document based on the database model as described above is performed as follows.

검색하고자 하는 토큰의 유형을 선택하면, 그 선택된 토큰의 유형에 해당하는 토큰에 대한 시작토큰 및 끝토큰의 일련번호를 태그 테이블에서 검색한다. 그러면, 시작토큰 및 끝토큰의 일련번호 사이에 있는 토큰을 문서 테이블에서 검색하여 얻고, 검색된 각 토큰에 대하여, 토큰유형이 내용('content')이면 내용 테이블로부터, 그리고 토큰유형이 엔티티('entiry')이면 엔티티 테이블 및 엔티티내용 테이블로부터 내용을 가져온다.When selecting the type of token to search, the serial number of the start token and the end token for the token corresponding to the selected token type is searched in the tag table. The token between the serial number of the start token and the end token is then retrieved from the document table, and for each token found, from the content table if the token type is' content ', and the token type is the entity (' entiry). '), The content is retrieved from the entity table and the entity content table.

예를 들어, 위 문서에서 머리말’ 내용을 조회하는 기능은 다음과 같은 절차로 수행된다.For example, the function to look up the "header" in the above document is performed by the following procedure.

a. 태그 테이블로부터‘머리말의 시작토큰의 일련번호('start_id=2')와 끝토큰의 일련번호('end_id=12')의 값을 구한다.a. From the tag table, get the values of the start token serial number ('start_id = 2') and end token serial number ('end_id = 12').

b. SGML 문서 테이블로부터 a과정에서 얻은 일련번호 값을 이용하여 머리말에 속하는 토큰들을 구한다.b. From the SGML document table, the tokens in the header are obtained using the serial number obtained in step a.

c. b에서 추출된 각 토큰에 대하여 그 유형이 내용('content')이면 내용 테이블로부터, 엔티티('entity')면 엔티티 테이블 및 엔티티내용 테이블로부터 각각 그 내용을 가져온다.c. For each token extracted in b, the content is taken from the content table if the type is 'content' and from the entity table and entity content table if the entity is 'entity'.

위 a, b, c과정을 수행하면 머리말의 내용, 즉 '홍길동 삼성 1998년7월24일'을 가져올 수 있다.Performing the above steps a, b, and c can bring about the contents of the preface, ie, Hong Gil-dong Samsung July 24, 1998.

이하에서는 시스템의 속도를 개선시키기 위한 다른 실시예에 대하여 설명한다. 본 실시예는 앞 실시예에서 설계된 테이블을 하나의 테이블로 만드는 방법이다. 이 경우 앞 실시예에 비하여 시스템의 기억장치를 좀 더 요구할 수 있지만 시스템 처리 속도에 있어서는 SQL 질의서의 수를 줄임으로써 처리속도가 향상되는 효과가 있다.Hereinafter, another embodiment for improving the speed of the system will be described. This embodiment is a method of making the table designed in the previous embodiment into one table. In this case, the storage of the system may be required more than in the previous embodiment. However, the processing speed of the system is improved by reducing the number of SQL queries.

SGML 문서를 구문해석하여 태그, 내용 및 엔티티에 해당하는 토큰들을 발생시킨 다음, 다음과 같은 테이블을 작성한다.Parse the SGML document to generate tokens corresponding to tags, content, and entities, then create the following table:

문서테이블에는 구조화 문서에 대한 문서번호, 각 토큰들에 대한 순서정보 및 토큰이 태그이면 유형정보, 다른 태그와의 관계정보, 시작 및 끝태그의 순서정보를, 토큰이 내용이면 그 내용을 그리고 토큰이 엔티티이면 그 엔티티의 이름이 저장되며, 엔티티내용 테이블에는 엔티티토큰에 대응하여 문서유형정의의 이름, 엔티티의 이름 및 상기 엔티티에 대응하는 내용이 저장된다. 여기서, 문서테이블에 저장되는 관계정보는 첫 번째 하위 태그토큰의 순서정보, 상위 태그토큰의 순서정보 및 다음에 나타나는 동격 태그토큰의 순서정보를 포함하며, 문서테이블에 저장되는 순서정보는 시작을 나타내는 태그토큰의 순서정보 및 마지막을 나타내는 태그토큰의 순서정보를 포함한다.The document table contains the document number for the structured document, the order information for each token, the type information if the token is a tag, the relationship information with other tags, the order information of the start and end tags, the content if the token is content, and the token. If this entity, the name of the entity is stored, and the entity content table stores the name of the document type definition, the name of the entity, and the content corresponding to the entity in correspondence with the entity token. Here, the relationship information stored in the document table includes the order information of the first lower tag token, the order information of the upper tag token, and the order information of the next equivalent tag token, and the order information stored in the document table indicates the start. Order information of the tag token and order information of the tag token indicating the end.

SGML 문서 테이블SGML Document Table

(doc_idinteger//문서 일련번호(doc_idinteger // document serial number

Tag_namevarchar(255) //태그명Tag_namevarchar (255) // tag name

typevarchar(255)//토큰의 유형typevarchar (255) // type of token

contentvarchar2//문서 내용contentvarchar2 // document content

parentinteger//parent tag의 순서정보(일령번호)Order information (parental number) of parentinteger // parent tag

seq_idinteger//토큰의 문서내 순서 정보seq_idinteger // Token-in-document order information

end_idinteger//end_tag의 순서정보Order information of end_idinteger // end_tag

first_childinteger//첫번째 child tag의 순서정보first_childinteger // Order information of first child tag

next_siblinginteger//다음 sibling tag의 순서정보)next_siblinginteger // order information of next sibling tag)

엔티티내용 테이블Entity Content Table

(DTD_namevarchar(255)//DTD 명(DTD_namevarchar (255) // DTD name

Ent_namevarchar(255)//엔티티 명Ent_namevarchar (255) // entity name

contentvarchar2//엔티티 내용)contentvarchar2 // entity content)

표 6은 SGML 문서 테이블에 저장된 것을 나타내며, 표 7은 엔티티내용 테이블에 저장된 것을 나타낸다.Table 6 shows what is stored in the SGML document table, and Table 7 shows what is stored in the entity content table.

doc_iddoc_id tag_ nametag_ name typetype contentscontents parent_idparent_id seq_idseq_id end_idend_id first_ childfirst_ child next_ siblingnext_ sibling 1One 메모memo start_tagstart_tag -- -- 1One 2121 22 -- 1One 머리말preface start_tagstart_tag -- 1One 22 1212 33 1313 1One 수신자receiver start_tagstart_tag -- 22 33 55 -- 66 1One 수신자receiver contentscontents 홍길동Hong Gil Dong 33 44 -- -- -- 1One 수신자receiver end_tagend_tag -- 33 55 -- -- -- 1One 발신자Caller start_tagstart_tag -- 1One 66 88 -- 1111 1One 발신자Caller contentscontents 삼성SDSSamsung SDS 66 77 -- -- -- 1One 발신자Caller end_tagend_tag -- 66 88 -- -- -- 1One 날짜date start_tagstart_tag -- 1One 99 1111 -- -- 1One 날짜date contentscontents 1998년7월24일July 24, 1998 99 1010 -- -- -- 1One 날짜date end_tagend_tag -- 99 1111 -- -- -- 1One 머리말preface end_tagend_tag -- 22 1212 -- -- -- 1One 본문main text start_tagstart_tag -- 1One 1313 1717 -- 1818 1One 본문main text contentscontents SRE는 W3C에서 제정한SRE was enacted by the W3C 1313 1414 -- -- -- 1One 본문main text entityentity DOMDOM 1313 1515 -- -- -- 1One 본문main text contentscontents 을 지원합니다.Support. 1313 1616 -- -- -- 1One 본문main text end_tagend_tag -- 1313 1717 -- -- -- 1One 끝인사Greeting start_tagstart_tag -- 1One 1818 2020 -- -- 1One 끝인사Greeting contentscontents 감사합니다.Thank you. 1818 1919 -- -- -- 1One 끝인사Greeting end_tagend_tag -- 1818 2020 -- -- -- 1One 메모memo end_tagend_tag -- 1One 2121 -- -- --

상기와 같은 데이터베이스 모델을 기반으로 저장된 SGML 문서에 대한 조회는 다음과 같이 수행된다.Inquiry about the stored SGML document based on the database model as described above is performed as follows.

검색하고자 하는 토큰의 유형을 선택하면, 그 선택된 토큰의 유형에 해당하는 토큰에 대한 시작토큰 및 끝토큰의 일련번호를 문서 테이블에서 구하고, 그 일련번호 사이에 있는 내용토큰에 대한 내용('content')을 한 번의 질의어로 모두 얻고, 다만 그 검색범위 내에 엔티티('entiry')가 있으면 엔티티내용 테이블로부터 내용을 가져온다. 그럼으로써, 앞 실시예에 따른 방법에 비하여 질의어의 수가 감소함으로써 시스템의 응답속도가 빨라진다.If you select the type of token you want to retrieve, you can get the serial number of the start token and end token for the token corresponding to the selected token type from the document table, and the content token ('content' ) Is obtained as a single query, but if there is an entity ('entiry') within the search scope, the content is retrieved from the entity content table. As a result, the response speed of the system is increased by reducing the number of queries compared to the method according to the previous embodiment.

이상에서 설명한 바와 같이, 본 발명에 의하면 SGML 패러다임(paradigm)에 있어서 필수적으로 수반되는 SGML 문서의 구문해석(parsing)을 통해 얻어지는 정보(ESIS; Element Structure Information Set)를 그대로 데이터베이스 모형에 반영하고 추가적으로 문서의 순서정보를 저장/활용함으로써 모형의 활용성 및 개발되는 시스템의 성능을 극대화시킬 수 있다.As described above, according to the present invention, an element structure information set (ESIS) obtained through parsing of an SGML document, which is essential in the SGML paradigm, is reflected in a database model and additionally By storing and utilizing the sequence information of, it is possible to maximize the utility of the model and the performance of the developed system.

Claims

In a relational database management system for storing structured documents written in general-purpose languages for electronic document standardization and efficiently inquiring them, the method for storing structured documents in a database,

Parsing the structured document to generate tokens corresponding to tags, content and entities;

Storing a document number for the structured document, order information for the respective tokens, and type information of tokens corresponding to the order information in a document table;

Storing the document number, tag type information, relationship information with other tags, and the sequence information corresponding to the tag for each tag token in a tag table for each tag token;

Storing the content number, the order information corresponding to the content, and the content in a content table for each content token;

Storing, for all the entity tokens, the document number, the order information corresponding to the entity, and the name of the entity for each entity token in an entity table; And

And corresponding to the entity table, storing the document number, the name of the document type definition, the name of the entity and the content corresponding to the entity in an entity content table.

The method of claim 1, wherein the relationship information with other tags stored in the tag table includes order information of a first lower tag token, order information of an upper tag token, and order information of a next equivalent tag token. How to manage a structured document database.

The method of claim 1, wherein the order information corresponding to a tag stored in the tag table includes order information of a tag token indicating a start and order information of a tag token indicating a last.

The document number for the structured document, the order information for each of the tokens, and the type information if the token is a tag, the relationship information with other tags, the order information of the start and end tags, the content if the token is content, and the token is If it is an entity, storing the name of the entity in a document table; And

And corresponding to the entity token, storing the name of the document type definition, the name of the entity, and the content corresponding to the entity in an entity content table.

5. The method of claim 4, wherein the relationship information with other tags stored in the document table includes order information of a first lower tag token, order information of an upper tag token, and order information of a next equivalent tag token. How to manage a structured document database.

5. The method of claim 4, wherein the order information corresponding to the tag stored in the document table includes order information of a tag token indicating a start and order information of a tag token indicating a last.