KR100902255B1

KR100902255B1 - AN INSTALLATION FOR URIref COMPRESS IN WEB DOCUMENTS

Info

Publication number: KR100902255B1
Application number: KR1020070106193A
Authority: KR
Inventors: 엄봉수
Original assignee: 에스케이 텔레콤주식회사
Priority date: 2007-10-22
Filing date: 2007-10-22
Publication date: 2009-06-11
Also published as: KR20090040713A

Abstract

본 발명에서 웹 문서를 효율적으로 압축할 수 있는 알디에프/엑스엠엘 문서에 대한 유알아이 참조의 압축 장치 및 방법을 개시한다.The present invention discloses an apparatus and method for compressing a URL reference to an RDF / MSL document capable of efficiently compressing a web document.

본 발명에 따른 압축장치는, 웹 문서를 입력받아 해당 파일로부터 데이터 요소들을 추출하기 위한 URI 파서; URI 파서로부터 추출된 데이터 중 애트리뷰트(Attribute) 값으로 사용된 URI 정보에 대한 베이스 등록 여부, 참조 등록 여부를 포함하여 신규 원소로의 추가를 위한 URI 추출 및 등록을 운영하기 위한 URI 분석부; URI 분석부의 베이스 등록 여부를 판단하기 위해 URI 정보, URI 색인 정보 및 단축형 이름 정보를 저장하는 URI 사전 등록부; URI 분석부의 참조 등록 여부를 판단하기 위해 단편 식별자 및 URI 색인 정보를 저장하는 URI 참조사전 등록부; 및 URI 사전 등록부로 등록된 URI 색인 정보를 토대로 URI 파서로 입력되는 웹 문서 파일을 치환하기 위한 압축부로 구성된다. 따라서, 본 발명은 웹 문서에서 애트리뷰트(Attribute) 값으로 사용되는 URI 참조 저장 시에 최소한의 저장 공간을 사용하도록 함으로써, 문서 저장 시에 저장공간을 절약하고, 문서 전송 시에 네트워크 대역폭을 절약할 수 있는 효과를 갖는다.In accordance with another aspect of the present invention, a compression apparatus includes: a URI parser for receiving a web document and extracting data elements from a corresponding file; A URI analysis unit for operating a URI extraction and registration for addition to a new element, including whether to register a base for the URI information used as an attribute value and a reference registration among data extracted from the URI parser; A URI pre-registration unit for storing URI information, URI index information, and short name information to determine whether to register a base of the URI analysis unit; A URI reference dictionary register that stores fragment identifiers and URI index information to determine whether to register a reference of the URI parser; And a compression unit for substituting the web document file input to the URI parser based on the URI index information registered with the URI preregistration unit. Therefore, the present invention can use the minimum storage space when storing the URI reference used as an attribute value in the Web document, thereby saving the storage space when storing the document, and saves network bandwidth when transmitting the document. Has the effect.

RDF/XML, URI, 참조, URIref, 애트리뷰트, 베이스, 색인, 네트워크 RDF / XML, URI, Reference, URIref, Attribute, Base, Index, Network

Description

AN INSTALLATION FOR URIref COMPRESS IN WEB DOCUMENTS}

본 발명은 웹 문서에서 URI 참조를 압축하는 방법에 관한 것으로, 보다 상세하게는 웹 문서의 애트리뷰트(Attribute) 값으로 사용된 URI 참조를 압축할 수 있는 알디에프/엑스엠엘 문서에 대한 유알아이 참조의 압축 방법에 관한 것이다.The present invention relates to a method of compressing a URI reference in a web document. More specifically, the present invention relates to a URL reference for an RDF / MSL document capable of compressing a URI reference used as an attribute value of a web document. To a compression method.

일반적으로, 기계로 처리할 수 있는 방식으로 RDF 선언문을 표현하기 위해서 RDF에서는 Extensible Markup Language(XML)를 사용하고 있다. XML은 누구라도 자신의 문서형식을 설계하고 그 형식으로 문서를 작성할 수 있도록 고안된 것이다. RDF에서는 RDF 정보를 표현하고, 이 정보를 기계 상호 간에 교환하기 위하여 RDF/XML이라고 하는 특수한 XML 마크업 언어를 정의하고 있다.In general, RDF uses Extensible Markup Language (XML) to represent RDF statements in a machine-processable way. XML is designed to allow anyone to design their own document formats and write documents in that format. RDF defines a special XML markup language called RDF / XML to represent RDF information and to exchange information between machines.

전술된 RDF 즉, 자원기술구조(Resource Description Framework: RDF)는 월드 와이드 웹(World Wide Web)에서 자원에 관한 정보를 표현하기 위한 언어이다. RDF는 특히 웹 페이지의 표제와 저자, 갱신일자를 포함하여 웹 문서의 저작권과 계약 정보, 또는 일부 공유자원에 대한 접근가능일자 등, 웹 자원에 관한 메타데이터를 표현하기 위한 것이다. 그러나 "웹 자원”이라는 개념을 일반화함으로써, 비록 웹에서 직접 검색되지는 않더라도 웹에서 식별되는 사물에 관한 정보를 표현하기 위해서도 RDF를 사용할 수 있다. 예컨대, 온라인 쇼핑 사이트에서 구매할 수 있는 상품에 관한 정보(예를 들어 제품규격이나 가격, 재고정보)나 정보전달을 위한 웹 이용자의 선호도에 관한 사항 등이 포함된다.The above-described RDF, or Resource Description Framework (RDF), is a language for expressing information about resources in the World Wide Web. RDF is specifically intended to represent metadata about web resources, including the web page's title, author, and date of renewal, including the copyright and contractual information of the web document, or the date of access to some shared resources. However, by generalizing the concept of “web resources,” RDF can also be used to represent information about things identified on the web, even if they are not directly retrieved from the web, for example, information about products that can be purchased on an online shopping site. (E.g., product specifications, prices, inventory information) or preferences of web users for information delivery.

따라서, 근래에는 RDF/XML 문서가 많이 사용되고 있으며, 이러한 문서의 사용 증가로 인해 네트워크 대역폭을 줄이기 위한 압축이 시도되고 있다. 특히 종래에 개시되었던 엑스엠엘(XML) 압축 방법을 살펴 보면, 먼저 XML 압축 기법들은 각 엘리먼트의 태그를 사전 부호화 기법을 이용하여 압축함으로써, 압축된 XML 데이터에 대한 경로 표현식을 처리하는데 비효율적이기 때문에, 역 산술 부호화 기법을 이용하여 각 엘리먼트의 경로를 [0.0, 1.0) 영역 상에서 구별되는 간격 (interval) 으로 변환한다. 경로 표현식도 역 산술 부호화 기법을 이용하여 [0.0, 1.0) 영역 상의 간격으로 변환되며, 이때 경로 표현식의 간격과 엘리먼트의 간격 간의 내포 관계(containment relationship)를 이용하여 효율적으로 경로 표현식을 처리한다.Therefore, in recent years, RDF / XML documents are widely used, and compression is being attempted to reduce network bandwidth due to the increased use of such documents. In particular, in the XML compression method disclosed in the prior art, XML compression techniques are first compressed by using a pre-coding method of each element of the tag, because it is inefficient to process the path expression for the compressed XML data, The inverse arithmetic coding technique transforms the path of each element into distinct intervals on the [0.0, 1.0) region. The path expression is also transformed into an interval on the [0.0, 1.0) region by using an inverse arithmetic coding technique, and the path expression is efficiently processed using a containment relationship between the interval of the path expression and the interval of the elements.

한편, 데이터 값의 타입에 따라서 효율적인 압축 기법이 달라질 수 있는데, 기존의 XML 압축 기법 중 XMill에서는 사용자가 직접 데이터 값의 타입을 정의하고, XGrind에서는 스트링 타입으로 고정하여 호프만 부호화 기법과 사전 부호화 기법만을 이용하여 데이터 값을 압축한. 다른 방법으로, 추론된 데이터 타입에 따라서 데이터 값에 적절한 부호화 기법들을 적용한다. 이는 정수 (integer) 또는 실수 (float) 타입으로 추론된 데이터 값에 대해서는 이진 부호화 (binary encoding) 기법과 차감 부호화(differential encoding) 기법을 조합한 부호화 기법을 이용하여 영역 질의 처리 시에 발생되는 부분 데이터 복원의 부담을 감소시킨다.On the other hand, the efficient compression scheme can vary according to the data value type. Among the existing XML compression techniques, XMill defines the type of data value directly, and in XGrind, it is fixed as a string type so that only Hoffman encoding and precoding techniques are used. To compress the data values. Alternatively, apply coding techniques appropriate to the data value according to the inferred data type. For partial data that is inferred as integer or float type, this is the partial data that is generated during the region query processing using a combination of binary encoding and differential encoding. Reduce the burden of restoration

따라서, 종래의 XML 데이터 압축 시에 역 산술 부호화 기법과 타입에 의존적인 압축 기법들을 이용함으로써 압축된 XML 데이터에 대하여 직접적이고 효율적인 질의 수행을 지원하고 있다. 또한, 타입 추론 엔진을 이용하여 각 데이터 값들에 대한 타입을 추론하고, 수치 데이터 (정수 또는 실수 타입)에 대하여서는 각 데이터 값의 크기 관계가 유지되는 압축 기법을 적용함으로써, 영역 질의 시에 부분 데이터 복원의 부담을 줄이고 있는 현실이다.Therefore, by using inverse arithmetic coding and type-dependent compression in the conventional XML data compression, direct and efficient query performance is supported on the compressed XML data. In addition, the type inference engine is used to infer the type for each data value, and the numerical data (integer or real type) is applied to the partial data during the region query by applying a compression technique in which the size relationship of each data value is maintained. The reality is reducing the burden of restoration.

그러나, 이와 같은 압축 기술은 XML 데이터의 보관, 검색 및 전송 시에 발생되는 디스크 공간 부담, 질의 처리 부담, 전송 부담 등을 감소시키므로, 전자 상거래, 인터넷 검색 등 XML 응용 분야에 크게 기여할 것이나, 이는 단순 데이터에 대한 압축을 유도할 뿐이다. 이로 인해, 근래에 사용되기 시작한 RDF/XML 문서에서는 전술된 종래의 XML 데이터 압축만으로 문서를 충분히 압축하지 못한다는 문제가 야기된다. 즉, RDF/XML 문서에서 가장 많은 부분을 차지하는 URI 참조(URI ref)를 압축하지 못하기 때문에 현재까지는 RDF/XML 문서 저장 시 많은 메모리 공간을 필요로 할 뿐만 아니라, 데이터 전송 시에도 네트워크 대역폭을 낭비한다는 지적이 발생되고 있다.However, such a compression technique reduces disk space burden, query processing burden, and transmission burden incurred in storing, retrieving, and transmitting XML data, and thus will greatly contribute to XML applications such as e-commerce and Internet search. It only induces compression on the data. This causes a problem in that RDF / XML documents which have recently started to be used do not sufficiently compress the documents by the above-described conventional XML data compression. In other words, since it is unable to compress the URI ref, which takes up the largest portion of the RDF / XML document, up to now, not only does it require a lot of memory space for storing the RDF / XML document, but it also wastes network bandwidth during data transmission. It is pointed out that.

본 발명은 이와 같은 문제점을 해결하기 위해 창출된 것으로, 본 발명의 목적은 웹 문서에 대한 저장 공간을 최소화하고, 해당 문서의 전송 시에도 네트워크 대역폭에 대한 낭비를 줄일 수 있는 웹 문서에 대한 유알아이 참조의 압축 장치 및 방법을 제공함에 있다.The present invention was created to solve the above problems, and an object of the present invention is to minimize the storage space for a web document and to reduce waste of network bandwidth even when the document is transmitted. It provides a compression apparatus and method of reference.

또한 본 발명의 다른 목적은, 웹 문서의 애트리뷰트(Attribute) 값으로 사용된 URI참조를 압축하여 웹 문서의 전송 시간을 단축할 수 있는 웹 문서에 대한 유알아이 참조의 압축 장치 및 방법을 제공함에 있다.In addition, another object of the present invention is to provide an apparatus and method for compressing a UI reference for a web document that can shorten the transmission time of the web document by compressing a URI reference used as an attribute value of the web document. .

상기 목적을 달성하기 위한 본 발명의 제1 관점에 따른 웹 문서에 대한 유알아이 참조의 압축 장치는, 웹 문서를 압축하기 위한 장치에 있어서, 상기 웹 문서를 입력받아 해당 파일로부터 데이터 요소들을 추출하기 위한 URI 파서; 상기 URI 파서로부터 추출된 데이터 중 애트리뷰트(Attribute) 값으로 사용된 URI 정보에 대한 베이스 등록 여부, 참조 등록 여부를 포함하여 신규 원소로의 추가를 위한 URI 추출 및 등록을 운영하기 위한 URI 분석부; 상기 URI 분석부의 베이스 등록 여부를 판단하기 위해 URI 정보, URI 색인 정보 및 단축형 이름 정보를 저장하는 URI 사전 등록부; 상기 URI 분석부의 참조 등록 여부를 판단하기 위해 단편 식별자 및 URI 색인 정보를 저장하는 URI 참조사전 등록부; 및 상기 URI 사전 등록부로 등록된 URI 색인 정보를 토대로 상기 URI 파서로 입력되는 웹 문서 파일을 치환하기 위한 압축부로 구성되는 것을 특징으로 한다.According to a first aspect of the present invention, there is provided a device for compressing a web document for a web document. The apparatus for compressing a web document, comprising: receiving data from the web document and extracting data elements from the file; A URI parser for; A URI analysis unit for operating a URI extraction and registration for addition to a new element, including whether to register a base for the URI information used as an attribute value and reference registration among data extracted from the URI parser; A URI pre-registration unit for storing URI information, URI index information, and short name information to determine whether to register the base of the URI analysis unit; A URI reference dictionary register that stores fragment identifiers and URI index information to determine whether to register a reference of the URI parser; And a compression unit for substituting the web document file input to the URI parser based on the URI index information registered with the URI pre-registration unit.

구체적으로, 상기 URI 사전 등록부로 임의의 URI 정보가 추가될 때마다, 유일한 정수 값을 상기 URI 사전의 색인으로 할당하고, 상기 URI 정보 중 자주 사용되는 베이스 URI는 상기 URI 사전의 색인 번호를 임의로 정의된 특정 번호로 할당하는 것을 특징으로 한다.Specifically, whenever arbitrary URI information is added to the URI dictionary register, a unique integer value is assigned to the index of the URI dictionary, and a frequently used base URI among the URI information arbitrarily defines the index number of the URI dictionary. It is characterized in that assigned to the specified number.

한편, 상기 목적을 달성하기 위한 본 발명의 제2 관점에 따른 웹 문서에 대한 유알아이 참조의 압축 방법은, 웹 문서를 압축하기 위한 방법에 있어서, a) 상기 웹 문서를 스캔하여 애트리뷰트 값으로 사용된 URI가 베이스 URI인지를 판단하고, 판단결과 베이스 URI일 경우 URI 색인 정보를 임의로 정의된 특정 번호로 설정되는 단계; b) 상기 a) 단계에서 판단한 결과, 상기 애트리뷰트 값으로 사용된 URI가 베이스 URI가 아닐 경우, URI 사전을 토대로 일치하는 URI 정보가 존재하는지를 판단하는 단계; c) 상기 b) 단계에서 판단한 결과, 상기 URI 사전에서 일치하는 URI 정보가 존재할 경우, URI 참조사전에서 일치하는 URI 참조(URIref)가 존재하는지를 판단하되 상기 URI 참조가 존재하지 않을 경우 f) 단계로 진행하는 단계; d) 상기 c) 단계에서 판단한 결과, URI 참조가 존재함으로 판단될 경우, URI 참조사전에서 상기 URI 정보를 추출하여 상기 URI 사전에 등록하고, 상기 URI 참조사전에 등록된 URI 색인정보를 추출한 후, 상기 웹 문서에서 스캔된 URI 정보를 상기 URI 색인정보로 치환하는 단계; e) 상기 b) 단계에서 판단한 결과, 상기 URI 사전에 일치하는 URI 정보가 존재하지 않을 경우, 상기 URI 사전에 상기 URI 정보를 등록하고, 이에 대응하는 신규의 URI 색인 정보를 생성하는 단계; 및 f) 상기 신규 등록 된 URI 색인 정보를 토대로 상기 URI 정보의 단편 식별자를 이용하여 URI 참조사전에 신규 원소를 등록한 후, 상기 d) 단계로 피드백하는 단계로 이루어진 것을 특징으로 한다.On the other hand, in the method for compressing a web document reference to a web document according to a second aspect of the present invention for achieving the above object, in a method for compressing a web document, a) scanning the web document to be used as an attribute value Determining whether the URI is a base URI, and if the determination result is a base URI, setting the URI index information to a randomly defined specific number; b) as a result of the determination in step a), if the URI used as the attribute value is not a base URI, determining whether there is a corresponding URI information based on a URI dictionary; c) As a result of the determination in step b), if there is matching URI information in the URI dictionary, it is determined whether there is a matching URI reference (URIref) in a URI reference dictionary, but if the URI reference does not exist, step f). Proceeding; d) if it is determined in step c) that the URI reference exists, extract the URI information from the URI reference dictionary and register it in the URI dictionary, extract the URI index information registered in the URI reference dictionary, Replacing the URI information scanned from the web document with the URI index information; e) if the URI information corresponding to the URI dictionary does not exist as determined in step b), registering the URI information in the URI dictionary and generating new URI index information corresponding thereto; And f) registering a new element in advance of a URI reference dictionary using a fragment identifier of the URI information based on the newly registered URI index information, and then feeding back to the step d).

본 발명에 따른 웹 문서에 대한 유알아이 참조의 압축 장치 및 방법은 웹 문서에서 애트리뷰트(Attribute) 값으로 사용되는 URI 참조 저장 시에 최소한의 저장 공간을 사용하도록 함으로써, 문서 저장 시에 저장공간을 절약하고, 문서 전송 시에 네트워크 대역폭을 절약할 수 있는 효과를 갖는다.The apparatus and method for compressing a URL reference to a web document according to the present invention saves storage space when storing a document by using a minimum storage space when storing a URI reference used as an attribute value in a web document. In addition, it has the effect of saving the network bandwidth during document transmission.

또한 본 발명은 기존의 포맷화된 압축 기법과는 달리, 애트리뷰트 값으로 사용된 URI 참조를 생성 등록하고, 등록된 URI 참조를 토대로 해당 문서를 압축함으로써, 압축 알고리즘이 단순화될 뿐만 아니라 압축 효율을 증대시킬 수 있는 효과를 갖는다.In addition, the present invention, unlike the existing formatted compression scheme, by generating and registering a URI reference used as an attribute value and compressing the document based on the registered URI reference, not only the compression algorithm is simplified but also the compression efficiency is increased. It has an effect that can be made.

이하, 본 발명을 첨부된 예시도면에 의거 상세히 설명하면 다음과 같다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

먼저 본 발명에서 적용되는 URI(Uniform Resource Identifier)는 웹 상에서 직접 검색할 수 없는 것까지도 포함하여 어떤 것이라도 식별할 수 있는 식별기호 체계이다. 이는 URL(Uniform Resource Locator)를 포함하며, URI 참조(URIref)는 선택사항으로 단편 식별자(Fragment Identifier)를 가질 수 있으며, 웹 문서 예컨 대 RDF/XML에서는 정보를 표현할 때 이 URI참조를 사용하여 정보를 식별한다. 이하 본 발명의 실시 예로 상기 RDF/XML 문서는 웹 문서로 총칭할 것이며, 이러한 RDF/XML 문서는 메타데이터를 처리하기 위한 기초이며, 웹에서 기계가 이해할 수 있는 정보를 교환하는 애플리케이션 간에 상호 운용성을 가장 효율적으로 제공하기 때문이다.First, the Uniform Resource Identifier (URI) applied in the present invention is an identifier system that can identify anything, even those that cannot be directly searched on the web. It contains a Uniform Resource Locator (URL), and a URI reference (URIref) can optionally have a fragment identifier, which can be used to express information in a web document, e.g. RDF / XML. Identifies Hereinafter, as an embodiment of the present invention, the RDF / XML document will be collectively referred to as a web document, and the RDF / XML document is a basis for processing metadata, and interoperability between applications exchanging machine understandable information on the web is provided. This is because it provides the most efficient.

이와 같이 전술된 URI 참조는 그 형태가 다양하며, 일예로 도 3에서 예시된 URI 참조가 사용된 RDF/XML 문서와 같이 라인 9, 12, 14, 15에서 사용된 애트리뷰트 값들은 베이스 URI인 'http://www.w3.org/wine#'을 사용하여 단축된 형태로 표현한 URI 참조들이다. 즉 'WineGrape'-라인 9-의 완전한 URI 참조는 'http://www.w3.org/ wine#WineGrape'이다. 여기서 애트리뷰트 종류에 따라 베이스 URI를 사용하는 방식이 다르며, 예컨대 'WineGrape'-라인 9-와 '#WineGrape'-라인 15-는 같은 URI 참조이지만, 다른 방식으로 표현되고 있다. 또한, 라인 10의 애트리뷰트 값은 개체 참조를 사용하여 표현되고, 라인 13의 애트리뷰트 값은 완전한 URI 참조를 사용하여 표현되는 등 애트리뷰트 값의 URI 참조는 다양한 방식으로 표현된다.As described above, the above-described URI reference may be in various forms. For example, attribute values used in lines 9, 12, 14, and 15, such as an RDF / XML document using the URI reference illustrated in FIG. URI references in shortened form using ': //www.w3.org/wine#'. In other words, the complete URI reference for 'WineGrape'-line 9- is' http://www.w3.org/ wine # WineGrape'. Here, the base URI is used differently depending on the attribute type. For example, 'WineGrape'-line 9- and' # WineGrape'-line 15- are the same URI references, but are expressed in different ways. In addition, the attribute value in line 10 is represented using an object reference, the attribute value in line 13 is represented using a complete URI reference, and so on.

결국, URI 참조는 기본 형식으로 http://www. ... / ... 등의 형식을 가지고 있으나, 애트리뷰트의 종류에 따라 기 정의된 형식인 'http', 'www', '/'등이 삭제된 형태도 표현될 수 있으며, 이는 개체 참조를 사용하여 표현될 수 있음을 나타낸다. 따라서, 이러한 URI 참조를 기반으로 문서에 대한 압축을 정의할 수 있다.In the end, the URI reference is http: // www. ... / ... has a format, but depending on the type of attribute, the predefined form 'http', 'www', '/', etc. may be deleted, which is an object reference It can be expressed using. Thus, you can define compression for documents based on these URI references.

본 발명은 이와 같은 URI 참조를 토대로 웹 문서 즉, RDF/XML 문서를 압축할 것이며, 이를 위해 도 1은 RDF/XML 문서의 압축 장치를 나타내고 있다. 도시된 바와 같이, RDF/XML 문서를 입력받아 해당 파일로부터 데이터 요소들을 추출하기 위한 URI 파서(101)와, 상기 URI 파서(101)로부터 추출된 데이터 중 애트리뷰트(Attribute) 값으로 사용된 URI 정보에 대한 베이스 등록 여부, 참조 등록 여부를 포함하여 신규 원소로의 추가를 위한 URI 추출 및 등록을 운영하기 위한 URI 분석부(103)와, 상기 URI 분석부(103)의 베이스 등록 여부를 판단하기 위해 URI 정보, URI 색인 정보 및 단축형 이름 정보를 저장하는 URI 사전 등록부(105)와, 상기 URI 분석부(103)의 참조 등록 여부를 판단하기 위해 단편 식별자 및 URI 색인 정보를 저장하는 URI 참조사전 등록부(107)와, 상기 URI 사전 등록부(105)로 등록된 URI 색인 정보를 토대로 상기 URI 파서(101)로 입력되는 RDF/XML 파일을 치환하기 위한 압축부(111)로 구성된다.The present invention will compress a web document, that is, an RDF / XML document, based on such a URI reference. For this purpose, FIG. 1 shows a compression apparatus of an RDF / XML document. As shown, a URI parser 101 for receiving an RDF / XML document and extracting data elements from a corresponding file, and URI information used as attribute values among data extracted from the URI parser 101. URI analysis unit 103 for operating the URI extraction and registration for addition to a new element, including whether the base registration for the reference, reference registration, and URI to determine whether the base registration of the URI analysis unit 103 A URI dictionary register 105 that stores information, URI index information, and short name information, and a URI reference dictionary register 107 that stores fragment identifiers and URI index information to determine whether the URI parser 103 registers a reference. ) And a compression unit 111 for replacing the RDF / XML file input to the URI parser 101 based on the URI index information registered by the URI preregistration unit 105.

전술된 URI 사전은 해당 URI 정보가 추가될 때마다, 유일한 정수 값을 URI 사전의 색인으로 할당받으며, 베이스 URI는 자주 사용되기 때문에 고정화된 특정 번호로 사용됨이 적절할 것이다. 여기서, 고정화된 특정 번호는 '0'번을 포함하여 기 정의된 임의의 숫자, 문자 등이 가능하나, 본 발명에서는 '0'번으로 색인을 할당함이 바람직할 것이다. 또한 RDF/XML 문서의 최상위 엘리먼트에 있는 네임 스페이스(Namespace) 접두사 선언 부분과 베이스 URI를 상기 URI 사전 등록부(105)에 등록함이 적절한데, 이는 베이스 URI나 네임 스페이스 접두사의 URI는 URI 참조의 URI로 사용될 가능성이 크기 때문이다.The URI dictionary described above is assigned a unique integer value as an index of the URI dictionary each time the URI information is added, and it is appropriate that the base URI is used as a fixed specific number because the base URI is frequently used. Here, the fixed specific number may be any number, character, and the like, including '0', but in the present invention, it is preferable to assign an index to '0'. It is also appropriate to register the namespace prefix declaration portion and base URI in the top-level element of the RDF / XML document with the URI pre-registration 105, where the base URI or namespace prefix URI is the URI of the URI reference. This is because it is likely to be used as.

이와 같이 구성된 압축 방식은 전술된 바와 같이, 애트리뷰트 값으로 사용된 URI 참조를 고유한 정수 값으로 치환하는 것으로, 치환된 정보로 인해 RDF/XML 문서에 대한 압축을 수행한다. 이는 RDF/XML 문서의 상당 부분이 URI 참조로 이루어져 있기 때문에 실질적인 압축 효율이 높아 문서의 저장 시 저장 공간을 최소화하고, 해당 문서의 전송 시에도 네트워크 대역폭을 줄일 수 있게 된다.As described above, the compression scheme configured as described above replaces the URI reference used as the attribute value with a unique integer value, and performs compression on the RDF / XML document due to the replaced information. Since a substantial part of the RDF / XML document is composed of URI references, the actual compression efficiency is high, thereby minimizing storage space when storing the document and reducing network bandwidth when transmitting the document.

이하, 본 발명의 동작을 첨부된 예시도면에 의거 상세히 설명하면 다음과 같다. 도 2는 본 발명의 주요 동작을 설명하기 위한 플로우챠트이다. 도 3 내지 도 6은 RDF/XML 문서에 대한 압축 과정을 설명하기 위해 본 발명의 실시 예로 나타낸 문서 및 사전 내용이다.Hereinafter, the operation of the present invention will be described in detail with reference to the accompanying drawings. 2 is a flowchart for explaining the main operation of the present invention. 3 to 6 are documents and dictionary contents shown as an embodiment of the present invention to explain a compression process for an RDF / XML document.

먼저 S201 단계에서, 상기 URI 파서(101)는 RDF/XML 문서를 스캔하여 데이터 요소 예컨대, 태그, 애트리뷰트, 데이터 값 등을 추출한다. 그리고, 상기 URI 분석부(103)를 통해 상기 데이터 요소 중 애트리뷰트 값으로 사용된 URI 정보를 입력받은 후, 현재 입력된 URI 정보가 베이스 URI 정보인지를 판단한다. 즉, 도 3에서와 같이 RDF/XML 문서가 수신되면, 해당 문서의 애트리뷰트 값으로 사용된 URI 정보를 순차적으로 입력받아 베이스 URI를 사용하는지를 판단한다.First, in step S201, the URI parser 101 scans an RDF / XML document to extract data elements such as tags, attributes, data values, and the like. After receiving the URI information used as an attribute value among the data elements through the URI analysis unit 103, it is determined whether the currently input URI information is base URI information. That is, when an RDF / XML document is received as shown in FIG. 3, it is determined whether to use a base URI by sequentially receiving URI information used as an attribute value of the document.

도시한 바와 같이, 베이스 URI는 'xml:base'로서 URI 사전의 등록 정보는 'http://www.w3.org/wine#'이다. 이러한 베이스 URI는 도 4와 같이 항상 '0'번 색인을 가진다. 물론, 전술한 바와 같이, 베이스 URI 색인 번호는 '0'번 이외에 특정한 숫자가 적용될 수 있을 것이다. 따라서, 도 4에 도시된 URI 색인 정보는 본 발명의 실시 예로 설명하기 위한 것으로, 필요에 따라 URI 색인 정보가 임의로 정의 된 특정 숫자 또는 특정 문자가 사용될 수 있음은 물론이다. 이후, 색인 번호는 URI 사전 정보에 대한 실시 예로서 색인 번호 '1'로 등록된 'http://www.w3.org/food#'과, 색인 번호 '2'로 등록된 URI 사전으로 'http://www.w3.org/2002/07/owl#'이 등록될 수 있으며, 또한 색인 번호 '3'으로 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'이 등록되거나, 색인 번호 '4'로서 'http://www.w3.org/2000/01/rdf-schema#'이 등록될 수 있다. 또한 색인 번호 '5'로서 'http://www.kaist.kr/owl#'이 등록되며 이외에 다수의 URI 사전 등록이 고유의 숫자로 등록될 수 있다.As shown, the base URI is 'xml: base' and the registration information of the URI dictionary is 'http://www.w3.org/wine#'. This base URI always has index '0' as shown in FIG. Of course, as described above, the base URI index number may be applied to a specific number other than '0'. Therefore, the URI index information illustrated in FIG. 4 is for explaining the exemplary embodiment of the present invention, and of course, a specific number or a specific character in which the URI index information is arbitrarily defined may be used. Thereafter, the index number is 'http://www.w3.org/food#' registered as the index number '1' and the URI dictionary registered as the index number '2' as an embodiment of the URI dictionary information. http://www.w3.org/2002/07/owl# 'may be registered, and the index number' 3 'may also be used as' http://www.w3.org/1999/02/22-rdf-syntax- ns # 'may be registered or' http://www.w3.org/2000/01/rdf-schema# 'may be registered as the index number' 4 '. In addition, 'http://www.kaist.kr/owl#' is registered as the index number '5', and a plurality of URI pre-registrations may be registered with unique numbers.

상기 URI 분석부(103)는 RDF/XML 문서에 대한 베이스 URI 정보를 찾는데, 상기 URI 분석부(103)에서 판단한 결과, 현재 입력된 URI 정보가 베이스 URI 정보일 경우, S215 단계로 진입한다. 즉, 상기 URI 분석부(103)는 URI 정보가 베이스 정보로 사용됨을 확인할 경우, 상기 압축부(111)를 인에이블시켜 URI 색인이 '0'임을 통지한다. 따라서, 상기 압축부(111)는 현재 입력된 베이스 URI 정보를 색인 번호'0'으로 치환하여 출력한다.The URI analyzing unit 103 finds base URI information on the RDF / XML document. When the URI analyzing unit 103 determines that the currently input URI information is the base URI information, the URI analysis unit 103 enters the step S215. That is, when confirming that the URI information is used as the base information, the URI analysis unit 103 enables the compression unit 111 to notify that the URI index is '0'. Therefore, the compression unit 111 replaces the currently input base URI information with the index number '0' and outputs the replacement.

또한, 상기 URI 분석부(103)는 애트리뷰트 값으로 사용되는 URI 정보를 지속적으로 검사하되, 상기 S201 단계에서 현재 입력된 URI 정보가 베이스 URI 정보가 아님으로 판단할 경우, S203 단계로 진입하여 상기 URI 분석부(103)는 URI 사전 등록부(105)를 인에이블시킨다. 상기 URI 사전 등록부(105)는 현재 수신된 URI 정보가 기 등록되어 있는지를 검색한다. 상기 URI 사전 등록부(105)의 등록 여부 검색은 도 4에 도시된 바와 같이, URI 사전의 단축형 이름을 토대로 검색이 이루어진 다.In addition, the URI analyzing unit 103 continuously checks the URI information used as an attribute value, and if it is determined in step S201 that the currently input URI information is not the base URI information, the process proceeds to step S203 and the URI is entered. The analysis unit 103 enables the URI preregistration unit 105. The URI preregistration unit 105 searches whether currently received URI information is already registered. The registration search of the URI dictionary registration unit 105 is searched based on the short name of the URI dictionary, as shown in FIG. 4.

상기 URI 분석부(103)는 현재 스캔된 URI 정보가 URI 사전의 단축형 이름에 등록되었는지를 상기 URI 사전 등록부(105)로 검색 의뢰하는 것이다. 상기 URI 사전 등록부(105)의 검색 결과 즉, S205 단계에서 판단한 결과 현재 스캔된 URI 정보가 상기 URI 사전 등록부(105)에 등록되어 있을 경우 S207 단계로 진입하여 상기 URI 참조사전 등록부(107)를 인에이블시킨다. 상기 URI 참조사전 등록부(107)는 상기 URI 사전 등록부(105)로 기 등록된 URI 정보가 URI 참조 사전에 등록되어 있는지를 판단한다.The URI analyzing unit 103 requests the URI dictionary registration unit 105 to search whether the currently scanned URI information is registered in the short name of the URI dictionary. As a result of the search of the URI preregistration unit 105, that is, the result of the determination in step S205, if the currently scanned URI information is registered in the URI preregistration unit 105, the process proceeds to step S207 to check the URI reference dictionary registration unit 107. Enable it. The URI reference dictionary registration unit 107 determines whether URI information previously registered with the URI dictionary registration unit 105 is registered in the URI reference dictionary.

예컨대, S209 단계와 같이, 상기 URI 분석부(103)는 상기 URI 참조사전 등록부(107)에 URI 정보가 존재하는 것으로 판단할 경우, S215 단계로 진입하여 상기 URI 참조 사전으로 등록된 URI 색인 정보를 추출한다. 그리고, 상기 URI 분석부(103)는 상기 URI 참조사전 등록부(107)에서 추출된 URI 색인정보를 상기 압축부(111)로 제공하며, 상기 압축부(111)는 현재 스캔된 URI 정보를 URI 색인정보로 치환한다.For example, as in step S209, when the URI analysis unit 103 determines that the URI information exists in the URI reference dictionary registration unit 107, the process proceeds to step S215 and the URI index information registered as the URI reference dictionary is entered. Extract. In addition, the URI analysis unit 103 provides the URI index information extracted from the URI reference dictionary registration unit 107 to the compression unit 111, and the compression unit 111 provides a URI index to the currently scanned URI information. Replace with information.

한편, 상기 S209 단계에서 판단한 결과 현재 스캔된 URI 정보와 대응하는 UIR 참조가 존재하지 않을 경우 즉, 상기 URI 참조사전 등록부(107)에 해당 URI 참조 정보가 존재하지 않을 경우 S213 단계로 진입한다. 본 과정에서는 URI 정보에 대한 색인정보가 없는 경우로서 URI 참조를 추가하기 위한 절차로서, URI 참조 사전을 이용하여 URI 참조를 URI참조 색인으로 치환하기 위한 것이다.On the other hand, if it is determined in step S209 that there is no UIR reference corresponding to the currently scanned URI information, that is, if there is no corresponding URI reference information in the URI reference dictionary registration unit 107, step S213 is entered. This procedure is a procedure for adding a URI reference when there is no index information on URI information, and is to replace a URI reference with a URI reference index using a URI reference dictionary.

이를 위해 먼저, 상기 URI 분석부(103)는 URI 참조에서 URI를 추출하는데, 이는 S203 단계에서 검색된 URI 색인과 단편 식별자를 이용하여 URI 참조 사전을 검사한다. 여기서, 상기 단편 식별자를 이용한 URI 참조 사전 검사는 URI 참조가 단편 식별자 즉, '#' 기호를 포함하고 있는 경우 단편 식별자를 제외한 URI를 추출한다. 또한 전술된 '#' 기호를 포함하지 않는 경우에는 마지막 '/' 기호를 기준으로 URI를 추출한다.To this end, the URI analysis unit 103 extracts a URI from the URI reference, which examines the URI reference dictionary using the URI index and the fragment identifier retrieved in step S203. Here, the URI reference pre-check using the fragment identifier extracts a URI excluding the fragment identifier when the URI reference includes a fragment identifier, that is, a '#' symbol. In addition, when the aforementioned '#' symbol is not included, a URI is extracted based on the last '/' symbol.

단, 베이스 URI를 사용하는 경우라면 이미 설명된 바와 같이, 도 3의 9 라인 rdf:ID의 'WineGrape'와, 15 라인 rdf:resource의 '#WineGrape'는 동일한 URI 참조임을 감안하여 검사할 수 있을 것이다. 이와 같은 절차를 통해 상기 URI 분석부(103)는 도 5와 같은 단편 식별자를 추출하고 이를 상기 URI 참조사전 등록부(107)에 등록한다.However, if the base URI is used, as described above, 'WineGrape' of the 9th line rdf: ID and '#WineGrape' of the 15th line rdf: resource of FIG. 3 may be checked in consideration of the same URI reference. will be. Through such a procedure, the URI analysis unit 103 extracts the fragment identifier as shown in FIG. 5 and registers it in the URI reference dictionary registration unit 107.

이는 도시한 바와 같이, 베이스 URI 정보로서 자주 사용되는 'WineGrape' 또는 'Wine'에 대한 URI 정보는 이미 URI 색인 번호로 '0'이 등록되나, URI 참조 사전에 등록되어 있지 않았던 'madeFromFruit'는 URI 참조 등록과 더불어 URI 색인 정보를 새롭게 부여받는다. 일 예로 상기 'madeFromFruit'의 URI 색인정보는 '5'로 설정할 수 있을 것이다.As shown, the URI information for 'WineGrape' or 'Wine', which is frequently used as base URI information, is already registered with '0' as the URI index number, but 'madeFromFruit', which was not registered in the URI reference dictionary, is a URI. In addition to registering a reference, URI index information is newly given. For example, the URI index information of 'madeFromFruit' may be set to '5'.

상기 URI 분석부(103)는 URI 참조 사전으로의 등록 완료를 인지한 후, 이를 상기 압축부(111)로 통지한다. 상기 압축부(111)는 URI 참조 사전으로 추가 등록된 URI 정보에 대한 URI 색인정보를 토대로 데이터 치환을 수행한다. 상기 압축부(111)는 이와 같은 절차를 통해 애트리뷰트 값으로 사용된 다수의 URI 정보를 URI 참조의 색인 정보로 치환한다. 치환된 결과는 도 6에서 도시된 바와 같이, RDF/XML 문서에 대한 URI 정보를 수치화된 또는 문자화된 색인 정보로 치환하여 데이터의 용량을 대폭 줄인다. 이는 전술된 도 3의 RDF/XML 문서의 용량과 도 6과 같이 본 발명에 따라 압축된 RDF/XML 문서의 용량을 대비할 때, 대략 50%의 압축율을 나타낸다.The URI analyzing unit 103 recognizes the completion of registration in the URI reference dictionary and notifies the compression unit 111 of this. The compression unit 111 performs data substitution based on URI index information of URI information additionally registered as a URI reference dictionary. The compression unit 111 replaces a plurality of URI information used as attribute values with index information of a URI reference through the above procedure. As shown in FIG. 6, the substituted result replaces the URI information for the RDF / XML document with digitized or characterized index information, thereby greatly reducing the capacity of the data. This represents a compression ratio of approximately 50% when comparing the capacity of the RDF / XML document of FIG. 3 described above with the capacity of an RDF / XML document compressed in accordance with the present invention as shown in FIG.

한편, 상기 S205 단계에서 판단한 결과 상기 URI 사전에 URI 정보가 등록되지 않았을 경우, 즉 URI 색인 정보가 없어 URI 참조를 압축할 수 없기 때문에 URI 사전에 색인 정보를 추가 등록해야만 한다. 이를 위해 S211 단계로 진입하여 상기 URI 분석부(103)는 URI 사전 등록부(105)를 인에이블 시킨다. 상기 URI 사전 등록부(105)는 신규로 등록된 URI 정보를 전술된 URI 사전에 등록하는데, URI 정보를 포함하여 해당 URI에 대한 단축형 이름 및 URI 색인 정보를 등록한다.On the other hand, if it is determined in step S205 that the URI information is not registered in the URI dictionary, that is, since there is no URI index information and cannot compress the URI reference, the index information must be additionally registered in the URI dictionary. To this end, the process proceeds to step S211 and the URI analysis unit 103 enables the URI pre-registration unit 105. The URI dictionary registration unit 105 registers newly registered URI information in the above-described URI dictionary, and registers the short name and URI index information for the corresponding URI including the URI information.

URI 정보에 대한 사전 등록 방법은 도 4에서 도시된 URI 정보를 기준으로 단축형 이름을 추출하고 이를 등록하기 위한 것으로, 예컨대 URI 정보가 'http://www.w3.org/food#'일 경우 전술한 바와 같이, '#' 기호를 포함하고 있는 경우 단편 식별자를 제외한 URI 정보인 'food'을 단축형 이름으로 설정하고 이에 대한 URI 색인 정보를 부여한다. 반면, 상기 '#' 기호가 존재하지 않는 경우 예컨대, 'http://www.w3.org/2002/07/owl' 와 같은 경우에는 마지막 '/' 기호를 기준으로 URI 정보를 추출하기 때문에, - owl -이 단축형 이름으로 설정되고 URI 색인 정보를 새롭게 부여한다.The pre-registration method for the URI information is to extract the short name based on the URI information shown in FIG. 4 and register it. For example, when the URI information is 'http://www.w3.org/food#' As described above, when the '#' symbol is included, 'food', which is URI information excluding the fragment identifier, is set as a short name and URI index information thereof is assigned thereto. On the other hand, when the '#' symbol does not exist, for example, 'http://www.w3.org/2002/07/owl', URI information is extracted based on the last '/' symbol. owl-This is set to a short name and given a new URI index.

상기 URI 분석부(103)는 이와 같은 절차를 통해 얻어진 URI 정보를 상기 URI 사전 등록부(105)로 제공한다. 상기 URI 사전 등록부(105)는 현재 신규로 입력된 URI 정보를 저장하고, URI 색인을 위한 색인 번호를 부여한다. 상기 색인 번호는 신규 번호로 고유하며, 필요에 따라 정수를 포함하여 문자, 숫자 및 조합 스트링(alphanumeric string) 등으로 부여될 수 있다. 또한 상기 URI 분석부(103)는 URI 사전 등록과 더불어, 상기 S213 단계를 통해 상기 URI 참조 사전에도 URI 색인 정보를 등록한다.The URI analysis unit 103 provides the URI information obtained through the above procedure to the URI dictionary registration unit 105. The URI preregistration unit 105 stores newly input URI information and assigns an index number for URI indexing. The index number is unique to the new number, and may be given as a letter, number, alphanumeric string, or the like, including an integer, if necessary. In addition to the URI dictionary registration, the URI analyzing unit 103 also registers URI index information in the URI reference dictionary through step S213.

그리고, 상기 압축부(111)는 전술된 S215 단계로 진행하여 신규로 할당받은 URI 색인 정보를 토대로 데이터 치환을 수행한다. 즉, 데이터 용량이 큰 URI 정보를 숫자 또는 문자화된 URI 색인 정보로 치환함으로써 데이터 용량을 대폭 줄이게 된다. 도 3 및 도 6은 RDF/XML 문서에 대한 압축 전 상태와 압축 후 상태를 나타낸 것으로, 본 발명에 따라 압축된 문서의 용량은 50% 가까이 감소시켜 압축율을 높이고 있음을 나타낸다.In addition, the compression unit 111 proceeds to step S215 described above and performs data substitution based on newly allocated URI index information. That is, the data capacity is drastically reduced by replacing URI information having a large data capacity with numeric or characterized URI index information. 3 and 6 show the pre-compression and post-compression states for the RDF / XML document, indicating that the capacity of the compressed document is reduced by near 50% to increase the compression rate.

이상 설명된 바와 같이 본 발명은, 대중화된 웹 문서로서 RDF/XML 문서에 대한 압축 방법을 제시하고 있으며, 이러한 압축 방식은 데이터의 용량을 확연하게 줄이게 됨에 따라, 데이터 전송 시 네트워크 대역폭의 낭비를 줄임으로써 통신 인프라 구축에 기여할 수 있어 산업적 이용 가치가 충분히 부여되고 있다.As described above, the present invention proposes a compression method for an RDF / XML document as a popularized web document, and this compression method significantly reduces the capacity of data, thereby reducing waste of network bandwidth during data transmission. As a result, it is possible to contribute to the construction of a communication infrastructure, and thus, industrial use value is sufficiently given.

도 1은 본 발명에 따른 알디에프/엑스엠엘 문서에 대한 유알아이 참조의 압축 장치의 주요 기능을 설명하기 위한 구성도이다.1 is a block diagram for explaining the main function of the compression apparatus of the URF reference to the RDF / XM document according to the present invention.

도 2는 도 1의 주요 동작을 설명하기 위한 플로우챠트이다.FIG. 2 is a flowchart for explaining a main operation of FIG. 1.

도 3은 본 발명에 따른 압축 방법을 설명하기 위해 사용되는 RDF/XML 압축 전 문서이다.3 is an RDF / XML pre-compression document used for explaining the compression method according to the present invention.

도 4는 본 발명에 따른 URI 사전을 나타낸 도면이다.4 illustrates a URI dictionary according to the present invention.

도 5는 본 발명에 따른 URI 참조 사전을 나타낸 도면이다.5 is a diagram illustrating a URI reference dictionary according to the present invention.

도 6은 본 발명에 따라 압축 처리된 RDF/XML 문서이다.6 is an RDF / XML document compressed in accordance with the present invention.

<주요 도면에 대한 부호의 설명><Explanation of symbols for main drawings>

101 : URI 파서 103 : URI 분석부101: URI parser 103: URI analysis unit

105 : URI 사전 등록부 107 : URI 참조사전 등록부105: URI dictionary register 107: URI reference dictionary register

111 : URI 참조사전 추가부111: URI reference dictionary addition

Claims

An apparatus for compressing a web document, the apparatus comprising:

A URI parser for receiving a file corresponding to the web document and extracting data elements from the file;

A URI analysis unit for operating a URI extraction and registration for addition to a new element, including whether to register a base for the URI information used as an attribute value and reference registration among data extracted from the URI parser;

A URI pre-registration unit for storing URI information, URI index information, and short name information to determine whether to register the base of the URI analysis unit;

A URI reference dictionary register that stores fragment identifiers and URI index information to determine whether to register a reference of the URI parser; And

And a compression unit for substituting the web document file input to the URI parser based on the URI index information registered by the URI pre-registration unit.

The method of claim 1,

Each time arbitrary URI information is added to the URI dictionary registration unit, an integer value different from an integer value corresponding to an index previously assigned to the URI information previously stored in the URI dictionary registration unit is assigned to the index of the URI dictionary. Compressor for the UI reference to a web document.

The method of claim 2,

The base URI, which is frequently used among the URI information, allocates the index number of the URI dictionary to a randomly defined specific number.

The method according to any one of claims 1 to 3,

And the web document is an RDF / XML document.

In the method for compressing a web document,

a) scanning the web document to determine whether a URI used as an attribute value is a base URI, and if the determination result is a base URI, setting URI index information to a randomly defined specific number;

b) as a result of the determination in step a), if the URI used as the attribute value is not a base URI, determining whether there is a corresponding URI information based on a URI dictionary;

c) As a result of the determination in step b), if there is matching URI information in the URI dictionary, it is determined whether there is a matching URI reference (URIref) in a URI reference dictionary, but if the URI reference does not exist, step f). Proceeding;

d) if it is determined in step c) that the URI reference exists, extract the URI information from the URI reference dictionary and register it in the URI dictionary, extract the URI index information registered in the URI reference dictionary, Replacing the URI information scanned from the web document with the URI index information;

e) if the URI information corresponding to the URI dictionary does not exist as determined in step b), registering the URI information in the URI dictionary and generating new URI index information corresponding thereto; And

f) registering a new element in a URI reference dictionary using a fragment identifier of the URI information based on the newly registered URI index information, and then feeding back to step d). How to compress a reference.

The method of claim 5, wherein

And URI information extracted from the URI reference dictionary of step d) is URI information excluding '#' symbol among fragment identifiers.

The method of claim 5, wherein

And URI information extracted from the URI reference dictionary of step d) is URI information extracted based on the last '/' symbol of the URI information.

The method according to any one of claims 5 to 7,

And the web document is an RDF / XML document.