KR100433584B1

KR100433584B1 - Method for product detailed information extraction of internet shopping mall with ontology and wrapper data

Info

Publication number: KR100433584B1
Application number: KR10-2000-0075438A
Authority: KR
Inventors: 김성훈; 장철수; 노명찬; 김중배; 이경호; 함호상
Original assignee: 한국전자통신연구원
Priority date: 2000-12-12
Filing date: 2000-12-12
Publication date: 2004-06-04
Also published as: KR20020045971A

Abstract

본 발명은 온토로지(ontology)와 규칙정보를 이용한, 인터넷 쇼핑몰 상품에 관한 상세 정보 추출 방법에 관한 것이다.The present invention relates to a method for extracting detailed information about an internet shopping mall product using ontology and rule information.

본 발명에서는 쇼핑몰 상품 상세 정보 페이지 URL 정보를 기반으로, 문서수집 로봇을 이용하여 가져온 HTML문서를 순수한 비태그 문자열들과 <TR, <P, <BR 태그들만으로 구성된 문서로 변환하는 전처리 단계와; 상기 전처리 단계를 통해 변환된 문서에서 지식 데이터베이스의 온토로지 정보를 기반으로 상품의 분류명과 온토로지 값을 추출하는 온토로지 기반 검색 단계와; 상기 전처리 단계를 통해 변환된 문서에서 상품의 모델명만을 추출하는 단계를 이용하여 상품의 일반적인 단순 정보 뿐만 아니라, 상세한 정보들까지 추출할 수 있다.According to the present invention, a preprocessing step of converting an HTML document obtained by using a document collecting robot into a document composed of purely non-tag strings and <TR, <P, and <BR tags based on shopping mall product detail page URL information; An ontology-based retrieval step of extracting a classification name and an ontology value of a product based on ontology information of a knowledge database in the document converted through the preprocessing step; By extracting only the model name of the product from the document converted through the preprocessing step, detailed information as well as general simple information of the product may be extracted.

이로 인해, 정확하고 빠르게 상품에 관한 상세 정보를 얻음으로써, 인력과 시간을 많이 필요로 하는 작업을 단순화 할 수 있다.This makes it possible to obtain detailed information about the product accurately and quickly, which simplifies work that requires a lot of manpower and time.

Description

Method for product detailed information extraction of internet shopping mall with ontology and wrapper data}

본 발명은 온토로지와 규칙정보를 이용한, 인터넷 쇼핑몰 상품에 관한 상세정보 추출 방법에 관한 것으로써, 보다 상세하게 설명하면, 지식 데이터베이스의 온토로지 정보와 래퍼(wrapper)의 규칙을 기반으로 HTML 문서상의 상품에 관한 상세 정보를 빠르고 정확하게 추출하는 온토로지와 규칙정보를 이용한, 인터넷 쇼핑몰 상품에 관한 상세 정보 추출 방법에 관한 것이다.The present invention relates to a method of extracting detailed information on an internet shopping mall product using ontologies and rule information. In more detail, the present invention relates to HTML based on the ontology information of a knowledge database and the rules of a wrapper. The present invention relates to a method for extracting detailed information about an internet shopping mall product using ontologies and rule information for quickly and accurately extracting detailed information about a product on a document.

일반 텍스트로 이루어진 문서에서 사용자가 어떤 정보를 얻고자 하거나 혹은 인터넷 웹 상에서의 사용자 질의를 통해 얻어진 문서에서 필요한 정보를 얻고자 한다면, 대부분 사용자는 해당 문서를 읽고, 자신이 원하는 정보를 얻어내는 방법을 생각한다. 이렇게 웹 상에 존재하는 수많은 문서들을 모두 직접 읽어서 정보를 파악해야 한다면 사용자는 많은 시간을 투자해야 할 것이다. 그러나 해당 문서가 데이터베이스 형태로 구성되어 있거나, 완벽한 데이터베이스의 형태는 아니지만 어느 정도의 규칙을 갖고 구성(semi-structured document)되어 있다면, 사용자가 모든 문서를 읽지 않고 자동으로 필요한 정보를 추출할 수 있다. 이렇게 인터넷 웹 상에서 얻을 수 있는 수많은 문서들이나 혹은 인터넷 웹 상은 아니지만 문자열로 저장된 문서상에서 정보를 얻어내는 기술을 정보 추출(information extraction) 이라고 한다.If a user wants to get some information from a plain text document or a document obtained through a user query on the Internet, most users read the document and learn how to get the information they want. think. If you have to read all the documents on the web and grasp the information yourself, you will have to spend a lot of time. However, if the document is in the form of a database, or if it is a semi-structured document that is not in the form of a complete database, the user can automatically extract the necessary information without reading all the documents. This technique of obtaining information from a large number of documents available on the Internet web or documents stored as strings but not on the Internet web is called information extraction.

이와 같은 정보 추출에서, 정보를 얻고자 하는 문서에 공통적으로 존재하는 규칙이 있을 경우, 이 규칙을 이용하면 손쉽게 정보를 얻을 수 있다. 일반적으로 인터넷 웹 상에 존재하는 문서들은 대부분 HTML 태그를 이용하여 구성된다. 즉, HTML 문서들이 여러 종류의 태그를 이용하여 일정한 형태를 갖고 정보가 표현된다면, 문서에서 그 규칙을 찾아내어 필요한 정보를 손쉽게 추출할 수 있을 것이다. 예를 들어, 어떤 문서에는 <B> 태그와 <B/> 태그 사이에는 언제나 문서의 제목이 들어있다고 하면 그 규칙만으로 자동으로 문서에서 제목을 추출하게 된다. 또한, HTML 태그로 구성되어 있지 않더라도 'Chap.' 이라는 단어 뒤에는 단원 제목이 나온다 라는 등의 규칙을 이용하여 역시 자동으로 문서에서 정보를 추출 할 수 있다. 이러한 방법을 통해 해당 정보를 자동으로 추출하기 위해서는 문서에 존재하는 규칙을 인식할 수 있어야 하는데 이때 이용되는 것을 래퍼(Wrapper)라고 한다. 즉, 래퍼를 이용하면 래퍼의 규칙에 맞는 문서에서는 필요한 정보를 자동으로 추출할 수 있는데, 이러한 래퍼만을 이용하는 정보 추출 방법은 기존의 정보 추출 방법에서 가장 많이 이용하는 방법이다. 그러나 이러한 방법은 래퍼의 형식이 문서의 형태에 따라 결정되고, 문서에 어떤 규칙도 없고, 제공되는 문서가 수시로 변경된다면, 해당 문서에 대한 래퍼의 구성은 불가능할 뿐만 아니라, 해당 문서에서 추출되는 정보를 의미 기반으로 추출하는 것은 불가능하다는 단점이 있다.In such information extraction, if there is a rule common to the document to be obtained information, this rule can be used to obtain information easily. In general, most documents on the Internet web are constructed using HTML tags. In other words, if HTML documents have a certain form and information is expressed by using various kinds of tags, it is possible to easily extract necessary information by finding the rules in the document. For example, if a document always contains the document's title between the <B> and <B /> tags, the rule automatically extracts the title from the document. Also, even if it's not made up of HTML tags, You can also automatically extract information from a document using rules such as the word title followed by the title. In order to extract the information automatically through this method, it is necessary to be able to recognize the rules existing in the document. This is called a wrapper. In other words, by using a wrapper, necessary information can be automatically extracted from a document conforming to the rules of the wrapper. The information extraction method using only such a wrapper is the most frequently used method in the existing information extraction method. However, if the format of the wrapper is determined by the type of the document, there are no rules in the document, and the provided document changes from time to time, the composition of the wrapper for the document is not only possible, but the information extracted from the document is not available. The disadvantage is that it is impossible to extract based on semantics.

정보 추출 방법에서 많이 사용하는 또 다른 방법은 온토로지(ontology)와 같은 사전을 이용하는 방법이다. 이 방법은 정보를 얻고자 하는 문서에서 온토로지와 같은 사전의 모든 데이터를 비교 검색하여 사전에 존재하는 데이터가 있을 경우 이를 추출하는 방법이다. 그러나 이 방법은 정보 추출을 위하여 대용량의 사전이 필요할 뿐만 아니라, 시스템의 검색 수행 시간이 많이 소모된다는 단점이 있다.Another method commonly used in the information extraction method is to use a dictionary such as ontology. This method compares and searches all data in the dictionary such as ontology in the document to obtain information, and extracts any data that exists in the dictionary. However, this method not only requires a large dictionary for information extraction, but also consumes a lot of time for searching the system.

또한, 상기의 래퍼 또는 온토로지를 이용하여 인터넷 쇼핑몰의 상품 정보를추출하는 기존의 방법들의 가장 큰 단점은 추출된 정보의 형태가 단순하다는 것이다. 이것은 인터넷 쇼핑몰의 HTML문서들이 서로 다른 형태를 가지고 있어서 래퍼를 이용하기 어렵고, 만약 래퍼를 이용한다 하더라도 단순한 형태의 래퍼만을 이용할 수 있기 때문이다. 또한 온토로지를 이용하는 대부분의 시스템은 쇼핑몰의 대표 URL만을 입력한 후 관련된 모든 HTML문서를 수집하여 이를 분석하는 방법을 사용함으로써 수행 시간이 많이 소모되는 단점을 가지고 있다.In addition, the biggest disadvantage of the existing methods for extracting the product information of the Internet shopping mall using the wrapper or the ontology is that the form of the extracted information is simple. This is because the HTML documents of the Internet shopping mall have different forms, so it is difficult to use a wrapper, and even if a wrapper is used, only a simple wrapper can be used. In addition, most systems using Ontology have a drawback in that it takes much time to execute by inputting only a representative URL of a shopping mall and collecting all related HTML documents and analyzing them.

상기한 종래 기술의 문제점을 해결하기 위한 본 발명의 목적은 온토로지와 규칙 정보를 이용하여 정의된 상품의 상세한 정보를 빠르고 정확하게 추출함으로써, 인터넷 쇼핑몰의 다양한 상품 정보를 빠르고 손쉽게 추출할 수 있는 온토로지와 규칙정보를 이용한, 인터넷 쇼핑몰 상품에 관한 상세 정보 추출 방법을 제공하기 위한 것이다.An object of the present invention for solving the problems of the prior art described above is to quickly and easily extract detailed information of a defined product using ontologies and rule information, thereby quickly and easily extracting various product information of an internet shopping mall. An object of the present invention is to provide a method for extracting detailed information about an internet shopping mall product using topology and rule information.

도 1은 본 발명의 일 실시예에 따른 온토로지와 규칙정보를 이용한 인터넷 쇼핑몰 상품에 관한 상세 정보 추출 방법의 전체 흐름도,1 is an overall flowchart of a method for extracting detailed information about an internet shopping mall product using ontologies and rule information according to an embodiment of the present invention;

도 2는 입력된 HTML 문서를 정보 추출을 위한 형태로 변환하는 전처리 과정을 도시한 흐름도,2 is a flowchart illustrating a preprocessing process of converting an input HTML document into a form for extracting information;

도 3은 전처리 문서에서 상품의 지식 데이터베이스의 데이터를 기반으로 온토로지 정보를 추출하는 방법을 도시한 흐름도,3 is a flowchart illustrating a method of extracting ontology information based on data of a knowledge database of goods in a preprocessed document;

도 4는 상품의 상세 정보 중에서 모델명만을 추출하는 방법을 도시한 흐름도이다.4 is a flowchart illustrating a method of extracting only a model name from detailed information of a product.

상기한 목적을 달성하기 위한 본 발명은, 지식 데이터베이스와 데이터베이스를 구비한 상세 정보 추출 시스템을 통한, 온토로지와 규칙정보를 이용한, 인터넷 쇼핑몰 상품에 관한 상세 정보 추출 방법에 있어서, 상기 쇼핑몰 상품의 해당 URL 정보를 이용하여 상기 인터넷 쇼핑몰 상품에 관련된 HTML 페이지를 찾는 제 1 단계와 ; 상기 HTML 페이지를 특정 태그와 순수한 문자열들로 구성된 문서로 전 처리하는 제 2 단계 ; 상기 지식데이터베이스 내의 온토로지 및 동의어를 이용하여 상기 전처리 된 문서로부터 모델명을 제외한 상기 쇼핑몰 상품의 상세 정보를 추출하여 상기 데이터베이스에 저장하는 제 3 단계 : 상기 전처리 된 문서의 규칙 정보를 기반으로 상기 모델명만을 추출하여 상기 데이터베이스에 저장하는 제 4 단계를 포함한다.The present invention for achieving the above object is a method for extracting detailed information about an internet shopping mall product using ontologies and rule information through a detailed information extraction system having a knowledge database and a database. A first step of finding an HTML page related to the internet shopping mall product using the corresponding URL information; A second step of preprocessing the HTML page into a document consisting of specific tags and pure strings; A third step of extracting detailed information of the shopping mall product except for a model name from the preprocessed document using the ontology and synonyms in the knowledge database and storing the detailed information of the shopping mall product in the database: the model name based on rule information of the preprocessed document And extracting only the bay and storing it in the database.

양호하게는, 지식 데이터베이스와 데이터베이스를 구비한 상세 정보 추출 시스템을 통한, 온토로지와 규칙정보를 이용한, 인터넷 쇼핑몰 상품에 관한 상세 정보 추출 방법을 수행하기 위해 컴퓨터로 실행할 수 있는 프로그램을 저장한 기록매체에 있어서, 상기 쇼핑몰 상품의 해당 URL 정보를 이용하여 상기 인터넷 쇼핑몰 상품에 관련된 HTML 페이지를 찾는 제 1 단계와 ; 상기 HTML 페이지를 특정 태그와 순수한 문자열들로 구성된 문서로 전 처리하는 제 2 단계 ; 상기 지식데이터베이스 내의 온토로지 및 동의어를 이용하여 상기 전처리 된 문서로부터 모델명을 제외한 상기 쇼핑몰 상품의 상세 정보를 추출하여 상기 데이터베이스에 저장하는 제 3 단계 ; 상기 전처리 된 문서의 규칙 정보를 기반으로 상기 모델명만을 추출하여 상기 데이터베이스에 저장하는 제 4 단계를 포함하는 것을 특징으로 하는 프로그램을 저장한 컴퓨터로 판독할 수 있는 기록매체가 제공된다.Preferably, a record storing a program executable by a computer to perform a method of extracting detailed information about an Internet shopping mall product using ontologies and rule information through a detailed information extraction system having a knowledge database and a database. A medium, comprising: a first step of finding an HTML page related to the Internet shopping mall product by using corresponding URL information of the shopping mall product; A second step of preprocessing the HTML page into a document consisting of specific tags and pure strings; A third step of extracting detailed information of the shopping mall product except for a model name from the preprocessed document by using the ontology and synonyms in the knowledge database and storing it in the database; And a fourth step of extracting only the model name on the basis of the rule information of the preprocessed document and storing the model name in the database.

이하 첨부된 도면을 참조하면서 본 발명의 일 실시예에 따른 온토로지와 규칙정보를 이용한, 인터넷 쇼핑몰 상품에 관한 상세 정보 추출 방법을 보다 자세하게 설명하기로 한다. 도 1은 본 발명의 일 실시예에 따른 온토로지와 규칙정보를이용한, 인터넷 쇼핑몰 상품에 관한 상세 정보 추출 방법의 전체 흐름도이다.Hereinafter, a method of extracting detailed information about an internet shopping mall product using ontologies and rule information according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings. 1 is an overall flowchart of a method for extracting detailed information about an internet shopping mall product using ontologies and rule information according to an embodiment of the present invention.

먼저, 상세 정보 추출 시스템 관리자가 입력한 상세 정보 추출을 원하는 해당 쇼핑몰의 URL(Uniform Resource Locator : 인터넷 주소 혹은 도메인 명, 이하 URL 이라함) 정보를 수신한다(S110). 수신된 해당 쇼핑몰의 URL 정보를 이용하여 상세 정보 추출 시스템 관리자는 문서수집 로봇을 이용하여 해당 쇼핑몰로부터 상품 정보 HTML 문서를 가져온(S120) 후, 가져온 해당 HTML 문서를 온토로지(ontology) 검색과 모델명 검색을 위한 문서로 전 처리한다(S130). 전처리 된 문서는 먼저, 지식 데이터베이스에 입력된 상품의 분류명과 온토로지를 이용하여 상세한 정보들의 검색(S140) 단계를 수행하는 반면, 상품의 상세 정보 중, 특별히 기본이 되는 모델명을 추출하기 위한 모델명 검색 단계를 수행한다(S150).First, a URL (Uniform Resource Locator: Internet address or domain name, hereinafter referred to as URL) information of a corresponding shopping mall for which the detailed information extraction system administrator wants to extract detailed information is received (S110). Using the received URL information of the shopping mall, the system manager extracts the product information HTML document from the shopping mall by using a document collecting robot (S120), and retrieves the retrieved HTML document from the ontology and model name. The document is preprocessed as a search (S130). The preprocessed document first searches for detailed information using the classification name and ontology of the product entered in the knowledge database (S140), while searching for a model name for extracting a particularly basic model name from the detailed information of the product. Perform the step (S150).

이 때, 가져온 해당 HTML 문서를 온토로지 검색과 모델명 검색을 위한 문서로 전 처리하는 단계(S130)의 목적은, HTML 문서에서 <TR, <P, <BR을 제외한 모든 태그를 제거하고, 순수하게 HTML 태그와 관계없는 문자열과 상기 태그들만으로 구성된 문서로 구성함으로써, 검색할 데이터의 양을 줄여 정보 추출을 위한 시간을 단축하기 위한 것이다.At this time, the purpose of the step (S130) of pre-processing the imported HTML document as a document for ontology search and model name search, to remove all tags except <TR, <P, <BR, and pure In order to reduce the amount of data to be searched by reducing the amount of data to be retrieved by configuring a document consisting of a string and a tag irrelevant to the HTML tag.

또한, 본 발명에서 모델명만을 추출하는 단계를 별도로 명시한 후, 래퍼(WRAPPER)만을 이용하는 것은, 모델명은 상품의 상세한 정보 중 가장 기본이 되는 단위이며 이는, 다른 상세한 정보들처럼 온토로지 값을 이용할 수 없으며, 상품 HTML 페이지 상에 본 발명의 방법과 같은 형태로 규칙이 존재하기 때문이다.In addition, after separately specifying the step of extracting only the model name in the present invention, using only the wrapper (WRAPPER), the model name is the most basic unit of the detailed information of the product, which can use the ontology value like other detailed information. This is because the rules exist on the product HTML page in the same manner as the method of the present invention.

도 2 내지 도 4는 본 발명에 따른 온토로지와 규칙정보를 이용한, 인터넷 쇼핑몰 상품에 관한 상세 정보 추출 방법의 세부적인 동작 과정을 도시한 흐름도이다. 도시된 도면을 참조하면서 본 발명에 따른 온토로지와 규칙정보를 이용한, 인터넷 쇼핑몰 상품에 관한 상세 정보 추출 방법에 대해 자세히 알아보기로 한다. 도 2는 입력된 HTML 문서를 정보 추출을 위한 형태로 변환하는 전처리 과정을 도시한 흐름도이다.2 to 4 are flowcharts showing the detailed operation of the method for extracting detailed information about an Internet shopping mall product using ontologies and rule information according to the present invention. With reference to the drawings it will be described in detail with respect to the method for extracting detailed information about the Internet shopping mall products using the ontology and the rule information according to the present invention. 2 is a flowchart illustrating a preprocessing process of converting an input HTML document into a form for extracting information.

우선, 상세 정보 추출 시스템 내의 문서수집 로봇을 통해 찾아낸 HTML 문서를 한 라인씩 분리한(S210) 후, 라인별로 분리된 HTML 문서 내의 각각의 라인에 HTML 태그 문자를 포함하는지를 검색한다(S220). 검색 결과, 각각의 라인에 HTML 태그 문자를 포함하면, HTML 태그 문자가 포함된 해당 태그 라인을 검출하는(S230) 반면, 검색 결과, HTML 태그 문자가 포함되어 있지 않으면, HTML 태그 문자를 포함하지 않는 라인들을 검색한(S270) 후, 검색된 비태그 문자들을 포함하는 라인들을 해당 데이터베이스에 저장한다(S280). 한편, HTML 태그 문자를 포함하는 해당 라인이 검출되면, 해당 라인이 <TR, <P, <BR 의 특정 태그 문자를 포함하는지를 검색한다(S240). 검색 결과, 상기와 같은 특정 태그 문자를 포함하지 않는 라인이면, 해당 라인을 제거(S250)하는 반면, 검색 결과, 상기와 같은 특정 태그 문자를 포함하는 라인이면, 해당 라인을 데이터베이스에 저장한다(S260). 이와 같이, 해당 라인들을 데이터베이스에 저장한 후, 입력되어 변환될 라인이 남아있는지의 여부를 검색한다(S290). 검색 결과, 입력될 라인이 아직도 남아 있으면, 상기와 같은 과정을 반복하며, 그렇지 않을 경우, 전처리 단계를 종료한다.First, after separating the HTML document found through the document collecting robot in the detailed information extraction system line by line (S210), it is searched whether each line in the HTML document separated by line includes the HTML tag character (S220). If the search results include HTML tag characters in each line, the corresponding tag line including HTML tag characters is detected (S230), while if the search result does not include HTML tag characters, the tag does not include HTML tag characters. After searching the lines (S270), the lines including the searched non-tag characters are stored in the corresponding database (S280). On the other hand, if the corresponding line including the HTML tag character is detected, it is searched whether the corresponding line includes specific tag characters of <TR, <P, <BR (S240). As a result of the search, if the line does not include the specific tag character as described above, the corresponding line is removed (S250), while if the search result is the line including the specific tag character as described above, the corresponding line is stored in the database (S260). ). In this manner, after storing the corresponding lines in the database, it is searched whether a line to be input and converted remains (S290). As a result of the search, if the line to be input still remains, the above process is repeated, otherwise, the preprocessing step is terminated.

도 3은 전처리 된 문서에서 상품의 지식 데이터베이스의 데이터를 기반으로 온토로지 정보를 추출하는 방법을 도시한 흐름도이다.3 is a flowchart illustrating a method of extracting ontology information based on data of a product knowledge database from a preprocessed document.

먼저, 상세 정보 추출 시스템은 상품의 지식 데이터베이스로부터 추출 대상 상품과 관련된 분류명들과 해당 분류명의 동의어들을 찾아낸(S301) 후, 전처리 결과로 작성된 해당 HTML 문서를 한 라인씩 읽어들인다(S302). 읽어들인 라인에서 해당 분류명의 동의어와 문자열을 비교하여 목적에 가장 근접한 동의어를 검색한다(S303). 검색 결과, 상세 정보 추출 시스템이 찾는 해당 동의어가 읽어들인 라인에 존재하는지를 검색한다(S304). 검색 결과, 해당 동의어가 존재하면, 지식 데이터베이스로부터 해당 분류명과 관련된 온토로지 값들을 불러온(S305) 후 불러온 온토로지 값들과 해당 문자열의 앞뒤를 비교하여 해당 온토로지를 검색하는(S307) 반면,검색 결과, 해당 동의어가 존재하지 않으면, 검색 상품의 모든 온토로지를 검색한다(S307). 다음, 온토로지 값이 존재하는지를 판단한다(S308). 검색 결과, 해당 온토로지 값이 존재하면, 분류명과 찾은 온토로지 값을 데이터베이스에 저장하는(S309) 반면, 검색 결과, 해당 온토로지 값이 존재하지 않으면, 검색 상품에 관한 모든 온토로지를 검색한다(S307). 이와 같이, 분류명과 찾은 온토로지 값을 데이터베이스에 저장하면, 입력되어 검색될 라인이 아직도 남아있는지를 검색한다(S310). 검색 결과, 입력되어 검색될 라인이 아직도 남아 있으면, 상기 과정을 반복 수행하는 반면, 검색 결과, 입력되어 검색될 라인이 남아 있지 않으면, 온토로지 검색 단계를 종료한다.First, the detailed information extraction system finds the classification names related to the extraction target product and the synonyms of the classification name from the knowledge database of the product (S301), and then reads the corresponding HTML document created as a result of preprocessing line by line (S302). The synonyms closest to the purpose are searched by comparing the synonyms and the strings of the corresponding classification names in the read line (S303). As a result of the search, the detailed information extraction system searches whether the corresponding synonym to find exists in the read line (S304). As a result of the search, if the synonym exists, the ontology values related to the classification name are loaded from the knowledge database (S305), and then the ontologies are searched by comparing the loaded ontology values with the front and back of the string (S307). On the other hand, if a corresponding synonym does not exist as a result of the search, all ontology of the search product is searched (S307). Next, it is determined whether an ontology value exists (S308). As a result of the search, if the ontology value exists, the classification name and the found ontology value are stored in the database (S309). If the search result, the ontology value does not exist, all ontology for the search product is stored. Search (S307). As such, when the classification name and the found ontology value are stored in the database, it is searched whether a line to be input and searched still remains (S310). If the search result, the line to be entered and search still remains, the above process is repeated, while if the search result, there is no line to be entered and searched, the ontology search step ends.

도 4는 상품의 상세 정보 중에서 모델명만을 추출하는 방법을 도시한 흐름도이다. 도 2에 도시된 흐름도를 거쳐 전처리 된 문서를 라인별로 읽어들인(S410) 후, 읽어들인 해당 라인에 "(" 문자가 존재하는지를 검색한다(S420). 검색 결과, 읽어들인 해당 라인에 "(" 문자가 존재가 존재하면, "(" 문자와 ")" 사이의 문자열을 추출(S430)한 후, 추출된 문자열에서 "-" 문자가 존재하는지를 검색하는(S440) 반면, 읽어들인 해당 라인에 "(" 문자가 존재하지 않으면, 바로 읽어들인 해당 라인에 "-" 문자가 존재하는지를 검색한다(S440). 검색 결과, "-" 문자가 존재하면, "-" 문자를 기준으로 상기 단계(S430)에서 추출된 문자열을 분리한(S450) 후, 분리된 문자가 영문과 숫자의 조합으로 구성되어 있는지를 분석한다(S460). 분석 결과, 분리된 문자가 영문과 숫자의 조합으로 구성되어 있으면, 분리된 문자를 해당 모델명으로서, 데이터베이스에 저장한(S470) 후, 아직도 처리할 문자열들이 남아있는지의 여부를 판단한다(S480). 판단 결과, 처리할 문자열이 남아있으며, 상기 과정들을 다시 반복하는 반면, 판단 결과, 처리할 문자열이 남아있지 않으면, 검색 상품의 모델명 검색 단계를 종료한다. 한편, 읽어들인 라인에 "-" 문자가 존재하지 않거나, 혹은, 분리된 문자가 영문과 숫자의 조합으로 구성되어 있지 않으면, 아직도 처리할 문자열들이 남아있는지의 여부를 검색한다(S480).4 is a flowchart illustrating a method of extracting only a model name from detailed information of a product. After reading the preprocessed document line by line through the flow chart shown in FIG. 2 (S410), it is searched whether there is a character "(" in the read line (S420). If a character exists, the character string between "(" character and ")" is extracted (S430), and then the character string is searched for the existence of the "-" character from the extracted string (S440). (If there is no character, it is searched whether a "-" character exists in a corresponding line read immediately (S440). If a "-" character is present, the search based on the "-" character (S430). After separating the extracted string from (S450), it is analyzed whether the separated character is composed of a combination of alphanumeric characters (S460), and as a result of the analysis, if the separated character is composed of a combination of alphanumeric characters, the separated character Is stored in the database as the model name (S470), and still Also, it is determined whether the character strings to be processed remain (S480), and as a result of the determination, the character strings to be processed remain and the above steps are repeated, while if the character strings to be processed are not found, the model name of the search product is searched. On the other hand, if there is no "-" character in the read line, or if the separated character does not consist of a combination of alphanumeric characters, it is searched whether there are still strings to be processed (S480). ).

위에서 양호한 실시예에 근거하여 이 발명을 설명하였지만, 이러한 실시예는 이 발명을 제한하려는 것이 아니라 예시하려는 것이다. 이 발명이 속하는 분야의 숙련자에게는 이 발명의 기술사상을 벗어남이 없이 위 실시예에 대한 다양한 변화나 변경 또는 조절이 가능함이 자명할 것이다. 그러므로, 이 발명의 보호범위는 첨부된 청구범위에 의해서만 한정될 것이며, 위와 같은 변화예나 변경예 또는 조절예를 모두 포함하는 것으로 해석되어야 할 것이다.While the invention has been described above based on the preferred embodiments thereof, these embodiments are intended to illustrate rather than limit the invention. It will be apparent to those skilled in the art that various changes, modifications, or adjustments to the above embodiments can be made without departing from the spirit of the invention. Therefore, the protection scope of the present invention will be limited only by the appended claims, and should be construed as including all such changes, modifications or adjustments.

이상과 같이 본 발명에 의하면, 의미가 확장된 온토로지 기반 데이터와 동의어를 이용함으로써, 검색 상품의 상세한 정보를 정확하게 추출할 수 있을 뿐만 아니라, 대상 URL 정보를 기반으로 대상 HTML 문서를 찾는 시간을 단축하고, 불필요한 태그 정보를 전처리 과정으로 삭제함으로써, 처리할 데이터의 양을 줄여 상세한 정보 추출을 위한 시간을 단축할 수 있는 효과가 있다.As described above, according to the present invention, by using the ontology-based data and the synonym with the extended meaning, it is possible not only to accurately extract the detailed information of the search product, but also to search for the target HTML document based on the target URL information. By shortening and deleting unnecessary tag information by a preprocessing process, the amount of data to be processed can be reduced, thereby reducing the time for detailed information extraction.

Claims

A method for extracting detailed information about an internet shopping mall product using a detailed information extraction system having a knowledge database (a) and a database (b),

A first step of searching for a HyperText Markup Language (HTML) page related to the Internet shopping mall product using corresponding URL (Uniform Resource Locator) information of the shopping mall product;

A second step of preprocessing the retrieved HTML page into a document composed of specific tags and pure strings;

A third step of searching for synonyms close to the synonyms of the classification name and a string in the input line of the preprocessed document from the classification names and synonyms stored in the knowledge database (a);

A fourth step of determining whether a corresponding ontology value exists by comparing the ontology values stored in the knowledge database (a) with the corresponding character string if the synonyms found in the third step exist in the read line;

A fifth step of storing and updating the classification name and the ontology value in the database (b) if a corresponding ontology value exists as a result of the determination in the fourth step; And

A sixth step of extracting detailed information on the Internet shopping mall by using the stored and updated database (b);

Method for extracting detailed information about an Internet shopping mall product, comprising a.

The method of claim 1,

The second step,

Sub-segmenting the HTML page by line and then removing the HTML tag except for a specific tag by the separated line; And

Storing and updating a line including only the specific tag and a line not including the HTML tag in the database (b);

delete

The method of claim 1,

A second step of searching for a "(" character and a ")" character for each line by reading the preprocessed document in the second step;

Extracting a string between the searched "(" and ")" characters, and then searching for a "-" character in the extracted string; And

After separating the character strings based on the searched "-" character, if the separated character strings are composed of a combination of alphabets and numbers, recognizing the character strings as model names, and storing and updating the character strings in the database (b). -3 step;

The method of claim 1,

And the specific tag includes <TR, <P, and <BR tags.

A recording medium having recorded thereon a program capable of executing a computer-implemented method of extracting detailed information about an Internet shopping mall product using a detailed information extraction system having a knowledge database (a) and a database (b),

A computer-readable recording medium having recorded a program comprising a.