KR20000037595A

KR20000037595A - System and method for automatically indexing product information of online stores

Info

Publication number: KR20000037595A
Application number: KR1019980052222A
Authority: KR
Inventors: 강대기; 이제선; 함호상; 박상봉
Original assignee: 정선종; 한국전자통신연구원
Priority date: 1998-12-01
Filing date: 1998-12-01
Publication date: 2000-07-05
Also published as: CN1255680A; JP2000172722A; KR100283103B1

Abstract

PURPOSE: A system and a method for automatically indexing product information of online stores are provided to enable an online comparison shopping around the stores on web sites, by fetching the product information including price data to store it as one file, filtering noises in the file and then automatically extracting the product information from the filtered file. CONSTITUTION: An electronic transaction information collector(11) gathers hyper text markup language(HTML) documents of online shops that include product information, and stores them in a HTML document memory(12). A HTML filter(13) filters the gathered documents and a cost information arranger(14) converts the filtered documents to be suitable for a system extracting the product information and including a formal information arranger(15) and a heuristic interpreter(16). The arranger(15) extracts the product information by calling an analyzing module if the type of the documents correspond to a preliminary analyzed type. The interpreter(16) extracts the product information from the documents from which the arranger(15) has failed to extract it. The extracted product information is stored in a memory(21).

Description

Method and system of automatic indexing of product information in online store

본 발명은 웹 상의 온라인 상점들이 제시하는 상품 목록(catalog)을 자동 추출하여 여러 상점들에 대한 온라인 비교 쇼핑(online comparison shopping)이 가능하도록 한 온라인 상점상의 제품 정보 자동 색인 방법 및 시스템에 관한 것이다.The present invention relates to a method and system for automatically indexing product information on an online store, which automatically extracts a catalog presented by online stores on the web and enables online comparison shopping for various stores.

일반적으로 앞으로의 전자 상거래 환경은 기존의 유통 질서와는 기본적으로 다른 것으로, 기존의 가격 체계 또한 이에 따라 변화하게 된다. 특히 전세계를 연결하는 웹과 인터넷 상의 전자 상거래의 특성을 볼 때, 특정 제품의 가격 차이는 단지 국내에서의 비교만이 아니라 전세계적으로 적용되게 된다. 이러한 상황에서 구매자가 자신이 사고자 하는 제품이 품질의 편차가 적은 경우, 가장 합리적인 가격을 제시하는 상점에서 구매를 하고자 할 것이다. 그러나 온라인 상점들은 기하급수적으로 증가하므로, 구매자는 자신이 원하는 상품을 찾는 데 어려움을 느끼게 된다.In general, the future e-commerce environment is fundamentally different from the existing distribution order, and the existing price system will change accordingly. In particular, the characteristics of e-commerce on the web and the Internet connecting the world, the price difference of a particular product is applied not only domestic comparison but also globally. In this situation, if the product that the buyer wants to buy is small in quality, the buyer will want to buy at the store offering the most reasonable price. However, online stores are growing exponentially, making it difficult for buyers to find what they are looking for.

종래의 기술들로는 세 가지가 있다. 첫째로, 기존의 검색 엔진들을 사용하는 방법이다. 이 방법은 찾고자 하는 제품과 전혀 무관한 페이지들이나 스팸(spam)된 페이지들까지 보여주게 되는 단점과 제품을 선택하는 가치 기준 중 가장 중요한 가격에 대한 고려보다는 일반적인 정보 검색 방법론에 따른 기준으로 문서의 순위가 매겨지므로, 많은 측면에서 부적합하다. 둘째로, 사람의 손에 의해 관리되는 상품 정보 검색 전용 디렉토리 서비스나 검색 엔진이 있다. 이러한 방법은 사람에 의한 것이므로 섬세한 면은 있으나, 상점의 개수가 많아지면 일일이 유지 보수하는 것이 힘들어지게 된다. 셋째로 데이터베이스(database)나 별도의 데이터 스토어(data store)를 사용하지 않고 상점에 실시간으로 접속하여 상품 정보를 가져오는 방법이 있다. 이 방법은 가장 확실한 정보를 제시하는 장점이 있으나, 병렬 질의(parallel query) 기술을 사용하는 경우라도 검색 속도가 느리고 네트워크의 트래픽을 가중시킨다.There are three conventional techniques. First, using existing search engines. This method ranks documents according to general information retrieval methodologies rather than considering the disadvantages of showing pages that are completely unrelated to the product you are looking for or pages that are spammed and the most important price among the value criteria for selecting a product. Is inadequate in many respects. Second, there is a directory service or a search engine dedicated to product information retrieval managed by a human hand. Since this method is human-oriented, there are some subtle aspects, but the larger the number of stores, the more difficult it is to maintain one by one. Third, there is a method of getting product information by accessing the store in real time without using a database or a separate data store. This method has the advantage of presenting the most reliable information, but even when using parallel query technology, the search speed is slow and the network traffic is increased.

따라서, 본 발명은 웹 상의 온라인 상점들의 가격을 포함한 제품 정보를 미러링 로봇 소프트웨어(mirroring robot software)를 통해 가져와서 하나의 파일로 저장하여, 전처리 과정을 통해 잡음 제거 및 필터링(filtering)된 문서에서 가격 정보가 포함된 제품 정보를 추출하도록 함으로써, 상기한 단점을 해소할 수 있는 온라인 상점상의 제품 정보 자동 색인 방법 및 시스템을 제공하는 데 그 목적이 있다.Accordingly, the present invention obtains the product information including the prices of online stores on the web through mirroring robot software and stores it as a file, and the price in the noise-reduced and filtered documents through the preprocessing process. It is an object of the present invention to provide a method and system for automatically indexing product information on an online store that can solve the above disadvantages by extracting product information including information.

상술한 목적을 달성하기 위한 본 발명에 따른 온라인 상점상의 제품 정보 자동 색인 시스템은 웹 상의 온라인 상점들을 돌아다니며 제품 정보가 포함된 하이퍼 텍스트 마크업 랭귀지(HTML) 문서들을 수집하고 저장하기 위한 전자 거래 정보 수집 및 저장기의 역확을 하는 미러링 로봇 소프트웨어와, 상기 HTML 문서 저장기에 저장된 문서에 대해 불필요한 정보들을 제거하기 위한 HTML 필터와, 상기 HTML 필터를 통해 수집된 정보들의 유형을 판별하여 이미 분석된 유형에 해당되는 경우에는 그에 대한 분석 모듈들을 호출하여 제품 정보를 추출하기 위한 정형 정보 정리기와, 상기 정형 정보 정리기에서 분석이 실패한 가격 정보를 가지고 있는 문서들에 대한 제품 정보를 추출하기 위한 휴리스틱 해석기와, 기존의 제품 정보가 저장되어있는 명사 사전 테이블과, 상기 명사 사전 테이블의 정보를 상기 정형 정보 정리기 및 상기 휴리스틱 해석기로 제공하며 상기 명사 사전 테이블의 정보를 유지 보수하기 위한 명사 사전 관리기와, 상기 정형 정보 정리기 및 상기 휴리스틱 해석기로부터 추출된 제품 정보를 저장하기 위한 가격 정보 자료 저장기와, 상기 가격 정보 자료 저장기에 저장된 가격 정보 자료를 데이터베이스 테이블로 생성하여 저장하기 위한 제품 정보 테이블 생성기와, 상기 제품 정보 테이블 생성기와 상기 제품정보 테이블 생성기에 의해 생성된 가격 정보 자료를 저장하기 위한 제품 정보 테이블을 포함하여 구성된 것을 특징으로 한다.The automatic product information indexing system on the online store according to the present invention for achieving the above-mentioned object is electronic transaction information for collecting and storing hypertext markup language (HTML) documents containing product information while traveling online stores on the web. Mirroring robot software for despecifying the collection and storage, an HTML filter for removing unnecessary information about the document stored in the HTML document storage, and a type of information collected through the HTML filter to determine the type of information already analyzed. If applicable, a formal information organizer for extracting product information by calling analysis modules thereof, a heuristic interpreter for extracting product information for documents having price information that analysis failed in the formal information organizer, and Noun dictionary table where product information is stored A noun dictionary manager for providing information of the noun dictionary table to the formal information organizer and the heuristic interpreter, and maintaining the information of the noun dictionary table, and storing product information extracted from the formal information organizer and the heuristic interpreter. A product information table generator for generating and storing price information data stored in the price information data storage as a database table, and price information generated by the product information table generator and the product information table generator. It comprises a product information table for storing data.

상술한 목적을 달성하기 위한 본 발명에 따른 온라인 상점상의 제품 정보 자동 색인 방법은 온라인 상점들의 HTML 문서들을 로봇으로 가져오는 단계와, HTML 문서들에서 가격 정보를 찾아내고 그 밖의 필요한 정보들만 남기고 불필요한 정보들을 제거하는 전처리 단계와, 상기 전처리 단계의 결과를 읽어들여 페이지의 유형을 판별하는 단계와, 상기 판별된 각 유형에 맞는 알고리즘을 적용하여 제품 정보를 추출하는 단계와, 상기 각각의 유형에 따른 알고리즘을 적용하고 남은 가격 정보에 대해 블라인드 탐색을 통해 제품 정보를 추출하는 단계를 포함하여 이루어진 것을 특징으로 한다.In order to achieve the above object, the method for automatically indexing product information on an online store according to the present invention includes bringing HTML documents of an online store to a robot, finding price information from HTML documents, and leaving unnecessary information, leaving only necessary information. Pre-processing step of removing them, reading the results of the pre-processing step to determine the type of the page, extracting product information by applying the algorithms for each of the determined types, and algorithms according to the respective types. And extracting the product information through the blind search for the remaining price information.

종래의 기술들을 보면 우선 전통적인 검색 엔진에 의한 방법은 구매자가 원하는 기준을 만족시키기 힘들다. 구매자가 원하는 기준은 여러 가지가 있을 수 있으나, 보다 합리적인 가격이나 자신이 원하는 사양의 제품을 검색 서비스에게 제시할 수 있는 기능이다. 이러한 기술적 과제를 본 발명에서는 하나의 제품을 〈사이트 ID, 회사명, 제품 분류명, 주요 기능, 상품명, 모델명, 가격, URL〉라는 레코드로 정의하고 각 레코드들에 대한 검색이 가능하게 함으로써 해결하였다. 두번째로 사람의 손에 의해 관리되는 검색 서비스는 많은 상점들을 자동적으로 유지 보수하기 힘든 문제점이 있다. 본 발명의 경우, 자동적으로 각 단계가 이루어지므로 이러한 문제점이 없다. 세번째로 실시간 상품 정보 검색의 경우, 검색 시간의 문제가 있을 수 있다. 본 발명의 경우, Fast CGI 방식으로 웹 서버와 연동되어 별도의 초기화 지연이나 네트워크 지연이 없이 데이터베이스에서 바로 가져오므로 빠른 성능을 보인다.Looking at the conventional techniques, first, the conventional search engine method is difficult to satisfy the criteria desired by the buyer. There may be several criteria that a buyer wants, but it is a function that can present a product with a more reasonable price or a desired specification to a search service. This technical problem has been solved by defining one product as a record " site ID, company name, product classification name, main function, product name, model name, price, URL " and enabling searching for each record. Secondly, a search service managed by a human hand has a problem that many stores are not automatically maintained. In the case of the present invention, there is no such problem because each step is automatically performed. Third, in the case of real-time product information search, there may be a problem of search time. In the case of the present invention, since it is interlocked with the web server by the fast CGI method and is directly taken from the database without any additional initialization delay or network delay, it shows fast performance.

기존의 방법들에서 제기된 문제점인 여러 상점에서 합리적인 가격을 찾는 문제를 해결하기 위해서는, 온라인 상점들의 가격이 기록되어 있는 문서를 찾아서 자동으로 제품 정보를 추출하는 것이 관건이다. 본 발명에서는 이를 위해 미러링 로봇 소프트웨어가 가져온 문서를 제품 정보 추출기가 처리하기 용이한 방식으로 변환하는 방법을 사용하여 제품 정보 추출기의 부담을 덜었다. 이러한 변환을 위해서 문서의 잡음 제거와 필터링 기술이 사용되었다. 제품 정보 추출기는 이러한 변환된 문서에서 상품 정보 페이지의 유형에 따른 분석으로 제품 정보 레코드를 추출해 낸다.In order to solve the problem of finding a reasonable price in various stores, which is a problem of the existing methods, it is a key to find a document in which prices of online stores are recorded and automatically extract product information. In the present invention, the burden of the product information extractor is reduced by using a method of converting a document brought by the mirroring robot software into a manner that the product information extractor can handle easily. For this conversion, document noise reduction and filtering techniques were used. The product information extractor extracts product information records from these converted documents by analysis according to the type of product information page.

도 1은 본 발명에 따른 온라인 상점 상의 제품 정보 자동 색인 시스템의 구성도.1 is a block diagram of an automatic product information indexing system on an online store according to the present invention.

도 2는 본 발명에 관한 전처리를 위한 HTML 필터의 자료 흐름도.2 is a data flow diagram of an HTML filter for preprocessing according to the present invention;

도 3은 본 발명에 관한 제품 정보 추출을 위한 정형 정보 정리기의 자료 흐름도.3 is a data flow diagram of a structured information organizer for product information extraction according to the present invention.

〈도면의 주요 부분에 대한 부호의 설명〉<Explanation of symbols for main parts of drawing>

11: 전자거래 정보수집기 12: HTML 문서 저장기11: e-commerce information collector 12: HTML document saver

13: HTML 필터 14: 가격정보 정리기13: HTML Filter 14: Price Cleaner

15: 정형정보 정리기 16: 휴리스틱 해석기15: Orthopedic Organizer 16: Heuristic Analyzer

17: 명사사전 테이블 18: 명사사전 관리기17: noun dictionary table 18: noun dictionary manager

19: 제품정보 테이블 20: 제품정보 테이블 생성기19: Product Information Table 20: Product Information Table Generator

21: 가격정보 자료 저장기21: Price Information Saver

이하, 첨부된 도면을 참조하여 본 발명을 상세히 설명하기로 한다.Hereinafter, with reference to the accompanying drawings will be described in detail the present invention.

도 1은 본 발명에 따른 온라인 상점 상의 제품 정보 자동 색인 시스템의 구성도이다. 전자 거래 정보 수집기(11)는 웹 상의 온라인 상점들을 돌아다니며 제품 정보가 포함된 하이퍼 텍스트 마크업 랭귀지(Hyper Text Markup Language: 이하 HTML 이라 함) 문서들을 수집하여 HTML 문서 저장기(12)에 HTML 문서들을 구성한 후 전처리 과정을 넘긴다. 전처리 과정에서는 HTML 필터(13)와 가격 정보 정리기(14)에 의해 수행된다. HTML 필터(13)를 통해 수집된 HTML 문서들은 1차적으로 불필요한 문서들이 걸러지고, 가격 정보 정리기(14)에 의해 제품 정보 추출 서브 시스템에서 사용하기에 적당한 형태로 변환된다. 제품 정보 추출 서브 시스템은 정형 정보 정리기(15)와 휴리스틱(heuristic) 해석기(16)로 구성되어 있다. 정형 정보 정리기(15)는 입력되는 문서들의 유형을 판별하여 이미 분석된 유형에 해당되는 경우에는 그에 대한 분석 모듈들을 호출하여 제품 정보를 추출한다. 정형 정보 정리기(15)에서 분석이 실패한 가격 정보를 가지고 있는 문서들은 휴리스틱 해석기로(16) 넘어와서 제품 정보가 추출된다. 정형 정보 정리기(15)는 가격 정보를 가지는 웹 페이지를 통계적으로 분석하여 유형별로 분류한 자료를 토대로 구성되었다. 이 분류에 의하면 가격 정보를 가지는 웹 페이지는 우선 제품 정보들이 배치된 유형에 따라 카탈로그 같은 요약형, 개조식 상세형 그리고 서술식 상세형이 있다. 또한 HTML 테이블로 구성된 유형에 따라 테이블이 헤더 정보를 가지는 경우, 테이블이 헤더 정보를 가지지 않는 경우, 테이블이 아닌 리스트로 구성된 경우, 그리고 테이블이나 리스트를 사용하지 않은 단순 나열형이 있다. 테이블이나 리스트를 사용한 경우도 각각의 원소(element)가 단순한 형태와 헤더 정보와 결합된 형태, 그리고 두 개 이상의 데이터가 결합된 형태가 있다. 하나의 온라인 상점은 위의 페이지 유형을 하나 이상 가질 수 있다.1 is a block diagram of a system for automatically indexing product information on an online store according to the present invention. The electronic transaction information collector 11 navigates online stores on the web and collects Hyper Text Markup Language (hereinafter referred to as HTML) documents containing product information to the HTML document store 12 for HTML documents. After configuring them, they pass the pretreatment process. In the preprocessing process, the HTML filter 13 and the price information organizer 14 are performed. The HTML documents collected through the HTML filter 13 are primarily filtered out of unnecessary documents and converted by the price information organizer 14 into a form suitable for use in the product information extraction subsystem. The product information extraction subsystem is composed of a structured information organizer 15 and a heuristic analyzer 16. The formal information organizer 15 determines the type of documents to be input, and if it corresponds to the type that has already been analyzed, calls the analysis modules for the extracted product information. Documents with price information whose analysis fails in the formal information organizer 15 are passed to the heuristic analyzer 16 and product information is extracted. The formal information organizer 15 is configured on the basis of data classified by type by statistically analyzing a web page having price information. According to this classification, a web page with price information first has catalog-like summaries, modified details, and narrative details, depending on the type of product information placed. In addition, depending on the type of HTML table, there is a case that a table has header information, a table does not have header information, a list is composed of a non-table list, and a simple enumeration type that does not use a table or list. In the case of using a table or list, each element has a simple form, a form in which header information is combined, and a form in which two or more pieces of data are combined. An online store can have one or more of the above page types.

정형 정보 정리기(15)와 휴리스틱 해석기(16)는 많이 사용되는 기존의 제품 정보가 들어있는 명사 사전의 도움을 받아 더욱 효과적으로 제품 정보 분석을 수행한다. 명사 사전은 명사 사전 테이블(17)의 형태로 구현되어 명사 사전 관리기(18)에 의해 유지 보수된다.The formal information organizer 15 and the heuristic interpreter 16 perform product information analysis more effectively with the help of a noun dictionary that contains a lot of existing product information. The noun dictionary is implemented in the form of a noun dictionary table 17 and maintained by the noun dictionary manager 18.

이상의 과정으로 추출된 제품 정보는 가격 정보 자료 저장기(21)에 저장된다. 저장된 가격 정보 자료는 제품 정보 테이블 생성기(20)에 의해 제품 정보 테이블(19)에 저장된다.The product information extracted by the above process is stored in the price information data storage 21. The stored price information data is stored in the product information table 19 by the product information table generator 20.

도 2는 본 발명에 관한 전처리를 위한 HTML 필터의 자료 흐름도로서, 전처리 과정의 핵심인 HTML 필터(13)에 대해 설명하고 있다. 전자 거래 정보 수집기(11)에 의해 수집된 HTML 문서에서 문서 고유번호(Document ID: Doc ID)(31), URL(Uniform Resource Locator)(32), 하이퍼링크(36)들, 그리고 테이블 정보(35)들이 추출된다. 또한, 스크립트(33)와 불필요한 태그들이 배재된 후, 가격 정보가 추출(34)된다. 추출된 가격 정보는 휴리스틱 해석기(16)의 해석 과정에서 제품 정보 판별 기준이 된다.2 is a data flow diagram of an HTML filter for preprocessing according to the present invention, and describes the HTML filter 13 which is the core of the preprocessing process. Document ID (Doc ID) 31, Uniform Resource Locator (URL) 32, hyperlinks 36, and table information 35 in the HTML document collected by the electronic transaction information collector 11. ) Are extracted. In addition, after the script 33 and unnecessary tags are excluded, the price information is extracted 34. The extracted price information becomes a product information discrimination criterion in the analysis process of the heuristic analyzer 16.

도 3은 본 발명에 관한 제품 정보 추출을 위한 정형 정보 정리기의 자료 흐름도이다. HTML 필터(13)에 의해 전처리된 HTML 문서의 각 토큰(token)에 대해 명사 사전 관리기(18)를 조회하여 테이블이나 리스트의 헤더 정보가 위치할 곳을 정한다. 헤더 정보가 토큰 데이터와 정합되는 경우에는 명시적으로 정해지고, 그렇지 않은 경우에는 명사 사전에 이미 들어 있는 제품 정보를 검색하여 암시적으로 정해질 수 있다. 헤더 정보 해석기(41)에 의해 헤더 정보가 해석되면, 이에 따라 테이블 데이터 해석기(42)에 의해 테이블 데이터가 해석된다. 테이블 데이터가 반복적으로 추출되면, 추출된 데이터는 제품 정보의 후보(candidate)가 된다. 제품 정보는 제품 정보 유효성 검사기(44)에 의해 유효성이 검사되어 유효하지 않은 제품 정보는 폐기된다. 이러한 이중 체크(double check)는 추출되는 정보의 품질을 높게 하여 올바른 제품 정보만이 남을 수 있게 한다. 제품 정보가 추출되었으면, 최종적으로 레코드 배열의 형태로 재구성하기 위해 레코드 배열 동기화기(43)에 의해 인접한 레코드에 대해 동기화를 수행한다. 해석된 제품 정보는 제품 정보 테이블 생성기(20)에 의해 〈사이트 ID, 회사명, 제품 분류명, 주요 기능, 상품명, 모델명, 가격, URL〉의 형태로 데이터베이스에 저장된다.3 is a data flow diagram of a structured information organizer for product information extraction according to the present invention. The noun dictionary manager 18 is queried for each token of the HTML document preprocessed by the HTML filter 13 to determine where the header information of the table or list is to be located. If the header information is matched with the token data, it may be explicitly determined. Otherwise, the header information may be implicitly determined by searching for product information already contained in the noun dictionary. When the header information is interpreted by the header information interpreter 41, the table data is interpreted by the table data interpreter 42 accordingly. If the table data is repeatedly extracted, the extracted data becomes a candidate of product information. The product information is validated by the product information validator 44, and invalid product information is discarded. This double check increases the quality of the extracted information so that only the correct product information remains. Once the product information has been extracted, the record array synchronizer 43 synchronizes the adjacent records in order to finally reconstruct it into the form of a record array. The analyzed product information is stored in the database by the product information table generator 20 in the form of " site ID, company name, product classification name, main function, product name, model name, price, URL ".

즉, 본 발명은 온라인 상점들의 HTML 문서들을 로봇으로 가져오는 단계와, HTML 문서들에서 가격 정보를 찾아내고 그 밖의 필요한 정보들만 남기고 불필요한 정보들을 제거하는 전처리 단계와, 상기 전처리 단계의 결과를 읽어들여 페이지의 유형을 판별하는 단계와, 상기 판별된 각 유형에 맞는 알고리즘을 적용하여 제품 정보를 추출하는 단계와, 상기 각각의 유형에 따른 알고리즘을 적용하고 남은 가격 정보에 대해 블라인드 탐색을 통해 제품 정보를 추출하는 단계를 수행하여 웹 상의 온라인 상점들에 대한 비교 구매를 가능하게 한다.That is, the present invention provides a method of importing HTML documents of an online store to a robot, a preprocessing step of finding price information from HTML documents and removing unnecessary information, leaving only other necessary information, and reading the result of the preprocessing step. Determining the type of the page, extracting product information by applying an algorithm suitable for each type determined, and applying the algorithm according to each type to search product information through blind search for the remaining price information. The extracting step is performed to enable comparative purchase of online stores on the web.

이상에서 설명한 본 발명은 본 발명이 속하는 기술분야에서 통상의 지식을 가진자에 있어 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능함으로 전술한 실시 예 및 첨부된 도면에 한정되는 것이 아니다.The present invention described above can be variously substituted, modified, and changed within the scope without departing from the technical spirit of the present invention for those skilled in the art to which the present invention belongs, and the accompanying drawings. It is not limited to.

상술한 바와 같이 본 발명은 자율적인 형태로 구성되어 있는 온라인 상점 상의 제품 정보를 나타내는 HTML 문서를 미러링 로봇에 의한 문서 수집과 잡음 제거와 정보 필터링을 위한 전처리기, 그리고 제품이 HTML 문서에 표현되는 형식에 대한 분류에 따른 정형 정보 해석기와 제품 정보가 HTML 문서에 위치하는 경향에 따른 휴리스틱 해석기를 거치게 하여 비교 쇼핑을 위한 제품 정보 레코드를 자동으로 추출해 줌으로써, 웹 상의 상품 정보 검색 엔진, 메타 검색 엔진, 쇼핑 에이전트, 푸쉬 솔루션 등에서 웹 상의 온라인 상점들에 대한 비교 구매를 가능하게 한다.As described above, the present invention provides an HTML document representing product information on an online store configured in an autonomous form, a document processing by a mirroring robot, a preprocessor for noise reduction and information filtering, and a format in which a product is expressed in an HTML document. Product information search engine, meta search engine, shopping on the web by automatically extracting product information records for comparison shopping by passing the structured information interpreter according to the classification of the product and the heuristic interpreter according to the tendency of the product information to be placed in the HTML document. Enables comparison purchases of online stores on the web in agents, push solutions, and the like.

Claims

An electronic transaction information collector for navigating online stores on the web and collecting hypertext markup language (HTML) documents containing product information,

An HTML document storage device for storing documents collected from the electronic transaction information collector;

An HTML filter for removing unnecessary information about the document stored in the HTML document store;

A formal information organizer for determining the types of information collected through the HTML filter and extracting the product information by calling analysis modules thereof when the information is already analyzed;

A heuristic analyzer for extracting product information on documents having price information whose analysis fails in the structured information organizer;

A noun dictionary table that stores existing product information,

A noun dictionary manager for providing information on the noun dictionary table to the formal information organizer and the heuristic interpreter, and for maintaining information on the noun dictionary table;

A price information data storage for storing product information extracted from the structured information organizer and the heuristic analyzer;

A product information table generator for generating price information data stored in the price information data storage;

And a product information table for storing price information material generated by the product information table generator.

The method of claim 1,

The structured information organizer is an automatic product information indexing system on an online store, characterized in that based on the statistical analysis of the web page having price information classified by type.

The method of claim 2,

The web page having the price information has a catalog-like summary type, a remodeled detail type, and a descriptive detail type according to the type of product information arranged. When the table has header information according to the type composed of HTML tables, the table is a header. A system for automatically indexing product information on an online store, characterized in that it has no information, consists of a list rather than a table, and a simple enumeration without using a table or list.

Bringing HTML documents from online stores to the robot,

A preprocessing step that finds price information in HTML documents, leaves only other necessary information and removes unnecessary information,

Reading a result of the preprocessing step to determine a page type;

Extracting product information by applying an algorithm suitable for each of the types determined;

And applying the algorithm according to each type and extracting product information through blind search for the remaining price information.