KR20020084944A

KR20020084944A - Automatic information extraction method from a web text using mDTD rule

Info

Publication number: KR20020084944A
Application number: KR1020010024082A
Authority: KR
Inventors: 김동석; 서정연; 이근배
Original assignee: 김동석; 서정연; 이근배
Priority date: 2001-05-03
Filing date: 2001-05-03
Publication date: 2002-11-16
Also published as: KR100408855B1

Abstract

PURPOSE: A method for automatically extracting the information of a web document using an mDTD(modified Document Type Definition) grammar rule is provided to conveniently and efficiently extract many information from the vast information of a domain by using the mDTD rule through the mechanical repetition learning. CONSTITUTION: The method for the mechanical learning comprises the steps of collecting the web document from the domain(S1), transforming the web document into a text object(S2), extracting a sample data from the text object according to a previously written seed mDTD rule(S3), attaching a format element tag to the sample data(S4), and generating the proper mDTD rule by using the tagged sample data(S5). The method for the automatic extraction comprises the steps of collecting the web document from the domain(S11), transforming the web document into the text object(S12), attaching the format element tag to the text object(S13), extracting a target by judging which mDTD rule among the mDTD rules generated by the mechanical learning process is suitable for the tagged text object(S14), and storing the extracted target in a domain database(S15).

Description

Automatic information extraction method from a web text using mDTD rule}

본 발명은 웹 문서 자동 정보 추출 방법에 관한 것으로, 보다 상세하게는 간단한 mDTD 문법 규칙을 이용한 웹 문서 자동 정보 추출 방법에 관한 것이다.The present invention relates to a method for automatically extracting web document information, and more particularly, to a method for automatically extracting web document information using a simple mDTD grammar rule.

최근 웹 사이트 상에서는 예를 들면, 가전 제품, 여행 정보, 금융 정보 등과 같은 무수히 많은 정보들이 제공되고 있다. 이와 같은 정보들은 특정 정보를 원하는 사용자에게 제공되기 위하여, 전문 정보 제공자에 의해 수집되어 데이터 베이스화 된다.Recently, a myriad of information is provided on a website, such as home appliances, travel information, financial information, and the like. Such information is collected by a professional information provider and made into a database in order to provide a user with specific information.

이러한 상황에서, 방대한 정보를 데이터 베이스화하기 위하여 인터넷 도메인으로부터 추출하는 방법은 복잡한 구조로 되어 있고, 상당히 숙련된 기술자에 의해서 이루어지며, 많은 부분들은 수작업이 필요하기 때문에 필요한 정보를 추출하는데 많은 시간이 소요되는 문제점이 있었다.In such a situation, the method of extracting a large amount of information from the Internet domain in order to make a database has a complicated structure and is made by a highly skilled technician, and much of the time is required for extracting necessary information because of the manual work. There was a problem.

더욱이, 사용자의 기호가 다양해짐에 따라, 제조된 상품을 비롯한 각종 서비스 상품을 제공하는 공급자들은 전문화된 데이터 베이스를 구축해 가는 추세이다. 따라서, 데이터 베이스에 저장될 정보를 필요에 따라 유동적으로 다양하고 신속하게 추출하기 위해서는 종전보다 훨씬 많은 정보를 보다 효율적이며 사용 간편하게 추출할 수 있는 방법이 필요하였다.Moreover, as the user's preferences are diversified, the suppliers of various service products, including manufactured products, tend to build specialized databases. Therefore, in order to extract information to be stored in a database in a variety of ways quickly and flexibly, there is a need for a method that can extract much more information more efficiently and conveniently than before.

상기와 같은 문제점들을 해결하기 위한 본 발명의 목적은 도메인의 방대한 정보를 효율적으로 자동추출하기 위하여, 간단한 mDTD 문법을 정의하고 상기 mDTD 문법에 의해 특정 도메인에 대하여 간단한 씨드 mDTD 규칙을 작성한 다음, 상기 작성된 씨드 mDTD 규칙에 의해 상기 도메인으로부터 예제들을 추출하며 반복적인 기계학습을 통하여 학습된 mDTD 규칙을 생성하며, 상기 학습된 mDTD 규칙에 의해 도메인으로부터의 필요한 데이터를 추출한 다음 데이터 베이스화할 수 있는 사용이 간편한 mDTD 문법 규칙을 이용하는 웹 문서 자동 정보 추출 방법을 제공한다.An object of the present invention for solving the above problems is to define a simple mDTD grammar and to create a simple seed mDTD rule for a specific domain by the mDTD grammar in order to efficiently and automatically extract a large amount of information of the domain, An easy-to-use mDTD that extracts examples from the domain by seed mDTD rules and generates learned mDTD rules through iterative machine learning, extracts the necessary data from the domain by the learned mDTD rules, and then databases Provides an automatic web document information extraction method using grammar rules.

또한, 본 발명의 다른 목적은 전자상거래 비교쇼핑 및 상품검색 등과 같은 다양한 응용분야의 데이터 서비스 시스템을 지원할 수 있는 데이터 베이스를 구축하기 위하여, 상기한 방법에 따른 mDTD 문법 규칙을 이용한 웹 문서 자동 정보 추출 시스템을 제공하는 것이다.In addition, another object of the present invention to build a database that can support a data service system of various applications such as e-commerce comparison shopping and product search, web document automatic information extraction using the mDTD grammar rules according to the above method To provide a system.

도 1은 본 발명에 따른 mDTD(modified document type definition) 문법의 정의가 도시된 도면.1 is a diagram illustrating a definition of a modified document type definition (mDTD) grammar according to the present invention.

도 2는 본 발명의 실시예에 따른 AV(audio and visual) 제품에 대한 씨드 (seed) mDTD 규칙이 도시된 도면.2 illustrates a seed mDTD rule for an audio and visual product in accordance with an embodiment of the present invention.

도 3은 본 발명의 실시예에 따른 기계학습 방법론의 알고리즘이 도시된 도면.3 illustrates an algorithm of a machine learning methodology in accordance with an embodiment of the present invention.

도 4는 씨드 mDTD 규칙을 사용하여 추출된 AV 제품의 예제가 도 3의 알고리즘에 따라 기계학습되어 학장된 mDTD 규칙이 도시된 도면.4 is a diagram illustrating an mDTD rule in which an example of an AV product extracted using the seed mDTD rule has been machine-learned according to the algorithm of FIG. 3.

도 5는 본 발명의 실시예에 따른 mDTD 문법 규칙을 이용한 웹 문서 자동 정보 추출 시스템의 블럭도.5 is a block diagram of a web document automatic information extraction system using mDTD grammar rules according to an embodiment of the present invention.

도 6는 본 발명의 실시예에 따른 mDTD 문법 규칙을 이용한 웹 문서 자동 정보 추출 방법의 순서도.6 is a flowchart of a method for automatically extracting web document information using mDTD grammar rules according to an embodiment of the present invention.

상기와 같은 목적을 달성하기 위하여, 본 발명은 기계학습 방법 및 자동추출 방법을 포함하는 mDTD 문법을 이용한 웹 문서 자동 정보 추출 방법을 제공하고, 상기 기계학습 방법은: (a)도메인으로부터 웹 문서를 수집하는 단계; (b)상기 수집된 웹 문서를 텍스트 객체로 변환하는 웹 문서 전처리 단계; (c)상기 전처리된 텍스트 객체로부터 미리 작성된 씨드 mDTD 규칙에 따라 예제 데이터를 추출하는 단계; (d)상기 추출된 예제 데이터에 형태소 태그를 부착하는 태그 부착 단계; 및 (e)상기 태그 부착된 예제 데이터를 사용하여 상기 도메인에 적절한 mDTD 규칙을 생성하는 기계학습 단계를 포함하며, 상기 자동추출 방법은: (f)상기 도메인으로부터 웹 문서를 수집하는 단계; (g)상기 수집된 웹 문서를 텍스트 객체로 변환하는 웹 문서 전처리 단계; (h)상기 전처리된 텍스트 객체에 형태소 태그를 부착하는 태그 부착 단계; (i)상기 태그 부착된 텍스트 객체가 상기 기계학습 단계에서 생성된 mDTD 규칙중 어떤 mDTD 규칙과 가장 적합한지를 판단하여 타깃을 추출하는 mDTD 문법 규칙적용 단계; 및 (j)상기 추출된 타깃을 도메인 데이터 베이스에 저장하는 단계를 포함하는 것을 특징으로 한다.In order to achieve the above object, the present invention provides a web document automatic information extraction method using mDTD grammar, including a machine learning method and an automatic extraction method, the machine learning method is: (a) a web document from a domain Collecting; (b) a web document preprocessing step of converting the collected web document into a text object; (c) extracting example data from the preprocessed text object according to a pre-written seed mDTD rule; (d) a tag attaching step of attaching a stem tag to the extracted example data; And (e) machine learning to generate an mDTD rule appropriate for the domain using the tagged example data, the automatic extraction method comprising: (f) collecting a web document from the domain; (g) a web document preprocessing step of converting the collected web document into a text object; (h) a tag attaching step of attaching a stem tag to the preprocessed text object; (i) an mDTD grammar rule applying step of extracting a target by determining which mDTD rule is best suited among the mDTD rules generated in the machine learning step; And (j) storing the extracted target in a domain database.

또한, 본 발명의 다른 양태는 기계 학습 단계에서 추출된 예제 데이터로부터 가장 많은 예제들을 대표하는 mDTD 규칙을 생성하고, 해당 예제를 삭제하는 일련의 과정을 모든 예제가 처리될 때까지 반복해서 실시하는 것을 특징으로 한다.In addition, another aspect of the present invention is to generate an mDTD rule representing the most examples from the sample data extracted in the machine learning step, and repeatedly performing a series of processes to delete the example until all the examples are processed. It features.

또한, 본 발명의 다른 양태는 (a)학습기, (b)자동 추출기 및 (c)도메인 데이터 베이스 저장 수단을 포함하는 mDTD 문법 규칙을 이용한 웹 문서 자동 정보 추출 시스템을 제공하고, 상기 학습기는 도메인으로부터 웹 문서를 수집하는 수단; 상기 수집된 웹 문서를 텍스트 객체로 변환하는 웹 문서 전처리기; 상기 전처리된 텍스트 객체로부터 미리 작성된 씨드 mDTD 규칙에 따라 예제 데이터를 추출하는 예제 추출기; 상기 추출된 예제 데이터에 형태소 태그를 부착하는 태그 부착기; 및 상기 태그 부착된 예제 데이터를 사용하여 상기 도메인에 적절한 mDTD 규칙을 생성하는 기계 학습기를 포함하며, 상기 자동 추출기는 상기 학습기의 수집 수단, 웹 문서 전처리기 및 태그 부착기를 사용해서 상기 도메인으로부터 웹 문서를 수집, 전처리 및 태그 부착하여 생성되는 태그 부착된 텍스트 객체가 상기 기계 학습기에서 생성된 mDTD 규칙중 어떤 mDTD 규칙과 가장 적합한지를 판단하여 타깃을 추출하는 mDTD 문법 적용 수단을 포함하고, 상기 도메인 데이터 베이스 저장 수단에 상기 추출 타깃을 저장하는 것을 특징으로 한다.In addition, another aspect of the present invention provides a web document automatic information extraction system using mDTD grammar rules including (a) a learner, (b) an automatic extractor and (c) a domain database storage means, wherein the learner is configured from a domain. Means for collecting web documents; A web document preprocessor for converting the collected web document into a text object; An example extractor for extracting example data from the preprocessed text object according to a pre-written seed mDTD rule; A tag attacher attaching a stem tag to the extracted example data; And a machine learner that uses the tagged example data to generate an mDTD rule appropriate for the domain, wherein the automatic extractor uses the learner's collection means, a web document preprocessor and a tagger to generate a web document from the domain. And a mDTD grammar application means for extracting a target by determining whether a tagged text object generated by collecting, preprocessing, and tagging is most suitable with which mDTD rule among the mDTD rules generated by the machine learner, and extracting a target; The extraction target is stored in a storage means.

이하, 첨부된 도면을 참조하여 본 발명을 상세히 설명한다.Hereinafter, with reference to the accompanying drawings will be described in detail the present invention.

도 1은 인터넷 도메인상의 방대한 정보로부터 필요한 정보만을 추출하기 위해서 정의된 mDTD라고 이름 붙여진 문법 형태를 도시한다. 여기서, 정보 추출의 대상이 되는 인터넷 도메인이란 가전 제품, 여행 정보, 금융 정보, 개인신상 정보등과 같이 인터넷에 게재되어 있는 특정 분야에 관한 정보를 지칭하는 것이고, 특정 도메인으로부터 추출하기를 원하는 정보를 추출 타깃이라고 한다. 즉, 개인신상 정보를 예로 들면, 이름, 주소, 생년월일, 직업 등과 같이 관심있는 정보들이 추출 타깃이 된다.Figure 1 shows a grammatical form named mDTD defined to extract only the necessary information from the vast amount of information on the Internet domain. Here, the Internet domain to be extracted information refers to information about a specific field published on the Internet, such as home appliances, travel information, financial information, personal information, etc., and to extract information from a specific domain. This is called the extraction target. That is, for example, personal information information, interests such as name, address, date of birth, occupation, etc. are targeted for extraction.

도 1에 도시된 mDTD 문법은 도메인을 최적으로 표현해주는 프로그래밍 언어와 같은 역할을 하고, 재작성(Rewriting) 규칙처럼 정의되면 선언적(declarative)인 특성을 갖는다. 이하, mDTD 문법 정의 부분을 보다 상세히 설명한다. 첫째줄은 mDTD 문법 규칙의 기본적인 형태를 표현한다. 규칙은 규칙의 시작을 나타내는 시작 표시자(<!), 키워드, 규칙 이름, opt, (content), occurrence_op, action, 및 규칙의 종료를 나타내는 규칙 종료 표시자(>)로 구성된다.The mDTD grammar shown in FIG. 1 serves as a programming language that optimally represents a domain, and has a declarative characteristic when defined as a rewriting rule. The mDTD grammar definition is described in more detail below. The first line represents the basic form of mDTD grammar rules. A rule consists of a start indicator (<!) Indicating the start of the rule, a keyword, a rule name, opt, (content), occurrence_op, action, and a rule end indicator (>) indicating the end of the rule.

도 1에 정형적으로 정의된 바와 같이, 키워드는 추출 타깃을 나타내는 TARGET, 요소를 나타내는 ELEMENT, 인스턴스 객체를 나타내는 INSTANCE 및 렉시칼 (lexical) 데이터를 나타내는 ENTITY 중 하나가 된다.As defined formally in FIG. 1, the keyword is one of a TARGET representing an extraction target, an ELEMENT representing an element, an INSTANCE representing an instance object, and an ENTITY representing lexical data.

참고로, 상기와 같은 mDTD 문법의 기초가되는 DTD 문법에 대해서는 인터넷 사이트 http://www.w3.org/xml/1998/06/xmlspec-report에 정의되어 있다.For reference, the DTD grammar that is the basis of the mDTD grammar described above is defined in the Internet site http://www.w3.org/xml/1998/06/xmlspec-report.

도 2는 도 1과 같이 정의된 mDTD 문법을 이용하여 인터넷 상의 AV(Audio and Visual 제품) 도메인을 대상으로 작성된 씨드 mDTD 규칙을 도시한다. 도 2에 도시된 바와 같이, 씨드 mDTD 규칙은 테이블 형태로 되어 있는 html 문서에서 기계학습에 필요한 예제들을 추출할 수 있는 규칙 및 실제 TV 제품에 관한 정보를 추출하기 위한 규칙들로 구성된다. 씨드 mDTD 규칙은 운영자가 직접 작성해야 되지만, 상기 씨드 mDTD 규칙에 의해 구조 문서로부터 예제들을 자동적으로 획득할 수 있고, 다음에 설명될 기계학습을 통하여, 상기와 같은 기본적인 골격은 모든 도메인에 동일하게 적용될 수 있다.FIG. 2 illustrates a seed mDTD rule written for an AV (Audio and Visual product) domain on the Internet using the mDTD grammar defined as shown in FIG. 1. As shown in FIG. 2, the seed mDTD rule is composed of a rule for extracting examples for machine learning from a html document in a table form and rules for extracting information about an actual TV product. The seed mDTD rule must be written by the operator, but the seed mDTD rule can automatically obtain examples from the structural document, and through machine learning, which will be described later, such a basic framework applies equally to all domains. Can be.

이와 같은 씨드 mDTD 규칙을 활용하는 방법론은 인터넷에 있는 거의 모든 도메인으로 확장 가능한 일반적인 방법론이며, 다음에 보다 상세히 설명한다.The methodology utilizing such seed mDTD rules is a general methodology that can be extended to almost any domain on the Internet, which is described in more detail below.

도 3은 상기 기계학습의 알고리즘을 도시한다. 상기 기계학습은 보다 많은 수의 해당 도메인에 속하는 사이트를 처리할 수 있는 능력을 갖추기 위한 것으로, 도 2에 도시된 바와 같은 씨드 mDTD 규칙을 확장시켜 도메인에 속하는 거의 모든 웹 사이트 정보를 추출하는데 활용되는 mDTD 규칙을 생성한다. 상기 기계학습의 방법으로는 순차적 학습 방식을 사용한다. 입력된 예제 중에서 가장 많은 예제에 부합하는 규칙을 순차적으로 반복해서 생성하는 방식이다. 이하, 도 3을 참조하여 상기 기계학습의 알고리즘을 단계별로 설명한다.3 shows the algorithm of the machine learning. The machine learning is to have the ability to process a larger number of sites belonging to the domain, and is used to extract almost all web site information belonging to the domain by extending the seed mDTD rule as shown in FIG. Create an mDTD rule. The machine learning method uses a sequential learning method. This is a method of generating a rule repeatedly matching the most examples among the input examples. Hereinafter, the algorithm of the machine learning will be described step by step with reference to FIG.

먼저, 운영자가 작성한 씨드 mDTD 규칙 및 상기 규칙을 활용하여 테이블 형태로 된 인터넷 구조문서에서 추출해온 예제들로부터 가장 많은 예제들을 대표하는 규칙을 하나 생성한다. 그런 다음 전체 예제에서 해당 예제들을 삭제하고, 다시 대표 규칙 찾기 및 삭제의 일련의 과정이 모든 예제들이 처리될 때까지 반복해서 실시된다. 여기서 하나의 규칙이 하나의 예제를 포함하는지의 여부를 결정할 때 판단자료로 채택된 기준은 문자열의 동일성 및 문법 형태소의 동일성에 기반을 둔다.이와 같은 기계학습 방법을 본 출원인은 SML(Sequential mDTD Learning)이라 칭한다.First, using a seed mDTD rule written by the operator and the above rules, one rule representing the most examples is generated from the examples extracted from the table structured Internet structure document. Then delete the examples from the entire example, and then again, a series of finding and deleting representative rules is repeated until all the examples have been processed. Here, the criterion adopted as a judgment material when determining whether a rule includes one example is based on the identity of strings and the identity of grammatical morphemes. Applicants of such a machine learning method are SML (Sequential mDTD Learning). It is called).

도 4는 상기 SML 알고리즘을 통하여 학습된 mDTD 규칙을 도시한다. 학습된 결과 규칙은 씨드 mDTD 규칙에 추가된다. 추가되는 방식은 연결 노드(예를 들어, 도 4에 도시된 mInst1와 같은 규칙)를 생성하고, 그 연결 노드의 하위 노드에 실제 학습된 결과 노드가 위치하게 된다. 학습된 노드를 추가시킬 때 중간에 연결노드를 설정하는 이유는 하나의 노드에 다수의 노드들이 직접 연결되는 것을 방지하기 위한 것이다. 또한, 씨드 mDTD 작성시 추출 타깃에 따른 학습 필요성이 서로 다르게 기술되기 때문에, 학습된 결과는 추출 타깃별로 학습된 것이다.4 illustrates mDTD rules learned through the SML algorithm. The learned result rule is added to the seed mDTD rule. The added method creates a connection node (e.g., a rule such as mInst1 shown in FIG. 4), and the actual learned result node is located in the child node of the connection node. The reason for setting the connection node in the middle when adding the learned node is to prevent the direct connection of multiple nodes to one node. In addition, since the necessity of learning according to the extraction target is described differently when creating the seed mDTD, the learned results are learned for each extraction target.

도 4에 도시된 내용은 AV 도메인의 모델명, 제품명, 제조회사, 제품 특징에 대한 학습 결과를 일부 도시한 예이지만, 일반적으로 사용자가 필요한 정보는 다양하고 세분화 되기 때문에, 다양한 추출 타깃을 획득하기 위해 학습되는 규칙들은 도메인별로 수백개씩 도출될 수가 있다.4 is an example showing some results of learning about the model name, product name, manufacturer, and product characteristics of the AV domain, but in general, since the information required by the user is various and detailed, in order to obtain various extraction targets. There are hundreds of rules learned for each domain.

도 5는 mDTD 규칙을 이용하여 인터넷으로부터 정보를 자동추출하는 시스템의 블록도이다. 상기 시스템은 학습기(10), 추출기(20) 및 도메인 DB(30)로 구성되고, 학습기(10)은 도메인으로부터 웹 문서를 수집하는 웹 로봇(11), 수집된 웹 문서를 텍스트 객체로 변환하는 웹 문서 전처리기(12), 전처리된 텍스트 객체로부터 예제를 추출하는 예제 추출기(13), 추출된 예제에 형태소 태그를 부착하는 형태소 태그 부착기(14) 및 SML 학습기(15)로 구성된다.5 is a block diagram of a system for automatically extracting information from the Internet using mDTD rules. The system is composed of a learner 10, an extractor 20 and a domain DB 30, the learner 10 is a web robot 11 for collecting web documents from the domain, converting the collected web document to a text object A web document preprocessor 12, an example extractor 13 for extracting an example from a preprocessed text object, a stemmer tag attacher 14 for attaching a stemming tag to the extracted example, and an SML learner 15.

추출기(20)은 학습기(10)의 구성요소와 동일한 웹 로봇(21), 웹 문서 전처리기(22) 및 형태소 태그 부착기(24), 학습기(10)에 의해 학습된 mDTD 규칙을 적용하여 타깃을 추출하는 mDTD 규칙 적용기(25), 추출 타깃을 저장하는 타깃 프레임 버퍼(26), 및 저장된 타깃을 데이터 베이스의 데이터 타입으로 변환하는 DB 엔트리 생성기(27)로 구성된다.The extractor 20 applies the mDTD rules learned by the web robot 21, the web document preprocessor 22 and the stemmer tagger 24, and the learner 10, which are the same as the components of the learner 10, to target the target. The mDTD rule applier 25 to extract, the target frame buffer 26 which stores an extraction target, and the DB entry generator 27 which converts the stored target into the data type of a database are comprised.

도 6은 mDTD 규칙을 이용한 웹 문서 자동 정보 추출 방법의 순서도이다. 이하, 도 5 및 도 6을 참조하여 mDTD 규칙을 이용한 웹 문서 자동 정보 추출 방법에 대하여 설명한다.6 is a flowchart illustrating a method for automatically extracting web document information using an mDTD rule. Hereinafter, a method of automatically extracting web document information using an mDTD rule will be described with reference to FIGS. 5 and 6.

먼저 학습기(10)에 의한 mDTD 규칙 기계학습 방법은, 웹 로봇(11)이 특정 도메인과 관련된 초기 url(uniform resource locator)로부터 웹 문서를 다운로드하여 수집하는 단계 1에서 시작된다. 앞서 설명한 바와 같이, 특정 도메인이란 인터넷상에 게재되어 있는 특정 분야에 관한 정보의 총칭이므로, 상기 특정 분야에 관련된 url은 다수개 존재한다. 상기와 같이 다운로드된 문서는 html 태그로 인코딩되어 있을 뿐만 아니라 일부 띄어쓰기 오류도 있을 수 있기 때문에, 단계 2에서 웹 문서 전처리기 (12)에 의해 이러한 에러를 정정하고 순수한 텍스트 객체로 정제하는 웹 문서 전처리를 수행한다.First, the mDTD rule machine learning method by the learner 10 begins in step 1, in which the web robot 11 downloads and collects a web document from an initial uniform resource locator (url) associated with a specific domain. As described above, since a specific domain is a general term for information about a specific field published on the Internet, there are a plurality of urls related to the specific field. Since the document downloaded as above may not only be encoded in the html tag, but there may also be some spacing errors, in step 2 the web document preprocessing corrects this error by the web document preprocessor 12 and refines it into a pure text object. Perform

다음 단계 3에서, 예제 추출기(13)은 운영자에 의해 미리 작성된 씨드 mDTD 규칙에 의거하여 전처리된 텍스트 객체로부터 예제 데이터를 추출해낸다. 다음 단계 4에서, 형태소 태그 부착기(14)는 SML 학습기(15)에서 사용하기 위하여 추출된 예제들에 한국어 형태소 태그를 부착한다.In the next step 3, the example extractor 13 extracts example data from the preprocessed text object based on the seed mDTD rule pre-written by the operator. In the next step 4, the stemming tag attacher 14 attaches the Korean stemming tag to the extracted examples for use in the SML learner 15.

다음 단계 5에서, SML 학습기(15)는 태그 정보와 문자열 자체 정보를 이용하여 도 3에 도시된 바와 같은 알고리즘에 의해 순차적으로 학습하고, 도메인에 최적이 되는 mDTD 규칙을 생성해낸다. 이와 같은 방법에 의해 mDTD 규칙의 학습이 완료되고, 상기 생성된 mDTD 규칙들은 도메인으로부터 타깃을 추출하기 위하여, 추출기 (20)로 입력된다.In the next step 5, the SML learner 15 learns sequentially by the algorithm shown in FIG. 3 using the tag information and the string itself information, and generates an mDTD rule that is optimal for the domain. In this way, the learning of the mDTD rules is completed, and the generated mDTD rules are input to the extractor 20 to extract the target from the domain.

다음으로, 추출기(20)에 의한 자동 정보 추출 방법은, 웹 로봇(21)이 웹 문서를 다운로드하여 수집하는 단계 11에서 시작된다. 웹 로봇(21)이 다운로드하는 문서의 종류는 학습기(10)에서 다운로드되는 문서의 종류 보다 훨씬 다양하다. 학습기(10)의 웹 로봇(11)은 테이블 형식으로 되어 있는 구조 문서만을 다운로드하지만, 웹 로봇(21)은 도메인과 관련된 일반적인 반구조적(semi-structured) 문서도 다운로드한다.Next, the automatic information extraction method by the extractor 20 begins at step 11 where the web robot 21 downloads and collects the web document. The type of document downloaded by the web robot 21 is much more diverse than the type of document downloaded by the learner 10. The web robot 11 of the learner 10 downloads only structured documents in the form of a table, but the web robot 21 also downloads general semi-structured documents related to the domain.

다음, 문자열 자체를 대상으로 처리하게 되는 웹 문서 전처리기(22) 및 형태소 태그 부착기(24)은 학습기(10)의 해당 부분과 동일한 처리를 하기때문에, 웹 문서 전처리 단계(단계 12) 및 형태소 태그 부착 단계(단계 13)는 단계 2 및 단계 4와 동일한 방법으로 수행되고 단계 14로 진행된다.Next, since the web document preprocessor 22 and the stemming tag attacher 24 which are to process the string itself perform the same processing as the corresponding part of the learner 10, the web document preprocessing step (step 12) and the stemming tag The attaching step (step 13) is carried out in the same manner as steps 2 and 4 and proceeds to step 14.

다음 단계 14에서, mDTD 규칙 적용기(25)는 학습기(10)에 의해 기계학습된 mDTD 규칙을 입력받고, 형태소 태그 부착기(24)로부터 입력된 텍스트 객체가 상기 학습된 mDTD 규칙중 어떤 규칙과 가장 적합한지를 판별하게 된다. 다음 단계 15에서, 상기 판별된 객체가 추출 타깃 프레임 버퍼(26)에 저장된다. 타깃 프레임 버퍼 (26)에 저장되는 데이터는 숫자나 문자의 구분이 없이 하나의 문자열 데이터로만 취급된다.In the next step 14, the mDTD rule applicator 25 receives the mDTD rule machine-learned by the learner 10, and the text object input from the stemmer tag attacher 24 impersonates any of the learned mDTD rules. Determine if appropriate. In the next step 15, the determined object is stored in the extraction target frame buffer 26. The data stored in the target frame buffer 26 is treated only as one string data without distinguishing numbers or characters.

다음 단계 16에서, DB 엔트리 생성기(27)는 데이터 베이스 스키마의 데이터 타입에 맞도록 상기 문자열 데이터를 변환시킨 다음, 단계 17에서, 상기 변환된 데이터가 도메인 데이터베이스(30)에 저장됨으로써 자동 정보 추출 작업이 완료된다.In the next step 16, the DB entry generator 27 converts the string data to match the data type of the database schema, and in step 17, the converted data is stored in the domain database 30, thereby automatically extracting information. Is complete.

이와 같은 방법으로 만들어진 데이터 베이스는 여러 용도의 쓰임새를 가질 수 있다. 우선은 제품별 도메인 데이터 베이스를 자동으로 구축할 수 있으며, 그것을 기반으로 가격 비교, 제품 검색과 같은 시스템의 후치(backend) 서비스를 제공할 수 있다.Databases created in this way can have multiple uses. First, it can automatically build a domain database for each product and use it to provide backend services for systems such as price comparisons and product searches.

또한 씨드 mDTD 작성시 추출 타깃에 따른 학습 필요성을 다양하게 기술함으로써, 응용 분야를 인명, 뉴스 기사, 음악, 영화 등과 같이 다양한 분야로 용이하게 확장할 수 있기 때문에 그 응용 분야는 무한하다고 할 수 있다.In addition, by describing various learning necessities according to extraction targets when creating a seed mDTD, the application field is infinite because the application field can be easily extended to various fields such as life, news articles, music, and movies.

본 발명에 따르면, 간단한 mDTD 문법을 정의하고, 상기 mDTD 문법에 의해 특정 도메인에 대하여 간단한 씨드 mDTD 규칙을 작성한 다음, 상기 작성된 씨드 mDTD 규칙에 의해 상기 도메인으로부터 예제들을 추출하며 반복적인 기계학습을 통하여 학습된 mDTD 규칙을 생성하고, 상기 학습된 mDTD 규칙에 의해 도메인으로부터의 필요한 데이터를 추출한 다음 데이터 베이스에 저장으로써, 필요한 정보를 도메인으로부터 효율적으로 자동 추출할 수 있다.According to the present invention, a simple mDTD grammar is defined, a simple seed mDTD rule is created for a specific domain by the mDTD grammar, and the examples are extracted from the domain by the created seed mDTD rule, and are learned through repetitive machine learning. By generating the generated mDTD rule, extracting the necessary data from the domain by the learned mDTD rule and storing it in the database, the necessary information can be efficiently extracted automatically from the domain.

또한, 본 발명에 따르면, 데이터를 추출하는데 사용하는 문법이 간단한 mDTD문법 규칙으로 제공됨으로써, 운영자가 상기 문법을 용이하게 사용할 수 있기 때문에, 응용 분야를 인명, 뉴스기사, 음악, 영화 등과 같이 다양한 분야로 용이하게확장할 수 있고, 따라서 그 응용 분야는 무한하다고 할 수 있다.In addition, according to the present invention, since the grammar used to extract data is provided as a simple mDTD grammar rule, the operator can easily use the grammar, and thus the application field is applied to various fields such as life, news articles, music, and movies. Can be easily extended, and thus the application field is infinite.

또한, 본 발명에 따르면, 운영자가 인터넷의 특정 도메인에 대하여 간단히 씨드 mDTD를 작성해주면 해당 규칙을 이용하여, 예를 들면, 제품별 도메인 데이터 베이스를 자동으로 구축할 수 있고, 그것을 기반으로 가격 비교, 제품 검색과 같은 시스템의 후치(backend) 서비스를 제공할 수 있으며, 따라서, 전자상거래 비교쇼핑 및 상품 검색 등과 같은 다양한 응용분야의 데이터 서비스 시스템을 지원할 수 있다.In addition, according to the present invention, if the operator simply writes the seed mDTD for a specific domain of the Internet, using the corresponding rule, for example, a product-specific domain database can be automatically built, and the price comparison, It can provide a backend service of a system such as a product search, and thus can support a data service system of various application fields such as e-commerce comparison shopping and product search.

Claims

A web document automatic information extraction method using mDTD grammar including machine learning method and automatic extraction method, the machine learning method comprising:

(a) collecting a web document from a domain;

(b) a web document preprocessing step of converting the collected web document into a text object;

(c) extracting example data from the preprocessed text object according to a pre-written seed mDTD rule;

(d) a tag attaching step of attaching a stem tag to the extracted example data; And

(e) machine learning using the tagged example data to create an mDTD rule appropriate for the domain,

The automatic extraction method is:

(f) collecting a web document from the domain;

(g) a web document preprocessing step of converting the collected web document into a text object;

(h) a tag attaching step of attaching a stem tag to the preprocessed text object;

(i) applying an mDTD grammar rule for extracting a target by determining which mDTD rule is best suited among the mDTD rules generated in the machine learning step; And

(j) a web document automatic information extraction method using mDTD grammar rules, comprising the step of storing the extracted target in a domain database.

The method of claim 1, wherein step (e) generates an mDTD rule representing the most examples from the extracted example data, and repeats a series of processes of deleting the example until all the examples have been processed. Method for automatically extracting web document information using mDTD grammar rules.

The mDTD according to any one of claims 1 to 2, wherein said steps (b) and (g) comprise correcting tag errors and spacing errors generated in said collected documents. Automatic information extraction of web documents using grammar rules.

The method according to any one of claims 1 to 3, wherein step (a) and step (f) are performed by a web robot, and the type of web document collected in step (f) is determined by (a). Method for automatically extracting web document information using mDTD grammar rules, characterized in that it comprises a variety of semi-structured document than the form of the web document collected in the step.

The method of claim 1, wherein after the step (i), storing the extracted target in an extraction target frame buffer (i ′) and extracting the extraction target stored in the frame buffer. And a database entry generating step (i ") of converting to a data type.

Means for collecting web documents from the domain;

A web document preprocessor for converting the collected web document into a text object;

An example extractor for extracting example data from the preprocessed text object according to a pre-written seed mDTD rule;

A tag attacher attaching a stem tag to the extracted example data; And

A machine learner that uses the tagged example data to create an mDTD rule appropriate for the domain,

The mDTD grammar rule is characterized in that the machine learner generates an mDTD rule representing the most examples from the extracted example data, and repeats a series of processes of deleting the example until all the examples have been processed. Learner for automatic web document information extraction.

Means for collecting web documents from the domain;

A tag attacher attaching a stem tag to the preprocessed text object;

MDTD grammar applying means for extracting a target by determining which mDTD rule is best suited among the pre-generated mDTD rules;

A target frame buffer for storing the extracted target;

Database entry generating means for converting an extraction target stored in the buffer to a data type of a database; And

And a domain database storage means for storing the converted data. The extractor for automatic web document information extraction using mDTD grammar rules.

An automated web document information extraction system using mDTD grammar rules including (a) learners, (b) automatic extractors, and (c) domain database storage means,

The learner is:

Means for collecting web documents from the domain;

A tag attacher attaching a stem tag to the extracted example data; And

The automatic extractor is:

The tagged text object created by collecting, preprocessing and tagging a web document from the domain using a collection means of the learner, a web document preprocessor and a tag attacher is configured with any mDTD rule of the mDTD rule created in the machine learner. MDTD grammar applying means for extracting a target by determining whether it is most suitable;

The web document automatic information extraction system using mDTD grammar rules, characterized in that the extraction target is stored in the domain database storage means.

The method of claim 8, wherein the machine learner generates an mDTD rule representing the most examples from the extracted example data, and repeats a series of processes of deleting the example until all the examples have been processed. Web document automatic information extraction system using mDTD grammar rules.

10. The extractor of claim 8 or 9, wherein the extractor is:

Means for collecting web documents from the domain;

A web document preprocessor for converting the collected web document into a text object; And

And further comprising a tag attacher attaching a stem tag to the preprocessed text object,

Web document automatic form using the mDTD grammar rules, characterized in that the form of the web document collected by the collection means of the automatic extractor includes a variety of semi-structured documents than the form of the web document collected by the learner collection means. Information extraction system.

The method of claim 10, wherein the automatic extractor,

A target frame buffer for storing the target extracted from the mDTD grammar applying means; And

And a database entry generating means for converting the extraction target stored in the buffer into a data type of a database.