KR20020040145A

KR20020040145A - Selective Information Gathering System and Methods

Info

Publication number: KR20020040145A
Application number: KR1020000070086A
Authority: KR
Inventors: 유승현; 김영민
Original assignee: 김상배; 주식회사쓰리소프트
Priority date: 2000-11-23
Filing date: 2000-11-23
Publication date: 2002-05-30

Abstract

PURPOSE: A system and method for collecting information selectively is provided to extract only wanted information out of information scattered on a network. CONSTITUTION: A manager user interface(100) designates a site to be collected and designates a place for selecting and storing collection-wanted information in the site. That is, the manager user interface(100) creates a rule with respect to site information, elements to be collected, and database information to be stored and stores the created rule in a database(300). A robot agent(200) connects to a site for collecting information based on rules with respect to site information, elements to be collected, and database information to be stored designated by the manager user interface(100), collects information, and stores the collected information in the database(300). The database(300) stores user interface information created by the manager user interface(100) and information collected by the robot agent(200).

Description

Selective Information Gathering System and Methods

본 발명은 정보를 수집하는 시스템 및 방법에 관한 것으로서, 특히 인터넷 상에 산재해 있는 정보들 중에서 원하는 정보들만 추출하는 정보 수집 시스템 및 방법에 관한 것이다.The present invention relates to a system and method for collecting information, and more particularly to an information collection system and method for extracting only the desired information from the information scattered on the Internet.

최근 인터넷의 발전으로 인하여 인터넷상에서 정보를 제공하는 정보 공급자(Contents Provider)가 증가하고 있으며, 이에 따라 정보 공급자가 제공하는 정보들이 인터넷상에 산재해 있다. 최근에는 인터넷상에서 공급하는 정보의 양이 급증함에 따라, 많은 사람들이 검색 로봇을 이용하여 필요한 정보를 검색하고 있다.Recently, due to the development of the Internet, information providers providing information on the Internet are increasing, and thus information provided by the information provider is scattered on the Internet. Recently, as the amount of information supplied from the Internet has rapidly increased, many people search for necessary information using a search robot.

종래의 검색 로봇은 인터넷상에서 벌크(bulk) 형태로 HTML(HyperText Markup Language) 파일을 수집한 후 이를 다시 정리하는 방법을 사용하였다. 그러나 이러한 종래 방법에 따르면, 검색 로봇이 관련 있는 다수의 정보(이중에는 원하는 정보와 원하지 않는 정보가 혼재되어 있음)를 먼저 수집하고 수집한 정보 중에서 사용자가 원하는 정보를 다시 정리하기 때문에, 검색의 정확성이 떨어지며 검색의 시간이 많이 걸린다는 문제점이 있었다.The conventional search robot collects HTML (HyperText Markup Language) files in bulk form on the Internet and uses a method of rearranging them. However, according to this conventional method, since the search robot first collects a large number of related information (in which the desired information and the undesired information are mixed) and re-arranges the information desired by the user, the accuracy of the search is improved. There was a problem that it takes a lot of time to search and fall.

본 발명이 이루고자 하는 기술적 과제는 이와 같은 종래 기술의 문제점을 해결하기 위한 것으로서, 네트워크 상에 산재해 있는 정보들 중에서 원하는 정보들만을 추출하는 정보 수집 시스템 및 방법을 제공하기 위한 것이다.The technical problem to be achieved by the present invention is to solve the problems of the prior art, to provide an information collection system and method for extracting only the desired information from the information scattered on the network.

도1은 본 발명의 실시예에 따른 정보 시스템을 나타내는 도면이다.1 is a diagram showing an information system according to an embodiment of the present invention.

도2는 본 발명의 실시예에 따른 선택적 정보 수집 방법을 개략적으로 나타낸 도면이다.2 is a diagram schematically showing a method for collecting selective information according to an embodiment of the present invention.

도3은 본 발명의 실시예에 따른 관리자 UI부를 상세하게 나타내는 도면이다.3 is a view showing in detail the administrator UI unit according to an embodiment of the present invention.

도4는 본 발명의 실시예에 따른 카테고리 등록 화면을 나타내는 도면이다.4 is a diagram illustrating a category registration screen according to an embodiment of the present invention.

도5는 본 발명의 실시예에 따른 카테고리 등록 과정을 나타내는 도면이다.5 is a diagram illustrating a category registration process according to an embodiment of the present invention.

도6은 본 발명의 실시예에 따른 카테고리 수집 화면을 나타내는 도면이다.6 is a diagram illustrating a category collection screen according to an embodiment of the present invention.

도7은 본 발명의 실시예에 따른 카테고리 수집 과정을 나타내는 도면이다.7 is a diagram illustrating a category collection process according to an embodiment of the present invention.

도8은 본 발명의 실시예에 따른 파싱 방법을 나타내는 도면이다8 illustrates a parsing method according to an embodiment of the present invention.

도9a는 일반적인 HTML 소스코드의 예를 나타내며,9a shows an example of general HTML source code,

도9b는 본 발명의 실시예에 따른 트리 노드의 예를 나타낸다.9B shows an example of a tree node according to an embodiment of the invention.

도10은 본 발명의 실시예에 따른 규칙 생성 화면을 나타내는 도면이다10 is a diagram illustrating a rule generation screen according to an embodiment of the present invention.

도11은 본 발명의 실시예에 따른 규칙 생성 방법을 나타내는 도면이다.11 is a diagram illustrating a rule generation method according to an embodiment of the present invention.

도12는 본 발명의 실시예에 따른 로봇 에이전트를 상세하게 나타내는 도면이다.12 is a view showing in detail the robot agent according to an embodiment of the present invention.

도13은 로봇 에이전트를 이용한 정보 검색 방법을 나타내는 도면이다.13 is a diagram illustrating an information retrieval method using a robot agent.

이와 같은 목적을 달성하기 위한 본 발명의 하나의 특징에 따른 선택적 정보 수집 방법은Selective information collection method according to one feature of the present invention for achieving the above object

정보 수집을 원하는 페이지들의 목록이 있는 페이지를 카테고리 페이지로 등록하는 제1 단계; 상기 카테고리 페이지 중에서 최총적으로 수집하고자 하는 내용들이 포함되어 있는 정보 페이지를 상기 카테고리 페이지의 하위 노드로 등록하는 제2 단계; 상기 정보 페이지 중 사용자가 수집하고자 하는 내용에 대한 정보와 상기 정보 페이지로 이동하기 위한 위치 정보를 각각 규칙으로 등록하는 제3 단계; 및 상기 제3 단계에서 등록한 규칙을 기초로 하여 해당 사이트에서 수집하고자 하는 카테고리 페이지를 수집하는 제4 단계를 포함한다.Registering a page including a list of pages for which information is desired to be collected as a category page; A second step of registering an information page including contents to be collected most comprehensively among the category pages as a lower node of the category page; A third step of registering, as a rule, information about contents to be collected by a user among the information pages and location information for moving to the information page; And a fourth step of collecting category pages to be collected at the corresponding site based on the rules registered in the third step.

또한, 본 발명의 하나의 특징에 따른 선택적 정보 수집 방법은In addition, the selective information collection method according to an aspect of the present invention

상기 제3 단계에서 등록된 규칙을 데이터 베이스에 저장하는 제5 단계; 및 상기 데이터 베이스에 저장된 규칙을 기초로 해당 사이트에 접속하여 정보 페이지들의 목록을 수집하고, 정보 페이지에 접속하여 원하는 정보를 추출하고 추출한 정보를 데이터 베이스에 저장하는 제6 단계를 추가로 포함할 수 있다.A fifth step of storing the rule registered in the third step in a database; And a sixth step of accessing the corresponding site based on the rules stored in the database, collecting a list of information pages, accessing the information page to extract desired information, and storing the extracted information in the database. have.

상기 제1 단계는The first step is

사용자가 접속한 사이트의 내용을 카테고리 URL뷰를 통해 디스플레이하는 단계; 사용자가 카테고리 URL뷰를 통해 디스플레이되는 페이지를 선택하면, 카테고리URL 뷰에 있는 내용을 현재 선택한 카테고리의 하위 카테고리로 등록하는 단계를 포함할 수 있다.Displaying contents of a site accessed by a user through a category URL view; When the user selects a page displayed through the category URL view, the method may include registering content in the category URL view as a subcategory of the currently selected category.

상기 제4 단계는The fourth step is

선택한 카테고리가 있는지를 판단하는 단계; 선택한 카테고리의 등록된 하위 카테고리가 있는지를 판단하는 단계; 선택한 카테고리의 페이지를 수집하고 등록된 하위 카테고리를 샘플 카테고리로 작성하는 단계; 수집한 선택 카테고리의 페이지에서 URL을 추출하는 단계; 상기 단계에서 추출한 URL이 상기 샘플 카테고리와 유사한 URL인 경우, 상기 샘플 카테고리에서 상속받은 카테고리를 생성하고 카테고리 속성을 설정한 후 카테고리 등록을 하는 단계를 포함할 수 있다.Determining whether there is a selected category; Determining whether there is a registered subcategory of the selected category; Collecting pages of the selected category and creating registered subcategories as sample categories; Extracting a URL from the collected pages of the selected category; When the URL extracted in the step is a URL similar to the sample category, the method may include generating a category inherited from the sample category, setting a category attribute, and registering the category.

상기 제3 단계는The third step is

정보 페이지 중 사용자가 수집하고자 하는 내용에 대한 소스 코드를 파싱하여 트리 노드를 만드는 단계; 상기 트리 노드에서의 위치 정보를 규칙으로 등록하는 단계를 포함할 수 있다.Creating a tree node by parsing source code of information that a user wants to collect among information pages; And registering location information in the tree node as a rule.

상기 제6 단계는The sixth step

URL 정보 데이터 베이스로부터 URL 정보를 읽어와 지정한 URL을 수집하는 단계; 규칙 정보 DB로부터 규칙 정보를 읽어와 파싱트리를 만드는 단계;파싱트리에서 규칙과 일치하는 엘리먼트를 찾는 단계; 찾은 엘리먼트를 특정 데이터 구조로 저장하는 단계를 포함할 수 있다.Reading URL information from a URL information database and collecting a designated URL; Reading rule information from a rule information DB to create a parsing tree; finding an element in the parsing tree that matches the rule; And storing the found element in a specific data structure.

본 발명의 하나의 특징에 따른 선택적 정보 수집 시스템은Optional information collection system according to an aspect of the present invention

수집하고자 하는 사이트를 지정하고 사이트 내에서 수집을 원하는 수집 요소를 선택하며, 선택한 수집 요소를 저장할 데이터 베이스 정보에 대한 규칙을 생성하는 관리자 유저 인터페이스부; 상기 관리자 유저 인터페이스부가 지정한 사이트 정보, 수집 요소 및 데이터 베이스 정보에 대한 규칙을 기초로 정보 수집을 원하는 사이트에 접속하여 정보들을 수집하는 로봇 에이전트; 및 상기 관리자 유저 인터페이스부에 의해 생성되는 규칙 정보와 상기 로봇 에이전트가 수집한 정보를 저장하는 데이터 베이스를 포함한다.An administrator user interface unit for designating a site to be collected, selecting a collection element to be collected in the site, and generating a rule for database information to store the selected collection element; A robot agent accessing a site to collect information based on rules for site information, collection elements, and database information designated by the administrator user interface unit to collect information; And a database storing rule information generated by the manager user interface unit and information collected by the robot agent.

여기서, 상기 관리자 유저 인터페이스부는Here, the administrator user interface unit

인터넷 상에 있는 리소스를 가져오는 브라우징부; 수집하고자 하는 카테고리들을 등록 및 관리하는 카테고리 관리부; 상기 카테고리 관리부에 의해 등록된 카테고리를 기준으로 동일 사이트에 있는 다른 카테고리를 자동으로 수집하는 카테고리 수집부; 상기 브라우징부에 의해 수집된 리소스를 분석하여 각종 태그를 객체화하는 파서; 및 상기 파서에 의해 분석된 객체들의 위치정보를 만드는 규칙 생성기를 포함할 수 있다.A browsing unit for bringing resources on the Internet; A category manager to register and manage categories to be collected; A category collecting unit which automatically collects other categories in the same site based on the category registered by the category managing unit; A parser for analyzing various resources collected by the browsing unit to object various tags; And a rule generator for generating location information of the objects analyzed by the parser.

상기 데이터 베이스는The database is

사이트 URL에 대한 정보를 저장하고 있는 사이트 URL 정보 데이터 베이스; 및 사이트 중 수집하고자 하는 요소에 대한 위치 정보를 저장하고 있는 규칙 정보 데이터 베이스를 포함할 수 있다.A site URL information database that stores information about the site URL; And a rule information database storing location information on elements to be collected in the site.

상기 로봇 에이전트는The robot agent

상기 URL 정보 데이터 베이스로부터 사이트 URL 정보를 읽은 후 각 사이트의 URL 리소스를 네트워크를 통해 읽어들이는 문서 수집부; 문서 수집부에서 읽어온리소스를 파싱하는 파서; 및 상기 규칙 정보 데이터베이스로부터 규칙 정보를 읽어온 후 파싱트리를 만들고, 규칙 정보 데이터베이스에서 읽어온 규칙과 일치하는 엘리먼트를 상기 파싱트리에서 찾은 후 특정 데이터 구조로 저장하는 규칙 필터를 포함할 수 있다.A document collector configured to read site URL information from the URL information database and read URL resources of each site through a network; A parser for parsing resources read from the document collection unit; And a rule filter that reads rule information from the rule information database, creates a parsing tree, and finds elements matching the rules read from the rule information database in the parsing tree and stores them in a specific data structure.

이하에서는 도면을 참조하여 본 발명의 실시예를 상세히 설명한다.Hereinafter, with reference to the drawings will be described an embodiment of the present invention;

도1에 도시한 바와 같이, 본 발명의 실시예에 따른 정보 시스템은 관리자 유저 인터페이스(User Interface; UI)부(100), 로봇 에이전트(200) 및 데이터 베이스(300)를 포함한다.As shown in FIG. 1, an information system according to an exemplary embodiment of the present invention includes an administrator user interface (UI) unit 100, a robot agent 200, and a database 300.

관리자 UI부(100)는 수집하고자 하는 사이트를 지정하고 사이트 내에서 수집을 원하는 정보들을 선택, 저장할 곳을 지정한다. 즉, 관리자 UI부(100)는 사이트 정보, 수집할 요소 및 저장할 데이터 베이스 정보에 대한 규칙을 생성하고, 생성된 규칙을 DB(300)에 저장한다.The administrator UI unit 100 designates a site to be collected and selects and stores a desired information to be collected in the site. That is, the administrator UI unit 100 generates rules for site information, elements to collect, and database information to store, and stores the generated rules in the DB 300.

로봇 에이전트(200)는 관리자 UI부(100)가 지정한 사이트 정보, 수집할 요소, 저장할 DB 정보들에 대한 규칙을 기초로 정보 수집을 원하는 사이트에 접속하여 정보들을 수집하여 데이터 베이스(300)에 저장한다.The robot agent 200 collects and stores information in a database 300 by accessing a site to which information is to be collected, based on the site information specified by the administrator UI unit 100, the elements to be collected, and the DB information to be stored. do.

데이터 베이스(300)는 관리자 UI부(100)에 의해 생성된 UI 정보와 로봇 에이전트(200)가 수집한 정보를 저장한다.The database 300 stores UI information generated by the manager UI unit 100 and information collected by the robot agent 200.

다음은 본 발명의 실시예에 따른 선택적 정보 수집 방법을 개략적으로 설명한다.The following describes the selective information collection method according to an embodiment of the present invention.

도2는 본 발명의 실시예에 따른 정보 수집 방법을 개략적으로 나타낸 도면이다.2 is a view schematically showing an information collection method according to an embodiment of the present invention.

먼저, 본 발명의 실시예에 따르면 우선 사용자가 어떤 사이트에서 어떤 정보들을 수집할 것인지 결정을 하고, 정보를 수집할 사이트의 이름, URL(uniform resource locator), 그룹 등을 선택하여 사이트 등록을 행한다. (S100) 여기서, 그룹이란 동일한 성격의 사이트를 묶기 위해 지정하는 것으로 그룹에 따라 저장되는 DB 테이블을 같은 것으로 지정할 수도 있고, 다른 테이블로 지정할 수도 있다.First, according to an embodiment of the present invention, a user first decides which information is collected from which site, and then registers a site by selecting a site name, a uniform resource locator (URL), a group, etc. to collect the information. Here, the group refers to grouping sites of the same personality, and the DB table stored according to the group may be designated as the same or different tables.

사이트가 등록되면, 등록된 사이트를 네비게이팅하면서 정보 수집을 원하는 페이지들의 목록이 있는 페이지를 카테고리로서 등록을 한다. (S105) 여기서, 카테고리 페이지는 정보를 수집할 페이지(정보 페이지)의 리스트들을 가지고 있는 페이지를 말한다. 예를 들어, 쇼핑몰 사이트의 상품 리스트들이 나오는 페이지 즉, 냉장고 목록이 있는 페이지 등을 카테고리 페이지라 한다.When a site is registered, a page containing a list of pages that you want to collect information is registered as a category while navigating the registered site. (S105) Here, the category page refers to a page having lists of pages (information pages) for collecting information. For example, a page on which product lists of a shopping mall site appear, that is, a page including a refrigerator list, is called a category page.

그리고 나서, 카테고리 페이지에서 하나의 정보 페이지를 선택하여 카테고리의 하위 노드로 등록한다.(S110) 여기서, 정보 페이지는 최종적으로 수집하고자 하는 내용들이 포함되어 있는 웹 페이지로서, 예를 들어 쇼핑몰에서 상품정보를 수집하고자 한다면, 특정 상품의 설명이 있는 페이지가 정보를 수집할 페이지가 된다. 이때, 카테고리 페이지와 정보 페이지는 하나의 수직적인 카테고리로서, 즉, 1단계 분류 카테고리 -> 2단계 분류 카테고리 -> 3단계 분류 카테고리 -> ... -> 정보 페이지의 형태로 등록된다.Then, one information page is selected from the category page and registered as a lower node of the category (S110). Here, the information page is a web page including contents to be finally collected, for example, product information in a shopping mall. If you want to collect, a page with a description of a particular product becomes a page to collect information. At this time, the category page and the information page are registered as one vertical category, that is, in the form of one-level classification category-> two-level classification category-> three-level classification category-> ...-> information page.

카테고리 페이지와 정보 페이지가 등록되면,(S105, S110) 정보 페이지에서수집을 원하는 정보에 대한 규칙을 등록한다. (S110) 즉, 정보 페이지에서 수집을 원하는 정보들 예를 들어, 상품명, 상품 가격, 제조원, 상품 설명 등을 규칙으로 등록한다.When the category page and the information page are registered (S105 and S110), the rules for the information to be collected are registered in the information page. That is, the information desired to be collected on the information page, for example, a product name, a product price, a manufacturer, a product description, and the like are registered as a rule.

그리고 나서, 정보 페이지의 바로 전 카테고리 페이지로 이동하여 정보 페이지의 링크 정보를 규칙으로 등록한다. (S120) 카테고리 페이지에는 정보 페이지들의 리스트들이 존재하며, 이 리스트에는 정보 페이지와 링크되어 있는 링크 정보(예를 들어 URL)가 존재한다. 단계 S115 및 S120에서 만들어진 규칙들은 뒤에서 설명하는 바와 같이 로봇 에이전트가 정보를 수집하기 위해 해당 페이지로 찾아가는데 사용된다.Then, it moves to the category page just before the information page, and registers the link information of the information page as a rule. In the category page, lists of information pages exist, and link information (for example, URL) linked to the information page exists in this list. The rules created in steps S115 and S120 are used by the robot agent to go to the page to collect information, as described below.

단계 S115와 S120에서 수집하고자 하는 정보들에 대한 규칙과 정보 페이지로 이동하기 위한 링크 정보(URL)에 대한 규칙이 등록되면, 해당 사이트에서 수집하고자 하는 모든 카테고리들을 수집한다. (S125) 이때, 수집되는 카테고리들은 뒤에서 설명하는 바와 같이 이미 등록한 카테고리를 샘플 카테고리로 하여 같은 깊이(Depth)의 카테고리부터 모든 속성을 상속받아 생성되기 때문에 새로 수집되는 모든 카테고리들은 이미 작성한 규칙들을 똑같이 유지할 수 있게 된다.When rules for information to be collected in steps S115 and S120 and rules for link information (URL) for moving to an information page are registered, all categories to be collected in the corresponding site are collected. At this time, as the categories to be collected are generated by inheriting all properties from the category of the same depth as the category already registered as a sample category, as described below, all newly collected categories maintain the same rules. It becomes possible.

카테고리 수집이 끝나면, 수집하고자 하는 정보들에 대한 규칙과 정보 페이지로 이동하기 위한 링크 정보(URL)에 대한 규칙을 데이터 베이스(300)에 저장한다. (S130) 이때, 규칙 저장은 카테고리 수집 단계 (S125)이전에 저장할 수도 잇다.After collecting the category, the rule about the information to be collected and the rule about the link information (URL) for moving to the information page are stored in the database 300. At this time, the rule storage may be stored before the category collection step (S125).

원하는 정보를 수집하기 위한 규칙을 데이터 베이스에 저장한 후, 로봇 에이전트를 구동한다. (S135) 로봇 에이전트가 구동되면 로봇 에이전트는 원하는 정보들을 수집하기 위해 작성된 규칙들을 데이터 베이스(300)로부터 메모리로 로드하여 정보를 수집하기 위한 작업들을 준비한다. (S140)After storing the rules for collecting the desired information in the database, the robot agent is started. (S135) When the robot agent is driven, the robot agent prepares tasks for collecting information by loading rules created to collect desired information from the database 300 into memory. (S140)

로봇 에이전트는 메모리에 로드된 규칙들을 읽어서 정보를 수집할 사이트에 접속하여, (S145) 정보를 수집해야할 페이지들의 목록들을 수집하고, 목록의 수집이 완료되면 목록들을 가지고 수집할 정보들이 있는 정보 페이지에 접속하여 등록된 규칙을 가지고 원하는 정보들을 추출하여 지정된 데이터 베이스에 저장한다. (S150)The robot agent reads the rules loaded in the memory and accesses a site to collect information, (S145) collects a list of pages to collect information, and when the collection of the list is completed, the robot agent displays information on the information page to collect information. Access and extract the desired information with the registered rules and save it in the specified database. (S150)

일단 정보 수집이 완료되면 다시 목록들을 수집하고 다시 정보 페이지에서 정보를 수집하는 작업을 반복한다.Once the information has been collected, the list is collected again and the information page is collected again.

도3은 본 발명의 실시예에 따른 관리자 UI부(100)를 상세하게 나타낸 도면이다.3 is a view showing in detail the administrator UI 100 according to an embodiment of the present invention.

도3에 도시한 바와 같이, 본 발명의 실시예에 따른 관리자 UI부는 브라우징부(260), 카테고리 관리부(210), 카테고리 수집부(220), HTML 파서(230), 규칙 생성기(240) 및 DB 인터페이스(250)를 포함한다.As shown in FIG. 3, the administrator UI unit according to an exemplary embodiment of the present invention may include a browsing unit 260, a category managing unit 210, a category collecting unit 220, an HTML parser 230, a rule generator 240, and a DB. Interface 250.

브라우징부(260)는 인터넷상에 있는 리소스(HTML 파일 등)를 끌어오는 역할을 하며, 카테고리 관리부(210)는 수집하고자 하는 카테고리들을 등록 및 관리하는 역할을 한다. 카테고리 수집부(220)는 카테고리 관리부(210)에 의해 등록된 카테고리(분류 기준)를 기준으로 동일 사이트에 있는 다른 카테고리를 자동으로 수집하는 역할을 한다. HTML 파서(230)는 브라우징부(260)에 의해 끌려온 리소스들을 분석하여 각종 태그를 객체화시키는 역할을 하며, 규칙 생성기(240)는 분석된 객체들의 위치 정보를 만드는 역할을 한다. DB 인터페이스(250)는 데이터베이스(300)와 인터페이스하는 역할을 한다.The browsing unit 260 serves to pull resources (such as HTML files) on the Internet, and the category manager 210 serves to register and manage categories to be collected. The category collector 220 automatically collects other categories in the same site based on the category (classification criteria) registered by the category manager 210. The HTML parser 230 analyzes the resources dragged by the browsing unit 260 to object various tags, and the rule generator 240 serves to create location information of the analyzed objects. The DB interface 250 serves to interface with the database 300.

다음은 도4 및 도5를 참조하여 본 발명의 실시예에 따른 카테고리 관리부(210)의 기능을 상세히 설명한다.Next, the function of the category manager 210 according to an exemplary embodiment of the present invention will be described in detail with reference to FIGS. 4 and 5.

도4는 본 발명의 실시예에 따른 카테고리 관리부에 의해 관리되는 카테고리 등록화면을 나타낸다.4 is a category registration screen managed by a category manager according to an exemplary embodiment of the present invention.

도4에 도시한 바와 같이, 본 발명의 실시예에 따른 카테고리 등록화면은 좌측의 카테고리 트리 리스트 뷰(211)와 우측의 카테고리 URL 뷰(212)로 구성되어 있다. 본 발명의 실시예에 따르면, 사용자는 일반적인 브라우저에서 인터넷을 네비게이팅하듯 카테고리 URL 뷰(212)에서 네비게이팅하여 등록하기를 원하는 페이지로 이동을 한다. 등록하고자 하는 페이지가 카테고리 URL 뷰(212)에 나타나면 도4에 나타낸 바와 같이 상단의 이름과 URL 필드가 자동으로 채워진다. 이때, 본 발명의 실시예에 따르면 상단의 이름은 마우스 포인터가 지시한 텍스트 정보가 그대로 기재되며, URL 필드는 포인터의 URL이 기록된다.As shown in Fig. 4, the category registration screen according to the embodiment of the present invention includes a category tree list view 211 on the left side and a category URL view 212 on the right side. According to an embodiment of the present invention, a user navigates to a page that he / she wants to register by navigating in the category URL view 212, as in navigating the Internet in a general browser. When the page to be registered appears in the category URL view 212, the name and URL fields at the top are automatically filled in as shown in FIG. At this time, according to an embodiment of the present invention, the text information indicated by the mouse pointer is written as it is, and the URL field is recorded in the URL field.

등록하고자 하는 페이지가 카테고리 URL 뷰(212)에 나타난 것을 확인한 후, 카테고리 등록화면상에 있는 등록 버튼을 클릭하여 카테고리를 등록하면, 카테고리 트리 리스트 뷰(211)에 등록하고자 하는 카테고리(페이지)의 이름과 함께 원하는 위치 즉, 현재 선택한 노드(도4에서는 '생활가전')의 하위노드(도4에서는 '청소기')에 표시되며, 카테고리 트리 리스트 뷰(211)를 통해서 카테고리의 속성을편집할 수 있도록 인터페이스를 제공한다.After confirming that the page to be registered is displayed in the category URL view 212 and clicking the register button on the category registration screen to register a category, the name of the category (page) to be registered in the category tree list view 211. In addition, it is displayed on the desired position, that is, the subnode ('cleaner' in FIG. 4) of the currently selected node ('home appliances' in FIG. 4), and the category attributes can be edited through the category tree list view 211. Provide an interface.

이와 같이, 카테고리 URL 뷰와 카테고리 트리 리스트 뷰를 통하여 카테고리를 등록하는 과정을 도5를 참조하여 보다 상세하게 설명한다.As described above, a process of registering a category through the category URL view and the category tree list view will be described in more detail with reference to FIG. 5.

사용자가 도4에 도시한 등록화면에서 사이트 등록 메뉴를 클릭하면,(S200) 카테고리 관리부(210)는 사이트 등록을 위한 다이얼로그 박스를 화면상에 출력한다. (S205) 사용자는 다이얼로그 박스 상에 사이트 이름, URL, 그룹 이름을 입력하여 사이트 접속을 시도한다. (S210, S215) 여기서, 그룹은 동일한 성격의 사이트를 묶기 위해 사용자가 임의로 지정하는 것을 말한다. (예를 들어, 사용자가 다수의 쇼핑사이트로부터 컴퓨터에 관한 상품 정보를 얻고자 하는 경우에는 '컴퓨터'를 그룹으로 입력하여 별도의 DB로 관리할 수 있다.)If the user clicks on the site registration menu in the registration screen shown in Fig. 4 (S200), the category manager 210 outputs a dialog box for site registration on the screen. (S205) The user attempts to access the site by inputting the site name, URL, group name on the dialog box. (S210, S215) Here, a group refers to a user's arbitrarily designating to bind sites of the same personality. (For example, if a user wants to obtain product information about a computer from multiple shopping sites, the user can enter 'computer' as a group and manage it as a separate DB.)

상기 단계 S215에서 원하는 사이트에 접속되면,(S220) 카테고리 관리부(210)는 카테고리 URL 뷰(212)에 사이트의 내용을 디스플레이하고, 카테고리 URL뷰의 URL 필드를 현재의 URL필드로 채운다. (S225, S230) 한편, 사용자가 카테고리 URL 뷰(212)에서 계속해서 사이트를 네비게이팅하여 다른 사이트에 접속하면, 마찬가지로 카테고리 URL뷰의 URL 필드를 현재의 URL필드로 채운다.(S235, S240, S245)If the desired site is accessed in step S215 (S220), the category manager 210 displays the content of the site in the category URL view 212 and fills the URL field of the category URL view with the current URL field. (S225, S230) On the other hand, when the user navigates the site continuously in the category URL view 212 and accesses another site, the URL field of the category URL view is similarly filled with the current URL field. (S235, S240, S245) )

한편, 상기 단계 S230에서 카테고리 등록버튼을 클릭하면 카테고리 관리부(210)는 카테고리 트리 리스트 뷰(211)에 카테고리 URL 뷰에 있는 내용을 현재 선택한 카테고리의 하위 카테고리로서 등록한다. (S255) 즉, 카테고리 관리부(210)는 카테고리 트리 리스트 뷰(211)에 등록하고자 하는 카테고리(페이지)의 이름과 함께 원하는 위치 즉, 현재 선택한 노드(도4에서는 '생활 가전')의 하위노드(도4에서는 '청소기')를 입력한다.On the other hand, if the category registration button is clicked in step S230, the category manager 210 registers the contents of the category URL view in the category tree list view 211 as a subcategory of the currently selected category. In other words, the category manager 210 may include a subnode of the desired location, that is, the currently selected node ('home appliances' in FIG. 4) together with the name of the category (page) to be registered in the category tree list view 211 (S4). In Fig. 4, a 'cleaner' is input.

그리고 나서, 카테고리 관리부(210)는 카테고리 입력 다이얼로그 박스와 서브 페이지 설정 다이얼로그 박스를 출력한다. (S260, S265)Then, the category manager 210 outputs a category input dialog box and a sub page setting dialog box. (S260, S265)

다음은 도6 및 도7을 참조하여 본 발명의 실시예에 따른 카테고리 수집부(220)의 기능을 상세히 설명한다.Next, the function of the category collector 220 according to an embodiment of the present invention will be described in detail with reference to FIGS. 6 and 7.

도6은 본 발명의 실시예에 따른 카테고리 수집부에 의해 관리되는 카테고리 수집화면을 나타낸다.6 shows a category collection screen managed by a category collecting unit according to an embodiment of the present invention.

도6에 도시한 바와 같이, 사용자가 카테고리 수집화면에 있는 카테고리 수집버튼(221)을 누르면 카테고리 수집부(220)는 선택한 카테고리의 하위로 기등록된 카테고리(이를 '샘플 카테고리'라 한다)를 기반으로 선택한 카테고리 페이지에서 URL들을 검색하여 유사한 규칙을 가진 카테고리들을 자동으로 추출, 수집한 후, 수집한 카테고리를 하위 카테고리로 자동 등록한다.As shown in FIG. 6, when the user presses the category collecting button 221 on the category collecting screen, the category collecting unit 220 is based on a category (hereinafter, referred to as a “sample category”) previously registered under the selected category. Search for URLs on the category page selected by using and automatically extract and collect categories with similar rules, and then automatically register the collected categories as subcategories.

따라서, 본 발명의 실시예에 따르면 사용자가 모든 카테고리를 직접 등록할 필요 없이, 단지 대분류, 중분류, 소분류, 세분류 등 계층적으로 하나 이상의 카테고리를 등록하고, 수집하기를 원하는 카테고리들의 부모 카테고리를 선택하여 카테고리 수집 버튼을 클릭하면, 카테고리 수집부(220)는 선택한 카테고리의 하위 카테고리들을 자동적으로 수집하여 등록한다.Therefore, according to an embodiment of the present invention, a user does not need to register all categories directly, but registers one or more categories hierarchically such as major classification, subclassification, subclassification, and subclassification, and selects parent categories of categories that the user wants to collect. When the category collecting button is clicked, the category collecting unit 220 automatically collects and registers sub categories of the selected category.

본 발명의 실시예에 따른 카테고리 수집부를 통한 카테고리 수집 방법을 도7을 참조하여 보다 상세하게 설명한다.A method of collecting categories through a category collecting unit according to an exemplary embodiment of the present invention will be described in more detail with reference to FIG. 7.

사용자가 카테고리 수집버튼을 클릭하면, (S300) 카테고리 수집부(220)는 선택한 카테고리가 있는가를 판단한다. (S305) 상기 단계 S305에서 선택한 카테고리가 없으면 카테고리 수집부(220)는 선택한 카테고리가 없음 메시지 박스를 출력하고(S310), 상기 단계 S305에서 선택한 카테고리가 있으면, (예를 들어, 도6에서 '생활 가전') 카테고리 수집부(220)는 등록된 하위 카테고리가 있는지를 판단한다.When the user clicks the category collecting button, the category collecting unit 220 determines whether there is a selected category (S300). (S305) If there is no category selected in step S305, the category collector 220 outputs a message box with no category selected (S310). If there is a category selected in step S305, (for example, in FIG. Home appliance ') category collector 220 determines whether there is a registered subcategory.

상기 단계 S315에서 등록된 하위 카테고리(예를 들어, 도6에서 '청소기')가 있으면, 카테고리 수집부(220)는 선택한 카테고리의 웹 페이지를 수집하고 등록된 하위 카테고리를 샘플 카테고리로 작성한다. (S320, S325)If there is a subcategory registered in step S315 (for example, 'cleaner' in FIG. 6), the category collector 220 collects a web page of the selected category and creates a registered subcategory as a sample category. (S320, S325)

그리고 나서 카테고리 수집부(220)는 수집한 카테고리의 페이지에서 추출할 URL이 있는지를 판단하고, (S335) 추출한 URL이 있는 경우 추출한 URL이 상기 단계 S325에서 작성한 샘플 카테고리와 유사한 URL인지를 판단한다. (S340) 상기 단계 S340에서 추출한 URL이 샘플 카테고리와 유사한 URL인 경우 카테고리 수집부(220)는 샘플 카테고리에서 상속받은 카테고리를 생성하고, 카테고리 속성을 설정한 후 카테고리 트리 리스트 뷰에 카테고리 등록을 행한다. (S345, S350, S355) 도6에서는 '청소기'를 샘플 카테고리로 하여 이와 유사한 URL 구조를 가진 '가스레인지', '냉장고', '전자레인지' 등을 카테고리로서 수집하였다.Then, the category collecting unit 220 determines whether there is a URL to be extracted from the collected category page (S335). If there is an extracted URL, the category collecting unit 220 determines whether the extracted URL is a URL similar to the sample category created in step S325. (S340) If the URL extracted in step S340 is a URL similar to the sample category, the category collector 220 generates a category inherited from the sample category, sets the category attribute, and then registers the category in the category tree list view. (S345, S350, S355) In FIG. 6, 'Gas Range', 'Refrigerator', and 'Microwave Oven' having a similar URL structure are collected as categories as 'Cleaners'.

한편, 상기 단계 S335에서 추출한 URL이 없는 경우에는 이동할 카테고리가 있는지를 판단하여(S360) 이동할 카테고리가 없는 경우에는 종료하고, 이동할 카테고리가 있는 경우(예를 들어, 가구(도6에는 도시하지 않음))에는 다음 카테고리를 선택하여 상기 단계 S315를 반복한다.Meanwhile, if there is no URL extracted in step S335, it is determined whether there is a category to move to (S360). If there is no category to move to, it is terminated, and if there is a category to move (for example, furniture (not shown in FIG. 6)). ), The next category is selected, and the above step S315 is repeated.

다음은 도8, 도9a 및 도9b를 참조하여 본 발명의 실시예에 따른 HTML파서(230)의 기능을 상세히 설명한다.Next, referring to Figures 8, 9a and 9b will be described in detail the function of the HTML parser 230 according to an embodiment of the present invention.

본 발명의 실시예에 따른 HTML 파서(230)는 도9a에 도시한 바와 같은 일반적인 HTML 소스코드를 파싱하여 도9b에 도시한 바와 같은 트리 노드를 만든다. 본 발명의 실시예에 따르면 트리에서의 노드 위치 정보를 이용하여 원하는 위치의 원하는 정보를 추출한다. 본 발명의 실시예에 따른 파싱 방법은 HTML뿐만 아니라, XML, SGML등의 일반적인 마크 업 언어(Markup Language)들에 적용 가능하며, 본 발명의 실시예에서는 HTML을 마크 업 언어의 예로서 설명한다.The HTML parser 230 according to the embodiment of the present invention parses the general HTML source code as shown in Fig. 9A to create a tree node as shown in Fig. 9B. According to an embodiment of the present invention, desired information of a desired location is extracted using node location information in the tree. The parsing method according to an embodiment of the present invention is applicable to general markup languages such as XML and SGML as well as HTML. In the embodiment of the present invention, HTML is described as an example of a markup language.

먼저, HTML 파서(230)는 파싱을 하고자 하는 문서를 읽어서 메모리에 로드한 후, 메모리로부터 한 문자씩 읽고 메모리로부터 읽은 문자가 '<' 문자인지(즉, 마크 업 언어 태그의 시작 표시)를 판단한다.(S400, S405, S410, S415) 상기 단계에서 메모리로부터 판독한 문자가 '<' 문자인 경우에는 태그 이름을 추출한 후 추출한 태그 이름이 올바른 태그인가를 판단한다. (S425, S430) 상기 단계 S415에서 메모리로부터 판독한 문자가 '<'가 아닌 경우나 상기 단계 S430에서 태그 이름이 올바른 태그가 아닌 경우에는 판독한 문자를 콘텐츠로 처리한다. (S420)First, the HTML parser 230 reads a document to be parsed and loads it into the memory, and then reads one character from the memory one by one and determines whether the character read from the memory is a '<' character (that is, a start mark of a markup language tag). (S400, S405, S410, S415) If the character read from the memory in the step is a '<' character, it is determined whether the extracted tag name is the correct tag after extracting the tag name. (S425, S430) If the character read from the memory in step S415 is not '<' or if the tag name is not a correct tag in step S430, the read character is processed as content. (S420)

상기 단계S435에서 태그 이름이 올바른 태그로 판단한 경우에는, 새로운 엘리먼트를 생성한 후 태그 문자열과, 태그의 속성(attribute)을 찾아서 태그 노드를 만들고 트리 모양을 만들기 위한 네이밍을 한다. (S435, S440)If it is determined in step S435 that the tag name is a correct tag, after generating a new element, the tag string and the attribute of the tag are found to form a tag node and naming to form a tree. (S435, S440)

한편, 추출한 태그가 새로운 태그인 경우에는 현재 작업중인 엘리먼트의 자식 노드(child node)로 올 수 있는 태그인지를 판단하여(S445), 자식으로 올 수 있는 경우에는 현재 엘리먼트의 자식으로 등록하고(S450) 자식으로 올 수 없는 경우에는 부모 엘리먼트를 현재 엘리먼트로 설정한다. (S455)On the other hand, if the extracted tag is a new tag, it is determined whether the tag can come as a child node of the current working element (S445), and if it is a child, it is registered as a child of the current element (S450). If it cannot come as a child, it sets the parent element as the current element. (S455)

다음은 도9a 및 도9b를 참조하여 본 발명의 실시예에 따라 트리를 만드는 과정을 간단하게 설명한다.Next, a process of making a tree according to an embodiment of the present invention will be described briefly with reference to FIGS. 9A and 9B.

먼저, HTML 파서는 도9a에 도시한 HTML 문서를 한 문자씩 확인하여 태그 이름을 추출한다. 도9a에서 처음에 추출한 태그이름이 html이므로 도9b에 도시한 바와 같이 위치 정보를 '1'로 설정하고, 현재 작업중인 노드를 html 태그로 설정한다. 도9a에서 다음에 나오는 태그가 head이고 html은 자식(child) 태그를 가질 수 있으므로 HTML 파서는 "head"를 html 노드의 자식 노드로 등록한다. 이때 head의 위치 정보는 html의 노드(노드 정보 1)의 자식으로 등록되므로, 도9b에 도시한 바와 같이 1을 추가한 1.1로 지정하고 현재 작업중인 노드를 head로 설정한다.First, the HTML parser checks the HTML document shown in Fig. 9A by one character and extracts the tag name. Since the tag name extracted first in FIG. 9A is html, as shown in FIG. 9B, the location information is set to '1', and the node currently in operation is set to the html tag. Since the tag following in FIG. 9A is head and html may have child tags, the HTML parser registers "head" as a child node of the html node. At this time, since the head position information is registered as a child of the node (node information 1) of html, as shown in FIG. 9B, 1 is added to 1.1 and the currently working node is set to head.

도9a에서 다음에 나오는 태그가 title이고 현재 작업중인 태그가 head이며, head 역시 자식 태그를 가질 수 있는 태그이므로, HTML 파서는 head의 자식 노드로 title을 등록을 하고, 도9b에 도시한 바와 같이 위치 정보를 1.1.1로 설정한다. 다음 태그는 title의 종료 태그인 </title>이므로 현재 작업중인 title 태그의 속성 중 종료 태그 속성을 트루(true)로 설정하고, 현재 작업중인 노드를 title의 부모 노드인 head로 설정을 한다. 다음에 나오는 태그 역시 head 태그의 종료 태그인 </head>이므로 head 노드의 종료 태그 속성을 트루로 설정하고, 현재 작업중인 노드를 head의 부모 노드인 html로 설정한다. 도9a에서 그 다음에 나오는 태그는 body이므로 html의 자식 노드로 등록하고, body의 위치 정보를 html의 두 번째 자식 노드이므로 1.2로 설정을 한다. table과 같은 특정 태그들은 자식 태그를 가질수 있고, img와 같은 특정 태그들은 자식 태그를 가질 수 없으므로 이런 정보들을 이용하여 위와 같은 작업을 반복적으로 시행하면 도9b에 도시한 바와 같이 위치 정보를 가진 트리 모양의 노드들로 재구성할 수 있다. 본 발명의 실시예에서는 후술하는 바와 같이 이 위치 정보를 활용하여 원하는 정보만을 추출할 수 있다.Since the next tag in FIG. 9A is title, the currently working tag is head, and head is also a tag that can have child tags, the HTML parser registers the title as a child node of head, as shown in FIG. 9B. Set the location information to 1.1.1. The next tag is </ title> which is the end tag of title, so the end tag attribute is set to true and the current working node is set to head, the parent node of title. The following tag is also </ head>, which is the end tag of the head tag, so the end tag attribute of the head node is set to true, and the currently working node is set to html, which is the parent node of head. Since the next tag in FIG. 9A is a body, it is registered as a child node of html, and the position information of the body is set to 1.2 since it is a second child node of html. Certain tags such as table can have child tags, and certain tags such as img cannot have child tags, so if you repeatedly execute the above operations using this information, you will see a tree with location information as shown in Figure 9b. Can be reconfigured to In the embodiment of the present invention, as will be described later, only the desired information can be extracted using this location information.

다음은 도10 및 도11을 참조하여 본 발명의 실시예에 따른 규칙 생성기(240)의 기능을 상세히 설명한다.Next, the functions of the rule generator 240 according to the embodiment of the present invention will be described in detail with reference to FIGS. 10 and 11.

도10은 본 발명의 실시예에 따른 규칙 생성기(240)에 의해 관리되는 규칙 생성화면을 나타낸다.10 illustrates a rule generation screen managed by the rule generator 240 according to an embodiment of the present invention.

본 발명의 실시예에 따르면, HTML내에서 특정 정보를 수집하기 위해서는 사용자가 도10에 도시한 바와 같이 속성 편집 창(242)에서 작업을 해야 한다. 사용자가 도10에 도시한 카테고리 트리 리스트 뷰에서 정보를 수집하기 위한 목록을 선택하고 속성 편집 메뉴 버튼을 클릭 하면, 우측 HTML 뷰(241)에는 속성 편집을 원하는 페이지가 열리고 우측 하단에는 룰을 편집하기 위한 속성 편집 창(242)이 열린다. 이때 사용자가 HTML 뷰(241)에서 수집하고자 하는 요소를 마우스로 이동하여 클릭하여 선택하면, 규칙 생성기(240)는 하단의 속성 편집 창(242)에 선택한 영역의 규칙 이름, 태그 이름, 내용, 규칙 정보, 서브 규칙 이름 등의 정보를 나타낸다. 수집하고자 하는 정보가 속성 편집 창(242)에 표시된 정보와 일치하여 사용자가 규칙 등록 버튼을 클릭하면, 규칙 생성기(240)는 속성 편집 창(242)에 표시된 정보를 규칙 정보로서 등록한다. 사용자는 이와 같은 작업을 반복함으로써 수집하고자 하는 정보들에 대한 규칙을 계속해서 만들 수 있다.According to an embodiment of the present invention, in order to collect specific information in HTML, the user must work in the attribute editing window 242 as shown in FIG. When the user selects a list for collecting information from the category tree list view shown in FIG. 10 and clicks the attribute edit menu button, a page to edit the attribute is opened in the right HTML view 241 and the rule is edited in the lower right. The property editing window 242 is opened. In this case, when the user moves and clicks an element to collect in the HTML view 241, the rule generator 240 displays the rule name, tag name, content, and rule of the selected area in the property edit window 242 at the bottom. Information such as information and subrule names are shown. When the user clicks the rule registration button because the information to be collected matches the information displayed in the attribute editing window 242, the rule generator 240 registers the information displayed in the attribute editing window 242 as rule information. By repeating this task, the user can continue to create rules for the information he wants to collect.

도11을 참조하여 본 발명의 실시예에 따른 규칙 생성기(240)의 규칙 생성방법을 보다 상세하게 설명한다.A rule generation method of the rule generator 240 according to an embodiment of the present invention will be described in detail with reference to FIG. 11.

사용자가 도10에 도시한 카테고리 트리 뷰에서 규칙을 만들 노드를 선택한 후, (S500) 속성 편집 메뉴를 클릭한다.(S505) 규칙 생성기(240)는 선택한 노드의 URL에 접속하여 HTML을 검색한 후, 검색한 HTML에 대한 파싱트리를 만들어 HTML 뷰(241)에 출력한다. (S510, S515, S520, S525)After the user selects a node to create a rule in the category tree view shown in FIG. 10, (S500) and clicks the property editing menu. (S505) After the rule generator 240 accesses the URL of the selected node and retrieves HTML, In addition, a parse tree for the retrieved HTML is created and output to the HTML view 241. (S510, S515, S520, S525)

사용자가 HTML뷰에 출력된 특정 요소를 클릭하면,(S530) 규칙 생성기(240)는 트리에서 노드를 찾은 후 속성 편집창(242)에 찾은 노드의 태그 이름, 태그 내용, 규칙 정보를 출력한다. (S535, S540, S545) 사용자가 속성 편집창에 출력된 정보에 대하여 규칙 등록 버튼을 클릭하면, 규칙 생성기(240)는 규칙으로 등록한다.When the user clicks a specific element output in the HTML view (S530), the rule generator 240 finds a node in the tree and then outputs the tag name, tag content, and rule information of the found node in the property edit window 242. (S535, S540, S545) When the user clicks the rule registration button for the information output in the property edit window, the rule generator 240 registers as a rule.

다음은 도12 및 도13을 참조하여 본 발명의 실시예에 따른 로봇 에이전트(200)의 기능을 상세히 설명한다.Next, the functions of the robot agent 200 according to the embodiment of the present invention will be described in detail with reference to FIGS. 12 and 13.

로봇 에이전트(200)는 관리자 UI부(100)에서 선택적으로 수집하기 위해 저장한 규칙 정보를 데이터베이스(300)로부터 읽어온다. 구체적으로 로봇 에이전트는 사이트 URL 정보 DB(310)로부터 사이트 URL에 대한 정보를 읽어오고, 사이트별 규규칙 정보 DB(320)로부터 사이트 중 어떤 부분만을 읽어올 것인가에 대한 정보를 읽어온다.The robot agent 200 reads rule information stored for selective collection by the manager UI 100 from the database 300. In more detail, the robot agent reads information about the site URL from the site URL information DB 310 and reads information about which part of the site is read from the site-specific rule information DB 320.

도12에 도시한 바와 같이, 로봇 에이전트(200)는 문서 수집부(210), 파서(220) 및 규칙 필터(230)를 포함한다.As shown in FIG. 12, the robot agent 200 includes a document collector 210, a parser 220, and a rule filter 230.

문서 수집부(210)는 찾아가야 할 사이트에 대한 정보를 미리 사이트 URL 정보 DB(310)로부터 읽어온 후, 각 사이트의 URL을 인터넷이나 인트라넷을 통해 읽어들인다. 파서(220)는 문서 수집부(210)가 읽어온 html 문서를 파싱한다. 규칙 필터(230)는 규칙 정보를 사이트별 규칙정보(320)로부터 읽어온 후 HTML 파싱트리를 만든다. 이때, 규칙 필터는 사이트별 규칙정보 DB(320)에서 읽어온 규칙과 일치하는 엘리먼트를 파싱트리에서 찾은 후, 이를 특정 데이터 구조로 저장한다.The document collection unit 210 reads information about a site to be visited from the site URL information DB 310 in advance, and then reads the URL of each site through the Internet or an intranet. The parser 220 parses the html document read by the document collector 210. The rule filter 230 reads the rule information from the site-specific rule information 320 and then creates an HTML parsing tree. At this time, the rule filter finds an element that matches the rule read from the site-specific rule information DB 320 in the parsing tree, and stores it as a specific data structure.

도13은 본 발명의 실시예에 따른 로봇 에이전트에 따른 문서 수집 방법을 나타내는 도면이다.13 is a view showing a document collection method according to the robot agent according to an embodiment of the present invention.

먼저, 로봇 에이전트(200)는 URL 정보 DB(310)에서 URL정보를 읽어온 후 지정한 URL을 수집한다. (S600, S605, S610) 그리고 나서, 로봇 에이전트(200)는 규칙 정보 DB(320)에서 규칙 정보를 읽어와 HTML 파싱트리를 만든 후, 읽어온 규칙 정보가 파싱트리의 끝인지를 판단한다. (S620, S625)First, the robot agent 200 reads the URL information from the URL information DB 310 and collects a specified URL. (S600, S605, S610) Then, the robot agent 200 reads the rule information from the rule information DB 320 to create an HTML parse tree, and then determines whether the read rule information is the end of the parse tree. (S620, S625)

상기 단계 S630에서 규칙 정보가 파싱 트리의 끝이 아닌 경우에는, 로봇 에이전트는 파싱트리에서 규칙과 일치하는 엘리먼트를 찾은 후 일치하는 엘리먼트를 특정 데이터 구조로 저장한다. (S635, S640, S645) 상기 단계 S630에서 트리의 끝으로 판단한 경우에는 선택한 정보를 DB의 지정 위치에 저장한 후(S650) 상기 과정을 반복한다.If the rule information is not at the end of the parsing tree in step S630, the robot agent finds an element that matches the rule in the parsing tree and stores the matching element as a specific data structure. (S635, S640, S645) If it is determined that the end of the tree in step S630, the selected information is stored in the designated location of the DB (S650) and the process is repeated.

이상에서 본 발명의 실시예에 대하여 설명하였으나, 본 발명은 상기한 실시예에만 한정하는 것은 아니며 많은 변형이나 변경이 가능하다.While the embodiments of the present invention have been described above, the present invention is not limited only to the above embodiments, and many modifications and variations are possible.

예를 들어, 본 발명의 실시예에서는 마크 업 언어로서 HTML을 주로 설명하였으나, 이외에도 XMl, SGML에도 적용될 수 있다.For example, the embodiment of the present invention mainly described HTML as a markup language, but can be applied to XMl and SGML.

이상에서 설명한 바와 같이, 본 발명에 따르면 네트워크 상에 산재해 있는 정보들 중에서 원하는 정보들만을 추출할 수 있기 때문에, 정확한 검색이 가능하며 또한 신속한 검색을 수행할 수 있다.As described above, according to the present invention, since only desired information can be extracted from information scattered on the network, accurate search is possible and quick search can be performed.

Claims

In the method for selectively collecting information supplied through the network,

Registering a page including a list of pages for which information is desired to be collected as a category page;

A second step of registering an information page including contents to be collected most comprehensively among the category pages as a lower node of the category page;

A third step of registering, as a rule, information about contents to be collected by a user among the information pages and location information for moving to the information page; And

And collecting a category page to be collected at a corresponding site based on the rules registered in the third step.

The method of claim 1,

A fifth step of storing the rule registered in the third step in a database; And

Optional information further includes a sixth step of accessing the site based on the rules stored in the database, collecting a list of information pages, accessing the information page to extract desired information, and storing the extracted information in the database. Collection method.

The method of claim 1,

The first step is

Displaying contents of a site accessed by a user through a category URL view;

And when the user selects a page displayed through the category URL view, registering content in the category URL view as a subcategory of the currently selected category.

The method of claim 1,

The fourth step is

Determining whether there is a selected category;

Determining whether there is a registered subcategory of the selected category;

Collecting pages of the selected category and creating registered subcategories as sample categories;

Extracting a URL from the collected pages of the selected category;

If the URL extracted in the step is a URL similar to the sample category, generating a category inherited from the sample category, setting a category attribute, and then registering the category.

The method of claim 4, wherein

The fourth step is

If there is no URL to be extracted from the collected pages of the selected category, determining whether there is a category to move;

An optional method of collecting information, further comprising selecting the next category if there are categories to move.

The method of claim 1,

The third step is

Creating a tree node by parsing source code of information that a user wants to collect among information pages;

And registering location information in the tree node as a rule.

The method of claim 6,

How to make the tree node

Extracting a tag name from the source code;

Creating a tag node by finding a tag string and an attribute of the tag after generating a new element when the tag extracted in the above step is correct;

Determining whether the tag extracted in the above step is a tag that can come to a child node of an element currently being worked on;

Registering as a child of the current element if it can come as a child and setting the parent element as the current element if it cannot come as a child.

The method according to any one of claims 2 to 7,

The sixth step

Reading URL information from a URL information database and collecting a designated URL;

Reading rule information from a rule information DB to create a parsing tree;

Finding an element in the parse tree that matches the rule;

And storing the found elements in a specific data structure.

In the system for selectively collecting information supplied through the network,

An administrator user interface unit for designating a site to be collected, selecting a collection element to be collected in the site, and generating a rule for database information to store the selected collection element;

A robot agent accessing a site to collect information based on rules for site information, collection elements, and database information designated by the administrator user interface unit to collect information; And

And a database storing rule information generated by the manager user interface unit and information collected by the robot agent.

The method of claim 9,

The manager user interface unit

A browsing unit for bringing resources on the Internet;

A category manager to register and manage categories to be collected;

A category collecting unit which automatically collects other categories in the same site based on the category registered by the category managing unit;

A parser for analyzing various resources collected by the browsing unit to object various tags; And

And a rule generator for generating location information of the objects analyzed by the parser.

The method of claim 9 or 10,

The database is

A site URL information database that stores information about the site URL; And

An optional information collection system that includes a rule information database that stores location information about elements of the site that you want to collect.

The method of claim 11,

The robot agent

A document collector configured to read site URL information from the URL information database and read URL resources of each site through a network;

A parser for parsing a resource read from a document collection unit; And

And a rule filter that reads rule information from the rule information database, creates a parsing tree, and finds elements matching the rules read from the rule information database in the parsing tree and stores them in a specific data structure.