KR20090033149A

KR20090033149A - Semantic web based index method and search engine using the same

Info

Publication number: KR20090033149A
Application number: KR1020080095268A
Authority: KR
Inventors: 조광현
Original assignee: 주식회사 시맨틱스; 조광현
Priority date: 2007-09-27
Filing date: 2008-09-29
Publication date: 2009-04-01
Also published as: KR101044633B1

Abstract

A semantic web based index method and a search engine using the same are provided to construct various indexes in one webpage through a semantic web and search a meaning of a keyword that a user inputs, thereby quickly providing information proper for a user intention. An indexing unit(210) converts collected web pages into a semantic web page. The indexing unit analyzes a semantic web page by a word, a paragraph, and an article level to generate a plurality of indexes in one web page. An index database(206) stores indexes of each web page. If a search word is inputted from a user, a search agent(204) searches the index database to process a document based on a semantic web. An indexing unit is comprised of static and dynamic web page gathering agents(211,212), a filtering agent(213) and an analysis agent(214).

Description

Semantic web-based indexing method and search engine using the same {SEMANTIC WEB BASED INDEX METHOD AND SEARCH ENGINE USING THE SAME}

본 발명은 웹 검색기술에 관한 것으로, 더욱 상세하게는 시맨틱 웹을 이용하여 검색 데이터베이스를 구축하는 시맨틱 웹 기반 인덱스 방법 및 이를 이용한 검색엔진에 관한 것이다.The present invention relates to a web search technology, and more particularly, to a semantic web-based indexing method for constructing a search database using the semantic web, and a search engine using the same.

일반적으로, 네이버, 드림위즈, 다음, 야후 등과 같은 통상의 포탈(검색) 사이트는 웹 사이트 정보를 소정의 기준에 따라 분류 및 저장하기 위한 데이터베이스와, 웹 상을 지속적으로 순회하면서 새로운 웹 사이트 정보를 기계적으로 수집하기 위한 검색로봇, 수집된 데이터를 데이터베이스화하여 포탈(검색) 사이트를 이용하는 이용자로 하여금 검색할 수 있도록 하는 검색엔진으로 구성되어 포탈 (검색)사이트 이용자가 키워드를 입력하면 검색엔진이 데이터베이스를 검색하여 키워드에 유사한 사이트들의 목록을 제공하고 있다. In general, conventional portal (search) sites such as Naver, Dreamwiz, Daum, Yahoo, etc., provide a database for classifying and storing web site information according to a predetermined criterion, and mechanically convert new web site information while continuously traversing the web. It consists of a search robot for collecting data and a search engine that makes the collected data into a database so that users who use portal (search) sites can search. Search and provide a list of sites similar to your keywords.

도 1은 일반적인 검색엔진의 전체 구조를 도시한 도면이다.1 is a diagram showing the overall structure of a general search engine.

도 1을 참조하면, 인터넷 검색엔진은 웹상에 존재하는 문서에 대한 검색을 가능하게 하는 정보검색시스템으로서, 자료수집(S1), 색인(S2), 검색(S3) 부분으로 크게 나눌 수 있다. 자료수집(S1) 부분에서는 스파이더(spiders), 크라울러 등으로 불리는 문서 수집 프로그램(12)이 링크정보를 바탕으로 월드와이드웹(www) 네트워크(11)에 연결되어 있는 전세계의 컴퓨터에 저장되어 있는 웹 문서를 수집하여 데이터베이스(13)에 저장한다.Referring to FIG. 1, an internet search engine is an information retrieval system that enables a search for a document existing on a web, and may be broadly divided into a data collection S1, an index S2, and a search S3. In the data collection (S1) section, document collection programs 12 called spiders and crawlers are stored on computers around the world connected to the World Wide Web network 11 based on the link information. The web document is collected and stored in the database 13.

색인(S2) 부분에서는 검색을 빠르게 하고 저장할 데이터의 용량을 줄이기 위하여 인덱스모듈(14)이 수집한 웹 문서의 색인(index) 정보를 인덱스 데이터베이스(16)에 저장하게 된다. In the index S2, index information of the web document collected by the index module 14 is stored in the index database 16 in order to speed up the search and reduce the amount of data to be stored.

그리고 검색(S3) 부분에서는 검색자(17)가 원하는 정보가 입력될 때마다 검색엔진(18)이 인덱스 데이터 베이스(16)에 저장된 색인정보를 검색하고, 순위결정시스템(20)이 검색결과에 대한 순위를 결정하여 순위에 따른 검색결과를 검색자(17)에게 제공한다. 이때 검색엔진(18)은 검색의 성능을 높이기 위해 스파이더 컨트롤(19)을 통해 스파이더 프로그램(12)을 제어하고, 인덱스모듈(14)과 분석모듈(15)이 수집된 웹 문서를 분석하여 인덱싱을 처리한다. In the search S3 part, whenever the searcher 17 inputs the desired information, the search engine 18 searches the index information stored in the index database 16, and the ranking system 20 searches the search results. The ranking of the search results is provided to the searcher 17 according to the ranking. At this time, the search engine 18 controls the spider program 12 through the spider control 19 to increase the performance of the search, and the index module 14 and the analysis module 15 analyze the collected web documents for indexing. Process.

이러한 인터넷 검색엔진은 검색방법에 따라 디렉토리 검색엔진, 키워드 검색엔진, 그리고 메타 검색엔진으로 구분된다. 디렉토리 검색엔진은 자료들을 주제어나 카테고리별로 구분하여 분류하고 설명 및 평가를 덧붙여 데이터베이스를 구축한 검색엔진을 말한다. 키워드 검색엔진은 웹문서 수집프로그램에 의해 웹문서를 수집하고 수집한 문서를 색인과정을 거쳐 검색엔진의 데이터 베이스에 저장해놓고 사용 자의 질의어에 대해 키워드 매칭방식으로 원하는 정보를 검색해준다. 메타 검색엔진은 다른 검색엔진으로부터 검색자의 질의어에 따른 검색내용을 취합한 후 검색자에게 보여주기 때문에 검색자는 다양한 검색결과를 얻을 수 있고, 기존의 검색엔진에서 질의어에 대한 결과를 종합하여 결과를 보여주기 때문에 내부적으로 데이터를 저장할 공간이 필요하지 않는 장점이 있다.These internet search engines are classified into a directory search engine, a keyword search engine, and a meta search engine according to a search method. A directory search engine is a search engine that classifies materials by subject or category, and builds a database by adding explanations and evaluations. The keyword search engine collects web documents by web document collection program and stores the collected documents in the search engine's database through the indexing process and searches the user's query words by keyword matching method. Since the meta search engine collects the search contents according to the query term of the searcher from other search engines and shows them to the searcher, the searcher can obtain various search results and display the result by combining the results of the query term in the existing search engine. The advantage is that it does not require space to store data internally.

한편, 시맨틱 웹(Semantic Web)은 웹상의 정보에 잘 정의된 의미를 부여함으로써 사람뿐만 아니라 컴퓨터도 쉽게 문서의 의미를 해석할 수 있도록 하여 컴퓨터를 이용한 정보의 검색 및 해석, 통합 등의 업무를 자동화하기 위한 목적으로 제안된 것이다.The Semantic Web, on the other hand, gives well-defined meanings to information on the Web, allowing computers as well as humans to easily interpret the meaning of documents, thus automating tasks such as searching, interpreting, and integrating information using computers. It is proposed for the purpose of doing so.

시맨틱 웹의 문서는 자연어 위주의 기존 웹 문서와 달리 컴퓨터가 해석하기 쉽도록 부여한 의미를 가지고 있기 때문에 자동화된 에이전트나 정교한 검색엔진들이 부여된 의미를 이용하여 고 수준의 자동화와 지능화를 이룰 수 있게 된다.Semantic Web documents have meanings that can be easily interpreted by computers, unlike existing web documents focused on natural language, so that automated agents or sophisticated search engines can use high meanings to achieve high levels of automation and intelligence. .

시맨틱 웹에서는 자원을 표현할 때, 자원, 속성, 속성값의 트리플 형태로 표현하며, 자원 표현을 위한 프레임워크로 RDF(Resource Description Framework)를 정의한다. 그리고 시맨틱 웹에서는 RDF로 표현된 자원정보를 검색하기 위한 질의언어로 SPARQL를 사용하는데, 클라이언트-서버 환경에서 질의를 전송하기 위한 프로토콜로 사용된다.In the Semantic Web, resources are expressed in triple form of resources, attributes, and attribute values, and RDF (Resource Description Framework) is defined as a framework for resource representation. The semantic web uses SPARQL as a query language for retrieving resource information expressed in RDF, and is used as a protocol for transmitting queries in a client-server environment.

정보검색시스템에서의 일반적인 목표는 저장되어 있는 다량의 정보들 중에서 사용자가 요구하는 정보와 문서를 어떻게 하면 사용자의 의도를 정확히 파악하여 효율적인 검색으로 누락되지 않고 요구문서를 사용자에게 제대로 전달할 수 있는가 하는 것이다.The general goal of information retrieval systems is to understand the user's intentions and documents from a large amount of stored information so that the user can accurately understand the user's intention and deliver the required documents to the user without being missed by efficient retrieval. .

그런데 구글 등과 같은 종래의 검색엔진은 게더링부터 인덱싱까지의 프로세스 사이에 별도의 다른 인덱싱 과정이 없이 ‘A,B,C,D…’순으로 인덱싱되어 하나의 웹 페이지에 하나의 인덱싱이 있다. However, the conventional search engine such as Google does not have any other indexing process between the gathering and the indexing process, ‘A, B, C, D…. Indexed in order, and there is one index on one web page.

따라서 종래의 검색엔진에서는 사용자가 입력한 키워드에 대응하여 엉뚱한 웹 페이지들이 검색되어 검색이 불편하고, 원하는 정보를 얻기까지 수차의 키워드 입력을 반복해야 하는 문제점이 있다.Therefore, in the conventional search engine, the wrong web pages are searched corresponding to the keyword input by the user, making the search inconvenient, and there is a problem in that the keyword input of aberration is repeated until the desired information is obtained.

본 발명은 상기와 같은 문제점을 해결하기 위해 제안된 것으로, 본 발명의 목적은 시맨틱 웹을 통해 하나의 웹 페이지에 다양한 방식으로 인덱싱을 구축하여 사용자가 입력한 키워드에 대한 의미검색(Meaning Search)을 가능하게 하여 사용자 의도에 적합한 정보를 신속히 검색할 수 있도록 하는 시맨틱 웹 기반 인덱스 방법 및 이를 이용한 검색엔진을 제공하는 것이다.SUMMARY OF THE INVENTION The present invention has been proposed to solve the above problems, and an object of the present invention is to build indexing on a web page in various ways through the semantic web to perform a Meaning Search for keywords entered by a user. It is possible to provide a semantic web-based indexing method and a search engine using the same that enable the user to quickly search for information suitable for user intention.

상기와 같은 목적을 달성하기 위한 본 발명의 검색엔진은 인터넷상에 분포된 웹 페이지들을 수집하여 시맨틱 웹 페이지로 처리하여 시맨틱 웹 페이지 데이터베이스에 저장하는 게더링 에이전트와, 상기 시맨틱 웹 페이지 데이터베이스에 저장된 시맨틱 웹 페이지에서 단어와 단락 및 기사 레벨로 추출하여 각 레벨별로 빈도수와 관계설정 및 그래프 처리하여 시맨틱 웹 분석 데이터베이스에 저장하는 시맨틱 분석 에이전트와, 상기 시맨틱 웹 분석 데이터베이스에 저장된 시맨틱 웹 페이지를 단어와 단락 및 기사 레벨로 구분하여 필터링한 후 시맨틱 웹 필터드 데이터베이스에 저장하는 필터링 에이전트와, 상기 필터링된 시맨틱 웹 페이지에 성격을 부여하는 성격 분석 에이전트와, 상기 성격이 부여된 시맨틱 웹 데이터를 성격의 백분율(%)에 따라서 분류하는 분류 분석 에이전트로 이루어져 수집된 웹 페이지들을 시맨틱 웹 페이지로 변환한 후 단어와 단락 및 기사 레벨로 분석하여 하나의 웹 페이지에 다수의 인덱스를 생성하는 인덱싱부; 상기 인덱싱부에 의해 생성된 각 웹 페이지들의 인덱스를 저장하고 있는 인덱스 데이터베이스; 및 사용자의 검색어 입력에 따라 상기 인덱스 데이터베이스를 검색하여 시맨틱 웹에 기반한 문서검색을 처리하는 검색 에이전트를 포함한 것을 특징으로 한다.The search engine of the present invention for achieving the above object is a gathering agent that collects web pages distributed on the Internet and processes them as semantic web pages and stores them in a semantic web page database, and the semantic web stored in the semantic web page database. The semantic analysis agent extracts the word, paragraph and article levels from the page, and sets the frequency, relationship and graph for each level, and stores them in the semantic web analysis database, and the semantic web page stored in the semantic web analysis database. A filtering agent that stores the semantic web filtered database after filtering by level, and a personality analysis agent that gives a personality to the filtered semantic web page, and the semantic web data to which the personality is granted. according to An indexing unit configured to classify the collected web pages into semantic web pages and classify them into words, paragraphs, and article levels to generate a plurality of indexes in one web page; An index database storing an index of each web page generated by the indexing unit; And a search agent for searching the index database according to a user's search term and processing a document search based on the semantic web.

또한 상기와 같은 목적을 달성하기 위한 본 발명의 인덱스 방법은 인터넷상에 분포된 웹 페이지들을 수집하여 시맨틱 웹 페이지로 처리하여 시맨틱 웹 페이지 데이터베이스에 저장하는 웹 페이지 수집단계; 상기 시맨틱 웹 페이지 데이터베이스에 저장된 시맨틱 웹 페이지에서 단어와 단락 및 기사 레벨로 추출하여 각 레벨별로 빈도수와 관계설정 및 그래프 처리하여 시맨틱 웹 분석 데이터베이스에 저장하는 시맨틱 분석 단계; 상기 시맨틱 웹 분석 데이터베이스에 저장된 시맨틱 웹 페 이지를 단어와 단락 및 기사 레벨로 구분하여 필터링한 후 시맨틱 웹 필터드 데이터베이스에 저장하는 필터링 단계; 상기 필터링된 시맨틱 웹 페이지에 성격을 부여하는 성격 분석 단계; 및 상기 성격이 부여된 시맨틱 웹 페이지를 성격의 백분율(%)에 따라서 분류하는 분류 분석 단계;를 구비하여 수집된 웹 페이지들을 시맨틱 웹 페이지로 변환한 후 단어와 단락 및 기사 레벨로 분석하여 하나의 웹 페이지에 다수의 인덱스를 생성하는 것을 특징으로 한다.In addition, the index method of the present invention for achieving the above object is a web page collection step of collecting the web pages distributed on the Internet to be processed as a semantic web page and stored in the semantic web page database; A semantic analysis step of extracting the semantic web page stored in the semantic web page database into word, paragraph, and article levels, setting the frequency, relationship, and graphing for each level and storing the semantic web page in the semantic web analysis database; A filtering step of filtering semantic web pages stored in the semantic web analysis database into words, paragraphs, and article levels, and storing the semantic web pages in the semantic web filtered database; A personality analysis step of assigning a personality to the filtered semantic web page; And a classification analysis step of classifying the semantic web page to which the personality is given according to a percentage (%) of the personality. The collected web pages are converted into semantic web pages and analyzed at a word, paragraph, and article level. And generating a plurality of indexes on the web page.

상기 웹 페이지 수집단계는, 인터넷에서 신문사, 포럼, 사설과 같은 일정한 소스 형식을 띠면서 정적 규칙에 의해 운영되는 정적 웹 페이지들을 수집하는 정적 웹 페이지 수집단계와, 인터넷에서 블로그, 일반 웹 페이지와 같은 동적 웹 페이지를 수집하는 동적 웹 페이지 수집 단계로 이루어진다.The web page collection step includes a static web page collection step of collecting static web pages operated by static rules while having a certain source format such as newspapers, forums and editorials on the Internet, and a blog or general web page on the Internet. It consists of a dynamic web page collection step of collecting a dynamic web page.

상기 시맨틱 분석 단계는, 수집된 시맨틱 웹 페이지에서 기사를 추출하여 빈도수와 관계설정, 그래프 처리하는 단계; 수집된 시맨틱 웹 페이지에서 단락을 추출하여 빈도수와 관계설정, 그래프 처리하는 단계; 및 수집된 시맨틱 웹 페이지에서 단어을 추출하여 빈도수와 관계설정, 그래프 처리하는 단계로 구성된다.The semantic analysis step may include extracting articles from the collected semantic web pages, setting a frequency, relationship, and graphing; Extracting paragraphs from the collected semantic web pages, establishing a frequency, relationship, and graphing; And extracting words from the collected semantic web pages to set frequency, relationship, and graph processing.

상기 필터링 단계는 분석된 시맨틱 웹 페이지의 기사에서 삭제할 데이터들을 정제하는 단계; 분석된 시맨틱 웹 페이지의 단락에서 삭제할 데이터들을 정제하는 단계; 및 분석된 시맨틱 웹 페이지의 단어에서 삭제할 데이터들을 정제하는 단계로 구성된다.The filtering step may include refining data to be deleted from an article of the analyzed semantic web page; Refining the data to be deleted in the paragraph of the analyzed semantic web page; And purifying the data to be deleted from the words of the analyzed semantic web page.

본 발명의 시맨틱 웹을 이용한 인덱스 방법은 게더링부터 인덱싱까지 프로세스 사이에 추가 인덱싱이 존재하여 하나의 웹 페이지에 수 백개의 인덱싱이 존재할 수 있다. 그리고 이 인덱싱들은 단어(Web Word)와 단락 중심의 인덱싱이다.In the indexing method using the semantic web of the present invention, additional indexing exists between processes from gathering to indexing, and thus hundreds of indexing may exist in one web page. These indexings are word- and paragraph-oriented indexing.

따라서 구글과 같은 종래의 검색엔진은 하나의 인덱싱 방식에 따라 저장된 DB에서 검색결과를 제공하지만, 본 발명이 적용된 검색엔진은 하나의 웹 페이지에는 단어의미를 파악하는 시맨틱 웹 개념이 수백개 존재하여 의미검색(Meaning Search)이 가능한 효과가 있다.Therefore, while a conventional search engine such as Google provides search results in a stored DB according to one indexing method, a search engine to which the present invention is applied means that there are hundreds of semantic web concepts that grasp the meaning of words in one web page. Meaning Search is possible.

본 발명과 본 발명의 실시에 의해 달성되는 기술적 과제는 다음에서 설명하는 본 발명의 바람직한 실시예들에 의하여 보다 명확해질 것이다. 다음의 실시예들은 단지 본 발명을 설명하기 위하여 예시된 것에 불과하며, 본 발명의 범위를 제한하기 위한 것은 아니다. The technical problems achieved by the present invention and the practice of the present invention will be more clearly understood by the preferred embodiments of the present invention described below. The following examples are merely illustrated to illustrate the present invention and are not intended to limit the scope of the present invention.

도 2는 본 발명에 따른 시맨틱 웹 기반 검색엔진의 전체 구조를 도시한 도면이다.2 illustrates the overall structure of a semantic web based search engine according to the present invention.

본 발명에 따른 시맨틱 웹 기반 검색엔진은 도 2에 도시된 바와 같이, 다수의 사용자들(110)이 인터넷(102)을 통해 접속할 수 있는 시맨틱 웹 기반 검색 사이 트(200)에 구현되어 있다. 시맨틱 웹 기반 검색사이트(200)는 클라이언트 인터페이스(202), 검색에이전트(SA;204), 인덱스 데이터베이스(206), 정적 인터넷(102-1)이나 동적 인터넷(102-2)으로부터 웹 페이지들을 수집하여 분석한 후 인덱싱하는 인덱싱부(210), 정책 에이전트(PA:220), 닥터 에이전트(DA;222), 모니터링 에이전트(MA;224)로 구성되고, 인덱싱부(210)는 정적 웹페이지 게더링 에이전트(GA;211)와 동적 웹페이지 게더링 에이전트(GA;212), 필터링 에이전트(FA;213), 분석 에이전트(AA;214)로 구성된다.As shown in FIG. 2, the semantic web-based search engine according to the present invention is implemented in a semantic web-based search site 200 that can be accessed by a plurality of users 110 through the Internet 102. The semantic web based search site 200 collects web pages from a client interface 202, a search agent (SA) 204, an index database 206, a static internet 102-1 or a dynamic internet 102-2. It is composed of an indexing unit 210, a policy agent (PA: 220), a doctor agent (DA; 222), and a monitoring agent (MA; 224) to index after analysis, and the indexing unit 210 includes a static web page gathering agent ( And a dynamic web page gathering agent (GA; 212), a filtering agent (FA; 213), and an analysis agent (AA; 214).

도 2를 참조하면, 본 발명에 따른 검색엔진에는 7개의 에이전트로 구성된 메인 솔루션 그룹이 있는데, 모든 에이전트들의 상위에 위치하는 정책 에이전트(Policy Agent:PA;220)는 해당 에이전트들에게 특정 기능수행을 요청하고 지휘하는 정책기능을 담당한다. Referring to FIG. 2, the search engine according to the present invention has a main solution group consisting of seven agents, and a Policy Agent (PA) 220 located above all agents performs specific functions to the agents. Responsible for requesting and directing policy functions.

게더링 에이전트(Gathering Agnet:GA; 211,212)는 웹 페이지들을 수집하고, 필터링 에이전트(Filter Agent:FA;213)는 데이터를 정제(사용가능한 형태로 변경)하며, 분석 에이전트(Analysis Agent:AA;214)는 수집된 데이터를 분석하여 인덱스를 인덱스데이터베이스(206)에 저장한다. 이때 정적 웹페이지 게더링 에이전트(211)는 신문사, 포럼, 사설과 같은 일정한 소스 형식을 띠면서 정적 규칙에 의해 운영되는 웹 페이지(102-1)의 데이터로부터 웹 페이지를 수집하고, 동적 웹 페이지 게더링 에이전트(212)는 블로그, 일반 웹 페이지와 같은 동적 인터넷(102-2)으로부터 시맨틱 웹 페이지를 수집한다. Gathering Agnet (GA) 211,212 collects web pages, Filter Agent (FA) 213 refines data (changes it to an available form), and Analysis Agent (AA) 214 Analyze the collected data and stores the index in the index database 206. At this time, the static web page gathering agent 211 collects web pages from data of the web page 102-1 operated by static rules while having a certain source format such as newspaper, forum, and editorial, and dynamic web page gathering agent. 212 collects semantic web pages from dynamic internet 102-2, such as blogs, generic web pages.

그리고 검색 에이전트(Search Agent:SA;204)는 온톨로지검색과 시맨틱웹문 서검색을 처리하고, 모니터링 에이전트(Monitoring Agent:MA;224)는 계산오류를 발견하거나 수정된 데이터를 모니터링하는 툴이며, 닥터 에이전트(Doctor Agent:DA;222)는 업 데이트 확인과 오류를 치료하는 역할을 담당한다.The Search Agent (SA) 204 processes ontology searches and semantic web document searches, and the Monitoring Agent (MA) 224 is a tool for detecting calculation errors or monitoring modified data. (Doctor Agent: DA; 222) is responsible for checking for updates and remediating errors.

도 3은 본 발명에 따라 시맨틱 웹을 이용하여 인덱싱하는 절차를 도시한 순서도이고, 도 4는 본 발명에 따라 시맨틱 웹을 이용하여 인덱싱하는 시맨틱 웹 솔루션의 예이다.3 is a flowchart illustrating a procedure of indexing using the semantic web according to the present invention, and FIG. 4 is an example of a semantic web solution indexing using the semantic web according to the present invention.

본 발명에 따라 시맨틱 웹을 이용하여 인덱싱하는 절차는 도 3에 도시된 바와 같이, 웹 페이지 수집 단계(S301), 시맨틱 분석 단계(S302), 필터링 단계(S303), 성격부여 단계(S304), 분류 단계(S305)로 구성되어 시맨틱 웹 검색을 위한 인덱스 DB를 생성한다.Indexing process using the semantic web according to the present invention, as shown in Figure 3, the web page collection step (S301), semantic analysis step (S302), filtering step (S303), personalization step (S304), classification In step S305, the index DB for semantic web search is generated.

웹 페이지 수집단계(S301)는 인터넷으로부터 웹 페이지를 가져와 정제되지 않은 동적, 정적 웹 페이지의 데이터를 시맨틱을 위한 웹 데이터로 처리하여 시맨틱 웹페이지 데이터베이스(402)에 저장하는 단계이다. 이를 위해 정적 게더링 에이전트(GA.S.ST: Gathering Agent for Semantic pages at Static Web pages) 솔루션(401a)은 신문사, 포럼, 사설과 같은 일정한 소스 형식을 띄면서 정적 규칙에 의해 운영되는 웹 페이지들로 이루어진 정적 인터넷(102-1)으로부터 웹 페이지를 수집하여 시맨틱 페이지를 위한 데이터로 처리하여 시맨틱 웹 페이지 데이터베이스(402)에 저장한다. 동적 게더링 에이전트(GA.S.D: Gathering Agent for Semantic pages at Dynamic Web pages) 솔루션(401b)은 블로그, 일반 웹 페이지와 같은 동적인 웹 페이지들로 이루어진 동적 인터넷(102-2)으로부터 웹 페이지를 수집하여 시 맨틱 페이지를 위한 데이터로 처리하여 시맨틱 웹 페이지 데이터베이스(402)에 저장한다.Web page collection step (S301) is a step of taking a web page from the Internet and processing the data of the unrefined dynamic, static web page as web data for semantics and stores it in the semantic web page database (402). To this end, the Gathering Agent for Semantic pages at Static Web pages (GA.S.ST) solution (401a) is a set of web pages operated by static rules that have a uniform source format such as newspapers, forums, and editorials. The web page is collected from the static Internet 102-1, and processed as data for the semantic page and stored in the semantic web page database 402. The Gathering Agent for Semantic pages at Dynamic Web pages (GA.SD) solution (401b) collects web pages from the dynamic Internet 102-2, which consists of dynamic web pages such as blogs and regular web pages. It is processed as data for semantic pages and stored in the semantic web page database 402.

시맨틱 분석 단계(S302)는 웹 페이지 수집 단계(S301)에서 수집되어 시맨틱 웹 페이지 데이터 베이스(402)에 저장된 시맨틱 웹 페이지들을 기사(Artical;403a), 단락(Paragraph;403b), 단어(Word;404c) 레벨로 구분하여 빈도 수와 관계 분석 데이터를 처리하기 위한 단계이다. 이를 위한 시맨틱 기사 분석 에이전트(AA.S.A (Analysis Agent for Semantic Article) 솔루션(404a)은 수집된 시맨틱 웹 페이지 데이터베이스(402)에서 각 웹 페이지의 기사(403a)를 추출하여 빈도수와 관계설정, 그래프 처리하여 기사분석 데이터(405a)를 시맨틱 웹 분석 데이터베이스(405)에 저장한다. 시맨틱 단락 분석 에이전트(AA.S.P: Analysis Agent for Semantic Paragraph) 솔루션(404b)은 수집된 시맨틱 웹 페이지 데이터베이스(402)에서 각 웹 페이지의 단락(403b)을 추출하여 빈도수와 관계설정, 그래프 처리하여 단락 분석 데이터(405b)를 시맨틱 웹 분석 데이터베이스(406)에 저장한다. 시맨틱 단어 분석 에이전트(AA.S.W: Analysis Agent for Semantic Word) 솔루션(404c)은 수집된 시맨틱 웹 페이지 데이터베이스(402)에서 각 웹 페이지의 단어(403c)를 추출하여 빈도수와 관계설정, 그래프 처리하여 단어 분석 데이터(405c)를 시맨틱 웹 분석 데이터베이스에 저장한다. 이와 같이 레벨별로 구분되어 분석된 분석 데이터들은 하나로 통합되어 시맨틱 웹 분석 데이터 베이스(406)에 저장된다. The semantic analysis step S302 is performed in the web page collection step S301 and stores the semantic web pages stored in the semantic web page database 402 as Article (403a), Paragraph (403b), and Word (404). Step to process frequency and relationship analysis data. For this Semantic Article Analysis Agent (AA.SA (Analysis Agent for Semantic Article) The solution 404a extracts the article 403a of each web page from the collected semantic web page database 402, sets the frequency, relationship, and graphs the article analysis data 405a. Stored in the analysis database 405. The Analysis Agent for Semantic Paragraph (AA.SP) solution (404b) extracts paragraphs (403b) of each web page from the collected semantic web page database (402) to establish a frequency, relationship, graph, and paragraphs. The analysis data 405b is stored in the semantic web analytics database 406. The Semantic Word Analysis Agent (AA.SW) solution 404c extracts the words 403c of each web page from the collected semantic web page database 402 to establish a frequency, relationship, and graph for words. The analysis data 405c is stored in the semantic web analytics database. The analysis data divided and analyzed by levels are integrated into one and stored in the semantic web analysis database 406.

필터링 단계(S303)는 시맨틱 웹 분석 데이터 베이스(406)에 저장된 데이터에서 삭제할 찌꺼기 데이터를 하위로 내리거나 정제해내는 단계이다. 시맨틱 기사 필 터링 에이전트(FA.S.A: Filter Agent for Semantic Article) 솔루션(407a)은 저장된 시맨틱 웹 데이터 베이스(406)의 기사 분석 데이터(405a)에서 삭제할 데이터들을 정제하여 기사 필터드 데이터(408a)를 시맨틱 웹 필터드 데이터베이스(409)에 저장한다.The filtering step S303 is a step of lowering or refining the waste data to be deleted from the data stored in the semantic web analysis database 406. The filter agent for semantic article (FA.SA) solution 407a refines the data to be deleted from the article analysis data 405a of the stored semantic web database 406 to filter the article filtered data 408a. Stored in the Semantic Web Filtered Database 409.

시맨틱 단락 필터링 에이전트(FA.S.P: Filter Agent for Semantic Paragraph) 솔루션(407b)은 시맨틱 웹 분석 데이터베이스(406)의 단락 분석 데이터(405b)에서 삭제할 데이터들을 정제하여 필터링된 단락 필터드 데이터(408b)를 시맨틱 웹 필터드 데이터베이스(409)에 저장한다. 시맨틱 단어 필터링 에이전트(FA.S.W: Filter Agent for Semantic Word) 솔루션(407c)은 시맨틱 웹 분석 데이터베이스(406)의 단어 분석 데이터(405c)에서 삭제할 데이터들을 정제하여 필터링된 단어 필터드 데이터(408c)를 시맨틱 웹 필터드 데이터베이스(409)에 저장한다. 이와 같이 필터링된 기사 필터드 데이터(408a)와 단락 필터드 데이터(409), 단어 필터드 데이터(408c)는 시맨틱 웹 필터드 데이터베이스(409)에 저장된다.The Semantic Paragraph Filtering Agent (FA.SP) solution 407b refines the data to be deleted from the paragraph analysis data 405b of the semantic web analytics database 406 to filter the filtered paragraph filtered data 408b. Stored in the Semantic Web Filtered Database 409. The Semantic Word Filtering Agent (FA.SW) solution 407c refines the data to be deleted from the word analysis data 405c of the semantic web analysis database 406 to filter the filtered word filtered data 408c. Stored in the Semantic Web Filtered Database 409. The filtered article filtered data 408a, the paragraph filtered data 409, and the word filtered data 408c are stored in the semantic web filtered database 409.

성격부여 단계(S304)는 저장된 시맨틱 웹 필터드 데이터 베이스(409)의 각 시맨틱 웹 페이지에 경제, 정치, 문화, 오락 등과 같은 성격을 부여하는 단계로서, 성격 분석 에이전트(AA.C.S: Analysis Agent for Character at Semantic Web pages) 솔루션(410)은 시맨틱 웹 필터드 데이터(409)에 성격을 부여하여 성격 분석 데이터(411)를 시맨틱 웹 성격 데이터베이스(412)에 저장한다.Characterization step (S304) is a step of giving a personality, such as economy, politics, culture, entertainment, etc. to each semantic web page of the stored semantic web filtered database 409, a personality analysis agent (AA.CS: Analysis Agent for Character at Semantic Web pages) solution 410 assigns personality to semantic web filtered data 409 and stores personality analysis data 411 in semantic web personality database 412.

분류 단계(S305)는 성격이 부여된 시맨틱 웹 데이터를 분류(grouping)하여 분류된 분석 데이터로 추출하기 위한 단계로서, 분류 에이전트(AA.G.S: Analysis Agent for Grouping at Semantic Web pages) 솔루션(413)은 시맨틱 웹 성격 데이터베이스(412)에서 성격이 부여된 시맨틱 웹 데이터를 성격의 백분율(%)에 따라서 분류(Grouping)하여 분류된 분석 데이터(414)를 시맨틱 웹 분류 데이터베이스(415)에 저장하여 인덱스 데이터베이스를 생성한다.The classification step (S305) is a step for grouping and extracting semantic web data to which the character is assigned and classifying the analysis data. The classification agent (AA.GS) Analysis Agent for Grouping at Semantic Web pages solution 413 The semantic web personality database 412 classifies semantic web data given personality according to the percentage of personality and stores the classified analysis data 414 in the semantic web classification database 415 to store the indexed database. Create

이상에서 본 발명은 도면에 도시된 일 실시예를 참고로 설명되었으나, 본 기술분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. The present invention has been described above with reference to one embodiment shown in the drawings, but those skilled in the art will understand that various modifications and equivalent other embodiments are possible therefrom.

도 1은 일반적인 검색엔진의 구조를 도시한 도면,1 is a diagram showing the structure of a general search engine,

도 2는 본 발명에 따른 시맨틱 웹 기반 검색엔진의 전체 구조를 도시한 도면,2 is a diagram illustrating the overall structure of a semantic web based search engine according to the present invention;

도 3은 본 발명에 따라 시맨틱 웹을 이용하여 인덱싱하는 절차를 도시한 순서도,3 is a flowchart illustrating a procedure of indexing using the semantic web according to the present invention;

도 4는 본 발명에 따라 시맨틱 웹을 이용하여 인덱싱하는 세부 예를 도시한 도면.4 illustrates a detailed example of indexing using the semantic web in accordance with the present invention.

Claims

A gathering agent that collects web pages distributed on the Internet, processes them into semantic web pages, and stores them in a semantic web page database, and extracts word, paragraph, and article levels from semantic web pages stored in the semantic web page database at each level. Semantic analysis agent that stores frequency and relationship and graphs and saves in semantic web analysis database, and semantic web page stored in semantic web analysis database is classified into word, paragraph and article level, and stored in semantic web filtered database. And a filtering agent configured to provide a personality to the filtered semantic web page, and a classification analysis agent to classify the semantic web data given the personality according to a percentage (%) of the personality. After the conversion to the semantic Web page words and paragraphs, and analyzed by article level indexing for generating a plurality of indexes for a web page;

An index database storing an index of each web page generated by the indexing unit; And

A semantic web-based search engine including a search agent for searching the index database according to a user's search term and processing a document search based on the semantic web.

The method of claim 1, wherein the semantic web based search engine

A policy agent which is located above the agents and is responsible for a policy function for requesting and directing specific agents to perform specific functions;

A monitoring agent that monitors data for calculation errors found or corrected,

A semantic web-based search engine further comprising a doctor agent that is responsible for checking for updates and treating errors.

The method of claim 1, wherein the gathering agent

A static web page gathering agent that collects static web pages on the Internet, run by static rules, in a certain source format such as newspapers, forums, editorials,

A semantic web-based search engine that consists of a dynamic web page gathering agent that collects dynamic web pages such as blogs and general web pages on the Internet.

A web page collecting step of collecting web pages distributed on the Internet, treating the semantic web pages, and storing the semantic web pages in a semantic web page database;

A semantic analysis step of extracting the semantic web page stored in the semantic web page database into word, paragraph, and article levels, setting the frequency, relationship, and graphing for each level and storing the semantic web page in the semantic web analysis database;

A filtering step of filtering semantic web pages stored in the semantic web analysis database into words, paragraphs, and article levels, and storing the semantic web pages in a semantic web filtered database;

A personality analysis step of assigning a personality to the filtered semantic web page; And

And a classification analysis step of classifying the semantic web page to which the personality is given according to a percentage (%) of the personality.

A semantic web-based indexing method that converts collected web pages into semantic web pages and analyzes them at the word, paragraph, and article level to generate multiple indexes on one web page.

The method of claim 4, wherein the web page collection step,

A static web page collecting step of collecting static web pages operated by static rules in a certain source format such as newspapers, forums and editorials on the Internet,

A semantic web-based indexing method comprising dynamic web page collection steps for collecting dynamic web pages such as blogs and general web pages on the Internet.

The method of claim 4, wherein the semantic analysis step,

Extracting articles from the collected semantic web pages, setting a frequency, a relationship, and graphing the articles;

Extracting paragraphs from the collected semantic web pages, establishing a frequency, relationship, and graphing; And

Semantic web-based index method comprising the steps of extracting a word from the collected semantic web page, frequency setting, relationship processing, graph processing.

The method of claim 6, wherein the filtering step

Refining data to be deleted from the analyzed semantic web page article;

Refining the data to be deleted in the paragraph of the analyzed semantic web page; And

And refining data to be deleted from the analyzed semantic web page word.