KR20030082109A

KR20030082109A - Method and System for Providing Information and Retrieving Index Word using AND Operator

Info

Publication number: KR20030082109A
Application number: KR1020020020663A
Authority: KR
Inventors: 전석진
Original assignee: (주)메타웨이브
Priority date: 2002-04-16
Filing date: 2002-04-16
Publication date: 2003-10-22

Abstract

PURPOSE: An index word search and information providing service system and method is provided to extract index words, to calculate a frequency of each index word included in specific documents, to calculate scores for the documents based on the calculated frequency, and to output the documents from the highest to the lowest score. CONSTITUTION: An index agent receives document data such as contents and URLs from a robot agent via an FA(Foreign Agent)(S510). The index agent extracts index words from the received document data, and calculates a frequency of each index word in a document(S520). The index agent requests the robot agent to transmit a next document(S530), and the robot agent checks whether the requested document is the last in a URL database(S540). If the requested document is not the last, the step S520 is repeated. Otherwise, the frequency of each index word is calculated for all the document, all the documents including the index words are extracted and simultaneously, scores for the extracted documents are calculated based on a score calculation equation(S550). The data on the documents including the index words is stored at an index database(S560).

Description

Method and System for Providing Information and Retrieving Index Word using AND Operator}

본 발명은 AND 연산자를 이용한 색인어 검색 및 정보 제공 시스템 및 방법에 관한 것으로, 보다 상세하게는 검색하고자 하는 색인어를 추출하여 각 색인어들의 특정 문서 내 출현 빈도를 산출하고, 산출된 각 색인어들의 출현 빈도를 서로 곱한 값을 해당 문서의 점수로 산정하여 점수가 높은 문서의 순서로 검색 결과를 제공하는 AND 연산자를 이용한 색인어 검색 및 정보 제공 시스템 및 방법에 관한 것이다.The present invention relates to an index word search and information providing system and method using an AND operator, and more particularly, extracts an index word to be searched, calculates the frequency of occurrence of each index word in a specific document, and calculates the frequency of occurrence of each index word. The present invention relates to an index word search and information providing system and method using an AND operator that calculates a value multiplied with each other as a score of a corresponding document and provides search results in the order of documents having a high score.

최근 들어, 대부분의 문서를 컴퓨터로 작성하고 통신망을 통해 문서를 배포하고 획득함에 따라 효과적으로 문서를 찾는 기술의 중요성이 매우 커지고 있다. 더구나, 인터넷이 보급됨으로써 전문가뿐만 아니라 일반인도 통신망에 접속하여 정보를 제공하거나 획득하는 것이 일반화되고, 이에 따라 인터넷으로 접근할 수 있는 정보의 양이 기하급수적으로 증가하고 있다. 따라서, 역사상 유래 없는 거대한 정보창고이자 정보획득 인프라인 인터넷에서 검색엔진(예컨대, altavista, yahoo, infoseek ultra, dejanews, lycos, empas 등)이 가장 성공적인 응용 프로그램으로 자리 매김을 하고 있다.In recent years, as most documents are written on a computer and documents are distributed and acquired through a communication network, the importance of finding a document effectively becomes very important. Moreover, with the spread of the Internet, it is common for not only experts but also ordinary people to access and provide information through communication networks, and accordingly, the amount of information accessible through the Internet is increasing exponentially. As a result, search engines (eg, altavista, yahoo, infoseek ultra, dejanews, lycos, empas, etc.) are becoming the most successful applications on the Internet, a huge information warehouse and information acquisition infrastructure that is unprecedented in history.

초기 인터넷 검색엔진은 웹의 규모가 크지 않았기 때문에 몇 안 되는 자료를데이터베이스로 구축할 필요가 없었으며, 야후와 같은 웹 초기의 검색엔진들은 데이터베이스 규모가 작은 경우 개발과 검색에 편리한 주제 검색 방법을 이용하였다. 예를 들어, 초기메뉴를 비롯한 각 단계의 메뉴들이 하위메뉴를 약 10개 정도를 갖고 있고 전체 메뉴는 총 4단계까지 지원한다고 가정하면, 이를 트리구조 형식으로 나타냈을 때 총 1000(10³)개만큼의 자료를 보유할 수 있다. 여기에서 한 단계를 더 추가한다면 10000(10⁴)개까지 자료를 확보할 수 있다. 그러나, 현재의 인터넷 검색엔진들의 보유 레코드 수가 작게는 100만 개부터 많게는 5천만 개에 이르고 있기 때문에 주제 검색 방식으로 자료를 검색할 경우 여러 단계의 거쳐야만 최종 자료에 접근할 수 있다. 만약, 여러 단계 중에서 한 번이라도 실수하게 되면 다시 상위 주제로 올라가지 않는 한 하위 주제에서 자료를 검색하는 것은 불가능하다. 이와 같이, 지속으로 인터넷의 규모가 커지면서 더 이상 주제 검색만으로는 원활한 검색이 불가능해졌고, 급팽창하는 웹의 규모에 맞게 검색엔진이 보유한 레코드 수도 그만큼 증가해야 하는데 예전과 같이 사람의 수작업에 의해 하나의 홈페이지를 확인하고 이를 하나의 레코드로 추가시키는 방식은 급격한 웹의 성장을 따라 갈 수 없으며, 이러한 수작업에 의해 수십만 개의 홈페이지를 색인하여 데이터베이스를 구축하더라도 이를 사용자가 메뉴 방식으로 검색하기 위해서는 많은 시간과 노력을 기울여야 한다.Since the early Internet search engines were not large in size, there was no need to build a few databases into a database. Early web search engines such as Yahoo used a topic search method that was convenient for development and search when the database was small. It was. For example, the menu of each stage, including the initial menu will have about 10 sub-menu and the full menu is a total of 1000 (10 ³⁾ nd appear in a tree structure format it, assuming supported up to step 4 It can hold as much data. If you add one more step, you can have up to 10000 (10 ⁴ ) pieces of data. However, since the current number of records of Internet search engines ranges from as little as one million to as many as 50 million, the final data can only be accessed through several stages when searching through the subject search method. If you make a mistake in any of the steps, it is impossible to retrieve data from subtopics unless you go back to the parent topic. As the size of the Internet continues to grow, it is no longer possible to search smoothly by just searching the topic, and the number of records held by the search engine must increase according to the size of the rapidly expanding web. Checking and adding it as a record cannot keep up with the rapid growth of the web, and even if you build a database by indexing hundreds of thousands of homepages by this manual work, it takes a lot of time and effort for users to search it menu-wise. You should pay attention.

이 때부터 로봇(예컨대, robots, wanderers, spiders, worms 등) 에이전트라는 개념이 인터넷에 도입되었으며, 로봇이란 일종의 자동 순회 프로그램으로 기존에 수작업으로 홈페이지를 찾아다니며 색인하던 작업을 자동적으로 검색하고 색인하여 이를 데이터베이스화하는 프로그램을 일컫는다. 이러한 로봇에 의해 만들어진 데이터베이스는 대부분이 색인어 검색이 가능하도록 설계되며 이 때부터 인터넷 검색엔진이 주제 검색에서 색인어 검색으로 전환하기 시작했다. 즉, 사용자가 자신이 원하는 정보를 검색하기 위해 해당 검색식을 색인어로 입력하고, 입력한 색인어간의 관계를 이용하여 불리언 질의 방식이나 벡터 질의 방식으로 관련 정보에 접근해간다.Since then, the concept of robot (eg robots, wanderers, spiders, worms, etc.) agents has been introduced to the Internet. Robots are a type of automatic traversal program that automatically searches and indexes existing homepages. It is a program that makes a database. Most of the databases created by these robots are designed to be index-searchable, and from then on, Internet search engines began to switch from topic search to index search. That is, a user inputs a corresponding search expression as an index word to search for information desired by the user, and accesses related information by a Boolean query method or a vector query method using the relationship between the input index words.

이와 같은 종래 기술에서는 입력된 색인어들간의 관계(즉, 색인어간의 가중치 등)를 고려하여 시스템에 구축되어 있는 색인어들의 인덱스에서 검색하여 해당되는 정보들을 사용자에게 제공한다. 색인어들간의 관계를 고려하여 원하는 정보에 접근하는 방법은 크게 단위 색인어들의 형태소를 미리 분석하여 저장하고 해당 색인어와 관련된 색인어를 추출하는 방법과 검색 색인어의 비그램(bigram) 또는 트라이그램(trigram)의 정보를 이용하여 색인어를 찾아내는 방법 등이 있다. 이를 이용하여 인터넷 이용자들이 원하는 정보를 빠른 시간 내에 획득하기 위해 검색엔진을 개발하기에 이르렀다. 검색엔진은 인터넷 이용자를 대신하여 인터넷을 빠른 속도로 돌아다니면서 이용자의 요구에 맞는 정보를 찾아준다. 즉, 인터넷 이용자는 검색엔진에서 제공하는 색인어 형식에 맞게 자신이 원하는 정보가 무엇인지 알려주고 검색엔진은 해당 정보를 찾아 인터넷 이용자에게 제공한다.In the related art, a search is performed on an index of index words constructed in a system in consideration of a relationship between input index words (ie, weights between index words, etc.) and provides corresponding information to a user. The method of accessing the desired information in consideration of the relation between the index words is to pre-analyze and store the morphemes of the unit index words, extract the index word related to the index word, and the bigram or trigram of the search index word. Finding index words using information Using this, we have developed a search engine to obtain the information that Internet users want quickly. Search engines navigate the Internet at high speed on behalf of Internet users, looking for information that meets their needs. In other words, the Internet user tells what information he / she wants in accordance with the index word format provided by the search engine, and the search engine finds the information and provides it to the Internet user.

그러나, 이와 같은 종래의 검색엔진을 이용한 정보 검색 방법은 처리속도와 안정성에 초점을 맞추어 개발해 왔기 때문에 여러 가지 문제점을 내포하고 있다.첫째, AND 연산자를 이용하여 검색 결과의 순서를 산정함에 있어 다수개의 색인어중 특정 문서 내의 출현빈도를 각각 산출하고, 각 색인어의 출현 빈도 중 가장 작은 출현 빈도를 가진 색인어를 기준으로 해당 문서의 점수를 매겨 검색 결과를 제공하기 때문에 정확도가 떨어지고, 검색된 문서의 수가 너무 방대하여 검색이 어렵다. 예를 들어, 임의의 문서에서 두 개의 특정 색인어 a와 b의 출현 빈도가 각각 tf_a, tf_b이고 tf_a<tf_b일 때, 특정 색인어에 대한 해당 문서의 점수는 출현 빈도가 작은 단어 즉, tf_a가 해당 문서의 점수가 된다. 따라서, AND 연산자를 이용하여 두 개 이상의 색인어로 정보를 검색함에 있어 출현 빈도가 작은 단어를 기준으로 문서의 점수를 산정하기 때문에 정확한 검색 결과를 산출하기가 어렵다. 둘째, 많은 인터넷 이용자가 동시에 검색을 요청하는 경우 검색시간과 응답시간이 길어짐에 따라 검색효율이 낮아진다.However, such a conventional information retrieval method using a search engine has been developed with a focus on processing speed and stability. Firstly, in order to calculate the order of the search results using the AND operator, a number of methods are used. The accuracy is inferior because it calculates the frequency of occurrence in a specific document among the index words and scores the document based on the index word with the smallest frequency of occurrence of each index word. Difficult to search For example, when the frequency of occurrences of two specific index words a and b in any document are tf _a , tf _b and tf _a <tf _b , then the score of that document for a particular index word is a word with a low frequency of appearance, tf _a is the score of the document. Therefore, when searching for information by two or more index words using the AND operator, since the document score is calculated based on a word having a low occurrence frequency, it is difficult to calculate an accurate search result. Second, when many Internet users request the search at the same time, the search efficiency decreases as the search time and response time become longer.

따라서, 본 발명은 상기한 바와 같은 종래의 제반 문제점을 해결하기 위하여 안출된 것으로서, 본 발명의 목적은 AND 연산자를 이용하여 검색 결과의 순서를 산정함에 있어 다수개의 색인어중 특정 문서 내의 출현빈도를 각각 산출하고, 각 색인어의 출현 빈도를 서로 곱하여 그 값을 해당 문서의 점수로 산정하여 검색 결과를 제공하는 AND 연산자를 이용한 색인어 검색 및 정보 제공 시스템 및 방법을 제공하는 데 있다.Accordingly, the present invention has been made to solve the above-mentioned general problems, and an object of the present invention is to calculate the frequency of appearance in a particular document among a plurality of index words in calculating the order of search results using the AND operator. The present invention provides an index word search and information providing system and method using an AND operator that calculates and multiplies the frequency of occurrence of each index word and calculates a value as a score of a corresponding document.

본 발명의 다른 목적은 특정 문서 내의 모든 색인어들의 출현 빈도를 미리산출하여 데이터베이스화함으로써, 검색시간을 단축하고 정확도가 높은 검색 결과를 제공하는 AND 연산자를 이용한 색인어 검색 및 정보 제공 시스템 및 방법을 제공하는 데 있다.Another object of the present invention is to provide a system and method for index word search and information using an AND operator that reduces the searching time and provides a highly accurate search result by precomputing and databaseting the frequency of occurrence of all index words in a specific document. There is.

도 1은 본 발명에 따른 색인어 검색 및 정보 제공 시스템의 전체적인 구성을 보여주는 블록도이고,1 is a block diagram showing the overall configuration of the index word search and information providing system according to the present invention,

도 2는 본 발명에 따른 색인 DB에 저장되는 색인어별 해당 문서에 대한 점수의 일 예를 나타내는 테이블 구조의 예시도이고,2 is an exemplary diagram of a table structure showing an example of a score for a corresponding document for each index word stored in an index DB according to the present invention;

도 3은 본 발명에 따른 각 색인어별 문서 정보를 나타내는 테이블 구조의 예시도이고,3 is an exemplary diagram of a table structure showing document information for each index word according to the present invention;

도 4는 본 발명에 따른 문서 수집 과정을 나타내는 흐름도이고,4 is a flowchart illustrating a document collection process according to the present invention,

도 5는 본 발명에 따른 색인어별 문서 점수 산정과정을 설명하는 흐름도이고,5 is a flowchart illustrating a document score calculation process for each index word according to the present invention;

도 6은 본 발명에 따른 클라이언트가 입력한 색인어 처리과정을 설명하는 흐름도이고,6 is a flowchart illustrating an index word processing process input by a client according to the present invention;

도 7은 본 발명에 따른 각 색인어들이 모두 포함된 문서 점수의 산정에 대한 일 예를 나타내는 예시도이다.7 is an exemplary view illustrating an example of calculating a document score including all index words according to the present invention.

♣ 도면의 주요 부분에 대한 부호의 설명 ♣♣ Explanation of symbols for the main parts of the drawing ♣

10: 인터넷20: 로봇 에이전트10: Internet 20: Robot Agent

30: 색인 에이전트40: 검색 에이전트30: index agent 40: search agent

50: 클라이언트100: 정보 제공 시스템50: client 100: information providing system

이와 같은 목적을 달성하기 위한 본 발명은 클라이언트가 요청한 색인어를 분석하여 원하는 정보를 리스트화하여 클라이언트에게 제공하는 색인어 검색 및 정보 제공 방법에 있어서, 등록된 URL 주소 정보를 기초로 인터넷상의 URL에 접근하여 문서를 수집하고 문서가 위치한 URL 주소 및 문서의 내용을 포함한 URL 정보를 저장하는 단계와; URL 정보를 수신하여 해당 문서의 색인어 및 출현 빈도를 산출하고, 점수 산정 공식을 이용하여 색인어에 대한 해당 문서의 점수를 산정하고, 색인어별 문서 정보를 저장하는 단계; 및 클라이언트가 AND 연산자를 이용하여 입력한 색인어를 기초로 각 색인별 문서 정보를 수신하여 각 색인어에 대한 해당 문서의 점수를 서로 곱하여 높은 점수순으로 검색 리스트를 작성하여 클라이언트에게 제공하는 단계를 포함하는 것을 특징으로 한다.In order to achieve the above object, the present invention analyzes an index word requested by a client and lists the desired information to provide the client with an index word search and information providing method. The method includes accessing a URL on the Internet based on registered URL address information. Collecting the document and storing URL information including the URL address where the document is located and the contents of the document; Receiving the URL information to calculate the index word and the appearance frequency of the corresponding document, calculating a score of the corresponding document for the index word using a scoring formula, and storing document information for each index word; And receiving the document information for each index based on the index word input by the client using the AND operator, multiplying the scores of the corresponding documents for each index word, and providing the client with a search list in high order. It is characterized by.

또한, 본 발명은 클라이언트가 요청한 색인어를 분석하여 원하는 정보를 리스트화하여 클라이언트에게 제공하는 색인어 검색 및 정보 제공 시스템에 있어서, 인터넷상의 웹서버를 순회하며 각각의 홈페이지에 게재된 각종 정보를 자동적으로 수집 및 색인하여 데이터베이스화하는 로봇 에이전트와; 로봇 에이전트와 상호 연결되어 있으며, 로봇 에이전트에 의해 수집된 문서를 색인하여 색인어를 추출하고 해당 문서 내의 각 색인어의 출현 빈도를 산출하고, 점수 산정 공식을 이용하여 각색인어에 대한 해당 문서의 점수를 산정하고, 산정된 문서 정보를 색인 DB에 저장하는 색인 에이전트; 및 색인 에이전트와 상호 연결되어 있으며, 클라이언트가 입력한 색인어를 분석하고, 색인 DB를 참조하여 각 색인어들이 포함된 문서를 추출하고, 색인 DB에 저장된 각 색인어에 대한 해당 문서의 점수를 서로 곱하여 문서의 점수가 높은 순서대로 검색 리스트를 작성하여 클라이언트에게 제공하는 검색 에이전트를 포함하는 것을 특징으로 한다.In addition, the present invention is an index word search and information providing system that analyzes the index words requested by the client to list the desired information to provide to the client, the system automatically collects the various information posted on each home page while traveling the web server on the Internet And a robot agent which indexes and makes a database; Interconnected with the robot agent, index the documents collected by the robot agent to extract the index word, calculate the frequency of occurrence of each index word in the document, and calculate the score of the corresponding document for the adaptation word using a scoring formula An index agent for storing the calculated document information in an index DB; And the index agent, and analyze the index word input by the client, extract the document containing each index word by referring to the index DB, and multiply each document's score for each index word stored in the index DB It characterized in that it comprises a search agent for providing the client by creating a search list in the order of high score.

이하, 본 발명에 따른 AND 연산자를 이용한 색인어 검색 및 정보 제공 시스템 및 방법에 대한 바람직한 실시예를 첨부된 도면에 의거하여 상세하게 설명한다.Hereinafter, exemplary embodiments of an index word search and information providing system and method using an AND operator according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 색인어 검색 및 정보 제공 시스템(100)(이하, '정보 제공 시스템'이라 약칭함)의 전체적인 구성을 보여주는 블록도로서, 본 정보 제공 시스템(100)은 인터넷(10), 로봇 에이전트(20), 색인 에이전트(30), 검색 에이전트(40) 및 클라이언트(50)로 이루어져 있다. 또한, 로봇 에이전트(20)는 수집 로봇(22), 추적 로봇(24), 관리 로봇(26), 퍼시러테이터(facilitator, 이하 'FA'로 약칭함)(28) 및 URL 데이터베이스(29)를 포함하고, 색인 에이전트(30)는 FA(32), 색인 모듈(34) 및 색인 데이터베이스(36)를 포함하고, 검색 에이전트(40)는 FA(42) 및 검색 서버(44)를 포함한다.1 is a block diagram showing the overall configuration of the index search and information providing system 100 (hereinafter, abbreviated as "information providing system") according to the present invention, the information providing system 100 is the Internet (10), It consists of a robot agent 20, an index agent 30, a search agent 40, and a client 50. In addition, the robot agent 20 may refer to the collection robot 22, the tracking robot 24, the management robot 26, the facilitator (hereinafter abbreviated as 'FA') 28 and the URL database 29. Index agent 30 includes FA 32, index module 34, and index database 36, and search agent 40 includes FA 42 and search server 44.

로봇 에이전트(20)는 인터넷(10)에 연결되어 있으며, 로봇 에이전트(20)는 자동 순회 프로그램으로 기존에 수작업으로 홈페이지를 검색하여 색인 작업을 수행하는 대신에 자동적으로 검색 및 색인하여 이를 데이터베이스화하는 기능을 수행한다. 로봇 에이전트(20)에는 새로운 정보를 수집하는 수집 로봇(22)과 기존 정보의내용변경 유무를 추적하여 정보를 수집하는 추적 로봇(24) 및 중복된 정보의 수집을 방지하고 최적의 정보를 저장하기 위해 URL 데이터베이스(29)를 관리하는 관리 로봇(26)을 포함하고, 각 에이전트간의 통신을 담당하고 관련된 에이전트를 관리하는 FA(28)를 포함한다. URL DB(29)에는 특정 문서의 내용 및 해당 문서가 위치한 URL 정보가 저장된다. 또한, 추적 로봇(24)은 새로운 인터넷 홈페이지(등록되지 않은 홈페이지)를 찾는 로봇, 내용이 변한 홈페이지(삭제된 홈페이지도 포함)를 추적하는 로봇, 접속에 실패한 홈페이지를 추후에 추적하는 로봇을 포함할 수 있다. 로봇 에이전트(20)는 결국 웹서버에 접속해 데이터를 가져오는 기능적인 측면만 보면 웹브라우저와 같은 기능을 하는 것처럼 보이지만 웹브라우저는 가져온 데이터를 화면에 보여 주고 하이퍼링크 등의 기능을 수행하고, 로봇 에이전트(20)는 데이터를 분석하고 그 안의 URL을 추출해 다른 URL로 연결시켜 주는 기능을 수행한다는 차이점이 있다. 따라서, 로봇 에이전트(20)를 실행시켜 놓으면 로봇 에이전트(20)가 자동으로 인터넷상의 홈페이지를 찾아 정보를 수집한다. 대표적인 로봇 에이전트(20)에는 스파이더(spider)와 크롤러(crawler) 등이 있다. 로봇 에이전트(20)는 자신이 방문한 웹 페이지의 모든 내용을 읽고 링크되어 있는 모든 사이트들을 차례로 방문하고, 일정 기간을 주기로 자신이 과거 방문했던 사이트들을 다시 방문하여 해당 페이지의 갱신 여부를 체크하여 변경이 있으면 해당 정보를 수집한다. 로봇 에이전트(20)가 방문한 곳에 링크된 웹 문서가 있으면 로봇은 자동으로 그곳으로 들어가 정보를 수집하고, 링크된 웹 문서가 없으면 로봇 에이전트(20)가 탐색해야 할 곳을 시작점으로 지정하여 그 곳에서부터 인덱스 수집이 이루어진다.The robot agent 20 is connected to the Internet 10, and the robot agent 20 automatically searches and indexes the homepage instead of manually searching the homepage by the automatic traversal program and performs indexing. Perform the function. The robot agent 20 collects new information and collects a robot 22 and a tracking robot 24 which collects information by tracking the change of contents of existing information, and prevents the collection of duplicate information and stores optimal information. And a management robot 26 for managing the URL database 29, and an FA 28 for managing the communication between the agents and managing the related agents. The URL DB 29 stores the contents of a specific document and URL information where the document is located. In addition, the tracking robot 24 may include a robot that searches for a new Internet homepage (unregistered homepage), a robot that tracks a changed homepage (including a deleted homepage), and a robot that tracks a homepage that fails to access later. Can be. The robot agent 20 eventually appears to function like a web browser in terms of the functional aspects of accessing a web server and importing data, but the web browser shows the imported data on the screen and performs functions such as hyperlinks. The agent 20 has a difference in that it analyzes data, extracts a URL therein, and connects to another URL. Therefore, when the robot agent 20 is executed, the robot agent 20 automatically finds a homepage on the Internet and collects information. Representative robot agents 20 include spiders and crawlers. The robot agent 20 reads all the contents of the web page visited by itself, visits all the linked sites in order, visits the sites visited in the past periodically for a period of time, and checks whether the corresponding page is updated or not. If yes, collect that information. If there is a web document linked to where the robot agent 20 has visited, the robot automatically enters there to collect information, and if there is no linked web document, the robot agent 20 designates as a starting point to search from there Index collection takes place.

색인 에이전트(30)는 로봇 에이전트(20)와 상호 연결되어 있으며, 로봇 에이전트(20)에 의해 수집된 문서를 색인하여 색인어를 추출하고 해당 문서 내의 각 색인어의 출현 빈도를 산출하고, 각 색인어에 대한 해당 문서의 점수를 산정하여 색인 데이터베이스(36)에 저장하는 기능을 수행한다. 이러한 기능은 색인 모듈(34)에 의해 수행된다. 색인 모듈(34)이 수행하는 색인 작업은 수집된 정보로부터 색인어를 추출해 내는 작업과 추출된 색인어에 대한 해당 문서의 점수를 산출하는 점수 산출 작업 및 그 색인어에 대한 정보의 위치를 지시함으로써 효율적인 정보 검색의 기반을 제공하는 색인 작성 작업으로 이루어진다. 여기에서, 로봇 에이전트(20)와 색인 에이전트(30)간의 통신은 각각의 FA(28, 32)가 담당한다. 색인 모듈(34)은 FA(32)를 통해 로봇 에이전트(20)로부터 전송된 정보(예를 들면, HTML 문서)의 태그(예를 들면, HTML 태그)를 제거하고 남은 정보의 내용을 가지고 색인어를 추출하고, 각 색인어에 대한 해당 문서의 점수를 데이터베이스화하여 색인 데이터베이스(36)에 저장한다. 여기에서, 색인어는 부사 및 동사는 포함시키지 않고, 명사, 형용사 또는 동사의 명사형을 위주로 색인어를 구성하는 것이 바람직하다. 각 색인어에 대한 해당 문서의 점수를 산출하는 방법은 상세하게 후술한다.The index agent 30 is interconnected with the robot agent 20, extracts index words by indexing documents collected by the robot agent 20, calculates the frequency of occurrence of each index word in the corresponding documents, and The score of the document is calculated and stored in the index database 36. This function is performed by the index module 34. The indexing operation performed by the index module 34 extracts an index word from the collected information, a scoring operation that calculates a score of a corresponding document for the extracted index word, and indicates the location of the information about the index word. It consists of indexing operations that provide a foundation for Here, each of the FAs 28 and 32 is in charge of the communication between the robot agent 20 and the index agent 30. The index module 34 removes the tag (e.g., HTML tag) of the information (e.g., HTML document) transmitted from the robot agent 20 through the FA 32, and selects an index word with the contents of the remaining information. It extracts and stores the score of the document for each index word in the index database 36. Here, it is preferable that the index word does not include adverbs and verbs, and the index word is mainly composed of nouns, adjectives, or noun forms of verbs. The method of calculating the score of the corresponding document for each index word will be described later in detail.

검색 에이전트(40)는 색인 에이전트(30)와 상호 연결되어 있으며, 클라이언트(50)가 입력한 색인어를 분석하고, 색인 DB(36)를 참조하여 각 색인어들이 포함된 문서를 추출하고, 색인 DB(36)에 저장된 각 색인어에 대한 해당 문서의 점수를 서로 곱하여 문서의 점수가 높은 순서대로 검색 결과를 제공해 주는 기능을 수행한다. 이러한 기능은 검색 서버(44)에 의해 수행되며, 검색 서버(44)에는 질의 입력,질의의 정당성 검사, 질의의 분석 등과 같은 작업을 수행하는 질의 분석기(query analyzer)(44a), 질의 분석기(44a)에 의해 분석된 질의에 대한 쓰래드(thread) 생성, 질의의 변형, 결과의 분석 및 결과의 생성, 결과의 통합 및 재랭킹과 같은 작업을 수행하는 쓰래드 관리자(thread manager)(44b) 및 쓰래드 관리자(44b)에 의해 처리된 정보를 URL, 제목, 정확도 등에 따라 리스트 형태로 오름차순으로 결정하여 최종 정보 검색 리스트를 생성하는 결과 생성기(result generator)(44c)를 포함한다. 특히, 쓰래드 관리자는 각 색인어에 대한 해당 문서의 점수를 서로 곱하여 가장 높은 점수를 가진 문서별로 최종 검색 결과를 생성하도록 프로그래밍 되어 있다.The search agent 40 is interconnected with the index agent 30, analyzes index words input by the client 50, extracts documents containing each index word by referring to the index DB 36, and index DB ( 36) It multiplies the scores of the corresponding documents for each index word stored in each other and provides the search results in the order of the high scores of the documents. This function is performed by the search server 44. The search server 44 includes a query analyzer 44a and a query analyzer 44a that perform tasks such as inputting a query, validating a query, and analyzing a query. Thread manager 44b and thread manager that perform tasks such as creating a thread, transforming a query, analyzing and generating results, and integrating and reranking results. And a result generator 44c for determining the information processed by 44b in ascending order in the form of a list according to the URL, title, accuracy, and the like to generate a final information search list. In particular, the thread manager is programmed to multiply the scores of the corresponding documents for each index by each other to produce the final search results for each document with the highest score.

클라이언트(50)는 본 발명의 검색 에이전트(40)에 접속하기 위해 일반 인터넷 사용자가 사용하는 컴퓨터로서, 컴퓨터에 설치되어 있는 넷스케이프, 인터넷 익스플로러와 같은 웹브라우저(web browser) 또는 기타 클라이언트 소프트웨어를 의미한다. 인터넷 사용자는 웹브라우저를 이용하여 검색 에이전트(40)에 접속한 후 색인어를 입력하여 원하는 정보를 제공받을 수 있다. 또한, 클라이언트(50)는 사용자 인터페이스를 포함하고, 사용자 인터페이스로 하여금 인터넷 이용자로부터의 질의를 버퍼에 저장하여 검색 서버(44)가 처리할 수 있도록 하고, 질의에 대한 결과가 저장된 결과 버퍼를 인터넷 이용자가 브라우징(browsing)하도록 한다. 위에 언급한 바와 같이, 각각의 에이전트간의 통신은 각각의 FA(28, 32, 42)가 담당한다.The client 50 is a computer used by ordinary Internet users to access the search agent 40 of the present invention, and means a web browser or other client software such as Netscape, Internet Explorer, etc. installed in the computer. . An internet user may access the search agent 40 using a web browser and input an index word to receive desired information. The client 50 also includes a user interface, which allows the user interface to store a query from an Internet user in a buffer so that the search server 44 can process it, and to store the result buffer for the query with the Internet user. To browse. As mentioned above, communication between each agent is handled by each FA 28, 32, 42.

도 2는 본 발명에 따른 색인 DB에 저장되는 색인어별 해당 문서에 대한 점수의 일 예를 나타내는 테이블 구조의 예시도이다. 본 색인어별 점수 테이블을 설명하기에 앞서, 로봇 에이전트의 URL DB에 저장된 문서의 수는 5개로 한정하고, 각 문서에서 추출된 색인어는 테이블에 나타낸 것으로 한정한다. 이는 본 발명의 설명을 용이하게 하기 위해 한정한 것으로, 실제로는 다수의 문서와 다수의 색인어로 이루어졌음은 물론이다.2 is an exemplary diagram of a table structure showing an example of a score for a corresponding document for each index word stored in an index DB according to the present invention. Before describing the index table for each index word, the number of documents stored in the URL DB of the robot agent is limited to five, and the index word extracted from each document is limited to that shown in the table. This is limited to facilitate the description of the present invention, and of course, it is actually made up of a plurality of documents and a plurality of index words.

먼저, 색인 에이전트의 색인 모듈은 FA를 통해 색인하고자 하는 문서를 로봇 에이전트로부터 전달받아 해당 문서의 색인어를 추출한다. 위에서 언급한 바와 같이, 색인어는 부사 및 동사는 포함시키지 않고, 명사, 형용사 또는 동사의 명사형을 위주로 색인어를 구성하는 것이 바람직하다. 도 2에 도시된 바와 같이, 색인어는 출현 순서에 따라 색인어를 추출할 수 있지만, 특정 문서 내의 출현 빈도가 가장 높은 색인어순으로 추출할 수도 있다. 색인 모듈은 특정 문서의 색인어를 추출하고 해당 색인어가 특정 문서에 출현한 수를 파악하고, 정규화된 점수 산정 공식을 이용하여 각 색인어에 대한 해당 문서의 점수를 산출하여 색인 DB에 저장한다. 또한, 색인 DB에는 해당 색인어가 포함된 문서 정보를 포함하여 저장한다. 점수 산정 공식은 색인어의 출현 빈도를 0과 1 사이의 값으로 정규화하는 것이 바람직하다. 여기에서, 점수 산정 공식의 하나의 예는 아래의 수학식과 같다.First, the indexing module of the indexing agent receives a document to be indexed through the FA from the robot agent and extracts an index word of the corresponding document. As mentioned above, index terms should not include adverbs and verbs, but should be constructed around nouns, adjectives, or verb nouns. As shown in FIG. 2, the index word may be extracted in the order of appearance, but may be extracted in the order of the highest frequency of appearance in a specific document. The index module extracts an index word of a specific document, grasps the number of occurrences of the index word in a specific document, calculates a score of the document for each index word using a normalized scoring formula, and stores the index word in the index DB. In addition, the index DB stores the document information including the index word. The scoring formula preferably normalizes the frequency of index words to values between 0 and 1. Here, one example of the scoring formula is as shown below.

[수학식 1][Equation 1]

여기에서, tw는 'term weight'의 약어로서, 임의의 문서에 대한 특정 색인어의 상대적인 중요도이고, TF는 'Term Frequency'의 약어로서, 전체 문서에 대한 특정 색인어의 최대 출현 빈도에서 임의의 문서에 대한 특정 색인어의 출현 빈도 비율, 즉 특정 색인어의 출현 빈도율이고, IDF는 'Inverse Document Frequency'의 약어로서, 전체 문서에서 특정 색인어가 차지하는 중요도이다. 전체 문서에서 특정 색인어가 출현한 빈도가 적을수록 IDF값은 커진다. 또한, tf는 임의의 문서에 나오는 특정 색인어의 빈도이고, max tf는 전체 문서에서 특정 색인어의 최대 빈도이고, N은 색인하고자 하는 전체 문서수이고, n은 전체 문서에서 특정 색인어가 출현하는 문서의 수이다. 여기에서, max tf는 임의의 기준값(예컨대, 10, 50, 100 등)으로 설정할 수도 있다. 위에 언급한 점수 산정 공식은 하나의 예시에 불과하며, 다른 방법으로 각 색인어에 대한 특정 문서의 점수를 산정할 수 있음은 물론이다.Here, tw is an abbreviation of 'term weight', the relative importance of a specific index word for any document, and TF is an abbreviation of 'Term Frequency', which means any document at the maximum occurrence frequency of a specific index word for the whole document. IDF is an abbreviation of 'Inverse Document Frequency', and is the importance of a specific index in the entire document. The less frequent the occurrence of a particular index in the entire document, the higher the IDF value. Also, tf is the frequency of a specific index word in any document, max tf is the maximum frequency of a specific index word in the entire document, N is the total number of documents to be indexed, and n is the number of documents in which the specific index word appears in the entire document. It is a number. Here, max tf may be set to arbitrary reference values (for example, 10, 50, 100, etc.). The scoring formula mentioned above is just one example, and of course, you can score a particular document for each index word.

예를 들어, 색인어 '자동차'에 대한 문서 1의 점수를 산정하기 위해 전술한 점수 산정 공식을 이용한다.For example, the scoring formula described above is used to score Document 1 for the index term 'car'.

위에서 산출된 값(즉, '0.041')이 색인어 '자동차'에 대한 문서 1의 점수이며, 색인 모듈은 색인어 '자동차'에 대한 문서 1의 점수를 산출한 후에는 나머지 색인어에 대한 문서 1의 점수를 각각 산출한다. 다음에, 색인 모듈은 문서 2 내지 문서 5도 문서 1과 같이 각 색인어에 대한 해당 문서의 점수를 산출한 후, 색인 DB에 산출된 색인어별 해당 문서에 대한 점수를 저장한다. 각 색인어에 대한 각 문서의 점수는 도 2에 도시된 바와 같으며, 전체 문서중 해당 색인어가 포함된 문서가적을수록 높은 점수가 할당된다. 색인 DB에 저장된 색인어별 해당 문서에 대한 점수는 클라이언트가 입력한 검색용 색인어에 대한 검색 결과를 리스트화할 때 활용된다.The value calculated above (i.e., '0.041') is Document 1's score for the index word 'Car', and after the Index module calculates Document 1's score for the index word 'Car', the Document 1 score for the remaining index terms. Are calculated respectively. Next, the index module calculates the score of the corresponding document for each index word like document 2 to document 5 and then stores the score for the corresponding document for each index word calculated in the index DB. The score of each document for each index word is as shown in FIG. 2, and the higher the number of documents including the index word in the entire document is assigned. The score for the document by index word stored in the index DB is used when listing the search results for the search index word entered by the client.

도 3은 본 발명에 따른 각 색인어별 문서 정보를 나타내는 테이블 구조의 예시도이다. 색인 모듈은 도 2의 과정에서 색인어별 해당 문서에 대한 점수가 산정되면, 이를 기초로 각 색인어별 문서 정보를 데이터베이스화하여 저장한다. 여기에서, 각 색인어별 문서 정보에는 해당 색인어가 포함되어 있는 문서, 해당 색인어가 포함된 문서의 점수 및 문서가 위치한 URL 주소 등이 포함된다. 예를 들어, 색인 모듈은 전체 문서중 특정 색인어(예컨대, 자동차)가 포함된 모든 문서(예컨대, 문서 1, 문서 2, 문서 4, 문서 5)를 추출하고, 도 2에서 산정한 해당 색인어별 문서 점수(예컨대, 0.041, 0.020, 0.081, 0.012)를 기초로 하여 색인어별 문서 정보를 색인 DB에 저장한다. 도 3에 나타난 바와 같이, 특정 색인어가 포함된 문서가 적을수록 문서 점수가 높아짐을 알 수 있다.3 is an exemplary diagram showing a table structure showing document information for each index word according to the present invention. When the index module calculates the score of the corresponding document for each index word in the process of FIG. 2, the index module stores the document information for each index word as a database. Here, the document information for each index word includes a document including the index word, a score of the document including the index word, and a URL address where the document is located. For example, the index module extracts all documents (eg, document 1, document 2, document 4, document 5) including a specific index word (eg, a car) from all documents, and the corresponding index word document calculated in FIG. 2. The document information for each index word is stored in the index DB based on the score (eg, 0.041, 0.020, 0.081, 0.012). As shown in FIG. 3, it can be seen that the fewer documents containing a specific index word, the higher the document score.

이하에서는, 본 발명에 따른 AND 연산자를 이용한 색인어 검색 및 정보 제공 시스템 및 방법에 대한 동작 관계를 첨부된 도면에 의거하여 좀 더 구체적으로 설명하면 다음과 같다.Hereinafter, an operation relationship of an index word search and information providing system and method using an AND operator according to the present invention will be described in more detail with reference to the accompanying drawings.

도 4는 본 발명에 따른 문서 수집 과정을 나타내는 흐름도이고, 로봇 에이전트는 추적 로봇을 이용하여 검색할 URL을 추적하여 URL DB에 저장시켜 두었다고 가정한다.4 is a flowchart illustrating a document collection process according to the present invention, and assumes that the robot agent tracks a URL to be searched using a tracking robot and stores it in a URL DB.

먼저, 정보 제공 시스템은 URL DB에 저장된 URL를 추출하여 처음 검색할 URL을 선정하여 로봇 에이전트를 실행시킨다(S410). 로봇 에이전트의 수집 로봇은 선정된 URL에 접근(S420)하여 해당 URL이 수집 로봇의 접근을 허용하는지를 판단한다(S430). 즉, 수집 로봇은 URL의 호스트 이름에 따라 http://호스트이름/robots.txt에 접근하여 robots.txt의 내용을 분석하고 자신이 해당 URL에 접근할 수 있는지의 여부를 판별한다. 판단 결과, 접근이 허용되면 분석된 robots.txt의 내용을 기초로 해당 사이트의 문서를 수집하고 URL 정보(예컨대, URL 주소 등)들을 URL DB에 저장한다(S440). 이때, 모든 URL은 상대 URL이 아닌 절대 URL로 변환하여 저장한다. 즉, 상대 URL(예컨대, /dir/index.html)을 절대 URL(http://호스트이름/dir/index.html)로 변환하여 저장한다. 추출한 URL은 추후에 활용할 수 있도록 URL DB에 저장하고 관련된 URL까지 함께 저장한다. 예를 들면, http://host/dir/subdir/file.html이 저장할 URL이면 이 URL이외에 관련된 URL(예컨대, http://host/dir/subdir/, http://host/dir/, http://host/)도 저장한다. 또한, 추출한 URL을 데이터베이스에 저장할 때는 이미 등록되어 있는지를 검사하여 중복된 URL이 등록되지 않도록 한다.First, the information providing system extracts a URL stored in a URL DB, selects a URL to be searched for the first time, and executes a robot agent (S410). The collecting robot of the robot agent accesses the selected URL (S420) and determines whether the corresponding URL allows the access of the collecting robot (S430). That is, the collecting robot accesses http: //hostname/robots.txt according to the URL host name, analyzes the contents of robots.txt, and determines whether the URL can be accessed. As a result of the determination, if access is allowed, documents of the site are collected based on the analyzed robots.txt and URL information (eg, URL address, etc.) is stored in the URL DB (S440). At this time, all URLs are converted to absolute URLs and not stored relative URLs. That is, the relative URL (eg, /dir/index.html) is converted into an absolute URL (http: //hostname/dir/index.html) and stored. The extracted URL is saved in the URL DB for later use and the related URL is also saved. For example, if http: //host/dir/subdir/file.html is the URL to save, the URL other than this URL (for example, http: // host / dir / subdir /, http: // host / dir /, http Also save: // host /). In addition, when storing the extracted URL in the database, it is checked whether it is already registered so that duplicate URLs are not registered.

다음에, 정보 제공 시스템은 URL DB에 저장된 URL 정보를 기초로 다음 URL로 이동하고(S450), 이동한 URL이 URL DB에 저장된 마지막 URL인가를 판단(S460)하여 마지막 URL이 아니면 상기 단계(S420)로 진행하여 이후의 단계들을 반복 수행하고, 이동한 URL이 마지막 URL이면 모든 처리과정을 종료한다.Next, the information providing system moves to the next URL based on the URL information stored in the URL DB (S450), and determines whether the moved URL is the last URL stored in the URL DB (S460). And repeat the subsequent steps, and if the moved URL is the last URL, all processing ends.

도 5는 본 발명에 따른 색인어별 문서 점수 산정과정을 설명하는 흐름도이다.5 is a flowchart illustrating a document score calculation process for each index word according to the present invention.

먼저, 색인 에이전트는 FA를 통해 로봇 에이전트에 요청하여 색인하고자 하는 문서 정보(예컨대, 문서 내용, 해당 문서가 위치한 URL 주소 등)를 수신받는다(S510). 즉, 로봇 에이전트는 추적 로봇을 이용하여 수집한 문서 정보를 URL DB에 저장시켜 두고, 색인 에이전트의 요청이 있을 때 해당 문서 정보를 URL DB에서 추출하여 FA를 통해 색인 에이전트로 전달한다. 색인 에이전트는 로봇 에이전트로부터 전달받은 문서 정보를 기초로 해당 문서의 색인어를 추출하고 각 색인어의 문서 내 출현 빈도를 산출한다(S520). 색인어는 부사 및 동사는 포함시키지 않고, 명사, 형용사 또는 동사의 명사형을 위주로 추출하는 것이 바람직하다.First, the index agent requests the robot agent through the FA to receive document information (eg, document content, URL address where the corresponding document is located) to be indexed (S510). That is, the robot agent stores the document information collected using the tracking robot in the URL DB, and when requested by the index agent, extracts the document information from the URL DB and delivers the document information to the index agent through the FA. The index agent extracts the index word of the corresponding document based on the document information received from the robot agent and calculates the frequency of appearance in the document of each index word (S520). Index terms do not include adverbs and verbs, and it is preferable to extract noun forms of adjectives, adjectives, or verbs.

하나의 문서에 대한 색인어 추출 및 출현 빈도를 산출한 다음에 색인 에이전트는 로봇 에이전트에 다음 문서를 요청하고(S530), 요청한 문서가 로봇 에이전트의 URL DB에 저장되어 있는 문서의 마지막 문서인가를 판단(S540)하여 마지막 문서가 아니면 상기 단계(S520)로 진행하여 이후의 단계들을 반복 수행하고, 요청한 문서가 마지막 문서이면 특정 색인어가 해당 문서에 출현한 빈도를 모두 산출하고, 점수 산정 공식을 이용하여 특정 색인어가 포함된 모든 문서를 추출함과 동시에 특정 색인어에 대한 해당 문서의 점수를 산정한다(S550). 특정 색인어의 출현 빈도 및 해당 문서의 점수 산정에 대한 예는 도 2에 도시된 바와 같다. 다음에, 색인 모듈은 상기 단계(S450)에서 산정된 해당 색인어별 문서 점수를 기초로 하여 색인어별 문서 정보(예컨대, 해당 색인어가 포함되어 있는 문서, 해당 색인어가 포함된 문서의 점수 및 문서가 위치한 URL 주소 등)를 색인 DB에 저장한다(S560). 색인어별 문서 정보에 대한 예는 도 3에 도시된 바와 같다.After calculating the index word and the appearance frequency of one document, the index agent requests the next document from the robot agent (S530), and determines whether the requested document is the last document stored in the URL DB of the robot agent ( If it is not the last document, the process proceeds to the step S520, and the subsequent steps are repeated. If the requested document is the last document, the frequency of occurrence of a specific index word appears in the document, and the score is calculated using a formula. Simultaneously extracting all the documents including the index word and calculating the score of the corresponding document for a specific index word (S550). An example of the frequency of occurrence of a specific index word and the scoring of the document is as shown in FIG. 2. Next, the index module is based on the document score for each index word calculated in the step S450 (for example, the document containing the index word, the document containing the index word, the score of the document containing the index word and the document is located) URL address, etc.) are stored in the index DB (S560). An example of document information for each index word is shown in FIG. 3.

도 6은 본 발명에 따른 클라이언트가 입력한 색인어 처리과정을 설명하는 흐름도이다.6 is a flowchart illustrating an index word processing process input by a client according to the present invention.

먼저, 클라이언트는 본 발명의 정보 제공 시스템의 검색 서버에 접속하고(S610), 검색하고자 하는 색인어를 AND 연산자를 이용하여 입력한다(S620). 색인어는 부사 및 동사는 포함시키지 않고, 명사, 형용사 또는 동사의 명사형을 위주로 입력하는 것이 바람직하다. 검색 에이전트(특히, 검색 서버)는 색인 에이전트로 클라이언트가 입력한 색인어를 전달하고, 색인 에이전트로부터 색인어별 문서 정보를 전달 받는다(S630). 즉, 색인 에이전트는 전달받은 색인어들을 포함하는 모든 문서 및 각 색인어에 대한 해당 문서의 점수를 추출하여 검색 에이전트로 해당 정보를 전달한다(S630).First, the client accesses the search server of the information providing system of the present invention (S610), and inputs an index word to be searched using the AND operator (S620). It is preferable to input nouns and adjectives or nouns of verbs without index terms including adverbs and verbs. The search agent (in particular, the search server) delivers the index word input by the client to the index agent, and receives the index information document information from the index agent (S630). That is, the index agent extracts all documents including the received index words and the scores of the corresponding documents for each index word, and delivers the corresponding information to the search agent (S630).

검색 에이전트의 검색 서버(특히, 쓰래드 관리자)는 색인 에이전트로부터 전달받은 색인어별 문서 정보를 기초로 각 색인어에 대한 해당 문서의 점수를 서로 곱하여 각 색인어들이 모두 포함된 문서 점수를 산정하고(S640), 높은 점수순으로 검색 리스트를 작성하고, 그 검색 리스트를 클라이언트에게 제공한다(S650). 본 발명에 따른 각 색인어들이 모두 포함된 문서 점수의 산정에 대한 일 예를 나타내는 예시도가 도 7에 도시되어 있다. 예를 들어, 클라이언트가 색인어로 '자동차 AND 아반떼'를 입력하였다고 가정하면, 검색 에이전트는 색인 에이전트로 '자동차'와 '아반떼'를 전송하고, 색인 에이전트는 색인 DB를 검색하여 각 색인어가 포함된 문서를 추출하여 '자동차'와 '아반떼'가 모두 포함된 문서를 선별한다. 도 3에 도시된 바와 같이, '자동차'를 포함한 문서는 문서 1, 문서 2, 문서 4, 문서 5가 있고,'아반떼'를 포함한 문서는 문서 1, 문서 2, 문서 3, 문서 4, 문서 5가 있다. 그러나, 클라이언트는 AND 연산자를 이용하여 색인어를 입력하였으므로 '자동차'와 '아반떼'를 모두 포함하는 문서를 선별하여야 한다. 즉, 문서 3에는 '아반떼'는 있지만, '자동차'가 없기 때문에 출력 리스트에서 제외된다. 색인 에이전트는 추출된 각 색인어별 문서 정보를 검색 에이전트로 전송하고, 검색 에이전트는 각 색인어들이 포함된 문서의 점수를 서로 곱하여 높은 점수순으로 검색 리스트를 작성하고 그 검색 리스트를 클라이언트에에 제공한다. 도 7에 도시된 바와 같이, 각 색인어가 포함된 문서의 점수는 문서 5, 문서 2, 문서 1, 문서 4순으로 높아지기 때문에 검색 리스트는 문서 4, 문서 1, 문서 2, 문서 5순으로 제공된다.The search server of the search agent (particularly, the thread manager) calculates a document score including all index words by multiplying the scores of the corresponding documents for each index word based on the document information for each index word received from the index agent (S640). The search list is created in order of high score, and the search list is provided to the client (S650). An exemplary diagram illustrating an example of calculating a document score including all index words according to the present invention is shown in FIG. 7. For example, suppose a client enters 'car AND avante' as an index, the search agent sends 'car' and 'Avante' to the index agent, and the index agent searches the index database to find documents containing each index. Extracts and selects documents containing both 'car' and 'Avante'. As shown in FIG. 3, the document including 'car' includes Document 1, Document 2, Document 4, and Document 5, and the document including 'Avante' includes Document 1, Document 2, Document 3, Document 4, and Document 5. There is. However, since the client inputs the index word using the AND operator, the client should select a document including both the car and the avante. In other words, Document 3 has 'Avante' but not 'Car', so it is excluded from the output list. The index agent transmits the extracted document information for each index word to the search agent, and the search agent multiplies the scores of the documents including the respective index terms by creating a search list in high score order and provides the search list to the client. As shown in FIG. 7, since the score of the document including each index word increases in the order of document 5, document 2, document 1, and document 4, the search list is provided in order of document 4, document 1, document 2, and document 5. .

(실험예)Experimental Example

국제 정보 검색학계에서 공인된 문서 집합(TREC: 영문)과 국내에서 일반적으로 사용되는 문서 집합(HANTEC: 국문)을 각각 이용하여 기존의 AND 연산자를 이용한 검색방식과 본 발명에 의한 AND 연산자를 이용한 검색방식에 따른 검색 결과를 표준 결과와 비교하여 그 정확도를 측정 및 평가한 예는 다음과 같다.Retrieval using a conventional AND operator and an AND operator according to the present invention using a document set (TREC: English) and a document set commonly used in Korea (HANTEC) The following is an example of measuring and evaluating the accuracy by comparing the search results according to the method with the standard results.

기존 방식Old way 본 발명의 방식Manner of the invention allall top 20top 20 allall top 20top 20 TRECTREC 0.12150.1215 0.24510.2451 0.1429(17.61%)0.1429 (17.61%) 0.3085(25.87%)0.3085 (25.87%) HANTECHANTEC 0.10120.1012 0.13150.1315 0.1183(16.90%)0.1183 (16.90%) 0.1565(19.01%)0.1565 (19.01%)

트렉(TREC)은 100만 건의 영문 문서, 한텍(HANTEC)은 30만 건의 국문 문서로 이루어진 임의의 문서 집합이며, 각 집합별로 50개에서 100여 개에 이르는 표준 질의문이 존재한다. 각각의 표준 질의문과 관련이 있는 표준 문서 목록이 미리 정의되어 있다. 이 문서 집합을 인덱싱한 다음 각각의 표준 질의문으로 검색한 결과 목록을 표준 문서 목록과 비교하여 검색 엔진이 얼마나 정확하게 관련 문서를 찾아 주는지를 판정하게 된다. 위의 표는 트렉과 한텍에 대한 실험 결과 중 일부이며, 'all' 항목은 전체 결과 목록에 대한 결과이고, 'top 20' 항목은 결과 목록 중 상위 20개의 문서에 대해서만 표준 문서 목록과 정확도를 비교한 결과이다. 또한, '본 발명의 방식' 중 괄호 안의 숫자는 기존 방식에 대한 정확도의 향상 정도를 나타낸다. 결과에 나타난 바와 같이, 기존 방식에 비해 본 발명의 방식이 향상된 정확도를 가짐을 알 수 있으며, 특히 상위 20개의 결과 목록에서 더욱 향상된 정확도를 가져옴으로써 클라이언트에게 보다 정확한 검색 결과를 제공할 수 있다.TREC is an arbitrary document set consisting of 1 million English documents and HANTEC is 300,000 Korean documents, and there are 50 to 100 standard queries for each set. The list of standard documents associated with each standard query statement is predefined. After indexing this document set, each standard query statement compares the list of results with the standard document list to determine how accurately the search engine finds relevant documents. The table above is part of the experimental results for Trek and Hantec, the 'all' item is for the entire result list, and the 'top 20' item compares the accuracy with the standard document list only for the top 20 documents in the result list. One result. In addition, the number in parentheses in the 'method of the present invention' indicates the degree of improvement of accuracy over the existing method. As shown in the results, it can be seen that the method of the present invention has an improved accuracy compared to the existing method, and in particular, it is possible to provide a more accurate search result to the client by bringing the improved accuracy in the top 20 result list.

이상의 설명은 하나의 실시예를 설명한 것에 불과하고, 본 발명은 상술한 실시예에 한정되지 않으며 첨부한 특허청구범위 내에서 다양하게 변경 가능한 것이다. 예를 들어, 본 발명의 실시예에 구체적으로 나타난 각 구성 요소의 형상 및 구조는 변형하여 실시할 수 있을 것이다.The above description is only for explaining one embodiment, and the present invention is not limited to the above-described embodiment and can be variously changed within the scope of the appended claims. For example, the shape and structure of each component specifically shown in the embodiments of the present invention may be modified.

이상에서 설명한 바와 같이 본 발명에 따른 AND 연산자를 이용한 색인어 검색 및 정보 제공 시스템 및 방법에 의하면, AND 연산자를 이용하여 검색 결과의 순서를 산정함에 있어 다수개의 색인어중 특정 문서 내의 출현빈도를 각각 산출하고, 각 색인어의 출현 빈도를 서로 곱한 값을 해당 문서의 점수로 산정하여 검색 결과를 제공함으로써, 검색시간을 단축하고 정확도가 높은 검색 결과를 제공할 수 있는 효과가 있다.As described above, according to the index word search and information providing system and method using the AND operator according to the present invention, in calculating the order of the search results using the AND operator, the frequency of occurrence in a specific document among a plurality of index words is calculated. In addition, the search result is calculated by multiplying the occurrence frequency of each index word by the score of the corresponding document, thereby reducing the search time and providing a high-precision search result.

Claims

In the index word search and information providing method for analyzing the index words requested by the client to list the desired information to provide to the client,

(a) collecting a document by accessing a URL on the Internet based on the registered URL address information and storing URL information including a URL address where the document is located and contents of the document;

(b) receiving the URL information to calculate an index word and a frequency of appearance of the document, calculating a score of the document with respect to the index word using a scoring formula, and storing document information for each index word; And

(c) receiving the document information for each index based on the index word inputted by the client using the AND operator, multiplying the scores of the corresponding document for each index word, and generating a search list in the order of high scores to the client. Index search and information providing method using the AND operator characterized in that it comprises a step of.

The method of claim 1, wherein step (b)

(b1) requesting document information to be indexed;

(b2) receiving the document information, extracting an index word of a document to be indexed, and calculating a frequency of appearance of each index word in the document;

(b3) calculating a score of a corresponding document for a specific index word using the score calculation formula; And

and (b4) storing document information for each index word based on the calculated document score for each index word.

The method of claim 1 or 2, wherein the score calculation formula is

ego,

tw is the relative importance of a particular index for any document, TF is the frequency of occurrence of a specific index, IDF is the importance of a particular index in the entire document, tf is the frequency of a particular index in any document, max tf is the maximum frequency of a specific index word in the entire document, N is the total number of documents to be indexed, n is the number of documents in which a specific index word appears in the entire document.

The index word according to claim 1 or 2, wherein the document information for each index word includes a document including the index word, a score of the document including the index word, and a URL address where the document is located. How to Search and Provide Information.

The method of claim 1, wherein step (c)

(c1) accessing a search server and inputting an index word using the AND operator;

(c2) receiving document information for each index word including all documents including each index word and scores of corresponding documents for each index word, and multiplying the scores of the corresponding documents for each index word by each other; And

and (c3) listing the documents including the index terms in high score order and providing the documents to the client.

In the index search and information providing system for analyzing the index words requested by the client to list the desired information to provide to the client,

A robot agent that traverses a web server on the Internet and automatically collects, indexes and databases various kinds of information posted on each homepage;

It is interconnected with the robot agent, extracts the index word by indexing the documents collected by the robot agent, calculates the frequency of occurrence of each index word in the document, and scores of the corresponding document for each index word using a score calculation formula. An indexing agent for estimating a value and storing the calculated document information in an index DB; And

It is interconnected with the indexing agent, analyzes the index word input by the client, extracts a document including each index word by referring to the index DB, and scores corresponding documents for each index word stored in the index DB. And a search agent for generating a search list in order of multiplying documents and providing them to the client, the index word searching and information providing system using an AND operator.

The method of claim 6, wherein the score calculation formula is

ego,

tw is the relative importance of a particular index for any document, TF is the frequency of occurrence of a specific index, IDF is the importance of a particular index in the entire document, tf is the frequency of a particular index in any document, max tf is the maximum frequency of a specific index word in the entire document, N is the total number of documents to be indexed, n is the number of documents in which a particular index word appears in the entire document, index index search and information providing system using the AND operator.