KR20030069640A

KR20030069640A - System and method for geting information on hierarchical and conceptual clustering

Info

Publication number: KR20030069640A
Application number: KR1020020009550A
Authority: KR
Inventors: 이의범
Original assignee: 이의범
Priority date: 2002-02-22
Filing date: 2002-02-22
Publication date: 2003-08-27

Abstract

PURPOSE: A system for searching information by a hierarchical and conceptual clustering and a method thereof are provided to classify and search information easily without an additional search learning by displaying an information search result grouped by notes when a web page information search based on a word or a sentence is requested through a web search CGI being supplied in a web server system. CONSTITUTION: A web server(110) receives search request information received from a computer terminal of a service user connected to the Internet and transmits classification information searched through many devices being interlocked at an Internet connection state to the computer terminal. A cluster search engine(120) receives the search request information received from the web server(110) according to individual words or morphemes and classifies/searches many web pages(300) by a received search word. A filtering database(140) classifies and sorts HTML document information or web page grouped and combined by a clustering of the cluster search engine(120). A note generator(150) creates a proper classification name to a corresponding information group sorted in the filtering database(140) and executes a tree-shaped arrangement.

Description

System and method for geting information on hierarchical and conceptual clustering}

본 발명은 계층적(hierarchical) 개념화(conceptual) 클러스터링 기법에 의한 정보검색 시스템에 관한 것으로서, 보다 자세하게는 서비스 이용자가 인터넷에 연결된 컴퓨터 단말기를 이용하여 웹서버 시스템에 접속하여 웹검색 CGI를 통해 단어 또는 문장 중심의 정보 검색을 요청하고 상기 웹검색 CGI로부터 클러스트 정보를 전송받은 클러스터 엔진은 웹문서의 클러스터링 검색과 동시에 다른 검색엔진으로부터 수행된 메타 검색에 의해서 웹검색 CGI를 통해 계층적이고 개념화된 클러스터 결과를 생성시키게 되며, 트리 형태로 구성되어 그룹화된 정보 검색 결과를 서비스이용자의 단말기를 통해 디스플레이시킬 수 있도록 한 계층적 또는 개념화 문서 클러스터링에 의한 정보검색 시스템 및 그 방법에 관한 것이다.The present invention relates to an information retrieval system using a hierarchical conceptual clustering technique, and more particularly, a service user accesses a web server system using a computer terminal connected to the Internet, and then uses a word or web search through a web search CGI. The cluster engine, which requests sentence-based information retrieval and receives cluster information from the web search CGI, performs a hierarchical and conceptualized cluster result through web search CGI by meta search performed from another search engine at the same time as clustering search of web documents. The present invention relates to an information retrieval system and a method of hierarchical or conceptual document clustering, which are generated in a tree form and display grouped information retrieval results through a terminal of a service user.

초기에 군사목적으로 개발되었던 인터넷이 일반 대중에게 보급되고 광케이블또는 ASDL과 같은 초고속 통신망의 증가로 인하여 인터넷이 텔레비젼과 신문등을 능가하는 대중매체로 자리잡아가고 있다.The Internet, which was initially developed for military purposes, has become widespread for the general public, and with the growth of high-speed telecommunications networks such as fiber optic cables or ASDLs, the Internet is becoming a mass media surpassing television and newspapers.

이와같은 인터넷의 보급이 늘어나면서 인터넷을 통해 제공되는 정보의 양도 기하급수적으로 증가되어 정보의 바다라고는 불리고는 있지만 증가되는 정보의 수만큼 불량한 정보의 양도 늘어나고 있는 실정이다.As the spread of the Internet increases, the amount of information provided through the Internet has also increased exponentially, which is called the sea of information, but the amount of bad information is increasing as much as the number of information increases.

한편, 상기 인터넷을 이용한 정보의 수집 과정에서 도메인 네임(Domain name)을 통한 정보의 검색에는 한계가 있기 때문에 다양한 검색 엔진을 제공하는다수의 포털사이트를 통한 정보의 검색이 이루어지게 된다.On the other hand, in the process of collecting information using the Internet there is a limit to the search of information through the domain name (Domain name), so the search of information through a number of portal sites that provide a variety of search engines are made.

다수의 포털사이트를 통해 제공되는 검색 엔진은 각각의 검색 형태별로 개발된 검색 로봇(Search Robot)에 의해서 인터넷을 통해 산재해 있는 다양한 정보를 검색해내어 분류하고 순차적으로 인터넷 이용자의 웹브라우져를 통하여 디스플레이 시키게 된다.Search engines provided through multiple portal sites search and classify various types of information scattered over the Internet by search robots developed for each type of search and display them sequentially through the web browser of Internet users. do.

이러한, 종래 검색엔진을 통한 정보의 검색 방법은 웹브라우져를 통해 제공되는 검색식 또는 검색어 입력창을 통해서 특정 단어 또는 검색하고자 하는 질의어 문장을 입력하면 인터넷을 통해 연결된 검색엔진의 검색로봇이 단어일 경우 해당 단어가 포함된 웹페이지의 요약문을, 질의어일 경우 그 문장안에 포함된 단어를 형태소별로 분석하여 해당 단어의 조합에 의해서 검색된 다수의 웹페이지를 순위별 또는 순차적으로 디스플레이되도록 한다.Such a conventional method of retrieving information through a search engine is a case where a search robot of a search engine connected through the Internet is a word when a specific word or a query sentence to be searched is input through a search expression or a search term input window provided through a web browser. In the case of a query word, a summary sentence of a web page including the corresponding word is analyzed in terms of morpheme, and a plurality of web pages searched by the combination of the words are displayed in order or sequentially.

이때, 상기 검색엔진을 통한 정보 검색은 단어 또는 형태소의 조합에 의해서 분류될 수 있는 인터넷상에 등재된 HTML문서나 자체 검색로봇을 이용한 색인 결과를 추출하여 정보 검색의 주제별 또는 검색 빈도수에 따른 순위의 순차적인 검색 결과를 사용자에게 보여주기는 하지만, 단지 각각의 단어나 형태소가 포함된 문서만을 읽어들임으로써 중복된 페이지의 나열로 인한 불필요한 정보의 검색이 이루어지게 되는 단점이 있다.At this time, the information retrieval through the search engine extracts the index results using the self-search robot or the HTML document registered on the Internet, which can be classified by a combination of words or morphemes, thereby ranking the ranking according to the topic or frequency of retrieval. Although sequential search results are shown to the user, there is a disadvantage in that unnecessary information is searched due to the duplication of the pages by only reading documents containing each word or morpheme.

또한, 종래의 검색 엔진을 통한 정보의 수직적 검색 방법은 웹브라우져를 통해 제공되는 수백~수천의 페이지 또는 웹사이트를 일렬로 나열함으로써 웹브라우져를 통해 출력되는 검색량의 한계 때문에 수십장의 웹페이지를 일일이 넘겨가며 검색이 이루어져야 하는 문제점이 있다.In addition, the conventional method of vertically searching information through a search engine lists hundreds to thousands of pages or websites provided through a web browser in a row, thereby limiting the amount of search output through the web browser. There is a problem that the search must be turned over.

따라서, 본 발명은 종래의 검색엔진에 따른 검색시스템 및 검색방법에서 제기되고 있는 상기 제반 단점과 문제점을 해결하기 위하여 창안된 것으로서, 서비스 이용자가 컴퓨터 단말기의 웹브라우져를 통해 접속된 서비스 제공자의 웹서버 시스템에서 제공되는 웹검색 CGI를 통해 단어 또는 문장 중심의 웹페이지 정보검색을 요청하면 상기 웹검색 CGI로부터 단어 또는 문장의 형태소 정보를 전송받은 클러스터 엔진이 인터넷상의 웹문서 클러스터링 검색과 동시에 메타 검색을 통해 다른 검색엔진에서 수행된 검색 결과에 의해서 계층적이고 개념화된 클러스터링 결과를 생성시키고 다시 웹검색 CGI를 통해 윈도우 탐색기의 트리 형태로 구성되어 주석에 의해서 그룹화된 정보 검색 결과를 디스플레이시킴으로써 사람이 직접 분류한 것과 같은 정보 검색 결과를 도출시킬 수 있음과 아울러 별도의 검색 학습 없이도 손쉽게 정보의 검색이 수행될 수 있도록 한 계층적 또는 개념화 문서 클러스터링에 의한 정보검색 시스템 및 그 방법을 제공함에 발명의 목적이 있다.Accordingly, the present invention has been made to solve the above-mentioned shortcomings and problems in the search system and the search method according to the conventional search engine, the service server is a web server of the service provider connected through the web browser of the computer terminal When requesting word or sentence-oriented web page information retrieval through the web search CGI provided by the system, the cluster engine receiving the stemming information of the word or sentence from the web retrieval CGI performs a meta search at the same time as the web document clustering search on the Internet. It is possible to generate hierarchical and conceptualized clustering results from search results performed by other search engines, and to display information search results grouped by annotations by forming a tree form of Windows Explorer through web search CGI. Search for the same information It can be obtained and the addition and there is a hierarchical search or by conceptualizing document clustering system and an object of the invention to provide a method so as to be easy to perform a search of the information without the need for search study.

도1은 본 발명에 따른 정보검색 시스템의 블럭도.1 is a block diagram of an information retrieval system according to the present invention;

도2는 본 발명 클러스터링 정보검색 시스템의 구성도.2 is a block diagram of the present invention clustering information retrieval system.

도3은 본 발명에 따른 정보검색 시스템의 데이터 처리 흐름도.3 is a data processing flowchart of the information retrieval system according to the present invention;

도4 내지 도5는 본 발명에 따른 화면 구성도로서,4 to 5 are screen configuration diagrams according to the present invention.

도4는 본 발명에 따른 홈페이지의 화면 구성도이고,4 is a screen configuration diagram of a homepage according to the present invention;

도5는 본 발명에 따른 정보 검색 화면 구성도이다.5 is a block diagram of an information search screen according to the present invention.

((도면의 주요부분에 대한 부호의 설명))((Explanation of symbols for main parts of drawing))

100. 서버시스템 110. 웹서버100. Server System 110. Web Server

111. 웹검색 CGI 120. 클러스터 검색엔진111. Web Search CGI 120. Cluster Search Engine

130. 메타 검색엔진 140. 필터링 DB130. Meta Search Engine 140. Filtering DB

150. 주석 생성기 160. 클러스터 분석기150. Annotation Generator 160. Cluster Analyzer

200. 클라이언트 시스템 210. 웹브라우져200. Client system 210. Web browser

300. 웹페이지 400. 검색엔진군300. Web page 400. Search engine group

본 발명의 상기 목적은, 인터넷에 연결된 서비스 이용자의 컴퓨터 단말기를 통해 검색 요청을 수신받고 상기 검색 요청 정보를 클러스터링하는 웹검색 CGI를 포함하는 웹서버와, 상기 웹검색 CGI로부터 수신된 클러스터 정보에 의해서 인터넷상의 다수 웹페이지를 클러스터링 검색하는 클러스터 검색엔진과, 상기 클러스터 검색엔진으로부터 실시간 검색된 검색 정보를 클러스터에 따른 각 분류별로 추출하여 정렬하는 필터링 데이터베이스와, 상기 필터링 데이터베이스에 정렬된 해당 정보군에 적절한 분류명을 생성시켜 부여하는 주석 생성기로 구성된 계층적, 개념화 클러스터링 정보검색 시스템에 의해서 달성된다.The object of the present invention is to provide a web server including a web search CGI that receives a search request through a computer terminal of a service user connected to the Internet and clusters the search request information, and the cluster information received from the web search CGI. A cluster search engine for clustering and searching a plurality of web pages on the Internet, a filtering database for extracting and sorting search information retrieved in real time from the cluster search engine for each classification according to a cluster, and a classification name suitable for the corresponding information group arranged in the filtering database. It is achieved by a hierarchical, conceptualized clustering information retrieval system consisting of an annotation generator that creates and assigns the

일반적으로 알려진 클러스터링 방법으로는 클러스터를 생성한 후, 클러스터 중심어(Centroid)가 이용자 질의에 가장 잘 매칭되는 문서를 검색결과로 보여주게 되는 계층적 클러스터링(Hierarchical clustering) 방식이 있다.이 방식은 이용자 질의를 탑 다운(Top-down)이나 바툼 업(Bottom-up) 방식으로 각 클러스터에 비교하여 결과값을 출력하게 되며 검색엔진 및 랭킹 기법을 이용하여 질의와 가장 근접한 문서를 출력하게 된다.A generally known clustering method is a hierarchical clustering method in which a cluster is generated and a cluster centroid is displayed as a search result that best matches a user query. Compared to each cluster by Top-down or Bottom-up method, the result value is output and the document closest to the query is output by using search engine and ranking technique.

또한, 클러스터링 기법으로는 문서를 구성하는 색인어들을 이용하는 문서 클러스터링과 인접 단어의 특성을 이용하는 단어 클러스터링이 있으며, 상기 두 방식은 우선 단어와 문서형식의 클러스터링 단위가 다르기 때문에 클러스터링에 이용되는 벡터 특성의 계산과정이 다르다는 차이점이 있으나 일단 특성 벡터가 구하여지면 유사한 클러스터링 알고리즘이 적용된다.In addition, clustering techniques include document clustering using index words constituting a document and word clustering using characteristics of adjacent words. In the above two methods, vector characteristics used for clustering are calculated because clustering units of words and document formats are different. The process is different, but once the feature vectors are obtained, a similar clustering algorithm is applied.

이에, 클러스터링은 정보 검색에서 유사한 객체, 즉 문서(Documents)나 단어(Terms)를 그루핑(Grouping)하는데 이용되는 알고리즘이다.Thus, clustering is an algorithm used for grouping similar objects, such as documents or words, in information retrieval.

이러한 클러스터링 알고리즘은 크게 3가지 유형으로 나누어진다.This clustering algorithm is largely divided into three types.

첫째, 특성(단어 및 기능)과 클래스(클러스터)간 관계에 대한 알고리즘으로서, 모노테틱(Monthetic) 알고리즘과 폴리테틱(Polythetic) 알고리즘으로 분류되며, 두번째는 객체와 클래스간의 관계를 정의하는 알고리즘으로서 익스클루시브(Exclusive) 알고리즘과 오버래핑(Overlapping) 알고리즘으로 분류된다.First, it is classified as an algorithm between characteristics (words and functions) and classes (clusters), and it is classified into a monothetic algorithm and a polythetic algorithm. The second is an algorithm that defines the relationship between objects and classes. It is classified into exclusive algorithm and overlapping algorithm.

마지막으로, 클래스와 클래스간 관계를 정의하는 알고리즘이 있으며 오더(Ordered, Hierarchic) 알고리즘과 언오더(Unordered, Simple Partition) 알고리즘으로 나누어진다.Finally, there are algorithms that define classes and their relationships, and are divided into ordered (hierarchic) and unordered (simple partition) algorithms.

본 발명의 계층적 또는 개념화 클러스터링에 의한 정보검색 시스템은 앞서 설명된 계층적, 개념적 클러스터링 기법에 의해서 구동되는 클러스터 검색엔진이 서비스 이용자의 검색 요청을 수신받는 서비스 제공자의 웹서버와 연동하여 다양한 클러스터 검색 정보를 추출하고 윈도우 탐색기의 트리 형태로 주석이 부여되어 분류된 정보를 실시간으로 서비스 이용자의 웹브라우져를 통해 디스플레이시킬 수 있도록 구성됨에 기술적 특징이 있다.In the information retrieval system based on hierarchical or conceptual clustering of the present invention, a cluster search engine driven by the hierarchical and conceptual clustering technique described above is searched for various clusters by interworking with a web server of a service provider receiving a service user's search request. The technical feature is that the information is extracted and annotated in a tree form of the window explorer to display the classified information in real time through a web browser of a service user.

본 발명 계층적, 개념화 클러스터링에 의한 정보검색 시스템 및 그 방법의 상기 목적에 대한 기술적 구성을 비롯한 작용효과에 관한 사항은 본 발명의 바람직한 실시예를 도시하고 있는 도면을 참조한 아래의 상세한 설명에 의해서 명확하게 이해될 것이다.Matters relating to the effect of the information retrieval system of the present invention hierarchical and conceptual clustering, including the technical configuration for the above object, will be clarified by the following detailed description with reference to the drawings showing preferred embodiments of the present invention. Will be understood.

먼저, 도1은 본 발명에 따른 정보검색 시스템의 블럭도이고, 도2는 본 발명 클러스터링 정보검색 시스템의 구성도이다.First, Figure 1 is a block diagram of an information retrieval system according to the present invention, Figure 2 is a block diagram of a clustering information retrieval system of the present invention.

도시된 바와 같이, 본 발명의 계층적 개념화 클러스터링에 의한 정보검색 시스템은, 인터넷에 연결된 서비스 이용자의 컴퓨터 단말기로부터 수신된 검색 요청정보를 수신받고 인터넷 연결상태에서 연동되는 다수 장치를 통해 검색된 분류 정보를 컴퓨터 단말기로 송신하는 웹서버(110)와, 상기 웹서버(110)로부터 수신된 검색 요청 정보가 개별 단어 또는 형태소별로 구별되어 수신되고 수신된 검색어에 의해서 다수의 웹페이지(300)를 분류 검색하는 클러스터 검색엔진(120)과, 상기 클러스터 검색엔진(120)의 클러스터링에 의해서 그룹화(Grouping)되어 조합된 HTML 문서정보 또는 웹페이지를 분류하여 정렬하는 필터링 데이터베이스(140)와, 상기 필터링 데이터베이스(140)에서 정렬된 해당 정보군에 적절한 분류명을 생성시켜 트리 형태로 나열되도록 하는 주석 생성기(150)로 구성되어 있다.As shown, the information retrieval system by the hierarchical conceptualized clustering of the present invention receives classification request information received from a computer terminal of a service user connected to the Internet, and retrieves classification information retrieved through a plurality of devices linked in the Internet connection state. The web server 110 transmitting to the computer terminal and the search request information received from the web server 110 are classified by individual words or morphemes, and classified and searched for a plurality of web pages 300 by the received search terms. A filtering database 140 for classifying and sorting HTML document information or web pages that are grouped and combined by clustering of the cluster search engine 120 and the cluster search engine 120, and the filtering database 140. To generate the appropriate classification name in the information group sorted in It consists of a group (150).

본 발명에 따른 정보검색 시스템은 인터넷을 통해 상호 연결되어 있는 서비스 제공자의 서버시스템(100)과 서비스 이용자의 클라이언트 시스템(200)으로 구성되어 있으며, 클라이언트 시스템(200)의 웹브라우져(210)를 통해서 서버시스템(100)의 웹검색 CGI(111)로 수신된 검색 요청 정보는 클러스터 분석기(160)로 전송되어 각 단어별 또는 질의 문장의 형태소별로 분류 추출된다.The information retrieval system according to the present invention comprises a server system 100 of a service provider and a client system 200 of a service user connected to each other through the Internet, and through a web browser 210 of the client system 200. The search request information received by the web search CGI 111 of the server system 100 is transmitted to the cluster analyzer 160 to be classified and extracted for each word or morpheme of the query sentence.

이와 같이 각 단위별로 추출된 검색어들은 클러스터 검색엔진(120)으로 전송되고 상기 클러스터 검색엔진(120)은 인터넷상에 흩어져있는 다수의 웹페이지 및 웹사이트를 클러스터링하여 분류별 그룹별로 검색된 정보의 실시간 조합을 이루어지게 된다.As such, the search terms extracted for each unit are transmitted to the cluster search engine 120, and the cluster search engine 120 clusters a plurality of web pages and websites scattered on the Internet to generate a real-time combination of information searched by group by category. Will be done.

또한, 서버시스템(100)의 메타 검색엔진(130)은 인터넷에 연결된 다수의 포털사이트를 통해 운용되는 다른 검색엔진군(400)의 색인(index) 정보로부터 해당 검색 결과를 추출하여 클러스터 검색엔진(120)으로부터의 검색 결과와 통합되어 전송이 이루어지게 된다.In addition, the meta search engine 130 of the server system 100 extracts a corresponding search result from the index information of another search engine group 400 that is operated through a plurality of portal sites connected to the Internet to search the cluster search engine ( The transmission is then integrated with the search results from 120.

한편, 상기 클러스터 검색엔진(120)으로부터 검색된 정보는 필터링 데이터베이스(140)를 거치면서 중복되어 있는 HTML 문서나 웹페이지를 걸러내고 분류별 카테고리를 구성할 수 있도록 하며, 필터링 데이터베이스(140)와 연동되는 주석 생성기(150)를 통해 해당 카테고리를 설명하는 주석을 자동으로 생성시키게 된다.On the other hand, the information retrieved from the cluster search engine 120 is filtered through the filtering database 140 to filter out duplicate HTML documents or web pages and to organize the categories by category, annotations linked to the filtering database 140 The generator 150 automatically generates an annotation describing the category.

상기와 같이, 클러스터 검색엔진(120)에 의해서 검색 분류된 정보는 적절한 카테고리별 주석이 할당되고 필터링 데이터베이스(140)에 의해서 불필요한 중복 문서가 제거된 상태에서 웹검색 CGI(111)를 통해 클라이언트 시스템(200)의 웹브라우져(210)를 통해 문서그룹으로 이루어진 트리 형태의 검색창이 디스플레이된다.As described above, the information classified and searched by the cluster search engine 120 is assigned to the client system through the web search CGI 111 in a state where an appropriate category annotation is assigned and unnecessary duplicate documents are removed by the filtering database 140. Through a web browser 210 of 200, a tree-type search box composed of document groups is displayed.

본 발명에 따른 서비스 제공자의 서버시스템(100)은 클라이언트 시스템(200) 웹브라우져(210)의 검색어 입력창(220)을 통해서 수신되는 특정 단어 또는 문장 형태의 검색어나 검색식 입력에 따른 검색 정보의 수신에 의해서 검색되는 다수의 정보를 실시간으로 클러스터링하여 문서 내용에 따라 그룹화된 정보가 클라이언트 시스템(200)의 웹브라우져(210)를 통해 제공될 수 있도록 구성됨에 기술적 특징이 있다.The server system 100 of the service provider according to the present invention may be configured to provide search information according to a search word or a search expression input in a specific word or sentence form received through the search word input window 220 of the web browser 210 of the client system 200. The technical feature is that the information grouped according to the document content can be provided through the web browser 210 of the client system 200 by clustering a plurality of information retrieved by the reception in real time.

본 발명 계층적, 개념적 클러스터링에 의한 정보검색 시스템을 이용한 검색과정 및 검색 방법에 대해서 아래 도시된 도3 내지 도5의 순서도와 화면 구성도를 통하여 살펴보면 다음과 같다.A search process and a search method using an information retrieval system based on hierarchical and conceptual clustering of the present invention will be described with reference to the flowchart and screen configuration of FIGS.

도3은 본 발명에 따른 정보검색 시스템의 데이터 처리 흐름도이다.3 is a data processing flowchart of the information retrieval system according to the present invention.

도시된 바와 같이, 본 발명의 클러스터링에 의한 정보검색 방법은 먼저, 인터넷에 연결된 클라이언트 시스템(200)의 웹브라우져(210)를 통해 서비스 이용자가 찾고자 하는 정보의 특정 단어 또는 문장 형태의 검색 요청이 서버시스템(100)의 웹검색 CGI(111)를 통해 수신(S100 단계)되면, 수신된 검색 요청 정보가 클러스터 분석기(160)로 전송(S101 단계)되고 상기 클러스터 분석기(160)에서 질의 문장에서 분류된 형태소 또는 검색 단어가 실시간으로 추출되어 클러스터 검색엔진(120)으로 전송(S102 단계)된다.As shown, the information retrieval method by the clustering of the present invention, first, a search request in the form of a specific word or sentence of information that the service user wants to find through the web browser 210 of the client system 200 connected to the Internet server When received through the web search CGI 111 of the system 100 (step S100), the received search request information is transmitted to the cluster analyzer 160 (step S101) and classified in the query sentence in the cluster analyzer 160. The morpheme or search word is extracted in real time and transmitted to the cluster search engine 120 (step S102).

이때, 상기 서버시스템(100)에 인터넷을 통해 유기적으로 연결되어 적어도 하나 이상의 검색엔진으로 구성된 검색엔진군(400)을 통해 메타 검색엔진(130)이 색인별 문서 또는 웹페이지의 검색을 요청(S103 단계)하고 클러스터 검색엔진(120)이 인터넷상의 웹페이지를 개별적으로 실시간 검색(S104 단계)하게 된다.At this time, the meta search engine 130 requests the search of the document or the web page by index through the search engine group 400 which is organically connected to the server system 100 through the Internet and constitutes at least one search engine (S103). In step S104, the cluster search engine 120 searches the web pages on the Internet individually in real time (step S104).

한편, 상기 메타 검색엔진(130)의 검색 정보와 클러스터 검색엔진(120)을 통한 검색 정보의 문서간 유사도가 실시간으로 분석되어 검색 정보에 대해서 의미있는 문서 그룹으로 분류하여 그룹화(Grouping)(S105 단계)를 이루어지도록 함과 아울러 그룹화된 검색 정보는 필터링 데이터베이스(140)로 전송되어 분류된 개별그룹의 중복 문서를 제거하고 순차적으로 정렬(S106 단계)이 이루어지도록 한다.Meanwhile, the similarity between the search information of the meta search engine 130 and the documents of the search information through the cluster search engine 120 is analyzed in real time and classified into searchable document groups for the search information. In addition, the grouped search information is transmitted to the filtering database 140 to remove duplicate documents of the classified individual groups, and to sort them sequentially (step S106).

상기 필터링 데이터베이스(140)를 거치면서 유사도에 각 카테고리별로 정렬된 검색 정보에 주석 생성기(150)를 통해서 각 카테고리의 개별 그룹별 주석이 자동으로 생성(S107 단계)되고, 클라이언트 시스템(200)의 질의 정보에 대한 검색 결과 정보는 서버시스템(100)의 웹검색 CGI(111)를 통해 클라이언트 시스템(200)의 웹브라우져(210)상에 계층적 카테고리로서의 트리 형태로 출력(S108 단계)이 이루어지게 된다.Through the filtering database 140, annotations of individual categories of each category are automatically generated through the annotation generator 150 in search information arranged in each category in similarity (S107), and the query of the client system 200 is performed. The search result information for the information is output in a tree form as a hierarchical category on the web browser 210 of the client system 200 through the web search CGI 111 of the server system 100 (step S108). .

다음, 도4 내지 도5는 본 발명에 따른 화면 구성도로서, 도4는 본 발명에 따른 홈페이지의 화면 구성도이고, 도5는 본 발명에 따른 정보 검색 화면 구성도이다.4 to 5 are screen configuration diagrams according to the present invention, FIG. 4 is a screen configuration diagram of a homepage according to the present invention, and FIG. 5 is an information search screen configuration diagram according to the present invention.

도시된 바와같이, 본 발명의 계층적 개념적 클러스터링에 의한 정보검색 방법은 웹브라우져 상단 중앙의 검색어 입력창에 서비스 이용자가 검색하고자 하는 특정 단어를 입력하고 오른쪽의 "Search"를 클릭하면 도5에서와 같이 웹브라우져의 일측에 윈도우 탐색기와 같은 트리 형태의 카테고리가 형성되고 각 카테고리에는 주석 생성기를 통해 생성된 주석이 각 분류별 정보에 따라 작성된다.As shown, the information retrieval method by hierarchical conceptual clustering of the present invention enters a specific word to be searched by the service user in the search term input window at the top center of the web browser, and clicks "Search" on the right. Similarly, a tree-like category is formed on one side of the web browser, and an annotation generated through the annotation generator is created according to the information for each category in each category.

이때, 상기 각 주석의 왼편에는 하부 카테고리가 존재할 경우에는 "+"기호가 표시되고 기호를 클릭했을 경우 하부 카테고리의 주석이 표시되며, 카테고리가 모두 풀린 주석명에는 "-"가 표시됨으로써 상기 "+""-"기호의 선택에 따라서 카테고리의 풀림과 닫힘이 반복되될 수 있도록 구성되어 있다.At this time, the left side of each comment is displayed with a "+" symbol when there is a lower category, and when the symbol is clicked, a comment of the lower category is displayed. According to the selection of the ""-"symbol, the release and closing of the category can be repeated.

도5의 화면 구성을 예로 들어 정보 검색 방법을 좀 더 자세히 설명하면 다음과 같다.Taking the screen configuration of FIG. 5 as an example, the information retrieval method will now be described in more detail.

먼저, 웹브라우져상에 형성된 검색어 입력창에 '경제'라는 단어를 입력하여 웹검색을 시작하게 되면 검색어 입력창 일측에 표시되는 다수의 검색엔진군을 통해 '경제'에 관련된 다수의 정보 문서가 적절한 주석이 부여된 각 카테고리(연합뉴스, 부동산, 서울경제, 경제학 등)별로 순차적으로 디스플레이되며, 서비스 이용자가 '경제' 카테고리 하부의 '부동산'이라는 주석 왼편의 "+"기호를 클릭하면 부동산과관련된 하부 카테고리가 다시 디스플레이됨과 아울러 다시 하부 카테고리가 형성된 '아파트'라는 주석명에는 "+"기호가 표시된다.First, when the web search is started by entering the word 'economy' in the search term input window formed on the web browser, a large number of information documents related to 'economy' are displayed through the search engine group displayed on one side of the search term input window. It is displayed in order by each category that has been annotated (Yonhap News, Real Estate, Seoul Economy, Economics, etc.), and if the user clicks the "+" sign on the left of the note "Real Estate" under the "Economics" category, The subcategory is displayed again, and the annotation name "apartment" in which the subcategory is formed again is marked with a "+" symbol.

이때, 상기 '아파트'라는 주석명의 "+" 기호를 클릭하여 하부 디렉토리를 풀고 '동영상'이라는 주석명을 선택하면 웹브라우져의 일측으로 상기와 같은 순차적인 카테고리 '경제>부동산>아파트>동영상'의 최종적인 검색 문서 정보가 디스플레이된다.At this time, click the "+" sign of the annotation name of 'apartment' to unpack the subdirectory and select the annotation name of 'video' to the one side of the web browser of the sequential category 'Economy> Real Estate> Apartment> Video' The final search document information is displayed.

한편, 새로운 카테고리의 정보를 검색하고자 할때에는 왼편의 카테고리 출력창에 표시된 다른 카테고리를 선택하여 원하는 정보의 검색을 수행하게 된다.On the other hand, when searching for information of a new category, it searches for desired information by selecting another category displayed in the category output window on the left.

이상에서 설명한 바와같이, 본 발명 계층적, 개념적 클러스터링에 의한 정보검색 시스템 및 그 방법은 클라이언트 시스템의 웹브라우져를 통햐여 요청되는 정보검색 요청의 상기 웹검색 CGI로부터 수신되면 단어 또는 형태소별로 분류된 정보를 전송받은 클러스터 엔진이 인터넷상의 웹문서 클러스터링 검색과 동시에 메타 검색을 통해 다른 검색엔진에서 수행된 검색 결과에 의해서 계층적이고 개념화된 클러스터링 결과를 생성시키고 필터링 데이터베이스와 주석 생성기에 의해서 적절한 주석명이 부여되어 그룹화된 다수의 카테고리를 트리 형태로 디스플레이시킴으로써 사람이 직접 분류한 것과 같은 정보 검색 결과를 도출시킬 수 있음과 아울러 별도의 검색 학습 없이도 손쉽게 정보의 검색이 수행될 수 있는 장점이 있다.As described above, the information retrieval system by the hierarchical and conceptual clustering of the present invention and the method are classified into words or morphemes when received from the web search CGI of the information retrieval request requested through a web browser of a client system. The cluster engine receives the web document clustering search on the Internet and generates a hierarchical and conceptualized clustering result by the search results performed by other search engines through meta search, and the appropriate annotation names are given and grouped by the filtering database and the comment generator. By displaying a plurality of categories in the form of a tree, it is possible to derive information search results such as a person's direct classification, and there is an advantage that information can be easily searched without additional search learning.

또한, 본 발명의 클러스터링에 의한 정보검색 시스템 및 그 방법은 검색결과에 대한 사전 분류 체계를 설정할 필요가 없고 다른 검색엔진군으로부터 검색된 정보에 대한 유사도를 실시간으로 계산하여 도출시킴으로써 빠르고 정확한 정보 검색이 수행될 수 있는 이점이 있다.In addition, the information retrieval system and method by the clustering of the present invention do not need to set up a preliminary classification system for the search results, and fast and accurate information retrieval is performed by calculating and deriving the similarity of information retrieved from other search engine groups in real time. There is an advantage that can be.

Claims

A web server 110 which receives the search request information received from the computer terminal of the service user connected to the Internet and transmits the classified information searched to the computer terminal through the web search CGI 111 interlocked in the Internet connection state, and the web server Search request information received from the 110 is classified by individual words or morphemes and transmitted to the cluster analyzer 160 and the search engine group 400 including at least one or more search engines through the meta search engine 130 by the received search word. HTML document information, which is grouped by the cluster search engine 120 and clustered by the cluster search engine 120, which extracts a plurality of web pages 300 searched by the similar documents and clusters by classification. Filtering database 140 for classifying and sorting web pages, and sorting in the filtering database 140 An information retrieval system by hierarchical and conceptual clustering, comprising an annotation generator 150 for generating a proper classification name for the corresponding category.

When a search request for a word or sentence of information that a service user wants to find through a web browser of a client system connected to the Internet is received through the server system web search CGI (S100), the received search request information is transmitted to the cluster analyzer (S101). And extracting the morpheme or search word classified for each sentence in the cluster analyzer in real time and transmitting the extracted morpheme or search word to the cluster search engine (S102);

The meta search engine requests a search of a document or a web page through a search engine group composed of at least one search engine organically connected to the server system through the Internet (S103), and the cluster search engine individually searches a web page on the Internet. Real-time search step (S104),

The similarity between documents of the search information of the meta search engine and the search information through the cluster search engine is analyzed in real time to classify the search information into meaningful document groups so that grouping is performed (S105) and grouped search information is stored. Removing duplicate documents of the individual groups classified and transmitted to the filtering database and sequentially sorting them (S106);

Through the filtering database, annotations for individual groups of each category are automatically generated through an annotation generator in search information arranged for each category in similarity (S107), and the search result information for the query information of the client system is determined by the server system. Outputting a hierarchical category in a tree form on a web browser of a client system through a web search CGI (S108);

Information retrieval method by hierarchical, conceptual clustering, characterized by the sequential process of.