KR20010108845A

KR20010108845A - Term-based cluster management system and method for query processing in information retrieval

Info

Publication number: KR20010108845A
Application number: KR1020000029788A
Authority: KR
Inventors: 이택현; 기민호; 문병주; 유병선; 주영란
Original assignee: 기민호; 주식회사 티. 아이 시스템
Priority date: 2000-05-31
Filing date: 2000-05-31
Publication date: 2001-12-08
Also published as: KR100396826B1

Abstract

1. 청구범위에 기재된 발명이 속한 기술분야1. TECHNICAL FIELD OF THE INVENTION

본 발명은 정보검색에서 질의어 처리를 위한 단어 클러스터 관리 장치 및 그 방법과 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것임.The present invention relates to an apparatus for managing word clusters for query processing in information retrieval, a method thereof, and a computer-readable recording medium having recorded thereon a program for realizing the method.

2. 발명이 해결하려고 하는 기술적 과제2. The technical problem to be solved by the invention

본 발명은, 정보검색 서비스를 위해, 수집 문서를 바탕으로 하나의 단어에 대해 연계되는 단어 클러스터링을 구축하고, 구축된 단어 클러스터링을 이용하여 검색어와 관련이 있는 관련단어를 제공하는 단어 클러스터 관리 장치 및 그 방법과 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하고자 함.The present invention provides a word cluster management apparatus for establishing a word clustering associated with a single word based on a collection document for an information retrieval service, and providing related words related to a search word using the constructed word clustering; A computer readable recording medium having recorded thereon a method and a program for realizing the method.

3. 발명의 해결방법의 요지3. Summary of Solution to Invention

본 발명은, 정보검색에서 이용되는 단어에 대해 연계되는 관련단어를 제공할 수 있도록 단어 클러스터로 묶어 저장하기 위한 클러스터 저장수단; 정보검색에 이용할 수 있도록 관련단어를 추출하는 클러스트링을 행할 문서를 받아 의미있는 명사 단어를 추출하기 위한 추출수단; 및 추출된 단어를 참조하여 상기 문서에서 관련단어를 추출하여 클러스터 계수를 산출하고 단어 클러스터를 생성하여 상기 클러스터 저장수단을 갱신하기 위한 클러스터 관리수단을 포함함.The present invention provides cluster storage means for grouping and storing word clusters to provide related words associated with words used in information retrieval; Extraction means for receiving a document to be subjected to clustering for extracting related words so as to be used for information retrieval and extracting a meaningful noun word; And cluster management means for extracting related words from the document with reference to the extracted words, calculating cluster coefficients, generating word clusters, and updating the cluster storage means.

4. 발명의 중요한 용도4. Important uses of the invention

본 발명은 정보검색 서비스 등에 이용됨.The present invention is used for information retrieval services.

Description

TERMS-BASED CLUSTER MANAGEMENT SYSTEM AND METHOD FOR QUERY PROCESSING IN INFORMATION RETRIEVAL}

본 발명은 정보검색에서 질의어 처리를 위한 단어 클러스터 관리 장치 및 그방법과 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것으로, 특히 정보검색 시스템에서 사용자의 질의어 선택을 도와주며, 질의어에 대한 위치 파악을 용이하게 하기 위한 단어 클러스터를 생성하고 표시하는 장치 및 그 방법과 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다.The present invention relates to a word cluster management apparatus for query processing in information retrieval, and a method and a computer readable recording medium recording a program for realizing the method. In particular, the information retrieval system helps the user to select a query. The present invention relates to an apparatus for generating and displaying a word cluster for facilitating the location of a query word, and a method and a computer readable recording medium storing a program for realizing the method.

여기서, 단어 클러스터란 하나의 단어에 대한 관련단어의 묶음을 말한다.Here, the word cluster refers to a bundle of related words for one word.

인터넷을 통하여 제공되는 자료가 기하급수적으로 늘어감에 따라 디렉토리 서비스를 제공하거나, 불특정 다수의 사이트에 대한 자료를 인덱싱하여 검색엔진을 이용한 정보검색 서비스를 제공하는 등 여러 경로를 통하여 정보를 찾을 수 있는 서비스를 제공하고 있다.As the data provided through the Internet increases exponentially, information can be found through various channels, such as providing a directory service or indexing data on a number of unspecified sites and providing information retrieval services using search engines. It provides a service.

그러나, 종래의 검색엔진들은 이용자의 검색어 선택이 단순하거나 일반적인 단어를 선정할 경우, 방대한 양의 검색결과를 제시함으로써 인식부담을 유발하고 있었다. 이러한 방대한 결과에서 원하는 자료를 찾기 위해서는 2차 질의어 확장을 통하여 재검색을 하여야 하거나, 일일이 자료를 확인하여야 하는 등의 인식부담을 유발하고 있는 문제점이 있었다.However, conventional search engines have caused a burden of recognition by presenting a large amount of search results when a user selects a simple or general word. In order to find the desired data from these vast results, there was a problem that caused a recognition burden such as re-searching through the expansion of the second query word or checking the data one by one.

또한, 같은 기능을 수행하는 연산자임에도 불구하고 검색 엔진마다 각자의 표기법을 사용함으로써 이용자는 검색엔진에서 사용되는 연산자를 알고 있어야 하는 문제점이 있었다.In addition, even though the operator performs the same function, each search engine uses its own notation, which causes a problem that the user must know the operator used in the search engine.

도 1 및 도 2 는 종래의 검색엔진에서 사용되는 도움말에 대한 일예시도이다.1 and 2 are exemplary views of the help used in the conventional search engine.

도 1 은 네이버(www.naver.com)의 검색에 사용되는 연산자에 대한 도움말이며, 도 2 는 야후(www.yahoo.co.kr)에서 제공하는 것들이다. 여기서 알 수 있듯이 기존의 검색엔진들이 "AND", "OR", "NOT" 등의 연산 검색을 비롯하여 절단검색, 유의어 검색, 위치검색, 구단위 검색 등 여러 방식을 제시하고 있어, 이용자는 모든 검색연산자를 숙지하고 있어야 한다.1 is a help for the operator used in the search of Naver (www.naver.com), Figure 2 is that provided by Yahoo (www.yahoo.co.kr). As you can see, the existing search engines offer various methods such as truncation search, thesaurus search, location search, and phrase search as well as operation search such as "AND", "OR", "NOT". You must be familiar with operators.

이러한 검색엔진의 사용은 사용자가 정확한 검색어의 선정과 적절한 연산자의 사용을 전재로 하고 있으며, 연산자의 사용에 있어서도 "AND"연산의 경우 어느 곳에서는 "AND"를 사용하고 어느 곳에서는 "과"를 사용한다거나, "OR"연산의 경우 한곳은 "OR"를 다른 곳은 "이거나"를 사용하고 있다. 이와 같이 정보검색 연산자의 표준화가 이루어지지 않아서 사용자는 이용하고자 하는 검색엔진에서 제공하는 연산자를 알고 있어야 이용이 가능하므로, 사용이 불편한 문제점이 있었다.The use of these search engines is based on the user's choice of exact search terms and the use of appropriate operators.In the use of operators, the "AND" operation is used "AND" in some places and "and" in some places. In the case of the "OR" operation, one uses "OR" and the other uses "or". As such, since the standardization of the information search operator is not made, the user needs to know the operator provided by the search engine to be used, and thus it is inconvenient to use.

근래에는 일상생활에서 사용하는 자연어를 그대로 검색어로 사용하는 경우가 등장하고 있다. 이미 몇몇 정보검색 사이트에서는 이러한 자연어 검색 기법을 도입하여 초보자들의 정보검색을 돕고 있으나, 이 방식도 이용자가 원하는 자료를 찾기 위해서는 정확한 질의어를 알고 있어야 하거나, 방대한 결과에 대하여 여러 번의 재검색을 시도하여야 하는 문제점이 있었다.Recently, natural words used in everyday life have been used as search terms. Some information retrieval sites have already introduced such natural language search techniques to help beginners search for information. However, this method also requires users to know the exact query or search for multiple results. There was this.

이러한 문제점은 인터넷의 급속한 확대와 더불어 더욱 크게 대두되고 있는데 인터넷의 이용 대상이 취학이전의 아동에서부터 고령의 노인에 이르기까지, 초보적인 지식을 원하는 사람부터 전문직종에 종사하며 고도의 전문기술을 요구하는 사람까지 그 대상층이 매우 다양해지고 있어 이러한 전망은 두드러질 것은 자명한 사실이다.This problem is getting bigger with the rapid expansion of the Internet. The use of the Internet is from pre-school children to senior citizens. It is obvious that this prospect is prominent as the target audience is very diverse.

이와 같이 정보검색에 있어 질의어의 선택이 무엇보다도 중요한데 반해, 정보검색에 경험이 없거나, 해당 분야에 대하여 초보자인 경우는 물론 정보검색의 경험이 많은 전문가에게도 정확한 질의어의 선정은 매우 어려운 문제점이 있었다.As such, the selection of the query is most important in information retrieval, whereas the selection of the correct query is very difficult even for an expert who has no experience in information retrieval, or is a novice in the field, as well as an experienced expert in information retrieval.

즉, 종래의 정보검색 시스템에서 효율적인 검색을 위하여 다양한 연산자와 여러 측면에서의 정보검색 방식을 제공하고 있으나, 모든 방식들이 정확한 키워드의 선정을 그 기본으로 하기 때문에 검색엔진을 많이 사용해본 이용자에게도 검색어 선정의 부담은 여전한 문제점이 있었다.In other words, the conventional information retrieval system provides various operators and information retrieval methods in various aspects for efficient retrieval. However, since all methods are based on the selection of the correct keywords, the search term is selected even for users who have used many search engines. The burden was still a problem.

본 발명은, 상기한 바와 같은 문제점을 해결하기 위하여 안출된 것으로, 정보검색 서비스를 위해, 수집 문서를 바탕으로 하나의 단어에 대해 연계되는 단어 클러스터링을 구축하고, 구축된 단어 클러스터링을 이용하여 검색어와 관련이 있는 관련단어를 제공하는 단어 클러스터 관리 장치 및 그 방법과 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 그 목적이 있다.The present invention has been made to solve the above problems, for the information retrieval service, to establish a word clustering associated with a single word based on the collection document, using the built-in word clustering and the search word and An object of the present invention is to provide a word cluster management apparatus for providing related related words, and a method thereof and a computer-readable recording medium having recorded thereon a program for realizing the method.

도 1 및 도 2 는 종래의 검색엔진에서 사용되는 도움말에 대한 일예시도.1 and 2 illustrate an example of help used in a conventional search engine.

도 3 은 본 발명에 따른 정보검색에서 질의어 처리를 위한 단어 클러스터 관리 장치에 대한 일실시예 구성도.3 is a diagram illustrating an embodiment of an apparatus for managing a word cluster for query processing in information retrieval according to the present invention.

도 4 는 본 발명에 따른 정보검색에서 질의어 처리를 위한 단어 클러스터 관리 장치에서의 명사추출기에 대한 일실시예 상세 구성도.Figure 4 is a detailed configuration diagram of a noun extractor in a word cluster management device for query processing in information retrieval according to the present invention.

도 5 는 본 발명에 따른 명사추출기에 추출한 명사노드의 리스트를 나타내고 있는 일실시예 설명도.Figure 5 is an exemplary explanatory diagram showing a list of noun nodes extracted from a noun extractor according to the present invention.

도 6 은 본 발명에 따른 클러스터 관리기에 대한 일실시예 구성도.6 is a diagram illustrating an embodiment of a cluster manager according to the present invention;

도 7 은 본 발명에 따른 질의 처리기에 대한 일실시예 구성도.7 is a diagram illustrating an embodiment of a query processor according to the present invention;

도 8 은 본 발명에 따른 질의처리기에서의 처리 결과에 대한 일예시도.8 is an exemplary view of a processing result in the query processor according to the present invention.

도 9 는 본 발명에 따른 단어 클러스트링 관계표시를 나선형으로 표시한 일예시도.Figure 9 is an example of a spiral display of the word clustering relationship display according to the present invention.

도 10 은 본 발명에 따른 정보검색에서 질의어 처리를 위한 단어 클러스터관리 방법에 대한 일실시예 흐름도.10 is a flowchart illustrating a method for managing a word cluster for query processing in information retrieval according to the present invention.

도 11 은 본 발명에 따른 형태소 분석을 통해 수집문서에서 명사를 추출하는 과정에 대한 일실시예 흐름도.11 is a flowchart illustrating an embodiment of a process of extracting a noun from a collected document through morpheme analysis according to the present invention.

도 12 는 본 발명에 따른 추출된 명사의 리스트에서 단어 클러스터를 생성하고 단어 클러스터 사전을 갱신하는 과정에 대한 일실시예 흐름도.12 is a flow diagram of an embodiment of a process for generating a word cluster and updating a word cluster dictionary from a list of extracted nouns in accordance with the present invention.

도 13 은 본 발명에 따른 생성된 명사노드 리스트를 이용하여 관련단어를 추출하고 단어 클러스터를 생성하는 과정에 대한 일실시예 흐름도.13 is a flowchart illustrating a process of extracting a related word and generating a word cluster using a generated noun node list according to the present invention.

도 14 는 본 발명에 따른 생성된 단어 클러스터를 이용하여 단어 클러스터 사전을 갱신하는 과정에 대한 일실시예 흐름도.14 is a flow diagram of an embodiment of a process of updating a word cluster dictionary using a generated word cluster according to the present invention.

*도면의 주요 부분에 대한 부호의 설명* Explanation of symbols for the main parts of the drawings

301 : 명사추출기 302 : 클러스터 관리기301: noun extractor 302: cluster manager

303 : 질의 처리기 304 : 검색엔진303: query handler 304: search engine

305 : 단어 클러스터 사전 306 : 한글사전305: Word Cluster Dictionary 306: Korean Dictionary

307 : 영문사전307: English Dictionary

상기 목적을 달성하기 위한 본 발명의 장치는, 정보검색에서 질의어 처리를 위한 단어 클러스터 관리 장치에 있어서, 정보검색에서 이용되는 단어에 대해 연계되는 관련단어를 제공할 수 있도록 단어 클러스터로 묶어 저장하기 위한 클러스터 저장수단; 정보검색에 이용할 수 있도록 관련단어를 추출하는 클러스트링을 행할 문서를 받아 의미있는 명사 단어를 추출하기 위한 추출수단; 및 추출된 단어를 참조하여 상기 문서에서 관련단어를 추출하여 클러스터 계수를 산출하고 단어 클러스터를 생성하여 상기 클러스터 저장수단을 갱신하기 위한 클러스터 관리수단을 포함하는 것을 특징으로 한다.An apparatus of the present invention for achieving the above object, in the word cluster management device for query processing in the information retrieval, for storing the group associated with the word cluster to provide related words associated with the words used in the information retrieval Cluster storage means; Extraction means for receiving a document to be subjected to clustering for extracting related words so as to be used for information retrieval and extracting a meaningful noun word; And cluster management means for extracting a related word from the document to calculate a cluster coefficient, generating a word cluster, and updating the cluster storage means with reference to the extracted word.

또한, 본 발명의 방법은, 정보검색에서 질의어 처리를 위한 단어 클러스터 관리 장치에 적용되는 단어 클러스터 관리 방법에 있어서, 정보검색에 이용할 문서를 받아 의미있는 명사 단어를 추출하는 제 1 단계; 추출된 명사 단어를 참조하여 상기 문서에서 관련단어를 추출하여 클러스터 계수를 산출하고 단어 클러스터를 생성하는 제 2 단계; 및 정보검색에서 이용되는 단어와 연계되는 관련단어를 제공할 수 있도록 정보를 저장하는 클러스터 저장수단을 생성된 단어 클러스터를 이용하여 갱신하는 제 3 단계를 포함하는 것을 특징으로 한다.In addition, the method of the present invention is a word cluster management method applied to a word cluster management device for query processing in information retrieval, comprising: a first step of receiving a document to be used for information retrieval and extracting a meaningful noun word; A second step of extracting a related word from the document by referring to the extracted noun word to calculate a cluster coefficient and generating a word cluster; And a third step of updating the cluster storage means for storing the information by using the generated word clusters so as to provide related words associated with the words used in the information retrieval.

또한, 본 발명은, 프로세서를 구비한 정보검색 시스템에, 정보검색에 이용할 문서를 받아 의미있는 명사 단어를 추출하는 제 1 기능; 추출된 명사 단어를 참조하여 상기 문서에서 관련단어를 추출하여 클러스터 계수를 산출하고 단어 클러스터를 생성하는 제 2 기능; 및 정보검색에서 이용되는 단어와 연계되는 관련단어를 제공할 수 있도록 정보를 저장하는 클러스터 저장수단을 생성된 단어 클러스터를 이용하여 갱신하는 제 3 기능을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.The present invention also provides an information retrieval system having a processor, comprising: a first function of receiving a document to be used for information retrieval and extracting a meaningful noun word; A second function of extracting a related word from the document with reference to the extracted noun word to calculate a cluster coefficient and to generate a word cluster; And a program for realizing a third function of updating a cluster storing means for storing information using a generated word cluster to provide related words associated with words used in information retrieval. Provide the medium.

본 발명은, 방대한 양의 자료를 수록하고 정보검색 서비스를 제공하여야 하는 경우, 이용자가 입력하는 질의어가 전체 단어들 중 어디에 위치하며 연관된 단어들이 무엇인지를 제시하여 2차 질의를 확장할 수 있도록 하여, 잘못된 질의어 선정으로 발생하는 다량의 결과값에 의한 사용자의 인식부담 문제점과 방향상실 문제점을 해결하기 위한 것이다.According to the present invention, when a large amount of data is to be stored and an information retrieval service is to be provided, the second query can be extended by presenting where the query word input by the user is located among the whole words and what the related words are. In order to solve the problem of user's cognition burden and loss of direction caused by a large amount of result value caused by wrong query selection.

본 발명은, 정보의 바다라고 불리는 인터넷에서 단어 클러스터링을 구축하여, 이용자가 원하는 정보를 찾기 위하여 입력하는 질의어에 대해 이와 관련이 있는 상하좌우의 관련단어 및 내용상 의미가 있는 관련 단어들을 보여줌으로써 질의어에 대한 위치 파악이 용이하다. 또한, 검색된 문서를 참조하여 단어 클러스터링을 구축하여 단어관계를 시각적으로 보여줌으로써 기존의 검색엔진을 이용할 때 발생하는 질의어 선정문제, 방대한 검색결과에 대한 인식부담(cognitive) 문제, 이용자가 정보를 검색하는 과정에서 발생하는 방향상실(disorientation) 문제 등을 최소화하는데 그 특징이 있다.The present invention establishes a word clustering on the Internet called the sea of information, and shows the related words of up, down, left and right and related words that are meaningful in content for the query word inputted by the user to find the desired information. Easy to locate In addition, by constructing word clustering with reference to the searched documents to visually show the word relations, query selection problems that occur when using existing search engines, cognitive burdens on vast search results, and users search for information It is characterized by minimizing disorientation problems that occur in the process.

상술한 목적, 특징들 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명한다.The above objects, features and advantages will become more apparent from the following detailed description taken in conjunction with the accompanying drawings. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 3 은 본 발명에 따른 정보검색에서 질의어 처리를 위한 단어 클러스터 관리 장치에 대한 일실시예 구성도이다.3 is a diagram illustrating an embodiment of an apparatus for managing a word cluster for query processing in information retrieval according to the present invention.

본 발명에 따른 정보검색에서 질의어 처리를 위한 단어 클러스터 관리 장치는, 명사추출기(301), 클러스터 관리기(302), 질의 처리기(303), 검색엔진(304),단어 클러스터 사전(305), 한글사전(306) 및 영문사전(307)을 포함한다.The apparatus for managing word clusters for query processing in information retrieval according to the present invention includes a noun extractor 301, a cluster manager 302, a query processor 303, a search engine 304, a word cluster dictionary 305, and a Korean dictionary. 306 and the English dictionary 307.

도 3 은 본 발명에 따른 일실시예 구성도로서, 키워드 및 관련단어를 분류하여 클러스터를 해야할 문서에서 불필요한 정보를 제거한 후, 텍스트(Text)를 추출하여, 추출된 텍스트(Text)를 토큰단위로 분리하여 복합명사를 포함한 명사리스트를 추출하고, 사전에 구축되지 않은 미등록 단어들을 임시저장소에 저장하는 기능과 미등록어로 분류된 단어중 출현 빈도가 높은 단어에 대하여 그 용도에 맞도록 시소러스사전, 명사사전, 불용어사전, 부사사전, 동사사전, 형용사사전 등의 종류를 가지는 한글사전(306) 및 영문사전(307)을 지속적으로 갱신하기 위한 사전관리기능을 포함하는 형태소분석을 통한 명사추출기(301), 명사추출기(301)로부터 전달된 단어리스트에서 순차적으로 단어를 인식하여 단어간 연관계수를 계산한 후, 해당문서에 대한 단어 클러스터를 생성하고, 해당문서에 대해 생성된 단어 클러스터가 신규인가를 체크하여, 신규인 경우 기존의 단어 클러스터 사전(305)을 갱신하며, 기존에 있는 단어 클러스터이면 동의어나 동음이의어를 확인한 후, 단어 클러스터 사전(305)을 갱신하는 클러스터 관리기(302), 이용자의 질의어를 입력받아 검색 히스토리를 저장하고 정보검색엔진(304)에 검색을 의뢰한 후 검색결과 제시와 함께 단어 클러스터 사전(306)을 참조하여 관련단어를 제시해 주는 질의 처리기(303)로 구성된다.3 is a block diagram of an embodiment according to the present invention, after classifying keywords and related words, removing unnecessary information from a document to be clustered, extracting text, and extracting the text in token units. Extract the list of nouns including the compound nouns separately, and store the unregistered words not constructed in the dictionary in the temporary storage, and thesaurus and noun dictionaries to suit the purpose for the high frequency of words classified as unregistered words. Noun extractor (301) through morphological analysis, including a dictionary management function for continuously updating the Korean dictionary (306) and the English dictionary (307) having a kind of stopword dictionary, adverb dictionary, verb dictionary, adjective dictionary, etc. After recognizing the words sequentially from the word list transferred from the noun extractor 301, calculating the correlation coefficient between the words, the word cluster for the document. After generating, check whether the generated word cluster is new for the document, and if it is new, update the existing word cluster dictionary 305, and if the existing word cluster checks synonyms or homonyms, the word cluster dictionary The cluster manager 302 updating the 305 receives the user's query word, stores the search history, requests the information search engine 304 to search, and then refers to the word cluster dictionary 306 together with the search result presentation. It consists of a query processor 303 that presents a word.

본 발명은 지속적으로 생성되는 문서에서 단어들을 추출하여 기존에 생성된 해당 단어의 클러스터와 비교하여 새로운 클러스터의 생성이나 지속적인 갱신을 통하여, 이용자의 질의어와 의미상, 내용상 관련이 있는 단어들을 보여줌으로써, 자신의 질의어에 대한 위치 파악은 물론 정확한 질의어의 선정을 통한 최적의 정보검색을 유도해주는 기술이다. 여기서, 클러스터는 관련단어 또는 관련단어들의 묶음을 의미한다.According to the present invention, by extracting words from a continuously generated document and comparing the clusters of the corresponding words with existing ones, by generating new clusters or continuously updating them, the user's query words and meanings and contents related words are shown. It is a technology that induces optimal information retrieval through the selection of exact query language as well as the location of own query language. Here, the cluster means a related word or a bundle of related words.

도 4 는 본 발명에 따른 정보검색에서 질의어 처리를 위한 단어 클러스터 관리 장치에서의 명사추출기에 대한 일실시예 상세 구성도이다.FIG. 4 is a detailed configuration diagram of a noun extractor in a word cluster management apparatus for query processing in information retrieval according to the present invention.

상기한 도 3 의 설명을 통해 정보검색에서 질의어 처리를 위한 단어 클러스터 관리 장치를 도시하였다. 도 4 에서는 그 중에서도 명사추출기(301)에 대해 상세히 도시하고 설명한다. 본 발명의 실시예에서 명사추출기(301)는 형태소분석을 통해 명사를 추출하는 것이나, 이외에도 명사를 추출하는 여러 방식이 모두 가능하다.Referring to FIG. 3, the apparatus for managing word clusters for query processing in information retrieval is illustrated. In FIG. 4, a noun extractor 301 is shown and described in detail. In the embodiment of the present invention, the noun extractor 301 extracts nouns through morpheme analysis, but various methods of extracting nouns are possible.

도 4 는 도 3 에 도시된 형태소 분석을 통한 명사추출기(301)의 구성도로서, 명사추출기(301)는 원문에서 단어 클러스터링에 불필요한 태그(Tag) 정보를 분리시키는 태그(Tag) 필터(401), 태그(Tag) 정보를 분리한 원문에서 추출된 택스트(Text) 문자열을 토큰 단위로 분리하는 토큰추출기(402), 분리된 토큰이 한글일 경우 한글사전(306)을 참조하여 명사노드에 저장하는 한글명사 추출처리기(403), 분리된 토큰이 영문일 경우 영문사전(307)을 참조하여 명사노드에 저장하는 영문명사 추출처리기(404), 추출된 토큰이 사전에 등록되지 않은 경우에는 한글미등록어 저장부(405)와 영문미등록어 저장부(406)에 토큰과 출현횟수를 저장 후 출현횟수가 소정 횟수를 초과하는 단어를 참조하여 한글사전(306)과 영문사전(307)을 갱신하는 사전관리기(405)로 구성되어 있다.4 is a configuration diagram of a noun extractor 301 through morpheme analysis shown in FIG. 3, the noun extractor 301 is a tag filter 401 that separates tag information unnecessary for word clustering from an original text. Token extractor 402 that separates the text string extracted from the original text that separates the tag information into token units, and if the separated token is Korean, refer to the Hangul dictionary 306 and store it in the noun node. Korean noun extraction processor (403), if the separated token in English, referring to the English dictionary (307), the English noun extraction processor (404) to store in the noun node, if the extracted token is not registered in advance After storing the token and the number of occurrences in the storage unit 405 and the non-registered English storage unit 406, the dictionary manager for updating the Korean dictionary 306 and the English dictionary 307 by referring to the words in which the occurrence number exceeds a predetermined number of times. 405 is comprised.

토큰 추출기(402)는 태그 필터(401)로부터 입력되는 택스트(Text) 파일의 내용에서 문장단위로 입력을 받아 스패이스(Space), 콤마(Comma), 언더라인(Underline)의 구분자를 통하여 어절 단위로 토큰을 추출한다. 여기서 추출된 토큰이 한글인지 영문인지를 판단하여 해당하는 추출처리기(403, 404) 에 토큰을 전달하는 기능을 수행한다.The token extractor 402 receives the input in sentence units from the contents of the text file input from the tag filter 401 and in the word unit through the separator of space, comma, and underline. Extract the token In this case, it is determined whether the extracted token is Korean or English, and the token is transmitted to the corresponding extraction processors 403 and 404.

추출처리기에서는 추출된 토큰에 따라 한글명사 추출처리기(403)와 영문명사 추출처리기(404)에서 각각의 특성에 따라 다른 자료처리과정을 통해 명사를 추출한다.In the extraction processor, the noun extraction processor 403 and the English noun extraction processor 404 extract different nouns according to their characteristics according to the extracted tokens.

한글명사 추출처리기(403)의 명사 추출 동작은 크게 4개의 과정으로 구분하여 볼 수 있다.The noun extraction operation of the Hangul noun extraction processor 403 can be divided into four processes.

처음으로, 명사분석 배제 정보를 이용하여 명사일 가능성이 거의 없는 "같", "보였"으로 시작하는 어절이나 "다른", "렇"이 존재하는 어절내에서는 명사가 절대 나타나지 않으므로 명사분석에서 배제할 수 있다. 이런 정보들을 모아 놓은 사전을 참조하여 명사를 포함하지 않는 단어를 제외한다.For the first time, nouns are excluded from noun analysis by using noun analysis exclusion information, because nouns never appear in words starting with "like", "shown", or words with "other" or "like", which are unlikely to be nouns. can do. Refer to dictionaries that collect this information and exclude words that do not contain nouns.

둘째로는 조사의 포함여부를 조사한다. 역조사 사전을 참조하여 어절의 오른쪽에서 왼쪽으로 비교해가면서 조사의 존재여부를 판단한다. 조사가 있는 경우 조사와 나머지 부분으로 분리한 후 복합명사 처리를 위하여 조사 플래그(Flag)를 온(On)한다. 즉, 복합명사를 처리함에 있어, 조사가 사이에 있어 "정보의 처리"와 같은 단어는 복합명사로 취급되지 않고, "정보처리"는 복합명사로 취급되므로 이를 구별한다.Secondly, check whether the survey is included. The presence of a survey is determined by comparing the word from the right side to the left side with reference to the reverse search dictionary. If there is an investigation, the investigation flag is divided into the rest and the investigation flag is turned on for complex noun processing. In other words, in processing compound nouns, words such as "processing of information" are not treated as compound nouns, and "information processing" is treated as compound nouns.

셋째로는, 한글 명사사전을 이용하여 명사여부를 확인한다. 명사인 경우 일반적으로 나타나는 빈도수가 너무 높아서 사용되지 않는 불용어로 등록된 명사인 경우는 다음 토큰을 처리하도록 한다.Thirdly, it checks whether or not nouns are used using the Hangul noun dictionary. In the case of nouns, if the nouns are registered too often because they appear too frequently, the next token is processed.

불용어로 등록된 단어가 아닌 경우에는, 바로 이전에 처리한 토큰이 명사일 경우 명사 플래그(Flag)가 온(On)되어 있게 되는데, 기존에 명사 플래그(Flag)가 온(On)인 경우 복합명사로 저장하고, 오프(Off)인 경우 단일명사로 저장한다. 조사 플래그(Flag)가 온(On)되어 있는 경우는 그 전에 명사를 처리하였어도 명사 플래그(Flag)가 온(On)되어 있지 않다.If the word is not registered as a stop word, the noun flag is turned on if the token processed before is a noun. If the noun flag is previously on, the compound noun is a compound noun. If it is Off, save it as a single noun. When the investigation flag Flag is On, even if the noun is processed before, the noun flag is not On.

명사가 아닌 경우 복합명사 처리를 위하여 세팅된 명사 플래그(Flag)를 오프(Off)한 후 부사, 동사, 형용사 사전을 참조하여 용언인 경우 명사가 아닌 것으로 판명하여 다음 토큰을 처리하도록 하며, 용언이 아닌 경우에는 새로운 명사일 수도 있으므로 한글 미등록어 저장부(406)에 저장한다.If it is not a noun, turn off the noun flag set for compound nouns and refer to the adverbs, verbs, and adjective dictionaries, and if it is a verb, determine that it is not a noun and process the next token. If not, the new noun may be stored in the Korean unregistered word storage unit 406.

넷째로는, 추출된 명사를 저장하는 것으로 아래의 도 5 에 예시한 것과 같이 단일명사뿐만 아니라 복합명사의 저장을 위하여 링크드 리스트(Linked List) 구조를 사용하여 토큰, 출현횟수, 복합명사의 경우 다음 토큰의 주소를 저장한다. 미등록어의 경우 단일단어로 취급하며 출현회수가 일정 수치를 초과하는 경우 사전관리기(407)에 통보된다.Fourthly, to store extracted nouns, as shown in FIG. 5 below, a linked list structure is used to store not only single nouns but also complex nouns. Store the address of the token. Unregistered words are treated as single words, and the pre-administrator 407 is notified when the number of occurrences exceeds a predetermined value.

영문명사 추출처리기(404)의 경우에는, 불용어를 제외하고 구성된 영문사전(307)을 참조하여 불용어가 아닌 명사인 경우 명사노드에 저장하고, 플래그(Flag)를 온(On)한다.In the case of the English noun extraction processor 404, if the noun is not a noun, the noun node is stored in the noun node with reference to the English dictionary 307 configured to exclude the noun and the flag is turned on.

명사가 아닌 경우, 플래그(Flag)를 오프(Off)한 후, 불용어사전에 등록되어 있는지를 확인하여, 등록되어 있으면 다음 토큰을 처리하도록 한다.If it is not a noun, turn off the flag, check if it is registered before the stopword, and if it is registered, process the next token.

불용어사전에 등록되어 있지 않은 경우 용언인지를 검증한다. 용언인 경우 다음 토큰처리를 수행하고, 용언이 아닌 경우, 플래그(Flag)를 온(On)한 후 영문 미등록어에 저장한 후 다음 토큰처리를 계속한다.If it is not registered in the stopwords dictionary, the verb is verified. If it is a verb, the next token processing is performed. If it is not a verb, the flag is turned on and stored in an English non-registered word and the next token processing is continued.

이러한 과정을 문서의 마지막까지 반복 수행하여 문서내에 있는 모든 명사를 추출하고, 미등록단어를 분리하여 영문 미등록어 임시저장부(406)에 저장한다.This process is repeated until the end of the document to extract all the nouns in the document, and separate the unregistered words are stored in the non-registered English temporary storage unit 406.

다음단어 처리시 이전과정에서 플래그(Flag)가 온(On)되어 있는 경우는 복합명사로 처리한다.When the next word is processed, if a flag is turned on in the previous process, it is treated as a compound noun.

형태소 분석을 통한 명사추출기(301)는 위와 같은 과정을 거쳐 단어 클러스터링에 필요한 문서내의 모든 명사를 추출하여 명사노드로 저장한다.The noun extractor 301 through morphological analysis extracts all nouns in a document necessary for word clustering and stores them as a noun node.

사전관리기(407)는 미등록어로 저장된 단어들 중에서 여러 문서에 걸쳐 빈번하게 나타나는 단어에 대해, 한글사전(308)과 영문사전(309)에 등록하여, 한글사전(308)과 영문사전(309)을 지속적으로 갱신한다. 미등록 단어의 등록은 관리자가 정한 소정의 조건에 만족하는 경우에 이루어진다.The dictionary manager 407 registers the Korean dictionary 308 and the English dictionary 309 for the words that frequently appear in various documents among the words stored as non-registered words in the Korean dictionary 308 and the English dictionary 309. Update continuously. Registration of an unregistered word is made when the predetermined condition set by an administrator is satisfied.

상기한 실시예에서, 한글명사 추출처리기(403)에서의 동작과 영문명사 추출처리기(404)에서의 동작에 차이가 있는데, 이는 두 가지 경우를 표현하기 위한 것으로, 한글명사 추출처리기(403)에서는 한글사전(306)에 불용어이든, 아니든 명사를 저장하고 있어 명사를 확인하고, 명사로 판명된 경우, 불용어사전을 통해 불용어를 걸러내는 것이고, 영문명사 추출처리기(404)에서는 영문사전(307)에 명사 중에서도 불용어를 걸러내어 클러스터로 사용될 수 있는 명사만을 가지고 있는 경우로, 신조어를 찾아내기 위해 명사가 아닌 내용 중 불용어인지를 확인하게 된다. 상기한 두 가지 방식은 모두 사용이 가능하다.In the above-described embodiment, there is a difference between the operation in the Korean noun extraction processor 403 and the operation in the English noun extraction processor 404. This is to express two cases, and the Korean noun extraction processor 403 If the nouns are stored in the Hangul dictionary 306 or not, check the nouns, and if they are found as nouns, filter out the stopwords through the stopword dictionary, and the noun extraction processor 404 in the English dictionary 307 If nouns are filtered out of nouns and have only nouns that can be used as clusters, to identify new words, it checks whether the words are non-nouns. Both methods described above can be used.

도 5 는 본 발명에 따른 명사추출기에 추출한 명사노드의 리스트를 나타내고 있는 일실시예 설명도이다.5 is an exemplary explanatory diagram showing a list of noun nodes extracted from a noun extractor according to the present invention.

명사추출기(301)에서 추출된 명사에 대해, 토큰(501), 출현횟수(502) 및 추출된 명사가 복합명사에 해당하는 경우, 다음 토큰의 주소(503)을 저장한다.For the noun extracted from the noun extractor 301, the token 501, the number of occurrences 502, and the extracted noun correspond to the compound noun, the address 503 of the next token is stored.

도 6 은 본 발명에 따른 클러스터 관리기에 대한 일실시예 구성도이다.6 is a diagram illustrating an embodiment of a cluster manager according to the present invention.

도 6 은 도 3 에서 표시한 클러스터 관리기(302)의 세부기능을 도시한 것으로써, 명사추출기(301)를 통하여 생성된 명사노드를 입력받아 연계단어 추출기(601)를 통하여 해당 문서의 관련 단어를 추출한 후, 문서별 단어 클러스터(603)를 생성하는 과정과 생성된 문서별 단어 클러스터(603)와 기존의 단어 클러스터 사전(305)의 관계를 파악하여 클러스터 갱신 처리기(602)에서의 클러스터 갱신 과정을 통하여 지속적으로 단어 클러스터 사전(305)을 구축한다.FIG. 6 illustrates the detailed functions of the cluster manager 302 shown in FIG. 3, by receiving a noun node generated through the noun extractor 301 and using the associated word extractor 601 to retrieve related words of the document. After the extraction, the process of generating the word cluster 603 for each document and the relationship between the generated word cluster 603 and the existing word cluster dictionary 305 are identified to perform the cluster update process in the cluster update processor 602. Through this, the word cluster dictionary 305 is continuously constructed.

연계단어 추출기(601)는 링크드 리스트(Linked List) 형태로 저장된 명사노드에서 순차적으로 단어를 인식하여 단어간 연관계수를 계산하고, 해당문서에 대한 문서별 단어 클러스터(603)를 생성한다.The linked word extractor 601 sequentially recognizes words in a noun node stored in the form of a linked list, calculates an association coefficient between words, and generates a document-specific word cluster 603 for the corresponding document.

그 과정은 다음과 같은 다섯 과정으로 이루어져 있다.The process consists of five steps:

처음에는, 명사추출기(301)를 통하여 만들어진 명사노드에서 첫번째 단어로 이동하여 첫번째 단어를 중심어로 하고, 중심어를 제외한 단어중 첫번째 단어로 이동한다.Initially, the noun node created by the noun extractor 301 moves to the first word and the first word is the center word, and the first word is moved to the first word except the central word.

두번째로는, 클러스터 대상 단어(Ti)와 리스트내 단어(Tj), 클러스터 대상 단어(Ti)의 빈도수(WTi), 리스트내 단어(Tj)의 빈도수(WTj)를 참조하여 아래 [수학식 1]을 이용하여 해당단어의 코사인(Cosine) 계수를 계산한다. 이는 단어간의 유사도를 구하는 것이다.Secondly, the cluster target word Ti, the word in the list Tj, the frequency WTi of the cluster target word Ti, and the frequency WTj of the word Tj in the list are described below. Calculate the cosine coefficient of the word using. This is to find the similarity between words.

세번째로, 단어간의 구문정보계수를 계산하는 아래의 [수학식 2]로써, 단어 Tj가 Ti와 같은 구문에 등장하는 빈도수를 계산한다. 이는 문서에서의 집중도를 구하는 것이다.Third, the following equation (2), which calculates the syntax information coefficient between words, calculates the frequency of the word Tj appearing in a phrase such as Ti. This is to find the concentration in the document.

여기서,는 단어 Tj가 단어 Ti와 같은 구문에서 나타나는 빈도수를 의미한다.here, Denotes the frequency with which the word Tj appears in the same phrase as the word Ti.

네번째로, 상기한 [수학식 1]과 [수학식 2]에서 계산된 코사인 계수와 구문정보 계수의 계산 결과를 참조하고, 아래의 [수학식 3]의 공식을 이용하여 클러스터 계수를 계산한다. 이는 구하여진 유사도와 집중도를 바탕으로 연관관계를 계산하는 것이다.Fourth, referring to the calculation result of the cosine coefficient and the syntax information coefficient calculated in the above [Equation 1] and [Equation 2], the cluster coefficient is calculated using the formula of [Equation 3] below. This is to calculate the correlation based on the obtained similarity and concentration.

현재 처리하는 단어가 마지막 단어이면, 해당단어에 대한 클러스터를 생성하고, 아닐 경우에는 다음 비교단어에 대하여 클러스터 계수를 계산한 후 다음단어로 이동하여 마지막 단어까지 위와 같은 처리를 반복한다.If the word currently being processed is the last word, a cluster for the corresponding word is generated. If not, the cluster coefficient is calculated for the next comparison word, and then the above process is repeated until the last word.

위의 과정을 반복하여 단어간의 연계단어 즉, 관련단어를 추출한 후, 이를 바탕으로 다음에서 기술하는 두 번째 과정인 추출된 단어의 클러스터링과 기존의 클러스터링 관계를 파악하여 신규생성 또는 갱신하는 클러스터 갱신(602) 과정을 수행한다.After repeating the above process to extract the words associated with the words, that is, the related words, based on this, the second process described in the following process to identify the clustering of the extracted words and the existing clustering relationship to update or create a new cluster update ( 602) perform the process.

제1 과정은, 해당문서에서 생성된 문서별 단어 클러스터(603)의 중심어가 기존의 단어 클러스터 사전(303)에 존재하는지 여부를 판단하여 존재할 경우 자카드(Jaccard) 계수를 계산하는 제2 과정으로, 신규인 경우 기존의 클러스터에추가하는 제6 과정으로 분기한다.The first process is a second process of determining whether or not the central word of the word cluster 603 for each document generated in the corresponding document exists in the existing word cluster dictionary 303 and calculating the Jaccard coefficient if present. If it is new, the process branches to the sixth step of adding to the existing cluster.

제2 과정은, 신규생성 클러스터 중심어(Ti), 기생성된 클러스터 중심어(Tj), Ti와 Tj의 클러스터 내 공동 단어수(N(GTi ∩GTj)), Ti와 Tj의 클러스터 내 연계단어 합(N(GTi ∪ GTj))을 참조하여 자카드(Jaccard) 계수를 계산한다. 그 식은 아래의 [수학식 4]와 같다.The second process includes the newly generated cluster core word Ti, the generated cluster core word Tj, the number of common words in the cluster of Ti and Tj (N (GTi ∩GTj)), and the sum of link words in the cluster of Ti and Tj ( Calculate the Jaccard coefficient with reference to N (GTi ∪ GTj). The equation is as shown in [Equation 4] below.

여기서,는 단어 Ti를 중심으로 하는 클러스터이고,는 단어 Tj를 중심어로 하는 클러스터이다.here, Is a cluster around the word Ti, Is a cluster centered on the word Tj.

제3 과정에서는, 계산된 자카드(Jaccard) 계수값이 임계값 J_T보다 크면 동의어로 처리하고, 작은 경우에는 동음이의어로 처리한다.In the third process, if the calculated Jaccard coefficient value is larger than the threshold value J _T , it is treated as a synonym, and if it is smaller, it is treated as a homonym.

제4 과정에서는, 재계산 클러스터 계수(P(T_i))를 기존 클러스터계수(PT_i), 신규 생성된 클러스터 계수(P'_Ti), 클러스터 계산에 이용된 문서수(n)를 이용하여 다음의 [수학식 5]와 같은 식을 이용하여 생성한다.In the fourth process, the recalculation cluster coefficient P (T _i ) is obtained by using the existing cluster coefficient PT _i , the newly generated cluster coefficient P ' _Ti , and the number of documents n used in the cluster calculation. It is generated using the equation as shown in [Equation 5].

제5 과정에서는, 얻어진 재계산 클러스터 계수(P(Ti))를 이용하여 기존의 클러스터 인덱스 값을 새로 변경한다.In the fifth step, the existing cluster index value is newly changed using the obtained recalculation cluster coefficient P (Ti).

제6 과정에서는, 해당문서에서 클러스터가 있는지를 판단하여 없으면 다음 문서 처리를 계속하고, 있는 경우에는 단어 클러스터 사전(305)의 생성과 변경 작업을 계속한다.In the sixth step, it is determined whether there is a cluster in the document, and if not, the next document processing is continued.

이러한 과정을 문서에서 추출된 단어 클러스터 사전(305)의 마지막까지 수행하여 기존의 클러스터링을 계속하여 갱신한다.This process is performed up to the end of the word cluster dictionary 305 extracted from the document to continuously update existing clustering.

도 7 은 본 발명에 따른 질의 처리기에 대한 일실시예 구성도이다.7 is a configuration diagram of an embodiment of a query processor according to the present invention.

도 7 은 도 3 에서 표시한 질의처리기(303)의 세부 도면으로써, 이용자가 입력하는 질의어를 입력받아 검색엔진(304)에서 사용하는 연산자를 추가하여 조건 검색이 될 수 있도록 변경하는 질의생성기(701), 생성된 질의어를 기존에 개발되어 사용되고 있는 검색엔진(304)에 의뢰하여 검색결과를 추출하는 검색의뢰기(702), 질의어와 관련이 있는 단어를 단어 클러스터 사전(305)에서 검색하여 관련단어를 확장하는 질의어 연관 클러스터 확장기(703), 추출된 단어 클러스터를 이용자의 브라우저에 표시하는 2D/3D 클러스터 표시기(704), 개인의 정보검색 이력을 관리하는 개인 검색 이력관리기(705)로 구성되어 있다.FIG. 7 is a detailed diagram of the query processor 303 shown in FIG. 3. The query generator 701 which receives a query input by a user and adds an operator used in the search engine 304 to change a condition search to be a condition search. ), A search requester 702 for extracting a search result by requesting the generated query word from a search engine 304 that has been developed and used, and a word related to the word related to the query word from the word cluster dictionary 305 Is composed of a query association cluster expander 703 that expands a, a 2D / 3D cluster indicator 704 that displays the extracted word clusters in a user's browser, and a personal search history manager 705 that manages a person's information retrieval history. .

질의 처리기(303)는 질의어를 입력받아 질의생성기(701)에서 조건 검색이 될 수 있도록 질의어를 변경하고, 검색의뢰기(702)에서 질의어에 대해 검색엔진(304)에 의뢰하여 검색결과를 받는다.The query processor 303 receives a query word, changes the query word to be a conditional search in the query generator 701, and receives a search result by requesting the search engine 304 for the query word from the search requester 702.

또한, 질의어를 입력받아 개인 검색 이력관리기(705)에서 사용자의 검색 이력을 확인하고, 이를 개인별 검색 이력 저장부(706)에 저장한다.In addition, the personal search history manager 705 receives the query and checks the user's search history, and stores it in the individual search history storage unit 706.

질의어 연관 클러스터 확장기(703)에서는 검색의뢰기(702)로부터 검색 결과와 함께 처리된 질의어를 수신하고, 개인별 검색 이력 저장부(706)으로부터도 사용자로부터의 질의어를 확인하여, 단어 클러스터 사전(305)을 이용하여 질의어의 관련단어를 확장한다. 2D/3D 클러스터 표시기(704)가 검색 결과와 함께 단어 클러스터 사전(305)을 통해 확인한 관련단어를 사용자에게 제공한다.The query association cluster expander 703 receives the processed query word from the search requester 702 together with the search result, checks the query word from the user from the individual search history storage unit 706, and then searches the word cluster dictionary 305. Expand the related words of the query using. The 2D / 3D cluster indicator 704 provides the user with the related words identified through the word cluster dictionary 305 along with the search results.

도 8 은 본 발명에 따른 질의처리기에서의 처리 결과에 대한 일예시도이다.8 is an example of the processing result in the query processor according to the present invention.

도 8 은 본 발명의 질의처리기를 이용하여 구성한 예를 보여주는 도면으로써, 질의처리기(303)의 클러스터 표시기(704)는 이용자의 브라우저에 내포할 수 있는 플러그인 형태로 구성되어 있어 별도의 설치 작업을 거치지 않아도 사용할 수 있으며, 중심단어는 가운데 붉은색 점을 표시하고 주변에는 중심어와 관련이 있는 클러스터를 표현한 것이다. 원의 크기는 중심어와 관련도에 따라 크기를 다르게 표시하며, 단어 위에 마우스의 오른쪽버튼을 클릭하면 검색어추가 버튼과 단어확장 버튼이 나타난다. 검색어 추가버튼은 조합조건에서 선택한 "AND", "OR", "NOT" 등의 검색조건과 조합되어 기존의 검색어와 결합되도록 구성되어 있으며, 사용자가 단어확장버튼을 클릭하면 해당 단어를 중심어로 하여 새로운 클러스터를 표시하여준다.8 is a view showing an example configured using the query processor of the present invention, the cluster indicator 704 of the query processor 303 is configured in the form of a plug-in that can be embedded in the user's browser is not going through a separate installation work It can be used even if the central word has a red dot in the middle and a cluster related to the central word in the vicinity. The size of the circle is displayed differently according to the central word and the relevance. If you click the right button of the mouse on the word, the add search word button and the word expansion button appear. The Add search button is configured to combine with existing search terms by combining with search conditions such as "AND", "OR", and "NOT" selected in the combination condition. When the user clicks the word expansion button, the word is centered. Mark the new cluster.

화면하단의 히스토리(History)는 단어확장을 한 이력을 보여주며 단어를 클릭하면 해당단어를 중심어로 한 클러스터링을 다시 표시하여 준다. 확장범위는 0 ~ 100 % 중에서 20%단위로 확장이 가능하며 단어 확장 시 연관관계의 범위를 제한하는 역할을 한다.History at the bottom of the screen shows the history of word expansion and clicking on a word displays the clustering of the word as the center. The expansion range can be expanded in units of 20% from 0 to 100%, and plays a role of limiting the range of associations when expanding words.

검색식은 검색어 추가버튼을 이용하여 단어를 추가한 이력을 보여주는 것으로써, 단어위에 마우스 우측버튼을 누르면 검색식에서 삭제할 수 있다. 좌측의 연산기호는 "AND(*)", "OR(+)", "NOT(~)" 등의 연산 조건을 나타내며 가장 상위의 단어는 이용자가 선택한 초기의 질의어를 나타낸다.The search expression shows a history of adding a word by using a search word add button. The search expression can be deleted by right-clicking on the word. The operation symbol on the left represents an operation condition such as "AND (*)", "OR (+)", "NOT (~)", and the uppermost word indicates an initial query word selected by the user.

분류사전은 중심어가 전체 분류 중에 어디에 위치하는지를 보여주는 것으로서 기존의 분류사전이나 시소러스(관련도)를 이용하여 그 위치를 파악할 수 있도록 한다.The classification dictionary shows where the central word is located in the entire classification. The classification dictionary can be used to identify the location using existing classification dictionaries or thesaurus.

도 9 는 본 발명에 따른 단어 클러스트링 관계표시를 나선형으로 표시한 일예시도이다.9 is a diagram illustrating a spiral display of the word clustering relationship display according to the present invention.

도 9 는 단어의 클러스터 표시를 나선형태로 표시한 것으로써 가운데의 단어를 중심어로 하여 확장한 예를 보여주는 것이다. 여기서도 마찬가지로 화면에 표시된 단어의 위에 마우스 오른쪽버튼을 누르면 단어추가, 질의어 추가 버튼이 나타나며 세부 기능은 앞서 설명한 것과 같은 역할을 수행한다.FIG. 9 shows an example in which a cluster of words is displayed in a spiral form and is expanded with the center word as the center word. In this case, if you press the right button of the mouse on the word displayed on the screen, the Add Word and Add Query button buttons appear, and the detailed function performs the same function as described above.

도 10 은 본 발명에 따른 정보검색에서 질의어 처리를 위한 단어 클러스터 관리 방법에 대한 일실시예 흐름도이다.10 is a flowchart illustrating a method for managing a word cluster for query processing in information retrieval according to the present invention.

이는 단어 클러스터 생성 및 표시 장치를 통해 이루어진다. 그 흐름은 아래와 같다.This is done through the word cluster generation and display device. The flow is as follows.

우선, 명사추출기(301)가 형태소 분석을 통해 수집문서에서 명사를 추출한다(1001).First, the noun extractor 301 extracts a noun from a collected document through morphological analysis (1001).

다음으로, 클러스터 관리기(302)가 추출된 명사의 리스트에서 단어의 연관관계에 따른 단어 클러스터를 생성하고 기존의 단어 클러스터 사전(305)을 갱신한다(1002).Next, the cluster manager 302 generates a word cluster according to the association of words from the extracted noun list and updates the existing word cluster dictionary 305 (1002).

마지막으로, 질의 처리기(303)가 사용자로부터 검색 질의어를 입력받아 검색 결과를 제시함과 아울러 단어 클러스터 사전(305)을 참조하여 관련단어를 제시한다(1003).Finally, the query processor 303 receives a search query from a user, presents a search result, and presents a related word with reference to the word cluster dictionary 305 (1003).

도 11 은 본 발명에 따른 형태소 분석을 통해 수집문서에서 명사를 추출하는 과정에 대한 일실시예 흐름도이다.11 is a flowchart illustrating a process of extracting a noun from a collected document through morpheme analysis according to the present invention.

이는 명사추출기(301)에서 이루어진다.This is done in the noun extractor 301.

태그 필터(401)가 수집된 문서 파일에서 태그(Tag) 정보를 분리시켜 택스트(Text)로 된 파일을 생성한다(1102).The tag filter 401 separates tag information from the collected document file to generate a text file (1102).

토큰분리기(402)가 택스트(Text) 파일로부터 문장단위로 입력을 받아, 어절 단위로 토큰을 추출한다(1102).The token separator 402 receives an input in units of sentences from a text file and extracts a token in units of words (1102).

토큰분리기(402)가 추출된 토큰이 한글인지를 확인한다(1103). 확인 결과, 추출된 토큰이 한글이면, 한글명사 추출처리기(403)가 토큰을 받아 한글사전(306)을 참조하여 명사를 추출하여 명사노드 리스트에 저장한다(1104).The token separator 402 checks whether the extracted token is Hangul (1103). As a result, if the extracted token is Korean, the Korean noun extraction processor 403 receives the token, extracts the noun with reference to the Korean dictionary 306, and stores the noun in the noun node list (1104).

토큰분리기(402)가 추출된 토큰이 한글인지를 확인한 결과, 추출된 토큰이 한글이 아니면, 영문으로 판단하여 영문명사 추출처리기(404)가 토큰을 받아 영문사전(307)을 참조하여 명사를 추출하여 명사노드 리스트에 저장한다(1105).As a result of the token separator 402 checking whether the extracted token is Korean, if the extracted token is not Korean, the English noun extraction processor 404 receives the token and extracts the noun by referring to the English dictionary 307. In operation 1105, the data is stored in the noun node list.

한글명사 추출처리기(403)와 영문명사 추출처리기(404)는 입력받은 토큰이 신조어로 추정될 때는 일단, 한글 미등록어 저장부(405)와 영문 미등록어 저장부(406)에 저장하고, 사전관리기(407)가 소정의 조건을 만족하는지를 비교하여 한글사전(306)과 영문사전(307)에 새로운 명사로 등록하여, 한글명사 추출처리기(403)와 영문명사 추출처리기(404)에서의 처리에 이용하도록 한다.When the received token is estimated as a new word, the Korean noun extraction processor 403 and the English noun extraction processor 404 store the non-registered word storage unit 405 and the non-registered word storage unit 406 in the dictionary. Comparing whether or not 407 satisfies a predetermined condition, it is registered as a new noun in the Korean dictionary 306 and the English dictionary 307, and used for processing in the Korean noun extraction processor 403 and the English noun extraction processor 404. Do it.

도 12 는 본 발명에 따른 추출된 명사의 리스트에서 단어 클러스터를 생성하고 단어 클러스터 사전을 갱신하는 과정에 대한 일실시예 흐름도이다.12 is a flowchart illustrating a process of generating a word cluster from a list of extracted nouns and updating a word cluster dictionary according to the present invention.

이러한 과정은 다음과 같이 이루어진다.This process is carried out as follows.

명사추출기(301)를 통하여 생성된 명사노드 리스트를 이용하여 관련단어를 추출하고 단어클러스터를 생성한다(1201).Using the noun node list generated through the noun extractor 301, the related word is extracted and a word cluster is generated (1201).

생성된 단어 클러스터를 이용하여 단어 클러스터 사전(305)을 갱신한다(1202).The word cluster dictionary 305 is updated using the generated word cluster (1202).

도 13 은 본 발명에 따른 생성된 명사노드 리스트를 이용하여 관련단어를 추출하고 단어 클러스터를 생성하는 과정에 대한 일실시예 흐름도이다.13 is a flowchart illustrating a process of extracting a related word and generating a word cluster by using a generated noun node list according to the present invention.

우선, 명사추출기를 통하여 생성된 명사노드 리스트의 첫번째 단어를 중심어로 하고(1301), 중심어 단어 단어를 비교단어를 한다(1302).First, the first word of the noun node list generated by the noun extractor is the main word (1301), and the word of the central word word is compared (1302).

중심어와 비교단어에 대한 코사인 계수를 연산하여 유사도를 구하고(1303),중심어와 비교단어에 대한 구문정보 계수를 연산하여 집중도를 구한다(1304).The degree of similarity is calculated by calculating cosine coefficients for the center word and the comparison word (1303), and the concentration degree is calculated by calculating the syntax information coefficients for the center word and the comparison word (1304).

얻어진 유사도와 집중도를 이용하여 중심어와 비교단어에 대한 클러스터 계수를 구한다(1305).Using the obtained similarity and concentration, cluster coefficients for the central word and the comparative word are obtained (1305).

사용된 비교단어가 명사노드 리스트의 마지막단어인지를 판단한다(1306). 판단 결과, 비교단어가 명사노드 리스트에 마지막으로 비교단어로 처리된 마지막 단어가 아니면, 기존의 비교단어의 다음 단어를 새로운 비교단어를 삼고(1307), 중심어와 비교단어에 대한 코사인 계수를 연산하여 유사도를 구하는 과정(1303)부터 반복 수행한다.It is determined whether the used comparison word is the last word of the noun node list (1306). As a result of the determination, if the comparison word is not the last word treated as the last comparison word in the noun node list, the next word of the existing comparison word is made a new comparison word (1307), and the cosine coefficients for the center word and the comparison word are calculated. The process is repeated from step 1303 for obtaining similarity.

사용된 비교단어가 명사노드 리스트의 마지막단어인지를 판단한 결과, 비교단어가 명사노드 리스트에 마지막으로 비교단어로 처리된 마지막 단어이면, 기존의 중심어에 대한 단어 클러스터를 생성하고(1308), 중심어가 명사노드 리스트의 마지막에 위치하는 단어였는지를 확인한다(1309),As a result of determining whether the comparison word used is the last word of the noun node list, if the comparison word is the last word treated as the last comparison word in the noun node list, a word cluster for the existing central word is generated (1308). Check whether the word is located at the end of the noun node list (1309).

중심어가 명사노드 리스트의 마지막에 위치하는 단어였는지를 확인한 결과, 마지막 단어가 아니면 중심어의 다음 단어를 새로운 중심어로 하고(1310), 중심어 다음 단어를 비교단어로 삼는 과정(1302)부터 반복 수행한다.As a result of checking whether the central word was the last word of the noun node list, if the final word is not the last word, the next word of the central word is a new central word (1310), and the process of repeating the word following the central word as a comparison word (1302) is repeated.

중심어가 명사노드 리스트의 마지막에 위치하는 단어였는지를 확인한 결과, 마지막 단어이면, 생성된 명사노드 리스트를 이용하여 관련단어를 추출하고 단어 클러스터를 생성하는 과정(1201)을 종료한다.As a result of checking whether the central word is a word located at the end of the noun node list, if it is the last word, the process 1201 of extracting a related word using the generated noun node list and generating a word cluster is completed.

도 14 는 본 발명에 따른 생성된 단어 클러스터를 이용하여 단어 클러스터 사전을 갱신하는 과정에 대한 일실시예 흐름도이다.14 is a flowchart illustrating a process of updating a word cluster dictionary using a generated word cluster according to the present invention.

우선, 생성된 문서별 단어 클러스터를 수신하여(1401), 처리되지 않는 단어 클러스터를 획득한다(1402).First, a generated word cluster for each document is received (1401), and an unprocessed word cluster is obtained (1402).

획득된 단어 클러스터의 중심어가 단어 클러스터 사전에 존재하는 않는 신규한 단어인지를 검사한다(1403).It is checked whether the central word of the obtained word cluster is a new word that does not exist in the word cluster dictionary (1403).

획득된 단어 클러스터의 중심어가 단어 클러스터 사전에 존재하는 않는 신규한 단어인지를 검사한 결과, 신규한 단어이면 이 단어 클러스터를 바탕으로 단어 클러스터 사전을 갱신하는 과정(1409)부터 수행한다.As a result of checking whether the central word of the acquired word cluster is a new word that does not exist in the word cluster dictionary, if the new word is a new word, the process of updating the word cluster dictionary based on the word cluster is performed (1409).

획득된 단어 클러스터의 중심어가 단어 클러스터 사전에 존재하는 않는 신규한 단어인지를 검사한 결과, 신규한 단어가 아니면, 동의어 판단을 위한 자카드 계수를 구하고(1404), 자카드 계수가 소정의 임계치보다 큰지를 판단한다(1405).As a result of checking whether the central word of the acquired word cluster is a new word that does not exist in the word cluster dictionary, if it is not a new word, a jacquard coefficient for synonym determination is obtained (1404), and whether the jacquard coefficient is larger than a predetermined threshold. Determine (1405).

자카드 계수가 소정의 임계치보다 큰지를 판단한 결과, 자카드 계수가 소정의 임계치보다 크지 않으면 동음 이의어로 처리하고(1408). 단어 클러스터 사전을 갱신하는 과정(1409)부터 반복 수행한다.As a result of determining whether the jacquard coefficient is greater than the predetermined threshold, if the jacquard coefficient is not greater than the predetermined threshold, the homophone object is processed (1408). The process is repeated from step 1409 of updating the word cluster dictionary.

자카드 계수가 소정의 임계치보다 큰지를 판단한 결과, 자카드 계수가 소정의 임계치보다 크면, 동의어로 처리하고(1406), 기존 클러스터 계수를 신규 생성된 클러스터 계수를 이용하여 새로이 구한다(1407). 그리고, 단어 클러스터 사전을 갱신하는 과정(1409)부터 수행한다.As a result of determining whether the jacquard coefficient is larger than the predetermined threshold, if the jacquard coefficient is larger than the predetermined threshold, the jacquard coefficient is processed synonymously (1406), and the existing cluster coefficient is newly obtained using the newly generated cluster coefficient (1407). In operation 1409, the word cluster dictionary is updated.

획득된 단어 클러스터의 중심어가 단어 클러스터 사전에 존재하는 않는 신규한 단어인지를 검사한 결과, 신규한 단어이면, 단어 클러스터 사전을 갱신하는 과정(1409)부터 반복 수행한다.As a result of checking whether the central word of the acquired word cluster is a new word that does not exist in the word cluster dictionary, if the new word is a new word, the process repeats the process of updating the word cluster dictionary (1409).

그 다음으로, 문서별 단어 클러스터에서 처리되지 않은 중심어가 있는지를 확인한다(1410). 확인 결과, 처리되지 않은 중심어가 없으면 생성된 단어 클러스터를 이용하여 단어 클러스터 사전을 갱신하는 과정(1202)를 종료한다.Next, it is checked whether there is an unprocessed central word in the word cluster for each document (1410). As a result of the check, if there is no unprocessed central word, the process 1202 of updating the word cluster dictionary using the generated word cluster is terminated.

문서별 단어 클러스터에서 처리되지 않은 중심어가 있는지를 확인한 결과, 처리되지 않은 중심어가 있으면, 처리되지 않은 단어 클러스터를 획득하는 과정(1402)부터 반복 수행한다.As a result of checking whether there is an unprocessed central word in the word cluster for each document, if there is an unprocessed central word, the process is repeated from the process 1402 of obtaining an unprocessed word cluster.

이상에서 설명한 본 발명은 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니고, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하다는 것이 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 명백할 것이다.The present invention described above is not limited to the above-described embodiments and the accompanying drawings, and various substitutions, modifications, and changes can be made in the art without departing from the technical spirit of the present invention. It will be apparent to those of ordinary knowledge.

상기한 바와 같은 본 발명은, 방대한 양의 자료를 수록하고 정보검색 서비스를 제공하여야 하는 경우, 이용자가 입력하는 질의어가 전체 단어들 중 어디에 위치하며 연관된 단어들이 무엇인지를 제시하여 2차 질의를 확장할 수 있도록 하여, 잘못된 질의어 선정으로 발생하는 다량의 결과값에 의한 사용자의 인식부담 문제점과 방향상실 문제점을 해결할 수 있는 효과가 있다.As described above, the present invention extends the secondary query by presenting where the user's input query is located in the whole words and what the related words are when the user needs to store a large amount of data and provide the information retrieval service. By doing so, it is possible to solve the problem of recognition burden and loss of direction caused by a large amount of result values generated by incorrect query selection.

또한, 본 발명은, 정보의 바다라고 불리는 인터넷에서 단어 클러스터링을 구축하여, 이용자가 원하는 정보를 찾기 위하여 입력하는 질의어에 대해 이와 관련이 있는 상하좌우의 관련단어 및 내용상 의미가 있는 관련 단어들을 보여줌으로써 질의어에 대한 위치 파악이 용이한 효과가 있다.In addition, the present invention, by building a word clustering on the Internet called the sea of information, by showing the relevant words of the upper and lower, left and right and the relevant words that are meaningful in the content for the query words entered to find the information desired by the user It is easy to locate the query.

또한, 본 발명은, 검색된 문서를 참조하여 단어 클러스터링을 구축하여 단어관계를 시각적으로 보여줌으로써 기존의 검색엔진을 이용할 때 발생하는 질의어 선정문제, 방대한 검색결과에 대한 인식부담(cognitive) 문제, 이용자가 정보를 검색하는 과정에서 발생하는 방향상실(disorientation) 문제 등을 최소화할 수 있는 효과가 있다.In addition, the present invention, by constructing the word clustering with reference to the searched document to visually show the word relationship, the query selection problem that occurs when using the existing search engine, the cognitive burden on the vast search results, the user There is an effect that can minimize the disorientation problem that occurs in the process of searching for information.

즉, 본 발명은, 단어간의 클러스터링을 제시하여 줌으로써 정확한 질의어의 선정이 가능하고 이를 통하여 검색결과의 방대한 양을 인식해야 하는 부담을 줄여줄 수 있으며, 검색어의 위치 정보를 파악함으로써 정보검색시에 발생하는 방향상실의 문제점을 줄일 수 있는 효과가 있다.That is, according to the present invention, it is possible to select an exact query word by presenting clustering between words, thereby reducing the burden of recognizing a large amount of search results, and to generate information when searching for information by grasping location information of a search word. There is an effect that can reduce the problem of loss of direction.

Claims

In the word cluster management apparatus for query processing in information retrieval,

Cluster storage means for grouping and storing word clusters to provide related words associated with words used in information retrieval;

Extraction means for receiving a document to be subjected to clustering for extracting related words so as to be used for information retrieval and extracting a meaningful noun word; And

Cluster management means for updating the cluster storage means by extracting the related words from the document by referring to the extracted words to calculate cluster coefficients and generate word clusters

Word cluster management device comprising a.

The method of claim 1,

Query processing means for receiving a search query from a user and providing the related word obtained by referring to the cluster storage means

The word cluster management device further including.

The method of claim 2,

The query processing means,

And a related word obtained by referring to the cluster storage means together with a search result according to a query from the user.

The method of claim 3, wherein

The query processing means,

Query generation means for receiving a query from the user and adding an operator used in a search engine to change a condition search;

A search requesting means for requesting the generated query word from the search engine that has been developed and used;

Associated cluster expansion means for obtaining a word related to a query by referring to the cluster storage means; And

Output means for providing the extracted word cluster and search results to the user

Word cluster management device that includes.

The method of claim 4, wherein

History management means for receiving a query from the user and managing an individual information search history; And

History storage means for storing the personal information search history under the control of the history management means

The word cluster management device further including.

The method of claim 1,

The extraction means,

Dictionary information storage means for storing thesaurus information, noun information, stopword information, adverb information, verb information, adjective information, and the like so as to determine meaningful noun words;

A tag filter for receiving documents to be subjected to clustering for extracting related words so as to be used for information retrieval, and separating tag information unnecessary for the word clustering;

Token extracting means for separating the text string extracted by the tag filter into token units;

A noun extracting means for identifying a token separated by the token extracting means, extracting a noun, and storing the noun in a noun node list;

Unregistered word storage means for storing the token and the appearance frequency as unregistered words when the extracted token is not registered in the dictionary information storage means; And

Pre-management means for updating the dictionary information storage means by checking the number of appearance of the token stored in the non-registered word storage means

Word cluster management device that includes.

The method of claim 6,

The dictionary information storing means, the noun extracting means and the unregistered word storing means,

Word cluster management apparatus characterized in that the management and processing for each language according to the language used for information retrieval.

The method according to any one of claims 1 to 7,

The cluster management means,

Linked word extracting means for receiving a list of noun nodes generated from the extracting means, extracting related words of a corresponding document, and generating and storing word clusters for each document; And

Cluster update processing means for updating the word cluster of the cluster storage means by grasping the relationship between the word cluster for each document generated by the associated word extracting means and the word cluster stored in the cluster storage means.

Word cluster management device that includes.

A word cluster management method applied to a word cluster management device for query processing in information retrieval,

A first step of receiving a document to be used for information retrieval and extracting a meaningful noun word;

A second step of extracting a related word from the document by referring to the extracted noun word to calculate a cluster coefficient and generating a word cluster; And

A third step of updating the cluster storage means for storing the information by using the generated word cluster so as to provide related words associated with the words used in the information retrieval;

Word cluster management method comprising a.

The method of claim 9,

A fourth step of receiving a search query from a user and providing the user with a related word obtained by referring to the cluster storage means;

Word cluster management method comprising more.

The method of claim 9,

A fourth step of receiving a search query from a user, obtaining a related word by referring to the cluster storage means, and providing the search word with the search result to the user;

Word cluster management method comprising more.

The method according to any one of claims 9 to 11,

The first step is,

A fifth step of generating a text file by separating tag information from the collected document files;

A sixth step of receiving a text file in sentence units and extracting a token in word units; And

The seventh step in which the noun extraction means of the corresponding language receives the token and extracts the nouns by referring to the dictionary information storage means and stores them as a noun node list according to the language corresponding to the extracted token.

Word cluster management method comprising a.

In an information retrieval system equipped with a processor,

A first function of receiving a document to be used for information retrieval and extracting a meaningful noun word;

A second function of extracting a related word from the document with reference to the extracted noun word to calculate a cluster coefficient and to generate a word cluster; And

A third function of updating the cluster storage means for storing information by using the generated word clusters so as to provide related words associated with the words used in the information retrieval;

A computer-readable recording medium having recorded thereon a program for realizing this.