KR101059557B1

KR101059557B1 - Computer-readable recording media containing information retrieval methods and programs capable of performing the information

Info

Publication number: KR101059557B1
Application number: KR1020080138727A
Authority: KR
Inventors: 이반 베를로셰; 안태성; 이경일
Original assignee: 주식회사 솔트룩스
Priority date: 2008-12-31
Filing date: 2008-12-31
Publication date: 2011-08-26
Also published as: KR20100080099A

Abstract

정보 검색 방법 및 이를 수행할 수 있는 프로그램이 수록된 컴퓨터로 읽을 수 있는 기록 매체를 개시한다. Disclosed are a computer-readable recording medium containing an information retrieval method and a program capable of performing the same.

본 발명에 따른 정보 검색 방법은 문서를 수집하고 수집된 문서의 키워드를 추출하는 데이터 수집 단계, 검색자에 의해 질의어가 입력되는 질의입력단계, 입력된 질의어에 대한 추출된 키워드의 주제 랭크(topic rank)를 계산하여 키워드 중 질의어와 연관되는 연관 키워드들을 선정하는 연관 키워드 선정 단계, 선정된 연관 키워드들을 단어 벡터로 벡터화하여 연관 키워드들을 워드 클러스터로 형성하는 워드 클러스터링 단계, 형성된 워드 클러스터의 랭크를 계산하는 클러스터 랭크 계산 단계 및 계산된 워드 클러스터의 랭크 및 키워드의 주제 랭크를 이용하여 입력된 질의어와 관련된 연관 키워드들을 제공하는 연관 키워드 제공 단계를 포함한다. The information retrieval method according to the present invention includes a data collection step of collecting documents and extracting keywords of the collected documents, a query input step of inputting a query by a searcher, and a topic rank of extracted keywords for the input query. A step of selecting an associated keyword associated with a query word among the keywords, a word clustering step of forming the related keywords into a word cluster by vectorizing the selected related keywords into a word vector, and calculating a rank of the formed word cluster. A cluster rank calculating step and a related keyword providing step of providing related keywords related to the input query word using the calculated rank of the word cluster and the subject rank of the keyword are included.

정보 검색, 주제 랭크, 워드 클러스터링, 단어 벡터 Information search, topic rank, word clustering, word vector

Description

Method for searching information and computer readable medium storing information that includes a method for searching for information and a program capable of performing the same

본 발명은 정보를 검색하는 방법과 이를 수행할 수 있는 프로그램이 수록된 컴퓨터로 읽을 수 있는 기록 매체에 관한 것으로, 더욱 구체적으로는 주제 랭크를 통하여 연관 키워드를 제공하는 방법과 이를 수행할 수 있는 프로그램이 수록된 컴퓨터로 읽을 수 있는 기록 매체에 관한 것이다.The present invention relates to a computer-readable recording medium containing a method for retrieving information and a program capable of performing the same, and more particularly, a method for providing a related keyword through a subject rank and a program capable of performing the same. The present invention relates to a computer-readable recording medium.

인터넷의 사용이 더욱 증가하고, 그에 따라서 인터넷을 통하여 접할 수 있는 정보의 양도 증가하고 있다. 이에 따라서 정보 검색의 필요성은 더욱 높아가고, 중요성도 커지고 있다. 그러나 정보의 양이 증가함에 따라서 정보를 검색하는 사용자인 검색자가 구하고자하는 정보를 정확히 찾아내기는 더욱 어려워지고 있다. The use of the Internet is increasing, and accordingly, the amount of information available through the Internet is also increasing. As a result, the necessity of information retrieval becomes more and more important. However, as the amount of information increases, it becomes more difficult to find exactly the information that a searcher, a user searching for information, wants to obtain.

초기에 인터넷에서 제공되는 정보 검색의 결과는 일일이 사람이 정보를 모으고, 우선 순위를 매기는 방식으로 얻어질 수 있었고, 그러한 결과를 검색자의 요구에 따라서 제공하는 것이었다. Initially, the results of information retrieval provided on the Internet could be obtained by human gathering and prioritizing, and providing such results according to the searcher's needs.

인터넷 상의 정보의 양이 방대해지면서 종전의 방법은 한계에 도달하였고, 그에 따라서 검색 로봇에 의하여 정보를 수집하고, 기계적인 처리를 통하여 분류하여 제공하는 방법이 보편화되어왔다. 그러나 이러한 기계적인 처리에 의한 정보 검색의 결과는 검색자가 원하는 결과를 정확히 제공하지 못하여, 정보 검색 결과에서 다시 검색자가 원하는 정보를 찾아야 하는 불편함이 야기되어 왔다. As the amount of information on the Internet has increased, the conventional method has reached its limit, and accordingly, a method of collecting information by a search robot and classifying and providing the information through mechanical processing has become popular. However, the result of the information retrieval by the mechanical processing does not provide exactly the results desired by the searcher, causing inconvenience that the searcher needs to find the desired information again.

본 발명이 해결하고자 하는 기술적 과제는 상기 문제점을 해결하기 위하여, 검색자의 요구를 만족시킬 수 있는 정보 검색 방법 및 이를 수행할 수 있는 프로그램이 수록된 컴퓨터로 읽을 수 있는 기록 매체를 제공하는 데에 있다. SUMMARY OF THE INVENTION The present invention has been made in an effort to provide a computer-readable recording medium containing an information retrieval method capable of satisfying a searcher's needs and a program capable of performing the same.

상기 기술적 과제를 해결하기 위하여 본 발명은 다음과 같은 정보 검색 방법 및 이를 수행할 수 있는 프로그램이 수록된 컴퓨터로 읽을 수 있는 기록 매체를 제공한다. In order to solve the above technical problem, the present invention provides a computer-readable recording medium containing the following information retrieval method and a program capable of performing the same.

본 발명에 의한 정보 검색 방법은 문서를 수집하고 수집된 상기 문서의 키워드를 추출하는 데이터 수집 단계, 검색자에 의해 질의어가 입력되는 질의입력단계, 입력된 상기 질의어에 대한 추출된 상기 키워드의 주제 랭크(topic rank)를 계산하여 상기 키워드 중 상기 질의어와 연관되는 연관 키워드들을 선정하는 연관 키워드 선정 단계, 선정된 상기 연관 키워드들을 단어 벡터로 벡터화하여 상기 연관 키워드들을 워드 클러스터로 형성하는 워드 클러스터링 단계, 형성된 상기 워드 클러스터의 랭크를 계산하는 클러스터 랭크 계산 단계 및 계산된 상기 워드 클러스터의 랭크 및 상기 키워드의 주제 랭크를 이용하여 입력된 상기 질의어와 관련된 연관 키워드들을 제공하는 연관 키워드 제공 단계를 포함한다. The information retrieval method according to the present invention includes a data collection step of collecting documents and extracting keywords of the collected documents, a query input step in which a query word is input by a searcher, and a topic rank of the extracted keywords for the input query word. a related keyword selection step of selecting related keywords related to the query word by calculating (topic rank), and vectorizing the selected related keywords into a word vector to form the related keywords into a word cluster; And a cluster rank calculating step of calculating a rank of the word cluster, and a related keyword providing step of providing related keywords related to the query word input using the calculated rank of the word cluster and the subject rank of the keyword.

상기 주제 랭크 TR(K,w)는 하기 식에 의해 계산될 수 있다. The subject rank TR (K, w) can be calculated by the following equation.

여기서, K는 질의어, w는 키워드, DF(K, w)는 K와 w가 함께 들어있는 문서 빈도, DF(w)는 w가 들어 있는 문서 빈도, p(w)는 w가 문서에 들어있는 확률, α, β는 가중치로 양의 실수이다.Where K is the query word, w is the keyword, DF (K, w) is the frequency of documents containing K and w, DF (w) is the frequency of documents containing w, and p (w) is w Probabilities α and β are positive real numbers by weight.

상기 연관 키워드 선정 단계는, 상기 주제 랭크가 큰 순서로 N개의 키워드를 상기 연관 키워드로 선정할 수 있다. In the selecting a related keyword, N keywords may be selected as the related keyword in the order of increasing the subject rank.

상기 워드 클러스터링 단계는, 하기 식에 의하여 상기 연관 키워드들을 단어 벡터로 벡터화할 수 있다. In the word clustering step, the associated keywords may be vectorized into a word vector by the following equation.

여기서,

는 벡터화한 i번째 연관 키워드, TR_ij는 i번째 연관 키워드에 대한 j번째 연관 키워드의 주제 랭크(여기서 i와 j는 1과 N 사이의 정수)이다.here,

Is the vectorized i-th association keyword, TR _ij is the subject rank of the j-th association keyword for the i-th association keyword, where i and j are integers between 1 and N.

상기 워드 클러스터링 단계는, 하기 식에 의해 상기 연관 키워드들의 유사도를 측정하여, 유사한 연관 키워드들을 워드 클러스터로 형성할 수 있다. In the word clustering step, similarity of the related keywords may be measured by the following equation, and thus similar related keywords may be formed into a word cluster.

여기서,

은

와

사이의 유사도,

은 벡터인

와

사이의 cosine값이다.here,

silver

Wow

The similarity between

Silver vector

Wow

Cosine value between.

상기 워드 클러스터의 랭크는, 상기 워드 클러스터를 이루는 연관 키워드들 각각의 주제 랭크의 평균값일 수 있다.The rank of the word cluster may be an average value of a subject rank of each of the related keywords constituting the word cluster.

상기 연관 키워드 제공 단계는, 상기 워드 클러스터의 랭크가 높은 순으로 상기 워드 클러스터를 제공하며, 제공되는 상기 워드 클러스트 내에서 상기 연관 키워드들의 주제 랭크 순으로 상기 연관 키워드들을 제공할 수 있다. The providing of the related keywords may provide the word clusters in ascending order of the word clusters, and provide the related keywords in order of subject rank of the related keywords in the provided word clusters.

제공된 상기 연관 키워드들 중, 상기 검색자에 의해 선택된 연관 키워드가 추출된 문서인 연관 문서를 제공하는 단계;를 더 포함할 수 있다. The method may further include providing a related document among the provided related keywords, wherein the related keyword selected by the searcher is an extracted document.

본 발명에 의하면, 특정 질의어에 부합하는 연관 키워드 및 문서를 검색자에게 제공하고, 검색자의 피드백 정보를 이용하여, 검색 성능을 향상시킬 수 있다. 특히, 검색자가 입력한 질의어와 연관되는 연관 키워드를 제공하여, 검색자가 정확한 질의어를 입력하지 않아도 검색자가 원하는 정보를 정확하게 선택할 수 있도록 한다. According to the present invention, it is possible to provide a searcher with related keywords and documents corresponding to a specific query word, and improve search performance by using the searcher's feedback information. In particular, by providing an associated keyword associated with a query input by the searcher, the searcher can select exactly the information desired by the searcher without inputting the correct query.

이하, 본 발명의 실시 예들에 따른 정보 검색 방법 및 이를 수행할 수 있는 프로그램이 수록된 컴퓨터로 읽을 수 있는 기록 매체를 첨부된 도면을 참조하여 상세하게 설명하지만, 본 발명이 하기의 실시 예들에 한정되는 것은 아니며, 해당 분야에서 통상의 지식을 가진 자라면 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 본 발명을 다양한 다른 형태로 구현할 수 있을 것이다. 즉, 특정한 구조적 내지 기능적 설명들은 단지 본 발명의 실시 예들을 설명하기 위한 목적으로 예시된 것으로, 본 발명의 실시 예들은 다양한 형태로 실시될 수 있으며 본문에 설명된 실시 예들에 한정되는 것으로 해석되어서는 아니된다. 본문에 설명된 실시 예들에 의해 한정되는 것이 아니므로 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. Hereinafter, an information retrieval method according to embodiments of the present invention and a computer-readable recording medium containing a program capable of performing the same will be described in detail with reference to the accompanying drawings, but the present invention is limited to the following embodiments. The present invention may be embodied in various other forms without departing from the technical spirit of the present invention. That is, specific structural to functional descriptions are merely illustrated for the purpose of describing embodiments of the present invention, and embodiments of the present invention may be embodied in various forms and should be construed as being limited to the embodiments described herein. No. It is not to be limited by the embodiments described in the text, it should be understood to include all changes, equivalents, and substitutes included in the spirit and scope of the present invention.

제1, 제2 등의 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 이러한 구성 요소들은 상기 용어들에 의해 한정되는 것은 아니다. 상기 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위로부터 벗어나지 않고, 제1 구성 요소는 제2 구성 요소로 명명될 수 있고, 유사하게 제2 구성 요소도 제1 구성 요소로 명명될 수 있다.Terms such as first and second may be used to describe various components, but such components are not limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component.

어떤 구성 요소가 다른 구성 요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성 요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성 요소가 존재할 수도 있다고 이해될 것이다. 반면에, 어떤 구성 요소가 다른 구성 요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성 요소가 존재하지 않는 것으로 이해될 것이다. 구성 요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석될 것이다.When a component is said to be "connected" or "connected" to another component, it may be directly connected to or connected to that other component, but it may be understood that other components may exist in the middle. Will be. On the other hand, when a component is said to be "directly connected" or "directly connected" to another component, it will be understood that there is no other component in between. Other expressions describing the relationship between components, such as "between" and "immediately between" or "neighboring to" and "directly neighboring", will likewise be interpreted.

본 출원에서 사용한 용어는 단지 특정한 실시 예들을 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "구비하다" 등의 용어는 설시된 특징, 숫자, 단계, 동작, 구성 요소 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해될 것이다.The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "comprise" or "include" are intended to indicate that there is a feature, number, step, action, component, or combination thereof described, and one or more other features or numbers, It will be understood that it does not exclude in advance the possibility of the presence or addition of steps, actions, components, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다. Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in the commonly used dictionaries are to be interpreted as having meanings consistent with the meanings in the context of the related art, and are not construed in ideal or excessively formal meanings unless expressly defined in this application. .

도 1은 본 발명의 실시 예에 따른 정보 검색 방법을 구현하기 위한 정보 검색 시스템의 구성을 나타내는 개략도이다. 1 is a schematic diagram showing the configuration of an information retrieval system for implementing an information retrieval method according to an embodiment of the present invention.

도 1을 참조하면, 정보 검색 시스템(1)은 네트워크(100)를 통하여 연결되며 크게 제어부(1000)와 저장부(2000)로 이루어진다. 제어부(1000)는 수집부(1100), 분석부(1200), 색인부(1300), 주제 랭크 처리부(1400), 워드 클러스터링 처리부(1500), 클러스터 랭크 처리부(1600), 언어 분석부(1700), 제공부(1800), 사용자 피드백 처리부(1900) 등을 포함할 수 있다. 저장부(2000)는 주 저장부(2100), 색인 저장부(2200), 랭크 저장부(2300), 로그 저장부(2400) 등을 포함할 수 있다. 제어부(1000)는 네트워크(100)를 통하여 인터넷 문서(10) 또는 검색자 장치(200)와 연결되어 정보를 수집하고 제공하도록 구성되어 있다.Referring to FIG. 1, the information retrieval system 1 is connected through a network 100 and includes a controller 1000 and a storage 2000. The controller 1000 includes a collector 1100, an analyzer 1200, an indexer 1300, a subject rank processor 1400, a word clustering processor 1500, a cluster rank processor 1600, and a language analyzer 1700. , A provider 1800, a user feedback processor 1900, and the like. The storage unit 2000 may include a main storage unit 2100, an index storage unit 2200, a rank storage unit 2300, a log storage unit 2400, and the like. The controller 1000 is connected to the Internet document 10 or the searcher device 200 through the network 100 to collect and provide information.

수집부(1100)는 네트워크(100)를 통하여 인터넷 문서(10)들을 수집하여 번역하여 각각의 인터넷 문서(10)별로 인터넷 문서 구조체를 생성할 수 있다. 수집부(1100)의 자세한 기능과 구성은 후술하도록 한다. 인터넷 문서(10)는 인터넷 상 에 텍스트를 포함하는 정보를 담고 있는 각종 웹페이지(web page) 등의 문서를 포괄하여 통칭하는 의미이다. 구체적으로 살펴보면 인터넷 문서(10)에는 예를 들면, 통상적인 웹페이지, 블로그, 뉴스 기사 등이 포함될 수 있다. 이 외에 텍스트(text)를 포함하거나 텍스트로 나타낼 수 있는 정보를 담고 있는 것은 모두 해당될 수 있다. 예를 들면 특정 커뮤니티(community, 예를 들면 카페, 클럽, 동호회 등의 이름으로 불리운다)의 각종 게시물, 각종 회사 또는 개인의 웹 사이트(web site)에 포함된 웹 페이지, 언론사 또는 포털 사이트(portal site) 등에 기재된 뉴스 기사, 각종 블로그(blog)에 게시된 포스트(post) 등이 인터넷 문서(10)에 포함될 수 있다. 또한 인터넷 문서(10)는 텍스트 정보 뿐만 아니라, 그림, 동영상, 음악 등 각종 멀티미디어 데이터(multimedia data)를 포함할 수 있음은 물론이다. 특히, 주로 멀티미디어 데이터로 구성되는 인터넷 문서(10)의 경우에도 제목 등 텍스트로 이루어진 정보를 포함할 수 있다. The collector 1100 may collect and translate the Internet documents 10 through the network 100 to generate an Internet document structure for each Internet document 10. Detailed functions and configurations of the collection unit 1100 will be described later. The Internet document 10 is a generic term that encompasses documents such as various web pages containing information including text on the Internet. Specifically, the Internet document 10 may include, for example, a typical web page, blog, news article, and the like. In addition to this, any information containing text or information that can be represented by text may be applicable. For example, various posts of a specific community (named cafes, clubs, clubs, etc.), web pages contained in various company or individual web sites, press or portal sites. News articles, etc., posts posted on various blogs, etc. may be included in the Internet document 10. In addition, the Internet document 10 may include not only text information but also various multimedia data such as pictures, videos, and music. In particular, the Internet document 10 mainly composed of multimedia data may include information consisting of text such as a title.

분석부(1200)는 수집된 인터넷 문서(10), 구체적으로는 수집부(1100)에서 생성한 인터넷 문서 구조체들을 분석하여 키워드, 특성키워드벡터 등을 포함하는 분석정보들을 각각 생성할 수 있다. 색인부(1300)는 수집된 인터넷 문서(10)와 분석정보들을 색인(index)하여 키워드와 특성키워드벡터를 포함하는 색인 정보를 생성할 수 있다. 수집된 인터넷 문서(10), 인터넷 문서 구조체, 분석 정보 및 색인 정보는 주 저장부(2100)에 저장될 수 있으며, 분석부(1200) 또는 색인부(1300)는 각각 수집부(1100) 또는 분석부(1200)로부터 정보를 받거나, 주 저장부(2100)에 저장된 정보를 사용할 수 있다. 분석부(1200)의 자세한 기능과 구성은 후술하도록 한 다. The analyzer 1200 may analyze the collected Internet document 10, specifically, the Internet document structures generated by the collector 1100, and generate analysis information including keywords, characteristic keyword vectors, and the like. The index unit 1300 may index the collected Internet document 10 and the analysis information to generate index information including a keyword and a characteristic keyword vector. The collected Internet document 10, the Internet document structure, analysis information, and index information may be stored in the main storage unit 2100, and the analysis unit 1200 or the index unit 1300 may be the collection unit 1100 or the analysis, respectively. The information may be received from the unit 1200 or the information stored in the main storage unit 2100 may be used. Detailed functions and configurations of the analysis unit 1200 will be described later.

특성키워드벡터는 인터넷 문서(100), 구체적으로는 개별 인터넷 문서 구조체별로 생성되어, 개별 인터넷 문서(100)가 담고 있는 정보의 특성을 단어 벡터(vector)의 형식으로 나타낸다. 단어 벡터는 개별 인터넷 문서(100)의 특성을 나타내는 키워드 및 각 키워드의 가중치를 포함하고 있다. 가중치는 각 키워드의 단어빈도(TF, Term Frequency) 및 각 키워드가 인터넷 문서 집합에서 나타나는 빈도의 역인 역문서빈도(IDF, Inverse Document Frequency) 등을 이용하여 구한다. 단어빈도는 개별 인터넷 문서에 특정 키워드의 출현횟수로써 특정 키워드가 인터넷 문서의 내용을 얼마나 대표하는가에 대한 척도이다. 또한 역문서빈도는 인터넷 문서 집합에서 특정 키워드가 출연하는 인터넷 문서 수의 비율의 역으로, 적은 인터넷 문서에서 나타나는 키워드는 그 키워드가 나타나는 인터넷 문서를 다른 인터넷 문서들과 구별할 수 있는 능력이 크게 된다. The characteristic keyword vector is generated for each Internet document 100, specifically, for each Internet document structure, and represents the characteristic of the information contained in the individual Internet document 100 in the form of a word vector. The word vector contains keywords representing the characteristics of the individual Internet document 100 and the weights of the keywords. The weight is calculated using the term frequency (TF) of each keyword and the inverse document frequency (IDF), which is the inverse of the frequency of each keyword in the Internet document set. The word frequency is the number of occurrences of a particular keyword in an individual Internet document and is a measure of how much the particular keyword represents the content of the Internet document. Reverse document frequency is also the inverse of the number of Internet documents that a particular keyword appears in an Internet document set. Keywords that appear in a few Internet documents have a greater ability to distinguish the Internet document that the keyword appears from other Internet documents. .

주제 랭크 처리부(1400)는 인터넷 문서(10)의 키워드들 사이의 주제 랭크(topic rank) 또는 검색자 장치(200)를 통하여 검색자가 입력한 질의어와 인터넷 문서(10)들로부터 추출된 키워드 사이의 주제 랭크를 계산할 수 있다. 또한 계산된 주제 랭크에 의하여 상기 질의어에 대한 연관 키워드를 선정할 수 있다. 주제 랭크를 계산하는 방법에 대해서는 후술하도록 한다.The topic rank processing unit 1400 may include a topic rank between keywords of the Internet document 10 or a query input by the searcher through the searcher device 200 and keywords extracted from the Internet documents 10. The subject rank can be calculated. In addition, the related keyword for the query may be selected based on the calculated topic rank. The method of calculating the subject rank will be described later.

워드 클러스터링 처리부(1500)는 선정된 상기 연관 키워드를 단어 벡터의 형태로 벡터화하여, 유사한 연관 키워드들끼리 워드 클러스터를 형성하도록 할 수 있다. 워드 클러스터를 형성하는 방법에 대해서는 후술하도록 한다.The word clustering processor 1500 may vectorize the selected related keyword in the form of a word vector so that similar related keywords form a word cluster. A method of forming a word cluster will be described later.

클러스터 랭크 처리부(1600)는 형성된 상기 워드 클러스터에 포함된 상기 연관 키워드들 각각의 주제 랭크의 평균값을 구하여, 상기 워드 클러스터들의 랭크를 구할 수 있다. The cluster rank processor 1600 may obtain an average value of a subject rank of each of the related keywords included in the formed word cluster, and obtain a rank of the word clusters.

언어 분석부(1700)는 검색자가 네트워크(100)를 통하여 연결된 검색자 장치(200)를 통하여 입력한 질의어에 대한 언어 분석을 할 수 있다. 언어 분석부(1700)에서는 입력된 상기 질의어의 언어를 판단하고, 상기 질의어가 문장이거나 복수의 단어로 구성된 경우에 핵심 질의어를 분석하여 선정할 수 있다. The language analyzer 1700 may perform language analysis on the query word input by the searcher through the searcher device 200 connected through the network 100. The language analyzer 1700 may determine a language of the input query word, and analyze and select a core query word when the query word is a sentence or a plurality of words.

제공부(1800)는 입력된 상기 질의어로부터 정보 검색 시스템(1)에서 처리된 결과를 검색자 장치(200)에 제공한다. 이 경우, 입력된 상기 질의어에 대한 연관 키워드를 제공할 수도 있고, 상기 연관 키워드로부터 연관된 인터넷 문서를 제공할 수도 있다. The provider 1800 provides the searcher apparatus 200 with a result processed by the information retrieval system 1 from the input query. In this case, an associated keyword for the input query word may be provided, or an associated Internet document may be provided from the related keyword.

사용자 피드백 처리부(1900)는 제공부(1800)에서 제공된 결과에 대한 검색자의 선택 결과를 로그 저장부(2400)에 저장하고, 그 결과를 연관 키워드 선정에 반영할 수 있다. The user feedback processor 1900 may store a searcher's selection result for the result provided by the provider 1800 in the log storage 2400 and reflect the result in selecting the associated keyword.

저장부(2000)는 주 저장부(2100) 외에도 색인정보를 저장하는 색인 저장부(2200), 주제 랭크와 클러스터 랭크를 저장하는 랭크 저장부(2300), 검색자들의 로그정보를 포함하는 검색자 정보를 저장하는 로그 저장부(2400)를 포함한다. 이들 주 저장부(2100), 색인 저장부(2200), 랭크 저장부(2300) 및 로그 저장부(2400)는 각각 물리적으로 구분되는 저장 장치일 수도 있으나, 하나 또는 복수의 저장 장치를 논리적으로 구분하는 구분 단위일 수도 있다. In addition to the main storage unit 2100, the storage unit 2000 includes an index storage unit 2200 for storing index information, a rank storage unit 2300 for storing a subject rank and a cluster rank, and a searcher including log information of searchers. It includes a log storage unit 2400 for storing information. The main storage unit 2100, the index storage unit 2200, the rank storage unit 2300, and the log storage unit 2400 may be physically divided storage devices, but logically divide one or a plurality of storage devices. It may be a division unit.

도 2는 본 발명의 실시 예에 따른 정보 검색 방법을 나타내는 흐름도이다. 2 is a flowchart illustrating an information retrieval method according to an embodiment of the present invention.

도 1 및 2를 참조하면, 수집부(1100)를 통하여 인터넷 문서(10)를 수집한다(S112). 수집부(1100)는 특정 종류의 인터넷 문서(10)를 수집할 수도 있고, 광범위한 종류의 인터넷 문서(10)를 수집할 수도 있다. 예를 들면, 뉴스 기사 또는 블로그에 게시된 포스트 등 특정 종류의 인터넷 문서(10)를 수집할 수도 있고, 그 외에 회사 또는 개인의 통상적인 웹페이지, 특정 커뮤니티의 게시물, 멀티미디어 데이터 등의 광범위한 종류의 인터넷 문서(10)를 수집할 수도 있다. 이는 검색자에게 제공하고자 하는 정보 검색 시스템(1)의 서비스 종류에 따라서 결정될 수 있다. 1 and 2, the Internet document 10 is collected through the collection unit 1100 (S112). The collection unit 1100 may collect a specific kind of Internet document 10, or may collect a wide variety of Internet documents 10. For example, it may collect certain kinds of Internet documents 10, such as news articles or posts posted on blogs, and other types of web pages of companies or individuals, posts of specific communities, multimedia data, etc. Internet documents 10 may be collected. This may be determined according to the type of service of the information retrieval system 1 to be provided to the searcher.

분석부(1200)에서 인터넷 문서(10)를 분석하여 각 인터넷 문서(10)를 나타내는 키워드를 추출한다(S114). 상기 키워드와 함께, 인터넷 문서(10)가 담고 있는 정보의 특성을 단어 벡터의 형태로 나타낸 특성키워드벡터를 형성할 수 있다. 상기 키워드는 해당 인터넷 문서(10)에서 의미를 가지는 적어도 하나의 단어이며, 상기 특성키워드벡터는 상기 키워드 및 키워드의 가중치를 포함하는 단어 벡터이다. The analysis unit 1200 analyzes the Internet document 10 and extracts a keyword representing each Internet document 10 (S114). Together with the keyword, it is possible to form a characteristic keyword vector representing the characteristic of the information contained in the Internet document 10 in the form of a word vector. The keyword is at least one word having meaning in the corresponding Internet document 10, and the characteristic keyword vector is a word vector including the keyword and the weight of the keyword.

반면, 검색자 장치(200)를 통하여 검색자가 질의어를 입력하면(S122), 언어 분석부(1700)에서 질의어를 분석한다(S124) 검색자 장치(200)는 네트워크(100)를 통하여 정보 검색 시스템(1)과 연결될 수 있다. 네트워크(100)는 유선, 무선의 인터넷, 로컬 랜, 인트라넷 등을 포함할 수 있다. 언어 분석부(1700)에서는 검색자가 입력한 상기 질의어의 입력 언어, 형식 등을 분석하여 적어도 하나의 핵심 질의어를 분류하여 이후 과정에서 질의어로 대체하여 사용할 수 있다. 이후, 질의어라고 기재하는 것은 입력된 하나의 질의어 또는 분류된 하나의 핵심 질의어일 수도 있으 나, 분류된 적어도 2개의 핵심 질의어의 조합일 수도 있다. 또한 이후, 질의어라고 기재하는 것은 입력된 언어로 된 질의어 또는 핵심 질의어일 수도 있으나, 정보 검색 시스템(1)에서 처리되는 주된 언어로 번역된 것일 수도 있다. On the other hand, when a searcher inputs a query word through the searcher device 200 (S122), the language analyzer 1700 analyzes the query word (S124). The searcher device 200 may search for an information retrieval system through the network 100. It can be connected with (1). The network 100 may include a wired or wireless Internet, a local LAN, an intranet, or the like. The language analyzer 1700 may classify at least one core query word by analyzing the input language, format, etc. of the query word input by the searcher, and replace the at least one core query word with a query word in a later process. Subsequently, the description of the query word may be one input query word or one classified key query word, or may be a combination of at least two classified key query words. Also, the term "query" may be a query or core query in the input language, or may be translated into a main language processed by the information retrieval system 1.

이후, 분석된 상기 질의어(K)에 대한 추출된 상기 키워드(w)의 주제 랭크 TR(K, w)를 계산한다(S130). 상기 주제 랭크 TR(K,w)는 하기 식에 의해 계산될 수 있다. Thereafter, a subject rank TR (K, w) of the extracted keyword w for the analyzed query word K is calculated (S130). The subject rank TR (K, w) can be calculated by the following equation.

여기서, K는 질의어, w는 키워드, DF(K, w)는 K와 w가 함께 들어있는 문서 빈도, DF(w)는 w가 들어 있는 문서 빈도, p(w)는 w가 문서에 들어있는 확률, α, β는 가중치로 양의 실수이다. Where K is the query word, w is the keyword, DF (K, w) is the frequency of documents containing K and w, DF (w) is the frequency of documents containing w, and p (w) is w Probabilities α and β are positive real numbers by weight.

구체적으로 살펴보면, 주제 랭크 TR(K, w)는 질의어(K)에 대한 키워드(w)의 연관도를 나타낸다. DF(K, w)와 DF(w)는 각각 (K 및 w)와 (w)의 문서 빈도(Document Frequency)를 나타낸다. 문서 빈도란 수집된 문서 중에서 해당 키워드 또는 질의어가 포함되어 있는 문서의 수를 의미한다(여기에서 문서란, 수집된 개별 인터넷 문서(10)를 의미한다). 즉, DF(K, w)는 질의어(K)와 키워드(w)가 함께 들어있는 문서의 빈도이고, DF(w)는 키워드(w)가 들어있는 문서의 빈도이다. Specifically, the topic rank TR (K, w) represents the degree of association of the keyword (w) with respect to the query (K). DF (K, w) and DF (w) represent Document Frequency of (K and w) and (w), respectively. The document frequency refers to the number of documents in which the keywords or query terms are included in the collected documents (here, the document means individual Internet documents 10 collected). In other words, DF (K, w) is the frequency of the document containing the query (K) and the keyword (w), DF (w) is the frequency of the document containing the keyword (w).

주제 랭크 TR(K, w) 공식의 첫 번째 부분인 DF(K, w)/DF(w)는 키워드(w)가 들어있는 문서 중에서 키워드(w)와 질의어(K)가 함께 들어있는 문서의 비율을 의미한다. 따라서 DF(K, w)/DF(w)의 값이 클수록, 키워드(w)와 질의어(K)의 연관성은 높아진다. The first part of the topic rank TR (K, w) formula, DF (K, w) / DF (w), is used for documents containing keyword (w) and query (K) together. Means percentage. Therefore, the larger the value of DF (K, w) / DF (w), the higher the correlation between the keyword w and the query word K.

그러나, 예를 들면, DF(w)가 1이고, DF(K, w)가 1인 경우의 경우와 같이, DF(w)와 DF(K, w)의 값이 모두 작은 경우에 키워드(w)와 질의어(K)의 연관성이 정말로 높은지를 판단하기는 어려울 수 있다. 즉, 수많은 문서 중에서 극히 드물게 키워드(w)가 질의어(K)와 함께 존재하는 문서가 있다고 하여도, 이것이 키워드(w)와 질의어(K)의 연관성이 높다고 보기에는 어려움이 있을 수 있다. However, for example, when the values of DF (w) and DF (K, w) are small, as in the case where DF (w) is 1 and DF (K, w) is 1, the keyword (w) ) Can be difficult to determine if the association between the query and the query is really high. In other words, even if there are documents in which the keyword w is present with the query word K, very rarely among a large number of documents, it may be difficult to see that the correlation between the keyword w and the query word K is high.

반대로, DF(w)와 DF(K, w)가 모두 큰 값, 예를 들면 전체 문서의 수에 가까울 경우에도 키워드(w)와 질의어(K)의 연관성이 정말로 높은지를 판단하기는 어려울 수 있다. 즉, 키워드(w)와 질의어(K)가 모두 자주 사용되는 것이라는 의미일 뿐, 연관성을 부여하기에는 어려움이 있다. 예를 들면, 영문에서 a, the, is, of와 같은 것이 키워드(w)와 질의어(K)로 선정된 경우를 생각할 수 있다.On the contrary, even if both DF (w) and DF (K, w) are close to large values, for example, the total number of documents, it may be difficult to determine whether the correlation between the keyword w and the query word K is really high. . In other words, it means that both the keyword w and the query word K are frequently used, and it is difficult to give an association. For example, the case where a, the, is, of, and the like is selected as the keyword w and the query word K in English can be considered.

주제 랭크 TR(K, w) 공식의 두 번째 부분인 -p(w)log(p(w))는 키워드(w)와 질의어(K)의 연관성에 정확도를 부여하기 위하여 사용될 수 있다. 이하에서 주제 랭크 TR(K, w) 공식의 두 번째 부분인 -p(w)log(p(w))는 엔트로피(entropy) 부분이라고 불리울 수 있다. 엔트로피 부분에 의하여, p(w)가 0 이거나 1 인 경우에는 TR(K, w)는 0이 될 수 있다. 따라서 엔트로피 부분에 의하여 키워드(w)가 문서들 중에 존재하는 확률인 p(w)가 0에 근접하거나 1에 근접하는 경우에 발생할 수 있는 연관도의 부정확성을 최소화할 수 있다. The second part of the topic rank TR (K, w) formula, -p (w) log (p (w)), can be used to give an accuracy to the association of the keyword (w) with the query (K). Hereinafter, the second part of the subject rank TR (K, w) formula, -p (w) log (p (w)), may be called an entropy part. By the entropy part, TR (K, w) may be zero when p (w) is zero or one. Accordingly, the inaccuracy may minimize the inaccuracy of association that may occur when p (w), which is a probability that the keyword w is present in the documents, is close to zero or close to one.

즉, 주제 랭크 TR(K, w)는 키워드(w)와 질의어(K) 사이의 연관성을 기계적(공식의 첫 번째 부분)으로 계산할 경우에 발생할 수 있는 오류를 엔트로피 부분(공식의 두 번째 부분)으로 보완하여 정확도를 향상시킬 수 있다. α, β는 각각 주제 랭크 TR(K, w) 공식의 첫 번째 부분과 두 번째 부분 사이의 영향도를 위하여 부여되는 가중치로, 양의 실수를 사용할 수 있다. 예를 들면, α=3, β=2를 사용하여 주제 랭크 TR(K, w)를 계산할 수 있다. That is, subject rank TR (K, w) is an entropy part (second part of the formula) that can occur when the association between the keyword (w) and the query word (K) is calculated mechanically (the first part of the formula). This can be improved by improving the accuracy. α and β are weights given for the influence between the first part and the second part of the subject rank TR (K, w) formula, respectively, and may use positive real numbers. For example, the subject rank TR (K, w) can be calculated using α = 3 and β = 2.

또한 주제 랭크 TR(K, w)은 검색자에 의한 질의어(K)가 입력된 단계에서 계산이 된다. 따라서, 특정한 키워드(w)에 대한 DF(K, w), DF(w), p(w) 등은 검색 시점에 따라서 그 값이 달라질 수 있다. 이를 통하여 시간의 흐름을 반영한 검색 결과를 얻을 수 있다. In addition, the subject rank TR (K, w) is calculated at the stage where the query word K by the searcher is input. Therefore, the values of DF (K, w), DF (w), p (w), etc. for a specific keyword w may vary depending on a search point. Through this, search results reflecting the passage of time can be obtained.

주제 랭크 처리부(1400)에서 질의어(K)에 대한 각 키워드(w)의 주제 랭크를 계산한 후, 주제 랭크가 큰 순서로 N개(여기서 N은 1보다 큰 양의 정수)의 키워드(w)를 연관 키워드로 선정한다(S142). 상기 연관 키워드는 정보 검색 시스템(1)의 설정에 따라서 결정될 수 있으며, 예를 들면, N=100으로 설정하여 100개의 연관 키워드를 선정할 수 있다. After the subject rank processing unit 1400 calculates the subject rank of each keyword w for the query word K, the keywords w of N (where N is a positive integer greater than 1) in the order of the largest subject ranks. Is selected as a related keyword (S142). The related keywords may be determined according to the setting of the information retrieval system 1, and, for example, 100 related keywords may be selected by setting N = 100.

워드 클러스터링 처리부(1500)는 선정된 N개의 상기 연관 키워드들을 단어 벡터의 형태로 벡터화하여 워드 클러스터를 형성할 수 있다(S144). 워드 클러스터는 선정된 N개의 상기 연관 키워드를 유사한 것들로 분류하여 클러스터(cluster)로 형성한 것을 의미한다. 선정된 N개의 상기 연관 키워드들의 유사도를 비교하기 위하여, 하기 식에 의하여 선정된 N개의 상기 연관 키워드들을 단어 벡터의 형태로 벡터화할 수 있다. The word clustering processor 1500 may vector the selected N related keywords in the form of a word vector to form a word cluster (S144). The word cluster means that the selected N related keywords are classified into similar ones and formed into clusters. In order to compare the similarity of the selected N related keywords, the N related keywords selected by the following equation may be vectorized in the form of a word vector.

여기서,

는 벡터화한 i번째 연관 키워드, TR_ij는 i번째 연관 키워드(w_i)에 대한 j번째 연관 키워드(w_j)의 주제 랭크(여기서 i와 j는 1과 N 사이의 정수)를 의미한다. 즉, TR_ij는 하기 식으로 나타낼 수 있다. here,

Is the vectorized i th association keyword, TR _ij means the subject rank of the j th association keyword w _j with respect to the i th association keyword w _i (where i and j are integers between 1 and N). That is, TR _ij can be represented by the following formula.

여기서 w_i와 w_j는 각각 i번째와 j번째의 연관 키워드이다. Where w _i and w _j are the i-th and j-th related keywords, respectively.

이러한 벡터화를 통하여, N개의 벡터화된 연관 키워드인 단어 벡터가 형성될 수 있다. Through this vectorization, a word vector, which is N vectorized related keywords, may be formed.

N개의 상기 벡터화된 연관 키워드들은 하기 식에 의해 유사도가 측정될 수 있다. The N vectorized related keywords may be measured for similarity by the following equation.

여기서,

은

와

사이의 유사도를 의미하고,

은 벡터인

와

의 cosine값을 의미한다. here,

silver

Wow

The similarity between

Silver vector

Wow

Means the cosine of.

이와 같이, N개의 상기 벡터화된 연관 키워드들의 유사도를 측정하여, N개의 상기 연관 키워드들을 유사한 것들끼리 워드 클러스터로 형성할 수 있다. As such, by measuring similarity of the N vectorized related keywords, the N related keywords may be formed into word clusters.

워드 클러스터를 형성한 후, 클러스터 랭크 처리부(1600)는 각 워드 클러스터에 포함된 연관 키워드들의 상기 질의어에 대한 토픽 랭크 TR(K, w)의 평균을 계산하여, 클러스터 랭크를 계산한다(S146). 상기 질의어에 대한 상기 연관 키워드들의 토픽 랭크 TR(K, w)는 주제 랭크 계산(S130)과 연관 키워드 선정(S142) 시에 이 미 구해진 값을 랭크 저장부(2300)에 저장한 후, 클러스터 랭크 처리부(1600)에서 불러서 사용할 수 있다. After forming the word cluster, the cluster rank processing unit 1600 calculates an average of the topic rank TR (K, w) for the query words of the associated keywords included in each word cluster, and calculates a cluster rank (S146). The topic rank TR (K, w) of the related keywords for the query word is stored in the rank storage unit 2300 after storing a value already obtained at the time of subject rank calculation (S130) and selecting the related keyword (S142), and then cluster rank. It can be called and used by the processor 1600.

제공부(1800)에서는 앞서 구한 클러스터 랭크와 토픽 랭크를 감안하여 검색자에게 연관 키워드를 제공한다(S148). 즉, 클러스터 랭크 처리부(1600)에서 계산된 클러스터 랭크 순으로 각 워드 클러스터를 구성하는 연관 키워드들을 제공할 수 있다. 이때, 각 워드 클러스터 내의 연관 키워드들은 토픽 랭크 순으로 제공할 수 있다. 이 경우, 선정된 N개의 연관 키워드들을 단순히 각각의 토픽 랭크 순으로 제공하는 것에 비하여 검색자가 연관 키워드들을 쉽게 구분하도록 할 수 있다.The provider 1800 provides a keyword related to the searcher in consideration of the cluster rank and the topic rank obtained above (S148). That is, the associated keywords constituting each word cluster in the cluster rank order calculated by the cluster rank processor 1600 may be provided. In this case, related keywords in each word cluster may be provided in topic rank order. In this case, the searcher can easily distinguish the related keywords as compared to simply providing the selected N related keywords in the order of each topic rank.

이와 같이, 검색자의 질의어에 부합되는 연관 키워드를 제공하여, 검색자가 정확한 질의어를 입력하지 않아도 검색자가 원하는 정보를 정확하게 검색할 수 있도록 도와줄 수 있다.As such, by providing an associated keyword corresponding to the searcher's query, the searcher can help the searcher to search the desired information accurately without inputting the correct query.

이후, 검색자가 검색자 장치(200)를 통하여 제공된 연관 키워드 중에서 특정 연관 키워드를 선택하면, 제공부(1800)는 추가적으로 해당 연관 키워드가 추출된 인터넷 문서(10)인 연관 문서를 제공할 수 있다(S150). 이때 검색자에게 제공되는 연관 문서는 해당 연관 키워드가 추출된 연관 문서 중에서 키워드의 가중치가 높은 순으로 정렬되도록 제공될 수 있다. Subsequently, when the searcher selects a specific related keyword from related keywords provided through the searcher device 200, the provider 1800 may additionally provide a related document which is the Internet document 10 from which the related keyword is extracted ( S150). In this case, the related documents provided to the searcher may be provided so that the weights of the keywords are sorted in order of the related documents from which the related keywords are extracted.

검색자가 검색자 장치(200)를 통하여 제공된 연관 키워드 또는 인터넷 문서(10) 중 어떤 것을 선택하였는지의 여부는 사용자 피드백 처리부(1900)를 통하여 로그 저장부(2400)에 저장될 수 있다. 이러한 검색자에 의한 사용자 피드백은 연관 키워드 선정(S142) 시에 반영하여 검색자의 선택 여부가 반영되도록 할 수 있 다(S160). 구체적으로는 다른 연관 키워드들과 비교하여, 검색자의 선택 비율이 낮아서 소정의 임계값을 넘지 못하는 경우, 연관도가 떨어지는 것으로 판단하여, 연관 키워드 결과에서 제외시킬 수 있다. Whether the searcher selects the related keyword or the Internet document 10 provided through the searcher device 200 may be stored in the log storage 2400 through the user feedback processor 1900. The user feedback by the searcher may be reflected at the time of selecting the relevant keyword (S142) to reflect whether or not the searcher is selected (S160). Specifically, when compared with other related keywords, if the searcher's selection rate is low and does not exceed a predetermined threshold, it may be determined that the degree of association is inferior and may be excluded from the related keyword result.

도 3은 본 발명의 실시 예의 변형에 따른 정보 검색 방법을 나타내는 흐름도이다. 3 is a flowchart illustrating an information retrieval method according to a variation of an embodiment of the present invention.

도 1 및 3을 참조하면, 본 발명의 실시 예의 변형은 주제 랭크 계산 단계(S126)에서 도 2에 보인 본 발명의 실시 예와 차이가 있다. 수집부(1100)를 통하여 인터넷 문서(10)를 수집한 후(S212), 분석부(1200)에서 인터넷 문서(10)를 분석하여 각 인터넷 문서(10)를 나타내는 키워드를 추출하여 색인 저장부(2200)에 저장한다(S214). 그런 후, 각 키워드 사이의 주제 랭크를 미리 계산하여 랭크 저장부(2300)에 저장한다(S216). 이때 각 키워드 사이의 주제 랭크의 계산은 수집된 인터넷 문서(10)의 키워드가 추출될 때마다 진행되도록 설정할 수도 있고, 특정 시점마다 주제 랭크가 계산되도록 설정할 수도 있다. 1 and 3, the modification of the embodiment of the present invention is different from the embodiment of the present invention shown in FIG. 2 in the main rank calculation step S126. After collecting the Internet document 10 through the collection unit 1100 (S212), the analysis unit 1200 analyzes the Internet document 10 to extract a keyword representing each Internet document 10 index storage unit ( 2200) (S214). Thereafter, the subject rank between each keyword is calculated in advance and stored in the rank storage unit 2300 (S216). At this time, the calculation of the subject rank between each keyword may be set to proceed every time the keywords of the collected Internet document 10 is extracted, or may be set to calculate the subject rank at a specific time point.

이와 같이, 각 키워드 사이의 주제 랭크가 특정 시점마다 계산되도록 설정한 경우, 1일에 한번, 1주일에 한번, 1달에 한번 등 수집된 인터넷 문서(10)의 양을 고려하여 시점을 정할 수 있고, 상대적으로 정보 검색 시스템(1)의 부하가 적은 시점, 즉 심야와 같이 검색자의 이용이 적은 때에 진행되도록 설정할 수 있다. As such, when the subject rank between each keyword is set to be calculated at a specific time point, the time point may be determined in consideration of the amount of collected Internet documents 10 once a day, once a week, and once a month. It can be set to proceed at a time when the load of the information retrieval system 1 is relatively low, that is, when the use of the searcher is low, such as at night.

반면, 검색자 장치(200)를 통하여 검색자가 질의어를 입력하면(S222), 언어 분석부(1700)에서 질의어를 분석하고(S224), 상기 질의어에 해당하는 키워드에 대하여 계산되어 미리 저장된 주제 랭크를 랭크 저장부(2300)로부터 읽어올 수 있다. On the other hand, when the searcher inputs a query word through the searcher device 200 (S222), the language analyzer 1700 analyzes the query word (S224), and calculates a topic rank calculated in advance for the keyword corresponding to the query word. It may be read from the rank storage unit 2300.

이를 통하여, 질의어가 입력될 때마다, 주제 랭크를 계산할 경우에 생길 수 있는 시간 지연을 방지할 수 있다. Through this, whenever a query is input, it is possible to prevent a time delay that may occur when calculating the subject rank.

도 4 내지 도 5는 본 발명의 실시 예에 따른 수집부와 분석부의 구성을 나타내는 개략도이다. 도 4 내지 도 5에서는 블로그의 포스트를 예로 들어, 수집된 인터넷 문서가 검색자에게 제공될 수 있는 형태로 저장되는 과정을 살펴보도록 한다. 4 to 5 is a schematic diagram showing the configuration of the collecting unit and the analysis unit according to an embodiment of the present invention. 4 to 5, a process of storing collected Internet documents in a form that can be provided to a searcher will be described by taking blog posts as an example.

도 4는 본 발명의 실시 예에 따른 수집부의 구성을 나타내는 개략도이다. 4 is a schematic view showing a configuration of a collecting unit according to an embodiment of the present invention.

도 4를 참조하면, 인터넷 문서(10) 및 그 주변 정보는 다양한 언어로 기술될 수 있기 때문에 먼저 언어판단모듈(1110)에서 한국어, 일본어, 중국어, 영어 등의 작성 언어를 판단할 수 있다. 그리고 인터넷 문서 수집모듈(1122) 및 주변 정보 수집모듈(1124)로 구성된 수집 모듈에서 인터넷 문서(10) 및 그 주변 정보를 함께 수집할 수 있다. 인터넷 문서(10)가 예를 들면, 블로그의 하나의 포스트인 경우 인터넷 문서 수집 모듈(1122)은 블로그에서 제공하는 RSS/ATOM 등의 피딩(feeding)을 제공하는 주소를 판단하여 인터넷 문서(10)를 수집할 수 있다. 그러나 모든 인터넷 문서(10)가 피딩을 제공하는 주소가 제공되는 것이 아니다. 예를 들면, 대다수의 블로그는 최근의 포스트의 일부만을 제공하므로 이러한 경우에는 포스트 본문 추출을 통하여 인터넷 문서(10)를 수집할 수 있다. 마찬가지로 주변 정보 수집모듈(1124)에서는 댓글, 트랙백을 포함하는 그 주변 정보를 추출하여 수집할 수 있다. Referring to FIG. 4, since the Internet document 10 and its surrounding information may be described in various languages, the language judging module 1110 may first determine a writing language such as Korean, Japanese, Chinese, and English. In addition, the collection module including the Internet document collection module 1122 and the peripheral information collection module 1124 may collect the Internet document 10 and its surrounding information together. For example, when the Internet document 10 is one post of a blog, the Internet document collection module 1122 determines an address providing a feed such as RSS / ATOM provided by the blog, thereby determining the Internet document 10. Can be collected. However, not all Internet documents 10 are provided with addresses that provide feeding. For example, since most blogs provide only a part of recent posts, in this case, the Internet document 10 may be collected through post body extraction. Similarly, the peripheral information collection module 1124 may extract and collect the peripheral information including a comment and a trackback.

이와 같이 수집된 인터넷 문서(10) 및 그 주변 정보는 원래 형태와 달리 개별적으로 수집된 상태이므로 콘텐츠 복원모듈(230)을 통해 구조화 과정을 거칠 수 있다. 예를 들면 블로그의 포스트를 수집하는 경우, 포스트 전체 본문 추출, 댓글, 트랙백 정보 연결, 기존 HTML 포스트 내용 추출 및 RSS/ATOM 형식으로의 구조화 등의 과정을 통하여 분석되고 복원될 수 있다. 또한 언어판단모듈(1110)에서 판단된 언어에 따라 자동번역모듈(1140)은 제공하고자 하는 언어와 다른 언어로 작성된 인터넷 문서(10) 및 그 주변 정보를 제공하고자 하는 언어로 자동 번역을 통하여 번역할 수 있다. 콘텐츠 복원모듈(1130)에서 복원된 인터넷 문서(10) 및 그 주변 정보와 자동번역모듈(1140)의 번역 결과는 단위 구조체 생성모듈(1150)에서 하나의 인터넷 문서(10), 예를 들면 블로그의 경우 하나의 포스트별로 인터넷 문서 구조체로 생성할 수 있다. 인터넷 문서 구조체는 예를 들면, XML 형식 또는 RSS 형식 등 컴퓨터와 같은 기계가 처리할 수 있는 형식으로 생성할 수 있다. The Internet document 10 and the surrounding information collected as described above are collected separately from the original form, and thus may be structured through the content restoration module 230. For example, when collecting a post of a blog, it can be analyzed and restored through the process of extracting the entire body of the post, commenting, linking trackback information, extracting existing HTML post contents, and structuring to RSS / ATOM format. In addition, according to the language determined by the language judging module 1110, the automatic translation module 1140 may translate the Internet document 10 written in a language different from the language to be provided by the automatic translation into a language to provide information about the Internet 10 and its surrounding information. Can be. The translation result of the Internet document 10 and its surrounding information restored by the content restoration module 1130 and the automatic translation module 1140 may be generated by the unit structure generation module 1150 of one Internet document 10, for example, a blog. In this case, each post can be created as an Internet document structure. The Internet document structure may be generated in a format that can be processed by a machine such as a computer, for example, XML format or RSS format.

이와 같이 생성된 인터넷 문서 구조체는 주 저장부(2100)에 저장되며, 분석부(1200)는 직접 수집부(1100)로부터 인터넷 문서 구조체를 받거나 주 저장부(2100)에 저장된 인터넷 문서 구조체를 불러서 분석 작업을 할 수 있다. The Internet document structure generated as described above is stored in the main storage unit 2100, and the analysis unit 1200 receives the Internet document structure from the collection unit 1100 or calls the Internet document structure stored in the main storage unit 2100 for analysis. You can work.

도 5는 본 발명의 실시 예에 따른 분석부의 구성을 나타내는 개략도이다. 5 is a schematic diagram illustrating a configuration of an analysis unit according to an exemplary embodiment of the present invention.

도 5를 참조하면, 분석부(1200)는 수집부(1100)에 의하여 수집되고 생성된 인터넷 문서 구조체를 텍스트 마이닝 기법으로 분석하여 분석 정보를 생성할 수 있다. 분석부(1200)는 수집부(1100)에서 생성된 인터넷 문서 구조체 또는 주 저장부(2100)에 저장된 인터넷 문서 구조체를 받아서 개체명 분석모듈(1210)에서 개체명 분석을 하여 주요 개체명을 추출할 수 있다. 개체명 분석은 인터넷 문서 구조체가 가지고 있는 텍스트를 분석하여 사람이름, 기업명, 상품명, 서비스명, 날짜 등 의미를 가진 단어를 추출하는 것으로 개체명 사전과 추출규칙을 통해 추출할 수 있다. 그런 후 추출된 주요 개체명과 인터넷 문서 구조체에 포함된 정보를 특성추출모듈(1220)에서 통계적으로 분석하여 인터넷 문서 구조체를 대표하는 키워드를 추출할 수 있다.Referring to FIG. 5, the analyzer 1200 may generate analysis information by analyzing an Internet document structure collected and generated by the collector 1100 by using a text mining technique. The analysis unit 1200 receives the Internet document structure generated by the collection unit 1100 or the Internet document structure stored in the main storage unit 2100, and performs entity name analysis in the entity name analysis module 1210 to extract the main entity names. Can be. The entity name analysis extracts words having meanings such as person name, company name, product name, service name, date, etc. by analyzing the text of the Internet document structure and can be extracted through the entity name dictionary and extraction rules. Thereafter, the extracted main entity name and information included in the Internet document structure may be statistically analyzed by the feature extraction module 1220 to extract keywords representing the Internet document structure.

자동분류모듈(1230)은 인터넷 문서 구조체들을 자동으로 분류할 수 있다. 이러한 자동 분류는 미리 정의된 분류목록(1232)과 분류목록(1232)에 따른 기계학습데이터(1234)를 기반으로 분류될 수 있다. 자동 분류는 분류대상의 차이가 명확할 경우에 그 성능이 높으며, 비슷한 군에서의 분류는 성능이 낮아지는 경향이 나타낼 수 있다. 특히 다단계 분류의 경우는 분석률이 떨어질 수 있다. 예를 들어 스포츠, 사회, 경제 등의 큰 카테고리의 분류는 시스템으로는 어느 정도 가능하지만, 스포츠의 구기종목을 야구, 배구, 농구 등으로 분류하는 것은 전자에 비해 상대적으로 분석률이 떨어질 수 있다. 자동분류모듈(1230)은 예를 들면, 베이지언(Bayesian), SVM(Support Vector Machine)과 같은 알고리즘을 통해 구현될 수 있다. 이때 본 자동분류모듈(1230)은 최상위 카테고리만으로 자동분류할 수 있다. 다단계 카테고리로 자동분류를 하는 경우 정확도가 떨어지고 기계학습을 하기 위한 시스템의 부담이 커지기 때문이다. The automatic classification module 1230 may automatically classify Internet document structures. Such automatic classification may be classified based on the predefined classification list 1232 and the machine learning data 1234 according to the classification list 1232. Automatic classification has high performance when the difference of classification object is clear, and classification in similar group may tend to decrease performance. Especially in the case of multi-stage classification, the analysis rate may be reduced. For example, large categories such as sports, society, and economy can be classified to some extent by the system. However, categorizing sports balls into baseball, volleyball, basketball, etc. may be less analyzed than the former. The automatic classification module 1230 may be implemented by, for example, an algorithm such as Bayesian or SVM (Support Vector Machine). In this case, the automatic classification module 1230 may automatically classify only the highest category. This is because automatic classification into multi-level categories reduces accuracy and burdens the system for machine learning.

자동군집모듈(1240)은 자동분류된 인터넷 문서 구조체들을 각 분류 카테고리 별로 군집화 과정을 거친다. 자동군집은 인터넷 문서 구조체들을 시스템이 통계적으로 임의의 단위로 군집할 수 있다. 자동군집모듈(1240)은 예를 들면, K-means 알고리즘 등을 사용하여 구현될 수 있다. 이렇게 군집된 인터넷 문서 구조체들은 정 보량 측정모듈(1250)에서 정보량 지수가 측정될 수 있다. 키워드와 이러한 정보량 지수를 결합하여 생성되는 특성키워드벡터는 각 인터넷 문서 구조체를 대표하는 단어 벡터로 검색을 위하여 사용될 수 있다. 추출된 키워드와 생성된 특성키워드벡터를 포함하는 분석 정보는 다시 주 저장부(2100)에 저장될 수 있다. The automatic clustering module 1240 clusters the automatically classified Internet document structures for each classification category. Auto-clustering allows the system to statistically group Internet document structures in any unit. The automatic cluster module 1240 may be implemented using, for example, a K-means algorithm. The clustered Internet document structures may have an information quantity index measured by the information quantity measurement module 1250. The characteristic keyword vector generated by combining the keyword and the information quantity index can be used for searching as a word vector representing each Internet document structure. Analysis information including the extracted keyword and the generated characteristic keyword vector may be stored in the main storage unit 2100 again.

이와 같이 주 저장부(2100)에 저장된 인터넷 문서 구조체와 키워드 등은 전술한 정보 검색 방법에 의하여 검색자가 이용하기 편리한 형태로 제공될 수 있다. As such, the Internet document structure and keywords stored in the main storage unit 2100 may be provided in a form convenient for a searcher by the above-described information retrieval method.

도 6은 본 발명의 실시 예에 따른 정보 검색 방법에 의한 정보 검색 결과를 나타내는 화면이다. 6 is a screen illustrating an information search result by an information search method according to an exemplary embodiment of the present invention.

도 6을 참조하면, 질의어로 "삼성"이 입력된 경우에 연관 키워드 목록이 제공된다. 이때, 연관 키워드 목록 중 굵은 글자로 표시된 부분은 각 워드 클러스터의 첫 번째 연관 키워드, 즉 하나의 워드 클러스터 내에서 주제 랭크가 가장 큰 값을 가지는 연관 키워드로, 각 워드 클러스터를 구분해서 볼 수 있게 하고 있다. 이러한 표시 방법은 정보 검색 시스템의 설계에 따라서 달라질 수 있는 예시에 불과하다. 예를 들면, 각 워드 클러스터 별로 줄이 바뀌도록 하여 연관 키워드 목록을 제공할 수도 있다. Referring to FIG. 6, a list of related keywords is provided when "Samsung" is entered as a query. In this case, the bolded part of the related keyword list is the first related keyword of each word cluster, that is, the related keyword having the largest value of the subject rank in one word cluster. have. This display method is merely an example that may vary depending on the design of the information retrieval system. For example, a list of related keywords may be provided by allowing lines to be changed for each word cluster.

이와 같이, 포괄적인 의미를 가지는 질의어를 입력하여도 검색자가 원하는 정보를 찾을 수 있도록 관련되는 연관 키워드를 효율적으로 제공함을 확인할 수 있다. As such, even when a query word having a comprehensive meaning is input, it can be confirmed that a related keyword is efficiently provided so that a searcher can find desired information.

또한, 본 발명의 실시 예들은 컴퓨터 시스템에서 실행할 수 있는 프로그램으로 작성 가능하다. 또한, 상기 프로그램이 수록된 컴퓨터로 읽을 수 있는 기록 매 체로부터 읽혀진 해당 프로그램은 디지털 컴퓨터 시스템에서 실행될 수 있다.In addition, embodiments of the present invention can be written as a program that can be executed in a computer system. In addition, the program read from the computer-readable recording medium containing the program can be executed in the digital computer system.

컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, DVD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어, 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고 본 발명을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술 분야의 프로그래머들에 의해 용이하게 추론될 수 있다.Examples of computer-readable recording media include ROM, RAM, CD-ROM, DVD-ROM, magnetic tape, floppy disks, optical data storage devices, and the like, as well as carrier wave (eg, transmission over the Internet). It also includes implementations in form. The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion. And functional programs, codes and code segments for implementing the present invention can be easily inferred by programmers in the art to which the present invention belongs.

<도면의 주요부분에 대한 설명><Description of main parts of drawing>

1 : 정보 검색 시스템, 10 : 인터넷 문서, 200 : 검색자 장치, 1000 : 제어부, 1100 : 수집부, 1200 : 분석부, 1300 : 색인부, 1400 : 주제 랭크 처리부, 1500 : 워드 클러스터링 처리부, 1600 : 클러스터 랭크 처리부, 1700 : 언어 분석부, 1800 : 제공부, 1900 : 사용자 피드백 처리부, 2000 : 저장부, 2100 : 주 저장부, 2200 : 색인 저장부, 2300 : 랭크 저장부, 2400 : 로그 저장부DESCRIPTION OF REFERENCE NUMERALS 1: information retrieval system, 10: internet document, 200: searcher device, 1000: control unit, 1100: collection unit, 1200: analysis unit, 1300: index unit, 1400: subject rank processing unit, 1500: word clustering processing unit, 1600: Cluster rank processing unit, 1700: language analysis unit, 1800: providing unit, 1900: user feedback processing unit, 2000: storage unit, 2100: main storage unit, 2200: index storage unit, 2300: rank storage unit, 2400: log storage unit

Claims

A data collection step of collecting documents and extracting keywords of the collected documents;

A query input step of inputting a query word by a searcher;

A related keyword selection step of selecting related keywords associated with the query word among the keywords by calculating a topic rank of the extracted keyword for the input query word;

A word clustering step of vectorizing the selected related keywords into a word vector to form the related keywords into a word cluster;

A cluster rank calculation step of calculating a rank of the formed word cluster; And

A related keyword providing step of providing related keywords related to the query word input using the calculated rank of the word cluster and the subject rank of the keyword.

The method according to claim 1,

The subject rank TR (K, w) is calculated by the following formula.

here,

K is the query word, w is the keyword, DF (K, w) is the frequency of the document containing K and w, DF (w) is the frequency of the document containing w, p (w) is the probability that w is in the document, α, β are positive real numbers by weight.

The method of claim 2,

The selecting of the related keyword may include selecting N keywords as the related keywords in order of increasing the subject rank, wherein N is a positive integer greater than one.

The method of claim 2,

The word clustering step,

And the related keywords are vectorized into a word vector according to the following equation.

here,

Is the vectorized i-th association keyword, TRij is the subject rank of the j-th association keyword for the i-th association keyword, where i and j are integers between 1 and N.

5. The method of claim 4,

The word clustering step,

And measuring similarity of the related keywords by the following equation to form similar related keywords into word clusters.

here,

silver

Wow

The similarity between

Silver vector

Wow

Cosine value between.

The method according to claim 1,

The rank of the word cluster,

And an average value of a subject rank of each of the related keywords constituting the word cluster.

The method according to claim 6,

The related keyword providing step,

And providing the word clusters in ascending order of the word clusters, and providing the related keywords in order of subject rank of the related keywords within the provided word clusters.

The method according to claim 1,

And providing a related document among the provided related keywords, wherein the related keyword selected by the searcher is an extracted document.

A computer-readable recording medium containing a program capable of performing the navigation according to any one of claims 1 to 8.