KR20100080100A

KR20100080100A - Method for searching information and computer readable medium storing thereof

Info

Publication number: KR20100080100A
Application number: KR1020080138728A
Authority: KR
Inventors: 안태성; 이반 베를로셰; 이경일
Original assignee: 주식회사 솔트룩스
Priority date: 2008-12-31
Filing date: 2008-12-31
Publication date: 2010-07-08
Also published as: KR101057075B1

Abstract

PURPOSE: An information searching method and a computer readable medium including a program capable of performing the same are provided to exactly select desired information by a user without exact query language input by a user. CONSTITUTION: A query is input by a searcher(S122). A topic rank between the query and a keyword or between keywords is calculated(S130). An association rank related to the query among the keywords is selected(S132). An association drawing from the query of the selected association theme is added as a weighted value so that a weighted value vector is formed. Documents related to the query are provided using the weighted value vector(S150).

Description

Method for searching information and computer readable medium storing information that includes a method for searching for information and a program capable of performing the same

본 발명은 정보를 검색하는 방법과 이를 수행할 수 있는 프로그램이 수록된 컴퓨터로 읽을 수 있는 기록 매체에 관한 것으로, 더욱 구체적으로는 주제 랭크를 통하여 연관 주제와 연관 질의어를 형성하는 방법과 이를 수행할 수 있는 프로그램이 수록된 컴퓨터로 읽을 수 있는 기록 매체에 관한 것이다.The present invention relates to a computer-readable recording medium containing a method for retrieving information and a program capable of performing the same, and more particularly, to a method for forming a related subject and a related query through a subject rank, and to performing the same. The present invention relates to a computer-readable recording medium containing a program.

인터넷의 사용이 더욱 증가하고, 그에 따라서 인터넷을 통하여 접할 수 있는 정보의 양도 증가하고 있다. 이에 따라서 정보 검색의 필요성은 더욱 높아가고, 중요성도 커지고 있다. 그러나 정보의 양이 증가함에 따라서 정보를 검색하는 사용자인 검색자가 구하고자하는 정보를 정확히 찾아내기는 더욱 어려워지고 있다. The use of the Internet is increasing, and accordingly, the amount of information available through the Internet is also increasing. As a result, the necessity of information retrieval becomes more and more important. However, as the amount of information increases, it becomes more difficult to find exactly the information that a searcher, a user searching for information, wants to obtain.

초기에 인터넷에서 제공되는 정보 검색의 결과는 일일이 사람이 정보를 모으고, 우선 순위를 매기는 방식으로 얻어질 수 있었고, 그러한 결과를 검색자의 요구에 따라서 제공하는 것이었다. Initially, the results of information retrieval provided on the Internet could be obtained by human gathering and prioritizing, and providing such results according to the searcher's needs.

인터넷 상의 정보의 양이 방대해지면서 종전의 방법은 한계에 도달하였고, 그에 따라서 검색 로봇에 의하여 정보를 수집하고, 기계적인 처리를 통하여 분류하여 제공하는 방법이 보편화되어왔다. 그러나 이러한 기계적인 처리에 의한 정보 검색의 결과는 검색자가 원하는 결과를 정확히 제공하지 못하여, 정보 검색 결과에서 다시 검색자가 원하는 정보를 찾아야 하는 불편함이 야기되어 왔다. As the amount of information on the Internet has increased, the conventional method has reached its limit, and accordingly, a method of collecting information by a search robot and classifying and providing the information through mechanical processing has become popular. However, the result of the information retrieval by the mechanical processing does not provide exactly the results desired by the searcher, causing inconvenience that the searcher needs to find the desired information again.

본 발명이 해결하고자 하는 기술적 과제는 상기 문제점을 해결하기 위하여, 검색자의 요구를 만족시킬 수 있는 정보 검색 방법 및 이를 수행할 수 있는 프로그램이 수록된 컴퓨터로 읽을 수 있는 기록 매체를 제공하는 데에 있다. SUMMARY OF THE INVENTION The present invention has been made in an effort to provide a computer-readable recording medium containing an information retrieval method capable of satisfying a searcher's needs and a program capable of performing the same.

상기 기술적 과제를 해결하기 위하여 본 발명은 다음과 같은 정보 검색 방법 및 이를 수행할 수 있는 프로그램이 수록된 컴퓨터로 읽을 수 있는 기록 매체를 제공한다. In order to solve the above technical problem, the present invention provides a computer-readable recording medium containing the following information retrieval method and a program capable of performing the same.

본 발명에 의한 정보 검색 방법은 문서를 수집하고 수집된 상기 문서의 키워드를 추출하는 데이터 수집 단계, 검색자에 의해 질의어가 입력되는 질의입력단계, 상기 질의어와 상기 키워드 또는 상기 키워드 사이의 주제 랭크(topic rank)를 계산하여 상기 키워드 중 상기 질의어와 연관되는 연관 주제를 선정하는 연관 주제 생성 단계, 선정된 상기 연관 주제의 상기 질의어로부터의 연관도를 가중치로 부여하여 가중치 벡터를 형성하는 연관 질의어 생성 단계 및 상기 가중치 벡터를 이용하여 상기 질의어와 연관된 문서들을 제공하는 문서 검색 단계를 포함한다. The information retrieval method according to the present invention includes a data collection step of collecting a document and extracting a keyword of the collected document, a query input step of inputting a query word by a searcher, and a topic rank between the query word and the keyword or the keyword ( a related topic generating step of selecting a related topic associated with the query among the keywords by calculating a topic rank), and a related query generating step of forming a weight vector by assigning a weighted degree of association from the query word of the selected related topic. And a document retrieval step of providing documents associated with the query using the weight vector.

상기 질의어와 상기 키워드 사이의 주제 랭크 TR(K,w)와 상기 키워드 사이의 주제 랭크 TR(w_i, w_j)는 하기 식에 의해 계산될 수 있다. A subject rank TR (K, w) between the query word and the keyword and a subject rank TR (w _i , w _j ) between the keyword may be calculated by the following equation.

여기서, K는 질의어, w, w_i, w_j는 키워드, DF(K,w)는 K와 w가 함께 들어있는 문서 빈도, DF(w_i, w_j)는 w_i와 w_j가 함께 들어있는 문서 빈도, DF(w) 또는 DF(w_j)는 w 또는 w_j가 들어있는 문서 빈도, p(w) 또는 p(w_j)는 w 또는 w_j가 문서에 들어 있는 확률, α, β는 가중치로 양의 실수, i, j는 추출된 상기 키워드의 수 이하의 서로 다른 값을 가지는 양의 정수이다.Where K is the query word, w, w _i , w _j is the keyword, DF (K, w) is the document frequency containing K and w together, and DF (w _i , w _j ) contains w _i and w _j together Frequency of the document, DF (w) or DF (w _j ) is the document frequency containing w or w _j , p (w) or p (w _j ) is the probability that w or w _j is in the document, α, β Is a positive real number by weight, and i and j are positive integers having different values less than or equal to the number of extracted keywords.

상기 연관 주제 선정 단계는, 상기 질의어와 상기 키워드 사이의 주제 랭크 TR(K, w) 또는 상기 키워드 사이의 주제 랭크 TR(w_i, w_j)가 소정의 값 이상인 경우에 연결 관계를 형성하고, 상기 연결 관계가 상기 질의어로부터 M개 이하인 키워드를 상기 연관 주제로 선정할 수 있다(M은 1보다 큰 양의 정수).The selecting a related topic may include forming a connection relationship when a subject rank TR (K, w) between the query word and the keyword or a subject rank TR (w _i , w _j ) between the keyword is equal to or greater than a predetermined value. A keyword whose M or less is less than or equal to M from the query word may be selected as the related topic (M is a positive integer greater than 1).

상기 가중치 벡터 WM는, 상기 연관 주제와 상기 연관 주제의 상기 질의어 방향으로 형성된 연결 관계의 주제 랭크를 결합하도록 하기의 식에 의해 형성될 수 있다. The weight vector WM may be formed by the following equation to combine the subject rank of the association relationship formed in the query term direction of the association subject with the association subject.

여기서, w_k는 k번째 연관 주제, TR_k는 k번째 연관 주제의 상기 질의어 방향으로 형성된 연결 관계의 주제 랭크, N은 선정된 연관 주제의 수, k는 1과 N 사이의 정수이다.Here, w _k is the k-th association subject, TR _k is the subject rank of the connection relationship formed in the query direction of the k-th association subject, N is the number of the selected association subject, k is an integer between 1 and N.

상기 가중치 벡터 WM은, 상기 연관 주제와 상기 연관 주제의 상기 질의어 방 향으로 형성된 연결 관계의 주제 랭크를 상기 질의어와 상기 연관 주제 사이의 연결 관계의 개수인 깊이를 반영하여 결합하도록, 하기의 식에 의해 형성될 수 있다. The weight vector WM combines a topic rank of a connection relationship formed in the query direction of the related subject and the related subject to reflect a depth which is the number of the connected relations between the query and the related subject. It can be formed by.

여기서, w_k는 k번째 연관 주제, TR_k는 k번째 연관 주제의 상기 질의어 방향으로 형성된 연결 관계의 주제 랭크, depth_k는 상기 질의어와 k번째 연관 주제 사이의 연결 관계의 개수인 깊이, dTR_k는 k번째 연관 주제의 깊이 가중 주제 랭크, N는 선정된 연관 주제의 수, k는 1과 N 사이의 정수이다.Here, w _k is the k-th association topic, TR _k is the subject rank of the connection relationship formed in the direction of the query language of the k-th association topic, depth _k is the depth that is the number of connection relationships between the query and the k-th associated topic, dTR _k Is the depth weighted topic rank of the k-th association subject, N is the number of selected association subjects, and k is an integer between 1 and N.

상기 데이터 수집 단계는, 수집된 상기 문서의 키워드를 이용하여 각각 해당 문서를 나타내는 단어 벡터인 특성키워드벡터를 형성하고, 상기 문서 검색 단계는, 상기 특성키워드벡터와 상기 가중치 벡터를 비교하여 상기 질의어와 관련된 문서들을 선정하여 제공할 수 있다. The data collecting step may include forming a feature keyword vector, each of which is a word vector representing a corresponding document by using the collected keywords of the document, and the document retrieving step may be performed by comparing the feature keyword vector with the weight vector. Relevant documents can be selected and provided.

상기 문서 검색 단계에서 제공된 문서 중 상기 검색자에 의하여 선택되는 문서의 키워드를 분석하여, 선택율이 소정의 임계값 이하인 문서에 포함된 키워드를 상기 연관 주제에서 제거하는 사용자 피드백 단계를 더 포함할 수 있다. The method may further include analyzing a keyword of a document selected by the searcher among the documents provided in the document searching step, and removing the keyword included in the document having a selectivity less than or equal to a predetermined threshold value from the related subject. .

본 발명에 의하면, 특정 질의어에 부합하는 연관 질의어를 형성하여 연관 주제 및 연관 문서를 검색자에게 제공하고, 검색자의 피드백 정보를 이용하여, 검색 성능을 향상시킬 수 있다. 특히, 검색자가 입력한 질의어와 연관되는 연관 주제 및 연관 질의어를 형성하여, 검색자가 정확한 질의어를 입력하지 않아도 검색자가 원하는 정보를 정확하게 선택할 수 있도록 한다. According to the present invention, a related query corresponding to a specific query can be formed to provide a related subject and related document to a searcher, and the search performance can be improved by using the searcher's feedback information. In particular, by forming a related subject and an associated query associated with the query input by the searcher, the searcher can select exactly the desired information even if the searcher does not enter the correct query.

이하, 본 발명의 실시 예들에 따른 정보 검색 방법 및 이를 수행할 수 있는 프로그램이 수록된 컴퓨터로 읽을 수 있는 기록 매체를 첨부된 도면을 참조하여 상세하게 설명하지만, 본 발명이 하기의 실시 예들에 한정되는 것은 아니며, 해당 분야에서 통상의 지식을 가진 자라면 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 본 발명을 다양한 다른 형태로 구현할 수 있을 것이다. 즉, 특정한 구조적 내지 기능적 설명들은 단지 본 발명의 실시 예들을 설명하기 위한 목적으로 예시된 것으로, 본 발명의 실시 예들은 다양한 형태로 실시될 수 있으며 본문에 설명된 실시 예들에 한정되는 것으로 해석되어서는 아니된다. 본문에 설명된 실시 예들에 의해 한정되는 것이 아니므로 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. Hereinafter, an information retrieval method according to embodiments of the present invention and a computer-readable recording medium containing a program capable of performing the same will be described in detail with reference to the accompanying drawings, but the present invention is limited to the following embodiments. The present invention may be embodied in various other forms without departing from the technical spirit of the present invention. That is, specific structural to functional descriptions are merely illustrated for the purpose of describing embodiments of the present invention, and embodiments of the present invention may be embodied in various forms and should be construed as being limited to the embodiments described herein. No. It is not to be limited by the embodiments described in the text, it should be understood to include all changes, equivalents, and substitutes included in the spirit and scope of the present invention.

제1, 제2 등의 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 이러한 구성 요소들은 상기 용어들에 의해 한정되는 것은 아니다. 상기 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위로부터 벗어나지 않고, 제1 구성 요소는 제2 구성 요소로 명명될 수 있고, 유사하게 제2 구성 요소도 제1 구성 요소로 명명될 수 있다.Terms such as first and second may be used to describe various components, but such components are not limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component.

어떤 구성 요소가 다른 구성 요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성 요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성 요소가 존재할 수도 있다고 이해될 것이다. 반면에, 어떤 구성 요소가 다른 구성 요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성 요소가 존재하지 않는 것으로 이해될 것이다. 구성 요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석될 것이다.When a component is said to be "connected" or "connected" to another component, it may be directly connected to or connected to that other component, but it may be understood that other components may exist in the middle. Will be. On the other hand, when a component is said to be "directly connected" or "directly connected" to another component, it will be understood that there is no other component in between. Other expressions describing the relationship between components, such as "between" and "immediately between" or "neighboring to" and "directly neighboring", will likewise be interpreted.

본 출원에서 사용한 용어는 단지 특정한 실시 예들을 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "구비하다" 등의 용어는 설시된 특징, 숫자, 단계, 동작, 구성 요소 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해될 것이다.The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "comprise" or "include" are intended to indicate that there is a feature, number, step, action, component, or combination thereof described, and one or more other features or numbers, It will be understood that it does not exclude in advance the possibility of the presence or addition of steps, actions, components, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다. Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in the commonly used dictionaries are to be interpreted as having meanings consistent with the meanings in the context of the related art, and are not construed in ideal or excessively formal meanings unless expressly defined in this application. .

도 1은 본 발명의 실시 예에 따른 정보 검색 방법을 구현하기 위한 정보 검색 시스템의 구성을 나타내는 개략도이다. 1 is a schematic diagram showing the configuration of an information retrieval system for implementing an information retrieval method according to an embodiment of the present invention.

도 1을 참조하면, 정보 검색 시스템(1)은 네트워크(100)를 통하여 연결되며 크게 제어부(1000)와 저장부(2000)로 이루어진다. 제어부(1000)는 수집부(1100), 분석부(1200), 색인부(1300), 주제 랭크 처리부(1400), 연관 질의어 생성부(1500), 검색부(1600), 언어 분석부(1700), 제공부(1800), 사용자 피드백 처리부(1900) 등을 포함할 수 있다. 저장부(2000)는 주 저장부(2100), 색인 저장부(2200), 로그 저장부(2300) 등을 포함할 수 있다. 제어부(1000)는 네트워크(100)를 통하여 인터넷 문서(10) 또는 검색자 장치(200)와 연결되어 정보를 수집하고 제공하도록 구성되어 있다.Referring to FIG. 1, the information retrieval system 1 is connected through a network 100 and includes a controller 1000 and a storage 2000. The controller 1000 may include a collection unit 1100, an analysis unit 1200, an index unit 1300, a subject rank processing unit 1400, an association query generation unit 1500, a search unit 1600, and a language analysis unit 1700. , A provider 1800, a user feedback processor 1900, and the like. The storage unit 2000 may include a main storage unit 2100, an index storage unit 2200, a log storage unit 2300, and the like. The controller 1000 is connected to the Internet document 10 or the searcher device 200 through the network 100 to collect and provide information.

수집부(1100)는 네트워크(100)를 통하여 인터넷 문서(10)들을 수집하여 번역하여 각각의 인터넷 문서(10)별로 인터넷 문서 구조체를 생성할 수 있다. 수집부(1100)의 자세한 기능과 구성은 후술하도록 한다. 인터넷 문서(10)는 인터넷 상에 텍스트를 포함하는 정보를 담고 있는 각종 웹페이지(web page) 등의 문서를 포괄하여 통칭하는 의미이다. 구체적으로 살펴보면 인터넷 문서(10)에는 예를 들면, 통상적인 웹페이지, 블로그, 뉴스 기사 등이 포함될 수 있다. 이 외에 텍스트(text)를 포함하거나 텍스트로 나타낼 수 있는 정보를 담고 있는 것은 모두 해당될 수 있다. 예를 들면 특정 커뮤니티(community, 예를 들면 카페, 클럽, 동호회 등의 이름으로 불리운다)의 각종 게시물, 각종 회사 또는 개인의 웹 사이트(web site)에 포함된 웹 페이지, 언론사 또는 포털 사이트(portal site) 등에 기재된 뉴 스 기사, 각종 블로그(blog)에 게시된 포스트(post) 등이 인터넷 문서(10)에 포함될 수 있다. 또한 인터넷 문서(10)는 텍스트 정보 뿐만 아니라, 그림, 동영상, 음악 등 각종 멀티미디어 데이터(multimedia data)를 포함할 수 있음은 물론이다. 특히, 주로 멀티미디어 데이터로 구성되는 인터넷 문서(10)의 경우에도 제목 등 텍스트로 이루어진 정보를 포함할 수 있다. The collector 1100 may collect and translate the Internet documents 10 through the network 100 to generate an Internet document structure for each Internet document 10. Detailed functions and configurations of the collection unit 1100 will be described later. The Internet document 10 is a collective term for a document such as various web pages containing information including text on the Internet. Specifically, the Internet document 10 may include, for example, a typical web page, blog, news article, and the like. In addition to this, any information containing text or information that can be represented by text may be applicable. For example, various posts of a specific community (named cafes, clubs, clubs, etc.), web pages contained in various company or individual web sites, press or portal sites. News articles, etc., posts posted on various blogs, etc. may be included in the Internet document 10. In addition, the Internet document 10 may include not only text information but also various multimedia data such as pictures, videos, and music. In particular, the Internet document 10 mainly composed of multimedia data may include information consisting of text such as a title.

분석부(1200)는 수집된 인터넷 문서(10), 구체적으로는 수집부(1100)에서 생성한 인터넷 문서 구조체들을 분석하여 키워드, 특성키워드벡터 등을 포함하는 분석정보들을 각각 생성할 수 있다. 색인부(1300)는 수집된 인터넷 문서(10)와 분석정보들을 색인(index)하여 키워드와 특성키워드벡터를 포함하는 색인 정보를 생성할 수 있다. 수집된 인터넷 문서(10), 인터넷 문서 구조체, 분석 정보 및 색인 정보는 주 저장부(2100)에 저장될 수 있으며, 분석부(1200) 또는 색인부(1300)는 각각 수집부(1100) 또는 분석부(1200)로부터 정보를 받거나, 주 저장부(2100)에 저장된 정보를 사용할 수 있다. 분석부(1200)의 자세한 기능과 구성은 후술하도록 한다. The analyzer 1200 may analyze the collected Internet document 10, specifically, the Internet document structures generated by the collector 1100, and generate analysis information including keywords, characteristic keyword vectors, and the like. The index unit 1300 may index the collected Internet document 10 and the analysis information to generate index information including a keyword and a characteristic keyword vector. The collected Internet document 10, the Internet document structure, analysis information, and index information may be stored in the main storage unit 2100, and the analysis unit 1200 or the index unit 1300 may be the collection unit 1100 or the analysis, respectively. The information may be received from the unit 1200 or the information stored in the main storage unit 2100 may be used. Detailed functions and configurations of the analysis unit 1200 will be described later.

특성키워드벡터는 인터넷 문서(100), 구체적으로는 개별 인터넷 문서 구조체별로 생성되어, 개별 인터넷 문서(100)가 담고 있는 정보의 특성을 단어 벡터(vector)의 형식으로 나타낸다. 단어 벡터는 개별 인터넷 문서(100)의 특성을 나타내는 키워드 및 각 키워드의 가중치를 포함하고 있다. 가중치는 각 키워드의 단어빈도(TF, Term Frequency) 및 각 키워드가 인터넷 문서 집합에서 나타나는 빈도의 역인 역문서빈도(IDF, Inverse Document Frequency) 등을 이용하여 구한다. 단 어빈도는 개별 인터넷 문서에 특정 키워드의 출현횟수로써 특정 키워드가 인터넷 문서의 내용을 얼마나 대표하는가에 대한 척도이다. 또한 역문서빈도는 인터넷 문서 집합에서 특정 키워드가 출연하는 인터넷 문서 수의 비율의 역으로, 적은 인터넷 문서에서 나타나는 키워드는 그 키워드가 나타나는 인터넷 문서를 다른 인터넷 문서들과 구별할 수 있는 능력이 크게 된다. The characteristic keyword vector is generated for each Internet document 100, specifically, for each Internet document structure, and represents the characteristic of the information contained in the individual Internet document 100 in the form of a word vector. The word vector contains keywords representing the characteristics of the individual Internet document 100 and the weights of the keywords. The weight is calculated using the term frequency (TF) of each keyword and the inverse document frequency (IDF), which is the inverse of the frequency of each keyword in the Internet document set. However, the frequency is the number of occurrences of a specific keyword in an individual Internet document and is a measure of how much the specific keyword represents the content of the Internet document. Reverse document frequency is also the inverse of the number of Internet documents that a particular keyword appears in an Internet document set. Keywords that appear in a few Internet documents have a greater ability to distinguish the Internet document that the keyword appears from other Internet documents. .

주제 랭크 처리부(1400)는 인터넷 문서(10)로부터 추출된 키워드들 사이의 주제 랭크(topic rank) 또는 검색자 장치(200)를 통하여 검색자가 입력한 질의어와 인터넷 문서(10)들로부터 추출된 키워드 사이의 주제 랭크를 계산하고, 계산된 주제 랭크로부터 연관 주제를 선정할 수 있다. 또한 계산된 주제 랭크를 색인 저장부(2200)에 저장할 수 있다. 주제 랭크를 계산하는 방법에 대해서는 후술하도록 한다.The topic rank processing unit 1400 may include a topic rank between keywords extracted from the Internet document 10 or a query input by the searcher through the searcher device 200 and keywords extracted from the Internet documents 10. It is possible to calculate the subject rank between and select the relevant subject from the calculated subject rank. In addition, the calculated subject rank may be stored in the index storage unit 2200. The method of calculating the subject rank will be described later.

연관 질의어 생성부(1500)는 상기 연관 주제를 이용하여, 가중치 벡터인 연관 질의어를 생성할 수 있다. 연관 질의어를 생성하는 방법에 대해서는 후술하도록 한다. The association query generator 1500 may generate an association query that is a weight vector using the association subject. A method of generating an association query will be described later.

검색부(1600)는 상기 가중치 벡터인 연관 질의어와 상기 특성키워드벡터를 비교하여 검색자가 입력한 상기 질의어와 연관된 문서들을 선정할 수 있다. The search unit 1600 may select documents associated with the query input by the searcher by comparing the related query word which is the weight vector and the feature keyword vector.

언어 분석부(1700)는 검색자가 네트워크(100)를 통하여 연결된 검색자 장치(200)를 통하여 입력한 질의어에 대한 언어 분석을 할 수 있다. 언어 분석부(1700)에서는 입력된 상기 질의어의 언어를 판단하고, 상기 질의어가 문장이거나 복수의 단어로 구성된 경우에 핵심 질의어를 분석하여 선정할 수 있다. The language analyzer 1700 may perform language analysis on the query word input by the searcher through the searcher device 200 connected through the network 100. The language analyzer 1700 may determine a language of the input query word, and analyze and select a core query word when the query word is a sentence or a plurality of words.

제공부(1800)는 검색부(1600)에서 선정된 입력된 상기 질의어와 연관된 문서들을 검색자 장치(200)에 제공한다. The provider 1800 provides the searcher apparatus 200 with documents associated with the input query word selected by the searcher 1600.

사용자 피드백 처리부(1900)는 제공부(1800)에서 제공된 결과에 대한 검색자의 선택 결과를 로그 저장부(2400)에 저장하고, 그 결과를 연관 질의어 선정에 반영할 수 있다. The user feedback processor 1900 may store a searcher's selection result for the result provided by the provider 1800 in the log storage 2400 and reflect the result in selecting the related query.

저장부(2000)는 주 저장부(2100) 외에도 색인정보와 주제 랭크를 저장하는 색인 저장부(2200), 검색자들의 로그정보를 포함하는 검색자 정보를 저장하는 로그 저장부(2300)를 포함한다. 이들 주 저장부(2100), 색인 저장부(2200) 및 로그 저장부(2300)는 각각 물리적으로 구분되는 저장 장치일 수도 있으나, 하나 또는 복수의 저장 장치를 논리적으로 구분하는 구분 단위일 수도 있다. The storage unit 2000 includes, in addition to the main storage unit 2100, an index storage unit 2200 for storing index information and a subject rank, and a log storage unit 2300 for storing searcher information including log information of searchers. do. The main storage unit 2100, the index storage unit 2200, and the log storage unit 2300 may be physically divided storage devices, or may be division units that logically divide one or a plurality of storage devices.

도 2는 본 발명의 실시 예에 따른 정보 검색 방법을 나타내는 흐름도이다. 2 is a flowchart illustrating an information retrieval method according to an embodiment of the present invention.

도 1 및 2를 참조하면, 수집부(1100)를 통하여 인터넷 문서(10)를 수집한다(S112). 수집부(1100)는 특정 종류의 인터넷 문서(10)를 수집할 수도 있고, 광범위한 종류의 인터넷 문서(10)를 수집할 수도 있다. 예를 들면, 뉴스 기사 또는 블로그에 게시된 포스트 등 특정 종류의 인터넷 문서(10)를 수집할 수도 있고, 그 외에 회사 또는 개인의 통상적인 웹페이지, 특정 커뮤니티의 게시물, 멀티미디어 데이터 등의 광범위한 종류의 인터넷 문서(10)를 수집할 수도 있다. 이는 검색자에게 제공하고자 하는 정보 검색 시스템(1)의 서비스 종류에 따라서 결정될 수 있다. 1 and 2, the Internet document 10 is collected through the collection unit 1100 (S112). The collection unit 1100 may collect a specific kind of Internet document 10, or may collect a wide variety of Internet documents 10. For example, it may collect certain kinds of Internet documents 10, such as news articles or posts posted on blogs, and other types of web pages of companies or individuals, posts of specific communities, multimedia data, etc. Internet documents 10 may be collected. This may be determined according to the type of service of the information retrieval system 1 to be provided to the searcher.

분석부(1200)에서 인터넷 문서(10)를 분석하여 각 인터넷 문서(10)를 나타내는 키워드를 추출한다(S114). 상기 키워드와 함께, 인터넷 문서(10)가 담고 있는 정보의 특성을 단어 벡터의 형태로 나타낸 특성키워드벡터를 형성한다(S116). 상기 키워드는 해당 인터넷 문서(10)에서 의미를 가지는 적어도 하나의 단어이며, 상기 특성키워드벡터는 상기 키워드 및 키워드의 가중치를 포함하는 단어 벡터이다. The analysis unit 1200 analyzes the Internet document 10 and extracts a keyword representing each Internet document 10 (S114). Together with the keyword, a characteristic keyword vector representing the characteristic of the information contained in the Internet document 10 in the form of a word vector is formed (S116). The keyword is at least one word having meaning in the corresponding Internet document 10, and the characteristic keyword vector is a word vector including the keyword and the weight of the keyword.

반면, 검색자 장치(200)를 통하여 검색자가 질의어를 입력하면(S122), 언어 분석부(1700)에서 질의어를 분석한다(S124) 검색자 장치(200)는 네트워크(100)를 통하여 정보 검색 시스템(1)과 연결될 수 있다. 네트워크(100)는 유선, 무선의 인터넷, 로컬 랜, 인트라넷 등을 포함할 수 있다. 언어 분석부(1700)에서는 검색자가 입력한 상기 질의어의 입력 언어, 형식 등을 분석하여 적어도 하나의 핵심 질의어를 분류하여 이후 과정에서 질의어로 대체하여 사용할 수 있다. 이후, 질의어라고 기재하는 것은 입력된 하나의 질의어 또는 분류된 하나의 핵심 질의어일 수도 있으나, 분류된 적어도 2개의 핵심 질의어의 조합일 수도 있다. 또한 이후, 질의어라고 기재하는 것은 입력된 언어로 된 질의어 또는 핵심 질의어일 수도 있으나, 정보 검색 시스템(1)에서 처리되는 주된 언어로 번역된 것일 수도 있다. On the other hand, when a searcher inputs a query word through the searcher device 200 (S122), the language analyzer 1700 analyzes the query word (S124). The searcher device 200 may search for an information retrieval system through the network 100. It can be connected with (1). The network 100 may include a wired or wireless Internet, a local LAN, an intranet, or the like. The language analyzer 1700 may classify at least one core query word by analyzing the input language, format, etc. of the query word input by the searcher, and replace the at least one core query word with a query word in a later process. Subsequently, the term "query" may be one input query word or one classified key query word, or may be a combination of at least two classified key query words. Also, the term "query" may be a query or core query in the input language, or may be translated into a main language processed by the information retrieval system 1.

이후, 분석된 상기 질의어(K)에 대한 추출된 상기 키워드(w)의 주제 랭크 TR(K, w)와 상기 키워드 사이의 주제 랭크 TR(w_i, w_j)를 계산한다(S130). 상기 주제 랭크 TR(K,w)와 TR(w_i, w_j)는 하기 식에 의해 계산될 수 있다. Subsequently, a subject rank TR (K, w) of the extracted keyword w for the analyzed query K is calculated and a subject rank TR (w _i , w _j ) between the keywords (S130). The subject ranks TR (K, w) and TR (w _i , w _j ) can be calculated by the following equation.

K는 질의어, w, w_i, w_j는 키워드, DF(K,w)는 K와 w가 함께 들어있는 문서 빈 도, DF(w_i, w_j)는 w_i와 w_j가 함께 들어있는 문서 빈도, DF(w) 또는 DF(w_j)는 w 또는 w_j가 들어있는 문서 빈도, p(w) 또는 p(w_j)는 w 또는 w_j가 문서에 들어 있는 확률, α, β는 가중치로 양의 실수, i, j는 추출된 상기 키워드의 수 이하의 서로 다른 값을 가지는 양의 정수이다. 즉, w는 추출된 상기 키워드 중 임의의 키워드를 의미하며, w_i와 w_j는 추출된 상기 키워드 중 서로 다른 키워드를 의미한다. K is the query word, w, w _i , w _j is the keyword, DF (K, w) is the document frequency containing K and w, and DF (w _i , w _j ) contains w _i and w _j Document frequency, DF (w) or DF (w _j ) is the document frequency containing w or w _j , p (w) or p (w _j ) is the probability that w or w _j is in the document, α, β is A positive real number by weight, i, j are positive integers having different values less than or equal to the number of extracted keywords. That is, w means any keyword among the extracted keywords, and w _i and w _j mean different keywords among the extracted keywords.

구체적으로 살펴보면, 주제 랭크 TR(K, w)는 질의어(K)에 대한 키워드(w)의 연관도를 나타낸다. 또한 주제 랭크 TR(w_i, w_j)는 하나의 질의어(w_i)에 대한 다른 키워드(w_j)의 연관도를 나타낸다. DF(K, w)와 DF(w)는 각각 (K 및 w)와 (w)의 문서 빈도(Document Frequency)를 나타낸다. 마찬가지로 DF(w_i, w_j)와 DF(w_j)는 각각 (w_i 및 w_j)와 (w_j)의 문서 빈도(Document Frequency)를 나타낸다. 문서 빈도란 수집된 문서 중에서 해당 키워드 또는 질의어가 포함되어 있는 문서의 수를 의미한다(여기에서 문서란, 수집된 개별 인터넷 문서(10)를 의미한다). 즉, DF(K, w)는 질의어(K)와 키워드(w)가 함께 들어있는 문서의 빈도이고, DF(w)는 키워드(w)가 들어있는 문서의 빈도이다. 또한 DF(wi, wj)는 서로 다른 2개의 키워드(w_i, w_j)가 함께 들어있는 문서의 빈도이고, DF(wj)는 키워드(w_j)가 들어있는 문서의 빈도이다. Specifically, the topic rank TR (K, w) represents the degree of association of the keyword (w) with respect to the query (K). The topic rank TR (w _i , w _j ) also indicates the degree of association of another keyword w _j with one query word w _i . DF (K, w) and DF (w) represent Document Frequency of (K and w) and (w), respectively. Similarly, DF (w _i , w _j ) and DF (w _j ) represent Document Frequency of (w _i and w _j ) and (w _j ), respectively. The document frequency refers to the number of documents in which the keywords or query terms are included in the collected documents (here, the document means individual Internet documents 10 collected). In other words, DF (K, w) is the frequency of the document containing the query (K) and the keyword (w), DF (w) is the frequency of the document containing the keyword (w). In addition, DF (wi, wj) is the frequency of documents containing two different keywords (w _i , w _j ), and DF (wj) is the frequency of documents containing keywords (w _j ).

따라서 TR(K, w)와 TR(w_i, w_j) 그리고 DR(K, w)와 DR(w_i, w_j)는 K와 w가 wi와 wj로 바뀌었을 뿐 실제 사용되는 수식은 동일하다. 따라서 이후의 설명에서 함수적 인 표현으로 TR(x, y)와 DR(x, y)로 사용하여 설명하도록 한다(x, y는 질의어 또는 키워드). 마찬가지로 DF(w) DF(w_j)도 함수적인 표현으로 DF(y)로 사용하여 설명하도록 한다. 이 경우, p(w) 또는 p(w_j)도 p(y)로 사용하여 설명하도록 한다. Therefore, TR (K, w) and TR (w _i , w _j ) and DR (K, w) and DR (w _i , w _j ) have the same formulas as K and w are replaced with wi and wj. Do. Therefore, in the following explanation, the functional expressions are used as TR (x, y) and DR (x, y) to describe them (x and y are query words or keywords). Similarly, DF (w) DF (w _j ) is used as a functional expression as DF (y). In this case, p (w) or p (w _j ) will also be described as p (y).

주제 랭크 TR(x, y) 공식의 첫 번째 부분인 DF(x, y)/DF(y)는 y가 들어있는 문서 중에서 x와 y가 함께 들어있는 문서의 비율을 의미한다. 따라서 DF(x, y)/DF(x)의 값이 클수록, x와 y의 연관성은 높아진다. The first part of the topic rank TR (x, y) formula, DF (x, y) / DF (y), is the ratio of documents containing x and y together in y. Therefore, the larger the value of DF (x, y) / DF (x), the higher the correlation between x and y.

그러나, 예를 들면, DF(y)가 1이고, DF(x, y)가 1인 경우의 경우와 같이, DF(y)와 DF(x, y)의 값이 모두 작은 경우에 x와 y의 연관성이 정말로 높은지를 판단하기는 어려울 수 있다. 즉, 수많은 문서 중에서 극히 드물게 y가 x와 함께 존재하는 문서가 있다고 하여도, 이것이 x와 y의 연관성이 높다고 보기에는 어려움이 있을 수 있다. However, for example, as in the case where DF (y) is 1 and DF (x, y) is 1, x and y are smaller when both the values of DF (y) and DF (x, y) are small. It can be difficult to determine if the association is really high. In other words, even though there are very few documents in which y exists along with x, it may be difficult to see that the relationship between x and y is high.

반대로, DF(y)와 DF(x, y)가 모두 큰 값, 예를 들면 전체 문서의 수에 가까울 경우에도 x와 y의 연관성이 정말로 높은지를 판단하기는 어려울 수 있다. 즉, x와 y가 모두 자주 사용되는 것이라는 의미일 뿐, 연관성을 부여하기에는 어려움이 있다. 예를 들면, 영문에서 a, the, is, of와 같은 것이 x와 y로 선정된 경우를 생각할 수 있다.On the contrary, even if both DF (y) and DF (x, y) are close to large values, for example, the total number of documents, it may be difficult to determine whether the association between x and y is really high. In other words, it means that both x and y are frequently used, and it is difficult to give an association. For example, the case where a, the, is, of, and the like are selected as x and y in English.

주제 랭크 TR(x, y) 공식의 두 번째 부분인 -p(y)log(p(y))는 x와 y의 연관성에 정확도를 부여하기 위하여 사용될 수 있다. 이하에서 주제 랭크 TR(x, y) 공식의 두 번째 부분인 -p(y)log(p(y))는 엔트로피(entropy) 부분이라고 불리울 수 있다. 엔트로피 부분에 의하여, p(y)가 0 이거나 1 인 경우에는 TR(x, y)는 0이 될 수 있다. 따라서 엔트로피 부분에 의하여 y가 문서들 중에 존재하는 확률인 p(y)가 0에 근접하거나 1에 근접하는 경우에 발생할 수 있는 연관도의 부정확성을 최소화할 수 있다. The second part of the subject rank TR (x, y) formula -p (y) log (p (y)) can be used to give accuracy to the association of x and y. Hereinafter, the second part of the subject rank TR (x, y) formula, -p (y) log (p (y)), may be called an entropy part. By the entropy portion, when p (y) is 0 or 1, TR (x, y) may be 0. Accordingly, the inaccuracy may minimize the inaccuracy of the degree of association that may occur when p (y), which is a probability that y exists in documents, approaches 0 or approaches 1.

즉, 주제 랭크 TR(x, y)는 x와 y 사이의 연관성을 기계적(공식의 첫 번째 부분)으로 계산할 경우에 발생할 수 있는 오류를 엔트로피 부분(공식의 두 번째 부분)으로 보완하여 정확도를 향상시킬 수 있다. α, β는 각각 주제 랭크 TR(x, y) 공식의 첫 번째 부분과 두 번째 부분 사이의 영향도를 위하여 부여되는 가중치로, 양의 실수를 사용할 수 있다. 예를 들면, α=3, β=2를 사용하여 주제 랭크 TR(x, y)를 계산할 수 있다. In other words, subject rank TR (x, y) improves accuracy by compensating for errors that can occur when the association between x and y is mechanically (first part of the formula) with the entropy part (second part of the formula). You can. α and β are weights given for the influence between the first and second portions of the subject rank TR (x, y) formula, respectively, and may use positive real numbers. For example, the subject rank TR (x, y) can be calculated using α = 3 and β = 2.

주제 랭크 TR(K, w), TR(w_i, w_j)은 검색자에 의한 질의어(K)가 입력된 단계에서 계산이 될 수 있다. 그러나 색인 저장부(2200)에 저장된 키워드(w)로부터 주제 랭크를 미리 계산하여 색인 저장부(2200)에 함께 저장할 수도 있다. 또는 주제 랭크 TR(K, w)는 질의어(K)가 입력된 단계에서 계산되고, 주제 랭크 TR(w_i, w_j)은 미리 계산하여 저장될 수도 있다. The subject ranks TR (K, w) and TR (w _i , w _j ) may be calculated at the stage where the query word K by the searcher is input. However, the subject rank may be calculated in advance from the keyword w stored in the index storage unit 2200 and stored together in the index storage unit 2200. Alternatively, the subject rank TR (K, w) may be calculated at the stage where the query word K is input, and the subject rank TR (w _i , w _j ) may be calculated and stored in advance.

주제 랭크 TR(K, w) 또는 TR(w_i, w_j)을 미리 계산하는 경우, 수집된 인터넷 문서(10)의 키워드가 추출될 때마다 진행되도록 설정할 수도 있고, 특정 시점마다 주제 랭크가 계산되도록 설정할 수도 있다. 이와 같이, 각 키워드 사이의 주제 랭크가 특정 시점마다 계산되도록 설정한 경우, 1일에 한번, 1주일에 한번, 1달에 한 번 등 수집된 인터넷 문서(10)의 양을 고려하여 시점을 정할 수 있고, 상대적으로 정보 검색 시스템(1)의 부하가 적은 시점, 즉 심야와 같이 검색자의 이용이 적은 때에 진행되도록 설정할 수 있다. 그 후 검색자 장치(200)를 통하여 검색자가 질의어(K)를 입력하면(S122), 언어 분석부(1700)에서 질의어를 분석하고(S124), 상기 질의어에 해당하는 키워드에 대하여 계산되어 미리 저장된 값을 랭크 저장부(2300)로부터 읽어와서 주제 랭크 TR(K, w)로 사용할 수 있다. When the subject rank TR (K, w) or TR (w _i , w _j ) is precomputed, the keyword of the collected Internet document 10 may be set to proceed every time the subject rank is extracted, and the subject rank is calculated at a specific time point. It can also be set. As such, when the subject rank between each keyword is set to be calculated at a specific time point, the time point may be determined in consideration of the amount of collected Internet documents 10 once a day, once a week, once a month, and so on. It can be set to proceed at a time when the load of the information retrieval system 1 is relatively low, that is, when the use of the searcher is low, such as at night. Thereafter, when the searcher inputs the query word K through the searcher device 200 (S122), the language analyzer 1700 analyzes the query word (S124) and calculates and stores the keyword corresponding to the query word in advance. The value can be read from the rank storage unit 2300 and used as the subject rank TR (K, w).

따라서, 특정한 키워드(w)에 대한 DF(K, w), DF(w), p(w) 등은 검색 시점에 따라서 그 값이 달라질 수 있다. 이를 통하여 시간의 흐름을 반영한 검색 결과를 얻을 수 있다. Therefore, the values of DF (K, w), DF (w), p (w), etc. for a specific keyword w may vary depending on a search point. Through this, search results reflecting the passage of time can be obtained.

주제 랭크 TR(K, w) 또는 TR(w_i, w_j)을 미리 계산하는 경우, 질의어가 입력될 때마다 계산을 하기 때문에 발생할 수 있는 시간 지연을 방지할 수 있다. When the subject rank TR (K, w) or TR (w _i , w _j ) is precomputed, calculations are performed every time a query is input, thereby preventing a time delay that may occur.

주제 랭크 TR(K, w) 및 TR(w_i, w_j)의 계산 후, 연관 주제를 선정한다(S132) 상기 연관 주제를 선정하기 위하여 주제 랭크 TR(K, w) 및 TR(w_i, w_j)이 소정의 값 이상인 경우를 연결 관계로 형성한다. 예를 들어, 질의어(K)로부터 상기 소정의 값 이상의 주제 랭크 TR(K, w)를 가지는 키워드(w)는 질의어(K)와의 사이에 연결 관계를 가진다고 정의한다. 또한 주제 랭크 TR(w_i, w_j)가 상기 소정의 값 이상인 경우에 키워드(w_i)와 키워드(w_j) 사이에는 연결 관계를 가진다고 정의한다. After calculating the subject ranks TR (K, w) and TR (w _i , w _j ), the related subjects are selected (S132). To select the related subjects, the subject ranks TR (K, w) and TR (w _i , The case where w _j ) is equal to or greater than a predetermined value is formed in a connection relationship. For example, a keyword w having a subject rank TR (K, w) equal to or larger than the predetermined value from the query word K is defined as having a connection relationship with the query word K. It is also defined that there is a connection relationship between the keyword w _i and the keyword w _j when the subject rank TR (w _i , w _j ) is equal to or greater than the predetermined value.

이러한 연결 관계가 정의되면, 질의어(K)와 직접 연결 관계를 가지지 못하는 키워드라도 다른 키워드를 통하여 질의어(K)와 연결 관계를 가질 수 있다. 예를 들면, 질의어(K)와 키워드(w₁)가 연결 관계를 가지고, 질의어(K)와 키워드(w₁₁)는 직접 연결 관계를 가지지 못하는 경우에도, 키워드(w₁)와 키워드(w₁₁)가 연결 관계를 가지는 경우, 질의어(K)와 키워드(w₁₁)는 키워드(w₁)를 통하여 연결 관계를 가질 수 있다. 이런 경우, 질의어(K)와 키워드(w₁₁) 사이에는 2개의 연결 관계가 있다고 정의할 수 있고, 이러한 연결 관계의 개수를 깊이라고 정의한다. 이와 같이, 질의어(K)와의 사이에 연결 관계가 소정의 M개 이하, 즉 질의어(K)와의 사이의 깊이가 M 이하인 키워드를 연관 주제로 선정할 수 있다. 이에 대해서는 뒤에서 자세히 설명하도록 한다. If such a connection relationship is defined, even a keyword that does not have a direct connection with the query word K may have a connection relationship with the query word K through another keyword. For example, even when the query word K and the keyword w ₁ have a connection relationship, and the query word K and the keyword w ₁₁ do not have a direct connection relationship, the keyword w ₁ and the keyword w ₁₁ do not have a direct connection relationship. ) Has a connection relationship, the query word (K) and the keyword (w ₁₁ ) may have a connection relationship through the keyword (w ₁ ). In this case, two connection relations may be defined between the query word K and the keyword w ₁₁ , and the number of such relations is defined as depth. In this manner, keywords having a connection relationship with the query K or less than a predetermined M, that is, a depth of M or less with the query K may be selected as the related theme. This will be explained in detail later.

선정된 상기 연관 주제에 연관도를 가중치로 부여하여, 가중치 벡터인 연관 질의어를 생성한다(S134). 가중치 벡터 WM인 연관 질의어는 주제 랭크 TR(K, w) 또는 TR(w_i, w_j)을 감안하여 하기 식과 같이 형성할 수 있다. An association degree is assigned to the selected association subject as a weight to generate an association query that is a weight vector (S134). An association query having a weight vector WM may be formed as follows by considering a subject rank TR (K, w) or TR (w _i , w _j ).

여기서, w_k는 k번째 연관 주제, TR_k는 k번째 연관 주제의 상기 질의어 방향으로 형성된 연결 관계의 주제 랭크, N은 선정된 연관 주제의 수, k는 1과 N 사이의 정수이다. Here, w _k is the k-th association subject, TR _k is the subject rank of the connection relationship formed in the query direction of the k-th association subject, N is the number of the selected association subject, k is an integer between 1 and N.

또는 가중치 벡터 WM인 연관 질의어는 주제 랭크 TR(K, w) 또는 TR(w_i, w_j)에 질의어(K)와 선정된 상기 연관 주제 사이의 깊이를 감안하여 하기 식과 같이 형 성할 수 있다. Alternatively, the associative query word having the weight vector WM may be formed as in the following equation in consideration of the depth between the query word K and the selected related topic in the subject rank TR (K, w) or TR (w _i , w _j ).

가중치 벡터 WM인 연관 질의어를 형성하는 구체적인 예시는 뒤에서 자세히 설명하도록 한다.A detailed example of forming an association query having a weight vector WM will be described in detail later.

검색부(1600)에서는 질의어(K)에 대하여 형성된 가중치 벡터 WM을 상기 특성키워드벡터와 비교하여, 질의어(K)와 관련되는 인터넷 문서(10)들을 선정한다(S140). 가중치 벡터 WM과 상기 특성키워드벡터는 모두 단어 벡터로, 두 단어 벡터의 거리 또는 두 단어 벡터 사이의 각도 등을 고려하여, 연관도가 높은 인터넷 문서(10)를 선정할 수 있다. 이와 같이 연관도가 높은 선정된 인터넷 문서(10)를 연관 문서라고 한다. The searcher 1600 selects the Internet documents 10 associated with the query K by comparing the weight vector WM formed for the query K with the feature keyword vector (S140). The weight vector WM and the feature keyword vector are both word vectors, and the Internet document 10 having a high correlation may be selected in consideration of the distance between the two word vectors or the angle between the two word vectors. The highly selected selected Internet document 10 is referred to as a related document.

제공부(1800)에서는 앞서 선정된 연관 문서를 검색자에게 제공한다(S150). 상기 연관 문서는 가중치 벡터 WM와 가까운 특성키워드벡터를 가지는 순서로 정리된 연관 문서의 목록을 검색자에게 제공될 수 있다. 이와 같이, 검색자의 질의어에 부합되는 연관 질의어를 형성하여, 검색자가 정확한 질의어를 입력하지 않아도 검색자가 원하는 정보를 정확하게 검색할 수 있도록 도와줄 수 있다.The provider 1800 provides the searcher with the previously selected related document (S150). The associated document may be provided to the searcher a list of related documents arranged in order with the characteristic keyword vector close to the weight vector WM. As such, by forming an associated query that matches the searcher's query, the searcher can help the searcher to search for the desired information accurately without inputting the correct query.

검색자가 검색자 장치(200)를 통하여 제공된 상기 연관 문서 중 어떤 것을 선택하였는지의 여부는 사용자 피드백 처리부(1900)를 통하여 로그 저장부(2300)에 저장될 수 있다. 이러한 검색자에 의한 사용자 피드백은 연관 질의어 생성(S134) 시에 반영하여 검색자의 선택 여부가 반영되도록 할 수 있다(S160). 구체적으로는 다른 연관 문서들과 비교하여, 검색자의 선택 비율이 낮아서 소정의 임계값을 넘지 못하는 경우 연관도가 떨어지는 것으로 판단하여, 연관 질의어 생성 시에 가중치에 추가로 반영할 수 있다. Whether the searcher selects one of the related documents provided through the searcher device 200 may be stored in the log storage unit 2300 through the user feedback processor 1900. The user feedback by the searcher may be reflected at the time of generating the related query (S134) to reflect whether the searcher is selected (S160). More specifically, compared to other related documents, if the searcher's selection rate is low and does not exceed a predetermined threshold, it may be determined that the degree of association is low, and may be additionally reflected in the weight when generating the related query.

또는 제공부(1800)에서는 선정된 상기 연관 주제를 키워드(K)와의 연결 관계를 나타나도록 검색자에게 제공할 수 있다. 이 경우, 검색자가 상기 연관 주제 중 특정한 것을 선택하면, 선택된 연관 주제를 키워드(K)로 설정하여, 새로운 연관 주제를 선정하여 검색자에게 제공할 수 있다. 또는 선택된 연관 주제가 키워드(K)로 입력된 것과 같이 연관 문서를 선정하여 검색자에게 제공할 수 있다.Alternatively, the provider 1800 may provide the searcher with the selected related theme to indicate a connection relationship with the keyword K. FIG. In this case, when the searcher selects a specific one of the related topics, the selected related topic may be set as a keyword K, and a new related topic may be selected and provided to the searcher. Alternatively, the related document may be selected and provided to the searcher as the selected related subject is inputted as the keyword K.

도 3 내지 도 4는 본 발명의 실시 예에 따른 수집부와 분석부의 구성을 나타내는 개략도이다. 도 3 내지 도 4에서는 블로그의 포스트를 예로 들어, 수집된 인터넷 문서가 검색자에게 제공될 수 있는 형태로 저장되는 과정을 살펴보도록 한다. 3 to 4 is a schematic diagram showing the configuration of the collecting unit and the analysis unit according to an embodiment of the present invention. 3 to 4, a process of storing collected Internet documents in a form that can be provided to a searcher will be described by taking blog posts as an example.

도 3은 본 발명의 실시 예에 따른 수집부의 구성을 나타내는 개략도이다. 3 is a schematic view showing a configuration of a collecting unit according to an embodiment of the present invention.

도 3을 참조하면, 인터넷 문서(10) 및 그 주변 정보는 다양한 언어로 기술될 수 있기 때문에 먼저 언어판단모듈(1110)에서 한국어, 일본어, 중국어, 영어 등의 작성 언어를 판단할 수 있다. 그리고 인터넷 문서 수집모듈(1122) 및 주변 정보 수집모듈(1124)로 구성된 수집 모듈에서 인터넷 문서(10) 및 그 주변 정보를 함께 수집할 수 있다. 인터넷 문서(10)가 예를 들면, 블로그의 하나의 포스트인 경우 인터넷 문서 수집 모듈(1122)은 블로그에서 제공하는 RSS/ATOM 등의 피딩(feeding)을 제공하는 주소를 판단하여 인터넷 문서(10)를 수집할 수 있다. 그러나 모든 인터넷 문서(10)가 피딩을 제공하는 주소가 제공되는 것이 아니다. 예를 들면, 대다수의 블로그는 최근의 포스트의 일부만을 제공하므로 이러한 경우에는 포스트 본문 추출을 통하여 인터넷 문서(10)를 수집할 수 있다. 마찬가지로 주변 정보 수집모듈(1124)에서는 댓글, 트랙백을 포함하는 그 주변 정보를 추출하여 수집할 수 있다. Referring to FIG. 3, since the Internet document 10 and its surrounding information may be described in various languages, the language determination module 1110 may first determine a writing language such as Korean, Japanese, Chinese, and English. In addition, the collection module including the Internet document collection module 1122 and the peripheral information collection module 1124 may collect the Internet document 10 and its surrounding information together. For example, when the Internet document 10 is one post of a blog, the Internet document collection module 1122 determines an address providing a feed such as RSS / ATOM provided by the blog, thereby determining the Internet document 10. Can be collected. However, not all Internet documents 10 are provided with addresses that provide feeding. For example, since most blogs provide only a part of recent posts, in this case, the Internet document 10 may be collected through post body extraction. Similarly, the peripheral information collection module 1124 may extract and collect the peripheral information including a comment and a trackback.

이와 같이 수집된 인터넷 문서(10) 및 그 주변 정보는 원래 형태와 달리 개별적으로 수집된 상태이므로 콘텐츠 복원모듈(230)을 통해 구조화 과정을 거칠 수 있다. 예를 들면 블로그의 포스트를 수집하는 경우, 포스트 전체 본문 추출, 댓글, 트랙백 정보 연결, 기존 HTML 포스트 내용 추출 및 RSS/ATOM 형식으로의 구조화 등의 과정을 통하여 분석되고 복원될 수 있다. 또한 언어판단모듈(1110)에서 판단된 언어에 따라 자동번역모듈(1140)은 제공하고자 하는 언어와 다른 언어로 작성된 인터넷 문서(10) 및 그 주변 정보를 제공하고자 하는 언어로 자동 번역을 통하여 번역할 수 있다. 콘텐츠 복원모듈(1130)에서 복원된 인터넷 문서(10) 및 그 주변 정보와 자동번역모듈(1140)의 번역 결과는 단위 구조체 생성모듈(1150)에서 하나의 인터넷 문서(10), 예를 들면 블로그의 경우 하나의 포스트별로 인터넷 문서 구조체 로 생성할 수 있다. 인터넷 문서 구조체는 예를 들면, XML 형식 또는 RSS 형식 등 컴퓨터와 같은 기계가 처리할 수 있는 형식으로 생성할 수 있다. The Internet document 10 and the surrounding information collected as described above are collected separately from the original form, and thus may be structured through the content restoration module 230. For example, when collecting a post of a blog, it can be analyzed and restored through the process of extracting the entire body of the post, commenting, linking trackback information, extracting existing HTML post contents, and structuring to RSS / ATOM format. In addition, according to the language determined by the language judging module 1110, the automatic translation module 1140 may translate the Internet document 10 written in a language different from the language to be provided by the automatic translation into a language to provide information about the Internet 10 and its surrounding information. Can be. The translation result of the Internet document 10 and its surrounding information restored by the content restoration module 1130 and the automatic translation module 1140 may be generated by the unit structure generation module 1150 of one Internet document 10, for example, a blog. In this case, one post can be created as an Internet document structure. The Internet document structure may be generated in a format that can be processed by a machine such as a computer, for example, XML format or RSS format.

이와 같이 생성된 인터넷 문서 구조체는 주 저장부(2100)에 저장되며, 분석부(1200)는 직접 수집부(1100)로부터 인터넷 문서 구조체를 받거나 주 저장부(2100)에 저장된 인터넷 문서 구조체를 불러서 분석 작업을 할 수 있다. The Internet document structure generated as described above is stored in the main storage unit 2100, and the analysis unit 1200 receives the Internet document structure from the collection unit 1100 or calls the Internet document structure stored in the main storage unit 2100 for analysis. You can work.

도 4는 본 발명의 실시 예에 따른 분석부의 구성을 나타내는 개략도이다. 4 is a schematic diagram illustrating a configuration of an analysis unit according to an exemplary embodiment of the present invention.

도 4를 참조하면, 분석부(1200)는 수집부(1100)에 의하여 수집되고 생성된 인터넷 문서 구조체를 텍스트 마이닝 기법으로 분석하여 분석 정보를 생성할 수 있다. 분석부(1200)는 수집부(1100)에서 생성된 인터넷 문서 구조체 또는 주 저장부(2100)에 저장된 인터넷 문서 구조체를 받아서 개체명 분석모듈(1210)에서 개체명 분석을 하여 주요 개체명을 추출할 수 있다. 개체명 분석은 인터넷 문서 구조체가 가지고 있는 텍스트를 분석하여 사람이름, 기업명, 상품명, 서비스명, 날짜 등 의미를 가진 단어를 추출하는 것으로 개체명 사전과 추출규칙을 통해 추출할 수 있다. 그런 후 추출된 주요 개체명과 인터넷 문서 구조체에 포함된 정보를 특성추출모듈(1220)에서 통계적으로 분석하여 인터넷 문서 구조체를 대표하는 키워드를 추출할 수 있다.Referring to FIG. 4, the analyzer 1200 may generate analysis information by analyzing the Internet document structure collected and generated by the collector 1100 by using a text mining technique. The analysis unit 1200 receives the Internet document structure generated by the collection unit 1100 or the Internet document structure stored in the main storage unit 2100, and performs entity name analysis in the entity name analysis module 1210 to extract the main entity names. Can be. The entity name analysis extracts words having meanings such as person name, company name, product name, service name, date, etc. by analyzing the text of the Internet document structure and can be extracted through the entity name dictionary and extraction rules. Thereafter, the extracted main entity name and information included in the Internet document structure may be statistically analyzed by the feature extraction module 1220 to extract keywords representing the Internet document structure.

자동분류모듈(1230)은 인터넷 문서 구조체들을 자동으로 분류할 수 있다. 이러한 자동 분류는 미리 정의된 분류목록(1232)과 분류목록(1232)에 따른 기계학습데이터(1234)를 기반으로 분류될 수 있다. 자동 분류는 분류대상의 차이가 명확할 경우에 그 성능이 높으며, 비슷한 군에서의 분류는 성능이 낮아지는 경향이 나타낼 수 있다. 특히 다단계 분류의 경우는 분석률이 떨어질 수 있다. 예를 들어 스포츠, 사회, 경제 등의 큰 카테고리의 분류는 시스템으로는 어느 정도 가능하지만, 스포츠의 구기종목을 야구, 배구, 농구 등으로 분류하는 것은 전자에 비해 상대적으로 분석률이 떨어질 수 있다. 자동분류모듈(1230)은 예를 들면, 베이지언(Bayesian), SVM(Support Vector Machine)과 같은 알고리즘을 통해 구현될 수 있다. 이때 본 자동분류모듈(1230)은 최상위 카테고리만으로 자동분류할 수 있다. 다단계 카테고리로 자동분류를 하는 경우 정확도가 떨어지고 기계학습을 하기 위한 시스템의 부담이 커지기 때문이다. The automatic classification module 1230 may automatically classify Internet document structures. Such automatic classification may be classified based on the predefined classification list 1232 and the machine learning data 1234 according to the classification list 1232. Automatic classification has high performance when the difference of classification object is clear, and classification in similar group may tend to decrease performance. Especially in the case of multi-stage classification, the analysis rate may be reduced. For example, large categories such as sports, society, and economy can be classified to some extent by the system. However, categorizing sports balls into baseball, volleyball, basketball, etc. may be less analyzed than the former. The automatic classification module 1230 may be implemented by, for example, an algorithm such as Bayesian or SVM (Support Vector Machine). In this case, the automatic classification module 1230 may automatically classify only the highest category. This is because automatic classification into multi-level categories reduces accuracy and burdens the system for machine learning.

자동군집모듈(1240)은 자동분류된 인터넷 문서 구조체들을 각 분류 카테고리 별로 군집화 과정을 거친다. 자동군집은 인터넷 문서 구조체들을 시스템이 통계적으로 임의의 단위로 군집할 수 있다. 자동군집모듈(1240)은 예를 들면, K-means 알고리즘 등을 사용하여 구현될 수 있다. 이렇게 군집된 인터넷 문서 구조체들은 정보량 측정모듈(1250)에서 정보량 지수가 측정될 수 있다. 키워드와 이러한 정보량 지수를 결합하여 생성되는 특성키워드벡터는 각 인터넷 문서 구조체를 대표하는 단어 벡터로 검색을 위하여 사용될 수 있다. 추출된 키워드와 생성된 특성키워드벡터를 포함하는 분석 정보는 다시 주 저장부(2100)에 저장될 수 있다. The automatic clustering module 1240 clusters the automatically classified Internet document structures for each classification category. Auto-clustering allows the system to statistically group Internet document structures in any unit. The automatic cluster module 1240 may be implemented using, for example, a K-means algorithm. The clustered Internet document structures may be measured by the information amount index in the information amount measurement module 1250. The characteristic keyword vector generated by combining the keyword and the information quantity index can be used for searching as a word vector representing each Internet document structure. Analysis information including the extracted keyword and the generated characteristic keyword vector may be stored in the main storage unit 2100 again.

이와 같이 주 저장부(2100)에 저장된 인터넷 문서 구조체는 전술한 정보 검색 방법에 의하여 검색자가 이용하기 편리한 형태로 제공될 수 있다As such, the Internet document structure stored in the main storage unit 2100 may be provided in a form convenient for the searcher by the above-described information retrieval method.

도 5는 본 발명의 실시 예에 따른 연관 주제를 선정하는 과정을 설명하기 위한 개념도이다. 5 is a conceptual diagram illustrating a process of selecting a related subject according to an embodiment of the present invention.

도 5를 참조하면, 질의어(K)와 다수의 키워드(w₁, w₂, w₃, w₄, w₅, w₁₁, w₁₂ 등을 말하며, 이하에서 전체로 설명할 때는 w라 함) 사이의 연관 관계를 나타낸다. 여기에서 질의어(K)와 키워드(w) 또는 키워드(w)들 사이의 연관 관계가 점선 또는 실선으로 나타나 있다. 상기 연관 관계 중, 실선으로 나타낸 것은 TR(K, w) 및 TR(w_i, w_j)이 상기 소정의 값 이상으로 연결 관계를 형성한 것에 해당하고, 점선인 경우는 TR(K, w) 및 TR(w_i, w_j)이 상기 소정의 값보다 작은 경우로 연결 관계를 형성하지 못한 것에 해당한다. 즉 질의어(K)와 키워드(w₁, w₂, w₃, w₄)는 연결 관계를 형성하나, 질의어(K)와 키워드(w₅)는 연결 관계를 형성하지 못한다. 또한 키워드(w₁)과 키워드(w₁₁, w₁₃)는 연결 관계를 형성하나, 키워드(w₁)과 키워드(w₁₂)는 연결 관계를 형성하지 못한다. 따라서 질의어(K)와 키워드(w₁₁)은 키워드(w₁)을 통하여 연결 관계를 가질 수 있다. 이 경우, 질의어(K)와 키워드(w₁₁) 사이에 연결 관계인 실선이 2개이므로, 질의어(K)와 키워드(w₁₁) 사이의 깊이는 2가 된다. Referring to FIG. 5, a query word K and a plurality of keywords (w ₁ , w ₂ , w ₃ , w ₄ , w ₅ , w ₁₁ , w _12, etc., are referred to as w in the following description). Indicates an association between. Here, the association between the query word K and the keyword w or the keywords w is shown by a dotted line or a solid line. Among the correlations, solid lines indicate that TR (K, w) and TR (w _i , w _j ) form a connection relationship with the predetermined value or more, and in the case of a dotted line, TR (K, w) And when TR (w _i , w _j ) is smaller than the predetermined value, the connection relationship is not formed. That is, the query K and the keywords w ₁ , w ₂ , w ₃ , and w ₄ form a linking relationship, but the query K and the keyword w ₅ do not form a linking relationship. In addition, the keyword w ₁ and the keywords w ₁₁ and w ₁₃ form a connection relationship, but the keyword w ₁ and the keyword w ₁₂ do not form a connection relationship. Therefore, the query word K and the keyword w ₁₁ may have a connection relationship through the keyword w ₁ . In this case, the depth between the query term (K) as the keyword since the two connection relationship between the solid line (w _11), a query term (K) and a keyword (w ₁₁₎ is 2.

반면, 질의어(K)와 키워드(w₃₄₁, w₃₄₂, w₃₄₃)는 키워드(w₃)과 키워드(w₃₄)를 통하여 연결 관계를 가질 수 있다. 따라서 질의어(K)와 키워드(w₃₄₁, w₃₄₂, w₃₄₃) 사이의 깊이는 3이 된다. 만약에 연관 주제를 선정하기 위한 연결 관계의 수(M)인 깊이를 2라 설정한 경우에는, 질의어(K)와 굵은 실선으로만 연결된 키워드(w₁₁, w₁₃, w₃₄등) 는 연관 주제에 선정되나, 질의어(K)와의 사이에 가는 실선이 포함되는 키워드(w₃₄₁, w₃₄₂, w₃₄₃)는 연관 주제에 선정되지 못한다. 이와 같은 과정을 통하여 질의어(K)에 대한 연관 주제를 선정할 수 있다. In contrast, the query word K and the keywords w ₃₄₁ , w ₃₄₂ , and w ₃₄₃ may have a connection relationship through the keyword w ₃ and the keyword w ₃₄ . Therefore, the depth between the query word K and the keywords w ₃₄₁ , w ₃₄₂ , and w ₃₄₃ is three. If the depth (M), which is the number of connection relationships (M) for selecting a related subject, is set to 2, the keywords (w ₁₁ , w ₁₃ , w _34, etc.) connected only by the query word (K) and the bold solid line are related subjects. However, keywords (w ₃₄₁ , w ₃₄₂ , w ₃₄₃ ) including a solid line between the query word (K) are not selected for the related subject. Through this process it is possible to select a related topic for the query language (K).

도시하지는 않았으나, 임의의 키워드(w)로부터 질의어(K)까지의 연결 관계가 2개 이상의 경로를 형성하는 경우에는 질의어(K)와 상기 임의의 키워드(w) 사이의 깊이가 적은 경로를 선택할 수 있고, 깊이가 동일한 경로가 있는 경우에는 각 연결 관계의 주제 랭크의 값을 고려하여, 하나의 경로를 선택할 수 있다. Although not shown, in the case where the connection relationship from the arbitrary keyword w to the query word K forms two or more paths, a path having a small depth between the query word K and the arbitrary keyword w may be selected. If there is a path having the same depth, one path may be selected in consideration of the value of the subject rank of each connection relationship.

또한 연관 질의어인 가중치 벡터 WM을 구할 때에 언급된 k번째 연관 주제의 상기 질의어 방향으로 형성된 연결 관계의 주제 랭크인 TR_k란, 예를 들어, k번째 연관 주제가 키워드(w₃₄)인 경우, 키워드(w₃)과 키워드(w₃₄) 사이의 주제랭크를 의미한다. 이는 키워드(w₃₄)로부터 질의어(K) 방향으로 형성된 연결 관계에는 키워드(w₃)가 위치하기 때문이다. In addition, when _k is the subject rank of the k-related association mentioned in the query term direction of the k-related relationship mentioned when the weight vector WM is obtained, for example, when the k-th association subject is the keyword w ₃₄ , the keyword ( w ₃ ) and the keyword (w ₃₄ ). This is because the keyword w ₃ is located in the connection relationship formed from the keyword w _{34 in the} direction of the query word K.

만일 전술한 바와 같이, 검색자에게 선정된 연관 주제를 제공하는 경우, 도 5에 도시된 것과 유사하게, 질의어(K)를 중심으로, 굵은 실선으로만 연결된 키워드(w₁₁, w₁₃, w₃₄ 등)들이 나타나도록 할 수 있다. 예를 들어, 사용자가 제공된 연관 주제 중 키워드(w₃)을 선택하는 경우, 키워드(w₃)을 질의어(K)로 입력한 것과 같이 다시 연관 주제를 선정하거나 연관 문서를 선정하여 검색자에게 제공할 수 있다. 이 경우, 키워드(w₃₄₁, w₃₄₂, w₃₄₃)는 키워드(w₃)와 사이의 깊이는 2이므로, 연관 주제 로 선정되어 검색자에게 제공될 수 있다. As described above, in the case of providing the selected related subject to the searcher, similar to the example shown in FIG. 5, the keywords w ₁₁ , w ₁₃ , and w ₃₄ connected only by bold solid lines, centering on the query word K, are illustrated. Etc.) may appear. For example, if a user selects a keyword (w ₃ ) among the provided related topics, the searcher selects a related topic again or selects a related document and provides the searcher with the keyword (w ₃ ) as a query word (K). can do. In this case, since the keywords w ₃₄₁ , w ₃₄₂ , and w ₃₄₃ have a depth between the keywords w ₃ and 2, the keywords w ₃₄₁ , w ₃₄₂ , and w ₃₄₃ may be selected as related topics and provided to the searcher.

이와 같이, 포괄적인 의미를 가지는 질의어를 입력하여도 검색자가 원하는 정보를 찾을 수 있도록 관련되는 연관 주제 또는 연관 문서를 효율적으로 제공할 수 있다. As such, even when a query word having a comprehensive meaning is input, a related subject or related document can be efficiently provided so that a searcher can find desired information.

또한, 본 발명의 실시 예들은 컴퓨터 시스템에서 실행할 수 있는 프로그램으로 작성 가능하다. 또한, 상기 프로그램이 수록된 컴퓨터로 읽을 수 있는 기록 매체로부터 읽혀진 해당 프로그램은 디지털 컴퓨터 시스템에서 실행될 수 있다.In addition, embodiments of the present invention can be written as a program that can be executed in a computer system. In addition, the program read from the computer-readable recording medium containing the program can be executed in the digital computer system.

컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, DVD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어, 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고 본 발명을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술 분야의 프로그래머들에 의해 용이하게 추론될 수 있다.Examples of computer-readable recording media include ROM, RAM, CD-ROM, DVD-ROM, magnetic tape, floppy disks, optical data storage devices, and the like, as well as carrier wave (eg, transmission over the Internet). It also includes implementations in form. The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion. And functional programs, codes and code segments for implementing the present invention can be easily inferred by programmers in the art to which the present invention belongs.

<도면의 주요부분에 대한 설명><Description of main parts of drawing>

1 : 정보 검색 시스템, 10 : 인터넷 문서, 200 : 검색자 장치, 1000 : 제어부, 1100 : 수집부, 1200 : 분석부, 1300 : 색인부, 1400 : 주제 랭크 처리부, 1500 : 연관 질의어 생성부, 1600 : 검색부, 1700 : 언어 분석부, 1800 : 제공부, 1900 : 사용자 피드백 처리부, 2000 : 저장부, 2100 : 주 저장부, 2200 : 색인 저장부, 2300 : 로그 저장부Reference Signs List 1: information retrieval system, 10: internet document, 200: searcher device, 1000: control unit, 1100: collection unit, 1200: analysis unit, 1300: index unit, 1400: subject rank processing unit, 1500: association query generation unit, 1600 : Search unit, 1700: language analyzer, 1800: provider, 1900: user feedback processor, 2000: storage, 2100: main storage, 2200: index storage, 2300: log storage

Claims

A data collection step of collecting documents and extracting keywords of the collected documents;

A query input step of inputting a query word by a searcher;

A related topic generation step of selecting a related topic associated with the query word among the keywords by calculating a topic rank between the query word and the keyword or the keyword;

An association query generation step of forming a weight vector by assigning a weighting degree of association from the query language of the selected related subject; And

And a document retrieval step of providing documents associated with the query word using the weight vector.

According to claim 1,

The subject rank TR (K, w) between the query word and the keyword and the subject rank TR (w _i , w _j ) between the keyword are calculated by the following equation.

here,

K is the query word, w, w _i , w _j is the keyword, DF (K, w) is the document frequency containing K and w, and DF (w _i , w _j ) is the document containing w _i and w _j Frequency, DF (w) or DF (w _j ) is the document frequency containing w or w _j , p (w) or p (w _j ) is the probability that w or w _j is in the document, α, β is the weight As a positive real number, i, j are positive integers having different values less than or equal to the number of extracted keywords.

The method of claim 2,

The selecting a related topic may include forming a connection relationship when a subject rank TR (K, w) between the query word and the keyword or a subject rank TR (w _i , w _j ) between the keyword is equal to or greater than a predetermined value. An information retrieval method (M is a positive integer greater than 1), wherein keywords having M or less keywords are selected as the related topics from the query.

The method of claim 3,

The weight vector WM is

And a subject rank of a connection relation formed in the direction of the query term of the related subject and the related subject by the following equation.

Here, w _k is the k-th association subject, TR _k is the subject rank of the connection relationship formed in the query direction of the k-th association subject, N is the number of the selected association subject, k is an integer between 1 and N.

The method of claim 3,

The weight vector WM is

And a subject rank of a connection relation formed in the direction of the query term of the related subject and the related subject reflecting the depth, which is the number of the connection relations between the query term and the related subject, is formed by the following equation. How to retrieve information.

Here, w _k is the k-th association topic, TR _k is the subject rank of the connection relationship formed in the direction of the query language of the k-th association topic, depth _k is the depth that is the number of connection relationships between the query and the k-th associated topic, dTR _k Is the depth weighted topic rank of the k-th association topic, N is the number of selected association topics, and k is an integer between 1 and N.

According to claim 1,

In the data collection step, using the keywords of the collected document to form a characteristic keyword vector, each of which is a word vector representing a corresponding document,

The document retrieving step may include selecting and providing documents related to the query by comparing the characteristic keyword vector and the weight vector.

The method according to claim 6,

And analyzing a keyword of a document selected by the searcher among the documents provided in the document searching step, and removing the keyword included in the document having a selectivity less than or equal to a predetermined threshold value from the related subject. Characteristic information retrieval method.

A computer-readable recording medium containing a program capable of performing the navigation according to any one of claims 1 to 7.