KR20180101955A

KR20180101955A - Document scoring method and document searching system

Info

Publication number: KR20180101955A
Application number: KR1020170028554A
Authority: KR
Inventors: 김도전; 박성웅
Original assignee: 주식회사 수브이
Priority date: 2017-03-06
Filing date: 2017-03-06
Publication date: 2018-09-14
Also published as: KR102028155B1

Abstract

The present invention relates to a method for scoring a document and a system thereof. According to the present invention, the method comprises the following steps: extracting a query language by analyzing a query statement inputted from a user and extracting a similar term and similarity with respect to the extracted query; inquiring pre-stored user data and calculating a weight for the query language, the similar term, and document classification; reconstructing the query statement using the extracted query language, similar term, and similarity; performing document search and scoring using the reconstructed query statement; and outputting a document search result for the query statement. Therefore, the efficiency of the system may increase.

Description

[0001] DOCUMENT SCORING METHOD AND DOCUMENT SEARCHING SYSTEM [0002]

본 발명은 문헌 스코어링 방법 및 문헌 검색 시스템에 관한 것이다.The present invention relates to a document scoring method and a document retrieval system.

문헌 검색 시스템에서 자연어에 포함된 단어를 벡터로 표현하는 워드 임베딩(word embedding) 등과 같은 다양한 방법이 연구되어 왔고, 최근 딥러닝과 맞물려, CBOW 내지 skip-gram 방식이 활발히 개발되고 있다.Various methods such as word embedding for expressing words included in natural language in a natural language in a document retrieval system have been studied and CBOW to skip-gram methods are being actively developed in conjunction with recent deep learning.

이러한 방식은 딥러닝을 위한 전처리 수단으로 고안되었으나, 유사어 추출과 같이 그 자체만으로도 의미 있는 결과를 도출하는 성능을 보이고 있다. 하지만, 워드 임베딩 자체를 검색 시스템과 같이 실시간 응답이 요구되는 시스템에 직접 적용하기에는 연산 복잡도가 크다. 따라서 실시간 시스템에 도입하기 위한 적절한 전처리가 요구된다.This method has been devised as a preprocessing method for deep running, but it has a performance that can yield meaningful results just like the extraction of synonyms. However, the computational complexity of applying the word embedding itself directly to a system requiring a real-time response such as a retrieval system is large. Therefore, proper preprocessing is required for introduction into real-time systems.

또한, 분류가 부여된 문헌에 대하여는 자연어와 분류를 동시에 이용하여 검색 시스템을 구성할 수 있으나, 그 분류체계에 대한 이해와 숙지가 없는 사용자는 검색 시스템에서 분류를 제대로 활용하기 어렵고, 자연어를 선호하게 된다.In addition, although a search system can be constructed using natural language and classification for a document to which a classification is given, a user who does not understand and understand the classification system is difficult to use the classification properly in a search system, do.

따라서, 사용자가 자연어만을 사용하더라도 내부적으로 분류가 사용되도록 하여 검색 시스템의 효율성을 높일 필요성이 있다.Therefore, it is necessary to increase the efficiency of the search system by allowing the classification to be used internally even if the user uses only natural language.

이와 같은, 검색 시스템은 사용자로부터 질의어(질의문)이 입력되면, 입력된 질의문에 대한 검색 결과를 사용자에게 제공하는 것이나, 이 경우 사용자의 의사를 검색 결과에 반영하기에 충분하지 않다. 따라서, 사용자의 의사를 검색 결과에 반영하기 위한 적절한 방안이 요구된다.In this way, when a query word (query statement) is input from a user, the search system is not enough to provide the user with a search result for the inputted query statement, but in this case, it is not enough to reflect the user's intention in the search result. Therefore, a proper scheme for reflecting the user's intention to the search result is required.

국내공개특허 제10-2014-0109729호 (2014년 09월 16일 공개)Korean Patent Publication No. 10-2014-0109729 (published on September 16, 2014)

본 발명의 목적은, 사용자가 자연어만을 사용하여 검색 요청을 하는 경우에 내부적으로 분류를 사용하여 검색 시스템의 효율성이 증대되도록 한, 문헌 스코어링 방법 및 문헌 검색 시스템을 제공함에 있다.It is an object of the present invention to provide a document scoring method and a document retrieval system in which, when a user makes a retrieval request using only natural language, the efficiency of the retrieval system is increased using the classification internally.

본 발명의 다른 목적은, 사용자의 의사를 문헌 검색 시 반영하도록 한, 문헌 스코어링 방법 및 문헌 검색 시스템을 제공함에 있다.It is another object of the present invention to provide a document scoring method and a document retrieval system in which a user's intention is reflected in document retrieval.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재들로부터 당업자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the above-mentioned technical problems, and other technical problems which are not mentioned can be understood by those skilled in the art from the following description.

상기의 목적을 달성하기 위한 본 발명의 일 실시예에 따른 문헌 스코어링 방법은, 사용자로부터 입력된 질의문을 분석하여 질의어를 추출하고, 상기 추출된 질의어에 대한 유사 용어 및 유사도를 추출하는 단계, 기 저장된 사용자 데이터를 조회하여 상기 질의어 및 상기 유사 용어와, 문헌 분류에 대한 가중치를 산출하는 단계, 상기 추출된 질의어, 상기 유사 용어 및 유사도를 이용하여 질의문을 재구성하는 단계, 상기 재구성된 질의문을 이용하여 문헌 검색 및 스코어링을 수행하는 단계, 및 상기 질의문에 대한 문헌 검색 결과를 출력하는 단계를 포함하는 것을 특징으로 한다.According to an aspect of the present invention, there is provided a document scoring method for extracting a query term by analyzing a query input from a user, extracting similar terms and similarities with respect to the extracted query terms, Retrieving stored user data, calculating a weight for the query term, the similar term, and document classification; reconstructing a query statement using the extracted query term, the similar term, and the similarity; Performing document search and scoring using the document search result, and outputting a document search result for the inquiry statement.

상기 사용자 데이터는, 상기 질의어에 대한 이전의 문헌 검색 결과에 대해 사용자로부터 입력된 채택 또는 불채택 정보가 저장된 것을 특징으로 한다.The user data is characterized in that adoption or non-adoption information input from the user to the previous document search result for the query term is stored.

상기 가중치를 산출하는 단계는, 상기 사용자 데이터로부터 상기 사용자에 의해 선택된 채택 문헌 및 불채택 문헌을 식별하고, 상기 질의어 및 상기 유사 용어와, 문헌 분류 마다의 채택 문헌 및 불채택 문헌의 빈도에 따라 가중치를 산출하는 것을 특징으로 한다.The step of calculating the weights may further comprise the steps of: identifying the adopted and non-adopted documents selected by the user from the user data; calculating weighted and non-adopted documents according to the frequency of the adopted and non- Is calculated.

상기 가중치를 산출하는 단계는, 상기 질의어 및 상기 유사 용어와, 문헌 분류 마다 고유한 가중치 함수를 이용하여 가중치를 산출하는 것을 특징으로 한다.The step of calculating the weight is characterized by calculating a weight using the query term, the similar term, and a weight function unique to each document classification.

상기 가중치를 산출하는 단계는, 기 저장된 사용자 데이터가 존재하지 않는 경우, 상기 유사 용어의 유사도를 가중치로 산출하는 것을 특징으로 한다.The step of calculating the weighting value is characterized by calculating the similarity degree of the similar term as a weight value when there is no pre-stored user data.

상기 가중치를 산출하는 단계는, 기 저장된 사용자 데이터가 존재하지 않은 경우, 상기 질의어에 대한 문헌 중 문헌 점수가 높은 일부 문헌이 채택된 것으로 임의 결정하고, 상기 질의어 및 상기 유사 용어와, 문헌 분류 마다의 채택 문헌의 빈도에 따라 가중치를 산출하는 것을 특징으로 한다.Wherein the step of calculating the weights includes the step of arbitrarily determining that some documents having a high document score are included in the document for the query term if the pre-stored user data does not exist, the query term and the similar term, And the weight is calculated according to the frequency of the adopted document.

상기 질의문을 재구성하는 단계는, 상기 질의어 및 상기 유사 용어를 상호 논리합 관계로 결합하여 질의문을 재구성하는 것을 특징으로 한다.The reconstructing of the query statement is characterized by reconstructing the query statement by combining the query term and the similar term in a logical OR relationship.

상기 문헌 검색 및 스코어링을 수행하는 단계는, 색인된 문헌의 색인 구조 및 상기 색인 구조에 따라 색인 데이터가 저장된 색인 저장소로부터 상기 재구성된 질의문에 포함된 질의어에 해당하는 색인어를 검색하는 단계, 상기 검색된 색인어에 대응하여 사전에 부여된 색인어 스코어를 추출하는 단계, 및 상기 질의어 및 상기 유사 용어에 대해 산출된 가중치, 및 상기 추출된 색인어 스코어를 이용하여 해당 문헌에 대한 문헌 점수를 부여하는 단계를 포함하는 것을 특징으로 한다.The step of performing the document searching and scoring may include searching an index term corresponding to a query term included in the reconstructed query sentence from an index repository storing index data according to the index structure of the indexed document and the index structure, Extracting an index score given in advance corresponding to an index word, and assigning a document score to the document using the extracted index score and a weight calculated for the query term and the similar term, .

상기 문헌 점수를 산출하는 단계는, 상기 질의어 및 상기 유사 용어에 대해 산출된 가중치와 상기 색인어 스코어를 곱한 값을 해당 문헌의 문헌 점수로 부여하는 것을 특징으로 한다.The step of calculating the document score may include adding a value obtained by multiplying the weighted value calculated for the query term and the similar term by the index score to the document score of the document.

또한, 본 발명의 일 실시예에 따른 문헌 스코어링 방법은, 상기 문헌 검색 결과에 대해 사용자로부터 채택 또는 불채택에 대한 정보를 입력 받는 단계, 상기 입력된 채택 또는 불채택에 대한 정보를 해당 문헌의 식별정보와 함께 사용자 데이터에 저장하는 단계, 및 상기 저장된 사용자 데이터를 이용하여 상기 질의어 및 상기 유사 용어와, 문헌 분류에 대한 가중치를 조절하는 단계를 더 포함하는 것을 특징으로 한다.In addition, the document scoring method according to an embodiment of the present invention may include the steps of: receiving information on adoption or non-adoption from the user on the document search result; identifying information on adoption or non- Storing the user data in the user data together with the information, and adjusting the weight for the query term, the similar term, and the document classification using the stored user data.

상기 가중치를 조절하는 단계는, 상기 질의어 및 상기 유사 용어와, 문헌 분류에 대한 가중치 함수의 파라미터 값을 조절하는 단계를 포함하는 것을 특징으로 한다.The step of adjusting the weights may include adjusting the parameter values of the weight function for the query term, the similar term, and the document classification.

상기 유사 용어 및 유사도를 추출하는 단계는, 문헌 내에 포함된 각 용어들 중 상기 질의어와의 유사도가 소정 임계치 이상인 유사 용어를 추출하는 것을 특징으로 한다.The step of extracting the similar terms and similarities is characterized by extracting similar terms in which the degree of similarity to the query term among the terms included in the document is equal to or greater than a predetermined threshold value.

상기 문헌 검색 결과를 출력하는 단계는, 상기 질의문에 대한 문헌 검색 조건에 따라 상기 문헌 검색 결과 내 문헌들을 정렬하여 출력하는 것을 특징으로 한다.The step of outputting the document search result may include sorting and outputting documents in the document search result according to a document search condition for the query.

상기 문헌 검색 결과를 출력하는 단계는, 상기 문헌 검색 결과 내 문헌들에 부여된 문헌 점수의 크기에 따라 각 문헌들을 정렬하는 것을 특징으로 한다.The outputting of the document search result is characterized in that each document is sorted according to the size of the document score given to the documents in the document search result.

또한, 상기의 목적을 달성하기 위한 본 발명의 일 실시예에 따른 문헌 검색 시스템은, 사용자로부터 입력된 질의문을 분석하여 질의어와, 상기 질의어에 대한 유사 용어 및 유사도를 추출하고, 기 저장된 사용자 데이터를 조회하여 상기 추출된 질의어 및 상기 유사 용어와, 문헌 분류에 대한 가중치를 산출하며, 상기 추출된 질의어, 상기 유사 용어 및 유사도를 이용하여 상기 질의문을 재구성하는 연산 모듈, 및 상기 재구성된 질의문을 이용하여 문헌 검색 및 스코어링을 수행하는 검색 모듈을 포함하는 것을 특징으로 한다.According to another aspect of the present invention, there is provided a document retrieval system for analyzing a query input from a user to extract a query term, similar terms and similarities to the query term, An operation module for calculating the weight of the extracted query word, the similar term, and the document classification, and reconstructing the query sentence using the extracted query term, the similar term, and the similarity, And a search module for performing document search and scoring using the search module.

상기 검색 모듈은, 상기 가중치에 근거하여 검색 문헌에 대한 스코어링을 수행하는 것을 특징으로 한다.And the search module performs scoring on the search documents based on the weights.

본 발명에 따르면, 사용자가 자연어만을 사용하여 검색 요청을 하는 경우에 내부적으로 분류를 사용하여 검색 시스템의 효율성이 증대되는 효과가 있다. 또한, 본 발명에 따르면 문헌 검색 결과에 사용자의 의사를 반영할 수 있는 이점이 있다.According to the present invention, when the user makes a search request using only natural language, the efficiency of the search system is increased by using the classification internally. In addition, according to the present invention, there is an advantage that a user's intention can be reflected in a document search result.

도 1은 본 발명의 일 실시예에 따른 문헌 검색 시스템의 구성을 도시한 도면이다.
도 2는 도 1의 연산 모듈에 대한 세부 구성을 도시한 도면이다.
도 3은 본 발명의 일 실시예에 따른 가중치 함수 그래프를 도시한 도면이다.
도 4는 도 1의 검색 모듈에 대한 세부 구성을 도시한 도면이다.
도 5 및 도 6은 본 발명의 일 실시예에 따른 문헌 스코어링 방법에 대한 동작 흐름을 도시한 도면이다.
도 7은 본 발명의 일 실시예에 따른 방법이 실행되는 컴퓨팅 시스템을 도시한 도면이다.BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram of a document search system according to an embodiment of the present invention; FIG.
FIG. 2 is a detailed block diagram of the operation module of FIG. 1. Referring to FIG.
3 is a graph illustrating a weight function graph according to an embodiment of the present invention.
4 is a detailed block diagram of the search module of FIG.
5 and 6 are diagrams illustrating an operation flow for a document scoring method according to an embodiment of the present invention.
Figure 7 is a diagram of a computing system in which a method according to one embodiment of the invention is implemented.

이하, 본 발명의 일부 실시예들을 예시적인 도면을 통해 상세하게 설명한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명의 실시예를 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 실시예에 대한 이해를 방해한다고 판단되는 경우에는 그 상세한 설명은 생략한다.Hereinafter, some embodiments of the present invention will be described in detail with reference to exemplary drawings. It should be noted that, in adding reference numerals to the constituent elements of the drawings, the same constituent elements are denoted by the same reference symbols as possible even if they are shown in different drawings. In the following description of the embodiments of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the difference that the embodiments of the present invention are not conclusive.

본 발명의 실시예의 구성 요소를 설명하는 데 있어서, 제 1, 제 2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 또한, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.In describing the components of the embodiment of the present invention, terms such as first, second, A, B, (a), and (b) may be used. These terms are intended to distinguish the constituent elements from other constituent elements, and the terms do not limit the nature, order or order of the constituent elements. Also, unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the relevant art and are to be interpreted in an ideal or overly formal sense unless explicitly defined in the present application Do not.

본 발명의 실시예에 따른 문헌 검색 시스템은 사용자로부터 입력된 질의문을 이용하여 문헌 검색 및 스코어링을 수행하고 문헌 검색 결과를 사용자에게 제공하되, 검색 결과에 대한 사용자의 의사(예, 채택 또는 불채택)를 문헌 검색 및 스코어링에 반영시키도록 한다.The document search system according to an embodiment of the present invention performs document search and scoring using a query input from a user and provides a document search result to a user. The document search system includes a user's intention (e.g., adoption or non-adoption) ) To the document search and scoring.

본 발명의 실시예에 따른 문헌 검색 시스템은 질의문에 따라 하나 이상의 문헌을 검색하여 사용자에게 제공하는 시스템이라면 어느 것이든 해당 될 수 있다. 일 예로, 본 발명의 문헌 검색 시스템은 특허 검색 시스템일 수 있다. 물론, 이에 한정되는 것은 아니다.The document searching system according to the embodiment of the present invention may be any system that searches one or more documents according to a query and provides the documents to a user. For example, the document search system of the present invention may be a patent search system. Of course, it is not limited thereto.

도 1은 본 발명의 일 실시예에 따른 문헌 검색 시스템의 구성을 도시한 도면이다.BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram of a document search system according to an embodiment of the present invention; FIG.

도 1을 참조하면, 문헌 검색 시스템(100)은 연산 모듈(110), 검색 모듈(130) 및 검색 DB(150)를 포함할 수 있다.Referring to FIG. 1, the document search system 100 may include a calculation module 110, a search module 130, and a search DB 150.

연산 모듈(110)은 사용자로부터 질의문이 입력되면, 입력된 질의문을 분석하여 질의어 및 유사 용어 등을 추출하고, 추출된 질의어 및 유사 용어와 문헌의 분류 등에 대한 가중치를 산출하여 질의문을 재구성할 수 있다. 또한, 연산 모듈(110)은 검색 모듈(130)에 의해 검색된 문헌 검색 결과를 사용자에게 제공하는 기능을 수행한다. 연산 모듈(110)의 세부 구성에 대한 실시예는 도 2를 참조하여 더욱 상세히 설명하도록 한다.The operation module 110 extracts a query term and similar term by analyzing the inputted query sentence when a query is input from a user, calculates a weight for the extracted query term, similar term, classification of the document, can do. In addition, the operation module 110 performs a function of providing a document retrieval result retrieved by the retrieval module 130 to a user. An embodiment of the detailed configuration of the computing module 110 will be described in more detail with reference to FIG.

검색 모듈(130)은 사용자에 의해 입력된 질의문 또는 연산 모듈(110)에 의해 재구성 된 질의문을 이용하여 문헌 검색 및 스코어링을 수행한다. 검색 모듈(130)의 세부 구성에 대한 실시예는 도 3을 참조하여 더욱 상세히 설명하도록 한다.The search module 130 performs document search and scoring using a query input by the user or a query sentence reconstructed by the operation module 110. [ An embodiment of the detailed configuration of the search module 130 will be described in more detail with reference to FIG.

검색 DB(150)는 본 발명의 일 실시예에 따른 문헌 검색 시스템(100)이 동작하는데 요구되는 데이터 및 알고리즘 등이 저장될 수 있다.The search DB 150 may store data and algorithms required to operate the document search system 100 according to an embodiment of the present invention.

이에, 검색 DB(150)는 문헌 저장소(151), 색인 저장소(153), 용어 벡터 집합 저장소(155), 유사 용어 저장소(157) 및 사용자 데이터 저장소(159)를 포함할 수 있다.The search DB 150 may include a document repository 151, an index repository 153, a term vector aggregation repository 155, a similar term repository 157, and a user data repository 159.

문헌 저장소(151)는 하나 이상의 문헌 데이터가 저장된다. 문헌 저장소(151)에 저장된 각 문헌 데이터는 분류값이 부여되어 해당 문헌 데이터와 함께 저장될 수 있다.The document repository 151 stores one or more document data. Each document data stored in the document repository 151 may be assigned a classification value and stored together with the corresponding document data.

각 문헌 데이터에 대응하는 분류값에 대한 실시예는 아래와 같이 나타낼 수 있다.An embodiment of the classification value corresponding to each document data can be expressed as follows.

색인 저장소(153)는 색인구조 및 색인구조에 따라 저장된 색인 데이터가 저장될 수 있다.The index store 153 may store the index data stored according to the index structure and the index structure.

색인 저장소(153)는 각 색인에 대응하는 색인어 스코어가 함께 저장될 수 있다. 이때, 색인어 스코어는 색인구조에 필드 가중치 형태로 부여될 수 있으며, 색인 데이터 자체에 부여될 수도 있다. 또한, 색인어 스코어는 해당 색인어가 추출된 위치(필드)마다 별도의 가중치가 부여될 수도 있다. The index store 153 may store the index score corresponding to each index together. At this time, the index score may be given to the index structure in the form of a field weight, and may be given to the index data itself. In addition, the index score may be given a different weight for each position (field) from which the corresponding index is extracted.

일 예로, 특허 검색 시스템에서 색인A(title)의 색인어 스코어는 1.0, 색인B(abstract)의 색인어 스코어 및 색인C(claim)의 색인어 스코어는 0.8, 그리고 색인C(specification)의 색인어 스코어는 0.4가 각각 부여할 수 있다.As an example, in the patent search system, the indexer score of the index A (title) is 1.0, the indexer score of the index B (abstract) and the indexer score of the index C (claim) are 0.8 and the indexer score of the index C Respectively.

여기서, 색인어 스코어는 tf-idf 값으로 부여될 수 있다.Here, the indexer score can be given as a value of tf-idf.

용어 벡터 집합 저장소(155)는 용어(word, phrase)와 그 벡터를 저장할 수 있다. The term vector store 155 may store terms (word, phrase) and their vectors.

용어의 벡터화(word embedding)는 차원 축소(dimension reduction) 및 그에 따른 연산 복잡도를 감소시키고, 또한 용어 상호간 연산(예, 코사인 유사도)을 통해 유사 용어를 추출하기 위해 수행될 수 있다. 여기서, 용어의 벡터화는 유사한 문맥(context)에서 사용된 단어들은 유사한 의미(semantic)를 공유하는 분포 가설(distributional hypothesis)을 전제로 수행될 수 있다.Word embedding of terms can be performed to reduce dimension reduction and hence computational complexity, and also to extract similar terms through inter-term computation (e.g., cosine similarity). Here, the vectorization of terms can be performed on the premise that the words used in a similar context have a distributional hypothesis that shares a similar semantic.

용어의 벡터화는 LSA, PCA, LDA 등과 같은 빈도 기반 방법 및 문맥이 주어졌을 때 단어를 예측하는 CBOW, 단어가 주어졌을 때 문맥을 예측하는 Skip-gram 등과 같은 예측 기반 방법을 이용하여 수행될 수 있다.Vectorization of terms can be performed using frequency-based methods such as LSA, PCA, LDA, etc., CBOW for predicting words when given contexts, and Skip-grams for predicting contexts when words are given .

용어 벡터 저장소는 연산 모듈(110)에 의해 유사 용어 및 유사도 추출 이벤트가 발생하면, 질의어에 대응하는 유사 용어 및 유사도를 추출하여 연산 모듈(110)로 제공할 수 있다.When the similarity term and similarity extraction event are generated by the operation module 110, the term vector store may extract similar terms and similarities corresponding to the query terms and provide them to the operation module 110.

유사 용어의 추출은 문헌 저장소(151)에서 문헌별 텍스트를 추출하여 테이블, 수식, 화학식 등을 제거하는 전처리 과정을 거친 후, 형태소 분석, 형태 분류, 문장 분리, 의존 관계 분석 및 어구 경계 인식 등을 통해 인식된 어구 및/또는 단어를 이용하여 Skip-gram 방식으로 문맥과 예측을 구성하여 학습할 수 있다.Extraction of similar terms is performed by extracting texts according to documents from the document repository 151 and performing a preprocessing process of removing tables, formulas, and formulas, and then performing morphological analysis, morphological classification, sentence separation, dependency analysis, Contexts and predictions can be constructed and learned in a skip-gram manner using recognized phrases and / or words.

유사 용어 저장소(157)는 문헌 내에 포함된 각 용어들 중 질의어에 대응하여 추출된 유사 용어 및 질의어와 유사 용어 간 유사도 정보가 저장될 수 있다. 이때, 유사 용어 저장소(157)는 추출된 유사 용어의 유사도가 소정 임계치 이상인 유사 용어가 저장될 수도 있다. 여기서, 소정 임계치는 모든 용어에 대하여 동일하게 적용하되 충분한 수의 유사어가 확보되도록 낮은 값으로 설정하고, 다만 실제 유사 용어 추출 시에는 임계치 보다 높은 값을 사용하는 것이 바람직하다. 하지만, 이에 한정되는 것은 아니다.The similar term storage 157 may store similar terms extracted from corresponding terms of the query, query terms, and similarity information between similar terms among the terms included in the document. At this time, the similar term storage 157 may store a similar term in which the similarity degree of the extracted similar term is greater than or equal to a predetermined threshold value. Here, the predetermined threshold is applied to all terms in the same manner, but it is preferable to set the threshold value to a low value such that a sufficient number of similar words are secured, but to use a value higher than the threshold value in extracting the similar term. However, the present invention is not limited thereto.

아래의 실시예는 질의어에 대한 유사 용어 및 유사도를 JSON 방식(유사도가 내림차순으로 정렬된 배열 형식을 포함)으로 구축한 예를 나타낸 것이다. The following embodiment shows an example in which similar terms and similarities for a query term are constructed by the JSON method (including an arrangement format in which the similarity is sorted in descending order).

아래의 실시예는 질의어에 대한 유사 용어 및 유사도를 RDB 방식의 term 테이블 또는 similar_term 테이블 형식으로 구축한 예를 나타낸 것이다.The following embodiment shows an example in which similar terms and similarities for a query term are constructed in the term table of the RDB scheme or the similar_term table format.

< term 테이블 ><term table>

< similar_term 테이블 ><similar_term table>

사용자 데이터 저장소(159)는 사용자의 질의문에 포함된 질의어를 이용하여 검색된 문헌들에 대해 사용자로부터 입력된 채택 또는 불채택 정보가 저장될 수 있다.The user data store 159 may store adoption or non-adoption information input from the user to the documents retrieved using the query term included in the query of the user.

이때, 사용자로부터 입력된 채택 또는 불채택 정보를 해당 문헌의 식별번호 및 분류 정보와 함께 저장될 수 있다.At this time, the adoption or non-adoption information input from the user may be stored together with the identification number and classification information of the document.

따라서, 연산 모듈(110)은 사용자 데이터 저장소(159)에 저장된 정보를 조회하여 질의어별 및/또는 분류별 가중치를 산출하는데 반영할 수 있다.Accordingly, the operation module 110 may inquire information stored in the user data storage 159 and reflect the calculated information in terms of the query terms and / or the classification weight.

도 2는 본 발명의 일 실시예에 따른 문헌 스코어링 장치의 구성을 도시한 도면이다.2 is a diagram illustrating a configuration of a document scoring apparatus according to an embodiment of the present invention.

도 2는 도 1의 연산 모듈에 대한 세부 구성을 도시한 도면이다.FIG. 2 is a detailed block diagram of the operation module of FIG. 1. Referring to FIG.

도 2를 참조하면, 본 발명의 일 실시예에 따른 연산 모듈(110)은 질의어 분석부(111), 가중치 산출부(113), 질의문 생성부(115) 및 채부 결정부(117)를 포함할 수 있다.2, a calculation module 110 according to an embodiment of the present invention includes a query analyzing unit 111, a weight calculating unit 113, a query generating unit 115, and a query determining unit 117 can do.

먼저, 질의어 분석부(111)는 사용자로부터 질의문이 입력되면, 입력된 질의문을 분석하여 질의어를 추출한다. 또한, 질의어 분석부(111)는 용어 벡터 집합 저장소(155)로부터 질의어에 대한 유사 용어 및 유사도를 추출한다. First, the query analyzing unit 111 analyzes a query sentence input from a user and extracts a query. In addition, the query term analyzing unit 111 extracts similar terms and similarities for the query term from the term vector aggregation storage 155.

일 예로, 질의어 분석부(111)는 질의문이 'hybrid car'인 경우, 질의문에 포함된 질의어 'car'에 대응하여 'vehicle'. 'motor', 'automobile' 등과 같은 유사 용어를 추출할 수 있다.For example, if the query is 'hybrid car', the query analyzing unit 111 determines 'vehicle' corresponding to the query term 'car' included in the query. 'motor', 'automobile', and so on.

질의어 분석부(111)는 해당 질의어에 대해 추출된 유사 용어 및 유사도에 대한 정보를 유사 용어 저장소(157)에 저장하도록 한다.The query analyzing unit 111 stores information on the extracted similar terms and similarities in the similar term storage 157.

가중치 산출부(113)는 질의어 분석부(111)에 의해 추출된 질의어 및/또는 유사 용어에 대한 가중치를 산출한다. 또한, 가중치 산출부(113)는 문헌 분류별 가중치를 산출한다.The weight calculation unit 113 calculates a weight for the query term and / or similar term extracted by the query term analysis unit 111. [ In addition, the weight calculation unit 113 calculates a weight for each document classification.

이때, 가중치 산출부(113)는 가중치 함수(f(x))를 이용하여 각 질의어(유사 용어 포함) 마다, 그리고 각 문헌 분류 마다 가중치를 산출할 수 있다. 여기서, 가중치 함수(f(x))는 아래 [수학식 1]을 참조하도록 한다.At this time, the weight calculation unit 113 can calculate a weight for each query term (including similar terms) and for each document classification using the weight function f (x). Here, the weight function f (x) is referred to the following formula (1).

[수학식 1]에서, f(x)은 가중치, x는 채택 문헌 및 불채택 문헌의 빈도, L은 최대가중치 그리고 k는 경사도를 의미한다. 여기서, 빈도 x는 채택된 문헌의 수에서 불채택된 문헌의 수를 차감한 값이 될 수 있다.In Equation (1), f (x) is the weight, x is the frequency of adopted and non-adopted documents, L is the maximum weight, and k is the slope. Here, the frequency x may be a value obtained by subtracting the number of non-adopted documents from the number of adopted documents.

일 예로, 사용자로부터 입력된 질의문 내의 질의어 'car'에 대해 사용자로부터의 채택 문헌이 4건이고 불채택 문헌이 0건인 경우, 질의어 'car'의 가중치는 f(4-0)이 될 수 있다. 또한, 질의어 'car'의 유사 용어 'vehicle'에 대해 사용자로부터의 채택 문헌이 1건이고 불채택 문헌이 2건인 경우, 유사 용어 'vehicle'의 가중치는 f(1-2)가 될 수 있다. 한편, 문헌의 분류 중 '분류A'에 대해 사용자로부터의 채택 문헌이 2건이고 불채택 문헌이 2건인 경우, '분류A'의 가중치는 f(2-2)가 될 수 있다. 또한, 문헌의 분류 중 '분류B'에 대해 사용자로부터의 채택 문헌이 0건이고 불채택 문헌이 5건인 경우, '분류B'의 가중치는 f(-5)가 될 수 있다.For example, if there are 4 adopted documents from the user and 0 non-adopted documents for the query term 'car' in the query input from the user, the weight of the query term 'car' may be f (4-0) . The weight of the similar term 'vehicle' can be f (1-2) when there is one adopted document from the user and two non-adopted documents for the similar term 'vehicle' of the query term 'car'. On the other hand, if two adopted documents from the user and two non-adopted documents are related to 'Class A' in the classification of the document, the weight of 'Class A' may be f (2-2). In addition, when there are 0 adopted documents from the user and 5 non-adopted documents with respect to 'Class B' in the classification of the document, the weight of 'Class B' may be f (-5).

또한, 경사도 k는 채택 문헌의 유사도가 불채택 문헌의 유사도보다 크도록 조절될 수 있다.Also, the slope k can be adjusted so that the similarity of the adopted document is greater than the similarity of the non-adopted document.

만약, 불채택 문헌의 최대 유사도가 채택 문헌의 최소 유사도보다 큰 경우 오류 상황이므로, 연산 모듈(110)은 불채택 문헌에 부여된 분류에 대한 가중치 함수의 경사도를 늘리거나, 채택 문헌에 부여된 분류에 대한 가중치 함수의 경사도를 줄일 수 있다. 이 경우, 분류에 대한 로지스틱 경사도로 오류 상황이 해소가 되지 않으면, 연산 모듈(110)은 질의어에 대한 가중치 함수의 경사도를 조절할 수도 있다.If the maximum similarity of the non-adopted documents is larger than the minimum similarity of the adopted documents, the operation module 110 may increase the inclination of the weight function for the classification assigned to the non-adopted documents, It is possible to reduce the inclination of the weight function. In this case, if the error situation is not solved by the logistic gradient for classification, the operation module 110 may adjust the inclination of the weight function for the query term.

이와 같이 산출되는 가중치 함수(f(x))의 그래프는 도 3과 같이 나타낼 수 있다.The graph of the weight function f (x) thus calculated can be represented as shown in FIG.

가중치는 채택 문헌 및 불채택 문헌의 빈도에 따라 결정되기 때문에, 채택 문헌 및 불채택 문헌이 존재하지 않는 초기 검색에서는 가중치가 산출되지 않을 수 있다.Since weights are determined by the frequency of the adopted and non-adopted documents, weights may not be calculated in the initial searches for which the adopted and non-adopted documents are absent.

이 경우, 가중치 산출부(113)는 질의어 및/또는 유사 용어에 대한 가중치 초기 값을 임의로 설정할 수 있다.In this case, the weight calculation unit 113 can arbitrarily set the weight initial value for the query term and / or the similar term.

일 예로, 가중치 산출부(113)는 최초 검색 시 질의어에 대한 가중치 초기 값을 고정된 값인 1로 설정할 수 있다. 또한, 가중치 산출부(113)는 최초 검색 시의 유사 용어에 대한 가중치를 질의어와 유사 용어 간 유사도로 설정할 수 있다. 질의어 'car'에 대한 유사 용어 'motor'의 유사도가 0.381901라 가정했을 때, 유사 용어 'motor'의 가중치 초기 값은 0.381901로 설정될 수 있다.For example, the weight calculation unit 113 may set the weight initial value to 1, which is a fixed value, for the query term in the initial search. In addition, the weight calculation unit 113 can set the weight for the similar term at the time of the first search to the degree of similarity between the query term and the similar term. Assuming that the similarity of the similar term 'motor' to the query term 'car' is 0.381901, the initial weight value of the similar term 'motor' can be set to 0.381901.

다른 예로, 가중치 산출부(113)는 최초 검색 시 질의어에 대한 문헌이 채택 문헌이라 가정한 상태에서 질의어에 대한 가중치 초기 값을 산출할 수도 있다.As another example, the weight calculation unit 113 may calculate the weight initial value for the query term in the state where the document about the query term in the initial search is assumed to be the adopted document.

이 경우, 질의어에 대한 문헌 중 문헌 점수가 높은 일부 문헌이 채택된 것으로 임의 결정할 수 있다. In this case, it can be arbitrarily decided that some documents having a high document score among the documents for the query term have been adopted.

질의어 및 유사 용어의 가중치 초기 값은 사용자로부터 검색된 문헌에 대한 채택 또는 불채택이 결정되면, 채택 문헌 및 불채택 문헌의 빈도에 의해 산출된 가중치로 갱신될 수 있다.The weighted initial values of query terms and similar terms may be updated with weights calculated by the frequency of adopted and non-adopted documents, if adoption or non-adoption of the retrieved documents is determined.

한편, 가중치 산출부(113)는 문헌 분류에 대한 가중치 산출 시, 각 문헌 분류에 대한 분류 계층 상의 하위 분류에 대하여도 가중치를 산출할 수 있다. 이 때, 각 문헌 분류에 부여된 가중치는 검색 모듈에서 각 문헌에 대한 문헌 점수를 부여하는데 적용될 수 있다. On the other hand, the weight calculation unit 113 can calculate a weight for the sub-classification on the classification hierarchy for each document classification when calculating the weight for the document classification. At this time, the weight assigned to each document classification can be applied to grant a document score to each document in the search module.

질의문 생성부(115)는 질의어 분석부(111)에 의해 추출된 질의어와, 유사 용어 및 유사도를 이용하여 질의문을 재구성한다. 이때, 질의문 생성부(115)는 질의어에 대응하여 추출된 하나 이상의 유사 용어를 상호 논리합 관계로 질의어와 결합할 수 있다.The query statement generation unit 115 reconstructs the query statement using the query term extracted by the query term analysis unit 111, similar terms, and similarity. At this time, the query generation unit 115 may combine one or more similar terms extracted corresponding to the query terms with the query terms in a logical-OR relationship.

일 예로, 질의문이 'hybrid car'이고, 질의어 분석부(111)로부터 질의어 'car'에 대응하여 추출된 유사 용어가 'vehicle'. 'motor', 'automobile'인 경우, 질의어 생성부는 질의어 'hybrid' 및 'car'와, 유사 용어 'vehicle'. 'motor', 'automobile'를 결합하여 'hybrid & (car + vehicle + motor + automobile)'와 같이 질의문을 재구성할 수 있다.For example, the query term is 'hybrid car', and the similar term extracted from the query term analysis unit 111 corresponding to the query term 'car' is 'vehicle'. 'motor' and 'automobile', the query term generator generates the query terms 'hybrid' and 'car', and the similar term 'vehicle'. You can combine 'motor' and 'automobile' to reconstruct the query like 'hybrid & (car + vehicle + motor + automobile)'.

물론, 질의문 생성부(115)는 질의어가 세 개 이상인 경우 각각의 질의어 조합에 대한 유사 용어 및 유사도를 추출하여 각 조합에 대한 질의문을 재구성할 수도 있다.Of course, the query generating unit 115 may retrieve similar terms and similarities for each query combination if the query has three or more query terms, and reconstruct the query for each combination.

채부 결정부(117)는 사용자로부터 입력된 질의문에 대한 문헌 검색 결과를 사용자에게 제공하는 경우, 사용자에게 제공된 검색 결과에 대한 채택 또는 불채택 여부를 질의할 수 있다. 채부 결정부(117)는 문헌 검색 결과에 대한 채택 또는 불채택 정보가 사용자로부터 입력되면, 사용자로부터 입력된 정보에 근거하여 문헌 검색 결과에 포함된 전체 또는 일부의 문헌 각각에 대한 채택 또는 불채택 여부를 결정하고, 그 결과를 사용자 데이터 저장소(159)에 저장한다.When providing the document search result to the user, the query determining unit 117 may inquire whether the search result provided to the user is adopted or not. When the adoption or non-adoption information of the document search result is input from the user, the appointment decision unit 117 determines whether the adoption or non-acceptance of each of all or part of the documents included in the document search result based on the information input from the user And stores the result in the user data store 159. [

이때, 가중치 산출부(113)는 채부 결정부(117)에 의해 결정된 채택 문헌 및 불채택 문헌의 빈도에 근거하여 각 질의어별 및 각 문헌 분류별로 기 산출된 가중치(또는 가중치 함수의 파라미터 값)를 조정할 수 있다.At this time, the weight calculation unit 113 calculates the weight (or the parameter value of the weight function) calculated for each query word and each document classification based on the frequencies of the adopted and non-adopted documents determined by the query determining unit 117 Can be adjusted.

도 4는 도 1의 검색 모듈에 대한 세부 구성을 도시한 도면이다.4 is a detailed block diagram of the search module of FIG.

도 4를 참조하면, 본 발명의 일 실시예에 따른 검색 모듈(130)은 문헌 검색부(131) 및 문헌 점수 산출부(135)를 포함할 수 있다.Referring to FIG. 4, the search module 130 according to an embodiment of the present invention may include a document search unit 131 and a document score calculation unit 135.

먼저, 문헌 검색부(131)는 사용자로부터 질의문이 입력되면, 질의문에 해당하는 문헌을 검색 엔진을 통해 검색할 수 있다.First, when a query is input from a user, the document search unit 131 may search for a document corresponding to the query through a search engine.

또한, 문헌 검색부(131)는 연산 모듈(110)의 질의문 생성부(115)에 의해 질의문이 재구성되면, 재구성된 질의문에 해당하는 문헌을 검색 엔진을 통해 검색할 수 있다.Also, when the query statement is reconstructed by the query statement generation unit 115 of the operation module 110, the document search unit 131 can search through the search engine for a document corresponding to the reconstructed query statement.

여기서, 검색 엔진은 질의어(및 유사 용어)를 이용하여 색인 저장소(153)에서 문헌을 검색할 수 있다. 색인 저장소(153)에는 색인된 문헌의 정보가 색인 구조(예, B-Tree)로 저장되기 때문에, 검색 엔진은 색인 저장소(153)의 색인 구조에서 질의어(및 유사 용어)에 해당하는 색인어를 검색할 수 있다. Here, the search engine can search documents in the index repository 153 using query terms (and similar terms). Since the information of the indexed document is stored in the index store 153 as an index structure (e.g., B-Tree), the search engine searches the index structure of the index store 153 for an index word corresponding to the query term can do.

또한, 색인 저장소(153)에는 색인된 문헌과 함께 해당 문헌에 대한 색인어 스코어가 함께 저장될 수 있다. 따라서, 검색 엔진은 질의어 및 유사 용어에 대한 색인어 검색 시, 검색된 색인어에 대한 색인어 스코어 정보를 추출할 수 있다.In addition, the index repository 153 may store indexed documents together with indexed documents for the documents. Accordingly, the search engine can extract the index score information for the searched index term when searching for an index term for a query term and a similar term.

문헌 점수 산출부(135)는 문헌 검색부(131)에서 질의문 또는 재구성된 질의문에 대응하는 문헌을 검색함과 동시에, 검색된 문헌에 대한 문헌 점수를 산출한다.The document score calculating unit 135 searches the document searching unit 131 for a document corresponding to the query statement or the reconstructed query statement, and at the same time calculates the document score for the retrieved document.

여기서, 문헌 점수 산출부(135)는 각 질의어 및 각 유사 용어에 대해 산출된 가중치, 및 각 질의어 및 각 유사 용어를 이용하여 검색된 색인어에 대한 색인어 스코어를 이용하여 해당 문헌에 대한 문헌 점수를 산출할 수 있다. 이때, 문헌 점수 산출부(135)는 각 질의어 및 각 유사 용어 별 가중치와 색인어 스코어를 곱한 값을 해당 문헌의 문헌 점수로 산출할 수 있다.Here, the document score calculator 135 calculates document scores for the corresponding documents using the weighted values for the respective query terms and the similar terms, and the index word scores for the index words searched for using the respective query terms and similar terms . At this time, the document score calculating unit 135 may calculate the value obtained by multiplying the weight of each query word and each similar term by the score of the indexer as the document score of the corresponding document.

또한, 문헌 점수 산출부(135)는 가중치 산출부(113)에 의해 각 문헌 분류에 부여된 가중치를 이용하여 해당 문헌의 문헌 점수를 산출할 수도 있다. The document score calculating unit 135 may calculate the document score of the document by using the weight assigned to each document classification by the weight calculating unit 113. [

상기와 같이 구성되는 본 발명의 일 실시예에 따른 문헌 검색 시스템(100)은 독립적인 하드웨어 장치 형태로 구현될 수 있으며, 적어도 하나 이상의 프로세서(processor)로서 마이크로 프로세서나 범용 컴퓨터 시스템과 같은 다른 하드웨어 장치에 포함된 형태로 구동될 수 있다.The document search system 100 according to an exemplary embodiment of the present invention may be implemented as an independent hardware device, and may include at least one processor, such as a microprocessor or a general-purpose computer system, As shown in FIG.

상기와 같이 구성되는 본 발명에 따른 장치의 동작 흐름을 보다 상세히 설명하면 다음과 같다.The operation flow of the apparatus according to the present invention will be described in more detail as follows.

도 5 및 도 6은 본 발명의 일 실시예에 따른 방법에 대한 동작 흐름을 도시한 도면이다.5 and 6 illustrate operational flows for a method according to an embodiment of the present invention.

먼저, 도 5는 본 발명의 일 실시예에 따른 문헌 스코어링 방법에 대한 동작의 흐름을 나타낸 것이다.5 is a flowchart illustrating an operation of a document scoring method according to an embodiment of the present invention.

도 5를 참조하면, 검색 시스템의 연산 모듈(110)은 사용자로부터 질의문이 입력되면(S110), 입력된 질의문을 분석하여 질의어를 추출하고(S130), 추출된 질의어에 대한 유사 용어 및 유사도를 추출한다(S140). 여기서, 연산 모듈(110)은 유사 용어 벡터 저장소에 저장된 정보에 근거하여 질의어에 대한 유사 용어 및 유사도를 추출할 수 있다. 또한, 연산 모듈(110)은 추출된 유사 용어 및 유사도를 유사 용어 저장소(157)에 저장할 수 있다.Referring to FIG. 5, the operation module 110 of the retrieval system analyzes a query sentence by extracting a query term from the user (S130) (S140). Here, the operation module 110 may extract similar terms and similarities to the query terms based on the information stored in the similar term vector storage. In addition, the operation module 110 may store the extracted similar terms and similarities in the similar term storage 157.

만일, 'S110' 과정에서 입력된 질의문에 대한 최초 검색인 경우(S120), 검색 모듈(130)은 S110' 과정에서 입력된 질의문을 기반으로 문헌 검색을 수행할 수 있다(S121). 이때, 연산 모듈(110)은 'S121' 과정에서 검색된 문헌에 대해 사용자로부터 채택 또는 불채택 정보와 같은 정보가 입력되면, 연산 모듈(110)은 입력된 정보를 사용자 데이터로서 사용자 데이터 저장소(159)에 저장하도록 한다(S125). 'S125' 과정에서 저장된 사용자 데이터는 후술하는 'S150' 과정에서 연산 모듈(110)에 의해 조회될 수 있다.If it is the first search for the query input in step S110, the search module 130 may perform a document search based on the query input in step S110 (step S121). At this time, when information such as adopted or non-adopted information is input from the user to the document retrieved in the process of 'S121', the operation module 110 transmits the input information as user data to the user data storage 159, (S125). The user data stored in the process 'S125' can be inquired by the calculation module 110 in the process 'S150' to be described later.

이후, 연산 모듈(110)은 사용자 데이터 저장소(159)에 저장된 사용자 데이터를 조회하여(S150), 질의어별(유사 용어 포함) 가중치 및/또는 문헌의 분류별 가중치를 산출한다(S160).The operation module 110 then inquires the user data stored in the user data storage 159 (S150), and calculates a weight for each query word (including a similar term) and / or a weight for each classification of documents (S160).

여기서, 사용자 데이터 저장소(159)에는 이전의 검색 기록에 대해 사용자로부터 입력된 채택 또는 불채택 정보가 저장될 수 있다.Here, the user data store 159 may store adoption or non-adoption information input from the user for the previous search record.

따라서, 연산 모듈(110)은 사용자 데이터 저장소(159)에 저장된 채택 또는 불채택된 문헌을 식별하고, 각 질의어 및/또는 각 분류 마다의 채택 문헌 및 불채택 문헌의 빈도에 따라 가중치를 산출할 수 있다.Accordingly, the computing module 110 can identify adopted or non-adopted documents stored in the user data store 159 and can calculate weights according to the frequency of each query and / or adopted and non-adopted documents for each class have.

'S160' 과정에서, 가중치는 각 질의어 및/또는 각 분류 마다 고유한 가중치 함수(f(x))로 산출될 수 있다. In the process of 'S160', the weights can be calculated as a weight function f (x) unique to each query word and / or each classification.

검색 시스템의 연산 모듈(110)은 'S130' 과정에서 추출된 질의어 및 'S140' 과정에서 추출된 유사 용어 및 유사도를 이용하여 질의문을 재구성한다(S170).The operation module 110 of the retrieval system reconstructs the query using the extracted query term in step S130 and similar terms and similarities extracted in step S140 in operation S170.

이때, 연산 모듈(110)은 유사 용어를 상호 논리합 관계로 질의어와 결합할 수 있다.At this time, the operation module 110 may combine similar terms with a query word in a logical-OR relationship.

검색 모듈(130)은 연산 모듈(110)에 의해 질의문이 재구성되면, 재구성된 질의문을 이용하여 문헌을 검색하고, 검색 문헌에 대한 스코어링을 수행한다(S180).When the query module is reconstructed by the operation module 110, the search module 130 searches the document using the reconstructed query statement and scans the search document (S180).

또한, 검색 모듈(130)은 문헌을 검색하면서, 검색된 문헌에 대한 스코어링을 동시에 수행할 수 있다.In addition, the search module 130 may simultaneously perform scoring of the retrieved documents while searching the documents.

여기서, 검색 모듈(130)은 각 질의어 및 각 유사 용어에 대해 산출된 가중치, 및 각 질의어 및 각 유사 용어를 이용하여 검색된 색인어에 대한 색인어 스코어를 이용하여 해당 문헌에 문헌 점수를 부여할 수 있다. 이때, 검색 모듈(130)은 각 질의어 및 각 유사 용어 별 가중치와 색인어 스코어를 곱한 값을 해당 문헌의 문헌 점수로 부여할 수 있다.Here, the search module 130 may assign a document score to the corresponding document using the calculated weight for each query word and each similar term, and the index word score for the index word searched using each query word and each similar term. At this time, the search module 130 may assign a value obtained by multiplying the weight of each query word and each similar term by the score of the indexer as the document score of the corresponding document.

검색 모듈(130)에 의해 문헌 검색 및 스코어링이 완료되면, 연산 모듈(110)은 문헌 검색 결과 출력할 수 있다(S190). 일 예로, 문헌 검색 결과는 각 문헌에 부여된 문헌 점수의 크기에 따라 정렬된 상태로 사용자에게 제공될 수 있다. When the document search and the scoring are completed by the search module 130, the calculation module 110 can output the document search result (S190). For example, the document search result may be provided to the user in an ordered state according to the size of the document score assigned to each document.

다른 예로, 연산 모듈(110)은 문헌 검색 결과에서 채택 문헌 점수와 불채택 문헌 점수 사이에 있는 문헌을 사용자에게 제공할 수 있다. 또 다른 예로, 연산 모듈(110)은 검색 조건에 따라 문헌 검색 결과에서 문헌 점수가 채택 문헌 점수보다 큰 문헌 또는 불채택 문헌 점수 보다 작은 문헌을 사용자에게 제공할 수도 있다.As another example, the computing module 110 may provide the user with a document between the adopted and non-adopted literature scores in the document search results. As another example, the computing module 110 may provide the user with a document whose document score is less than the adopted document score or the non-adopted document score in the document search result according to the search condition.

도 6은 도 5의 'S160' 과정에 적용된 가중치 함수를 조절하는 동작의 흐름을 도시한 도면이다.FIG. 6 is a flowchart illustrating an operation of adjusting a weight function applied to the process of 'S160' of FIG.

도 6을 참조하면, 사용자로부터 입력된 질의문에 대한 문헌 검색 및 스코어링이 완료되면, 연산 모듈(110)은 사용자에게 제공된 검색 결과에 대한 채택 또는 불채택 여부를 질의할 수 있다. 따라서, 사용자는 검색 결과 화면을 통해 검색된 전체 또는 일부 문헌에 대한 채택 또는 불채택을 선택할 수 있다. 물론, 사용자는 별도의 질의가 없더라도 검색된 전체 또는 일부 문헌에 대한 채택 또는 불채택을 선택할 수도 있다.Referring to FIG. 6, when the document retrieval and scoring for the query input from the user is completed, the computation module 110 may inquire whether the search result provided to the user is adopted or not. Accordingly, the user can select the adoption or non-adoption of all or part of the documents retrieved through the search result screen. Of course, the user may choose to adopt or not to accept all or some of the documents searched, even if there is no separate query.

사용자에게 제공된 검색 결과에 대해 사용자로부터 채택 또는 불채택 정보가 입력되면(S210), 연산 모듈(110)은 사용자로부터 입력된 정보에 근거하여 각 문헌에 대한 채택 또는 불채택을 결정하고(S220), 해당 정보를 사용자 데이터 저장소(159)에 저장할 수 있다(S230). 물론, 사용자에게 제공된 검색 결과 중 사용자로부터 채택 또는 불채택 여부가 입력되지 않은 문헌이 존재할 수도 있다.If the adoption or non-adoption information is input from the user to the search result provided to the user (S210), the calculation module 110 determines whether to adopt or not adopt each document based on the information input from the user (S220) And the corresponding information may be stored in the user data storage 159 (S230). Of course, there may exist documents in which the adoption or non-adoption from the user is not inputted among the search results provided to the user.

전체 또는 일부의 검색 문헌에 대한 채택 또는 불채택이 결정되면, 연산 모듈(110)은 채택 문헌 및 불채택 문헌의 빈도에 따라 가중치를 계산하는 함수의 값을 조절할 수 있다(S240). 이 경우, 각 질의어 및 각 분류 마다의 가중치가 갱신될 수 있다.If adoption or non-adoption of all or some of the retrieved documents is determined, the computation module 110 may adjust the value of the function that computes weights according to the frequency of the adopted and non-adopted documents (S240). In this case, the weight of each query word and each classification can be updated.

따라서, 연산 모듈(110)은 다음 검색 시 각 질의어 및 각 분류 마다 갱신된 가중치를 적용하여 질의문을 재구성할 수 있다.Therefore, the calculation module 110 can reconstruct the query statement by applying the updated query weight for each query term and each classification at the next search.

도 7은 본 발명의 일 실시예에 따른 방법이 실행되는 컴퓨팅 시스템을 도시한 도면이다.Figure 7 is a diagram of a computing system in which a method according to one embodiment of the invention is implemented.

도 7을 참조하면, 컴퓨팅 시스템(1000)은 버스(1200)를 통해 연결되는 적어도 하나의 프로세서(1100), 메모리(1300), 사용자 인터페이스 입력 장치(1400), 사용자 인터페이스 출력 장치(1500), 스토리지(1600), 및 네트워크 인터페이스(1700)를 포함할 수 있다.7, a computing system 1000 includes at least one processor 1100, a memory 1300, a user interface input device 1400, a user interface output device 1500, (1600), and a network interface (1700).

프로세서(1100)는 중앙 처리 장치(CPU) 또는 메모리(1300) 및/또는 스토리지(1600)에 저장된 명령어들에 대한 처리를 실행하는 반도체 장치일 수 있다. 메모리(1300) 및 스토리지(1600)는 서로 다른 종류의 휘발성 또는 불휘발성 저장 매체를 포함할 수 있다. 예를 들어, 메모리(1300)는 ROM(Read Only Memory) 및 RAM(Random Access Memory)을 포함할 수 있다.The processor 1100 may be a central processing unit (CPU) or a memory device 1300 and / or a semiconductor device that performs processing for instructions stored in the storage 1600. Memory 1300 and storage 1600 may include different types of volatile or non-volatile storage media. For example, the memory 1300 may include a ROM (Read Only Memory) and a RAM (Random Access Memory).

따라서, 본 명세서에 개시된 실시예들과 관련하여 설명된 방법 또는 알고리즘의 단계는 프로세서(1100)에 의해 실행되는 하드웨어, 소프트웨어 모듈, 또는 그 2 개의 결합으로 직접 구현될 수 있다. 소프트웨어 모듈은 RAM 메모리, 플래시 메모리, ROM 메모리, EPROM 메모리, EEPROM 메모리, 레지스터, 하드 디스크, 착탈형 디스크, CD-ROM과 같은 저장 매체(즉, 메모리(1300) 및/또는 스토리지(1600))에 상주할 수도 있다. 예시적인 저장 매체는 프로세서(1100)에 커플링되며, 그 프로세서(1100)는 저장 매체로부터 정보를 판독할 수 있고 저장 매체에 정보를 기입할 수 있다. 다른 방법으로, 저장 매체는 프로세서(1100)와 일체형일 수도 있다. 프로세서 및 저장 매체는 주문형 집적회로(ASIC) 내에 상주할 수도 있다. ASIC는 사용자 단말기 내에 상주할 수도 있다. 다른 방법으로, 프로세서 및 저장 매체는 사용자 단말기 내에 개별 컴포넌트로서 상주할 수도 있다.Thus, the steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by processor 1100, or in a combination of the two. The software module may reside in a storage medium (i.e., memory 1300 and / or storage 1600) such as a RAM memory, a flash memory, a ROM memory, an EPROM memory, an EEPROM memory, a register, a hard disk, a removable disk, You may. An exemplary storage medium is coupled to the processor 1100, which can read information from, and write information to, the storage medium. Alternatively, the storage medium may be integral to the processor 1100. [ The processor and the storage medium may reside within an application specific integrated circuit (ASIC). The ASIC may reside within the user terminal. Alternatively, the processor and the storage medium may reside as discrete components in a user terminal.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 서로 다른 수정 및 변형이 가능할 것이다. The description above is merely illustrative of the technical idea of the present invention. It will be apparent to those skilled in the art that various modifications and variations may be made without departing from the essential characteristics of the present invention.

따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

100: 문헌 검색 시스템 110: 연산 모듈
111: 질의어 분석부 113: 가중치 산출부
115: 질의문 생성부 117: 채부 결정부
130: 검색 모듈 131: 문헌 검색부
135: 문헌 점수 산출부 150: 검색 DB
151: 문헌 저장소 153: 색인 저장소
155: 용어 벡터 집합 저장소 157: 유사 용어 저장소
159: 사용자 데이터 저장소100: Document search system 110: Operation module
111: Query term analysis unit 113: Weight calculation unit
115: query statement generation unit 117: query decision unit
130: Search module 131:
135: Document score calculation unit 150: Search DB
151: Document repository 153: Index repository
155: Term vector set storage 157: Similar term storage
159: User Data Store

Claims

Extracting a query term by analyzing a query input from a user, and extracting a similar term and similarity with respect to the extracted query term;
Retrieving the stored user data, calculating a weight for the query term, the similar term, and the document classification;
Reconstructing a query sentence using the extracted query term, the similar term, and the similarity;
Performing document search and scoring using the reconstructed query statement; And
And outputting a document search result for the query statement.

The method according to claim 1,
Wherein the user data comprises:
Wherein adoption or non-adoption information input from a user to a previous document search result for the query term is stored.

The method of claim 2,
The step of calculating the weight includes:
And a weight calculation unit for calculating a weight based on the query term and the similar term and the frequency of the adopted and non-adopted documents for each document classification from the user data by identifying the adopted and non-adopted documents selected by the user, Scoring method.

The method according to claim 1,
The step of calculating the weight includes:
And calculating a weight to be calculated using the query term, the similar term, and a weight function unique to each document classification.

The method according to claim 1,
The step of calculating the weight includes:
And if the pre-stored user data does not exist, the degree of similarity of the similar term is calculated as a weight value.

The method according to claim 1,
The step of calculating the weight includes:
And if the pre-stored user data does not exist, a decision is made that the document about the query term is adopted, and a weight is calculated according to the frequency of the adopted document for each document classification.

The method according to claim 1,
The step of reconstructing the query statement comprises:
And the query term is reconstructed by combining the query term and the similar term in a logical OR relationship.

The method according to claim 1,
Wherein performing the document search and scoring comprises:
Retrieving an index word corresponding to a query term included in the reconstructed query sentence from an index repository storing index data according to the index structure of the indexed document and the index structure;
Extracting an indexer score given in advance corresponding to the searched indexer; And
A weight calculated for the query word and the similar term, and a document score for the document using the extracted index word score.

The method of claim 8,
The step of calculating the document score includes:
And a value obtained by multiplying the weighted value calculated for the query term and the similar term by the score of the indexer is given as a document score of the document.

The method according to claim 1,
Receiving information on adoption or non-adoption from the user on the document search result;
Storing the information on the adopted or non-adopted information in the user data together with the identification information of the corresponding document;
Further comprising adjusting a weight for the query term, the similar term, and the document classification using the stored user data.

The method of claim 10,
The step of adjusting the weight includes:
Adjusting the parameter value of the weight function for the query term, the similar term, and the document classification.

The method according to claim 1,
Wherein the step of extracting the similar terms and similarities comprises:
And a similar term in which the degree of similarity with the query term is equal to or greater than a predetermined threshold value is extracted from each term included in the document.

Extracts a query term, similar terms and similarities with respect to the query term by analyzing the query sent from the user, calculates a weight for the extracted query term, the similar term, and the document classification by querying the previously stored user data, An operation module for reconstructing the query sentence using the extracted query term, the similar term, and the similarity; And
And a retrieval module for performing document retrieval and scoring using the reconstructed query statement,
Wherein the search module comprises:
And scoring the search documents based on the weights.

14. The method of claim 13,
Wherein the user data comprises:
Wherein adoption or non-adoption information input from a user to the previous document search result for the query term is stored.

15. The method of claim 14,
The operation module includes:
And a weight calculation unit for calculating a weight based on the query term and the similar term and the frequency of the adopted and non-adopted documents for each document classification from the user data by identifying the adopted and non-adopted documents selected by the user, Search system.

14. The method of claim 13,
The operation module includes:
And if the pre-stored user data does not exist, calculates the similarity degree of the similar term as a weight value.

14. The method of claim 13,
The operation module includes:
If the pre-stored user data does not exist, it is arbitrarily determined that both the query term and the similar term have been adopted, and the query term, the similar term, and the weight of the adopted document for each document classification are calculated Wherein the document retrieval system comprises: