KR101141498B1

KR101141498B1 - Informational retrieval method using a proximity language model and recording medium threrof

Info

Publication number: KR101141498B1
Application number: KR1020100003528A
Authority: KR
Inventors: 윤여걸
Original assignee: 주식회사 와이즈넛
Priority date: 2010-01-14
Filing date: 2010-01-14
Publication date: 2012-05-04
Also published as: KR20110083347A

Abstract

본 발명은, 복수의 질의단어 포함하는 질의를 이용한 정보 검색 방법, 정보 추출 방법, 검색 대상 문서의 순위를 정하는 방법에 관한 것이다. 특히, 문서 내에서의 질의에 포함된 복수의 질의단어들의 근접성 점수와, 그 문서 내에 포함되는 단어 단위의 방출성 확률을 통합하여, 문서의 최종 순위를 결정하는 것을 특징으로 한다. 각각의 단어 단위로 방출성 확률에 근거하여 근접성 모델이 적용되며, 이 근접성 모델에 의하여 전체 순위 함수가 사용되기 때문에 보다 정확하고 효과적인 정보 검색 및 추출을 할 수 있다. The present invention relates to an information retrieval method using a query including a plurality of query words, an information extraction method, and a method of ranking the search target document. In particular, the final ranking of the document may be determined by integrating the proximity scores of the plurality of query words included in the query in the document and the emission probability of the unit of words included in the document. Proximity model is applied based on the emission probability in each word unit, and the overall ranking function is used by this proximity model for more accurate and effective information retrieval and extraction.

Description

Information retrieval using proximity language model {INFORMATIONAL RETRIEVAL METHOD USING A PROXIMITY LANGUAGE MODEL AND RECORDING MEDIUM THREROF}

본 발명은 정보 검색 방법에 관한 것이다. 특히 검색 엔진의 랭킹 모듈에 관련한다.
The present invention relates to an information retrieval method. In particular, it relates to the ranking module of a search engine.

대량의 자료가 기하급수적으로 쏟아지고 수많은 정보가 축적되어 있는 환경 하에서, 정보의 접근 통로로서의 정보 검색의 중요성은 대부분의 기업과 유저들에게 잘 인식되어 있다. 또한, 정보 검색의 수단으로 다양한 검색 엔진들과 관련된 검색 기법이 알려져 있다. 검색 엔진은 소스 데이터인 DBMS, 비정형문서, 웹사이트 등에 산재된 데이터에 대하여, 질의에 대응하는 검색 결과를 산출한다. 이러한 검색 엔진에 포함되는 주요 모듈로는, 다양한 수집기(DB Bridge, File Bridge, Web Bridge), 색인기(Indexer), 형태소 분석기(Morphological Analyzer), 질의 분석기(Query Analyzer), 랭킹모듈(Ranking Module), 검색기(Searcher) 등이 포함될 수 있다. In an environment where a large amount of data is exploding exponentially and a lot of information is accumulated, the importance of information retrieval as an information access path is well recognized by most companies and users. In addition, search techniques associated with various search engines are known as a means of information retrieval. The search engine calculates a search result corresponding to the query for the data scattered in the DBMS, the unstructured document, the website, and the like, which are the source data. The main modules included in the search engine include a variety of collectors (DB Bridge, File Bridge, Web Bridge), Indexer, Morphological Analyzer, Query Analyzer, Ranking Module, Searcher may be included.

정보 검색의 핵심은 각 문서를 사용자 질의와의 연관성을 기준으로 순위화하여 랭킹을 정하는 것이다. 문서의 순위화에는 기존의 확률 모델[23, 13, 24] 및 확률 언어 모델[21, 15, 29] 등의 확률 모델이 사용되어 왔다. 상기 모델들은 각 용어가 완전히 독립적이라고 가정하고, 질의 및 문서를 “단어의 집합 (bag of word)”으로 나타내며, 용어 빈도, 역 문서 빈도(inverse document frequency) 및 문서 길이와 같은 통계치를 문서를 순위화하는데 주로 사용한다. 하지만 이 모델들은 문서 내 용어간 근접성을 고려하지 않는다는 단점이 있다. The key to information retrieval is to rank each document based on its relevance to user queries. Probability models such as probability models [23, 13, 24] and probability language models [21, 15, 29] have been used for ranking documents. The models assume that each term is completely independent, represent queries and documents as “bags of words,” and rank documents with statistics such as term frequency, inverse document frequency, and document length. Mainly used to make money. The disadvantage is that these models do not take into account the proximity of terms in the document.

근접성(proximity)은 문서에 등장하는 질의 용어들의 긴밀성(closeness) 또는 밀집도(compactness)를 나타낸다. 즉, 용어가 밀집해 있을수록 주제와의 연관성이 높아지며, 따라서 해당 문서와 사용자 질의에 나타난 개념과의 연관성이 높아진다. 근접성은 용어간 의존성을 간접적으로 나타내는 척도로 볼 수 있는데, 용어간 근접성은 용어간 의존성에 큰 영향을 미친다[2].Proximity refers to the closeness or compactness of query terms appearing in a document. In other words, the denser the term, the higher the association with the subject, and thus the higher the association between the document and the concepts presented in the user query. Proximity can be seen as an indirect measure of interdependence, and interproximalness has a significant effect on interdependence [2].

그간 근접성 인자(proximity factor)를 기존의 순위화 함수(ranking functions)에 통합시키기 위한 여러 연구가 있어 왔다[22, 3, 27, 1]. 이들 연구를 통해 실시된 실험 결과에 따르면 적절한 근접성 측정방법을 통해 확률 순위화 함수의 효과를 향상시킬 수 있다. 하지만 기존의 연구들은 두 가지 주요 단점을 지니고 있었다. 첫째는 개별 질의 용어의 근접성 구조를 고려하지 않는다는 것이다. 예를 들면, 참고 문헌 [27]은 두 가지 근접성 메커니즘, 즉 범위 기반 방법(span-based method)과 페어 기반 방법(pair-based method)을 제시하고 있다. There have been a number of studies for integrating proximity factors into existing ranking functions [22, 3, 27, 1]. According to the experimental results of these studies, it is possible to improve the effect of the probability ranking function through the appropriate proximity measurement method. However, previous studies had two major drawbacks. The first is that it does not take into account the proximity structure of individual query terms. For example, Ref. [27] proposes two proximity mechanisms, span-based and pair-based methods.

범위 기반 방법은 전체 질의 용어의 텍스트 범위에 따라 문서의 순위를 정하지만, 이들 용어의 내부 구조를 고려하지 않는 문제점이 있었다. 또한, 페어 기반 방법은 문서의 근접성 점수 (proximity score)를 각 페어 내의 용어간 거리로 직접적으로 나타내지만 역시 개별 용어의 근접성 구조를 고려하지 않는 문제점이 지적된다. The range-based method ranks documents according to the text range of all query terms, but has a problem that does not consider the internal structure of these terms. It is also pointed out that the pair-based method directly represents the document's proximity score as the distance between terms in each pair but also does not take into account the proximity structure of individual terms.

기존 연구의 또 다른 취약점은 근접성 인자를 매우 직관적으로 확률 모델과 결합시킨다는 것이다. 대부분의 연구는 문서의 총 근접성 점수를 기존의 순위화 함수를 통해 계산한 연관성 점수와 단순히 선형적으로, 그리고 외적으로 문서 차원에서 결합시킨다.
Another weakness of previous research is that it combines proximity factors with probability models very intuitively. Most studies combine the document's total proximity score with the association score computed by the existing ranking function simply or linearly and externally.

이와 같은 개요에 기초하여, 종래 기술에 대해 아래와 같이 보다 구체적으로 살펴본다:Based on this overview, the prior art looks more specifically as follows:

1. 의존성 모델링(Dependency Modeling)1. Dependency Modeling

“단어의 집합(bag of word)”이라는 비직관적인 가정에 반하여, 용어간 의존성(term dependency)을 확률 모델에 직접적으로 통합하고자 하는 많은 노력이 있어 왔다. 초기 연구[6, 28]는 용어간 의존성을 이진 독립 모델(binary independence model: BIM)에 통합한 모델들을 개발했다. 하지만 이 모델들은 기대만큼 안정적으로 BIM보다 우수한 성능을 보이지 않았으며, 실제로 거의 사용되지 않았다. 그 이유는 BIM에서의 독립성 가정(independence assumption)을 사실상 약한 의존성 가정(weaker linked dependence assumption)으로 대체될 수 있었기 때문이다[5].Contrary to the unintuitive assumption of “bag of words,” there have been many efforts to integrate term dependencies directly into probabilistic models. Early work [6, 28] developed models that incorporated terminology dependencies into a binary independence model (BIM). However, these models did not perform as well as BIM, as expected, and were rarely used in practice. The reason is that the independence assumption in BIM could be replaced by a weaker linked dependency assumption [5].

최근, 정보 검색 분야에서 언어 모델링이 인기를 끌면서, 용어간 의존성을 언어 모델에 통합하고자 하는 연구들이 수행되었다. 예를 들면, 참고문헌 [25, 17]은 바이그램 언어 모델(bigram language model)과 유니그램 언어 모델(unigaram language model)을 결합함으로써, 순차 인접 용어간 의존성(ordered adjacent dependence)을 고려할 수 있는 일반적인 언어 모델을 제시한다. [26]은 비순차 (unordered) 인접 단어 쌍간의 의존성을 고려하는 이중어 (biterm) 언어 모델을 소개한다. 또한, [10, 19]는 의존성 트리(dependency tree)에 존재하는 용어들 간의 관계를 고려하는 의존성 언어 모델을 제안하며, [16]은 각 질의 용어 부분 집합에서의 의존성을 고려하는 지수 언어 모델(exponential language model)을 제안한다. 상기 의존성 모델들은 유니그램 독립성 언어 모델보다 나은 성능을 보였다. 그러나, 이들의 주요 문제는 의존성을 직접적으로 고려함으로 인해 파라미터 공간이 매우 커진다는데 있다. 파라미터 공간이 커지게 되면, 파라미터 추정이 매우 어려워지며, 데이터 희박성 (sparse) 및 노이즈에 민감하게 된다. 결국, 이는 직접적인 의존성 모델링을 통해 얻을 수 있는 작은 효과마저 없애는 문제점을 초래하였다.
Recently, as language modeling has become popular in the field of information retrieval, studies have been conducted to integrate term dependencies into language models. For example, references [25, 17] combine a bigram language model and a unigaram language model, so that a general language can consider ordered adjacent dependencies. Present the model. [26] introduces a biterm language model that takes into account the dependencies between unordered adjacent word pairs. In addition, [10, 19] proposes a dependency language model that takes into account the relationship between terms in the dependency tree. [16] proposes an exponential language model that considers dependencies in each query term subset. exponential language model). The dependency models performed better than the Unigram independence language model. However, their main problem is that the parameter space becomes very large due to the direct consideration of dependencies. As the parameter space grows, parameter estimation becomes very difficult and sensitive to data sparse and noise. As a result, this led to the problem of eliminating even the small effects that can be obtained through direct dependency modeling.

2. 어구 인덱싱(Phrase Indexing)2. Phrases Indexing

다른 연구들의 흐름은 단어보다 더 큰 단위, 즉 어구 (phrase) 등을 텍스트 형태에 통합하고자 하는 시도였다. 그런 방법에서는 단어간 의존성은 간접적으로 파악될 수 있다. 예를 들면, 참고문헌 [8, 7]은 구문적 어구와 통계 어구(syntactic and statistic phrases)를 텍스트 인덱싱에 통합하는 방법을 테스트하는 내용을 담고 있다. 상기 문헌에 따르면, 통계 어구가 구분적 어구보다 성능이 나았다. 하지만, 어구 사용을 통해 얻을 수 있는 개선 효과가 지속적이지 않았다. 즉, 일부 컬렉션 (collection)에서는 상당한 효과를 거둔 반면, 다른 컬렉션에서는 약간의 혹은 부정적인 효과를 나타내었다. 또한, [18]은 인덱싱 및 검색을 위한 어구 사용을 재검토했다. 즉, 통계 어구를 추출할 때, 여러 문서 내에서 인접한 위치에서 발생하는 모든 비함수(non-function) 단어 쌍들을 어구로 간주했으며, 간단한 단어의 경우처럼 각 어구를 인덱싱 단위로 간주했다. 상기 연구는 적절한 기본적인 순위화 방법을 사용할 경우, 어구 사용이 높은 순위에서의 정확도에 큰 영향을 미치지는 않는다고 결론을 맺고 있다.The flow of other studies has been an attempt to integrate larger units of words, phrases, etc. into text forms. In such a way, interword dependencies can be identified indirectly. For example, references [8, 7] contain a test of how to incorporate syntactic and statistic phrases into text indexing. According to the literature, statistical phrases outperformed distinctive phrases. However, the improvements that could be achieved through the use of gear did not continue. That is, some collections have had significant effects, while others have had some or negative effects. In addition, [18] reviewed the use of phrases for indexing and searching. That is, when extracting statistical phrases, all non-function word pairs that occur in adjacent locations within multiple documents were considered phrases, and each phrase was considered an indexing unit, as in the case of simple words. The study concludes that the use of proper basic ranking methods does not significantly affect the accuracy of high rankings.

참고 문헌 [10]에서 지적하듯이, 어구 인덱싱이 효과가 없는 것은 두 가지 이유 때문이다. 첫째, 어구는 단어와 본질적으로 다르기 때문에, 이 둘에 동일한 가중치 적용 방법을 사용하는 것은 적절하지 않다. 둘째, 독립 모델(independent model)에서는 시스템적으로 어구에 과한 점수를 줄 가능성이 있다.
As pointed out in Ref. [10], phrase indexing is ineffective for two reasons. First, since phrases are inherently different from words, it is not appropriate to use the same weighting method for both. Second, in the independent model, it is possible to systematically give phrases excessively.

3. 근접성 모델링 (Proximity Modeling)3. Proximity Modeling

근접성 모델링은 용어간 의존성을 파악하기 위한 또 다른 간접적인 방법으로 생각할 수 있다. 초기 연구들 [11, 14, 4]은 비슷한 근접성 모델들을 제안하는데, 이들 모델은 질의 용어의 범위(span) 와 밀집도(density)에 따라 문서의 근접성을 측정한다. 즉, 모든 질의 용어를 포함하는 텍스트 범위가 좁을수록 문서의 연관성이 높아지며, 문서에서 질의 용어의 범위 인스턴스(span instance)가 많을수록, 문서의 연관성이 높아진다. 하지만 상기 연구들은 근접성 측정 방법을 효과적인 확률 순위화 모델에 통합하지는 않았다.Proximity modeling can be thought of as another indirect way of identifying dependencies between terms. Early studies [11, 14, 4] suggest similar proximity models, which measure the proximity of documents according to the span and density of query terms. That is, the narrower the text range including all the query terms, the higher the relevance of the document, and the more span instances of the query terms in the document, the higher the relevance of the document. However, these studies did not incorporate proximity measurement methods into effective probability ranking models.

최근, 근접성 인자를 확률 순위화 함수(probabilistic ranking functions)에 통합하기 위한 연구들이 실시되었다. 예를 들면, [22, 3]은 용어-위치 근접성 평가 부분(term-proximity scoring part)으로 BM25 순위화 공식 [24]을 확장했다. 여기서, 근접성 부분은 범위 기반 측정 방법에 관한 것으로, 간단한 단어와 같이 유사한 형태의 모든 질의 용어를 커버하는 텍스트 세그먼트 내에 발생하는 각 질의 용어 페어에 점수를 매긴다. 그리고, BM25 순위 점수와 근접성 점수를 선형 결합해 문서의 순위 점수를 얻는다. 그 결과 최고 정확도에 있어 효과를 보였으며, 평균 정확도에도 약간의 영향을 미쳤다.Recently, studies have been conducted to integrate the proximity factor into the probabilistic ranking functions. For example, [22, 3] extended the BM25 ranking formula [24] to the term-proximity scoring part. Here, the proximity part relates to a range-based measurement method, and scores each query term pair that occurs in a text segment covering all query terms of similar form, such as simple words. The BM25 ranking score and proximity score are linearly combined to obtain the document ranking score. The result was an effect on the highest accuracy, with a slight effect on the average accuracy.

[27]은 범위 기반 측정 방법을 연구했으며, 또한, 용어 간의 페어와이즈(pair-wise) 로컬 거리의 관점에서 근접성을 모델링 하는 몇 가지의 페어 기반 근접성 측정 방법(pair-based proximity measures)을 제안하였다. 또한, 근접성 인자를 BM25 순위화 모델 및 KL 발산 언어 모델(KL divergence language model)과 결합했다[15]. 근접성 인자의 통합은 외적 점수 결합 형태로 다음의 식1과 같이 나타낼 수 있다.[27] studied range-based measurement methods and also proposed several pair-based proximity measures that model proximity in terms of pair-wise local distances between terms. . Proximity factors were also combined with the BM25 ranking model and the KL divergence language model [15]. The integration of the proximity factor can be expressed as Equation 1 in the form of external score combining.

식 1에서, KL(q,d)은 KL을 통해 얻은 순위 점수이고, δ(q,d)는 질의(q)에 대한 문서(d)의 근접성 거리 측정치다. 그와 같은 간단한 방법일지라도, [27]은 적절한 근접성 거리 측정 방법을 KL 언어 모델과 결합함으로써, [16]과 같이 보다 복잡한 의존성 언어모델을 통해 얻은 결과보다 나은 결과를 얻을 수 있다.In Equation 1, KL (q, d) is the rank score obtained through KL, and δ (q, d) is a measure of the proximity distance of document (d) to query (q). Even with such a simple method, [27] combines the appropriate proximity distance measurement method with the KL language model, yielding better results than those obtained with more complex dependency language models such as [16].

종합해보면, 직접적인 의존성 모델링 및 어구 인덱싱에 비해, 근접성 모델링은 정보 검색에 있어 경제적이며, “단어의 집합 {bag of word}” 가정을 뛰어넘을 보다 효과적인 방법이다. Taken together, proximity modeling is more economical in information retrieval than direct dependency modeling and phrase indexing, and is a more effective way to go beyond the “bag of word” assumption.

본 발명의 발명가는 위와 같은 종래기술의 문제점들을 극복하고, 특히 근접성 모델링에 관한 기존의 연구를 개선하고자 오랫동안 연구 노력한 결과, 본 발명을 완성하기에 이르렀다.
The inventor of the present invention has overcome the problems of the prior art as described above, and in particular, as a result of long research efforts to improve the existing research on proximity modeling, the present invention has been completed.

[1] J. Bai, Y. Chang, H. Cui, Z. Zheng, G. Sun, and X. Li. Investigation of partial query proximity in web search. 2008.[1] J. Bai, Y. Chang, H. Cui, Z. Zheng, G. Sun, and X. Li. Investigation of partial query proximity in web search. 2008. [2] D. Beeferman, A. Berger, and J. Lafferty. A model of lexical attraction and repulsion. Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, pages 373-380, 1997.[2] D. Beeferman, A. Berger, and J. Lafferty. A model of lexical attraction and repulsion. Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, pages 373-380, 1997. [3] S. Buttcher and C. Clarke. Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval. Proceedings of the 14th Text Retrieval Conference (Gaithersburg, USA, November 2005).[3] S. Buttcher and C. Clarke. Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval. Proceedings of the 14th Text Retrieval Conference (Gaithersburg, USA, November 2005). [4] C. Clarke, G. Cormack, and E. Tudhope. Relevance ranking for one to three term queries. Information Processing and Management, 36(2):291-311, 2000.[4] C. Clarke, G. Cormack, and E. Tudhope. Relevance ranking for one to three term queries. Information Processing and Management, 36 (2): 291-311, 2000. [5] W. Cooper. Some Inconsistencies and Misidentified Modeling Assumptions in Probabilistic Information Retrieval. ACM Transactions on Information Systems, 13(1):100-111, 1995.[5] W. Cooper. Some Inconsistencies and Misidentified Modeling Assumptions in Probabilistic Information Retrieval. ACM Transactions on Information Systems, 13 (1): 100-111, 1995. [6] W. Croft. Boolean queries and term dependencies in probabilistic retrieval models. Journal of the American Society for Information Science, 37(2):71-77, 1986.[6] W. Croft. Boolean queries and term dependencies in probabilistic retrieval models. Journal of the American Society for Information Science, 37 (2): 71-77, 1986. [7] W. Croft, H. Turtle, and D. Lewis. The use of phrases and structured queries in information retrieval. Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, pages 32-45, 1991.[7] W. Croft, H. Turtle, and D. Lewis. The use of phrases and structured queries in information retrieval. Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, pages 32-45, 1991. [8] J. Fagan. Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods. 1987.[8] J. Fagan. Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods. 1987. [9] T. Ferguson. A Bayesian analysis of some nonparametric problems. Ann. Statist, 1(2):209-230, 1973.[9] T. Ferguson. A Bayesian analysis of some nonparametric problems. Ann. Statist, 1 (2): 209-230, 1973. [10] J. Gao, J. Nie, G. Wu, and G. Cao. Dependence language model for information retrieval. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 170-177, 2004.[10] J. Gao, J. Nie, G. Wu, and G. Cao. Dependence language model for information retrieval. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 170-177, 2004. [11] D. Hawking and P. Thistlewaite. Proximity operators-So near and yet so far. Proceedings of the 4th Text Retrieval Conference, pages 131-143, 1995.[11] D. Hawking and P. Thistlewaite. Proximity operators-So near and yet so far. Proceedings of the 4th Text Retrieval Conference, pages 131-143, 1995. [12] W. Hersh, C. Buckley, T. Leone, and D. Hickam. OHSUMED: an interactive retrieval evaluation and new large test collection for research. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pages 192-201. Springer-Verlag New York, Inc. New York, NY, USA, 1994.[12] W. Hersh, C. Buckley, T. Leone, and D. Hickam. OHSUMED: an interactive retrieval evaluation and new large test collection for research. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pages 192-201. Springer-Verlag New York, Inc. New York, NY, USA, 1994. [13] K. Jones, S. Walker, and S. Robertson. A Probabilistic Model of Information Retrieval: Development and Status. University of Cambridge, Computer Laboratory, 1998.[13] K. Jones, S. Walker, and S. Robertson. A Probabilistic Model of Information Retrieval: Development and Status. University of Cambridge, Computer Laboratory, 1998. [14] C. LA Clark and G. Cormack. Shortest-Substring Retrieval and Ranking. ACM Transactions on Information Systems, 18(1):44-78, 2000.[14] C. LA Clark and G. Cormack. Shortest-Substring Retrieval and Ranking. ACM Transactions on Information Systems, 18 (1): 44-78, 2000. [15] J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 111-119, 2001.[15] J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 111-119, 2001. [16] D. Metzler and W. Croft. A Markov random field model for term dependencies. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 472-479, 2005.[16] D. Metzler and W. Croft. A Markov random field model for term dependencies. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 472-479, 2005. [17] D. Miller, T. Leek, and R. Schwartz. A hidden Markov model information retrieval system. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 214-221, 1999.[17] D. Miller, T. Leek, and R. Schwartz. A hidden Markov model information retrieval system. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 214-221, 1999. [18] M. Mitra, C. Buckley, A. Singhal, and C. Cardie. An analysis of statistical and syntactic phrases. Proceedings of RIAO-97, 5th International Conference “Recherched information Assistee par Ordinateur”, pages 200-214, 1997.[18] M. Mitra, C. Buckley, A. Singhal, and C. Cardie. An analysis of statistical and syntactic phrases. Proceedings of RIAO-97, 5th International Conference “Recherched information Assistee par Ordinateur”, pages 200-214, 1997. [19] R. Nallapati and J. Allan. Capturing term dependencies using a language model based on sentence trees. Proceedings of the eleventh international conference on Information and knowledge management, pages 383-390, 2002.[19] R. Nallapati and J. Allan. Capturing term dependencies using a language model based on sentence trees. Proceedings of the eleventh international conference on Information and knowledge management, pages 383-390, 2002. [20] P. Ogilvie and J. Callan. Experiments Using the Lemur Toolkit. NIST Special Publication SP, pages 103-108, 2002.[20] P. Ogilvie and J. Callan. Experiments Using the Lemur Toolkit. NIST Special Publication SP, pages 103-108, 2002. [21] J. Ponte and W. Croft. A language modeling approach to information retrieval. ACM New York, NY, USA, 1998.[21] J. Ponte and W. Croft. A language modeling approach to information retrieval. ACM New York, NY, USA, 1998. [22] Y. Rasolofo and J. Savoy. Term Proximity Scoring for Keyword-Based Retrieval Systems. Lecture Notes in Computer Science, pages 207-218, 2003.[22] Y. Rasolofo and J. Savoy. Term Proximity Scoring for Keyword-Based Retrieval Systems. Lecture Notes in Computer Science, pages 207-218, 2003. [23] S. Robertson, S. Jones, et al. Relevance Weighting of Search Terms. Journal of the American Society for Information Science, 27(3):129-46, 1976.[23] S. Robertson, S. Jones, et al. Relevance Weighting of Search Terms. Journal of the American Society for Information Science, 27 (3): 129-46, 1976. [24] S. Robertson, S. Walker, and M. Beaulieu. Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive track. NIST Special Publication SP, pages 253-264, 1999.[24] S. Robertson, S. Walker, and M. Beaulieu. Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive track. NIST Special Publication SP, pages 253-264, 1999. [25] F. Song and W. Croft. A general language model for information retrieval. Proceedings of the eighth international conference on Information and knowledge management, pages 316-321, 1999.[25] F. Song and W. Croft. A general language model for information retrieval. Proceedings of the eighth international conference on Information and knowledge management, pages 316-321, 1999. [26] M. Srikanth and R. Srihari. Biterm language models for document retrieval. Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 425-426, 2002.[26] M. Srikanth and R. Srihari. Biterm language models for document retrieval. Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 425-426, 2002. [27] T. Tao and C. Zhai. An exploration of proximity measures in information retrieval. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 295-302, 2007.[27] T. Tao and C. Zhai. An exploration of proximity measures in information retrieval. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 295-302, 2007. [28] C. Yu, C. Buckley, K. Lam, and G. Salton. A Generalized Term Dependence Model in Information Retrieval. Information technology: research and development, 2(4):129-154, 1983.[28] C. Yu, C. Buckley, K. Lam, and G. Salton. A Generalized Term Dependence Model in Information Retrieval. Information technology: research and development, 2 (4): 129-154, 1983. [29] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2):179-214, 2004.[29] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22 (2): 179-214, 2004.

본 발명의 기본적인 목적은 다수의 단어로 이루어진 질의를 입력하여 정보 검색을 함에 있어서, 보다 정확하고 효율적인 검색 방법을 제공함에 있다. 또한, 이와 관련한 신규한 정보 검색 방법, 정보 추출 방법 및 순위화 방법들을 제시하고자 한다. SUMMARY OF THE INVENTION The basic object of the present invention is to provide a more accurate and efficient search method in inputting a query consisting of a plurality of words. In addition, new information retrieval methods, information extraction methods, and ranking methods are proposed.

보다 구체적으로, 본 발명의 목적은 근접성 모델을 적용함에 있어서, 각각의 단어 단위로 방출성 확률을 도출하는 신규한 정보 검색 방법 및 정보 추출 방법을 제공함에 있다. 이를 위하여, 용어 근접성을 유니그램 언어 모델에 통합하여, 용어 내부의 구조를 고려하는 것이며, 결과적으로 문서 내의 근접성 정보를 내적인 방법으로 통합하는 데 있다.More specifically, an object of the present invention is to provide a novel information retrieval method and information extraction method for deriving the emission probability in each word unit in applying the proximity model . To this end, the term proximity is integrated into the Unigram language model to consider the internal structure of terms and consequently to integrate the proximity information in the document in an internal manner.

또한, 본 발명은 위와 같은 용어 차원의 근접성 통합을 탄탄한 수학적 근거에 기초하여 제시하고자 한다.In addition, the present invention intends to present a close integration of terms in terms of terms based on a solid mathematical basis.

한편, 본 발명의 명시되지 않은 또 다른 목적들은 하기의 상세한 설명 및 그 효과로부터 용이하게 추론할 수 있는 범위 내에서 추가적으로 고려될 것이다.
On the other hand, other unspecified objects of the present invention will be further considered within the range that can be easily inferred from the following detailed description and effects.

위와 같은 목적을 달성하기 위하여, 본 발명은, 클라이언트에 의하여 입력된 복수의 검색용어가 포함된 질의(q)에 대하여, 대규모 문서 데이터베이스로부터 문서(d_l)를 순위화하여 검색결과를 클라이언트에게 제공하는 정보 검색 방법에 있어서, In order to achieve the above object, the present invention provides a search result to a client by ranking a document (d _l ) from a large document database with respect to a query (q) including a plurality of search terms input by the client. In the information retrieval method to

(a) 문서 내에 포함되는 각각의 단어에 대해서, 단어 단위의 방출성 확률을 산출하는 단계; (a) for each word included in the document, calculating an emission probability in word units;

(b) 문서 내의 질의 용어간 근접 중심성 점수를 구하여 상기 방출성 확률에 결합함으로써, 근접성 모델로 변환된 단어 단위의 방출성 확률을 산출하는 단계; 및(b) calculating the radiative probability of the unit of words converted into the proximity model by obtaining the proximity centrality score between the query terms in the document and combining it with the radiative probability; And

(c) 상기 (b) 단계에서 구한 단어 단위의 방출성 확률을 문서의 전체 순위함수에서 사용하여 각 문서의 랭킹 순위를 정하고, 정해진 랭킹 순위에 따라 검색결과를 클라이언트의 화면에 표출하는 단계;를 포함하는, 근접성 언어 모델을 이용한 정보 검색 방법을 특징으로 한다.(c) determining the ranking rank of each document using the emission probability of the word unit obtained in the step (b) in the overall ranking function of the document, and displaying a search result on the screen of the client according to the determined ranking rank; Characterized in that the information retrieval method using a proximity language model, including.

또한, 본 발명의 근접성 언어 모델을 이용한 정보 검색 방법의 일 실시예에 있어서, 상기 (a) 단계는, 질의와 문서를 다항의 용어 카운트 벡터로 설정하여 다항 파라미터를 사용하는 것을 포함하는 것이 바람직하다.In addition, in an embodiment of the information retrieval method using the proximity language model of the present invention, the step (a) preferably includes using a polynomial parameter by setting a query and a document as a polynomial term count vector. .

또한, 본 발명의 근접성 언어 모델을 이용한 정보 검색 방법의 일 실시예에 있어서, 상기 (b) 단계의 근접 중심성 점수를 산출하는 것은, 질의 용어의 페어간 거리를 측정하여 산출할 수 있다.In addition, in one embodiment of the information retrieval method using the proximity language model of the present invention, calculating the proximity centrality score of step (b) may be calculated by measuring the distance between pairs of query terms.

또한, 본 발명의 근접성 언어 모델을 이용한 정보 검색 방법의 일 실시예에 있어서, 상기 질의 용어의 페어간 거리는 페어간 최소 거리인 것이 바람직하다.In addition, in an embodiment of the information retrieval method using the proximity language model of the present invention, the inter-pair distance of the query term is preferably a minimum inter-pair distance.

또한, 본 발명의 근접성 언어 모델을 이용한 정보 검색 방법의 다른 실시예에 있어서, 상기 질의 용어의 페어간 거리는 질의 용어간 평균 거리인 것이 바람직하다.In another embodiment of the information retrieval method using the proximity language model of the present invention, the distance between pairs of query terms is preferably an average distance between query terms.

또한, 본 발명의 근접성 언어 모델을 이용한 정보 검색 방법의 또 다른 실시예에 있어서, 상기 질의 용어의 페어간 거리는, 문서 내에서 모든 질의 용어가 이루는 페어간 거리를 근접성 변환 함수에 의하여 근접 중심성 점수로 변환한 다음에 이를 합산하는 것이 바람직하다.
Further, in another embodiment of the information retrieval method using the proximity language model of the present invention, the inter-pair distance of the query terms, the distance between the pair of all the query terms in the document by the proximity conversion function by the proximity centering score It is preferable to convert and then sum them.

또한, 본 발명의 목적을 달성하기 위한 다른 관점으로서, 본 발명은, 복수의 질의단어 포함하는 질의의 입력에 대한 검색 대상 문서의 순위를 정하는 방법에 있어서, In addition, as another aspect for achieving the object of the present invention, the present invention, in the method for ranking the search target document for the input of the query including a plurality of query words,

상기 문서 내에서의 상기 질의에 포함된 복수의 질의단어들의 근접성 점수와, 상기 문서 내에 포함되는 단어 단위의 방출성 확률을 통합하여, 문서의 최종 순위를 결정하는 단계를 포함하는, 검색 대상 문서의 순위를 정하는 방법을 특징으로 한다.
Determining a final ranking of the document by integrating the proximity scores of the plurality of query words included in the query within the document with the emissivity probabilities of word units contained in the document. It features a method of ranking.

또한, 본 발명의 목적을 달성하기 위한 또 다른 관점으로서, 복수의 질의단어 포함하는 질의를 이용한 정보 검색 방법에 있어서, In addition, as another aspect for achieving the object of the present invention, in the information retrieval method using a query containing a plurality of query words,

검색 대상 문서 내에서의 상기 질의에 포함된 복수의 질의단어들의 근접성 점수와, 상기 문서 내에 포함되는 단어 단위의 방출성 확률을 통합하여, 문서의 최종 순위를 결정하는 것을 특징으로 하는, 근접성 언어 모델을 이용한 정보 검색 방법에 관한 것이다.
Proximity language model, characterized in that to determine the final ranking of the document by integrating the proximity score of the plurality of query words included in the query in the search target document, and the emission probability of the unit of words included in the document It relates to an information retrieval method using.

또한, 본 발명의 또 다른 국면으로서, 클라이언트에 의하여 입력된 복수의 질의용어가 포함된 질의에 대하여, 웹 또는 문서 데이터베이스로부터 문서를 순위화하여 정보를 추출하는 방법에 있어서, In still another aspect of the present invention, in a method of extracting information by ranking documents from a web or document database, for a query including a plurality of query terms input by a client,

(c) 상기 (b) 단계에서 구한 단어 단위의 방출성 확률을 문서의 전체 순위함수에서 사용하여 각 문서의 랭킹 순위를 정하고, 정해진 랭킹 순위에 따라 정보를 추출하는 단계를 포함하는, 근접성 언어 모델을 이용한 정보 추출 방법을 특징으로 한다.
(c) determining the ranking rank of each document using the emission probability of the word unit obtained in the step (b) in the overall ranking function of the document, and extracting information according to the determined ranking rank. Characterized by the information extraction method using.

또한, 본 발명은 이상의 방법들을 실행하는 컴퓨터 프로그램이 기록된 컴퓨터 판독용 기록매체로서, 근접성 모델로 변환된 단어 단위의 방출성 확률을 산출하는 모듈을 포함하는 것을 특징으로 하는 컴퓨터 판독용 기록매체에 관한 것이다.
The present invention also provides a computer readable recording medium having recorded thereon a computer program for executing the above methods, comprising: a module for calculating an emission probability in units of words converted into a proximity model. It is about.

위와 같은 본 발명에 따르면, 각각의 단어 단위로 방출성 확률에 근거하여 근접성 모델이 적용되며, 이 근접성 모델에 의하여 전체 순위 함수가 사용되기 때문에 보다 정확하고 효과적인 정보 검색 및 추출을 할 수 있었다. 단어 단위의 방출성 확률에 기초한 본 발명의 근접성 모델링이 문서 내에서의 개별 용어의 근접성 구조를 정보 검색에 효과적으로 반영하기 때문이다.According to the present invention as described above, a proximity model is applied to each word unit based on the emission probability, and the overall ranking function is used by this proximity model, so that more accurate and effective information retrieval and extraction can be performed. This is because the proximity modeling of the present invention based on word-based emission probability effectively reflects the proximity structure of individual terms in a document to information retrieval.

위와 같은 본 발명의 용어 근접성과 유니그램 언어 모델의 통합하는 기술을 통해서, 해당 질의 용어의 위치가 다른 질의 용어의 위치와 근접할 때, 효과적으로 근접성 점수를 증가시키는 순위화 공식을 얻을 수 있었다. 이러한 용어 차원의 근접성 통합은 수학적 근거를 두고 있으며, 기존의 문서 차원의 직관적인 근접성 결합보다 성능이 나은 것으로 드러났다. Through the technique of integrating the term proximity and the unigram language model of the present invention as described above, when the position of the query term is close to the position of another query term, it is possible to obtain a ranking formula that effectively increases the proximity score. This terminology integration of proximity is mathematically based and has shown to outperform traditional document-level intuitive proximity combining.

또한, 본 발명은 간단한 키워드를 이용한 질의뿐 아니라 장황한 질의 및 불용어를 포함하는 질의에 대해서도 우수한 성능을 나타내었다.In addition, the present invention showed excellent performance not only for queries using simple keywords but also for queries including verbose queries and stopwords.

본 발명의 명세서에서 구체적으로 언급되지 않은 효과라 하더라도, 본 발명의 기술적 특징에 의해 기대되는 잠정적인 효과는 본 발명의 명세서에 기재된 것과 같이 취급됨을 첨언한다.
Even if effects not specifically mentioned in the specification of the present invention are incorporated, the provisional effects expected by the technical features of the present invention are treated as described in the specification of the present invention.

도 1은 용어 거리 그래프의 일예를 나타내는 도면이다.
도 2는 본 발명에 따른 근접 중심성 측정 방법의 일예로서, P_MinDist를 이용한 PLM의 파라미터 민감도를 나타내는 도면이다.
※ 첨부된 도면은 본 발명의 기술사상에 대한 이해를 위하여 참조로서 예시된 것임을 밝히며, 그것에 의해 본 발명의 권리범위가 제한되지는 아니한다.1 is a diagram illustrating an example of a term distance graph.
2 is a diagram illustrating parameter sensitivity of a PLM using P_MinDist as an example of a method for measuring proximity center according to the present invention.
* The accompanying drawings illustrate examples of the present invention in order to facilitate understanding of the technical idea of the present invention, and thus the scope of the present invention is not limited thereto.

이하, 본 발명의 실시를 위한 구체적인 내용을 설명한다. 이하에서 상세하게 기술되는 수학적 모델링은 본 발명의 사상을 예시하기 위함임을 미리 밝혀 둔다. 본 발명은 복수의 질의 용어로 이루어지는 질의어에 대한 정보 검색 방법, 정보 추출 방법, 또는 문서 랭킹 방법에 있어서, 질의어를 구성하는 복수의 질의 용어들의 근접성을 이용한다. 그러나 본 발명의 신규한 기술사상의 요체는, 문서 내에서 질의 용어들의 산출된 근접성 점수를 수학 통계, 확률적 방법 등을 통해서, 문서 내의 단어 단위의 방출성 확률과 통하여, 최종 순위 함에서 적용한다는 것에 있다. 따라서 이하에서 예시되는 구체적인 수학식이나 확률분포식에 의해 본 발명의 보호범위가 제한되지 아니한다.
Hereinafter, specific contents for the practice of the present invention. The mathematical modeling described in detail below is intended to illustrate the idea of the present invention. The present invention utilizes the proximity of a plurality of query terms constituting a query in an information retrieval method, an information extraction method, or a document ranking method for a query composed of a plurality of query terms. However, the novel technical concept of the present invention is to apply the calculated proximity scores of the query terms in the document in the final ranking through the statistical probability, the probabilistic method, and the likelihood probabilities of word units in the document. Is in. Therefore, the protection scope of the present invention is not limited by the specific equations or probability distribution equations exemplified below.

1. 먼저, 근접성 언어 모델에 대해 상세히 설명한다:First, we describe the proximity language model in detail:

<< 유니그램Unigram 언어 모델 ( Language model ( UnigramUnigram LanguageLanguage ModelModel : : ULMULM )>)>

질의(q)와 문서(d_l)가 하나의 집합(C)에 포함되어 있다고 가정하자. V = {w₁,w₂, ..., w_|V|}가 어휘 셋(vocabulary set)이면, “단어의 집합 (bag of words)”이라는 가정 하에, 상기 질의와 문서는 다음과 같이 단어들의 용어 카운트(term count)의 벡터로 나타낼 수 있다:Assume that the query (q) and the document (d _l ) are included in one set (C). V = {w ₁ , w ₂ , ..., w _{| V |} } Is a vocabulary set, assuming a "bag of words," the query and document can be represented as a vector of term counts of words as follows:

여기서, q_i 및 d_l _,i는 각각 질의(q)와 문서(d_l)에서의 i 번째 단어의 빈도를 의미한다. 또한, θ_l=(θ_l,1, θ_l,2, ..., θ_l,|V|)가 d_l를 위한 다항 생성 모델의 파라미터이고, θ_l,i 는 어휘 내에 용어(w_i)의 방출성 확률이라고 할 때, 질의 (q) 또는 문서 (d_l)의 생성 확률을 수학식으로 표현하면 다음과 같다. Here, q _i and d _{_l, i} are respectively the mean frequency of the i th word of the query (q) and the article (d _l). Also, θ _l = (θ _{l, 1} , θ _{l, 2} , ..., θ _{l, | V |} ) are parameters of the polynomial generation model for d _l , and θ _{l, i} is the term (w _i in the vocabulary). In terms of the emission probability of), the probability of generating the query (q) or the document (d _l ) is expressed by the following equation.

여기서,

와

는 q와 d_l 내에서의 총 용어 카운트이다.here,

Wow

Is the q and d _l Is the total term count within.

각 문서(d_l)에 대해, θ_l 파라미터는 해당 문서를 관찰 데이터(observed data)로 간주하고 추정한다. 그리고, 문서(d_l)와 질의(q)의 연관성은 d_l을 통해 추정한 언어 모델을 이용해 계산한 q의 생성 확률로 측정한다. 따라서, 주어진 문서(d_l)로부터 얻은 θ_l에서 다항 파라미터를 추정하는 것이 핵심이다. 이를 위한 가장 간단한 방법은 최대 가능도 추정(maximum likelihood estimation) 방법인데, 다음의 식 (4)를 이용해 구할 수 있다.For each document d _l , the θ _l parameter considers the document as observed data and estimates it. The correlation between the document (d _l ) and the query (q) is measured by the generation probability of q calculated using the language model estimated through d _l . Therefore, it is essential to estimate the polynomial parameter from θ _l obtained from a given document d _l . The simplest method for this is the maximum likelihood estimation method, which can be obtained using the following equation (4).

드문 단어(rare words)와 데이터 희박성 문제를 해결하기 위해, 대개 다양한 평활(smoothing) 방법들을 추정에 사용한다. 평활 방법에서는 흔히 문서 추정 언어 모델을 집합 추정 언어 모델(collection estimated language model)[29]에 삽입한다.
To solve the rare words and data sparsity problem, various smoothing methods are usually used for estimation. In the smoothing method, a document estimation language model is often inserted into a collection estimated language model [29].

<근접성 정보와의 통합 (Integration with Proximity Information IntegrationIntegration withwith ProximityProximity InformationInformation )>)>

이제, 문서 내 질의 용어간 근접성 정보를 유니그램 언어모델에 통합하는 방법을 설명한다. 질의를 생성할 때, 사용자는 여러 용어를 함께 사용해 특정 개념을 표현하고자 한다. 용어간 근접성 정보는 문서에 출현하는 질의 용어들간의 긴밀성 정도(closeness)를 나타낸다. 문서 내에서 용어들이 가까운 위치에서 출현할수록, 이들이 주제와 관련될 가능성이 크며, 사용자 질의가 나타내고자 하는 개념을 표현할 가능성이 크다.Now, we describe how to integrate proximity information between query terms in a document into the Unigram language model. When creating a query, you want to use several terms together to express a specific concept. Inter-term proximity information indicates the closeness between query terms appearing in a document. The closer the terms appear in the document, the more likely they are to be related to the subject and the more likely they are to express the concepts that the user query is intended to represent.

예를 들어 두 개의 문서 (d_a, d_b)가 있을 경우, 다른 것은 모두 동일하고 다만 질의 용어들(q)이 d_b에서 보다 d_a에서 서로 더 가까운 위치에서 출현하면, d_a가 d_b보다 해당 질의와의 연관성이 높다고 본다. 다시 말하면,

및

가 각각 d_a 및 d_b로부터 추정한 언어 모델이라면, q가

에서 생성될 확률이

에서 생성될 확률보다 높다고 본다.For example, two documents (d _a, d _b) is in the case, and the other is: if both the same and just this the query terms (q) appearance in a position closer to each other in a d _a than at d _b, d _a a d _b The association with the query is higher. In other words,

And

If is a language model estimated from d _a and d _b , respectively, then q is

Is likely to generate from

I think it is higher than the probability

이를 위해, 질의 용어 (w_i)의 근접 중심성 점수(proximity centrality score)를 해당 질의 용어의 방출성 확률(emission probability)(θ_l,i)에 대한 가중치로 본다. 즉, 상이한 용어의 방출 확률 추정치는 각 용어의 근접 중심성 점수에 따라 서로 관련되고 비례한다. 여기서, 용어의 근접 중심성은 문서 내에서 해당 용어와 다른 모든 용어들 간의 근접성을 의미하며, 이는 하기에 정의되어 있다. To this end, the proximity centrality score of the query term (w _i ) is regarded as a weight for the emission probability (θ _{l, i} ) of the query term. That is, estimates of the probability of release of different terms are related and proportional to each other according to the near centrality score of each term. Here, the proximity centrality of a term means the proximity between the term and all other terms in the document, which are defined below.

본 발명에서는 베이지안 분석(Bayesian analysis)을 기반으로 파라미터의 불확실성에 대한 지식 또는 신뢰를 사전 분포(prior distribution)를 통해 표현할 수 있었다. 또한, 이 신뢰를 표현하기 위해 파라미터의 근원이 되는 분포의 켤레를 이용했는데, 다항 분포의 자연 켤레는 디리슈레 분포(Dirichlet distribution)이다.In the present invention, based on Bayesian analysis, knowledge or confidence about the uncertainty of the parameter could be expressed through prior distribution. In addition, to express this confidence, we used a pair of distributions that are the source of the parameter. The natural pair of polynomial distributions is the Dirichlet distribution.

구체적으로, B가 근접 중심성 연산 모델이고, Prox_B(w_i)가 용어(w_i)의 계산된 근접 중심성일 때, θ_l에 대한 디리슈레 사전분포를 하이퍼 레벨 파라미터 u = (u₁, u₂, ..., u_|V|)와 함께 사용했다.Specifically, when B is a proximity centered computational model and Prox _B (w _i ) is the calculated proximity centeredness of the term (w _i ), the dilisure predistribution for θ _l is obtained from the hyper level parameter u = (u ₁ , u ₂ , ..., u _{| V |} ).

식 (5)에서,

이고,

이며, θ _l 파라미터에 의존하지 않는다. 디리슈레는 확률에 대한 분포이다. 본 발명에서는, 문서 내에서 매칭되는 단어들의 추정 확률이 이들의 근접성 구조를 반영하도록 상호 연관되어 있어야 한다는 믿음을 디리슈레 사전 분포를 이용해 표현했다.In equation (5),

ego,

And does not depend on the [theta] _l parameter. Dirishure is a distribution of probabilities. In the present invention, the Dirishre dictionary distribution expresses the belief that the estimated probabilities of matching words in a document must be correlated to reflect their proximity structure.

그리고, 문서(d_l)와 근접성 연산 모델을 기반으로, θ_l을 다음과 같이 사후 추정할 수 있었다.Based on the document d _l and the proximity computation model, θ _l could be post-estimated as follows.

자연 켤레 분포의 특성에 따라, 상기 식 (6)은 디리슈레 분포로 Dir(θ_l|u+d)와 같이 나타낼 수 있는데, 여기서 u+d=(d_l _,1+u₁, ..., d_l _,|V|+u_|V|)이다. 사전분포는 가중치 (θ_l)에 대한 기존의 신뢰를 반영하지만, θ_l의 사후 분포는 데이터 (d_l)의 빈도 정보를 검토한 후 바뀐 θ_l에 대한 신뢰를 반영한다. 다시 말하면, 사후 분포는 기존의 신뢰와 데이터간의 절충을 나타내는 지점에 초점을 맞춘다. Depending on the nature of the natural conjugate distribution, Eq. (6) can be expressed as Dir (θ _l | u + d) as a Dirish distribution, where u + d = (d _l _{, 1} + u ₁ , ... , d _l _{, | V |} + u _{| V |} Prior distribution is the posterior distribution of the weights reflect the traditional reliance on (θ _l), θ _l but reflects the confidence in the θ _l changed after a review of the frequency information of the data (d _l). In other words, the posterior distribution focuses on the point of compromise between existing trust and data.

사후 분포가 주어지면, 단어 방출성 확률 추정은 식 (7)로 나타낼 수 있다.Given the posterior distribution, the word emission probability estimate can be represented by equation (7).

경험적 베이지안 분석에서, 디리슈레 사전분포의 하이퍼 파라미터 (u)는, 비록 “사전(prior)”의 직관적 의미와 배치되긴 하지만, 데이터를 통해 여전히 추정할 수 있다. 구체적으로는, 각 문서 (d_l)에 대하여, 대응하는 근접 중심성 모델(B_l)은, 이하에서 소개되어 있는 특정한 측정 방법에 따라, 주어진 질의에 관하여 d_l로부터 계산될 것이다. 그와 같은 B_l는 식 (7)에서 단어 방출성 확률을 계산하기 위해 이용된다. 이와 같은 방식으로, 문서 내에서의 근접성 정보가 추정 유니그램 문서 언어 모델에 통합된다.In empirical Bayesian analysis, the hyperparameter (u) of the Dirishre pre-distribution can still be estimated from the data, although contradicted by the intuitive meaning of “prior”. Specifically, for each document d _l , the corresponding proximity centrality model B _l will be calculated from d _{l for a} given query, according to the particular measurement method introduced below. Such B _l is used to calculate the word emission probability in equation (7). In this way, proximity information within the document is incorporated into the presumed Unigram document language model.

상기 추정 문서 모델에서, 근접성 정보는 유니그램 언어 모델이 모델링할 수 있는 주요 대상인 단어 카운트 정보로 변환된다고 볼 수 있다. 다른 관점에서 보면, 질의(q)에 대한 문서(d_l)의 근접성 모델 (B_l)을 바탕으로, 문서가 “단어의 집합(bag of word)” 형태에서, 의사(pseudo) “단어의 집합 (bag of word)” 문서 형태(d_Bl)로 변환된다고 볼 수 있다. In the estimated document model, the proximity information may be regarded as being converted into word count information which is a main object that can be modeled by the unigram language model. In other respects, based on the proximity model (B _l ) of the document (d _l ) to the query (q), the document is a set of "pseudo" words, in the form of a "bag of word". (bag of word) ”document type (d _Bl ).

의사 “단어의 집합” 문서 형태(d_Bl)에서, 매칭되는 용어의 빈도는 d_l _,i에서

로 변환되며, 총 문서 길이는 n_l에서

로 변경된다. 그러면, 질의(q)에 대해 d_l의 순위를 정하는 문제가 d_Bl의 순위를 정하는 문제로 바뀐다. 따라서, “단어의 집합 (bag of word)” 가정에 기반한 어떤 언어 모델(질의 가능도 모델 또는 KL 발산 모델)도 d_Bl에 적용할 수 있으며, 결과적으로 문서 내의 근접성 정보를 내적인 방법으로 통합할 수 있다. In the form of a pseudo “word set” document (d _Bl ), the frequency of matching terms is found in d _l _{, i}

, The total document length is from n _l

Is changed to Then, the problem of ranking d _l for the query q is changed to the problem of ranking d _Bl . Thus, any language model (such as a query likelihood model or a KL divergence model) based on the “bag of word” assumption can be applied to d _Bl , resulting in integrating proximity information in the document in an internal manner. Can be.

한편, 본 발명의 일 실시예에서는 디리슈레 확률분포를 사용하였으나, 다른 확률분포를 사용할 수도 있음을 첨언해 둔다.
On the other hand, in one embodiment of the present invention, the Dilischre probability distribution is used, it is noted that other probability distribution may be used.

다음으로, 본 발명은 KL 발산 언어 모델링 프레임워크 내에서 근접성 통합 순위화 함수(proximity integrated ranking function)를 제공한다. 본 발명은 문서 내에 보이지 않는 단어들을 처리하기 위해 근접성 통합 언어 모델을 집합 언어 모델(

)을 이용해 평활하였다. 예컨대, 집합 기반 디리슈레 사전 분포

를 평활에 적용하였다. 다음 식 (8)과 같은 평활 근접성 통합 추정(smoothed proximity integrated estimation)을 얻을 수 있다.Next, the present invention provides a proximity integrated ranking function within the KL divergence language modeling framework. The present invention uses a proximity language model (a set language model) to process invisible words in a document.

) Smoothed. For example, set-based Dirishre dictionary distribution

Was applied to smoothing. Smooth proximity integrated estimation as shown in Equation (8) can be obtained.

그리고, 순위화 함수는 다음의 식 (9)로 나타낼 수 있다.The ranking function can be expressed by the following equation (9).

표준 평활 방식을 바탕으로 추론해 보면, 상기 식은 본래 다음과 같이 나타낼 수 있다.Inferring from the standard smoothing method, the equation can be expressed as follows.

여기서, ps(?|d_l,u)는 근접성 모델 B_l을 기준으로 문서 (d_l)내 보이는 단어의 확률이고, α_dp(w_i|C)는 d_l에서 보이지 않는 단어의 확률이다.Where ps (? | D _l , u) is the probability of the visible word in the document (d _l ) based on the proximity model B _l , and α _d p (w _i | C) is the probability of the invisible word in d _l . .

상기에서, 근접성 인자의 주요 기능은 문서 내 질의에 대해 매칭되는 보이는 단어들을 위한 파라미터를 조절하는 것이다. 이는 문서 내에서 보이지 않는 단어에 가중치를 부여하는, 평활을 위한 집합 기반 사전분포와는 개념적으로 매우 다르다.
In the above, the main function of the proximity factor is to adjust the parameters for visible words that match for queries in the document. This is conceptually very different from the set-based dictionary distribution for smoothing, which weights words that are not visible in the document.

2. 다음으로, 용어 근접성 측정 방법(2. Next, the term proximity measurement method ( TermTerm ProximityProximity MeasureMeasure )에 대해서 상세히 설명한다:) In detail:

상기 근접성 언어 모델(PLM)의 핵심 개념은 용어의 근접 중심성 (Prox_B(w_i))이다. 근접 중심성은 주어진 질의에 관하여, 특정 문서 내의 전체 근접성 구조를 형성하는데 있어서 용어의 중요성을 나타낸다. The key concept of the proximity language model (PLM) is the proximity centrality of the term Prox _B (w _i ). Proximity centrality refers to the importance of a term in forming the entire proximity structure within a particular document, for a given query.

비질의 용어(non-query term)의 근접 중심성 점수는 0점이 주어지는 것으로 가정한다. 질의 용어의 경우의 근접 중심성은 동일 문서 내에서 해당 질의 용어와 다른 질의 용어들과의 긴밀한 정도를 반영하는 근접성 측정방법에 따라 계산한다. 그러나 근접성 모델링에 대한 종래의 대부분의 작업들은 하나의 질의에 대해서 하나의 문서의 총 근접성 점수를 계산했다. 특정한 개별 용어에 대한 근접성 점수나 또는 근접 중심성을 계산함에 있어서, 잘 확립된 근접성 측정 방법은 없다. 본 발명의 실시예에서는, 그와 같은 용어의 특정한 중심성 점수를 제공할 수 있는 몇 가지 측정 방법들을 전개하고자 한다.
The near-centrality score of the non-query term is assumed to be given a zero point. Proximity centrality in the case of query terms is calculated according to the proximity measurement method that reflects the degree of closeness between the query term and other query terms in the same document. However, most of the conventional works on proximity modeling have calculated the total proximity score of a document for a query. There is no well-established method of measuring proximity in calculating proximity scores or proximity centers for particular individual terms. In an embodiment of the present invention, it is intended to develop several measurement methods that can provide a particular centrality score of such terms.

<< 페어Pair 거리를 통한 근접성 측정( Proximity measurement via distance ( MeasuringMeasuring ProximityProximity viavia PairPair DistanceDistance )>)>

용어의 근접성을 나타내기 위한 본 발명의 직관적인 모색은 해당 문서 내에서 해당 질의 용어와 다른 질의 용어간의 거리를 측정하는 것이다. 다른 용어와의 거리가 짧을수록 해당 용어는 높은 근접 영역에 속하며, 높은 근접성 점수가 주어져야 한다. 상기 방법을 구현하는 데는 두 가지 핵심 키 포인트가 있다. 첫째는 문서 내에서 다른 용어간의 거리를 어떻게 정의할 것인가 이며, 둘째는 그 거리를 해당 용어의 근접성 점수로 매핑하기 위한 비선형 함수를 고안하는 방법이다.An intuitive search for the present invention to indicate the proximity of terms is to measure the distance between the query term and another query term in the document. The shorter the distance to another term, the higher the region is, and the higher the proximity score should be given. There are two key key points in implementing the method. The first is how to define the distance between different terms in the document, and the second is to devise a nonlinear function to map the distance to the term's proximity score.

먼저, 문서 내의 어떤 페어간 거리를 정의한다. 그것은 그 문서 내에서 그들이 발생하는 위치를 통해서 측정될 수 있다. 하지만, 주된 어려움은 문서 내에서 페어들이 매우 많은 발생빈도를 가질 수 있다는 점에 있다.First, define the distance between any pair in the document. It can be measured from where they occur in the document. However, the main difficulty is that pairs can have a very high frequency in a document.

예를 들면,

가 질의에 포함된 상이한 질의 용어들의 집합이고,

가 문서(D) 내에서 특정 용어(Q_i)가 발생하는 위치들의 집합이라고 가정해 보자. For example,

Is a set of different query terms included in the query,

Suppose is a set of locations where a particular term Q _i occurs in document D.

문서(D) 내의 두 용어 또는 두 용어의 발생 위치 간의 페어와이즈 거리(pairwise distance: 쌍간 거리)를 Dis(x, y; D)로 나타내기로 한다. 페어와이즈 용어 거리는 문서 내에서 두 용어의 가장 가까운 발생 위치 간의 거리로 나타낸다.The pairwise distance between two terms in the document D or the occurrence position of the two terms will be represented as Dis (x, y; D). The pairwise term distance is expressed as the distance between the closest occurrence positions of two terms in the document.

여기서, |D| 는 문서(D)의 길이를 의미하며,

는 D 내에서 해당 질의 용어 (Q_i)의 발생 횟수를 의미한다. 상기 쌍의 거리 측정은 대칭적이며, 즉,

이다.Where | D | Means the length of the document (D),

Denotes the number of occurrences of the query term Q _i in D. The distance measurement of the pair is symmetric, i.e.

to be.

이제, 상기 페어와이즈 거리 측정 방법을 통해서, 우리는 상기 거리를 변환하는데 필요한 함수를 페어와이즈 용어 근접성으로 정의한다. 변환 함수는 상이한 매칭 용어들의 근접성 점수간 비례율(proportional ratio)의 스케일을 설정하는데 매우 중요한 역할을 한다. 다음의 지수 함수는 이미 사용되고 테스트된 함수인데, data는 입력된 거리 점수를 나타내고, para는 용어간 근접성 점수의 스케일을 제어하는 파라미터를 나타낸다.Now, through the pairwise distance measurement method, we define the function required to convert the distance in pairwise term proximity. The transform function plays a very important role in setting the scale of the proportional ratio between the proximity scores of the different matching terms. The following exponential function is a function that has already been used and tested, where data represents the distance score entered and para represents a parameter that controls the scale of the interstitial proximity score.

따라서 상기 페어와이즈 용어 근접성은 다음과 같이 나타낼 수 있다:Thus, the pairwise term proximity can be expressed as follows:

<용어의 근접 중심성 계산 (Calculation of Proximity Centrality of Terms ComputationComputation ofof Term'sTerm's ProximateProximate CentralityCentrality )>)>

이제, 상기 방법들에 기초하여, 용어의 근접 중심성 측정 방법들을 정의하고자 한다. 본 발명에서는 다음과 같은 용어의 근접 중심성 측정 방법들을 제안하며 테스트 했다.
Now, based on the above methods, we want to define the method of measuring the proximity centrality of the term. In the present invention, the method of measuring the proximity center of the following terms was proposed and tested.

(1) 최소 거리에 기반한 용어 근접성(Term Proximity based on Minimum Distance)(1) Term Proximity based on Minimum Distance

본 방법의 실시예에서는, 질의 용어와, 이와 페어를 이루는 다른 용어들간의 최소 거리(term's minimum pair distance)에 근접성 변환 함수를 적용해서 상기 질의 용어의 근접성 점수를 구할 수 있다. 이 방법은 P_MinDist로 표현되며 다음과 같이 나타낼 수 있다. In an embodiment of the method, the proximity score of the query term may be obtained by applying a proximity transform function to a term's minimum pair distance between the query term and other terms paired with the term. This method is expressed as P_MinDist and can be expressed as follows.

여기서, f 는 식 (15)에 정의된 비선형 단조함수를 의미한다.
Here, f means a nonlinear forging function defined in equation (15).

(2) 평균 거리에 기반한 용어 근접성(Term Proximity based on Average Distance)(2) Term Proximity based on Average Distance

본 방법의 실시예에서는 최소 거리 대신에 해당 용어와 이와 페어를 이루는 다른 용어들간의 평균 거리(term's average pair distance)를 이용하며 용어의 근접성을 모델링하며, P_AveDist로 나타낼 수 있다.In the embodiment of the present method, instead of the minimum distance, the term's average pair distance between the term and other terms constituting the pair is used to model the proximity of the term, and may be represented by P_AveDist.

여기서, n은 D에 나오는 질의 용어의 수를 나타낸다.
Where n represents the number of query terms in D.

(3) 페어 근접성의 합에 기반한 용어 근접성(Term Proximity Summed over Pair Proximity)(3) Term Proximity Summed over Pair Proximity

본 방법의 실시예에서는 용어의 근접성을 해당 용어가 이루는 모든 페어와이즈 근접성의 합으로 모델링할 수 있다. 상기 방법은 P_SumProx로 표시하며 다음과 같이 나타낸다.In an embodiment of the method, the proximity of terms may be modeled as the sum of all pairwise proximityes of the terms. The method is represented by P_SumProx and is expressed as follows.

P_SumProx은 먼저 비선형 함수를 통해 각 페어 거리를 페어 근접성 점수로 변환하고, 다음으로 모든 페어와이즈 근접성 점수를 합산한다.
P_SumProx first converts each pair distance into a pair proximity score through a nonlinear function, and then sums all pairwise proximity scores.

예를 들어, A, B, C, D 4개의 매칭되는 용어가 있다고 가정하면, 이들 사이의 페어와이즈 거리들은 도 1에 도시된 바와 같다. 이제 이용되는 근접성 변환 함수가 f=1.5^- ^dist 라고 하면, 상기 세 가지 상이한 용어 근접성 측정 방법을 통해 표1에서와 같은 근접 중심성 점수를 얻을 수 있다.For example, suppose there are four matching terms A, B, C, and D. The pairwise distances between them are as shown in FIG. If the proximity transformation function used is f = 1.5 ⁻ ^dist , then the three different term proximity measurement methods can obtain the proximity centrality score as shown in Table 1.

[상이한 측정 방법에 의해 계산된 용어 근접성 점수][Term Proximity Score Calculated by Different Measurement Methods] 측정 방법How to measure Prox(A)Prox (A) Prox(B)Prox (B) Prox(C)Prox (C) Prox(D)Prox (D) P_MinDIstP_MinDIst 1One 1One 0.070.07 0.130.13 P_AveDistP_AveDist 0.110.11 0.070.07 0.040.04 0.060.06 P_SumProxP_SumProx 1.171.17 1.101.10 0.110.11 0.230.23

3. 3. 실시예Example

<데이터 세트(set) 및 실시예의 셋업><Set of data set and embodiment>

앞서 설명한 본 발명의 모델을 TREC 데이터 세트 및 TREC-9 필터 트랙에 사용되는 OHSUMED (메드라인 데이터베이스 초록, 1987-1991) 컬렉션 [12]을 이용해 평가하였다. TREC 데이터 세트는 AP88 (Associated Press News, 1988), WSJ90-92 (월 스트리트 저널 1990-92), WSJ87-92 (월 스트리트 저널 1987-92) 등의 유명한 애드혹 (ad hoc) 컬렉션들이다. 이들 컬렉션의 세부 내용은 표 2에 상세하게 기재되어 있다.The model of the invention described above was evaluated using the OHSUMED (Medline Database Abstract, 1987-1991) collection [12] used for TREC data sets and TREC-9 filter tracks. The TREC data set is the famous ad hoc collections such as AP88 (Associated Press News, 1988), WSJ90-92 (Wall Street Journal 1990-92), WSJ87-92 (Wall Street Journal 1987-92). The details of these collections are described in detail in Table 2.

[데이터 세트 통계치][Data set statistics] 컬렉션collection 문서 수(#)Document Count (#) 길이Length 질의vaginal Qrel. 수(#)Qrel. Number(#) AP88AP88 7991979919 488488 TOPIC251-300TOPIC251-300 16721672 WSJ90-92WSJ90-92 7452074520 514514 TOPIC251-300TOPIC251-300 10641064 WSJ87-92WSJ87-92 173252173252 473473 TOPIC151-200TOPIC151-200 39133913 OHSUMEDOHSUMED 348566348566 2929 Ohsumed topicOhsumed topic 38753875

상기 3개의 애드혹 TREC 컬렉션 및 질의는 상이한 도메인으로부터 다형 (polymorphic) 문서들과 토픽(topic)들을 포함한다. 반면에, OHSUMED 컬렉션은 특정 기술 영역(의학)에 속하는 문서라고 볼 수 있는 단형(monomorphic)적인 문서들과 토픽들을 포함한다. The three ad hoc TREC collections and queries include polymorphic documents and topics from different domains. The OHSUMED collection, on the other hand, contains monomorphic documents and topics that can be viewed as documents belonging to a particular technical area (medicine).

상이한 질의가 유사한 컬렉션에 미치는 영향을 평가하기 위해, 두 개의 TREC 토픽 세트, 즉 TOPIC 251-300과 TOPIC 151-200을 크기가 다른 두 개의 중복되는 WSJ 문서 컬렉션에 사용했다. TREC 애드혹 토픽에 대해서는, 토픽의 “제목” 필드만이 질의로 사용되었으며, OHSUMED 토픽에 대해서는 “설명(description)” 필드만이 사용되었다. OHSUMED 질의는 원래 TREC 토픽보다 장황(verbose)하다. OHSUMED 컬렉션에 대한 연관성 평가는 의학 사서 및 의사들이 "확실한 연관성" 또는 "연관 가능성 있음"으로 나타낸 쌍방향 검색 결과들을 바탕으로 실시했다. 하지만 본 실시예에서는 연관성이 확실히 있는 문서와 연관 가능성 있는 문서들 간에 구분을 두지 않고, 모두 연관성 있는 문서로 간주했다.
To assess the impact of different queries on similar collections, two sets of TREC topics were used, two TOPIC 251-300 and one TOPIC 151-200, for two overlapping WSJ document collections of different sizes. For TREC ad hoc topics, only the "title" field of the topic was used as a query, and only the "description" field was used for the OHSUMED topic. OHSUMED queries are verbose than the original TREC topic. The association assessment for the OHSUMED collection was based on interactive search results indicated by medical librarians and physicians as "sure association" or "possibly associated". However, in the present embodiment, all documents are regarded as related documents without distinguishing between documents that are associatively related and documents that may be related.

본 명세서에서 실시된 모든 실시예에서는, 레무르 툴킷(Lemur toolkit)을 이용했다. 그리고, 상기 컬렉션들에 포함된 모든 문서 및 질의는 영문 문자 외의 모든 기호를 구획문자로 간주하는 단순 토큰화기(tokenizer)를 이용해 토큰화했다. 또한, 포터 알고리즘(Porter's algorithm)을 이용해 어간 추출(stemming)을 적용했으며, 인덱싱 때(index time)에는 불용어 (stop word)를 제거하지 않았고, 대신 최소의 불용어 리스트를 사용해 가장 빈도가 높은 불용어를 질의 시(query time)에 제거했다. In all of the examples practiced herein, the Lemur toolkit was used. All documents and queries included in the collections were tokenized using a simple tokenizer that regards all symbols other than English characters as delimiters. In addition, stemming was applied using the Porter's algorithm, and no stop words were removed at index time. Instead, the most frequent stopwords were queried using a minimum list of stopwords. Removed at query time.

실험 결과를 평가하기 위해 최고 정확도 및 평균 정확도를 사용했다. 구체적으로, 각각 최고 5개 문서에 해당하는 정확도, 최고 10개 문서에 해당하는 정확도, 그리고 평균 정확도를 보이는 메트릭 Pr@5, Pr@10, MAP을 사용했다.
The highest accuracy and average accuracy were used to evaluate the experimental results. Specifically, we used metrics Pr @ 5, Pr @ 10, and MAP, which show accuracy for up to 5 documents, accuracy for up to 10 documents, and average accuracy, respectively.

<기준 모델(Baseline Models>Baseline Models

또한, 근접성 통합 언어 모델(PLM이라고 약칭한다)과 기본 KL 발산 언어 모델(LM이라고 약칭한다)의 성능을 비교했으며, 참고문헌 [27]에서 언어 모델과 근접성을 통합한 작업(LLM이라고 약칭한다)과도 비교했다. In addition, we compared the performance of the Proximity Integrated Language Model (abbreviated PLM) and the Basic KL Divergence Language Model (abbreviated LM), and references [27] incorporating language models and proximity (abbreviated LLM). Overestimated.

참고 문헌 [27]은, 상기 식 (1)에서 나타난 바와 같이, 문서의 근접성 점수를 문서 차원에서 외적 선형 결합을 통해 언어 모델의 연관성 점수와 결합했다. 본 발명의 실시예에서는, 최소 페어 거리 측정 방법이 문서의 근접성 거리 δ(Q,D)에 대해 이용되며, 이는 [27]에서와 같이 최고의 성능을 달성한다. 그러나, 그와 같은 단순한 결합은 근접성 점수와 연관성 점수의 스케일과 가중치를 고려하지 않으며, 따라서 언어 모델을 위한 공식의 표면 형태(surface model)에 민감하게 된다. 예컨대, KL 언어 모델에 대해 다음의 두 추론 형태를 비교하면:Reference [27] combined the document's proximity score with the association score of the language model through an external linear combination at the document level, as shown in equation (1) above. In the embodiment of the present invention, the minimum pair distance measuring method is used for the proximity distance δ (Q, D) of the document, which achieves the best performance as in [27]. However, such a simple combination does not take into account the scales and weights of proximity scores and association scores, and thus becomes sensitive to the surface model of the formula for language models. For example, comparing the following two types of inferences for the KL language model:

상기 두 식이 순위를 결정하는데 있어서는 비슷하지만, 점수 스케일이 다르게 때문에 근접성 점수와 직접적으로 결합될 때에는 매우 다르게 될 것이다. 본 실시예에서 우리는 식 (20)이 보다 일반적인 것이지만, KL 순위화 공식처럼 식 (21)을 사용하였다.
The two equations are similar in ranking, but will be very different when combined directly with proximity scores because of different score scales. In this example we use equation (21) as KL ranking formula, although equation (20) is more general.

<파라미터 설정(Parameter Setting)><Parameter Setting>

각 모델에 다음과 같은 파라미터를 설정했다. 먼저, LM에서 가장 중요한 파라미터는 사전 집합 샘플 크기(μ) 이다. 모든 실험에서 참고문헌 [29]의 사전 집합 샘플 크기를 2000으로 설정했다. 상기 파라미터는 LLM및 PLM에서도 다른 최적화 작업 없이 그대로 사용된다.The following parameters were set for each model. First, the most important parameter in LM is the preset sample size (μ). In all experiments, the preset sample size of Ref. [29] was set to 2000. The parameter is used as it is in the LLM and PLM without any optimization.

LLM에서는, 상기 식 (1)에서 나타난 바와 같이, 근접성 모델과 관련된 파라미터(γ)가 있다. 본 실험에서는, 파라미터 공간 검색을 통해 각 테스트 컬렉션의 성능을 최적화함으로써 상기 파라미터를 설정했다.In the LLM, as shown in equation (1) above, there is a parameter γ associated with the proximity model. In this experiment, the parameters were set by optimizing the performance of each test collection through parameter space search.

PLM에서 가장 중요한 두 가지 파라미터는 λ와 para이다. λ 파라미터는 상기 식 (7)에 나타난 바와 같이, 근접성 정보를 위한 근접성 편각(proximity argument)이다. 상기 λ 파라미터는 문서 내에서 관찰된 단어 카운트 정보와 관련하여 사전 근접성 인자(prior proximity factor)의 비례 가중치(proportional weight)를 제어한다. 한편, para 파라미터는 식 (15)에 나타난 바와 같이 근접성 변환 함수의 지수적 가중치(exponential weight)이다. para 파라미터는 상이한 용어의 근접성 점수의 수치적 스케일을 제어한다. 다시 말하면, 문서 내 상이한 질의 용어간 근접성 점수의 비례율을 제어한다. λ와 para 파라미터를 설정하기 위해서는 다음의 파라미터 공간을 철저히 검색해 컬렉션에 대한 성능을 최대화해야 한다.The two most important parameters in PLM are λ and para. The lambda parameter is a proximity argument for proximity information, as shown in equation (7) above. The λ parameter controls the proportional weight of the prior proximity factor in relation to the word count information observed in the document. On the other hand, the para parameter is the exponential weight of the proximity transformation function, as shown in equation (15). The para parameter controls the numerical scale of the proximity score of different terms. In other words, it controls the proportionality of the proximity scores between different query terms in the document. To set the λ and para parameters, we need to thoroughly search the following parameter spaces to maximize the performance of the collection.

도 2는 P_MinDist를 용어 근접 중심성 측정 방법으로 사용할 때(여기서, P_AveDist 및 P_SumProx를 사용하기 위한 최적의 파라미터 설정은 P_MinDist를 사용하기 위한 파라미터 설정과 약간 다르다), 테스트 컬렉션에 대한 PLM의 파라미터 민감도를 도시하고 있다. 도 2를 참조하면, WSJ90-92를 제외하고, 근접성 편각(λ)이 약 6으로 설정될 때 최고의 성능을 나타낸다는 것을 알 수 있다. 또한, 상기 용어 근접성 측정 방법에서, 최적의 λ 파라미터 값은 para 값의 변화에 비교적 민감하지 않다. 하지만, 상이한 용어간의 비례율을 제어하는 지수 가중치인 para는 순위화 성능에 어느 정도의 영향을 미친다. 상기 3가지 애드혹 컬렉션의 경우, 약 1.7의 비교적 큰 para값이 바람직하다. 하지만, OHSUMED 컬렉션의 경우, 이보다 작은 값이 바람직하다.
FIG. 2 shows the parameter sensitivity of the PLM to the test collection when using P_MinDist as the term proximity centering method, where the optimal parameter setting for using P_AveDist and P_SumProx is slightly different from the parameter setting for using P_MinDist. Doing. Referring to FIG. 2, it can be seen that, except for WSJ90-92, it exhibits the best performance when the proximity declination λ is set to about 6. FIG. In addition, in the term proximity measurement method, the optimum lambda parameter value is relatively insensitive to the change in the para value. However, para, an exponential weight that controls the proportionality between different terms, has some effect on ranking performance. For the three ad hoc collections, a relatively large para value of about 1.7 is preferred. However, for OHSUMED collections, smaller values are desirable.

<성능 비교(Performance Comparison)><Performance Comparison>

표 3은 각 순위화 모델을 통해 얻을 수 있는 최고의 성능을 보여주고 있다. 표 3에서, “PLM(P_MinDist)”은 P_MinDist를 용어 근접성 측정 방법으로 사용하는 PLM 모델을 의미한다.Table 3 shows the best performance that can be achieved with each ranking model. In Table 3, “PLM (P_MinDist)” refers to a PLM model that uses P_MinDist as a term proximity measurement method.

먼저, 상이한 용어 근접 중심성 측정 방법을 적용했을 때의 PLM의 성능을 비교한다. 표 3에서, 별표 (*)는 기준 LM과 비교한 윌콕슨 (Wilcoxon) 유의 수준 0.05를 표시한 것이다. 전반적으로, P_SumProx 및 P_MinDist가 P_AveDist보다 나은 성능을 보였으며, P_SumProx는 P_MinDist보다 약간 낫지만 유사한 성능을 보였다. 여기서 주목할 것은, 표 3에 나와있는 성능은 위 <데이터 세트(set) 및 실시예의 셋업>에서 설명했듯이 불용어를 질의에서 제거했을 경우의 성능이다. 불용어를 제거하지 않았을 경우에는 결과가 약간 다를 수 있을 것이다. 이 경우의 성능 결과는 하기에 기재되어 있다.First, we compare the performance of PLM when different term proximity centering methods are applied. In Table 3, an asterisk (*) indicates the Wilcoxon significance level of 0.05 compared to the reference LM. Overall, P_SumProx and P_MinDist performed better than P_AveDist, and P_SumProx performed slightly better than P_MinDist, but showed similar performance. Note that the performance shown in Table 3 is the performance when the stop words are removed from the query as described in the above <Set of data and embodiment>. If you haven't removed the stopwords, your results may be slightly different. The performance results in this case are described below.

[순위화 모델별 성능 비교][Performance Comparison by Ranking Model] 모델/측정 방법Model / Measurement Method MAPMAP PR@5PR @ 5 PR@10PR @ 10 MAPMAP PR@5PR @ 5 PR@10PR @ 10 AP88AP88 WSJ90-92WSJ90-92 LM
LLM
PLM(P_MinDist)
PLM(P_AveDist)
PLM(P_SumProx)LM
LLM
PLM (P_MinDist)
PLM (P_AveDist)
PLM (P_SumProx) 0.2070
0.2123
0.2178
0.2167
0.22030.2070
0.2123
0.2178
0.2167
0.2203 0.3120
0.3280
0.3440
0.3360
0.34400.3120
0.3280
0.3440
0.3360
0.3440 0.2620
0.2720
0.2780
0.2740
0.28400.2620
0.2720
0.2780
0.2740
0.2840 0.1534
0.1589
0.1751*
0.1603
0.1735*0.1534
0.1589
0.1751 *
0.1603
0.1735 * 0.2160
0.2280
0.2320
0.2320
0.2360*0.2160
0.2280
0.2320
0.2320
0.2360 * 0.1860
0.1940
0.1940
0.1900
0.2000*0.1860
0.1940
0.1940
0.1900
0.2000 * WSJ87-92WSJ87-92 OHSUMEDOHSUMED LM
LLM
PLM(P_MinDist)
PLM(P_AveDist)
PLM(P_SumProx)LM
LLM
PLM (P_MinDist)
PLM (P_AveDist)
PLM (P_SumProx) 0.3351
0.3457
0.3474
0.3446
0.34930.3351
0.3457
0.3474
0.3446
0.3493 0.5440
0.5640
0.5680
0.5600
0.57200.5440
0.5640
0.5680
0.5600
0.5720 0.4960
0.5120
0.5100
0.5060
0.51200.4960
0.5120
0.5100
0.5060
0.5120 0.2704
0.2651
0.2983*
0.2954*
0.2984*0.2704
0.2651
0.2983 *
0.2954 *
0.2984 * 0.4889
0.4857
0.5365*
0.5238*
0.5397*0.4889
0.4857
0.5365 *
0.5238 *
0.5397 * 0.4698
0.4587
0.5095*
0.5095*
0.5154*0.4698
0.4587
0.5095 *
0.5095 *
0.5154 *

다음으로, 근접성 통합 모델인 PLM을 LM 및 상기 <기준 모델(Baseline Models>에서 설명한 바와 같은 외적 선형 점수 결합을 사용하는 모델(LLM)과 비교한다. LLM은 상기 3개의 애드혹 컬렉션에서는 기본 언어 모델보다 나은 성능을 보였지만, 보다 장황한 토픽을 가지고 있는 OHSUMED 컬렉션에서는 성능 개선을 보지 못했다. 사실 LM보다 못한 성능을 보였다. PLM은 최고 정확도 및 평균 정확도 측면에서 LLM 뿐만 아니라 기본 언어 모델보다 나은 성능을 보였다. 또한 PLM은 OHSUMED의 장황한 토픽에 대해서도 매우 좋은 성능을 보였다.Next, the proximity integration model, PLM, is compared to LM and a model (LLM) that uses an external linear score combination as described in Baseline Models above. Although we performed better, we did not see any performance improvements in the OHSUMED collection, which had more verbose topics, in fact, which was inferior to LM PLM performed better than the base language model as well as LLM in terms of maximum accuracy and average accuracy. PLM also performed very well on the lengthy topics of OHSUMED.

상기 기준 모델들과 비교하였을 때, 근접성 통합 언어 모델의 주요한 강점은 다음 관점으로부터 나온다. 외적 선형 점수 결합 방법에서, 근접성 점수와 연관성 점수의 결합은 후처리 방식(post-processing manner)으로 수행되지만, PLM에서는 근접성 인자가 선처리 방식(pre-processing manner)으로 통합된다. 또한, 외적 점수 결합에 대해서는, 문서의 연관성 점수와 근접성 점수가 문서에 대해서 별도로 계산된다. 계산된 연관성 점수와 근접성 점수의 성질이 완전히 다르기 때문에, 이 둘은 매우 직관적인 방식으로만 결합될 수 있다. 가장 중요하게, 근접성 점수를 외적 문서 차원에서 사용함으로써, 상이한 용어가 제공한 근접성 정보가 서로 섞이고, 문서 내에 포함된 전체 근접성 정보는 줄어든다. 이와 달리, PLM에서는 근접성 정보가 용어 빈도와 같은 문서 통계치와 분산 방식(distributed way)으로 결합된다. 그리고, 각 질의 용어가 제공하는 근접성 정보에는 상응하는 가중치가 부여된다.
Compared with the above reference models, the main strength of the proximity integration language model comes from the following point of view. In the external linear score combining method, the combining of the proximity score and the association score is performed in a post-processing manner, but in PLM, the proximity factor is integrated in the pre-processing manner. In addition, for external score combining, the document's association score and proximity score are calculated separately for the document. Because the nature of the calculated correlation score and proximity score are completely different, the two can only be combined in a very intuitive way. Most importantly, by using the proximity score at the external document level, the proximity information provided by the different terms is mixed with each other and the overall proximity information contained in the document is reduced. In contrast, in PLM, proximity information is combined in a distributed way with document statistics such as term frequency. In addition, the proximity information provided by each query term is assigned a corresponding weight.

<질의에 포함된 불용어의 영향(The Influence of Stop Word in Query)><The Influence of Stop Word in Query>

누구나 알고 있듯이, 정보 검색에 있어 불용어 제거는 많은 도움이 된다. 하지만 실제 불용어를 전혀 제거하지 않는 것이 바람직한 경우가 많다. 또한, 우수한 순위화 함수는 불용어가 사용자 질의 내에 고려될 때에도 좋은 성능을 보여야 한다. 불용어는 근접성 모델링에 큰 영향을 미칠 수 있다. 이는 불용어가 대개 문서 내에서 자주 발생하기 때문에, 해당 문서 내의 다른 단어들과 근접해 있을 가능성이 높기 때문이다. 불용어가 근접성 메커니즘의 효과를 감소시킬 수 있으므로, 근접성 모델링에서는 해당 모델이 불용어의 영향을 견뎌낼 수 있는지 여부를 테스트하는 것이 매우 중요하다. As everyone knows, getting rid of stopwords is a great help in retrieving information. However, it is often desirable not to remove the actual stopwords at all. Also, a good ranking function should show good performance even when stop words are considered in user queries. Stopwords can have a big impact on proximity modeling. This is because a stopword usually occurs frequently in a document, so it is likely to be close to other words in the document. Because stopwords can reduce the effectiveness of proximity mechanisms, it is very important in proximity modeling to test whether a model can withstand the effects of stopwords.

본 발명에서는 기존 불용어 리스트 내에 적어도 하나의 단어를 포함하는 TOPIC251-300에서 모든 질의를 추출했으며, 그 결과 총 23개의 질의가 추출되었다. 그리고 나서, 추출한 질의에 대해서 AP88및 WSJ90-92 컬렉션 상의 상이한 순위화 모델들을 테스트했다. 이 테스트에서는 인덱싱이나 검색 어구에 어떤 불용어 리스트도 사용하지 않았다.In the present invention, all queries were extracted from the TOPIC251-300 including at least one word in the existing stopword list, and as a result, a total of 23 queries were extracted. Then, the extracted queries were tested for different ranking models on the AP88 and WSJ90-92 collections. This test did not use any list of stopwords for indexing or search phrases.

[질의어에 포함된 불용어를 고려한 성능 비교][Performance Comparison Considering Terminology in Queries] 모델/측정 방법Model / Measurement Method MAPMAP Pr@5Pr @ 5 Pr@10Pr @ 10 AP88AP88 LM
LLM
PLM(P_MinDist)
PLM(P_AveDist)
PLM(P_SumProx)LM
LLM
PLM (P_MinDist)
PLM (P_AveDist)
PLM (P_SumProx) 0.1537
0.1422
0.1575
0.1558
0.16070.1537
0.1422
0.1575
0.1558
0.1607 0.2957
0.2609
0.2870
0.3030
0.29570.2957
0.2609
0.2870
0.3030
0.2957 0.2348
0.2435
0.2391
0.2479
0.25650.2348
0.2435
0.2391
0.2479
0.2565 WSJ90-92WSJ90-92 LM
LLM
PLM(P_MinDist)
PLM(P_AveDist)
PLM(P_SumProx)LM
LLM
PLM (P_MinDist)
PLM (P_AveDist)
PLM (P_SumProx) 0.1072
0.1017
0.1139
0.1080
0.11580.1072
0.1017
0.1139
0.1080
0.1158 0.1652
0.1391
0.1652
0.1739
0.16650.1652
0.1391
0.1652
0.1739
0.1665 0.1348
0.1217
0.1391
0.1348
0.15220.1348
0.1217
0.1391
0.1348
0.1522

표 4는 불용어를 포함하는 질의에 대한 다양한 순위화 모델의 성능을 보여준다. 표 4를 참조하면, 질의에 포함된 불용어를 고려할 때 LLM의 성능이 좋지 않으며, AP88 및 WSJ90-92 두 컬렉션에 대한 기본 언어 모델의 성능보다 못한 성능을 보인다는 것을 알 수 있다. 반면, PLM은 기본 언어 모델보다 어느 정도 나은 성능을 보였다. Table 4 shows the performance of various ranking models for queries containing stop words. Referring to Table 4, it can be seen that the performance of LLM is poor when considering the stopwords included in the query, and it is lower than the performance of the base language model for both AP88 and WSJ90-92 collections. PLM, on the other hand, performed somewhat better than the base language model.

PLM에 사용되는 상이한 용어 근접성 측정 방법들을 비교해 보면, LM에 대한 P_MinDist의 개선율이 약간 떨어지며, P_MinDist보다 P_SumProx의 성능이 훨씬 좋다는 것을 알 수 있다. P_MinDist 방법의 문제는 질의에 가령 “of”와 같은 불용어가 포함될 경우, 이 용어가 높은 발생 빈도로 인해, 상기 질의 내의 각 컨텐트 단어(content word)와 매우 근접한 위치에서 발생한다는데 있다. 따라서, 일부 질의 용어에 높은 근접성 점수가 잘못 주어지는 경우가 발생할 수 있다. 종합해 볼 때, P_SumProx는 PLM에 가장 적절한 방법이며, 불용어 제거의 여부에 상관없이 기본 언어 모델을 개선하는 성능이 우수하다.
Comparing the different term proximity measurement methods used in PLM, it can be seen that the improvement rate of P_MinDist for LM is slightly lower, and that P_SumProx performs much better than P_MinDist. The problem with the P_MinDist method is that when a query contains a stopword, such as "of", the term occurs at a location very close to each content word in the query due to the high frequency of occurrence. As a result, a high proximity score may be incorrectly given to some query terms. Taken together, P_SumProx is the best choice for PLM and has the ability to improve the base language model with or without stopwords.

요약하자면, 본 발명은 정보 검색 및 정보 추출을 위해 근접성 인자를 유니그램 언어 모델에 통합하는 상기한 신규 방법을 제시하였다. 본 발명의 신규한 방법은 각각의 단어 단위의 방출성 확률을 도출하여 그것을 근접성 모델에 이용한 후, 그 근접성 모델이 전체 순위 함수에 사용된다는 점이다. 또한, 본 발명의 방법은 다항 문서 언어 모델의 파라미터에 가중치를 주기 위해 질의 용어의 근접 중심성을 디리슈레 하이퍼 파라미터로 간주될 수 있다. 또한, 상기 본 발명은 탄탄한 수학적 근거에 기반하고 있으며 기존의 방법보다 나은 성능을 보였다. 가장 중요하게는, 간단한 키워드를 이용한 질의뿐 아니라 장황한 질의 및 불용어를 포함하는 질의에 대해서도 우수한 성능을 보였다.
In summary, the present invention has presented the above novel method of incorporating proximity factors into the Unigram language model for information retrieval and information extraction. The novel method of the present invention is that the emission probability of each word unit is derived and used in the proximity model, and then the proximity model is used for the overall ranking function. In addition, the method of the present invention may regard the proximity centrality of the query term as a Dirischre hyperparameter to weight the parameters of the polynomial document language model. In addition, the present invention is based on a solid mathematical basis and shows better performance than conventional methods. Most importantly, it showed excellent performance not only for queries using simple keywords but also for queries including verbose queries and stopwords.

한편, 본 발명의 보호범위가 이상에서 명시적으로 설명한 실시예에 의해 제한되는 것은 아니다. 본 발명에서 예시되는 확률분포와 순위 함수들은 균등한 다른 방법에 의해 치환될 수 있고 다양한 변형을 가질 수 있음을 밝혀 둔다. 또한, 본 발명이 속하는 기술분야에서 자명한 변경이나 치환으로 말미암아 본 발명의 보호범위가 제한될 수도 없음을 첨언한다.On the other hand, the scope of protection of the present invention is not limited by the embodiments explicitly described above. It is noted that the probability distributions and rank functions illustrated in the present invention may be substituted by other equivalent methods and may have various modifications. Further, it should be noted that the protection scope of the present invention may not be limited due to obvious changes or substitutions in the technical field to which the present invention belongs.

Claims

An information retrieval method for providing a search result to a client by ranking a document d _l from a large document database with respect to a query q including a plurality of search terms input by a client,
(a) for each word included in the document, calculating an emission probability in word units;
(b) calculating the radiative probability of the unit of words converted into the proximity model by obtaining the proximity centrality score between the query terms in the document and combining it with the radiative probability; And
(c) determining the ranking rank of each document using the emission probability of the word unit obtained in the step (b) in the overall ranking function of the document, and displaying a search result on the screen of the client according to the determined ranking rank; A method of retrieving information using a proximity language model, comprising proximity modeling based on word-like emission probability.

The method of claim 1,
The step (a) includes setting a query and a document as a polynomial term count vector and using polynomial parameters.

The method of claim 1,
Computing the proximity centrality score of the step (b) is to measure and calculate the distance between the pair of query terms, information search method using a proximity language model.

The method of claim 3,
And the pair-to-pair distance of the query term is a minimum distance between pairs.

The method of claim 3,
And the pair-to-pair distance of the query terms is an average distance between query terms.

4. The proximity language model of claim 3, wherein the pair-to-pair distance of the query term is converted into a proximity centrality score by a proximity transformation function and then summed after pair-to-pair distance formed by all query terms in a document. How to retrieve information.

delete

A method for extracting information by ranking documents from a web or document database with respect to a query including a plurality of query terms input by a client,
(a) for each word included in the document, calculating an emission probability in word units;
(b) calculating the radiative probability of the unit of words converted into the proximity model by obtaining the proximity centrality score between the query terms in the document and combining it with the radiative probability; And
(c) determining the ranking rank of each document by using the emission probability of the word unit obtained in the step (b) in the overall ranking function of the document, and extracting information according to the determined ranking rank. An information extraction method using a proximity language model characterized by proximity modeling based on sex probabilities.

A computer readable recording medium having recorded thereon a computer program for executing the method of any one of claims 1 to 6 and 9, comprising a module for calculating an emission probability in units of words converted into a proximity model. And a computer-readable recording medium.