KR102435849B1

KR102435849B1 - Method for providing search result for non-text based object in documents

Info

Publication number: KR102435849B1
Application number: KR1020210000464A
Authority: KR
Inventors: 박승범; 장수현; 안근진
Original assignee: 호서대학교 산학협력단; 주식회사 리빈에이아이
Priority date: 2021-01-04
Filing date: 2021-01-04
Publication date: 2022-08-25
Also published as: KR20220099160A

Abstract

인공지능 기반의 검색모델을 이용하여 문서에 포함된 표, 이미지, 그래프 등의 텍스트 기반이 아닌 개체를 검색하는 방법이 제공된다.
문서에는 포함된 표, 이미지, 그래프 등의 텍스트 기반이 아닌 개체에 대한 설명이 주어지며, 문서 내의 특정 단락에는 해당 개체에 대한 설명이 존재한다. 따라서, 이들 텍스트 기반의 정보를 취합하여 각 개체에 대한 패시지를 생성시킬 수 있다. 해당 문서의 제목이나 초록도 함께 패시지에 포함될 수 있다. 이렇게 생성된 텍스트 패시지를 이용하여 문서에 포함된 표, 이미지, 그래프 등의 텍스트 기반이 아닌 개체를 검색할 수 있다.
인공지능 기반의 검색모델을 이용하는 경우, 비지도학습에 기초한 검색모델만을 이용하는 것에 비하여 성능이 향상된 검색결과를 얻을 수 있다.A method for searching non-text-based objects such as tables, images, and graphs included in documents using an artificial intelligence-based search model is provided.
A document is given a description of an object that is not text-based, such as tables, images, and graphs, and a specific paragraph within the document contains a description of the object. Accordingly, a passage for each entity can be generated by collecting these text-based information. The title or abstract of the document may also be included in the passage. Using the text message generated in this way, non-text-based objects such as tables, images, and graphs included in a document can be searched for.
When an AI-based search model is used, search results with improved performance can be obtained compared to using only a search model based on unsupervised learning.

Description

Method for providing search result for non-text based object in documents}

본 발명은 문서에 포함된 텍스트 기반이 아닌 개체에 대한 검색결과를 제공하는 방법에 대한 것으로서, 더 구체적으로는, 윅수퍼비전 방법론에 기초하여 학습된 정보검색모델을 이용하여 문서에 포함된 표, 이미지, 그래프 등의 텍스트 기반이 아닌 개체에 대한 검색결과를 제공하는 방법에 대한 것이다.The present invention relates to a method for providing a search result for a non-text-based entity included in a document, and more specifically, to a table and image included in a document using an information retrieval model learned based on the Wick Supervision methodology. It is about a method of providing search results for non-text-based objects such as , graphs, etc.

검색기술은 구글이 그래프 이론(graph theory)에 기초를 둔 페이지랭크 (PageRank) 기법의 검색기술을 선보인 이후로 지속적으로 발전하여 왔다. 이러한 검색기술은 비지도학습에 기초한 것으로서, 문서뭉치만 주어지면 검색이 가능하였다. 비지도학습에 기초한 검색모델로서 대표적인 것은 BM25가 있으며, RM3라는 쿼리 확장 (query expansion) 기법과 함께 사용하는 경우 매우 향상된 성능을 보인다. 오픈 소스로는 Anserini가 학술 분야 및 현장에서 널리 이용되고 있다. Search technology has been continuously developed since Google introduced the PageRank search technology based on graph theory. This search technology is based on unsupervised learning, and it was possible to search only if a bundle of documents was given. A representative example of a search model based on unsupervised learning is BM25, and it shows very improved performance when used together with the query expansion technique called RM3. As an open source, Anserini is widely used in academic fields and fields.

한편, 자연어처리 분야에서도 인공지능 기법을 적용하고자 하는 학술 분야에서의 연구에 따라, 다양한 검색모델이 제안되어 왔다. 예를 들어, DRMM, KNRM, PACRR 등과 같은 딥러닝 기반의 검색모델이 제안되었다. 구글이 2018년 발표한 BERT는 다양한 자연어처리 분야에서 좋은 성능을 나타냈으며, 트랜스포머 또는 언어 모델 기반의 검색모델로 활용하려는 연구가 이어져 왔다. On the other hand, various search models have been proposed according to research in the academic field to apply artificial intelligence techniques in the field of natural language processing. For example, deep learning-based search models such as DRMM, KNRM, and PACRR have been proposed. BERT, released by Google in 2018, showed good performance in various natural language processing fields, and research to use it as a search model based on a transformer or language model has continued.

각 분야마다 오픈 소스가 공개된 인공지능 모델들을 소개하는 웹사이트인 Paper With Code의 Ad-Hoc Information Retrieval 항목에서는 비지도학습에 기초한 검색모델인 Anserini를 포함하여 인공지능 기반의 검색모델들의 현재 시점에서의 SOTA (State-of-the-Art), 즉, 가장 좋은 성능을 나타내는 검색모델을 파악할 수 있다. 지미 린(Lin, Jimmy)이라는 캐나다 워털루 대학 소속 연구자에 다르면, BERT 이전의 딥러닝 계열의 검색모델들, 즉, DRMM, KNRM, PACRR 등의 검색모델은 비지도학습 방법론에 기초한 검색모델인 Anserini와 성능이 비슷하거나 오히려 떨어졌지만, BERT 이후에 제안된 모델들은 Anserini 보다 성능이 향상되었다고 한다(참조: Lin, Jimmy. "The Neural Hype, Justified! A Recantation."). 이러한 사항은 전술한 Paper With Code 의 Ad-Hoc Information Retrieval 항목의 리더보드(leader board)에서도 확인이 가능하다. 이러한 학술연구 결과로부터, 인공지능 기반의 검색모델에 의하여 검색결과의 정확도가 향상될 수 있음을 알 수 있다.In the Ad-Hoc Information Retrieval section of Paper With Code, a website that introduces open source artificial intelligence models for each field, the current time of AI-based search models, including Anserini, a search model based on unsupervised learning SOTA (State-of-the-Art), that is, the search model that shows the best performance can be identified. According to a researcher at the University of Waterloo, Canada named Jimmy Lin, the search models of deep learning prior to BERT, that is, search models such as DRMM, KNRM, and PACRR, are the search models based on the unsupervised learning methodology, Anserini and Although the performance was comparable or rather inferior, it is said that the models proposed after BERT performed better than Anserini (see Lin, Jimmy. "The Neural Hype, Justified! A Recantation."). These details can also be confirmed on the leader board of the Ad-Hoc Information Retrieval item of Paper With Code. From these academic research results, it can be seen that the accuracy of the search results can be improved by the AI-based search model.

그러나, 인공지능 기반의 검색모델은 몇가지 제약이 존재한다. However, the AI-based search model has several limitations.

인공지능 기반의 검색모델을 추론에 이용하기 위해서는 먼저 학습시켜야 하는데, 이러한 학습에는 대량의 레이블드 데이터(labeled data)가 요구된다. 레이블드 데이터는 기본적으로 인간이 가공하여 제공하여야 하는데, 학습에 필요한 데이터의 양을 고려할 때 레이블링에 소요되는 비용이 너무 크기 때문에 비경제적이다.In order to use an AI-based search model for inference, it must first be trained, and this training requires a large amount of labeled data. Labeled data must be processed and provided by humans, but considering the amount of data required for learning, the cost of labeling is too high, which is uneconomical.

다른 문제로서, 비지도학습에 기초한 검색모델은 일반적으로 문서의 길이가 길더라도 문제가 되지 않지만, 인공지능 기반의 검색모델들은 대부분 처리할 수 있는 문서의 길이에 제한이 있다. 예를 들어, BERT의 경우 처리할 수 있는 최대 토큰 수는 512개로 제한된다. 따라서, 짧은 글로 이루어진 말뭉치를 검색대상으로 하는 경우에는 문제가 없지만, 특허, 논문 등과 같이 길이가 긴 문서를 검색하는 경우에는 적용에 어려움이 있다.As another problem, search models based on unsupervised learning generally do not have a problem even if the length of a document is long, but most of the AI-based search models have a limitation in the length of documents that can be processed. For example, in the case of BERT, the maximum number of tokens that can be processed is limited to 512. Therefore, there is no problem when a corpus composed of short texts is used as a search target, but it is difficult to apply when searching for long documents such as patents and papers.

한편, 문서에 포함된 표, 이미지, 그래프 등의 개체에 대하여 텍스트 기반의 정보검색 기술을 이용하여 검색하는 방법이 제공되면 사용자의 편의성이 증대될 것이다.On the other hand, if a method for searching objects such as tables, images, and graphs included in a document using text-based information retrieval technology is provided, the user's convenience will be increased.

[1] https://paperswithcode.com/task/ad-hoc-information-retrieval[1] https://paperswithcode.com/task/ad-hoc-information-retrieval [2] MacAvaney, Sean, et al. "CEDR: Contextualized embeddings for document ranking." Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019.[2] MacAvaney, Sean, et al. "CEDR: Contextualized embeddings for document ranking." Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019. [3] Dai, Zhuyun, and Jamie Callan. "Deeper text understanding for IR with contextual neural language modeling." Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019.[3] Dai, Zhuyun, and Jamie Callan. "Deeper text understanding for IR with contextual neural language modeling." Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019.

본 발명은 전술한 문제를 해결하고자 하는 것으로서, 문서에 포함된 표, 이미지, 그래프 등의 개체에 대하여 텍스트 형식의 쿼리를 이용하여 검색결과를 제공하는 방법을 제공하는 것이다.An object of the present invention is to solve the above problem, and to provide a method for providing search results by using a text-type query for objects such as tables, images, and graphs included in a document.

본 발명에 일 양태에 의하여, 문서에 포함된 개체를 인공지능 기반 검색모델을 이용하여 검색하기 위한 컴퓨터로 구현되는 방법으로서, (a) 문서에 포함된 특정 개체에 대한 설명 및 해당 개체가 언급된 단락을 포함하는, 해당 개체에 대응되는 패시지를 생성시키는 단계; (b) 복수개의 문서로부터 상기 단계 (a)에 의하여 형성된 개체에 대한 패시지를 모아서 이루어진 개체 관련 말뭉치를 형성시키는 단계; (c) 비지도학습 (unsupervised) 방법론에 기초한 검색모델에 의하여, 상기 단계 (b)에서 형성된 말뭉치로부터 입력된 쿼리에 대응하는 N개의 패시지가 리트리빙 (retrieving) 되는 단계; (d) 인공지능 기반 검색모델에 의하여, 상기 입력된 쿼리를 기초로 상기 단계 (c)에서 리트리빙 된 N개의 패시지가 리랭킹 (re-ranking) 되는 단계; 및, (e) 상기 단계 (d)에서 리랭킹 된 N개의 패시지에 대응하는 개체들이 검색결과로서 출력되는 단계를 포함하는 문서에 포함된 개체를 검색하기 위한 컴퓨터로 구현되는 방법이 제공된다.According to an aspect of the present invention, there is provided a computer-implemented method for searching for an entity included in a document using an artificial intelligence-based search model, wherein (a) a description of a specific entity included in the document and the entity are mentioned generating a passage corresponding to the object, including a paragraph; (b) forming an entity-related corpus formed by collecting passages for the entity formed by step (a) from a plurality of documents; (c) retrieving N passages corresponding to the input query from the corpus formed in step (b) by a search model based on an unsupervised methodology; (d) re-ranking the N passages retrieved in step (c) based on the input query by an AI-based search model; and, (e) outputting objects corresponding to the N passages reranked in step (d) as search results.

본 발명의 다른 양태에 의하여, 문서에 포함된 개체를 인공지능 기반 검색모델을 이용하여 검색하기 위한 장치로서, 적어도 하나의 프로세서; 및 컴퓨터로 실행가능한 명령을 저장하는 적어도 하나의 메모리를 포함하되, 상기 적어도 하나의 메모리에 저장된 상기 컴퓨터로 실행가능한 명령은, 상기 적어도 하나의 프로세서에 의하여, (a) 문서에 포함된 특정 개체에 대한 설명 및 해당 개체가 언급된 단락을 포함하는, 해당 개체에 대응되는 패시지를 생성시키는 단계; (b) 복수개의 문서로부터 상기 단계 (a)에 의하여 형성된 개체에 대한 패시지를 모아서 이루어진 개체 관련 말뭉치를 형성시키는 단계; (c) 비지도학습 (unsupervised) 방법론에 기초한 검색모델에 의하여, 상기 단계 (b)에서 형성된 말뭉치로부터 입력된 쿼리에 대응하는 N개의 패시지가 리트리빙 (retrieving) 되는 단계; (d) 인공지능 기반 검색모델에 의하여, 상기 입력된 쿼리를 기초로 상기 단계 (c)에서 리트리빙 된 N개의 패시지가 리랭킹 (re-ranking) 되는 단계; 및, (e) 상기 단계 (d)에서 리랭킹 된 N개의 패시지에 대응하는 개체들이 검색결과로서 출력되는 단계를 포함하는 문서에 포함된 개체를 검색하기 위한 장치가 제공된다.According to another aspect of the present invention, there is provided an apparatus for searching an object included in a document using an artificial intelligence-based search model, comprising: at least one processor; and at least one memory storing computer-executable instructions, wherein the computer-executable instructions stored in the at least one memory are transmitted by the at least one processor to (a) a specific object included in the document. generating a passage corresponding to the object, including a description of the object and a paragraph in which the object is mentioned; (b) forming an entity-related corpus formed by collecting passages for the entity formed by step (a) from a plurality of documents; (c) retrieving N passages corresponding to the input query from the corpus formed in step (b) by a search model based on an unsupervised methodology; (d) re-ranking the N passages retrieved in step (c) based on the input query by an AI-based search model; and, (e) outputting objects corresponding to the N passages reranked in step (d) as search results.

본 발명의 또다른 양태에 의하여, 문서에 포함된 개체를 인공지능 기반 검색모델을 이용하여 검색하기 위한 컴퓨터 프로그램으로서, 비일시적 저장 매체에 저장되며, 프로세서에 의하여, (a) 문서에 포함된 특정 개체에 대한 설명 및 해당 개체가 언급된 단락을 포함하는, 해당 개체에 대응되는 패시지를 생성시키는 단계; (b) 복수개의 문서로부터 상기 단계 (a)에 의하여 형성된 개체에 대한 패시지를 모아서 이루어진 개체 관련 말뭉치를 형성시키는 단계; (c) 비지도학습 (unsupervised) 방법론에 기초한 검색모델에 의하여, 상기 단계 (b)에서 형성된 말뭉치로부터 입력된 쿼리에 대응하는 N개의 패시지가 리트리빙 (retrieving) 되는 단계; (d) 인공지능 기반 검색모델에 의하여, 상기 입력된 쿼리를 기초로 상기 단계 (c)에서 리트리빙 된 N개의 패시지가 리랭킹 (re-ranking) 되는 단계; 및, (e) 상기 단계 (d)에서 리랭킹 된 N개의 패시지에 대응하는 개체들이 검색결과로서 출력되는 단계를 포함하는 문서에 포함된 개체를 검색하는 방법을 제공하기 위한 비일시적 저장 매체에 저장되는 컴퓨터 프로그램이 제공된다.According to another aspect of the present invention, as a computer program for searching for an object included in a document using an artificial intelligence-based search model, it is stored in a non-transitory storage medium, and by the processor, (a) generating a passage corresponding to the object, including a description of the object and a paragraph in which the object is mentioned; (b) forming an entity-related corpus formed by collecting passages for the entity formed by step (a) from a plurality of documents; (c) retrieving N passages corresponding to the input query from the corpus formed in step (b) by a search model based on an unsupervised methodology; (d) re-ranking the N passages retrieved in step (c) based on the input query by an AI-based search model; and (e) outputting objects corresponding to the N passages reranked in step (d) as search results. A computer program is provided.

본 발명에 따라, 문서에 포함된 개체에 대하여 텍스트 형식의 쿼리를 이용하여 검색결과를 제공하는 방법이 제공된다.According to the present invention, there is provided a method for providing a search result using a text-type query for an object included in a document.

도 1은 본 발명에 따른 인공지능 기반의 정보검색모델을 학습시키기 위한 방법을 도시한 흐름도.
도 2는 본 발명에 따른 인공지능 기반의 정보검색모델을 이용한 정보검색 방법을 도시한 흐름도.
도 3은 본 발명의 일 예에 따른 정보검색 모델을 개략적으로 도시한 도면.
도 4는 본 발명에 따른 인공지능 기반의 정보검색모델을 이용한 정보검색 방법을 수행하기 위한 장치를 개략적으로 도시한 도면.
도 5는 본 발명의 일 실시예에 따른 검색결과를 다른 모델의 검색결과와 대비하는 도면.
도 6은 본 발명의 일 실시예에 따른 검색결과를 또다른 모델의 검색결과와 대비하는 도면.1 is a flowchart illustrating a method for learning an artificial intelligence-based information retrieval model according to the present invention.
2 is a flowchart illustrating an information retrieval method using an artificial intelligence-based information retrieval model according to the present invention.
3 is a diagram schematically illustrating an information retrieval model according to an example of the present invention.
4 is a diagram schematically illustrating an apparatus for performing an information retrieval method using an artificial intelligence-based information retrieval model according to the present invention.
5 is a diagram for comparing a search result according to an embodiment of the present invention with a search result of another model;
6 is a diagram for comparing a search result according to an embodiment of the present invention with a search result of another model;

이하에서는, 첨부된 도면을 참조하여 본 발명에 따른 실시예를 상세히 설명한다. 동일하거나 유사한 구성요소에 대해서는 동일 또는 유사한 도면 부호를 부여하고 이에 대한 중복되는 설명은 생략한다. 본 명세서에 개시된 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 첨부된 도면은 본 명세서에 개시된 실시예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 등가인 것 내지 대체하는 것을 포함하는 것으로 이해되어야 한다.Hereinafter, an embodiment according to the present invention will be described in detail with reference to the accompanying drawings. The same or similar reference numerals are assigned to the same or similar components, and overlapping descriptions thereof are omitted. In describing the embodiments disclosed in the present specification, if it is determined that detailed descriptions of related known technologies may obscure the gist of the embodiments disclosed in the present specification, the detailed description thereof will be omitted. The accompanying drawings are only for easy understanding of the embodiments disclosed in the present specification, and the technical spirit disclosed herein is not limited by the accompanying drawings, and all changes and equivalents included in the spirit and scope of the present invention It should be understood to include what is or what is replaced.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이러한 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용되며 해당되는 구성요소들은 이러한 용어들에 의해 한정되지 않는다. 단수의 표현은, 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.Terms including an ordinal number such as 1st, 2nd, etc. may be used to describe various components, but these terms are used only for the purpose of distinguishing one component from other components, and the corresponding components are defined by these terms. not limited by The singular expression includes the plural expression unless the context clearly dictates otherwise.

본 명세서에서 사용된 "포함한다", "구비한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 단계, 구성요소 또는 이들을 조합한 것이 존재함을 한정하려는 것으로 이해되어야 하며, 하나 이상의 다른 특징들이나 단계, 구성요소 또는 이들을 조합한 것들이 존재할 또는 부가될 가능성을 배제하려는 것은 아니다.As used herein, terms such as “comprises”, “comprises” or “have” are to be understood as limiting the existence of a feature, step, element, or combination thereof described in the specification, and one or more other It is not intended to exclude the possibility that features, steps, components, or combinations thereof may be present or added.

도 1에는 본 발명에 따른 인공지능 기반의 정보검색모델을 학습시키는 방법이 도시되어 있으며, 도 2에는 본 발명에 따른 인공지능 기반의 정보검색모델을 이용하여 추론을 수행하는 단계가 도시되어 있고, 도 3에는 본 발명에 따른 정보검색모델의 일 예가 개략적으로 도시되어 있다. 이하에서는 이들 도면을 참고하여 본 발명에 대하여 설명한다.1 shows a method for learning an artificial intelligence-based information retrieval model according to the present invention, and FIG. 2 shows a step of performing inference using the artificial intelligence-based information retrieval model according to the present invention, 3 schematically shows an example of an information retrieval model according to the present invention. Hereinafter, the present invention will be described with reference to these drawings.

[인공지능 기반의 정보검색모델의 학습단계][Learning stage of information retrieval model based on artificial intelligence]

정보검색모델information retrieval model

본 명세서에서 정보검색모델은 '비지도학습 (unsupervised) 방법론에 기초한 검색모델'과 '인공지능 기반 검색모델'로 구분한다. 전자는, BM25, QL(Query Likelihood) 등 통계적 또는 기타 비지도학습 방법론에 기초한 정보검색모델을 의미하며, 후자는, DRMM, KNRM, PACRR 등의 딥러닝 계열 및 BERT 등 언어 모델 (Language Model) 계열 등을 포함하여 학습에 의하여 마련되는 정보검색모델을 의미한다.In this specification, the IR model is divided into a 'search model based on an unsupervised methodology' and an 'artificial intelligence-based search model'. The former means information retrieval models based on statistical or other unsupervised learning methodologies such as BM25 and QL (Query Likelihood). It means an information retrieval model prepared by learning, including

본 발명에서 전자인 비지도학습 방법론에 기초한 검색모델은, 학습단계에서는 슈도-레이블을 마련하기 위하여 이용되며, 추론단계에서는 말뭉치로부터 패시지(passage)를 리트리빙 (retrieving) 하기 위하여 이용되고, 후자는 추론단계에서 리트리빙된 패시지를 리랭킹 (re-ranking) 하기 위하여 이용된다.In the present invention, the former search model based on the unsupervised learning methodology is used to prepare pseudo-labels in the learning stage, and is used to retrieve passages from the corpus in the inference stage, and the latter is used to It is used to re-rank the passage retrieved in the reasoning step.

오픈 소스가 존재하는 정보검색모델에 대하여는 Paper With Code의 Ad-Hoc Information Retrieval 항목에 공개되어 있다.Information retrieval models for which open sources exist are disclosed in the Ad-Hoc Information Retrieval section of Paper With Code.

비지도학습 방법론에 기초한 검색모델Search model based on unsupervised learning methodology

비지도학습 (unsupervised) 방법론에 기초한 검색모델로는, BM25, QL(Query Likelihood) 등이 알려져 있으며, 오픈 소스로는 Anserini가 많이 이용된다. Anserini는 BM25와 함께 쿼리 확장 (query expansion) 방법론인 RM3를 이용하는 검색모델이다. 연구결과에 따르면, Anserini의 성능은 DRMM, KNRM, PACRR 등의 딥러닝 계열의 검색모델의 성능과 비슷하지만, BERT 등 언어 모델 계열의 검색모델에 비하여는 성능이 떨어지는 것으로 알려져 있다.As search models based on unsupervised methodologies, BM25 and Query Likelihood (QL) are known, and Anserini is widely used as an open source. Anserini is a search model that uses RM3, a query expansion methodology, along with BM25. According to the research results, the performance of Anserini is similar to that of deep learning-based search models such as DRMM, KNRM, and PACRR, but it is known that the performance of Anserini is inferior to that of language model-type search models such as BERT.

비지도학습 방법론에 기초한 검색모델은 학습이 요구되지 않으며 통계적 이론 등에 기초하여 쿼리와 문서 간의 유사도를 판단한다.The search model based on the unsupervised learning methodology does not require learning and determines the similarity between the query and the document based on statistical theories.

인공지능 기반 검색모델AI-based search model

인공지능 기반 검색모델로서는, DRMM, KNRM, PACRR 등의 딥러닝 계열의 검색모델과, BERT 등 트랜스포머 (transformer) 또는 언어 모델 계열의 검색모델이 알려져 있다. Paper With Code에 따르면, 본 특허의 출원일 현재 가장 성능이 좋은, 즉, SOTA (State-of-the-Art) 검색모델은 언어 모델 계열의 BERT와 딥러닝 계열의 검색모델을 결합시킨 구조를 갖는 CEDR인 것으로 파악된다.As AI-based search models, deep learning-based search models such as DRMM, KNRM, and PACRR, and transformer or language model-based search models such as BERT are known. According to Paper With Code, the best performing SOTA (State-of-the-Art) search model as of the filing date of this patent is CEDR, which has a structure that combines the language model-type BERT and the deep-learning-type search model. is understood to be

인공지능 기반의 검색모델은 쿼리와 문서 간의 유사도를 학습데이터에 기초하여 학습시킴으로써 완성된다. 학술적으로는 TREC, SQuAD, MS Marco 등 쿼리-문서 관계를 제공하는 데이터셋을 이용하여 인공지능 기반 검색모델을 학습시킨다.The AI-based search model is completed by learning the similarity between the query and the document based on the training data. Academically, AI-based search models are trained using datasets that provide query-document relationships, such as TREC, SQuAD, and MS Marco.

BERT 등의 언어 모델 계열의 검색엔진은, 사전학습 (pre-training) 단계에서는 자기 지도 학습 (Self-supervision learning) 방법론을 이용하지만, 정보검색을 위한 파인튜닝 (fine-tuning) 단계에서는 딥러닝 계열의 검색모델과 마찬가지로 쿼리-문서 관계에 대한 데이터셋에 기초한 지도 학습이 요구된다.Language model search engines such as BERT use a self-supervision learning methodology in the pre-training stage, but deep learning in the fine-tuning stage for information retrieval. Like the search model of , supervised learning based on a dataset on the query-document relationship is required.

윅수퍼비전 (Weaksupervision) 방법론Weaksupervision Methodology

학술적으로 이용되는 TREC, SQuAD, MS Marco 등의 쿼리-문서 관계 데이터셋은 실제 사용자의 검색활동에 기초하여 추출된 그라운드 트루스(ground truth)에 해당하거나, 문서의 제목을 쿼리로 이용한 것도 있다. 그러나, 실제 현장에서는 문서뭉치(document corpus)만이 존재하며 이를 검색하기 위하여 사용자가 입력한 쿼리가 존재하지 않거나, 혹시 존재하더라도 인공지능 기반의 검색모델을 학습시키기에는 매우 부족하다. 본 발명에서는 문서뭉치만 존재하는 경우에 각 문서로부터 슈도-쿼리를 생성하고, 생성된 슈도-쿼리를 이용하여 의사 (psuedo) 쿼리-문서 관계를 형성시키고 슈도-레이블(psuedo-label)을 생성하며, 이를 이용하여 인공지능 기반의 검색모델을 학습시키는 방법론이 제공된다. 본 발명에 따른 인공지능 기반의 검색모델에 대한 윅수퍼비전 방법론은 크게 구분하면 다음과 같은 단계를 포함한다: 1) 문서뭉치의 각 문서로부터 슈도-쿼리를 생성하는 단계, 2) 생성된 슈도-쿼리를 이용하여 의사 쿼리-문서 관계를 생성시키고, 생성된 의사 쿼리-문서 관계에 기초하여 슈도-레이블을 생성하는 단계, 3) 생성된 슈도-레이블을 이용하여 인공지능 기반의 검색모델을 학습시키는 단계. 이하에서는 전술한 각 단계에 대하여 항목을 나누어 설명한다.Query-document relation datasets such as TREC, SQuAD, and MS Marco that are used academically correspond to ground truth extracted based on real users' search activity, or some use the title of a document as a query. However, in the actual field, only a document corpus exists, and there is no query input by the user to search for it, or even if there is, it is very insufficient to train an AI-based search model. In the present invention, when only a document bundle exists, a pseudo-query is generated from each document, a pseudo-query-document relationship is formed using the generated pseudo-query, and a pseudo-label is generated. , a methodology for learning an AI-based search model using this is provided. The Wick Supervision methodology for the AI-based search model according to the present invention broadly includes the following steps: 1) generating a pseudo-query from each document in the document bundle, 2) the generated pseudo-query Creating a pseudo-query-document relationship using . Hereinafter, each of the above-described steps will be described by dividing the items.

1) 슈도-쿼리를 생성하는 단계1) Steps to create a pseudo-query

문서뭉치의 각 문서로부터 1개 이상의 키워드를 추출하고 이를 슈도-쿼리로 결정한다. 문서로부터 키워드를 추출하는 방법은 기존에 알려진 키워드 추출 기법을 이용한다. 키워드 추출 알고리즘도 크게 비지도학습에 기반한 기법과 지도학습에 기반한 기법으로 나눌 수 있으며, 다수의 기법들에 대한 오픈 소스가 존재한다.One or more keywords are extracted from each document in the document bundle and determined by pseudo-query. A method of extracting a keyword from a document uses a known keyword extraction technique. Keyword extraction algorithms can also be largely divided into techniques based on unsupervised learning and techniques based on supervised learning, and open sources for a number of techniques exist.

대부분의 기법들이 문서로부터 복수개의 키워드 또는 구문(keyphrase)을 추출하는 방식이지만, 최근에는 지도학습에 기반한 기법으로서 BERT를 이용하여 하여 자연어 문장 형태의 쿼리를 생성하는 Doc2Query라는 기법도 오픈 소스로 공개된 바 있다.Although most techniques are methods of extracting multiple keywords or phrases from a document, recently, as a method based on supervised learning, a technique called Doc2Query, which uses BERT to generate a query in the form of a natural language sentence, has also been released as an open source. there is a bar

2) 슈도-레이블을 생성하는 단계2) Create a pseudo-label

각 문서에서 추출된 슈도-쿼리를 BM25 등의 비지도학습 방법론에 기초한 검색모델의 입력으로 하여 문서뭉치로부터 M개의 문서를 리트리빙 한다. 이 때, 해당 슈도-쿼리를 추출한 문서가 M개의 문서 중에 상위에 포함될 가능성이 높지만 포함되지 않을 수도 있다.M documents are retrieved from the document bundle by using the pseudo-query extracted from each document as an input to a search model based on unsupervised learning methodologies such as BM25. In this case, the document from which the corresponding pseudo-query is extracted is highly likely to be included in the top among the M documents, but may not be included.

리트리빙 된 M개의 문서 중에서, 상위 m(<M)개의 문서를 포지티브 (positive) 학습 데이터로 레이블링 (labeling) 하고, M개 중 나머지 문서 중의 적어도 일부를 네거티브 (negative) 학습 데이터로 레이블링 한다. 인공지능 기반의 검색모델에 따라, 추가로 다른 일부를 뉴트럴 (neutral) 학습 데이터로 레이블링 할 수 있다. 일반적으로 포지티브 데이터와 네가티브 데이터는 반드시 필요하지만, 뉴트럴 데이터는 반드시 필요하지는 않다.Among the retrieved M documents, the top m (<M) documents are labeled as positive training data, and at least some of the remaining documents among the M documents are labeled as negative training data. According to the AI-based search model, it is possible to additionally label other parts as neutral training data. In general, positive data and negative data are necessary, but neutral data is not necessarily required.

여기서, 리트리빙 하는 문서의 수 M, 포지티브 학습 데이터의 수 m, 네거티브 학습 데이터의 수 및 뉴트럴 학습 데이터의 수 등은 정수이며, 일종의 하이퍼 파라미터로서 문서뭉치의 특성에 따라 다르게 결정될 수 있다. 개발자는 문서뭉치의 특성에 따라 실험적 또는 이론적 접근방식으로 이들 하이퍼 파라미터를 조정하여 인공지능 기반의 검색모델의 정확도를 높일 수 있다.Here, the number of retrieved documents M, the number of positive learning data m, the number of negative learning data, and the number of neutral learning data are integers, and may be determined differently depending on the characteristics of the document bundle as a kind of hyperparameter. Developers can increase the accuracy of AI-based search models by adjusting these hyperparameters in an experimental or theoretical approach according to the characteristics of the document bundle.

3) 인공지능 기반의 검색모델을 학습시키는 단계3) Learning the AI-based search model

인공지능 기반의 검색모델을 학습시키기 위해서는 쿼리-문서 관계가 필요하므로, 학습 데이터는 리트리빙에 이용된 쿼리 및 그 쿼리에 의하여 리트리빙된 문서의 관계의 형태로 제공된다. 예를 들어, (슈도-쿼리, 포지티브 문서)와 같이 2개의 데이터의 쌍의 형태로 주어지거나, (슈도-쿼리, 포지티브 문서, 네가티브 문서)와 같이 3개의 데이터가 연관된 형태로 제공될 수 있다. 물론 뉴트럴 문서까지 4개의 데이터가 연관될 수도 있다.Since a query-document relationship is required to train an AI-based search model, the training data is provided in the form of a relationship between a query used for retrieval and a document retrieved by the query. For example, (pseudo-query, positive document) may be provided in the form of a pair of two data, or (pseudo-query, positive document, negative document) may be provided in a form in which three pieces of data are related. Of course, up to a neutral document can also be associated with four data.

학습에 필요한 데이터의 양을 정확하게 알려진 바는 없으며 주로 실험적 방법으로 확인되고 있다. 다만, 모델이 포함하는 파라미터 수에 따라 학습에 필요한 데이터의 양도 늘어난다. BERT 계열의 언어 모델 기반의 검색모델들은 대체로 위키피디아 등에 기초하여 사전학습이 된 모델이 공개되어 있으며, 사전학습된 모델을 정보검색과 같은 특정 과업(task)에 맞춰 파인튜닝 하는 것에는 사전학습과 대비하여 적은 양의 데이터가 요구된다.The amount of data required for learning is not known exactly, and it is mainly confirmed by an experimental method. However, the amount of data required for training increases according to the number of parameters included in the model. For search models based on the BERT series of language models, pre-trained models are generally published based on Wikipedia, etc. Therefore, a small amount of data is required.

학술연구에서는 입수할 수 있는 데이터셋에 따라 학습 데이터의 양이 결정되지만, 실제 현장에서는 문서뭉치에 포함된 문서의 양에 따라 학습 데이터가 결정될 수 있다. 문서뭉치에 포함된 문서의 양이 학습을 위해 부족한 경우에는, 비슷한 분야의 다른 문서를 문서뭉치에 추가하거나, 데이터 증대 (data augmentation) 기법을 이용하여 데이터의 양을 늘릴 수 있다.In academic research, the amount of learning data is determined according to the available dataset, but in the actual field, the amount of learning data may be determined according to the amount of documents included in the document bundle. When the amount of documents included in the document bundle is insufficient for learning, other documents in a similar field may be added to the document bundle, or the amount of data may be increased by using a data augmentation technique.

인공지능 기반의 모델을 학습시키는 경우, 일반적으로 데이터를 학습용 데이터와 검증용 데이터로 나누어 성능을 검증하는 것이 바람직하다. 예를 들어, 전체 데이터 중에서 80%의 데이터는 학습에 이용하고, 나머지 20%의 데이터는 검증에 이용할 수 있다.When training an AI-based model, it is generally desirable to verify performance by dividing the data into training data and verification data. For example, 80% of the total data may be used for learning, and the remaining 20% of the data may be used for verification.

[추론단계][Inference stage]

본 발명에서 추론단계는 사용자가 입력한 쿼리에 대응하여 검색결과를 제공하는 과정을 의미한다. 추론단계는 크게 다음과 같은 단계들로 구분될 수 있다: 리트리빙 (retrieving) 단계, 리랭킹 (re-ranking) 단계 및 출력 단계. 이하에서는 각 단계별로 상세히 설명한다.In the present invention, the reasoning step refers to a process of providing a search result in response to a query input by a user. The reasoning step can be roughly divided into the following steps: a retrieving step, a re-ranking step, and an output step. Hereinafter, each step will be described in detail.

리트리빙 단계retrieval stage

검색대상인 문서뭉치로부터, 비지도학습 방법론에 기초한 검색모델에 의하여 N개의 문서를 추출하는 단계이다. 일반적으로 인공지능 기반의 검색모델은 추론에 소요되는 시간이 비지도학습 방법론에 기초한 검색모델에 비하여 상당히 긴 편이다. 따라서, 문서뭉치에 포함된 전체 문서에 대하여 인공지능 기반의 검색모델을 적용하는 경우, 지나치게 긴 시간이 소요되어 사용자의 편의성이 떨어지게 된다. 따라서, 1차로 속도가 상대적으로 빠른 비지도학습 방법론에 기초한 검색모델에 의하여 다수의 문서를 리트리빙 한 후에, 2차로 리트리빙 된 문서에 대해서만 인공지능 기반의 검색모델을 적용하는 것이 일반적이다. 인공지능 기반의 검색모델의 추론 속도를 향상시켜 리트리빙 단계를 생략하려는 연구는 지속적으로 이루어지고 있지만, 아직까지는 사용자 편의성을 고려할 때 충분히 빠른 추론 속도에 도달하지 못한 상태로 파악된다.It is a step of extracting N documents from a document bundle which is a search target by a search model based on an unsupervised learning methodology. In general, the AI-based search model takes considerably longer for inference than the search model based on the unsupervised learning methodology. Accordingly, when the AI-based search model is applied to all documents included in the document bundle, it takes too long, and the user's convenience deteriorates. Therefore, after retrieving a large number of documents by the search model based on the unsupervised learning methodology, which is relatively fast in the first place, it is common to apply the AI-based search model only to the secondly retrieved documents. Research to omit the retrieval step by improving the inference speed of the AI-based search model is continuously being conducted, but it is understood that the inference speed has not yet reached a sufficiently fast inference speed considering the user convenience.

본 단계에서는 정확도(accuracy)도 중요하지만 재현율(recall)이 더 중요하다. 반면, 2차 리랭킹 단계에서는 재현율보다는 정확도가 더 중요하다. 재현율을 높이는 방법 중의 하나는, 검색결과에서 제공하고자 하는 문서의 수에 비하여 몇배 더 많은 수의 문서를 리트리빙 하는 것이다. 다른 방안으로서는, 비지도학습 방법론에 기초한 검색모델의 재현율을 높이는 것이다. BERT를 이용하는 DeepCT Index 라는 기법을 이용하면, 동일한 비지도학습 방법론에 기초한 검색모델을 이용하면서도 재현율을 높일 수 있다. 많은 양의 문서를 리트리빙 하기 위해서는 문서 수에 비례하여 리트리빙에 소요되는 시간이 증가한다. DeepCT Index 기법을 이용하는 경우, 리트리빙 하는 문서 수를 줄임으로써 검색에 소요되는 시간을 단축시키면서도, 그보다 몇배 많은 수의 문서를 리트리빙 하는 경우와 동일한 수준의 재현율을 구현할 수 있다.At this stage, accuracy is important, but recall is more important. On the other hand, in the second re-ranking stage, accuracy is more important than recall. One of the methods to increase the recall is to retrieve a number of documents several times larger than the number of documents to be provided in the search results. Another method is to increase the recall rate of the search model based on the unsupervised learning methodology. By using a technique called DeepCT Index using BERT, it is possible to increase recall while using a search model based on the same unsupervised learning methodology. In order to retrieve a large amount of documents, the time required for retrieval increases in proportion to the number of documents. In the case of using the DeepCT Index technique, it is possible to reduce the search time by reducing the number of documents to be retrieved, while achieving the same level of recall as when retrieving a number of documents several times larger than that.

비지도학습 방법론에 기초한 검색모델로서, 학습단계에서 이용된 검색모델과 동일한 것을 이용할 수도 있고, 다른 검색모델을 이용할 수도 있다. 해당 분야에서 재현율이 높은 검색모델을 선택하는 것이 중요하다. 학술적 연구에서는 Anserini(BM25+RM3)가 오픈 소스 기반의 검색모델로서 널리 이용되고 있다.As a search model based on the unsupervised learning methodology, the same search model used in the learning step may be used, or a different search model may be used. It is important to select a search model with high recall in the relevant field. In academic research, Anserini (BM25+RM3) is widely used as an open source-based search model.

리랭킹 단계re-ranking stage

리트리빙 된 N개의 문서와 사용자가 입력한 쿼리와의 관련도가 비지도학습 방법론에 기초한 검색모델에 의하여 평가되고 관련도 순에 따라 정렬된다. 본 단계에서는, 리트리빙 된 N개의 문서와 사용자가 입력한 쿼리와의 관련도가 인공지능 기반의 검색모델에 의하여 다시 평가된 후에 재정렬, 즉 리랭킹 된다. 전술한 학습단계에서 학습된 인공지능 기반의 검색모델에 의하여 본 단계가 수행된다.The relevance between the retrieved N documents and the query entered by the user is evaluated by a search model based on the unsupervised learning methodology and sorted according to the order of relevance. In this step, the relation between the retrieved N documents and the query entered by the user is re-evaluated by the AI-based search model and then rearranged, that is, re-ranked. This step is performed by the AI-based search model learned in the above-described learning step.

검색결과 출력Search result output

인공지능 기반의 검색모델에 의하여 리랭킹 된 N개의 문서는 관련도의 순서에 따라 정렬되어 검색결과로 출력된다.The N documents reranked by the AI-based search model are sorted according to the order of relevance and output as a search result.

[특수한 경우의 처리][Handling in special cases]

문서의 길이와 관련된 사항Matters related to the length of the document

일반적으로 비지도학습 방법론에 기초한 검색모델은 문서의 길이가 길더라도 실행에 문제가 없지만, 인공지능 기반의 검색모델은 처리할 수 있는 문서의 최대 길이에 제한이 있는 경우가 많다. 특히, BERT 이후에 소개되는 인공지능 기반의 검색모델들은 처리할 수 있는 최대 토큰(token)의 수가 제한된다. 예를 들어, BERT의 경우에는 처리할 수 있는 최대 토큰의 수가 512개로 제한된다.In general, search models based on unsupervised learning methodologies have no problem in execution even if the length of documents is long, but AI-based search models often have a limit on the maximum length of documents that can be processed. In particular, AI-based search models introduced after BERT have a limited number of maximum tokens that can be processed. For example, in the case of BERT, the maximum number of tokens that can be processed is limited to 512.

인공지능 기반의 검색모델이 처리할 수 있는 문서의 최대 길이에 제한이 있는 경우, 문서뭉치(document corpus)의 각 문서를 제한된 길이 이하의 패시지(passage)로 나누어 이를 검색대상으로 할 수 있다. 본 명세서에서는, 길이가 긴 문서로 이루어진 문서뭉치와 구별하기 위하여 짧은 길이의 글로 이루어진 패시지의 뭉치를 '말뭉치(corpus)'라고 표현한다.When there is a limit on the maximum length of a document that can be processed by the AI-based search model, each document in a document corpus can be divided into passages with a length less than the limited length and can be searched. In the present specification, a bundle of passages composed of short texts is referred to as a 'corpus' in order to distinguish it from a document bundle composed of long documents.

학습단계에서는, 말뭉치에 포함된 패시지가 문서로 대체되는 것 외에는 전술한 것과 동일하다. 즉, 패시지로부터 슈도-쿼리를 추출하고, 비지도학습 방법론에 기초한 검색모델을 이용하여 슈도-레이블을 생성시킨 후에 인공지능 기반의 검색모델을 학습시킨다. 인공지능 기반의 검색모델은 슈도-쿼리와 문서 전체 사이의 관련도가 아니라, 슈도-쿼리와 패시지 사이의 관련도에 기초하여 학습하게 된다.In the learning stage, the passages included in the corpus are the same as described above except that the passages are replaced with documents. That is, after extracting a pseudo-query from a passage and generating a pseudo-label using a search model based on an unsupervised learning methodology, an AI-based search model is trained. The AI-based search model learns based on the relationship between the pseudo-query and the passage, not the relationship between the pseudo-query and the entire document.

추론단계에서는, 문서뭉치의 각 문서를 패시지로 나누어 말뭉치로 만들고, 패시지와 문서의 대응관계를 참조할 수 있는 형태로 저장하는 전처리가 필요하다. 리트리빙 단계에서는 말뭉치로부터 사용자가 입력한 쿼리에 기초하여 N개의 패시지를 리트리빙 하고, 리랭킹 단계에서는 리트리빙 된 N개의 패시지를 리랭킹 한다. In the reasoning step, it is necessary to divide each document in the document bundle into passages to make a corpus, and to perform preprocessing of storing the correspondence between passages and documents in a form that can be referred to. In the retrieval step, N passages are retrieved from the corpus based on the query input by the user, and in the reranking step, the retrieved N passages are reranked.

검색결과에서 패시지가 아닌 문서를 관련도 순으로 정렬하여 제공하여야 한다. 이를 위하여, 전처리 단계에서 제공된 패시지와 문서의 대응관계를 참조하여 리랭킹 된 패시지의 관련도 정렬 순서에 대응하도록 문서를 정렬하여 검색결과로서 제공한다. 하나의 문서로부터 여러 개의 패시지가 분리되므로, 리랭킹 된 결과에는 하나의 문서로부터 추출된 패시지가 복수개 포함될 수 있다. 이 경우, 문서의 정렬 순서는 가장 관련도가 높은 패시지의 순서에 대응되도록 할 수 있다.Documents that are not passages in the search results should be provided in order of relevance. To this end, with reference to the correspondence between the passages and documents provided in the pre-processing step, the documents are arranged so as to correspond to the re-ranked passage's relevance sort order and provided as a search result. Since several passages are separated from one document, a plurality of passages extracted from one document may be included in the reranked result. In this case, the sort order of documents may correspond to the order of passages with the highest degree of relevance.

문서확장 (document expansion)document expansion

길이가 긴 문서의 경우에는 다수개의 단락을 포함하고 있는데, 각 단락의 내용은 반드시 주제가 일치하지 않는다. 따라서, 각 단락을 패시지로 나누어 말뭉치로 저장하는 경우에 관련성이 있는 문서로부터 분리된 단락임에도 불구하고 관련도가 낮게 평가될 가능성이 존재한다.In the case of a long document, it contains multiple paragraphs, and the content of each paragraph does not necessarily match the topic. Accordingly, when each paragraph is divided into passages and stored as a corpus, there is a possibility that the relevance is evaluated low even though the paragraphs are separated from related documents.

이러한 문제를 방지하기 위하여, 문서의 제목 또는 이에 준하는 문구를 해당 문서에서 분리된 각 패시지에 추가할 수 있다. 즉, 한 문서에서 분리된 각 패시지에는 동일한 문구가 추가된다. 이렇게 패시지에 관련성을 갖는 문구를 추가하여 확장시키는 것을 본 명세서에서는 '문서확장'이라는 요어로 표현한다. 문서의 제목에 준하는 문구는, 예를 들어, 문서요약 (text summarization) 기법을 이용하여 생성될 수 있다. 문서요약 기법은 크게 추출적 요약(extractive summarization)과 추상적 요약(abstractive summarization)으로 구분되며, 양자 모두 적용이 가능하다. 문서확장을 위하여 추가되는 문구를 생성하는 기법은 반드시 문서요약 기법에 한정되지 않으며, 해당 문서의 주제를 압축적으로 표현할 수 있는 기법이라면 어떤 것이라도 무방하다. In order to prevent such a problem, the title of the document or a phrase equivalent thereto may be added to each passage separated from the document. That is, the same phrase is added to each passage separated from one document. In this specification, the extension by adding a phrase having relevance to the passage is expressed as a key word 'document extension'. A phrase corresponding to the title of the document may be generated using, for example, a text summarization technique. The document summarization technique is largely divided into extractive summarization and abstract summarization, both of which can be applied. The technique for generating additional phrases for document extension is not necessarily limited to the document summary technique, and any technique that can compressively express the subject of the document may be used.

예를 들어, 사용자가 입력한 쿼리에 복수개의 키워드가 포함되어 있는데, 어느 키워드는 원래의 패시지에 포함되어 있지만, 다른 키워드는 문서확장에 의하여 추가된 문구에 포함되어 있을 수 있다. 문서확장에 의하여 추가된 문구는 해당 문서 전체의 주제를 포함하고 있으므로, 해당 패시지는 사용자가 검색하고자 의도한 내용을 포함하고 있을 가능성이 존재한다. 문서확장으로 해당 문구가 추가되지 않았다면 리트리빙 되지 않았을 패시지가 문서확장에 의하여 리트리빙 될 수 있다. 이는 리트리빙 단계의 재현율(recall)을 높일 수 있다. 위에서 설명한 바와 같이, 리트리빙 단계에서는 재현율이 높은 것이 중요하다.For example, a plurality of keywords are included in a query input by a user. Some keywords may be included in the original message, but other keywords may be included in a phrase added by document expansion. Since the phrase added by the document extension includes the subject of the entire document, there is a possibility that the corresponding passage contains the content intended by the user to search. If the corresponding phrase was not added by document extension, a passage that would not have been retrieved may be retrieved by document extension. This may increase the recall of the retrieval step. As described above, it is important that the recall is high in the retrieval step.

문서확장으로 패시지에 추가되는 문구의 길이가 너무 길면 문서확장에 의하여 늘어난 패시지의 길이가 인공지능 기반의 검색모델이 처리할 수 있는 길이를 넘어갈 수 있다. 이로 인한 부작용을 최소화하기 위하여, 문서확장에 의하여 추가되는 문구는 패시지의 앞쪽에 배치되는 것이 바람직하다. 문서확장으로 길이가 늘어난 패시지의 후단부가 짤려 나간다고 하여도, 사용자가 입력한 쿼리에 포함된 키워드가 그 후단부에 포함되어 있지 않다면, 여전히 해당 패시지는 리트리빙 단계에서 추출될 것이기 때문이다.If the length of the phrase added to the passage due to the document extension is too long, the length of the passage increased by the document extension may exceed the length that the AI-based search model can handle. In order to minimize the side effects caused by this, it is preferable that the text added by the document extension be placed at the front of the passage. Even if the rear end of a passage whose length is increased due to document expansion is cut off, if the keyword included in the query entered by the user is not included in the rear end, the passage will still be extracted in the retrieval stage.

도메인 특화 (domain adaptiveness)domain adaptiveness

인공지능 기반의 검색모델을 특정 도메인의 문서로 학습시키는 경우 해당 도메인에 특화된 검색모델로 이용될 수 있다. 예를 들어, 위키피디아 등 보편적인 문서뭉치로 사전학습된 BERT를 특정 도메인의 문서로 파인튜닝 하는 경우, 사전학습 단계에서는 어휘들 간의 일반적인 관계를 학습하고, 파인튜닝 시에는 해당 도메인에 특화된 어휘에 대해 학습하게 된다. When an AI-based search model is trained as a document of a specific domain, it can be used as a search model specialized for that domain. For example, when fine-tuning a BERT that has been pre-trained with a general document bundle such as Wikipedia as a document of a specific domain, in the pre-learning stage, general relationships between vocabularies are learned, and in fine-tuning, a vocabulary specific to that domain is learned. will learn

특정 도메인에 특화된 검색모델은 다른 도메인에 대해서는 성능이 상대적으로 떨어질 수 있지만 해당 도메인에서는 성능이 향상된다.A search model specialized for a specific domain may have relatively poor performance for other domains, but it improves performance in that domain.

앙상블 (Ensemble) 검색모델Ensemble search model

전술한 방법에서는, 인공지능 기반의 검색모델에 의하여 리랭킹 된 결과를 최종 검색결과로 활용하였지만, 비지도학습 방법론에 기초한 검색모델과 인공지능 기반의 검색모델의 앙상블을 이용하여 검색결과를 제공할 수도 있다. 이 경우 최종평가는 수학식 1과 같이 표현된다.In the above method, the results reranked by the artificial intelligence-based search model were used as the final search results. may be In this case, the final evaluation is expressed as Equation 1.

[수학식 1][Equation 1]

(최종평가) = a*(비지도학습 방법론에 기초한 검색모델의 평가) (Final evaluation) = a*(Evaluation of search model based on unsupervised learning methodology)

+ (1-a)*(인공지능 기반의 검색모델의 평가)+ (1-a)*(Evaluation of AI-based search model)

수학식 1에서 a는 0과 1 사이의 값이며, 하이퍼 파라미터로서 최선의 검색결과를 제공하도록 조정될 수 있다.In Equation 1, a is a value between 0 and 1, and may be adjusted to provide the best search result as a hyper parameter.

앙상블 모델이 반드시 단독 모델에 비하여 성능이 향상되는 것을 보장하지는 않으며, 도메인에 따라 채택 여부를 검토할 수 있다.The ensemble model does not necessarily guarantee better performance compared to the single model, and it can be reviewed for adoption depending on the domain.

도 4에는 본 발명에 따른 검색방법을 수행하기 위한 컴퓨터 장치가 도시되어 있다.4 shows a computer device for performing a search method according to the present invention.

도 1 내지 3을 참조하여 본 발명에 따른 검색방법 및 학습방법에 대하여는 상세히 설명한 바 있으므로, 도 4를 참조하여서는 그러한 검색방법을 수행하기 위한 장치(100)를 간략히 설명한다.Since the search method and the learning method according to the present invention have been described in detail with reference to FIGS. 1 to 3 , an apparatus 100 for performing such a search method will be briefly described with reference to FIG. 4 .

도 4를 참조하면, 컴퓨터 장치(100)는, 프로세서(110), 프로그램과 데이터를 저장하는 비휘발성 저장부(120), 실행 중인 프로그램들을 저장하는 휘발성 메모리(130), 사용자와의 사이에 정보를 입력 및 출력하는 입/출력부(140) 및 이들 장치 사이의 내부 통신 통로인 버스 등으로 이루어져 있다. 실행 중인 프로그램으로는, 운영체계(Operating System) 및 다양한 어플리케이션이 있을 수 있다. 도시되지는 않았지만, 전력제공부를 포함한다.Referring to FIG. 4 , the computer device 100 includes a processor 110 , a non-volatile storage unit 120 for storing programs and data, a volatile memory 130 for storing programs being executed, and information between a user and a user. It consists of an input/output unit 140 for inputting and outputting , and a bus, which is an internal communication path between these devices. The running program may include an operating system and various applications. Although not shown, it includes a power supply unit.

학습단계에서는 저장부(120)에 저장된 학습 데이터를 이용하여 메모리(130)에서 인공지능 기반의 검색모델을 학습시킨다. 추론단계에서는, 저장부(120)에 저장된 비지도학습 기반의 검색모델과(210) 인공지능 기반의 검색모델(220)을 메모리(130)에서 실행시킨다. 말뭉치는 저장부(120)에 저장되고, 입/출력부(140)를 통하여 입력된 쿼리에 기초하여 검색방법을 수행한다.In the learning step, an AI-based search model is learned in the memory 130 using the learning data stored in the storage unit 120 . In the reasoning step, the memory 130 executes the unsupervised learning-based search model 210 and the artificial intelligence-based search model 220 stored in the storage unit 120 . The corpus is stored in the storage unit 120 , and a search method is performed based on a query input through the input/output unit 140 .

[실시예][Example]

이하에서는, 본 발명에 따른 검색결과 제공 방법을 한국생산기술연구원의 보유한 특허문서를 대상으로 적용한 실시예에 대하여 설명한다.Hereinafter, an embodiment in which the search result providing method according to the present invention is applied to a patent document owned by the Korea Institute of Industrial Technology will be described.

인공지능 기반 검색모델AI-based search model

인공지능 기반 검색모델로서는 오픈 소스가 공개된 CEDR-KNRM 모델을 이용하였다. CEDR-KNRM 모델은 간략히 설명하면 BERT와 KNRM을 병렬로 처리하는 구조이며, 구체적인 사항은 관련 논문 및 공개된 소스 코드로부터 확인이 가능하다. Paper With Code 에 따르면 실시예를 구현하는 시점에 성능이 가장 좋은 모델이었으며, 출원일 현재까지 변동이 없는 것으로 확인된다.As an AI-based search model, the CEDR-KNRM model, which is open source, was used. The CEDR-KNRM model is a structure that processes BERT and KNRM in parallel when described briefly, and details can be checked from related papers and published source codes. According to Paper With Code, it was the model with the best performance at the time of implementing the embodiment, and it is confirmed that there is no change as of the filing date.

말뭉치corpus

검색 대상 특허문서, 즉, 문서뭉치는 대략 3000건이었으며, 모든 문서의 길이가 BERT가 처리할 수 있는 512개의 토큰을 넘기 때문에, 단락별로 구분하여 말뭉치를 형성하였다. 말뭉치는 대략 120,000개의 패시지를 포함한다.Patent documents to be searched, that is, the document corpus was approximately 3000, and since the length of all documents exceeds 512 tokens that BERT can process, the corpus was formed by dividing it into paragraphs. The corpus contains approximately 120,000 passages.

각 패시지 및 추출된 특허문서와의 관계가 추후 추론단계에서 참조될 수 있도록 저장된다.The relationship between each passage and the extracted patent document is stored so that it can be referenced later in the reasoning step.

슈도-쿼리 생성Pseudo-query generation

특허문서의 각 단락으로 형성된 패시지에 키워드 추출 기법을 적용하여 1개 이상의 키워드 또는 문구를 슈도-쿼리로서 추출하였다. 여기에 사용된 기법은 RAKE로서 논문 및 오픈 소스가 공개되어 있다.One or more keywords or phrases were extracted as pseudo-queries by applying the keyword extraction technique to the passage formed by each paragraph of the patent document. The technique used here is RAKE, which has been published in papers and open sources.

동일한 슈도-쿼리를 중복 제거하면 대략 40,000개의 슈도-쿼리가 생성되었다.Deduplication of the same pseudo-query resulted in approximately 40,000 pseudo-queries.

쿼리-문서 관계 형성Forming Query-Document Relationships

각각의 슈도-쿼리를 입력으로 하고 기존에 알려진 비지도학습 방법론에 기초한 검색모델인 BM25를 이용하여 말뭉치로부터 300건의 패시지를 추출하였다. 슈도-쿼리는 대략 40,000개였으므로, 추출된 패시지는 총 12,000,000건이다. 이는 CEDR-KNRM 모델을 학습시키기에 부족하지 않은 양의 데이터이다.With each pseudo-query as input, 300 passages were extracted from the corpus using BM25, a search model based on the previously known unsupervised learning methodology. Since there were approximately 40,000 pseudo-queries, the total number of extracted passages is 12,000,000. This is a sufficient amount of data to train the CEDR-KNRM model.

슈도-레이블링pseudo-labeling

각각의 슈도-쿼리에 의하여 추출된 300건의 패시지 중에서 상위 m개의 패시지를 포지티브 학습 데이터로 분류하고, 하위 p개의 패시지를 네거티브 학습 데이터로 분류하였다. CEDR-KNRM 모델은 뉴트럴 학습 데이터도 이용하기 때문에, 상위 m개의 패시지와 하위 p개의 패시지를 제외한 나머지 패시지를 뉴트럴 학습 데이터로 분류하였다.Among the 300 passages extracted by each pseudo-query, the upper m passages were classified as positive learning data, and the lower p passages were classified as negative learning data. Since the CEDR-KNRM model also uses neutral learning data, the remaining passages except for the upper m passages and the lower p passages are classified as neutral learning data.

m, p 등의 하이퍼 파라미터는 학습 후 검색모델의 성능을 발명자가 검색결과를 검토하는 방식으로 검증하면서 조정하였다. 그라운드 트루스가 없는 데이터셋이기 때문에, 마련된 학습 데이터의 일부를 검증용 데이터로 이용할 수도 있지만, 발명자가 수작업으로 검증이 가능한 분야이기 때문에 수작업을 통하여 성능을 검증하였다.Hyperparameters such as m and p were adjusted while validating the performance of the search model after learning by the inventor reviewing the search results. Since it is a dataset without ground truth, a part of the prepared training data can be used as data for verification, but since the inventor is a field that can be manually verified, the performance was verified through manual work.

추론단계inference stage

사용자가 복수개의 키워드로 이루어진 쿼리를 입력하면, 먼저 BM25를 이용하여 말뭉치로부터 300건의 패시지를 리트리빙 한다. 리트리빙 된 300건의 패시지는 다음으로 사용자에 의하여 입력된 쿼리와 함께 학습된 CEDR-KNRM 모델로 제공된다. CEDR-KNRM 모델은 300건의 패시지를 리랭킹 한다. 리랭킹 된 300건의 패시지와 대응되는 특허문서를 조회하여 패시지의 순서에 대응되도록 특허문서의 순서를 정렬시킨다. 리트리빙 된 패키지 중에서 동일한 특허문서에 속하는 것이 존재하는 경우, 대응하는 특허문서는 가장 높게 평가된 패시지의 순서에 맞춰 정렬되고 후순위의 패시지는 특허문서의 정렬시 무시된다. 따라서, 검색결과는 300건 보다 작거나 같게 나타난다.When a user inputs a query composed of a plurality of keywords, 300 passages are retrieved from the corpus using BM25. The retrieved 300 passages are then provided as a trained CEDR-KNRM model along with the query entered by the user. The CEDR-KNRM model reranks 300 passages. Search the patent documents corresponding to the re-ranked 300 passages and arrange the order of the patent documents to correspond to the order of the passages. If there is one belonging to the same patent document among the retrieved packages, the corresponding patent document is sorted according to the order of the highest-rated passage, and the lower-order passage is ignored when sorting the patent document. Therefore, the search results are less than or equal to 300.

검색결과 비교Compare search results

도 5 및 도 6에는 본 실시예에 의하여 개발된 검색방법으로 검색된 결과와 다른 검색방법으로 검색된 결과를 대비하여 설명하고 있다.5 and 6 illustrate a comparison between a result searched by the search method developed according to the present embodiment and a result searched by another search method.

도 5는 "휠체어" 및 "주행보조"라는 2개의 키워드를 포함하는 쿼리를 입력하여 한국생산기술연구원이 보유한 특허 중에서 검색된 결과로서, 좌측은 본 실시예에 의하여 개발된 검색방법으로 검색된 결과이고 우측은 KIPRIS에서 검색된 결과이다. 도시된 바와 같이, 본 실시예에 의하여 개발된 검색방법에서는 "휠체어" 및 "주행보고"라는 키워드와 정확하게 일치하는 단어를 포함하는 문서는 물론, 거동이 불편한 사람이 이용할 수 있는 다른 장치들에 대한 특허도 함께 검색되는 반면, KIPRIS에서는 "휠체어" 및 "주행보고"라는 키워드와 정확하게 일치하는 단어를 포함하는 문서만이 검색된다.5 is a search result among patents owned by the Korea Institute of Industrial Technology by inputting a query including two keywords "wheelchair" and "driving assistance". is the search result from KIPRIS. As shown, in the search method developed according to the present embodiment, documents containing words exactly matching the keywords "wheelchair" and "driving report" as well as other devices available to persons with reduced mobility are searched for. While patents are also searched, only documents containing words that exactly match the keywords "wheelchair" and "drive report" are searched in KIPRIS.

도 6은 "스캐닝소나", "안전감시", "해양운용장비", "해상감시", 및 "양식장감시"라는 6개의 키워드를 포함하는 쿼리를 입력하여 한국생산기술연구원이 보유한 특허 중에서 검색된 결과로서, 상측은 본 실시예에 의하여 개발된 검색방법으로 검색된 결과이고 하측은 구글 특허검색에서 검색된 결과이다. 다만, 구글 특허검색의 경우 쿼리에 포함되는 키워드가 5개가 넘어가면 검색이 수행되지 않았기 때문에, "양식장감시"라는 키워드를 제외하고 검색을 수행하였다. 도시된 바와 같이, 본 실시예에 의하여 개발된 검색방법에서는 쿼리에 포함된 키워드와 정확하게 일치하지 않더라도 의미상으로 유사한 단어들을 포함하는 문서를 검색하는 반면, 구글 특허에서는 검색결과가 없는 것으로 나타난다. 이러한 결과는 KIPRIS도 동일하다(도시되지 않음).6 shows the results of searching among patents held by the Korea Institute of Industrial Technology by entering a query including six keywords: “scanning sonar”, “safety monitoring”, “marine operation equipment”, “marine monitoring”, and “aquaculture monitoring” As , the upper side is the result searched by the search method developed according to this embodiment, and the lower side is the search result from the Google patent search. However, in the case of the Google patent search, the search was performed except for the keyword "farm monitoring" because the search was not performed when the number of keywords included in the query exceeded 5. As shown, in the search method developed according to the present embodiment, documents including semantically similar words even if they do not exactly match the keywords included in the query are searched for, whereas there are no search results in the Google patent. These results are the same for KIPRIS (not shown).

도 5 및 도 6으로부터 알 수 있듯이, 인공지능 기반의 검색모델을 이용하는 본 발명에 따른 검색방법은, 입력된 쿼리에 포함된 키워드를 포함하지는 않지만 의미상으로 관련성이 있는 문서도 추가로 검색할 수 있다.As can be seen from FIGS. 5 and 6 , the search method according to the present invention using an artificial intelligence-based search model may additionally search documents that do not include keywords included in the input query but are semantically related. have.

[문서에 포함된 표, 이미지, 그래프 등의 검색에의 응용][Application to retrieval of tables, images, graphs, etc. included in documents]

문서에 포함된 표, 이미지, 그래프 등의 텍스트 기반이 아닌 개체에 대해서는 해당 문서에 대응되는 텍스트 기반의 설명이 존재한다. 예를 들어, 논문의 경우에는 표, 이미지, 그래프 등에 대해 위쪽 또는 아래쪽에 해당 개체의 참조부호 및 간략한 설명이 배치되며, 문서 내에서 해당 개체에 대한 텍스트 기반의 설명이 존재한다. 다른 예로서, 특허공보의 경우에는, 명세서의 상세한 설명에 도면의 간단한 설명이 존재하며, 각 도면에 대하여 설명하는 단락이 존재한다. 객체의 종류에 따라 참조부호는 표 1, 그림 1, 도 1 등으로 부여된다.For non-text-based objects such as tables, images, and graphs included in a document, a text-based description corresponding to the document exists. For example, in the case of a thesis, reference signs and brief descriptions of the objects are placed above or below for tables, images, graphs, etc., and text-based descriptions of the objects exist within the document. As another example, in the case of a patent publication, there is a brief description of the drawings in the detailed description of the specification, and there is a paragraph describing each drawing. Depending on the type of object, reference signs are given in Table 1, Figure 1, Figure 1, and the like.

이러한 텍스트 기반의 정보를 기초로 하여 표, 이미지, 그래프 등의 각각의 텍스트 기반이 아닌 개체에 대응되는 패시지를 생성할 수 있다. 예를 들어, 문서 1의 표 1에 해당하는 패시지는, 해당 표의 위쪽 또는 아래쪽에 배치된 해당 표에 대한 설명과 문서 1 내에서 해당 표 1에 대한 설명을 포함하는 단락을 결합하여 생성할 수 있다. 나아가, 앞에서 설명한 문서확장 기법을 적용하면, 문서 1의 제목, 초록 등을 패시지에 추가시킬 수 있다. Based on such text-based information, a passage corresponding to each non-text-based object such as a table, image, or graph may be generated. For example, a passage corresponding to Table 1 of Document 1 may be generated by combining a description of the corresponding table placed above or below the corresponding table and a paragraph including the description of the corresponding Table 1 in Document 1. . Furthermore, if the document extension technique described above is applied, the title, abstract, etc. of document 1 can be added to the passage.

이러한 패시지들은 문서에 포함된 텍스트 기반이 아닌 개체에 대한 말뭉치를 형성한다. 텍스트 기반이 아닌 개체에 대한 말뭉치만을 이용하여 인공지능 기반의 검색모델을 학습시킬 수 있다. 또는, 텍스트 기반이 아닌 개체에 대한 말뭉치는 검색대상으로만 처리하고, 문서의 단락으로 이루어진 말뭉치에 기초하여 학습된 검색모델을 이용할 수도 있다.These passages form a corpus of non-text-based entities included in the document. An AI-based search model can be trained using only the corpus for objects, not text-based. Alternatively, a corpus of non-text based objects may be processed only as a search target, and a search model learned based on a corpus composed of paragraphs of a document may be used.

문서, 텍스트 기반이 아닌 개체 및 패시지의 관계를 테이블화 또는 인덱싱 해 두면, 검색결과에 대응되는 패시지만이 아니라, 해당 패시지에 대응되는 텍스트 기반이 아닌 개체(표, 이미지, 그래프 등)를 함께 검색결과로 제공할 수 있다.If the relationship between documents, non-text-based objects and passages is tabulated or indexed, not only passages corresponding to the search results but also non-text-based objects (tables, images, graphs, etc.) corresponding to the passage are searched together. results can be provided.

전술한 상세한 설명은 어떤 면에서도 제한적으로 해석되어서는 아니되며 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.The foregoing detailed description should not be construed as restrictive in any way but as illustrative. The scope of the present invention should be determined by a reasonable interpretation of the appended claims, and all modifications within the equivalent scope of the present invention are included in the scope of the present invention.

Claims

A computer-implemented method for searching a non-text-based object included in a document using an artificial intelligence-based search model, comprising:
(a) given a document containing a non-text-based entity, generate a passage corresponding to that entity, including a description of the particular non-text-based entity contained in the document, and a paragraph to which the entity is referenced; and creating a relationship between the document, entity and passage;
(b) forming an entity-related corpus formed by collecting passages for the entity formed by step (a) from a plurality of documents and a relation table between the documents, entities, and passages;
(c) retrieving N passages corresponding to the input query from the corpus formed in step (b) by a search model based on an unsupervised methodology;
(d) re-ranking the N passages retrieved in step (c) based on the input query by an AI-based search model; and,
(e) outputting the N passages reranked in step (d), an entity corresponding thereto, and a document including the passage and the entity as a search result
A computer-implemented method for retrieving non-text-based objects embedded in documents that contain

The method according to claim 1, wherein in step (a),
Each passage is generated to further include the title of the document including the specific object
A computer-implemented method for retrieving non-text-based objects contained in a document, characterized in that

The method according to claim 1, wherein in step (a),
Each passage is generated to further include an abstract of a document containing the specific entity.
A computer-implemented method for retrieving non-text-based objects contained in a document, characterized in that

The method according to claim 1,
The AI-based search model is a corpus that is learned based on a corpus created by collecting paragraphs of each document, which is generated separately from the entity-related corpus of step (b).
A computer-implemented method for retrieving non-text-based objects contained in a document, characterized in that

A device for searching non-text-based objects included in documents using an artificial intelligence-based search model, comprising:
at least one processor; and
at least one memory for storing computer-executable instructions;
The computer-executable instructions stored in the at least one memory are executed by the at least one processor,
(a) given a document containing a non-text-based entity, generate a passage corresponding to that entity, including a description of the particular non-text-based entity contained in the document, and a paragraph to which the entity is referenced; and creating a relationship between the document, entity and passage;
(b) forming an entity-related corpus formed by collecting passages for the entity formed by step (a) from a plurality of documents and a relation table between the documents, entities, and passages;
(c) retrieving N passages corresponding to the input query from the corpus formed in step (b) by a search model based on an unsupervised methodology;
(d) re-ranking the N passages retrieved in step (c) based on the input query by an AI-based search model; and,
(e) outputting the N passages reranked in step (d), an entity corresponding thereto, and a document including the passage and the entity as a search result
A device for retrieving non-text-based objects embedded in a document that causes it to run.

6. The method of claim 5,
Each passage is generated to further include the title of the document including the specific object
A device for retrieving non-text-based objects contained in a document, characterized in that

6. The method of claim 5,
Each passage is generated to further include an abstract of a document containing the specific entity.
A device for retrieving non-text-based objects contained in a document, characterized in that

6. The method of claim 5,
The AI-based search model is a corpus that is learned based on a corpus created by collecting paragraphs of each document, which is generated separately from the entity-related corpus.
A device for retrieving a non-text object included in a document, characterized in that it.

A computer program for searching non-text-based objects included in documents using an artificial intelligence-based search model,
stored on a non-transitory storage medium, coupled to the processor,
(a) given a document containing a non-text-based entity, generate a passage corresponding to that entity, including a description of the particular non-text-based entity contained in the document, and a paragraph to which the entity is referenced; and creating a relationship between the document, entity and passage;
(b) forming an entity-related corpus formed by collecting passages for the entity formed by step (a) from a plurality of documents and a relation table between the documents, entities, and passages;
(c) retrieving N passages corresponding to the input query from the corpus formed in step (b) by a search model based on an unsupervised methodology;
(d) re-ranking the N passages retrieved in step (c) based on the input query by an AI-based search model; and,
(e) outputting the N passages reranked in step (d), an entity corresponding thereto, and a document including the passage and the entity as a search result
A computer program stored in a non-transitory storage medium to execute a method comprising a.