KR20220117944A

KR20220117944A - Method for batch-processing of data preprocessing and training process for information retreival model

Info

Publication number: KR20220117944A
Application number: KR1020210021003A
Authority: KR
Inventors: 이종원; 이호준
Original assignee: 호서대학교 산학협력단
Priority date: 2021-02-17
Filing date: 2021-02-17
Publication date: 2022-08-25
Also published as: KR102588268B1

Abstract

Provided is a method for batch-processing of data preprocessing and training procedures for an artificial intelligence-based information search model. A batch-processing procedure comprises the following procedures of: when a document for an information search model using a passage unit as a search target is given, dividing the document into passages to generate corpuses; and training the information search model based on the Weaksupervision methodology by using at least some of the passages in the corpuses. Since the method can batch-process the preprocessing procedure of dividing the document into passages having a length processable by the artificial intelligence-based information search model to generate corpuses when the document is given, the method can reduce costs and time required for a preprocessing procedure. Moreover, since the method can batch-process the procedure of training the information search model based on the Weaksupervision methodology by using at least some of the corpuses generated in the preprocessing procedure, the method can reduce costs consumed for the training of the artificial intelligence-based information search model.

Description

Method for batch-processing of data preprocessing and training process for information retreival model

본 발명은 정보검색모델을 위한 데이터 전처리 및 학습 과정을 일괄로 처리하는 방법에 관한 것으로서, 더 구체적으로는, 패시지 단위를 검색대상으로 하는 정보검색모델을 위하여, 문서가 주어지는 경우 이를 패시지로 나누어 말뭉치로 만드는 과정 및 생성된 말뭉치 내의 적어도 일부의 패시지를 이용하여 정보검색모델을 윅수퍼비전 방법론에 기초하여 학습시키는 과정을 일괄로 처리하는 방법에 관한 것이다.The present invention relates to a method for collectively processing data pre-processing and learning processes for an IR model, and more specifically, for an IR model that uses a passage unit as a search target, when a document is given, it is divided into passages to form a corpus It relates to a method of collectively processing the process of making an IR model based on the Wick Supervision methodology using at least some passages in the generated corpus.

검색기술은 구글이 그래프 이론(graph theory)에 기초를 둔 페이지랭크 (PageRank) 기법의 검색기술을 선보인 이후로 지속적으로 발전하여 왔다. 이러한 검색기술은 비지도학습에 기초한 것으로서, 문서뭉치만 주어지면 검색이 가능하였다. 비지도학습에 기초한 검색모델로서 대표적인 것은 BM25가 있으며, RM3라는 쿼리 확장 (query expansion) 기법과 함께 사용하는 경우 매우 향상된 성능을 보인다. 오픈 소스로는 Anserini가 학술 분야 및 현장에서 널리 이용되고 있다.Search technology has been continuously developed since Google introduced the PageRank search technology based on graph theory. This search technology is based on unsupervised learning, and it was possible to search only if a bundle of documents was given. A representative example of a search model based on unsupervised learning is BM25, and it shows very improved performance when used together with the query expansion technique called RM3. As an open source, Anserini is widely used in academic fields and fields.

한편, 자연어처리 분야에서도 인공지능 기법을 적용하고자 하는 학술 분야에서의 연구에 따라, 다양한 검색모델이 제안되어 왔다. 구글이 2018년 발표한 BERT는 여러 자연어처리 분야에서 좋은 성능을 나타냈으며, 이후 다양한 언어모델을 정보검색에 활용하려는 연구가 이어져 왔다. 지미 린(Lin, Jimmy)이라는 캐나다 워털루 대학 소속 연구자에 다르면, BERT 이전의 딥러닝 계열의 정보검색모델들은 비지도학습 방법론에 기초한 정보검색모델인 Anserini와 성능이 비슷하거나 오히려 떨어졌지만, BERT 이후에 제안된 정보검색모델들은 Anserini 보다 성능이 향상되었다고 한다(참조: Lin, Jimmy. "The Neural Hype, Justified! A Recantation.").On the other hand, various search models have been proposed according to research in the academic field to apply artificial intelligence techniques in the field of natural language processing. BERT, announced by Google in 2018, showed good performance in several natural language processing fields, and since then, research to use various language models for information retrieval has continued. According to a researcher at the University of Waterloo, Canada, named Jimmy Lin, the IR models of deep learning before BERT had similar or inferior performance to Anserini, an IR model based on unsupervised learning methodology, but after BERT, The proposed IR models are said to have improved performance over Anserini (refer to Lin, Jimmy. "The Neural Hype, Justified! A Recantation.").

그러나, 인공지능 기반의 정보검색모델은 몇가지 제약이 존재한다. 먼저, 비지도학습에 기초한 정보검색모델은 일반적으로 문서의 길이가 길더라도 문제가 되지 않지만, 인공지능 기반의 정보검색모델들은 대부분 처리할 수 있는 문서의 길이에 제한이 있다. 예를 들어, BERT의 경우 처리할 수 있는 최대 토큰 수는 512개로 제한된다. 따라서, 짧은 글들로 이루어진 말뭉치를 검색대상으로 하는 경우에는 문제가 없지만, 특허, 논문 등과 같이 길이가 긴 문서를 검색하는 경우에는 적용에 어려움이 있다. 다른 문제로서, 인공지능 기반의 검색모델을 추론에 이용하기 위해서는 먼저 학습시켜야 하는데, 이러한 학습에는 대량의 레이블드 데이터(labeled data)가 요구된다. 레이블드 데이터는 기본적으로 인간이 가공하여 제공하여야 하는데, 학습에 필요한 데이터의 양을 고려할 때 레이블링에 소요되는 비용이 너무 크기 때문에 비경제적이다.However, the AI-based information retrieval model has some limitations. First, IR models based on unsupervised learning generally do not have a problem even if the length of a document is long, but AI-based IR models have a limitation in the length of documents that can be processed most of the time. For example, in the case of BERT, the maximum number of tokens that can be processed is limited to 512. Therefore, there is no problem when a corpus composed of short texts is used as a search target, but it is difficult to apply when searching for a long document such as a patent or a thesis. As another problem, in order to use an AI-based search model for inference, it must first be trained, and this training requires a large amount of labeled data. Labeled data must be processed and provided by humans, but considering the amount of data required for learning, the cost of labeling is too high, which is uneconomical.

위에서 언급된 첫번째 문제를 해결하기 위하여, 인공지능 기반의 정보검색모델을 이용하는 경우, 문서를 정보검색모델이 처리할 수 있는 수준의 작은 크기로 나눈 패시지를 검색대상의 기본 단위로 하는 방법이 제안되었다. 다만, 문서를 패시지로 나누어 검색대상으로 만드는 전처리 과정은 많은 시간이 소요된다는 문제가 있다.In order to solve the first problem mentioned above, when an AI-based IR model is used, a method of dividing a document into a small size that the IR model can process was proposed as the basic unit of the search target. . However, there is a problem in that the pre-processing process of dividing the document into passages and making them a search target takes a lot of time.

위에서 언급된 두번째 문제를 해결하는 방법의 하나로서, 윅수퍼비전 방법론이 제안되었다. 예를 들어, 본 출원인이 출원하여 등록받은 대한민국 특허 10-2197645호에서는, 검색대상인 각 패시지로부터 키워드 추출 알고리즘을 이용하여 키워드를 추출하고, 이를 슈도-쿼리(pseudo-query)로 하고, 비지도학습 기반의 정보검색모델을 이용하여 슈도-쿼리와 패시지 사이에 슈도-레이블(pseudo-label)을 생성하며, 이러한 약한 레이블을 이용하여 인공지능 기반의 정보검색모델을 학습시키는 방법이 개시되어 있다. 그러나, 이러한 학습과정이 수동으로 이루어지기 때문에 많은 비용이 소요된다는 문제가 있다.As one of the methods to solve the second problem mentioned above, the Wick Supervision methodology has been proposed. For example, in Korean Patent No. 10-2197645, filed and registered by the present applicant, keywords are extracted from each passage, which is a search target, using a keyword extraction algorithm, and this is used as a pseudo-query, and unsupervised learning A method for generating a pseudo-label between a pseudo-query and a passage using an IR-based IR model and training an AI-based IR model using this weak label is disclosed. However, there is a problem in that such a learning process is done manually, and thus a lot of cost is required.

[1] https://paperswithcode.com/task/ad-hoc-information-retrieval[1] https://paperswithcode.com/task/ad-hoc-information-retrieval [2] MacAvaney, Sean, et al. "CEDR: Contextualized embeddings for document ranking." Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019.[2] MacAvaney, Sean, et al. "CEDR: Contextualized embeddings for document ranking." Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019. [3] Dai, Zhuyun, and Jamie Callan. "Deeper text understanding for IR with contextual neural language modeling." Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019.[3] Dai, Zhuyun, and Jamie Callan. "Deeper text understanding for IR with contextual neural language modeling." Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019.

본 발명은 전술한 문제를 해결하고자 하는 것으로서, 문서가 주어지는 경우 이를 인공지능 기반의 정보검색모델이 처리할 수 있는 길이를 갖는 패시지로 나누어 말뭉치로 만드는 전처리 과정 및 생성된 말뭉치 내의 적어도 일부의 패시지를 이용하여 정보검색모델을 윅수퍼비전 방법론에 기초하여 학습시키는 과정을 일괄로 처리하는 방법을 제공하는 것이다.The present invention is to solve the above problem, and when a document is given, it is divided into passages having a length that an AI-based IR model can process, and a pre-processing process to make a corpus and at least some passages in the generated corpus It is to provide a method for collectively processing the process of learning the IR model based on the Wick Supervision methodology using

전술한 과제를 해결하기 위하여, 본 발명의 일 양태에 따라, 정보검색모델을 위한 말뭉치의 생성 및 이 말뭉치에 기초한 정보검색모델의 학습을 일괄로 처리하기 위한 컴퓨터로 구현되는 방법으로서, (a) 검색대상인 복수개의 문서에 대한 정보에 기초하여 미리 결정된, 일괄 처리에 필요한 파라미터를 입력받는 단계; (b) 상기 단계 (a)에서 입력된 파라미터에 기초하여, 상기 복수개의 문서 각각을 패시지로 구분하여 말뭉치를 생성하는 단계; (c) 상기 단계 (b)에서 생성된 말뭉치에 포함된 적어도 일부의 패시지 각각으로부터 슈도-쿼리로 사용될 키워드를 추출하는 단계; (d) 상기 단계 (c)에서 추출된 슈도-쿼리 및 비지도학습 기반의 정보검색모델을 이용하여, 상기 단계 (a)에서 입력된 파라미터에서 지정된 수의 패시지를 추출하는 단계; (e) 상기 상기 단계 (a)에서 입력된 파라미터에서 지정된 하이퍼 파라미터에 기초하여, 슈도-레이블을 생성하는 단계; 및, (f) 상기 단계 (e)에서 생성된 슈도-레이블을 이용하여, 인공지능 기반의 정보검색모델을 학습시키는 단계를 포함하는 정보검색모델을 위한 데이터 전처리 및 학습 과정을 일괄처리 하는 컴퓨터로 구현되는 방법이 제공된다.In order to solve the above problems, according to an aspect of the present invention, a computer-implemented method for collectively processing generation of a corpus for an IR model and learning of an IR model based on the corpus, (a) receiving a parameter necessary for batch processing, which is predetermined based on information on a plurality of documents to be searched; (b) generating a corpus by dividing each of the plurality of documents into passages based on the parameters input in step (a); (c) extracting a keyword to be used as a pseudo-query from each of at least some passages included in the corpus generated in step (b); (d) extracting a specified number of passages from the parameters input in step (a) using the pseudo-query and unsupervised learning-based IR model extracted in step (c); (e) generating a pseudo-label based on the hyper parameter specified in the parameter input in the step (a); and, (f) using the pseudo-label generated in step (e), to train an artificial intelligence-based IR model. A method of implementation is provided.

본 발명의 다른 양태에 따라, 정보검색모델을 위한 말뭉치의 생성 및 이 말뭉치에 기초한 정보검색모델의 학습을 일괄로 처리하기 위한 장치로서, 적어도 하나의 프로세서; 및 컴퓨터로 실행가능한 명령을 저장하는 적어도 하나의 메모리를 포함하되, 상기 적어도 하나의 메모리에 저장된 상기 컴퓨터로 실행가능한 명령은, 상기 적어도 하나의 프로세서에 의하여, (a) 검색대상인 복수개의 문서에 대한 정보에 기초하여 미리 결정된, 일괄 처리에 필요한 파라미터를 입력받는 단계; (b) 상기 단계 (a)에서 입력된 파라미터에 기초하여, 상기 복수개의 문서 각각을 패시지로 구분하여 말뭉치를 생성하는 단계; (c) 상기 단계 (b)에서 생성된 말뭉치에 포함된 적어도 일부의 패시지 각각으로부터 슈도-쿼리로 사용될 키워드를 추출하는 단계; (d) 상기 단계 (c)에서 추출된 슈도-쿼리 및 비지도학습 기반의 정보검색모델을 이용하여, 상기 단계 (a)에서 입력된 파라미터에서 지정된 수의 패시지를 추출하는 단계; (e) 상기 상기 단계 (a)에서 입력된 파라미터에서 지정된 하이퍼 파라미터에 기초하여, 슈도-레이블을 생성하는 단계; 및, (f) 상기 단계 (e)에서 생성된 슈도-레이블을 이용하여, 인공지능 기반의 정보검색모델을 학습시키는 단계가 실행 되도록 하는 정보검색모델을 위한 데이터 전처리 및 학습 과정을 일괄처리 하는 장치가 제공된다.According to another aspect of the present invention, there is provided an apparatus for collectively processing generation of a corpus for an IR model and training of an IR model based on the corpus, comprising: at least one processor; and at least one memory for storing computer-executable instructions, wherein the computer-executable instructions stored in the at least one memory are executed by the at least one processor (a) for a plurality of documents to be searched. receiving a parameter necessary for batch processing, which is predetermined based on the information; (b) generating a corpus by dividing each of the plurality of documents into passages based on the parameters input in step (a); (c) extracting a keyword to be used as a pseudo-query from each of at least some passages included in the corpus generated in step (b); (d) extracting a specified number of passages from the parameters input in step (a) using the pseudo-query and unsupervised learning-based IR model extracted in step (c); (e) generating a pseudo-label based on the hyper parameter specified in the parameter input in the step (a); and, (f) a device for batch processing data pre-processing and learning process for an IR model so that the step of training the AI-based IR model is executed using the pseudo-label generated in step (e). is provided

본 발명의 또다른 양태에 따라, 정보검색모델을 위한 말뭉치의 생성 및 이 말뭉치에 기초한 정보검색모델의 학습을 일괄로 처리하기 위한 컴퓨터 프로그램으로서, 비일시적 저장 매체에 저장되며, 프로세서에 의하여, (a) 검색대상인 복수개의 문서에 대한 정보에 기초하여 미리 결정된, 일괄 처리에 필요한 파라미터를 입력받는 단계; (b) 상기 단계 (a)에서 입력된 파라미터에 기초하여, 상기 복수개의 문서 각각을 패시지로 구분하여 말뭉치를 생성하는 단계; (c) 상기 단계 (b)에서 생성된 말뭉치에 포함된 적어도 일부의 패시지 각각으로부터 슈도-쿼리로 사용될 키워드를 추출하는 단계; (d) 상기 단계 (c)에서 추출된 슈도-쿼리 및 비지도학습 기반의 정보검색모델을 이용하여, 상기 단계 (a)에서 입력된 파라미터에서 지정된 수의 패시지를 추출하는 단계; (e) 상기 상기 단계 (a)에서 입력된 파라미터에서 지정된 하이퍼 파라미터에 기초하여, 슈도-레이블을 생성하는 단계; 및, (f) 상기 단계 (e)에서 생성된 슈도-레이블을 이용하여, 인공지능 기반의 정보검색모델을 학습시키는 단계가 실행 되도록 하는 명령을 포함하는, 정보검색모델을 위한 데이터 전처리 및 학습 과정을 일괄처리 하는 비일시적 저장 매체에 저장되는 컴퓨터 프로그램이 제공된다.According to another aspect of the present invention, a computer program for collectively processing generation of a corpus for an IR model and learning of an IR model based on the corpus, stored in a non-transitory storage medium, by a processor ( a) receiving a parameter necessary for batch processing, which is predetermined based on information on a plurality of documents to be searched; (b) generating a corpus by dividing each of the plurality of documents into passages based on the parameters input in step (a); (c) extracting a keyword to be used as a pseudo-query from each of at least some passages included in the corpus generated in step (b); (d) extracting a specified number of passages from the parameters input in step (a) using the pseudo-query and unsupervised learning-based IR model extracted in step (c); (e) generating a pseudo-label based on the hyper parameter specified in the parameter input in the step (a); And, (f) using the pseudo-label generated in step (e), the step of learning the AI-based IR model is executed, including a command to be executed, data pre-processing and learning process for IR model A computer program stored in a non-transitory storage medium for batch processing is provided.

본 발명에 따라, 문서가 주어지는 경우 이를 인공지능 기반의 정보검색모델이 처리할 수 있는 길이를 갖는 패시지로 나누어 말뭉치로 만드는 전처리 과정을 일괄로 처리할 수 있기 때문에, 전처리 과정에 소요되는 시간과 비용을 절감할 수 있다.According to the present invention, when a document is given, the preprocessing process of making a corpus by dividing it into passages having a length that can be processed by the AI-based IR model can be processed in a batch, so the time and cost required for the preprocessing process can save

또한, 본 발명에 따라, 전처리 과정에서 생성된 말뭉치 내의 적어도 일부의 패시지를 이용하여 정보검색모델을 윅수퍼비전 방법론에 기초하여 학습시키는 과정을 일괄로 처리할 수 있기 때문에, 인공지능 기반의 정보검색모델의 학습에 소요되는 비용을 절감할 수 있다.In addition, according to the present invention, since the process of learning the IR model based on the Wick Supervision methodology using at least some passages in the corpus generated in the pre-processing process can be processed collectively, the AI-based IR model can reduce the cost of learning.

도 1은 종래의 인공지능 기반의 정보검색모델을 학습시키기 위한 방법을 도시한 흐름도.
도 2는 종래의 인공지능 기반의 정보검색모델을 이용한 정보검색 방법을 도시한 흐름도.
도 3은 종래의 정보검색 모델을 개략적으로 도시한 도면.
도 4는 인공지능 기반의 정보검색모델을 이용한 정보검색 방법을 수행하기 위한 장치를 개략적으로 도시한 도면.
도 5는 본 발명의 일 실시예에 따른, 정보검색모델을 위한 데이터 전처리 및 학습 과정을 일괄로 처리하는 방법의 흐름도를 도시한 도면.1 is a flowchart illustrating a method for learning a conventional artificial intelligence-based information retrieval model.
2 is a flowchart illustrating an information retrieval method using a conventional artificial intelligence-based information retrieval model.
3 is a diagram schematically illustrating a conventional information retrieval model.
4 is a diagram schematically showing an apparatus for performing an information retrieval method using an artificial intelligence-based information retrieval model.
5 is a flowchart illustrating a method for collectively processing data pre-processing and a learning process for an information retrieval model according to an embodiment of the present invention.

이하에서는, 첨부된 도면을 참조하여 본 발명에 따른 실시예를 상세히 설명한다. 동일하거나 유사한 구성요소에 대해서는 동일 또는 유사한 도면 부호를 부여하고 이에 대한 중복되는 설명은 생략한다. 본 명세서에 개시된 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 첨부된 도면은 본 명세서에 개시된 실시예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 등가인 것 내지 대체하는 것을 포함하는 것으로 이해되어야 한다.Hereinafter, an embodiment according to the present invention will be described in detail with reference to the accompanying drawings. The same or similar reference numerals are assigned to the same or similar components, and overlapping descriptions thereof are omitted. In describing the embodiments disclosed in the present specification, if it is determined that detailed descriptions of related known technologies may obscure the gist of the embodiments disclosed in the present specification, the detailed description thereof will be omitted. The accompanying drawings are only for easy understanding of the embodiments disclosed in the present specification, and the technical spirit disclosed herein is not limited by the accompanying drawings, and all changes and equivalents included in the spirit and scope of the present invention It should be understood to include what is or what is replaced.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이러한 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용되며 해당되는 구성요소들은 이러한 용어들에 의해 한정되지 않는다. 단수의 표현은, 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.Terms including an ordinal number such as 1st, 2nd, etc. may be used to describe various components, but these terms are used only for the purpose of distinguishing one component from other components, and the corresponding components are defined by these terms. not limited by The singular expression includes the plural expression unless the context clearly dictates otherwise.

본 명세서에서 사용된 "포함한다", "구비한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 단계, 구성요소 또는 이들을 조합한 것이 존재함을 한정하려는 것으로 이해되어야 하며, 하나 이상의 다른 특징들이나 단계, 구성요소 또는 이들을 조합한 것들이 존재할 또는 부가될 가능성을 배제하려는 것은 아니다.As used herein, terms such as “comprises”, “comprises” or “have” are to be understood as limiting the existence of a feature, step, element, or combination thereof described in the specification, and one or more other It is not intended to exclude the possibility that features, steps, components, or combinations thereof may be present or added.

이하에서는, 본 발명에 대한 이해를 돕기 위하여, 본 출원인이 출원하여 등록받은 대한민국 특허 10-2197645호에서 개시된 인공지능 기반의 정보검색모델을 학습시키는 방법 및 이를 이용하여 검색결과를 제공하는 방법에 대하여 먼저 도 1 내지 도 4를 참조하여 간략하게 설명한 후에, 본 발명에 따른 데이터 전처리 및 학습 과정의 일괄처리 방법에 대하여 설명한다.Hereinafter, in order to help the understanding of the present invention, a method of learning an artificial intelligence-based information search model disclosed in Korean Patent No. 10-2197645 filed and registered by the applicant of the present invention and a method of providing search results using the method First, after a brief description with reference to FIGS. 1 to 4 , a batch processing method of data pre-processing and a learning process according to the present invention will be described.

도 1에는 본 발명에 따른 인공지능 기반의 정보검색모델을 학습시키는 방법이 도시되어 있으며, 도 2에는 본 발명에 따른 인공지능 기반의 정보검색모델을 이용하여 추론을 수행하는 단계가 도시되어 있고, 도 3에는 본 발명에 따른 정보검색모델의 일 예가 개략적으로 도시되어 있다. 이하에서는 이들 도면을 참고하여 본 발명에 대하여 설명한다.1 shows a method for learning an artificial intelligence-based information retrieval model according to the present invention, and FIG. 2 shows a step of performing inference using the artificial intelligence-based information retrieval model according to the present invention, 3 schematically shows an example of an information retrieval model according to the present invention. Hereinafter, the present invention will be described with reference to these drawings.

[인공지능 기반의 정보검색모델의 학습단계][Learning stage of information retrieval model based on artificial intelligence]

정보검색모델information retrieval model

본 명세서에서 정보검색모델은 '비지도학습 (unsupervised) 방법론에 기초한 검색모델'과 '인공지능 기반 검색모델'로 구분한다. 전자는, BM25, QL(Query Likelihood) 등 통계적 또는 기타 비지도학습 방법론에 기초한 정보검색모델을 의미하며, 후자는, DRMM, KNRM, PACRR 등의 딥러닝 계열 및 BERT 등 언어 모델 (Language Model) 계열 등을 포함하여 학습에 의하여 마련되는 정보검색모델을 의미한다.In this specification, the IR model is divided into a 'search model based on an unsupervised methodology' and an 'artificial intelligence-based search model'. The former means information retrieval models based on statistical or other unsupervised learning methodologies such as BM25 and QL (Query Likelihood). It means an information retrieval model prepared by learning, including

본 발명에서 전자인 비지도학습 방법론에 기초한 검색모델은, 학습단계에서는 슈도-레이블을 마련하기 위하여 이용되며, 추론단계에서는 말뭉치로부터 패시지(passage)를 리트리빙 (retrieving) 하기 위하여 이용되고, 후자는 추론단계에서 리트리빙된 패시지를 리랭킹 (re-ranking) 하기 위하여 이용된다.In the present invention, the former search model based on the unsupervised learning methodology is used to prepare pseudo-labels in the learning stage, and is used to retrieve passages from the corpus in the inference stage, and the latter is used to It is used to re-rank the passage retrieved in the reasoning step.

오픈 소스가 존재하는 정보검색모델에 대하여는 Paper With Code의 Ad-Hoc Information Retrieval 항목에 공개되어 있다.Information retrieval models for which open sources exist are disclosed in the Ad-Hoc Information Retrieval section of Paper With Code.

비지도학습 방법론에 기초한 검색모델Search model based on unsupervised learning methodology

비지도학습 (unsupervised) 방법론에 기초한 검색모델로는, BM25, QL(Query Likelihood) 등이 알려져 있으며, 오픈 소스로는 Anserini가 많이 이용된다. Anserini는 BM25와 함께 쿼리 확장 (query expansion) 방법론인 RM3를 이용하는 검색모델이다. 연구결과에 따르면, Anserini의 성능은 DRMM, KNRM, PACRR 등의 딥러닝 계열의 검색모델의 성능과 비슷하지만, BERT 등 언어 모델 계열의 검색모델에 비하여는 성능이 떨어지는 것으로 알려져 있다.As search models based on unsupervised methodologies, BM25 and Query Likelihood (QL) are known, and Anserini is widely used as an open source. Anserini is a search model that uses RM3, a query expansion methodology, along with BM25. According to the research results, the performance of Anserini is similar to that of deep learning-based search models such as DRMM, KNRM, and PACRR, but it is known that the performance of Anserini is inferior to that of language model-type search models such as BERT.

비지도학습 방법론에 기초한 검색모델은 학습이 요구되지 않으며 통계적 이론 등에 기초하여 쿼리와 문서 간의 유사도를 판단한다.The search model based on the unsupervised learning methodology does not require learning and determines the similarity between the query and the document based on statistical theories.

인공지능 기반 검색모델AI-based search model

인공지능 기반 검색모델로서는, DRMM, KNRM, PACRR 등의 딥러닝 계열의 검색모델과, BERT 등 트랜스포머 (transformer) 또는 언어 모델 계열의 검색모델이 알려져 있다. Paper With Code에 따르면, 본 특허의 출원일 현재 가장 성능이 좋은, 즉, SOTA (State-of-the-Art) 검색모델은 언어 모델 계열의 BERT와 딥러닝 계열의 검색모델을 결합시킨 구조를 갖는 CEDR인 것으로 파악된다.As AI-based search models, deep learning-based search models such as DRMM, KNRM, and PACRR, and transformer or language model-based search models such as BERT are known. According to Paper With Code, the best performing SOTA (State-of-the-Art) search model as of the filing date of this patent is CEDR, which has a structure that combines the language model-type BERT and the deep-learning-type search model. is understood to be

인공지능 기반의 검색모델은 쿼리와 문서 간의 유사도를 학습데이터에 기초하여 학습시킴으로써 완성된다. 학술적으로는 TREC, SQuAD, MS Marco 등 쿼리-문서 관계를 제공하는 데이터셋을 이용하여 인공지능 기반 검색모델을 학습시킨다.The AI-based search model is completed by learning the similarity between the query and the document based on the training data. Academically, AI-based search models are trained using datasets that provide query-document relationships, such as TREC, SQuAD, and MS Marco.

BERT 등의 언어 모델 계열의 검색엔진은, 사전학습 (pre-training) 단계에서는 자기 지도 학습 (Self-supervision learning) 방법론을 이용하지만, 정보검색 과업(task)에 적합하게 학습시키기 위한 파인튜닝 (fine-tuning) 단계에서는 딥러닝 계열의 검색모델과 마찬가지로 쿼리-문서 관계에 대한 데이터셋에 기초한 지도 학습이 요구된다.The search engine of the language model series such as BERT uses the self-supervision learning method in the pre-training stage, but fine-tuning to learn it suitable for the information retrieval task. In the -tuning) stage, supervised learning is required based on a dataset on the query-document relationship, similar to the deep learning-based search model.

윅수퍼비전 (Weaksupervision) 방법론Weaksupervision Methodology

학술적으로 이용되는 TREC, SQuAD, MS Marco 등의 쿼리-문서 관계 데이터셋은 실제 사용자의 검색활동에 기초하여 추출된 그라운드 트루스(ground truth)에 해당하거나, 문서의 제목을 쿼리로 이용한 것도 있다. 그러나, 실제 현장에서는 문서뭉치(document corpus)만이 존재하며 이를 검색하기 위하여 사용자가 입력한 쿼리가 존재하지 않거나, 혹시 존재하더라도 인공지능 기반의 검색모델을 학습시키기에는 매우 부족하다. 본 발명에서는 문서뭉치만 존재하는 경우에 각 문서로부터 슈도-쿼리를 생성하고, 생성된 슈도-쿼리를 이용하여 의사 (psuedo) 쿼리-문서 관계를 형성시키고 슈도-레이블(psuedo-label)을 생성하며, 이를 이용하여 인공지능 기반의 검색모델을 학습시키는 방법론이 제공된다. 본 발명에 따른 인공지능 기반의 검색모델에 대한 윅수퍼비전 방법론은 크게 구분하면 다음과 같은 단계를 포함한다: 1) 문서뭉치의 각 문서로부터 슈도-쿼리를 생성하는 단계, 2) 생성된 슈도-쿼리를 이용하여 의사 쿼리-문서 관계를 생성시키고, 생성된 의사 쿼리-문서 관계에 기초하여 슈도-레이블을 생성하는 단계, 3) 생성된 슈도-레이블을 이용하여 인공지능 기반의 검색모델을 학습시키는 단계. 이하에서는 전술한 각 단계에 대하여 항목을 나누어 설명한다.Query-document relation datasets such as TREC, SQuAD, and MS Marco that are used academically correspond to ground truth extracted based on real users' search activity, or some use the title of a document as a query. However, in the actual field, only a document corpus exists, and there is no query input by the user to search for it, or even if there is, it is very insufficient to train an AI-based search model. In the present invention, when only a document bundle exists, a pseudo-query is generated from each document, a pseudo-query-document relationship is formed using the generated pseudo-query, and a pseudo-label is generated. , a methodology for learning an AI-based search model using this is provided. The Wick Supervision methodology for the AI-based search model according to the present invention broadly includes the following steps: 1) generating a pseudo-query from each document in the document bundle, 2) the generated pseudo-query Creating a pseudo-query-document relationship using . Hereinafter, each of the above-described steps will be described by dividing the items.

1) 슈도-쿼리를 생성하는 단계1) Steps to create a pseudo-query

문서뭉치의 각 문서로부터 1개 이상의 키워드를 추출하고 이를 슈도-쿼리로 결정한다. 문서로부터 키워드를 추출하는 방법은 기존에 알려진 키워드 추출 기법을 이용한다. 키워드 추출 알고리즘도 크게 비지도학습에 기반한 기법과 지도학습에 기반한 기법으로 나눌 수 있으며, 다수의 기법들에 대한 오픈 소스가 존재한다.One or more keywords are extracted from each document in the document bundle and determined by pseudo-query. A method of extracting a keyword from a document uses a known keyword extraction technique. Keyword extraction algorithms can also be largely divided into techniques based on unsupervised learning and techniques based on supervised learning, and open sources for a number of techniques exist.

대부분의 기법들이 문서로부터 복수개의 키워드 또는 구문(keyphrase)을 추출하는 방식이지만, 최근에는 지도학습에 기반한 기법으로서 BERT를 이용하여 하여 자연어 문장 형태의 쿼리를 생성하는 Doc2Query라는 기법도 오픈 소스로 공개된 바 있다.Although most techniques are methods of extracting multiple keywords or phrases from a document, recently, as a method based on supervised learning, a technique called Doc2Query, which uses BERT to generate a query in the form of a natural language sentence, has also been released as an open source. there is a bar

2) 슈도-레이블을 생성하는 단계2) Create a pseudo-label

각 문서에서 추출된 슈도-쿼리를 BM25 등의 비지도학습 방법론에 기초한 검색모델의 입력으로 하여 문서뭉치로부터 M개의 문서를 리트리빙 한다. 이 때, 해당 슈도-쿼리를 추출한 문서가 M개의 문서 중에 상위에 포함될 가능성이 높지만 포함되지 않을 수도 있다.M documents are retrieved from the document bundle by using the pseudo-query extracted from each document as an input to a search model based on unsupervised learning methodologies such as BM25. In this case, the document from which the corresponding pseudo-query is extracted is highly likely to be included in the top among the M documents, but may not be included.

리트리빙 된 M개의 문서 중에서, 상위 m(<M)개의 문서를 포지티브 (positive) 학습 데이터로 레이블링 (labeling) 하고, M개 중 나머지 문서 중의 적어도 일부를 네거티브 (negative) 학습 데이터로 레이블링 한다. 인공지능 기반의 검색모델에 따라, 추가로 다른 일부를 뉴트럴 (neutral) 학습 데이터로 레이블링 할 수 있다. 일반적으로 포지티브 데이터와 네가티브 데이터는 반드시 필요하지만, 뉴트럴 데이터는 반드시 필요하지는 않다.Among the retrieved M documents, the top m (<M) documents are labeled as positive training data, and at least some of the remaining documents among the M documents are labeled as negative training data. According to the AI-based search model, it is possible to additionally label other parts as neutral training data. In general, positive data and negative data are necessary, but neutral data is not necessarily required.

여기서, 리트리빙 하는 문서의 수 M, 포지티브 학습 데이터의 수 m, 네거티브 학습 데이터의 수 및 뉴트럴 학습 데이터의 수 등은 정수이며, 일종의 하이퍼 파라미터로서 문서뭉치의 특성에 따라 다르게 결정될 수 있다. 개발자는 문서뭉치의 특성에 따라 실험적 또는 이론적 접근방식으로 이들 하이퍼 파라미터를 조정하여 인공지능 기반의 검색모델의 정확도를 높일 수 있다.Here, the number of retrieved documents M, the number of positive learning data m, the number of negative learning data, and the number of neutral learning data are integers, and may be determined differently depending on the characteristics of the document bundle as a kind of hyperparameter. Developers can increase the accuracy of AI-based search models by adjusting these hyperparameters in an experimental or theoretical approach according to the characteristics of the document bundle.

3) 인공지능 기반의 검색모델을 학습시키는 단계3) Learning the AI-based search model

인공지능 기반의 검색모델을 학습시키기 위해서는 쿼리-문서 관계가 필요하므로, 학습 데이터는 리트리빙에 이용된 쿼리 및 그 쿼리에 의하여 리트리빙된 문서의 관계의 형태로 제공된다. 예를 들어, (슈도-쿼리, 포지티브 문서)와 같이 2개의 데이터의 쌍의 형태로 주어지거나, (슈도-쿼리, 포지티브 문서, 네가티브 문서)와 같이 3개의 데이터가 연관된 형태로 제공될 수 있다. 물론 뉴트럴 문서까지 4개의 데이터가 연관될 수도 있다.Since a query-document relationship is required to train an AI-based search model, the training data is provided in the form of a relationship between a query used for retrieval and a document retrieved by the query. For example, (pseudo-query, positive document) may be provided in the form of a pair of two data, or (pseudo-query, positive document, negative document) may be provided in a form in which three pieces of data are related. Of course, up to a neutral document can also be associated with four data.

학습에 필요한 데이터의 양은 정확하게 알려진 바는 없으며 주로 실험적 방법으로 확인되고 있다. 다만, 모델이 포함하는 파라미터 수에 따라 학습에 필요한 데이터의 양도 늘어난다. BERT 계열의 언어 모델 기반의 검색모델들은 대체로 위키피디아 등에 기초하여 사전학습이 된 모델이 공개되어 있으며, 사전학습된 모델을 정보검색과 같은 특정 과업(task)에 맞춰 파인튜닝 하는 것에는 사전학습과 대비하여 적은 양의 데이터가 요구된다.The amount of data required for learning is not known precisely, and is mainly confirmed by experimental methods. However, the amount of data required for training increases according to the number of parameters included in the model. For search models based on the BERT series of language models, pre-trained models are generally published based on Wikipedia, etc. Therefore, a small amount of data is required.

학술연구에서는 입수할 수 있는 데이터셋에 따라 학습 데이터의 양이 결정되지만, 실제 현장에서는 문서뭉치에 포함된 문서의 양에 따라 학습 데이터가 결정될 수 있다. 문서뭉치에 포함된 문서의 양이 학습을 위해 부족한 경우에는, 비슷한 분야의 다른 문서를 문서뭉치에 추가하거나, 데이터 증대 (data augmentation) 기법을 이용하여 데이터의 양을 늘릴 수 있다.In academic research, the amount of learning data is determined according to the available dataset, but in the actual field, the amount of learning data may be determined according to the amount of documents included in the document bundle. When the amount of documents included in the document bundle is insufficient for learning, other documents in a similar field may be added to the document bundle, or the amount of data may be increased by using a data augmentation technique.

인공지능 기반의 모델을 학습시키는 경우, 일반적으로 데이터를 학습용 데이터와 검증용 데이터로 나누어 성능을 검증하는 것이 바람직하다. 예를 들어, 전체 데이터 중에서 80%의 데이터는 학습에 이용하고, 나머지 20%의 데이터는 검증에 이용할 수 있다.When training an AI-based model, it is generally desirable to verify performance by dividing the data into training data and verification data. For example, 80% of the total data may be used for learning, and the remaining 20% of the data may be used for verification.

[추론단계][Inference stage]

본 발명에서 추론단계는 사용자가 입력한 쿼리에 대응하여 검색결과를 제공하는 과정을 의미한다. 추론단계는 크게 다음과 같은 단계들로 구분될 수 있다: 리트리빙 (retrieving) 단계, 리랭킹 (re-ranking) 단계 및 출력 단계. 이하에서는 각 단계별로 상세히 설명한다.In the present invention, the reasoning step refers to a process of providing a search result in response to a query input by a user. The reasoning step can be roughly divided into the following steps: a retrieving step, a re-ranking step, and an output step. Hereinafter, each step will be described in detail.

리트리빙 단계retrieval stage

검색대상인 문서뭉치로부터, 비지도학습 방법론에 기초한 검색모델에 의하여 N개의 문서를 추출하는 단계이다. 일반적으로 인공지능 기반의 검색모델은 추론에 소요되는 시간이 비지도학습 방법론에 기초한 검색모델에 비하여 상당히 긴 편이다. 따라서, 문서뭉치에 포함된 전체 문서에 대하여 인공지능 기반의 검색모델을 적용하는 경우, 지나치게 긴 시간이 소요되어 사용자의 편의성이 떨어지게 된다. 따라서, 1차로 속도가 상대적으로 빠른 비지도학습 방법론에 기초한 검색모델에 의하여 다수의 문서를 리트리빙 한 후에, 2차로 리트리빙 된 문서에 대해서만 인공지능 기반의 검색모델을 적용하는 것이 일반적이다. 인공지능 기반의 검색모델의 추론 속도를 향상시켜 리트리빙 단계를 생략하려는 연구는 지속적으로 이루어지고 있지만, 아직까지는 사용자 편의성을 고려할 때 충분히 빠른 추론 속도에 도달하지 못한 상태로 파악된다.It is a step of extracting N documents from a document bundle which is a search target by a search model based on an unsupervised learning methodology. In general, the AI-based search model takes considerably longer for inference than the search model based on the unsupervised learning methodology. Accordingly, when the AI-based search model is applied to all documents included in the document bundle, it takes too long, and the user's convenience deteriorates. Therefore, after retrieving a large number of documents by the search model based on the unsupervised learning methodology, which is relatively fast in the first place, it is common to apply the AI-based search model only to the secondly retrieved documents. Research to omit the retrieval step by improving the inference speed of the AI-based search model is continuously being conducted, but it is understood that the inference speed has not yet reached a sufficiently fast inference speed considering the user convenience.

본 단계에서는 정확도(accuracy)도 중요하지만 재현율(recall)이 더 중요하다. 반면, 2차 리랭킹 단계에서는 재현율보다는 정확도가 더 중요하다. 재현율을 높이는 방법 중의 하나는, 검색결과에서 제공하고자 하는 문서의 수에 비하여 몇배 더 많은 수의 문서를 리트리빙 하는 것이다. 다른 방안으로서는, 비지도학습 방법론에 기초한 검색모델의 재현율을 높이는 것이다. BERT를 이용하는 DeepCT Index 라는 기법을 이용하면, 동일한 비지도학습 방법론에 기초한 검색모델을 이용하면서도 재현율을 높일 수 있다. 많은 양의 문서를 리트리빙 하기 위해서는 문서 수에 비례하여 리트리빙에 소요되는 시간이 증가한다. DeepCT Index 기법을 이용하는 경우, 리트리빙 하는 문서 수를 줄임으로써 검색에 소요되는 시간을 단축시키면서도, 그보다 몇배 많은 수의 문서를 리트리빙 하는 경우와 동일한 수준의 재현율을 구현할 수 있다.At this stage, accuracy is important, but recall is more important. On the other hand, in the second re-ranking stage, accuracy is more important than recall. One of the methods to increase the recall is to retrieve a number of documents several times larger than the number of documents to be provided in the search results. Another method is to increase the recall rate of the search model based on the unsupervised learning methodology. By using a technique called DeepCT Index using BERT, it is possible to increase recall while using a search model based on the same unsupervised learning methodology. In order to retrieve a large amount of documents, the time required for retrieval increases in proportion to the number of documents. In the case of using the DeepCT Index technique, it is possible to reduce the search time by reducing the number of documents to be retrieved, while achieving the same level of recall as when retrieving a number of documents several times larger than that.

비지도학습 방법론에 기초한 검색모델로서, 학습단계에서 이용된 검색모델과 동일한 것을 이용할 수도 있고, 다른 검색모델을 이용할 수도 있다. 해당 분야에서 재현율이 높은 검색모델을 선택하는 것이 중요하다. 학술적 연구에서는 Anserini(BM25+RM3)가 오픈 소스 기반의 검색모델로서 널리 이용되고 있다.As a search model based on the unsupervised learning methodology, the same search model used in the learning step may be used, or a different search model may be used. It is important to select a search model with high recall in the relevant field. In academic research, Anserini (BM25+RM3) is widely used as an open source-based search model.

리랭킹 단계re-ranking stage

리트리빙 된 N개의 문서와 사용자가 입력한 쿼리와의 관련도가 비지도학습 방법론에 기초한 검색모델에 의하여 평가되고 관련도 순에 따라 정렬된다. 본 단계에서는, 리트리빙 된 N개의 문서와 사용자가 입력한 쿼리와의 관련도가 인공지능 기반의 검색모델에 의하여 다시 평가된 후에 재정렬, 즉 리랭킹 된다. 전술한 학습단계에서 학습된 인공지능 기반의 검색모델에 의하여 본 단계가 수행된다.The relevance between the retrieved N documents and the query entered by the user is evaluated by a search model based on the unsupervised learning methodology and sorted according to the order of relevance. In this step, the relation between the retrieved N documents and the query entered by the user is re-evaluated by the AI-based search model and then rearranged, that is, re-ranked. This step is performed by the AI-based search model learned in the above-described learning step.

검색결과 출력Search result output

인공지능 기반의 검색모델에 의하여 리랭킹 된 N개의 문서는 관련도의 순서에 따라 정렬되어 검색결과로 출력된다.The N documents reranked by the AI-based search model are sorted according to the order of relevance and output as a search result.

[특수한 경우의 처리][Handling in special cases]

문서의 길이와 관련된 사항Matters related to the length of the document

일반적으로 비지도학습 방법론에 기초한 검색모델은 문서의 길이가 길더라도 실행에 문제가 없지만, 인공지능 기반의 검색모델은 처리할 수 있는 문서의 최대 길이에 제한이 있는 경우가 많다. 특히, BERT 이후에 소개되는 인공지능 기반의 검색모델들은 처리할 수 있는 최대 토큰(token)의 수가 제한된다. 예를 들어, BERT의 경우에는 처리할 수 있는 최대 토큰의 수가 512개로 제한된다.In general, search models based on unsupervised learning methodologies have no problem in execution even if the length of documents is long, but AI-based search models often have a limit on the maximum length of documents that can be processed. In particular, AI-based search models introduced after BERT have a limited number of maximum tokens that can be processed. For example, in the case of BERT, the maximum number of tokens that can be processed is limited to 512.

인공지능 기반의 검색모델이 처리할 수 있는 문서의 최대 길이에 제한이 있는 경우, 문서뭉치(document corpus)의 각 문서를 제한된 길이 이하의 패시지(passage)로 나누어 이를 검색대상으로 할 수 있다. 본 명세서에서는, 길이가 긴 문서로 이루어진 문서뭉치와 구별하기 위하여 짧은 길이의 글로 이루어진 패시지의 뭉치를 '말뭉치(corpus)'라고 표현한다.When there is a limit on the maximum length of a document that can be processed by the AI-based search model, each document in a document corpus can be divided into passages with a length less than the limited length and can be searched. In the present specification, a bundle of passages composed of short texts is referred to as a 'corpus' in order to distinguish it from a document bundle composed of long documents.

학습단계에서는, 말뭉치에 포함된 패시지가 문서로 대체되는 것 외에는 전술한 것과 동일하다. 즉, 패시지로부터 슈도-쿼리를 추출하고, 비지도학습 방법론에 기초한 검색모델을 이용하여 슈도-레이블을 생성시킨 후에 인공지능 기반의 검색모델을 학습시킨다. 인공지능 기반의 검색모델은 슈도-쿼리와 문서 전체 사이의 관련도가 아니라, 슈도-쿼리와 패시지 사이의 관련도에 기초하여 학습하게 된다.In the learning stage, the passages included in the corpus are the same as described above except that the passages are replaced with documents. That is, after extracting a pseudo-query from a passage and generating a pseudo-label using a search model based on an unsupervised learning methodology, an AI-based search model is trained. The AI-based search model learns based on the relationship between the pseudo-query and the passage, not the relationship between the pseudo-query and the entire document.

추론단계에서는, 문서뭉치의 각 문서를 패시지로 나누어 말뭉치로 만들고, 패시지와 문서의 대응관계를 참조할 수 있는 형태로 저장하는 전처리가 필요하다. 리트리빙 단계에서는 말뭉치로부터 사용자가 입력한 쿼리에 기초하여 N개의 패시지를 리트리빙 하고, 리랭킹 단계에서는 리트리빙 된 N개의 패시지를 리랭킹 한다. In the reasoning step, it is necessary to divide each document in the document bundle into passages to make a corpus, and to perform preprocessing of storing the correspondence between passages and documents in a form that can be referred to. In the retrieval step, N passages are retrieved from the corpus based on the query input by the user, and in the reranking step, the retrieved N passages are reranked.

검색결과에서 패시지가 아닌 문서를 관련도 순으로 정렬하여 제공하여야 한다. 이를 위하여, 전처리 단계에서 제공된 패시지와 문서의 대응관계를 참조하여 리랭킹 된 패시지의 관련도 정렬 순서에 대응하도록 문서를 정렬하여 검색결과로서 제공한다. 하나의 문서로부터 여러 개의 패시지가 분리되므로, 리랭킹 된 결과에는 하나의 문서로부터 추출된 패시지가 복수개 포함될 수 있다. 이 경우, 문서의 정렬 순서는 가장 관련도가 높은 패시지의 순서에 대응되도록 할 수 있다.Documents that are not passages in the search results should be provided in order of relevance. To this end, with reference to the correspondence between the passages and documents provided in the pre-processing step, the documents are arranged so as to correspond to the re-ranked passage's relevance sort order and provided as a search result. Since several passages are separated from one document, a plurality of passages extracted from one document may be included in the reranked result. In this case, the sort order of documents may correspond to the order of passages with the highest degree of relevance.

문서확장 (document expansion)document expansion

길이가 긴 문서의 경우에는 다수개의 단락을 포함하고 있는데, 각 단락의 내용은 반드시 주제가 일치하지 않는다. 따라서, 각 단락을 패시지로 나누어 말뭉치로 저장하는 경우에 관련성이 있는 문서로부터 분리된 단락임에도 불구하고 관련도가 낮게 평가될 가능성이 존재한다.In the case of a long document, it contains multiple paragraphs, and the content of each paragraph does not necessarily match the topic. Accordingly, when each paragraph is divided into passages and stored as a corpus, there is a possibility that the relevance is evaluated low even though the paragraphs are separated from related documents.

이러한 문제를 방지하기 위하여, 문서의 제목 또는 이에 준하는 문구를 해당 문서에서 분리된 각 패시지에 추가할 수 있다. 즉, 한 문서에서 분리된 각 패시지에는 동일한 문구가 추가된다. 이렇게 패시지에 관련성을 갖는 문구를 추가하여 확장시키는 것을 본 명세서에서는 '문서확장'이라는 요어로 표현한다. 문서의 제목에 준하는 문구는, 예를 들어, 문서요약 (text summarization) 기법을 이용하여 생성될 수 있다. 문서요약 기법은 크게 추출적 요약(extractive summarization)과 추상적 요약(abstractive summarization)으로 구분되며, 양자 모두 적용이 가능하다. 문서확장을 위하여 추가되는 문구를 생성하는 기법은 반드시 문서요약 기법에 한정되지 않으며, 해당 문서의 주제를 압축적으로 표현할 수 있는 기법이라면 어떤 것이라도 무방하다. In order to prevent such a problem, the title of the document or a phrase equivalent thereto may be added to each passage separated from the document. That is, the same phrase is added to each passage separated from one document. In this specification, the extension by adding a phrase having relevance to the passage is expressed as a key word 'document extension'. A phrase corresponding to the title of the document may be generated using, for example, a text summarization technique. The document summarization technique is largely divided into extractive summarization and abstract summarization, both of which can be applied. The technique for generating additional phrases for document extension is not necessarily limited to the document summary technique, and any technique that can compressively express the subject of the document may be used.

예를 들어, 사용자가 입력한 쿼리에 복수개의 키워드가 포함되어 있는데, 어느 키워드는 원래의 패시지에 포함되어 있지만, 다른 키워드는 문서확장에 의하여 추가된 문구에 포함되어 있을 수 있다. 문서확장에 의하여 추가된 문구는 해당 문서 전체의 주제를 포함하고 있으므로, 해당 패시지는 사용자가 검색하고자 의도한 내용을 포함하고 있을 가능성이 존재한다. 문서확장으로 해당 문구가 추가되지 않았다면 리트리빙 되지 않았을 패시지가 문서확장에 의하여 리트리빙 될 수 있다. 이는 리트리빙 단계의 재현율(recall)을 높일 수 있다. 위에서 설명한 바와 같이, 리트리빙 단계에서는 재현율이 높은 것이 중요하다.For example, a plurality of keywords are included in a query input by a user. Some keywords may be included in the original message, but other keywords may be included in a phrase added by document expansion. Since the phrase added by the document extension includes the subject of the entire document, there is a possibility that the corresponding passage contains the content intended by the user to search. If the corresponding phrase was not added by document extension, a passage that would not have been retrieved may be retrieved by document extension. This may increase the recall of the retrieval step. As described above, it is important that the recall is high in the retrieval step.

문서확장으로 패시지에 추가되는 문구의 길이가 너무 길면 문서확장에 의하여 늘어난 패시지의 길이가 인공지능 기반의 검색모델이 처리할 수 있는 길이를 넘어갈 수 있다. 이로 인한 부작용을 최소화하기 위하여, 문서확장에 의하여 추가되는 문구는 패시지의 앞쪽에 배치되는 것이 바람직하다. 문서확장으로 길이가 늘어난 패시지의 후단부가 짤려 나간다고 하여도, 사용자가 입력한 쿼리에 포함된 키워드가 그 후단부에 포함되어 있지 않다면, 여전히 해당 패시지는 리트리빙 단계에서 추출될 것이기 때문이다.If the length of the phrase added to the passage due to the document extension is too long, the length of the passage increased by the document extension may exceed the length that the AI-based search model can handle. In order to minimize the side effects caused by this, it is preferable that the text added by the document extension be placed at the front of the passage. Even if the rear end of a passage whose length is increased due to document expansion is cut off, if the keyword included in the query entered by the user is not included in the rear end, the passage will still be extracted in the retrieval stage.

도메인 특화 (domain adaptiveness)domain adaptiveness

인공지능 기반의 검색모델을 특정 도메인의 문서로 학습시키는 경우 해당 도메인에 특화된 검색모델로 이용될 수 있다. 예를 들어, 위키피디아 등 보편적인 문서뭉치로 사전학습된 BERT를 특정 도메인의 문서로 파인튜닝 하는 경우, 사전학습 단계에서는 어휘들 간의 일반적인 관계를 학습하고, 파인튜닝 시에는 해당 도메인에 특화된 어휘에 대해 학습하게 된다. When an AI-based search model is trained as a document of a specific domain, it can be used as a search model specialized for that domain. For example, when fine-tuning a BERT that has been pre-trained with a general document bundle such as Wikipedia as a document of a specific domain, in the pre-learning stage, general relationships between vocabularies are learned, and in fine-tuning, a vocabulary specific to that domain is learned. will learn

특정 도메인에 특화된 검색모델은 다른 도메인에 대해서는 성능이 상대적으로 떨어질 수 있지만 해당 도메인에서는 성능이 향상된다.A search model specialized for a specific domain may have relatively poor performance for other domains, but it improves performance in that domain.

도 4에는 본 발명에 따른 검색방법을 수행하기 위한 컴퓨터 장치가 도시되어 있다.4 shows a computer device for performing a search method according to the present invention.

도 1 내지 3을 참조하여 본 발명에 따른 검색방법 및 학습방법에 대하여는 상세히 설명한 바 있으므로, 도 4를 참조하여서는 그러한 검색방법을 수행하기 위한 장치(100)를 간략히 설명한다.Since the search method and the learning method according to the present invention have been described in detail with reference to FIGS. 1 to 3 , an apparatus 100 for performing such a search method will be briefly described with reference to FIG. 4 .

도 4를 참조하면, 컴퓨터 장치(100)는, 프로세서(110), 프로그램과 데이터를 저장하는 비휘발성 저장부(120), 실행 중인 프로그램들을 저장하는 휘발성 메모리(130), 사용자와의 사이에 정보를 입력 및 출력하는 입/출력부(140) 및 이들 장치 사이의 내부 통신 통로인 버스 등으로 이루어져 있다. 실행 중인 프로그램으로는, 운영체계(Operating System) 및 다양한 어플리케이션이 있을 수 있다. 도시되지는 않았지만, 전력제공부를 포함한다.Referring to FIG. 4 , the computer device 100 includes a processor 110 , a non-volatile storage unit 120 for storing programs and data, a volatile memory 130 for storing programs being executed, and information between a user and a user. It consists of an input/output unit 140 for inputting and outputting , and a bus, which is an internal communication path between these devices. The running program may include an operating system and various applications. Although not shown, it includes a power supply unit.

학습단계에서는 저장부(120)에 저장된 학습 데이터를 이용하여 메모리(130)에서 인공지능 기반의 검색모델을 학습시킨다. 추론단계에서는, 저장부(120)에 저장된 비지도학습 기반의 검색모델과(210) 인공지능 기반의 검색모델(220)을 메모리(130)에서 실행시킨다. 말뭉치는 저장부(120)에 저장되고, 입/출력부(140)를 통하여 입력된 쿼리에 기초하여 검색방법을 수행한다.In the learning step, an AI-based search model is learned in the memory 130 using the learning data stored in the storage unit 120 . In the reasoning step, the memory 130 executes the unsupervised learning-based search model 210 and the artificial intelligence-based search model 220 stored in the storage unit 120 . The corpus is stored in the storage unit 120 , and a search method is performed based on a query input through the input/output unit 140 .

이하에서는, 도 5를 참고하여, 본 발명에 따른 일 실시예로서, 정보검색모델을 위한 말뭉치의 생성 및 이 말뭉치에 기초한 정보검색모델의 학습을 일괄로 처리하기 위한 방법에 대하여 설명한다.Hereinafter, a method for collectively processing generation of a corpus for an IR model and learning of an IR model based on the corpus will be described with reference to FIG. 5 , as an embodiment according to the present invention.

일괄처리에 필요한 파라미터를 입력받는 단계 (S10)Step of receiving parameters required for batch processing (S10)

이하에서의 과정이 일괄로 처리될 수 있도록 각 과정에서 필요한 파라미터를 입력받는다. 이러한 파라미터는 사전에 입력받을 수도 있고, 각 과정이 시작되기 전에 사용자에게 입력을 요구할 수도 있다. 사전에 입력받는 경우에는 데이터 전처리 과정으로부터 정보검색모델의 학습 과정까지가 모두 일괄로 처리될 수 있다.Parameters required for each process are input so that the following processes can be batch-processed. These parameters may be input in advance, or input may be requested from the user before each process starts. When input is received in advance, from the data pre-processing process to the learning process of the information retrieval model, all processes can be batch-processed.

말뭉치 생성 과정 (S20)Corpus creation process (S20)

검색대상인 문서들이 주어지면, 각 문서를 미리 정해진 크기를 갖는 또는 미리 정해진 규칙을 따라 구분된 패시지로 나눈다.Given documents to be searched, each document is divided into passages having a predetermined size or according to a predetermined rule.

미리 정해진 크기는 적용될 인공지능 기반의 정보검색모델이 처리할 수 있는 텍스트의 크기를 고려하여 결정된다. 예를 들어, BERT의 경우 처리할 수 있는 토큰의 최대 수는 512개이므로, 이 중에서 쿼리 및 특수 토큰을 위하여 사용될 토큰 수를 제외하고 텍스트의 크기를 결정한다. 예를 들어, 쿼리를 위한 토큰 수를 11개로 제한하는 경우, 즉, 쿼리가 포함하는 키워드의 수를 개략적으로 11개로 제한하는 경우, 특수 토큰인 [CLS]까지 포함하여 총 12개의 토큰이 제외되어야 한다. 그 경우 말뭉치에 포함될 패시지의 최대 길이는 500개의 토큰으로 제한된다.The predetermined size is determined in consideration of the text size that the AI-based IR model to be applied can process. For example, in the case of BERT, the maximum number of tokens that can be processed is 512, so the size of the text is determined by excluding the number of tokens to be used for queries and special tokens. For example, if you limit the number of tokens for a query to 11, that is, roughly limit the number of keywords included in a query to 11, a total of 12 tokens, including the special token [CLS], should be excluded. do. In that case, the maximum length of passages to be included in the corpus is limited to 500 tokens.

미리 정해진 규칙으로는, 예를 들어, 각 문단 단위로 문서를 구분할 수 있다. 문서의 각 문단은 조판 부호인 리턴 부호로 구분되므로, 리턴 부호를 기준으로 문서를 문단 단위로 구분할 수 있다.As a predetermined rule, for example, a document may be divided in units of paragraphs. Since each paragraph of the document is separated by a return sign, which is a typesetting mark, the document can be divided into paragraphs based on the return sign.

한편, 문서를 미리 정해진 크기로 구분하는 경우, 검색되어야 할 내용이 서로 다른 패시지로 분리될 가능성이 있다. 이러한 문제를 방지하기 위하여, 문서를 일정한 길이를 갖는 패시지로 구분할 때 부분적으로 중첩되도록 한다. 예를 들어, 50%를 중첩시키는 경우에는, 후속하는 패시지는 직전 패시지의 뒷부분 50%를 포함한다. 중첩시키는 비율은 반드시 50%로 정해지는 것은 아니며, 말뭉치의 특성을 고려하여 사용자가 임의로 지정할 수 있다. 다만, 말뭉치 생성 과정이 일괄로 처리되기 위해서는 최초에 사용자가 패시지의 중첩 비율을 결정하여 입력시키는 것이 바람직하다.On the other hand, when the document is divided into a predetermined size, there is a possibility that the content to be searched is divided into different passages. In order to prevent this problem, when dividing a document into passages having a certain length, they are partially overlapped. For example, if 50% overlap, the subsequent passage contains 50% of the last portion of the previous passage. The overlapping ratio is not necessarily set to 50%, and the user can arbitrarily designate it in consideration of the characteristics of the corpus. However, in order for the corpus generation process to be batch-processed, it is preferable for the user to initially determine and input the overlapping ratio of passages.

문서-패시지 관계 데이터 생성Creating document-passage relationship data

추후 검색결과를 패시지 단위가 아닌 문서 단위로 제공하기 위하여, 문서 및 이로부터 추출된 패시지 사이의 관계를 생성하여 저장한다. 이러한 과정은 이전의 과정들에 이어서 일괄로 처리될 수 있다.In order to provide a later search result in a document unit rather than a passage unit, a relationship between a document and a passage extracted therefrom is created and stored. This process may be processed in batch following the previous processes.

비지도학습 기반의 정보검색모델을 위한 인덱싱 파일 생성Create indexing file for information retrieval model based on unsupervised learning

비지도학습 기반의 정보검색모델은 검색속도를 향상시키기 위하여, 미리 생성된 인덱싱 파일을 이용한다. 리트리빙 단계에서 이용되는 비지도학습 기반의 정보검색모델을 위하여, 말뭉치에 포함된 패시지들로부터 인덱싱 파일을 생성한다. 이 과정은 이전의 과정들에 이어서 일괄로 처리될 수 있다.The information retrieval model based on unsupervised learning uses a pre-generated indexing file to improve the search speed. For the information retrieval model based on unsupervised learning used in the retrieval step, an indexing file is generated from passages included in the corpus. This process can be processed in batches following the previous processes.

슈도-쿼리 생성 과정 (S30)Pseudo-query generation process (S30)

각 문서를 복수개의 패시지로 나눈 후에, 이들 각 패시지에 미리 정해진 키워드 추출 기법을 적용하여 1개 이상의 키워드 또는 문구를 슈도-쿼리로서 추출한다. 이전에 수행된 과정에 이어서 본 과정도 일괄로 처리될 수 있도록, 적용될 키워드 추출 기법은 미리 정해지는 것이 바람직하다.After dividing each document into a plurality of passages, one or more keywords or phrases are extracted as pseudo-queries by applying a predetermined keyword extraction technique to each of these passages. It is preferable that the keyword extraction technique to be applied is predetermined so that the present process can be batch-processed following the previously performed process.

쿼리-문서 관계 형성 과정 (S40)Query-document relationship formation process (S40)

추출된 각각의 슈도-쿼리를 입력으로 하고 기존에 알려진 비지도학습 방법론에 기초한 검색모델을 이용하여 말뭉치로부터 미리 정해진 수의 패시지를 추출한다. 마찬가지로, 일괄 처리를 위하여, 말뭉치로부터 추출될 패시지의 수는 사전에 결정되어 입력되는 것이 바람직하다.With each extracted pseudo-query as input, a predetermined number of passages are extracted from the corpus using a search model based on a previously known unsupervised learning methodology. Similarly, for batch processing, it is preferable that the number of passages to be extracted from the corpus is predetermined and input.

슈도-레이블링 과정 (S50)Pseudo-Labeling Process (S50)

각각의 슈도-쿼리에 의하여 추출된 패시지 중에서 상위 m개의 패시지를 포지티브 학습 데이터로 분류하고, 하위 p개의 패시지를 네거티브 학습 데이터로 분류하였다. 학습에 필요한 경우, 상위 m개의 패시지와 하위 p개의 패시지를 제외한 나머지 패시지를 뉴트럴 학습 데이터로 분류할 수 있다.Among the passages extracted by each pseudo-query, the upper m passages were classified as positive learning data, and the lower p passages were classified as negative learning data. If necessary for learning, the remaining passages except for the upper m passages and the lower p passages may be classified as neutral learning data.

m, p 등의 하이퍼 파라미터는 일반적으로 학습 후에 정보검색모델의 성능을 검토하면서 조정되지만, 이전의 과정들에 이어서 일괄로 처리될 수 있도록, 미리 결정되는 것이 바람직하다.Hyperparameters such as m and p are generally adjusted after learning while examining the performance of the IR model, but it is preferable to be determined in advance so that they can be processed in batches following the previous processes.

학습 과정 (S60)Learning process (S60)

생성된 슈도-레이블을 이용하여 인공지능 기반의 정보검색모델을 학습시킨다. 이 과정은 슈도-레이블이 주어지면 이전의 과정들에 이어서 일괄로 처리될 수 있다. 다만, 학습 과정을 멈추는 기준은 미리 정해지는 것이 바람직하다. 예를 들어, 학습 과정에서의 손실(loss)이 미리 지정된 값보다 작아지는 경우 학습 과정을 중지하도록 할 수 있다.An artificial intelligence-based information retrieval model is trained using the generated pseudo-label. This process can be batch-processed following the previous processes given a pseudo-label. However, it is preferable that the criterion for stopping the learning process is predetermined. For example, when a loss in the learning process becomes smaller than a predetermined value, the learning process may be stopped.

추론 단계reasoning phase

이상의 과정을 통하여 인공지능 기반의 정보검색모델을 위한 데이터 전처리 및 학습 과정이 일괄로 처리된 후에, 사용자는 학습이 완료된 정보검색모델을 이용하여 검색 작업을 수행할 수 있다. After the data pre-processing and learning process for the AI-based IR model are batch-processed through the above process, the user can perform a search operation using the completed IR model.

사용자가 복수개의 키워드로 이루어진 쿼리를 입력하면, 먼저 BM25를 이용하여 말뭉치로부터 미리 정해진 수의 패시지를 리트리빙 한다. 리트리빙 된 패시지는 다음으로 사용자에 의하여 입력된 쿼리와 함께 학습된 정보검색모델로 제공된다. 정보검색모델은 주어진 패시지를 리랭킹 한다. 리랭킹 된 패시지와 대응되는 문서를 조회하여 패시지의 순서에 대응되도록 문서의 순서를 정렬시킨다. When a user inputs a query composed of a plurality of keywords, a predetermined number of passages are retrieved from the corpus using BM25. The retrieved passage is then provided as a trained IR model along with the query input by the user. The IR model reranks a given passage. By inquiring the documents corresponding to the reranked passages, the order of the documents is arranged to correspond to the order of the passages.

전술한 상세한 설명은 어떤 면에서도 제한적으로 해석되어서는 아니되며 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.The foregoing detailed description should not be construed as restrictive in any way but as illustrative. The scope of the present invention should be determined by a reasonable interpretation of the appended claims, and all modifications within the equivalent scope of the present invention are included in the scope of the present invention.

Claims

A computer-implemented method for collectively processing generation of a corpus for an IR model and learning of an IR model based on the corpus,
(a) receiving a parameter necessary for batch processing, which is predetermined based on information on a plurality of documents to be searched;
(b) generating a corpus by dividing each of the plurality of documents into passages based on the parameters input in step (a);
(c) extracting a keyword to be used as a pseudo-query from each of at least some passages included in the corpus generated in step (b);
(d) extracting a specified number of passages from the parameters input in step (a) using the pseudo-query and unsupervised learning-based IR model extracted in step (c);
(e) generating a pseudo-label based on the hyper parameter specified in the parameter input in the step (a); and,
(f) using the pseudo-label generated in step (e) to train an artificial intelligence-based information retrieval model
A computer-implemented method for batch processing data pre-processing and learning process for an information retrieval model, including.

The method according to claim 1,
The parameters required for the batch processing include a parameter related to a method for classifying documents in step (a), a parameter related to a keyword extracted in step (c), and a parameter related to the number of passages to be extracted in step (d) and hyperparameters for generating the pseudo-label in step (e).
A computer-implemented method for batch processing data pre-processing and learning process for an information retrieval model, characterized in that

3. The method according to claim 2,
The parameter regarding the method of classifying the document includes information on the size of passages in which the documents are to be divided and the overlapping ratio of each passage
A computer-implemented method for batch processing data pre-processing and learning process for an information retrieval model, characterized in that

The method according to claim 1,
Storing the relationship between the document and the passages delimited therefrom.
A computer-implemented method for batch processing data pre-processing and learning process for an information retrieval model, characterized in that it further comprises.

An apparatus for collectively processing generation of a corpus for an IR model and learning of an IR model based on the corpus,
at least one processor; and
at least one memory for storing computer-executable instructions;
The computer-executable instructions stored in the at least one memory are executed by the at least one processor,
(a) receiving a parameter necessary for batch processing, which is predetermined based on information on a plurality of documents to be searched;
(b) generating a corpus by dividing each of the plurality of documents into passages based on the parameters input in step (a);
(c) extracting a keyword to be used as a pseudo-query from each of at least some passages included in the corpus generated in step (b);
(d) extracting a specified number of passages from the parameters input in step (a) using the pseudo-query and unsupervised learning-based IR model extracted in step (c);
(e) generating a pseudo-label based on the hyper parameter specified in the parameter input in the step (a); and,
(f) using the pseudo-label generated in step (e) to train an artificial intelligence-based information retrieval model
A device that batch-processes data pre-processing and learning processes for an IR model to be executed.

6. The method of claim 5,
The parameters required for the batch processing include a parameter related to a method for classifying documents in step (a), a parameter related to a keyword extracted in step (c), and a parameter related to the number of passages to be extracted in step (d) and hyperparameters for generating the pseudo-label in step (e).
A device for batch processing data pre-processing and learning process for an information retrieval model, characterized in that

7. The method of claim 6,
The parameter regarding the method of classifying the document includes information on the size of passages in which the documents are to be divided and the overlapping ratio of each passage
A device for batch processing data pre-processing and learning process for an information retrieval model, characterized in that

6. The method of claim 5,
Storing the relationship between the document and the passages delimited therefrom.
Device for batch processing data pre-processing and learning process for the information retrieval model, characterized in that it further comprises.

A computer program for collectively processing generation of a corpus for an IR model and learning of an IR model based on the corpus,
It is stored in a non-transitory storage medium, and by the processor,
(a) receiving a parameter necessary for batch processing, which is predetermined based on information on a plurality of documents to be searched;
(b) generating a corpus by dividing each of the plurality of documents into passages based on the parameters input in step (a);
(c) extracting a keyword to be used as a pseudo-query from each of at least some passages included in the corpus generated in step (b);
(d) extracting a specified number of passages from the parameters input in step (a) using the pseudo-query and unsupervised learning-based IR model extracted in step (c);
(e) generating a pseudo-label based on the hyper parameter specified in the parameter input in the step (a); and,
(f) using the pseudo-label generated in step (e) to train an artificial intelligence-based information retrieval model
A computer program stored in a non-transitory storage medium for batch processing of data pre-processing and learning process for an information retrieval model, including instructions to be executed.