KR20210032253A

KR20210032253A - System and method for searching documents and providing an answer to a natural language question

Info

Publication number: KR20210032253A
Application number: KR1020190113798A
Authority: KR
Inventors: 백승빈; 이명기; 이정환
Original assignee: (주)플랜아이
Priority date: 2019-09-16
Filing date: 2019-09-16
Publication date: 2021-03-24
Also published as: KR102256007B1

Abstract

본 발명은 자연어 질의를 통한 문서 검색 및 응답 제공 시스템에 관한 것으로서, 사용자로부터 자연어 질의가 입력되면 상기 자연어 질의를 토큰화하는 제 1 질의 형태소 분석기; 상기 토큰화된 자연어 질의를 사용하여 상기 데이터베이스 내의 모든 문서와 유사도 검사를 수행하여 적어도 하나 이상의 문서를 선별하는 자연어 질의 & 문서 매칭 엔진; 상기 선별된 적어도 하나 이상의 문서 각각에 대하여 질의에 맞는 응답의 위치를 추론하는 문서 내 자연어 응답 위치 추출 엔진;을 포함하는 것을 특징으로 한다.The present invention relates to a document search and response providing system through a natural language query, comprising: a first query morpheme analyzer that tokenizes the natural language query when a natural language query is input from a user; A natural language query & document matching engine that selects at least one document by performing a similarity check with all documents in the database using the tokenized natural language query; And an engine for extracting a location of a natural language response within a document for inferring a location of a response corresponding to a query for each of the selected at least one or more documents.

Description

System and method for searching documents and providing an answer to a natural language question}

본 발명은 자연어 질의를 통한 문서 검색 및 응답 제공 시스템 및 방법에 관한 것으로서, 대량의 문서에 대해서 자연어 질의에 의한 검색과 문서 내 응답 위치를 추출하는 것을 가능하도록 하는 자연어 질의를 통한 문서 검색 및 응답 제공 시스템 및 방법에 관한 것이다.The present invention relates to a system and method for providing a document search and response through a natural language query, and provides a document search and response through a natural language query that enables a large number of documents to be searched by a natural language query and extract a response location in the document. It relates to systems and methods.

일반적인 검색 시스템은 데이터베이스 내에서 검색어와 가장 유사한 형태를 가지는 문서나 콘텐츠를 찾아 나열하고 있으나, 자연어 질의(사람이 일상생활에서 물어보는 형태의 질의)에 대한 응답을 찾아주지는 못한다.A general search system finds and lists documents or contents that have a form most similar to a search word in a database, but cannot find a response to a natural language query (a form that a person asks in everyday life).

문서 검색이란 사용자가 입력한 검색 키워드를 바탕으로 문서의 집합으로부터 관련 문서를 찾아내는 것인데, 대량의 문서 집합으로부터 검색 키워드를 포함한 문서를 고속으로 찾아내기 위하여 전치 인덱스라고 불리는 것을 일반적으로 사용하고 있으며, 전치 인덱스의 확장 버전으로 단어의 위치에 주목하는 구문 검색 방법이 있는데, 구체적으로는 먼저 전치 인덱스를 만들고 문서 번호에 덧붙여 단어 위치도 함께 표시하여 구문 검색을 가능하게 할 수 있다.Document search is to find related documents from a set of documents based on the search keyword entered by the user. In order to find documents including the search keyword from a large set of documents at high speed, what is called a transposition index is generally used. As an extended version of the index, there is a phrase search method that pays attention to the position of a word. Specifically, first, a transposition index is created, and the position of the word is also displayed in addition to the document number to enable phrase search.

그러나, 일반적인 문석 검색에서 사용자는 필요한 문서를 얻기 위해 필요한 문서를 대표하는 검색어를 생성하고, 검색 후 필요한 문서를 포함하고 있다고 생각되는 콘텐츠를 탐색 및 선택하며, 선택한 콘텐츠 내에서 필요한 문서를 탐색하는 과정을 거치게 되는데, 개인의 검색 능력에 영향을 받기 때문에 필요한 정보를 찾지 못하거나 정보 탐색에 많은 시간을 소모하게 되는 불편함이 있다.However, in general text search, a user creates a search word representing the required document to obtain the required document, searches for and selects the content deemed to contain the necessary document after the search, and the process of searching for the required document within the selected content. However, there is an inconvenience of not finding necessary information or spending a lot of time searching for information because it is affected by an individual's search ability.

자연어 처리(Natural Language Processing) 분야는 순환 신경망(RNN: Recurrent Neural Network)에 기반한 모델이 대부분이었으나, 최근, 병렬로 입력 데이터를 한꺼번에 처리하여 문맥과 같은 의미적 연결을 다룰 수 있는 트랜스포머(Transformer) 기술이 등장하였고, 이러한 트랜스포머의 인코더를 이용하여 더욱 더 발전된 모델인 BERT(Bidirectional Encoder Representations from Transformers)가 등장하였다.In the field of natural language processing, most models were based on recurrent neural networks (RNNs), but recently, transformer technology that can handle semantic connections such as context by processing input data in parallel. Appeared, and BERT (Bidirectional Encoder Representations from Transformers), a more advanced model using these transformer encoders, appeared.

그러나 자연어 처리를 위해서는 아주 많은 컴퓨팅 자원이 필요하고, 웹 검색과 같이 대량의 문서 검색에는 적용하기 어렵다는 문제점이 있다.However, there is a problem in that a very large amount of computing resources are required for natural language processing, and it is difficult to apply it to a large amount of document search such as web search.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2018, BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingJacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2018, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

본 발명은 대량의 문서에 대해서 자연어 질의에 의한 검색과 문서 내 응답 위치를 추출하는 것을 가능하도록 하는 것을 목적으로 한다.An object of the present invention is to enable a large number of documents to be searched by natural language queries and to extract a response location within the document.

또한 본 발명의 다른 목적은 웹 검색과 같은 방대한 검색에 대해서 자연어 질의에 의한 검색과 문서 내 응답 위치를 추출하는 것을 가능하도록 하는 것이다.In addition, another object of the present invention is to make it possible to search by natural language query and extract a response position in a document for a vast search such as a web search.

또한 본 발명의 다른 목적은 대량의 데이터가 저장된 데이터베이스에 대해서도 자연어 응답 위치 추출이 적용되는 전문 검색이 가능하도록 하는 것이다.In addition, another object of the present invention is to enable full text search to which natural language response location extraction is applied even to a database in which a large amount of data is stored.

본 발명이 해결하고자 하는 과제는 상기 목적으로만 제한하지 아니하고, 위에서 명시적으로 나타내지 아니한 다른 기술적 과제는 이하 본 발명의 구성 및 작용을 통하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 쉽게 이해할 수 있을 것이다.The problem to be solved by the present invention is not limited to the above purpose, and other technical problems not explicitly indicated above are easily understood by those of ordinary skill in the art through the configuration and operation of the present invention. I will be able to.

본 발명에서는, 상기 과제를 해결하기 위하여 이하의 구성을 포함한다.In the present invention, the following configurations are included in order to solve the above problems.

본 발명은 자연어 질의를 통한 문서 검색 및 응답 제공 시스템에 관한 것으로서, 사용자로부터 자연어 질의가 입력되면 상기 자연어 질의를 토큰화하는 제 1 질의 형태소 분석기; 상기 토큰화된 자연어 질의를 사용하여 상기 데이터베이스 내의 모든 문서와 유사도 검사를 수행하여 적어도 하나 이상의 문서를 선별하는 자연어 질의 & 문서 매칭 엔진; 상기 선별된 적어도 하나 이상의 문서 각각에 대하여 질의에 맞는 응답의 위치를 추론하는 문서 내 자연어 응답 위치 추출 엔진;을 포함하는 것을 특징으로 한다.The present invention relates to a document search and response providing system through a natural language query, comprising: a first query morpheme analyzer that tokenizes the natural language query when a natural language query is input from a user; A natural language query & document matching engine for selecting at least one document by performing a similarity check with all documents in the database using the tokenized natural language query; And an engine for extracting a location of a natural language response within a document for inferring a location of a response corresponding to a query for each of the selected at least one or more documents.

본 발명은 사용자로부터 자연어 질의가 입력되면 상기 자연어 질의를 토큰화하는 제 2 질의 형태소 분석기를 더 포함하고, 상기 문서 내 자연어 응답 위치 추출 엔진은 제 2 질의 형태소 분석기에서 토큰화된 자연어 질의를 사용하여 상기 선별된 적어도 하나 이상의 문서 각각에 대하여 질의에 맞는 응답의 위치를 추론하는 것을 특징으로 한다.The present invention further includes a second query morpheme analyzer that tokenizes the natural language query when a natural language query is input from a user, and the natural language response location extraction engine in the document uses the tokenized natural language query in the second query morpheme analyzer. It is characterized in that for each of the selected at least one document, inferring a location of a response corresponding to a query.

본 발명의 제 1 질의 형태소 분석기는 자연어 질의에서 품사를 분석하고, 제 2 질의 형태소 분석기는 자연어 질의에서 자소를 분석하는 것을 특징으로 한다.The first query morpheme analyzer of the present invention analyzes part of speech in the natural language query, and the second query morpheme analyzer analyzes the grapheme in the natural language query.

본 발명의 상기 문서 내 자연어 응답 위치 추출 엔진은 상기 선별된 적어도 하나 이상의 문서 각각에 대하여 문서 내에서 단락의 위치에 대해서 임베딩을 수행하는 것을 특징으로 한다.The engine for extracting the position of natural language response in the document of the present invention is characterized in that, for each of the selected at least one document, embedding the position of a paragraph in the document.

본 발명의 상기 문서 내 자연어 응답 위치 추출 엔진은 상기 임베딩된 단락의 위치를 사용하여 상기 선별된 적어도 하나 이상의 문서 각각에 대하여 질의에 맞는 응답의 위치를 추론하는 것을 특징으로 한다.The engine for extracting a location of natural language responses in the document of the present invention is characterized in that, using the location of the embedded paragraph, infers the location of a response corresponding to a query for each of the selected at least one document.

본 발명의 상기 자연어 질의 & 문서 매칭 엔진은 상기 토큰화된 자연어 질의를 사용하여 적어도 하나 이상의 문서를 선별하고, 스코어 알고리즘을 적용하는 것을 특징으로 한다.The natural language query & document matching engine of the present invention is characterized in that at least one document is selected by using the tokenized natural language query and a score algorithm is applied.

본 발명은 수집된 대량의 문서를 토큰화하고 색인화하여 상기 데이터베이스에 저장하는 문서용 형태소 분석기를 더 포함하는 것을 특징으로 한다.The present invention is characterized in that it further comprises a morpheme analyzer for documents that tokenizes and indexes the collected large amount of documents and stores them in the database.

또한 본 발명은 자연어 질의를 통한 문서 검색 및 응답 제공 방법에 관한 것으로서, 제 1 질의 형태소 분석기에서 사용자로부터 자연어 질의가 입력되면 상기 자연어 질의를 토큰화하는 단계; 자연어 질의 & 문서 매칭 엔진에서 상기 토큰화된 자연어 질의를 사용하여 데이터베이스 내의 모든 문서와 유사도 검사를 수행하여 적어도 하나 이상의 문서를 선별하는 단계; 문서 내 자연어 응답 위치 추출 엔진에서 상기 선별된 적어도 하나 이상의 문서 각각에 대하여 질의에 맞는 응답의 위치를 추론하는 단계;를 포함하는 것을 특징으로 한다.In addition, the present invention relates to a method for searching for a document through a natural language query and providing a response, comprising: tokenizing the natural language query when a natural language query is input from a user in a first query morpheme analyzer; Selecting at least one or more documents by performing a similarity check with all documents in a database using the tokenized natural language query in a natural language query & document matching engine; And inferring a location of a response corresponding to a query for each of the selected at least one or more documents by an engine for extracting a location of a natural language response in a document.

본 발명은 제 2 질의 형태소 분석기에서 사용자로부터 자연어 질의가 입력되면 상기 자연어 질의를 토큰화하는 단계를 더 포함하고, 상기 문서 내 자연어 응답 위치 추출 엔진은 제 2 질의 형태소 분석기에서 토큰화된 자연어 질의를 사용하여 상기 선별된 적어도 하나 이상의 문서 각각에 대하여 질의에 맞는 응답의 위치를 추론하는 것을 특징으로 한다.The present invention further includes the step of tokenizing the natural language query when a natural language query is input from the user in the second query morpheme analyzer, and the natural language response location extraction engine in the document performs the tokenized natural language query in the second query morpheme analyzer. By using, for each of the selected at least one document, it is characterized in that inferring a location of a response corresponding to the query.

또한 본 발명은 상기 자연어 질의를 통한 문서 검색 및 응답 제공 방법을 실행시키기 위하여 매체에 저장된 컴퓨터프로그램일 수 있다.In addition, the present invention may be a computer program stored in a medium to execute the method for searching for a document and providing a response through the natural language query.

본 발명의 효과는 대량의 문서에 대해서 자연어 질의에 의한 검색과 문서 내 응답 위치를 추출하는 것을 가능하게 하는 것이다.The effect of the present invention is to make it possible to retrieve a large number of documents by natural language query and to extract a response location within the document.

또한 본 발명의 다른 효과는, 웹 검색과 같은 방대한 검색에 대해서 자연어 질의에 의한 검색과 문서 내 응답 위치를 추출하는 것을 가능하도록 하는 것이다.In addition, another effect of the present invention is to make it possible to search by natural language query and extract the response position in a document for a vast search such as a web search.

또한 본 발명의 또 다른 효과는, 대량의 데이터가 저장된 데이터베이스에 대해서도 자연어 응답 위치 추출이 적용되는 전문 검색이 가능하도록 하는 것이다.In addition, another effect of the present invention is to enable full text search to which natural language response location extraction is applied even to a database in which a large amount of data is stored.

본 발명에 의한 효과는 상기 효과로만 제한하지 아니하고, 위에서 명시적으로 나타내지 아니한 다른 효과는 이하 본 발명의 구성 및 작용을 통하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 쉽게 이해할 수 있을 것이다.The effects of the present invention are not limited to the above effects, and other effects not explicitly shown above will be easily understood by those of ordinary skill in the art through the configuration and operation of the present invention below.

도 1은 데이터베이스 내에서 검색어와 가장 비슷한 형태를 가지는 문서나 콘텐츠를 찾아 나열하는 일반적인 검색 시스템을 도시한다.
도 2는 자연어 처리를 적용하여 문서를 검색하는 일반적인 검색 시스템을 도시한다.
도 3은 본 발명의 자연어 질의를 통한 문서 검색 및 응답 제공 시스템을 도시한다.
도 4는 본 발명의 문서내 자연어 응답 위치 추출 엔진의 내부 구성도를 도시한다.
도 5는 본 발명의 자연어 질의를 통한 문서 검색 및 응답 제공 방법의 흐름도를 도시한다.FIG. 1 shows a general search system for finding and listing documents or contents having a form most similar to a search word in a database.
2 shows a general search system for searching a document by applying natural language processing.
3 is a diagram illustrating a system for searching a document and providing a response through a natural language query according to the present invention.
4 is a diagram illustrating an internal configuration of an engine for extracting a position of a natural language response in a document according to the present invention.
5 is a flowchart illustrating a method for searching a document and providing a response through a natural language query according to the present invention.

이하 본 발명의 바람직한 실시예에 따른 전체적인 구성 및 작용에 대해 설명하기로 한다. 이러한 실시예는 예시적인 것으로서 본 발명의 구성 및 작용을 제한하지는 아니하고, 실시예에서 명시적으로 나타내지 아니한 다른 구성 및 작용도 이하 본 발명의 실시예를 통하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 쉽게 이해할 수 있는 경우는 본 발명의 기술적 사상으로 볼 수 있을 것이다.Hereinafter, an overall configuration and operation according to a preferred embodiment of the present invention will be described. These examples are illustrative, and do not limit the configuration and operation of the present invention, and other configurations and functions not explicitly shown in the examples are also known in the art to which the present invention pertains through the examples of the present invention. If the possessor can easily understand, it will be seen as the technical idea of the present invention.

일반적인 범용 검색엔진에 비하여 전문 검색엔진은 소량의 데이터가 저장된 데이터베이스에 대해서 적용되고 있으나, 본 발명은 대량의 데이터가 저장되는 데이터베이스에 대해서도 전문 검색엔진이 적용 가능하도록 하고, 자연어 질의에 의한 검색과 문서 내 응답 위치를 추출하는 것을 가능하도록 한다.Compared to general general search engines, the full text search engine is applied to a database in which a small amount of data is stored, but the present invention allows the full text search engine to be applied to a database in which a large amount of data is stored, and searches and documents by natural language queries Make it possible to extract my response location.

도 1은 데이터베이스 내에서 검색어와 가장 비슷한 형태를 가지는 문서나 콘텐츠를 찾아 나열하는 일반적인 검색 시스템을 도시한다.FIG. 1 shows a general search system for finding and listing documents or contents having a form most similar to a search word in a database.

도 1을 참조하면, 웹 검색을 위한 일반적인 검색 시스템은 문서용 형태소 분석기(100), 문서 데이터베이스(200), 질의 형태소 분석기(300), 자연어 질의 & 문서 매칭 엔진(400)을 구비하고, 웹 검색을 위하여 여러 사이트로부터 대량의 문서들(문서 1, 문서 2, 문서 3, …)을 수집하고, 수집한 문서들에 대해서 검색이 용이하도록 문서용 형태소 분석기(100)는 형태소 분석과 색인 등을 수행하여 문서 데이터베이스(200)에 저장하게 된다.Referring to FIG. 1, a general search system for web search includes a document morpheme analyzer 100, a document database 200, a query morpheme analyzer 300, a natural language query & document matching engine 400, and a web search For the purpose of collecting a large amount of documents (document 1, document 2, document 3, …) from various sites, the document morpheme analyzer 100 performs morpheme analysis and indexing to facilitate searching for the collected documents. Thus, it is stored in the document database 200.

상기 질의 형태소 분석기(300)는 사용자로부터 자연어 질의를 입력받는 경우 상기 자연어 질의를 토큰화하는데, 토큰화에서는 주로 형태소 분석을 수행하고, 상기 자연어 질의 & 문서 매칭 엔진(400)은 상기 문서 데이터베이스(200)에 저장된 문서들에 대해서 검색을 수행하며, 상기 토큰화된 질의를 사용하여 상기 데이터베이스 내의 모든 문서와 유사도 검사를 수행하여 적어도 하나 이상의 문서를 선별하게 된다.When receiving a natural language query from a user, the query morpheme analyzer 300 tokenizes the natural language query. In tokenization, the query morpheme analysis is mainly performed, and the natural language query & document matching engine 400 uses the document database 200 ) Is searched for documents stored in ), and a similarity check with all documents in the database is performed using the tokenized query to select at least one or more documents.

문서 데이터베이스(200)에 저장되어 있는 대량의 문서들로부터 검색 키워드를 포함한 문서를 고속으로 찾아내기 위하여 전치 인덱스를 일반적으로 사용하고 있으며, 전치 인덱스에서 문서 번호에 덧붙여 단어 위치도 함께 표시하여 구문 검색을 가능하게 할 수도 있으나, 대량의 문서들에 대해서 자연어 처리에 의한 구문 검색은 과도한 컴퓨팅 자원이 소모될 수 있으므로, 본 발명에서는 과도한 컴퓨팅 자원의 소모를 저감하기 위하여 검색 단계를 후술하는 바와 같이 단계적으로 분리할 수 있다.In order to quickly find a document including a search keyword from a large number of documents stored in the document database 200, a transposition index is generally used, and a phrase search is performed by displaying the word position in addition to the document number in the transposition index. Although it may be possible, the phrase search by natural language processing for a large amount of documents may consume excessive computing resources, so in the present invention, in order to reduce the consumption of excessive computing resources, the search step is separated in stages as described later. can do.

또한 상기 자연어 질의 & 문서 매칭 엔진(300)은 스코어 알고리즘으로 tf-idf(term frequency-inverse document frequency) 또는 bm25 알고리즘을 사용하여 적어도 하나 이상의 문서를 선별할 수 있다.In addition, the natural language query & document matching engine 300 may select at least one document using a term frequency-inverse document frequency (tf-idf) or bm25 algorithm as a score algorithm.

도 2는 자연어 처리를 적용하여 문서를 검색하는 일반적인 검색 시스템을 도시한다.2 shows a general search system for searching a document by applying natural language processing.

도 2를 참조하면, 소량의 문서들이 저장된 문서 데이터베이스(200)에 대해서는 문서내 자연어 응답 위치 추출 엔진(500)이 질의에 맞는 응답과 응답의 위치를 쉽게 추론할 수 있으나, 대량의 문서들이 저장된 데이터베이스(200)에 대해서는 문서내 자연어 응답 위치 추출 엔진(500)이 질의에 맞는 응답과 응답의 위치를 추론하기 위하여 과도한 컴퓨팅 자원이 소모될 수 있으므로, 본 발명에서는 과도한 컴퓨팅 자원의 소모를 저감하기 위하여 검색 단계를 후술하는 바와 같이 단계적으로 분리할 수 있다.Referring to FIG. 2, for a document database 200 in which a small amount of documents are stored, the natural language response location extraction engine 500 in a document can easily infer a response suitable for a query and a location of a response, but a database in which a large number of documents are stored. For 200, the natural language response location extraction engine 500 in the document may consume excessive computing resources in order to infer the location of the response and the response that fits the query, so in the present invention, search in order to reduce the consumption of excessive computing resources. The steps can be separated in stages as described later.

도 3은 본 발명의 자연어 질의를 통한 문서 검색 및 응답 제공 시스템을 도시한다.3 is a diagram illustrating a system for searching a document and providing a response through a natural language query according to the present invention.

도 3을 참조하면, 본 발명은 자연어 질의를 통한 문서 검색 및 응답 제공 시스템에 관한 것으로서, 문서용 형태소 분석기(100), 문서 데이터베이스(200), 제 1 질의 형태소 분석기(310) , 제 2 질의 형태소 분석기(320), 자연어 질의 & 문서 매칭 엔진(400), 문서 내 자연어 응답 위치 추출 엔진(500)을 포함하고 있다.Referring to FIG. 3, the present invention relates to a document search and response providing system through natural language queries, and includes a document morpheme analyzer 100, a document database 200, a first query morpheme analyzer 310, and a second query morpheme. It includes an analyzer 320, a natural language query & document matching engine 400, and an engine 500 for extracting a location of a natural language response within a document.

제 1 질의 형태소 분석기(310)는 사용자로부터 자연어 질의가 입력되면 상기 자연어 질의를 형태소 분석을 통하여 토큰화를 수행하는데, 대량의 문서들 중에서 사용자의 질의와 관련된 문서를 빠르게 찾을 수 있도록 하기 위해서는 문자 그대로의 의미뿐만 아니라 한국어의 특성을 반영해야 한다. 이에 따라 제 1 질의 형태소 분석기(310)는 한국어의 특성을 반영하기 위해서 사용자의 질의로부터 품사를 분석하여 문자의 형태는 같더라도 품사가 다른 경우도 고려하게 된다.When a natural language query is input from a user, the first query morpheme analyzer 310 tokenizes the natural language query through morpheme analysis. It should reflect not only the meaning of the word but also the characteristics of Korean. Accordingly, the first query morpheme analyzer 310 analyzes the part-of-speech from the user's query in order to reflect the characteristics of the Korean language, and considers a case where the part-of-speech is different even though the characters have the same form.

상기 자연어 질의 & 문서 매칭 엔진(400)은 상기 토큰화된 자연어 질의를 사용하여 상기 문서 데이터베이스(200) 내의 모든 문서와 유사도 검사를 수행하여 적어도 하나 이상의 문서를 선별할 수 있고, 제 1 질의 형태소 분석기(310)에서 사용자의 질의로부터 품사를 분석하여 문자의 형태는 같더라도 품사가 다른 경우도 고려함에 따라 대량의 문서들로부터 사용자의 질의와 관련된 문서를 좀 더 빠르게 찾아낼 수 있다.The natural language query & document matching engine 400 may select at least one or more documents by performing a similarity check with all documents in the document database 200 using the tokenized natural language query, and a first query morpheme analyzer In (310), the part of speech is analyzed from the user's query, and the case of the case where the part of speech is different even though the characters have the same form is considered, so that documents related to the user's query can be found more quickly from a large number of documents.

상기 자연어 질의 & 문서 매칭 엔진(400)에서는 검색 키워드의 빈도(Term Frequency)와 문서 빈도의 역수(Inverse Document Frequency)를 사용하는 스코어 알고리즘을 적용하여 발견한 문서에 적절한 순위를 매기고 있으며, 상기 스코어 알고리즘으로는 TF-IDF(Term Frequency-Inverse Document Frequency) 알고리즘과 bm25 알고리즘이 사용될 수 있다.The natural language query & document matching engine 400 applies a scoring algorithm using a term frequency of a search keyword and an inverse document frequency to rank the found documents appropriately, and the scoring algorithm For example, TF-IDF (Term Frequency-Inverse Document Frequency) algorithm and bm25 algorithm may be used.

제 2 질의 형태소 분석기(320)는 사용자로부터 자연어 질의가 입력되면 상기 자연어 질의를 형태소 분석을 통하여 토큰화를 수행하는데, 상기 자연어 질의 & 문서 매칭 엔진(400)에서 선별된 적어도 하나 이상의 문서 내에서 사용자의 질의와 관련된 응답의 위치를 정확하게 찾을 수 있도록 하기 위해서는 글자 하나 하나를 색인할 수 있어야 한다. 이에 따라 제 2 질의 형태소 분석기(320)는 사용자의 질의로부터 자소를 분석하여 사용하게 된다.When a natural language query is input from a user, the second query morpheme analyzer 320 tokenizes the natural language query through morpheme analysis, and the user within at least one document selected by the natural language query & document matching engine 400 In order to be able to accurately locate a response related to a query, it must be able to index each letter. Accordingly, the second query morpheme analyzer 320 analyzes and uses the grapheme from the user's query.

상기 문서 내 자연어 응답 위치 추출 엔진(500)은 상기 자연어 질의 & 문서 매칭 엔진(400)에서 선별된 적어도 하나 이상의 문서 각각에 대하여 질의에 맞는 응답의 위치를 추론하는 구성으로서, 상기 선별된 적어도 하나 이상의 문서 각각에 대하여 문서 내에서 단락의 위치에 대해서 임베딩을 수행하고, 상기 임베딩된 단락의 위치를 사용하여 상기 선별된 적어도 하나 이상의 문서 각각에 대하여 질의에 맞는 응답의 위치를 추론하게 된다.The natural language response location extraction engine 500 in the document is a component that infers the location of a response corresponding to the query for each of the at least one or more documents selected by the natural language query & document matching engine 400, wherein the selected at least one or more For each document, embedding is performed on the position of the paragraph in the document, and the position of a response corresponding to the query for each of the selected at least one or more documents is inferred by using the position of the embedded paragraph.

상기 문서 내 자연어 응답 위치 추출 엔진(500)은 상기 문서 데이터베이스(200) 내의 모든 문서에 대해서 문서 내에서 단락의 위치에 대해서 임베딩을 수행하지 아니하고, 상기 선별된 문서인 소량의 문서에 대해서 문서 내에서 단락의 위치에 대해서 임베딩을 수행하여 질의에 맞는 응답의 위치를 추론함으로써, 결과적으로 본 발명은 대량의 문서에 대해서 자연어 질의에 의한 검색과 문서 내 응답 위치를 추출하는 것이 가능해진다.The natural language response location extraction engine 500 in the document does not embed the position of paragraphs within the document for all documents in the document database 200, and for a small amount of documents, which are the selected documents, within the document. By performing embedding on the position of the paragraph and inferring the position of the response corresponding to the query, as a result, the present invention makes it possible to search for a large number of documents by natural language query and to extract the response position within the document.

또한 웹 검색을 해보면, 표시되는 형식이나 사이트의 종류가 다를 뿐만 아니라 대량의 문서들이 수집될 수 있는데, 이러한 대량의 문서들에 대해서도 본 발명은 자연어 질의에 의한 검색과 문서 내 응답 위치를 추출할 수 있게 된다.In addition, when a web search is performed, not only the displayed format or the type of site is different, but also a large number of documents can be collected. For such a large amount of documents, the present invention can search by natural language query and extract the response location in the document. There will be.

도 4는 본 발명의 문서내 자연어 응답 위치 추출 엔진의 내부 구성도를 도시한다.4 is a diagram illustrating an internal configuration of an engine for extracting a location of natural language responses in a document according to the present invention.

도 4를 참조하면, 본 발명의 문서 내 자연어 응답 위치 추출 엔진(500)은 트랜스포머가 양방향으로 활용되는 BERT(Bidirectional Encoder Representation from Transformers) 모델을 사용하여 구현될 수 있으며, BERT 모델은 트랜스포머 인코더를 쌓아 놓은 훈련된 모델로서, 토큰에 대한 변환, 문장 각각에 대한 위치, 단어의 문장에 대한 위치에 대해서 임베딩이 이루어질 수 있다.Referring to FIG. 4, the engine 500 for extracting a position of a natural language response within a document of the present invention may be implemented using a Bidirectional Encoder Representation from Transformers (BERT) model in which a transformer is used in both directions, and the BERT model is stacked with a transformer encoder. As a trained model, embedding can be done for the token conversion, the position for each sentence, and the position for the sentence of a word.

도 5는 본 발명의 자연어 질의를 통한 문서 검색 및 응답 제공 방법의 흐름도를 도시한다.5 is a flowchart illustrating a method for searching a document and providing a response through a natural language query according to the present invention.

도 5를 참조하면, 본 발명은 자연어 질의를 통한 문서 검색 및 응답 제공 방법에 관한 것으로서, 문서용 형태소 분석기(100)에서 대량의 문서를 토큰화하고 색인화하여 문서 데이터베이스(200)에 저장하고(S100), 제 1 질의 형태소 분석기(310)는 사용자로부터 자연어 질의가 입력되면 상기 자연어 질의를 토큰화하는 단계(S200)를 수행한다.Referring to FIG. 5, the present invention relates to a method for searching for a document and providing a response through a natural language query, and tokenizing and indexing a large amount of documents in the document morpheme analyzer 100, and storing them in the document database 200 (S100 ), the first query morpheme analyzer 310 performs a step (S200) of tokenizing the natural language query when a natural language query is input from a user.

자연어 질의 & 문서 매칭 엔진(400)은 상기 토큰화된 자연어 질의를 사용하여 상기 문서 데이터베이스(200) 내의 모든 문서와 유사도 검사를 수행하여 적어도 하나 이상의 문서를 선별하며(S300), 제 1 질의 형태소 분석기(310)에서 사용자의 질의로부터 품사를 분석하여 문자의 형태는 같더라도 품사가 다른 경우도 고려함에 따라 대량의 문서들로부터 사용자의 질의와 관련된 문서를 좀 더 빠르게 찾아낼 수 있다.The natural language query & document matching engine 400 selects at least one document by performing a similarity check with all documents in the document database 200 using the tokenized natural language query (S300), and a first query morpheme analyzer In (310), the part of speech is analyzed from the user's query, and the case of the case where the part of speech is different even though the text form is the same, it is possible to more quickly find a document related to the user's query from a large number of documents.

제 2 질의 형태소 분석기(320)는 사용자로부터 자연어 질의가 입력되면 상기 자연어 질의를 형태소 분석을 통하여 토큰화를 수행하는데(S400), 상기 자연어 질의 & 문서 매칭 엔진(400)에서 선별된 적어도 하나 이상의 문서 내에서 사용자의 질의와 관련된 응답의 위치를 정확하게 찾을 수 있도록 하기 위해서는 글자 하나 하나를 색인할 수 있어야 한다. 이에 따라 제 2 질의 형태소 분석기(320)는 사용자의 질의로부터 자소를 분석하여 사용하게 된다.When a natural language query is input from a user, the second query morpheme analyzer 320 performs tokenization of the natural language query through morpheme analysis (S400), and at least one document selected by the natural language query & document matching engine 400 In order to be able to accurately find the location of the response related to the user's inquiries within, it must be possible to index each character. Accordingly, the second query morpheme analyzer 320 analyzes and uses the grapheme from the user's query.

문서 내 자연어 응답 위치 추출 엔진은 제 2 질의 형태소 분석기에서 토큰화된 자연어 질의를 사용하여 상기 선별된 적어도 하나 이상의 문서 각각에 대하여 질의에 맞는 응답의 위치를 추론하는 단계(S400)를 수행하게 된다.The natural language response location extraction engine in the document performs a step (S400) of inferring the location of a response suitable for the query for each of the selected at least one or more documents by using a tokenized natural language query in the second query morpheme analyzer.

또한 본 발명의 자연어 질의를 통한 문서 검색 및 응답 제공 방법은 컴퓨터프로그램으로서 매체에 저장되어 구현될 수도 있다.In addition, the method of searching for a document and providing a response through a natural language query according to the present invention may be implemented by being stored in a medium as a computer program.

100: 문서용 형태소 분석기
200: 문서 데이터베이스
300: 질의 형태소 분석기
310: 제 1 질의 형태소 분석기
320: 제 2 질의 형태소 분석기
400: 자연어 질의 & 문서 매칭 엔진
500: 문서내 자연어 응답 위치 추출 엔진100: document morpheme analyzer
200: document database
300: query morpheme analyzer
310: first query morpheme analyzer
320: second query morpheme analyzer
400: natural language query & document matching engine
500: natural language response location extraction engine in document

Claims

In the document search and response provision system through natural language queries,
A first query morpheme analyzer that tokenizes the natural language query when a natural language query is input from a user;
A natural language query & document matching engine that selects at least one document by performing a similarity check with all documents in the database using the tokenized natural language query;
And a natural language response location extraction engine in a document that infers a location of a response corresponding to a query for each of the selected at least one or more documents.

The method of claim 1,
Further comprising a second query morpheme analyzer that tokenizes the natural language query when a natural language query is input from a user,
The natural language response location extraction engine in the document infers the location of a response corresponding to the query for each of the selected at least one document using a tokenized natural language query in a second query morpheme analyzer. Document search and response provision system.

The method of claim 2,
The first query morpheme analyzer analyzes the part of speech in the natural language query,
The second query morpheme analyzer is a document search and response providing system through a natural language query, characterized in that the character element is analyzed in a natural language query.

The method of claim 1,
The natural language response location extraction engine in the document embedding the position of a paragraph in the document for each of the selected at least one or more documents.

The method of claim 4,
The natural language response location extraction engine in the document infers the location of a response corresponding to the query for each of the selected at least one document using the location of the embedded paragraph. system.

The method of claim 1,
The natural language query & document matching engine selects at least one document using the tokenized natural language query and applies a score algorithm.

The method of claim 1,
A document search and response providing system through a natural language query, further comprising a morpheme analyzer for a document that tokenizes and indexes the collected large amount of documents and stores them in the database.

In the document search and response provision method through natural language query,
Tokenizing the natural language query when a natural language query is input from a user in a first query morpheme analyzer;
Selecting at least one or more documents by performing a similarity check with all documents in a database using the tokenized natural language query in a natural language query & document matching engine;
And inferring a location of a response corresponding to a query for each of the selected at least one or more documents by an engine for extracting a location of a natural language response in a document.

The method of claim 8,
The second query morpheme analyzer further comprises tokenizing the natural language query when a natural language query is input from a user,
The natural language response location extraction engine in the document infers the location of a response corresponding to the query for each of the selected at least one document using a tokenized natural language query in a second query morpheme analyzer. How to search for documents and provide responses.

A computer program stored in a medium to execute the document search and response provision method through the natural language query of claim 8 or 9.