KR20230101214A

KR20230101214A - System and Method of paragraph exploration and paragraph selection for Machine Reading Comprehension

Info

Publication number: KR20230101214A
Application number: KR1020210191123A
Authority: KR
Inventors: 이무봉; 김경환
Original assignee: (주)노스스타컨설팅; 이무봉; 김경환
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2023-07-06

Abstract

본 발명은 탐색대상문서가 저장된 데이터베이스 및 연산기능을 가진 제어서버를 이용하는 컴퓨팅장치에 의해 수행되며, 기계독해를 수행할 문단을 탐색 및 선택하는 시스템으로서, 기 설정된 질문이 입력되는 질문 입력부; 입력된 질문과 탐색대상문서에 속한 문단과의 유사도를 이용하여, 문단을 탐색 및 선택하는 문단 탐색부; 및 상기 문단 탐색부에서 선택된 기계독해대상문단을 전송순위에 따라 기계독해 모델로 전송하는 문단 전송부;를 포함하며, 상기 문단 탐색부는 질문과 문단의 유사도 점수를 산출하고, 유효 문단을 선택하는 유효 문단부; 유효 문단을 그룹화하고, 그룹점수를 산출하는 문단 그룹부; 그룹점수를 이용하여 분석대상문서를 선택하고, 분석대상문서내에서 기계독해대상문단을 추가선택하는 기계독해대상문단부; 및 기계독해대상문단의 전송순위를 산출하는 전송순위 산출부를 포함한다.The present invention is performed by a computing device using a database in which documents to be searched are stored and a control server having an arithmetic function, and a system for searching and selecting paragraphs to perform machine reading comprehension, comprising: a question input unit into which a preset question is input; a paragraph search unit that searches for and selects a paragraph by using a similarity between the input question and a paragraph belonging to a search target document; and a paragraph transmitter for transmitting the target paragraphs selected by the paragraph search unit to the machine reading comprehension model according to transmission order, wherein the paragraph search unit calculates a similarity score between a question and a paragraph and selects an effective paragraph. paragraph; a paragraph grouping unit for grouping valid paragraphs and calculating group scores; a machine reading comprehension target section for selecting an analysis target document using a group score and additionally selecting a machine reading target paragraph within the analysis target document; and a transmission order calculation unit for calculating the transmission order of the machine-reading target paragraph.

Description

Paragraph search and paragraph selection system and method for machine reading comprehension {System and Method of paragraph exploration and paragraph selection for Machine Reading Comprehension}

본 발명은 문단탐색 및 문단선택 시스템 및 방법에 관한 것이다. 구체적으로는 기계독해를 위한 문단탐색 및 문단선택 시스템 및 방법에 관한 것이다.The present invention relates to a paragraph search and paragraph selection system and method. Specifically, it relates to a paragraph search and paragraph selection system and method for machine reading comprehension.

종래에는 특정 분야에 대한 전문적인 지식을 온라인상에서 얻기 위하여, 전자 도서관 또는 인터넷 포탈 검색을 통해 사용자가 알고 싶어하는 단어 또는 문장으로 입력하였다. 그리고 검색 엔진이 찾는 결과에 대해 사용자가 필요한 정보가 맞는지 여부에 일일이 열람을 하는 과정이 필요하였다. Conventionally, in order to obtain specialized knowledge in a specific field online, a user inputs a word or sentence that he/she wants to know through an electronic library or Internet portal search. In addition, it was necessary to go through the process of checking whether or not the information the user needed was correct for the results found by the search engine.

이처럼 특정 분야에 대한 지식 검색은 해당 분야에 대한 서적이 적을 뿐만 아니라 전문용어들이 많기 때문에 질문에 대한 자연어의 처리가 난이도가 높은 경향을 가진다. 때문에 특정 분야에 대한 서적 및 관련 자료들을 모아 데이터 베이스에 저장하고 검색하는 시스템이 다수 존재한다. 최근에는 기계독해 기술을 도입해 질의응답 시스템을 개발하는 시도가 있다. As such, knowledge retrieval for a specific field tends to have a high level of difficulty in processing a question in natural language because there are not only few books on the field but also many technical terms. For this reason, there are many systems that collect books and related data on a specific field, store them in a database, and search them. Recently, there has been an attempt to develop a question-answering system by introducing machine reading comprehension technology.

기계독해(Machine Reading Comprehension)는 기계가 문서를 읽고 이해하는 능력을 질문을 통해 학습하고, 새로운 질문이 주어졌을 때 문서 내에 존재하는 적절한 응답을 추론하여 출력하는 인공지능 알고리즘 방법을 의미한다. Machine Reading Comprehension is an artificial intelligence algorithm method that learns the ability of a machine to read and understand documents through questions, and infers and outputs an appropriate response that exists in the document when a new question is given.

최근 자연어처리 분야에서 공개된 기술로 사전학습(Pre-training)된 언어모델인 BERT(Pre-training of deep Bidirectional Transformers for Language Understanding)을 이용하여 많은 발전을 이루었고, 영어와 달리 굴절어인 한국어의 말뭉치를 이용하여 BERT만으로는 부족한 정보를 추가적인 모델을 결합함으로써 질의응답분야에서 최고의 성능을 보여주고 있다.A lot of progress has been made using BERT (Pre-training of deep Bidirectional Transformers for Language Understanding), a language model that has been pre-trained as a technology recently released in the field of natural language processing. By using BERT alone, it shows the best performance in the field of question answering by combining additional models with insufficient information.

질문에 대한 답을 찾는 기계독해 서비스를 운영할 때, 모델의 입력 데이터인 질문에 대한 정답이 있을 가능성이 높은 문서를 찾기 위해 문서 데이터베이스에 있는 모든 문서를 모델에 입력하여 정답을 찾거나, 정답이 있는 문서들을 유사도 기준으로 선별하여 정답을 찾으려는 방법이 제시되고 있다.When operating a machine reading comprehension service that finds an answer to a question, in order to find a document that is likely to have the correct answer to the question, which is the input data of the model, all documents in the document database are entered into the model to find the correct answer, or if the correct answer is A method to find the correct answer by selecting documents with similarity criteria is proposed.

이러한 기계독해 서비스 운영을 위해 질문에 정답 후보가 되는 문서를 찾기 위한 선별 전처리 작업은 통상적으로 벡터 기반의 코사인 유사도 또는 키워드 중심의 유사도를 통해 수행되었다. 해당 유사도를 통한 문서를 찾거나 문단을 찾아 기계독해 모델에 입력하여 정답을 추론하였다. In order to operate such a machine reading comprehension service, pre-processing of selection to find documents that are candidates for correct answers to questions is usually performed through vector-based cosine similarity or keyword-based similarity. The correct answer was inferred by finding a document or paragraph through the similarity and entering it into the machine reading comprehension model.

종래의 기술로 기계독해 서비스를 운영하기 위해서는 데이터 베이스에 높은 수준의 자연어 전처리 기술이 필요하고, 성능이 뛰어난 코사인 유사도 및 키워드 유사도 기술을 사용해야 해야, 질문에 대한 답을 빠르게 찾을 수 있다. In order to operate a machine reading comprehension service with conventional technology, a high level of natural language preprocessing technology is required for databases, and cosine similarity and keyword similarity technologies with excellent performance must be used to quickly find answers to questions.

또한, 기계독해 모델을 통해 정답을 추론하는 프로그램은 시스템 자원을 많이 소모하여, 하나의 질문에 대한 답을 찾기 위해 수 많은 문서를 스캔하는 것은 비효율적이다. 자연어 전처리나 유사도 계산 기술이 높은 수준이 아니라도 자원효율성을 위해 정답이 있을 만한 문단과 문서를 찾을 수 있는 기술이 필요하다.In addition, a program that infers an answer through a machine reading comprehension model consumes a lot of system resources, so it is inefficient to scan numerous documents to find an answer to a single question. Even if natural language preprocessing or similarity calculation technology is not at a high level, for resource efficiency, it is necessary to find paragraphs and documents with correct answers.

(문헌 1) 한국등록특허공보 제10-2292040호 (2021.08.13)(Document 1) Korea Patent Registration No. 10-2292040 (2021.08.13)

본 발명에 따른 기계독해를 위한 문단탐색 및 문단선택 시스템 및 방법은 다음과 같은 해결과제를 가진다.The paragraph search and paragraph selection system and method for machine reading comprehension according to the present invention have the following problems.

첫째, 입력된 질문에 대한 유사한 문단인 기계독해대상문단을 추출하여 기계독해 모델에 전달하고자 한다.First, we try to extract the machine reading comprehension target paragraph, which is a paragraph similar to the input question, and pass it to the machine reading comprehension model.

둘째, 기계독해 모델에 전달될 기계독해대상문단을 선택하는 시간을 단축하면서도 유효한 문단을 추출하고자 한다.Second, it is intended to extract effective paragraphs while reducing the time to select the target paragraphs to be delivered to the machine reading comprehension model.

셋째, 기계독해대상문단을 통해, 정답을 찾는 시간을 단축하고자 한다.Third, it is intended to reduce the time to find the correct answer through the machine reading comprehension target paragraph.

본 발명의 해결과제는 이상에서 언급한 것들에 한정되지 않으며, 언급되지 아니한 다른 해결과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다. The problems of the present invention are not limited to those mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the description below.

본 발명은 탐색대상문서가 저장된 데이터베이스 및 연산기능을 가진 제어서버를 이용하는 컴퓨팅장치에 의해 수행되며, 기계독해를 수행할 문단을 탐색 및 선택하는 시스템으로서, 기 설정된 질문이 입력되는 질문 입력부; 입력된 질문과 탐색대상문서에 속한 문단과의 유사도를 이용하여, 문단을 탐색 및 선택하는 문단 탐색부; 및 상기 문단 탐색부에서 선택된 기계독해대상문단을 전송순위에 따라 기계독해 모델로 전송하는 문단 전송부;를 포함하며, 상기 문단 탐색부는 질문과 문단의 유사도 점수를 산출하고, 유효 문단을 선택하는 유효 문단부; 유효 문단을 그룹화하고, 그룹점수를 산출하는 문단 그룹부; 그룹점수를 이용하여 분석대상문서를 선택하고, 분석대상문서내에서 기계독해대상문단을 추가선택하는 기계독해대상문단부; 및 기계독해대상문단의 전송순위를 산출하는 전송순위 산출부를 포함할 수 있다.The present invention is performed by a computing device using a database in which documents to be searched are stored and a control server having an arithmetic function, and a system for searching and selecting paragraphs to perform machine reading comprehension, comprising: a question input unit into which a preset question is input; a paragraph search unit that searches for and selects a paragraph by using a similarity between the input question and a paragraph belonging to a search target document; and a paragraph transmitter for transmitting the target paragraphs selected by the paragraph search unit to the machine reading comprehension model according to transmission order, wherein the paragraph search unit calculates a similarity score between a question and a paragraph and selects an effective paragraph. paragraph; a paragraph grouping unit for grouping valid paragraphs and calculating group scores; a machine reading comprehension target section for selecting an analysis target document using a group score and additionally selecting a machine reading target paragraph within the analysis target document; and a transmission order calculation unit that calculates a transmission order of a paragraph to be read by the machine.

본 발명에 있어서, 상기 유효 문단부는 상기 탐색대상문서에 포함된 모든 문단을 추출하고, 추출된 각 문단과 입력된 검색질문의 유사도 점수를 산출하는 유사도 산출부를 구비할 수 있다.In the present invention, the effective paragraph unit may include a similarity calculation unit that extracts all paragraphs included in the search target document and calculates a similarity score between each extracted paragraph and an input search question.

본 발명에 있어서, 상기 유사도 산출부에서 상기 유사도 점수 산출은 벡터 기반 코사인 유사도 또는 키워드 중심 유사도를 통해 산출가능하다.In the present invention, the similarity score can be calculated by the similarity calculation unit through vector-based cosine similarity or keyword-centered similarity.

본 발명에 있어서, 상기 유효 문단부는 상기 유사도 산출부에서 산출된 유사도 점수 순으로 각 문단을 정렬하고, 기 설정값 이상의 유사도 점수를 가진 문단 또는 유사도 점수순으로 기 설정된 개수의 문단을 선택하는 유효 문단 선택부를 구비할 수 있다.In the present invention, the effective paragraph unit sorts each paragraph in the order of similarity score calculated by the similarity calculation unit, and selects a paragraph having a similarity score equal to or higher than a preset value or a preset number of paragraphs in order of similarity score. A selector may be provided.

본 발명에 있어서, 상기 문단 그룹부는 상기 유효 문단부에서 선택된 문단들을 각 문단이 소속된 문서별로 그룹화하는 문단 그룹화부를 구비할 수 있다.In the present invention, the paragraph grouping unit may include a paragraph grouping unit for grouping the paragraphs selected from the effective paragraph unit according to documents to which each paragraph belongs.

본 발명에 있어서, 상기 문단 그룹부는 상기 문단 그룹화부에서 그룹핑된 각 문단들의 유사도 점수의 평균값을 산출하여 그룹점수를 산출하는 그룹점수 산출부를 구비할 수 있다.In the present invention, the paragraph grouping unit may include a group score calculating unit that calculates a group score by calculating an average value of similarity scores of each paragraph grouped by the paragraph grouping unit.

본 발명에 있어서, 상기 기계독해대상문단부는 상기 문단 그룹부에서 산출된 그룹점수 순으로 문서를 정렬하고, 기 설정값 이상의 그룹점수를 가진 문서를 분석대상문서로 선택하는 문서 선택부를 구비할 수 있다.In the present invention, the machine reading target paragraph unit may include a document selection unit for arranging documents in the order of group scores calculated in the paragraph group unit and selecting a document having a group score equal to or higher than a preset value as an analysis target document. .

본 발명에 있어서, 상기 문서 선택부는 기 설정값 이상의 그룹점수를 가진 문서의 개수가 기 설정된 최대 문서 개수를 초과하면, 기 설정된 최대 문서 개수까지만 선택할 수 있다.In the present invention, when the number of documents having a group score equal to or higher than a preset value exceeds the preset maximum number of documents, the document selector may select only up to the preset maximum number of documents.

본 발명에 있어서, 상기 기계독해대상문단부는 상기 문서 선택부의 선택된 문서에서, 유효 문단 외의 나머지 문단의 유사도 점수를 정렬하여, 기 설정된 개수의 문단을 기계독해대상문단으로 추가하는 문단 추가부를 구비할 수 있다.In the present invention, the machine reading comprehension unit may include a paragraph adding unit for adding a predetermined number of paragraphs as machine reading target paragraphs by arranging similarity scores of remaining paragraphs other than valid paragraphs in the document selected by the document selection unit. there is.

본 발명에 있어서, 상기 전송순위 산출부는 상기 기계독해대상문단부에서 선택된 각 기계독해대상문단에 대하여, 아래 관계식 1에 따른 전송순위 점수를 산출할 수 있다.In the present invention, the transfer priority calculation unit may calculate a transfer priority score according to the following relational expression 1 for each paragraph selected from the target paragraphs for machine reading comprehension.

[관계식 1][Relationship 1]

전송순위 값 = (그룹내 문단 순위값 × 100) + 그룹순위값Transmission priority value = (paragraph rank value within group × 100) + group priority value

본 발명은 탐색대상문서가 저장된 데이터베이스 및 연산기능을 가진 제어서버를 이용하는 컴퓨팅장치에 의해 수행되며, 기계독해를 수행할 문단을 탐색 및 선택하는 방법으로서, 상기 컴퓨팅장치는 질문 입력부에 기 설정된 질문이 입력되는 S100 단계; 문단 탐색부가 입력된 질문과 탐색대상문서에 속한 문단과의 유사도를 이용하여, 문단을 탐색 및 선택하는 S200 단계; 및 문단 전송부가 상기 문단 탐색부에서 선택된 기계독해대상문단을 전송순위에 따라 기계독해 모델로 전송하는 S300 단계;를 포함하여 수행하며, 상기 S200 단계는 유효 문단부가 질문과 문단의 유사도 점수를 산출하고, 유효 문단을 선택하는 S210 단계; 문단 그룹부가 유효 문단을 그룹화하고, 그룹점수를 산출하는 S220 단계; 기계독해대상문단부가 그룹점수를 이용하여 분석대상문서를 선택하고, 분석대상문서내에서 기계독해대상문단을 추가선택하는 S230 단계; 및 전송순위 산출부가 기계독해대상문단의 전송순위를 산출하는 S240 단계를 포함할 수 있다.The present invention is performed by a computing device using a database in which search target documents are stored and a control server having an arithmetic function, and searching for and selecting a paragraph to perform machine reading comprehension, wherein the computing device inputs a predetermined question into a question input unit. Step S100 being entered; Step S200 of searching for and selecting a paragraph by using the similarity between the input question and the paragraph belonging to the search target document by the paragraph search unit; and a step S300 of transmitting, by the paragraph transfer unit, the target paragraphs selected by the paragraph search unit to the machine reading comprehension model according to the transmission order, wherein the effective paragraph calculates a similarity score between the question and the paragraph , step S210 of selecting a valid paragraph; Step S220 of grouping valid paragraphs by the paragraph grouping unit and calculating group scores; Step S230 of selecting an analysis target document by using a group score by the machine reading target paragraph and additionally selecting a machine reading target paragraph within the analysis target document; and a step S240 in which the transmission order calculation unit calculates the transmission order of the paragraph to be read by the machine.

본 발명에 있어서, 상기 S210 단계는 유사도 산출부가 상기 탐색대상문서에 포함된 모든 문단을 추출하고, 추출된 각 문단과 입력된 검색질문의 유사도 점수를 산출하는 S211 단계를 구비할 수 있다.In the present invention, the step S210 may include a step S211 in which the similarity calculation unit extracts all paragraphs included in the search target document and calculates a similarity score between each extracted paragraph and the input search question.

본 발명에 있어서, 상기 S211 단계에서 상기 유사도 점수 산출은 벡터 기반 코사인 유사도 또는 키워드 중심 유사도를 통해 산출가능하다.In the present invention, the similarity score calculation in step S211 can be calculated through vector-based cosine similarity or keyword-centered similarity.

본 발명에 있어서, 상기 S210 단계는 유효 문단 선택부가 상기 유사도 산출부에서 산출된 유사도 점수 순으로 각 문단을 정렬하고, 기 설정값 이상의 유사도 점수를 가진 문단 또는 유사도 점수순으로 기 설정된 개수의 문단을 선택하는 S212 단계를 구비할 수 있다.In the present invention, in the step S210, the effective paragraph selector arranges each paragraph in the order of similarity score calculated by the similarity calculation unit, and selects a paragraph having a similarity score equal to or higher than a preset value or a preset number of paragraphs in order of similarity score. A selection step S212 may be provided.

본 발명에 있어서, 상기 S220 단계는 문단 그룹화부가 상기 유효 문단부에서 선택된 문단들을 각 문단이 소속된 문서별로 그룹화하는 S221 단계를 구비할 수 있다.In the present invention, the step S220 may include a step S221 in which a paragraph grouping unit groups the paragraphs selected from the valid paragraphs by document to which each paragraph belongs.

본 발명에 있어서, 상기 S220 단계는 그룹점수 산출부가 상기 문단 그룹화부에서 그룹핑된 각 문단들의 유사도 점수의 평균값을 산출하여 그룹점수를 산출하는 S222 단계를 구비할 수 있다.In the present invention, the step S220 may include a step S222 in which the group score calculator calculates a group score by calculating an average value of similarity scores of each paragraph grouped by the paragraph grouping unit.

본 발명에 있어서, 상기 S230 단계는 문서 선택부가 상기 문단 그룹부에서 산출된 그룹점수 순으로 문서를 정렬하고, 기 설정값 이상의 그룹점수를 가진 문서를 분석대상문서로 선택하는 S231 단계를 구비할 수 있다.In the present invention, the step S230 may include a step S231 in which the document selection unit sorts documents in the order of group scores calculated in the paragraph group unit and selects documents having a group score equal to or higher than a preset value as the analysis target document. there is.

본 발명에 있어서, 상기 S231 단계는 기 설정값 이상의 그룹점수를 가진 문서의 개수가 기 설정된 최대 문서 개수를 초과하면, 기 설정된 최대 문서 개수까지만 선택할 수 있다.In the present invention, in step S231, if the number of documents having group scores equal to or greater than the preset value exceeds the preset maximum number of documents, only up to the preset maximum number of documents can be selected.

본 발명에 있어서, 상기 S230 단계는 문단 추가부가 상기 문서 선택부의 선택된 문서에서, 유효 문단 외의 나머지 문단의 유사도 점수를 정렬하여, 기 설정된 개수의 문단을 기계독해대상문단으로 추가하는 S232 단계를 구비할 수 있다.In the present invention, the step S230 may include a step S232 in which the paragraph addition unit sorts the similarity scores of the remaining paragraphs other than valid paragraphs in the document selected by the document selection unit and adds a predetermined number of paragraphs as target paragraphs for machine reading. can

본 발명에 있어서, 상기 S240 단계는 상기 S230 단계에서 선택된 각 기계독해대상문단에 대하여, 아래 관계식 1에 따른 전송순위 점수를 산출할 수 있다.In the present invention, in the step S240, for each paragraph to be machine-read selected in the step S230, a transmission priority score may be calculated according to the relational expression 1 below.

[관계식 1][Relationship 1]

본 발명은 하드웨어와 결합되어, 본 발명에 따른 기계독해를 위한 문단탐색 및 문단선택 방법을 컴퓨터에 의해 실행시키기 위하여 컴퓨터가 판독 가능한 기록매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다.The present invention can be combined with hardware and implemented as a computer program stored in a computer-readable recording medium to execute the paragraph search and paragraph selection method for machine reading comprehension according to the present invention by a computer.

본 발명에 따른 기계독해를 위한 문단탐색 및 문단선택 시스템 및 방법은 다음과 같은 효과를 가진다.The paragraph search and paragraph selection system and method for machine reading comprehension according to the present invention have the following effects.

첫째, 입력된 질문에 대한 유사한 문단인 기계독해대상문단을 추출하여 기계독해 모델에 전달하는 효과가 있다.First, there is an effect of extracting a paragraph similar to the input question, the machine reading comprehension target paragraph, and passing it to the machine reading comprehension model.

둘째, 기계독해 모델에 전달될 기계독해대상문단을 선택하는 시간을 단축하면서도 유효한 문단을 추출하는 효과가 있다.Second, it has the effect of extracting effective paragraphs while shortening the time to select the target paragraphs to be delivered to the machine reading comprehension model.

셋째, 기계독해대상문단을 통해, 정답을 찾는 시간을 단축하는 효과가 있다.Third, it has the effect of shortening the time to find the correct answer through the machine reading comprehension target paragraph.

본 발명의 효과는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 효과들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to those mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

도 1은 본 발명에 따른 기계독해를 위한 문단탐색 및 문단선택 시스템의 구성도이다.
도 2는 본 발명에 따른 문단 탐색부의 세부 구성도이다.
도 3은 본 발명에 따른 기계독해를 위한 문단탐색 및 문단선택 방법의 순서도이다.
도 4는 본 발명에 따른 문단 탐색 단계의 세부 순서도이다.1 is a block diagram of a paragraph search and paragraph selection system for machine reading comprehension according to the present invention.
2 is a detailed configuration diagram of a paragraph search unit according to the present invention.
3 is a flowchart of a paragraph search and paragraph selection method for machine reading comprehension according to the present invention.
4 is a detailed flowchart of a paragraph search step according to the present invention.

이하, 첨부한 도면을 참조하여, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 설명한다. 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 용이하게 이해할 수 있는 바와 같이, 후술하는 실시예는 본 발명의 개념과 범위를 벗어나지 않는 한도 내에서 다양한 형태로 변형될 수 있다. 가능한 한 동일하거나 유사한 부분은 도면에서 동일한 도면부호를 사용하여 나타낸다.Hereinafter, with reference to the accompanying drawings, embodiments of the present invention will be described so that those skilled in the art can easily practice it. As can be easily understood by those skilled in the art to which the present invention pertains, the embodiments described below may be modified in various forms without departing from the concept and scope of the present invention. Where possible, identical or similar parts are indicated using the same reference numerals in the drawings.

본 명세서에서 사용되는 전문용어는 단지 특정 실시예를 언급하기 위한 것이며, 본 발명을 한정하는 것을 의도하지는 않는다. 여기서 사용되는 단수 형태들은 문구들이 이와 명백히 반대의 의미를 나타내지 않는 한 복수 형태들도 포함한다.The terminology used in this specification is only for referring to specific embodiments and is not intended to limit the present invention. As used herein, the singular forms also include the plural forms unless the phrases clearly indicate the opposite.

본 명세서에서 사용되는 "포함하는"의 의미는 특정 특성, 영역, 정수, 단계, 동작, 요소 및/또는 성분을 구체화하며, 다른 특정 특성, 영역, 정수, 단계, 동작, 요소, 성분 및/또는 군의 존재나 부가를 제외시키는 것은 아니다.As used herein, the meaning of "comprising" specifies particular characteristics, regions, integers, steps, operations, elements, and/or components, and other specific characteristics, regions, integers, steps, operations, elements, components, and/or components. It does not exclude the presence or addition of groups.

본 명세서에서 사용되는 기술용어 및 과학용어를 포함하는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 일반적으로 이해하는 의미와 동일한 의미를 가진다. 사전에 정의된 용어들은 관련기술문헌과 현재 개시된 내용에 부합하는 의미를 가지는 것으로 추가 해석되고, 정의되지 않는 한 이상적이거나 매우 공식적인 의미로 해석되지 않는다.All terms including technical terms and scientific terms used in this specification have the same meaning as commonly understood by a person of ordinary skill in the art to which the present invention belongs. The terms defined in the dictionary are additionally interpreted as having a meaning consistent with the related technical literature and the currently disclosed content, and are not interpreted in an ideal or very formal meaning unless defined.

본 명세서에서 사용되는 방향에 관한 표현, 예를 들어 전/후/좌/우의 표현, 상/하의 표현, 종방향/횡방향의 표현은 도면에 개시된 방향을 참고하여 해석될 수 있다.Expressions related to directions used in this specification, for example, expressions of front/back/left/right, top/bottom, and longitudinal/lateral directions may be interpreted with reference to directions disclosed in the drawings.

본 발명은 기계독해 인공지능 모델에 입력될 데이터를 전처리하는 공정에 관한 것이다. 기계독해 모델에 유효한 데이터를 입력시키면 기계독해의 결과값이 더욱 유의미하게 도출될 것이다. The present invention relates to a process of preprocessing data to be input to an artificial intelligence model for machine reading comprehension. If valid data is entered into the machine reading comprehension model, the result of machine reading comprehension will be more meaningful.

이에, 본 발명은 기계독해 모델에 입력될 기계독해대상문단을 추출하는 시간을 최소화하면서도, 유효성이 높은 문단을 선택하는 것을 기술적 특징으로 한다.Accordingly, the technical feature of the present invention is to select highly effective paragraphs while minimizing the time to extract the target paragraphs to be input to the machine reading comprehension model.

본 발명에서 주요한 용어는 다음과 같이 정의된다.Key terms in the present invention are defined as follows.

'탐색대상문서'는 웹크롤링 등 다양한 방법을 통해 취득되어 데이터베이스에 저장된 관련 제반 문서를 의미한다. 'Documents to be searched' refer to all related documents acquired through various methods such as web crawling and stored in the database.

'분석대상문서'는 탐색대상문서 중에서 기계독해대상문단을 선택하기 위하여 분석되는 문서를 의미한다. The 'analysis target document' refers to a document that is analyzed to select a target paragraph for machine reading among search target documents.

'유효 문단'은 질문과의 유사도가 높은 문단으로서, 분석대상문서를 선택하기 위해 사용된다. 'Valid paragraph' is a paragraph with high similarity to the question, and is used to select the analysis target document.

'추가 문단'은 분석대상문서에 포함된 문단으로서 유효 문단은 아니나, 추가로 선택된 문단을 의미한다. 'Additional paragraph' refers to a paragraph included in the document to be analyzed, which is not a valid paragraph, but is additionally selected.

'제외 문단'은 분석대상문서에 포함된 유효 문단이나, 분석대상에서 제외되는 문단을 의미한다. 'Excluded paragraph' means a valid paragraph included in the analysis target document or a paragraph excluded from the analysis target.

'그룹'은 분석대상문서에 포함된 문단 중 유효 문단을 그룹화한 것을 의미한다.'Group' refers to a grouping of valid paragraphs among the paragraphs included in the document to be analyzed.

'그룹점수'는 그룹에 포함된 유효 문단의 유사도 점수의 평균값을 의미한다.'Group score' means the average value of similarity scores of valid paragraphs included in the group.

'기계독해대상문단'은 기계독해 모델에 입력데이터로 전송되는 문단으로서, 유효 문단과 추가문단을 의미하며, 필요시 제외 문단이 반영된 것을 의미한다.'Machine comprehension target paragraph' is a paragraph that is transmitted as input data to the machine reading comprehension model, and means a valid paragraph and an additional paragraph, and excluded paragraphs are reflected when necessary.

'기계독해 모델'은 BERT 등의 공지의 기계독해 모델을 포함한다.The 'machine reading comprehension model' includes known machine reading comprehension models such as BERT.

이하에서는 도면을 참고하여 본 발명을 설명하고자 한다. 참고로, 도면은 본 발명의 특징을 설명하기 위하여, 일부 과장되게 표현될 수도 있다. 이 경우, 본 명세서의 전 취지에 비추어 해석되는 것이 바람직하다.Hereinafter, the present invention will be described with reference to the drawings. For reference, the drawings may be partially exaggerated in order to explain the features of the present invention. In this case, it is preferable to interpret in light of the whole purpose of this specification.

도 1은 본 발명에 따른 기계독해를 위한 문단탐색 및 문단선택 시스템의 구성도이다. 도 2는 본 발명에 따른 문단 탐색부의 세부 구성도이다.1 is a block diagram of a paragraph search and paragraph selection system for machine reading comprehension according to the present invention. 2 is a detailed configuration diagram of a paragraph search unit according to the present invention.

본 발명은 탐색대상문서가 저장된 데이터베이스 및 연산기능을 가진 제어서버를 이용하는 컴퓨팅장치에 의해 수행되며, 기계독해를 수행할 문단을 탐색 및 선택하는 시스템에 관한 것이다.The present invention relates to a system for searching for and selecting paragraphs to perform machine reading comprehension, performed by a computing device using a database in which documents to be searched are stored and a control server having an arithmetic function.

본 발명에 따른 기계독해를 위한 문단탐색 및 문단선택 시스템은 기 설정된 질문이 입력되는 질문 입력부(100); 입력된 질문과 탐색대상문서에 속한 문단과의 유사도를 이용하여, 문단을 탐색 및 선택하는 문단 탐색부(200); 및 상기 문단 탐색부(200)에서 선택된 기계독해대상문단을 전송순위에 따라 기계독해 모델로 전송하는 문단 전송부(300);를 포함한다.A paragraph search and paragraph selection system for machine reading comprehension according to the present invention includes a question input unit 100 into which a preset question is input; a paragraph search unit 200 that searches for and selects a paragraph by using a similarity between an input question and a paragraph belonging to a search target document; and a paragraph transmission unit 300 that transmits the machine reading comprehension target paragraph selected by the paragraph search unit 200 to the machine reading comprehension model according to the transmission order.

본 발명에 따른 문단 탐색부(200)는 질문과 문단의 유사도 점수를 산출하고, 유효 문단을 선택하는 유효 문단부(210); 유효 문단을 그룹화하고, 그룹점수를 산출하는 문단 그룹부(220); 그룹점수를 이용하여 분석대상문서를 선택하고, 분석대상문서내에서 기계독해대상문단을 추가선택하는 기계독해대상문단부(230); 및 기계독해대상문단의 전송순위를 산출하는 전송순위 산출부(240)를 포함한다.The paragraph search unit 200 according to the present invention includes a valid paragraph unit 210 that calculates a similarity score between a question and a paragraph and selects a valid paragraph; a paragraph grouping unit 220 that groups valid paragraphs and calculates group scores; a machine reading comprehension target section 230 for selecting an analysis target document using a group score and additionally selecting a machine reading target paragraph within the analysis target document; and a transmission order calculation unit 240 that calculates the transmission order of the machine-reading target paragraph.

이하에서는, 본 발명에 따른 유효 문단부(210)와 그 하위 구성인 유사도 산출부(211) 및 유효 문단 선택부(212)를 설명한다.Hereinafter, the effective paragraph unit 210 according to the present invention and its sub-components, the similarity calculation unit 211 and the effective paragraph selection unit 212, will be described.

본 발명에 따른 유효 문단부(210)는 질문과 문단의 유사도 점수를 산출하고, 유효 문단을 선택할 수 있다.The effective paragraph unit 210 according to the present invention may calculate a similarity score between a question and a paragraph and select a valid paragraph.

유효 문단부(210)는 탐색대상문서에 포함된 모든 문단을 추출하고, 추출된 각 문단과 입력된 검색질문의 유사도 점수를 산출하는 유사도 산출부(211)를 구비할 수 있다.The effective paragraph unit 210 may include a similarity calculation unit 211 that extracts all paragraphs included in the search target document and calculates a similarity score between each extracted paragraph and an input search question.

본 발명에 따른 유사도 산출부(211)에서 상기 유사도 점수 산출은 벡터 기반 코사인 유사도 또는 키워드 중심 유사도를 통해 산출가능하다. 또한, 그 외의 다른 점수 알고리즘을 사용하는 것도 가능하다.In the similarity calculation unit 211 according to the present invention, the similarity score can be calculated through vector-based cosine similarity or keyword-centered similarity. It is also possible to use other scoring algorithms.

다음 표 1은 유사도 점수를 산출하는 일 실시예를 나타낸다.Table 1 below shows an example of calculating a similarity score.

유효 문단부(210)는 유사도 산출부(211)에서 산출된 유사도 점수 순으로 각 문단을 정렬하고, 기 설정값 이상의 유사도 점수를 가진 문단 또는 유사도 점수순으로 기 설정된 개수(A개)의 문단을 선택하는 유효 문단 선택부(212)를 구비할 수 있다.The effective paragraph unit 210 sorts each paragraph in the order of the similarity score calculated by the similarity calculation unit 211, and selects a paragraph having a similarity score higher than a preset value or a preset number (A) of paragraphs in order of similarity score. A valid paragraph selector 212 for selection may be provided.

다음 표 2는 유사도 점수를 내림차순으로 정렬하여 상위 600개의 문단을 추출하는 실시예를 나타낸다.Table 2 below shows an example of extracting the top 600 paragraphs by arranging similarity scores in descending order.

일 예로서, 유사도 점수 90점 이상을 가진 모든 문단을 유효 문단으로 선택하는 제1 실시예가 가능하다. 다만, 제1 실시예에서는 선택되는 유효 문단의 개수가 일정하지 않을 수 있다.As an example, a first embodiment in which all paragraphs having a similarity score of 90 or more are selected as valid paragraphs is possible. However, in the first embodiment, the number of valid paragraphs selected may not be constant.

다른 예로서, 유사도 점수순으로 정렬하여, 예로 600개의 문단을 선택하는 제2 실시예도 가능하다. 이 경우 600개 문단의 최하 유사도 점수가 예로 90점 미만일 수도 있고, 최하 유사도 점수가 예로 95점을 초과할 수도 있을 것이다.As another example, a second embodiment in which 600 paragraphs are selected by arranging them in order of similarity score is also possible. In this case, the lowest similarity score of 600 paragraphs may be, for example, less than 90 points, or the lowest similarity score may exceed 95 points, for example.

다른 예로서, 제1 실시예와 제2 실시예를 병합하여 실시하는 제3 실시예도 가능할 것이다. 예로, 소정 유사도 점수 이상으로서 소정 개수의 문단을 선택하는 실시예도 가능할 것이다.As another example, a third embodiment in which the first embodiment and the second embodiment are merged and implemented may also be possible. For example, an embodiment in which a predetermined number of paragraphs are selected with a predetermined similarity score or higher may be possible.

이하에서는, 본 발명에 따른 문단 그룹부(220)와 그 하위 구성인 문단 그룹화부(221) 및 그룹점수 산출부(222)를 설명한다.Hereinafter, the paragraph grouping unit 220 and the paragraph grouping unit 221 and the group score calculation unit 222, which are sub-components, according to the present invention will be described.

본 발명에 따른 문단 그룹부(220)는 유효 문단을 그룹화하고, 그룹점수를 산출할 수 있다.The paragraph grouping unit 220 according to the present invention may group valid paragraphs and calculate group scores.

문단 그룹부(220)는 유효 문단부(210)에서 선택된 문단들을 각 문단이 소속된 문서별로 그룹화하는 문단 그룹화부(221)를 구비할 수 있다.The paragraph grouping unit 220 may include a paragraph grouping unit 221 that groups the paragraphs selected from the effective paragraph unit 210 according to documents to which each paragraph belongs.

문단 그룹부(220)는 상기 문단 그룹화부(221)에서 그룹핑된 각 문단들의 유사도 점수의 평균값을 산출하여 그룹점수를 산출하는 그룹점수 산출부(222)를 구비할 수 있다.The paragraph grouping unit 220 may include a group score calculation unit 222 that calculates a group score by calculating an average value of similarity scores of each paragraph grouped by the paragraph grouping unit 221 .

다음 표 3은 문서 ID로 그룹화 한 후, 그룹점수 즉 그룹 포함된 유효 문단의 유사도 점수들의 평균값(평균 유사도점수)을 산출하고, 그룹의 평균 유사도점수가 높은 상위 50개의 그룹순위(문서랭킹) 1~50위를 선택하는 실시예를 나타낸다.Table 3 below calculates the group score, that is, the average value (average similarity score) of the similarity scores of valid paragraphs included in the group after grouping by document ID, and ranks the top 50 groups with the highest average similarity score (document ranking) 1 Shows an example of selecting ~50th place.

본 발명에 있어서, '그룹화'는 같은 문서정보(문서ID)를 가진 문단들을 모으는 작업을 의미한다. In the present invention, 'grouping' means an operation of gathering paragraphs having the same document information (document ID).

문단의 그룹화를 실행하고 나서 그룹 내 문단들의 유사도점수에 평균을 매겨 그룹 점수를 계산하고, 그룹 점수로 그룹들의 순위를 부여할 수 있다.After grouping the paragraphs, the similarity scores of the paragraphs in the group are averaged to calculate the group score, and the group scores can be used to rank the groups.

이하에서는, 본 발명에 따른 기계독해대상문단부(230)와 그 하위 구성인 문서 선택부(231) 및 문단 추가부(232)를 설명한다.Hereinafter, the paragraph unit 230 subject to machine reading according to the present invention and its subordinate components, the document selection unit 231 and the paragraph addition unit 232, will be described.

본 발명에 따른 기계독해대상문단부(230)는 그룹점수를 이용하여 분석대상문서를 선택하고, 분석대상문서내에서 기계독해대상문단을 추가선택할 수 있다.The machine reading comprehension target paragraph unit 230 according to the present invention selects an analysis target document using a group score, and can additionally select a machine reading target paragraph within the analysis target document.

기계독해대상문단부(230)는 문단 그룹부(220)에서 산출된 그룹점수 순으로 문서를 정렬하고, 기 설정값 이상의 그룹점수를 가진 문서를 분석대상문서로 선택하는 문서 선택부(231)를 구비할 수 있다.The machine reading target paragraph unit 230 includes a document selection unit 231 that sorts documents in the order of group scores calculated in the paragraph group unit 220 and selects a document having a group score equal to or higher than a preset value as an analysis target document. can be provided

문서 선택부(231)는 기 설정값 이상의 그룹점수를 가진 문서의 개수가 기 설정된 최대 문서 개수(B개)를 초과하면, 기 설정된 최대 문서 개수까지만 선택할 수 있다.The document selector 231 may select only up to the preset maximum number of documents when the number of documents having group scores equal to or greater than the preset value exceeds the preset maximum number of documents (B).

문서 선택부(231)는 그룹 점수를 기준으로 내림차순으로 정렬한다. 기 설정된 최대 문서 개수(B개)보다 클 경우, 그룹점수가 높은 B개의 그룹만 추출하고 나머지 그룹은 삭제한다. The document selection unit 231 sorts in descending order based on group scores. If it is greater than the preset maximum number of documents (B), only B groups with high group scores are extracted and the remaining groups are deleted.

일반적인 종래의 방법은 문단 또는 문서를 찾고, 그 리스트를 정답 추론부에 전송해 정답을 찾는 과정을 거친다. 해당 단계에서는 문서별로 그룹화하여 문서들의 중요도를 비교한다. In a typical conventional method, a paragraph or document is searched for, and the list is transmitted to the correct answer reasoning unit to find the correct answer. In this step, documents are grouped and their importance is compared.

그룹 내지 문서의 유사도 평균 점수가 높다는 것은, 그룹이 속한 문서에 정답과 유사한 문단이 많다는 것을 의미한다. 반대로 유사도 평균 점수가 낮은 문서는 정답이 존재할 확률이 적기 때문에 정답 추론의 효율성을 위해 대상에서 배제하는 것이 적절할 것이다.A high average similarity score of a group or document means that there are many paragraphs similar to the correct answer in the document to which the group belongs. Conversely, documents with a low average similarity score are less likely to have correct answers, so it would be appropriate to exclude them from the target for the efficiency of correct answer inference.

본 발명에서, 최대 문서 개수(B개)를 설정하지 않을 경우, 데이터베이스에 존재하는 모든 문서들을 추출할 가능성도 있다. 이에, 본 발명은 최대 문서 개수를 설정하여, 정답이 있을 확률이 낮은 문서를 제거함으로써 시간단축을 할 수 있다.In the present invention, when the maximum number of documents (B) is not set, there is a possibility of extracting all documents existing in the database. Accordingly, in the present invention, time can be reduced by setting the maximum number of documents and removing documents with a low probability of correct answers.

기계독해대상문단부(230)는 문서 선택부(231)의 선택된 문서에서, 유효 문단 외의 나머지 문단의 유사도 점수를 정렬하여, 기 설정된 개수의 문단을 기계독해대상문단으로 추가하는 문단 추가부(232)를 구비할 수 있다.The machine reading target paragraph unit 230 sorts the similarity scores of the remaining paragraphs other than valid paragraphs in the document selected by the document selection unit 231, and adds a predetermined number of paragraphs as machine reading target paragraphs (232). ) can be provided.

다음 표 4는 분석대상문서에서 추가 문단을 확장하여 각 분석대상문서별로 40개의 기계독해대상문단을 보유하는 실시예를 나타낸다.Table 4 below shows an example in which additional paragraphs are expanded in the analysis target document to hold 40 machine reading target paragraphs for each analysis target document.

각 문서 내에서 사용자의 질문과 유사한 문단의 수를 확장하여, 유사도 산출부(211)에서 산출한 유사도 점수를 활용해 문단을 추가시켜, 각 문서별로 기 설정된 개수의 문단을 보유하도록 할 수 있다. In each document, the number of paragraphs similar to the user's question may be expanded, and paragraphs may be added using the similarity score calculated by the similarity calculation unit 211, so that each document has a preset number of paragraphs.

유효문단 선택부(212)에서 선택된 것을 '유효 문단'이라고 하고, 본 문단 추가부(232)에서 추가된 것을 '추가 문단'이라고 한다. 추가 문단은 유사도 점수 순으로 추가될 수 있다. 예를 들어, 유효 문단이 30개인 경우, 유사도 점수 순으로 추가문단 10개를 추가하여, 총 40개의 문단을 보유하도록 설정할 수 있다.What is selected in the valid paragraph selection unit 212 is called a 'valid paragraph', and what is added in the present paragraph addition unit 232 is called an 'additional paragraph'. Additional paragraphs may be added in order of similarity score. For example, if there are 30 valid paragraphs, 10 additional paragraphs can be added in order of similarity score, so that a total of 40 paragraphs can be set.

이 과정은 각 문서(그룹)에서 질문과의 유사도 점수가 높은 순으로 정렬하기 위한 구성이다. 이전 단계에서 질의에 대한 정답이 있을 만한 문서들을 찾았으며, 해당 문서에 대해 다시한번 정답이 있을 만한 문단을 추가하는 것이다. 이를 위해, 유사도 점수를 통해 각 문서별로 문단에 대해 순위를 부여한다.This process is a configuration for sorting each document (group) in the order of the similarity score with the question. In the previous step, documents that may have answers to the query have been found, and paragraphs for those documents that may have answers are added again. To this end, rankings are given to paragraphs for each document through similarity scores.

실제 정답이 포함되어 있는 문단이 전체 문단 점수에서는 유사도 점수가 낮아 배제된 경우, 문단 추가부에서 문서별로 추가 문단을 확장 추가함으로써 정답을 찾을 수 있는 가능성을 증가시키는 것이다.If a paragraph containing an actual correct answer is excluded due to a low similarity score in the overall paragraph score, the possibility of finding the correct answer is increased by expanding and adding additional paragraphs for each document in the paragraph addition section.

이러한 구성을 통해 정답이 포함된 문단에 접근하는 시간이 단축될 수 있다.Through this configuration, the time to access the paragraph containing the correct answer can be shortened.

한편, '유효 문단'의 개수가 기 설정된 개수(예로, 40개)를 초과하면, 그 초과분을 분석대상에서 제외하게 되며, 이와 같이 제외되는 문단을 '제외 문단'이라고 한다. 본 발명에 있어서, '제외 문단'은 유사도 점수순으로 결정될 수 있다.Meanwhile, if the number of 'valid paragraphs' exceeds a preset number (eg, 40), the excess is excluded from the analysis target, and the paragraphs excluded in this way are referred to as 'excluded paragraphs'. In the present invention, 'excluded paragraphs' may be determined in order of similarity scores.

유사도 점수가 높은 문단에서 반드시 정답이 도출된다는 보장이 없으므로, 비록 타 분석대상문서에 속한 유효 문단 또는 추가문단의 유사도 점수보다 높지만, 기계독해대상문단에서 제외될 수 있는 것이다. 또한, 비록 유사도 점수가 상대적으로 낮은 경우라도 기계독해대상문단에 포함될 수 있는 것이다.Since there is no guarantee that the correct answer will always be derived from a paragraph with a high similarity score, it can be excluded from the machine reading comprehension target paragraph even though it is higher than the similarity score of a valid paragraph or additional paragraph belonging to another document to be analyzed. Also, even if the similarity score is relatively low, it can be included in the machine reading comprehension target paragraph.

이하에서는, 본 발명에 따른 전송순위 산출부(240)를 설명한다.Hereinafter, the transmission priority calculator 240 according to the present invention will be described.

본 발명에 따른 전송순위 산출부(240)는 기계독해대상문단의 전송순위(최종점수)를 산출할 수 있다.The transmission order calculation unit 240 according to the present invention may calculate the transmission order (final score) of the paragraph to be read by machine.

전송순위 산출부(240)는 기계독해대상문단부(230)에서 선택된 각 기계독해대상문단에 대하여, 아래 관계식 1에 따른 전송순위(최종점수)를 산출할 수 있다. The transmission order calculation unit 240 may calculate the transmission order (final score) according to the relational expression 1 below for each paragraph selected by the machine reading target paragraph section 230 .

[관계식 1][Relationship 1]

전송순위는 기계독해에 전송할 우선순서를 정하는 값이다. 전송순위 점수를 산출하고 내림차순으로 정렬했을 때, 전송순위 점수가 낮은 값들의 순서대로 기계독해 모델로 전송하게 된다.Transmission priority is a value that determines the priority order of transmission in machine reading comprehension. When transmission order scores are calculated and sorted in descending order, values with lower transmission order scores are transmitted to the machine reading comprehension model in order.

다음 표 5는 전송순위(최종점수)를 산출하고 오름차순으로 정렬하기 전의 실시예를 나타낸다.Table 5 below shows an embodiment before calculating the transmission order (final score) and sorting in ascending order.

표 5를 보면, 그룹순위(문서랭킹) 1위인 문서(CCC) 부터 문서랭킹 50위인 문서(YYY)는 는 각각 40개의 문단(기계독해대상문단)을 선정하였음을 알 수 있다. 여기서, 각 문서내에서 각 문서에 속한 40개의 기계독해대상문단은 유사도 점수에 따른 전송수위(최종점수) 별로 정렬된다.Looking at Table 5, it can be seen that 40 paragraphs (paragraphs subject to machine reading) were selected for each of the document (CCC) ranked first in the group ranking (document ranking) to the document (YYY) ranked 50th. Here, within each document, 40 machine reading target paragraphs belonging to each document are sorted by transmission level (final score) according to the similarity score.

다음 표 6은 전송순위(최종점수)를 산출하고 오름차순으로 정렬한 후의 실시예를 나타낸다. Table 6 below shows an example after calculating the transmission order (final score) and sorting in ascending order.

각 문서에 속한 40개의 기계독해대상문단 내에서의 유사도 점수 순위는 타 문서에 대하여도 서로 동일한 지위를 가진다. The similarity score rank within the 40 machine reading comprehension target paragraphs belonging to each document has the same status with respect to other documents.

예로, 각 문서의 문단순위 1위는 그룹순위(문서랭킹)와 무관하게, 관계식 1에 의해, 타 문서의 문단순위 2위보다 높은 전송순위(최종점수)를 가진다. 다만, 각 문서에서 동일한 문단순위들 간의 전송순위(최종점수)는 그룹순위(문서랭킹)에 따르게 된다.For example, the first-ranked paragraph of each document has a transmission priority (final score) higher than that of the second-ranked paragraph of other documents, regardless of the group ranking (document ranking), according to relational expression 1. However, the transmission order (final score) between the same paragraph ranks in each document follows the group rank (document ranking).

따라서, 각 문서의 문단순위 1위인 문단 40개는 그룹순위(문서랭킹) 순으로 전송된다. 다음으로, 각 문서의 문단순위 2위인 문단 40개도 그룹순위(문서랭킹) 순으로 전송된다.Therefore, 40 paragraphs, which are the first in paragraph ranking of each document, are transmitted in the order of group ranking (document ranking). Next, 40 paragraphs, which rank second in paragraph ranking of each document, are also transmitted in order of group ranking (document ranking).

한편, 위와 같은 전송순위로 기계독해대상문단들은 기계독해 모델에 입력데이터로 입력될 수 있다. 기계독해 모델은 정답으로 취급할 수 있는 수준의 문단을 찾게된다. 정답으로 취급되는 수준은 기 설정한 수준값 이상의 정답범위에 속하는 것을 의미한다.On the other hand, in the above transmission order, the machine reading comprehension target paragraphs may be input as input data to the machine reading comprehension model. The machine reading comprehension model finds paragraphs at a level that can be treated as correct answers. A level treated as a correct answer means that it belongs to a range of correct answers equal to or higher than a preset level value.

여기서, 만약, 기계독해 모델에 입력된 기계독해대상문단 중에서 정답범위에 속하는 정답이 도출되면, 아직 전송되지 않은 잔여 기계독해대상문단을 더이상 전송하지 않는 실시예가 가능하다.Here, if a correct answer belonging to the range of correct answers is derived from among the target paragraphs input to the machine reading comprehension model, an embodiment in which the remaining target paragraphs for machine reading not yet transmitted is no longer transmitted is possible.

또한, 비록 정답범위에 속하는 정답이 도출되더라도 기계독해대상문단의 기 설정된 분량, 예로 선정된 기계독해대상문단의 최소 50% 분량에 해당되는 개수의 문단을 전송한 후 기계독해 모델이 최종 정답을 도출하는 실시예도 가능하다.In addition, even if an answer within the range of correct answers is derived, the machine reading comprehension model derives the final correct answer after transmitting the number of paragraphs corresponding to at least 50% of the selected machine reading comprehension target paragraph, e.g. An embodiment is also possible.

한편, 본 발명은 문단탐색 및 문단선택 방법 발명으로 구현될 수 있다. 구체적으로 기계독해를 위한 문단탐색 및 문단선택 방법으로 구현될 수 있다.On the other hand, the present invention can be implemented as a paragraph search and paragraph selection method invention. Specifically, it can be implemented as a paragraph search and paragraph selection method for machine reading comprehension.

이러한 방법발명은 전술한 시스템발명과 실질적으로 동일한 발명으로서 발명의 카테고리가 상이하다. 따라서, 시스템발명과 공통되는 구성은, 전술한 설명으로 대체하기로 하며, 이하에서는 본 방법발명의 요지 위주로 설명하고자 한다.This method invention is substantially the same invention as the system invention described above, and the category of the invention is different. Therefore, the configuration common to the system invention will be replaced with the above description, and the following will focus on the gist of the method invention.

도 3은 본 발명에 따른 기계독해를 위한 문단탐색 및 문단선택 방법의 순서도이다. 도 4는 본 발명에 따른 문단 탐색 단계의 세부 순서도이다.3 is a flowchart of a paragraph search and paragraph selection method for machine reading comprehension according to the present invention. 4 is a detailed flowchart of a paragraph search step according to the present invention.

본 발명은 탐색대상문서가 저장된 데이터베이스 및 연산기능을 가진 제어서버를 이용하는 컴퓨팅장치에 의해 수행되며, 기계독해를 수행할 문단을 탐색 및 선택하는 방법으로서, 상기 컴퓨팅장치는 질문 입력부(100)에 기 설정된 질문이 입력되는 S100 단계; 문단 탐색부(200)가 입력된 질문과 탐색대상문서에 속한 문단과의 유사도를 이용하여, 문단을 탐색 및 선택하는 S200 단계; 및 문단 전송부(300)가 상기 문단 탐색부(200)에서 선택된 기계독해대상문단을 전송순위에 따라 기계독해 모델로 전송하는 S300 단계;를 포함하여 수행한다.The present invention is performed by a computing device using a database storing documents to be searched for and a control server having an arithmetic function, and searching for and selecting paragraphs to perform machine reading comprehension, the computing device using a question input unit (100) Step S100 in which a set question is input; Step S200 of the paragraph search unit 200 searching for and selecting a paragraph by using the similarity between the input question and the paragraph belonging to the search target document; and a step S300 in which the paragraph transmission unit 300 transmits the machine reading target paragraph selected by the paragraph search unit 200 to the machine reading comprehension model according to the transmission order.

본 발명에 따른 S200 단계는 유효 문단부(210)가 질문과 문단의 유사도 점수를 산출하고, 유효 문단을 선택하는 S210 단계; 문단 그룹부(220)가 유효 문단을 그룹화하고, 그룹점수를 산출하는 S220 단계; 기계독해대상문단부(230)가 그룹점수를 이용하여 분석대상문서를 선택하고, 분석대상문서내에서 기계독해대상문단을 추가선택하는 S230 단계; 및 전송순위 산출부(240)가 기계독해대상문단의 전송순위를 산출하는 S240 단계를 포함한다.Step S200 according to the present invention includes a step S210 in which the valid paragraph unit 210 calculates a similarity score between a question and a paragraph and selects a valid paragraph; Step S220 of grouping valid paragraphs by the paragraph grouping unit 220 and calculating group scores; Step S230 in which the machine reading comprehension section unit 230 selects the analysis target document using the group score and additionally selects the machine reading comprehension target paragraph within the analysis target document; and a step S240 in which the transmission order calculation unit 240 calculates the transmission order of the machine-reading target paragraph.

본 발명에 있어서, S210 단계는 유사도 산출부(211)가 상기 탐색대상문서에 포함된 모든 문단을 추출하고, 추출된 각 문단과 입력된 검색질문의 유사도 점수를 산출하는 S211 단계를 구비할 수 있다. In the present invention, step S210 may include step S211 in which the similarity calculating unit 211 extracts all paragraphs included in the search target document and calculates a similarity score between each extracted paragraph and the input search question. .

본 발명에 있어서, S211 단계에서 상기 유사도 점수 산출은 벡터 기반 코사인 유사도 또는 키워드 중심 유사도를 통해 산출가능하다.In the present invention, the similarity score calculation in step S211 can be calculated through vector-based cosine similarity or keyword-centered similarity.

본 발명에 있어서, S210 단계는 유효 문단 선택부(212)가 상기 유사도 산출부(211)에서 산출된 유사도 점수 순으로 각 문단을 정렬하고, 기 설정값 이상의 유사도 점수를 가진 문단 또는 유사도 점수순으로 기 설정된 개수의 문단을 선택하는 S212 단계를 구비할 수 있다.In the present invention, in step S210, the effective paragraph selection unit 212 arranges each paragraph in the order of similarity score calculated by the similarity calculation unit 211, and arranges paragraphs having a similarity score equal to or higher than a preset value or in order of similarity score. A step S212 of selecting a preset number of paragraphs may be provided.

본 발명에 있어서, S220 단계는 문단 그룹화부(221)가 상기 유효 문단부(210)에서 선택된 문단들을 각 문단이 소속된 문서별로 그룹화하는 S221 단계를 구비할 수 있다. In the present invention, step S220 may include a step S221 in which the paragraph grouping unit 221 groups the paragraphs selected from the effective paragraph unit 210 by document to which each paragraph belongs.

본 발명에 있어서, S220 단계는 그룹점수 산출부(222)가 상기 문단 그룹화부(221)에서 그룹핑된 각 문단들의 유사도 점수의 평균값을 산출하여 그룹점수를 산출하는 S222 단계를 구비할 수 있다. In the present invention, step S220 may include step S222 in which the group score calculator 222 calculates a group score by calculating an average value of similarity scores of each paragraph grouped by the paragraph grouping unit 221.

본 발명에 있어서, S230 단계는 문서 선택부(231)가 상기 문단 그룹부(220)에서 산출된 그룹점수 순으로 문서를 정렬하고, 기 설정값 이상의 그룹점수를 가진 문서를 분석대상문서로 선택하는 S231 단계를 구비할 수 있다. In the present invention, in step S230, the document selection unit 231 sorts the documents in the order of group scores calculated by the paragraph grouping unit 220 and selects documents having a group score equal to or higher than a preset value as the analysis target document. Step S231 may be provided.

본 발명에 있어서, S231 단계는 기 설정값 이상의 그룹점수를 가진 문서의 개수가 기 설정된 최대 문서 개수를 초과하면, 기 설정된 최대 문서 개수까지만 선택할 수 있다. In the present invention, in step S231, if the number of documents having group scores equal to or greater than the preset value exceeds the preset maximum number of documents, only up to the preset maximum number of documents can be selected.

본 발명에 있어서, S230 단계는 문단추가부(232)가 상기 문서 선택부(231)의 선택된 문서에서, 유효 문단 외의 나머지 문단의 유사도 점수를 정렬하여, 기 설정된 개수의 문단을 기계독해대상문단으로 추가하는 S232 단계를 구비할 수 있다. In the present invention, in step S230, the paragraph button addition unit 232 sorts the similarity scores of the remaining paragraphs other than valid paragraphs in the document selected by the document selection unit 231, and selects a predetermined number of paragraphs as target paragraphs for machine reading. An additional step S232 may be provided.

본 발명에 있어서, S240 단계는 상기 S230 단계에서 선택된 각 기계독해대상문단에 대하여, 아래 관계식 1에 따른 전송순위 점수를 산출할 수 있다. In the present invention, in step S240, a transmission priority score may be calculated according to the relational expression 1 below for each paragraph to be read by machine selected in step S230.

[관계식 1][Relationship 1]

한편, 본 발명은 컴퓨터프로그램으로 구현될 수도 있다. 구체적으로 본 발명은 하드웨어와 결합되어, 본 발명에 따른 기계독해를 위한 문단탐색 및 문단선택 방법을 컴퓨터에 의해 실행시키기 위하여 컴퓨터가 판독 가능한 기록매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다.Meanwhile, the present invention may be implemented as a computer program. Specifically, the present invention may be combined with hardware and implemented as a computer program stored in a computer-readable recording medium to execute the paragraph search and paragraph selection method for machine reading comprehension according to the present invention by a computer.

본 발명의 실시예에 따른 방법들은 다양한 컴퓨터 수단을 통하여 판독 가능한 프로그램 형태로 구현되어 컴퓨터로 판독 가능한 기록매체에 기록될 수 있다. 여기서, 기록매체는 프로그램 명령, 데이터 파일, 데이터구조 등을 단독으로 또는 조합하여 포함할 수 있다. 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 예컨대 기록매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CDROM, DVD와 같은 광 기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함한다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어를 포함할 수 있다. 이러한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Methods according to embodiments of the present invention may be implemented in a program form readable by various computer means and recorded on a computer readable recording medium. Here, the recording medium may include program commands, data files, data structures, etc. alone or in combination. Program instructions recorded on the recording medium may be those specially designed and configured for the present invention, or those known and usable to those skilled in computer software. For example, recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CDROMs and DVDs, and magneto-optical media such as floptical disks. optical media), and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of the program command may include a high-level language that can be executed by a computer using an interpreter, as well as a machine language generated by a compiler. These hardware devices may be configured to act as one or more software modules to perform the operations of the present invention, and vice versa.

본 명세서에서 설명되는 실시예와 첨부된 도면은 본 발명에 포함되는 기술적 사상의 일부를 예시적으로 설명하는 것에 불과하다. 따라서, 본 명세서에 개시된 실시예들은 본 발명의 기술적 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이므로, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아님은 자명하다. 본 발명의 명세서 및 도면에 포함된 기술적 사상의 범위 내에서 당업자가 용이하게 유추할 수 있는 변형 예와 구체적인 실시 예는 모두 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The embodiments described in this specification and the accompanying drawings merely illustrate some of the technical ideas included in the present invention by way of example. Therefore, since the embodiments disclosed in this specification are intended to explain rather than limit the technical spirit of the present invention, it is obvious that the scope of the technical spirit of the present invention is not limited by these embodiments. All modified examples and specific examples that can be easily inferred by those skilled in the art within the scope of the technical idea included in the specification and drawings of the present invention should be construed as being included in the scope of the present invention.

100 : 질문 입력부
200 : 문단 탐색부 210 : 유효 문단부
211 : 유사도 산출부 212 : 유효문단 선택부
220 : 문단 그룹부 221 : 문단 그룹화부
222 : 그룹점수 산출부 230 : 기계독해대상문단부
231 : 문서 선택부 232 : 문단 추가부
240 : 전송순위 산출부
300 : 문단 전송부100: question input unit
200: paragraph search unit 210: valid paragraph
211: similarity calculation unit 212: valid paragraph selection unit
220: paragraph grouping unit 221: paragraph grouping unit
222: group score calculation unit 230: machine reading comprehension section
231: document selection unit 232: paragraph addition unit
240: transmission priority calculation unit
300: paragraph transmission unit

Claims

A system that is performed by a computing device using a database in which search target documents are stored and a control server having an arithmetic function, and searches and selects paragraphs to perform machine reading comprehension,
a question input unit into which a preset question is input;
a paragraph search unit for searching and selecting a paragraph by using a similarity between the input question and a paragraph belonging to a search target document; and
and a paragraph transmitter for transmitting the machine reading comprehension target paragraph selected in the paragraph search unit to the machine reading comprehension model according to transmission order,
The paragraph search section
a valid paragraph unit for calculating a similarity score between a question and a paragraph and selecting a valid paragraph;
a paragraph grouping unit for grouping valid paragraphs and calculating group scores;
a machine reading comprehension target section for selecting an analysis target document using a group score and additionally selecting a machine reading target paragraph within the analysis target document; and
A paragraph search and paragraph selection system for machine reading comprehension, comprising a transmission order calculation unit for calculating the transmission order of a paragraph targeted for machine reading comprehension.

The method of claim 1,
The valid paragraph is
Extract all paragraphs included in the search target document;
A paragraph search and paragraph selection system for machine reading comprehension, comprising a similarity calculation unit for calculating a similarity score between each extracted paragraph and an input search question.

The method of claim 2,
The paragraph search and paragraph selection system for machine reading comprehension, characterized in that the calculation of the similarity score in the similarity calculation unit can be calculated through vector-based cosine similarity or keyword-based similarity.

The method of claim 2,
The valid paragraph is
Sort each paragraph in the order of the similarity score calculated by the similarity calculator,
A paragraph search and paragraph selection system for machine reading comprehension, characterized by comprising an effective paragraph selection unit for selecting a paragraph having a similarity score higher than a preset value or a preset number of paragraphs in order of similarity scores.

The method of claim 1,
The paragraph grouping part
A paragraph search and paragraph selection system for machine reading comprehension, characterized in that a paragraph grouping unit grouping the paragraphs selected from the effective paragraph unit by document to which each paragraph belongs.

The method of claim 5,
The paragraph grouping part
The paragraph search and paragraph selection system for machine reading comprehension, characterized by comprising a group score calculation unit for calculating a group score by calculating an average value of similarity scores of each paragraph grouped in the paragraph grouping unit.

The method of claim 1,
The machine reading comprehension target section
Sort the documents in the order of group scores calculated in the paragraph grouping unit;
A paragraph search and paragraph selection system for machine reading comprehension, comprising a document selection unit for selecting a document having a group score equal to or higher than a pre-set value as an analysis target document.

The method of claim 7,
The document selection section
If the number of documents with group scores higher than the preset value exceeds the preset maximum number of documents,
A paragraph search and paragraph selection system for machine reading comprehension, characterized in that only up to a preset maximum number of documents are selected.

According to claim 7 or claim 8,
The machine reading comprehension target section
In the document selected in the document selection unit,
A paragraph search and paragraph selection system for machine reading comprehension, comprising a paragraph adding unit for sorting similarity scores of remaining paragraphs other than valid paragraphs and adding a preset number of paragraphs as machine reading target paragraphs.

The method of claim 1,
The transfer priority calculator
A paragraph search and paragraph selection system for machine reading comprehension, characterized in that for each paragraph selected from the machine reading comprehension target paragraph, a transmission priority score is calculated according to the following relational expression 1.
[Relationship 1]
Transmission priority value = (paragraph rank value within group × 100) + group priority value

A method of searching for and selecting a paragraph to perform machine reading, which is performed by a computing device using a database in which search target documents are stored and a control server having an arithmetic function, the computing device comprising:
Step S100 of inputting a preset question to a question input unit;
Step S200 of searching for and selecting a paragraph by using the similarity between the input question and the paragraph belonging to the search target document by the paragraph search unit; and
A step S300 in which the paragraph transmission unit transmits the machine reading comprehension target paragraph selected in the paragraph search unit to the machine reading comprehension model according to the transmission order;
The step S200 is
Step S210 of calculating a similarity score between the question and the paragraph by the valid paragraph and selecting a valid paragraph;
Step S220 of grouping valid paragraphs by the paragraph grouping unit and calculating group scores;
Step S230 of selecting an analysis target document by using a group score by the machine reading target paragraph and additionally selecting a machine reading target paragraph within the analysis target document; and
A paragraph search and paragraph selection method for machine reading comprehension, characterized in that it comprises a step S240 in which the transmission order calculation unit calculates the transmission order of the target paragraph for machine reading comprehension.

The method of claim 11,
The step S210 is
The similarity calculation unit extracts all paragraphs included in the search target document,
A paragraph search and paragraph selection method for machine reading comprehension, comprising a step S211 of calculating a similarity score between each extracted paragraph and an input search question.

The method of claim 12,
The paragraph search and paragraph selection method for machine reading comprehension, characterized in that the calculation of the similarity score in step S211 can be calculated through vector-based cosine similarity or keyword-based similarity.

The method of claim 12,
The step S210 is
The effective paragraph selector includes a step S212 of arranging each paragraph in the order of the similarity score calculated by the similarity calculation unit and selecting a paragraph having a similarity score equal to or higher than a preset value or a preset number of paragraphs in order of similarity score. Paragraph search and paragraph selection method for machine reading comprehension.

The method of claim 11,
The step S220 is
A paragraph search and paragraph selection method for machine reading comprehension, characterized by comprising a step S221 of grouping the paragraphs selected from the valid paragraphs by a paragraph grouping unit according to documents to which each paragraph belongs.

The method of claim 15
The step S220 is
The paragraph search and paragraph selection method for machine reading comprehension, characterized by comprising a step S222 of calculating a group score by calculating an average value of the similarity scores of each paragraph grouped by the group score calculation unit in the paragraph grouping unit.

The method of claim 11,
The step S230 is
A paragraph search for machine reading comprehension comprising a step S231 of a document selection unit sorting documents in the order of group scores calculated in the paragraph group unit and selecting a document having a group score equal to or higher than a preset value as an analysis target document. and paragraph selection methods.

The method of claim 17
The step S231 is
A paragraph search and paragraph selection method for machine reading comprehension, characterized in that when the number of documents having group scores equal to or greater than the preset value exceeds the preset maximum number of documents, only the preset maximum number of documents is selected.

According to claim 17 or claim 18,
The step S230 is
A paragraph for machine reading comprehension comprising a step S232 of adding a predetermined number of paragraphs as target paragraphs by arranging similarity scores of remaining paragraphs other than valid paragraphs in the document selected by the document selection unit by the paragraph addition unit. Navigation and paragraph selection methods.

The method of claim 11,
The step S240 is
A paragraph search and paragraph selection method for machine reading comprehension, characterized in that for each paragraph to be machine reading comprehension selected in step S230, a transmission priority score is calculated according to the following relational expression 1.
[Relationship 1]
Transmission priority value = (paragraph rank value within group × 100) + group priority value

A computer program stored in a computer-readable recording medium combined with hardware to execute the paragraph search and paragraph selection method for machine reading comprehension according to claim 11 by a computer.