KR20230045263A

KR20230045263A - Question Answering System and Method to extract infrastructure damage information from disaster report using weights

Info

Publication number: KR20230045263A
Application number: KR1020210127900A
Authority: KR
Inventors: 김형관; 김요한; 손지우
Original assignee: 연세대학교 산학협력단
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2023-04-04

Abstract

The present invention relates to a disaster information question and answer system, in which a control server (10) with a calculation function and a database (20) stored with disaster report information are connected via a network, and disaster report information is analyzed by the control server (10) using a BERT model, and more particularly, to a disaster information question and answer system for extracting infrastructure damage information from a disaster report using a disaster infrastructure weight. The control server (10) includes: a paragraph search unit (100) including a vector conversion unit (110) for converting input question information and disaster report information into a question vector and a report paragraph vector, respectively, a similarity comparison unit (120) for calculating the similarity between each of the question vector and the report paragraph vector, and a paragraph determination unit (130) for determining at least one report paragraph vector for each question vector in accordance with the similarity calculated in the similarity comparison unit (120); and a question and answer unit (200) having a learning data selection unit (210) for selecting the question vector and the report paragraph vector determined in the paragraph search unit (100) as learning data, a hyperparameter determination unit (220) for deriving an optimal hyperparameter in accordance with the learning data, a weight calculation unit (230) for calculating a disaster infrastructure weight by applying a hyperparameter to a BERT model and performing model learning, and a damage information extraction unit (240) for extracting a correct answer to a question by substituting the calculated disaster infrastructure weight. Therefore, a weight appropriate to a disaster infrastructure is derived.

Description

Disaster Information Question Answering System and Method to extract infrastructure damage information from disaster report using weights}

본 발명은 재난정보 질의응답시스템 및 질의응답방법에 관한 것이다. 구체적으로, 본 발명은 재난인프라 가중치를 이용하여 재난보고서로부터 인프라 피해 정보를 추출하는 재난정보 질의응답시스템 및 질의응답방법에 관한 것이다.The present invention relates to a disaster information question answering system and a question answering method. Specifically, the present invention relates to a disaster information question answering system and a question answering method for extracting infrastructure damage information from a disaster report using disaster infrastructure weights.

텍스트 데이터로부터 정보를 추출하는 연구는 과거부터 지속적으로 수행되어왔으나, 재난관리 및 인프라관리 분야에서는 주로 소셜미디어를 활용한 연구가 수행되고 있는 실정이다.Research on extracting information from text data has been continuously conducted in the past, but in the field of disaster management and infrastructure management, studies using social media are being conducted.

뉴스 또는 보고서를 활용한 연구는 재난 상황 자체에 초점이 맞추어져 있으며, 주로 재난의 시간-공간적 정보를 얻는 것을 목표로 한다. 특히, 재난으로 인한 인프라 피해 정보를 재난에 관하여 정리된 재난 보고서로부터 추출하는 연구는 없었다.Research using news or reports is focused on the disaster situation itself and mainly aims to obtain time-spatial information of the disaster. In particular, there has been no research to extract infrastructure damage information from disasters from disaster reports.

질의응답시스템을 활용한 연구도 마찬가지로 재난의 전반적인 상황 정보를 즉각적으로 제공하는데 초점이 맞추어져 있었다. 일반적인 질의응답시스템 또한, 질문에 대한 정답을 단답형의 형태로 제공하고 있다.Research using the question and answer system was also focused on providing immediate information on the overall situation of disasters. A general question answering system also provides answers to questions in the form of short answers.

한편, 재난에 관하여 정리된 자료인 재난보고서에는 인프라 피해 정보로서, 피해지역, 피해 인프라 유형, 피해 유형 등 다양한 정보가 복합적으로 기재되어 있다. 따라서, 재난으로 인한 적절한 인프라 관리 계획을 세우기 위해서는 이러한 정보가 활용되는 것이 중요하다.On the other hand, disaster reports, which are organized data about disasters, contain various information such as damaged areas, damaged infrastructure types, and damage types as infrastructure damage information. Therefore, it is important that this information is utilized to develop appropriate infrastructure management plans in the wake of a disaster.

그러나 단답형으로 정보를 제공하는 기존의 질의응답시스템으로는 이러한 인프라 피해 정보를 모두 제공할 수 없기 때문에, 구나 문장으로 정보를 제공하기 위한 새로운 시스템이 필요한 상황이다.However, since the existing question answering system that provides information in a short answer format cannot provide all of this infrastructure damage information, a new system for providing information in phrases or sentences is needed.

자연어(Natural Language)란 우리가 일상 생활에서 사용하는 언어를 의미한다. 자연어 처리(NLP: Natural Language Processing)란 인간의 언어(자연어)를 컴퓨터가 이해하고 분석할 수 있도록 처리하는 것을 의미한다.Natural language means the language we use in our daily life. Natural Language Processing (NLP) means processing human language (natural language) so that a computer can understand and analyze it.

자연어 처리는 음성 인식, 내용 요약, 번역, 사용자의 감성 분석, 텍스트 분류 작업(스팸 메일 분류, 뉴스 기사 카테고리 분류), 질의 응답 시스템, 챗봇과 같은 곳에서 사용되고 있다.Natural language processing is used in places such as voice recognition, content summarization, translation, user sentiment analysis, text classification tasks (spam classification, news article classification), question-and-answer systems, and chatbots.

자연어처리 과제(개체명 인식, 문서 분류, 문서 요약 등)를 수행하기 위해서 다양한 딥러닝 모델들이 제안되고 있다. Various deep learning models have been proposed to perform natural language processing tasks (entity recognition, document classification, document summary, etc.).

언어모델(Language model)은 언어라는 현상을 모델링하고자 단어 시퀀스(또는 문장)에 확률을 할당(assign)하는 모델을 의미한다. A language model refers to a model that assigns probabilities to word sequences (or sentences) in order to model a phenomenon called language.

2018년 구글(Google)은 인공지능(AI) 언어모델 BERT(Bidirectional Encoder Representations from Transformers)를 공개하였다. BERT는 언어표현 사전학습의 새로운 방법으로 위키피디아(Wikipedia)와 같은 큰 텍스트 코퍼스(corpus)를 이용하여 범용목적의 언어 이해(language understanding)' 모델을 훈련시키는 것과 그 모델에 관심 있는 실제의 자연 언어 처리 태스크(질문·응답 등)에 적용하는 것이다In 2018, Google unveiled an artificial intelligence (AI) language model, BERT (Bidirectional Encoder Representations from Transformers). BERT is a new method of language expression pre-learning, which uses a large text corpus such as Wikipedia to train a general-purpose language understanding' model and the actual natural language processing of interest in the model. It is applied to the task (question, answer, etc.)

BERT는 NLP(자연어처리) 사전 훈련 언어 모델이며, 특정 분야에 국한된 기술이 아니라 모든 자연어 처리 분야에서 좋은 성능을 내는 범용 언어모델이다.BERT is a NLP (Natural Language Processing) pretraining language model, and is not a technology limited to a specific field, but a general-purpose language model that performs well in all natural language processing fields.

BERT는 구글의 셀프 어텐션 신경망 모델인 트랜스포머(Transformer)로 되어있다. 그리고 미리 사전훈련(pre-training)을 한 후 여러 가지 자연어 문제에 파인튜닝(fine-tuning)만 하여 공통으로 적용이 가능한 특징이 있다.BERT is based on Transformer, Google's self-attention neural network model. In addition, there is a feature that can be applied in common by pre-training in advance and only fine-tuning to various natural language problems.

하지만, 구글이 대량의 텍스트를 통해서 도출한 BERT의 사전 학습 가중치는 본 발명에서 파악하고자 하는 '재난 피해 정보'를 분류하는 데에 최적화된 값이 아니라는 문제점이 있다.However, there is a problem in that the pre-learning weight of BERT derived by Google through a large amount of text is not an optimal value for classifying 'disaster damage information' to be grasped in the present invention.

즉, 뛰어난 성능을 기반으로 BERT는 질의응답시스템을 구축하기에 적합하지만 이러한 BERT 기반의 질의응답시스템 역시 재난보고서로부터 인프라 피해정보를 추출하기에는 부족하였다.In other words, BERT based on its excellent performance is suitable for constructing a question-answering system, but this BERT-based question-answering system was also insufficient to extract infrastructure damage information from disaster reports.

첫번째 이유로서, 사전학습된 BERT의 가중치가 인프라 피해정보 추출에 최적화된 값이 아니기 때문이다, 두번째 이유로서, BERT 기반의 질의응답시스템은 긴 정보를 제공하지 못하기 때문이다.The first reason is that the weight of the pretrained BERT is not an optimal value for extracting infrastructure damage information. The second reason is that the BERT-based question-answering system cannot provide long information.

(문헌 1) 한국등록특허공보 제10-1703116호(2017.01.31)(Document 1) Korea Patent Registration No. 10-1703116 (2017.01.31)

본 발명에 따른 재난인프라 가중치를 이용하여 재난보고서로부터 인프라 피해 정보를 추출하는 재난정보 질의응답시스템 및 질의응답방법은 다음과 같은 해결과제를 가진다.A disaster information question answering system and question answering method for extracting infrastructure damage information from a disaster report using disaster infrastructure weights according to the present invention have the following challenges.

첫째, BERT모델의 기존의 일반적인 가중치가 아니라, 재난인프라에 적합한 가중치를 도출하고자 한다.First, we want to derive weights suitable for disaster infrastructure, not the existing general weights of the BERT model.

둘째, 재난보고서에 기재된 방대한 자료를 활용하여, 피해정보에 대한 정답을 YES, NO와 같은 단답형 정답이 아니라, 구, 절 또는 문장으로 많은 정보를 제공하고자 한다.Second, by using the vast amount of data described in the disaster report, the correct answer to the damage information is not a short answer such as YES or NO, but rather a phrase, clause or sentence to provide a lot of information.

본 발명의 해결과제는 이상에서 언급한 것들에 한정되지 않으며, 언급되지 아니한 다른 해결과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다. The problems of the present invention are not limited to those mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the description below.

본 발명은 연산기능을 가진 제어서버 및 재난보고서 정보가 저장된 데이터베이스가 네트워크로 연결되고, 제어서버가 BERT 모델을 이용하여 재난보고서 정보가 분석되는 재난정보 질의응답시스템으로서, 제어서버는 The present invention is a disaster information question and answer system in which a control server having an arithmetic function and a database storing disaster report information are connected to a network, and the control server analyzes disaster report information using a BERT model.

입력된 질문 정보 및 재난보고서 정보를 각각 질문 벡터와 보고서문단 벡터로 변환하는 벡터 변환부, 상기 각각의 질문 벡터와 보고서문단 벡터 사이의 유사도를 계산하는 유사도 비교부 및 상기 유사도 비교부에서 계산된 유사도에 따라 질문 벡터별로 적어도 하나의 보고서문단 벡터를 결정하는 문단결정부를 갖는 문단검색부; 및 상기 문단검색부에서 결정된 질문 벡터 및 보고서문단 벡터를 학습데이터로 선정하는 학습데이터 선정부, 학습데이터에 따른 최적의 하이퍼파라미터를 도출하는 하이퍼파라미터 결정부, 하이퍼파라미터를 BERT 모델에 적용하여 모델학습을 시켜 재난인프라 가중치를 산출하는 가중치 산출부 및 산출된 재난인프라가중치를 대입하여 질문에 대한 정답을 추출하는 피해정보 추출부를 갖는 질의응답부를 포함한다.A vector conversion unit that converts the input question information and disaster report information into a question vector and a report paragraph vector, respectively, a similarity comparison unit that calculates a similarity between each question vector and a report paragraph vector, and a similarity calculated by the similarity comparison unit a paragraph retrieval unit having a paragraph determining unit for determining at least one report paragraph vector for each question vector according to; And a learning data selection unit that selects the question vectors and report paragraph vectors determined in the paragraph search unit as training data, a hyperparameter determination unit that derives optimal hyperparameters according to the learning data, and model learning by applying the hyperparameters to the BERT model and a question answering unit having a weight calculator for calculating disaster infrastructure weights and a damage information extractor for extracting correct answers to questions by substituting the calculated disaster infrastructure weights.

본 발명에 있어서, 상기 문단검색부의 벡터 변환부로 입력되는 질문 정보는 문장 단위로 입력되고, 재난보고서 정보는 문단 단위로 입력될 수 있다.In the present invention, question information input to the vector conversion unit of the paragraph search unit may be input in units of sentences, and disaster report information may be input in units of paragraphs.

본 발명에 있어서, 상기 문단검색부의 유사도 비교부는 각 질문 벡터와 보고서문단 벡터 사이의 코사인 유사도를 계산하고, 각 보고서문단 벡터에 유사도 점수를 부여할 수 있다.In the present invention, the similarity comparison unit of the paragraph search unit may calculate a cosine similarity between each question vector and a report paragraph vector, and assign a similarity score to each report paragraph vector.

청구항 3에 있어서, 상기 문단검색부의 문단결정부는 보고서문단 벡터 중 유사도 점수가 높은 순서를 기준으로, 기 설정된 개수의 보고서문단 벡터를 결정할 수 있다.The method according to claim 3, the paragraph determining unit of the paragraph search unit may determine a preset number of report paragraph vectors based on an order of high similarity scores among report paragraph vectors.

본 발명에 있어서, 상기 질의응답부의 학습데이터 선정부에서, 학습데이터셋은 질문, 보고서문단 및 정답으로 구비될 수 있다.In the present invention, in the learning data selection unit of the question answering unit, the learning data set may include questions, report paragraphs, and correct answers.

본 발명에 있어서, 상기 정답은 구, 절 또는 문장 단위로 정답이 설정될 수 있다.In the present invention, the correct answer may be set in units of phrases, clauses, or sentences.

본 발명에 있어서, 상기 질의응답부에서 하이퍼파라미터 결정부는 배치사이즈는 1, 에폭은 6, 학습률은 7e-6으로 하이퍼파라미터를 결정하고, 가중치 산출부는 상기 결정된 하이퍼파라미터를 이용하여, 재난인프라 가중치를 산출할 수 있다.In the present invention, the hyperparameter determination unit in the question response unit determines hyperparameters with a batch size of 1, an epoch of 6, and a learning rate of 7e-6, and the weight calculation unit uses the determined hyperparameters to calculate disaster infrastructure weights. can be calculated

본 발명에 있어서, 상기 질의응답부의 피해정보 추출부에서, 학습데이터셋의 질문과 문단의 각 단어들은 상기 재난인프라 가중치가 설정된 BERT 모델의 은닉층을 통과하면서 벡터값이 부여될 수 있다.In the present invention, in the damage information extraction unit of the question response unit, each word of the question and paragraph of the learning dataset may be given a vector value while passing through the hidden layer of the BERT model in which the disaster infrastructure weight is set.

본 발명에 있어서, 최종 은닉층을 통과한 문단의 단어들은 소프트맥스 활성함수를 통해 2개의 확률로 표현이 될 수 있다.In the present invention, the words of the paragraph passing through the final hidden layer can be expressed with two probabilities through the softmax activation function.

본 발명에 있어서, 상기 각 확률은 정답의 시작 단어가 될 확률과 끝 단어가 될 확률이며, 제1 단계로서, 시작 단어가 될 확률이 가장 높은 단어가 정해지고, 제2 단계로서, 그 뒤에 나온 단어들 중 끝 단어가 될 확률이 가장 높은 단어가 정해지며, 제3 단계로서, 두 단어와 그 사이에 있는 단어들이 정답으로 추출될 수 있다.In the present invention, each probability is the probability of being the start word and the probability of being the end word of the correct answer. As a first step, the word with the highest probability of being the start word is determined, and as a second step, the following Among the words, a word with the highest probability of being the end word is determined, and as a third step, two words and words between them may be extracted as the correct answer.

본 발명은 연산기능을 가진 제어서버 및 재난보고서 정보가 저장된 데이터베이스가 네트워크로 연결되고, 제어서버가 In the present invention, a control server having an arithmetic function and a database storing disaster report information are connected to a network, and the control server

BERT 모델을 이용하여 재난보고서 정보가 분석되는 재난정보 질의응답방법으로서, 제어서버에서 벡터 변환부가 입력된 질문 정보 및 재난보고서 정보를 각각 질문 벡터와 보고서문단 벡터로 변환하는 S110 단계, 유사도 비교부가 상기 각각의 질문 벡터와 보고서문단 벡터 사이의 유사도를 계산하는 S120 단계 및 문단결정부가 상기 유사도 비교부에서 계산된 유사도에 따라 질문 벡터별로 적어도 하나의 보고서문단 벡터를 결정하는 S130 단계를 갖는 문단검색 단계(S100); 및 학습데이터 선정부가 상기 문단검색부에서 결정된 질문 벡터 및 보고서문단 벡터를 학습데이터로 선정하는 S210 단계, 하이퍼파라미터 결정부가 학습데이터에 따른 최적의 하이퍼파라미터를 도출하는 S220 단계, 가중치 산출부가 하이퍼파라미터를 BERT 모델에 적용하여 모델학습을 시켜 재난인프라 가중치를 산출하는 S230 단계 및 피해정보 추출부가 산출된 재난인프라가중치를 대입하여 질문에 대한 정답을 추출하는 S240 단계를 갖는 질의응답 단계(S200)를 포함하여 수행된다.As a disaster information question and answer method in which disaster report information is analyzed using a BERT model, the vector conversion unit in the control server converts the input question information and disaster report information into a question vector and a report paragraph vector, respectively. Step S110, Similarity comparison unit A paragraph search step having a step S120 of calculating a similarity between each question vector and a report paragraph vector and a step S130 of a paragraph determining unit determining at least one report paragraph vector for each question vector according to the similarity calculated by the similarity comparison unit ( S100); and step S210 of a learning data selector selecting the question vector and report paragraph vector determined by the paragraph search unit as training data, step S220 of a hyperparameter determiner deriving optimal hyperparameters according to the learning data, and weight calculation unit Including a question-answering step (S200) with a step S230 of applying the BERT model to model learning to calculate disaster infrastructure weights and a step S240 of extracting the correct answer to the question by substituting the disaster infrastructure weights calculated by the damage information extraction unit is carried out

본 발명에 있어서, S110 단계의 벡터 변환부로 입력되는 질문 정보는 문장 단위로 입력되고, 재난보고서 정보는 문단 단위로 입력될 수 있다.In the present invention, the question information input to the vector conversion unit in step S110 may be input in units of sentences, and the disaster report information may be input in units of paragraphs.

본 발명에 있어서, S120 단계의 유사도 비교부는 각 질문 벡터와 보고서문단 벡터 사이의 코사인 유사도를 계산하고, 각 보고서문단 벡터에 유사도 점수를 부여할 수 있다.In the present invention, the similarity comparison unit in step S120 may calculate the cosine similarity between each question vector and the report paragraph vector, and assign a similarity score to each report paragraph vector.

본 발명에 있어서, S130 단계의 문단결정부는 보고서문단 벡터 중 유사도 점수가 높은 순서를 기준으로, 기 설정된 개수의 보고서문단 벡터를 결정할 수 있다.In the present invention, the paragraph determining unit in step S130 may determine a preset number of report paragraph vectors based on the order of high similarity scores among report paragraph vectors.

본 발명에 있어서, S210 단계의 학습데이터 선정부에서, 학습데이터셋은 질문, 보고서문단 및 정답으로 구비될 수 있다.In the present invention, in the learning data selection unit of step S210, the learning data set may include questions, report paragraphs, and correct answers.

본 발명에 있어서, S220 단계의 하이퍼파라미터 결정부는 배치사이즈는 1, 에폭은 6, 학습률은 7e-6으로 하이퍼파라미터를 결정하고, S230 단계의 가중치 산출부는 상기 결정된 하이퍼파라미터를 이용하여, 재난인프라 가중치를 산출할 수 있다.In the present invention, the hyperparameter determination unit in step S220 determines the hyperparameters with a batch size of 1, an epoch of 6, and a learning rate of 7e-6, and the weight calculation unit in step S230 uses the determined hyperparameters to calculate disaster infrastructure weights. can be calculated.

본 발명에 있어서, S240 단계의 피해정보 추출부에서, 학습데이터셋의 질문과 문단의 각 단어들은 상기 재난인프라 가중치가 설정된 BERT 모델의 은닉층을 통과하면서 벡터값이 부여될 수 있다.In the present invention, in the damage information extraction unit in step S240, each word of the question and paragraph of the learning dataset may be given a vector value while passing through the hidden layer of the BERT model in which the disaster infrastructure weight is set.

본 발명은 하드웨어와 결합되어, 본 발명에 따른 재난인프라 가중치를 이용하여 재난보고서로부터 인프라 피해 정보를 추출하는 재난정보 질의응답방법을 컴퓨터에 의해 실행시키기 위하여 컴퓨터가 판독 가능한 기록매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다.The present invention is a computer program stored in a computer-readable recording medium in order to execute a disaster information question and answer method for extracting infrastructure damage information from a disaster report by using a disaster infrastructure weight in combination with hardware according to the present invention by a computer. can be implemented

본 발명에 따른 재난인프라 가중치를 이용하여 재난보고서로부터 인프라 피해 정보를 추출하는 재난정보 질의응답시스템 및 질의응답방법은 다음과 같은 효과를 가진다.A disaster information question answering system and question answering method for extracting infrastructure damage information from a disaster report using disaster infrastructure weights according to the present invention have the following effects.

첫째, BERT모델의 기존의 일반적인 가중치가 아니라, 재난인프라에 적합한 가중치가 도출되는 효과가 있다.First, there is an effect of deriving weights suitable for disaster infrastructure, not the existing general weights of the BERT model.

둘째, 재난보고서에 기재된 방대한 자료를 활용하여, 피해정보에 대한 정답을 YES, NO와 같은 단답형 정답이 아니라, 구, 절 또는 문장으로 많은 정보를 제공하는 효과가 있다.Second, it has the effect of providing a lot of information in phrases, clauses or sentences, rather than short answer answers such as YES or NO, by utilizing the vast amount of data described in the disaster report.

본 발명의 효과는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 효과들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to those mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

도 1은 본 발명에 따른 재난정보 질의응답시스템의 개요도이다.
도 2는 본 발명에 따른 재난정보 질의응답시스템의 문단검색부의 세부 구성을 나타낸다.
도 3은 본 발명에 따른 재난정보 질의응답시스템의 질의응답부의 세부 구성을 나타낸다.
도 4 및 도 5는 본 발명에 따른 재난정보 질의응답방법의 순서도이다.1 is a schematic diagram of an emergency information query response system according to the present invention.
2 shows the detailed configuration of the paragraph search unit of the disaster information question and answer system according to the present invention.
3 shows the detailed configuration of the question answering unit of the disaster information question answering system according to the present invention.
4 and 5 are flowcharts of an emergency information question and answer method according to the present invention.

이하, 첨부한 도면을 참조하여, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 설명한다. 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 용이하게 이해할 수 있는 바와 같이, 후술하는 실시예는 본 발명의 개념과 범위를 벗어나지 않는 한도 내에서 다양한 형태로 변형될 수 있다. 가능한 한 동일하거나 유사한 부분은 도면에서 동일한 도면부호를 사용하여 나타낸다.Hereinafter, with reference to the accompanying drawings, embodiments of the present invention will be described so that those skilled in the art can easily practice it. As can be easily understood by those skilled in the art to which the present invention pertains, the embodiments described below may be modified in various forms without departing from the concept and scope of the present invention. Where possible, identical or similar parts are indicated using the same reference numerals in the drawings.

본 명세서에서 사용되는 전문용어는 단지 특정 실시예를 언급하기 위한 것이며, 본 발명을 한정하는 것을 의도하지는 않는다. 여기서 사용되는 단수 형태들은 문구들이 이와 명백히 반대의 의미를 나타내지 않는 한 복수 형태들도 포함한다.The terminology used in this specification is only for referring to specific embodiments and is not intended to limit the present invention. As used herein, the singular forms also include the plural forms unless the phrases clearly indicate the opposite.

본 명세서에서 사용되는 "포함하는"의 의미는 특정 특성, 영역, 정수, 단계, 동작, 요소 및/또는 성분을 구체화하며, 다른 특정 특성, 영역, 정수, 단계, 동작, 요소, 성분 및/또는 군의 존재나 부가를 제외시키는 것은 아니다.As used herein, the meaning of "comprising" specifies particular characteristics, regions, integers, steps, operations, elements, and/or components, and other specific characteristics, regions, integers, steps, operations, elements, components, and/or components. It does not exclude the presence or addition of groups.

본 명세서에서 사용되는 기술용어 및 과학용어를 포함하는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 일반적으로 이해하는 의미와 동일한 의미를 가진다. 사전에 정의된 용어들은 관련기술문헌과 현재 개시된 내용에 부합하는 의미를 가지는 것으로 추가 해석되고, 정의되지 않는 한 이상적이거나 매우 공식적인 의미로 해석되지 않는다.All terms including technical terms and scientific terms used in this specification have the same meaning as commonly understood by a person of ordinary skill in the art to which the present invention belongs. The terms defined in the dictionary are additionally interpreted as having a meaning consistent with the related technical literature and the currently disclosed content, and are not interpreted in an ideal or very formal meaning unless defined.

본 발명은 구글의 BERT 모델을 활용하는 기술이며, 다만, BERT 모델은 재난 상황에 특화된 재난 가중치가 없어서 재난 관련 메시지 분석에서 정확성이 부족하므로, 새로운 재난 가중치를 도출하여 재난 메시지 분석의 정확성을 증가시키는 것이 특징이다. 또한, 새로운 재난 가중치를 도출하기 위하여, 학습데이터를 새롭게 구축하고자 한다.The present invention is a technology that utilizes Google's BERT model, however, since the BERT model lacks accuracy in analyzing disaster-related messages because there is no disaster weight specialized for disaster situations, a new disaster weight is derived to increase the accuracy of disaster message analysis It is characterized by In addition, in order to derive new disaster weights, we want to newly build learning data.

즉, 본 발명과 종래기술과의 가장 큰 차별요소는 기존 사전학습 가중치와 다른 재난 및 인프라 가중치, 그리고 이 가중치를 얻기 위해 새롭게 구축된 학습데이터이다. 새롭게 구축된 학습데이터는 기존 질의응답시스템을 학습시키는데 사용된 데이터와 차별점을 가진다. That is, the biggest differentiating factor between the present invention and the prior art is the existing pre-learning weight and other disaster and infrastructure weights, and newly constructed learning data to obtain these weights. The newly constructed learning data has a difference from the data used to train the existing question-answering system.

종래의 질의응답 데이터셋은 질문/본문/정답으로 구성이 되는데, 여기서 정답은 적은 수의 단어로 구성된 단답형으로 제공된다. 그렇기 때문에 종래의 질의응답 데이터셋으로 학습되어 얻어진 가중치는 인프라 피해 정보와 같은 긴 정보를 얻기에 적합하지 않다. A conventional question-answering dataset is composed of question/body/correct answer, where the correct answer is provided in a short answer format consisting of a small number of words. Therefore, the weight obtained by learning from the conventional question-answering dataset is not suitable for obtaining long information such as infrastructure damage information.

이에, 본 발명에서 활용된 학습데이터는 구, 절, 문장 등 다양한 길이의 정보에 대해 높은 정확도로 정보를 제공할 수 있도록 구축되었다. Accordingly, the learning data used in the present invention is constructed to provide information with high accuracy for information of various lengths such as phrases, clauses, and sentences.

후술할 실험 결과에 따르면, 새로운 학습데이터로 얻어진 재난 및 인프라 가중치를 활용했을 때 2배 이상의 높은 정확도를 보임을 확인하였다. According to the experimental results to be described later, it was confirmed that the accuracy was more than twice as high when the disaster and infrastructure weights obtained as new learning data were used.

이하에서는 도면을 참고하여 본 발명을 설명하고자 한다. 참고로, 도면은 본 발명의 특징을 설명하기 위하여, 일부 과장되게 표현될 수도 있다. 이 경우, 본 명세서의 전 취지에 비추어 해석되는 것이 바람직하다.Hereinafter, the present invention will be described with reference to the drawings. For reference, the drawings may be partially exaggerated in order to explain the features of the present invention. In this case, it is preferable to interpret in light of the whole purpose of this specification.

도 1은 본 발명에 따른 재난인프라 가중치를 이용하여 재난보고서로부터 인프라 피해 정보를 추출하는 재난정보 질의응답시스템의 개요도이다.1 is a schematic diagram of a disaster information question and answer system for extracting infrastructure damage information from a disaster report using disaster infrastructure weights according to the present invention.

도 1에 도시된 바와 같이, 본 발명은 재난보고서가 인프라 피해정보를 묻는 질문과 함께 제어서버(10)에 입력되면, 컴퓨터가 최적 재난 및 인프라 가중치를 활용해 해당 보고서와 질문을 분석하여 이에 대한 응답(정답), 즉 피해정보를 추출하는 시스템이다. As shown in FIG. 1, in the present invention, when a disaster report is input to the control server 10 together with a question asking about infrastructure damage information, the computer analyzes the report and question using the optimal disaster and infrastructure weights and responds accordingly. It is a system that extracts responses (correct answers), that is, damage information.

본 발명은 연산기능을 가진 제어서버(10) 및 재난보고서 정보가 저장된 데이터베이스(20)가 네트워크로 연결되고, 제어서버(10)가 BERT 모델을 이용하여 재난보고서 정보가 분석되는 재난정보 질의응답시스템에 관한 것이다.In the present invention, a control server 10 having an operation function and a database 20 storing disaster report information are connected to a network, and the control server 10 analyzes disaster report information using a BERT model Disaster information question and answer system It is about.

본 발명에 따른 재난정보 질의응답시스템의 제어서버(10)는 문단검색부(100)와 질의응답부(200)를 포함한다.The control server 10 of the disaster information question answering system according to the present invention includes a paragraph search unit 100 and a question answering unit 200.

본 발명에 따른 문단검색부(100)는 입력된 질문 정보 및 재난보고서 정보를 각각 질문 벡터와 보고서문단 벡터로 변환하는 벡터 변환부(110), 상기 각각의 질문 벡터와 보고서문단 벡터 사이의 유사도를 계산하는 유사도 비교부(120) 및 상기 유사도 비교부(120)에서 계산된 유사도에 따라 질문 벡터별 적어도 하나의 보고서문단 벡터를 결정하는 문단결정부(130)를 포함한다.The paragraph search unit 100 according to the present invention converts the input question information and disaster report information into a question vector and a report paragraph vector, respectively. It includes a similarity comparison unit 120 that calculates and a paragraph determination unit 130 that determines at least one report paragraph vector for each question vector according to the similarity calculated by the similarity comparison unit 120.

본 발명에 따른 질의응답부(200)는 상기 문단검색부(100)에서 결정된 질문 벡터 및 보고서문단 벡터를 학습데이터로 선정하는 학습데이터 선정부(210), 학습데이터에 따른 최적의 하이퍼파라미터를 도출하는 하이퍼파라미터 결정부(220), 하이퍼파라미터를 BERT 모델에 적용하여 모델학습을 시켜 재난인프라 가중치를 산출하는 가중치 산출부(230) 및 산출된 재난인프라가중치를 대입하여 질문에 대한 정답을 추출하는 피해정보 추출부(240)를 포함한다.The question answering unit 200 according to the present invention includes a learning data selection unit 210 that selects the question vector and the report paragraph vector determined by the paragraph search unit 100 as learning data, and derives optimal hyperparameters according to the learning data. A hyperparameter determination unit 220 that applies the hyperparameters to the BERT model to perform model learning, and a weight calculation unit 230 that calculates disaster infrastructure weights and substituting the calculated disaster infrastructure weights Damage to extract the correct answer to the question An information extraction unit 240 is included.

본 발명에 있어서, 재난보고서는 다양한 데이터 수집기술을 활용하여, 데이터베이스(20)에 사전 또는 실시간으로 저장될 수 있다.In the present invention, the disaster report may be stored in advance or in real time in the database 20 using various data collection techniques.

도 2는 본 발명에 따른 재난정보 질의응답시스템의 문단검색부의 세부 구성을 나타낸다. 2 shows the detailed configuration of the paragraph search unit of the disaster information question and answer system according to the present invention.

본 발명에 있어서, 문단검색부(100)의 벡터 변환부(110)로 입력되는 질문 정보는 문장 단위로 입력되고, 재난보고서 정보는 문단 단위로 입력될 수 있다.In the present invention, question information input to the vector conversion unit 110 of the paragraph search unit 100 may be input in units of sentences, and disaster report information may be input in units of paragraphs.

본 발명에서 활용되어 BERT는 최대 512개의 토큰만 입력될 수 있다. 본 발명에서 활용된 문단의 최대 길이는 468로, 충분히 BERT에 입력될 수 있다. 또한 문단은 하나의 주제를 가지고 있기 때문에 문단 단위로 입력함으로써 모델이 문맥(context)를 학습하는데 적합하다. Utilized in the present invention, BERT can only input up to 512 tokens. The maximum length of a paragraph used in the present invention is 468, which can be sufficiently entered into BERT. In addition, since a paragraph has one topic, it is suitable for the model to learn the context by inputting it in paragraph units.

본 발명에 따른 문단검색부(100)의 유사도 비교부(120)는 각 질문 벡터와 보고서문단 벡터 사이의 코사인 유사도를 계산하고, 각 보고서문단 벡터에 유사도 점수를 부여할 수 있다. 문단검색부(100)의 문단결정부(130)는 보고서문단 벡터 중 유사도 점수가 높은 순서를 기준으로, 기 설정된 개수의 보고서문단 벡터를 결정할 수 있다.The similarity comparison unit 120 of the paragraph search unit 100 according to the present invention may calculate a cosine similarity between each question vector and a report paragraph vector, and assign a similarity score to each report paragraph vector. The paragraph determination unit 130 of the paragraph search unit 100 may determine a preset number of report paragraph vectors based on an order of high similarity scores among report paragraph vectors.

이는 인프라 피해 정보를 묻는 질문과 인프라 피해 정보가 포함된 보고서문단 사이의 유사도가 높을 것이라는 가정 하에 수행되는 것이며, 실험을 통해 이를 증명하였다. 결과적으로, 유사도 점수가 높은 상위 N개(기 설정된 개수)의 문단이 검색되며, 검색된 문단과 질문은 질의응답부(200)으로 입력된다.This is performed under the assumption that the similarity between the question asking for infrastructure damage information and the report paragraph containing infrastructure damage information will be high, and this was proven through experiments. As a result, the top N paragraphs (preset number) having high similarity scores are searched for, and the searched paragraphs and questions are input to the question answering unit 200 .

도 3은 본 발명에 따른 재난정보 질의응답시스템의 질의응답부의 세부 구성을 나타내며, BERT의 최적 재난 및 인프라 가중치를 도출하는 단계를 포함한다. 3 shows the detailed configuration of the question answering unit of the disaster information question answering system according to the present invention, and includes a step of deriving the optimal disaster and infrastructure weights of BERT.

구글이 발표한 딥러닝 기반 언어모델 BERT는 대량의 텍스트(BooksCorpus 800M + Wikipedia 2500M)를 이용한 학습을 통해 사전 학습 가중치를 정하였다. 그러나, 이러한 사전 학습 가중치는 재난보고서로부터 인프라 피해 정보를 효과적으로 추출하기에 적합하지 않았다.BERT, a deep learning-based language model announced by Google, determined pre-learning weights through learning using a large amount of text (BooksCorpus 800M + Wikipedia 2500M). However, these prior learning weights were not suitable for effectively extracting infrastructure damage information from disaster reports.

따라서, 본 발명은 인프라 피해정보를 정밀하게 추출할 수 있도록 최적의 재난 및 인프라 가중치를 도출하기 위해 새롭게 데이터셋을 구축하고 최적의 하이퍼 파라미터를 결정하였다.Therefore, the present invention newly built a dataset and determined the optimal hyperparameters in order to derive the optimal disaster and infrastructure weights so that infrastructure damage information can be extracted precisely.

결정된 하이퍼 파라미터를 BERT에 적용하고 모델을 학습시킴으로써 최적의 재난인프라 가중치를 산출하였다.The optimal disaster infrastructure weights were calculated by applying the determined hyperparameters to BERT and learning the model.

BERT 모델의 학습을 위한 학습데이터, 하이퍼 파라미터를 결정하기 위한 검증 데이터 및 BERT 모델의 성능을 확인하기 위한 시험 데이터는 미국 국립 허리케인 센터(National Hurricane Center)에서 공개한 보고서를 활용해 구축되었다. Training data for training of the BERT model, validation data for determining hyperparameters, and test data for verifying the performance of the BERT model were built using reports published by the National Hurricane Center in the United States.

학습 데이터(348개), 검증 데이터(87개), 시험 데이터(81개)로 이루어진 총 517개의 데이터로 구성된 질의응답 데이터셋이 구축되었다.A question-and-answer dataset consisting of a total of 517 pieces of data consisting of training data (348 pieces), verification data (87 pieces), and test data (81 pieces) was constructed.

본 발명에 있어서, 질의응답부(200)의 학습데이터 선정부(210)에서, 학습데이터셋은 질문, 보고서문단 및 정답으로 구비될 수 있다. 본 발명은 한 개의 질문에 대해서 한 개의 정답을 제공한다. 즉, 복수의 정답은 제공하지 않는다. In the present invention, in the learning data selection unit 210 of the question response unit 200, the learning data set may include questions, report paragraphs, and correct answers. The present invention provides one correct answer to one question. That is, multiple correct answers are not provided.

본 발명의 데이터셋을 구축함에 있어 기존과 차별화되는 점은 정답의 형태에 있다. 기존 질의응답 시스템을 구축하기 위해 자주 활용되는 질의응답 데이터셋들(대표적인 예로 Stanford Question Answering Dataset, SQuAD)은 주로 단어 단위의 단답형으로 정답을 제공한다. In constructing the dataset of the present invention, the difference from the existing ones lies in the form of the correct answer. Question-answering datasets (a representative example, the Stanford Question Answering Dataset, SQuAD) frequently used to build existing question-answering systems provide correct answers in word-based short-answer form.

그러나 보고서 내 인프라 피해정보는 짧게는 하나의 구나 절부터 길게는 복수의 문장으로 서술되어 있기 때문에, 본 발명을 위한 새로운 데이터셋은 기존의 질의응답 데이터셋과는 다르게 구축할 필요가 있었다. However, since the infrastructure damage information in the report is described in a short phrase or clause or a long sentence, a new dataset for the present invention needs to be constructed differently from the existing question and answer dataset.

이에, 본 발명에 따른 정답은 구, 절 또는 문장 단위로 정답이 설정될 수 있다. 즉 구 이상의 단위로 정답을 설정하고 이를 활용해 모델을 학습시켜 최적 가중치를 산정할 수 있다.Accordingly, the correct answer according to the present invention may be set in units of phrases, clauses, or sentences. In other words, it is possible to set the correct answer in units of nine or more and use it to train the model to calculate the optimal weight.

아래 표 1은 기존 데이터셋과 본 발명에 활용된 데이터셋의 예시를 보여준다. Table 1 below shows examples of existing datasets and datasets used in the present invention.

상기와 같은 질의응답 데이터셋의 구축은 종래 기술과 차별화되는 요소이다. 이러한 차별화된 데이터셋의 구축은 새로운 가중치를 도출함에 있어서도 큰 영향을 끼친다.Building a question-answering dataset as described above is a differentiating factor from the prior art. The construction of these differentiated datasets has a great impact on deriving new weights.

본 발명에 따른 재난인프라 가중치를 조정하기 위해 BERT 모델의 학습 방식을 결정하는 최적의 하이퍼 파라미터를 설정할 필요가 있다.In order to adjust the disaster infrastructure weights according to the present invention, it is necessary to set the optimal hyperparameters that determine the learning method of the BERT model.

BERT에서 조정할 수 있는 주된 하이퍼 파라미터는 배치 사이즈(batch size)와 에폭(epoch), 그리고 학습률(learning rate)가 있으며, 각 하이퍼 파라미터를 조정해가면서 검증 데이터로 성능을 평가하고, 이를 바탕으로 최적의 하이퍼 파라미터 조합을 찾는다.The main hyperparameters that can be adjusted in BERT are batch size, epoch, and learning rate, and performance is evaluated with verification data while adjusting each hyperparameter, and based on this, the optimal Find hyperparameter combinations.

배치 사이즈(batch size)는 한 번의 반복에서 사용되는 학습 데이터의 수이다. 에폭(Epoch)은 전체 데이터 세트를 훈련하기 위한 반복 횟수이다. 학습률(learning rate)은 손실 함수의 최소값으로 이동하면서 각 반복에서 단계 크기를 결정하는 최적화 알고리즘의 튜닝 매개 변수이다.Batch size is the number of training data used in one iteration. Epoch is the number of iterations to train the entire data set. The learning rate is a tuning parameter of the optimization algorithm that determines the step size at each iteration, moving towards the minimum of the loss function.

BERT 모델을 학습함에 있어 활용된 컴퓨터에 성능(특히, 외장 그래픽 카드)에 따라 배치 사이즈의 범위가 제한된다. 본 발명에서 활용된 컴퓨터의 스펙(GTX1080)으로는 1보다 큰 배치 사이즈를 활용할 수 없었기에, 본 발명에 따른 배치 사이즈는 1로 고정하였다. 그 후 에폭과 학습률을 변동시키면서 검증 데이터의 정확성 결과가 가장 높은 값을 산출하였다.In learning the BERT model, the range of batch size is limited according to the performance of the computer used (especially the external graphics card). Since a batch size larger than 1 could not be utilized with the specifications of the computer (GTX1080) used in the present invention, the batch size according to the present invention was fixed at 1. After that, while changing the epoch and learning rate, the highest accuracy result of the verification data was calculated.

또한, 학습데이터의 과적합(overfitting)을 방지하기 위해 교차검증(cross validation)을 수행하였다. 학습데이터와 검증데이터의 비율(예로 4:1)을 유지하되 학습데이터와 검증데이터의 변화를 주면서, 예로 5번의 평가를 수행하였다. 이를 통해 학습데이터의 편중을 방지할 수 있다.In addition, cross validation was performed to prevent overfitting of the learning data. For example, 5 evaluations were performed while maintaining the ratio of learning data and verification data (eg 4:1) while changing the learning data and verification data. Through this, bias in learning data can be prevented.

BERT 연구진은 3 또는 4의 에폭, 2e-5, 4e-5, 또는 5e-5의 학습률이 자연어처리 분야의 전반적인 작업에 잘 적용된다고 하였다. 그러나 본 발명의 하이퍼 파라미터 산출 결과는 BERT 연구진이 제시한 것과는 다른데, 이는 종래와 차별화된 학습데이터에 기인한 것으로 판단된다.The BERT researchers said that learning rates of 3 or 4 epochs, 2e-5, 4e-5, or 5e-5 apply well to overall tasks in the field of natural language processing. However, the hyperparameter calculation result of the present invention is different from that suggested by the BERT researchers, which is determined to be due to the learning data differentiated from the prior art.

본 발명에 따른 질의응답부(200)에서 하이퍼파라미터 결정부(220)는 검증 데이터를 활용한 실험을 통해 최적의 하이퍼 파라미터로 배치사이즈는 1, 에폭은 6, 학습률은 7e-6을 도출하였다. In the question answering unit 200 according to the present invention, the hyperparameter determining unit 220 derived a batch size of 1, an epoch of 6, and a learning rate of 7e-6 as optimal hyperparameters through experiments using verification data.

본 발명에 따른 가중치 산출부(230)는 상기와 같이 도출된 하이퍼파라미터를 이용하여, 최적의 재난인프라 가중치를 산출할 수 있다. 최적의 재난인프라 가중치는 최적의 하이퍼파라미터를 바탕으로 모델의 재학습을 통해 산출된다. 앞서 교차검증을 통한 최적의 하이퍼파라미터를 구하기 위해 활용되었던 학습데이터(348개)와 검증 데이터(87개)를 합친 총435개의 데이터를 이용해 모델을 재학습시킨다. The weight calculation unit 230 according to the present invention may calculate the optimal disaster infrastructure weight using the derived hyperparameters. Optimal disaster infrastructure weights are calculated through model retraining based on optimal hyperparameters. The model is retrained using a total of 435 pieces of data, a combination of training data (348 pieces) and verification data (87 pieces), which were previously used to obtain optimal hyperparameters through cross-validation.

본 발명에 따른 가중치(weight)라 함은, 텍스트 데이터의 각 단어의 의미를 수치적으로 표현할 수 있도록 단어를 벡터값으로 변환하기 위한 모델의 파라미터를 의미한다. 즉, 가중치 값에 따라 단어의 벡터값이 변화하고 이를 통해 모델이 도출하는 결과가 달라진다.A weight according to the present invention means a parameter of a model for converting a word into a vector value so that the meaning of each word of text data can be numerically expressed. That is, the vector value of the word changes according to the weight value, and the result derived by the model changes through this.

위 방법으로 얻은 최적 재난 및 인프라 가중치는 피해정보 추출부(240)에서 피해 정보를 추출하는데 활용된다. The optimal disaster and infrastructure weights obtained in the above method are used to extract damage information in the damage information extraction unit 240 .

즉, 질의응답부(200)의 피해정보 추출부(240)에서, 학습데이터셋의 질문과 문단의 각 단어들은 최적의 재난인프라 가중치가 설정된 BERT 모델의 은닉층(hidden layer)을 통과하면서 벡터값이 부여된다. That is, in the damage information extraction unit 240 of the question response unit 200, each word of the question and paragraph of the learning dataset is a vector value while passing through the hidden layer of the BERT model in which the optimal disaster infrastructure weight is set. is granted

최종 은닉층을 통과한 문단의 단어들은 소프트맥스 활성함수를 통해 2개의 확률로 표현이 될 수 있다.Words in paragraphs that have passed through the final hidden layer can be expressed with two probabilities through the softmax activation function.

상기 각 확률은 정답의 시작 단어가 될 확률과 끝 단어가 될 확률이며, 제1 단계로서, 시작 단어가 될 확률이 가장 높은 단어가 정해지고, 제2 단계로서, 그 뒤에 나온 단어들 중 끝 단어가 될 확률이 가장 높은 단어가 정해지며, 제3 단계로서, 두 단어와 그 사이에 있는 단어들이 정답으로 추출될 수 있다.Each of the above probabilities is a probability of being the start word and end word of the correct answer. As a first step, a word with the highest probability of being the start word is determined, and as a second step, the end word among the following words is determined. The word with the highest probability of being is determined, and as a third step, two words and words between them can be extracted as the correct answer.

시작 단어와 끝 단어로 선정된 단어들의 벡터값은 단순히 그 단어 자체의 뜻이 표현된 것이 아니라 주변 문맥을 고려한 의미가 포현된 것이다.The vector values of the words selected as the start word and the end word do not simply express the meaning of the word itself, but express the meaning considering the surrounding context.

질의응답 시스템을 통해 얻고자 하는 정보가 길이가 길수록 문맥적 의미를 고려해야 할 때 필요한 중간 단어들의 수가 많아지기 때문에, 그 수가 적은 기존 질의응답 데이터셋으로는 인프라 피해 정보와 같은 긴 정보를 얻기에 적합하지 않다. 이러한 점에서 본 발명이 종래의 기술과 차별점을 가진다.As the length of the information to be obtained through the question-answering system increases, the number of intermediate words required when considering contextual meaning increases, so the existing question-answering dataset with a small number is suitable for obtaining long information such as infrastructure damage information. don't In this respect, the present invention has a difference from the prior art.

본 발명은 기존의 구글 BERT를 활용한 질의응답시스템에서 사용되는 사전 학습 가중치를 대체할 새로운 재난 및 인프라 가중치를 도출하기 위해 학습 데이터를 새롭게 선정하였다. 특히, 새로운 학습 데이터의 정답을 설정함에 있어 기존 질의응답시스템에서 주로 사용되는 여러 학습 데이터 세트(SQuAD 등)과 다르게, 구 또는 절의 형태로 정답을 설정함으로써 보다 정확하고 다양한 정보를 사용자에게 제공할 수 있도록 하였다.The present invention newly selected learning data to derive new disaster and infrastructure weights to replace the pre-learning weights used in the existing question-answering system using Google BERT. In particular, in setting the correct answer for new learning data, it is possible to provide users with more accurate and diverse information by setting the correct answer in the form of a phrase or clause, unlike various learning data sets (SQuAD, etc.) that are mainly used in existing question-answering systems. made it possible

아래 표 2는 BERT의 사전 가중치(SQuAD를 활용해 얻어진 가중치)를 시험 데이터에 적용한 실험 결과와 본 발명에서 제안한 재난인프라 가중치를 활용한 실험 결과를 비교한 것이다. Table 2 below compares the experimental results of applying BERT's prior weights (weights obtained using SQuAD) to the test data and the experimental results of using the disaster infrastructure weights proposed in the present invention.

F1 score는 정밀도와 재현율을 결합한 지표이며, 정밀도와 재현율이 어느 한쪽으로 치우치지 않는 수치를 나타낼 때 상대적으로 높은 값을 가진다. 표 2에서의 F1-score는 모델이 예측한 결과와 실제 정답 사이의 일치율을 단어 단위로 비교하여 얻어진 값이다. The F1 score is an indicator that combines precision and recall, and has a relatively high value when precision and recall represent a value that is not biased to either side. The F1-score in Table 2 is a value obtained by comparing the agreement rate between the result predicted by the model and the actual correct answer on a word-by-word basis.

표 2에 기재된 바와 같이, 구글의 사전가중치의 F1-score 값은 39.2%인 반면에, 본 발명에 따른 재난인프라 가중치의 F1-score 값은 86.4%로서, 구글의 사전 가중치와 대비하여 2배 이상의 높은 정확도를 보임을 확인할 수 있다.As shown in Table 2, the F1-score value of Google's pre-weighting is 39.2%, whereas the F1-score value of the disaster infrastructure weighting according to the present invention is 86.4%, which is more than twice as high as Google's pre-weighting. It can be seen that the high accuracy is shown.

이러한 결과는 본 발명이 재난보고서로부터 인프라 피해정보를 높은 정확도로 추출할 수 있음을 보여준다. 본 발명을 통해 얻어지는 정보는 향후 발생하는 재난으로 인한 인프라 피해를 최소화할 수 있으며, 이는 사회적·경제적 손실을 줄일 수 있을 것이다.These results show that the present invention can extract infrastructure damage information from disaster reports with high accuracy. Information obtained through the present invention can minimize infrastructure damage due to future disasters, which will reduce social and economic losses.

한편, 본 발명은 재난정보 질의응답방법으로 구현될 수도 있다. 이는 전술한 재난정보 질의응답시스템과 실질적으로 동일한 발명으로서 발명의 카테고리가 상이하다. 따라서, 공통되는 구성은 설명을 생략하고, 요지 위주로 설명하고자 한다.Meanwhile, the present invention may be implemented as an emergency information question and answer method. This invention is substantially the same as the disaster information question and answering system described above, and the category of the invention is different. Therefore, the description of the common configuration will be omitted, and the description will focus on the gist.

도 4 및 도 5는 본 발명에 따른 재난인프라 가중치를 이용하여 재난보고서로부터 인프라 피해 정보를 추출하는 재난정보 질의응답방법의 순서도이다.4 and 5 are flowcharts of a disaster information question and answer method for extracting infrastructure damage information from a disaster report using disaster infrastructure weights according to the present invention.

본 발명은 연산기능을 가진 제어서버(10) 및 재난보고서 정보가 저장된 데이터베이스(20)가 네트워크로 연결되고, 제어서버(10)가 BERT 모델을 이용하여 재난보고서 정보가 분석되는 재난정보 질의응답방법으로서, 제어서버(10)에서, 벡터 변환부(110)가 입력된 질문 정보 및 재난보고서 정보를 각각 질문 벡터와 보고서문단 벡터로 변환하는 S110 단계, 유사도 비교부(120)가 상기 각각의 질문 벡터와 보고서문단 벡터 사이의 유사도를 계산하는 S120 단계 및 문단결정부(130)가 상기 유사도 비교부(120)에서 계산된 유사도에 따라 질문 벡터별로 적어도 하나의 보고서문단 벡터를 결정하는 S130 단계를 갖는 문단검색 단계(S100); 및 학습데이터 선정부(210)가 상기 문단검색부(100)에서 결정된 질문 벡터 및 보고서문단 벡터를 학습데이터로 선정하는 S210 단계, 하이퍼파라미터 결정부(220)가 학습데이터에 따른 최적의 하이퍼파라미터를 도출하는 S220 단계, 가중치 산출부(230)가 하이퍼파라미터를 BERT 모델에 적용하여 모델학습을 시켜 재난인프라 가중치를 산출하는 S230 단계 및 피해정보 추출부(240)가 산출된 재난인프라가중치를 대입하여 질문에 대한 정답을 추출하는 질의응답 단계(S200)를 포함하여 수행된다.In the present invention, a control server 10 having an arithmetic function and a database 20 storing disaster report information are connected to a network, and the control server 10 analyzes disaster report information using a BERT model. In step S110, in the control server 10, the vector conversion unit 110 converts the input question information and disaster report information into a question vector and a report paragraph vector, respectively, and the similarity comparison unit 120 converts each of the question vectors and a paragraph having a step S120 of calculating a similarity between the report paragraph vector and a step S130 of determining at least one report paragraph vector for each question vector according to the degree of similarity calculated by the similarity comparison unit 120 in the paragraph determining unit 130 Search step (S100); and a step S210 in which the learning data selection unit 210 selects the question vector and the report paragraph vector determined by the paragraph search unit 100 as training data. Step S220 of deriving, step S230 of calculating disaster infrastructure weights by applying hyperparameters to the weight calculation unit 230 to the BERT model and learning the model, and questioning by substituting the calculated disaster infrastructure weights of the damage information extraction unit 240 It is performed including a question answering step (S200) of extracting the correct answer to.

또한, 본 발명은 컴퓨터프로그램으로 구현될 수도 있다. 구체적으로 본 발명은 하드웨어와 결합되어, 본 발명에 따른 재난 가중치를 이용하여 소셜미디어의 재난 메시지 정보를 분석하는 재난정보 선별방법을 컴퓨터에 의해 실행시키기 위하여 컴퓨터가 판독 가능한 기록매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다.Also, the present invention may be implemented as a computer program. Specifically, the present invention is a computer program stored in a computer-readable recording medium in order to execute, by a computer, a disaster information selection method for analyzing disaster message information of social media using disaster weights in combination with hardware according to the present invention. can be implemented

본 발명의 실시예에 따른 방법들은 다양한 컴퓨터수단을 통하여 판독 가능한 프로그램 형태로 구현되어 컴퓨터로 판독 가능한 기록매체에 기록될 수 있다. 여기서, 기록매체는 프로그램 명령, 데이터 파일, 데이터구조 등을 단독으로 또는 조합하여 포함할 수 있다. 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 예컨대 기록매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CDROM, DVD와 같은 광 기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함한다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어를 포함할 수 있다. 이러한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Methods according to embodiments of the present invention may be implemented in the form of programs readable by various computer means and recorded on computer-readable recording media. Here, the recording medium may include program commands, data files, data structures, etc. alone or in combination. Program instructions recorded on the recording medium may be those specially designed and configured for the present invention, or those known and usable to those skilled in computer software. For example, recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CDROMs and DVDs, and magneto-optical media such as floptical disks. optical media), and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of the program command may include a high-level language that can be executed by a computer using an interpreter, as well as a machine language generated by a compiler. These hardware devices may be configured to act as one or more software modules to perform the operations of the present invention, and vice versa.

본 명세서에서 설명되는 실시예와 첨부된 도면은 본 발명에 포함되는 기술적 사상의 일부를 예시적으로 설명하는 것에 불과하다. 따라서, 본 명세서에 개시된 실시예들은 본 발명의 기술적 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이므로, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아님은 자명하다. 본 발명의 명세서 및 도면에 포함된 기술적 사상의 범위 내에서 당업자가 용이하게 유추할 수 있는 변형 예와 구체적인 실시 예는 모두 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The embodiments described in this specification and the accompanying drawings merely illustrate some of the technical ideas included in the present invention by way of example. Therefore, since the embodiments disclosed in this specification are intended to explain rather than limit the technical spirit of the present invention, it is obvious that the scope of the technical spirit of the present invention is not limited by these embodiments. All modified examples and specific examples that can be easily inferred by those skilled in the art within the scope of the technical idea included in the specification and drawings of the present invention should be construed as being included in the scope of the present invention.

10 : 제어서버 20 : 데이터베이스
100 : 문단검색부 110 : 벡터 변환부
120 : 유사도 비교부 130 : 문단 결정부
200 : 질의응답부 210 : 학습데이터 선정부
220 : 하이퍼파라미터 결정부 230 : 가중치 산출부
240 : 피해정보 추출부 10: control server 20: database
100: paragraph search unit 110: vector conversion unit
120: similarity comparison unit 130: paragraph decision unit
200: Q&A unit 210: Learning data selection unit
220: hyperparameter determination unit 230: weight calculation unit
240: damage information extraction unit

Claims

As a disaster information question and answer system in which a control server with an arithmetic function and a database storing disaster report information are connected to a network, and the control server analyzes disaster report information using a BERT model, the control server is
A vector conversion unit that converts the input question information and disaster report information into a question vector and a report paragraph vector, respectively, a similarity comparison unit that calculates a similarity between each question vector and a report paragraph vector, and a similarity calculated by the similarity comparison unit a paragraph retrieval unit having a paragraph determining unit for determining at least one report paragraph vector for each question vector according to; and
A learning data selection unit that selects the question vector and the report paragraph vector determined in the paragraph search unit as training data, a hyperparameter determination unit that derives optimal hyperparameters according to the learning data, and model learning by applying the hyperparameters to the BERT model A question answering unit having a weight calculation unit that calculates the disaster infrastructure weight and a damage information extraction unit that extracts the correct answer to the question by substituting the calculated disaster infrastructure weight A disaster information question and answer system that extracts damage information.

The method of claim 1,
Question information input to the vector conversion unit of the paragraph search unit is input in sentence units and disaster report information is input in units of paragraphs, using disaster infrastructure weights to extract infrastructure damage information from disaster reports. system.

The method of claim 1,
The similarity comparison unit of the paragraph search unit calculates the cosine similarity between each question vector and the report paragraph vector, and extracts infrastructure damage information from the disaster report using the disaster infrastructure weight, characterized in that for assigning a similarity score to each report paragraph vector. Disaster information question and answer system.

The method of claim 3,
The paragraph determination unit of the paragraph search unit determines a predetermined number of report paragraph vectors based on the order of high similarity scores among the report paragraph vectors. Information question and answer system.

The method of claim 1,
In the learning data selection unit of the question and answering unit,
A disaster information question and answer system for extracting infrastructure damage information from a disaster report using disaster infrastructure weights, characterized in that the learning dataset is provided with questions, report paragraphs, and correct answers.

The method of claim 5,
The answer is a disaster information question and answer system for extracting infrastructure damage information from a disaster report using a disaster infrastructure weight, characterized in that the correct answer is set in units of phrases, clauses or sentences.

The method of claim 1,
In the question and answer section
The hyperparameter determination unit determines the hyperparameters with a batch size of 1, an epoch of 6, and a learning rate of 7e-6;
The weight calculation unit calculates the disaster infrastructure weight using the determined hyperparameter, characterized in that for extracting infrastructure damage information from the disaster report using the disaster infrastructure weight.

The method of claim 7,
In the damage information extraction unit of the question answering unit,
Disaster information for extracting infrastructure damage information from disaster reports using disaster infrastructure weights, characterized in that each word of the question and paragraph of the learning dataset is given a vector value while passing through the hidden layer of the BERT model in which the disaster infrastructure weight is set question answering system.

The method of claim 8,
A disaster information query response system for extracting infrastructure damage information from a disaster report using disaster infrastructure weights, characterized in that words in paragraphs that have passed through the final hidden layer are expressed with two probabilities through a softmax activation function.

The method of claim 9,
Each of the above probabilities is the probability of being the start word and the probability of being the end word of the correct answer,
As a first step, the word with the highest probability of being the starting word is determined,
As a second step, the word with the highest probability of being the last word among the following words is determined,
As a third step, a disaster information question and answer system for extracting infrastructure damage information from a disaster report using disaster infrastructure weights, characterized in that two words and words between them are extracted as correct answers.

As a disaster information question and answer method in which a control server having an arithmetic function and a database storing disaster report information are connected to a network, and the control server analyzes disaster report information using a BERT model, in the control server,
A step S110 in which the vector conversion unit converts the input question information and disaster report information into a question vector and a report paragraph vector, respectively, a step S120 in which the similarity comparison unit calculates a degree of similarity between each of the question vectors and the report paragraph vector, and a paragraph decision unit a paragraph search step (S100) having a step S130 of determining at least one report paragraph vector for each question vector according to the degree of similarity calculated by the similarity comparison unit; and
Step S210 in which the learning data selection unit selects the question vector and the report paragraph vector determined by the paragraph search unit as training data, step S220 in which the hyperparameter determination unit derives the optimal hyperparameter according to the learning data, and the weight calculation unit converts the hyperparameter to BERT A question-answering step (S200) including a step S230 of calculating disaster infrastructure weights by applying the model to model learning and a step S240 of extracting the correct answer to the question by substituting the calculated disaster infrastructure weights of the damage information extraction unit A disaster information question and answer method for extracting infrastructure damage information from a disaster report using disaster infrastructure weights, characterized in that.

The method of claim 11,
Disaster information question and answer method for extracting infrastructure damage information from a disaster report using disaster infrastructure weights, characterized in that the question information input to the vector conversion unit in step S110 is input in sentence units and the disaster report information is input in paragraph units. .

The method of claim 11,
The similarity comparison unit in step S120 calculates the cosine similarity between each question vector and the report paragraph vector, and extracts infrastructure damage information from the disaster report using the disaster infrastructure weight, which is characterized by assigning a similarity score to each report paragraph vector. Disaster information question and answer method.

The method of claim 13,
Disaster information for extracting infrastructure damage information from the disaster report using the disaster infrastructure weight, characterized in that the paragraph decision unit in step S130 determines a preset number of report paragraph vectors based on the order of high similarity scores among report paragraph vectors. Q&A method.

The method of claim 11,
In the learning data selection unit of step S210, the learning data set is a disaster information question and answer method for extracting infrastructure damage information from the disaster report using disaster infrastructure weights, characterized in that the learning data set is provided with questions, report paragraphs and correct answers.

The method of claim 15
The correct answer is a disaster information question and answer method for extracting infrastructure damage information from a disaster report using a disaster infrastructure weight, characterized in that the correct answer is set in units of phrases, clauses or sentences.

The method of claim 11,
The hyperparameter determination unit in step S220 determines the hyperparameters with a batch size of 1, an epoch of 6, and a learning rate of 7e-6,
The weight calculation unit in step S230 calculates the disaster infrastructure weight using the determined hyperparameter. Disaster information question and answer method for extracting infrastructure damage information from the disaster report using the disaster infrastructure weight.

The method of claim 17
In the damage information extraction unit of step S240,
Disaster information for extracting infrastructure damage information from disaster reports using disaster infrastructure weights, characterized in that each word of the question and paragraph of the learning dataset is given a vector value while passing through the hidden layer of the BERT model in which the disaster infrastructure weight is set Q&A method.

The method of claim 18
Disaster information question and answer method for extracting infrastructure damage information from a disaster report using disaster infrastructure weights, characterized in that words in paragraphs that have passed through the final hidden layer are expressed with two probabilities through a softmax activation function.

The method of claim 19
Each of the above probabilities is the probability of being the start word and the probability of being the end word of the correct answer,
As a first step, the word with the highest probability of being the starting word is determined,
As a second step, the word with the highest probability of being the last word among the following words is determined,
As a third step, a disaster information question and answer method for extracting infrastructure damage information from a disaster report using disaster infrastructure weights, characterized in that two words and words between them are extracted as correct answers.

A computer program stored in a computer-readable recording medium in order to execute a disaster information question answering method for extracting infrastructure damage information from a disaster report by using a disaster infrastructure weight according to claim 11 in combination with hardware.