KR102149701B1

KR102149701B1 - A method for mapping a natural language sentence to an SQL query

Info

Publication number: KR102149701B1
Application number: KR1020180170691A
Authority: KR
Inventors: 한욱신; 김현지; 조정호; 이유경; 홍기재
Original assignee: 포항공과대학교 산학협력단
Priority date: 2018-12-27
Filing date: 2018-12-27
Publication date: 2020-08-31
Also published as: KR20200080822A

Abstract

본 발명은 텍스트 데이터에서 의미상 대응하는 자연어-SQL의 매핑 방법에 관한 것으로, 컴퓨터상에서 수행되는 자연어 문장과 SQL 질의를 매핑하는 방법으로서, a) 동일 문서 내의 자연어 문장과 SQL 질의의 매핑 후보를 추출하는 단계와, b) 각 매핑 후보의 자연어 문장과 SQL 질의에 대해서, 문서상의 위치를 기반으로 거리를 계산하는 단계와, c) 자연어 문장 내의 토큰과 SQL 질의 내의 토큰을 대응시켜 토큰 대응 점수를 계산하는 단계와, d) 자연어 문장 내의 토큰과 SQL 질의 토큰의 임베딩 벡터들의 요소별 합산 벡터를 비교하여 두 문장과 질의의 의미 레벨 유사도를 계산하는 단계와, e) 자연어 문장과 SQL 질의 간의 거리, 토큰 대응 점수, 의미 레벨 유사도를 특징점으로 하여 매핑 점수를 구하고 점수를 비교하여 자연어 문장과 SQL질의를 매핑하는 단계를 포함한다.The present invention relates to a natural language-SQL mapping method corresponding semantically in text data, and a method for mapping natural language sentences and SQL queries executed on a computer, comprising: a) extracting mapping candidates for natural language sentences and SQL queries in the same document And b) calculating the distance based on the position on the document for the natural language sentences and SQL queries of each mapping candidate; c) calculating the token correspondence score by matching the token in the natural language sentence with the token in the SQL query. And d) calculating the semantic level similarity between the two sentences and the query by comparing the sum vector of each element of the token in the natural language sentence and the embedding vectors of the SQL query token, and e) the distance between the natural language sentence and the SQL query, token And mapping a natural language sentence and an SQL query by obtaining a mapping score using the correspondence score and the semantic level similarity as feature points, and comparing the scores.

Description

{A method for mapping a natural language sentence to an SQL query}

본 발명은 텍스트 데이터에서 의미상 대응하는 자연어-SQL의 매핑 방법에 관한 것으로, 더 상세하게는 Lexical-Syntactic-Semantic (LSS) 유사도 분석 기술을 이용한 텍스트 데이터에서 의미상 대응하는 자연어-SQL의 매핑 방법에 관한 것이다. The present invention relates to a semantically corresponding natural language-SQL mapping method in text data, and more particularly, a semantically corresponding natural language-SQL mapping method in text data using a Lexical-Syntactic-Semantic (LSS) similarity analysis technique It is about.

웹 데이터를 비롯한 텍스트 데이터 내의 SQL 질의의 집합과 자연어 문장의 집합에 대해서, 각 SQL 질의에 서로 의미적으로 대응되는 자연어 문장을 매핑하는 문제이다. 이는 현실 세계에서 많은 응용을 가진다. 예를 들어, 웹 데이터로부터 서로 대응되는 자연어 질의와 SQL 질의의 데이터를 자동으로 추출함으로써 자연어와 SQL 질의의 쌍으로 이루어진 데이터셋을 구축할 수 있다. This is a problem of mapping natural language sentences that semantically correspond to each SQL query for a set of SQL queries and a set of natural language sentences in text data including web data. This has many applications in the real world. For example, a data set consisting of pairs of natural language and SQL queries can be constructed by automatically extracting data of natural language queries and SQL queries corresponding to each other from web data.

이러한 데이터셋은 자연어를 SQL로 변환하는 기술 개발 등에 사용될 수 있다. 현재 자연어를 SQL로 변환하는 머신 러닝 기반 기술을 위한 데이터셋은 크기가 작거나, 단순한 형태의 SQL 질의만을 포함한다는 한계가 있다. These datasets can be used for developing technologies for converting natural language to SQL. Currently, datasets for machine learning-based technology that convert natural language to SQL have a limitation in that they are small in size or contain only simple SQL queries.

의미적으로 대응되는 자연어 문장과 SQL 질의로 이루어진 데이터셋을 수집하는 기존 방법은 대부분 크라우드 소싱을 이용한 수작업을 동반한다. Most of the existing methods of collecting data sets consisting of semantically corresponding natural language sentences and SQL queries involve manual work using crowdsourcing.

"Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, Luke Zettlemoyer (2017). Learning a Neural Semantic Parser from User Feedback. ACL, (1), 963-973."에는 구체적으로 SQL 전문가로부터 데이터를 수집하는 방식이 기재되어 있다."Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, Luke Zettlemoyer (2017). Learning a Neural Semantic Parser from User Feedback. ACL, (1), 963-973." specifically describes how data is collected from SQL experts. Is described.

또한, "Victor Zhong, Caiming Xiong, Richard Socher (2017). Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. CoRR, abs/1709.00103."에는 탬플릿(template)을 이용하여 SQL을 자동 생성한 후에 크라우드 소싱을 이용하여 데이터를 수집하는 방식을 이용하고 있다.Also, in "Victor Zhong, Caiming Xiong, Richard Socher (2017). Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. CoRR, abs/1709.00103.", SQL is automatically generated using a template and then crowdsourced. Using a method of collecting data is used.

이러한 방법은 높은 신뢰의 데이터셋을 수집할 수 있으나 데이터셋 수집에 걸리는 시간이 길고, 높은 비용을 요구하기 때문에 대용량 데이터셋을 수집하기에 한계가 있다.This method can collect a highly reliable data set, but it has a limitation in collecting a large data set because it takes a long time to collect the data set and requires a high cost.

그리고 자동으로 데이터셋을 수집하는 방식에 대하여 "Florin Brad, Radu Cristian Alexandru Iacob, Ionel-Alexandru Hosu, Traian Rebedea (2017). Dataset for a Neural Natural Language Interface for Databases (NNLIDB). IJCNLP, (1), 906-914"에 기재되어 있다. 자동으로 대응되는 자연어 문장과 SQL 질의를 추출하는 방식으로 Stack Exchange 사이트에서 SQL 질의와 SQL 질의를 설명하는 문장들을 추출하는 방식에 대하여 설명하고 있다.And on how to collect datasets automatically, "Florin Brad, Radu Cristian Alexandru Iacob, Ionel-Alexandru Hosu, Traian Rebedea (2017). Dataset for a Neural Natural Language Interface for Databases (NNLIDB). IJCNLP, (1), 906-914". Describes the method of extracting SQL queries and statements describing SQL queries from the Stack Exchange site by automatically extracting corresponding natural language sentences and SQL queries.

그러나 Stack Exchange는 공개 API를 통해 잘 알려진 질문-응답 플랫폼의 전체 데이터베이스에 관한 질의를 수집한다. 그러나 위의 논문은 SQL 질의가 2,000 문자 이상으로 긴 경우, 중복된 경우, 자연어 문장 파트가 비어있는 경우와 같은 몇 가지 규칙을 이용한 필터링을 제외하고는 자연어 설명과 SQL 질의의 의미가 실제 대응되는지 검증하지 않기 때문에 데이터셋에 대한 신뢰가 낮은 문제점이 있었다.However, Stack Exchange collects queries on the entire database of well-known question-and-answer platforms through public APIs. However, the above paper verifies whether the natural language description and the meaning of the SQL query actually correspond except for filtering using several rules such as when the SQL query is longer than 2,000 characters, duplicates, and the natural language sentence part is empty. Because it does not, there is a problem that the trust in the dataset is low.

이처럼 자연어-SQL 쌍으로 구성된 데이터셋을 수집하는 종래 방식은 다수의 SQL 전문가로부터 데이터셋을 수집하거나, 템플릿과 크라우드 소싱을 이용한 방식이 있다. As such, the conventional method of collecting a data set composed of natural language-SQL pairs includes collecting data sets from multiple SQL experts or using templates and crowdsourcing.

다수의 SQL 전문가로부터 수집하는 방식의 경우, 전문 인력을 투자하여 비용과 시간이 많이 소요된다는 문제가 있기 때문에 현재까지 이러한 방식으로 수집된 데이터셋은 크기가 작다. In the case of collecting from a large number of SQL experts, there is a problem that it takes a lot of time and cost by investing specialized personnel, so the data set collected in this way is small.

템플릿을 이용한 방식은 자연어 문장과 SQL 질의에서 데이터만 비워진 템플릿에서 데이터를 채워 넣는 방식으로 자동으로 SQL 질의를 생성한 뒤 템플릿 내의 자연어 문장을 보고 크라우드 소싱을 통해 패러프레이즈(paraphrase) 된 문장을 생성하여 수집하는 방식이다. 이러한 방식으로 수집된 데이터셋은 전문가로부터 수집된 데이터셋에 비하여 크기가 크지만, 템플릿을 이용하기 때문에 단순한 형식의 SQL만을 포함하고 있다.The method using a template is a method of automatically filling in data from a template in which only data is emptied from natural language sentences and SQL queries.After automatically generating SQL queries, looking at natural language sentences in the template and creating paraphrase sentences through crowdsourcing. It is a way of collecting. The data set collected in this way is larger than the data set collected by experts, but because it uses a template, it contains only simple SQL.

현재까지 자동으로 대응되는 자연어 문장과 SQL 질의를 추출하는 기술로는 Stack Exchange 사이트를 이용한 방법이 있다. 이 기술의 경우에는 형식이 단일화된 웹에서 데이터를 추출하는 방식으로 (1) SQL과 대응되는 자연어가 정해진 위치에 존재하기 때문에 대응되는 자연어와 SQL을 바로 추출할 수 있으며, (2) 추출한 데이터에서 유사도를 분석하여 매핑하는 단계가 없다. 따라서 종래 기술은 (1)로 인해 다양한 형식의 웹 데이터 혹은 텍스트 데이터로의 확장성이 없으며, (2)로 인해 신뢰가 낮다고 할 수 있다. Until now, as a technology for automatically extracting natural language sentences and SQL queries, there is a method using the Stack Exchange site. In the case of this technology, data is extracted from the web in which the format is unified. (1) Natural language corresponding to SQL exists in a fixed location, so the corresponding natural language and SQL can be directly extracted, and (2) from the extracted data. There is no step to analyze and map similarity. Therefore, it can be said that the prior art is not scalable to web data or text data in various formats due to (1), and low reliability due to (2).

이러한 종래의 문제점을 해결하기 위한 본 발명이 해결하고자 하는 과제는, 텍스트 데이터에 존재하는 자연어 문장과 SQL질의의 유사도를 분석하여 자연어 문장-SQL 질의를 매핑할 수 있는 매핑 방법을 제공함에 있다.The problem to be solved by the present invention for solving such a conventional problem is to provide a mapping method capable of mapping a natural language sentence to an SQL query by analyzing the similarity between a natural language sentence and an SQL query existing in text data.

본 발명 텍스트 데이터에서 의미상 대응하는 자연어-SQL의 매핑 방법은, 컴퓨터상에서 수행되는 자연어 문장과 SQL 질의를 매핑하는 방법으로서, a) 동일 문서 내의 자연어 문장과 SQL 질의의 매핑 후보를 추출하는 단계와, b) 각 매핑 후보의 자연어 문장과 SQL 질의에 대해서, 문서상의 위치를 기반으로 거리를 계산하는 단계와, c) 자연어 문장 내의 토큰과 SQL 질의 내의 토큰을 대응시켜 토큰 대응 점수를 계산하는 단계와, d) 자연어 문장 내의 토큰과 SQL 질의 토큰의 임베딩 벡터들의 요소별 합산 벡터를 비교하여 두 문장과 질의의 의미 레벨 유사도를 계산하는 단계와, e) 자연어 문장과 SQL 질의 간의 거리, 토큰 대응 점수, 의미 레벨 유사도를 특징점으로 하여 매핑 점수를 구하고 점수를 비교하여 자연어 문장과 SQL질의를 매핑하는 단계를 포함한다.In the present invention, a semantically corresponding natural language-SQL mapping method in text data is a method of mapping natural language sentences and SQL queries executed on a computer, comprising the steps of: a) extracting mapping candidates for natural language sentences and SQL queries in the same document, and , b) calculating a distance for each mapping candidate's natural language sentence and SQL query based on the location on the document, c) calculating a token correspondence score by matching the token in the natural language sentence with the token in the SQL query, and , d) Computing the semantic level similarity of the two sentences and the query by comparing the sum vector of each element of the token in the natural language sentence and the embedding vectors of the SQL query token, and e) the distance between the natural language sentence and the SQL query, token correspondence score, And mapping a natural language sentence and an SQL query by obtaining a mapping score using the semantic level similarity as a feature point and comparing the scores.

본 발명의 실시예에서, 상기 a) 단계는, SQL 질의를 포함한 문서의 집합에서, 동일 문서상에 존재하는 자연어 문장의 집합과 SQL 질의의 집합을 데카르트 곱(Cartesian product)하여 매핑 후보의 집합을 추출할 수 있다.In an embodiment of the present invention, in step a), a set of mapping candidates is extracted by Cartesian product of a set of natural language sentences and a set of SQL queries existing in the same document from a set of documents including an SQL query. I can.

본 발명의 실시예에서, 상기 b) 단계는, 문서 내의 각 자연어 문장 혹은 SQL 질의의 위치를 문서 상에서 자연어 문장 혹은 SQL 질의에 앞서 위치하는 자연어 문장과 SQL 질의의 개수로 정의하고, 자연어 문장의 위치가 N이고 SQL 질의의 위치가 M일 때, 자연어 문장과 SQL 질의의 거리를 |M-N|으로 계산할 수 있다.In an embodiment of the present invention, in step b), the position of each natural language sentence or SQL query in the document is defined as the number of natural language sentences and SQL queries located before the natural language sentence or SQL query in the document, and the position of the natural language sentence When is N and the location of the SQL query is M, the distance between the natural language statement and the SQL query can be calculated as |MN|.

본 발명의 실시예에서, 상기 c) 단계는, c-1) 각 매핑 후보의 자연어 문장과 SQL 질의에 대하여, 자연어 문장 내의 토큰과 SQL 질의 내의 토큰 중 동일한 토큰을 대응시키고, 고유명사 토큰을 대응시키는 단계와, c-2) 각 매핑 후보의 자연어 문장과 SQL 질의에 대하여, 자연어 문장 구문 분석기를 이용하여 자연어 문장과 SQL 질의를 구문 분석 트리로 변환하고, 자연어 문장 구문 분석 트리와 SQL 질의 구문 분석 트리에 대하여, 구문 분석 트리 상에서의 매핑 규칙과 상기 c-1)단계에서 계산한 대응되는 노드 정보로부터 자연어 문장 구문 분석 트리의 노드와 SQL 질의 구문 분석 트리의 노드를 추가적으로 대응시키는 단계와, c-3) 상기 c-1)단계 및 c-2)에서 구한 토큰 간의 대응 정보로부터 자연어 문장과 SQL 질의의 토큰 대응 점수를 (자연어 문장과 SQL 질의 내의 서로 대응되는 토큰의 개수)/(자연어 문장과 SQL 질의 내의 전체 토큰의 개수) 계산하는 단계를 포함할 수 있다.In an embodiment of the present invention, step c), c-1), with respect to the natural language sentence and the SQL query of each mapping candidate, the same token among the token in the natural language sentence and the token in the SQL query, and the proper noun token. And c-2) For natural language sentences and SQL queries of each mapping candidate, convert natural language sentences and SQL queries into parsing trees using a natural language sentence parser, and parse natural language sentences and SQL queries. With respect to the tree, the step of additionally matching the node of the natural language sentence parsing tree and the node of the SQL query parsing tree from the mapping rule on the parsing tree and the corresponding node information calculated in step c-1); and c- 3) The token correspondence score of the natural language sentence and the SQL query is calculated from the correspondence information between the tokens obtained in steps c-1) and c-2) above (the number of tokens corresponding to each other in the natural language sentence and the SQL query)/(natural language sentence and SQL It may include calculating the total number of tokens in the query).

본 발명의 실시예에서, 상기 c-1) 단계는, 각 매핑 후보의 자연어 문장과 SQL 질의에 대하여, 먼저 자연어 문장 내의 토큰과 SQL 질의 내의 토큰에서 전치사와 접속사를 스탑 워드(stop word)로 설정하여 문장에서 제거하고, 남은 토큰들에 대하여 문자열 비교를 통해 자연어 문장 내의 특정 토큰과 SQL 내의 특정 토큰이 정확히 일치할 경우 서로 대응시키고, 아직 대응되지 않은 토큰 중에 고유명사 사전 내에 존재하는 토큰이 있을 경우 고유명사 사전을 검색하여 동일 고유명사를 지칭하는 서로 다른 두 토큰이 있을 경우 대응시킬 수 있다.In an embodiment of the present invention, in step c-1), for natural language sentences and SQL queries of each mapping candidate, prepositions and conjunctions are first set as stop words in tokens in natural language sentences and tokens in SQL queries. If a specific token in a natural language sentence and a specific token in SQL match exactly through string comparison with respect to the remaining tokens, match each other, and if there is a token that exists in the proper noun dictionary among the tokens that have not yet been matched You can search the proper noun dictionary to match two different tokens that refer to the same proper noun.

본 발명의 실시예에서, 상기 c-2) 단계는, 각 매핑 후보의 자연어 문장과 SQL 질의에 대하여, 자연어 문장과 SQL 질의에서 종속 관계 기반의 자연어 문장 구문 분석기 (dependency-based syntactic parser)를 이용하여 두 개의 구문 분석 트리를 생성한 다음, 트리의 각 노드가 토큰에 대응되며 각 에지가 토큰 간의 종속 관계를 나타내는 두 개의 구문 분석 트리에서 서로 대응되는 두 노드에 동일 종류의 에지가 연결되어 있고, 해당 에지와 연결된 자식 노드가 대응되는 노드가 없으며 스탑 워드가 아닌 경우에, 두 자식 노드를 서로 대응시킬 수 있다.In an embodiment of the present invention, step c-2) uses a dependency-based syntactic parser in natural language sentences and SQL queries for natural language sentences and SQL queries of each mapping candidate. And two parsing trees are created, and then each node of the tree corresponds to a token, and edges of the same type are connected to two corresponding nodes in the two parsing trees, where each edge represents a dependency relationship between tokens, When a child node connected to the edge does not have a corresponding node and is not a stop word, two child nodes can be matched to each other.

본 발명의 실시예에서, 상기 e) 단계는, e-1) 상기 자연어 문장과 SQL 질의 간의 거리, 토큰 대응 점수, 의미 레벨 유사도를 특징점으로 하여, 매핑 후보가 실제 매핑일 확률을 반환하는 XGBoost 모델을 이용하여 각 매핑 후보의 매핑 점수를 계산하는 단계와, e-2) 각 매핑 후보의 매핑 점수에서, 각 SQL 질의에 대해서 매핑 점수가 가장 높은 자연어 문장을 해당 SQL 질의와 매핑하는 단계를 포함하는 텍스트 데이터에서 의미상 대응하는 자연어-SQL의 매핑 방법. In an embodiment of the present invention, step e) comprises: e-1) an XGBoost model that returns a probability that a mapping candidate is an actual mapping, using the distance between the natural language sentence and the SQL query, token correspondence score, and semantic level similarity as feature points. Computing a mapping score of each mapping candidate using e-2), and mapping a natural language sentence having the highest mapping score for each SQL query with a corresponding SQL query from the mapping scores of each mapping candidate. A natural language-SQL mapping method that semantically corresponds in text data.

본 발명은 텍스트 데이터에 존재하는 자연어 문장과 SQL질의의 유사도를 분석하여 자연어 문장-SQL 질의를 매핑함으로써, 서로 의미적으로 대응되는 자연어 문장과 SQL 질의 쌍을 검증하는 프로세스를 자동화하 수 있으며, 웹 데이터와 같은 텍트스 데이터 내에 섞여있는 여러 개의 자연어 문장과 SQL 질의에서 어떤 자연어 문장이 각 SQL 질의와 의미적으로 대응하는지 매핑할 수 있어 텍스트 데이터로부터 데이터셋을 수집하는데 활용할 수 있는 효과가 있다.The present invention can automate the process of verifying pairs of natural language sentences and SQL queries that semantically correspond to each other by analyzing the similarity between natural language sentences and SQL queries existing in text data and mapping natural language sentences to SQL queries. Since it is possible to map which natural language sentences semantically correspond to each SQL query in several natural language sentences and SQL queries mixed in text data such as data, there is an effect that can be used to collect data sets from text data.

또한, 본 발명은 SQL과 대응되는 자연어가 정해진 위치에 존재하지 않아도 되기 때문에 확장성을 향상시킬 수 있는 효과가 있다. 또한 추출한 데이터에서 유사도를 분석하여 매핑하기 때문에 신뢰성을 높일 수 있는 효과가 있다.In addition, the present invention has an effect of improving scalability because the natural language corresponding to the SQL does not need to exist in a predetermined location. In addition, since similarity is analyzed and mapped from the extracted data, there is an effect of improving reliability.

도 1은 본 발명의 바람직한 실시예에 따른 매핑 방법의 흐름도이다.1 is a flowchart of a mapping method according to a preferred embodiment of the present invention.

이하, 본 발명 텍스트 데이터에서 의미상 대응하는 자연어-SQL의 매핑 방법에 대하여 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, a method of mapping a semantically corresponding natural language-SQL in text data of the present invention will be described in detail with reference to the accompanying drawings.

본 발명의 실시 예들은 당해 기술 분야에서 통상의 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해 제공되는 것이며, 아래에 설명되는 실시 예들은 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 아래의 실시 예들로 한정되는 것은 아니다. 오히려, 이들 실시 예는 본 발명을 더욱 충실하고 완전하게 하며 당업자에게 본 발명의 사상을 완전하게 전달하기 위하여 제공되는 것이다.The embodiments of the present invention are provided to more completely describe the present invention to those of ordinary skill in the art, and the embodiments described below may be modified in various other forms, and The scope is not limited to the following embodiments. Rather, these embodiments are provided to make the present invention more faithful and complete, and to fully convey the spirit of the present invention to those skilled in the art.

본 명세서에서 사용된 용어는 특정 실시 예를 설명하기 위하여 사용되며, 본 발명을 제한하기 위한 것이 아니다. 본 명세서에서 사용된 바와 같이 단수 형태는 문맥상 다른 경우를 분명히 지적하는 것이 아니라면, 복수의 형태를 포함할 수 있다. 또한, 본 명세서에서 사용되는 경우 "포함한다(comprise)" 및/또는"포함하는(comprising)"은 언급한 형상들, 숫자, 단계, 동작, 부재, 요소 및/또는 이들 그룹의 존재를 특정하는 것이며, 하나 이상의 다른 형상, 숫자, 동작, 부재, 요소 및/또는 그룹들의 존재 또는 부가를 배제하는 것이 아니다. 본 명세서에서 사용된 바와 같이, 용어 "및/또는"은 해당 열거된 항목 중 어느 하나 및 하나 이상의 모든 조합을 포함한다.　The terms used herein are used to describe specific embodiments, and are not intended to limit the present invention. As used herein, the singular form may include a plural form unless the context clearly indicates another case. Also, as used herein, "comprise" and/or "comprising" specify the presence of the mentioned shapes, numbers, steps, actions, members, elements and/or groups thereof. And does not exclude the presence or addition of one or more other shapes, numbers, actions, members, elements, and/or groups. As used herein, the term "and/or" includes any and all combinations of one or more of the corresponding listed items.

본 명세서에서 제1, 제2 등의 용어가 다양한 부재, 영역 및/또는 부위들을 설명하기 위하여 사용되지만, 이들 부재, 부품, 영역, 층들 및/또는 부위들은 이들 용어에 의해 한정되지 않음은 자명하다. 이들 용어는 특정 순서나 상하, 또는 우열을 의미하지 않으며, 하나의 부재, 영역 또는 부위를 다른 부재, 영역 또는 부위와 구별하기 위하여만 사용된다. 따라서, 이하 상술할 제1 부재, 영역 또는 부위는 본 발명의 가르침으로부터 벗어나지 않고서도 제2 부재, 영역 또는 부위를 지칭할 수 있다.In the present specification, terms such as first and second are used to describe various members, regions and/or parts, but it is obvious that these members, parts, regions, layers and/or parts are not limited by these terms. . These terms do not imply any particular order, top or bottom, or superiority, and are only used to distinguish one member, region, or region from another member, region, or region. Accordingly, the first member, region, or region to be described below may refer to the second member, region, or region without departing from the teachings of the present invention.

이하, 본 발명의 실시 예들은 본 발명의 실시 예들을 개략적으로 도시하는 도면들을 참조하여 설명한다. 도면들에 있어서, 예를 들면, 제조 기술 및/또는 공차에 따라, 도시된 형상의 변형들이 예상될 수 있다. 따라서, 본 발명의 실시 예는 본 명세서에 도시된 영역의 특정 형상에 제한된 것으로 해석되어서는 아니 되며, 예를 들면 제조상 초래되는 형상의 변화를 포함하여야 한다.Hereinafter, embodiments of the present invention will be described with reference to the drawings schematically showing embodiments of the present invention. In the drawings, for example, depending on manufacturing techniques and/or tolerances, variations of the illustrated shape can be expected. Accordingly, the embodiments of the present invention should not be construed as being limited to the specific shape of the region shown in the present specification, but should include, for example, a change in shape caused by manufacturing.

도 1은 본 발명의 실시 예에 따른 매핑 방법의 흐름도이다.1 is a flowchart of a mapping method according to an embodiment of the present invention.

도 1을 참조하면, 본 발명 자연어-SQL 매핑 방법은 동일 문서 내의 자연어 문장과 SQL 질의의 매핑 후보를 추출하는 단계(S10)와, 각 매핑 후보의 자연어 문장과 SQL 질의에 대해서, 문서상의 위치를 기반으로 거리를 계산하는 단계(S20)와, 자연어 문장 내의 토큰과 SQL 질의 내의 토큰을 대응시켜 토큰 대응 점수를 계산하는 단계(S30)와, 자연어 문장 내의 토큰과 SQL 질의 토큰의 임베딩 벡터들의 요소별 합산 벡터를 비교하여 두 문장과 질의의 의미 레벨 유사도를 계산하는 단계(S40)와, 자연어 문장과 SQL 질의 간의 거리, 토큰 대응 점수, 의미 레벨 유사도를 특징점으로 하여 매핑 점수를 구하고 점수를 비교하여 자연어 문장과 SQL질의를 매핑하는 단계(S50)로 구성된다.Referring to FIG. 1, in the natural language-SQL mapping method of the present invention, the step of extracting a mapping candidate for a natural language sentence and an SQL query in the same document (S10), and a position on the document for the natural language sentence and the SQL query of each mapping candidate. Calculating the distance based on the distance (S20), calculating the token correspondence score by matching the token in the natural language sentence with the token in the SQL query (S30), and by elements of the embedding vectors of the token in the natural language sentence and the SQL query token Computing the semantic level similarity of the two sentences and the query by comparing the sum vector (S40), and calculating the mapping score using the distance between the natural language sentence and the SQL query, the token correspondence score, and the semantic level similarity as feature points, and comparing the scores. It consists of a step (S50) of mapping the statement and the SQL query.

이하, 상기와 같이 구성되는 본 발명의 구성과 작용에 대하여 보다 상세히 설명한다.Hereinafter, the configuration and operation of the present invention configured as described above will be described in more detail.

본 발명은 텍스트에서 자연어 문장을 추출하고, 추출된 자연어 문장과 의미상 대응하는 SQL 질의를 매핑하는 것으로, 본 발명은 컴퓨터 등 연산장치에서 수행되는 것이며, 따라서 각 단계의 실행 주체는 컴퓨터의 중앙처리장치 등의 제어기가 된다.The present invention extracts natural language sentences from text and maps a semantically corresponding SQL query with the extracted natural language sentences. The present invention is executed in a computing device such as a computer, and thus, the execution subject of each step is the central processing of the computer. It becomes a controller such as a device.

먼저, S10단계에서는 자연어 문장과 의미상에서 대응되는 SQL 질의를 매핑할 문서 집합(S11)이 입력으로 주어지면, 먼저 동일 문서 내의 자연어 문장과 SQL 질의의 매핑 후보를 추출한다. First, in step S10, when a document set (S11) to map a natural language sentence and an SQL query corresponding in meaning is given as an input, first, a mapping candidate of the natural language sentence and the SQL query in the same document is extracted.

구체적으로 동일 문서상에 존재하는 자연어 문장의 집합을 추출(S12)하고, SQL 질의의 집합을 추출(S13)한 후을 데카르트 곱을 통하여 매핑 후보의 집합을 구한다(S14). Specifically, a set of natural language sentences existing in the same document is extracted (S12), a set of SQL queries is extracted (S13), and then a set of mapping candidates is obtained through Cartesian multiplication (S14).

이처럼 추출된 매핑 후보의 집합 내의 각 자연어 문장 SQL 질의 쌍에 대해서 매핑 점수를 계산한다. A mapping score is calculated for each SQL query pair of natural language sentences in the extracted mapping candidate set.

매핑 점수 계산을 위하여 자연어 문장과 SQL 질의의 거리(S20), 토큰 대응 점수(S30), 의미 레벨 유사도를 계산(S40)한다. 이때 자연어 문장과 SQL 질의의 거리계산, 토큰 대응 점수 계산, 의미 레벨 유사도 계산은 병렬처리된다.To calculate the mapping score, the distance between the natural language sentence and the SQL query (S20), the token correspondence score (S30), and the semantic level similarity are calculated (S40). At this time, distance calculation between natural language sentences and SQL queries, token correspondence score calculation, and semantic level similarity calculation are processed in parallel.

S20단계에서는 자연어 문장과 SQL 질의 간의 문서상의 위치를 기반으로 거리를 계산한다.In step S20, the distance is calculated based on the position on the document between the natural language sentence and the SQL query.

구체적으로, 문서상에서 자연어 문장과 SQL 질의의 위치를 구한다(S21). 이때, 각 자연어 문장 혹은 SQL 질의의 위치는 문서상에서 자연어 문장 혹은 SQL 질의에 앞서 위치하는 자연어 문장과 SQL 질의의 개수로 정의한다. Specifically, the positions of natural language sentences and SQL queries on the document are obtained (S21). At this time, the location of each natural language statement or SQL query is defined as the number of natural language statements and SQL queries located before the natural language statement or SQL query in the document.

이후, 자연어 문장과 SQL 질의의 위치 간의 맨해튼 거리(Manhattan distance)를 계산(S22)한다.Thereafter, the Manhattan distance between the location of the natural language sentence and the SQL query is calculated (S22).

즉, 문서 내의 각 자연어 문장 혹은 SQL 질의의 위치를 문서상에서 자연어 문장 혹은 SQL 질의에 앞서 위치하는 자연어 문장과 SQL 질의의 개수로 정의하고, 자연어 문장의 위치가 N이고 SQL 질의의 위치가 M일 때, 자연어 문장과 SQL 질의의 거리를 |M-N|으로 계산할 수 있다.In other words, the location of each natural language statement or SQL query in the document is defined as the number of natural language statements and SQL queries placed before the natural language statement or SQL query in the document, and when the location of the natural language statement is N and the location of the SQL query is M. , The distance between a natural language sentence and an SQL query can be calculated as |MN|.

S30단계에서는, 자연어 문장 내의 토큰과 SQL 질의 내의 토큰을 대응시켜 토큰 대응 점수를 계산한다. In step S30, the token correspondence score is calculated by matching the token in the natural language sentence with the token in the SQL query.

구체적으로 S31단계와 같이 두 질의와 문장 내의 동일한 사전적 의미를 가지는 토큰들을 서로 대응시킨다. Specifically, as in step S31, two queries and tokens having the same dictionary meaning in the sentence are matched to each other.

우선 각 문장과 질의 내에서 전치사와 접속사를 제거하고 남은 토큰들에 대해서 문자열 비교를 통하여 정확히 동일한 토큰들을 서로 대응시킨다. First, prepositions and conjunctions are removed from each sentence and query, and the remaining tokens are matched with exactly the same tokens through string comparison.

이때, 제거된 전치사와 접속사를 스탑 워드(stop word)로 설정한다. 대응되지 않은 토큰들에 대해서는 동의어 사전을 이용하여 동일한 의미를 가지는 토큰들을 서로 대응시킨다. At this time, the removed prepositions and conjunctions are set as stop words. For tokens that are not matched, tokens having the same meaning are matched to each other using the synonym dictionary.

이후 대응되지 않은 토큰들에 대해서는 n-gram과 WordNet이라는 기술을 이용하여 토큰 간의 유사도 점수를 계산한 뒤, 유사도 점수가 기준값 이상일 경우에 대응시킨다. Thereafter, for the non-corresponding tokens, the similarity score between the tokens is calculated using a technique called n-gram and WordNet, and then, if the similarity score is greater than or equal to the reference value, it is mapped.

마지막으로 대응되지 않은 토큰 중에 대해서는 고유명사 사전을 검색하여 동일 고유명사를 가리키는 두 토큰은 서로 대응시킨다. Finally, among tokens that are not matched, the proper noun dictionary is searched, and two tokens pointing to the same proper noun are matched with each other.

이후, S32단계에서는 SQL 질의와 자연어 문장 내에서 구문론적으로 동일한 토큰들을 서로 대응시킨다. Thereafter, in step S32, tokens that are syntactically identical within the SQL query and natural language sentences are associated with each other.

우선, 두 문장과 질의에 대해서 종속 관계 기반 자연어 문장 구문 분석기를 이용하여, 두 개의 구문 분석 트리를 생성한다. First, for two sentences and a query, two parsing trees are created using a natural language sentence parser based on dependency relations.

이때, 문장 혹은 질의의 각 토큰은 트리의 노드에 대응되며, 토큰 간의 종속 관계는 트리의 에지에 대응된다. At this time, each token of the sentence or query corresponds to a node of the tree, and the dependency relationship between the tokens corresponds to the edge of the tree.

다음으로, 두 개의 구문 분석 트리에서 서로 대응되는 두 노드에 동일 종류의 에지가 자식 노드로 연결되어 있고, 해당 에지와 연결된 자식 노드가 대응되는 노드 쌍이 없으며, 스탑 워드가 아닌 경우에 두 노드가 가리키는 토큰을 서로 대응시킨다. Next, in two parsing trees, two nodes that correspond to each other have the same type of edge connected as a child node, and the child node connected to the corresponding edge does not have a corresponding node pair. Match tokens to each other.

이 방법을 통하여 자연어 문장과 SQL 질의 간의 대응되는 토큰을 모두 구한다. Through this method, all tokens that correspond between natural language sentences and SQL queries are obtained.

이후 S33단계와 같이 (자연어 문장과 SQL 질의 내의 서로 대응되는 토큰의 개수) / (자연어 문장과 SQL 질의 내의 전체 토큰의 개수)로 토큰 대응 점수를 계산한다.Then, as in step S33, the token correspondence score is calculated as (the number of tokens corresponding to each other in the natural language sentence and the SQL query) / (the total number of tokens in the natural language sentence and the SQL query).

S40단계에서는 자연어 문장 내의 토큰과 SQL 질의 토큰의 임베딩 벡터들의 요소별 합산 벡터를 비교하여 두 문장과 질의의 벨 유사도를 계산한다.In step S40, the bell similarity of the two sentences and the query is calculated by comparing the sum vector for each element of the token in the natural language sentence and the embedding vectors of the SQL query token.

먼저, S41단계의 의미적 유사도 산출단계는, word2vec 기술을 이용하여 두 문장과 질의 내의 토큰들을 임베딩 벡터의 형태로 변환한다.First, in the semantic similarity calculation step in step S41, the tokens in the two sentences and the query are converted into an embedding vector using word2vec technology.

이후 각 문장 혹은 질의를 구성하는 토큰의 임베딩 벡터들의 요소별 합산 벡터를 구하여 각 문장 혹은 질의를 하나의 벡터로 표현한다. After that, the sum vector for each element of the embedding vectors of tokens constituting each sentence or query is obtained, and each sentence or query is expressed as a single vector.

마지막으로 문장과 질의를 표현하는 벡터 간의 코사인 유사도를 구하는 것으로 두 문장과 질의 간의 의미 레벨 유사도를 계산한다.Finally, by calculating the cosine similarity between the sentence and the vector representing the query, the semantic level similarity between the two sentences and the query is calculated.

S50단계에서는, 문장과 질의 간의 거리, 토큰 대응 점수, 의미 레벨 유사도를 특징점으로 하여 매핑 점수를 구하고 점수를 비교하여 자연어 문장과 SQL 질의를 매핑한다.In step S50, a mapping score is calculated using the distance between the sentence and the query, the token correspondence score, and the semantic level similarity as feature points, and the scores are compared to map the natural language sentence and the SQL query.

구체적으로, S51단계와 같이 추출된 매핑 후보의 집합 내의 각 자연어 문장과 SQL 질의 간의 거리, 토큰 대응 점수, 의미 레벨 유사도를 특징점으로 하여, 매핑 후보가 실제 매핑일 확률을 나타내는 매핑 점수를 XGBoost 모델을 이용하여 계산한다. Specifically, using the distance between each natural language sentence in the set of mapping candidates extracted as in step S51 and the SQL query, token correspondence score, and semantic level similarity as feature points, the mapping score representing the probability that the mapping candidate is an actual mapping is used as an XGBoost model. Calculate using.

이때 XGBoost 모델은 크라우드 소싱을 이용하여 수집된 자연어 문장과 SQL 질의 간의 매핑 정답 정보와 오답 정보를 이용하여 학습한다. At this time, the XGBoost model is learned by using the correct answer information and incorrect answer information for mapping between natural language sentences and SQL queries collected using crowdsourcing.

이후 모델을 이용하여 계산된 각 매핑 후보의 매핑 점수에서, 각 SQL 질의에 대해서 매핑 점수가 가장 높은 자연어 문장을 해당 SQL 질의와 매핑한다.After that, from the mapping score of each mapping candidate calculated using the model, the natural language sentence with the highest mapping score for each SQL query is mapped with the SQL query.

이처럼 본 발명은 어휘 레벨(Lexical level), 구문(Syntactic level), 의미 레벨(Semantic level)의 세 가지 레벨의 유사도 분석 기술을 이용하여 자연어 문장과 SQL 질의의 유사도를 계산할 수 있다. As described above, the present invention can calculate the similarity between a natural language sentence and an SQL query using three levels of similarity analysis technology: lexical level, syntactic level, and semantic level.

계산한 유사도 점수와 자연어 문장과 SQL 질의의 매핑 유무 간의 관계에 대해 밝혀진 바가 없기 때문에 그대로 이용할 수는 없다. Since the relationship between the calculated similarity score and the mapping between natural language sentences and SQL queries has not been revealed, it cannot be used as it is.

이러한 문제를 해결하기 위해 본 발명에서는 유사도 점수와 매핑 유무 간의 관계식을 분류(Classification) 기계 학습 모델을 통해 자동으로 근사하는 방법을 제안한다. In order to solve this problem, the present invention proposes a method of automatically approximating the relational expression between the similarity score and the presence or absence of mapping through a classification machine learning model.

분류 기계 학습 모델을 이용하면 문제 상황에 따라 유동적으로 특징점(feature)을 추가하여 더 높은 정확도를 기대할 수 있다. If a classification machine learning model is used, higher accuracy can be expected by dynamically adding features according to the problem situation.

본 발명에서 기계 학습 모델을 학습시키기 위해 필요한 데이터는 크라우드 소싱을 통해 수집하였으나, 간단한 분류 기계 학습 모델을 사용하고 있기 때문에, 수백 개 정도의 학습 데이터로 학습이 가능하며, 이는 딥 러닝 기반의 자연어 질의를 SQL 질의로 번역하는 기술이 요구하는 것보다 훨씬 적은 데이터 크기이다. In the present invention, the data necessary to train a machine learning model was collected through crowdsourcing, but since a simple classification machine learning model is used, it is possible to learn from hundreds of training data, which is a deep learning-based natural language query. It is a much smaller data size than the technology for translating data into SQL queries requires.

또한, 한 번 학습을 완료하면 모델을 계속 이용할 수 있기 때문에, 본 기술을 이용하면 지속적으로 크기가 증가하는 웹 데이터로부터 학습에 이용한 데이터 보다 훨씬 더 많은 데이터를 수집할 수 있다.In addition, once training is completed, the model can be used continuously, so by using this technology, much more data can be collected from web data that continuously increases in size than the data used for training.

한편, 본 명세서와 도면에 개시된 본 발명의 실시예들은 본 발명의 기술 내용을 쉽게 설명하고 본 발명의 이해를 돕기 위해 특정 예를 제시한 것일 뿐이며, 본 발명의 범위를 한정하고자 하는 것은 아니다. 즉 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 자명한 것이다.On the other hand, the embodiments of the present invention disclosed in the specification and drawings are only provided specific examples to easily explain the technical content of the present invention and to aid understanding of the present invention, and are not intended to limit the scope of the present invention. That is, it is apparent to those of ordinary skill in the art that other modifications based on the technical idea of the present invention can be implemented.

Claims

As a method of mapping natural language sentences and SQL queries executed on a computer,
a) extracting mapping candidates for natural language sentences and SQL queries in the same document;
b) For natural language sentences and SQL queries of each mapping candidate, the distance is calculated based on the position on the document, and the location of each natural language sentence or SQL query in the document is determined with the natural language sentence or SQL query located in the document. Defining the number of SQL queries and calculating the distance between the natural language sentence and the SQL query as |MN| when the position of the natural language sentence is N and the position of the SQL query is M;
c) calculating a token correspondence score by matching the token in the natural language sentence with the token in the SQL query;
d) calculating a semantic level similarity between the two sentences and the query by comparing the sum vector for each element of the token in the natural language sentence and the embedding vectors of the SQL query token; And
e) A natural language corresponding semantically in text data including the step of mapping a natural language sentence and an SQL query by calculating the mapping score using the distance between the natural language sentence and the SQL query, the token correspondence score, and the semantic level similarity as feature points. How to map SQL.

The method of claim 1,
The step a),
Natural language corresponding semantically in text data, characterized by extracting a set of mapping candidates by Cartesian product of a set of natural language sentences existing in the same document and a set of SQL queries from a set of documents including an SQL query. How to map SQL.

delete

The method of claim 1,
The step c),
c-1) matching the natural language sentence and the SQL query of each mapping candidate with the same token among the tokens in the natural language sentence and the tokens in the SQL query, and matching the proper noun token;
c-2) For natural language sentences and SQL queries of each mapping candidate, natural language sentences and SQL queries are converted into parsing trees using natural language sentence parsers, and natural language sentence parsing trees and SQL query parsing trees, Additionally correlating the node of the natural language sentence parsing tree and the node of the SQL query parsing tree from the mapping rule on the parsing tree and the corresponding node information calculated in step c-1);
c-3) From the correspondence information between the tokens obtained in steps c-1) and c-2) above, the token correspondence score of the natural language sentence and the SQL query (the number of tokens corresponding to each other in the natural language sentence and the SQL query)/(natural language sentence And the number of total tokens in the SQL query) semantically corresponding natural language-SQL mapping method in text data.

The method of claim 4,
The step c-1),
For natural language sentences and SQL queries of each mapping candidate, first, prepositions and conjunctions from tokens in natural language sentences and tokens in SQL queries are set as stop words and removed from the sentence, and then string comparisons are performed for the remaining tokens. If a specific token in a natural language sentence and a specific token in SQL match exactly, they are matched, and if there is a token that exists in the proper noun dictionary among tokens that have not yet been matched, the proper noun dictionary is searched to refer to the same proper noun. A natural language-SQL mapping method that semantically corresponds to text data that is mapped if there is a token.

The method of claim 4,
The step c-2),
For natural language sentences and SQL queries of each mapping candidate, two parsing trees are created using dependency-based syntactic parser in natural language sentences and SQL queries, and then each node of the tree Is corresponding to a token, and each edge has an edge of the same type connected to two nodes corresponding to each other in the two parsing trees representing the dependency relationship between tokens, and the child node connected to the edge has no corresponding node, and the stop word is If not, a natural language-SQL mapping method that semantically corresponds in text data that associates two child nodes with each other.

The method of claim 1,
The step e),
e-1) calculating a mapping score of each mapping candidate using an XGBoost model that returns the probability that the mapping candidate is an actual mapping, using the distance between the natural language sentence and the SQL query, token correspondence score, and semantic level similarity as feature points. ; And
e-2) A natural language-SQL mapping method that semantically corresponds to text data, including mapping the natural language sentence with the highest mapping score for each SQL query in the mapping score of each mapping candidate with the corresponding SQL query.