KR102498403B1

KR102498403B1 - Apparatus and method for gathering training set for natural language to sql system

Info

Publication number: KR102498403B1
Application number: KR1020210013642A
Authority: KR
Inventors: 한욱신; 강혁규; 김현지; 나인혁
Original assignee: 포항공과대학교 산학협력단
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2023-02-09
Also published as: KR20220109978A

Abstract

본 발명은 자연어를 SQL로 변환하는 NL2SQL 시스템을 위한 훈련 세트 수집 및 분석에 관한 것으로 본 발명에 따르면 대상 데이터베이스로부터 샘플링을 통해 발췌 테이블을 생성하고, 생성한 발췌 테이블에 대한 사용자의 자연어 질문을 SQL 쿼리로 변환하여 수행하고, 수행한 결과를 시각화 함으로써 SQL 전문가가 아니더라도 고품질의 NL2SQL 시스템 훈련 세트를 수집할 수 있도록 함으로써 비용을 절약하면서도 개발자가 오류 원인을 쉽게 분석하도록 해주는 효과가 있다.The present invention relates to collection and analysis of a training set for an NL2SQL system that converts natural language into SQL. , and by visualizing the result, it is possible to collect a high-quality NL2SQL system training set even if you are not a SQL expert, which has the effect of enabling developers to easily analyze the cause of errors while saving costs.

Description

Training set collection device and method for a system that converts natural language to SQL

본 발명은 데이터베이스 관리 시스템에 관한 것으로, 특히 자연어를 SQL로 변환하기 위한 시스템에 관한다.The present invention relates to a database management system, and more particularly to a system for converting natural language to SQL.

관계형 데이터베이스와 자연어(Natural Language) 질문에 대해, 자연어를 SQL로 번역(NL2SQL)하는 것은 자연어 질문에 대응하는 SQL 서술(Statement)을 찾는 것이다. 최근에는 이러한 NL2SQL을 위해 딥러닝(Deep Learning) 기술을 이용하는 많은 방법들이 개발되어 왔다.For relational databases and natural language queries, natural language to SQL translation (NL2SQL) is finding the SQL statement that corresponds to the natural language query. Recently, many methods using deep learning technology have been developed for such NL2SQL.

NL2SQL 시스템을 개발하는데 있어 주요 작업은 좋은 인코더와 디코더를 설계하는 것이지만, NL2SQL 시스템의 품질을 향상하기 위해서는 크게 두 가지가 필요하다. 첫째로는 많은 수의 고품질, 라벨이 붙여진 훈련 데이터와, 둘째로는 잘못된 번역을 찾기 위한 툴이다. 시스템의 정확성은 훈련 세트에 크게 의존하기 때문이다.The main task in developing a NL2SQL system is to design a good encoder and decoder, but two things are needed to improve the quality of an NL2SQL system. First, a large number of high-quality, labeled training data, and second, a tool for finding false translations. This is because the accuracy of the system is highly dependent on the training set.

많은 훈련 데이터를 얻기 위해 크라우드 소싱(Crowdsourcing)이 사용되지만 다른 머신러닝 문제들에 비해 NL2SQL 시스템은 작업자들이 SQL 쿼리(Queries)와 주어진 데이터베이스 스키마(Schema)에 대해 잘 알아야 하기 때문에 비용이 많이 드는 문제가 있다.Crowdsourcing is used to obtain a lot of training data, but compared to other machine learning problems, the NL2SQL system is an expensive problem because it requires workers to have a good knowledge of SQL queries and a given database schema. there is.

또한 대표적인 NL2SQL 모델들이 인코더 디코더 모델을 사용하지만 각 입력 특징(Feature)들이 출력에 어떤 역할을 하는지 적절히 파악하기 힘든 문제도 있다.In addition, although representative NL2SQL models use an encoder decoder model, there is a problem in that it is difficult to properly understand what role each input feature plays in the output.

본 발명의 발명자들은 이러한 종래 기술의 NL2SQL 시스템의 자료수집 및 분석의 문제점들을 해결하기 위해 연구 노력해 왔다. 작업자들이 쉽게 데이터를 탐색하고 수집할 수 있도록 데이터들을 시각화하고 잘못 번역된 쿼리들을 분석할 수 있도록 함으로써 비용을 절약할 수 있는 데이터베이스 자연어 인터페이스 시스텀을 완성하기 위해 많은 노력 끝에 본 발명을 완성하기에 이르렀다.The inventors of the present invention have made research efforts to solve the problems of data collection and analysis of the prior art NL2SQL system. The present invention has been completed after much effort to complete a database natural language interface system that can save costs by enabling workers to easily search and collect data, visualize data, and analyze mistranslated queries.

본 발명의 목적은 저비용으로 고품질의 훈련 세트를 획득하고 잘못 번역된 쿼리들을 분석하기 위한 장치 및 그 방법을 제공하는 것이다.It is an object of the present invention to provide an apparatus and method for obtaining a high-quality training set at low cost and analyzing mistranslated queries.

한편, 본 발명의 명시되지 않은 또 다른 목적들은 하기의 상세한 설명 및 그 효과로부터 용이하게 추론 할 수 있는 범위 내에서 추가적으로 고려될 것이다.Meanwhile, other unspecified objects of the present invention will be additionally considered within the scope that can be easily inferred from the following detailed description and effects thereof.

본 발명에 따른 NL2SQL 시스템을 위한 훈련 세트 수집 장치는,Training set collection device for NL2SQL system according to the present invention,

주어진 데이터베이스에서 상기 데이터베이스의 일부로 구성되는 발췌 테이블(Table Excerpts)을 생성하는 테이블 생성부; 상기 발췌 테이블에 대해 생성된 사용자의 자연어 질문을 NL2SQL(Natural Language to SQL) 시스템을 이용하여 SQL 쿼리로 번역하고, 상기 SQL 쿼리를 수행한 결과를 표시하는 탐색부; 및 상기 SQL 쿼리를 수행한 결과 사용자에 의해 부정확한 것으로 표시된 훈련 쌍인 자연어 질문과 SQL 쿼리의 쌍을 분석하여 모델 해석(Model Interpretation), 오류 하이라이팅(Error Highlighting) 또는 시스템을 제안하는 분석부;를 포함한다.A table creation unit for generating an extract table (Table Excerpts) composed of a part of the database in a given database; a search unit that translates a user's natural language question generated for the extraction table into a SQL query using a Natural Language to SQL (NL2SQL) system and displays a result of performing the SQL query; and an analysis unit that analyzes a pair of a natural language question and a SQL query, which is a training pair marked as incorrect by a user as a result of performing the SQL query, and proposes a model interpretation, error highlighting, or system. do.

상기 테이블 생성부는 상기 발췌 테이블을 생성하기 위해 상기 주어진 데이터베이스에서 관계 샘플링, 속성 샘플링 및 튜플 샘플링을 수행하는 것을 특징으로 한다.The table generator may perform relation sampling, attribute sampling, and tuple sampling in the given database to generate the extract table.

상기 탐색부에서 상기 SQL 쿼리를 수행한 결과를 저장하기 위한 히스토리 데이터베이스; 및 상기 탐색부에서 상기 SQL 쿼리를 수행한 결과 중 정확한 것으로 표시된 자연어 질문과 SQL 쿼리의 쌍을 저장하는 훈련 쌍 데이터베이스;를 더 포함하는 것을 특징으로 한다.a history database for storing a result of performing the SQL query in the search unit; and a training pair database for storing pairs of natural language questions and SQL queries marked as correct among results of performing the SQL query by the search unit.

상기 테이블 생성부는 상기 발췌 테이블을 교란(Perturbing)하여 새로운 발췌 테이블을 생성하고, 상기 탐색부는 상기 새로운 발췌 테이블에 의해 상기 SQL 쿼리를 검증하는 것을 특징으로 한다.The table creation unit creates a new extraction table by perturbing the extraction table, and the search unit verifies the SQL query using the new extraction table.

상기 분석부는 통합 그라디언트 방법과 컨덕턴스 방법에 의해 상기 부정확한 훈련 쌍을 분석하여 상기 모델 해석을 수행하는 것을 특징으로 한다.The analyzer may perform the model analysis by analyzing the inaccurate training pair by an integrated gradient method and a conductance method.

상기 분석부는 상기 부정확한 훈련 쌍의 SQL 쿼리와 주어진 정확한 쿼리를 질의 변환(Query Rewriting)을 사용하여 정규화하고 정규화된 두 쿼리들의 문법 구조를 비교하여 상기 부정확한 훈련 쌍의 SQL 쿼리의 잘못된 부분을 시각화 함으로써 상기 오류 하이라이팅을 수행하는 것을 특징으로 한다.The analyzer normalizes the SQL query of the incorrect training pair and the given correct query using query rewriting, and compares the grammatical structures of the two normalized queries to visualize the incorrect part of the SQL query of the incorrect training pair. It is characterized in that the error highlighting is performed by doing.

상기 분석부는 상기 사용자의 자연어 질문을 상기 NL2SQL 시스템 외의 다른 복수의 NL2SQL 시스템을 이용하여 번역하고, 번역된 SQL 쿼리가 주어진 정확한 SQL 쿼리와 일치하는 NL2SQL 시스템을 추천하는 것을 특징으로 한다.The analyzer may translate the user's natural language question using a plurality of other NL2SQL systems other than the NL2SQL system, and recommend a NL2SQL system whose translated SQL query matches the given exact SQL query.

본 발명의 다른 실시예에 따른 훈련 세트 수집 방법은,A training set collection method according to another embodiment of the present invention,

(a) 주어진 데이터베이스에서 상기 데이터베이스의 일부로 구성되는 발췌 테이블(Table Excerpts)을 생성하는 단계; (b) 상기 발췌 테이블에 대해 생성된 사용자의 자연어 질문을 NL2SQL(Natural Language to SQL) 시스템을 이용하여 SQL 쿼리로 번역하고, 상기 SQL 쿼리를 수행한 결과를 표시하는 단계; 및 (c) 상기 SQL 쿼리를 수행한 결과 사용자에 의해 부정확한 것으로 표시된 훈련 쌍인 자연어 질문과 SQL 쿼리의 쌍을 분석하여 모델 해석(Model Interpretation), 오류 하이라이팅(Error Highlighting) 또는 시스템을 제안하는 단계;를 포함한다.(a) in a given database, creating Table Excerpts which are part of said database; (b) translating the user's natural language question generated with respect to the extraction table into a SQL query using a Natural Language to SQL (NL2SQL) system, and displaying a result of performing the SQL query; and (c) analyzing a pair of a natural language question and a SQL query, which is a training pair marked as incorrect by a user as a result of performing the SQL query, and suggesting model interpretation, error highlighting, or a system; includes

상기 (a) 단계는 상기 발췌 테이블을 생성하기 위해 상기 주어진 데이터베이스에서 관계 샘플링, 속성 샘플링 및 튜플 샘플링을 수행하는 것을 특징으로 한다.The step (a) is characterized in that relation sampling, attribute sampling and tuple sampling are performed in the given database to generate the extract table.

상기 (b)단계 이후에, 상기 SQL 쿼리를 수행한 결과를 히스토리 데이터베이스에 저장하는 단계; 및 상기 SQL 쿼리를 수행한 결과 중 정확한 것으로 표시된 자연어 질문과 SQL 쿼리의 쌍을 훈련 쌍 데이터베이스에 저장하는 단계;를 더 포함하는 것을 특징으로 한다.After the step (b), storing the result of performing the SQL query in a history database; and storing a pair of a natural language question and an SQL query marked as correct among results of performing the SQL query in a training pair database.

상기 (a)단계는 상기 발췌 테이블을 교란(Perturbing)하여 새로운 발췌 테이블을 생성하고, 상기 (b)단계는 상기 새로운 발췌 테이블에 의해 상기 SQL 쿼리를 검증하는 것을 특징으로 한다.In the step (a), a new extract table is created by perturbing the extract table, and in the step (b), the SQL query is verified by the new extract table.

상기 (c)단계는 통합 그라디언트 방법과 컨덕턴스 방법에 의해 상기 부정확한 훈련 쌍을 분석하여 상기 모델 해석을 수행하는 것을 특징으로 한다.The step (c) is characterized in that the model analysis is performed by analyzing the inaccurate training pair by an integrated gradient method and a conductance method.

상기 (c)단계는 상기 부정확한 훈련 쌍의 SQL 쿼리와 주어진 정확한 쿼리를 질의 변환(Query Rewriting)을 사용하여 정규화하고 정규화된 두 쿼리들의 문법 구조를 비교하여 상기 부정확한 훈련 쌍의 SQL 쿼리의 잘못된 부분을 시각화 함으로써 상기 오류 하이라이팅을 수행하는 것을 특징으로 한다.The step (c) normalizes the SQL query of the incorrect training pair and the given correct query using query rewriting, and compares the grammar structures of the two normalized queries to determine whether the SQL query of the incorrect training pair is correct. It is characterized in that the error highlighting is performed by visualizing the part.

상기 (c)단계는 상기 사용자의 자연어 질문을 상기 NL2SQL 시스템 외의 다른 복수의 NL2SQL 시스템을 이용하여 번역하고, 번역된 SQL 쿼리가 주어진 정확한 SQL 쿼리와 일치하는 NL2SQL 시스템을 추천하는 것을 특징으로 한다.The step (c) is characterized by translating the user's natural language question using a plurality of other NL2SQL systems other than the NL2SQL system, and recommending an NL2SQL system whose translated SQL query matches the given exact SQL query.

본 발명에 따르면 자연어를 SQL로 변환하는 시스템을 훈련하기 위한 고품질의 훈련 세트를 수집할 수 있는 효과가 있다.According to the present invention, it is possible to collect a high-quality training set for training a system that converts natural language into SQL.

또한 훈련을 위한 데이터들을 분석하기 위한 시각화 자료들을 제공함으로써 작업자들이 보다 쉽게 훈련 데이터를 수집하고 잘못된 데이터들을 수정할 수 있는 장점도 있다.In addition, by providing visualization data for analyzing data for training, workers can more easily collect training data and correct incorrect data.

한편, 여기에서 명시적으로 언급되지 않은 효과라 하더라도, 본 발명의 기술적 특징에 의해 기대되는 이하의 명세서에서 기재된 효과 및 그 잠정적인 효과는 본 발명의 명세서에 기재된 것과 같이 취급됨을 첨언한다.On the other hand, even if the effects are not explicitly mentioned here, it is added that the effects described in the following specification expected by the technical features of the present invention and their provisional effects are treated as described in the specification of the present invention.

도 1은 본 발명의 바람직한 어느 실시예에 따른 훈련 세트 수집 장치의 개략적인 구조도이다.
도 2 및 도 3은 본 발명의 바람직한 어느 실시예에 따른 훈련 세트 수집 장치의 실행 예를 나타낸다.
도 4는 본 발명의 바람직한 다른 실시예에 따른 훈련 세트 수집 방법의 개략적인 흐름도이다.
※ 첨부된 도면은 본 발명의 기술사상에 대한 이해를 위하여 참조로서 예시된 것임을 밝히며, 그것에 의해 본 발명의 권리범위가 제한되지는 아니한다1 is a schematic structural diagram of a training set collection device according to a preferred embodiment of the present invention.
2 and 3 show an implementation example of a training set collection device according to a preferred embodiment of the present invention.
4 is a schematic flowchart of a training set collection method according to another preferred embodiment of the present invention.
※ It is revealed that the accompanying drawings are exemplified as references for understanding the technical idea of the present invention, and thereby the scope of the present invention is not limited.

이하, 도면을 참조하여 본 발명의 다양한 실시예가 안내하는 본 발명의 구성과 그 구성으로부터 비롯되는 효과에 대해 살펴본다. 본 발명을 설명함에 있어서 관련된 공지기능에 대하여 이 분야의 기술자에게 자명한 사항으로서 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다. Hereinafter, with reference to the drawings, look at the configuration of the present invention guided by various embodiments of the present invention and the effects resulting from the configuration. In the description of the present invention, if it is determined that a related known function may unnecessarily obscure the subject matter of the present invention as an obvious matter to those skilled in the art, the detailed description thereof will be omitted.

'제1', '제2' 등의 용어는 다양한 구성요소를 설명하는데 사용될 수 있지만, 상기 구성요소는 위 용어에 의해 한정되어서는 안 된다. 위 용어는 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용될 수 있다. 예를 들어, 본 발명의 권리범위를 벗어나지 않으면서 '제1구성요소'는 '제2구성요소'로 명명될 수 있고, 유사하게 '제2구성요소'도 '제1구성요소'로 명명될 수 있다. 또한, 단수의 표현은 문맥상 명백하게 다르게 표현하지 않는 한, 복수의 표현을 포함한다. 본 발명의 실시예에서 사용되는 용어는 다르게 정의되지 않는 한, 해당 기술분야에서 통상의 지식을 가진 자에게 통상적으로 알려진 의미로 해석될 수 있다.Terms such as 'first' and 'second' may be used to describe various elements, but the elements should not be limited by the above terms. The above terms may only be used for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, a 'first element' may be named a 'second element', and similarly, a 'second element' may also be named a 'first element'. can Also, singular expressions include plural expressions unless the context clearly indicates otherwise. Terms used in the embodiments of the present invention may be interpreted as meanings commonly known to those skilled in the art unless otherwise defined.

이하, 도면을 참조하여 본 발명의 다양한 실시예가 안내하는 본 발명의 구성과 그 구성으로부터 비롯되는 효과에 대해 살펴본다.Hereinafter, with reference to the drawings, look at the configuration of the present invention guided by various embodiments of the present invention and the effects resulting from the configuration.

도 1은 본 발명의 바람직한 어느 실시예에 따른 훈련 세트 수집 장치의 개략적인 구조도이다.1 is a schematic structural diagram of a training set collection device according to a preferred embodiment of the present invention.

본 발명에 따른 NL2SQL 시스템을 위한 훈련 세트 수집 장치(1)는 테이블 생성부(10), 탐색부(20), 수집부(30) 및 분석부(40)를 포함한다. 또한 훈련쌍 데이터베이스(50) 및 히스토리(History) 데이터베이스(60)를 더 포함할 수 있다.A training set collection device 1 for an NL2SQL system according to the present invention includes a table creation unit 10, a search unit 20, a collection unit 30, and an analysis unit 40. In addition, a training pair database 50 and a history database 60 may be further included.

테이블 생성부(10)는 주어진 데이터베이스 D에서 샘플링을 통해 발췌 테이블(Table Excerpts)을 생성한다.The table creation unit 10 creates an extract table (Table Excerpts) through sampling in a given database D.

작고 직관적인 발췌 테이블 생성을 위해서는 데이터베이스 D에서 1) 관계(Relation) 샘플링, 2) 속성(Attribute) 샘플링, 그리고 3) 튜플(Tuple) 샘플링을 수행한다.To create a small and intuitive extract table, perform 1) relation sampling, 2) attribute sampling, and 3) tuple sampling in database D.

관계/속성 샘플링을 위해서는 누적된 SQL 쿼리를 이용하는데 서로 다른 관계와 속성의 관련성(Relevance)을 계산하기 위함이다. 두 스키마 엔트리(즉, 관계 혹은 속성)가 같이 나타나는 빈도가 높을수록 더 연관되어 있다고 판단할 수 있다.For relationship/property sampling, accumulated SQL queries are used to calculate the relevance of different relationships and properties. The higher the frequency in which two schema entries (ie, relationships or attributes) appear together, the more related they can be determined.

이를 위해 각 노드(Node)가 관계 또는 속성이고 각 에지(Edge)는 두 관계들 사이의 결합가능한 관계 또는 속성의 관계에 대한 연계(affiliation)를 암시하는 스키마 그래프로부터 연결된 서브그래프를 추출한다. 특히, SQL 쿼리 로그에 나타나는 빈도에 비례하는 확률로 관계를 무작위로 선택한다. 그 다음 이전에 선택된 관계와 결합 가능한 새 관계를 무작위로 반복적으로 선택한다. 여기서 선택 확률은 선택된 관계들의 공동 출현(co-occurrence) 빈도와 비례하게 된다.To this end, a connected subgraph is extracted from a schema graph in which each node is a relationship or attribute and each edge implies an affiliation of a combinable relationship between two relationships or a relationship of attributes. In particular, it randomly selects relationships with probabilities proportional to the frequency of their appearance in the SQL query log. It then randomly and repeatedly selects new relationships that can be combined with previously selected relationships. Here, the selection probability is proportional to the co-occurrence frequency of the selected relationships.

관계들을 선택한 후에 이와 비슷하게 선택된 관계들의 공동 출현 빈도를 고려하여 속성들을 선택한다.After selecting the relationships, similarly, the properties are selected by considering the co-occurrence frequency of the selected relationships.

샘플링 단계가 끝나면 이를 이용하여 발췌 테이블을 생성하게 된다. 이는 1) 모든 선택된 관계들 사이의 모든 바깥(outer) 결합 수행, 2) 테이블 t_D를 얻기 위해 모든 선택된 속성들에 대한 프로젝션(Projection), 3) t_D로부터 일부 튜플들을 샘플링함으로써 생성된다. After the sampling phase is over, an extract table is created using this. It is created by 1) performing all outer joins between all selected relations, 2) projection on all selected attributes to obtain table t _D , and 3) sampling some tuples from t _D .

다음 표 1은 IMDB(International Movie DataBase)로부터 생성된 발췌 테이블의 예를 나타낸다.Table 1 below shows an example of an extract table generated from IMDB (International Movie Database).

[표 1][Table 1]

우선 관계 샘플링에 의해 세 속성(영화(Movie), 캐스팅(Cast) 및 배우(Actor))을 선택하는데, 캐스팅은 영화와 배우 사이의 다수-대-다수(many-to-many) 관계를 나타낸다. 다음으로 속성 샘플링에 의해 여섯 개의 속성들을 선택한다. 그 다음 세 관계들의 모든 외부 결합과 여섯 속성들에 대한 프로젝션을 수행함으로써 테이블 t_D를 얻을 수 있다. 마지막으로 t_D로부터 다섯 개의 튜플들을 샘플링하면 발췌 테이블인 표 1을 얻게 된다.First, three attributes (Movie, Cast, and Actor) are selected by relationship sampling, where casting represents a many-to-many relationship between a movie and an actor. Next, six attributes are selected by attribute sampling. Table _tD can then be obtained by performing projections on all six attributes and all outer joins of the three relations. Finally, sampling five tuples from t _D gives Table 1, an excerpt table.

보다 직관적인 발췌 테이블을 얻기 위해서는 발췌 테이블은 동일한 부분 키 값을 가지는 튜플들을 포함해야 한다. 표 1에서 키 {Movie.mid, Actor.aid}를 고려해보면, 첫 두 튜플은 부분 키 값이 2627로 동일하므로 동일한 영화를 나타낸다. 이는 작업자들이 발췌 테이블에 대해 의미있는 그룹과 쿼리 집합을 제시할 수 있게 해준다.To obtain a more intuitive extract table, the extract table should contain tuples with the same partial key value. Considering the key {Movie.mid, Actor.aid} in Table 1, the first two tuples have the same partial key value of 2627, so they represent the same movie. This allows workers to come up with meaningful groups and query sets for the excerpt table.

탐색부(20)는 사용자의 질문을 SQL 쿼리로 번역하여 데이터베이스를 탐색한다. 이를 위해 사용자는 NL2SQL 시스템과 데이터베이스를 선택할 수 있다.The search unit 20 searches the database by translating the user's question into an SQL query. For this, users can choose the NL2SQL system and database.

도 2는 탐색부(20)에서 사용자에게 표시하는 화면의 한 예이다.2 is an example of a screen displayed to the user by the search unit 20.

탐색부(20)는 우선 선택된 데이터베이스에서 무작위로 발췌 테이블(210)을 선택하여 보여준다.The search unit 20 first randomly selects and displays an extract table 210 from the selected database.

탐색부(20)가 발췌 테이블(210)에 대해 자연어 질문 q_nl(220) 수신하면 이를 선택된 NL2SQL 시스템을 이용하여 SQL 쿼리 q_sql로 번역하고, 실행 결과(230)를 시각화한다.When the search unit 20 receives a natural language question q _nl 220 for the extract table 210, it is translated into an SQL query q _sql using the selected NL2SQL system, and an execution result 230 is visualized.

예를 들어 사용자가 “브래드 피트가 출연한 영화의 제목을 찾아라(Find the titles of the movies which “Brad Pitt”starred in)”라는 자연어 질문(220)을 입력한다면 이는 “SELECT T1.title FROM movie AS T1 JOIN cast AS T3 ON T1.mid = T3.msid JOIN actor AS T2 ON T3.aid = T2.aid WHERE T2.name = “Brad Pitt””라는 SQL 쿼리로 변환되고 탐색부(20)는 발췌 테이블(220)에서 해당 영화들의 제목을 찾아 출력(230)한다. 결과를 보면 사용자는 정확한지 여부를 체크(240)하게 된다.For example, if the user enters a natural language question (220) “Find the titles of the movies which “Brad Pitt” starsred in”, it is “SELECT T1.title FROM movie AS”. T1 JOIN cast AS T3 ON T1.mid = T3.msid JOIN actor AS T2 ON T3.aid = T2.aid WHERE T2.name = “Brad Pitt” is converted into an SQL query, and the search unit 20 extracts the table ( In step 220, the titles of corresponding movies are searched and output (230). Looking at the result, the user checks (240) whether it is correct.

이 과정에서 발췌 테이블에 따라 잘못된 쿼리가 정확한 쿼리와 동일한 응답을 생성할 수도 있다. 예를 들면 표 1의 발췌 테이블 t에서 사용자가 “연기자가 몇 명인가?(How many performers are there?)”라는 질문을 했을 때 이를 잘못 번역하여 “SELECT count(*) FROM movie AS T1”과 같은 SQL 쿼리를 생성할 수 있다. 이 SQL 쿼리에 의해 영화의 수를 응답으로 생성하더라도 사용자는 이를 정확한 응답으로 판단하여 번역이 정확하다고 마크할 것이다. 표 1에 나타난 배우의 수와 영화의 수는 모두 3이기 때문이다.During this process, depending on the excerpt table, an incorrect query may produce the same response as a correct query. For example, in the excerpt table t of Table 1, when the user asks the question “How many performers are there?” You can create SQL queries. Even if the number of movies is generated as a response by this SQL query, the user will judge it as an accurate response and mark the translation as correct. This is because both the number of actors and the number of movies shown in Table 1 are 3.

이러한 오류를 줄이기 위해 테이블 생성부(10)는 다른 발췌 테이블 t'를 생성하여 SQL 쿼리를 검증한다. 발췌 테이블 t'는 발췌 테이블 t를 교란하여 생성하고 이를 다른 사용자에 의해 검증하도록 할 수 있다.To reduce this error, the table generator 10 verifies the SQL query by creating another extract table t'. The culling table t' can be created by perturbing culling table t and having it verified by other users.

이렇게 해서 얻어지는 SQL 쿼리들은 결국 발췌 테이블에 관한 쿼리가 아니라 원래의 데이테베이스에 대한 쿼리가 된다. 하나의 발췌 테이블에 대한 간단한 쿼리뿐 아니라 결합을 포함한 복잡한 쿼리들을 수집할 수 있기 때문이다. The resulting SQL queries end up being queries against the original database, not against the extract tables. This is because you can collect complex queries including joins as well as simple queries against a single excerpt table.

탐색부(20)에서는 1) 선택된 테이블들의 결합, 2) 선택된 열(columns)에 대한 프로젝션, 3) 샘플링을 포함하는 SQL 쿼리들을 수행하여 발췌 테이블을 생성하므로, 발췌 테이블에 대한 SQL 쿼리 q1을 원본 데이터베이스에 대한 SQL 쿼리 q2로 변환할 수 있다.Since the search unit 20 generates an extract table by performing SQL queries including 1) combination of selected tables, 2) projection of selected columns, and 3) sampling, the SQL query q1 for the extract table is converted into the original SQL query q1. This can be converted to SQL query q2 against the database.

변환은 1) 테이블이 아닌 q1의 FROM 절에 결합 술어를 대입하고, 2) q1의 열을 원본 데이터베이스의 대응하는 열로 대체함으로써 이루어진다.The conversion is done by 1) substituting the join predicate in the FROM clause of q1, not the table, and 2) replacing the columns in q1 with the corresponding columns in the source database.

예를 들면, 표 1에서 “영화 120에서 누가 연기했는가?”라는 질문은 NL2SQL 시스템에 의해 “SELECT actor_name FROM table_excerpt WHERE movie_title='120'”이라는 SQL 쿼리로 변환되고, 이는 “SELECT actor.name FROM actor JOIN cast ON actor.aid = cast.aid JOIN movie ON cast.mid = movie.mid WHERE movie.title = '120'.”으로 다시 변환될 수 있다.For example, in Table 1, the question “Who acted in movie 120?” is converted by the NL2SQL system into the SQL query “SELECT actor_name FROM table_excerpt WHERE movie_title='120'”, which is “SELECT actor.name FROM actor JOIN cast ON actor.aid = cast.aid JOIN movie ON cast.mid = movie.mid WHERE movie.title = '120'.”

즉 앞의 쿼리가 사용자에게는 보여지지만 뒤의 변환된 쿼리가 탐색부(20)에 의해 수집될 것이다. 물론 발췌 테이블에 대한 변환 결과와 원본 데이터베이스에 대한 변환 결과 모두 분석될 수 있다.That is, the previous query is displayed to the user, but the later converted query is collected by the search unit 20 . Of course, both the conversion result for the extract table and the conversion result for the original database can be analyzed.

수집부(30)는 훈련 쌍(Training Pairs)을 수집한다. 훈련 쌍 수집은 사용자가 다른 작업을 하는 동안 백그라운드 작업으로 수행되므로 사용자에게는 보이지 않는다.The collection unit 30 collects training pairs. The collection of training pairs is invisible to the user as it is done as a background task while the user is doing other things.

탐색부(20)에서 사용자가 정확하다고 표시한 훈련 쌍은 훈련 쌍 데이터베이스(50)에 저장된다. 만일 사용자가 부정확하다고 표시한 훈련 쌍은 히스토리 데이터베이스(60)에 저장되고 분석부(40)로도 전달된다.Training pairs marked as correct by the user in the search unit 20 are stored in the training pair database 50 . If the user marks the training pair as incorrect, it is stored in the history database 60 and transmitted to the analyzer 40 as well.

수집부(30)에서 자연어 질문, SQL 쿼리, 피드백으로 이루어진 묶음(q_nl, q_sql, feedback)은 모두 히스토리 데이터베이스(60)에 저장한다.In the collection unit 30, all bundles (q _nl , q _sql , feedback) consisting of natural language questions, SQL queries, and feedback are stored in the history database 60.

분석부(40)는 사용자가 분석하고자 하는 질문을 선택받아 선택된 질문에 대한 정확한 SQL 쿼리를 수신한다. 그리고 모델 해석(Model Interpretation), 오류 하이라이팅(Error Highlighting), 시스템 제안을 수행한다.The analysis unit 40 receives a question to be analyzed by the user and receives an accurate SQL query for the selected question. It also performs model interpretation, error highlighting, and system suggestions.

도 3은 분석부(40)에서 사용자에게 표시하는 화면의 한 예이다.3 is an example of a screen displayed to a user by the analyzer 40 .

분석부(40)는 모델 해석을 위해 통합 그라디언트(Integrated Gradients) 방법을 사용한다. 통합 그라디언트 방법은 자연어 또는 스키마의 각 단어(w)들이 모델 출력에 기여하는 정도를 평가한다. The analysis unit 40 uses an integrated gradients method for model analysis. The integrated gradient method evaluates the contribution of each word (w) in a natural language or schema to the model output.

분석부(40)는 단어 w와 디코더에서 생성된 출력 시퀀스의 각 요소들과의 기여도를 계산하는데, w의 기여도는 w에 의한 모든 기여도를 더하여 구해진다. 각 단어는 그 기여도에 따라 색을 달리한 글자로 시각화된다. 예를 들면 녹색은 긍정적인 기여를, 적색은 부정적인 기여를 의미할 수 있다.The analysis unit 40 calculates the contribution between the word w and each element of the output sequence generated by the decoder, and the contribution of w is obtained by adding all the contributions by w. Each word is visualized as a letter in a different color according to its contribution. For example, green can mean positive contribution and red can mean negative contribution.

분석부(40)는 또한 모델의 내부 히든 유닛들(Internal Hidden Units)의 중요성을 시각화 하기 위해 컨덕턴스 방법(Conductance Method) 방법을 사용할 수 있다. 분석부(40)는 각 히든 상태(State)의 컨덕턴스를 평가한다. 히든 상태들은 시퀀스(Sequences)와 레이어(Layer)들의 요소(Element)들에 교차로 존재하기 때문에(예를 들어, 시퀀스 길이가 5이고 레이어 수가 3이면 5x3=15 개의 히든 상태가 존재한다), 컨덕턴스는 x축이 시퀀스의 요소들이고 y축은 레이어인 히트 맵(Heat Map)으로 시각화된다.The analyzer 40 may also use a conductance method to visualize the importance of internal hidden units of the model. The analyzer 40 evaluates the conductance of each hidden state. Since hidden states exist at the intersection of elements of sequences and layers (for example, if the sequence length is 5 and the number of layers is 3, 5x3 = 15 hidden states exist), the conductance is It is visualized as a heat map, where the x-axis is the elements of the sequence and the y-axis is the layer.

이러한 방법들에 의해 모델의 행동들이 이해될 수 있으므로 잘못된 번역들을 분석함으로써 모델을 향상시키고 문제 해결을 보다 쉽게 할 수 있다.By these methods, the model's behavior can be understood, so that erroneous translations can be analyzed to improve the model and make troubleshooting easier.

도 3에서 사용자는 스크롤바 상에 표시된 붉은 선(330)을 통해 오류 발생 위치를 쉽게 찾을 수 있다. 모든 오류들은 히스토리 데이터베이스(60)에 저장되어 있으므로 이를 로드하여 오류 발생 위치를 찾을 수 있다.In FIG. 3 , the user can easily find the location of the error through the red line 330 displayed on the scroll bar. Since all errors are stored in the history database 60, they can be loaded to find the error location.

오류 검증을 위해 사용자는 정답 쿼리(310)를 입력한다. “연기자가 몇 명인가?(How many performers are there?)”에 대한 정답 쿼리(310)는 “SELECT count(*) FROM actor AS T1”이 될 것이다.For error verification, the user inputs a correct query 310 . The answer query 310 to “How many performers are there?” will be “SELECT count(*) FROM actor AS T1”.

사용자가 “모든 연기자의 수”를 질문에 대해 자연어 질문에 포함된 “연기자”는 “배우”와 관계를 가진다. 만일 분석부(40)에 의해 분석한 “연기자”라는 단어의 기여도(320)가 너무 낮다면 모델이 그 단어를 충분히 이해하지 못하고 있음을 알 수 있다. 도 3의 예에서 단어의 기여도(320)는 색의 농도로 표현된다. 따라서 모델의 인코더 구조를 미리 훈련된 언어 모델로 바꿈으로써 모델을 수정할 수 있을 것이다.When the user asks the question “the number of all actors”, “actors” included in the natural language question have a relationship with “actors”. If the contribution 320 of the word "actor" analyzed by the analyzer 40 is too low, it can be seen that the model does not fully understand the word. In the example of FIG. 3 , the contribution 320 of a word is expressed as a color density. Therefore, we will be able to modify the model by replacing the model's encoder structure with a pre-trained language model.

분석부(40)는 다음으로 오류 하이라이팅을 수행한다.The analysis unit 40 then performs error highlighting.

분석부(40)는 선택된 SQL 쿼리를 주어진 정확한 쿼리와 비교하여 선택된 SQL 쿼리의 어느 부분이 잘못되었는지 시각화한다. 분석부(40)는 두 쿼리들을 질의 변환(Query Rewriting)을 사용하여 정규화하고 정규화된 두 쿼리들의 문법 구조를 비교하게 된다.The analysis unit 40 compares the selected SQL query to a given correct query and visualizes which part of the selected SQL query is incorrect. The analyzer 40 normalizes the two queries using query rewriting and compares the grammatical structures of the two normalized queries.

마지막으로 분석부(40)는 자연어 질문을 사용자가 선택하지 않은 다른 NL2SQL 시스템을 사용하여 번역한 다음 정확한 번역을 생성한 NL2SQL 시스템을 추천한다. 만일 정확한 번역을 제공하는 시스템이 없다면 분석부(40)는 각 시스템의 번역된 SQL 쿼리와 정답(Ground Truth) 사이의 유사도 점수를 계산하여 가장 정확한 NL2SQL 시스템을 추천하게 된다. 두 SQL 쿼리 사이의 유사도 점수를 계산하기 위해 분석부(40)는 두 SQL 쿼리를 정규화한다.Finally, the analysis unit 40 translates the natural language question using another NL2SQL system not selected by the user, and then recommends the NL2SQL system that generated an accurate translation. If there is no system that provides accurate translation, the analysis unit 40 recommends the most accurate NL2SQL system by calculating a similarity score between the translated SQL query of each system and the ground truth. In order to calculate a similarity score between the two SQL queries, the analysis unit 40 normalizes the two SQL queries.

번역의 정확성을 결정하기 위해 분석부(40)는 멀티-스테이지(multi-staged) 확인(Validation) 툴을 사용하는데, 이는 번역된 쿼리와 주어진 정확한 SQL 쿼리 사이의 의미 동등성을 결정한다. To determine correctness of a translation, analysis unit 40 uses a multi-staged validation tool, which determines semantic equivalence between a translated query and a given correct SQL query.

분석부(40)는 우선 주어진 데이터베이스 인스턴스와 데이터베이스 테스트 기법으로부터 얻은 데이터베이스 인스턴스들에서 두 쿼리를 수행하고, 수행된 결과를 비교한다. 만일 수행 결과가 다르다면 두 쿼리들은 동등하지 않은 것으로 결정된다. 만일 수행 결과가 같다면 두 SQL 쿼리들의 의미 동등성을 위해 증명 툴(prover)에 의해 둘을 비교한다. 만일 증명 툴이 동등성을 판단하는데 실패한다면 분석부(40)는 두 쿼리들을 질의 변환에 의해 변환하여 그들의 문법 구조를 비교할 수 있다. 질의 변환에는 RDBMS(Relational Database Management System) 등이 사용될 수 있다.The analysis unit 40 first performs two queries on a given database instance and database instances obtained from the database test technique, and compares the results. If the execution results are different, the two queries are determined not to be equivalent. If the execution results are the same, the two are compared by a prover for semantic equality of the two SQL queries. If the proof tool fails to determine equivalence, the analyzer 40 may convert the two queries by query transformation and compare their grammatical structures. A relational database management system (RDBMS) or the like may be used for query conversion.

도 4는 본 발명의 바람직한 다른 실시예에 따른 훈련 세트 수집 방법의 개략적인 흐름도이다.4 is a schematic flowchart of a training set collection method according to another preferred embodiment of the present invention.

본 발명의 훈련 세트 수집 방법은 하나 이상의 프로세서 및 제어부를 포함하는 훈련 세트 수집 장치에 의해 수행될 수 있다.The training set collection method of the present invention may be performed by a training set collection device including one or more processors and a control unit.

훈련 세트 수집을 위해 우선 주어진 데이터베이스에서 샘플링을 통해 발췌 테이블을 생성한다(S10).To collect a training set, an extract table is first created through sampling in a given database (S10).

발췌 테이블 생성을 위해서는 데이터베이스에서 1) 관계(Relation) 샘플링, 2) 속성(Attribute) 샘플링, 그리고 3) 튜플(Tuple) 샘플링을 수행할 수 있다. 테이블 생성을 위한 자세한 방법은 앞서 설명한 바와 같다.To create an extract table, 1) Relation sampling, 2) Attribute sampling, and 3) Tuple sampling can be performed in the database. The detailed method for creating a table is as described above.

발췌 테이블이 생성되면 사용자는 테이블 탐색 단계를 수행한다(S20).When the extraction table is created, the user performs a table search step (S20).

테이블 탐색 단계에서는 사용자의 자연어 질문을 SQL 쿼리로 번역하여 데이터베이스를 탐색한다. 이를 위해 사용자는 NL2SQL 시스템과 데이터베이스를 선택할 수 있다.In the table search step, the database is searched by translating the user's natural language question into an SQL query. For this, users can choose the NL2SQL system and database.

SQL 쿼리의 번역 오류를 줄이기 위해 테이블 생성 단계(10)와 탐색(20) 단계를 반복할 수 있다. 이 때 새로운 발췌 테이블은 이전 테이블을 교란하여 생성하고 사용자가 다시 검증하도록 할 수 있다.To reduce translation errors in SQL queries, the table creation step (10) and the search (20) steps can be repeated. At this time, a new extract table can be created by perturbing the old table and the user can verify it again.

탐색이 끝난 후 자연어 질문과 SQL 쿼리 쌍인 훈련 쌍들은 데이터베이스에 저장된다(S30).After the search is over, training pairs that are pairs of natural language questions and SQL queries are stored in the database (S30).

사용자가 정확한 것으로 체크한 훈련 쌍은 훈련 쌍 데이터베이스에 저장되고, 부정확하다고 체크한 훈련 쌍은 히스토리 데이터베이스에 저장되고 오류 원인이 분석된다.Training pairs checked as correct by the user are stored in a training pair database, and training pairs checked as incorrect are stored in a history database and causes of errors are analyzed.

분석 단계에서는 모델 해석이 수행된다(S40).In the analysis step, model analysis is performed (S40).

사용자가 부정확하다고 체크한 훈련 쌍을 선택하면 그에 대한 정확한 SQL 쿼리를 수신하고, 모델 해석을 위해서는 통합 그라디언트 방법과 컨덕턴스 방법을 사용하여 자연어 질문의 단어들의 출력 기여도와 모델 내부의 히든 상태들을 분석하게 된다. When the user selects a training pair that has been checked as inaccurate, an accurate SQL query for it is received, and for model interpretation, the output contribution of words in the natural language question and hidden states within the model are analyzed using the integrated gradient method and conductance method. .

분석 단계에서 오류 하이라이팅 또한 수행될 수 있다(S50).Error highlighting may also be performed in the analysis step (S50).

오류 하이라이팅 단계에서는 선택된 SQL 쿼리를 주어진 정확한 쿼리와 비교하여 선택된 SQL 쿼리의 어느 부분이 잘못되었는지 시각화함으로써 두 쿼리들의 문법 구조를 비교할 수 있다.In the error highlighting step, the syntax structure of the two queries can be compared by comparing the selected SQL query to a given exact query and visualizing which part of the selected SQL query is incorrect.

마지막으로 분석 결과에 의해 보다 나은 시스템을 제안할 수 있다(S60).Finally, a better system can be proposed based on the analysis result (S60).

자연어 질문을 사용자가 선택하지 않은 다른 NL2SQL 시스템을 사용하여 번역한 다음 정확한 번역을 생성한 NL2SQL 시스템을 추천한다. 만일 정확한 번역을 제공하는 시스템이 없다면 각 시스템의 번역된 SQL 쿼리와 정답(Ground Truth) 사이의 유사도 점수를 계산하여 가장 정확한 NL2SQL 시스템을 추천하게 된다.After translating the natural language question using another NL2SQL system not selected by the user, we recommend the NL2SQL system that produced the correct translation. If there is no system that provides accurate translation, the most accurate NL2SQL system is recommended by calculating the similarity score between the translated SQL query of each system and the ground truth.

이상과 같은 본 발명에 따른 NL2SQL 시스템을 위한 훈련 세트 수집 장치 및 방법에 따르면 NL2SQL을 위한 고품질의 훈련 데이터를 SQL 전문가가 아닌 사용자도 수집할 수 있으며 개발자들이 잘못된 SQL 번역을 쉽게 분석할 수 있도록 해주는 장점이 있다.According to the training set collection apparatus and method for the NL2SQL system according to the present invention as described above, high-quality training data for NL2SQL can be collected by users who are not SQL experts, and developers can easily analyze erroneous SQL translations. there is

본 발명의 보호범위가 이상에서 명시적으로 설명한 실시예의 기재와 표현에 제한되는 것은 아니다. 또한, 본 발명이 속하는 기술분야에서 자명한 변경이나 치환으로 말미암아 본 발명이 보호범위가 제한될 수도 없음을 다시 한 번 첨언한다.The protection scope of the present invention is not limited to the description and expression of the embodiments explicitly described above. In addition, it is added once again that the scope of protection of the present invention cannot be limited due to obvious changes or substitutions in the technical field to which the present invention belongs.

Claims

a table creation unit that performs sampling on a given database and generates an extract table (Table Excerpts) composed of a part of the database;
a search unit that translates a user's natural language question generated for the extraction table into a SQL query using a Natural Language to SQL (NL2SQL) system and displays a result of performing the SQL query; and
As a result of performing the SQL query, a pair of natural language questions and SQL queries, which are training pairs marked as incorrect by the user, is analyzed to perform model interpretation, error highlighting, and other NL2SQL systems other than the NL2SQL system Training set collection device for the NL2SQL system, including; analysis unit for performing the proposal of.

According to claim 1,
Characterized in that the table creation unit performs relation sampling, attribute sampling and tuple sampling on the given database to generate the extract table.

According to claim 1,
a history database for storing a result of performing the SQL query in the search unit; and a training pair database for storing pairs of natural language questions and SQL queries marked as correct among the results of performing the SQL query in the search unit.

According to claim 1,
The table generator creates a new extraction table by perturbing the extraction table;
Wherein the search unit verifies the SQL query by the new extract table.

According to claim 1,
Characterized in that the analysis unit performs the model analysis by analyzing the inaccurate training pair by an integrated gradient method and a conductance method.

According to claim 1,
The analyzer normalizes the SQL query of the incorrect training pair and the given correct query using query rewriting, and compares the grammatical structures of the two normalized queries to visualize the incorrect part of the SQL query of the incorrect training pair. A training set collection device for an NL2SQL system, characterized in that performing the error highlighting by doing.

According to claim 1,
The analyzer translates the user's natural language question using a plurality of other NL2SQL systems other than the NL2SQL system, and proposes another NL2SQL system in which the translated SQL query matches the given exact SQL query. Characterized in that , a training set collection device for the NL2SQL system.

A training set collection method performed by a training set collection device comprising one or more processors:
(a) performing sampling on a given database to create table excerpts that are part of the database;
(b) translating the user's natural language question generated with respect to the extraction table into a SQL query using a Natural Language to SQL (NL2SQL) system, and displaying a result of performing the SQL query; and
(c) Analyzing a pair of natural language questions and SQL queries, which are training pairs marked as incorrect by the user as a result of performing the SQL query, performing model interpretation, performing error highlighting, and excluding the NL2SQL system A method of gathering a training set for a NL2SQL system, including performing suggestions from other NL2SQL systems.

According to claim 8,
Wherein step (a) performs relation sampling, attribute sampling and tuple sampling on the given database to generate the extract table.

According to claim 8,
After the step (b), storing the result of performing the SQL query in a history database; and storing a pair of a natural language question and an SQL query marked as correct among results of performing the SQL query in a training pair database.

According to claim 8,
Step (a) creates a new extraction table by perturbing the extraction table;
Wherein step (b) verifies the SQL query by the new extract table.

According to claim 8,
Wherein step (c) performs the model analysis by analyzing the inaccurate training pairs by an integrated gradient method and a conductance method.

According to claim 8,
The step (c) normalizes the SQL query of the incorrect training pair and the given correct query using query rewriting, and compares the grammar structures of the two normalized queries to determine whether the SQL query of the incorrect training pair is correct. A training set collection method for an NL2SQL system, characterized in that performing the error highlighting by visualizing a part.

According to claim 8,
The step (c) translates the user's natural language question using a plurality of other NL2SQL systems other than the NL2SQL system, and performs a proposal of the other NL2SQL system in which the translated SQL query matches the given exact SQL query Characterized by a training set collection method for the NL2SQL system.