CN115203236B

CN115203236B - text-to-SQL generating method based on template retrieval

Info

Publication number: CN115203236B
Application number: CN202210836518.3A
Authority: CN
Inventors: 车万翔; 窦隆续; 潘名扬; 赵妍妍; 刘挺
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2022-07-15
Filing date: 2022-07-15
Publication date: 2023-05-12
Anticipated expiration: 2042-07-15
Also published as: CN115203236A

Abstract

A text-to-SQL generating method based on template retrieval relates to the technical field of data processing, and aims at solving the problem that in the prior art, the decoding speed of SQL sentences with longer length is low, and due to the fact that the parallelism of a non-autoregressive model brings time performance improvement, meanwhile, the method has the defect that context information of a target sequence cannot be observed in a generating stage, the method overcomes the defects of the non-autoregressive model through template retrieval and repeated iteration generation, and the technical scheme of the method is more than 50% higher than that of the traditional method for SQL sentences with complicated structure and longer length. The template library of the technical scheme has expandability, is easy to migrate and has higher generation speed.

Description

text-to-SQL generating method based on template retrieval

Technical Field

The invention relates to the technical field of data processing, in particular to a text-to-SQL generation method based on template retrieval.

Background

The text-to-SQL generation task is an important direction in semantic parsing, and the main contents are as follows: on the premise of giving a database or a table, the system generates SQL sentences consistent with the user description semantics according to the user description (or the problem), and then obtains the query result in the database or the table. Most of the research on text-to-SQL generating tasks is an end-to-end generating mode, and the following categories generally exist: SQL generation based on fixed templates (e.g., SQLova, M-SQL, etc.), SQL generation based on grammar and transfer systems (e.g., RATSQL, etc.), and SQL generation based on pre-training models and constraint decoding (e.g., PICARD, etc.). However, existing model architectures, while having relatively good time efficiency and decoding performance for SQL statements of simpler structure, are slower to decode for SQL statements of longer length, which are complex in structure.

Disclosure of Invention

The purpose of the invention is that: aiming at the problem of low decoding speed of SQL sentences with longer length in the prior art, a text-to-SQL generating method based on template retrieval is provided.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the text-to-SQL generating method based on template retrieval comprises the following steps:

step one: acquiring a data set, wherein the data set comprises a user question, a database and SQL sentences, and then utilizing semantics to analyze the SQL sentences in the data set so as to construct an SQL template library;

step two: acquiring the structure of a database, and searching an SQL template most relevant to the user question in an SQL template library according to the user question and the structure of the database;

step three: splicing the user question, the structure of the database and the SQL template most relevant to the user question to obtain a word element sequence, and inputting the word element sequence into a pre-training language model for coding to obtain a coding vector of each word element;

step four: selecting a coding vector of a first word element, and predicting the SQL sequence length consistent with the questioning semantics of the user through a feedforward neural network;

step five: based on the SQL sequence length, decoding the coded vector by utilizing a non-autoregressive transducer to obtain an SQL sentence consistent with the user questioning semantics.

Further, the specific steps of the first step are as follows:

replacing tables, columns, values and ordering modes in SQL by using specific marks, deleting on clauses in SQL to obtain an SQL template, and removing repeated SQL templates until all SQL sentences are processed to obtain an SQL template library;

the specific table, column, value and ordering mode which are presented in the SQL are replaced by specific marks are as follows:

replace table name with [ TAB ];

replace column name with [ COL ];

the limit clause value is replaced with [ NUM ], and other values are replaced with [ VAL ];

the ordering is replaced by an ORD, which includes ascending and descending order, namely ASC and DESC.

Further, the retrieval is performed through a template retrieval model, and the template retrieval model is obtained by introducing loss function optimization based on a double-tower model;

the input of the double-tower model comprises a template part and a query part;

the template part is a template in an SQL template library;

the query part is formed by splicing a user question and a database structure to obtain a query sequence S;

the specific processing steps of the double-tower model are as follows:

and respectively sending templates in the query sequence S and the SQL template library into two independent pre-training language models for encoding, respectively obtaining the encoding result of the query and the encoding result of the template through a multi-layer feedforward neural network aiming at the encoded result, then calculating the encoding result of the query and the encoding result of the template to obtain cosine similarity of the two, and selecting the template corresponding to the maximum value of the cosine similarity in the SQL template library as the SQL template most relevant to the user question.

Further, the query sequence S is expressed as:

S＝<TABLE>t ₁ |t ₂ |…|t _N |<COLUMN>c ₁ |c ₂ |…|c _M |<QUESTION>q _1…n

wherein t is ₁ ～t _N C is the table name in the database ₁ ～c _M Q is the column name in the database ₁ ～q _n As a word element in the question,<TABLE>、<COLUMN>and<QUESTION>For identifying the list head, the column names and the special symbols of the problems, N is the number of the tables in the database, M is the number of the columns in the database, and N is the number of the words in the problems;

the loss function is expressed as:

wherein S represents a query sequence, T ^+/- Representing positive and negative templates, p representing conditional probabilities.

Furthermore, the input of the pre-training language model in the third step is obtained by splicing SQL templates most relevant to the user question after the query sequence S;

the pre-training language model in the third step is obtained by adding new type codes on the basis of the original position codes of the pre-training language model;

the new type code includes:

table and column position coding: each table name and column name in the input sequence corresponds to a single code, and the other codes are marked as 0 from 1;

table, column identification code: table names are represented by 1, column names are represented by 2, and others are represented by 0;

column type coding: 0 represents other, 1-5 represent integers, character strings, floating point numbers, dates and Boolean types respectively;

database matching coding: and matching the table and the column in the database with the word elements in the user problem in a character string matching mode, wherein the complete matching is marked as 1, the partial matching is marked as 2, and the other cases are marked as 0.

Furthermore, the feedforward neural network in the step four is obtained through training, and cross entropy loss functions are used for optimization during training, and the cross entropy loss functions are added into the loss functions of the model overall in a ratio of 0.1 times.

In the fifth step, decoding is performed through a pointer network based on fragment copying, and SQL sentences consistent with the user questioning semantics are expressed in the form of keywords and range indexes;

the scope index refers to that in the SQL sentence, the table name, the column name and the condition value in the SQL sentence are represented by using the starting position index and the ending position index of the fragment in the input sequence.

Further, in the fifth step, the non-autoregressive transformers are obtained by randomly initializing transformers and adding a pointer network, and training by using a cross entropy loss function.

Further, the specific steps of the fifth step are as follows:

firstly, using < mask > symbols with the same length as the SQL sequence obtained in the step four as the input of a non-autoregressive transducer, and calculating self-attention with the coding vector of each word element to obtain the coding vector of each < mask > symbol;

then carrying out iteration for preset times by using the coding vector of each < mask > symbol to generate a keyword and a range index in the SQL sentence;

and finally, filling corresponding tables, columns and values according to the generated range indexes, and supplementing the missing on clauses to obtain a final SQL sentence.

Further, the pre-trained language model is BERT, roBERTa or electric.

The beneficial effects of the invention are as follows:

because the parallelism of the non-autoregressive model brings time performance improvement and has some defects, the context information of the target sequence cannot be observed in the generation stage, the method and the device make up for the defects of the non-autoregressive model through template retrieval and repeated iteration generation, and the technical scheme of the method and the device has the advantages that the structure is complex, the SQL sentence with a longer length is adopted, and the decoding speed is improved by more than 50% compared with that of the traditional method. The template library of the technical scheme has expandability, is easy to migrate and has higher generation speed.

Drawings

FIG. 1 is an overall flow chart of the present application;

FIG. 2 is a diagram of the overall architecture of the model;

FIG. 3 is a template retrieval model;

FIG. 4 is a schematic illustration of a template filling portion;

fig. 5 is a schematic diagram of multiple iterative decoding.

Detailed Description

The first embodiment is as follows: describing the present embodiment with reference to fig. 1, the present embodiment discloses: the text-to-SQL generating method based on template retrieval comprises the following steps:

The scheme is suitable for text-to-SQL generating tasks and can be roughly divided into two parts: template retrieval and SQL generation. The specific flow is shown in figure 1: the overall structure of the model is shown in fig. 2:

1. template library construction and retrieval

Firstly, according to SQL sentences appearing in the semantic analysis data set, an SQL template is extracted, and a template library is constructed. Tables, columns, values, etc. that appear in SQL are replaced with specific labels. The specific rules are as follows: table names replaced by TAB, column names replaced by COL, other values replaced by VAL, ordering (ascending/descending) replaced by ORD, limit clause values replaced by NUM. The on clauses in SQL are deleted simultaneously (which can be deduced from the primary-foreign key relationships between the tables in the from and join clauses). Such as by SQL statements: "select name from student where age >18; "SQL templates can be obtained: "select [ COL ] from [ TAB ] where [ COL ] > [ VAL ]; ".

After the template library is constructed, training is carried out by using a double-tower model to obtain a template retrieval model, and the model structure is shown in figure 3: the template part is all templates in the template library, the query part is the splicing of the problems and the database structure, and the format is as follows: "S =<TABLE>t ₁ |t ₂ |…|t _N |<COLUMN>c ₁ |c ₂ |…|c _M |<QUESTION>q _1...n ". Wherein t is _i C is the table name in the database _i Q is the column name in the database _i Is a term in the question.<TABLE>、<COLUMN>and <QUESTION>To identify the head, column name and special symbols of the question。

The query sequence and templates are fed into a pre-trained language model for encoding. The pre-training language model can use BERT, roBERTa, electra and other self-coding language models, and uses the coding results corresponding to the sentence head identifier ([ CLS ] or [ BOS ]), and the coding results of the query and the template are obtained respectively through the multi-layer feedforward neural network.

And finally, calculating cosine similarity of the two, sequencing the templates, and selecting the SQL template with the maximum similarity as the corresponding SQL template. During training, 3-5 negative templates are selected for each problem, and the loss function is as follows:

wherein S represents a query sequence, T ^+/- Representing positive and negative templates.

2. Template filling (SQL generation)

The template filling stage adopts a non-autoregressive transducer structure, and the module is divided into three parts: encoder, length module, decoder. The detailed structure is shown in fig. 4:

the input of the coding part is consistent with the query part of the TEMPLATE retrieval module, and the obtained TEMPLATE (taking < TEMPLATE > as prefix) is queried in the previous stage after the query part is spliced. The encoder adopts the RoBERTa pre-training language model, and adds several new types of codes (added to the Embedding representation of the input sequence in the same way) based on the original position codes:

table and column position coding: each entity (table or column) of the database portion in the input sequence corresponds to a separate code, starting with 1 and marking the other portions as 0;

table, column identification code: the position corresponds to the table name by 1, the column name by 2, and the others by 0;

column type coding: 0 represents other, 1-6 represent integers, character strings, floating point numbers, dates and Boolean types respectively;

database matching coding: and matching the table and the column in the database with the word elements in the problem in a character string matching mode, wherein the complete matching is marked as 1, the partial matching is marked as 2, and the other cases are marked as 0.

After the coding is finished, the coding result of the sentence head identifier is used for predicting a length module through a layer of feedforward neural network, and the loss function of the length module is added into the loss function of the model overall in a ratio of 0.1 times during training.

The technical scheme of the application uses a pointer network based on fragment copy in the decoding part to complete the non-autoregressive decoding part, and the SQL is expressed in the form of a keyword and a range index. The scope index refers to a table name, a column name, and a condition value (shown in the following table) in the SQL statement, which are represented by using a start position index and an end position index of a fragment in the input sequence.

Table 1: position index representation

The decoder uses a randomly initialized transducer and adds a pointer network-like generation module. The input section uses a corresponding number of < mask > symbols according to the result of the length prediction module. In the model decoding stage, SQL sentences are generated repeatedly, and only one or more word elements which are the most determined by the model are generated each time. In decoding SQL statements, the generation may be performed by "copying" (copying the table name, column value, etc. referred to by the fragment index) or "generating" (SQL key). An example diagram of the decoding section is shown in fig. 5:

after decoding is completed, corresponding tables, columns and values are filled back according to the generated indexes, and missing on clauses are supplemented according to the tables appearing in SQL, so that a final SQL sentence is obtained.

The closest approach to this application is the PICARD model proposed by TorstenScholak et al, which also uses the transducer model to accomplish the text-to-SQL generation task, but this approach is only applicable to Spider datasets and the time efficiency of decoding is low. The M-SQL model proposed by XiaoyuZhang et al is also a template-based text-to-SQL generation model, but the solution is single in template, not ubiquitous, and relatively inefficient in time. The template library of the technical scheme has expandability, is easy to migrate and has higher generation speed.

It should be noted that the detailed description is merely for explaining and describing the technical solution of the present invention, and the scope of protection of the claims should not be limited thereto. All changes which come within the meaning and range of equivalency of the claims and the specification are to be embraced within their scope.

Claims

1. The text-to-SQL generating method based on template retrieval is characterized by comprising the following steps of:

2. The method for generating text to SQL based on template retrieval according to claim 1, wherein the specific steps of the first step are:

replacing tables, columns, values and ordering modes in SQL by using marks, deleting on clauses in SQL to obtain an SQL template, and removing repeated SQL templates until all SQL sentences are processed to obtain an SQL template library;

the table, column, value and ordering mode in the SQL are replaced by the marks specifically comprises the following steps:

replace table name with [ TAB ];

replace column name with [ COL ];

3. The text-to-SQL generation method based on template retrieval according to claim 2, wherein the retrieval is performed by a template retrieval model, which is based on a double-tower model and is obtained by introducing a loss function optimization;

the input of the double-tower model comprises a template part and a query part;

the template part is a template in an SQL template library;

the specific processing steps of the double-tower model are as follows:

4. A text-to-SQL generation method based on template retrieval according to claim 3, wherein said query sequence S is expressed as:

S＝＜TABLE＞t ₁ |t ₂ |…|t _N |<COLUMN>c ₁ |c ₂ |…|c _M |<QUESTION>q _1…n

the loss function is expressed as:

5. The text-to-SQL generation method based on template retrieval according to claim 4, wherein the input of the pre-trained language model in the third step is obtained by concatenating the SQL templates most relevant to the user question after the query sequence S;

the new type code includes:

6. The text-to-SQL generation method based on template retrieval according to claim 5, wherein the feedforward neural network in the step four is obtained through training, and cross entropy loss functions are used for optimization during training, and the cross entropy loss functions are added into the loss functions of the model overall in a ratio of 0.1 times.

7. The text-to-SQL generation method based on template retrieval according to claim 6, wherein the decoding in the fifth step is performed through a pointer network based on fragment copy, and SQL sentences consistent with the user question semantics are expressed in the form of keywords + range indexes;

8. The method for generating text to SQL based on template retrieval according to claim 7, wherein the non-autoregressive transformers in the fifth step are obtained by randomly initializing transformers and adding a pointer network, and training by using a cross entropy loss function.

9. The text-to-SQL generating method based on template retrieval according to claim 8, wherein the specific steps of the fifth step are as follows:

10. The template retrieval based text-to-SQL generation method of claim 9, wherein the pre-trained language model is BERT, roBERTa, or electric.