CN116521711A

CN116521711A - Text-to-SQL method, system and medium

Info

Publication number: CN116521711A
Application number: CN202310502292.8A
Authority: CN
Inventors: 车万翔; 王丁子睿; 窦隆绪; 王佳琪; 夏文岳; 陶洪铸; 刘金波; 李大鹏; 付聪; 黄运豪; 张�杰; 商敬安; 潘琦; 刘涛; 贺春
Original assignee: Harbin Institute of Technology; State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI; State Grid Tianjin Electric Power Co Ltd
Current assignee: Harbin Institute of Technology; State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI; State Grid Tianjin Electric Power Co Ltd
Priority date: 2023-05-06
Filing date: 2023-05-06
Publication date: 2023-08-01

Abstract

A Text-to-SQL method, system and medium relate to the technical field of databases, and aim at the problem of longer length and the problem of low accuracy in the process of generating SQL in the prior art.

Description

Text-to-SQL method, system and medium

Technical Field

The invention relates to the technical field of databases, in particular to a Text-to-SQL method, a system and a medium.

Background

Text-to-SQL generating task (Text-to-SQL) is a research direction which is paid attention to in the field of natural language processing, and the Text-to-SQL generating task specifically comprises the steps that a system generates an SQL sentence consistent with the problem semantics according to the problem of a user, and the sentence is executed in a database to obtain a corresponding result.

Currently, most research on Text-to-SQL is directed to end-to-end generation methods, such as SQL generation based on fixed templates (SQLova, M-SQL, etc.), SQL generation based on grammar and transfer systems (RAT-SQL, etc.), and SQL generation based on pre-training models and constraint decoding (PICARD, etc.). However, existing model architecture applications have limitations, such as being limited to single form, single domain, or single round Text-to-SQL problems. Although the PICARD model in the prior art uses a transducer to complete the task of generating text to SQL, the model structure is complex and is difficult to apply to multi-table, multi-field and multi-round application scenes. In addition, in a complex application scene, due to the problems and the longer database structure length, the problem of low accuracy in Text-to-SQL is caused.

Disclosure of Invention

The purpose of the invention is that: aiming at the problem of low accuracy in the Text-to-SQL caused by the longer length and the database structure in the Text-to-SQL, a Text-to-SQL method, a Text-to-SQL system and a Text-to-SQL medium are provided.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a Text-to-SQL method comprising the steps of:

step one: acquiring a Text-to-SQL question and a database structure, converting the Text-to-SQL question and the database structure into linear texts, and then splicing to obtain input information;

step two: aiming at input information, searching a column name in the input information, and labeling category information behind the column name;

step three: coding the input information marked with the category information to obtain a coding vector V, and obtaining a result text by utilizing autoregressive decoding based on the coding vector V;

step four: acquiring all words and the sequence of the words in the result text, then sequentially judging whether the grammar relation of SQL is satisfied between the current word and the next word according to each word in the result text, if so, retaining the current word,

step five: splicing the words reserved in the step four into a target text, counting the table names except the FROM keywords in the target text based on the target text, comparing the table names except the FROM keywords in the target text with the table names in the FROM keywords, and supplementing the missing table names in the FROM keywords based on the comparison result.

Further, the encoding the input information after labeling the category information specifically includes:

and coding the input information after the category information is marked by using a pre-training model, wherein the pre-training model is a BART algorithm model or a T5 algorithm model.

Further, the specific steps of obtaining the result text based on the coding vector V by utilizing autoregressive decoding are as follows:

and selecting the word with the largest weight from each round as the word generated currently through autoregressive decoding, so as to obtain a result text.

Further, the step of selecting the word with the largest weight from each round as the word currently generated through autoregressive decoding, and further obtaining the result text comprises the following specific steps:

step 3.1: mapping the coding vector V into word vectors through linear transformation based on the coding vector V, wherein the number of dimensions of the word vectors is consistent with the number of corresponding words in a word table, the numerical value on each dimension of the word vectors represents the selected weight of the word corresponding to the dimension, the word corresponding to the dimension with the largest weight is selected as a target word, the target word is coded to obtain a coding vector E, and then the coding vector E and the coding vector V are spliced and mapped to obtain a new coding vector;

step 3.2: and replacing the coding vector V in the third step by using the new coding vector, and repeating the steps until a terminator is generated, and determining a result text according to the target words determined in all rounds and the sequence of adjacent rounds.

Further, in the fourth step, if the grammatical relation of SQL is not satisfied between the current word and the next word, all weights in the corresponding rounds of the current word are ordered, and the current word is replaced in sequence from the word corresponding to the next largest weight until the grammatical relation of SQL is satisfied between the replaced word and the next word.

Further, the category information includes affiliation, foreign key relationship, table relationship, internal structure information, name link information, numeric link information, and history information.

Further, the labeling of the category information after the column name specifically includes: the affiliation behind the column names is labeled BL, the foreign key relationship behind the column names is labeled FK, the table relationship behind the column names is labeled FT, the internal structure information behind the column names is labeled TK, the name linking information behind the column names is labeled SL, the numerical value linking information behind the column names is labeled EL, and the history information behind the column names is labeled HI.

A Text-to-SQL system, comprising: the system comprises an input information acquisition module, a labeling module, a coding module, a grammar judging module and a complement module;

the input information acquisition module is used for acquiring a Text-to-SQL problem and a database structure, converting the Text-to-SQL problem and the database structure into linear texts and then splicing the linear texts to obtain input information;

the labeling module is used for searching column names in the input information aiming at the input information, and labeling category information behind the column names;

the coding module is used for coding the input information marked with the category information to obtain a coding vector V, and obtaining a result text by utilizing autoregressive decoding based on the coding vector V;

the grammar judging module is used for acquiring all words and the sequence of the words in the result text, then sequentially judging whether the grammar relation of SQL is satisfied between the current word and the next word according to each word in the result text, and if so, retaining the current word;

the completion module is used for splicing the reserved words into a target text, counting the table names except the FROM keywords in the target text based on the target text, comparing the table names except the FROM keywords in the target text with the table names in the FROM keywords, and supplementing the missing table names in the FROM keywords based on the comparison result.

Further, the encoding module is configured to obtain a result text by using autoregressive decoding based on the encoding vector V, which specifically includes:

the encoding module is used for selecting the word with the largest weight from each round as the word which is currently generated through autoregressive decoding, and further obtaining a result text.

A Text-to-SQL medium comprising a computer readable program embodied therein for performing the steps of any one of claims 1 to 7.

The beneficial effects of the invention are as follows:

the method aims at the problems and the problem that model memory information is difficult and information correspondence capability is weak when the length of a database structure is long, words meeting grammar rules are reserved through information marks and application of SQL grammar rules, words not meeting grammar rules are reserved, all weights in corresponding rounds of current words are reserved, the current words are ordered, the words corresponding to the next large weights are replaced in sequence until the grammar relationship of SQL is met between the replaced words and the next words, the replaced words are reserved, finally texts are spliced, the problem that memory information is difficult and the problem that the information correspondence capability is weak is solved, finally the problem that in the prior art, due to the fact that the length is long, the database structure is used, the problem that the accuracy is low is caused in Text-to-SQL is solved, and the accuracy of SQL generation is improved.

Drawings

FIG. 1 is a flow chart of the whole application;

FIG. 2 is a block diagram of the present application;

FIG. 3 is a schematic diagram of a process for decoding a generated SQL statement;

FIG. 4 is a schematic diagram of the SQL completion process.

Detailed Description

It should be noted in particular that, without conflict, the various embodiments disclosed herein may be combined with each other.

The first embodiment is as follows: referring to fig. 1, the Text-to-SQL method according to the present embodiment is characterized by comprising the following steps:

step two: aiming at input information, acquiring a column name in the input information, and labeling category information behind the column name;

step three: coding the marked input information to obtain a coding vector V, and selecting a word with the largest weight from each round as a word currently generated by autoregressive decoding to obtain a result text;

step four: for each word in the result text, judging whether the grammar relation of SQL is satisfied between the current word and the next word in sequence, if so, reserving the current word, otherwise, sequencing all weights in the corresponding rounds of the current word, and sequentially replacing the current word from the word corresponding to the next largest weight until the grammar relation of SQL is satisfied between the replaced word and the next word, and reserving the replaced word; for example, if the result text includes 5 words, it is first determined whether the first word (assumed to be a) and the second word (assumed to be F) satisfy the grammatical relation of SQL, if so, the first word is reserved, and whether the second word and the third word in the result text satisfy the grammatical relation of SQL is continuously determined, if not, all the weights obtained in the first round of autoregressive decoding process and the word with the largest weight (in autoregressive decoding process) among the words corresponding to the weights are determined according to the first word in the autoregressive decoding process, if the first word in the result text has the weight of 0.4, the weight of b is 0.3, the weight of c is 0.2, and the weight of d is 0.1, so the word a corresponding to the weight of 0.4 is used as the first word.

When the grammar relation of SQL is not satisfied between the current word and the next word, the first word in the result text is exemplified, all weights in the first round of autoregressive decoding process are sequenced, the words are sequentially selected from the second largest weight according to the sequencing, whether the grammar relation of SQL is satisfied or not is judged by the words selected each time and the next word, if so, the first word is replaced by the words corresponding to the second weight, otherwise, whether the grammar relation of SQL is satisfied or not is judged by continuously selecting the words according to the sequencing (if A and F do not satisfy the grammar relation of SQL, whether B and F satisfy the grammar relation of SQL is judged, if so, B is used as the current word, namely B is reserved, otherwise, whether C and F satisfy the grammar relation of SQL is continuously judged).

Step five: splicing the words reserved in the step four into a target Text, counting table names except the FROM keywords in the target Text based on the target Text, comparing the table names except the FROM keywords in the target Text with the table names in the FROM keywords, and supplementing the missing table names in the FROM keywords based on a comparison result to finish Text-to-SQL.

In order to solve the limitation of the existing Text-to-SQL method on application scenes, the method based on structural awareness is provided for generating SQL. Under a complex application scene, the problem of long length and the database structure are solved, and the problem of low accuracy in SQL generation is solved. In order to solve the problem and reduce the learning difficulty of the model, the application also provides a structural marking, band-limited decoding and SQL completion mechanism.

The application is applicable to text-to-SQL generating tasks, and can be roughly divided into four parts: generating input with structural markers, model encoding, band-limited decoding, and generating SQL completions. The specific flow is shown in fig. 1, and the overall structure of the model is shown in fig. 2.

Band-limited decoding

Because of the complexity of the input information, the language model may generate SQL containing entities that are not present in the database when predicting SQL. The present application solves this problem by constructing a prefix tree based on a database structure, where each node in the book corresponds to a character in the dictionary, and the child node of each node corresponds to the successor state allowed by that node. After the prefix tree is obtained, generating a plurality of candidate results of the current step in each step in the bundle searching process, checking whether each result is legal successor state of the last step, and filtering out unsatisfied prediction results. This is repeated until the predicted result is an ending symbol and is a legal successor state.

Taking fig. 3 as an example, this is the process of generating SQL statements by model decoding. In the second step of generating, four candidates of the generated SELECT key are succeeded, and the FROM key cannot directly follow the SELECT key according to the SQL grammar rules, so that the SELECT key is rejected, and three other legal results are reserved. By band-limited decoding, the proportion of legal SQL generated by the model can be obviously improved.

SQL completion

JOIN keywords are often missed when generating answers, since they are not explicitly mentioned in the question. To improve the accuracy of SQL generation, text-to-SQL needs to have JOIN keys that are possibly missing in accordance with the database structure. Specifically, the method firstly constructs a pattern diagram based on a database structure, wherein each node in the diagram represents a table name or a column name, and each side corresponds to a structural relationship, such as a table-column containing relationship, a table-table external key relationship and the like. Based on this pattern diagram, the model attempts to find the table and column in the shortest path of the existing table and column in a JOIN-missing SQL.

Taking fig. 4 as an example, this is an SQL lacking a JOIN key, and tables MATCHES and column winner_id are not mentioned in the FROM clause. We infer these two information from their neighbors PLAYERS and RANKING. Because MATCHES is located on the path connecting the two tables, and winner_id is the primary key of MATCHES, a complete SQL statement can be completed in this way.

The second embodiment is as follows: the present embodiment is further described with respect to the first embodiment, and the difference between the present embodiment and the first embodiment is that the encoding of the input information after the labeling of the category information specifically includes:

Model coding

After the structural marks are marked, the method sends the constructed input into a pre-training model for encoding. The pre-trained language model may use a BART, T5, etc. self-encoding language model to generate each character in the answer step by means of autoregressive after encoding is completed.

And a third specific embodiment: this embodiment is further described with respect to the first embodiment, and the difference between this embodiment and the first embodiment is that the specific steps for obtaining the result text based on the encoded vector V by using autoregressive decoding are as follows:

The specific embodiment IV is as follows: the third embodiment is further described, and the difference between the third embodiment and the third embodiment is that the word with the largest weight is selected as the word currently generated in each round through autoregressive decoding, and the specific steps of obtaining the result text are as follows:

Fifth embodiment: in the fourth step, if the grammatical relation of SQL is not satisfied between the current word and the next word, all weights in the corresponding rounds of the current word are ordered, and the current word is replaced in sequence from the word corresponding to the next largest weight until the grammatical relation of SQL is satisfied between the replaced word and the next word, and the replaced word is reserved.

Specific embodiment six: the present embodiment is further described with respect to the first embodiment, and the category information includes an attachment relationship, an external key relationship, a table relationship, internal structure information, name link information, numerical link information, and history information.

Seventh embodiment: the present embodiment is further described with respect to the sixth embodiment, and the difference between the present embodiment and the sixth embodiment is that the column name is followed by the category information, specifically: the affiliation behind the column names is labeled BL, the foreign key relationship behind the column names is labeled FK, the table relationship behind the column names is labeled FT, the internal structure information behind the column names is labeled TK, the name linking information behind the column names is labeled SL, the numerical value linking information behind the column names is labeled EL, and the history information behind the column names is labeled HI.

Generating structured tagged inputs

Because the database contains complex internal and external key relationships, the model has difficulty capturing and memorizing these relationships. The basic input information includes questions and database structures, and in order to reduce the difficulty of the model in understanding the input information, the method marks different marks in the input text to indicate different information. For example, if a column name is the foreign key of the table, a label shaped as < FK > will be marked thereafter. Specific marking information is shown in table 1.

TABLE 1

Eighth embodiment: a Text-to-SQL system, comprising: the system comprises an input information acquisition module, a labeling module, a coding module, a grammar judging module and a complement module;

the specific steps for obtaining the result text based on the coding vector V by utilizing autoregressive decoding are as follows:

selecting the word with the largest weight from each round as the word generated currently through autoregressive decoding, so as to obtain a result text;

the grammar judging module is used for acquiring all words and the sequence of the words in the result text, then sequentially judging whether the grammar relation of SQL is satisfied between the current word and the next word according to each word in the result text, if so, reserving the current word, otherwise, sequencing all weights in the corresponding wheel of the current word, sequentially replacing the current word from the word corresponding to the next largest weight until the grammar relation of SQL is satisfied between the replaced word and the next word, and reserving the replaced word;

the completion module is used for splicing reserved words into a text, counting table names except the FROM keywords in the text based on the spliced text, comparing the table names with the table names in the FROM keywords, and supplementing the missing table names in the FROM keywords based on a comparison result;

the category information includes affiliation, foreign key relationship, table relationship, internal structure information, name link information, numeric link information, and history information.

The labeling of category information behind the column names is specifically as follows: the affiliation is labeled BL, the foreign key relationship is labeled FK, the table relationship is labeled FT, the internal structure information is labeled TK, the name linking information is labeled SL, the numerical linking information is labeled EL, and the history information is labeled HI.

Detailed description nine: this embodiment is further described with respect to embodiment eight, and the difference between this embodiment and embodiment eight is that the specific steps of the encoding module are as follows:

step 1: coding the marked input information to obtain a coding vector V;

step 2: mapping the coding vector V into word vectors through linear transformation based on the coding vector V, wherein the number of dimensions of the word vectors is consistent with the number of corresponding words in a word table, the numerical value on each dimension of the word vectors represents the selected weight of the word corresponding to the dimension, then the word corresponding to the dimension with the largest weight is selected as a target word, then the target word is coded to obtain a coding vector E, and then the coding vector E and the coding vector V are spliced and mapped to obtain a new coding vector;

step 3: and (3) replacing the coding vector V in the step (1) by using the obtained new coding vector, and repeating the steps until a terminator is generated, and obtaining a result text according to the target words determined in all rounds and the sequence of adjacent rounds.

The autoregressive process is as follows:

step 1: coding the marked input information to obtain a coding vector V;

step 2: mapping the code vector V to word vectors through linear transformation based on the code vector V, wherein the number of dimensions of the word vectors is consistent with the number of corresponding words in a word table, the numerical value on each dimension of the word vectors represents the selected weight of the word corresponding to the dimension, then the word corresponding to the dimension with the largest weight is selected as the word which is currently generated, all the generated words are coded to obtain a code vector E, and then the code vector E and the code vector V are spliced and mapped to obtain a new code vector;

step 3: and (3) replacing the coding vector V in the step (1) by using the obtained new coding vector, and repeating the steps until a terminator is generated, and determining a result text according to the target words determined in all rounds and the sequence of adjacent rounds to obtain the result text.

Detailed description ten: a Text-to-SQL medium comprising a computer readable program embodied therein for performing the steps of any one of claims 1 to 7.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The solutions in the embodiments of the present application may be implemented in various computer languages, for example, object-oriented programming language Java, and an transliterated scripting language JavaScript, etc.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. A Text-to-SQL method, comprising the steps of:

2. The Text-to-SQL method of claim 1, wherein the encoding the input information after labeling the category information specifically comprises:

3. The Text-to-SQL method according to claim 1, wherein the specific steps of obtaining the result Text based on the encoding vector V by using autoregressive decoding are as follows:

4. A Text-to-SQL method according to claim 3, wherein the step of selecting the word with the greatest weight as the currently generated word in each round by autoregressive decoding, and further obtaining the result Text comprises the steps of:

5. The Text-to-SQL method according to claim 1, wherein in the fourth step, if the grammatical relation of SQL is not satisfied between the current word and the next word, all weights in the corresponding rounds of the current word are ordered, and the current word is replaced sequentially from the word corresponding to the next largest weight until the grammatical relation of SQL is satisfied between the replaced word and the next word.

6. The Text-to-SQL method of claim 1, wherein the category information comprises affiliations, foreign key relationships, table relationships, internal structure information, name link information, numeric link information, and history information.

7. The Text-to-SQL method of claim 6, wherein the column names are followed by category information, specifically: the affiliation behind the column names is labeled BL, the foreign key relationship behind the column names is labeled FK, the table relationship behind the column names is labeled FT, the internal structure information behind the column names is labeled TK, the name linking information behind the column names is labeled SL, the numerical value linking information behind the column names is labeled EL, and the history information behind the column names is labeled HI.

8. A Text-to-SQL system, comprising: the system comprises an input information acquisition module, a labeling module, a coding module, a grammar judging module and a complement module;

9. The Text-to-SQL system of claim 8, wherein the encoding module is configured to obtain the result Text by autoregressive decoding based on the encoding vector V by:

10. A Text-to-SQL medium, characterized in that it comprises a computer readable program for executing the steps of any one of claims 1 to 7.