CN117827886A

CN117827886A - Method for converting natural sentence into SQL sentence based on large language model

Info

Publication number: CN117827886A
Application number: CN202410255498.XA
Authority: CN
Inventors: 张煇; 剌昊跃; 李龙
Original assignee: Changhe Information Co ltd; Beijing Changhe Digital Intelligence Technology Co ltd
Current assignee: Changhe Information Co ltd; Beijing Changhe Digital Intelligence Technology Co ltd
Priority date: 2024-03-06
Filing date: 2024-03-06
Publication date: 2024-04-05
Anticipated expiration: 2044-03-06
Also published as: CN117827886B

Abstract

The application discloses a method for converting natural sentences into SQL sentences based on a large language model, which relates to the technical field of natural languages and comprises the following steps: collecting natural sentences and corresponding SQL sentences from a plurality of open data sets to construct a training set; expanding a training set by using a template based on grammar analysis and vocabulary replacement rules; constructing a neural network from sequence to sequence, setting an encoder to extract characteristics of the extended training set, and setting a decoder to generate a corresponding SQL sentence; in the encoder, a table structure analysis component is arranged to analyze the corresponding database table structure from SQL sentences in the training set; in the encoder, a database content analysis component is arranged to analyze the corresponding database table content from SQL sentences in the training set; in the decoder, a database grammar rule set is set for checking and correcting grammar errors in the generated SQL statement. Aiming at the problem of low conversion accuracy from natural sentences to SQL sentences in the prior art, the method and the device improve the understanding capability and conversion accuracy of the model.

Description

Method for converting natural sentence into SQL sentence based on large language model

Technical Field

The application relates to the technical field of natural language, in particular to a method for converting natural sentences into SQL sentences based on a large language model.

Background

In the present information age, data processing and database querying are important components of various industries. The complexity and technical threshold of the database query language SQL, in turn, has led to difficulties for many non-technicians to understand and use, especially for natural language-based users. Thus, achieving accurate conversion of natural language into SQL statements is an important and challenging problem. The traditional natural language processing technology is poor in performance in this aspect, low in conversion accuracy and incapable of meeting actual requirements.

In the prior art, the conversion accuracy from a natural sentence to an SQL sentence is affected by various factors, including ambiguity of language expression, complexity of grammar structure, architecture difference of a database and the like. Traditional rule or template-based methods often lack flexibility and generalization capability, and are difficult to handle complex language structures and semantic relationships. In addition, because the data flow is unclear, the generated SQL statement often has grammar errors or semantic errors, and the conversion accuracy and reliability are reduced.

In the related art, for example, in chinese patent document CN117194478A, a method for converting natural language sentences into SQL sentences based on a fine-tuning large language model is provided, which includes: fine tuning training a large language model; acquiring a natural language sentence to be converted; and inputting the natural language sentence into the big language model after fine tuning training to obtain the converted SQL sentence. However, direct fine tuning is difficult to adapt to the characteristics of the SQL language, so that the conversion accuracy of the scheme needs to be further improved.

Disclosure of Invention

1. Technical problem to be solved

Aiming at the problem of low conversion precision from natural sentences to SQL sentences in the prior art, the application provides a method for converting natural sentences into SQL sentences based on a large language model, which utilizes a table structure analysis component and a content analysis component in an encoder to analyze semantic information of a database, and sets grammar rule sets of the database to check and correct grammar of generated sentences in a decoder, thereby improving conversion precision.

2. Technical proposal

The aim of the application is achieved by the following technical scheme.

The embodiment of the specification provides a method for converting natural sentences into SQL sentences based on a large language model, which comprises the following steps: collecting natural sentences and corresponding SQL sentences from a plurality of open data sets to construct a training set; on the training set, expanding the training set by utilizing a template based on grammar analysis and vocabulary replacement rules; constructing a neural network from sequence to sequence, setting an encoder to extract characteristics of the extended training set, and setting a decoder to generate a corresponding SQL sentence; in the encoder, a table structure analysis component is arranged to analyze a corresponding database table structure from SQL sentences in a training set, wherein the database table structure comprises a table name, a column name and a data type; in the encoder, a database content analysis component is arranged, corresponding database table contents are analyzed from SQL sentences in a training set, and the database table contents comprise data value distribution, whether null values and value discrimination are included or not; in the decoder, a database grammar rule set is set for checking grammar errors of the generated SQL sentence and correcting.

Wherein, natural sentence: refers to questions or query sentences that the user presents in natural language, such as "list the names and cell phone numbers of all students". The corresponding SQL statement: refers to a database query statement, such as "SELECT name, phone FROM student", corresponding to a query intent expressed by a natural statement. According to the technical scheme, natural language query sentences and corresponding SQL sentences are collected from a plurality of open data sets to construct a training set. The training set contains a number of natural sentence instances expressing query intent in different ways and their corresponding SQL implementations. Training the sequence-to-sequence neural network model on the training set to enable the training set to learn the mapping relation between the natural sentences and the SQL sentences. The model analyzes the semantics of the natural sentences through the encoder, and the decoder generates corresponding SQL sentences to realize the conversion from the natural language to the database query language. The training set of collection natural and SQL statement pairs is the key data base for achieving this transformation. The larger the training set scale is, the richer the sentence expression is, which is beneficial to the strong mapping capability of model learning.

According to the technical scheme, a sequence-to-sequence neural network model is built, wherein the encoder comprises a table structure analysis module and a content analysis module, SQL sentences can be automatically analyzed, database table structure information such as a student table, a coarse table and the like can be extracted, and relationships among tables such as sc tables can be obtained. The decoder contains a database grammar rule set and can generate SQL sentences with correct grammar. The model is trained on a large-scale open data set, so that the model learns and maps the corresponding relation between the natural language and the SQL sentence, and the accurate conversion from the natural language to the SQL sentence is realized.

Wherein, based on the template of grammar parsing: refers to extracting grammar structures in natural language sentences using grammar rules, forming reusable grammar templates such as "list { column names }" "find { column names of { conditions }", etc. Vocabulary replacement rules: the method is characterized in that new sentences are generated according to words in the semantic replacement template, such as replacing 'column names' with 'student names', and generating sentences 'listed student names'. According to the technical scheme, firstly, grammar templates representing semantics are extracted from a training set by utilizing a grammar analysis-based method, and the templates comprise variable representation column names, table names and other changeable parts. And then defining a vocabulary replacement rule, replacing variables in the template with specific vocabularies according to different column names, table names and the like, thereby generating a new training sample and expanding the training set scale. The grammar template after the training set is expanded has wider coverage, richer vocabulary combination and more sufficient training corpus for the neural network model. This helps to promote the understanding ability of the model to the diversified sentences, thereby realizing accurate conversion from natural language to SQL sentences.

Wherein, sequence-to-sequence (Sequence to Sequence, seq2 Seq): an end-to-end neural network architecture for taking one sequence as input and generating another sequence as output. The technical scheme constructs a sequence-to-sequence neural network model, wherein: an Encoder (Encoder) module performs feature extraction on the input sequence of the natural language sentence and outputs a semantic feature sequence. A Decoder (Decoder) module generates an output sequence of corresponding SQL statements based on semantic features of the encoder output. In this way, an end-to-end mapping from the natural language sequence to the SQL statement sequence is achieved. Compared with the traditional word-by-word labeling analysis method, the sequence-to-sequence model better simulates the natural language understanding and generating process, and can directly learn the corresponding relation between the input sequence mode and the output sequence mode without complex artificial feature engineering, thereby realizing high-quality conversion from the natural language to the database query language.

Wherein, table structure analysis component: refers to a component module for analyzing and extracting structural information such as database table names, column names, data types and the like which are related in SQL sentences. Database table structure: refer to the structured information of the database table, including table names, column names included, data types of columns, and so forth. In the encoder module of the neural network, the technical scheme is provided with a table structure analysis component, and the function of the table structure analysis component is as follows: from SQL sentences of the training set, analyzing the table structure of the corresponding database, and extracting structural information such as table names, column data types and the like related to the sentences. The table structure information obtained through analysis can provide database Schema information related to the SQL statement for the encoder, and the encoder is helped to better understand the semantics of the SQL statement. The more thorough the encoder understands, the higher the quality of the SQL statement generated by the decoder. The table structure analysis component provides database structure information for the sequence-to-sequence model, and is an important means for improving conversion quality by utilizing database semantic knowledge.

Wherein the database content analysis component: and the component module is used for analyzing and extracting the data content information in the database table corresponding to the SQL statement, such as the value distribution of the data, whether null values exist or not and other characteristics. Database table contents: the characteristics of the data stored in the database table include distribution of values, whether null values exist, and the distinction of column values. The technical scheme is that a database content analysis component is also arranged in the encoder, and can analyze and extract content information, such as the distribution condition of data values, whether null values exist or not and the like, from a data table corresponding to the SQL statement. These content information reflect the characteristics of the table data itself, providing semantic knowledge of the content of the statement-related columns. After the encoder fully understands the table content information, the column semantics related to the statement can be analyzed more accurately, so that the decoder is guided to generate SQL statements consistent with the semantics. The additional utilization of content information may help the model understand the semantics of the database tables more deeply than using only structural information, which is an important means to improve the conversion quality from a data perspective.

Specifically, the data value distribution: refers to the value range and distribution of data in the list of the database, and reflects the overall characteristics of the list values. Whether or not a null value is contained: refers to whether a database table column has a null value reflecting the integrity of the column. Value discrimination degree: refers to the duplication of database table values reflecting the ability to distinguish between records. The database content analysis component in the technical scheme can analyze the following content information from the database table corresponding to the SQL statement: data value distribution: the range and distribution of values in the column are analyzed, such as whether the values deviate from a range or are distributed more evenly. Whether or not a null value is contained: it is checked whether there is a null missing in the column. Value discrimination degree: and counting the repeated value cases of the columns, and judging the uniqueness of the column values. Such content information may help the encoder understand deeply that the statement relates to the specific semantics of the column: the data value distribution reflects the value overall view of the column, whether the null value is contained or not reflects the integrity of the column, and the value discrimination reflects the uniqueness of the column. After the encoder fully analyzes the content characteristics of the columns, the semantics of the sentences can be analyzed more accurately, and the quality of decoding and generating SQL sentences is improved.

Specifically, a database grammar rule set: refers to a set of predefined rules for verifying the grammar correctness of SQL statements that cover key grammar constructs in the SQL grammar standard. In the technical scheme, a database grammar rule set is arranged in a decoder module of the neural network so as to check whether grammar errors exist in the generated SQL sentences. Specifically, after the decoder generates the SQL statement, the SQL statement is input into a grammar rule set for verification, and the rule set verifies the grammar correctness of the statement one by one according to the SQL language specification. For example, it is checked whether the FROM clause specifies a table, whether the WHERE clause condition is correct, and so on. If errors which do not meet the grammar specification are detected, the database grammar rule set is revised, and the SQL sentence is adjusted to the correct grammar form. Therefore, the constraint of the grammar rule set of the database is added in the generation stage, so that the accuracy of the decoder in outputting SQL sentences can be greatly improved, and grammar errors are avoided.

Further, constructing a sequence-to-sequence neural network, comprising: setting a bidirectional LSTM encoder which comprises a forward LSTM unit and a backward LSTM unit, and respectively carrying out positive sequence semantic analysis and negative sequence semantic analysis on natural sentences to obtain encoder hidden state sequences of positive sequence and negative sequence; setting a decoder based on an attention mechanism, wherein the decoder comprises an attention layer and an LSTM unit; the attention layer outputs attention weight vectors step by step according to the obtained encoder hidden state sequences of the positive sequence and the negative sequence; for each time step, weighting the encoder hidden state sequences of the positive sequence and the negative sequence through the attention weight vector to obtain the semantic features of the decoder of the corresponding time step; the LSTM unit of the decoder generates a corresponding SQL sentence according to the obtained semantic features;

wherein the positive sequence encoder conceals the state sequence: the input sequence is coded by forward LSTM units in sequence to obtain the hidden state sequence. Reflecting the forward semantic information of the input sequence. Reverse order encoder hidden state sequence: the input sequence is a hidden state sequence obtained by backward LSTM unit coding according to the reverse sequence. Reflecting the reverse semantic information of the input sequence. In the technical scheme, the encoder adopts a bidirectional LSTM structure and comprises a forward LSTM unit and a backward LSTM unit. The forward LSTM generates hidden states at each moment according to the sequence of the input sentences, and connects the hidden states into a positive sequence hidden state sequence to reflect forward semantic information. The backward LSTM generates a reverse sequence hidden state sequence according to the reverse sequence, and reflects the reverse semantic information. The decoder is connected with hidden state sequences in two directions at the same time, and the forward and reverse semantic features of the learning statement are integrated. Compared with the unidirectional LSTM, the bidirectional LSTM learns the semantics of the input sentence more comprehensively, and can understand the semantic equivalence of the situations of word order conversion and the like in the sentence. Therefore, the combination of the positive sequence and the reverse sequence hidden state sequence is beneficial to improving the capacity of the encoder to express sentence semantics, thereby helping the decoder to generate more accurate SQL sentences.

In particular, the decoder of the present scheme includes an attention layer that automatically learns and identifies information in the encoder that is more important to the current decoding step. The attention mechanism enables the decoder to focus on the relevant part of the input sentence in a scoring and weighting mode, and semantic information is effectively captured. The LSTM unit is arranged in the decoder, so that semantic content in the sequence generation process can be recorded and transferred. The gating structure of the LSTM can effectively capture long-distance dependency. The attention mechanism refines semantic features from the encoder output as inputs to the LSTM, which generates output tokens with semantic information. Their combination can take full advantage of both, the attention mechanism provides targeted semantics, and LSTM provides long-range dependent memory capabilities. The decoder accurately grasps the semantics of the input sentence based on the combined action of the attention mechanism and the LSTM, and generates a corresponding SQL sentence. Through semantic extraction and sentence generation in the modeling decoding stage, accurate conversion from natural language to SQL query is completed.

Wherein the encoder hidden state sequence: the encoder processes the input sequence and then outputs a hidden state sequence representing the semantics. Attention weight vector: the attention mechanism is a vector composed of attention weight coefficients calculated for hidden states at each moment of the encoder. In the technical scheme, the bidirectional LSTM of the encoder generates two hidden state sequences of positive sequence and negative sequence, and the hidden state sequences comprise semantic information of the whole input sentence. The attention layer calculates the importance of each hidden state of the encoder to the current step for the current decoding time step to form an attention weight vector. Specifically, the attention layer calculates the correlation of the decoder current state and the encoder current hidden state as the weight. In the attention weight vector, the encoder hidden state with higher weight represents the input semantic information that is more important for the current decoding step. And finally, weighting and summing the hidden state sequences of the encoder by using the attention weight vector to obtain the semantic features of the current time step, and inputting the semantic features into the LSTM of the decoder to guide the generation of the current target word element. Thus, the capture and refinement of the coding features is achieved through the encoder hidden state sequence and the attention weight vector.

Wherein, semantic features: and the feature vector representing the semantic information of the sentence is obtained by weighting and fusing the coding states of the input sentence. In the technical scheme, the attention mechanism generates an attention weight vector at a corresponding moment for each time step of the decoder. The forward and reverse hidden state sequences of the encoder are then weighted and summed using this weight vector. The weighted summation result is a semantic feature vector which fuses semantic information contained in the bidirectional hidden state of the encoder and represents the whole semantic content corresponding to the input sentence. This semantic feature vector is passed to the LSTM unit of the decoder as an input to the current decoding time step. The LSTM unit generates an output word element of the current time step under the guidance of the semantic features. Thus, the extraction and refinement of the encoded semantics by the attention mechanism and the process of generating output by passing to the decoder are realized. Through learning and transmission of semantic features, the semantics of the input sentence can be effectively obtained, so that SQL sentence query consistent with the semantics is guided to be generated.

Specifically, the decoder sets an LSTM unit to generate a corresponding SQL statement. The attention mechanism extracts semantic features from the encoder as input to the decoder LSTM. The semantic features contain the whole semantic information of the input sentence and are important basis for generating the SQL sentence. The LSTM unit determines the best word element for generating the current position according to the current semantic features at each moment. The LSTM unit transmits semantic information to guide sentence generation by capturing long-distance dependence. The decoder LSTM generates each word element in the SQL statement in turn, eventually forming a complete SQL statement query. And generating the SQL sentence with the semantic consistency with the input sentence through the guidance of the semantic features. Compared with direct prediction, the introduction of semantic features improves the accuracy of the generated sentences. The LSTM structure enables long-distance constraint by utilizing semantics in the generation process. The combination of the targeted semantics provided by the attention mechanism and the memory capabilities of the LSTM is the key to generating high quality SQL statements.

Further, obtaining semantic features of the decoder includes: setting a linear projection layer for reducing the dimensionality of the positive sequence and the negative sequence hidden state sequences of the encoder; positive sequence hidden state sequence of encoderAnd the reverse order hidden state sequence->Respectively inputting corresponding linear projection layers; linear projection layer alignment sequence hidden state sequence +.>And the reverse order hidden state sequence->Projecting and outputting positive sequence hidden state sequence after projection>And the reverse order hidden state sequence->The method comprises the steps of carrying out a first treatment on the surface of the Positive sequence hidden state sequence after projection in feature dimension>And the reverse order hidden state sequence->And obtaining the encoder fusion state ht of the current time t.

Wherein, linear projection layer: a neural network layer projects an input into a structure of a low dimensional space by linear transformation. In the technical scheme, the normal sequence and the reverse sequence hidden state sequences generated by the bidirectional LSTM of the encoder are generally higher in dimension. In order to reduce the computational complexity of the subsequent attention mechanism, a linear projection layer is provided between the encoder and the attention layer. The linear projection layer performs linear transformation on the input forward sequence and reverse sequence Gao Weiyin states and projects the states into a lower dimensional space. Therefore, the weighted calculation amount of the attention mechanism on the hidden state can be reduced, and the operation efficiency of the model is improved. Meanwhile, redundant information can be removed through parameter learning of linear projection, and effective semantic features can be extracted. Therefore, the linear projection layer reduces the computational complexity and refines the semantic information, so that the attention module can generate attention features more efficiently and accurately.

Specifically, forward LSTM of encoder generates positive sequence hidden state sequenceBackward LSTM generates the reverse hidden state sequence +.>. Linear projection layer Matrix1 is defined for positive sequence and Matrix2 for reverse sequence. The positive sequence is hidden in the state sequenceAnd (3) carrying out linear transformation on the input Matrix1 to obtain a projected positive sequence hidden state sequence h1t'. The reverse order hidden state sequenceLinear transformation is carried out on the Matrix2 to obtain an inverse sequence hidden state sequence after projection>. The Matrix1 and Matrix2 can realize the effect of dimension reduction and semantic extraction. The resulting positive sequence->And reverse sequenceIs compression and semantic extraction of the original sequence. The forward and reverse sequence after administration +.>And->The weighted calculation is performed by inputting the weighted calculation to the attention layer. The linear projection is used for realizing the reduction and refinement, so that the calculated amount of the attention layer is reduced, and the efficiency is improved. Finally, attention weighted semantic features are output as input to the decoder to generate SQL statements.

Specifically, feature dimensions: the dimension of vectors or tensors in the neural network reflects the expressive power of the feature. In the application, the positive sequence hidden state is obtained through linear projectionAnd reverse order hidden state- >The dimensions of the two are the same. The encoder state ht is defined as a fusion of the forward and reverse order hidden states. In the feature dimension of ht +.>And->And the fusion of the positive and negative sequence characteristics is realized. I.e. if->And->All are 64 dimensions, then ht is 128 dimensions. Thus ht contains semantic information in both positive and negative order, and is a comprehensive representation of the input sentence semantics. The subsequent attention layer will weight ht and output semantic features. The final semantic features fuse information of the two-way hidden state, and have stronger semantic representation capability. By splicing the feature dimensions, the effective fusion of the two-way semantics is realized.

Further, obtaining the semantic features of the decoder further includes: setting a residual error connecting layer to enable the positive sequence hidden state sequence of the encoderAnd the reverse order hidden state sequence->Adding the encoder fusion state ht to obtain the fusion characteristic of the encoder>The method comprises the steps of carrying out a first treatment on the surface of the Fusion feature of encoder->An attention layer input to the decoder; the decoder is based on the fusion feature->And outputting the semantic features of the current time t of the decoder.

Specifically, the encoder generates a positive sequence hidden state sequenceAnd the reverse order hidden state sequence->. Obtaining the encoder fusion state by linear projection and splicing >. Defining a residual connection layer whose input is +.>、/>And->. Residual connection layer pair->、/>Performing linear transformation and mapping to +.>The same dimension. Then the transformed +.>And->And->And adding to realize residual connection. The result after residual addition is taken as new encoder fusion state +.>. The residual connection realizes the fusion of the original forward and reverse sequence state and the fusion state. />And the semantic information and the original semantic details which are abstract gradually are contained. This further enhances the semantic representation capability of the encoder state. />As input to the attention layer, for directing the decoder to generate SQL statements.

Specifically, the encoder outputs a fusion featureSemantic information of the input sentence is contained. Will->Input to the attention layer of the decoder. Attention layer according to->And the current state of the decoder, calculating the correlation between the current state and the current state as the attention weight. By means of attention weighting pairs->Weighted summation is carried out to generate semantic feature vector +.>. The calculation of attention weights here merges the guidance of the decoder state, enabling the dynamic selection of the encoder characteristics. />Semantic information in (1) comes from the encoder state +.>And carrying out weighted fusion according to the current decoding state. / >And as semantic input of the decoder at the current time t, guiding to generate corresponding SQL sentence lemmas. By repeating this process, the decoder steps through the complete SQL statement. The dynamic semantic fusion provided by the attention mechanism is the key to generating accurate SQL statements.

Further, the LSTM unit of the decoder generates a corresponding SQL sentence according to the obtained semantic features, including: for each time step t, the LSTM unit of the decoder generates SQL sentence word elements of the current time step t according to semantic features output by the attention layer at the time step t, wherein the word elements are expressed by using one-hot vectors, and the word elements comprise database table names and column names of the database tables; and connecting the SQL sentence word generated in the time step t to the SQL sentence word accumulated in the time steps 1 to t-1 to form a preliminary SQL sentence of the current time step t.

Specifically, a Token (Token) refers to the minimum unit with semantic information in a language sequence, and may be a word, a phrase, a symbol, or the like. In this scenario, the SQL statement generated by the decoder is represented as a sequence of tokens. Wherein: the word elements are represented by one-hot vectors, so that network learning is facilitated. The lemma contains key components in the SQL statement, such as database table names, column names, etc. At each time step t, the LSTM unit of the decoder predicts and generates a word element at the current moment according to the semantic feature ct output by the attention layer. And predicting the word elements step by step through a decoder, and finally forming the complete SQL statement query. The attention feature ct provides a semantic basis for predicting the current word element. Generating a sequence of tokens containing semantic information is the key to obtaining a correct SQL statement. Thus, the lemma and the one-hot representation thereof are used as basic units for generating sentences, and the lemma is predicted step by step through the guidance of attention features, so that the conversion from natural language to SQL query is realized.

Specifically, the attention layer outputs semantic features based on encoder features at each time step t. An LSTM unit is arranged in the decoder to generate SQL sentence lemma. At time step t, will ∈ ->Is input to the LSTM cell. LSTM cell is according to semantic feature->Predicting the word element of the current time step +.>。/>And the one-hot vector representation is adopted, so that the network learning is facilitated. />Contains SQL key words such as table name, column name, etc. The LSTM unit realizes the transmission of the semantics by controlling the opening and closing of the door. At the next time step, this process is repeated until a complete SQL statement is generated. The dynamic semantics provided by the attention mechanism instruct the LSTM to predict tokens step by step. LSTM enables modeling of long-range dependencies. The attention feature is combined with LSTM, so that the accuracy of generating the word elements is improved.

Specifically, at a first time step, the decoder generates a first token. Starting from the second time step, the decoder generates the word ++for the current time instant>. Define a variable +.>For storing the preliminary SQL statement at the current time. At time t, will ∈>Is connected to->End of (A) form->. I.e. < ->. Wherein->The tokens generated in time steps 1 to t-1 are stored. By this connection, the +.>All the tokens at the current time are contained. When decoding is completed, _in- >And the final generated complete SQL statement is obtained. Incremental generation of SQL sentences is realized in a step-by-step connection mode. The dynamic semantics provided by the attention mechanism guide incremental concatenation of tokens.

Further, generating the corresponding SQL sentence further includes: performing incremental search on the generating process of the SQL sentence by adopting a beam search algorithm; acquiring a preliminary SQL sentence of the current time step t as a first candidate sequence of a beam search; based on the candidate sentence set of the previous time step t-1, adding a word element through word element expansion, generating a candidate sentence set of the current time step t, wherein the candidate sentence set represents sentence selection to be evaluated; calculating the generation probability of each sentence in a candidate sentence set of the current time step t; selecting to generate top beams sentences with top probability ranking, and participating in beam search of the next time step t+1; repeating the steps, traversing to a final time step T, and selecting the sentence with the highest cumulative probability from the candidate sentence set of all time steps as a final SQL sentence; the final time step T represents the termination time of the traversing searching process and is determined according to the length of the input natural sentence; the cumulative probability maximum represents a set of candidate sentences traversing all time steps, and a probability cumulative value of each candidate sentence from the beginning time step to the last time step T is calculated.

Wherein, the beam search algorithm: a graph-based search algorithm rapidly searches for an optimal solution by limiting the number of expansion nodes. The generation process comprises the following steps: refers to the process by which a neural network model generates tokens step by step from input to output. Incremental search: and a search strategy which gradually expands the search range and gradually approaches to the optimal target is indicated. The method utilizes a beam search algorithm to perform incremental search on the generation process of the SQL sentence: in the generation process, top k most probable word elements are stored in each step. Then based on these tokens, the next step of top k candidate tokens is continued to be generated. Through the increment expansion, possible optimal SQL sentences can be quickly searched. beam size controls the breadth of the search, avoiding permutation and combination explosion. The dynamic semantics provided by the attention mechanism guide the direction of the search. And finally, outputting a plurality of candidate SQL sentences, and selecting the SQL sentence with the highest probability as a final result. The beam search incremental search improves decoding efficiency.

In the beam search, the first candidate sequence refers to a search starting point, that is, a first candidate solution generated according to existing information in an initial state. At time step t, the decoder generates a token for that instant . Will->Connecting to the previous statement to form a preliminary SQL statement +.>。/>All the meta information generated by the decoder up to the current time is contained. Will beAs the first candidate sequence for the beam search. />As a starting point for the incremental search, all currently available information is contained. At->On the basis of (a), the search space is started to be expanded to find the best possible solution. Dynamic semantics provided by the attention mechanism are incorporated +.>Is a kind of medium. By->As a first candidate sequence, an incremental search expansion is performed.

Wherein, in the beam search algorithm, the candidate sentence set refers to all possible sentence sequence sets reserved at a certain moment, and is used for representing the currently-to-be-evaluated alternative choice. At time step t-1, a plurality of candidate sentence sequences are obtained through beam search. At time step t, a plurality of possible next tokens are generated based on each candidate sequence of t-1. After adding these possible tokens to the candidate sequence of t-1, respectively, a plurality of candidate sentences of the current time step t are formed. These newly generated sentence sequences constitute the candidate sentence set at the current time. And scoring the candidate sentence set, and reserving top k with the highest score. This process is repeated until the sequence generation is completed. The candidate sentence set realizes incremental search through the expansion set. And finally, outputting the sentence sequence with the highest probability as a prediction result. The attention mechanism provides for the generation of a semantic guidance candidate set.

Wherein, the probability is generated: in the sentence generation task, the generation probability refers to the probability that the model generates a certain specified sentence. It reflects the rationality of sentences. The candidate sentence set contains a plurality of generated sentences. For each sentence, its joint probability is calculated as the generation probability. I.e. the product of all the word probabilities of the sentence is generated. The probability for each token can be calculated by the attentiveness mechanism and the LSTM model. The more reasonable the sentence, the higher the probability of its generation. By comparing the generation probabilities of all sentences, the quality of the sentences can be evaluated. And reserving top k sentences with highest generation probability for next expansion. This process is repeated until the search is completed. Finally, the sentence with the highest probability is output as the prediction result. Generating the probabilistic estimate directs the incremental search.

Specifically, the generation probability of each sentence is calculated for the candidate sentence set of the current time step t. And sorting all sentences according to the generation probability. Top beams sentences top ranked are selected. Wherein beams is a preset super parameter, and controls the search width. These selected sentences will participate in the search at the next time step t+1. At time t+1, a new candidate sentence is generated based on the beams sentence. The probability of the newly generated sentence is calculated and the previous beams are selected. This process is repeated until the sequence generation is completed. By such an incremental search, the optimal solution can be approximated quickly. The evaluation of the generation probability ensures the balance of search efficiency and effect. Dynamic semantic fusion probability computation provided by the attention mechanism.

Specifically, the previous steps are repeated, and the beam search is performed on the candidate sentence set of each time step. And calculating the generation probability of each time statement, and selecting a statement with higher probability to enter the next step. When the decoding reaches the last time step T, the search process ends. The sentences of each time step then constitute a global set of candidate sentences. Among these candidate sentences, the sentence having the largest cumulative generation probability is selected. The cumulative generation probability is the product of all the word element probabilities of the sentence. This statement is the globally optimal, highest probability result statement. And outputting the result as a finally predicted SQL query statement. Through incremental search, the optimal sentence is quickly approximated. The evaluation of the probability of generation ensures that the most reasonable statement is obtained.

The length of the input natural language sentence is N words. The decoder generates the SQL statement step by step, with the time steps corresponding to the number of tokens. The termination time T of the decoding is thus equal to the input sentence length N. When the time step traverses to T, the decoding process ends. And after all searches are finished, obtaining a candidate sentence set of each time step. For each candidate sentence, the cumulative value P of its generation probability is calculated. The calculation of P is from time 1, step by step multiplication to get the joint probability of each step. I.e. p=p1×p2× … … ×pt, where pT is the probability of generation at time t. The sentence with the highest cumulative probability is the global optimal solution. And finishing the whole incremental search process by determining the termination time and the cumulative probability calculation.

Further, calculating a generation probability of each sentence in the candidate sentence set of the current time step t, including: acquiring a candidate sentence set of the current time step t; splitting each candidate sentence to obtain each word element forming the candidate sentence; counting the occurrence frequency of each word element in the training set; converting the calculated occurrence frequency of each word element into the emission probability of each word element; and multiplying the emission probabilities of all the word elements of each sentence to obtain the generation probability of the sentence.

Specifically, at the current time step t, a set of candidate sentences is obtained through the beam search. The definition variable candates_t stores the current set of candidate statements. Traversing each candidate sentence s in candidates t: and splitting the sentence s by using the character string operation to obtain all the lemmas forming s. For example, if s is "SELECT FROM table", the split result is [ "SELECT", "FROM", "table" ]. And storing the split word elements in a list token. This process is repeated for each statement in candidates t. Finally, a two-dimensional list is obtained, wherein the outer layer stores each candidate sentence, and the inner layer stores the word elements of the sentences. The operation of the word element level is performed, for example, calculation of the generation probability, etc. And the subsequent refinement treatment is realized by splitting the sentence into the word elements.

Specifically, the emission probability (Emission Probability) refers to the probability of the state emission corresponding to the observation in the generated model. Reflecting the relationship between state and observation. The occurrence frequency of each word element in the training set is counted first and converted into frequency. The token frequency is mapped to the (0, 1) interval as the emission probability of the token. The higher the probability of transmission indicates the more common the occurrence of a token. For each word element in the candidate sentence, the emission probability thereof is searched. Multiplying the emission probabilities of the word elements to obtain the generation probability of the sentence. The higher the probability of generation, the more reasonable the statement. And evaluating the quality of the sentence through the statistical calculation of the word element emission probability. The transmission probability provides an important basis for the rationality of the word elements.

Specifically, for each sentence s in the candidate sentence set, splitting s to obtain all the tokens that make up it: [ w1, w2, ], wn]Find each of the tokensIs denoted emit_prob (wi). Multiplying the emission probabilities of all the word elements to obtain the generation probability of the sentence s: generating_prob(s) =emit_prob (w 1) ×emit_prob (w 2) × … … ×emit_prob (wn), the higher the probability of generation, the more reasonable the explanatory statement s. Because the combination of the high probability representation words better accords with the statistical rule of the training set. This process is repeated for all statements in the candidate set. The generation probabilities of all sentences are compared. And selecting sentences with higher probability to perform the next expansion search. And evaluating the quality of the sentence through the calculation of the word element emission probability.

Further, calculating the emission probability of each word element includes: adding 1 to the occurrence frequency of each word element for smoothing; counting the total number of the word elements in the training set; and calculating the occurrence probability of each word element as the emission probability according to the total word element number and the occurrence frequency of each word element after the smoothing treatment.

Specifically, the training set is traversed, and the occurrence frequency of each word element is counted to obtain word frequency count (w). And adding 1 to the word frequency of each word element:this is a Laplace smoothing for solving the problem of word frequency 0. And simultaneously counting the number N of all the lemmas in the training set. Calculating the smoothed frequency of each word element: p (w) =count' (w)/(n+ dictionary size), and the smoothed frequency maps to (0, 1) as the emission probability of the lemma. Laplace smoothing alleviates the data sparseness problem. Smoothing is performed by the statistical dictionary size N. This provides a more reliable token emission probability for the probability of generating a subsequent computed sentence. And setting the total number of the words in the dictionary as N, and counting to obtain the Chinese character dictionary. For each word element wThe number of times it appears in the training set count (w) is obtained. 1-adding smoothing is performed on count (w) to obtain smoothed frequency count' (w). Calculating the emission probability of the word element w: emit_prob (w) =count' (w)/(n+ dictionary size), where the denominator is added to the dictionary size, and laplace smoothing is performed. So that each word element has a relatively reliable non-zero probability. The emission probability emit_prob (w) belongs to the (0, 1) interval. The higher the probability of transmission indicates the more common the occurrence of a token. The emission probability provides a probability basis for the probability calculation of the generation of subsequent sentences. Through statistics and smoothing technology, more accurate word element emission probability is obtained.

Further, the table structure analysis component extracts the database table structure by using the FROM and WHERE clauses of the SQL statement. Specifically, the input is an SQL query statement. The FROM clause is extracted FROM the query statement, and the table name related to the query can be obtained. For example, the FROM clause is "FROMtable1, table2", and "table1" and "table2" can be extracted. The query conditional sentence can be extracted from the WHERE clause, and the related column names are judged.

For example, "whertetab 1. Id=table2. Ref_id", it is possible to extract "table1.Id", "table2.Ref_id". And merging the table names of the FROM clauses and the column names of the WHERE clauses. And generating a table, and storing the corresponding relation between the table names and the column names. This table reflects the database table structure information involved in the SQL statement query. And repeatedly analyzing different SQL sentences to obtain more comprehensive knowledge of the table structure. The knowledge of the table structure will better transform SQL queries from the power-assisted sequences to the sequence model.

Further, identifying the names of the database tables by using a FROM algorithm; identifying the database column name by adopting a WHERE algorithm; and generating a database table structure according to the identified database table name and the database column name. Wherein, the FROM algorithm: an algorithm for extracting the table names involved in the query FROM the FROM clause of the SQL statement. WHERE algorithm: an algorithm for extracting the column names involved in a query from the WHERE clause of an SQL statement. In this application, an SQL query statement is entered. Applying the FROM algorithm to the FROM clause, and identifying the related table name. The WHERE algorithm is applied to the WHERE clause to identify the column name involved. The FROM algorithm realizes table name extraction based on character string matching and other modes. The WHERE algorithm realizes column name extraction through grammar parsing and other modes. And generating a table structure table according to the extracted table names and column names. The table stores the correspondence of table names and column names. The obtaining of the knowledge of the table structure is realized by applying FROM and WHERE algorithms. The obtained structural knowledge will be applied to the sequence to sequence model.

Specifically, an SQL query statement is input, a FROM algorithm is applied, table names related to query such as table1 and table2 are identified, a WHERE algorithm is applied, column names related to query such as table1.Col1 and table2.Col2 are identified, a dictionary is created to store a table structure, and table_i is identified for each table name: creating key value pairs in the subject: table_i: [ column name 1, column name 2, ], column name n ]. The value is the column name related to the table_i table, the key in the final part is the table name, the value is the column name related to the table, and the part is the generated database table structure representation. And repeatedly analyzing more SQL, and continuously enriching the point. And finally obtaining the knowledge of the table structure of the whole database. This knowledge will be applied to the sequence to sequence model.

3. Advantageous effects

Compared with the prior art, the advantage of this application lies in:

the table structure analysis component and the content analysis component are arranged, semantic information of the database can be analyzed, including table names, column names, data distribution and the like, and an important basis is provided for conversion. By applying the sequence to the sequence model and the attention mechanism, the semantics of the input sentence can be captured, which is beneficial to accurately understanding the intention of the user. The grammar rule set of the database is set, so that grammar errors of generated sentences can be restrained and corrected, and sentence correctness is improved. By adopting the beam search, the search space can be enlarged, better candidate sentences can be selected, and the conversion effect is improved. The sentence generation probability is calculated, the quality of different candidate sentences can be evaluated, and sentences with higher probability are output. The accuracy calculation of the word element emission probability can improve the fidelity of sentence generation. And a large-scale training corpus is constructed from a plurality of open data sets, so that the generalization capability of model training is facilitated. The training set scale is expanded, so that the diversity of training data can be increased, and the model robustness is improved.

Drawings

The present specification will be further described by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. The embodiments are not limiting, in which like numerals represent like structures, wherein:

FIG. 1 is an exemplary flow chart of a method for converting natural language statements to SQL statements based on a large language model according to some embodiments of the present disclosure;

FIG. 2 is an exemplary flow diagram for converting natural language statements to SQL statements, according to some embodiments of the present description;

FIG. 3 is an exemplary flow chart for acquiring semantic features according to some embodiments of the present description;

FIG. 4 is an exemplary flow chart for retrieving a final SQL statement according to some embodiments of the present description.

Detailed Description

The method and system provided in the embodiments of the present specification are described in detail below with reference to the accompanying drawings.

FIG. 1 is an exemplary flow chart of a method for converting natural language sentences to SQL sentences based on a large language model according to some embodiments of the present description, collecting natural language sentences and corresponding SQL sentences from a plurality of open data sets, constructing a training set; on the training set, expanding the training set by utilizing a template based on grammar analysis and vocabulary replacement rules; constructing a neural network from sequence to sequence, setting an encoder to extract characteristics of the extended training set, and setting a decoder to generate a corresponding SQL sentence; in the encoder, a table structure analysis component is arranged to analyze a corresponding database table structure from SQL sentences in a training set, wherein the database table structure comprises a table name, a column name and a data type; in the encoder, a database content analysis component is arranged, corresponding database table contents are analyzed from SQL sentences in a training set, and the database table contents comprise data value distribution, whether null values and value discrimination are included or not; in the decoder, a database grammar rule set is set for checking grammar errors of the generated SQL sentence and correcting.

Specifically, training data including natural sentences and corresponding SQL sentences are collected from a plurality of open data sets to construct a training set. On the training set, the training set is expanded by utilizing a template based on grammar analysis and vocabulary replacement rules. This can increase the size of the training data and the diversity of language expressions. And constructing a sequence-to-sequence neural network model, wherein the encoder performs feature extraction on the extended training set, and the decoder generates a corresponding SQL sentence. In the encoder, a table structure analysis component is provided, which analyzes the corresponding database table structure from SQL sentences in the training set, and contains information such as table names, column names, data types and the like. In the encoder, a database content analysis component is further provided, and the component analyzes corresponding database table contents from SQL sentences in the training set, wherein the database table contents comprise information such as data value distribution, whether null values are contained, value distinction and the like. In the decoder, a database grammar rule set is set for checking grammar errors when generating SQL sentences and making necessary correction to generate SQL sentences with correct grammar. Through end-to-end training, a neural network model capable of inputting natural language and outputting corresponding SQL sentences is obtained. During testing, natural language query is input, and the model can generate SQL sentences with correct grammar, so that conversion from natural language to SQL is realized.

Data collection and pretreatment: first, a large number of natural language queries and corresponding SQL statements are collected from databases in different domains as training data. The data is preprocessed, including text cleaning, word segmentation, part-of-speech tagging, etc., for subsequent model training and processing. And connecting the student course selection system database to obtain all the archived SQL query sentences. SELECT type query statements are filtered out from them. Such as "SELECT FROM course WHERE student _id=123". For each SQL statement, a corresponding natural language description is constructed manually by the database administrator. Such as "list all courses selected by students with number 123". Each set of < natural language, SQL > statement pairs is added to the collected training set. The process is repeated, collecting tens of thousands of sets of mapping statement pairs. The diversity of statement pairs is ensured as much as possible. And storing the collected training set data into a table or a text document for subsequent processing. The training data simulates interaction corpus of users and a database in a real scene. The training set covers the important mapping situation of daily SQL queries and natural language. The corpus with large scale and high quality is collected and can be used for effective training of a fence mechanism.

Text cleaning: and deleting irrelevant contents such as a question mark, a space, a carriage return and the like in the natural language sentence, deleting punctuation marks such as a period, a comma, a pause mark and the like, and completely converting the SQL sentence into a lowercase. Word segmentation: loading a Chinese word segmentation algorithm package jieba, and calling a cut interface of the jieba for word segmentation for each preprocessed natural language sentence s: word = jieba. Cut(s), resulting in word sequences like "list/number/order/123/student/chose/all/course". The word segmentation breaks down natural language into word units, which provides a basis for subsequent part-of-speech tagging, sequence conversion, and the like. The sentence quality is improved by cleaning, and key information is extracted by word segmentation. The two together form an important link of text preprocessing.

Part of speech tagging: loading a Chinese part-of-speech tagging tool package Jieba segment, and tagging word segmented results words: tags=jieba_segment.post (words), resulting in a part-of-speech sequence like "list/v|academic number/n|is/v|123/num|student/n|opt-out/v|/all/a|course/n". Building training data: the preprocessed natural language sentences, word segmentation and part of speech tagging results and corresponding SQL sentences are organized into a training sample, and one sample comprises (natural language sentences, word sequences, part of speech sequences and SQL sentences). And repeatedly constructing tens of thousands of sample data to obtain a complete training corpus. The training corpus contains representations of various levels of input text, as well as SQL that is desired to be output. This can be used for training of sequence-to-sequence models. The part-of-speech tagging provides semantic grammar structure information of sentences, and enriches input contents.

Structural analysis: the natural language query enters a table structure analysis component that is responsible for resolving information such as database table names, column names, and data distributions involved in the natural language. Through the step, the system can understand target data and related fields of the user query, and provides important basis for subsequent SQL sentence generation. Inputting a natural language query: "list names and application professions for all students older than 20 years old", using the regular expression: table name regularization: (\w+), column name is canonical: (\w+) \w+. Matching query sentences, extracting: table name: "student", column name: "name", column name: "age", column name: "application specialty".

Import re

query= "list names and application professions of all students older than 20 years old"

table_pattern=r'（\w+）'

column_pattern=r'（\w+）\w+'

table=re.search（table_pattern，query）.group

columns=re.find all（column_pattern，query）

print (table) # student

print (columns) # name ',' age ',' application professional

Import re

table_pattern=r'（\w+）'

column_pattern=r'（\w+）\w+'

table=re.search（table_pattern，query）.group

columns=re.find all（column_pattern，query）

print (table) # student

print (columns) # name ',' age ',' application professional

By matching the regular expression, key information in the table structure can be conveniently extracted from the query, and input is provided for subsequent processing. Inquiring a structural knowledge base of the table to obtain structural information of a 'student' table: student (table name), name (column name), age (column name), application specialty (column name).

Analyzing query conditions:

the query condition is "age above 20 years"

Extraction using comparison words: age column greater than 20

And (3) obtaining data distribution knowledge: age range of 20 or more

The structure of the building table represents:

table name: student's study

Column name: name, age, application profession

Data distribution: { age column: { range: ">20" }

The representation is organized using a dictionary in Python:

table_schema={

"table name": "student",

column name ": [ "name", "age", "application specialty" ],

"data distribution": {

"age column": { "Range": ">20" }

}}

The above table structure indicates that knowledge of table relationships, column properties, and data distribution, etc., is fully encoded. This structured knowledge facilitates the generation of SQL from the guided sequence to sequence model transformations.

The table structure represents key information that encodes a natural language query: the table "student", column "name", "age", "application specialty", condition "age >20". The input table structure is represented into the sequence-to-sequence model: the table names and column names are mapped into vector representations, and the conditions are input as additional features. In the decoding generation SQL process: predicting the FROM clause according to the table name vector, predicting the SELECT and WHERE clauses according to the column name vector, and filtering the clauses according to the condition vector prediction condition. The table structure representation effectively guides the generation of SQL statements: input: "name and application profession listing students older than 20 years old", output: "SELECT name, apply professional FROM student WHERE age >20". And finally, accurate conversion from natural language query to SQL query is realized.

Content analysis: at the same time, natural language queries are also fed into a content analysis component that is responsible for deep parsing of semantic information in natural language, including user intent, query conditions, and the like. Through the step, the system can more comprehensively understand the query requirement of the user and lay a foundation for generating accurate SQL sentences. Inputting a natural language query: "find out the number and average score of computer professional students older than 20 years old and with the score ranking 50% before", content analysis: user intent: query/output, query body: computer specialty student, condition 1: age above 20 years, condition 2: the score ranked the top 50%. Output column: number, average score, semantic parsing using natural language understanding techniques such as named entity recognition, dependency syntactic analysis, etc. And finally outputting a content analysis result:

{ "user intention": "query",

"query subject": "computer professional student",

"condition 1": "age >20",

"condition 2": "score ranking percentage <0.5",

"output column": [ "number", "average score" ] }

The content representation encodes user query intent and semantic information, providing guidance for subsequent SQL transformations.

Encoder processing: after the table structure analysis and the content analysis, the natural language query is processed by the encoder. The encoder uses a neural network model to convert the natural language query into an intermediate representation that contains key information extracted from the first two steps, such as database structures, semantic information, etc. Natural language query: "find out the number and average score of computer professional students older than 20 years old and 50% of the score rank front", the table structure shows: { "table name": "student information table",

Column name ": [ "age", "score", "specialty", "academic number", "average score" ] }

Content analysis results: {

"user intention": "query",

"query subject": "computer professional student",

"condition 1": "age >20",

"condition 2": "score ranking percentage <0.5",

"output column": [ "academic number", "average score ]

}

The table structure representation provides database field information and the content analysis results provide query intent and condition information. Together, both supplement the structured semantic knowledge of natural language queries. This information will be input into the encoder, directing the generation of the SQL statement.

The ready-made Seq2Seq module in Tensor flow or Pytorch is imported, the encoder is defined as double-layer LSTM, and the decoder is single-layer LSTM:

import Tensor flow as tf

from Tensor flow.keras.layers import LSTM，Dense

# defines the maximum length of the input-output sequence

max_input_len=100

max_output_len=80

# definition encoder layer: double layer LSTM

encoder_inputs=tf.keras.Input（shape=（max_input_len，））

enc_layer1=LSTM（64，return_sequences=True）

enc_layer2=LSTM（32，return_sequences=True，return_state=True）

encoder_outputs，h，c=enc_layer2（enc_layer1（encoder_inputs））

encoder_states=[h，c]

# definition decoder layer: single layer LSTM

decoder_inputs=tf.keras.Input（shape=（max_output_len，））

dec_layer=LSTM（64，return sequences=True，return_state=True）

decoder_outputs，_，_=dec_layer（decoder_inputs，initial_state=encoder_states）

Construction of Seq2Seq model

model=tf.keras.Model（[encoder_inputs，decoder_inputs]，decoder_outputs）

Mapping each word in the query into a word vector by using an Embedding layer, and obtaining a word vector sequence as a vector representation of the query:

from Tensor flow.keras.layers import Embedding

input text= "find out the number and average score of computer professional student older than 20 years old and with score ranking 50% before"

# set vocabulary size, e.g. 10000 words

vocab_size=10000

# sets the dimension of the word vector, e.g., 256

embed_dim=256

Construction of Embedding layer

embedding=Embedding（vocab_size，embed_dim）

Word segmentation of the input text to obtain a word index list

word_indexes=[123，4567，683，...]

The- # Embedding layer converts the word index into a word vector

word_vectors=embedding（word_indexes）

Word vector sequence representation of text as word vector obtained in #

Using an Embedding layer for table names and column names to obtain vectors, using One-Hot coding for conditions, output columns and the like to obtain vectors, and splicing the vectors to obtain feature vectors of the structural representation:

From Tensor flow.keras.layers import Embedding，Input

From Tensor flow import one_hot

# mapping table name and column name

table_embedding=Embedding（1000，64）

column_embedding=Embedding（500，32）

table_name= "student table" # inputs table name

column_names = [ "academic number", "name" ] # input column name list

table_vec=table_embedding（table_name）

column_vecs=column_embedding（column_names）

# use of One-Hot encoding for Condition

conditions=["age>20"，"score<60"]

condition_vecs=[one_hot（c，len（conditions））forc in conditions]

# splice vector

import Tensor flow as tf

feature_vector=tf.concat（[table_vec，column_vecs，condition_vecs]，axis=-1）

The word vector sequence is input into a first LSTM layer, the structured feature vector is directly input into a second LSTM layer, and the encoder outputs an intermediate semantic vector as input to the decoder:

from Tensor flow.keras.models import Model

from Tensor flow.keras.layers import Input，LSTM，Embedding

# encoder input

text_input=Input（shape=（max_len，））

table_input=Input（shape=（feature_dim，））

# word vector layer

embed=Embedding（vocab_size，embed_dim）

text_vec=embed（text_input）

# first layer LSTM encoded text

lstm1=LSTM（64）

text_encoding=lstm1（text_vec）

# second layer LSTM encoded structured feature

lstm2=LSTM（32）

encoding=lstm2（text_encoding，table_input）

# build encoder model

encoder=Model（[text_input，table_input]，encoding）

The # decoder is omitted … … here

The# encoder output encoding as decoder input

FIG. 2 is an exemplary flow chart of converting natural language statements to SQL statements, decoder generation, according to some embodiments of the present description: the intermediate representation processed by the encoder is transmitted to a decoder, and the decoder generates a corresponding SQL sentence according to the intermediate representation and a preset database grammar rule set. In the process, the decoder also performs grammar checking and correction on the generated SQL sentence, so as to ensure that the generated sentence accords with the SQL grammar specification.

Decoder input: the intermediate vector representation of the encoder output, a database syntax rule set: the intermediate vector comprises SQL keywords, grammar templates and the like, and the intermediate vector output by the encoder represents: vector dimension 128, encoded the semantic content of a natural language query: rule 1: SELECT FROM table WHER Econditions, rule 2: SELECT columns FROM table WHER Econditions, rule 3: SELECTagg (column) FROM table WHERE conditions. SQL keywords: SELECT, FROM, WHERE, GROUPBY, ORDERBY, LIMIT.

SQL statement templates:

# inquiry template

SELECT columns FROM table WHERE conditions

# summarizing template

SELECT agg_func（column）FROM table WHERE conditions GROU PBY column

# ordering template

SELECTcolumnsFROMtableWHEREconditionsORDERBYcolumnASC/DESC

# paging template

SELECT columns FROM table WHERE conditions LIMIT offset，count

These keywords and templates define the syntax structure of the SQL language, and the decoder generates the correct SQL statement, such as generating key components of SELECT, FROM, etc., based on this structured knowledge, and SELECTs the appropriate template. Encoder vector: a real number vector of length 128 represents the semantic content of "average number of students with average score greater than 85 points" of inquiry. Grammar rule set: the key words are as follows: SELECT, FROM, WHERE, COUNT, comprising a template: select count FROM table WHERE conditions. The decoder generates the SQL procedure: the initialization sequence is [ START ], the next word is predicted to be SELECT according to the encoder vector and the current state, the SELECTCOM is generated according to the template rule, the WHERE condition is generated according to the encoder vector and the template, and finally the output is carried out: select count FROM students WHERE grade >85, the encoder vector provides semantic content, the grammar rules provide SQL constructs, and the decoder combines both to generate the correct SQL.

Decoder processing: using the LSTM et al Seq2Seq model, predicting the next word according to the current state and the grammar rule set at each time step, and forming the generated word sequence into an SQL statement:

from Tensor flow.keras.layers import LSTM，Dense

# decoder LSTM layer

decoder_lstm=LSTM（128，return_sequences=True）

Full-join layer for# predictive next word

output_layer=Dense（vocab_size）

Input initialized state vector s0 (from encoder)

s0=encoder_outputs

sql_query=[]

Generating word sequences in a # loop

For i in range（max_len）：

Current state si #

s_i=decoder_lstm（s0）

# predict next word based on current state si and grammar rules

next_word=output_layer（s_i）

# add to query sequence

sql_query.append（next_word）

# update state s0

s0=s_i

# compose SQL query statement

sql_statement="".join（sql_query）

Grammar checking and correction: check statement grammar, such as Keywords order, PARENTHESES match, etc., correct error grammar, replace/add Keywords or paramenthes:

# generated SQL statement

sql="SELECT*FROM table WHERE score>80ORDERdesc"

# grammar checking

defsyntax_check（sql）：

# check keyword order

ifsql.index（"WHERE"）>sql.index（"ORDER"）：

Return False

Check parenthes

ifsql.count（"（"）！=sql.count（"）"）：

Return False

Return True

Grammar error # correction

If not syntax_check（sql）：

# correction keyword order

sql=sql.replace（"ORDER desc"，""）+"ORDERBY desc"

# increase parenthes

sql=sql.replace（"WHERE score>80"，"（WHERE score>80）"）

Output after # correction

print（sql）

#SELECT*FROM table（WHERE score>80）ORDERBY desc

Inputting a natural language query: "find average score for students older than 20 years old", encoder output vector: a length 128 vector encodes the semantic content of the query, and the decoder generates the SQL procedure: initialization sequence: [ START ], SELECT keywords are predicted from the encoder vector, and SELECTAVG (score) is generated from the syntax rules. According to the encoder vector, a WHER stage >20 condition is predicted, and FROM students are generated according to grammar rules. Final output: SELECTAVG (score) FROM students WHER Eage >20, the decoder comprehensively utilizes semantic representation of the encoder and grammar knowledge of the database, and accurately realizes conversion from natural language to SQL.

FIG. 3 is an exemplary flow chart for acquiring semantic features, results return and optimization, according to some embodiments of the present description: the generated SQL statement is sent to the database for execution, and the query result is obtained. The system can verify and analyze the results, and if grammar errors exist or the query results are not in accordance with expectations, the system can optimize and adjust and continuously iterate and optimize the whole flow. Generating SQL sentences: user query: "find student names 10% top of computer 2021 level score rank", generate a sentence: "SELECT name FROM students WHER Emajor = 'computer science' AND grade_rank < = 0.1order gradedesclimit10". Verification execution SQL: executing the sentence in the database, and returning an error prompt if grammar errors exist. Analyzing the query result: it is checked whether the result meets expectations, such as verifying whether the returned record belongs to computer 2021 class. It is determined whether the number of results is the first 10% students expected. If the verification is not passed: adjusting the condition range: grad_rank < = 0.1 instead of grad_rank <0.1, correct key: the DESC is changed into ASC, and the verification and optimization adjustment iteration flow is repeated until the SQL query statement meeting the expectations is generated.

Figure 4 is an exemplary flow chart for retrieving a final SQL statement shown in accordance with some embodiments of the present description,

user query statement: "find average score for students older than 20 years old", semantic parsing: query intent: and (5) inquiring. Inquiring a main body: a student. Condition 1: age >20. The output is required: average performance. Other key information: and no. The key semantic components of the query statement are extracted through semantic analysis, and the method comprises the following steps: query intent (query), query subject (student), condition (age > 20), output (average score), which provides important semantic input for subsequent encoding processes and SQL transformations.

Performing statement vectorization: mapping words such as age, student and the like into Word vectors by using a Word2Vec method and the like, splicing the Word vectors to obtain vector representation of the sentence, and carrying out table structure vectorization: the "student table", "age", "achievement" columns are encoded using One-Hot, etc. methods to obtain a vector representation of the table structure, input encoder (e.g., LSTM): the statement vector representation and the table structure vector are taken as encoder inputs, and the encoder outputs an intermediate vector representation of the statement, with a length of 128. The intermediate vector fully encodes the semantic content of the statement and the table structure information. The decoder may generate a corresponding SQL statement from this vector.

The encoder outputs the intermediate vector of the statement (the real vector of length 128), the decoder inputs: encoder vector + database syntax rule set. The decoder generates the SQL procedure:

initialization sequence: [ START ], SELECT keywords are predicted from vectors, and SELECTAVG (score) is generated from rules. Generating FROM documents according to the vector prediction WHER area >20 condition and rules, and finally generating SQL sentences: "SELECTAVG (score) FROM students WHER Eage >20". The SQL statement is executed in the database, and a query result can be obtained. The decoder implements natural language to SQL conversion by encoder vectors and grammar rules.

Executing the generated SQL, if the query result is correct, directly returning to the user, and if the result is wrong: checking whether semantic analysis accurately extracts query conditions, checking whether the encoding process accurately encodes semantics, and error examples: user query: "find students with top 10% of computer 2021 level achievements rank", generate SQL: "SELECT FROM students WHERE major = 'computer science' AND grade_rank < = 0.1LIMIT10", verification result is unsatisfactory, students are not just 2021 grade, adjust semantic parsing, increase identification of 2021 grade ", adjust encoder, increase coding representation of grade, repeat until SQL is generated correctly, then return result to user, continue optimizing modules such as semantic parsing, encoding, etc., AND improve overall query conversion quality.

The foregoing has been described schematically the invention and embodiments thereof, which are not limiting, but can be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The drawings are also intended to depict only one of the embodiments of the invention, and therefore the actual construction is not intended to be limiting, as any reference number in the claims should not be limiting to the claims that issue. Therefore, if one of ordinary skill in the art is informed by this disclosure, a structural manner and an embodiment similar to the technical solution are not creatively devised without departing from the gist of the present invention, and all the structural manners and the embodiments are considered to be within the protection scope of the present application. In addition, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" preceding an element does not exclude the inclusion of a plurality of such elements. The various elements recited in the product claims may also be embodied in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims

1. A method for converting natural sentences into SQL sentences based on a large language model, comprising:

collecting natural sentences and corresponding SQL sentences from a plurality of open data sets to construct a training set;

On the training set, expanding the training set by utilizing a template based on grammar analysis and vocabulary replacement rules;

constructing a neural network from sequence to sequence, setting an encoder to extract characteristics of the extended training set, and setting a decoder to generate a corresponding SQL sentence;

in the encoder, a table structure analysis component is arranged to analyze a corresponding database table structure from SQL sentences in a training set, wherein the database table structure comprises a table name, a column name and a data type;

in the encoder, a database content analysis component is arranged, corresponding database table contents are analyzed from SQL sentences in a training set, and the database table contents comprise data value distribution, whether null values and value discrimination are included or not;

in the decoder, a database grammar rule set is set for checking grammar errors of the generated SQL sentence and correcting.

2. The method for converting natural sentences to SQL sentences based on the large language model according to claim 1, wherein the method comprises the following steps:

constructing a sequence-to-sequence neural network, comprising:

setting a bidirectional LSTM encoder which comprises a forward LSTM unit and a backward LSTM unit, and respectively carrying out positive sequence semantic analysis and negative sequence semantic analysis on natural sentences to obtain encoder hidden state sequences of positive sequence and negative sequence;

Setting a decoder based on an attention mechanism, wherein the decoder comprises an attention layer and an LSTM unit;

the attention layer outputs attention weight vectors step by step according to the obtained encoder hidden state sequences of the positive sequence and the negative sequence;

for each time step, weighting the encoder hidden state sequences of the positive sequence and the negative sequence through the attention weight vector to obtain the semantic features of the decoder of the corresponding time step;

and generating a corresponding SQL sentence by the LSTM unit of the decoder according to the obtained semantic features.

3. The method for converting natural sentences into SQL sentences based on the large language model according to claim 2, wherein the method comprises the following steps:

acquiring semantic features of a decoder, comprising:

setting a linear projection layer for reducing the dimensionality of the positive sequence and the negative sequence hidden state sequences of the encoder;

positive sequence hidden state sequence of encoderAnd the reverse order hidden state sequence->Respectively inputting corresponding linear projection layers;

linear projection layer alignment sequence hidden state sequenceAnd the reverse order hidden state sequence->Projecting and outputting positive sequence hidden state sequence after projection>Reverse order hidden state sequence->；

Stitching projected positive sequence hidden state sequences in feature dimensionsHidden in reverse orderStatus sequence->Obtaining the encoder fusion state of the current time t >。

4. The method for converting natural language sentences into SQL sentences based on the large language model according to claim 3, wherein the method comprises the following steps:

acquiring semantic features of the decoder, further comprising:

setting a residual error connecting layer to enable the positive sequence hidden state sequence of the encoderAnd the reverse order hidden state sequence->Joining encoder fusion state->Obtaining the fusion characteristics of the encoder>；

Fusing features for encodersAn attention layer input to the decoder;

the decoder being based on fusion characteristicsAnd outputting the semantic features of the current time t of the decoder.

5. The method for converting natural sentences into SQL sentences based on the large language model according to claim 2, wherein the method comprises the following steps:

the LSTM unit of the decoder generates a corresponding SQL sentence according to the obtained semantic features, comprising:

for each time step t, the LSTM unit of the decoder generates SQL sentence word elements of the current time step t according to semantic features output by the attention layer at the time step t, wherein the word elements are expressed by using one-hot vectors, and the word elements comprise database table names and column names of the database tables;

and connecting the SQL sentence word generated in the time step t to the SQL sentence word accumulated in the time steps 1 to t-1 to form a preliminary SQL sentence of the current time step t.

6. The method for converting natural language sentences into SQL sentences based on the large language model according to claim 5, wherein the method comprises the following steps:

generating a corresponding SQL statement, further comprising:

performing incremental search on the generating process of the SQL sentence by adopting a beam search algorithm;

acquiring a preliminary SQL sentence of the current time step t as a first candidate sequence of a beam search;

based on the candidate sentence set of the previous time step t-1, adding a word element through word element expansion, generating a candidate sentence set of the current time step t, wherein the candidate sentence set represents sentence selection to be evaluated;

calculating the generation probability of each sentence in a candidate sentence set of the current time step t;

selecting to generate top beams sentences with top probability ranking, and participating in beam search of the next time step t+1;

repeating the steps, traversing to a final time step T, and selecting the sentence with the highest cumulative probability from the candidate sentence set of all time steps as a final SQL sentence;

the final time step T represents the termination time of the traversing searching process and is determined according to the length of the input natural sentence; the cumulative probability maximum represents a set of candidate sentences traversing all time steps, and a probability cumulative value of each candidate sentence from the beginning time step to the last time step T is calculated.

7. The method for converting natural language sentences into SQL sentences based on the large language model according to claim 6, wherein the method comprises the following steps:

calculating the generation probability of each sentence in the candidate sentence set of the current time step t, wherein the generation probability comprises the following steps:

acquiring a candidate sentence set of the current time step t;

splitting each candidate sentence to obtain each word element forming the candidate sentence;

counting the occurrence frequency of each word element in the training set;

converting the calculated occurrence frequency of each word element into the emission probability of each word element;

and multiplying the emission probabilities of all the word elements of each sentence to obtain the generation probability of the sentence.

8. The method for converting natural language sentences into SQL sentences based on the large language model according to claim 7, wherein the method comprises the following steps:

calculating the emission probability of each word element comprises the following steps:

adding 1 to the occurrence frequency of each word element for smoothing;

counting the total number of the word elements in the training set;

and calculating the occurrence probability of each word element as the emission probability according to the total word element number and the occurrence frequency of each word element after the smoothing treatment.

9. The method for converting natural language sentences into SQL sentences based on the large language model according to any one of claims 1 to 8, wherein:

The table structure analysis component adopts FROM and WHERE clauses of SQL sentences to extract the database table structure.

10. The method for converting natural language sentences into SQL sentences based on the large language model according to claim 9, wherein the method comprises the following steps:

identifying the names of the database tables by using a FROM algorithm;

identifying the database column name by adopting a WHERE algorithm;

and generating a database table structure according to the identified database table name and the database column name.