CN117033423A

CN117033423A - SQL generating method for injecting optimal mode item and historical interaction information

Info

Publication number: CN117033423A
Application number: CN202311004567.1A
Authority: CN
Inventors: 雷思情; 桑楠; 李贞昊; 唐思佳
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-08-10
Filing date: 2023-08-10
Publication date: 2023-11-10

Abstract

The invention discloses an SQL generating method for injecting optimal mode items and historical interaction information, which is applied to the semantic analysis field of natural language processing and aims at solving the problem that the prior art is difficult to solve the problem that the complex Chinese natural language across fields generates the actual business intelligent scene application of structured language SQL; the invention injects the optimal mode item with larger relativity with the Chinese natural language inquiry into the model through the pre-screening cross encoder; then, simplifying the complex query of the step-by-step user by using the historical query of the user; finally, a complete SQL structure prediction model is provided, a node sequence is generated from a root node in a depth priority order, the characteristics, father node characteristics, context characteristics and node type characteristics of the last moment are fused, probability values of correct SQL sentences corresponding to natural language problems are maximized, an SQL abstract syntax tree is generated, and then the SQL syntax tree is traversed in depth to generate a final SQL sentence.

Description

SQL generating method for injecting optimal mode item and historical interaction information

Technical Field

The invention belongs to the field of semantic parsing of natural language processing, and particularly relates to a technology for converting natural language into structured SQL sentences.

Background

Semantic parsing is a task of mapping natural language queries into corresponding machine executable logic forms, converting complex natural language into SQL (Structured Query Language ) sentences, and is a branch of semantic parsing, aiming at breaking barriers between non-technicians and databases and shortening the distance between users and databases. With the proposal of large cross-domain data sets Spider and WikiSQL, the research of converting natural language into SQL is greatly promoted, the data sets pushed above are all based on English, and the development and landing of the SQL generation technology under the Chinese scene are greatly promoted by the DuSQL and the translated English data set CSpider through the competition release of the Chinese characteristic data set TableQA.

However, the current research results and work do not enable the technology to be truly used in the actual ground scene, and the problems are as follows: (1) Complex queries in spoken form in chinese natural language make it difficult for a machine to understand the user's actual query intent. (2) The database designed in the actual landed application scenario may be large in size, resulting in an inability to fully input information into the model. (3) changes in database structure result in performance degradation.

Early researches mostly comprehensively grasp information contained in a database, all the information contained in the database is input into a model, and in practical application scenes, users have difficulty in expressing correct query intention in a sentence, so that complex SQL sentence generation is difficult to generate.

Disclosure of Invention

The method aims at solving the problem of generating the actual business intelligent scene application of the structured language SQL by the complex Chinese natural language in the cross-domain; the invention provides an SQL generating method for injecting optimal mode items and historical interaction information, which comprises the steps of extracting the optimal mode items by designing a pre-screening cross encoder, simplifying the generation of complex queries by injecting the historical interaction information of a user, and generating a final correct SQL sentence in a depth-first mode through an SQL grammar tree.

The invention adopts the technical scheme that: a SQL generating method for injecting optimal mode items and historical interaction information comprises the following steps:

s1, a pre-screening cross encoder calculates a mode item most relevant to natural language query according to the input natural language query and a corresponding database mode item;

s2, an SQL coding module integrating the historical interaction information constructs an interaction diagram of natural language and historical query according to the optimal mode item output by the pre-screening cross encoder and the historical interaction information of the user, and codes the obtained interaction diagram;

and S3, decoding the encoding result in the step S2 based on the SQL structure prediction model.

The invention has the beneficial effects that: the SQL generating technology for injecting the optimal mode item and the historical interaction information provided by the invention has the following beneficial effects:

1. in the invention, in the process of converting Chinese natural language query into SQL task in a floor scene aiming at practical application, the excessive large scale of a database possibly can be encountered and can not be completely input into a model, and the pre-screening cross encoder is provided for preferentially calculating the database pattern item (comprising a table and a column) most relevant to the natural language query, wherein the optimal pattern item comprises the top k ₁ Front k of database table ₂ The data columns are injected into the model, so that the high recall rate is ensured, and excessive noise injection is prevented;

2. the generation of the complex SQL sentences corresponding to the natural language sentences is also very complex, so the invention provides that before the user inputs Chinese natural language query sentences, a plurality of simple interdependent query sentences are input preferentially on the basis of historical interaction information so as to generate the final complex query;

3. adding relations among input sequence word fragments in computation based on a transducer structureBiasing the model more towards these predefined relationships during processing, includingThe database schema item includes a relationship between a database table and a column;the linkage relation between the word segmentation of the Chinese natural language query and the data table and the data column in the database mode;inquiring the dependency relationship among the word segmentation by natural language; />The relation between the current natural language query word segmentation and the historical query word segmentation;

4. according to the method for constructing the SQL sentence generation model of the Chinese natural language, the SQL sentence is generated according to the depth priority sequence by a grammar tree decoding method, the decoding process is divided into two types of generating grammar keywords and generating database modes, possible candidate items are determined according to the current node type and father node type, probability distribution is calculated through LSTM, and the candidate item with the highest probability value is selected. The decoding mode accords with the characteristics of SQL statement structure and structured data, can avoid grammar errors, supports the generation of SQL statements of complex structures including table connection and nested query, and has higher accuracy of a model.

Drawings

FIG. 1 shows the technical steps of an implementation of the present invention;

FIG. 2 is a schematic flow diagram of a multi-language encoding module of a pre-screening cross encoder;

FIG. 3 is an exemplary diagram of the steps of a pre-screening crossover encoder to calculate a database schema term probability;

FIG. 4 is an example of the values of k1 and k 2;

FIG. 5 is a schematic illustration of a natural language query and database schema necklace;

FIG. 6 is a schematic diagram of co-index relationships between natural language query runs;

FIG. 7 is a schematic flow chart of a process of encoding optimal pattern items to obtain a relational-aware encoding vector;

fig. 8 is a schematic diagram of an SQL decoding process based on an SQL syntax tree.

Detailed Description

The present invention will be further explained below with reference to the drawings in order to facilitate understanding of technical contents of the present invention to those skilled in the art.

As shown in fig. 1, the method of the present invention comprises the following specific steps:

s1, pre-screening a cross compiling model: screening out front k which accords with natural language and has stronger correlation ₁ Database tables k ₂ Database columns k ₁ 、k ₂ Two important super parameters need to be set manually, and too large a setting may contain many unnecessary mode items, and too small a setting may exclude some important mode items. The problem that a large-scale database in an actual application scene cannot be completely input into a model is avoided, and a cross encoder is designed to input natural language query and database mode information as sequences. The pre-screening crossover compiling model comprises: the system comprises a data preprocessing module, a coding module, a column enhancement table embedding module and a loss calculating module.

The method specifically comprises the following sub-steps:

s11, the data preprocessing module comprises preprocessing of natural language and database modes. The user-entered chinese natural language may have spoken and non-normalized expressions, such as 2022 may be expressed as "last year", "22 years", "two-zero-two years". The amount 100000 may be expressed as "one hundred thousand", but the value corresponding to the natural language is expressed in the database as a normalized exact value, so that the data of the value type such as time, currency, percentage needs to be uniformly processed, expressed as an exact value. The preprocessing of the database schema is to uniformly process database tables and columns by using semantic names, such as the identification of unique identifiers of airports by using id columns in the database tables of the airports, but the real meaning of the database schema and whether the database schema is a column associated with natural language query cannot be identified for the model, so that the database schema is more similar to real semantics by using semantic names (airports id), and the subsequent screening process is facilitated.

Specifically, the semantic names are adopted to replace the database schema items, the true meanings of the database items can be identified when screening and matching is carried out, the similarity between the Chinese natural language and the database items represented by the semantics is calculated, and the schema items are classified. Traversing a data table in a database and specific column names in the table, restoring the real semantic names of the data table by using notes for the data table, and restoring the real semantic names by using information of the data table and column notes for the data column.

S12, a coding module: the data pre-processed natural language query and database mode item are encoded into vectors, the vectors comprise Chinese natural language query, historical query and English database mode representation, and the XLM-R (Cross-lingual Language Model RoBERTa) Cross-language pre-training model is used for encoding spliced input.

Specific implementation steps are that N database tables in one database are expressed as T= { T ₁ ,t ₂ …t _N All columns are denoted C according to the table to which they belong, representing all table sets.

For a column of the database,represents the ith table t _i The vector of the j-th column of the list indicates that the above-listed type is added before the column name +.>The type of the j-th column of the i-th table is shown as either a number or a character. Unstructured natural language queries, historical queries and terms are then usedDatabase schema of the sense name representation is spliced into a fused feature representation +.> Use |as delimiter. Further, a normalized fusion characterization X is obtained _contact As an input sequence for PLM (Pre-trained Language Model). Because the input sequence contains both Chinese natural language query and English expressed database schema items, the example adopts Cross-language model XLM-R (Cross-lingual Language Model RoBERTa) to encode, and semantic encoding vectors of semantic names of table names and column names of the database in the Chinese natural language query and the database schema are obtained. The PLM will divide the pattern item into a plurality of token identifiers (e.g., divide "source air" into two words for encoding "source" and "air"), but as a condition for filtering the column, it is desirable that all pattern item information be fused on one encoding vector.

Thus, the example uses a BiLSTM layer to obtain an encoded representation of the hidden state of the segmented pattern items, represents each pattern item as a whole embedded via pooling, aggregates the hidden state into a fixed size representation of each pattern item, adds a non-linear fully-connected layer over the pooled representation to enable the pattern items to capture complex relationships and patterns in the data, and represents the table embedded as after the pooling operation described aboveEach column insert is denoted +.>d represents the hidden layer size, < >>Representing an embedded dimension size of 1 xd. The detailed steps are shown in fig. 2.

S13, a table embedding module for enhancing column information: when only a certain column name in the data table is explicitly mentioned in the natural language and no table name is mentioned, the column information is injected into the corresponding table code. Corresponding tables can be identified when only information related to column names is involved in natural language, column information is injected into the table embedding by stacking multi-head zoom dot product attention layers, and a feature fusion layer is added to obtain the table embedding with column information enhancement.

Specifically, as described in the example of fig. 3, the natural language only refers to "beijing" and does not refer to the table name "tables" to which the columns belong, in which case the table names may be filtered out, so the present invention proposes a method of injecting column information into the table embedment, merging the column embedment information in the table into the table embedment using a multi-headed attention mechanism, wherein T is _i As the query matrix shape is (1 xd), C _: ⁱ As both keys and value matrix size (n) _i ×d)，n _i For the number of columns, h represents h heads. The model is focused on different locations by a different randomly initialized Q/K/V weight matrix for each head. The attention weight of each column is respectively calculated and the attention weight of each column is calculated, and then the table embedding which is sensitive to the column information and is obtained through multi-head attention calculation is obtained through a weighted summation mode>Then get column enhanced table embedding ++through a normalization function>

S14, the loss calculation module can be actually regarded as a similarity task of natural language problems and tables, so that the loss function of a classification task is finally calculated, but because fewer database tables and columns are contained in the SQL sentence, negative samples in the classification task are multiple times of positive samples, and the samples are unbalanced, the focal loss function is adopted as the loss function of classification. The loss function of the pre-screened cross encoder is formed in a multitasking manner, consisting of table classification losses and column classification losses.

The detailed implementation step adopts two different MLP (Multi-Layer permission) Multi-Layer perceptron modules to respectively classify the enhanced table embedding and the column embedding, and respectively calculate the related probability of natural language problems

Are all trainable parameters, σ (·) represents softmax. For an actual floor scene, the SQL query required to be written by a user generally only comprises a few tables in a database, so that more negative samples can appear when screening data tables and columns, and in order to solve the problem of unbalanced samples, the invention adopts a focus loss function. In a task learning manner, we calculate a penalty function for the data table classification and the data column classification:

n represents all tables, N _i Representing the ith column in the first database,indicating that all of the columns are to be used,and y is _i The real tag y being the ith table _i =1 indicates that the corresponding SQL statement contains this table, otherwise it is 0./>Representing the true tag of column k in the ith table,/for>Then the column is referred to in the corresponding SQL statement, otherwise 0.

Finally we find the correlation probability of natural language to different tables and columns, and set k with highest screening probability in the example ₁ Individual table and k ₂ The most probability-related columns, k ₁ And k ₂ Two very important super parameters, we need k ₁ And k ₂ Selecting proper values to ensure high recall rate, preventing excessive noise from being introduced, restoring the reordered database schema items to original names, using the original names for linking by the subsequent SQL generating modules, and taking the obtained optimal schema items as the input of the subsequent SQL generating modules. The steps are shown in fig. 3.

For example, query sample "what is the flight number from Beijing? As shown in FIG. 4, the table and column in the database show the semantic names of the pattern items in brackets, the table phases Guan Du of the (airines, airports, flings) query sample are (0.02,0.93,0.95), the column similarity is shown in FIG. 4, the columns mainly related to SQL sentences are (airports, airport_code, flightflightjnumber, flights, source_airport), in order to promote the performance of the filtering task, k ₁ And k ₂ The value of (2) is relatively high, the related probability value is set to be 80% larger than the given threshold value, and the threshold value can be set according to the query requirement of the user, so that the filtered k in the example ₁ Is 2, k ₂ At 4, where k ₁ ，k ₂ The method is used for pre-screening and extracting the optimal mode items, and the mode items truly involved in the SQL sentence are also determined according to the subsequent coding part.

S2, the SQL coding module integrated with the history interaction information requires a user to preferentially search a history generation record which is more in line with the query intention in the history query when inputting the natural language query, and generates a final result through multiple times of interaction. Encoding the relationships between schema terms and natural language of the database and between natural language and historical queries, including based on the dependencies between schema term links and natural language queries, has been demonstrated by research to explicitly represent these relationships to improve the expressivity of the model. After each relation is displayed, the relation-based perceptron can be encoded. The system specifically comprises four modules, a natural language query and database mode coding module, a mode item linking module, a history interaction linking module and a relation-based perception graph coding module. The method specifically comprises the following sub-steps:

s21, a natural language query and database coding module:

the module can be independently carried out, or the output result of the S1 module can be directly used as the input of the module, specifically: the pre-screened cross encoder has obtained the initial encoding vector of natural language and database mode items integrated with the optimal mode item and the history inquiry encoded by the cross-language pre-training encoder:

Q _i represents the ith historical query statement, where |Q _i I represents the total number of tokens in the ith historical natural language query,represents the j-th word vector representation, |k, in the i-th historical query natural language query ₁ I represents the top k of the pre-screening in database schema ₁ Watch(s)>An initial vector representation representing an ith table in the database, for a database column +.>Vector representation representing the jth column in the ith table, adding the above-listed type ++before column name>The type of the j-th column of the i-th table is shown as either a number or a character. All->Is the sum k of the numbers of all columns ₂ 。

S22, the links based on the pattern items are mainly aimed at the table, the column and the link alignment between the natural language query and the table of the database, wherein the relation is table, column, table and natural language subword, column and natural language subword and column. Representing relationships between words in a natural language query asThe relationship between natural language queries and database schema (including database columns and tables and database values) is expressed as +.>The relation between the database schema items such as foreign key information and the like is expressed as +.>The natural language and the database schema are processed as nodes, the links or relations between the natural language query and the database schema items are edges, then the whole input constitutes an interaction graph, because +.>The dependency between the syntax contains several relations, such as up to 55 syntactical dependencies contained in the SpaCy toolkit, all of which are injected into the inter-composition, possibly resulting in overfitting, thus abstracting the dependency between the syntax into three relations, forward, backward, none, respectively representing dependenciesDepending on the word preceding the current word, depending on the word following the current word, and no dependencies, such as "see, this lovely cat", "cat" forward depends on "lovely", "this" backward depends on "kitten".

Further preferably, any two different vertices in the directed graph are vertex A and vertex B;

the expressed relationship is a relationship between database schema item including database tables and columns, wherein the relationship +.>Expressed as shown in table 1:

TABLE 1Relationships involving

Further, linking relationships between queries and database schemas with respect to natural languageAs shown in fig. 5, the "beijing" and "flight number" in the natural language are respectively linked with the city column in the aiports table and the fightno column in the lights table, and an important value linking manner is also performed here when the database schema item and the word in the natural language query are correctly linked, and the "beijing" is the value in the city column, and the values in the schema mentioned in the problem are aligned. Thus->The term and database schema are completely matched, represented by EM (exact match), and partial match by PM (partial match) As shown in table 2:

TABLE 2Relationships involving

The analysis of the syntactic dependency of the query problem is seldom focused on, and the pre-training model can process the dependency to a certain extent, but experiments show that the explicit labeling of the dependency has obvious improvement on SQL generating effect. The saidTo represent the dependency between natural language query tokens, to prevent too much overfitting of the total number of relationships, the relationships are abstracted such that when both vertices A and B are natural language tokens, syntax-F represents that A has forward syntactic dependency on B, syntax-R represents that A has reverse syntactic dependency on B, and Syntax-None represents that A and B have no syntactic dependency. Said->The relationships contained are shown in Table 3:

TABLE 3 Table 3Relationships involving

Finally, the followingRepresenting the relationship between the current natural language query word and the historical query word, if the user inputs the natural language query, the user can search the multiple tables in a joint wayIn the practical application scenario, it is difficult for the user to accurately describe the real query intention at one time, so that the user is expected to generate more complex queries based on the history which the user has queried by using a superposition manner when describing the requirement, which can enable the model to more clearly know the real intention of the user, for example, the user wants to query "get to the Shanghai from Beijing and the airline is what kind of aviation is in China", and the process of generating complex queries using the history queries for multiple rounds is shown in fig. 6:

the saidThe relationships involved are: vertex A is the word of the natural language query, vertex B is the word of the history query, coref_Reaction indicates that B and A have a Co-fingered relationship, and Co_Reaction indicates that the Co-fingered relationship is regarded as an overall relationship as shown in Table 4:

table 4 relationship contained in co_relationship

S24, coding a graph based on relation perception, wherein a series of relations are defined, a model is enabled to perceive the predefined relations, and an encoder initializes each edge of the preliminary iso-graph constructed by the nodes and the edges into a vector representation. Setting key relation by splicing the characteristics of all edges in different patternsSum value relation->As shown in the formula

i and j represent an i-th node and a j-th node respectively,contact (·) represents a splice operation, p _ij Representing said certain relationship, is a trainable edge vector representation, m _ij Is a variable for graphic clipping. According to the input natural language query, each edge in the heterogeneous graph is classified in a binary way to record whether the edge needs to be reserved or not, if not, m is used _ij The vector representation of the edge is set to a vector of the same dimension consisting of all zeros, R being the total number of relational edges.

The constructed isomerism map is encoded according to RGAT (Relational graph attention transformers, relationship map attention transformer), the following formula:

RGAT is stacked by multi-layer multi-head self-attention, each layer containing a total of H heads, each head by initializing a different W _Q ,W _K ,W _V Focusing on different positions in the vector, adding a predefined relation based on the original self-attention mechanism to form a new K matrix asV matrix is->Respectively obtaining through linear transformation; w is shown _Q ,W _K ,W _V Is to set a randomly initialized coefficient matrix, x, in each head _i ，x _j Is the input sequence X _init The i and j-th word of (a) and ∈>Representing the calculated attention score of the ith word segment for the jth word segment,/for each of the ith word segments>The attention value of the ith term to the jth term is then used. Finally->For the attention result of the H attention head, H heads are adopted to splice and multiply W by a plurality of independent attention values ^o The parameter matrix gets the final multi-headed attention value.

Then adding the multi-head attention value of the ith word segmentation result and the composite semantic coding vector of the ith word segmentation result through a first adding normalization unit, and normalizing to obtainInputting the data into a feedforward neural network for processing; thereafter->After the linear layer, the result is processed by ReLU activation function, and then output is obtained by a linear layer and then the result is obtained by the first addition normalization unit ∈ ->The addition is specifically as follows:each word is processed according to the above to generate y _i Y of all participles _i Vector constitution matrix to obtain final relation code X _encode The overall process is shown in fig. 7.

S3, SQL decoding module

SQL decoding is performed by relational encoding tensor X _encode Depth-first SQL decoding based on grammar tree provides a complete SQL structure prediction model, the grammar tree structure is characterized in that a node sequence is generated according to SQL grammar rules in depth-first order from root node root, the feature, father node feature, context feature and node type feature of the last moment are fused, the probability value of the correct SQL sentence corresponding to the natural language problem is maximized, firstly, an SQL abstract grammar tree is generated, then a series of decoder actions are output through LSTM network, and the formula of the LSTM update state of the decoder of the tree structure is:

wherein m is _t Is the LSTM cell state, h _t Is the output of LSTM at step t,is the output of the parent node of the current node. a, a _t-1 Is an embedded representation of the previous action (behavioural actions APPLYRULE, SELECTTABLE and select column), is +.>Behavior z representing parent node of current node _t Is a contextual representation, p _t Syntax tree parent node, which is the current node, +.>Is an embedded representation of the current node type, the goal of the model is to maximize the probability value of predicting the correct SQL statement, namely: prob represents a conditional probability, at a given q _pre ,X _encode On the premise of (a) calculating a _t Probability of a) _pre Representing all of the predicted behavior sequences before.

Comprises the following sub-steps:

s31, expanding a non-leaf node, and expanding the node into a grammar rule when the generated node is the non-leaf node, wherein the expansion is APPLYRULE, APPLYRULE, and the grammar rule is applied; the formula is as follows: where g (-) is a 2-layer MLP with Tanh as the activation function

Prob(a _t ＝APPLYRULE[R]|a _pre ,X _encode )＝softmax(g(h _t ))

softmax (·) is a normalized exponential function;

s32, generating leaf nodes, wherein the generated nodes are leaf nodes, and then selecting a table name or a column name from the database mode to be respectively a selection table and a selection column, wherein the formulas are the same calculation: lambda (lambda) _j As a weight vector of the weight vector,making an embedded representation of a table

For example, the following SQL statement is generated:

SELECT COUNT(*)

FROM students

WHERE name....

the generation process is as shown in fig. 8, where nodes are generated in depth-first order starting from the root node root of the syntax tree, the names of the data table are the names of the columns in the table. FIG. 8 is a diagram of an SQL grammar key or terminator [ END ] generation operation, namely APPLYRULE; the actions of generating a data table name and a data column name are called select table and select column, respectively. The possible types of successor nodes for each node are determined by the SQL syntax. When all branches END with [ END ], the SQL decoding process is completed, at this time, the grammar tree is traversed according to the depth-first order, so that a unique SQL sentence can be obtained, and the SQL sentence is delivered to the database execution engine to return the SQL query result.

Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. The SQL generating method for injecting the optimal mode item and the historical interaction information is characterized by comprising the following steps of:

2. The method for generating the SQL injection of the optimal pattern item and the historical interaction information according to claim 1, wherein the step S1 specifically comprises:

s11, carrying out standardization processing on numerical items in the input natural language query, and replacing a database mode item by adopting a semantic name;

s12, splicing the natural language query processed in the step S11 with the database mode item, and inputting the spliced natural language query and database mode item into a cross-language pre-training model to obtain the coding representation of the hidden state of the segmented mode item;

pooling the coded representation of the hidden state of the segmented mode item by adopting BiLSTM, representing each mode item as an integral embedding, inputting the pooled result into a nonlinear full-connection layer to obtain a table embedding and a column embedding;

s13, injecting column embedding information into the table embedding through stacking the multi-head scaling dot product attention layer, and adding a feature fusion layer to obtain column embedding information enhanced table embedding;

s14, performing classification tasks on the column embedding and the enhanced table embedding by adopting two different multi-layer perceptron modules, and calculating the relevant probabilities of the natural language on different columns and tables; screening k with highest probability for related probabilities of different columns and tables according to natural language ₁ Individual table and k ₂ And the most relevant columns of the probabilities obtain the optimal mode item.

3. The method for generating the SQL injection with the optimal pattern item and the historical interaction information according to claim 2, wherein in the step S11, the database pattern item is replaced by a semantic name, specifically: traversing a data table in the database and specific column names in the table, using notes to restore the real semantic names of the data table for the data table, and using information of the data table and column notes to restore the real semantic names for the data column.

4. The method for generating SQL by injecting optimal pattern items and historical interaction information according to claim 3, wherein the step S2 specifically comprises:

s21, obtaining initial coding vectors of natural language and database mode items according to the optimal mode items and the historical query;

s22, processing natural language and database modes as nodes, and processing links or relations between natural language query and database mode items as edges to construct a preliminary interaction diagram;

s23, setting key relation by splicing all edge characteristics in the interaction diagramSum value relation->

S24, annotating the isomerism graph processed in the step S23 according to a relationCoding by the force transducer to obtain a final relation code X _encode 。

5. The method for generating SQL injection of optimal schema terms and historical interaction information according to claim 4, wherein the link or relationship between the natural language query and the database schema terms in step S22 comprises: the relationships between terms in a natural language query are expressed asThe relationship between the natural language query and the database schema is expressed asThe relation between database schema items is denoted +.>

6. The method for generating SQL injection of optimal schema terms and historical interaction information according to claim 5, wherein,dependencies between syntax include: depending on the word preceding the current word, depending on the word following the current word, and no dependencies.

7. The method for generating the SQL injection of the optimal pattern item and the historical interaction information according to claim 5, wherein the step S3 specifically comprises the following sub-steps:

s31, generating a node sequence from a root node in a depth priority order, if the generated node is a non-leaf node, executing a step S32, and if the generated node is a leaf node, executing a step S33;

s32, expanding non-leaf nodes into grammar rules:

Prob(a _t ＝APPLYRULE[R]|a _pre ,X _encode )＝softmax(g(h _t ))

wherein g (-) is a 2-layer multi-layer perceptron module with Tanh as an activation function, and applyrun represents an application grammar rule, a _pre Represent all predicted behavior sequences before, h _t Is the output of LSTM at step t, a _t An embedded representation representing the current node;

s33, selecting a table name or a column name from the database schema to be a selection table and a selection table respectively, wherein the formulas are the same calculation: lambda (lambda) _j As a weight vector of the weight vector,making an embedded representation of a table

S34, when all branches END up with [ END ], generating an SQL abstract syntax tree; the SQL abstract syntax tree is then decoded over the LSTM network.

8. The method for generating SQL injection of optimal pattern items and historical interaction information according to claim 7, wherein the formula of the LSTM network update state is:

wherein m is _t Is the LSTM cell state, h _t Is the output of LSTM at step t,is the output of the parent node of the current node, a _t-1 Is an embedded representation of the previous action, +.>Representing the behavior of the parent node of the current node, z _t Is a contextual representation, p _t Syntax tree parent node, which is the current node, +.>Is an embedded representation of the current node type.